# convergence_aspects_of_hybrid_kernel_svgd__c043ad93.pdf

Published in Transactions on Machine Learning Research (12/2025)

Convergence Aspects of Hybrid Kernel SVGD

Anson Mac Donald anson.macdonald@unsw.edu.au School of Mathematics and Statistics University of New South Wales

Scott A. Sisson scott.sisson@unsw.edu.au School of Mathematics and Statistics University of New South Wales

Sahani Pathiraja s.pathiraja@unsw.edu.au School of Mathematics and Statistics University of New South Wales

Reviewed on Open Review: https: // openreview. net/ forum? id= JZkb MSQDm D

Stein variational gradient descent (SVGD) is a particle-based approximate inference algorithm. Many variants of SVGD have been proposed in recent years, including the hybrid kernel variant (h-SVGD), which has demonstrated promising results on image classiﬁcation with deep neural network ensembles. By framing h-SVGD as a kernelised Wasserstein gradient ﬂow on a functional that is not the Kullback-Leibler divergence, we demonstrate that h-SVGD does not converge to the target distribution in the mean ﬁeld limit. Despite this theoretical result, we provide intuition and experimental support for the ability of h-SVGD to improve variance estimation in high dimensions. Unlike other SVGD variants that also alleviate variance collapse, this is achieved at no additional computational cost and without further assumptions on the posterior.

1 Introduction

Stein variational gradient descent (SVGD) is a variational inference algorithm that generates samples from a target probability density (Liu & Wang, 2016). It has proven useful in many tasks in Bayesian inference and machine learning. SVGD evolves an interacting particle system until the particles resemble a sample from a target density. The dynamics of this system include a driving term that moves particles to regions of high probability, and a repulsive term that repels particles from one another. This repulsive term prevents particles from converging to the same mode. It has been shown that within a unit ball of a reproducing kernel Hilbert space (RKHS), the SVGD update direction optimally reduces the Kullback-Leibler (KL) divergence between the target density and the approximating density (Liu & Wang, 2016). The reproducing kernel of this RKHS appears in both the driving and repulsive terms, making the choice of kernel a key ingredient for SVGD.

The theoretical properties of vanilla SVGD have been studied extensively. Liu showed that the empirical measure of the particles converges weakly to the target distribution (Liu, 2017). SVGD in the mean ﬁeld regime has been described as a gradient ﬂow on the KL divergence (Liu & Wang, 2016) and the chi-squared divergence (Chewi et al., 2020). Furthering this geometric point of view, Duncan et al. (2023) developed the Stein geometry along with its associated tangent spaces and geodesics, leading to guidelines for choosing kernels to improve convergence. The existence and uniqueness of the solution to the Stein partial diﬀerential equation (PDE) has been established (Lu et al., 2019) along with various descent lemmas bounding the decrease in KL divergence at each iteration (Liu, 2017; Korba et al., 2020; Salim et al., 2022).

Published in Transactions on Machine Learning Research (12/2025)

SVGD is known to suﬀer from the curse of dimensionality through variance collapse (Wang et al., 2018; Ba et al., 2019) whereby the marginal variances of the particles underestimate the true marginal variances of the target in high dimensions. Zhuo et al. (2018) explained that this phenomenon is due to the size of the repulsive term of the update direction scaling inversely with dimension. This enables the driving term to dominate in high dimensions, thereby forcing particles to converge to the mode(s) of the target. This insight suggests that strengthening the repulsive term in SVGD should lead to better variance estimation, an idea which we explore in Sections 3 and 4.

Many variants of SVGD have also been proposed in recent years, some oﬀering improvements and others providing generalisations. Riemannian SVGD (Liu & Zhu, 2018) generalises SVGD by allowing for target densities on Riemannian manifolds, not just Euclidean spaces. Matrix SVGD (Wang et al., 2019) replaces the scalar valued kernel with a matrix valued kernel to incorporate preconditioning information and speed up particle exploration. Message passing SVGD (Zhuo et al., 2018) and graphical SVGD (Wang et al., 2018) focus on target densities that factorise according to a graph structure. This approach reduces variance collapse in high dimensions by converting the problem to a collection of low dimensional problems. Projected SVGD (Chen & Ghattas, 2020), sliced SVGD (Gong et al., 2021) and Grassman SVGD (Liu et al., 2022) also mitigate the issue of variance collapse by updating particles within lower dimensional subspaces, which comes at the expense of additional computation.

In this work, we study a variant called hybrid kernel Stein variational gradient descent (h-SVGD). The name comes from its use of two distinct kernels for the driving and repulsive terms with the aim of mitigating variance collapse. This variant was originally proposed by D Angelo et al. (2021) in the context of training deep neural network ensembles by sampling from the distribution of network parameters. In that setting, two particles may parameterise networks with very similar outputs despite being far apart in the weight space. Their insight was to encourage functional diversity between networks in the ensemble by using a standard kernel in the driving term, but a functional kernel in the repulsive term. In this neural network ensemble setting, h-SVGD demonstrated better performance than other variants on image classiﬁcation. Annealed SVGD (D Angelo & Fortuin, 2021) may also be considered an example of h-SVGD. In this variant, the driving kernel is a scalar multiple γ(ℓ) [0, 1] of the repulsive kernel, and this factor γ(ℓ) gradually increases to 1 as the iteration ℓincreases. Numerical experiments show that annealed SVGD improves the ability of particles to escape local modes. Scaling one of the update terms has also been used as a computational technique to aid other SVGD variants when training Bayesian neural networks (Gong et al., 2021).

Although preliminary numerical experiments have shown the beneﬁts of h-SVGD (D Angelo et al., 2021), the theoretical results for SVGD do not directly apply in the hybrid kernel setting due to the presence of a second kernel. In this paper, we address this theoretical gap and reinforce the practical beneﬁts of h-SVGD through the following contributions.

We establish existence of a solution to the hybrid Stein PDE and a kernelised Wasserstein gradient ﬂow interpretation. Through the study of this gradient ﬂow, we demonstrate that h-SVGD does not converge to the target distribution in the mean ﬁeld limit.

We also quantify the rate of dissipation of the gradient ﬂow functional and develop a discrete time version of this result, otherwise known as a descent lemma.

Despite not converging to the target distribution, we demonstrate through numerical experiments that h-SVGD can mitigate variance collapse in the ﬁnite particle regime at negligible additional cost, whilst remaining competitive at high dimensional inference tasks.

In Section 2 we clarify notation, recall necessary theory, and outline the vanilla and hybrid SVGD algorithms. Section 3 contains the theoretical contributions, with proofs relegated to Appendix A. Numerical experiments are in Section 4 with additional experiments and details in Appendix B.

Published in Transactions on Machine Learning Research (12/2025)

2 Background

2.1 Notation

Let X Rd. Let π denote the target probability density on X and let sπ(x) = x log π(x). We will often write π for the corresponding measure. Assume that π(x) = e V (x) for some potential V . Let P(X) be the set of probability measures on X and let PV (X) denote the subset where µ PV := R

X (1+V (x))dµ(x) < . For each p 1, let Pp(X) denote the subset satisfying µ Pp := R

X x pdµ(x) < and deﬁne the Wasserstein

p-distance between the two measures µ, ν Pp(X) as Wp(µ, ν) := infγ Γ(µ,ν) R x y pdγ(µ, ν) 1/p, where Γ(µ, ν) is the set of couplings between µ and ν. Let L c (X) denote the set of probability densities bounded almost everywhere with compact support. Let L2(µ) denote the set of functions that are square integrable with respect to the measure µ. Given µ P(X) and a smooth, invertible transform T : X X, let T#µ denote the pushforward measure of µ through T . The KL divergence between two measures µ, ν P(X) is denoted by KL(µ ν).

2.2 Reproducing Kernel Hilbert Spaces

A function k : X X R is positive deﬁnite if P i,j aik(xi, xj)aj > 0 for any choice of a1, . . . , ad R and x1, . . . , xd X. Given a Hilbert space H of functions ϕ : X R, a function k : X X R is said to be a reproducing kernel for H if it satisﬁes the reproducing property, ϕ(x) = ϕ, k(x, ) H for all ϕ H. A positive deﬁnite k : X X R admits a unique Hilbert space H of functions ϕ : X R for which the Dirac functionals δx : H R, δxϕ = ϕ(x) are all continuous and k is a reproducing kernel. This Hilbert space is called the reproducing kernel Hilbert space (RKHS) of k and it is equal to the closure of the span of {k(x, ) : x R}. Let Hd = H H denote the Hilbert space of functions ϕ : X Rd whose components are all in H, and equip it with the usual inner product ϕ, ψ Hd = Pd i=1 ϕi, ψi H. Given two kernels k1, k2 : X X R, let H1, H2 denote their respective RKHS. An important kernel used throughout this paper is the radial basis function (RBF) kernel k RBF(x, y; h) := exp( x y 2 2 /(2h)) with bandwidth h > 0. For a thorough treatment of RKHS we refer the reader to Aronszajn (1950), Steinwart & Christmann (2008) and Berlinet & Thomas-Agnan (2011).

2.3 Stein Variational Gradient Descent

The key result from Liu & Wang (2016) identiﬁes a transform T : X X that optimally decreases the KL divergence from an arbitrary probability measure to π. More precisely, let H be an RKHS with kernel k : X X R and consider transforms of the form T (x) = x + ϵϕ(x) where ϵ > 0 and ϕ is in the unit ball {ϕ Hd : ϕ Hd 1}. The maximum value of

ϵKL(T#µ π)|ϵ=0 (1)

occurs at ϕk µ,π/ ϕk µ,π Hd, where

ϕk µ,π( ) := Ex µ [k(x, )sπ(x) + xk(x, )] . (2)

When µ is an empirical distribution (i.e. a sum of Dirac measures), the expectation in (2) can be computed exactly by summing over the particles of each Dirac measure. Using this observation, the SVGD algorithm starts with an initial set of N particles (xi 0)N i=1 and iteratively applies the transform T with (2) as the update direction. At each iteration ℓ, this yields a set of particles (xi ℓ)N i=1 and a corresponding empirical distribution µℓ= 1

i δxi ℓ. This is captured in Algorithm 1. The intention is that after suﬃciently many iterations, the set of particles will resemble samples from π and expectations of the form Ex πh(x) can be approximated by Ex µℓh(x) = 1 N P i h(xi ℓ). We also recall the deﬁnition of the kernelised Stein discrepancy (KSD) from Liu et al. (2016),

Sk(µ, π) := Ex,y µ δπ,µ(x) k(x, y)δπ,µ(x)

where δπ,µ := sπ(x) sµ(x).

Published in Transactions on Machine Learning Research (12/2025)

Algorithm 1 Stein Variational Gradient Descent (Liu & Wang, 2016)

Input: A target probability distribution π, a kernel k, an initial set of particles (xi 0)N i=1 in X, and a sequence of step sizes (ϵℓ). Output: A set of particles (xi)N i=1 in X whose empirical distribution approximates π. for iteration ℓdo

xi ℓ+1 xi ℓ+ ϵℓˆϕ µℓ,π(xi ℓ), i = 1, . . . , N

ˆϕ µℓ,π(x) = 1

j=1 k(xj ℓ, x)sπ(xj ℓ) | {z } driving force

+ xj ℓk(xj ℓ, x) | {z } repulsive force

Algorithm 2 Hybrid Kernel Stein Variational Gradient Descent

Input: A target probability distribution π, two kernels k1, k2, an initial set of particles (xi 0)N i=1 in X, and a sequence of step sizes (ϵℓ). Output: A set of particles (xi)N i=1 in X whose empirical distribution approximates π. for iteration ℓdo

xi ℓ+1 xi ℓ+ ϵℓˆϕ µℓ,π(xi ℓ), i = 1, . . . , N

ˆϕ µℓ,π(x) = 1

j=1 k1(xj ℓ, x)sπ(xj ℓ) | {z } driving force

+ xj ℓk2(xj ℓ, x) | {z } repulsive force

2.4 Hybrid Kernel Stein Variational Gradient Descent

The SVGD update in (4) contains two terms, each using the same kernel. The ﬁrst term, often referred to as the driving term, uses the score function to move particles towards regions of high probability density, and the repulsive term prevents particles from collapsing at the modes. The h-SVGD variant proposed by D Angelo et al. (2021) uses a diﬀerent kernel in each term. Let k1 denote the kernel that appears alongside the score function, and let k2 denote the repulsive kernel. For the remainder of this paper, k1 and k2 will both be positive deﬁnite. We present h-SVGD in Algorithm 2.

3 Theoretical Results

3.1 Deﬁnitions and Assumptions

A function f : X R is in the Stein class of π if it is smooth and satisﬁes R

x X x (f(x)π(x)) dx = 0. A function f = (f1, . . . , fd) : X Rd is in the Stein class of π if each fi belongs to the Stein class of π. A kernel k : X X R is in the Stein class of π if it has continuous second order partial derivatives and both k(x, ) and k( , y) are in the Stein class of π for all x, y X. The hybrid Stein operator acts on a pair of kernels k1, k2 : X X R by

Sπ (k1, k2)(x, ) := k1(x, )sπ(x) + xk2(x, ),

provided k1 and k2 both belong to the Stein class of π. This reduces to the Stein operator deﬁned in Liu et al. (2016) when k1 = k2. Motivated by the h-SVGD update in (5), deﬁne the update direction

ϕk1,k2 µ,π ( ) := Ex µ [Sπ (k1, k2)( , x)] , (6)

and write ϕ = ϕk1,k2 µ,π / ϕk1,k2 µ,π ( ) H1 for the normalised direction. Let G( ; k1, µ, π) := Ex µ [k1(x, )sπ(x)] and R( ; k2, µ) := Ex µ [ xk2(x, )] denote the driving (or gradient) term and the repulsive term respec-

Published in Transactions on Machine Learning Research (12/2025)

tively. We can then write ϕk1,k2 µ,π ( ) = G( ; k1, µ, π) + R( ; k2, µ). The update transform

T k1,k2 µ,π (x) = x + ϵϕk1,k2 µ,π (x) (7)

and the map Φk1,k2 π : µ 7 T k1,k2 µ,π

# µ characterise the h-SVGD dynamics. For each ℓ, deﬁne

µN ℓ+1 := Φk1,k2 π (µN ℓ), µ ℓ+1 := Φk1,k2 π (µ ℓ), (8)

where µN 0 is the empirical measure of the initial particles (xi 0)N i=1 drawn i.i.d. from some µ 0 .

All technical assumptions required in the theorems throughout this section are detailed here for completeness. The ﬁrst set of assumptions relate to the potential of the target distribution.

(A1) V C (X), V 0, and lim|x| V (x) = + .

(A2) There exist constants CV > 0 and q > 1 such that for all x, y X,

|V (x)|q CV (1 + V (x))

sup θ [0,1]

2V (θx + (1 θ)y) q CV (1 + V (x) + V (y)).

(A3) For any α, β > 0, there exists a constant Cα,β > 0 such that

(1 + |x|)(| V (y)| + 2V (y) ) Cα,β(1 + V (x))

whenever |y| α |x| + β.

(A4) The Hessian HV of V is well-deﬁned and satisﬁes HV op M for some M > 0.

Assumptions (A1), (A2) and (A3) are identical to those from Lu et al. (2019). Assumption (A4) is identical to Assumption (A2) from Korba et al. (2020), and Assumption 2.1 from Salim et al. (2022). Assumptions on the kernels are also required.

(B1) There exist symmetric functions K1, K2 : X R such that ki(x, y) = Ki(x y) for i = 1, 2, K1 is C2 with bounded derivatives, and K2 is C4 with bounded derivatives. We use B > 0 as a bound for all derivatives in the proofs.

(B2) There exists a constant D > 0 such that both k1 and k2 are D-Lipschitz, and such that V ( )k1( , z) is D-Lipschitz for each z. That is,

|k1(x, x ) k1(y, y )| D ( x y 2 + x y 2) ,

xk2(x, x ) yk2(y, y ) D ( x y 2 + x y 2) ,

V (x)k1(x, z) V (y)k1(y, z) D ( x y 2)

for all x, x , y, y , z X.

Assumption (B1) is a slight relaxation of Assumption 2.1 from Lu et al. (2019), and contains Assumption 2.6 from Salim et al. (2022). The ﬁrst two parts of Assumption (B2) are hybrid kernel versions of Assumption (B2) from Korba et al. (2020), and the third part of Assumption (B2) relaxes the restrictive Assumption (B1) from Korba et al. (2020).

Published in Transactions on Machine Learning Research (12/2025)

3.2 Large Particle Limit

We begin our theoretical study with a result on the deviation of the empirical distributions of h-SVGD to the large particle limit. The single kernel version (Korba et al., 2020) of the following result uses an assumption that is quite restrictive. It requires |V (x)| CV for some constant CV > 0, which rules out even a Gaussian target distribution. We relax this with the third part of Assumption (B2) and provide an updated proof in the appendix.

Proposition 3.1. Assume (A1), (A4), (B1) and (B2), and let T > 0. For any 0 ℓ T

ϵℓ, there exists a constant L depending on k1, k2 and p such that

E W 2 2 (µN ℓ, µ ℓ) 1 2

var(µ 0 )e LT (e2LT 1).

3.3 Hybrid Stein PDE

In the continuous time limit, equation (5) becomes a coupled system of diﬀerential equations,

j=1 k1(xj, xi)sπ(xj) + xjk2(xj, xi) (9)

for i = 1, . . . , N. In the mean ﬁeld limit, integration by parts and the identity ρ = ρ log ρ yields

dt = Z k1(x , x)sπ(x ) + x k2(x , x) ρ(x )dx (10)

= Z k1(x , x) log π(x )ρ(x )dx Z k2(x , x) x ρ(x )dx

= Tk1,ρ ( log π) (x) Tk2,ρ ( log ρ) (x), (11)

where Tk,ρ : L2(ρ)d Hd k is the Hilbert-Schmidt operator given by (Tk,ρf)(x) = R k(x , x)f(x )ρ(x )dx for a kernel k. Recalling that V is the potential of π and so V = log π, this mean ﬁeld limit can be described by the hybrid Stein partial diﬀerential equation

tρt = (ρt (K1 V ρt + K2 ρt)) . (12)

Deﬁnition 3.2. Given a probability measure ν on Rd, a map X(t, x; ν) : [0, ) Rd Rd that is C1 with respect to t and satisﬁes

t X(t, x; ν) = (K1 ( V ρt)) (X(t, x; ν)) ( K2 ρt) (X(t, x; ν)) ρt = X(t, , ν)#ν (13)

X(0, x; ν) = x

is called a mean ﬁeld characteristic ﬂow of (10) or of (12).

We now generalise Theorem 2.4 from Lu et al. (2019), ensuring the existence of a solution to the hybrid Stein PDE. First, deﬁne Y := {u C(X, X) : supx X |u(x) x| < } with d Y (u, v) = supx X |u(x) v(x)| and note that (Y, d Y ) is a complete metric space.

Proposition 3.3. Assume (A1), (A2), (A3) and (B1), and let T > 0. Then there exists a unique solution X( , , ν) C1([0, T]; Y ) to (13) and the corresponding ρt is a weak solution to (12) satisfying

ρt PV π PV exp (C min ( K1 , K2 ) t) (14)

for some constant C > 0 depending on K1, K2 and V .

The second kernel enables a stronger bound than Theorem 2.4 from Lu et al. (2019) by careful modiﬁcation of the telescoping section of the proof (see Appendix A and Equation (3.8)). In particular, ensuring that K1 < K2 when choosing K2 yields a stronger bound in (14) than if K1 were used for both kernels. We remark that this bound describes regularity of the solution to the PDE, not a rate of convergence.

Published in Transactions on Machine Learning Research (12/2025)

3.4 Kernelised Wasserstein Gradient Flow and Asymptotic Density: k2 = ck1

Zhuo et al. (2018) uncovered a correlation between dimension and the magnitude of the repulsive force R, as deﬁned at the beginning of Section 3.1. Under some technical conditions, for any α, δ (0, 1), they show that with probability at least 1 δ, SVGD under an RBF kernel yields R( ; k, µ) = O(d α). This suggests that simply scaling the repulsive force by dα for some α (0, 1) should oﬀset the decrease in R( ; k2, µ) in high dimensions, thereby alleviating variance collapse at negligible additional computational cost. Scaling the repulsive kernel in this way corresponds to h-SVGD where k1 is an RBF kernel and k2 = dαk1. This motivates our study of the h-SVGD gradient ﬂow under the special case k2 = ck1.

In the case where k = k1 = k2, equations (10) and (12) describe a kernelised Wasserstein gradient ﬂow of the form tρt = (ρt Tk,ρt W F(ρt)), where

F(ρ) = KL(ρ π) = Ex ρ [log ρ(x) log π(x)]

is the KL divergence functional (Liu, 2017). Recall that a functional derivative of F is a measurable function δF

δρ (ρ) satisfying

d dϵF(ρ + ϵχ) ϵ=0 = Z δF

for all perturbations χ = ρ ρ with ρ L c (X) P(X) (Santambrogio, 2015, Deﬁnition 7.12). In particular δF

δρ = log ρ log π + 1 up to a constant, and W F(ρ) = δF

δρ (ρ) = log ρ log π is the Wasserstein gradient of F. Using (11) and the linearity of Tk,ρ, the corresponding Fokker-Planck equation is then

tρt = (ρt Tk,ρt ( log ρt log π)) (15)

where {ρt : t 0} is a curve of probability densities. The following result generalises this gradient ﬂow interpretation to the case where k2 = ck1. Note that it applies to any positive deﬁnite kernel k1.

Proposition 3.4. Given a positive deﬁnite kernel k1 and constant c > 0, let k2 = ck1. Then the mean ﬁeld dynamics of h-SVGD describe a kernelised Wasserstein gradient ﬂow on the functional

F(ρ) = Ex ρ [c log ρ log π(x)] . (16)

The corresponding continuity equation is

tρt = (ρt Tk1,ρt (c log ρt log π)) . (17)

Even in this simple hybrid kernel setting, the following result establishes that the limiting distribution ρ of the mean ﬁeld regime is not equal to the target distribution π.

Corollary 3.5. If k2 = ck1 for some constant c > 0 where k1 is a positive deﬁnite kernel, then the mean ﬁeld h-SVGD has a ﬁxed point ρ (x) π(x)1/c.

Although Corollary 3.5 applies to any target density π satisfying (A1), it is insightful to consider a Gaussian target. If π(x) = N(x; µ, Σ), then ρ (x) = N(x; µ, cΣ). So scaling the repulsive kernel k2 adjusts the variance of the target by the same factor. This supports the motivation that scaling the repulsive kernel should oﬀset the variance underestimation in the ﬁnite particle regime in high dimensions at a negligible additional cost. This idea will be revisited in Section 4. Note that Corollary 3.5 extends the c = 1 case where SVGD converges to the target π in the mean ﬁeld limit (Liu & Wang, 2016).

We now generalise an existing result (Korba et al., 2020) that describes the dissipation of the KL divergence along the SVGD gradient ﬂow. The result below describes the dissipation of the functional in (16) along the h-SVGD gradient ﬂow, ensuring that the functional always decreases. It also describes the dissipation of the KL divergence with respect to the mean ﬁeld limiting distribution ρ , which we emphasise is not equal to the target distribution π, as per Corollary 3.5.

Published in Transactions on Machine Learning Research (12/2025)

Proposition 3.6. Under the assumptions of Proposition 3.4,

d dt F(ρt) = c2 Tk1,ρt( log ρ log ρ ) 2 Hd 1 + c Z ρt(x)

t log ρt(x)dx, (18)

d dt KL(ρt ρ ) = c Tk1,ρt( log ρ log ρ ) 2 Hd 1 + Z ρt(x)

t log ρt(x)dx, (19)

where ρ (x) π(x)1/c is the mean ﬁeld ﬁxed point. Furthermore, Z ρt(x)

t log ρt(x)dx 0,

so d dt F(ρt) 0 and d dt KL(ρ ρ ) 0 for t 0, and d dt F(ρ ) = 0.

The following descent lemma is adapted from Liu (2017) and provides a discrete time version of Proposition 3.6. We use µℓto denote discrete time steps, as in (8), as opposed to ρt for the continuous time analysis. We remark that other descent lemmas for SVGD have been proved (Korba et al., 2020; Salim et al., 2022).

Proposition 3.7. Under the assumptions of Proposition 3.4, let σ( ) denote the spectral radius and set ϵℓ (2 supx σ( ϕ (x) + ϕ (x) )) 1. Deﬁne R := supx{ 1

2 log π Lip k1(x, x) + 2 x,yk1(x, x)} where

f Lip := supx =y |f(x) f(y)|

x y 2 is the Lipschitz norm and x,yk(x, x) := P

i xi yik(x, y)|x=y. Then

F(µ ℓ+1) F(µ ℓ) c(1 ϵℓR) S(µℓ, ρ ).

3.5 Kernelised Wasserstein Gradient Flow: The General Case

In this section, we present a generalisation of Proposition 3.4 and discuss some diﬃculties in ﬁnding kernels that satisfy the required conditions. For ease of presentation, we restrict our attention to d = 1. Throughout, we assume (B1) and that any required Fourier transforms exist. Deﬁne the function r : X R by

r(x; ρ) := F 1 F(K2)

F(K1) F( ρ) (x) (20)

where F denotes the Fourier transform (see proof in Appendix A), and let R : X R be a function satisfying R(x; ρ) = r(x; ρ)/ρ(x).

Proposition 3.8. Assume that both r and R exist. Then the corresponding continuity equation is

tρt = (ρt Tk1,ρt ( R( ; ρt) log π)) . (21)

If in addition Z ϵR(x; ρ + ϵχ)d(ρ + ϵχ)(x) ϵ=0 = 0 (22)

for any measure χ = ρ ρ with ρ L c (X) P(X), then the mean ﬁeld dynamics of h-SVGD describe a kernelised Wasserstein gradient ﬂow on the functional

F(ρ) = Ex ρ [R(x; ρ) log π(x)] . (23)

Note that k2 = ck1 implies F(K2)/F(K1) = c and so r(x; ρ) = c ρ(x). Therefore, R(x; ρ) = c log ρ(x), and so (21) reduces to (17). Furthermore, the left hand side of (22) is equal to c R dχ(x) = c R d ρ(x) c R dρ(x), which is zero because ρ, ρ are both probability measures. So (22) is satisﬁed in this special case. We remark that verifying (22) remains a challenge in general hybrid kernel settings. However, under this assumption, the mean ﬁeld dynamics of h-SVGD converges to a distribution that is not the target distribution.

Proposition 3.9. If k1 = k2, both r and R exist, and (22) is satisﬁed, then any ﬁxed point ρ of the gradient ﬂow in Proposition (3.8) is not equal to π.

Published in Transactions on Machine Learning Research (12/2025)

The example below demonstrates some behaviour of SVGD in the general hybrid kernel setting. In particular, the form of the target π and the mean ﬁeld steady state ρ can look quite diﬀerent.

Example 3.10. Let h1, h2, σ > 0 and assume that h := h2 h1 = 0. Let ki(x, y) = k RBF(x, y; hi) for i = 1, 2, and let π(x) exp( α exp( x2

2β )) be the target on X = R where

h, β = σ2(σ2 + h)

Then ρ (x) = N(x; 0, σ2) is a ﬁxed point of the h-SVGD mean ﬁeld dynamics.

We brieﬂy remark that the right hand side of Equation (20) exists in this setting of two RBF kernels with bandwidths h2 > h1 provided ρ decays suﬃciently fast. In particular,

F(K2)(ω) F(K1)(ω) F( ρ)(ω) = i2πω

h2 h1 exp 2π2(h2 h1)ω2 F(ρ)(ω).

We also remark that Section 4 focuses on experimental results in the k2 = ck1 case due to its capacity to improve variance estimation. We leave a more detailed study of the general hybrid setting for future work.

4 Experiments

Although Corollary 3.5 shows that h-SVGD does not converge to the target distribution, we demonstrate in this section that it has the ability to improve variance estimation when compared to SVGD. Furthermore, it does this at no extra computational cost, and without any assumptions on the structure of the posterior, as is common in other SVGD variants that alleviate variance collapse. We measure variance collapse using dimension averaged marginal variance (DAMV), 1 d Pd j=1 Varj {xi}N i=1 , as is standard in the literature (Zhuo et al., 2018; Ba et al., 2019; 2021; Gong et al., 2021). Further details on the the experimental details can be found in Appendix B.1

Before describing the ﬁrst numerical example, we reiterate the intuition behind the ability of h-SVGD to improve variance estimation. In high dimensions, with a ﬁnite number of particles, SVGD is well known to suﬀer from variance collapse (Wang et al., 2018; Ba et al., 2019). Indeed, for any α, δ (0, 1), the repulsive force of SVGD under an RBF kernel decreases as R( ; k, µ) = O(d α) with probability at least 1 δ (Zhuo et al., 2018). By Corollary 3.5, under a Gaussian target distribution, a repulsive kernel k2 = ck1 with c > 0 will scale the variance of the limiting distribution by c in the inﬁnite particle limit. So scaling the repulsive kernel by c = dα for some α (0, 1) should compensate for the decrease in R( ; k2, µ) , and as a consequence oﬀset the variance collapse in high dimensions in the ﬁnite particle regime. The factor α should be tuned and we expect that a suitable choice of α will depend on the number of particles N, the target distribution π and the kernel k1. We leave a more detailed theoretical analysis guiding the choice of α for future work. In the experiments that follow, we found that α = 0.5 provided a good balance between variance estimation and inference.

4.1 Multivariate Gaussian Mixture

In this ﬁrst example, we sample from a high-dimensional mixture of Gaussians with 10 equally weighted components. The mean vectors are randomly initialised from N(0, Id), and the covariance matrices are c Id, where c > 0 is a factor that makes the DAMV equal to 1. We repeat this for dimensions up to d = 1000 at intervals of 100. We choose to sample N = 50 particles in order to demonstrate the performance of h SVGD when d is much greater than N, as is often the case in high dimensional Bayesian inference. Particles are initialised from N(0, Id) and each SVGD variant is run for 2000 iterations with an initial step size of ϵ = 0.01, adapted using Ada Grad. We run SVGD and sliced SVGD (SSVGD) (Gong et al., 2021) with the RBF kernel k RBF and compare against h-SVGD with kernels k1 = k RBF and k2 =

d k RBF. All algorithms use h = med2/ log(N) as the bandwidth, where med is the median pairwise distance between particles (Liu

1The python code for reproducing these experiments is available at https://github.com/anson-macdonald-unsw/h-SVGD

Published in Transactions on Machine Learning Research (12/2025)

& Wang, 2016). For SSVGD, there is one bandwidth per dimension, so the median distances are computed along each projection, as described in Gong et al. (2021). We also compute the energy distance between the samples generated by each method and the target distribution as described in Székely & Rizzo (2013). All conﬁgurations are run 10 times with a diﬀerent initialisation and results for each conﬁguration are averaged.

Figure 1 demonstrates that h-SVGD provides an uplift in marginal variance estimation and an improved energy distance when compared to SVGD. Although SSVGD estimates the variance fairly consistently as the dimension increases, Figure 1 shows that this comes at a signiﬁcant increase in runtime, whereas there is no noticeable diﬀerence between the runtimes of SVGD and h-SVGD. Furthermore, h-SVGD outperforms SSVGD in terms of the energy distance. Figure 2 shows projections of the N = 50 particles onto the ﬁrst two two dimensions along with a contour plot of the marginal density along those two dimensions. This visual representation emphasises the extent to which both h-SVGD and SSVGD oﬀset variance collapse in the ﬁnite particle setting. Further ﬁgures are included in Appendix B.

(b) Energy Distance

(c) Time (seconds)

Figure 1: DAMV, energy distance, and runtime of diﬀerent SVGD variants sampling from high-dimensional Gaussian mixtures.

Figure 2: Projections onto the ﬁrst two dimensions of particles sampled from a high-dimensional Gaussian mixture using diﬀerent SVGD variants. The contour plot of the corresponding marginal density is overlaid.

4.2 Bayesian Neural Network

In this section, we sample weights from a Bayesian neural network (BNN). Aside from scaling the repulsive kernel by

d, as described in the previous experiment, our setup is identical to Liu & Wang (2016). For completeness, details are included in Appendix B. Using the no-U-turn sampler (NUTS) (Hoﬀman et al., 2014) with 10 parallel chains, we generated 10 000 ground truth samples for 8 of the 10 datasets. The Protein and Year datasets were large enough to make NUTS prohibitively slow. We then use these ground

Published in Transactions on Machine Learning Research (12/2025)

Figure 3: Energy distance of BNN samples generated by each SVGD variant as compared against ground truth samples generated by NUTS. Ranges indicate one standard deviation above and below the mean over repeated trials.

truth samples to compute the energy distance (Székely & Rizzo, 2013) of samples generated by each variant. Table 1 shows that with no additional computational cost, the problem of variance collapse is mitigated under h-SVGD through an increased DAMV. Table 2 demonstrates that it remains competitive at inference through improved test log-likelihood (LL) for all datasets and improved root mean squared error (RMSE) for all but one dataset. Appendix B includes a comparison of the same metrics between h-SVGD and SSVGD to emphasise that the alleviation of variance collapse under h-SVGD comes at a lower cost. Figure 3 shows that across all datasets except Wine, the samples generated by h-SVGD are closer to the ground truth than those generated by SVGD, as measured by the energy distance. The h-SVGD samples are also closer to the ground truth than the SSVGD samples for all but two datasets.

Table 1: DAMV and runtime in seconds of SVGD and h-SVGD. Ranges indicate one standard deviation above and below the mean over repeated trials. NUTS runtime is included for those datasets where it was computationally feasible.

DAMV Runtime (seconds) Dataset SVGD h-SVGD SVGD h-SVGD NUTS Boston 0.051 0.011 0.112 0.015 0.112 0.015 0.112 0.015 28.9 1.4 28.6 0.8 28.6 0.8 28.6 0.8 133 Concrete 0.084 0.010 0.120 0.010 0.120 0.010 0.120 0.010 28.7 1.3 28.7 1.3 28.7 1.3 28.8 1.8 189 Energy 0.065 0.015 0.154 0.023 0.154 0.023 0.154 0.023 30.8 2.8 30.8 2.8 30.8 2.8 30.9 2.5 167 Kin8nm 0.105 0.003 0.105 0.003 0.105 0.003 0.102 0.003 35.9 1.3 35.9 1.3 35.9 1.3 36.2 1.3 1 748 Naval 0.059 0.004 0.091 0.019 0.091 0.019 0.091 0.019 33.4 0.8 33.1 1.5 33.1 1.5 33.1 1.5 51 Combined 0.128 0.008 0.145 0.007 0.145 0.007 0.145 0.007 36.3 1.5 36.3 1.5 36.3 1.5 36.8 2.7 954 Protein 0.089 0.001 0.089 0.001 0.089 0.001 0.087 0.001 72.7 1.0 72.7 1.0 72.7 1.0 72.8 1.7 NA Wine 0.068 0.005 0.090 0.003 0.090 0.003 0.090 0.003 29.5 1.4 29.5 1.4 29.5 1.4 29.8 1.1 393 Yacht 0.060 0.020 0.194 0.034 0.194 0.034 0.194 0.034 29.3 0.4 29.3 0.4 29.3 0.4 29.6 0.4 83 Year 0.011 0.000 0.012 0.000 0.012 0.000 0.012 0.000 592 24 563 23 563 23 563 23 NA

5 Conclusion

In this paper we have developed the mean ﬁeld theory of h-SVGD by proving the existence of a solution to the hybrid Stein PDE and identifying it as a gradient ﬂow on a functional other than the KL divergence. We characterised the mean ﬁeld ﬁxed point for the special case k2 = ck1 and demonstrated that h-SVGD does not converge to the target in the mean ﬁeld limit unless c = 1. This suggests that h-SVGD may

Published in Transactions on Machine Learning Research (12/2025)

Table 2: Average RMSE and LL of SGVD and h-SVGD evaluated on the test dataset. Ranges indicate one standard deviation above and below the mean over repeated trials.

Test RMSE Test LL Dataset SVGD h-SVGD SVGD h-SVGD Boston 3.094 0.579 3.034 0.587 3.034 0.587 3.034 0.587 -2.123 0.116 -1.959 0.148 -1.959 0.148 -1.959 0.148 Concrete 5.857 0.468 5.384 0.504 5.384 0.504 5.384 0.504 -2.616 0.099 -2.499 0.128 -2.499 0.128 -2.499 0.128 Energy 1.528 0.169 1.157 0.138 1.157 0.138 1.157 0.138 -1.702 0.094 -1.072 0.096 -1.072 0.096 -1.072 0.096 Kin8nm 0.124 0.005 0.100 0.003 0.100 0.003 0.100 0.003 -1.293 0.108 -0.314 0.104 -0.314 0.104 -0.314 0.104 Naval 0.006 0.000 0.004 0.000 0.004 0.000 0.004 0.000 -1.353 0.161 -0.404 0.140 -0.404 0.140 -0.404 0.140 Combined 4.105 0.220 4.072 0.220 4.072 0.220 4.072 0.220 -2.459 0.051 -2.367 0.050 -2.367 0.050 -2.367 0.050 Protein 4.791 0.025 4.679 0.025 4.679 0.025 4.679 0.025 -2.633 0.035 -2.511 0.023 -2.511 0.023 -2.511 0.023 Wine 0.637 0.044 0.631 0.045 0.631 0.045 0.631 0.045 -1.463 0.120 -0.819 0.080 -0.819 0.080 -0.819 0.080 Yacht 1.677 0.571 1.677 0.571 1.677 0.571 1.886 0.664 -1.587 0.120 -1.045 0.160 -1.045 0.160 -1.045 0.160 Year 8.882 0.043 8.804 0.039 8.804 0.039 8.804 0.039 -2.908 0.031 -2.890 0.019 -2.890 0.019 -2.890 0.019

not converge to the target more generally, even in its continuous time, mean-ﬁeld form. We provided a result on the dissipation of the new functional, as well as a discrete time version, otherwise known as a descent lemma. We also highlighted the complexities of the gradient ﬂow in the general hybrid kernel setting. Despite non-convergence to the target, experimental results demonstrated that h-SVGD can alleviate variance collapse in high dimensions observed in ﬁnite-particle SVGD at a much lower cost than other SVGD variants. We also showed that h-SVGD maintains its performance on high dimensional inference tasks, whilst improving variance estimation without the additional computational cost required of other SVGD variants. One interesting direction for future research is to ﬁnd a principled method of scaling the repulsive kernel. Another avenue is to further develop the theory of h-SVGD in the general hybrid kernel setting. It would also be interesting to incorporate tempered target densities, as considered in (Chehab et al., 2024).

Acknowledgments

SP gratefully acknowledges funding from the Eva Mayr-Stihl foundation.

Nachman Aronszajn. Theory of reproducing kernels. Transactions of the American Mathematical Society, 68(3):337 404, 1950.

Jimmy Ba, Murat A Erdogdu, Marzyeh Ghassemi, Taiji Suzuki, Shengyang Sun, Denny Wu, and Tianzong Zhang. Towards characterizing the high-dimensional bias of kernel-based particle inference algorithms. In Second Symposium on Advances in Approximate Bayesian Inference, 2019.

Jimmy Ba, Murat A Erdogdu, Marzyeh Ghassemi, Shengyang Sun, Taiji Suzuki, Denny Wu, and Tianzong Zhang. Understanding the variance collapse of SVGD in high dimensions. In International Conference on Learning Representations, 2021.

Alain Berlinet and Christine Thomas-Agnan. Reproducing kernel Hilbert spaces in probability and statistics. Springer Science & Business Media, 2011.

Omar Chehab, Anna Korba, Austin Stromme, and Adrien Vacher. Provable convergence and limitations of geometric tempering for langevin dynamics. ar Xiv preprint ar Xiv:2410.09697, 2024.

Peng Chen and Omar Ghattas. Projected Stein variational gradient descent. Advances in Neural Information Processing Systems, 33:1947 1958, 2020.

Sinho Chewi, Thibaut Le Gouic, Chen Lu, Tyler Maunu, and Philippe Rigollet. SVGD as a kernelized Wasserstein gradient ﬂow of the chi-squared divergence. Advances in Neural Information Processing Systems, 33:2098 2109, 2020.

Published in Transactions on Machine Learning Research (12/2025)

Francesco D Angelo and Vincent Fortuin. Annealed Stein variational gradient descent. In Third Symposium on Advances in Approximate Bayesian Inference, 2021.

Francesco D Angelo, Vincent Fortuin, and Florian Wenzel. On Stein variational neural network ensembles. ar Xiv preprint ar Xiv:2106.10760, 2021.

Andrew Duncan, Nikolas Nüsken, and Lukasz Szpruch. On the geometry of Stein variational gradient descent. Journal of Machine Learning Research, 24:1 39, 2023.

Wenbo Gong, Yingzhen Li, and José Miguel Hernández-Lobato. Sliced kernelized Stein discrepancy. In International Conference on Learning Representations, 2021.

Matthew D Hoﬀman, Andrew Gelman, et al. The no-u-turn sampler: adaptively setting path lengths in hamiltonian monte carlo. J. Mach. Learn. Res., 15(1):1593 1623, 2014.

Anna Korba, Adil Salim, Michael Arbel, Giulia Luise, and Arthur Gretton. A non-asymptotic analysis for Stein variational gradient descent. Advances in Neural Information Processing Systems, 33:4672 4682, 2020.

Chang Liu and Jun Zhu. Riemannian Stein variational gradient descent for Bayesian inference. In Proceedings of the AAAI Conference on Artiﬁcial Intelligence, volume 32, 2018.

Qiang Liu. Stein variational gradient descent as gradient ﬂow. Advances in Neural Information Processing Systems, 30, 2017.

Qiang Liu and Dilin Wang. Stein variational gradient descent: A general purpose Bayesian inference algorithm. Advances in Neural Information Processing Systems, 29, 2016.

Qiang Liu, Jason Lee, and Michael Jordan. A kernelized Stein discrepancy for goodness-of-ﬁt tests. In International Conference on Machine Learning, pp. 276 284. PMLR, 2016.

Xing Liu, Harrison Zhu, Jean-François Ton, George Wynne, and Andrew Duncan. Grassmann Stein variational gradient descent. In Proceedings of The 25th International Conference on Artiﬁcial Intelligence and Statistics, volume 151, pp. 2002 2021. PMLR, 2022.

Jianfeng Lu, Yulong Lu, and James Nolen. Scaling limit of the Stein variational gradient descent: The mean ﬁeld regime. SIAM Journal on Mathematical Analysis, 51(2):648 671, 2019.

Adil Salim, Lukang Sun, and Peter Richtarik. A convergence theory for SVGD in the population limit under Talagrands inequality T1. In International Conference on Machine Learning, pp. 19139 19152. PMLR, 2022.

Filippo Santambrogio. Optimal transport for applied mathematicians. Birkäuser, NY, 55(58-63):94, 2015.

Ingo Steinwart and Andreas Christmann. Support vector machines. Springer Science & Business Media, 2008.

Gábor J Székely and Maria L Rizzo. Energy statistics: A class of statistics based on distances. Journal of statistical planning and inference, 143(8):1249 1272, 2013.

Dilin Wang, Zhe Zeng, and Qiang Liu. Stein variational message passing for continuous graphical models. In International Conference on Machine Learning, pp. 5219 5227. PMLR, 2018.

Dilin Wang, Ziyang Tang, Chandrajit Bajaj, and Qiang Liu. Stein variational gradient descent with matrixvalued kernels. Advances in Neural Information Processing Systems, 32, 2019.

Jingwei Zhuo, Chang Liu, Jiaxin Shi, Jun Zhu, Ning Chen, and Bo Zhang. Message passing Stein variational gradient descent. In International Conference on Machine Learning, pp. 6018 6027. PMLR, 2018.

Published in Transactions on Machine Learning Research (12/2025)

Lemma A.1. Under assumptions (A1), (A4), (B1), (B2) the map

(z, µ) 7 E(z, µ) := Z

X k1(x, z) V (x) + xk2(x, z)dµ(x)

is L-Lipschitz. That is,

E(z, µ) E(z , µ ) 2 L( z z 2 + W2(µ, µ ))

where L > 0 depends on k1, k2 and V .

Proof. Largely following the proof of Lemma 14 Korba et al. (2020), choosing an optimal coupling γ of µ and µ ,

E(z, µ) E(z , µ ) 2 Eγ [ V (x)(k1(x, z) k1(x , z ))]

+ Eγ [( V (x ) V (x))k1(x , z )]

+ Eγ [ xk2(x, z) x k2(x , z )]

DEγ [ x x 2 + z z 2]

+ BMEγ [ x x 2]

+ DEγ [ x x 2 + z z 2]

(2D + BM) ( z z 2 + W2(µ, µ )) .

Note that the second term is bounded using the relaxed Assumption (B2) and there is no need to require that |V | is bounded by a constant.

Proof of Proposition 3.1. This follows identically to the proof of Proposition 7 in Korba et al. (2020) with Lemma A.1 in place of Lemma 14.

Proof of Proposition 3.3. The proof largely follows those of Theorems 2.4 and 3.2 in Lu et al. (2019) with some minor adjustments. Notably, after ﬁxing r > 0 and deﬁning

Yr := u Y : sup x X |u(x) x| < r

and the complete metric space

Sr := C([0, T0]; Yr),

d S(u, v) := sup t [0,T0] d Y (u(t), v(t))

for some suﬃciently small T0 (to be determined later), the operator G : u(t, ) 7 G(u)(t, ) must be modiﬁed to act on u Sr via

G(u)(t, x) = x Z t

X K2(u(s, x) u(s, x ))ν(dx )ds

X K1(u(s, x) u(s, x )) V (u(s, x ))ν(dx )ds.

Note that we use G instead of the F used in Lu et al. (2019) to avoid confusion between the functional F deﬁned in (16). The same techniques of Lu et al. (2019) are suﬃcient to establish the required bounds to show that G is a contraction on Sr for suﬃciently small T0. Note that Assumptions (B1), (A2) and (A3) are used to establish this. So the unique ﬁxed point X( , ; ν) Sr of G solves (13) in the interval [0, T0].

The min ( K1 , K2 ) term emerges because the telescoping in Equation (3.8) of Lu et al. (2019) can be performed with either kernel. The remainder of the proof follows Theorems 2.4 and 3.2 of Lu et al. (2019).

Published in Transactions on Machine Learning Research (12/2025)

Proof of Proposition 3.4. Following Deﬁnition 7.12 in Santambrogio (2015), a functional derivative of F is a measurable function δF

δρ (ρ) satisfying

d dϵF(ρ + ϵχ) ϵ=0 = Z δF

for all perturbations χ = ρ ρ with ρ L c (X) P(X). If it exists, δF

δρ (ρ) is unique up to additive constants.

We ﬁrst compute

d dϵF(ρ + ϵχ) ϵ=0

Z c log(ρ(x) + ϵχ(x))ρ(x)dx + ϵ Z c log(ρ(x) + ϵχ(x))χ(x)dx

Z log π(x)ρ(x)dx ϵ Z log π(x)χ(x)dx ϵ=0

= Z c χ(x) ρ(x) + ϵχ(x)ρ(x)dx + Z c log(ρ(x) + ϵχ(x))χ(x)dx

+ ϵ Z c χ(x) ρ(x) + ϵχ(x)χ(x)dx Z log π(x)χ(x)dx ϵ=0

= Z c log ρ(x) log π(x)χ(x)dx + c Z χ(x)dx. (24)

Since ρ, ρ P(X), the ﬁnal integral is zero. So the functional gradient of F is

δρ (ρ) = c log ρ log π.

Its Wasserstein gradient is then

W F(ρ) = c log ρ log π.

Since k2 = ck1, equation (10) can be written as

dt = Z k1(x , x) x log π(x ) + c x k1(x , x) ρ(x )dx

= Z k1(x , x) x log π(x )ρ(x ) ck1(x , x) x ρ(x )dx

= Z k1(x , x) x log π(x )ρ(x ) ck1(x , x) x log ρ(x )ρ(x )dx

= Tk1,ρ( log π c log ρ)(x)

= Tk1,ρ( W F(ρ))(x).

The continuity equation tρt + ρt dx

dt = 0 then becomes

tρt = (ρt Tk1,ρ( W F(ρ))(x))

= (ρt Tk1,ρt (c log ρt log π)) .

Published in Transactions on Machine Learning Research (12/2025)

Proof of Corollary (3.5). A measure ρ satisfying c log ρ = log π will be a ﬁxed point because this would imply W F(ρ) = 0. Solving this for ρ , we have

c log ρ (x) = log π(x)

c log ρ (x) = log π(x) + A

log(ρ (x)c) = log(e Aπ(x))

ρ (x)c = e Aπ(x)

ρ (x) = e A/cπ(x)1/c

for some A R.

Proof of Proposition 3.6. Using (17), integration by parts, and the fact that Tk1,ρ : L2(ρt)d Hd 1 is the adjoint of the inclusion ı : Hd 1 L2(ρt)d,

d dt F(ρt) = d

Z ρt(x) (c log ρt(x) log π(x)) dx

= Z (c log ρt(x) log π(x))

tρt(x) + cρt(x)

t log ρt(x)dx

= Z (c log ρt(x) log π(x)) (ρt(x)Tk1,ρt(c log ρt(x) log π)(x)) dx

t log ρt(x)dx

= Z (c log ρt(x) log π(x)) Tk1,ρt(c log ρt log π)(x)ρt(x)dx

t log ρt(x)dx

= c log ρt log π, Tk1,ρt(c log ρt log π) L2(ρt)d

t log ρt(x)dx

= Tk1,ρt(c log ρt log π) 2 Hd 1 + c Z ρt(x)

t log ρt(x)dx

= c2 Tk1,ρt( log ρt log ρ ) 2 Hd 1 + c Z ρt(x)

t log ρt(x)dx. (25)

F(ρt) = Ex ρt [c log ρt(x) log π(x)]

= Ex ρt [c log ρt(x) log (Aρ (x)c)]

= c Ex ρt [log ρt(x) log ρ (x)] + log(A) = c KL(ρt ρ ) + log(A) (26)

for some A R, we have 1

c d dt F(ρt) = d

dt KL(ρt ρ ). Therefore,

d dt KL(ρt ρ ) = 1

c Tk1,ρt(c log ρt log π) 2 Hd 1 + Z ρt(x)

t log ρt(x)dx

= c Tk1,ρt( log ρt log π1/c) 2

Hd 1 + Z ρt(x)

t log ρt(x)dx

= c Tk1,ρt( log ρt log ρ ) 2 Hd 1 + Z ρt(x)

t log ρt(x)dx.

Published in Transactions on Machine Learning Research (12/2025)

To simplify notation in the calculations below, set u = Tk1,ρt(c log ρt) and v = Tk1,ρt( log π). Recall the identity

u 2 + u v 2 v 2 . (27)

Now we apply the multivariate chain rule to the remainder term along with the fact about the adjoint of Tk1,ρt, the identity in (27), and the triangle inequality. This gives

t log ρt(x)dx = Z ρt(x) log ρt(x) dx

= Z ρt(x) log ρt(x) Tk1,ρt(c log ρt log π)(x)dx (28)

= log ρt, Tk1,ρt(c log ρt log π) L2(ρt)d

= Tk1,ρt( log ρt), Tk1,ρt(c log ρt log π) Hd 1

c u, u v Hd 1

u 2 Hd 1 + v 2 Hd 1 u v 2 Hd 1

u 2 Hd 1 + u v 2 Hd 1 + u 2 Hd 1 u v 2 Hd 1

The ﬁnal statement that d dt F(ρ ) = 0 follows from equations (25) and (28) by observing that c log ρ log π = 0.

Proof of Proposition 3.7. This follows from applying Theorem 3.3 Liu (2017) with ρ instead of the target, then substituting in (26).

Proof of Proposition 3.8. We ﬁrst deﬁne the Fourier transform of a real-valued function f to be the function F(f) deﬁned by

F(f)(ω) = Z f(x) exp( i2πω)dx,

given that the integral exists. Similarly, the inverse Fourier transform of a real-valued function F is the function F 1(F) deﬁned by

F 1(F)(x) = Z F(ω) exp(i2πωx)dω,

where the integral exists. To recover the continuity equation, apply the convolution theorem to (20)

F(K1) F( ρ) (x) = r(x; ρ)

F(K2)(ω) F( ρ)(ω) = F(K1)(ω) F(r)(ω)

F(K2 ρ)(ω) = F(K1 r)(ω)

(K2 ρ)(x) = (K1 r)(x) Z K2(x y) log ρ(y)ρ(y)dy = Z K1(x y)r(y)

(Tk2,ρ log ρ) (x) = (Tk1,ρ(r/ρ)) (x)

Published in Transactions on Machine Learning Research (12/2025)

Equation (10) can now be rewritten as

dt = Z k1(x , x) x log π(x ) + x k2(x , x) ρ(x )dx

= Z k1(x , x) x log π(x )ρ(x ) k2(x , x) x ρ(x )dx

= Z k1(x , x) x log π(x ) k2(x , x) x log ρ(x ) ρ(x )dx

= (Tk1,ρ log π) (x) (Tk2,ρ log ρ) (x)

= (Tk1,ρ ( log π R( ; ρ))) (x),

where R(x; ρ) := r(x;ρ)

ρ(x) , yielding the continuity equation (21).

As in the proof of Proposition 3.4, let χ = ρ ρ with ρ L c (X) P(X). We ﬁrst compute

d dϵF(ρ + ϵχ) ϵ=0

Z R(x; ρ + ϵχ)dρ(x) + ϵ Z R(x; ρ + ϵχ)dχ(x)

Z log π(x)dρ(x) ϵ Z log π(x)dχ(x) ϵ=0

Z R(x; ρ + ϵχ)dρ(x) + Z R(x; ρ + ϵχ)dχ(x) + ϵ d

Z R(x; ρ + ϵχ)dχ(x)

Z log π(x)dχ(x) ϵ=0

= Z R(x; ρ) log π(x)dχ(x)

+ Z ϵR(x; ρ + ϵχ)dρ(x) + ϵ Z ϵR(x; ρ + ϵχ)dχ(x) ϵ=0

= Z R(x; ρ) log π(x)dχ(x) + Z ϵR(x; ρ + ϵχ)d(ρ + ϵχ)(x) ϵ=0 . (29)

By assumption, the remainder term above is equal to zero and the functional derivative of F is therefore

δρ (ρ) = R1(x; ρ) log π(x),

so its Wasserstein gradient is

W F(ρ) = δF

= R1(x; ρ) log π(x).

Published in Transactions on Machine Learning Research (12/2025)

Proof of Proposition 3.9. To show that ρ is not the target, assume for the sake of a contradiction that ρ = π. So W F(π) = 0, and Proposition 3.8 gives

π(x) = log π(x)

r(x; π) = π(x)

F(K2) F(K1)F( π) = F( π)

F(K2) = F(K1)

Therefore, it cannot be the case that ρ = π.

Proof of Example 3.10. A direct computation of (20) along with the deﬁnitions of r and R yields

F(K2)(ω) F(K1)(ω) F( ρ )(ω) =

h2 h1 exp( 2π2ω2 h) 2πiω exp( 2π2ω2σ2)

h2 h1 2πiω exp( 2π2ω2(σ2 + h))

2π x (σ2 + h)3/2 exp x2

h2 h1 σ (σ2 + h)3/2 x exp x2

h2 h1 σ (σ2 + h)3/2 x exp hx2

2σ2(σ2 + h)

Now computing log π, using some A R for the normalising constant, we have

π(x) = A exp α exp x2

log π(x) = log(A) α exp x2

log π(x) = αx

h2 h1 σ (σ2 + h)3/2 x exp hx2

2σ2(σ2 + h)

Since R(x; ρ ) = log π(x), equation (21) implies that ρ is a ﬁxed point.

B Additional Experimental Results and Details

B.1 Multivariate Gaussian Mixture

Figure 2 in Section 4 shows the projection onto the ﬁrst two dimensions of particles sampled from highdimensional Gaussian mixtures using diﬀerent SVGD variants. The marginal density of the corresponding projection of the target density is overlaid. We extend this here with Figure 4, showing 5 additional projection plots of randomly selected pairs of dimensions between 1 and 1000.

Published in Transactions on Machine Learning Research (12/2025)

Figure 4: Projections onto 5 pairs of randomly selected dimensions of particles sampled from a highdimensional Gaussian mixture using diﬀerent SVGD variants. The contour plots of the corresponding marginal density are overlaid.

Published in Transactions on Machine Learning Research (12/2025)

B.2 Bayesian Neural Network

The results presented in Section 4.2 follow the settings of (Liu & Wang, 2016). In particular, we use Gaussian priors for the network weights and Gamma priors for the inverse covariances. There is one hidden layer with 50 units for most datasets, Protein and Year being the exceptions with 100 units each. The datasets are randomly partitioned into 90% for training and 10% for testing with results averaged over 20 trials, Protein and Year being the exceptions with 5 trials and 3 trials respectively. The number of particles in each case is 20, the activation function is RELU(x) = max(0, x), the number of iterations is 2000, and the mini-batch size is 100 for all datasets except for Year, which uses a mini-batch size of 1000.

We recreate Tables 1 and 2 below, this time comparing h-SVGD against SSVGD (Gong et al., 2021). Table 3 shows that h-SVGD outperforms SSVGD in mitigation of variance collapse on all but one dataset. We note that h-SVGD achieves this with a signiﬁcantly faster runtime for all datasets, which is due to SSVGD requiring additional optimistaion of the projection matrix. Table 4 shows that SSVGD and h-SVGD are comparable in both RMSE and LL metrics. The h-SVGD algorithm achieves a better LL score on more datasets, but SSVGD achieves a better RMSE score on more datasets.

Table 3: DAMV and runtime in seconds of SSVGD and h-SVGD.

DAMV Runtime (seconds) Dataset SSVGD h-SVGD SSVGD h-SVGD Boston 0.035 0.002 0.087 0.010 0.087 0.010 0.087 0.010 208 13 28.5 0.9 28.5 0.9 28.5 0.9 Concrete 0.070 0.004 0.102 0.006 0.102 0.006 0.102 0.006 148 56 28.7 1.3 28.7 1.3 28.7 1.3 Energy 0.053 0.005 0.106 0.011 0.106 0.011 0.106 0.011 156 28 30.7 2.2 30.7 2.2 30.7 2.2 Kin8nm 0.083 0.002 0.093 0.003 0.093 0.003 0.093 0.003 141 3.9 36.0 1.3 36.0 1.3 36.0 1.3 Naval 0.070 0.021 0.070 0.021 0.070 0.021 0.068 0.011 237 11 34.5 0.9 34.5 0.9 34.5 0.9 Combined 0.118 0.005 0.138 0.006 0.138 0.006 0.138 0.006 116 18 36.7 2.4 36.7 2.4 36.7 2.4 Protein 0.057 0.006 0.084 0.001 0.084 0.001 0.084 0.001 390 20 72.8 1.7 72.8 1.7 72.8 1.7 Wine 0.029 0.002 0.075 0.005 0.075 0.005 0.075 0.005 210 10 29.8 1.1 29.8 1.1 29.8 1.1 Yacht 0.066 0.009 0.121 0.012 0.121 0.012 0.121 0.012 97.9 1.2 30.0 1.3 30.0 1.3 30.0 1.3 Year 0.012 NA 0.012 NA 12488 NA 666 NA 666 NA 666 NA

Table 4: Average RMSE and LL of SSGVD and h-SVGD evaluated on the test dataset.

Test RMSE Test LL Dataset SSVGD h-SVGD SSVGD h-SVGD Boston 3.024 0.604 3.001 0.584 3.001 0.584 3.001 0.584 -2.088 0.322 -1.988 0.221 -1.988 0.221 -1.988 0.221 Concrete 5.073 0.522 5.073 0.522 5.073 0.522 5.210 0.529 -2.563 0.239 -2.535 0.179 -2.535 0.179 -2.535 0.179 Energy 0.923 0.123 1.040 0.128 -0.631 0.162 -0.631 0.162 -0.631 0.162 -0.805 0.104 Kin8nm 0.084 0.003 0.084 0.003 0.084 0.003 0.090 0.003 0.232 0.135 0.468 0.090 0.468 0.090 0.468 0.090 Naval 0.003 0.000 0.003 0.000 0.003 0.000 0.004 0.000 -0.624 0.161 -0.090 0.105 -0.090 0.105 -0.090 0.105 Combined 4.028 0.220 4.028 0.220 4.028 0.220 4.057 0.218 -2.335 0.066 -2.335 0.066 -2.335 0.066 -2.354 0.052 Protein 4.581 0.026 4.581 0.026 4.581 0.026 4.600 0.026 -2.526 0.045 -2.456 0.017 -2.456 0.017 -2.456 0.017 Wine 0.676 0.051 0.626 0.045 -1.261 0.172 -0.750 0.097 -0.750 0.097 -0.750 0.097 Yacht 1.664 0.607 1.664 0.607 1.664 0.607 1.861 0.662 -0.788 0.511 -0.788 0.511 -0.788 0.511 -0.813 0.227 Year 8.922 NA 8.689 NA 8.689 NA 8.689 NA -2.940 NA -2.872 NA -2.872 NA -2.872 NA