# kernel_stein_discrepancy_descent__e008f33b.pdf

Kernel Stein Discrepancy Descent

Anna Korba 1 Pierre-Cyril Aubin-Frankowski 2 Szymon Majewski 3 Pierre Ablin 4

Among dissimilarities between probability distributions, the Kernel Stein Discrepancy (KSD) has received much interest recently. We investigate the properties of its Wasserstein gradient ﬂow to approximate a target probability distribution π on Rd, known up to a normalization constant. This leads to a straightforwardly implementable, deterministic score-based method to sample from π, named KSD Descent, which uses a set of particles to approximate π. Remarkably, owing to a tractable loss function, KSD Descent can leverage robust parameter-free optimization schemes such as L-BFGS; this contrasts with other popular particle-based schemes such as the Stein Variational Gradient Descent algorithm. We study the convergence properties of KSD Descent and demonstrate its practical relevance. However, we also highlight failure cases by showing that the algorithm can get stuck in spurious local minima.

1. Introduction

An important problem in machine learning and computational statistics is to sample from an intractable target distribution π. In Bayesian inference for instance, π corresponds to the posterior probability of the parameters, which is required to compute the posterior predictive distribution. It is known only up to an intractable normalization constant. In Generative Adversarial Networks (GANs, Goodfellow et al., 2014), the goal is to generate data which distribution is similar to the training set deﬁned by samples of π. In the ﬁrst setting, one has access to the score of π (the gradient of its log density), while in the second, one has access to samples of π. Assessing how different the target π and a given approximation µ are can be performed through a dissimilarity function D(µ|π). As summarized

1CREST, ENSAE, Institut Polytechnique de Paris 2CAS, MINES Paris Tech, Paris, France 3CMAP, Ecole Polytechnique, Institut Polytechnique de Paris 4CNRS and DMA, Ecole Normale Supérieure, Paris, France. Correspondence to: Anna Korba <anna.korba@ensae.fr>.

Proceedings of the38 th International Conference on Machine Learning,PMLR 139, 2021. Copyright 2021 by the author(s).

by Simon-Gabriel (2018), classical dissimilarities include f-divergences such as the KL (Kullback-Leibler) or the χ2

(Chi-squared), the Wasserstein distances in Optimal Transport (OT), and Integral Probability Metrics (IPMs), such as the Maximum Mean Discrepancy (MMD, Gretton et al., 2012). Dissimilarity functions can hence be used to characterize π since, under mild assumptions, they only vanish at µ = π. Setting F(µ) = D(µ|π), assuming also in our case that π P2(Rd), the set of probability measures µ with ﬁnite second moment ( R x 2dµ(x) < ), the sampling task can then be recast as an optimization problem over P2(Rd) min µ P2(Rd) F(µ). (1)

Starting from an initial distribution µ0, one can then apply a descent scheme to (1) to converge to π. In particular, one can consider the Wasserstein gradient ﬂow of F over P2(Rd). This operation can be interpreted as a vector ﬁeld continuously displacing the particles constituting µ.

Among dissimilarities, the Kernel Stein Discrepancy (KSD, introduced independently by Liu et al., 2016; Chwialkowski et al., 2016; Gorham & Mackey, 2017) writes as follows

ZZ kπ(x, y)dµ(x)dµ(y), (2)

where kπ is the Stein kernel, deﬁned through the score of π, s(x) = log π(x), and through a positive semi-deﬁnite kernel k (see Section 2.1 for the meaning of 1 or of 2)

kπ(x, y) = s(x)T s(y)k(x, y) + s(x)T 2k(x, y)

+ 1k(x, y)T s(y) + 1 2k(x, y). (3)

The great advantage of the KSD is that it can be readily computed when one has access to the score of π and uses a discrete measure ˆµ, since (2) writes as a ﬁnite double sum of kπ in this case. Furthermore the deﬁnition of the KSD was inspired by Stein s method (see Anastasiou et al., 2021, for a review) and no sampling over π is required in (2). This motivated the use of the KSD in a growing number of problems. The KSD has been widely used in nonparametric statistical tests for goodness-of-ﬁt (e.g. Xu & Matsuda, 2020; Kanagawa et al., 2020). It was also used for sampling tasks: to select a suitable set of static points to approximate π, adding a new one at each iteration (Chen et al., 2018; 2019); to compress (Riabiz et al., 2020) or reweight (Hodgkinson et al.,

Kernel Stein Discrepancy Descent

2020) Markov Chain Monte Carlo (MCMC) outputs; and to learn a static transport map from µ0 to π (Fisher et al., 2021). In this paper, we consider F(µ) = 1/2 KSD2(µ|π) and its Wasserstein gradient W2F to deﬁne a ﬂow over particles to approximate π.

Related works. Minimizing a dissimilarity D is a popular approach to ﬁt an unnormalized density model in the machine learning and computational statistics literature. For instance, Hyvärinen & Dayan (2005) proposed to minimize the Fisher divergence. Alternatively, D is often taken as the KL divergence. Indeed, since the seminal paper by Jordan et al. (1998), the Wasserstein gradient ﬂow of the KL has been extensively studied and related to the Langevin Monte Carlo (LMC) algorithm (e.g. Wibisono, 2018; Durmus et al., 2019). However, an unbiased time-discretization of the KL ﬂow is hard to implement (Salim et al., 2020). To tackle this point, a recent successful kernel-based approximation of the KL ﬂow was introduced as the Stein Variational Gradient Descent (SVGD, Liu & Wang, 2016). Several variants were considered (see Chewi et al., 2020, and references therein). Another line of work considers D as the MMD (Mroueh et al., 2019; Arbel et al., 2019) with either regularized or exact Wasserstein gradient ﬂow of the MMD, especially for GAN training. However, these two approaches require samples of π to evaluate the gradient of the MMD. As the KSD is a speciﬁc case of the MMD with the Stein kernel (Chen et al., 2018), our approach is similar to Arbel et al. (2019) but better suited for a score-based sampling task, owing to the properties of the Stein kernel.

Contributions. In this paper, in contrast with the aforementioned approaches, we choose the dissimilarity D in (1) to be the KSD. As in SVGD, our approach, KSD Descent, optimizes the positions of a ﬁnite set of particles to approximate π, but through a descent scheme of the KSD (in contrast to the KL for SVGD) in the space of probability measures. KSD Descent comes with several advantages. First, it beneﬁts from a closed-form cost function which can be optimized with a fast and hyperparameter-free algorithm such as L-BFGS (Liu & Nocedal, 1989). Second, our analysis comes with several theoretical guarantees (namely, existence of the ﬂow and a descent lemma in discrete time) under a Lipschitz assumption on the gradient of kπ. We also provide negative results highlighting some weaknesses of the convergence of the KSD gradient ﬂow, such as the absence of exponential decay near equilibrium. Moreover, stationary points of the KSD ﬂow may differ from the target π and even be local minima of the ﬂow, which implies that some particles are stuck far from high-probability regions of π. Sometimes a simple annealing strategy mitigates this convergence issue. On practical machine learning problems, the performance of KSD Descent highly depends on the local minimas of log π. KSD Descent achieves comparable performance to SVGD on convex (i.e., π log-concave) toy

examples and Bayesian inference tasks, while it is outperformed on non-convex tasks with several saddle-points like independent component analysis.

This paper is organized as follows. Section 2 introduces the necessary background on optimal transport and on the Kernel Stein Discrepancy. Section 3 presents our approach and discusses its connections with related works. Section 4 is devoted to the theoretical analysis of KSD Descent. Our numerical results are to be found in Section 5.

2. Background

This section introduces the high-level idea of the gradient ﬂow approach to sampling. It also summarizes the known properties of the KSD.

2.1. Notations

The space of l continuously differentiable functions on Rd

is Cl(Rd). The space of smooth functions with compact support is C c (Rd). If ψ : Rd Rp is differentiable, we denote by Jψ : Rd Rp d its Jacobian. If p = 1, we denote by ψ the gradient of ψ. Moreover, if ψ is differentiable, the Jacobian of ψ is the Hessian of ψ denoted Hψ. If p = d, ψ denotes the divergence of ψ, i.e. the trace of the Jacobian. We also denote by ψ the Laplacian of ψ, where ψ = ψ. For a differentiable kernel k : Rd Rd R, 1k (resp. 2k) is the gradient of the kernel w.r.t. the ﬁrst (resp. second) variable, while H1k denotes its Hessian w.r.t. the ﬁrst variable.

Consider the set P2(Rd) of probability measures µ on Rd

with ﬁnite second moment and Pc Rd the set of probability measures with compact support. For µ P2(Rd), we denote by dµ/dπ its Radon-Nikodym density if µ is absolutely continuous w.r.t. π. For any µ P2(Rd), L2(µ) is the space of functions f : Rd R such that R f 2dµ < . We denote by L2(µ) and , L2(µ) respectively the norm and the inner product of the Hilbert space L2(µ). Given a measurable map T : Rd Rd and µ P2(Rd), T#µ is the pushforward measure of µ by T. We consider, for µ, ν P2(Rd), the 2-Wasserstein distance W2(µ, ν), and we refer to the metric space (P2(Rd), W2) as the Wasserstein space. In a Riemannian interpretation of (P2(Rd), W2), the tangent space of P2(Rd) at µ is denoted TµP2(Rd) and is a subset of L2(µ) (Otto, 2001). We refer to Appendix A.2 for more details on the Wasserstein distance and related ﬂows.

2.2. Lyapunov analysis and gradient ﬂows

To sample from a target distribution π, a now classical approach consists in identifying a continuous process which moves particles from an initial probability distribution µ0 toward samples of π. This can be expressed as searching for

Kernel Stein Discrepancy Descent

vector ﬁelds vt : Rd Rd transporting the distribution µt through the continuity equation (see Appendix A.1)

t + div(µtvt) = 0 (4)

where vt should ensure the convergence of µt to π, for some topology over measures, in ﬁnite or inﬁnite time. Due to vt, eq. (4) is nonlinear over µt. Cauchy-Lipschitz-style assumptions for existence and uniqueness of the solution of (4) are provided in Appendix A.3. The continuity equation ensures that the mass is conserved and that it is not teleported as for a mixture µt = (1 t)µ0 + tπ. In order to adjust the position of particles only depending on the present distribution µt and to have an automated choice of vt at any given time, it is favorable to have vt as a function of µt, written as vµt.

A principled way to select such a vµt is to deﬁne it based on a Lyapunov functional F(µ) over measures, decreasing along the Wasserstein gradient ﬂow (see Appendix A.2)

F(µt) := d F(µt)

dt = W2F(µt), vµt L2(µt) 0. (5)

Any dissimilarity F( ) = D( |π) is a valid Lyapunov candidate since it is non-negative and vanishes at π. Hence, (4) can be seen as a continuous descent scheme of (1) or, conversely, (1)-(5) can be interpreted as a way to choose vµt in (4) to steer µ0 to π. In short, any Lyapunov-based approach rests upon three quantities (F(µt), vµt, F(µt)), related by (5). A natural choice of vµt satisfying (5) and realizing the steepest descent is the Wasserstein gradient itself, vµt = W2F(µt). Depending on the choice of F, this vµt may be hard to implement, or require speciﬁc analysis of the resulting dissipation function F(µt). Otherwise, to ensure that F(µt) only vanishes at π, one can choose vµt so that both F(µt) and F(µt) are known dissimilarities. As a matter of fact, if there exists a dissimilarity D separating measures such that F(µ) D(µ|π), then π is asymptotically stable for the ﬂow and, if D(µ|π) F(µ), then π is exponentially stable (by Gronwall s lemma). This relates the Lyapunov analysis to functional inequalities (Villani, 2003), expressing domination w.r.t. F of the Wasserstein gradient of F under speciﬁc assumptions on π, e.g. log-Sobolev for the KL, or Poincaré for the χ2 (Chewi et al., 2020).

2.3. Kernel Stein Discrepancy

Consider a positive semi-deﬁnite kernel k : Rd Rd R and its corresponding RKHS Hk of real-valued functions on Rd. The space Hk is a Hilbert space with inner product , Hk and norm Hk. Moreover, k satisﬁes the reproducing property: f Hk, f(x) = f, k(x, ) Hk; which for smooth kernels also holds for derivatives, e.g. if(x) = f, ( 1k(x, ))i Hk (see Saitoh & Sawano, 2016). Let µ P2(Rd). If R k(x, x)dµ(x) < , then the integral

operator associated to the kernel k and measure µ, denoted by Sµ,k : L2(µ) Hk and deﬁned as

Sµ,kf = Z k(x, )f(x)dµ(x), (6)

is a Hilbert-Schmidt operator and Hk L2(µ). In this case, the identity embedding ι : Hk L2(µ) is a bounded operator and it is the adjoint of Sµ,k (i.e., ι = Sµ,k (Steinwart & Christmann, 2008, Theorems 4.26 and 4.27). Hence, for any (f, g) Hk L2(µ), ιf, g L2(µ) = f, Sµ,kg Hk. We denote by Hd k the Cartesian product RKHS consisting of elements f = (f1, . . . , fd) with fi Hk, and with inner product f, g Hd k = Pd i=1 fi, gi Hk. For vector-inputs, we extend Sµ,k, applying it component-wise.

The Stein kernel kπ (3) is a reproducing kernel and satisﬁes a Stein identity ( R

Rd kπ(x, )dπ(x) = 0) under mild regularity assumptions on k and π.1 It allows for several interpretations of the KSD (2) as already discussed by Liu et al. (2016). It can be introduced as an IPM in the speciﬁc case of a Stein operator applied to Hk (e.g. Gorham & Mackey, 2017). It can then be identiﬁed as an asymmetric MMD in Hkπ (see Section 3.3). Alternatively, the squared KSD can be seen as a kernelized Fisher divergence, where the Fisher information log Ä dµ

dπ ä 2 L2(µ) is smoothed through the kernel

integral operator, i.e. KSD2(µ|π) = Sµ,k log Ä dµ

dπ ä 2 Hd k. In this sense, the squared KSD has also been referred to as the Stein Fisher information (Duncan et al., 2019). Hence, minimizing the KSD can be thought as a kernelized version of score-matching (Hyvärinen & Dayan, 2005).

The metrization of weak convergence by the KSD, i.e. that limt KSD(µt|π) = 0 implies the weak convergence of µt to π, depends on the choice of the kernel relatively to the target. This question has been considered by Gorham & Mackey (2017), who show this is the case under assumptions akin to strong log-concavity of π at inﬁnity (namely distant dissipativity, Eberle, 2016), and for a kernel k with a slow decay rate. This includes ﬁnite Gaussian mixtures with common covariance and kernels that are translationinvariant with heavy-tails and non-vanishing Fourier transform, such as the inverse multi-quadratic (IMQ) kernel deﬁned by k(x, y) = (c2 + x y 2 2)β for c > 0 and β ( 1, 0), or its variants considered in Chen et al. (2018).

3. Sampling as optimization of the KSD

This section deﬁnes the KSD Descent and relates it to other gradient ﬂows, especially the MMD gradient ﬂow, of which the KSD Descent is a special case. In all the following, we assume that k C3,3(Rd Rd, R), and that π is such that s = log π C2(Rd).

1e.g., k is a Gaussian kernel and π is a smooth density fully supported on Rd, see Liu et al. (2016, Theorem 3.7).

Kernel Stein Discrepancy Descent

3.1. Continuous time dynamics

Consider the functional F : P2(Rd) [0, + ), µ 7 1 2 KSD2(µ|π) deﬁned over the Wasserstein space. If µ P2(Rd) satisﬁes some mild regularity conditions (i.e., it has a C1 density w.r.t. Lebesgue measure, and it is in the domain of F, see Appendix A.2), the gradient of F at µ is well-deﬁned and denoted by W2F(µ) L2(µ). We shall consider the following assumptions on the Stein kernel:

(A1) There exists a map L( ) C0(Rd, R+), which is µintegrable for any µ P2(Rd), such that, for any y Rd, the maps x 7 1kπ(x, y) and x 7 2kπ(x, y) are L(y)-Lipschitz.

(A2) There exists m > 0 such that for any µ Pc Rd , for all y Rd, we have R 2kπ(x, y)dµ(x) m 1 + y + R x dµ(x) .

(A3) The map (x, y) 7 H1kπ(x, y) op is µ ν-integrable for every µ, ν P2(Rd).

(A4) For all µ P2(Rd), R kπ(x, x)dµ(x) < .

The KSD gradient ﬂow is deﬁned as the ﬂow induced by the continuity equation:

t + div(µtvµt) = 0 for vµt := W2F(µt). (7)

Assumptions (A1) and (A2) ensure that the KSD gradient ﬂow exists and is unique, they are further discussed in Appendix A.3. Assumptions (A1) and (A3) are needed so that the Hessian of F is well deﬁned (see Section 4). Assumption (A4) guarantees that the integral operator Sµ,kπ (6) is well-deﬁned and that F(µ) < for all µ P2(Rd). Lemma 1. Assume that k, its derivatives up to order 3, and their product by x y are uniformly bounded over Rd; and that s is Lipschitz and has a bounded Hessian over Rd. Then Assumptions (A1), (A3) and (A4) hold. If, furthermore there exists M > 0 and M0 such that, for all x Rd, s(x) M p

x + M0, then Assumption (A2) also holds.

See the proof in Appendix B.1. Smoothed Laplace distributions π paired with Gaussian k satisfy the assumptions of Lemma 1. For Gaussian π, s is linear, so Assumptions (A1), (A3) and (A4) hold for smooth kernels, but Assumption (A2) does not hold in general because of the s(x) s(y) term in kπ. Notice that most of our results are stated without Assumption (A2), which is only required to establish the global existence of KSD ﬂow, in the sense that the particle trajectories are well-deﬁned and do not explode in ﬁnite-time. Proposition 2. Under Assumptions (A1) and (A2), the W2 gradient of F evaluated at µ and its dissipation (5) along (7) are

W2F(µ) = Ex µ[ 2kπ(x, )], (8) F(µt) = Ey µt Ex µt[ 2kπ(x, y)] 2 . (9)

Since the r.h.s. of (9) is negative, Proposition 2 shows that the squared KSD w.r.t. π decreases along the KSD gradient ﬂow dynamics. In other words, F is indeed a Lyapunov functional for the dynamics (7) as discussed in Section 2.2.

3.2. Discrete time and discrete measures

A straightforward time-discretization of (7) is a gradient descent in the Wasserstein space applied to F(µ) = 1

2 KSD2(µ|π). Starting from an initial distribution µ0 P2(Rd), it writes as follows at iteration n N,

µn+1 = (I γ W2F(µn))# µn, (10)

for a step-size γ > 0. However for discrete measures ˆµ = 1

N PN j=1 δxj, we can make the problem more explicit setting a loss function

F([xj]N j=1) := F(ˆµ) = 1 2N 2

i,j=1 kπ(xi, xj). (11)

Problem (1) then corresponds to a standard non-convex optimization problem over the ﬁnite-dimensional, Euclidean, space of particle positions. The gradient of F is readily obtained as

xi F([xj]N j=1) = 1 N 2

j=1 2kπ(xj, xi).

since, by symmetry of kπ, 1kπ(x, y) = 2kπ(y, x). As both F and xi F can be explicitly computed, one can implement the KSD Descent either using a gradient descent (Algorithm 1) or through a quasi-Newton algorithm such as L-BFGS (Algorithm 2). As a matter of fact, L-BFGS (Liu & Nocedal, 1989) is often faster and more robust than the conventional gradient descent. It also does not require choosing critical hyper-parameters, such as a learning rate, since LBFGS performs a line-search to ﬁnd suitable step-sizes. It only requires a tolerance parameter on the norm of the gradient, which is in practice set to machine precision. A techni-

Algorithm 1 KSD Descent GD

Input: initial particles (xi 0)N i=1 µ0, number of iterations M, step-size γ for n = 1 to M do

[xi n+1]N i=1 = [xi n]N i=1 γ

j=1 [ 2kπ(xj n, xi n)]N i=1,

(12) end for Return: [xi M]N i=1.

cal descent lemma for (10) (Proposition 14) showing that F decreases at each iteration (10) is provided in Appendix A.6. It requires the boundedness of ( L( ) L2(µn))n 0, the L2norm of the Lipschitz constants of Assumption (A1) along the ﬂow, as well as the convexity of L( ) and a compactlysupported initialization.

Kernel Stein Discrepancy Descent

Algorithm 2 KSD Descent L-BFGS

Input: initial particles (xi 0)N i=1 µ0, tolerance tol

Return: [xi ]N i=1 = L-BFGS(F, F, [xi 0]N i=1, tol).

Remark 1. As L-BFGS requires access to exact gradients, Algorithm 2 cannot be used in a stochastic setting. However this can be done for Algorithm 1 by subsampling over particles in the double sum in Equation (11). Moreover, in some settings like Bayesian inference, the score itself writes as a sum over observations. In this case, the loss F writes as a double sum over observations, and a stochastic variant of Algorithm 1 tailored for this problem could be devised, in the spirit of Clémençon et al. (2016).

3.3. Related work

Several recent works fall within the framework sketched in Section 2.2. In SVGD, Liu et al. (2016) take F as the KL, and set vµt = Sµ,k ln Ä dµt

dπ ä to obtain F as the (squared) KSD. Integrating this inequality w.r.t. time yields a 1/T convergence rate for the average KSD between µt and π for t [0, T]. This enabled Korba et al. (2020, Proposition 5, Corollary 6) to obtain a discrete-time descent lemma for bounded kernels, as well as rates of convergence for the averaged KSD. In contrast, since the dissipation (9) of the KSD along its W2 gradient ﬂow does not correspond to any dissimilarity, our descent lemma for (10) (Proposition 14) does not yield similar rates of convergence. Alternatively, in the LAWGD algorithm recently proposed by Chewi et al. (2020), F is the KL, and vµt = Sπ,k Lπ Ä dµt

dπ ä with k Lπ chosen such that F is the χ2, by taking Sπ,k Lπ as the inverse of the diffusion operator:

Lπ : f 7 f log π, f . (13)

Their elegant approach results in a linear convergence of the KL along their ﬂow, but implementing LAWGD in practice requires to compute the spectrum of Lπ. It is in general as difﬁcult as solving a linear PDE, and Chewi et al. (2020) admit it is unlikely to scale in high dimensions.2

Beyond studies on the KL, Mroueh et al. (2019) considered F as the MMD and pick vµt based on a kernelized Sobolev norm so that F resembles the MMD, but without proving convergence of their scheme. Arbel et al. (2019) also analyzed F as the MMD, but for vµt = W2 1 2 MMD2(µt, π) with similar F as ours and with a dedicated analysis of their MMD-GD ﬂow. We recall that MMD2(µ, π) = R k(x, )dµ(x) R k(x, )dπ(x) 2 Hk. Since the Stein kernel satisﬁes the Stein s identity R kπ(x, )dπ(x) = 0, the

2The update rules of SVGD, LAWGD and MMD-GD can be found in Appendix A.4.

KSD (2) can be identiﬁed to an MMD with the Stein kernel (Chen et al., 2018). However, the assumptions of Arbel et al. (2019) - k is L-Lipschitz for L R+- do not hold in general for unbounded Stein kernels. Here, we provide the right set of assumptions (A1)-(A2) on kπ for the ﬂow to exist and for a descent lemma to hold. Also, as noted on Figure 1, the sample-based MMD ﬂow, deﬁned through

W2 1 2 MMD2(µ, π) = Z 2k(x, )d(µ π)(x) (14)

can fail dramatically while KSD ﬂow succeeds. This suggests that the geometrical properties of the KSD ﬂow are more favorable than the ones of the regular MMD ﬂow. In other words, choosing an appropriate (target-dependent) kernel, as in our method or in LAWGD, appears more propitious than taking a kernel k unrelated to π.

Related to the optimization of the KSD, Stein points (Chen et al., 2018) also propose to use the KSD loss for sampling, but the loss is minimized using very different tools. While KSD descent uses a ﬁrst order information by following the gradient (or L-BFGS) direction with a ﬁxed number of particles, Stein points use a Frank-Wolfe scheme, adding particles one by one in a greedy fashion. Given N particles x1, . . . , x N, in Stein points, the next particle is set as

x N+1 arg min x 1 2kπ(x, x) +

i=1 kπ(x, xi).

This problem is solved using derivative-free (a.k.a. zerothorder) algorithms like grid-search or random sampling. The main drawback of such an approach is that it scales poorly with the dimension d when compared to ﬁrst-order methods. In the same spirit, (Futami et al., 2019) have recently proposed a similar algorithm to optimize the MMD.

4. Theoretical properties of the KSD ﬂow

In this section, we provide a theoretical study of the convergence of the KSD Wasserstein gradient ﬂow, assessing the convexity of F and discussing the stationary points of its gradient ﬂow. Remarkably, we encounter pitfalls similar to other deterministic ﬂows derived from IPMs. This issue arises because IPMs are always mixture convex, but seldom geodesically convex. We ﬁrst investigate the convexity properties of F along W2 geodesics and show that exponential convergence near equilibrium cannot hold in general. Then, we examine some stationary points of its W2 gradient ﬂow, which explain the failure cases met in Section 5, where ˆµn converges to a degenerate measure.

4.1. Convexity properties of the KSD ﬂow

As is well-known, decay along W2 gradient ﬂows can be obtained from convexity properties along geodesics. A natural object of interest is then the Hessian of the objective

Kernel Stein Discrepancy Descent

F. We deﬁne below this object, in a similar way as Duncan et al. (2019). We recall that { ψ, ψ C c (Rd)} is by deﬁnition dense in TµP2(Rd) L2(µ) for any µ P2(Rd) (Ambrosio et al., 2008, Deﬁnition 8.4.1).

Deﬁnition 1. Consider ψ C c (Rd) and the path ρt from µ to (I + ψ)#µ given by: ρt = (I + t ψ)#µ, for all t [0, 1]. The Hessian of F at µ, HF|µ, is deﬁned as a symmetric bilinear form on C c (Rd) associated with the

quadratic form Hessµ F(ψ, ψ) := d2

dt2 t=0F(ρt).

Deﬁnition 1 can be straightforwardly related to the usual symmetric bilinear form deﬁned on TµP2(Rd) TµP2(Rd) (Otto & Villani, 2000, Section 3)3.

Proposition 3. Under Assumptions (A1) and (A3), the Hessian of F at µ is given, for any ψ C c (Rd), by

Hessµ F(ψ, ψ) = Ex,y µ ψ(x)T 1 2kπ(x, y) ψ(y)

+ Ex,y µ ψ(x)T H1kπ(x, y) ψ(x) . (15)

A proof of Proposition 3 is provided in Appendix B.3. Our computations are similar to the ones in (Arbel et al., 2019, Lemma 23) with some terms getting simpler owing to the Stein s identity satisﬁed by the Stein kernel. As for the squared MMD (Arbel et al., 2019, Proposition 5), the squared KSD is unlikely to be geodesically convex. Indeed, while the ﬁrst term is always positive, the second term in (15) can in general take negative values, unless H1kπ(x, y) is positive for all x, y supp(µ). Nevertheless, at µ = π, this second term vanishes, again owing to the Stein s property of kπ.

Corollary 4. Under Assumptions (A1), (A3) and (A4), the Hessian of F at π is given, for any ψ C c (Rd), by

Hessπ F(ψ, ψ) = Sπ,kπLπψ 2 Hkπ

where Sπ,kπ and Lπ are deﬁned in Equations (6) and (13).

A proof of Corollary 4 is provided in Appendix B.4. We now study the curvature properties near equilibrium, characterized by Hessπ F. In particular, inspired by the methodology described in Villani (2003) and recently applied by Duncan et al. (2019), we expect exponential convergence of solutions initialized near π whenever the Hessian is bounded from below by a quadratic form on the tangent space of P2(Rd) at π, included in L2(π).

Deﬁnition 2. We say that exponential decay near equilibrium holds if there exists λ > 0 such that for any ψ C c (Rd),

Hessπ F(ψ, ψ) λ ψ 2 L2(π). (16)

3The W2 Hessian of F, denoted HF|µ is an operator over

TµP2(Rd) verifying HF|µvt, vt L2(µ) = d2

dt2 t=0F(ρt) if t 7

ρt is a geodesic starting at µ with vector ﬁeld t 7 vt.

According to Corollary 4, (16) can be seen as a kernelized version of the following form of the Poincaré inequality for π (Bakry et al., 2013, Chapter 5)

Lπψ 2 L2(π) λπ ψ 2 L2(π). (17)

Condition (16) is similar to the Stein-Poincaré inequality (Duncan et al., 2019, Lemma 32). We will now argue that (16) is hardly ever satisﬁed, thus obtaining an impossibility result reminiscent of the one for SVGD in (Duncan et al., 2019, Lemma 36), which states that exponential convergence (of the KL) for the SVGD gradient ﬂow does not hold whenever π has exponential tails and the derivatives of log π and k grow at most at a polynomial rate. We start with the following characterization of exponential decay near equilibrium:

Proposition 5. Let Tπ,kπ = S π,kπ Sπ,kπ and L2 0(π) = {φ L2(π), R φdπ = 0}. The exponential decay near equilibrium (16) holds if and only if L 1 π : L2 0(π) L2 0(π), the inverse of Lπ|L2 0(π), is well-deﬁned, bounded, and for all φ L2 0(π) we have

φ, Tπ,kπφ L2(π) λ φ, L 1 π φ L2(π). (18)

See Appendix B.5 for a proof. By the spectral theorem for compact, self-adjoint operators (Kreyszig, 1978, Section 8.3), Tπ,kπ has a discrete spectrum (ln)n N which satisﬁes ln 0 and ln 0. Under mild assumptions on π, the operator Lπ also has a discrete, positive spectrum (Chewi et al., 2020, Appendix A). Proposition 5 implies the following necessary condition on the spectrum of L 1 π and Tπ,kπ for the exponential decay near equilibrium (16) to hold.

Corollary 6. If L 1 π has a discrete spectrum (λn)n N and (16) holds, then λn = O(ln), i.e. the eigenvalue decay of L 1 π is at least as fast as the one of Tπ,kπ.

We also show that if Hkπ is inﬁnite dimensional and exponential convergence near equilibrium holds, then L 1 π has a discrete spectrum (Lemma 16). We now present our impossibility result on the exponential decay near equilibrium.

Theorem 7. Let π e V . Assume that V C2(Rd), V is Lipschitz and Lπ has a discrete spectrum. Then exponential decay near equilibium does not hold.

The main idea behind the proof of Theorem 7 (Appendix B.7) is that Tπ,kπ is nuclear (Steinwart & Christmann, 2008, Theorem 4.27), which implies that its eigenvalues (ln)n N are summable. On the other hand the eigenvalue decay of L 1 π when π is a Gaussian can be seen to be of order O(1/n1/d) (Lemma 17 in Appendix B.7), which is not summable. The general case is obtained by comparison with a Gaussian. Remark that by Lemma 16, the assumption that Lπ has a discrete spectrum in Theorem 7 can be exchanged for an assumption that Hkπ is inﬁnite dimensional, which is discussed in the proof of Theorem 7. Despite the

Kernel Stein Discrepancy Descent

lack of (strong) geodesic convexity near equilibrium, we still observe empirically good convergence properties of the KSD ﬂow for discrete measures to a stationary measure. Hence, we now investigate these stationary measures.

4.2. Stationary measures of the KSD ﬂow

The KSD gradient ﬂow leads to a deterministic algorithm, as for SVGD and LAWGD. To study the convergence of these algorithms in continuous time, it is relevant to characterize the stationary measures, i.e. the ones which cancel the dissipation F (5) of the objective functional F along the relative gradient ﬂow dynamics. Unfortunately, unlike for the SVGD and LAWGD algorithms, the dissipation related to the KSD ﬂow (9) does not yield a dissimilarity between measures. Consequently, the study of the stationary measures of the KSD is more involved. We discuss below when failure cases may happen.

Lemma 8. Assume Assumption (A4) holds. Then Hkπ does not contain non-zero constant functions.

A proof of Lemma 8 is provided in Appendix B.8. This result has the immediate consequence of considerably restricting the number of candidate fully-supported measures that are stationary for the KSD gradient ﬂow. Consider one such measure µ . At equilibrium, F(µ ) = 0; which implies that R kπ(x, .)dµ (x) is µ -a.e. equal to a constant function c. Since x 7 c is then also an element of Hkπ, the previous lemma implies that c = 0. Hence, if µ and π are full-support, F(µ ) = 0. Provided kπ is characteristic (Sriperumbudur et al., 2011), then µ = π.

However, as Algorithms 1 and 2 rely on discrete measures, the dissipation F (9) can vanish even for µ = π because µ is not full-support. Depending on the properties of π and k, this may happen even for trivial measures such as a single Dirac mass, as stated in the following Lemma.

Lemma 9. Let x0 such that s(x0) = 0 and Js(x0) is invertible, and consider a translation-invariant kernel k(x, y) = φ(x y), for φ C3(Rd). Then δx0 is a stable ﬁxed measure of (7), i.e. it is stationary and any small push-forward of δx0 is attracted back by the ﬂow.

Proof: For ε > 0 and ψ C c (Rd), set µε = (I+εψ)#δx0. We then have F(µε) = 1

2kπ(x0 + εψ(x0), x0 + εψ(x0)). Expanding kπ(x, x) at the ﬁrst order around x = x0 gives 2F(µε) = ε2 [Js(x0)]ψ(x0) 2φ(0) φ(0)+o(ε2). This quantity is minimized for ψ(x0) = 0, which shows that δx0 is indeed a local minimum for F.

Importantly, this result applies whenever the score s vanishes at x0, not only when x0 is a local stable minimum of the potential log(π). This means that for a single particle, KSD descent is attracted to any stationary point of log(π), whereas SVGD converges only to local maxima of log(π)

(Liu & Wang, 2016). Nonetheless, if π is log-concave, there is no spurious stationary point.

For cases more general than Lemma 9, we are interested in the sets that are kept invariant by the gradient ﬂow. For these sets, an erroneous initialization may prevent the particles from reaching the support of π. We provide below a general result holding for any deterministic ﬂow, beyond our speciﬁc choice (7) of vµt, and thus holding also for SVGD. Deﬁnition 3. Let M Rd be a closed nonempty set. We say that M is a ﬂow-invariant support set for the ﬂow (µt)t 0 of (4) if for any µ0 s.t. supp(µ0) M, we have that the ﬂow veriﬁes supp(µt) M for all t 0. Proposition 10. (Informal) Let M Rd be a smooth nonempty submanifold and µ0 Pc(Rd) with supp(µ0) M. Assume that, for a deterministic (vµt)t 0 satisfying classical Caratheodory-Lipschitz assumptions (Appendix A.3), we have vµt(x) TM(x) where TM(x) is the tangent space to M at x. Then M is ﬂow-invariant for (4).

The formal statement, Proposition 20, stated and proved in Appendix B.9, can be in particular applied to the ubiquitous radial kernels and to planes of symmetry of π, i.e. afﬁne subspaces M Rd such that the density of π is symmetric w.r.t. M. Lemma 11 is illustrated in Section 5 for a mixture of two Gaussians with the same variance (Figure 3). Lemma 11. Let M be a plane of symmetry of π and consider a radial kernel k(x, y) = φ( x y 2/2) with φ C3(R). Then, for all (x, y) M2, 2kπ(x, y) TM(x) and M is ﬂow-invariant for (7).

Proof idea: We show that all the terms in 2kπ(x, y) belong to TM(x). This implies that the convex combination W2F(µ)(y) = Ex µ[ 2kπ(x, y)] TM(x). We then apply Proposition 10 to conclude.

5. Experiments

In this section, we discuss the performance of KSD Descent to sample from π in practice, on toy examples and realworld problems. The code to reproduce the experiments and a package to use KSD Descent are available at https: //github.com/pierreablin/ksddescent. For all our experiments, we use a Gaussian kernel, as we did not notice any difference in practice w.r.t. the IMQ kernel. Its bandwith is selected by cross-validation. Implementation details and additional experiments can be found in Appendix D.

5.1. Toy examples

In the ﬁrst example, we choose π to be a standard 2D Gaussian, and a Gaussian k with unit bandwidth. We initialize with 50 particles drawn from a Gaussian with mean (1, 1). Figure 1 displays the trajectories of several different meth-

Kernel Stein Discrepancy Descent

KSD Grad KSD L-BFGS

Figure 1. Toy example with 2D standard Gaussian. The green points represent the initial positions of the particles. The light grey curves correspond to their trajectories under the different vµt.

0.0 0.1 0.2 Time (s.)

0.0 0.1 0.2 Time (s.)

KSD SVGD, small step SVGD, good step SVGD, big step

Figure 2. Convergence speed of KSD and SVGD on a Gaussian problem in 1D, with 30 particles.

ods: SVGD, KSD Descent implemented using gradient descent (Algorithm 1) and L-BFGS (Algorithm 2), and the MMD ﬂow (Arbel et al., 2019). To assess the convergence of the algorithms, for SVGD we monitored the norm of the displacement, while for the KSD and MMD gradient ﬂows we used the tolerance parameter of L-BFGS. KSD Descent successfully pushes the particles towards the target distribution, with a ﬁnal result that is well-distributed around the mode. While KSD performs similarly to SVGD, we can notice that the trajectories of the particles behave very differently. Indeed, SVGD trajectories appear to be at ﬁrst driven by the score term in the update, while the repulsive term acts later to spread the particles around the mode. In contrast, trajectories of the particles updated by KSD Descent are ﬁrst inﬂuenced by the last repulsive term of the update, which seems to determine their global layout, and are then translated and contracted towards the mode under the action of the driving terms. Finally, for the MMD descent, some particles collapse around the mode, while others stay far from the target. This behavior was documented in Arbel et al. (2019), and can be partly circumvented by injecting some noise in the updates.

We then compare the convergence speed of KSD Descent and SVGD in terms of the KL or KSD objective (see Figure 2). With a ﬁne-tuned step-size, SVGD is the fastest method. However, taking a step-size too large leads to nonconvergence, while one too small leads to slow convergence. It should be stressed that it is hard to select a good step-size, or to implement a line-search strategy, since SVGD does not minimize a simple function. In contrast, the empirical KSD (11) can be minimized using L-BFGS, which does not have any critical hyper-parameter.

Figure 3. KSD Descent applied on a balanced mixture of Gaussian with small variance (0.1) in 2D. The centroids are at ( 1, 0) and (1, 0). The green crosses indicate the initial particle positions, while the blue ones are the ﬁnal positions. The light red arrows correspond to the score directions. Left: the initial positions are all chosen close to the line x = 0, which corresponds to an axis of symmetry of π. Right: even when the initialization is more spread, some particles are still caught in this spurious local minimum.

In our second example, we apply KSD Descent for π taken as a symmetric mixture of two Gaussians with the same variance. This highlights the results of Section 4.2. If initialized on the axis of symmetry, the particles are indeed stuck on it, as stated in Lemma 11. We noticed that, for a large variance of π (e.g. in [0.2, 1]), this axis is unstable. However, when the variance is too small (e.g. set to 0.1 as in Figure 3), the axis can even become a locally stable set. We also observed that, for a distribution initialized exclusively on one side of the axis, a single component of the mixture can be favored. This is a classical behavior of score-based methods, depending typically on the variance of π (Wenliang, 2020).

To ﬁx this issue, we consider an annealing strategy as suggested by Wenliang (2020). It consists in adding an inverse temperature variable β to the log density of the model, i.e. πβ(x) exp( βV (x)) for π(x) exp( V (x)), with β (0, 1]. This is easily implemented with score-based methods, since it simply corresponds to multiplying s(x) by β. When β is small, annealing smoothes the target distribution and the last term of the Stein kernel, repulsive at short distance, becomes dominant; on the other hand, for β close to 1, we recover the true log density. To implement this method, we start with β = 0.1, and run the KSD Descent to obtain particles at high temperature . KSD Descent is then re-run starting from these particles, setting now β = 1. One can see that this strategy successfully solves the issues encountered when the KSD ﬂow was failing to converge to the target π (Figure 4). This correction differs from the noiseinjection strategy proposed in Arbel et al. (2019) for the MMD ﬂow, which is rather related to randomized smoothing (Duchi et al., 2012). Noise-injection would prevent the use of L-BFGS in our case, as it requires exact values of the gradients of previous iterations. Annealing on the other hand is compatible with Algorithm 2.

Kernel Stein Discrepancy Descent

β = 1 β = 0.1 β = 0.1 1

Figure 4. Effect of the annealing strategy on a mixture of Gaussians. Left: without annealing, some particles fall into a spurious minimum. Middle: with a higher temperature (β = 0.1), the particles are more spread out. Right: starting from the particles in the middle ﬁgure and setting β = 1 we converge to a distribution which minimizes the KSD, and has no spurious particles.

Random KSD SVGD

Amari distance

Random KSD SVGD

Amari distance

Random KSD SVGD

Amari distance

Figure 5. Bayesian ICA results. Left: p = 2. Middle: p = 4. Right: p = 8. Each dot corresponds to the Amari distance between an estimated matrix and the true unmixing matrix.

5.2. Bayesian Independent Component Analysis

Independent Component Analysis (ICA, Comon, 1994) is the generative model x = W 1s, where x is an observed sample in Rp, W Rp p is the unknown square unmixing matrix, and s Rp are the independent sources. We assume that each component has the same density si ps. The likelihood of the model is p(x|W) = log |W| + Pp i=1 ps([Wx]i). For our prior, we assume that W has i.i.d. entries, of law N(0, 1). The posterior is p(W|x) p(x|W)p(W), and the score is given by s(W) = W ψ(Wx)x W, where ψ = p s ps . In practice, we choose ps such that ψ( ) = tanh( ). We then use the presented algorithms to draw particles W p(W|x). We use N = 10 particles, and take 1000 samples x from the ICA model for p {2, 4, 8}. Each method outputs N estimated unmixing matrices, [ Wi]N i=1. We compute the Amari distance (Amari et al., 1996) between each Wi and W: the Amari distance vanishes if and only if the two matrices are the same up to scale and permutation, which are the natural indeterminacies of ICA. We repeat the experiment 50 times, resulting in 500 values for each algorithm (Figure 5). We also add the results of a random output, where the estimated matrices are obtained with i.i.d. N(0, 1) entries. We see that for this experiment, KSD performs barely better than random, while SVGD ﬁnds matrices with lower Amari distance. One explanation is that the ICA likelihood is highly non-convex (Cardoso, 1998). This is easily seen with the invariances of the problem: permuting the rows of W does not change p(x|W). As a consequence, the posterior has

many saddle points, in which particles might get trapped. Unfortunately, the annealing strategy proposed above did not improve the achieved performance for this problem.

5.3. Real-world data

We compare KSD Descent and SVGD in the Bayesian logistic regression setting described in Gershman et al. (2012); Liu & Wang (2016). Given datapoints d1, . . . , dq Rp, and labels y1, . . . , yq { 1}, the labels yi are modelled as p(yi = 1|di, w) = (1 + exp w di ) 1 for some w Rp. The parameters w follow the law p(w|α) = N(0, α 1Ip), and α > 0 is drawn from an exponential law p(α) = Exp(0.01). The parameter vector is then x = [w, log(α)] Rp+1, and we use Algorithm 2 to obtain samples from p(x| (di, yi)q i=1) for 13 datasets, with N = 10 particles for each.

The learning rate for SVGD and the bandwidth of the kernel for both methods are chosen through grid-search, and for each problem we select the hyper-parameters yielding the best test accuracy. For all problems, the running times of SVGD with the best step-size and of KSD Descent were similar, while KSD Descent has the advantage of having one less hyper-parameter. We present on Figure 6 the accuracy of each method on each dataset, where KSD Descent was applied without annealing since it did not change the numerical results. Our results show we match the SVGD performance without having to ﬁne-tune the step-size, owing to Algorithm 2. We posit that KSD succeeds on this task because the posterior p(x| (di, yi)q i=1) is log-concave, and does not have saddle points.

0.6 0.8 1.0 SVGD

KSD Descent

Figure 6. Accuracy of the KSD Descent and SVGD on bayesian logistic regression for 13 datasets. Both methods yield similar results. KSD is better by 2% on one dataset.

Discussion. KSD Descent beneﬁts from a tractable loss and can be straightforwardly implemented with L-BFGS, achieving performance on par with SVGD on convex problems. However its dissipation has non-trivial stationary points, which prevents its use for non-convex problems with saddlepoints such as ICA. Convergence of kernel-based sampling schemes is known to be difﬁcult, and we provided some intuitions on the reasons for it. This leaves the door open to a more in-depth analysis of kernel-based gradient ﬂows, especially for unbounded kernels.

Acknowledgments. A.K. thanks the GENES, and S.M. the A.N.R ABSint (Projet-ANR-18-CE40-0034) for the ﬁnancial support. P.A. thanks A.N.R. ANR19-P3IA-0001.

Kernel Stein Discrepancy Descent

Amari, S.-I., Cichocki, A., and Yang, H. H. A new learning algorithm for blind signal separation. In Advances in Neural Information Processing Systems, pp. 757 763, 1996.

Ambrosio, L. and Crippa, G. Continuity equations and ODE ﬂows with non-smooth velocity. Proceedings of the Royal Society of Edinburgh: Section A Mathematics, 144 (6):1191 1244, 2014.

Ambrosio, L., Gigli, N., and Savaré, G. Gradient ﬂows: in metric spaces and in the space of probability measures. Springer Science & Business Media, 2008.

Anastasiou, A., Barp, A., Briol, F.-X., Ebner, B., Gaunt, R. E., Ghaderinezhad, F., Gorham, J., Gretton, A., Ley, C., Liu, Q., Mackey, L., Oates, C. J., Reinert, G., and Swan, Y. Stein s method meets statistics: A review of some recent developments. ar Xiv preprint ar Xiv:2105.03481, 2021.

Arbel, M., Korba, A., Salim, A., and Gretton, A. Maximum mean discrepancy gradient ﬂow. Advances in Neural Information Processing Systems, 2019.

Aubin, J.-P. and Frankowska, H. Set-Valued Analysis. Birkhäuser Boston, 1990.

Bakry, D., Gentil, I., and Ledoux, M. Analysis and geometry of Markov diffusion operators, volume 348. Springer Science & Business Media, 2013.

Bonnet, B. and Frankowska, H. Differential inclusions in Wasserstein spaces: The Cauchy-Lipschitz framework. Journal of Differential Equations, 271:594 637, January 2021.

Brezis, H. Functional analysis, Sobolev spaces and partial differential equations. Springer Science & Business Media, 2010.

Cardoso, J.-F. Blind signal separation: statistical principles. Proceedings of the IEEE, 86(10):2009 2025, 1998.

Chen, W. Y., Mackey, L., Gorham, J., Briol, F.-X., and Oates, C. J. Stein points. International Conference on Machine Learning (ICML), 2018.

Chen, W. Y., Barp, A., Briol, F.-X., Gorham, J., Girolami, M., Mackey, L., Oates, C., et al. Stein point Markov Chain Monte Carlo. International Conference on Machine Learning (ICML), 2019.

Chewi, S., Gouic, T. L., Lu, C., Maunu, T., and Rigollet, P. SVGD as a kernelized Wasserstein gradient ﬂow of the chi-squared divergence. Advances in Neural Information Processing Systems, 2020.

Chwialkowski, K., Strathmann, H., and Gretton, A. A kernel test of goodness of ﬁt. In International Conference on Machine Learning (ICML), pp. 2606 2615, 2016.

Clémençon, S., Colin, I., and Bellet, A. Scaling-up empirical risk minimization: optimization of incomplete u-statistics. Journal of Machine Learning Research, 17 (1):2682 2717, 2016.

Comon, P. Independent component analysis, a new concept? Signal processing, 36(3):287 314, 1994.

Duchi, J. C., Bartlett, P. L., and Wainwright, M. J. Randomized smoothing for stochastic optimization. SIAM Journal on Optimization, 22(2):674 701, 2012.

Duncan, A., Nüsken, N., and Szpruch, L. On the geometry of Stein variational gradient descent. ar Xiv preprint ar Xiv:1912.00894, 2019.

Durmus, A., Majewski, S., and Miasojedow, B. Analysis of Langevin Monte Carlo via convex optimization. Journal of Machine Learning Research, 20(73):1 46, 2019.

Eberle, A. Reﬂection couplings and contraction rates for diffusions. Probability theory and related ﬁelds, 166(3): 851 886, 2016.

Fisher, M. A., Nolan, T., Graham, M. M., Prangle, D., and Oates, C. J. Measure transport with kernel Stein discrepancy. ar Xiv preprint ar Xiv:2010.11779, 130:1054 1062, 2021.

Futami, F., Cui, Z., Sato, I., and Sugiyama, M. Bayesian posterior approximation via greedy particle optimization. In Proceedings of the AAAI Conference on Artiﬁcial Intelligence, volume 33, pp. 3606 3613, 2019.

Gershman, S., Hoffman, M., and Blei, D. Nonparametric variational inference. In International Conference on Machine Learning (ICML), pp. 235 242, 2012.

Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., and Bengio, Y. Generative adversarial nets. In Advances in Neural Information Processing Systems, volume 27, pp. 2672 2680, 2014.

Gorham, J. and Mackey, L. Measuring sample quality with kernels. In International Conference on Machine Learning (ICML), volume 70, pp. 1292 1301. JMLR. org, 2017.

Gretton, A., Borgwardt, K. M., Rasch, M. J., Schölkopf, B., and Smola, A. A kernel two-sample test. Journal of Machine Learning Research, 13:723 773, 2012.

Kernel Stein Discrepancy Descent

Harris, C. R., Millman, K. J., van der Walt, S. J., Gommers, R., Virtanen, P., Cournapeau, D., Wieser, E., Taylor, J., Berg, S., Smith, N. J., Kern, R., Picus, M., Hoyer, S., van Kerkwijk, M. H., Brett, M., Haldane, A., del R ıo, J. F., Wiebe, M., Peterson, P., G erard-Marchant, P., Sheppard, K., Reddy, T., Weckesser, W., Abbasi, H., Gohlke, C., and Oliphant, T. E. Array programming with Num Py. Nature, 585(7825):357 362, September 2020. doi: 10. 1038/s41586-020-2649-2. URL https://doi.org/ 10.1038/s41586-020-2649-2.

Hodgkinson, L., Salomone, R., and Roosta, F. The reproducing Stein kernel approach for post-hoc corrected sampling. ar Xiv preprint ar Xiv:2001.09266, 2020.

Hunter, J. D. Matplotlib: A 2d graphics environment. Computing in Science & Engineering, 9(3):90 95, 2007. doi: 10.1109/MCSE.2007.55.

Hyvärinen, A. and Dayan, P. Estimation of non-normalized statistical models by score matching. Journal of Machine Learning Research, 6(4), 2005.

Jordan, R., Kinderlehrer, D., and Otto, F. The variational formulation of the Fokker-Planck equation. SIAM Journal on Mathematical Analysis, 29(1):1 17, 1998.

Kanagawa, H., Jitkrittum, W., Mackey, L., Fukumizu, K., and Gretton, A. A kernel Stein test for comparing latent variable models. ar Xiv preprint ar Xiv:1907.00586, 2020.

Klenke, A. Probability theory: a comprehensive course. Springer Science & Business Media, 2013.

Korba, A., Salim, A., Arbel, M., Luise, G., and Gretton, A. A non-asymptotic analysis for Stein variational gradient descent. Advances in Neural Information Processing Systems, 33, 2020.

Kreyszig, E. Introductory functional analysis with applications, volume 1. Wiley New York, 1978.

Liu, D. C. and Nocedal, J. On the limited memory BFGS method for large scale optimization. Mathematical programming, 45(1-3):503 528, 1989.

Liu, Q. and Wang, D. Stein variational gradient descent: A general purpose Bayesian inference algorithm. In Advances in Neural Information Processing Systems, pp. 2378 2386, 2016.

Liu, Q., Lee, J., and Jordan, M. A kernelized Stein discrepancy for goodness-of-ﬁt tests. In International Conference on Machine Learning (ICML), pp. 276 284, 2016.

Mroueh, Y., Sercu, T., and Raj, A. Sobolev descent. In International Conference on Artiﬁcial Intelligence and Statistics (AISTATS), pp. 2976 2985, 2019.

Otto, F. The geometry of dissipative evolution equations: the porous medium equation. Communications in Partial Differential Equations, 26(1-2):101 174, 2001.

Otto, F. and Villani, C. Generalization of an inequality by Talagrand and links with the logarithmic Sobolev inequality. Journal of Functional Analysis, 173(2):361 400, 2000.

Pankrashkin, K. Introduction to the spectral theory. Lecture notes, Université Paris-Sud, 2014.

Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., et al. Pytorch: An imperative style, high-performance deep learning library. ar Xiv preprint ar Xiv:1912.01703, 2019.

Pavliotis, G. A. Stochastic processes and applications: diffusion processes, the Fokker-Planck and Langevin equations, volume 60. Springer, 2014.

Renardy, M. and Rogers, R. C. An introduction to partial differential equations, volume 13. Springer Science & Business Media, 2006.

Riabiz, M., Chen, W., Cockayne, J., Swietach, P., Niederer, S. A., Mackey, L., Oates, C., et al. Optimal thinning of MCMC output. ar Xiv preprint ar Xiv:2005.03952, 2020.

Saitoh, S. and Sawano, Y. Theory of Reproducing Kernels and Applications. Springer Singapore, 2016.

Salim, A., Korba, A., and Luise, G. Wasserstein proximal gradient. In Advances in Neural Information Processing Systems, 2020.

Simon-Gabriel, C. J. Distribution-Dissimilarities in Machine Learning. Ph D thesis, Eberhard Karls Universität Tübingen, Germany, 2018.

Sriperumbudur, B. K., Fukumizu, K., and Lanckriet, G. R. Universality, characteristic kernels and RKHS embedding of measures. Journal of Machine Learning Research, 12 (70):2389 2410, 2011.

Steinwart, I. and Christmann, A. Support vector machines. Springer Science & Business Media, 2008.

Villani, C. Optimal transportation, dissipative PDE s and functional inequalities. In Optimal transportation and applications, pp. 53 89. Springer, 2003.

Virtanen, P., Gommers, R., Oliphant, T. E., Haberland, M., Reddy, T., Cournapeau, D., Burovski, E., Peterson, P., Weckesser, W., Bright, J., van der Walt, S. J., Brett, M., Wilson, J., Millman, K. J., Mayorov, N., Nelson, A. R. J., Jones, E., Kern, R., Larson, E., Carey, C. J., Polat, I., Feng, Y., Moore, E. W., Vander Plas, J., Laxalde, D., Perktold, J., Cimrman, R., Henriksen, I., Quintero, E. A.,

Kernel Stein Discrepancy Descent

Harris, C. R., Archibald, A. M., Ribeiro, A. H., Pedregosa, F., van Mulbregt, P., and Sci Py 1.0 Contributors. Sci Py 1.0: Fundamental Algorithms for Scientiﬁc Computing in Python. Nature Methods, 17:261 272, 2020. doi: 10.1038/s41592-019-0686-2.

Wenliang, L. K. Blindness of score-based methods to isolated components and mixing proportions. ar Xiv preprint ar Xiv:2008.10087, 2020.

Wibisono, A. Sampling as optimization in the space of measures: The Langevin dynamics as a composite optimization problem. Conference on Learning Theory, 2018.

Xu, W. and Matsuda, T. A Stein goodness-of-ﬁt test for directional distributions. In International Conference on Artiﬁcial Intelligence and Statistics (AISTATS), volume 108, pp. 320 330, 2020.