# stein_variational_gradient_descent_as_gradient_flow__1760fff3.pdf

Stein Variational Gradient Descent as Gradient Flow

Qiang Liu Department of Computer Science Dartmouth College Hanover, NH 03755 qiang.liu@dartmouth.edu

Stein variational gradient descent (SVGD) is a deterministic sampling algorithm that iteratively transports a set of particles to approximate given distributions, based on a gradient-based update that guarantees to optimally decrease the KL divergence within a function space. This paper develops the ﬁrst theoretical analysis on SVGD. We establish that the empirical measures of the SVGD samples weakly converge to the target distribution, and show that the asymptotic behavior of SVGD is characterized by a nonlinear Fokker-Planck equation known as Vlasov equation in physics. We develop a geometric perspective that views SVGD as a gradient ﬂow of the KL divergence functional under a new metric structure on the space of distributions induced by Stein operator.

1 Introduction

Stein variational gradient descent (SVGD) [1] is a particle-based algorithm for approximating complex distributions. Unlike typical Monte Carlo algorithms that rely on randomness for approximation, SVGD constructs a set of points (or particles) by iteratively applying deterministic updates that is constructed to optimally decrease the KL divergence to the target distribution at each iteration. SVGD has a simple form that efﬁcient leverages the gradient information of the distribution, and can be readily applied to complex models with massive datasets for which typical gradient descent has been found efﬁcient. A nice property of SVGD is that it strictly reduces to the typical gradient ascent for maximum a posteriori (MAP) when using only a single particle (n = 1), while turns into a full sampling method with more particles. Because MAP often provides reasonably good results in practice, SVGD is found more particle-efﬁcient than typical Monte Carlo methods which require much larger numbers of particles to achieve good results.

SVGD can be viewed as a variational inference algorithm [e.g., 2], but is signiﬁcantly different from the typical parametric variational inference algorithms that use parametric sets to approximate given distributions and have the disadvantage of introducing deterministic biases and (often) requiring non-convex optimization. The non-parametric nature of SVGD allows it to provide consistent estimation for generic distributions like Monte Carlo does. There are also particle algorithms based on optimization, or variational principles, with theoretical guarantees [e.g., 3 5], but they often do not use the gradient information effectively and do not scale well in high dimensions.

However, SVGD is difﬁcult to analyze theoretically because it involves a system of particles that interact with each other in a complex way. In this work, we take an initial step towards analyzing SVGD. We characterize the SVGD dynamics using an evolutionary process of the empirical measures of the particles that is known as Vlasov process in physics, and establish that empirical measures of the particles weakly converge to the given target distribution. We develop a geometric interpretation of SVGD that views SVGD as a gradient ﬂow of KL divergence, deﬁned on a new Riemannian-like metric structure imposed on the space of density functions.

31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.

2 Stein Variational Gradient Descent (SVGD)

We start with a brief overview of SVGD [1]. Let νp be a probability measure of interest with a positive, (weakly) differentiable density p(x) on an open set X Rd. We want to approximate νp with a set of particles {xi}n i=1 whose empirical measure ˆµn(dx) = Pn i=1 δ(x xi)/ndx weakly converges to νp as n (denoted by ˆµn νp), in the sense that we have Eˆµn[h] Eνp[h] as n for all bounded, continuous test functions h.

To achieve this, we initialize the particles with some simple distribution µ, and update them via map

T (x) = x + ϵφ(x),

where ϵ is a small step size, and φ(x) is a perturbation direction, or velocity ﬁeld, which should be chosen to maximally decrease the KL divergence of the particle distribution with the target distribution; this is framed by [1] as solving the following functional optimization,

dϵKL(T µ || νp) ϵ=0 s.t. ||φ||H 1 . (1)

where µ denotes the (empirical) measure of the current particles, and T µ is the measure of the updated particles x = T (x) with x µ, or the pushforward measure of µ through map T , and H is a normed function space chosen to optimize over.

A key observation is that the objective in (1) is a linear functional of φ that draws connections to ideas in the Stein s method [6] used for proving limit theorems or probabilistic bounds in theoretical statistics. Liu and Wang [1] showed that

dϵKL(T µ || νp) ϵ=0 = Eµ[Spφ], with Spφ(x) := log p(x) φ(x) + φ(x), (2)

where φ := Pd k=1 xkφk(x), and Sp is a linear operator that maps a vector-valued function φ to a scalar-valued function Spφ, and Sp is called the Stein operator in connection with the so-called Stein s identity, which shows that the RHS of (2) equals zero if µ = νp,

Ep[Spφ] = Ep[ log p φ + φ] = Z (pφ)dx = 0; (3)

it is the result of integration by parts, assuming proper zero boundary conditions. Therefore, the optimization (1) reduces to

D(µ || νp) := max φ H Eµ[Spφ], s.t. ||φ||H 1 , (4)

where D(µ || νp) is called Stein discrepancy, which provides a discrepancy measure between µ and νp, since D(µ || νp) = 0 if µ = νp and D(µ || νp) > 0 if µ = νp given H is sufﬁciently large.

Because (4) induces an inﬁnite dimensional functional optimization, it is critical to select a nice space H that is both sufﬁciently rich and also ensures computational tractability in practice. Kernelized Stein discrepancy (KSD) provides one way to achieve this by taking H to be a reproducing kernel Hilbert space (RKHS), for which the optimization yields a closed form solution [7 10].

To be speciﬁc, let H0 be a RKHS of scalar-valued functions with a positive deﬁnite kernel k(x, x ), and H = H0 H0 the corresponding d 1 vector-valued RKHS. Then it can be shown that the optimal solution of (4) is

φ µ,p( ) Ex µ[Sp k(x, )], with Sp k(x, ) := log p(x)k(x, ) + xk(x, ), (5)

where Sp is an outer product variant of Stein operator which maps a scalar-valued function to a vector-valued one. Further, it has been shown in [e.g., 7] that

D(µ || νp) = ||φ µ,p||H = q

Ex,x µ[κp(x, x )], with κp(x, x ) := Sx p Sx p k(x, x ), (6)

where κp(x, x ) is a Steinalized positive deﬁnite kernel obtained by applying Stein operator twice; Sx p and Sx p are the Stein operators w.r.t. variable x and x , respectively. The key advantage of KSD is its computational tractability: it can be empirically evaluated with samples drawn from µ and the gradient log p, which is independent of the normalization constant in p [see 7, 8].

Algorithm 1 Stein Variational Gradient Descent [1]

Input: The score function x log p(x). Goal: A set of particles {xi}n i=1 that approximates p(x). Initialize a set of particles {xi 0}n i=1; pick a positive deﬁnite kernel k(x, x ) and step-size {ϵℓ}. For iteration ℓdo

xi ℓ+1 xi ℓ+ ϵφ ˆµn ℓ,p(xi ℓ), i = 1, . . . , n,

where φ ˆµn ℓ,p(x) = 1

log p(xj ℓ)k(xj ℓ, x) + xj ℓk(xj ℓ, x) , (8)

An important theoretic issue related to KSD is to characterize when H is rich enough to ensure D(µ || νp) = 0 iff µ = νp; this has been studied by Liu et al. [7], Chwialkowski et al. [8], Oates et al. [11]. More recently, Gorham and Mackey [10] (Theorem 8) established a stronger result that Stein discrepancy implies weak convergence on X = Rd: let {µℓ} ℓ=1 be a sequence of probability measures, then

D(µℓ|| νp) 0 µℓ νp as ℓ , (7)

for νp that are distantly dissipative (Deﬁnition 4 of Gorham and Mackey [10]) and a class of inverse multi-quadric kernels. Since the focus of this work is on SVGD, we will assume (7) holds without further examination.

In SVGD algorithm, we iteratively update a set of particles using the optimal transform just derived, starting from certain initialization. Let {xi ℓ}n i=1 be the particles at the ℓ-th iteration. In this case, the exact distributions of {xi ℓ}n i=1 are unknown or difﬁcult to keep track of, but can be best approximated by their empirical measure ˆµn ℓ(dx) = P i δ(x xi ℓ)dx/n. Therefore, it is natural to think that φ ˆµn ℓ,p, with µ in (5) replaced by ˆµn ℓ, provides the best update direction for moving the particles (and equivalently ˆµn ℓ) closer to νp. Implementing this update (8) iteratively, we get the main SVGD algorithm in Algorithm 1.

Intuitively, the update in (8) pushes the particles towards the high probability regions of the target probability via the gradient term log p, while maintaining a degree of diversity via the second term k(x, xi). In addition, (8) reduces to the typical gradient descent for maximizing log p if we use only a single particle (n = 1) and the kernel stratiﬁes k(x, x ) = 0 for x = x ; this allows SVGD to provide a spectrum of approximation that smooths between maximum a posterior (MAP) optimization to a full sampling approximation by using different particle sizes, enabling efﬁcient trade-off between accuracy and computation cost.

Despite the similarity to gradient descent, we should point out that the SVGD update in (8) does not correspond to minimizing any objective function F({xi ℓ}) in terms of the particle location {xi ℓ}, because one would ﬁnd xi xj F = xj xi F if this is true. Instead, it is best to view SVGD as a type of (particle-based) numerical approximation of an evolutionary partial differential equation (PDE) of densities or measures, which corresponds to a special type of gradient ﬂow of the KL divergence functional whose equilibrium state equals the given target distribution νp, as we discuss in the sequel.

3 Density Evolution of SVGD Dynamics

This section collects our main results. We characterize the evolutionary process of the empirical measures ˆµn ℓof the SVGD particles and their large sample limit as n (Section 3.1) and large time limit as ℓ (Section 3.2), which together establish the weak convergence of ˆµn ℓto the target measure νp. Further, we show that the large sample limit of the SVGD dynamics is characterized by a Vlasov process, which monotonically decreases the KL divergence to target distributions with a decreasing rate that equals the square of Stein discrepancy (Section 3.2-3.3). We also establish a geometric intuition that interpret SVGD as a gradient ﬂow of KL divergence under a new Riemannian metric structure induced by Stein operator (Section 3.4). Section 3.5 provides a brief discussion on the connection to Langevin dynamics.

3.1 Large Sample Asymptotic of SVGD

Consider the optimal transform T µ,p(x) = x + ϵφ µ,p(x) with φ µ,p deﬁned in (5). We deﬁne its related map Φp : µ 7 T µ,pµ, where T µ,pµ denotes the pushforward measure of µ through transform T µ,p. This map fully characterizes the SVGD dynamics in the sense that the empirical measure ˆµn ℓ can be obtained by recursively applying Φp starting from the initial measure ˆµn 0.

ˆµn ℓ+1 = Φp(ˆµn ℓ), ℓ N. (9)

Note that Φp is a nonlinear map because the transform T µ,p depends on the input map µ. If µ has a density q and ϵ is small enough so that T µ,p is invertible, the density q of µ = Φp(µ) is given by the change of variables formula:

q (z) = q(T 1 µ,p(z)) | det( T 1 µ,p(z))|. (10)

When µ is an empirical measure and q is a Dirac delta function, this equation still holds formally in the sense of distribution (generalized functions).

Critically, Φp also fully characterizes the large sample limit property of SVGD. Assume the initial empirical measure ˆµn 0 at the 0-th iteration weakly converges to a measure µ 0 as n , which can be achieved, for example, by drawing {xi 0} i.i.d. from µ 0 , or using MCMC or Quasi Monte Carlo methods. Starting from the limit initial measure µ 0 and applying Φp recursively, we get

µ ℓ+1 = Φp(µ ℓ), ℓ N. (11)

Assuming ˆµn 0 µ 0 by initialization, we may expect that ˆµn ℓ µ ℓfor all the ﬁnite iterations ℓif Φp satisﬁes certain Lipschitz condition. This is naturally captured by the bounded Lipschitz metric.

For two measures µ and ν, their bounded Lipschitz (BL) metric is deﬁned to be their difference of means on the set of bounded, Lipschitz test functions:

BL(µ, ν) = sup f

Eµf Eνf s.t. ||f||BL 1 , where ||f||BL = max{||f|| , ||f||Lip},

where ||f|| = supx |f(x)| and ||f||Lip = supx =y |f(x) f(y)|

||x y||2 . For a vector-valued bounded

Lipschitz function f = [f1, . . . , fd] , we deﬁne its norm by ||f||2 BL = Pd i=1 ||fi||2 BL. It is known that the BL metric metricizes weak convergence, that is, BL(µn, ν) 0 if and only if µn ν.

Lemma 3.1. Assuming g(x, y) := Sx p k(x, y) is bounded Lipschitz jointly on (x, y) with norm ||g||BL < , then for any two probability measures µ and µ , we have

BL(Φp(µ), Φp(µ )) (1 + 2ϵ||g||BL) BL(µ, µ ).

Theorem 3.2. Let ˆµn ℓbe the empirical measure of {xi ℓ}n i=1 at the ℓ-th iteration of SVGD. Assuming

lim n BL(ˆµn 0, µ 0 ) 0,

then for µ ℓdeﬁned in (11), at any ﬁnite iteration ℓ, we have

lim n BL(ˆµn ℓ, µ ℓ) 0.

Proof. It is a direct result of Lemma 3.1.

Since BL(µ, ν) metricizes weak convergence, our result suggests ˆµn ℓ ˆµ ℓfor ℓ, if ˆµn 0 ˆµ 0 by initialization. The bound of BL metric in Lemma 3.1 increases by a factor of (1 + 2ϵ||g||BL) at each iteration. We can prevent the explosion of the BL bound by decaying step size sufﬁciently fast. It may be possible to obtain tighter bounds, however, it is fundamentally impossible to get a factor smaller than one without further assumptions: suppose we can get BL(Φp(µ), Φp(µ )) αBL(µ, µ ) for some constant α [0, 1), then starting from any initial ˆµn 0, with any ﬁxed particle size n (e.g., n = 1), we would have BL(ˆµn ℓ, νp) = O(αℓ) 0 as ℓ 0, which is impossible because we can not get arbitrarily accurate approximate of νp with ﬁnite n. It turns out that we need to look at KL divergence in order to establish convergence towards νp as ℓ , as we discuss in Section 3.2-3.3.

Remark Because g(x, y) = x log p(x)k(x, y)+ xk(x, y), and x log p(x) is often unbounded if the domain X is not unbounded. Therefore, the condition that g(x, y) must be bounded in Lemma 3.1 suggests that it can only be used when X is compact. It is an open question to establish results that can work for more general domain X.

3.2 Large Time Asymptotic of SVGD

Theorem 3.2 ensures that we only need to consider the update (11) starting from the limit initial µ 0 , which we can assume to have nice density functions and have ﬁnite KL divergence with the target νp. We show that update (11) monotonically decreases the KL divergence between µ ℓand νp and hence allows us to establish the convergence µ ℓ νp.

Theorem 3.3. 1. Assuming p is a density that satisﬁes Stein s identity (3) for φ H, then the measure νp of p is a ﬁxed point of map Φp in (11).

2. Assume R = supx{ 1

2|| log p||Lipk(x, x) + 2 xx k(x, x)} < , where xx k(x, x) = P

i xi x ik(x, x ) x=x , and the step size ϵℓat the ℓ-th iteration is no larger than ϵ ℓ := (2 supx ρ( φ µℓ,p + φ µℓ,p)) 1, where ρ(A) denotes the spectrum norm of a matrix A. If KL(µ 0 || νp) < by initialization, then

KL(µ ℓ+1 || νp) KL(µ ℓ|| νp) (1 ϵℓR) D(µ ℓ|| νp)2, (12)

that is, the population SVGD dynamics always deceases the KL divergence when using sufﬁciently small step sizes, with a decreasing rate upper bounded by the squared Stein discrepancy. Further, if we set the step size ϵℓto be ϵℓ D(µ ℓ|| νp)β for any β > 0, then (12) implies that D(µ ℓ|| νp) 0 as ℓ .

Remark Assuming D(µ ℓ || νp) 0 implies µ ℓ νp (see (7)), then Theorem 3.3(2) implies µ ℓ νp. Further, together with Theorem 3.2, we can establish the weak convergence of the empirical measures of the SVGD particles: ˆµn ℓ νp, as ℓ , n .

Remark Theorem 3.3 can not be directly applied on the empirical measures ˆµn ℓwith ﬁnite sample size n, since it would give KL(ˆµn ℓ|| νp) = in the beginning. It is necessary to use BL metric and KL divergence to establish the convergence w.r.t. sample size n and iteration ℓ, respectively.

Remark The requirement that ϵℓ ϵ ℓis needed to guarantee that the transform T µℓ,p(x) = x + ϵφ µℓ,p(x) has a non-singular Jacobean matrix everywhere. From the bound in Equation A.6 of the Appendix, we can derive an upper bound of the spectrum radius:

sup x ρ( φ µℓ,p + φ µℓ,p) 2 sup x || φ µℓ,p||F 2 sup x

xx k(x, x) D(µℓ|| νp).

This suggest that the step size should be upper bounded by the inverse of Stein discrepancy, i.e., ϵ ℓ D(µℓ|| νp) 1 = ||φ µℓ,p|| 1 H , where D(µℓ|| νp) can be estimated using (6) (see [7]).

3.3 Continuous Time Limit and Vlasov Process

Many properties can be understood more easily as we take the continuous time limit (ϵ 0), reducing our system to a partial differential equation (PDE) of the particle densities (or measures), under which we show that the negative gradient of KL divergence exactly equals the square Stein discrepancy (the limit of (12) as ϵ 0).

To be speciﬁc, we deﬁne a continuous time t = ϵℓ, and take inﬁnitesimal step size ϵ 0, the evolution of the density q in (10) then formally reduces to the following nonlinear Fokker-Planck equation (see Appendix A.3 for the derivation):

tqt(x) = (φ qt,p(x)qt(x)). (13)

This PDE is a type of deterministic Fokker-Planck equation that characterizes the movement of particles under deterministic forces, but it is nonlinear in that the velocity ﬁeld φ qt,p(x) depends on the current particle density qt through the drift term φ qt,p(x) = Ex qt[Sx p k(x, x )].

It is not surprising to establish the following continuous version of Theorem 3.3(2), which is of central importance to our gradient ﬂow perspective in Section 3.4:

Theorem 3.4. Assuming {µt} are the probability measures whose densities {qt} satisfy the PDE in (13), and KL(µ0 || νp) < , then

d dt KL(µt || νp) = D(µt || νp)2. (14)

Remark This result suggests a path integration formula, KL(µ0 || νp) = R 0 D(µt || νp)2dt, which can be potentially useful for estimating KL divergence or the normalization constant.

PDE (13) only works for differentiable densities qt. Similar to the case of Φp as a map between (empirical) measures, one can extend (13) to a measure-value PDE that incorporates empirical measures as weak solutions. Take a differentiable test function h and integrate the both sides of (13): Z th(x)qt(x)dx = Z h(x) (φ qt,p(x)qt(x))dx,

Using integration by parts on the right side to shift the derivative operator from φ qt,pqt to h, we get

d dt Eµt[h] = Eµt[ h φ µt,p], (15)

which depends on µt only through the expectation operator and hence works for empirical measures as well,. A set of measures {µt} is called the weak solution of (13) if it satisﬁes (15).

Using results in Fokker-Planck equation, the measure process (13)-(15) can be translated to an ordinary differential equation on random particles {xt} whose distribution is µt:

dxt = φ µt,p(xt)dt, µt is the distribution of random variable xt, (16)

initialized from random variable x0 with distribution µ0. Here the nonlinearity is reﬂected in the fact that the velocity ﬁeld depends on the distribution µt of the particle at the current time.

In particular, if we initialize (15) using an empirical measure ˆµn 0 of a set of ﬁnite particles {xi 0}n i=1, (16) reduces to the following continuous time limit of n-particle SVGD dynamics:

dxi t = φ ˆµn t ,p(xi t)dt, i = 1, . . . , n, with ˆµn t (dx) = 1

i=1 δ(x xi t)dx, (17)

where {ˆµn t } can be shown to be a weak solution of (13)-(15), parallel to (9) in the discrete time case. (16) can be viewed as the large sample limit (n ) of (17).

The process (13)-(17) is a type of Vlasov processes [12, 13], which are (deterministic) interacting particle processes of the particles interacting with each other though the dependency on their mean ﬁeld µt (or ˆµn t ), and have found important applications in physics, biology and many other areas. There is a vast literature on theories and applications of interacting particles systems in general, and we only refer to Spohn [14], Del Moral [15] and references therein as examples. Our particular form of Vlasov process, constructed based on Stein operator in order to approximate arbitrary given distributions, seems to be new to the best of our knowledge.

3.4 Gradient Flow, Optimal Transport, Geometry

We develop a geometric view for the Vlasov process in Section 3.3, interpreting it as a gradient ﬂow for minimizing the KL divergence functional, deﬁned on a new type of optimal transport metric on the space of density functions induced by Stein operator.

We focus on the set of nice densities q paired with a well deﬁned Stein operator Sq, acting on a Hilbert space H. To develop the intuition, consider a density q and its nearby density q obtained by applying transform T (x) = x + φ(x)dt on x q with inﬁnitesimal dt and φ H, then we can show that (See Appendix A.3)

log q (x) = log q(x) Sqφ(x)dt, q (x) = q(x) q(x)Sqφ(x)dt, (18)

Because one can show that Sqφ = (φq)

q from (2), we deﬁne operator q Sq by q Sqφ(x) = q(x)Sqφ(x) = (φ(x)q(x)). Eq (18) suggests that the Stein operator Sq (resp. q Sq) serves to translate a φ-perturbation on the random variable x to the corresponding change on the log-density (resp. density). This fact plays a central role in our development.

Denote by Hq (resp. q Hq) the space of functions of form Sqφ (resp. q Sqφ) with φ H, that is, Hq = {Sqφ : φ H}, q Hq = {q Sqφ : φ H}. Equivalently, q Hq is the space of functions of form qf where f Hq. This allows us to consider the inverse of Stein operator for functions in Hq. For each f Hq, we can identify an unique function ψf H that has minimum || ||H norm in the set of ψ that satisfy Sqψ = f, that is,

ψq,f = arg min ψ H

||ψ||H s.t. Sqψ = f ,

where Sqψ = f is known as the Stein equation. This allows us to deﬁne inner products on Hq and q Hq using the inner product on H: f1 f2 Hq := qf1, qf2 q Hq := ψq,f1, ψq,f2 H. (19) Based on standard results in RKHS [e.g., 16], one can show that if H is a RKHS with kernel k(x, x ), then Hq and q Hq are both RKHS; the reproducing kernel of Hq is κp(x, x ) in (6), and correspondingly, the kernel of q Hq is q(x)κp(x, x )q(x ).

Now consider q and a nearby q = q+qfdt, f Hq, obtained by an inﬁnitesimal perturbation on the density function using functions in space Hq. Then the ψq,f can be viewed as the optimal transform, in the sense of having minimum || ||H norm, that transports q to q via T (x) = x + ψq,f(x)dt. It is therefore natural to deﬁne a notion of distance between q and q = q + qfdt via, WH(q, q ) := ||ψq,f||Hdt. From (18) and (19), this is equivalent to WH(q, q ) = ||q q ||q Hqdt = || log q log q||Hqdt.

Under this deﬁnition, we can see that the inﬁnitesimal neighborhood {q : WH(q, q ) dt} of q, consists of densities (resp. log-densities) of form q = q + gdt, g q Hq, ||g||q Hq 1,

log q = log q + fdt, f Hq, ||f||Hq 1. Geometrically, this means that q Hq (resp. Hq) can be viewed as the tangent space around density q (resp. log-density log q). Therefore, the related inner product , q Hq (resp. , Hq) forms a Riemannian metric structure that corresponds to WH(q, q ).

This also induces a geodesic distance that corresponds to a general, H-dependent form of optimal transport metric between distributions. Consider two densities p and q that can be transformed from one to the other with functions in H, in the sense that there exists a curve of velocity ﬁelds {φt : φt H, t [0, 1]} in H, that transforms random variable x0 q to x1 p via dxt = φt(x)dt. This is equivalent to say that there exists a curve of densities {ρt : t [0, 1]} such that tρt = (φtρt), and ρ0 = q, ρ1 = p. It is therefore natural to deﬁne a geodesic distance between q and p via

WH(q, p) = inf {φt, ρt}

0 ||φt||Hdt, s.t. tρt = (φtρt), ρ0 = p, ρ1 = q . (20)

We call WH(p, q) an H-Wasserstein (or optimal transport) distance between p and q, in connection with the typical 2-Wasserstein distance, which can be viewed as a special case of (20) by taking H to be the L2 ρt space equipped with norm ||f||L2ρt = Eρt[f 2], replacing the cost with R ||φt||L2ρtdt; the 2-Wasserstein distance is widely known to relate to Langevin dynamics as we discuss more in Section 3.5 [e.g., 17, 18].

Now for a given functional F(q), this metric structure induced a notion of functional covariant gradient: the covariant gradient grad HF(q) of F(q) is deﬁned to be a functional that maps q to an element in the tangent space q Hq of q, and satisﬁes F(q + fdt) = F(q) + grad HF(q), fdt q Hq, (21) for any f in the tangent space q Hq.

Theorem 3.5. Following (21), the gradient of the KL divergence functional F(q) := KL(q || p) is grad HKL(q || p) = (φ q,pq).

Therefore, the SVGD-Valsov equation (13) is a gradient ﬂow of KL divergence under metric WH( , ): qt

t = grad HKL(qt || p).

In addition, ||grad HKL(q || p)||q Hq = D(q || p).

Remark We can also deﬁnite the functional gradient via

grad HF(q) arg max f : ||f||q Hq 1

lim ϵ 0+ F(q + ϵf) F(q)

WH(q + ϵf, q)

which speciﬁes the steepest ascent direction of F(q) (with unit norm). The result in Theorem (3.5) is consistent with this deﬁnition.

3.5 Comparison with Langevin Dynamics

The theory of SVGD is parallel to that of Langevin dynamics in many perspectives, but with importance differences. We give a brief discussion on their similarities and differences.

Langevin dynamics works by iterative updates of form xℓ+1 xℓ+ ϵ log p(xℓ) + 2 ϵξℓ, ξℓ N(0, 1), where a single particle {xℓ} moves along the gradient direction, perturbed with a random Gaussian noise that plays the role of enforcing the diversity to match the variation in p (which is accounted by the deterministic repulsive force in SVGD). Taking the continuous time limit (ϵ 0), We obtain a Ito stochastic differential equation, dxt = log p(xt)dt+2d Wt,where Wt is a standard Brownian motion, and x0 is a random variable with initial distribution q0. Standard results show that the density qt of random variable xt is governed by a linear Fokker-Planck equation, following which the KL divergence to p decreases with a rate that equals Fisher divergence: qt

t = (qt log p) + qt, d dt KL(qt || p) = F(qt, p), (22)

where F(q, p) = || log(q/p)||2 L2q. This result is parallel to Theorem 3.4, and the role of square Stein

discrepancy (and RKHS H) is replaced by Fisher divergence (and L2 q space). Further, parallel to Theorem 3.5, it is well known that (22) can be also treated as a gradient ﬂow of the KL functional KL(q || p), but under the 2-Wasserstein metric W2(q, p) [17]. The main advantage of using RKHS over L2 q is that it allows tractable computation of the optimal transport direction; this is not case when using L2 q and as a result Langevin dynamics requires a random diffusion term in order to form a proper approximation.

Practically, SVGD has the advantage of being deterministic, and reduces to exact MAP optimization when using only a single particle, while Langevin dynamics has the advantage of being a standard MCMC method, inheriting its statistical properties, and does not require an O(n2) cost to calculate the n-body interactions as SVGD. However, the connections between SVGD and Langevin dynamics may allow us to develop theories and algorithms that unify the two, or combine their advantages.

4 Conclusion and Open Questions

We developed a theoretical framework for analyzing the asymptotic properties of Stein variational gradient descent. Many components of the analysis provide new insights in both theoretical and practical aspects. For example, our new metric structure can be useful for solving other learning problems by leveraging its computational tractability. Many important problems remains to be open. For example, an important open problem is to establish explicit convergence rate of SVGD, for which the existing theoretical literature on Langevin dynamics and interacting particles systems may provide insights. Another problem is to develop ﬁnite sample bounds for SVGD that can take the fact that it reduces to MAP optimization when n = 1 into account. It is also an important direction to understand the bias and variance of SVGD particles, or combine it with traditional Monte Carlo whose bias-variance analysis is clearer (see e.g., [19]).

Acknowledgement This work is supported in part by NSF CRII 1565796. We thank Lester Mackey and the anonymous reviewers for their comments.

[1] Q. Liu and D. Wang. Stein variational gradient descent: A general purpose bayesian inference algorithm. In Advances in Neural Information Processing Systems, 2016.

[2] M. J. Wainwright, M. I. Jordan, et al. Graphical models, exponential families, and variational inference. Foundations and Trends R in Machine Learning, 1(1 2):1 305, 2008.

[3] Y. Chen, M. Welling, and A. Smola. Super-samples from kernel herding. In Conference on Uncertainty in Artiﬁcial Intelligence (UAI), 2010.

[4] J. Dick, F. Y. Kuo, and I. H. Sloan. High-dimensional integration: the quasi-monte carlo way. Acta Numerica, 22:133 288, 2013.

[5] B. Dai, N. He, H. Dai, and L. Song. Provable Bayesian inference via particle mirror descent. In The 19th International Conference on Artiﬁcial Intelligence and Statistics, 2016.

[6] C. Stein. Approximate computation of expectations. Lecture Notes-Monograph Series, 7:i 164, 1986.

[7] Q. Liu, J. D. Lee, and M. I. Jordan. A kernelized Stein discrepancy for goodness-of-ﬁt tests and model evaluation. In International Conference on Machine Learning (ICML), 2016.

[8] K. Chwialkowski, H. Strathmann, and A. Gretton. A kernel test of goodness-of-ﬁt. In International Conference on Machine Learning (ICML), 2016.

[9] C. J. Oates, M. Girolami, and N. Chopin. Control functionals for Monte Carlo integration. Journal of the Royal Statistical Society, Series B, 2017.

[10] J. Gorham and L. Mackey. Measuring sample quality with kernels. In International Conference on Machine Learning (ICML), 2017.

[11] C. J. Oates, J. Cockayne, F.-X. Briol, and M. Girolami. Convergence rates for a class of estimators based on Stein s identity. ar Xiv preprint ar Xiv:1603.03220, 2016.

[12] W. Braun and K. Hepp. The Vlasov dynamics and its ﬂuctuations in the 1/n limit of interacting classical particles. Communications in mathematical physics, 56(2):101 113, 1977.

[13] A. A. Vlasov. On vibration properties of electron gas. J. Exp. Theor. Phys, 8(3):291, 1938.

[14] H. Spohn. Large scale dynamics of interacting particles. Springer Science & Business Media, 2012.

[15] P. Del Moral. Mean ﬁeld simulation for Monte Carlo integration. CRC Press, 2013.

[16] A. Berlinet and C. Thomas-Agnan. Reproducing kernel Hilbert spaces in probability and statistics. Springer Science & Business Media, 2011.

[17] F. Otto. The geometry of dissipative evolution equations: the porous medium equation. 2001.

[18] C. Villani. Optimal transport: old and new, volume 338. Springer Science & Business Media, 2008.

[19] J. Han and Q. Liu. Stein variational adaptive importance sampling. In Uncertainty in Artiﬁcial Intelligence, 2017.