# stein_variational_gradient_descent_as_moment_matching__67f54acf.pdf

Stein Variational Gradient Descent as Moment Matching

Qiang Liu, Dilin Wang Department of Computer Science The University of Texas at Austin Austin, TX 78712 {lqiang, dilin}@cs.utexas.edu

Stein variational gradient descent (SVGD) is a non-parametric inference algorithm that evolves a set of particles to ﬁt a given distribution of interest. We analyze the non-asymptotic properties of SVGD, showing that there exists a set of functions, which we call the Stein matching set, whose expectations are exactly estimated by any set of particles that satisﬁes the ﬁxed point equation of SVGD. This set is the image of Stein operator applied on the feature maps of the positive deﬁnite kernel used in SVGD. Our results provide a theoretical framework for analyzing properties of SVGD with different kernels, shedding insight into optimal kernel choice. In particular, we show that SVGD with linear kernels yields exact estimation of means and variances on Gaussian distributions, while random Fourier features enable probabilistic bounds for distributional approximation. Our results offer a refreshing view of the classical inference problem as ﬁtting Stein s identity or solving the Stein equation, which may motivate more efﬁcient algorithms.

1 Introduction

One of the core problems of modern statistics and machine learning is to approximate difﬁcultto-compute probability distributions. Two fundamental ideas have been extensively studied and used in the literature: variational inference (VI) and Markov chain Monte Carlo (MCMC) sampling (e.g., Koller & Friedman, 2009; Wainwright et al., 2008). MCMC has the advantage of being nonparametric and asymptotically exact, but often suffers from difﬁculty in convergence, while VI frames the inference into a parametric optimization of the KL divergence and works much faster in practice, but loses the asymptotic consistency. An ongoing theme of research is to combine the advantages of these two methodologies.

Stein variational gradient descent (SVGD) (Liu & Wang, 2016) is a synthesis of MCMC and VI that inherits the non-parametric nature of MCMC while maintaining the optimization perspective of VI. In brief, SVGD for distribution p(x) updates a set of particles {xi}n i=1 parallelly with a velocity ﬁeld φ( ) that balances the gradient force and repulsive force,

xi xi + ϵφ(xi), φ( ) = 1

j=1 xj log p(xj)k(xj, ) + xjk(xj, ),

where ϵ is a step size and k(x, x ) is a positive deﬁnite kernel deﬁned by the user. This update is derived as approximating a kernelized Wasserstein gradient ﬂow of KL divergence (Liu et al., 2017) with connection to Stein s method (Stein, 1972) and optimal transport (Ollivier et al., 2014); see also Anderes & Coram (2002). SVGD has been applied to solve challenging inference problems in various domains; examples include Bayesian inference (Liu & Wang, 2016; Feng et al., 2017),

32nd Conference on Neural Information Processing Systems (Neur IPS 2018), Montréal, Canada.

uncertainty quantiﬁcation (Zhu & Zabaras, 2018), reinforcement learning (Liu et al., 2017; Haarnoja et al., 2017), learning deep probabilistic models (Wang & Liu, 2016; Pu et al., 2017) and Bayesian meta learning (Feng et al., 2017; Kim et al., 2018).

However, the theoretical properties of SVGD are still largely unexplored. The only exceptions are Liu et al. (2017); Lu et al. (2018), which studied the partial differential equation that governs the evolution of the limit densities of the particles, with which the convergence to the distribution of interest can be established. However, the results in Liu et al. (2017); Lu et al. (2018) are asymptotic in nature and hold only when the number of particles is very large. More recently, Chen et al. (2018) studied non-asymptotic properties of a linear combination of SVGD and Langevin dynamics, whose analysis, however, mainly exploits the propriety of Langevin dynamics and does not work for SVGD alone. A theoretical understanding of SVGD in the ﬁnite sample size region is still missing and of great practical importance, especially given that the particle sizes used in practice are often relatively small, thanks to the property that SVGD with a single particle (n = 1) exactly reduces to ﬁnding the mode (a.k.a. maximum a posteriori (MAP)).

Our Results We analyze the ﬁnite sample properties of SVGD. In contrast to the dynamical perspective of Liu et al. (2017), we directly study what properties a set of particles would have if it satisﬁes the ﬁxed point equation of SVGD, regardless of how we obtain them algorithmically, or whether the ﬁxed point is unique. Our analysis indicates that the ﬁxed point equation of SVGD is essentially a moment matching condition which ensures that the ﬁxed point particles {x i }n i=1 exactly estimate the expectations of all the functions in a special function set F ,

i=1 f(x i ) = Epf, f F .

This set F , which we call the Stein matching set, consists of functions obtained by applying Stein operator on the linear span of feature maps of the kernel used by SVGD.

This framework allows us to understand properties of different kernels (and the related feature maps) by studying their Stein matching sets F , which should ideally either match the test functions that we are actually interested in estimating, or is as large as possible to approximate the overall distribution. This process is difﬁcult in general, but we make two observations in this work:

i) We show that, by using linear kernels (features), SVGD can exactly estimate the mean and variance of Gaussian distributions when the number of particles is larger than the dimension. Since Gaussianlike distributions appear widely in practice, and the estimates of mean and variance are often of special importance, linear kernels can provide a signiﬁcant advantage over the typical Gaussian RBF kernels, especially in estimating the variance.

ii) Linear features are not sufﬁcient to approximate the whole distributions. We show that, by using random features of strictly positive deﬁnite kernels, the ﬁxed points of SVGD approximate the whole distribution with an O(1/ n) rate in kernelized Stein discrepancy.

Overall, our framework reveals a novel perspective that reduces the inference problem to either a regression problem of ﬁtting Stein identities, or inverting the Stein operator which is framed as solving a differential equation called Stein equation. These ideas are signiﬁcantly different from the traditional MCMC and VI that are currently popular in machine learning literature, and draw novel connections to Quasi Monte Carlo and quadrature methods, among other techniques in applied mathematics. New efﬁcient approximate inference methods may be motivated with our new perspectives.

2 Background

We introduce the basic background of the Stein variational method, a framework of approximate inference that integrates ideas from Stein s method, kernel methods, and variational inference. The readers are referred to Liu et al. (2016); Liu & Wang (2016); Liu et al. (2017) and references therein for more details. For notation, all vectors are assumed to be column vectors. The differential operator x is viewed as a column vector of the same size as x Rd. For example, xφ is a Rd-valued function when φ is a scalar-valued function, and x φ(x) = Pd i=1 xiφ(x) is a scalar-valued function when φ is Rd-valued.

Stein s Identity Stein s identity forms the foundation of our framework. Given a positive differentiable density p(x) on X Rd, one form of Stein s identity is

Ep[ x log p(x) φ(x) + x φ(x)] = 0, φ,

which holds for any differentiable, Rd-valued function φ that satisﬁes a proper zero-boundary condition. Stein s identity can be proved by a simple exercise of integration by parts. We may write Stein s identity in a more compact way by deﬁning a Stein operator Px:

Ep[P x φ(x)] = 0, where P x φ(x) = x log p(x) φ(x) + x φ(x),

where Px is formally viewed as a d-dimensinoal column vector like x, and hence P x φ is the inner product of Px and φ, yielding a scalar-valued function.

The power of Stein s identity is that, for a given distribution p, it deﬁnes an inﬁnite number of functions of form P x φ that has zero expectation under p, all of which only depend on p through the Stein operator Px, or the score function x log p(x) = p(x)

p(x) , which is independent of the normalization constant in p that is often difﬁcult to calculate.

Stein Discrepancy on RKHS Stein s identity can be leveraged to characterize the discrepancy between different distributions. The idea is that, for two different distributions p = q, there shall exist a function φ such that Eq[P x φ] = 0. Consider functions φ in a Rd-valued reproducing kernel Hilbert space (RKHS) of form H = H0 H0 where H0 is a R-valued RKHS with positive deﬁnite kernel k(x, x ). We may deﬁne a kernelized Stein discrepancy (KSD) (Liu et al., 2016; Chwialkowski et al., 2016; Oates et al., 2017):

Dk(q || p) = max φ H Eq[P x φ(x)] : ||φ||H 1 , (1)

The optimal φ in (1) can be solved in closed form:

φ q,p( ) Ex q[Pxk(x, )], (2)

which yields a simple kernel-based representation of KSD:

D2 k(q || p) = Ex,x q[κp(x, x )], with κp(x, x ) = P x (Px k(x, x )), (3)

where x and x are i.i.d. draws from q, and κp(x, x ) is a new Steinalized positive deﬁnite kernel obtained by applying the Stein operator twice, ﬁrst w.r.t. variable x and then x . It turns out that the RKHS related to kernel κp(x, x ) is exactly the space of functions obtained by applying Stein operator on functions in H, that is,

Hp = {P x φ: φ H}.

By Stein s identity, all the functions in Hp have zero expectation under p. We can also deﬁne H+ p to be the space of functions in Hp adding arbitrary constants, that is, H+ p := {f(x) + c: f Hp, c R}, which can also be viewed as a RKHS, with kernel κp(x, x ) + 1. Stein discrepancy can be viewed as a maximum mean discrepancy (MMD) on the Steinalized RKHS H+ p (or equivalently Hp):

Dk(q || p) = max f H+ p

Eqf Epf : ||f||H+ p 1 . (4)

Different from typical MMD, here the RKHS space depends on distribution p. In order to make Stein discrepancy discriminative, in that Dk(q || p) = 0 implies q = p, we need to take kernels k(x, x ) so that H+ p is sufﬁciently large. It has been shown that this can be achieved if k(x, x ) is strictly positive deﬁnite or universal, in a proper technical sense (Liu et al., 2016; Chwialkowski et al., 2016; Gorham & Mackey, 2017; Oates et al., 2017).

It is useful to consider the kernels in a random feature representation (Rahimi & Recht, 2007),

k(x, x ) = Ew pw[φ(x, w)φ(x , w)], (5)

where φ(x, w) is a set of features indexed by a random parameter w drawn from a distribution pw. For example, the Gaussian RBF kernel k(x, x ) = exp( 1

2h2 ||x x ||2 2) admits

hw 1 x + w0), (6)

where w0 Unif([0, 2π]) and w1 N(0, I). With the random feature representation, KSD can be rewritten into

D2 k(q || p) = Ew pw

||Ex q[Pxφ(x, w)]||2 , (7)

which can be viewed as the mean square error of Stein s identity Ex q[Pxφ(x, w)] = 0 over the random features. D2 k(q || p) = 0 shall imply q = p if the feature set G = {φ(x, w): w} is rich enough. Note that the RKHS H and feature set G are different; Stein discrepancy is an expected loss function on G as shown in (7), but a worst-case loss on H as shown in (1).

Stein Variational Gradient Descent (SVGD) SVGD is a deterministic sampling algorithm motivated by Stein discrepancy. It is based on the following basic observation: given a distribution q, assume q[φ] is the distribution of x = x + ϵφ(x) obtained by updating x with a velocity ﬁeld φ, where ϵ is a small step size, then we have

KL(q[φ] || p) = KL(q || p) ϵEq[P x φ] + O(ϵ2),

which shows that the decrease of KL divergence is dominated by ϵEq[P x φ]. In order to choose φ to make q[φ] move towards p as fast as possible, we should choose φ to maximize Eq[P x φ], whose solution is exactly φ q,p( ) Ex q[Pxk(x, )] as shown in (2). This suggests that φ q,p happens to be the best velocity ﬁeld that pushes the probability mass of q towards p as fast as possible.

Motivated by this, SVGD approximates q with the empirical distribution of a set of particles {xi}n i=1, and iteratively updates the particles by

j=1 [Pxjk(xj, xi)]. (8)

Liu et al. (2017) studied the asymptotic properties of the dynamic system underlying SVGD, showing that the evolution of the limit density of the particles when n can be captured by a nonlinear Fokker-Planck equation, and established its weak convergence to the target distribution p.

However, the analysis in Liu et al. (2017) and Lu et al. (2018) do not cover the case when the sample size n is ﬁnite, which is more relevant to the practical performance. We address this problem by directly analyzing the properties of the ﬁxed point equation of SVGD, yielding results that work for ﬁnite sample size n, also independent of the update rule used to arrive the ﬁxed points.

3 SVGD as Moment Matching

This section presents our main results on the moment matching properties of SVGD and the related Stein matching sets. We start with Section 3.1 which introduces the basic idea and characterizes the Stein matching set of SVGD with general positive deﬁnite kernels. We then analyze in Section 3.2 the special case when the rank of the kernel is less than the particle size, in which case the Stein matching set is independent of the ﬁxed points themselves. Section 3.3 shows that SVGD with linear features exactly estimates the ﬁrst two second-order moments of Gaussian distributions. Section 3.4 establishes a probabilistic bound when random features are used.

3.1 Fixed Point of SVGD

Our basic idea is rather simple to illustrate. Assume X = {x i }n i=1 is the ﬁxed point of SVGD and ˆµX its related empirical measure, then according to (8), the ﬁxed point condition of SVGD ensures

Ex ˆµX [Pxk(x, x i )] = 0, i = 1, . . . , n. (9)

On the other hand, by Stein s identity, we have

Ex p[Pxk(x, x i )] = 0, i = 1, . . . , n.

This suggests that ˆµX exactly estimates the expectation of functions of form f(x) = Pxk(x, x i ) under p, all of which are zero. By the linearity of expectation, the same holds for all the functions in the linear span of Pxk(x, x i ).

Lemma 3.1. Assume X = {x i }n i=1 satisﬁes the ﬁxed point equation (9) of SVGD. We have

i=1 f(x i ) = Epf, f F ,

where the Stein matching set F is the linear span of {Pxk(x, x i )}n i=1 {1}, that is, F consists of

i=1 a i Pxk(x, x i ) + b, ai Rd, b R.

Equivalently, f(x) = P x φ(x) + b and φ is in the linear span of {k(x, x i )}n i=1, that is, φ(x) = Pn i=1 aik(x, x i ).

Extending Lemma 3.1, one can readily see that the SVGD ﬁxed points can approximate the expectation of functions that are close to F . Speciﬁcally, let F ϵ be the ϵ neighborhood of F , that is, F ϵ = {f : inff F ||f f || ϵ}, then it is easily shown that 1 n

i=1 f(x i ) Epf

2ϵ, f F ϵ .

Therefore, the SVGD approximation can be viewed as prioritizing the functions within, or close to, F . This is different in nature from Monte Carlo, which approximates the expectation of all bounded variance functions with the same O(1/ n) error rate. Instead, SVGD shares more similarity with the quadrature and sigma point methods, which also ﬁnd points (particles) to match the expectation on certain class of functions, but mostly only on polynomial functions and for simple distributions such as uniform or Gaussian distributions. SVGD provides a more general approach that can match moments of richer classes of functions for more general complex multivariate distributions. As we show in Section 3.3, when using polynomial kernels, SVGD reduces to matching polynomials when applied to multivariate Gaussian distributions.

In this view, the performance of SVGD is essentially decided by the Stein matching set F . We shall design the algorithm, by engineering the kernels or feature maps, to make F as large as possible in order to approximate the distribution well, or include the test functions of actual interest, such as mean and variance.

3.2 Fixed Point of Feature-based SVGD

One undesirable property of F in Lemma 3.1 is that it depends on the values of the ﬁxed point particles X , whose properties are difﬁcult to characterize a priori. This makes it difﬁcult to infer what kernel should be used to obtain a desirable F . It turns out the dependency of F on X can be essentially decoupled by using degenerated kernels corresponding to a ﬁnite number of feature maps. Speciﬁcally, we consider kernels of form

ℓ=1 φℓ(x)φℓ(x ),

where we assume the number m of features is no larger than the particle size n. Then, the ﬁxed point of SVGD reduces to

ℓ=1 Pxφℓ(x)φℓ(x j)] = 0, j [n]. (10)

Deﬁne Φ = [φℓ(x j)]ℓ,j which is a matrix of size (m n) . If rank(Φ) m, then (10) reduces to

Ex ˆµX [Pxφℓ(x)] = 0, ℓ= 1, . . . , m, (11)

where the test function f(x) := Pxφℓ(x) no longer depends on the ﬁxed point X .

Theorem 3.2. Assume X is a ﬁxed point of SVGD with kernel k(x, x ) = Pm ℓ=1 φℓ(x)φℓ(x ). Deﬁne the (m n) matrix Φ = [φℓ(x i )]ℓ [m],i [n]. If rank(Φ) m, then

i=1 f(x i ) = Epf, f F ,

where the Stein matching set F is the linear span of {Pxφℓ(x)}m ℓ=1 {1}, that is, it is set of the functions of form

ℓ=1 a ℓPxφℓ(x) + b, aℓ Rd, b R. (12)

Note that the rank condition implies that we must have m n. The idea is that n particles can at most match n linearly independent features exactly. Here, although the rank condition still depends on the ﬁxed point X = {x i }n i=1 and cannot be guaranteed a priori, it can be numerically veriﬁed once we obtain the values of X . In our experiments, we ﬁnd that the rank condition tends to always hold practically when n = m. In cases when it does fail to satisfy, we can always rerun the algorithm with a larger n until it is satisﬁed. Intuitively, it seems to require bad luck to have Φ low rank when there are more particles than features (n m), although a theoretical guarantee is still missing.

Query-Speciﬁc Inference as Solving Stein Equation Assume we are interested in a query-speciﬁc task of estimating Epf for a speciﬁc test function f. In this case, we should ideally select the features {φℓ}ℓsuch that (12) holds to yield an exact estimation of Epf. By the linearity of the Stein operator, (12) is equivalent to

Stein Equation: f(x) = P x φ(x) + b, (13)

where φ(x) = Pm ℓ=1 aℓφℓ(x). Eq (13) is known as Stein Equation when solving φ and b with a given f, which effectively calculates the inverse of Stein operator.

Stein equation plays a central role in Stein s method as a theoretical tool (Barbour & Chen, 2005). Here, we highlight its fundamental connection to the approximate inference problem: if we can exactly solve φ and b for a given f, then the inference problem regarding f is already solved (without running SVGD), since we can easily see that Epf = b by taking expectation from both sides of (13).

Mathematically, this reduces the integration problem of estimating Epf into solving a differential equation. It suggests that Stein equation is at least as hard as the inference problem itself, and we should not expect a tractable way to solve it in general cases. On the other hand, it suggested that efﬁcient ways of approximate inference may be developed by approximate solutions of Stein equation. Similar idea has been investigated in Oates et al. (2017), which developed a kernel approximation of Stein equation in the case based on a given set of points. SVGD allows us to further extend this idea by optimizing the set of points (particles) on which approximation is deﬁned.

3.3 Linear Feature SVGD is Exact for Gaussian

Although Stein equation is difﬁcult to solve in general, it is signiﬁcantly simpliﬁed when the distribution p of interest is Gaussian. In the following, we show that when p is a multivariate Gaussian distribution, we can use linear features, relating to a linear kernel k(x, x ) = x x + 1, to ensure that SVGD exactly estimates all the ﬁrst and second order moments of p. This insight provides an important practical guidance on the optimal kernel choices for Gaussian-like distributions.

Theorem 3.3. Assume X is a ﬁxed point of SVGD with polynomial kernel k(x, x ) = x x + 1. Let F be the Stein matching set in Theorem 3.2. If p is a multivariate normal distribution on Rd, then F Poly(2), where Poly(2) is the set of all polynomials upto the second order, that is, Poly(2) = {x Ax + b x + c: A Rd d, b Rd, c R}.

Further, denote by Φ the (d + 1) n matrix deﬁned by

Φ = x1 x2 xn 1 1 1

If rank(Φ) d + 1, then F = Poly(2). In this case, any ﬁxed point of SVGD exactly estimates both the mean and the covariance matrix of the target distribution.

More generally, if the features are polynomials of order j, its related Stein matching set should be polynomials of order j + 1 for Gaussian distributions. We do not investigate this further because it is less common to estimate higher order moments in multivariate settings.

Theorem 3.3 suggests that it is a good heuristic to include linear features in SVGD, because Gaussianlike distributions appear widely thanks to the central limit theorem and Bernstein von Mises theorem, and the main goal of inference is often to estimate the mean and variance. In contrast, the more commonly used Gaussian RBF kernel does not have similar exact recovery results for the mean and variance, even for Gaussian distributions.

A nice property of our result is that once we use fewer features than the particles and solve the ﬁxed point exactly, the features do not interfere with each other. This allows us to program our algorithm by adding different types of features that serve different purposes in different cases.

3.4 Random feature SVGD

The linear features are not sufﬁcient for providing the consistent estimation of the whole distribution, even for Gaussian distributions. Non-degenerate kernels are required to obtain bounds on the whole distributions, but they complicate the analysis because their Stein matching set depends on the solution X as shown in Lemma C.1. Random features can be used to sidestep this difﬁculty (Rahimi & Recht, 2007), enabling us to analyze a random feature variant of SVGD with probabilistic bounds.

To set up, assume k(x, x ) is a universal kernel whose Stein discrepancy Dk(q || p) yields a discriminative measure of differences between distributions. Assume k(x, x ) yields the random feature representation in (5), and we can approximate it by drawing m random features,

ˆk(x, x ) = 1

ℓ=1 φ(x, wℓ)φ(x , wℓ),

where wℓare i.i.d. drawn from pw. We assume m n, then running SVGD with kernel ˆk(x, x ) (with the random features ﬁxed during the iterations) yields a matching set that decouples with the ﬁxed point X . In this way, our result below establish that Dk(ˆµX || p) = O(1/ n) with high probability. According to (4), this provides a uniform bound of EˆµX f Epf for all functions in the unit ball of H+ p .

Here, random features are introduced mainly for facilitating theoretical analysis, but we also ﬁnd random feature SVGD works comparably, and sometimes even better than SVGD with the original non-degenerate kernel (see Appendix). This is because with a ﬁnite number n of particles, at most n function basis of k(x, x ) can be effectively used, even if k(x, x ) itself has an inﬁnite rank. From the perspective of moment matching, there is no beneﬁt to use universal kernels when the particle size n is ﬁnite.

In the sequel, we ﬁrst explain the intuitive idea behind our result, highlighting a perspective that views inference as ﬁtting a zero-valued curve with Stein s identity, and then introduce technical details.

Distributional Inference as Fitting Stein s Identity Recall that our goal can be viewed as ﬁnding particles X = {x i } such that their empirical ˆµX approximates the target distribution p. We re-frame this into ﬁnding ˆµX such that Stein s identity holds (approximately):

Find ˆµX s.t. EˆµX [Pxφ(x, w)] 0, w.

We may view this as a special curve ﬁtting problem: considering g X(w) = EˆµX[Pxφ(x, w)], we want to ﬁnd parameter X such that g X(w) 0 for all inputs w. The kernelized Stein discrepancy (KSD), as shown in (7), can be viewed as the expected rooted mean square loss of this ﬁtting problem:

D2 k(ˆµX || p) = Ew pw ||g X(w)||2 2 (14)

When replacing k(x, x ) with its random feature approximation ˆk(x, x ), the corresponding KSD can be viewed as an empirical loss on random sample {wℓ} from pw:

D2 ˆk(ˆµX || p) = 1

||g X(wℓ)||2 2 .

By running SVGD with ˆk(x, x ), we acheive g X (wℓ) = 0 for all ℓat the ﬁxed point, implying a zero empirical loss Dˆk(ˆµX || p) = 0 assuming the rank condition holds.

The key question, however, is to bound the expected loss Dk(ˆµX || p), which can be achieved using generalization bounds in statistical learning theory. In fact, standard results in learning theory

suggests that the difference between the empirical loss and expected loss is O(m 1/2), yielding D2 k(ˆµX || p) = O(m 1/2). However, following (4), this implies EˆµX f Epf = O(m 1/4) for f H+ p , which does not acheive the standard O(m 1/2). Fortunately, note that our setting is noisefree, and we achieve zero empirical loss; thus, we can get a better rate of D2 k(ˆµX || p) = O(m 1) using the techniques in Srebro et al. (2010).

Bound for Random Features We now present our concentration bounds of random feature SVGD.

Assumption 3.4. 1) Assume {φ(x, wℓ)}m ℓ=1 is a set of random features with wℓi.i.d. drawn from pw on domain W, and X = {x i }n i=1 is an approximate ﬁxed point of SVGD with random feature φ(x, wℓ) in the sense that

|Ex ˆµX Pxjφ(x, wℓ)| ϵj m.

where Pxj is the Stein operator w.r.t. the j-th coordinate xj of x. Assume ϵ2 := Pd j=1 ϵ2 j < .

2) Let supx X,w W |Pxjφ(x, w)| = Mj, and M 2 := Pd j=1 M 2 j < . This may imply that X has to be compact since x log p(x) is typically unbounded on non-compact X (e.g., when p is standard Gaussian, x log p(x) = x).

3) Deﬁne function set PjΦ = {w 7 Pxjφ(x, w): x X}.

We assume the Rademacher complexity of PjΦ satisﬁes Rm(PjΦ) Rj/ m, and R2 := Pd j=1 R2 j < .

Theorem 3.5. Under Assumption 3.4, for any δ > 0, we have with at least probability 1 δ (in terms of the randomness of feature parameters {wℓ}m ℓ=1),

Dk(ˆµX || p) C m

ϵ2 + log3 m + log(1/δ) 1/2 , (15)

where C is a constant that depends on R and M.

Remark Recalling (4), Eq (15) provides a uniform bound

sup ||f||H+ p 1

EµX f Epf = O(m 1/2log1.5 m).

This is a uniform bound that controls the worse error uniformly among all f H+ p . It is unclear if the logarithm factor log1.5 m is essential. In the following, we present a result that has an O(1/ m) rate, without the logarithm factor, but only holds for individual functions.

Theorem 3.6. Let F be the set of linear span of the Steinalized features:

f(x) = Ew pw[v(w) Pxφ(x, w)], (16)

where v(w) = [v1(w), . . . , vd(w)] Rd is the combination weights that satisfy supw ||v(w)|| < . We may deﬁne a norm on F by ||f||2 F := infv Pd j=1 supw |vj(w)|2, where infv is taken on all v(w) that satisﬁes (16).

Assume Assumption 3.4 holds, then for any given function f F with ||f||F 1, we have with at least probability 1 δ,

|EˆµX f Epf| C m(1 + ϵ + p

2 log(1/δ)),

where C is a constant that depends on R and M.

The F deﬁned above is closely related to the RKHS Hp. In fact, one can show that F is a dense subset of Hp (Rahimi & Recht, 2008) and is hence quite rich if k(x, x ) is set to be universal.

4 Conclusion

We analyze SVGD through the eyes of moment matching. Our results are non-asymptotic in nature and provide an insightful framework for understanding the inﬂuence of kernels in the behavior of SVGD ﬁxed points. Our framework suggests promising directions to develop systematic ways of optimizing the choice of kernels, especially for the query-speciﬁc inference that focuses on speciﬁc test functions. A particularly appealing idea is to program the inference algorithm by adding features that serve speciﬁc purposes so that the algorithm can be easily adapted to meet the needs of different users. In general, we expect that the connection between approximation inference and Stein s identity and Stein equation will provide further opportunities for deriving new generations of approximate inference algorithms.

Another advantage of our framework is that it separates the design of the ﬁxed point equation with the numerical algorithm used to achieve the ﬁxed point. In this way, the iterative algorithm does not have to be derived as an approximation of an inﬁnite dimensional gradient ﬂow, in contrast to the original SVGD. This allows us to apply various practical numerical methods and acceleration techniques to solve the ﬁxed point equation faster, with convergence guarantees.

Anderes, Ethan and Coram, Marc. A general spline representation for nonparametric and semiparametric density estimates using diffeomorphisms. In ar Xiv preprint ar Xiv:1205.5314, 2002.

Barbour, Andrew D and Chen, Louis Hsiao Yun. An introduction to Stein s method, volume 4. World Scientiﬁc, 2005.

Bartlett, Peter L and Mendelson, Shahar. Rademacher and gaussian complexities: Risk bounds and structural results. Journal of Machine Learning Research, 3(Nov):463 482, 2002.

Chen, Changyou, Zhang, Ruiyi, Wang, Wenlin, Li, Bai, and Chen, Liqun. A uniﬁed particleoptimization framework for scalable bayesian sampling. ar Xiv preprint ar Xiv:1805.11659, 2018.

Chwialkowski, Kacper, Strathmann, Heiko, and Gretton, Arthur. A kernel test of goodness-of-ﬁt. International Conference on Machine Learning (ICML), 2016.

Feng, Yihao, Wang, Dilin, and Liu, Qiang. Learning to draw samples with amortized Stein variational gradient descent. Conference on Uncertainty in Artiﬁcial Intelligence (UAI), 2017.

Gorham, Jackson and Mackey, Lester. Measuring sample quality with kernels. International Conference on Machine Learning (ICML), 2017.

Haarnoja, Tuomas, Tang, Haoran, Abbeel, Pieter, and Levine, Sergey. Reinforcement learning with deep energy-based policies. In International Conference on Machine Learning (ICML), pp. 1352 1361, 2017.

Kim, Taesup, Yoon, Jaesik, Dia, Ousmane, Kim, Sungwoong, Bengio, Yoshua, and Ahn, Sungjin. Bayesian model-agnostic meta-learning. In Advances In Neural Information Processing Systems (NIPS), 2018.

Koller, Daphne and Friedman, Nir. Probabilistic graphical models: principles and techniques. MIT press, 2009.

Liu, Qiang and Wang, Dilin. Stein variational gradient descent: A general purpose Bayesian inference algorithm. In Advances In Neural Information Processing Systems (NIPS), pp. 2378 2386, 2016.

Liu, Qiang, Lee, Jason, and Jordan, Michael. A kernelized Stein discrepancy for goodness-of-ﬁt tests. In International Conference on Machine Learning, pp. 276 284, 2016.

Liu, Yang, Ramachandran, Prajit, Liu, Qiang, and Peng, Jian. Stein variational policy gradient. Conference on Uncertainty in Artiﬁcial Intelligence (UAI), 2017.

Lu, Jianfeng, Lu, Yulong, and Nolen, James. Scaling limit of the stein variational gradient descent part i: the mean ﬁeld regime. ar Xiv preprint ar Xiv:1805.04035, 2018.

Oates, Chris J, Girolami, Mark, and Chopin, Nicolas. Control functionals for Monte Carlo integration. Journal of the Royal Statistical Society, Series B, 2017.

Ollivier, Yann, Pajot, Hervé, and Villani, Cédric. Optimal Transport: Theory and Applications, volume 413. Cambridge University Press, 2014.

Pu, Yuchen, Gan, Zhe, Henao, Ricardo, Li, Chunyuan, Han, Shaobo, and Carin, Lawrence. VAE learning via Stein variational gradient descent. In Advances in Neural Information Processing Systems (NIPS), pp. 4239 4248, 2017.

Rahimi, Ali and Recht, Benjamin. Random features for large-scale kernel machines. In Advances in Neural Information Processing Systems (NIPS), pp. 1177 1184, 2007.

Rahimi, Ali and Recht, Benjamin. Uniform approximation of functions with random bases. In Communication, Control, and Computing, 46th Annual Allerton Conference on, pp. 555 561. IEEE, 2008.

Srebro, Nathan, Sridharan, Karthik, and Tewari, Ambuj. Smoothness, low noise and fast rates. In Advances in neural information processing systems (NIPS), pp. 2199 2207, 2010.

Stein, Charles. A bound for the error in the normal approximation to the distribution of a sum of dependent random variables. In Proceedings of the Sixth Berkeley Symposium on Mathematical Statistics and Probability, Volume 2: Probability Theory, pp. 583 602, 1972.

Wainwright, Martin J, Jordan, Michael I, et al. Graphical models, exponential families, and variational inference. Foundations and Trends R in Machine Learning, 1(1 2):1 305, 2008.

Wang, Dilin and Liu, Qiang. Learning to draw samples: With application to amortized MLE for generative adversarial learning. ar Xiv preprint ar Xiv:1611.01722, 2016.

Zhu, Yinhao and Zabaras, Nicholas. Bayesian deep convolutional encoder decoder networks for surrogate modeling and uncertainty quantiﬁcation. Journal of Computational Physics, 366:415 447, 2018.