# understanding_and_accelerating_particlebased_variational_inference__6c551778.pdf

Understanding and Accelerating Particle-Based Variational Inference

Chang Liu 1 Jingwei Zhuo 1 Pengyu Cheng 2 Ruiyi Zhang 2 Jun Zhu 1 Lawrence Carin 2

Particle-based variational inference methods (Par VIs) have gained attention in the Bayesian inference literature, for their capacity to yield ﬂexible and accurate approximations. We explore Par VIs from the perspective of Wasserstein gradient ﬂows, and make both theoretical and practical contributions. We unify various ﬁnite-particle approximations that existing Par VIs use, and recognize that the approximation is essentially a compulsory smoothing treatment, in either of two equivalent forms. This novel understanding reveals the assumptions and relations of existing Par VIs, and also inspires new Par VIs. We propose an acceleration framework and a principled bandwidth-selection method for general Par VIs; these are based on the developed theory and leverage the geometry of the Wasserstein space. Experimental results show the improved convergence by the acceleration framework and enhanced sample accuracy by the bandwidth-selection method.

1. Introduction

Bayesian inference provides powerful tools for modeling and reasoning with uncertainty. In Bayesian learning one seeks access to the posterior distribution p on the support space X (i.e., the latent space) given data. As p is intractable in general, various approximations have been developed. Variational inference methods (VIs) seek to approximate p within a certain distribution family by minimizing typically the Kullback-Leibler (KL) divergence to p. The approximating distribution is commonly chosen to be a member of a parametric family (Wainwright et al., 2008; Hoffman et al., 2013), but often it poses a restrictive assumption and the closeness to p suffers. Markov chain Monte Carlo (MCMC)

1Dept. of Comp. Sci. & Tech., Institute for AI, BNRist Center, Tsinghua-Fuzhou Inst. for Data Tech., THBI Lab, Tsinghua University, Beijing, 100084, China 2Dept. of Elec. & Comp. Engineering, Duke University, NC, USA. Correspondence to: Jun Zhu <dcszj@tsinghua.edu.cn>, Lawrence Carin <lcarin@duke.edu>.

Proceedings of the 36 th International Conference on Machine Learning, Long Beach, California, PMLR 97, 2019. Copyright 2019 by the author(s).

methods (Geman & Geman, 1987; Neal et al., 2011; Welling & Teh, 2011; Ding et al., 2014) aim to directly draw samples from p. Although asymptotically accurate, they often converge slowly in practice due to undesirable autocorrelation between samples. A relatively large sample size is needed (Liu & Wang, 2016) for a good result, increasing the cost for downstream tasks.

Recently, particle-based variational inference methods (Par VIs) have been proposed. They use a set of samples, or particles, to represent the approximating distribution (like MCMC) and deterministically update particles by minimizing the KL-divergence to p (like VIs). Par VIs have greater non-parametric ﬂexibility than classical VIs, and are also more particle-efﬁcient than MCMCs, since they make full use of a ﬁnite number of particles by taking particle interaction into account. The availability of optimization-based update rules also makes them converge faster. Stein Variational Gradient Descent (SVGD) (Liu & Wang, 2016) is a representative method of this type; it updates particles by leveraging a proper vector ﬁeld that minimizes the KLdivergence. Its unique beneﬁts make SVGD popular, with many variants (Liu & Zhu, 2018; Zhuo et al., 2018; Chen et al., 2018b; Futami et al., 2018) and applications (Wang & Liu, 2016; Pu et al., 2017; Liu et al., 2017a; Haarnoja et al., 2017; Zhang et al., 2018; 2019).

SVGD was later understood as simulating the steepest descending curves, or gradient ﬂows, of the KL-divergence on a certain kernel-related distribution space PH(X) (Liu, 2017). Inspired by this, more Par VIs have been developed by simulating the gradient ﬂow on the Wasserstein space P2(X) (Ambrosio et al., 2008; Villani, 2008) with a ﬁnite set of particles. The particle optimization method (PO) (Chen & Zhang, 2017) and w-SGLD method (Chen et al., 2018a) explore the minimizing movement scheme (MMS) (Jordan et al. (1998); Ambrosio et al. (2008), Def. 2.0.6) of the gradient ﬂow and make approximations for tractability with a ﬁnite set of particles. The Blob method (originally called w-SGLD-B) (Chen et al., 2018a) uses the vector ﬁeld form of the gradient ﬂow and approximates the update direction with a ﬁnite set of particles. Empirical comparisons of these Par VIs have been conducted, but theoretical understanding on their ﬁnite-particle approximations for gradient ﬂow simulation remains unknown, particularly for the assumption and relation of these approximations. We also

Understanding and Accelerating Particle-Based Variational Inference

note that, from the optimization point of view, all existing Par VIs simulate the gradient ﬂow, but no Par VI yet exploits the geometry of P2(X) and uses the more appealing accelerated ﬁrst-order methods on the manifold P2(X). Moreover, the smoothing kernel bandwidth of Par VIs is found crucial for performance (Zhuo et al., 2018) and a more principled bandwidth selection method is needed, relative to the current heuristic median method (Liu & Wang, 2016).

In this work, we examine the P2(X) gradient ﬂow perspective to address these problems and demands, and contribute to the Par VI ﬁeld a uniﬁed theory on the ﬁnite-particle approximations, and two practical techniques: an acceleration framework and a principled bandwidth selection method. The theory discovers that various ﬁnite-particle approximations of Par VIs are essentially a smoothing operation, in the form of either smoothing the density or smoothing functions. We reveal the two forms of smoothing, and draw a connection among Par VIs by discovering their equivalence. Furthermore, we recognize that Par VIs actually and necessarily make an assumption on the approximation family. The theory also establishes a principle for developing Par VIs, and we use this principle to conceive two novel models. The acceleration framework follows the Riemannian version (Liu et al., 2017b; Zhang & Sra, 2018) of Nesterov s acceleration method (Nesterov, 1983), which enjoys a proved convergence improvement over direct gradient ﬂow simulation. In developing the framework, we make novel use of the Riemannian structure of P2(X), particularly the inverse exponential map and parallel transport. We emphasize that the direct application of Nesterov s acceleration on every particle in X is unsound theoretically, since the KL-divergence is not minimized on X but on P2(X), and each single particle is not optimizing a function. For the bandwidth method, we elaborate on the goal of smoothing and develop a principled approach for setting the bandwidth parameter. Experimental results show the improved sample quality by the principled bandwidth method over the median method (Liu & Wang, 2016), and improved convergence of the acceleration framework on both supervised and unsupervised tasks.

Related work On understanding SVGD, Liu (2017) ﬁrst views it as a gradient ﬂow on PH(X). Chen et al. (2018a) then ﬁnd it implausible to exactly formulate SVGD as a gradient ﬂow on P2(X). In this work, we ﬁnd SVGD approximates the P2(X) gradient ﬂow by smoothing functions, achieving an understanding and improvement of all Par VIs within a universal perspective. The PH(X) viewpoint is difﬁcult to apply in general and it lacks appealing properties.

On accelerating Par VIs, the particle optimization method (PO) (Chen & Zhang, 2017) resembles the Polyak s momentum (Polyak, 1964) version of SVGD. However, it was not

derived originally for acceleration and is thus not theoretically sound for such; we observe that PO is less stable than our acceleration framework, similar to the discussion by Sutskever et al. (2013). More recently, Taghvaei & Mehta (2018) also considered accelerating the P2(X) gradient ﬂow. They use the variational formulation of optimization methods (Wibisono et al., 2016) and deﬁne components in the formulation for P2(X), while we leverage the geometry of P2(X) and apply Riemannian acceleration methods. Algorithmically, Taghvaei & Mehta (2018) use a set of momentums and we use auxiliary particles. However, both approaches face the problem of a ﬁnite-particle approximation, solved here systematically using our theory. Their implementation is recognized, in our theory, as smoothing the density. Moreover, our novel perspective on P2(X) and corresponding techniques provide a general tool that enables other Riemannian optimization techniques for Par VIs.

On incorporating Riemannian structure with Par VIs, Liu & Zhu (2018) consider the case for which the support space X is a Riemannian manifold, either speciﬁed by task or taken as the manifold in information geometry (Amari, 2016). We summarize that they utilize the geometry of the Riemannian support space X, while we leverage deeper knowledge on the Riemannian structure of P2(X) itself. Algorithmically, our acceleration is model-agnostic and computationally cheaper. The work of Detommaso et al. (2018) considers second-order information of the KL-divergence on PH(X). Our acceleration remains ﬁrst-order, and we consider the Wasserstein space P2(X) so that with our theory the acceleration is applicable for all Par VIs.

2. Preliminaries We ﬁrst introduce the Wasserstein space P2(X) and the gradient ﬂow on it, and review related Par VIs. We only consider Euclidean support X = RD to reduce unnecessary sophistication, and highlight our key contributions.

We denote C c as the set of compactly supported RD-valued smooth functions on X, and C c for scalar-valued functions. Denote L2 q as the Hilbert space of RD-valued functions {u : RD RD | R u(x) 2 2 dq < } with inner product u, v L2q := R u(x) v(x) dq, and L2 q for scalar-valued functions. The Lebesgue measure is taken if q is not speciﬁed. We deﬁne the push-forward of a distribution q under a measurable transformation T : X X as the distribution of the T -transformed random variable of q, denoted as T#q.

2.1. The Riemannian Structure of the Wasserstein Space P2(X) Figure 1 illustrates the related concepts discussed here. Consider distributions on a support space X with distance d( , ), and denote P(X) as the set of all such distributions. The Wasserstein space is the metric space P2(X) := {q P(X) : x0 X s.t. Eq[d(x0, x)2] < + } equipped with

Understanding and Accelerating Particle-Based Variational Inference

Figure 1. Illustration of the Wasserstein space P2(X) and related concepts.

the well-known Wasserstein distance W2 (Villani (2008), Def. 6.4). Its Riemannian structure is then discovered (Otto, 2001; Benamou & Brenier, 2000), enabling explicit expression of quantities of interest, like gradient. The ﬁrst step is the recognition of tangent vectors and tangent spaces on it. For any smooth curve (qt)t on P2(X), there exists an a.e.-unique time-dependent vector ﬁeld vt(x) on X such that for a.e. t R, tqt + (vtqt) = 0 and

vt { ϕ : ϕ C c } L2 qt where the overline means closure (Villani (2008), Thm. 13.8; Ambrosio et al. (2008), Thm. 8.3.1, Prop. 8.4.5). The unique existence of such a vt allows us to recognize vt as the tangent vector of the curve at qt, and the mentioned closure as the tangent space at qt, denoted by Tqt P2 (Ambrosio et al. (2008), Def. 8.4.1). The inherited inner product in Tqt P2 from L2 qt deﬁnes a Riemannian structure on P2(X), and it is consistent with the Wasserstein distance W2 due to the Benamou-Brenier formula (Benamou & Brenier, 2000). Restricted to a parametric family as a submanifold of P2(X), this structure gives a metric in the parameter space that differs from the Fisher-Rao metric (Amari, 2016), and it has been used in classical VIs (Chen & Li, 2018). Finally, the vector ﬁeld representation provides a convenient means of simulating the distribution curve (qt)t: it is known that (id +εvt)#qt is a ﬁrst-order approximation of qt+ε in terms of W2 (Ambrosio et al. (2008), Prop. 8.4.6). This means that given a set of samples {x(i)}i of qt, {x(i) + εvt(x(i))}i is approximately a set of samples of qt+ε for small ε.

2.2. Gradient Flows on P2(X) Gradient ﬂow of a function F is roughly the family of steepest descending curves {(qt)t} for F. It has various technical deﬁnitions on metric spaces (Ambrosio et al. (2008), Def. 11.1.1; Villani (2008), Def. 23.7), e.g., as the limit of the minimizing movement scheme (MMS; Ambrosio et al. (2008), Def. 2.0.6):

qt+ε = argmin q P2(X) F(q) + 1

2εW 2 2 (q, qt), (1)

and these deﬁnitions all coincide when a Riemannian structure is endowed (Villani (2008), Prop. 23.1, Rem. 23.4; Ambrosio et al. (2008), Thm. 11.1.6; Erbar et al. (2010), Lem. 2.7). In this case the gradient ﬂow (qt)t has its tangent

vector at any t being the gradient of F at qt, deﬁned as:

grad F(qt):=max argmax v: v Tqt P2=1

d dεF (id+εv)#qt ε=0, (2)

where max argmax denotes the scalar multiplication of the maximum and the maximizer. For Bayesian inference tasks, given an absolutely continuous target distribution p, we aim to minimize the KL-divergence (a.k.a relative entropy) KLp(q) := R

X log(q/p) dq. Its gradient ﬂow (qt)t has its tangent vector at any t being: v GF t := grad KLp(qt) = log p log qt, (3)

whenever qt is absolutely continuous (Villani (2008), Thm. 23.18; Ambrosio et al. (2008), Example 11.1.2). When KLp is geodesically µ-convex on P2(X),1 the gradient ﬂow (qt)t enjoys exponential convergence: W2(qt, p) e µt W2(q0, p) (Villani (2008), Thm. 23.25, Thm. 24.7; Ambrosio et al. (2008), Thm. 11.1.4), as expected. Remark 1. The Langevin dynamics dx = log p(x) dt +

2 d Bt(x) (Bt is the Brownian motion) is also known to produce the gradient ﬂow of KLp on P2(X) (e.g., Jordan et al. (1998) from the MMS perspective). It produces the same distribution curve (qt)t as the deterministic dynamics dx = v GF t (x) dt.

2.3. Particle-Based Variational Inference Methods Stein Variational Gradient Descent (SVGD) (Liu & Wang, 2016) uses a vector ﬁeld v to update particles: x(i) k+1 =

x(i) k +εv(x(i) k ), and v is selected to maximize the decreasing rate: d

dεKLp (id +εv)#q ε=0, where q is the distribution that {x(i)}i obeys. When v is optimized over the vectorvalued reproducing kernel Hilbert space (RKHS) HD of a kernel K (Steinwart & Christmann (2008), Def. 4.18), the solution can be derived in closed-form:

v SVGD( ) := Eq(x)[K(x, ) log p(x) + x K(x, )]. (4)

Noting that the optimization problem ﬁts the form of Eq. (2), Liu (2017) interprets SVGD as the gradient ﬂow on PH, a distribution manifold that takes HD as its tangent space. Equation (4) can be estimated by a ﬁnite set of particles, equivalently taking q(x) as the empirical distribution ˆq(x) := 1

N PN i=1 δx(i)(x).

Other methods have been developed to simulate the P2(X) gradient ﬂow. The Blob method (Chen et al., 2018a) estimates Eq. (3) with a ﬁnite set of particles. It reformulates the intractable part u GF := log q from the perspective of variation: u GF = δ δq Eq[log q] , and then partly smooths the density q by convolving with a kernel K:

δq Eq[log(q K)]

= log q (q/ q) K , where q := q K and denotes convolution. This form enables the usage of ˆq.

1E.g., p is µ-log-concave on X (Villani (2008), Thm. 17.15).

Understanding and Accelerating Particle-Based Variational Inference

The particle optimization method (PO) (Chen & Zhang, 2017) simulates the P2(X) gradient ﬂow by MMS (1), where the Wasserstein distance W2 is estimated by solving a dual optimal transport problem that optimizes over quadratic functions. The resulting update rule x(i) k =

x(i) k 1 + ε(v SVGD(x(i) k 1) + N(0, σ2I)) + µ(x(i) k 1 x(i) k 2)

(with parameters ε, σ, µ) comes in the form of the Polyak s momentum (Polyak, 1964) version of SVGD. The w-SGLD method (Chen et al., 2018a) estimates W2 by solving the primal problem with entropy regularization. The algorithm is similar to PO.

3. Par VIs as Approximations to P2(X) Gradient Flow

This part of the paper constitutes our main theoretical contributions. We ﬁnd that Par VIs approximate the P2(X) gradient ﬂow by smoothing, either smoothing the density or smoothing functions. We make recognition of existing Par VIs, analyze equivalence and the necessity of smoothing, and develop two novel Par VIs based on the theory.

3.1. SVGD Approximates P2(X) Gradient Flow Currently SVGD is only known to simulate the gradient ﬂow on the distribution space PH(X). We ﬁrst interpret SVGD as a simulation of the gradient ﬂow on the Wasserstein space P2(X) with a ﬁnite set of particles, so that all existing Par VIs can be analyzed from a common perspective. Noting that v GF is an element of the Hilbert space L2 q, we can identify it by:

v GF = max argmax v L2q, v L2q =1

We then ﬁnd that by changing the optimization domain from L2 q to the vector-valued RKHS HD of a kernel K, the problem can be solved in closed-form and the solution coincides with v SVGD. This connects the two notions: Theorem 2 (v SVGD approximates v GF). The SVGD vector ﬁeld v SVGD deﬁned in Eq. (4) approximates the vector ﬁeld v GF of the gradient ﬂow on P2(X), in the following sense:

v SVGD = max argmax v HD, v HD =1

The proof is provided in Appendix A.1. We will see that HD is roughly a subspace of L2 q, so v SVGD can be seen as the projection of v GF on HD.

The current PH(X) gradient ﬂow interpretation of SVGD (Liu, 2017; Chen et al., 2018a) is not fully satisfying. We point out that PH is not yet a well deﬁned Riemannian manifold. It is characterized by taking HD as its tangent space, but it is only known that the tangent space of a manifold is determined by the manifold s topology (Do Carmo, 1992), and it is unknown if there uniquely exists a manifold with a speciﬁed tangent space. Particularly, the tangent

vector of any smooth curve should uniquely exist in the tangent space. Wasserstein space P2(X) satisﬁes this (Villani (2008), Thm. 13.8; Ambrosio et al. (2008), Thm. 8.3.1, Prop. 8.4.5), but it remains unknown for PH. The manifold PH also lacks appealing properties like an explicit expression of the distance (Chen et al., 2018a). SVGD has also been formulated as a Vlasov process (Liu, 2017; Chen et al., 2018a); this shows that SVGD keeps p invariant, but does not provide much knowledge on its convergence behavior.

3.2. Par VIs Approximate P2(X) Gradient Flow by Smoothing With the above knowledge, all Par VIs approximate the P2(X) gradient ﬂow. We then ﬁnd that the approximation is made by a compulsory smoothing treatment, either smoothing the density or smoothing functions.

Smoothing the Density We note that the Blob method approximates the P2(X) gradient ﬂow by replacing q with a smoothed density q := ˆq K in the variational formulation of the gradient ﬂow. In the w-SGLD method (Chen et al., 2018a), an entropy regularization is introduced to the primal optimal transport problem. This term avoids the density solution being highly concentrated, and thus effectively poses a smoothing requirement on densities.

Smoothing Functions We note in Theorem 2 that SVGD approximates the P2(X) gradient ﬂow by replacing the function family L2 q with HD in an optimization formulation of the gradient ﬂow. We then reveal the fact that a function in HD is roughly a kernel smoothed function in L2 q, as stated formally in the following theorem:

Theorem 3 (HD smooths L2 q). For X = RD, a Gaussian kernel K on X and an absolutely continuous q, the vectorvalued RKHS HD of K is isometrically isomorphic to the

closure G := {φ K : φ C c } L2 q.

The proof is presented in Appendix A.2. Noting that C c is

roughly L2 q in the sense C c L2 q =L2 q (Kov aˇcik & R akosn ık (1991), Thm. 2.11), the closure G is roughly the set of kernel smoothed functions in L2 q, and it is roughly HD.

As mentioned in Section 2.3, the particle optimization method (PO) (Chen & Zhang, 2017) restricts the optimization domain to be quadratic functions when solving the dual optimal transport problem. Since quadratic functions have restricted sharpness (no change in second-order derivatives), this treatment effectively smooths the function family.

Equivalence The above analysis draws more importance when we note the equivalence between the two smoothing approaches. Recall that the objective in the optimization problem that SVGD uses, Eq. (5), is v GF, v L2q = Eq[v GF v]. We generalize this objective in the form Eq[L(v)] with a linear map L : L2 q L2 q. Due to the interchangeability of

Understanding and Accelerating Particle-Based Variational Inference

the integral and the linearity of L, we have the relation: E q[L(v)] = Eq K[L(v)] = Eq[L(v) K] = Eq[L(v K)], which indicates that smoothing the density q K is equivalent to smoothing functions v K. This equivalence bridges the two types of Par VIs, making the analysis and techniques (e.g., our acceleration and bandwidth methods in Secs. 4 and 5, respectively) for one type also applicable to the other.

Necessity and Par VI Assumption We stress that this smoothing endeavor is essentially required by a well-deﬁned gradient ﬂow of the KL-divergence KLp(q). It returns inﬁnity when q is not absolutely continuous, as is the case q = ˆq, thus the gradient ﬂow cannot be reasonably deﬁned. So we recognize that Par VIs have to make the assumption that q is smooth, and this can be implemented by either smoothing the density or smoothing functions.

This claim seems straightforward for smoothing density methods, but is perhaps obscure for smoothing function methods. We now make a direct analysis for SVGD, that neither smoothing the density (take q = ˆq) nor smoothing functions (optimize over L2 p

2) leads to an unreasonable result.

Theorem 4 (Necessity of smoothing for SVGD). For q = ˆq and v L2 p, problem (5) has no optimal solution. In fact the supremum of the objective is inﬁnite, indicating that a maximizing sequence of v tends to be ill-posed.

The proof is given in Appendix A.3. SVGD claims no assumption on the form of the approximating density q as long as its samples are known, but we ﬁnd it actually transmits the restriction on q to the functions v. The choice for v in HD is not just for a tractable solution, but more importantly, for guaranteeing a valid vector ﬁeld. We see that there is no free lunch in making the smoothing assumption. Par VIs have to assume a smoothed density or functions.

3.3. New Par VIs with Smoothing The theoretical understanding of smoothing for the ﬁniteparticle approximation constructs a principle for developing new Par VIs. We conceive two new instances based on the smoothing-density and smoothing-function formulations.

GFSD We directly approximate q with smoothed density q := ˆq K and adopt the vector ﬁeld form of gradient ﬂow (Eq. (3)): u GFSD := log q. We call the corresponding method Gradient Flow with Smoothed Density (GFSD).

GFSF We discover another novel optimization formulation to identify u GF, which could build a new Par VI by smoothing functions. We reform the intractable component u GF := log q, as qu GF + q = 0, and treat it as an equality that holds in the weak sense. This means Eq[φ u φ] =

2Why not L2 q: v L2 p is required by the condition of Stein s identity (Liu, 2017), on which SVGD is based. Also, ˆq is not absolutely continuous so L2 ˆq is not a proper Hilbert space of functions.

0, φ C c ,3 or equivalently, u GF = argmin u L2 max φ C c , φ L2q =1

Eq[φ u φ] 2.

We take q = ˆq and smooth functions φ C c with kernel K, which is equivalent to taking φ from the vector-valued RKHS HD according to Theorem 3:

u GFSF := argmin u L2 max φ HD, φ HD =1

Eˆq[φ u φ] 2. (6)

The closed-form solution is ˆu GFSF = ˆK ˆK 1 in matrix form, where ˆu GFSF :,i := u GFSF(x(i)), ˆKij := K(x(i), x(j)), and ˆK :,i := P

j x(j)K(x(j), x(i)) (see Appendix B.1). We call this method the Gradient Flow with Smoothed test Functions (GFSF). Note that the above objective ﬁts the form Eq[L(φ)] with L linear, indicating the equivalence to smoothing the density, as discussed in Section 3.2. An interesting relation between the matrix-form expression of GFSF and SVGD is that ˆv GFSF = ˆg + ˆK ˆK 1 while ˆv SVGD = ˆg ˆK + ˆK , where ˆg:,i := log p(x(i)). We also note that the GFSF estimate of log q coincides with the method of (Li & Turner, 2017), which is derived by Stein s identity and approximating the ℓ2 space with RKHS H.

Due to Remark 1, all these Par VIs aim to simulate the same path on P2(X) as the Langevin dynamics (LD) (Roberts & Stramer (2002)). They directly utilize the particle interaction via the smoothing kernel, so every particle is aware of others and they could be more particle-efﬁcient (Liu & Wang, 2016) than the vanilla LD simulation. To scale to large datasets, LD has employed stochastic gradient in simulation (Welling & Teh, 2011), which is appropriate (Chen et al., 2015). Due to the connection to LD, Par VIs can also adopt stochastic gradient for scalability. Finally, our theory could utilize more techniques for developing Par VIs, e.g., implicit distribution gradient estimation (Shi et al., 2018).

4. Accelerated First-Order Methods on P2(X) We have developed a uniﬁed understanding on Par VIs under the P2(X) gradient ﬂow perspective, which corresponds to the gradient descent method on P2(X). It is well-known that the Nesterov s acceleration method (Nesterov, 1983) can give a faster convergence rate, and its Riemannian variants have been developed recently, such as Riemannian Accelerated Gradient (RAG) (Liu et al., 2017b) and Riemannian Nesterov s method (RNes) (Zhang & Sra, 2018). We aim to employ Par VIs with these methods. However, this requires more knowledge on the geometry of P2(X).

4.1. Leveraging the Riemannian Structure of P2(X) RAG and RNes require the exponential map (and its inverse) and parallel transport on P2(X). As depicted in

3We also consider scalar-valued functions ϕ C c smoothed in H, which gives the same result, as shown in Appendix B.2.

Understanding and Accelerating Particle-Based Variational Inference

Figure 2. Illustration of the concepts exponential map (left) and parallel transport (right)

Fig. 2, the exponential map Expq : Tq P2(X) P2(X) is the displacement from position q to a new position along the geodesic (a straight line on a manifold) with a given direction, and the parallel transport Γr q : Tq P2(X) Tr P2(X) is the transformation of a tangent vector at q to another one at r in a certain sense of parallel along the geodesic from q to r. We now examine these components and ﬁnd practical estimation with a ﬁnite set of particles.

Exponential Map on P2(X) For an absolutely continuous measure q, Expq(v) = (id +v)#q (Villani (2008), Coro. 7.22; Ambrosio et al. (2008), Prop. 8.4.6; Erbar et al. (2010), Prop. 2.1). Finite-particle estimation is just an application of the map x 7 x + v(x) on each particle.

Inverse of the Exponential Map With details in Appendix A.4, we ﬁnd that the inverse exponential map Exp 1 q (r) for q, r P2(X) can be expressed by the optimal transport map from q to r. We can approximate the map by a discrete one between particles {x(i)}N i=1 of q and {y(i)}N i=1 of r. However, this is still a costly task (Pele & Werman, 2009). Faster approaches like the Sinkhorn method (Cuturi, 2013; Xie et al., 2018) still require empirically O(N 2) time, and it appears unacceptably unstable in our experiments. We consider an approximation when {x(i)}i and {y(i)}i are pairwise close: d(x(i), y(i)) min minj =i d(x(i), x(j)), minj =i d(y(i), y(j)) . In this case we have the result (deduction in Appendix A.4):

Proposition 5 (Inverse exponential map). For pairwise close samples {x(i)}N i=1 of q and {y(i)}N i=1 of r, we have Exp 1 q (r) (x(i)) y(i) x(i).

Parallel Transport on P2(X) There has been formal research on the parallel transport on P2(X) (Lott, 2008; 2017), but the result is expensive to estimate with a ﬁnite set of particles. Here we utilize Schild s ladder method (Ehlers et al., 1972; Kheyfets et al., 2000), a ﬁrst-order approximation of parallel transport that only requires the exponential map. We derive the following estimation with a ﬁnite set of particles (deduction in Appendix A.5):

Proposition 6 (Parallel transport). For pairwise close samples {x(i)}N i=1 of q and {y(i)}N i=1 of r, we have Γr q(v) (y(i)) v(x(i)), v Tq P2(X).

Both results may not seem surprising. This is because the geometry of P2(X) is determined by that of X. We consider Euclidean X, so P2(X) also appears ﬂat. Extension to nonﬂat Riemannian X can be done with the same procedure.

Algorithm 1 The acceleration framework with Wasserstein Accelerated Gradient (WAG) and Wasserstein Nesterov s method (WNes)

1: WAG: select acceleration factor α > 3; WNes: select or calculate c1, c2 R+ (Appendix C.2);

2: Initialize {x(i) 0 }N i=1 distinctly; let y(i) 0 = x(i) 0 ; 3: for k = 1, 2, , kmax, do 4: for i = 1, , N, do 5: Find v(y(i) k 1) by SVGD/Blob/GFSD/GFSF;

6: x(i) k = y(i) k 1 + εv(y(i) k 1);

7: y(i) k = x(i) k + ( WAG: k 1

k (y(i) k 1 x(i) k 1)+ k+α 2

k εv(y(i) k 1); WNes: c1(c2 1)(x(i) k x(i) k 1);

8: end for 9: end for 10: Return {x(i) kmax}N i=1.

4.2. Acceleration Framework for Par VIs Now we apply RAG and RNes to the Wasserstein space P2(X) and construct an accelerated sequence {qk}k minimizing KLp. Both methods introduce an auxiliary variable rk P2(X), on which the gradient is evaluated: vk := grad KL(rk). RAG (Liu et al., 2017b) needs to solve a nonlinear equation in each step. We simplify it with moderate approximations to give an explicit update rule: qk = Exprk 1(εvk 1), rk =

Expqk h Γqk rk 1 k 1

k Exp 1 rk 1(qk 1) k+α 2

(details in Appendix C.1). RNes (Zhang & Sra, 2018) involves an additional variable. We collapse the variable and reformulate RNes as: qk = Exprk 1(εvk 1), rk = Expqk c1 Exp 1 qk Exprk 1 (1 c2) Exp 1 rk 1(qk 1) + c2 Exp 1 rk 1(qk) (details in Appendix C.2). To implement both methods with a ﬁnite set of particles, we leverage the geometric calculations in the previous subsection and estimate vk with Par VIs. The resulting algorithms are called Wasserstein Accelerated Gradient (WAG) and Wasserstein Nesterov s method (WNes), and are presented in Alg. 1 (deduction details in Appendix C.3). They form an acceleration framework for Par VIs. In the deduction, the pairwise-close condition is satisﬁed, so the usage of Propositions 5 and 6 is appropriate.

In theory, the acceleration framework inherits the proved improvement on the convergence rate from RAG and RNes, and it can be applied to all Par VIs, since our theory has recognized the equivalence of Par VIs. In practice, the framework imposes a computational cost linear in the particle size N, which is not a signiﬁcant overhead. Moreover, we emphasize that we cannot directly apply the vanilla Nesterov s acceleration method (Nesterov, 1983) in X on every

Understanding and Accelerating Particle-Based Variational Inference

particle, since a single particle is not optimizing a certain function. Finally, the developed knowledge on the geometry of P2(M) (Propositions 5, 6) also makes it possible to apply other optimization techniques on Riemannian manifolds to beneﬁt Par VIs, e.g., Riemannian BFGS (Gabay, 1982; Qi et al., 2010; Yuan et al., 2016) and Riemannian stochastic variance reduction gradient (Zhang et al., 2016).

5. Bandwidth Selection via the Heat Equation Our theory points out that all Par VIs need smoothing, and this can be done with a kernel. Thus it is an essential problem to choose the bandwidth of the kernel. SVGD uses the median method (Liu & Wang, 2016) based on a heuristic for numerical stability. We work towards a more principled method. We ﬁrst analyze the goal of smoothing, then build a practical algorithm.

As noted in Remark 1, the deterministic dynamics dx = v GF(x) dt and the Langevin dynamics produce the same rule of density evolution. In particular, the deterministic dynamics dx = log qt(x) dt and the Brownian motion dx =

2 d Bt(x) produce the same evolution rule speciﬁed by the heat equation (HE): tqt(x) = qt(x). So a good smoothing kernel should let the evolving density under the approximated dynamics match the rule of HE. This is the principle for selecting a kernel bandwidth.

We implement this principle for GFSD, which estimates qt by the kernel smoothed density q(x) = q(x; {x(i)}i) = 1 N PN i=1 K(x, x(i)). The approximate dynamics dx = log q(x) dt moves particles {x(i)}i of qt to {x(i) ε log q(x(i))}i, which approximates qt+ε. On the other hand, according to the evolution rule of the HE, the new density qt+ε should be approximated by qt + ε tqt q + ε q. The two approximations should match, which means q x; {x(i) ε log q(x(i))}i should be close to q + ε q. Expanding up to ﬁrst order in ε, this requirement translates to imposing that the function λ(x) := q(x; {x(i)}i) + P

j x(j) q(x; {x(i)}i) log q(x(j); {x(i)}i) should be close to zero. To achieve this practically, we propose to minimize N h D+2 Eq(x)[λ(x)2] 1 h D+2 P

k λ(x(k))2 w.r.t. the bandwidth h. We introduce the factor 1 h D+2 (note that x2/h is dimensionless) to make the ﬁnal objective dimensionless.

We call this the HE method. Although the derivation is based on GFSD, the resulting algorithm can be applied to all kernel-based Par VIs, like SVGD, Blob and GFSF, due to the equivalence of smoothing the density and functions from our theory. See Appendix D for further details.

6. Experiments Detailed experimental settings and parameters are provided in Appendix E, and codes are available at https: //github.com/chang-ml-thu/AWGF. Corresponding to WAG and WNes, we call the vanilla gradient ﬂow sim-

ulation of Par VIs as Wasserstein Gradient Descent (WGD).

6.1. Toy Experiments

Figure 3. Comparison of HE (right column) with the median method (left column) for bandwidth selection. Rows correspond to SVGD, Blob, GFSD and GFSF, respectively.

We ﬁrst investigate the beneﬁt of the HE method for selecting the bandwidth, with comparison to the median method. Figure 3 shows 200 particles produced by four Par VIs using both the HE and median methods. We ﬁnd that the median method makes the particles collapse to the modes, since the numerical heuristic cannot guarantee the effect of smoothing. The HE method achieves an attractive effect: the particles align neatly and distribute almost uniformly along the contour, building a more representative approximation. SVGD gives diverse particles also with the median method, which may be due to the averaging of gradients for each particle.

6.2. Bayesian Logistic Regression (BLR) We show the accelerated convergence of the proposed WAG and WNes methods (Alg. 1) for various Par VIs on BLR. Although PO is not developed for acceleration, we treat it as an empirical acceleration. We follow the same settings as Liu & Wang (2016) and Chen et al. (2018a), except that results are averaged over 10 random trials. Results are evaluated by test accuracy (Fig. 4) and log-likelihood (Fig. 8 in Appendix F.1). For all four Par VIs, WAG and WNes notably improve the convergence over WGD and PO. Moreover, WNes gets better results than WAG, especially at the early stage, and it is also more stable w.r.t. hyperparameters. The performance of the PO method is roughly the same as WGD, matching the observation by Chen & Zhang (2017). We also note that the four Par VIs have a similar performance, which is natural since they approximate the same gradient ﬂow with equivalent smoothing treatments.

6.3. Bayesian Neural Networks (BNNs) We test all methods on BNNs for a ﬁxed number of iterations, following the settings of Liu & Wang (2016), and present results in Table 1. We observe that WAG and WNes acceleration methods outperform the WGD and PO for all

Understanding and Accelerating Particle-Based Variational Inference

Table 1. Results on BNN on the Kin8nm dataset (one of the UCI datasets (Asuncion & Newman, 2007)). Results averaged over 20 runs.

Method Avg. Test RMSE ( 10 2) Avg. Test LL

SVGD Blob GFSD GFSF SVGD Blob GFSD GFSF

WGD 8.4 0.2 8.2 0.2 8.0 0.3 8.3 0.2 1.042 0.016 1.079 0.021 1.087 0.029 1.044 0.016 PO 7.8 0.2 8.1 0.2 8.1 0.2 8.0 0.2 1.114 0.022 1.070 0.020 1.067 0.017 1.073 0.016 WAG 7.0 0.2 7.0 0.2 7.1 0.1 7.0 0.1 1.167 0.015 1.169 0.015 1.167 0.017 1.190 0.014 WNes 6.9 0.1 7.0 0.2 6.9 0.1 6.8 0.1 1.171 0.014 1.168 0.014 1.173 0.016 1.193 0.014

0 2500 5000 7500 iteration

SVGD-WGD SVGD-PO SVGD-WAG SVGD-WNes

0 2500 5000 7500 iteration

Blob-WGD Blob-PO Blob-WAG Blob-WNes

0 2500 5000 7500 iteration

GFSD-WGD GFSD-PO GFSD-WAG GFSD-WNes

0 2500 5000 7500 iteration

GFSF-WGD GFSF-PO GFSF-WAG GFSF-WNes

Figure 4. Acceleration effect of WAG and WNes on BLR on the Covertype dataset. Curves are averaged over 10 runs.

the four Par VIs. The PO method also improves the performance, but not to the extent of WAG.

6.4. Latent Dirichlet Allocation (LDA)

100 200 300 400 iteration

holdout perplexity

SGNHT-seq SGNHT-para SVGD-WGD SVGD-WNes

Figure 6. Comparison of SVGD and SGNHT on LDA, as representatives of Par VIs and MCMCs. Average over 10 runs.

We show the improved performance by the acceleration methods on an unsupervised task: posterior inference of LDA (Blei et al., 2003). We follow the same settings as Ding et al. (2014), including the ICML dataset4 and the Expanded-Natural parameterization (Patterson & Teh, 2013). The particle size is ﬁxed at 20. Inference results are evaluated by the conventional hold-out perplexity (the lower the better).

The acceleration effect is shown in Fig. 5. We see again that WAG and WNes improve the convergence rate over WGD. PO has a comparable empirical acceleration performance. We note that WAG is sensitive to its parameter α and ex-

4https://cse.buffalo.edu/ changyou/code/ SGNHT.zip

200 400 iteration

holdout perplexity

SVGD-WGD SVGD-PO SVGD-WAG SVGD-WNes

200 400 iteration

holdout perplexity

Blob-WGD Blob-PO Blob-WAG Blob-WNes

200 400 iteration

holdout perplexity

GFSD-WGD GFSD-PO GFSD-WAG GFSD-WNes

200 400 iteration

holdout perplexity

GFSF-WGD GFSF-PO GFSF-WAG GFSF-WNes

Figure 5. Acceleration effect of WAG and WNes on LDA. Curves are averaged over 10 runs.

hibits minor oscillation, while WNes is more stable. We also compare Par VIs with stochastic gradient Nos e-Hoover thermostats (SGNHT) (Ding et al., 2014), an advanced MCMC method. Its samples are taken as either the last 20 samples from one chain (-seq), or the very last sample from 20 parallel chains (-para). From Figure 6, we observe the faster convergence of Par VIs over the MCMC method.

7. Conclusions By exploiting the P2(X) gradient ﬂow perspective of Par VIs, we establish a uniﬁed theory on the ﬁnite-particle approximations of Par VIs, and propose an acceleration framework and a principled bandwidth-selection method to improve Par VIs. The theory recognizes various approximations as a smoothing treatment, by either smoothing the density or smoothing functions. The equivalence of the two smoothing forms connects existing Par VIs, and their necessity reveals the assumptions of Par VIs. Algorithm acceleration is developed via a deep exploration on the geometry of P2(X) and the bandwidth method is based on a principle of smoothing. Experiments show more representative particles by the principled bandwidth method, and the speed-up of Par VIs by the acceleration framework.

Understanding and Accelerating Particle-Based Variational Inference

Acknowledgments

This work was supported by the National Key Research and Development Program of China (No. 2017YFA0700904), NSFC Projects (Nos. 61620106010, 61621136008, 61571261), Beijing NSF Project (No. L172037), DITD Program JCKY2017204B064, Tiangong Institute for Intelligent Computing, Beijing Academy of Artiﬁcial Intelligence (BAAI), NVIDIA NVAIL Program, and the projects from Siemens and Intel. This work was done when C. Liu was visiting Duke University, during which he was supported by the China Scholarship Council.

Amari, S.-I. Information geometry and its applications. Springer, 2016.

Ambrosio, L., Gigli, N., and Savar e, G. Gradient ﬂows: in metric spaces and in the space of probability measures. Springer Science & Business Media, 2008.

Asuncion, A. and Newman, D. Uci machine learning repository, 2007.

Benamou, J.-D. and Brenier, Y. A computational ﬂuid mechanics solution to the monge-kantorovich mass transfer problem. Numerische Mathematik, 84(3):375 393, 2000.

Blei, D. M., Ng, A. Y., and Jordan, M. I. Latent dirichlet allocation. Journal of machine Learning research, 3(Jan): 993 1022, 2003.

Chen, C. and Zhang, R. Particle optimization in stochastic gradient mcmc. ar Xiv preprint ar Xiv:1711.10927, 2017.

Chen, C., Ding, N., and Carin, L. On the convergence of stochastic gradient mcmc algorithms with high-order integrators. In Advances in Neural Information Processing Systems, pp. 2278 2286, 2015.

Chen, C., Zhang, R., Wang, W., Li, B., and Chen, L. A uniﬁed particle-optimization framework for scalable bayesian sampling. ar Xiv preprint ar Xiv:1805.11659, 2018a.

Chen, W. Y., Mackey, L., Gorham, J., Briol, F.-X., and Oates, C. J. Stein points. ar Xiv preprint ar Xiv:1803.10161, 2018b.

Chen, Y. and Li, W. Natural gradient in wasserstein statistical manifold. ar Xiv preprint ar Xiv:1805.08380, 2018.

Cuturi, M. Sinkhorn distances: Lightspeed computation of optimal transport. In Advances in neural information processing systems, pp. 2292 2300, 2013.

Detommaso, G., Cui, T., Marzouk, Y., Spantini, A., and Scheichl, R. A stein variational newton method. In Advances in Neural Information Processing Systems, pp. 9187 9197, 2018.

Ding, N., Fang, Y., Babbush, R., Chen, C., Skeel, R. D., and Neven, H. Bayesian sampling using stochastic gradient thermostats. In Advances in neural information processing systems, pp. 3203 3211, 2014.

Do Carmo, M. P. Riemannian Geometry. 1992.

Ehlers, J., Pirani, F., and Schild, A. The geometry of free fall and light propagation, in the book general relativity (papers in honour of jl synge), 63 84, 1972.

Erbar, M. et al. The heat equation on manifolds as a gradient ﬂow in the wasserstein space. In Annales de l Institut Henri Poincar e, Probabilit es et Statistiques, volume 46, pp. 1 23. Institut Henri Poincar e, 2010.

Futami, F., Cui, Z., Sato, I., and Sugiyama, M. Frank-wolfe stein sampling. ar Xiv preprint ar Xiv:1805.07912, 2018.

Gabay, D. Minimizing a differentiable function over a differential manifold. Journal of Optimization Theory and Applications, 37(2):177 219, 1982.

Geman, S. and Geman, D. Stochastic relaxation, gibbs distributions, and the bayesian restoration of images. In Readings in Computer Vision, pp. 564 584. Elsevier, 1987.

Haarnoja, T., Tang, H., Abbeel, P., and Levine, S. Reinforcement learning with deep energy-based policies. ar Xiv preprint ar Xiv:1702.08165, 2017.

Hoffman, M. D., Blei, D. M., Wang, C., and Paisley, J. Stochastic variational inference. The Journal of Machine Learning Research, 14(1):1303 1347, 2013.

Jordan, R., Kinderlehrer, D., and Otto, F. The variational formulation of the fokker planck equation. SIAM journal on mathematical analysis, 29(1):1 17, 1998.

Kheyfets, A., Miller, W. A., and Newton, G. A. Schild s ladder parallel transport procedure for an arbitrary connection. International Journal of Theoretical Physics, 39 (12):2891 2898, 2000.

Kov aˇcik, O. and R akosn ık, J. On spaces lˆp(x) and wˆk, p(x). Czechoslovak Mathematical Journal, 41(4):592 618, 1991.

Li, Y. and Turner, R. E. Gradient estimators for implicit models. ar Xiv preprint ar Xiv:1705.07107, 2017.

Liu, C. and Zhu, J. Riemannian stein variational gradient descent for bayesian inference. In Thirty-Second AAAI Conference on Artiﬁcial Intelligence, 2018.

Understanding and Accelerating Particle-Based Variational Inference

Liu, Q. Stein variational gradient descent as gradient ﬂow. In Advances in neural information processing systems, pp. 3118 3126, 2017.

Liu, Q. and Wang, D. Stein variational gradient descent: A general purpose bayesian inference algorithm. In Advances In Neural Information Processing Systems, pp. 2378 2386, 2016.

Liu, Y., Ramachandran, P., Liu, Q., and Peng, J. Stein variational policy gradient. ar Xiv preprint ar Xiv:1704.02399, 2017a.

Liu, Y., Shang, F., Cheng, J., Cheng, H., and Jiao, L. Accelerated ﬁrst-order methods for geodesically convex optimization on riemannian manifolds. In Advances in Neural Information Processing Systems, pp. 4875 4884, 2017b.

Lott, J. Some geometric calculations on wasserstein space. Communications in Mathematical Physics, 277(2):423 437, 2008.

Lott, J. An intrinsic parallel transport in wasserstein space. Proceedings of the American Mathematical Society, 145 (12):5329 5340, 2017.

Neal, R. M. et al. Mcmc using hamiltonian dynamics. Handbook of Markov Chain Monte Carlo, 2(11), 2011.

Nesterov, Y. A method of solving a convex programming problem with convergence rate o (1/k2). In Soviet Mathematics Doklady, volume 27, pp. 372 376, 1983.

Otto, F. The geometry of dissipative evolution equations: the porous medium equation. 2001.

Patterson, S. and Teh, Y. W. Stochastic gradient riemannian langevin dynamics on the probability simplex. In Advances in Neural Information Processing Systems, pp. 3102 3110, 2013.

Pele, O. and Werman, M. Fast and robust earth mover s distances. In ICCV, volume 9, pp. 460 467, 2009.

Polyak, B. T. Some methods of speeding up the convergence of iteration methods. USSR Computational Mathematics and Mathematical Physics, 4(5):1 17, 1964.

Pu, Y., Gan, Z., Henao, R., Li, C., Han, S., and Carin, L. Vae learning via stein variational gradient descent. In Advances in Neural Information Processing Systems, pp. 4239 4248, 2017.

Qi, C., Gallivan, K. A., and Absil, P.-A. Riemannian bfgs algorithm with applications. In Recent advances in optimization and its applications in engineering, pp. 183 192. Springer, 2010.

Rezende, D. J. and Mohamed, S. Variational inference with normalizing ﬂows. ar Xiv preprint ar Xiv:1505.05770, 2015.

Roberts, G. O. and Stramer, O. Langevin diffusions and metropolis-hastings algorithms. Methodology and computing in applied probability, 4(4):337 357, 2002.

Shi, J., Sun, S., and Zhu, J. A spectral approach to gradient estimation for implicit distributions. ar Xiv preprint ar Xiv:1806.02925, 2018.

Steinwart, I. and Christmann, A. Support vector machines. Springer Science & Business Media, 2008.

Sutskever, I., Martens, J., Dahl, G., and Hinton, G. On the importance of initialization and momentum in deep learning. In International conference on machine learning, pp. 1139 1147, 2013.

Taghvaei, A. and Mehta, P. G. Accelerated gradient ﬂow for probability distributions. ICLR 2019 submission, 2018. URL https://openreview.net/forum? id=Hkx CEh Aqt Q&note Id=H1lv Dj2FCX.

Villani, C. Optimal transport: old and new, volume 338. Springer Science & Business Media, 2008.

Wainwright, M. J., Jordan, M. I., et al. Graphical models, exponential families, and variational inference. Foundations and Trends R in Machine Learning, 1(1 2):1 305, 2008.

Wang, D. and Liu, Q. Learning to draw samples: With application to amortized mle for generative adversarial learning. ar Xiv preprint ar Xiv:1611.01722, 2016.

Welling, M. and Teh, Y. W. Bayesian learning via stochastic gradient langevin dynamics. In Proceedings of the 28th International Conference on Machine Learning (ICML11), pp. 681 688, 2011.

Wibisono, A., Wilson, A. C., and Jordan, M. I. A variational perspective on accelerated methods in optimization. Proceedings of the National Academy of Sciences, 113(47): E7351 E7358, 2016.

Xie, Y., Wang, X., Wang, R., and Zha, H. A fast proximal point method for computing wasserstein distance. ar Xiv preprint ar Xiv:1802.04307, 2018.

Yuan, X., Huang, W., Absil, P.-A., and Gallivan, K. A. A riemannian limited-memory bfgs algorithm for computing the matrix geometric mean. Procedia Computer Science, 80:2147 2157, 2016.

Zhang, H. and Sra, S. An estimate sequence for geodesically convex optimization. In Conference On Learning Theory, pp. 1703 1723, 2018.

Understanding and Accelerating Particle-Based Variational Inference

Zhang, H., Reddi, S. J., and Sra, S. Riemannian svrg: fast stochastic optimization on riemannian manifolds. In Advances in Neural Information Processing Systems, pp. 4592 4600, 2016.

Zhang, R., Chen, C., Li, C., and Duke, L. C. Policy optimization as wasserstein gradient ﬂows. In ICML, 2018.

Zhang, R., Wen, Z., Chen, C., and Carin, L. Scalable thompson sampling via optimal transport. In AISTATS, 2019.

Zhou, D.-X. Derivative reproducing properties for kernel methods in learning theory. Journal of computational and Applied Mathematics, 220(1):456 463, 2008.

Zhuo, J., Liu, C., Shi, J., Zhu, J., Chen, N., and Zhang, B. Message passing stein variational gradient descent. In Proceedings of the 35th International Conference on Machine Learning (ICML-18), pp. 6013 6022, 2018.