# stein_variational_gradient_descent_with_matrixvalued_kernels__2603b83e.pdf

Stein Variational Gradient Descent with

Matrix-Valued Kernels

Dilin Wang* Ziyang Tang Chandrajit Bajaj Qiang Liu Department of Computer Science, UT Austin {dilin, ztang, bajaj, lqiang}@cs.utexas.edu

Stein variational gradient descent (SVGD) is a particle-based inference algorithm that leverages gradient information for efﬁcient approximate inference. In this work, we enhance SVGD by leveraging preconditioning matrices, such as the Hessian and Fisher information matrix, to incorporate geometric information into SVGD updates. We achieve this by presenting a generalization of SVGD that replaces the scalar-valued kernels in vanilla SVGD with more general matrix-valued kernels. This yields a signiﬁcant extension of SVGD, and more importantly, allows us to ﬂexibly incorporate various preconditioning matrices to accelerate the exploration in the probability landscape. Empirical results show that our method outperforms vanilla SVGD and a variety of baseline approaches over a range of real-world Bayesian inference tasks.

1 Introduction

Approximate inference of intractable distributions is a central task in probabilistic learning and statistics. An efﬁcient approximation inference algorithm must perform both efﬁcient optimization to explore the high probability regions of the distributions of interest, and reliable uncertainty quantiﬁcation for evaluating the variation of the given distributions. Stein variational gradient descent (SVGD) (Liu & Wang, 2016) is a deterministic sampling algorithm that achieves both desiderata by optimizing the samples using a procedure similar to gradient-based optimization, while achieving reliable uncertainty estimation using an interacting repulsive mechanism. SVGD has been shown to provide a fast and ﬂexible alternative to traditional methods such as Markov chain Monte Carlo (MCMC) (e.g., Neal et al., 2011; Hoffman & Gelman, 2014) and parametric variational inference (VI) (e.g., Wainwright et al., 2008; Blei et al., 2017) in various challenging applications (e.g., Pu et al., 2017; Wang & Liu, 2016; Kim et al., 2018; Haarnoja et al., 2017).

On the other hand, standard SVGD only uses the ﬁrst order gradient information, and can not leverage the advantage of the second order methods, such as Newton s method and natural gradient, to achieve better performance on challenging problems with complex loss landscapes or domains. Unfortunately, due to the special form of SVGD, it is not straightforward to derive second order extensions of SVGD by simply extending similar ideas from optimization. While this problem has been recently considered (e.g., Detommaso et al., 2018; Liu & Zhu, 2018; Chen et al., 2019), the presented solutions either require heuristic approximations (Detommaso et al., 2018), or lead to complex algorithmic procedures that are difﬁcult to implement in practice (Liu & Zhu, 2018).

Our solution to this problem is through a key generalization of SVGD that replaces the original scalarvalued positive deﬁnite kernels in SVGD with a class of more general matrix-valued positive deﬁnite kernels. Our generalization includes all previous variants of SVGD (e.g., Wang et al., 2018; Han & Liu, 2018) as special cases. More signiﬁcantly, it allows us to easily incorporate various structured

Equal contribution

33rd Conference on Neural Information Processing Systems (Neur IPS 2019), Vancouver, Canada.

preconditioning matrices into SVGD updates, including both Hessian and Fisher information matrices, as part of the generalized matrix-valued positive deﬁnite kernels. We develop theoretical results that shed insight on optimal design of the matrix-valued kernels, and also propose simple and fast practical procedures. We empirically evaluate both Newton and Fisher based extensions of SVGD on various practical benchmarks, including Bayesian neural regression and sentence classiﬁcation, on which our methods show signiﬁcant improvement over vanilla SVGD and other baseline approaches.

Notation and Preliminary For notation, we use bold lower-case letters (e.g., x) for vectors in Rd, and bold upper-case letters (e.g., Q) for matrices. A symmetric function k: Rd Rd ! R is called a positive deﬁnite kernel if P

ij cik(xi, xj)cj 0 for any {ci} R and {xi} Rd. Every positive deﬁnite kernel k(x, x0) is associated with a reproducing kernel Hilbert space (RKHS) Hk, which consists of the closure of functions of form

cik(x, xi), 8{ci} R, {xi} Rd, (1)

for which the inner product and norm are deﬁned by hf, gi Hk = P

ij cisjk(xi, xj), kfk2

ij cicjk(xi, xj), where we assume g(x) = P

i sik(x, xi). Denote by Hd

k := Hk . . . Hk the vector-valued RKHS consisting of Rd vector-valued functions φ = [φ1, . . . , φd]> with each φ 2 Hk. See e.g., Berlinet & Thomas-Agnan (2011) for more rigorous treatment. For notation convenience, we do not distinguish distributions on Rd and and their density functions.

2 Stein Variational Gradient Descent (SVGD)

We introduce the basic derivation of Stein variational gradient descent (SVGD), which provides a foundation for our new generalization. See Liu & Wang (2016, 2018); Liu (2017) for more details.

Let p(x) be a positive and continuously differentiable probability density function on Rd. Our goal is to ﬁnd a set of points (a.k.a. particles) {xi}n

i=1 Rd to approximate p, such that the empirical distribution q(x) = P

i δ(x xi)/n of the particles weakly converges to p when n is large. Here δ( ) denotes the Dirac delta function.

SVGD achieves this by starting from a set of initial particles, and iteratively updating them with a deterministic transformation of form

k(xi), 8i = 1, , n, φ

k = arg max

d KL(q[ φ] k p)

where is a small step size, φ

k : Rd ! Rd is an optimal transform function chosen to maximize the decreasing rate of the KL divergence between the distribution of particles and the target p, and q[ φ] denotes the distribution of the updated particles x0 = x + φ(x) as x q, and Bk is the unit ball of RKHS Hd

k := Hk . . . Hk associated with a positive deﬁnite kernel k(x, x0), that is,

Bk = {φ 2 Hd

Liu & Wang (2016) showed that the objective in (2) can be expressed as a linear functional of φ,

d KL(q[ φ] || p)

= Ex q[P>φ(x)], P>φ(x) = rx log p(x)>φ(x) + r>

x φ(x), (4)

where P is a differential operator called Stein operator; here we formally view P and the derivative operator rx as Rd column vectors, hence P>φ and r>

x φ are viewed as inner products, e.g., r>

=1 rx φ , with x and φ being the -th coordinate of vector x and φ, respectively.

With (4), it is shown in Liu & Wang (2016) that the solution of (2) is

k( ) / Ex q[Pk(x, )] = Ex q[rx log p(x)k(x, ) + rxk(x, )]. (5)

k provides the best update direction for the particles within RKHS Hd

k. By taking q to be the empirical measure of the particles, i.e., q(x) = Pn

i=1 δ(x xi)/n and repeatedly applying this update on the particles, we obtain the SVGD algorithm using equations (2) and (5).

3 SVGD with Matrix-valued Kernels

Our goal is to extend SVGD to allow efﬁcient incorporation of precondition information for better optimization. We achieve this by providing a generalization of SVGD that leverages more general matrix-valued kernels, to ﬂexibly incorporate preconditioning information.

The key idea is to observe that the standard SVGD searches for the optimal φ in RKHS Hd

k = Hk Hk, a product of d copies of RKHS of scalar-valued functions, which does not allow us to encode potential correlations between different coordinates of φ. This limitation can be addressed by replacing Hd

k with a more general RKHS of vector-valued functions (called vector-valued RKHS), which uses more ﬂexible matrix-valued positive deﬁnite kernels to specify rich correlation structures between different coordinates. In this section, we ﬁrst introduce the background of vector-valued RKHS with matrix-valued kernels in Section 3.1, and then propose and discuss our generalization of SVGD using matrix-valued kernels in Section 3.2-3.3.

3.1 Vector-Valued RKHS with Matrix-Valued Kernels

We now introduce the background of matrix-valued positive deﬁnite kernels, which provides a most general framework for specifying vector-valued RKHS. We focus on the intuition and key ideas in our introduction, and refer the readers to Alvarez et al. (2012); Carmeli et al. (2006) for mathematical treatment.

Recall that a standard real-valued RKHS Hk consists of the closure of the linear span of its kernel function k( , x), as shown in (1). Vector-valued RKHS can be deﬁned in a similar way, but consist of the linear span of a matrix-valued kernel function:

K(x, xi)ci, (6)

for any {ci} Rd and {xi} Rd, where K : Rd Rd ! Rd d is now a matrix-valued kernel function, and ci are vector-valued weights. Similar to the scalar case, we can deﬁne an inner product structure hf, gi HK = P

i K(xi, xj)sj, where we assume g = P

i K(x, xi)si, and hence a norm kfk2

i K(xi, xj)cj. In order to make the inner product and norm well deﬁned, the matrix-value kernel K is required to be symmetric in that K(x, x0) = K(x0, x)>, and positive deﬁnite in that P

i K(xi, xj)cj 0, for any {xi} Rd and {ci} Rd.

Mathematically, one can show that the closure of the set of functions in (6), equipped with the inner product deﬁned above, deﬁnes a RKHS that we denote by HK. It is reproducing because it has the following reproducing property that generalizes the version for scalar-valued RKHS: for any f 2 HK and any c 2 Rd, we have

f(x)>c = hf( ), K( , x)ci HK, (7)

where it is necessary to introduce c because the result of the inner product of two functions must be a scalar. A simple example of matrix kernel is K(x, x0) = k(x, x0)I, where I is the d d identity matrix. It is related RKHS is HK = Hk Hk = Hd

k, as used in the original SVGD.

3.2 SVGD with Matrix-Valued Kernels

It is now natural to leverage matrix-valued kernels to obtain a generalization of SVGD (see Algorithm 1). The idea is simple: we now optimize φ in the unit ball of a general vector-valued RKHS HK with a matrix valued kernel K(x, x0):

K = arg max

, s.t. kφk HK 1

This yields a simple closed form solution similar to (5).

Theorem 1. Let K(x, x0) be a matrix-valued positive deﬁnite kernel that is continuously differentiable on x and x0, the optimal φ in (8) is

K( ) / Ex q [K( , x)P] = Ex q [K( , x)rx log p(x) + K( , x)rx] , (9)

Algorithm 1 Stein Variational Gradient Descent with Matrix-valued Kernels (Matrix SVGD)

Input: A (possibly unnormalized) differentiable density function p(x) in Rd. A matrix-valued positive deﬁnite kernel K(x, x0). Step size . Goal: Find a set of particles {xi}n

i=1 to represent the distribution p. Initialize a set of particles {xi}n

i=1, e.g., by drawing from some simple distribution. repeat

K(xi, xj)rxj log p(xj) + K(xi, xj)rxj

where K( , x)rx is formally deﬁned as the product of matrix K( , x) and vector rx. The -th element of K( , x)rx is (K( , x)rx) = Pd

m=1 rxm K ,m( , x); see also (10). until Convergence

where the Stein operator P and derivative operator rx are again formally viewed as Rd-valued column vectors, and K( , x)P and K( , x)rx are interpreted by the matrix multiplication rule. Therefore, K( , x)P is a Rd-valued column vector, whose -th element is deﬁned by

(K( , x)P) =

(K ,m( , x)rxm log p(x) + rxm K ,m( , x)) , (10)

where K ,m(x, x0) denotes the ( , m)- element of matrix K(x, x0) and xm the m-th element of x.

Similar to the case of standard SVGD, recursively applying the optimal transform φ

K on the particles yields a general SVGD algorithm shown in Algorithm 1, which we call matrix SVGD.

Parallel to vanilla SVGD, the gradient of matrix SVGD in (9) consists of two parts that account for optimization and diversity, respectively: the ﬁrst part is a weighted average of gradient rx log p(x) multiplied by a matrix-value kernel K( , x); the other part consists of the gradient of the matrixvalued kernel K, which, like standard SVGD, serves as a repulsive force to keep the particles away from each other to reﬂect the uncertainty captured in distribution p.

Matrix SVGD includes various previous variants of SVGD as special cases. The vanilla SVGD corresponds to the case when K(x, x0) = k(x, x0)I, with I as the d d identity matrix; the gradientfree SVGD of Han & Liu (2018) can be treated as the case when K(x, x0) = k(x, x0)w(x)w(x0)I, where w(x) is an importance weight function; the graphical SVGD of Wang et al. (2018); Zhuo et al. (2018) corresponds to a diagonal matrix-valued kernel: K(x, x0) = diag[{k (x, x0)}d

=1], where each k (x, x0) is a local scalar-valued kernel function related to the -th coordinate x of vector x.

3.3 Matrix-Valued Kernels and Change of Variables

It is well known that preconditioned gradient descent can be interpreted as applying standard gradient descent on a reparameterization of the variables. For example, let y = Q1/2x, where Q is a positive deﬁnite matrix, then log p(x) = log p(Q 1/2y). Applying gradient descent on y and transform it back to the updates on x yields a preconditioned gradient update x x + Q 1rx log p(x).

We now extend this idea to SVGD, for which matrix-valued kernels show up naturally as a consequence of change of variables. This justiﬁes the use of matrix-valued kernels and provides guidance on the practical choice of matrix-valued kernels. We start with a basic result of how matrix-valued kernels change under change of variables (see Paulsen & Raghupathi (2016)).

Lemma 2. Assume H0 is an RKHS with a matrix kernel K0 : Rd Rd ! Rd d. Let H be the set of functions formed by

φ(x) = M(x)φ0(t(x)), 8φ0 2 H0,

where M : Rd ! Rd d is a ﬁxed matrix-valued function and we assume M(x) is an invertible matrix for all x, and t: Rd ! Rd is a ﬁxed continuously differentiable one-to-one transform on Rd.

For 8φ, φ0 2 H, we can identity an unique φ0, φ0

0 2 H0 such that φ(x) = M(x)φ0(t(x)) and φ0(x) = M(x)φ0

0(t(x)). Deﬁne the inner product on H via hφ, φ0i H = hφ0, φ0

0i H0, then H is

also a vector-valued RKHS, whose matrix-valued kernel is

K(x, x0) = M(x)K0(t(x), t(x0))M(x0)>.

We now present a key result, which characterizes the change of kernels when we apply invertible variable transforms on the SVGD trajectory.

Theorem 3. i) Let p and q be two distributions and p0, q0 the distribution of x0 = t(x) when x is drawn from p, q, respectively, where t is a continuous differentiable one-to-one map on Rd. Assume p is a continuous differentiable density with Stein operator P, and P0 the Stein operator of p0. We have

0 φ0(x)] = Ex q[P>φ(x)], with φ(x) := rt(x) 1φ0(t(x)), (11)

where rt is the Jacobian matrix of t.

ii) Therefore, in the asymptotics of inﬁnitesimal step size ( ! 0+), running SVGD with kernel K0 on p0 is equivalent to running SVGD on p with kernel

K(x, x0) = rt(x) 1K0(t(x), t(x0))rt(x0) >,

in the sense that the trajectory of these two SVGD can be mapped to each other by the one-to-one map t (and its inverse).

3.4 Practical Choice of Matrix-Valued Kernels

Theorem 3 suggests a conceptual procedure for constructing proper matrix kernels to incorporate desirable preconditioning information: one can construct a one-to-one map t so that the distribution p0 of x0 = t(x) is an easy-to-sample distribution with a simpler kernel K0(x, x0), which can be a standard scalar-valued kernel or a simple diagonal kernel. Practical choices of t often involve rotating x with either Hessian matrix or Fisher information, allowing us to incorporating these information into SVGD. In the sequel, we ﬁrst illustrate this idea for simple Gaussian cases and then discuss practical approaches for non-Gaussian cases.

Constant Preconditioning Matrices Consider the simple case when p is multivariate Gaussian, e.g., log p(x) = 1

2x>Qx+const, where Q is a positive deﬁnite matrix. In this case, the distribution p0 of the transformed variable t(x) = Q1/2x is the standard Gaussian distribution that can be better approximated with a simpler kernel K0(x, x0), which can be chosen to be the standard RBF kernel suggested in Liu & Wang (2016), the graphical kernel suggested in Wang et al. (2018), or the linear kernels suggested in Liu & Wang (2018). Theorem 3 then suggests to use a kernel of form

KQ(x, x0) := Q 1/2K0

Q1/2x, Q1/2x0

Q 1/2, (12)

in which Q is applied on both the input x and the output side. As an example, taking K0 to be the scalar-valued Gaussian RBF kernel gives

KQ(x, x0) = Q 1 exp

2h||x x0||2

where ||x x0||2

Q := (x x0)>Q(x x0) and h is a bandwidth parameter. Deﬁne K0,Q(x, x0) := K0

Q1/2x, Q1/2x0/

. One can show that the SVGD direction of the kernel in (12) equals

KQ( ) = Q 1Ex q[r log p(x)K0,Q( , x) + K0,Q( , x)rx] = Q 1φ

K0,Q( ), (14)

which is a linear transform of the SVGD direction of kernel K0,Q(x, x0) with matrix Q 1.

In practice, when p is non-Gaussian, we can construct Q by taking averaging over the particles. For example, denote by H(x) = r2

x log p(x) the negative Hessian matrix at x, we can construct Q by

H(xi)/n, (15)

where {xi}n

i=1 are the particles from the previous iteration. We may replace H with the Fisher information matrix to obtain a natural gradient like variant of SVGD.

Point-wise Preconditioning A constant preconditioning matrix can not reﬂect different curvature or geometric information at different points. A simple heuristic to address this limitation is to replace the constant matrix Q with a point-wise matrix function Q(x); this motivates a kernel of form

K(x, x0) = Q 1/2(x)K0

Q1/2(x)x, Q1/2(x0)x0/

Unfortunately, this choice may yield expensive computation and difﬁcult implementation in practice, because the SVGD update involves taking the gradient of the kernel K(x, x0), which would need to differentiate through matrix valued function Q(x). When Q(x) equals the Hessian matrix, for example, it involves taking the third order derivative of log p(x), yielding an unnatural algorithm.

Mixture Preconditioning We instead propose a more practical approach to achieve point-wise preconditioning with a much simpler algorithm. The idea is to use a weighted combination of several constant preconditioning matrices. This involves leveraging a set of anchor points {z }m

=1 Rd, each of which is associated with a preconditioning matrix Q = Q(z ) (e.g., their Hessian or Fisher information matrices). In practice, the anchor points {z }m

=1 can be conveniently set to be the same as the particles {xi}n

i=1. We then construct a kernel by

KQ (x, x0)w (x)w (x0), (16)

where KQ (x, x0) is deﬁned in (12) or (13), and w (x) is a positive scalar-valued function that decides the contribution of kernel KQ at point x. Here w (x) should be viewed as a mixture probability, and hence should satisfy Pm

=1 w (x) = 1 for all x. In our empirical studies, we take w (x) as the Gaussian mixture probability from the anchor points:

w (x) = N(x; z , Q 1

0=1 N(x; z 0, Q 1

0 ), N(x; z , Q 1

where Z = (2 )d/2 det(Q ) 1/2. In this way, each point x is mostly inﬂuenced by the anchor point closest to it, allowing us to apply different preconditioning for different points. Importantly, the SVGD update direction related to the kernel in (16) has a simple and easy-to-implement form:

(w (x)KQ ( , x))P

w KQ ( ), (18)

which is a weighted sum of a number of SVGD directions with constant preconditioning matrices (but with an asymmetric kernel w (x)KQ ( , x)).

A Remark on Stein Variational Newton (SVN) Detommaso et al. (2018) provided a Newton-like variation of SVGD. It is motivated by an intractable functional Newton framework, and arrives a practical algorithm using a series of approximation steps. The update of SVN is

k(xi), 8i = 1, . . . , n, (19)

k( ) is the standard SVGD gradient, and Hi is a Hessian like matrix associated with particle xi, deﬁned by

H(x)k(x, xi)2 + (rxik(x, xi)) 2

where H(x) = r2

x log p(x), and w 2 := ww>. Due to the approximation introduced in the derivation of SVN, it does not correspond to a standard functional gradient ﬂow like SVGD (unless

Hi = Q for all i, in which case it reduces to using a constant preconditioning matrix on SVGD like (14)). SVN can be heuristically viewed as a hard variant of (18), which assigns each particle with its own preconditioning matrix with probability one, but the mathematical form do not match precisely. On the other hand, it is useful to note that the set of ﬁxed points of SVN update (19) is the identical to that of the standard SVGD update with φ

k( ), once all Hi are positive deﬁnite matrices. This is because at the ﬁxed points of (19), we have H

k(xi) = 0 for 8i = 1, . . . , n, which is equivalent to φ

k(xi) = 0, 8i when all the Hi, 8i are positive deﬁnite. Therefore, SVN can be justiﬁed as an alternative ﬁxed point iteration method to achieve the same set of ﬁxed points as the standard SVGD.

Iteration=30

(a) Initializations (b) Vanilla SVGD (c) SVN (d) Matrix-SVGD

(e) Matrix-SVGD

(f) Iteration

Figure 1: Figure (a)-(e) show the particles obtained by various methods at the 30-th iteration. Figure (f) plots the log MMD (Gretton et al., 2012) vs. training iteration starting from the 10-th iteration. We use the standard RBF kernel for evaluating MMD.

4 Experiments

We demonstrate the effectiveness of our matrix SVGD on various practical tasks. We start with a toy example and then proceed to more challenging tasks that involve logistic regression, neural networks and recurrent neural networks. For our method, we take the preconditioning matrices to be either Hessian or Fisher information matrices, depending on the application. For large scale Fisher matrices in (recurrent) neural networks, we leverage the Kronecker-factored (KFAC) approximation by Martens & Grosse (2015); Martens et al. (2018) to enable efﬁcient computation. We use RBF kernel for vanilla SVGD. The kernel K0(x, x0) in our matrix SVGD (see (12) and (13)) is also taken to be Gaussian RBF. Following Liu & Wang (2016), we choose the bandwidth of the Gaussian RBF kernels using the standard median trick and use Adagrad (Duchi et al., 2011) for stepsize. Our code is available at https://github.com/dilinwang820/matrix_svgd.

The algorithms we test are summarized here:

Vanilla SVGD, using the code by Liu & Wang (2016);

Matrix-SVGD (average), using the constant preconditioning matrix kernel in (13), with Q to be either the average of the Hessian matrices or Fisher matrices of the particles (e.g., (15));

Matrix-SVGD (mixture), using the mixture preconditioning matrix kernel in (16), where we pick the anchor points to be particles themselves, that is, {z }m

Stein variational Newton (SVN), based on the implementation of Detommaso et al. (2018);

Preconditioned Stochastic Langevin Dynamics (p SGLD), which is a variant of SGLD (Li et al., 2016), using a diagonal approximation of Fisher information as the preconditioned matrix.

4.1 Two-Dimensional Toy Examples

Settings We start with illustrating our method using a Gaussian mixture toy model (Figure 1), with exact Hessian matrices for preconditioning. For fair comparison, we search the best learning rate for all algorithms exhaustively. We use 50 particles for all the cases. We use the same initialization for all methods with the same random seeds.

Results Figure 1 show the results for 2D toy examples. Appendix B shows more visualization and results on more examples. We can see that methods with Hessian information generally converge faster than vanilla SVGD, and Matrix-SVGD (mixture) yields the best performance.

4.2 Bayesian Logistic Regression

Settings We consider the Bayesian logistic regression model for binary classiﬁcation. Let D = {(xj, yj)}N

j=1 be a dataset with feature vector xj and binary label yj 2 {0, 1}. The distribution of interest is

p( | D) / p(D | )p( ) with p(D | ) =

yjσ( >xj) + (1 yj)σ( >xj)

where σ(z) := 1/(1 + exp( z)), and p0( ) is the prior distribution, which we set to be standard normal N( ; 0, I). The goal is to approximate the posterior distribution p( | D) with a set of

(a) Covtype (b) Covtype (c) Protein (d) Protein

Test Accuracy

Test Log-Likelihood

Test Log-Likelihood

# of Iterations # of Iterations # of Iterations # of Iterations Figure 2: (a)-(b) Results of Bayesian Logistic regression on the Covtype dataset. (c)-(d) Average test RMSE and log-likelihood vs. training batches on the Protein dataset for Bayesian neural regression.

Test RMSE Test Log-Likelihood Dataset p SGLD Vanilla SVGD Matrix-SVGD

Matrix-SVGD

p SGLD Vanilla SVGD Matrix-SVGD

Matrix-SVGD

(mixture) Boston 2.699 0.155 2.699 0.155 2.699 0.155 2.785 0.169 2.898 0.184 2.717 0.166 2.847 0.182 2.706 0.158 2.669 0.141 2.669 0.141 2.669 0.141 2.861 0.207 Concrete 5.053 0.124 5.027 0.116 4.869 0.124 4.721 0.111 4.721 0.111 4.721 0.111 3.206 0.056 3.064 0.034 3.064 0.034 3.064 0.034 3.150 0.054 3.207 0.071 Energy 0.985 0.024 0.889 0.024 0.795 0.025 0.795 0.025 0.795 0.025 0.868 0.025 1.395 0.029 1.315 0.020 1.135 0.026 1.135 0.026 1.135 0.026 1.249 0.036 Kin8nm 0.091 0.001 0.093 0.001 0.092 0.001 0.090 0.001 0.090 0.001 0.090 0.001 0.973 0.010 0.964 0.012 0.956 0.011 0.975 0.011 0.975 0.011 0.975 0.011 Naval 0.002 0.000 0.004 0.000 0.001 0.000 0.000 0.000 0.000 0.000 0.000 0.000 4.535 0.093 4.312 0.087 5.383 0.081 5.639 0.048 5.639 0.048 5.639 0.048 Combined 4.042 0.034 4.088 0.033 4.056 0.033 4.029 0.033 4.029 0.033 4.029 0.033 2.821 0.009 2.832 0.009 2.824 0.009 2.817 0.009 2.817 0.009 2.817 0.009 Wine 0.641 0.009 0.645 0.009 0.637 0.008 0.637 0.008 0.637 0.008 0.637 0.009 0.637 0.009 0.637 0.009 0.984 0.016 0.997 0.019 0.980 0.016 0.980 0.016 0.980 0.016 0.988 0.018 Protein 4.300 0.018 4.186 0.017 3.997 0.018 3.852 0.014 3.852 0.014 3.852 0.014 2.874 0.004 2.846 0.003 2.796 0.004 2.755 0.003 2.755 0.003 2.755 0.003 Year 8.630 0.007 8.686 0.010 8.637 0.005 8.594 0.009 8.594 0.009 8.594 0.009 3.568 0.002 3.577 0.002 3.569 0.001 3.561 0.002 3.561 0.002 3.561 0.002

Table 1: Average test RMSE and log-likelihood in test data for UCI regression benchmarks.

particles { i}n

i=1, and then use it to predict the class labels for testing data points. We compare our methods with preconditioned stochastic gradient Langevin dynamics (p SGLD) (Li et al., 2016). Because p SGLD is a sequential algorithm, for fair comparison, we obtain the samples of p SGLD by running n parallel chains of p SGLD for estimation. The preconditioning matrix in both p SGLD and matrix SVGD is taken to be the Fisher information matrix.

We consider the binary Covtype2 dataset with 581, 012 data points and 54 features. We partition the data into 70% for training, 10% for validation and 20% for testing. We use Adagrad optimizer with a mini-batch size of 256. We choose the best learning rate from [0.001, 0.005, 0.01, 0.05, 0.1, 0.5, 1.0] for each method on the validation set. For all the experiments and algorithms, we use n = 20 particles. Results are average over 20 random trials.

Results Figure 2 (a) and (b) show the test accuracy and test log-likelihood of different algorithms. We can see that both Matrix-SVGD (average) and Matrix-SVGD (mixture) converge much faster than both vanilla SVGD and p SGLD, reaching an accuracy of 0.75 in less than 500 iterations.

4.3 Neural Network Regression

Settings We apply our matrix SVGD on Bayesian neural network regression on UCI datasets. For all experiments, we use a two-layer neural network with 50 hidden units with Re LU activation functions. We assign isotropic Gaussian priors to the neural network weights. All datasets3 are randomly partitioned into 90% for training and 10% for testing. All results are averaged over 20 random trials, except for Protein and Year, on which 5 random trials are performed. We use n = 10 particles for all methods. We use Adam optimizer with a mini-batch size of 100; for large dataset such as Year, we set the mini-batch size to be 1000. We use the Fisher information matrix with Kronecker-factored (KFAC) approximation for preconditioning.

Results Table 1 shows the performance in terms of the test RMSE and the test log-likelihood. We can see that both Matrix-SVGD (average) and Matrix-SVGD (mixture), which use secondorder information, achieve better performance than vanilla SVGD. Matrix-SVGD (mixture) yields the best performance for both test RMSE and test log-likelihood in most cases. Figure 2 (c)-(d) show that both variants of Matrix-SVGD converge much faster than vanilla SVGD and p SGLD on the Protein dataset.

2https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html 3https://archive.ics.uci.edu/ml/datasets.php

4.4 Sentence Classiﬁcation With Recurrent Neural Networks (RNN)

Settings We consider the sentence classiﬁcation task on four datasets: MR (Pang & Lee, 2005), CR (Hu & Liu, 2004), SUBJ (Pang & Lee, 2004), and MPQA (Wiebe et al., 2005).

Method MR CR SUBJ MPQA SGLD 20.52 18.65 7.66 11.24 p SGLD 19.75 17.50 6.99 10.80 Vanilla SVGD 19.73 18.07 6.67 10.58 Matrix-SVGD (average) 19.22 17.29 6.76 10.79 Matrix-SVGD (mixture) 19.09 17.13 6.59 10.71

Table 2: Sentence classiﬁcation errors measured with four benchmarks.

We use a recurrent neural network (RNN) based model, p(y | x) = softmax(w>

y h RNN(x, v)), where x is the input sentence, y is a discretevalued label of the sentence, and wy is a weight coefﬁcient related to label class y. And h RNN(x, v) is an RNN function with parameter v using a one-layer bidirectional GRU model (Cho et al., 2014) with 50 hidden units. We apply matrix SVGD to infer the posterior of w = {wy : 8y}, while updating the RNN weights v using typical stochastic gradient descent. In all experiments, we use the pre-processed text data provided in Gan et al. (2016). For all the datasets, we conduct 10-fold cross-validation for evaluation. We use n = 10 particles for all the methods. For training, we use a mini-batch size of 50 and run all the algorithms for 20 epochs with early stop. We use the Fisher information matrix for preconditioning.

Results Table 2 shows the results of testing classiﬁcation errors. We can see that Matrix-SVGD (mixture) generally performs the best among all algorithms.

5 Conclusion

We present a generalization of SVGD by leveraging general matrix-valued positive deﬁnite kernels, which allows us to ﬂexibly incorporate various preconditioning matrices, including Hessian and Fisher information matrices, to improve exploration in the probability landscape. We test our practical algorithms on various practical tasks and demonstrate its efﬁciency compared to various existing methods.

Acknowledgement

This work is supported in part by NSF CRII 1830161 and NSF CAREER 1846421. We would like to acknowledge Google Cloud and Amazon Web Services (AWS) for their support.

Alvarez, M. A., Rosasco, L., Lawrence, N. D., et al. Kernels for vector-valued functions: A review.

Foundations and Trends R in Machine Learning, 4(3):195 266, 2012.

Berlinet, A. and Thomas-Agnan, C. Reproducing kernel Hilbert spaces in probability and statistics.

Springer Science & Business Media, 2011.

Blei, D. M., Kucukelbir, A., and Mc Auliffe, J. D. Variational inference: A review for statisticians.

Journal of the American Statistical Association, 112(518):859 877, 2017.

Carmeli, C., De Vito, E., and Toigo, A. Vector valued reproducing kernel Hilbert spaces of integrable

functions and Mercer theorem. Analysis and Applications, 4(04):377 408, 2006.

Chen, W. Y., Barp, A., Briol, F.-X., Gorham, J., Girolami, M., Mackey, L., Oates, C., et al. Stein

point markov chain monte carlo. International Conference on Machine Learning (ICML), 2019.

Cho, K., Van Merriënboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H., and Bengio,

Y. Learning phrase representations using RNN encoder-decoder for statistical machine translation.

ar Xiv preprint ar Xiv:1406.1078, 2014.

Detommaso, G., Cui, T., Marzouk, Y., Spantini, A., and Scheichl, R. A Stein variational Newton

method. In Advances in Neural Information Processing Systems, pp. 9187 9197, 2018.

Duchi, J., Hazan, E., and Singer, Y. Adaptive subgradient methods for online learning and stochastic

optimization. Journal of Machine Learning Research, 12(Jul):2121 2159, 2011.

Gan, Z., Li, C., Chen, C., Pu, Y., Su, Q., and Carin, L. Scalable Bayesian learning of recurrent neural

networks for language modeling. ACL, 2016.

Gretton, A., Borgwardt, K. M., Rasch, M. J., Schölkopf, B., and Smola, A. A kernel two-sample test.

Journal of Machine Learning Research, 13(Mar):723 773, 2012.

Haarnoja, T., Tang, H., Abbeel, P., and Levine, S. Reinforcement learning with deep energy-based

policies. ar Xiv preprint ar Xiv:1702.08165, 2017.

Han, J. and Liu, Q. Stein variational gradient descent without gradient. In International Conference

on Machine Learning, 2018.

Hoffman, M. D. and Gelman, A. The No-U-turn sampler: adaptively setting path lengths in Hamiltonian Monte Carlo. Journal of Machine Learning Research, 15(1):1593 1623, 2014.

Hu, M. and Liu, B. Mining and summarizing customer reviews. In Proceedings of the tenth ACM

SIGKDD international conference on Knowledge discovery and data mining, pp. 168 177. ACM, 2004.

Kim, T., Yoon, J., Dia, O., Kim, S., Bengio, Y., and Ahn, S. Bayesian model-agnostic meta-learning.

ar Xiv preprint ar Xiv:1806.03836, 2018.

Li, C., Chen, C., Carlson, D. E., and Carin, L. Preconditioned stochastic gradient Langevin dynamics

for deep neural networks. In AAAI, volume 2, pp. 4, 2016.

Liu, C. and Zhu, J. Riemannian Stein variational gradient descent for Bayesian inference. In

Thirty-Second AAAI Conference on Artiﬁcial Intelligence, 2018.

Liu, Q. Stein variational gradient descent as gradient ﬂow. In Advances in neural information

processing systems, pp. 3115 3123, 2017.

Liu, Q. and Wang, D. Stein variational gradient descent: A general purpose Bayesian inference

algorithm. In Advances In Neural Information Processing Systems, pp. 2378 2386, 2016.

Liu, Q. and Wang, D. Stein variational gradient descent as moment matching. In Advances in Neural

Information Processing Systems, pp. 8867 8876, 2018.

Martens, J. and Grosse, R. Optimizing neural networks with Kronecker-factored approximate

curvature. In International conference on machine learning, pp. 2408 2417, 2015.

Martens, J., Ba, J., and Johnson, M. Kronecker-factored curvature approximations for recurrent

neural networks. In ICLR, 2018.

Neal, R. M. et al. MCMC using Hamiltonian dynamics. Handbook of Markov Chain Monte Carlo, 2

(11):2, 2011.

Pang, B. and Lee, L. A sentimental education: Sentiment analysis using subjectivity summariza-

tion based on minimum cuts. In Proceedings of the 42nd annual meeting on Association for Computational Linguistics, pp. 271. Association for Computational Linguistics, 2004.

Pang, B. and Lee, L. Seeing stars: Exploiting class relationships for sentiment categorization with

respect to rating scales. In Proceedings of the 43rd annual meeting on association for computational linguistics, pp. 115 124. Association for Computational Linguistics, 2005.

Paulsen, V. I. and Raghupathi, M. An introduction to the theory of reproducing kernel Hilbert spaces,

volume 152. Cambridge University Press, 2016.

Pu, Y., Gan, Z., Henao, R., Li, C., Han, S., and Carin, L. VAE learning via Stein variational gradient

descent. In Advances in Neural Information Processing Systems, pp. 4236 4245, 2017.

Wainwright, M. J., Jordan, M. I., et al. Graphical models, exponential families, and variational

inference. Foundations and Trends R in Machine Learning, 1(1 2):1 305, 2008.

Wang, D. and Liu, Q. Learning to draw samples: With application to amortized MLE for generative

adversarial learning. ar Xiv preprint ar Xiv:1611.01722, 2016.

Wang, D., Zeng, Z., and Liu, Q. Stein variational message passing for continuous graphical models.

In International Conference on Machine Learning, pp. 5206 5214, 2018.

Wiebe, J., Wilson, T., and Cardie, C. Annotating expressions of opinions and emotions in language.

Language resources and evaluation, 39(2-3):165 210, 2005.

Zhuo, J., Liu, C., Shi, J., Zhu, J., Chen, N., and Zhang, B. Message passing Stein variational gradient

descent. In International Conference on Machine Learning, pp. 6013 6022, 2018.