# message_passing_stein_variational_gradient_descent__e19e9f19.pdf Message Passing Stein Variational Gradient Descent Jingwei Zhuo 1 Chang Liu 1 Jiaxin Shi 1 Jun Zhu 1 Ning Chen 1 Bo Zhang 1 Stein variational gradient descent (SVGD) is a recently proposed particle-based Bayesian inference method, which has attracted a lot of interest due to its remarkable approximation ability and particle efficiency compared to traditional variational inference and Markov Chain Monte Carlo methods. However, we observed that particles of SVGD tend to collapse to modes of the target distribution, and this particle degeneracy phenomenon becomes more severe with higher dimensions. Our theoretical analysis finds out that there exists a negative correlation between the dimensionality and the repulsive force of SVGD which should be blamed for this phenomenon. We propose Message Passing SVGD (MP-SVGD) to solve this problem. By leveraging the conditional independence structure of probabilistic graphical models (PGMs), MP-SVGD converts the original highdimensional global inference problem into a set of local ones over the Markov blanket with lower dimensions. Experimental results show its advantages of preventing vanishing repulsive force in high-dimensional space over SVGD, and its particle efficiency and approximation flexibility over other inference methods on graphical models. 1. Introduction Stein variational gradient descent (SVGD) (Liu & Wang, 2016) is a recently proposed inference method. To approximate an intractable but differentiable target distribution, it constructs a set of particles iteratively along the optimal gradient direction in a vector-valued reproducing kernel Hilbert space (RKHS) towards minimizing the KL divergence. SVGD does not confine the approximation within 1Dept. of Comp. Sci. & Tech., BNRist Center, State Key Lab for Intell. Tech. & Sys., THBI Lab, Tsinghua University, Beijing, 100084, China. Correspondence to: Jingwei Zhuo , Jun Zhu . Proceedings of the 35 th International Conference on Machine Learning, Stockholm, Sweden, PMLR 80, 2018. Copyright 2018 by the author(s). parametric families as commonly done in traditional variational inference (VI) methods. Besides, SVGD is more particle efficient than traditional Markov Chain Monte Carlo (MCMC) methods: it generates diverse particles due to the deterministic repulsive force induced by kernels instead of Monte Carlo randomness. These benefits make SVGD an appealing method and gain a lot of interest (Pu et al., 2017; Haarnoja et al., 2017; Liu et al., 2017; Feng et al., 2017). As a kernel-based method, the performance of SVGD relies on the choice of kernels and corresponding RKHS. In previous work, an isotropic vector-valued RKHS with a kernel defined by some distance metric (Euclidean distance) over all the dimensions is used. Examples include the RBF kernel (Liu & Wang, 2016) and the IMQ kernel (Gorham & Mackey, 2017). However, as discussed in Aggarwal et al. (2001) and Ramdas et al. (2015), distance metrics and corresponding kernels suffer from the curse of dimensionality. Thus, a natural question is, is the performance of SVGD also affected by the dimensionality? We observe that the dimensionality negatively affects the performance of SVGD: its particles tend to collapse to modes and this phenomenon becomes more severe with higher dimensions. To understand this phenomenon, we analyze the impact of dimensionality on the repulsive force, which is critical for SVGD to work as an inference method for minimizing the KL divergence, and attribute the reason partially to the negative correlation between the repulsive force and the dimensionality under some assumption about the variational distribution through theoretical analysis and experimental verifications. Our analysis takes an initial step towards understanding the non-asymptotic behavior of SVGD with a finite number of particles, which is important since inferring high dimensional distributions with a limited computational and storage resource is common in practice. We propose Message Passing SVGD (MP-SVGD) to solve this problem when the target distribution is compactly described via a probabilistic graphical model (PGM) and thus the conditional independence structure can be leveraged. MP-SVGD converts the original high-dimensional inference problem into a set of local ones with lower dimensions according to a decomposition of the KL divergence, and solves each local problem iteratively in an RKHS with a local kernel defined over the Markov blanket. Experimental Message Passing SVGD results on both synthetic and real-world settings demonstrate the power of MP-SVGD over SVGD and other inference methods on graphical models. Related work The idea of converting a global inference problem into several local ones is not new. Traditional methods such as (loopy) belief propagation (BP) (Pearl, 1988), expectation propagation (EP) (Minka, 2001) and variational message passing (VMP) (Winn & Bishop, 2005) all share this spirit. However, VMP makes a strong mean-field and conjugate exponential family assumption, EP requires an exponential family approximation, and loopy BP does not guarantee q to be a globally valid distribution as it relaxes the solution of marginals to be in an outer bound of the marginal polytope (Wainwright & Jordan, 2008). Moreover, loopy BP requires further approximation in message to handle complex potentials, which restricts its expressive power. For example, Nonparametric BP (NBP) (Sudderth et al., 2003) approximates the messages with mixtures of Gaussians; Particle BP (PBP) (Ihler & Mcallester, 2009) approximates the message using an important sampling approach with either the local potential or the estimated beliefs as proposals; and Expectation Particle BP (EPBP) (Lienart et al., 2015) extends PBP with adaptive proposals produced by EP. Another drawback of loopy BP and its variants is that except some special cases where beliefs are tractable (e.g., Gaussian BP), numerical integration is required when using beliefs in subsequent tasks like evaluating the expectation over some test function. On the other hand, MCMC methods like Gibbs sampling avoid these problems since the expectation can be estimated directly from samples. However, Gibbs sampling can only be used in some cases where the conditional distribution can be sampled efficiently (e.g., Martens & Sutskever (2010)). Compared to the aforementioned methods, MP-SVGD is more appealing since it requires neither tractable conditional distribution nor restrictions over potentials, which makes it suitable as a general purpose inference tool for graphical models with differentiable densities. Finally, we note that the idea of improving SVGD over graphical models by leveraging the conditional independence property was developed concurrently and independently by Wang et al. (2017). The difference between their work and ours lies in the derivation of the method and the implications that are explored. Wang et al. (2017) also observed the particle degeneracy phenomenon of SVGD and proposed a similar method called Graphical SVGD by introducing graph structured kernels and corresponding Kernelized Stein Discrepancy (KSD). Rather than that, MP-SVGD is derived by a decomposition of the KL divergence. Moreover, we develop a theoretical explanation for the particle degeneracy phenomenon by analyzing the relation between the dimensionality and the repulsive force. 2. Preliminaries Given an intractable distribution p(x) where x = [x1, ..., x D] X RD, variational inference aims to find a tractable distribution q(x) supported on X to approximate p(x) by minimizing some distribution measure, e.g., the (exclusive) KL divergence KL(q p). Instead of assigning a parametric assumption over q(x), Stein variational gradient descent (SVGD) (Liu & Wang, 2016) constructs q(x) from some initial distribution q0(x) via a sequence of density transformations induced by the transformation on random variable: T (x) = x + ϵφ(x), where ϵ is the step size and φ( ) : X RD denotes the transformation direction. To be tractable and flexible, φ is restricted to a vector-valued reproducing kernel Hilbert space (RKHS) HD = H0 H0, where H0 is the scalar-valued RKHS of kernel k( , ) which is chosen to be positive definite and in the Stein class of p (Liu et al., 2016). Examples include the RBF kernel k(x, y) = exp x y 2 2/(2h) (Liu & Wang, 2016) and the IMQ kernel k(x, y) = 1/ p 1 + x y 2 2/(2h) (Gorham & Mackey, 2017), where the bandwidth h is commonly chosen according to the median heuristic (Scholkopf & Smola, 2001)1. Now, let q[T ] denote the density of the transformed random variable T (x) = x + ϵφ(x) where x q and ϵ is small enough so that T is invertible. Under this notion, we have min φ HD 1 ϵKL(q[T ] p)|ϵ=0 = max φ HD 1 Ex q[Apφ(x)], (1) where Ap is the Stein operator and Apφ(x) = φ(x) x log p(x) + trace( xφ(x)). As shown in (Liu et al., 2016) and (Chwialkowski et al., 2016), the right hand side of Eq. (1) has a closed-form solution φ / φ HD where φ (x) = Ey q [k(x, y) y log p(y) + yk(x, y)] . (2) φ (x) consists of two parts: the kernel smoothed gradient G(x; p, q) = Ey q [k(x, y) y log p(y)] and the repulsive force R(x; q) = Ey q [ yk(x, y)]. By doing the transformation x x + ϵφ (x) iteratively, q[T ] decreases the KL divergence along the steepest direction in HD. The iteration ends when φ (x) 0 and thus T reduces to the identity mapping. This condition is equivalent to q = p when k(x, y) is strictly positive definite in a proper sense (Liu et al., 2016; Chwialkowski et al., 2016). In practice, a set of particles {x(i)}M i=1 are used to practically represent q(x) by the empirical distribution ˆq M(x) = 1 M PM i=1 δx(i)(x), where δ is the Dirac delta function. 1h = med2, where med is the median of the pairwise distances x y 2, x, y q. Message Passing SVGD These particles are are updated iteratively via x(i) x(i) + ϵ ˆφ (x(i)), where ˆφ (x) = Ey ˆq M [k(x, y) y log p(y) + yk(x, y)] . (3) When M = 1, the update rule becomes x(1) x(1) + ϵ x(1) log p(x(1)), which corresponds to the gradient method to find the mode of p(x). 3. Towards Understanding the Impact of Dimensionality for SVGD Kernel-based methods suffers from the curse of dimensionality. For example, Ramdas et al. (2015) demonstrates that the power of nonparametric hypothesis testing using Maximum Mean Discrepancy (MMD) drops polynomially with increasing dimensions. It is reasonable to suspect that SVGD also suffers from similar problems. In fact, as shown in the upper row of Fig. 1, even for p(x) = N(x|0, I), the performance of SVGD is unsatisfactory: though it correctly estimates the mean of p(x), it underestimates the marginal variance, and this problem becomes more severe with higher dimensions. In other words, SVGD suffers from particle degeneracy in high dimensions in which particles become less diverse and tend to collapse to modes of p(x). In this section, we take an initial step toward understanding this through analyzing2 the repulsive force R(x; q). First we highlight the importance of R(x; q). Referring to Eq. (2), we have φ (x) = G(x; p, q) + R(x; q), and we can show that the kernel smoothed gradient G(x; p, q) corresponds to the steepest direction for maximizing Ex q[log p(x)], i.e., G(x; p, q) G(x; p, q) HD = argmax φ HD 1 ϵEz q[T ][log p(z)]|ϵ=0, (4) where z = T (x) = x + ϵφ(x). The convergence condition G(x; p, q) 0 corresponds to y log p(y) = 0 for q(y) = 0, i.e., the optimal q(x) collapses to modes of p(x). This implies that without R(x; q), G(x; p, q) alone corresponds to the gradient method to find modes of p(x). So R(x; q) is critical for SVGD to work as an inference algorithm for minimizing the KL divergence. However, with a kernel measuring the global similarity in X RD (e.g., the RBF kernel), the repulsive force becomes R(x; q) = Ey q exp x y 2 2 2h 2All the derivation details can be found in the supplemental materials. We consider only the RBF kernel k(x, y) = exp x y 2 2 2h . The IMQ kernel also shares similar properties and corresponding results can be found in the supplemental materials as well. 1 50 100 Dimensionality (D) Dim-Averaged Marginal Variance 1 50 100 Dimensionality (D) Dim-Averaged Mean 1 50 100 Dimensionality (D) 1 50 100 Dimensionality (D) 50 (E) 100 (E) 200 (E) 50 (B) 100 (B) 200 (B) Figure 1: Results for inferring p(x) = N(x|0, I) using SVGD with the RBF kernel, where particles are initialized by N(x|0, 25I). Top two figures show the dimension-averaged marginal variance 1 D PD d=1 Varˆq M (xd) and mean 1 D PD d=1 Eˆq M [xd] respectively, and bottom two figures show the particle-averaged magnitude of the repulsive force (PAMRF) 1 M PM i=1 R(x(i); ˆq M) and kernel smoothed gradient (PAKSG) 1 M PM i=1 G(x(i); p, ˆq M) respectively, at both the beginning (dotted;B) and the end of iterations (solid;E) with different number of particles M = 50, 100 and 200. Unlike G(x; p, q) in which the bandwidth h only appears as a denominator for x y 2 2 and can be chosen using the median heuristic, the bandwidth in R(x; q) also appears as a denominator for x y. As a result, finding a proper h for R(x; q) will be hard and the magnitude of R(x; q) is bounded as R(x; q) Ey q for any h > 0. Intuitively, when x y / x y 2 2 1 for most regions of q, R(x; q) would be small, making SVGD dynamics greatly dependent on G(x; p, q), especially in the beginning stage where q does not match p and G(x; p, q) is large. Besides, though the theoretical convergence condition that φ (x) 0 iff q = p still holds, the vanishing repulsive force weakens it in reducing the difference between G(x; p, q) 0 and φ (x) 0. These characteristics would bring problems in practice when q is approximated by a set of particles {x(i)}M i=1: the empirical convergence condition ˆφ (x(i)) = 0, i {1, ..., M} does not guarantee3 {x(i)}M i=1 to be a good approximation of p, and the G(x; p, q)-dominant dynamic would result in 3An extreme case is as follows: when x(i) = x with x = Message Passing SVGD collapsing particles. Now, a natural question is, for which q does this intuition hold? One example is q to be Gaussian as summarized in the following proposition: Proposition 1. Given the RBF kernel k(x, y) and q(y) = N(y|µ, Σ), the repulsive force satisfies 2 + 1)(1 + 2 where λmin(Σ) is the smallest eigenvalue of Σ. By using limx 0(1 + x)1/x = e, we have R(x; q) x µ /(λmin(Σ) Proposition 1 indicates that the upper bound of R(x; q) negatively correlates with D. In practice, since R(x; ˆq M) is an unbiased estimate of R(x; q), we can also bound R(x; ˆq M) x µ /(λmin(Σ) D). Apart from the Gaussian distribution, we can prove that such a negative correlation exists for R(x; ˆq M) in a more general case: Proposition 2. Let k(x, y) be an RBF kernel. Suppose q(y) is supported on a bounded set X which satisfies y C for y X, and Var(yd|y1, ..., yd 1) C0 almost surely for any 1 d D. Let {x(i)}M i=1 be a set of samples of q and ˆq M the corresponding empirical distribution. Then, for any x C, α, δ (0, 1), there exists D0 > 0, such that for any D > D0, R(x; ˆq M) 2 e Dα (6) holds with at least probability 1 δ. In proposition 2, the bounded support assumption is relatively mild: examples include distributions defined on the images, in which the pixel intensity lies in a bounded interval. Requiring the conditional variance is larger than some constant reflects that the stochasticity for each dimension will not be eliminated by knowing the values of other dimensions, which is a quite strong assumption. However, as evaluated in experiments, the negative correlation exists for q even when such assumptions do not hold. Thus proposition 2 may be improved with weaker assumptions. Given these intuitions, we would like to explain the particle degeneracy phenomenon in Fig. 1. As shown in the bottom row, there exists a negative correlation between R(x; ˆq M) and D, at both the beginning and the end of iterations. In the beginning stage, G(x; p, ˆq M) keeps almost unchanged while R(x; ˆq M) negatively correlates with D. This implies that the SVGD dynamics becomes more G(x; p, ˆq M)-dominant with larger D at the argmaxx log p(x) (i.e., the MAP) holds for any i {1, ..., M}, the empirical convergence condition is satisfied. beginning. When converged, ˆφ (x(i)) = 0, which corresponds to G(x(i); p, ˆq M) = R(x(i); ˆq M) . In this case, we find an interesting phenomenon that R(x; ˆq M) tends to be constant with M = 50 but the marginal variance still decreases with increasing dimensions. A possible explanation for this case is that assuming q is Gaussian with Σ = σ2I, the variance σ2 = λmin(Σ) x(i) µ /( R(x(i); ˆq M) D) as proved in proposition 1. When R(x(i); ˆq M) is almost constant (and x(i) µ does not increase faster than D), σ2 will decrease as D increases. 4. Message Passing SVGD As discussed in Section 3, the unsatisfying property of SVGD comes from the negative correlation between the dimensionality and the repulsive force. Though the highdimensional nature of p(x) is inevitable in practice, this problem can be solved for p(x) with conditional independence structure, which is commonly described by probabilistic graphical models (PGMs). Based on this idea, we propose Message Passing SVGD, which converts the original high-dimensional inference problem into a set of local inference problems with lower dimensions. More specifically, we assume p(x) can be factorized4 as p(x) Q F F ψF (x F ) where the factor F {1, ..., D} denotes the index set and x F = [xd]d F . The Markov blanket Γd = {F : F d} \ {d} contains neighborhood nodes of d such that p(xd|x d) = p(xd|xΓd). 4.1. A Decomposition of the KL Divergence Our method relies on the key observation that we can decompose KL(q p) as KL(q p) = KL q(xd|x d)q(x d) p(xd|xΓd)q(x d) + KL q(x d) p(x d) , (7) where d = {1, ..., D} \ {d} denotes the index set other than d. Eq. (7) provides another perspective for minimizing KL(q p): instead of solving a global problem which minimizes KL(q p) over q(x), we can iteratively solve a set of local problems which minimizes the localized divergence over q(xd|x d) by keeping q(x d) fixed, i.e., argmin q(xd|x d) KL q(xd|x d)q(x d) p(xd|xΓd)q(x d) . (8) This idea resembles EP, which also performs local minimizations iteratively, however, for a localized version of the inclusive KL divergence KL(p q) (Minka, 2001). Another difference is that each local step in EP does not guarantees 4Such a p(x) is usually described using a factor graph, which unifies both directed and undirected graphical models. We refer the readers to (Koller & Friedman, 2009) for details. Message Passing SVGD minimizing a global divergence (Minka, 2005), while solving Problem (8) iteratively corresponds to minimizing the original KL(q p) due to the decomposition5 in Eq. (7). Eq. (7) requires decomposing q(x) as q(xd|x d) for each d, which makes it useless for VI methods with a parametric q(x) except some special cases (e.g., q(x) is Gaussian or fully factorized as q(x) = QD d=1 q(xd)). However, this decomposition is very suitable for transformation based methods like SVGD. Consider the transformation z = T (x) = [x1, ..., Td(xd), ..., x D] for x q, where only the dth dimension is transformed and other dimensions are kept unchanged, we have q[T ](z d) = q(z d). In other words, minimizing KL(q[T ] p) over Td is equivalent to minimizing KL q[Td](xd|x d)q(x d) p(xd|xΓd)q(x d) . Thus SVGD can be applied to Problem (8) directly, by following Td : xd xd + ϵφd(x) where φd H0 is associated with the global kernel k : X X R, and solving min φd H0 1 ϵKL(q[T ] p) ϵ=0. This produces a coordinate-wise version of SVGD. However, the highdimensional problem still exists due to the global kernel. To reduce the dimensionality, we further restrict the transformation to be dependent only on xd and its Markov blanket, i.e., Td : xd xd + ϵφd(x Sd), φd Hd, where Sd = {d} Γd. Here Hd is the RKHS induced by kernel kd : XSd XSd R, where XSd = {x Sd, x X}. By doing so, we have the following proposition6: Proposition 3. Let z = T (x) = [x1, ..., Td(xd), ..., x D] with Td : xd xd + ϵφd(x Sd), φd Hd where Hd is associated with the local kernel kd : XSd XSd R. Then, we have ϵKL(q[T ]||p) = ϵKL q[Td](zd|zΓd)q(zΓd) p(zd|zΓd)q(zΓd) , (9) and the solution for min φd Hd 1 ϵKL(q[T ]| p)|ϵ=0 is φ d/ φ d Hd, where φ d(x Sd) =Ey Sd q kd(x Sd, y Sd) yd log p(yd|yΓd) + ydkd(x Sd, y Sd) . (10) As shown in Eq. (10), computing φ (x Sd) only requires x Sd XSd instead of x X, which reduces the dimension from D to |Sd|. This would alleviate the vanishing repulsive force problem , especially for the case where p(x) is highly sparse structured such that |Sd| D (e.g., pairwise Markov Random Fields), as verified in the experiments on both synthetic and real-world problems. Similar to the original SVGD, the convergence condition φ d(x Sd) 0 holds iff q(xd|xΓd)q(xΓd) 5In fact, when both q and p are differentiable, we can show that each localized divergence equals zero iff q = p, as detailed in the supplemental material. 6Proof can be found in the supplemental material. p(xd|xΓd)q(xΓd), i.e., q(xd|xΓd) p(xd|xΓd) when kd(x Sd, y Sd) is strictly positive definite in a proper sense (Liu et al., 2016; Chwialkowski et al., 2016). In other words, to reduce the dimension, we have to pay the price that q is only conditionally consistent with p. This relaxation of q also appears in traditional methods like loopy BP and its variants, in which only marginal consistency is guaranteed (Wainwright & Jordan, 2008). 4.2. Markov Blanket based Kernels Now, the remaining question is the choice of the local kernel kd : XSd XSd R. We can simply define kd(x Sd, y Sd) = f x Sd y Sd 2 2/(2h Sd) for f(z) = exp( z) (i.e., the RBF kernel), with the implicit assumption that all nodes in the Markov blanket contribute equally for node d. We call such a kd the Single-Kernel. However, as the factorization of p(x) Q F F ψF (x F ) is known, we can also define the Multi-Kernel where kd(x Sd, y Sd) = 1 K P F d f x F y F 2 2/(2h F ) , where K is the number of factors containing d, to reflect the assumption that nodes in different factors may contribute in a different way. By doing so, Rd(x; q) = 1 K P F d Ey F q ydf x F y F 2 2/(2h F ) and we can further reduce the dimension for Rd(x; q) from |Sd| to max{|F| : F d}. 4.3. Final Algorithm Similar to SVGD, we use a set of particles {x(i)}M i=1 to approximate q and this procedure is summarized in Algorithm 1. According to the choice of kernels, we abbreviate MP-SVGD-s for the Single-Kernel and MP-SVGD-m for the Multi-Kernel. Algorithm 1 Message Passing SVGD Input: A differentiable target distribution p(x) whose dth conditional distribution is p(xd|xΓd) = p(xd|x d), and a set of initial particles {x(i)}M i=1. Output: A set of particles {x(i)}M i=1 as an approximation of p(x). for iteration t do for d {1, ..., D} do x(i) d x(i) d + ϵˆφ d(x(i) Sd) where ϵ is the stepsize, and ˆφ d(x Sd) = Ey Sd ˆq M [kd(x Sd, y Sd) yd log p(yd|yΓd) + ydkd(x Sd, y Sd)]. end for end for Updating particles acts in a message passing way as shown in Figure 2: node d receives particles (messages) from its neighbors (i.e., {x(i) Γd}M i=1); updates its own particles Message Passing SVGD Figure 2: Message passing procedure for node d. {x(i) d }M i=1; and sends them to its neighbors. Unlike loopy BP, each node sends the same message to its neighbors. This resembles VMP, where messages from the parent to its children in a directed graph are also identical (Winn & Bishop, 2005). 5. Experiments In this section, we experimentally verify our analysis and evaluate the performance of MP-SVGD with other inference methods on both synthetic and real-world examples. We use the RBF kernel with the bandwidth chosen by the median heuristic for all experiments. 5.1. Synthetic Markov Random Fields We follow the settings of (Lienart et al., 2015) and focus on a pairwise MRF on the 2D grid p(x) Q d V ψd(xd) Q (d,t) E ψdt(xd, xt) with the random variable in each node taking values on R. The node and edge potentials are chosen such that p(x) and its marginals are multimodal, non-Gaussian and skewed: ψd(xd) = α1N(xd yd| 2, 1) + α2G(xd yd|2, 1.3) ψdt(xd, xt) = L(xd xt|0, 2) , (11) where yd denotes the observed value at node d and is initialized randomly as yd α1N(yd 2| 2, 1) + α2G(yd 2|2, 1.3), N(x|µ, σ2) exp( (x µ)2/(2σ2)), G(x|µ, β) exp( (x µ)/β + exp( (x µ)/β)) and L(x|µ, β) exp( |x µ|/β) denote the density of Gaussian, Gumbel and Laplace distribution, respectively. Parameters α1 and α2 are set to 0.6 and 0.4. We consider a 10 10 grid except Fig. 5, whose grid size ranges from 2 2 to 10 10. All experimental results are averaged over 10 runs with random initializations. Since p(x) is intractable, we recover the ground truth by samples drawn by an adaptive version of Hamiltonian Monte Carlo (HMC) (Neal, 2011; Shi et al., 2017). We run 100 chains in parallel with 40,000 samples for each chain after 10,000 burned-in, i.e. 4 million samples in total. We compare MP-SVGD with SVGD, HMC, EP and EPBP (Lienart et al., 2015). Although widely used in MRFs, Gibbs sampling does not suit this task since the conditional dis- tribution p(xd|xΓd) cannot be sampled directly. So we use uniformly randomly chosen particles from the 4 million ground truth samples as a strong baseline7 and regard the method as HMC. For EP, we use the Gaussian distribution as the factors, and the moment matching step is done by numerical integration due to the non-Gaussian nature of p(x). EPBP is a variant of BP methods and the original state-of-the-art method on this task. It uses weighted samples to estimate the messages while other methods (except EP) use unweighted samples to approximate p(x) directly. Fig. 3 provides a qualitative comparison of these methods by visualizing the estimated marginal densities on each node. As is shown, SVGD estimates the marginals undesirably in all cases. For both unimodal and bimodal cases, the SVGD curve exhibits a sharp, peaked behavior, indicating the collapsing trend of its particles. It is interesting to note that in the rightmost figure, more SVGD particles gather around the lower mode. One possible reason is that the lower marginal mode may correspond to a larger mode in the joint distribution, into which SVGD particles tends to collapse. EP provides quite a crude approximation as expected, due to the mismatch between its Gaussian assumption and the non Gaussian nature of p(x). Other methods perform similarly. Fig. 4 provides a quantitative comparison8 in particle efficiency measured by the mean squared error (MSE) of estimated expectation with different particle sizes M. For EPBP, M denotes the number of node particles when approximating messages. We observe that SVGD achieves the highest RMSE compared to other methods, reflecting its particle degeneracy. Compared to HMC and EPBP, MPSVGD achieves comparable and even better results. Besides, MP-SVGD-m achieves lower RMSE than MP-SVGD-s, reflecting the benefits of MP-SVGD-m of leveraging more structural information in designing local kernels. Fig. 5 compares MP-SVGD with SVGD in particleaveraged magnitude of the repulsive force (PAMRF) 1 M PM i=1 R(x(i); ˆq M) r at the end of iterations for various dimensions. As expected, the SVGD PAMRF negatively correlates with the dimensionality D while the MP-SVGD PAMRF does not. Besides, both of them exhibit roughly a log linear relationship with the dimensionality. This verifies Proposition 2 that R(x(i); ˆq M) is upper bounded by D α for some constant α. This also reflects the power of MP-SVGD in preventing the repulsive force from being too small by reducing dimensionality using local kernels. Besides, we observe that MP-SVGD-m PAMRF is higher than MP-SVGD-s PAMRF, which verifies our analysis regarding Single-Kernel and Multi-Kernel: for the pairwise MRF, 7This is strong in the sense that when all the 4 million samples are used, corresponding approximation error would be zero. 8We omit EP results for the clearness, the figure which includes EP results can be found in the supplementary materials. Message Passing SVGD 4 2 0 2 4 6 8 0.0 4 2 0 2 4 6 8 0.00 4 2 0 2 4 6 8 0.00 HMC SVGD MP-SVGD-s MP-SVGD-m EPBP EP Ground Truth Figure 3: A qualitative comparison of inference methods with 100 particles (except EP) for marginal densities of three randomly selected nodes. Density curves of SVGD, MP-SVGD and HMC are estimated by kernel density estimator with RBF kernels. For EPBP, the curve is drawn by normalizing its beliefs over the fixed interval [ 5, 10) with bin size 0.01. 10 150 300 Particle Size (M) 10 150 300 Particle Size (M) 10 150 300 Particle Size (M) 10 150 300 Particle Size (M) HMC SVGD MP-SVGD-s MP-SVGD-m EPBP Figure 4: A quantitative comparison of inference methods with varying number of particles. Performance is measured by the MSE of the estimation of expectation Ex ˆq M [f(x)] for test functions f(x) = x, x2, 1/(1 + exp(ω x + b)) and cos(ω x + b), arranged from left to right, where denotes the element-wise product. Results are averaged over 10 random draws of ω and b, where ω, b R100 with ωd N(0, 1) and bd Uniform[0, 2π], d {1, ..., 100}. 4 16 49 100 Dimensionality (D) 4 16 49 100 Dimensionality (D) SVGD (100) SVGD (200) SVGD (300) MP-SVGD-s (100) MP-SVGD-s (200) MP-SVGD-s (300) MP-SVGD-m (100) MP-SVGD-m (200) MP-SVGD-m (300) Figure 5: PAMRF 1 M PM i=1 R(x(i); ˆq M) r for converged {x(i)}M i=1 with r = (left) and r = 2 (right). The grid size ranges from 2 2 to 10 10. The number in the bracket denotes the number of particles. the dimension for Single-Kernel is 5 at most (the Markov blanket and the node itself) while the dimension for Multi Kernel is 2 at most (the edge) and thus lower dimensionality corresponds to higher repulsive force. 5.2. Image Denoising Our last experiment is designed for verifying the power of MP-SVGD in real-world application. Following (Schmidt et al., 2010), we formulate image denoising via finding the posterior mean9 of p(x|y) p(y|x)p(x), where the likelihood p(y|x) = N(y|x, σ2 n I) denotes that the observed image y = x + n for some unknown natural image x corrupted by Gaussian noise n with the noise level σn. The prior p(x) encodes the statistics of natural images, which is a Fields-of-Experts (FOE) (Roth & Black, 2009) MRF: p(x) exp( ϵ x 2 2 2 ) Y i=1 φ(JT i x F ; αi), (12) where {Ji}N i=1 is a bank of linear filters and the expert function φ(JT i x F ; αi) = PJ j=1 αij N(JT i x F |0, σ2 i /sj) is the Gaussian scale mixtures (Woodford et al., 2009). We focus on the pairwise MRF where F indexes all the edge factors, Ji = [1, 1] , N = 1 and J = 15. All the parameters (i.e., ϵ, Ji, σi and sj) are pre-learned and details can be found in (Schmidt et al., 2010). We compare SVGD and MP-SVGD10 with Gibbs sampling 9It is called the Bayesian minimum mean squared error estimate (MMSE) in the original paper. 10As MP-SVGD-m is shown to be better than MP-SVGD-s, we only use MP-SVGD-m here, and without notification, MP-SVGD stands for MP-SVGD-m. Message Passing SVGD Noisy (28.16, 0.678) MAP (31.05, 0.867) SVGD (31.89, 0.890) MP-SVGD (33.20, 0.911) Aux. Gibbs (32.95, 0.901) Figure 6: Denoising results for Lena using 50 particles, 256 256 pixels, σn = 10. The number in bracket is PSNR and SSIM. Upper Row: The full size image; Bottom Row: The 50 50 patches. with auxiliary variables (Aux. Gibbs), the state-of-the-art method reported in the original paper. The recovered image is obtained by averaging all the particles and its quality is evaluated using the peak signal-to-noise ratio (PSNR) and structural similarity index (SSIM) (Wang et al., 2004). Higher PSNR/SSIM generally means better image quality. 100 2500 50000 Dimensionality (D) 100 2500 50000 Dimensionality (D) SVGD MP-SVGD Figure 7: PAMRF 1 M PM i=1 R(x(i); ˆq M) r for converged {x(i)}M i=1 with r = (left) and r = 2 (right) over rescaled Lena ranged from 8 8 to 256 256. M = 50 particles are used. Fig. 6 shows example results for Lena, a benchmark image for comparing denoising methods. Despite the difference in PSNR and SSIM, the image recovered by SVGD lack texture details (especially for the region near the nose of Lena shown in the 50 50 patches), which resembles the image reconstructed by MAP. Table 1 compares these method quantitatively. As expected, MP-SVGD achieves the best result given the same number of particles. We also find that Aux. Gibbs requires about 200 particles for σn = 10 and 400 to 800 particles for σn = 20, to achieve a similar performance of MP-SVGD. From Fig. 7 we observe a negative/non-negative correlation between the repulsive force and the dimensionality for Table 1: Denoising results for 10 test images (Lan et al., 2006) from BSD dataset (Martin et al., 2001). The first two rows are cited from (Schmidt et al., 2010) while the other rows are based on our implementation. M denotes the number of particles. Inference avg. PSNR avg. SSIM σn = 10 σn = 20 σn = 10 σn = 20 MAP 30.27 26.48 0.855 0.720 Aux. Gibbs 32.09 28.32 0.904 0.808 Aux. Gibbs (M = 50) 31.87 28.05 0.898 0.795 Aux. Gibbs (M = 100) 31.98 28.17 0.901 0.801 SVGD (M = 50) 31.58 27.86 0.894 0.766 SVGD (M = 100) 31.65 27.90 0.896 0.767 MP-SVGD (M = 50) 32.09 28.21 0.905 0.808 MP-SVGD (M = 100) 32.12 28.27 0.906 0.809 SVGD/MP-SVGD, respectively, which verifies our analysis again. 6. Conclusions and Future Work In this paper, we analyze the particle degeneracy phenomenon of SVGD in high dimensions and attribute it to the negative correlation between the repulsive force and dimensionality. We also propose message passing SVGD (MP-SVGD), which converts the original problem into several local inference problem with lower dimensions, to solve this problem. Experiments on both synthetic and real-world applications show the effectiveness of MP-SVGD. For future work, we d like to settle the analysis of the repulsive force and its impact on SVGD dynamics completely and formally. We also want to apply MP-SVGD to more complex real-world applications like pose estimation (Pacheco et al., 2014). Besides, investigating robust kernel with nondecreasing repulsive force is also an interesting direction. Message Passing SVGD Acknowledgements We thank anonymous reviewers for their insightful comments and suggestions. This work was supported by NSFC Projects (Nos. 61620106010, 61621136008, 61332007), Beijing NSF Project (No. L172037), Tiangong Institute for Intelligent Computing, NVIDIA NVAIL Program, Siemens and Intel. Aggarwal, C. C., Hinneburg, A., and Keim, D. A. On the surprising behavior of distance metrics in high dimensional spaces. In Proceedings of the International conference on database theory, 2001. Chwialkowski, K., Strathmann, H., and Gretton, A. A kernel test of goodness of fit. In Proceedings of the International Conference on Machine Learning, 2016. Feng, Y., Wang, D., and Liu, Q. Learning to draw samples with amortized stein variational gradient descent. In Proceedings of the Conference on Uncertainty in Artificial Intelligence, 2017. Gorham, J. and Mackey, L. Measuring sample quality with kernels. In Proceedings of the International Conference on Machine Learning, 2017. Haarnoja, T., Tang, H., Abbeel, P., and Sergey, L. Reinforcement learning with deep energy-based policies. In Proceedings of the International Conference on Machine Learning, 2017. Ihler, A. T. and Mcallester, D. A. Particle belief propagation. In Proceedings of the International Conference on Artificial Intelligence and Statistics, 2009. Koller, D. and Friedman, N. Probabilistic graphical models: principles and techniques. MIT press, 2009. Lan, X., Roth, S., Huttenlocher, D., and Black, M. J. Efficient belief propagation with learned higher-order markov random fields. In European conference on computer vision, 2006. Lienart, T., Teh, Y. W., and Doucet, A. Expectation particle belief propagation. In Advances in Neural Information Processing Systems, 2015. Liu, Q. and Wang, D. Stein variational gradient descent: A general purpose bayesian inference algorithm. In Advances In Neural Information Processing Systems, 2016. Liu, Q., Lee, J., and Jordan, M. A kernelized stein discrepancy for goodness-of-fit tests. In Proceedings of the International Conference on Machine Learning, 2016. Liu, Y., Ramachandran, P., Liu, Q., and Peng, J. Stein variational policy gradient. In Proceedings of the Conference on Uncertainty in Artificial Intelligence, 2017. Martens, J. and Sutskever, I. Parallelizable sampling of markov random fields. In Proceedings of the International Conference on Artificial Intelligence and Statistics, 2010. Martin, D., Fowlkes, C., Tal, D., and Malik, J. A database of human segmented natural images and its application to evaluating segmentation algorithms and measuring ecological statistics. In IEEE International Conference on Computer Vision, 2001. Minka, T. Expectation propagation for approximate bayesian inference. In Proceedings of the Conference on Uncertainty in Artificial Intelligence, 2001. Minka, T. Divergence measures and message passing. Technical Report MSR-TR-2005-173, Microsoft Research, 2005. Neal, R. M. Mcmc using hamiltonian dynamics. Handbook of Markov Chain Monte Carlo, 2(11), 2011. Pacheco, J., Zuffi, S., Black, M., and Sudderth, E. Preserving modes and messages via diverse particle selection. In Proceedings of the International Conference on Machine Learning, 2014. Pearl, J. Probabilistic reasoning in intelligent systems: networks of plausible inference. Morgan Kaufmann, 1988. Pu, Y., Gan, Z., Henao, R., Li, C., Han, S., and Carin, L. Vae learning via stein variational gradient descent. In Advances in Neural Information Processing Systems, 2017. Ramdas, A., Reddi, S. J., P oczos, B., Singh, A., and Wasserman, L. On the decreasing power of kernel and distance based nonparametric hypothesis tests in high dimensions. In Proceedings of the AAAI Conference on Artificial Intelligence, 2015. Roth, S. and Black, M. J. Fields of experts. International Journal of Computer Vision, 2009. Schmidt, U., Gao, Q., and Roth, S. A generative perspective on mrfs in low-level vision. In Computer Vision and Pattern Recognition, 2010. Scholkopf, B. and Smola, A. J. Learning with kernels: support vector machines, regularization, optimization, and beyond. MIT press, 2001. Shi, J., Chen, J., Zhu, J., Sun, S., Luo, Y., Gu, Y., and Zhou, Y. Zhusuan: A library for bayesian deep learning. ar Xiv preprint ar Xiv:1709.05870, 2017. Message Passing SVGD Sudderth, E., Ihler, A., Freeman, W., and Willsky, A. Nonparametric belief propagation. In Computer Vision and Pattern Recognition, 2003. Wainwright, M. J. and Jordan, M. I. Graphical models, exponential families, and variational inference. Foundations and Trends R in Machine Learning, 2008. Wang, D., Zeng, Z., and Liu, Q. Structured stein variational inference for continuous graphical models. ar Xiv preprint ar Xiv:1711.07168, 2017. Wang, Z., Bovik, A. C., Sheikh, H. R., and Simoncelli, E. P. Image quality assessment: from error visibility to structural similarity. IEEE transactions on image processing, 2004. Winn, J. and Bishop, C. M. Variational message passing. Journal of Machine Learning Research, 2005. Woodford, O. J., Rother, C., and Kolmogorov, V. A global perspective on map inference for low-level vision. In IEEE International Conference on Computer Vision, 2009.