# stein_variational_gradient_descent_without_gradient__b0ca85db.pdf Stein Variational Gradient Descent Without Gradient Jun Han 1 Qiang Liu 2 Stein variational gradient decent (SVGD) has been shown to be a powerful approximate inference algorithm for complex distributions. However, the standard SVGD requires calculating the gradient of the target density and cannot be applied when the gradient is unavailable. In this work, we develop a gradient-free variant of SVGD (GF-SVGD), which replaces the true gradient with a surrogate gradient, and corrects the induced bias by re-weighting the gradients in a proper form. We show that our GF-SVGD can be viewed as the standard SVGD with a special choice of kernel, and hence directly inherits the theoretical properties of SVGD. We shed insights on the empirical choice of the surrogate gradient and propose an annealed GF-SVGD that leverages the idea of simulated annealing to improve the performance on high dimensional complex distributions. Empirical studies show that our method consistently outperforms a number of recent advanced gradient-free MCMC methods. 1. Introduction Approximate inference of complex distributions is a longstanding fundamental computational task in machine learning and statistics. Traditional methods are based on either Markov chain Monte Carlo (MCMC) (Neal et al., 2011; Hoffman & Gelman, 2014), or variational inference (VI) (Blei et al., 2017; Zhang et al., 2017), both of which have their own inherent pros and cons: MCMC is theoretically sound and asymptotically consistent, but is often slow to converge in practice; VI is practically faster but has been known to lack theoretical consistency guarantees. Stein variational gradient descent (SVGD) (Liu & Wang, 2016) is recently developed to integrate the advantages of 1Computer Science, Dartmouth College 2Computer Science, The University of Texas at Austin. Correspondence to: Jun , Qiang . Proceedings of the 35 th International Conference on Machine Learning, Stockholm, Sweden, PMLR 80, 2018. Copyright 2018 by the authors. MCMC and VI. By leveraging a new type of functional gradient descent of KL divergence on the space of distributions, SVGD directly fits a set of non-parametric particles to the distribution of interest, without introducing additional parametric forms. This makes SVGD both theoretically sound (Liu, 2017), practically fast and sample-efficient thanks to its optimization-based formulation. Unfortunately, gradient information of the target distribution is not always available in practice. In some cases, the distribution of interest is only available as a black-box density function and the gradient cannot be calculated analytically; in other cases, it may be computationally too expensive to calculate the gradient (Beaumont, 2003; Andrieu & Roberts, 2009; Filippone & Girolami, 2014). The goal of this paper is to extend SVGD to gradient-free settings. Our Results We propose a simple gradient-free variant of SVGD that replaces the true gradient with a surrogate gradient and uses a form of importance weighting to correct the bias introduced by the surrogate gradient. To give a very quick overview of our method, recall that the standard SVGD approximates a given distribution p on Rd with a set of particles {xi}n i=1 iteratively updated by xi xi + ϵ n xi, where j=1 [ log p(xj)k(xj, xi) + xjk(xj, xi)], (1) where k(x, x ) is any positive definite kernel; the term with gradient log p drives the particles to the high probability regions of p, and the term with k(x, x ) acts as a repulsive force to keep the particles away from each other to quantify the uncertainty. Our gradient-free SVGD is based on the following generalization of the SVGD update rule: j=1 wj log ρ(xj)k(xj, xi) + xjk(xj, xi) , (2) which replaces the true gradient log p with a surrogate gradient log ρ of an arbitrary auxiliary distribution ρ(x), and then uses an importance weight wj := ρ(xj)/p(xj) to correct the bias introduced by the surrogate. Perhaps surprisingly, we show that the new update can be derived as a standard SVGD update by using an importance weighted Gradient-Free SVGD kernel w(x)w(x )k(x, x ), and hence immediately inherits the theoretical proprieties of SVGD; for example, particles updated by (2) can be viewed as a gradient flow of KL divergence similar to the original SVGD (Liu, 2017). The performance of the gradient-free update (2) critically depends on the choice of ρ. We study this problem empirically and show that it works efficiently, sometimes even better than the original SVGD, if we take ρ to be an over-dispersed estimate of p that covers the support of p well. We further improve the algorithm by combining the gradient-free update (2) with simulated annealing, which we show consistently outperforms a number of other advanced gradient-free Monte Carlo methods, including a gradient-free variant of annealed importance sampling (Neal, 2001), kernel adaptive Metropolis-Hastings (KAMH) (Sejdinovic et al., 2014) and kernel Hamiltonian Monte Carlo (KHMC) (Strathmann et al., 2015). Discussion and Related Works Almost all gradient-free sampling methods employ some auxiliary (or proposal) distributions that are different, but sufficiently close to the target distribution, followed with some mechanisms to correct the bias introduced by the surrogate distribution. There have been a few number of bias-correction mechanisms underlying most of the gradient-free methods, including importance sampling, rejection sampling and the Metropolis-Hastings rejection trick. The state-of-the-art gradient-free sampling methods are often based on adaptive improvement of the proposals when using these tricks, this includes adaptive importance sampling and rejection sampling (Gilks & Wild, 1992; Capp e et al., 2008; Cotter et al., 2015; Han & Liu, 2017), and adaptive MCMC methods (e.g., Sejdinovic et al., 2014; Strathmann et al., 2015). Our method is significantly different from these gradientfree sampling algorithms aforementioned in principle, with a number of key advantages. Instead of correcting the bias by either re-weighting or rejecting the samples from the proposal, which unavoidably reduces the effective number of usable particles, our method re-weights the SVGD gradient and steers the update direction of the particles in a way that compensates the discrepancy between the target and surrogate distribution, without directly reducing the effective number of usable particles. In addition, while the traditional importance sampling and rejection sampling methods require the proposals to be simple enough to draw samples from, our update does not require to draw samples from the surrogate ρ. We can set ρ to be arbitrarily complex as long as we can calculate ρ(x) and its gradient. In fact, ρ(x) does not even have to be a normalized probability, sidestepping the difficult problem of calculating the normalization constant. Outline The paper is organized as follows. Section 2 in- troduces the background of SVGD. Section 3 introduces our novel gradient-free SVGD algorithm, and Section 4 further proposes annealed gradient-free SVGD that combines the advantage of simulated annealing to achieve better performance for complex distributions. We present empirical results in Section 5 and conclude the paper in Section 6. 2. Stein Variational Gradient Descent Stein variational gradient descent (SVGD) (Liu & Wang, 2016) is a nonparametric variational inference algorithm that iteratively transports a set of particles to approximate a given target distribution by performing a type of functional gradient descent on the KL divergence. We give a quick overview of its main idea in this section. Let p(x) be a positive density function on Rd which we want to approximate with a set of particles {xi}n i=1. SVGD starts with a set of initial particles {xi}n i=1, and updates the particles iteratively by xi xi + ϵφ(xi), i = 1, . . . , n, (3) where ϵ is a step size, and φ: Rd Rd is a velocity field which should be chosen to drive the particle distribution closer to the target. Assume the distribution of the particles at the current iteration is q, and q[ϵφ] is the distribution of the updated particles x = x + ϵφ(x). The optimal choice of φ can be framed into the following optimization problem: φ = arg max φ F dϵKL(q[ϵφ] || p) ϵ=0 where F is the set of candidate velocity fields, and φ is chosen in F to maximize the decreasing rate on the KL divergence between the particle distribution and the target. In SVGD, F is chosen to be the unit ball of a vectorvalued reproducing kernel Hilbert space (RKHS) H = H0 H0, where H0 is a RKHS formed by scalarvalued functions associated with a positive definite kernel k(x, x ), that is, F = {φ H: ||φ||H 1}. This choice of F allows us to consider velocity fields in infinite dimensional function spaces while still obtaining computationally tractable solution. A key step towards solving (4) is to observe that the objective function in (4) is a simple linear functional of φ that connects to Stein operator (Oates et al., 2017; Gorham & Mackey, 2015; Liu & Wang, 2016; Gorham & Mackey, 2017; Chen et al., 2018), dϵKL(q[ϵφ] || p) ϵ=0 = Ex q[A p φ(x)], (5) with A p φ(x) = x log p(x) φ(x) + x φ(x), (6) Gradient-Free SVGD where Ap is a linear operator called Stein operator and is formally viewed as a column vector similar to the gradient operator x. The Stein operator Ap is connected to Stein s identity which shows that the RHS of (5) is zero if p = q: Ex p[A p φ(x)] = 0. (7) This corresponds to d dϵKL(q[ϵφ] || p) ϵ=0 = 0 since there is no way to further decease the KL divergence when p = q. Eq. (7) is a simple result of integration by parts assuming the value of p(x)φ(x) vanishes on the boundary of the integration domain. Therefore, the optimization in (4) reduces to DF(q||p) def = max φ F Ex q A p φ(x) , (8) where DF(q || p) is the kernelized Stein discrepancy (KSD) defined in Liu et al. (2016); Chwialkowski et al. (2016). Observing that (8) is simple in that it is a linear functional optimization on a unit ball of a Hilbert space, Liu & Wang (2016) showed that (4) has a simple closed-form solution: φ (x ) Ex q[Apk(x, x )], (9) where Ap is applied to variable x, and D2 F(q || p) = Ex,x q[κp(x, x )], (10) where κp(x, x ) := (A p) (Apk(x, x )) and A p denotes the Stein operator applied on variable x . Here κp(x, x ) can be calculated explicitly in Theorem 3.6 of Liu et al. (2016). The Stein variational gradient direction φ provides a theoretically optimal direction that drives the particles towards the target p as fast as possible. In practice, SVGD approximates q using a set of particles, yielding update rule (1). 3. Gradient-Free SVGD The standard SVGD requires the gradient of the target p and cannot be applied when the gradient is unavailable. In this section, we propose a gradient-free variant of SVGD which replaces the true gradient with a surrogate gradient and corrects the bias introduced using an importance weight. We start with introducing a gradient-free variant of Stein s identity and Stein discrepancy. Gradient-Free Stein s Identity and Stein Discrepancy We can generalize Stein s identity to make it depend on a surrogate gradient x log ρ of an arbitrary auxiliary distribution ρ, instead of the true gradient x log p. The idea is to use importance weights to transform Stein s identity of ρ into an identity regarding p: recall the Stein s identity of ρ: Ex ρ[A ρ φ(x)] = 0. It can be easily seen that it is equivalent to the following importance weighted Stein s identity: p(x)A ρ φ(x) = 0, (11) which is already gradient free since it depends on p only through the value of p(x), not the gradient. (11) holds for an arbitrary auxiliary distribution ρ which satisfies ρ(x)/p(x) < for any x. Based on identity (11), it is straightforward to define an importance weighted Stein discrepancy DF,ρ(q || p) = max φ F p(x)A ρ φ(x) , (12) which is gradient-free if ρ does not depend on the gradient of p. Obviously, this includes the standard Stein discrepancy in Section 2 as special cases: if ρ = p, then DF,ρ(q || p) = DF(q || p), reducing to the original definition in (8), while if ρ = q, then DF,ρ(q || p) = DF(p || q), which switches the order of p and q. It may appear that DF,ρ(q || p) strictly generalizes the definition (8) of Stein discrepancy. One of our key observations, however, shows that this is not the case. Instead, DF,ρ(q || p) can also be viewed as a special case of DF(q || p), by replacing F in (8) with w F := {w(x)φ(x): φ F}, where w(x) = ρ(x)/p(x). Theorem 1. Let p, ρ be positive differentiable densities and w(x) = ρ(x)/p(x). We have w(x)A ρ φ(x) = A p w(x)φ(x) . (13) Therefore, DF,ρ(q || p) in (12) is equivalent to DF,ρ(q || p) = max φ F Ex q[A p w(x)φ(x) ] (14) = max φ w F Ex q[A p φ(x)] (15) = Dw F(q || p). Proof: The proof can be found in the appendix A. Identity (13) is interesting because it is gradient-free (in terms of x log p) from the left hand side, but gradientdependent from the right hand side; this is because the x log p term in Ap is cancelled out when applying Ap on the density ratio w(x) = ρ(x)/p(x). It is possible to possible to further extend our method to take ρ(x) and w(x) to be general matrix-valued functions, in which case the operator A p (w(x)φ(x)) is called diffusion Stein operator in Gorham et al. (2016), corresponding to various forms of Langevin diffusion when taking special values of w(x). We leave it as future work to explore ρ. Gradient-Free SVGD Algorithm 1 Gradient-Free SVGD (GF-SVGD) Input: Target distribution p(x); Surrogate ρ(x) and its score function sρ(x) := x log ρ. Goal: Find particles {xi}n i=1 to approximate p. Initialize particles {x0 i }n i=1 from any distribution q. for iteration t do xt+1 i xt i + xt i, i = 1, . . . , n, where xt i = ϵt,i j=1 w(xt j) sρ(xt j)k(xt j, xt i) + xjk(xt j, xt i) , where w(x) := ρ(x)/p(x), Zt = Pn j=1 w(xt j), and ϵt,i is a step size. end for Gradient-Free SVGD Theorem 1 suggests that by simply multiplying φ with an importance weight w(x) (or replacing F with w F), one can transform Stein operator Ap to operator Aρ, which depends on log ρ instead of log p (gradient-free). This idea can be directly applied to derive a gradient-free extension of SVGD, by updating the particles using velocity fields of form w(x)φ(x) from space w F: x x + ϵw(x)φ (x), (16) where φ maximzies the decrease rate of KL divergence, φ =arg max φ H Eq[A p (w(x)φ(x))], s.t. ||φ||H 1 . (17) Similar to (8), we can derive a closed-form solution for (17) when H is RKHS. To do this, it is sufficient to recall that if H is an RKHS with kernel k(x, x ), then w H is also an RKHS, with an importance weighted kernel (Berlinet & Thomas-Agnan, 2011) k(x, x ) = w(x)w(x )k(x, x ). (18) Theorem 2. When H is an RKHS with kernel k(x, x ), the optimal solution of (14) is φ /||φ ||H, where φ ( ) = Ex q[Ap(w(x)k(x, ))] (19) = Ex q[w(x)Aρk(x, )], (20) where the Stein operator Aρ is applied to variable x, Aρk(x, ) = x log ρ(x)k(x, )+ xk(x, ). Correspondingly, the optimal decrease rate of KL divergence in (17) equals the square of DF,ρ(q || p), which equals DF,ρ(q || p) = (Ex,x q[w(x)w(x )κρ(x, x )]) 1 2 , (21) where κρ(x, x ) = (A ρ) (Aρk(x, x )) and A ρ is the Stein operator applied on variable x . Proof: The proof can be found in the appendix B. The form in (21) allows us to estimate DF,ρ(q || p) empirically either using U-statistics or V-statistics when q is observed through an i.i.d. sample, with the advantage of being gradient-free. Therefore, it can be directly applied to construct gradient-free methods for goodness-of-fit tests (Liu et al., 2016; Chwialkowski et al., 2016) and black-box importance sampling (Liu & Lee, 2017) when the gradient of p is unavailable. We leave this to future works. Using the gradient-free form of φ in (20), we can readily derive a gradient-free SVGD update xi xi + xi, with j=1 w(xj)Aρk(xj, xi), (22) where the operator Aρ is applied on variable xj, and we set Z = n, viewed as a normalization constant, and ϵi = ϵw(xi), viewed as the step size of particle xi . Since ϵi/Z is a scalar, we can change it in practice without altering the set of fixed points of the update. In practice, because the variability of the importance weight w(xi) can be very large, making the updating speed of different particles significantly different, we find it is empirically better to determine ϵi directly using off-the-shelf step size schemes such as Adam (Kingma & Ba, 2015). In practice, we also replace Z = n with a self-normalization factor Z = Pn j=1 w(xj) (see Algorithm 1). We find this makes tuning step sizes become easier in practice, and more importantly, avoids to calculate the normalization constant of either p(x) or ρ(x). This sidesteps the critically challenging problem of calculating the normalization constant and allows us to essentially choose ρ(x) to be an arbitrary positive differentiable function once we can calculate its value and gradient. Choice of the Auxiliary Distribution ρ(x) Obviously, the performance of gradient-free SVGD (GF-SVGD) critically depends on the choice of the auxiliary distribution ρ(x). Theoretically, gradient-free SVGD is just a standard SVGD with the importance weighted kernel k(x, x ). Therefore, the optimal choice of ρ is essentially the problem of choosing an optimal kernel for SVGD, which, unfortunately, is a difficult, unsolved problem. In this work, we take a simple heuristic that sets ρ(x) to approximate p(x). This is based on the justification that if the original k(x, x ) has been chosen to be optimal or reasonably well , we should take ρ(x) p(x) so that k(x, x ) is close to k(x, x ) and GF-SVGD will have similar performance as the original SVGD. In this way, the problem of choosing the optimal auxiliary distribution ρ(x) and the optimal kernel k(x, x ) is separated, and different kernel selection methods can be directly plugged into the algorithm. In practice, we find that ρ(x) p(x) serves a reasonable Gradient-Free SVGD Algorithm 2 Annealed SVGD (A-SVGD) Inputs: p(x), distribution path {pt}T t=1 with p T = p. Initialize particles {x0 i }n i=1 from any distribution. for iteration t = 0, , T 1 do Update the particles to get {xt+1 i }n i=1 by running the typical SVGD with pt+1 as the target for m steps. end for Output: {x T i }n i=1 as an approximation of p. Remark: m = 1 is sufficient when T is large. heuristic when using Gaussian RBF kernel k(x, x ). Interestingly, our empirical observation shows that a widely spread ρ(x) tends to give better and more stable results than peaky ρ(x). In particular, Figure 1 in the experiment section shows that in the case when both p(x) and ρ(x) are Gaussian and RBF kernel is used, the best performance is achieved when the variance of ρ(x) is larger than the variance of p. In fact, the gradient-free SVGD update (22) still makes sense even when ρ(x) = 1, corresponding to an improper distribution with infinite variance: 1 p(xj) xjk(xj, xi). (23) This update is interestingly simple; it has only a repulsive force and relies on an inverse probability 1/p(x) to adjust the particles towards the target p(x). We should observe that it is as general as the GF-SVGD update (22) (and hence the standard SVGD update (1)), because if we replace k(x, x ) with ρ(x)ρ(x )k(x, x ), (23) reduces back to (22). All it matters is the choice of the kernel function. With a typical kernel such as RBF kernel, we empirically find that the particles by the update (23) can estimate the mean parameter reasonably well (although not optimally), but tend to overestimate the variance because the repulsive force dominates; see Figure 1. 4. Annealed Gradient-Free SVGD In practice, it may be difficult to directly find ρ(x) that closely approximates the target p, causing the importance weights to have undesirably large variance and deteriorate the performance. In this section, we introduce an annealed GF-SVGD algorithm that overcomes the difficulty of choosing ρ and improves the performance by iteratively approximating a sequence of distributions which interpolate the target distribution with a simple initial distribution. In the sequel, we first introduce the annealed version of the basic SVGD and then its combination with GF-SVGD. Annealed SVGD (A-SVGD) is a simple combination of SVGD and simulated annealing, and has been discussed by Liu et al. (2017) in the setting of reinforcement learning. Let p0(x) be a simple initial distribution. We define a path Algorithm 3 Annealed GF-SVGD (AGF-SVGD) Input: Target distribution p(x); initial distribution p0(x); intermediate distributions {pt}T t=1. Goal: Particles {xi}n i=1 to approximate p(x). Initialize particles {x0 i }n i=1 drawn from p0. for iteration t = 0, , T 1 do xt+1 i xt i + xt i, i = 1, . . . , n, where xt i = ϵt,i j=1 wt j sρ j,t+1k(xt j, xt i) + xjk(xt j, xt i) , where sρ j,t+1 = x log ρt+1(xt j) and ρt+1 is defined in (24); wt j = ρt+1(xt j)/pt+1(xt j), Zt = Pn j=1 wt j. end for Output: {x T i }n i=1 to approximate p. of distributions that interpolate between p0(x) and p(x): pt(x) p0(x)1 αtp(x)αt, where 0 = α0 < α1 < < αT = 1 is a set of temperatures. Annealed SVGD starts from a set of particle {x0 i }n i=1 drawn from p0, and at the t-th iteration, updates the particles so that {xt+1 i }n i=1 approximates the intermediate distribution pt+1 by running m steps of SVGD with pt+1 as the target. See Algorithm 2. In practice, m = 1 is sufficient when T is large. It is useful to consider the special case when p0 = const, and hence pt(x) p(x)αt, yielding an annealed SVGD update of form j=1 [ x log p(xj)k(xj, xi)+ 1 αt xjk(xj, xi)], where the repulsive force is weighted by the inverse temperature 1/αt. As αt increases from 0 to 1, the algorithm starts with a large repulsive force and gradually decreases it to match the temperature of the distribution of interest. This procedure is similar to the typical simulated annealing, but enforces the diversity of the particles using the deterministic repulsive force, instead of random noise. Annealed Gradient-Free SVGD (AGF-SVGD) is the gradient-free version of annealed SVGD which replaces the SVGD update with an GF-SVGD update. Specifically, at the t-th iteration when we want to update the particles to match pt+1, we use a GF-SVGD update with auxiliary distribution ρt+1 pt+1, which we construct by using a simple kernel curve estimation j=1 pt+1(xt j)kρ(xt j, x), (24) where kρ is a smoothing kernel (which does not have to be positive definite). Although there are other ways to approximate pt+1, this simple heuristic is computationally Gradient-Free SVGD log10 (<;=<) 0 2 4 log10 (<;=<) 0 2 4 log10 (<;=<) 0 2 4 IS MC GF-SVGD SVGD (a) MMD (b) Mean (c) Variance (d) MMD Figure 1. Results of GF-SVGD on 2D multivariate Gaussian distribution as we change the mean µρ and variance σρ of the surrogate ρ(x). We can see that the best performance is achieved by matching the mean of ρ and the target p (µρ = µ), and making σρ slightly larger than the variance σ of p (e.g., log 10(σρ/σ) 0.5 or σρ 3σ). (d) uses the same setting with ||µρ µ||2 = 8, but also adds the result of exact Monte Carlo sampling, gradient-based SVGD, and importance sampling (IS) whose proposal is ρ, the same as the auxiliary distribution used by GF-SVGD shown in the red curve. We use n = 100 particles in this plot. fast, and the usage of smoothing kernel makes ρt+1 an overdispersed estimate which we show perform well in practice (see Figure 1). Note that here ρt+1 is constructed to fit smooth curve pt+1, which leverages the function values of pt+1(x) and is insensitive to the actual distribution of the current particles {xt j}. It would be less robust to construct ρt+1 as a density estimator of distribution pt+1 because the actual distribution of the particles may deviate from what we expect in practice. The procedure is organized in Algorithm 3. Combining the idea of simulated annealing with gradient-free SVGD makes it easier to construct an initial surrogate distribution and estimate a good auxiliary distribution at each iteration, decreasing the variance of the importance weights. We find that it significantly improves the performance over the basic GF-SVGD for complex target distributions. 5. Empirical Results We test our proposed algorithms on both synthetic and realworld examples. We start with testing our methods on simple multivariate Gaussian and Gaussian mixture models, developing insights on the optimal choice of the auxiliary distribution. We then test AGF-SVGD on Gaussian-Bernoulli restricted Boltzmann machine (RBM) and compare it with advanced gradient-free MCMC such as KAMH (Sejdinovic et al., 2014) and KHMC (Strathmann et al., 2015). Finally, we apply our algorithm to Gaussian process classification on real-world datasets. We use RBF kernel k(x, x ) = exp( x x 2/h) for the updates of our proposed algorithms and the kernel approximation in (24); the bandwidth h is taken to be h=med2/(2 log(n + 1)) where med is the median of the current n particles. When maximum mean discrepancy (MMD) (Gretton et al., 2012) is applied to evaluate the sample quality, RBF kernel is used and the bandwidth is chosen based on the median distance of the exact samples so that all methods use the same bandwidth for a fair comparison. Adam optimizer (Kingma & Ba, 2015) is applied to our proposed algorithms for accelerating convergence. 5.1. Simple Gaussian Distributions We test our basic GF-SVGD in Algorithm 1 on a simple 2D multivariate Gaussian distribution to develop insights on the optimal choice of ρ. We set a Gaussian target p(x) = N(x; µ, σI) with fixed µ = (0, 0) and σ = 2.0, and an auxiliary distribution ρ(x) = N(x; µρ, σρI) where we vary the value of µρ and σρ in Figure 1. The performance is evaluated based on MMD between GF-SVGD particles and the exact samples from p (Figure 1(a)), and the mean square error (MSE) of estimating µ and σ (Figure 1(b)-(c)). Figure 1 suggests a smaller difference in mean µρ and µ generally gives better results, but the equal variance σρ = σ does not achieve the best performance. Instead, it seems that σρ 3σ gives the best result in this particular case. This suggests by choosing ρ to be a proper distribution that well covers the probability mass of p, it is possible to even outperform the gradient-based SVGD which uses ρ = p. Interestingly, even when we take σρ = , corresponding to the simple update in (23) with ρ = 1, the algorithm still performs reasonably well (although not optimally) in terms of MMD and mean estimation (Figure 1(a)-(b)). It does perform worse on the variance estimation (Figure 1(c)), and we observe that this seems to be because the repulsive force domains when σρ is large (e.g., when σρ = , only the repulsive term is left as shown in (23)), and it causes the particles to be overly diverse, yielding an over-estimation of the variance. This is interesting because we have found that the standard SVGD with RBF kernel tends to underestimate the variance, and a hybrid of them may be developed to give a more calibrated variance estimation. Gradient-Free SVGD 1000 2000 3000 iterations 10 100 1000 sample size 10 100 1000 sample size 10 100 1000 sample size (a) Convergence (b) Mean (c) Variance (d) MMD Figure 2. Results on GMM with 10 random mixture components and 25 dimensions. (a): the convergence of MMD with fixed sample size of n = 200. (b)-(c): the MSE vs. sample size when estimating the mean and variance using the particles returned by different algorithms at convergence. (d): the MMD between the particles of different methods and the true distribution p. In (b, c, d), 3000 iterations are used. For GF-AIS, the sample size n represents the number of parallel chains, and the performance is evaluated using the weighted average of the particles at the final iteration with their importance weights given by AIS. In Figure 1(d), we add additional comparisons with exact Monte Carlo (MC) which directly draws sample from p, and standard importance sampling (IS) with ρ as proposal. We find that GF-SVGD provides much better results than the standard IS strategy with any ρ. In addition, GF-SVGD can even outperform the exact MC and the standard SVGD when auxiliary distribution ρ is chosen properly (e.g., σρ 3σ). 5.2. Gaussian Mixture Models (GMM) We test GF-SVGD and AGF-SVGD on a 25-dimensional GMM with 10 randomly generated mixture components, p(x) = 1 10 P10 i=1 N(x; µi, I), with each element of µi is drawn from Uniform([ 1, 1]). The auxiliary distribution ρ(x) is a multivariate Gaussian ρ(x) = N(x; µρ, σρI), with fixed σρ = 4 and each element of µρ drawn from Uniform([ 1, 1]). For AGF-SVGD, we set its initial distribution p0 to equal the ρ above in GF-SVGD. Figure 2(a) shows the convergence of MMD vs. the number of iterations of different algorithms with a particle size of n = 200, and Figure 2(b)-(d) shows the converged performance as the sample size n varies. It is not surprising to see that that standard SVGD converges fastest since it uses the full gradient information of the target p. A-SVGD converges slightly slower in the beginning but catches up later; this is because that it uses increasingly more gradient information from p. GF-SVGD performs significantly worse, which is expected because it does not leverage the gradient information. However, it is encouraging that annealed GF-SVGD, which also leverages no gradient information, performs much better than GF-SVGD, only slightly worse than the gradient-based SVGD and A-SVGD. For comparison, we also tested a gradient-free variant of annealed importance sampling (GF-AIS) (Neal, 2001) with a transition probability constructed by Metropolis-adjusted Langevin dynamics, in which we use the same temperature scheme as our AGF-SVGD, and the same surrogate gradient x log ρt defined in (24). GF-AIS returns a set of particles with importance weights, so we use weighted averages when evaluating the MMD and the mean/variance estimation. This version of GF-AIS is highly comparable to our AGF-SVGD since both of them use the same annealing scheme and surrogate gradient. However, Figure 2 shows that AGFSVGD still significantly outperforms GF-AIS. 5.3. Gauss-Bernoulli Restricted Boltzmann Machine We further compare AGF-SVGD with two recent baselines on Gauss-Bernoulli RBM, defined by h exp(x Bh + c 1 x + c 2 h 1 2 x 2 2), (25) where x Rd and h { 1}d is a binary latent variable. By marginalizing the hidden variable h, we can see that p(x) is a special GMM with 2d components. In our experiments, we draw the parameters c1 and c2 from standard Gaussian and select each element of B randomly from { 0.5} with equal probabilities. We set the dimension d of x to be 20 and the dimension d of h to be 10 so that p(x) is a 20dimensional GMM with 210 components, for which it is still feasible to draw exact samples by brute-force for the purpose of evaluation. For AGF-SVGD, we set the initial distribution to be p0(x) = N(x; µ, σI), with µ drawn from Uniform([1, 2]) and σ = 3. We compare our AGF-SVGD with two recent gradient-free methods: KAMH (Sejdinovic et al., 2014) and KHMC (Strathmann et al., 2015). Both methods are advanced MCMC methods that adaptively improves the transition proposals based on kernel-based approximation from the history of Markov chains. For a fair comparison with SVGD, we run n parallel chains of KAMH and KHMC and take the last samples of n chains for estimation. In addition, we find that both KAMH and KHMC require a relatively long burn-in phase before the adaptive proposal becomes useful. In our experiments, we use 10,000 burn-in steps for both KAMH and KHMC, and Gradient-Free SVGD 500 1000 1500 2000 iterations 10 100 1000 sample size KAMH KHMC AGF-SVGD 10 100 1000 sample size 10 100 1000 sample size (a) Convergence (b) Mean (c) Variance (d) MMD Figure 3. Gauss-Bernoulli RBM with d = 20 and d = 10. (a): the convergence of MMD with n = 100 for all the algorithms. The evaluations of MMD of KAMH and KHMC in (a) starts from the burn-in steps of the typical MH algorithm. (b)-(c): the MSE vs. sample size when estimating the mean and variance using the particles returned by different algorithms at 2000 iterations. (d): the MMD between the particles of different algorithms and the true distribution p at 2000 iterations. exclude the computation time of burn-in when comparing the convergence speed with GF-SVGD in Figure 3; this gives KAMH and KHMC much advantage for comparison, and the practical computation speed of KAMH and KHMC is much slower than our AGF-SVGD. From Figure 3 (a), we can see that our AGF-SVGD converges fastest to the target p, even when we exclude the 10,000 burn-in steps in KAMH and KHMC. Figure 3 (b, c, d) shows that our AGF-SVGD performs the best in terms of the accuracy of estimating the mean, variance and MMD. 5.4. Gaussian Process Classification We apply our AGF-SVGD to sample hyper-parameters from marginal posteriors of Gaussian process (GP) binary classification. Consider a classification of predicting binary label y { 1} from feature z. We assume y is generated by a latent Gaussian process f(z), p(y|z) = 1/(1 + exp( yf(z))) and f is drawn from a GP prior f GP(0, kf,θ), where kf,θ is the GP kernel indexed by a hyperparameter θ. In particular, we assume kf,θ(z, z ) = exp( 1 2||(z z )./ exp(θ)||2), where ./ denotes the element-wise division and θ is a vector of the same size as z. Given a dataset Y = {yi} and Z = {zi}, we are interested in drawing samples from the posterior distribution p(θ|Z, Y ). Note that the joint posterior of (θ, f) is p(θ, f|Z, Y ) = p(Y |f, Z)p(f|θ)p(θ). Since it is intractable to exactly calculate the marginal posterior of θ, we approximate it by ˆp(θ|Z, Y ) := p(θ) 1 p(Y |f i, Z)p(f i|θ) q(f i|θ) , (26) where {f i}m i=1 is drawn from a proposal distribution q(f | θ), which is constructed by an expectation propagationbased approximation of p(f|θ, Z, Y ) following Filippone & Girolami (2014). We run multiple standard Metropolis-Hastings chains to obtain ground truth samples from p(θ | D), following 1000 2000 3000 iterations KAMH KHMC AGF-SVGD 1000 3000 5000 iterations (a) Glass dataset (b) SUSY dataset Figure 4. Sampling from the marginal posteriors on GP classification for Glass dataset (a) and SUSY dataset (b). We use a sample size of n = 200 for all methods. the procedures in section 5.1 of Sejdinovic et al. (2014) and Appendix D.3 of Strathmann et al. (2015). We test the algorithms on Glass dataset and SUSY dataset in Figure 4 from UCI repository (Asuncion & Newman, 2007) for which the dimension of θ is d = 9 and d = 18, respectively. We initialize our algorithm with draws from p0(x) = N(x; µ, σI) where σ = 3 and each element of µ is drawn from Uniform([ 1, 1]). For KAMH and KHMC, we again run n parallel chains and initialize them with an initial burn-in period of 6000 steps which is not taken into account in evaluation. Figure 4 shows that AGF-SVGD again converges faster than KAMH and KHMC, even without the additional burn-in period. 6. Conclusion We derive a gradient-free extension of Stein s identity and Stein discrepancy and propose a novel gradient-free sampling algorithm. Future direction includes theoretical investigation of optimal choice of the auxiliary proposal with which we may leverage the gradient of the target to further improve the sample efficiency over the standard SVGD. We are also interested in exploring the application of the gradient-free KSD on the goodness-of-fit tests and blackbox importance sampling. Gradient-Free SVGD Acknowledgement This work is supported in part by NSF CRII 1565796. Andrieu, Christophe and Roberts, Gareth O. The pseudomarginal approach for efficient Monte Carlo computations. The Annals of Statistics, pp. 697 725, 2009. Asuncion, Arthur and Newman, David. UCI machine learning repository, 2007. Beaumont, Mark A. Estimation of population growth or decline in genetically monitored populations. Genetics, 2003. Berlinet, Alain and Thomas-Agnan, Christine. Reproducing kernel Hilbert spaces in probability and statistics. Springer Science & Business Media, 2011. Blei, David M, Kucukelbir, Alp, and Mc Auliffe, Jon D. Variational inference: A review for statisticians. Journal of the American Statistical Association, 2017. Capp e, Olivier, Douc, Randal, Guillin, Arnaud, Marin, Jean Michel, and Robert, Christian P. Adaptive importance sampling in general mixture classes. Statistics and Computing, 18(4):447 459, 2008. Chen, Wilson Ye, Mackey, Lester, Gorham, Jackson, Briol, Franc ois-Xavier, and Oates, Chris J. Stein points. ar Xiv:1803.10161, 2018. Chwialkowski, Kacper, Strathmann, Heiko, and Gretton, Arthur. A kernel test of goodness of fit. In ICML, 2016. Cotter, Colin, Cotter, Simon, and Russell, Paul. Parallel adaptive importance sampling. ar Xiv:1508.01132, 2015. Filippone, Maurizio and Girolami, Mark. Pseudo-marginal Bayesian inference for gaussian processes. In IEEE, 2014. Gilks, Walter R and Wild, Pascal. Adaptive rejection sampling for Gibbs sampling. Applied Statistics, 1992. Gorham, Jack, Duncan, Andrew B, Vollmer, Sebastian J, and Mackey, Lester. Measuring sample quality with diffusions. ar Xiv:1611.06972, 2016. Gorham, Jackson and Mackey, Lester. Measuring sample quality with stein s method. In Advances in Neural Information Processing Systems, pp. 226 234, 2015. Gorham, Jackson and Mackey, Lester. Measuring sample quality with kernels. ar Xiv:1703.01717, 2017. Gretton, Arthur, Borgwardt, Karsten M, Rasch, Malte J, Sch olkopf, Bernhard, and Smola, Alexander. A kernel two-sample test. JMLR, 2012. Han, Jun and Liu, Qiang. Stein variational adaptive importance sampling. In Uncertainty in Artificial Intelligence, 2017. Hoffman, Matthew D and Gelman, Andrew. The no-u-turn sampler: adaptively setting path lengths in Hamiltonian Monte Carlo. JMLR, 15(1):1593 1623, 2014. Kingma, Diederik and Ba, Jimmy. Adam: A method for stochastic optimization. ICLR, 2015. Liu, Qiang. Stein variational gradient descent as gradient flow. In NIPS, 2017. Liu, Qiang and Lee, Jason D. Black-box importance sampling. AISTATS, 2017. Liu, Qiang and Wang, Dilin. Stein variational gradient descent: A general purpose Bayesian inference algorithm. In NIPS, pp. 2370 2378, 2016. Liu, Qiang, Lee, Jason D, and Jordan, Michael I. A kernelized Stein discrepancy for goodness-of-fit tests. In ICML, 2016. Liu, Yang, Ramachandran, Prajit, Liu, Qiang, and Peng, Jian. Stein variational policy gradient. In In UAI, 2017. Neal, Radford M. Annealed importance sampling. Statistics and computing, 11(2):125 139, 2001. Neal, Radford M et al. MCMC using Hamiltonian dynamics. Handbook of Markov Chain Monte Carlo, 2(11), 2011. Oates, Chris J, Girolami, Mark, and Chopin, Nicolas. Control functionals for monte carlo integration. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 2017. Sejdinovic, Dino, Strathmann, Heiko, Garcia, Maria Lomeli, Andrieu, Christophe, and Gretton, Arthur. Kernel adaptive Metropolis-Hastings. In ICML, 2014. Strathmann, Heiko, Sejdinovic, Dino, Livingstone, Samuel, Szabo, Zoltan, and Gretton, Arthur. Gradient-free Hamiltonian Monte Carlo with efficient kernel exponential families. In NIPS, pp. 955 963, 2015. Zhang, Cheng, Butepage, Judith, Kjellstrom, Hedvig, and Mandt, Stephan. Advances in variational inference. ar Xiv:1711.05597, 2017.