# gradientfree_kernel_stein_discrepancy__93907fac.pdf

Gradient-Free Kernel Stein Discrepancy

Matthew A. Fisher1, Chris. J Oates1,2

1Newcastle University, UK 2Alan Turing Institute, UK

Stein discrepancies have emerged as a powerful statistical tool, being applied to fundamental statistical problems including parameter inference, goodness-of-fit testing, and sampling. The canonical Stein discrepancies require the derivatives of a statistical model to be computed, and in return provide theoretical guarantees of convergence detection and control. However, for complex statistical models, the stable numerical computation of derivatives can require bespoke algorithmic development and render Stein discrepancies impractical. This paper focuses on posterior approximation using Stein discrepancies, and introduces a collection of non-canonical Stein discrepancies that are gradient-free, meaning that derivatives of the statistical model are not required. Sufficient conditions for convergence detection and control are established, and applications to sampling and variational inference are presented.

1 Introduction

Stein discrepancies were introduced in Gorham and Mackey [2015], as a way to measure the quality of an empirical approximation to a continuous statistical model involving an intractable normalisation constant. Rooted in Stein s method [Stein, 1972], the idea is to consider empirical averages of a large collection of test functions, each of which is known to integrate to zero under the statistical model. To date, test functions have been constructed by combining derivatives of the statistical model with reproducing kernels [Chwialkowski et al., 2016, Liu et al., 2016, Gorham and Mackey, 2017, Gong et al., 2021a,b], random features [Huggins and Mackey, 2018], diffusion coefficients and functions with bounded derivatives [Gorham et al., 2019], neural networks [Grathwohl et al., 2020], and polynomials [Chopin and Ducrocq, 2021]. The resulting discrepancies have been shown to be powerful statistical tools, with diverse applications including parameter inference [Barp et al., 2019, Matsubara et al., 2022], goodness-of-fit testing [Jitkrittum et al., 2017, Fernandez et al., 2020], and sampling [Liu and Lee, 2017, Chen et al., 2018, 2019, Riabiz et al., 2022, Hodgkinson et al., 2020, Fisher et al., 2021]. However, one of the main drawbacks of these existing works is the requirement that derivatives both exist and can be computed.

The use of non-differentiable statistical models is somewhat limited but includes, for example, Bayesian analyses where Laplace priors are used [Park and Casella, 2008, Roˇckov a and George, 2018]. Much more common is the situation where derivatives exist but cannot easily be computed. In particular, for statistical models with parametric differential equations involved, one often requires different, more computationally intensive numerical methods to be used if the sensitivities (i.e. derivatives of the solution with respect to the parameters) are to be stably computed [Cockayne and Duncan, 2021]. For large-scale partial differential equation models, as used in finite element simulation, computation of sensitivities can increase simulation times by several orders of magnitude, if it is practical at all.

The motivation and focus of this paper is on computational methods for posterior approximation, and to this end we propose a collection of non-canonical Stein discrepancies that are gradient free, meaning that computation of the derivatives of the statistical model is not required. Gradient-free

37th Conference on Neural Information Processing Systems (Neur IPS 2023).

Stein operators were introduced in Han and Liu [2018] in the context of Stein variational gradient descent [Liu and Wang, 2016], but the theoretical properties of the corresponding discrepancy have yet to be investigated. General classes of Stein discrepancies were analysed in Huggins and Mackey [2018], Gorham et al. [2019], but their main results do not cover the gradient-free Stein discrepancies developed in this work, for reasons that will be explained. The combination of gradient-free Stein operators and reproducing kernels is studied in detail, to obtain discrepancies that can be explicitly computed. The usefulness of these discrepancies depends crucially on their ability to detect the convergence and non-convergence of sequences of probability measures to the posterior target, and in both directions positive results are established.

Outline Gradient-free KSD (GF-KSD) is proposed and theoretically analysed in Section 2. The proposed discrepancy involves certain degrees of freedom, including a probability density denoted q in the sequel, and strategies for specifying these degrees of freedom are empirically assessed in Section 3. Two applications are then explored in detail; Stein importance sampling (Section 4.1) and Stein variational inference (Section 4.2). Conclusions are drawn in Section 5.

This section contains our core methodological (Section 2.1) and theoretical (Section 2.2) development. The following notation will be used:

Real Analytic Notation For a twice differentiable function f : Rd R, let if denote the partial derivative of f with respect to its ith argument, let f denote the gradient vector with entries if, and let 2f denote the Hessian matrix with entries i jf. For a sufficiently regular bivariate function f : Rd Rd R, let ( i j)f indicate the application of i to the first argument of f, followed by the application of j to the second argument. (For derivatives of other orders, the same tensor notation will be used.)

Probabilistic Notation Let P(Rd) denote the set of probability distributions on Rd. Let δ(x) P(Rd) denote an atomic distribution located at x Rd. For π, π0 P(Rd), let π π0 indicate that π is absolutely continuous with respect to π0. For π P(Rd) and (πn)n N P(Rd), write πn d π to indicate weak convergence of πn to π. The symbols p and q are reserved for probability density functions on Rd, while π is reserved for a generic element of P(Rd). For convenience, the symbols p and q will also be used to refer to the probability distributions that these densities represent.

2.1 Gradient-Free Kernel Stein Discrepancy

The aim of this section is to explain how a GF-KSD can be constructed. Let p P(Rd) be a target distribution of interest. Our starting point is a gradient-free Stein operator, introduced in Han and Liu [2018] in the context of Stein variational gradient descent [Liu and Wang, 2016]: Definition 1 (Gradient-Free Stein Operator). For p, q P(Rd) with q p and log q welldefined, the gradient-free Stein operator is defined as

p ( h + h log q) ,

acting on differentiable functions h : Rd Rd.

The Langevin Stein operator of Gorham and Mackey [2015] is recovered when p = q, but when q = p the dependence on the derivatives of p is removed. The operator Sp,q can still be recognised as a diffusion Stein operator, being related to the infinitesimal generator of a diffusion process that leaves p invariant; however, it falls outside the scope of the theoretical analysis of Huggins and Mackey [2018], Gorham et al. [2019], for reasons explained in Remark 2. It can also be viewed as a non-standard instance of the density method of Diaconis et al. [2004]; see Section 2 of Anastasiou et al. [2023]. The inclusion of q introduces an additional degree of freedom, specific choices for which are discussed in Section 3. Remark 1. The ratio q/p in Definition 1 could be viewed as an importance weight, but the construction does not fully correspond to importance sampling due to the log q term, which is q-dependent.

The Stein operator nomenclature derives from the vanishing integral property in Proposition 1 below, which is central to Stein s method [Stein, 1972]:

Proposition 1. In the setting of Definition 1, assume that x d 1q(x) 0 as x and R log q dq < . Then, for any function h : Rd Rd whose first derivatives exist and are bounded, it holds that R Sp,qh dp = 0.

All proofs are contained in Appendix A. From Proposition 1, the expectation of Sp,qh with respect to π P(Rd) will be zero when π and p are equal; conversely, the value of such an expectation can be used to quantify the extent to which π and p are different. Consideration of multiple test functions increases the number and nature of the differences between π and p that may be detected. A discrepancy is obtained by specifying which test functions h are considered, and then taking a supremum over the expectations associated to this set. For computational convenience we take h to be contained in the unit ball of a reproducing kernel Hilbert space, as described next.

For a symmetric positive definite function k : Rd Rd R, called a kernel, denote the associated reproducing kernel Hilbert space as H(k). Let H(k)d denote the Cartesian product of d copies of H(k), equipped with the inner product h, g H(k)d := Pd i=1 hi, gi H(k).

Proposition 2. Let π P(Rd). In the setting of Definition 1, assume there is an α > 1 such that R (q/p)α dπ < and R log q α/(α 1) dπ < . Let k : Rd Rd R be a continuously differentiable kernel such that both k and its first derivatives x 7 ( i i)k(x, x), i = 1, . . . , d, are bounded. Then Sp,q is a bounded linear operator from H(k)d to L1(π).

For discrete distributions π, supported on a finite subset of Rd, the moment conditions in Proposition 2 are automatically satisfied. For general distributions π on Rd, the exponent α can be taken arbitrarily close to 1 to enable the more stringent moment condition R (q/p)α dπ < to hold. An immediate consequence of Proposition 2 is that Definition 2 below is well-defined.

Definition 2 (Gradient-Free Kernel Stein Discrepancy). For p, q, k and π satisfying the preconditions of Proposition 2, the gradient-free KSD is defined as

Dp,q(π) = sup Z Sp,qh dπ : h H(k)d 1 . (1)

The GF-KSD coincides with the canonical kernel Stein discrepancy (KSD) when p = q, and is thus strictly more general. Note that Dp,q(π) is precisely the operator norm of the linear functional h 7 R Sp,qh dπ, which exists due to Proposition 2. Most common kernels satisfy the assumptions of Proposition 2, and a particularly important example is the inverse multi-quadric kernel

k(x, y) = (σ2 + x y 2) β, σ (0, ), β (0, 1), (2)

which has bounded derivatives of all orders; see Lemma 4 of Fisher et al. [2021].

The use of reproducing kernels ensures that GF-KSD can be explicitly computed; see Proposition 5 and Corollary 1 in Appendix A. A general spectral characterisation of GF-KSD is provided in Proposition 6 of the supplement, inspired by the recent work of Wynne et al. [2022]. Note that the scale of Equation (6) does not matter when one is interested in the relative performance of different π P(Rd) as approximations of a fixed target p P(Rd). In this sense, GF-KSD may be employed with p in place of p, where p p/Z and Z is a normalisation constant. This feature makes GF-KSD applicable to problems of posterior approximation, and will be exploited for both sampling and variational inference in Section 4. On the other hand, GF-KSD is not applicable to problems in which the target distribution pθ involves a parameter θ, such as estimation and composite hypothesis testing, since then the normalisation term Zθ cannot be treated as constant.

Going beyond the discrepancies discussed in Section 1, several non-canonical discrepancies have recently been proposed based on Stein s method, for example sliced [Gong et al., 2021a,b], stochastic [Gorham et al., 2020], and conditional [Singhal et al., 2019] Stein discrepancies (in all cases derivatives of p are required). However, only a subset of these discrepancies have been shown to enjoy important guarantees of convergence detection and control. Convergence control is critical for the posterior approximation task considered in this work, since this guarantees that minimisation of Stein discrepancy will produce a consistent approximation of the posterior target. The aim of the next section is to establish such guarantees for GF-KSD.

2.2 Convergence Detection and Control

The canonical KSD benefits from theoretical guarantees of convergence detection and control [Gorham and Mackey, 2017]. The aim of this section is to establish analogous guarantees for GF-KSD. To set the scene, for a Lipschitz function f : Rd Rd, denote its Lipschitz constant L(f) := supx =y f(x) f(y) / x y . Then, for measurable g : Rd R, we denote the tilted Wasserstein distance as W1(π, p; g) := sup L(f) 1 R fg dπ R fg dp whenever this expression is well-defined [Huggins and Mackey, 2018]. Note that the standard 1-Wasserstein distance W1(π, p) is recovered when g = 1. There is no dominance relation between W1( , ; g) for different g; the topologies they induce are different. Tilted Wasserstein distance induces a much weaker topology than, for example, divergences such as Kullback Leibler or Hellinger, since it does not require absolute continuity of measures.

Theorem 1 (Convergence Detection). Let p, q P(Rd) with q p, log q Lipschitz and R log q 2 dq < . Assume there is an α > 1 such that the sequence (πn)n N P(Rd) satisfies R (q/p)αdπn (0, ), R log q α/(α 1)dπn < , R log q α/(α 1)(q/p) dπn < , and R fq/p dπn < with f(x) = x , for each n N. Let k be a kernel such that each of k, ( i i)k and ( i j i j)k exist, are continuous, and are bounded, for i, j {1, . . . , d}. Then W1(πn, p; q/p) 0 implies Dp,q(πn) 0.

Thus the convergence of πn to p, in the sense of the tilted Wasserstein distance with g = q/p, is detected by the GF-KSD Dp,q. Note that the conditions on (πn)n N in Theorem 1 are automatically satisfied when each πn has a finite support.

Despite being a natural generalisation of KSD, this GF-KSD does not in general provide weak convergence control in the equivalent theoretical context. Indeed, Gorham and Mackey [2017] established positive results on convergence control for distributions in Q(Rd), the set of probability distributions on Rd with positive density function q : Rd (0, ) for which log q is Lipschitz and q is distantly dissipative, meaning that

lim inf r inf log q(x) log q(y), x y

x y 2 : x y = r > 0.

Under the equivalent assumptions, convergence control fails for GF-KSD in general:

Proposition 3 (Convergence Control Fails in General). Let k be a (non-identically zero) radial kernel of the form k(x, y) = ϕ(x y) for some twice continuously differentiable ϕ : Rd R, for which the preconditions of Proposition 2 are satisfied. Then there exist p P(Rd), q Q(Rd) and a sequence (πn)n N P(Rd), also satisfying the preconditions of Proposition 2, such that Dp,q(πn) 0 and yet πn d p.

The purpose of Proposition 3 is to highlight that GF-KSD is not a trivial extension of canonical KSD; it requires a bespoke treatment. This is provided in Theorem 2, next. Indeed, to ensure that GF-KSD provides convergence control, additional condition on the tails of q are required:

Theorem 2 (Convergence Control). Let p P(Rd), q Q(Rd) be such that p is continuous and infx Rd q(x)/p(x) > 0. Assume there is an α > 1 such that the sequence (πn)n N P(Rd) satisfies R (q/p)αdπn < and R log q α/(α 1)(q/p) dπn < , for each n N. Let k be the inverse multi-quadric kernel in Equation (2). Then Dp,q(πn) 0 implies πn d p.

The proof of Theorem 2 is based on carefully re-casting GF-KSD between π and p as a canonical KSD between q and a transformed distribution π (see Proposition 7 in the supplement), then appealing to the analysis of Gorham and Mackey [2017]. Compared to the analysis of KSD in Gorham and Mackey [2017], the distant dissipativity condition now appears on q (a degree of freedom), rather than on p (a distribution determined by the task at hand), offering a realistic opportunity for this condition to be verified. For example, one could take a Gaussian measure q that dominates the target distribution p for use in GF-KSD. Nevertheless, the conditions of Theorem 2 rule out distributions p that are heavy-tailed. Suitable choices for q are considered in Section 3.

Remark 2 (Related Work). Convergence control was established for discrepancies based on general classes of Stein operator in earlier work, but the required assumptions are too stringent when applied in our context. In particular, to use Huggins and Mackey [2018] it is required that the gradient

Outward Convergence Outward Non-Convergence

Inward Convergence Inward Non-Convergence

Oblique Convergence Oblique Non-Convergence

q=Prior q=Laplace

0 20 40 60 80 100 n

0 20 40 60 80 100 n

Figure 1: Empirical assessment of gradient-free kernel Stein discrepancy. (a) Test sequences (πn)n N, defined in Appendix C.1. The first column displays sequences (solid) that converge to the distributional target p (black), while the second column displays sequences (dashed) which converge instead to a fixed Gaussian target. (b) Performance of gradient-free kernel Stein discrepancy, when approaches to selecting q (described in the main text) are employed. The colour and style of each curve in (b) indicates which of the sequences in (a) is being considered. [Here we fixed the kernel parameters σ = 1 and β = 1/2.]

log(q/p) is bounded1, while to use Gorham et al. [2019] it is required that q/p is Lipschitz2. In our context, where q must be specified in ignorance of p, such conditions, which require that q is almost as light as p in the tail, cannot be guaranteed to hold. The present paper therefore instead contributes novel analysis for the regime where q may be appreciably heavier than p in the tail.

This completes our theoretical assessment, but the practical performance of GF-KSD remains to be assessed. Suitable choices for both q and the kernel parameters σ and β are proposed and investigated in Section 3, and practical demonstrations of GF-KSD are presented in Section 4.

3 Implementation Detail

The purpose of this section is to empirically explore the effect of varying q, σ and β, aiming to arrive at reasonable default settings. In the absence of an application-specific optimality criterion, we aim to select values that perform well (in a sense to be specified) over a range of scenarios that may be encountered. Here, to assess performance several sequences (πn)n N are considered, some of which converge to a specified limit p and the rest of which converge to an alternative Gaussian target; see Figure 1(a). An effective discrepancy should clearly indicate which of these sequences are convergent and which are not. On this basis, recommendations for q are considered in Section 3.1, and recommendations for σ and b in Section 3.2. Of course, we cannot expect default settings to perform universally well, so in Section 3.3 we highlight scenarios where our defaults may fail. Python code to reproduce the experiments reported below can be downloaded at [blinded].

3.1 Choice of q

In what follows we cast q as an approximation of p, aiming to inherit the desirable performance of canonical KSD for which q and p are equal. The task to which the discrepancy is being applied will,

1To see this, take A = q/p in Theorem 3.2 of Huggins and Mackey [2018]. 2To see this, take m = (q/p)I in Theorem 7 and Proposition 8 of Gorham et al. [2019].

in practice, constrain the nature and form of the distributions q that can be implemented. For expository purposes (only), the following qualitatively distinct approaches to choosing q are considered:

Prior In Bayesian settings where p is a posterior distribution, selecting q to be the prior distribution ensures that the condition infx Rd q(x)/p(x) > 0 is satisfied.

Laplace If the target p can be differentiated, albeit at a possibly high computational cost, it may be practical to construct a Laplace approximation q to the target [Gelman et al., 2013].

GMM One could take q to be a Gaussian mixture model fitted to approximate samples from p, representing a more flexible alternative to Laplace.

KDE Additional flexibility can be obtained by employing a kernel density estimator as a nonparametric alternative to GMM.

Of course, there is a circularity to GMM and KDE which renders these methods impractical in general. The specific details of how each of the q were constructed are contained in the code that accompanies this paper, but the resulting q are displayed as insets in Figure 1(b). The performance of GF-KSD with these different choices of q is also displayed in Figure 1(b). It was observed that all four choices of q produced a discrepancy that could detect convergence of πn to p, though the detection of convergence was less clear for Prior due to slower convergence of the discrepancy to 0 as n was increased. On the other hand, all approaches were able to clearly detect non-convergence to the target. That Laplace performed comparably with GMM and KDE was surprising, given that the target p is not well-approximated by a single Gaussian component. These results are for a specific choice of target p, but in Appendix C.5.1 a range of p are considered and similar conclusions are obtained. Section 3.3 explores how challenging p must be before the convergence detection and control properties associated to Laplace fail.

3.2 Choice of σ and β

For the investigation in Section 3.1 the parameters of the inverse multi-quadric kernel (2) were fixed to σ = 1 and β = 1/2, the latter being the midpoint of the permitted range β (0, 1). In general, care in the selection of these parameters may be required. The parameter σ captures the scale of the data, and thus standardisation of the data may be employed to arrive at σ = 1 as a natural default. In this paper (with the exception of Section 4.2) the standardisation x 7 C 1x was performed, where C is the covariance matrix of the approximating distribution q being used. In Appendix C.5.2 we reproduce the investigation of Section 3.1 using a range of values for σ and β; these results indicate the performance of GF-KSD is remarkably insensitive to perturbations around (σ, β) = (1, 1/2).

3.3 Avoidance of Failure Modes

GF-KSD is not a silver bullet, and there are a number of specific failure modes that care may be required to avoid. The four main failure modes are illustrated in Figure 2. These are as follows: (a) q is substantially heavier than p in a tail; (b) q is substantially lighter than p in a tail; (c) the dimension d is too high; (d) p has well-separated high-probability regions. Under both (a) and (c), convergence detection can fail, either because theoretical conditions are violated or because the terms πn must be extremely close to p before convergence begins to be detected. Under (b), the values of GF-KSD at small n can mislead. Point (d) is a well-known pathology of all score-based methods; see Wenliang and Kanagawa [2021], Liu et al. [2023]. These four failure modes inform our recommended usage of GF-KSD, summarised next.

Summary of Recommendations Based on the investigation just reported, the default settings we recommend are Laplace with σ = 1 (post-standardisation) and β = 1/2. Although not universally applicable, Laplace does not require samples from p, and has no settings that must be user-specified. Thus we recommend the use of Laplace in situations where a Laplace approximation can be justified and computed. If Laplace not applicable, then one may attempt to construct an approximation q using techniques available for the task at hand (e.g. in a Bayesian setting, one may obtain q via variational inference, or via inference based on an approximate likelihood). These recommended settings will be used for the application presented in Section 4.1, next.

0 20 40 60 80 100 n

0 25 50 75 100 n

0 20 40 60 80 100 n

d=1 d=10 d=100

0 20 40 60 80 100 n

Figure 2: Failure modes: (a) q is substantially heavier than p in a tail; (b) q is substantially lighter than p in a tail; (c) the dimension d is too high; (d) p has separated high-probability regions. [For each of (a), (b) and (d), the colour and style of the curves refers to the same sense of convergence or non-convergence (outward/inward/oblique) presented in Figure 1, and we plot the logarithm of the gradient-free kernel Stein discrepancy as a function of the index n of the sequence (πn)n N. For (c) we consider convergent sequences (πn)n N of distributions on Rd.

4 Applications

To demonstrate potential uses of GF-KSD, two applications to posterior approximation are now presented; Stein importance sampling (Section 4.1) and Stein variational inference using measure transport (Section 4.2). In each case, we extend the applicability of existing algorithms to statistical models for which certain derivatives of p are either expensive or non-existent.

4.1 Gradient-Free Stein Importance Sampling

Stein importance sampling [Liu and Lee, 2017, Hodgkinson et al., 2020] operates by first sampling independently from a tractable approximation of the target p and then correcting the bias in the samples so-obtained. To date, applications of Stein importance sampling have been limited to instances where the statistical model p can be differentiated; our contribution is to remove this requirement. In what follows we analyse Stein importance sampling in which independent samples (xn)n N are generated from the same approximating distribution q that is employed within GF-KSD: Theorem 3 (Gradient-Free Stein Importance Sampling). Let p P(Rd), q Q(Rd) be such that p is continuous and infx Rd q(x)/p(x) > 0. Suppose that R exp{γ log q 2} dq < for some γ > 0. Let k be the inverse multi-quadric kernel in Equation (2). Let (xn)n N be independent samples from q. To the sample, assign optimal weights

i=1 wiδ(xi)

: 0 w1, . . . , wn, w1 + + wn = 1

Then πn := Pn i=1 w i δ(xi) satisfies πn d p almost surely as n .

The proof builds on earlier work in Riabiz et al. [2022]. Note that the optimal weights w can be computed without the normalisation constant of p, by solving a constrained quadratic programme at cost O(n3).

As an illustration, we implemented gradient-free Stein importance sampling to approximate a posterior arising from a discretely observed Lotka Volterra model

u(t) = αu(t) βu(t)v(t), v(t) = γv(t) + δu(t)v(t), (u(0), v(0)) = (u0, v0),

with independent log-normal observations with covariance matrix diag(σ2 1, σ2 2). The parameters to be inferred are {α, β, γ, δ, u0, v0, σ1, σ2} and therefore d = 8. The data analysed are due to Hewitt [1921], and full details are contained in Appendix C.3. The direct application of Stein importance sampling to this task requires the numerical calculation of sensitivities of the differential equation at each of the n samples that are to be re-weighted. Aside from simple cases where automatic differentiation or adjoint methods can be used, the stable computation of sensitivities can form a major computational bottleneck; see [Riabiz et al., 2022]. In contrast, our approach required a

Figure 3: Gradient-Free Stein Importance Sampling: (a) The lower triangular panels display n = 20 independent (biased) samples from the Laplace approximation q, while the upper triangular panels display the same number of re-weighted samples obtained using gradient-free Stein importance sampling. [Samples are shown in blue, with their size proportional to the square of their weight, to aid visualisation. The shaded background indicates the high probability regions of p, the target.] (b) The approximation quality, as a function of the number n of samples from q, is measured as the energy distance between the approximation and the target. [The solid line corresponds to the output of gradient-free Stein importance sampling, while the dashed line corresponds to the output of self-normalised importance sampling. Standard error regions are shaded.]

fixed number of gradient computations to construct a Laplace approximation3, independent of the number n of samples required; see Appendix C.3 for detail. In the lower triangular portion of Figure 3(a), biased samples from the Laplace approximation q ( = p) are displayed, while in the upper triangular portion the same samples are re-weighted using gradient-free Stein importance sampling to form a consistent approximation of p. A visual reduction in bias and improvement in approximation quality can be observed. As a baseline against which to assess the quality of our approximation, we consider self-normalised importance sampling; i.e. the approximation with weights w such that wi p(xi)/q(xi). Figure 3(b) reports the accuracy of the approximations to p as quantified using energy distance [Cram er, 1928]. These results indicate that the approximations produced using gradient-free Stein importance sampling improve on those constructed using selfnormalised importance sampling. This may be explained by the fact that the optimal weights w attempt to mitigate both bias due to q = p and Monte Carlo error, while the weights w only address the bias due to q = p, and do not attempt to mitigate error due to the randomness in Monte Carlo. Additional experiments in Appendix C.5.3 confirm that gradient-free Stein importance sampling achieves comparable performance with gradient-based Stein importance sampling in regimes where the Laplace approximation can be justified.

Although our recommended default settings for GF-KSD were successful in this example, an interesting theoretical question would be to characterise an optimal choice of q in this context. This appears to be a challenging problem but we hope to address it in future work. In addition, although we focused on Stein importance sampling, our methodology offers the possibility to construct gradientfree versions of other related algorithms, including the Stein points algorithms of Chen et al. [2018, 2019], and the Stein thinning algorithm of Riabiz et al. [2022].

4.2 Stein Variational Inference Without Second-Order Gradient

Stein discrepancy was proposed as a variational objective in Ranganath et al. [2016] and has demonstrated comparable performance with the traditional Kullback Leibler objective in certain application areas, whilst abolishing the requirement that the variational family is absolutely continuous with

3In this case, 49 first order derivatives were computed, of which 48 were on the optimisation path and 1 was used to construct a finite difference approximation to the Hessian.

Figure 4: Stein Variational Inference Without Second Order Gradient: The top row concerns approximation of a distributional target p that is banana shaped, while the bottom row concerns a sinusoidal target. The first four columns depict the variational approximation πθm to p constructed using gradient descent applied to gradient-free kernel Stein discrepancy (i.e. first order derivatives of p required) along the stochastic optimisation sample path (m {0, 2 103, 104, 2 104}), while the final column reports the corresponding approximation (m = 2 104) constructed using standard kernel Stein discrepancy (i.e. second order derivatives of p required).

respect to the statistical model [Fisher et al., 2021]. This offers an exciting, as yet largely unexplored opportunity to construct flexible variational families outside the conventional setting of normalising flows (which are constrained to be diffeomorphisms of Rd). However, gradient-based stochastic optimisation of the canonical Stein discrepancy objective function means that second-order derivatives of the statistical model are required. GF-KSD reduces the order of derivatives that are required from second-order to first-order, and in many cases (such as when differential equations appear in the statistical model) this will correspond to a considerable reduction in computational cost.

In what follows we fix a reference distribution R P(Rd) and a parametric class of maps T θ : Rd Rd, θ Rp, and consider the variational family (πθ)θ Rp P(Rd) whose elements πθ := T θ #R are the pushforwards of R through the maps T θ, θ Rp. The aim is to use stochastic optimisation to minimise θ 7 Dp,q(πθ) (or, equivalently, any strictly increasing transformation thereof). For this purpose a low-cost unbiased estimate of the gradient of the objective function is required; the details and sufficient theoretical conditions are contained in Appendix B.

As an interesting methodological extension, that departs from the usual setting of stochastic optimisation, here we consider interlacing stochastic optimisation over θ and the selection of q, leveraging the current value θm on the optimisation path to provide a natural candidate πθm for q in this context4. Thus, for example, a vanilla stochastic gradient descent routine becomes θm+1 = θm ϵ θDp,πθm(πθ)2 θ=θm for some learning rate ϵ > 0. (In this iterative setting, to ensure the variational objective remains fixed, we do not perform the standardisation of the data described in Section 3.2.)

To assess the performance of GF-KSD in this context, we re-instantiated an experiment from Fisher et al. [2021]. The results, in Figure 4, concern the approximation of banana and sinusoidal distributions in dimension d = 2, and were obtained using the reference distribution R = N(0, 2I) and taking T θ to be the inverse autoregressive flow of Kingma et al. [2016]. These are both toy problems, which do not themselves motivate our methodological development, but do enable us to have an explicit ground truth to benchmark performance against. Full experimental detail is contained in Appendix C.4; we highlight that gradient clipping was used, both to avoid extreme values of q/p encountered on the optimisation path, and to accelerate the optimisation itself [Zhang et al., 2019]. The rightmost panel depicts the result of performing variational inference with the standard KSD objective functional. It is interesting to observe that GF-KSD leads to a similar performance in both examples, with the caveat that stochastic optimisation was more prone to occasional failure when GF-KSD was used. The development of a robust optimisation technique in this context requires care and a detailed empirical assessment, and is left as a promising avenue for further research.

4If the density of πθm is not available, for example because T is a complicated mapping, it can be consistently estimated using independent samples from T θm # .

5 Conclusion

In this paper GF-KSD was proposed and studied. Theoretical and empirical results support the use of GF-KSD in settings where an initial approximation to the distributional target can readily be constructed, and where the distributional target itself does not contain distant high probability regions, but poor performance can occur outside this context. Nevertheless, for many statistical analyses the principal challenge is the cost of evaluating the statistical model and its derivatives, rather than the complexity of the target itself, and in these settings the proposed discrepancy has the potential to be usefully employed. The focus of this work was on posterior approximation, with illustrative applications to sampling and variational inference being presented. A natural extension to this work would involve a systematic empirical assessment of the performance of GF-KSD across a broad range of applied contexts. However, we note that GF-KSD is not applicable to problems such as estimation and composite hypothesis testing, where the target pθ ranges over a parametric model class, and alternative strategies will be required to circumvent gradient computation in that context.

Acknowledgements MAF was supported by EP/W522387/1. CJO was supported by EP/W019590/1. The authors are grateful to Franc ois-Xavier Briol, Jon Cockayne, Jeremias Knoblauch, Lester Mackey, Marina Riabiz, Rob Salomone, Leah South, and George Wynne for insightful comments on an early draft of the manuscript, as well as to the Reviewers at Neur IPS.

A. Anastasiou, A. Barp, F.-X. Briol, B. Ebner, R. E. Gaunt, F. Ghaderinezhad, J. Gorham, A. Gretton, C. Ley, Q. Liu, L. Mackey, C. J. Oates, G. Reinert, and Y. Swan. Stein s method meets statistics: A review of some recent developments. Statistical Science, 38(1):120 139, 2023.

A. Barp, F.-X. Briol, A. Duncan, M. Girolami, and L. Mackey. Minimum Stein discrepancy estimators. In Proceedings of the 33rd Conference on Neural Information Processing Systems, volume 32, pages 12964 12976, 2019.

W. Y. Chen, L. Mackey, J. Gorham, F.-X. Briol, and C. J. Oates. Stein points. In Proceedings of the 35th International Conference on Machine Learning, pages 844 853. PMLR, 2018.

W. Y. Chen, A. Barp, F.-X. Briol, J. Gorham, M. Girolami, L. Mackey, and C. J. Oates. Stein point Markov chain Monte Carlo. In Proceedings of the 36th International Conference on Machine Learning, pages 1011 1021. PMLR, 2019.

N. Chopin and G. Ducrocq. Fast compression of MCMC output. Entropy, 23(8):1017, 2021.

K. Chwialkowski, H. Strathmann, and A. Gretton. A kernel test of goodness of fit. In Proceedings of the 33rd International Conference on Machine Learning, pages 2606 2615. PMLR, 2016.

J. Cockayne and A. Duncan. Probabilistic gradients for fast calibration of differential equation models. SIAM/ASA Journal on Uncertainty Quantification, 9(4):1643 1672, 2021.

H. Cram er. On the composition of elementary errors. Scandinavian Actuarial Journal, (1):141 180, 1928.

P. Diaconis, C. Stein, S. Holmes, and G. Reinert. Use of exchangeable pairs in the analysis of simulations. In Stein s Method, pages 1 25. Institute of Mathematical Statistics, 2004.

T. Fernandez, N. Rivera, W. Xu, and A. Gretton. Kernelized Stein discrepancy tests of goodnessof-fit for time-to-event data. In Proceedings of the 37th International Conference on Machine Learning, pages 3112 3122. PMLR, 2020.

M. Fisher, T. Nolan, M. Graham, D. Prangle, and C. J. Oates. Measure transport with kernel Stein discrepancy. In Proceedings of the 24th International Conference on Artificial Intelligence and Statistics, pages 1054 1062. PMLR, 2021. (Here we refer to the error-corrected version ar Xiv:2010.11779.).

A. Gelman, J. B. Carlin, H. S. Stern, D. B. Dunson, A. Vehtari, and D. B. Rubin. Bayesian Data Analysis. CRC Press, 2013.

W. Gong, Y. Li, and J. M. Hern andez-Lobato. Sliced kernelized Stein discrepancy. In Proceedings of the 9th International Conference on Learning Representations, 2021a.

W. Gong, K. Zhang, Y. Li, and J. M. Hern andez-Lobato. Active slices for sliced Stein discrepancy. In Proceedings of the 38th International Conference on Machine Learning, pages 3766 3776, 2021b.

J. Gorham and L. Mackey. Measuring sample quality with Stein s method. In Proceedings of the 29th Conference on Neural Information Processing Systems, pages 226 234, 2015.

J. Gorham and L. Mackey. Measuring sample quality with kernels. In Proceedings of the 34th International Conference on Machine Learning, pages 1292 1301. PMLR, 2017.

J. Gorham, A. B. Duncan, S. J. Vollmer, and L. Mackey. Measuring sample quality with diffusions. The Annals of Applied Probability, 29(5):2884 2928, 2019.

J. Gorham, A. Raj, and L. Mackey. Stochastic Stein discrepancies. In Proceedings of the 34th Conference on Neural Information Processing Systems, 2020.

W. Grathwohl, K.-C. Wang, J.-H. Jacobsen, D. Duvenaud, and R. Zemel. Learning the Stein discrepancy for training and evaluating energy-based models without sampling. In Proceedings of the 37th International Conference on Machine Learning, pages 3732 3747. PMLR, 2020.

J. Han and Q. Liu. Stein variational gradient descent without gradient. In Proceedings of the 35th International Conference on Machine Learning, pages 1900 1908. PMLR, 2018.

C. G. Hewitt. The Conservation of the Wild Life of Canada. Charles Scribner s Sons, 1921.

L. Hodgkinson, R. Salomone, and F. Roosta. The reproducing Stein kernel approach for post-hoc corrected sampling. ar Xiv:2001.09266, 2020.

J. H. Huggins and L. Mackey. Random feature Stein discrepancies. In Proceedings of the 32nd International Conference on Neural Information Processing Systems, pages 1903 1913, 2018.

W. Jitkrittum, W. Xu, Z. Szab o, K. Fukumizu, and A. Gretton. A linear-time kernel goodness-of-fit test. In Proceedings of the 31st Conference on Neural Information Processing Systems. NIPS Foundation, 2017.

D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. In Proceedings of the 3rd International Conference on Learning Representations, 2015.

D. P. Kingma, T. Salimans, R. Jozefowicz, X. Chen, I. Sutskever, and M. Welling. Improved variational inference with inverse autoregressive flow. In Proceedings of the 30th Conference on Neural Information Processing Systems, 2016.

D. C. Liu and J. Nocedal. On the limited memory BFGS method for large scale optimization. Mathematical Programming, 45(3):503 528, 1989.

Q. Liu and J. Lee. Black-box importance sampling. In Proceedings of the 20th International Conference on Artificial Intelligence and Statistics, pages 952 961. PMLR, 2017.

Q. Liu and D. Wang. Stein variational gradient descent: A general purpose Bayesian inference algorithm. In Proceedings of the 30th Conference on Neural Information Processing Systems, 2016.

Q. Liu, J. Lee, and M. Jordan. A kernelized Stein discrepancy for goodness-of-fit tests. In Proceedings of the 33rd International Conference on Machine Learning, pages 276 284. PMLR, 2016.

X. Liu, A. Duncan, and A. Gandy. Using perturbation to improve goodness-of-fit tests based on kernelized Stein discrepancy. In Proceedings of the 40th International Conference on Machine Learning, 2023.

T. Matsubara, J. Knoblauch, F.-X. Briol, and C. J. Oates. Robust generalised Bayesian inference for intractable likelihoods. Journal of the Royal Statistical Society, Series B, 84(3):997 1022, 2022.

B. O Donoghue, E. Chu, N. Parikh, and S. Boyd. Conic optimization via operator splitting and homogeneous self-dual embedding. Journal of Optimization Theory and Applications, 169(3): 1042 1068, June 2016.

T. Park and G. Casella. The Bayesian lasso. Journal of the American Statistical Association, 103 (482):681 686, 2008.

D. Prangle and C. Viscardi. Distilling importance sampling for likelihood free inference. Journal of Computational and Graphical Statistics, 2023. To appear.

R. Ranganath, D. Tran, J. Altosaar, and D. Blei. Operator variational inference. In Proceedings of the 30th Conference on Neural Information Processing Systems, pages 496 504, 2016.

M. Riabiz, W. Chen, J. Cockayne, P. Swietach, S. A. Niederer, L. Mackey, and C. J. Oates. Optimal thinning of MCMC output. Journal of the Royal Statistical Society, Series B, 84(4):1059 1081, 2022.

V. Roˇckov a and E. I. George. The Spike-and-Slab LASSO. Journal of the American Statistical Association, 113(521):431 444, 2018.

G. Schwarz. Estimating the Dimension of a Model. The Annals of Statistics, 6(2):461 464, 1978.

B. W. Silverman. Density Estimation for Statistics and Data Analysis. Chapman & Hall, 1986.

R. Singhal, X. Han, S. Lahlou, and R. Ranganath. Kernelized complete conditional Stein discrepancy. ar Xiv:1904.04478, 2019.

Stan Development Team. RStan: the R interface to Stan, 2022. URL https://mc-stan.org/. R package version 2.21.5.

C. Stein. A bound for the error in the normal approximation to the distribution of a sum of dependent random variables. In Proceedings of the 6th Berkeley Symposium on Mathematical Statistics and Probability, volume 6.2, pages 583 603. University of California Press, 1972.

I. Steinwart and A. Christmann. Support Vector Machines. Springer Science & Business Media, 2008.

L. K. Wenliang and H. Kanagawa. Blindness of score-based methods to isolated components and mixing proportions. In Proceedings of the Neur IPS workshop Your Model is Wrong , 2021.

G. Wynne, M. Kasprzak, and A. B. Duncan. A spectral representation of kernel Stein discrepancy with application to goodness-of-fit tests for measures on infinite dimensional Hilbert spaces. ar Xiv:2206.04552, 2022.

J. Zhang, T. He, S. Sra, and A. Jadbabaie. Why gradient clipping accelerates training: A theoretical justification for adaptivity. In International Conference on Learning Representations, 2019.

These appendices supplement the paper Gradient Free Kernel Stein Discrepancy. The proofs for theoretical results stated in the main text are contained in Appendix A. Theoretical analysis of stochastic gradient estimators for use in the context of Stein Variational Inference is contained in Appendix B, as advertised in Section 4.2 of the main text. All additional details related to empirical assessment are contained in Appendix C.

A Proof of Results in the Main Text

This appendix contains proofs for all novel theoretical results reported in the main text.

Proof of Proposition 1. First we show that the integral in the statement is well-defined. Since h and its first derivatives are bounded, we can define the constants C0 := sup{ h(x) : x Rd} and C1 := sup{| h(x)| : x Rd}. Then Z |Sp,qh| dp = Z q p[ h + h log q] dp = Z | h + h log q| dq

Z log q dq < ,

so indeed the integral is well-defined. Now, let Br = {x Rd : x r}, Sr = {x Rd : x = r}, and t(r) := sup{q(x) : x Sr}, so that by assumption rd 1t(r) 0 as r . Let 1B(x) = 1 if x B and 0 if x / B. Then R Sp,qh dp = limr R 1Br Sp,qh dp and, from the divergence theorem, Z 1Br Sp,qh dp = Z 1Br q p[ h + h log q] dp

Br [q h + h q] dx

Sr qh n dx t(r) 2πd/2

Γ(d/2)rd 1 r 0,

as required.

Proof of Proposition 2. Since k and its first derivatives are bounded, we can set

Ck 0 := sup x Rd

k(x, x), Ck 1 := sup x Rd

i=1 ( i i)k(x, x).

From Cauchy Schwarz and the reproducing property, for any f H(k) it holds that |f(x)| = | f, k( , x) H(k)| f H(k) k( , x) H(k) = f H(k) p

k( , x), k( , x) H(k) = f H(k) p

k(x, x). Furthermore, using the fact that k is continuously differentiable, it can be shown that | if(x)| f H(k) p

( i i)k(x, x); see Corollary 4.36 of Steinwart and Christmann [2008]. As a consequence, for all h H(k)d we have that, for all x Rd,

i=1 k(x, x) hi 2 H(k) = Ck 0 h H(k)d

i=1 xihi(x)

( i i)k(x, x) hi H(k)

Ck 1 h H(k)d.

To use analogous notation as in the proof of Proposition 1, set C0 := Ck 0 h H(k)d and C1 := Ck 1 h H(k)d. Then, using H older s inequality and the fact that (a + b)β 2β 1(aβ + bβ) with β = α/(α 1), we have

Sp,qh L1(π) = Z |Sp,qh| dπ = Z q p[ h + h log q] dπ

α Z ( h + h log q)

α α 1 dπ α 1

α α 1 1 + C

Z log q α α 1 dπ α 1

2 1 α h H(k)d Z q

α (Ck 1 ) α α 1 + (Ck 0 ) α α 1 Z log q α α 1 dπ α 1

as required.

To prove obtain explicit computable formulae for the GF-KSD, two intermediate results are required: Proposition 4. Let k and π satisfy the preconditions of Proposition 2. Then the function

Rd x 7 f(x) := q(x)

p(x)[ xk(x, ) + k(x, ) log q(x)] (3)

takes values in H(k)d, is Bochner π-integrable and, thus, ξ := R f dπ H(k)d.

Proof. Since k has continuous first derivatives x 7 ( i i)k(x, x), Lemma 4.34 of Steinwart and Christmann [2008] gives that ( i 1)k(x, ) H(k), and, thus f H(k)d. Furthermore, f : Rd H(k)d is Bochner π-integrable since Z f(x) H(k)d dπ(x) = Z q(x)

p(x) xk(x, ) + k(x, ) log q(x) H(k)ddπ(x)

α Z xk(x, ) + k(x, ) log q(x)

α α 1 H(k)d dπ(x) α 1

α (Ck 1 ) α α 1 + (Ck 0 ) α α 1 Z log q α α 1 dπ α 1

where we have employed the same Ck 0 and Ck 1 notation as used in the proof of Proposition 2. Thus, from the definition of the Bochner integral, ξ = R f dπ exists and is an element of H(k)d.

Proposition 5. Let k and π satisfy the preconditions of Proposition 2. Then

Dp,q(π)2 = ZZ q(x)

p(x) q(y) p(y)kq(x, y) dπ(x)dπ(y) (4)

where kq(x, y) = x yk(x, y) + xk(x, y), y log q(y) + yk(x, y), x log q(x) + k(x, y) x log q(x), y log q(y) . (5)

Proof. Let f be as in Equation (3). From Proposition 4, ξ = R f dπ H(k)d. Moreover, since f is Bochner π-integrable and Tf = h, f H(k)d is a continuous linear functional on H(k)d, from basic properties of Bochner integrals we have Tξ = T R f dπ = R Tf dπ. In particular,

h, ξ H(k)d = h, Z q(x)

p(x) [ xk(x, ) + k(x, ) log q(x)] dπ(x)

p(x) x h, k(x, ) H(k)d + h, k(x, ) H(k)d log q(x) dπ(x)

p(x) [ h(x) + h(x) log q(x)] dπ(x) = Z Sp,qh dπ(x)

which shows that ξ is the Riesz representer of the bounded linear functional h 7 R Sp,qh dπ on H(k)d. It follows from Cauchy Schwarz that the (squared) operator norm of this functional is

Dp,q(π)2 = ξ 2 H(k)d = ξ, ξ H(k)d = Z q(x)

p(x) [ xk(x, ) + k(x, ) log q(x)] dπ(x), Z q(y)

p(y) [ yk(y, ) + k(y, ) log q(y)] dπ(y)

p(x) q(y) p(y)kq(x, y) dπ(x)dπ(y)

as claimed.

For concreteness, we instantiate Proposition 5 in the specific case of the inverse multi-quadric kernel, since this is the kernel that we recommend in Section 2.2: Corollary 1 (Explicit Form). For p, q, and π satisfying the preconditions of Proposition 2, and k the inverse multi-quadric kernel in Equation (2), we have that

Dp,q(π)2 = ZZ q(x)q(y)

4β(β + 1) x y 2

(σ2 + x y 2)β+2 + 2β d + log q(x) log q(y), x y

(σ2 + x y 2)1+β

+ log q(x), log q(y)

(σ2 + x y 2)β

dπ(x)dπ(y) (6)

Proof of Corollary 1. First we compute derivatives of the kernel k in Equation (2):

xk(x, y) = 2β

(σ2 + x y 2)β+1 (x y)

yk(x, y) = 2β

(σ2 + x y 2)β+1 (x y)

x yk(x, y) = 4β(β + 1) x y 2

(σ2 + x y 2)β+2 + 2βd

(σ2 + x y 2)β+1

Letting u(x) := log q(x) form a convenient shorthand, we have that

kq(x, y) := x yk(x, y) + xk(x, y), u(y) + yk(x, y), u(x) + k(x, y) u(x), u(y)

= 4β(β + 1) x y 2

(σ2 + x y 2)β+2 + 2β

" d + u(x) u(y), x y

(σ2 + x y 2)1+β

+ u(x), u(y)

(σ2 + x y 2)β

which, combined with Proposition 5, gives the result.

In addition to the results in the main text, here we present a spectral characterisation of GF-KSD. The following result was inspired by an impressive recent contribution to the literature on kernel Stein discrepancy due to Wynne et al. [2022], and our (informal) proof is based on an essentially identical argument: Proposition 6 (Spectral Characterisation). Consider a positive definite isotropic kernel k, and recall that Bochner s theorem guarantees k(x, y) = R e i s,x y dµ(s) for some µ P(Rd). Then, under regularity conditions that we leave implicit,

Dp,q(π)2 = Z

n e i s,x q(x) ise i s,x q(x) o dπ(x)

C dµ(s). (7)

The Fourier transform c q of q is defined as R e i s,x q(x) dx, and a basic property of the Fourier transform is that the transform of a derivative can be computed using the expression is R e i s,x q(x) dx. This implies that the inner integral in (7) vanishes when π and p are equal. Thus we can interpret GF-KSD as a quantification of the uniformity of dπ/dp, with a weighting function based on the Fourier derivative identity with regard to q.

Proof of Proposition 6. From direct calculation, and assuming derivatives and integrals can be interchanged, we have that

xk(x, y) = Z ise i s,x y dµ(s), (8)

yk(x, y) = Z ise i s,x y dµ(s), (9)

x yk(x, y) = Z s 2e i s,x y dµ(s). (10)

η(x, s) = 1 p(x)

n e i s,x q(x) ise i s,x q(x) o = q(x)

n e i s,x log q(x) ise i s,x o

and note through direct calculation and Equations (8) to (10) that Z η(x, s) η(y, s) dµ(s)

p(x) q(y) p(y)

Z s 2 + is log q(x) is log q(y) + log q(x) log q(y)

e i s,x y dµ(s)

p(x) q(y) p(y)

x yk(x, y) + yk(x, y) log q(x) + xk(x, y) log q(y) + k(x, y) log q(x) log q(y)

p(x) q(y) p(y)kq(x, y).

Thus, integrating with respect to π, and assuming that we may interchange the order of integrals, we have that Z

Z η(x, s) dπ(x)

C dµ(s) = Z ZZ η(x, s) η(y, s) dπ(x)dπ(y) dµ(s)

= ZZ Z η(x, s) η(y, s) dµ(s) dπ(x)dπ(y)

p(x) q(y) p(y)kq(x, y) dπ(x)dπ(y) = Dp,q(π)2,

where the final equality is Proposition 5. This establishes the result.

To prove Theorem 1, two intermediate results are required: Proposition 7. For an element π P(Rd), assume Z := R (q/p) dπ (0, ). Assume that k and π satisfy the preconditions of Proposition 2, and that R log q α/(α 1)(q/p) dπ < . Let π := (qπ)/(p Z). Then π P(Rd) and

Dp,q(π) = ZDq,q( π).

Proof. The assumption Z (0, ) implies that π P(Rd). Furthermore, the assumption R log q α/(α 1)(q/p) dπ < implies that R log q α/(α 1) d π < . Thus the assumptions of Proposition 2 are satisfied for both π and π, and thus both Dp,q(π) and Dq,q( π) are welldefined. Now, with ξ as in Proposition 4, notice that

Dp,q(π) = ξ H(k)s =

p(x) [ k(x, ) + k(x, ) log q(x)] dπ(x) H(k)d

Z [ k(x, ) + k(x, ) log q(x)] Zd π(x) H(k)d = ZDq,q( π),

as claimed.

Proposition 8. Let f : Rd [0, ) and π P(Rd). Then R f α dπ > 0 R f dπ > 0, for all α (0, ).

Proof. From the definition of the Lebesgue integral, we have that R f α dπ = sup{ R s dπ : s a simple function with 0 s f α} > 0. Thus there exists a simple function s = Pm i=1 si1Si with 0 s f α and R s dπ > 0. Here the si R and the measurable sets Si Rd are disjoint. In particular, it must be the case that at least one of the coefficients si is positive; without loss of generality suppose s1 > 0. Then s := s1/α 1 1S1 is a simple function with 0 s f and R s dπ > 0. It follows that R f dπ = sup{ R s dπ : s a simple function with 0 s f} > 0.

Proof of Theorem 1. Since R (q/p)α dπn (0, ) and q/p 0, from Proposition 8 we have, for each n, that Zn := R q/p dπn > 0. Thus the assumptions of Proposition 7 are satisfied by k and each πn, which guarantees that Dp,q(πn) = Zn Dq,q( πn) where πn := (qπn)/(p Zn) P(Rd).

Now, since W1(πn, p; q/p) 0, taking f = 1 we obtain Zn = R fq/p dπn R fq/p dp = 1. In addition, note that

W1( πn, q) = sup L(f) 1

Z f d πn Z f dq

= sup L(f) 1

Z f(0) + [f(x) f(0)] d πn(x) Z f(0) + [f(x) f(0)] dq(x)

= sup L(f) 1

Z [f(x) f(0)] d πn(x) Z [f(x) f(0)] dq(x)

= sup L(f) 1 f(0)=0

Z f d πn Z f dq

Thus, from the triangle inequality, we obtain the bound

W1( πn, q) = sup L(f) 1 f(0)=0

Z f d πn Z f dq

= sup L(f) 1 f(0)=0

Z fq p Zn dπn Z fq

sup L(f) 1 f(0)=0

Z fq p Zn dπn Z fq

+ sup L(f) 1 f(0)=0

sup L(f) 1 f(0)=0

as n . For ( ), since q/p 0, the supremum is realised by f(x) = x and

( ) = Z x q(x)

p(x) dπn(x) < .

Thus we have established that W1( πn, q) 0. Since log q is Lipschitz with R log q 2 dq < and k has continuous and bounded second derivatives, the standard kernel Stein discrepancy has 1-Wasserstein convergence detection [Proposition 9 of Gorham and Mackey, 2017], meaning that W1( πn, q) 0 implies that Dq,q( πn) 0 and thus, since Zn 1, Dp,q(πn) 0. This completes the proof.

To prove Proposition 3, an intermediate result is required:

Proposition 9. Let k(x, y) = ϕ(x y) be a kernel with ϕ twice differentiable and let q P(Rd) with log q well-defined. Then kq(x, x) = ϕ(0) + ϕ(0) log q(x) 2, where = and kq was defined in Proposition 5.

Proof. First, note that we must have ϕ(0) = 0, else the symmetry property of k would be violated. Now, xk(x, y) = ( ϕ)(x y), yk(x, y) = ( ϕ)(x y) and x yk(x, y) = ϕ(x y). Thus xk(x, y)|y=x = yk(x, y)|x=y = 0 and x yk(x, y)|x=y = ϕ(0). Plugging these expressions into Equation (5) yields the result.

Proof of Proposition 3. Let kq be defined as in Proposition 5. From Cauchy Schwarz, we have that kq(x, y) p

kq(y, y), and plugging this into Proposition 5 we obtain the bound

Dp,q(π) Z q(x)

kq(x, x) dπ(x) (11)

For a radial kernel k(x, y) = ϕ(x y) with ϕ twice differentiable, we have ϕ(0) > 0 (else k must be the zero kernel, since by Cauchy Schwarz |k(x, y)| p

k(y, y) = ϕ(0) for all x, y Rd), and kq(x, x) = ϕ(0) + ϕ(0) log q(x) 2 (from Proposition 9). Plugging this expression into Equation (11) and applying Jensen s inequality gives that

Dp,q(π)2 Z q(x)2

p(x)2 ϕ(0) + ϕ(0) log q(x) 2 dπ(x).

Now we may pick a choice of p, q and (πn)n N (πn d p) for which this bound can be made arbitrarily small. One example is q = N(0, 1), p = N(0, σ2) (any fixed σ > 1), for which we have

Dp,q(π)2 Z σ2 exp( γ x 2) ϕ(0) + ϕ(0) x 2 dπ(x)

where γ = 1 σ 2 > 0. Then it is clear that, for example, the sequence πn = δ(ne1) (where e1 = [1, 0, . . . , 0] ) satisfies the assumptions of Proposition 2 and, for this choice,

Dp,q(πn)2 σ2 exp( γn2) ϕ(0) + ϕ(0)n2 0

and yet πn d p, as claimed.

Proof of Theorem 2. Since infx Rd q(x)/p(x) > 0, for each n we have Zn := R q/p dπn > 0 and, furthermore, the assumption R log q α/(α 1)(q/p) dπn < implies that R log q α/(α 1) dπn < . Thus the assumptions of Proposition 7 are satisfied by k and each πn, and thus we have Dp,q(πn) = Zn Dq,q( πn) where πn := (qπn)/(p Zn) P(Rd).

From assumption, Zn infx Rd q(x)/p(x) is bounded away from 0. Thus if Dp,q(πn) 0 then Dq,q( πn) 0. Furthermore, since q Q(Rd) and the inverse multi-quadric kernel k is used, the standard kernel Stein discrepancy has convergence control, meaning that Dq,q( πn) 0 implies

πn d q [Theorem 8 of Gorham and Mackey, 2017]. It therefore suffices to show that πn d q implies πn d p.

From the Portmanteau theorem, πn d p is equivalent to R g dπn R g dp for all functions g which are continuous and bounded. Thus, for an arbitrary continuous and bounded function g, consider f = gp/q, which is also continuous and bounded. Then, since πn d q, we have (again from the Portmanteau theorem) that Z 1 n R g dπn = R f d πn R f dq = R g dp. Furthermore, the specific choice g = 1 shows that Z 1 n 1, and thus R g dπn R g dp in general. Since g was arbitrary,

we have established that πn d p, completing the proof.

To prove Theorem 3, an intermediate result is required: Proposition 10. Let Q P(Rd) and let kq : Rd Rd R be a reproducing kernel with R kq(x, ) dq = 0 for all x Rd. Let (xn)n N be a sequence of random variables independently sampled from q and assume that R exp{γkq(x, x)} dq(x) < for some γ > 0. Then

almost surely as n .

Proof. This is Lemma 4 in Riabiz et al. [2022], specialised to the case where samples are independent and identically distributed. Although not identical to the statement in Riabiz et al. [2022], one obtains this result by following an identical argument and noting that the expectation of kq(xi, xj) is identically 0 when i = j (due to independence of xi and xj), so that bounds on these terms are not required.

Proof of Theorem 3. Since πn has finite support, all conditions of Theorem 2 are satisfied. Thus it is sufficient to show that almost surely Dp,q(πn) 0. To this end, we follow Theorem 3 of Riabiz et al. [2022] and introduce the classical importance weights wi = p(xi)/q(xi), which are well-defined since q > 0. The normalised weights wi = wi/Wn, Wn := Pn j=1 wj satisfy 0 w1, . . . , wn and w1 + + wn = 1, and thus the optimality of w , together with the integral form of the GF-KSD in Equation (4), gives that

i=1 w i δ(xi)

i=1 wiδ(xi)

j=1 kq(xi, xj)

From the strong law of large numbers, almost surely n 1Wn R p

q dq = 1. Thus it suffices to show that the final term in Equation (12) converges almost surely to 0. To achieve this, we can check the conditions of Proposition 10 are satisfied.

Since q and h( ) = Sp,qk(x, ) satisfy the conditions of Proposition 1, the condition R kq(x, ) dq = 0 is satisfied. Let ϕ(z) = (1 + z 2) β so that k(x, y) = ϕ(x y) is the inverse multi-quadric kernel. Note that ϕ(0) = 1 and ϕ(0) = d. Then, from Proposition 9, we have that kq(x, x) = ϕ(0) + ϕ(0) log q 2 = d + log q 2. Then Z exp{γkq(x, x)} dq(x) = exp{γd} Z exp{γ log q 2} dq < ,

which establishes that the conditions of Proposition 10 are satisfied and completes the proof.

B Stein Variational Inference Without Second-Order Gradient

This section contains sufficient conditions for unbiased stochastic gradient estimators to exist in the context of Stein Variational Inference; see Section 4.2 of the main text. The main result that we prove is as follows:

Proposition 11 (Stochastic Gradients). Let p, q, R P(Rd) and T θ : Rd Rd for each θ Rp. Let θ 7 θT θ(x) be bounded. Assume that for each ϑ Rp there is an open neighbourhood Nϑ Rp such that Z sup θ Nϑ

2 d R(x) < ,

q(T θ(x)) p(T θ(x)) log r(T θ(x)) d R(x) < ,

q(T θ(x)) p(T θ(x)) 2 log r(T θ(x)) d R(x) < ,

for each of r {p, q}. Let k be the inverse multi-quadric kernel in Equation (2) and let u(x, y) denote the integrand in Equation (6). Then

θDp,q(πθ)2 = E

i =j θu(T θ(xi), T θ(xj))

where the expectation is taken with respect to independent samples x1, . . . , xn R.

The role of Proposition 11 is to demonstrate how an unbiased gradient estimator may be constructed, whose computation requires first-order derivatives of p only, and whose cost is O(n2). Although the assumption that θ 7 θT θ(x) is bounded seems strong, it can typically be satisfied by reparametrisation of θ Rp.

To prove Proposition 11, we exploit the following general result due to Fisher et al. [2021]: Proposition 12. Let R P(Rp). Let Θ Rp be an open set and let u : Rd Rd R, T θ : Rd Rd, θ Θ, be functions such that, for all ϑ Θ,

RR |u(T ϑ(x), T ϑ(y))| d R(x)d R(y) < ;

(A2) there exists an open neighbourhood Nϑ Θ of ϑ such that ZZ sup θ Nϑ θu(T θ(x), T θ(y)) d R(x)d R(y) < .

Then F(θ) := RR u(T θ(x), T θ(y)) d R(x)d R(y) is well-defined for all θ Θ and

i =j θu(T θ(xi), T θ(xj))

where the expectation is taken with respect to independent samples x1, . . . , xn R.

Proof. This is Proposition 1 in Fisher et al. [2021].

Proof of Proposition 11. In what follows we aim to verify the conditions of Proposition 12 hold for the choice u(x, y) = kq(x, y), where kq was defined in Equation (5).

(A1): From the first line in the proof of Proposition 4, the functions Sp,qk(x, ), x Rd, are in H(k)d. Since (x, y) 7 u(x, y) = Sp,qk(x, ), Sp,qk(y, ) H(k)d is positive semi-definite, from Cauchy Schwarz, |u(x, y)| p

u(y, y). Thus Z |u(T θ(x), T θ(y))| d R(x)d R(y) = Z |u(x, y)| d T θ #R(x)d T θ #R(y)

u(x, x) d T θ #R(x) 2 .

Since k(x, y) = ϕ(x y), we have from Proposition 9 that

u(x, x) = q(x)

2 ϕ(0) + ϕ(0) log q(x) 2

u(x, x) d T θ #R(x)

s Z | ϕ(0) + ϕ(0) log q 2| d T θ #R

which is finite by assumption.

(A2): Fix x, y Rd and let Rx(θ) := q(T θ(x))/p(T θ(y)). From repeated application of the product rule of differentiation, we have that

θu(T θ(x), T θ(y)) = kq(T θ(x), T θ(y)) θ [Rx(θ)Ry(θ)] | {z } ( )

+ Rx(θ)Ry(θ) θkq(T θ(x), T θ(y)) | {z } ( )

Let bp(x) := log p(x), bq(x) := log q(x), b(x) := bq(x) bp(x), and [ θT θ(x)]i,j = ( / θi)T θ j (x). In what follows, we employ a matrix norm on Rd d which is consistent with the Euclidean norm on Rd, meaning that θT θ(x)b(T θ(x)) θT θ(x) b(T θ(x)) for each θ Θ and x Rd. Considering the first term ( ), further applications of the chain rule yield that

θ [Rx(θ)Ry(θ)] = Rx(θ)Ry(θ)[ θT θ(x)b(T θ(x)) + θT θ(y)b(T θ(y))]

and from the triangle inequality we obtain a bound

θ [Rx(θ)Ry(θ)] Rx(θ)Ry(θ) θT θ(x) b(T θ(x)) + θT θ(y) b(T θ(y)) .

Let denote inequality up to an implicit multiplicative constant. Since we assumed that θT θ(x) is bounded, and the inverse multi-quadric kernel k is bounded, we obtain that

|( )| Rx(θ)Ry(θ) b(T θ(x)) + b(T θ(y)) .

Similarly, from Equation (5), and using also the fact that the inverse multi-quadric kernel k has derivatives or all orders [Lemma 4 of Fisher et al., 2021], we obtain a bound

θkq(T θ(x), T θ(y)) 1 + bq(T θ(x)) + bq(T θ(x)) 1 + bq(T θ(y))

+ 1 + bq(T θ(y)) + bq(T θ(y)) 1 + bq(T θ(x))

which we multiply by Rx(θ)Ry(θ) to obtain a bound on ( ). Thus we have an overall bound

θu(T θ(x), T θ(y)) Rx(θ)Ry(θ) 1 + bq(T θ(x)) + bq(T θ(x)) 1 + bq(T θ(y))

+ 1 + bq(T θ(y)) + bq(T θ(y)) 1 + bq(T θ(x)) .

Substituting this bound into RR supθ Nϑ θu(T θ(x), T θ(y)) d R(x)d R(y), and factoring terms into products of single integrals, we obtain an explicit bound on this double integral in terms of the following quantities (where r {p, q}): Z sup θ Nϑ Rx(θ) d R(x) Z sup θ Nϑ Rx(θ) br(T θ(x)) d R(x) Z sup θ Nϑ Rx(θ) br(T θ(x)) d R(x)

which we have assumed exist.

Thus the conditions of Proposition 12 hold, and the result immediately follows.

C Experimental Details

These appendices contain the additional empirical results referred to in Section 3, together with full details required to reproduce the experiments described in Section 4 of the main text.

C.1 Detection of Convergence and Non-Convergence

This appendix contains full details for the convergence plots of Figure 1. In Figure 1, we considered the target distribution

i=1 wi N(x; µi, σ2 i ),

where N(x; µ, σ2) is the univariate Gaussian density with mean µ and variance σ2. The parameter choices used were (w1, w2, w3) = (0.375, 0.5625, 0.0625), (µ1, µ2, µ3) = ( 0.4, 0.3, 0.06) and (σ2 1, σ2 2, σ2 3) = (0.2, 0.2, 0.9).

The approximating sequences considered were location-scale sequences of the form Ln #u, where Ln(x) = an + bnx for some (an)n N and (bn)n N and u P(R). For the converging sequences, we set u = p and for the non-converging sequences, we set u = N(0, 0.5). We considered three different choices of (an)n N and (bn)n N, one for each colour. The sequences (an)n N and (bn)n N used are shown in Figure S1. The specification of our choices of q is the following:

Prior: We took q N(0, 0.752). Laplace: The Laplace approximation computed was q N(0.3, 0.20412).

0 20 40 60 80 100 n

0 20 40 60 80 100 n

Figure S1: The sequences an and bn used in the location-scale sequences. (a) The sequences an and bn used in the location-scale sequences in Figure 1. (b) The sequences an and bn used in the location-scale sequences in Figure 2. In each case, the colour used of each curve indicates which of the sequences (π)n N they correspond to.

GMM: The Gaussian mixture model was computed using 100 samples from the target p. The number of components used was 2, since this value minimised the Bayes information criterion [Schwarz, 1978].

KDE: The kernel density estimate was computed using 100 samples from the target p. We utilised a Gaussian kernel k(x, y) = exp( (x y)2/ℓ2) with the lengthscale or bandwidth parameter ℓ determined by Silverman s rule of thumb [Silverman, 1986].

The values of GF-KSD reported in Figure 1 were computed using a quasi Monte Carlo approximation to the integral (6), utilising a length 300 low-discrepancy sequence. The low discrepancy sequences were obtained by first specifying a uniform grid over [0, 1] and then performing the inverse CDF transform for each member of the sequence πn.

C.2 Avoidance of Failure Modes

This appendix contains full details of the experiment reported in Section 3.3 and an explanation of the failure mode reported in Figure 2a. The sequences considered are displayed in Figure S2. Each sequence was a location-scale sequences of the form Ln #u, where Ln(x) = an + bnx for some (an)n N and (bn)n N and u P(R). For the converging sequences, we set u = p. The specification of the settings of each failure mode are as follows:

Failure mode (a) [Figure 2a]: We took p as the target used in Figure 1 and detailed in Appendix C.1 and took q N(0, 1.52). The an and bn sequences used are displayed in Figure S1a. The values of GF-KSD reported were computed using a quasi Monte Carlo approximation, using a length 300 low discrepancy sequence. The low discrepancy sequences were obtained by first specifying a uniform grid over [0, 1] and then performing the inverse CDF transform for each member of the sequence πn.

Failure mode (b) [Figure 2b]: We took p N(0, 1) and q N( 0.7, 0.12). The an and bn sequences used are displayed in Figure S1b. The values of GF-KSD reported were computed using a quasi Monte Carlo approximation, using a length 300 low discrepancy sequence. The low discrepancy sequences were obtained by first specifying a uniform grid over [0, 1] and then performing the inverse CDF transform for each member of the sequence πn.

Failure Mode (c) [Figure 2c]: In each dimension d considered, we took p N(0, I) and q N(0, 1.1I). The an and bn sequences5 used are displayed in Figure S1b. The values of GF-KSD reported were computed using a quasi Monte Carlo approximation, using a length 1, 024 Sobol sequence in each dimension d.

5Note that for d > 1, we still considered location-scale sequences of the form Ln(x) = an + bnx.

3 2 1 0 1 2 3

3 2 1 0 1 2 3

3 2 1 0 1 2 3

(b) and (c)

3 2 1 0 1 2 3

Converging Sequence (d)

3 2 1 0 1 2 3

3 2 1 0 1 2 3

3 2 1 0 1 2 3

Non-Converging Sequence (d)

3 2 1 0 1 2 3

3 2 1 0 1 2 3

Figure S2: Test sequences (πn)n N used in Figure 2. The colour and style of each sequence indicates which of the curves in Figure 2 is being considered. In the second row from the top, the sequence used when d = 1 in Figure 2c is shown in the final column.

Failure mode (d) [Figure 2d]: We took p = q, with p(x) = 0.5 N(x; 1, 0.12) + 0.5 N(x; 1, 0.12), where N(x; µ, σ2) is the univariate Gaussian density with mean µ and variance σ2. The an and bn sequences used are displayed in Figure S1b. For the non-converging sequences we took u = N(1, 0.12) and used the an and bn sequences specified in Figure S1b. The values of GF-KSD reported were computed using a quasi Monte Carlo approximation, using a length 300 low discrepancy sequence. The low discrepancy sequences were obtained by first specifying a uniform grid over [0, 1] and then performing the inverse CDF transform for each member of the sequence πn.

In Figure S3, we provide an account of the degradation of convergence detection between q = Prior considered in Figure 1 and q = N(0, 1.52) of Failure mode (a). In Figure S3a, it can be seen that the value of the integrals R (q/p)2 dπn are finite for each element of the pink sequence πn. However, in Figure S3b, it can be seen that the values of the integrals R (q/p)2 dπn are infinite for the last members of the sequence πn, thus violating a condition of Theorem 1.

C.3 Gradient-Free Stein Importance Sampling

This appendix contains full details for the experiment reported in Section 4.1. We considered the following Lotka Volterra dynamical system:

u(t) = α u(t) β u(t)v(t), v(t) = γ v(t) + δ u(t)v(t), (u(0), v(0)) = (u 0, v 0).

4 2 0 2 4 x

q= (0,1.52)

4 2 0 2 4 x

q = (0,1.12)

q = (0,1.22)

q = (0,1.32)

q = (0,1.42)

Figure S3: Explanation of Failure Mode (a). (a) Values of (q/p)2πn for q = Prior and for the converging pink sequence displayed in Figure 1. (b) Values of (q/p)2πn for q = N(0, 1.52) and for the converging pink sequence displayed in the first column and first row of Figure S2. (c) Values of q2/p for different choices of q.

Using 21 observations u1, . . . , u21 and v1, . . . , v21 over times t1 < . . . < t21, we considered the probability model

ui Log-normal(log u(ti), (σ 1)2), vi Log-normal(log v(ti), (σ 2)2).

In order to satisfy positivity constraints, we performed inference on the logarithm of the parameters (α, β, γ, δ, u0, v0, σ1, σ2) = (log α , log β , log γ , log δ , log u 0, log v 0, log σ 1, log σ 2). We took the following independent priors on the constrained parameters:

α Log-normal(log(0.7), 0.62), β Log-normal(log(0.02), 0.32),

γ Log-normal(log(0.7), 0.62), δ Log-normal(log(0.02), 0.32),

u 0 Log-normal(log(10), 1), v 0 Log-normal(log(10), 1),

σ 1 Log-normal(log(0.25), 0.022), σ 2 Log-normal(log(0.25), 0.022).

In order to obtain independent samples from the posterior for comparison, we utilised Stan [Stan Development Team, 2022] to obtain 8, 000 posterior samples using four Markov chain Monte Carlo chains. Each chain was initialised at the prior mode. The data analysed are due to Hewitt [1921] and can be seen, along with a posterior predictive check, in Figure S4.

The Laplace approximation was obtained by the use of 48 iterations of the L-BFGS optimisation algorithm [Liu and Nocedal, 1989] initialised at the prior mode. The Hessian approximation was obtained using Stan s default numeric differentiation of the gradient.

Finally, the quadratic programme defining the optimal weights of gradient-free Stein importance sampling (refer to Theorem 3) was solved using the splitting conic solver of O Donoghue et al. [2016].

C.4 Stein Variational Inference Without Second-Order Gradient

This appendix contains full details for the experiment reported in Section 4.2. We considered the following bivariate densities

p1(x, y) := N(x; 0, η2 1) N(y; sin(ax), η2 2),

p2(x, y) := N(x; 0, σ2 1) N(y; bx2, σ2 2),

where N(x; µ, σ2) is the univariate Gaussian density with mean µ and variance σ2. The parameter choices for the sinusoidal experiment p1 were η2 1 = 1.32, η2 2 = 0.092 and a = 1.2. The parameter choices for the banana experiment p2 were σ2 1 = 1, σ2 2 = 0.22 and b = 0.5.

The development of a robust stochastic optimisation routine for measure transport with GF-KSD is beyond the scope of this work, and in what follows we simply report one strategy that was successfully used in the setting of the application reported in the main text. This strategy was based on

1900 1905 1910 1915 1920

Pelts (thousands)

Data Prediction

Figure S4: Posterior predictive check for the Lotka Volterra model. The shaded blue region indicates the 50% interquartile range of the posterior samples.

0 2500 5000 7500 10000 12500 15000 17500 20000 m

Figure S5: Tempering sequence (ϵm)m N used in each of the variational inference experiments.

tempering of p, the distributional target, to reduce a possibly rather challenging variational optimisation problem into a sequence of easier problems to be solved. Specifically, we considered tempered distributions pm P(Rd) with log density

log pm(x) = ϵm log p0(x) + (1 ϵm) log p(x),

where (ϵm)m N [0, 1]N is the tempering sequence and p0 P(Rd) is fixed. In this case p0 was taken to be N(0, 2I) in both the banana and sinusoidal experiment. Then, at iteration m of stochastic optimisation, we considered the variational objective function

π 7 log Dpm,q(π)

where q = πθm, as explained in the main text. Tempering has been applied in the context of normalising flows in Prangle and Viscardi [2023]. The tempering sequence used (ϵm)m N for each of the experiments is displayed in Figure S5.

For each experiment, the stochastic optimisation routine used was Adam [Kingma and Ba, 2015] with learning rate 0.001. Due to issues involving exploding gradients due to the q/p term in GFKSD, we utilised gradient clipping in each of the variational inference experiments, with the maximum 2-norm value taken to be 30. In both the banana and sinusoidal experiment, the parametric class of transport maps T θ was the inverse autoregressive flow of Kingma et al. [2016]. In the banana experiment, the dimensionality of the hidden units in the underlying autoregressive neural network was taken as 20. In the sinusoidal experiment, the dimensionality of the hidden units in the underlying autoregressive neural network was taken as 30. For the comparison with standard kernel Stein discrepancy, the same parametric class T θ and the same initialisations of θ were used.

1.5 1.0 0.5 0.0 0.5 1.0 1.5 x

0 20 40 60 80 100 m

Figure S6: The π0 and tempering sequences used in the additional convergence detection experiments. The colour of each curve indicates which of the sequences in Figure S7 and Figure S8 they correspond with. (a) The π0 choices of each tempered sequence of distributions. (b) The tempering sequences (ϵm)m N considered.

C.5 Additional Experiments

Appendix C.5.1 explores the impact of p on the conclusions drawn in the main text. Appendix C.5.2 investigates the sensitivity of the proposed discrepancy to the choice of the parameters σ and β that appear in the kernel. Appendix C.5.3 compares the performance of GF-KSD importance sampling, KSD importance sampling and self-normalised importance sampling.

C.5.1 Exploring the Effect of p

In this section we investigate the robustness of the convergence detection described in Figure 1 subject to different choices of the target p. We consider two further choices of p:

i=1 ci N(x; µi, σ2 i ),

i=1 di Student-T(x; ν, mi, si),

where N(x; µ, σ2) is the univariate Gaussian density with mean µ and variance σ2 and Student-T(x; ν, m, s) is the univariate Student-T density with degrees of freedom ν, location parameter m and scale parameter s. The parameter choices for p1 were

(c1, c2, c3, c4) = (0.3125, 0.3125, 0.3125, 0.0625), (µ1, µ2, µ3, µ4) = ( 0.3, 0, 0.3, 0),

(σ2 1, σ2 2, σ2 3, σ2 4) = (0.12, 0.052, 0.12, 1).

The parameter choices for p2 were ν = 10 and

(d1, d2, d3, d4) = (0.1, 0.2, 0.3, 0.4), (m1, m2, m3, m4) = ( 0.4, 0.2, 0, 0.3), (s1, s2, s3, s4) = (0.05, 0.1, 0.1, 0.3).

Instead of using the location-scale sequences of Figure 1, we instead considered tempered sequences of the form log πn(x) = ϵn log π0(x) + (1 ϵn) log u(x).

For the converging sequences considered we set u to be the target (either u = p1 or u = p2) and set u = N(x; 0, 0.42) for each of the non-converging sequences. The different sequences vary in choice of π0 and tempering sequence (ϵn)n N. These choices are displayed in Figure S6 and are taken as the same for both of the targets considered.

The specification of our choices of q is the following:

Prior: For p1, we took q N(0, 0.52). For p2, we took q Student-T(10, 0, 0.5). Laplace: For p1, the Laplace approximation computed was q N(0, 0.0512). For p2, the Laplace approximation computed was q N(0, 0.1252). GMM: For both targets, the Gaussian mixture model was computed using 100 samples from the target. In both cases, the number of components used was 3, since this value minimised the Bayes information criterion [Schwarz, 1978]. KDE: For both targets, the kernel density estimate was computed using 100 samples from the target. In both cases, we utilised a Gaussian kernel k(x, y) = exp( (x y)2/ℓ2) with the lengthscale or bandwidth parameter ℓdetermined by Silverman s rule of thumb [Silverman, 1986].

Results for p1 are displayed in Figure S7 and results for p2 are displayed in Figure S8. It can be seen that for both target distributions and the different sequences considered, GF-KSD correctly detects convergence in each case. For both targets and for q = Laplace, it can be seen that GF-KSD exhibits the same behaviour of Failure Mode (b), displayed in Figure 2b.

The values of GF-KSD reported in Figure S7 and Figure S8 were computed using a quasi Monte Carlo approximation to the integral (6), utilising a length 300 low-discrepancy sequence. Due to the lack of an easily computable inverse CDF, we performed an importance sampling estimate of GF-KSD as follows

Dp,q(π) = ZZ (Sp,q Sp,q)k(x, y) dπ(x) dπ(y)

= ZZ (Sp,q Sp,q)k(x, y) π(x)π(y)

w(x)w(y) dw(x) dw(y),

where w is the proposal distribution. For each element of a sequence πn, we used a Gaussian proposal wn of the form:

log wn(x) = ϵn log π0(x) + (1 ϵn) log N(x; 0, 0.42).

Since π0 is Gaussian for each sequence, this construction ensures that each wn is both Gaussian and a good proposal distribution for πn. The low-discrepancy sequences were then obtained by first specifying a uniform grid over [0, 1] and the performing an inverse CDF transformation using wn.

C.5.2 Exploring the Effect of σ and β

In this section we investigate the effect on convergence detection that results from changing the parameters σ and β in the inverse multi-quadric kernel (2). Utilising the same test sequences and choices of q used in Figure 1, we plot the values of GF-KSD in Figure S9. It can be seen that the convergence detection is robust to changing values of σ and β.

C.5.3 GF-KSD vs. KSD Importance Sampling

In this section we investigate the performance of gradient-free Stein importance sampling, standard Stein importance sampling, and self-normalised importance sampling, as the distribution q varies in quality as an approximation to p. We consider two different regimes:

1. p = N(0, I) and q = N(0, λI) for 0.7 λ 1.3.

2. p = N(0, I) and q = N(c1, I) for 0.6 c 0.6, where 1 = (1, . . . , 1) .

In both cases, we consider the performance of each approach for varying dimension d and number of samples n. Results are reported in Figure S10 and Figure S11 for each regime respectively. The quadratic programme defining the optimal weights of gradient-free Stein importance sampling and Stein importance sampling (refer to Theorem 3) was solved using the splitting conic solver of O Donoghue et al. [2016].

Inward Convergence Inward Non-Convergence

Outward Convergence Outward Non-Convergence

Oblique Convergence Oblique Non-Convergence

q=Prior q=Laplace

0 25 50 75 100 n

0 25 50 75 100 n

Figure S7: Additional empirical assessment of gradient-free kernel Stein discrepancy using the target p1 defined in Appendix C.5.1. (a) Test sequences (πn)n N, defined in Appendix C.5.1. The first column displays sequences (solid) that converge to the distributional target p (black), while the second column displays sequences (dashed) which converge instead to a fixed Gaussian target. (b) Performance of gradient-free kernel Stein discrepancy, when different approaches to selecting q are employed. The colour and style of each curve in (b) indicates which of the sequences in (a) is being considered. [Here we fixed the kernel parameters σ = 1 and β = 1/2.]

Inward Convergence Inward Non-Convergence

Outward Convergence Outward Non-Convergence

Oblique Convergence Oblique Non-Convergence

q=Prior q=Laplace

0 25 50 75 100 n

0 25 50 75 100 n

Figure S8: Additional empirical assessment of gradient-free kernel Stein discrepancy using the target p2 defined in Appendix C.5.1. (a) Test sequences (πn)n N, defined in Appendix C.5.1. The first column displays sequences (solid) that converge to the distributional target p (black), while the second column displays sequences (dashed) which converge instead to a fixed Gaussian target. (b) Performance of gradient-free kernel Stein discrepancy, when different approaches to selecting q are employed. The colour and style of each curve in (b) indicates which of the sequences in (a) is being considered. [Here we fixed the kernel parameters σ = 1 and β = 1/2.]

( , ) = (0.5,0.5)

q=Prior q=Laplace q=GMM q=KDE

( , ) = (1,0.5)

( , ) = (2,0.5)

( , ) = (1,0.1)

( , ) = (1,0.9)

Figure S9: Comparison of different values of σ and β in the inverse multiquadric kernel. Here the vertical axis displays the logarithm of the gradient free kernel Stein discrepancy. The colour and style of each of the curves indicates which of the sequences in Figure 1 is being considered.

log ED (d=1)

GF-KSD KSD Self-Normalised

log ED (d=5)

0.8 1.0 1.2 0.75

log ED (d=20)

0.8 1.0 1.2

0.8 1.0 1.2 5

0.8 1.0 1.2

log ED (d=50)

0.8 1.0 1.2

0.8 1.0 1.2

Figure S10: Comparison of the performance of importance sampling methodologies in varying dimension d and number of sample points considered n under the regime q = N(0, λI). The approximation quality is quantified as the logarithm of the Energy Distance (ED).

log ED (d=1)

GF-KSD KSD Self-Normalised

log ED (d=5)

0.5 0.0 0.5

log ED (d=20)

0.5 0.0 0.5

0.5 0.0 0.5 5

0.5 0.0 0.5 c

log ED (d=50)

0.5 0.0 0.5 c

0.5 0.0 0.5 c

Figure S11: Comparison of the performance of importance sampling methodologies in varying dimension d and number of sample points considered n under the regime q = N(c, I). The approximation quality is quantified as the logarithm of the Energy Distance (ED).