# stein_piimportance_sampling__33cf8ae8.pdf

Stein Π-Importance Sampling

Congye Wang1, Wilson Ye Chen2, Heishiro Kanagawa1, Chris. J. Oates1

1 Newcastle University, UK 2 University of Sydney, Australia

Stein discrepancies have emerged as a powerful tool for retrospective improvement of Markov chain Monte Carlo output. However, the question of how to design Markov chains that are well-suited to such post-processing has yet to be addressed. This paper studies Stein importance sampling, in which weights are assigned to the states visited by a Π-invariant Markov chain to obtain a consistent approximation of P, the intended target. Surprisingly, the optimal choice of Π is not identical to the target P; we therefore propose an explicit construction for Π based on a novel variational argument. Explicit conditions for convergence of Stein Π-Importance Sampling are established. For 70% of tasks in the Posterior DB benchmark, a significant improvement over the analogous post-processing of P-invariant Markov chains is reported.

1 Introduction

Stein discrepancies are a class of statistical divergences that can be computed without access to a normalisation constant. Originally conceived as a tool to measure the performance of sampling methods (Gorham and Mackey, 2015), these discrepancies have since found wide-ranging statistical applications (see the review of Anastasiou et al., 2023). Our focus here is the use of Stein discrepancies for retrospective improvement of Markov chain Monte Carlo (MCMC), and here two main techniques have been proposed: (i) Stein importance sampling (Liu and Lee, 2017; Hodgkinson et al., 2020), and (ii) Stein thinning (Riabiz et al., 2022). In Stein importance sampling (also called black box importance sampling), the samples are assigned weights such that a Stein discrepancy between the weighted empirical measure and the target P is minimised. Stein thinning constructs a sparse approximation to this optimally weighted measure at a lower computational and storage cost. Together, these techniques provide a powerful set of post-processing tools for MCMC, with subsequent authors proposing a range of generalisations and extensions (Teymur et al., 2021; Chopin and Ducrocq, 2021; Hawkins et al., 2022; Fisher and Oates, 2023; Bénard et al., 2023).

The consistency of these algorithms has been established in the setting of approximate, Π-invariant MCMC, motivated by challenging inference problems where only approximate sampling can be performed. In these settings, Π is implicitly an approximation to P that is as accurate as possible subject to computational budget. However, the critical question of how to design Markov chains that are well-suited to such post-processing has yet to be addressed. This paper provides a solution, in the form of a specific construction for Π derived from a novel variational argument. Surprisingly, we are able to demonstrate a substantial improvement using the proposed Π, compared to the case where Π and P are equal. The paper proceeds as follows: Section 2 presents an abstract formulation of the task and existing results for optimally-weighted empirical measures are reviewed. Section 3 derives our proposed choice of Π and establishes that Stein post-processing of samples from a Π-invariant Metropolis-adjusted Langevin algorithm (MALA) provides a consistent approximation of P. The approach is stress-tested using the recently released Posterior DB suite of benchmark tasks in Section 4, before concluding with a discussion in Section 5.

37th Conference on Neural Information Processing Systems (Neur IPS 2023).

2 Background

To properly contextualise our discussion we start with an abstract mathematical description of the task. Let P be a probability measure on a measurable space X. Let P(X) be the set of all probability measures on X. Let DP : P(X) [0, ] be a statistical divergence for measuring the quality of an approximation Q to P, meaning that DP (Q) = 0 if and only if Q = P. In this work we consider approximations whose support is contained in a finite set {x1, . . . , xn} X, and in particular we consider optimal approximations of the form

i=1 w i δ(xi), w arg min w 0, 1 w=1 DP

i=1 wiδ(xi)

In what follows we restrict attention to statistical divergences for which such approximations can be shown to exist and be well-defined. The question that we then ask is which states {x1, . . . , xn} minimise the approximation error DP (P n)? Before specialising to Stein discrepancies, it is helpful to review existing results for some standard statistical divergences DP .

2.1 Wasserstein Divergence

Optimal quantisation focuses on the r-Wasserstein (r 1) family of statistical divergences DP (Q) = infγ Γ(P,Q) R x y rdγ(x, y), where Γ(P, Q) denotes the set of all couplings1 of P, Q P(Rd), and the divergence is finite whenever P and Q have finite r-th moment. Assuming the states {x1, . . . , xn} are distinct, the corresponding optimal weights are w i = P(Ai) where Ai is the Voronoi neighbourhood2 of xi in Rd. Optimal states achieve the minimal quantisation error for P;

en,r(P) = inf x1,...,xn Rd DP

i=1 w i δ(xi)

the smallest value of the divergence among optimally-weighted distributions supported on at most n states. Though the dependence of optimal states on n and P can be complicated, we can broaden our perspective to consider asymptotically optimal states, whose asymptotic properties can be precisely characterised. To this end, for A Rd, let U(A) denote the uniform distribution on A, and define the universal constant Cr([0, 1]d) = infn 1 nr/d en,r(U([0, 1]d)). Suppose that P admits a density p on Rd. Then the rth quantisation coefficient of P on Rd, defined as

Cr(P) = Cr([0, 1]d) Z p(x)d/(d+r) dx (d+r)/d ,

plays a central role in the classical theory of quantisation, being the rate constant in the asymptotic convergence of the minimal quantisation error; limn nr/den,r(P) = Cr(P); see Theorem 6.2 of Graf and Luschgy (2007). This suggests a natural definition; a collection {x1, . . . , xn} is called asymptotically optimal if

lim n nr/d DP

i=1 w i δ(xi)

which amounts to P n asymptotically attaining the minimal quantisation error en,r(P). The main result here is that, if {x1, . . . , xn} are asymptotically optimal, then 1

n Pn i=1 δ(xi) Πr, where convergence is in distribution and Πr is the distribution whose density is πr(x) p(x)d/(d+r); see Theorem 7.5 of Graf and Luschgy (2007). This provides us with a key insight; optimal states are over-dispersed with respect to the intended distributional target. The extent of the over-dispersion here depends both on r, a parameter of the statistical divergence, and the dimension d of the space on which distributions are defined.

The r-Wasserstein divergence is, unfortunately, not well-suited for use in the motivating Bayesian context. In particular, computing the optimal weights wi = P(Ai) requires knowledge of P, which is typically not available when P is implicitly defined via an intractable normalisation constant. On

1A coupling γ Γ(P, Q) is a distribution γ P(Rd Rd) whose marginal distributions are P and Q. 2The Voronoi neighbourhood of xi is the set Ai = {x Rd : x xi = minj=1,...,n x xj }.

the other hand, the optimal sampling distribution Π is explicit and can be sampled (for example using MCMC); for discussion of random quantisers in this context see Graf and Luschgy (2007, Chapter 9), Cohort (2004, p126) and Sonnleitner (2022, Section 4.5). The simple form of Π is a feature of the classical approach to quantisation that we will attempt to mimic in the sequel.

2.2 Kernel Discrepancies

The theory of quantisation using kernels is less well-developed. A kernel is a measurable, symmetric, positive-definite function k : X X R. From the Moore Aronszajn theorem, there is a unique Hilbert space H(k) for which k is a reproducing kernel, meaning that k( , x) H(k) for all x X and f, k( , x) H(k) = f(x) for all f H(k) and all x X. Assuming that H(k) L1(P), we can define the weak (or Pettis) integral

µP ( ) = Z k( , x) d P(x), (1)

called the kernel mean embedding of P in H(k). The kernel discrepancy is then defined as the norm of the difference between kernel mean embeddings

DP (Q) = µQ µP H(k) =

s ZZ k(x, y) d(Q P)(x)d(Q P)(y) (2)

where, to be consistent with our earlier notation, we adopt the convention that DP (Q) is infinite whenever H(k) L1(Q). The second equality in (2) follows immediately from the stated properties of a reproducing kernel. To satisfy the requirement of a statistical divergence, we assume that the kernel k is characteristic, meaning that µP = µQ if and only if P = Q. In this setting, the properties of optimal states are necessarily dependent on the choice of kernel k, and are in general not well-understood. Indeed, given distinct states {x1, . . . , xn}, the corresponding optimal weights w = (w 1, . . . , w n) are the solution to the linearly-constrained quadratic program

arg min w Rd w Kw 2z w s.t. w 0, 1 w = 1 (3)

where Ki,j = k(xi, xj) and zi = µP (xi). This program does not admit a closed-form solution, but can be numerically solved. To the best of our knowledge, the only theoretical analysis of approximations based on (3) is due to Hayakawa et al. (2022), who established rates for the convergence of P n to P in the case where states are independently sampled from P. The question of an optimal sampling distribution was not considered in that work.

Although few results are available concerning (3), relaxations of this program have been well-studied. The simplest relaxation of (3) is to remove both the positivity (w 0) and normalisation (1 w = 1) constraints, in which case the optimal weights have the explicit representation w = K 1z. The analysis of optimal states in this context has developed under the dual strands of kernel cubature and Bayesian cubature, where it has been theoretically or empirically demonstrated that (i) if states are randomly sampled, the optimal sampling distribution will be n-dependent (Bach, 2017) and over-dispersed with respect to the distributional target (Briol et al., 2017), and (ii) space-filling designs are asymptotically optimal for typical stationary kernels on bounded domains X Rd (Briol et al., 2019). Analysis of optimal states on unbounded domains appears to be more difficult; see e.g. Karvonen et al. (2021). Relaxation of either the positivity or normalisation constraints results in approximations that behave similarly to kernel cubature (see, respectively, Ehler et al., 2019; Karvonen et al., 2018). However, relaxation of either constraint can result in the failure of P n to be an element of P(X), limiting the relevance of these results to the posterior approximation task.

Despite relatively little being known about the character of optimal states in this context, kernel discrepancy is widely used. The application of kernel discrepancies to an implicitly defined distributional target, such as a posterior distribution in a Bayesian analysis, is made possible by the use of a Stein kernel; a P-dependent kernel k = k P for which µP (x) = 0 for all x X (Oates et al., 2017). The associated kernel discrepancy

DP (Q) = µQ H(k P ) =

s ZZ k P (x, y) d Q(x)d Q(y) (4)

is called a kernel Stein discrepancy (KSD) (Chwialkowski et al., 2016; Liu et al., 2016; Gorham and Mackey, 2017), and this will be a key tool in our methodological development. The corresponding optimally weighted approximation P n is the Stein importance sampling method of Liu and Lee (2017). To retain clarity of presentation in the main text, we defer all details on the construction of Stein kernels to Appendix A.

2.3 Sparse Approximation

If the number n of states is large, computation of optimal weights can become impractical. This has motivated a range of sparse approximation techniques, which aim to iteratively construct an approximation of the form Pn,m = 1

m Pm i=1 δ(yi), where each yi is an element from {x1, . . . , xn}. The canonical example is the greedy algorithm which, at iteration j, selects a state

yj arg min y {x1,...,xn} DP

1 j δ(y) + 1

for which the statistical divergence is minimised. In the context of kernel discrepancy, the greedy algorithm (5) has computational cost O(m2n), which compares favourably3 with the cost of solving (3) when m n. Furthermore, under appropriate assumptions, the sparse approximation converges to the optimally weighted approximation; DP (Pn,m) DP (P n) as m with n fixed. See Teymur et al. (2021) for full details, where non-myopic and mini-batch extensions of the greedy algorithm are also considered. The greedy algorithm can be viewed as a regularised version of the Frank Wolfe algorithm (also called herding, or the conditional gradient method), for which a similar asymptotic result can be shown to hold (Chen et al., 2010; Bach et al., 2012; Chen et al., 2018). Related work includes Dwivedi and Mackey (2021, 2022); Shetty et al. (2022); Hayakawa et al. (2022). Since in what follows we aim to retrospectively improve MCMC output, where it is not unusual to encounter n 104 106, sparse approximation will be important.

This completes our overview of background material. In what follows we seek to mimic classical quantisation by deriving a choice for Π that is straight-forward to sample using MCMC and is appropriately over-dispersed relative to P. This should be achieved while remaining in the framework of kernel discrepancies, so that optimal weights can be explicitly computed, and coupled with a sparse approximation that has low computational and storage cost.

3 Methodology

The methods that we consider first sample states {x1, . . . , xn} using Π-invariant MCMC, then postprocess these states using kernel discrepancies (Section 2.2) and sparse approximation (Section 2.3), to obtain an approximation to the target P. A variational argument, which we present in Section 3.1, provides a suitable n-independent choice for Π (which agrees with our intuition from Section 2.1 that Π should be in some appropriate sense over-dispersed with respect to P). Sufficient conditions for strong consistency of the approximation are established in Section 3.3.

3.1 Selecting Π

Here we present a heuristic argument for a particular choice of Π; rigorous theoretical support for Stein Π-Importance Sampling is then provided in Section 3.3. Our setting is that of Section 2.2, and the following will additionally be assumed:

Assumption 1. It is assumed that

(A1) C2 1 := infx X k(x, x) > 0

(A2) C2 := R p

k(x, x) d P(x) < .

3It is difficult to quantify the complexity of numerically solving (3), since this will depend on details of the solver and the tolerance that are used. On the other hand, if we ignore the non-negativity and normalisation constraints, then we can see that the computational cost of solving the n-dimensional linear system of equations is O(n3).

Note that (A2) implies that H(k) L1(P), and thus (1) is in fact a strong (or Bochner) integral.

A direct analysis of the optimal states associated to the optimal weights w appears to be challenging due to the fact that the components of w are strongly inter-dependent. Our solution here is to instead consider optimal states associated with weights that, while not optimal, can be expected to perform much better than alternatives, with the advantage that their components are only weakly dependent. Specifically, we will be assuming that P is absolutely continuous with respect to Π (denoted P Π), and study convergence of self-normalised importance sampling (SNIS), i.e. the approximation Pn = Pn i=1 wiδ(xi), wi (d P/dΠ)(xi), where x1, . . . , xn Π are independent. Since w 0 and 1 w = 1, from the optimality of w under these constraints we have that DP (P n) DP (Pn). It is emphasised that the SNIS weights are a theoretical device only, and will not be used for computation; indeed, we can demonstrate that the SNIS weights w perform substantially worse than w in general.

The analysis of SNIS weights w is tractable when viewed as approximation of the kernel mean embedding µP in the Hilbert space H(k). Indeed, recall that DP (Pn) = ξn/ n H(k) where ξn = n(µPn µP ). Then, following Section 2.3.1 of Agapiou et al. (2017), we observe that

i=1 wik( , xi) µP

1 n Pn i=1 d P dΠ(xi) [k( , xi) µP ]

1 n Pn i=1 d P dΠ(xi) . (6)

The idea is to seek Π for which the asymptotic variance of ξn is small. Supposing that Z d P

dΠ(x)2 dΠ(x) < , (S1)

from the weak law of large numbers the denominator in (6) converges in probability to 1. Further supposing that Z d P dΠ(x)[k( , x) µP ]

H(k) dΠ(x) < , (S2)

from the Hilbert space central limit theorem the numerator in (6) converges in distribution to a Gaussian 1 n Pn i=1(d P/dΠ)(xi) [k( , xi) µP ] d N(0, C) where C : H(k) H(k) is the covariance operator defined via

f, Cg H(k) = Z f, d P

dΠ(x)[k( , x) µP ]

dΠ(x)[k( , x) µP ]

H(k) dΠ(x),

see Section 10.1 of Ledoux and Talagrand (1991). Thus, from Slutsky s lemma applied to (6), we conclude that ξn d N(0, C). Recalling that n DP (Pn)2 = ξn 2 H(k), and noting that the mean square of the limiting Gaussian random variable is tr(C), a natural idea is to select the sampling distribution Π such that tr(C) is minimised.

Fortunately, the trace of C can be explicitly computed. It simplifies presentation to restrict attention to a Stein kernel k = k P , for which µP = 0, giving tr(C) = R (d P/dΠ)(x)2k P (x) dΠ(x), where for convenience we have let k P (x) := k P (x, x). Assuming that P and Π admit densities p and π on X = Rd, the variational problem we wish to solve is

arg min π Q

π(x) k P (x) dx s.t. Z π(x) dx = 1, (7)

where Q be the set of positive measures on Rd for which (S1-2) are satisfied. To solve this problem, we first relax the constraints (S1-2) and solve the relaxed problem using the Euler Lagrange equations, which yield

π(x) p(x) p

k P (x). (8)

Note that the normalisation constant of π is C2 from (1), whose existence we assumed. Then we verify that (S1-2) in fact hold for this choice of Π. Indeed,

(S1) = Z d P

dΠ(x)2 dΠ(x) = C2

Z 1 k P (x) dΠ(x) C2

(S2) = Z d P

dΠ(x)2k P (x) dΠ(x) = C2

k P (x) d P(x) = C2 2 < ,

(1.1) (1.2)

Figure 1: Illustrating our choice of Π in 1D. (a) The univariate target P (black), and our choice of Π based on the Langevin Stein kernel (purple), the KGM1 Stein kernel (green), and the KGM3 Stein kernel (blue). (b) The mean kernel Stein discrepancy (KSD) for Stein Π-Importance Sampling using the Stein kernels from (a); in each case, KSD was computed using the same Stein kernel used to construct Π. Solid lines indicate the baseline case of sampling from P, while dashed lines indicate sampling from Π. (The experiment was repeated 100 times and standard error bars are plotted.)

which shows that we have indeed solved (7). The sampling distribution Π we have obtained is characterised up to a normalisation constant in (8), so just like P we can sample from Π using techniques such as MCMC. It is interesting to note that Π is also optimal for standard importance sampling (i.e. without self-normalisation); see Lemma 1 of Adachi et al. (2022). The Stein kernel k P determines the extent to which Π differs from P, as we illustrate next.

3.2 Illustration

For illustration, consider the univariate target P (black curve) in Figure 1.1, a 3-component Gaussian mixture model. Our recommended choice of Π in (8) is shown for both the Langevin Stein kernel (purple curve) and the KGMs Stein kernels with s {1, 3} (green and blue curves). The Stein discrepancy corresponding to the Langevin Stein kernel provides control over weak convergence (i.e. convergence of integrals of functions that are continuous and bounded), while the KGMs Stein kernel provides additional control over the convergence of polynomial moments up to order s; full details about the construction of Stein kernels are contained in Appendix A. The Langevin and KGM1 Stein kernels have k P (x) x2, while the KGM3 Stein kernel has k P (x) x6, in each case as |x| , and thus greater over-dispersion results from use of the KGM3 Stein kernel. This over-dispersion is less pronounced4 in higher dimensions; see Appendix D.1.

To illustrate the performance of Stein Π-Importance Sampling, we generated a sequence (xn)n N of independent samples from Π. For each n {1, . . . , 100}, the samples {x1, . . . , xn} were assigned optimal weights w by solving (3), and the associated KSD was computed. As a baseline, we performed the same calculation using independent samples from P. Figure 1.2 indicates that, for both Stein kernels, substantial improvement results from the use of samples from Π compared to the use of samples from P. Interestingly, the KGM3 Stein kernel demonstrated a larger improvement compared to the Langevin Stein kernel, suggesting that the choice of Π may be more critical in settings where KSD enjoys a stronger form of convergence control.

To illustrate a posterior approximation task, consider a simple regression model yi = fi(x) + ϵi with fi(x) = x1(1 + tix2), ti = i 5, i = 1, . . . , 10, with ϵi independent N(0, 1). The parameter x = (x1, x2) was assigned a prior N(0, I). Data were simulated using x = (0, 0). The posterior distribution P is depicted in the leftmost panel of Figure 2, while our choice of Π corresponding to the Langevin (centre left), KGM3 (centre right) and Riemann Stein kernels (right) are also displayed. For the Langevin and KGM3 kernels, the associated Π target their mass toward regions where P varies the most. The reason for this behaviour is clearly seen for the Langevin Stein kernel since

4The same holds for classical quantisation; c.f. Section 2.1.

Figure 2: Illustrating our choice of Π in 2D. The bivariate target P (left), together with our choice of Π based on the Langevin Stein kernel (centre left), the KGM3 Stein kernel (centre right), and the Riemann Stein kernel (right).

Algorithm 1 Π-Invariant Metropolis-Adjusted Langevin Algorithm (MALA) Require: x0 (initial state), ϵ (step size), M (preconditioner matrix), n (chain length), k P (Stein kernel)

1: for i = 1, . . . , n do 2: x xi 1 + ϵM 1 log p(xi 1) + ϵ

2M 1 log k P (xi 1) | {z } =:ν(xi 1)

2ϵM 1/2Zi Zi IID N(0, I)

3: L log p(x ) p(xi 1) + 1

2 log k P (x ) k P (xi 1) 1 4ϵ xi 1 ν(x ) 2 M 1 + 1 4ϵ x ν(xi 1) 2 M 1

4: if log(Ui) < L then xi x ; else xi xi 1; end if Ui IID U([0, 1]) 5: end for

Algorithm 2 Stein Π-Importance Sampling (SΠIS-MALA) Require: {x1, . . . , xn} from Algorithm 1, k P (Stein kernel)

1: w arg minw Rd{ w, KP w : w 0, 1 w = 1} [KP ]i,j = k P (xi, xj)

k P (x) = c1 + c2 log p(x) 2 for some c1, c2 > 0; see Appendix C for detail. The Riemann Stein kernel can be viewed as a preconditioned form of the Langevin Stein kernel which takes into account the geometric structure of P; see Appendix A for full detail5. Results in Figure S2 demonstrate that Stein Π-Importance Sampling improves upon the default Stein importance sampling method (i.e. with Π and P equal) for all choices of kernel.

An additional illustration involving a GARCH model with d = 4 parameters is presented in Appendix D.4, where the effect of varying the order s of the KGM Stein kernel is explored.

3.3 Theoretical Guarantees

The aim of this section is to establish when post-processing of Π-invariant MCMC produces a strongly consistent approximation of P, for our recommended choice of Π in (8). Our analysis focuses on MALA (Roberts and Stramer, 2002), leveraging the recent work of Durmus and Moulines (2022) to present explicit and verifiable conditions on P for our results to hold. In fact, we consider the more general preconditioned form of MALA, where the symmetric positive definite preconditioner matrix M is to be specified. Our results also allow for (optional) sparse approximation, to circumvent direct solution of (3) (c.f. Section 2.3). The resulting algorithms, which we call Stein Π-Importance Sampling (SΠIS-MALA) and Stein Π-Thinning (SΠT-MALA), are quite straight-forward and contained, respectively, in Algorithms 2 and 3. The linearly-constrained quadratic programme in Algorithm 2 was solved using the Python v3.10.4 packages qpsolvers v3.4.0 and proxsuite v0.3.7. While it is difficult to analyse the computational complexity associated with these methods, we believe they are at worst O(n3).

5The use of geometric information may be beneficial, in the sense that the associated diffusion process may mix more rapidly, and rapid mixing leads to sharper bounds from the perspective of convergence control (Gorham et al., 2019). However, the Riemann Stein kernel is associated with a prohibitive computational cost; it is included here only for academic interest.

Algorithm 3 Stein Π-Thinning (SΠT-MALA) Require: {x1, . . . , xn} from Algorithm 1, m (number of samples to retain), k P (Stein kernel)

1: for i = 1, . . . , m do 2: yi arg miny {x1,...,xn} 1 2k P (y) + Pi 1 j=1 k P (y, yj) 3: end for

Let A B indicate that A B is a positive semi-definite matrix for A, B Rn n. For a symmetric positive definite matrix A let z A :=

z A 1z for z Rd. Let Cs(Rd) denote the set of s-times continuously differentiable real-valued functions on Rd.

Theorem 1 (Strong consistency of SΠISand SΠT-MALA). Let Assumption 1 hold where k = k P is a Stein kernel, and let DP : P(X) [0, ] denote the associated KSD. Assume also that

(A1) log p C2(Rd) with supx Rd 2 log p(x) <

(A2) b1 > 0, B1 0 such that 2 log p(x) b1I for all x B1

(A3) k P C2(Rd).

(A4) 0 < b2 < 2b1C2 1, B2 0 such that 2k P (x) b2I for all x B2

Let P n = Pn i=1 w i δ(xi) be the result of running Algorithm 2 and let Pn,m = 1

m Pm i=1 δ(yi) be the result of running Algorithm 3. Let m n and m = Ω((log n)δ) for some δ > 2. Then there exists ϵ0 > 0 such that, for all step sizes ϵ (0, ϵ0) and all initial states x0 Rd, DP (P n) 0, DP (Pn,m) 0 almost surely as m, n .

The proof is in Appendix B. Compared to earlier authors6, such as Chen et al. (2019); Riabiz et al. (2022), a major novelty here is that our assumptions are explicit and can often be verified (see also Hodgkinson et al., 2020). (A2) is strong log-concavity of P when B1 = 0, while for B1 > 0 this condition is slightly stronger than the related distant dissipativity condition assumed in earlier work (Gorham and Mackey, 2017; Riabiz et al., 2022). (A4) holds for the Langevin Stein kernel (i.e. weak convergence control) and for the KGM1 Stein kernel (i.e. weak convergence control + control over first moments), but not for the higher-order KGM Stein kernels. Extending our proof strategy to the higher-order KGM Stein kernels would require further research into the convergence properties of MALA, and this is expected to be difficult.

4 Benchmarking on Posterior DB

The area of Bayesian computation has historically lacked a common set of benchmark problems, with classical examples being insufficiently difficult and case-studies being hand-picked (Chopin and Ridgway, 2017). To introduce objectivity into our assessment, we exploited the recently released Posterior DB benchmark (Magnusson et al., 2022). This project is an attempt toward standardised benchmarking, consisting of a collection of posteriors to be numerically approximated. Here, we systematically compared the performance of SΠIS-MALA against the default Stein importance sampling algorithm (i.e. Π = P; denoted SIS-MALA), and also against unprocessed P-invariant MALA (i.e. uniform weights), reporting results across the breadth of Posterior DB. The test problems in Posterior DB are defined in the Stan probabilistic programming language, and so Bridge Stan (Roualdes et al., 2023) was used to directly access posterior densities and their gradients as required. For all instances of MALA, an adaptive algorithm was used to learn a suitable preconditioner matrix M during the warm-up period; see Appendix D.3. All experiments that we report can be reproduced using code available at https://github.com/congyewang/Stein-Pi-Importance-Sampling.

Results are reported in Table 1 for n = 3 103 samples from MALA. These focus on the Langevin Stein kernel, for which our theory holds, and the KGM3 Stein kernel, for which it does not. There was a significant improvement of SΠIS-MALA over SIS-MALA in 73% of test problems for the Langevin Stein kernel and in 65% of test problems for the KGM3 Stein kernel. Compared to

6These earlier results required high-level assumptions on the convergence of MCMC, for which explicit sufficient conditions had yet to be derived.

Langevin Stein Kernel KGM3 Stein Kernel

Task d MALA SIS - MALA

SΠIS - MALA MALA SIS - MALA

SΠIS - MALA earnings-earn_height 3 1.41 0.0674 0.0332 5.33 0.656 0.181 gp_pois_regr-gp_regr 3 0.298 0.0436 0.0373 1.22 0.385 0.223 kidiq-kidscore_momhs 3 1.04 0.109 0.0941 4.66 0.848 0.476 kidiq-kidscore_momiq 3 5.03 0.516 0.358 25.3 4.86 1.55 mesquite-logmesquite_logvolume 3 1.10 0.179 0.156 4.97 1.70 0.844 arma-arma11 4 4.47 1.09 1.01 26.0 8.91 6.03 earnings-logearn_logheight_male 4 9.46 1.96 1.59 53.9 15.4 8.65 garch-garch11 4 0.543 0.159 0.130 4.70 1.16 1.01 kidiq-kidscore_momhsiq 4 5.21 0.982 0.897 29.3 7.25 5.05 earnings-logearn_interaction_z 5 3.09 1.36 1.33 19.3 10.4 8.94 kidiq-kidscore_interaction 5 7.74 1.65 1.79 47.8 13.2 10.1 kidiq_with_mom_work-kidscore_interaction_c 5 1.35 0.659 0.711 7.92 4.05 4.17 kidiq_with_mom_work-kidscore_interaction_c2 5 1.38 0.689 0.699 8.09 4.24 4.25 kidiq_with_mom_work-kidscore_interaction_z 5 1.11 0.500 0.499 6.62 2.63 3.25 kidiq_with_mom_work-kidscore_mom_work 5 1.07 0.507 0.545 6.70 2.63 3.04 low_dim_gauss_mix-low_dim_gauss_mix 5 5.51 1.87 1.76 37.5 14.7 11.3 mesquite-logmesquite_logva 5 1.83 0.821 0.818 12.6 5.73 5.59 hmm_example-hmm_example 6 1.99 0.578 0.523 11.6 4.13 3.40 sblrc-blr 6 479 154 134 3300 1100 854 sblri-blr 6 201 66.7 60.3 1340 514 595 ar K-ar K 7 6.87 3.39 3.16 60.4 26.4 23.0 mesquite-logmesquite_logvash 7 1.89 1.18 1.23 15.5 8.88 10.1 bball_drive_event_0-hmm_drive_0 8 1.15 0.679 0.698 8.55 4.72 3.99 bball_drive_event_1-hmm_drive_1 8 42.9 11.9 12.4 285 85.6 67.8 hudson_lynx_hare-lotka_volterra 8 4.62 2.29 2.15 47.4 18.8 18.9 mesquite-logmesquite 8 1.46 1.00 1.06 13.3 8.28 9.14 mesquite-logmesquite_logvas 8 2.02 1.31 1.35 19.2 10.8 12.2 mesquite-mesquite 8 0.429 0.268 0.235 3.71 2.17 2.42 eight_schools-eight_schools_centered 10 0.526 0.100 0.182 7.53 2.15 215 eight_schools-eight_schools_noncentered 10 0.210 0.137 0.137 43.6 28.7 27.5 nes1972-nes 10 6.16 3.89 3.45 72.9 36.2 34.4 nes1976-nes 10 6.67 3.86 3.53 77.5 35.5 34.4 nes1980-nes 10 4.34 2.68 2.57 49.8 25.4 25.7 nes1984-nes 10 6.18 3.75 3.43 71.3 34.9 33.6 nes1988-nes 10 7.40 3.70 3.27 81.4 34.6 32.4 nes1992-nes 10 7.52 4.32 3.84 89.1 39.7 37.3 nes1996-nes 10 6.44 3.87 3.53 74.1 36.4 34.3 nes2000-nes 10 3.35 2.22 2.20 38.6 21.3 22.8 diamonds-diamonds 26 196 157 143 5120 2990 2620 mcycle_gp-accel_gp 66 11.3 8.25 9.79 960 623 815

Table 1: Benchmarking on Posterior DB. Here we compared raw output from MALA with the post-processed output provided by the default Stein importance sampling method of Liu and Lee (2017) (SIS-MALA) and the proposed Stein Π-Importance Sampling method (SΠIS-MALA). Here d = dim(P) and the number of MALA samples was n = 3 103. The Langevin and KGM3 Stein kernels were used for SIS-MALA and SΠIS-MALA and the associated KSDs are reported. Ten replicates were computed and statistically significant improvement is highlighted in bold.

unprocessed MALA, a significant improvement occurred in 100% and 97% of cases, respectively for each kernel. However, the extent of improvement decreased when the dimension d of the target increased, supporting the intuition that we set out earlier and in Appendix D.1. An in-depth breakdown of results, including varying the number n of samples that were used, and the performance SΠT-MALA, can be found in Appendices D.5 and D.6.

If P and its gradients are cheap to evaluate, the computational cost of MALA is lower than that of SIS-MALA, and one could run more iterations of MALA for an equivalent computational cost. But for more complex P, the computational cost of all algorithms will be gated by the number of times P and its gradients need to be evaluated, making the direct comparison in Table 1 meaningful. Further, if we aim for a compressed representation of P, then some form of post-processing of MALA would be required, which would then entail an additional computational cost.

Our focus is on the development of algorithms for minimisation of KSDs; the properties of KSDs themselves are out of scope for this work7. Nonetheless, there is much interest in better understanding the properties of KSDs, and we therefore also report performance of SΠIS-MALA in terms of 1-Wasserstein divergence in Appendix D.7. The main contrast between these results and the results in Table 1 is that, being score-based, KSDs suffer from the blindness to mixing proportions phenomena which has previously been documented in Wenliang and Kanagawa (2021); Koehler et al. (2022); Liu et al. (2023). Caution should therefore be taken when using algorithms based on Stein discrepancies

7The interested reader is referred to Gorham and Mackey (2017); Barp et al. (2022b); Kanagawa et al. (2022).

in the context of posterior distributions with multiple high probability regions that are spatially separated. This is also a failure mode for MCMC algorithms such as MALA, and yet there are still many problems for which MALA has been successfully used.

The alternative choice Π1, with π1(x) p(x)d/(d+1), which provides a generic form of overdispersion and is optimal for approximation in 1-Wasserstein divergence (c.f. Section 2.1), was also considered. Results in Appendix D.8 indicate that, while Π1 yields an improvement compared to the baseline of using P itself, Π1 may be less effective than our proposed Π when P is skewed.

5 Discussion

This paper presented Stein Π-Importance Sampling; an algorithm that is simple to implement, admits an end-to-end theoretical treatment, and achieves a significant improvement over existing postprocessing methods based on KSD. On the negative side, second order derivatives of the statistical model are required, and we are ultimately bound to the performance of the KSD on which Stein Π-Importance Sampling is based. Our analysis focused on MALA, but there is in principle no barrier to deriving sufficient conditions for consistent approximation that are applicable to other sampling algorithms, such as the unadjusted Langevin algorithm. Of course, it remains to be seen whether SIS-MALA or any of its variants will stand the test of time compared to continued development in MCMC methodology, but we believe this line of research merits further investigation. For models for which access to second order derivatives is impractical, our methodology and theoretical analysis are directly applicable to gradient-free KSD (Fisher and Oates, 2023), and this would be an interesting direction for future work. Similarly, alternatives to KSD that are better-suited to high-dimensional P could be considered, such as the sliced KSD of Gong et al. (2021a,b).

Acknowledgements CW was supported by the China Scholarship Council. HK and CJO were supported by EP/W019590/1. The authors are grateful to François-Xavier Briol for feedback on an earlier draft of the manuscript, and to the anonymous Reviewers for their input.

Adachi, M., Hayakawa, S., Jørgensen, M., Oberhauser, H., and Osborne, M. A. (2022). Fast Bayesian inference with batch Bayesian quadrature via kernel recombination. In Proceedings of the 35th Conference on Neural Information Processing Systems.

Agapiou, S., Papaspiliopoulos, O., Sanz-Alonso, D., and Stuart, A. M. (2017). Importance sampling: Intrinsic dimension and computational cost. Statistical Science, 32(3):405 431.

Anastasiou, A., Barp, A., Briol, F.-X., Ebner, B., Gaunt, R. E., Ghaderinezhad, F., Gorham, J., Gretton, A., Ley, C., Liu, Q., Mackey, L., Oates, C. J., Reinert, G., and Swan, Y. (2023). Stein s method meets statistics: A review of some recent developments. Statistical Science, (38):120 139.

Bach, F. (2017). On the equivalence between kernel quadrature rules and random feature expansions. The Journal of Machine Learning Research, 18(1):714 751.

Bach, F., Lacoste-Julien, S., and Obozinski, G. (2012). On the equivalence between herding and conditional gradient algorithms. In Proceedings of the 29th International Conference on Machine Learning.

Barbour, A. D. (1988). Stein s method and poisson process convergence. Journal of Applied Probability, 25(A):175 184.

Barbour, A. D. (1990). Stein s method for diffusion approximations. Probability Theory and Related Fields, 84(3):297 322.

Barp, A., Oates, C. J., Porcu, E., and Girolami, M. (2022a). A Riemann Stein kernel method. Bernoulli, 28(4):2181 2208.

Barp, A., Simon-Gabriel, C.-J., Girolami, M., and Mackey, L. (2022b). Targeted separation and convergence with kernel discrepancies. ar Xiv preprint ar Xiv:2209.12835.

Bénard, C., Staber, B., and Da Veiga, S. (2023). Kernel stein discrepancy thinning: A theoretical perspective of pathologies and a practical fix with regularization. ar Xiv preprint ar Xiv:2301.13528.

Briol, F.-X., Oates, C. J., Cockayne, J., Chen, W. Y., and Girolami, M. (2017). On the sampling problem for kernel quadrature. In Proceedings of the 34th International Conference on Machine Learning, pages 586 595.

Briol, F.-X., Oates, C. J., Girolami, M., Osborne, M. A., and Sejdinovic, D. (2019). Probabilistic integration: A role in statistical computation (with discussion and rejoinder). Statistical Science, 34(1):1 22.

Carmeli, C., De Vito, E., and Toigo, A. (2006). Vector valued reproducing kernel Hilbert spaces of integrable functions and mercer theorem. Analysis and Applications, 4(04):377 408.

Chen, W. Y., Barp, A., Briol, F.-X., Gorham, J., Girolami, M., Mackey, L., and Oates, C. J. (2019). Stein point Markov chain Monte Carlo. In Proceedings of the 36th International Conference on Machine Learning, pages 1011 1021.

Chen, W. Y., Mackey, L., Gorham, J., Briol, F.-X., and Oates, C. J. (2018). Stein points. In Proceedings of the 35th International Conference on Machine Learning, pages 844 853.

Chen, Y., Welling, M., and Smola, A. (2010). Super-samples from kernel herding. In Proceedings of the 26th Conference on Uncertainty in Artificial Intelligence, pages 109 116.

Chopin, N. and Ducrocq, G. (2021). Fast compression of MCMC output. Entropy, 23(8):1017.

Chopin, N. and Ridgway, J. (2017). Leave pima indians alone: Binary regression as a benchmark for Bayesian computation. Statistical Science, 32(1):64 87.

Chwialkowski, K., Strathmann, H., and Gretton, A. (2016). A kernel test of goodness of fit. In Proceedings of the 33rd International Conference on Machine Learning, pages 2606 2615.

Cohort, P. (2004). Limit theorems for random normalized distortion. The Annals of Applied Probability, 14(1):118 143.

Durmus, A. and Moulines, É. (2022). On the geometric convergence for MALA under verifiable conditions. ar Xiv preprint ar Xiv:2201.01951.

Dwivedi, R. and Mackey, L. (2021). Kernel thinning. In Proceedings of 34th Conference on Learning Theory, pages 1753 1753.

Dwivedi, R. and Mackey, L. (2022). Generalized kernel thinning. In Proceedings of the 10th International Conference on Learning Representations.

Ehler, M., Gräf, M., and Oates, C. J. (2019). Optimal Monte Carlo integration on closed manifolds. Statistics and Computing, 29(6):1203 1214.

Fisher, M. A. and Oates, C. J. (2023). Gradient-free kernel Stein discrepancy. In Proceedings of the 37th Conference on Neural Information Processing Systems.

Girolami, M. and Calderhead, B. (2011). Riemann manifold Langevin and Hamiltonian Monte Carlo methods. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 73(2):123 214.

Gong, W., Li, Y., and Hernández-Lobato, J. M. (2021a). Sliced kernelized Stein discrepancy. In Proceedings of the 9th International Conference on Learning Representations.

Gong, W., Zhang, K., Li, Y., and Hernández-Lobato, J. M. (2021b). Active slices for sliced Stein discrepancy. ar Xiv:2102.03159.

Gorham, J., Duncan, A. B., Vollmer, S. J., and Mackey, L. (2019). Measuring sample quality with diffusions. The Annals of Applied Probability, 29(5):2884 2928.

Gorham, J. and Mackey, L. (2015). Measuring sample quality with Stein s method. In Proceedings of the 29th Conference on Neural Information Processing Systems, pages 226 234.

Gorham, J. and Mackey, L. (2017). Measuring sample quality with kernels. In Proceedings of the 34th International Conference on Machine Learning, pages 1292 1301.

Gotze, F. (1991). On the rate of convergence in the multivariate CLT. The Annals of Probability, pages 724 739.

Graf, S. and Luschgy, H. (2007). Foundations of Quantization for Probability Distributions. Springer.

Hawkins, C., Koppel, A., and Zhang, Z. (2022). Online, informative MCMC thinning with kernelized Stein discrepancy. ar Xiv preprint ar Xiv:2201.07130.

Hayakawa, S., Oberhauser, H., and Lyons, T. (2022). Positively weighted kernel quadrature via subsampling. In Proceedings of the 35th Conference on Neural Information Processing Systems.

Hodgkinson, L., Salomone, R., and Roosta, F. (2020). The reproducing Stein kernel approach for post-hoc corrected sampling. ar Xiv preprint ar Xiv:2001.09266.

Kanagawa, H., Gretton, A., and Mackey, L. (2022). Controlling moments with kernel Stein discrepancies. ar Xiv preprint ar Xiv:2211.05408 (v1).

Karvonen, T., Oates, C. J., and Girolami, M. (2021). Integration in reproducing kernel Hilbert spaces of Gaussian kernels. Mathematics of Computation, 90(331):2209 2233.

Karvonen, T., Oates, C. J., and Sarkka, S. (2018). A Bayes Sard cubature method. In Proceedings of the 32nd Conference on Neural Information Processing Systems.

Kent, J. (1978). Time-reversible diffusions. Advances in Applied Probability, 10(4):819 835.

Koehler, F., Heckett, A., and Risteski, A. (2022). Statistical efficiency of score matching: The view from isoperimetry. In Proceedings of the 36th Conference on Neural Information Processing Systems.

Ledoux, M. and Talagrand, M. (1991). Probability in Banach Spaces: Isoperimetry and Processes, volume 23. Springer Science & Business Media.

Liu, Q. and Lee, J. (2017). Black-box importance sampling. In Proceedings of the 20th International Conference on Artificial Intelligence and Statistics, pages 952 961.

Liu, Q., Lee, J., and Jordan, M. (2016). A kernelized Stein discrepancy for goodness-of-fit tests. In Proceedings of the 33rd International Conference on Machine Learning, pages 276 284.

Liu, X., Duncan, A., and Gandy, A. (2023). Using perturbation to improve goodness-of-fit tests based on kernelized Stein discrepancy. In Proceedings of the 40th International Conference on Machine Learning.

Magnusson, M., Bürkner, P., and Vehtari, A. (2022). Posterior DB: A set of posteriors for Bayesian inference and probabilistic programming. https://github.com/stan-dev/posteriordb.

Meyn, S. P. and Tweedie, R. L. (2012). Markov Chains and Stochastic Stability. Springer Science & Business Media.

Oates, C. J., Girolami, M., and Chopin, N. (2017). Control functionals for Monte Carlo integration. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 79(3):695 718.

Riabiz, M., Chen, W., Cockayne, J., Swietach, P., Niederer, S. A., Mackey, L., and Oates, C. J. (2022). Optimal thinning of MCMC output. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 84(4):1059 1081.

Roberts, G. O. and Rosenthal, J. S. (1998). Optimal scaling of discrete approximations to Langevin diffusions. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 60(1):255 268.

Roberts, G. O. and Stramer, O. (2002). Langevin diffusions and Metropolis Hastings algorithms. Methodology and Computing in Applied Probability, 4:337 357.

Roualdes, E., Ward, B., Axen, S., and Carpenter, B. (2023). Bridge Stan: Efficient in-memory access to Stan programs through Python, Julia, and R. https://github.com/roualdes/bridgestan.

Shetty, A., Dwivedi, R., and Mackey, L. (2022). Distribution compression in near-linear time. In Proceedings of the 10th International Conference on Learning Representations.

Sonnleitner, M. (2022). The Power of Random Information for Numerical Approximation and Integration. Ph D thesis, University of Passau.

Teymur, O., Gorham, J., Riabiz, M., and Oates, C. J. (2021). Optimal quantisation of probability measures using maximum mean discrepancy. In Proceedings of the 24th International Conference on Artificial Intelligence and Statistics, pages 1027 1035.

Wenliang, L. K. and Kanagawa, H. (2021). Blindness of score-based methods to isolated components and mixing proportions. In Proceedings of Your Model is Wrong @ the 35th Conference on Neural Information Processing Systems.

These appendices contains supporting material for the paper Stein Π-Importance Sampling. The mathematical background on Stein kernels is contained in Appendix A. The proof of Theorem 1 is contained in Appendix B. For implementation of Stein Π-Importance Sampling without the aid of automatic differentiation, various explicit derivatives are required; the relevant calculations can be found in Appendix C. The empirical protocols and additional empirical results are presented in Appendix D.

A Mathematical Background

This appendix contains mathematical background on reproducing kernels and Stein kernels, as used in the main text. Appendix A.1 introduces matrix-valued reproducing kernels, while Appendix A.2 specialises to Stein kernels by application of a Stein operator to a matrix-valued kernel. A selection of useful Stein kernels are presented in Appendix A.3.

A.1 Matrix-Valued Reproducing Kernels

A matrix-valued kernel is a function K : Rd Rd Rd d, that is both

1. symmetric; K(x, y) = K(y, x) for all x, y Rd, and

2. positive semi-definite; Pn i=1 Pn j=1 ci, K(xi, xj)cj 0 for all x1, . . . , xn Rd and c1, . . . , cn Rd.

Let Kx = K( , x). For vector-valued functions g, g : Rd Rd, defined by g = Pn i=1 Kxici and g = Pm j=1 Kx jc i, define an inner product

g, g H(K) =

j=1 ci, K(xi, x j)c j . (9)

There is a unique Hilbert space of such vector-valued functions associated to K, denoted H(K); see Proposition 2.1 of Carmeli et al. (2006). This space is characterised as

H(K) = span{Kxc : x, c Rd}

where here the closure is taken with respect to the inner product in (9). It can be shown that H(K) is in fact a reproducing kernel Hilbert space (RKHS) which satisfies the reproducing property

g, Kxc H(K) = g(x), c

for all g H(K) and x, c Rd. Matrix-valued kernels are the natural starting point for construction of KSDs, as described next.

A.2 Stein Kernels

A general construction for Stein kernels is to first identify a matrix-valued RKHS H(K) and an operator SP : H(K) L1(P) for which R Sph d P = 0 for all h H(K). Such an operator will be called a Stein operator. The collection {Sph : h H(K)} inherits the structure of an RKHS, whose reproducing kernel

k P (x, y) = SP Kx, SP Ky H(K) (10)

is a Stein kernel, meaning that µP = 0 where µP is the kernel mean embedding from (1); see Barp et al. (2022b). Explicit calculations for the Stein kernels considered in this work can be found in Appendix C.

For univariate distributions, Barbour (1988) proposed to obtain Stein operators from infinitessimal generators of P-invariant continuous-time Markov processes; see also Barbour (1990); Gotze (1991).

The approach was extended to multivariate distributions in Gorham and Mackey (2015). The starting point is the P-invariant Itô diffusion

2 1 p(Xt) [p(Xt)M(Xt)]dt + M(Xt)1/2d Wt, (11)

where p is the density of P, assumed to be positive, M : Rd Rd d is a symmetric matrix called the diffusion matrix, and Wt is a standard Wiener process (Kent, 1978; Roberts and Stramer, 2002). Here the notation [ A]i = (A i, ) indicates the divergence operator applied to each row of the matrix A(x) Rd d. The infinitessimal generator is

(AP u)(x) = 1

2 1 p(x) [p(x)M(x) u(x)].

Substituting h(x) for 1

2 u(x), we obtain a Stein operator

(SP h)(x) = 1 p(x) [p(x)M(x)h(x)] (12)

called the diffusion Stein operator (Gorham et al., 2019). This is indeed a Stein operator, since under mild integrability conditions on K, the divergence theorem gives that R Sph d P = 0 for all h H(K); for full details and a proof see Barp et al. (2022b).

A.3 Selecting a Stein Kernel

There are several choices for a Stein kernel, and which we should use depends on what form of convergence we hope to control (Gorham and Mackey, 2017; Gorham et al., 2019; Hodgkinson et al., 2020; Barp et al., 2022b; Kanagawa et al., 2022). Appendix A.3.1 describes the Langevin Stein kernel for weak convergence control, Appendix A.3.2 describes the KGM Stein kernels for additional control over moments, and Appendix A.3.3 presents the Riemann Stein kernel, whose convergence properties have to-date been less well-studied.

All of the kernels that we consider have length scale parameters that need to be specified, and some also have location parameters to be specified. As a reasonably automatic default we define

x arg max p(x), Σ 1 = 2 log p(x )

as a location and a matrix of characteristic length scales for P that will be used throughout. These values can typically be obtained using gradient-based optimisation, which is usually cheaper to perform compared to full approximation of P. It is assumed that 2 log p(x ) is positive definite in the sequel.

A.3.1 Weak Convergence Control with Langevin Stein Kernels

The first kernel we consider, which we called the Langevin Stein kernel in the main text, was introduced by Gorham and Mackey (2017). This Stein kernel was developed for the purpose of controlling the weak convergence of a sequence (Qn)n N P(Rd) to P. Recall that a sequence (Qn)n N is said to converge weakly (or in distribution) to P if R fd Qn R fd P for all continuous bounded functions f : Rd R. This convergence is denoted Qn d P in shorthand.

The problem considered in Gorham and Mackey (2017) was how to select a combination of matrixvalued kernel K (and, implicitly, a diffusion matrix M) such that the Stein kernel k P in (10) generates a KSD DP (Q) in (4) for which DP (Qn) 0 implies Qn d P. Their solution was to combine the inverse multi-quadric kernel with an identity diffusion matrix;

K(x, y) = (1 + x y 2 Σ) βI, M(x) = I

for β (0, 1). Provided that P has a density p for which log p(x) is Lipschitz, and that P is distantly dissipative (see Definition 4 of Gorham and Mackey, 2017), the associated KSD enjoys weak convergence control. Technically, the results in Gorham and Mackey (2017) apply only when Σ = I, but Theorem 4 in Chen et al. (2019) demonstrated that they hold also for any positive definite Σ. Following the recommendation of several previous authors, including Chen et al. (2018, 2019); Riabiz et al. (2022), we take β = 1

2 throughout.

A.3.2 Moment Convergence Control with KGM Stein Kernels

Despite its many elegant properties, weak convergence can be insufficient for applications where we are interested in integrals R f d P for which the integrand f : Rd R is unbounded. In particular, this is the case for moments of the form f(x) = xα1 1 . . . xαd d , 0 = α Nd 0. In such situations, we may seek also the stronger property of moment convergence control. The development of KSDs for moment convergence control was recently considered by Kanagawa et al. (2022), and we refered to their construction as the KGM Stein kernels in the main text. (For convenience, we have adopted the initials of the authors in naming the KGM Stein kernel.)

A sequence (Qn)n N P(Rd) is said to converge to P in the sth order moment if R x sd Qn(x) R x sd P(x). To establish convergence of moments, we need an additional condition on top of weak convergence control: uniform integrability control. A sequence of measures (Qn)n N is said to have uniformly integrable sth moments if for any ε > 0, we can take r > 0 such that

x >r x s d Qn(x) < ε.

This condition essentially states that the tail decay of the measures is well-controlled (so that it has a convergent moment). The KSD convergence DP (Qn) 0 implies uniform integrability if for any ε > 0, we can take rε > 0 and fε H(K) such that

SP fε(x) x s1{ x > rε} ε, (13)

i.e., the Stein-modified RKHS can approximate the (norm-weighted) indicator function arbitrarily well. Such a function fε can be explicitly constructed (while not guaranteed to be a member of the RKHS). Specifically, the choice fε = (1 ιε)g satisfies (13) under an appropriate dissipativity condition, where ιε is a differentiable indicator function vanishing outside a ball, and g(x) = x/ p

1 + x 2. This motivated Kanagawa et al. (2022) to introduce the sth order KGM Stein kernel, which is based on the matrix-valued kernel and diffusion matrix

K(x, y) = [ϕ( x y Σ) + κlin(x, y)] I, M(x) = (1 + x x 2 Σ) s 1

where (x, y) 7 ϕ( x y Σ) is a C1 0 universal kernel (see Barp et al., 2022b, Theorem 4.8). For comparability of our results, we take ϕ to be the inverse multi-quadric ϕ(r) = (1 + r2) 1/2, and

κlin(x, y) = 1 + (x x ) Σ 1(y x ) p

1 + x x 2 Σ p

1 + y x 2 Σ .

Here the normalised linear kernel κlin ensures g H(K), while the C1 0 universal kernel ϕ allows approximation of SP ιεg; see Kanagawa et al. (2022).

A.3.3 Exploiting Geometry with Riemann Langevin Stein Kernels

For academic interest only, here we describe the Riemann Stein kernel that featured in Figure 2 of the main text. This Stein kernel is motivated by the analysis of Gorham et al. (2019), who argued that the use of rapidly mixing Itô diffusions in Stein operators can lead to sharper convergence control. The Riemann Stein kernel is based on the class of so-called Riemannian diffusions considered in Girolami and Calderhead (2011), who proposed to take the diffusion matrix M in (11) to be M = (Iprior + IFisher) 1, the inverse of the Fisher information matrix, IFisher, regularised using the Hessian of the negative log-prior, Iprior. For the two-dimensional illustration in Section 3.2, this leads to the diffusion matrix

i=1 [ fi(x)][ fi(x)] ! 1

where we recall that yi = fi(x) + ϵi, where the ϵi are independent with ϵi N(0, 1), and the prior is x N(0, 1). For the presented experiment we paired the above diffusion matrix with the inverse multi-quadric kernel K(x, y) = (1 + x y 2 Σ) β for β = 1

2. The Riemann Stein kernel extends naturally to distributions P defined on Riemannian manifolds X; see Barp et al. (2022a) and Example 1 of Hodgkinson et al. (2020).

Unfortunately, the Riemann Stein kernel is prohibitively expensive in most real applications, since each evaluation of M requires a full scan through the size-n dataset. The computational complexity

of Stein Π-Thinning with the Riemann Stein kernel is therefore O(m2n2), which is unfavourable compared to the O(m2n) complexity in the case where the Stein kernel is not data-dependent. Furthermore, the convergence control properties of the Riemann Stein kernel have yet to be established. For these reasons we included the Riemann Stein kernel for illustration only; further groundwork will be required before the Riemann-Stein kernel can be practically used.

B Proof of Theorem 1

This appendix is devoted to the proof of Theorem 1. The proof is based on the recent work of Durmus and Moulines (2022), on the geometric convergence of MALA, and on the analysis of sparse (greedy) approximation of kernel discrepancies performed in Riabiz et al. (2022); these existing results are recalled in Appendix B.1. An additional technical result on preconditioned MALA is contained in Appendix B.2. The proof of Theorem 1 itself is contained in Appendix B.3.

B.1 Auxiliary Results

To precisely describe the results on which our analysis is based, we first need to introduce some notation and terminology. Let V : X [1, ) and, for a function f : X R and a measure µ on X, let

f V := sup x X

V (x) , µ V := sup f V 1

Recall that a Q-invariant Markov chain (xi)i N X with nth step transition kernel Qn is V -uniformly ergodic (see Theorem 16.0.1 of Meyn and Tweedie, 2012) if and only if R [0, ), ρ (0, 1) such that

Qn(x, ) Q V Rρn V (x) (14)

for all initial states x X and all n N.

Although MALA (Algorithm 1) is classical (Roberts and Stramer, 2002), until recently explicit sufficient conditions for ergodicity of MALA had not been obtained. The first result we will need is due Durmus and Moulines (2022), who presented the first explicit conditions for V -uniform convergence of MALA. It applies only to standard MALA, meaning that the preconditioning matrix M appearing in Algorithm 1 is the identity matrix. The extension of this result to preconditioned MALA will be handled in Appendix B.2.

Theorem 2. Let Q P(Rd) admit a density, q, such that

(DM1) there exists x0 with log q(x0) = 0

(DM2) q is twice continuously differentiable with supx Rd 2 log q(x x0) <

(DM3) there exists b > 0 and B 0 such that 2 log q(x x0) b I for all x x0 B.

Then there exists ϵ0 > 0 such that for all step sizes ϵ (0, ϵ0), standard Q-invariant MALA (i.e. with M = I) is V -uniformly ergodic for V (x) = exp b

16 x x0 2 .

Proof. This is Theorem 1 of Durmus and Moulines (2022).

The next result that we will need establishes consistency of the greedy algorithm applied to samples from a Markov chain that is Q-invariant.

Theorem 3. Let P, Q P(X) with P Q. Let k P : X X R be a Stein kernel and let DP : X X [0, ] denote the associated KSD. Consider a Q-invariant, time-homogeneous Markov chain (xi)i N X such that

(R+1) (xi)i N is V -uniformly ergodic, such that V (x) d P

(R+2) supi N E h d P d Q(xi) p

k P (xi)V (xi) i <

(R+3) there exists γ > 0 such that supi N E h exp n γ max 1, d P

d Q(xi)2 k P (xi) oi < .

Let Pn,m be the result of running the greedy algorithm in (5). If m n and log(n) = O(mγ/2) for some γ < 1, then DP (Pn,m) 0 almost surely as m, n .

Proof. This is Theorem 3 of Riabiz et al. (2022).

B.2 Preconditioned MALA

In addition to the auxiliary results in Appendix B.1, which concern standard MALA (i.e. with M = I), we require an elementary fact about MALA, namely that preconditioned MALA is equivalent to standard MALA under a linear transformation of the state variable. Recall that the M-preconditioned MALA algorithm is a Metropolis Hastings algorithm whose proposal is the Euler Maruyama discretisation of the Itô diffusion (11). Proposition 1. Let M(x) M for a symmetric positive definite and position-independent matrix M Rd d. Let Q P(Rd) admit a probability density function (PDF) q for which the Q-invariant diffusion (Xt)t 0, given by setting p = q in (11), is well-defined. Then under the change of variables Yt := M 1/2Xt,

2( log q)(Yt)dt + d Wt, (15)

where q(x) q(M 1/2x) for all x Rd.

Proof. From the chain rule,

( log q)(y) = y log q(M 1/2y) = M 1/2( log q)(M 1/2y),

and thus, substituting Yt = M 1/2Xt, (15) is equal to

d Xt = M 1/2 1

2M 1/2( log q)(M 1/2M 1/2Xt) + d Wt

2M 1( log q)(Xt) + M 1/2d Wt,

which is identical to (11) in the case where M(x) = M is constant.

Let Q and Q be the distributions referred to in Proposition 1, whose PDFs are respectively q(x) and q(x) q(M 1/2x). Proposition 1 then implies that the M-preconditioned MALA algorithm applied to Q (i.e. Algorithm 1 for Π = Q) is equivalent to the standard MALA algorithm (i.e. M = I) applied to Q. This fact allows us to generalise the result of Theorem 2 as follows: Corollary 1. Consider a symmetric positive definite matrix M Rd d. Assume that conditions (DM1-3) in Theorem 2 are satisfied. Then there exists ϵ 0 > 0 and b > 0 such that for all step sizes ϵ (0, ϵ 0), the M-preconditioned Q-invariant MALA is V -uniformly ergodic for V (x) =

exp b 16 x x0 2 .

Proof. From Theorem 2 and Proposition 1, the result follows if we can establish (DM1-3) for Q, since M-preconditioned MALA is equivalent to standard MALA applied to Q. For a matrix A Rd d, let λmin(A) and λmax(A) respectively denote the minimum and maximum eigenvalues of A. For (DM1) we set y0 = M 1/2x0 and observe that

( log q)(y0) = M 1/2( log q)(x0) = 0.

For (DM2) we have that

sup y Rd 2(log q)(y y0) = sup y Rd M 1/2( 2 log q)(M 1/2(y y0))M 1/2

λmin(M) 1 sup x Rd ( 2 log q)(x x0) < .

For (DM3) we have that

( 2 log q)(y y0) = M 1/2( 2 log q)(M 1/2(y y0))M 1/2

= M 1/2( 2 log q)(x x0)M 1/2 M 1/2(b I)M 1/2 = b M 1 b I

where b = bλmax(M) 1, which holds for all x x0 B, and in particular for all y y0 B

where B = Bλmax(M)1/2. Thus (DM1-3) are established for Q.

Remark 1. The choice M = Σ 1, which sets the preconditioner matrix M equal to the inverse of the length scale matrix Σ used in the specification of the kernel K (c.f. Appendix A.3), leads to the elegant interpretation that Stein Π-Importance Sampling applied to M-preconditioned MALA is equivalent to the Stein Π-Importance Sampling applied to standard MALA (i.e. with M = I) for the whitened target P with PDF p(x) p(M 1/2x). For our experiments, however, the preconditioner matrix M was learned during a warm-up phase of MALA, since in general the curvature of P (captured by Σ) and the curvature of Π (captured by M 1) may be different.

B.3 Proof of Theorem 1

The route to establishing Theorem 1 has three parts. First, we establish (DM1-3) of Theorem 2 with Q = Π, to deduce from Corollary 1 that Π-invariant M-preconditioned MALA is V -uniformly ergodic. This in turn enables us to establish conditions (R+1-3) of Theorem 3, again for Q = Π, from which the strong consistency DP (Pn,m) a.s. 0 of SΠT-MALA is established. Finally, we note that 0 DP (P n) DP (Pn,m), since the support of Pn,m is contained in the support of P n, and the latter is optimally weighted, whence also the strong consistency of SΠIS-MALA.

Establish (DM1-3) First we establish (DM1-3) for Q = Π. Fix x0 Rd. For (DM2), first recall that the range of k P is [C2 1, ) where C1 > 0, from Assumption 1. Since log( ) has bounded second derivatives on [C2 1, ), there is a constant C > 0 such that

x Rd, 2 log k P (x) C 2k P (x) . Thus, using compactness of the set {x : x x0 B2},

sup x Rd 2 log k P (x) C max

sup x x0 B2 2k P (x)

| {z } < by (A3)

, sup x x0 B2 2k P (x)

| {z } <b2 I by (A4)

Now, π is twice differentiable as it is the product of twice differentiable functions p and k1/2 P from (A1) and (A3), and moreover

sup x Rd 2 log π(x x0) sup x Rd 2 log p(x) | {z } < by (A1)

2 sup x Rd 2 log k P (x) | {z } < by (16)

so (DM2) is satisfied. For (DM3), first note from the chain and product rules that for all x B2

2 log k P (x x0) = 2k P (x x0)

k P (x x0) | {z } (b2/C2 1)I by (A4)

[ k P (x x0)][ k P (x x0)]

k P (x x0)2 | {z } 0

C2 1 I. (17)

Thus, for all x x0 B := x0 + max(B1, B2),

2 log π(x x0) = 2 log p(x x0) | {z } b1I by (A2)

2 2 log k P (x x0) | {z } (b2/C2 1)I by (17)

| {z } =:b>0

as required. The same argument establishes (DM1); from (18) we have lim x π(x) = 0, and since π is a continuously differentiable density there must exist an x0 at which π is locally minimised. Thus we have established (DM1-3) for Q = Π and we may conclude from Corollary 1 that there is an ϵ 0 > 0 and b > 0 such that, for all ϵ (0, ϵ 0), the Π-invariant M-preconditioned MALA chain (xi)i N is V -uniformly ergodic for V (x) = C2 exp b 16 x x0 2 (since if a Markov chain is V -uniformly ergodic, then it is also CV -uniformly ergodic).

Establish (R+1-3) The aim is now to establish conditions (R+1-3) of Theorem 3 for Q = Π. By construction d P/dΠ = C2/ p

k P (x) < C2/C1 < , where C1 and C2 were defined in Assumption 1, so that P Π. It has already been established that (xi)i N is V -uniformly ergodic, and further

V (x) = C2 exp b

16 x x0 2 C2 = d P

for all x, which establishes (R+1). Let R and ρ denote constants for which the V -uniform ergodicity property (14) is satisfied. From V -uniform ergodicity, the integral R V dΠ exists and

k P (xi)V (xi) C2

Z V dΠ = C2

E[V (xi)] Z V dΠ

C2Rρn V (x0) 0

which establishes (R+2). Fix γ > 0. By construction d P/dΠ C2/C1, and thus

exp γ max 1, d P

dΠ(x)2 k P (x) < exp { γk P (x)}

where γ = max(1, C2/C1)γ. Since we have assumed that k P is continuous with, from (A4),

b3 := lim sup x

we may take γ such that γb3 < b /16, so that x 7 exp{ γk P (x)} V < and in particular

E [exp{ γk P (xi)}] Z exp{ γk P (x)} dΠ(x) x 7 exp{ γk P (x)} V Rρn V (x0) 0

which establishes (R+3). Thus we have established (R+1-3) for Q = Π, so from Theorem 3 we have strong consistency of SΠT-MALA (i.e. DP (Pn,m) a.s. 0) provided that m n with log(n) = O(mγ/2) for some γ < 1. The latter condition is equivalent to m = Ω((log n)δ) for some δ > 2, which we used for the statement. Since 0 DP (P n) DP (Pn,m), the strong consistency of SΠIS-MALA is also established.

C Explicit Calculation of Stein Kernels

This appendix contains explicit calculations for the Langevin Stein and KGM Stein kernels k P , which are sufficient to implement Stein Π-Importance Sampling and Stein Π-Thinning. These calculations can also be performed using automatic differentiation, but comparison to the analytic expressions is an important step in validation of computer code.

To proceed, we observe that the diffusion Stein operator SP in (12) applied to a matrix-valued kernel K is equivalent to the Langevin Stein operator applied to the kernel C(x, y) = M(x)K(x, y)M(y) . In the case of the Langevin Stein and KGM Stein kernels we have K(x, y) = κ(x, y)I for some κ(x, y) and M(x) = (1+ x x 2 Σ)(s 1)/2I for some s {0, 1, 2, . . . }. Thus C(x, y) = c(x, y)I where

c(x, y) := (1 + x x 2 Σ)(s 1)/2(1 + y x 2 Σ)(s 1)/2κ(x, y)

k P (x, y) = x yc(x, y) + [ xc(x, y)] [ y log p(y)] + [ yc(x, y)] [ x log p(x)] + c(x, y)[ x log p(x)] [ y log p(y)],

following the calculations in Oates et al. (2017). To evaluate the terms in this formula we start by differentiating c(x, y), to obtain

xc(x, y) = (1 + x x 2 Σ)(s 1)/2(1 + y x 2 Σ)(s 1)/2

(s 1)κ(x, y)Σ 1(x x )

1 + x x 2 Σ + xκ(x, y)

yc(x, y) = (1 + x x 2 Σ)(s 1)/2(1 + y x 2 Σ)(s 1)/2

(s 1)κ(x, y)Σ 1(y x )

1 + y x 2 Σ + yκ(x, y)

x yc(x, y) = (1 + x x 2 Σ)(s 1)/2(1 + y x 2 Σ)(s 1)/2

(s 1)2κ(x, y)(x x ) Σ 2(y x )

(1 + x x 2 Σ)(1 + y x 2 Σ) + (s 1)(y x ) Σ 1 xκ(x, y)

(1 + y x 2 Σ)

+(s 1)(x x ) Σ 1 yκ(x, y)

(1 + x x 2 Σ) + x yκ(x, y) .

These expressions involve gradients of κ(x, y), and explicit formulae for these are presented for the choice of κ(x, y) corresponding to the Langevin Stein kernel in Appendix C.1, and to the KGM Stein kernel in Appendix C.2.

To implement Stein Π-Thinning we require access to both k P (x) and k P (x), the latter for use in the proposal distribution and acceptance probability in MALA. These quantities will now be calculated. In what follows we assume that κ(x, y) is continuously differentiable, so that partial derivatives with respect to x and y can be interchanged. Then

c0(x) := c(x, x)

= (1 + x x 2 Σ)s 1κ(x, x) c1(x) := xc(x, y)|y x

= (1 + x x 2 Σ)s 1 (s 1)κ(x, x)Σ 1(x x )

(1 + x x 2 Σ) + xκ(x, y)|y x

c2(x) := x yc(x, y)|y x

= (1 + x x 2 Σ)s 1 (s 1)2κ(x, x)(x x ) Σ 2(x x )

(1 + x x 2 Σ)2

+ 2(s 1)(x x ) Σ 1 xκ(x, y)|y x

(1 + x x 2 Σ) + x yκ(x, y)|y x

k P (x) := k P (x, x) = c2(x) + 2c1(x) x log p(x) + c0(x) x log p(x) 2. (19)

Let [ xc1(x)]i,j = xi[c1(x)]j and [ 2 x log p(x)]i,j = xi xj log p(x). Now we can differentiate (19) to get

xk P (x) = xc2(x) + 2[ xc1(x)][ x log p(x)] + 2[ 2 x log p(x)]c1(x)

+ [ xc0(x)] x log p(x) 2 + 2c0(x)[ 2 x log p(x)][ x log p(x)]. (20)

In what follows we also derive explicit formulae for c0(x), c1(x) and c2(x), and hence for xc0(x), xc1(x) and xc2(x), for the case of the Langevin Stein kernel in Appendix C.1, and the KGM Stein kernel in Appendix C.2.

C.1 Explicit Formulae for the Langevin Stein Kernel

The Langevin Stein kernel from Appendix A.3.1 corresponds to the choice s = 1 and κ(x, y) the inverse multi-quadric kernel, so that

κ(x, y) = (1 + x y 2 Σ) β

xκ(x, y) = 2β(1 + x y 2 Σ) β 1Σ 1(x y)

yκ(x, y) = 2β(1 + x y 2 Σ) β 1Σ 1(x y)

x yκ(x, y) = 4β(β + 1)(1 + x y 2 Σ) β 2(x y) Σ 2(x y)

+ 2βtr(Σ 1)(1 + x y 2 Σ) β 1.

Evaluating on the diagonal:

κ(x, x) = 1 xκ(x, y)|y x = yκ(x, y)|y x = 0

x yκ(x, y)|y x = 2βtr(Σ 1),

so that c0(x) = 1, c1(x) = 0, c2(x) = 2βtr(Σ 1). Differentiating these formulae, xc0(x) = 0, xc1(x) = 0, xc2(x) = 0.

C.2 Explicit Formulae for the KGM Stein Kernel

The KGM kernel of order s from Appendix A.3.2 corresponds to the choice

κ(x, y) = (1 + x y 2 Σ) β + 1 + (x x ) Σ 1(y x ) (1 + x x 2 Σ)s/2(1 + y x 2 Σ)s/2 ,

for which we have

xκ(x, y) = 2β(1 + x y 2 Σ) β 1Σ 1(x y)

+ Σ 1(y x ) s[1 + (x x ) Σ 1(y x )]Σ 1(x x )(1 + x x 2 Σ) 1

(1 + x x 2 Σ)s/2(1 + y x 2 Σ)s/2

yκ(x, y) = 2β(1 + x y 2 Σ) β 1Σ 1(x y)

+ Σ 1(x x ) s[1 + (x x ) Σ 1(y x )]Σ 1(y x )(1 + y x 2 Σ) 1

(1 + x x 2 Σ)s/2(1 + y x 2 Σ)s/2

x yκ(x, y) = 4β(β + 1)(1 + x y 2 Σ) β 2(x y) Σ 2(x y) + 2βtr(Σ 1)(1 + x y 2 Σ) β 1

tr(Σ 1) s(1 + x x 2 Σ) 1(x x ) Σ 2(x x ) s(1 + y x 2 Σ) 1(y x ) Σ 2(y x ) +s2[1 + (x x ) Σ 1(y x )](1 + x x 2 Σ) 1(1 + y x 2 Σ) 1

(x x ) Σ 2(y x )

(1 + x x 2 Σ)s/2(1 + y x 2 Σ)s/2 .

Evaluating on the diagonal:

κ(x, x) = 1 + (1 + x x 2 Σ) s+1

xκ(x, y)|y x = yκ(x, y)|y x = (s 1)Σ 1(x x )(1 + x x 2 Σ) s

x yκ(x, y)|y x = 2βtr(Σ 1) + tr(Σ 1)(1 + x x 2 Σ) s

+ s(s 2)(1 + x x 2 Σ) s 1(x x ) Σ 2(x x )

c0(x) = 1 + (1 + x x 2 Σ)s 1

c1(x) = (s 1)(1 + x x 2 Σ)s 2Σ 1(x x )

c2(x) = [(s 1)2(1 + x x 2 Σ)s 1 1](x x ) Σ 2(x x ) (1 + x x 2 Σ)2 + tr(Σ 1)[1 + 2β(1 + x x 2 Σ)s] (1 + x x 2 Σ) .

Differentiating these formulae:

xc0(x) = 2(s 1)(1 + x x 2 Σ)s 2Σ 1(x x )

xc1(x) = 2(s 1)(s 2)(1 + x x 2 Σ)s 3[Σ 1(x x )][Σ 1(x x )]

+ (s 1)(1 + x x 2 Σ)s 2Σ 1

xc2(x) = 2(s 1)2(s 3)(1 + x x 2 Σ)s 4[(x x ) Σ 2(x x )]Σ 1(x x )

+ 2(s 1)2(1 + x x 2 Σ)s 3Σ 2(x x )

+ 4βtr(Σ 1)(s 1)(1 + x x 2 Σ)s 2Σ 1(x x )

2(1 + x x 2 Σ) 2[Σ 2(x x ) + tr(Σ 1)Σ 1(x x )]

+ 4(1 + x x 2 Σ) 3[(x x ) Σ 2(x x )]Σ 1(x x ).

These complete the analytic calculations necessary to compute the Stein kernel k P and its gradient.

D Empirical Assessment

This appendix contains full details of the empirical protocols that were employed and the additional empirical results described in the main text. Appendix D.1 discusses the effect of dimension on our proposed Π. Additional illuatrative results from Section 3.2 are contained in Appendix D.2. The full details for how MALA was implemented are contained in Appendix D.3. An additional illustration using a generalised auto-regressive moving average (GARCH) model is presented in Appendix D.4. The full results for SΠIS-MALA are contained in Appendix D.5, and in Appendix D.6 the convergence of the sparse approximation provided by SΠT-MALA to the optimal weighted approximation is investigated. Finally, the performance of KSDs is quantified using the 1-Wasserstein divergence in Appendix D.7.

D.1 The Effect of Dimension on Π

The improvement of Stein Π-Importance Sampling over the default Stein importance sampling algorithm (i.e. Π = P) can be expected to reduce as the dimension d of the target P is increased. To see this, consider the Langevin Stein kernel

k P (x) = c1 + c2 log p(x) 2 Σ (21)

for some c1, c2 > 0; see Appendix C. Taking P = N(0, Id d), for which the length scale matrix Σ appearing in Appendix A.3 is Σ = Id d, we obtain

k P (x) = c1 + c2 x 2.

However, the sampling distribution Π defined in (8) depends on k P only up to an unspecified normalisation constant; we may therefore equally consider the asymptotic behaviour of k P (x) := k P (x)/d. Let X P. Then E[ k P (X)] = c2 is a d-independent constant, and

k P E[ k P (X)] 2

L2(P ) = Z k P (x) (c1 + c2d)

2 d P(x) = 2c2 2 d 0

as d . This shows that k P converges to a constant function in L2(P), and thus for typical values of x in the effective support of P,

π(x) p(x) q

k P (x) p(x),

so that Π P in the d limit. This intuition is borne out in simulations involving both the Langevin Stein kernel (as just discussed) and also the KGM3 Stein kernel. Indeed, Figure S1 shows that as the dimension d is increased, the marginal distributions of Π become increasingly similar to those of P.

3 2 1 0 1 2 3 x1

Langevin Stein Kernel

P Π (d = 1)

3 2 1 0 1 2 3 x1

KGM3 Stein Kernel

P Π (d = 1)

Figure S1: The effect of dimension on Π: Here P was taken to be the standard Gaussian distribution N(0, Id d) in Rd and the proposed distribution Π was computed. The marginal distribution of the first component of Π is plotted for d {1, 2, 10}, for both (a) the Langevin Stein kernel and (b) the KGM3 Stein kernel.

Figure S2: Assessing the performance of the sampling distributions Π shown in Figure 2. The mean kernel Stein discrepancy (KSD) for computation performed using the Langevin Stein kernel (purple), the KGM3 Stein kernel (blue), and the Riemann Stein kernel (red); in each case, KSD was computed using the same Stein kernel used to construct Π. Solid lines indicate the baseline case of sampling from P, while dashed lines indicate the proposed approach of sampling from Π. (The experiment was repeated 10 times and standard error bars are plotted.)

D.2 2D Illustration from the Main Text

Section 3.2 of the main text contained a 2-dimensional illustration of Stein Π-Importance Sampling and presented the distributions Π corresponding to different choices of Stein kernel. Here, in Figure S2, we present the mean KSDs for Stein Π-Importance Sampling performed using the Langevin Stein kernel (purple), the KGM3 Stein kernel (blue), and the Riemann Stein kernel (red), corresponding to the sampling distributions Π displayed in Figure 2 of the main text.

For this experiment, exact sampling from both P and Π was performed using a fine grid on which all probabilities were calculated and appropriately normalised. Results are in broad agreement with the 1-dimensional illustration contained in the main text, in the sense that in all cases Stein Π-Importance Sampling provides a significant improvement over the default Stein importance sampling method with Π equal to P.

Algorithm 4 Adaptive MALA

Require: x0,0 (initial state), ϵ0 (initial step size), M0 (initial preconditioner matrix), {ni}h 1 i=0 (epoch lengths), {αi}h 1 i=1 (learning schedule), h (number of epochs), k P (Stein kernel) 1: {x0,1 . . . , x0,n0} MALA(x0,0, ϵ0, M0, n0, k P ) 2: for i = 1, . . . , h 1 do 3: xi,0 xi 1,ni 1 4: ρi 1 1 ni 1 Pni 1 j=1 1xi 1,j =xi 1,j 1 Average acceptance rate for chain i 5: ϵi ϵi 1 exp(ρi 1 0.57) Update step size 6: Mi αi Mi + (1 αi)cov({xi 1,1 . . . , xi 1,ni 1}) Update preconditioner matrix 7: {xi,1 . . . , xi,ni} MALA(xi,0, ϵi, Mi, ni, k P ) 8: end for

D.3 Implementation of MALA

For implementation of MALA in Algorithm 4 we are required to specify a step size ϵ and a preconditioner matrix M. In general, suitable values for both of these parameters will be problemdependent. Standard practice is to perform some form of manual or automated tuning to arrive at parameter values for which the average acceptance rate is close to 0.57, motivated by the asymptotic analysis of Roberts and Rosenthal (1998). Adaptive MCMC algorithms, which seek to optimise the parameters of MCMC algorithms such as MALA during the warm-up period, provide an appealing solution, and was the approach taken in this work.

The adaptive MALA algorithm which we used is contained in Algorithm 4, where we have let MALA(x, ϵ, M, n, k P ) denote the output from the preconditioned MALA with initial state x, step size ϵ, preconditioner matrix M, and chain length n, described in Algorithm 1. In Algorithm 4, we use cov( ) to denote the sample covariance matrix. The algorithm monitors the average acceptance rate and increases or decreases it according to whether it is below or above, respectively, the 0.57 target. For the preconditioner matrix, the sample covariance matrix of samples obtained from the penultimate tuning run of MALA is used. For all experiments that we report using MALA, we set ϵ0 = 1, M0 = Id, h = 10, and α1 = = α9 = 0.3. The warm-up epoch lengths were n0 = = n8 = 1, 000 and the final epoch length was n9 = 105. The samples {xh 1,1, . . . , xh 1,ni 1} from the final epoch are returned, and constituted output from MALA for our experimental assessment.

To sample from P instead of Π, we used Algorithm 4 we formally set k P (x) = 1 for all x Rd, which recovers Π = P as the target.

D.4 Illustration on a GARCH Model

This appendix contains an additional illustrative experiment, concerning a GARCH model that is a particular instance of a model from the Posterior DB database discussed in Section 4. The purpose of this illustration is to facilitate an empirical investigation in a slightly higher dimension (d = 4) and to explore the effect of changing the order s of the KGM Stein kernel defined in Appendix A.3.2.

First we describe the GARCH model that was used. These models are widely-used in econometrics to describe time series data {yt}n t=1 in settings where the volatility process is assumed to be time-varying (but stationary). In particular, we consider the GARCH(1,1) model

yt = ϕ1 + at, at = σtϵt, ϵt N(0, 1),

σ2 t = ϕ2 + ϕ3a2 t 1 + ϕ4σ2 t 1,

where ϕ2 > 0, ϕ3 > 0, ϕ4 > 0, and ϕ3 +ϕ4 < 1 are the model parameters, constrained to a subset of R4. For ease of sampling, a change of variables τ : (ϕ1, ϕ2, ϕ3, ϕ4) 7 θ is performed in such a way that the parameter θ R4 is unconstrained. Assuming an improper flat prior on θ, the log-posterior density for θ is given up to an additive constant by

log p(θ | y1, . . . , yn) +C =

2 log σ2 t y2 t 2σ2 t

+ log |Jτ 1(θ)|,

where |Jτ 1(θ)| is the Jacobian determinant of τ 1.

4.6 4.8 5.0 5.2 5.4 5.6 θ1

1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0 θ2

1 0 1 2 3 θ3

5 0 5 10 θ4

Figure S3: Illustrating the shape of Π based on the KGMs Stein kernel for a GARCH(1,1) model, controlling convergence of moments up to order s {2, 3, 4}. The marginal density functions of each distribution were approximated using one-million samples obtained using MCMC.

For this illustration, real data were provided within the model description of Posterior DB, for which the estimated maximum a posteriori parameter is ˆϕ = (5.04, 1.36, 0.53, 0.31). The marginal distributions of Π corresponding to the KGM Stein kernels of orders s {2, 3, 4} are compared to the marginals of P in Figure S3. It can be seen that higher orders s correspond to greater overdispersion of Π; this makes intuitive sense since larger s corresponds to a more stringent KSD (controlling the convergence of moments up to order s) which places greater emphasis on how the tails of P are approximated. Further, for the final skewed marginal of P, we note that the distribution Π exaggerates the skew, placing more of its mass in the tail of the direction which is positively skewed. Further discussion of skewed targets is contained in Appendix D.8.

D.5 Stein Π-Importance Sampling for Posterior DB

To introduce objectivity into our assessment, we exploited the Posterior DB benchmark (Magnusson et al., 2022). This ongoing project is an attempt toward standardised benchmarking, consisting of a collection of posteriors to be numerically approximated. The test problems in Posterior DB are defined in the Stan probabilistic programming language, and so Bridge Stan (Roualdes et al., 2023) was used to directly access posterior densities and their gradients as required. The ambition of Posterior DB is to provide an extensive set of benchmark tasks; at the time we conducted our research, Posterior DB was at Version 0.4.0 and contained 149 models, of which 47 came equipped with a gold-standard sample of size n = 103, generated from a long run of Hamiltonian Monte Carlo (the No-U-Turn sampler in Stan). Of these 47 models, a subset of 40 were found to be compatible with Bridge Stan, which was at Version 1.0.2 at the time this research was performed. The version of Stan that we used was Stanc3 Version 2.31.0 (Unix). Thus we used a total of 40 test problems for our empirical assessment.

For each test problem, a total of 10 replicate experiments were performed and standard errors were computed. A sampling method was defined as being significantly better for approximation of a given target, compared to all other methods considered, if had lower mean KSD and the standard error bar did not overlap with the standard error bar of any other method. Table 1 in the main text summarises the performance of SΠIS-MALA, fixing the number of samples to be n = 3 103. In this appendix, full empirical results are provided.

For sampling from MALA, we used the adaptive algorithm described in Appendix D.3 with a final epoch of length nmax = 105. Then, whenever a set of n nmax consecutive samples from MALA are required for our experimental assessment, these were obtained by selecting at random a consecutive sequence of length n from the total chain of length 105. This ensures that the performance of unprocessed MALA that we report is not negatively affected by burn-in, in so far as is practical to control.

Full results are presented in Figure S4. These results broadly support the interpretation that SΠISMALA usually outperforms SIS-MALA, or otherwise both methods provide a similar level of performance, for the sufficiently large sample sizes n considered. The sample size threshold at which SΠIS-MALA outperforms SIS-MALA appears to be dimension-dependent. A notable exception is panel 29 of Figure S4, a d = 10 dimensional task for which SΠIS-MALA provided a substantially worse approximation in KSD for the range of values of n considered.

Figure S4: Benchmarking on Posterior DB. Here we compared raw output from MALA (dotted lines) with the post-processed output provided by the default Stein importance sampling method of Liu and Lee (2017) (SIS-MALA; solid lines) and the proposed Stein Π-Importance Sampling method (SΠIS-MALA; dashed lines). The Langevin (purple) and KGM3 Stein kernels (blue) were used for SIS-MALA and SΠIS-MALA and the associated KSDs are reported as the number n of iterations of MALA is varied. Ten replicates were computed and standard errors were plotted. The name of each model is shown in the title of the corresponding panel, and the dimension d of the parameter vector is given in parentheses. [Langevin Stein kernel: MALA, SIS-MALA, SΠISMALA. KGM3 Stein kernel: MALA, SIS-MALA, SΠIS-MALA.]

101 102 103 n

earnings earn height(3D)

101 102 103 n

gp pois regr gp regr(3D)

101 102 103 n

kidiq kidscore momhs(3D)

101 102 103 n

kidiq kidscore momiq(3D)

101 102 103 n

mesquite logmesquite logvolume(3D)

101 102 103 n

arma arma11(4D)

101 102 103 n

earnings logearn logheight male(4D)

101 102 103 n

garch garch11(4D)

101 102 103 n

kidiq kidscore momhsiq(4D)

101 102 103 n

earnings logearn interaction z(5D)

101 102 103 n

kidiq kidscore interaction(5D)

101 102 103 n

kidiq with mom work kidscore interaction c(5D)

101 102 103 n

kidiq with mom work kidscore interaction c2(5D)

101 102 103 n

kidiq with mom work kidscore interaction z(5D)

101 102 103 n

kidiq with mom work kidscore mom work(5D)

101 102 103 n

low dim gauss mix low dim gauss mix(5D)

101 102 103 n

mesquite logmesquite logva(5D)

101 102 103 n

hmm example hmm example(6D)

101 102 103 n

sblrc blr(6D)

101 102 103 n

sblri blr(6D)

101 102 103 n

ar K ar K(7D)

101 102 103 n

mesquite logmesquite logvash(7D)

101 102 103 n

bball drive event 0 hmm drive 0(8D)

101 102 103 n

bball drive event 1 hmm drive 1(8D)

101 102 103 n

hudson lynx hare lotka volterra(8D)

101 102 103 n

mesquite logmesquite(8D)

101 102 103 n

mesquite logmesquite logvas(8D)

101 102 103 n

mesquite mesquite(8D)

101 102 103 n

eight schools eight schools centered(10D)

101 102 103 n

eight schools eight schools noncentered(10D)

101 102 103 n

nes1972 nes(10D)

101 102 103 n

nes1976 nes(10D)

101 102 103 n

nes1980 nes(10D)

101 102 103 n

nes1984 nes(10D)

101 102 103 n

nes1988 nes(10D)

101 102 103 n

nes1992 nes(10D)

101 102 103 n

nes1996 nes(10D)

101 102 103 n

nes2000 nes(10D)

101 102 103 n

diamonds diamonds(26D)

101 102 103 n

mcycle gp accel gp(66D)

D.6 Stein Π-Thinning for Posterior DB

The results presented in the main text concerned n = 3 103 samples from MALA, which is near the limit at which the optimal weights w can be computed in a few seconds on a laptop PC. For larger values of n, sparse approximation methods are likely to required. In the main text we presented Stein Π-Thinning, which employs a greedy optimisation perspective to obtain a sparse approximation to the optimal weights at cost O(m2n), where m are the number of greedy iterations performed. Explicit and verifiable conditions for the strong consistency of the resulting SΠT-MALA algorithm were established in Section 3.3. The purpose of this appendix is to empirically explore the convergence of SΠT-MALA using the Posterior DB test bed.

In the experiments we report the number of MALA samples was fixed to n = 103 and the number of greedy iterations was varied from m = 1 to m = 103. The results, in Figure S5, indicate that for most models in Posterior DB the minimum value of KSD is approximately reached when m is anywhere from n

2 , representing a modest but practically significant reduction in computational cost compared to SΠIS-MALA. This agrees with the qualitative findings reported in the original Stein thinning paper of Riabiz et al. (2022).

Figure S5: Benchmarking on Posterior DB. Here we investigate the convergence of the sparse approximation provided by the proposed Stein Π-Thinning method (SΠT-MALA). The Langevin (purple) and KGM3 Stein kernels (blue) were used for SΠT-MALA and the associated KSDs are reported as the number m of iterations of Stein thinning is varied. Ten replicates were computed and standard errors were plotted. The name of each model is shown in the title of the corresponding panel, and the dimension d of the parameter vector is given in parentheses. [Langevin Stein kernel: SΠT-MALA. KGM3 Stein kernel: SΠT-MALA.]

100 101 102 103 m

earnings earn height(3D)

100 101 102 103 m

gp pois regr gp regr(3D)

100 101 102 103 m

kidiq kidscore momhs(3D)

100 101 102 103 m

kidiq kidscore momiq(3D)

100 101 102 103 m

mesquite logmesquite logvolume(3D)

100 101 102 103 m

arma arma11(4D)

100 101 102 103 m

earnings logearn logheight male(4D)

100 101 102 103 m

garch garch11(4D)

100 101 102 103 m

kidiq kidscore momhsiq(4D)

100 101 102 103 m

earnings logearn interaction z(5D)

100 101 102 103 m

kidiq kidscore interaction(5D)

100 101 102 103 m

kidiq with mom work kidscore interaction c(5D)

100 101 102 103 m

kidiq with mom work kidscore interaction c2(5D)

100 101 102 103 m

kidiq with mom work kidscore interaction z(5D)

100 101 102 103 m

kidiq with mom work kidscore mom work(5D)

100 101 102 103 m

low dim gauss mix low dim gauss mix(5D)

100 101 102 103 m

mesquite logmesquite logva(5D)

100 101 102 103 m

hmm example hmm example(6D)

100 101 102 103 m

sblrc blr(6D)

100 101 102 103 m

sblri blr(6D)

100 101 102 103 m

ar K ar K(7D)

100 101 102 103 m

mesquite logmesquite logvash(7D)

100 101 102 103 m

bball drive event 0 hmm drive 0(8D)

100 101 102 103 m

bball drive event 1 hmm drive 1(8D)

100 101 102 103 m

hudson lynx hare lotka volterra(8D)

100 101 102 103 m

mesquite logmesquite(8D)

100 101 102 103 m

mesquite logmesquite logvas(8D)

100 101 102 103 m

mesquite mesquite(8D)

100 101 102 103 m

eight schools eight schools centered(10D)

100 101 102 103 m

eight schools eight schools noncentered(10D)

100 101 102 103 m

nes1972 nes(10D)

100 101 102 103 m

nes1976 nes(10D)

100 101 102 103 m

nes1980 nes(10D)

100 101 102 103 m

nes1984 nes(10D)

100 101 102 103 m

nes1988 nes(10D)

100 101 102 103 m

nes1992 nes(10D)

100 101 102 103 m

nes1996 nes(10D)

100 101 102 103 m

nes2000 nes(10D)

100 101 102 103 m

diamonds diamonds(26D)

100 101 102 103 m

mcycle gp accel gp(66D)

D.7 Performance of Stein Discrepancies

The properties of Stein discrepancies was out of scope for this work. Nonetheless, there is much interest in better understanding the properties of KSDs, and in this appendix the performance of SΠIS-MALA in terms of 1-Wasserstein divergence is reported. This was made possible since Posterior DB supplies a set of posterior samples obtained from a long run of Hamiltonian Monte Carlo (the No-U-Turn sampler in Stan) which we treat as a gold standard.

Full results are presented in Figure S6 and Table 2. Broadly speaking, in most cases the minimisation of KSD seems to be associated with minimisation of 1-Wasserstein distance, and in particular a significant improvement of SΠIS-MALA over SIS-MALA is reported for 63% of tasks in the Posterior DB benchmark. However there are some scenarios for which minimisation of KSD is loosely, if at all, related to minimisation of 1-Wasserstein divergence. In these cases, we attribute this performance to a combination of two factors: First, the blindness to mixing proportions phenomena, described in Wenliang and Kanagawa (2021); Koehler et al. (2022); Liu et al. (2023), which is a pathology of KSDs in general. Second, the Langevin Stein kernel cannot be expected to control convergence in 1-Wasserstein, since convergence in 1-Wasserstein is equivalent to weak convergence plus convergence of the first moment. Focusing therefore on the KGM3 Stein kernel only, it is encouraging to note that SΠIS-MALA outperforms SIS-MALA on 83% of tasks in Posterior DB in the 1-Wasserstein metric, as shown in Table 2. However, it is interesting to observe that MALA performed well in the 1-Wasserstein sense across the Posterior DB test bed.

The development of improved Stein discrepancies is an active area of research, and we emphasise that the methodology developed in this work can be applied to any KSDs, including potentially KSDs with better or more direct control over standard notions of convergence (such as 1-Wasserstein) that in the future may be developed.

Figure S6: Performance of Stein discrepancies on Posterior DB. Here we compared raw output from MALA (dotted lines) with the post-processed output provided by the default Stein importance sampling method of Liu and Lee (2017) (SIS-MALA; solid lines) and the proposed Stein ΠImportance Sampling method (SΠIS-MALA; dashed lines). The Langevin (purple) and KGM3 Stein kernels (blue) were used for SIS-MALA and SΠIS-MALA, and the 1-Wasserstein divergence is reported as the number n of iterations of MALA is varied. Ten replicates were computed and standard errors were plotted. The name of each model is shown in the title of the corresponding panel, and the dimension d of the parameter vector is given in parentheses. [Legend: Raw MALA. Langevin Stein kernel: SIS-MALA, SΠIS-MALA. KGM3 Stein kernel: SIS-MALA, SΠIS-MALA.]

101 102 103 n

earnings earn height(3D)

101 102 103 n

gp pois regr gp regr(3D)

101 102 103 n

kidiq kidscore momhs(3D)

101 102 103 n

kidiq kidscore momiq(3D)

101 102 103 n

mesquite logmesquite logvolume(3D)

101 102 103 n

arma arma11(4D)

101 102 103 n

earnings logearn logheight male(4D)

101 102 103 n

garch garch11(4D)

101 102 103 n

kidiq kidscore momhsiq(4D)

101 102 103 n

earnings logearn interaction z(5D)

101 102 103 n

kidiq kidscore interaction(5D)

101 102 103 n

kidiq with mom work kidscore interaction c(5D)

101 102 103 n

kidiq with mom work kidscore interaction c2(5D)

101 102 103 n

kidiq with mom work kidscore interaction z(5D)

101 102 103 n

kidiq with mom work kidscore mom work(5D)

101 102 103 n

low dim gauss mix low dim gauss mix(5D)

101 102 103 n

mesquite logmesquite logva(5D)

101 102 103 n

hmm example hmm example(6D)

101 102 103 n

sblrc blr(6D)

101 102 103 n

sblri blr(6D)

101 102 103 n

ar K ar K(7D)

101 102 103 n

mesquite logmesquite logvash(7D)

101 102 103 n

bball drive event 0 hmm drive 0(8D)

101 102 103 n

bball drive event 1 hmm drive 1(8D)

101 102 103 n

hudson lynx hare lotka volterra(8D)

101 102 103 n

mesquite logmesquite(8D)

101 102 103 n

mesquite logmesquite logvas(8D)

101 102 103 n

mesquite mesquite(8D)

101 102 103 n

eight schools eight schools centered(10D)

101 102 103 n

eight schools eight schools noncentered(10D)

101 102 103 n

nes1972 nes(10D)

101 102 103 n

nes1976 nes(10D)

101 102 103 n

nes1980 nes(10D)

101 102 103 n

nes1984 nes(10D)

101 102 103 n

nes1988 nes(10D)

101 102 103 n

nes1992 nes(10D)

101 102 103 n

nes1996 nes(10D)

101 102 103 n

nes2000 nes(10D)

101 102 103 n

diamonds diamonds(26D)

101 102 103 n

mcycle gp accel gp(66D)

D.8 Investigation for a Skewed Target

This final appendix contrasts the 1-Wasserstein optimal sampling distribution Π1 (c.f. Section 2.1), with the choice of Π that we recommended in (8). In particular, we focus on the KGM3 Stein kernel under a heavily skewed P, for which Π1 and Π can be markedly different.

For this investigation a bivariate skew-normal target was constructed, where the density is given by p(x1, x2) = 4ϕ(x1)Φ(6x1)ϕ(x2)Φ( 3x2), with ϕ and Φ respectively denoting the density and distribution functions of a standard Gaussian. The density p of P, together with the marginal densities of Π1 and Π, are plotted in Figure S7. It can be seen that, while both Π1 and Π are over-dispersed with respect to P, our recommended Π assigns proportionally more mass to the tail that is positively skewed.

The performance of Stein Π-Importance Sampling based on Π1 and Π is compared in Figure S8. Though both choices lead to an improvement relative to Stein importance sampling algorithm with Π = P, the use of Π leads to a significant further reduction (on average) in KSD compared to Π1. Based on our investigations, this finding seems general; the use of Π1 does not realise the full potential of Stein Π-Imporance sampling when the target is skewed.

Langevin Stein Kernel KGM3 Stein Kernel

Task d MALA SIS - MALA

SΠIS - MALA MALA SIS - MALA

SΠIS - MALA earnings-earn_height 3 6420.0 7230.0 9440.0 6420.0 7280.0 7390.0 gp_pois_regr-gp_regr 3 0.0779 0.0758 0.0724 0.0779 0.0865 0.0778 kidiq-kidscore_momhs 3 0.282 0.581 0.508 0.282 0.950 0.604 kidiq-kidscore_momiq 3 0.826 2.19 2.01 0.826 2.53 1.82 mesquite-logmesquite_logvolume 3 0.0236 0.0253 0.0238 0.0236 0.0311 0.0234 arma-arma11 4 0.0127 0.0132 0.0131 0.0127 0.0144 0.0133 earnings-logearn_logheight_male 4 1.46 1.75 1.34 1.46 1.56 0.725 garch-garch11 4 0.260 0.241 0.243 0.260 0.365 0.263 kidiq-kidscore_momhsiq 4 1.86 2.39 1.72 1.86 2.51 1.54 earnings-logearn_interaction_z 5 0.0245 0.0240 0.0240 0.0245 0.0252 0.0251 kidiq-kidscore_interaction 5 13.9 14.5 15.3 13.9 14.4 20.4 kidiq_with_mom_work-kidscore_interaction_c 5 0.289 0.251 0.237 0.289 0.584 0.371 kidiq_with_mom_work-kidscore_interaction_c2 5 0.258 0.269 0.252 0.258 0.638 0.430 kidiq_with_mom_work-kidscore_interaction_z 5 0.914 0.904 0.889 0.914 1.46 1.17 kidiq_with_mom_work-kidscore_mom_work 5 1.01 1.05 1.06 1.01 1.74 1.43 low_dim_gauss_mix-low_dim_gauss_mix 5 0.0215 0.0206 0.0207 0.0215 0.0214 0.0212 mesquite-logmesquite_logva 5 0.0715 0.0672 0.0681 0.0715 0.0833 0.0775 hmm_example-hmm_example 6 0.0708 0.102 0.105 0.0708 0.161 0.12 sblrc-blr 6 0.0613 0.0614 0.0581 0.0613 0.0613 0.0605 sblri-blr 6 0.0729 0.0733 0.0843 0.0729 0.0722 0.0926 ar K-ar K 7 0.0544 0.0533 0.0525 0.0544 0.0618 0.0557 mesquite-logmesquite_logvash 7 0.165 0.158 0.161 0.165 0.180 0.171 bball_drive_event_0-hmm_drive_0 8 0.203 0.183 0.186 0.203 0.242 0.216 bball_drive_event_1-hmm_drive_1 8 0.812 0.825 0.811 0.812 0.814 0.754 hudson_lynx_hare-lotka_volterra 8 0.113 0.112 0.111 0.113 0.135 0.120 mesquite-logmesquite 8 0.212 0.204 0.209 0.212 0.214 0.212 mesquite-logmesquite_logvas 8 0.197 0.192 0.196 0.197 0.208 0.203 mesquite-mesquite 8 123.0 122.0 114.0 123.0 123.0 121.0 eight_schools-eight_schools_centered 10 9.17 9.50 8.14 9.17 7.00 10.3 eight_schools-eight_schools_noncentered 10 16.8 16.7 16.7 16.8 16.7 16.7 nes1972-nes 10 0.193 0.189 0.172 0.193 0.198 0.172 nes1976-nes 10 0.190 0.189 0.179 0.190 0.202 0.177 nes1980-nes 10 0.229 0.228 0.222 0.229 0.256 0.233 nes1984-nes 10 0.202 0.198 0.197 0.202 0.210 0.184 nes1988-nes 10 0.212 0.210 0.185 0.212 0.214 0.185 nes1992-nes 10 0.172 0.169 0.159 0.172 0.182 0.168 nes1996-nes 10 0.191 0.187 0.173 0.191 0.199 0.179 nes2000-nes 10 0.275 0.273 0.274 0.275 0.306 0.284 diamonds-diamonds 26 0.648 0.647 0.643 0.648 0.648 0.658 mcycle_gp-accel_gp 66 15.5 15.5 15.6 15.5 15.5 15.8

Table 2: Benchmarking on Posterior DB. Here we compared raw output from MALA with the post-processed output provided by the default Stein importance sampling method of Liu and Lee (2017) (SIS-MALA) and the proposed Stein Π-Importance Sampling method (SΠIS-MALA). Here d = dim(P) and the number of MALA samples was n = 3 103. The Langevin and KGM3 Stein kernels were used for SIS-MALA and SΠIS-MALA and the associated 1-Wasserstein distances are reported. Ten replicates were computed and statistically significant improvement is highlighted in bold.

1 0 1 2 3 4 5 6 x1

P Π (1Wass.)

6 5 4 3 2 1 0 1 2 x2

Figure S7: Comparing the proposed distribution Π (KGM3; based on the KGM3 Stein kernel) to Π1 (1Wass.; the optimal choice for 1-Wasserstein quantisation from Section 2.1) for a bivariate skew-normal target (d = 2). The marginal density functions of each distribution were approximated using 106 samples from MCMC.

101 102 103 n

P Π (1Wass.)

Figure S8: Comparing the performance of using the proposed distribution Π (KGM3; based on the KGM3 Stein kernel) to Π1 (1Wass.; the optimal choice for 1-Wasserstein quantisation from Section 2.1) for a bivariate skew-normal target (d = 2). The mean kernel Stein discrepancy (KSD) for Stein Π-Importance Sampling was estimated; in each case, the KSD based on the KGM3 Stein kernel was computed. Solid lines indicate the baseline case of sampling from P, while dashed lines indicate sampling from Π. (The experiment was repeated 10 times and standard error bars are plotted.)