# stein_point_markov_chain_monte_carlo__6faf9218.pdf Stein Point Markov Chain Monte Carlo Wilson Ye Chen * 1 Alessandro Barp * 2 3 Franc ois-Xavier Briol 4 3 Jackson Gorham 5 Mark Girolami 4 3 Lester Mackey * 6 Chris. J. Oates 7 3 An important task in machine learning and statistics is the approximation of a probability measure by an empirical measure supported on a discrete point set. Stein Points are a class of algorithms for this task, which proceed by sequentially minimising a Stein discrepancy between the empirical measure and the target and, hence, require the solution of a non-convex optimisation problem to obtain each new point. This paper removes the need to solve this optimisation problem by, instead, selecting each new point based on a Markov chain sample path. This significantly reduces the computational cost of Stein Points and leads to a suite of algorithms that are straightforward to implement. The new algorithms are illustrated on a set of challenging Bayesian inference problems, and rigorous theoretical guarantees of consistency are established. 1. Introduction The task that we consider in this paper is to approximate a Borel probability measure P on an open and convex set X Rd, d N, with an empirical measure ˆP supported on a discrete point set {xi}n i=1 X. To limit scope we restrict attention to uniformly-weighted empirical measures; ˆP = 1 n Pn i=1 δxi where δx is a Dirac measure on x. The quantisation (Graf & Luschgy, 2007) of P by ˆP is an important task in computational statistics and machine learning. For example, quantisation facilitates the approximation of integrals R X fd P of measurable functions f : X R using cubature rules f 7 1 n Pn i=1 f(xi). More generally, quantisation underlies a broad spectrum *Equal contribution 1Institute of Statistical Mathematics 2Imperial College London 3Alan Turing Institute 4University of Cambridge 5Open Door 6Microsoft Research 7Newcastle University. Correspondence to: Lester Mackey , Chris. J. Oates . Proceedings of the 36 th International Conference on Machine Learning, Long Beach, California, PMLR 97, 2019. Copyright 2019 by the author(s). of algorithms for uncertainty quantification that must operate subject to a finite computational budget. Motivated by applications in Bayesian statistics, our focus is on the situation where P admits a density p with respect to the Lebesgue measure on X but this density can only be evaluated up to an (unknown) normalisation constant. Specifically, we assume that p = p C where p is an un-normalised density and C > 0, such that both p and log p, where = ( x1 , . . . , xd ), can be (pointwise) evaluated at finite computational cost. A popular approach to this task is Markov chain Monte Carlo (MCMC; Robert & Casella, 2004), where the sample path of an ergodic Markov chain with invariant distribution P constitutes a point set {xi}n i=1. MCMC algorithms exploit a range of techniques to construct Markov transition kernels which leave P invariant, based (in general) on pointwise evaluation of p (Metropolis et al., 1953) or (sometimes) on pointwise evaluation of log p and higher-order derivative information (Girolami & Calderhead, 2011). In a favourable situation, the MCMC output will be approximately independent draws from P. However, in this case the {xi}n i=1 will typically not be a low discrepancy point set (Dick & Pillichshammer, 2010) and as such the quantisation of P performed by MCMC will be sub-optimal. In recent years several attempts have been made to deveop improved algorithms for quantisation in the Bayesian statistical context as an alternative to MCMC: Minimum Energy Designs (MED) In (Roshan Joseph et al., 2015; Joseph et al., 2018) it was proposed to obtain a point set {xi}n i=1 by using a numerical optimisation method to approximately minimise an energy functional E p({xi}n i=1) that depends on P only through p rather than through p itself. Though appealing in its simplicity, MED has yet to receive a theoretical treatment that accounts for the imperfect performance of the numerical optimisation method. Support Points The method of (Mak & Joseph, 2018) first generates a large MCMC output { xi}N i=1 and from this a subset {xi}n i=1 is selected in such a way that a low-discrepancy point set is obtained. (This can be contrasted with classical thinning in which an arithmetic subsequence of the MCMC output is selected.) Stein Point Markov Chain Monte Carlo At present, a theoretical analysis that accounts for the possible poor performance of the MCMC method has not yet been announced. Transport Maps and QMC The method of (Parno, 2015) aims to learn a transport map T : X X such that the pushforward measure T#Q corresponds to P, where Q is a distribution for which quantisation by a point set { xi}n i=1 is easily performed, for instance using quasi-Monte Carlo (QMC) (Dick & Pillichshammer, 2010). Then quantisation of P is provided by the point set {T( xi)}n i=1. The flexibility in the construction of a transport map allows several algorithms to be envisaged, but an end-to-end theoretical treatment is not available at present. Stein Variational Gradient Descent (SVGD) A popular methodology due to (Liu & Wang, 2016) aims to take an arbitrary initial point set {x0 i }n i=1 and to construct a discrete time dynamical system xt i = g p(xt 1 1 , . . . , xt 1 n ), indexed by time t and dependent on p, such that limt {xt i}n i=1 provides a quantisation of P. This can be viewed as a discretisation of a particular gradient flow that has P as a fixed point (Liu, 2017). However, a generally applicable theoretical analysis of the SVGD method itself is not available (note that a compactness assumption on X was required in Liu, 2017). Note also that, unlike the other methods discussed in this section, SVGD does not readily admit an extensible construction; that is, the number n of points must be a priori fixed. Stein Points (SP) The authors of (Chen et al., 2018b) proposed to select a point set {xi}n i=1 that approximately minimises a kernel Stein discrepancy (KSD; Liu et al., 2016; Chwialkowski et al., 2016; Gorham & Mackey, 2017) between the empirical measure and the target P. The KSD can be exactly computed with a finite number of pointwise evaluations of log p and, for the (non-convex) minimisation, a variety of numerical optimisation methods can be applied. In contrast to the other methods just discussed, SP does admit a end-to-end theoretical treatment when a grid search procedure is used as the numerical optimisation method (Thms. 1 & 2 in Chen et al., 2018b). An empirical comparison of several of the above methods on a selection of problems arising in computational statistics was presented in (Chen et al., 2018b). The conclusion of that work was that MED and SP provided broadly similar performance-per-computational-cost at the quantisation task, where the performance was measured by the Wasserstein distance to the target and the computational cost was measured by the total number of evaluations of either p or its gradient. In some situations, SVGD provided superior SP-MCMC MCMC Figure 1. Illustration of Monte Carlo points (MC; left) and Stein Point Markov chain Monte Carlo (SP-MCMC; right) on a Gaussian mixture target P. SP-MCMC provides better space-filling properties than MC. quantisation to MED and SP but this was achieved at a substantially higher computational cost. At the same time, it was observed that all algorithms considered provided improved quantisation compared to MCMC, but at a computational cost that was substantially higher than the corresponding cost of MCMC. In this paper, we propose Stein Point Markov chain Monte Carlo (SP-MCMC), aiming to provide strong performance at the quantisation task (see Fig. 1) but at substantially reduced computational cost compared to the original SP method. Our contributions are summarised as follows: The global optimisation subroutine in SP, whose computational cost was exponential in dimension d, is replaced by a form of local search based on MCMC. This allows us to make use of efficient transition kernels for exploration of X, which in turn improves performance in higher dimensions and reduces the overall computational cost. Our construction requires a new Markov chain to be initialised each time a point xn is added, however the initial distribution of the chain does not need to coincide with P. This enables us to develop an efficient criterion for initialisation of the Markov chains, based on the introduced notion of the most influential point in {xi}n 1 i=1 , as quantified by KSD. This turns our sequence of local searches into a global-like search, and also leads to automatic mode hopping behaviour when P is a multi-modal target. The consistency of SP-MCMC is established under a V -uniform ergodicity condition on the Markov kernel. SP-MCMC is shown, empirically, to outperform MCMC, MED, SVGD and SP when applied to posterior computation in the Bayesian statistical context. The paper is structured as follows: In Section 2 we review the central notions of Stein s method and KSD, as well as Stein Point Markov Chain Monte Carlo recalling the original SP method. The novel methodology is presented in Section 3. This is assessed experimentally in Section 4 and theoretically in Section 5. Conclusions are drawn in Section 6. 2. Background In Section 2.1 we recall the construction of KSD, then in Section 2.2 the SP method of (Chen et al., 2018b), which is based on minimisation of KSD, is discussed. 2.1. Discrepancy and Stein s Method A discrepancy is a notion of how well an empirical measure, based on a point set {xi}n i=1 X, approximates a target P. One popular form of discrepancy is the integral probability metric (IPM) (Muller, 1997), which is based on a set F consisting of functionals on X, and is defined as: DF,P ({xi}n i=1) := supf F 1 n Pn i=1 f(xi) R X fd P (1) The set F is required to be measure-determining in order for the IPM to be a genuine metric. Certain sets F lead to familiar notions, such as the Wasserstein distance, but direct computation of an IPM will generically require exact integration against P; a demand that is not met in the Bayesian context. In order to construct an IPM that can be computed in the Bayesian context, (Gorham & Mackey, 2015) proposed the notion of a Stein discrepancy, based on Stein s method (Stein, 1972). This consists of finding an operator A, called a Stein operator, and a function class G, called a Stein class, which satisfy the Stein identity R X Agd P = 0 for all g G. Taking F = AG to be the image of G under A in (1) leads directly to the Stein discrepancy: DAG,P ({xi}n i=1) = supg G 1 n Pn i=1 Ag(xi) (2) A particular choice of A and G was studied in (Gorham & Mackey, 2015) with the property that exact computation can be performed based only on point-wise evaluation of log p. The computation of this graph Stein discrepancy reduced to solving d independent linear programs in parallel with O(n) variables and constraints. To eliminate the the reliance on a linear program solver, (Liu et al., 2016; Chwialkowski et al., 2016; Gorham & Mackey, 2017) proposed kernel Stein discrepancies, alternative Stein discrepancies (2) with embarrassingly parallel, closed-form values. For the remainder we assume that p > 0 on X. The canonical KSD is obtained by taking the Stein operator A to be the Langevin operator Ag := 1 p ( pg) and the Stein class G = B(Kd) to be the unit ball of a space of vector-valued functions, formed as a d-dimensional Cartesian product of scalar-valued reproducing kernel Hilbert spaces K (RKHS) (Berlinet & Thomas-Agnan, 2004). (Throughout we use to denote divergence and , to denote the Euclidean inner product.) Recall that an RKHS K is a Hilbert space of functions with inner product , k and induced norm k, and there is a function k : X X R, called a kernel, such that x X, we can write the evaluation functional f(x) = f, k( , x) k f K. It is assumed that the mixed derivatives 2k(x, y)/ xi yj and all lower-order derivatives are continuous and uniformly bounded. For X bounded, with piecewise smooth boundary denoted X, outward normal denoted n and surface element denoted dσ(x), the conditions H X k(x, x )p(x)n(x)dσ(x ) = 0, H X xk(x, x ) n(x)p(x)dσ(x ) = 0 are sufficient for the Stein identity to hold; c.f. Lemma 1 in (Oates et al., 2017). For X = Rd, a sufficient condition is R X log p(x) 2d P(x) < ; c.f. Prop. 1 of (Gorham & Mackey, 2017). The image AG = B(K0) is the unit ball of another RKHS, denoted K0, whose kernel is (Oates et al., 2017): k0(x, x ) = x x k(x, x ) + xk(x, x ), x log p(x ) + x k(x, x ), x log p(x) + k(x, x ) x log p(x), x log p(x ) (3) In this case, (2) corresponds to a maximum mean discrepancy (MMD; Gretton et al., 2006) in the RKHS K0 and thus can be explicitly computed. The Stein identity implies that R X k0(x, )d P 0. Thus we denote the KSD between the empirical measure 1 n Pn i=1 δxi and the target P (in a small abuse of notation) as DK0,P ({xi}n i=1) := q 1 n2 Pn i,j=1 k0(xi, xj). (4) Under regularity assumptions (Gorham & Mackey, 2017; Chen et al., 2018b; Huggins & Mackey, 2018), the KSD controls classical weak convergence of the empirical measure to the target. This motivates selecting the {xi}n i=1 to minimise the KSD, and to this end we now recall the SP method of (Chen et al., 2018b). 2.2. Stein Points The Stein Point (SP) method due to (Chen et al., 2018b) selects points {xi}n i=1 to approximately minimise DK0,P ({xi}n i=1). This is of course a challenging nonconvex and multivariate problem in general. For this reason, two sequential strategies were proposed. The first, called Greedy SP, was based on greedy minimisation of KSD, whilst the second, called Herding SP, was based on Frank-Wolfe minimisation of KSD. In each case, at iteration j {1, . . . , n} of the algorithm, the points {xi}j 1 i=1 have been selected and a global search method is used to select the next point xj X. To limit scope we restrict the discussion below to Greedy SP, as this has stronger theoretical guarantees and has been shown empirically to outperform Herding SP. The convergence of Greedy SP was established in Theorem 2 of (Chen et al., 2018b) when k0 is a Stein Point Markov Chain Monte Carlo P-sub-exponential kernel (Def. 1 of Chen et al.). More precisely, assume that for some pre-specified tolerance δ > 0, the resulting point sequence satisfies the following identity j {1, . . . , n} : DK0,P ({xi}j i=1)2 δ j2 + inf x X DK0,P ({xi}j 1 i=1 {x})2. Then it was shown that c1, c2 > 0 such that DK0,P ({xi}n i=1) eπ/2q so that KSD is asymptotically minimised. However, a significant limitation of the SP method is that it requires a global (non-convex) minimisation problem over X to be (approximately) solved in order to select the next point. In practice, the global search at iteration j can be facilitated by a grid search over X, but this procedure entails a computational cost that is exponential in the dimension d of X and even in modest dimension this becomes impractical. The main contribution of the present paper is to re-visit the SP method and to study its behaviour when the global search is replaced with a local search, facilitated by a MCMC method. To proceed, two main challenges must be addressed: First, an appropriate local optimisation procedure must be developed. Second, the theoretical convergence of the modified algorithm must be established. In the next section we address the first challenge by presenting our novel methodological development. 3. Methodology In Section 3.1 we present the novel SP-MCMC method. Then in Section 3.2 we describe how the kernel k can be pre-conditioned to improve performance in SP-MCMC. 3.1. SP-MCMC In this paper, we propose to replace the global minimisation at iteration j of the SP method of (Chen et al., 2018b) with a local search based on a P-invariant Markov chain of length mj, where the sequence (mj)j N is to be specified. The proposed SP-MCMC method proceeds as follows: 1. Fix an initial point x1 X. 2. For j = 2, . . . , n: i. Select an index i {1, . . . , j 1} according to some criterion crit({xi}j 1 i=1), to be defined. ii. Run a P-invariant Markov chain, initialised at xi , for mj iterations and denote the realised sample path as (yj,l)mj l=1. iii. Set xj = yj,l where l {1, . . . , mj} minimises DK0,P ({xi}j 1 i=1 {yj,l}). It remains to specify the sequence (mj)j N and the criterion crit. Precise statements about the effect of these choices on convergence are reserved for the theoretical treatment in Section 5. For the criterion crit, three different approaches are considered: LAST selects the point last added: i := j 1. RAND selects i uniformly at random in {1, . . . , j 1}. INFL selects i to be the index of a most influential point in {xi}j 1 i=1. Specifically, we call xi a most influential point if removing it from our point set creates the greatest increase in KSD. i.e. i maximises DK0,P ({xi}j 1 i=1 \ {xi }). SP-MCMC overcomes the main limitation facing the original SP method; the global search is avoided. Indeed, the cost of simulating mj steps of a P-invariant Markov chain will typically be just a fraction of the cost of implementing a global search method. The number of iterations mj acts as a lever to trade-off approximation quality against computational cost, with larger mj leading on average to an empirical measure with lower KSD. The precise relationship is elucidated in Section 5. Remark 1 (KSD has low overhead). A large number of modern MCMC methods, such as the Metropolisadjusted Langevin algorithm (MALA) and Hamiltonian Monte Carlo, exploit evaluations of log p to construct a P-invariant Markov transition kernel (Barp et al., 2018a). If such an MCMC method is used, the gradient information log p(xi ) is computed during the course of the MCMC and can be recycled in the subsequent computation of KSD. Remark 2 (Automatic mode-hopping). Although the Markov chain is used only for a local search, the initialisation criteria RAND and INFL offer the opportunity to jump to any point in the set {xi}j 1 i=1 and thus can facilitate global exploration of the state space X. The INFL criteria, in particular, favours areas of X that are under-represented in the point set and thus, for a multi-modal target P, one can expect mode hopping from near an over-represented mode to near an under-represented mode of P. Remark 3 (Removal of bad points). A natural extension of the SP-MCMC method allows for the possibility of removing a bad point from the current point set. That is, at iteration j we may decide, according to some probabilistic or deterministic schedule, to remove a point xi that minimises DK0,P ({xi}j 1 i=1 \ {xi }). This extension was also investigated and results are reserved for Section A.6.5. Remark 4 (Sequence vs set). If the number n of points is pre-specified, then after the n point is selected one can attempt to further improve the point set by applying (e.g.) co-ordinate descent to the KSD interpreted as a function DK0,P : X n [0, ); see (Chen et al., 2018b). To limit scope, this was not considered. Stein Point Markov Chain Monte Carlo 3.2. Pre-conditioned Kernels for SP-MCMC The original analysis of (Gorham & Mackey, 2017) focussed on the inverse multiquadric (IMQ) kernel k(x, x ) = 1 + λ 2 x x 2 2 β for some length-scale parameter λ > 0 and exponent β ( 1, 0); alternative kernels were considered in (Chen et al., 2018b), but the IMQ kernel was observed to lead to the best empirical approximations as quantified objectively by the Wasserstein distance between the empirical measure and the target. Thus, in this paper we focus on the IMQ kernel. However, in order to improve the performance of the algorithm, we propose to allow for pre-conditioning of the kernel; that is, we consider k(x, x ) = 1 + Λ 1 2 (x x ) 2 2 β (6) for some symmetric positive definite matrix Λ. The use of pre-conditioned kernels was recently proposed in the context of SVGD in (Detommaso et al., 2018), where Λ 1 was taken to be an approximation to the expected Hessian R x x log p(x)d P(x) of the negative log target. Note that the matrix Λ can also form part of a MCMC transition kernel, such as the pre-conditioner matrix in MALA (Girolami & Calderhead, 2011). Sufficient conditions for when a pre-conditioned kernel ensures that KSD controls classical weak convergence of the empirical measure to the target are established in Section 5. 4. Experimental Results In this section our attention turns to the empirical performance of SP-MCMC. The experimental protocol is explained in Section 4.1 and specific experiments are described in Sections 4.2, 4.3 and 4.4. 4.1. Experimental Protocol To limit scope, we present a comparison of SP-MCMC to the original SP method, as well as to MCMC, MED and SVGD. All experiments involving SP-MCMC, SP or SVGD in this paper were based on the IMQ kernel in (6) with β = 1 2. The preconditioner matrix Λ was taken either to be a sample-based approximation to the covariance matrix of P (Secs. 4.2 and 4.3), generated by running a short MCMC, or Λ I (Sec. 4.4); however, in each experiment Λ was fixed across all methods being compared. The Markov chains used for SP-MCMC and MCMC in this work employed either a random walk Metropolis (RWM) or a MALA transition kernel, described in Appendix A.5. Our implementations of MED and SVGD are described in Appendix A.6.1. Three experiments of increasing sophistication were con- 500 1000 -4 SP-MCMC MCMC 500 1000 -4 500 1000 -4 Figure 2. Gaussian mixture experiment in dimension d = 2. Columns (left to right): MCMC, SP-MCMC with LAST, SPMCMC with RAND, SP-MCMC with INFL. Top row: Point sets of size n = 1000 produced by MCMC and SP-MCMC. (Point colour indicates the mode to which they are closest.) Second row: Trace plot of log DK0,P ({xi}j i=1) as j is varied from 1 to n. Third row: Trace plots of the sequence (xi)n i=1, projected onto the first coordinate. Bottom row: Distribution of the squared jump distance xj xj 1 2 2 (green) compared to the quantities yj,mj yj,1 2 2 associated with the Markov chains (orange) used during the course of each method. sidered.1 First, in Section 4.2 we consider a simple Gaussian mixture target in order to explore SP-MCMC and investigate sensitivity to the degrees of freedom in this new method. Second, in Section 4.3 we revisit one of the experiments in (Chen et al., 2018b), in order to directly compare against SP, MCMC, MED and SVGD. Third, in Section 4.4 we consider a more challenging application to Bayesian parameter inference in an ordinary differential equation (ODE) model. 4.2. Gaussian Mixture Model For exposition we let σ2 = 0.5 and consider a d = 2 dimensional Gaussian mixture model P = 1 2N( 1, σ2Id d) + 1 2N(1, σ2Id d) with modes at 1 = [1, 1] and 1. The performance of MCMC was compared to SP-MCMC for each of the criteria LAST, RAND, INFL. Note that in this section we do not address computational 1Code to reproduce all experiments can be downloaded at https://github.com/wilson-ye-chen/sp-mcmc. Stein Point Markov Chain Monte Carlo cost; this is examined in Secs. 4.3 and 4.4. For SP-MCMC the sequence (mj)j N was set as mj = 5. Results are presented in Fig. 2 with n = 1000. The point sets produced by SP-MCMC with LAST and INFL (top row) were observed to provide a better quantisation of the target P compared to MCMC, as captured by the KSD of the empirical measure to the target (second row). RAND did not distribute points evenly between modes and, as a result, KSD was observed to plateau in the range of n displayed. For MCMC, the proposal step-size h > 0 was optimised according to the recommendations in (Roberts & Rosenthal, 2001), but nevertheless the chain was observed to jump between the two components of P only infrequently (third row, colour-coded). In contrast, after an initial period where both modes are populated, SP-MCMC under the INFL criteria was seen to frequently jump between components of P. Finally, we note that under INFL the typical squared jump distance xj xj 1 2 2 was greater than the analogous quantities yj,mj yj,1 2 2 for the underlying Markov chains that were used (bottom row), despite the latter being optimised according to the recommendations of (Roberts & Rosenthal, 2001), which supports the view that more frequent mode-hopping is a property of the INFL method. Based on the findings of this experiment, we focus only on LAST and INFL in the sequel. The extension where bad points are removed, described in Remark 3, was explored in supplemental Section A.6.5. 4.3. IGARCH Model Next our attention turns to whether SP-MCMC improves over the original SP method and how it compares to existing methods such as MED and SVGD when computational cost is taken into account. To this end we consider an identical experiment to (Chen et al., 2018b), based on Bayesian inference for a classical integrated generalised autoregressive conditional heteroskedasticity (IGARCH) model. The IGARCH model (Taylor, 2011) yt = σtϵt, ϵt i.i.d. N(0, 1) σ2 t = θ1 + θ2y2 t 1 + (1 θ2)σ2 t 1 describes a financial time series (yt) with time-varying volatility (σt). The model is parametrised by θ = (θ1, θ2), θ1 > 0 and 0 < θ2 < 1 and Bayesian inference for θ is considered, based on data y = (yt) that represent 2,000 daily percentage returns of the S&P 500 stock index (from December 6, 2005 to November 14, 2013). Following (Chen et al., 2018b), an improper uniform prior was placed on θ. The domain X = R+ (0, 1) is bounded and, for this example, the posterior P places negligible mass near the boundary X. This ensures that the boundary conditions described in Sec. 2.1 hold essentially to machine precision, as argued in (Chen et al., 2018b). 2 4 6 8 10 12 log neval MALA RWM SVGD MED SP SP-MALA LAST SP-MALA INFL SP-RWM LAST SP-RWM INFL Figure 3. IGARCH experiment. The new SP-MCMC method was compared against the original SP method of (Chen et al., 2018b), as well as against MCMC, MED (Roshan Joseph et al., 2015) and SVGD (Liu & Wang, 2016). The implementation of all existing methods is described in Appendix A.6. Each method produced an empirical measure 1 n Pn i=1 δxi whose distance to the target P was quantified by the energy distance EP . The computational cost was quantified by the number neval of times either p or its gradient were evaluated. For objectivity, the energy distance EP (Sz ekely & Rizzo, 2004; Baringhaus & Franz, 2004) was used to assess closeness of all empirical measures to the target.2 SP-MCMC was implemented with mj = 5 j. In addition to SP-MCMC, the methods SP, MED, SVGD and standard MCMC were also considered, with implementation described in Appendix A.6. All methods produced a point set of size n = 1000. The results, presented in Fig. 3, are indexed by the computational cost of running each method, which is a count of the total number neval of times either p or log p were evaluated. It can be seen that SP-MCMC offers improved performance over the original SP method for fixed computational cost, and in turn over both MED and SVGD in this experiment. Typical point sets produced by each method are displayed in Fig. S1. The performance of the pre-conditioned kernel on this task was investigated in Appendix A.6.4. 4.4. System of Coupled ODEs Our final example is more challenging and offers an opportunity to explore the limitations of SP-MCMC in higher dimensions. The context is an indirectly observed ODE yi = g(u(ti)) + ϵi, ϵi i.i.d. N(0, σ2I) u(t) = fθ(t, u), u(0) = u0 2The energy distance EP is equivalent to MMD based on the conditionally positive definite kernel k(x, y) = x y 2 (Sejdinovic et al., 2013). It was computed using a high-quality empirical approximation of P obtained from a large MCMC output. Stein Point Markov Chain Monte Carlo 2 4 6 8 10 12 log n eval MALA RWM SVGD MED SP SP-MALA LAST SP-MALA INFL SP-RWM LAST SP-RWM INFL 4 6 8 10 12 log neval MALA RWM SVGD MED SP SP-MALA LAST SP-MALA INFL SP-RWM LAST SP-RWM INFL Figure 4. ODE experiment, d-dimensional. The new SP-MCMC method was compared against the original SP method of (Chen et al., 2018b), as well as against standard MCMC, MED (Roshan Joseph et al., 2015) and SVGD (Liu & Wang, 2016). Each method produced an empirical measure 1 n Pn i=1 δxi whose distance to the target P was quantified by the kernel Stein discrepancy (KSD). The computational cost was quantified by the number neval of times either p or its gradient were evaluated. and, in particular, Bayesian inference for the parameter θ in the gradient field. Here yi Rp, u(t) Rq and θ Rd for p, q, d N. For our experiment, fθ and g comprised two instantiations of the Goodwin oscillator (Goodwin, 1965), one low-dimensional with (q, d) = (2, 4) and one higherdimensional with (q, d) = (8, 10). In both cases p = 2, σ = 0.1 and 40 measurements were observed at uniformlyspaced time points in [41, 80]. The Goodwin oscillator does not permit a closed form solution, meaning that each evaluation of the likelihood function requires the numerical integration of the ODE at a non-negligible computational cost. SP-MCMC was implemented with the INFL criterion and mj = 10 (d = 4), mj = 20 (d = 10). Full details of the ODE and settings for MED and SVGD are provided in Appendix A.6.6. In this experiment, KSD was used to assess closeness of all empirical measures to the target.3 Naturally, SP and SP-MCMC are favoured by this choice of assessment criterion, as these methods are designed to directly minimise KSD. Therefore our main focus here is on the comparison between SP and SP-MCMC. All methods produced a point set of size n = 1000. Results are shown in Fig. 4a (lowdimensional) and Fig. 4b (high-dimensional). Note how the gain in performance of SP-MCMC over SP is more substantial when d = 10 compared to when d = 4, supporting our earlier intuition for the advantage of local optimisation using a Markov kernel. 3The more challenging nature of this experiment meant accurate computation of the energy distance was precluded, due to the fact that a sufficiently high-quality empirical approximation of P could not be obtained. 5. Theoretical Results Let Ωbe a probability space on which the collection of random variables Yj,l : Ω X representing the lth state of the Markov chain run at the jth iteration of SP-MCMC are defined. Each of the three algorithms that we consider correspond to a different initialisation of these Markov chains and we use E to denote expectation over randomness in the Yj,l. For example, the algorithm called LAST would set Yj,1(ω) = xj 1. It is emphasised that the results of this section hold for any choice of function crit that takes values in X. As a stepping-stone toward our main result, we first extend the theoretical analysis of the original SP method to the case where the global search is replaced by a Monte Carlo search based on mi independent draws from P at iteration i of the SP method. Theorem 1 (i.i.d. SP-MCMC Convergence). Suppose that the kernel k0 satisfies R X k0(x, )d P(x) 0 and EZ P [eγk0(Z,Z)] < for some γ > 0. Let (mj)n j=1 N be a fixed sequence, and consider idealised Markov chains with Yj,l i.i.d. P for all 1 l mj, j N. Let {xi}n i=1 denote the output of SP-MCMC. Then, writing a b = min{a, b}, C > 0 such that E DK0,P ({xi}n i=1)2 C n Pn i=1 log(n mi) supx X k0(x,x) The constant C depends on k0 and P, and the proof in Appendix A.1 makes this dependence explicit. It follows that SP-MCMC with independent sampling from P is consistent whenever each mj grows with n. When mj = m for all j we obtain: E DK0,P ({xi}n i=1)2 C log(n m) supx X k0(x,x) and by choosing m = n, we recover the rate (5) of the Stein Point Markov Chain Monte Carlo original SP algorithm which optimizes over all of X (Chen et al., 2018b). For bounded kernels, the result improves over the O(1/n + 1/ m) independent sampling kernel herding rate established in (Lacoste-Julien et al., 2015, App. B). Thm. 1 more generally accommodates unbounded kernels at the cost of a log(n m) factor. The role of Thm. 1 is limited to providing a stepping stone to Thm. 2, as it is not practical to obtain exact samples from P in general. To state our result in the general case, restrict attention to X = Rd, consider a function V : X [1, ) and define the associated operators f V := supx X |f(x)|/V (x), µ V := supf: f V 1 | R fdµ| respectively on functions f : X R and on signed measures µ on X. A Markov chain (Yi)i N X with nth step transition kernel Pn is called V -uniformly ergodic (Meyn & Tweedie, 2012, Chap. 16) if R [0, ), ρ (0, 1) such that Pn(y, ) P V RV (y)ρn for all initial states y X and all n N. The proof of the following is provided in Appendix A.2: Theorem 2 (SP-MCMC Convergence). Suppose R X k0(x, )d P(x) 0 with EZ P [eγk0(Z,Z)] < for γ > 0. For a sequence (mj)n j=1 N, let {xi}n i=1 denote the output of SP-MCMC, based on time-homogeneous reversible Markov chains (Yj,l)mj l=1, j N, generated using the same V -uniformly ergodic transition kernel. Define V (s) := supx:k0(x,x) s2 k0(x, x)1/2V (x) 1 and Si = p 2 log(n mi)/γ. Then C > 0 such that E DK0,P ({xi}n i=1)2 C n Pn i=1 S2 i n + V+(Si)V (Si) We give an example of verifying the preconditions of Thm. 2 for MALA. Let P denote the set of distantly dissipative4 distributions with log p Lipschitz on X = Rd. Let C(r,r) b be the set of functions k : Rd Rd R with (x, y) 7 l x l yk(x, y) continuous and uniformly bounded for l {0, . . . , r}. Let q(x, y) be a density for the proposal distribution of MALA, and let α(x, y) denote the acceptance probability for moving from x to y, given that y has been proposed. Let A(x) = {y X : α(x, y) = 1} denote the region where proposals are always accepted and let R(x) = X \ A(x). Let I(x) := {y : y 2 x 2}. MALA is said to be inwardly convergent (Roberts & Tweedie, 1996, Sec. 4) if A(x) I(x) q(x, y)dy = 0 (7) where A B denotes the symmetric set difference (A B)\ (A B). The proof of the following is provided in Appendix A.3: 4The target P is said to be distantly dissipative (Eberle, 2016; Gorham et al., 2019) if κ0 = lim infr κ(r) > 0 for κ(r) = inf n 2 log[ p(x) p(y)],x y x y 2 2 : x y 2 = r o . Theorem 3 (SP-MALA Convergence). Suppose k0 has the form (3), based on a kernel k C(1,1) b and a target P P such that R X k0(x, )d P(x) 0. Let (mj)n j=1 N be a fixed sequence and let {xi}n i=1 denote the output of SP-MCMC, based on Markov chains (Yj,l)mj l=1, j N, generated using MALA transition kernel with step size h sufficiently small. Assume P is such that MALA is inwardly convergent. Then MALA is V -uniformly ergodic for V (x) = 1 + x 2 and C > 0 such that E DK0,P ({xi}n i=1)2 C n Pn i=1 log(n mi) Our final result, proved in Appendix A.4, establishes that the pre-conditioner kernel proposed in Sec. 3.2 can control weak congergence to P when the pre-conditionner Λ is symmetric positive definite (denoted Λ 0). It is a generalisation of Thm. 8 of Gorham & Mackey (2017), who treated the special case of Λ = I: Theorem 4 (Pre-conditioned IMQ KSD Controls Convergence). Suppose k0 is a Stein kernel (3) for a target P P and a pre-conditioned IMQ base kernel (6) with β ( 1, 0) and Λ 0. If DK0,P ({xi}n i=1) 0 then 1 n Pn i=1 δxi converges weakly to P. 6. Conclusion This paper proposed fundamental improvements to the SP method of (Chen et al., 2018b), establishing, in particular, that the global search used to select each point can be replaced with a finite-length sample path from an MCMC method. The convergence of the proposed SP-MCMC method was established, with an explicit bound provided on the KSD in terms of the V -uniform ergodicity of the Markov transition kernel. Potential extensions to our SP-MCMC method include the use of fast approximate Markov kernels for P (such as the unadjusted Langevin algorithm; see Appendix A.3), fast approximations to KSD (Jitkrittum et al., 2017; Huggins & Mackey, 2018), exploitation of conditional independence structure in P (Wang et al., 2018; Zhuo et al., 2018) and extension to a general Riemannian manifold X (Liu & Zhu, 2018; Barp et al., 2018b). One could also attempt to use our MCMC optimization approach to accelerate related algorithms such as kernel herding (Chen et al., 2010; Bach et al., 2012; Lacoste-Julien et al., 2015). Other recent approaches to quantisation in the Bayesian context include (Futami et al., 2018; Hu et al., 2018; Frogner & Poggio, 2018; Zhang et al., 2018; Chen et al., 2018a; Li et al., 2019), and an assessment of the relative performance of these methods would be of interest. However, we note that these approaches are not accompanied by the same level of theoretical guarantees that we have established. Stein Point Markov Chain Monte Carlo Acknowledgements The authors are grateful to the reviewers for their critical feedback on the manuscript. WYC was supported by the Australian Research Council Centre of Excellence for Mathematics and Statistical Frontiers. AB was supported by a Roth Scholarship from the Department of Mathematics at Imperial College London, UK. WYC, AB, FXB, MG and CJO were supported by the Lloyd s Register Foundation programme on data-centric engineering at the Alan Turing Institute, UK. MG was supported by the EPSRC grants [EEP/P020720/1, EP/R018413/1, EP/R034710/1, EP/R004889/1] and a Royal Academy of Engineering Research Chair. Abramowitz, M. and Stegun, I. A. Handbook of Mathematical Functions. Dover, 1972. Bach, F., Lacoste-Julien, S., and Obozinski, G. On the equivalence between herding and conditional gradient algorithms. In Proceedings of the International Conference on Machine Learning, pp. 1355 1362, 2012. Baringhaus, L. and Franz, C. On a new multivariate twosample test. Journal of Multivariate Analysis, 88(1): 190 206, 2004. Barp, A., Briol, F.-X., Kennedy, A. D., and Girolami, M. Geometry and dynamics for Markov chain Monte Carlo. Annual Reviews in Statistics and its Applications, 5:451 471, 2018a. Barp, A., Oates, C., Porcu, E., and M, G. A Riemannian Stein kernel method. ar Xiv:1810.04946, 2018b. Berlinet, A. and Thomas-Agnan, C. Reproducing Kernel Hilbert Spaces in Probability and Statistics. Springer Science & Business Media, New York, 2004. Calderhead, B. and Girolami, M. Estimating Bayes factors via thermodynamic integration and population MCMC. Computational Statistics & Data Analysis, 53(12):4028 4045, 2009. Chen, C., Zhang, R., Wang, W., Li, B., and Chen, L. A unified particle-optimization framework for scalable Bayesian sampling. In Proceedings of the 34th Conference on Uncertainty in Artificial Intelligence, 2018a. Chen, W. Y., Mackey, L., Gorham, J., Briol, F.-X., and Oates, C. J. Stein points. In Proceedings of the 35th International Conference on Machine Learning, volume 80, pp. 843 852. PMLR, 2018b. Chen, Y., Welling, M., and Smola, A. Super-samples from kernel herding. In Proceedings of the Conference on Uncertainty in Artificial Intelligence, 2010. Chwialkowski, K., Strathmann, H., and Gretton, A. A kernel test of goodness of fit. In Proceedings of the 33rd International Conference on Machine Learning, pp. 2606 2615, 2016. De Marchi, S., Schaback, R., and Wendland, H. Nearoptimal data-independent point locations for radial basis function interpolation. Advances in Computational Mathematics, 23(3):317 330, 2005. Detommaso, G., Cui, T., Spantini, A., Marzouk, Y., and Scheichl, R. A Stein variational Newton method. In Advances in Neural Information Processing Systems 31, pp. 9187 9197, 2018. Dick, J. and Pillichshammer, F. Digital Nets and Sequences - Discrepancy Theory and Quasi-Monte Carlo Integration. Cambridge University Press, 2010. Eberle, A. Reflection couplings and contraction rates for diffusions. Probability Theory and Related Fields, 166 (3-4):851 886, 2016. Freund, R. M., Grigas, P., and Mazumder, R. An extended Frank Wolfe method with in-face directions, and its application to low-rank matrix completion. SIAM Journal on Optimization, 27(1):319 346, 2017. Frogner, C. and Poggio, T. Approximate inference with Wasserstein gradient flows. ar Xiv:1806.04542, 2018. Futami, F., Cui, Z., Sato, I., and Sugiyama, M. Frank-Wolfe Stein sampling. ar Xiv:1805.07912, 2018. Girolami, M. and Calderhead, B. Riemann manifold Langevin and Hamiltonian Monte Carlo methods. Journal of the Royal Statistical Society. Series B, 73(2):123 214, 2011. Goodwin, B. C. Oscillatory behavior in enzymatic control process. Advances in Enzyme Regulation, 3:318 356, 1965. Gorham, J. and Mackey, L. Measuring sample quality with Stein s method. In Advances in Neural Information Processing Systems, pp. 226 234, 2015. Gorham, J. and Mackey, L. Measuring sample quality with kernels. In Proceedings of the 34th International Conference on Machine Learning, pp. 1292 1301, 2017. Gorham, J., Duncan, A., Mackey, L., and Vollmer, S. Measuring sample quality with diffusions. Annals of Applied Probability, 2019. In press. Graf, S. and Luschgy, H. Foundations of Quantization for Probability Distributions. Springer, 2007. Stein Point Markov Chain Monte Carlo Gretton, A., Borgwardt, K. M., Rasch, M. J., Sch olkopf, B., and Smola, A. J. A kernel method for the two-sampleproblem. In Advances in Neural Information Processing Systems, pp. 513 520, 2006. Hu, T., Chen, Z., Sun, H., Bai, J., Ye, M., and Cheng, G. Stein neural sampler. ar Xiv:1810.03545, 2018. Huggins, J. and Mackey, L. Random feature Stein discrepancies. In Advances in Neural Information Processing Systems 31, pp. 1903 1913. 2018. Jitkrittum, W., Xu, W., Szabo, Z., Fukumizu, K., and Gretton, A. A linear-time kernel goodness-of-fit test. In Advances in Neural Information Processing Systems, pp. 261 270, 2017. Joseph, V. R., Dasgupta, T., Tuo, R., and Wu, C. Sequential exploration of complex surfaces using minimum energy designs. Technometrics, 57(1):64 74, 2015. Joseph, V. R., Wang, D., Gu, L., Lyu, S., and Tuo, R. Deterministic sampling of expensive posteriors using minimum energy designs. Technometrics, 2018. To appear. Lacoste-Julien, S. and Jaggi, M. On the global linear convergence of Frank-Wolfe optimization variants. In Advances in Neural Information Processing Systems, pp. 496 504, 2015. Lacoste-Julien, S., Lindsten, F., and Bach, F. Sequential kernel herding: Frank-Wolfe optimization for particle filtering. In Proceedings of the 18th International Conference on Artificial Intelligence and Statistics, pp. 544 552, 2015. Li, L., Li, Y., Liu, J., Liu, Z., and Lu, J. A stochastic version of Stein variational gradient descent for efficient sampling. ar Xiv:1902.03394, 2019. Liu, C. and Zhu, J. Riemannian Stein variational gradient descent for Bayesian inference. In Thirty-Second AAAI Conference on Artificial Intelligence, 2018. Liu, Q. Stein variational gradient descent as gradient flow. In Advances in Neural Information Processing Systems, pp. 3118 3126, 2017. Liu, Q. and Wang, D. Stein variational gradient descent: A general purpose Bayesian inference algorithm. In Advances in Neural Information Processing Systems, pp. 2378 2386, 2016. Liu, Q. and Wang, D. Stein variational gradient descent as moment matching. In Advances in Neural Information Processing Systems, pp. 8868 8877, 2018. Liu, Q., Lee, J. D., and Jordan, M. I. A kernelized Stein discrepancy for goodness-of-fit tests and model evaluation. In Proceedings of the 33rd International Conference on Machine Learning, pp. 276 284, 2016. Lu, J., Lu, Y., and Nolen, J. Scaling limit of the Stein variational gradient descent: The mean field regime. SIAM Journal on Mathematical Analysis, 2018. To appear. Mak, S. and Joseph, V. R. Support points. Annals of Statistics, 46(6A):2562 2592, 2018. Metropolis, N., Rosenbluth, A. W., Rosenbluth, M. N., Teller, A. H., and Teller, E. Equation of state calculations by fast computing machines. The Journal of Chemical Physics, 21(6):1087 1092, 1953. Meyn, S. and Tweedie, R. Markov Chains and Stochastic Stability. Springer Science & Business Media., 2012. Muller, A. Integral probability metrics and their generating classes of functions. Advances in Applied Probability, 29(2):429 443, 1997. Nelder, J. and Mead, R. A simplex method for function minimization. The Computer Journal, 7(4):308 313, 1965. Oates, C. J., Papamarkou, T., and Girolami, M. The controlled thermodynamic integral for Bayesian model evidence evaluation. Journal of the American Statistical Association, 111(514):634 645, 2016. Oates, C. J., Girolami, M., and Chopin, N. Control functionals for Monte Carlo integration. Journal of the Royal Statistical Society, Series B, 79(3):695 718, 2017. Parno, M. D. Transport maps for accelerated Bayesian computation. Ph D thesis, Massachusetts Institute of Technology, 2015. Robert, C. and Casella, G. Monte Carlo Statistical Methods. Springer, 2004. Roberts, G. O. and Rosenthal, J. S. Optimal scaling for various Metropolis-Hastings algorithms. Statistical Science, 16(4):351 367, 2001. Roberts, G. O. and Tweedie, R. L. Exponential convergence of Langevin distributions and their discrete approximations. Bernoulli, 2(4):341 363, 1996. Roshan Joseph, V., Dasgupta, T., Tuo, R., and Jeff Wu, C. F. Sequential exploration of complex surfaces using minimum energy designs. Technometrics, 57(1):64 74, 2015. Stein Point Markov Chain Monte Carlo Santin, G. and Haasdonk, B. Convergence rate of the dataindependent P-greedy algorithm in kernel-based approximation. Dolomites Research Notes on Approximation, 10, 2017. Sejdinovic, D., Sriperumbudur, B., Gretton, A., and Fukumizu, K. Equivalence of distance-based and rkhs-based statistics in hypothesis testing. Annals of Statistics, pp. 2263 2291, 2013. Stein, C. A bound for the error in the normal approximation to the distribution of a sum of dependent random variables. In Proceedings of 6th Berkeley Symposium on Mathematical Statistics and Probability, pp. 583 602. University of California Press, 1972. Sz ekely, G. and Rizzo, M. Testing for equal distributions in high dimension. Inter Stat, 5(16.10):1249 1272, 2004. Taylor, S. J. Asset Price Dynamics, Volatility, and Prediction. Princeton University Press, 2011. Wang, D., Zeng, Z., and Liu, Q. Stein variational message passing for continuous graphical models. In Proceedings of the 35th International Conference on Machine Learning, PMLR 80, pp. 5219 5227, 2018. Zhang, R., Chen, C., Li, C., and Carin, L. Policy optimization as Wasserstein gradient flows. In Proceedings of the 35th International Conference on Machine Learning, pp. 5737 5746, 2018. Zhuo, J., Liu, C., Shi, J., Zhu, J., Chen, N., and Zhang, B. Message passing Stein variational gradient descent. In Proceedings of the 35th International Conference on Machine Learning, PMLR 80:6013-6022, 2018.