# stein_points__c419cd1a.pdf Stein Points Wilson Ye Chen 1 Lester Mackey 2 Jackson Gorham 3 Franc ois-Xavier Briol 4 5 6 Chris. J. Oates 7 6 An important task in computational statistics and machine learning is to approximate a posterior distribution p(x) with an empirical measure supported on a set of representative points {xi}n i=1. This paper focuses on methods where the selection of points is essentially deterministic, with an emphasis on achieving accurate approximation when n is small. To this end, we present Stein Points. The idea is to exploit either a greedy or a conditional gradient method to iteratively minimise a kernel Stein discrepancy between the empirical measure and p(x). Our empirical results demonstrate that Stein Points enable accurate approximation of the posterior at modest computational cost. In addition, theoretical results are provided to establish convergence of the method. 1. Introduction This paper is motivated by approximation of a Borel distribution P, defined on a topological space X, with deterministic point sets or sequences {xi}n i=1 X for n N, such that 1 n n i=1 h(xi) h d P (1) as n for all functions h X R in a specified set H. Throughout it will be assumed that P admits a density p, with respect to a reference measure, available in a form that is un-normalised (i.e., we know q(x) in closed form where p(x) = q(x)/C for some C > 0). Such problems occur in Bayesian statistics where P represents a posterior distribution, and the integral represents a posterior expectation of interest. Markov chain Monte Carlo (MCMC) methods 1School of Mathematical and Physical Sciences, University of Technology Sydney, Australia 2Microsoft Research New England, USA 3Opendoor Labs, Inc., USA 4Department of Statistics, University of Warwick, UK 5Department of Mathematics, Imperial College London, UK 6Alan Turing Institute, UK 7School of Mathematics, Statistics and Physics, Newcastle University, UK. Correspondence to: Wilson Ye Chen , Lester Mackey . Proceedings of the 35 th International Conference on Machine Learning, Stockholm, Sweden, PMLR 80, 2018. Copyright 2018 by the author(s). are extensively used for this task but suffer (in terms of accuracy) from clustering of the points {xi}n i=1 when n is small. This observation motivates us to instead consider a range of goal-oriented discrete approximation methods that are designed with un-normalised densities in mind. The problem of discrete approximation of a distribution, given its normalised density, has been considered in detail and relevant methods include quasi-Monte Carlo (QMC) (Dick & Pillichshammer, 2010), kernel herding (Chen et al., 2010; Lacoste-Julien et al., 2015), support points (Mak & Joseph, 2016; 2017), transport maps (Marzouk et al., 2016), and minimum energy methods (Johnson et al., 1990). On the other hand, the question of how to proceed with unnormalised densities has been primarily answered with increasingly sophisticated MCMC. At the same time, recent work had led to theoreticallyjustified measures of sample quality in the case of an unnormalised target. In (Gorham & Mackey, 2015; Mackey & Gorham, 2016) it was shown that Stein s method can be used to construct discrepancy measures that control weak convergence of an empirical measure to a target. This was later extended in (Gorham & Mackey, 2017) to encompass a family of discrepancy measures indexed by a reproducing kernel. In the latter case, the discrepancy measure can be recognised as a maximum mean discrepancy (Smola et al., 2007). As such, one can consider discrete approximation as an optimisation problem in a Hilbert space and attempt to optimise this objective with either a greedy or a conditional gradient method. The resulting method Stein Points and its variants are proposed and studied in this work. Our Contribution This paper makes the following contributions: Two algorithms are proposed for minimisation of the kernel Stein discrepancy (KSD; Chwialkowski et al., 2016; Liu et al., 2016; Gorham & Mackey, 2017); a greedy algorithm and a conditional gradient method. In each case, a convergence result of the form in Eqn. 1 is established. Novel kernels are proposed for the KSD, and we prove that, with these kernels, the KSD controls weak convergence of the empirical measure to the target. In other Stein Points words, the test functions h for which our results hold constitute a rich set H. Outline The paper proceeds as follows. In Section 2 we provide background, and in Section 3 we present the approximation methods that will be studied. Section 4 applies these methods to both simulated and real approximation problems and provides a extensive empirical comparison. All technical material is contained in Section 5, where we derive novel theoretical results for the methods we proposed. Finally we summarise our findings in Section 6. 2. Background Throughout this section it will be assumed that X is a metric space, and we let P(X) denote the collection of Borel distributions on X. In this context, weak convergence of the empirical measure to P corresponds to taking the set H in Eqn. 1 to be the set HCB of functions which are continuous and bounded. In this work we also consider sets H that correspond to stronger modes of convergence in P(X). First, in 2.1, we recall how discrepancy measures are constructed. Then we recall the use of Stein s method in this context in 2.2. Formulae for KSD are presented in 2.3. 2.1. Discrepancy Measures A discrepancy is a quantification of how well the points {xi}n i=1 cover the domain X with respect to the distribution P. This framework will be developed below in reproducing kernel Hilbert spaces (RKHS; Hickernell, 1998), but the general theory of discrepancy can be found in (Dick & Pillichshammer, 2010). Note that we focus on unweighted point sets for ease of presentation, but our discussions and results generalise straightforwardly to point sets that are weighted. Let k X X R be the reproducing kernel of a RKHS K of functions X R. That is, K is a Hilbert space of functions with inner product , K and induced norm K such that, for all x X, k(x, ) K and f(x) = f,k(x, ) K whenever f K. The Cauchy-Schwarz inequality in K gives that n n i=1 f(xi) fd P f K DK,P ({xi}n i=1) where the final term DK,P ({xi}n i=1) = 1 n n i=1 k(xi, ) k(x, )d P(x) K is the canonical discrepancy measure for the RKHS. The Bochner integral k P = k(x, )d P(x) K is known as the mean embedding of P into K (Smola et al., 2007). Thus, if H = B(K) = {f K f K 1} is the unit ball in K, then DK,P ({xi}n i=1) 0 implies the convergence result in Eqn. 1. The RKHS framework is now standard for QMC analysis (Dick & Pillichshammer, 2010). Its popularity derives from the fact that, when both k P and k P,P = k P d P are explicit, the canonical discrepancy measure is also explicit: DK,P ({xi}n i=1) = (2) n n i=1 k P (xi) + 1 n2 n i,j=1 k(xi,xj) Table 1 in (Briol et al., 2015) collates pairs (k,P) for which k P and k P,P are explicit. If P is a posterior distribution, so that p has unknown normalisation constant, it is unclear how the terms k P and k P,P can be computed in closed form, and so similarly for the discrepancy DK,P . This has so far prevented QMC and related methods such as kernel herding (Chen et al., 2010) from being used to compute posterior integrals. A solution to this problem can be found in Stein s method, presented next. 2.2. Kernel Stein Discrepancy The method of Stein (1972) was introduced as an analytical tool for establishing convergence in distribution of random variables, but its potential for generating and analyzing computable discrepancies was developed in (Gorham & Mackey, 2015). In what follows, we recall the kernelised version of the Stein discrepancy, first presented for an optimallyweighted point set in 2.3.3 of (Oates et al., 2017b) and later generalised to an arbitrarily-weighted point set in (Chwialkowski et al., 2016; Liu et al., 2016; Gorham & Mackey, 2017). Suppose that X carries the structure of a smooth manifold, and consider a linear differential operator TP on X, together with a set F of sufficiently differentiable functions, with the following property: TP [f] d P = 0 f F. (3) Then TP is called a Stein operator and F a Stein set. In the kernelised version of Stein s method, the set F is either an RKHS K with reproducing kernel k X X R, or the product Kd, which contains vector-valued functions f = (f1,...,fd) with fj K and is equipped with a norm1 f Kd = ( d j=1 fj 2 K)1/2. For the case F = K, the image of K under a Stein operator TP is denoted K0 = TP K. The notation can be justified since, under appropriate regularity assumptions, the set TP K admits structure from the reproducing kernel k0(x,x ) = TP TP k(x,x ) (Oates et al., 2017b). Here TP is the adjoint of the operator TP and acts on the second argument x of the kernel. If instead F = Kd, then we suppose that TP f = d j=1 TP,jfj so that the set 1For what follows, any vector norm can be used to combine the component norms fj K (Gorham & Mackey, 2017, Prop. 3). Stein Points K0 = TP Kd admits structure from the reproducing kernel k0(x,x ) = d j=1 TP,j TP,jk(x,x ). In either case, we will call the reproducing kernel k0 of K0 a Stein reproducing kernel. Stein reproducing kernels possess the useful property that k0,P = k0(x, )d P = 0 and k0,P,P = k0,P d P = 0, so in particular both are explicit. Thus, if k0 is a Stein reproducing kernel, then Eqn. 2 can be simplified: DK0,P ({xi}n i=1) = 1 n2 n i,j=1 k0(xi,xj). (4) We call this quantity a kernel Stein discrepancy (KSD). Next, we exhibit some differential operators for which Eqn. 3 is satisfied and Eqn. 4 can be computed. 2.3. Stein Operators and Their Reproducing Kernels The divergence theorem can be used to construct Stein operators on a manifold. For P supported on X = Rd, (Oates et al., 2017b; Gorham & Mackey, 2015; Chwialkowski et al., 2016; Liu et al., 2016; Gorham & Mackey, 2017) considered the Langevin Stein operator TP f = (pf) where is the usual divergence operator and f Kd. Thus, for the Langevin Stein operator, we obtain a Stein reproducing kernel k0(x,x ) = x x k(x,x ) (6) + xk(x,x ) x log p(x ) + x k(x,x ) x log p(x) +k(x,x ) x log p(x) x log p(x ). To evaluate this kernel, the normalisation constant for p is not required. Other Stein operators for the Euclidean case were developed in (Gorham et al., 2016). For P supported on a closed Riemannian manifold X, (Oates et al., 2017a; Liu & Zhu, 2017) proposed the second order Stein operator TP f = 1 p (p f) where and are, respectively, the gradient and divergence operators on the manifold and f K. Other Stein operators for the general case are proposed in the supplement of (Oates et al., 2017a). The theoretical results in (Gorham & Mackey, 2017) established that certain combinations of Stein operator TP and base kernel k ensure that KSD controls weak convergence; that is, DK0,P ({xi}n i=1) 0 implies that Eqn. 1 holds with H = HCB. This important result motivates our next contribution, where numerical optimisation methods are used to select points {xi}n i=1 to approximately minimise KSD. Theoretical analysis of the proposed methods is reserved for Section 5. In this paper, two algorithms to select points {xi}n i=1 are studied in detail. The first of these is a greedy algorithm, which at each iteration attempts to minimise the KSD, whilst the second is a conditional gradient algorithm, known as herding, which also targets the KSD. In 3.1 and 3.2 the two algorithms are described, whilst in 3.3 some alternative approaches are briefly discussed. 3.1. Greedy Algorithm The simplest algorithm that we consider follows a greedy strategy, whereby the first point x1 is taken to be a global maximum of p (an operation which does not require the normalisation constant) and each subsequent point xn is taken to be a global minimum of DK0,P ({xi}n i=1), with the KSD being viewed as a function of xn holding {xi}n 1 i=1 fixed. Equivalently, at iteration n > 1 of the greedy algorithm, we select xn arg minx X k0(x,x) 2 + n 1 i=1 k0(xi,x). (7) Note that each iteration of the algorithm requires the solution of a global optimisation problem over X; in practice we employed a numerical optimisation method, and this choice is discussed in detail in connection with the empirical results in Section 4 and the theoretical results in Section 5. If a user has a budget of at most n points, the greedy algorithm can be run for n iterations and thereafter improved using (block) coordinate descent on the KSD objective to update an existing point xi instead of introducing a new point. The cost of each update is equal to the cost of adding the n-th greedy Stein Point. This budget-constrained variant of the method will be called Stein Greedy-n in the sequel (see Section B.1.3 for more details). 3.2. Herding Algorithm The definition of discrepancy in Section 2.1 suggests that selection of {xi}n i=1 can be elegantly formulated as a single global optimisation problem over K0. Let M(K0) be the marginal polytope of K0; i.e. the convex hull of the set {k0(x, )}x X (Wainwright & Jordan, 2008). The mean embedding Q k Q, as a map P(X) M(K), is injective whenever the kernel k is universal and X is compact (Smola et al., 2007), so that in this case k Q fully characterises Q. Results in a similar direction for Stein reproducing kernels were established in Chwialkowski et al. (2016, Theorem 2.1) and Liu et al. (2016, Proposition 3.3). Thus, as P is mapped to 0 under the embedding, we are motivated to consider non-trivial solutions to arg minf M(K0) J(f), J(f) = 1 2 f 2 K0. (8) As might be expected, the objective function is closely related to KSD; for f( ) = 1 n n i=1 k0(xi, ) we have J(f) = Stein Points 1 2DK0,P ({xi}n i=1)2. An iterative algorithm, called kernel herding, was proposed in (Chen et al., 2010) to solve problems in the form of Eqn. 8. This was later shown to be equivalent to a conditional gradient algorithm, the Frank Wolfe algorithm, in (Bach et al., 2012). The canonical Frank Wolfe algorithm, which results in an unweighted point set (as opposed to a more general weighted point set; Bach et al., 2012), is presented next. The first point x1 is again taken to be a global maximum of p; this corresponds to an element f1 = k0(x1, ) M(K0). Then, at iteration n > 1, the convex combination fn = n 1 nf n M(K0) is constructed where the element f n encodes a direction of steepest descent: fn arg minf M(K0) f,DJ(fn 1) K0 , where DJ(f) is the representer of the Fr echet derivative of J at f. Given that minimisation of a linear objective over a convex set can be restricted to the boundary of that set, it follows that f n = k(xn, ) for some xn X. Thus, at iteration n > 1 of the proposed algorithm, we select xn arg minx X n 1 i=1 k0(xi,x) (9) to obtain fn( ) = 1 n n i=1 k0(xi, ), the embedding of the empirical distribution of {xi}n i=1. As in the standard kernel herding algorithm of (Chen et al., 2010), each iteration in practice requires the solution of a global optimisation problem over X. Compared to Eqn. 7, the greedy algorithm is seen to be a regularised version of herding with regulariser 1 2k0(x,x). The two algorithms coincide if k0(x,x) is independent of x; however, this is typically not true for a Stein reproducing kernel. The computational cost of either method is O(n2); thus we anticipate applications in which evaluation of p(x) (and its gradient) constitute the principal computational bottleneck. The performance of both algorithms is studied empirically in Section 4 and theoretically in Section 5. In a similar manner to Stein Greedy-n, a budget-constrained variant of the above method can be considered, which we call Stein Herding-n in the sequel. 3.3. Other Algorithms The output of either of our algorithms will be called Stein Points. These are extensible point sequence Sn = (xi)n i=1, meaning that Sn can be incrementally extended Sn = (Sn 1,xn) as required. Another recently proposed extensible method is the (sequential) minimum energy design (MED) of (Joseph et al., 2015; 2017), here used as a benchmark. For some problems the number of points n will be fixed in advance and the aim will instead be to select a single optimal point set {xi}n i=1. This alternative problem demands different methodologies, and a promising method in this direction is Stein variational gradient descent (SVGD-n; Liu & Wang, 2016; Liu, 2017). A natural point set analogue of our approach would be to optimise KSD for n fixed. This approach was considered for other discrepancy measures in (Oettershagen, 2017), where the Newton method was used. We instead employ our budget-constrained algorithms Stein Greedy-n and Stein Herding-n for this use case. In this section, the proposed greedy and herding algorithms are empirically assessed and compared. In 4.2 a Gaussian mixture problem is studied in detail, whilst in 4.3 and 4.4, respectively, the methods are applied to approximate the parameter posterior in a non-parametric regression model and an IGARCH model. First, in 4.1 we provide details on the experimental protocol. 4.1. Experimental Protocol Here we describe the parameters and settings that were varied in the experiments that are presented. Stein Operator To limit scope, we focus on the case X = Rd and always take TP to be the Langevin Stein operator in Eqn. 5. Choice of Kernel For the kernel k in Eqn. 6 we considered one standard choice the inverse multi-quadric (IMQ) kernel together with two novel alternatives: (k1) (IMQ) k1(x,x ) = (α + x x 2 2) β (k2) (inverse log) k2(x,x ) = (α + log(1 + x x 2 2)) 1 (k3) (IMQ score) k3(x,x ) = (α + log p(x) log p(x ) 2 2) β. In all cases α > 0 and β ( 1,0). To limit scope, in what follows we considered a finite number of judiciously selected configurations for α,β, though in principle these could be optimised as in (Jitkrittum et al., 2017). The best set of parameter values was selected for each algorithm and each target distribution, where the possible values were α {0.1η,0.5η,η,2η,4η,8η} and β {0.1,0.3,0.5,0.7,0.9}, with η > 0 problem-dependent (see the Supplement). The IMQ kernel, together with the Langevin Stein operator, was proven in Gorham & Mackey (2017, Theorem 8) to provide a KSD that controls weak convergence. Similar results for novel kernels k2 and k3 are established in Section 5. Numerical Optimisation Method Any optimisation procedure could be used to (approximately) solve the global Stein Points optimisation problem embedded in each iteration of the proposed algorithms. In our experiments, we considered the following numerical methods, for which full details appear in the Section B.2. 1. Nelder-Mead (NM): At iteration n, parallel runs of Nelder-Mead were employed, initialised at draws from a Gaussian mixture proposal centred on the current point set Π = 1 n 1 n 1 i=1 N(xi,λI) with problemspecific λ > 0. 2. Monte Carlo (MC): The optimisation problem at iteration n was solved over a sample of points drawn from the same Gaussian mixture proposal Π. 3. Grid search (GS): Through brute force, the optimisation problem at iteration n was solved over a regular grid of width 1 n. This required O(n d 2 ) points; if required, the domain was first truncated with a large bounding box. Performance Assessment To obtain a reasonably objective assessment, we focused on the 1-Wasserstein distance between the empirical measure and P: WP ({xi}n i=1) = suph HLip 1 n n i=1 h(xi) hd P , where HLip is the set of all function h X R with Lipschitz constant Lip(h) 1. By replacing P with the empirical measure PN = 1 N N i=1 δyi for yi iid P, the expected error from using WPN ({xi}n i=1) in lieu of WP ({xi}n i=1) converges at a N 1 2 log N rate for d = 2 and N 1 d rate for d > 2 (Fournier & Guillin, 2015). By employing L1spanners, the approximation WPN ({xi}n i=1) can be computed in O((n+N)2 log2d 1(n+N)) time (Gudmundsson et al., 2007). For all reported results, the {yi}N i=1 were obtained by brute-force Monte Carlo methods applied to P, with N sufficiently large that approximation error can be neglected. The computational cost associated to any given method was quantified as the total number neval of times either the logdensity log p or its gradient log p were evaluated. This can be justified since in most applications the parameter to data map dominates the computational cost associated with the likelihood. Benchmarks Two existing methods were used as a benchmark: 1. The MED method of (Joseph et al., 2015; 2017) relies on numerical optimisation methods to minimise an energy measure Eδ,P ({xi}n i=1), adapted to P. This measure has one tuning parameter δ [1, ). See Section B.1.1 of the Supplement for full detail. 2. The SVGD method of (Liu & Wang, 2016; Liu, 2017) performs a version of gradient descent on the Kullback Leibler divergence, described in Section B.1.2 of the Supplement. To avoid confounding of the empirical results by incomparable algorithm parameters, (1) the collection of numerical optimisation methods used for KSD were also used for MED, and (2) the same collection of kernels k1,...,k3 was considered for SVGD as was used for KSD. Note that, apart from standard Monte Carlo, none of the methods considered in these experiments are re-parametrisation invariant. 4.2. Gaussian Mixture Test For our first test, we considered a Gaussian mixture model P = 1 2N(µ1,Σ1) + 1 defined on X = R2. Full settings for each of the methods considered are detailed in Section C.1 in the Supplement. Typical point sets are displayed over the contours of P for µ1 = ( 1.5,0), µ2 = (1.5,0), Σ1 = Σ2 = I in Figure 1. Additionally, point sets for the n point budget-constrained algorithms Stein Greedy-n and Stein Herding-n are presented in Figure 6 in the Supplement. For each of the methods shown in Figures 1 and 6, tuning parameters were varied and the overall performance was captured in Figure 2. It was observed that for (a-c) the choice of numerical optimisation method was the most influential tuning parameter, with the simpler Monte Carlo-based method being most successful. The kernels k1,k2 were seen to perform well, but in (a,b,d) the kernel k3 was sometimes seen to fail. A subjectively-selected exemplar was extracted for each method, and these best results for each method are overlaid in Figure 3. The total number of points was limited to n = 100. In terms of our proposed methods, two qualitative regimes were observed: (i) For low computational budget log neval 7, the standard Monte Carlo method performed best. (ii) For a larger computational budget 7 < log neval, greedy Stein points were not out-performed. Note that KSD and SVGD are based on the log target and its gradient, whilst for MED the target p(x) itself is required. As a result, numerical instabilities were sometimes encountered with MED. Next, we turned our attention to two important posterior approximation problems that occur in the real world. 4.3. Gaussian Process Regression Model The Gaussian process (GP) model is a popular choice for uncertainty quantification in the non-parametric regression context (Rasmussen & Williams, 2006). The data D = {(xi,yi)}n i=1 that we considered are from a light de- Stein Points MED Stein Herd. Stein Greedy Monte Carlo 6 7 8 9 10 log n eval Figure 1: Typical point sets obtained in the Gaussian mixture test. [Here the left border of each sub-plot is aligned to the exact value of log neval spent to obtain each point set.] tection and ranging (LIDAR) experiment (Ruppert et al., 2003). They consist of 221 realisations of an independent scalar variable xi (distances travelled before the light is reflected back to its source) and a dependent scalar variable yi (log-ratios of received light from two laser sources); these were modelled as yi = g(xi)+ϵi, for ϵi i.i.d. N(0,σ2) and a known value of σ. The unknown regression function g is modelled as a centred GP with covariance function cov(x,x ) = θ1 exp( θ2(x x )2). The hyper-parameters θ1,θ2 > 0 determine the suitability of the GP model, but appropriate values will be unknown in general. In this experiment we re-parametrised φi = log θi and placed a standard multivariate Cauchy prior on φ = (φ1,φ2), defined on X = R2. The task is thus to approximate the conditional distribution p(φ D). This problem is motivated by the computation of posterior predictive marginal distributions p(y x ,D) for a new input x , which is defined as the integral p(y x ,φ,D)p(φ D)dφ. Note that the density p(φ D) can be differentiated, and an explicit formula is provided in Rasmussen & Williams (2006, Eqn. 5.9). For each class of method, best tuning parameters were selected and these are presented on the same plot in Figure 4a. In addition, typical point sets provided by each method are 0 5 10 15 log neval NM k1 NM k2 NM k3 MC k1 MC k2 MC k3 GS k1 GS k2 GS k3 (a) Stein Points (Greedy) 0 5 10 15 log n eval NM k1 NM k2 NM k3 MC k1 MC k2 MC k3 GS k1 GS k2 GS k3 (b) Stein Points (Herding) 0 5 10 15 log n eval NM =4 NM =8 NM =16 MC =4 MC =8 MC =16 GS =4 GS =8 GS =16 0 5 10 15 log n eval Figure 2: Results for the Gaussian mixture test. [Here n = 100. x-axis: log of the number neval of model evaluations that were used. y-axis: log of the Wasserstein distance WP ({xi}n i=1) obtained. Kernel parameters α, β were optimised according to WP in all cases, with sensitivities reported in Fig. 7 of the Supplement.] 0 2 4 6 8 10 log neval Monte Carlo Stein Greedy-100 Stein Herding-100 MED SVGD-100 Figure 3: Combined results for the Gaussian mixture test. [Here n = 100. x-axis: log of the number neval of model evaluations that were used. y-axis: log of the the Wasserstein distance WP ({xi}n i=1) obtained. Tuning parameters were selected to minimise WP , as described in the main text. The dashed line indicates the point at which n Stein Points have been generated; block coordinate descent is performed thereafter to satisfy the n point budget constraint.] presented in Figures 8 and 9 in the Supplement. MED was not included because the method exhibited severe numerical instability on this task, as earlier discussed. Results indicated three qualitative regimes where, respectively, Monte Carlo, greedy Stein points and SVGD provided the best performance for fixed cost. Stein Points 4.4. IGARCH Model The integrated generalised autoregressive conditional heteroskedasticity (IGARCH) model is widely-used to describe financial time series (yt) with time-varying volatility (σt) (Taylor, 2011). The model is as follows: yt = σtϵt, ϵt i.i.d. N(0,1) σ2 t = θ1 + θ2y2 t 1 + (1 θ2)σ2 t 1 with parameters θ = (θ1,θ2), θ1 > 0 and 0 < θ2 < 1. The data y = (yt) that we considered were 2,000 daily percentage returns of the S&P 500 stock index (from December 6, 2005 to November 14, 2013), and an improper uniform prior was placed on θ. Thus the task was to approximate the posterior p(θ y). Note that, whilst the domain X = R+ (0,1) is bounded, for these data the posterior density is negligible on the boundary X. This ensures that Eqn. 3 holds essentially to machine precision; see also the discussion in Oates et al. (2018, Section 3.2). For the IGARCH model, gradients log p(θ y) can be obtained as the solution of a recursive system of equations for σt/ θ2. As before, the best performing of each class of method was selected and these are presented on the same plot in Figure 4b. In addition, typical point sets provided by each method are presented in Figures 12 and 13 in the Supplement. (Numerical instability again prevented results for MED from being obtained.) Results were consistent with the Gaussian mixture experiment, favouring either Monte Carlo or greedy Stein points depending on the computational budget. 5. Theoretical Results In this section we establish two important forms of theoretical guarantees: (1) discrepancy control, i.e., DK0,P ({xi}n i=1) 0 as n for our extensible Stein Point sequences and (2) distributional convergence control, i.e., for our kernel choices and appropriate choices of target, DK0,P ({xi}n i=1) 0 implies that the empirical distribution 1 n n i=1 δxi converges in distribution to P. 5.1. Discrepancy Control Earlier work has shown that, when a kernel is uniformly bounded (i.e., supx X k0(x,x) R2), the greedy and kernel herding algorithms decrease the associated discrepancy DK0,P at an O(n 1 2 ) rate (Lacoste-Julien et al., 2015; Jones, 1992). We extend these results to cover all growing, P-subexponential kernels. Definition 1 (P-sub-exponential reproducing kernel). We say a reproducing kernel k0 is P-sub-exponential if PZ P [k0(Z,Z) t] c1e c2t for some constants c1,c2 > 0 and all t 0. Notably, any uniformly bounded reproducing kernel is P-sub-exponential, and, when P is a sub-Gaussian distribution, any kernel with at most quadratic growth (i.e., k0(x,x) = O( x 2 2)) is also P-sub-exponential. Our first result, proved in Section A.1.1, shows that if we truncate the search domain suitably in each step, Stein Herding decreases the discrepancy at an O( log(n)/n) rate. This result holds even if each point xi is selected suboptimally with error δ/2. This extra degree of freedom allows a user to conduct a grid search or a search over appropriately generated random points on each step (see, e.g., Lacoste-Julien et al., 2015) and still obtain a rate of convergence. Theorem 1 (Stein Herding Convergence). Suppose k0 with k0,P = 0 is a P-sub-exponential reproducing kernel. Then there exist constants c1,c2 > 0 depending only on k0 and P such that any point sequence {xi}n i=1 satisfying j 1 i=1 k0(xi,xj) δ 2 + min x X k0(x,x) R2 j j 1 i=1 k0(xi,x) with k0(xj,xj) R2 j [2log(j)/c2,2log(n)/c2] for each 1 j n also satisfies DK0,P ({xi}n i=1) eπ/2 Our next result, proved in Section A.1.2, shows that Stein Greedy decreases the discrepancy at an O( log(n)/n) rate whether we choose to truncate (Rj < ) or not (Rj = ). This highlights an advantage of the Stein Greedy algorithm over Stein Herding: the extra k0(x,x)/2 term acts as a regularizer ensuring that no truncation is necessary. The result also accommodates points xi selected suboptimally with error δ/2. Theorem 2 (Stein Greedy Convergence). Suppose k0 with k0,P = 0 is a P-sub-exponential reproducing kernel. Then there exist constants c1,c2 > 0 depending only on k0 and P such that any point sequence {xi}n i=1 satisfying 2 + j 1 i=1 k0(xi,xj) 2 + min x X k0(x,x) R2 j 2 + j 1 i=1 k0(xi,x) 2log(j)/c2 Rj for each 1 j n also satisfies DK0,P ({xi}n i=1) eπ/2 5.2. Distributional Convergence Control To present our final results, we overload notation to define the KSD associated with any probability measure µ: DK0,P (µ) = E(Z,Z ) µ µ [k0(Z,Z )]. Stein Points 0 2 4 6 8 10 log neval MCMC Stein Greedy-100 Stein Herding-100 SVGD-100 (a) Gaussian Process Test 0 2 4 6 8 10 log neval (b) IGARCH Test Figure 4: Combined results for the (a) Gaussian process test and (b) IGARCH test. [Here n = 100. x-axis: log of the number neval of model evaluations that were used. y-axis: log of the Wasserstein distance WP ({xi}n i=1) obtained. Tuning parameters were selected to minimise WP , as described in the main text. The dashed line indicates the point at which n Stein Points have been generated; block coordinate descent is performed thereafter to satisfy the n point budget constraint.] Our original DK0,P definition (Eq. 4) for a point set {xi}n i=1 is recovered when µ is the empirical measure 1 n n i=1 δxi. We also write µm P to indicate that a sequence of probability measures (µm) m=1 converges in distribution to P. Gorham & Mackey (2017, Thm. 8) showed that KSDs with IMQ base kernel (k1) and Langevin Stein operator TP control distributional convergence whenever P belongs to the set P of distantly dissipative distributions (i.e., log p(x) log p(y),x y κ x y 2 2+C for some C 0,κ > 0) with Lipschitz log p. Surprisingly, Gaussian, Mat ern, and other kernels with light tails do not satisfy this property (Gorham & Mackey, 2017, Thm. 6). Our next theorem establishes distributional convergence control for our newly introduced log inverse kernel (k2). Theorem 3 (Log Inverse KSD Controls Convergence). Suppose P P. Consider a Stein reproducing kernel k0 = TP TP k2 with Langevin operator TP and base kernel k2(x,x ) = (α + log(1 + x x 2 2))β for α > 0 and β < 0. If DK0,P (µm) 0, then µm P. Our final theorem, proved in Section A.3, guarantees distributional convergence control for the new IMQ score kernel (k3) under the additional assumption that log p is strictly concave. Theorem 4 (IMQ Score KSD Controls Convergence). Suppose P P has strictly concave log density. Consider a Stein reproducing kernel k0 = TP TP k3 with Langevin operator TP and base kernel k3(x,x ) = (c2 + log p(x) log p(x ) 2 2)β for c > 0 and β ( 1,0). If DK0,P (µm) 0, then µm P. 6. Conclusion This paper proposed and studied Stein Points, extensible point sequences rooted in minimisation of a KSD, building on the recent theoretical work of (Gorham & Mackey, 2017). Although we focused on KSD to limit scope, our methods could in fact be applied to any computable Stein discrepancy, even those not based on reproducing kernels (see, e.g., Gorham & Mackey, 2015; Gorham et al., 2016). Stein Points provide an interesting counterpoint to other recent work focussing on point sequences (Joseph et al., 2015; 2017) and point sets (Liu & Wang, 2016; Liu, 2017). Moreover, when X is a finite set {yi}N i=1 (e.g., an inexpensive initial point set generated by MCMC), Stein Points provide a compact and convergent approximation to the optimally weighted probability measure N i=1 wiδyi with minimum KSD (see Section B.3 for more details). Theoretical results were provided which guarantee the asymptotic correctness of our methods. However, we were only able to establish an O( log(n)/n) rate, which leaves a theoretical gap between the faster convergence that was sometimes empirically observed. Relatedly, the O(n2) computational cost could be reduced to O(n) by using finitedimensional kernels (see, e.g., Jitkrittum et al., 2017), but the associated distributional convergence control results must first be developed. Our experiments were relatively comprehensive, but we did not consider other Stein operators, nor higher-dimensional or non-Euclidean manifolds X. Related methods not considered in this work include those based on optimal transport (Marzouk et al., 2016) and self-avoiding particle-based samplers (Robert & Mengersen, 2003). The comparison against these methods is left for future work. Stein Points Acknowledgements WYC was supported by the ARC Centre of Excellence for Mathematical and Statistical Frontiers. FXB was supported by EPSRC [EP/L016710/1, EP/R018413/1]. CJO was supported by the Lloyd s Register Foundation programme on data-centric engineering at the Alan Turing Institute, UK. This material was based upon work partially supported by the National Science Foundation under Grant DMS-1127914 to the Statistical and Applied Mathematical Sciences Institute. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation. Bach, F., Lacoste-Julien, S., and Obozinski, G. On the equivalence between herding and conditional gradient algorithms. In Proceedings of the 29th International Conference on Machine Learning, pp. 1355 1362, 2012. Baker, J. Integration of radial functions. Mathematics Magazine, 72(5):392 395, 1999. Briol, F.-X., Oates, C., Girolami, M., Osborne, M., and Sejdinovic, D. Probabilistic integration: A role in statistical computation? ar Xiv:1512.00933, 2015. Chen, Y., Welling, M., and Smola, A. Super-samples from kernel herding. In Proceedings of the 26th Conference on Uncertainty in Artificial Intelligence, 2010. Chwialkowski, K., Strathmann, H., and Gretton, A. A kernel test of goodness of fit. In Proceedings of the 33rd International Conference on Machine Learning, volume 48, pp. 2606 2615, 2016. Dick, J. and Pillichshammer, F. Digital Nets and Sequences. Discrepancy Theory and Quasi-Monte Carlo Integration. Cambridge University Press, 2010. Fournier, N. and Guillin, A. On the rate of convergence in Wasserstein distance of the empirical measure. Probability Theory Related Fields, 162(3-4):707 738, 2015. Gorham, J. and Mackey, L. Measuring sample quality with Stein s method. In Advances in Neural Information Processing Systems, pp. 226 234, 2015. Gorham, J. and Mackey, L. Measuring sample quality with kernels. In Proceedings of the 34th International Conference on Machine Learning, pp. 1292 1301, 2017. Gorham, J., Duncan, A., Vollmer, S., and Mackey, L. Measuring sample quality with diffusions. ar Xiv:1611.06972, 2016. Gudmundsson, J., Klein, O., Knauer, C., and Smid, M. Small Manhattan networks and algorithmic applications for the earth movers distance. In Proceedings of the 23rd European Workshop on Computational Geometry, pp. 174 177, 2007. Hickernell, F. A generalized discrepancy and quadrature error bound. Mathematics of Computation, 67(221):299 322, 1998. Jitkrittum, W., Xu, W., Szabo, Z., Fukumizu, K., and Gretton, A. A linear-time kernel goodness-of-fit test. In Advances in Neural Information Processing Systems, pp. 261 270, 2017. Johnson, M., Moore, L., and Ylvisaker, D. Minimax and maximin distance designs. Journal of Statistical Planning and Inference, 26(2):131 148, 1990. Jones, L. A simple lemma on greedy approximation in Hilbert space and convergence rates for projection pursuit regression and neural network training. Annals of Statistics, 20(1):608 613, 1992. Joseph, V., Dasgupta, T., Tuo, R., and Wu, C. Sequential exploration of complex surfaces using minimum energy designs. Technometrics, 57(1):64 74, 2015. Joseph, V., Wang, D., Gu, L., Lv, S., and Tuo, R. Deterministic sampling of expensive posteriors using minimum energy designs. ar Xiv:1712.08929, 2017. Joshi, K. Introduction to General Topology. New Age International, 1983. Lacoste-Julien, S., Lindsten, F., and Bach, F. Sequential kernel herding : Frank-Wolfe optimization for particle filtering. In Proceedings of the 18th International Conference on Artificial Intelligence and Statistics, pp. 544 552, 2015. Liu, C. and Zhu, J. Riemannian Stein variational gradient descent for Bayesian inference. ar Xiv:1711.11216, 2017. Liu, Q. Stein variational gradient descent as gradient flow. In Advances in Neural Information Processing Systems, pp. 3118 3126, 2017. Liu, Q. and Wang, D. Stein variational gradient descent: A general purpose Bayesian inference algorithm. In Advances In Neural Information Processing Systems, pp. 2370 2378, 2016. Liu, Q., Lee, J., and Jordan, M. A kernelized Stein discrepancy for goodness-of-fit tests. In Proceedings of the 33rd International Conference on Machine Learning, pp. 276 284, 2016. Stein Points Mackey, L. and Gorham, J. Multivariate Stein factors for a class of strongly log-concave distributions. Electronic Communications in Probability, 21(56), 2016. Mak, S. and Joseph, V. R. Support points. ar Xiv:1609.01811, 2016. Mak, S. and Joseph, V. R. Projected support points, with application to optimal MCMC reduction. ar Xiv:1708.06897, 2017. Marzouk, Y., Moselhy, T., Parno, M., and Spantini, A. Handbook of Uncertainty Quantification, chapter Sampling via Measure Transport: An Introduction. Springer, 2016. Nelder, J. and Mead, R. A simplex method for function minimization. The Computer Journal, 7(4):308 313, 1965. Oates, C., Barp, A., and Girolami, M. Posterior integration on a Riemannian manifold. ar Xiv:1712.01793, 2017a. Oates, C., Girolami, M., and Chopin, N. Control functionals for Monte Carlo integration. Journal of the Royal Statistical Society: Series B, 79(3):695 718, 2017b. Oates, C., Cockayne, J., Briol, F.-X., and Girolami, M. Convergence rates for a class of estimators based on Stein s identity. Bernoulli, 2018. To appear. Oettershagen, J. Construction of Optimal Cubature Algorithms with Applications to Econometrics and Uncertainty Quantification. Ph D thesis, University of Bonn, 2017. Rasmussen, C. and Williams, C. Gaussian Processes for Machine Learning. MIT Press, 2006. Robert, C. and Mengersen, K. IID sampling with selfavoiding particle filters: The pinball sampler. In Bayesian Statistics, volume 7, chapter IID sampling with selfavoiding particle filters: The pinball sampler. Oxford University Press, 2003. Eds. Bernardo, J., Bayarri, M., Berger, J., Dawid, A., Heckerman, D., Smith, A., West, M. Ruppert, D., Wand, M., and Carroll, R. Semiparametric Regression. Number 12. Cambridge Series in Statistical and Probabilistic Mathematics, 2003. Sejdinovic, D., Sriperumbudur, B., Gretton, A., and Fukumizu, K. Equivalence of distance-based and RKHS-based statistics in hypothesis testing. Annals of Statistics, 41(5): 2263 2291, 2013. Smola, A., Gretton, A., Song, L., and Sch olkopf, B. A Hilbert space embedding for distributions. In Proceedings of the 18th International Conference on Algorithmic Learning Theory, pp. 13 31, 2007. Spivak, M. Calculus on Manifolds: A Modern Approach to Classical Theorems of Advanced Calculus. Westview Press, 1965. Stein, C. A bound for the error in the normal approximation to the distribution of a sum of dependent random variables. In Proceedings of the Sixth Berkeley Symposium on Mathematical Statistics and Probability, Volume 2: Probability Theory. The Regents of the University of California, 1972. Steinwart, I. and Christmann, A. Support Vector Machines. Springer Science & Business Media, 2008. Taylor, S. Asset Price Dynamics, Volatility, and Prediction. Princeton University Press, 2011. Wainwright, M. High-Dimensional Statistics: A Non Asymptotic Viewpoint. 2017. URL https://www. stat.berkeley.edu/ mjwain/stat210b/ Chap2_Tail Bounds_Jan22_2015.pdf. Wainwright, M. J. and Jordan, M. I. Graphical models, exponential families, and variational inference. Foundations and Trends in Machine Learning, 1(1 2):1 305, 2008. Wendland, H. Scattered Data Approximation. Cambridge University Press, 2004.