# generalized_kernel_thinning__bed326b8.pdf

Published as a conference paper at ICLR 2022

GENERALIZED KERNEL THINNING

Raaz Dwivedi1, Lester Mackey2

1 Department of Computer Science, Harvard University and Department of EECS, MIT 2 Microsoft Research New England raaz@mit.edu, lmackey@microsoft.com

The kernel thinning (KT) algorithm of Dwivedi and Mackey (2021) compresses a probability distribution more effectively than independent sampling by targeting a reproducing kernel Hilbert space (RKHS) and leveraging a less smooth squareroot kernel. Here we provide four improvements. First, we show that KT applied directly to the target RKHS yields tighter, dimension-free guarantees for any kernel, any distribution, and any fixed function in the RKHS. Second, we show that, for analytic kernels like Gaussian, inverse multiquadric, and sinc, target KT admits maximum mean discrepancy (MMD) guarantees comparable to or better than those of square-root KT without making explicit use of a square-root kernel. Third, we prove that KT with a fractional power kernel yields better-than Monte-Carlo MMD guarantees for non-smooth kernels, like Laplace and Mat ern, that do not have square-roots. Fourth, we establish that KT applied to a sum of the target and power kernels (a procedure we call KT+) simultaneously inherits the improved MMD guarantees of power KT and the tighter individual function guarantees of target KT. In our experiments with target KT and KT+, we witness significant improvements in integration error even in 100 dimensions and when compressing challenging differential equation posteriors.

1 INTRODUCTION

A core task in probabilistic inference is learning a compact representation of a probability distribution P. This problem is usually solved by sampling points x1, . . . , xn independently from P or, if direct sampling is intractable, generating n points from a Markov chain converging to P. The benefit of these approaches is that they provide asymptotically exact sample estimates Pinf 1

n Pn i=1 f(xi) for intractable expectations Pf EX P[f(X)]. However, they also suffer from a serious drawback: the learned representations are unnecessarily large, requiring n points to achieve |Pf Pinf| = Θ(n 1

2 ) integration error. These inefficient representations quickly become prohibitive for expensive downstream tasks and function evaluations: for example, in computational cardiology, each function evaluation f(xi) initiates a heart or tissue simulation that consumes 1000s of CPU hours (Niederer et al., 2011; Augustin et al., 2016; Strocchi et al., 2020).

To reduce the downstream computational burden, a standard practice is to thin the initial sample by discarding every t-th sample point (Owen, 2017). Unfortunately, standard thinning often results in a substantial loss of accuracy: for example, thinning an i.i.d. or fast-mixing Markov chain sample from n points to n 1 2 points increases integration error from Θ(n 1

2 ) to Θ(n 1

The recent kernel thinning (KT) algorithm of Dwivedi & Mackey (2021) addresses this issue by producing thinned coresets with better-than-i.i.d. integration error in a reproducing kernel Hilbert space (RKHS, Berlinet & Thomas-Agnan, 2011). Given a target kernel1 k and a suitable sequence of input points Sin = (xi)n i=1 approximating P, KT returns a subsequence Sout of n points with better-than-i.i.d. maximum mean discrepancy (MMD, Gretton et al., 2012),2

MMDk(P, Pout) sup f k 1|Pf Poutf| for Pout 1 n P x Sout δx, (1)

1A kernel k is any function that yields positive semi-definite matrices (k(zi, zj))l i,j=1 for all inputs (zi)l i=1. 2MMD is a metric for characteristic k, like those in Tab. 1, and controls integration error for all bounded continuous f when k determines convergence, like each k in Tab. 1 except SINC (Simon-Gabriel et al., 2020).

Published as a conference paper at ICLR 2022

where k denotes the norm for the RKHS H associated with k. That is, the KT output admits o(n 1

4 ) worst-case integration error across the unit ball of H.

KT achieves its improvement with high probability using non-uniform randomness and a less smooth square-root kernel krt satisfying

k(x, y) = R

Rd krt(x, z)krt(z, y)dz. (2)

When the input points are sampled i.i.d. or from a fast-mixing Markov chain on Rd, Dwivedi & Mackey prove that the KT output has, with high probability, Od(n 1

2 log n)-MMDk error for P and krt with bounded support, Od(n 1

2 (logd+1 n log log n) 1 2 )-MMDk error for P and krt with light tails, and Od(n 1

2ρ log n log log n)-MMDk error for P and k2 rt with ρ > 2d moments. Meanwhile, an i.i.d. coreset of the same size suffers Ω(n 1

4 ) MMDk. We refer to the original KT algorithm as ROOT KT hereafter.

Our contributions In this work, we offer four improvements over the original KT algorithm. First, we show in Sec. 2.1 that a generalization of KT that uses only the target kernel k provides a tighter O(n 1

2 log n) integration error guarantee for each function f in the RKHS. This TARGET KT guarantee (a) applies to any kernel k on any domain (even kernels that do not admit a squareroot and kernels defined on non-Euclidean spaces), (b) applies to any target distribution P (even heavy-tailed P not covered by ROOT KT guarantees), and (c) is dimension-free, eliminating the exponential dimension dependence and (log n)d/2 factors of prior ROOT KT guarantees.

Second, we prove in Sec. 2.2 that, for analytic kernels, like Gaussian, inverse multiquadric (IMQ), and sinc, TARGET KT admits MMD guarantees comparable to or better than those of Dwivedi & Mackey (2021) without making explicit use of a square-root kernel. Third, we establish in Sec. 3 that generalized KT with a fractional α-power kernel kα yields improved MMD guarantees for kernels that do not admit a square-root, like Laplace and non-smooth Mat ern. Fourth, we show in Sec. 3 that, remarkably, applying generalized KT to a sum of k and kα a procedure we call kernel thinning+ (KT+) simultaneously inherits the improved MMD of POWER KT and the dimensionfree individual function guarantees of TARGET KT.

In Sec. 4, we use our new tools to generate substantially compressed representations of both i.i.d. samples in dimensions d = 2 through 100 and Markov chain Monte Carlo samples targeting challenging differential equation posteriors. In line with our theory, we find that TARGET KT and KT+ significantly improve both single function integration error and MMD, even for kernels without fast-decaying square-roots.

σ > 0 GAUSS(σ)

σ > 0 LAPLACE(σ)

2, γ > 0 MAT ERN(ν, γ)

ν > 0, γ > 0

θ = 0 SINC(θ)

β N B-SPLINE(2β+1, γ)

exp z 2 2 2σ2

2 (γ z 2) cν d

2 (γ z 2)ν d

2 1 (1+ z 2 2/γ2)ν Qd j=1 sin(θzj)

θzj B d 2β+2 Qd j=1 hβ(γzj)

Table 1: Common kernels k(x, y) on Rd with z = x y. In each case, k = 1. Here, ca 21 a

Γ(a), Ka is the modified Bessel function of the third kind of order a (Wendland, 2004, Def. 5.10), hβ is the recursive convolution of 2β + 2 copies of 1[ 1

2 ], and Bβ 1 (β 1)! P β/2 j=0 ( 1)j β j ( β

Related work For bounded k, both i.i.d. samples (Tolstikhin et al., 2017, Prop. A.1) and thinned geometrically ergodic Markov chains (Dwivedi & Mackey, 2021, Prop. 1) deliver n 1 2 points with O(n 1

4 ) MMD with high probability. The online Haar strategy of Dwivedi et al. (2019) and low discrepancy quasi-Monte Carlo methods (see, e.g., Hickernell, 1998; Novak & Wozniakowski, 2010; Dick et al., 2013) provide improved Od(n 1

2 logd n) MMD guarantees but are tailored specifically to the uniform distribution on [0, 1]d. Alternative coreset constructions for more general P include kernel herding (Chen et al., 2010), discrepancy herding (Harvey & Samadi, 2014), super-sampling with a reservoir (Paige et al., 2016), support points convex-concave procedures (Mak & Joseph, 2018), greedy sign selection (Karnin & Liberty, 2019, Sec. 3.1), Stein point MCMC (Chen et al., 2019), and Stein thinning (Riabiz et al., 2020a). While some admit better-than-i.i.d. MMD guarantees for finite-dimensional kernels on Rd (Chen et al., 2010; Harvey & Samadi, 2014), none

Published as a conference paper at ICLR 2022

apart from KT are known to provide better-than-i.i.d. MMD or integration error for the infinitedimensional kernels covered in this work. The lower bounds of Phillips & Tai (2020, Thm. 3.1) and Tolstikhin et al. (2017, Thm. 1) respectively establish that any procedure outputting n 1 2 -sized coresets and any procedure estimating P based only on n i.i.d. sample points must incur Ω(n 1

2 ) MMD in the worst case. Our guarantees in Sec. 2 match these lower bounds up to logarithmic factors.

Notation We define the norm k = supx,y |k(x, y)| and the shorthand [n] {1, . . . , n}, R+ {x R : x 0}, N0 N {0}, Bk {f H : f k 1}, and B2(r) y Rd : y 2 r . We write a b and a b to mean a = O(b) and a = Ω(b), use d when masking constants dependent on d, and write a = Op(b) to mean a/b is bounded in probability. For any distribution Q and point sequences S, S with empirical distributions Qn, Q n, we define MMDk(Q, S) MMDk(Q, Qn) and MMDk(S, S ) MMDk(Qn, Q n).

2 GENERALIZED KERNEL THINNING

Our generalized kernel thinning algorithm (Alg. 1) for compressing an input point sequence Sin = (xi)n i=1 proceeds in two steps: KT-SPLIT and KT-SWAP detailed in App. A. First, given a thinning parameter m and an auxiliary kernel ksplit, KT-SPLIT divides the input sequence into 2m candidate coresets of size n/2m using non-uniform randomness. Next, given a target kernel k, KT-SWAP selects a candidate coreset with smallest MMDk to Sin and iteratively improves that coreset by exchanging coreset points for input points whenever the swap leads to reduced MMDk. When ksplit is a square-root kernel krt (2) of k, generalized KT recovers the original ROOT KT algorithm of Dwivedi & Mackey. In this section, we establish performance guarantees for more general ksplit with special emphasis on the practical choice ksplit = k. Like ROOT KT, for any m, generalized KT has time complexity dominated by O(n2) evaluations of ksplit and k and O(n min(d, n)) space complexity from storing either Sin or the kernel matrices (ksplit(xi, xj))n i,j=1 and (k(xi, xj))n i,j=1.

Algorithm 1: Generalized Kernel Thinning Return coreset of size n/2m with small MMDk Input: split kernel ksplit, target kernel k, point sequence Sin = (xi)n i=1, thinning parameter m N, probabilities (δi) n/2 i=1

(S(m,ℓ))2m ℓ=1 KT-SPLIT (ksplit, Sin, m, (δi) n/2 i=1 ) // Split Sin into 2m candidate coresets of size n

SKT KT-SWAP (k, Sin, (S(m,ℓ))2m ℓ=1) // Select best coreset and iteratively refine

return coreset SKT of size n/2m

2.1 SINGLE FUNCTION GUARANTEES FOR KT-SPLIT

We begin by analyzing the quality of the KT-SPLIT coresets. Our first main result, proved in App. B, bounds the KT-SPLIT integration error for any fixed function in the RKHS Hsplit generated by ksplit.

Theorem 1 (Single function guarantees for KT-SPLIT) Consider KT-SPLIT (Alg. 1a) with oblivious3 Sin and (δi)n/2 i=1 and δ mini δi. If n 2m N, then, for any fixed f Hsplit, index ℓ [2m], and scalar δ (0, 1), the output coreset S(m,ℓ) with P(ℓ) split 1 n/2m P

x S(m,ℓ) δx satisfies

|Pinf P(ℓ) splitf| f ksplit σm q

δ ) for σm 2

ksplit ,in log( 6m

with probability at least psg 1 δ Pm j=1 2j 1

i=1 δi. Here, ksplit ,in maxx Sin ksplit(x,x).

Remark 1 (Guarantees for known and oblivious stopping times) By Dwivedi & Mackey (2021, App. D), the success probability psg is at least 1 δ if we set δ = δ

2 and δi = δ

n for a stopping time n known a priori or δi = mδ 2m+2(i+1) log2(i+1) for an arbitrary oblivious stopping time n.

When compressing heavily from n to n points, Thm. 1 and Rem. 1 guarantee O(n 1

2 log n) integration error with high probability for any fixed function f Hsplit. This represents a near-quadratic

3Throughout, oblivious indicates that a sequence is generated independently of any randomness in KT.

Published as a conference paper at ICLR 2022

improvement over the Ω(n 1

4 ) integration error of n i.i.d. points. Moreover, this guarantee applies to any kernel defined on any space including unbounded kernels on unbounded domains (e.g., energy distance (Sejdinovic et al., 2013) and Stein kernels (Oates et al., 2017; Chwialkowski et al., 2016; Liu et al., 2016; Gorham & Mackey, 2017)); kernels with slowly decaying square roots (e.g., sinc kernels); and non-smooth kernels without square roots (e.g., Laplace, Mat ern with γ ( d

2, d]), and the compactly supported kernels of Wendland (2004) with s < 1

2(d+1)). In contrast, the MMD guarantees of Dwivedi & Mackey covered only bounded, smooth k on Rd with bounded, Lipschitz, and rapidly-decaying square-roots. In addition, for k = 1 on Rd, the MMD bounds of Dwivedi & Mackey feature exponential dimension dependence of the form cd or (log n)d/2 while the Thm. 1 guarantee is dimension-free and hence practically relevant even when d is large relative to n.

Thm. 1 also guarantees better-than-i.i.d. integration error for any target distribution with |Pf Pinf| = o(n 1

4 ). In contrast, the MMD improvements of Dwivedi & Mackey (2021, cf. Tab. 2) applied only to P with at least 2d moments. Finally, when KT-SPLIT is applied with a squareroot kernel ksplit = krt, Thm. 1 still yields integration error bounds for f H, as H Hsplit. However, relative to target KT-SPLIT guarantees with ksplit = k, the error bounds are inflated by a

multiplicative factor of q

k ,in f krt

f k . In App. H, we show that this inflation factor is at least 1 for each kernel explicitly analyzed in Dwivedi & Mackey (2021) and grows exponentially in dimension for Gaussian and Mat ern kernels, unlike the dimension-free target KT-SPLIT bounds.

Finally, if we run KT-SPLIT with the perturbed kernel k split defined in Cor. 1, then we simultaneously obtain O(n 1

2 log n) integration error for f Hsplit, near-i.i.d. O(n 1

4 log n) integration error for arbitrary bounded f outside of Hsplit, and intermediate, better-than-i.i.d. o(n 1

4 ) integration error for smoother f outside of Hsplit (by interpolation). We prove this guarantee in App. C.

Corollary 1 (Guarantees for functions outside of Hsplit) Consider extending each input point xi with the standard basis vector ei Rn and running KT-SPLIT (Alg. 1a) on S in = (xi, ei)n i=1 with k split((x, w), (y, v)) = ksplit(x,y)

ksplit + w, v for w, v, Rn. Under the notation and assumptions of Thm. 1, for any fixed index ℓ [2m], scalar δ (0, 1), and f defined on Sin, we have, with probability at least psg,

|Pinf P(ℓ) splitf| min(p n

2m f ,in, p

ksplit f ksplit) 2m

δ ) log( 8m

2.2 MMD GUARANTEE FOR TARGET KT

Our second main result bounds the MMDk (1) the worst-case integration error across the unit ball of H for generalized KT applied to the target kernel, i.e., ksplit = k. The proof of this result in App. D is based on Thm. 1 and an appropriate covering number for the unit ball Bk of the k RKHS.

Definition 1 (k covering number) For a set A and scalar ε > 0, we define the k covering number Nk(A, ε) with Mk(A, ε) log Nk(A, ε) as the minimum cardinality of a set C Bk satisfying

h C{g Bk : supx A |h(x) g(x)| ε}. (4)

Theorem 2 (MMD guarantee for TARGET KT) Consider generalized KT (Alg. 1) with ksplit = k, oblivious Sin and (δi) n/2 i=1 , and δ mini δi. If n 2m N, then for any δ (0, 1), the output coreset SKT is of size n 2m and satisfies

MMDk(Sin, SKT) inf ε (0,1), Sin A 2ε + 2m

8 3 k ,in log( 6m

2mδ ) log( 4

δ ) + Mk(A, ε) (5)

with probability at least psg, where k ,in and psg were defined in Thm. 1.

When compressing heavily from n to n points, Thm. 2 and Rem. 1 with ε = q

A = B2(Rin) for Rin maxx Sin x 2 guarantee

MMDk(Sin, SKT) δ

k ,in log n

n Mk(B2(Rin), q

Published as a conference paper at ICLR 2022

with high probability. Thus we immediately obtain an MMD guarantee for any kernel k with a covering number bound. Furthermore, we readily obtain a comparable guarantee for P since MMDk(P, SKT) MMDk(P, Sin)+MMDk(Sin, SKT). Any of a variety of existing algorithms can be used to generate an input point sequence Sin with MMDk(P, Sin) no larger than the compression bound (6), including i.i.d. sampling (Tolstikhin et al., 2017, Thm. A.1), geometric MCMC (Dwivedi & Mackey, 2021, Prop. 1), kernel herding (Lacoste-Julien et al., 2015, Thm. G.1), Stein points (Chen et al., 2018, Thm. 2), Stein point MCMC (Chen et al., 2019, Thm. 1), greedy sign selection (Karnin & Liberty, 2019, Sec. 3.1), and Stein thinning (Riabiz et al., 2020a, Thm. 1).

2.3 CONSEQUENCES OF THM. 2

Tab. 2 summarizes the MMD guarantees of Thm. 2 under common growth conditions on the log covering number Mk and the input point radius RSin maxx Sin x 2. In Props. 2 and 3 of App. J, we show that analytic kernels, like Gaussian, inverse multiquadric (IMQ), and sinc, have LOGGROWTH Mk (i.e., Mk(B2(r), ε) d rd logω( 1

ε)) while finitely differentiable kernels (like Mat ern and B-spline) have POLYGROWTH Mk (i.e., Mk(B2(r), ε) d rdε ω).

Our conditions on RSin arise from four forms of target distribution tail decay: (1) COMPACT (RSin d 1), (2) SUBGAUSS (RSin d log n), (3) SUBEXP (RSin d log n), and (4) HEAVYTAIL (RSin d n1/ρ). The first setting arises with a compactly supported P (e.g., on the unit cube [0, 1]d), and the other three settings arise in expectation and with high probability when Sin is generated i.i.d. from P with sub-Gaussian tails, sub-exponential tails, or ρ moments respectively.

Substituting these conditions into (6) yields the eight entries of Tab. 2. We find that, for LOGGROWTH Mk, TARGET KT MMD is within log factors of the Ω(n 1/2) lower bounds of Sec. 1 for light-tailed P and is o(n 1/4) (i.e., better than i.i.d.) for any distribution with ρ > 4d moments. Meanwhile, for POLYGROWTH Mk, TARGET KT MMD is o(n 1/4) whenever ω < 1

2 for lighttailed P or whenever P has ρ > 2d/( 1

2 ω) moments.

Rin d 1 COMPACT P

Rin d log n SUBGAUSS P

Rin d log n

Rin d n1/ρ HEAVYTAIL P

Mk(B2(r), ε) d rd logω( 1

ε) LOGGROWTH Mk

(log n)d+ω+1

(log n)2d+ω+1

Mk(B2(r), ε) d rdε ω POLYGROWTH Mk q

log n n1 ω q

(log n)2d+1

log n n1 ω 2d/ρ

Table 2: MMD guarantees for TARGET KT under Mk (4) growth and P tail decay. We report the MMDk(Sin, SKT) bound (6) for target KT with n input points and n output points, up to constants depending on d and k ,in. Here Rin maxx Sin x 2.

Next, for each of the popular convergence-determining kernels of Tab. 1, we compare the ROOT KT MMD guarantees of Dwivedi & Mackey (2021) with the TARGET KT guarantees of Thm. 2 combined with covering number bounds derived in Apps. J and K. We see in Tab. 3 that Thm. 2 provides better-than-i.i.d. and better-than-ROOT KT guarantees for kernels with slowly decaying or non-existent square-roots (e.g., IMQ with ν < d

2, sinc, and B-spline) and nearly matches known

ROOT KT guarantees for analytic kernels like Gauss and IMQ with ν d

2, even though TARGET KT makes no explicit use of a square-root kernel. See App. K for the proofs related to Tab. 3.

3 KERNEL THINNING+

We next introduce and analyze two new generalized KT variants: (i) POWER KT which leverages a power kernel kα that interpolates between k and krt to improve upon the MMD guarantees of target KT even when krt is not available and (ii) KT+ which uses a sum of k and kα to retain both the improved MMD guarantee of kα and the superior single function guarantees of k.

Power kernel thinning First, we generalize the square-root kernel (2) definition for shift-invariant k using the order 0 generalized Fourier transform (GFT, Wendland, 2004, Def. 8.9) bf of f : Rd R.

Published as a conference paper at ICLR 2022

Kernel k TARGET KT ROOT KT KT+

GAUSS(σ) (log n) 3d

(log n) d 4 + 1

(log n) d 4 + 1

LAPLACE(σ) n 1

4 N/A ( cn(log n)1+2d(1 α)

MAT ERN(ν, γ): ν ( d

4 N/A ( cn(log n)1+2d(1 α)

MAT ERN(ν, γ): ν > d min(n 1

4 , (log n) d+1

2 n(ν d)/(2ν d) ) (log n) d+1

(log n) d+1

IMQ(ν, γ): ν < d

4 (log n)d+1

IMQ(ν, γ): ν d

(log n) d+1

(log n) d+1

SINC(θ) (log n)2

B-SPLINE(2β + 1, γ): β 2N

log n n2β/(2β+1) N/A

B-SPLINE(2β + 1, γ): β 2N0 + 1

log n n2β/(2β+1)

Table 3: MMDk(Sin, SKT) guarantees for commonly used kernels. For n input and n output points, we report the MMD bounds of Thm. 2 for TARGET KT, of Dwivedi & Mackey (2021, Thm. 1) for ROOT KT, and of Thm. 4 for KT+ (with α = 1 2 wherever feasible). We assume a SUBGAUSS P for the GAUSS kernel, a COMPACT P for the B-SPLINE kernel, and a SUBEXP P for all other k (see Tab. 2 for a definition of each P class). Here, cn log log n, δi = δ n, δ = δ

2, and error is reported up to constants depending on (k, d, δ, α). The KT+ guarantee for LAPLACE applies with α > d d+1 and for MAT ERN with α> d

2ν . The TARGET KT guarantee for MAT ERN with ν > 3d/2 assumes ν d/2 N to simplify the presentation (see (53) for the general case). The best rate is highlighted in blue.

Definition 2 (α-power kernel) Define k1 k. We say a kernel k 1

2-power kernel for k if k(x, y) = (2π) d/2 R

2 (x, z)k 1

2 (z, y)dz. For α ( 1

2, 1), a kernel kα(x, y)=κα(x y) on Rd

is an α-power kernel for k(x, y)=κ(x y) if c κα = bκα.

By design, k 1

2 matches krt (2) up to an immaterial constant rescaling. Given a power kernel kα we define POWER KT as generalized KT with ksplit = kα. Our next result (with proof in App. E) provides an MMD guarantee for POWER KT.

Theorem 3 (MMD guarantee for POWER KT) Consider generalized KT (Alg. 1) with ksplit = kα for some α [ 1

2, 1], oblivious sequences Sin and (δi) n/2 i=1 , and δ mini δi. If n 2m N, then for any δ (0, 1), the output coreset SKT is of size n 2m and satisfies

MMDk(Sin, SKT) 2m

2α (2 f Mα)1 1

d 2max f Mα 1

with probability at least psg (defined in Thm. 1). The parameters f Mα and Rmax are defined in App. E and satisfy f Mα = Od( log n) and Rmax = Od(1) for compactly supported P and kα and f Mα = Od( log n log log n) and Rmax = Od(log n) for subexponential P and kα, when δ = δ

Thm. 3 reproduces the ROOT KT guarantee of Dwivedi & Mackey (2021, Thm. 1) when α = 1

2 and more generally accommodates any power kernel via an MMD interpolation result (Prop. 1) that may be of independent interest. This generalization is especially valuable for less-smooth kernels like LAPLACE and MAT ERN(ν, γ) with ν ( d

2, d] that have no square-root kernel. Our TARGET KT MMD guarantees are no better than i.i.d. for these kernels, but, as shown in App. K, these kernels have MAT ERN kernels as α-power kernels, which yield o(n 1

4 ) MMD in conjunction with Thm. 3.

Kernel thinning+ Our final KT variant, kernel thinning+, runs KT-SPLIT with a scaled sum of the target and power kernels, k k/ k + kα/ kα .4 Remarkably, this choice simultaneously provides the improved MMD guarantees of Thm. 3 and the dimension-free single function guarantees of Thm. 1 (see App. F for the proof).

4When Sin is known in advance, one can alternatively choose k k/ k ,in + kα/ kα ,in.

Published as a conference paper at ICLR 2022

Root KT (Gauss)

Target KT (Gauss)

KT+ (Laplace)

KT+ (Bspline)

23 25 27 Coreset size n

MMDk( , out)

iid: n 0.24

Root KT: n 0.54

Target KT: n 0.54

23 25 27 Coreset size n

iid: n 0.24

KT+: n 0.49

23 25 27 Coreset size n

iid: n 0.24

KT+: n 0.35

23 25 27 Coreset size n

iid: n 0.24

KT+: n 0.51

Figure 1: Generalized kernel thinning (KT) vs i.i.d. sampling for an 8-component mixture of Gaussians target P. For kernels k without fast-decaying square-roots, KT+ offers visible and quantifiable improvements over i.i.d. sampling. For Gaussian k, TARGET KT closely mimics ROOT KT.

Theorem 4 (Single function & MMD guarantees for KT+) Consider generalized KT (Alg. 1) with ksplit = k , oblivious Sin and (δi) n/2 i=1 , δ mini δi, and n 2m N. For any fixed function f H, index ℓ [2m], and scalar δ (0, 1), the KT-SPLIT coreset S(m,ℓ) satisfies

|Pinf P(ℓ) splitf| 2m

2mδ ) log( 2

with probability at least psg (for psg and P(ℓ) split defined in Thm. 1). Moreover,

MMDk(Sin, SKT) min

2 Mtarget KT(k), 2 1 2α Mpower KT(kα) (9)

with probability at least psg, where Mtarget KT(k) denotes the right hand side of (5) with k ,in replaced by k , and Mpower KT(kα) denotes the right hand side of (7).

As shown in Tab. 3, KT+ provides better-than-i.i.d. MMD guarantees for every kernel in Tab. 1 even the Laplace, non-smooth Mat ern, and odd B-spline kernels neglected by prior analyses while matching or improving upon the guarantees of TARGET KT and ROOT KT in each case.

4 EXPERIMENTS

Dwivedi & Mackey (2021) illustrated the MMD benefits of ROOT KT over i.i.d. sampling and standard MCMC thinning with a series of vignettes focused on the Gaussian kernel. We revisit those vignettes with the broader range of kernels covered by generalized KT and demonstrate significant improvements in both MMD and single-function integration error. We focus on coresets of size n produced from n inputs with δi = 1

2n, let Pout denote the empirical distribution of each output coreset, and report mean error ( 1 standard error) over 10 independent replicates of each experiment.

Target distributions and kernel bandwidths We consider three classes of target distributions on Rd: (i) mixture of Gaussians P = 1 M PM j=1 N(µj, I2) with M component means µj R2 defined in App. I, (ii) Gaussian P = N(0, Id), and (iii) the posteriors of four distinct coupled ordinary

Published as a conference paper at ICLR 2022

differential equation models: the Goodwin (1965) model of oscillatory enzymatic control (d = 4), the Lotka (1925) model of oscillatory predator-prey evolution (d = 4), the Hinch et al. (2004) model of calcium signalling in cardiac cells (d = 38), and a tempered Hinch posterior. For settings (i) and (ii), we use an i.i.d. input sequence Sin from P and kernel bandwidths σ = 1/γ =

2d. For setting (iii), we use MCMC input sequences Sin from 12 posterior inference experiments of Riabiz et al. (2020a) and set the bandwidths σ = 1/γ as specified by Dwivedi & Mackey (2021, Sec. K.2).

Coreset size n

2 1 MMDk( , out), d = 2

iid: n 0.27

Root KT: n 0.51

Target KT: n 0.51

Coreset size n

2 1 MMDk( , out), d = 10

iid: n 0.26

Root KT: n 0.42

Target KT: n 0.43

Coreset size n 2 6

2 1 MMDk( , out), d = 20

iid: n 0.25

Root KT: n 0.39

Target KT: n 0.39

Coreset size n

2 1 MMDk( , out), d = 50

iid: n 0.25

Root KT: n 0.35

Target KT: n 0.35

Coreset size n

2 1 MMDk( , out), d = 100

iid: n 0.25

Root KT: n 0.34

Target KT: n 0.34

Coreset size n

20 |( out)x1|, d = 2

iid: n 0.30

Target KT: n 0.53

Coreset size n

|( out)x1|, d = 10

iid: n 0.25

Target KT: n 0.49

Coreset size n

|( out)x1|, d = 20

iid: n 0.23

Target KT: n 0.49

Coreset size n

|( out)x1|, d = 50

iid: n 0.28

Target KT: n 0.47

Coreset size n

|( out)x1|, d = 100

iid: n 0.22

Target KT: n 0.44

Coreset size n

2 3 |( out)k(X )|, d = 2

iid: n 0.28

Target KT: n 0.55

Coreset size n

|( out)k(X )|, d = 10

iid: n 0.18

Target KT: n 0.44

Coreset size n

|( out)k(X )|, d = 20

iid: n 0.22

Target KT: n 0.38

Coreset size n

2 5 |( out)k(X )|, d = 50

iid: n 0.27

Target KT: n 0.27

Coreset size n 2 9

|( out)k(X )|, d = 100

iid: n 0.24

Target KT: n 0.30

22 24 26 Coreset size n

2 3 |( in out)f CIF|, d = 2

iid: n 0.24

Target KT: n 0.33

22 24 26 Coreset size n

|( in out)f CIF|, d = 10

iid: n 0.22

Target KT: n 0.35

22 24 26 Coreset size n

2 5 |( in out)f CIF|, d = 20

iid: n 0.18

Target KT: n 0.29

22 24 26 Coreset size n

2 5 |( in out)f CIF|, d = 50

iid: n 0.24

Target KT: n 0.24

22 24 26 Coreset size n

|( in out)f CIF|, d = 100

iid: n 0.22

Target KT: n 0.22

Figure 2: MMD and single-function integration error for Gaussian k and standard Gaussian P in Rd. Without using a square-root kernel, TARGET KT matches the MMD performance of ROOT KT and improves upon i.i.d. MMD and single-function integration error, even in d = 100 dimensions.

Function testbed To evaluate the ability of generalized KT to improve integration both inside and outside of H, we evaluate integration error for (a) a random element of the target kernel RKHS (f(x) = k(X , x) described in App. I), (b) moments (f(x) = x1 and f(x) = x2 1), and (c) a standard numerical integration benchmark test function from the continuous integrand family (CIF, Genz, 1984), f CIF(x) = exp( 1

d Pd j=1|xj uj|) for uj drawn i.i.d. and uniformly from [0, 1].

Generalized KT coresets For an 8-component mixture of Gaussians target P, the top row of Fig. 1 highlights the visual differences between i.i.d. coresets and coresets generated using generalized KT. We consider ROOT KT with GAUSS k, TARGET KT with GAUSS k, and KT+ (α = 0.7) with LAPLACE k, KT+ (α = 1

2) with IMQ k (ν = 0.5), and KT+(α = 2

3) with B-SPLINE(5) k, and note that the B-SPLINE(5) and LAPLACE k do not admit square-root kernels. In each case, even for small n, generalized KT provides a more even distribution of points across components with fewer within-component gaps and clumps. Moreover, as suggested by our theory, TARGET KT and ROOT KT coresets for GAUSS k have similar quality despite TARGET KT making no explicit use of a square-root kernel. The MMD error plots in the bottom row of Fig. 1 provide a similar conclusion quantitatively, where we observe that for both variants of KT, the MMD error decays as n 1

2 , a significant improvement over the n 1

4 rate of i.i.d. sampling. We also observe that the empirical MMD decay rates are in close agreement with the rates guaranteed by our theory in Tab. 3 (n 1

2 for GAUSS, B-SPLINE, and IMQ and n 1

4α = n 0.36 for LAPLACE). We provide additional visualizations and results in Figs. 4 and 5 of App. I, including MMD errors for M = 4 and M = 6 component mixture targets. The conclusions remain consistent with those drawn from Fig. 1.

Published as a conference paper at ICLR 2022

TARGET KT vs. i.i.d. sampling For Gaussian P and Gaussian k, Fig. 2 quantifies the improvements in distributional approximation obtained when using TARGET KT in place of a more typical i.i.d. summary. Remarkably, TARGET KT significantly improves the rate of decay and order of magnitude of mean MMDk(P, Pout), even in d = 100 dimensions with as few as 4 output points. Moreover, in line with our theory, TARGET KT MMD closely tracks that of ROOT KT without using krt. Finally, TARGET KT delivers improved single-function integration error, both of functions in the RKHS (like k(X , )) and those outside (like the first moment and CIF benchmark function), even with large d and relatively small sample sizes.

KT+ vs. standard MCMC thinning For the MCMC targets, we measure error with respect to the input distribution Pin (consistent with our guarantees), as exact integration under each posterior P is intractable. We employ KT+ (α = 0.81) with LAPLACE k for Goodwin and Lotka-Volterra and KT+ (α = 0.5) with IMQ k (ν = 0.5) for Hinch. Notably, neither kernel has a squareroot with fast-decaying tails. In Fig. 3, we evaluate thinning results from one chain targeting each of the Goodwin, Lotka-Volterra, and Hinch posteriors and observe that KT+ uniformly improves upon the MMD error of standard thinning (ST), even when ST exhibits better-than-i.i.d. accuracy. Furthermore, KT+ provides significantly smaller integration error for functions inside of the RKHS (like k(X , )) and outside of the RKHS (like the first and second moments and the benchmark CIF function) in nearly every setting. See Fig. 6 of App. I for plots of the other 9 MCMC settings.

Coreset size n

Goodwin ADA-RW

MMDk( in, out)

KT+: n 0.32

Coreset size n

|( in out)x1|

KT+: n 0.54

Coreset size n

|( in out)x2

KT+: n 0.50

Coreset size n

|( in out)k(X )|

KT+: n 0.52

Coreset size n

|( in out)f CIF|

KT+: n 0.63

Coreset size n 2 6

Lotka ADA-RW

MMDk( in, out)

KT+: n 0.32

Coreset size n

|( in out)x1|

KT+: n 0.65

Coreset size n

|( in out)x2

KT+: n 0.65

Coreset size n

|( in out)k(X )|

KT+: n 0.47

Coreset size n

|( in out)f CIF|

KT+: n 0.37

23 25 27 Coreset size n

MMDk( in, out)

KT+: n 0.50

23 25 27 Coreset size n

|( in out)x1|

KT+: n 0.56

23 25 27 Coreset size n

20 |( in out)x2

KT+: n 0.52

23 25 27 Coreset size n

|( in out)k(X )|

KT+: n 0.70

23 25 27 Coreset size n

|( in out)f CIF|

KT+: n 0.54

Figure 3: Kernel thinning+ (KT+) vs. standard MCMC thinning (ST). For kernels without fast-decaying square-roots, KT+ improves MMD and integration error decay rates in each posterior inference task.

5 DISCUSSION AND CONCLUSIONS

In this work, we introduced three new generalizations of the ROOT KT algorithm of Dwivedi & Mackey (2021) with broader applicability and strengthened guarantees for generating compact representations of a probability distribution. TARGET KT-SPLIT provides n-point summaries with O( p

log n/n) integration error guarantees for any kernel, any target distribution, and any function in the RKHS; POWER KT yields improved better-than-i.i.d. MMD guarantees even when a square-root kernel is unavailable; and KT+ simultaneously inherits the guarantees of TARGET KT and POWER KT. While we have focused on unweighted coreset quality we highlight that the same MMD guarantees extend to any improved reweighting of the coreset points. For example, for downstream tasks that support weights, one can optimally reweight Pout to approximate Pin in O(n 3 2 ) time by directly minimizing MMDk. Finally, one can combine generalized KT with the COMPRESS++ meta-algorithm of Shetty et al. (2022) to obtain coresets of comparable quality in near-linear time.

Published as a conference paper at ICLR 2022

REPRODUCIBILITY STATEMENT

See App. I for supplementary experimental details and results and the goodpoints Python package

https://github.com/microsoft/goodpoints

for Python code reproducing all experiments.

ACKNOWLEDGMENTS

We thank Lucas Janson and Boaz Barak for their valuable feedback on this work. RD acknowledges the support by the National Science Foundation under Grant No. DMS-2023528 for the Foundations of Data Science Institute (FODSI).

Christoph M Augustin, Aurel Neic, Manfred Liebmann, Anton J Prassl, Steven A Niederer, Gundolf Haase, and Gernot Plank. Anatomically accurate high resolution modeling of human whole heart electromechanics: A strongly scalable algebraic multigrid solver method for nonlinear deformation. Journal of computational physics, 305:622 646, 2016. (Cited on page 1.)

Necdet Batir. Bounds for the gamma function. Results in Mathematics, 72(1):865 874, 2017. doi: 10.1007/s00025-017-0698-0. URL https://doi.org/10.1007/s00025-017-0698-0. (Cited on page 26.)

Alain Berlinet and Christine Thomas-Agnan. Reproducing kernel Hilbert spaces in probability and statistics. Springer Science & Business Media, 2011. (Cited on pages 1, 15, 17, 18, and 26.)

Wilson Ye Chen, Lester Mackey, Jackson Gorham, Franc ois-Xavier Briol, and Chris J. Oates. Stein points. In Proceedings of the 35th International Conference on Machine Learning, 2018. (Cited on page 5.)

Wilson Ye Chen, Alessandro Barp, Franc ois-Xavier Briol, Jackson Gorham, Mark Girolami, Lester Mackey, and Chris Oates. Stein point Markov chain Monte Carlo. In International Conference on Machine Learning, pp. 1011 1021. PMLR, 2019. (Cited on pages 2 and 5.)

Yutian Chen, Max Welling, and Alex Smola. Super-samples from kernel herding. In Proceedings of the Twenty Sixth Conference on Uncertainty in Artificial Intelligence, UAI 10, pp. 109 116, Arlington, Virginia, USA, 2010. AUAI Press. ISBN 9780974903965. (Cited on page 2.)

Kacper Chwialkowski, Heiko Strathmann, and Arthur Gretton. A kernel test of goodness of fit. In International conference on machine learning, pp. 2606 2615. PMLR, 2016. (Cited on page 4.)

Josef Dick, Frances Y Kuo, and Ian H Sloan. High-dimensional integration: The quasi-Monte Carlo way. Acta Numerica, 22:133 288, 2013. (Cited on page 2.)

Raaz Dwivedi and Lester Mackey. Kernel thinning. ar Xiv preprint ar Xiv:2105.05842, 2021. (Cited on pages 1,

2, 3, 4, 5, 6, 7, 8, 9, 14, 15, 16, 17, 20, 21, 28, 29, and 30.)

Raaz Dwivedi, Ohad N Feldheim, Ori Gurel-Gurevich, and Aaditya Ramdas. The power of online thinning in reducing discrepancy. Probability Theory and Related Fields, 174(1):103 131, 2019. (Cited on page 2.)

Alan Genz. Testing multidimensional integration routines. In Proc. of international conference on Tools, methods and languages for scientific and engineering computation, pp. 81 94, 1984. (Cited on page 8.)

Mark Girolami and Ben Calderhead. Riemann manifold Langevin and Hamiltonian Monte Carlo methods. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 73(2):123 214, 2011. (Cited on page 21.)

Brian C Goodwin. Oscillatory behavior in enzymatic control process. Advances in Enzyme Regulation, 3: 318 356, 1965. (Cited on pages 8 and 21.)

Jackson Gorham and Lester Mackey. Measuring sample quality with kernels. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 1292 1301. JMLR. org, 2017. (Cited on page 4.)

Published as a conference paper at ICLR 2022

Arthur Gretton, Karsten M. Borgwardt, Malte J. Rasch, Bernhard Sch olkopf, and Alexander Smola. A kernel two-sample test. Journal of Machine Learning Research, 13(25):723 773, 2012. (Cited on pages 1 and 18.)

Heikki Haario, Eero Saksman, and Johanna Tamminen. Adaptive proposal distribution for random walk Metropolis algorithm. Computational Statistics, 14(3):375 395, 1999. (Cited on page 21.)

Nick Harvey and Samira Samadi. Near-optimal herding. In Conference on Learning Theory, pp. 1165 1182, 2014. (Cited on page 2.)

Fred Hickernell. A generalized discrepancy and quadrature error bound. Mathematics of computation, 67(221): 299 322, 1998. (Cited on page 2.)

Robert Hinch, JL Greenstein, AJ Tanskanen, L Xu, and RL Winslow. A simplified local control model of calcium-induced calcium release in cardiac ventricular myocytes. Biophysical journal, 87(6):3723 3736, 2004. (Cited on pages 8 and 21.)

Zohar Karnin and Edo Liberty. Discrepancy, coresets, and sketches in machine learning. In Conference on Learning Theory, pp. 1975 1993. PMLR, 2019. (Cited on pages 2 and 5.)

Simon Lacoste-Julien, Fredrik Lindsten, and Francis Bach. Sequential kernel herding: Frank-Wolfe optimization for particle filtering. In Artificial Intelligence and Statistics, pp. 544 552. PMLR, 2015. (Cited on page 5.)

Qiang Liu, Jason Lee, and Michael Jordan. A kernelized Stein discrepancy for goodness-of-fit tests. In Proc. of 33rd ICML, volume 48 of ICML, pp. 276 284, 2016. (Cited on page 4.)

Alfred James Lotka. Elements of physical biology. Williams & Wilkins, 1925. (Cited on pages 8 and 21.)

Simon Mak and V Roshan Joseph. Support points. The Annals of Statistics, 46(6A):2562 2592, 2018. (Cited on page 2.)

Whitney K. Newey and Daniel Mc Fadden. Chapter 36: Large sample estimation and hypothesis testing. In Handbook of Econometrics, volume 4, pp. 2111 2245. Elsevier, 1994. doi: https://doi.org/10.1016/ S1573-4412(05)80005-4. URL https://www.sciencedirect.com/science/article/pii/ S1573441205800054. (Cited on page 27.)

Steven A Niederer, Lawrence Mitchell, Nicolas Smith, and Gernot Plank. Simulating human cardiac electrophysiology on clinical time-scales. Frontiers in Physiology, 2:14, 2011. (Cited on page 1.)

E Novak and H Wozniakowski. Tractability of multivariate problems, volume ii: Standard information for functionals, european math. Soc. Publ. House, Z urich, 3, 2010. (Cited on page 2.)

Chris J Oates, Mark Girolami, and Nicolas Chopin. Control functionals for Monte Carlo integration. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 79(3):695 718, 2017. (Cited on page 4.)

Art B Owen. Statistically efficient thinning of a Markov chain sampler. Journal of Computational and Graphical Statistics, 26(3):738 744, 2017. (Cited on page 1.)

Brooks Paige, Dino Sejdinovic, and Frank Wood. Super-sampling with a reservoir. In Proceedings of the Thirty-Second Conference on Uncertainty in Artificial Intelligence, pp. 567 576, 2016. (Cited on page 2.)

Jeff M Phillips and Wai Ming Tai. Near-optimal coresets of kernel density estimates. Discrete & Computational Geometry, 63(4):867 887, 2020. (Cited on page 3.)

Marina Riabiz, Wilson Chen, Jon Cockayne, Pawel Swietach, Steven A Niederer, Lester Mackey, and Chris Oates. Optimal thinning of MCMC output. ar Xiv preprint ar Xiv:2005.03952, 2020a. (Cited on pages 2, 5, 8, and 21.)

Marina Riabiz, Wilson Ye Chen, Jon Cockayne, Pawel Swietach, Steven A. Niederer, Lester Mackey, and Chris J. Oates. Replication Data for: Optimal Thinning of MCMC Output, 2020b. URL https://doi. org/10.7910/DVN/MDKNWM. Accessed on Mar 23, 2021. (Cited on page 21.)

Gareth O Roberts and Richard L Tweedie. Exponential convergence of Langevin distributions and their discrete approximations. Bernoulli, 2(4):341 363, 1996. (Cited on page 21.)

Alessandro Rudi, Ulysse Marteau-Ferey, and Francis Bach. Finding global minima via kernel approximations. ar Xiv preprint ar Xiv:2012.11978, 2020. (Cited on pages 24 and 25.)

Published as a conference paper at ICLR 2022

Dino Sejdinovic, Bharath Sriperumbudur, Arthur Gretton, and Kenji Fukumizu. Equivalence of distance-based and rkhs-based statistics in hypothesis testing. The Annals of Statistics, pp. 2263 2291, 2013. (Cited on page 4.)

Abhishek Shetty, Raaz Dwivedi, and Lester Mackey. Distribution compression in near-linear time. In International Conference on Learning Representations, 2022. (Cited on page 9.)

Carl-Johann Simon-Gabriel, Alessandro Barp, and Lester Mackey. Metrizing weak convergence with maximum mean discrepancies. ar Xiv preprint ar Xiv:2006.09268, 2020. (Cited on page 1.)

Ingo Steinwart and Andreas Christmann. Support vector machines. Springer Science & Business Media, 2008. (Cited on pages 22, 25, and 27.)

Ingo Steinwart and Simon Fischer. A closer look at covering number bounds for Gaussian kernels. Journal of Complexity, 62:101513, 2021. (Cited on pages 25, 27, and 28.)

Marina Strocchi, Matthias AF Gsell, Christoph M Augustin, Orod Razeghi, Caroline H Roney, Anton J Prassl, Edward J Vigmond, Jonathan M Behar, Justin S Gould, Christopher A Rinaldi, Martin J Bishop, Gernot Plank, and Steven A Niederer. Simulating ventricular systolic motion in a four-chamber heart model with spatially varying robin boundary conditions to model the effect of the pericardium. Journal of Biomechanics, 101:109645, 2020. (Cited on page 1.)

Hong-Wei Sun and Ding-Xuan Zhou. Reproducing kernel Hilbert spaces associated with analytic translationinvariant Mercer kernels. Journal of Fourier Analysis and Applications, 14(1):89 101, 2008. (Cited on pages 22, 27, and 28.)

Ilya Tolstikhin, Bharath K Sriperumbudur, and Krikamol Muandet. Minimax estimation of kernel mean embeddings. The Journal of Machine Learning Research, 18(1):3002 3048, 2017. (Cited on pages 2, 3, 5, and 29.)

Vito Volterra. Variazioni e fluttuazioni del numero d individui in specie animali conviventi. 1926. (Cited on page 21.)

Martin J Wainwright. High-dimensional statistics: A non-asymptotic viewpoint, volume 48. Cambridge University Press, 2019. (Cited on pages 14, 26, and 28.)

Holger Wendland. Scattered data approximation, volume 17. Cambridge university press, 2004. (Cited on pages 2, 4, 5, 18, 19, and 27.)

Haizhang Zhang and Liang Zhao. On the inclusion relation of reproducing kernel hilbert spaces. Analysis and Applications, 11(02):1350014, 2013. (Cited on pages 18, 19, and 20.)

Ding-Xuan Zhou. Capacity of reproducing kernel spaces in learning theory. IEEE Transactions on Information Theory, 49(7):1743 1752, 2003. (Cited on page 28.)

Published as a conference paper at ICLR 2022

A Details of KT-SPLIT and KT-SWAP 13

B Proof of Thm. 1: Single function guarantees for KT-SPLIT 14

C Proof of Cor. 1: Guarantees for functions outside of Hsplit 14

D Proof of Thm. 2: MMD guarantee for TARGET KT 16

E Proof of Thm. 3: MMD guarantee for POWER KT 16

F Proof of Thm. 4: Single function & MMD guarantees for KT+ 17

G Proof of Prop. 1: An interpolation result for MMD 18

H Sub-optimality of single function guarantees with root KT 19

I Additional experimental results 20

J Upper bounds on RKHS covering numbers 22

K Proof of Tab. 3 results 28

A DETAILS OF KT-SPLIT AND KT-SWAP

Algorithm 1a: KT-SPLIT Divide points into candidate coresets of size n/2m

Input: kernel ksplit, point sequence Sin = (xi)n i=1, thinning parameter m N, probabilities (δi) n

2 i=1 S(j,ℓ) {} for 0 j m and 1 ℓ 2j // Empty coresets: S(j,ℓ) has size i

2j after round i

σj,ℓ 0 for 1 j m and 1 ℓ 2j 1 // Swapping parameters

for i = 1, . . . , n/2 do

S(0,1).append(xi); S(0,1).append(x2i)

// Every 2j rounds, add one point from parent coreset S(j 1,ℓ) to each child coreset S(j,2ℓ 1), S(j,2ℓ)

for (j = 1; j m and i/2j 1 N; j = j + 1) do

for ℓ= 1, . . . , 2j 1 do

(S, S ) (S(j 1,ℓ), S(j,2ℓ 1)); (x, x) get last two points(S) // Compute swapping threshold a a, σj,ℓ get swap params(σj,ℓ, b, δ|S|/2 2j 1

m ) for b2=ksplit(x, x)+ksplit( x, x) 2ksplit(x, x) // Assign one point to each child after probabilistic swapping α ksplit( x, x) ksplit(x, x)+Σy S(ksplit(y, x) ksplit(y, x)) 2Σz S (ksplit(z, x) ksplit(z, x)) (x, x) ( x, x) with probability min(1, 1

S(j,2ℓ 1).append(x); S(j,2ℓ).append( x) end end end return (S(m,ℓ))2m ℓ=1, candidate coresets of size n/2m

function get swap params(σ, b, δ):

2 log(2/δ), b2) σ σ+b2(1+(b2 2a)σ2/a2)+ return (a, σ)

Published as a conference paper at ICLR 2022

Algorithm 1b: KT-SWAP Identify and refine the best candidate coreset

Input: kernel k, point sequence Sin = (xi)n i=1, candidate coresets (S(m,ℓ))2m ℓ=1

S(m,0) baseline thinning(Sin, size = n/2m ) // Compare to baseline (e.g., standard thinning)

SKT S(m,ℓ ) for ℓ argminℓ {0,1,...,2m} MMDk(Sin, S(m,ℓ)) // Select best candidate coreset

// Swap out each point in SKT for best alternative in Sin for i = 1, . . . , n/2m do

SKT[i] argminz Sin MMDk(Sin, SKT with SKT[i] = z) end return SKT, refined coreset of size n/2m

B PROOF OF THM. 1: SINGLE FUNCTION GUARANTEES FOR KT-SPLIT

The proof is identical for each index ℓ, so, without loss of generality, we prove the result for the case ℓ= 1. Define

f Wm W1,m = Pinksplit P(1) out ksplit = 1

n P x Sin ksplit(x, ) 1 n/2m P x S(m,1) ksplit(x, ).

Next, we use the results about an intermediate algorithm, kernel halving (Dwivedi & Mackey, 2021, Alg. 3) that was introduced for the analysis of kernel thinning. Using the arguments from Dwivedi & Mackey (2021, Sec. 5.2), we conclude that KT-SPLIT with ksplit and thinning parameter m, is equivalent to repeated kernel halving with kernel ksplit for m rounds (with no Failure in any rounds of kernel halving). On this event of equivalence, denoted by Eequi, Dwivedi & Mackey (2021, Eqns. (50, 51)) imply that the function f Wm Hsplit is equal in distribution to another random function Wm, where Wm is unconditionally sub-Gaussian with parameter

ksplit log( 6m

E[exp( Wm, f ksplit)] exp( 1

2σ2 m f 2 ksplit) for all f Hsplit, (10)

where we note that the analysis of Dwivedi & Mackey (2021) remains unaffected when we replace ksplit by ksplit ,in in all the arguments. Applying the sub-Gaussian Hoeffding inequality (Wainwright, 2019, Prop. 2.5) along with (10), we obtain that

P[ Wm, f ksplit > t] 2 exp( 1

2t2/(σ2 m f 2 ksplit)) δ for t σm f ksplit

Call this event Esg. As noted above, conditional to the event Eequi, we also have

Wm d= f Wm = Wm, f ksplit d= Pinf P(1) out f,

where d= denotes equality in distribution. Furthermore, Dwivedi & Mackey (2021, Eqn. 48) implies that

P(Eequi) 1 Pm j=1 2j 1

Putting the pieces together, we have

P[|Pinf P(1) out f| t] P(Eequi Ec sg) P(Eequi) P(Esg) 1 Pm j=1 2j 1

i=1 δi δ = psg,

as claimed. The proof is now complete.

C PROOF OF COR. 1: GUARANTEES FOR FUNCTIONS OUTSIDE OF HSPLIT

Fix any index ℓ [2m], scalar δ (0, 1), and f defined on Sin, and consider the associated vector g Rn with gi = f(xi) for each i [n]. We define two extended functions by appending the domain by Rn as follows: For any w Rn, define f1((x, w)) = f(x) and f2((x, w)) = g, w

Published as a conference paper at ICLR 2022

(the Euclidean inner product). Then we note that these functions replicate the values of f on Sin, since f1((xi, w)) = f(xi) for arbitrary w Rn, and f2((xi, ei)) = g, ei = gi = f(xi), where ei denotes the i-th basis vector in Rn. Thus we can write

Pinf P(ℓ) splitf = P inf1 P (ℓ) splitf1 = P inf2 P (ℓ) splitf2 (11)

for the extended empirical distributions P in = 1

n Pn i=1 δxi,ei and P (ℓ) split, defined analogously. Notably, any function of the form f((x, w)) = g, w belongs to the RKHS of k split with

f k split g 2 (12)

by Berlinet & Thomas-Agnan (2011, Thm. 5).

By the repeated halving interpretation of kernel thinning (Dwivedi & Mackey, 2021, Sec. 5.2), on an event E of probability at least psg + δ we may write

P inf2 P (ℓ) splitf2 = Pm j=1 Wj, f2 k split = Pm j=1 Wj, f2,j k split

where Wj denotes suitable random functions in the RKHS of k split, and each f2,j((x, w)) = g(j), w for g(j) Rn a sparsification of g with at most n 2j 1 non-zero entries that satisfy

E[exp( Wj, f2,j k split) | Wj 1] exp( σ2 j 2 f2,j 2 k split) (12) exp( σ2 j 2 g(j) 2 2) exp( σ2 j 2 n 2j 1 f 2 ,in)

for W0 0 and

σ2 j = 4( 2j 1

n )2 k split ,in log( 4m

since by definition k split ,in 2. Hence, by sub-Gaussian additivity (see, e.g., Dwivedi &

Mackey, 2021, Lem. 8), Pinf2 P(ℓ) splitf2 is eσ2 sub-Gaussian with

n f 2 ,in Pm j=1 2j log( 4m

2jδ ) (i) = 4

n f 2 ,in 2 (2m 1) log( 4m

δ ) ((2m 1)(m 1) + m) log 2

n f 2 ,in 2 (2m 1) log( 4m 2

2mδ ) m log 2

n f 2 ,in log( 8m

on the event E, where step (i) makes use of the following expressions: Pm j=1 2j = 2(2m 1) and Pm j=1 j2j = 2((m 1)(2m 1) + m).

Moreover, when f Hsplit, we additionally have f1 in the RKHS of k split with

f1 k split f ksplit p

as argued in the proof of (23). The proof of Thm. 1 then implies that P inf1 P (ℓ) splitf1 is eσ1 sub Gaussian with

eσ1 fa k split 2

k split ,in log( 6m

n f ksplit p

8 3 log( 6m

2mδ ), (14)

on the very same event E.

Recalling (11) and putting the pieces together with the definitions (13) and (14), we conclude that on the event E, the random variable Pinf P(ℓ) splitf is eσ sub-Gaussian for

eσ min(eσ1, eσ2) (13),(14) min p n

2m f ,in, f ksplit p

The advertised high-probability bound (3) now follows from the eσ sub-Gaussianity on E exactly as in the proof of Thm. 1.

Published as a conference paper at ICLR 2022

D PROOF OF THM. 2: MMD GUARANTEE FOR TARGET KT

First, we note that by design, KT-SWAP ensures

MMDk(Sin, SKT) MMDk(Sin, S(m,1)),

where S(m,1) denotes the first coreset returned by KT-SPLIT. Thus it suffices to show that MMDk(Sin, S(m,1)) is bounded by the term stated on the right hand side of (5). Let P(1) out 1 n/2m P

x S(m,1) δx. By design of KT-SPLIT, supp(P(1) out ) supp(Pin). Recall the set A is such that supp(Pin) A.

Proof of (5) Let C Ck,ε(A) denote the cover of minimum cardinality satisfying (4). Fix any f Bk. By the triangle inequality and the covering property (4) of C, we have (Pin P(1) out )f infg C (Pin P(1) out )(f g) + (Pin P(1) out )(g)

infg C|Pin(f g)| + P(1) out (f g) + supg C (Pin P(1) out )(g)

infg C 2 supx A |f(x) g(x)| + supg C (Pin P(1) out )(g)

2ε + supg C (Pin P(1) out )(g) . (15)

Applying Thm. 1, we have (Pin P(1) out )(g) 2m

8 3 k ,in log( 4

with probability at least 1 δ Pm j=1 2j 1

i=1 δi = psg δ . A standard union bound then yields that

supg C (Pin P(1) out )(g) 2m

n supg C g k q

8 3 k ,in log( 4

δ ) log |C| + log( 4

probability at least psg δ . Since f Bk was arbitrary, and C Bk and thus supg C g k 1, we therefore have

MMDk(Sin, S(m,1)) = sup f k 1 (Pin P(1) out )f (15) 2ε+supg C (Pin P(1) out )(g)

δ ) log |C| + log( 4

with probability at least psg δ as claimed.

E PROOF OF THM. 3: MMD GUARANTEE FOR POWER KT

Definition of f Mα and Rmax Define the kα tail radii,

R kα,n min{r : τkα(r) kα

n }, where τkα(R) (supx R

y 2 R k2 α(x, x y)dy) 1 2 ,

Rkα,n min{r :sup x y 2 r|kα(x, y)| kα

and the Sin tail radii

RSin maxx Sin x 2, and RSin,kα,n min RSin, n1+ 1

d Rkα,n + n 1 d kα /Lkα . (18) Furthermore, define the inflation factor

Mkα(n,m, d,δ,δ ,R) 37 q

d log(2 + 2 Lkα kα Rkα,n + R ) i ,

where Lkα denotes a Lipschitz constant satisfying |kα(x, y) kα(x, z)| Lkα y z 2 for all x, y, z Rd. With the notations in place, we can define the quantities appearing in Thm. 3: f Mα Mkα(n,m, d,δ ,δ ,RSin,kα,n) and Rmax max(RSin, R kα,n/2m). (19)

The scaling of these two parameters depends on the tail behavior of kα and the growth of the radii RSin (which in turn would typically depend on the tail behavior of P). The scaling of f Mα and Rmax stated in Thm. 3 under the compactly supported or subexponential tail conditions follows directly from Dwivedi & Mackey (2021, Tab. 2, App. I).

Published as a conference paper at ICLR 2022

Proof of Thm. 3 The KT-SWAP step ensures that

MMDk(Sin, SαKT) MMDk(Sin, S(m,1) α ),

where S(m,1) α denotes the first coreset output by KT-SPLIT with ksplit = kα. Next, we state a key interpolation result for MMDk that relates it to the MMD of its power kernels (Def. 2) (see App. G for the proof).

Proposition 1 (An interpolation result for MMD) Consider a shift-invariant kernel k that admits valid α and 2α-power kernels kα and k2α respectively for some α [ 1

2, 1]. Then for any two discrete measures P and Q supported on finitely many points, we have

MMDk(P, Q) (MMDkα(P, Q))2 1

α (MMDk2α(P, Q)) 1 α 1. (20)

Given Prop. 1, it remains to establish suitable upper bounds on MMDs of kα and k2α. To this end, first we note that for any reproducing kernel k and any two distributions P and Q, H older s inequality implies that

MMD2 k(P, Q) = (P Q)k 2 k = (P Q)(P Q)k P Q 1 (P Q)k 2 (P Q)k .

Now, let Pin and P(m,1) α denote the empirical distributions of Sin and S(m,1) α . Now applying Dwivedi & Mackey (2021, Thm. 4(b)), we find that

MMDkα(Sin, S(m,1) α ) q

2 (Pin P(m,1) α )kα ,in q

n kα ,inf Mkα (21)

with probability psg δ , where f Mkα was defined in (19). We note that while Dwivedi & Mackey (2021, Thm. 4(b)) uses kα in their bounds, we can replace it by kα ,in, and verifying that all the steps of the proof continue to be valid (noting that kα ,in is deterministic given Sin). Furthermore, Dwivedi & Mackey (2021, Thm. 4(b)) yields that

MMDk2α(Sin, S(m,1) α ) 2m

d 2max f Mα

with probability psg δ , where we have once again replaced the term kα with kα ,in for the same reasons as stated above. We note that the two bounds (21) and (22) apply under the same high probability event as noted in Dwivedi & Mackey (2021, proof of Thm. 1, eqn. (18)). Putting together the pieces, we find that

MMDk(Sin, S(m,1) α ) (20) (MMDkα(Sin, S(m,1) α )2 1

α (MMDk2α(Sin, S(m,1) α )) 1 α 1

(21,22) h 2 2m

n kα ,inf Mα i1 1

d 2max f Mα

2α (2 f Mα)1 1

d 2max f Mα

as claimed. The proof is now complete.

F PROOF OF THM. 4: SINGLE FUNCTION & MMD GUARANTEES FOR KT+

Proof of (8) First, we note that the RKHS H of k is contained in the RKHS H of k Berlinet & Thomas-Agnan (2011, Thm. 5). Now, applying Thm. 1 with ksplit = k for any fixed function f H H and δ (0, 1), we obtain that Pinf P(ℓ) splitf f k 2

k ,in log( 6m

2mδ ) log( 2

(ii) f k 2m

3 k log( 6m

2mδ ) log( 2

Published as a conference paper at ICLR 2022

with probability at least psg. Here step (i) follows from the inequality k 2, and step (ii) follows from the inequality f k p

k f k, which in turn follows from the standard facts that

f λk (iii) = f k

λ , and f λk+kα

(iv) f λk for λ > 0, f H, (23)

see, e.g., Zhang & Zhao (2013, Proof of Prop. 2.5) for a proof of step (iii), Berlinet & Thomas Agnan (2011, Thm. 5) for step (iv). The proof for the bound (8) is now complete.

Proof of (9) Repeating the proof of Thm. 2 with the bound (16) replaced by (8) yields that

MMDk(Sin, SKT+) infε,Sin A 2ε+ 2m

3 k log( 6m

2mδ ) log( 4

δ )+Mk(A, ε)

2 Mtarget KT(k) (24)

with probability at least psg. Let us denote this event by E1.

To establish the other bound, first we note that KT-SWAP step ensures that

MMDk(Sin, SKT+) MMDk(Sin, S(m,1) KT+ ), (25)

where S(m,1) KT+ denotes the first coreset output by KT-SPLIT with ksplit = k . We can now repeat the proof of Thm. 3, using the sub-Gaussian tail bound (8), and with a minor substitution, namely, kα ,in replaced by 2 kα . Putting it together with (25), we conclude that

MMDk(Sin, SKT+) 2m

n 2 kα ,in 1

2α (2f Mα)1 1

d 2max f Mα

= 2 1 2α Mpower KT(kα), (26)

with probability at least psg. Let us denote this event by E2.

Note that the quantities on the right hand side of the bounds (24) and (26) are deterministic given Sin, and thus can be computed apriori. Consequently, we apply the high probability bound only for one of the two events E1 or E2 depending on which of the two quantities (deterministically) attains the minimum. Thus, the bound (9) holds with probability at least psg as claimed.

G PROOF OF PROP. 1: AN INTERPOLATION RESULT FOR MMD

For two arbitrary distributions P and Q, and any reproducing kernel k, Gretton et al. (2012, Lem. 4) yields that

MMD2 k(P, Q) = (P Q)k 2 k. (27)

Let F denote the generalized Fourier transform (GFT) operator (Wendland (2004, Def. 8.9)). Since k(x, y) = κ(x y), Wendland (2004, Thm. 10.21) yields that

f 2 k = 1 (2π)d/2 R

Rd (F(f)(ω))2

F(κ)(ω) dω, for f H. (28)

Let bκ F(κ), and consider a discrete measure D = Pn i=1 wiδxi supported on finitely many points, and let Dk(x) P wik(x, xi) = P wiκ(x xi). Now using the linearity of the GFT operator F, we find that for any ω Rd,

F(Dk)(ω) = F(Pn i=1 wiκ( xi)) = Pn i=1 wi F(κ( xi) = (Pn i=1 wie ω,xi ) bκ(ω)

= b D(ω)bκ(ω) (29)

where we used the time-shifting property of GFT that F(κ( xi))(ω) = e ω,xi bκ(ω) (proven for completeness in Lem. 1), and used the shorthand b D(ω) (Pn i=1 wie ω,xi ) in the last step.

Published as a conference paper at ICLR 2022

Putting together (27) to (29) with D = P Q, we find that

MMD2 k(P, Q) = 1 (2π)d/2 R

Rd b D2(ω)bκ(ω)dω (30)

= 1 (2π)d/2 R

Rd b D2(ω)bκα(ω)(bκα(ω)) 1 α

= 1 (2π)d/2 R

Rd b D2(ω )bκα(ω )dω R

Rd b D2(ω)bκα(ω) R

Rd b D2(ω )bκα(ω )dω (bκα(ω)) 1 α

(i) 1 (2π)d/2 R

Rd b D2(ω )bκα(ω )dω R

Rd b D2(ω)bκα(ω) R

Rd b D2(ω )bκα(ω )dω bκα(ω)dω 1 α

= 1 (2π)d/2 R

Rd b D2(ω )bκα(ω )dω 2 1

Rd b D2(ω)bκ2α(ω)

= 1 (2π)d/2 R

Rd b D2(ω )bκα(ω )dω 2 1

α 1 (2π)d/2 R

Rd b D2(ω)bκ2α(ω)

(ii) = (MMD2 kα(P, Q))2 1

α (MMD2 k2α(P, Q)) 1 α 1,

where step (i) makes use of Jensen s inequality and the fact that the function t 7 t 1 α

α for t 0 is concave for α [ 1

2, 1], and step (ii) follows by applying (30) for kernels kα and k2α and noting that by definition F(kα) = bκα, and F(k2α) = bκ2α. Noting MMD is a non-negative quantity, and taking square-root establishes the claim (20).

Lemma 1 (Shifting property of the generalized Fourier transform) If bκ denotes the generalized Fourier transform (GFT) (Wendland, 2004, Def. 8.9) of the function κ : Rd R, then e ,xi bκ denotes the GFT of the shifted function κ( xi), for any xi Rd.

Proof Note that by definition of the GFT bκ (Wendland, 2004, Def. 8.9), we have R κ(x)bγ(x)dx = R bκ(ω)γ(ω)dω, (31)

for all suitable Schwartz functions γ (Wendland, 2004, Def. 5.17), where bγ denotes the Fourier transform (Wendland, 2004, Def. 5.15) of γ since GFT and FT coincide for these functions (as noted in the discussion after Wendland (2004, Def. 8.9)). Thus to prove the lemma, we need to verify that R κ(x xi)bγ(x)dx = R e ω,xi bκ(ω)γ(ω)dω, (32)

for all suitable Schwartz functions γ. Starting with the right hand side of the display (32), we have R e ω,xi bκ(ω)γ(ω)dω = R bκ(ω)(e ω,xi γ(ω))dω (i) = R κ(x)bγ(x + xi)dx (ii) = R κ(z xi)bγ(z)dz,

where step (i) follows from the shifting property of the FT (Wendland, 2004, Thm. 5.16(4)), and the fact that the GFT condition (31) holds for the shifted function γ( + xi) function as well since it is still a Schwartz function (recall that bγ is the FT), and step (ii) follows from a change of variable. We have thus established (32), and the proof is complete.

H SUB-OPTIMALITY OF SINGLE FUNCTION GUARANTEES WITH ROOT KT

Define ekrt as the scaled version of krt, i.e., ekrt krt/ krt that is bounded by 1. Then Zhang & Zhao (2013, Proof of Prop. 2.3) implies that

krt f ekrt. (33)

And thus we also have Hrt = e Hrt where Hrt and e Hrt respectively denote the RKHSs of krt and ekrt.

Next, we note that for any two kernels k1 and k2 with corresponding RKHSs H1 and H2 with H1 H2, in the convention of Zhang & Zhao (2013, Lem. 2.2, Prop. 2.3), we have

f k2 f k1 β(H1, H2) p

λ(H1, H2) for f H. (34)

Published as a conference paper at ICLR 2022

Consequently, we have p

maxx Sin krt(x, x) f krt

(33) = f ekrt

λ(H, e Hrt), (35)

where in the last step, we have applied the bound (34) with (k1, H1) (k, H) and (k2, H2) (ekrt, ekrt) since H Hrt = ekrt.

Next, we use (35) to the kernels studied in Dwivedi & Mackey (2021) where we note that all the kernels in that work were scaled to ensure k = 1 and in fact satisfied k(x, x) = 1. Conse-

quently, the multiplicative factor stated in the discussion after Thm. 1, namely, q

k ,in f krt

be bounded by q

λ(H, e Hrt) given the arguments above.

For k = Gauss(σ) kernels, Zhang & Zhao (2013, Prop. 3.5(1)) yields that

λ(H, e Hrt) = 2d/2.

For k = B-spline(2β + 1, γ) with β 2N + 1, Zhang & Zhao (2013, Prop. 3.5(1)) yields that

λ(H, e Hrt) = 1.

For k =Mat ern(ν, γ) with ν > d, some algebra along with Zhang & Zhao (2013, Prop 3.1) yields that

λ(H, e Hrt) = Γ(ν)Γ((ν d)/2)

Γ(ν d/2)Γ(ν/2) 1.

I ADDITIONAL EXPERIMENTAL RESULTS

This section provides additional experimental details and results deferred from Sec. 4.

Common settings and error computation To obtain an output coreset of size n 1 2 with n input points, we (a) take every n 1 2 -th point for standard thinning (ST) and (b) run KT with m = 1

2 log2 n using an ST coreset as the base coreset in KT-SWAP. For Gaussian and Mo G target we use i.i.d. points as input, and for MCMC targets we use an ST coreset after burn-in as the input (see App. I for more details). We compute errors with respect to P whenever available in closed form and otherwise use Pin. For each input sample size n 24, 26, . . . , 214 with δi = 1 2n, we report the mean MMD or function integration error 1 standard error across 10 independent replications of the experiment (the standard errors are too small to be visible in all experiments). We also plot the ordinary least squares fit (for log mean error vs log coreset size), with the slope of the fit denoted as the empirical decay rate, e.g., for an OLS fit with slope 0.25, we display the decay rate of n 0.25.

Details of test functions We note the following: (a) For Gaussian targets, the error with CIF function and i.i.d. input is measured across the sample mean over the n input points and n output points obtained by standard thinning the input sequence, since Pf CIF does not admit a closed form. (b) To define the function f : x 7 k(X , x), first we draw a sample X P, independent of the input, and then set X = 2X. For the MCMC targets, we draw a point uniformly from a held out data point not used as input for KT. For each target, the sample is drawn exactly once and then fixed throughout all sample sizes and repetions.

I.1 MIXTURE OF GAUSSIANS EXPERIMENTS

Our mixture of Gaussians target is given by P = 1 M PM j=1 N(µj, Id) for M {4, 6, 8} where

µ1 = [ 3, 3] , µ2 = [ 3, 3] , µ3 = [ 3, 3] , µ4 = [3, 3] ,

µ5 = [0, 6] , µ6 = [ 6, 0] , µ7 = [6, 0] , µ8 = [0, 6] .

Two independent replicates of Fig. 1 can be found in Fig. 4. Finally,we display mean MMD ( 1 standard error across ten independent experiment replicates) as a function of coreset size in Fig. 5 for M = 4, 6 component Mo G targets. The conclusions from Fig. 5 are identical to those from the bottom row of Fig. 1: TARGET KT and ROOT KT provide similar MMD errors with GAUSS k, and all variants of KT provide a significant improvement over i.i.d. sampling both in terms of magnitude and decay rate with input size. Morever the observed decay rates for KT+ closely match the rates guaranteed by our theory in Tab. 3.

Published as a conference paper at ICLR 2022

Root KT (Gauss)

Target KT (Gauss)

KT+ (Laplace)

KT+ (Bspline)

Root KT (Gauss)

Target KT (Gauss)

KT+ (Laplace)

KT+ (Bspline)

Figure 4: Generalized kernel thinning (KT) and i.i.d. coresets for various kernels k (in parentheses) and an 8-component mixture of Gaussian target P with equidensity contours underlaid. These plots are independent replicates of Fig. 1. See Sec. 4 for more details.

I.2 MCMC EXPERIMENTS

Our set-up for MCMC experiments follows closely that of Dwivedi & Mackey (2021). For complete details on the targets and sampling algorithms we refer the reader to Riabiz et al. (2020a, Sec. 4).

Goodwin and Lotka-Volterra experiments From Riabiz et al. (2020b), we use the output of four distinct MCMC procedures targeting each of two d = 4-dimensional posterior distributions P: (1) a posterior over the parameters of the Goodwin model of oscillatory enzymatic control (Goodwin, 1965) and (2) a posterior over the parameters of the Lotka-Volterra model of oscillatory predatorprey evolution (Lotka, 1925; Volterra, 1926). For each of these targets, Riabiz et al. (2020b) provide 2 106 sample points from the following four MCMC algorithms: Gaussian random walk (RW), adaptive Gaussian random walk (ada RW, Haario et al., 1999), Metropolis-adjusted Langevin algorithm (MALA, Roberts & Tweedie, 1996), and pre-conditioned MALA (p MALA, Girolami & Calderhead, 2011).

Hinch experiments Riabiz et al. (2020b) also provide the output of two independent Gaussian random walk MCMC chains targeting each of two d = 38-dimensional posterior distributions P: (1) a posterior over the parameters of the Hinch model of calcium signalling in cardiac cells (Hinch et al., 2004) and (2) a tempered version of the same posterior, as defined by Riabiz et al. (2020a, App. S5.4).

Burn-in and standard thinning We discard the initial burn-in points of each chain using the maximum burn-in period reported in Riabiz et al. (2020a, Tabs. S4 & S6, App. S5.4). Furthermore, we also normalize each Hinch chain by subtracting the post-burn-in sample mean and dividing each coordinate by its post-burn-in sample standard deviation. To obtain an input sequence Sin of length

Published as a conference paper at ICLR 2022

23 25 27 Coreset size n

MMDk( , out)

M = 4 Gauss k

iid: n 0.27

Root KT: n 0.50

Target KT: n 0.51

23 25 27 Coreset size n

M = 4 IMQ k

iid: n 0.27

KT+: n 0.50

23 25 27 Coreset size n

M = 4 Laplace k

iid: n 0.27

KT+: n 0.35

23 25 27 Coreset size n

M = 4 Bspline k

iid: n 0.27

KT+: n 0.52

23 25 27 Coreset size n

MMDk( , out)

M = 6 Gauss k

iid: n 0.25

Root KT: n 0.54

Target KT: n 0.54

23 25 27 Coreset size n

M = 6 IMQ k

iid: n 0.25

KT+: n 0.49

23 25 27 Coreset size n

M = 6 Laplace k

iid: n 0.25

KT+: n 0.35

23 25 27 Coreset size n

M = 6 Bspline k

iid: n 0.25

KT+: n 0.52

Figure 5: Kernel thinning versus i.i.d. sampling. For mixture of Gaussians P with M {4, 6} components and the kernel choices of Sec. 4, the TARGET KT with GAUSS k provides comparable MMDk(P, Pout) error to the ROOT KT, and both provide an n 1

2 decay rate improving significantly over the n 1

4 decay rate from i.i.d. sampling. For the other kernels, KT+ provides a decay rate close to n 1

2 for IMQ and B-SPLINE k, and n 0.35 for LAPLACE k, providing an excellent agreement with the MMD guarantees provided by our theory. See Sec. 4 for further discussion.

n to be fed into a thinning algorithm, we downsample the remaining even indices of points using standard thinning (odd indices are held out). When applying standard thinning to any Markov chain output, we adopt the convention of keeping the final sample point.

The selected burn-in periods for the Goodwin task were 820,000 for RW; 824,000 for ada RW; 1,615,000 for MALA; and 1,475,000 for p MALA. The respective numbers for the Lotka-Volterra task were 1,512,000 for RW; 1,797,000 for ada RW; 1,573,000 for MALA; and 1,251,000 for p MALA.

Additional remarks on Fig. 3 When a Markov chain is fast mixing (as in the Goodwin and Lotka Volterra examples), we expect standard thinning to have Ω(n 1

4 ) error. However, when the chain is slow mixing, standard thinning can enjoy a faster rate of decay due to a certain degeneracy of the chain that leads it to lie close to a one-dimensional curve. In the Hinch figures, we observe these better-than-i.i.d. rates of decay for standard thinning, but, remarkably, KT+ still offers improvements in both MMD and integration error. Moreover, in this setting, every additional point discarded via improved compression translates into thousands of CPU hours saved in downstream heart-model simulations.

J UPPER BOUNDS ON RKHS COVERING NUMBERS

In this section, we state several results on covering bounds for RKHSes for both generic and specific kernels. We then use these bounds with Thm. 2 (or Tab. 2) to establish MMD guarantees for the output of generalized kernel thinning as summarized in Tab. 3.

We first state covering number bounds for RKHS associated with generic kernels, that are either (a) analytic, or (b) finitely many times differentiable. These results follow essentially from Sun & Zhou (2008); Steinwart & Christmann (2008), but we provide a proof in App. J.2 for completeness.

Proposition 2 (Covering numbers for analytic and differentiable kernels) The following results hold true.

Published as a conference paper at ICLR 2022

Coreset size n

MMDk( in, out)

KT+: n 0.31

Coreset size n

2 8 |( in out)x1|

KT+: n 0.62

Coreset size n

|( in out)x2

KT+: n 0.50

Coreset size n

|( in out)k(X )|

KT+: n 0.46

Coreset size n

|( in out)f CIF|

KT+: n 0.54

Coreset size n 2 6

MMDk( in, out)

KT+: n 0.33

Coreset size n

|( in out)x1|

KT+: n 0.64

Coreset size n

|( in out)x2

KT+: n 0.63

Coreset size n

|( in out)k(X )|

KT+: n 0.37

Coreset size n

|( in out)f CIF|

KT+: n 0.29

23 25 27 Coreset size n

MMDk( in, out)

KT+: n 0.46

23 25 27 Coreset size n

|( in out)x1|

KT+: n 0.53

23 25 27 Coreset size n

|( in out)x2

KT+: n 0.46

23 25 27 Coreset size n

|( in out)k(X )|

KT+: n 0.62

23 25 27 Coreset size n

2 5 |( in out)f CIF|

KT+: n 0.55

Coreset size n

Goodwin MALA

MMDk( in, out)

KT+: n 0.32

Coreset size n

|( in out)x1|

KT+: n 0.59

Coreset size n

|( in out)x2

KT+: n 0.60

Coreset size n

|( in out)k(X )|

KT+: n 0.29

Coreset size n

2 9 |( in out)f CIF|

KT+: n 0.57

Coreset size n 2 6

MMDk( in, out)

KT+: n 0.34

Coreset size n

|( in out)x1|

KT+: n 0.70

Coreset size n

|( in out)x2

KT+: n 0.70

Coreset size n

|( in out)k(X )|

KT+: n 0.22

Coreset size n

|( in out)f CIF|

KT+: n 0.32

23 25 27 Coreset size n

Hinch Tempered

MMDk( in, out)

KT+: n 0.38

23 25 27 Coreset size n

|( in out)x1|

KT+: n 0.61

23 25 27 Coreset size n

|( in out)x2

KT+: n 0.32

23 25 27 Coreset size n

|( in out)k(X )|

KT+: n 0.41

23 25 27 Coreset size n

|( in out)f CIF|

KT+: n 0.44

Coreset size n

Goodwin p MALA

MMDk( in, out)

KT+: n 0.32

Coreset size n

2 8 |( in out)x1|

KT+: n 0.59

Coreset size n

|( in out)x2

KT+: n 0.39

Coreset size n

2 3 |( in out)k(X )|

KT+: n 0.45

Coreset size n

2 10 |( in out)f CIF|

KT+: n 0.58

Coreset size n 2 6

Lotka p MALA

MMDk( in, out)

KT+: n 0.34

Coreset size n

2 6 |( in out)x1|

KT+: n 0.85

Coreset size n

|( in out)x2

KT+: n 0.85

Coreset size n

|( in out)k(X )|

KT+: n 0.58

Coreset size n

|( in out)f CIF|

KT+: n 0.45

23 25 27 Coreset size n

Hinch Tempered 2

MMDk( in, out)

KT+: n 0.36

23 25 27 Coreset size n

|( in out)x1|

KT+: n 0.53

23 25 27 Coreset size n

|( in out)x2

KT+: n 0.47

23 25 27 Coreset size n

|( in out)k(X )|

KT+: n 0.53

23 25 27 Coreset size n

|( in out)f CIF|

KT+: n 0.32

Figure 6: Kernel thinning+ (KT+) vs. standard MCMC thinning (ST). For kernels without fast-decaying square-roots, KT+ improves MMD and integration error decay rates in each posterior inference task.

Published as a conference paper at ICLR 2022

(a) Analytic kernels: Suppose that k(x, y) = κ( x y 2 2) for κ : R+ R real-analytic with convergence radius Rκ, that is, 1

j!κ(j) + (0) Cκ(2/Rκ)j for all j N0 (36)

for some constant Cκ, where κ(j) + denotes the right-sided j-th derivative of κ. Then for any set A Rd and any ε (0, 1

2), we have

Mk(A, ε) N2(A, r /2) 4 log(1/ε) + 2 + 4 log(16 Cκ + 1) d+1, (37)

where r min Rκ

Rκ + D2 A DA , and DA maxx,y A x y 2. (38)

(b) Differentiable kernels: Suppose that for X Rd, the kernel k : X X R is s-times continuously differentiable, i.e., all partial derivatives α,αk : X X R exist and are continuous for all multi-indices α Nd 0 with |α| s. Then, for any closed Euclidean ball B2(r) contained in X and any ε > 0, we have

Mk( B2(r), ε) cs,d,k rd (1/ε)d/s, (39)

for some constant cs,d,k that depends only on on s, d and k.

Next, we state several explicit bounds on covering numbers for several popular kernels. See App. J.3 for the proof.

Proposition 3 (Covering numbers for specific kernels) The following statements hold true.

(a) When k = GAUSS(σ), we have

Mk(B2(r), ε) CGauss,d log(4/ε) log log(4/ε) d log(1/ε)

( 1 when r 1

2rσ)d otherwise, (40)

where CGAUSS,d 4e+d d e d 4.3679 for d = 1 0.05 d4ee d for d 2 30 for all d 1. (41)

(b) When k = MAT ERN(ν, γ), ν d

2 + 1, then for some constant CMAT ERN,ν,γ,d, we have

Mk(B2(r), ε) CMAT ERN,ν,γ,d rd (1/ε)d/ ν d

(c) When k = IMQ(ν, γ), we have

Mk(B2(r), ε) (1 + 4r

er )d (4 log(1/ε) + 2 + CIMQ,ν,γ)d+1, (43)

where CIMQ,ν,γ 4 log 16 (2ν+1)ν+1

γ2ν +1 , and er min γ 2d, p

γ2 + 4r2 2r . (44)

(d) When k = SINC(θ), then for ε (0, 1

2), we have

Mk([ r, r]d, ε) d (1 + 4r

erθ ) (4 log(d/ε) + 2 + 4 log 17)2, (45)

where erθ min

12 θ2 + 4r2 2r . (46)

(e) When k = B-SPLINE(2β + 1, γ), then for some universal constant CB-SPLINE, we have

2]d, ε) d max(γ, 1) CB-SPLINE (d/ε)

J.1 AUXILIARY RESULTS ABOUT RKHS AND EUCLIDEAN COVERING NUMBERS

In this section, we collect several results regarding the covering numbers of Euclidean and RKHS spaces that come in handy for our proofs. These results can also be of independent interest.

We start by defining the notion of restricted kernel and its unit ball (Rudi et al. (2020, Prop. 8)). For X Rd, let |X denotes the restriction operator. That is, for any function f : Rd R, we have f|X : X R such that f|A(x) = f(x) for x X.

Published as a conference paper at ICLR 2022

Definition 3 (Restricted kernel and its RKHS) Consider a kernel k defined on Rd Rd with the corresponding RKHS H, any set X Rd. The restricted kernel k|X is defined as

k|X : X X R such that k|X (x, y) k|X X (x, y) = k(x, y) for all x, y X,

and H|X denotes its RKHS. For f H|X , the restricted RKHS norm is defined as follows:

f k|X = infh H h k such that h|X = f.

Furthermore, we use Bk|X {f H|X : f k|X 1} to denote the unit ball of the RKHS corresponding to this restricted kernel.

In this notation, the unit ball of unrestricted kernel satisfies Bk Bk|Rd. Now, recall the RKHS covering number definition from Def. 1. In the sequel, we also use the covering number of the restricted kernel defined as follows:

N k(X, ε) = Nk|X (X, ε), (48)

that is N k(X, ε) denotes the minimum cardinality over all possible covers C Bk|X that satisfy

h C g Bk|X :supx X |h(x) g(x)| ε .

With this notation in place, we now state a result that relates the covering numbers N (48) and N Def. 1.

Lemma 2 (Relation between restricted and unrestricted RKHS covering numbers) We have

Nk,ε(X) N k,ε(X)

Proof Rudi et al. (2020, Prop. 8(d,f)) imply that there exists a bounded linear extension operator E : H|X H with operator norm bounded by 1, which when combined with Steinwart & Christmann (2008, eqns. (A.38), (A.39)) yields the claim.

Next, we state results that relate RKHS covering numbers for a change of domain for a shift-invariant kernel. We use B (x; r) y Rd : x y r to denote the r radius ball in Rd defined by the metric induced by a norm .

Definition 4 (Euclidean covering numbers) Given a set X Rd, a norm , and a scalar ε > 0, we use N (X, ε) to denote the ε-covering number of X with respect to -norm. That is, N (X, ε) denotes the minimum cardinality over all possible covers C X that satisfy

X z CB (z; ε).

When = q for some q [1, ], we use the shorthand Nq N q.

Lemma 3 (Relation between RKHS covering numbers on different domains) Given a shiftinvariant kernel k, a norm on Rd, and any set X Rd, we have

N k(X, ε) h N k(B , ε) i N (X,1) .

Proof Let C X denote the cover of minimum cardinality such that

z C B (z, 1).

We then have

N k(X, ε) (i) Q

z C N k(B (z, 1), ε) (ii) Q

z C N k(B , ε) h N k(B , ε) i|C| ,

where step (i) follows by applying Steinwart & Fischer (2021, Lem. 3.11),5 and step (ii) follows by applying Steinwart & Fischer (2021, Lem. 3.10). The claim follows by noting that C denotes a cover of minimum cardinality, and hence by definition |C| = N (X, 1).

5Steinwart & Fischer (2021, Lem. 3.11) is stated for disjoint partition of X in two sets, but the argument can be repeated for any finite cover of X.

Published as a conference paper at ICLR 2022

Lemma 4 (Covering number for product kernel) Given X R and a reproducing kernel κ : X X R, consider the product kernel k κ d : X 2d R defined as

k(x, y) = Qd i=1 κ(xi, yi) for x, y X d X X . . . X | {z } d times Rd.

Then the covering numbers of the two kernels are related as follows:

N k(X d, ε) h N κ(X, ε/(d κ

2 )) id . (49)

Proof Let H denote the RKHS corresponding to κ. Then the RKHS corresponding to the kernel k is given by the tensor product Hk H H . . . H Berlinet & Thomas-Agnan (2011, Sec. 4.6), i.e., for any f Hk, there exists f1, f2, . . . , fd H such that

f(x) = Qd i=1 fi(xi) for all x X d. (50)

Let Cκ(X, ε) Bκ denote an ε-cover of Bκ in L -norm (Def. 1). Then for each fi H, we have efi Cκ(X, ε) such that

supz X fi(z) efi(z) ε. (51)

Now, we claim that for every f Bk, there exists g Ck (Cκ(X, ε)) d such that

supx X d|f(x) g(x)| dε κ

which immediately implies the claimed bound (49) on the covering number. We now prove the claim (52). For any fixed f Hk, let fi, efi denote the functions satisfying (50) and (51) respectively. Then, we prove our claim (52) with g = Qd i=1 efi Ck. Using the convention Q0 k=1 efk(xk) = 1, we find that

|f(x) g(x)| = Qd i=1 fi(xi) Qd i=1 efi(xi)

Pd i=1 fi(xi) efi(xi) Qd j=i+1 fj(xj) Qi 1 k=1 efk(xk)

(51) dε suph Bκ h d 1 dε κ

where in the last step we have used the following argument:

supz X h(x) = supz X h, κ(z, )κ h κ p

κ for any h Bκ.

The proof is now complete.

Lemma 5 (Relation between Euclidean covering numbers) We have

N (B2(r), 1) 1

πd h (1 + 2r

2πe id for all d 1.

Proof We apply Wainwright (2019, Lem. 5.7) with B = B2(r) and B = B (1) to conclude that

N (B2(r), 1) Vol(2B2(r)+B (1))

Vol(B (1)) Vol(B2(2r +

2 +1) (2r +

where Vol(X) denotes the d-dimensional Euclidean volume of X Rd, and Γ(a) denotes the Gamma function. Next, we apply the following bounds on the Gamma function from Batir (2017, Thm. 2.2):

Γ(b + 1) (b/e)b

2πb for any b 1, and Γ(b + 1) (b/e)b

e2b for any b 1.1.

Thus, we have

N (B2(r), 1) πd/2

2e )d/2 (2r +

πd h (1 + 2r

as claimed, and we are done.

Published as a conference paper at ICLR 2022

J.2 PROOF OF PROP. 2: COVERING NUMBERS FOR ANALYTIC AND DIFFERENTIABLE KERNELS

First we apply Lem. 2 so that it remains to establish the stated bounds simply on log N k(X, ε).

Proof of bound (37) in part (a) The bound (37) for the real-analytic kernel is a restatement of Sun & Zhou (2008, Thm. 2) in our notation (in particular, after making the following substitutions in their notation: R 1, C0 Cκ, r Rκ, X A, er r , η ε, D D2 A, n d).

Proof of bound (39) for part (b): Under these assumptions, Steinwart & Christmann (2008, Thm. 6.26) states that the i-th dyadic entropy number Steinwart & Christmann (2008, Def. 6.20) of the identity inclusion mapping from H| B2(r) to L B2(r) is bounded by c s,d,k rsi s/d for some constant c s,d,k independent of ε and r. Given this bound on the entropy number, and applying Steinwart & Christmann (2008, Lem. 6.21), we conclude that the log-covering number log N k( B2(r), ε) is bounded by ln 4 (c s,d,krs/ε)d/s = cs,d,krd (1/ε)d/s as claimed.

J.3 PROOF OF PROP. 3: COVERING NUMBERS FOR SPECIFIC KERNELS

First we apply Lem. 2 so that it remains to establish the stated bounds in each part on the corresponding log Nk.

Proof for GAUSS kernel: Part (a) The bound (40) for the Gaussian kernel follows directly from Steinwart & Fischer (2021, Eqn. 11) along with the discussion stated just before it. Furthermore, the bound (41) for CGauss,d are established in Steinwart & Fischer (2021, Eqn. 6), and in the discussion around it.

Proof for MAT ERN kernel: Part (b) We claim that MAT ERN(ν, γ) is ν d

2 -times continuously differentiable which immediately implies the bound (42) using Prop. 2(b).

To prove the differentiability, we use Fourier transform of Mat ern kernels. For k = MAT ERN(ν, γ), let κ : Rd R denote the function such that noting that k(x, y) = κ(x y). Then using the Fourier transform of κ from Wendland (2004, Thm 8.15), and noting that κ is real-valued, we can write

k(x, y) = ck,d R cos(ω (x y))(γ2 + ω 2 2) νdω

for some constant ck,d depending only on the kernel parameter, and d (due to the normalization of the kernel, and the Fourier transform convention). Next, for any multi-index a Nd 0, we have

a,a cos(ω (x y))(γ2 + ω 2 2) ν Qd j=1 ω2aj j (γ2 + ω 2 2) ν ω 2 Pd j=1 aj 2 (γ2+ ω 2 2)ν ,

where a,a denotes the partial derivative of order a. Moreover, we have

R ω 2 Pd j=1 aj 2 (γ2+ ω 2 2)ν dω = cd R

r>0 rd 1 r 2 Pd j=1 aj (γ2+r2)ν dr cd R

r>0 rd 1+2 Pd j=1 aj 2ν (i) < ,

where step (i) holds whenever

d 1 + 2 Pd j=1 aj 2ν < 1 Pd j=1 aj < ν d

Then applying Newey & Mc Fadden (1994, Lemma 3.6), we conclude that for all multi-indices a such that Pd j=1 aj ν d

2 , the partial derivative a,ak exists and is given by

ck,d R a,a cos(ω (x y))(γ2 + ω 2 2) νdω,

and we are done.

Published as a conference paper at ICLR 2022

Proof for IMQ kernel: Part (c) The bounds (43) and (44) follow from Sun & Zhou (2008, Ex. 3), and noting that N2(B2(r), er/2) is bounded by (1 + 4r

er )d (Wainwright, 2019, Lem. 5.7).

Proof for SINC kernel: Part (d) For k = SINC(θ), we can write k(x, y) = Qd i=1 κθ(xi yi)

for κθ : R R defined as κθ(t) = sin(θt)

θt (i) = sin(|θt|)

|θt| , where step (i) follows from the fact that t 7 sin t/t is an even function. Thus, we can apply Lem. 4. Given the bound (49), and noting that κθ = 1, it suffices to establish the univariate version of the bound (45), namely,

Mk([ r, r], ε) (1 + 4r

erθ ) (4 log(1/ε) + 2 + 4 log 17)2.

To do so, we claim that univariate SINC kernel is an analytic kernel that satisfies the condition (36) of Prop. 2(a) with κ(t) = SINC(θ

t), Rκ = 12

θ2 , and Cκ = 1; and thus applying the bounds (37) and (38) from Prop. 2(a) with A = Bd 2(r) yields the claimed bound (45) and (46). To verify the condition (36) with the stated parameters, we note that

κ(t) = SINC(θ

t P j=0 1 (2j+1)! (θ

t)2j+1 = P j=0 1 (2j+1)! (θ

= P j=0 1 (2j+1)! θ2j tj

which implies κ(j) + (0) = 1 (2j+1)! θ2jj! (2/Rκ)jj! for Rκ 2 θ2 infj 1((2j + 1)!)1/j = 12

and we are done.

Proof for B-SPLINE kernel: Part (e) For k = B-SPLINE(2β + 1, γ), we can write k(x, y) = Qd i=1 κβ,γ((xi yi)) for κβ : R R defined as κβ,γ(t) = B 1 2β+2 2β+2 1[ 1

2 ](γ t), and thus we can apply Lem. 4. Given the bound (49), and noting that κβ,γ 1 (Dwivedi & Mackey (2021, Eqn. 107)), it suffices to establish the univariate version of the bound (47). Abusing notation and using κβ,γ to denote the univariate B-SPLINE(2β + 1, γ) kernel, we find that

log N κβ,γ([ 1

2], ε) (i) N1([0, γ], 1) log N κβ,1([ 1

(ii) max(γ, 1) CB-SPLINE (1/ε)

where step (i) follows from Steinwart & Fischer (2021, Thm. 2.4, Sec. 3.3), and for step (ii) we use the fact that the unit-covering number of [0, γ] is bounded by max(γ, 1), and apply the covering number bound for the univariate B-SPLINEkernel from Zhou (2003, Ex. 4) (by substituting m = 2β + 2 in their notation) along with the fact that log N κβ,1([ 1

2], ε) = log N κβ,1([0, 1], ε) since κβ is shift-invariant.

K PROOF OF TAB. 3 RESULTS

In Tab. 3, the stated results for all the entries in the TARGET KT column follow directly by substituting the covering number bounds from Prop. 3 in the corresponding entry along with the stated radii growth conditions for the target P. (We substitute m = 1

2 log2 n since we thin to n output size.) For the KT+ column, the stated result follows by either taking the minimum of the first two columns (whenever the ROOT KT guarantee applies) or using the POWER KT guarantee. First we remark how to always ensure a rate of at least O(n 1

4 ) even when the guarantee from our theorems are larger, using a suitable baseline procedure and then proceed with our proofs.

Remark 2 (Improvement over baseline thinning) First we note that the KT-SWAP step ensures that, deterministically, MMDk(Sin, SKT) MMDk(Sin, Sbase) and MMDk(P, SKT) 2 MMDk(P, Sin) + MMDk(P, Sbase) for Sbase a baseline thinned coreset of size n 2m and any target P. For example if the input and baseline coresets are drawn i.i.d. and k is bounded, then

Published as a conference paper at ICLR 2022

MMDk(Sin, SKT) and MMDk(P, SKT) are O( p

2m/n) with high probability (Tolstikhin et al., 2017, Thm. A.1), even if the guarantee of Thm. 2 is larger. As a consequence, in all well-defined KT variants, we can guarantee a rate of n 1

4 for MMDk(Sin, SKT) when the output size is n simply by using baseline as i.i.d. thinning in the KT-SWAP step.

GAUSS kernel The TARGET KT guarantee follows by substituting the covering number bound for the Gaussian kernel from Prop. 3(a) in (6), and the ROOT KT guarantee follows directly from Dwivedi & Mackey (2021, Tab. 2). Putting the guarantees for the ROOT KT and TARGET KT together (and taking the minimum of the two) yields the guarantee for KT+.

IMQ kernel The TARGET KT guarantee follows by putting together the covering bound Prop. 3(c) and the MMD bounds (6).

For the ROOT KT guarantee, we use a square-root dominating kernel ekrt IMQ(ν , γ ) Dwivedi & Mackey (2021, Def.2) as suggested by Dwivedi & Mackey (2021). Dwivedi & Mackey (2021, Eqn.(117)) shows that ekrt is always defined for appropriate choices of ν , γ . The best ROOT KT guarantees are obtained by choosing largest possible ν (to allow the most rapid decay of tails), and Dwivedi & Mackey (2021, Eqn.(117)) implies with ν < d 2, the best possible parameter satisfies ν d

2. For this parameter, some algebra shows that max(R ekrt,n Rekrt,n) d,ν,γ n1/2ν, leading

to a guarantee worse than n 1

4 , so that the guarantee degenerates to n 1

4 using Rem. 2 for ROOT KT. When ν d

2, we can use a MAT ERN kernel as a square-root dominating kernel from Dwivedi & Mackey (2021, Prop. 3), and then applying the bounds for the kernel radii (17), and the inflation factor (19) for a generic Mat ern kernel from Dwivedi & Mackey (2021, Tab. 3) leads to the entry for the ROOT KT stated in Tab. 2. The guarantee for KT+ follows by taking the minimum of the two.

MAT ERN kernel For TARGET KT, substituting the covering number bound from Prop. 3(b) in (6) with R = log n yields the MMD bound of order r

log n (log n)d n2 ν d

which is better than n 1

4 only when ν > 3d/2, and simplified to the entry in the Tab. 3 when we assume ν d

2 is an integer. When ν 3d/2, we can simply use baseline as i.i.d. thinning to obtain an order n 1

4 MMD error as in Rem. 2.

The ROOT KT (and thereby KT+) guarantees for ν > d follow from Dwivedi & Mackey (2021, Tab. 2).

2, d], we use POWER KT with a suitable α to establish the KT+ guarantee. For MAT ERN(ν, γ) kernel, the α-power kernel is given by MAT ERN(αν, γ) if αν > d 2 (a proof of this follows from Def. 2 and Dwivedi & Mackey (2021, Eqns (71-72))). Since LAPLACE(σ) = MAT ERN( d+1

2 , σ 1), we conclude that its α-power kernel is defined for α > d d+1. And using the various tail radii (17), and the inflation factor (19) for a generic Mat ern kernel from Dwivedi & Mackey (2021, Tab. 3), we conclude that f Mα d,kα,δ log n log log n, and max(R kα,n Rkα,n) d,kα log n, so that Rmax = Od,kα(log n) (18) for SUBEXP P setting. Thus for this case, the MMD guarantee for n thinning with POWER KT (tracking only scaling with n) is

2α (2 f Mα)1 1

d 2max f Mα

d,kα,δ ( 1 n) 1 2α ( cn log n)1 1

2α ((log n) d 2 + 1

2 cn) 1 α 1 = ( cn(log n)1+2d(1 α)

where cn = log log n; and we thus obtain the corresponding entry (for KT+) stated in Tab. 3.

SINC kernel The guarantee for TARGET KT follows directly from substituting the covering number bounds from Prop. 3(d) in (6).

Published as a conference paper at ICLR 2022

For the ROOT KT guarantee, we note that the square-root kernel construction of Dwivedi & Mackey (2021, Prop.2) implies that SINC(θ) itself is a square-root of SINC(θ) since the Fourier transform of SINC is a rectangle function on a bounded domain. However, the tail of the SINC kernel does not decay fast enough for the guarantee of Dwivedi & Mackey (2021, Thm. 1) to improve beyond the n 1

4 bound of Dwivedi & Mackey (2021, Rem. 2) obtained when running ROOT KT with i.i.d. baseline thinning.

In this case, TARGET KT and KT+ are identical since krt = k.

B-SPLINE kernel The guarantee for TARGET KT follows directly from substituting the covering number bounds from Prop. 3(d) in (6).

For B-SPLINE(2β + 1, γ) kernel, using arguments similar to that in Dwivedi & Mackey (2021, Tab.4), we conclude that (up to a constant scaling) the α-power kernel is defined to be B-SPLINE(A+ 1, γ) whenever A 2αβ +2α 2 2N0. For odd β we can always take α = 1

2 and B-SPLINE(β + 1, γ) is a valid (up to a constant scaling) square-root kernel (Dwivedi & Mackey, 2021). For even β, we have to choose α p+1

2, 1) by taking p N suitably, and the smallest suitable choice

2 = β 2 N, which is feasible as long as β 2. And, thus B-SPLINE(β + 1, γ) is a suitable kα for B-SPLINE(2β + 1) for even β 2 with α = β+2 2β+2 ( 1

2, 1). Whenever the α-power kernel is defined, we can then apply the various tail radii (17), and the inflation factor (19) for the power B-SPLINE kernel from Dwivedi & Mackey (2021, Tab. 3) to obtain the MMD rates for POWER KT from Dwivedi & Mackey (2021, Tab. 2) (which remains the same as ROOT KT upto factors depending on α and β).

The guarantee for KT+ follows by taking the minimum MMD error for TARGET KT and ROOT KT for even β, and α-POWER KT for odd β.