# kernel_thinning__975a2dca.pdf

Journal of Machine Learning Research 25 (2024) 1-77 Submitted 11/21; Revised 3/24; Published 4/24

Kernel Thinning

Raaz Dwivedi dwivedi@cornell.edu Cornell Tech

Lester Mackey lmackey@microsoft.com Microsoft Research New England

Editor: Ingo Steinwart

We introduce kernel thinning, a new procedure for compressing a distribution P more effectively than i.i.d. sampling or standard thinning. Given a suitable reproducing kernel k and O(n2) time, kernel thinning compresses an n-point approximation to P into a n-point approximation with comparable worst-case integration error across the associated reproducing kernel Hilbert space. The maximum discrepancy in integration error is Od(n 1/2 log n) in probability for compactly supported P and Od(n 1

2 (log n)(d+1)/2 log log n) for subexponential P on Rd. In contrast, an equal-sized i.i.d. sample from P suffers Ω(n 1/4) integration error. Our sub-exponential guarantees resemble the classical quasi-Monte Carlo error rates for uniform P on [0, 1]d but apply to general distributions on Rd and a wide range of common kernels. Moreover, the same construction delivers near-optimal L coresets in O(n2) time. We use our results to derive explicit non-asymptotic maximum mean discrepancy bounds for Gaussian, Mat ern, and B-spline kernels and present two vignettes illustrating the practical benefits of kernel thinning over i.i.d. sampling and standard Markov chain Monte Carlo thinning, in dimensions d = 2 through 100. Keywords: coresets, distribution compression, Markov chain Monte Carlo, maximum mean discrepancy, reproducing kernel Hilbert space, thinning

1. Introduction

Monte Carlo and Markov chain Monte Carlo (MCMC) methods (Brooks et al., 2011) are commonly used to approximate intractable target expectations Pf EX P[f(X)] of Pintegrable functions f with asymptotically exact averages Pnf 1

n Pn i=1 f(xi) based on points (xi)n i=1 generated from a Markov chain. A standard practice, to minimize the expense of downstream function evaluation, is to thin the Markov chain output down to a smaller size nout by keeping only every (n/nout)-th sample point (Owen, 2017). We call this approach standard thinning, and such sample compression is critical in fields like computational cardiology in which each function evaluation triggers an organ or tissue simulation consuming thousands of CPU hours (Niederer et al., 2011; Augustin et al., 2016; Strocchi et al., 2020). Unfortunately, standard thinning also leads to a significant reduction in accuracy. For example, thinning one s chain down to nout = n sample points increases integration error from O(n 1

2 ) in probability to Ω(n 1

4 ) by the Markov chain central limit theorem (Roberts and Rosenthal, 2004, Prop. 29). Our primary contribution is a more effective thinning strategy, which provides op(n 1

4 )-integration error when n 1 2 points are returned.

2024 Raaz Dwivedi and Lester Mackey.

License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v25/21-1334.html.

Dwivedi and Mackey

1.1 Thinned MMD coresets

We focus on integration error in a reproducing kernel Hilbert space (RKHS, Steinwart and Christmann, 2008, Def. 4.18) of bounded, measurable functions with a target kernel k : Rd Rd R for d N and RKHS norm k .

Assumption 1 (RKHS of bounded, measurable functions). The RKHS Hk of a kernel k : Rd Rd R contains only bounded measurable functions. Equivalently, k is bounded with k(x, ) measurable for all x Rd (Steinwart and Christmann, 2008, Lems. 4.23, 4.24).1

The worst-case integration error over the RKHS unit ball is given by the kernel maximum mean discrepancy (MMD, Gretton et al., 2012).

Definition 1 (Maximum mean discrepancy (Gretton et al., 2012)). For a kernel k satisfying Assump. 1, we define the kernel maximum mean discrepancy,

MMDk(µ, ν) sup f Hk: f k 1 |µf νf| for all probability measures µ, ν on Rd. (1)

For sequences of points S and S in Rd with empirical distributions Q and Q , we overload this notation to write MMDk(P, S) MMDk(P, Q) and MMDk(S, S ) MMDk(Q, Q ).

Given k satisfying Assump. 1, a target distribution P on Rd, and a sequence of Rdvalued points (xi)n i=1 generated to approximate P, our aim is to identify a thinned MMD coreset, a shorter subsequence that continues to approximate P well in MMDk .

Definition 2 (MMD coreset). We call a sequence of nout points in Rd with empirical measure Q an (nout, ε)-MMD coreset for (k , P) if MMDk (P, Q) ε.

Notably, when the initial sequence is drawn i.i.d. or from a fast-mixing Markov chain targeting P, standard thinning down to size nout = n 1 2 yields an order (n 1 2 , n 1

4 )-MMD coreset in probability (see Prop. 1). A benchmark for improvement is provided by the online Haar strategy of Dwivedi et al. (2019), which generates an (n 1 2 , Od(n 1

2 log2d n))- MMD coreset in probability from 2n 1 2 i.i.d. sample points when P is specifically the uniform distribution on the unit cube [0, 1]d.2 Our goal is to develop thinned coresets of improved quality for any target P with sufficiently fast tail decay.

1.2 Our contributions

To this end, we introduce kernel thinning (Alg. 1), a new, practical solution to the thinned MMD coreset problem that takes as input an (n, Op(n 1

2 ))-MMD coreset and outputs an (n 1 2 , op(n 1

4 ))-MMD coreset for a wide-range of (k , P). Kernel thinning uses non-uniform randomness and evaluations of a less smooth square-root kernel krt (see Def. 5) to partition the input into subsets of comparable quality and then greedily refines the best of these subsets using k . Our primary contributions include:

1. Throughout, we use k for statements involving a generic kernel that is potentially distinct from the target kernel k . 2. Dwivedi et al. (2019) specifically control the star discrepancy, a quantity which in turn upper bounds a Sobolev space MMD called the L2 discrepancy (Hickernell, 1998; Novak and Wozniakowski, 2010).

Kernel Thinning

1. Better-than-i.i.d. MMD coresets: Given n input points sampled i.i.d. or from a fast-mixing Markov chain, kernel thinning yields, in probability, an (n 1 2 , Od(n 1

2 log n))- MMD coreset for P and krt with bounded support, an (n 1 2 , Od(n 1

logd+1 n log log n))-

MMD coreset for P and krt with light tails, and an (n 1 2 , Od(n 1

2ρ log n log log n))- MMD coreset for P and k2 rt with ρ > 2d moments (Thm. 1 and Cor. 1). For compactly supported or light-tailed P and krt, these results compare favorably with known Ωd(n 1

2 ) lower bounds (see Sec. 8.1). Our guarantees extend to more general input point sequences, including deterministic sequences based on quadrature or kernel herding (Chen et al., 2010), and give rise to explicit, non-asymptotic error bounds for a wide variety of popular kernels including Gaussian, Mat ern, and B-spline kernels. While (n 1 2 , Od(n 1

2 n))-MMD coresets have been developed for specific (k , P) pairings like the uniform distribution on [0, 1]d and an L2 discrepancy kernel k (see Sec. 8.1), to the best of our knowledge, no prior (n 1 2 , op(n 1

4 ))-MMD coreset constructions were known for the range of P and k studied in this work.

2. MMD error from square-root L error: To derive our MMD guarantees for kernel thinning, we first establish an important link between MMD coresets for k and L

coresets for krt.

Definition 3 (L coreset). For any kernel k satisfying Assump. 1, probability measure µ on Rd, and z Rd, let µk(z) EX µ[k(X, z)]. We call a sequence of nout points in Rd with empirical measure Q an (nout, ε)-L coreset for (k, P) if Pk Qk ε.

Thm. 2 and Cor. 2 show that any L coreset for (krt, P) is also an MMD coreset for (k , P) with quality depending on the tail decay of krt and P.

3. Online vector balancing in Hilbert spaces: As a building block for constructing high-quality coresets, we introduce and analyze a Hilbert space generalization of the self-balancing walk of Alweiss et al. (2021) to partition a sequence of functions (like (krt(xi, ))n i=1) into nearly equal halves. Our analysis of this self-balancing Hilbert walk (SBHW, Alg. 3) in Thm. 3 may be of independent interest for solving the online vector balancing problem of Spencer (1977) in Hilbert spaces (Cor. 4).

4. Efficient, near-optimal L coresets: We then design a symmetrized version of SBHW for RKHSes kernel halving that delivers 2-thinned coresets with small L

error (Alg. 2 and Thm. 4). The first stage of kernel thinning, kt-split, recursively applies kernel halving to krt to obtain near-minimax-optimal L coresets in O(n2) time with O(n min(d, n)) space (Cors. 5 and 6).

After describing our kernel and input point requirements in Sec. 2, we detail the kernel thinning and kernel halving algorithms in Sec. 3. Sec. 4 houses our main MMD guarantees, both for kernel thinning and for generic L square-root kernel coresets. We introduce and analyze the self-balancing Hilbert walk in Sec. 5 and present our main L guarantees for kernel halving and kt-split in Sec. 6. Sec. 7 complements our theoretical contributions with two vignettes illustrating the practical benefits of kernel thinning over (a) i.i.d. sampling in dimensions d = 2 through 100 and (b) standard MCMC thinning across twelve

Dwivedi and Mackey

experiments targeting challenging differential equation posterior distributions. We conclude with a discussion of our results, related work, and future directions in Sec. 8 and defer all proofs to the appendices.

Notation We define the shorthand [n] {1, . . . , n} for n N, a b min(a, b) for a, b R, R+ {x R : x 0}, and B(x; r) y Rd | x y 2 < r for r R. We use Vol(B) to denote the volume of a compact set B Rd. We let Ac denote the complement of a set A Rd and IA(x) = 1 if x A and 0 otherwise. We use Pr(E) to denote the probability of an event E. For real-valued kernels k and functions f on Rd, we make frequent use of the norms k = supx,y Rd |k(x, y)| and f = supx Rd |f(x)|. For x > 0, we use Γ(x) = R 0 tx 1e t dt to denote the Gamma function (with Γ(n) = (n 1)! for n N). For two sequences of real numbers (an)n N and (bn)n N, we say that an is of order bn and write an = O(bn) or an bn to denote that an cbn for all n N and some constant c > 0. We write an = Ω(bn) if bn = O(an) and an = Θ(bn) when a = Ω(bn) and a = O(bn). Moreover, we use an = Od(bn), an d bn, an = Ωd(bn), an d bn to indicate dependency of underlying universal constant on d. We say an = o(bn) if limn an/bn = 0. For a sequence of real-valued random variables (Xn)n N, we write Xn = OP (an) or Xn = O(an) in probability, when Xn

an is stochastically bounded, i.e., for all δ > 0, there exists finite cδ and nδ such that Pr(|Xn

an | > cδ) < δ, for all n > nδ. We write Xn = ΩP (an) if 1/Xn = OP (1/bn) and Xn = op(an) when Xn

an 0 in probability, i.e., for all ε > 0, Pr(|Xn

an | ε) 0. We write order (n, ε)-MMD (or L ) coreset to mean an (n, O(ϵ))-MMD (or L ) coreset and append in probability to mean an (n, Op(ϵ))-MMD (or L ) coreset.

2. Input Point and Kernel Requirements

Given a target distribution P on Rd, a kernel k satisfying Assump. 1, and a sequence of Rd-valued input points Sn =(xi)n i=1 generated either randomly or deterministically, our goal is to identify a better-than-i.i.d. thinned MMD coreset, that is, a subsequence Sout of size n 1 2 satisfying MMDk (P, Sout) = op(n 1

4 ). When drawing asymptotic conclusions, we will view d as fixed and Sn as a prefix of an infinite sequence of points S (xi) i=1.

2.1 Input point requirements

Our algorithms are designed to return high quality MMD coresets for the input Sn. To translate these into high quality coresets for the target P, it suffices, by the triangle inequality, for the input points to have quality MMDk (P, Sn) = Op(n 1

2 ). As we discuss in Sec. 8.1, input sequences generated by i.i.d. sampling, kernel herding (Chen et al., 2010), Stein Point MCMC (Chen et al., 2019), and greedy sign selection (Karnin and Liberty, 2019) all satisfy this property. Moreover, we prove in App. B that an analogous guarantee holds for the iterates of a fast-mixing Markov chain.

Proposition 1 (MMD guarantee for MCMC). Consider a homogeneous ϕ-irreducible geometrically ergodic Markov chain (Gallegos-Herrada et al., 2023, Thm. 1xi) with initial state x0, subsequent iterates S , and stationary distribution P. If k satisfies Assump. 1, then there exists a P-almost everywhere finite function c : Rd (0, ] such that, for any given

n N and δ (0, 1), MMDk (P, Sn) q

c(x0) k log(e/δ)

n with probability 1 δ given x0.

Kernel Thinning

The input radius,

RSn maxx Sn x 2, (2)

will also play an important role in our results. In particular, the growth rate of this radius as a function of n impacts the growth rate of our MMD bounds. Our next definition assigns familiar names to the most commonly encountered growth rates.

Definition 4 (Input radius growth rates). We say the point sequence S with prefixes Sn for n N is Compact if RSn = Od(1), Sub Gauss if RSn = Od( log n), Sub Exp if RSn = Od(log n), and Heavy Tail(ρ) with ρ > 0 if RSn = Od(n1/ρ).

These growth rates are exactly those which arise with probability 1 when an input sequence is generated identically from P with corresponding tail behavior or from a fastmixing Markov chain targeting P. Our proof of this result is given in App. C.

Proposition 2 (Almost sure radius growth). Consider either (i) points S sampled identically (but not necessarily independently) from P with x0 independent or (ii) a homogeneous ϕ-irreducible geometrically ergodic Markov chain with initial state x0, subsequent iterates S , and stationary distribution P. Then the following statements hold true for any nonnegative c and ρ and P-almost every x0.

(a) If P is compactly supported, then, with probability 1 conditional on x0, S is Compact.

(b) If EX P[ec X 2 2] < , then, with probability 1 conditional on x0, S is Sub Gauss.

(c) If EX P[ec X 2] < , then, with probability 1 conditional on x0, S is Sub Exp.

(d) If EX P[ X ρ 2] < , then, with probability 1 conditional on x0, S is Heavy Tail(ρ).

Finally, we will also require S to be oblivious, that is, generated independently of any randomness in the thinning algorithm. To capture this assumption, we treat S as fixed and deterministic hereafter. This treatment is without loss of generality since our results hold conditional on the observed values of (xi) i=1 when the points are random and oblivious.

2.2 Kernel requirements

We use the terms reproducing kernel and kernel interchangeably to indicate that k is symmetric and positive definite, i.e., that the kernel matrix (k (zi, zj))l i,j=1 is symmetric and positive semidefinite for any evaluation points (zi)l i=1 in Rd. In addition to k , our algorithm takes as input a square-root kernel for k .

Definition 5 (Square-root kernel). We say a kernel krt : Rd Rd R is a square-root kernel for k : Rd Rd R if krt(x, ) is square integrable for all x Rd with

k (x, y) = R

Rd krt(x, z)krt(y, z)dz for all x, y Rd. (3)

We highlight that a square-root kernel need not be unique and that its existence is an indication of a certain degree of smoothness in the target kernel k . One convenient tool for deriving square-root kernels is the notion of a spectral density.

Dwivedi and Mackey

k (x, y) = κ(x y)

Name of kernel

κ(z) Expression for

bκ(ω) Fourier transform

krt Square-root kernel

σ > 0 Gaussian(σ) : exp z 2 2 2σ2

σd exp σ2 ω 2 2 2 2

4 Gaussian σ

ν > d, γ > 0 Mat ern(ν, γ) : cν d

2 (γ z 2)ν d

2 (γ z 2) ϕd,ν,γ (γ2 + ω 2 2) ν Aν,γ,d Mat ern(ν

β 2N + 1 B-spline(2β + 1) : S2β+2,d

j=1 2β+2I[ 1

2 ](zj) S 2β+2,d

sin2β+2( ωj

e Sβ,d B-spline(β)

Table 1: Square-root kernels krt for common target kernels k . Each k satisfies k = 1, and the parameter range ensures the existence of krt. Above, ℓdenotes recursive convolution with ℓfunction copies, Ka denotes the modified Bessel function of the third kind (Wendland, 2004, Def. 5.10), cb 21 b

Γ(b) , ϕd,ν,γ = cν d/2

cν γ2ν d, Aν,γ,d 1

Γ(ν) Γ(ν d/2) Γ((ν d)/2)

Γ(ν/2) , S 2β+2,d S2β+2,d ( 4β+1

2π )d, and e Sβ,d

S2β+2,d Sβ+1,d where Sβ,d is defined in (89). See App. N for our derivation.

Definition 6 (Shift invariance and spectral density). We call a kernel of the form k(x, y) = κ(x y) for κ : Rd R shift-invariant and say k has spectral density bκ if κ is the Fourier transform of a finite measure with Lebesgue density bκ, i.e., κ(z) = 1 (2π)d/2 R e i ω,z bκ(ω)dω.

As we show in Apps. N and Q, many familiar kernels admit spectral densities, including Gaussian, Mat ern, B-spline, inverse multiquadric, sech, and Wendland s compactly supported kernels. Moreover, by Bochner s theorem (Bochner, 1933; Wendland, 2004, Thm. 6.6) and the Fourier inversion theorem (Wendland, 2004, Cor. 5.24), any continuous k (x, y) = κ(x y) with absolutely integrable κ has a spectral density equal to the Fourier transform of κ. Our next result (proved in App. D) derives a square-root kernel for any shift-invariant k with a square-root integrable spectral density.

Proposition 3 (Shift-invariant square-root kernels). If a kernel k (x, y)=κ(x y) admits a spectral density (Def. 6) bκ with R p

bκ(ω)dω< , then krt(x, y)= κrt(x y)

(2π)d/4 is a square-root

kernel of k for κrt the Fourier transform of

Tab. 1 gives several examples of common kernels satisfying the conditions of Prop. 3 along with their associated square-root kernels. For example, if k is Gaussian with bandwidth σ, then a rescaled Gaussian kernel with bandwidth σ

2 is a valid choice for krt. For simplicity, our results in the sequel assume the use of an exact square-root kernel krt, but, as we detail in App. Q, it suffices to use the square-root of any kernel that dominates k in the positive-definite order (see Def. 8). For example, we show in Prop. 4 of App. Q that a standard Mat ern kernel is a suitable square-root dominating kernel for any sufficiently-smooth shift-invariant k with absolutely integrable κ. In Tab. 5 of App. Q, we also derive convenient tailored square-root dominating kernels for inverse multiquadric, sech, and Wendland s compactly supported kernels. Finally, we define several kernel growth and decay properties that will be explicitly assumed in some of our results.

Kernel Thinning

Assumption 2 (Lipschitz kernel). The kernel k : Rd Rd R admits a Lipschitz constant

Lk supx,y,z |k(x,y) k(x,z)|

y z 2 < . (4)

Assumption 3 (Kernel tail decay). The kernel k satisfies Assump. 1 and, for each ε > 0,

max{ inf{r : sup x,y: x y 2 r|k(x, y)| ε}, inf{r : τk(r) ε} } < ,

where τk(r) (supx R

y 2 r k2(x, x y)dy) 1 2 for r 0. (5)

The following definition gives familiar names to commonly encountered tail decay rates.

Definition 7 (Kernel tail decay rate). For a kernel k satisfying Assump. 3, define

Rk,n inf{r : sup x,y: x y 2 r |k(x, y)| k

n }, R k,n inf{r : τk(r) k n },

and R k,n max{Rk,n, R k,n} (6)

for τk defined in (5). We say k is (a) Compact if R k,n = Od(1), (b) Sub Gauss if

R k,n = Od( log n), (c) Sub Exp if R k,n = Od(log n), (d) Heavy Tail(ρ) for ρ > 0 if

R k,n = Od(n1/ρ), and (e) Sub Poly if log R k,n = Od(log n). Notably, any Compact, Sub Gauss, Sub Exp, or Heavy Tail(ρ) k is also Sub Poly.

Remark 1 (krt tail decay implies k boundedness). If krt satisfying Assump. 3 is a squareroot kernel of k , then there exists a finite r for which k (x, x) = R k2 rt(x, x y)dy krt 2 Vol(B(0; r)) + τ 2 krt(r) < .

Popular examples of Compact, Sub Gauss, and Sub Exp k are B-spline, Gaussian, and Mat ern kernels respectively (see Tab. 2). Moreover, one can directly verify that an inverse multiquadric k(x, y) = (γ2 + x y 2 2) ν with γ > 0 and ν > d

4 is Heavy Tail(2ν (4ν d)).

3. Kernel Thinning

Our solution to the thinned coreset problem is kernel thinning, described in Alg. 1. Given a thinning parameter m N, kernel thinning proceeds in two stages: kt-split and kt-swap.

Algorithm 1: Kernel Thinning Return coreset of size n/2m with small MMDk

Input: kernels (k ,krt), input points Sn =(xi)n i=1, thinning parameter m N, probabilities (δi) n

(S(m,ℓ))2m ℓ=1 kt-split (krt, Sn, m, (δi) n

2 i=1 ) // Split Sn into 2m candidate coresets of size n

SKT kt-swap (k , Sn, (S(m,ℓ))2m ℓ=1) // Select best coreset and iteratively refine

return coreset SKT of size n/2m

Dwivedi and Mackey

𝒮2,1 𝒮2,2 𝒮2,3 𝒮2,4

𝒮3,1 𝒮3,2 𝒮3,3 𝒮3,4 𝒮3,5 𝒮3,6 𝒮3,7 𝒮3,8

𝒮m,1 𝒮m,2 𝒮m,5 𝒮m,3 𝒮m,4 𝒮m,6 𝒮m,7 𝒮m,2m 𝒮m,8

Input ( points) n

( points) n 2

Output ( points) n 2m

After Kernel Halving rounds m

Kernel Halving

( points) n 4

Kernel Halving

Figure 1: Overview of kt-split. (Left) kt-split recursively partitions its input Sn into 2m

balanced coresets S(m,ℓ) of size n

2m . (Right) In Sec. 6, we interpret each coreset S(m,ℓ)

as the output of repeated kernel halving: on each halving round, remaining points are paired, and one point from each pair is selected using non-uniform randomness.

Algorithm 1a: kt-split Divide points into candidate coresets of size n/2m

Input: kernel krt, point sequence Sn = (xi)n i=1, thinning parameter m N, probabilities (δi) n/2 i=1 S(j,ℓ) {} for 0 j m and 1 ℓ 2j // Empty coresets: S(j,ℓ) has size i 2j 1 after i rounds

σj,ℓ 0 for 1 j m and 1 ℓ 2j 1 // Swapping parameters

for i = 1, . . . , n/2 do

S(0,1).append(x2i 1); S(0,1).append(x2i) // Every 2j 1 rounds add point from parent coreset S(j 1,ℓ) to each child S(j,2ℓ 1), S(j,2ℓ)

for (j = 1; j m and i/2j 1 N; j = j + 1) do

for ℓ= 1, . . . , 2j 1 do

(S, S ) (S(j 1,ℓ), S(j,2ℓ 1)); (x, x ) get last two points(S) // Compute swapping threshold a a, σj,ℓ get swap params(σj,ℓ, b, δ|S|/2 2j 1

m ) for b2 =krt(x, x)+krt(x , x ) 2krt(x, x ) // Assign one point to each child after probabilistic swapping α krt(x , x ) krt(x, x) + Σy S(krt(y, x) krt(y, x )) 2Σz S (krt(z, x) krt(z, x )) (x, x ) (x , x) with probability min(1, 1

S(j,2ℓ 1).append(x); S(j,2ℓ).append(x ) end end end return (S(m,ℓ))2m ℓ=1, candidate coresets of size n/2m

function get swap params(σ, b, δ):

2 log(2/δ), b2) σ2 σ2+b2(1+(b2 2a)σ2/a2)+ return (a, σ)

Kernel Thinning

KT-SPLIT The first stage, kt-split, is an initialization stage that partitions the input sequence Sn = (xi)n i=1 into 2m balanced candidate coresets, each of size n

2m .3 As depicted in Fig. 1, this partitioning is carried out recursively in m rounds, first dividing the input sequence in half, then halving those halves into quarters, and so on until coresets of size n

2m are produced. The details of kt-split can appear a bit complicated as, in practice, all m halving rounds are carried out concurrently in an online manner. However, under the hood, each candidate coreset is generated by recursively applying a new simple subroutine called kernel halving.

Kernel halving Kernel halving (KH, Alg. 2) is a simple randomized procedure for dividing an input sequence Sn into two balanced, equal-sized coresets S(1) and S(2) using a kernel k. KH begins with two empty coresets S(1) and S(2) and adds points (x, x ) from the input sequence two at a time, assigning one point from each pair to each coreset. To encourage balance between the coresets during generation, KH effectively computes which assignment of (x, x ) leads to a smaller MMDk(S(1), S(2)) and then favors that assignment using non-uniform randomness. More precisely, on the i-th step with (x, x ) = (x2i 1, x2i), KH computes the imbalance contrast

αi = i2 MMD2 k(S(1) {x}, S(2) {x }) MMD2 k(S(1) {x }, S(2) {x})

and then adds x to S(1) and x to S(2) or x to S(1) and x to S(2) with probability biased toward the more balanced outcome. We refer to this step as probabilistic swapping in the algorithm statements. The exact value of this probability depends on a swapping threshold ai that is produced automatically each round based on the user-supplied inputs (δi) n/2 i=1 . In Sec. 4.1, we will learn how to set these inputs to achieve better balance than standard thinning or uniform subsampling, and in Sec. 6.1 we will discuss the role of ai and its generation procedure get swap params in achieving this balance. Notably, in the context of kt-split, KH is run specifically with the square-root kernel krt rather than the target kernel k . This choice enables us to take advantage of the strong L balance properties established for KH in Sec. 6 and the close connection between square-root L error and target MMD error revealed in Sec. 4.2.

KT-SWAP The second stage, kt-swap, refines the candidate coresets produced by ktsplit in three steps. First, kt-swap adds a baseline coreset of size n

2m to the candidate list (for example, one produced by standard thinning or uniform subsampling) to ensure that the kt-swap output is never worse than that of the baseline. Next, it selects the candidate coreset closest to Sn in terms of MMDk . Finally, it refines the selected coreset by replacing each coreset point in turn with the best alternative in Sn, as measured by MMDk (Sn, ). This stage serves to greedily improve upon the MMD of the initial kt-split candidates, and, when computable, MMDk to the target distribution P can be substituted for the surrogate MMDk (Sn, ) throughout.

Complexity For any m, the time complexity of kernel thinning is dominated by O(n2) kernel evaluations, while the space complexity is O(n min(d, n)), achieved by storing the smaller of the input sequence (xi)n i=1 and the kernel matrix (krt(xi, xj))n i,j=1. In addition,

3. When 2m does not evenly divide n, the final n 2m n

2m points are discarded.

Dwivedi and Mackey

Algorithm 2: Kernel Halving

Input: kernel k, point sequence Sn = (xi)n i=1, probability sequence (δi) n/2 i=1 S(1), S(2) {}; ψ0 0 H // Initialize empty coresets: S(1), S(2) have size i after round i σ0 0 // Swapping parameter for i = 1, 2, . . . , n/2 do

// Construct kernel difference function using next two points (x, x ) (x2i 1, x2i); fi k(x2i 1, ) k(x2i, ); ηi 1

// Compute swapping threshold ai ai, σi get swap params(σi 1, b, δi) with b2 = fi 2 k =k(x, x)+k(x , x ) 2k(x, x )

// Compute RKHS inner product ψi 1, fi k, which has a simple form αi P2i 2 j=1 (k(xj, x) k(xj, x )) 2 P

z S(1)(k(z, x) k(z, x ))

// Assign one point to each coreset after probabilistic swapping (x, x ) (x , x) and ηi 1 with probability min(1, 1

ai )+) S(1).append(x); S(2).append(x ); ψi ψi 1 + ηifi // ψi = P

x S(2)k(x , ) P

x S(1)k(x, ) end return S(1), coreset of size n/2

Algorithm 1b: kt-swap Identify and refine the best candidate coreset

Input: kernel k , point sequence Sn = (xi)n i=1, candidate coresets (S(m,ℓ))2m ℓ=1 S(m,0) baseline coreset(Sn, size= n/2m ) // Compare to baseline (e.g., standard thinning)

SKT S(m,ℓ ) for ℓ argminℓ {0,1,...,2m} MMDk (Sn, S(m,ℓ)) // Select best coreset

// Swap out each point in SKT for best alternative in Sn for i = 1, . . . , n/2m do

SKT[i] argminz Sn MMDk (Sn, SKT with SKT[i] = z) end return SKT, refined coreset of size n/2m

scaling either k or krt by a positive multiplier has no impact on Alg. 1, so the kernels need only be specified up to arbitrary rescalings.

4. MMD Guarantees

We are now prepared to present our main MMD guarantees.

4.1 MMD guarantees for kernel thinning

Our first main result, proved in App. E, bounds the MMD of a kernel thinning coreset in terms of the input (2) and kernel (6) radii, the combined radii

RSn,k,n min RSn, n1+ 1

d Rk,n + n 1 d k

Lk and R Sn,k,n max RSn, R k,n , (7)

Kernel Thinning

and the kernel thinning inflation factor

Mk(n,m,d,δ,δ ,R) 2 I( n

d log 2+ 2Lk

k (Rk,n+R) i , (8)

defined for any kernel k satisfying Assumps. 2 and 3, n, m, d N, δ (0, 6m

2m ], δ (0, 1], and R 0.

Theorem 1 (MMD guarantee for kernel thinning). Consider kernel thinning (Alg. 1) with k satisfying Assump. 1, krt a square-root kernel of k , δ mini δi, and nout n

2m for m log2 n . If krt satisfies Assumps. 2 and 3, then, for any fixed δ (0, 1), we have

MMDk (Sn, SKT) krt

2+1) (R Sn,krt,nout) d 2 Mkrt(n, m,d,δ ,δ ,RSn,krt,n) , (9)

with probability at least 1 δ Pm j=1 2j 1

m P2m jnout i=1 δi.

Remark 2 (Guarantee for target P). A guarantee for any target distribution P follows directly from the triangle inequality, MMDk (P, SKT) MMDk (P, Sn)+MMDk (Sn, SKT).

Remark 3 (Comparison with baseline thinning). The kt-swap step ensures that, deterministically, MMDk (Sn, SKT) MMDk (Sn, Sbase) for Sbase a baseline thinned coreset of size nout. Therefore, we additionally have MMDk (P, SKT) 2 MMDk (P, Sn) + MMDk (P, Sbase).

Remark 4 (Finite-time and anytime guarantees). To obtain a success probability of at least 1 δ with δ = δ

2, it suffices to choose δi = δ

n when the input size n is known in advance and δi = mδ 2m+2(i+1) log2(i+1) when the input size n is not known in advance (but is chosen

independently of the randomness used in kernel thinning). In either case, δ 6m

2m is a valid argument to Mkrt (8). See App. F for our proof.

Our next corollary, proved in App. G, translates Thm. 1 into specific rates of MMD decay depending on the radius growth of S and the tail decay of krt.

Corollary 1 (MMD rates for kernel thinning). Under the notation and assumptions of Thm. 1, consider a sequence of kernel thinning runs (Alg. 1), indexed by n N, with m = 1

2 log2 n , log(1/δ ) = O(log n), and Pm j=1 2j 1

m P n/2j i=1 δi = o(1) as n . If S and krt respectively satisfy one of the radius growth (Def. 4) and tail decay (Def. 7) conditions in the table below, then MMDk (Sn, SKT) = Od(εMMD,n) in probability where εMMD,n is the corresponding table entry.

εMMD,n RSn d 1 Compact S

RSn d log n Sub Gauss S

RSn d log n Sub Exp S

RSn d n1/ρ Heavy Tail(ρ) S

R krt,n d 1 Compact krt q

n (log n) d+2

n (log n) d+1

R krt,n d log n Sub Gauss krt (log n) d+2

n (log n) d+2

n (log n) d+1

R krt,n d log n Sub Exp krt (log n) d+1

n (log n) d+1

n (log n) d+1

R krt,n d n1/ρ Heavy Tail(ρ ) krt log n

n1 d/ρ log n

n1 d/ρ log n

n1 d/ρ log n

n1 d/(ρ ρ )

Dwivedi and Mackey

kernel krt Square-root krt (Rkrt,n, R krt,n)

2 log2 n, d, δ

4 Gaussian σ

4 (σ log n, σ d + log n)

δ )+d log log n+ R

Aν,γ,d Mat ern( ν

2, γ) ν( γ2

2π(a 1)) d 4 γ 1(a+log n+E))

γ 1(log n+a log(1+a)), q

δ )+d log log n+B+γR ]

e Sβ,d B-spline(β) cd β ( 1

δ )+d log(dβ+

Table 2: Explicit bounds on Thm. 1 quantities for common kernels. Here, Aν,γ,d, and e Sβ,d

are as in Tab. 1, a 1

2(ν d)(> 1), B a log(1+a), E d log(

γ )+log( (ν 2)ν 3

(2(a 1))2a 1d d 2 +1 ),

3, and cβ < 1 for β > 1 (see (120)). See App. O for our derivation.

Remark 5 (Probability parameters). The condition ( ) Pm j=1 2j 1

m P n/2j i=1 δi =o(1) is satisfied when δ1 = = δ n

n). Hence, both log(1/δ ) = O(log n) and ( ) are satisfied when, for example, δ1 = =δ n

2 = 1 n log log n.

Cor. 1 shows that kernel thinning returns an (n 1 2 , Od(n 1

2 log n))-MMD coreset in probability when S and krt are compactly supported. For fixed d, this guarantee significantly improves upon the baseline Ωp(n 1

4 ) rates of i.i.d. sampling and standard MCMC thinning and matches the minimax lower bounds of Sec. 8.1 up to a log n term and constants depending on d. For example, when S is drawn i.i.d. from P, kernel thinning is nearly minimax optimal amongst all distributional approximations (even weighted coresets and non-coreset approximations) that depend on P only through n i.i.d. input points (Tolstikhin et al., 2017, Thms. 1 and 6). More generally, when S and krt are Sub Gauss, Sub Exp, or Heavy Tail(ρ) with ρ > 2d, Cor. 1 shows that the kernel thinning provides an MMD error of Od(n 1

(log n)d/2+1 log log n),

(log n)d+1 log log n)), and Od(n 1

2 n d 2ρ log n) in probability with output coresets of size n. In each case, we find that kernel thinning significantly improves upon an Ωp(n 1

4 ) baseline when n is sufficiently large relative to d and, by Rem. 3, is never significantly worse than the baseline when n is small. Our Sub Exp guarantees also resemble the classical quasi-Monte Carlo guarantees for the uniform distribution on [0, 1]d (see Sec. 8.1) but allow for non-uniform and unbounded target distributions P. Thm. 1 also allows us to derive more precise, explicit error bounds for specific kernels. For example, for the popular Gaussian, Mat ern, and B-spline kernels, Tab. 2 provides explicit bounds on each kernel-dependent quantity in Thm. 1: krt , the kernel radii (Rkrt,n, R krt, n), and the inflation factor Mkrt.

4.2 MMD coresets from square-root L coresets

Thm. 1 builds on a second key result, proved in App. H, that bounds MMD error for k in terms of L error for the square-root kernel krt.

Theorem 2 (MMD guarantee for square-root L approximations). Suppose k satisfying Assump. 1 has a square-root kernel krt satisfying Assump. 3. Then for any distributions µ

Kernel Thinning

and ν on Rd and scalars r, a, b 0 with a + b = 1,

MMDk (µ, ν) edr d 2 µkrt νkrt + 2τkrt(ar)+ 2 k

1 2 max{τµ(br),τν(br)}, (10)

where ed (πd/2/Γ(d/2 + 1)) 1 2 decreases super-exponentially in d and τµ(r) µ(Bc(0, r)).

Importantly, Thm. 2 implies that any L coreset for the square-root kernel krt, even one not produced by kernel thinning, is also an MMD coreset for the target kernel k with MMD error depending on the tail decay of (krt, µ, ν). Cor. 2 summarizes the implications of Thm. 2 for common classes of tail decay. See App. I for the proof with explicit constants.

Corollary 2 (MMD error from square-root L error). Under the setting and assumptions of Thm. 2, define the L error ε µkrt νkrt and the tail decay function eτ(r) τkrt(r)+ k 1/2 max{τµ(r), τν(r)} for r 0. Then the following implications hold for any nonnegative c and ρ.

Tail Decay Compact Sub Gauss Sub Exp Heavy Tail(ρ) eτ(r) I(r c) e cr2 e cr r ρ

MMDk (µ, ν) d ε ε(log 1

ε) d 4 ε(log 1

Remark 6 (Tail decay from finite moments). By Markov s inequality (Durrett, 2019, Thm. 1.6.4), eτ has (i) Compact decay when krt is Compact and µ and ν have compact support; (ii) Sub Gauss decay when krt is Sub Gauss and EX µ[ec X 2 2], EX ν[ec X 2 2] < ; (iii) Sub Exp decay when krt is Sub Exp and EX µ[ec X 2], EX ν[ec X 2] < ; and (iv) Heavy Tail(ρ) decay when krt is Heavy Tail(ρ) and EX µ[ X ρ 2], EX ν[ X ρ 2] < .

Cor. 2 highlights that MMD quality for (k , µ) is of the same order as L quality for (krt, µ) when krt, µ, and the approximation ν have compact support. MMD quality then degrades naturally as the tail behavior worsens. In Secs. 5 and 6, we show that, with high probability, kt-split provides a high-quality L coreset for (krt, Sn) and hence, by Cor. 2, also provides a high-quality MMD coreset for (k , Sn).

5. Self-balancing Hilbert Walk

To exploit the L -MMD connection revealed in Thm. 2, we now turn our attention to constructing high-quality thinned L coresets. Our strategy relies on a new Hilbert space generalization of the self-balancing walk of Alweiss et al. (2021). We dedicate this section to defining and analyzing this self-balancing Hilbert walk, and we detail its connection to kernel thinning in Sec. 6. Alweiss et al. (2021, Thm. 1.2) introduced a randomized algorithm called the selfbalancing walk that takes as input a streaming sequence of Euclidean vectors xi Rd with xi 2 1 and outputs a online sequence of random assignments ηi { 1, 1} satisfying

Pn i=1 ηixi p

log(d/δ) log(n/δ) with probability at least 1 δ. (11)

Dwivedi and Mackey

Algorithm 3: Self-balancing Hilbert Walk

Input: sequence of functions (fi)n i=1 in a Hilbert space H, threshold sequence (ai)n i=1 ψ0 0 H for i = 1, 2, . . . , n do

αi ψi 1, fi H // Compute Hilbert space inner product if |αi| > ai: ψi ψi 1 fi αi/ai else: ηi 1 with probability 1 2(1 αi/ai) and ηi 1 otherwise ψi ψi 1 + ηifi end return ψn, combination of signed input functions

Since our ultimate aim is to combine kernel functions, we define a suitable Hilbert space generalization in Alg. 3. Given a streaming sequence of functions fi in an arbitrary Hilbert space H with a norm H, this self-balancing Hilbert walk (SBHW) outputs a streaming sequence of signed function combinations ψi satisfying the following desirable properties established in App. J.

Theorem 3 (Self-balancing Hilbert walk properties). Consider the self-balancing Hilbert walk (Alg. 3) with each fi H and ai > 0 and define the sub-Gaussian constants

σ2 0 0 and σ2 i σ2 i 1 + fi 2 H 1 + σ2 i 1 a2 i ( fi 2 H 2ai)

+ i 1. (12)

The following properties hold true.

(i) Functional sub-Gaussianity: For each i [n], ψi is σi sub-Gaussian:

E[exp( ψi, u H)] exp σ2 i u 2 H 2 for all u H. (13)

(ii) Signed sum representation: If ai σi 1 fi H p

2 log(2/δi) for δi (0, 1], then, with probability at least 1 Pn i=1 δi,

|αi| ai, i [n], and ψn = Pn i=1 ηifi. (14)

(iii) Exact halving via symmetrization: If ai σi 1 fi H p

2 log(2/δi) for δi (0, 1] and each fi = g2i 1 g2i for g1, . . . , g2n H, then, with prob. at least 1 Pn i=1 δi,

|αi| ai, i [n], and 1 2nψn = 1 2n P2n i=1 gi 1

n P i I gi for I = {2i + ηi 1

2 : i [n]}.

(iv) Pointwise sub-Gaussianity in RKHS: If H is the RKHS of a kernel k : X X R, then, for each i [n] and x X, ψi(x) is σi p

k(x, x) sub-Gaussian:

E[exp(ψi(x))] exp σ2 i k(x,x)

Kernel Thinning

(v) Sub-Gaussian constant bound: Fix any q [0, 1), and suppose 1

2 fi 2 H ai for

all i [n]. If fi 2 H 1+q ai fi 2 H 1 q whenever both σ2 i 1 < a2 i 2ai fi 2 H and fi H > 0, then

σ2 i maxj [i] fj 2 H 1 q2 for all i [n]. (15)

(vi) Adaptive thresholding: If ai = max(ciσi 1 fi H, fi 2 H) for ci 0, then

σ2 n maxi [n] fi 2 H 4 (c + 1/c )2 for c maxi [n] ci.

Remark 7. The kernel k in Property (iv) can be arbitrary and need not be bounded.

Property (i) ensures that the functions ψi produced by Alg. 3 are mean zero and unlikely to be large in any particular direction u. Property (ii) builds on this functional sub-Gaussianity to ensure that ψn is precisely a sum of the signed input functions fi with high probability. The two properties together imply that, with high probability and an appropriate setting of ai, Alg. 3 partitions the input functions fi into two groups such that the function sums are nearly balanced across the two groups. Property (iii) uses the signed sum representation to construct a two-thinned coreset for any input function sequence (gi)2n i=1. This is achieved by offering the consecutive function differences g2i 1 g2i as the inputs fi to Alg. 3. Property (iv) highlights that functional sub-Gaussianity also implies sub-Gaussianity of the function values ψi(x) whenever the Hilbert space H is an RKHS. Finally, Properties (v) and (vi) provide explicit bounds on the sub-Gaussian constants σi when adaptive settings of the thresholds ai are employed. In Sec. 6, we will connect the SBHW to kernel halving and use Properties (iii), (iv), and (vi) together to show that kernel halving and hence also kt-split coresets have provably small L kernel error. Specifically, we will boost the pointwise sub-Gaussianity (Property (iv)) of the output function ψn into a high probability bound for ψn = supx |ψn(x)| = supx | ψn, k(x, ) | by constructing a finite cover for (ψn(x))x Rd based on the decay and smoothness of the kernel k.

Comparison with i.i.d. signs A simple alternative to Alg. 3 is to assign signs uniformly at random to each vector, that is, to output ψ n = Pn i=1 η ifi with independent Rademacher η i { 1}. Since the minimal squared sub-Gaussian constant of a sum of independent weighted Rademachers is equal to its variance (Buldygin and Kozachenko, 1980, Lem. 5 & Ex. 1), the minimal squared sub-Gaussian constant of ψ n satisfies

σ2 IID = supu H Var( ψ n, u H)/ u 2 H = supu H Pn i=1 fi, u 2 H/ u 2 H.

In the best case, all fi are orthogonal and bounded and σ2 IID = maxi [n] fi 2 H does not grow with n; the reader can check that Alg. 3 also reduces to i.i.d. signing in this case. However, in the worst case, all fi are equal, and σ2 IID = n maxi [n] fi 2 H. In contrast, if we choose ai as in Property (vi) with ci = p

2 log(2n/δ), then the SBHW output ψn = Pn i=1 ηifi with probability 1 δ by Property (ii) and has squared sub-Gaussian constant σ2 SBHW = O(log(2n/δ) maxi [n] fi 2 H) in every case by Property (vi). This drop from Ω(n) to O(log n) represents an exponential improvement in worst-case balance over employing i.i.d. signs.

Dwivedi and Mackey

We can attribute this gain to the carefully chosen updates of Alg. 3. Notice that, on round i, the function ψi 1 is updated only in the fi direction, so it suffices to examine the evolution of ψi 1, fi H. We show in App. J.1 that this evolution takes the form

ψi, fi H = (1 fi 2 H/ai) ψi 1, fi H + εi fi 2 H

where εi I[|αi| ai](ηi + αi/ai) is mean-zero and 1-sub-Gaussian given ψi 1. In other words, whenever ai fi 2 H (as recommended in Property (vi)), Alg. 3 first shrinks the magnitude of ψi 1 in the fi direction before adding a sub-Gaussian variable in this direction. This targeted shrinkage is absent in the i.i.d. signing update, ψ i, fi

H = ψ i 1, fi

H + η i fi 2 H,

which simply adds a sub-Gaussian variable in the fi direction, and allows the SBHW to maintain a substantially smaller sub-Gaussian constant.

General recipe for exact halving The symmetrization construction introduced in Property (iii) can be used to convert any vector balancing algorithm (i.e., any algorithm which assigns 1 signs to a sequence of vectors) into an exact halving algorithm (i.e., one which assigns 1 to exactly half of the points) simply by running the algorithm on paired vector differences. We will use this property in the sequel to painlessly construct coresets of an exact target size.

Comparison with the self-balancing walk of Alweiss et al. (2021) In the Euclidean setting with H = Rd, constant thresholds ai = 30 log(n/δ), and ψi 1, fi H the usual Euclidean dot product, Alg. 3 recovers a slight variant of the Euclidean self-balancing walk of Alweiss et al. (2021, Proof of Thm. 1.2). The original algorithm differs only superficially by terminating with failure whenever |αi| > ai. We allow the walk to continue with the update ψi ψi 1 fi αi/ai, as it streamlines our sub-Gaussianity analysis and avoids the reliance on distributional symmetry present in Sec. 2.1 of Alweiss et al. (2021). We show in App. R that Thm. 3 recovers the guarantee (11) of Alweiss et al. (2021, Thm. 1.2) with improved constants and a less conservative setting of ai.

6. L Guarantees

In this section, we derive near-optimal L coreset guarantees for kernel halving (Alg. 2) and kt-split by relating the two algorithms to the self-balancing Hilbert walk (Alg. 3).

6.1 L guarantees for kernel halving

To make the connection between KH and SBHW more apparent, we have translated each line of Alg. 2 into the notation of Alg. 3. In this notation, we see that Alg. 2 forms signed combinations ψi of paired kernel differences fi = k(x2i 1, ) k(x2i, ); that the inner product αi = ψi 1, fi k has a simple explicit form in terms of kernel evaluations; and that, under the event E = {|αi| ai for all i = 1, . . . , n/2 }, the function ψ n/2 of Alg. 2 exactly matches the output of SBHW. Indeed, the function get swap params serves to compute the sub Gaussian constants σi exactly as defined in Thm. 3 and to adaptively select the thresholds ai exactly as recommended in Thm. 3(iii) and (vi): ai = max( fi Hσi 1 p

2 log(2/δi), fi 2 H).

Kernel Thinning

This choice of ai simultaneously ensures that the KH-SBHW equivalence event E occurs with high probability by Thm. 3(iii) and that the SBHW sub-Gaussian constant remains small by Thm. 3(vi). Hence, we can invoke the pointwise SBHW sub-Gaussianity revealed in Thm. 3(iv) to control the KH L coreset error with high probability.

Theorem 4 (L guarantees for kernel halving). Let SKH(k, S, ) denote the output of kernel halving (Alg. 2) with kernel k satisfying Assumps. 2 and 3, input point sequence S, and probability sequence . Let Pn 1

n Pn i=1δxi, and recall the definitions of Mk (8) and RSn,k,n (7). The following statements hold for any m N with m log2 n and δ (0, 1).

(a) Kernel halving yields a 2-thinned L coreset: The output S(1) SKH(k, Sn, (δi) n

2 i=1) has size nout = n

2 with Q(1) KH 1 nout P x S(1)δx satisfying

Pnk Q(1) KHk k

nout Mk(n, 1, d, δ , δ , RSn,k,n) (16)

with probability at least 1 δ Pnout i=1 δi for δ mini nout δi.

(b) Repeated kernel halving yields a 2m-thinned L coreset: For each j > 1, let S(j) SKH(k, S(j 1), (2j 1

m δi) n/2j i=1 ) be the output of kernel halving recursively applied

for j rounds. Then S(m) has size nout = n

2m with Q(m) KH 1 nout P x S(m)δx satisfying

Pnk Q(m) KH k k

nout Mk(n, m, d, δ , δ , RSn,k,n) (17)

with probability at least 1 δ Pm j=1 2j 1

m P2m jnout i=1 δi for δ mini 2mnout δi.

Thm. 4, proved in App. K, shows that L error for KH scales simply as the kernel thinning inflation factor Mk (8) divided by the size of the output. Our next corollary, an immediate consequence of Thm. 4(b) and the definition of Mk (8), translates these bounds into rates of decay depending on the radius growth of S and the tail decay of k.

Corollary 3 (L rates for kernel halving). Under the notation and assumptions of Thm. 4, consider a sequence of repeated kernel halving runs (Alg. 2), indexed by n N, with m = 1

2 log2 n rounds, log(1/δ ) = O(log n), and Pm j=1 2j 1

m P n/2j i=1 δi = o(1) as n . If S and k respectively satisfy one of the radius growth (Def. 4) and tail decay (Def. 7) conditions in the table below, then Pnk Q(m) KH k = O(ε ,n) in probability where ε ,n is the corresponding table entry and all hidden constants are independent of the dimension d.

Type of k Type of S

Compact with Lk(Rk,n RSn)

k =O(1) Compact with RSn =O(

Sub Poly with log(Lk Rk,n

k )=O(log n) Arbitrary

Remark 8 (Example settings for KH L rates). Rem. 5 specifies settings of (δi) n/2 i=1 that meet the conditions of Cor. 3. For any radial kernel k(x, y) = κ( x y 2/σ) with LLipschitz κ : R R, bandwidth σ > 0, and k = 1, we have Lk k L

σ , Rk,n σκ (1/n),

Dwivedi and Mackey

and therefore Lk Rk,n

k L κ (1/n) where L and κ (u) sup{r : κ(r) u} are independent of

d and σ. Hence Gaussian, Mat ern (with ν > d

2 + 1), and IMQ kernels satisfy the Sub Poly requirements of Cor. 3 with any choice of bandwidth and satisfy the Compact requirements when restricted to a compact domain with σ =

d. See App. L for our proof.

Near-optimal L coresets For any bounded, radial k satisfying mild decay and smoothness conditions, Phillips and Tai (2020, Thm. 3.1) proved that any procedure outputting a coreset of size n 1 2 must suffer Ω(min(

4 )) L error with constant probability for some Pn. Hence, the repeated KH quality guarantees from Cor. 3 are within a log n factor of minimax optimality for this kernel family which includes Gaussian, Mat ern, Wendland, and IMQ kernels.

Online vector balancing in an RKHS In the online vector balancing problem of Spencer (1977), one must assign signs ηi to Euclidean vectors fi in an online fashion while keeping the norm of the signed sum Pn i=1 ηifi as small as possible. As an immediate consequence of Thm. 4(a), we find that kernel halving solves an RKHS generalization of the online vector balancing problem.

Corollary 4 (Online vector balancing in an RKHS). Consider a sequence of kernel halving runs (Alg. 2), indexed by n N, with log(1/δ ) = O(log n) and P n/2j i=1 δi = o(1) as n . If the kernel k is Sub Poly (Def. 7) with log(Lk Rk,n

k ) = O(log n), then P2n i=1 ϵik(xi, ) =

d log n) in probability for the generated signs ϵi η i/2 ( 1)i.

Proof By Thm. 4(a) and the definition of Mk (8), Pn i=1 ηifi =O(

d log n) in probability. Noting that ηifi =ηi(k(x2i, ) k(x2i 1, ))=ϵ2i 1k(x2i 1, ) + ϵ2ik(x2i, ), the claim follows.

Cor. 4 applies to any fixed input point sequence S and to a broad range of kernels including, by Rem. 8, Gaussian, Mat ern, and IMQ kernels, as well as B-spline kernel since (122) and (123) imply that log(Lk Rk,n

k ) = O(log d) = O(log n) for B-spline k.

6.2 L and MMD guarantees for KT-SPLIT

We finally extend our L and MMD guarantees to the kt-split step of kernel thinning by observing that each candidate coreset generated by kt-split is the product of repeated kernel halving (Alg. 2) with krt as the chosen kernel k. Hence, Thm. 4(a) applies equally to the coreset S(1,1) returned by kt-split with m = 1, and, when m > 1, Thm. 4(b) applies to the coreset S(m,1) produced by kt-split. Combining these L bounds with our L to MMD conversion theorem (Thm. 2) yields the following corollary proved in App. M.

Corollary 5 (L and MMD guarantees for kt-split). Consider kt-split with krt satisfying Assumps. 2 and 3, δ mini δi, and Pn 1

n Pn i=1δxi. The guarantees of Thm. 4

with k = krt hold if Q(m,1) KT 1 nout P x S(m,1)δx is substituted for Q(m) KH , and the guarantees of Thm. 1 hold if the output coreset S(m,1) is substituted for SKT,

Cor. 5 ensures that, with high probability, at least one kt-split candidate is a highquality L coreset for (krt, Pn) and hence also a high-quality MMD coreset for (k , Pn).

Kernel Thinning

The proof of Thm. 1 in App. E establishes the same MMD guarantee for kernel thinning by noting that the subsequent kt-swap step directly minimizes the MMDk to Pn and hence can only improve or maintain this MMD quality. Finally, exactly as in the proofs of Cors. 1 and 3 we can deduce both L and MMD rate bounds for kt-split coresets.

Corollary 6 (L and MMD rates for kt-split). Under the notation and assumptions of Cor. 5, consider a sequence of kt-split runs, indexed by n N, with m = 1

rounds, log(1/δ ) = O(log n), and Pm j=1 2j 1

m P n/2j i=1 δi = o(1) as n . The guarantees

of Cor. 3 with k = krt hold if Q(m,1) KT is substituted for Q(m) KH , and the guarantees of Cor. 1 hold if S(m,1) is substituted for SKT.

7. Vignettes

We complement our primary methodological and theoretical development with two vignettes illustrating the promise of kernel thinning for improving upon (a) i.i.d. sampling in dimensions d = 2 through 100 and (b) standard MCMC thinning when targeting challenging differential equation posteriors. See App. P for supplementary details and

https://github.com/microsoft/goodpoints

for a Python implementation of kernel thinning and code replicating each vignette.

7.1 Common settings

Throughout, we adopt a Gaussian(σ) target kernel k (x, y) = exp( 1

2σ2 x y 2 2) and the

corresponding square-root kernel krt from Tab. 1. To output a coreset of size n 1 2 with n input points, we (a) take every n 1 2 -th point for standard thinning and (b) run kernel thinning (KT) with m = 1

2 log2 n using a standard thinning coreset as the base coreset in kt-swap. For each input sample size n 24, 26, . . . , 214, 216 with δi = 1 2n, we report the mean coreset error MMDk (P, S) 1 standard error across 10 independent replications of the experiment (the standard errors are too small to be visible in all experiments). We additionally regress the log mean MMD onto the log input size using ordinary least squares and display both the best linear fit and an empirical decay rate based on the slope of that fit, e.g., for a slope of 0.25, we report an empirical decay rate of n 0.25 for the mean MMD.

7.2 Kernel thinning versus i.i.d. sampling

We first illustrate the benefits of kernel thinning over i.i.d. sampling from a target P. We generate each input sequence Sn i.i.d. from P, use squared kernel bandwidth σ2 = 2d, and consider both Gaussian targets P = N(0, Id) for d {2, 4, 10, 100} and mixture of Gaussians (Mo G) targets P = 1 M PM j=1 N(µj, I2) with M {4, 6, 8} component locations µj R2 defined in App. P.1. Fig. 2 highlights the visible differences between the KT and i.i.d. coresets for the Mo G targets. Even for small sample sizes, the KT coresets achieves better stratification across components with less clumping and fewer gaps within components, suggestive of a better approximation to P. Indeed, when we examine MMD error as a function of coreset size in Fig. 3, we observe that kernel thinning provides a significant improvement across all

Dwivedi and Mackey

8 i.i.d. points 16 i.i.d. points 32 i.i.d. points

8 KT points

16 KT points

32 KT points

8 i.i.d. points 16 i.i.d. points 32 i.i.d. points

8 KT points

16 KT points

32 KT points

Figure 2: Kernel thinning (KT) and i.i.d. coresets for 4and 8-component mixture of Gaussian targets with equidensity contours of the target underlaid. See Sec. 7.2 for more details.

settings. For the Gaussian d = 2 target and each Mo G target, the KT MMD error scales as n 1

2 , a quadratic improvement over the Ωp(n 1

4 ) MMD error of i.i.d. sampling. Notably, we would not expect to see an exact empirical rate of n 1

2 for larger d and small n due to the logarithmic factors in our MMD bounds. However, even for small sample sizes and high dimensions, we observe in Fig. 3(b) that KT significantly improves both the magnitude and the decay rate of MMD relative to i.i.d. sampling.

7.3 Kernel thinning versus standard MCMC thinning

Next, we illustrate the benefits of kernel thinning over standard Markov chain Monte Carlo (MCMC) thinning on twelve posterior inference experiments conducted by Riabiz et al. (2021). We briefly describe each experiment here and refer the reader to Riabiz et al. (2021, Sec. 4) for more details.

Goodwin and Lotka-Volterra experiments From Riabiz et al. (2020), we obtain the output of four distinct MCMC procedures targeting each of two d = 4-dimensional posterior distributions P: (1) a posterior over the parameters of the Goodwin model of oscillatory enzymatic control (Goodwin, 1965) and (2) a posterior over the parameters of the Lotka Volterra model of oscillatory predator-prey evolution (Lotka, 1925; Volterra, 1926). For each target, Riabiz et al. (2020) provide 2 106 sample points from each of four MCMC algorithms: Gaussian random walk (RW), adaptive Gaussian random walk (ada RW, Haario et al., 1999), Metropolis-adjusted Langevin algorithm (MALA, Roberts and Tweedie, 1996), and pre-conditioned MALA (p MALA, Girolami and Calderhead, 2011).

Hinch experiments From Riabiz et al. (2020), we also obtain the output of two independent Gaussian random walk MCMC chains for each of two d = 38-dimensional posterior distributions P: (1) a posterior over the parameters of the Hinch model of calcium signalling in cardiac cells (Hinch et al., 2004) and (2) a tempered version of the same posterior, as defined by Riabiz et al. (2021, App. S5.4). In computational cardiology, the calcium signalling model represents one component of a heart simulator, and one aims to propagate uncertainty in the signalling model through the whole heart simulation, an operation which

Kernel Thinning

22 24 26 28 Coreset size n

22 24 26 28 Coreset size n

22 24 26 28 Coreset size n

(a) Mixture of Gaussians P

28 26 24 22 Coreset size n

log n/ n d=2

iid: n 0.26

28 26 24 22 Coreset size n

log n/ n d=4

iid: n 0.26

28 26 24 22 Coreset size n

iid: n 0.25

28 26 24 22 Coreset size n

iid: n 0.25

(b) Gaussian P

Figure 3: Kernel thinning versus i.i.d. sampling. For (a) mixture of Gaussian targets with M {4, 6, 8} components and (b) standard multivariate Gaussian targets in dimension d {2, 4, 10, 100}, kernel thinning (KT) reduces MMDk (P, S) significantly more quickly than the standard n 1

4 rate for n 1 2 i.i.d. points, even in high dimensions. For reference, decay rates of p

log n/n and log n/ n are plotted in each of the figures in panel (b).

requires thousands of CPU hours per sample point (Riabiz et al., 2021). In this setting, the costs of running kernel thinning are dwarfed by the time required to generate the input sample (two weeks) and more than offset by the cost savings in the downstream uncertainty propagation task.

Preprocessing and kernel settings We discard the initial points of each chain as burnin using the maximum burn-in period reported in Riabiz et al. (2021, Tabs. S4 & S6, App. S5.4). and normalize each Hinch chain by subtracting the post-burn-in sample mean and dividing each coordinate by its post-burn-in sample standard deviation. To form an input sequence Sn of length n for coreset construction, we downsample the remaining points using standard thinning. Since exact computation of MMDk (P, S) is intractable for these posterior targets, we report MMDk (Sn, S) the error that is controlled directly in our theoretical results for these experiments. We select the kernel bandwidth σ using the popular median heuristic (see, e.g., Garreau et al., 2017). Additional details can be found in App. P.3.

Results Fig. 4 compares the mean MMDk (Sn, S) error of the generated kernel thinning and standard thinning coresets. In each of the twelve experiments, KT significantly improves both the rate of decay and the order of magnitude of mean MMD, in line with the guarantees

Dwivedi and Mackey

22 24 26 28 Coreset size n

22 24 26 28 Coreset size n

Goodwin ADA-RW

22 24 26 28 Coreset size n

Goodwin MALA

22 24 26 28 Coreset size n

Goodwin p MALA

22 24 26 28 Coreset size n

Lotka-Volterra RW

22 24 26 28 Coreset size n

Lotka-Volterra ADA-RW

22 24 26 28 Coreset size n

Lotka-Volterra MALA

22 24 26 28 Coreset size n

Lotka-Volterra p MALA

22 24 26 28 Coreset size n

22 24 26 28 Coreset size n

2 1 Hinch RW 2

22 24 26 28 Coreset size n

Tempered Hinch RW 1

22 24 26 28 Coreset size n

Tempered Hinch RW 2

Figure 4: Kernel thinning versus standard MCMC thinning. Kernel thinning (KT) significantly improves both the rate of decay and the order of magnitude of mean MMDk (Sn, S) in each posterior inference task, including eight tasks with 4-dimensional targets (Goodwin and Lotka-Volterra) and four tasks with 38-dimensional targets (Hinch). See Sec. 7.3 for more details.

of Thm. 1. Notably, in the d = 38-dimensional Hinch experiments, standard thinning already improves upon the n 1

4 rate of i.i.d. subsampling but is outpaced by KT which consistently provides further improvements.

8. Discussion

We introduced kernel thinning (Alg. 1), a new, practical solution to the thinned MMD coreset problem that, given O(n2) time and O(n min(d, n)) storage, improves upon the integration error of i.i.d. sampling and standard MCMC thinning. To achieve this we first showed that any L coreset for a square-root kernel krt also provides an MMD coreset for its associated target kernel k (Thm. 2). We next introduced and analyzed a selfbalancing Hilbert walk for solving the online vector balancing problem in Hilbert spaces (Alg. 3 and Thm. 3). We then designed a symmetrized version of SBHW for RKHSes kernel halving that delivers 2-thinned coresets with small L error (Alg. 2 and Thm. 4). Our online algorithm, kt-split, recursively applies kernel halving to a square-root kernel to obtain near-optimal L coresets in O(n2) time with O(n min(d, n)) space (Cors. 3 and 5). Kernel thinning then combines kt-split with a greedy refinement step (kt-swap) to yield

Kernel Thinning

coresets with better-than-i.i.d. MMD for a broad range of kernels and target distributions (Thm. 1 and Cor. 1). While our analysis is restricted to kernels that admit square-root dominating kernels, Dwivedi and Mackey (2022) recently generalized the KT algorithm and analysis to support arbitrary kernels. Separately, Shetty et al. (2022) have developed a distribution compression meta-algorithm, Compress++, which reduces the runtime of KT to near-linear O(n log3 n) time with MMD error that is worse by at most a factor of 4. Hence, KT-Compress++ can be practically deployed even for very large input sizes.

8.1 Related work on MMD coresets

While (n 1 2 , op(n 1

4 ))-MMD coresets have been developed for specific (P, k ) pairings like the uniform distribution on the unit cube paired with a Sobolev kernel k , to the best of our knowledge, no prior (n 1 2 , op(n 1

4 ))-MMD coreset constructions were known for the range of P and k studied in this work. For comparison, we review here both lower bounds and prior strategies for generating coresets with small MMD.

Lower bounds For any bounded and radial (i.e., k (x, y)=κ( x y 2 2)) kernel satisfying mild decay and smoothness conditions, Phillips and Tai (2020, Thm. 3.1) showed that any procedure outputting coresets of size n 1 2 must suffer Ω(min(

4 )) MMDk for some (discrete) target distribution P. This lower bound applies, for example, to Mat ern kernels and to infinitely smooth Gaussian kernels. For any continuous and shift-invariant (i.e., k (x, y) = κ(x y)) kernel taking on at least two values, Tolstikhin et al. (2017, Thm. 1) showed that any estimator (including non-coreset estimators) based only on n i.i.d. draws from P must suffer Ck n 1

2 MMDk with probability at least 1/4 for some discrete target P and a constant Ck depending only k . If, in addition, k is characteristic (i.e., MMDk (µ, ν) = 0 when µ = ν), then Tolstikhin et al. (2017, Thm. 6) establish the same lower bound for some continuous target P with infinitely differentiable density. These last two lower bounds hold, for example, for Gaussian, Mat ern, and B-spline kernels and apply in particular to any thinning algorithm that compresses n i.i.d. sample points without additional knowledge of P. For light-tailed P and krt, the kernel thinning guarantees of Thm. 1 match each of these lower bounds up to factors of log n and constants depending on d.

Order (n 1 2 , n 1

4 )-MMD coresets for general target P By Prop. A.1 of Tolstikhin et al. (2017), an i.i.d. sample from P yields an order (n 1 2 , n 1

4 )-MMD coreset in probability. Chen et al. (2010) showed that kernel herding with a finite-dimensional kernel (like the linear k (x, y) = x, y ) finds an (n 1 2 , (CP,k ,dn) 1

2 )-MMD coreset for an inexplicit parameter CP,k ,d. However, Bach et al. (2012) showed that their analysis does not apply to any infinite-dimensional kernel (like the Gaussian, Mat ern, and B-spline kernels studied in this work), as CP,k ,d would necessarily equal 0. The best known rate for kernel herding with bounded infinite-dimensional kernels (Lacoste-Julien et al., 2015, Thm. G.1) guarantees an order (n 1 2 , n 1

4 )-MMD coreset, matching the i.i.d. guarantee. For bounded kernels, the same guarantee is available for Stein Point MCMC (Chen et al., 2019, Thm. 1) which greedily minimizes MMD4 over random draws from P and for a variant of the greedy sign selection

4. To bound MMDk using Chen et al. (2019, Thm. 1), choose k 0(x, y) = k (x, y) Pk (x) Pk (y)+PPk .

Dwivedi and Mackey

algorithm described in Karnin and Liberty (2019, Sec. 3.1).5 Slightly inferior guarantees were established for Stein points (Chen et al., 2018, Thm. 1) and Stein thinning (Riabiz et al., 2021, Thm. 1), both of which accommodate unbounded kernels as well.

Finite-dimensional kernels Harvey and Samadi (2014) construct (n 1 2 ,

2 log2.5 n)- MMD coresets for finite-dimensional linear kernels on Rd but do not address infinitedimensional kernels.

Uniform distribution on [0, 1]d The explicit low discrepancy quasi-Monte Carlo (QMC) construction of Chen and Skriganov (2002) provides a (n 1 2 , Od(n 1

2 n))-MMD coreset for an L2 discrepancy kernel when P is the uniform distribution on the unit cube [0, 1]d. For the same target, the online Haar strategy of Dwivedi et al. (2019) yields an (n 1 2 , Od(n 1

2 log2d n))-MMD coreset in probability. Dwivedi et al. (2019) also conjecture that a greedy variant of their Haar strategy would provide an improved (n 1 2 , Od(n 1

2 logd n))- MMD coreset. These constructions satisfy our quality criteria but are tailored specifically to the uniform distribution on the unit cube.

Unknown coreset quality On compact manifolds, optimal coresets of size n 1 2 minimize the weighted Riesz energy (a form of relative MMD with a weighted Riesz kernel) at known rates (Borodachov et al., 2014); however, practical minimum Riesz energy (Borodachov et al., 2014) and minimum energy design (Joseph et al., 2015, 2019) constructions have not been analyzed. When k is nonnegative and the kernel matrix (k (xi, xj))n i,j=1 satisfies a strong diagonal dominance condition, Kim et al. (2016, Cor. 3, Thm. 6) show that greedy optimization of MMDk yields an MMD-critic coreset ˆS of size n 1 2 satisfying

MMD2 k (Pn, ˆS) (1 1

e Pn Pnk for MMD = min|S|= n MMDk (Pn, S).

In the usual case when Pn Pnk = Ω(1), this error bound does not decay to 0 with n. Paige et al. (2016) analyze the impact of approximating a kernel in super-sampling with a reservoir but do not analyze the quality of the constructed MMD coreset. For the conditionally positive definite energy distance kernel, Mak and Joseph (2018) establish that an optimal coreset of size n 1 2 has o(n 1

4 ) MMD but do not provide a construction; in addition, Mak and Joseph (2018) propose two support points convex-concave procedures for constructing MMD coresets but do not establish their optimality and do not analyze their quality.

8.2 Related work on weighted MMD coresets

While coresets satisfy a number of valuable constraints that are critical for some downstream applications exact approximation of constants, automatic preservation of convex integrand constraints, compatibility with unweighted downstream tasks, easy visualization, straightforward sampling, and increased numerical stability against errors in integral evaluations (Karvonen et al., 2019) some applications also support weighted coreset approximations of P of the form P n i=1 wiδxi for weights wi R that need not be equal, need not be nonnegative, or need not sum to 1. Notably, weighted coresets that depend on P only through an i.i.d. sample of size n are subject to the same Ω(n 1

2 ) MMD lower bounds of Tolstikhin

5. The statement of Karnin and Liberty (2019, Thm. 24) bounds , but the proof bounds MMDk .

Kernel Thinning

et al. (2017) described in Sec. 8.1. Any constructions that violate these bounds do so only by exploiting additional information about P (for example, exact knowledge of Pk ) that is not generally available and not required for our kernel thinning guarantees. Moreover, while weighted coresets need not provide satisfactory solutions to the unweighted coreset problem studied in this work, kernel thinning coreset points can be converted into an optimally weighted coreset of no worse quality by explicitly minimizing MMDk (Pn, P n i=1 wiδxi) or, if computable, MMDk (P, P n i=1 wiδxi) over the weights wi in O(n3/2) time. With this context, we now review known weighted MMD coreset guarantees. We highlight that only one of the weighted (n 1 2 , o(n 1

4 ))-MMD guarantees covers the unbounded distributions addressed in this work and that the single unbounded guarantee relies on a restrictive uniformly bounded eigenfunction assumption that is typically not satisfied. In other words, our analysis establishes MMD improvements for practical (k , P) pairings not covered by prior weighted analyses.

P with bounded support If the target P has bounded density and bounded, regular support and k is a Gaussian or Mat ern kernel, then Bayesian quadrature (O Hagan, 1991) and Bayes-Sard cubature (Karvonen et al., 2018) with quasi-uniform unisolvent point sets yield weighted (n 1 2 , o(n 1

4 ))-MMD coresets by Wendland (2004, Thm. 11.22 and Cor. 11.33). If P has bounded support, and k has more than d continuous derivatives, then the P-greedy algorithm (De Marchi et al., 2005) also yields weighted (n 1 2 , o(n 1

4 ))-MMD coresets by Santin and Haasdonk (2017, Thm. 4.1). For (k , P) pairs with compact support and sufficiently rapid eigenvalue decay, approximate continuous volume sampling kernel quadrature (Belhadji et al., 2020) using the Gibbs sampler of Rezaei and Gharan (2019) yields weighted coresets with o(n 1

4 ) root mean squared MMD.

Finite-dimensional kernels with compactly supported P For compactly supported P, Briol et al. (2015, Thm. 1) and Bach et al. (2012, Prop. 1) show that Frank-Wolfe Bayesian quadrature and weighted variants of kernel herding respectively yield weighted (n 1 2 , o(n 1

4 ))-MMD coresets for continuous finite-dimensional kernels, but, by Bach et al. (2012, Prop. 2), these analyses do not extend to infinite-dimensional kernels, like the Gaussian, Mat ern, and B-spline kernels studied in this work.

Eigenfunction restrictions For (k , P) pairs with known Mercer eigenfunctions, Belhadji et al. (2019) bound the expected squared MMD of determinantal point process (DPP) kernel quadrature in terms of kernel eigenvalue decay and provide explicit rates for univariate Gaussian P and uniform P on [0, 1]. Their construction makes explicit use of the kernel eigenfunctions which are not available for most (k , P) pairings. For (k , P) pairs with Pk = 0, uniformly bounded eigenfunctions, and rapidly decaying eigenvalues, Liu and Lee (2017, App. B.2) prove that black-box importance sampling generates probability-weighted coresets with o(n 1

4 ) root mean squared MMD but do not provide any examples verifying their assumptions. The uniformly bounded eigenfunction condition is considered particularly difficult to check (Steinwart and Scovel, 2012), does not hold for Gaussian kernels with Gaussian P (Minh, 2010, Thm. 1), and need not hold even for infinitely univariate smooth kernels on [0, 1] (Zhou, 2002, Ex. 1).

Dwivedi and Mackey

Unknown coreset quality Husz ar and Duvenaud (2012, Prop. 2) bound the MMD error of weighted sequential Bayesian quadrature coresets using weak submodularity, but this bound does not decay to zero with n. Khanna and Mahoney (2019, Thm. 2) prove that weighted kernel herding yields a weighted (n 1 2 , exp( n 1 2 /κn))-MMD coreset. However, the κn term in Khanna and Mahoney (2019, Thm.3, Assum. 2) is at least as large as the condition number of an n n kernel matrix, which for typical kernels (including the Gaussian and Mat ern kernels) is Ω( n) (Koltchinskii and Gin e, 2000; El Karoui, 2010); the resulting MMD error bound therefore does not decay with n. The Proto Greedy and Proto Dash algorithms of Gurumoorthy et al. (2019, Thm. IV.3, IV.5) yield nonnegative weighted coresets ˆS of size n 1 2 satisfying MMD2 k (Pn, ˆS) MMD2 +(Pn Pnk MMD2 )e λ n where MMD is the optimal MMD error to Pn for a nonnegatively weighted coreset of size n 1 2 . However, careful inspection reveals that λ n 1 for any kernel and any n. Hence, in the usual case in which Pn Pnk = Ω(1), this error bound does not decay to 0 with n. Campbell and Broderick (2019, Thm. 4.4) prove that Hilbert coresets via Frank-Wolfe with n input points yield weighted order (n 1 2 , ν n n )-MMD coresets for some νn < 1 but do not analyze the dependence of νn on n.

Non-MMD guarantees For P with continuously differentiable Lebesgue density and k a bounded Langevin Stein kernel with Pk = 0, Thm. 2 of Oates et al. (2017) does not bound MMD but does prove that a randomized control functionals weighted coreset satisfies q

E[(Ef P n i=1 wif(xi))2] CP,k ,d,f/n 7 24 for each f in the RKHS of k and an unspecified

CP,k ,d,f. This bound is asymptotically better than the Ω(n 1

4 ) guarantee for unweighted i.i.d. coresets but worse than the unweighted kernel thinning guarantees of Thm. 1. On compact domains, Thm. 1 of Oates et al. (2019) establishes improved rates for the same weighted coreset when both P and k are sufficiently smooth. Bardenet and Hardy (2020) establish an n 1

4d asymptotic decay of Ef P n i=1 wif(xi) for DPP kernel quadrature with P on [ 1, 1]d and each f in the RKHS of a particular kernel.

8.3 Related work on L coresets

A number of alternative strategies are available for constructing coresets with L guarantees. For example, for any bounded krt, Cauchy-Schwarz and the reproducing property imply that

(P Pn)k = supz Rd k (z, ), Pk Pnk k MMDk (P, Pn) k

so that all of the order (n 1 2 , n 1

4 )-MMD coreset constructions discussed in Sec. 8.1 also yield order (n 1 2 , n 1

4 )-L coresets. However, none of those constructions is known to provide a (n 1 2 , o(n 1

4 ))-L coreset. A series of breakthroughs due to Joshi et al. (2011); Phillips (2013); Phillips and Tai (2018, 2020); Tai (2020) has led to a sequence of increasingly compressed (n 1 2 , o(n 1

coreset constructions, with the best known guarantees currently due to Phillips and Tai (2020) and Tai (2020). Given n input points, Phillips and Tai (2020) developed an offline, polynomial-time construction to find an (n 1 2 , Op(

2 log n))-L coreset for Lipschitz kernels exhibiting suitable decay, while Tai (2020) developed an offline construction for

Kernel Thinning

Gaussian kernels that runs in Ω(d5d) time and yields an (n 1 2 , Op(2dn 1

log(d log n)))-L

coreset. More details on these constructions based on the Gram-Schmidt walk of Bansal et al. (2018) can be found in App. S. Notably, the Phillips and Tai (hereafter, PT) guarantee is tighter than that of Thm. 4 by a factor of log log n for sub-Gaussian kernels and input points and log n for heavy-tailed kernels and input points. Similarly, the Tai guarantee provides an improvement when n is doubly-exponential in the dimension, that is, when d log n = Ω(2d). Moreover, by Thm. 2, we may apply the PT and Tai constructions to a square-root kernel krt to obtain comparable MMD guarantees for the target kernel k with high probability. However, kernel thinning has a number of practical advantages that lead us to recommend it. First with n input points, using standard matrix multiplication, the PT and Tai constructions have Ω(n4) computational complexity and Ω(n2) storage costs, a substantial increase over the O(n2) running time and O(n min(d, n)) storage of kernel thinning. Second, kt-split is an online algorithm while the PT and Tai constructions require the entire set of input points to be available a priori. Finally, each halving round of ktsplit splits the sample size exactly in half, allowing the user to run all m halving rounds simultaneously; the PT and Tai constructions require a rebalancing step after each round forcing the halving rounds to be conducted sequentially.

8.4 Future directions

Several other opportunities for future development recommend themselves. First, since our results cover any target P with at least 2d moments even discrete and other non-smooth targets a natural question is whether tighter error bounds with better sample complexities are available when P is also known to have a smooth Lebesgue density. Second, the MMD to L reduction in Thm. 2 applies also to weighted L coresets, and, in applications in which weighted point sets are supported, we would expect either quality or compression improvements from employing non-uniform weights (see, e.g. Turner et al., 2021).

Table of contents

1 Introduction 1

2 Input Point and Kernel Requirements 4

3 Kernel Thinning 7

4 MMD Guarantees 10

5 Self-balancing Hilbert Walk 13

6 L Guarantees 16

7 Vignettes 19

8 Discussion 22

Dwivedi and Mackey

A Appendix Notation 28

B Proof of Prop. 1: MMD guarantee for MCMC 29

C Proof of Prop. 2: Almost sure radius growth 30

D Proof of Prop. 3: Shift-invariant square-root kernels 33

E Proof of Thm. 1: MMD guarantee for kernel thinning 34

F Proof of Rem. 4: Finite-time and anytime guarantees 34

G Proof of Cor. 1: MMD rates for kernel thinning 34

H Proof of Thm. 2: MMD guarantee for square-root L approximations 35

I Proof of Cor. 2: MMD error from square-root L error 37

J Proof of Thm. 3: Self-balancing Hilbert walk properties 37

K Proof of Thm. 4: L guarantees for kernel halving 41

L Proof of Rem. 8: Example settings for KH L rates 53

M Proof of Cor. 5: L and MMD guarantees for kt-split 53

N Derivation of Tab. 1: Square-root kernels krt for common target kernels k 54

O Derivation of Tab. 2: Explicit bounds on Thm. 1 quantities for common kernels 56

P Supplementary Details for Vignettes of Sec. 7 64

Q Kernel Thinning with Square-root Dominating Kernels 66

R Online Vector Balancing in Euclidean Space 69

S L Coresets of Phillips and Tai (2020) and Tai (2020) 70

Appendix A. Appendix Notation

For each p 1, we define Lp as the set of measurable g : Rd R with g Lp ( R |g(x)p|dx)1/p < and Cp as the set of g : Rd R for which all partial derivatives of order p exist and are continuous. For a kernel k : Rd Rd R, we also write k L2,

to indicate that k is measurable with finite

k L2, supx Rd R k2(x, y)dy 1

2 = supx Rd k(x, ) L2. (18)

Throughout, we follow the unitary angular frequency convention of Wendland (2004, Def. 5.15) and define the Fourier transform F(f) of an integrable complex function f : Rd C via

F(f)(ω) 1 (2π)d/2 R

Rd f(x)e i x,ω dx for all ω Rd. (19)

Kernel Thinning

Appendix B. Proof of Prop. 1: MMD guarantee for MCMC

By Douc et al. (2018, Lem. 9.3.9, Cor. 9.2.16), a homogeneous ϕ-irreducible geometrically ergodic Markov chain with stationary distribution P is also aperiodic with a unique stationary distribution.6 Since (xi) i=0 are the iterates of such a chain, there exist, by Gallegos Herrada et al. (2023, Thm. 1xi), constants ρ (0, 1) and τ < and a measurable P-almost everywhere finite function V : Rd [1, ] satisfying PV < and

measurable h: |h(x)|

V (x) 1, x Rd |E[h(xi) | x0 = x] Ph| τV (x)ρi, for all x Rd and i N. (20)

Since V is finite P-almost everywhere, we will choose c(x) = V (x) = to ensure that our claim is (vacuously) true whenever V (x0) = . Hereafter, suppose V (x0) < . Since the Markov chain is irreducible and aperiodic with a unique stationary distribution P, Assump. H1 of Havet et al. (2020) is satisfied. Hence, by an application of Havet et al. (2020, Prop. 2.1) with V = V ρg and ζ = 2/ρg for sufficiently large g N, there exists a set C(x0) Rd that contains x0 and satisfies Assumps. H2 and H3 of Havet et al. (2020). Now fix any y1, . . . , yn, z1, . . . , zn Rd. We invoke the definition of MMD (1), the triangle inequality, the reproducing property of an RKHS (Steinwart and Christmann, 2008, Def. 4.18), and Cauchy-Schwarz in turn to deduce a bounded differences property for MMD:

n Pn i=1 δyi) MMDk (P, 1

n Pn i=1 δzi)

= sup f k 1 |Pf 1

n Pn i=1 f(yi)| sup f k 1 |Pf 1

n Pn i=1 f(zi)|

sup f k 1 | 1

n Pn i=1 f(yi) f(zi)| = sup f k 1 1 n Pn i=1 | k (yi, ) k (zi, ), f k |

sup f k 1 1 n Pn i=1 k (yi, ) k (zi, ) k f k

= sup f k 1 1 n Pn i=1 p

k (yi, yi) + k (zi, zi) 2k (yi, zi) f k 2

1 2 Pn i=1 I[yi = zi].

Since x0 belongs to a set C(x0) satisfying Assumps. H2 and H3 of Havet et al. (2020), Mc Diarmid s inequality for geometrically ergodic Markov chains (Havet et al., 2020, Thm. 3.1) implies that, with probability at least 1 δ conditional on x0,

MMDk (P, Pn) E[MMDk (P, Pn) | x0] + p

c1(x0) k log(1/δ)/n

where c1(x0) is a finite value depending only on the transition probabilities of the chain and the set C(x0). Now, define the P centered kernel k P(x, y) = k (x, y) Pk (x) Pk (y) + PPk . To bound the expectation, we will use a slight modification of Lem. 3 of Riabiz et al. (2021). The original lemma used the assumption of V -uniform ergodicity (Meyn and Tweedie, 2012, Defn. (16.0.1)) and the assumption V (x) p

k P(x, x) solely to argue that, for some R > 0,

|E[f(xi) | x0 = x] Pf| RV (x)ρi for all x Rd and f Hk P with f k P = 1.

6. In Havet et al. (2020); Douc et al. (2018, Def. 9.2.1) the term irreducible is synonymous with ϕ-irreducible as defined by Gallegos-Herrada et al. (2023, Sec. 2).

Dwivedi and Mackey

In our case, since k P is bounded and any f Hk P with f k P = 1 satisfies

|f(x)| = | k P(x, ), f k P| k P(x, ) k P f k P = p

k P(x, x) p

k P for all x Rd

by the reproducing property and Cauchy-Schwarz, the geometric ergodicity property (27) implies the analogous bound

|E[f(xi) | x0 = x] Pf| τ p

k P V (x)ρi for all x Rd and f Hk P with f k P = 1.

Hence, the conclusions of Riabiz et al. (2021, Lem. 3) with R = τ p

k P hold under our assumptions. Jensen s inequality and the conclusion of Lem. 3 of Riabiz et al. (2021) now yield the sure bound

E[ MMDk (P, Pn) | x0]2 E[MMDk (P, Pn)2 | x0]

n2 Pn i=1 k P(xi, xi) + 1

n2 Pn i=1 P j =i k P(xi, xj) | x0]

n k P (1 + 2τρ

1 ρ 1 n Pn 1 i=1 E[V (xi) | x0]) 4

n k (1 + 2τρ

1 ρ 1 n Pn 1 i=1 E[V (xi) | x0]).

Now, define c2(x0) supn N 1 n Pn 1 i=1 E[V (xi) | x0]. Since V (x0) < , the geometric ergodicity property (27) and the fact that PV < imply

c2(x0) PV + V (x0) supn N 1 n Pn 1 i=1 ρi PV + V (x0)ρ < .

Taking c(x0) =

2 max(c1(x0), 4c2(x0)(1 + 2τρ

1 ρ)) completes the proof.

Appendix C. Proof of Prop. 2: Almost sure radius growth

We prove this result for the identically distributed case in App. C.1 and for the Markov chain case in App. C.2.

C.1 Radius growth for identically distributed sequences

Suppose x0 and S are drawn identically from P. Claim (a) is true by definition. To establish the remaining claims, we use the following more general result proved in App. C.1.1.

Lemma 5 (Growth rate for identically distributed sequence). Consider a sequence of identically distributed random variables (Yi) i=1 on R and a measurable function ψ : R R with an increasing inverse function ψ 1. If ψ(Y1) 0 almost surely, then

Pr(maxi n Yi > ψ 1(n) for some n N) E(ψ(Y1)).

Consequently, if E(ψ(Y1)) < , then, for any δ (0, 1],

Pr(maxi n Yi ψ 1(n E[ψ(Y1)]

δ ) for all n N) 1 δ.

Fix any δ (0, 1]. Thm. 5 with Yi = xi 2 implies that, with probability 1 δ,

RSn ψ 1(n E[ψ( x1 2)]

δ ) for all n N

for ψ 1(r) = log r c in case (b), ψ 1(r) = log r

c in case (c), and ψ 1(r) = r1/ρ in case (d). As a result we have, with probability 1 δ, RSn = Od( log n) in case (b), RSn = Od(log n) in case (c), and RSn = Od(n1/ρ) in case (d). Since δ is arbitrary, these orders hold with probability 1 as claimed.

Kernel Thinning

C.1.1 Proof of Thm. 5: Growth rate for identically distributed sequence

We make use of three lemmas. The first rewrites maximum exceedance events in terms of individual variable exceedance events when thresholds are nondecreasing.

Lemma 6 (Exceedance equivalence). For any real-valued (ai) i=1 and nondecreasing (bi) i=1,

maxi n ai > bn for some n N ai > bi for some i N. (21)

Proof The part follows immediately. To prove the part, suppose maxi n ai > bn for some n . Then there exists an i n with ai > bn bi since (bi) i=1 is nondecreasing.

The second bounds the probability of growth rate violation for any sequence of random variables in terms of a sum of exceedance probabilities.

Lemma 7 (Growth rate for arbitrary sequence). For any sequence of random variables (Yi) i=1 on R and a nondecreasing real-valued sequence (bi) i=1, we have

Pr(maxi n Yi > bn for some n N) = Pr(Yi > bi for some i N) P i=1 Pr(Yi > bi).

Proof The result follows from immediately from Thm. 6 and the union bound.

The third lemma bounds a sum of exceedance probabilities whenever the random variables are identically distributed and nonnegative.

Lemma 8 (Bounding exceedances with expectations). If the random variables (Zi) i=1 are identically distributed and almost surely nonnegative, then

P i=1 Pr(Zi > i) E(Z1). (22)

Proof Since Z1 is almost surely nonnegative, we have Z1 = R Z1 0 dt = R 0 I(Z1 > t)dt almost surely. Tonelli s theorem (Mukherjea, 1972, Thm. 1) therefore implies that

E(Z1) = E( R 0 I(Z1 > t)dt) = R 0 Pr(Z1 > t)dt

R 0 Pr(Z1 > t )dt = P i=1 Pr(Z1 > i) = P i=1 Pr(Zi > i)

where the final inequality uses the identically distributed assumption.

Since ψ(Y1) 0 almost surely and ψ 1 is increasing, we invoke Thm. 8 with Zi = ψ(Yi), the invertibility of ψ, and Thm. 7 with bi = ψ 1(i) in turn to conclude

E(ψ(Y1)) (26) P i=1 Pr(ψ(Yi) > i) = P i=1 Pr(Yi > ψ 1(i))

(21) Pr(maxi n Yi > ψ 1(n) for some n N).

Dwivedi and Mackey

C.2 Radius growth for MCMC

Now suppose x0 and S are the iterates of a homogeneous ϕ-irreducible geometrically ergodic Markov chain with initial state x0, subsequent iterates S , and stationary distribution P. Our claims will follow from the following more detailed result proved in App. C.2.1.

Lemma 9 (Growth rate for MCMC). Consider a homogeneous ϕ-irreducible geometrically ergodic Markov chain with initial state x0, subsequent iterates (xi) i=1, and stationary distribution P. There exist constants ρ (0, 1) and τ < and a measurable P-almost everywhere finite function V : Rd [1, ] such that, for any index j N, measurable function g : Rd R, measurable nonnegative function ψ on R with increasing inverse function ψ 1, and X P,

Pr(maxi n g(xi) > ψ 1(n) for some n N | x0) (23)

E(ψ(g(X))) + τρj+1

1 ρ V (x0) + Pj i=1 Pr(g(xi) > ψ 1(i) | x0).

Now suppose V (x0) < , and fix any measurable nonnegative ψ on R with increasing ψ 1

and any measurable g : Rd R. If E(ψ(g(X))) < , then, for any δ (0, 1], there exists a constant cδ,ψ g(x0) (0, ) such that

Pr(maxi n g(xi) ψ 1(cδ,ψ g(x0)n) for all n N | x0) 1 δ. (24)

Moreover, if P is compactly supported, then for any δ (0, 1], there exists a constant cδ(x0) (0, ) such that

Pr(supi N xi 2 cδ(x0) | x0) 1 δ. (25)

Instantiate the function V from Thm. 9, and suppose that V (x0) < , an event that holds for P-almost every x0. Claims (b) to (d) then follow by invoking the time-uniform tail bound (24) with g = 2 and proceeding as in App. C.1. Finally, claim (a) follows from the bound (25), which establishes xi 2 = Od(1) with probability 1 conditional on x0.

C.2.1 Proof of Thm. 9: Growth rate for MCMC

The proof closely parallels that of Thm. 5 except that we substitute the following estimate for Thm. 8.

Lemma 10 (Bounding MCMC exceedances with expectations). Consider a homogeneous ϕ-irreducible geometrically ergodic Markov chain with initial state x0, subsequent iterates (xi) i=1, and stationary distribution P. There exist constants ρ (0, 1) and τ < and a measurable P-almost everywhere finite function V : Rd [1, ] such that, for any index j N, measurable nonnegative function f on Rd, and X P,

P i=j P(f(xi) > i | x0) E(f(X)) + τρj

1 ρV (x0). (26)

Proof By Douc et al. (2018, Lem. 9.3.9), a homogeneous ϕ-irreducible geometrically ergodic Markov chain with stationary distribution P is also aperiodic. By Gallegos-Herrada et al.

Kernel Thinning

(2023, Thm. 1xi), there exist constants ρ (0, 1) and τ < and a measurable P-almost everywhere finite function V : Rd [1, ] satisfying

sup h:|h(x)| V (x), x Rd |E[h(xi) | x0 = x] Ph| τV (x)ρi for all x Rd. (27)

Applying this result to the functions hi(x) I[f(x) > i], we find that P i=j P(f(xi) > i | x0) P i=j P(f(X) > i) + P i=j τV (x0)ρi E[f(X)] + τρj

where the final inequality uses Thm. 8.

Fix any V , ρ, and τ satisfying the conclusions of Thm. 10, any measurable g : Rd R, and any measurable nonnegative ψ on R with increasing ψ 1. The first claim (23) follows by applying Thm. 7 with bi = ψ 1(i), the assumed invertiblity and strict monotonicity of ψ 1, and Thm. 10 with f = ψ g in turn to find that

P(maxi n g(xi) > ψ 1(n) for some n | x0) P i=1 P(g(xi) > ψ 1(i) | x0)

= P i=1 P(ψ(g(xi)) > i | x0) Pj i=1 P(ψ(g(xi)) > i | x0) + E(ψ(g(X))) + τρj+1

1 ρ V (x0).

Now suppose V (x0) < , fix any δ (0, 1], and let j be the smallest positive index with τρj+1

1 ρ V (x0) < δ

3. Since each ψ(g(xi)) is a tight random variable given x0, we can additionally

choose a constant cδ,ψ g(x0) (0, ) satisfying Pj i=1 P(ψ(g(xi))/cδ,ψ g(x0) > i | x0) < δ

3 and E(ψ(g(X)))/cδ,ψ g(x0) < δ

3. The claim (24) now follows by applying the initial result (23) to the function ψ/cδ,ψ g(x0). To establish the final claim (25), suppose that P is compactly supported. Since P has compact support and each xi 2 is a tight random variable, there exists a constant cδ (0, ) satisfying Pr X P( X 2 > cδ) = 0 and Pj i=1 P(ψ( xi 2) > cδ | x0) < δ

2. The union bound and the geometric ergodicity property (27) applied to the function h(x) = I( x 2 > cδ) with Ph = 0 now imply

Pr(supi N xi 2 > cδ | x0) P i=1 Pr( xi 2 > cδ | x0)

Pj i=1 Pr( xi 2 > cδ | x0) + τV (x0) P i=j+1 ρi

= Pj i=1 Pr( xi 2 > cδ | x0) + τρj+1

1 ρ V (x0) < δ

Appendix D. Proof of Prop. 3: Shift-invariant square-root kernels

Bochner s theorem (Bochner, 1933; Wendland, 2004, Thm. 6.6) implies that krt is a kernel since κrt is the Fourier transform of a finite Borel measure with Lebesgue density

bκ. Moreover, as krt(x, ) = 1 (2π)d/4 F(e i ,x

bκ) and e i ,x

bκ is integrable and square integrable, the Plancherel-Parseval identity (Wendland, 2004, Proof of Thm. 5.23) implies that R

Rd krt(x, z)krt(y, z)dz = R

Rd 1 (2π)d/4 e i ω,x p

bκ(ω) 1 (2π)d/4 ei ω,y p

= 1 (2π)d/2 R

Rd e i ω,x y bκ(ω)dω = k (x, y)

confirming that krt is a square-root kernel of k .

Dwivedi and Mackey

Appendix E. Proof of Thm. 1: MMD guarantee for kernel thinning

By design, kt-swap ensures

MMDk (Sn, SKT) MMDk (Sn, S(m,1)),

where S(m,1) denotes the first coreset returned by kt-split. Next, applying Cor. 5, in particular, the bound (85) yields the desired claim.

Appendix F. Proof of Rem. 4: Finite-time and anytime guarantees

We prove the three claims one by one.

Finite time guarantee For the case with known n, the claim follows simply by noting that

Pm j=1 2j 1

m P2m j n/2m i=1 δ n = Pm j=1 2m

2n Pm j=1 2m

Any time guarantee When the input size n is not known in advance but is chosen independently of the randomness in kernel thinning, we first note that

P i=1 1 (i+1) log2(i+1)

(i) 2, and Pm j=1 2j = 2m+1 2 2m+1. (28)

where step (i) can be verified using mathematical programming software. Therefore, for any n N, with δi = mδ 2m+2(i+1) log2(i+1), we have

Pm j=1 2j 1

m P2m j n/2m i=1 δi Pm j=1 2j 1

m P i=1 δi = Pm j=1 2j 1

m P i=1 mδ 2m+2(i+1) log2(i+1)

= δ 2m+3 Pm j=1 2j P i=1 1 (i+1) log2(i+1)

(28) δ 2m+3 2m+1 2 δ

Upper bound on δ The probability lower bound in Thm. 1 is 1 δ Pm j=1 2j 1

m P2m j n/2m i=1 δi, which is non-negative only if

1 Pm j=1 2j 1

m P2m j n/2m i=1 δi 2m

which holds only if δ 2 2m n/2m 6m

2m since m [1, log2 n]. The claim follows.

Appendix G. Proof of Cor. 1: MMD rates for kernel thinning

Repeating arguments similar to those deriving (125) in App. N, we find that for the advertised choices of m and for any fixed δ such that δ = δ

2 and log(1/δ ) = O(log(n/δ)), the

Kernel Thinning

RHS of the bound (9) on MMD from Thm. 1 can be simplified as follows:

MMDk (Sn, SKT)

c max(R krt,n,RSn)2

δ) + log 2 + Lkrt(Rkrt,n+RSn)

= cδ,d krt max(R krt,n, RSn) d

log(1 + max(Rkrt,n, RSn)) + log(1 + Lkrt krt ),

for some universal constants c, c where to simplify the expressions, we have used the fact that RSn,krt,n RSn (7). Noting that (29) holds with probability at least 1 δ, Cor. 1 now follows from plugging the assumed growth rate bounds into the estimate (29), and treating krt and Lkrt krt as some constant while n grows.

Appendix H. Proof of Thm. 2: MMD guarantee for square-root L

approximations

Our proof will use the following two lemmas proved in Apps. H.1 and H.2 respectively.

Lemma 11 (Square-root representation of MMD). For k satisfying Assump. 1 with squareroot kernel krt L2, we have, for any distributions µ and ν on Rd,

MMDk (µ, ν) = supg L2: g L2 1 R g(y)(µkrt(y) νkrt(y))dy .

Lemma 12 (L bound on L2 kernel error). Consider any kernel k L2, satisfying Assump. 1, distributions µ, ν on Rd, and function g L2 with g L2 1. For any r, a, b 0 with a + b = 1, R g(y)(µk(y) νk(y))dy µk νk Vol 1 2 (r) + 2τk(ar) + 2 k L2, max{τµ(br), τν(br)},

where Vol(r) πd/Γ(d/2 + 1)rd denotes the volume of the Euclidean ball B(0; r).

We first note that, by the square-root kernel definition (Def. 5), krt(x, ) L2 = p

k (x, x) for each x Rd. Since k is bounded, we therefore have krt L2, with krt L2, = p

k . The result now follows by invoking Thms. 11 and 12 with k = krt.

H.1 Proof of Thm. 11: Square-root representation of MMD

Let Hk represent the RKHS of k . By Saitoh (1999, Thms. 1 and 2) and the definition (3) of krt, for any f Hk , there exists a function g L2 such that

f k = g L2 and f(x) = R g(y)krt(x, y)dy for all x Rd, (30)

and, for any g L2, there exists an f Hk such that (30) holds. Note that the integral in (30) is well defined for each x since g L2 and krt L2, . Hence, we have

MMDk (µ, ν) = supf Hk : f k 1|µf νf|

(i) = supg L2: g L2 1 R R g(y)krt(y, x)dydµ(x) R R g(y)krt(y, x)dydν(x)

(ii) = supg L2: g L2 1 R g(y)µkrt(y)dy R g(y)νkrt(y)dy .

Dwivedi and Mackey

where step (i) follows from (30), and we can swap the order of integration to obtain step (ii) using Fubini s theorem along with the following fact justified by H older s inequality: R R |g(y)krt(y, x)|dyd µ(x) g L2 krt L2, R d µ(x) < for any distribution. µ (31)

H.2 Proof of Thm. 12: L bound on L2 kernel error

Fix any r 0, introduce the shorthand B(r) = B(0; r), and define the restrictions

gr(x) = g(x)IB(r)(x), kr(x, z) k(x, z) IB(r)(z), and k(c) r k kr,

so that µk = µkr + µk(c) r . We first note that, by Cauchy-Schwarz, gr L1 L2 with

gr L1 gr L2 p

Vol(r) g L2 p

Vol(r) (32)

and that, exactly as in (31), R R |g(y)k(y, x)|dyd µ(x) < for any distribution µ so that each of the integrals to follow is well defined. We now apply the triangle inequality and H older s inequality to obtain R g(y)(µk(y) νk(y))dy = R g(y)(µkr(y) νkr(y))dy + R g(y)(µk(c) r (y) νk(c) r (y))dy

R gr(y)(µkr(y) νkr(y))dy + R

y 2 r g(y)(µk(c) r (y) νk(c) r (y))dy

gr L1 µk νk + R

y 2 r g(y)(µk(y) νk(y))dy . (33)

Next, we bound the second term in (33). For any x, y Rd with y 2 r and scalars a, b [0, 1] such that a + b = 1, either x y 2 ar or x 2 br. Hence, R

y 2 r g(y)(µk(y) νk(y))dy = R

y 2 r g(y) R

x Rd k(x, y)(dµ(x) dν(x))dy

x y 2 ar g(y)k(x, y)(dµ(x) dν(x))dy

x 2 br g(y)k(x, y)(dµ(x) dν(x))dy

=: T1 + T2.

Note that both T1, T2 < since g L2, krt L2, and µ, ν are probability measures. We now bound the terms T1 and T2 separately in (34a) and (34b) below. These bounds, together with the estimates (32) and (33), yield our claim.

Bounding T1 Substituting x y = z, we have

z 2 ar|g(x z)k(x, x z)||dµ(x) dν(x)dz|

z 2 ar|g(x z)k(x, x z)|dz|dµ(x) dν(x)|

(i) R g(x ) L2 supx R

z 2 ar k2(x , x z)dz 1/2 |dµ(x) dν(x)|

(ii) = R g L2τk(ar)|dµ(x) dν(x)| (iii) 2 g L2τk(ar), (34a)

where step (i) follows from Cauchy-Schwarz, step (ii) from the definition (5) of τk, and step (iii) from the fact R |dµ(x) dν(x)| 2.

Kernel Thinning

Bounding T2 We have

y 2 r|g(y)k(x, y)|dy |dµ(x) dν(x)|

x 2 br g L2 supx k(x , ) L2|dµ(x) dν(x)|

(iv) 2 g L2 k L2, max{τµ(br), τν(br)}. (34b)

where step (iv) follows from the definitions (5) and (18) of (τµ, τν) and k L2, .

Appendix I. Proof of Cor. 2: MMD error from square-root L error

Let e d 2 d 2 ed for ed defined in Thm. 2. Notice that the bound (10) for the choice of a = b = 1

2 can be rewritten as

MMDk (P, Q) infr(e dr d 2 ε + 2eτ(r)). (35)

Then the claims of Cor. 2 follow by optimizing the RHS of (35) over the choice of r depending on the tail decay of eτ. Throughout the proofs c, c , ρ denote the (exactly same) constants underlying the assumed tail decay of eτ.

Proof for Compact part Choosing r c , we obtain that

MMDk (P, Q) infr(e dr d 2 ε + 2c I(r c )) ε e d(c ) d 2 .

Proof for Sub Gauss part Choosing r = q

1 c log(1 8cc

de dε), we obtain that

MMDk (P, Q) infr e dr d 2 ε + 2ce c r2 ε e d log(1 8cc

Proof for Sub Exp part Choosing r = 1

c log(1 4cc

de dε), we obtain that

MMDk (P, Q) infr e dr d 2 ε + 2ce c r ε e d log(1 4cc

Proof for Heavy Tail(ρ) part Choosing r = ( 4cρ

de dε) 2 d+2ρ , we obtain that

MMDk (P, Q) infr e dr d 2 ε + 2cr ρ (ε e d)

2ρ d+2ρ (4cρ

d ) d d+2ρ (1 + 2ρ

Appendix J. Proof of Thm. 3: Self-balancing Hilbert walk properties

We prove each property from Thm. 3 one by one.

J.1 Property (i): Functional sub-Gaussianity

We prove the functional sub-Gaussianity claim (13) by induction on the iteration i {0, . . . , n}. Our proof uses the following lemma proved in App. J.7, which supplies a convenient decomposition for the self-balancing Hilbert walk iterates.

Dwivedi and Mackey

Lemma 13 (Alternate representation of ψi). For each i [n], the iterate ψi of the selfbalancing Hilbert walk (Alg. 3) satisfies

ψi, u H = D ψi 1, u fi fi,u H

H + εi fi, u H for all u H (36)

for the random variable εi I[|αi| ai](ηi + αi/ai) which satisfies

E[εi|ψi 1] = 0, εi [ 2, 2], and E[etεi|ψi 1] et2/2 for all t R. (37)

Now we proceed with our induction argument.

Base case For i = 1, noting that ψ0 = 0, we have

E[exp( ψ1, u H)] (36) = E[exp(ε1 f1, u H)|] (37) exp 1 2 f1, u 2 H exp 1

2 f1 2 H u 2 H ,

where the last step follows from Cauchy-Schwarz, and thus ψ1 is sub-Gaussian with parameter σ1 = f1 H as desired.

Inductive step Fix any i [n] with i 2 and assume that the functional sub-Gaussianity claim (13) holds for ψi 1 with σi 1. We have

E[exp( ψi, u H)] = E[E[exp( ψi, u H)|ψi 1]]

(36) = E h exp D ψi 1, u fi fi,u H

E[exp(εi fi, u H)|ψi 1] i

(37) E h exp D ψi 1, u fi fi,u H

exp 1 2 fi, u 2 H i

= exp 1 2 fi, u 2 H E h exp D ψi 1, u fi fi,u H

(i) exp 1 2 fi, u 2 H + σ2 i 1

2 u fi fi,u H

where step (i) follows from the induction hypothesis. Simplifying the exponent in the display (38) using Cauchy-Schwarz and the definition (12) of σi, we have

1 2 fi, u 2 H + σ2 i 1

2 u fi fi,u H

2 fi, u 2 H + σ2 i 1

2 u 2 H + fi, u 2 H fi 2 H a2 i 2

2 u 2 H + fi, u 2 H 1 2 + σ2 i 1 fi 2 H 2a2 i σ2 i 1 ai

2 u 2 H + fi, u 2 H 1 2 + σ2 i 1 fi 2 H 2a2 i σ2 i 1 ai

σ2 i 1 + fi 2 H 1 + σ2 i 1 a2 i ( fi 2 H 2ai)

= σ2 i 2 u 2 H.

Kernel Thinning

J.2 Property (ii): Signed sum representation

Since Alg. 3 adds fi to ψi 1 whenever |αi| = | ψi 1, fi H| ai, by the union bound, it suffices to lower bound the probability of this event by 1 δi for each i. The following lemma establishes this bound using the functional sub-Gaussianity (13) of each ψi 1.

Lemma 14 (Self-balancing Hilbert walk success probability). The self-balancing Hilbert walk (Alg. 3) with threshold ai σi 1 fi H p

2 log(2/δi) for δi (0, 1] satisfies

Pr(Ei) 1 δi for Ei = {| ψi 1, fi H| ai}.

Proof Instantiate the notation of Thm. 3. The sub-Gaussian Hoeffding inequality (Wainwright, 2019, Prop. 2.5), the functional sub-Gaussianity of ψi 1 (13), and the choice of ai imply that

Pr(Ec i ) = Pr(| ψi 1, fi H| > ai) 2 exp( a2 i /(2σ2 i 1 fi 2 H)) 2 exp( log(2/δi)) = δi.

J.3 Property (iii): Exact halving via symmetrization

Whenever the signed sum representation (14) holds, we have

ψn = Pn i=1 ηifi = Pn i=1(ηig2i 1 ηig2i) = P2n i=1 gi 2 P i I gi

where the last step follows from the definition of I.

J.4 Property (iv): Pointwise sub-Gaussianity in RKHS

The reproducing property of the kernel k and the established functional sub-Gaussianity (13) yield

E[exp(ψi(x))] = E[exp( ψi, k (x, ) H)] exp σ2 i k (x, ) 2 H 2 = exp σ2 i k (x,x)

J.5 Property (v): Sub-Gaussian constant bound

We establish the bound (15) for all i {0, . . . , n} by induction on the iteration i.

Base case The claim (15) holds for the base case, i = 0, since σ0 = 0.

Inductive step Fix any i [n] and assume that the claim (15) holds for σi 1.

If either fi H = 0 or σ2 i 1 a2 i 2ai fi 2 H , then σ2 i = σ2 i 1 by the definition (12) of σi and

the assumption that fi 2 H 2 ai, completing the inductive step.

If, alternatively, σ2 i 1 < a2 i 2ai fi 2 H and fi H > 0, then our assumptions fi 2 H 1+q ai

fi 2 H 1 q imply that ( fi 2 H ai 1)2 q2. Hence, by the definition (12) of σi and the inductive

Dwivedi and Mackey

hypothesis,

σ2 i = σ2 i 1 + fi 2 H 1 + σ2 i 1 a2 i ( fi 2 H 2ai)

= fi 2 H + σ2 i 1( fi 2 H ai 1)2

(1 q2) fi 2 H 1 q2 + q2 maxj [i 1] fj 2 H 1 q2 maxj [i] fj 2 H 1 q2 ,

completing the inductive step.

J.6 Property (vi): Adaptive thresholding

Define c 1 = max(c , 1) and let q = (c 1)2 1 (c 1)2+1 [0, 1) so that

1 1 q = 1+(c 1)2

2 and 1 1 q2 = (c 1+1/c 1)2

4 (c +1/c )2

since 1 = argminc 0 c + 1/c. By assumption, ai fi 2 H fi 2 H 1+q for all i [n].

Now suppose that σ2 i 1 < a2 i 2ai fi 2 H and fi H > 0. If ai fi 2 H, then ai fi 2 H 1 q . If,

alternatively, ai ciσi 1 fi H, then

2 fi 2 H(1 + c2 i ) fi 2 H 1 q .

The conclusion now follows from the sub-Gaussian constant bound (15).

J.7 Proof of Thm. 13: Alternate representation of ψi

Alg. 3 and our definition εi I[|αi| ai](ηi + αi/ai) give

ψi = ψi 1 fi αi/ai + I[|αi| ai](fi αi/ai + ηifi) = ψi 1 fi fi,ψi 1 H

Taking an inner product with u H now yields the equality (36). By construction, εi [cmin, cmax] [ 2, 2] for

cmin = max( 2, min(0, 1 + αi/ai)) and cmax = min(2, max(0, 1 + αi/ai))

by construction. Moreover,

E[εi | ψi 1, |αi| > ai] = 0

E[εi | ψi 1, |αi| ai] = 1 + αi

so that E[εi | ψi 1] = 0 as claimed. The conditional sub-Gaussianity claim

E[etεi | ψi 1] et2/2 for all t R,

now follows from Hoeffding s lemma (Hoeffding, 1963, (4.16)) since εi is bounded with cmax cmin 2 and mean-zero conditional on ψi 1.

Kernel Thinning

Appendix K. Proof of Thm. 4: L guarantees for kernel halving

We start by showing that after m rounds of kernel halving the output size is nout = n

2m (also see Footnote 3). We prove this by induction. The base case of m = 1 can be directly verified. Let nj |S(j)| denote the output size after j rounds of kernel halving and assume that the hypothesis is true for round j, i.e., nj = n

2j . Then for round j + 1, we have

nj+1 = nj/2 = n/2j

2 (i) = n 2j+1 , where step (i) follows from the fact (Graham et al.,

1994, Eqn. (3.11)) that ℓ

2 for any integer ℓ. Our desired claim follows. Next, define nin = 2mnout = 2m n

2m . In Apps. K.1 and K.2 we will show that the respective claims (16) and (17) hold whenever n = nin, that is, whenever 2m divides n evenly. Now suppose that n = nin. Since the output of m rounds of kernel halving depends only on the first nin points, the evenly divisible case of App. K.1 implies that, for part (a),

Pnink Q(1) KHk k

nout Mk(nin,1,d,δ ,δ ,RSnin,k,nin) for nin = 2 n

2 = 2nout, (39)

with probability at least 1 δ Pnin/2 i=1 δi, and App. K.1 implies that, for part (b),

Pnink Q(m) KH k k

nout Mk(nin,m,d,δ ,δ ,RSnin,k,nin) for nin =2m n

2m =2mnout, (40)

with probability at least 1 δ Pm j=1 2j 1

m Pnin/2j i=1 δi. Next, to recover the bounds (16) and (17), we use the following deterministic inequalities:

Pnk Pnink 2(n nin) k

n 2(2m 1) k

nout , (41)

RSnin,k,nin RSn,k,n, and (42)

Mk(nin,m,d,δ,δ ,RSnin,k,nin)+2 (i) Mk(nin,m, d,δ,δ ,RSn,k,n)+2

(ii) Mk(n,m,d,δ,δ ,RSn,k,n), (43)

where (42) follows directly from the definition (7), and step (i) follows from (42) and the fact that Mk(n,m,d,δ,δ ,R) (8) is non-decreasing in R, and step (ii) follows since Mk(n,m,d,δ,δ ,R) is non-decreasing in n and nin = n. For part (a) we conclude, by (39),

Pnk Q(1) KHk Pnk Pnink + Pnink Q(1) KHk

nout Mk(nin, 1, d, δ , δ , RSnin,k,nin)

nout Mk(n, 1, d, δ, δ , RSn,k,n).

For part (b), we conclude, by (40),

Pnk Q(m) KH k Pnk Pnink + Pnink Q(m) KH k

nout Mk(nin, 1, d, δ , δ , RSnin,k,nin)

nout Mk(n, m, d, δ, δ , RSn,k,n).

Dwivedi and Mackey

K.1 Proof of part (a): Kernel halving yields a 2-thinned L coreset

As noted earlier, we prove this part assuming n is even. Consider a self-balancing Hilbert walk (Alg. 3) with inputs (fi)n/2 i=1 and (ai)n/2 i=1, where fi = k(x2i 1, ) k(x2i, ) and ai = max( fi k σi 1 p

2 log(2/δi), fi 2 k ), and σi was defined in (12). Property (iii) of Thm. 3 implies that for a self-balancing walk with the choices summarized above, the event Ehalf = {|αi| ai for all i [n/2]} satisfies

Pr(Ehalf) 1 Pn/2 i=1 δi. (44)

Consider kernel halving coupled with the above instantiation of self-balancing Hilbert walk. Due to the equivalence with kernel halving on the event Ehalf, we conclude that the output ψn/2 of self-balancing Hilbert walk matches with that of the kernel halving, and satisfies

1 nψn/2 = 1

n Pn i=1 k(xi, ) 1 n/2 P x S(1) k(x, ) = Pnk Q(1) KHk, (45)

on the event Ehalf. Furthermore, on the event Ehalf, we can also write that the kernel halving coreset satisfies S(1) = (xi)i I for I = {2i ηi 1

2 : i [n/2]} and ηi defined in Alg. 3. Finally, applying property (vi) of Thm. 3 for σn/2 with ci = p

2 log(2/δi), we obtain that

σ2 n/2 maxi n/2 fi 2 k 4 2 log( 2

δ )(1 + 1 2 log(2/δ ))2 (i) 4 k log( 2

where in step (i), we use the fact that fi = k(x2i 1, ) k(x2i, ) satisfies

fi 2 k = k(x2i 1, x2i 1) + k(x2i, x2i) 2k(x2i 1, x2i) 4 k .

Next, we split the proof in two parts: Case (I) When RSn,k,n < RSn, and Case (II) when RSn,k,n = RSn, where RSn,k,n was defined in (7). We prove the results for these two cases in Apps. K.1.1 and K.1.2 respectively. In the sequel, we make use of the following tail quantity of the kernel:

τ k(R ) sup{|k(x, y)| : x y 2 R }. (47)

K.1.1 Proof for case (I): When RSn,k,n < RSn

By definition (7), for this case,

RSn,k,n = n1+ 1

d Rk,n + n 1 d k

On the event Ehalf, the following lemma provides a high probability bound on ψn/2 in terms of the kernel parameters, the sub-Gaussianity parameter σn/2, and the size of the cover (Wainwright, 2019, Def. 5.1) of a neighborhood of the input points (xi)n i=1.

Lemma 15 (A direct covering bound on ψn/2 ). Fix R r > 0 and δ > 0, and suppose Cn(r, R) is a set of minimum cardinality satisfying Sn i=1 B(xi, R) S z Cn(r,R) B(z, r). (49)

If k satisfies Assumps. 1 and 2, then, conditional on the event Ehalf (45), the event

E ψn/2 max nτ k(R), n Lkr + σn/2 p

2 k log(2|Cn(r, R)|/δ ) , (50)

occurs with probability at least 1 δ , i.e., Pr(E |Ehalf) 1 δ , where τ k was defined in (47).

Kernel Thinning

Thm. 15 succeeds in controlling ψn/2(x) for all x Rd since either x lies far from every input point xi so that each k(xi, x) in the expansion (45) is small or x lies near some xi, in which case ψn/2(x) is well approximated by ψn/2(z) for z Cn(r, R). The proof inspired by the covering argument of Phillips and Tai (2020, Lem. 2.1) and using the pointwise sub-Gaussianity property of Thm. 3 over the finite cover Cn can be found in App. K.3. Now we put together the pieces to prove Thm. 4. First, (Wainwright, 2019, Lem. 5.7) implies that |C1(r, R)| (1 + 2R/r)d (i.e., any ball of radius R in Rd can be covered by (1 + 2R/r)d balls of radius r). Thus for an arbitrary R, we can conclude that

|Cn(r, R)| n(1 + 2R/r)d = (n1/d + 2n1/d R/r)d. (51)

Second we fix R and r such that nτ k(R) = k and n Lkr = k , so that R

r n Rk,n Lk k (c.f. (6) and (47)). Substituting these choices of radii in the bound (50) of Thm. 15, we find that conditional to Ehalf E , we have

ψn/2 max nτ k(R), n Lkr + σn/2 p

2 k log(2|Cn(r, R)|/δ )

δ ) h log( 2

δ ) + d log n 1 d + 2Lk k n1+ 1

d Rk,n i (52)

δ ) h log( 2

δ ) + d log 2Lk k (Rk,n + RSn,k,n) i

(8) k 2Mk(n, 1, d, δ , δ , RSn,k,n). (53)

Putting (45) and (53) together, we conclude

Pr Pnk Q(1) KHk > 1 n/2Mk(n, 1, d, δ , δ , RSn,k,n) Pr((Ehalf E )c)

= Pr(Ec half Ec )

= Pr(Ec half) + Pr(Ehalf Ec )

= Pr(Ec half) + Pr(Ec |Ehalf)

δ + Pn/2 i=1δi,

where the last step follows from (44) and Thm. 15. The claim now follows.

K.1.2 Proof for case (II): When RSn,k,n = RSn

In this case, we split the proof for bounding ψn/2 into two lemmas. First, we relate the ψn/2 in terms of the tail behavior of k and the supremum of differences for ψn/2 between any pair points on a Euclidean ball (see App. K.4 for the proof):

Lemma 16 (A basic bound on ψn/2 ). If k satisfies Assump. 1, then, conditional on the event Ehalf (45), for any fixed R = R + RSn with R > 0 and any fixed δ (0, 1),

e E = ψn/2 max nτ k(R ), σn/2 k

δ )+ sup x,x B(0,R) |ψn/2(x) ψn/2(x )| (54)

occurs with probability at least 1 δ

2 , i.e., Pr(e E |Ehalf) 1 δ

2 , where τ k was defined in (47).

Dwivedi and Mackey

Next, to control the supremum term on the RHS of the display (54), we establish a high probability bound in the next lemma. Its proof in App. K.5 proceeds by showing that ψn/2 is an Orlicz process with a suitable metric and then applying standard concentration arguments for such processes.

Lemma 17 (A high probability bound on supremum of ψn/2 differences). If k satisfies Assumps. 1 and 2, then, for any fixed R > 0, δ (0, 1), the event

Esup sup x,x B(0,R) |ψn/2(x) ψn/2(x )| 8DR

d log 2 + Lk R

occurs with probability at least 1 δ

2 , where DR q

1 2 min 1, q

We now turn to the rest of the proof for Thm. 4. We apply both Thms. 16 and 17 with R = RSn + Rk,n (6) and (7). For this R, we have R = Rk,n in Thm. 16 and hence nτ k(R ) k . Now, condition on the event Ehalf e E Esup. Then, we have

n Pnk Q(1) KHk

(45) = ψn/2 (54) max( k , σn/2 k

δ ) + supx,x B(0,R) |ψn/2(x) ψn/2(x )|)

(55) max k , σn/2 k

d log(2 + Lk(RSn+Rk,n)

(i) max( k , 32σn/2 k

d log 2 + Lk(RSn+Rk,n)

(46) k 64 q

d log 2 + Lk(RSn+Rk,n)

(ii) k 2 Mk(n, 1, d, δ , δ , RSn,k,n), (57)

where step (i) follows from the fact that DR q

1 2 , and in step (ii) we have used the working assumption for this case, i.e., RSn,k,n = RSn. As a result,

Pr( Pnk Q(1) KHk > 2

n Mk(n,1,d,δ ,δ ,RSn,k,n) Pr((Ehalf e E Esup)c)

= Pr(Ec half e Ec Ec sup)

= Pr(Ec half Ec sup) + Pr((Ec half Ec sup)c e Ec )

= Pr(Ec half Ec sup) + Pr(Ehalf Esup e Ec )

Pr(Ec half Ec sup) + Pr(Ehalf e Ec )

Pr(Ec half) + Pr(Ec sup) + Pr(e Ec |Ehalf)

Pn/2 i=1δi + δ

where the last step follows from (44) and Thms. 16 and 17. The desired claim follows.

Kernel Thinning

K.2 Proof of part (b): Repeated kernel halving yields a 2m-thinned L coreset

As noted earlier, we prove this part assuming n/2m N. The proof in this section follows by applying the arguments from the previous section, separately for each round and then invoking the sub-Gaussianity of a weighted sum of the output functions from each round. Note that coreset S(j 1) is independent of the randomness for the j-th round of kernel havling and thus can be treated as fixed for that round. When running kernel halving with the input S(j 1), let fi,j, ai,j, ψi,j, αi,j, ηi,j denote the analog of the quantities fi, ai, ψi, αi, ηi defined in Alg. 2. Like in part (a), we would couple the j-th round of kernel halving with an instantiation of self-balancing Hilbert walk with inputs (fi,j, ai,j)n/2j i=1 and final output

ψn/2j,j, where ai,j = max( fi,j kσi 1,j q

2jδi ), fi,j 2 k) and we define σi,j in a recursive

manner as in (12) with fi,j, ai,j, σi,j taking the role of fi, ai, σi respectively. With this set-up, first we apply property (i) of Thm. 3 which implies that given S(j 1), the function ψn/2j,j is σn/2j,j sub-Gaussian, where

σ2 n/2j,j 4 k log( 4m

2jδ j ) with δ j = min(δi)n/2j i=1 , (58)

using an argument similar to (46) with property (vi) of Thm. 3. Next, we note that for j-th round, the event E(j) half = |αi,j| ai,j for all i [n/2j], satisfies

P(E(j) half) 1 Pn/2j i=1 δi 2j 1

and this event serves as the analog the equivalence event Ehalf from part (a) for the j-th round of kernel halving. Thus on the event E(j) half, we can write that the kernel halving coreset S(j) = (xi)i Ij for Ij = {2i ηi,j 1

2 : i [n/2j]} and ηi,j as defined while running Alg. 3, so that

n ψn/2j,j = 1 n/2j 1 P x S(j 1) k(x, ) 1 n/2j P x S(j) k(x, ). (60)

Now conditional on the event m j=1E(j) half, we conclude that the output of all m kernel halving rounds are equivalent to the output of m different self-balancing Hilbert walks (each with exact two-thinning in each round). Putting the pieces together, we conclude that on the event m j=1E(j) half, we have

Pnk Q(m) KH k = 1

n P x Sn k(x, ) 1 n/2m P x S(m) k(x, )

(i) = Pm j=1 1 n/2j 1 P x S(j 1) k(x, ) 1 n/2j P x S(j) k(x, )

= Pm j=1 2j 1

n ψn/2j,j. (61)

where we abuse notation S(0) Sn in step (i) for simplicity of expressions. Next, we use the following basic fact:

Lemma 18 (Sub-Gaussian additivity). For a sequence of random variables (Zj)m j=1 such that Zj is a σj sub-Gaussian variable conditional on (Z1, . . . , Zj 1), the random variable Z = Pm j=1 θj Zj is (Pm j=1 θ2 jσ2 j )1/2 sub-Gaussian.

Dwivedi and Mackey

Proof We will prove the result for Zs = Ps j=1 θj Zj by induction on s m. The result holds for the base case of s = 0 as Zs = 0 is 0 sub-Gaussian. For the inductive case, suppose the result holds for s. Then we may apply the tower property, our conditional sub-Gaussianity assumption, and our inductive hypothesis in turn to conclude

E[e Ps+1 j=1 θj Zj] = E[e Ps+1 j=1 θj Zj E[eθs+1Zs+1 | Z1:s]] E[e Ps j=1 θj Zj]e

θ2 s+1σ2 s+1 2 = e

Ps+1 j=1 θ2 j σ2 j 2 .

Hence, Zs+1 is q Ps+1 j=1 θ2 jσ2 j sub-Gaussian, and the proof is complete.

Applying Thm. 18 to the output sequence (ψn/2j,j)m j=1 for the m rounds of self-balancing Hilbert walks, we conclude that, the random variable

Wm Pm j=1 2j 1

n ψn/2j,j (62)

is sub-Gaussian with parameter

2mδ ) (i) q

k Pm j=1 4j n2 log( 4m

2jδ ) (ii) q Pm j=1(2j 1

n )2σ2 n/2j,j, (63)

conditional to the input Sn, where step (ii) follows since

k Pm j=1 4j log( 4m

2jδ ) (iii) 4 k Pm j=1 4j 1 log( 4m

2jδ ) (58) Pm j=1 4j 1σ2 n/2j,j,

and step (iii) follows from the fact that δ = min(δ j )m j=1 (58). Now to prove step (i), we note that

S1 Pm j=1 4j = 4

34m, and S2 Pm j=1 j4j = 4m

which in turn implies that

Pm j=1 4j log( 4m

2jδ )=S1 log(4m

δ ) log 2 S2 = S1 log(4m

δ ) 4m log 2

34m log 2m = 4

34m log( 6m

thereby establishing step (i). Next, analogous to the proof of part (a), we split the proof in two parts: Case (I) When RSn,k,n < RSn, in which case, we proceed with a direct covering argument to bound Wm using a lemma analogous to Thm. 15; and Case (II) when RSn,k,n = RSn, in which case, we proceed with a metric-entropy based argument to bound Wm using two lemmas that are analogous Thms. 16 and 17.

K.2.1 Proof for case (I): When RSn,k,n < RSn

Recall that for this case, RSn,k,n is given by (48). The next lemma, a straightforward extension of Thm. 15, provides a high probability control on Wm . Its proof is omitted for brevity.

Kernel Thinning

Lemma 19 (A direct covering bound on Wm ). Fix R r > 0 and δ > 0, and recall the definition (49) of Cn(r, R). If k satisfies Assumps. 1 and 2, then, conditional on the event m j=1E(j) half (60), the event

E(m) Wm max 2mτ k(R), 2m Lkr + σWm p

2 k log(2|Cn(r, R)|/δ ) , (64)

occurs with probability at least 1 δ , i.e., Pr(E(m) | m j=1 E(j) half) 1 δ .

Next, we repeat arguments similar to those used earlier around the display (52) and (53). Fix R and r such that τ k(R) = k /n and Lkr = k /n, so that R

r n Rk,n Lk k (c.f. (6) and (47)). Substituting these choices of radii in the bound (64) of Thm. 19, we find that conditional to m j=1E(j) half E(m) , we have

Pnk Q(m) KH k (61) = Pm j=1 2j 1

(64) max 2mτ k(R), 2m Lkr + σWm p

2 k log(2|Cn(r, R)|/δ )

n k (1 + 2 q

2mδ ) h log( 2

δ ) + d log n 1 d + 2Lk k n1+ 1

n k (1 + 2 q

2mδ ) h log( 2

δ ) + d log 2Lk k (Rk,n + RSn,k,n) i )

n Mk(n, m, d, δ , δ , RSn,k,n),

where in step (i), we have used the bounds (51) and (63). Putting the pieces together, we conclude

Pr Pnk Q(m) KH k > k 2m

n Mk(n, m, d, δ , δ , RSn,k,n)

Pr(( m j=1E(j) half E(m) )c)

= Pr( m j=1(E(j) half)c (E(m) )c)

= Pr( m j=1(E(j) half)c) + Pr( m j=1E(j) half (E(m) )c)

j=1 Pr((E(j) half)c) + Pr((E(m) )c| m j=1 (E(j) half))

Pm j=1 Pn/2j i=1 δi 2j 1

where the last step follows from (59) and Thm. 19. The claim follows.

K.2.2 Proof for case (II): When RSn,k,n = RSn

In this case, we makes use of two lemmas. Their proofs (omitted for brevity) can be derived essentially by replacing ψn/2 and σn/2 with Wm (62) and σWm (63) respectively, and repeating the proof arguments from Thms. 16 and 17.

Dwivedi and Mackey

Lemma 20 (A basic bound on Wm ). If k satisfies Assump. 1, then, conditional on the event m j=1E(j) half (60), for any fixed R = R + RSn with R > 0 and any fixed δ (0, 1),

e E(m) = Wm max 2mτ k(R ), σWm k

δ )+ sup x,x B(0,R) |Wm(x) Wm(x )| ,

occurs with probability at least 1 δ

2 , i.e., Pr(e E(m) | m j=1 E(j) half) 1 δ

Lemma 21 (A high probability bound on supremum of Wm differences). If k satisfies Assumps. 1 and 2, then, for any fixed R > 0, δ (0, 1), given Sn, the event

E(m) sup sup x,x B(0,R) |Wm(x) Wm(x )| 8DR

d log 2 + Lk R

occurs with probability at least 1 δ /2, where DR q

1 2 min 1, q

Mimicking the arguments like those in display (56) and (57), with R = RSn + Rk,n, and R = Rk,n, we find that conditional on the event e E(m) E(m) sup m j=1 E(j) half,

Pnk Q(m) KH k (61) = Pm j=1 2j 1

n k , 32σWm k

d log 2 + Lk(RSn+Rk,n)

d log 2 + Lk(RSn+Rk,n)

n Mk(n, m, d, δ , δ , RSn),

which does not happed with probability at most

P(( m j=1E(m) half e E(m) E(m) sup )c) Pm j=1 Pr((E(j) half)c) + Pr((E(m) sup )c) + Pr((e E(m) )c| m j=1 (E(j) half))

Pm j=1 P n/2j i=1 δi 2j 1

where the last step follows from (59) and Thms. 20 and 21 as claimed. The proof is now complete.

K.3 Proof of Thm. 15: A direct covering bound on ψn/2

We claim that conditional on the event Ehalf (45), we deterministically have

ψn/2 max nτ k(R), n Lkr + maxz Cn(r,R) |ψn/2(z)| , (65)

and the event maxz Cn(r,R) ψn/2(z) σn/2 p

2 k log(2|Cn(r, R)|/δ ) (66)

occurs with probability at least 1 δ . Putting these two claims together yields the lemma. We now prove these two claims separately.

Kernel Thinning

K.3.1 Proof of (65)

Note that on the event Ehalf (45) we have

ψn/2 = n(Pnk e Qn/2k) = Pn i=1 ηik(xi, ) (67)

for some ηi { 1, 1}. Now fix any x Rd, and introduce the shorthand Cn = Cn(r, R). The result follows by considering two cases.

(Case 1) x Sn i=1 B(xi, R) In this case, we have x xi 2 R for all i [n] and therefore, representation (67) yields that ψn/2(x) = |Pn i=1 ηik(xi, x)| Pn i=1|ηi||k(xi, x)| = Pn i=1|k(xi, x)| nτ k(R),

by Cauchy-Schwarz s inequality and the definition (47) of τ k.

(Case 2) x Sn i=1 B(xi, R) By the definition (49) of our cover Cn, there exists z Cn

such that x z 2 r. Therefore, on the event Ehalf, using representation (67), we find that ψn/2(x) |Pn i=1 ηik(xi, x)| = |Pn i=1 ηi(k(xi, x) k(xi, z) + k(xi, z))|

|Pn i=1 ηi(k(xi, x) k(xi, z))| + |Pn i=1 ηik(xi, z)|

Pn i=1|k(xi, x) k(xi, z)| + ψn/2(z)

n Lk(r) + supz Cn ψn/2(z )

by Cauchy-Schwarz s inequality.

K.3.2 Proof of (66)

Introduce the shorthand Cn = Cn(r, R). Then applying the union bound, the pointwise sub Gaussian property (iv) of Thm. 3, and the sub-Gaussian Hoeffding inequality (Wainwright, 2019, Prop. 2.5), we find that

Pr(maxz Cn |ψn/2(z)| > t |) P z Cn 2 exp( t2/(2σ2 n/2k(z, z)))

2|Cn| exp( t2/(2σ2 n/2 k )) = δ

for t σn/2 p

2 k log(2|Cn|/δ ),

as claimed.

K.4 Proof of Thm. 16: A basic bound on ψn/2

For any R > 0, we deterministically have ψn/2 max(supx B(0,R) ψn/2(x) , supx Bc(0,R) ψn/2(x) ) (68)

Since R = R + RSn, for any x Bc(0, R), we have x xi 2 R for all i [n]. Thus conditional on the event Ehalf, applying property (ii) from Thm. 3, we find that ψn/2(x) = |Pn i=1 ηik(xi, x)| Pn i=1|ηi||k(xi, x)| = Pn i=1|k(xi, x)| nτ k(R ),

Dwivedi and Mackey

by Cauchy-Schwarz s inequality and the definition (47) of τ k. For the first term on the RHS of (68), we have

supx B(0,R) ψn/2(x) ψn/2(0) + supx B(0,R) ψn/2(x) ψn/2(0)

ψn/2(0) + supx,x B(0,R) ψn/2(x) ψn/2(x ) .

Now, the sub-Gaussianity of ψn/2 (property (iv) of Thm. 3) with x = 0, and the sub Gaussian Hoeffding inequality (Wainwright, 2019, Prop. 2.5) imply that

ψn/2(0) σn/2 p

2k(0, 0) log(4/δ) σn/2 q

2 k log(4/δ ),

with probability at least 1 δ /2. Putting the pieces together completes the proof.

K.5 Proof of Thm. 17: A high probability bound on supremum of ψn/2 differences

The proof proceeds by using concentration arguments for Orlicz processes (Wainwright, 2019, Def. 5.5). Given a set T Rd, a random process {Zx, x T} is called an Orlicz Ψ2-process with respect to the metric ρ if

Zx Zx Ψ2 ρ(x, x ) for all x, x T,

where for any random variable Z, its Orlicz Ψ2-norm is defined as

Z Ψ2 = inf λ > 0 : E[exp(Z2/λ2)] 2 .

Our next result (see App. K.6 for the proof) establishes that ψn/2 is an Orlicz process with respect to a suitable metric. (For clarity, we use B2 to denote the Euclidean ball.)

Lemma 22 (ψn/2 is an Orlicz Ψ2-process). If k satisfies Assumps. 1 and 2, then, given any fixed R > 0, the random process ψn/2(x), x B2(0; R) is an Orlicz Ψ2-process with respect to the metric ρ defined in (70), i.e., ψn/2(x) ψn/2(x ) Ψ2 ρ(x, x ) for all x, x B2(0; R), (69)

where the metric ρ is defined as

ρ(x, x ) = q

8 3 σn/2 min p

2Lk x x 2, 2 p

Given Thm. 22, we can invoke high probability bounds for Orlicz processes. For the Orlicz process ψn/2(x), x T , given any fixed δ (0, 1], Wainwright (2019, Thm 5.36) implies that

supx,x T ψn/2(x) ψn/2(x ) 8 JT,ρ(D) + D p

log(4/δ ) , (71)

with probability at least 1 δ /2, where D supx,x T ρ(x, x ) denotes the ρ-diameter of T, and the quantity JT,ρ(D) is defined as

JT,ρ(D) R D 0 p

log(1 + NT,ρ(u))du. (72)

Kernel Thinning

Here NT,ρ(u) denotes the u-covering number of T with respect to metric ρ, namely the cardinality of the smallest cover CT,ρ(u) Rd such that

T z CT,ρ(u)Bρ(z; u) where Bρ(z; u) x Rd : ρ(z, x) u . (73)

To avoid confusion, let B2 denote the Euclidean ball (metric induced by 2). Then in our setting, we have T = B2(0; R) and hence

DR supx,x B2(0;R) ρ(x, x ) = min q

, with βR = 32

3 σ2 n/2Lk R. (74)

Some algebra establishes that this definition of DR is identical to that specified in Thm. 17. Next, we derive bounds for NT,ρ and JT,ρ (see App. K.7 for the proof).

Lemma 23 (Bounds on NT,ρ and JT,ρ). If k satisfies Assumps. 1 and 2, then, for T = B2(0; R) with R > 0 and ρ defined in (70), we have

NT,ρ(u) 1 + βR/u2 d, and (75)

d DR h 3 + q

2 log(βR/D2 R) i , (76)

where DR and βR are defined in (74).

Doing some algebra, we find that

βR D2 R = max 2, Lk R

Moreover, note that 3 + p

2 log(2 + a) 6 p

log(2 + a) for all a 0, so that the bound (76) can be simplified to

log(2 + Lk R

Substituting the bound (78) in (71) yields that the event Esup defined in (55) holds with probability at least 1 δ /2, as claimed.

K.6 Proof of Thm. 22: ψn/2 is an Orlicz Ψ2-process

The proof of this lemma follows from the sub-Gaussianity of ψn/2 established in Thm. 3. Introduce the shorthand Y ψn/2(x) ψn/2(x ). Applying property (i) of Thm. 3 with u = k(x, ) k(x , ) along with the reproducing property of the kernel k, we find that for any t R,

E[exp(t Y )] = E[exp(t ψn/2, k(x, ) k(x , )

2t2σ2 n/2 k(x, ) k(x , ) 2 k ).

That is, the random variable Y is sub-Gaussian with parameter σY σn/2 k(x, ) k(x , ) k . Next, Wainwright (2019, Thm 2.6 (iv)) yields that E[exp( 3Y 2

8σ2 Y )] 2, which in turn implies

that Y Ψ2 q

8 3σY . Moreover, we have

k(x, ) k(x , ) 2 k = k(x, x) k(x, x ) + k(x , x ) k(x , x) min(2Lk x x 2, 4 k ),

using the Lipschitz continuity of k and the definition of k (4). Putting the pieces together along with the definition (70) of ρ, we find that Y Ψ2 ρ(x, x ) thereby yielding the claim (69).

Dwivedi and Mackey

K.7 Proof of Thm. 23: Bounds on NT,ρ and JT,ρ

We use the relation of ρ to the Euclidean norm 2 to establish the bound (75). The definitions (70) and (73) imply that

α ) Bρ(z; u) for α 16

3 σ2 n/2Lk,

since ρ(x, x ) p

α x x 2. Consequently, any u2/α-cover C of T in the Euclidean ( 2) metric automatically yields a u-cover of T in ρ metric as we note that

B2(0; R) z CB2(z; u2

α ) z CBρ(z; u).

Consequently, the smallest cover CT,ρ(u) would not be larger than than the smallest cover CT, 2(u2/α), or equivalently:

NT,ρ(u) NT, 2(u2

α )) (i) 1 + 2Rα/u2 d,

where inequality (i) follows from Wainwright (2019, Lem 5.7) using the fact that T = B2(0; R). Noting that βR = 2Rα yields the bound (75). We now use the bound (75) on the covering number to establish the bound (76).

Proof of bound on JT,ρ Applying the definition (72) and the bound (75), we find that

JT,ρ(DR) R DR 0

log(1 + (1 + βR/u2)d)du

2 R βR/D2 R s 3/2p

log(1 + (1 + s)d)ds, (79)

where step (i) follows from a change of variable s βR/u2. Note that log(1 + a) log a + 1/a, and

b for any a, b 0. Applying these inequalities, we find that

log(1 + (1 + s)d) d log(2 + s) d(log s + 1

s + 1 1+s) d(log s + 2

for s [βR/D2 R, ) (77) [2, ). Consequently,

R βR/D2 R s 3/2p

log(1 + (1 + s)d)ds

d R βR/D2 R

2 s2 ds. (80)

Next, we note that R βR/D2 R

2D2 R βR , and (81) R βR/D2 R

s3/2 ds = 2 q

2 log s) s= s=βR/D2 R (i) 2DR βR

log(βR/D2 R) + 1

log(βR/D2 R)

where step (i) follows from the following bound on the incomplete Gamma function:

2, a) = R a 1

te tdt 1 a R a e tdt = a 1

2 e a for any a > 0.

Kernel Thinning

Putting the bounds (79) to (82) together, we find that

d DR DR βR +

log(βR/D2 R) + 1/ q

log(βR/D2 R) .

Note that by definition (74) DR βR 1

2. Using this observation and the fact that 1

2 log 2 3 yields the claimed bound (76).

Appendix L. Proof of Rem. 8: Example settings for KH L rates

Note that when the kernel is restricted to a compact domain, we have Rk,n O(σ) so that in this case, Lk(Rk,n RSn)

k O(L). Next, we establish the bounds on L and κ (1/n) for each

of the kernels, which immediately imply the respective claims both when supported on Rd

and on a compact domain with σ =

For a Gaussian kernel, we have κ(r) = e r2/2, which when put together with (98) and (99) implies that L = p

2/e, and κ (1/n) = 2 log n.

For a Mat ern kernel, we have κ(r) = cbrb Kb(r) for b = ν d

2, for suitable constant cb, and function Kb; see Tab. 1. Moreover, (107) and (108) imply that L = C1 C2+b C1 C2+1 for universal constants C1, C2, and κ (1/n) O(max(b log b, log n)) = O(log n).

For a IMQ kernels, we have κ(r) = 1 (γ2+r2)ν , so that

L supr κ (r) = supr 2νr (γ2+r2)ν+1 = 2νγ γ2(ν+1) 1 1+2ν (2ν+1

and κ (1/n) = p

n1/ν γ2 n1/2ν.

Appendix M. Proof of Cor. 5: L and MMD guarantees for kt-split

First applying Thm. 4(b) with k = krt yields that

Pnkrt Q(m,1) KT krt krt

nout Mkrt(n, m, d, δ , δ , RSn,krt,n) (83)

with probability at least 1 δ Pm j=1 2j 1

m P2m j n/2m i=1 δi. Call this event E.

Dwivedi and Mackey

Let r = R Sn,krt,nout +ε denote a shorthand notation, where ε > 0 is an arbitrary scalar. Applying Thm. 2 with r = 2r and a = b = 1

2, we find that on event E,

MMDk (Sn, S(m,1))

= MMDk (Pn, Q(m,1) KT )

Pnkrt Q(m,1) KT krt edrd/2 + 2τkrt(r ) + 2 k

1 2 max(τPn(r ), τQ(m,1) KT (r ))

(i) = Pnkrt Q(m,1) KT krt r

2 +1)(r ) d 2 + 2 krt

nout Mkrt(n, m, d, δ , δ , RSn,krt,n) r

2 +1)(r ) d 2 + 2 krt

2 +1)(r ) d 2 Mkrt(n, m, d, δ , δ , RSn,krt,n) (84)

where step (i) uses the following arguments: (a) edrd/2 = r

2 +1)(r ) d 2 ; (b) Q(m,1) KT is

supported on a subset of points from Sn and hence

max{τPn(r ), τQ(m,1) KT (r )} = τPn(r ) = 0 for any r > RSn;

and (c) τkrt(r ) krt

nout for any r > R krt,nout. Noting that (84) applies simultaneously for all ε > 0 under event E, taking the limit ε 0 yields that

MMDk (Sn, S(m,1)) krt

2 +1)(R Sn,krt,nout) d 2 Mkrt(n, m, d, δ , δ , RSn,krt,n)

with probability at least 1 δ Pm j=1 2j 1

m P2m j n/2m i=1 δi, as claimed.

Appendix N. Derivation of Tab. 1: Square-root kernels krt for common target kernels k

In this appendix, we derive the results stated in Tab. 1.

N.1 General proof strategy

Let F denote the Fourier operator (19). In Tab. 3, we state the continuous κ such that k (x, y) = κ(x y) in the first column, its Fourier transform (19) bκ in the second column, the square-root Fourier transform in the third column, and the square-root kernel in the fourth column, given by krt(x, y) = 1 (2π)d/4 κrt(x y) with κrt = F(

bκ). Prop. 3 along with expressions in Tab. 3 directly establishes the validity of the square-root kernels for the Gaussian and (scaled) B-spline kernels. For completeness, we also illustrate the remaining calculus for the B-spline kernels in App. N.3. We do a similar calculation in App. N.2 for the Mat ern kernel for better exposition of the involved expressions.

Kernel Thinning

for κ(z) Expression

transform of κ: bκ(ω) Expression for Fourier

transform: p

bκ(ω) Square-root Fourier

bκ)(z) Expression for

j=1 2β+2I[ 1

2 ](zj) 4β+1

sin2β+2(ωj/2)

sinβ+1(ωj/2)

j=1 β+1I[ 1

Table 3: Fourier transforms of kernels k (x, y) = κ(x y) and square-root kernels krt(x, y) = κrt(x y) from Tab. 1. Here F denotes the Fourier operator (19), and ℓdenotes the convolution operator applied ℓ 1 times, with the convention 1f f and 2f = f f for (f g)(x) = R f(y)g(x y)dy. Each Fourier transform is derived from Sriperumbudur et al. (2010, Tab. 2).

N.2 Deriving krt for the Mat ern kernel

For k =Mat ern(ν, γ) from Tab. 1, we have

k (x, y) = ϕd,ν,γ Φν,γ(x y) where Φν,γ(z) cν z 2

2 (γ z 2), (86)

ϕd,ν,γ cν d/2

cν γ2ν d and cν 21 ν

and Ka denotes the modified Bessel function of third kind of order a (Wendland, 2004, Def. 5.10). That is, for Mat ern kernel with k (x, y) = κ(x y), the function κ is given by κ = ϕd,ν,γΦν,γ. Now applying (Wendland, 2004, Thm. is 8.15), we find that

F(Φν,γ) = 1 (γ2+ ω 2 2)ν = F( p

F(Φν,γ)) = 1 (2π)d/4 Φν/2,γ,

where in the last step we also use the facts that Φν,γ is an even function and p

for all ν > d/2. Thus, by Prop. 3, a valid square-root kernel krt(x, y) = κrt(x y) is defined via

κrt = pϕd,ν,γ 1 (2π)d/4 Φν/2,γ, = krt = Aν,γ,d Mat ern(ν/2, γ), (87)

where Aν,γ,d 1

Γ(ν) Γ(ν d/2) Γ((ν d)/2)

Γ(ν/2) . (88)

N.3 Deriving krt for the B-Spline kernel

For positive integers β, d, define the constants

Bβ 1 (β 1)! P β/2 j=0 ( 1)j β j (β

2 j)β 1, Sβ,d B d β , and e Sβ,d

S2β+2,d Sβ+1,d . (89)

Define the function χβ : Rd R as follows:

χβ(z) Sβ,d Qd i=1 fβ(zi). (90)

Dwivedi and Mackey

Then for kernel k = B-spline(2β + 1), we have

k (x, y) = χ2β+2(x y). (91)

The second row of Tab. 3 implies that

2π) d 4 F p

S2β+2,d Sβ+1,d χβ+1 (89) = e Sβ,d χβ+1,

where we also use the fact that χβ is an even function. Putting the pieces together with Prop. 3, we conclude that

krt = e Sβ,d B-spline(β), (92)

(with κrt = e Sβ,dχβ+1) is a valid square-root kernel for k = B-spline(2β + 1).

Appendix O. Derivation of Tab. 2: Explicit bounds on Thm. 1 quantities for common kernels

We start by collecting some common tools in App. O.1 that we later use for our derivations for the Gaussian, Mat ern and B-spline kernels in Apps. O.2 to O.4 respectively. Finally, in App. O.5, we put the pieces together to derive explicit MMD rates, as a function of d, n, δ and kernel parameters for the three kernels, and summarize the rates in Tab. 4. (This table is complementary to the generic results stated in Cor. 1.)

O.1 Common tools for our derivations

We collect some simplified expressions, and techniques that come handy in our derivations to follow.

Simplified bounds on Mkrt From (8), we have

2 log2 n, d, δ

n, δ , R) r

δ ) + d log Lkrt krt (Rkrt,n + R) i

= Mkrt(n, 1

2 log2 n, d, δ

d log n log Lkrt krt (Rkrt,n + R) . (93)

Thus, given (93), to get a bound on Mkrt, we need to derive bounds on (Lkrt, krt , Rkrt,n) for various kernels to obtain the desired guarantees on MMD and L coresets.

Bounds on Gamma function Our proofs make use of the bounds from Batir (2017, Thm 2.2) on the Gamma function:

Γ(b + 1) (b/e)b

2πb for any b 1, and Γ(b + 1) (b/e)b

e2b for any b 1.1. (94)

General tools for bounding the Lipschitz constant To bound the Lipschitz constant Lkrt, the following two observations come in handy. For a radial kernel krt(x, y) = eκrt( x y 2) with eκrt : R+ R, we note

sup x,y,z: y z 2 r|krt(x, y) krt(x, z)| supa>0,b r|eκrt(a) eκrt(a + b)| eκ rt r

= Lkrt eκ rt . (95)

Kernel Thinning

For a translation-invariant kernel krt(x, y) = κrt(x y) with κrt : Rd R, we use the bound

sup x,y,z: y z 2 r|krt(x, y) krt(x, z)| = supz ,z , z z 2 r|κrt(z ) κrt(z )|

supz Rd κrt(z ) 2r,

= Lkrt supz Rd κrt(z ) 2

d supz Rd κrt(z ) , (96)

where step (i) follows from Cauchy-Schwarz s inequality (and is handy when coordinate wise control on κrt is easier to derive). We later apply the inequality (95) for the Gaussian and Mat ern kernels, and (96) for the B-spline kernel.

O.2 Proofs for Gaussian kernel

For the kernel k = Gauss(σ) and its square-root kernel krt = 2

2), we claim the following bounds on various quantities

4 and k = 1, (97)

2/e σ = Lkrt/ krt (97) = 1 σ p

Rkrt,n = σ log n, (99)

R krt, n = O(σ d + log n). (100)

Substituting these expressions in (93), we find that

MGauss krt (n, 1

2 log2 n, d, δ

2n, δ , R) q

δ + d log log n + R

as claimed in Tab. 2.

Proof of claims (97) to (100) The claim (97) follows directly from the definition of

krt. The bound (98) on Lkrt follows from the fact d

dre r2/σ2 =

2/e σ and invoking the relation (95). Next, recalling the definition (6) and noting that

sup x,y: x y 2 r|krt(x, y)| 2

implies the bound (99) on Rkrt,n. Next, we have

τ 2 krt(R) = R

2 exp( 2 z 2 2/σ2)dz

= Pr X N(0,σ2/4 Id)( X 2 R) (i) e R2/σ2, (101)

2d, where step (i) follows from the standard tail bound for a chi-squared random variable Y with k degree of freedom (Laurent and Massart, 2000, Lem. 1): Pr(Y k > 2

kt + 2t) e t, wherein we substitute k = d, Y = 4 σ2 X 2 2, t = dα with α 2, and R2 = σ2t. Using the tail bound (101), the bound (97) on krt and the definition (6) of R krt, n, we find that

R krt, n = O(σ max( q

d)) = O(σ d + log n).

yielding the claim (100).

Dwivedi and Mackey

O.3 Proofs for Mat ern kernel

Recall the notations from (86) to (88) for the Mat ern kernel related quantities. Let Kb denote the modified Bessel function of the third kind with order b (Wendland, 2004, Def. 5.10). Let a ν d

2 . Then, the kernel k = Mat ern(ν, γ) and its square-root kernel krt = Aν,γ,d Mat ern(a, γ) satisfy

k (x, y) = eκν d/2(γ x y 2), and (102)

krt(x, y) = Aν,γ,deκa(γ x y 2), (103)

where eκb(r) cbrb Kb(r), and cb = 21 b

Γ(b) for b > 0. (104)

We claim the following bounds on various quantities assuming a 2.2, and d 2:7

krt = Aν,γ,d and k = 1, (105)

Aν,γ,d 5ν( γ2 2π(a 1)) d 4 , (106)

Lkrt Aν,γ,d C1γ a+C2 = Lkrt krt

(105) C1γ a+C2 , (107)

γ O(max(log n a log(1 + a), C2a log(1 + a))), (108)

R krt, n = 1

γ O(a+log n+d log(

γ )+log( (ν 2)ν 3

(2(a 1))2a 1d d 2 +1 )) (109)

We prove these claims in App. O.3.1 through App. O.3.5. Putting these bounds together with (93) yields that

MMat ern krt (n, 1

2 log2 n, d, δ

δ ) h log 1

δ + d log 1 1+a (log n+γR) i if a = o(log n) q

δ + d log( a log(1 + a)+γR) if a = Ω(log n)

δ + d log(log n+B +γR)

with B = a log(1 + a), as claimed in Tab. 2.

O.3.1 Set-up for proofs of Mat ern kernel quantities

Before proceeding to the proofs of the claims (105) to (109), we collect some handy inequalities. Applying Wendland (2004, Lem. 5.13, 5.14), we have

eκa(γr) min 1,

2πca(γr)a 1

2γr for r > 0. (110)

For a large enough r, we also establish the following bound:

eκa(γr) min 1, 4ca(γr)a 1e γr/2 for γr 2(a 1), a 1. (111)

7. When a (1, 2.2), and d [1, 2), analogous bounds with slightly different constants follow from employing the Gamma function upper bound of Batir (2017, Thm 2.3) in place of our upper bound (94). For brevity, we omit these derivations.

Kernel Thinning

Proof of (111) Noting the definition (104) of κa, it suffices to show the following bound:

re r/2 for r/2 a 1,

where Ka is the modified Bessel function of the third kind. Using Wendland (2004, Def. 5.10), we have

2 R 0 e r cosh t cosh(at)dt (i) R 0 e r

2 eteatdt (ii) = R r/2 e s(2s

sds (iii) 4 re r/2

where step (i) uses the following inequalities: cosh t 1

2(et + e t) 1

2et and cosh(at) eat

for a > 0, t > 0, step (ii) follows from a change of variable s r

2et, and finally step (iii) uses the following bound on the incomplete Gamma function obtained by substituting B = 2 and A = a in Borwein and Chan (2009, Eq. 1.5): R r ta 1e tdt 2ra 1e r for r a 1.

The proof is now complete.

O.3.2 Proof of the bound (105) on krt and k We claim that

eκb = 1 for all b > 0. (112)

where was defined in (104). Putting the equality (112) with (102), (103), and (110) immediately implies the bounds (105) on the krt and k . To prove (112), we follow the steps from the proof of Wendland (2004, Lem. 5.14). Using Wendland (2004, Def. 5.10), for b > 0 we have

2 R e r cosh tebtdt = 1

2 (et+e t)ebtdt = k b 1

2 R 0 e r/2(u/k+k/u)ub 1du

where the last step follows by substituting u = ket. Setting k = r/2, we find that

rb Kb(r) = 2b 1 R 0 e ue r2/(4u)ub 1du 2b 1Γ(b),

where we achieve equality in the last step when we take the limit r 0. Noting that eκb(r) = 21 b

Γ(b) rb Kb(r) (104) yields the claim (112).

O.3.3 Proof of the bound (106) on Aν,γ,d

Using the definition (88) we have Aν,γ,d = 1

4 A ν,γ,d, where

A ν,γ,d = q

Γ(ν) Γ(ν d/2) Γ((ν d)/2)

4 (2 e) d 2 ν 1 ν 2 ν 2

ν 1 ν 2 (ν 1) 1 4 ν d 2 ν d

(ν d 2) 1 2 ( d

4 (2 e) d 2 e

2 (ν 1) 1 4 ( e) d

2 +1 (ν d 2)/(ν d

(ν d 2) d 4

(2π) 3 4 (ν 1) 1 4 (ν d

2 1) 3 4 ν d 2

Dwivedi and Mackey

As a result, we have

4πγ2 d/4 A ν,γ,d 5ν( γ2 π(ν d 2)) d 4 = 5ν( γ2 2π(a 1)) d 4 ,

as claimed.

O.3.4 Proof of the bound (107) on Lkrt

To derive a bound on the Lipschitz constant Lkrt, we bound the derivative of eκa. Using (ra Ka(r)) = ra Ka 1(r) (Wendland, 2004, Eq. 5.2), we find that

(eκa(γr)) = ca d

dr((γr)a Ka(γr)) = ca

ca 1 γ2reκa 1(γr).

Using (110), we have

ca ca 1 γ2reκa 1(γr) caγ ca 1 min(γr,

2πca 1(γr)a 1

2 e γr+ (a 1)2

2γr ) for r > 0.

And (111) implies that

ca ca 1 γ2reκa 1(γr) caγ ca 1 min γr, 4ca 1(γr)a 1e γr/2 for r 2(a 2), a 2.

Next, we make use of the following observations: The function tbe t/2 achieves its maximum at t = 2b. Then for any function f : R+ R+ such that f(t) t for all t 0 and f(t) min(t, Ctbe t/2) for t > t with t < 2b, we have

supt 0 f(t) min(2b, C(2b)be b).

As a result, we conclude that

supr 0 ca ca 1 γ2reκa 1(γr) caγ ca 1 min 2(a 1), 4ca 1(2(a 1))a 1e a+1 . (113)

Substituting the value of ca 1 and the bound (94), we can bound the second argument inside the min on the RHS of (113) (for a 3) as follows:

4ca 1(2(a 1))a 1e a+1 4 22 a 1

2π(a 2)( e a 2)a 22a 1(a 1)a 1e1 a (114)

2π(1 + 1 a 2)a 2 a 2 + 1 a 2

for a 3. When a [2, 3), one can directly show that 4ca 1(2(a 1))a 1e a+1 8/e. Putting the pieces together, and noting that ca ca 1 = 1 max(2(a 1),1), we find that

supr 0 ca ca 1 γ2reκa 1(γr) γ max(2(a 1),1) min 2(a 1), 8

e + 8 π a 2 C1γ a+C2

for any a 2. And hence

Lkrt Aν,γ,d supr 0 |κ a(γr)| Aν,γ,d C1γ a+C2 = Lkrt krt

(105) C1γ a+C2 ,

as claimed.

Kernel Thinning

O.3.5 Proof of the bound (108) on Rkrt,n

Using arguments similar to those used to obtain (114), we find that

maxγr 2(a 1) eκa(γr) 4ca(2(a 1))a 1e (a 1) q

Next, note that

2γr e γr/2 for γr a log(1 + a), (115)

where (115) follows from standard algebra. Thus, we have

caeκa(γr) ca exp( γr/2) for γr a log(1 + a)

(94) = Rkrt,n 1

γ max(log n a log(1 + a), C1a log(1 + a)),

as claimed.

O.3.6 Proof of the bound (109) on R krt, n

Let Vd = πd/2/Γ(d/2 + 1) denote the volume of unit Euclidean ball in Rd. Using (103), we have

1 A2 ν,γ,d τ 2 krt(R) = R

z 2 R eκ2 a(γ z 2)dz = d Vd R

r>R rd 1eκ2 a(γr)dr

(110) 2πc2 ad Vd R

r>R rd 1(γr)2a 1 exp( 2γr + a2

= 2πc2 ad Vdγ1 d R

r>R(γr)ν 2 exp( 2γr + a2

γr)dr (116)

where we have also used the fact that 2a + d = ν. Next, we note that

exp( 2γr + a2

2γr) for γr

For any integer b > 1, noticing that f(t) = 1 Γ(b)tb 1e t is the density function of an Erlang distribution with shape parameter b and rate parameter 1, and using the expression for its complementary cumulative distribution function, we find that R r tb 1e tdt = Γ(b) Pb 1 i=0 ri

i! e r (118)

Thus, for R >

2 γ a, we have

r>R(γr)ν 2 exp( 2γr + a2

γr)dr (117) R

r>R(γr)ν 2 exp( 3

3 ν 1 R 3γR/2 tν 2e tdt

(118) = 1 γ 2

3 ν 1Γ(ν 1)e 3

2 γR Pν 1 i=0 (3γR/2)i

γ Γ(ν 1)e 3

γ Γ(ν 1)e 1

2 γR. (119)

Dwivedi and Mackey

Putting the bounds (116) and (119) together for R ν d

2γ , we find that

τ 2 krt(R) A2 ν,γ,d γ d 2π 22 2a

Γ2(a) π d 2 Γ( d

2 +1) Γ(ν 1) exp( 1

Using (94), we have

Γ2(a) π d 2 Γ( d

2 +1) Γ(ν 1) 2π 22 2ae2(a 1) 2π(a 1)(a 1)2(a 1) πd/2(2e)d/2

πd dd/2 e ν 2(ν 2

= 2 πe2a 2+d/2+1 ν(2π)d/2(2a 2) (2a 1)d d/2 1(ν 2)ν 3/2

= 2 e π(2eπ)d/2(ν d 2) (ν d 1)d d/2 1(ν 2)ν 3/2

= 2 e π(2eπ)d/2(2(a 1)) (2a 1)d d/2 1(ν 2)ν 3/2.

Putting the pieces together, and solving for τkrt(R) krt / n (105) = Aν,γ,d/ n, we find that R krt, n (6) can be bounded as

2, log n+log( 2 e π)+d log(

γ )+log( (ν 2)ν 3

(2(a 1))2a 1d d 2 +1 ))

γ (a+log n+d log(

γ )+log( (ν 2)ν 3

(2(a 1))2a 1d d 2 +1 ),

as claimed.

O.4 Proofs for B-spline kernel

Recall the notation from (89) to (92). Then, the kernel k = B-spline(2β + 1) and its square-root kernel krt = e Sβ,d B-spline(β) satisfy

krt = e Sβ,d (i) = cd β where cβ Bβ+1

3 when β = 1

(iii) 1 when β > 1 , (120)

k = 1, (121)

d(β + 1)/2, and R krt, n

d(β + 1)/2. (123)

While claims (i) and (ii) follow directly from the definitions in the display (89), claim (iii) can be verified numerically, e.g., using Sci Py (Virtanen et al., 2020). From numerical verification, we also find that the constant cβ in (120) is decreasing with β. See App. O.4.1 for the proofs of the remaining claims. Finally, substituting various quantities from (120), (122), and (123) in (93), we find that

MB-spline krt (n, 1

2 log2 n, d, δ

2n, δ , R) r

δ) + d log(dβ +

Kernel Thinning

O.4.1 Proofs of the bounds on B-spline kernel quantities

We start with some basic set-up. Consider the (unnormalized) univariate B-splines

fβ : R [0, 1] with fβ(a) = βI[ 1

2 ](a) (i) = 1 (β 1)! Pβ j=0( 1)j β j (a + β

where step (i) follows from Schumaker (2007, Eqn 4.46, 4.47, 4.59, p.135, 138). Noting that fβ is an even function with a unique global maxima at 0 (see Schumaker (2007, Ch 4.) for more details), we find that

fβ = fβ(0) = 1 (β 1)! P β/2 j=0 ( 1)j β j (β

2 j)β 1 (89) = Bβ. (124)

Bounds on k and krt Recalling (90) and (91), we find that

k = χ2β+2 = S2β+2,d f2β+2 d = S2β+2,d Bd 2β+2 = 1,

thereby establishing (121).

Bounds on Lkrt We have

f β+1(a) = R fβ(b) d

2 ](a b)dy = fβ(a 1

2) fβ(a + 1

and hence f β+1 = supa |fβ(a 1

2)| fβ since fβ is non-negative. Putting the pieces together, we have

Lkrt krt = 1 e Sβ,d Lkrt supz χβ+1(z) 2

d Sβ+1,d f β+1 fβ+1 d 1

d B d β+1 Bβ Bd 1 β+1

where step (i) can be verified numerically.

Bounds on Rkrt,n, and R krt, n Using the property of convolution, we find that fβ+1(a) =

2(β + 1). Hence κrt(z) = 0 for z > (β + 1)/2 and applying the definitions (6), we find that

d(β + 1)/2 and R krt, n

as claimed in (123).

O.5 Explicit MMD rates for common kernels

Putting the quantities from Tab. 2 together with Thm. 1 (with δ = δ, and δ treated as a constant) yields the MMD rates summarized in Tab. 4. For completeness, we illustrate a key simplification that can readily yield the results stated in Tab. 4. Define ζkrt,Sn as follows:

d max(R krt, n, RSn)2,

Dwivedi and Mackey

Kernel k MMDk (Sn, SKT)

Gaussian(σ) Cd 1 q

d(log n+( RSn

σ )2)] d 2 log( log n+ RSn

Mat ern(ν, γ) Cd 2

n h 1 d(a 1)(a + log2 n+d2 log2(

γ )+G2 ν,d+γ2R2 Sn) i d

2 log(log n+a+γRSn)

B-spline(2β + 1) Cd 3

n [β2 + R2 Sn d ] d 2 log(β + RSn

Table 4: Explicit kernel thinning MMD guarantees for common kernels. Here, a 1

d), Gν,d log( (ν 2)ν 3

(2(a 1))2a 1d d 2 +1 ) and each Ci denotes a universal constant. See App. O.5

for more details on deriving these bounds from Thm. 1.

where RSn, and R krt, n were defined in (6) and (7) respectively. Then applying Thm. 1, and substituting the simplified bound (93) in (9), we find that

MMDk (Sn, SKT) 2 krt n + krt n (Bζkrt,Sn) d 4 d 1

δ)+d log Lkrt(Rkrt,n+R)

(i) δ 2 krt n + krt (Bζkrt,Sn) d 4 d 1 4 r

n log Lkrt(Rkrt,n+RSn)

with probability at least 1 2δ, where B 8eπ is a universal constant, and in step (i) we use the following bound: For any r =

αd, we have

cdr d 2 = (4π) d 4

2 +1)) 1 2 (αd) d 4 (ii) (8eπα) d 4

(πd) 1 4 (Bα) d 4 d 1

4 where B 8eπ,

and step (ii) follows (for d 2) from the Gamma function bounds (94). Now the results in Tab. 4 follow by simply substituting the various quantities from Tab. 2 in (125).

Appendix P. Supplementary Details for Vignettes of Sec. 7

This section provides supplementary details for the vignettes of Sec. 7.

P.1 Mixture of Gaussians target

Our mixture of Gaussians target is given by P = 1 M PM j=1 N(µj, Id) for M {4, 6, 8} where

µ1 = [ 3, 3] , µ2 = [ 3, 3] , µ3 = [ 3, 3] , µ4 = [3, 3] ,

µ5 = [0, 6] , µ6 = [ 6, 0] , µ7 = [6, 0] , µ8 = [0, 6] .

P.2 Empirical decay rates

In Fig. 5, we provide results for regressing the true error rate of q

n and log n n to the input size n (on the log-log scale as in all of mean MMD plots) for various ranges of coreset sizes. Notably, the observed empirical rates are significantly slower than n 1

2 for the smallest

Kernel Thinning

coreset range (the same coreset range used in the panels in Fig. 3) but approach n 1

2 as the coreset size increases. Hence, the observed empirical rates of Fig. 3 are consistent with logc n n rates of decay.

22 24 26 28 Coreset size n

28 211 214 Coreset size n

215 219 223 Coreset size n

(a) True error scaling of q

22 24 26 28 Coreset size n

28 211 214 Coreset size n

215 219 223 Coreset size n

(b) True error scaling of log n n

Figure 5: Empirical decay rates when true error = logc n n . We display the ordinary least squares fits for regressing the log of true error onto the log of input size n when the true

error equals either (a) q

n or (b) log n n . As the input range increases, the fitted decay

rate tends towards n 1

2 despite appearing significantly slower for smaller input ranges.

P.3 MCMC vignette details

For complete details on the targets and sampling algorithms we refer the reader to Riabiz et al. (2021, Sec. 4). When applying standard thinning to any Markov chain output, we adopt the convention of keeping the final sample point. For all experiments, we used only the odd indices of the post burn-in sample points when thinning to form Sn. The selected burn-in periods for the Goodwin task were 820,000 for RW; 824,000 for ada RW; 1,615,000 for MALA; and 1,475,000 for p MALA. The respective numbers for the Lotka-Volterra task were 1,512,000 for RW; 1,797,000 for ada RW; 1,573,000 for MALA; and

Dwivedi and Mackey

1,251,000 for p MALA. For the Hinch experiments, we discard the first 106 points as burnin following Riabiz et al. (2021, App. S5.4). For all n, the parameter σ for the Gaussian kernel is set to the median distance between all pairs of 47 points obtained by standard thinning the post-burn-in odd indices. The resulting values for the Goodwin chains were 0.02 for RW; 0.0201 for ada RW; 0.0171 for MALA; and 0.0205 for p MALA. The respective numbers for Lotka-Volterra task were 0.0274 for RW; 0.0283 for ada RW; 0.023 for MALA; and 0.0288 for p MALA. Finally, the numbers for the Hinch task were 8.0676 for RW; 8.3189 for ada RW; 8.621 for MALA; and 8.6649 for p MALA.

Appendix Q. Kernel Thinning with Square-root Dominating Kernels

As alluded to in Sec. 2, it is not necessary to identify an exact square-root kernel krt to run kernel thinning. Rather, it suffices to identify any square-root dominating kernel ekrt defined as follows.

Definition 8 (Square-root dominating kernel). We say a reproducing kernel ekrt : Rd Rd R is a square-root dominating kernel for a reproducing kernel k : Rd Rd R if ekrt(x, ) is square-integrable for all x Rd and either of the following equivalent conditions hold.

(a) The RKHS of k belongs to the RKHS of ek(x, y) R

Rd ekrt(x, z)ekrt(y, z)dz.

(b) The function ek ck is positive definite for some c > 0.

Remark 9 (Controlling MMDk ). A square-root dominating kernel ekrt is a suitable surrogate for krt as, by Zhang and Zhao (2013, Lem. 2.2, Prop. 2.3), MMDk (µ, ν) p

1/c MMDek(µ, ν) for c the constant appearing in Def. 8(b) and all distributions µ and ν.

Def. 8 and Rem. 9 enable us to use convenient dominating surrogates whenever an exact square-root kernel is inconvenient to derive or deploy. For example, our next result, proved in App. Q.1, shows that a standard Mat ern kernel is a square-root dominating kernel for every sufficiently-smooth shift-invariant and absolutely integrable k .

Proposition 4 (Dominating smooth kernels). If k (x, y) = κ(x y) and κ L1 C2ν for ν > d, then, for any γ > 0, the Mat ern(ν

2, γ) kernel of Tab. 1 is a square-root dominating kernel for k .

Checking the square-root dominating condition is also particularly simple for any pair of continuous shift-invariant kernels as the next result, proved in App. Q.2, shows.

Proposition 5 (Dominating shift-invariant kernels). Suppose that the real-valued kernels k (x, y) = κ(x y) and ekrt(x, y) = eκrt(x y) have respective spectral densities (Def. 6) bκ and bκrt. If bκrt L2, then ekrt is a square-root dominating kernel for k if and only if

ess sup ω Rd: bκ(ω)>0

bκ(ω) bκrt(ω)2 < . (126)

In Tab. 5, we use Prop. 5 to derive convenient tailored square-root dominating kernels ekrt for standard inverse multiquadric kernels, hyberbolic secant (sech) kernels, and Wendland s compactly supported kernels. In each case, we can identify a square-root dominating kernel from the same family.

Kernel Thinning

k (x, y) = κ(x y)

Name of kernel

κ(z) Expression for

kernel ekrt(x, y) = eκrt(x y) Name of square-root dominating

ν > 0, γ > 0 Inverse Multiquadric(ν, γ) (γ2 + z 2 2) ν

(ν , γ ) Cν,γ,d; see (131) Inverse Multiquadric(ν , γ )

a > 0 Sech(a)

j=1 sech p π

2(d + 1) Wendland(s) ϕd,s( z 2); see (127) s N0, s 1

4(2s 1 d) Wendland(s )

Table 5: Square-root dominating kernels ekrt for common kernels k (see Def. 8). Here N0 denotes the non-negative integers. See App. Q.3 for our derivation.

Expressions for Wendland kernels The Wendland(s) kernel is a compactly-supported radial kernel ϕd,s( x y 2) on Rd where ϕd,s : R+ R is a truncated minimal-degree polynomial with 2s continuous derivatives. We collect here the expressions for ϕd,s for s = 0, 1, 2 and refer the readers to Wendland (2004, Ch. 9, Thm. 9.13, Tab. 9.1) for more general s. Let (r)+ = max(0, r) and ℓ d/2 + 3, then we have

ϕd,0(r) = (1 r) d/2 +1 + , ϕd,1(r) = (1 r) d/2 +2 + [( d/2 + 2)r + 1] (127)

ϕd,2(r) = (1 r) d/2 +3 + [(ℓ2 + 4ℓ+ 3)r2 + (3ℓ+ 6)r + 3].

Q.1 Proof of Prop. 4: Dominating smooth kernels

Since κ L1, F(κ) is bounded by the Babenko-Beckner inequality (Beckner, 1975) and nonnegative by Bochner s theorem (Bochner, 1933; Wendland, 2004, Thm. 6.6). Moreover, since κ C2ν, Sun (1993, Thm. 4.1) implies that R ω 2ν 2 F(κ)(ω)dω < . By Wendland (2004, Theorem 8.15), the Mat ern(ν, γ) kernel ek(x, y) Φν,γ(x y) for Φν,γ continuous with F(Φν,γ)(ω) = (γ2 + ω 2 2) ν. Since we have established that R F(κ)(ω)2

F(Φν,γ)(ω)dω= R (γ2 + ω 2 2)νF(κ)(ω)2dω F(κ) R (γ2 + ω 2 2)νF(κ)(ω)dω< ,

Wendland (2004, Thm. 10.12) implies that κ belongs to Hek and hence that Hk Hek. Finally, by App. N.1, Mat ern(ν

2, γ) is a valid square-root dominating kernel for ek and therefore for k .

Q.2 Proof of Prop. 5: Dominating shift-invariant kernels

Def. 6 implies that

k (x, y) = 1 (2π)d/2 R e i ω,x y bκ(ω)dω and ekrt(x, y) = 1 (2π)d/2 R e i ω,x y bκrt(ω)dω,

Moreover, since krt(x, ) = F(e i ,x bκrt) for e i ,x bκrt L1 L2, the Plancherel-Parseval identity (Wendland, 2004, Proof of Thm. 5.23) implies that

Rd ekrt(x, z)ekrt(y, z)dz = R

Rd e i ω,x bκrt(ω) ei ω,y bκrt(ω)dω

Rd e i ω,x y bκrt(ω)2dω,

Dwivedi and Mackey

and Bochner s theorem (Bochner, 1933; Wendland, 2004, Thm. 6.6) implies that ek is a kernel. Finally, Prop. 3.1 of Zhang and Zhao (2013) now implies that the RKHS of k belongs to the RKHS of ek if and only if the ratio condition (126) holds.

Q.3 Derivation of Tab. 5: Square-root dominating kernels ekrt for common kernels k

Thanks to Prop. 5, to establish the validity of the square root dominating kernels stated in Tab. 5, it suffices to verify that

bκrt L2 and bκ d,κ,eκrt bκ2 rt, (128)

where the functions κ and bκrt are the spectral densities of κ and eκrt.

Inverse multiquadric Consider the positive definite function Φν,γ (86) underlying the Mat ern kernel, which is continuous on Rd\{0}. When κ(z) = (γ2 + z 2 2) ν and eκrt(z) = (γ 2 + z 2 2) ν , Wendland (2004, Theorem 8.15) implies that

bκ(ω) = Φν,γ(ω) and Φ2 ν ,γ (ω) = bκrt(ω)2.

Let a(ω) d,ν,γ b(ω) denote asymptotic equivalence up to a constant depending on d, ν, γ. Then, by (DLMF, Eqs. 10.25.3 & 10.30.2), we have

Φν,γ(ω) d,ν,γ ω ν d

2 2 e γ ω 2 as ω 2 , (129)

Φν,γ(ω) d,ν,γ ω (d 2ν)+ 2 as ω 2 0. (130)

Hence, bκrt L2 whenever ν > d

4. Moreover, applying (129), we find that for ω 2 ,

Φν,γ(ω) Φ2 ν ,γ (ω) d,ν,γ,ν ,γ ω ν d

2 2 e γ ω 2 ω 2ν +d+1 2 e2γ ω 2

2 e(2γ γ) ω 2.

If 2γ γ < 0, this expression is bounded for any value of ν . If 2γ γ = 0, this expression is bounded when ν ν

4. Applying (130), we find that for ω 2 0,

Φν,γ(ω) Φ2 ν ,γ (ω) d,ν,γ,ν ,γ ω (d 2ν)+ 2 ω 2(d 2ν )+ 2 = ω 2(d 2ν )+ (d 2ν)+ 2 ,

2, this expression is finite for any value of ν . If ν < d

2, this expression is finite when ν ν

4. Hence, our condition (128) is verified whenever (ν , γ ) belongs to the set

Cν,γ,d (ν , γ ) : (1) ν > d

2, and (131)

Kernel Thinning

Sech Define κa(z) Qd j=1 sech p π

2 azj , and suppose κ = κa and eκrt = κ2a. Huggins and Mackey (2018, Ex. 3.2) yields that

bκ(ω) = F(κa)(ω) = 1 ad Qd j=1 sech(p π

a ) = a d κ1/a(ω), and (132)

bκrt(ω) = F(eκrt)(ω) = (2a) dκ1/(2a)(ω). (133)

Since sech2(b/2) = 4 eb+e b+2 > 4 eb+e b = 2 sech(b), we have

κ1/a(ω) 2 d κ1/(2a)(ω)2. (134)

Putting the pieces together, we further have

bκ(ω) (132) = a d κ1/a(ω) (134) (2a) d κ1/(2a)(ω)2 (133) = (2a)d bκrt(ω)2.

Since κ1/(2a) L2, we have verified the condition (128).

Wendland When κ(z) = ϕd,s( z 2) and eκrt(z) = ϕd,s ( z 2), Wendland (2004, Thm. 10.35) implies that

bκ(ω) d,s,s 1 (1+ ω 2 2)d+2s+1 1 (1+ ω 2 2)2d+4s +2 d,s,s bκrt(ω)2 d,s,s 1 (1+ ω 2 2)2d+4s +2 ,

with s N0, s (2s 1 d)

4 , thereby establishing the condition (128).

Appendix R. Online Vector Balancing in Euclidean Space

Using Thm. 3, we recover the online vector balancing result of Alweiss et al. (2021, Thm. 1.2) with improved constants and a less conservative setting of the thresholds ai. Note that we capture the usual obliviousness assumption by treating the sequence (fi)n i=1 as a fixed, deterministic input to Alg. 3.

Corollary 7 (Online vector balancing in Euclidean space). If H = Rd equipped with the Euclidean dot product, each fi 2 1, and each ai = 1

2 + log(4n/δ), then, with probability at least 1 δ, the self-balancing Hilbert walk (Alg. 3) returns ψn satisfying

ψn = Pn i=1 ηifi p

2 log(4d/δ) log(4n/δ).

Proof Instantiate the notation of Thm. 3. With our choice of ai, the signed sum representation property (ii) implies that ψn = Pn i=1 ηifi with probability at least 1 δ/2. Moreover, the union bound, the functional sub-Gaussianity property (i), and the sub-Gaussian Hoeffding inequality (Wainwright, 2019, Prop. 2.5) now imply

Pr( ψn > t | Fn) Pd j=1 P(| ψn, ej | > t | Fn) 2d exp( t2/(2σ2 n)) = δ/2

2 log(4d/δ). We claim that

σ2 n αn maxj n max rj, fj 2 H with rj a2 j 2aj fj 2 H . (135)

Dwivedi and Mackey

Assuming this claim as given at the moment, we finish the proof. Since fj H = fj 2 1

for each j and

1 2 +log(4/δ)

log(4/δ) is increasing in δ (0, 1],

σ2 n 1 maxj n a2 j 2aj fj 2 H 1 maxj n a2 j 2aj 1 = log(4n/δ) ( 1

2 +log(4n/δ))2

2(log(4n/δ))2

log(4n/δ) ( 1

2 +log(4))2

2(log(4))2 log(4n/δ).

The advertised result now follows from the union bound. We will now establish our earlier claim (135) using induction. For the base case, we have σ2 1 = f1 2 H α1 from the definition of σ2 1 (12). Now, for the inductive case, assume that σ2 i 1 αi 1 for some i > 1, and define the function g(x) = x + fi 2 H(1 x/ri)+ for x 0. Using the definition of σ2 i (12), we find

σ2 i = σ2 i 1 + fi 2 H 1 + σ2 i 1 a2 i ( fi 2 H 2ai)

+ = σ2 i 1 + fi 2 H(1 σ2 i 1/ri)+ = g(σ2 i 1).

Since αi 1 αi, our inductive assumption implies σ2 i 1 αi and hence

σ2 i = g(σ2 i 1) maxx [0,αi] g(x) = max(maxx [0,ri] g(x), maxx (ri,αi] g(x)).

Next, we observe that g is continuous and that the derivative of g is 1 fi 2 H/ri for x < ri and 1 for x > ri. Consequently, whenever fi 2 H < ri, the function g is increasing so that maxx [0,αi] g(x) g(αi). On the other hand, when fi 2 H ri, the function g is non-increasing in [0, ri] and increasing on (ri, ) so that maxx [0,ri] g(x) g(0) and maxx (ri,αi] g(αi). Moreover, g(αi) = αi since αi ri. Therefore, we always have

σ2 i maxx [0,αi] g(x) max(g(0), g(αi)) = max( fi 2 H, αi) = αi.

Since i > 1 was arbitrary, the desired claim (135) follows by induction.

Appendix S. L Coresets of Phillips and Tai (2020) and Tai (2020)

Here we provide more details on the L coreset construction of Phillips and Tai (2020) and Tai (2020) discussed in Sec. 8.3. Given input points (xi)n i=1, the Phillips-Tai (PT) construction forms the matrix K = (k (xi, xj))n i,j=1 of pairwise kernel evaluations, finds a matrix square-root V Rn n satisfying K = V V , and augments V with a row of ones (to encourage near-halving in the next step). Then, the PT construction runs the Gram-Schmidt (GS) walk of Bansal et al. (2018) to identify approximately half of the columns of V as coreset members and rebalances the coreset until exactly half of the input points belong to the coreset. The GS walk and rebalancing steps are recursively repeated Ω(log(n)) times to obtain an (n 1 2 , Op(

2 log n))-L coreset. The low-dimensional Gaussian kernel construction of Tai (2020) first partitions the input points into balls of radius 2 log n and then applies the PT construction separately to each ball. The result is an (n 1 2 , Op(2dn 1

log(d log n)))-L coreset with an additional superexponential Ω(d5d) running time dependence.

Kernel Thinning

Acknowledgments

RD acknowledges support by National Science Foundation under Grant No. DMS2023528 for the Foundations of Data Science Institute (FODSI). The authors thank Fran cois-Xavier Briol, Lucas Janson, Lingxiao Li, Chris Oates, Art Owen, and the anonymous reviewers for their valuable feedback on this work and Jeffrey Rosenthal for helpful discussions surrounding geometric ergodicity. Part of this work was done when RD was interning at Microsoft Research New England.

Ryan Alweiss, Yang P Liu, and Mehtaab Sawhney. Discrepancy minimization via a selfbalancing walk. In Proceedings of the 53rd Annual ACM SIGACT Symposium on Theory of Computing, pages 14 20, 2021.

Christoph M Augustin, Aurel Neic, Manfred Liebmann, Anton J Prassl, Steven A Niederer, Gundolf Haase, and Gernot Plank. Anatomically accurate high resolution modeling of human whole heart electromechanics: A strongly scalable algebraic multigrid solver method for nonlinear deformation. Journal of computational physics, 305:622 646, 2016.

Francis Bach, Simon Lacoste-Julien, and Guillaume Obozinski. On the equivalence between herding and conditional gradient algorithms. In Proceedings of the 29th International Coference on International Conference on Machine Learning, ICML 12, page 1355 1362, Madison, WI, USA, 2012. Omnipress. ISBN 9781450312851.

Nikhil Bansal, Daniel Dadush, Shashwat Garg, and Shachar Lovett. The Gram-Schmidt walk: A cure for the Banaszczyk blues. In Proceedings of the 50th Annual ACM SIGACT Symposium on Theory of Computing, pages 587 597, 2018.

R emi Bardenet and Adrien Hardy. Monte Carlo with determinantal point processes. The Annals of Applied Probability, 30(1):368 417, 2020.

Necdet Batir. Bounds for the gamma function. Results in Mathematics, 72(1): 865 874, 2017. doi: 10.1007/s00025-017-0698-0. URL https://doi.org/10.1007/ s00025-017-0698-0.

William Beckner. Inequalities in Fourier analysis. Annals of Mathematics, pages 159 182, 1975.

Ayoub Belhadji, R emi Bardenet, and Pierre Chainais. Kernel quadrature with DPPs. In Advances in Neural Information Processing Systems, volume 32, pages 12927 12937. Curran Associates, Inc., 2019.

Ayoub Belhadji, R emi Bardenet, and Pierre Chainais. Kernel interpolation with continuous volume sampling. In International Conference on Machine Learning, pages 725 735. PMLR, 2020.

Salomon Bochner. Monotone funktionen, stieltjessche integrale und harmonische analyse. Mathematische Annalen, 108(1):378 410, 1933.

Dwivedi and Mackey

Sergiy V Borodachov, Douglas P Hardin, and Edward B Saff. Low complexity methods for discretizing manifolds via Riesz energy minimization. Foundations of Computational Mathematics, 14(6):1173 1208, 2014.

Jonathan M Borwein and O-Yeat Chan. Uniform bounds for the complementary incomplete gamma function. Mathematical Inequalities and Applications, 12:115 121, 2009.

Fran cois-Xavier Briol, Chris J. Oates, Mark Girolami, and Michael A. Osborne. Frank Wolfe Bayesian Quadrature: Probabilistic integration with theoretical guarantees. In Advances in Neural Information Processing Systems, pages 1162 1170, 2015.

Steve Brooks, Andrew Gelman, Galin Jones, and Xiao-Li Meng. Handbook of Markov chain Monte Carlo. CRC press, 2011.

V. V. Buldygin and Yu. V. Kozachenko. Sub-gaussian random variables. Ukrainian Mathematical Journal, 32(6):483 489, 1980. doi: 10.1007/BF01087176. URL https: //doi.org/10.1007/BF01087176.

Trevor Campbell and Tamara Broderick. Automated scalable Bayesian inference via Hilbert coresets. The Journal of Machine Learning Research, 20(1):551 588, 2019.

W WL Chen and MM Skriganov. Explicit constructions in the classical mean squares problem in irregularities of point distribution. Journal fur die Reine und Angewandte Mathematik, (545):67 95, 2002.

W. Y. Chen, L. Mackey, J. Gorham, F-X. Briol, and C. J. Oates. Stein points. In Proceedings of the 35th International Conference on Machine Learning, 2018.

Wilson Ye Chen, Alessandro Barp, Fran cois-Xavier Briol, Jackson Gorham, Mark Girolami, Lester Mackey, and Chris Oates. Stein point Markov chain Monte Carlo. In International Conference on Machine Learning, pages 1011 1021. PMLR, 2019.

Yutian Chen, Max Welling, and Alex Smola. Super-samples from kernel herding. In Proceedings of the Twenty-Sixth Conference on Uncertainty in Artificial Intelligence, UAI 10, page 109 116, Arlington, Virginia, USA, 2010. AUAI Press. ISBN 9780974903965.

Stefano De Marchi, Robert Schaback, and Holger Wendland. Near-optimal dataindependent point locations for radial basis function interpolation. Advances in Computational Mathematics, 23(3):317 330, 2005.

DLMF. NIST Digital Library of Mathematical Functions. http://dlmf.nist.gov/, Release 1.1.1 of 2021-03-15. F. W. J. Olver, A. B. Olde Daalhuis, D. W. Lozier, B. I. Schneider, R. F. Boisvert, C. W. Clark, B. R. Miller, B. V. Saunders, H. S. Cohl, and M. A. Mc Clain, eds.

Randal Douc, Eric Moulines, Pierre Priouret, and Philippe Soulier. Markov chains. Springer, 2018.

Rick Durrett. Probability: Theory and Examples. Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge University Press, 5 edition, 2019. doi: 10.1017/ 9781108591034.

Kernel Thinning

Raaz Dwivedi and Lester Mackey. Generalized kernel thinning. In International Conference on Learning Representations, 2022.

Raaz Dwivedi, Ohad N Feldheim, Ori Gurel-Gurevich, and Aaditya Ramdas. The power of online thinning in reducing discrepancy. Probability Theory and Related Fields, 174(1): 103 131, 2019.

Noureddine El Karoui. The spectrum of kernel random matrices. The Annals of Statistics, 38(1):1 50, 2010.

M. A. Gallegos-Herrada, D. Ledvinka, and J. S. Rosenthal. Equivalences of geometric ergodicity of markov chains, 2023.

Damien Garreau, Wittawat Jitkrittum, and Motonobu Kanagawa. Large sample analysis of the median heuristic. ar Xiv preprint ar Xiv:1707.07269, 2017.

Mark Girolami and Ben Calderhead. Riemann manifold Langevin and Hamiltonian Monte Carlo methods. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 73(2):123 214, 2011.

Brian C Goodwin. Oscillatory behavior in enzymatic control process. Advances in Enzyme Regulation, 3:318 356, 1965.

Ronald L. Graham, Donald Ervin Knuth, and Oren Patashnik. Concrete Mathematics: A Foundation for Computer Science. Addison-Wesley, Reading, MA, Second edition, 1994. URL https://www.csie.ntu.edu.tw/~r97002/temp/Concrete%20Mathematics%202e. pdf.

Arthur Gretton, Karsten M. Borgwardt, Malte J. Rasch, Bernhard Sch olkopf, and Alexander Smola. A kernel two-sample test. Journal of Machine Learning Research, 13(25): 723 773, 2012.

Karthik S Gurumoorthy, Amit Dhurandhar, Guillermo Cecchi, and Charu Aggarwal. Efficient data representation by selecting prototypes with importance weights. In 2019 IEEE International Conference on Data Mining (ICDM), pages 260 269. IEEE, 2019.

Heikki Haario, Eero Saksman, and Johanna Tamminen. Adaptive proposal distribution for random walk Metropolis algorithm. Computational Statistics, 14(3):375 395, 1999.

Nick Harvey and Samira Samadi. Near-optimal herding. In Conference on Learning Theory, pages 1165 1182, 2014.

Antoine Havet, Matthieu Lerasle, Eric Moulines, and Elodie Vernet. A quantitative Mc Diarmid s inequality for geometrically ergodic Markov chains. Electronic Communications in Probability, 25(none):1 11, 2020. doi: 10.1214/20-ECP286. URL https: //doi.org/10.1214/20-ECP286.

Fred Hickernell. A generalized discrepancy and quadrature error bound. Mathematics of computation, 67(221):299 322, 1998.

Dwivedi and Mackey

Robert Hinch, JL Greenstein, AJ Tanskanen, L Xu, and RL Winslow. A simplified local control model of calcium-induced calcium release in cardiac ventricular myocytes. Biophysical journal, 87(6):3723 3736, 2004.

Wassily Hoeffding. Probability inequalities for sums of bounded random variables. Journal of the American Statistical Association, 58(301):13 30, 1963.

Jonathan Huggins and Lester Mackey. Random feature Stein discrepancies. In Advances in Neural Information Processing Systems, pages 1899 1909, 2018.

Ferenc Husz ar and David Kristjanson Duvenaud. Optimally-weighted herding is bayesian quadrature. Ar Xiv, abs/1408.2049, 2012.

V Roshan Joseph, Tirthankar Dasgupta, Rui Tuo, and CF Jeff Wu. Sequential exploration of complex surfaces using minimum energy designs. Technometrics, 57(1):64 74, 2015.

V Roshan Joseph, Dianpeng Wang, Li Gu, Shiji Lyu, and Rui Tuo. Deterministic sampling of expensive posteriors using minimum energy designs. Technometrics, 2019.

Sarang Joshi, Raj Varma Kommaraji, Jeff M Phillips, and Suresh Venkatasubramanian. Comparing distributions and shapes using the kernel distance. In Proceedings of the twenty-seventh annual symposium on Computational geometry, pages 47 56, 2011.

Zohar Karnin and Edo Liberty. Discrepancy, coresets, and sketches in machine learning. In Conference on Learning Theory, pages 1975 1993. PMLR, 2019.

Toni Karvonen, Chris J Oates, and Simo Sarkka. A Bayes-Sard cubature method. In Advances in Neural Information Processing Systems, pages 5882 5893, 2018.

Toni Karvonen, Motonobu Kanagawa, and Simo S arkk a. On the positivity and magnitudes of Bayesian quadrature weights. Statistics and Computing, 29(6):1317 1333, 2019.

Rajiv Khanna and Michael W Mahoney. On linear convergence of weighted kernel herding. ar Xiv preprint ar Xiv:1907.08410, 2019.

Been Kim, Rajiv Khanna, and Oluwasanmi O Koyejo. Examples are not enough, learn to criticize! Criticism for interpretability. Advances in neural information processing systems, 29, 2016.

Vladimir Koltchinskii and Evarist Gin e. Random matrix approximation of spectra of integral operators. Bernoulli, pages 113 167, 2000.

Simon Lacoste-Julien, Fredrik Lindsten, and Francis Bach. Sequential kernel herding: Frank-Wolfe optimization for particle filtering. In Artificial Intelligence and Statistics, pages 544 552. PMLR, 2015.

Beatrice Laurent and Pascal Massart. Adaptive estimation of a quadratic functional by model selection. Annals of Statistics, pages 1302 1338, 2000.

Qiang Liu and Jason Lee. Black-box importance sampling. In Artificial Intelligence and Statistics, pages 952 961. PMLR, 2017.

Kernel Thinning

Alfred James Lotka. Elements of physical biology. Williams & Wilkins, 1925.

Simon Mak and V Roshan Joseph. Support points. The Annals of Statistics, 46(6A): 2562 2592, 2018.

Sean P Meyn and Richard L Tweedie. Markov chains and stochastic stability. Springer Science & Business Media, 2012.

Ha Quang Minh. Some properties of gaussian reproducing kernel hilbert spaces and their implications for function approximation and learning theory. Constructive Approximation, 32(2):307 338, 2010.

Arunava Mukherjea. A remark on tonelli s theorem on integration in product spaces. Pacific Journal of Mathematics, 42(1):177 185, 1972.

Steven A Niederer, Lawrence Mitchell, Nicolas Smith, and Gernot Plank. Simulating human cardiac electrophysiology on clinical time-scales. Frontiers in Physiology, 2:14, 2011.

E Novak and H Wozniakowski. Tractability of multivariate problems, volume ii: Standard information for functionals, european math. Soc. Publ. House, Z urich, 3, 2010.

Chris J Oates, Mark Girolami, and Nicolas Chopin. Control functionals for Monte Carlo integration. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 79(3):695 718, 2017.

Chris J Oates, Jon Cockayne, Fran cois-Xavier Briol, and Mark Girolami. Convergence rates for a class of estimators based on Stein s method. Bernoulli, 25(2):1141 1159, 2019.

Anthony O Hagan. Bayes Hermite quadrature. Journal of statistical planning and inference, 29(3):245 260, 1991.

Art B Owen. Statistically efficient thinning of a Markov chain sampler. Journal of Computational and Graphical Statistics, 26(3):738 744, 2017.

Brooks Paige, Dino Sejdinovic, and Frank Wood. Super-sampling with a reservoir. In Proceedings of the Thirty-Second Conference on Uncertainty in Artificial Intelligence, pages 567 576, 2016.

Jeff M Phillips. ε-samples for kernels. In Proceedings of the twenty-fourth annual ACMSIAM symposium on Discrete algorithms, pages 1622 1632. SIAM, 2013.

Jeff M Phillips and Wai Ming Tai. Improved coresets for kernel density estimates. In Proceedings of the Twenty-Ninth Annual ACM-SIAM Symposium on Discrete Algorithms, pages 2718 2727. SIAM, 2018.

Jeff M Phillips and Wai Ming Tai. Near-optimal coresets of kernel density estimates. Discrete & Computational Geometry, 63(4):867 887, 2020.

Alireza Rezaei and Shayan Oveis Gharan. A polynomial time mcmc method for sampling from continuous determinantal point processes. In International Conference on Machine Learning, pages 5438 5447. PMLR, 2019.

Dwivedi and Mackey

Marina Riabiz, Wilson Ye Chen, Jon Cockayne, Pawel Swietach, Steven A. Niederer, Lester Mackey, and Chris J. Oates. Replication Data for: Optimal Thinning of MCMC Output, 2020. URL https://doi.org/10.7910/DVN/MDKNWM. Accessed on Mar 23, 2021.

Marina Riabiz, Wilson Chen, Jon Cockayne, Pawel Swietach, Steven A Niederer, Lester Mackey, and Chris Oates. Optimal thinning of MCMC output. To appear: Journal of the Royal Statistical Society: Series B (Statistical Methodology), 2021.

Gareth O Roberts and Jeffrey S Rosenthal. General state space Markov chains and MCMC algorithms. Probability surveys, 1:20 71, 2004.

Gareth O Roberts and Richard L Tweedie. Exponential convergence of Langevin distributions and their discrete approximations. Bernoulli, 2(4):341 363, 1996.

Saburou Saitoh. Applications of the general theory of reproducing kernels. In Reproducing Kernels and Their Applications, pages 165 188. Springer, 1999.

Gabriele Santin and Bernard Haasdonk. Convergence rate of the data-independent p-greedy algorithm in kernel-based approximation. Dolomites Research Notes on Approximation, 10(Special Issue), 2017.

Larry Schumaker. Spline functions: basic theory. Cambridge University Press, 2007.

Abhishek Shetty, Raaz Dwivedi, and Lester Mackey. Distribution compression in near-linear time. In International Conference on Learning Representations, 2022.

Joel Spencer. Balancing games. Journal of Combinatorial Theory, Series B, 23(1):68 74, 1977.

Bharath K Sriperumbudur, Arthur Gretton, Kenji Fukumizu, Bernhard Sch olkopf, and Gert RG Lanckriet. Hilbert space embeddings and metrics on probability measures. Journal of Machine Learning Research, 11(Apr):1517 1561, 2010.

Ingo Steinwart and Andreas Christmann. Support vector machines. Springer Science & Business Media, 2008.

Ingo Steinwart and Clint Scovel. Mercer s theorem on general domains: On the interaction between measures, kernels, and RKHSs. Constructive Approximation, 35(3):363 417, 2012.

Marina Strocchi, Matthias AF Gsell, Christoph M Augustin, Orod Razeghi, Caroline H Roney, Anton J Prassl, Edward J Vigmond, Jonathan M Behar, Justin S Gould, Christopher A Rinaldi, Martin J Bishop, Gernot Plank, and Steven A Niederer. Simulating ventricular systolic motion in a four-chamber heart model with spatially varying robin boundary conditions to model the effect of the pericardium. Journal of Biomechanics, 101:109645, 2020.

Xingping Sun. Conditionally positive definite functions and their application to multivariate interpolations. Journal of approximation theory, 74(2):159 180, 1993.

Kernel Thinning

Wai Ming Tai. New nearly-optimal coreset for kernel density estimation. ar Xiv preprint ar Xiv:2007.08031, 2020.

Ilya Tolstikhin, Bharath K Sriperumbudur, and Krikamol Muandet. Minimax estimation of kernel mean embeddings. The Journal of Machine Learning Research, 18(1):3002 3048, 2017.

Paxton Turner, Jingbo Liu, and Philippe Rigollet. A statistical perspective on coreset density estimation. In International Conference on Artificial Intelligence and Statistics, pages 2512 2520. PMLR, 2021.

Pauli Virtanen, Ralf Gommers, Travis E. Oliphant, Matt Haberland, Tyler Reddy, and Sci Py 1. 0 others. Sci Py 1.0: Fundamental Algorithms for Scientific Computing in Python. Nature Methods, 2020. doi: https://doi.org/10.1038/s41592-019-0686-2.

Vito Volterra. Variazioni e fluttuazioni del numero d individui in specie animali conviventi. 1926.

Martin J Wainwright. High-dimensional statistics: A non-asymptotic viewpoint, volume 48. Cambridge University Press, 2019.

Holger Wendland. Scattered data approximation, volume 17. Cambridge university press, 2004.

Haizhang Zhang and Liang Zhao. On the inclusion relation of reproducing kernel hilbert spaces. Analysis and Applications, 11(02):1350014, 2013.

Ding-Xuan Zhou. The covering number in learning theory. Journal of Complexity, 18(3): 739 767, 2002.