# debiased_distribution_compression__833ece00.pdf

Debiased Distribution Compression

Lingxiao Li 1 Raaz Dwivedi 2 Lester Mackey 3

Modern compression methods can summarize a target distribution P more succinctly than i.i.d. sampling but require access to a low-bias input sequence like a Markov chain converging quickly to P. We introduce a new suite of compression methods suitable for compression with biased input sequences. Given n points targeting the wrong distribution and quadratic time, Stein Kernel Thinning (SKT) returns n equal-weighted points with e O(n 1/2) maximum mean discrepancy (MMD) to P. For larger-scale compression tasks, Low-rank SKT achieves the same feat in sub-quadratic time using an adaptive low-rank debiasing procedure that may be of independent interest. For downstream tasks that support simplex or constant-preserving weights, Stein Recombination and Stein Cholesky achieve even greater parsimony, matching the guarantees of SKT with as few as poly-log(n) weighted points. Underlying these advances are new guarantees for the quality of simplex-weighted coresets, the spectral decay of kernel matrices, and the covering numbers of Stein kernel Hilbert spaces. In our experiments, our techniques provide succinct and accurate posterior summaries while overcoming biases due to burn-in, approximate Markov chain Monte Carlo, and tempering.

1. Introduction

Distribution compression is the problem of summarizing a target probability distribution P with a small set of representative points. Such compact summaries are particularly valuable for tasks that incur substantial downstream computation costs per summary point, like organ and tissue

1MIT CSAIL 2Cornell Tech 3Microsoft Research New England. Correspondence to: Lingxiao Li <lingxiao@mit.edu>, Raaz Dwivedi <dwivedi@cornell.edu>, Lester Mackey <lmackey@microsoft.com>.

Proceedings of the 41 st International Conference on Machine Learning, Vienna, Austria. PMLR 235, 2024. Copyright 2024 by the author(s).

modeling in which each simulation consumes thousands of CPU hours (Niederer et al., 2011).

Remarkably, modern compression methods can summarize a distribution more succinctly than i.i.d. sampling. For example, kernel thinning (KT) (Dwivedi and Mackey, 2021; 2022), Compress++ (Shetty et al., 2022), recombination (Hayakawa et al., 2023), and randomly pivoted Cholesky (Epperly and Moreno, 2024) all provide e O(1/m) approximation error using m points, a significant improvement over the Ω(1/ m) approximation provided by i.i.d. sampling from P. However, each of these constructions relies on access to an accurate input sequence, like an i.i.d. sample from P or a Markov chain converging quickly to P.

Much more commonly, one only has access to n biased sample points approximating a distribution Q = P. Such biases are a common occurrence in Markov chain Monte Carlo (MCMC)-based inference due to tempering (where one targets a less peaked and more dispersed distribution to achieve faster convergence, Gramacy et al., 2010), burn-in (where the initial state of a Markov chain biases the distribution of chain iterates, Cowles and Carlin, 1996), or approximate MCMC (where one runs a cheaper approximate Markov chain to avoid the prohibitive costs of an exact MCMC algorithm, e.g., Ahn et al., 2012). The Stein thinning (ST) method of Riabiz et al. (2022) was developed to provide accurate compression even when the input sample sequence provides a poor approximation to the target. ST operates by greedily thinning the input sample to minimize the maximum mean discrepancy (MMD, Gretton et al., 2012) to P. However, ST is only known to provide an O(1/ m) approximation to P; this guarantee is no better than that of i.i.d. sampling and a far cry from the e O(1/m) error achieved with unbiased coreset constructions.

In this work, we address this deficit by developing new, efficient coreset constructions that provably yield better-thani.i.d. error even when the input sample is biased. For P on Rd, our primary contributions are fourfold and summarized in Tab. 1. First, for the task of equal-weighted compression, we introduce Stein Kernel Thinning (SKT, Alg. 1), a strategy that combines the greedy bias correction properties of ST with the unbiased compression of KT to produce n summary points with error e O(n 1/2) in O(n2) time. In contrast, ST would require Ω(n) points to guaran-

Debiased Distribution Compression

Table 1: Methods for debiased distribution compression. For each method, we report the smallest coreset size m and running time, up to logarithmic factors, sufficient to guarantee e O(n 1/2) MMDk P to P given a LOGGROWTH kernel k P and n slow-growing input points Sn = (xi)n i=1 from a fast-mixing Markov chain targeting Q with tails no lighter than P (see Thm. 1 and Def. 3). For generic slow-growing Sn, identical guarantees hold for excess MMDk P (2) relative to the best simplex reweighting of Sn.

Method Compression Type Coreset Size m Runtime Source

Stein Thinning (Riabiz et al., 2022) equal-weighted n dk Pn2 App. D.1

Stein Kernel Thinning

( Greedy Low-rank (Alg. 1) (Alg. 3) equal-weighted n dk Pn2

dk Pn1.5 Thm. 3 Thm. 5

Stein Recombination

( Greedy Low-rank (Alg. 5) simplex-weighted poly-log(n) dk Pn2

dk Pn + n1.5 Thm. 6

Stein Cholesky

( Greedy Low-rank (Alg. 7) constant-preserving poly-log(n) dk Pn2

dk Pn + n1.5 Thm. 7

tee this error. Second, for larger-scale compression problems, we propose Low-rank SKT (Alg. 3), a strategy that combines the scalable summarization of Compress++ with a new low-rank debiasing procedure (Alg. 2) to match the SKT guarantees in sub-quadratic o(n2) time.

Third, for the task of simplex-weighted compression, in which summary points are accompanied by weights in the simplex, we propose greedy and low-rank Stein Recombination (Alg. 5) constructions that match the guarantees of SKT with as few as poly-log(n) points. Finally, for the task of constant-preserving compression, in which summary points are accompanied by real-valued weights summing to 1, we introduce greedy and low-rank Stein Cholesky (Alg. 7) constructions that again match the guarantees of SKT using as few as poly-log(n) points.

Underlying these advances are new guarantees for the quality of simplex-weighted coresets (Thms. 1 and 2), the spectral decay of kernel matrices (Cor. B.1), and the covering numbers of Stein kernel Hilbert spaces (Prop. 1) that may be of independent interest. In Sec. 5, we employ our new procedures to produce compact summaries of complex target distributions given input points biased by burn-in, approximate MCMC, or tempering.

Notation We assume Borel-measurable sets and functions and define [n] {1, . . . , n}, n 1 {w Rn : w 0, 1 w = 1}, x 0 |{i : xi = 0}|, and x p p P

for x Rd and p 1. For x Rd, δx denotes the delta measure at x. We let Hk denote the reproducing kernel Hilbert space (RKHS) of a kernel k : Rd Rd R (Aronszajn, 1950) and f k denote the RKHS norm of f Hk. For a measure µ and separately µ-integrable k and f, we write µf R f(x)dµ(x) and µk(x) R k(x, y)dµ(y). The divergence of a differentiable matrix-valued function A is ( x A(x))j = P

i xi Aij(x). For random variables

(Xn)n N, we say Xn = O(f(n, δ)) holds with probability 1 δ if Pr(Xn Cf(n, δ)) 1 δ for a constant C independent of (n, δ) and all n sufficiently large. When using this notation, we view all algorithm parameters except δ as functions of n. For A Rn n and v Rn, diag(A) and diag(v) are n n diagonal matrices with Aii and vi respectively as the i-th diagonal entry.

2. Debiased Distribution Compression

Throughout, we aim to summarize a fixed target distribution P on Rd using a sequence Sn (xi)n i=1 of potentially biased candidate points in Rd.1 Correcting for unknown biases in Sn requires some auxiliary knowledge of P. For us, this knowledge comes in the form of a kernel function k P with known expectation under P. Without loss of generality, we can take this kernel mean to be identically zero.2

Assumption 1 (Mean-zero kernel). For some p 1/2, Ex P[k P(x, x)p] < and Pk P 0.

Given a target compression size m, our goal is to output an weight vector w Rn with w 0 m, 1 n w = 1, and o(m 1/2) (better-than-i.i.d.) maximum mean discrepancy (MMD) to P:

MMDk P(Pn i=1wiδxi, P) q Pn i,j=1wiwjk P(xi, xj).

We consider three standard compression tasks with w 0 m. In equal-weighted compression one selects m possibly repeated points from Sn and assigns each a weight of 1

m; because of repeats, the induced weight vector

1Our coreset constructions will in fact apply to any sample space, but our analysis will focus on Rd. 2For Pk P 0, the kernel k P (x, y) = k P(x, y) Pk P(x) Pk P(y) + PPk P satisfies Pk P 0 and MMDk P = MMDk P.

Debiased Distribution Compression

over Sn satisfies w n 1 ( N0

m )n. In simplex-weighted compression we allow any w n 1, and in constantpreserving compression we simply enforce 1 n w = 1. All three coreset types exactly preserve constants and hence satisfy standard coherence constraints.

When making big-O statements, we will treat Sn as the prefix of an infinite sequence S (xi)i N. We also write k P(Sn[J], Sn[J]) [k P(xi, xj)]i,j J for the principal kernel submatrix with indices J [n].

2.1. Kernel assumptions

Many practical Stein kernel constructions are available for generating mean-zero kernels for a target P (Chwialkowski et al., 2016; Liu et al., 2016; Gorham and Mackey, 2017; Gorham et al., 2019; Barp et al., 2019; Yang et al., 2018; Afzali and Muthukumarana, 2023). We will use the most prominent of these Stein kernels as a running example:

Definition 1 (Stein kernel). Given a differentiable base kernel k and a symmetric positive semidefinite matrix M, the Stein kernel kp : Rd Rd R for P with positive differentiable Lebesgue density p is defined as

kp(x, y) 1 p(x)p(y) x y (p(x)Mk(x, y)p(y)).

While our algorithms apply to any mean zero kernel, including the aforementioned Stein kernels, the Sobolev kernels used in quasi-Monte Carlo (Kuo, 2003), and the popular centered Gaussian kernel (Chen et al., 2010; Lacoste Julien et al., 2015), our guarantees adapt to the underlying smoothness of the kernel. Our next definition and assumption make this precise.

Definition 2 (Covering number). For a kernel k : Rd Rd R with Bk {f Hk : f k 1}, a set A Rd, and ε > 0, the covering number Nk(A, ε) is the minimum cardinality of all sets C Bk satisfying

h C{g Bk : supx A |h(x) g(x)| ε}.

Assumption (α, β)-kernel. For some Cd > 0, all r > 0 and ε (0, 1), and B2(r) {x Rd : x 2 r}, a kernel k is either POLYGROWTH(α, β), i.e.,

log Nk(B2(r), ε) Cd(1/ε)α(r + 1)β,

with α < 2 or LOGGROWTH(α, β), i.e.,

log Nk(B2(r), ε) Cd log(e/ε)α(r + 1)β.

In Cor. B.1 we show that the eigenvalues of kernel matrices with POLYGROWTH and LOGGROWTH kernels have polynomial and exponential decay respectively. Dwivedi and Mackey (2022, Prop. 2) showed that all sufficiently differentiable kernels satisfy the POLYGROWTH condition and

that bounded radially analytic kernels are LOGGROWTH. Our next result, proved in App. B.2, shows that a Stein kernel kp can inherit the growth properties of its base kernel even if kp is itself unbounded and non-smooth. Proposition 1 (Stein kernel growth rates). A Stein kernel kp with sup x 2 r log p(x) 2 = O(rdℓ) for dℓ 0 is

(a) LOGGROWTH(d+1, 2d+δ) for any δ > 0 if the base kernel k is radially analytic (Def. B.3) and

(b) POLYGROWTH( d s 1, (1+ dℓ

s )d) if the base kernel k is s-times continuously differentiable (Def. B.2) for s>1.

Notably, the popular Gaussian (Ex. B.1) and inverse multiquadric (Ex. B.2) base kernels satisfy the LOGGROWTH preconditions, while Mat ern, B-spline, sinc, sech, and Wendland s compactly supported kernels satisfy the POLYGROWTH precondition (Dwivedi and Mackey, 2022, Prop. 3). To our knowledge, Prop. 1 and Cor. B.1 provide the first covering number bounds and eigenvalue decay rates for the typically unbounded Stein kernels kp.

2.2. Input point desiderata

Our primary desideratum for the input points is that they can be debiased into an accurate estimate of P. Indeed, our high-level strategy for debiased compression is to first use k P to debias the input points into a more accurate approximation of P and then compress that approximation into a more succinct representation. Fortunately, even when the input Sn targets a distribution Q = P, effective debiasing is often achievable via simplex reweighting, i.e., by solving the convex optimization problem

w OPT arg minw n 1 Pn i,j=1 wiwjk P(xi, xj) (1)

with MMDOPT MMDk P(Pn i=1w OPTiδxi, P).

For example, Hodgkinson et al. (2020, Thm. 1b) showed that simplex reweighting can correct for biases due to offtarget i.i.d. or MCMC sampling. Our next result (proved in App. C.2) significantly relaxes their conditions. Theorem 1 (Debiasing to i.i.d. quality via simplex reweighting). Consider a kernel k P satisfying Assum. 1 with Hk P separable, and suppose (xi) i=1 are the iterates of a homogeneous ϕ-irreducible geometrically ergodic Markov chain (Gallegos-Herrada et al., 2023, Thm. 1) with stationary distribution Q and initial distribution absolutely continuous with respect to P. If Ex P[ d P

d Q(x)2q 1k P(x, x)q] < for some q > 1 then MMDOPT = O(n 1/2) in probability. Remark 1. Hk P is separable whenever k P is continuous (Steinwart and Christmann, 2008, Lem. 4.33).

Since n points sampled i.i.d. from P have Θ(n 1/2) root mean squared MMD (see Prop. C.1), Thm. 1 shows that

Debiased Distribution Compression

a debiased off-target sample can be as accurate as a direct sample from P. Moreover, Thm. 1 applies to many practical examples. The simplest example of a geometrically ergodic chain is i.i.d. sampling from Q, but geometric ergodicity has also been established for a variety of popular Markov chains including random walk Metropolis (Roberts and Tweedie, 1996, Thm. 3.2), independent Metropolis Hastings (Atchad e and Perron, 2007, Thm. 2.2), the unadjusted Langevin algorithm (Durmus and Eric Moulines, 2017, Prop. 8), the Metropolis-adjusted Langevin algorithm (Durmus and Moulines, 2022, Thm. 1), Hamiltonian Monte Carlo (Durmus et al., 2020, Thm. 10 and Thm. 11), stochastic gradient Langevin dynamics (Li et al., 2023, Thm. 2.1), and the Gibbs sampler (Johnson, 2009). Moreover, for Q absolutely continuous with respect to P, the importance weight d P

d Q is typically bounded or slowly growing when the tails of Q are not much lighter than those of P.

Remarkably, under more stringent conditions, Thm. 2 (proved in App. C.3) shows that simplex reweighting can decrease MMD to P at an even-faster-than-i.i.d. rate.

Theorem 2 (Better-than-i.i.d. debiasing via simplex reweighting). Consider a kernel k P satisfying Assum. 1 with p = 2 and points (xi) i=1 drawn i.i.d. from a distribution Q with d P

d Q bounded. If E[k P(x1, x1)q] < for some q > 3, then E[MMD2 OPT] = o(n 1).

The work of Liu and Lee (2017, Thm. 3.3) also established o(n 1/2) MMD error for simplex reweighting but only under a uniformly bounded eigenfunctions assumption that is often violated (Minh, 2010, Thm. 1, Zhou, 2002, Ex. 1) and difficult to verify (Steinwart and Scovel, 2012).

We highlight that our remaining results make no particular assumption about the input points but rather upper bound the excess MMD

MMDk P(w) MMDk P(P

i [n]wiδxi, P) (2)

of a candidate weighting w in terms of the input point radius Rn maxi [n] xi 2 1 and kernel radius k P n maxi [n] k P(xi, xi). While these results apply to any input points, we consider the following running example of slow-growing input points throughout the paper.

Definition 3 (Slow-growing input points). We say Sn is γslow-growing if Rn = O((log n)γ) for some γ 0 and k P n = e O(1).

Notably, Sn is 1-slow-growing with probability 1 when k P(x, x) is polynomially bounded by x 2 and the input points are drawn from a homogeneous ϕ-irreducible geometrically ergodic Markov chain with a sub-exponential target Q, i.e., E[ec x 2] < for some c > 0 (Dwivedi and Mackey, 2021, Prop. 2). For a Stein kernel kp (Def. 1),

by Prop. B.3, kp(x, x) is polynomially bounded by x 2 if k(x, x), x yk(x, x) 2, and log p(x) 2 are all polynomially bounded by x 2. Moreover, log p(x) 2 is automatically polynomially bounded by x 2 when log p is Lipschitz or, more generally, pseudo-Lipschitz (Erdogdu et al., 2018, Eq. (2.5)).

2.3. Debiased compression via Stein Kernel Thinning

Off-the-shelf solvers based on mirror descent and Frank Wolfe can solve the convex debiasing program (1) in O(n3) time by generating weights with O(n 1/2 k P n) excess MMD (Liu and Lee, 2017). We instead employ a more efficient, greedy debiasing strategy based on Stein thinning (ST). After n rounds, ST outputs an equal-weighted coreset of size n with O(n 1/2 k P n) excess MMD (Riabiz et al., 2022, Thm. 1). Moreover, while the original implementation of Riabiz et al. (2022) has cubic runtime, our implementation (Alg. D.1) based on sufficient statistics improves the runtime to O(n2dk P) where dk P denotes the runtime of a single kernel evaluation.3

The equal-weighted output of ST serves as the perfect input for the kernel thinning (KT) algorithm which compresses an equal-weighted sample of size n into a coreset of any target size m n in O(n2dk P) time. We adapt the target KT algorithm slightly to target MMD error to P and to include a baseline ST coreset of size m in the KT-SWAP step (see Alg. D.3). Combining the two routines we obtain Stein Kernel Thinning (SKT), our first solution for equalweighted debiased distribution compression:

Algorithm 1 Stein Kernel Thinning (SKT)

Input: mean-zero kernel k P, points Sn, output size m, KT failure probability δ n m 2 log2 n m

w Stein Thinning(k P, Sn, n ) w SKT Kernel Thinning(k P, Sn, n , w, m, δ) Return: w SKT n 1 ( N0

m )n hence w SKT 0 m

Our next result, proved in App. D.3, shows that SKT yields better-than-i.i.d. excess MMD whenever the radii (Rn and k P n) and kernel covering number exhibit slow growth. Theorem 3 (MMD guarantee for SKT). Given a kernel k P satisfying Assums. 1 and (α, β)-kernel, Stein Kernel Thinning (Alg. 1) outputs w SKT in O(n2dk P) time satisfying

MMDk P(w SKT)=O

k P nℓδ log n Rβ n Gα m min(m, n)

with probability at least 1 δ, where ℓδ log2( e

( 1+log m LOGGROWTH(α, β), m POLYGROWTH(α, β).

3Often, dk P = Θ(d) as in the case of Stein kernels (App. I.1).

Debiased Distribution Compression

Example 1. Under the assumptions of Thm. 3 with γ-slowgrowing input points (Def. 3), LOGGROWTH k P, and a coreset size m n, SKT with high probability delivers e O(m 1) excess MMD, a nearly minimax optimal rate for equal-weighted coresets (Phillips and Tai, 2020, Thm. 3.1) and a significant improvement over the Ω(m 1/2) error rate of i.i.d. sampling.

Remark 2. When m< n, we can uniformly subsample or, in the case of MCMC inputs, standard thin (i.e., keep only every n m2 -th point of) the input sequence down to size m2 before running SKT to reduce runtime from O(n2) to O(m4) while incurring only O(m 1) excess error. A similar remark applies to LSKT algorithm introduced in Sec. 3.

3. Accelerated Debiased Compression

To enable larger-scale debiased compression, we next introduce a sub-quadratic-time version of SKT built via a new low-rank debiasing scheme and the near-linear-time compression algorithm of Shetty et al. (2022).

3.1. Fast bias correction via low-rank approximation

At a high level, our approach to accelerated debiasing involves four components. First, we form a rank-r approximation FF of the kernel matrix K = k P(Sn, Sn) in O(nrdk P+nr2) time using a weighted extension (Weighted RPCholesky, Alg. F.1) of the randomly pivoted Cholesky algorithm of Chen et al. (2022, Alg. 2.1). Second, we correct the diagonal to form K = FF + diag(K FF ). Third, we solve the reweighting problem (1) with K substituted for K using T iterations of accelerated entropic mirror descent (AMD, Wang et al., 2023, Alg. 14 with ϕ(w) = P

i wi log wi). The acceleration ensures O(1/T 2) suboptimality after T iterations, and each iteration takes only O(nr) time thanks to the low-rank plus diagonal approximation. Finally, we repeat this three-step procedure Q times, each time using the weights outputted by the prior round to update the low-rank approximation b K. On these subsequent adaptive rounds, Weighted RPCholesky approximates the leading subspace of a weighted kernel matrix diag(

w) for w n 1 before undoing the row and column reweighting. Since each round s weights are closer to optimal, this adaptive updating has the effect of upweighting more relevant subspaces for subsequent debiasing. For added sparsity, we prune the weights outputted by the prior round using stratified residual resampling (Resample, Alg. E.3, Douc and Capp e, 2005). Our complete Low-rank Debiasing (LD) scheme, summarized in Alg. 2, enjoys o(n2) runtime whenever r = o(n1/2), T = O(n1/2), and Q = O(1).

Moreover, our next result, proved in App. F.1, shows that LD provides i.i.d.-level precision whenever T n, Q =

Algorithm 2 Low-rank Debiasing (LD)

Input: mean-zero kernel k P, points Sn = (xi)n i=1, rank r, AMD steps T, adaptive rounds Q w(0) ( 1

n, . . . , 1

for q = 1 to Q do

w Resample(w(q 1), n) I, F Weighted RPCholesky(k P, Sn, w, r) K FF + diag(k P(Sn, Sn)) diag(FF ) w(q) AMD(K , T, w, AGG = 1q>1) if (w(q)) K w(q) > w K w then w(q) w end for Return: w LD w(Q) n 1

O(1), and r grows appropriately with the input radius and kernel covering number. Assumption (α,β)-params. The kernel k P satisfies Assums. 1 and (α, β)-kernel, the output size and rank m, r ( Cd Rβ n+1 log 2 + 2 log 2)2, the AMD step count T n, and the adaptive round count Q=O(1).4

Theorem 4 (Debiasing guarantee for LD). Under Assum. (α,β)-params, Low-rank Debiasing (Alg. 2) takes O((dk P +r+T)nr) time to output w LD satisfying

MMDk P(w LD)=O q

k P n max(log n,1/δ)

with probability at least 1 δ, for any δ (0, 1) and Hn,r defined in (46) that satisfies

O r( R2β n r ) 1 α

POLYGROWTH(α, β),

O r exp( 0.83 r 2.39

LOGGROWTH(α, β).

Example 2. Under the assumptions of Thm. 4 with γ-slowgrowing input points (Def. 3), LOGGROWTH k P, T = Θ( n), and r = (log n)2(α+βγ)+ϵ for any ϵ > 0, LD delivers e O(n 1/2) excess MMD with high probability in e O(n1.5) time.

3.2. Fast debiased compression via Low-rank Stein KT

To achieve debiased compression in sub-quadratic time, we next propose Low-rank SKT (Alg. 3). LSKT debiases the input using LD, converts the LD output into an equal-weighted coreset using Resample, and finally combines KT with the divide-and-conquer Compress++ framework (Shetty et al., 2022) to compress n equal-weighted points into n in near-linear time.

Our next result (proved in App. F) shows that LSKT can provide better-than-i.i.d. excess MMD in o(n2) time.

4To unify the presentation of our results, Assum. (α,β)- params constrains all common algorithm input parameters with the understanding that the conditions are enforced only when the input is relevant to a given algorithm.

Debiased Distribution Compression

Algorithm 3 Low-rank Stein Kernel Thinning (LSKT)

Input: mean-zero kernel k P, points Sn = (xi)n i=1, rank r, AGM steps T, adaptive rounds Q, oversampling parameter g, failure prob. δ w Low-rank Debiasing(k P, Sn, r, T, Q) n 4 log4 n , m

n output size n m < 2 n w Resample(w, n ) w LSKT KT-Compress++(k P, Sn, n , w, g, δ

3) Return: w LSKT n 1 ( N0

m )n hence w LSKT 0 m

Theorem 5 (MMD guarantee for LSKT). Under Assum. (α,β)-params, Low-rank SKT (Alg. 3) with g [log2 log(n + 1) + 3.1, log4( n/ log n)] and δ (0, 1) outputs w LSKT in O((dk P +r +T)nr +dk Pn1.5) time satisfying, with probability at least 1 δ,

MMDk P(w LSKT)

k P n max(1/δ, ℓδ(log n)nγβGα n)

for Gm, Hn,r as in Thms. 3 and 5.

Example 3. Under the assumptions of Thm. 5 with γslow-growing input points (Def. 3), LOGGROWTH k P, T = Θ( n), and r = (log n)2(α+βγ)+ϵ for any ϵ > 0, LSKT delivers, with high probability, e O(n 1/2) excess MMD in e O(n1.5) time using a coreset of size m [ n, 2 n).

4. Weighted Debiased Compression

The prior sections developed debiased equal-weighted coresets with better-than-i.i.d. compression guarantees. Equal-weighted coresets are typically chosen for their compatibility with unweighted downstream tasks and easy visualization. For downstream tasks that support weights, we next match the equal-weighted guarantees with significantly smaller simplex-weighted or constant-preserving coresets.

4.1. Simplex-weighted coresets via Stein Recombination

Simplex-weighted coresets automatically preserve the constraints of convex integrands, support straightforward direct sampling, and, when compared with schemes involving negative weights, offer improved numerical stability in the presence of integral evaluations errors (Karvonen et al., 2019). Inspired by the coreset constructions of Hayakawa et al. (2022; 2023), we first introduce a simplexweighted compression algorithm, Recombination Thinning (RT, Alg. 4), suitable for summarizing a debiased input sequence. To produce a coreset given input weights w n 1, RT first prunes small weights using Resample and then uses Weighted RPCholesky to identify m 1 test vectors that capture most of the variability in the weighted

Algorithm 4 Recombination Thinning (RT)

Input: mean-zero kernel k P, points Sn = (xi)n i=1, weights w n 1, output size m w Resample(w, n) I, F Weighted RPCholesky(k P, Sn, w, m 1) w Recombination([F, 1n] , w) [F, 1n] Rm n

F w = F w , w n 1 , and w 0 m w KT-Swap-LS(k P, Sn, w , SPLX); J {i: w i > 0} w [J] argminw |J| 1 w k P(Sn[J], Sn[J])w use

any O(|J|3) quadratic programming solver Return: w RT w n 1 with w RT 0 m

kernel matrix. Next, Recombination (Alg. G.1) (Tchernychova, 2016, Alg. 1) identifies a sparse simplex vector w

with w 0 m that exactly matches the inner product of its input with each of the test vectors. Then, we run KT-Swap-LS (Alg. G.2), a new, line-search version of KTSWAP (Dwivedi and Mackey, 2021, Alg. 1b) that greedily improves MMD to P while maintaining both the sparsity and simplex constraint of its input. Finally, we optimize the weights of the remaining support points using any cubictime quadratic programming solver.

In Prop. G.1 we show that RT runs in time O((dk P + m)nm + m3 log n) and nearly preserves the MMD of its input whenever m grows appropriately with the kernel covering number. Combining RT with Stein Thinning or Lowrank Debiasing in Alg. 5, we obtain Stein Recombination (SR) and Low-rank SR (LSR), our approaches to debiased simplex-weighted compression. Remarkably, SR and LSR can match the MMD error rates established for SKT and LSKT using substantially fewer coreset points, as our next result (proved in App. G.2) shows.

Algorithm 5 (Low-rank) Stein Recombination (SR / LSR)

Input: mean-zero kernel k P, points Sn, output size m, rank r, AGM steps T, adaptive rounds Q

Low-rank Debiasing(k P, Sn, r, T, Q) if low-rank Stein Thinning(k P, Sn) otherwise w SR Recombination Thinning(k P, Sn, w, m) Return: w SR n 1 with w SR 0 m

Theorem 6 (MMD guarantee for SR/LSR). Under Assum. (α,β)-params, Stein Recombination (Alg. 5) takes O(dk Pn2+(dk P+m)nm+m3 log n) to output w SR, and Lowrank SR takes O((dk P+r+T)nr+(dk P+m)nm+m3 log n) time to output w LSR. Moreover, for any δ (0, 1) and Hn,r as in Thm. 4, each of the following bounds holds (sepa-

Debiased Distribution Compression

rately) with probability at least 1 δ:

MMDk P(w SR)=O q

k P n(log n 1

δ ) n + n Hn,m

MMDk P(w LSR)=O q

k P n(log n 1

δ ) n + n(Hn,m+Hn,r)

Example 4. Instantiate the assumptions of Thm. 6 with γslow-growing input points (Def. 3), LOGGROWTH k P, and a heavily compressed coreset size m = (log n)2(α+βγ)+ϵ

for any ϵ > 0. Then SR delivers e O(n 1/2) excess MMD with high probability in O(n2) time, and LSR with r = m and T = Θ( n) achieves the same in e O(n1.5) time.

4.2. Constant-preserving coresets via Stein Cholesky

For applications supporting negative weights, constantpreserving coresets offer the possibility of higher accuracy at the cost of potential numerical instability and poorer robustness to errors in integral evaluation (Karvonen et al., 2019). We introduce a constant-preserving compression algorithm, Cholesky Thinning (CT, Alg. 6), suitable for summarizing a debiased input sequence. CT first applies Weighted RPCholesky to a constant-regularized kernel k P(x, y) + c to select an initial coreset and then uses a combination of KT-Swap-LS and closed-form optimal constant-preserving reweighting to greedily refine the support and weights. The regularized kernel ensures that the coreset output by Weighted RPCholesky is of high quality when paired with the best constant-preserving weights, and our CT standalone analysis (Prop. H.1) improves upon the runtime and error guarantees of RT. In Alg. 7, we combine CT with Stein Thinning or Low-rank Debiasing to obtain Stein Cholesky (SC) and Low-rank SC (LSC), our approaches to debiased constant-preserving compression. Our MMD guarantees for SC and LSC (proved in App. H.2) improve upon the rates of Thm. 6.

Algorithm 6 Cholesky Thinning (CT)

Input: mean-zero kernel k P, points Sn = (xi)n i=1, weights w n 1, output size m c AVERAGE(Largest m entries of (k P(xi, xi))n i=1) I, F Weighted RPCholesky(k P+c, Sn, w, m); w 0n w[I] argminw R|I|:P

i w i=1 w k P(Sn[I], Sn[I])w

w KT-Swap-LS(k P, Sn, w, CP); I {i : wi = 0} w[I] argminw R|I|:P

i w i=1 w k P(Sn[I], Sn[I])w

Return: w CT w Rn with w CT 0 m, 1 n w CT = 1

Theorem 7 (MMD guarantee for SC / LSC). Under Assum. (α,β)-params, Stein Cholesky (Alg. 7) takes O(dk Pn2+(dk P+m)nm+m3) time to output w SC, and Lowrank SC takes O((dk P+r+T)nr+(dk P+m)nm+m3) time to output w LSC. Moreover, for any δ (0, 1), with probability at least 1 δ, each of the following bounds hold:

Algorithm 7 (Low-rank) Stein Cholesky (SC / LSC)

Input: mean-zero kernel k P, points Sn, output size m, rank r, AGM steps T, adaptive rounds Q

Low-rank Debiasing(k P, Sn, r, T, Q) if low-rank Stein Thinning(k P, Sn) otherwise w SC Cholesky Thinning(k P, Sn, w, m) Return: w SC Rn with w SC 0 m and 1 n w SC = 1

MMDk P(w SC) = 2 MMDOPT

k P n log n

MMDk P(w LSC) = 2 MMDOPT

k P n(log n 1/δ)

for Hn,r as in Thm. 4 and m m+log 2 2 m log 2 + 1.

Example 5. Instantiate the assumptions of Thm. 7 with γslow-growing input points (Def. 3), LOGGROWTH k P, and a heavily compressed coreset size m = (log n)2(α+βγ)+ϵ

for any ϵ > 0. Then SC delivers e O(n 1/2) excess MMD with high probability in O(n2) time, and LSC with r = m and T = Θ( n) achieves the same in e O(n1.5) time.

Remark 3. While we present our results for a target precision of 1/ n, a coarser target precision of 1/ n0 for n0 < n can be achieved more quickly by random subsampling/standard thinning the input sequence down to size n0 before running SR, LSR, SC, or LSC.

5. Experiments

We next evaluate the practical utility of our procedures when faced with three common sources of bias: (1) burn-in, (2) approximate MCMC, and (3) tempering. In all experiments, we use a Stein kernel kp with an inverse multiquadric (IMQ) base kernel k(x, y) = (1 + x y 2 M/σ2) 1/2 for σ equal to the median pairwise M distance amongst 1000 points standard thinned from the input. To vary output MMD precision, we first standard thin the input to size n0 {210, 212, 214, 216, 218, 220} before applying any method, as discussed in Rems. 2 and 3. For low-rank or weighted coreset methods, we show results for m = r = nτ. When comparing weighted coresets, we optimally reweight every coreset. We report the median over 5 independent runs for all error metrics. We implement our algorithms in JAX (Bradbury et al., 2018) and refer the reader to App. I for additional experiment details (including runtime comparison in Tab. I.1). Our opensource code is available as part of the Good Points Python library at https://github.com/microsoft/goodpoints.

Correcting for burn-in The initial iterates of a Markov chain are biased by its starting point and need not accu-

Debiased Distribution Compression

Equal-Weighted

102 103 Coreset Size m

Energy Distance

Burn-in Oracle + Standard Stein Thinning Stein Kernel Thinning Low-rank SKT (τ = 0.4)

Low-rank SKT (τ = 0.5) Burn-in Oracle + Compress++

Simplex-Weighted

102 103 Coreset Size m

Energy Distance

Burn-in Oracle + Standard Stein Thinning Stein Recombination (τ = 0.4)

Stein Recombination (τ = 0.5)

Low-rank SR (τ = 0.4)

Low-rank SR (τ = 0.5) Burn-in Oracle + RT

Constant-Preserving

102 103 Coreset Size m

Energy Distance

Burn-in Oracle + Standard Stein Thinning Stein Cholesky (τ = 0.4)

Stein Cholesky (τ = 0.5)

Low-rank SC (τ = 0.4)

Low-rank SC (τ = 0.5) Burn-in Oracle + CT

Figure 1: Correcting for burn-in. Left: Before selecting coresets (orange), the burn-in oracle uses 6 independent Markov chains to discard burn-in (red) while LSKT identifies the same high-density region (blue) with 1 chain. Right: Using only one chain, our methods consistently outperform the Stein and standard thinning baselines and match the 6-chain oracle.

rately reflect the target distribution P. Classical burn-in corrections use convergence diagnostics to detect and discard these iterates but typically require running multiple independent Markov chains (Cowles et al., 1999). Alternatively, our proposed debiased compression methods can be used to correct for burn-in given just a single chain.

We test this claim using an experimental setup from Riabiz et al. (2022, Sec. 4.1) and the 6-chain burn-in oracle diagnostic of Vats and Knudson (2021).5 We aim to compress a posterior P over the parameters in the Goodwin model of oscillatory enzymatic control (d = 4) using n = 2 106 points from a preconditioned Metropolisadjusted Langevin algorithm (P-MALA) chain. We repeat this experiment with three alternative MCMC algorithms in App. I.3. Our primary metric is MMDk P to P with M = I, but, for external validation, we also measure the energy distance (Riabiz et al., 2022, Eq. 11) to an auxiliary MCMC chain of length n. Trajectory plots of the first two coordinates (Fig. 1, left) highlight the substantial burn-in period for the Goodwin chain and the ability of LSKT to mimic the 6-chain burn-in oracle using only a single chain. In Fig. 1 (right), for both the MMD metric and the auxiliary energy distance, our proposed methods consistently outperform Stein thinning and match the quality of 6-chain burnin removal paired with unbiased compression. The spike

5The Vats and Knudson diagnostic is an improvement on the popular Gelman and Rubin (1992) diagnostic (GRD). The GRD was designed to diagnose MCMC convergence rather than burn-in but is commonly used to identify and discard burn-in points automatically through packages like coda (Plummer et al., 2006).

in baseline energy distance for the constant-preserving task can be attributed to the selection of overly large weight values due to poor matrix conditioning; the simplex-weighted task does not suffer from this issue due to its regularizing nonnegativity constraint.

Correcting for approximate MCMC In posterior inference, MCMC algorithms typically require iterating over every datapoint to draw each new sample point. When datasets are large, approximating MCMC using datapoint mini-batches can reduce sampling time at the cost of persistent bias and an unknown stationary distribution that prohibits debiasing via importance sampling. Our proposed methods can correct for these biases during compression by computing full-dataset scores on a small subset of n0 standard thinned points. To evaluate this protocol, we compress a Bayesian logistic regression posterior conditioned on the Forest Covtype dataset (d=54) using n=224 approximate MCMC points from the stochastic gradient Fisher scoring sampler (Ahn et al., 2012) with batch size 32. Following Wang et al. (2024), we set M = 2log p(xmode) at the sample mode xmode and use 220 surrogate ground truth points from the No U-turn Sampler (Hoffman and Gelman, 2014) to evaluate energy distance. We find that our proposals improve upon standard thinning and Stein thinning for each compression task, not just in the optimized MMD metric (Fig. 2, top) but also in the auxiliary energy distance (Fig. 2, middle) and when measuring integration error for the mean (Fig. I.4).

Correcting for tempering Tempering, targeting a lesspeaked and more dispersed distribution Q, is a popular

Debiased Distribution Compression

Equal-Weighted (Approx. MCMC)

0 200 400 600 800 1000 Coreset Size m

Energy Distance

Standard Thinning Stein Thinning Stein Kernel Thinning Low-rank SKT (τ = 0.4)

Low-rank SKT (τ = 0.5)

Simplex-Weighted (Approx. MCMC)

0 200 400 600 800 1000 Coreset Size m

Energy Distance

Standard Thinning Stein Thinning Stein Recombination (τ = 0.4)

Stein Recombination (τ = 0.5)

Low-rank SR (τ = 0.4)

Low-rank SR (τ = 0.5)

Constant-Preserving (Approx. MCMC)

0 200 400 600 800 1000 Coreset Size m

Energy Distance

Standard Thinning Stein Thinning Stein Cholesky (τ = 0.4)

Stein Cholesky (τ = 0.5)

Low-rank SC (τ = 0.4)

Low-rank SC (τ = 0.5)

Coreset Size m

Equal-Weighted (Tempering)

Standard Thinning Stein Thinning Stein Kernel Thinning Low-rank SKT (τ = 0.4)

Low-rank SKT (τ = 0.5)

Coreset Size m

Simplex-Weighted (Tempering)

Standard Thinning Stein Thinning Stein Recombination (τ = 0.4)

Stein Recombination (τ = 0.5)

Low-rank SR (τ = 0.4)

Low-rank SR (τ = 0.5)

Coreset Size m

Constant-Preserving (Tempering)

Standard Thinning Stein Thinning Stein Cholesky (τ = 0.4)

Stein Cholesky (τ = 0.5)

Low-rank SC (τ = 0.4)

Low-rank SC (τ = 0.5)

Figure 2: Correcting for approximate MCMC (top) and tempering (bottom). For posterior inference over the parameters of Bayesian logistic regression (d = 54, top) and a cardiac calcium signaling model (d = 38, bottom), our concise coreset constructions correct for approximate MCMC and tempering biases without need for explicit importance sampling.

technique to improve the speed of MCMC convergence. One can correct for the sample bias using importance sampling, but this requires knowledge of the tempered density and can introduce substantial variance (Gramacy et al., 2010). Alternatively, one can use constructions of this work to correct for tempering during compression; this requires no importance weighting and no knowledge of Q. To test this proposal, we compress the cardiac calcium signaling model posterior (d = 38) of Riabiz et al. (2022, Sec. 4.3) with M = I and n = 3 106 tempered points from a Gaussian random walk Metropolis-Hastings chain. As discussed by Riabiz et al., compression is essential in this setting as the ultimate aim is to propagate posterior uncertainty through a human heart simulator, a feat which requires over 1000 CPU hours for each summary point retained. Our methods perform on par with Stein thinning for equal-weighted compression and yield substantial gains over Stein (and standard) thinning for the two weighted compression tasks.

6. Conclusions and Future Work

We have introduced and analyzed a suite of new procedures for compressing a biased input sequence into an accurate summary of a target distribution. For equal-weighted compression, Stein kernel thinning delivers n points with e O(n 1/2) MMD in O(n2) time, and low-rank SKT can improve this running time to e O(n3/2). For simplex-weighted and constant-preserving compression, Stein recombination and Stein Cholesky provide enhanced parsimony, matching these guarantees with as few as poly-log(n) points. Recent work has identified some limitations of score-based discrepancies, like Stein kernel MMDs, and developed modified objectives that are more sensitive to the relative density of isolated modes (Liu et al., 2023; B enard et al., 2024). A valuable next step would be to extend our constructions to provide compression guarantees for these modified discrepancy measures. Other opportunities for future work include marrying the better-than-i.i.d. guarantees of this work with the non-myopic compression of Teymur et al. (2021), the control-variate compression of Chopin and Ducrocq (2021), and the online compression of Koppel et al. (2024).

Debiased Distribution Compression

Acknowledgments

We thank Marina Riabiz for making the Markov chain data used in Sec. 5 available and Jeffrey Rosenthal for helpful discussions concerning the geometric ergodicity of Markov chains. Lingxiao Li acknowledges the generous support of Army Research Office grants W911NF2010168 and W911NF2110293, of Air Force Office of Scientific Research award FA9550-19-1-031, of National Science Foundation grant CHS-1955697, from the CSAIL Systems that Learn program, from the MIT IBM Watson AI Laboratory, from the Toyota CSAIL Joint Research Center, from a gift from Adobe Systems, and from a Google Research Scholar award.

Impact Statement

This paper presents work with the aim of advancing the field of Machine Learning. There are many potential societal consequences of our work, none which we feel must be specifically highlighted here.

E. Afzali and S. Muthukumarana. Gradient-free kernel conditional Stein discrepancy goodness of fit testing. Machine Learning with Applications, 12:100463, 2023. ISSN 2666-8270. doi: https://doi.org/10.1016/ j.mlwa.2023.100463. URL https://www.sciencedirect. com/science/article/pii/S2666827023000166. (Cited on page 3.)

S. Ahn, A. Korattikara, and M. Welling. Bayesian posterior sampling via stochastic gradient Fisher scoring. In J. Langford and J. Pineau, editors, Proceedings of the 29th International Conference on Machine Learning (ICML-12), ICML 12, pages 1591 1598, New York, NY, USA, July 2012. Omnipress. ISBN 978-1-45031285-1. (Cited on pages 1, 8, and 55.)

N. Aronszajn. Theory of reproducing kernels. Transactions of the American mathematical society, 68(3):337 404, 1950. (Cited on page 2.)

Y. F. Atchad e and F. Perron. On the geometric ergodicity of metropolis-hastings algorithms. Statistics, 41(1):77 84, 2007. (Cited on page 4.)

A. Barp, F.-X. Briol, A. Duncan, M. Girolami, and L. Mackey. Minimum stein discrepancy estimators. Advances in Neural Information Processing Systems, 32, 2019. (Cited on page 3.)

A. Barp, C.-J. Simon-Gabriel, M. Girolami, and L. Mackey. Targeted separation and convergence with kernel discrepancies. ar Xiv preprint ar Xiv:2209.12835, 2022. (Cited on page 19.)

C. B enard, B. Staber, and S. Da Veiga. Kernel Stein discrepancy thinning: A theoretical perspective of pathologies and a practical fix with regularization. Advances in Neural Information Processing Systems, 36, 2024. (Cited on page 9.)

P. Billingsley. Convergence of probability measures. John Wiley & Sons, 2013. (Cited on page 26.)

J. Bradbury, R. Frostig, P. Hawkins, M. J. Johnson, C. Leary, D. Maclaurin, G. Necula, A. Paszke, J. Vander Plas, S. Wanderman-Milne, and Q. Zhang. JAX: composable transformations of Python+Num Py programs, 2018. URL http://github.com/google/jax. (Cited on pages 7 and 54.)

R. C. Bradley. Basic Properties of Strong Mixing Conditions. A Survey and Some Open Questions. Probability Surveys, 2(none):107 144, 2005. doi: 10.1214/ 154957805100000104. URL https://doi.org/10.1214/ 154957805100000104. (Cited on page 25.)

C. Carmeli, E. De Vito, and A. Toigo. Vector valued reproducing kernel Hilbert spaces of integrable functions and Mercer theorem. Analysis and Applications, 4(04): 377 408, 2006. (Cited on pages 18 and 19.)

T. K. Chandra. De la vall ee poussin s theorem, uniform integrability, tightness and moments. Statistics & Probability Letters, 107:136 141, 2015. (Cited on page 26.)

Y. Chen, M. Welling, and A. Smola. Super-samples from kernel herding. In Proceedings of the Twenty-Sixth Conference on Uncertainty in Artificial Intelligence, pages 109 116, 2010. (Cited on page 3.)

Y. Chen, E. N. Epperly, J. A. Tropp, and R. J. Webber. Randomly pivoted Cholesky: Practical approximation of a kernel matrix with few entry evaluations. ar Xiv preprint ar Xiv:2207.06503, 2022. (Cited on pages 5, 40, 42, and 51.)

N. Chopin and G. Ducrocq. Fast compression of MCMC output. Entropy, 23(8):1017, 2021. (Cited on page 9.)

K. Chwialkowski, H. Strathmann, and A. Gretton. A kernel test of goodness of fit. In M. F. Balcan and K. Q. Weinberger, editors, Proceedings of The 33rd International Conference on Machine Learning, volume 48 of Proceedings of Machine Learning Research, pages 2606 2615, New York, New York, USA, 20 22 Jun 2016. PMLR. URL https://proceedings.mlr.press/v48/ chwialkowski16.html. (Cited on page 3.)

M. K. Cowles and B. P. Carlin. Markov chain Monte Carlo convergence diagnostics: A comparative review. Journal of the American Statistical Association, 91(434):883 904, 1996. (Cited on page 1.)

Debiased Distribution Compression

M. K. Cowles, G. O. Roberts, and J. S. Rosenthal. Possible biases induced by mcmc convergence diagnostics. Journal of Statistical Computation and Simulation, 64 (1):87 104, 1999. (Cited on page 8.)

A. Dax. Low-rank positive approximants of symmetric matrices. Advances in Linear Algebra & Matrix Theory, 4 (03):172, 2014. (Cited on page 51.)

R. Douc and O. Capp e. Comparison of resampling schemes for particle filtering. In ISPA 2005. Proceedings of the 4th International Symposium on Image and Signal Processing and Analysis, 2005., pages 64 69. IEEE, 2005. (Cited on pages 5 and 37.)

R. Douc, E. Moulines, P. Priouret, and P. Soulier. Markov chains, volume 1. Springer, 2018. (Cited on page 25.)

A. Durmus and E. Moulines. On the geometric convergence for MALA under verifiable conditions. ar Xiv preprint ar Xiv:2201.01951, 2022. (Cited on page 4.)

A. Durmus, E. Moulines, and E. Saksman. Irreducibility and geometric ergodicity of Hamiltonian Monte Carlo. The Annals of Statistics, 48(6):3545 3564, 2020. (Cited on page 4.)

A. Durmus and Eric Moulines. Nonasymptotic convergence analysis for the unadjusted Langevin algorithm. The Annals of Applied Probability, 27(3):1551 1587, 2017. ISSN 10505164. URL http://www.jstor.org/stable/ 26361410. (Cited on page 4.)

R. Dwivedi and L. Mackey. Kernel thinning. ar Xiv preprint ar Xiv:2105.05842, 2021. (Cited on pages 1, 4, 6, and 34.)

R. Dwivedi and L. Mackey. Generalized kernel thinning. In International Conference on Learning Representations, 2022. (Cited on pages 1, 3, 21, 34, and 35.)

E. Epperly and E. Moreno. Kernel quadrature with randomly pivoted Cholesky. Advances in Neural Information Processing Systems, 36, 2024. (Cited on pages 1 and 51.)

M. A. Erdogdu, L. Mackey, and O. Shamir. Global nonconvex optimization with discretized diffusions. Advances in Neural Information Processing Systems, 31, 2018. (Cited on page 4.)

M. A. Gallegos-Herrada, D. Ledvinka, and J. S. Rosenthal. Equivalences of geometric ergodicity of Markov chains. Journal of Theoretical Probability, pages 1 27, 2023. (Cited on pages 3 and 25.)

A. Gelman and D. B. Rubin. Inference from iterative simulation using multiple sequences. Statistical science, 7 (4):457 472, 1992. (Cited on page 8.)

B. Ghojogh, A. Ghodsi, F. Karray, and M. Crowley. KKT conditions, first-order and second-order optimization, and distributed optimization: tutorial and survey. ar Xiv preprint ar Xiv:2110.01858, 2021. (Cited on page 52.)

J. Gorham and L. Mackey. Measuring sample quality with kernels. In International Conference on Machine Learning, pages 1292 1301. PMLR, 2017. (Cited on page 3.)

J. Gorham, A. B. Duncan, S. J. Vollmer, and L. Mackey. Measuring sample quality with diffusions. The Annals of Applied Probability, 29(5):2884 2928, 2019. (Cited on page 3.)

R. Gramacy, R. Samworth, and R. King. Importance tempering. Statistics and Computing, 20:1 7, 2010. (Cited on pages 1 and 9.)

A. Gretton, K. M. Borgwardt, M. J. Rasch, B. Sch olkopf, and A. Smola. A kernel two-sample test. The Journal of Machine Learning Research, 13(1):723 773, 2012. (Cited on page 1.)

S. Hayakawa, H. Oberhauser, and T. Lyons. Positively weighted kernel quadrature via subsampling. Advances in Neural Information Processing Systems, 35:6886 6900, 2022. (Cited on page 6.)

S. Hayakawa, H. Oberhauser, and T. Lyons. Samplingbased Nystr om approximation and kernel quadrature. In Proceedings of the 40th International Conference on Machine Learning, ICML 23. JMLR.org, 2023. (Cited on pages 1 and 6.)

L. Hodgkinson, R. Salomone, and F. Roosta. The reproducing Stein kernel approach for post-hoc corrected sampling. ar Xiv preprint ar Xiv:2001.09266, 2020. (Cited on page 3.)

M. D. Hoffman and A. Gelman. The No-U-Turn sampler: Adaptively setting path lengths in Hamiltonian Monte Carlo. J. Mach. Learn. Res., 15(1):1593 1623, 2014. (Cited on pages 8 and 55.)

A. A. Johnson. Geometric ergodicity of Gibbs samplers. university of minnesota, 2009. (Cited on page 4.)

T. Karvonen, M. Kanagawa, and S. S arkk a. On the positivity and magnitudes of Bayesian quadrature weights. Statistics and Computing, 29:1317 1333, 2019. (Cited on pages 6 and 7.)

A. Koppel, J. Eappen, S. Bhatt, C. Hawkins, and S. Ganesh. Online MCMC thinning with kernelized Stein discrepancy. SIAM J. Math. Data Sci., 6(1):51 75, 2024. doi: 10.1137/22M1510108. URL https://doi.org/10. 1137/22m1510108. (Cited on page 9.)

Debiased Distribution Compression

F. Kuo. Component-by-component constructions achieve the optimal rate of convergence for multivariate integration in weighted Korobov and Sobolev spaces. Journal of Complexity, 19(3):301 320, 2003. ISSN 0885064X. doi: https://doi.org/10.1016/S0885-064X(03) 00006-2. URL https://www.sciencedirect.com/science/ article/pii/S0885064X03000062. Oberwolfach Special Issue. (Cited on page 3.)

S. Lacoste-Julien, F. Lindsten, and F. Bach. Sequential kernel herding: Frank-Wolfe optimization for particle filtering. In Artificial Intelligence and Statistics, pages 544 552. PMLR, 2015. (Cited on page 3.)

L. Li, J.-G. Liu, and Y. Wang. Geometric ergodicity of SGLD via reflection coupling. ar Xiv preprint ar Xiv:2301.06769, 2023. (Cited on page 4.)

Q. Liu and J. Lee. Black-box importance sampling. In Artificial Intelligence and Statistics, pages 952 961. PMLR, 2017. (Cited on pages 4, 26, 28, and 55.)

Q. Liu, J. Lee, and M. Jordan. A kernelized Stein discrepancy for goodness-of-fit tests. In International conference on machine learning, pages 276 284. PMLR, 2016. (Cited on page 3.)

X. Liu, A. B. Duncan, and A. Gandy. Using perturbation to improve goodness-of-fit tests based on kernelized Stein discrepancy. In A. Krause, E. Brunskill, K. Cho, B. Engelhardt, S. Sabato, and J. Scarlett, editors, Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pages 21527 21547. PMLR, 23 29 Jul 2023. (Cited on page 9.)

F. Merlev ede, M. Peligrad, and S. Utev. Sharp conditions for the CLT of linear processes in a Hilbert space. Journal of Theoretical Probability, 10(3):681 693, 1997. (Cited on page 25.)

S. P. Meyn and R. L. Tweedie. Markov chains and stochastic stability. Springer Science & Business Media, 2012. (Cited on pages 25 and 26.)

H. Q. Minh. Some properties of Gaussian reproducing kernel Hilbert spaces and their implications for function approximation and learning theory. Constructive Approximation, 32:307 338, 2010. (Cited on page 4.)

S. Niederer, L. Mitchell, N. Smith, and G. Plank. Simulating human cardiac electrophysiology on clinical timescales. Frontiers in physiology, 2:14, 2011. (Cited on page 1.)

V. I. Paulsen and M. Raghupathi. An introduction to the theory of reproducing kernel Hilbert spaces, volume 152. Cambridge university press, 2016. (Cited on pages 15, 23, and 25.)

D. Phan, N. Pradhan, and M. Jankowiak. Composable effects for flexible and accelerated probabilistic programming in numpyro. ar Xiv preprint ar Xiv:1912.11554, 2019. (Cited on page 55.)

J. M. Phillips and W. M. Tai. Near-optimal coresets of kernel density estimates. Discrete & Computational Geometry, 63:867 887, 2020. (Cited on page 5.)

I. Pinelis. Exact lower and upper bounds on the incomplete gamma function. ar Xiv preprint ar Xiv:2005.06384, 2020. (Cited on page 43.)

Y. Pitcan. A note on concentration inequalities for Ustatistics. ar Xiv preprint ar Xiv:1712.06160, 2017. (Cited on page 31.)

M. Plummer, N. Best, K. Cowles, and K. Vines. Coda: Convergence diagnosis and output analysis for mcmc. R News, 6(1):7 11, 2006. URL https://journal.r-project. org/archive/. (Cited on page 8.)

Q. Qin. Geometric ergodicity of trans-dimensional Markov chain Monte Carlo algorithms. ar Xiv preprint ar Xiv:2308.00139, 2023. (Cited on page 25.)

M. Riabiz, W. Y. Chen, J. Cockayne, P. Swietach, S. A. Niederer, L. Mackey, and C. J. Oates. Replication Data for: Optimal Thinning of MCMC Output, 2020. URL https://doi.org/10.7910/DVN/MDKNWM. Accessed on Mar 23, 2021. (Cited on page 55.)

M. Riabiz, W. Y. Chen, J. Cockayne, P. Swietach, S. A. Niederer, L. Mackey, and C. J. Oates. Optimal thinning of MCMC output. Journal of the Royal Statistical Society Series B: Statistical Methodology, 84(4):1059 1081, 2022. (Cited on pages 1, 2, 4, 8, 9, 33, 36, and 54.)

G. O. Roberts and R. L. Tweedie. Geometric convergence and central limit theorems for multidimensional Hastings and Metropolis algorithms. Biometrika, 83(1):95 110, 1996. (Cited on page 4.)

A. Shetty, R. Dwivedi, and L. Mackey. Distribution compression in near-linear time. In International Conference on Learning Representations, 2022. (Cited on pages 1, 5, 34, 40, 45, 46, 47, and 54.)

I. Steinwart and A. Christmann. Support vector machines. Springer Science & Business Media, 2008. (Cited on pages 3, 15, 16, 17, 19, 27, and 29.)

I. Steinwart and C. Scovel. Mercer s theorem on general domains: On the interaction between measures, kernels, and RKHSs. Constructive Approximation, 35:363 417, 2012. (Cited on pages 4, 14, and 15.)

Debiased Distribution Compression

H.-W. Sun and D.-X. Zhou. Reproducing kernel Hilbert spaces associated with analytic translation-invariant Mercer kernels. Journal of Fourier Analysis and Applications, 14(1):89 101, 2008. (Cited on pages 21 and 23.)

M. Tchernychova. Caratheodory cubature measures. Ph D thesis, University of Oxford, 2016. (Cited on pages 6, 48, and 49.)

O. Teymur, J. Gorham, M. Riabiz, and C. Oates. Optimal quantisation of probability measures using maximum mean discrepancy. In International Conference on Artificial Intelligence and Statistics, pages 1027 1035. PMLR, 2021. (Cited on page 9.)

D. Vats and C. Knudson. Revisiting the Gelman Rubin diagnostic. Statistical Science, 36(4):518 529, 2021. (Cited on page 8.)

M. J. Wainwright. High-dimensional statistics: A nonasymptotic viewpoint, volume 48. Cambridge university press, 2019. (Cited on pages 23, 29, and 31.)

C. Wang, Y. Chen, H. Kanagawa, and C. J. Oates. Stein Πimportance sampling. Advances in Neural Information Processing Systems, 36, 2024. (Cited on page 8.)

J.-K. Wang, J. Abernethy, and K. Y. Levy. No-regret dynamics in the Fenchel game: A unified framework for algorithmic convex optimization. Mathematical Programming, pages 1 66, 2023. (Cited on pages 5, 40, 41, and 43.)

J. Wellner. Weak convergence and empirical processes: With applications to statistics. Springer Science & Business Media, 2013. (Cited on page 25.)

J. Yang, Q. Liu, V. Rao, and J. Neville. Goodness-of-fit testing for discrete distributions via Stein discrepancy. In International Conference on Machine Learning, pages 5561 5570. PMLR, 2018. (Cited on page 3.)

F. Zhang. The Schur complement and its applications, volume 4. Springer Science & Business Media, 2006. (Cited on page 51.)

D.-X. Zhou. The covering number in learning theory. Journal of Complexity, 18(3):739 767, 2002. (Cited on page 4.)

Debiased Distribution Compression

Appendix Contents

A. Appendix Notation

For the point sequence Sn = (xi)i [n], we use Sn 1

n P i [n] δxi to denote the empirical distribution. For a weight vector w Rn, we define the support supp(w) {i [n] : wi = 0} and the signed measure Sw n P i [n] wiδxi. For a matrix K Rn n and w n 1, we define the weighted matrix Kw diag( w)K diag( w). For positive semidefinite (PSD) matrices (A, B), we use A B (resp. A B) to mean A B (resp. B A) is PSD. For a symmetric PSD (SPSD) matrix M, we let M 1/2 denote a symmetric matrix square root satisfying M = M 1/2M 1/2. For A Rn m, we denote A p supx =0 Ax p

x p . We will use 1E to denote the indicator function for an event E.

Notation used only in a specific section will be introduced therein.

B. Spectral Analysis of Kernel Matrices

The goal of this section is to develop spectral bounds for kernel matrices.

From covering numbers of kernels to eigenvalues of kernel matrices In App. B.1, we transfer the bounds on covering numbers from the definition of POLYGROWTH or LOGGROWTH kernels to bounds on the eigenvalues of the kernel matrices. This sets the theoretical foundation for the algorithms in later sections as their error guarantees rely on the fast decay of eigenvalues of kernel matrices.

Covering numbers and eigenvalues of Stein kernels In App. B.2, we show that Stein kernels are POLYGROWTH (resp. LOGGROWTH) provided that their base kernels are differentiable (resp. radially analytic). Putting the results together from App. B.1, we obtain spectral bounds for a wide range of Stein kernels.

Notation For a normed space E, we use E to denote its norm, BE(p, r) {x E : x p E r} to denote the closed ball of radius r centered at p in E with the shorthand BE(r) BE(0, r) and BE BE(1). When E is an RKHS with kernel k, for brevity we use k in place of E in the subscript. Let F(X, Y) denote the space of functions from X to Y, and B(E, F) denote the space of bounded linear functions between normed spaces E, F. For a set A, we use ℓ (A) to denote the space of bounded R-valued functions on A equipped with the sup-norm f ,A supx A |f(x)|. We use E , F to denote the inclusion map. e use λℓ(T) to denote the ℓ-th largest eigenvalue of an operator T.

B.1. From covering numbers of kernels to eigenvalues of kernel matrices

We first introduce the general Mercer representation theorem from Steinwart and Scovel (2012), which shows the existence of a discrete spectrum of the integral operator associated with a continuous square-integrable kernel. The theorem also provides a series expansion of the kernel, i.e., the Mercer representation, in terms of the eigenvalues and eigenfunctions.

Lemma B.1 (General Mercer representation (Steinwart and Scovel, 2012)). Consider a kernel k : Rd Rd R that is jointly continuous in both inputs and a probability measure µ such that R k(x, x)dµ(x) < . Then the following holds.

(a) The inclusion Hk , L2(µ) is a compact operator, i.e., Bk is a compact subset of L2(µ). In particular, this inclusion is continuous.

(b) The Hilbert-space adjoint of the inclusion Hk , L2(µ) is the compact operator Sk,µ : L2(µ) Hk defined as

Sk,µf R k( , x)f(x)dµ(x). (3)

We also have S k,µ Hk , L2(µ). Hence the operator

Tk,µ S k,µSk,µ : L2(µ) L2(µ) (4)

is also compact.

Debiased Distribution Compression

(c) There exist {λℓ} ℓ=1 with λ1 λ2 0 and {ϕℓ} ℓ=1 Hk such that {ϕℓ} ℓ=1 is an orthonormal system in L2(µ) and {λℓ} ℓ=1 (resp. {ϕℓ} ℓ=1) consists of the eigenvalues (resp. eigenfunctions) of Tk,µ with eigendecomposition, for f L2(µ),

Tk,µf = P ℓ=1 λℓ f, ϕℓ L2(µ)ϕℓ

with convergence in L2(µ).

(d) We have the following series expansion

k(x, x ) = P ℓ=1 λiϕℓ(x)ϕℓ(x ), (5)

where the series convergence is absolute and uniform in x, x on all A A supp µ supp µ.

Proof of Lem. B.1. Part (a) and (b) follow respectively from Steinwart and Scovel (2012, Lem. 2.3 and 2.2). Part (c) follows from part (a) and Steinwart and Scovel (2012, Lem. 2.12). Finally, part (d) follows from Steinwart and Scovel (2012, Cor. 3.5).

We will use the following lemma regarding the restriction of covering numbers.

Lemma B.2 (Covering number is preserved in restriction). For a kernel k : Rd Rd R and a set A Rd, we have Nk(A, ϵ) = Nk|A(A, ϵ), for k|A, the restricted kernel of k to A (Paulsen and Raghupathi, 2016, Sec. 5.4).

Proof of Lem. B.2. It suffices to show that a (k, A, ϵ) cover can be converted to a cover of (k|A, A, ϵ) of the same cardinality and vice versa.

Let C Bk|A be a (k|A, A, ϵ) cover. For any f C, we have f k|A = inf n f k : f Hk, f|A = f o 1 (Paulsen and

Raghupathi, 2016, Corollary 5.8). Moreover, the infimum is attained by some f Hk such that f k = f k|A 1 and f|A = f. Now form C = { f : f C}. For any h Bk, there exists f C such that h|A f ,A ϵ = = h f ,A ϵ,

so C is a (k|A, A, ϵ) cover.

For the other direction, let C Bk be a (k, A, ϵ) cover. Define C = { f|A : f C} Hk|A. Since f|A k|A f k, we have C Bk|A. For any h Bk A], again by Paulsen and Raghupathi (2016, Corollary 5.8), there exists h Hk such that h k = h k|A 1, so there exists f C such that h f ,A ϵ = h f|A ,A ϵ,

Hence C is a (k, A, ϵ) cover.

The goal for the rest of this section is to transfer the bounds of the covering number in the definition of a POLYGROWTH or LOGGROWTH kernel from Assum. (α, β)-kernel to bounds on entropy numbers (Steinwart and Christmann, 2008, Def. 6.20) that are closely related to eigenvalues of the integral operator (4).

Definition B.1 (Entropy number of a bounded linear map). For a bounded linear operator S : E F between normed spaces E, F, for ℓ N, the ℓ-th entropy number of S is defined as

eℓ(S) inf n ϵ > 0 : s1, . . . , s2ℓ 1 S(BE) such that S(BE) S2ℓ 1

i=1 BF (si, ϵ) o .

The following lemma shows the relation between covering numbers and entropy numbers.

Lemma B.3 (Relation between covering number and entropy number). Suppose a kernel k is jointly continuous and A Rd is bounded. Then for any ϵ > 0,

e log2 Nk(A,ϵ) +1(Hk|A , ℓ (A)) ϵ.

Debiased Distribution Compression

Proof of Lem. B.3. First, the assumption implies k|A is a bounded kernel, so by Steinwart and Christmann (2008, Lemma 4.23), the inclusion Hk|A , ℓ (A) is continuous. By the definition of Nk|A(A, ϵ), by adding arbitrary elements into the cover if necessary, there exists a (k|A, A, ϵ) cover of Bk|A of cardinality 2 log2(Nk|A(A,ϵ)) Nk|A(A, ϵ). Hence

e log2 Nk|A(A,ϵ) +1(Hk|A , ℓ (A)) ϵ.

The claim follows since Nk|A(A, ϵ) = Nk(A, ϵ) by Lem. B.2.

Proposition B.1 (ℓ -entropy number bound for POLYGROWTH or LOGGROWTH k). Suppose a kernel k satisfies Assum. (α, β)-kernel. Let Cd > 0 denote the constant that appears in the Assum. (α, β)-kernel. Define

Lk(r) Cd log 2rβ. (6)

Then for any r > 0 and ℓ N that satisfies ℓ> Lk(r + 1) + 1, we have

eℓ(Hk|B2(r) , ℓ (B2(r)))

α if k is POLYGROWTH(α, β), and

exp 1 ℓ 1 Lk(r+1) 1

α if k is LOGGROWTH(α, β).

Proof of Prop. B.1. By Lem. B.3 and the fact that eℓis monotonically decreasing in ℓby definition, if ℓ log2 Nk(B2(r), ϵ) + 1 for some ϵ > 0, then

eℓ(Hk|B2(r) , ℓ (B2(r))) e log2 Nk(B2(r),ϵ) +1(Hk|B2(r) , ℓ (B2(r))) ϵ. (7)

For the POLYGROWTH case, by its definition, the condition ℓ log2 Nk(B2(r), ϵ) + 1 is met if ϵ (0, 1) and

ℓ Cd log 2(1/ϵ)α(r + 1)β + 1 ϵ Lk(r+1)

Hence (7) holds with ϵ = Lk(r+1)

α , as long as ϵ (0, 1), so ℓneeds to satisfy

1 > Lk(r+1)

α ℓ> Lk(r + 1) + 1.

Similarly, for the LOGGROWTH case, the condition ℓ log2 Nk(B2(r), ϵ) + 1 is met if ϵ (0, 1) and

ℓ Cd log 2(log(1/ϵ) + 1)α(r + 1)β + 1 ϵ exp 1 ℓ 1 Lk(r+1) 1

Hence (7) holds with ϵ = exp 1 ℓ 1 Lk(r+1) 1

α , as long as ϵ (0, 1), so ℓneeds to satisfy

1 > exp 1 ℓ 1 Lk(r+1) 1

α ℓ> Lk(r + 1) + 1.

Next, we show that we can transfer bounds on entropy numbers to obtain bounds for the eigenvalues of kernel matrices, which will become handy when we develop sub-quadratic-time algorithms in Sec. 3. We rely on the following lemma, which summarizes the relevant facts from Steinwart and Christmann (2008, Appendix A).

Lemma B.4 (Eigenvalue is bounded by entropy number). Let k be a jointly continuous kernel and P be a distribution such that Ex P[k(x, x)] < , and recall that λℓ( ) denotes the ℓ-th largest eigenvalue of a linear operator. Then, for all ℓ N,

λℓ(Tk,P) 4e2 ℓ(Hk , L2(P)).

Debiased Distribution Compression

Proof of Lem. B.4. For any bounded linear operator S : H1 H2 between Hilbert spaces H1 and H2, we have aℓ(S) 2eℓ(S), where aℓis the ℓ-th approximation number defined in Steinwart and Christmann (2008, (A.29)). Recall the operator S k,P = Hk , L2(P) from (3), which is compact (in particular bounded) by Lem. B.1(a). Thus

sℓ(S k,P) = aℓ(S k,P) 2eℓ(S k,P),

where the first equality follows from the paragraph below Steinwart and Christmann (2008, (A.29))) and sℓis ℓ-th singular number of an operator (Steinwart and Christmann, 2008, (A.25)). Then using the identities mentioned under Steinwart and Christmann (2008, (A.25)) and Steinwart and Christmann (2008, (A.27)) and that all operators involved are compact by Lem. B.1(b), we have

λℓ(Tk,P) = λℓ(S k,PSk,P) = sℓ(S k,PSk,P) = s2 ℓ(S k,P) 4e2 ℓ(S k,P).

The previous lemma allows us to bound eigenvalues of kernel matrices by ℓ -entropy numbers.

Proposition B.2 (Eigenvalue of kernel matrix is bounded by ℓ -entropy number). Let k be a jointly continuous kernel. Define K k(Sn, Sn) for the sequence of points Sn = (x1, . . . , xn) Rd. For any w n 1, recall the notation Sw n = P

i [n] wiδxi, Kw = diag( w)K diag( w), and Rn = 1 + supi [n] xi 2. Then for all ℓ N,

λℓ(Kw) (i) = λℓ(Tk,Sw n ) (ii) 4e2 ℓ(Hk|B2(Rn 1) , ℓ (B2(Rn 1))). (8)

Proof of Prop. B.2. Without loss of generality, we assume wi > 0 for all i [n], since otherwise, we can consider a smaller set of points by removing the ones with zero weights.

Proof of equality (i) from display (8) Note that L2(Sw n ) is isometric to Rn. Let K k(Sn, Sn) denote the kernel matrix. The action of Tk,Sw n is given by, for i [n],

Tk,Sw n f(xi) = P

j [n] wjk(xi, xj)f(xj),

so in matrix form, Tk,Sw n f = K diag(w)f, and hence Tk,Sw n = K diag(w). If λℓis an eigenvalue of K diag(w) with eigenvector vℓ, then

K diag(w)vℓ= λℓvℓ diag( w)K diag(w)vℓ= λℓdiag( w)vℓ diag( w)K diag( w)(diag( w)vℓ) = λℓdiag( w)vℓ,

where we used wi > 0 for all i [n]. Hence the eigenspectrum of Tk,Sw n agrees with that of diag( w)K diag( w).

Proof of bound (ii) from display (8) By Lem. B.4, we have λℓ(Tk,Sw n ) 4e2 ℓ(Hk|B2(Rn 1) , L2(Sw n )). Finally, using Def. B.1, we have eℓ(Hk|B2(Rn 1) , L2(Sw n )) eℓ(Hk|B2(Rn 1) , ℓ (B2(Rn 1))) because Sw n is supported in B2(Rn 1) and the fact that L2(P) for any P.

Combining the tools developed so far, we have the following corollary for bounding the eigenvalues of POLYGROWTH and LOGGROWTH kernel matrices.

Corollary B.1 (Eigenvalue bound for POLYGROWTH or LOGGROWTH kernel matrix). Suppose a kernel k satisfies Assum. (α, β)-kernel. Let Sn = (x1, . . . , xn) Rd be a sequence of points. For any w n 1, using the notation Lk from (6), for any ℓ> Lk(Rn) + 1, we have

α POLYGROWTH(α, β) and

4 exp 2 2 ℓ 1 Lk(Rn) 1

LOGGROWTH(α, β).

Proof of Cor. B.1. The claim follows by applying Prop. B.2 and Prop. B.1.

Debiased Distribution Compression

B.2. Covering numbers and eigenvalues of Stein kernels

The goal of this section is to show that a Stein kernel kp satisfies Assum. (α, β)-kernel provided that the base kernel is sufficiently smooth and to derive the parameters α, β for POLYGROWTH and LOGGROWTH cases. Such results when put together with results from App. B.1 imply spectral decay rates for Stein kernel matrices.

For a Stein kernel kp with preconditioning matrix M, we define

Sp(r) max 1, sup x 2 r M 1/2 log p(x) 2

We start by noting a useful alternative expression for a Stein kernel where we only need access to the density via the score log p.

Proposition B.3 (Alternative expression for Stein kernel). The Stein kernel kp has the following alternative form:

kp(x, y) = log p(x), M log p(y) k(x, y) + log p(x), M yk(x, y) +

log p(y), M xk(x, y) + tr(M x yk(x, y)), (10)

where x yk(x, y) denotes the d d matrix ( xi yjk(x, y))i,j [d].

Proof of Prop. B.3. We compute

( x (p(x)Mk(x, y)p(y)))j = P

i [d] Mij ( xip(x)k(x, y)p(y) + p(x) xik(x, y)p(y)) .

y x (p(x)Mk(x, y)p(y)) = P

i,j [d] Mij xip(x) yjp(y)k(x, y) + xip(x) yjk(x, y)p(y)

i,j [d] Mij p(x) yjp(y) xik(x, y) + p(x) xi yjk(x, y)p(y) .

The four terms in the last equation correspond to the four terms in (10).

In what follows, we will make use of a matrix-valued kernel K : Rd Rd Rd d which generates an RKHS HK of vector-valued functions. See Carmeli et al. (2006) for an introduction to vector-valued RKHS theory.

Our next goal is to build a Hilbert-space isometry between the direct sum Hilbert space Hk d and Hkp to represent functions in Hkp using functions from Hk.

Lemma B.5 (Preconditioned matrix-valued RKHS from a scalar kernel). Let k : Rd Rd R be kernel and Hk be the corresponding RKHS. Let M Rd d be an SPSD matrix. Consider the map ι : Hk d F(Rd, Rd) defined by (f1, . . . , fd) 7 [x 7 M 1/2(f1(x), . . . , fd(x))], where Hk d is the direct-sum Hilbert space of d copies of Hk Then ι is a Hilbert-space isometry onto a vector-valued RKHS HK with matrix-valued reproducing kernel given by K(x, y) = k(x, y)M.

Proof of Lem. B.5. Define γ : Rd F(Rd, Hk d) via

γ(x)(α) k(x, )M 1/2α.

γ(x)(α) Hk d k(x, ) k M 1/2 2 α 2 ,

so γ(x) is bounded. Since γ(x) is also linear, we have γ(x) B(Rd, Hk d). Let γ(x) : Hk d Rd denote the Hilbert-space adjoint of γ(x). Then for any (f1, . . . , fd) Hk d, α Rd, we have

γ(x) (f1, . . . , fd), α = (f1, . . . , fd), γ(x)(α) Hk d

= (f1, . . . , fd), k(x, )M 1/2α Hk d

= (f1(x), . . . , fd(x)), M 1/2α

= M 1/2(f1(x), . . . , fd(x)), α .

Debiased Distribution Compression

Hence γ(x) (f1, . . . , fd) = M 1/2(f1(x), . . . , fd(x)), so ι(f1, . . . , fd)(x) = γ(x) (f1, . . . , fd). By Carmeli et al. (2006, Proposition 2.4), we see that ι is a partial isometry from Hk d onto a vector-valued RKHS space withv reproducing kernel K(x, y) = γ(x) γ(y) : Rd Rd. For α Rd, previous calculation implies

γ(x) γ(y)(α) = γ(x) (k(y, )M 1/2α) = M 1/2k(y, x)M 1/2α = k(x, y)M.

Lemma B.6 (Stein operator is an isometry). Consider a Stein kernel kp with base kernel k and preconditioning matrix M. Then, the Stein operator Sp defined by Sp(v) 1

p (pv) is an isometry from HK with K k M to Hkp.

Proof. This follows from Barp et al. (2022, Theorem 2.6) applied to K.

The previous two lemmas show that Sp ι is a Hilbert space isometry from Hk d to Hkp. Note that Sp(v) = log p, h + h. Hence, we immediately have

Hkp = n log p, M 1/2f + (M 1/2f) : f = (f1, . . . , fd) Hk do . (11)

We next build a divergence RKHS which is one of the summands used to form Hkp.

Lemma B.7 (Divergence RKHS). Let k : Rd Rd R be a continuously differentiable kernel. Let M be an SPSD matrix. Define 2 (Mk) : Rd Rd R via

2 (Mk)(x, y) y x (Mk(x, y)) = tr(M x yk(x, y)). (12)

Then 2 (Mk) is a kernel, and its RKHS H 2 (Mk) has the following explicit form

H 2 (Mk) = HK = n (M 1/2f) : f = (f1, . . . , fd) Hk do , (13)

where K = Mk. Moreover, : HK H 2 (Mk) is an isometry.

Proof of Lem. B.7. First of all, by Steinwart and Christmann (2008, Corollary 4.36), every f Hk is continuously differentiable, so xif exists. By Lem. B.5, HK is well-defined and the right equality in (13) holds.

Define γ : Rd F(R, HK) via

γ(x)(t) t Pd i=1 xi K(x, )ei,

where ei Rd is the ith standard basis vector in Rd; by Barp et al. (2022, Lemma C.8) we have xi K(x, )ei HK. Note that

γ(x)(t) K = |t| Pd i=1 xi K(x, )ei K ,

so γ(x) B(R, HK). The adjoint γ(x) B(HK, R) must satisfy, for any h HK,

tγ(x) h = h, γ(x)(t) K = D h, t Pd i=1 xi K(x, )ei E

where we used the fact (Barp et al., 2022, Lemma C.8) that, for c Rd, h HK, xi K(x, )c, h = c xih(x). So we find γ(x) (h) = h(x). By Carmeli et al. (2006, Proposition 2.4), the map A : HK F(Rd, R) defined by A(h)(x) = γ (x)(h) = h(x), i.e., A = , is a partial isometry from HK to an RKHS H K with reproducing kernel

γ(x) γ(y) = Pd i=1 xi K(x, )ei (y) = y x K(x, y) = 2 (Mk)(x, y).

The following lemma shows that we can project a covering of Bk consisting of arbitrary functions to a covering using functions only in Bk while inflating the covering radius by at most 2.

Debiased Distribution Compression

Lemma B.8 (Projection of coverings into RKHS balls). Let k be a kernel, A Rd be a set, and ϵ > 0. Let C be a set of functions such that for any f Bk, there exists g C such that f g ,A ϵ. Then

Nk(A, 2ϵ) |C|.

Proof. We will build a (k, A, 2ϵ) covering C as follows. For any h C, if there exists h Bk with h h ,A ϵ, then we include h in C . By construction, |C | |C|. Then, for any f Bk, by assumption, there exists g C such that f g ,A ϵ. By construction, there exists g C such that g g ,A ϵ. Thus

f g ,A f g ,A + g g ,A 2ϵ.

Hence C is a (k, A, 2ϵ) covering.

We are now ready to bound the covering numbers of kp by those of k and 2 (Mk). Our key insight towards this end is that any element in Hkp can be decomposed as a sum of functions originated from Hk and a function from the divergence RKHS H 2 (Mk). Lemma B.9 (Upper bounding covering number of Stein kernel with that of its base kernel). Let kp be a Stein kernel with density p and preconditioning matrix M. For any A Rd, ϵ1, ϵ2 > 0,

Nkp(A, ϵ) Nk(A, ϵ1)d N 2 (Mk)(A, ϵ2),

dϵ1 supx A M 1/2 log p(x) + ϵ2).

Proof of Lem. B.9. Let Ck be a (k, A, ϵ1) covering and C 2 (Mk) be a ( 2 (Mk), A, ϵ2) covering. Define b M 1/2 log p. Form

C n b, f + g : f = ( f1, . . . , fd) (Ck)d, g C 2 (Mk) o F(Rd, R).

Then |C| |Ck|d C 2 (Mk) . Let K k M. For any h Bkp, by (11), we can find f = (f1, . . . , fd) Hk d with fi Hk such that

h = Sp ι(f) = log p, M 1/2f + (M 1/2f) = b, f + (M 1/2f).

Since ι and Sp are isometries, we have f BHk d. Since, for each i,

fi k q Pd j=1 fj 2 k = f Hk d 1,

we have fi Bk. By Lem. B.7, : HK H 2 (Mk) is also an isometry, so (M 1/2f) B 2 (Mk). Thus there exist fi Ck for each i and g C 2 (Mk) such that fi fi ,A ϵ1, (M 1/2f) g ,A ϵ2.

h(x) b, f + g C.

Then for x A, h(x) h(x) = b(x), f(x) f(x) + (M 1/2f(x)) g(x)

b(x) q Pd i=1(fi(x) fi(x))2 + (M 1/2f(x)) g(x)

dϵ1 b(x) + ϵ2.

Hence h h ,A

dϵ1 supx A b(x) + ϵ2 ϵ3.

Note that C that we constructed is not necessarily contained in Bkp. By Lem. B.8, we can get a (kp, A, 2ϵ3) covering and we are done.

Debiased Distribution Compression

Corollary B.2 (Log-covering number bound for Stein kernel). Let kp be a Stein kernel and A Rd. For any r > 0, ϵ > 0,

log Nkp(A, ϵ) d log Nk A, ϵ 4

+ log N 2 (Mk) A, ϵ

where Sp is defined in (9).

Proof. This is direct from Lem. B.9 with ϵ1 = ϵ 4

d Sp(r), ϵ2 = ϵ

B.2.1. CASE OF DIFFERENTIABLE BASE KERNEL

Definition B.2 (s-times continuously differentiable kernel). A kernel k is s-times continuously differentiable for s N if all partial derivatives α,αk exist and are continuous for all multi-indices α Nd 0 with |α| s.

Proposition B.4 (Covering number bound for kp with differentiable base kernel). Suppose kp is a Stein kernel with an s-times continuously differentiable base kernel k for s 2. Then there exist a constant Cd > 0 depending only on (s, d, k, M) such that for any r > 0, ϵ (0, 1),

log Nkp(B2(r), ϵ) Cdrd Sd/s p (r)(1/ϵ)d/(s 1).

Proof of Prop. B.4. Since k is s-times continuously differentiable, the divergence kernel 2 (Mk) is (s 1)-times continuously differentiable. By Dwivedi and Mackey (2022, Proposition 2(b)), there exists constants c1, c2 depending only on (s, d, k, M) such that, for any r > 0, ϵ1, ϵ2 > 0,

log Nk (B2(r), ϵ1) c1rd(1/ϵ)d/s,

log N 2 (Mk) (B2(r), ϵ2) c2rd(1/ϵ)d/(s 1).

By Cor. B.2 with A = B2(r), we have, for any r > 0 and ϵ (0, 1),

log Nkp(B2(r), ϵ) c1drd(4

d Sp(r))d/s(1/ϵ)d/ϵ + c2rd(4/ϵ)d/(s 1)

Cdrd Sd/s p (r)(1/ϵ)d/(s 1)

for some Cd > 0 depending only on (s, d, k, M).

B.2.2. CASE OF RADIALLY ANALYTIC BASE KERNEL

For a symmetric positive definite M Rd d, we define, for x Rd,

Definition B.3 (Radially analytic kernel). A kernel k is radially analytic if k satisfies k(x, y) = κ( x y 2 M) for a symmetric positive definite matrix M Rd d and a function κ : R 0 R real-analytic everywhere with convergence radius Rκ > 0 such that there exists a constant Cκ > 0 for which 1

j!κ(j) + (0) Cκ(2/Rκ)j, for all j N0, (14)

where κ(j) + indicates the j-th right-sided derivative of κ.

Example B.1 (Gaussian kernel). Consider the Gaussian kernel k(x, y) = κ( x y 2 M) with κ(t) = e t 2σ2 where σ > 0.

Note the exponential function is real-analytic everywhere, and so is κ. Since κ(t) = P j=0 ( t/2σ2)j

j , we find 1

j!κ(j)(0) =

j(2σ2)j . Hence (14) holds with Cκ = 1 and Rκ = 2 infj 0(j(2σ2)j)1/j = 4σ2.

Example B.2 (IMQ kernel). Consider the inverse multiquadric kernel k(x, y) = κ( x y 2 M) with κ(t) = (c2 + t) β

where c, β > 0. By Sun and Zhou (2008, Example 3), κ is real-analytic everywhere with Cκ = c 2β(2β + 1)β+1 and Rκ = c2.

Debiased Distribution Compression

A simple calculation yields the following lemma.

Proposition B.5 (Expression for kp with a radially analytic base kernel). Suppose a Stein kernel kp has a symmetric positive definite preconditioning matrix and a base kernel k(x, y) = κ( x y 2 M) where κ is twice-differentiable. Then

kp(x, y) = log p(x), M log p(y) κ( x y 2 M)

2κ +( x y 2 M) x y, log p(x) log p(y)

4κ +( x y 2 M) x y 2 M 2dκ +( x y 2 M).

In particular,

kp(x, x) = M 1/2 log p(x) 2 2 κ(0) 2dκ +(0).

Proof of Prop. B.5. From k(x, y) = κ( x y 2 M) = κ((x y) M 1(x y)), we compute, using (12),

yk(x, y) = 2κ +( x y 2 M)M 1(x y)

x yk(x, y) = 2κ +( x y 2 M)M 1 4κ +( x y 2 M)M 1(x y)((x y)M 1)

2 (Mk)(x, y) = tr(M x yk(x, y)) = 4κ +( x y 2 M) x y 2 M 2dκ +( x y 2 M). (16)

The form (15) then follows from applying Prop. B.3.

We next show that the divergence kernel 2 (Mk) is radially analytic given that k is.

Lemma B.10 (Convergence radius of divergence kernel). Let k be a radially analytic kernel with the corresponding realanalytic function κ, convergence radius Rκ with constant Cκ, and a symmetric positive definite matrix M. Then

2 (Mk)(x, y) = κ 2 (Mk)( x y 2 M),

where κ 2 (Mk) : R 0 R is the real-analytic function defined as

κ 2 (Mk)(t) 4κ +(t)t 2dκ +(t).

Moreover, κ 2 (Mk) has convergence radius with constant

Rκ 2 (Mk) = Rκ 4d+8, Cκ 2 (Mk) = 4d Cκ/Rκ.

Proof of Lem. B.10. The first statement regarding the form of κ 2 (Mk) directly follows from (16). Next, iterative differentiation yields, for j N0,

κ(j) 2 (Mk)(t) = (2d + 4j)κ(j+1) + (t) 4κ(j+2) + (t)t.

j!κ(j) 2 (Mk)(0) = 2d+4j

j! κ(j+1) + (0)

= (2d+4j)(j+1)

(j+1)! κ(j+1) + (0)

(2d + 4j)(j + 1)Cκ(2/Rκ)j+1. (17)

j!κ(j) 2 (Mk)(0) (2Cκ/Rκ) ((2d + 4j)(j + 1))1/j2/Rκ j

(2Cκ/Rκ) ((2(2d + 4) 2/Rκ)j .

where we used the fact that ((2d+4j)(j +1))1/j is decreasing in j. For j = 0, (17) is just 2d Cκ 2/Rκ. Hence κ 2 (Mk) is analytic with Cκ 2 (Mk) = 4d Cκ/Rκ and Rκ 2 (Mk) = Rκ 4d+8.

Debiased Distribution Compression

We will use the following lemma repeatedly.

Lemma B.11 (Covering number of radially analytic kernel with M-metric). Let k0 be a radially analytic kernel with k0(x, y) = κ( x y 2 2). For any symmetric positive definite M Rd d, consider the radially analytic kernel k(x, y) κ( x y 2 M). Then for any A Rd and ϵ > 0, we have

Nk(M 1/2(A), ϵ) = Nk0(A, ϵ).

In particular, for any r > 0,

Nk(B2(r), ϵ) Nk0(B2(r M 1/2 2), ϵ).

Proof. Note that k(x, y) = k0(M 1/2x, M 1/2y). By Paulsen and Raghupathi (2016, Theorem 5.7), Hk = {f M 1/2 : f Hk0}, and moreover Bk = {f M 1 : f Bk0}. Let C0 be a (k0, A, ϵ) covering. Form C = {h M 1/2 : h C0} Bk. For any element f M 1/2 Bk where f Bk0, there exists h C0 such that f h ,A ϵ. Thus

f M 1/2 h M 1/2 ,M 1/2(A) = f h ,A ϵ.

Thus Nk(M 1/2(A), ϵ) Nk0(A, ϵ). By considering M 1 in place of M, we get our desired equality.

For the second statement, by letting A = M 1/2B2(r), we have

Nk(B2(r), ϵ) = Nk0(M 1/2B2(r), ϵ) Nk0(B2(r M 1/2 2), ϵ),

where we use the fact that M 1/2B2(r) B2(r M 1/2 2).

In the next lemma, we rephrase the result from Sun and Zhou (2008, Theorem 2) for bounding the covering number of a radially analytic kernel.

Lemma B.12 (Covering number bound for radially analytic kernel). Let k be a radially analytic kernel with k(x, y) = κ( x y 2 2). Then, there exist a polynomial P(r) of degree 2d and a constant C depending only on (κ, d) such that for any r > 0, ϵ (0, 1/2),

log Nk(B2(r), ϵ) P(r)(log(1/ϵ) + C)d+1.

Proof of Lem. B.12. Let Rκ, Cκ denote the constants for κ as in (14). By and Sun and Zhou (2008, Theorem 2) with R = 1, D = 2r, and Lem. B.2, for ϵ (0, 1/2), we have

log Nk(B2(r), ϵ) N2(B2(r), r /2) 4 log(1/ϵ) + 2 + 4 log 16 Cκ + 1 d+1 ,

where r = min( Rκ

Rκ + (2r)2 2r), and N2(B2(r), r /2) is the covering number of B2(r) as a subset of Rd, which can be further bounded by (Wainwright, 2019, (5.8))

N2(B2(r), r /2) 1 + 4r

Rκ + (2r)2 2r, then r

Rκ+(2r)2 2r = r(

Rκ+(2r)2+2r)

Rκ r( Rκ+4r)

Rκ which is a quadratic polynomial

in r. Hence for a constant C > 0 and a polynomial P(r) of degree 2d that depend only on (κ, d), we have the claim.

Proposition B.6 (Covering number bound for kp with radially analytic base kernel). Suppose kp is a Stein kernel with a preconditioning matrix M and a radially analytic base kernel k based on a real-analytic function κ. Then there exist a constant C > 0 and a polynomial P(r) of degree 2d depending only on (κ, d, M) such that for any r > 0, ϵ (0, 1),

log Nkp(B2(r), ϵ) log Sp(r)

ϵ + C d+1 P(r). (18)

Debiased Distribution Compression

Proof of Prop. B.6. Recall k(x, y) = κ( x y 2 M). Consider k0(x, y) κ( x y 2 2). For ϵ1 (0, 1/2), by Lem. B.11, we have

log Nk(B2(r/ M 1/2 2), ϵ1) log Nk0(B2(r), ϵ1/2).

Thus by Lem. B.12, there exists a polynomial Pk(r) of degree 2d and a constant Ck depending only on (κ, d, M) such that

log Nk(B2(r), ϵ1) Pk(r)(log(1/ϵ1) + Ck)d+1

Similarly, for ϵ2 (0, 1/2), by Lem. B.10 and Lem. B.12, we have, for a constant C 2 (Mk) > 0 and a polynomial P 2 (Mk)(r) of degree 2d that depend only on (κ, d, M),

log N 2 (Mk)(B2(r), ϵ2) P 2 (Mk)(r)(log(1/ϵ2) + C 2 (Mk))d+1.

For a given ϵ (0, 1), let ϵ1 = ϵ 4

d Sp(r) and ϵ2 = ϵ

4. Then since Sp 1, we have ϵ1, ϵ2 (0, 1/2). By Cor. B.2 with

A = B2(r), we obtain, for a constants C > 0 and a polynomial P(r) of degree 2d that depend only on (κ, d, M),

log Nkp(B2(r), ϵ) P(r)(log(1/ϵ) + log Sp(r) + C)d+1.

Hence (18) is shown.

When log Sp(r) grows polynomially in r, we apply Prop. B.6 to immediately obtain the following.

Corollary B.3. Under the assumption of Prop. B.6, suppose Sp(r) = O(poly(r)). Then for any δ > 0, there exists Cd > 0 such that

log Nkp(B2(r), ϵ) Cd log(e/ϵ)d+1(r + 1)2d+δ.

Proof of Cor. B.3. This immediately follows from Prop. B.6 by using δ > 0 to absorb the log Sp(r) = O(rδ) term.

B.2.3. PROOF OF PROP. 1: STEIN KERNEL GROWTH RATES

This follows from Prop. B.4 and Cor. B.3, and by noticing that if sup x 2 r log p(x) 2 is bounded by a degree dℓ polynomial, then so is

Sp(r) = sup x 2 r M 1/2 log p(x) 2 M 1/2 2 sup x 2 r log p(x) 2 .

C. Analysis of Debiasing Benchmarks

We start with a result on the MMD quality of i.i.d. sample points in App. C.1, followed by the proofs of our i.i.d.-like and better-than-i.i.d. guarantees for debiasing from Thms. 1 and 2 in Apps. C.2 and C.3 respectively.

C.1. MMD of unbiased i.i.d. sample points

We start by showing that sequence of n points sampled i.i.d. from P achieves Θ(n 1) squared MMDk P to P in expectation.

Proposition C.1 (MMD of unbiased i.i.d. sample points). Let k P be a kernel satisfying Assum. 1 with p 1. Let Sn = (xi)i [n] be n i.i.d. samples from P. Then

E[MMDk P(Sn, P)2] = Ex P[k P(x,x)]

Proof of Prop. C.1. We compute

E[MMDk P(Sn, P)2] = E[P

i,j [n] 1 n2 k P(xi, xj)] = 1 n2 P

i,j [n] E[k P(xi, xj)] = 1

n E[k P(x1, x1)],

where we used the fact that k P is mean-zero with respect to P and the independence of xi, xj for i = j.

Debiased Distribution Compression

C.2. Proof of Thm. 1: Debiasing to i.i.d. quality via simplex reweighting

We make use of the self-normalized importance sampling weights w SNIS j = d P d Q(xj)/ P

i [n] d P d Q(xi) for j [n] in our proofs. Notice that (w SNIS 1 , . . . , w SNIS n ) n 1 and hence

MMDOPT MMDk P(w SNIS i δxi, P) = Pn i=1 d P d Q (xi)k P(xi, ) k P Pn i=1 d P d Q (xi) = 1

n Pn i=1 d P d Q (xi)k P(xi, ) k P 1 n Pn i=1 d P d Q (xi) .

Introduce the bounded in probability notation Xn = Op(gn) to mean Pr(|Xn/gn| > Cϵ) ϵ for all n Nϵ for any ϵ > 0. Then we claim that under the conditions assumed in Thm. 1,

n Pn i=1 d P d Q(xi)k P(xi, ) k P = Op(n 1

2 ) and 1 n Pn i=1 d P d Q(xi) 1 almost surely, (19)

so that by Slutsky s theorem (Wellner, 2013, Ex. 1.4.7), we have MMDOPT = Op(n 1

2 ) as desired. We prove the claims in (19) in two main steps: (a) first, we construct a weighted RKHS and then (b) establish a central limit theorem (CLT) that allows us to conclude both claims from (19)

Constructing a weighted and separable RKHS Define the kernel k Q(x, y) d P

d Q(x)k P(x, y) d P

d Q(y) with Hilbert space Hk Q = d P d QHk P and the elements ξi k Q(xi, ) = d P d Q(xi)k P(xi, ) d P

d Q( ) Hk Q for each i N. By Paulsen and Raghupathi (2016, Prop. 5.20), any element in Hk Q is represented by d P

d Qf for some f Hk P and moreover, f 7 d P

d Qf preserves inner product between the two RKHSs, i.e., f, g k P = d P

d Qg k Q for f, g Hk P, which in turn implies f k P = d P

d Qf k Q. As a result, we also have that

n Pn i=1 d P d Q(xi)k P(xi, ) k P = 1

n Pn i=1 d P d Q(xi)k P(xi, ) d P

d Q( ) k Q = 1

n Pn i=1 ξi k Q. (20)

Since Hk P is separable, there exists a dense countable subset (fn)n N Hk P. For any d P

d Qf Hk Q, there exists {nk}k N such that limk fnk f k P = 0. Since d P

d Qf k Q = d P

d Q(fnk f) k Q = fnk f k P due to inner-product preservation, we thus have limk d P

d Qf k Q = limk fnk f k P0, so ( d P

d Qfn)n N is dense in Hk Q, showing that Hk Q is separable.

Harris recurrence of the chain (xi)i N Let µ1 denote the distribution of x1. Since S = (xi) i=1 is a homogeneous ϕirreducible geometrically ergodic Markov chain with stationary distribution Q, it is also positive (Meyn and Tweedie, 2012, Ch. 10) by definition and aperiodic by Douc et al. (2018, Lem. 9.3.9). Moreover, since S is ϕ-irreducible, aperiodic, and geometrically ergodic in the sense of Gallegos-Herrada et al. (2023, Thm. 1) and µ1 is absolutely continuous with respect to P, we will assume, without loss of generality, that S is Harris recurrent (Meyn and Tweedie, 2012, Ch. 9), since, by Qin (2023, Lem. 9), S is equal to a geometrically ergodic Harris chain with probability 1.

CLT for 1 n Pn i=1 ξi We now show that 1 n Pn i=1 ξi converges to a Gaussian random element taking values in Hk Q. We separate the proof in two parts: first when the initial distribution µ1 = Q and next when µ1 = Q.

Case 1: µ1 = Q When µ1 = Q, S is a strictly stationary chain. By Bradley (2005, Thm. 3.7 and (1.11)), since S is geometrically ergodic, its strong mixing coefficients ( αi)i N satisfy αi Cρi for some C > 0 and ρ [0, 1) and all i N. Since each ξi is a measurable function of xi, the strong mixing coefficients (αi)i N of (ξi)i N satisfy αi αi Cρi for each i N. Consequently, P

i N i2/δαi < for δ = 2q 2 > 0. Note that we also have

Ez Q[ k Q(z, ) 2+δ k Q ] = Ez Q[k Q(z, z)q] = Ez Q[ d P

d Q(z)2qk P(z, z)q] = Ex P[ d P

d Q(x)2q 1k P(x, x)q] < ,

Exi Q[ξi] = Exi P[k P(xi, )] = 0 and that Hk Q is separable. Since S is a strictly stationary chain, we conclude that (ξi)i N is a strictly stationary centered sequence of Hk Q-valued random variables satisfying the conditions needed to invoke Merlev ede et al. (1997, Cor. 1), and hence Pn i=1 ξi/ n converges in distribution to a Gaussian random element taking values in Hk Q.

Case 2: µ1 = Q Since S is positive Harris and Pn i=1 ξi/ n satisfies a CLT for the initial distribution µ1 = Q, Meyn and Tweedie (2012, Prop. 17.1.6) implies that Pn i=1 ξi/ n also satisfies the same CLT for any initial distribution µ1.

Debiased Distribution Compression

Putting the pieces together for (19) Since, for any initial distribution for x1, the sequence (Pn i=1 ξi/ n)n N converges in distribution and that Hk Q is separable and (by virtue of being a Hilbert space) complete, Prokhorov s theorem (Billingsley, 2013, Thm. 5.2) implies that the sequence is also tight, i.e., Pn i=1 ξi k Q/ n = Op(1). Consequently,

n Pn i=1 d P d Q(xi)k P(xi, ) k P (20) = 1

n Pn i=1 ξi k Q = 1 n

Pn i=1 ξi n k Q = Op(n 1

as desired for the first claim in (19). Moreover, the strong law of large numbers for positive Harris chains (Meyn and Tweedie, 2012, Thm. 17.0.1(i)) implies that 1

n P i [n] d P d Q(xi) converges almost surely to Ez Q[ d P

d Q(z)] = 1 as desired for the second claim in (19).

C.3. Proof of Thm. 2: Better-than-i.i.d. debiasing via simplex reweighting

We start with Thm. C.1, proved in App. C.4, that bounds MMDOPT in terms of the eigenvalues of the integral operator of the kernel k P. Our proof makes use of a weight construction from Liu and Lee (2017, Theorem 3.2), but is a non-trivial generalization of their proof as we no longer assume uniform bounds on the eigenfunctions, and instead leverage truncated variations of Bernstein s inequality (Lems. C.2 and C.3) to establish suitable concentration bounds.

Theorem C.1 (Debiasing via i.i.d. simplex reweighting). Consider a kernel k P satisfying Assum. 1 with p = 2. Let (λℓ) ℓ=1 be the decreasing sequence of eigenvalues of Tk P,P defined in (4). Let Sn be a sequence of n 2N i.i.d. random variables with law Q such that P is absolutely continuous with respect to Q and d P

d Q M for some M > 0. Futhermore, assume there exist constants δn, Bn > 0 such that Pr ( k P n > Bn) < δn. Then for all L N such that λL > 0, we have

E[MMD2 k P(Sw OPT n , P)] 8M

n Ex P[k2 P(x,x)] λL + P

ℓ>L λℓ + ϵn E[k2 P(x1, x1)], (21)

ϵ2 n n exp 3n 16MBn/λL

+ 2 exp n 16M 2 + 2 exp n 64M 2(Ex P[k P(x,x)]+Bn/12)/λL

Given Thm. C.1, Thm. 2 follows, i.e., we have E[MMD2 k P(Sw OPT n , P)] = o(n 1), as long as we can show (i) E[k2 P(x1, x1)] < , which in turn holds when q > 3 as assumed in Thm. 2, and (ii) find sequences (Bn) n=1, (δn) n=1, and (Ln) n=1 such that Pr( k P n > Bn) < δn for all n and the following conditions are met:

(a) Ex P[k2 P(x,x)] λLn = o(n);

(b) Bn λLn = O(nβ), for some β < 1;

ℓ>Ln λℓ= o(1);

(d) δn = o(n 2).

We now proceed to establish these conditions under the assumptions of Thm. 2.

Condition (d) By the de La Vall ee Poussin Theorem (Chandra, 2015, Thm. 1.3) applied to the Q-integrable function x 7 k P(x, x)q (which is a uniformly integrable family with one function), there exists a convex increasing function G such that limt G(t)

t = and E[G(k P(x1, x1)q)] < . Thus,

Pr k P(x1, x1) > n3/q = Pr k P(x1, x1)q > n3 = Pr G(k P(x1, x1))q > G(n3)

E[G(k P(x1,x1))q]

G(n3) = o(n 3),

where the last step uses limt G(t)

t = . Hence by the union bound,

Pr k P n > n3/q = Pr maxi [n] k P(xi, xi) > n3/q n Pr k P(x1, x1) > n3/q = o(n 2).

Hence if we set Bn = nτ for τ 3/q < 1, there exists (δn) n=1 such that δn = o(n 2). This fulfills (d) and that Pr( k P n > Bn) < δn.

Debiased Distribution Compression

To prove remaining conditions, without loss of generality, we assume that λℓ> 0 for all ℓ N, since otherwise we can choose Ln to be, for all n, the largest ℓsuch that λℓ> 0. Then P ℓ>Ln λLn = 0 and all other conditions are met.

Condition (c) If Ln , then (c) is fulfilled since P

ℓλℓ< , which follows from Lem. B.1(d) and that P

ℓλℓ= P ℓ=1 λi Ex P[ϕℓ(x)2] = P ℓ=1 λi Ex P[ϕℓ(x)2] = Ex P[P ℓ=1 λiϕℓ(x)2] = Ex P[k P(x, x)] < .

Conditions (a) and (b) Note that the condition (a) is subsumed by (b) since Ex P[k2 P(x, x)] < . It remains to choose (Ln) n=1 to satisfy (b) such that limn Ln = . Define Ln max{ℓ N : λℓ n τ 1

2 }. Then Ln is well-defined for n ( 1

λ1 ) 2 1 τ , since for such n we have λ1 n τ 1

2 . Hence for n ( 1

λ1 ) 2 1 τ , we have

so (b) is satisfied with β = τ+1

2 < 1. Since τ < 1, Ln is non-decreasing in n and n τ 1

2 decreases to 0. Since each λℓ> 0, we therefore have limn Ln = .

C.4. Proof of Thm. C.1: Debiasing via i.i.d. simplex reweighting

We will slowly build up towards proving Thm. C.1. First notice Ex P[k2 P(x, x)] < implies Ex P[k P(x, x)] < , so Lem. B.1 holds. Fix any L N satisfying λL > 0. Since n is even, we can define D0 [n/2] and D1 [n] \ D0. We will use SD0 and SD1 to denote the subsets of Sn with indices in D0 and D1 respectively. Let (ϕℓ) ℓ=1 Hk P be eigenfunctions corresponding to the eigenvalues (λℓ) ℓ=1 by Lem. B.1(c), so that (ϕℓ) ℓ=1 is an orthonormal system of L2(P).

We start with a useful lemma.

Lemma C.1 (Hk P consists of mean-zero functions). Let k P be a kernel satisfying Assum. 1. Then for any f Hk P, we have Pf = 0.

Proof. Fix f Hk P. By Steinwart and Christmann (2008, Thm 4.26), f is P integrable. Consider the linear operator I that maps f 7 Pf. Since

|I(f)| = |Pf| P|f| = R | f, k P(x, ) k P|d P R f k P p

k P(x, x)d P = f k P Ex P[k P(x, x) 1 2 ].

Hence I is a continuous linear operator, so by the Riez representation theorem (Steinwart and Christmann, 2008, Thm. A.5.12), there exists g Hk P such that I(h) = h, g k P for any h Hk P.

By Steinwart and Christmann (2008, Thm. 4.21), the set

Hpre Pn i=1 αik P( , xi) : n N, (αi)i [n] R, (xi)i [n] Rd

is dense in Hk P. Note that Hpre consists of mean zero functions under P by linearity. So there exists fn converging to f in Hk P where each fn has Pfn = I(fn) = fn, g k P = 0. Since

limn | f, g k P fn, g k P| = limn | f fn, g k P| limn f fn k P g k P = 0,

we have Pf = f, g k P = 0.

In particular, the assumption Ex P[k2 P(x, x)] < of Thm. C.1 implies Ex P[k P(x, x) 1 2 ] < , so Lem. C.1 holds.

Step 1. Build control variate weights

Fix any L 1 and h Hk P, and let ˆh D0 denote the eigen-expansion truncated approximation of h based on D0,

ˆh D0(x) PL ℓ=1 ˆβℓ,0ϕℓ(x) for ˆβℓ,0 2

i D0 h(xi)ϕℓ(xi)ξ(xi).

E[ˆβℓ,0] = E 2

n P i D0 h(xi)ϕℓ(xi)ξ(xi) = h, ϕℓ L2(P). (23)

Debiased Distribution Compression

Next, define the control variate

ξ(xi)(h(xi) ˆh D0(xi)) . (24)

which satisfies

E[ ˆ Z0[h]] = Ex P h h(x) PL ℓ=1 E[ˆβℓ,0]ϕℓ(x) i = 0, (25)

since functions in Hk P have mean 0 with respect to P (Lem. C.1). Similarly, we define ˆ Z1[h] by swapping D0 and D1.

Then we form ˆZ[h] ˆ Z0[h]+ ˆ Z1[h] 2 . We can rewrite ˆZ[h] as a quadrature rule over Sn (Liu and Lee, 2017, Lemma B.6)

i [n] wih(xi), (26)

where wi is defined as (whose randomness depends on the randomness in Sn)

wi 1 nξ(xi) 2 n2 P

j D1 ξ(xi)ξ(xj) ΦL(xi), ΦL(xj) , i D0, 1 nξ(xi) 2 n2 P

j D0 ξ(xi)ξ(xj) ΦL(xi), ΦL(xj) , i D1, (27)

and ΦL(x) (ϕ1(x), . . . , ϕL(x)).

Step 2. Show E[MMD2 k P(Sw n , P)] = o(n 1)

We first bound the variance of the control variate ˆZ0[h] for h = ϕℓ for ℓ N. Let us fix ℓ N. From (24), we compute

E[ ˆZ0[h]2] = 4 n2 E P

i D1 ξ(xi)(h(xi) ˆh D0(xi)) 2 = 4 n2 E h P

i D1 ξ(xi)2(h(xi) ˆh D0(xi))2i

n E[Ex Q[ξ(x)2(h(x) ˆh D0(x))2|SD0]]

n E[Ex P[ξ(x)(h(x) ˆh D0(x))2|SD0]]

n E[Ex P[(h(x) ˆh D0(x))2|SD0]],

where in the second equality, the cross terms are zero due to the independence of points xi and the equality (25). By the definition of ˆh D0, we compute

Ex P[(h(x) ˆh D0(x))2|SD0] = Ex P

ℓ L ˆβℓ,0ϕℓ(x) 2 SD0

= Ex P h ϕ2 ℓ (x) + P

ℓ L ˆβ2 ℓ,0ϕ2 ℓ(x) 2ϕℓ (x) P

ℓ L ˆβℓ,0ϕℓ(x) SD0 i

ℓ L ˆβ2 ℓ,0 2 P

ℓ L ˆβℓ,01ℓ =ℓ

ℓ L ˆβ2 ℓ,0 2ˆβℓ ,01ℓ L,

where we use the fact that (ϕℓ) ℓ=1 is an orthonormal system in L2(P). By (23) with h = ϕℓ , we have E[ˆβℓ ,0] = 1. On the other hand, we can bound, again using the orthonormality of (ϕℓ) ℓ=1,

E[ˆβ2 ℓ,0] = E h 2

i D0 ϕℓ(xi)ϕℓ (xi)ξ(xi) 2i = 4 n2 E P

i D0(ϕℓ(xi)ϕℓ (xi)ξ(xi))2 2M

n Ex P[(ϕℓ(x)ϕℓ (x))2].

Thus for all ℓ N,

E[ ˆZ0[ϕℓ ]2] 2M

ℓ L Ex P[(ϕℓ(x)ϕℓ (x))2] 21ℓ L 2M

ℓ L Ex P[(ϕℓ(x)ϕℓ (x))2] + 1ℓ >L .

Since ˆZ[h] = ˆ Z0[h]+ ˆ Z1[h] 2 and ( a+b

2 for a, b R, and, by symmetry, E[ ˆZ0[h]2] = E[ ˆZ1[h]2], we have

E[ ˆZ[ϕℓ ]2] 2M

ℓ L Ex P[(ϕℓ(x)ϕℓ (x))2] + 1ℓ >L . (28)

Debiased Distribution Compression

Now we have

E[MMD2 k P(Sw n , P)] = E h P

i,j [n] wiwjk P(xi, xj) i = E h P

i,j [n] wiwj P ℓ =1 λℓ ϕℓ (xi)ϕℓ (xj) i

= E h P ℓ =1 P

i,j [n] wiwjλℓ ϕℓ (xi)ϕℓ (xj) i

= E P ℓ =1 λℓ P

i [n] wiϕℓ (xi) 2

= P ℓ =1 λℓ E P

i [n] wiϕℓ (xi) 2 = P ℓ =1 λℓ E[ ˆZ[ϕℓ ]2],

where the second and third equalities are due to the absolute convergence of the Mercer series (Lem. B.1(d)), the fourth equality follows from Tonelli s theorem (Steinwart and Christmann, 2008, Thm. A.3.10), and the last step is due to (26). Plugging in (28), we have

E[MMD2 k P(Sw n , P)] 2M

ℓ L λℓ Ex P[(ϕℓ(x)ϕℓ (x))2] + P

Since the eigenvalues are nonnegative and non-increasing, we can write, by (5),

k2 P(x, x) = P ℓ=1 λℓϕℓ(x)2 2 P ℓ =1 P

ℓ L λℓ λℓ(ϕℓ(x)ϕℓ (x))2 λL P ℓ =1 P

ℓ L λℓ (ϕℓ(x)ϕℓ (x))2.

Thus by Tonelli s theorem (Steinwart and Christmann, 2008, Thm. A.3.10), P ℓ =1 P

ℓ L λℓ Ex P[(ϕℓ(x)ϕℓ (x))2] = Ex P h P ℓ =1 P

ℓ L λℓ (ϕℓ(x)ϕℓ (x))2i Ex P[k2 P(x,x)] λL .

Finally, we have

E[MMD2 k P(Sw n , P)] 2M

n Ex P[k2 P(x,x)] λL + P

ℓ>L λℓ . (29)

Step 3. Meet the non-negative constraint

We now show that the weights (27) are nonnegative and sum close to one with high probability. For i D0, we have

nξ(xi) (1 Ti) for Ti 2

n P j D1 ξ(xj) ΦL(xi), ΦL(xj) .

Our first goal is to derive an upper bound for Pr (mini D0 wi < 0). Define the event

En { k P n Bn} , (30)

so Pr(Ec n) < δn by the assumption on k P n. Then

Pr mini [n] wi < 0, En = Pr maxi [n] Ti > 1, En n Pr(T11En > 1), (31)

where we applied the union bound and used the fact that Ti1En has the same law for different i. To further bound Pr(T11En > 1), we will use the following lemma. Lemma C.2 (Truncated Bernstein inequality). Let X1, . . . , Xn be i.i.d. random variables with E[X1] = 0 and E[X2 1] < . For any B > 0, t > 0,

i [n] Xi1Xi B > t exp nt2

2(E[X2 1]+ Bt

Proof of Lem. C.2. Fix any B > 0 and t > 0 and define, for each i [n], Yi Xi1Xi B. Then Yi B,

E[Yi] = E[Xi1Xi B] E[Xi1Xi B] + E[Xi1Xi>B] = E[Xi] = 0, and

E[Y 2 i ] = E[X2 i 1Xi B] E[X2 i ] = E[X2 1].

Now we can invoke the non-positivity of E[Yi], the one-sided Bernstein inequality (Wainwright, 2019, Prop. 2.14), and the relation E[Y 2 i ] E[X2 1] to conclude that

i [n] Yi > t Pr 1 n P

i [n] (Yi E[Yi]) > t exp nt2

i [n] E[Y 2 i ]+ Bt

2(E[X2 1]+ Bt

Debiased Distribution Compression

For j D1, define Xj ξ(xj) ΦL(x1), ΦL(xj) and note that

E[Xj|x1] = Ex Q[ξ(x) ΦL(x1), ΦL(x) |x1] = Ex P[ ΦL(x1), ΦL(x) |x1] = 0

E[X2 j |x1] = E[ξ(xj)2 ΦL(x1), ΦL(xj) 2|x1] MEx P[ ΦL(x1), ΦL(x) 2|x1]

= MEx P h P ℓ,ℓ L ϕℓ(x1)ϕℓ (x1)ϕℓ(x)ϕℓ (x) x1 i

ℓ,ℓ L ϕℓ(x1)ϕℓ (x1)Ex P [ϕℓ(x)ϕℓ (x)]

= M ΦL(x1) 2 2 .

Since λ1 λ2 0, for any x Rd, we can bound ΦL(x) 2 2 via

ΦL(x) 2 2 = P ℓ L ϕℓ(x)2

ℓ L λℓϕℓ(x)2

P ℓ=1 λℓϕℓ(x)2

λL = k P(x,x)

where we applied Lem. B.1(d) for the last equality. Thus

|Xj| M ΦL(x1) 2 ΦL(xj) 2 M ΦL(x1) 2 q

so if we let B q

Bn λL M ΦL(x1) 2 , then

En = n supi [n] k P(xi, xi) Bn o T

j D1{|Xj| B}.

Since T1 = 2

j D1 Xj, we have inclusions of events

{T11En > 1} = {T1 > 1} En n 2 n P

j D1 Xj1Xj B > 1 o .

Thus Lem. C.2 with t = 1 and conditioned on x1 implies

Pr (T11En > 1|x1) Pr 2 n P

j D1 Xj1Xj B > 1 x1

n 4(M ΦL(x1) 2 2+ q

Bn λL M ΦL(x1) 2/3)

On event {k P(x1, x1) Bn}, by (32), we have

Pr (T11En > 1|x1) 1k P(x1,x1) Bn exp n 16

On the other hand, {k P(x1, x1) > Bn} / En, so

Pr (T11En > 1|x1) 1k P(x1,x1)>Bn = 0

Pr(T11En > 1) = E[Pr(T11En > 1|x1)] exp n 16

Combining the last inequality with (31), we have:

Pr mini [n] wi < 0, En n exp n 16

Debiased Distribution Compression

Step 4. Meet the sum-to-one constraint

i D0 wi = P

i D0 1 nξ(xi) 1 2

j D1 ξ(xj) ΦL(xi), ΦL(xj) .

We now derive a bound for Pr(S < 1/2 t/2) for t (0, 1). Let

i D0 ξ(xi), S2 2

j D1 ξ(xi)ξ(xj) ΦL(xi), ΦL(xj) ,

so S = S1 + S2. Note that E[S1] = 1/2 and E[S2] = 0 since D0 and D1 are disjoint. Let En be the same event defined as in (30). For t1 (0, t/2) to be determined later and t2 t/2 t1, we have, by the union bound

Pr(S < 1/2 t/2, En) Pr(S1 < 1/2 t1, En) + Pr(S2 < t2, En).

By Hoeffding s inequality and the assumption ξ(x) M, we have

Pr(S1 < 1/2 t1, En) Pr 2 n P

2 1/2 < t1 exp 2(n/2)t2 1 (M/2)2 = exp 4nt2 1 M 2 . (34)

To give a concentration bound for Pr(S2 < t2, En), we will use the following lemma. Lemma C.3 (U-statistic Bernstein s inequality). Let h : X X R be a function bounded above by b > 0. Assume n 2N and let x1, . . . , xn be i.i.d. random variables taking values in X. Denote mh E[h(x1, x2)] and σ2 h Var[h(x1, x2)]. Let D0 = [n/2] and D1 = [n] \ [n/2]. Define

U 1 (n/2)2 P

j D1 h(xi, xj).

Pr(U mh > t) exp nt2

Proof of Lem. C.3. We adapt the proof from Pitcan (2017, Section 3) as follows. Let k n/2. Define V : X n R as

V (x1, . . . , xn) 1

i [k] h(xi, xi+k).

Then note that

k! P σ perm(k) Vσ,

Vσ V (xσ1, . . . , xσk),

where perm(k) is the set of all permutations of [k]; this is because every h(xi, xj) term for i D0, j D1 will appear in the summation an equal number of times. For a fixed σ perm(k), the random variable V (xσ1, . . . , xσk, xk+1, . . . , xn) is a sum of k i.i.d. terms h(xσi, xi+k). Denote V = V (x1, . . . , xn). For any s > 0, we have, by independence,

E[es(V mh)] = E h exp s k P i [k](h(xi, xi+[k]) mh) i

k(h(x1, x2) mh) k

By the one-sided Bernstein s lemma Wainwright (2019, Prop. 2.14) applied to h(x1,x2)

k which is upper bounded by b

variance σ2 h k2 , we have

E h exp s h(x1,x2) mh

k i exp s2σ2 h/2 k(k bs

for s [0, 3k/b). Next, by Markov s inequality and Jensen s inequality,

Pr(U mh > t) = Pr es(U mh) > est E[es(U mh)]e st

= E h exp 1 (n/2)! P

σ perm(n/2) s(Vσ mh) i e st

E h 1 (n/2)! P

σ perm(n/2) exp(s(Vσ mh)) i e st

= E[es(V mh)]e st.

Debiased Distribution Compression

Pr(U mh > t) exp s2σ2 h 2(k bs

Now, we get the desired bound if we pick s = k2t kσ2 h+ ktb

3 [0, 3k/b) and simplify.

h(x, x ) ξ(x)ξ(x ) ΦL(x), ΦL(x ) h(x, x ) h(x, x )1h(x,x ) M 2 Bn

Pr(S2 < t2, En) = Pr 1 (n/2)2 P

j D1 h(xi, xj) > 2t2, En

Pr 1 (n/2)2 P

j D1 h(xi, xj) > 2t2 , (35)

where the last inequality used the fact that, for i D0, j D1,

En {max(k P(xi, xi), k P(xj, xj)) Bn} n h(xi, xj) M 2 Bn

using (32). We further compute

m h = E[ h(x1, x2)] E[h(x1, x2)] = E[ξ(x1)ξ(x2) ΦL(x1), ΦL(x2) ]

= P ℓ L E[ξ(x1)ξ(x2)ϕℓ(x1)ϕℓ(x2)]

ℓ L(Ex P[ϕℓ(x)])2 = 0,

σ2 h = Var[ h(x1, x2)] E[ h(x1, x2)2] E[h(x1, x2)2]

= E h (ξ(x1)ξ(x2) ΦL(x1), ΦL(x2) )2i

M 2E(x,x ) P P[ ΦL(x), ΦL(x ) 2]

= M 2E(x,x ) P P h P

ℓ,ℓ L ϕℓ(x)ϕℓ (x)ϕℓ(x )ϕℓ (x ) i

ℓ,ℓ L(E[ϕℓ(x)ϕℓ (x)])2 = LM 2.

Since Ex P[k P(x, x)] = P

ℓλℓ LλL, we have L k P 2 L2(P) λL , so that σ2 h M 2 k P 2 L2(P) λL . Applying Lem. C.3 to h, which is bounded by M 2 Bn

λL and using the fact that m h 0, we have

Pr 1 (n/2)2 P

j D1 h(xi, xj) > 2t2 Pr 1 (n/2)2 P

j D1 h(xi, xj) m h > 2t2

M2 k P 2 L2(P) λL +2M 2 Bn

Thus combining (34), (35), (36), we get

Pr(S < 1/2 t/2, En) exp 4nt2 1 M 2 + exp

nt2 2 M2 k P 2 L2(P) λL +2M 2 Bn

Debiased Distribution Compression

Finally, by symmetry and the union bound, for t (0, 1), t (0, t/2) and t2 = t/2 t1, we have

i [n] wi < 1 t, En Pr P

i D0 wi < 1/2 t/2, En + Pr P

i D1 wi < 1/2 t/2, En

= 2 Pr(S < 1/2 t/2, En)

exp 4nt2 1 M 2 + exp

nt2 2 M2 k P 2 L2(P) λL +2M 2 Bn

Step 5. Putting it all together

Define the event

Fn = n mini [n] wi 0, P

Then, by the union bound,

Pr(F c n) Pr mini [n] wi < 0, En + Pr P

i [n] wi < 1

2, En + Pr(Ec n).

Applying (33) and (37) to bound the last expression with t = 1/2, t1 = t2 = 1/8, we have Pr(F c n) ϵ2 n for ϵn defined in (22). On the event Fn, if we define w+ n 1 via

then w+ i = αwi for i [n] and α 1 P

i [n] wi 2. Let w n 1 be the weight defined by w1 = 1 and wi = 0 for i > 1.

Since w OPT is the best simplex weight, we have MMD2 k P(Sw OPT n , P) min(MMD2 k P(Sw+ n , P), MMD2 k P(S w n , P)). Hence

E MMD2 k P(Sw OPT n , P) = E MMD2 k P(Sw OPT n , P)1Fn + E MMD2 k P(Sw OPT n , P)1F c n

E h MMD2 k P(Sw+ n , P)1Fn i + E MMD2 k P(S w n , P)1F c n .

For the first term, we have the bound

E h MMD2 k P(Sw+ n , P)1Fn i = E h P

i,j [n] w+ i w+ j k P(xi, xj)1Fn i = E h α2 P

i,j [n] wiwjk P(xi, xj)1Fn i

i,j [n] wiwjk P(xi, xj) i 8M

n Ex P[k2 P(x,x)] λL + P

where we applied (29) for the last inequality. For the second term, by the Cauchy-Schwartz inequality,

E MMD2 k P(S w n , P)1F c n p

i,j [n] k P(xi, xj) wi wj 2

E[k P(x1, x1)2].

Putting everything together we obtain (21).

D. Stein Kernel Thinning

In this section, we detail our Stein thinning implementation in App. D.1, our kernel thinning implementation and analysis in App. D.2, and our proof of Thm. 3 in App. D.3.

D.1. Stein Thinning with sufficient statistics

For an input point set of size n, the original implementation of Stein Thinning of Riabiz et al. (2022) takes O(nm2) time to output a coreset of size m. In Alg. D.1, we show that this runtime can be improved to O(nm) using sufficient statistics. The idea is to maintain a vector g Rn such that g = 2k P(Sn, Sn)w where w is the weight representing the current coreset.

Debiased Distribution Compression

Algorithm D.1 Stein Thinning (ST) with sufficient statistics

Input: kernel k P with zero-mean under P, input points Sn = (xi)i [n], output size m w 0 Rn

j argmini [n] k P(xi, xi) wj 1 g 2k P(Sn, xj) maintain sufficient statistics g = 2k P(Sn, Sn)w for t = 1 to m 1 do

j argmini [n]{tgi + k P(xi, xi)} w t t+1w + 1 t+1ej g t t+1g + 2 t+1k P(Sn, xj) end for Return: w

Algorithm D.2 Kernel Thinning (KT) (adapted from Dwivedi and Mackey (2022, Alg. 1))

Input: kernel k P with zero-mean under P, input points Sn = (xi)i [n], multiplicity n with log2 n m N, weight w n 1 ( N0

n )n, output size m with n

m 2N, failure probability δ S index sequence where k [n] appears n wk times t log2 n m N (I(ℓ))ℓ [2t] KT-SPLIT(k P, Sn[S], t, δ/n ) KT-SPLIT is from Dwivedi and Mackey (2022, Algorithm 1a) and we set δi = δ for all i I(ℓ) S[I(ℓ)] for each ℓ [2t] I KT-Swap(k P, Sn, (I(ℓ))ℓ [2t]) w KT simplex weight corresponding to I wi = number of occurrences of i in I

|I| Return: w KT n 1 ( N0

m )n Hence w KT 0 m

D.2. Kernel Thinning targeting P

Our Kernel Thinning implementation is detailed in Alg. D.2. Since we are able to directly compute MMDk P(Sw n , P), we use KT-Swap (Alg. D.3) in place of the standard KT-SWAP subroutine (Dwivedi and Mackey, 2022, Algorithm 1b) to choose candidate points to swap in so as to greedily minimize MMDk P(Sw n , P). To facilitate our subsequent SKT analysis, we restate the guarantees of KT-SPLIT (Dwivedi and Mackey, 2022, Theorem 2) in the sub-Gaussian format of (Shetty et al., 2022, Definition 3).

Lemma D.1 (Sub-Gaussian guarantee for KT-SPLIT). Let Sn be a sequence of n points and k a kernel. For any δ (0, 1) and m N such that log2 n m N, consider the KT-SPLIT algorithm (Dwivedi and Mackey, 2022, Algorithm 1a) with ksplit = k, thinning parameter t = log2 n m, and δi = δ

n to compress Sn to 2t coresets {S(i) out}i [2t] where each S(i) out has m

points. Denote the signed measure ϕ(i) 1

x S(i) out δx. Then for each i [2t], on an event E(i) equi with

P(E(i) equi) 1 δ

2, ϕ(i) = ϕ(i) for a random signed measure ϕ(i)6 such that, for any δ (0, 1),

Pr ϕ(i)k Hk an,m + vn,m q

8 3 k n log 6(log2 n m )m δ log(4Nk(B2(Rn), m 1)) ,

8 3 k n log 6(log2 n m )m δ .

6This is the signed measure returned by repeated applications of self-balancing Hilbert walk (SBHW) (Dwivedi and Mackey, 2021, Algorithm 3). Although SBHW returns an element of Hk, by tracing the algorithm, the returned output is equivalent to a signed measure via the correspondence P

i [n] cik(xi, ) P

i [n] ciδxi. The usage of signed measures is consistent with Shetty et al. (2022).

Debiased Distribution Compression

Algorithm D.3 KT-Swap (modified Dwivedi and Mackey (2022, Alg. 1b) to minimize MMD to P)

Input: kernel k P with zero-mean under P, input points Sn = (xi)i [n], candidate coreset indices (I(ℓ))ℓ [L] m I(0) all candidate coresets are of the same size I I(ℓ ) for ℓ argminℓ [L] MMDk P(Sn[I(ℓ)], P) select the best KT-SPLIT coreset IST index sequence of Stein Thinning(k P, Sn, m) add Stein thinning baseline C = {I, IST} shortlisted candidates for I C do

g 0 Rn maintain sufficient statistics g = P

j [m] k P(x Ij, Sn) Kdiag (k P(xi, xi))i [n] for j = 1 to m do

g g + k P(x Ij, Sn) end for for j = 1 to m do

= 2(g k P(x Ij, Sn)) + Kdiag this is the change in MMD2 k P(Sn[I], P) if we were to replace Ij k argmini [n] i g = g k P(x Ij, Sn) + k P(xk, Sn) Ij k end for end for Return: I = argmin I C MMDk P(Sn[I], P)

Proof of Lem. D.1. Fix i [2t], δ (0, 1) and n, m N such that t = log2 n m N. Define ϕ ϕ(i). By the proof of

Dwivedi and Mackey (2022, Thms. 1 and 2), there exists an event Eequi with Pr Ec equi δ

2 such that, on this event, ϕ = ϕ

where ϕ is a signed measure such that, for any δ (0, 1), with probability at least 1 δ ,

ϕk Hk infϵ (0,1),A:Sn A 2ϵ + 2t

8 3 k n log 6tn

δ + log Nk(A, ϵ) .

Note that on Eequi, ϕk Hk = ϕk Hk. We choose A = B2(Rn) and ϵ = 2t

n = m 1, so that, with probability at least

1 δ , using the fact that

b for a, b 0,

8 3 k n log 6tn

δ + log Nk(B2(Rn), m 1) (38)

8 3 k n log 6tn

log 4Nk(B2(Rn), m 1) i

an,m + vn,m q

for an,m, vn,m in Lem. D.1.

Corollary D.1 (MMD guarantee for KT-SPLIT). Let S be an infinite sequence of points in Rd and k a kernel. For any δ (0, 1) and n, m N such that log2 n m N, consider the KT-SPLIT algorithm (Dwivedi and Mackey, 2022, Algorithm 1a) with parameters ksplit = k and δi = δ n to compress Sn to 2t coresets {S(i) out}i [2t] where t = log2 n m, each with m points. Then for any i [2t], with probability at least 1 δ,

MMDk(Sn, S(i) out) 1

8 3 k n log 6(log2 n m )m δ log Nk(B2(Rn), m 1) + log 8

Debiased Distribution Compression

Proof of Cor. D.1. Fix i [2t]. By taking δ = δ

2 in (38), we obtain (39). This occurs with probability

Pr MMDk(Sn, S(i) out) < an,m + vn,m q

=1 Pr MMDk(Sn, S(i) out) an,m + vn,m q

1 Pr E(i) equi, MMDk(Sn, S(i) out) an,m + vn,m q

δ Pr E(i) equi

1 Pr ϕ(i)k Hk an,m + vn,m q

δ Pr E(i) equi

D.3. Proof of Thm. 3: MMD guarantee for SKT

Thm. 3 follows directly from Assum. (α, β)-kernel and the following result for a kernel with generic covering number.

Theorem D.1. Let k P be a kernel satisfying Assum. 1. Let S be an infinite sequence of points. Then for a prefix sequence Sn of n points, m [n], and n m2 log2 n m , SKT outputs w SKT in O(n2dk P) time that satisfies, with probability at least 1 δ,

MMDk P(w SKT) r 1+log n

n k P n + 1

8 3 k n log 6(log2 n m )m δ log Nk(B2(Rn), m 1) + log 8

Proof of Thm. D.1. The runtime of SKT comes from the fact that all of Stein Thinning (with output size n), KT-SPLIT, and KT-Swap take O(dk Pn2) time.

By Riabiz et al. (2022, Theorem 1), Stein Thinning (which is a deterministic algorithm) from n points to n points has the following guarantee

MMD2 k P(Sw n , P) MMD2 k P(Sw OPT n , P) + 1+log n

where we denote the output weight of Stein Thinning as w . Using

b for a, b 0, we have

MMDk P(Sw n , P) MMDk P(Sw OPT n , P) + r 1+log n

Fix δ (0, 1). By Cor. D.1 with k = k P and t = log2 n m, with probability at least 1 δ, we have, for any i [2t],

MMDk P(Sw n , S(i) out) 1

8 3 k n log 6(log2 n m )m δ log Nk(B2(Rn), m 1) + log 8

where S(i) out is the i-th coreset output by KT-SPLIT. Since KT-Swap can only decrease the MMD to P, we have, by the triangle inequality of MMDk P,

MMDk P(Sw SKT n , P) MMDk P(S(1) out , P) MMDk P(S(1) out , Sw n ) + MMDk P(Sw n , P),

which gives the desired bound.

Thm. 3 now follows from Thm. D.1, the kernel growth definitions in Assum. (α, β)-kernel, n n 2n, and that log2( n

Debiased Distribution Compression

E. Resampling of Simplex Weights

Integral to many of our algorithms is a resampling procedure that turns a simplex-weighted point set of size n into an equal-weighted point set of size m while incurring at most O(1/ m) MMD error. The motivation for wanting an equalweighted point set is two-fold: First, in LSKT, we need to provide an equal-weighted point set to KT-Compress++, but the output of LD is a simplex weight. Secondly, we can exploit the fact that non-zero weights are bounded away from zero in equal-weighted point sets to provide a tighter analysis of Weighted RPCholesky. While i.i.d. resampling also achieves the O(1/ m) goal, we choose Resample (Alg. E.3), a stratified residual resampling algorithm (Douc and Capp e, 2005, Sec. 3.2, 3.3). In this section, we derive an MMD bound for Resample and show that it is better in expectation than using i.i.d. resampling or residual resampling alone.

Let Dinv w be the inverse of the cumulative distribution function of the multinomial distribution with weight w, i.e.,

Dinv w (u) min n i [n] : u Pi j=1 wj o .

Algorithm E.1 i.i.d. resampling

Input: Weights w n 1, output size m w 0 Rn

for j = 1 to m do

Draw Uj Uniform([0, 1)) Ij Dinv w (Uj) w Ij w Ij + 1

m end for Return: w n 1 ( N0

Algorithm E.2 Residual resampling

Input: Weights w n 1, output size m w i mwi

m , i [n] r m P

i [n] mwi N

r , i [n] for j = 1 to r do

Draw Uj Uniform([0, 1)) Ij Dinv η (Uj) w Ij w Ij + 1

m end for Return: w n 1 ( N0

Algorithm E.3 Stratified residual resampling (Resample)

Input: Weights w n 1, output size m w i mwi

m , i [n] r m P

i [n] mwi N

r , i [n] for j = 1 to r do

Draw Uj Uniform([ j

r )) Ij Dinv η (Uj) w Ij w Ij + 1

m end for Return: w n 1 ( N0

Debiased Distribution Compression

Proposition E.1 (MMD guarantee of resampling algorithms). Consider any kernel k, points Sn = (x1, . . . , xn) Rd, and a weight vector w n 1.

(a) Using the notation from Alg. E.1, let X, X be independent random variables with law Sw n . Then, the output weight vector wi.i.d. w of Alg. E.1 satisfies

E[MMD2 k(Swi.i.d. n , Sw n )] = Ek(X,X) Ek(X,X )

(b) Using the notation from Alg. E.2, let R, R be independent random variables with law Sη n. Then, the output weight vector wresid w of Alg. E.2 satisfies

E[MMD2 k(Swresid n , Sw n )] = r(Ek(R,R) Ek(R,R ))

(c) Using the notation from Alg. E.3, let Rj x Ij and R j be an independent copy of Rj. Let R be an independent random variable with law Sη n. Then, the output weight vector wsr w of Alg. E.3 satisfies

E[MMD2 k(Swsr n , Sw n )] = r Ek(R,R) P

j [r] Ek(Rj,R j) m2 . (42)

Proof of Prop. E.1(a). Let Xi x Ii. As random signed measures, we have

Sw n Sw n = 1

i [m] δXi P

i [n] wiδxi.

MMD2 k(Sw n , Sw n ) = ((Sw n Sw n ) (Sw n Sw n ))k

i,i [m] k(Xi, Xi ) 2

i [m],i [n] wi k(Xi, xi ) + P

i,i [n] wiwi k(xi, xi ).

Since each Xi is distributed to Sw n and Xi and Xi are independent for i = i , taking expectation, we have

E[MMD2 k(Sw n , Sw n )] = 1

m Ek(X, X) + m 1

m Ek(X, X ) 2Ek(X, X ) + Ek(X, X ).

This gives the bound (40).

Proof of Prop. E.1(b). Let Rj x Ij. As random signed measures, we have

Sw n Sw n = P

j [r] δRj P

i [n] wiδxi

j [r] δRj P

i [n] wi mwi

j [r] δRj r

i [n] ηiδxi.

MMD2 k(Sw n , Sw n ) = ((Sw n Sw n ) (Sw n Sw n ))k

j,j [r] k(Rj, Rj ) 2r

j [r],i [n] ηik(Rj, xi) + r2

i,i [n] ηiηjk(xi, xj). (43)

Since each Rj is distributed to Sη n and Rj and Rj are independent for j = j , taking expectation, we have

E[MMD2 k(Sw n , Sw n )] = r m2 Ek(R, R) + r(r 1)

m2 Ek(R, R ) 2r2

m2 Ek(R, R ) + r2

m2 Ek(R, R ).

This gives the bound (41).

Debiased Distribution Compression

Proof of Prop. E.1(c). We repeat the same steps from the previous part of the proof to get (43). In the case of (c), Rj s are not identically distributed so the analysis is different. Let R be an independent copy of R. Taking expectation of (43), we have

m2E[MMD2 k(Sw n , Sw n )] = P j [r] Ek(Rj, Rj) + P j [r] P j [r]\{j} Ek(Rj, Rj ) 2r P j [r] Ek(Rj, R) + r2Ek(R, R ).

j [r] Ek(Rj, Rj) = P

r ) k(x Dinv η (u), x Dinv η (u))du = r R 1 0 k(x Dinv η (u), x Dinv η (u))du = r Ek(R, R),

where we used the fact that x Dinv η (U) D= R for U Uniform([0, 1]). Similarly, we deduce

P j [r] P j [r]\{j} Ek(Rj, Rj ) = P j [r] P j [r] Ek(Rj, R j ) Ek(Rj, R j)

j [r](r Ek(Rj, R ) Ek(Rj, R j))

= r2Ek(R, R ) P

j [r] Ek(Rj, R j),

j [r] Ek(Rj, R) = r Ek(R, R ).

Combining terms, we get

m2E MMD2 k(Sw n , Sw n ) = r Ek(R, R) + r2Ek(R, R ) P

j [r] Ek(Rj, R j) 2r2Ek(R, R ) + r2Ek(R, R )

= r Ek(R, R) P

j [r] Ek(Rj, R j),

which yields the desired bound (42).

The next proposition shows that stratifying the residuals always improves upon using i.i.d. sampling or residual resampling alone. We need the following convexity lemma.

Lemma E.1 (Convexity of squared MMD). Let k be a kernel. Let Sn = (x1, . . . , xn) be an arbitrary set of points. The function Ek : Rn R defined by

Ek(w) Sw n k 2 Hk = P

i,j [n] wiwjk(xi, xj)

is convex on Rn.

Proof of Lem. E.1. Since k is a kernel, the Hessian 2Ek = 2k(Sn, Sn) is PSD, and hence Ek is convex.

Proposition E.2 (Stratified residual resampling improves MMD). Under the assumptions of Prop. E.1, we have

E[MMD2 k(Swi.i.d. n , Sw n )] E[MMD2 k(Swresid n , Sw n )] E[MMD2 k(Swsr n , Sw n )].

Proof of Prop. E.2. Let K k(Sn, Sn). To show the first inequality, note that since η = mw mw

r , by Prop. E.1,

E[MMD2 k(Swresid n , Sw n )] = r(Ek(R,R) Ek(R,R ))

i [n] Kiiηi P

i,j [n] Kijηiηj) m2

i [n] Kii wi mwi

E[MMD2 k(Swi.i.d. n , Sw n )] E[MMD2 k(Swresid n , Sw n )]

i [n] Kii mwi

i [n] Kiiξi + θη Kη w Kw ,

Debiased Distribution Compression

where we let ξ m m r mw

m. Note that w = θη + (1 θ)ξ. By Lem. E.1 and Jensen s inequality, we have

w Kw = Ek(w) θEk(η) + (1 θ)Ek(ξ) = θη Kη + (1 θ)ξ Kξ θη Kη + (1 θ) P

i [n] Kiiξi,

where the last inequality follows from Prop. E.1(a) with w = ξ and the fact that MMD is nonnegative. Hence we have shown

E[MMD2 k(Swi.i.d. n , Sw n )] E[MMD2 k(Swresid n , Sw n )] 0,

as desired.

For the second inequality, by Prop. E.1, we compute

E[MMD2 k(Swresid n , Sw n )] E[MMD2 k(Swsr n , Sw n )] = r m2 1 r P

j [r] Ek(Rj, R j) Ek(R, R ) .

Ek(R, R ) = R

[0,1) k(x Dinv η (u), x Dinv η (v))dudv = Ek (Dinv η )#Uniform[0, 1) ,

where we used T#µ to denote the pushforward measure of µ by T. Similarly,

j [r] Ek(Rj, R j) = 1

r ) k(x Dinv η (u), x Dinv η (v))dudv

j [r] Ek (Dinv η )#Uniform j

Ek (Dinv η )#Uniform[0, 1) = Ek(R, R ),

where in the last inequality we applied Jensen s inequality since Ek is convex by Lem. E.1. Hence we have shown

E[MMD2 k(Swresid n , Sw n )] E[MMD2 k(Swsr n , Sw n )] 0

and the proof is complete.

F. Accelerated Debiased Compression

In this section, we provide supplementary algorithmic details and deferred analyses for LSKT (Alg. 3). In Weighted RPCholesky (Alg. F.1), we provide details for the weighted extension of Chen et al. (2022, Alg. 2.1) that is used extensively in our algorithms. The details of AMD (Wang et al., 2023, Alg. 14) are provided in Alg. F.2. In App. F.1, we give the proof of Thm. 4 for the MMD error guarantee of LD (Alg. 2). In App. F.2, we provide details on KT-Compress++ modified from Compress++ (Shetty et al., 2022) to minimize MMD to P. Finally, Thm. 5 is proved in App. F.3.

Algorithm F.1 Weighted Randomly Pivoted Cholesky (Weighted RPCholesky) (extension of Chen et al. (2022, Alg. 2.1))

Input: kernel k, points Sn = (xi)n i=1, simplex weights w n 1, rank r k(i, j) k(xi, xj) wi wj reweighted kernel matrix function F 0n r, S {}, d ( k(i, i))i [n] for i = 1 to r do

Sample s d/ P

j [n] dj S S {s} g k(:, s) F(:, 1 : i 1)F(s, 1 : i 1)

F(:, i) g/ gs d d F(:, i)2 F(:, i)2 denotes a vector with entry-wise squared values of F(:, i) d max(d, 0) numerical stability fix, helpful in practice end for F diag((1/ wi)i [n])F undo weighting; treat 1/ wi = 0 if wi = 0 Return: S [n] with |S| = r and F Rn r

Debiased Distribution Compression

Algorithm F.2 Accelerated Entropic Mirror Descent (AMD) (modification of Wang et al. (2023, Alg. 14))

Input: kernel matrix K Rn n, number of steps T, initial weight w0 n 1, aggressive flag AGG η 1 8w 0 diag(K) if AGG else 1 8 maxi [n] Kii v0 w0 for t = 1 to T do

βt 2 t+1 zt (1 βt)wt 1 + βtvt 1 g 2tηKzt this is γt f(zt) in Wang et al. (2023, Alg. 14) for f(w) = w Kw vt vt 1 exp( g) component-wise exponentiation and multiplication vt vt/ vt 1 vt = argminw n 1 g, w + Dϕ vt 1(w) for ϕ(w) = P

i [n] wi log wi wt (1 βt)wt 1 + βtvt end for Return: w T n 1

F.1. Proof of Thm. 4: Debiasing guarantee for LD

We start with a useful lemma that bounds w (K ˆK)w by tr K ˆK for any simplex weights w.

Lemma F.1. For any PSD matrix A Rn n and w n 1, we have

w Aw tr(Aw) maxi [n] Aii λ1(A),

where λ1(A) denotes the largest eigenvalue of A.

Proof of Lem. F.1. Note that

w Aw = w diag( w)A diag( w) w = w Aw w.

The condition that w n 1 implies w 2 = 1, so that

w Aw w λ1(Aw) tr(Aw).

To see tr(Aw) max i [n]Aii, note that tr(Aw) = P

i [n] Aiiwi maxi [n] Aii since w n 1..

Since λ1(A) = supx: x 2=1 x Ax, if we let i argmini [n] Aii, then the simplex weight with 1 on the i -th entry has two-norm 1, so we see that maxi [n] Aii λ1(A).

Our next lemma bounds the suboptimality of surrogate optimization of a low-rank plus diagonal approximation of K.

Lemma F.2 (Suboptimality of surrogate optimization). Let k P be a kernel satisfying Assum. 1. Let Sn = (x1, . . . , xn) Rd be a sequence of points. Define K k P(Sn, Sn) Rn n. Suppose b K Rn n is another PSD matrix such that K b K. Define D diag(K b K), the diagonal part of K b K, and form K b K + D. Let w argminw n 1 w K w . Then for any w n 1,

MMD2 k P(Sw n , P) MMD2 k P(Sw OPT n , P) + tr (K b K)w + maxi [n](K b K)ii + (w K w w K w ). (44)

Proof of Lem. F.2. Since K = K + (K b K) D by construction, we have

w Kw = w K w + w (K b K)w w Dw

w K w + w (K b K)w

= (w K w w K w ) + w K w + w (K b K)w

(w K w w K w ) + w K w + tr (K b K)w ,

Debiased Distribution Compression

where we used the fact that D 0 and Lem. F.1. Next, by the definition of w , we have

w K w (w OPT) K w OPT = (w OPT) (K K)w OPT + (w OPT) Kw OPT

= (w OPT) (D (K b K))w OPT + (w OPT) Kw OPT (w OPT) Dw OPT + (w OPT) Kw OPT

maxi [n](K b K)ii + (w OPT) Kw OPT,

where we used the fact K b K in the penultimate step and Lem. F.1 in the last step. Hence we have shown our claim.

Lem. F.2 shows that to control MMD2 k P(Sw n , P), it suffices to separately control the approximation error in terms of

tr K ˆK and the optimization error (w K w w K w ). The next result establishes that using Weighted RPC-

holesky, we can obtain polynomial and exponential decay bounds for tr K ˆK in expectation depending on the kernel growth of k P.

Proposition F.1 (Approximation error of Weighted RPCholesky). Let k be a kernel satisfying Assum. (α, β)-kernel. Let S be an infinite sequence of points in Rd. For any w n 1, let F be the low-rank approximation factor output by Weighted RPCholesky(k, Sn, w, r). Define K k(Sn, Sn). If r ( Cd Rβ n+1 log 2 + log 2)2 1 log 2, then, with the expectation taken over the randomness in Weighted RPCholesky,

E tr (K FF )w Hn,r, (45)

where Hn,r is defined as

( 8 Pn ℓ=U(r)( Lk(Rn)

ℓ ) 2 α POLYGROWTH(α, β),

8 Pn ℓ=U(r) exp 1 ( ℓ Lk(Rn)) 1 α

LOGGROWTH(α, β), (46)

for Lk defined in (6) and

r+ 1 log 2 log 2 1 log 2

Moreover, Hn,r satisfies the bounds in Thm. 4.

Proof of Prop. F.1. Recall the notation Lk(Rn) = Cd Rβ n log 2 from (6). Define q U(r) so that q is the biggest integer for which r 2q + q2 log 2. The lower bound assumption of r is chosen such that q > Lk(Rn) > 0. By Chen et al. (2022, Theorem 3.1) with ϵ = 1, we have

E tr (K FF )w 2 Pn ℓ=q+1 λℓ(Kw). (48)

Since q > Lk(Rn), we can apply Cor. B.1 to bound λℓ(Kw) for ℓ q + 1 and obtain (45) since Hn,r (46) is constructed to match the bounds when applying Cor. B.1 to (48). It remains to justify the bounds for Hn,r in Thm. 4.

If k is POLYGROWTH(α, β), by Assum. (α, β)-kernel we have α < 2. Hence

Hn,r = 8 Pn ℓ=q Lk(Rn)

α 8Lk(Rn) 2 α R q 1 ℓ 2

α dℓ= 8Lk(Rn) 2 α (q 1)1 2

α = O r( R2β n r ) 1 α ,

where we used the fact that R q 1 ℓ 2

α dℓ= (q 1)1 2

α for α < 2, Lk(Rn) = O(Rβ n), and q = Θ( r).

If k is LOGGROWTH(α, β), then

Hn,r = 8 Pn ℓ=q exp 1 ℓ Lk(Rn) 1

α = 8e Pn ℓ=q cℓ1/α 8e R ℓ=q 1 cℓ1/α,

Debiased Distribution Compression

where c exp Lk(Rn) 1/α (0, 1). Defining m log c > 0 and q = q 1, we have R x=q cx1/αdx = R x=q exp mx1/α dx = αq (mq 1/α) αΓ(α, mq 1/α) = αm αΓ(α, mq 1/α), (49)

where Γ(α, x) R x tα 1e tdt is the incomplete gamma function. Since α > 0, by Pinelis (2020, Thm. 1.1), we have

Γ(α, mq 1/α) (mq 1/α+b)α (mq 1/α)α

αb e mq 1/α,

where b is a known constant depending only on α. By the equivalence of norms on R2, there exists Cα > 0 such that (x + y)α Cα(xα + yα) for any x, y > 0. Hence

Γ(α, mq 1/α) (mq 1/α+b)α

αb e mq 1/α Cα(mαq +bα)

αb e mq 1/α.

Hence from (49) we deduce P ℓ=q cℓ1/α Cα(q b 1 + bα 1m α)e mq 1/α. (50)

Since m = log c = Lk(Rn) 1/α, we can bound the exponent by

mq 1/α = (Lk(Rn) 1q )1/α = ( q log 2

Cd Rβ n )1/α ( 0.83 r 2.39

Cd Rβ n )1/α,

where we used the fact that q log 2 = (q 1) log 2 (

r+ 1 log 2 log 2 1 log 2 2) log 2 0.83 r 2.39. On the other hand,

since q = q 1 L(Rn) = m α, we can absorb the bα 1m 1 term in (50) into q and finally obtain the bounds for Hn,r in Thm. 4.

The last piece of our analysis involves bounding the optimization error (w K w w K w ) in (44).

Lemma F.3 (AMD guarantee for debiasing). Let K Rn n be an SPSD matrix. Let f(w) w Kw. Then the final iterate x T of Nesterov s 1-memory method (Wang et al., 2023, Algorithm 14) after T steps with objective function f(w), norm = 1, distance-generating function ϕ(x) = Pn i=1 xi log xi, and initial point w0 = ( 1

n, . . . , 1

n) n 1 satisfies

f(w T ) f(w OPT) 16 log n maxi [n] Kii

where w OPT argminx Rn f(x).

Proof of Lem. F.3. We apply Wang et al. (2023, Theorem 14). Hence it remains to determine the smoothness constant L > 0 such that, for all x, y n 1,

f(x) f(y) L x y 1 ,

and an upper bound for the Bregman divergence Dϕ w0(w OPT) = Pn i=1 w OPTi log w OPTi

(w0)i = Pn i=1 w OPTi log nw OPTi. To determine L, note f(w) = 2Kw, so we have, for any x, y n 1,

f(x) f(y) = 2 K(x y) = 2 maxi [n] |Ki,:(x y)|

2 maxi [n] Ki,: x y 1 = 2 maxi [n] Kii x y 1 = 2 maxi [n] Kii x y 1 ,

where we used the fact that the largest entry in an SPSD matrix appears on its diagonal. Thus we can take the smoothness constant to be

L = 2 maxi [n] Kii.

To bound Dϕ w0(w OPT), note that by Jensen s inequality,

Dϕ w0(w) = Pn i=1 wi log nwi log Pn i=1 nw2 i = log n + log w 2 2 log n,

where we used the fact that w 2 2 w 1 = 1 for w n 1.

Debiased Distribution Compression

With these tools in hand, we turn to the proof of Thm. 4. For the runtime of LD, it follows from the fact that Weighted RPCholesky takes O((dk P + r)nr) time and one step of AMD takes O(nr) time.

The error analysis is different for the first adaptive iteration and the ensuing adaptive iterations. Roughly speaking, we will show that the first adaptive iteration brings the MMD gap to the desired level, while the ensuing iterations do not introduce an excessive amount of error.

Step 1. Bound MMDk P(w(1))

Let K k P(Sn, Sn) and F denote the low-rank approximation factor generated by Weighted RPCholesky. Denote b K FF . Then K = b K + diag(K b K). First, note that since w(0) = ( 1

n, . . . , 1

n), Resample returns w = w(0)

with probability one. By Lem. F.2, we have, using

b for a, b 0 repeatedly and Lem. F.1 that tr (K b K)w λ1(K b K) and maxi [n](K b K)ii λ1(K b K),

MMDk P(Sw(1) n , P) MMDk P(Sw OPT n , P) + q

2λ1(K b K) + q

w(1) K w(1) w K w

MMDk P(Sw OPT n , P) + q

2λ1(K b K) + q

16 log n k P n

where we applied Lem. F.1 and Lem. F.3 in the last inequality. Fix δ (0, 1). By Markov s inequality, we have

λ1(K b K) > q

E[λ1(K b K)] δ

This means that with probability at least 1 δ, we have

MMDk P(Sw(1) n , P) MMDk P(Sw OPT n , P) + q

2E[λ1(K b K)] δ + q

16 log n k P n

Note that the lower bound condition on r in Assum. (α,β)-params implies the lower bound condition in Prop. F.1. Hence,

by Prop. F.1 with w = ( 1

n, . . . , 1

n) and using the identity λ1(K b K) tr K b K while noting that a factor of n appears, we have

MMDk P(Sw(1) n , P) MMDk P(Sw OPT n , P) + q

16 k P n log n

Step 2. Bound the error of the remaining iterations

Fix δ > 0. The previous step shows that, with probability at least 1 δ

MMDk P(Sw(1) n , P) MMDk P(Sw OPT n , P) + q

16 k P n log n

Fix q > 1, and let w be the resampled weight defined in the q-th iteration in Alg. 2. Without loss of generality, we assume wi > 0 for all i > 0, since if wi = 0 then index i is irrelevant for the rest of the algorithm. Thus, thanks to Resample, we have wi 1/n for all i [n]. Let a/b denote the entry-wise division between two vectors. As in the previous step of the proof, we let K k P(Sn, Sn), F be the low-rank factor output by Weighted RPCholesky(k P, Sn, w, r), and b K = FF . For any w n 1, recall the notation Kw diag( w)K diag( w). Then we have

w) diag(K w)(w/

w) b K diag(

w)) + diag(

w)(K b K) diag(

= w b Kw + (w/

w)(K b K) diag(

w b Kw + maxi [n](1/ wi) tr diag(

w)(K b K) diag(

w b Kw + n tr (K b K) w . (51)

K = b K + diag(K b K) = K + ( b K K) + diag(K b K).

Debiased Distribution Compression

Since K b K, we have

w(q) b Kw(q) w(q) K w(q) w K w, (52)

where the last inequality follows from the if conditioning at the end of Alg. 2. In addition,

w K w = w (K + ( b K K) + diag(K b K)) w

w K w + w diag(K b K) w

w diag((K b K) w)

w K w + tr (K b K) w ,

where we used the fact that K b K and

w 2 = 1. Plugging the previous inequality into (52) and then into (51) with

w = w(q), we get

w(q) Kw(q) w K w + (n + 1) tr (K b K) w . (53)

Taking square-root on both sides using

b for a, b 0 and the triangle inequality, we get

MMDk P(Sw(q) n , P) MMDk P(S w n , P) + r

(n + 1) tr (K b K) w

MMDk P(Sw(q 1) n , P) + MMDk P(Sw(q 1) n , S w n ) + r

(n + 1) tr (K b K) w .

By Markov s inequality, we have

MMDk P(Sw(q 1) n , S w n ) >

4QE h MMD2 k P(Sw(q 1) n ,S w n ) i

tr (K b K) w > q

4QE[tr((K b K) w)] δ

By Prop. E.1(c), we have

E h MMD2 k P(Sw(q 1) n , S w n ) i = E h E h MMD2 k P(Sw(q 1) n , S w n ) w(q 1)ii k P n

Thus by the union bound, with probability at least 1 δ 2Q, using Prop. F.1 (recall low-rank approximation b K is obtained using w), we have

MMDk P(Sw(q) n , P) MMDk P(Sw(q 1) n , P) + q

4Q(n+1)Hn,r

Finally, applying union bound and summing up the bounds for q = 1, . . . , Q, we get, with probability at least 1 δ,

MMDk P(w(q)) q

16 k P n log n

T 2 + (Q 1) q

4Q(n+1)Hn,r

This matches the stated asymptotic bound in Thm. 4.

F.2. Thinning with KT-Compress++

For compression with target distribution P, we modify the original KT-Compress++ algorithm of (Shetty et al., 2022, Ex. 6): in HALVE and THIN of Compress++, we use KT-SPLIT with kernel k P without KT-SWAP (so our version of Compress++ outputs 2g coresets, each of size n), followed by KT-Swap to obtain a size n coreset. We call the resulting thinning algorithm KT-Compress++. We show in Lem. F.4 and Cor. F.1 that KT-Compress++ satisfies an MMD guarantee similar to that of quadratic-time kernel thinning.

Debiased Distribution Compression

Algorithm F.3 KT-Compress++ (modified Shetty et al. (2022, Alg. 2) to minimize MMD to P)

Input: kernel k P with zero-mean under P, input points Sn = (xi)i [n], multiplicity n with n 4N, weight w n 1 ( N0

n )n, thinning parameter g, failure probability δ S index sequence where k [n] appears n wk times (I(ℓ))ℓ [2g] Compress++(g, Sn[S]) Shetty et al. (2022, Ex. 6) with KT substituted with KT-SPLIT in HALVE and THIN. I(ℓ) S[I(ℓ)] for each ℓ [2g] I KT-Swap(k P, Sn, (I(ℓ))ℓ [2g]) w C++ simplex weights corresponding to I wi = number of occurrences of i in I

|I| Return: w C++ n 1 ( N0 n)n Hence w C++ 0 n

Lemma F.4 (Sub-gaussian guarantee for Compress++). Let Sn be a sequence of n points with n 4N. For any δ (0, 1) and integer g log2 log(n + 1) + 3.1 , consider the Compress++ algorithm (Shetty et al., 2022, Algorithm 2) with thinning parameter g, halving algorithm HALVE(k) symmetrized7(KT-SPLIT(k, , 1, n2 k 4n2g(g+(βn+1)2g)δ)) for an input

of nk 2g+1+k n points and βn log2 n n0

, and with thinning algorithm THIN KT-SPLIT(k, , g, g g+(βn+1)2g δ).

Then this instantiation of Compress++ compresses Sn to 2g coresets (S(i) out)i [2g] of n points each. Denote the signed

measure ϕ(i) 1 n P

x Sn δx 1 n P

x S(i) out δx. Then for each i [2g], on an event E(i) equi with Pr E(i) equi 1 δ

ϕ(i) = ϕ(i) for a random signed measure ϕ(i) such that, for any δ (0, 1),

Pr ϕ(i)k Hk a n 1 + q

8 3 k n log 6 n(g+( log2 n

log 4Nk B2(Rn), n 1/2 !

Proof of Lem. F.4. This proof is similar to the one for Shetty et al. (2022, Ex. 6) but with explicit constant tracking and is self-contained, invoking only Shetty et al. (2022, Thm. 4) which gives MMD guarantees for Compress++ given the sub-Gaussian parameters of HALVE and THIN.

Recall that nk is the number of input points for the halving subroutine at recursion level k in Compress++, and βn is the total number of recursion levels. Let SC denote the output of COMPRESS (Shetty et al., 2022, Alg. 1) of size 2g n. Fix δ, δ (0, 1). Suppose we use HALVE(k) symmetrized(KT-SPLIT(k, , 1, γkδ)) for an input of nk points for γk to be determined. Suppose we use THIN KT-SPLIT(k, , g, γ δ) for γ to be determined; this is the kernel thinning stage that thins 2g n points to 2g coresets, each with n points. Since the analysis is the same for all coresets, we will fix an arbitrary coreset without superscript in the notation.

By Lem. D.1, with notation t log 1

δ , there exist events Ek,j, ET, and random signed measures ϕk,j, ϕk,j, ϕT, ϕT for 0 k βn and j [4k] such that

(a) Pr Ec k,j γkδ

2 and Pr(Ec T) γ δ

(b) 1Ek,jϕk,j = 1Ek,j ϕk,j and 1ETϕT = 1ET ϕT,

(c) We have

Pr ϕk,jk Hk ank + vnk

t { ϕk ,j }k >k,j 1, { ϕk ,j }k ,j <j

Pr ϕTk Hk a n + v n

7Any halving algorithm can be converted into an unbiased one by symmetrization, i.e., returning either the output half or its complement with equal probability (Shetty et al., 2022, Remark 3).

Debiased Distribution Compression

where, by Lem. D.1, and by increasing the sub-Gaussian constants if necessary, we have

ank vnk ank,nk/2 = 2 nk

8 3 k n log 3nk γkδ log 4Nk B2(Rn), 2

a n v n a2g n, n = 1 n

8 3 k n log 6g n

γ δ log 4Nk B2(Rn), n 1/2 , and

(d) E h ϕk,jk { ϕk ,j }k >k,j 1, { ϕk ,j }k ,j <j i = 0.

Hence on the event E = T

k,j Ek,j ET, these properties hold simultaneously. We will choose {γk}k and γ such that Pr(Ec) δ

2. By the union bound,

Pr(Ec) Pr(Ec T) + Pβn k=0 P4k

j=1 Pr Ec k,j γ δ

2 + Pβn k=0 4k γkδ

On the event E, we apply Shetty et al. (2022, Thm. 4, Rmk. 7) to get a sub-Gaussian guarantee for MMDk(Sn, Sout). We want to choose γk, γ such that the rescaled quantities ζH n0

2 an0 and ζT na n satisfy ζH = ζT (Shetty et al., 2022, Eq. (13)), which implies that we need

3n0 γ0δ = 6g n

Hence if we take γ = g g+(βn+1)2g and γk = n2 k 4n2g(g+(βn+1)2g), then (56) holds and the upper bound in (55) becomes δ

2. Note that nkank is non-decreasing in nk, so by Shetty et al. (2022, Theorem 4, Remark 7), Compress++(δ, g) outputs a signed measure ϕ that, on the event E with Pr(Ec) δ

2, equals another signed measure ϕ that satisfies, for any δ (0, 1),

Pr ϕk Hk ˆan + ˆvn q

where ˆan, ˆvn satisfy max(ˆan, ˆvn) 4a n whenever g log2 log(n + 1) + 3.1 .

Corollary F.1 (MMD guarantee for Compress++). Let S be an infinite sequence of points in Rd and k a kernel. For any δ (0, 1) and n N such that n 4N, consider the Compress++ with the same parameters as in Lem. F.4 with g log2 log(n + 1) + 3.1 . Then for any i [ n], with probability at least 1 δ,

MMDk(Sn, S(i) out) 4 n

8 3 k n log 6 n(g+( log2 n

log 4Nk B2(Rn), n 1/2 ! 1 + q

Proof. After applying Lem. F.4 with δ = δ

2 and following the same argument as in the proof of Cor. D.1, we have, with probability at least 1 δ,

MMDk(Sn, S(i) out) a n 1 + q

Plugging in the expression of a n from Lem. F.4 gives the claimed bound.

F.3. Proof of Thm. 5: MMD guarantee for LSKT

First of all, the claimed runtime follows from the runtime of LD (Thm. 4), the O(dk P4gn log n) runtime of Compress++, and the O(dk Pn1.5) runtime of KT-Swap.

Without loss of generality assume n 4N. Fix δ (0, 1). Let w denote the output of LD, and wsr denote the output of Resample, both regarded as random variables. By Thm. 4, we have, with probability at least 1 δ

MMDk P(Sw n , P) = MMDk P(Sw OPT n , P) + O q

k P n max(log n,1/δ)

Debiased Distribution Compression

By Prop. E.1(c) with k = k P, we have the upper bound

E MMD2 k P(Swsr n , Sw n ) = E E MMD2 k P(Swsr n , Sw n ) w k P n

Thus, by Markov s inequality,

Pr MMDk P(Swsr n , Sw n ) q

Hence, with probability at least 1 δ

MMDk P(Swsr n , Sw n ) q

Let S(i) out denote the i-th coreset output by THIN in KT-Compress++ (Alg. F.3). By Cor. F.1 with k = k P, we have, with probability at least 1 δ

MMDk P(Swsr n , S(i) out) = O q

k P n log n log(e Nk P(B2(Rn),n 1/2)))

Since KT-Swap can never increase MMDk P( , P), we have, by the triangle inequality,

MMDk P(Sw LSKT n , P) MMDk P(S(1) out , P) MMDk P(S(1) out , Swsr n ) + MMDk P(Swsr n , Sw n ) + MMDk P(Sw n , P). (59)

By the union bound, with probability at least 1 δ, the bounds (57), (58), (59) hold, so that the claim is shown by adding together the right-hand sides of these bounds and applying Assum. (α, β)-kernel.

G. Simplex-Weighted Debiased Compression

In this section, we provide deferred analyses for RT and SR/LSR, as well as the algorithmic details of Recombination (Alg. G.1) and KT-Swap-LS (Alg. G.2).

G.1. MMD guarantee for RT

We start by stating the MMD guarantee of RT, a result that might be of independent interest.

Proposition G.1 (RT guarantee). Under Assums. 1 and (α, β)-kernel, given w n 1 and that m ( Cd Rβ n+1 log 2 + log 2)2 1 log 2 + 1, Recombination Thinning (Alg. 4) outputs w RT n 1 with w RT 0 m in O((dk P + m)nm + m3 log n) time such that with probability at least 1 δ,

MMDk P(Sw RT n , P) MMDk P(Sw n , P) + q

where Hn,r is defined in (46).

Proof of Prop. G.1. The runtime follows from the O((dk P + m)nm) runtime of Weighted RPCholesky, the O(dk Pnm) runtime of KT-Swap-LS, and the O(m3 log n) runtime of Recombination (Tchernychova, 2016) which dominates the O(m3) weight optimization step.

Recall w n 1 from RT. The formation of F in Alg. 4 is identical to the formation of F (with r = m 1) in Alg. 2 for q > 1. Thus by (51) with w = w , K = k P(Sn, Sn),

w Kw w FF w + n tr (K FF ) w ,

where K = k P(Sn, Sn). By construction of w using Recombination, we have F w = F w . Since K FF , we have

w Kw w FF w + n tr (K FF ) w w K w + n tr (K FF ) w .

We recognize the right-hand side is precisely the right-hand side of (53) aside from having a multiplier of n instead of n + 1 in front of the trace and that F is rank m 1. Now applying (54) with Q = 1

2, w(q) = w , w(q 1) = w, r = m 1, and noticing that KT-Swap-LS and the quadratic-programming solve at the end cannot decrease the objective, we obtain (60) with probability at least 1 δ. Note that the lower bound of m in Assum. (α,β)-params makes r = m 1 satisfy the lower bound for r in Prop. F.1.

Debiased Distribution Compression

Algorithm G.1 Recombination (rephrasing of Tchernychova (2016, Alg. 1) that takes O(m3 log n) time)

Input: matrix A Rm n with m < n and one row of A all positive, a nonnegative vector x0 Rn 0. function Find BFS(A, x0)

The requirement of A and x0 are the same as the input. This subroutine takes O(n3) time. x x0 U, S, V SVD(A) any O(n3)-time SVD algorithm that gives USV = A V (V )m+1:n V R(n m) n so that the null space of A is spanned by the rows of V for i = 1 to n m do

v Vi k argminj [n]:vj>0 xj vj This must succeed because Av = 0 and A has an all-positive row, so one of the coordinates of v must be positive. x x xk

vk v This zeros out the k-th coordinate of x while still ensuring x is nonnegative. for j = i + 1 to n m do

vk v {Vj}n m j=i+1 remain independent and have 0 on the k-th coordinate. end for end for return: x Rn 0 such that Ax = Ax0 and x 0 m. end function x x0 while x 0 > 2m do

Divide {i [n] : xi > 0} into 2m index blocks I1, . . . , I2m, each of size at most j x 0

Ai A:,Iix Ii Rm, i [2m] Form ˆA to be the m 2m matrix with columns Ai Hence, one row of A contains all positive entries.

ˆx Find FBS( ˆA, 12m) ˆx 0 n and ˆAˆx = P Aiˆxi = P Ai = P A:,Iix Ii = Ax. for i = 1 to 2m do

x Ii ˆxi x Ii if ˆxi > 0 else 0 end for After the update, the support of x shrinks by 2 while it maintains that Ax = Ax0. end while if x 0 m + 1 then

I {i [n] : xi > 0} x I = Find BFS(A:,I, x I) end if Return: x Rn 0 such that Ax = Ax0 and x 0 m.

G.2. Proof of Thm. 6: MMD guarantee for SR/LSR

The claimed runtime follows from the runtime of Stein Thinning (Alg. D.1) or LD (Thm. 4) plus the runtime of RT (Prop. G.1).

Note the lower bound for m in Assum. (α,β)-params implies the lower bound condition in Prop. G.1. For the case of SR, we proceed as in the proof of Thm. 3 and use Prop. G.1. For the case of LSR, we proceed as in the proof of Thm. 5 and use Thm. 4 and Prop. G.1.

H. Constant-Preserving Debiased Compression

In this section, we provide deferred analyses for CT and SC/LSC.

H.1. MMD guarantee for CT

Proposition H.1 (CT guarantee). Under Assums. 1 and (α, β)-kernel, given w n 1 and m ( Cd Rβ n+1 log 2 + 2 log 2)2

1 log 2, CT outputs w CT Rn with 1 n w CT = 1 and w CT 0 m in O((dk P + m)nm + m3) time such that, for any

Debiased Distribution Compression

Algorithm G.2 KT-Swap with Linear Search (KT-Swap-LS)

Input: kernel k P with zero-mean under P, input points Sn = (xi)i [n], weights w n 1, fmt {SPLX, CP} S {i [n] : wi = 0} Maintain two sufficient statistics: g = Kw and D = w Kw. function Add(g, D, i, t)

g g + tk P(Sn, xi) D D + 2tgi + t2k P(xi, xi) return: (g, D) end function function Scale(g, D, α)

g αg D α2D return: (g, D) end function Kdiag k P(Sn, Sn) g 0 Rn

D 0 for i in S do

(g, D) Add(g, D, i, wi) end for for i in S do

if wi = 1 then continue; We cannot swap i out if P

j =i wj = 0! First zero out wi. (g, D) Add(g, D, i, wi) (g, D) Scale(g, D, 1 1 wi ) wi = 0 Next perform line search to add back a point. α = (D g)./(D 2g + Kdiag); αi = argmint MMD2 k P(Stei+(1 t)w n , P) = argmint(1 t)2D + 2t(1 t)g + t2Kdiag if fmt = SPLX then

α = clip(α, 0, 1); Clipping α to [0, 1]. This corresponds to argmint [0,1] MMD2 k P(Stei+(1 t)w n , P). end if D (1 α)2D + 2α(1 α)g + α2Kdiag multiplications are element-wise k argmini D i (g, D) Scale(g, D, 1 αk) (g, D) Add(g, D, k, αk) end for Return: w n 1

δ (0, 1), with probability 1 δ,

MMDk P(Sw CT n , P) 2 MMDk P(Sw n , P) + q

where Hn,m is defined in (46) and m m + log 2 2 m log 2 + 1.

Proof of Prop. H.1. The runtime follows from the O((dk P +m)nm) runtime of Weighted RPCholesky, the O(nm) runtime of KT-Swap-LS, and the O(m3) runtime of matrix inversion in solving the two minimization problems using (64).

To improve the clarity of notation, we will use w to denote the input weight w to CT. For index sequences I, J [n] and a kernel k, we use k(I, J) to indicate the matrix k(Sn[I], Sn[J]) = [k(xi, xj)]i I,j J, and similarly for a function f : Rn R, we use f(I) to denote the vector (f(xi))i I.

Recall the regularized kernel is kc k P + c. Suppose for now that c > 0 is an arbitrary constant. Let I denote the indices

Debiased Distribution Compression

output by Weighted RPCholesky in CT. Let

wc argminw:supp(w) I MMD2 kc(Sw n , Sw n ).

Note that wc is not a probability vector and may not sum to 1.

Step 1. Bound MMD2 kc(Swc n , Sw n ) in terms of Weighted RPCholesky approximation error

We start by using an argument similar to that of Epperly and Moreno (2024, Prop. 3) to exploit the optimality condition of wc. Since

argminw:supp(w) I MMD2 kc(Sw n , Sw n ) = argminw:supp(w) I w I kc(I, I)w I 2w kc(Sn, I)w I,

by optimality, wc satisfies,

kc(I, I)wc I = Sw n kc(I).

We comment that the index sequence I returned by Weighted RPCholesky makes kc(I, I) invertible with probability 1: by the Guttman rank additivity formula of Schur complement (Zhang, 2006, Eq. (6.0.4)), each iteration of Weighted RPCholesky chooses a pivot with a non-zero diagonal and thus increases the rank of the low-rank approximation matrix, which is spanned by the columns of pivots, by 1. Hence

Swc n kc( ) = kc( , Sn)wc = kc( , I)wc I = kc( , I)kc(I, I) 1kc(I, I)wc I = kc( , I)kc(I, I) 1Sw n kc(I) = Sw n kc I( ),

where kc I(x, y) kc(x, I)kc(I, I) 1kc(I, y). Then

MMD2 kc(Swc n , Sw n ) = Sw n kc Swc n kc 2

kc = Sw n kc Sw n kc I 2

kc = w (kc kc I)(Sn, Sn)w .

Recall the index set I consists of the m pivots selected by Weighted RPCholesky on the input matrix

K c kc(Sn, Sn)w .

b K c kc I(Sn, Sn)w .

Thus, by Lem. F.1,

MMD2 kc(Swc n , Sw n ) = w (kc kc I)(Sn, Sn)w =

w (K c b K c )

w λ1(K c b K c ) tr K c b K c .

Step 2. Bound tr K c b K c using the trace bound of the unregularized kernel

Let JAKr denote the best rank-r approximation of an SPSD matrix A Rn n in the sense that

JAKr argmin X Rn n

X=X A X 0 rank(X) r

tr(A X). (61)

By the Eckart-Young-Mirsky theorem applied to symmetric matrices (Dax, 2014, Theorem 19), the solution to (61) is given by r-truncated eigenvalue decomposition of A, so that

tr(A JAKr) = Pn ℓ=r+1 λℓ(A).

Let q U(m) where U is defined in (47), so that by Chen et al. (2022, Thm. 3.1) with ϵ = 1, we have

E h tr K c b K c i 2 tr K c JK c Kq .

Debiased Distribution Compression

We know one specific rank-q approximation of K c :

e K c JK Kq 1 + diag(

w )c1n1 n diag(

which satisfies

K c e K c = K + diag(

w )c1n1 n diag(

w ) e K c = K JK Kq 1.

Thus by the variational definition in (61), we have

tr K c JK c Kq tr K c e K c = tr K JK Kq 1 = Pn ℓ=q λℓ(K ).

Note the last bound does not depend on c. The tail sum of eigenvalues in the last expression is the same (up to a constant multiplier) as the one in (48) except for an off-by-1 difference in the summation index. A simple calculation shows that for m m + log 2 2 m log 2 + 1, we have U(m ) = U(m) 1. Another simple calculation shows that m

( Cd Rβ n+1 log 2 + 2 log 2)2 1 log 2 implies that m satisfies the lower bound requirement of r in Prop. F.1. Thus, arguing as in the proof that follows (48), we get

E h tr K c b K c i Hn,m .

Thus so far we have shown

E[MMD2 kc(Swc n , Sw n )] E h tr K c b K c i Hn,m .

By Markov s inequality, with probability at least 1 δ, we have

MMDkc(Swc n , Sw n ) q

Recall that MMDk(µ, ν) = (µ ν)k k for signed measures µ, ν. By the triangle inequality, we have

MMDkc(Swc n , P) MMDkc(Swc n , Sw n ) + MMDkc(Sw n , P)

= MMDkc(Swc n , Sw n ) + MMDk P(Sw n , P),

where we used that fact that P i [n] w i = 1 to get the identity MMDkc(Sw , P) = MMDk P(Sw , P). Hence, with probability at least 1 δ,

MMDkc(Swc n , P) MMDk P(Sw n , P) + q

Step 3. Incorporating sum-to-one constraint

We now turn wc into a constant-preserving weight while not inflating the MMD by much. Define

w1 argminw:supp(w) I,P

i [n] wi=1 MMD2 k P(Sw n , P). (63)

Note w1 is the weight right before KT-Swap-LS step in CT. Let KI = k P(I, I). Let 1I denote the |I|-dimensional all-one veector. The Karush-Kuhm-Tucker condition (Ghojogh et al., 2021, Sec. 4.7) applied to (63) implies that, the solution w1

is a stationary point of the Lagrangian function

L(w I, λ) w I KIw I + λ(1 I w I 1).

Then w IL(w1 I, λ) = 0 implies 2KIw1 I λ1I = 0, so w1 I = λK 1 I 1I 2 . The Lagrangian multiplier λ is determined by the constraint 1 I w I = 1, so we find

w1 I = K 1 I 1I 1 I K 1 I 1I . (64)

Debiased Distribution Compression

wc,P argminw:supp(w) I MMD2 kc(Sw n , P).

Since wc,P is optimized to minimize MMDkc to P on the same support as wc, we have

MMDkc(Swc,P n , P) MMDkc(Swc n , P).

The optimality condition for wc,P is

(KI + c1I1 I )w c1I = 0,

and hence by the Sherman Morrison formula,

wc,P I = (KI + c1I1 I ) 1c1I = K 1 I c K 1 I 1I1 I K 1 I 1+c1 I K 1 I 1I

c1I = K 1 I 1I 1/c+1 I K 1 I 1I .

Let ρc 1 I K 1 I 1I 1/c+1 I K 1 I 1I , so that wc,P I = ρcw1 I. In particular, w1 and wc,P are scalar multiples of one another. To relate

MMDk P(Sw1 n , P) and MMDkc(Swc,P n , P), note that

MMD2 k P(Sw1 n , P) = w1 I KIw1 I = wc,P I KIwc,P I ρ2c = wc,P I (KI+c1I1 I )wc,P I c(1 I wc I )2

= MMD2 kc(Swc,P n ,P)+2c1 I wc I c c(1 I wc I )2

ρ2c = MMD2 kc(Swc,P n ,P) c(ρc 1)2

So far the argument does not depend on any particular choice of c > 0. Let us now discuss how to choose c. Note that

1 I K 1 I 1I = m 1I m K 1 I 1I m mλm(K 1 I ) m λ1(KI) m tr(KI) m P

i [m] diag(K) i ,

where diag(K) denote the diagonal entries of K = k P(Sn, Sn) sorted in descending order. Thus

ρc = 1 1 c1 I K 1 I 1I +1 1 P i [m] diag(K) i mc +1 .

Hence we can choose c to make sure ρc is bounded from below by a positive value. Recall in CT, we take

i [m] diag(K) i m ,

so that ρc 1

MMD2 k P(Sw1 n , P) = MMD2 kc(Swc,P n ,P) c(ρc 1)2

ρ2c 4 MMD2 kc(Swc,P n , P).

Hence by (62) and the fact that KT-Swap-LS and the final reweighting in CT only improves MMD, we have, with probability at least 1 δ,

MMDk P(Sw CT n , P) MMDk P(Sw1 n , P) 2 MMDkc(Swc,P n , P) 2 MMDkc(Swc n , P) 2 MMDk P(Sw n , P) + 2 q

where we use (62) in the last inequality.

H.2. Proof of Thm. 7: MMD guarantee for SC / LSC

The claimed runtime follows from the runtime of Stein Thinning (Alg. D.1) or LD (Thm. 4) plus the runtime of CT (Prop. H.1).

Note the lower bound for m in Assum. (α,β)-params implies the lower bound condition in Prop. H.1. For the case of SC, we proceed as in the proof of Thm. 3 and use Prop. H.1. For the case of LSC, we proceed as in the proof of Thm. 5 by invoking Thm. 4 and Prop. H.1.

Debiased Distribution Compression

I. Implementation and Experimental Details

In this section, we collect experimental details that were deferred from Sec. 5.

I.1. O(d)-time Stein kernel evaluation

In this section, we show that for Sn = (xi)i [n], each Stein kernel evaluation kp(xi, xj) for a radially analytic base kernel (Def. B.3) can be done in O(d) time after computing certain sufficient statistics in O(nd2 + d3) time. Let M Rd d be a positive definite preconditioning matrix for kp. Let L be the Cholesky decomposition of M which can be done in O(d3) time so that M = LL . From the expression (15), we can achieve O(d) time evaluation if we can compute x y 2 M and M log p(x) in O(d) time. For M log p(x), we can simply precompute M log p(xi) for all i [n]. For x y 2 M, we have

x y 2 M = (x y) M 1(x y) = (x y) (LL ) 1(x y) = L 1(x y) 2 2 .

Hence it suffices to precompute L 1xi for all i [n], and we can precompute the inverse L 1 in O(d3) time.

I.2. Default parameters for algorithms

For LD, we always use Q = 3. To ensure that the guarantees of Lem. F.3 and Thm. 4 hold while achieving fast convergence in practice, we take the step size of AMD to be 1/(8 k P n) in the first adaptive round and 1/(8 P

i [n] w(q 1) i k P(xi, xi)) in subsequent adaptive rounds. We use T = 7 n0 for AMD in all experiments.

We implemented our modified versions of Kernel Thinning and KT-Compress++ in JAX (Bradbury et al., 2018) so that certain subroutines can achieve a speedup using just-in-time compilation and the parallel computation power of GPUs. For Compress++, we use g = 4 in all experiments as in Shetty et al. (2022). For both Kernel Thinning and KT-Compress++, we use choose δ = 1/2 as in the goodpoints library.

Each experiment was run with a single NVIDIA RTX 6000 GPU and an AMD EPYC 7513 32-Core CPU.

I.3. Correcting for burn-in details

We use the four MCMC chains provided by Riabiz et al. (2022) that include both the sample points and their scores. The reference chain used to compute the energy distance is the same one used in Riabiz et al. (2022) for the energy distance and was kindly provided by the authors.

In Tab. I.1, we collect the runtime for the burn-in correction experiments.

Fig. I.1, Fig. I.2, Fig. I.3, display the results of the burn-in correction experiment of Sec. 5 repeated with three other MCMC algorithms: MALA without preconditioning, random walk (RW), and adaptive random walk (ADA-RW). The results of P-MALA from Sec. 5 are also included for completeness. For all four chains, our methods reliably achieve better quality coresets when compared with the baseline methods.

n0 ST LD (0.5) LD (0.4) KT KT-Compress++ RT (0.5) RT (0.4) CT (0.5) CT (0.4)

214 2.50 13.22 12.88 7.31 3.49 0.79 0.60 2.06 1.96

216 8.48 16.15 15.82 20.77 5.90 2.59 1.68 3.66 3.04

218 111.06 32.14 20.60 193.03 11.73 11.16 2.63 6.48 3.67

220 - 314.67 131.31 - 35.99 113.71 11.06 51.14 8.42

Table I.1: Breakdown of runtime (in seconds) for the burn-in correction experiment (d = 4) of Sec. 5. n0 is the input size after standard thinning from the length n = 2 106 chain (Rem. 2). Each runtime is the median of 3 runs. KT and KT-Compress++ output m = n0 equal-weighted points. RT and CT respectively output m = nτ 0 points with simplex or constant-preserving weights for τ shown in parentheses. In addition, LD, RT, and CT use the rank nτ 0. ST and KT took longer than 30 minutes for n0 = 220 and hence their numbers are not reported.

Debiased Distribution Compression

102 103 Coreset Size m

Energy Distance

Burn-in Oracle + Standard Stein Thinning Stein Kernel Thinning Low-rank SKT (τ = 0.4)

Low-rank SKT (τ = 0.5) Burn-in Oracle + Compress++

102 103 Coreset Size m

Energy Distance

Burn-in Oracle + Standard Stein Thinning Stein Kernel Thinning Low-rank SKT (τ = 0.4)

Low-rank SKT (τ = 0.5) Burn-in Oracle + Compress++

102 103 Coreset Size m

Energy Distance

Burn-in Oracle + Standard Stein Thinning Stein Kernel Thinning Low-rank SKT (τ = 0.4)

Low-rank SKT (τ = 0.5) Burn-in Oracle + Compress++

102 103 Coreset Size m

Energy Distance

Burn-in Oracle + Standard Stein Thinning Stein Kernel Thinning Low-rank SKT (τ = 0.4)

Low-rank SKT (τ = 0.5) Burn-in Oracle + Compress++

Figure I.1: Correcting for burn-in with equal-weighted compression. For each of four MCMC algorithms and using only one chain, our methods consistently outperform the Stein and standard thinning baselines and match the 6-chain oracle.

I.4. Correcting for approximate MCMC details

Surrogate ground truth Following Liu and Lee (2017), we took the first 10,000 data points and generated 220 surrogate ground truth sample points using NUTS (Hoffman and Gelman, 2014) for the evaluation. To generate the surrogate ground truth using NUTS, we used numpyro (Phan et al., 2019). It took 12 hours to generate the surrogate ground truth points using the GPU implementation, and we estimate it would have taken 200 hours using the CPU implementation.

SGFS For SGFS, we used batch size 32 and the step size schedule η/(1 + t)0.55 where t is the step count and η is the initial step size. We chose η from {10.0, 5.0, 1.0, 0.5, 0.1, 0.05, 0.01}, found η = 1.0 gave the best standard thinning MMD to get a coreset size of m = 210 , and hence we fixed η = 1.0 in all experiments. We used the version of SGFS (Ahn et al., 2012, SGFS-f) that involves inversion of d d matrices we found the faster version (SGFS-d) that inverts only the diagonal resulted in significantly worse mixing. We implemented SGFS in numpy and ran it on the CPU.

Runtime The SGFS chain of length 224 took approximately 2 hours to generate using the CPU. Remarkably, all of our low-rank methods finish within 10 minutes for n0 = 220, which is orders of magnitude faster than the time taken to generate the NUTS surrogate ground truth.

Additional results In Fig. I.4, we plot the posterior mean mean-squared error (MSE) for each compression method in the approximate MCMC experiment of Sec. 5.

I.5. Correcting for tempering details

In the data release of Riabiz et al. (2020), we noticed there were 349 sample points for which the provided scores were Na Ns, so we removed those points at the recommendation of the authors.

Debiased Distribution Compression

102 103 Coreset Size m

Energy Distance

Burn-in Oracle + Standard Stein Thinning Stein Recombination (τ = 0.4)

Stein Recombination (τ = 0.5)

Low-rank SR (τ = 0.4)

Low-rank SR (τ = 0.5) Burn-in Oracle + RT

102 103 Coreset Size m

Energy Distance

Burn-in Oracle + Standard Stein Thinning Stein Recombination (τ = 0.4)

Stein Recombination (τ = 0.5)

Low-rank SR (τ = 0.4)

Low-rank SR (τ = 0.5) Burn-in Oracle + RT

102 103 Coreset Size m

Energy Distance

Burn-in Oracle + Standard Stein Thinning Stein Recombination (τ = 0.4)

Stein Recombination (τ = 0.5)

Low-rank SR (τ = 0.4)

Low-rank SR (τ = 0.5) Burn-in Oracle + RT

102 103 Coreset Size m

Energy Distance

Burn-in Oracle + Standard Stein Thinning Stein Recombination (τ = 0.4)

Stein Recombination (τ = 0.5)

Low-rank SR (τ = 0.4)

Low-rank SR (τ = 0.5) Burn-in Oracle + RT

Figure I.2: Correcting for burn-in with simplex-weighted compression. For each of four MCMC algorithms and using only one chain, our methods consistently outperform the Stein and standard thinning baselines and match the 6-chain oracle.

102 103 Coreset Size m

Energy Distance

Burn-in Oracle + Standard Stein Thinning Stein Cholesky (τ = 0.4)

Stein Cholesky (τ = 0.5)

Low-rank SC (τ = 0.4)

Low-rank SC (τ = 0.5) Burn-in Oracle + CT

102 103 Coreset Size m

Energy Distance

Burn-in Oracle + Standard Stein Thinning Stein Cholesky (τ = 0.4)

Stein Cholesky (τ = 0.5)

Low-rank SC (τ = 0.4)

Low-rank SC (τ = 0.5) Burn-in Oracle + CT

102 103 Coreset Size m

Energy Distance

Burn-in Oracle + Standard Stein Thinning Stein Cholesky (τ = 0.4)

Stein Cholesky (τ = 0.5)

Low-rank SC (τ = 0.4)

Low-rank SC (τ = 0.5) Burn-in Oracle + CT

102 103 Coreset Size m

Energy Distance

Burn-in Oracle + Standard Stein Thinning Stein Cholesky (τ = 0.4)

Stein Cholesky (τ = 0.5)

Low-rank SC (τ = 0.4)

Low-rank SC (τ = 0.5) Burn-in Oracle + CT

Figure I.3: Correcting for burn-in with constant-preserving compression. For each of four MCMC algorithms and using only one chain, our methods consistently outperform the Stein and standard thinning baselines and match the 6-chain oracle.

Debiased Distribution Compression

Coreset Size m

Equal-Weighted (Approx. MCMC)

Standard Thinning Stein Thinning Stein Kernel Thinning Low-rank SKT (τ = 0.4)

Low-rank SKT (τ = 0.5)

Coreset Size m

Simplex-Weighted (Approx. MCMC)

Standard Thinning Stein Thinning Stein Recombination (τ = 0.4)

Stein Recombination (τ = 0.5)

Low-rank SR (τ = 0.4)

Low-rank SR (τ = 0.5)

Coreset Size m

Constant-Preserving (Approx. MCMC)

Standard Thinning Stein Thinning Stein Cholesky (τ = 0.4)

Stein Cholesky (τ = 0.5)

Low-rank SC (τ = 0.4)

Low-rank SC (τ = 0.5)

Figure I.4: Posterior mean mean-squared error (MSE) for the approximate MCMC compression experiment of Sec. 5. MSE is computed as ˆEPZ P

i [n0] wixi 2 M/d where ˆEPZ is the mean of the surrogate ground truth NUTS sample.