# distribution_compression_in_nearlinear_time__1457b035.pdf DISTRIBUTION COMPRESSION IN NEAR-LINEAR TIME Abhishek Shetty1, Raaz Dwivedi2, Lester Mackey3 1 Department of EECS, UC Berkeley 2 Department of Computer Science, Harvard University and Department of EECS, MIT 3 Microsoft Research New England SHETTY@BERKELEY.EDU, RAAZ@MIT.EDU, LMACKEY@MICROSOFT.COM In distribution compression, one aims to accurately summarize a probability distribution P using a small number of representative points. Near-optimal thinning procedures achieve this goal by sampling n points from a Markov chain and identifying n points with e O(1/ n) discrepancy to P. Unfortunately, these algorithms suffer from quadratic or super-quadratic runtime in the sample size n. To address this deficiency, we introduce Compress++, a simple meta-procedure for speeding up any thinning algorithm while suffering at most a factor of 4 in error. When combined with the quadratic-time kernel halving and kernel thinning algorithms of Dwivedi and Mackey (2021), Compress++ delivers n points with O( p log n/n) integration error and better-than-Monte-Carlo maximum mean discrepancy in O(n log3 n) time and O( n log2 n) space. Moreover, Compress++ enjoys the same near-linear runtime given any quadratic-time input and reduces the runtime of super-quadratic algorithms by a square-root factor. In our benchmarks with high-dimensional Monte Carlo samples and Markov chains targeting challenging differential equation posteriors, Compress++ matches or nearly matches the accuracy of its input algorithm in orders of magnitude less time. 1 INTRODUCTION Distribution compression constructing a concise summary of a probability distribution is at the heart of many learning and inference tasks. For example, in Monte Carlo integration and Bayesian inference, n representative points are sampled i.i.d. or from a Markov chain to approximate expectations and quantify uncertainty under an intractable (posterior) distribution (Robert & Casella, 1999). However, these standard sampling strategies are not especially concise. For instance, the Monte Carlo estimate Pinf 1 n Pn i=1 f(xi) of an unknown expectation Pf EX P[f(X)] based on n i.i.d. points has Θ(n 1 2 ) integration error |Pf Pinf|, requiring 10000 points for 1% relative error and 106 points for 0.1% error. Such bloated sample representations preclude downstream applications with critically expensive function evaluations like computational cardiology, where a 1000-CPU-hour tissue or organ simulation is required for each sample point (Niederer et al., 2011; Augustin et al., 2016; Strocchi et al., 2020). To restore the feasibility of such critically expensive tasks, it is common to thin down the initial point sequence to produce a much smaller coreset. The standard thinning approach, select every t-th point (Owen, 2017), while being simple often leads to an substantial increase in error: e.g., standard thinning n points from a fast-mixing Markov chain yields Ω(n 1 4 ) error when n 1 2 points are returned. Recently, Dwivedi & Mackey (2021) introduced a more effective alternative, kernel thinning (KT), that provides near optimal e Od(n 1 2 ) error when compressing n points in Rd down to size n 1 2 . While practical for moderate sample sizes, the runtime of this algorithm scales quadratically with the input size n, making its execution prohibitive for very large n. Our goal is to significantly improve the runtime of such compression algorithms while providing comparable error guarantees. Problem setup Given a sequence Sin of n input points summarizing a target distribution P, our aim is to identify a high quality coreset Sout of size n in time nearly linear in n. We measure coreset quality via its integration error |Pf PSoutf| |Pf 1 |Sout| P x Sout f(x)| for functions f in the reproducing kernel Hilbert space (RKHS) Hk induced by a given kernel k (Berlinet & Thomas-Agnan, 2011). We consider both single function error and kernel maximum mean discrepancy (MMD, Gretton et al., 2012), the worst-case integration error over the unit RKHS norm ball: MMDk(P, PSout) sup f k 1|Pf PSoutf|. (1) Our contributions We introduce a new simple meta procedure COMPRESS++ that significantly speeds up a generic thinning algorithm while simultaneously inheriting the error guarantees of its input up to a factor of 4. A direct application of COMPRESS++ to KT improves its quadratic Θ(n2) runtime to near linear O(n log3 n) time while preserving its error guarantees. Since the e Od(n 1 2 ) KT MMD guarantees of Dwivedi & Mackey (2021) match the Ω(n 1 2 ) minimax lower bounds of Tolstikhin et al. (2017); Phillips & Tai (2020) up to factors of log n and constants depending on d, KT-COMPRESS++ also provides near-optimal MMD compression for a wide range of k and P. Moreover, the practical gains from applying COMPRESS++ are substantial: KT thins 65, 000 points in 10 dimensions in 20m, while KT-COMPRESS++ needs only 1.5m; KT takes more than a day to thin 250, 000 points in 100 dimensions, while KT-COMPRESS++ takes less than 1hr (a 32 speedup). For larger n, the speed-ups are even greater due to the order n log3 n reduction in runtime. COMPRESS++ can also be directly combined with any thinning algorithm, even those that have suboptimal guarantees but often perform well in practice, like kernel herding (Chen et al., 2010), MMD-critic (Kim et al., 2016), and Stein thinning (Riabiz et al., 2020a), all of which run in Ω(n2) time. As a demonstration, we combine COMPRESS++ with the popular kernel herding algorithm and observe 45 speed-ups when compressing 250, 000 input points. In all of our experiments, COMPRESS++ leads to minimal loss in accuracy and, surprisingly, even improves upon herding accuracy for lower-dimensional problems. Most related to our work are the merge-reduce algorithms of Matousek (1995); Chazelle & Matousek (1996); Phillips (2008) which also speed up input thinning algorithms while controlling approximation error. In our setting, merge-reduce runs in time Ω(n1.5) given an n2-time input and in time Ω(n(τ+1)/2) for slower nτ-time inputs (see, e.g., Phillips, 2008, Thm. 3.1). In contrast, COM- PRESS++ runs in near-linear O(n log3 n) time for any n2-time input and in O(nτ/2 logτ n) time for slower nτ-time inputs. After providing formal definitions in Sec. 2, we introduce and analyze COMPRESS++ and its primary subroutine COMPRESS in Secs. 3 and 4, demonstrate the empirical benefits of COMPRESS++ in Sec. 5, and present conclusions and opportunities for future work in Sec. 6. Notation We let PS denote the empirical distribution of S. For the output coreset SALG of an algorithm ALG with input coreset Sin, we use the simpler notation PALG PSALG and Pin PSin. We extend our MMD definition to point sequences (S1, S2) via MMDk(S1, S2) MMDk(PS1, PS2) and MMDk(P, S1) MMDk(P, PS1). We use a b to mean a = O(b), a b to mean a = Ω(b), a = Θ(b) to mean both a = O(b) and a = Ω(b), and log to denote the natural logarithm. 2 THINNING AND HALVING ALGORITHMS We begin by defining the thinning and halving algorithms that our meta-procedures take as input. Definition 1 (Thinning and halving algorithms) A thinning algorithm ALG takes as input a point sequence Sin of length n and returns a (possibly random) point sequence SALG of length nout. We say ALG is αn-thinning if nout = n/αn and root-thinning if αn = n. Moreover, we call ALG a halving algorithm if SALG always contains exactly n 2 of the input points. Many thinning algorithms offer high-probability bounds on the integration error |PSinf PSALGf|. We capture such bounds abstractly using the following definition of a sub-Gaussian thinning Definition 2 (Sub-Gaussian thinning algorithm) For a function f, we call a thinning algorithm ALG f-sub-Gaussian with parameter ν and write ALG Gf(ν) if E[exp(λ(PSinf PSALGf)) | Sin] exp λ2ν2(n) 2 for all λ R. Def. 2 is equivalent to a sub-Gaussian tail bound for the integration error, and, by Boucheron et al. (2013, Section 2.3), if ALG Gf(ν) then E[PSALGf | Sin] = PSinf and, for all δ (0, 1), |PSinf PSALGf| ν(n) q δ ), with probability at least 1 δ given Sin. Hence the integration error of ALG is dominated by the sub-Gaussian parameter ν(n). Example 1 (KT-SPLIT) Given a kernel k and n input points Sin, the KT-SPLIT(δ) algorithm1 of Dwivedi & Mackey (2022; 2021, Alg. 1a) takes Θ(n2) kernel evaluations to output a coreset of size nout with better-than-i.i.d. integration error. Specifically, Dwivedi & Mackey (2022, Thm. 1) prove that, on an event with probability 1 δ 2, KT-SPLIT(δ) Gf(ν) with ν(n) = 2 nout log( 6nout log2(n/nout) for all f with f k = 1. Many algorithms also offer high-probability bounds on the kernel MMD (1), the worst-case integration error across the unit ball of the RKHS. We again capture these bounds abstractly using the following definition of a k-sub-Gaussian thinning algorithm. Definition 3 (k-sub-Gaussian thinning algorithm) For a kernel k, we call a thinning algorithm ALG k-sub-Gaussian with parameter v and shift a and write ALG Gk(v, a) if P[MMDk(Sin, SALG) an + vn t Sin] e t for all t 0. (3) We also call εk,ALG(n) max(vn, an) the k-sub-Gaussian error of ALG. Example 2 (Kernel thinning) Given a kernel k and n input points Sin, the generalized kernel thinning (KT(δ)) algorithm1 of Dwivedi & Mackey (2022; 2021, Alg. 1) takes Θ(n2) kernel evaluations to output a coreset of size nout with near-optimal MMD error. In particular, by leveraging an appropriate auxiliary kernel ksplit, Dwivedi & Mackey (2022, Thms. 2-4) establish that, on an event with probability 1 δ 2, KT(δ) Gk(a, v) with an = Ca nout p ksplit , and vn = Cv nout ksplit log( 6nout log2(n/nout) δ ) MSin,ksplit, (4) where ksplit = supx ksplit(x, x), Ca and Cv are explicit constants, and MSin,ksplit 1 is nondecreasing in n and varies based on the tails of ksplit and the radius of the ball containing Sin. 3 COMPRESS The core subroutine of COMPRESS++ is a new meta-procedure called COMPRESS that, given a halving algorithm HALVE, an oversampling parameter g, and n input points, outputs a thinned coreset of size 2g n. The COMPRESS algorithm (Alg. 1) is very simple to implement: first, divide the input points into four subsequences of size n 4 (in any manner the user chooses); second, recursively call COMPRESS on each subsequence to produce four coresets of size 2g 1 n; finally, call HALVE on the concatenation of those coresets to produce the final output of size 2g n. As we show in App. H, COMPRESS can also be implemented in a streaming fashion to consume at most O(4g n) memory. 3.1 INTEGRATION ERROR AND RUNTIME GUARANTEES FOR COMPRESS Our first result relates the runtime and single-function integration error of COMPRESS to the runtime and error of HALVE. We measure integration error for each function f probabilistically in terms of the sub-Gaussian parameter ν of Def. 2 and measure runtime by the number of dominant operations performed by HALVE (e.g., the number of kernel evaluations performed by kernel thinning). Theorem 1 (Runtime and integration error of COMPRESS) If HALVE has runtime r H(n) for inputs of size n, then COMPRESS has runtime r C(n) = Pβn i=0 4i r H(ℓn2 i), (5) where ℓn 2g+1 n (twice the output size of COMPRESS), and βn log2( n ℓn ) = log4 n g 1. Furthermore, if, for some function f, HALVE Gf(νH), then COMPRESS Gf(νC) with ν2 C(n) = Pβn i=0 4 i ν2 H(ℓn2 i). (6) 1The δ argument of KT-SPLIT(δ) or KT(δ) indicates that each parameter δi = δ ℓin Dwivedi & Mackey (2022, Alg. 1a), where ℓis the size of the input point sequence compressed by KT-SPLIT(δ) or KT(δ). Algorithm 1: COMPRESS Input: halving algorithm HALVE, oversampling parameter g, point sequence Sin of size n if n = 4g then return Sin Partition Sin into four arbitrary subsequences {Si}4 i=1 each of size n/4 for i = 1, 2, 3, 4 do e Si COMPRESS(Si, HALVE, g) // return coresets of size 2g p n 4 end e S CONCATENATE( e S1, e S2, e S3, e S4) // coreset of size 2 2g n return HALVE( e S) // coreset of size 2g n As we prove in App. B, the runtime guarantee (5) is immediate once we unroll the COMPRESS recursion and identify that COMPRESS makes 4i calls to HALVE with input size ℓn2 i. The error guarantee (6) is more subtle: here, COMPRESS benefits significantly from random cancellations among the conditionally independent and mean-zero HALVE errors. Without these properties, the errors from each HALVE call could compound without cancellation leading to a significant degradation in COMPRESS quality. Let us now unpack the most important implications of Thm. 1. Remark 1 (Near-linear runtime and quadratic speed-ups for COMPRESS) Thm. 1 implies that a quadratic-time HALVE with r H(n) = n2 yields a near-linear time COMPRESS with r C(n) 4g+1 n(log4(n) g). If HALVE instead has super-quadratic runtime r H(n) = nτ, COMPRESS enjoys a quadratic speed-up: r C(n) c τ nτ/2 for c τ 2τ(g+2) 2τ 4 . More generally, whenever HALVE has superlinear runtime r H(n) = nτ ρ(n) for some τ 1 and non-decreasing ρ, COMPRESS satisfies ( cτ n (log4(n) g) ρ(ℓn) for τ 2 c τ nτ/2 ρ(ℓn) for τ > 2 where cτ 4(τ 1)(g+1). Remark 2 (COMPRESS inflates sub-Gaussian error by at most p log4 n) Thm. 1 also implies νC(n) βn + 1 νH(ℓn) = p log4 n g νH(ℓn) in the usual case that n νH(n) is non-decreasing in n. Hence the sub-Gaussian error of COMPRESS is at most p log4 n larger than that of halving an input of size ℓn. This is an especially strong benchmark, as ℓn is twice the output size of COMPRESS, and thinning from n to ℓn 2 points should incur at least as much approximation error as halving from ℓn to ℓn Example 3 (KT-SPLIT-COMPRESS) Consider running COMPRESS with, for each HALVE input of size ℓ, HALVE = KT-SPLIT( ℓ2 n4g+1(βn+1)δ) from Ex. 1. Since KT-SPLIT runs in time Θ(n2), COMPRESS runs in near-linear O(n log n) time by Rem. 1. In addition, as we detail in App. F.1, on an event of probability 1 δ 2, every HALVE call invoked by COMPRESS is f-sub-Gaussian with νH(ℓ) = 4 ℓ log( 12n4g(βn+1) ℓδ ) k for all f with f k = 1. (7) Hence, Rem. 2 implies that COMPRESS is f-sub-Gaussian on the same event with νC(n) p log4 n g νH(ℓn), a guarantee within p log4 n of the original KT-SPLIT(δ) error (2). 3.2 MMD GUARANTEES FOR COMPRESS Next, we bound the MMD error of COMPRESS in terms of the MMD error of HALVE. Recall that MMDk (1) represents the worst-case integration error across the unit ball of the RKHS of k. Its proof, based on the concentration of subexponential matrix martingales, is provided in App. C. Theorem 2 (MMD guarantees for COMPRESS) Suppose HALVE Gk(a, v) for n an and n vn non-decreasing and E PHALVEk | Sin = Pink. Then COMPRESS Gk(ea, ev) with evn 4(aℓn +vℓn) p 2(log4 n g), and ean evn p log(n+1), (8) where ℓn = 2g+1 n as in Thm. 1. Remark 3 (Symmetrization) We can convert any halving algorithm into one that satisfies the unbiasedness condition E PHALVEk | Sin = Pink without impacting integration error by symmetrization, i.e., by returning either the outputted half or its complement with equal probability. Remark 4 (COMPRESS inflates MMD guarantee by at most 10 log(n+1)) Thm. 2 implies that the k-sub-Gaussian error of COMPRESS is always at most 10 log(n+1) times that of HALVE with input size ℓn =2g+1 n since εk,COMPRESS(n) Def. 3 = max(ean, evn) (8) 10 log(n + 1) max(aℓn, vℓn) = 10 log(n + 1) εk,HALVE(ℓn). As in Rem. 2, HALVE applied to an input of size ℓn is a particularly strong benchmark, as thinning from n to ℓn 2 points should incur at least as much MMD error as halving from ℓn to ℓn Example 4 (KT-COMPRESS) Consider running COMPRESS with, for each HALVE input of size ℓ, HALVE = KT( ℓ2 n4g+1(βn+1)δ) from Ex. 2 after symmetrizing as in Rem. 3. Since KT has Θ(n2) runtime, COMPRESS yields near-linear O(n log n) runtime by Rem. 1. Moreover, as we detail in App. F.2, using the notation of Ex. 2, on an event E of probability at least 1 δ 2, every HALVE call invoked by COMPRESS is k-sub-Gaussian with k , and vℓ= 2Cv k log( 12n4g(βn+1) ℓδ ) MSin,k. Thus, Rem. 4 implies that, on E, KT-COMPRESS has k-sub-Gaussian error εk,COMPRESS(n) 10 log(n+1)εk,HALVE(ℓn), a guarantee within 10 log(n+1) of the original KT(δ) MMD error (4). 4 COMPRESS++ To offset any excess error due to COMPRESS while maintaining its near-linear runtime, we next introduce COMPRESS++ (Alg. 2), a simple two-stage meta-procedure for faster root-thinning. COMPRESS++ takes as input an oversampling parameter g, a halving algorithm HALVE, and a 2g-thinning algorithm THIN (see Def. 1). In our applications, HALVE and THIN are derived from the same base algorithm (e.g., from KT with different thinning factors), but this is not required. COMPRESS++ first runs the faster but slightly more erroneous COMPRESS(HALVE, g) algorithm to produce an intermediate coreset of size 2g n. Next, the slower but more accurate THIN algorithm is run on the greatly compressed intermediate coreset to produce a final output of size n. In the sequel, we demonstrate how to set g to offset error inflation due to COMPRESS while maintaining its fast runtime. Algorithm 2: COMPRESS++ Input: oversampling parameter g, halving alg. HALVE, 2g-thinning alg. THIN, point sequence Sin of size n SC COMPRESS(HALVE, g, Sin) // coreset of size 2g n SC++ THIN(SC) // coreset of size n return SC++ 4.1 INTEGRATION ERROR AND RUNTIME GUARANTEES FOR COMPRESS++ The following result, proved in App. D, relates the runtime and single-function integration error of COMPRESS++ to the runtime and error of HALVE and THIN. Theorem 3 (Runtime and integration error of COMPRESS++) If HALVE and THIN have runtimes r H(n) and r T(n) respectively for inputs of size n, then COMPRESS++ has runtime r C++(n) = r C(n) + r T(ℓn/2) where r C(n) (5)= Pβn i=0 4i r H(ℓn2 i), (9) ℓn = 2g+1 n, and βn = log4 n g 1 as in Thm. 1. Furthermore, if for some function f, HALVE Gf(νH) and THIN Gf(νT), then COMPRESS++ Gf(νC++) with ν2 C++(n) = ν2 C(n) + ν2 T(ℓn/2) where ν2 C(n) (6)= Pβn i=0 4 i ν2 H(ℓn2 i). Remark 5 (Near-linear runtime and near-quadratic speed-ups for COMPRESS++) When HALVE and THIN have quadratic runtimes with max(r H(n), r T(n)) = n2, Thm. 3 and Rem. 1 yield that r C++(n) 4g+1 n(log4(n) g) + 4gn. Hence, COMPRESS++ maintains a near-linear runtime r C++(n) = O(n logc+1 4 (n)) whenever 4g = O(logc 4 n). (10) If HALVE and THIN instead have super-quadratic runtimes with max(r H(n), r T(n)) = nτ, then by Rem. 1 we have r C++(n) ( 4τ 2τ 4 + 1) 2gτnτ/2, so that COMPRESS++ provides a near-quadratic speed up r C++(n) = O(nτ/2 logcτ/2 4 (n)) whenever 4g = O(logc 4 n). Remark 6 (COMPRESS++ inflates sub-Gaussian error by at most 2) In the usual case that n νH(n) is non-decreasing in n, Thm. 3 and Rem. 2 imply that ν2 C++(n) (log4 n g)ν2 H(ℓn) + ν2 T( ℓn 2 ) = ν2 T( ℓn 2 ) 1 + log4 n g 4g ( ζH(ℓn) where we have introduced the rescaled quantities ζH(ℓn) ℓn 2 νH(ℓn) and ζT( ℓn 2 ) n νT( ℓn 2 ). Therefore, COMPRESS++ satisfies 2 ) whenever g log4 log4 n + log2( ζH(ℓn) ζT(ℓn/2)). (11) That is, whenever COMPRESS++ is run with an oversampling parameter g satisfying (11) its sub Gaussian error is never more than 2 times the second-stage THIN error. Here, THIN represents a strong baseline for comparison as thinning from ℓn/2 to n points should incur at least as much error as thinning from n to n points. As we illustrate in the next example, when THIN and HALVE are derived from the same thinning algorithm, the ratio ζH(ℓn) ζT(ℓn/2) is typically bounded by a constant C so that the choice g = log4 log4 n + log2 C suffices to simultaneously obtain the 2 relative error guarantee (11) of Rem. 6 and the substantial speed-ups (10) of Rem. 5. Example 5 (KT-SPLIT-COMPRESS++) In the notation of Ex. 1, consider running COMPRESS++ with HALVE = KT-SPLIT( ℓ2 4n2g(g+2g(βn+1))δ) when applied to an input of size ℓand THIN = KT-SPLIT( g g+2g(βn+1)δ). As detailed in App. F.3, on an event of probability 1 δ 2, all COM- PRESS++ invocations of HALVE and THIN are simultaneously f-sub-Gaussian with parameters satisfying ζH(ℓ)=ζT(ℓ)= 2 log( 6 n(g+2g(βn+1)) δ ) k = ζH(ℓn) 2 ) =1 for all f with f k = 1. (12) Since KT-SPLIT runs in Θ(n2) time, Rems. 5 and 6 imply that KT-SPLIT-COMPRESS++ with g = log4 log4 n runs in near-linear O(n log2 n) time and inflates sub-Gaussian error by at most 4.2 MMD GUARANTEES FOR COMPRESS++ Next, we bound the MMD error of COMPRESS++ in terms of the MMD error of HALVE and THIN. The proof of the following result can be found in App. E. Theorem 4 (MMD guarantees for COMPRESS++) If THIN Gk(a ,v ), HALVE Gk(a,v) for n an and n vn non-decreasing, and E PHALVEk | Sin = Pink, then COMPRESS++ Gk(ba, bv) with bvn evn + v ℓn/2 and ban ean + a ℓn/2 + bvn log 2 for evn and ean defined in Thm. 2 and ℓn = 2g+1 n as in Thm. 1. Remark 7 (COMPRESS++ inflates MMD guarantee by at most 4) Thm. 4 implies that the COMPRESS++ k-sub-Gaussian error εk,COMPRESS++(n) = max(ban, bvn) satisfies εk,COMPRESS++(n) (10 log(n + 1) εk,HALVE(ℓn) + εk,THIN( ℓn 2 )) (1 + log 2) εk,THIN( ℓn 2 )( 10 log(n+1) 2g eζH(ℓn) eζT( ℓn 2 ) + 1)(1 + log 2), where we have introduced the rescaled quantities eζH(ℓn) ℓn 2 εk,HALVE(ℓn) and eζT( ℓn 2 ) n εk,THIN( ℓn 2 ). Therefore, COMPRESS++ satisfies εk,COMPRESS++(n) 4 εk,THIN( ℓn 2 ) whenever g log2 log(n + 1) + log2(8.5 eζH(ℓn) eζT( ℓn In other words, relative to a strong baseline of thinning from ℓn 2 to n points, COMPRESS++ inflates k-sub-Gaussian error by at most a factor of 4 whenever g satisfies (13). For example, when the ratio eζH(ℓn)/eζT( ℓn 2 ) is bounded by C, it suffices to choose g = log2 log(n+1)+log2(8.5C) . Example 6 (KT-COMPRESS++) In the notation of Ex. 2 and Rem. 3, consider running COM- PRESS++ with HALVE = symmetrized KT( ℓ2 4n2g(g+2g(βn+1))δ) when applied to an input of size ℓand THIN = KT( g g+2g(βn+1)δ). As we detail in App. F.4, on an event of probability 1 δ 2, all COMPRESS++ invocations of HALVE and THIN are simultaneously k-sub-Gaussian with eζH(ℓn) = eζT( ℓn k log( 6 n(g+2g(βn+1)) δ ) MSin,k = eζH(ℓn) eζT( ℓn As KT runs in Θ(n2) time, Rems. 5 and 7 imply that KT-COMPRESS++ with g= log2 log n+3.1 runs in near-linear O(n log3 n) time and inflates k-sub-Gaussian error by at most 4. 5 EXPERIMENTS We now turn to an empirical evaluation of the speed-ups and error of COMPRESS++. We begin by describing the thinning algorithms, compression tasks, evaluation metrics, and kernels used in our experiments. Supplementary experimental details and results can be found in App. G. Thinning algorithms Each experiment compares a high-accuracy, quadratic time thinning algorithm either target kernel thinning (Dwivedi & Mackey, 2022) or kernel herding (Chen et al., 2010) with our near-linear time COMPRESS and COMPRESS++ variants that use the same input algorithm to HALVE and THIN. In each case, we perform root thinning, compressing n input points down to n points, so that COMPRESS is run with g = 0. For COMPRESS++, we use g = 4 throughout to satisfy the small relative error criterion (11) in all experiments. When halving we restrict each input algorithm to return distinct points and symmetrize the output as discussed in Rem. 3. Compressing i.i.d. summaries To demonstrate the advantages of COMPRESS++ over equal-sized i.i.d. summaries we compress input point sequences Sin drawn i.i.d. from either (a) Gaussian targets P = N(0, Id) with d {2, 4, 10, 100} or (b) M-component mixture of Gaussian targets P = 1 M PM j=1 N(µj, I2) with M {4, 6, 8, 32} and component means µj R2 defined in App. G. Compressing MCMC summaries To demonstrate the advantages of COMPRESS++ over standard MCMC thinning, we also compress input point sequences Sin generated by a variety of popular MCMC algorithms (denoted by RW, ADA-RW, MALA, and p MALA) targeting four challenging Bayesian posterior distributions P. In particular, we adopt the four posterior targets of Riabiz et al. (2020a) based on the Goodwin (1965) model of oscillatory enzymatic control (d = 4), the Lotka (1925); Volterra (1926) model of oscillatory predator-prey evolution (d = 4), the Hinch et al. (2004) model of calcium signalling in cardiac cells (d = 38), and a tempered Hinch model posterior (d = 38). Notably, for the Hinch experiments, each summary point discarded via an accurate thinning procedure saves 1000s of downstream CPU hours by avoiding an additional critically expensive whole-heart simulation (Riabiz et al., 2020a). See App. G for MCMC algorithm and target details. Kernel settings Throughout we use a Gaussian kernel k(x, y) = exp( 1 2σ2 x y 2 2) with σ2 as specified by Dwivedi & Mackey (2021, Sec. K.2) for the MCMC targets and σ2 = 2d otherwise. Evaluation metrics For each thinning procedure we report mean runtime across 3 runs and mean MMD error across 10 independent runs 1 standard error (the error bars are often too small to be visible). All runtimes were measured on a single core of an Intel Xeon CPU. For the i.i.d. targets, we report MMDk(P, Pout) which can be exactly computed in closed-form. For the MCMC targets, we report the thinning error MMDk(Pin, Pout) analyzed directly by our theory (Thms. 2 and 4). Kernel thinning results We first apply COMPRESS++ to the near-optimal KT algorithm to obtain comparable summaries at a fraction of the cost. Figs. 1 and 2 reveal that, in line with our guarantees, 45 47 49 Input size n KT-Comp: n 0.47 KT-Comp++: n 0.50 45 47 49 Input size n KT-Comp: n 0.42 KT-Comp++: n 0.47 45 47 49 Input size n KT-Comp: n 0.38 KT-Comp++: n 0.41 45 47 49 Input size n KT-Comp: n 0.31 KT-Comp++: n 0.31 103 104 105 106 Input size n Single core runtime KT KT Comp++ KT Comp 103 104 105 106 Input size n 8hr 1day d=4 103 104 105 106 Input size n 8hr 1day d=10 103 104 105 106 Input size n 8hr 1day d=100 45 47 49 Input size n Herd-Comp: n 0.43 Herd-Comp++: n 0.51 Herd: n 0.49 45 47 49 Input size n Herd-Comp: n 0.41 Herd-Comp++: n 0.46 Herd: n 0.45 45 47 49 Input size n Herd-Comp: n 0.38 Herd-Comp++: n 0.41 Herd: n 0.46 45 47 49 Input size n Herd-Comp: n 0.32 Herd-Comp++: n 0.31 Herd: n 0.32 103 104 105 106 Input size n Single core runtime Herd Herd-Comp++ Herd-Comp 103 104 105 106 Input size n 103 104 105 106 Input size n 103 104 105 106 Input size n Figure 1: For Gaussian targets P in Rd, KT-COMPRESS++ and Herd-COMPRESS++ improve upon the MMD of i.i.d. sampling (ST), closely track the error of their respective quadratic-time input algorithms KT and kernel herding (Herd), and substantially reduce the runtime. KT-COMPRESS++ matches or nearly matches the MMD error of KT in all experiments while also substantially reducing runtime. For example, KT thins 65000 points in 10 dimensions in 20m, while KT-COMPRESS++ needs only 1.5m; KT takes more than a day to thin 250000 points in 100 dimensions, while KT-COMPRESS++ takes less than an hour (a 32 speed-up). For reference we also display the error of standard thinning (ST) to highlight that KT-COMPRESS++ significantly improves approximation quality relative to the standard practice of i.i.d. summarization or standard MCMC thinning. See Fig. 4 in App. G.1 for analogous results with mixture of Gaussian targets. Kernel herding results A strength of COMPRESS++ is that it can be applied to any thinning algorithm, including those with suboptimal or unknown performance guarantees that often perform well in practical. In such cases, Rems. 4 and 6 still ensure that COMPRESS++ error is never much larger than that of the input algorithm. As an illustration, we apply COMPRESS++ to the popular quadratic-time kernel herding algorithm (Herd). Fig. 1 shows that Herd-COMPRESS++ matches or nearly matches the MMD error of Herd in all experiments while also substantially reducing runtime. For example, Herd requires more than 11 hours to compress 250000 points in 100 dimensions, while Herd-COMPRESS++ takes only 14 minutes (a 45 speed-up). Moreover, surprisingly, Herd COMPRESS++ is consistently more accurate than the original kernel herding algorithm for lower dimensional problems. See Fig. 4 in App. G.1 for comparable results with mixture of Gaussian P. Visualizing coresets For a 32-component mixture of Gaussians target, Fig. 3 visualizes the coresets produced by i.i.d. sampling, KT, kernel herding, and their COMPRESS++ variants. The COMPRESS++ coresets closely resemble those of their input algorithms and, compared with i.i.d. sampling, yield visibly improved stratification across the mixture components. 45 47 49 Input size n KT-Comp: n 0.45 KT-Comp++: n 0.46 45 47 49 Input size n 2 3 Goodwin ADA-RW KT-Comp: n 0.43 KT-Comp++: n 0.47 45 47 49 Input size n 2 3 Goodwin MALA KT-Comp: n 0.55 KT-Comp++: n 0.52 45 47 49 Input size n 2 3 Goodwin p MALA KT-Comp: n 0.50 KT-Comp++: n 0.48 45 47 49 Input size n Lotka-Volterra RW KT-Comp: n 0.50 KT-Comp++: n 0.47 44 45 46 47 48 Input size n Lotka-Volterra ADA-RW KT-Comp: n 0.43 KT-Comp++: n 0.46 45 47 49 Input size n Lotka-Volterra MALA KT-Comp: n 0.56 KT-Comp++: n 0.54 45 47 49 Input size n Lotka-Volterra p MALA KT-Comp: n 0.50 KT-Comp++: n 0.48 44 45 46 47 48 Input size n KT-Comp: n 0.53 KT-Comp++: n 0.50 44 45 46 47 48 Input size n KT-Comp: n 0.49 KT-Comp++: n 0.48 44 45 46 47 48 Input size n 2 3 Tempered Hinch RW 1 KT-Comp: n 0.45 KT-Comp++: n 0.44 44 45 46 47 48 Input size n 2 3 Tempered Hinch RW 2 KT-Comp: n 0.45 KT-Comp++: n 0.43 Figure 2: Given MCMC sequences summarizing challenging differential equation posteriors P, KTCOMPRESS++ consistently improves upon the MMD of standard thinning (ST) and matches or nearly matches the error of of its quadratic-time input algorithm KT. i.i.d. KT-Comp++ KT Herd-Comp++ Herd 20 0 20 20 0 20 20 0 20 20 0 20 Figure 3: Coresets of size 32 (top) or 64 (bottom) with equidensity contours of the target underlaid. 6 DISCUSSION AND CONCLUSIONS We introduced a new general meta-procedure, COMPRESS++, for speeding up thinning algorithms while preserving their error guarantees up to a factor of 4. When combined with the quadratic-time KT-SPLIT and kernel thinning algorithms of Dwivedi & Mackey (2021; 2022), the result is nearoptimal distribution compression in near-linear time. Moreover, the same simple approach can be combined with any slow thinning algorithm to obtain comparable summaries in a fraction of the time. Two open questions recommend themselves for future investigation. First, why does Herd COMPRESS++ improve upon the original kernel herding algorithm in lower dimensions, and can this improvement be extended to higher dimensions and to other algorithms? Second, is it possible to thin significantly faster than COMPRESS++ without significantly sacrificing approximation error? Lower bounds tracing out the computational-statistical trade-offs in distribution compression would provide a precise benchmark for optimality and point to any remaining opportunities for improvement. REPRODUCIBILITY STATEMENT See the goodpoints Python package for Python implementations of all methods in this paper and https://github.com/microsoft/goodpoints for code reproducing each experiment. ACKNOWLEDGMENTS We thank Carles Domingo-Enrich for alerting us that an outdated proof of Thm. 2 was previously included in the appendix. RD acknowledges the support by the National Science Foundation under Grant No. DMS-2023528 for the Foundations of Data Science Institute (FODSI). Part of this work was done when AS was interning at Microsoft Research New England. Christoph M Augustin, Aurel Neic, Manfred Liebmann, Anton J Prassl, Steven A Niederer, Gundolf Haase, and Gernot Plank. Anatomically accurate high resolution modeling of human whole heart electromechanics: A strongly scalable algebraic multigrid solver method for nonlinear deformation. Journal of computational physics, 305:622 646, 2016. (Cited on page 1.) Alain Berlinet and Christine Thomas-Agnan. Reproducing kernel Hilbert spaces in probability and statistics. Springer Science & Business Media, 2011. (Cited on page 2.) S. Boucheron, G. Lugosi, and P. Massart. Concentration Inequalities: A Nonasymptotic Theory of Independence. OUP Oxford, 2013. ISBN 9780199535255. URL https://books.google.com/books? id=ko Nq WRluh P0C. (Cited on pages 2, 12, 15, and 19.) Bernard Chazelle and Jiri Matousek. On linear-time deterministic algorithms for optimization problems in fixed dimension. Journal of Algorithms, 21(3):579 597, 1996. (Cited on page 2.) Yutian Chen, Max Welling, and Alex Smola. Super-samples from kernel herding. In Proceedings of the Twenty Sixth Conference on Uncertainty in Artificial Intelligence, UAI 10, pp. 109 116, Arlington, Virginia, USA, 2010. AUAI Press. ISBN 9780974903965. (Cited on pages 2 and 7.) Raaz Dwivedi and Lester Mackey. Kernel thinning. ar Xiv preprint ar Xiv:2105.05842, 2021. (Cited on pages 1, 2, 3, 7, 9, 23, and 24.) Raaz Dwivedi and Lester Mackey. Generalized kernel thinning. In International Conference on Learning Representations, 2022. (Cited on pages 3, 7, 9, 20, 21, and 22.) Mark Girolami and Ben Calderhead. Riemann manifold Langevin and Hamiltonian Monte Carlo methods. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 73(2):123 214, 2011. (Cited on page 23.) Brian C Goodwin. Oscillatory behavior in enzymatic control process. Advances in Enzyme Regulation, 3: 318 356, 1965. (Cited on page 7.) Arthur Gretton, Karsten M. Borgwardt, Malte J. Rasch, Bernhard Sch olkopf, and Alexander Smola. A kernel two-sample test. Journal of Machine Learning Research, 13(25):723 773, 2012. (Cited on page 2.) Heikki Haario, Eero Saksman, and Johanna Tamminen. Adaptive proposal distribution for random walk Metropolis algorithm. Computational Statistics, 14(3):375 395, 1999. (Cited on page 23.) Robert Hinch, JL Greenstein, AJ Tanskanen, L Xu, and RL Winslow. A simplified local control model of calcium-induced calcium release in cardiac ventricular myocytes. Biophysical journal, 87(6):3723 3736, 2004. (Cited on page 7.) Been Kim, Rajiv Khanna, and Oluwasanmi O Koyejo. Examples are not enough, learn to criticize! criticism for interpretability. Advances in neural information processing systems, 29, 2016. (Cited on page 2.) Alfred James Lotka. Elements of physical biology. Williams & Wilkins, 1925. (Cited on page 7.) Jiri Matousek. Approximations and optimal geometric divide-and-conquer. Journal of Computer and System Sciences, 50(2):203 208, 1995. (Cited on page 2.) Steven A Niederer, Lawrence Mitchell, Nicolas Smith, and Gernot Plank. Simulating human cardiac electrophysiology on clinical time-scales. Frontiers in Physiology, 2:14, 2011. (Cited on page 1.) Art B Owen. Statistically efficient thinning of a Markov chain sampler. Journal of Computational and Graphical Statistics, 26(3):738 744, 2017. (Cited on page 1.) Jeff M Phillips. Algorithms for ε-approximations of terrains. In International Colloquium on Automata, Languages, and Programming, pp. 447 458. Springer, 2008. (Cited on page 2.) Jeff M Phillips and Wai Ming Tai. Near-optimal coresets of kernel density estimates. Discrete & Computational Geometry, 63(4):867 887, 2020. (Cited on page 2.) Marina Riabiz, Wilson Chen, Jon Cockayne, Pawel Swietach, Steven A Niederer, Lester Mackey, and Chris Oates. Optimal thinning of MCMC output. ar Xiv preprint ar Xiv:2005.03952, 2020a. (Cited on pages 2, 7, and 23.) Marina Riabiz, Wilson Ye Chen, Jon Cockayne, Pawel Swietach, Steven A. Niederer, Lester Mackey, and Chris J. Oates. Replication Data for: Optimal Thinning of MCMC Output, 2020b. URL https://doi. org/10.7910/DVN/MDKNWM. Accessed on Mar 23, 2021. (Cited on page 23.) Christian P Robert and George Casella. Monte Carlo integration. In Monte Carlo statistical methods, pp. 71 138. Springer, 1999. (Cited on page 1.) Gareth O Roberts and Richard L Tweedie. Exponential convergence of Langevin distributions and their discrete approximations. Bernoulli, 2(4):341 363, 1996. (Cited on page 23.) Marina Strocchi, Matthias AF Gsell, Christoph M Augustin, Orod Razeghi, Caroline H Roney, Anton J Prassl, Edward J Vigmond, Jonathan M Behar, Justin S Gould, Christopher A Rinaldi, Martin J Bishop, Gernot Plank, and Steven A Niederer. Simulating ventricular systolic motion in a four-chamber heart model with spatially varying robin boundary conditions to model the effect of the pericardium. Journal of Biomechanics, 101:109645, 2020. (Cited on page 1.) Ilya Tolstikhin, Bharath K Sriperumbudur, and Krikamol Muandet. Minimax estimation of kernel mean embeddings. The Journal of Machine Learning Research, 18(1):3002 3048, 2017. (Cited on page 2.) Joel A. Tropp. User-friendly tail bounds for sums of random matrices. Foundations of Computational Mathematics, 12(4):389 434, 2012. doi: 10.1007/s10208-011-9099-z. URL https://doi.org/10.1007/ s10208-011-9099-z. (Cited on pages 15 and 18.) Vito Volterra. Variazioni e fluttuazioni del numero d individui in specie animali conviventi. 1926. (Cited on page 7.) A Additional Definitions and Notation 12 B Proof of Thm. 1: Runtime and integration error of COMPRESS 13 C Proof of Thm. 2: MMD guarantees for COMPRESS 13 D Proof of Thm. 3: Runtime and integration error of COMPRESS++ 19 E Proof of Thm. 4: MMD guarantees for COMPRESS++ 19 F Proofs of Exs. 3, 4, 5, and 6 20 G Supplementary Details for Experiments 22 H Streaming Version of COMPRESS 23 A ADDITIONAL DEFINITIONS AND NOTATION This section provides additional definitions and notation used throughout the appendices. We associate with each algorithm ALG and input Sin the measure difference ϕALG(Sin) PSin PSALG = 1 x Sin δx 1 nout P x SALG δx (14) that characterizes how well the output empirical distribution approximates the input. We will often write ϕALG instead of ϕALG(Sin) for brevity if Sin is clear from the context. We also make use of the following standard definition of a sub-Gaussian random variable (see, e.g., Boucheron et al., 2013, Sec. 2.3). Definition 4 (Sub-Gaussian random variable) We say that a random variable G is sub-Gaussian with parameter ν and write G G(ν) if E exp(λ G) exp λ2ν2 2 for all λ R. Given Def. 4, it follows that ALG Gf(ν) for a function f as in Def. 2 if and only if the random variable ϕALG(f) PSinf PSALGf is sub-Gaussian with parameter ν conditional on the input Sin. In our proofs, it is often more convenient to work with an unnormalized measure discrepancy ψALG(Sin) n ϕALG(Sin) (14) = P x Sin δx n nout P SALG δx. (15) By definition (15), we have the following useful equivalence: ψALG(f) n ϕALG(f) G(σALG) ϕALG(f) G(νALG) for σALG =n νALG. (16) The following standard lemma establishes that the sub-Gaussian property is closed under scaling and summation. Lemma 1 (Summation and scaling preserve sub-Gaussianity) Suppose G1 G(σ1). Then, for all β R, we have β G1 G(βσ1). Furthermore, if G1 is F-measurable and G2 G(σ2) given F, then G1 + G2 G( p σ2 1 + σ2 2). Proof Fix any β R. Since G1 G(σ1), for each λ R, E exp(λ β G1) exp λ2(βσ1)2 so that βG1 G(βσ1) as advertised. Furthermore, if G1 is F-measurable and G2 G(σ2) given F, then, for each λ R, E h exp λ (G1 + G2) i = E exp(λ G1 + λ G2) = E h exp(λ G1) E exp(λ G2) | F i exp λ2σ2 2 2 E h exp λ f(G2) i = exp λ2σ2 1 2 exp λ2σ2 2 2 = exp λ2(σ2 1+σ2 2) 2 so that G1 + G2 G( p σ2 1 + σ2 2) as claimed. B PROOF OF THM. 1: RUNTIME AND INTEGRATION ERROR OF COMPRESS First, we bound the running time of COMPRESS. By definition, COMPRESS makes four recursive calls to COMPRESS on inputs of size n/4. Then, HALVE is run on an input of size 2g+1 n. Thus, r C satisfies the recursion r C(n) = 4r C n 4 + r H( n2g+1). Since r C(4g) = 0, we may unroll the recursion to find that r C(n) = Pβn i=0 4ir H(2g+1 as claimed in (5). Next, we bound the sub-Gaussian error for a fixed function f. In the measure discrepancy (15) notation of App. A we have ψC(Sin) = P4 i=1 ψC(Si) + n2 g 1ψH( e S) (17) where Si and e S are defined as in Alg. 1. Unrolling this recursion, we find that running COMPRESS on an input of size n with oversampling parameter g leads to applying HALVE on 4i coresets of size ni = 2g+1 i n for 0 i βn. Denoting these HALVE inputs by (Sin i,j)j [4i], we have ψC(Sin) = n2 g 1 Pβn i=0 P4i j=1 2 iψH(Sin i,j). (18) Now define σH(n) = nνH(n). Since ψH(Sin i,j)(f) are σH(ni) sub-Gaussian given (Sin i ,j )i >i,j 1 and (Sin i,j )j j, Lem. 1 implies that ψC(Sin)(f) is σC sub-Gaussian given Sin for σ2 C(n) = n4 g 1 Pβn i=0 σ2 H(ni). Recalling the relation (16) between σ and ν from App. A, we conclude that ν2 C(n) = Pβn i=0 4 iν2 H(ni). as claimed in (6). C PROOF OF THM. 2: MMD GUARANTEES FOR COMPRESS Our proof proceeds in several steps. To control the MMD (1), we will control the Hilbert norm of the measure discrepancy of COMPRESS (15), which we first write as a weighted sum of measure discrepancies from different (conditionally independent) runs of HALVE. To effectively leverage the MMD tail bound assumption for this weighted sum, we reduce the problem to establishing a concentration inequality for the operator norm of an associated matrix. We carry out this plan in four steps summarized below. First, in App. C.1 we express the MMD associated with each HALVE measure discrepancy as the Euclidean norm of a suitable vector (Lem. 2). Second, in App. C.2 we define a matrix dilation operator for a vector that allows us to control vector norms using matrix spectral norms (Lem. 3). Third, in App. C.3 we prove and apply a sub-Gaussian matrix Freedman concentration inequality (Lem. 4) to control the MMD error for the COMPRESS output, which in turn requires us to establish moment bounds for these matrices by leveraging tail bounds for the MMD error (Lem. 5). Finally, we put together the pieces in App. C.4 to complete the proof. We now begin our formal argument. We will make use of the unrolled representation (17) for the COMPRESS measure discrepancy ψC(Sin) in terms of the HALVE inputs (Sin k,j)j [4k] of size nk = 2g+1 k n for 0 k log4 n g 1. For brevity, we will use the shorthand ψC ψC(Sin), ψH k,j ψH(Sin k,j), and ψT ψT(SC) hereafter. C.1 REDUCING MMD TO VECTOR EUCLIDEAN NORM Number the elements of Sin as (x1, . . . , xn), define the n n kernel matrix K (k(xi, xj))n i,j=1, and let K 1 2 denote a matrix square-root such that K = K 1 2 K 1 2 (which exists since K is a positive semidefinite matrix for any kernel k). Next, let Sout k,j denote the output sequence corresponding to ψH k,j (i.e., running HALVE on Sin k,j), and let {ei}n i=1 denote the canonical basis of Rn. The next lemma (with proof in App. C.5) relates the Hilbert norms to Euclidean norms of carefully constructed vectors. Lemma 2 (MMD as a vector norm) Define the vectors uk,j K 1 2 Pn i=1 ei 1(xi Sin k,j) 2 1(xi Sout k,j ) , and u C Plog4 n g 1 k=0 P4k j=1 wk,juk,j, (19) where wk,j n 2g+1+k . Then, we have n2 MMD2 k(Sin, SC) = u C 2 2, and (20) E[uk,j|(uk ,j : j [4k ], k > k)] = 0 for k = 0, . . . , log4 n g 2, (21) and uk,j for j [4k] are conditionally independent given (uk ,j : j [4k ], k > k). Applying (20), we effectively reduce the task of controlling the MMD errors to controlling the Euclidean norm of suitably defined vectors. Next, we reduce the problem to controlling the spectral norm of a suitable matrix. C.2 REDUCING VECTOR EUCLIDEAN NORM TO MATRIX SPECTRAL NORM To this end, we define a symmetric dilation matrix operator: given a vector u Rn, define the matrix Mu as Mu 0 u u 0n n R(n+1) (n+1). (22) It is straightforward to see that u 7 Mu is a linear map. In addition, the matrix Mu also satisfies a few important properties (established in App. C.6) that we use in our proofs. Lemma 3 (Properties of the dilation operator) For any u Rn, the matrix Mu (22) satisfies Mu op (a) = u 2 (b) = λmax(Mu), and Mq u (c) u q 2In+1 for all q N. (23) Define the shorthand Mk,j Mwk,juk,j (defined in Lem. 2). Applying Lems. 2 and 3, we find that n MMDk(Sin, SC) (20) = u C 2 (23) = λmax(Mu C) (i) = λmax(Plog4 n g 1 k=0 P4k j=1 Mk,j), (24) where step (i) follows from the linearity of the dilation operator. Thus to control the MMD error, it suffices to control the maximum eigenvalue of the sum of matrices appearing in (24). C.3 CONTROLLING THE SPECTRAL NORM VIA A SUB-GAUSSIAN MATRIX FREEDMAN INEQUALITY To control the maximum eigenvalue of the matrix Mu C, we make use of (24) and the following sub Gaussian generalization of the matrix Freedman inequality of Tropp (2012, Thm. 7.1). The proof of Lem. 4 can be found in App. C.7. For two matrices A and B of the same size, we write A B if B A is positive semidefinite. Lemma 4 (Sub-Gaussian matrix Freedman inequality) Consider a sequence (Yi)N i=1 of selfadjoint random matrices in Rm m and a fixed sequence of scalars (Ri)N i=1 satisfying E h Yi| Yj i 1 j=1 i (A) = 0 and E h Yq i | Yj i 1 j=1 2)!Rq i I, for all i [N] and q 2N. (25) Define the variance parameter σ2 PN i=1 R2 i . Then, P[λmax(PN i=1 Yi) σ p 8(t + log m)] e t for all t > 0, and equivalently P[λmax(PN i=1 Yi) σ p 8 log(m/δ)] 1 δ for all δ (0, 1]. To apply Lem. 4 with the matrices Mk,j, we need to establish the zero-mean and moment bound conditions for suitable Rk,j in (25). C.3.1 VERIFYING THE ZERO MEAN CONDITION (25)(A) FOR Mk,j To this end, first we note that the conditional independence and zero-mean property of ψH k,j implies that the random vectors uk,j and the matrices Mk,j also satisfy a similar property, and in particular that E Mk,j | Mk ,j : k > k, j [4k ] = 0 for j [4k], k {0, 1, . . . , log4 n g 1}. C.3.2 ESTABLISHING MOMENT BOUND CONDITIONS (25)(B) FOR Mk,j IN TERMS OF Rk,j VIA MMD TAIL BOUNDS FOR HALVE To establish the moment bounds on Mk,j, note that Lems. 2 and 3 imply that Mq k,j = Mq wk,juk,j (23) wk,juk,j q 2 In+1 (20) = wq k,j uk,j q 2 In+1 (27) where wk,j was defined in Lem. 2. Thus it suffices to establish the moment bounds on uk,j q 2. To this end, we first state a lemma that converts tail bounds to moment bounds. See App. C.8 for the proof inspired by Boucheron et al. (2013, Thm. 2.3). Lemma 5 (Tail bounds imply moment bounds) For a non-negative random variable Z, t] e t for all t 0 = E[Zq] (2a+2v)q( q 2)! for all q 2N. To obtain a moment bound for uk,j 2, we first state some notation. For each n, define the quantities a n nan, v n nvn (28) where an and vn are the parameters such that HALVE Gk(an, vn) on inputs of size n. Using an argument similar to Lem. 2, we have uk,j 2 = nk,j MMDk(Sin k,j, Sout k,j ) for nk,j = |Sin k,j| = n2g+1 k. Thereby, using the Gk assumption on HALVE implies that P[ uk,j 2 a ℓ k + v ℓ k t | (uk ,j : j [4k ], k > k)] e t for all t 0, (29) ℓ k nk,j = n2g+1 k (30) and, notably, ℓn = ℓ 0. Combining the bound (29) with Lem. 5 yields that E[ uk,j q 2 | (uk ,j : j [4k ], k > k)] ( q 2)!(2a ℓ k + 2v ℓ k)q, (31) for all q 2N, where ℓk is defined in (29). Now, putting together (27) and (31), and using the conditional independence of Mk,j, we obtain the following control on the q-th moments of Mk,j for q 2N: E Mq k,j Mk ,j , k > k, j [4k ] (27) wq k,j E uk,j q 2 n uk ,j , k > k, j [4k ] o In+1 (31) wq k,j (2a ℓ k + 2v ℓ k)q( q 2)!Rq k,j In+1 where Rk,j 2wk,j(a ℓ k + v ℓ k) (32) where ℓk is defined in (30). In summary, the computation above establishes the condition (B) from the display (25) for the matrices Mk,j in terms of the sequence Rk,j defined in (32). C.4 PUTTING THE PIECES TOGETHER FOR PROVING THM. 2 log4 n g 2(a n2g+1 + v n2g+1) (33) Now, putting (26) and (32) together, we conclude that with a suitable ordering of the indices (k, j), the assumptions of Lem. 4 are satisfied by the random matrices Mk,j, j [4k], k {0, 1, . . . , log4 n g 1} with the sequence Rk,j . Now, since ℓ k = n2g+1 k (29) is decreasing in k, wk,j = ℓ k 4g+1 (as defined in Lem. 2), and a n and v n (28) are assumed non-decreasing in n, we find that n2 eσ2 (33) = n2(log4 n g)(2(a n2g+1 + v n2g+1))2 (29) = (log4 n g) n 4g+1 (2(a ℓ 0 + v ℓ 0))2 Plog4 n g 1 k=0 n 4g+1 (2(a ℓ k + v ℓ k))2 = Plog4 n g 1 k=0 P4k j=1 n 4g+1+k (2(a ℓ k + v ℓ k))2 = Plog4 n g 1 k=0 P4k j=1(2wk,j(a ℓ k + v ℓ k))2 (32) = Plog4 n g 1 k=0 P4k j=1 R2 k,j. Finally, applying (24) and invoking Lem. 4 with σ neσ and m n + 1, we conclude that P[MMD(Sin, SC) eσ p 8(log(n + 1) + t)] (24) = P[λmax(Plog4 n g 1 k=0 P4k j=1 Mk,j) neσ p 8(log(n + 1) + t)] e t for all t > 0, which in turn implies P[MMD(Sin, SC) ean + evn t] e t for t 0, since the parameters evn, ean (8) satisfy evn (8)= 4(aℓn +vℓn) p 2(log4 n g) (33) = eσ 8, and ean (8)= evn p log(n+1) = eσ p 8 log(n + 1). Comparing with Def. 3, Thm. 2 follows. C.5 PROOF OF LEM. 2: MMD AS A VECTOR NORM Let vk,j Pn i=1 ei 1(xi Sin k,j) 2 1(xi Sout k,j ) . By the reproducing property of k we have ψH k,j(k) 2 k = P x Sin k,j k(x, ) 2 P x Sout k,j k(x, ) 2 x Sin k,j,y Sin k,j k(x, y) 2 P x Sout k,j ,y Sin k,j k(x, y) + P x Sout k,j ,y Sout k,j k(x, y) = v k,j Kvk,j (19) = uk,j 2 2. (34) Using (18), (19), and (22), and mimicking the derivation above (34), we can also conclude that ψC(k) 2 k = u C 2 2. Additionally, we note that MMDk(Sin, SC) = sup f k=1 1 n f, ψC(k) Finally the conditional independence and zero mean property (21) follows from (18) by noting that conditioned on (Sin k ,j )k >k,j 1, the sets (Sin k,j)j 1 are independent. C.6 PROOF OF LEM. 3: PROPERTIES OF THE DILATION OPERATOR For claim (a) in the display (23), we have u 2 2 0 n 0n uu ! (i) u 2 2In+1 = Mu op (ii) = u 2, where step (i) follows from the standard fact that uu u 2 2In and step (ii) from the facts M2 uee1 = u 2 2ee1 for ee1 the first canonical basis vector of Rn+1 and Mu 2 op = M2 u op. Claim (b) follows directly by verifying that the vector v = [1, u u 2 ] is an eigenvector of Mu with eigenvalue u 2. Finally, claim (c) follows directly from the claim (a) and the fact that Mq u op = Mu q op for all integers q 1. C.7 PROOF OF LEM. 4: SUB-GAUSSIAN MATRIX FREEDMAN INEQUALITY We first note the following two lemmas about the tail bounds and symmetrized moment generating functions (MGFs) for matrix valued random variables (see Apps. C.9 and C.10 respectively for the proofs of Lems. 6 and 7). Lemma 6 (Sub-Gaussian matrix tail bounds) Let Xk Rm m k 1 be a sequence of selfadjoint matrices adapted to a filtration Fk, and let Ak Rm m k 1 be a sequence of determin- istic self-adjoint matrices. Define the variance parameter σ2 P k Ak op. If, for a Rademacher random variable ε independent of (Xk, Fk)k 1, we have log E exp(2εθXk)|Fk 1 2θ2Ak for all θ R, (35) then we also have k Xk t) i me t2/(8σ2) for all t 0. Lemma 7 (Symmetrized sub-Gaussian matrix MGF) For a fixed scalar R, let X be a selfadjoint matrix satisfying EX = 0 and EXq ( q 2)!Rq I for q 2N. (36) If ε is a Rademacher random variable independent of X, then E exp(2εθX) exp 2θ2R2I for all θ R. The assumed conditions (25) allow us to apply Lem. 7 conditional on (Yi)i 0, since P[X+ > u] = P[X > u] = P[Z > a + u] for any u > 0, we apply the tail bounds on Z to control the moments of X+. In particular, we have E Xq + ] (i) = q R 0 uq 1P[X+ > u]du (ii) = q R 0 (v t)q 1P[X+ > v (iii) qvq R 0 tq/2 1e tdt (iv) = qvqΓ( q where we have applied (i) integration by parts, (ii) the substitution u = v t, and (iii) the assumed tail bound for Z. Since Z = X + a, the convexity of the function t 7 tq for q 1, and Jensen s inequality imply that for each q 2N, we have EZq 2q 1(aq + E|X|q) 2q 1(2aq + EXq +) (2a)q + 2q 1qvqΓ( q = (2a)q + 2q 1qvq( q (2a + 2v)q( q 2)! where the last step follows since xq + yq (x + y)q for all q N and x, y 0. The proof is now complete. C.9 PROOF OF LEM. 6: SUB-GAUSSIAN MATRIX TAIL BOUNDS The proof of this result is identical to that of Tropp (2012, Proof of Thm. 7.1) as the same steps are justified under our weaker assumption (35). Specifically, applying the arguments from Tropp (2012, Proof of Thm. 7.1), we find that E tr exp(Pn k=1 θXk) E tr exp Pn 1 k=1 θXk + log E exp(2εθXn)|Fn 1 (35) E tr exp Pn 1 k=1 θXk + 2θ2An (i) tr exp 2θ2 Pn k=1 Ak (ii) m exp 2θ2σ2 , (37) where step (i) follows by iterating the arguments over k = n 1, . . . , 1 and step (ii) from the standard fact that tr(exp(A)) m exp(A) op = m exp( A op) for an m m self-adjoint matrix A. Next, applying the matrix Laplace transform method Tropp (2012, Prop. 3.1), for all t > 0, we have k Xk t) i infθ>0 n e θt E tr exp(Pn k=1 θXk) o (37) m infθ>0 n e θt e2θ2σ2o = me t2/(8σ2), where the last step follows from the choice θ = t 4σ2 . The proof is now complete. C.10 PROOF OF LEM. 7: SYMMETRIZED SUB-GAUSSIAN MATRIX MGF E[exp(2εθX)] = I + P q=1 2qθq q! E[εq Xq] (i) = I + P k=1 22kθ2k (2k)! E[X2k] (ii) I + P k=1 22kθ2k k!R2k (iii) I + P k=1 (2θ2R2)k k! I = exp(2θ2R2I), where step (i) uses the facts that (a) E[εq] = 1(q 2N) and (b) E[εq Xq] = E[εq]E[Xq] since ε is independent of X, step (ii) follows from the assumed condition (36), and step (iii) from the fact that 2kk! (2k)! 1 k! (Boucheron et al., 2013, Proof of Thm. 2.1). D PROOF OF THM. 3: RUNTIME AND INTEGRATION ERROR OF COMPRESS++ First, the runtime bound (9) follows directly by adding the runtime of COMPRESS(HALVE, g) as given by (5) in Thm. 1 and the runtime of THIN. Recalling the notation (14) and (15) from App. A and noting the definition of the point sequences SC and SC++ in Alg. 2, we obtain the following relationship between the different discrepancy vectors: ϕC(Sin) = 1 x Sin δx 1 2g n P ϕT(SC) = 1 2g n P x SC δx 1 n P x SC++ δx, and ϕC++(Sin) = 1 x Sin δx 1 n P = ϕC(Sin) + ϕT(SC). Noting the Gf property of HALVE and applying Thm. 1, we find that ϕC(Sin)(f) is sub-Gaussian with parameter νC(n) defined in (6). Furthermore, by assumption on THIN, given SC, the variable ϕT(SC)(f) is νC( ℓn 2 ) sub-Gaussian. The claim now follows directly from Lem. 1. E PROOF OF THM. 4: MMD GUARANTEES FOR COMPRESS++ Noting that MMD is a metric, and applying triangle inequality, we have MMDk(Sin, SC++) MMDk(Sin, SC) + MMDk(SC, SC++). Since SC++ is the output of THIN(2g) with SC as the input, applying the MMD tail bound assumption (38) with |SC| = n2g substituted in place of n, we find that P h MMD(SC, SC++) a 2g n+v 2g n t i e t for all t 0. Recall that ℓn/2 = 2g n. Next, we apply Thm. 2 with HALVE to conclude that P[MMDk(Sin, SC) ean + evn t] e t for all t 0. Thus, we have P h MMDk(Sin, SC++) a ℓn/2 + ean + (v ℓn/2 + evn) t i 2 e t for all t 0, which in turn implies that P h MMDk(Sin, SC++) a ℓn/2 + ean + (v ℓn/2 + evn) log 2 + (v ℓn/2 + evn) t i e t for all t 0, thereby yielding the claimed result. F PROOFS OF EXS. 3, 4, 5, AND 6 We begin by defining the notions of sub-Gaussianity and k-sub-Gaussianity on an event. Definition 5 (Sub-Gaussian on an event) We say that a random variable G is sub-Gaussian on an event E with parameter σ if E[1[E] exp(λ G)] exp( λ2σ2 2 ) for all λ R. Definition 6 (k-sub-Gaussian on an event) For a kernel k, we call a thinning algorithm ALG ksub-Gaussian on an event E with parameter v and shift a if P[E, MMDk(Sin, SALG) an + vn t | Sin] e t for all t 0. (38) We will also make regular use of the unrolled representation (17) for the COMPRESS measure discrepancy ψC(Sin) in terms of the HALVE inputs (Sin k,j)j [4k] of size nk = 2g+1 k n for 0 k βn. (39) For brevity, we will use the shorthand ψC ψC(Sin), ψH k,j ψH(Sin k,j), and ψT ψT(SC) hereafter. F.1 PROOF OF EX. 3: KT-SPLIT-COMPRESS For HALVE = KT-SPLIT( ℓ2 n4g+1(βn+1)δ) when applied to an input of size ℓ, the proof of Thm. 1 in Dwivedi & Mackey (2022) identifies a sequence of events Ek,j and random signed measures ψk,j such that, for each 0 k βn, j [4k], and f with f k = 1, (a) P[Ec k,j] (i) n2 k n4g+1(βn+1) δ 2 (ii) = 1 2 δ 4k(βn+1), (b) 1[Ek,j]ψH k,j = 1[Ek,j] ψk,j, and (c) ψk,j(f) is nk νH(nk) sub-Gaussian (7) given ( ψk ,j )k >k,j 1 and ( ψk,j )j k,j 1, ( ψk,j )j k,j 1, ( ψk,j )j k,j 1 and ( ψk,j )j k,j 1, ( ψk,j )j k,j 1, ( ψk,j )j 16) for j = 1, 2, . . . , 32. Here we also provide additional results with mixture of Gaussian targets given by P = 1 M PM j=1 N(µj, Id) for M {4, 6, 8}. The mean locations for these are given by µ1 = [ 3, 3] , µ2 = [ 3, 3] , µ3 = [ 3, 3] , µ4 = [3, 3] , µ5 = [0, 6] , µ6 = [ 6, 0] , µ7 = [6, 0] , µ8 = [0, 6] . Fig. 4 plots the MMD errors of KT and herding experiments for the mixture of Gaussians targets with 4, 6 and 8 centers, and notice again that COMPRESS++ provides a competitive performance to the original algorithm, in fact suprisingly, improves upon herding. 45 47 49 Input size n KT-Comp: n 0.46 KT-Comp++: n 0.52 45 47 49 Input size n KT-Comp: n 0.46 KT-Comp++: n 0.51 45 47 49 Input size n KT-Comp: n 0.45 KT-Comp++: n 0.52 45 47 49 Input size n Herd-Comp: n 0.44 Herd-Comp++: n 0.51 Herd: n 0.49 45 47 49 Input size n Herd-Comp: n 0.43 Herd-Comp++: n 0.51 Herd: n 0.48 45 47 49 Input size n Herd-Comp: n 0.44 Herd-Comp++: n 0.51 Herd: n 0.48 Figure 4: For M-component mixture of Gaussian targets, KT-COMPRESS++ and Herd-COMPRESS++ improve upon the MMD of i.i.d. sampling (ST) and closely track or improve upon the error of their quadratic-time input algorithms, KT and kernel herding (Herd). See App. G.1 for more details. G.2 DETAILS OF MCMC TARGETS Our set-up for the MCMC experiments is identical to that of Dwivedi & Mackey (2021, Sec. 6), except that we use all post-burn-in points to generate our Goodwin and Lotka-Volterra input point sequences Sin instead of only the odd indices. In particular, we use the MCMC output of Riabiz et al. (2020b) described in (Riabiz et al., 2020a, Sec. 4) and perform thinning experiments after discarding the burn-in points. To generate an input Sin of size n for a thinning algorithm, we downsample the post-burn-in points using standard thinning. For Hinch, we additionally do coordinate-wise normalization by subtracting the sample mean and dividing by sample standard deviation of the post-burn-in-points. In Sec. 5, RW and ADA-RW respectively refer to Gaussian random walk and adaptive Gaussian random walk Metropolis algorithms (Haario et al., 1999) and MALA and p MALA respectively refer to the Metropolis-adjusted Langevin algorithm (Roberts & Tweedie, 1996) and pre-conditioned MALA (Girolami & Calderhead, 2011). For Hinch experiments, RW 1 and RW 2 refer to two independent runs of Gaussian random walk, and Tempered denotes the runs targeting a tempered Hinch posterior. For more details on the set-up, we refer the reader to Dwivedi & Mackey (2021, Sec. 6.3, App. J.2). H STREAMING VERSION OF COMPRESS COMPRESS can be efficiently implemented in a streaming fashion (Alg. 3) by viewing the recursive steps in Alg. 1 as different levels of processing, with the bottom level denoting the input points and the top level denoting the output points. The streaming variant of the algorithm efficiently maintains memory at several levels and processes inputs in batches of size 4g+1. At any level i (with i = 0 denoting the level of the input points), whenever there are 2i4g+1 points, the algorithm runs HALVE on the points in this level, appends the output of size 2i 14g+1 to the points at level i+1, and empties the memory at level i (and thereby level i never stores more than 2i4g+1 points). In this fashion, just after processing n = 4k+g+1 points, the highest level is k + 1, which contains a compressed coreset of size 2k 14g+1 = 2k+g+12g = n2g (outputted by running HALVE at level k for the first time), which is the desired size for the output of COMPRESS. Algorithm 3: COMPRESS (Streaming) Outputs stream of coresets of size 2g n for n = 4k+g+1 and k N Input: halving algorithm HALVE, oversampling parameter g, stream of input points x1, x2, . . . S0 {} // Initialize empty level 0 coreset for t = 1, 2, . . . , do S0 S0 (xj)t 4g+1 j=1+(t 1) 4g+1 // Process input in batches of size 4g+1 if t == 4j for j N then Sj+1 {} // Initialize level j + 1 coreset after processing 4j+g+1 input points end for i = 0, . . . , log4 t + 1 do if |Si| == 2i4g+1 then S HALVE(Si) // Halve level i coreset to size 2i 14g+1 Si+1 Si+1 S // Update level i + 1 coreset: has size {1, 2, 3, 4} 2i 14g+1 Si {} // Empty coreset at level i end end if t == 4j for j N then output Sj+1 // Coreset of size n2g with n t4g+1 and t = 4j for j N end end Our next result analyzes the space complexity of the streaming variant (Alg. 3) of COMPRESS. The intuition for gains in memory requirements is very similar to that for running time, as we now maintain (and run HALVE) on subsets of points with size much smaller than the input sequence. We count the number of data points stored as our measure of memory. Proposition 1 (COMPRESS Streaming Memory Bound) Let HALVE store s H(n) data points on inputs of size n. Then, after completing iteration t, the streaming implementation of COMPRESS (Alg. 3) has used at most s C(t) = 4g+3 t + s H(2g+1 t) data points of memory. Proof At time t, we would like to estimate the space usage of the algorithm. At the ith level of memory, we can have at most 2i+24g data points. Since we are maintaining a data set of size at most t4g at time t, there are at most log t 2 levels. Thus, the maximum number of points stored at time t is bounded by P0.5 log t i=0 2i+24g 4g+3 Furthermore, at any time up to time t, we have run HALVE on a point sequence of size at most t2g+1 which requires storing at most s H( t2g+1) additional points. Example 7 (KT-COMPRESS and KT-COMPRESS++) First consider the streaming variant of COMPRESS with HALVE = symmetrized KT( ℓ 2n ℓn δ) for HALVE inputs of size ℓas in Ex. 4. Since s KT(n) n (Dwivedi & Mackey, 2021, Sec. 3), Prop. 1 implies that s C(n) 4g+4 n. Next consider COMPRESS++ with the streaming variant of COMPRESS, with HALVE = symmetrized KT( ℓ 2n ℓn+2g nδ) when applied to an input of size ℓ, and THIN = KT( g n 2g+gδ) as in Ex. 6. The space complexity s C++(n) = s C(n)+s KT(ℓn)=4g+4 n + ℓn 4g+5 n. Setting g as in Ex. 6, we get s C++(n) = O( n log2 n).