# lowrank_thinning__354ad7c8.pdf

Low-Rank Thinning

Annabelle Michael Carrell 1 Albert Gong 2 Abhishek Shetty 3 Raaz Dwivedi 2 Lester Mackey 4

The goal in thinning is to summarize a dataset using a small set of representative points. Remarkably, sub-Gaussian thinning algorithms like Kernel Halving and Compress can match the quality of uniform subsampling while substantially reducing the number of summary points. However, existing guarantees cover only a restricted range of distributions and kernel-based quality measures and suffer from pessimistic dimension dependence. To address these deficiencies, we introduce a new low-rank analysis of sub-Gaussian thinning that applies to any distribution and any kernel, guaranteeing high-quality compression whenever the kernel or data matrix is approximately low-rank. To demonstrate the broad applicability of the techniques, we design practical sub-Gaussian thinning approaches that improve upon the best known guarantees for approximating attention in transformers, accelerating stochastic gradient training through reordering, and distinguishing distributions in near-linear time.

1. Introduction

This work is about thinning, finding a small set of representative points to accurately summarize a larger dataset. Here, we use the term dataset liberally to refer to any collection of points, be they experimental observations (as in Sec. 6), stochastic gradients (as in Sec. 5), or key-value pairs (as in Sec. 4). State-of-the-art thinning techniques provably improve upon uniform subsampling but only for restricted classes of kernel-based quality measures and with pessimistic dependence on the data dimension (see, e.g.,

1University of Cambridge 2Cornell Tech 3MIT 4Microsoft Research New England. Correspondence to: Annabelle Carrell <ac2411@cam.ac.uk>, Albert Gong <agong@cs.cornell.edu>, Abhishek Shetty <ashetty1995@gmail.com>, Raaz Dwivedi <dwivedi@cornell.edu>, Lester Mackey <lmackey@microsoft.com>.

Proceedings of the 42 nd International Conference on Machine Learning, Vancouver, Canada. PMLR 267, 2025. Copyright 2025 by the author(s).

Harvey & Samadi, 2014; Phillips & Tai, 2020; Alweiss et al., 2021; Dwivedi & Mackey, 2024; 2022; Shetty et al., 2022; Li et al., 2024). We introduce a new analysis for sub-Gaussian thinning algorithms that applies to any kernel and shows that one can efficiently identify a better-thanuniform set of representative points whenever the kernel or data matrix is nearly low-rank. This opens the door to a variety of impactful applications including approximate dotproduct attention in transformers, accelerating model training through gradient reordering, and distinguishing distributions with deep kernels in near-linear time.

In Sec. 2, we introduce our formal definition of thinning, two kernel-based measures of thinning quality, and a suite of candidate thinning algorithms. Sec. 3 presents our main theorem relating thinning quality to low-rank properties of a dataset and its induced kernel matrix. In Sec. 4, we translate the problem of approximating attention into a thinning problem and develop a practical solution, Thinformer, with state-of-the-art quality guarantees. In Sec. 5, we develop a thinned stochastic gradient reordering rule that provably accelerates training and bridges the theory-practice gaps left open by prior work. Finally, in Sec. 6 we derive new and improved power guarantees for practical thinned hypothesis tests that distinguish distributions in near-linear time. Each section also includes its own discussion of related work.

Notation. For each n N and a, b R, we define [n] {1, . . . , n}, a b min(a, b), and a b max(a, b). We let A op, A max, and A 2, respectively represent the maximum singular value, absolute entry, and row Euclidean norm of a matrix A and let λr(K) denote the rth largest eigenvalue of a suitable matrix K. We also define the Euclidean norm balls Bm {u Rm : u 2 1} and Bm(R) R Bm for each m N and R > 0. For an event E and an integrable random variable X, we define EE[X] E[X 1[E]]. We write an e O(bn) to mean an bn polylog(n).

2. Sub-Gaussian Thinning

We begin with a formal definition of our problem setting. Consider a fixed collection of nin input points Xin belonging to a potentially larger universe of datapoints

Low-Rank Thinning

Table 1: Examples of (K, ν, δ)-sub-Gaussian thinning algorithms. For input size nin, output size nout nin, and K max = 1 we report each sub-Gaussian parameter ν and runtime up to constants independent of (nin, nout, δ, K).

Prop. B.1 SUBSAMPLING

Prop. B.2 KH(δ)

Prop. B.5 KH-COMPRESS(δ)

Prop. B.6 GS-THIN

Prop. B.10 GS-COMPRESS

parameter ν Sub-Gaussian 1 nout

log(nout/δ)

log(nout) log(nout/δ)

nout 1 nout

Runtime nout n2 in n2 out n3 in n3 out

X {x1, . . . , xn}. The aim of a thinning algorithm is to select nout points from Xin that together accurately summarize Xin. This is formalized by the following definition.

Definition 1 (Thinning algorithms). A thinning algorithm ALG takes as input Xin and returns a possibly random subset Xout of size nout. We denote the input and output empirical distributions by Pin 1 nin P

x Xin δx and Pout 1 nout P

x Xout δx and define the induced probability vectors pin, pout n 1 over the indices [n] by

pin,i = 1[xi Xin]

nin and pout,i = 1[xi Xout]

nout for all i [n].

When X Rd, we use X [x1, . . . , xn] Rn d to denote the input point matrix so that

Ex Pin[x] = X pin and Ex Pout[x] = X pout.

We will make use of two common measures of summarization quality.

Definition 2 (Kernel MMD and max seminorm). Given two distributions µ, µ and a reproducing kernel k (Steinwart & Christmann, 2008, Def. 4.18), the associated kernel maximum mean discrepancy (MMD) is the worst-case difference in means for functions in the unit ball Bk of the associated reproducing kernel Hilbert space:

MMDk(µ, µ) supf Bk |Ex µf(x) Ex µf(x)|.

When µ = Pin and µ = Pout as in Def. 1 and K (k(xi, xj))n i,j=1 Rn n denotes the induced kernel matrix, then the MMD can be expressed as a Mahalanobis distance between pin and pout:

MMDk(Pin, Pout) = p

(pin pout) K(pin pout)

MMDK(pin, pout).

For any indices I [n], we further define the kernel max seminorm (KMS)

K(pin pout) I maxi I |e i K(pin pout)|. (1)

Notably, when the input points lie in Rd and k(xi, xj) is the linear kernel xi, xj (so that K = XX ), MMD measures the Euclidean discrepancy in datapoint means between the input and output distributions:

MMDK(pin, pout) = X pin X pout 2.

A common strategy for bounding the error of a thinning algorithm is to establish its sub-Gaussianity.

Definition 3 (Sub-Gaussian thinning algorithm). We write ALG Gν,δ(K) and say ALG is (K, ν, δ)-sub Gaussian, if ALG is a thinning algorithm, K is a symmetric positive semidefinite (SPSD) matrix, ν > 0, δ [0, 1), and there exists an event E with probability at least 1 δ/2 such that, the input and output probability vectors satisfy

EE[exp( u, K(pin pout) )] exp ν2

2 u Ku , u Rn.

Here, the sub-Gaussian parameter ν controls the summarization quality of the thinning algorithm, and we see from Tab. 1 that a variety of practical thinning algorithms are (K, ν, δ)-sub-Gaussian for varying levels of ν.

2.1. Examples of sub-Gaussian thinning algorithms

Perhaps the simplest sub-Gaussian thinning algorithm is uniform subsampling: by Prop. B.1, selecting nout points from Xin uniformly at random (without replacement) is (K, ν, 0)-sub-Gaussian with ν = p

K max/ nout. Unfortunately, uniform subsampling suffers from relatively poor summarization quality. As we prove in App. B.1.1, its root-mean-squared MMD and KMS are both Ω(1/ nout) meaning that nout = 10000 points are needed to achieve 1% relative error.

Proposition 1 (Quality of uniform subsampling). For any I [n], a uniformly subsampled thinning satisfies

E[MMD2 K(pin, pout)] = 1 nout nin nout

nin 1 CK and

E[ K(pin pout) 2 I] 1 nout nin nout

nin 1 maxi I CKeie i K

for any SPSD K with CK Pn i=1 pin,i Kii p in Kpin.

Low-Rank Thinning

Fortunately, uniform subsampling is not the only sub Gaussian thinning algorithm available. For example, the Kernel Halving (KH(δ)) algorithm of Dwivedi & Mackey (2024) provides a substantially smaller sub-Gaussian parameter, ν = O( p

log(nout/δ)/nout), in n2 in time by selecting one out of every two points and biasing selection toward the point that yields smaller approximation error. Similarly, the KH-COMPRESS(δ) algorithm of Shetty et al. (2022, Ex. 3) delivers ν = O( p

log(nout) log(nout/δ)/nout) in only n2 out time by halving and concatenating coresets of geometrically increasing size until the target output size is met. We derive simplified versions of these algorithms with identical sub-Gaussian constants in Apps. B.2 and B.5 and a linear-kernel variant (LKH(δ)) with nind runtime in App. B.3. To round out our set of examples, we show in App. B.6.1 that two new thinning algorithms based on the Gram-Schmidt walk of Bansal et al. (2018) yield even smaller ν at the cost of increased runtime. We call these algorithms Gram-Schmidt Thinning (GS-THIN) and GSCOMPRESS.

3. Low-rank Sub-Gaussian Thinning

One might hope that the improved sub-Gaussian constants of Tab. 1 would also translate into improved quality metrics. Our main result, proved in App. C, shows that this is indeed the case whenever the inputs are approximately low-rank.

Theorem 1 (Low-rank sub-Gaussian thinning). Fix any δ (0, 1), r n, and I [n]. If ALG Gν,δ(K), then the following bounds hold individually with probability at least 1 δ/2 δ :

MMD2 K(pin, pout) ν2 e2r + e log( 1

nout 1 nin ) and (2)

K(pin pout) I νDI q

2 log( 2|I|

Here, λj denotes the j-th largest eigenvalue of K, λn+1 0, and DI maxi I Kii.

Suppose that, in addition, X Rd and |Kil Kjl| LK xi xj 2 for some LK > 0 and all i, j I and l supp(pin). Then, with probability at least 1 δ/2 δ ,

K(pin pout) I νDI p

2 log(4/δ )(1 + 32

2 3 rank(XI) log( 3e2RILK D2 I (RILK)) (4)

for RI maxi I xi 2 and XI [xi] i I.

Let us unpack the three components of this result. First, Thm. 1 provides a high-probability O(ν p

log(|I|)) bound (3) on the KMS for any kernel and any sub-Gaussian thinning algorithm on any space. In particular, the non-uniform

algorithms of Tab. 1 all enjoy O(log(nout) p

log(|I|)/nout) KMS, a significant improvement over the Ω(1/ nout) KMS of uniform subsampling. The bound (3) follows from the sub-Gaussianity of the thinning algorithm (Def. 3) and the union bound over ei for each i I.

For datapoints in Rd, Thm. 1 also provides a refined O(ν p

rank(XI) log(RILK)) bound (4) on KMS. For bounded data, this trades an explicit dependence on the number of query points |I| for a rank factor that is never larger (and sometimes significantly smaller) than d. This refinement follows from a more elaborate chaining argument that frames (e i K(pin pout))i I as a sub-Gaussian process (Lem. C.5) and uses the Lipschitzness of K to control its entropy integral. We will make use of these results when approximating dot-product attention in Sec. 4.

Notably, Thm. 3.1 of Phillips & Tai (2020) implies that any thinning algorithm must incur at least Ω(

d/nout) KMS error for some dataset in Rd and many common kernels. Meanwhile, our Tab. 1 and Thm. 1 imply that GS-THIN has ν = O(1/nout) and hence KMS O(

d/nout). Taken together, these results imply that no algorithm can have sub-Gaussian constant ν = o(1/nout) and that GS-THIN enjoys minimax rate-optimal KMS and a minimax rateoptimal sub-Gaussian constant.

Finally, Thm. 1 provides an O(ν r + p

λr+1/nout) highprobability bound on kernel MMD, where the approximate rank parameter r can be freely optimized. We establish this result by projecting K1/2(pin pout) onto the first r eigenvectors of K, bounding the residual error in terms of λr+1, and bounding the projection magnitude with high probability using the sub-Gaussianity of ALG and a union bound over a finite cover of a Euclidean ball in Rr.

When K = (k(xi, xj))n i,j=1 is generated by a finite-rank kernel k, like a linear kernel xi, xj , a polynomial kernel (1 + xi, xj )p, or a random Fourier feature kernel (Rahimi & Recht, 2007), this guarantee becomes O(ν) and improves upon uniform subsampling whenever ν = o(1/ nout). In this case, the non-uniform algorithms of Tab. 1 all enjoy O(log(nout)/nout) MMD, a significant improvement over the Ω(1/ nout) MMD of uniform subsampling. We will revisit this finite-rank setting when studying stochastic gradient acceleration strategies in Sec. 5.

More generally, Thm. 1 guarantees improved MMD even for full-rank K, provided that the eigenvalues of K decay sufficiently rapidly. For example, optimizing over the approximate rank parameter r yields an O(ν logp/2(nout)) bound under exponential eigenvalue decay λr+1 = O(ne cr1/p) and an O(ν

1 2(p+1) ) bound under polynomial eigenvalue decay λr+1 = O(n/rp). Fortunately, some of the most commonly-used kernels generate kernel matrices with rapid eigenvalue decay.

Low-Rank Thinning

For example, the popular Gaussian kernel on Rd,

GAUSS(η) : k(x, y) = exp( η x y 2 2) for η > 0, (5)

generates K = (k(xi, xj))n i,j=1 satisfying

2e r1/d log dr1/d

4e2ηR2 for (2e)d r < n (6)

whenever X Bd(R) (Altschuler et al., 2019, Thm. 3). Combined with Thm. 1, this fact immediately yields an MMD guarantee for each algorithm in Tab. 1. We present a representative guarantee for KH(δ). Corollary 1 (Gaussian MMD of KH). If Xin Bd(R) for R > 0, then KH(δ) with k = GAUSS(η), and n = nin delivers

MMD2 K(pin, pout)

O log(nout/δ)

n2 out log(nout) (R2η)

d d + log( 1

with probability at least 1 δ/2 δ .

The proof in App. D provides a fully explicit and easily computed bound on the Gaussian MMD. Under the same assumptions, the distinct analysis of Dwivedi & Mackey (2022, Thm. 2, Prop. 3) provides a squared MMD bound of size Θ log(nout/δ)

n2 out ( logd+1(nout)Rdηd/2

(log log(nout))d + log( 1

δ )) . Notably, Cor. 1 improves upon this best known KH(δ) guarantee whenever the datapoint radius R = O(log nout), a property that holds almost surely for any bounded, sub-Gaussian, or subexponential data sequence (see Dwivedi & Mackey, 2024, Prop. 2).

Altschuler et al. (2019, Thm. 4) additionally showed that Gaussian kernel matrix eigenvalues satisfy

λr+1 ne cr2/(5d ) for 1 r < n (7)

for a constant c independent of X when X belongs to a smooth compact manifold of dimension d < d. In this case, our low-rank analysis yields adaptive MMD guarantees that scale with the potentially much smaller intrinsic dimension d . We use Thm. 1 to prove the first such intrinsic-dimension guarantee for KH(δ) in App. E. Corollary 2 (Intrinsic Gaussian MMD of KH). If Xin lies on a smooth manifold Ω Bd of dimension d < d (Assump. E.1), then KH(δ) with k = GAUSS(η), and n = nin delivers

MMD2 K(pin, pout) O log( nout

δ ) n2 out ( log(nout)

with probability at least 1 δ

2 δ for c independent of Xin.

In Sec. 6, we will use Cors. 1 and 2 to establish new guarantees for distinguishing distributions in near-linear time.

With our core theory in hand, we now turn our attention to a series of impactful applications.

4. Approximating Attention

We will first use our analysis to accelerate attention approximation in transformers. Dot-product attention lies at the heart of the transformer neural network architecture that has revolutionized natural language processing, computer vision, and speech recognition over the last decade (Vaswani et al., 2017; Dosovitskiy et al., 2021; Dong et al., 2018). Given a collection of query, key, and value vectors (qi, ki, vi)n i=1 each in Rd, dot-product attention computes the softmax matrix

T ATTENTION((qi)n i=1, (kj, vj)n j=1) D 1AV (8)

for Aij exp( qi,kj

d ), D = diag(A1n), and Vij vij.

While attention has enjoyed unprecedented success in capturing long-range dependencies amongst datapoints, its computation is expensive, requiring Θ(d n2) time to construct and multiply the matrix A. This quadratic-time bottleneck has inspired a plethora of practical approximate attention mechanisms (e.g., Kitaev et al., 2020; Choromanski et al., 2021; Chen et al., 2021), but, to our knowledge, only two guarantee accurate reconstruction of the softmax matrix T (Zandieh et al., 2023; Han et al., 2024).1 In this section, we design a new fast attention approximation based on sub-Gaussian thinning and derive guarantees that improve upon the prior art.

4.1. Thinning attention in theory

Algorithm 1: Thinformer

Input: Queries, keys, and values (qi, ki, vi)n i=1 in Rd, nout // Define key-value attention kernel

katt(( k, v), ( k , v )) exp k, k v, v

// Thin augmented key-value pairs using katt vmax max i [n] vi ; ( ki, vi)n i=1 (ki/d 1 4 , (vi, vmax))n i=1

Xout KH-COMPRESS(0.5)(Xin = ( ki, vi)n i=1, katt, nout)

// Return exact attention on selected key-value subset

return b T ATTENTION (qi)n i=1, {(k, v) : ( k, v) Xout}

Alg. 1 summarizes our new Thinformer module. At its heart is a new key-value attention kernel katt that mimics the special structure of the softmax matrix T. Alg. 1 uses the attention kernel and a high-quality thinning algorithm, KH-COMPRESS(0.5), to subselect key-value pairs and then computes exact attention (8) for the key-value subset. In total, this requires only O(d n2 out) time to run KH-COMPRESS(0.5) and O(d n nout) time to compute

1A third remarkable work (Alman & Song, 2024) establishes upper and lower bounds for attention approximation but without a practical implementation.

Low-Rank Thinning

Table 2: Practical approximations with guarantees. For each approximation b T Rn d to the softmax matrix T (8), we report, up to a constant factor, the best worst-case error guarantee for b T T max given O(d n1+a) running time and γ-bounded (9) queries and keys. Here, the ratio V op/ V 2, lies in [1, n] and τ = 0.173 + o(1).

Approximation Guarantee

d log(n V max) log n

KDEformer n2γ+ τ

Hyper Attention n 17γ

3 (log n) 1 6 na/6 V op

ATTENTION with n queries and nout key-value pairs. In contrast, computing the exact softmax matrix T with standard matrix multiplication requires Θ(d n2) time. Our next result, proved in App. F, shows that Alg. 1 also admits a strong quality guarantee for approximating T.

Theorem 2 (Quality of Thinformer). With probability at least 1

2, Thinformer (Alg. 1) yields

b T T max c exp( 2R2

log2(nout) log(8nout log2 nin nout )

(d + 1) log(3e2( R2

d + 2) V max) + p

log(8)(4 + 128

3 ) and R maxi [n] max( ki 2, qi 2).

To put this result into context, let us compare with the existing guarantees for practical attention approximation, summarized in Tab. 2. Under the γ-boundedness assumption,

maxi [n] max( ki 2 2, qi 2 2) γ

d log n, (9)

the KDEformer approximation (Zandieh et al., 2023, Cor. 3.6) with τ = 0.173 + o(1), the Hyper Attention approximation (Han et al., 2024, Thm. 1) with no masking, and the Thinformer approximation (Thm. 2) guarantee the b T T max bounds of Tab. 2 with O(dn1+a) runtime and probability at least 1

2. The Thinformer guarantee exhibits four improvements over its predecessors. First, it establishes a significantly faster error decay rate (n a versus n a/2 or n a/6) for a given subquadratic runtime n1+a. Second, it reduces the dependence on the error inflation factor γ. Third, like the Hyper Attention guarantee, it eliminates all dependence on the KDEformer penalty parameter τ. Finally, it reduces dependence on the value matrix by a factor of V op V 2, [1, n].

Put otherwise, with bounded V 2, , b Tthin can provide consistent (i.e., b Tthin T max 0 as n ) subquadratic estimation whenever γ is bounded away from 1/2

and guarantee, for example, O( 1 n) error in e O(dn 3 2 +2γ)

time. In contrast, the b Tkde and b Thyp bounds require quadratic runtime to guarantee O( 1 n) error in the best case ( V op = O(1)) and cannot guarantee consistent subquadratic estimation in the worst case ( V op = Ω( n)).

4.2. Thinning attention in practice

To gauge the practical effectiveness of Alg. 1, we recreate the benchmark Tokens-To-Token Vision Transformer (T2T-Vi T) and Big GAN image generation experiments of Zandieh et al. (2023). In the T2T-Vi T experiment, attention approximations are scored on their Image Net classification accuracy and computational expense when used as drop-in replacements for the two most expensive attention layers in a pretrained T2T-Vi T neural network (Yuan et al., 2021). In the Big GAN experiment, approximations are scored on their computational expense and two popular measures of image generation quality, the Frechet Inception Distance (FID, Heusel et al., 2017) and Inception Score (IS, Salimans et al., 2016). Using the exact implementations and settings provided by Zandieh et al. (2023), we benchmark our Py Torch implementation of Thinformer against exact attention and four leading attention approximations: Performer (Choromanski et al., 2021), Reformer (Kitaev et al., 2020), Scatter Brain (Chen et al., 2021), and KDEformer.

In Tab. 3, we find that Thinformer (g = 2) provides the highest Top-1 accuracy on the Image Net 2012 validation set (Russakovsky et al., 2015), while running faster than all of the alternatives. In Tab. 4, Thinformer (g = 2) yields better FID and IS than all of the alternatives while running significantly faster than exact, KDEformer, Reformer, and Scatter Brain. Performer runs faster still but at the expense of substantially worse FID and IS. The final attention call of Thinformer can also be combined with optimized attention implementations like Flash Attention (Dao et al., 2022; Dao, 2024) to further reduce the time and memory footprint. We provide Py Torch code replicating this experiment at https://github.com/microsoft/ thinformer and supplementary experiment details in App. L.1.

5. Faster SGD Training

We now turn to a second application, accelerating training through gradient reordering. To train a machine learning model parameterized by w Rd, a standard approach is to minimize the empirical risk f(w) 1

n Pn i=1 fi(w) using stochastic gradient descent (SGD) updates,

n = wk+ i 1

n α fπk(i)(wk+ i 1

for each epoch k [K] and datapoint i [n]. Here, α > 0 is a step size, each fi is a datapoint-specific loss function,

Low-Rank Thinning

Table 3: Quality of T2T-Vi T attention approximations on Image Net. We report mean Top-1 accuracy 1 standard deviation across five random seeds and mean forward pass runtime 1 standard deviation across 50 batches of 64 images.

Attention Algorithm Top-1 Accuracy (%) Layer 1 Runtime (ms) Layer 2 Runtime (ms)

Exact 82.55 0.00 18.48 0.12 1.40 0.01 Performer 80.56 0.30 2.54 0.01 0.60 0.01 Reformer 81.47 0.06 7.84 0.03 1.53 0.01 KDEformer 82.00 0.07 5.39 0.03 2.28 0.03 Scatterbrain 82.05 0.08 6.86 0.02 1.55 0.03 Thinformer (Ours) 82.18 0.05 2.06 0.01 0.54 0.00

Table 4: Quality of Big GAN attention approximations for image generation. We report Frechet Inception Distance (FID) with the Image Net validation set, Inception Scores (IS), and mean forward pass runtime 1 standard deviation across 10 batches of 32 images. A lower FID or higher IS indicates better image generation quality.

Attention Algorithm FID ( ) IS ( ) Runtime (ms)

Exact 32.18 58.37 4.21 5.79 0.02 Performer 33.58 38.07 3.43 2.29 0.01 Reformer 72.23 19.14 2.09 12.14 0.02 KDEformer 30.70 56.91 4.16 6.74 0.46 Scatter Brain 38.47 36.84 2.90 3.12 0.02 Thinformer (Ours) 30.54 57.12 3.96 2.69 0.01

and πk is a permutation of [n] representing the order in which datapoints are processed in the k-th epoch.

Algorithm 2: Thinned Reordering

Input: Stochastic gradients (xk i fπk(i)(wk+ i 1

n ))n i=1, prior ordering πk, thinning algorithm ALG

// Select half of points using linear kernel X k out ALG(Xin = (xk i )n i=1, nout = n

2 , k(x, y) = x, y ) Π []; Π [] // Initialize empty start and end lists for i = 1, . . . , n do

Π.append(πk(i)) if xk i X k out else Π .prepend(πk(i)) end return πk+1 = concatenate(Π, Π )

Typically, one selects the orderings πk uniformly at random, but recent work has demonstrated faster convergence using non-uniform, adaptively selected orderings. Specifically, Lu et al. (2022); Cooper et al. (2023) show that any sufficiently accurate thinning algorithm can be efficiently transformed into a reordering rule that improves the convergence rate of SGD by a substantial e O(n 1) factor. Their approach, distilled in Alg. 2, uses an elegant construction of Harvey & Samadi (2014, Thm. 10) to translate a high-quality thinning of stochastic gradients into a higherquality reordering. However, these prior studies leave two

problems unaddressed.

First, while the established convergence rates of Lu et al. (2022) nearly match the minimax lower bounds for permuted SGD algorithms (Cha et al., 2023, Thm. 4.5), a multiplicative gap of size Θ(d) remains in the worst case. This led Cha et al. (2023) to declare, It is an open problem whether there exists a permutation-based SGD algorithm that gives a dimension-free upper bound while maintaining the same dependency on other factors.

Second, Lu et al. (2022) carry out their analysis using the self-balancing walk (SBW) thinning algorithm of Alweiss et al. (2021) but find its overhead to be too high in practice. Hence, in all experiments they instead employ a greedy thinning algorithm that often works well in practice but is not covered by their analysis.

5.1. Bridging the dimension gap

To address the first problem, we derive a new guarantee for SGD with LKH reordering that replaces the typical Θ(d) penalty with a soft notion of rank.

Definition 4 (ϵ-rank). The ϵ-rank, rankϵ(X), of a matrix X is the number of singular values greater than ϵ.

Theorem 3 (LKH-SGD convergence). Suppose that, for

Low-Rank Thinning

0 10 20 30 40 50 Epochs

Full Train Loss

0 10 20 30 40 50 Epochs

Test Accuracy

0 200 400 600 800 1000 1200 Seconds

Full Train Loss

CD-Gra B: SBW RR LKH-SGD CD-Gra B: Greedy

0 200 400 600 800 1000 1200 Seconds

Test Accuracy

Figure 1: Train and test convergence trajectories for mortgage classification with reordered SGD variants. We display mean values 1 standard deviation across 5 random seeds. See Sec. 5.2 for more details.

all i [n] and w, v Rd, the losses f and fi satisfy

fi(w) f(w) 2 2 σ2 (bounded noise),

fi(w) fi(v) 2 L w v 2 (smoothness), and

f(w) f 1 2µ f(w) 2 2 (PL) for f infv Rd f(v).

Then, with probability at least 1

2, SGD (10) with LKH( 1

2K ) reordering (Alg. 2) and step size α given in App. G satisfies

f(w K) f e O( r n2K2 ) for

r maxk [K] rankϵk([xk 1, . . . , xk n]), xk 1

n Pn i=1 xk i ,

and ϵk max i [n]

9e log(4Kn log(en/2)) log(4Kn) xk i xk 2 n .

The proof of Thm. 3 in App. G simply uses Thm. 1 to bound the thinning quality of LKH( 1

2K ) and then adapts the prior SGD analysis of Cooper et al. (2023). Notably, the standard practice of random reshuffling, i.e., SGD with uniform reordering, can only guarantee a significantly slower Ω( 1 n K2 ) rate under these assumptions (Rajput et al., 2020, Thm. 2), while Lu et al. (2022, Thm. 4) implies a similar but dimension-dependent e O( d n2K2 ) rate for SBW reordering. Thm. 3 matches the minimax lower bound of Cha et al. (2023, Thm. 4.5) up to the ϵ-rank parameter and shows that dimension dependence can be avoided when the gradient update matrices [xk 1, . . . , xk n] are low-rank, or, more generally, ϵ = O( log(Kn) n )-approximable by low-rank matrices.

5.2. Bridging the theory-practice gap

Two criticisms levied by Lu et al. (2022) against the SBW algorithm were the need to estimate the maximum Euclidean norm of any possible gradient vector in advance and the need to tune its free hyperparameter. LKH( 1

2K ) has neither of these drawbacks as it automatically adapts

to the scale of each input and has no hyperparameters to tune. Moreover, with a linear kernel, LKH( 1

2K ) can be run online in O(nd) time. Hence, LKH( 1

2K ) is a promising substitute for the greedy thinning of Lu et al. (2022); Cooper et al. (2023). Indeed, when we recreate the Home Mortgage Disclosure Act logistic regression experiment of Cooper et al. (2023) with a single worker (Fig. 1), we find that LKH-SGD strongly outperforms the standard practice of random reshuffling (RR) and the theoretically justified but overly conservative CD-Gra B: SBW variant. In addition, LKH-SGD matches the state-of-the-art test accuracy of CD-Gra B: Greedy and lags only slightly in terms of training convergence.

The accelerated convergence rate of LKH-SGD over the standard slow SGD rate of RR provides a direct verification of the Thm. 3 guarantee, and we further verify in Fig. L.1 that the singular values of the gradient update matrices drop off steeply, resulting in relatively small ϵk-ranks (see Fig. L.1). See https://github.com/microsoft/ khsgd for Py Torch code replicating this experiment and App. L.2 for supplementary experiment details.

6. Cheap Two-Sample Testing

Our final application is two-sample testing, determining whether two datasets are drawn from the same underlying distribution. We observe independent samples X (xi)m i=1 and Y (yj)n j=1 from the unknown distributions P and Q respectively, and we seek to accept or reject the null hypothesis that P = Q. Standard kernel MMD tests tackle this task by computing the empirical MMD

MMDk(Pin, Qin) for Pin, Qin 1

for an appropriate kernel k and rejecting the null hypothesis whenever MMDk(Pin, Qin) is sufficiently large (Gretton et al., 2012). Such tests are prized both for their broad

Low-Rank Thinning

applicability and for their high discriminating power, that is, their probability of rejecting the null when P = Q. A standard way to summarize the power properties of a test is through its detectable separation rate.

Definition 5 (Detectable separation rate). We say a twosample test has detectable separation rate ϵk,m,n if, for any detection probability 1 β (0, 1), there exists a constant ck,β > 0 such that the test has power at least 1 β of rejecting the null whenever MMDk(P, Q) ck,β ϵk,m,n.

Standard MMD tests can detect distributional differences on the order of ϵk,m,n = 1

min(m,n) (Gretton et al., 2012,

Cor. 9), and this detectable separation rate is known to be the best possible for MMD tests (Domingo-Enrich et al., 2023, Prop. 2) and minimax optimal for translation invariant kernels (Kim & Schrab, 2023, Thm. 8). However, standard MMD tests also suffer from the Θ((m + n)2) time burden of computing the empirical MMD. Recently, Domingo-Enrich et al. (2023) showed that one can improve scalability while preserving power by compressing Pin and Qin using a high-quality thinning algorithm. However, their analysis applies only to a restricted class of distributions and kernels and exhibits a pessimistic dimension dependence on Rd. Here, we offer a new analysis of their Compress Then Test approach that applies to any bounded kernel on any domain and, as an application, develop the first non-asymptotic power guarantees for testing with learned deep neural network kernels.

6.1. Low-rank analysis of Compress Then Test

Algorithm 3: Compress Then Test (CTT) Input: Samples (X, Y), # coresets s, compression level g, kernel k, failure probability δ, # replicates B, level α

Partition X into sm = sm m+n equal-sized bins (X (i))sm i=1 Partition Y into sn = sn m+n equal-sized bins (Y(i))sn i=1

// Identify coreset of size nout = 2gq

s for each bin

for i = 1, . . . , sm do P(i) out KT-COMPRESS(δ)(X (i), g, k) for i = 1, . . . , sn do Q(i) out KT-COMPRESS(δ)(Y(i), g, k)

// Compute CORESETMMD test statistic MB+1 MMDk( 1

sm Psm i=1 P(i) out, 1 sn Psn i=1 Q(i) out) (11)

// Simulate null by randomly permuting the s coresets B times for b = 1, . . . , B do

(P(i) out,b)sm i=1, (Q(i) out,b)sn i=1 PERMUTE((P(i) out)sm i=1, (Q(i) out)sn i=1)

sm Psm i=1 P(i) out,b, 1 sn Psn i=1 Q(i) out,b) end

// Threshold test statistic R position of MB+1 in an increasing ordering of (Mb)B+1 b=1 with ties broken uniformly at random return Reject with prob. min(1, max(0, R (1 α)(B + 1)))

Alg. 3 details the Compress Then Test (CTT) approach of Domingo-Enrich et al. (2023, Alg. 1). Given a coreset count s 2, a compression level g 0, and a nominal level α (0, 1), CTT divides X and Y into datapoint bins of size nin m+n

s , thins each bin down to size nout 2g nin using KT-COMPRESS(δ) (a refinement of KH-COMPRESS(δ) detailed in App. H), and uses the thinned coresets to cheaply approximate MMDk(Pin, Qin) and permuted versions thereof. Domingo-Enrich et al. (2023, (8)) showed that the total runtime of CTT is dominated by

O(4g(m + n)(s + log4 m+n

kernel evaluations, yielding a near-linear O((m + n) logc(m+n)) time algorithm whenever s = O(log4(m+ n)) and g c log4 log(m + n). Moreover, Prop. 1 of Domingo-Enrich et al. (2023) ensures that CTT has probability at most α of falsely rejecting the null hypothesis.

Our next, complementary result shows that CTT also matches the detectable separation rate of standard MMD tests up to an inflation factor Rk/2g depending on the compression level g.

Theorem 4 (Low-rank analysis of CTT power). Suppose the parameters of CTT (Alg. 3) satisfy m n,

γ ), and δ = min( eβ 6 , ( eβ 2 )1/ α(B+1) α 30es)

for eβ β 1+β/2 and γ α 4e( β 4 )1/ α(B+1) . Then CTT has detectable separation rate (Def. 5)

ϵk,m,n = (1 + Rk/2g)/ m,

where R2 k denotes the (1 eβ 20sn )-th quantile of

b R2 k log( m+n

min r 2nout

eβ ) + (λr+1(K) + λr+1(K ))nout .

for K (k(xi, xj))m i,j=1, K (k(yi, yj))n i,j=1, and k supx,y supp(P+Q)|k(x, y)|.

The proof in App. I combines the low-rank sub-Gaussian error bounds of Thm. 1 with the generic compressed power analysis of Domingo-Enrich et al. (2023, App. B.1) to yield power guarantees for bounded kernels on any domain. Notably, when rank(K) and rank(K ) are bounded or, more generally, polylog(n) one can choose the compression level g = Θ(log4 log(m + n)) to exactly match the optimal quadratic-time detectable separation rates with a near-linear time CTT test. Moreover, the inflation factors remain well-controlled whenever the induced kernel matrices exhibit rapid eigenvalue decay.

Low-Rank Thinning

As a concrete example, consider the learned deep neural network kernel of Liu et al. (2020),

kdeep(x, y) [(1 ϵ)κ(ϕ(x), ϕ(y)) + ϵ]q(x, y), (13)

where ϕ : Rd Rdembd is a pretrained neural network, q and κ are GAUSS(η) kernels (5) on Rd and Rdembd respectively, and ϵ (0, 1). This deep kernel generates full-rank kernel matrices (Liu et al., 2020, Prop. 5) but induces exponential eigenvalue decay due to its decomposition as a mixture of Gaussian kernels. Hence, as we show in App. J, CTT with kdeep, g = Θ(log4 log(m + n)), and sub-Gaussian inputs matches the detection quality of a quadratic-time MMD test in near-linear time. Corollary 3 (Power of deep kernel CTT). Instantiate the assumptions of Thm. 4 with k = kdeep (13). If the inputs (ϕ(x1), x1, ϕ(y1), y1) are sub-Gaussian, that is,

E[ec (ϕ(x1),x1,ϕ(y1),y1) 2 2] < (14)

for some c > 0, then CTT satisfies the conclusions of Thm. 4 with d dembd + d and

Rkdeep = O(log d

Moreover, when the input and neural features lie on smooth compact manifolds (as, e.g., in Zhu et al., 2018), the error inflation of CTT adapts to the smaller intrinsic manifold dimension, enabling an improved trade-off between runtime and detection power. See App. K for our proof. Corollary 4 (Power of deep manifold kernel CTT). Under the assumptions of Cor. 3, if x1, y1, (x1, ϕ(x1)), and (y1, ϕ(y1)) belong to smooth compact manifolds (Assump. E.1) with dimension d < d then CTT satisfies the conclusions of Thm. 4 with

Rkdeep = O(log 5d

Cors. 3 and 4 follow from explicitly bounding the eigenvalues of the generated deep kernel matrices as in (6) and (7). By Kim & Schrab (2023, Thm. 8), the separation rate of Thm. 4 is minimax optimal up to the inflation factor Rk/2g

and, hence, those of Cors. 3 and 4 are minimax optimal up to log factors.

One could alternatively bound the compression error of KT-COMPRESS(δ) using the covering number approach of Dwivedi & Mackey (2022, Thm. 2, Prop. 3). In the setting of Cor. 3, the argument of App. J combined with this distinct analysis would yield an alternative error inflation factor Rkdeep/2g with worse dimension dependence,

Rkdeep = Θ(log 3d

and without known adaptivity to an intrinsic manifold dimension.

0 10 20 30 40 50 Time per test (ms)

Power (1 - Type II Error)

CTT W-Block Subsampling Level 0.05

Figure 2: Time-power trade-off curves for detecting Higgs bosons with deep kernel MMD tests. We plot mean values 1 standard error across 1000 independent trials with level α = 0.05 and B = 100 permutations.

6.2. Powerful deep kernel testing in near-linear time

To evaluate the practical utility of deep kernel CTT, we follow the Higgs mixture experiment of Domingo-Enrich et al. (2023, Sec. 5) and use the deep kernel training procedure of Liu et al. (2020, Tab. 1). Here, the aim is to distinguish a Higgs boson signal process P from a background process Q given m = n = 16384 observations, d = 2 particle-detector features, and a five-layer fully-connected neural network ϕ with softplus activations and embedding dimension dembd = 20.

Fig. 2 compares the time-power trade-off curves induced by three fast kernel testing approaches to this problem: SUBSAMPLING, a standard wild-bootstrap MMD test (Chwialkowski et al., 2014) that simply evaluates empirical MMDkdeep using nout = mout uniformly subsampled points; W-BLOCK, a wild-bootstrap test that averages n

B subsampled squared MMDkdeep estimates based on nout = mout = B points (Zaremba et al., 2013); and CTT with s = 32 bins and varying g. We find that the CTT curve dominates that of the alternative methods and matches the power of an exact MMD test (SUBSAMPLING with nout = n) in a fraction of the time. The improvements of CTT over the standard power-runtime trade-off of SUBSAMPLING provides a direct verification of the Thm. 4 guarantee, and we additionally verify in Fig. L.2 that the empirical inflation factor b Rkdeep = O(log5(n)) in this setting due to approximate low-rankness. See https://github.com/ microsoft/deepctt for Py Torch code replicating this experiment and App. L.3 for supplementary experiment details.

Low-Rank Thinning

Impact Statement

This work introduced a new analysis of thinning algorithms that adapts to low-rank structures. We exploited this adaptivity to design fast algorithms with strong quality guarantees for three key applications in machine learning: dotproduct attention in Transformers, stochastic gradient training in optimization, and deep kernel testing for distinguishing distributions. More broadly, our techniques provide a general framework for reducing computational resource use in machine learning. Such tools have the potential to reduce energy costs and environmental harms from model training, inference, and evaluation and to improve accessibility in resource-constrained settings, all while provably maintaining high quality.

Acknowledgments We thank Insu Han, A. Feder Cooper, and Wentao Guo for their assistance with their code bases and datasets.

Alman, J. and Song, Z. Fast attention requires bounded entries. Advances in Neural Information Processing Systems, 36, 2024. (Cited on page 4.)

Altschuler, J., Bach, F., Rudi, A., and Niles-Weed, J. Massively scalable sinkhorn distances via the nystr om method. Advances in Neural Information Processing Systems, 32, 2019. (Cited on pages 4, 29, and 30.)

Alweiss, R., Liu, Y. P., and Sawhney, M. Discrepancy minimization via a self-balancing walk. In Proceedings of the 53rd Annual ACM SIGACT Symposium on Theory of Computing, pp. 14 20, 2021. (Cited on pages 1 and 6.)

Bansal, N., Dadush, D., Garg, S., and Lovett, S. The gramschmidt walk: a cure for the banaszczyk blues. In Proceedings of the 50th annual acm sigact symposium on theory of computing, pp. 587 597, 2018. (Cited on pages 3, 19, and 22.)

Cha, J., Lee, J., and Yun, C. Tighter lower bounds for shuffling sgd: Random permutations and beyond. In International Conference on Machine Learning, pp. 3855 3912. PMLR, 2023. (Cited on pages 6 and 7.)

Chen, B., Dao, T., Winsor, E., Song, Z., Rudra, A., and R e, C. Scatterbrain: unifying sparse and low-rank attention approximation. In Proceedings of the 35th International Conference on Neural Information Processing Systems, Neur IPS 21, Red Hook, NY, USA, 2021. Curran Associates Inc. ISBN 9781713845393. (Cited on pages 4 and 5.)

Choromanski, K. M., Likhosherstov, V., Dohan, D., Song, X., Gane, A., Sarlos, T., Hawkins, P., Davis, J. Q., Mohiuddin, A., Kaiser, L., Belanger, D. B., Colwell, L. J., and Weller, A. Rethinking attention with performers. In International Conference on Learning Representations, 2021. URL https: //openreview.net/forum?id=Ua6zuk0WRH. (Cited on pages 4 and 5.)

Chwialkowski, K. P., Sejdinovic, D., and Gretton, A. A wild bootstrap for degenerate kernel tests. Advances in Neural Information Processing Systems, 27, 2014. (Cited on page 9.)

Cooper, A. F., Guo, W., Pham, K., Yuan, T., Ruan, C. F., Lu, Y., and De Sa, C. Coordinating distributed example orders for provably accelerated training. In Thirty-seventh Conference on Neural Information Processing Systems, 2023. (Cited on pages 6, 7, 32, 33, and 37.)

Dao, T. Flashattention-2: Faster attention with better parallelism and work partitioning. In The Twelfth International Conference on Learning Representations, 2024. URL https: //openreview.net/forum?id=m Zn2Xyh9Ec. (Cited on page 5.)

Dao, T., Fu, D., Ermon, S., Rudra, A., and R e, C. Flashattention: Fast and memory-efficient exact attention with ioawareness. Advances in Neural Information Processing Systems, 35:16344 16359, 2022. (Cited on page 5.)

Domingo-Enrich, C., Dwivedi, R., and Mackey, L. Compress then test: Powerful kernel testing in near-linear time. In Proceedings of The 26th International Conference on Artificial Intelligence and Statistics, Proceedings of Machine Learning Research. PMLR, 25 27 Apr 2023. (Cited on pages 8, 9, and 34.)

Dong, L., Xu, S., and Xu, B. Speech-transformer: A norecurrence sequence-to-sequence model for speech recognition. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5884 5888, 2018. doi: 10.1109/ICASSP.2018.8462506. (Cited on page 4.)

Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., and Houlsby, N. An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations, 2021. URL https://openreview.net/forum? id=Yicb Fd NTTy. (Cited on page 4.)

Dwivedi, R. and Mackey, L. Generalized kernel thinning. In International Conference on Learning Representations, 2022. (Cited on pages 1, 4, and 9.)

Dwivedi, R. and Mackey, L. Kernel thinning. Journal of Machine Learning Research, 25(152):1 77, 2024. (Cited on pages 1, 3, 4, 15, 16, 17, 18, 34, and 36.)

Gretton, A., Borgwardt, K. M., Rasch, M. J., Sch olkopf, B., and Smola, A. A kernel two-sample test. The Journal of Machine Learning Research, 13(1):723 773, 2012. (Cited on pages 7 and 8.)

Han, I., Jayaram, R., Karbasi, A., Mirrokni, V., Woodruff, D., and Zandieh, A. Hyperattention: Long-context attention in nearlinear time. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview. net/forum?id=Eh0Od2BJIM. (Cited on pages 4 and 5.)

Harshaw, C., S avje, F., Spielman, D. A., and Zhang, P. Balancing covariates in randomized experiments with the gram schmidt walk design. Journal of the American Statistical Association, pp. 1 13, 2024. (Cited on page 22.)

Harvey, N. and Samadi, S. Near-Optimal Herding. In Proceedings of The 27th Conference on Learning Theory, volume 35, pp. 1165 1182, 2014. (Cited on pages 1, 6, and 32.)

Low-Rank Thinning

Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., and Hochreiter, S. Gans trained by a two time-scale update rule converge to a local nash equilibrium. Advances in neural information processing systems, 30, 2017. (Cited on page 5.)

Hoeffding, W. Probability inequalities for sums of bounded random variables. Journal of the American Statistical Association, 58(301):13 30, 1963. doi: 10.1080/01621459.1963. 10500830. (Cited on page 15.)

Horn, R. A. and Johnson, C. R. Matrix Analysis. Cambridge University Press, 1985. (Cited on page 36.)

Kim, I. and Schrab, A. Differentially private permutation tests: Applications to kernel methods. ar Xiv preprint ar Xiv:2310.19043, 2023. (Cited on pages 8 and 9.)

Kitaev, N., Kaiser, L., and Levskaya, A. Reformer: The efficient transformer. In International Conference on Learning Representations, 2020. URL https://openreview.net/ forum?id=rkg NKk Htv B. (Cited on pages 4 and 5.)

Li, L., Dwivedi, R., and Mackey, L. Debiased distribution compression. In Proceedings of the 41st International Conference on Machine Learning, volume 203 of Proceedings of Machine Learning Research. PMLR, 21 27 Jul 2024. (Cited on page 1.)

Liu, F., Xu, W., Lu, J., Zhang, G., Gretton, A., and Sutherland, D. J. Learning deep kernels for non-parametric two-sample tests. In International conference on machine learning, pp. 6316 6326. PMLR, 2020. (Cited on pages 9 and 38.)

Lu, Y., Guo, W., and De Sa, C. M. Grab: Finding provably better data permutations than random reshuffling. Advances in Neural Information Processing Systems, 35:8969 8981, 2022. (Cited on pages 6 and 7.)

Markov, A. On certain applications of algebraic continued fractions. Unpublished Ph. D. thesis, St Petersburg, 1884. (Cited on pages 26 and 27.)

Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., et al. Pytorch: An imperative style, high-performance deep learning library. Advances in Neural Information Processing Systems, 32, 2019. (Cited on page 36.)

Phillips, J. M. and Tai, W. M. Near-optimal coresets of kernel density estimates. Discrete & Computational Geometry, 63 (4):867 887, 2020. (Cited on pages 1 and 3.)

Rahimi, A. and Recht, B. Random features for large-scale kernel machines. Advances in Neural Information Processing Systems, 20, 2007. (Cited on page 3.)

Rajput, S., Gupta, A., and Papailiopoulos, D. Closing the convergence gap of sgd without replacement. In International Conference on Machine Learning, pp. 7964 7973. PMLR, 2020. (Cited on page 7.)

Rudin, W. Functional Analysis. International series in pure and applied mathematics. Mc Graw-Hill, 1991. ISBN 9780070542365. URL https://books.google.com/ books?id=Sh_v AAAAMAAJ. (Cited on page 14.)

Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., Berg, A. C., and Fei-Fei, L. Image Net Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV), 115(3):211 252, 2015. doi: 10.1007/ s11263-015-0816-y. (Cited on page 5.)

Saadetoglu, M. and Dinsev, S. M. Inverses and determinants of n n block matrices. Mathematics, 11(17), 2023. ISSN 2227-7390. doi: 10.3390/math11173784. URL https:// www.mdpi.com/2227-7390/11/17/3784. (Cited on page 23.)

Salimans, T., Goodfellow, I., Zaremba, W., Cheung, V., Radford, A., and Chen, X. Improved techniques for training gans. Advances in neural information processing systems, 29, 2016. (Cited on page 5.)

Sherman, J. and Morrison, W. J. Adjustment of an Inverse Matrix Corresponding to a Change in One Element of a Given Matrix. The Annals of Mathematical Statistics, 21(1):124 127, 1950. doi: 10.1214/aoms/1177729893. URL https: //doi.org/10.1214/aoms/1177729893. (Cited on page 23.)

Shetty, A., Dwivedi, R., and Mackey, L. Distribution compression in near-linear time. In International Conference on Learning Representations, 2022. (Cited on pages 1, 3, 18, 19, 24, 34, and 35.)

Steinwart, I. and Christmann, A. Support Vector Machines. Springer Publishing Company, Incorporated, 1st edition, 2008. ISBN 0387772413. (Cited on page 2.)

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., and Polosukhin, I. Attention is all you need. In Proceedings of the 31st International Conference on Neural Information Processing Systems, NIPS 17, pp. 6000 6010, Red Hook, NY, USA, 2017. Curran Associates Inc. ISBN 9781510860964. (Cited on page 4.)

Wainwright, M. J. High-dimensional statistics: A non-asymptotic viewpoint, volume 48. Cambridge University Press, 2019. (Cited on pages 25, 28, and 29.)

Yuan, L., Chen, Y., Wang, T., Yu, W., Shi, Y., Jiang, Z.-H., Tay, F. E., Feng, J., and Yan, S. Tokens-to-token vit: Training vision transformers from scratch on imagenet. In Proceedings of the IEEE/CVF international conference on computer vision, pp. 558 567, 2021. (Cited on page 5.)

Zandieh, A., Han, I., Daliri, M., and Karbasi, A. Kdeformer: Accelerating transformers via kernel density estimation. In International Conference on Machine Learning, pp. 40605 40623. PMLR, 2023. (Cited on pages 4, 5, and 37.)

Zaremba, W., Gretton, A., and Blaschko, M. B-test: A nonparametric, low variance kernel two-sample test. Advances in Neural Information Processing Systems, 26, 2013. (Cited on page 9.)

Zhu, W., Qiu, Q., Huang, J., Calderbank, R., Sapiro, G., and Daubechies, I. Ldmnet: Low dimensional manifold regularized neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2743 2751, 2018. (Cited on page 9.)

Low-Rank Thinning

Appendix Contents

A Appendix Notation and Definitions 13

B Proof of Tab. 1: Sub-Gaussian Thinning Examples 14

B.1 SUBSAMPLING . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

B.2 KH(δ) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

B.3 LKH(δ) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

B.4 RKH(δ) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

B.5 KH-COMPRESS(δ) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

B.6 GS-THIN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

B.7 GS-COMPRESS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

C Proof of Thm. 1: Low-rank sub-Gaussian thinning 24

C.1 Proof of MMD bound (2) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

C.2 Proof of kernel max seminorm bound (3) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

C.3 Proof of Lipschitz kernel max seminorm bound (4) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

C.4 Proof of Lem. C.4: Bounded-H older sub-Gaussian process . . . . . . . . . . . . . . . . . . . . . . . . 29

D Proof of Cor. 1: Gaussian MMD of KH 29

E Proof of Cor. 2: Intrinsic Gaussian MMD of KH 30

F Proof of Thm. 2: Quality of Thinformer 30

F.1 Proof of Lem. F.1: Decomposing attention approximation error . . . . . . . . . . . . . . . . . . . . . 31

F.2 Proof of Lem. F.2: KMS bound on attention approximation error . . . . . . . . . . . . . . . . . . . . 31

F.3 Proof of Lem. F.3: Thinned attention problem parameters . . . . . . . . . . . . . . . . . . . . . . . . 31

G Proof of Thm. 3: LKH-SGD convergence 32

H KT-COMPRESS(δ) 34

I Proof of Thm. 4: Low-rank analysis of CTT power 34

I.1 Proof of Thm. I.1: Low-rank analysis of CTT power, detailed . . . . . . . . . . . . . . . . . . . . . . 34

I.2 Proof of Lem. I.1: (K, ν, δ)-sub-Gaussian thinning algorithms are k-sub-Gaussian . . . . . . . . . . 35

J Proof of Cor. 3: Power of deep kernel CTT 36

K Proof of Cor. 4: Power of deep manifold kernel CTT 36

L Supplementary Experiment Details 36

Low-Rank Thinning

L.1 Approximating attention experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

L.2 Faster SGD training experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

L.3 Cheap two-sample testing experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

A. Appendix Notation and Definitions

We often use the shorthand (a)+ max(a, 0) as well as the shorthand k(X, X) to represent the matrix (k(xi, xj))n i,j=1. In addition, for each kernel k, we let Hk and k represent the associated reproducing kernel Hilbert space (RKHS) and RKHS norm, so that Bk = {f Hk : f k 1} and define

(Pin Pout)k 1 nin P

x Xin k(x, ) 1 nout P

x Xout k(x, ).

We also relate our definition of a sub-Gaussian thinning algorithm (Def. 3) to several useful notions of sub-Gaussianity.

Definition A.1 (Sub-Gaussian vector). We say that a random vector w Rn is (K, ν)-sub-Gaussian on an event E if K is SPSD and ν > 0 satisfies

EE exp(u Kw) exp( ν2

2 u Ku) for all u Rn. (15)

If, in addition, the event has probability 1, we say that w is (K, ν)-sub-Gaussian.

Notably, a thinning algorithm is (K, ν, δ)-sub-Gaussian if and only if its associated vector pin pout is (K, ν)-sub-Gaussian on an event E of probability at least 1 δ/2.

Definition A.2 (Sub-Gaussian function). For a kernel k, we say that a random function ϕ Hk is (k, ν)-sub-Gaussian on an event E if ν > 0 satisfies

EE[exp( f, ϕ k)] exp( ν2

2 f 2 k) for all f Hk. (16)

If, in addition, the event has probability 1, we say that ϕ is (k, ν)-sub-Gaussian.

Our next two lemmas show that for finitely-supported signed measures like Pin Pout, this notion of functional sub Gaussianity is equivalent to the prior notion of vector sub-Gaussianity, allowing us to use the two notions interchangeably. Hereafter, we say that k generates a SPSD matrix K if k(X, X) = K.

Lemma A.1 (Functional sub-Gaussianity implies vector sub-Gaussianity). In the notation of Def. 3, if (Pin Pout)k is (k, ν)-sub-Gaussian on an event E and k generates K, then the vector pin pout is (K, ν)-sub-Gaussian on E.

Proof. Suppose (Pin Pout)k is (k, ν)-sub-Gaussian on an event E, fix a vector u Rn, and define the function

fu Pn i=1 uik( , xi) Hk.

By the reproducing property,

u K(pin pout) = fu, (Pin Pout)k k and fu 2 k = u Ku. (17)

Invoking the representations (17) and the functional sub-Gaussianity condition (16) we therefore obtain

EE exp(u K(pin pout) = EE[exp( fu, (Pin Pout)k k)] exp( fu 2 k ν2

2 ) = exp(u Ku ν2

so that pin pout is (K, ν)-sub-Gaussian on the event E as claimed.

Lemma A.2 (Vector sub-Gaussianity implies functional sub-Gaussianity). In the notation of Def. 3, if pin pout is (K, ν)-sub-Gaussian on an event E and k generates K, then (Pin Pout)k is (k, ν)-sub-Gaussian on E.

Proof. Suppose pin pout is (K, ν)-sub-Gaussian on an event E, fix a function f Hk, and consider the set

L n fu Pn i=1 uik( , xi) : u Rno .

Low-Rank Thinning

Since L is a closed linear subspace of Hk, we can decompose f as f = fu + f , where u Rn and f is orthogonal to L (Rudin, 1991, Theorem 12.4), so that

f 2 k = fu 2 k + f 2 k and fu 2 k = u Ku. (18)

Invoking the orthogonality of f and (Pin Pout)k L, the reproducing property representations (17), and the vector sub-Gaussianity condition (15), we find that

EE[exp( f, (Pin Pout)k k)] = EE[exp( fu + f , (Pin Pout)k k)] = EE exp(u K(pin pout) )

exp(u Ku ν2

2 ) (18) exp( f 2 k ν2

so that (Pin Pout)k is (k, ν)-sub-Gaussian on the event E as claimed.

We end our discussion about the versions of sub-Gaussianity considered above by presenting the standard fact about the additivity of sub-Gaussianity parameters under summation of independent sub-Gaussian random vectors, adapted to our setting.

Lemma A.3 (Vector sub-Gaussian additivity). Suppose that, for each j [m], j Rn is (K, νj) on an event Ej given 1:(j 1) ( 1, . . . , j 1) and E j 1 Tj 1 i=1 Ei. Then Pm j=1 j is (K, (Pm j=1 ν2 j )1/2)-sub-Gaussian on E m.

Proof. Let E s = Ts j=1 Ej for each s [m]. We prove the result for Zs = Ps i=1 j by induction on s [m]. The result holds for the base case of s = 1 by assumption. For the inductive case, suppose the result holds for s [m 1]. Fixing u Rn, we may apply the tower property, our conditional sub-Gaussianity assumption, and our inductive hypothesis in turn to conclude

E h exp( u, K Ps+1 j=1 j )1[E s+1] i = E h exp( u, K Ps j=1 j )1[E s]E[exp( u, s+1 )1[Es+1] | 1:s, E s] i

E h exp( u, K Ps j=1 j )1[E s] i exp ν2 s+1

2 u Ku exp Ps+1 j=1 ν2 j 2 u Ku .

Hence, Zs+1 is (K, (Ps+1 j=1 ν2 j )1/2)-sub-Gaussian on E s+1, and the proof is complete.

B. Proof of Tab. 1: Sub-Gaussian Thinning Examples

This section provides supplementary details for each of the sub-Gaussian thinning algorithms of Tab. 1.

B.1. SUBSAMPLING

B.1.1. PROOF OF PROP. 1: QUALITY OF UNIFORM SUBSAMPLING

We begin by computing the first and second moments of pout: E[pout] = pin and

E[poutp out] = 1 nout diag(pin) + nin(nout 1)

nout(nin 1)(pinp in 1 nin diag(pin)) = 1 nout ( nin nout

nin 1 ) diag(pin) + nin(nout 1)

nout(nin 1)pinp in.

E[MMD2 K(pin, pout)] = p in Kpin 2p in KE[pout] + E[p out Kpout] = tr(KE[poutp out]) p in Kpin = 1 nout ( nin nout

nin 1 )(tr(K diag(pin)) p in Kpin) = 1 nout ( nin nout

nin 1 )CK. (19)

To derive the second advertised result, we note that

E[ K(pin pout) 2 I] maxi I E[(e i K(pin pout))2] = maxi I E[MMD2 Keie i K(pin, pout)]

and invoke the initial result (19) to conclude.

Low-Rank Thinning

B.1.2. SUB-GAUSSIANITY OF SUBSAMPLING

Proposition B.1 (Sub-Gaussianity of uniform subsampling). For any SPSD K Rn n, uniform subsampling (without replacement) is a (K, ν, 0)-sub-Gaussian thinning algorithm with

K max nout .

Proof. Fix any vector u Rn, and let J1, . . . , Jnout be the random indices in [n] selected by uniform subsampling. Since u K(pin pout) = 1 nout Pnout i=1 u K(pin e Ji) is an average of mean-centered scalars drawn without replacement and satisfying

e Ji Ke Ji p

u Ku with probability 1

by Cauchy-Schwarz, Thm. 4 and equations (1.8) and (4.16) of Hoeffding (1963) imply that

E[exp(u K(pin pout))] exp( K max

2nout u Ku).

Algorithm B.1: KH(δ): Kernel Halving with simplified swapping thresholds and failure probability δ/2

Input: point sequence Xin = (xi)nin i=1 with even nin, kernel k

S(1), S(2) {}; eψ0 0 Hk // Initialize empty coresets: S(1), S(2) have size i after round i bmax,i 0 // Max function norm so far for i = 1, 2, . . . , nin/2 do

// Construct kernel difference function using next two points (x, x ) (x2i 1, x2i); fi k(x2i 1, ) k(x2i, ); ηi 1

// Compute swapping threshold ai b2 i = fi 2 k =k(x, x)+k(x , x ) 2k(x, x ); bmax,i = max(bi, bmax,i 1) ai bibmax,i( 1

2 + log(2nin/δ))

// Compute RKHS inner product D eψi 1, fi E

k, which has a simple form

αi P2i 2 j=1 (k(xj, x) k(xj, x )) 2 P

z S(1)(k(z, x) k(z, x ))

// Assign one point to each coreset after probabilistic swapping (x, x ) (x , x) and ηi 1 with probability min(1, 1

S(1).append(x); S(2).append(x ); eψi eψi 1 + ηifi // eψi = P

x S(2)k(x , ) P

x S(1)k(x, ) end return Xout S(1), coreset of size nout = nin/2

In this section, we analyze KH(δ) (Alg. B.1), a variant of the Kernel Halving algorithm (Dwivedi & Mackey, 2024, Alg. 2) with simplified swapping thresholds. Prop. B.2, proved in App. B.2.1, establishes the sub-Gaussianity of KH(δ) and its intermediate iterates.

Proposition B.2 (Sub-Gaussianity of KH(δ)). Suppose nin 2. In the notation of Alg. B.1, on a common event E of probability at least 1 δ/2, for all i [nin/2], 1

2i eψi is (k, νi)-sub-Gaussian with

νi = bmax,i

log(2nin/δ)

log(2nin/δ)

2i maxj [i] MMDk(δx2j 1, δx2j)

log(2nin/δ)

2i maxj [i] MMDk(δx2j 1, δx2j)

log(2nin/δ)

2i 2 min(maxx Xin p

k(x, x), maxx Xin MMDk(δx, Pin)).

Prop. B.2 and the triangle inequality imply that (Pin Pout)k = 1 nin eψnin/2 is (k, ν)-sub-Gaussian on E with

ν = bmax,nin/2

log(2nin/δ)

log(2nin/δ)

nin 2 min(maxx Xin p

k(x, x), maxx Xin MMDk(δx, Pin)).

Low-Rank Thinning

By Lem. A.1, we thus have that the KH(δ) output pin pout is (K, ν)-sub-Gaussian on E for K generated by k and that KH(δ) Gν,δ(K).

B.2.1. PROOF OF PROP. B.2: SUB-GAUSSIANITY OF KH(δ)

We begin by studying the sub-Gaussian properties of a related algorithm, the self-balancing Hilbert walk (SBHW) of Dwivedi & Mackey (2024, Alg. 3). By Dwivedi & Mackey (2024, Thm. 3(i)), when the SBHW is run on the RKHS Hk with the same fi and ai sequences employed in KH(δ), the output ψi of each round is (k, σi)-sub-Gaussian for

σ2 0 0 and σ2 i σ2 i 1 + fi 2 k 1 + σ2 i 1 a2 i ( fi 2 k 2ai)

+ i 1. (20)

The following lemma bounds the growth of the sub-Gaussian constants σi in terms of the swapping thresholds ai. Lemma B.1 (Growth of SBHW sub-Gaussian constants). For each i, the SBHW sub-Gaussian constants (20) satisfy

σ2 i ci for ci maxj [i] max(b2 j, rj) and ri a2 i 2ai b2 i a2 i 2ai bibmax,i .

Proof. We will prove the result by induction on i.

Base case. σ2 1 = b2 1 c1 as desired.

Inductive case. Suppose σ2 i 1 ci 1. Then σ2 i = g(σ2 i 1) for g(x) = x + b2 i (1 x/ri)+. Note that the slope of g is 1 b2 i /ri for x < ri and 1 for x > ri. If 1 b2 i /ri 0, then g is increasing and its maximum value over [0, ci] is at ci. If, on the other hand, 1 b2 i /ri < 0, then g first decreases and then increases so its maximum value over [0, ci] is either at 0 or at ci. Since ci max(ri, ci 1), σ2 i max(g(0), g(ci)) = max(b2 i , ci) = ci. The proof is complete.

Invoking Lem. B.1, the assumption nin 2, and the fact that δ 7

1 2 +log(2/δ)

log(2/δ) is increasing on (0, 1], we find that

σ2 i b2 max,i log(2nin/δ) ( 1

2 +log(2nin/δ))2

2(log(2nin/δ))2 b2 max,i log(2nin/δ) ( 1

2 +log(4))2

2(log(4))2 b2 max,i log(2nin/δ). (21)

The first inequality in (21) and the definition (20) further imply that

ai = bibmax,i( 1

2 + log(2nin/δ)) σibi p

2 log(2nin/δ) σi 1bi p

2 log(2nin/δ).

Hence, by Dwivedi & Mackey (2024, Thm. 3(iii)), for each i [nin/2], the vector eψi of KH(δ) coincides with the vector ψi of SBHW on a common event E of probability at least 1 δ/2. Therefore, each 1

2i eψi is (k, 1

2iσi)-sub-Gaussian on E, implying the result.

B.3. LKH(δ)

In this section, we analyze LKH(δ) (Alg. B.2), the Kernel Halving algorithm of (Dwivedi & Mackey, 2024, Alg. 2) with a linear kernel, k(x, y) = x, y , on Rd and failure probability δ/2. Notably, Alg. B.2 can be carried out in only O(nd) time thanks to the linear kernel structure. Prop. B.3, proved in App. B.3.1, establishes the sub-Gaussianity of LKH(δ) and its intermediate iterates. Proposition B.3 (Sub-Gaussianity of LKH(δ)). Suppose nin 2. In the notation of Alg. B.2, on a common event E of probability at least 1 δ/2, for all i [nin/2], 1

2i eψi is (k, νi)-sub-Gaussian with k(x, y) = x, y and

log(2nin(log(nin/2)+1)/δ)

2i maxj [i] x2j 1 x2j 2

log(2nin(log(nin/2)+1)/δ)

2i 2 min(maxx Xin p

x 2, maxx Xin x x 2) for x = 1 nin P x Xin δx.

Prop. B.3 and the triangle inequality imply that (Pin Pout)k = 1 nin ψnin/2 is (k, ν)-sub-Gaussian on E with

log(2nin(log(nin/2)+1)/δ)

nin maxj [nin/2] x2j 1 x2j 2

log(2nin(log(nin/2)+1)/δ)

nin 2 min(maxx Xin p

x 2, maxx Xin x x 2) for x = 1 nin P

Low-Rank Thinning

Algorithm B.2: LKH(δ): Kernel Halving with linear kernel and failure probability δ/2

Input: point sequence Xin = (xi)nin i=1 with even nin and xi Rd

S(1), S(2) {}; ψ0 0 Rd // Initialize empty coresets: S(1), S(2) have size i after round i σ0 0 // Keep track of sub-Gaussian constant for i = 1, 2, . . . , nin/2 do

// Consider two points (x, x ) (x2i 1, x2i); ηi 1

// Compute swapping threshold ai b2 i = x x , x x ; δi = δ 2i(log(nin/2)+1) (ai, σi) get swap params(σi 1, bi, δi)

// Compute inner product αi ψi 1, x x

// Assign one point to each coreset after probabilistic swapping (x, x ) (x , x) and ηi 1 with probability min(1, 1

S(1).append(x); S(2).append(x ); eψi eψi 1 + ηifi end return Xout S(1), coreset of size nout = nin/2

function get swap params(σ, b, δ):

2 log(2/δ), b2) σ2 σ2+b2(1+(b2 2a)σ2/a2)+ return (a, σ);

By Lem. A.1, we thus have that the LKH(δ) output pin pout is (K, ν)-sub-Gaussian on E for K generated by k and that LKH(δ) Gν,δ(K).

B.3.1. PROOF OF PROP. B.3: SUB-GAUSSIANITY OF LKH(δ)

We begin by studying the sub-Gaussian properties of a related algorithm, the self-balancing Hilbert walk (SBHW) of Dwivedi & Mackey (2024, Alg. 3). By Dwivedi & Mackey (2024, Thm. 3(i)), when the SBHW is run on the RKHS Hk with the same fi and ai sequences employed in LKH(δ), the output ψi of each round is (k, σi)-sub-Gaussian. Moreover, since

ai σi 1bi p

2 log(2/δi) for each i [nin/2],

Dwivedi & Mackey (2024, Thm. 3(iii)) implies that, for each i [nin/2], the vector eψi of LKH(δ) coincides with the vector ψi of SBHW on a common event E of probability at least 1 δ/2. Therefore, each 1

2i eψi is (k, 1

2iσi)-sub-Gaussian on E. Finally, Dwivedi & Mackey (2024, (46)) shows that σi νi for each i [nin/2], yielding the result.

B.4. RKH(δ)

Algorithm B.3: RKH(δ): Repeated KH(δ)

Input: point sequence Xin = (xi)nin i=1, kernel k, output size nout nin/2N

// Repeatedly divide coreset size in half m log2(nin/nout) for ℓ= 1, 2, . . . , m do Xin KH(δ/m)(Xin, k) ; return Xout Xin, coreset of size nout = nin/2m

In this section, we analyze repeated KH(δ) (RKH(δ), Alg. B.3), a variant of the KT-SPLIT algorithm (Dwivedi & Mackey, 2024, Alg. 1a) with simplified swapping thresholds. Our next result, proved in App. B.4.1, establishes the sub-Gaussianity of RKH(δ).

Low-Rank Thinning

Algorithm B.4: KH-COMPRESS(δ): Compress with KH halving and failure probability δ

Input: point sequence Xin = (xi)nin i=1, kernel k, nout nin 2N

g log2(nout/ nin) // identify compression level

function compress(S):

if |S| = 4g then return S Partition S into four arbitrary subsequences {Si}4 i=1 each of size |S|/4 for i = 1, 2, 3, 4 do

e Si compress(Si) // return coresets of size 2g q

4 end e S CONCATENATE( e S1, e S2, e S3, e S4); ℓ 2 2g p

|S| // coreset of size ℓ

nin4g+1(log4 nin g)δ ( e S, k) // coreset of size 2gp

return compress(Xin) // coreset of size nout = 2g nin

Proposition B.4 (Sub-Gaussianity of RKH(δ)). If nout nin/2N then RKH(δ) (Alg. B.3) is (k, ν)-sub-Gaussian with

log( 6nout log2(nin/nout)

δ ) min(maxx Xin p

k(x, x), maxx Xin MMDk(δx, Pin))

on an event E of probability at least 1 δ/2.

By Lem. A.1, we thus have that the RKH(δ) output pin pout is (K, ν)-sub-Gaussian on E for K generated by k and that

RKH(δ) Gν,δ(K). Finally, ν = O(

log(nout/δ)

nout ) when nout nin.

B.4.1. PROOF OF PROP. B.4: SUB-GAUSSIANITY OF RKH(δ)

Let c = 2 min(maxx Xin p

k(x, x), maxx Xin MMDk(δx, Pin)), and, for each ℓ [m], let eψ(ℓ) represent the vector eψnin/2ℓproduced at the end of the ℓ-th call to KH(δ). By the proof of Prop. B.2 and the union bound, on an event E of

probability at least 1 δ/2, ( eψ(ℓ))ℓ [m] = (ψ(ℓ))ℓ [m], where each 2ℓ 1

nin ψ(ℓ) is (k, ν(ℓ))-sub-Gaussian given (ψ(j))j [ℓ 1] for

log(2ninm/(2ℓ 1δ))

Hence, on E, the weighted sum

(Pin Pout)k = P

nin eψ(ℓ) = P

ℓ [m](ν(ℓ))2)-sub-Gaussian by Dwivedi & Mackey (2024, Lem. 14). Finally, by Dwivedi & Mackey (2024,

Eq. (63)), q P

ℓ [m](ν(ℓ))2 ν.

B.5. KH-COMPRESS(δ)

In this section, we analyze KH-COMPRESS(δ) (Alg. B.4), a variant of the KT-SPLIT-COMPRESS algorithm (Shetty et al., 2022, Ex. 3) with simplified swapping thresholds. Proposition B.5 (Sub-Gaussianity of KH-COMPRESS(δ)). If nout nin 2N then KH-COMPRESS(δ) (Alg. B.4) is (k, ν)-sub-Gaussian with

log2(nout) log( 4nout log2(nin/nout)

δ ) maxx Xin p

on an event E of probability at least 1 δ/2.

Proof. Since the original Kernel Halving algorithm of Dwivedi & Mackey (2024, Alg. 2) is equal to the KT-SPLIT algorithm of Dwivedi & Mackey (2024, Alg. 1a) with m = 1 halving round, KH-COMPRESS(δ) is simply the KT-SPLITCOMPRESS algorithm of (Shetty et al., 2022, Ex. 3) with KH(δ) of Alg. B.1 substituted for KT-SPLIT(δ, m = 1). The

Low-Rank Thinning

result now follows immediately from the KH(δ) sub-Gaussian constant of Prop. B.2 and the argument of Shetty et al. (2022, Rem. 2, Ex. 3).

Prop. B.5 and Lem. A.1 imply that KH-COMPRESS(δ) Gν,δ(K) for any kernel k that generates K. In addition, ν =

log(nout) log(nout/δ)

nout ) when nout nin. Furthermore, Shetty et al. (2022, Rem. 1) implies that KH-COMPRESS(δ) has a runtime less than 4g+1nin(log4(nin) g) = 4n2 out log2(nin/nout) = O(n2 out) when nout nin.

B.6. GS-THIN

The section introduces and analyzes the Gram-Schmidt Thinning algorithm (GS-THIN, Alg. B.5). GS-THIN repeatedly divides an input sequence in half using, GS-HALVE (Alg. B.6), a symmetrized and kernelized version of the Gram Schmidt (GS) Walk of Bansal et al. (2018). We will present two different implementations of GS-HALVE: a quartic-time implementation (Alg. B.6) based on the GS Walk description of Bansal et al. (2018) and a cubic-time implementation based on local updates to the matrix inverse (Alg. B.7). While both the algorithms lead to the same output given the same source of randomness, we present the original implementation2 for conceptual clarity and the optimized implementation for improved runtime. Throughout, for a matrix Q and vector u, we use the notation QI J and u I to represent the submatrix (Qij)i I,j J and subvector (ui)i I.

Algorithm B.5: GS-THIN: Gram-Schmidt Thinning

Input: point sequence Xin = (xi)nin i=1, kernel k, output size nout nin/2N, HALVE {GS-HALVE, GS-HALVE-CUBIC}

// Repeatedly divide coreset size in half m log2(nin/nout) for ℓ= 1, 2, . . . , m do Xin HALVE(Xin, k) ; return Xout Xin, coreset of size nout = nin/2m

Our first result, proved in App. B.6.1, shows that GS-THIN is a sub-Gaussian thinning algorithm. Proposition B.6 (GS-THIN sub-Gaussianity). For K generated by k, GS-THIN (Alg. B.5) is a (K, ν, 0)-sub-Gaussian thinning algorithm with parameter

nout . (22)

Our second result, proved in App. B.6.2, shows that GS-THIN with the GS-HALVE implementation has O(n4 in) runtime. Proposition B.7 (Runtime of GS-THIN with GS-HALVE). The runtime of GS-THIN with implementation GS-HALVE (Alg. B.6) is O(n4 in).

Our third result, proved in App. B.6.3, establishes the equivalence between GS-HALVE and GS-HALVE-CUBIC. More precisely, we show that the sequence of partial assignment vectors generated by kernel gs walk( ) of Alg. B.6 and kernel gs walk cubic( ) of Alg. B.7 are identical given identical inputs, an invertible induced kernel matrix, and an identical source of randomness. Proposition B.8 (Agreement of GS-HALVE and GS-HALVE-CUBIC). Let z1, z2, . . . be the fractional assignment sequence generated by kernel gs walk((xi)nin i=1) in Alg. B.6 and z 1, z 2, . . . be the fractional assignment sequence generated by kernel gs walk cubic((xi)nin i=1) in Alg. B.7 with an identical source of randomness. If the pairwise difference matrix

Q (k(x2i 1, x2j 1) + k(x2i, x2j) k(x2i 1, x2j) k(x2i, x2j 1))i,j [nin/2]

is positive definite, then zt = z t for all t.

Our fourth result, proved in App. B.6.4, shows that GS-THIN with the GS-HALVE-CUBIC implementation has O(n3 in) runtime. Proposition B.9 (Runtime of GS-THIN with GS-HALVE-CUBIC). The runtime of GS-THIN with implementation GSHALVE-CUBIC (Alg. B.7) is O(n3 in).

2 Towards making this equivalence clear, Alg. B.6 has been expressed with the same variables that Alg. B.7 uses. Alg. B.6 can be slightly simplified if it were to be considered independently.

Low-Rank Thinning

Algorithm B.6: GS-HALVE: Gram-Schmidt Halving

Input: point sequence Xin = (xi)nin i=1 with even nin, kernel k

Xout {} // Initialize empty coreset // Select one point to keep from each consecutive pair using kernelized GS Walk z kernel gs walk(Xin) for i = 1, . . . , nin/2 do

if zi = 1 then

Xout.append(x2i 1) else

Xout.append(x2i) end end return Xout, coreset of size nin/2

function kernel gs walk((xi)nin i=1): t 1; zt (0, 0, . . . , 0) Rnin/2 // Initialize fractional assignment vector A [nin/2] // Initialize set of active coordinates p A // Select a pivot uniformly at random while zt / { 1}nin/2 do

A A \ min {i [nin/2] : |zti| = 1} \ ([nin/2] \ A)

// Update set of active coordinates by removing smallest index set to 1 if p / A then

p Unif(A ) // Select a new pivot from A uniformly at random else

p p end // Compute step direction in which to update fractional assignment vector ut argminu Rnin/2 u Qu subject to up = 1 and ui = 0 for all i / A , where Q R(nin/2) (nin/2) has entries Qij k(x2i 1, x2j 1) + k(x2i, x2j) k(x2i 1, x2j) k(x2i, x2j 1)

δ+ |max | and δ |min |, where = n δ R : zt + δut [ 1, +1]nin/2o // Select candidate step sizes

δt δ+ with probability δ /(δ+ + δ ); otherwise δt δ // Choose step size and sign at random zt+1 zt + δtut // Update fractional assignments t t + 1; A A ; p p

end return zt, sign vector in { 1}nin/2

Low-Rank Thinning

Algorithm B.7: GS-HALVE-CUBIC: Gram-Schmidt Halving with cubic runtime

Input: point sequence Xin = (xi)nin i=1 with even nin, kernel k with positive definite k(Xin, Xin)

Xout {} // Initialize empty coreset // Select one point to keep from each consecutive pair using kernelized GS Walk z kernel gs walk cubic(Xin) for i = 1, . . . , nin/2 do

if zi = 1 then

Xout.append(x2i 1) else

Xout.append(x2i) end end return Xout, coreset of size nin/2

function kernel gs walk cubic((xi)nin i=1): t 1; zt (0, 0, . . . , 0) Rnin/2 // Initialize fractional assignment vector A [nin/2] // Initialize set of active coordinates p A // Select pivot uniformly at random Q (k(x2i 1, x2j 1) + k(x2i, x2j) k(x2i 1, x2j) k(x2i, x2j 1))nin/2 i,j=1 // Form paired difference kernel matrix C (QA\{p} A\{p}) 1

while zt / { 1}nin/2 do

A A \ min {i [nin/2] : |zti| = 1} \ ([nin/2] \ A)

// Update set of active coordinates by removing smallest index set to 1 if p / A then

p Unif(A ) // Select a new pivot from A uniformly at random else

p p end A1 A \ {p} A2 A \ {p }. i A1\A2 // Choose i as the (unique) index that was removed from the active coordinates // Compute (QA2 A2) 1 using block matrix inversion and the Sherman-Morrison formula D CA2 A2

C D DQA2 {i}Q{i} A2 D

Qii+Q{i} A2 DQA2 {i} // Compute step direction in which to update fractional assignment vector Compute ut as (ut)A2 = CQA2 {p } , utp = 1, and uti = 0 for i / A

δ+ |max | and δ |min |, where = n δ R : zt + δut [ 1, +1]nin/2o // Select candidate step sizes

δt δ+ with probability δ /(δ+ + δ ); otherwise δt δ // Choose step size and sign at random zt+1 zt + δtut // Update fractional assignments t t + 1; A A ; p p

end return zt, sign vector in { 1}nin/2

Low-Rank Thinning

B.6.1. PROOF OF PROP. B.6: GS-THIN SUB-GAUSSIANITY

Our first lemma bounds the sub-Gaussian constant of GS-HALVE (Alg. B.6).

Lemma B.2 (GS-HALVE sub-Gaussianity). In the notation of Def. 1, consider the input and output vectors pin, pout Rn

of GS-HALVE (Alg. B.6) for X Xin with |X| = n nin. If K = k(X, X), then pin pout is (K, ν)-sub-Gaussian with

ν 2 K 1/2 max nin = K 1/2 max nout .

Proof. Since K is SPSD, there exists a matrix Φ Rn d such that K = ΦΦ . Let B Rd (nin/2) be the matrix with entries

Bj,i Φ2i 1,j Φ2i,j for i [nin/2] and j [d].

Note that, for each i [nin/2], P

j [d] B2 j,i = K2i 1,2i 1 + K2i,2i K2i 1,2i K2i,2i 1 4 K max.

Hence, by Harshaw et al. (2024, Thm. 6.6), 1 nin Bz is (I, ν)-sub-Gaussian where I is the identity matrix in Rd d.

Now fix any u Rd. Since 1 nin Bz = Φ (pin pout) by construction,

E exp u K(pin pout) E h exp Φ u, 1

nin Bz i exp ν2

2 Φ u 2 2 = exp ν2

Now, for ℓ [m], let pℓ Rn denote the output probability vector produced by the ℓ-th call to GS-HALVE. Defining p0 pin and pout pm, we have

pin pout = Pm i=1 i, for i pi 1 pi for i [m].

By Lem. B.2, each pi 1 pi is (K, 2 K 1/2 max nin/2i 1 )-sub-Gaussian conditional on ( 1, . . . , i 1). Applying Lem. A.3 to the sequence ( j)m j=1, we find that pin pout is (K, ν)-sub-Gaussian with parameter

ν = Pm j=1 4 K max (nin/2j 1)2 1/2 = 2 K 1/2 max nin

Pm j=1 4j 1/2 K 1/2 max nin

Simplifying the above using the fact that nout = nin/2m yields our desired result (22).

B.6.2. PROOF OF PROP. B.7: RUNTIME OF GS-THIN WITH GS-HALVE

We essentially reproduce the argument from Bansal et al. (2018) for the runtime of the GS-HALVE algorithm in our kernelized context.

The main computational cost of GS-HALVE is the execution of the kernel gs walk( ) subroutine in Alg. B.6. The number of iterations in while loop for zt is at most nin/2. This is due to the fact that in each iteration, at least one new variable is set to { 1}. Further, in each iteration, the main computational cost is the computation of

ut argminu Rnin/2 u Qu

under the constraints that up = 1 and ui = 0 for all i / A. Since this can be implemented in O(n3 in) time using standard convex optimization techniques, GS-HALVE has total runtime

for an input sequence of size ℓand a constant C independent of ℓ. Now, note that GS-THIN calls GS-HALVE iteratively on inputs of size nin2 i for i = 0, 1, . . . , m 1 where m = log2(nin/nout). Thus, GS-THIN has runtime Pm 1 i=0 r H(nin/2i) Pm 1 i=0 C(nin/2i)4 = O(n4 in).

Low-Rank Thinning

B.6.3. PROOF OF PROP. B.8: AGREEMENT OF GS-HALVE AND GS-HALVE-CUBIC

We want to reason that any round of partial coloring leads to the same output across the two algorithms. Fix any fractional assignment update round. Recall that A1 = A\{p} and A2 = A \ {p }. These represent the active set coordinates without the pivot before and after the update respectively.

The main difference between Algs. B.6 and B.7 is in the computation of the step direction ut, which is the solution of the program

ut argminu Rn u Qu subject to up = 1 and ui = 0 for all i / A .

ut has a closed form with entries

(ut)A2 = (QA2 A2) 1 QA2 {p }.

Note that the invertibility of QA2 A2 follows from the positive-definiteness of Q, as, for any w R|A2|,

w QA2 A2w = w Q w > 0

for a second vector w with w A2 = w and all other entries equal to zero. Therefore, to compute ut, it suffices to keep track of the inverse of QA2 A2 as A across iterations.

Let i be the unique element in A1\A2. Writing QA1 A1 in block form, we have

QA1 A1 = QA2 A2 QA2 {i} Q{i} A2 Qii

By block matrix inversion (see, e.g., Saadetoglu & Dinsev, 2023, Thm. 2), the leading size |A2| |A2| principal submatrix of (QA1 A1) 1 equals

D QA2 A2 QA2 {i}Q{i} A2

Thus, by the Sherman-Morrison formula (Sherman & Morrison, 1950),

(QA2 A2) 1 = D 1 + QA2 {i}Q{i} A2

1 = D DQA2 {i}Q{i} A2D Qii+Q{i} A2DQA2 {i} . (23)

Hence, if we already have access to a matrix C = (QA1 A1) 1, we can compute D by dropping the row and column of C corresponding to i and then compute (QA2 A2) 1 using (23). Since in Alg. B.7 we begin by explicitly computing the inverse of QA A , the update step in Alg. B.7 maintains the required inverse and thus its partial assignment updates match those of Alg. B.6.

B.6.4. PROOF OF PROP. B.9: RUNTIME OF GS-THIN WITH GS-HALVE-CUBIC

We begin by establishing the runtime of kernel gs walk cubic( ). Lemma B.3 (Running time of kernel gs walk cubic( ) ). The routine kernel gs walk cubic( ) runs in O(ℓ3) time given a point sequence of size ℓ.

Proof. First, the initialization of C costs O(ℓ3) time using standard matrix inversion algorithms. Second, the number of iterations in the while loop is at most ℓ/2 since, in each iteration, at least one new variable is assigned a permanent sign in { 1}. In each while loop iteration, the main computational costs are the update of C and the computation of the step direction ut, both of which cost O(ℓ2) time using standard matrix-vector multiplication. Hence, together, all while loop iterations cost O(ℓ3) time.

Given the above lemma, we have that GS-HALVE-CUBIC, on input of size ℓ, has a running time

for some C independent of ℓ. When used in GS-THIN this yields the runtime Pm 1 i=0 r H(nin/2i) Pm 1 i=0 C(nin/2i)3 = O(n3 in).

Low-Rank Thinning

Algorithm B.8: GS-COMPRESS: Compress with GS-HALVE-CUBIC halving

Input: point sequence Xin = (xi)nin i=1, kernel k, nout nin 2N

g log2(nout/ nin) // identify compression level

function compress(S):

if |S| = 4g then return S Partition S into four arbitrary subsequences {Si}4 i=1 each of size |S|/4 for i = 1, 2, 3, 4 do

e Si compress(Si) // return coresets of size 2g q

4 end e S CONCATENATE( e S1, e S2, e S3, e S4); ℓ 2 2g p

|S| // coreset of size ℓ return GS-HALVE-CUBIC( e S, k) // coreset of size 2gp

return compress(Xin) // coreset of size nout = 2g nin

B.7. GS-COMPRESS

This section introduces and analyzes the new GS-COMPRESS algorithm (Alg. B.8) which combines the COMPRESS metaalgorithm of Shetty et al. (2022) with the GS-HALVE-CUBIC halving algorithm (Alg. B.7). The following result bounds the sub-Gaussian constant and runtime of GS-COMPRESS.

Proposition B.10 (GS-COMPRESS sub-Gaussianity and runtime). If K is generated by k, then GS-COMPRESS is (K, ν, 0)-sub-Gaussian with

log2(nout) K max.

Moreover, GS-COMPRESS has an O(n3 out) runtime.

Proof. By Lem. B.2 and Prop. B.8, GS-HALVE-CUBIC is (K, νH(ℓ))-sub-Gaussian for an input point sequence of size ℓ and νH(ℓ) = 2 p

K max/ℓ. Hence, by Lem. A.2, GS-HALVE-CUBIC is also νH(ℓ) f-sub-Gaussian in the sense of Shetty et al. (2022, Def. 2) for each f Hk. By Shetty et al. (2022, Rmk. 2), GS-COMPRESS is therefore f-sub-Gaussian with parameter

log2(nin/nout)νH(2nout) p

log2(nout) K 1/2 max nout

for each f Hk. Hence, Lem. A.1 implies that GS-COMPRESS is a (K, ν, 0)-sub-Gaussian thinning algorithm.

Furthermore, Shetty et al. (2022, Thm. 1) implies that GS-COMPRESS has a runtime of

Plog2(nin/(2nout)) i=0 4i r H(2nout2 i).

where the GS-HALVE-CUBIC runtime r H(ℓ) Cℓ3 for C independent of the input size ℓby Lem. B.3. Therefore, the GS-COMPRESS runtime is bounded by

Plog2(nin/(2nout)) i=0 4i (2nout)32 3i = O(n3 out).

Remark 1 (COMPRESS with GS-HALVE). If the GS-HALVE implementation were used in place of GS-HALVE-CUBIC, parallel reasoning would yield an O(n4 out) runtime for GS-COMPRESS.

C. Proof of Thm. 1: Low-rank sub-Gaussian thinning

We establish the MMD bound (2) in App. C.1, the first kernel max seminorm bound (3) in App. C.2, and the Lipschitz kernel max seminorm bound (4) in App. C.3. Throughout, we use the notation PE(E ) P(E, E ) for events (E, E ).

Low-Rank Thinning

C.1. Proof of MMD bound (2)

Without loss of generality, we suppose that r rank(K). Let VΛV be an eigendecomposition of K with orthonormal V Rn n and diagonal Λ = diag(λ1, , λn) Rn n. Let Vr represent the first r columns of V, and let V r represent the last n r columns of V. Introduce the shorthand

w pin pout Rn and Φ VΛ1/2V Rn n. (24)

We can directly verify that

VV = V V = I, VV = Vr V r + V r V r, and K = ΦΦ . (25)

Using the above equalities, we decompose the squared MMD into two components,

MMD2 K(pin, pout) = w Kw = w ΦΦ w = w ΦVV Φ w = w ΦVr V r Φ w + w ΦV r V rΦ w

= V r Φ w 2 2 + V rΦ w 2 2. (26)

In Apps. C.1.1 and C.1.2 respectively, we will establish the bounds

P( V r Φ w 2 2 eν2(er + log(1/δ )) 1 δ/2 δ and (27)

P( V rΦ w 2 2 λr+1( 1

n)) = 1, (28)

which when combined with (26) yield the advertised claim (2) on the squared MMD.

C.1.1. PROOF OF (27): BOUNDING V r Φ w 2 2

Our first lemma bounds the Euclidean norm of a vector in terms of a finite number of inner products.

Lemma C.1 (Euclidean norm cover). For any v Rr and ε (0, 1),

v 2 1 1 ε maxu Cε,r u, v (29)

for a set Cε,r contained in the ball Br with |Cε,r| (1 + 2/ε)r.

Proof. Fix any ε (0, 1), and let Cε,r be a set of minimum cardinality satisfying

Cε,r Br and supu Br minu Cε,r u u 2 ε.

By Wainwright (2019, Lem. 5.2), |Cε,r| (1 + 2/ε)r. Now we invoke the variational representation of 2 and the Cauchy-Schwarz inequality to conclude that

v 2 = supu Br u, v = supu Br minu Cε,r[ u u , v + u , v ]

supu Br minu Cε,r u u 2 v 2 + maxu Cε,r u , v

ε v 2 + maxu Cε,r u , v .

Rearranging terms yields the claimed bound (29).

Our next lemma uses this covering estimate to bound the exponential moments of V r Φ w 2.

Lemma C.2 (Norm sub-Gaussianity). For any ε > 0 and any t > 0,

EE[exp(t V r Φ w 2)] (1 + 2

ε)r exp( ν2t2 2(1 ϵ)2 ).

Proof. Fix any t > 0. Since x 7 exp(tx) is increasing, Lem. C.1 implies that

EE[exp(t V r Φ w 2)] EE[exp(t 1 1 ε maxu Cε,r u, V r Φ w )]

= EE[maxu Cε,r exp( t 1 ε Vru, Φ w )]

u Cε,r EE[exp( t 1 ε Vru, Φ w )]

Low-Rank Thinning

for a subset Cε,r with |Cε,r| (1 + 2

ε)r and u 2 1 for each u Cε,r.

Now fix any u Cε,r and let Λr = diag(λ1, . . . , λr). Using (24) and (25), we have

Vr = Φ VrΛ 1/2 r and therefore

Vru, Φ w = Φ VrΛ 1/2 r u, Φ w = VrΛ 1/2 r u, Kw .

In addition, we have

(VrΛ 1/2 r u) K(VrΛ 1/2 r u) = u Λ 1/2 r V r VΛV VrΛ 1/2 r u = u u.

Next, we can invoke our sub-Gaussianity assumption (Def. 3) to conclude that

EE[exp( t 1 ε Vru, Φ w )] = EE[exp( t 1 ε VrΛ 1/2 r u, Kw )] exp( ν2t2 2(1 ε)2 VrΛ 1/2 r u, KVrΛ 1/2 r u )

exp( ν2t2 2(1 ε)2 u 2 2).

Since u 2 1 and |Cε,r| (1 + 2

ε)r, the advertised result now follows.

By Markov s inequality (Markov, 1884) and Lem. C.2, for any α > 0,

P( V r Φ w 2 > α) = PE( V r Φ w 2 > α) + P( V r Φ w 2 > α, Ec)

PE( V r Φ w 2 > α) + P(Ec)

inft>0 EE[exp(t V r Φ w 2)]/ exp(tα) + δ/2

ε)r inft>0 exp( ν2t2 2(1 ε)2 tα) + δ/2

ε)r exp( (1 ε)2α2

2ν2 ) + δ/2.

Next, we have

ε)r exp( (1 ε)2α2

2ν2 ) δ if α ν

δ ) + r log(1 + 2

Since this bound holds for any ε, choosing ε = 1 p

2/e, we find that

V r Φ w 2 2 eν2h r log(1 + 2/(1 p

2/e)) + log(1/δ ) i eν2[er + log(1/δ )]

with probability at least 1 δ/2 δ as claimed.

C.1.2. PROOF OF (28): BOUNDING V rΦ w 2 2

w 2 2 = p inpin + p outpout 2p inpout = nin

n2 in + nout

n2 out 2nout ninnout = 1 nout 1 nin , (30)

we have, for Λ r diag(λr+1, , λn) and λmax the maximum eigenvalue of a SPSD matrix,

V rΦ w 2 2 = w V rΛ r V rw λmax(V rΛ r V r) w 2 2 (30) = λr+1( 1

nout 1 nin ).

C.2. Proof of kernel max seminorm bound (3)

We begin by establishing a general bound on the maximum discrepancy between input and output expectations over a collection of test functions admitting a finite cover.

Lemma C.3 (Discrepancy cover bound). Fix any kernel k, subset F Hk, and scalars ε 0 and δ (0, 1). Define

a supf F f k and BF {f Hk : f k a},

Low-Rank Thinning

and let Cϵ,F be a set of minimum cardinality satisfying

Cϵ,F BF and supf F minf Cϵ,F maxx Xin |f(x) f (x)| ε. (31)

If (Pin Pout)k is (k, ν)-sub-Gaussian on an event E (Def. A.2), then, on E,

Pin Pout F supf F(Pin Pout)f 2ϵ + νa p

2 log(|Cϵ,F|/δ ) with probability at least 1 δ .

Proof. The triangle inequality and the covering property (31) together imply that, with probability 1,

(Pin Pout)f minf Cϵ,F (Pin Pout)f + |(Pin Pout)(f f )|

Pin Pout Cϵ,F + minf Cϵ,F |Pin(f f )| + |Pout(f f )|

Pin Pout Cϵ,F + 2 minf Cϵ,F maxx Xin |f(x) f (x)|

Pin Pout Cϵ,F + 2ε (32)

for each f F. Since s 7 ets is increasing, the bound (32), the assumed sub-Gaussianity (Def. A.2), and the fact that Cϵ,F belongs to BF imply that

EE[exp(t Pin Pout F)] e2tεEE[exp(t Pin Pout Cϵ,F )]

f Cϵ,F e2tεEE[exp(t(Pin Pout)f )]

f Cϵ,F exp( t2ν2 f 2 k 2 + 2tϵ) |Cϵ,F| exp( t2ν2a2

Now, by Markov s inequality (Markov, 1884), for any α > 0,

PE(supf F(Pin Pout)f > α + 2ϵ) inft>0 EE[exp(t Pin Pout F)]/ exp(t(α + 2ϵ))

|Cϵ,F| inft>0 exp( t2ν2a2

2 tα) = |Cϵ,F| exp( α2

Finally, choosing α = νa p

2 log(|Cϵ,F|/δ ) yields the desired claim.

Now fix any ϵ 0, δ (0, 1), and kernel k that generates K, and consider the subset F = { k(xi, ) : i I}. Since K(pin pout) I = Pin Pout F and supf F f k = DI, Lem. C.3 implies that, on the event E,

K(pin pout) I 2ϵ + νDI p

2 log(|Cϵ,F|/δ ) with probability at least 1 δ .

Since P(Ec) δ/2 and |F| 2|Z|, we use the estimate |C0,F| 2|I| with ϵ = 0 to obtain the advertised bound (3).

C.3. Proof of Lipschitz kernel max seminorm bound (4)

Introduce the query point set Z {xi : i I}, fix any δ (0, 1) and z0 Z, and define the symmetrized seminorm

(Pin Pout)k Z,Z supz,z Z |(Pin Pout)k(z) (Pin Pout)k(z )|.

By the triangle inequality and the derivation of App. C.2, we have, on the event E,

K(pin pout) I (Pin Pout)k Z,Z + |(Pin Pout)k(z0)|

(Pin Pout)k Z,Z + ν p

k(z0, z0) p

2 log(4/δ ) with probability at least 1 δ /2. (33)

Since P(Ec) δ/2, it only remains to upper bound (Pin Pout)k Z,Z on E with probability at least 1 δ /2.

To this end, we first establish that ((Pin Pout)k(z))z Z is a sub-Gaussian process on E with respect to a particular bounded-H older metric ρ.

Definition C.1 (Sub-Gaussian process on an event). We say an indexed collection of random variables (Xθ)θ Θ is a sub-Gaussian process with respect to ρ on an event E if ρ is a metric on Θ and

EE h exp (Xθ X θ)2

ρ(θ,θ )2 i 2 for all θ, θ Θ.

Low-Rank Thinning

Lemma C.4 (Bounded-H older sub-Gaussian process). Consider a kernel k on X = Rd satisfying |k(z, x) k(z , x)| Lk z z 2 for all z, z Z X and x Xin. If (Pin Pout)k is (k, ν)-sub-Gaussian on an event E (Def. A.2), then ((Pin Pout)k(z))z Z is a sub-Gaussian process on E with respect to the metric

ρ(z, z ) ν p

8/3 min(2 supz Z p

2Lk z z 2). (34)

The proof of Lem. C.4 can be found in App. C.4. Our next lemma, a slight modification of Wainwright (2019, Thm. 5.36), bounds the suprema of symmetrized sub-Gaussian processes on an event in terms of covering numbers.

Lemma C.5 (Sub-Gaussian process tails). Suppose (Xθ)θ Θ is a sub-Gaussian process with respect to ρ on an event E, and define the diameter diam(Θ, ρ) supθ,θ Θ ρ(θ, θ ), the covering number

N(u; Θ, ρ) min{|Cu| : Cu Θ, maxθ Θ minθ Cu ρ(θ, θ ) u} for all u > 0,

and the entropy integral J (Θ, ρ) R diam(Θ,ρ) 0 p

log(1 + N(u; Θ, ρ)) du. Then,

PE(supθ,θ Θ |Xθ Xθ | 8(J (Θ, ρ) + t)) 2 exp( t2/ diam(Θ, ρ)2) for all t > 0.

Proof. Since p

log(1 + xy) p

log((1 + x)(1 + y)) p

log(1 + x) + p

log(1 + y) for all x, y > 0, the proof is identical to that of Wainwright (2019, Thm. 5.36) with c1 = 8 and (EE, PE) substituted for (E, P).

Our final lemma bounds the diameter, covering numbers, and entropy integral of Z using the metric ρ.

Lemma C.6 (Covering properties of bounded-H older metric). Consider the bounded-H older metric ρ (34) for a kernel k on X = Rd and a finite set Z X. If Z is a matrix with one row corresponding to each element of Z, r = rank(Z), and R = maxz Z z 2, then, in the notation of Lem. C.5,

N(u; Z, ρ) (1 + c2/u2)r for c ν q

3 RLk and all u > 0, (35)

diam(Z, ρ) D min(c, ν q

k(z, z)), and (36)

J (Z, ρ) D q

Proof. The diameter bound (36) follows directly from the definition of ρ (34) and the fact maxz,z Z z z 2 2R.

To establish the covering number bound (35), we let UΣV be a compact singular value decomposition of Z so that

V Rd r, Z = ZVV , and maxz Z V z 2 = maxz Z z 2 = R.

Fix any ϵ > 0, and let C and Cext be a sets of minimum cardinality satisfying

C Br(R), maxv Br(R) minv C v v 2 ϵ2/2,

Cext Bd(R), and maxz Z minz Cext z z 2 ϵ2/2. (37)

Since V z Br(R) for each z Z and Vv Bd for each v Br, we have

maxz Z minv C Vv z 2 = maxz Z minv C V(v V z) 2 = maxz Z minv C v V z 2 ϵ2/2,

so that VC satisfies the criteria of (37). Since |VC| |C| (1 + 4R/ϵ2)r by Wainwright (2019, Lem. 5.2), we must also have |Cext| (1 + 4R/ϵ2)r.

Now, since Cext has minimum cardinality amongst sets satisfying (37), for each z Cext, there is some z Z satisfying z z 2 ϵ2/2 (or else z would be superfluous). Hence, there exists a set Cint Z satisfying

|Cint| |Cext| (1 + 4R/ϵ2)r and maxz Z minz Cint z z 2 ϵ2.

Low-Rank Thinning

Moreover, by our metric definition (34),

maxz Z minz Cint ρ(z, z ) c 2

R maxz Z minz Cint p

Hence, for u = cϵ 2

R, N(u; Z, ρ) |Cint| (1 + c2/u2)r. Since ϵ > 0 was arbitrary, we have established (35).

Finally, we bound the entropy integral using the inequality 1 c2/u2 for u [0, D], the concavity of the square-root function, and Jensen s inequality:

J (Z, ρ) R D 0 p

log(1 + (1 + c2/u2)r) du R D 0 p

log((3c2/u2)r) du = R D 0

1 D R D 0 2r log(

3c/u) du = D q

Together, Lems. C.4, C.5, and C.6 imply that, in the notation of Lem. C.6,

(Pin Pout)k Z,Z 8D q

3ec/D) + 8D p

on E with probability at least 1 δ /2. Combining this bound with the inequality (33) yields the result.

C.4. Proof of Lem. C.4: Bounded-H older sub-Gaussian process

Define Xz = (Pin Pout)k(z) for each z Z, and fix any z, z Z. Our sub-Gaussianity assumption implies

EE[exp(λ(Xz Xz )] exp( ν2λ2

2 k(z, ) k(z , ) 2 k) for all λ R.

Moreover, by our Lipschitz assumption,

k(z, ) k(z , ) 2 k = k(z, z) k(z, z ) + k(z , z ) k(z , z) min(4 maxz Z k(z, z), 2Lk z z 2).

Finally, Lem. C.7 shows that EE[exp( (Xz Xz )2

ρ(z,z )2 ] 2 so that (Xz)z Z is a sub-Gaussian process on E with respect to ρ.

Lemma C.7 (Squared exponential moment bound). If EE[exp(λX)] exp( ν2λ2

2 ) for all λ R, then EE[exp( 3X2

Proof. The proof is identical to that in Wainwright (2019, Sec. 2.4) with EE substituted for E.

D. Proof of Cor. 1: Gaussian MMD of KH

Cor. 1 follows immediately from the following explicit, non-asymptotic bound. Corollary D.1 (Detailed Gaussian MMD of KH). If Xin Bd(R) for R > 0, then KH(δ) with k = GAUSS(η), n = nin (2e)d, and b 1

MMD2 K(pin, pout) 1 n2 out log( 4nout

δ ) h e2 max n 2e

d log(ninnoutb) d, ( R2ηe34

d )do + e log( 1

δ ) i + 1 noutb( 1

nout 1 nin )

with probability at least 1 δ/2 δ .

Proof. Consider the approximate rank parameter

d log(ninnoutb) d, (R2ηe34/d)do .

The assumption nin (2e)d and the fact that b 1/(2dnout) ensure that log(ninnoutb) d + log(noutb/2d) d and therefore that r (2e)d. Hence, by Altschuler et al. (2019, Thm. 3), the (r + 1)-th eigenvalue of K satisfies

λr +1 nin exp d

d log(ninnoutb), (R2ηe34/d) log d max{ 2e

d log(ninnoutb),(R2ηe34/d)}

nin exp{ log(ninnoutb) log(e)} nin 1 ninnoutb = 1 noutb.

Low-Rank Thinning

Since K max = 1 and KH(δ) Gν(K) with ν defined in Prop. B.2, the result now follows from Thm. 1.

E. Proof of Cor. 2: Intrinsic Gaussian MMD of KH

Assumption E.1 (d -manifold with Q-smooth atlas (Altschuler et al., 2019, Assum. 1)). Let Ω Rd be a smooth compact manifold without boundary of dimension d < d. Let (Ψj, Uj)j [T ] for T N be an atlas for Ω, where (Uj)j are open sets covering Ωand Ψj : Uj 7 Bd (rj) are smooth maps with smooth inverses, mapping Uj bijectively to Bd (rj). Assume that there exists Q > 0 such that supu Bd (rj) DαΨ 1 j (u) Q|α| for all α Nd and j [T],

where |α| Pd

j=1 αj and Dα = |α|

uα1 1 ... u αd d for α Nd .

Cor. 2 follows immediately from the following more detailed result.

Corollary E.1 (Detailed Intrinsic Gaussian MMD of KH). Suppose Xin lies on a manifold Ω Bd satisfying Assump. E.1. Then KH(δ) with k = GAUSS(η) and n = nin delivers

MMD2 K(pin, pout) 1 n2 out log( 4nout

c5d /2 log 5d

2 (ninnout) + e log( 1

δ ) + 1 nout ( 1

nout 1 nin )

with probability at least 1 δ

2 δ for c independent of Xin.

Proof. Altschuler et al. (2019, Thm. 4) showed that the (r+1)-th eigenvalue of K satisfies (7) for a constant c independent of X = Xin. Since K max = 1 and KH(δ) Gν(K) with ν defined in Prop. B.2, the result now follows from Thm. 1 with r = (log(ninnout)/c)5d /2.

F. Proof of Thm. 2: Quality of Thinformer

Throughout we will make use of the convenient representation

b T = b D 1 b AV for Iout {i [n] : ( ki, vi) Xout}, b A n nout (exp( qi,kj

d )1[j Iout])n i,j=1, and b D b A1n. (38)

Our proof makes use of three lemmas. The first, proved in App. F.1, bounds the approximation error for the attention matrix T in terms of the approximation error for AV and A1n.

Lemma F.1 (Decomposing attention approximation error). In the notation of Alg. 1 and (38),

b D 1 b AV D 1AV max min ( 1

n D) 1 max, ( 1

n b D) 1 max ( 1

n b AV AV max + 1

n A1n b A1n V max).

The second, proved in App. F.2, bounds the approximation error for AV and A1n in terms of the KMS (1) for a specific choice of attention kernel matrix.

Lemma F.2 (KMS bound on attention approximation error). Instantiate the notation of Alg. 1 and (38) and define the query set

X {xi+nj ( qi, ed+1 j ) : i [n], j [d + 1]} where qi qi/d 1 4

and ed+1 j is the j-th standard basis vector in Rd+1. If Katt katt(X, X) for X X Xin, then

n ( b A A)V max, 1

n ( b A A)1n V max = Katt(pin pout) I for I [n(d + 1)].

Our third lemma, proved in App. F.3, bounds the size of key parameters of the thinned attention problem.

Lemma F.3 (Thinned attention problem parameters). Instantiate the notation of Lem. F.2, and define R

Low-Rank Thinning

maxi [n] max( qi 2, ki 2). Then, for all i, j I and l supp(pin),

n D) 1 max exp( R2

d), maxx Xin p

katt(x, x) exp( R2

V 2 2, + V 2max,

RI maxi I xi 2 q

d + 1, DI maxi I p

Katt,ii exp( R2

rank(XI) d + 1 for XI [xi] i I, and

|Katt,il Katt,jl| LKatt xi xj 2 for LKatt exp( R2

d + 2 V max.

Now instantiate the notation of Lem. F.2, and define the coefficient

2 3 (d + 1) log(3e2( R2

d + 2) V max) + p

2 log(8)(1 + 32

Together, Lem. F.3, the KMS quality bound of Thm. 1, and the KH-COMPRESS(0.5) sub-Gaussian constant ν of Prop. B.5 imply that, with probability at least 1

Katt(pin pout) I c 2

V 2 2, + V 2max

log2(nout) log(8nout log2 nin nout )

Hence, by Lems. F.1 and F.2, with probability at least 1

b D 1 b AV D 1AV max c

V 2 2, + V 2max

log2(nout) log(8nout log2 nin nout )

log2(nout) log(8nout log2 nin nout )

F.1. Proof of Lem. F.1: Decomposing attention approximation error

By the triangle inequality, we have

b D 1 b AV D 1AV max b D 1 b AV b D 1AV max + b D 1AV D 1AV max.

We bound the first term on the right-hand side using the submultiplicativity of the max norm under diagonal rescaling:

b D 1 b AV b D 1AV max b D 1 max b AV AV max = ( 1

n b D) 1 max 1

n b AV AV max.

To bound the second term we use the same submultiplicativity property and the fact that each entry of D 1AV is the average of values in V:

b D 1AV D 1AV max = b D 1(D b D)D 1AV max b D 1 max D b D max D 1AV max

n b D) 1 max 1

n A1n b A1n V max.

An identical argument reversing the roles of (D, A) and ( b D, b A) yields the second bound.

F.2. Proof of Lem. F.2: KMS bound on attention approximation error

Define the augmented value matrix e V = [V, V max1n] Rd+1. By the definition of Katt and b A,

Katt(pin pout) I = maxi [n],j [d+1] | P

ℓ [n] Aiℓe Vℓj(pin pout)ℓ| = 1

n (A b A) e Ved j = 1

n (A b A) e V max.

F.3. Proof of Lem. F.3: Thinned attention problem parameters

First, by the Cauchy-Schwarz inequality and the nonnegativity of D = A1n we have

n D) 1 max = 1 mini [n] 1 n P

j [n] Aij 1

mini [n],j [n] exp( qi,kj

mini [n],j [n] exp( qi 2 kj 2

d ) exp( R2

Low-Rank Thinning

Second, the maxx Xin p

katt(x, x) inequality follows as

katt(( ki, vi), ( ki, vi)) = exp( ki 2 2

d )( vi 2 2 + V 2 max) exp( R2

d)( V 2 2, + V 2 max).

Third, the RI inequality follows as

( qi, ed+1 j ) 2 = p

qi 2 2 + 1 q

d + 1 for all i [n], j [d + 1].

Fourth, the DI inequality follows as

maxi I Katt,ii = maxi [n] exp( qi 2 2

d ) exp( R2

Fifth, the rank inequality follows as xi Rd+1 for i I. Finally, the Lipschitz inequality follows as, for any i, k, l [n] and j, m [d + 1],

| exp( qi,kl

d ) ed+1 j , vl exp( qk,kl

d ) ed+1 m , vl |

d )| vlj vlm| + | exp( qi,kl

d ) exp( qk,kl

exp( qi 2 kl 2

d ) ed+1 j ed+1 m 2 | vlj vlm|

2 + exp( max( qi 2, qk 2) kl 2

d )| qi qk,kl

d) ed+1 j ed+1 m 2 | vlj vlm|

2 + exp( R2

d) qi qk 2R

d) ed+1 j ed+1 m 2

2 V max + exp( R2

d) qi qk 2R

d + 2 V max ( qi, ed+1 j ) ( qk, ed+1 m ) 2

by the triangle inequality, multiple applications of Cauchy-Schwarz, and the mean-value theorem applied to x 7 ex.

G. Proof of Thm. 3: LKH-SGD convergence

Our proof makes use of three intermediate results. The first, inspired by Harvey & Samadi (2014, Thm. 10) and Cooper et al. (2023, Lem. 1), relates the quality of the ordering produced by Alg. 2 to the quality of the thinning.

Lemma G.1 (Quality of thinned reordering). The output of thinned reordering (Alg. 2) satisfies

maxj [n] Pj i=1 xk πk+1(π 1 k (i)) 2 1

2 maxj [n] Pj i=1 xk i 2 + 1

2 maxj [n] Pj i=1 ηk i xk i 2 + Pn i=1 xk i 2

where π 1 k is the inverse permutation of πk and ηk i 2(1 xk i X k out 1).

Proof. Fix any j argmaxj [n] Pj i=1 xk πk+1(π 1 k (i)) 2. If j n/2, then

i=1 xk πk+1(π 1 k (i)) 2 2 maxj [n] Pj i=1 1 ηk i = 1 xk i 2 maxj [n] Pj i=1 xk i 2 + maxj [n] Pj i=1 ηk i xk i 2

by the triangle inequality. Similarly, if j > n/2, then,

i=1 xk πk+1(π 1 k (i)) 2 Pn i=1 xk i 2) 2 P

i>j xk πk+1(π 1 k (i)) 2 2 maxj [n] Pj i=1 1 ηk i = 1 xk i 2

maxj [n] Pj i=1 xk i 2 + maxj [n] Pj i=1 ηk i xk i 2.

The second, a mild adaptation of Cooper et al. (2023, Thms. 2 and 3), bounds the convergence rate of SGD with thinned reordering in terms of the thinning quality.

Low-Rank Thinning

Theorem G.1 (Convergence of SGD with thinned reordering). Suppose that, for all i [n] and w, v Rd,

fi(w) f(w) 2 2 σ2 and fi(w) fi(v) 2 L w v 2

and that SGD (10) with thinned reordering (Alg. 2) satisfies the prefix discrepancy bound

maxj [n] Pj i=1 ηk i xk i 2 2 A maxi [n] xk i xk 2 for ηk i 2(1 xk i X k out 1), xk 1

n Pn i=1 xk i , (39)

and each epoch k [K]. Then the step size setting

α = min 1 16L(2n+ A), 4F1 42L2σ2 A2n K+18L2n3σ2 1/3 with F1 f(w1) f and f infv Rd f(v)

yields the convergence bound

1 K PK k=1 f(wk) 2 9(F1Lσ A)2/3

(n K)2/3 + (72F1Lσ)2/3+64F1L(2+ A/(n)) K .

If, in addition, f satisfies the µ-Polyak-Łojasiewicz (PL) condition,

µ(f(w) f ) 1

2 f(w) 2 2 for all w Rd,

and the number of epochs satisfies

µ32L(2 + A/n) W for W W0(K2n2C3) and C3 (F1+σ2/L)µ2

224L2σ2 A2 ,

where W0 denotes the Lambert W function, then the step size setting α = 2 W Knµ yields the convergence bound

f(w K) f 1 (n K)2 (F1+L2σ2) W C3 + 112L2σ2 A2 W 2 µ3 .

Proof. The proof is identical to that of Cooper et al. (2023, Thms. 2 and 3) with m = 1 worker once each instance of is replaced with 2, each instance of L2, is replaced with L, each instance of T is replaced with K, and Lem. G.1 is substituted for Cooper et al. (2023, Lem. 1).

The final result uses Thm. 1 to bound the prefix discrepancy of LKH(δ). Lemma G.2 (LKH(δ) prefix discrepancy). Fix any epoch k [K]. With probability at least 1 δ

2 δ , thinned reordering (Alg. 2) with LKH(δ) satisfies the prefix discrepancy bound (39) with

log( 2n(log(n/2)+1)

δ ) e2 rankϵk(Xk) + 10e log( n

for Xk [xk 1, . . . , xk n] , ϵk maxi [n]

9e log(2n log(en/2)/δ) log(n/δ ) xk i xk 2 n , and xk 1

n Pn i=1 xk i .

Proof. Define X k = {xk 1, . . . , xk n}, c = 2 maxi [n] xk i xk 2, and r = rankϵk(Xk). For any j [n], we can write

Pj i=1 ηk i xk i 2 = Pj i=1 xk i Pj i=1 1 xk i X k out,j xk i 2 = 2j (Xk) (pj in pj out) 2 = 2j MMDXk(Xk) (pj in, pj out)

where pj in and pj out are the empirical distributions over X k in,j = (xk i )j i=1 and X k out,j = {xk i X k out : i [j]}.

Since LKH(δ) is an online algorithm that assigns signs (ηk i , ηk i+1 = 1 ηk i ) to the points (xk i , xk i+1) sequentially, we can view X k out,j as the output of LKH(δ) applied to X k in,j with nout = j

2 and the linear kernel k(x, y) = x, y for each j [n]. Therefore, we may invoke the established LKH(δ) sub-Gaussian constants νj of Prop. B.3, Thm. 1, the union bound, and the definition of ϵ-rank (Def. 4) to deduce that

maxj [n] Pj i=1 ηk i xk i 2 2 maxj [n] 4j2ν2 j e2r + e log( n

δ ) + σr+1(Xk)2 4j2

with probability at least 1 δ

Thm. 3 now follows directly from Thm. G.1 and Lem. G.2 applied to LKH( 1

2K ) with δ = 1 4K and a union bound over epochs.

Low-Rank Thinning

H. KT-COMPRESS(δ)

We describe the thinning algorithm KT-COMPRESS(δ) used in Alg. 3. We use KH(δ) for every halving round except for the last round, which thins a point sequence of size 2nout to nout. For this final halving round we use KH-REFINE(δ) (Alg. H.1) derived from the KT-SWAP algorithm of Dwivedi & Mackey (2024, Alg. 1b). The refinement stage of Alg. H.1 greedily improves the MMD of the initial KH(δ) output. Hence, MMDk(Xin, Xout) MMDk(Xin, S(1)) with probability 1.

Algorithm H.1: KH-REFINE(δ): KH(δ) with greedy refinement (Dwivedi & Mackey, 2024, Alg. 1a)

Input: point sequence Xin = (xi)nin i=1, kernel k, input size nin 2N

S KH(δ)(Xin, k); nout nin/2

// Swap out each point in Xout for the best alternative in Xin Xout S.copy() for x S do

Xout Xout \ {x} {argminx Xin MMDk(Pin, Pout + 1 nout (δx δx))} end return Xout, refined coreset of size nin/2

I. Proof of Thm. 4: Low-rank analysis of CTT power

Thm. 4 follows from the following more detailed statement, proved in App. I.1, as

R2 K(nin, eβ 20sn , g) + R2 K (nin, eβ 20sn , g) = O(b R2 k).

Theorem I.1 (Low-rank analysis of CTT power, detailed). Under the assumptions of Thm. 4 with nin m+n

s , CTT (Alg. 3) rejects with probability at least 1 β whenever c MMDk(P, Q)/ p

log(1/γ) exceeds

2ceβ/(20s) k 1 2 m + Rk(P,nin, e β 20sm ,g)+Rk(Q,nin, e β 20sn ,g) 2g m .

Here, c > 0 is a universal constant, cδ 2 + q

δ ), and Rk(P, nin, δ, g) and Rk(Q, nin, δ, g) respectively denote the

2)-th quantiles of RK(nin, δ, g) and RK (nin, δ, g), where

R2 e K(nin, δ, g) 256(log4 nin g 1)( p

log(nin + 1) + p

log(2/δ))2 (40)

e log( 6 2g nin(log4 nin g)

log( 3nin(log4 nin g 1)

+ minr 2g+1 nin

e2r log 6 2g nin(log4 nin g)

λr+1( e K) 2g 1 nin

I.1. Proof of Thm. I.1: Low-rank analysis of CTT power, detailed

Recall the following definition from Shetty et al. (2022, Def. 3).

Definition I.1 (k-sub-Gaussian thinning algorithm). We say a thinning algorithm ALG (satisfying Def. 1) is k-sub Gaussian on an event E with shift a and parameter v if

PE(MMDk(Pin, Pout) a + v

t | Xin) e t for all t 0.

Fix e K {K, K }. To conclude our power result, it suffices, by Domingo-Enrich et al. (2023, Rmk. 2, App. B.1) and the failure probability setting of Domingo-Enrich et al. (2023, Lem. 11), to establish that

R2 e K(nin, δ, g) = 256(log4 nin g 1)(C e K(δ, 2g+1 nin) + M e K(δ, 2g+1 nin) q

log( 3nin(log4 nin g 1)

log(nin + 1) + p

log(2/δ))2, (41)

Low-Rank Thinning

for any scalars C e K(δ, 2g+1 nin) and M e K(δ, 2g+1 nin) satisfying the property that, on an event of probability at least 1 δ/2, every call to HALVE KH( ℓ2 nin4g+1(log4 nin g)δ) with input size ℓand output size ℓ/2 is k-sub-Gaussian (Def. I.1) with shift aℓ,nin, e K and parameter vℓ,nin, e K satisfying

aℓ,nin, e K = Cf K(δ,ℓ)

ℓ/2 and vℓ,nin, e K = Mf K(δ,ℓ)

log( 12nin4g(log4 nin g)

Substituting M e K(δ, 2g+1 nin) = (2g nin)v2g+1 nin,nin, e K h log( 12nin4g(log4 nin g)

2g+1 ninδ ) i 1

2 and C e K(δ, 2g+1 nin) =

(2g nin)a2g+1 nin,nin, e K into (41), we obtain the sufficient condition

R2 e K(nin, δ, g) = 256(log4 nin g 1) (2g nin)2 ( p

log(nin + 1) + p

a2g+1 nin,nin, e K + v2g+1 nin,nin, e K h log( 12nin4g(log4 nin g)

2g+1 ninδ ) i 1

log( 3nin(log4 nin g 1)

δ ) 2 . (42)

We now identify suitable aℓ,nin, e K and vℓ,nin, e K with the aid of the following lemma, proved in App. I.2.

Lemma I.1 ((K, ν, δ)-sub-Gaussian thinning algorithms are k-sub-Gaussian). Suppose ALG is a (K, ν, δ)-sub Gaussian thinning algorithm, satisfying Def. 3 with an event E of probability at least 1 δ/2. Then ALG is k-sub-Gaussian (Def. I.1) on E with shift anout,nin,K and parameter vnout,nin,K defined as

anout,nin,K ν e + minr nin n ν

nout 1 nin ) o and vnout,nin,K ν e.

By Prop. B.4 and Lem. A.1, KH( ℓ2 nin4g+1(log4 nin g)δ) with input size ℓand output size ℓ/2 is a ( e K, ν, ℓ2 nin4g+1(log4 nin g)δ)- sub-Gaussian thinning algorithm with

log 6(ℓ/2) log2(ℓ/(ℓ/2))

δ nin4g+1(log4 nin g)

ℓ2 e K max = 2 (ℓ/2)

log 12nin4g(log4 nin g)

ℓδ e K max.

By Lem. I.1, on an event of probability at least 1 ℓ2 2nin4g+1(log4 nin g)δ, KH( ℓ2 nin4g+1(log4 nin g)δ) with input size ℓand output size ℓ/2 is a k-sub-Gaussian thinning algorithm with shift aℓ,nin, e K and parameter vℓ,nin, e K defined as

aℓ,nin, e K = 2 (ℓ/2)

log 12nin4g(log4 nin g)

ℓδ e K max e log 2

log 12nin4g(log4 nin g)

λr+1( e K)( 1

ℓ) and (43)

vℓ,nin, e K = 2 (ℓ/2)

log 12nin4g(log4 nin g)

ℓδ e K max e. (44)

Moreover, by the union bound, as detailed in Shetty et al. (2022, App. F.1), every HALVE call made by KT-COMPRESS is simultaneously k-sub-Gaussian with these input-size-dependent parameters on a common event of probability at least 1 δ

2. Substituting (43) and (44) with ℓ= 2g+1 nin into (42), we obtain our error inflation factor expression (40), completing the proof.

I.2. Proof of Lem. I.1: (K, ν, δ)-sub-Gaussian thinning algorithms are k-sub-Gaussian

Fix any t 0, and let δ = e t. By our sub-Gaussian assumption, Thm. 1 implies that, as advertised,

e t PE MMD2 K(pin, pout) minr nin ν2 e2r + et + λr+1(K)( 1

nout 1 nin )

= PE MMDK(pin, pout) minr nin q

ν2[e2r + et] + λr+1(K)( 1

nout 1 nin )

PE MMDK(pin, pout) ν e

t + minr nin ν

nout 1 nin ) .

Low-Rank Thinning

J. Proof of Cor. 3: Power of deep kernel CTT

Define the radius

R maxy Y X (ϕ(y), y) 2,

the augmented vectors Y {(ϕ(y), y)}y Y, and the augmented kernel

q ((ϕ(x), x), (ϕ(y), y)) κ(ϕ(x), ϕ(y))q(x, y) = exp( η (ϕ(x), x) (ϕ(y), y) 2 2).

Since the deep kernel (13) takes the form

kdeep(x, y) = (1 ϵ)q ((ϕ(x), x), (ϕ(y), y)) + ϵq(x, y)

we also have

Kdeep kdeep(Y, Y) = (1 ϵ)Q + ϵQ for Q q (Y , Y ) and Q q(Y, Y).

Hence, by Weyl s inequality (Horn & Johnson, 1985, Thm. 4.3.1) and the Gaussian kernel matrix eigenvalue bound (6),

λ2r+1(Kdeep) (1 ϵ)λr+1(Q ) + ϵλr+1(Q) ne d

2e r1/d log d r1/d

for (2e)d r < n.

Parallel reasoning and the assumption m n yield the same bound for λ2r+1(kdeep(X, X)) and (2e)d r < m. Now consider the approximate rank parameter

d log(nnoutb) d , (R 2ηe34/d )d o for b 1

Then, for n (2e)d , we have, exactly as in App. D,

λ2r +1(Kdeep) + λ2r +1(kdeep(X, X)) 2 noutb

and therefore

eβ ) max n 2e

d log(nnoutb) d /2, (R 2ηe34/d )d /2o .

Our final step is to bound the quantile of the sole remaining data-dependent term, R . Since the inputs are c-sub-Gaussian (14), Lem. 1 of Dwivedi & Mackey (2024) with ψ 1(r) = log r c implies that the 1 eβ 20sn quantile of R is O q

yielding the result.

K. Proof of Cor. 4: Power of deep manifold kernel CTT

Our reasoning is identical to that in App. J with the manifold Gaussian kernel matrix eigenvalue bound (7) now substituted for the Euclidean ball bound (6) and the approximate rank setting r = (log(nnout)/c)5d /2 substituted for (45).

L. Supplementary Experiment Details

L.1. Approximating attention experiments

The experiment of Tab. 3 was carried out using Python 3.12.9, Py Torch 2.8.0.dev20250407+cu128 (Paszke et al., 2019), and an Ubuntu 22.04.5 LTS server with an AMD EPYC 7V13 64-Core Processor, 220 GB RAM, and a single NVIDIA A100 GPU (80 GB memory, CUDA 12.8, driver version 570.124.04). For reference, attention layer 1 has (n, d) = (3136, 64) and attention layer 2 has (n, d) = (784, 64). For each of the first 50 Image Net 2012 validation set batches of size 64, we measured the time required to complete a forward pass through each the approximate softmax matrix (8) layer using CUDA events following 10 warm-up batches to initialize the GPU. Tab. L.1 provides the hyperparameter settings for each attention approximation in Tab. 3.

Low-Rank Thinning

The experiment of Tab. 4 was carried out using Python 3.12.9, Py Torch 2.6.0, and an Ubuntu 22.04.5 LTS server with an Intel(R) Xeon(R) Gold 5218 CPU Processor, 100 GB RAM, and a single NVIDIA A6000 GPU (48 GB memory, CUDA 12.1, driver version 530.30.02). For reference, the Big GAN model contains a single attention layer with 4096 queries in R64, 1024 keys in R64, and 1024 values in R256. We measured the time required to complete a forward pass through the approximate softmax matrix (8) layer using CUDA events following 10 warm-up batches to initialize the GPU. Tab. L.2 provides the hyperparameter settings for each attention approximation in Tab. 4.

The settings and implementations for all methods other than Thinformer were provided by Zandieh et al. (2023), and our experiment code builds on their open-source repository https://github.com/majid-daliri/kdeformer.

Table L.1: Configurations for the attention approximation methods of Tab. 3.

Attention Algorithm Layer 1 Configuration Layer 2 Configuration

Performer num_features=49 num_features=12

Reformer bucket_size=49 bucket_size=12 n_hashes=2 n_hashes=2

Scatter Brain local_context=49 local_context=12 num_features=48 num_features=6

KDEformer sample_size=64 sample_size=56 bucket_size=32 bucket_size=32 Thinformer (Ours) g=2 g=4

Table L.2: Configurations for the attention approximation methods of Tab. 4.

Attention Algorithm Layer Configuration

Performer num_features=128

Reformer bucket_size=64 n_hashes=8

Scatter Brain local_context=32 num_features=128

KDEformer sample_size=128 bucket_size=64 Thinformer (Ours) g=2

L.2. Faster SGD training experiments

The experiment of Sec. 5.2 was carried out using Python 3.10, Py Torch 2.0.1, a Rocky Linux 8.9 server with 64 CPU cores (Intel(R) Xeon(R) Platinum 8358 CPU @ 2.60GHz), and a NVIDIA A100 GPU (40 GB memory, CUDA 12.4, driver version 550.54.15).

Technically, the CD-Gra B: SBW algorithm requires an a priori upper bound on the maximum Euclidean norm bmax of any stochastic gradient that it will encounter. To conduct our experiment, we first estimate bmax by calculating the maximum gradient Euclidean encountered across 10 epochs of running SGD with LKH( 1

2K ) reordering. One would typically not choose to carry out such a two-step procedure in practice, but the experiment serves to demonstrate that the CD-Gra B: SBW leads to overly conservative performance even if reasonable upper bound is known in advance.

The settings and implementation for both random reshuffling (RR) and CD-Gra B: Greedy were those used in the original logistic regression on mortgage application experiment of Cooper et al. (2023). Our experiment code builds on the opensource CD-Gra B repository https://github.com/Garl Guo/CD-Gra B. As in Cooper et al. (2023), optimization was carried out with a learning rate of α = 0.01, datapoints were loaded in batches of size 16, and stochastic gradients were reordered for each datapoint individually.

Low-Rank Thinning

0 5 10 15 20 ϵk-rank of SGD Matrix Xk

Number of Occurrences

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 Singular Value Index

Singular Value (log scale)

Figure L.1: Approximate low-rank structure of stochastic gradient matrices. Left. For each epoch k of the LKH( 1

2K ) experiment of Sec. 5.2, we record the ϵk-rank of the stochastic gradient matrix Xk [xk 1, . . . , xk n] Rn d (see Def. 4 and Thm. 3). Notably, the ϵk-ranks are significantly smaller than the ambient dimension d = 19. Right. We display the singular values of the first stochastic gradient matrix, X1. The singular values drop off steeply, resulting in relatively small ϵk-ranks.

L.3. Cheap two-sample testing experiment

The experiment of Sec. 6.2 was carried out using Python 3.12.9, Py Torch 2.6.0+cu124, and an Ubuntu 20.04.6 LTS server with an AMD EPYC 9554 64-Core Processor, 100 GB RAM, and a single NVIDIA H100 GPU (96 GB memory, CUDA 12.2, driver version 535.154.05). Each test is run with replication count B = 100, nominal level α = 0.05, and failure probability δ = 0.5. The neural network ϕ was trained exactly as in Liu et al. (2020) (with learning rate 5 10 5 and batch size equal to the full training sample size), and runtime measurements exclude the time required to train ϕ. Our experiment code builds on the open-source deep kernel testing (https://github.com/fengliu90/DK-for-TST) and Compress Then Test (https://github.com/microsoft/goodpoints) repositories.

43 44 45 46 47

Sample Size n

Rkdeep / log5 (n)

Figure L.2: Slow-growing error inflation. In the experimental setting of Sec. 6.2, the error inflation factor b Rkdeep enjoys O(log5(n)) growth as the ratio b Rkdeep/ log5(n) decreases with n. Here, b R2 kdeep is defined by (12) with n = m, nin 2n

nout 2g nin for constants β = 0.05, s = 32, and g = 4.