# supervised_kernel_thinning__ccefdf22.pdf Supervised Kernel Thinning Albert Gong Kyuseong Choi Raaz Dwivedi Cornell Tech, Cornell University agong,kc728,dwivedi@cornell.edu The kernel thinning algorithm of [10] provides a better-than-i.i.d. compression of a generic set of points. By generating high-fidelity coresets of size significantly smaller than the input points, KT is known to speed up unsupervised tasks like Monte Carlo integration, uncertainty quantification, and non-parametric hypothesis testing, with minimal loss in statistical accuracy. In this work, we generalize the KT algorithm to speed up supervised learning problems involving kernel methods. Specifically, we combine two classical algorithms Nadaraya-Watson (NW) regression or kernel smoothing, and kernel ridge regression (KRR) with KT to provide a quadratic speed-up in both training and inference times. We show how distribution compression with KT in each setting reduces to constructing an appropriate kernel, and introduce the Kernel-Thinned NW and Kernel-Thinned KRR estimators. We prove that KT-based regression estimators enjoy significantly superior computational efficiency over the full-data estimators and improved statistical efficiency over i.i.d. subsampling of the training data. En route, we also provide a novel multiplicative error guarantee for compressing with KT. We validate our design choices with both simulations and real data experiments. 1 Introduction In supervised learning, the goal of coreset methods is to find a representative set of points on which to perform model training and inference. On the other hand, coreset methods in unsupervised learning have the goal of finding a representative set of points, which can then be utilized for a broad class of downstream tasks from integration [10, 9] to non-parametric hypothesis testing [8]. This work aims to bridge these two research threads. Leveraging recent advancements from compression in the unsupervised setting, we tackle the problem of non-parametric regression (formally defined in Sec. 2). Given a dataset of n i.i.d. samples, (xi, yi)n i=1, we want to learn a function f such that f(xi) yi. The set of allowable functions is determined by the kernel function, which is a powerful building block for capturing complex, non-linear relationships. Due to its powerful performance in practice and closed-form analysis, non-parametric regression methods based on kernels (a.k.a kernel methods ) have become a popular choice for a wide range of supervised learning tasks [13, 20, 24]. There are two popular approaches to non-parametric kernel regression. First, perhaps a more classical approach, is kernel smoothing, also referred to as Nadaraya-Watson (NW) regression. The NW estimator at a point x is effectively a smoothing of labels yi such that xi is close to x. These weights are computed using the kernel function (see Sec. 2 for formal definitions). Importantly, the NW estimator takes Θ(n) pre-processing time (to simply store the data) and Θ(n) inference time for each test point x (n kernel evaluations and n simple operations). Another popular approach is kernel ridge regression (KRR), which solves a non-parametric least squares subject to the regression function lying in the reproducing kernel Hilbert space (RKHS) of a 38th Conference on Neural Information Processing Systems (Neur IPS 2024). specified reproducing kernel function. Remarkably, KRR admits a closed-form solution via inverting the associated kernel matrix, and takes O(n3) training time and Θ(n) inference time for each test point x. Our goal is to overcome the computational bottlenecks of kernel methods, while retaining their favorable statistical properties. Previous attempts at using coreset methods include the work of Boutsidis et al. [4], Zheng and Phillips [32], Phillips [18], which depend on a projection type compression, having similar spirit to the celebrated Johnson Lindenstrauss lemma, a metric preserving projection result. So accuracy and running depend unfavorably on the desired statistical error rate. Kpotufe [14] propose an algorithm to reduce the query time of the NW estimator to O(log n), but the algorithm requires super-linear preprocessing time. Other lines of work exploit the structure of kernels more directly, especially in the KRR literature. A slew of techniques from numerical analysis have been developed, including work on Nyström subsampling by El Alaoui and Mahoney [11], Avron et al. [1], Díaz et al. [7]. Camoriano et al. [5] and Rudi et al. [22] combine early stopping with Nyström subsampling. Though more distant from our approach, we also note the approach of Rahimi and Recht [21] using random features, Zhang et al. [31] using Divide-and-Conquer, and Tu et al. [28] using block coordinate descent. Our contributions. In this work, we show how coreset methods can be used to speed up both training and inference in non-parametric regression for a large class of function classes/kernels. At the heart of these algorithms is a general procedure called kernel thinning [10, 9], which provides a worst-case bound on integration error (suited for problems in the original context of unsupervised learning and MCMC simulations). In Sec. 3, we introduce a meta-algorithm that recovers our two thinned non-parametric regression methods each based on NW and KRR. We introduce the kernel-thinned Nadaraya-Watson estimator (KT-NW) and the kernel-thinned kernel ridge regression estimator (KT-KRR). We show that KT-NW requires O(n log3 n) time during training and O( n) time at inference, while achieving a mean square error (MSE) rate of n β β+d (Thm. 1) a strict improvement over uniform subsampling of the original input points. We show that KT-KRR requires O(n3/2) time during training and and O( n) time during inference, while achieving an near-minimax optimal rate of m log n n when the kernel has finite dimension (Thm. 2). We show how our KT-KRR guarantees can also be extended to the infinite-dimension setting (Thm. 3). In Sec. 5, we apply our proposed methods to both simulated and real-world data. In line with our theory, KT-NW and KT-KRR outperform standard thinning baselines in terms of accuracy while retaining favorable runtimes. 2 Problem setup We now formally describe the non-parametric regression problem. Let x1, . . . , xn be i.i.d. samples from the data distribution P (with density p) over the domain X Rd and w1, . . . , wn be i.i.d. samples from N(0, 1). In the sequel, we use to denote the Euclidean norm unless otherwise stated. Then define the response variables y1, . . . , yn by the follow data generating process: yi f (xi) + vi for i = 1, 2, . . . , n, (1) where f : X Y R is the regression function and vi σwi for some noise level σ > 0. Our task is to build an estimate for f given the n observed points, denoted by Sin ((x1, y1), . . . , (xn, yn)). Nadaraya-Watson (NW) estimator. A classical approach to estimate the function f is kernel smoothing, where one estimates the function value at a point z using a weighted average of the observed outcomes. The weight for outcome yi depends on how close xi is to the point z; let κ : Rd R denote this weighting function such that the weight for xi is proportional to κ( xi z /h) for some bandwidth parameter h > 0. Let k : Rd R R denote a shift-invariant kernel defined as k(x1, x2) = κ( x1 x2 /h). Then this smoothing estimator, also known as Nadaraya-Watson (NW) estimator, can be expressed as (x,y) Sin k( , x)y P x Sin k( , x) (2) whenever the denominator in the above display is non-zero. In the case the denominator in (2) is zero, we can make a default choice, which for simplicity here we choose as zero. We refer to the estimator (2) as FULL-NW estimator hereafter. One can easily note that FULL-NW requires O(n) storage for the input points and O(n) kernel queries for inference at each point. Kernel ridge regression (KRR) estimator. Another popular approach to estimate f is that of non-parametric (regularized) least squares. The solution in this approach, often called as the kernel ridge regression (KRR), is obtained by solving a least squares objective where the fitted function is posited to lie in the RKHS H of a reproducing kernel k, and a regularization term is added to the objective to avoid overfitting.1 Overall, the KRR estimate is the solution to the following regularized least-squares objective, where λ > 0 denotes a regularization hyperparameter: min f H LSin + λ f 2 k, where LSin 1 (x,y) Sin (f(x) y)2. (3) Like NW, an advantage of KRR is the existence of a closed-form solution bffull,λ( ) i=1 αik( , xi) where (4) α (K + nλIn) 1 Rn and K [k(xi, xj)]n i,j=1 Rn n. (5) Notably, the estimate bffull,λ, which we refer to as the FULL-KRR estimator, can also be seen as yet another instance of weighted average of the observed outcomes. Notably, NW estimator imposes that the weights across the points sum to 1 (and are also non-negative whenever k is), KRR allows for generic weights that need not be positive (even when k is) and need not sum to 1. We note that naïvely solving bffull,λ requires O(n2) kernel evaluations to compute the kernel matrix, O(n3) to compute a matrix inverse, and O(n) kernel queries for inference at each point. One of our primary goals in this work is to tackle this high computational cost of FULL-KRR. 3 Speeding up non-parametric regression We begin with a general approach to speed up regression by thinning the input datasets. While computationally superior, a generic approach suffers from a loss of statistical accuracy motivating the need for a strategic thinning approach. To that end, we briefly review kernel thinning and finally introduced our supervised kernel thinning approach. 3.1 Thinned regression estimators: Computational and statistical tradeoffs Our generic approach comprises two main steps. First, we compress the input data by choosing a coreset Sout Sin of size nout |Sout|. Second, we apply our off-the-shelf non-parametric regression methods from Sec. 2 to the compressed data. By setting nout n, we can obtain notable speed-ups over the FULL versions of NW and KRR. Before we introduce the thinned versions of NW and KRR, let us define the following notation. Given an input sequence Sin and output sequence Sout, define the empirical probability measures (x,y) Sin δ(x,y) and Qout 1 nout (x,y) Sout δ(x,y). (6) Thinned NW estimator. The thinned NW estimator is the analog of Full-NW except that Sin is replaced by Sout in (2) so that the thinned-NW estimator is given by (x,y) Sout k( , x)y P x Sout k( , x) = Qout(yk) 1We note that while KRR approach (14) does require k to be reproducing, the NW approach (2) in full generality is valid even when k is a not a valid reproducing kernel. whenever the denominator in the display is not zero; and 0 otherwise. When compared to the FULL-NW estimator, we can easily deduce the computational advantage of this estimator: more efficient O(nout) storage as well as the faster O(nout) computation for inference at each point. Thinned KRR estimator. Similarly, we can define the thinned KRR estimator as bf Sout,λ ( ) = i=1 α ik( , x i), where (8) α (K + noutλ Inout) 1 y 1... y nout Rnout and K [k(x i, x j)]nout i,j=1 Rnout nout given some regularization parameter λ > 0. When compared to FULL-KRR, bf Sout,λ has training time O(n3 out) and prediction time O(nout). A baseline approach is standard thinning, whereby we let Sout be an i.i.d. sample of nout = n points from Sin. For NW, let us call the resulting bf Sout (7) the standard-thinned Nadaraya-Watson (ST-NW) estimator. When nout = n, ST-NW achieves an excess risk rate of O(n β 2β+d ) compared to the FULL-NW rate of O(n 2β 2β+d ). For KRR, let us call the resulting bf Sout,λ (8) the standard-thinned KRR (ST-KRR) estimator. When nout = n, ST-KRR achieves an excess risk rate of O( m nout ) compared to the FULL-KRR rate of O( m n ). Our goal is to provide good computational benefits without trading off statistical error. Moreover, we may be able to do better by leveraging the underlying geometry of the input points and summarize of the input distribution more succinctly than i.i.d. sampling. 3.2 Background on kernel thinning A subroutine central to our approach is kernel thinning (KT) from Dwivedi and Mackey [10, Alg. 1]. We use a variant called KT-COMPRESS++ from Shetty et al. [23, Ex. 6] (see full details in App. A), which provides similar approximation quality as the original KT algorithm of Dwivedi and Mackey [10, Alg. 1], while reducing the runtime from O(n2) to O(n log3 n).2 Given an input kernel k ALG and input points Sin, KT-COMPRESS++ outputs a coreset SKT Sin with size nout n n. In this work, we leverage two guarantees of KT-COMPRESS++. Informally, SKT satisfies (with high probability): (L bound) (Pin Qout)k ALG C1 (MMD bound) sup h k ALG 1 |(Pin Qout)h| C2 log nout log Nk ALG(B2(Rin), 1/nout) nout , (10) where C1, C2 > 0 are constants that depend on the properties of the input kernel k ALG and the chosen failure probability of KT-COMPRESS++, Rin characterizes the radius of {xi}n i=1, and Nk ALG(B2(Rin), 1/nout) denotes the kernel covering number of H(k ALG) over the ball B2(Rin) Rd at a specified tolerance (see Sec. 4.2 for formal definitions). At its highest level, KT provides good approximation of function averages. The bound (9) (formally stated in Lem. 1) controls the worst-case point-wise error, and is near-minimax optimal by Phillips and Tai [19, Thm. 3.1]. In the sequel, we leverage this type of result to derive generalization bounds for the kernel smoothing problem. The bound (10) (formally stated in Lem. 2) controls the integration error of functions in H(k ALG) and is near-minimax optimal by Tolstikhin et al. [26, Thm. 1, 6]. In the sequel, we leverage this type of result to derive generalization bounds for the KRR problem. 2In the sequel, we use KT and KT-COMPRESS++ interchangeably since the underlying algorithm (kernel halving [10, Alg. 1a]) and associated approximation guarantees are the same up to small constant factors. 3.3 Supervised kernel thinning We show how the approximation results from kernel thinning can be extended to the regression setting. We construct two meta-kernels, the Nadaraya-Watson meta-kernel k NW and the ridge-regression meta-kernel k RR, which take in a base kernel k (defined over X only) and return a new kernel (defined over X Y). When running KT, we set this new kernel as k ALG. 3.3.1 Kernel-thinned Nadaraya-Watson regression (KT-NW) A tempting choice of kernel for KT-NW is the kernel k itself. That is, we can thin the input points using the kernel k ALG((x1, y1), (x2, y2)) k(x1, x2). (11) This choice is sub-optimal since it ignores any information in the response variable y. For our supervised learning set-up, perhaps another intuitive choice would be to use KT with k ALG((x1, y1), (x2, y2)) k((x1, y1), (x2, y2)), (12) where (x, y) denotes the concatenation of x and y. While this helps improve performance, there remains a better option as we illustrate next. In fact, a simple but critical observation immediately reveals a superior choice of the kernel to be used in KT for NW estimator. We can directly observe that the NW estimator is a ratio of the averages of two functions: fnumer(x, y)( ) k(x, ) y, 1 R and fdenom(x, y)( ) k(x, ), over the empirical distribution Pin (6). Recall that KT provides a good approximation of sample means of functions in an RKHS, so it suffices to specify a correct choice of the RKHS (or equivalently the correct choice of the reproducing kernel). We can verify that fdenom lies in the RKHS associated with kernel k(x1, x2) and fnumer lies in the RKHS associated with kernel k(x1, x2) y1y2. This motivates our definition for the Nadaraya-Watson kernel: k NW((x1, y1), (x2, y2)) k(x1, x2) + k(x1, x2) y1y2 (13) since then we do have fdenom, fnumer H(k NW). Intuitively, thinning with k RR should simultaneously provide good approximation of averages of fdenom and fnumer over Pin (see the formal argument in Sec. 4.1). When Sout = KT-COMPRESS++(Sin, k NW, δ), we call the resulting solution to (7) the kernel-thinned Nadaraya-Watson (KT-NW) estimator, denoted by bf KT. As we show in Fig. 1(a), this theoretically principled choice does provide practical benefits in MSE performance across sample sizes. 210 214 Input sample size n Mean Squared Error k(x1,x2) k((x1 y1),(x2 y2)) k NW((x1,y1),(x2,y2)) 210 214 Input sample size n Mean Squared Error k(x1,x2) k((x1 y1),(x2 y2)) k RR((x1,y1),(x2,y2)) (a) NW ablation with Wendland(0) base kernel (b) KRR ablation with Gaussian base kernel Figure 1: MSE vs choice of kernels. For exact settings and further discussion see Sec. 5.1. 3.3.2 Kernel-thinned kernel ridge regression (KT-KRR) While with NW estimator, the closed-form expression was a ratio of averages, the KRR estimate (4) can not be expressed as an easy function of averages. However, notice that LSin in (3) is an average of the function ℓf : X Y R defined as ℓf(x, y) f 2(x) 2f(x)y + y2 for f H(k). Thus, there may be hope of deriving a KT-powered KRR estimator by thinning LSin with the appropriate kernel. Assuming f H(k), we can verify that f 2 lies in the RKHS associated with kernel k2(x1, x2) and that 2f(x)y lies in the RKHS associated with kernel k(x1, x2) y1y2. We now define the ridge regression kernel by k RR((x1, y1), (x2, y2)) k2(x1, x2) + k(x1, x2) y1y2 (14) and we can verify that f 2(x) 2f(x)y lies in the RKHS H(k RR).3 When Sout KT-COMPRESS++(Sin, k RR, δ), we call the resulting solution to (8) the kernel-thinned KRR (KT-KRR) estimator with regularization parameter λ > 0, denoted bf KT,λ . We note that the kernel k RR also appears in [12, Lem. 4], except our subsequent analysis comes with generalization bounds for the KT-KRR estimator. Like for NW, in Fig. 1(b) we do a comparison for KRR-MSE across many kernel choices and conclude that the choice (14) is indeed a superior choice compared to the base kernel k and the concatenated kernel (12). 4 Main results We derive generalization bounds of our two proposed estimators. In particular, we bound the mean squared error (MSE) defined by f f 2 2 = EX (f(X) f (X))2 . Our first assumption is that of a well-behaved density on the covariate space. This assumption mainly simplifies our analysis of Nadaraya-Watson and kernel ridge regression, but can in principle be relaxed. Assumption 1 (Compact support). Suppose that X B2(Rin) Rd for some Rin > 0 and that the density p satisfies 0 < pmin p(x) pmax for all x X. For the analysis of the NW estimator, we define function complexity in terms of Hölder smoothness following prior work [27]. Definition 1. For L > 0 and β (0, 1], a function f : X R is (β, L)-Hölder if for all x1, x2 X, |f(x1) f(x2)| L x1 x2 β. Our next assumption is that on the kernel. Whereas typically the NW estimator does not require a reproducing kernel, our KT-NW estimator requires that k be reproducing to allow for valid analysis. Assumption 2 (Shift-invariant kernel). k is a reproducing kernel function (i.e., symmetric and positive semidefinite) defined by k(x1, x2) κ( x1 x2 /h), where h > 0 and κ : R R is bounded by 1, Lκ-Lipschitz, square-integrable, and satisfies: log(Lκ κ (1/n)) = O(log n), where κ (u) sup{r : κ(r) u}; (15) Additionally, there must exist constants c1, c2 > 0 such that 2jβ sup x X 2 )h,(2j+ 1 2 )h] k(x, x z)dz c1 sup x X 2 h] k(x, x z)dz and (16) 2)d2jβκ(2j 1 1) c2 Z 1 2 0 κ(u)ud 1du for all j = 1, 2, . . . . (17) When k is a compact kernel, such as Wendland, Sinc, and B-spline, Assum. 2 is easily satisfied. In App. G, we prove that Gaussian and Matérn (with ν > d/2 + 1) kernels also satisfy Assum. 2. We now present our main result for the KT-NW estimator. See App. B for the proof. 3One might expect the ridge regression kernel to include a term that accounts for y2. However, the generalization bounds turn out to be essentially the same regardless of whether we include this term when defining k RR. Theorem 1 (KT-NW). Suppose that Assum. 1 and 2 hold and that f Σ(β, Lf) with β (0, 1]. Then for any fixed δ (0, 1], the KT-NW estimator (7) with nout = n and bandwidth h = n 1 2β+2d satisfies bf KT f 2 2 Cn β β+d log2 n, (18) with probability at least 1 δ, for some positive constant C that does not depend on n. Tsybakov and Tsybakov [27], Belkin et al. [3] show that FULL-NW achieves a rate of O(n 2β 2β+d ), which is minimax optimal for the (β, L)-Hölder function class. Compared to the ST-NW rate of n β 2β+d , KT-NW achieves strictly better rates for all β > 0 and d > 0, while retaining ST-NW s fast query time of O( n). Note that our method KT-NW has a training time of O(n log3 n), which is not much more than simply storing the input points. We present our main result for KT-KRR using finite-rank kernels. This class of RKHS includes linear functions and polynomial function classes. Theorem 2 (KT-KRR for finite-dimensional RKHS). Assume f H(k), Assum. 1 is satisfied, and k has rank m N. Let bf KT,λ denote the KT-KRR estimator with regularization parameter λ = O( m log nout n n2 out ). Then with probability at least 1 2δ 2e f 2 k c1( f 2 k+σ2) , the following holds: bf KT,λ f 2 2 Cm log nout min(n, n2 out) [ f k + 1]2 (19) for some constant C that does not depend on n or nout. See App. C for the proof. Under the same assumptions, Wainwright [29, Ex. 13.19] showed that the Full-KRR estimator bffull,λ achieves the minimax optimal rate of O(m/n) in O(n3) runtime. When nout = n logc n, the KT-KRR error rates from Thm. 2 match this minimax rate in e O(n3/2) time, a (near) quadratic improvement over the Full-KRR. On the other hand, standard thinning-KRR with similar-sized output achieves a quadratically poor MSE of order m n. Our method and theory also extend to the setting of infinte-dimensional kernels. To formalize this, we first introduce the notion of kernel covering number. Definition 2 (Covering number). For a kernel k : Z Z R with Bk {f H : f H 1}, a set A Z and ϵ > 0, the covering number Nk(A, ϵ) is the minimum cardinality of all sets C Bk satisfying Bk S h C{g Bk : supx A|h(x) g(x)| ϵ}. We consider two general classes of kernels. Assumption 3. For some Cd > 0, all r > 0 and ϵ (0, 1), and B2(r) = x Rd : x 2 r , a kernel k is LOGGROWTH(α, β) when log Nk(B2(r), ϵ) Cd log(e/ϵ)α(r + 1)β with α, β > 0 and POLYGROWTH(α, β) when log Nk(B2(r), ϵ) Cd(1/ϵ)α(r + 1)β with α < 2. We highlight that the definitions above cover several popular kernels: LOGGROWTH kernels include finite-rank kernels and analytic kernels, like Gaussian, inverse multiquadratic (IMQ), and sinc [9, Prop. 2], while POLYGROWTH kernels includes finitely-many continuously differentiable kernels, like Matérn and B-spline [9, Prop. 3]. For clarity, here we present our guarantee for LOGGROWTH kernels and defer the other case to App. E. Theorem 3 (KT-KRR guarantee for infinite-dimensional RKHS). Suppose Assum. 1 is satisfied and k is LOGGROWTH(α, β) (Assum. 3). Then bf KT,λ with λ = O(1/nout) satisfies the following bound with probability at least 1 2δ 2e f 2 k logα n c1( f 2 k+σ2) : bf KT,λ f 2 2 C logα n n + logα nout [ f k + 1]2. (20) for some constant C that does not depend on n or nout. See App. E for the proof. When nout = n, ST-KRR achieves an excess risk rate of n 1/2 logα n for k satisfying LOGGROWTH(α, β). While KT-KRR does not achieve a strictly better excess risk rate bound over ST-KRR, we see that in practice, KT-KRR still obtains an empirical advantage. Obtaining a sharper error rate for the infinite-dimensional kernel setting is an exciting venue for future work. 5 Experimental results We now present experiments on simulated and real-world data. On real-world data, we compare our KT-KRR estimator with several state-of-the-art KRR methods, including Nyström subsampling-based methods and KRR pre-conditioning methods. All our experiments were run on a machine with 8 CPU cores and 100 GB RAM. Our code can be found at https://github.com/ag2435/npr. 5.1 Simulation studies Figure 2: Simulated data. We begin with some simulation experiments. For simplicity, let X = R and P = Unif[ 3] so that Var[X] = 1. We set f (x) = 8 sin(8πx) exp(x) and σ = 1 (21) and follow (1) to generate (yi)n i=1 (see Fig. 2). We let the input sample size n vary between 28, 210, 212, 214 and set the output coreset size to be nout = n in all cases. For NW, we use the Wendland(0) kernel defined by k(x1, x2) (1 x1 x2 2 h )+ for h > 0. (22) For KRR, we use the Gaussian kernel defined by k(x1, x2) exp( x1 x2 2 2 2h2 ) for h > 0. (23) We select the bandwidth h and regularization parameter λ (for KRR) using grid search. Specifically, we use a held-out validation set of size 104 and run each parameter configuration 100 times to estimate the validation MSE since KT-KRR and ST-KRR are random. Ablation study. In Fig. 1, we compare thinning with our proposed meta-kernel k ALG = k NW to thinning with the baseline meta-kernels (11) and (12). For our particular regression function (21), thinning with (11) outperforms thinning with (12). We hypothesize that the latter kernel is not robust to the scaling of the response variables. By inspecting (22), we see that (x1, y1) (x2, y2) 2 is heavily determined by the yi values when they are large compared to the values of xi as is the case on the right side of Fig. 2 (when X > 0). Since P is a uniform distribution, thinning with (11) evenly subsamples points along the input domain X, even though accuractely learning the left side of Fig. 2 (when X < 0) is not needed for effective prediction since it is primarily noise. Validating our theory from Thm. 1, the best performance is obtained when thinning with k NW (13), which avoids evenly subsampling points along the input domain and correctly exploits the dependence between X and Y . In Fig. 1, we perform a similar ablation for KRR. Again we observe that thinning with k ALG((x1, y1), (x2, y2)) = k(x1, x2) outperforms thinning with k ALG((x1, y1), (x2, y2)) = k((x1, y1), (x2, y2)), while thinning with k ALG = k RR achieves the best performance. Comparison with FULL, ST, RPCHOLESKY. In Fig. 3(a), we compare the MSE of KT-NW to FULL-NW, ST-NW (a.k.a Subsample ), and RPCHOLESKY-NW across four values of n. This last method uses the pivot points from RPCHOLESKY as the output coreset Sout. At all n we evaluated, KT-NW achieves lower MSE than ST-NW and RPCHOLESKY-NW. FULL-NW achieves the lowest MSE across the board, but it suffers from significantly worse run times, especially at test time. Owing to its O(n log3 n) runtime, KT-NW is significantly faster than RPCHOLESKY-NW for training and nearly matches ST-NW in both training and testing time. We hypothesize that RPCHOLESKY while it provides a good low-rank approximation of the kernel matrix is not designed to preserve averages. In Fig. 3(b), we compare the MSE of KT-KRR to FULL-KRR, ST-KRR (a.k.a Subsample ), and the RPCHOLESKY-KRR method from Chen et al. [6, Sec. 4.2.2], which uses RPCHOLESKY to select 28 210 212 214 Input sample size n Mean Squared Error 0.0 0.1 0.2 0.3 0.4 Train time (s) Test time (s) n=210 n=214 Full Subsample RPCholesky KT (Ours) (a) Nadaraya-Watson estimator with Wendland(0) kernel (22). 28 210 212 214 Input sample size n Mean Squared Error 0.001 0.03 1 32 Train time (s) Test time (s) n=214 Full Subsample RPCholesky KT (Ours) (b) Kernel ridge regression estimator with Gaussian kernel (23). Figure 3: MSE and runtime comparison on simulated data. Each point plots the mean and standard deviation across 100 trials (after parameter grid search). landmark points for the restricted KRR problem. We observe that KT-KRR achieves lower MSE than ST-KRR, but higher MSE than RPCHOLESKY-KRR and FULL-KRR. In Fig. 3(b), we also observe that KT-KRR is orders of magnitude faster than FULL-KRR across the board, with runtime comparable to ST-KRR and RPCHOLESKY-KRR in both training and testing. 5.2 Real data experiments We now present experiments on real-world data using two popular datasets: the California Housing regression dataset from Pace and Barry [17] (https://scikit-learn.org/1.5/datasets/ real_world.html#california-housing-dataset; BSD-3-Clause license) and the SUSY binary classification dataset from Baldi et al. [2] (https://archive.ics.uci.edu/dataset/ 279/susy; CC-BY-4.0 license). California Housing dataset (d = 8, N = 2 104). Tab. 1(a) compares the test MSE, train times, and test times. We normalize the input features by subtracting the mean and dividing by the standard deviation and use a 80-20 train-test split. For all methods, we use the Gaussian kernel (23) with bandwidth h = 10. We use λ = λ = 10 3 for FULL-KRR, ST-KRR, and KT-KRR and λ = 10 5 for RPCHOLESKY-KRR. On this dataset, KT-KRR lies between ST-KRR and RPCHOLESKY-KRR in terms of test MSE. When nout = n, RPCHOLESKY pivot selection takes O(n2) time by Chen et al. [6, Alg. 2], compared to KT-COMPRESS++, which compresses the input points in only O(n log3 n) time. This difference in big-O runtime is reflected in our empirical results, where we see KT-KRR take 0.0153s versus 0.3237s for RPCHOLESKY-KRR. SUSY dataset (d = 18, N = 5 106). Tab. 1(b) compares our proposed method KT-KRR (with h = 10, λ = 10 1) to several large-scale kernel methods, namely RPCholesky preconditioning [7], FALKON [22], and Conjugate Gradient (all with h = 10, λ = 10 3) in terms of test classification error and training times. For the baseline methods, we use the Matlab implementation provided by Díaz et al. [7] (https://github.com/ eepperly/Robust-randomized-preconditioning-for-kernel-ridge-regression). In Method MSE (%) Training time (s) Prediction time (s) Full 0.4137 11.1095 0.7024 ST-KRR 0.5736 0.0018 0.0018 0.0005 0.0092 0.0006 RPCHOLESKY 0.3503 0.0001 0.3237 0.0094 0.0060 0.0008 KT-KRR (Ours) 0.5580 0.0015 0.0153 0.0013 0.0083 0.0003 (a) California Housing regression dataset. Method Test Error (%) Training Time (s) RPCholesky 19.99 0.00 3.46 0.03 FALKON 19.99 0.00 5.06 0.02 CG 20.35 0.00 6.16 0.03 ST-KRR 22.71 0.30 0.09 0.00 KT-KRR (Ours) 22.00 0.21 1.79 0.00 (b) SUSY dataset. Table 1: Accuracy and runtime comparison on real-world data. Each cell represents mean standard error across 100 trials. our experiment, we use 4 106 points for training and the remaining 106 points for testing. As is common practice for classification tasks, we use the Laplace kernel defined by k(x1, x2) exp( x1 x2 2/h). All parameters are chosen with cross-validation. We observe that KT-KRR achieves test MSE between ST-KRR and RPCHOLESKY preconditioning with training time almost half that of RPCHOLESKY preconditioning. Notably, our Cython implementation of KT-COMPRESS++ thinned the four million training samples in only 1.7 seconds on a single CPU core with further speed-ups to be gained from parallelizing on a GPU in the future. 6 Conclusions In this work, we introduce a meta-algorithm for speeding up two estimators from non-parametric regression, namely the Nadaraya-Watson and Kernel Ridge Regression estimators. Our method inherits the favorable computational efficiency of the underlying Kernel Thinning algorithm and stands to benefit from further advancements in unsupervised learning compression methods. The KT guarantees provided in this work apply only when f H(k) for some base kernel k. In practice, choosing a good kernel k is indeed a challenge common to all prior work. Our framework is friendly to recent developments in kernel selection to handle this problem: Dwivedi and Mackey [9, Cor. 1] provide integration-error guarantees for KT when f / H(k). Moreover, there are recent results on finding the best kernel (e.g., for hypothesis testing [8, Sec. 4.2]). Radhakrishnan et al. [20] introduce the Recursive Feature Machine, which uses a parameterized kernel k M(x1, x2) exp( (x1 x2) M(x1 x2)/(2h2)), and propose an efficient method to learn the matrix parameter M via the average gradient outer product estimator. An exciting future direction would be to combine these parameterized (or "learned") kernels with our proposed KT methods for non-parametric regression. 7 Acknowledgements AG is supported with funding from the New York-Presbyterian Hospital. [1] Haim Avron, Kenneth L Clarkson, and David P Woodruff. Faster kernel ridge regression using sketching and preconditioning. SIAM Journal on Matrix Analysis and Applications, 38(4): 1116 1138, 2017. (Cited on page 2.) [2] Pierre Baldi, Peter Sadowski, and Daniel Whiteson. Searching for exotic particles in high-energy physics with deep learning. Nature communications, 5(1):4308, 2014. (Cited on page 9.) [3] Mikhail Belkin, Alexander Rakhlin, and Alexandre B Tsybakov. Does data interpolation contradict statistical optimality? In The 22nd International Conference on Artificial Intelligence and Statistics, pages 1611 1619. PMLR, 2019. (Cited on pages 7 and 21.) [4] Christos Boutsidis, Petros Drineas, and Malik Magdon-Ismail. Near-optimal coresets for least-squares regression. IEEE transactions on information theory, 59(10):6880 6892, 2013. (Cited on page 2.) [5] Raffaello Camoriano, Tomás Angles, Alessandro Rudi, and Lorenzo Rosasco. Nytro: When subsampling meets early stopping. In Artificial Intelligence and Statistics, pages 1403 1411. PMLR, 2016. (Cited on page 2.) [6] Yifan Chen, Ethan N Epperly, Joel A Tropp, and Robert J Webber. Randomly pivoted cholesky: Practical approximation of a kernel matrix with few entry evaluations. ar Xiv preprint ar Xiv:2207.06503, 2022. (Cited on pages 8 and 9.) [7] Mateo Díaz, Ethan N Epperly, Zachary Frangella, Joel A Tropp, and Robert J Webber. Robust, randomized preconditioning for kernel ridge regression. ar Xiv preprint ar Xiv:2304.12465, 2023. (Cited on pages 2 and 9.) [8] Carles Domingo-Enrich, Raaz Dwivedi, and Lester Mackey. Compress then test: Powerful kernel testing in near-linear time. ar Xiv preprint ar Xiv:2301.05974, 2023. (Cited on pages 1 and 10.) [9] Raaz Dwivedi and Lester Mackey. Generalized kernel thinning. In International Conference on Learning Representations, 2022. (Cited on pages 1, 2, 7, 10, 14, and 15.) [10] Raaz Dwivedi and Lester Mackey. Kernel thinning. Journal of Machine Learning Research, 25 (152):1 77, 2024. (Cited on pages 1, 2, 4, 13, 14, 18, 24, 32, 48, and 49.) [11] Ahmed El Alaoui and Michael W Mahoney. Fast randomized kernel methods with statistical guarantees. stat, 1050:2, 2014. (Cited on page 2.) [12] Steffen Grünewälder. Compressed empirical measures (in finite dimensions). ar Xiv preprint ar Xiv:2204.08847, 2022. (Cited on pages 6 and 46.) [13] Ming-Yueh Huang and Shu Yang. Robust inference of conditional average treatment effects using dimension reduction. Statistica Sinica, 32(Suppl):547, 2022. (Cited on page 1.) [14] Samory Kpotufe. Fast, smooth and adaptive regression in metric spaces. Advances in Neural Information Processing Systems, 22, 2009. (Cited on page 2.) [15] Beatrice Laurent and Pascal Massart. Adaptive estimation of a quadratic functional by model selection. Annals of statistics, pages 1302 1338, 2000. (Cited on page 48.) [16] Lingxiao Li, Raaz Dwivedi, and Lester Mackey. Debiased distribution compression. ar Xiv preprint ar Xiv:2404.12290, 2024. (Cited on pages 37 and 39.) [17] R Kelley Pace and Ronald Barry. Sparse spatial autoregressions. Statistics & Probability Letters, 33(3):291 297, 1997. (Cited on page 9.) [18] Jeff M Phillips. Coresets and sketches. In Handbook of discrete and computational geometry, pages 1269 1288. Chapman and Hall/CRC, 2017. (Cited on page 2.) [19] Jeff M Phillips and Wai Ming Tai. Improved coresets for kernel density estimates. In Proceedings of the Twenty-Ninth Annual ACM-SIAM Symposium on Discrete Algorithms, pages 2718 2727. SIAM, 2018. (Cited on page 4.) [20] Adityanarayanan Radhakrishnan, Daniel Beaglehole, Parthe Pandit, and Mikhail Belkin. Feature learning in neural networks and kernel machines that recursively learn features. ar Xiv preprint ar Xiv:2212.13881, 2022. (Cited on pages 1 and 10.) [21] Ali Rahimi and Benjamin Recht. Random features for large-scale kernel machines. Advances in neural information processing systems, 20, 2007. (Cited on page 2.) [22] Alessandro Rudi, Luigi Carratino, and Lorenzo Rosasco. Falkon: An optimal large scale kernel method. Advances in neural information processing systems, 30, 2017. (Cited on pages 2 and 9.) [23] Abhishek Shetty, Raaz Dwivedi, and Lester Mackey. Distribution compression in near-linear time. In International Conference on Learning Representations, 2022. (Cited on pages 4, 13, 14, and 15.) [24] Rahul Singh, Liyuan Xu, and Arthur Gretton. Sequential kernel embedding for mediated and time-varying dose response curves. ar Xiv preprint ar Xiv:2111.03950, 2021. (Cited on page 1.) [25] Ingo Steinwart and Simon Fischer. A closer look at covering number bounds for gaussian kernels. Journal of Complexity, 62:101513, 2021. (Cited on page 29.) [26] Ilya Tolstikhin, Bharath K Sriperumbudur, Krikamol Mu, et al. Minimax estimation of kernel mean embeddings. Journal of Machine Learning Research, 18(86):1 47, 2017. (Cited on page 4.) [27] Alexandre B Tsybakov and Alexandre B Tsybakov. Nonparametric estimators. Introduction to Nonparametric Estimation, pages 1 76, 2009. (Cited on pages 6 and 7.) [28] Stephen Tu, Rebecca Roelofs, Shivaram Venkataraman, and Benjamin Recht. Large scale kernel learning using block coordinate descent. ar Xiv preprint ar Xiv:1602.05310, 2016. (Cited on page 2.) [29] Martin J Wainwright. High-dimensional statistics: A non-asymptotic viewpoint, volume 48. Cambridge University Press, 2019. (Cited on pages 7, 21, 22, 23, 28, 32, 33, 34, 36, 37, 40, 41, 42, 44, 45, and 46.) [30] Holger Wendland. Scattered data approximation, volume 17. Cambridge university press, 2004. (Cited on page 48.) [31] Yuchen Zhang, John Duchi, and Martin Wainwright. Divide and conquer kernel ridge regression: A distributed algorithm with minimax optimal rates. The Journal of Machine Learning Research, 16(1):3299 3340, 2015. (Cited on page 2.) [32] Yan Zheng and Jeff M Phillips. Coresets for kernel regression. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 645 654, 2017. (Cited on page 2.) A Background on KT-COMPRESS++ This section details the KT-COMPRESS++ algorithm of Shetty et al. [23, Ex. 6]. In a nutshell, KT-COMPRESS (Alg. 2) takes as input a point sequence of size n, a compression level g, a (reproducing) kernel functions k ALG, and a failure probability δ. KT-COMPRESS++ first runs the KT-COMPRESS(g) algorithm of Shetty et al. [23, Ex. 4] to produce an intermediate coreset of size 2g n. Next, the KT algorithm is run on the intermediate coreset to produce a final output of size n. KT-COMPRESS proceeds by calling the recursive procedure COMPRESS, which uses KT with kernels k ALG as an intermediate halving algorithm. The KT algorithm itself consists of two subroutines: (1) KT-SPLIT (Alg. 3a), which splits a given input point sequence into two equal halves with small approximation error in the k ALG reproducing kernel Hilbert space and (2) KT-SWAP (Alg. 3b), which selects the best approximation amongst the KT-SPLIT coresets and a baseline coreset (that simply selects every other point in the sequence) and then iteratively refines the selected coreset by swapping out each element in turn for the non-coreset point that most improves MMDk ALG error. As in Shetty et al. [23, Rem. 3], we symmetrize the output of KT by returning either the KT coreset or its complement with equal probability. Following Shetty et al. [23, Ex. 6], we always default to g = log2 log n + 3.1 so that KT-COMPRESS++ has an overall runtime of O(n log3 n). For the sake of simplicity, we drop any dependence on g in the main paper. Algorithm 1: KT-COMPRESS++ Identify coreset of size n Input: point sequence Sin of size n, compression level g, kernel k ALG, failure probability δ SC KT-COMPRESS(g, Sin, δ) // coreset of size 2g n SC++ KT(SC, k ALG, g g+2g(βn+1)δ) // coreset of size n return SC++ Algorithm 2: KT-COMPRESS Identify coreset of size 2g n Input: point sequence Sin of size n, compression level g, kernel k ALG, failure probability δ return COMPRESS(Sin, g, k ALG, δ n4g+1(log4 n g)) function COMPRESS(S, g, k ALG, δ): if |S| = 4g then return S Partition S into four arbitrary subsequences {Si}4 i=1 each of size n/4 for i = 1, 2, 3, 4 do e Si COMPRESS(Si, g, k ALG, δ) // run COMPRESS recursively to return coresets of size 2g q 4 end e S CONCATENATE( e S1, e S2, e S3, e S4) // combine the coresets to obtain a coreset of size 2 2g p |S| return KT( e S, k ALG, | e S|2δ) // halve the coreset to size 2gp |S| via symmetrized kernel thinning function KT(S, k ALG, δ): // Identify kernel thinning coreset containing |S|/2 input points SKT KT-SWAP(k ALG, KT-SPLIT(k ALG, S, δ)) return SKT with probability 1 2 and the complementary coreset S \ SKT otherwise Define the event EKT,δ {KT-COMPRESS++ succeeds}. (24) Dwivedi and Mackey [10, Thm. 1, Rmk. 4] show that P(EKT,δ) 1 δ. We restate [10, Thm. 4] in our notation: Lemma 1 (L guarantee for KT-COMPRESS++). Let Z Rd and consider a reproducing kernel k ALG : Z Z R. Assume n/nout 2N. Let SKT KT-COMPRESS++(Sin, g, k ALG, δ) and Algorithm 3a: KT-SPLIT Divide points into candidate coresets of size n/2 Input: kernel ksplit, point sequence Sin = (xi)n i=1, failure probability δ S(1), S(2) {} // Initialize empty coresets: S(1), S(2) have size i after round i σ 0 // Initialize swapping parameter for i = 1, 2, . . . , n/2 do // Consider two points at a time (x, x ) (x2i 1, x2i) // Compute swapping threshold ai ai, σ get_swap_params(σ, b, δ n) with b2 =ksplit(x, x)+ksplit(x , x ) 2ksplit(x, x ) // Assign one point to each coreset after probabilistic swapping θ P2i 2 j=1 (ksplit(xj, x) ksplit(xj, x )) 2 P z S(1)(ksplit(z, x) ksplit(z, x )) (x, x ) (x , x) with probability min(1, 1 S(1).append(x); S(2).append(x ) end return (S(1), S(2)), candidate coresets of size n/2 function get_swap_params(σ, b, δ): 2 log(2/δ), b2) σ2 σ2+b2(1+(b2 2ai)σ2/a2 i )+ return (ai, σ) Algorithm 3b: KT-SWAP Identify and refine the best candidate coreset Input: kernel k ALG, point sequence Sin = (xi)n i=1, candidate coresets (S(1), S(2)) S(0) baseline_coreset(Sin, size= n/2 ) // Compare to baseline (e.g., standard thinning) SKT S(ℓ ) for ℓ argminℓ {0,1,2} MMDk ALG(Sin, S(ℓ)) // Select best coreset // Swap out each point in SKT for best alternative in Sin while ensuring no point is repeated in SKT for i = 1, . . . , n/2 do SKT[i] argminz {SKT[i]} (Sin\SKT) MMDk ALG(Sin, SKT with SKT[i] = z) end return SKT, refined coreset of size n/2 define Pin 1 z Sin δz and Qout 1 nout P z SKT δz. Then on event EKT,δ, the following bound holds: (Pin Qout)k ALG c k ALG ,in nout Mk ALG(n, nout, d, δ, R), where (25) Mk ALG(n, nout, d, δ, R) r log nout log2(n/nout) d log 1 + Lk ALG k ALG (Rk ALG,n + R) , Lk ALG supz1,z2,z3 Z |k ALG(z1,z2) k ALG(z1,z3)| z2 z3 2 , and (27) Rk ALG,n inf r : sup z1,z2 Z z1 z2 2 r |k ALG(z1, z2)| k ALG for some universal positive constant c. Proof. The claim follows by replacing k [10, Thm. 4] with k ALG, replacing the sub-Gaussian constant of KT with that of KT-COMPRESS++ in [23, Ex. 5], and replacing k ALG with k ALG ,in supz Sin k ALG(z, z) throughout. We restate [9, Thm. 2] in our notation: Lemma 2 (MMD guarantee for KT-COMPRESS++). Let Z Rd and consider a reproducing kernel k ALG : Z Z R. Assume n/nout 2N. Let SKT KT-COMPRESS++(k ALG, g)(Sin) and define Pin 1 z Sin δz and Qout 1 nout P z SKT δz. Then on event EKT,δ, the following bound holds: suph H(k ALG): h k ALG 1 |(Pin Qout)h| infϵ (0,1) Sin A 2ϵ + 2 k ALG 1/2 ,in nout Wk ALG(n, nout, δ, A, ϵ) where Wk ALG(n, nout, δ, R, ϵ) c r log nout log(n/nout) δ + log Nk ALG(A, ϵ) . (29) for some universal postive constant c. Proof. The claim follows from replacing k in [9, Thm. 2] with k ALG and replacing the sub-Gaussian constant of KT with that of KT-COMPRESS++ in [23, Ex. 5]. B Proof of Thm. 1: KT-NW Our primary goal is to bound ESin[( bf KT(x0) f (x0))2] for a fixed x0 X. Once we have this bound, bounding bf KT f 2 2 is as straightforward as integrating over x0 X. Consider the following decomposition: bf KT(x0) f (x0) 2 = ESin bf KT(x0) bf(x0) + bf(x0) f (x0) 2 bf KT(x0) bf(x0) 2 (30) bf(x0) f (x0) 2 . (31) Define the random variables ηi 1 n Xi x0 h 1 o for i = 1, 2, . . . , n. Also define the event E {Pn i=1 ηi > 0}. (32) Since Xi are i.i.d. samples from P, it follows that ηi are i.i.d. Bernoulli random variables with parameter p P(ηi = 1) c0pminhd, (33) where c0 > 0 depends only on d and κ (see Assum. 2). Denote the denominator terms in bf and bf KT by n Pn i=1 k( , xi) and bp KT( ) 1 nout Pnout j=1 k( , x i), (34) respectively, and the numerator terms in bf and bf KT by n Pn i=1 k( , xi)yi and b AKT( ) 1 nout Pnout j=1 k( , x i)y i, (35) respectively. We now consider two cases depending on the event E. Case I: Suppose event Ec is satisfied. It follows from (34) that bp(x0) = 0, in which case bf(x0) = 0. Since Sout Sin, it necessarily follows that bp KT(x0) = 0 and bf KT(x0) = 0. Thus, we can bound (30) and (31) by bf KT(x0) bf(x0) 2 I[Ec] = 0 and bf(x0) f (x0) 2 I[Ec] = ESin h (0 f (x0))2 I[Ec] i (f )2(x0)P(Ec) (f )2(x0)(1 p)n (f )2(x0) exp Cnhd for some positive constant C that does not depend on n. Note that these are low-order terms compared to the rest of the calculations, so we may ignore them in the final bound. Case II: Otherwise, we may assume event E is satisfied. Let us first bound (30). On event EKT,δ (24), we claim that bf KT(x0) bf(x0) 2 I[E] Cd log2 n nh2d whenever p = ω( q d n log n). (36) We defer the proof to App. B.1. Letting X (X1, . . . , Xn) and Y (Y1, . . . , Yn) denote the x and y components of Sin, respectively, we can further decompose (31) by bf(x0) f (x0) 2 I[E] = EX bf(x0) EY |X h bf(x0) i 2 I[E] EY |X h bf(x0) i f (x0) 2 I[E] , where the first RHS term corresponds to the variance and the second RHS term corresponds to the bias. We claim that bf(x0) EY |X h bf(x0) i 2 I[E] σ2 ξ n exp Cnhd + C nhd and (37) EY |X h bf(x0) i f (x0) 2 I[E] C L2 fh2β log2 n, (38) for some constant C > 0 that does not depend on either n or h. We defer the proofs to App. B.2 and B.3. Combining (36) to (38), we have bf KT(x0) f (x0) 2 I[E] Cd log2 n nh2d | {z } KT bound + 2σ2 ξ ne Cnhd + C nhd | {z } Variance bound + 2CL2 fh2β log2 n | {z } Bias bound Note that hd 1, so the Cd log2 n nh2d term dominates the C nhd term. Thus, the optimal choice of bandwidth h comes from balancing C nh2d 2L2 fh2β = h = cn 1 2β+2d . (40) Finally, we must verify our growth rate assumption on p in (36) is satisfied. Since β > 0, we have p (33) c0pminhd (40) = c 0n d 2β+2d = limn d n log n = . Plugging (40) into (39) yields the advertised bound (18). B.1 Proof of claim (36) We first provide a generic result for approximating the numerator and denominator terms defined in (34) and (35). Lemma 3 (Simultaneous L bound using KT-COMPRESS++ with k NW). Suppose k satisfies Assum. 2. Given Sin, the following bounds hold on the event EKT,δ: bp bp KT cp q d n(log n + log(1/δ)) (41) b A b AKT cp q d n(log n + log(1/δ)), (42) where ca, cp > 0 are constants that do not depend on d or n. See App. B.1.1 for the proof. In the sequel, we will simply treat the log(1/δ) term as a constant, meaning the log n terms dominate in the expressions. With this lemma in hand, let us prove the claim (36). Define the following events: A {bp KT(x0) = 0} B {bp KT(x0) = 0} C n bp(x0) p On event E, consider the following decomposition: bf KT(x0) bf(x0) 2 I[EKT,δ] = ESin bf KT(x0) bf(x0) 2 I[EKT,δ Cc] (43) bf KT(x0) bf(x0) 2 I[EKT,δ A C] (44) bf KT(x0) bf(x0) 2 I[EKT,δ B C] . (45) Bounding (43). Note that almost surely, we have | bf(x0)| Ymax and | bf KT(x0)| Ymax. Thus, we have bf KT(x0) bf(x0) 2 I[Cc] 4Y 2 max P n bp(x0) < np (i) = 4Y 2 max P Pn i=1 ηi np < np (ii) c0 exp c1nhd , where (i) follows from subtracting np from both sides of the probability statement and (ii) follows from concentration of Bernoulli random variables (see App. B.2). Bounding (44). Note that on event EKT,δ C, we have bp KT(x0) bp(x0) bp bp KT (36) c1p (33) c2pminhd > 0. (46) where step (i) follows from applying (41) and substituting bp p 2. Hence the events EKT,δ and A C are mutually exclusive with probability 1, thereby yielding bf KT(x0) bf(x0) 2 I[EKT,δ A C] = 0. Bounding (45). On the event B C, we have bf KT(x0) = b AKT(x0) bp KT(x0) and bf(x0) = b A(x0) bp(x0) , which yields bf KT(x0) bf(x0) = b AKT bp KT b A(x0) bp(x0) = b AKT(x0) bp(x0) b A(x0) bp KT(x0) bp(x0) bp KT(x0) = ( b AKT(x0) b A(x0)) bp(x0)+ b A(x0) (bp(x0) bp KT(x0)) bp(x0) bp KT(x0) | b AKT(x0) b A(x0)| bp(x0)+ b A(x0) |bp(x0) bp KT(x0)| bp(x0) bp KT(x0) We can invoke (41) and (42) to bound |bp(x0) bp KT(x0)| and | b AKT(x0) b A(x0)| respectively. Thus, we have bf KT(x0) bf(x0) 2 I[EKT,δ B C] ca d n log n bp(x0)+cp d n log n b A(x0) bp(x0) bp KT(x0) ca bp KT(x0) 2 + b A(x0) bp(x0) 2 cp bp KT(x0) 2 (i) 2d log2 n n h c2 a+Y 2 maxc2 p bp KT(x0) i2 (ii) Cd log2 n for some positive constant C that does not depend on n, where step (i) uses the fact that b A(x0) bp(x0) Ymax and step (ii) uses the lower bound on bp KT(x0) from (46). Combining (43) to (45), we have bf KT(x0) bf(x0) 2 I[EKT,δ] c0 exp c1nhd + Cd log n Note that the second term dominates so that we may drop the first term with slight change to the value of the constant C in the bound (36). B.1.1 Proof of Lem. 3: Simultaneous L bound using KT-COMPRESS++ with k NW We first decompose k NW as k NW((x1, y1), (x2, y2)) = k1((x1, y1), (x2, y2)) + k2((x1, y1), (x2, y2)), where k1((x1, y1), (x2, y2)) k(x1, x2) and (47) k2((x1, y1), (x2, y2)) k(x1, x2) y1y2. (48) and note that H(k NW) = H(k1) H(k2). (49) This fact will be useful later for proving simultaneous L approximation guarantees for b A and bp. Given that k satisfies Assum. 2, we want to show that k NW defined by (13) satisfies the Lipschitz and tail decay properties, so that we may apply Lem. 1. Note that k NW = k (1 + Y 2 max). (50) We claim that kernel k NW satisfies Lk NW Lk + Ymax( k + Lk Ymax) and (51) Rk NW,n Rk,n + 2Ymax (52) By [10, Rmk. 8], we have h and Rk,n hκ (1/n), (53) where κ is defined by (15). Applying (50) and (51), we have Lk NW k NW Lk+Ymax( k +Lk Ymax) k (1+Y 2 max) Lk k + 1 Ymax + Lk k (53) 2Lκ h + 1 Ymax . Finally, we have Lk NW Rk NW,n hκ (1/n) + 2Ymax = 2Lκκ (1/n) + 4LκYmax h + hκ (1/n) 4 max{1, LκYmax} κ (1/n) h 4 max{1, LκYmax} c nα, (54) where the last inequality follows from Assum. 2 for some universal positive constant c . Since Assum. 1 is satisfied, R is constant. Applying (54) to Mk NW(n, nout, d, δ, R) as defined by (26), we have the bound Mk NW(n, nout, d, δ, R) c q δ + 5 d log n i for some positive constant c . Substituting this into (25), we have (Pin Qout)k NW c1 k (1+Y 2 max) nout δ + 5 d log n i c2 k (1+Y 2 max) nout d( log n + p log(1/δ))2, for some positive constants c1, c2. By definition, (Pin Qout)k NW = supz Z (Pin Qout)k NW, k NW( , z) k NW. Define k1 and k2 by (47) and (48), respectively, and note that k1( , z), k2( , z) H(k NW) for all z Z. We want to show that supz Z (Pin Qout)k NW, k1( , z) k NW c2 k (1+Y 2 max) nout d( log n + p log(1/δ))2 and supz Z (Pin Qout)k NW, k2( , z) k NW c2 k (1+Y 2 max) nout d( log n + p log(1/δ))2, which would imply (41) and (42) (after simplifying all terms besides n, d, and δ). The first inequality follows from replacing all occurrences of the test function k NW( , (x, y)) in the proof of Lem. 1 with the function k1( , x) and noting that k NW( , (xi, yi)), k1( , (x, y)) k NW = k1( , xi), k1( , x) k1 from the fact that H(k NW) = H(k1) H(k2) (49). The second inequality follows from replacing all occurrences of the test function k NW( , (x, y)) in the proof of Lem. 1 with the function k2( , x) and noting that k NW( , (xi, yi)), k2( , (x, y)) k NW = k2( , (xi, yi)), k2( , (x, y)) k2, again from the fact that H(k NW) = H(k1) H(k2) (49). Proof of claim (51). We leverage the fact that the Lipschitz constants defined by (27) satisfies the following additivity property. Letting Z = Sin, we have Lk NW = supz1,z2,z3 Z |k NW(z1,z2) k NW(z1,z3)| z2 z3 2 supz1,z2,z3 Z |k1(z1,z2) k1(z1,z3)| z2 z3 2 + supz1,z2,z3 Z |k2(z1,z2) k2(z1,z3)| z2 z3 2 = Lk1 + Lk2. We proceed to bound Lk1 and Lk2 separately. Note that Applying the definition (27) to Lk2, we have Lk2 = supz1=(x1,y1) z2=(x2,y2) z3=(x3,y3) |k(x1,x2)y1y2 k(x1,x3)y1y3| x2 x3 2+ y2 y3 2 = supz1=(x1,y1) z2=(x2,y2) z3=(x3,y3) |y1| |k(x1,x2)y2 k(x1,x2)y3+k(x1,x2)y3 k(x1,x3)y3| x2 x3 2+ y2 y3 2 = supz1=(x1,y1) z2=(x2,y2) z3=(x3,y3) |y1| |k(x1,x2)(y2 y3)+(k(x1,x2) k(x1,x3))y3| x2 x3 2+ y2 y3 2 supz1=(x1,y1) z2=(x2,y2) z3=(x3,y3) |y1| |k(x1,x2)(y2 y3)| x2 x3 2+ y2 y3 2 + supz1=(x1,y1) z2=(x2,y2) z3=(x3,y3) |y1| |(k(x1,x2) k(x1,x3))y3| x2 x3 2+ y2 y3 2 Ymax k + Lk Y 2 max. Putting together the pieces yields the claimed bound. Proof of claim (52). We aim to show that Rk NW,n is not much larger than Rk,n. Note that k NW can be rewritten as k NW((x1, y1), (x2, y2)) = (1 + y1y2)k(x1, x2). We define the sets Γ r : sup x1,x2: x1 x2 2 r|k(x1, x2)| k r : sup (x1,y1),(x2,y2): x1 x2 2+ y1 y2 2 (r )2|k(x1, x2) y1y2| k noting that Rk,n = inf Γ and Rk NW,n = inf Γ by definition (28). Suppose r Γ. Then for any (x1, y1), (x2, y2) such that x1 x2 2 + y1 y2 2 r2 + 4Y 2 max, it must follow that x1 x2 2 r2 (since y1 y2 2 4Y 2 max by triangle inequality). Since r satisfies (55), it must follow that |k(x1, x2) (y1y2 + 1)| |k(x1, x2)|(Y 2 max + 1) k n (Y 2 max + 1) (50) = k NW r2 + 4Y 2 max Γ , where recall Γ is defined by (56). Thus, we have Rk,n + 4Y 2 max Rk,n + 2Ymax as desired. B.2 Proof of claim (37) Suppose event E (32) is satisfied. Define the shorthand for the variance: σ2(x0; X) EY |X bf(x0) EY |X h bf(x0) i 2 . Conditioned on X1 = x1, X2 = x2, . . . , Xn = xn, we have EY |X h bf(x0) i = EY1|X1,...,Yn|Xn h Pn i=1 Yik(Xi,x0) Pn i=1 k(Xi,x0) i = Pn i=1 f (Xi)k(Xi,x0) Pn i=1 k(Xi,x0) , (57) where we have used the fact that E[Y | X = ] = f ( ) by assumption (1). Note that on event E, we have σ2(x0; X) (57) = EY |X Pn i=1 Yik(Xi,x0) Pn i=1 k(Xi,x0) Pn i=1 f (Xi)k(Xi,x0) Pn i=1 k(Xi,x0) 2 Pn i=1 vik(Xi,x0) Pn i=1 k(Xi,x0) 2 = Var[v1] Pn i=1 k(Xi,x0)2 ( Pn i=1 k(Xi,x0)) 2 , where recall v1, . . . , vn are i.i.d. random variables with Var[vi] = σ2 by (1). Taking the expectation w.r.t. X1, . . . , Xn and leveraging symmetry, we have EX σ2(x0; X) I[E] = nσ2 σ2 X, where σ2 X EX k2(X1,x0) ( Pn i=1 k(Xi,x0)) 2 σ2 X can be bounded by k2(X1,x0) ( Pn i=1 k(Xi,x0)) 2 I h Pn i=1 ηi np 2 i + 2 np 2 EX k2(X1, x0) (i) P Pn i=1 ηi np 2 + 2 np 2 R X k2(x1, x0)p(dx1), where step (i) follows from the fact that k2(X1,x0) ( Pn i=1 k(Xi,x0)) 2 1. Using Bernstein s inequality [29, Prop. 2.14], the first term can be bounded by P Pn i=1 ηi np 2 = P Pn i=1 ηi np np exp n (np)2 2(np(1 p)+np/3) o exp c1nhd for some universal positive constant c. Applying the fact that p is bounded by Assum. 1 and κ is square-integrable by Assum. 2, we can bound the second term by 2 np 2 R X k2(x1, x0)p(dx1) (i) 2 np 2 pmaxhd R (ii) 4 (np)2 c1hd c2 n2hd , for some positive constant c2 that does not depend on either h or n. Substituting these expressions into (58) yields a bound on EX σ2(x0; X) I[E] as desired. B.3 Proof of claim (38) Define the following shorthand for the bias: b(x0; X) EY |X h bf(x0) i f (x0). We state a more detailed version of the claim. Lemma 4 (Bias of Nadaraya-Watson). Suppose Assum. 1 and 2 are satisfied and the event E (32) holds. If f Σ(β, Lf) for β (0, 1], Lf > 0, then with high probability, the following statements hold true uniformly for all x0 X: (I.a) If k is compactly supported, then b(x0; X) Lfhβ. (I.b) Otherwise, if k is non-compactly supported, we have b(x0; X) c Lfhβ log n for some positive constant c that does not depend on n. For completeness, we first state the proof of claim (I.a) from Belkin et al. [3, Lem. 2]. On the event E (32), we have b(x0; X) (57) = Pn i=1(f (Xi) f (x0))k(Xi,x0) Pn i=1 k(Xi,x0) Pn i=1 Lf Xi βk(Xi,x0) Pn i=1 k(Xi,x0) where step (i) follows from our assumption that f Σ(β, Lf) and step (ii) follows from our assumption that k is compactly supported, so k(x, x0) = 0 whenever x x0 > h. We now proceed to proving claim (I.b). Step I: Decomposing the bias. For any x0 X, define the random variables n Pn i=1 k(Xi, x0)I[ Xi x0 h] and (59) n Pn i=1 k(Xi, x0)I 2j 1h < Xi x0 2jh for j = 1, 2, . . . , T, (60) log h = O(log n) (61) when p X( ) has compact support under Assum. 1 and h is defined by (40). We call these random variables the kernel empirical means. We can directly verify that PT j=0 A0(x0) = 1 n Pn i=1 k(Xi, x0) for all x0 X. With these definitions in hand, we may rewrite b(x0; X) in terms of A0(x0), A1(x0), . . . , AT (x0) as follows: 1 n Pn i=1 k(Xi,x0)(f (Xi) f (x0)) 1 n Pn i=1 k(Xi,x0) 1 n Pn i=1 k(Xi,x0)I[ Xi x0 h](f (Xi) f (x0)) A0(x0)+PT j=1 Aj(x0) PT j=1 1 n Pn i=1 k(Xi,x0)I[2j 1h< Xi x0 2jh](f (Xi) f (x0)) A0(x0)+PT j=1 Aj(x0) 1 n Pn i=1 k(Xi,x0)I[ Xi x0 h]Lf Xi x0 β A0(x0)+PT j=1 Aj(x0) PT j=1 1 n Pn i=1 k(Xi,x0)I[2j 1h< Xi x0 2jh]Lf Xi x0 β A0(x0)+PT j=1 Aj(x0) 1 n Pn i=1 k(Xi,x0)I[ Xi x0 h]Lf hβ A0(x0)+PT j=1 Aj(x0) (62) PT j=1 1 n Pn i=1 k(Xi,x0)I[2j 1h< Xi x0 2jh]Lf (2jh)β A0(x0)+PT j=1 Aj(x0) , (63) where step (i) follows from the fact that f (Xi) f (x0) Lf Xi x0 β by Def. 1 and step (ii) follows from the fact that Xi x0 2jh under the indicator functions. Applying (59) to (62) and (60) to (63), we have b(x0; X) A0(x0)Lf hβ A0(x0)+PT j=1 Aj(x0) + PT j=1 Aj(x0)Lf (2jh)β A0(x0)+PT j=1 Aj(x0) Lfhβ PT i=0 Ai(x0)2iβ PT j=0 Aj(x0) For any z X, define the following shorthands: a0(z) EX[k(X, z)I[ X z h]] and (65) aj(z) EX k(X, x)I 2j 1h < X z 2jh for j = 1, 2, . . . , T. (66) We can directly verify that EX[A0(z)] = a0(z) and EX[Aj(z)] = aj(z) for all j = 1, 2, . . . , T. Under Assum. 1, ai(z) can be computed directly from the decay properties of k as follows: x z h k(x, z)p X(dx) pmin R x z h k(x, z)dx 2j 1< x z 2jh k(x, z)p X(dx) pmax R 2j 1< x z 2jh k(x, z)dx for j = 1, 2, . . . , T. Remark 1. Note that for fixed x0 X, Ai(x0), Hoeffding s inequality implies that Ai(x0) concentrates around ai(x0) for all i = 0, 1, . . . , T with high probability. Assuming that k decays sufficiently rapidly (so that aj(x0)2jβ = O(a0(x0)) for all j [T]), we have b(x0; X) Lfhβ PT i=1 O(1) = O(Lfhβ log n). It remains to show that this bound holds uniformly over all x0 X with high probability. Step II: Uniform concentration of empirical kernel means. Fix ϵ (0, h 2 ] and let C0 be a ϵ-cover of X w.r.t. the Euclidean norm (see [29, Def. 5.1] for the detailed definition). We will optimize the choice of ϵ at the end of the proof. By [29, Lem. 5.7], we have |C0| (1 + 2R We now introduce a slightly modified version of (65) and (66) defined as aϵ 0(z) EX[k(X, z)I[ X z h ϵ]] and aϵ j(z) EX k(X, z)I 2j 1h ϵ < X z 2jh + ϵ for j = 1, 2, . . . , T, respectively. Using the fact that ϵ h/2, Assum. 1 and 2, and a change of variables, we have aϵ 0(z) pmin R 2 k(x, z)dx = pminhdωd 1 R 1 2 0 κ(u)ud 1du, and aϵ j(z) pmax R 2 )h< x z (2j+ 1 2 )h k(x, x 0)dx = pmaxhdωd 1 R 2j+ 1 2 κ(u)ud 1du for j [T], where ωd 1 2πd/2/Γ(d/2) denotes the surface area of the unit sphere in Rd. Moreover, we consider the following concentration events: n Pn i=1 k(Xi, z)I[ Xi z h ϵ] aϵ 0(z) 0 and (68) n Pn i=1 k(Xi, z)I 2j 1h ϵ < Xi z 2jh + ϵ aϵ j(z) + j for j [T], (69) where 0, 1, . . . , T are positive scalars to be chosen later. For any scalars 0 < a < b and z X, further denote the empirical mass contained within the ball or annulus centered at point z by n Pn i=1 I[ Xi z a] and Massz(a, b) 1 n Pn i=1 I[a < Xi z b]. We now introduce a lemma that bounds the empirical kernel mean restricted to each ball and annuli: Lemma 5 (Uniform concentration of empirical kernel means). Suppose event T z C0 TT i=0 Aϵ i(z) is satisfied. Then for every x0 X, there exists x 0 C0 such that x0 x 0 ϵ and the following bounds hold simultaneously: A0(x0) aϵ 0(x 0) 0 Massx 0(h ϵ) Lkϵ and (70) Aj(x0) aϵ j(x 0) + j + Massx 0(2j 1h ϵ, 2jh + ϵ) κ( 2j 1h 2ϵ h ) for j [T]. (71) Here, Lk and κ( ) are properties of k from Assum. 2. See App. B.3.1 for the proof. This result suggests that we can control the error between the empirical kernel mean, Ai(x0), and its (shifted) population counterpart, aϵ j(x 0), by choosing ϵ to be small and leveraging the decay property of κ. To bound the Massx 0(h ϵ) and Massx 0(2j 1h ϵ, 2jh + ϵ) terms, we define the following events: M0(x 0) Massx 0(h ϵ) EX Massx 0(h ϵ) + t0 , and (72) Mj(x 0) Massx 0(2j 1h ϵ, 2jh + ϵ) EX Massx 0(2j 1h ϵ, 2jh + ϵ) + tj for j [T], (73) where t0, t1, . . . , t T are positive scalars to be chosen later. For any scalars 0 < a < b, we may apply Assum. 1 to obtain EX Massx 0(a) pmax Vol(a) and EX Massx 0(a, b) pmax Vol(b), (74) where Vol(r) πd/Γ(d/2 + 1)rd denotes the volume of the Euclidean ball B2(r) Rd. Step III: Putting things together. For any z C0, note that k(Xi, z)I[ Xi z h ϵ] [0, k ] and k(Xi, z)I 2j 1h ϵ < Xi z 2jh + ϵ [0, k ] and therefore these random variables are sub-Gaussian with parameter σ = 1 2 k . Thus, we may bound (68) and (69) using Hoeffding s bound [29, Prop. 2.5], obtaining P(Aϵ i(z)c) exp n 2n 2 0 k 2 o for all i = 0, 1, . . . , T and z C0, (75) Again by Hoeffding s bound [29, Prop. 2.5], we have P(Mi(z)c) exp 2nt2 i for all i = 0, 1, . . . , T and z C0. Next, it remains to choose the values of ϵ, 0, . . . , T and t0, . . . , t T . Consider 2 pminhdωd 1 R 1 2 0 κ(u)ud 1du, (76) j = 2 jβ pmaxhdωd 1 R 1 2 0 κ(u)ud 1du for j = 1, 2, . . . , T, (77) 2pmax Vol(h ϵ), and (78) 2pmax Vol(2jh + ϵ) for j = 1, 2, . . . , T. (79) Combining (70), (72), (74), (76), and (78), we have on the event E0 T z C0(Aϵ 0(z) M0(z)) the following bound: 2pminhdωd 1 R 1 2 0 κ(u)ud 1du pmax πd 2 +1)(h ϵ)d Lkϵ hdωd 1 1 2pmin R 1 2 0 κ(u)ud 1du pmax πd 2 +1)ωd 1 Lkϵ 4pminhdωd 1 R 1 2 0 κ(u)ud 1du, where step (i) follows from setting ϵ = min 1 4 h pmin R 1 2 0 κ(u)ud 1du i pmax πd 2 +1)ωd 1 Lk 1 , h Similarly, for each j = 1, 2, . . . , T, we may combine (71), (73), (74), (77), and (79) to obtain a bound on Aj(x0). Hence, on the Ej T z C0(Aϵ j(z) Mj(z)), we have Aj(x0) pmaxhdωd 1 R 2j+ 1 2 κ(u)ud 1du + 2 jβ R 1 2 0 κ(u)ud 1du + pmax πd 2 +1)(2jh + ϵ)d κ( 2j 1h 2ϵ (i) pmaxhdωd 1 R 2j+ 1 2 κ(u)ud 1du + 2 jβ R 1 2 0 κ(u)ud 1du + πd(2j+ 1 2 +1)ωd 1 κ(2j 1 1) where step (i) employs the fact that ϵ h/2. Note that for j = 1, 2, . . . , T: A0(x0) 4pmax 2jβ R 2j + 1 2 κ(u)ud 1du+ R 1 2 0 κ(u)ud 1du+ πd(2j + 1 2 +1)ωd 1 κ(2j 1 1) 2 0 κ(u)ud 1du (81) where the last inequality comes from the fact that 2jβ R 2j + 1 2 κ(u)ud 1du 2 0 κ(u)ud 1du = O(1) and 2 +1)ωd 1 κ(2j 1 1) 2 0 κ(u)ud 1du = O(1) by Assum. 2. Substituting this bound into (64), we have b(x0; X) (64) Lfhβ PT i=0 Ai(x0)2jβ PT j=0 Aj(x0) Lfhβ PT i=0 Ai(x0)2jβ A0(x0) (81) LfhβTC (61) O(Lfhβ log n) as desired. Finally, it remains to control the probability that this bound on b(x0; X) holds. Applying union-bound, we have P(Ec 0 . . . Ec T ) (67),(75) (1 + 2R ϵ )d PT i=0 exp n 2n 2 i k 2 o + exp 2nt2 i ϵ )d PT j=0 exp c1nh2d4 jβ + exp c2nhjd ϵ )d(T + 1) exp c1nh2d4 T β + exp c2nh T d , for some universal positive constants c1, c2 that do not depend on n. Note that by (80), we have (1 + 2R ϵ )d = O((1 + 1 h)d). Setting h = cn 1 2β+2d as in (40) and recalling that T = O(log n), we have P(Ec 0 . . . Ec T ) 1 for sufficiently large n. This completes the proof of Lem. 4. B.3.1 Proof of Lem. 5: Uniform concentration of empirical kernel means Fix any x0 X. By definition of C0, there exists x 0 C0 such that x0 x 0 ϵ. Consider any x B2(h ϵ; x 0)4. By triangle inequality, x x0 x x 0 + x 0 x0 (h ϵ) + ϵ. Therefore, we have B2(h ϵ; x 0) B2(h; x0). (82) We now introduce the following lemma, which is a finite-sample version of [10, Lem. 8]: 4We use the notation B2(r; x) to denote the Euclidean ball of radius r centered at x Rd. Lemma 6 (L bound on L2 kernel error (finite sample)). Consider any bounded kernel k L2, , points x0, x 0 X, and function g L2(Pn). For any r, a, b 0 with a+b = 1, points x1, . . . , xn X, and set A X, we have 1 n Pn i=1 g(xi)(k(xi, x0) k(xi, x 0))1A(xi) g A,L2(Pn) h k( , x0) k( , x 0) A, Mass 1 2 A(r) + 2τk,A(ar) + 2 k A,L2, I[ x0 x 0 br] i . f A,L2(Pn) 1 n Pn i=1 f 2(xi)1A(xi) 1 f A, maxi [n]|f(xi)1A(xi)| Mass A(r) 1 n Pn i=1 I[ xi x 0 r, xi A] τA,k(r) supx {x0,x 0} 1 n Pn i=1 k2(xi, x )I[ xi x r, xi A] 1 k A,L2, supx {x0,x 0} k( , x ) A,L2(Pn). See App. B.3.2 for the proof. We now use this lemma to prove each of the claims in Lem. 5. Proof of claim (70). We consider the following basic decomposition: 1 n Pn i=1 k(Xi, x0)I[ Xi x0 h] (83) n Pn i=1 k(Xi, x 0)I[ Xi x 0 h ϵ] (84) n Pn i=1 k(Xi, x 0)I[ Xi x 0 h ϵ] 1 n Pn i=1 k(Xi, x0)I[ Xi x0 h] . (85) Note that the term (84) can be directly bounded using the definition of (68). Next, we bound the term (85). Define g 1 and A B2(h ϵ; x 0), so that g A,L2(Pn) = Mass 1 2 x 0(h ϵ). By property (82), we have I[ Xi x 0 h ϵ] I[ Xi x0 h] for all i = 1, 2, . . . , n. (86) Observe that 1 n Pn i=1 g(Xi)(k(Xi, x0) k(xi, x 0))1A(xi) (87) n Pn i=1(k(Xi, x 0) k(Xi, x0))I[ Xi x 0 h ϵ] n Pn i=1 k(Xi, x 0)I[ Xi x 0 h ϵ] 1 n Pn i=1 k(Xi, x0)I[ Xi x 0 h ϵ] (86) 1 n Pn i=1 k(Xi, x 0)I[ Xi x 0 h ϵ] 1 n Pn i=1 k(Xi, x0)I[ Xi x0 h], To upper bound (87), we now apply Lem. 6 with r = h, a = 1 2, and b = 1 2 to obtain 1 n Pn i=1 g(xi)(k(xi, x0) k(xi, x 0))1A(xi) 1 2 x 0(h ϵ) h k( , x0) k( , x 0) A, Mass 1 2 A(h) + 2τk,A( h 2 ) + 2 k A,L2, I x0 x 0 > h 1 2 x 0(h ϵ) h Lkϵ Mass 1 2 x 0(h ϵ) i , (88) where the last inequality uses the fact that k( , x0) k( , x 0) A, Lk x0 x 0 Lkϵ from the Lipschitz property of k (see Assum. 2) and the fact that 1 2 A(h) = Mass 1 2 x 0(h ϵ), τk,A( h 2 ) = 0, and xi x0 ϵ h Substituting the bounds (68), (87), and (88) into the decomposition (83) yields the desired claim. Proof of claim (71). We consider the basic decomposition: 1 n Pn i=1 k(Xi, x0)I 2j 1h ϵ < Xi x0 2jh + ϵ (89) i=1 k(Xi, x 0)I 2j 1h ϵ < Xi x 0 2jh + ϵ (90) n Pn i=1 k(Xi, x 0)I 2j 1h < Xi x0 2jh n Pn i=1 k(Xi, x 0)I 2j 1h ϵ < Xi x 0 2jh + ϵ ) (91) Note that the term (90) can be directly bounded using the definition of (69). Next, we bound the term (91). By triangle inequality, we have x x 0 x x0 + x0 x 0 2jh + ϵ, for all x B2(2jh; x0), which implies that B2(2jh; x0) B2(2jh+ϵ; x 0). Moreover, we can verify that B2(2j 1h ϵ; x 0) B2(2j 1h; x0). Hence, I 2j 1h < Xi x0 2jh I 2j 1h ϵ < Xi x 0 2jh + ϵ for all i = 1, 2, . . . , n. (92) g 1 and A B2(2jh + ϵ; x 0) \ B2(2j 1h ϵ; x 0), (93) so that g A,L2(Pn) = Mass 1 2 x 0(2j 1h ϵ, 2jh + ϵ). Hence, (91) can be bounded by 1 n Pn i=1 k(Xi, x 0)I 2j 1h < Xi x0 2jh 1 n Pn i=1 k(Xi, x 0)I 2j 1h ϵ < Xi x 0 2jh + ϵ (92) 1 n Pn i=1 k(Xi, x 0)I 2j 1h ϵ < Xi x 0 2jh + ϵ n Pn i=1 k(Xi, x 0)I 2j 1h ϵ < Xi x 0 2jh + ϵ (93) 1 n Pn i=1 g(Xi)(k(Xi, x 0) k(Xi, x 0))1A(xi) 1 2 x 0(2j 1h ϵ, 2jh + ϵ) k( , x0) k( , x 0) A, Mass 1 2 A(2(2jh + ϵ)) + 2τk,A(2jh + ϵ) + 2 k A,L2, I x0 x 0 2jh + ϵ (94) where the last inequality follows from applying Lem. 6 with r = 2(2jh + ϵ) and a = b = 1 2. By triangle inequality, we have Xi x0 Xi x 0 x 0 x0 (i) (2j 1h ϵ) ϵ 2j 1h 2ϵ, where step (i) follows from the definition of A (93). Since κ is monotonically decreasing by Assum. 2, we have κ( Xi x0 h ) κ( 2j 1h 2ϵ h ). Thus, we have k( , x0) k( , x 0) A, max{ k( , x0) A, , k( , x 0) A, } max n κ( 2j 1h 2ϵ h ), κ( 2j 1h ϵ (i) κ( 2j 1h 2ϵ where step (i) again uses the fact that κ is monotonically decreasing. Moreover, note that Mass A(2(2jh + ϵ)) = Massx 0(2j 1h ϵ, 2jh + ϵ), τA,k(2jh + ϵ) = 0, and x0 x 0 ϵ < 2jh + ϵ. Substituting the bounds (69), (94), and (95) into the decomposition (89) yields the desired claim. B.3.2 Proof of Lem. 6: L bound on L2 kernel error (finite sample) Define the restrictions gr(x) = g(x)1B2(r;x 0)(x), g(c) r = g gr, kr(x, z) k(x, z) 1B2(r;x 0)(z), and k(c) r k kr, so that k(z, x0) = kr(z, x0) + k(c) r (z, x0) and k(z, x 0) = kr(z, x 0) + k(c) r (z, x 0). We now apply the triangle inequality and Hölder s inequality to obtain 1 n Pn i=1 g(xi)(k(xi, x0) k(xi, x 0))1A(xi) n Pn i=1 gr(xi)(kr(xi, x0) kr(xi, x 0))1A(xi) + 1 n Pn i=1 g(c) r (xi)(k(c) r (xi, x0) k(c) r (xi, x 0))1A(xi) n Pn i=1 gr(xi)(kr(xi, x0) kr(xi, x 0))1A(xi) n Pn i=1 g(c) r (xi)(k(c) r (xi, x0) k(c) r (xi, x 0))1A(xi) gr 1A L1(Pn) (k( , x0) k( , x 0))1A( ) (96) n Pn i=1 g(c) r (xi)(k(xi, x0) k(xi, x 0))1A(xi) , (97) where L1(Pn) is defined by f L1(Pn) 1 n Pn i=1|f(xi)|. To bound the term (96), we apply Cauchy-Schwarz to gr L1(Pn) L2(Pn) to obtain gr 1A L1(Pn) gr 1A L2(Pn) p Mass A(r) g A,L2(Pn) p Next, we bound the term (97). For any x {x0, x 0} and xi satisfying xi x 0 r and scalars a, b [0, 1] such that a + b = 1, either x xi ar or x x 0 > br. Hence, 1 n Pn i=1 g(c) r (k(xi, x0) k(xi, x 0)1A(xi) n Pn i=1 I[ xi x0 r]g(c) r (xi) P x {x0,x 0} k(xi, x )1A(xi)(δx0(x ) δx 0(x )) x {x0,x 0} I[ xi x ar]g(c) r (xi)k(xi, x )1A(xi)(δx0(x ) δx 0(x )) | {z } x {x0,x 0} I[ x x0 > br]g(c) r (xi)k(xi, x )1A(xi)(δx0(x ) δx 0(x )) | {z } where we define δx0(z) I[z = x0] and δx 0(z) I[z = x 0]. Bounding (98). Exchanging the order of the summations and applying Cauchy-Schwarz, we have x {x0,x 0} 1 n Pn i=1 I[ x xi ar]g(c) r (xi)k(x , xi)1A(xi)(δx0(x ) δx 0(x )) n Pn i=1 I[ xi x ar]g(c) r (xi)k(x , xi)1A(xi)(δx0(x ) δx 0(x )) 1 n Pn i=1 I[ xi x ar]g(c) r (xi)21A(xi) 1 n Pn i=1 I[ xi x ar]k2(xi, x )1A(xi) 1 2 δx0(x ) δx 0(x ) 1 n Pn i=1 I[ xi x ar]g(c) r (xi)21A(xi) 1 supx {x0,x 0} 1 n Pn i=1 I[ xi x ar]k2(xi, x )1A(xi) 1 2 δx0(x ) δx 0(x ) x {x0,x 0} g 1A L2(Pn) τk(ar) δx0(x ) δx 0(x ) = 2 g 1A L2(Pn) τk(ar) Bounding (99). Rearranging the terms and applying Cauchy-Schwarz, we have x {x0,x 0} I[ x x 0 > br] 1 n Pn i=1 g(c) r (xi)k(xi, x )1A(xi)(δx0(x ) δx 0(x )) x {x0,x 0} I[ x x 0 > br] 1 n Pn i=1 g(c) r (xi)k(xi, x )1A(xi) δx0(x ) δx 0(x ) x {x0,x 0} I[ x x 0 > br] g 1A L2(Pn) supx {x0,x 0} k( , x ) 1A L2(Pn) δx0(x ) δx 0(x ) = g 1A L2(Pn) supx {x0,x 0} k( , x ) 1A L2(Pn) I[ x0 x 0 > br]. C Proof of Thm. 2: KT-KRR for finite-dimensional RKHS We rely on the localized Gaussian/Rademacher analysis of KRR from prior work [29]. Define the Gaussian critical radius εn > 0 to be the smallest positive solution to the inequality b Gn(ε; BH(3)) R 2σε2, where b Gn(ε; F) Ew sup f F: f n ε n Pn i=1 wif(xi) # BH(3) is the k-ball of radius 3 and wi i.i.d. N(0, 1). Assumption 4. Assume that f k Bk(R) and bf KT,λ Bk(c R), for some constant c > 0 . Note that for any g Bk(c R), we have g supx X g, k( , x) k (i) supx X g k k( , x) k g k p where step (i) follows from Cauchy-Schwarz. Thus, the function class Bk(c R) is B-uniformly bounded. Now define the Rademacher critical radius δn > 0 to be the smallest positive solution to the inequality Rn(δ; H) δ2, where Rn(δ; F) Ex,ν sup f F: f 2 δ n Pn i=1 νif(xi) # and νi = 1 each with probability 1/2. Finally, we use the following shorthand to control the KT approximation error term, ηn,k a2 nout (2 + Wk(n, nout, δ, Rin, a nout )), where (102) Rin maxx Sin x 2 and a k ,in + Y 2 max (103) and Wk is an inflaction factor defined in (29) that scales with the covering number Nk (see Def. 2). With these definitions in place, we are ready to state a detailed version of Thm. 2: Theorem 4 (KT-KRR for finite-dimensional RKHS, detailed). Suppose the kernel operator associated with k and P has eigenvalues µ1 . . . µm > 0 (by Mercer s theorem). Define Cm 1/µm. Let εn and δn denote the solutions to (101) and (100), respectively. Further assume 5 nδ2 n > log(4 log(1/δn)). (104) Let bf KT,λ denote the KT-KRR estimator with regularization parameter λ 2ξ2 n where ξn εn δn 4 Cm( f k + 1)ηn,k. Then with probability at least 1 2δ 2e nδ2 n c1(b2+σ2) , we have bf KT,λ f 2 2 c ξ2 n + λ f 2 k + cδ2 n. (105) where recall δ is the success probability of KT-COMPRESS++ (24). See App. D for the proof. We set λ = 2ξ2 n, so that (105) becomes bf KT,λ f 2 2 3cξ2 n f 2 k + cδ2 n. (106) It remains to bound the quantities εn (100), δn (101), and ηn,k (102). We claim that 5Note that when k is finite-rank, this condition is automatically satisfied. m log nout log(1/δ) nout . (109) for some universal positive constants c0, c1, c2. Now set R = f k δ = e 1/R4. Thus, we have ξn c ( σ f k b 4 Cm f k ) m n nout . for some universal positive constant c . Substituting this into (106) leads to the advertised bound (19). Proof of claim (107). For finite rank kernels, ˆµj = 0 for j > m. Thus, we have q 2 n q Pn j=1 min{ε2, ˆµj} = q mε2. From the critical radius condition (100), we want q 4σε2, so we may set εn σ Proof of claim (108). By similar logic as above, we have q 2 n q Pn j=1 min{δ2, µj} = q From the critical radius condition (101), we want q bδ2, so we may set δn b Proof of claim (109). Consider the linear operator T : H Rm that maps a function to the coefficients in the vector space spanned by {ϕi}m i=1. Note that Since the image of T has dimension m, we have rank(T) m. Moreover, k µ1 R2 in. Now we can invoke [25, Eq. 14] with ϵ = a/nout to obtain Nk(Bd 2(Rin), a/nout) N(T, a/nout) (1 + µ1R2 innout/a)m. Taking the log on both sides and substituting this bound into (102), we have ηn,k = a2 nout (2 + Wk(n, nout, δ, Rin, a nout )) a2 nout (2 + r log nout log(n/nout) δ + log Nk(Bd 2(Rin), a nout ) i ) a2 nout (2 + r log nout log(n/nout) δ + m log 1 + 2 T nout m log nout log(1/δ) nout for some positive constant c that doesn t depend on m, nout, δ. D Proof of Thm. 4: KT-KRR for finite-dimensional RKHS, detailed We rescale our observation model (1) by f k, so that the noise variance is (σ/ f k)2 and our new regression function satisfies f k = 1. Our final prediction error should then be multiplied by f 2 k to recover a result for the original problem. For simplicity, denote eσ = σ/ f k. For notational convenience, define an event E = bf KT,λ f 2 2 c(ξ2 n + λ ) , and our goal is to show that E occurs with high-probability in terms of P, the probability regarding all the randomness. For that end, we introduce several events that are used throughout, Econc supg H g n g 2 δn and Elower n bf KT,λ f 2 > δn o , (110) where δn is defined in (101) and H is the RKHS generated by k hence star-shaped. Further, we introduce two technical events AKT(u), BKT defined in (123) and (136) respectively, which are proven to occur with small probability, and define a shorthand Egood Ac KT(ξn) Bc KT Econc EKT,δ. Equipped with these shorthands, observe the following inequality, P(E) = P(E Elower) + P(E Ec lower) P(E Elower) + P(Ec lower) (111) where the second inequality is because Ec lower Ec lower E due to the assumption λ 2ξ2 n 2δ2 n. If we are able to show the set inclusion {Egood Elower} {E Elower} and that P(Ec good) is small, we are able refine (111) to the following P(E) P(Egood Elower) + P(Ec lower) 1 P(Ec good) P(Ec lower) + P(Elower) = 1 P(Ec good), where the last quantity 1 P(Ec good) would be large. To complete this proof strategy, we claim the set inclusion {Egood Elower} {E Elower} (112) to hold and prove it in App. D.1 and further claim P(Ec good) c δ + e c nδ2 n/(B2 H eσ2) (113) which verify in App. D.2. Putting the pieces together, claims (112) and (113) collectively implies P(E) 1 c δ + e c nδ2 n/(B2 H eσ2) as desired. D.1 Proof of claim (112) There are several intermediary steps we take to show the set inclusion of interest (112). We introduce the shorthand b KT bf KT,λ f . By invoking Propositions and basic inequalities to come, we successively show the following chain of set inclusions Egood Elower Egood Elower { b KT 2 nout c(ξ2 n + λ )} (114) Egood Elower { b KT 2 n c(ξ2 n + λ )} (115) Egood Elower E (116) Elower E. (117) Note that step (117) is achieved trivially by dropping Egood. Further note that (115) is the crucial intermediary step after which we may apply uniform concentration across n independent samples. Proof of (115) leverages on the Proposition to come (Prop. 1) that allows n and nout to be exchangeable for finite rank kernels. Recovering step (114) Since bf KT,λ and f are optimal and feasible, respectively for the central optimization problem of interest minf H(k) 1 nout Pnout i=1 (y i f(x i))2 + λ f 2 k, we have the basic inequality 1 nout Pnout i=1 y i bf KT,λ (x i) 2 + λ bf KT,λ 2 k 1 nout Pnout i=1 (y i f (x i))2 + λ f 2 k, (118) With some algebra , may refine (118) to 1 2 b KT 2 nout 1 nout Pnout i=1 v i b KT(x i) + λ n f 2 k bf KT,λ 2 k o . (119) where b KT = bf KT,λ f .Suppose that b KT nout < ξn, then we trivially recover (114) by adding λ > 0. Thus, we assume that b KT nout ξn. Under the assumption b KT nout ξn, which is without loss of generality, we utilize the basic inequality (119) and control its stochastic component 1 nout Pnout i=1 v i b KT(x i) , with a careful case work to follow, which is technical by nature. Case where bf KT,λ k 2: Under such case, we introduce a technical event AKT(u) g F \ B2(δn) { g nout u} such that 1 nout Pnout i=1 v ig(x i) 3 g noutu , for any star-shaped function class F H. Since f k = 1, triangle inequality implies b KT k bf KT,λ k + f k 3. Moreover, on the event Elower ( Egood), we have b KT 2 > δn. Thus, we may apply b KT to the event Ac KT(ξn) with F = BH(3) (i.e., the H-ball of radius 3) to attain 1 nout Pnout i=1 v i b KT(x i) c0ξn b KT nout on the event Ac KT(ξn) Elower. (120) Upper bounding the stochastic component of the basic inequality (119) by (120) and dropping the bf KT,λ 2 k term in (119), we have 1 2 b KT 2 nout c0ξn b KT nout + λ . As a last step under the case bf KT,λ k 2, apply the quadratic formula (specifically, if a, b 0 and x2 ax b 0, then x a2 + b) to obtain b KT 2 nout 4c2 0ξ2 n + 2λ . Case where bf KT,λ k > 2: Under such case, by assumption we have bf KT,λ k > 2 > 1 f k. Thus, we may derive the following f k bf KT,λ k < 0 and f k + bf KT,λ k > 1, which further implies the following inequality f 2 k bf KT,λ 2 k = { f k bf KT,λ k}{ f k + bf KT,λ k} f k bf KT,λ k. (121) Further writing bf KT,λ = f + b KT and noting that b KT k f k bf KT,λ k holds through triangle inequality, we may further refine (121) as f k bf KT,λ k 2 f k b KT k 2 b KT k, so that the basic inequality in (119) reduces to 1 2 b KT 2 nout 1 nout Pnout i=1 v i b KT(x i) + λ {2 b KT k}. (122) We again introduce a technical event that controls the stochastic component of (122), which is BKT g F \ B2(δn) { g k 1} : (123) 1 nout Pnout i=1 v ig(x i) > 4ξn g nout + 2ξ2 n g k + 1 for a star-shaped function class F H. By triangle inequality, we have b KT k bf KT,λ k f k > 1, and on event Elower ( Egood), we have b KT 2 > δn. Thus, we may apply g = b KT to the event Bc KT, and the resulting refined basic inequality is 1 2 b KT 2 nout 4ξn b KT nout + (2ξ2 n λ ) b KT k + 2λ 4ξn b KT nout + 2λ on the event Bc KT Elower where the second inequality is due to the assumption that λ 2ξ2 n. We apply the quadratic formula (specifically, if a, b 0 and x2 ax b 0, then x a2 + b) to obtain b KT 2 nout 4c2 0ξ2 n + 2λ . Putting the pieces together, we have shown b KT 2 nout c(ξ2 n + λ ) on the event Ac KT(ξn) Bc KT Elower, which is sufficient to recover (114). Recovering step (115) We now upgrade events { b KT 2 nout c(ξ2 n + λ )} = { b KT 2 n c (ξ2 n + λ )} by exploiting the events Econc EKT,δ (subset of Egood) that were otherwise not used when recovering (114). For this end, the following result is a crucial ingredient, which shows that nout and n are essentially exchangeable with high-probability, Proposition 1 (Multiplicative guarantee for KT-COMPRESS++ with k RR). Let Cm 1/µm and suppose δn satisfies (104). Then on event EKT,δ Econc, where EKT,δ and Econc are defined in (24) and (110) respectively, we have (1 4Cm ηn,k) g n g nout (1 + 4Cm ηn,k) g n (124) uniformly over all g H such that g 2 > δn. See App. D.3 for the proof. An immediate consequence of Prop. 1 is that { b KT 2 nout c(ξ2 n + λ )} = { b KT 2 n c (ξ2 n + λ )} on the event EKT,δ Econc Elower,, which is sufficient to recover (115). Recovering step (116) Our last step is to show { b KT 2 n c (ξ2 n + λ )} = { b KT 2 2 c (ξ2 n + λ )}. Such result can be immediately shown on the event Econc by observing that b KT H, by our assumption that f H and by the definition bf KT,λ argminf H(k) Lnout(f) + λ f 2 k. D.2 Proof of claim (113) It suffices to show the appropriate bounds for the following four probability terms P(AKT(ξn)), P(BKT), P(Ec conc), P(Ec KT,δ). Fix the shorthand BH k 2 R2 < . We know from [10] that P(Ec KT,δ|Sin) δ and then we may apply [29, Thm. 14.1] to obtain a high probability statement, P(Ec conc) e c nδ2 n/B2 H. (125) Now we present two Lemmas that bound the P( | Sin) probability of events AKT(ξn) and BKT, Lemma 7 (Controlling bad event when bf KT,λ k 2). Suppose u ξn. Then for some constant c > 0, P(AKT(u) | Sin) δ + e cnδ2 n/B2 H + e cnu2/eσ2 (126) where eσ = σ/ f k. See App. D.4 for the proof. Note that by plugging in ξn into (126) results in a probability that depends on Sin (as ξn depends on Sin). By invoking the definition of ξn, we may further refine the probability bound of AKT(ξn) by P(AKT(ξn) | Sin) δ + e cnδ2 n/B2 H + e cnδ2 n/eσ2 (127) Lemma 8 (Controlling bad event when bf KT,λ k > 2). For some constants c, c > 0, P(BKT | Sin) δ + e cnδ2 n/B2 H + ce nξ2 n/(c eσ2) (128) where eσ = σ/ f k. See App. D.5 for the proof. It is notable that ξn in the probability bound of (128) contains a term εn defined in (100) that is a function of Sin. Invoking the definition of ξn, we observe the probability upper bound (128) can be refined to P(BKT | Sin) δ + e cnδ2 n/B2 H + ce nδ2 n/(c eσ2), (129) which does not depend on Sin. Putting the pieces together, we have the following probability bound for some constants c, c > 0, P(Ec good | Sin) P(AKT(ξn) | Sin) + P(BKT | Sin) + P(Ec conc | Sin) + P(Ec KT,δ | Sin) (125)(127)(129) c δ + e c nδ2 n/(B2 H eσ2) thereby implying P(Ec good) c δ + e c nδ2 n/(B2 H eσ2) for some constant c . D.3 Proof of Prop. 1: Multiplicative guarantee for KT-COMPRESS++ with k RR Fix g H. Denote g, h = R g(x)h(x)dx as the inner product in the L2 sense. By Mercer s theorem [29, Cor. 12.26], the k-norm of g has a basis expansion g 2 k = Pm i=1 g, ϕi 2/λi so that g 2 k Pm i=1 g, ϕi 2/λm = Cm g 2 2 since Cm = 1/λm. (130) The assumption g 2 δn implies that on the event Econc (110), we have 1 2δn g 2 1 2δn g n (131) Moreover, g must be a non-zero function. Note that g2 H(k RR) (see App. F.3). Thus, we may apply Lem. 14 to f1 = f2 = g and a = 1, b = 0 to obtain g 2 n g 2 nout = 1 n Pn i=1 g2(xi) 1 nout Pnout i=1 g2(x i) g 2 k ηn,k. (132) The LHS can be expanded as g 2 n g 2 nout = g n g nout g n + g nout | {z } >0 by (131) Thus, we may rearrange (132) and combine with (130) to obtain g n g nout Cm g 2 2 g n+ g nout ηn,k. (133) On event Econc, we have 2δn + g n)2 (i) δ2 n 4 + δn g n + g 2 n. (134) Thus, we have g 2 2 g n+ g nout (134) δ2 n 4| g n+ g nout| + δn g n g n+ g nout + g 2 n g n+ g nout (131) δ2 n 2δn + δn + g n g n g n+ g nout 3 (131) 3 g n + g n = 4 g n. (135) Using (135) to refine (133), we have on event EKT,δ Econc: g n g nout 4Cm g n ηn,k. With some algebra, this implies with probability at least 1 δ exp( c nδ2 n/B2 F): (1 4Cm ηn,k) g n g nout (1 + 4Cm ηn,k) g n uniformly over all non-zero g H such that g 2 > δn. D.4 Proof of Lem. 7: Controlling bad event when bf KT,λ k 2 Recall EKT,δ and Econc defined by (24) and (110). Also recall that EKT,δ Econc combined with the assumption g 2 δn invokes the event (124). Our aim is to show AKT(u) EKT,δ Econc {Zn(2u) 2u2}, where Zn(t) sup g F: g n t n Pn i=1 wig(xi) so that we have a probability bound P(AKT(u)) P(Ec KT,δ) + P(Ec conc) + P(Zn(2u) 2u2). The first RHS term can be bounded by δ (see (24)). The second RHS term can bounded by (125). The third term can be bounded by P(Zn(2u) 2u2) = P Zn(u) u2/2 + u2/2 (i) P Zn(u) uεn/2 + u2/2 (ii) e nu2 where (i) follows from our assumption that u εn and (ii) follows from applying generic concentration bounds on Zn(u) (see [29, Thm. 2.26, Eq. 13.66]). Putting together the pieces yields our desired probability bound (126). Proof of claim (136). Consider the event AKT(u) EKT,δ Econc. The norm equivalence established on the event EKT,δ Econc in Prop. 1 is an important ingredient throughout. Let g H be the function that satisfies three conditions: g 2 δn , g nout u, and 1 nout Pnout i=1 v ig(x i) 3 g noutu. Define the normalized function eg = u g/ g nout so that it satisfies eg nout = u and also 1 nout Pnout i=1 v ieg(x i) 3u2. (137) By triangle inequality, the LHS of (137) can be further upper bounded by 1 nout Pnout i=1 v ieg(x i) 1 n Pn i=1 vieg(xi) + u g nout n Pn i=1 vig(xi) 1 nout Pnout i=1 v ig(x i) . Recall the chosen g satisies g nout u. Observe that vig(xi) (1)= (yi f (xi))g(xi) = f (xi)g(xi) + yig(xi), so we may apply Lem. 14 with f1 = f , f2 = g and a = 1, b = 1. Thus, on the event EKT,δ Econc, we have 1 n Pn i=1 vig(xi) 1 nout Pnout i=1 v ig(x i) g k( f k + 1) ηn,k. (139) Thus, we may rearrange (138) and combine with (137) and (139) to obtain 1 n Pn i=1 vieg(xi) 3u2 u g nout g k( f k + 1) ηn,k g k g nout = g k g n g n g nout . We tackle each term in turn. First, g k Cm. Since we assume g 2 δn, we have g 2 g n δn/2+ g n (131) 2 on event Econc; and g n g nout (144) 2 on event Econc EKT,δ. Taken together, g k g nout 4 Cm on event Econc EKT,δ. (140) Cm( f k + 1)ηn,k by assumption, we have therefore found eg with norm eg nout = u satisfying 1 n Pn i=1 vieg(xi) 3u2 u2 = 2u2. We may further show that eg n = u g nout g n u g n g nout u 2 on event Econc EKT,δ, where the last inequality follows from the fact that g 2 δn and by applying (144). So we observe 2u2 1 n Pn i=1 vieg(xi) sup eg n 2u 1 n Pn i=1 vieg(xi) = Zn(2u) D.5 Proof of Lem. 8: Controlling bad event when bf KT,λ k > 2 Our aim is to show for any g H with g k 1, 1 nout Pnout i=1 v ig(x i) 2ξn g nout + 2ξ2 n g k + 1 16 g 2 nout with high probability. Note that it is sufficient to prove our aim for g H with g k = 1 by proving only for g with g k = 1, then for any h H with h k 1, we may plug g = h/ h k into 1 nout Pnout i=1 v ig(x i) 2ξn g nout + 2ξ2 n + 1 16 g 2 nout (141) to recover the aim of interest. So without loss of generality, we show (141) for all g such that g H and g k = 1. Let BKT denote the event where (141) is violated, i.e. there exists g H with g k = 1 so that 1 nout Pnout i=1 v ig(x i) > 3ξn g nout + 2ξ2 n + 1 4 g 2 nout. (142) We prove the following set inclusion, BKT EKT,δ Econc g H s.t. g k = 1 and 1 n Pn i=1 vig(xi) > 2εn g n + 2ε2 n + 1 where we know the RHS event of (143) has probability bounded by ce nξ2 n/(c eσ2) which is proven in [29, Lem. 13.23]. So the set inclusion (143) implies a bound over the event BKT, P(BKT) P(Ec KT,δ) + P(Ec conc) + ce nξ2 n/(c eσ2), where P(Ec KT,δ) δ by (24) and P(Ec conc) by (125). Choose g so that g k = 1 and (142) holds. Condition g k = 1 as well as the condition (130) resulting from a finite rank kernel k implies δn 1 g k Cm g 2. Invoke Prop. 1 for the choice of g that satisfies g 2 δn/ Cm δn, so that on the event EKT,δ Econc, we have the following norm equivalence, 1 2 g n g nout 3 2 g n for any n such that Cm ηn,k 1/18. (144) Then we have the following chain of inequalities, which holds on event Econc EKT,δ 1 n Pn i=1 vig(xi) (i) 1 nout Pnout i=1 v ig(x i) 1 n Pn i=1 vig(xi) 1 nout Pnout i=1 v ig(x i) (ii) 3ξn g nout + 2ξ2 n + 1 4 g 2 nout g k( f k + 1) ηn,k (140) 3ξn g nout + 2ξ2 n + 1 4 g 2 nout 4 Cm g nout( f k + 1) ηn,k 2ξn 2 Cm( f k + 1) ηn,k) g n + 2ξ2 n + 1 16 g 2 n, (145) where step (i) follows from triangle inequality and step (ii) follows from our assumption (142) to bound the first term and our approximation guarantee (139) to bound the second term. By definition of ξn, we have 3 2ξn 2 Cm( f k + 1) ηn,k 2ξn. Using this to refine (145), we have 1 n Pn i=1 vig(xi) 2ξn g n + 2ξ2 n + 1 which directly implies the inclusion (143) as desired. E Proof of Thm. 3: KT-KRR guarantee for infinite-dimensional RKHS We state a more detailed version of the theorem: Theorem 5 (KT-KRR guarantee for infinite-dimensional RKHS, detailed). Assume f H(k) and Assum. 1 is satisfied. If k is LOGGROWTH(α, β), then for some constant c (depending on d, α, β), bf KT,λ with λ = O(1/nout) satisfies bf KT,λ f 2 2 c logα n n + logα nout [ f k + 1]2 (146) with probability at least 1 2δ 2e nδ2 n c1( f 2 k+σ2) . If k is POLYGROWTH(α, β) with α (0, 2), then for some constant c (depending on d, α, β), bf KT,λ with λ = O(n 2 α 2 out ) satisfies bf KT,λ f 2 2 c f 2 2+α k n 2 2+α + [ f k + 1]2n 2 α 2 out log nout + c b 4 2+α n 2 2+α (147) with probability at least 1 2δ 2e nδ2 n c1( f 2 k+σ2) . E.1 Generic KT-KRR guarantee We state a generic result for infinite-dimensional RKHS that only depends on the Rademacher and Gaussian critical radii as well as the KT approximation term, all introduced in App. C. Theorem 6 (KT-KRR). Let f H(k) and Assum. 1 is satisfied. Let δn, εn denote the solutions to (100), (101), respectively. Denote bf KT,λ with regularization parameter λ 2ηn,k, where ηn,k is defined by (102). Then with probability at least 1 2δ 2e nδ2 n c( f 2 k+σ2) c1e c2 n f 2 kε2 n σ2 , we have bf KT,λ f 2 2 Ufull + UKT, where Ufull c ε2 n + λ [ f k + 1]2 + cδ2 n and UKT c ηn,k [ f k + 1]2. See App. F for the proof. The term Ufull follows from the excess risk bound of FULL-KRR bffull,λ. The term UKT follows from our KT approximation. Clearly, the best rates are achieved when we choose λ = 2ηn,k. E.2 Proof of explicit rates The strategy for each setting is as follows: 1. Bound the Gaussian critical radius (101) using [29, Cor. 13.18], which reduces to finding ε > 0 satisfying the inequality q 2 n q Pn j=1 min{ε2, ˆµj} βε2, where β f k and ˆµ1 ˆµ2 . . . ˆµn 0 are the eigenvalues of the normalized kernel matrix K/n, where K is defined by (5). 2. Bound the Rademacher critical radius (100) using [29, Cor. 14.5], which reduces to solving the inequality q 2 n q P j=1 min{δ2, µj} δ2 where (µj) j=1 are the eigenvalues of the k according to Mercer s theorem [29, Thm. 12.20] and b is the uniform bound on the function class. 3. Bound ηn,k (102) using the covering number bound N(Bd 2(Rin), ϵ) from Assum. 3. In the sequel, we make use of the following notation. Let Rn 1 + supx Sin xi 2 (103) = 1 + Rin and Lk(r) Cd log 2rβ according to [16, Eq. 6], where Cd is the constant that appears in Assum. 3. E.2.1 Proof of (147) We begin by solving (148). Lemma 9 (Critical Gaussian radius for POLYGROWTH kernels). Suppose Assum. 1 is satisfied and k is POLYGROWTH with α < 2 as defined by Assum. 3. Then the Gaussian critical radius satisfies ε2 n 2c f k/4σ 4 2+α 2 αLk(Rn)(1 + 32α 2 α) 2 2+α n 2 2+α . (150) Proof. [16, Cor. B.1] implies that ˆµj 4 Lk(Rn) α for all j > Lk(Rn) + 1 Let k be the smallest integer such that k > Lk(Rn) + 1 and 4 Lk(Rn) By Assum. 1, Rn is a constant, so the first inequality is easily satisfied for large enough n k 2 αLk(Rn)ε α + 1. (151) 2 n q Pn j=1 min{ε2, ˆµj} 2 n kε2 + Pn j=k+1 4 Lk(Rn) kε2 + 4Lk(Rn)2/α 2/α 1 k1 2/α 2 αLk(Rn)ε2 α + 4 22 αLk(Rn) 2/α 1 ε2 α, where step (i) follows from the approximation Pn 1 j=k 4 Lk(Rn) α 4Lk(Rn)2/α R k t 2/αdt = 4Lk(Rn)2/α 1 2/α 1k1 2 To solve (148), it suffices to solve 2 αLk(Rn)(1 + 16 2/α 1)ε2 α βε2 β 2 2+α 2 αLk(Rn)(1 + 32α 2 α) 1 2+α n 1 2+α . Since εn is the smallest such solution to (148) by definition, we have (150) as desired. We proceed to solve (149). Lemma 10. Suppose Assum. 1 is satisfied and k is POLYGROWTH with α < 2 as defined by Assum. 3. Then the Rademacher critical radius satisfies ε2 n b 4 2+α 2 αLk(Rn)(1 + 32α 2 α) 2 2+α n 2 2+α . Proof. Thus, we can solve the following inequality q 2 n q Pn j=1 min{δ2, ˆµj} 1 Following the same logic as in the proof of Lem. 9 but with β = 1/b yields the desired bound. Finally, it remains to bound (102). We have ηn,k = a2 nout (2 + Wk(n, nout, δ, Rin, a nout )) a2 nout (2 + r log nout log(n/nout) δ + log Nk(Bd 2(Rin), a nout ) i ) a2 nout (2 + r log nout log(n/nout) δ + Cd nout a α(Rin + 1)β) log nout log(n/nout) Cd (Rin+1)β 2 1 out a2 2 + r log nout log(n/nout) Cd (Rin+1)β for some universal positive constant c. In summary, there exists positive constants c0, c1, c2 such that ε2 n c0 σ f k 4 2+α n 2 2+α δ2 n c1b 4 2+α n 2 2+α ηn,k c2a2n 2 α 2 out log nout Setting λ = c2a2n 2 α 2 out log nout, we have bf KT,λ f 2 2 c ε2 n + λ + ηn,k [ f k + 1]2 + c δ2 n 2 2+α k n 2 2+α + [ f k + 1]2n 2 α 2 out log nout + c b 4 2+α n 2 2+α . E.2.2 Proof of (146) We begin by solving (148). Lemma 11 (Critical Gaussian radius for LOGGROWTH kernels). Under Assum. 1 and LOGGROWTH version of Assum. 3, Gaussian critical radius satisfies n Lk(Rn)C α for some constant C α that only depends on α. where we ignore log-log factors. Proof. [16, Cor. B.1] implies that ˆµj 4 exp 2 2 j 1 Lk(Rn) 1 α for all j > Lk(Rn) + 1 Let k be the smallest integer such that k > Lk(Rn) + 1 and 4 exp 2 2 j 1 Lk(Rn) 1 By Assum. 1, Rn is a constant, so the first inequality is easily satisfied for large enough n k Lk(Rn) log 2e ε α + 1. (152) Thus, k = Lk(Rn) log 2e ε α + 1 . Then 2 n q Pn j=1 min{ε2, ˆµj} 2 n kε2 + Pn j=k+1 4 exp 2 2 j 1 Lk(Rn) 1 Consider the following approximation: Pn 1 ℓ=k 4 exp 2 2 ℓ Lk(Rn) 1 α 4e2 R k e 2t Lk(Rn) 1/α dt = R k 1 ct1/αdt, where c exp( (Lk(Rn)/2) 1/α) (0, 1). Defining m log c > 0 and k k 1, we have R k ct1/αdt Cα(k b 1 + bα 1m α)e mk 1/α, (154) by Li et al. [16, Eq. 50], where Cα > 0 is a constant satisfying (x + y)α Cα(xα + yα) for any x, y > 0 and b is a known constant depending only on α. Plugging in k = Lk(Rn) log 2e ε α , we can bound the exponential by e mk 1/α e m Lk(Rn)1/α log( 2e ε m Lk(Rn)1/α . Note that we can simplify the exponent by m Lk(Rn)1/α = (Lk(Rn)/2) 1/αLk(Rn)1/α = 21/α. Note that k = k 1 L(Rn) = 2m α. Thus, we can absorb the bα 1m α term in (154) into k and obtain the following bound Pn 1 ℓ=k 4 exp 2 2 ℓ Lk(Rn) 1 where C α depends only on α. Plugging this bound into (153), we have 2 n q Pn j=1 min{ε2, ˆµj} 2 n kε2 + C αk ε Lk(Rn) log 2e ε α(ε2 + C α ε Lk(Rn) log 2e ε αC αε21/(1 α) for some constant C α that only depends on α and universal positive constant c. To solve (148), it suffices to solve Lk(Rn) log 2e ε αC αε21/(1 α) βε2, which is implied by the looser bound n Lk(Rn)C α ε2 log 2e The solution to (148) (up to log-log factors) is ε log(2e β n)α/2 β2 Lk(Rn)C α. We proceed to solve (149). Lemma 12 (Critical Gaussian radius for LOGGROWTH kernels). Under Assum. 1 and LOGGROWTH version of Assum. 3, the Rademacher critical radius satisfies δ2 n b2 log( 2e n Lk(Rn)C α. Proof. Following the same logic as in the proof of Lem. 11 but with β = 1/b yields the desired bound. Finally, it remains to bound (102). We have ηn,k = a2 nout (2 + Wk(n, nout, δ, Rin, a nout )) a2 nout (2 + r log nout log(n/nout) δ + log Nk(Bd 2(Rin), a nout ) i ) a2 nout (2 + r log nout log(n/nout) δ + Cd log( enout a )α(Rin + 1)β ) for some universal positive constant c. In summary, there exists universal positive constants c0, c1, c2 such that n δ2 n c1b2 log( 2e n ηn,k c2 a nout log( enout a )α/2Rβ/2 in . Setting λ = 2c2 a nout log( enout a )α/2, we have bf KT,λ f 2 2 c ε2 n + λ + ηn,k [ f k + 1]2 + c δ2 n c log(2e f k n + [ f k + 1]2 c nout log( enout a )α/2 + c b2 log( 2e F Proof of Thm. 6: KT-KRR Our first goal is to bound the in-sample prediction error. We relate bf KT,λ f 2 n to bffull,λ f 2 n, where the latter quantity has well known properties from standard analyses of the KRR estimator (refer to [29]). Note that regularization parameter λ of KT based estimator bf KT,λ is independently chosen from the regularization parameter λ of the estimator based on original samples bffull,λ. For bffull,λ, we choose the regularization parameter λ = 2ε2 n, (155) which is known to yield optimal L2 error rates. Define the main event of interest, E { bf KT,λ f 2 2 c(ε2 n + δ2 n + λ + ηn,k)[ f k + 1]2}. Our goal is to show E occurs with high probability. For that end, we introduce several additional events that are used throughout this proof. For some constant c >, define the event of an appealing in-sample prediction error of bf KT,λ , EKT,n(t) n bf KT,λ f 2 n c t2 + λ + ηn,k ( f k + 1)2o for t εn. where ηn,k is defined in (102). Recall EKT,δ is the event where KT-COMPRESS++ succeeds as defined by (24). Further as f and bf KT,λ are both in {f H : f k R}, we may deduce that all the functions under consideration satisfies f k f k k R where k < . Accordingly, we define a uniform concentration event, E conc {supf F f 2 2 f 2 n f 2 2/2 + δ2 n/2} where F = {f H : f 2 k R}. (156) Event (156) is analogous to the event Econc previously defined in (110) when dealing with finite rank kernels. We first show that EKT,n(εn δn) E conc E. (157) Notice that almost surely we have bf KT,λ f 2 k R, thereby implying bf KT,λ f 2 2 2 bf KT,λ f 2 n + δ2 n on the event E conc. (158) Next invoking the event EKT,n(εn δn) along with (158), we have bf KT,λ f 2 2 2c[(εn δn)2 + λ + ηn,k] ( f k + 1)2 + δ2 n c(ε2 n + δ2 n + λ + ηn,k)[ f k + 1]2. which recovers the event of E. The remaining task is to show E is of high-probability, which amounts to showing events EKT,n(t) and E conc are of high-probability by reflecting on (157). From [29, Thm. 14.1], we may immediately derive P(E conc) 1 c1e c2 nδ2 n k 2 R2 for some constants c1, c2 > 0. We further claim that P(EKT,n(t) | Sin) 1 δ e nt2 c0σ2 c1e c2 n f 2 kt2 for some constants c0, c1, c2 > 0. Proof of claim (159) is deferred to App. F.1. Plugging in t = εn δn into (159), and invoking inequality εn δn δn so as to decouple the dependence on Sin, we have P(EKT,n(εn δn) | Sin) 1 δ e nδ2 n c0σ2 c1e c2 f 2 knδ2 n σ2 which further implies P(EKT,n(εn δn)) 1 δ e nδ2 n c0σ2 c1e c2 f 2 knδ2 n σ2 . Putting the pieces together, for some constants c0, c1 > 0, we have P(E) 1 δ c0e c1 nδ2 n σ2 (σ2/ f 2 k) ( k 2 R2) . (160) Overall, (157) and (160) collectively yields the desired result. F.1 Proof of claim (159) To prove claim (159), we introduce two new intermediary and technical events. For some positive constant c0, define the event 6 when in-sample prediction error of bffull,λ is appealing Efull,n(t) n bffull,λ f 2 n 3c0 f 2 kt2o for t εn. (161) The second intermediary event, denoted as Eb KT(t), is the intersection of (169) and (170), which we do not elaborate here due to its technical nature event Eb KT(t) plays an analogous role to Ac KT Bc KT defined in (123) and (136) respectively. Our goal here is two-folds: first is to show {Efull,n(t) EKT,δ Eb KT(t)} = EKT,n(t) and second is to prove the following bound P Efull,n(t) EKT,δ Eb KT(t) | Sin 1 δ e nt2 c0σ2 c1e c2 n f 2 kt2 from which (159) follows. Note that Wainwright [29, Thm. 13.17] show P(Efull,n(t)) 1 c1e c2 n f 2 kt2 for some constants c1, c2 > 0 and that P(EKT,δ | Sin) 1 δ. So it remains to bound the probability of event Eb KT(t), which we show below. Given f, define the following quantities n Pn i=1(f 2(xi) 2f(xi)yi) + 1 n Pn i=1 y2 i and Lnout(f) 1 nout Pnout i=1 (f 2(x i) 2f(x i)y i) + 1 n Pn i=1 y2 i . In the sequel, we repeatedly make use of the following fact: on event EKT,δ defined in (24), we have |Ln(f) Lnout(f)| f 2 k + 2 ηn,k for all non-zero f H. (162) The claim of (162) is deferred to the end of this section. Given f, we can show with some algebra that n Pn i=1(f(xi) yi)2 = f f 2 n 2 n Pn i=1 ξ2 i , (163) where Z (f(x1) f (x1), . . . , f(xn) f (xn)) and ξ (ξ1, . . . , ξn) are vectors in Rn. Define the shorthands b KT bf KT,λ f and b full bffull,λ f . In the sequel, we use the following shorthands: Zfull (b full(x1), . . . , b full(xn)) and ZKT (b KT(x1), . . . , b KT(xn)). Now for the main argument to bound bf KT,λ f 2 n. When b KT n < t, we immediately have b KT 2 n < t2, which implies (159). Thus, we may assume that b KT n t. Note that b KT 2 n (163) = Ln( bf KT,λ ) + 2 n ZKT , ξ 1 n Pn i=1 ξ2 i = Ln( bffull,λ) + h Ln( bf KT,λ ) Ln( bffull,λ) i + 2 n ZKT , ξ 1 n Pn i=1 ξ2 i . Given the optimality of bffull,λ on the objective (3), we have Ln( bffull,λ) 1 n Pn i=1 ξ2 i + λ n f 2 k bffull,λ 2 k o 1 n Pn i=1 ξ2 i + λ f 2 k, 6Since the input points in Sin are fixed, the randomness in bffull,λ originates entirely from the randomness of the noise variables ξ. where the last inequality follows trivially from dropping the bffull,λ 2 k term. Thus, b KT 2 n = 1 n Pn i=1 ξ2 i + λ f 2 k + h Ln( bf KT,λ ) Ln( bffull,λ) i + 2 n ZKT , ξ 1 n Pn i=1 ξ2 i n ZKT , ξ + λ f 2 k + h Ln( bf KT,λ ) Ln( bffull,λ) i . (164) Using standard arguments to bound the term 2 n ZKT , ξ , we claim that on the event Eb KT, we have b KT 2 n ct2( f k + 1)2 + c h Ln( bf KT,λ ) Ln( bffull,λ) i (165) for some positive constants c, c , and that P(Eb KT | Sin) 1 e nt2 2σ2 . We defer the proof of claim (165) to the end of this section. Now we bound the stochastic term h Ln( bf KT,λ ) Ln( bffull,λ) i in (165) first observe the following decomposition: Ln( bf KT,λ ) Ln( bffull,λ) = Ln( bf KT,λ ) Lnout( bf KT,λ ) + Lnout( bf KT,λ ) Ln( bffull,λ) . On the event EKT,δ (24), the first term in the display can be bounded by Ln( bf KT,λ ) Lnout( bf KT,λ ) (162) ( bf KT,λ 2 k + 2) ηn,k. Note that bf KT,λ is the solution to the following optimization problem, minf H(k) Lnout(f) + λ f 2 k, so the second term in the display can be bounded by the following basic inequality Lnout( bf KT,λ ) + λ bf KT,λ 2 k Lnout( bffull,λ) + λ bffull,λ 2 k so that Lnout( bf KT,λ ) Lnout( bffull,λ) λ n bffull,λ 2 k bf KT,λ 2 k o . Thus, on event EKT,δ, we have Ln( bf KT,λ ) Ln( bffull,λ) ( bf KT,λ 2 k + 2) ηn,k + λ n bffull,λ 2 k bf KT,λ 2 k o = 2ηn,k + λ bffull,λ 2 k + {ηn,k λ } bf KT,λ 2 k (i) 2ηn,k + λ bffull,λ 2 k (ii) λ ( bffull,λ 2 k + 1) where steps (i) and (ii) both follow from the fact that λ 2ηn,k (see assumptions in Thm. 6). To bound bffull,λ 2 k, we use the following lemma: Lemma 13 (RKHS norm of bffull,λ). On event Efull,n (161), we have the following bound bffull,λ 2 k c0( f k + 1)2 (166) for some constant c0 > 0. See App. F.2 for the proof. Putting things together, we have Ln( bf KT,λ ) Ln( bffull,λ) cλ ( f k + 1)2 for some constant c substituting this bound into (165) yields bf KT,λ f 2 n ct2( f k + 1)2 + c λ ( f k + 1)2, for some constants c, c , which directly implies (159), i.e. implying {Efull,n(t) EKT,δ Eb KT(t)} = EKT,n(t). Proof of claim (162). Given f, define the function ℓ f : X Y R, where ℓ f(x, y) f 2(x) 2y f(x) (167) and note that Ln(f) Lnout(f) = 1 n Pn i=1 ℓ f(xi, yi) 1 nout Pnout i=1 ℓ f(x i, y i). We first prove a generic technical lemma: Lemma 14 (KT-COMPRESS++ approximation bound using k RR). Suppose f1, f2 H(k) and a, b R. Then the function ℓf1,f2 : X Y R, where ℓf1,f2(x, y) a f1(x)f2(x) + b yf1(x) (168) lies in the RKHS H(k RR). Moreover, on event EKT,δ, we have Pinℓf1,f2 Qoutℓf1,f2 (|a| f1 k f2 k + |b| f2 k) ηn,k. uniformly for all non-zero f1, f2 H(k). See App. F.3 for the proof. Applying the lemma with f1 f, f2 g and a = 1, b = 2, we have Pinℓ f Qoutℓ f ( f 2 k + 2) ηn,k, which combined with the observation (168) yields the desired claim. Proof of claim (165). Case I: First suppose that b KT k 1. Recall that b KT n t εn by assumption. Thus, we may apply [29, Lem. 13.12] to obtain 1 n ZKT , ξ 2 b KT nt w.p. at least 1 e nt2 Plugging the above bound into (164), we have with probability at least 1 e nt2 b KT 2 n 4 b KT nt + λ f 2 k + h Ln( bf KT,λ ) Ln( bffull,λ) i . We can solve for b KT n using the quadratic formula. Specifically, if a, b 0 and x2 ax b 0, b. Thus, we have with probability at least 1 e nε2 n 2σ2 : b λ f 2 k + h Ln( bf KT,λ ) Ln( bffull,λ) i . Using the fact that (a + b)2 2a2 + 2b, we have with probability at least 1 e nt2 bf KT,λ f 2 n 32t2 + 2λ f 2 k + 2 h Ln( bf KT,λ ) Ln( bffull,λ) i (155) ct2( f k + 1)2 + 2 h Ln( bf KT,λ ) Ln( bffull,λ) i Case II: Otherwise, we may assume that b KT k > 1. Now we apply [29, Thm. 13.23] to obtain 1 n ZKT , ξ 2t b KT n + 2t2 b KT k + 1 16 b KT 2 n w.p. at least 1 c1e nt2 c2σ2 , (170) for some universal positive constants c1, c2. Plugging the above bound into (164) and collecting terms, we have with probability at least 1 c1e nt2 7 8 b KT 2 n 4t b KT n + 4t2 b KT k + λ f 2 k + h Ln( bf KT,λ ) Ln( bffull,λ) i . Solving for b KT n using the quadratic formula, we have with probability at least 1 c1e nt2 7 t2 b KT 2 k + 8 7λ f 2 k + 8 7 h Ln( bf KT,λ ) Ln( bffull,λ) i . Using the fact that (a + b)2 2a2 + 2b, we have with probability at least 1 c1e nt2 bf KT,λ f 2 n 42t2 + 10t2 b KT 2 k + 2.3λ f 2 k + 2.3 h Ln( bf KT,λ ) Ln( bffull,λ) i (i) 42t2 + 10t2 b KT 2 k + 4.6t2 f 2 k + 2.3 h Ln( bf KT,λ ) Ln( bffull,λ) i (155),(166) c3t2( f k + 1)2 + c4 h Ln( bf KT,λ ) Ln( bffull,λ) i for some positive constants c3, c4, where step (i) follows from that fact that λ = 2ε2 n by (155). F.2 Proof of Lem. 13: RKHS norm of bffull,λ Given the optimality of bffull,λ on the objective (3), we have the following basic inequality Ln( bffull,λ) + λ bffull,λ 2 k 1 n Pn i=1 ξ2 i + λ f 2 k = bffull,λ 2 k f 2 k + 1 λ 1 n Pn i=1 ξ2 i Ln( bffull,λ) . Since bffull,λ f 2 n 0, we also have the trivial lower bound Ln( bffull,λ) (163) = bffull,λ f 2 n 2 n Zfull, ξ + 1 n Pn i=1 ξ2 i 2 n Zfull, ξ + 1 n Pn i=1 ξ2 i . bffull,λ 2 k f 2 k + 1 n Zfull, ξ (171) and it remains to bound 2 n Zfull, ξ . Case I: First, suppose that b full k > 1. Then we may apply [29, Lem. 13.23] to obtain 1 n Zfull, ξ 2εn b full n + 2ε2 n b full k + 1 16 b full 2 n w.p. at least 1 c1e nε2 n c2σ2 . Combining this bound with (171), we have with probability at least 1 c1e nε2 n c2σ2 : bffull,λ 2 k f 2 k + 2ε2 n λ b full k + 2 λ 2εn b full n + 1 16 b full 2 n (i) f 2 k + 2ε2 n λ ( bffull,λ k + f k) + 2 λ 2εn b full n + 1 16 b full 2 n (155) = f 2 k + bffull,λ k + f k + 2 λ 2εn b full n + 1 16 b full 2 n , where step (i) follows from triangle inequality. Solving for bffull,λ k using the quadratic formula, we have bffull,λ 2 k 2 + f 2 k + f k + 2 λ 2εn b full n + 1 16 b full 2 n . On the event Efull,n (161), we have b full n c f kεn for some positive constant c, which implies the claimed bound (166) after some algebra. Case II(a): Otherwise, assume b full k 1 and b full n εn. Applying [29, Thm. 2.26] to the function sup g k 1 g n εn n Pn i=1 ξig(xi) , we have 1 n Zfull, ξ ε2 n 2 w.p. at least 1 e nε2 n 8σ2 Combining this bound with (171), we obtain bffull,λ 2 k f 2 k + 1 λε2 n (155) = f 2 k + 1 which immediately implies the claimed bound (166). Case II(b): Finally, assume b full k 1 and b full n > εn. Applying [29, Lem. 13.12] with u = εn, we have 1 n Zfull, ξ 2ε2 w.p. at least 1 e nε2 n 2σ2 . Combining this bound with (171), we obtain bffull,λ 2 k f 2 k + 4 λε2 n (155) = f 2 k + 2, which immediately implies the claimed bound (166). F.3 Proof of Lem. 14: KT-COMPRESS++ approximation bound using k RR By Grünewälder [12, Lem. 4], ℓf1,f2 lies in the RKHS H(k RR), which is a direct sum of two RKHS: H(k RR) = H(k1) H(k2), where k1, k2 : Z Z R are the kernels defined by k1((x1, y1), (x2, y2)) k2(x1, x2) and k2((x1, y1), (x2, y2)) k(x1, x2) y1y2. Applying Lem. 2 with Z = X Y, k ALG = k RR, and ϵ = ( k 1/2 ,in+Ymax)2 yields the following bound on event EKT,δ (24): suph H(k RR): h k RR 1 |(Pin Qout)h| 2ϵ + k RR 1/2 ,in nout Wk RR(n, nout, δ, Rin, ϵ ). We claim that k RR 1/2 ,in k ,in + Y 2 max and (172) log N k RR(Sin, ϵ ) c log Nk(Sin, k 1/2 ,in+Ymax nout ), (173) for some positive constant c, where N k RR is the cardinality of the cover of B k RR n ℓ f/ ℓ f k RR : f H(k) o for ℓ f defined by (167). Proof of the claims (172) and (173) are deferred to the end of this section. By definition of Wk RR, we have Wk RR(n, nout, δ, Rin, ϵ ) (29) c Wk(n, nout, δ, Rin, k 1/2 ,in+Ymax nout ) W k. On event EKT,δ, we have suph H(k RR): h k RR 1 |(Pin Qout)h| 2( k 1/2 ,in+Ymax)2 nout + k ,in+Y 2 max nout W k (i) = k ,in+Y 2 max nout [2 + W k], (174) where step (i) follows from the fact that ( k 1/2 ,in + Ymax)2 2( k ,in + Y 2 max). Since f1, f2 are non-zero, we have ℓf1,f2 k > 0. Thus, the function h ℓf/ ℓf k RR H(k RR) is well-defined and satisfies h k RR = 1. Applying (174), we obtain |Pinh Qouth| k ,in+Y 2 max nout (2 + W k) on event EKT,δ. Multiplying both sides by ℓf k RR and noting that ℓf,g 2 k RR = a f1f2 2 \ H H + b f2 , 1 R 2 H R a2 f1 2 k f2 2 k + b2 f2 2 k (|a| f1 k f2 k + |b| f2 k)2, we have on event EKT,δ, Pinℓf1,f2 Qoutℓf1,f2 (|a| f1 k f2 k + |b| f2 k) k ,in+Y 2 max nout (2 + W k), which directly implies the bound (162) after applying the shorthand (102). Proof of (172) Define Ymax supy (Sin)y y. We have k ALG ,in = sup(x1,y1),(x2,y2) Sin k(x1, x2)2 + k(x1, x2) y1y2 + (y1y2)2 supx1,x2 (Sin)x k(x1, x2)2 + supx1,x2 (Sin)x k(x1, x2) supy1,y2 (Sin)y y1y2 + supy1,y2 (Sin)y(y1y2)2 = k 2 ,in + k ,in Y 2 max + Y 4 max k ,in + Y 2 max 2 . Proof of (173) Since H(k RR) is a direct sum, we have log N k RR(Sin, ϵ ) log N k1(Sin, ϵ /2) + log N k2(Sin, ϵ /2), (175) where N k1 and log N k2 are the covering numbers of B k1 f 2/ f 2 k1 : f H(k) and B k2 {f , y R/ f , y R k2 : f , y R H(k2)}, respectively. log N k1(Sin, ϵ ) 2 log Nk(Sin, ϵ /(2 k 1/2 )) 2 log Nk(Sin, (1 + Ymax k 1/2 ,in ) k 1/2 ,in+Ymax 2 log Nk(Sin, k 1/2 ,in+Ymax Define a kernel on R by k R(y1, y2) y1y2. When supy (Sin)y|y| Ymax, we have Nk R([ Ymax, Ymax], ϵ) = O(Y 2 max/ϵ) for ϵ > 0. Similarly, note that log N k2(Sin, ϵ ) log Nk(Sin, ϵ /( k 1/2 + k R 1/2 )) + log Nk R(Sin, ϵ /( k 1/2 ,in + k R 1/2 )) log Nk(Sin, k 1/2 ,in+Ymax nout ) + log Y 2 max( k 1/2 +Ymax) nout Substituting the above two log-covering number expressions into (175) yields log Nk ALG(Sin, ϵ ) 3 log Nk(Sin, k 1/2 ,in+Ymax nout ) + log Y 2 max( k 1/2 ,in+Ymax) nout c log Nk(Sin, k 1/2 ,in+Ymax for some universal positive constant c. G Non-compact kernels satisfying Assum. 2 The boundedness, Lipschitz assumption, square-integrability, and (15) follow from [10, App. O] and [10, Rmk. 8]. Thus, it only remains to verify (16) and (17) for each kernel. G.1 Gaussian For the kernel by k(x1, x2) exp( x1 x2 2 2h2 ), we have κ(u) exp( u2/2). Verifying (16). For any j 1 and x X, we have R 2 )h,(2j+ 1 2 )h] k(x, x z)dz R 2 )h k(x, x z)dz = 2πh2 d/2PX N(0,h2Id) X 2 (2j 1 1 = 2πh2 d/2PX N(0,Id) X 2 2 (2j 1 1 = 2πh2 d/2PX N(0,Id) X 2 2 d (2j 1 1 2)2 d . (176) By [15, Lem. 1], we have PX N(0,Id) h X 2 2 d > 2 dt + 2t i e t for any t 0. (177) Define t 2d and R 2 t. We can directly verify that R2 d 2 dt + 2t. Thus, we may further upper bound (176) by PX N(0,Id) X 2 2 d (2j 1 1 2)2 d PX N(0,Id) h X 2 2 d 2 (177) e R2/4 e (2j 2 1 whenever R 2 2d, or equivalently, j log2(1 + 4 2d). Under this regime, we have 2 )h,(2j+ 1 2 )h] k(x, x z)dz 2jβ(2πh2)d/2e (2j 1 1 for some positive constant C that does not depend on j or n as desired. Verifying (17). Fixing j 1, we have 2)d2jβκ(2j 1 1) 2j(d+β)+d exp( 1 2(2j 1 1)2) exp (log 2)(j(d + β) + d) (2j 1 1)2) exp maxj [T ] (log 2)(j(d + β) + d) (2j 1 1)2) Note that (log 2)(j(d + β) + d) (2j 1 1)2 is concave over j > 1 and maximized at j = log(1 + p 1 + 2(β + d)) . Treating R 1 2 0 κ(u)ud 1du as a constant, there exists c2 such that (17) is satisfied for all positive integers j. We consider the kernel k(x1, x2) cν d 2 ( x1 x2 2 2 ( x1 x2 2 h ) with ν > d, where Ka denotes the modified Bessel function of the third kind [30, Def. 5.10], and cb 21 b Γ(b) . We have k(x1, x2) = κ( x1 x2 2 2 ( x1 x2 2 eκb(u) cbub Kb(u). Verifying (16). Fix j {0, 1, . . . , T}. Applying Jensen s inequality, we have 2 )h,(2j+ 1 2 )h] k(x, x z)dz supx X R 2 )h,(2j+ 1 2 )h] k2(x, x z)dx 1 2 )h k2(x, x z)dx 1 = τk((2j 1 1 where τk( ) is the kernel tail decay defined by [10, Assum. 3]. We may use the same logic as in [10, App. O.3.6] but ignoring the Aν,γ,d term and making the substitutions, a ν d h, and Γ(ν 1) Γ(2ν 1) to obtain τ 2 k(R) hd 2π 22 2ν d 2 ) π d 2 Γ( d 2 +1) Γ(2ν 1) exp( R 2h) for R 2ν d Hence, we have 2jβ supx X R 2 )h,(2j+ 1 2 )h] k(x, x z)dz 2jβCh,d,ν exp( 1 2)) = O(1), whenever (2j 1 1 2 h j 1 + log( 2ν d 2) for some constant Ch,d,ν that does not depend on j. Verifying (17). Following similar logic as in [10, App. O.3.1], we have eκa(u) min 1, 2πcaua 1 exp u 2 for u 2(a 1). Fixing j 1, we have (2j + 1)d2jβκ(2j 1 1) 2j(d+β)+d min n 1, 2 (2j 1 1)ν d 2(2j 1 1)) o . Note that when ν d 2 1 < 1, the RHS can be rewritten as Cβ,d(2j 1)d+β exp( 1 22j 1) for some constant Cβ,d that doesn t depend on j. When ν d 2 1 1, the RHS can be upper-bounded by C β,d(2j 1)j(ν+β+ d 2 1) exp( 1 22j 1) for some constant C β,d that doesn t depend on j. Observe that the function tbe t/2 attains its maximum at t = 2b. Hence, the RHS can be bounded by a constant that does not depend on n as desired. Neur IPS Paper Checklist Question: Do the main claims made in the abstract and introduction accurately reflect the paper s contributions and scope? Answer: [Yes] Justification: Abstract and introduction give clear outlines on our contributions, and we present our contributions in the main text accordingly. We have also included pointers in the introduction that would link to the referred main text containing specific contributions. Guidelines: The answer NA means that the abstract and introduction do not include the claims made in the paper. The abstract and/or introduction should clearly state the claims made, including the contributions made in the paper and important assumptions and limitations. A No or NA answer to this question will not be perceived well by the reviewers. The claims made should match theoretical and experimental results, and reflect how much the results can be expected to generalize to other settings. It is fine to include aspirational goals as motivation as long as it is clear that these goals are not attained by the paper. 2. Limitations Question: Does the paper discuss the limitations of the work performed by the authors? Answer: [Yes] Justification: In Sec. 6, we discuss limitations as well as future work to address these limitations. Guidelines: The answer NA means that the paper has no limitation while the answer No means that the paper has limitations, but those are not discussed in the paper. The authors are encouraged to create a separate "Limitations" section in their paper. The paper should point out any strong assumptions and how robust the results are to violations of these assumptions (e.g., independence assumptions, noiseless settings, model well-specification, asymptotic approximations only holding locally). The authors should reflect on how these assumptions might be violated in practice and what the implications would be. The authors should reflect on the scope of the claims made, e.g., if the approach was only tested on a few datasets or with a few runs. In general, empirical results often depend on implicit assumptions, which should be articulated. The authors should reflect on the factors that influence the performance of the approach. For example, a facial recognition algorithm may perform poorly when image resolution is low or images are taken in low lighting. Or a speech-to-text system might not be used reliably to provide closed captions for online lectures because it fails to handle technical jargon. The authors should discuss the computational efficiency of the proposed algorithms and how they scale with dataset size. If applicable, the authors should discuss possible limitations of their approach to address problems of privacy and fairness. While the authors might fear that complete honesty about limitations might be used by reviewers as grounds for rejection, a worse outcome might be that reviewers discover limitations that aren t acknowledged in the paper. The authors should use their best judgment and recognize that individual actions in favor of transparency play an important role in developing norms that preserve the integrity of the community. Reviewers will be specifically instructed to not penalize honesty concerning limitations. 3. Theory Assumptions and Proofs Question: For each theoretical result, does the paper provide the full set of assumptions and a complete (and correct) proof? Answer: [Yes] Justification: We have clarified the assumptions and models required for all the theorems and corollaries provided in the main text and appendix. Also we provide a complete proof in the appendix for all the stated results. Guidelines: The answer NA means that the paper does not include theoretical results. All the theorems, formulas, and proofs in the paper should be numbered and cross-referenced. All assumptions should be clearly stated or referenced in the statement of any theorems. The proofs can either appear in the main paper or the supplemental material, but if they appear in the supplemental material, the authors are encouraged to provide a short proof sketch to provide intuition. Inversely, any informal proof provided in the core of the paper should be complemented by formal proofs provided in appendix or supplemental material. Theorems and Lemmas that the proof relies upon should be properly referenced. 4. Experimental Result Reproducibility Question: Does the paper fully disclose all the information needed to reproduce the main experimental results of the paper to the extent that it affects the main claims and/or conclusions of the paper (regardless of whether the code and data are provided or not)? Answer: [Yes] Justification: We describe details of the experiments in Sec. 5 and provide links to all code and data. Guidelines: The answer NA means that the paper does not include experiments. If the paper includes experiments, a No answer to this question will not be perceived well by the reviewers: Making the paper reproducible is important, regardless of whether the code and data are provided or not. If the contribution is a dataset and/or model, the authors should describe the steps taken to make their results reproducible or verifiable. Depending on the contribution, reproducibility can be accomplished in various ways. For example, if the contribution is a novel architecture, describing the architecture fully might suffice, or if the contribution is a specific model and empirical evaluation, it may be necessary to either make it possible for others to replicate the model with the same dataset, or provide access to the model. In general. releasing code and data is often one good way to accomplish this, but reproducibility can also be provided via detailed instructions for how to replicate the results, access to a hosted model (e.g., in the case of a large language model), releasing of a model checkpoint, or other means that are appropriate to the research performed. While Neur IPS does not require releasing code, the conference does require all submissions to provide some reasonable avenue for reproducibility, which may depend on the nature of the contribution. For example (a) If the contribution is primarily a new algorithm, the paper should make it clear how to reproduce that algorithm. (b) If the contribution is primarily a new model architecture, the paper should describe the architecture clearly and fully. (c) If the contribution is a new model (e.g., a large language model), then there should either be a way to access this model for reproducing the results or a way to reproduce the model (e.g., with an open-source dataset or instructions for how to construct the dataset). (d) We recognize that reproducibility may be tricky in some cases, in which case authors are welcome to describe the particular way they provide for reproducibility. In the case of closed-source models, it may be that access to the model is limited in some way (e.g., to registered users), but it should be possible for other researchers to have some path to reproducing or verifying the results. 5. Open access to data and code Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [Yes] Justification: We provide a link to our Git Hub repository containing all code in Sec. 5. Guidelines: The answer NA means that paper does not include experiments requiring code. Please see the Neur IPS code and data submission guidelines (https://nips.cc/ public/guides/Code Submission Policy) for more details. While we encourage the release of code and data, we understand that this might not be possible, so No is an acceptable answer. Papers cannot be rejected simply for not including code, unless this is central to the contribution (e.g., for a new open-source benchmark). The instructions should contain the exact command and environment needed to run to reproduce the results. See the Neur IPS code and data submission guidelines (https: //nips.cc/public/guides/Code Submission Policy) for more details. The authors should provide instructions on data access and preparation, including how to access the raw data, preprocessed data, intermediate data, and generated data, etc. The authors should provide scripts to reproduce all experimental results for the new proposed method and baselines. If only a subset of experiments are reproducible, they should state which ones are omitted from the script and why. At submission time, to preserve anonymity, the authors should release anonymized versions (if applicable). Providing as much information as possible in supplemental material (appended to the paper) is recommended, but including URLs to data and code is permitted. 6. Experimental Setting/Details Question: Does the paper specify all the training and test details (e.g., data splits, hyperparameters, how they were chosen, type of optimizer, etc.) necessary to understand the results? Answer: [Yes] Justification: We provide train-test splits and hyperparameters in the main paper. Guidelines: The answer NA means that the paper does not include experiments. The experimental setting should be presented in the core of the paper to a level of detail that is necessary to appreciate the results and make sense of them. The full details can be provided either with the code, in appendix, or as supplemental material. 7. Experiment Statistical Significance Question: Does the paper report error bars suitably and correctly defined or other appropriate information about the statistical significance of the experiments? Answer: [Yes] Justification: In all figures, we plot error bars representing standard deviation across 100 trials. In all tables, we report mean +/- standard error across 100 trials. Guidelines: The answer NA means that the paper does not include experiments. The authors should answer "Yes" if the results are accompanied by error bars, confidence intervals, or statistical significance tests, at least for the experiments that support the main claims of the paper. The factors of variability that the error bars are capturing should be clearly stated (for example, train/test split, initialization, random drawing of some parameter, or overall run with given experimental conditions). The method for calculating the error bars should be explained (closed form formula, call to a library function, bootstrap, etc.) The assumptions made should be given (e.g., Normally distributed errors). It should be clear whether the error bar is the standard deviation or the standard error of the mean. It is OK to report 1-sigma error bars, but one should state it. The authors should preferably report a 2-sigma error bar than state that they have a 96% CI, if the hypothesis of Normality of errors is not verified. For asymmetric distributions, the authors should be careful not to show in tables or figures symmetric error bars that would yield results that are out of range (e.g. negative error rates). If error bars are reported in tables or plots, The authors should explain in the text how they were calculated and reference the corresponding figures or tables in the text. 8. Experiments Compute Resources Question: For each experiment, does the paper provide sufficient information on the computer resources (type of compute workers, memory, time of execution) needed to reproduce the experiments? Answer: [Yes] Justification: We indicate the computer resources for running all experiments in Sec. 5. Guidelines: The answer NA means that the paper does not include experiments. The paper should indicate the type of compute workers CPU or GPU, internal cluster, or cloud provider, including relevant memory and storage. The paper should provide the amount of compute required for each of the individual experimental runs as well as estimate the total compute. The paper should disclose whether the full research project required more compute than the experiments reported in the paper (e.g., preliminary or failed experiments that didn t make it into the paper). 9. Code Of Ethics Question: Does the research conducted in the paper conform, in every respect, with the Neur IPS Code of Ethics https://neurips.cc/public/Ethics Guidelines? Answer: [Yes] Justification: We have reviewed the Neur IPS Code of Ethics and our paper conforms with it. Guidelines: The answer NA means that the authors have not reviewed the Neur IPS Code of Ethics. If the authors answer No, they should explain the special circumstances that require a deviation from the Code of Ethics. The authors should make sure to preserve anonymity (e.g., if there is a special consideration due to laws or regulations in their jurisdiction). 10. Broader Impacts Question: Does the paper discuss both potential positive societal impacts and negative societal impacts of the work performed? Answer: [NA] Justification: Our work reduces the computational costs of classical methods and is applied to standard datasets. Thus, it has no outsize societal impact. Guidelines: The answer NA means that there is no societal impact of the work performed. If the authors answer NA or No, they should explain why their work has no societal impact or why the paper does not address societal impact. Examples of negative societal impacts include potential malicious or unintended uses (e.g., disinformation, generating fake profiles, surveillance), fairness considerations (e.g., deployment of technologies that could make decisions that unfairly impact specific groups), privacy considerations, and security considerations. The conference expects that many papers will be foundational research and not tied to particular applications, let alone deployments. However, if there is a direct path to any negative applications, the authors should point it out. For example, it is legitimate to point out that an improvement in the quality of generative models could be used to generate deepfakes for disinformation. On the other hand, it is not needed to point out that a generic algorithm for optimizing neural networks could enable people to train models that generate Deepfakes faster. The authors should consider possible harms that could arise when the technology is being used as intended and functioning correctly, harms that could arise when the technology is being used as intended but gives incorrect results, and harms following from (intentional or unintentional) misuse of the technology. If there are negative societal impacts, the authors could also discuss possible mitigation strategies (e.g., gated release of models, providing defenses in addition to attacks, mechanisms for monitoring misuse, mechanisms to monitor how a system learns from feedback over time, improving the efficiency and accessibility of ML). 11. Safeguards Question: Does the paper describe safeguards that have been put in place for responsible release of data or models that have a high risk for misuse (e.g., pretrained language models, image generators, or scraped datasets)? Answer: [NA] Justification: We do not release models or data as part of this paper. Guidelines: The answer NA means that the paper poses no such risks. Released models that have a high risk for misuse or dual-use should be released with necessary safeguards to allow for controlled use of the model, for example by requiring that users adhere to usage guidelines or restrictions to access the model or implementing safety filters. Datasets that have been scraped from the Internet could pose safety risks. The authors should describe how they avoided releasing unsafe images. We recognize that providing effective safeguards is challenging, and many papers do not require this, but we encourage authors to take this into account and make a best faith effort. 12. Licenses for existing assets Question: Are the creators or original owners of assets (e.g., code, data, models), used in the paper, properly credited and are the license and terms of use explicitly mentioned and properly respected? Answer: [Yes] Justification: We include URL and licenses for baseline code and datasets used in Sec. 5.2. Guidelines: The answer NA means that the paper does not use existing assets. The authors should cite the original paper that produced the code package or dataset. The authors should state which version of the asset is used and, if possible, include a URL. The name of the license (e.g., CC-BY 4.0) should be included for each asset. For scraped data from a particular source (e.g., website), the copyright and terms of service of that source should be provided. If assets are released, the license, copyright information, and terms of use in the package should be provided. For popular datasets, paperswithcode.com/datasets has curated licenses for some datasets. Their licensing guide can help determine the license of a dataset. For existing datasets that are re-packaged, both the original license and the license of the derived asset (if it has changed) should be provided. If this information is not available online, the authors are encouraged to reach out to the asset s creators. 13. New Assets Question: Are new assets introduced in the paper well documented and is the documentation provided alongside the assets? Answer: [Yes] Justification: We release our code with documentation at https://github.com/ag2435/ npr under a BSD-3 Clause license. Guidelines: The answer NA means that the paper does not release new assets. Researchers should communicate the details of the dataset/code/model as part of their submissions via structured templates. This includes details about training, license, limitations, etc. The paper should discuss whether and how consent was obtained from people whose asset is used. At submission time, remember to anonymize your assets (if applicable). You can either create an anonymized URL or include an anonymized zip file. 14. Crowdsourcing and Research with Human Subjects Question: For crowdsourcing experiments and research with human subjects, does the paper include the full text of instructions given to participants and screenshots, if applicable, as well as details about compensation (if any)? Answer: [NA] Justification: We do not have any studies or results regarding crowdsourcing experiments and human subjects. Guidelines: The answer NA means that the paper does not involve crowdsourcing nor research with human subjects. Including this information in the supplemental material is fine, but if the main contribution of the paper involves human subjects, then as much detail as possible should be included in the main paper. According to the Neur IPS Code of Ethics, workers involved in data collection, curation, or other labor should be paid at least the minimum wage in the country of the data collector. 15. Institutional Review Board (IRB) Approvals or Equivalent for Research with Human Subjects Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or institution) were obtained? Answer: [NA] Justification: We do not have any studies or results including study participants. Guidelines: The answer NA means that the paper does not involve crowdsourcing nor research with human subjects. Depending on the country in which research is conducted, IRB approval (or equivalent) may be required for any human subjects research. If you obtained IRB approval, you should clearly state this in the paper. We recognize that the procedures for this may vary significantly between institutions and locations, and we expect authors to adhere to the Neur IPS Code of Ethics and the guidelines for their institution. For initial submissions, do not include any information that would break anonymity (if applicable), such as the institution conducting the review.