# on_the_approximation_of_kernel_functions__3618b488.pdf

Journal of Machine Learning Research 26 (2025) 1-30 Submitted 2/24; Published 3/25

On the Approximation of Kernel functions

Paul Dommel paul.dommel@math.tu-chemnitz.de Faculty of Mathematics University of Technology, Chemnitz 09126, Chemnitz, Germany

Alois Pichler alois.pichler@math.tu-chemnitz.de Faculty of Mathematics University of Technology, Chemnitz 09126, Chemnitz, Germany

Editor: Gabor Lugosi

Various methods in statistical learning build on kernels considered in reproducing kernel Hilbert spaces. In applications, the kernel is often selected based on characteristics of the problem and the data. This kernel is then employed to infer response variables at points, where no explanatory data were observed. The data considered here are located in compact sets in higher dimensions and the paper addresses approximations of the kernel itself. The new approach considers Taylor series approximations of radial kernel functions. For the Gauss kernel on the unit cube, the paper establishes an upper bound of the associated eigenfunctions, which grows only polynomially with respect to the index. The novel approach substantiates smaller regularization parameters than considered in the literature, overall leading to better approximations. This improvement conﬁrms low rank approximation methods such as the Nyström method. Keywords: statistical learning, kernel methods, reproducing kernel Hilbert spaces, Nyström method

1. Introduction

This paper contributes to statistical methods building on reproducing kernel Hilbert spaces. These methods have become popular in statistical learning, in inference and in support vector machines due to the kernel trick. They constitute powerful tools in diﬀerent scientiﬁc areas such as geostatistics (cf. Honarkhah and Caers 2010), stochastic optimization (cf. Dommel and Pichler 2023, Park et al. 2022), digit recognition (cf. Schölkopf 1997), computer vision (cf. Zhang et al. 2007) and bio informatics (cf. Schölkopf et al. 2004). The approach presented here approximates the kernel function by elements in the range of the associated Hilbert Schmidt integral operator. We choose these elements so that its Taylor series expansion matches the initial coeﬃcients. The method applies for general radial kernels with variable bandwidth. Special emphasize is given to the Gaussian kernel, which is of major importance in practical applications. Fundamental in statistical approximation is the regularization parameter. Standard results suggest regularization parameters decreasing as O(1/n), where n is the sample size.

c 2025 Paul Dommel and Alois Pichler.

License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v26/24-0270.html.

Dommel and Pichler

This paper, in contrast, justiﬁes signiﬁcantly smaller regularization parameters, often close to machine precision. This leads to enhanced approximation quality. An additional consequence of our approach, not suﬃciently addressed in the literature yet, are the magnitudes of the eigenfunctions of the related Hilbert Schmidt operator. We demonstrate that this magnitude grows only polynomially in the index. Exploiting this crucial property, we present an interpolation inequality, which allows bounding the uniform error by the much weaker L2-error. The approximation of the kernel function builds on the function wx m( ), smallest in L2([0, 1])-norm, satisfying the moment constraints

0 zℓwx m(z) dz = xℓ, ℓ= 0, . . . , m 1. (1.1)

We provide the explicit solution, which is a polynomial with coeﬃcients involving the Hilbert matrix. The results have consequences for low rank kernel methods. For these methods, they ensure a stable approximation quality by building only on few supporting points. Prominent examples of these methods include the Nyström algorithms and kernel principal component analysis.

Related literature. Our results address the approximation of randomly located kernel functions. This is of particular importance for low rank kernel methods, which build on an approximation of the kernel matrix itself. The Nyström method, introduced in Williams and Seeger (2000), is a prominent example for this technique. Drineas and Mahoney (2005) analyze the error of the matrix approximation in the Nyström method and relate it to the best approximating matrix, while Bach (2013) studies the precision of predictions directly. The excellent work of Rudi et al. (2015) relates the rank of the approximating matrix to the approximation of kernel functions. The result is then employed to develop a low rank regression approach, which is signiﬁcantly cheaper in computations than kernel ridge regression while maintaining stable prediction accuracy, cf. Rudi et al. (2017). Kernel principal component analysis builds on these results as well, cf. Sterge and Sriperumbudur (2021). The approximation of kernel functions is the core research question of this paper, for which we present new bounds. The second main result of this work addresses the eigensystem of Mercer s decomposition associated with the Gaussian kernel. Shi et al. (2009) address this issue for an unbounded domain, building on the normal distribution as underlying design measure. The authors provide an explicit expression of the eigenvalues and eigenfunctions by involving the Hermite polynomials in an unbounded domain. In compact domains, Diaconis et al. (2008) consider the eigensystem for the Laplacian kernel. The analysis of the divide and conquer approach relies on properties of the eigenfunctions as well, cf. Zhang et al. (2013), but the paper builds on unveriﬁed assumptions. The analysis of the regression error in diﬀerent norms (cf. Fischer and Steinwart 2020 and Steinwart et al. 2009) can be related to bounded eigenfunctions as well. This paper presents explicit bounds of the maximum value the eigenfunction may attain.

Outline of the paper. Section 2 addresses polynomial approximations of kernel functions. Section 3 addresses the Gaussian kernel speciﬁcally and presents bounds of the eigenfunctions

On the Approximation of Kernel functions

on bounded domains. This section contains the main results, which are considered in Section 4 in applications. Section 5 concludes the paper.

2. The minimal moment function

Reproducing kernel Hilbert spaces (RKHS) build on a kernel function, denoted k. The approximations considered here build on the point evaluation function kx( ) := k(x, ). This section resumes the minimal moment function wx m from (1.1), for which the image Lkwx m is a suitable estimate of kx. We ﬁrst derive the function wx m for the simple design space X = [0, 1], and then extend it to the d-dimensional case with some non-uniform design measure P. Using the moment property (1.1), we derive an error bound for the approximation quality of kx( ) by Lkwx m, which is based on the Taylor coeﬃcients of the kernel.

2.1 The minimal moment function

The central element of this paper is the function with smallest L2-norm, satisfying the moment properties (1.1). There are inﬁnitely many functions fulﬁlling the condition (1.1). We refer to the function with smallest L2-norm as the minimal moment function, where the inner product is f, g L2 := R

X f(z)g(z)p(z) dz with the density p of the underlying measure P. Throughout this section, we consider the design space X = [0, 1] equipped with the uniform measure P U[0, 1]. In what follows we provide an explicit representation of the minimal moment function. We demonstrate that it is a polynomial of degree m 1, with coeﬃcients originating from a Hilbert matrix.

Theorem 1 (Explicit minimal moment function). For x [0, 1] ﬁxed, the optimization problem

ϑ := min w 2 L2 : Z 1

0 zℓw(z)dz = xℓ, ℓ= 0, 1, . . . , m 1 (2.1)

has the unique solution

i=1 αx,izi 1, z (0, 1), (2.2)

where αx satisﬁes the equations Hmαx = x for the Hilbert matrix Hm := 1 i+j 1 n,n

i=1,j=1 and

the vector x := (1, x, . . . , xm 1) Rm.

Proof The Lagrangian L of (2.1) is

L(w, µ) = w(z), w(z) L2 +

i=1 µi zi 1, w(z)

where w L2 and µ = (µ1, . . . , µm) Rm are Lagrange multipliers. The ﬁrst order condition reads

( w L)(w µ, µ) = 2w µ +

i=1 µizi 1 = 0,

Dommel and Pichler

which is equivalent to w µ(z) = 1

2 Pm i=1 µizi 1. For µ = (µ1, . . . , µm) ﬁxed, the Lagrangian function L is convex and w µ thus a minimizer. Hence, the Lagrangian dual function is

g(µ) := min w L2 L(w, µ) = L(w µ, µ) = w µ, w µ

i=1 µi zi 1, w µ

depending only on the multipliers µ. As zi 1, zj 1

L2 = 1 i+j 1, we further have that

L2 xi 1 = 1

i=1 µ i 1 j + i 1 xj 1 = 0

by setting µ = 2H 1 m x. It follows that

g(µ ) = w µ , w µ

as w µ is feasible in (2.1). This implies strong duality as well as the optimality of w µ = wx m, which is the assertion.

A minimal moment function of particular interest is the optimizer of (2.1) associated with the point x = 1. In contrast to the general case x [0, 1], the L2-norm of w1 m can be computed explicitly. Indeed, it holds that

0 w1 m(z)2dz = Z 1

j=1 α1,iα1,jzi+j 2dz

j=1 α1,iα1,j 1 i + j 1

i=1 α1,i1 = α 1 e = e H 1 m e = m2, (2.3)

as Pn i,j=1(H 1 m )i,j = m2. In what follows we bound the norm of the remaining moment functions. For that we relate wx m with a speciﬁc linear transformation of w1 m.

Theorem 2 (Upper bound of the weight function). The weight function wx m satisﬁes the bound wx m 2 L2 m2 (2.4)

for any x [0, 1].

Proof Note ﬁrst that, for x = 0, Z 1

0 w0 m(z)2dz = e 1 H 1 m e1 = H 1 m

where e1 = (1, 0, . . . .0) is the ﬁrst vector in the canonical basis. To verify the bound for the remaining points x (0, 1) we relate wx m with an auxiliary function wx m for which we

On the Approximation of Kernel functions

are able to calculate the norm more easily. Setting gx(z) := w1 m( z

x)1[0,x](z) we deﬁne the auxiliary function

( gx(z) if z x, g1 x(1 z) if z > x,

with x (0, 1). To relate wx m with wx m, we demonstrate ﬁrst that wx m satisﬁes the moment constraints (1.1) and bound its norm afterwards. It holds that

0 zhgx(z)dz = Z x

0 zh w1 m z

0 (yx)hw1 m(y)dy = xxh (2.5)

for the ﬁrst part of the integral. We further have that

0 zhg1 x(1 z)dz = Z 0

1 x (1 y)hg1 x(y)dz = Z 1 x

0 (1 y)hg1 x(y)dy

after changing the variables. By the binomial theorem,

0 (1 y)hg1 x(y)dy =

( 1)h p Z 1 x

0 yh pg1 x(y)dy

( 1)h p(1 x)(1 x)h p

= (1 x)(1 (1 x))h = (1 x)xh,

as R 1 x 0 yh pg1 x(y)dy = (1 x)(1 x)h p by (2.5). Connecting both identities, we have that Z 1

0 zh wx m(z)dz = Z x

0 zhgx(z)dz + Z 1 x

0 (1 y)hg1 x(y)dy = xh,

and thus the moment property (1.1) of wx m. It is now evident by (2.1) that wx m L2 is an upper bound of wx m L2 . Employing the same substitutions as above, we get that

0 wx m(z)2dz = Z x

0 gx(z)2dz + Z 1

x g1 x(1 z)2dz

0 w1 m(y)2dy + (1 x) Z 1

0 w1 m(y)2dz

0 w1 m(y)2dy = m2

by (2.3), concluding the proof.

The squared norm of the moments functions at the boundary points is m2. However, the norm of the minimal moment functions associated with the interior points x (0, 1) might be signiﬁcantly smaller.

Dommel and Pichler

2.2 Extensions and error estimates

The results of the preceding Section 2.1 crucially rely on the proposed setting, i.e., the design space [0, 1] equipped with the uniform distribution. These assumptions are quite restrictive and need to be relaxed for situations of practical application. To this end we investigate a more general setting throughout this section. We consider the multivariate case where X = [0, 1]d, with some underlying design measure P. This measure has a strictly positive density with

> C > sup z [0,1]d p(z) inf z [0,1]d p(z) > c > 0,

giving rise to the inner product

f, g L2 = Z

[0,1]d f(z)g(z)p(z)dz.

In what follows we specify the minimal moment functions for this more general setting. Building on the univariate moment property of the functions (2.2), we demonstrate that their product satisﬁes a multivariate version of (1.1). The next proposition reveals the precise statement.

Proposition 1 (Upper bound of the weight function in higher dimensions). Let x = (x1, . . . , xd) [0, 1]d and consider the function

W x m(z1, . . . , zd) :=

i=1 wxi m(zi)

p(z1, . . . , zn) 1, (2.6)

where wxi m are the functions deﬁned in (2.2). The function W x m satisﬁes the general moment property Z

y z 2 2 ℓ W x m(z)p(z)dz = y x 2 2 ℓ (2.7)

for all integers ℓ m

2 . Its norm is bounded by W x m 2 L2 cpm2d with cp = supz [0,1]d p(z) 1.

Proof The moment property (2.7) follows from

y z 2 2 ℓ W x m(z)p(z)dz = Z 1

i=1 (yi zi)2 !ℓ d Y

i=1 wxi m(zi)dz1 . . . dzd

ℓ h1, . . . , hd

0 (yi zi)2hiwxi m(zi)dzi

ℓ h1, . . . , hd

i=1 (yi xi)2hi

= y x 2 2 ℓ ,

as R 1 0 zℓi i wxi m(zi) = xℓi i holds for all integers ℓi m 1. This is the ﬁrst assertion.

On the Approximation of Kernel functions

For the second assertion note that wxi m 2 L2 m2 holds by (2.4). Hence, we get that Z

[0,1]d(W x m(z))2p(z)dz = Z 1

i=1 wxi m(zi)

!2 p(z1, . . . , zn) 1dz1 . . . dzd

sup z [0,1]d

0 (wxi m(zi))2dzi = cpm2d,

which concludes the proof.

Remark 1. The function W x m( ) might not be the norm minimal function satisfying (2.7). Indeed, the product structure of (2.6) leads to an exponentially (with respect to the dimension) increasing norm of W x m. However, the construction of the norm minimal moment function satisfying (2.7) might require a signiﬁcantly more involved representation, which is out of the scope of this paper. Continuing the general setting considered above, we now provide the ﬁrst error estimate. To this end we consider a radial kernel

k(x, y) = φ( x y 2) (2.8)

as well as the corresponding integral operator Lk : L2(X, P) L2(X, P) deﬁned as

(Lkf)(y) = Z

X k(z, y)f(z)p(z)dz. (2.9)

Here, φ is a smooth function with Taylor series expansion

ℓ! xℓ. (2.10)

Building on the moment property (2.7), we utilize that the ℓth moment of kx Lk W x m vanishes. The subsequent theorem reveals the precise bound.

Theorem 3 (Uniform bound in d-dimensions). With the function W x m deﬁned in (2.6), the error estimate

sup x [0,1]d (Lk W x m)(y) kx (1 + c 1/2 p md)

ℓ! dℓ, (2.11)

holds true. Here, denotes the ﬂoor function.

Proof Employing the series representation (2.10) of φ, we have the decomposition

(Lk W x m)(y) = Z

X φ y z 2 2 W x m(z)p(z)dz

ℓ! ( y z 2 2)ℓW x m(z)p(z)dz

ℓ! ( y z 2 2)ℓW x m(z)p(z)dz.

Dommel and Pichler

For the ﬁrst part, it follows from the moment property (2.7) of W x m that

ℓ! ( y z 2 2)ℓW x m(z)p(z)dz = m 1

y x 2 2 ℓ ,

as the 2 m 1

2 m 1. Hence, we have that

|k(y, x) (Lk W x m)(y)|

k(y, x) m 1

ℓ! ( y x 2 2)ℓ+ Z

ℓ! ( y z 2 2)ℓW x m(z)p(z)dz

ℓ! ( y x 2 2)ℓ+ Z

ℓ! ( y z 2 2)2ℓW x m(z)p(z)dz

ℓ! dℓ W x m L2

ℓ! dℓ+ c 1/2 p md X

The assertion follows, as the result holds for each x [0, 1]d.

The statement of Theorem 3 above applies for general translation invariant kernels of the shape k(x, y) = φ( x y 2). To further specify the associated error bound, one needs to include knowledge about the decay of the Taylor coeﬃcients of φ. To this end we now consider a ﬁxed kernel for which these coeﬃcients and their behavior are known.

3. The Gaussian kernel

The following results build on reproducing kernel Hilbert spaces (RKHS). For these spaces, point evaluations are continuous linear functionals, and this property is the decisive characteristic of RKHS. Each kernel function addressed above is associated with the corresponding space Hk, | k , for which f| k(x, ) k = f(x) whenever f Hk. For a detailed review of these spaces we refer to Berlinet and Thomas-Agnan (2004) or Steinwart and Christmann (2008). This section addresses the most popular kernel in machine learning, the Gaussian kernel

k(x, y) := e σ x y 2 2 = φ σ x y 2 2 , (3.1)

where σ > 0 is a width parameter. We approximate the point evaluation function in the range of the Hilbert Schmidt operator and analyze its error with respect to diﬀerent norms. Building on these estimates, we derive essential properties of the problem

inf w Hk λ w 2 L2 + Lkw kx 2 k, (3.2)

which will be used in what follows. We employ these results to establish polynomial bounds on the magnitude of the eigenfunctions of the Gaussian kernel.

On the Approximation of Kernel functions

3.1 Approximation of the point evaluation function

In this section we investigate the approximation of the point evaluation function kx for the Gaussian kernel. We relate the function kx with an approximation from the image of the corresponding integral operator Lk and provide bounds with respect to the inﬁnity norm as well as the norm of the RKHS. The following theorem provides the result for the uniform norm ﬁrst.

Theorem 4 (Uniform approximation of kx in ). Let k be the d-dimensional Gaussian kernel with width parameter σ. Setting cσ := max {1, 2eσd}, cp = supz [0,1]d p(z) 1 and

C(σ, m) := 1 1 σed

the uniform bound

sup x [0,1]d Lk W x m kx (1 + c 1/2 p md) C(σ, m) jm

holds for W x m deﬁned in (2.6) for m > cσ + 1. Speciﬁcally, for m(s) := 3cσs + 2, we have that

sup x [0,1]d

Lk W x m(td) kx 3(t d) 3t d (3.3)

whenever t max nln(3)+(d 1) ln(2)+ 1

2 ln(cp)+d ln(3cσd) 2d ln(3) , 1 o .

Proof We defer the proof to Appendix A.

The uniform bound (3.3) relates Lk W x m and kx for all points x [0, 1]d. However, the uniform norm is slightly too weak when studying the objective (3.2), which relies on the approximation in RKHS norm. To overcome this issue we extend the result obtained and establish a dedicated bound, similar to (3.3), but with respect to the stronger norm k. The next proposition reveals the desired bound.

Proposition 2 (Uniform approximation of kx in the norm k). Let m = m(s) = 3cσs + 2 and t c := max{c0, c1, c2, 1}, where

c0 = ln(3) + (d 1) ln(2) + 1

2 ln(cp) + d ln(3cσd) 2d ln(3) .

c1 = (2d 1) ln(2) + d ln(3cσ) + 1

2 ln(cp) d + 1,

c2 = (2d 1) ln(2) + 1

2 ln(cp) d .

In the setting of Theorem (4), the uniform bound in k-norm,

sup x [0,1]d

Lk W x m(td) kx 2

k 9(td) 2td, (3.4)

holds true.

Dommel and Pichler

Proof Let x [0, 1]d and observe from (3.3) and the reverse triangle inequality that

(Lk W x m)(x) k(x, x) Lk W x m kx k(x, x) 3(td) 3td, (3.5)

whenever m and t are chosen appropriately (see Theorem 4). Hence, setting m = 3cσtd + 2, it follows from (3.5) that

Lk W x m kx 2 k = Lk W x m, Lk W x m k 2 Lk W x m, kx k + k(x, x)

= Lk W x m, Lk W x m k 2 (Lk W x m) (x) + k(x, x)

Lk W x m, W x m L2 (Lk W x m) (x) + 3(td) 3td,

as Lk W x m, kx k = (Lk W x m) (x) holds by the reproducing property. Furthermore, we have that

Lk W x m, W x m L2 = Lk W x m kx, W x m L2 + kx, W x m L2 = Lk W x m kx, W x m L2 + (Lk W x m) (x).

Employing the bound Lk W x m kx, W x m L2 6(dt) 2dt

from Lemma A.7 in the appendix and combining the estimates above, we get

Lkwx m kx 2 k 6(td) 2td + 3(td) 3td 9(td) 2td,

and thus the assertion.

Remark 2. The bound (3.4) depends only on the magnitude of the product td. Substituting s = td gives the more convenient bound

sup x [0,1]d Lk W x m kx 2 k 9s 2s, (3.6)

where m = m(s) = 3cσs + 2 and s d c.

3.2 Implications for the weight function

Building on the bound (3.4) of Section 3.1, we now examine the optimal weight function wx λ solving the optimization problem

inf w L2 λ w 2 L2 + Lkw kx 2 k ,

cf. (3.2). This optimal solution wx λ , and its norm, determine the approximation quality of the point evaluation kx in the range of Lk. Moreover, they relate the continuous operator Lk with discrete versions, which we address and discuss in the following sections. By Mercer s theorem, the kernel k enjoys the representation k(x, y) = P ℓ=1 µℓϕℓ(x)ϕℓ(y), where µℓare the eigenvalues (in non-increasing order) and ϕℓthe eigenfunctions, ℓ= 1, 2, . . . , of the operator Lk introduced in (2.9). The explicit representation

wx λ(y) = (λ + Lk) 1Lkkx (y) =

µℓ λ + µℓ ϕℓ(x)ϕℓ(y)

On the Approximation of Kernel functions

of the optimal solution involves the regularization parameter λ and the components of Mercer s decomposition of the kernel k (cf. Cucker and Zhou 2007, Proposition 8.6). The next theorem provides a bound on the magnitude of the worst case norm, more speciﬁcally, a bound for supx [0,1]d wx λ L2.

Theorem 5 (Uniform bound of the weight function). Let k be the Gaussian kernel. The weight function wx λ satisﬁes the bound

sup x [0,1]d wx λ 2 L2 9cp (3cσs(λ) + 2)2d + 1 (3.7)

with s(λ) := max 1

9 , d c, e and cσ, c from Proposition 2.

Proof The function wx λ minimizes (3.2) and thus

sup x [0,1]d λ wx λ 2 L2 sup x [0,1]d λ W x m 2 L2 + Lk W x m kx 2 k sup x [0,1]d λcpm2d + Lk W x m kx 2 k ,

(3.8) as W x m 2 L2 cpm2d (see Theorem 1). Choosing s(λ) := max 1

9 , d c, e and m = 3cσ s(λ) + 2 we derive from (3.6) that Lk W x m kx 2 k 9s 2s λ

(as s e and s 1

9 ), and thus

sup x [0,1]d λ wx λ 2 L2 sup x [0,1]d λcpm2d + Lk W x m kx 2 k λcp(3cσs(λ) + 2)2d + λ

by (3.8). Multiplying with λ 1 gives the assertion.

The bound (3.7) characterizes the asymptotic growth of the norm wx λ L2. Indeed, letting

λ 0, the bound (3.7) implies that wx λ L2 grows at most as ln(λ 1) d. The subsequent sections heavily exploit this asymptotic behavior.

3.3 Eigenvalues and eigenfunctions

This section studies the elements in Mercer s decomposition of the kernel, k(x, y) = P ℓ=1 µℓϕℓ(x)ϕℓ(y), speciﬁcally for the Gaussian kernel. We ﬁrst relate the maximal value of any eigenfunction with its associated eigenvalue, demonstrating that ϕℓ is bounded by ln(µ 1 ℓ) d. This bound is signiﬁcantly sharper than the standard bound maxx X ϕℓ(x)

k(x, x) 1/2µ 1/2 ℓ derived from the Cauchy Schwartz inequality, ϕ, kx k ϕ k k(x, x). We next describe the decay of the eigenvalues µℓand derive a bound on ϕℓ , which turns out to be quadratic in ℓ. The results of this section enable us to infer convergence in uniform norm from convergence in L2. Generally, this approach leads to faster convergence rates compared to results, which are derived from convergence in the norm k (see Section 4.2). Our approach builds on the following observation. For the regularization λ = µℓ, the series

wx λ 2 L2 =

Dommel and Pichler

involves the term 1 4ϕ2 ℓ(x). To bound the maximum of |ϕℓ|, it is suﬃcient to assess supx [0,1]d wx λ 2 L2, which is bounded by the inequality (3.7). The following result summarizes these relations.

Theorem 6. For the eigenfunctions of the Gaussian kernel, the inequality

sup x [0,1]d |ϕℓ(x)| 6 cp (3cσs(µℓ) + 2)d + 2 (3.9)

holds for every ℓ N, where s(µℓ) = max 1

9 , e, d c is as in Theorem 5.

Proof For ℓ N, it holds that

1 4 |ϕℓ(x)|2 = µℓ µℓ+ µℓ

2 ϕh(x)2 = wx λ 2 L2 ,

with λ = µℓ. Taking the maximum on both sides, we deduce from the latter inequality and (3.7) that

1 4 ϕℓ 2 sup x [0,1]d wx λ 2 L2

9cp (3cσs(λ ) + 2)2d + 1 = 9cp (3cσs(µℓ) + 2)2d + 1.

Reformulating this inequality as well as using the subadditivity of the square root gives the assertion.

The bound (3.9) relates the eigenfunctions and the eigenvalues of the operator Lk. In what follows, we analyze the decay of (µℓ) ℓ=1 to get a more concrete characterization of the bound in (3.9). The next lemma provides a lower bound of the eigenvalues (µℓ) ℓ=1.

Lemma 1 (Maximal decay of eigenvalues). Set pmin := infx X p(x) and pmax := supx X p(x). For the eigenvalues of the Gaussian kernel it holds that

µℓ p2 min p2max C(d, σ)e cσ,d(ℓ+d) 2 d , ℓ= 1, 2, . . . , (3.10)

where the constants cd,σ and C(d, σ) depend on the dimension d and the bandwidth σ.

Proof See Appendix B.

The subsequent theorem combines the inequalities (3.9) and (3.10) and provides an explicit bound on the maximal absolute value of the eigenfunctions (ϕℓ) ℓ=1.

Theorem 7 (Eigenfunctions). The eigenfunctions of the d-dimensional Gaussian kernel satisfy the bound

max x [0,1]d |ϕℓ(x)| 6 cp max n h(ℓ), (3cσe + 2)d , (3cσ(d c) + 2)do + 2 (3.11)

h(ℓ) := 3cσ

2 ln p2 min p2max C(d, σ) + 1

2cd,σ(ℓ+ d) 2 d

and the constants from the preceding Lemma 1.

On the Approximation of Kernel functions

Proof For the assertion note ﬁrst that

s(λ) = max 1

, e, d c = max 1

+ , e, d c ,

where x+ := max {x, 0} is the positive part of x. Hence, employing (3.10) in (3.9), we have the bound

|ϕℓ(x)| (3.12)

6 cp (3cσs(µℓ) + 2)d + 2

p2 min C(d, σ)e cσ,d(ℓ+d) 2 d

2 ln p2 min p2max C(d, σ) + 1

2cd,σ(ℓ+ d) 2 d

+ , e, d c + 2 d

Note now that x 7 xd is an increasing function on [0, ) and thus max {a, b}d = max{ad, bd}, provided that a, b 0. Using this property in (3.13) provides the assertion (3.11).

Remark 3. The right-hand side of (3.11) deserves additional attention. While h(ℓ) grows with ℓincreasing, the other terms do not depend on ℓ. For suﬃciently large ℓ, the inequality (3.11) thus reads

max x [0,1]d |ϕℓ(x)| 6c 1/2 p h(ℓ) + 2.

The function h itself grows at most as a polynomial with degree 2. That is, there exists a constant b > 0 with quadratic bound

max x [0,1]d |ϕℓ(x)| b ℓ2, ℓ= 1, 2, . . . . (3.14)

To the best of our knowledge, this is the ﬁrst non-exponential bound of ϕ ℓs absolute maximum. The following section addresses various consequences of this main result.

Remark 4. The approach chosen in this Section 3 is not limited to the Gaussian kernel. All steps extend to general kernels of the form φ( x y 2) (cf. (2.8)), provided that a lower bound on the decay of its eigenvalues is available.

4. Main results

This section connects the results of the preceding sections for general statistical learning settings. We present improved concentration bound inequalities, an interpolation inequality relating to uniform convergence, and the Nyström method.

Dommel and Pichler

We adopt the well-established notation (cf. Rudi et al. 2015) and introduce

N (λ) := sup x [0,1]d wx λ 2 L2 + λ 1 Lkwx λ kx 2 k . (4.1)

As established above in (3.8), it holds that

N (λ) = O ln(λ 1)2d

for λ 0. Building on the results of Section 3, we provide the results for the Gaussian kernel.

4.1 Concentration bounds

Standard kernel ridge regression minimizes the regularized squared error at the sample points. To infer the related error at other locations, one needs to connect the discrete setting with the continuous setting from Section 3. This is commonly done by relating the discrete operator

LD k : C(X) Hk with (LD k f)(y) := 1

i=1 f(xi)k(y, xi)

with its continuous version Lk, as demonstrated in Caponnetto and De Vito (2006) as well as in Fischer and Steinwart (2020). However, the underlying concentration results in these references generally require regularization sequences not faster than λn = O(1/n), which is too restrictive for many tasks including the analysis of low rank kernel methods (cf. Zhang et al. 2013, Bach 2013). This section addresses this issue. We provide the following result, which allows signiﬁcantly higher ﬂexibility in choosing the regularization sequence λn. Speciﬁcally, the regularizing sequence may be chosen to decay faster than any polynomial. The following proposition collects the detailed result for the concentration inequality.

Proposition 3. Assume that λ satisﬁes

4 3τ g(λ)N (λ)

2τ g(λ)N (λ)

g(λ) = ln 2e(λ + µ1)N(λ)

and τ > 0. Then the inequality (Lk + λ) 1/2(λ + LD k ) 1(Lk + λ) 1/2 Hk Hk 2 (4.3)

holds with probability at least 1 2e τ.

Proof For the proof we refer to Appendix C.

The condition (4.2) deserves some additional commentary. The term N (λ) is asymptotically bounded by O ln(λ 1)2d , and the function g(λ) does not grow faster than O ln(λ 1) . The

On the Approximation of Kernel functions

condition (4.2) is asymptotically satisﬁed, provided that the regularization parameter λ does not decay faster than c exp n 1 2d+1 . This is a signiﬁcant improvement compared to the common regularization choice n 1. The conclusion of Proposition 3 can be given more general, not necessarily involving the Gaussian kernel. Moreover, the bound (3.7) of N (λ) only relies on the decay of the Taylor coeﬃcients associated with the kernel. The same method thus may be applied to derive concentration inequalities for any other radial kernel functions.

4.2 Interpolation inequality

This section establishes results on the uniform approximation quality for smooth functions. The following interpolation inequality ensures that the uniform norm of smooth functions is comparable its norm in L2, although this norm is much weaker in general. We measure smoothness with reference to the norm s introduced below.

Theorem 8 (Interpolation of norm). For f = P ℓ=1 cℓϕℓ L2 with f 2 s := P ℓ=1 c2 ℓ ℓ 2s < it holds that f π

s 2 , (4.4)

where s > 3 and maxx X |ϕℓ(x)| b ℓ2 for all ℓ= 1, 2, . . . (cf. (3.14)).

Proof Building on the bound maxx X |ϕℓ(x)| bℓ2 we have with the Cauchy Schwarz inequality that

ℓ=1 cℓϕℓ(x) b

where ζ( ) is the Riemann zeta function and p [1, 2s]. Employing Hölder s inequality, it follows for the ﬁrst term that v u u u t

p)( p p 1 ) ℓ

Choosing p = s

3, we ﬁnally have that

Dommel and Pichler

by involving Euler s famous formula ζ(2) = π2

6 . This is the assertion.

Convergence in L2 implies convergence in L : Speciﬁcally, consider a sequence of functions fn with slowly increasing norm fn s and limit f s < . Then, for fn f L2 0 suﬃciently fast, it follows from Theorem 8 that fn f 0.

Remark 5. Fischer and Steinwart (2020) study convergence in an interpolation space between Hk and L2, which the authors call power spaces. The norm considered in Fischer and Steinwart (2020), however, is stronger than the norm s considered above.

4.3 Nyström method

The main result in Rudi et al. (2015, Theorem 1) relates the function N (λ) to the Nyström method. Their paper ensures that the Nyström method does not require more than O N (λ) ln λ 1 supporting points. The results established in Section 3 above provides explicit access to the function N (λ), so that their main result can be reﬁned and enhanced to the following form.

Theorem 9 (Cf. Rudi et al. 2015, Theorem 1). Let E(f) := RR

X R f(x) y 2ρ(dx, dy) be the common error function. Given the assumptions of Theorem 1 in (Rudi et al., 2015), it holds that the Nyström-approximation ˆfλ,m with regression parameter λ satisﬁes

E( ˆfλ,m) E(f H) q2n 2ν+1 2nu+γ+1

for at least

m (67 5 9cp (3cσs(λ) + 2)2d + 1 ln λ 1 (4.5)

supporting points. The reference provides the constant q and ν explicitly.

Proof Invoking the bound (3.7) for N (λ), the result is immediate from Rudi et al. (2015, Theorem 1).

The adapter result (4.5) gives an explicit selection criterion for the critical number of support points in the Nyström method.

The novel approach of this paper considers an explicit approximation of the kernel function in the range of the associated integral operator. To this end we provide an explicit weight function by matching the initial Taylor coeﬃcients of the kernel. The approach has numerous consequences in theory and in applications. We provide bounds for the eigenfunctions, which grow only quadratic in the enumeration index. An interpolation inequality provided relates convergence in uniform norm and the weaker convergence in L2 for smooth functions. The methods established justify smaller regression parameters for regression problems, which is of particular importance for low-rank approximation techniques.

On the Approximation of Kernel functions

Francis Bach. Sharp analysis of low-rank kernel matrix approximations. In Proceedings of the 26th Annual Conference on Learning Theory, volume 30 of Proceedings of Machine Learning Research, pages 185 209, 2013. doi:10.48550/ar Xiv.1208.2015.

Alain Berlinet and Christine Thomas-Agnan. Reproducing Kernel Hilbert Spaces in Probability and Statistics. Springer US, 2004. doi:10.1007/978-1-4419-9096-9.

A. Caponnetto and E. De Vito. Optimal rates for the regularized least-squares algorithm. Foundations of Computational Mathematics, 7(3):331 368, 2006. doi:10.1007/s10208-0060196-8.

Felipe Cucker and Ding Xuan Zhou. Learning Theory: An Approximation Theory Viewpoint. Cambridge Monographs on Applied and Computational Mathematics. Cambridge University Press, 2007. doi:10.1017/CBO9780511618796.

Persi Diaconis, Sharad Goel, and Susan Holmes. Horseshoes in multidimensional scaling and local kernel methods. The Annals of Applied Statistics, 2(3):777 807, 2008. doi:10.1214/08AOAS165.

Benedikt Diederichs and Armin Iske. Improved estimates for condition numbers of radial basis function interpolation matrices. Journal of Approximation Theory, 238:38 51, 2019. doi:10.1016/j.jat.2017.10.004.

Paul Dommel and Alois Pichler. Dynamic programming for data independent decision sets. Journal of Convex Analysis, 30(3):897 916, 2023. doi:10.48550/ar Xiv.2102.07464.

Petros Drineas and Michael Mahoney. On the Nyström method for approximating a Gram matrix for improved kernel-based learning. Journal of Machine Learning Research, 6: 2153 2175, 12 2005.

William Feller. An Introduction to Probability Theory and Its Applications, volume 1. Wiley, January 1968. doi:10.1017/S0020268100037586.

Simon Fischer and Ingo Steinwart. Sobolev norm learning rates for regularized least-squares algorithms. Journal of Machine Learning Research, 21(1), 2020. doi:10.48550/ar Xiv.1702.07254.

Mehrdad Honarkhah and Jef Caers. Stochastic simulation of patterns using distance-based pattern modeling. Mathematical Geosciences, 42(5):487 517, 2010. doi:10.1007/s11004010-9276-7.

Hyuk Park, Zhuangzhuang Jia, and Grani A. Hanasusanto. Data-driven stochastic dual dynamic programming: Performance guarantees and regularization schemes. 2022. URL https://optimization-online.org/?p=21376.

Alessandro Rudi, Raﬀaello Camoriano, and Lorenzo Rosasco. Less is more: Nyström computational regularization. In Advances in Neural Information Processing Systems, volume 28. Curran Associates, Inc., 2015. doi:10.48550/ar Xiv.1507.04717.

Dommel and Pichler

Alessandro Rudi, Luigi Carratino, and Lorenzo Rosasco. Falkon: An optimal large scale kernel method. In Advances in Neural Information Processing Systems, 2017. URL https://api.semanticscholar.org/Corpus ID:25900554.

Bernhard Schölkopf. Support Vector Learning. 1997. URL https://pure.mpg.de/rest/ items/item_1794215/component/file_3214422/content.

Bernhard Schölkopf, Koji Tsuda, and Jean-Philippe Vert. Kernel Methods in Computational Biology. The MIT Press, 2004. doi:10.7551/mitpress/4057.001.0001.

John Shawe-Taylor, Chris Williams, Nello Cristianini, and Jaz Kandola. On the Eigenspectrum of the Gram Matrix and Its Relationship to the Operator Eigenspectrum, pages 12 12. Springer Berlin Heidelberg, 2002. doi:10.1007/3-540-36182-0_3.

Tao Shi, Mikhail Belkin, and Bin Yu. Data spectroscopy: Eigenspaces of convolution operators and clustering. The Annals of Statistics, 37(6B):3960 3984, 2009. doi:10.1214/09-AOS700.

Ingo Steinwart and Andreas Christmann. Support vector machines. Springer Science & Business Media, 2008. doi:10.1007/978-0-387-77242-4.

Ingo Steinwart, Don R. Hush, and Clint Scovel. Optimal rates for regularized least squares regression. In Proceedings of the 22nd Annual Conference on Learning Theory, pages 79 93, 2009.

Nicholas Sterge and Bharath K. Sriperumbudur. Statistical optimality and computational eﬃciency of Nyström kernel pca. Journal of Machine Learning Research, 23:337:1 337:32, 2021. doi:10.48550/ar Xiv.2105.08875.

Christopher Williams and Matthias Seeger. Using the Nyström method to speed up kernel machines. In Advances in Neural Information Processing Systems, volume 13. MIT Press, 2000. URL https://proceedings.neurips.cc/paper_files/paper/2000/file/ 19de10adbaa1b2ee13f77f679fa1483a-Paper.pdf.

J. Zhang, M. Marszałek, S. Lazebnik, and C. Schmid. Local features and kernels for classiﬁcation of texture and object categories: A comprehensive study. International Journal of Computer Vision, 73(2):213 238, 2007. doi:10.1007/s11263-006-9794-4.

Yuchen Zhang, John Duchi, and Martin Wainwright. Divide and conquer kernel ridge regression. In Proceedings of the 26th Annual Conference on Learning Theory, volume 30, pages 592 617, Princeton, NJ, USA, 2013. URL https://proceedings.mlr.press/v30/ Zhang13.html.

On the Approximation of Kernel functions

Appendix A. Gaussian approximation

This section provides the proof of the formula (3.3), which is of fundamental importance for every other bound provided in Section 3. The proof of the bound (3.3) builds on the error estimate (2.11), which involves the Taylor remainder of the exponential series. To this end we utilize the formula X

of the truncated geometric series as well as Stirling s approximation

n < n!, (A.2)

which relates the factorial with the exponential function. For future reference convenience of the reader, we restate Theorem 4.

Theorem 10. Let k be the d-dimensional Gaussian kernel with width parameter σ. Setting cσ := max {1, 2eσd}, cp = supz [0,1]d p(z) 1 and

C(σ, m) := 1 1 σed

the uniform bound

sup x [0,1]d Lk W x m kx (1 + c 1/2 p md) C(σ, m) jm

holds for W x m deﬁned in (2.6) for m > cσ + 1. Speciﬁcally, for m(s) := 3cσs + 2, we have that

sup x [0,1]d

Lk W x m(td) kx 3(t d) 3t d (A.3)

whenever t max nln(3)+(d 1) ln(2)+ 1

2 ln(cp)+d ln(3cσd) 2d ln(3) , 1 o .

Proof Employing the inequality (2.11) involving the Taylor coeﬃcients, we have that

sup x [0,1]d Lk W x m kx (1 + c 1/2 p md)

ℓ! dℓ (1 + c 1/2 p md)

ℓ! dℓ, (A.4)

2 + 1. Further, invoking Stirling s approximation (A.2) for the right-hand side of (A.4), it follows that

(1 + c 1/2 p md)

2 σℓ(ℓ!) 1 dℓ (1 + c 1/2 p md)

Dommel and Pichler

Now assume that m satisﬁes the inequality 1 < m

2 (σed) 1, then we have from identity (A.1) of the truncated geometric series that

sup x [0,1]d Lk W x m kx (1 + c 1/2 p md)

(1 + c 1/2 p md) 1 1 σed

= (1 + c 1/2 p md)C(σ, m) jm

which is the ﬁrst assertion. For the second assertion let m = m(td) = 6cσtd + 2 and observe for the constant C(σ, m) that

C(σ, m) = 1 1 σed

2 1 1 σed 3(σed)td 1 1 1 3td 3

holds for every t 1. Furthermore, employing m(td), as well as C(σ, m) 3

2, and arguing as above, we get for (A.6) that

(1 + c 1/2 p md)C(σ, m) jm

2(1 + c 1/2 p m(t)d) 3σed2t

2(1 + c 1/2 p (3cσtd + 2)d) (3td) 3dt

2 (3td) 3dt + 3

22d 1c 1/2 p (3cσtd)d (3td) 3dt + 3

222d 1 (3td) 3dt ,

where we utilize (a + b)d 2d 1(ad + bd) for the last inequality. The latter is bounded by

3 222d 1 (3td) 3dt (td) 3dt

as well as the middle term by

2d 1c 1/2 p (3cσtd)d3 3td(td) 3td

2(eln(3)+(d 1) ln(2)+ln(c 1/2 p )+d ln(3cσtd) ln(3)3td)(td) 3td

2(eln(3)+(d 1) ln(2)+ln(c 1/2 p )+d ln(3cσd) ln(3)2td)(dt) 3td

On the Approximation of Kernel functions

whenever t ln(3)+(d 1) ln(2)+ln(c 1/2 p )+d ln(3 max{cσ,1}d) 2d ln(3) . Hence, choosing

( ln(3) + (d 1) ln(2) + ln(c 1/2 p ) + d ln(3 max {cσ, 1} d) 2d ln(3) , 1

we have that

(1 + c 1/2 p md)C(σ, m) jm

2 (3dt) 3dt + 3

2(dt) 3dt 3(dt) 3dt,

which is the assertion.

Building on the bound (3.3) in the uniform norm, we establish a bound in the RKHS norm in Proposition 2. To this end the following technical Lemma is of crucial importance.

Lemma 2. Given the assumptions of Theorem 4, the bound

sup x [0,1]d Lk W x m kx, W x m L2 6(td) 2td (A.7)

holds whenever t max {c0, c1, c2} with constants

c1 = (2d 1) ln(2) + d ln(3cσ) + 1

2 ln(cp) d + 1 and

c2 = (2d 1) ln(2) + 1

2 ln(cp) d .

Proof Employing the Cauchy Schwarz inequality, we have from (3.3) that

Lk W x m kx, W x m 2 Lk W x m kx L2 W x m L2 3(td) 3tdc 1/2 p md

whenever m (and t) are chosen with respect to the constraints Theorem 4. Involving m = m(td) we further observe that

3(td) 3tdc 1/2 p (3cσtd + 2)d 2d 13(td) 3tdc 1/2 p (3cσtd)d + 22d 13(td) 3tdc 1/2 p . (A.8)

For the second term in (A.8) it follows that

22d 13(td) 3tdc 1/2 p = 22d 1(td) tdc 1/2 p 3(td) 2td 3(td) 2td

whenever t (2d 1) ln(2)+ 1

2 ln(cp) d := c2. Reformulating the ﬁrst term in (A.8) to

2d 13(td) 3tdc 1/2 p (3cσtd)d = 3e td ln(td)+(2d 1) ln(2)+ 1

2 ln(cp)+d ln(3cσtd)(td) 2td

we get for the exponent that

td ln(td) + (d 1) ln(2) + 1

2 ln(cp) + d ln(3cσtd)

= d(t 1) ln(td) + (d 1) ln(2) + 1

2 ln(cp) + d ln(3cσ)

d(t 1) ln(2) + (d 1) ln(2) + 1

2 ln(cp) + d ln(3cσ).

Dommel and Pichler

t (2d 1) ln(2) + d ln(3cσ) + 1

2 ln(cp) d + 1 =: c1

it follows that 2d 13(td) 3tdc 1/2 p (3cσtd)d 3(td) 2td.

Combining the estimates of the terms in (A.8), we have for t max {c1, c2} that

Lk W x m kx, W x m 2 6(td) 2td

and thus the assertion.

Appendix B. Decay of eigenvalues

This section provides a lower bound on the eigenvalues (µℓ) ℓ=1 of the operator Lk associated with the Gaussian kernel. We address the univariate case ﬁrst, which is then extended to the multivariate case. As a starting point we consider the most elementary setting, i.e., X = [0, 1], equipped with the uniform measure P = U[0, 1]. The following lemma provides the precise eigenvalue bound.

Lemma 3 (Maximal decay of eigenvalues). Let k be the Gaussian kernel, X = [0, 1] as well as P = U[0, 1]. For every ℓ N it holds that

ℓC(σ)e aσ(ℓ 1)2, (B.1)

where aσ = 8 4π2

16σ and C(σ) is a constant depending on the width parameter σ.

Proof Let x1, . . . , xℓbe independent random variables following the uniform measure U[0, 1]. Let K = (k(xi, xj))ℓ i,j=1 be the Gramian matrix and invoking Shawe-Taylor et al. (2002, Proposition A), it follows that

ℓE λmin(K),

where λmin(K) is the smallest eigenvalue of the matrix K and the expectation is with respect to the samples. Further, employing the result of Diederichs and Iske (2019, Example 2.6), we get with M := mini,j ℓ,i =j |xi xj| and (D.4) in the auxiliary Lemma 9 below (Appendix D) that

E λmin(K) E M 1 C(σ)e 4π2

16M2σ C(σ) E e 4π2

16M2σ C(σ)e aσ(ℓ 1)2,

as M < 1, and where C(σ) = C C(σ) with C from as in (D.4). This is the assertion.

Extending the univariate case to the multivariate setting builds on the product structure of the Gaussian kernel, that is, on Qd i=1 e σ(xi yi)2 = e σ Pn i=1(xi yi)2. Indeed, provided that the underlying measure is U[0, 1]d, the spectrum of the corresponding operator Lk is ( d Y

i=1 µℓi : ℓ1, . . . , ℓd N

On the Approximation of Kernel functions

where µℓi is the ℓith eigenvalue in the univariate setting. This is immediate, as every eigenfunction in the multivariate case is a product of elementary eigenfunctions, Qd i=1 ϕℓi(xi). We may assume the multivariate eigenvalues µ(d) ℓ arranged in non-increasing order such that

i=1 µℓi : ℓi N

and µ(d) 1 µ(d) 2 . . . . The subsequent auxiliary and combinatorial lemmata utilize the structure of the spectrum to infer the maximal decay rate of the sequence µ(d) ℓ. The ﬁrst is a general combinatorial result, with which we assess the eigenvalue decay in the second.

Lemma 4. It holds that X

i1+ +id n, ij N

(d 1)d 1 d (B.3)

Proof We proof the assertion by employing on the stars and bars formula

i1+ +id=i 1 = i 1 d 1

where ij 1 are positive integers (cf. Feller 1968, p. 38) and i N. We have that

i1+ +id n 1 =

i1+ +id=i 1 =

where we utilize that i 1 d 1

= (i 1)(i 2) (i d + 1)

(d 1)(d 2) 1 =

i 1 d 1 = i 1

Furthermore, employing nd

in (B.4), it follows that

d 1 d = 1 (d 1)d 1

i=1 (i 1)d 1 d

= 1 (d 1)d 1

k=0 kd 1 d (n 1)d

d(d 1)d 1 d

and thus the assertion.

Dommel and Pichler

Lemma 5. Let µℓ e ρℓ2 for every ℓ N. Then the sequence µ(d) ℓ from (B.2) satisﬁes

µ(d) ℓ C exp( cρ(ℓ+ d) 2 d ) (B.5)

for all ℓ N and d 2 for some c > 0 and C > 0.

Proof We ﬁrst determine the number of combinations of µi1, . . . , µid for which the product Qd k=1 µik is larger than exp( ρ(λ + 2)2) for some threshold parameter λ > 0. It holds that Qd i=1 µℓi e ρ i2 1 ... ρ i2 d e ρ(i1+ +id)2 e ρ(λ+2)2, provided that i1 + + id λ + 2. From (B.3) we deduce that

i1+ +id λ+2 1 λ + 1 d

d(d 1)(d 1) d

µ(d) ℓ e ρ(λ+2)2

for all ℓ λ+1 d

d(d 1)(d 1) d. Choosing λ := d(d 1)d 1(ℓ+ d) 1/d and employing (a + b)2

2a2 + 2b2, we get that

µ(d) ℓ e ρ (d(d 1)d 1(ℓ+d) 1/d +2 2

e ρ23e ρ2d2/d(d 1) 2(d 1)

d (ℓ+d)2/d.

The assertion follows by setting C := exp( 8ρ) and c := 2d 2/d(d 1) 2(d 1)

d . The bound (B.5) already implies (3.10), provided that the underlying measure is uniform, i.e, the Lebesgue measure. The following lemma extends the assertion for more general probability measures.

Lemma 6. Let X = [0, 1]d and consider the operators Lk : L2(X, λ) L2(X, λ) and Lp k : L2(X, P) L2(X, P) with

(Lkf)(y) = Z

X k(x, y)f(x)dx (LP k f)(y) = Z

X k(x, y)f(x)p(x)dx.

Here, λ is the Lebesgue (uniform) measure and P is a probability measure, satisfying the condition 0 < pmin := infx X p(x) pmax := supx X p(x) < . The eigenvalues (µ(d) ℓ)ℓof Lk ((µ(d) ℓ,p)ℓof Lp k, resp.), satisfy the relation

µ(d) ℓ,p p2 min p2max µ(d) ℓ (B.6)

for all ℓ N.

On the Approximation of Kernel functions

Proof By the Courant Fischer Weyl min-max principle it holds that

µ(d) ℓ,p = max dim(Sk)=k min x Sk x L2(P )=1

= max dim(Sk)=k min x Sk

x L2(λ) x L2(P)

Lp k x x L2(λ) , x x L2(λ)

L2(P) max dim(Sk)=k min x Sk x L2(λ)=1 p 2 max Lp kx, x

It follows further that

µ(d) ℓ,p max dim(Sk)=k min x Sk x L2(λ)=1 p 2 max Lp kx, x

max dim(Sk)=k min x Sk x L2(λ)=1 p 2 maxp2 min Lkx, x L2(λ) = p2 min p2max µ(d) ℓ,

as 0 p(x)p(y) p2 min. Hence, the assertion.

We now combine the results of the preceding lemmata to bound the eigenvalues of the d-dimensional Gaussian kernel for general measures.

Lemma 7 (Maximal decay of eigenvalues). For every ℓ N it holds that

µ(d) ℓ p2 min p2max C(d, σ)e cσ,d(ℓ+d) 2 d , (B.7)

where cd,σ and C(d, σ) are constants depending on the dimension d and the bandwidth σ.

Proof We show the assertion only for the uniform measure U[0, 1]d, as the result for more general design measures follows immediately by (B.6). For the eigenvalues µℓin the univariate setting, we have from (B.1) that

ℓC(σ)e aσ(ℓ 1)2 C(σ)e aσℓ2 (B.8)

where aσ is chosen such that

aσℓ2 ln(ℓ) + aσ(ℓ 1)2, ℓ= 1, 2, . . . .

Combining (B.8) with (B.5) (cf. Lemma 5), it follows that

µ(d) ℓ C(σ)d C exp( aσc(ℓ+ d) 2 d )

as C(σ) appears in every factor of the product Qd i=1 µℓi. Setting cσ,d := ac reveals the assertion.

Dommel and Pichler

Appendix C. Concentration

This section provides a proof of the operator bound (4.3). To this end, we restate the following concentration bound for random operators on Hilbert spaces, which we then subsequently.

Proposition 4 (See Fischer and Steinwart 2020, Theorem A.3). Let (Ω, B, P) be a probability space, H a separable Hilbert space, and ξ : Ω L2(H) be a random variable with values in the set of self-adjoint Hilbert-Schmidt operators. Furthermore, let the operator norm be uniformly, i.e., ξ H H B P-a.s. and V be a self-adjoint positive semi-deﬁnite trace class operator with EP ξ2 V , i.e. V EP ξ2 is positive semi-deﬁnite. Then, for g(V ) := ln(2e tr(V ) V 1 H H), τ 1 and n 1, the following concentration inequality is satisﬁed:

i=1 ξi EP ξ

H H 4τ B g(V )

2τ V H H g(V )

2e τ. (C.1)

To demonstrate the desired bound (4.3), we show ﬁrst that (Lk + λ) 1/2 Lk (Lk + λ) 1/2

and (Lk + λ) 1/2 LD k (Lk + λ) 1/2 are close in operator norm. To this end we rephrase the operator LD k in terms of simple operators. Letting x1, . . . , xn P, independently distributed with respect to the design measure, and deﬁning the operators

Tz : Hk Hk with (Tzf)(y) = f(z)k(y, z) (C.2)

gives the representation

which ﬁts the setting of Proposition 4. With this we have the subsequent concentration bound.

Proposition 5. For N (λ) as in (4.1) it holds

(Lk + λ) 1/2 (Lk LD k ) (Lk + λ) 1/2 Hk Hk 4τN (λ)g(λ)

2τN (λ)g(λ)

with probability at least 1 2e τ, where

g(λ) := ln 2eλ + µ1

µ1 N(λ) (C.4)

Proof For x P and Tx as in (C.2) we consider the operator-valued random variable ξ(ω) := (Lk + λ) 1/2Tx(ω)(Lk + λ) 1/2. We have from

E(Txf)(y) = Ex f(x)k(y, x) = Z

X f(x)k(y, x)p(x)dx = (Lkf)(y)

i=1 ξi = (Lk + λ) 1/2LD k (Lk + λ) 1/2 and E ξ = (Lk + λ) 1/2Lk(Lk + λ) 1/2,

On the Approximation of Kernel functions

and thus the setting of Proposition 4. It is thus suﬃcient to show that ξ satisﬁes the requirements of Proposition 4, provided that there is an appropriate constant B as well as dominating operator V . We bound the norm ξ Hk Hk ﬁrst. It holds that

ξf 2 k = (Lk + λ) 1/2kx(ω)((Lk + λ) 1/2f)(x(ω)) 2

= (Lk + λ) 1/2kx(ω) 2

k (Lk + λ) 1/2f, kx(ω) 2 k f 2 k N 2 (λ)

and thus ξ Hk Hk N (λ) P-almost surely. This discloses the constant B = N (λ). To bound the second moment, note ﬁrst that ξ is a positive deﬁnite operator. Hence, we have that

E ξ2 ξ Hk Hk E ξ = ξ Hk Hk (Lk + λ) 1Lk N (λ)(Lk + λ) 1Lk =: V

by employing the bound for ξ Hk Hk. The operator norm of V is bounded by

V Hk Hk = N (λ)(Lk + λ) 1Lk Hk Hk = N (λ) µ1 λ + µ1 N (λ)

g(V ) = ln 2e tr(V ) V 1 Hk Hk

2e N (λ)N(λ) 1 N (λ) µ1 λ+µ1

= ln 2eλ + µ1

corresponding to g(λ) in (C.4). The desired inequality (C.3) follows from Proposition 4.

Building on the bound (C.3) above, the assertion of Proposition 3 follows from the subsequent considerations. Note ﬁrst the operator identity

(I (Lk + λ) 1/2(Lk LD k )(Lk + λ) 1/2) 1 = (Lk + λ) 1/2(LD k + λ) 1(Lk + λ) 1/2,

from which we conclude that (Lk + λ) 1/2(LD k + λ) 1(Lk + λ) 1/2 Hk Hk

= (I (Lk + λ) 1/2(Lk LD k )(Lk + λ) 1/2) 1 Hk Hk

1 1 (Lk + λ) 1/2(Lk LD k )(Lk + λ) 1/2 Hk Hk by involving the Neumann series. Assuming the condition (4.2), we have by (C.3) that (Lk + λ) 1/2(Lk LD k )(Lk + λ) 1/2 Hk Hk 4

3τ g(λ)N (λ)

2τ g(λ)N (λ)

with probability at least 1 2e τ. Therefore, the bound (Lk + λ) 1/2(LD k + λ) 1(Lk + λ) 1/2 Hk Hk

1 1 (Lk + λ) 1/2(Lk LD k )(Lk + λ) 1/2 Hk Hk 1 1 1

holds also with probability at least 1 2e τ. This is the desired inequality (4.3).

Dommel and Pichler

Appendix D. Auxiliary lemmata

The following two lemmata provide a crucial element for the proof of Lemma 3. That is, a bound on the expectation E e 1 M2 ,

where M := min i,j=1,...,n i =j |Ui Uj|

is the minimal gap between n independently chosen uniforms. The ﬁrst lemma provides the density of M explicitly, the other bounds the associated expectation.

Lemma 8. Let U1, . . . , Un U[0, 1] be independent uniforms. The random variable M has the density

( n(n 1)(1 (n 1)m)n 1 for m h 0, 1 n 1 i

0 else . (D.1)

Proof Let U1, . . . , Un be independent uniforms on [0, 1] and denote the corresponding minimal absolute diﬀerence by M := mini,j=1,...,n,i =j |Ui Uj|. Note here that M 1 n 1, and M = 1 n 1 if all U1, . . . , Un are equidistant. Therefore, let m [0, 1 n 1] and observe that

P(M > m) = n!P(M > m, U1 U2 Un) (D.2)

as there are n! possible rearrangements of the random variables U1, . . . , Un. For the latter probability we have that

P(M > m, U1 U2 . . . Un)

= P(U1 U2 m, U2 U3 m, . . . , Un 1 Un m, U1 . . . Un)

where λ (Um) is the Lebesgue measure of the set

Um = {(u1, . . . , un) [0, 1]n : u1 u2 m, . . . , un 1 un m, u1 u2 un} .

We next present a measure persevering bijection between Um and

Ym := {(y1, . . . , yn) [0, 1 (n 1)m]n : y1 y2 yn} .

To this end, deﬁne T : Um Ym with Tu = u (0, m, 2m, . . . , (n 1)m). For u = (u1, . . . , un) Um and y = Tu = u (0, m, 2m, . . . , (n 1)m) it is evident that yi 0 as well as y1 yn. Furthermore, it holds that ui 1 m (n i) as the distance between the ui and ui+1 is at least m, for every i = 1, . . . , n. With this we have the inequality

yi = ui (i 1)m 1 m (n i) (i 1)m = 1 (n 1)m

and therefore y Ym. Conversely, let y Ym and set u = y+(0, m, 2m, . . . , (n 1)m) = T 1y. It is again immediate that ui 0 as well as

ui = yi + (i 1)m 1 (n 1)m + (i 1)m 1.

On the Approximation of Kernel functions

From y1 yn we further get that

ui+1 m = yi+1 + i m m = yi+1 + (i 1)m yi + (i 1)m = ui

for all i = 1, . . . , n 1 and thus u Um. Hence, T is a bijection, from which we conclude that λ(Um) = λ(Ym). The latter measure is

λ(Ym) = λ ({(y1, . . . , yn) [0, 1 (n 1)m]n : y1 y2 yn})

n!λ ({(y1, . . . , yn) [0, 1 (n 1)m]n}) = 1

n!(1 (n 1)m)n. (D.3)

Combining (D.2) with (D.3), we get that

P(M > m) = n! 1

n!(1 (n 1)m)n = (1 (n 1)m)n

and hence the density

p(m) = d dm(1 P(M > m)) = n(n 1)(1 (n 1)m)n 1.

This is the assertion.

Lemma 9. Let U1, . . . , Un U[0, 1] be independent uniforms and M the minimum gap as in Lemma 8. For any c > 0 it holds that

E e c M 2 4e c 2

a2 Ce 8c(n 1)2, (D.4)

with a = min{1

3 , 1 2(n 1)}.

Proof We employ the density (D.1) of M. As 1 n 1 1 2(n 1) and e c M 2 0, we have that

E e c M 2 = Z 1 n 1

m2 n(n 1) 1 (n 1)m n 1dm

> Z 1 2(n 1)

m2 n(n 1)(1 (n 1)m)n 1dm

e (n 1) Z 1 2(n 1)

To bound the latter integral term, note that

whenever x 3c 2 3 , as

Dommel and Pichler

is satisﬁed for all x 3c 2 3 . Thus, choosing a = min{1

3 , 1 2(n 1)} we get that

x2 dx = h 4e c 2

which is the assertion.