# quasimonte_carlo_features_for_kernel_approximation__a784b03d.pdf

Quasi-Monte Carlo Features for Kernel Approximation

Zhen Huang 1 Jiajin Sun 1 Yian Huang 1

Random features (Rahimi & Recht, 2007), based on Monte Carlo (MC) method, is one of the most popular approximation techniques to accelerate kernel methods. We show for a class of kernels, including Gaussian kernels, quasi-Monte Carlo (QMC) methods can be used in place of MC to improve the approximation error from OP (1/

M) to O(1/M) (up to logarithmic factors), for estimating both the kernel function itself and the associated integral operator, where M is the number of features being used. Furthermore, we demonstrate the advantage of QMC features in the case of kernel ridge regression, where theoretically, fewer random features suffice to guarantee the same convergence rate of the excess risk. In practice, the QMC kernel approximation approach is easily implementable and shows superior performance, as supported by the empirical evidence provided in the paper.

1. Introduction

Kernel methods offer a mathematically well-founded and practically powerful nonparametric modeling framework for a wide range of problems in machine learning (Wahba, 1990; Sch olkopf & Smola, 2002; Cucker & Smale, 2002). Random features (Rahimi & Recht, 2007) is one of the most popular approximation techniques to accelerate kernel methods. Its idea proceeds as follows: first suppose a kernel function K : X X R (where X is a subset of Rd) has an integral representation:

K(x, x ) = Z

Ω ψ(x, ω)ψ(x , ω)dπ(ω), (1)

where π is a probability measure over some space Ωand ψ( , ) is a function on X Ω. Note that an integral representation in the form of (1) exists under very mild conditions

1Department of Statistics, Columbia University, New York, NY 10027, USA. Correspondence to: Zhen Huang <zh2395@columbia.edu>.

Proceedings of the 41 st International Conference on Machine Learning, Vienna, Austria. PMLR 235, 2024. Copyright 2024 by the author(s).

(see e.g., Proposition A.8 in Appendix). Explicit examples include any shift invariant kernel K(x, x ) = h(x x ), for which Bochner s theorem (Bochner, 1933) implies the existence of a finite non-negative symmetric Borel measure µ on Rd such that

h(x x ) = Z

Rd e i(x x ) ωdµ(ω)

1 π cos x ω + b cos (x ) ω + b db dµ(ω).

(2) Shift invariant kernels cover many popular kernels such as

1. Gaussian kernel e σ(x x ) 2 2/2: µ N(0, σ2Id).

2. Laplacian kernel e γ(x x ) 1: µ has Lebesgue density Qd i=1 1 πγ(1+(ωi/γ)2) (Cauchy distribution).

3. Cauchy kernel Qd i=1 1 1+(xi x i)2/λ2 : µ has Lebesgue

2 e λ ω 1 (Laplace distribution).

Given the kernel function has integral representation (1), K(x, x ) can be approximated by

KM(x, x ) = 1

i=1 ψ(x, ωi)ψ(x , ωi), (3)

with ω1, . . . , ωM i.i.d. from π. Note that (3) is an inner product on RM. This reduces the computational complexity of the kernel ridge regression (O(n3) in time; O(n2) in space) to that of the ordinary ridge regression on RM

(O(n M 2 + M 3) in time; O(n M) in space), if M n.

For kernel ridge regression, suppose (X, Y ) X R follows a distribution PXY with marginal distributions PX and PY . Given the kernel function K, the integral operator L : L2(PX) L2(PX) is defined as:

Lf(x) := EX PX [K(X, x)f(X)] . (4)

If the true regression function in the kernel ridge regression belongs to the range of Lr (here r [1/2, 1] can be viewed as a complexity or smoothness parameter), then Rudi & Rosasco (2017) shows that M n 2r 2r+1 (up to logarithmic factors) random features can ensure the same convergence rate of the excess risk as the exact kernel ridge regression.

QMC Features

Halton Sequence Random Points (iid)

Figure 1. Left: the first 25 points of the two-dimensional Halton sequence. Right: 25 i.i.d. random points from Unif[0, 1]2.

Our contributions: In this paper, we show that compared with the random error bound from Monte Carlo (MC) method: |K(x, x ) KM(x, x )| = OP (1/

M) (Dick et al., 2013), quasi-Monte Carlo (QMC) method uses a deterministic sequence ω1, . . . , ωM to achieve a deterministic error bound |K(x, x ) KM(x, x )| = O(loga M/M) for some integer a under some conditions (Theorems 2.2 and 2.5) this convergence rate is much faster and non-random and such an improvement holds true for a class of kernels including the Gaussian kernel and Cauchy kernel mentioned above. The improved approximation to the kernel also leads to improved approximations of the integral operator and the spectrum of the kernel matrix (see Propositions 2.6 and 2.7). Further, we demonstrate the usefulness of QMC features in the application of kernel ridge regression, by showing that with QMC method, M n 1 2r+1 is enough to guarantee the same convergence rate of the excess risk as exact kernel ridge regression (Theorem 2.2). This is an enormous reduction from MC based random features (which require M n 2r 2r+1 ) when r > 1/2 (i.e., the true regression function has a smoothness condition beyond simply lying in the reproducing kernel Hilbert space1, a.k.a. RKHS, associated with the kernel K; see Section 3 for more details). In practice, the QMC kernel approximation approach is easily implementable and demonstrates superior performance, as supported by empirical evidence provided in the paper.

1.1. Quasi-Monte Carlo Method

QMC is a powerful tool in numerical integration. Its primary focus is to approximate integrals over the unit cube with respect to the uniform measure. In order to approximate R

[0,1]d f(x)dx with a sum 1 M PM i=1 f(xi), MC uses i.i.d. random {xi}M i=1, while QMC uses some well-chosen

1The RKHS (Aronszajn, 1950) H is a space of function over X consisting of span{K(x, ) : x X} and their limits, equipped with an inner product given by K(x, ), K(x , ) H = K(x, x ).

deterministic {xi}M i=1 that are spread out more uniformly in a certain sense. In this section, we will cover some background that is necessary for subsequent discussions. More details can be found in textbooks such as Niederreiter (1992) and Owen (2023). We first introduce an important inequality in QMC:

Theorem 1.1 (Koksma-Hlawka inequality, Hlawka, 1961). Suppose f : [0, 1]d R has bounded variation in the sense of Hardy and Krause VHK(f).2 Then for any x1, . . . , x M [0, 1]d, we have

[0,1]d f(x)dx 1

VHK(f)D ({xi}M i=1),

where D ({xi}M i=1) is the star discrepancy3 of the point set {xi}M i=1.

QMC is useful thanks to the existence of some lowdiscrepancy sequences. One notable example is the Halton sequence h1, h2, . . . which satisfies

D ({hi}M i=1) CH(d)(log M)d/M (5)

for some CH(d) > 0 that depends on d, and all M 2 (Halton, 1964; Atanassov, 2004). This is a substantial improvement from random sampling, whose star discrepancy is of order OP (M 1/2). It may be seen from Figure 1 that points from Halton sequence appear more uniform than i.i.d. points from the uniform distribution. In practice, Halton sequence can be easily generated from an elegant formula and is directly accessible in major computational softwares. Compared with other QMC sequences, Halton sequence has a distinct feature: it avoids the boundary of the unit cube (Owen, 2006). This characteristic makes it particularly useful in approximating a class of shift-invariant kernels, including the Gaussian kernel and Cauchy kernel (see Section 2.1 for details). Hence, the Halton sequence will be the primary choice of QMC sequence in this paper.

Other low-discrepancy QMC sequences include Sobol sequence (Sobol , 1967) and Faure sequence (Faure, 1982), which were combined by Niederreiter (1987) to formulate the concepts of digital nets and sequences. Note that digital sequences also satisfy (5), but with CH(d) replaced by a different constant (Niederreiter, 1992, Theorem 4.17). The lead

2In one-dimension, Hardy-Krause variation coincides with the usual total variation. In general dimensions, VHK(f; [0, 1]d) = P

I {1,...,d},I = R

u I uj=1,j / I

du I, provided that f has all the related derivatives. For definition in general situation, see e.g., Niederreiter (1992); Owen (2005). 3The star discrepancy of the point set {xi}M i=1 is defined as

D ({xi}M i=1) := supt [0,1]d Vol(Jt) |{i {1,...,M}:xi Jt}|

M , where Jt := [0, t1) [0, t2) [0, td) and Vol(Jt) := Qd i=1 ti is the volume.

QMC Features

constant on the dominating term (log M)d/M (for digital sequences) used to be much smaller than that for the Halton sequences for large d, but that was changed by Atanassov (2004) who considerably sharpened the bounds for Halton squence (the constant was shown to converge to 0 as d ). It is conjectured that the O((log M)d/M) rate for star discrepancy decay is optimal for infinite sequences, and Schmidt (1972) proved this in the case d = 1. For d > 1, the question remains open; a lower bound (log M) d 2 /M was provided by Roth (1954), which was slightly improved by Baker (1999).

When applying QMC to kernel approximation, a negative result was found by Avron et al. (2016) that the integral representation (2) from Bochner s theorem, when written as an integral over the unit cube, has infinite variation. Consequently, the Koksma-Hlawka inequality (Theorem 1.1) cannot provide a meaningful bound.

To overcome this difficulty, we show that for a class of shift invariant kernels including the Gaussian kernel, even though the integrand has infinite variation, the singularity is mild, so the approximation error can still be well controlled. Our result relies on the geometry of Halton sequence that it avoids the boundary of the unit cube (Owen, 2006). In addition to shift invariant kernels, We also provide examples of nonshift invariant kernels which have integral representation (1) with the integrand having bounded variation. Our results continue to hold true for such non-shift invariant kernels.

1.2. Related Literature

Kernel methods provide a mathematically rigorous nonparametric modeling approach that finds applications across a broad spectrum of machine learning (Fukumizu et al., 2004; Belkin et al., 2006; Fukumizu et al., 2009; Sriperumbudur et al., 2011; Gretton et al., 2012; Fukumizu et al., 2013; Klebanov et al., 2020; Huang et al., 2022). Despite being remarkably effective in small and medium size problems with certain optimal statistical results (Kimeldorf & Wahba, 1970; Sch olkopf et al., 2001; Caponnetto & De Vito, 2007), exact kernel methods become infeasible for large scale problems due to its time and memory requirements (Rudi & Rosasco, 2017). To overcome this difficulty, various approximation techniques have been proposed (Smola, 2000; Williams & Seeger, 2000; Rahimi & Recht, 2007). One notable approach is random features (RF) (Rahimi & Recht, 2007) which have been well-understood theoretically (Sutherland & Schneider, 2015; Sriperumbudur & Szab o, 2015; Choromanski et al., 2018; Jacot et al., 2020; Lanthaler & Nelsen, 2023). In particular, RF demonstrates nice generalization properties, achieving the same rate of prediction accuracy as the exact kernel ridge regression estimator but at a much lower computational cost (Rudi & Rosasco, 2017; Li et al., 2019; Mei et al., 2022; Liu et al., 2022). In this

paper, we show that with QMC, the computational cost can be further reduced.

QMC is effective for numerical integration, which was born in the 1950s and 1960s (Korobov, 1963) from the successful attempt to achieve a faster convergence rate than MC. Contemporary reviews of QMC can be found in the books of Niederreiter (1992); Leobacher & Pillichshammer (2014) and articles of Owen (2005); Dick et al. (2013).

Efforts have been made to employ QMC to random features: Yang et al. (2014) and Avron et al. (2016) considered shift-invariant kernels, and found that Koksma-Hlawka inequality cannot be applied due to the integrand having unbounded variation; they instead proposed a framework based on a modified discrepancy measure called box discrepancy. QMC points on the high-dimensional sphere were adopted in Lyu (2017), which can be used to approximate shift and rotation invariant kernels and the arc-cosine kernels. Applying QMC to the graph kernels and graph random features was considered in Reid et al. (2023). Nevertheless, none of the above proposals provides a deterministic error bound of the kernel approximation in finite samples, nor do they show whether QMC could attain comparable performance as the exact kernel computation in learning problems, while reducing computational costs. We aim to answer these questions in this paper.

1.3. Organization

In Section 2, we describe how QMC can be used to approximate a kernel function, and provide a deterministic approximation error bound (of the same order as that in the QMC literature) in finite samples. In Section 3, we show that the same error rate in kernel ridge regression can be achieved with much lower computational costs (compared with the exact computation and MC based random features). Simulation evidence will be given in Section 4. Proofs of our results, further discussions, additional simulation studies and real data examples are provided in Appendices A-D.

2. Approximate Kernel Functions with QMC

In this section, we show how QMC can be used to approximate kernel functions, and provide deterministic error bounds for the approximation. In Section 2.1, we consider shift invariant kernels; in Section 2.2, we consider non-shift invariant kernels. Approximation error bounds for the associated integral operator and the spectrum of the kernel matrix will be given in Section 2.3.

2.1. Shift Invariant Kernels

Recall that we are considering kernels on a space X Rd. We assume that the µ from Bochner s theorem (2) is a

QMC Features

probability measure with independent components4, with the i-th component having cumulative distribution function Φi(t) (i = 1, 2, . . . , d). Let Φ(t) := (Φ1(t), . . . , Φd(t)) , and Φ 1(t) := (Φ 1 1 (t), . . . , Φ 1 d (t)) , where Φ 1 i (t) denotes the inverse function of the monotone function Φi(t). By a change of variable, (2) reduces to

K(x, x ) = h(x x ) = Z

[0,1]d+1 2 cos x Φ 1(t) + 2πb cos (x ) Φ 1(t) + 2πb dbdt.

(6) Therefore, the integral representation (1) holds with ω = (t, b) following Unif[0, 1]d+1 and ψ(x, ω) =

2 cos x Φ 1(t) + 2πb . We propose to set ω1, . . . , ωM as the first M points in the Halton sequence, and define the approximate kernel function KM( , ) as in (3).

As t approaches the boundary of [0, 1]d, the integrand in (6) oscillates back and forth and has unbounded variation. We therefore need a condition to characterize the situation where the singularity is mild so that K can still be wellapproximated by KM:

QMC Condition 1. K( , ) is shift invariant with Φi defined as above (i = 1, . . . , d) satisfying d

dtΦ 1 i (t) Ci min(t,1 t) for some constant Ci > 0 and all t (0, 1). X is compact.

QMC Condition 1 helps control the derivatives of the integrand in (6) as t approaches the boundary of [0, 1]d. Two important kernels that satisfy QMC Condition 1 are given by the proposition below: (see Appendix A.1 for a proof) Proposition 2.1. The Gaussian kernel and Cauchy kernel over a compact domain satisfy QMC Condition 1.

Gaussian kernel and Cauchy kernel over a compact domain are examples of universal kernels (Micchelli et al., 2006), i.e., the function class associated with the kernel can approximate (uniformly) any continuous function arbitrarily well. This property makes them particularly useful in machine learning applications such as kernel ridge regression, where an unknown regression function needs to be estimated from data. Laplacian kernel, although being universal, unfortunately does not satisfy QMC Condition 1. The following theorem (proved in Appendix B.1) shows that if QMC Condition 1 is satisfied, then the approximation error of KM to K is of order 1/M, up to logarithmic factors. Theorem 2.2. Suppose K( , ) satisfies QMC Condition 1. Then there exists a constant C > 0 (depending on X Rd

and K) such that for any x, x X and M 2,

|KM(x, x ) K(x, x )| C(log M)2d+1

This error rate is significantly better than that of the MC-

4This assumption has been adopted in previous works (e.g., Avron et al., 2016) and is satisfied for many common kernels.

based random features, which is of order OP (1/

M). Remark 2.3 (On the proof of Theorem 2.2). The general idea is to study the singularity of the integrand (6) near the boundary of the unit cube, which will be mild if QMC Condition 1 holds true. The classical Koksma-Hlawka inequality (Theorem 1.1) can then be applied to a large sub-cube within the unit cube. The fact that Halton sequence avoids the boundary of the unit cube (Owen, 2006) is useful, which ensures that the first M points of the Halton sequence do not lie too close to the boundary. Remark 2.4 (On the constant C). The exact expression of the error bound can be found in our proof (in Appendix B.1). In particular, the constant multiplied to the dominating term (log M)2d+1

M is CH(d + 1)22d+1B, where CH(d + 1) is the constant from the Halton sequence (5), and B = 4π maxu {1,...,d} max|u| x,y X { x y , x + y } Q i u Ci , with the convention that (max{ })0 = 1.

2.2. Non-Shift Invariant Kernels

For non-shift invariant kernels, Bochner s theorem is no longer applicable. Consequently, whether K( , ) possesses an integration representation in the form of (1) needs to be considered on a case-by-case basis. In this section, we will provide a collection of non-shift invariant kernels which have representation (1), and QMC can be applied.

Motivated by the Koksma-Hlawka inequality (Theorem 1.1), we introduce the following general condition: QMC Condition 2. Suppose there exists a function ψ : X [0, 1]p R such that

K(x, x ) = Z

[0,1]p ψ(x, ω)ψ(x , ω)dω,

and for any x, x X, g(ω) = ψ(x, ω)ψ(x , ω) is of bounded Hardy-Krause variation VHK(g) C0, for some C0 > 0.

Note that if all derivatives of g are well-bounded, then VHK(g) can be bounded; see footnote 2. When QMC Condition 2 is satisfied, we set ω1, . . . , ωM as the first M points in the Halton sequence, and define the approximate kernel function KM( , ) as in (3). A direct application of the Koksma-Hlawka inequality (Theorem 1.1) yields the following error bound: Theorem 2.5. Suppose K( , ) satisfies QMC Condition 2. Let CH(p) be the Halton sequence constant as in (5). For any x, x X and M 2, we have

|KM(x, x ) K(x, x )| CH(p) (log M)p

One may notice that the Halton sequence is not essential here for Theorem 2.5; other QMC sequences can also be used, though the constant in front of (log M)p/M may vary.

QMC Features

In the following, we present some kernels for which QMC Condition 2 is satisfied, and thereby Theorem 2.5 is valid. Example 1 (Min kernel). For u, v [0, 1],

K(u, v) = min{u, v} = Z 1

0 1t<u1t<vdt.

1t<u1t<v = 1t<min{u,v} is of bounded variation 1. Min kernel is the covariance kernel of the Brownian motion, and is also the distance kernel in one-dimension a famous example of a characteristic, but not universal kernel (Sejdinovic et al., 2013). High-dimensional min kernel is also available, by taking the product of univariate min kernels. Example 2 (Brownian bridge). For u, v [0, 1],

K(u, v) = min{u, v} uv = Z 1

0 (1t<u u)(1t<v v)dt.

The integrand (1t<u u)(1t<v v) has variation bounded by 3. The Brownian bridge kernel has been used to analyze average-case errors in numerical problems (Ritter, 2000). Example 3 (Iterative kernel). Suppose K1( , ) is a continuous kernel on [0, 1]d, and µ is a positive integrable function on [0, 1]d. The iterative kernel (Courant & Hilbert, 1953, Section III.5.3) of K1 is defined as

K2(x, z) := Z

[0,1]d K1(x, t)K1(z, t)µ(t)dt.

If the Hardy-Krause variation of the integrand fx,z(t) := K1(x, t)K1(z, t)µ(t) is bounded by some C0 > 0 for all x, z [0, 1]d, then QMC Condition 2 is satisfied. A sufficient condition for the existence of such a C0 is that there exists a constant C0 > 0 such that for all u {1, . . . , d} and x, z, t [0, 1]d, |(Q

i u ti )fx,z(t)| C0. If K1 has Mercer series (w.r.t. µ): K1(x, z) = P i=1 λiei(x)ei(z), then it can be shown that K2(x, z) = P i=1 λ2 i ei(x)ei(z) (which is smoother than the original K1). See Appendix A.2.1 for more detailed discussions. Example 4 (Natural cubic spline). For u, v [0, 1],

K(u, v) = Z 1

0 (u t ut)(v t vt)dt

1 6u(1 v)(1 u2 (1 v)2), 0 u v 1

1 6v(1 u)(1 v2 (1 u)2), 0 v u 1 .

The integrand above has variation bounded by 4. For any fixed value of v, it is a natural cubic spline that interpolates zero at u = 0 and u = 1. This kernel is also the iterative kernel of Brownian bridge with µ(t) 1. Example 5 (Product kernel). Suppose for i = 1, . . . , d, Ki(u, v) = R 1 0 ψi(u, t)ψi(v, t)dt satisfies QMC Condition 2 with |ψi(u, t)| κi for some κi and all u, t. Then

i=1 Ki(ui, vi) = Z

[0,1]d ψ(u, t)ψ(v, t)dt,

where ψ(x, t) = Qd i=1 ψi(xi, ti), satisfies QMC Condition 2. See Appendix A.2.2 for the detailed proof.

Example 5 allows us to construct high-dimensional kernels that satisfy QMC Condition 2 with Examples 1-4.

2.3. Approximate Integral Operator and Kernel Matrix

When the kernel function K( , ) is well approximated by KM( , ), the associated integral operator and the kernel matrix can also be well approximated. Recall the integral operator L defined in (4). Define its approximation LM : L2(PX) L2(PX) as

LMf(x) := EX PX [KM(X, x)f(X)] .

The following result (proved in Appendix B.2) shows that the error in estimating the kernel function propagates to that in estimating the integral operator:

Proposition 2.6. Suppose two kernels K and KM satisfy

sup x,x X |K(x, x ) KM(x, x )| C (log M)a

for some positive constants C and a. Then we have

LM L C (log M)a

where denotes the operator norm.

Proposition 2.6 will be useful in showing the superior performance of QMC features (compared with MC based random features) in kernel ridge regression (see Section 3).

Another advantage of using QMC features concerns the spectral approximation of the kernel matrix:

Proposition 2.7 (Spectrum approximation). Suppose two kernels K and KM satisfy

sup x,x X |K(x, x ) KM(x, x )| C (log M)a

for some constants C, a > 0. Let K := [K(xi, xj)](i,j) Rn n, KM := [KM(xi, xj)](i,j) Rn n, and λ, > 0. When M (log M)a Cn

(1 )(K+λIn) KM +λIn (1+ )(K+λIn). (7)

Compare with (MC based) random features that require M to be of order n 2λ to achieve a -spectral approximation (7) with high probability (Avron et al., 2017, Theorem 7), QMC features only require M of order n λ (ignoring logarithmic factors). Moreover, (7) holds true with probability 1. Such a spectral bound is useful in analyzing the statistical performance of some downstream learning tasks (see e.g., Musco & Musco, 2017, Appendix E).

QMC Features

3. Application in Kernel Ridge Regression

In this section, we will show how QMC features introduced in Section 2 can be used to further accelerate the computation of kernel ridge regression (compared with MC random features), while still maintaining the same statistical accuracy. We will first give a brief review on the kernel ridge regression with random features, and then provide the theoretical results for our method.

3.1. Brief Review on Kernel Ridge and Random Feature

Brief Review on Kernel Ridge Regression. Consider the usual setting where we have n i.i.d. samples (xi, yi)n i=1 drawn from a joint distribution PXY on X R. The goal is to learn the regression function f (X) = E[Y |X] which minimizes the expected risk E(f) = E[Y f(X)]2. Given a kernel K, denote the reproducing kernel Hilbert space associated with K by H. The kernel ridge regression (KRR) defines a penalized estimator for the above learning problem

ˆfλ := arg min f H

i=1 (yi f(xi))2 + λ f 2 H

for some λ > 0, and has the explicit solution

i=1 ˆαi K(xi, x), ˆα = (K + nλIn) 1y (9)

where K = [K(xi, xj)](i,j) Rn n, In is the n n identity matrix, y = (y1, . . . , yn) , and ˆα = (ˆα1, . . . , ˆαn) . Statistically, the KRR estimator (8) is minimax rate optimal for the sqaure loss E( ) over H (Sch olkopf & Smola, 2002; Caponnetto & De Vito, 2007). Computationally, the time and space complexities of KRR are of order O(n3) and O(n2), respectively, which could be costly when n is large.

Brief Review on KRR with Random Features. Recall from Section 1 that when the kernel function K has an integral representation K(x, x ) = R

Ωψ(x, ω)ψ(x , ω)dπ(ω) as in (1), one could use a Monte Carlo integration KM(x, x ) = ϕM(x) ϕM(x ) = 1 M PM i=1 ψ(x, ωi)ψ(x , ωi) as in (3) to approximate K, where the ωi s are i.i.d. sampled from π, and ϕM(x) := M 1/2(ψ(x, ω1), . . . , ψ(x, ωM)) . Substituting KM = [KM(xi, xj)](i,j) for K in the KRR estimator (9) gives the random feature kernel ridge regression (RF-KRR) estimator (Rudi & Rosasco, 2017; Avron et al., 2017). The time and space complexities of RF-KRR are O(n M 2+M 3) and O(n M), respectively, which indicates a reduction in computational cost compared to (9) if M n. Furthermore, this computational gain does not bring an additional cost in statistical error: Rudi & Rosasco (2017) shows that RF-KRR with M n 2r 2r+1 (up to logarithmic factors) guarantees the same error rate as the exact KRR, where r [ 1

2, 1] is a measure of complexity to be rigorously defined in Section 3.3 below.

3.2. Kernel Ridge Regression with QMC Features

The nice properties of RF-KRR rely on the fact that KM approximates K with an OP (M 1/2) error rate when sampling ψ(x, ωi) with i.i.d. ωi. Enlightened by the even better error O(M 1) QMC approximation (up to logarithmic factors) of KM to K as shown in Theorems 2.2 and 2.5, we naturally consider the quasi-Monte Carlo feature kernel ridge regression (QMCF-KRR) estimator by employing KM in lieu of K in the KRR estimator (9), where KM is the QMC approximation to K. Through algebraic transformations (Bach, 2017), we have the explicit formula of the QMCF-KRR estimator given by

ˆfλ,M(x) = ϕM(x) ˆΦ M ˆΦM + nλIM 1 ˆΦ My (10)

where ˆΦM := (ϕM(x1), . . . ,ϕM(xn)) Rn M, with ϕM(x) still defined as M 1/2(ψ(x, ω1), . . . , ψ(x, ωM))

but ωi s are now generated from a QMC sequence.

From (10) it is clear that, the time and space complexities of QMCF-KRR, like those of the RF-KRR estimator, are O(n M 2 + M 3) and O(n M), respectively. It remains to answer the question: how large should M be to guarantee good statistical accuracy? In the next subsection, we show that to achieve the same error rate as RF-KRR and the exact KRR, QMCF-KRR requires only M n 1 2r+1 (up to logarithmic factors) number of random features, which further reduces the computational cost compared with RFKRR while maintaining the same statistical accuracy.

3.3. Theoretical Results for QMCF-KRR

Under the settings of Kernel Ridge Regression in subsection 3.1, for ˆfλ,M as defined by (10), we will establish the statistical excess error rate in terms of E(f) = E[Y f(X)]2

over the class H. To formally state our result, we postulate the following regularity conditions:

KRR Condition 1. (i) K(x, x ) is continuous and has the integral representation (1), in which |ψ(x, ω)| κ for some constant κ > 0. Assume X has full support on X, and ω 7 ψ( , ω), as a map from Ωto L2(PX), is continuous.

(ii) π in (1) is the uniform distribution over [0, 1]p for some p 1, and quasi-Monte Carlo method is used for approximating the kernel as in (3), from which we have

sup x,x X |K(x, x ) KM(x, x )| C loga M

for some positive constants C and a (see Theorems 2.2, 2.5).

KRR Condition 2. The distribution of Y satisfies a Bernstein condition: there exist positive constants σ and D such that E[|Y |k | X] 1

2k!σ2Dk 2 for all k 2.

KRR Condition 3. There exists r [1/2, 1] such that f H = Lrg for some g L2(PX), where f H solves

QMC Features

minf H E(f), and L is the integral operator defined in (4). Let R := max{ g L2(PX), 1} be a positive constant. Remark 3.1 (On the conditions). In KRR Condition 1(i), the continuity of K and the full support of X are standard assumptions for a Mercer kernel, which implies that H is essentially ran L1/2 (see Theorem A.7 in the Appendix). The continuity of ω 7 ψ( , ω) L2(PX) is weaker than the continuity of ψ(x, ω) (which was assumed in Rudi & Rosasco 2017); it includes min kernel where ψ(x, ω) = 1ω<x is not continuous in the usual sense. For the boundedness assumption of ψ, we note that all examples in Section 2 have bounded feature functions. KRR Condition 1(ii), as shown in Theorems 2.2 and 2.5, is satisfied under the case of either QMC Condition 1 or 2. KRR Condition 2 is a moderate and usual condition on the tail of the response distribution, which holds when the conditional distribution of Y is sub-exponential. KRR Condition 3 is widely-used in the kernel machine literature (Smale & Zhou, 2003; Caponnetto & De Vito, 2007), whose implications manifest in two folds. First, it assumes the existence of a minimizer f H of the loss E( ) in the class H. Second, it further assumes that f H lies in the range of Lr. Here r [1/2, 1] is interpreted as a measure of complexity of f H, and can be intuitively understood as a smoothness parameter: the larger r is, the smoother f H is, as L is a convolution. In particular, r = 1/2 represents the basic case, equivalent to assuming only that f H exists in H (see Theorem A.7 in the Appendix).

Thoerem 3.2 below (proved in Appendix C) establishes the statistical error rate of the proposed QMCF-KRR estimator (10). Theorem 3.2. Assume KRR Conditions 1, 2, 3. Let λ = Cn 1 2r+1 (0, e 1], and ˆfλ,M be defined as in (10). Then M = loga(1/λ)

λ = n 1 2r+1 loga(n 1 2r+1 / C)/ C is enough to guarantee that, for any δ (0, 1], there exists n0 (of order (log 1

2r ), such that when n n0, with probability at least 1 δ, the excess risk

E( ˆfλ,M) inf f H E(f) C1n 2r 2r+1 log2 6

where C1 is a constant depending only on κ, σ, D, R, r, C, C and a (see Appendix C.2 for the exact expression of C1). Remark 3.3 (Implications of Theorem 3.2). Our error bound (11) achieves the same error rate as in exact KRR (Caponnetto & De Vito, 2007, Theorem 1) and RF-KRR (Rudi & Rosasco, 2017, Theorem 2). However, our QMC approach is computationally more efficient in smoother or less complexity cases: RF-KRR (Rudi & Rosasco, 2017, Theorem 2) requires M n 2r 2r+1 log 108κ2n

δ many random features to achieve an excess risk of C1n 2r 2r+1 log2 18

δ , while our QMCF-KRR method needs only M = n 1 2r+1 loga(n 1 2r+1 / C)/ C many

0.0 0.5 1.0 0.0

0.0 0.5 1.0 0.0

QMC (M = 25)

0.0 0.5 1.0 0.0

MC (M = 25)

0.0 0.5 1.0 0.0

Gaussian Kernel

0.0 0.5 1.0 0.0

QMC (M = 25)

0.0 0.5 1.0 0.0

MC (M = 25)

Figure 2. Min kernel and Gaussian kernel over [0, 1]2, and their approximated versions using QMC and MC based random features.

features. Recall that r [1/2, 1]. The improvement in the required number of features is reflected in two aspects. Firstly, when r is greater than 1/2, QMCF-KRR allows for a substantial reduction in the rate of M, which indicates a sharp diminution in computational costs. For the ease of understanding, when ignoring the constants and logarithmic terms, QMCF-KRR requires M n 1 2r+1 , which is of smaller order than the n 2r 2r+1 number of features needed by RF-KRR to achieve the same statistical error rate. Secondly, as QMC generates a non-random sequence of ωi s, the choice of M in QMCF-KRR does not depend on the small probability δ (as opposed to RF-KRR), which facilitates a more straightforward selection of M in practical applications with theoretical guarantees.

4. Simulations

In this section, we demonstrate the usefulness of QMC features in kernel approximation and kernel ridge regression through simulations. Additional simulation studies and real data examples will be referred to Appendix D. Halton sequence implemented in the Sci Py package in Python (Virtanen et al., 2020) is used.

4.1. Visualization of Kernel Approximation

We consider the (non-shift invariant) min kernel min{x, x } and the (shift invariant) Gaussian kernel exp( |x x |2). For the min kernel, we use the integral representation as in Example 1. For Gaussian kernel, we use the integral representation (6) from Bochner s theorem. The values of the kernel functions over [0, 1]2 are shown in the left column of Figure 2, with the approximated kernels from QMC and MC shown in the middle column and the right column, respectively. Here the MC plot is produced using

QMC Features

one realization of M i.i.d. random features. Figure 2 is in grayscale: brighter pixels correspond to larger values. It can be seen that with M = 25 features, QMC method has already provided a reasonably good approximation to the true kernel function. Whereas for MC, the same number of random features are clearly not enough to well approximate the kernel.

4.2. Simulations on Kernel Ridge Regression

In this subsection, we compare the performances of RFKRR and QMCF-KRR. It has been shown in Theorem 3.2 that the QMC approach is guaranteed to be more efficient than the MC approach in smoother cases, where r [1/2, 1] is large. Therefore, we will present simulations for cases where r = 1 to support our findings here. In fact, we observe that even in cases where r = 1/2, the QMC approach can still have superior performance compared with MC. These additional simulations are given in Appendix D.1.

The training and test data are generated from Y = f(X) + ε, where f is the regression function, X Unif[0, 1]d, and ε N(0, 1). We consider two choices of kernels: (i) the min kernel K(x, x ) = Qd i=1 min(xi, x i), and (ii) the Gaussian kernel K(x, x ) = exp( 1

2σ2 x x 2 2), with the bandwidth σ set as the median of X X (computed numerically), where X, X i.i.d. Unif[0, 1]d.

By definition, a function in ran Lr for r = 1 has the form f(x) = R K(x, z)g(z)d PX(z) for some g L2(PX). For the min kernel, we set g(z) = (Qd i=1 zi) 1 d as the geometric mean function, which leads to f(x) =

( d d+1)d Qd j=1 xj d 2d+1x 2+ 1

d j , and for the Gaussian

kernel, we set g(z) = exp( 1

2σ2 z 2 2), which yields f(x) = σ2d exp( 1

2σ2 x 2 2) Qd j=1 exp(xj/σ2) 1

xj . To approximately control the signal-noise-ratio, we set the regression function f(x) = C f f(x) for some constant C f such that Ef(X) = 5. The kernel ridge regularization parameter is set as λ = 0.25n 1 2r+1 .

In Figures 3 and 4, we plot the test mean square error (MSE) against the number of random features, for exact KRR, RFKRR and QMCF-KRR. For each combination of kernel and d, 106 test data points are first generated and held fixed. We consider 1000 realizations of training samples of size 104. For each of the realization, we fit a kernel ridge regression and compute its test error (i.e., MSE on the test set). The solid lines in Figures 3, 4 are obtained by averaging over the 1000 realizations. We also provide confidence bands using the 25% and 75% error quantiles from the 1000 realizations. For the MC method, the randomness comes from re-generating the training set and the MC random features. Whereas for the QMC method, it only comes from the training set re-generation as the QMC features are deterministic.

4 6 8 10 log2(M)

4 6 8 10 log2(M)

4 6 8 10 log2(M)

4 6 8 10 log2(M)

Figure 3. Gaussian Kernel (r = 1): the test MSE against the number of random features for exact KRR, RF-KRR and QMCFKRR.

4 6 8 10 log2(M)

4 6 8 10 log2(M)

4 6 8 10 log2(M)

4 6 8 10 log2(M)

Figure 4. Min Kernel (r = 1): the test MSE against the number of random features, for exact KRR, RF-KRR and QMCF-KRR.

It can be seen from Figures 3 and 4 that for all combinations of kernel and d, QMC features exhibit superior performance in kernel ridge regression: it achieves the same generalization error as MC random features with much fewer random features (smaller M), and converges to the error of the exact KRR at a much faster rate. Note that the confidence bands of QMC method are almost negligible, showing very stable performance compared with the MC method, and in many cases, their confidence bands do not intersect. Similar phenomena are observed even when r = 0.5. We refer readers to Appendix D.1 for more details.

Superior performance of QMC is also observed in real data, where RF-KRR is often seen to have a wide confidence band. For some real-world examples, see Appendix D.2.

QMC Features

Note that the reported plots (Figures 3, 4) are for d [1, 4], which are all low-dimensional, as we observe that QMC features are most effective when the dimension of the ambient space X is low, which aligns with the practical discovery that the best use case for QMC often arises when the integrand can be well approximated by a sum of functions involving only a small number of its input variables (Owen, 2023; Adcock & Brugiapaglia, 2022). Ongoing research has been studying the high-dimensional situations where QMC methods will be successful (Dick et al., 2013), with a famous empirical finding by Paskov & Taub (1995) that showed integrands from finance in 360 dimensions could be well integrated by QMC. However, for the examples above, we do not observe ideal performance of the QMC features in high-dimensional cases where d > 10 (see Appendix D.3). In practice, a dimension reduction prior to applying the methodology may be helpful. Exploring the effectiveness of QMC features in high dimensions could be an interesting research direction to pursue in the future.

Acknowledgements

The authors would like to thank Samory Kpotufe, Bodhisattva Sen, Liu Sifan, and the anonymous reviewers for their careful reading of the paper and helpful suggestions and comments.

Impact Statement

This paper presents work whose goal is to advance the field of Machine Learning. There might be potential societal consequences of our work, none of which we feel need to be specifically highlighted here.

Adcock, B. and Brugiapaglia, S. Is monte carlo a bad sampling strategy for learning smooth functions in high dimensions? ar Xiv preprint ar Xiv:2208.09045, 2022.

Aronszajn, N. Theory of reproducing kernels. Trans. Amer. Math. Soc., 68:337 404, 1950.

Atanassov, E. On the discrepancy of the halton sequences. Mathematica Balkanica, 18:15 32, 2004.

Aubin, J.-P. Applied Functional Analysis, volume 47. John Wiley & Sons, 2011.

Avron, H., Sindhwani, V., Yang, J., and Mahoney, M. W. Quasi-monte carlo feature maps for shift-invariant kernels. Journal of Machine Learning Research, 17(1):4096 4133, 2016.

Avron, H., Kapralov, M., Musco, C., Musco, C., Velingker, A., and Zandieh, A. Random fourier features for kernel

ridge regression: Approximation bounds and statistical guarantees. In International Conference on Machine Learning, pp. 253 262. PMLR, 2017.

Bach, F. On the equivalence between kernel quadrature rules and random feature expansions. Journal of Machine Learning Research, 18(1):714 751, 2017.

Baker, R. C. On irregularities of distribution. II. J. London Math. Soc. (2), 59(1):50 64, 1999.

Belkin, M., Niyogi, P., and Sindhwani, V. Manifold regularization: A geometric framework for learning from labeled and unlabeled examples. Journal of Machine Learning Research, 7(11), 2006.

Bochner, S. Monotone funktionen, stieltjessche integrale und harmonische analyse. Mathematische Annalen, 108 (1):378 410, 1933.

Caponnetto, A. and De Vito, E. Optimal rates for the regularized least-squares algorithm. Foundations of Computational Mathematics, 7:331 368, 2007.

Chamakh, L., Gobet, E., and Szab o, Z. Orlicz random fourier features. The Journal of Machine Learning Research, 21(1):5739 5775, 2020.

Choromanski, K., Rowland, M., Sarlos, T., Sindhwani, V., Turner, R., and Weller, A. The geometry of random features. In International Conference on Artificial Intelligence and Statistics, pp. 1 9. PMLR, 2018.

Cohn, D. L. Measure Theory. Springer, New York, second edition, 2013.

Courant, R. and Hilbert, D. Methods of Mathematical Physics. Vol. I. Interscience Publishers, Inc., New York, N.Y., 1953.

Cucker, F. and Smale, S. On the mathematical foundations of learning. Bulletin of the American Mathematical Society, 39(1):1 49, 2002.

De Vito, E., Caponnetto, A., and Rosasco, L. Model selection for regularized least-squares algorithm in learning theory. Foundations of Computational Mathematics, 5: 59 85, 2005.

Dick, J., Kuo, F. Y., and Sloan, I. H. High-dimensional integration: the quasi-monte carlo way. Acta Numerica, 22:133 288, 2013.

Faure, H. Discr epance de suites associ ees a un syst eme de num eration (en dimension s). Acta Arithmetica, 41(4): 337 351, 1982.

QMC Features

Fukumizu, K., Bach, F. R., and Jordan, M. I. Dimensionality reduction for supervised learning with reproducing kernel Hilbert spaces. Journal of Machine Learning Research, 5 (Jan):73 99, 2004.

Fukumizu, K., Bach, F. R., and Jordan, M. I. Kernel dimension reduction in regression. Ann. Statist., 37(4): 1871 1905, 2009.

Fukumizu, K., Song, L., and Gretton, A. Kernel Bayes rule: Bayesian inference with positive definite kernels. Journal of Machine Learning Research, 14(1):3753 3783, 2013.

Gretton, A., Borgwardt, K. M., Rasch, M. J., Sch olkopf, B., and Smola, A. A kernel two-sample test. Journal of Machine Learning Research, 13(1):723 773, 2012.

Halton, J. H. Algorithm 247: Radical-inverse quasi-random point sequence. Communications of the ACM, 7(12): 701 702, 1964.

Hein, M. and Bousquet, O. Kernels, Associated Structures and Generalizations. Technical Report of the Max Planck Institute for Biological Cybernetics 127, Max Planck Institute for Biological Cybernetics, T ubingen, Germany, 2004.

Hlawka, E. Funktionen von beschr ankter variatiou in der theorie der gleichverteilung. Annali di Matematica Pura ed Applicata, 54(1):325 333, 1961.

Huang, Z., Deb, N., and Sen, B. Kernel partial correlation coefficient a measure of conditional dependence. Journal of Machine Learning Research, 23(216):1 58, 2022.

Jacot, A., Simsek, B., Spadaro, F., Hongler, C., and Gabriel, F. Implicit regularization of random feature models. In International Conference on Machine Learning, pp. 4631 4640. PMLR, 2020.

Kimeldorf, G. S. and Wahba, G. A correspondence between Bayesian estimation on stochastic processes and smoothing by splines. Ann. Math. Statist., 41:495 502, 1970.

Klebanov, I., Schuster, I., and Sullivan, T. J. A rigorous theory of conditional mean embeddings. SIAM J. Math. Data Sci., 2(3):583 606, 2020.

Korobov, N. M. Number-Theoretic Methods in Approximate Analysis. Fizmatgiz, Moscow, 1963.

Lanthaler, S. and Nelsen, N. H. Error bounds for learning with vector-valued random features. In Advances in Neural Information Processing Systems, volume 36, pp. 71834 71861, 2023.

Leobacher, G. and Pillichshammer, F. Introduction to Quasi Monte Carlo Integration and Applications. Compact Textbooks in Mathematics. Birkh auser/Springer, Cham, 2014.

Li, Z., Ton, J.-F., Oglic, D., and Sejdinovic, D. Towards a unified analysis of random Fourier features. In International Conference on Machine Learning, pp. 3905 3914. PMLR, 2019.

Liao, Z., Couillet, R., and Mahoney, M. W. A random matrix analysis of random fourier features: beyond the gaussian kernel, a precise phase transition, and the corresponding double descent. In Advances in Neural Information Processing Systems, volume 33, pp. 13939 13950, 2020.

Liu, F., Huang, X., Chen, Y., and Suykens, J. A. K. Random features for kernel approximation: A survey on algorithms, theory, and beyond. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(10):7128 7148, 2022.

Lyu, Y. Spherical structured feature maps for kernel approximation. In International Conference on Machine Learning, pp. 2256 2264. PMLR, 2017.

Mei, S., Misiakiewicz, T., and Montanari, A. Generalization error of random feature and kernel methods: Hypercontractivity and kernel matrix concentration. Applied and Computational Harmonic Analysis, 59:3 84, 2022. Special Issue on Harmonic Analysis and Machine Learning.

Micchelli, C. A., Xu, Y., and Zhang, H. Universal kernels. Journal of Machine Learning Research, 7(12), 2006.

Musco, C. and Musco, C. Recursive sampling for the nystrom method. In Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017.

Niederreiter, H. Point sets and sequences with small discrepancy. Monatsh. Math., 104(4):273 337, 1987.

Niederreiter, H. Random Number Generation and Quasi Monte Carlo Methods, volume 63 of CBMS-NSF Regional Conference Series in Applied Mathematics. SIAM, Philadelphia, PA, 1992.

Owen, A. B. Multidimensional variation for quasi-monte carlo. In Contemporary Multivariate Analysis And Design Of Experiments: In Celebration of Professor Kai-Tai Fang s 65th Birthday, pp. 49 74. World Scientific, 2005.

Owen, A. B. Halton sequences avoid the origin. SIAM Review, 48(3):487 503, 2006.

Owen, A. B. Practical Quasi-Monte Carlo Integration. https://artowen.su.domains/mc/ practicalqmc.pdf, 2023.

QMC Features

Pace, R. K. and Barry, R. Sparse spatial autoregressions. Statistics & Probability Letters, 33(3):291 297, 1997.

Paskov, S. H. and Taub, J. F. Faster valuation of financial derivatives. Journal of Portfolio Management, 22(1):113, 1995.

Rahimi, A. and Recht, B. Random features for large-scale kernel machines. In Advances in Neural Information Processing Systems, volume 20, 2007.

Reid, I., Choromanski, K., and Weller, A. Quasi-monte carlo graph random features. ar Xiv preprint ar Xiv:2305.12470, 2023.

Ritter, K. Average-Case Analysis of Numerical Problems, volume 1733 of Lecture Notes in Mathematics. Springer Verlag, Berlin, 2000.

Roth, K. F. On irregularities of distribution. Mathematika, 1:73 79, 1954.

Rudi, A. and Rosasco, L. Generalization properties of learning with random features. In Advances in Neural Information Processing Systems, volume 30, 2017.

Rynne, B. P. and Youngson, M. A. Linear Functional Analysis. Springer Undergraduate Mathematics Series. Springer-Verlag London, Ltd., London, second edition, 2008.

Schmidt, W. Irregularities of distribution, vii. Acta Arithmetica, 21(1):45 50, 1972.

Sch olkopf, B. and Smola, A. J. Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond. MIT press, 2002.

Sch olkopf, B., Herbrich, R., and Smola, A. J. A generalized representer theorem. In International Conference on Computational Learning Theory, pp. 416 426. Springer, 2001.

Sejdinovic, D., Sriperumbudur, B., Gretton, A., and Fukumizu, K. Equivalence of distance-based and RKHS-based statistics in hypothesis testing. Ann. Statist., 41(5):2263 2291, 2013.

Smale, S. and Zhou, D.-X. Estimating the approximation error in learning theory. Analysis and Applications, 1(01): 17 41, 2003.

Smola, A. J. Sparse greedy matrix approximation for machine learning. In International Conference on Machine Learning. Morgan Kaufmann, 2000.

Sobol , I. M. On the distribution of points in a cube and the approximate evaluation of integrals. Zhurnal Vychislitel noi Matematiki i Matematicheskoi Fiziki, 7(4): 784 802, 1967.

Sriperumbudur, B. and Sterge, N. Approximate kernel pca using random features: Computational vs. statistical tradeoff. The Annals of Statistics, 50(5):2713 2736, 2022.

Sriperumbudur, B. and Szab o, Z. Optimal rates for random fourier features. In Advances in Neural Information Processing Systems, volume 28, 2015.

Sriperumbudur, B. K., Fukumizu, K., and Lanckriet, G. R. Universality, characteristic kernels and RKHS embedding of measures. Journal of Machine Learning Research, 12 (7), 2011.

Steinwart, I. and Christmann, A. Support Vector Machines. Information Science and Statistics. Springer, New York, 2008.

Steinwart, I. and Scovel, C. Mercer s theorem on general domains: on the interaction between measures, kernels, and RKHSs. Constr. Approx., 35(3):363 417, 2012.

Sutherland, D. J. and Schneider, J. On the error of random fourier features. In Conference on Uncertainty in Artificial Intelligence, pp. 862 871, 2015.

Szab o, Z. and Sriperumbudur, B. On kernel derivative approximation with random fourier features. In International Conference on Artificial Intelligence and Statistics, pp. 827 836. PMLR, 2019.

Ullah, E., Mianjy, P., Marinov, T. V., and Arora, R. Streaming kernel PCA with O( n) random features. In Advances in Neural Information Processing Systems, volume 31, 2018.

Uzilov, A. V., Keegan, J. M., and Mathews, D. H. Detection of non-coding RNAs on the basis of predicted secondary structure formation free energy change. BMC Bioinformatics, 7:1 30, 2006.

Virtanen, P., Gommers, R., Oliphant, T. E., Haberland, M., Reddy, T., Cournapeau, D., Burovski, E., Peterson, P., Weckesser, W., Bright, J., van der Walt, S. J., Brett, M., Wilson, J., Millman, K. J., Mayorov, N., Nelson, A. R. J., Jones, E., Kern, R., Larson, E., Carey, C. J., Polat, I., Feng, Y., Moore, E. W., Vander Plas, J., Laxalde, D., Perktold, J., Cimrman, R., Henriksen, I., Quintero, E. A., Harris, C. R., Archibald, A. M., Ribeiro, A. H., Pedregosa, F., van Mulbregt, P., and Sci Py 1.0 Contributors. Sci Py 1.0: Fundamental Algorithms for Scientific Computing in Python. Nature Methods, 17:261 272, 2020.

Wahba, G. Spline Models for Observational Data. SIAM, 1990.

Williams, C. and Seeger, M. Using the Nystr om method to speed up kernel machines. In Advances in Neural Information Processing Systems, volume 13. MIT Press, 2000.

QMC Features

Yang, J., Sindhwani, V., Avron, H., and Mahoney, M. Quasimonte carlo feature maps for shift-invariant kernels. In International Conference on Machine Learning, pp. 485 493. PMLR, 2014.

QMC Features

A. Some Further Discussions

In this section, we elaborate on some parts of the main text that were initially deferred so as not to impede the flow of the paper.

A.1. Kernels Satisfying QMC Condition 1

Here we provide a proof for Proposition 2.1 (i.e., Gaussian kernel and Cauchy kernel over a compact domain satisfy QMC Condition 1).

A.1.1. GAUSSIAN KERNEL

To show that Gaussian kernel over a compact domain satisfies QMC Condition 1, we first show the following lemma:

Lemma A.1. If Φ is the distribution function of the standard normal distribution, then there exists C > 0 such that

d dtΦ 1(t) C min(t, 1 t) 1, for all t (0, 1). (A.1)

Proof. Note that d

dtΦ 1(t) = 1 ϕ(Φ 1(t)) = 1 ϕ(Φ 1(1 t)), where ϕ is the density function of the standard normal distribution. Therefore, by the symmetry about t = 1/2, we only need to consider t (0, 1/2] in (A.1), which reduces to

1 ϕ(Φ 1(t)) Ct 1.

Let x = Φ 1(t) ( , 0]. The above inequality is equivalent to

Φ(x) Cϕ(x).

If x 1, then

2π 1 |x|e x2

If 1 < x 0, then Φ(x)

Hence, by taking C = max 1, Φ(0)

ϕ( 1) , the inequality (A.1) holds.

For a Gaussian kernel with a general bandwidth, the cumulative distribution function Φi(t) (as defined in QMC Condition 1) is given by Φi(t) = Φ(t/σ) for some σ > 0. Therefore, Φ 1 i (t) = σΦ 1(t). By Lemma A.1, d

dtΦ 1 i (t) Ci min(t,1 t) for all t (0, 1), where Ci = σC.

A.1.2. CAUCHY KERNEL

For Cauchy kernel K(x, x ) = Qd i=1 1 1+λ2(xi x i)2 , Φi(t) (as defined in QMC Condition 1) is the cumulative distribution

function of the symmetrized exponential distribution (a.k.a. Laplace distribution) with Lebesgue density ϕi(x) = λ

2 e λ|x|. Therefore, Φi(x) = 1

2eλx1x 0 + 1 1

2e λx 1x>0, and Φ 1 i (t) = λ 1(log(2t)1t 1/2 log(2(1 t))1t 1/2). By taking the derivative w.r.t. t, we have

d dtΦ 1 i (t) = λ 1 1

t 1t 1/2 + 1 1 t1t 1/2

2λ 1 min(t, 1 t) 1.

Therefore, Cauchy kernel over a compact domain satisfies QMC Condition 1.

QMC Features

A.2. Kernels Satisfying QMC Condition 2

A.2.1. ITERATIVE KERNEL

Suppose K1( , ) is a continuous kernel on [0, 1]d, and µ is a positive integrable function on [0, 1]d. The iterative kernel (Courant & Hilbert, 1953, Section III.5.3) of K1 is defined as

K2(x, z) := Z

[0,1]d K1(x, t)K1(z, t)µ(t)dt.

QMC Condition 2 holds if the Hardy-Krause variation of fx,z(t) := K1(x, t)K1(z, t)µ(t) is bounded by some C0 > 0 for all x, z [0, 1]d. A sufficient condition for the existence of such a C0 is that there exists a constant C0 > 0 such that for all u {1, . . . , d} and x, z, t [0, 1]d, |(Q

i u ti )fx,z(t)| C0, due to the fact that (Niederreiter, 1992, Section 2.2):

VHK(f; [0, 1]d) = X

I {1,...,d},I =

Note that µ can be viewed as a Lebesgue density function and thus induces a strictly positive and finite Borel measure on [0, 1]d, with the measure of a set B defined as R

B µ(x)dx. Recall the integral operator L : L2(µ) L2(µ) defined as:

[0,1]d K1(x, t)f(t)µ(t)dt.

Mercer s theorem (Steinwart & Scovel, 2012) implies that there exists a continuous orthonormal basis {ei} of L2(µ) consisting of eigenfunctions of the integral operator L, with corresponding nonnegative eigenvalues {λi} satisfying P

i λi < + . Moreover, K1(x, y) = X

i λiei(x)ei(y),

where the convergence is absolute and uniform.

Observe that

K2(x, z) = Z

[0,1]d K1(x, t)K1(z, t)µ(t)dt = Z

[0,1]d K1(x, t) X

i λiei(t)ei(z)µ(t)dt.

Since supx,t [0,1]d |K1(x, t)| < by the continuity of K1, and the convergence of Mercer s series is absolute and uniform, we can exchange the summation and integration to obtain

K2(x, z) = X

[0,1]d K1(x, t)λiei(t)ei(z)µ(t)dt

i λiei(z) Z

[0,1]d K1(x, t)ei(t)µ(t)dt

i λiei(z)Lei(x)

i λ2 i ei(z)ei(x).

We can see that the integral operators of K1 and K2 share the same set of eigenfunctions, while the eigenvalues of K2 (i.e., {λ2 i }) decay faster than those of K1 (i.e., {λi}).

A.2.2. PRODUCT KERNEL

Here we give a proof for Example 5 in Section 2.2: if for i = 1, . . . , d, Ki(u, v) = R 1 0 ψi(u, t)ψi(v, t)dt satisfies QMC Condition 2 and |ψi(u, t)| κi for some κi and all u, t, then

i=1 Ki(ui, vi) = Z

[0,1]d ψ(u, t)ψ(v, t)dt,

QMC Features

where ψ(x, t) = Qd i=1 ψi(xi, ti), satisfies QMC Condition 2.

Definition A.2 (Vitali Variation). Consider a rectangle [a, b] := [a1, b1] [ad, bd] Rd and a function f : [a, b] R. Define hk(f, x) := f(x1, . . . , xk + hk, . . . , xd) f(x1, . . . , xk, . . . , xd),

and, recursively, h1h2...hk(f, x) := hk h1...hk 1, x .

Consider the collection Πk of finite ordered families πk of points t1 k < t2 k < . . . < t Nk+1 k [ak, bk]. Denote hi k = ti+1 k ti k. The Vitali variation of f is defined as

V[a,b](f) := sup (π1,...,πd) Π1 Πd

hi1 1 h id n f, (xi1 1 , . . . , xid n ) .

Note that on R1, Vitali variation coincides with the usual notion of total variation. If fi : [ai, bi] R has total variation V[ai,bi](fi), then from Definition A.2 we deduce that f(x) = Qd i=1 fi(xi) has Vitali variation V[a,b](f) = Qd i=1 V[ai,bi](fi). With this fact, we can prove the following proposition.

Proposition A.3. Consider a rectangle [a, b] := [a1, b1] [ad, bd] Rd, and fi : [ai, bi] R has total variation V[ai,bi](fi) Ci and |fi(bi)| κi. Then the Hardy-Krause variation of f(x) := Qd i=1 fi(xi) is bounded by C = P

u {1,...,d},u = Q

Proof. By Owen (2005, Definition 2), the Hardy-Krause variation can be written in terms of Vitali variation:

VHK(f; [a, b]) = X

u {1,...,d},u = V[au,bu]f(xu; b u) X

u {1,...,d},u =

Here, u denotes the complement {1, . . . , d}\u; xu denotes the sub-vector from x by taking all xj with j u; xu : b u denotes a vector y Rd with yj = xj for j u and yj = bj for j / u; and when b u is held fixed, f(xu : b u) is a function of xu, and this function is denoted by f(xu; b u).

By applying Proposition A.3 to fi(t) = ψi(u, t)ψi(v, t) and [a, b] = [0, 1]d, we see that the product kernel in Example 5 satisfies QMC Condition 2.

A.3. A Review of Concepts from Functional Analysis

We start with a brief review of some notions from functional analysis that will be important for subsequent discussions; see e.g., Rynne & Youngson (2008); Aubin (2011) for a detailed study of these concepts. By a bounded linear operator A from a Hilbert space F to a Hilbert space G we mean a linear map A : F G such that, for some L 0, Av G L v F, for all v F. The smallest L such that the previous inequality holds is called the operator norm of A, denoted by A . There is a unique bounded operator A : G F, called the adjoint of A, such that u, Av G = A u, v F, for all u G, v F. Let ran A := {Av : v F} denote the range of the operator A and ker A := {v F : Av = 0} denote the kernel of A. A bounded linear operator A from a Hilbert space H to itself is nonnegative if u, Au H 0 for all u H. We say that A is self-adjoint if A = A. Suppose that H is separable with orthonormal basis {ei}i 1. Then the trace of a non-negative A is defined as tr(A) := P i Aei, ei H. For a nonnegative operator A, if tr(A) < , then A is said to be a trace-class operator. A compact operator from a Hilbert space F to another Hilbert space G is a linear operator L such that the image under L of any bounded subset of F is a relatively compact subset (has compact closure) of G. Such an operator is necessarily a bounded operator, and thus is continuous. For a bounded linear operator L from a Hilbert space F to another Hilbert space G, L is compact if and only if L is compact, and they are also equivalent to LL being compact or L L being compact. For every compact self-adjoint operator L from a Hilbert space H to itself, the spectral theorem states that there exists an orthonormal basis {xi} of H consisting of eigenvectors of L, i.e., Lxi = λixi where nonzero λi s are at most countable and converge to 0. If furthermore L is nonnegative, then Lr with r > 0 can be defined by Lrxi = λr i xi for all i.

Let F and G be separable Hilbert spaces. Let {fi}i I to be an orthonormal basis for F, and let {gj}j J be an orthonormal basis for G; here I and J are indexing sets being either finite or countably infinite.

QMC Features

Definition A.4 (Hilbert-Schmidt operators). The Hilbert-Schmidt norm of a compact operator L : G F is defined to be

L 2 HS := X

j J Lgj 2 F.

The operator L is Hilbert-Schmidt when this norm is finite. The Hilbert-Schmidt operators mapping from G to F form a Hilbert space, written HS(G, F), with inner product

L, M HS := X

j J Lgj, Mgj F, (A.2)

which is independent of the orthonormal basis chosen. Here L : G F and M : G F are two Hilbert-Schmidt operators.

A.4. A Review of Mercer Kernels

By a kernel function K : X X R we mean a symmetric and nonnegative definite function such that K(x, ) is a (real-valued) measurable function on X, for all x X. Denote by H the reproducing kernel Hilbert space associated with K, and , H the associated inner product . Then H is a Hilbert space of real-valued functions on X such that, for any f H, we have f(x) = f, K(x, ) H, for all x X; this is usually referred to as the reproducing property of the kernel K. Let µ be a measure on X.

Definition A.5 (Mercer kernel). We call a kernel K : X X R a Mercer kernel (w.r.t. µ) if it is continuous and R

X K(x, x)dµ(x) < .

If X is a separable space and K is continuous, then H is separable; see e.g., Hein & Bousquet (2004, Theorem 7), Steinwart & Christmann (2008, Lemma 4.33). Therefore, a Mercer kernel on X Rd automatically induces a separable RKHS.

Definition A.6 (Inclusion operator). Suppose K is a Mercer kernel. Define the inclusion operator I : H L2(µ) by identifying a function in H as a function in L2(µ), i.e., I(f) = f L2(µ). Note that If 2 L2(µ) = R

X f 2(x)dµ(x) R

X f 2 H K(x, ) 2 Hdµ(x) = R

X K(x, x)dµ(x) f 2 H, which implies I is a bounded linear operator. Its adjoint operator I : L2(µ) H is given by:

I g(x) = I g, K(x, ) H = g, IK(x, ) L2(µ) = Z

X K(x, t)g(t)dµ(t).

An important property of the inclusion operator I is that it is injective as long as µ is supported on the entire X. To see this, suppose f H with If = 0. The continuity of K implies that f is continuous. If a continuous function has L2(µ) norm being 0, then it must be identically 0. Hence, the injectivity is shown.

It follows from a direct verification that the integral operator defined in (4) is L = II . Let L1/2 be the unique self-adjoint nonnegative square root operator of L (see Appendix A.3 for the definition). Then L1/2 induces an isometry from L2(µ) to I(H):

Theorem A.7 (Isometry induced by L1/2). Suppose K is a Mercer kernel (w.r.t. µ) and µ has full support on X. Then L1/2

induces an isometry from (ker L1/2) to I(H), where I(H) is equipped with the inner product I(f), I(g) := f, g H for f, g H.

Proof. Let I be the inclusion operator defined above, and {en}n 1 be an orthonormal basis of H. Observe that

tr(I I) = X

n en, I Ien H = X

n Ien, Ien L2(µ) = X

Z e2 n(x)dµ(x) = X

Z K(x, ), en 2 Hdµ(x)

= Z K(x, ) 2 Hdµ(x) = Z K(x, x)dµ(x) < .

Hence, I I is trace-class and therefore a compact operator. As a result, I, I , II are all compact operators. By the spectral theorem of the compact self-adjoint operator on a Hilbert space, there exists an orthonormal basis {fi}i 1 of L2(µ) which are eigenvectors of II . Assume that {λi}i 1 are the corresponding eigenvalues, i.e., II fi = λifi for all i. Since I is injective (by the fact that K is a Mercer kernel and µ has full support), we have (ran I ) = ker I = {0}. Therefore,

QMC Features

the closure of ran I is the entire H. Moreover, I fi, I fj H = fi, II fj L2(µ) = fi, λjfj L2(µ) = 0 for i = j, and I fi, I fi H = fi, II fi L2(µ) = fi, λifi L2(µ) = λi. Hence, {gi = I fi/ λi : λi > 0} is an orthonormal basis of H.

We first establish the bijection between (ker L1/2) and I(H). Any function in (ker L1/2) can be uniquely written as P

i aifi with P

i a2 i < , where fi corresponds to a positive eigenvalue λi > 0. This function is mapped by L1/2 to P

i λiaifi. On the other hand, any function in H can be uniquely written as P

i aigi with P

i a2 i < , which gets mapped to P

i λiaifi by I. Therefore, a bijection between (ker L1/2) and H is established, by mapping P

i aifi to P i aigi. Note that for any {ai}i 1, {bi}i 1 satisfying P i a2 i < and P

i b2 i < , we have

This shows that the bijection preserves inner products, and is therefore an isometry.

A direct consequence of Theorem A.7 is the existence of an integral representation in the form of (1):

Proposition A.8 (Existence of integral representation). Suppose K is a Mercer kernel and µ has full support on X. Then there exists ψ : X X R such that

K(x, x ) = Z

X ψ(x, ω)ψ(x , ω)dµ(ω).

Proof. From Theorem A.7, there exists an isometry between H and a subspace of L2(µ). Suppose K(x, ) is mapped to ψ(x, ) L2(µ) under this isometry. Then

K(x, x ) = K(x, ), K(x , ) H = ψ(x, ), ψ(x , ) L2(µ) = Z

X ψ(x, ω)ψ(x , ω)dµ(ω).

B. Proofs of the Results in Section 2

The main idea of the proof of Theorem 2.2 is to find a low variation function f that coincides with the integrand f in (6) on a large set , and apply the classical Koksma-Hlawka inequality (Theorem 1.1) to f. QMC Condition 1 is used to control both VHK( f) and the behavior the integrand outside the large set . The fact that Halton sequence avoids the boundary of the unit cube (Owen, 2006) will be useful, which allows us to claim f = f on the first n points of the Halton sequence.

B.1. Proof of Theorem 2.2

We first introduce some notations: for a non-empty set u {1, . . . , d + 1} and a function f on Rd+1, uf(x) represents (Q

j u / xj)f(x). Let u denote the complement {1, . . . , d + 1}\u, and xu denote the sub-vector from x by taking all xj with j u. For x, z [0, 1]d+1, xu : z u denotes a vector y Rd+1 with yj = xj for j u and yj = zj for j / u. When z u is held fixed, then f(xu : z u) is a function of xu, and this function is denoted by f(xu; z u).

Observe that the integrand in (6) can be re-written as

f(t, b) = cos (x x ) Φ 1(t) cos (x + x ) Φ 1(t) + 4πb .

Let D = maxx,y X,i {1,...,d}{|xi yi|, |xi + yi|}. Then for any non-empty set u {1, . . . , d + 1} and (t, b) (0, 1)d+1,

| uf(t, b)| 4πD|u\{d+1}| Y

d dti Φ 1 i (ti).

Theorem 2.2 is a direct consequence of the following lemma:

QMC Features

Lemma B.1. Suppose f : (0, 1)d+1 R satisfies | uf(x)| B Q

i u\{d+1} 1 min(xi,1 xi) for some B > 0 and any non-empty set u {1, . . . , d + 1}. Let h1, . . . , hn be the first n points of the (d + 1)-dimensional Halton sequence. There exists a constant C(d) depending only on d such that for all n 2,

[0,1]d+1 f(x)dx 1

BC(d)(log n)2d+1

Proof of Lemma B.1. The main idea is to find a low variation function fn that coincides with f on a large set Kn, so that the classical Koksma-Hlawka inequality (Theorem 1.1) can be applied to fn. Outside Kn, f can be controlled with the given bounds on the derivatives. Note that if h1, . . . , hn Kn, then

[0,1]d+1 f(x)dx 1

[0,1]d+1 |f(x) fn(x)|dx + D ({hi}n i=1)VHK( fn) + 1

i=1 | fn(hi) fn(hi)|

[0,1]d+1 |f(x) fn(x)|dx + D ({hi}n i=1)VHK( fn).

(B.1) We will construct fn, Kn, and bound each term on the right-hand side above.

Let c = (1/2, . . . , 1/2) Rd+1. We can write f(x) as an integral

f(x) = f(c) + X

[cu,xu] uf(zu : c u)dzu.

For εn (0, 1/2) and Kn = [εn, 1 εn]d+1, we use the same low variation extension fn of f as in Owen (2006, Equation 2.9):

fn(x) = f(c) + X

[cu,xu] 1zu:c u Kn uf(zu : c u)dzu.

Then fn coincides with f on Kn. By definition (see e.g., Owen, 2005, Definition 2), the Hardy-Krause variation of fn over [0, 1]d+1 is:

VHK( fn; [0, 1]d+1) = X

u = V[0u,1u] fn(xu; 1 u), (B.2)

where V[a,b]( ) is the Vitali variation. Note that the Vitali variation satisfies the bound (see e.g., Owen, 2005, Proposition 13)

V[a,b](g) Z

[a,b] | {1,...,p}g(x)|dx, for g( ) defined on a hyperrectangle [a, b] Rp. (B.3)

Plugging the bound (B.3) into the definition (B.2), we obtain

VHK( fn; [0, 1]d+1) X

[εn,1 εn]|u| | uf(xu : c u)|dxu

[εn,1 εn]|u| B Y

j u\{d+1} min(xj, 1 xj) 1dxu

u = ((1 2εn) 1d+1 u + 1d+1/ u) Y

j u\{d+1} 2(log(1/2) log εn)

j u\{d+1} 2(log(1/2) log εn).

QMC Features

We may simplify the above sum by considering whether d + 1 u: X

j u\{d+1} 2(log(1/2) log εn)

u = ,u {1,...,d}

j u 2(log(1/2) log εn) + X

u {1,...,d+1},d+1 u

j u\{d+1} 2(log(1/2) log εn)

j=1 (1 + 2 log(1/2) 2 log εn)

j=1 (1 + 2 log(1/2) 2 log εn)

j=1 (1 + 2 log(1/2) 2 log εn) = 2(1 + 2 log(1/2) 2 log εn)d.

Therefore, we obtain the following bound for VHK( fn; [0, 1]d+1):

VHK( fn; [0, 1]d+1) 2B(1 2 log 2 2 log εn)d. (B.4)

The star discrepancy of the (d + 1)-dimensional Halton sequence satisfies

D ({hi}n i=1) CH(d + 1) (log n)d+1

n , for n 2, (B.5)

where CH(d + 1) is a constant depending on the dimension d + 1 (see e.g., Niederreiter, 1992, Theorem 3.6 and Atanassov, 2004). Hence, it remains to bound the term R

[0,1]d+1 |f(x) fn(x)|dx in (B.1). Observe that for x (0, 1)d+1,

|f(x) fn(x)| X

[cu,xu] 1zu:c u / Kn| uf(zu : c u)|dzu

[cu,xu] 1zu:c u / Kn B Y

j u\{d+1} min(zj, 1 zj) 1dzu

j u\{d+1} min(zj, 1 zj) 1dzu

By considering whether d + 1 u, we have

j u\{d+1} min(zj, 1 zj) 1dzu

u = ,u {1,...,d}

j u\{d+1} min(zj, 1 zj) 1dzu

u {1,...,d+1},d+1 u

j u\{d+1} min(zj, 1 zj) 1dzu

u = ,u {1,...,d}

j u | log(1/2) log(min(xj, 1 xj))|

u {1,...,d+1},d+1 u |1/2 xd+1| Y

j u\{d+1} | log(1/2) log(min(xj, 1 xj))|

1 + | log(1/2) log(min(xj, 1 xj))|

+ |1/2 xd+1|

1 + | log(1/2) log(min(xj, 1 xj))|

QMC Features

1 + | log(1/2) log(min(xj, 1 xj))| .

Therefore, |f(x) fn(x)| 3B

2 Qd j=1 1 + | log(1/2) log(min(xj, 1 xj))| . Recall that fn coincides with f on Kn = [εn, 1 εn]d+1. Hence, we have Z

(0,1/2)d+1 |f(x) fn(x)|dx 3B

(0,1/2)d+1\[εn,1/2]d+1

j=1 (1 + log(1/2) log(xj))dx

0 (1 + log(1/2) log(xj))dxj

k {1,...,d}\{j} (1 + log(1/2) log(xk))dx j

0 (1 + log(1/2) log(xk))dxk

j=1 ((2 log 2)εn εn log εn) 1

4 2 + (2 log 2)d d log εn .

The same argument can be applied to all 2d+1 subcubes of (0, 1)d+1, which yields Z

(0,1)d+1 |f(x) fn(x)|dx 3 2d 1Bεn 2 + (2 log 2)d d log εn . (B.6)

Finally, we use the fact that Halton sequence avoids the boundary of the unit cube: by choosing εn = 1 n(n+1) Qd j=1 p 1 j , h1, . . . , hn belong to Kn = [εn, 1 εn]d+1 (Owen, 2006, Theorem 3.1). Now, we can apply the above bounds on R

[0,1]d+1 |f(x) fn(x)|dx (B.6), D ({hi}n i=1) (B.5), and VHK( fn) (B.4) back to (B.1) to obtain:

[0,1]d+1 f(x)dx 1

[0,1]d+1 |f(x) fn(x)|dx + D ({hi}n i=1)VHK( fn)

3 2d 1Bεn 2 + (2 log 2)d d log εn + CH(d + 1) (log n)d+1

n 2B(1 2 log 2 2 log εn)d

BC(d)(log n)2d+1

for some C(d) > 0 and all n 2.

Note that the coefficient of the dominating term (log n)2d+1

n is CH(d + 1)22d+1B.

B.2. Proof of Proposition 2.6

For any g L2(PX), we have

(LM L)g 2 L2(PX) =

Z (KM( , x) K( , x))g(x)d PX(x)

= Z (KM(z, x) K(z, x))(KM(z, x ) K(z, x ))g(x)g(x )d PX(x)d PX(x )d PX(z)

Z |g(x)g(x )|d PX(x)d PX(x )

M 2 g 2 L2(PX).

Therefore, by the definition of operator norm, LM L = supg L2(PX) (LM L)g L2(PX)

g L2(PX) C loga M

QMC Features

B.3. Proof of Proposition 2.7

Note that the -spectral approximation (7) is equivalent to

K (K + λIn) KM K + (K + λIn).

Let K + λIn = V Σ2V be the eigen-decomposition of K + λIn. By multiplying by Σ 1V on the left and V Σ 1 on the right, it suffices to show that Σ 1V(KM K)V Σ 1 2 ,

where 2 denotes the matrix L2 norm, or the spectral norm. Since the eigenvalues of K + λIn is lower bounded by λ, we have that Σ 1 2 1

λ. Together with the facts that (i) UAV 2 = A 2 for any orthogonal matrices U, V , and (ii) the spectral norm is upper bounded by the Frobenius norm, we have that

Σ 1V(KM K)V Σ 1 2 1

λ V(KM K)V 2

n2C2(log M)2a

C. Proofs of the Results in Section 3

The goal of this section is to prove Theorem 3.2. We will first define some useful operators and clarify some notions about the excess risk with the inclusion operator notations in subsection C.1, and then provide the proof in subsection C.2.

C.1. Preliminaries

Some Useful Operators. We first introduce some useful operators between Hilbert spaces (Rudi & Rosasco, 2017).

Definition C.1. Recall ϕM(x) = M 1/2(ψ(x, ω1), . . . , ψ(x, ωM)) and the approximated kernel KM(x, x ) = ϕM(x) ϕM(x ). Define:

SM : RM L2(PX), (SMβ)( ) = ϕM( ) β.

S M : L2(PX) RM, (S Mg)i = 1

X ψ(x, ωi)g(x)d PX(x).

LM : L2(PX) L2(PX), (LMg)( ) = R

X KM( , z)g(z)d PX(z).

CM : RM RM, CM = R

X ϕM(x)ϕM(x) d PX(x).

ˆCM : RM RM, ˆCM = 1

n Pn i=1 ϕM(xi)ϕM(xi) .

ˆSM : RM Rn, ˆSM = 1 n(ϕM(x1), . . . ,ϕM(xn)) .

ˆS M : Rn RM, ˆS M = 1 n(ϕM(x1), . . . ,ϕM(xn)).

It follows from direct verification that S M is the adjoint of SM, ˆS M is the adjoint of ˆSM, LM = SMS M, CM = S MSM, and ˆCM = ˆS M ˆSM. Let CK = I I where I is the inclusion operator defined in Appendix A.4. Recall (from Appendix A.4) that L = II . From Caponnetto & De Vito (2007) and Rudi & Rosasco (2017), we have that L, CK, LM, CM, ˆCM are trace class operators, and I, SM, ˆSM are compact operators.

Excess Risk. Recall the excess risk: E( ˆfλ,M) inff H E(f) defined in (11), where E(f) = E[Y f(X)]2. In order to more clearly distinguish between the inner products , H and , L2(PX), we identify f H as If L2(PX) when the associated inner product is , L2(PX), where I is the inclusion operator (introduced in Appendix A.4). With these notations, the excess risk is (more rigorously) written as E( ˆfλ,M) inff H E(If).

QMC Features

Recall that f (X) = E[Y |X]. A variance-bias decomposition of the risk gives:

E(f) = E[(Y f(X))2] = E[(Y f (X) + f (X) f(X))2]

= E[(Y f (X))2] + E[(f (X) f(X))2]. (C.1)

Therefore, the existence of an f H H that minimizes E(If) (KRR Condition 3) is equivalent to the existence of an If H in I(H) being closest to f in L2(PX) distance. In other words, let P : L2(PX) L2(PX) be the projection operator from L2(PX) to the closure of I(H) in L2(PX); if KRR Condition 3 holds, we have If H = Pf = Lrg.

Hence, from (C.1) we can re-write the excess risk as follows (the last equality follows from Pythagorean theorem and the fact that ˆfλ,M I(H)):

E( ˆfλ,M) inf f H E(If) = E( ˆfλ,M) E(If H)

= f ˆfλ,M 2 L2(PX) f Pf 2 L2(PX) = ˆfλ,M Pf 2 L2(PX). (C.2)

C.2. Proof of Theorem 3.2

For any operator T on some space S, define Tλ := T + λIS, where IS denotes the identity operator on S. For S = RM, IS is the M M identity matrix IM. The regression function ˆfλ,M(x) defined in (10) can then be written as

ˆfλ,M = SM( ˆCM + λIM) 1 ˆS My/ n = SM ˆC 1 M,λ ˆS My/ n, where y = (y1, . . . , yn) . (C.3)

Using the operators introduced in the previous subsection, we define

f = I(CK + λIH) 1I Pf = IC 1 K,λI Pf and (C.4)

f M = SM(CM + λIM) 1S MPf = SMC 1 M,λS MPf . (C.5)

Here, ˆfλ,M can be viewed as the empirical version of f M, and f M can be viewed as the random feature approximation of f. With triangle inequality, we decompose the excess risk (C.2) into ˆfλ,M Pf L2(PX) E1 + E2 + E3, where

E1 := f Pf L2(PX),

E2 := f M f L2(PX),

E3 := ˆfλ,M f M L2(PX).

In the above decomposition (C.6), E1, E2, E3 may be understood as follows:

E1 is the bias introduced by the ridge regression penalty, which also appears in the analysis of the exact kernel ridge regression (De Vito et al., 2005), and is not affected by the application of random feature approximation.

E2 is the random feature computational error arising from approximating the kernel K with KM. It is worth noting that it is in this term that our error rate has substantial improvement compared to the bound for the Monte Carlo random features (Rudi & Rosasco, 2017).

E3 is the variance coming from n i.i.d. draws from the population. While E3 exhibits an error rate comparable to that in the Monte Carlo random feature case, our analysis differs in several aspects. In particular, results that hold almost surely may need to be replaced by results that hold surely, and some bounds also have tighter and non-random versions under the QMC setting.

In the following, we will bound these three terms respectively. Throughout the paper, norms without subscripts are considered operator norms unless otherwise specified.

Bound for E1: The following expressions

f = LL 1 λ Pf , f M = LML 1 M,λPf

QMC Features

follow from the definitions of f, f M in (C.4), (C.5), together with the identity that: for a bounded linear operator A : H1 H2,

A (AA + c IH2) 1 = (A A + c IH1) 1A . (C.7)

From KRR Condition 3, we have that Pf = Lrg with g L2(PX) R. By the identity L(L + λIL2(PX)) 1 = IL2(PX) λ(L + λIL2(PX)) 1, we have

E1 = LL 1 λ Pf Pf L2(PX) = (LL 1 λ IL2(PX))Pf L2(PX) = λL 1 λ Pf L2(PX) = λL 1 λ Lrg L2(PX) λ L 1+r λ L r λ Lr g L2(PX) λ λ 1+r 1 R

Bound for E2: Denote a general identity operator whose associated space is unspecified by I. By the algebraic identities A(A + λI) 1 = I λ(A + λI) 1 and A 1 B 1 = A 1(B A)B 1, we have that

(LL 1 λ LML 1 M,λ)Pf = λ(L 1 M,λ L 1 λ )Pf = λL 1 M,λ(L LM)L 1 λ Pf .

From KRR Condition 3, we have that Pf = Lrg with g L2(PX) R. From KRR Condition 1 and Proposition 2.6, we have that L LM C loga M

M . Therefore,

E2 = (LL 1 λ LML 1 M,λ)Pf L2(PX) = λL 1 M,λ(L LM)L 1 λ Lrg L2(PX)

λ L 1 M,λ L LM L 1+r λ L r λ Lr g L2(PX)

λ λ 1 C loga M

M λ 1+r 1 R

= Rλ 1+r C loga M

Bound for E3: Recall that ˆfλ,M = SM ˆC 1 M,λ ˆS My/ n and f M = SMC 1 M,λS MPf . With triangle inequality, we decompose E3 into E3 E31 + E32 + E33, where

E31 = SM ˆC 1 M,λ( ˆS My/ n S Mf ) L2(PX), (C.8)

E32 = SM ˆC 1 M,λS M(IL2(PX) P)f L2(PX), (C.9)

E33 = SM( ˆC 1 M,λ C 1 M,λ)S MPf L2(PX). (C.10)

We will show that E32 = 0, and when 19κ2

2δ λ min{ L C loga M

M , 1/e} and n max{405κ2, 67κ2 log 3κ2

2δ } (the constants a, κ and C are from KRR Condition 1), with probability at least 1 δ,

8σ2κ2 log 6

C (1 + a)a + 1

4Rκ2r+1 log 6

δ + 2Rκ2r r

nλ . (C.12)

The proofs of the bounds for E31, E32 and E33 are given in Appendix C.3.

Combining the bounds: Combining the bounds for E1, E2 and E3 above, we have that when 19κ2

QMC Features

min{ L C loga M

M , 1/e} and n max{405κ2, 67κ2 log 3κ2

2δ }, with probability at least 1 δ:

E( ˆfλ,M) inf f H E(f)

1/2 = ˆfλ,M Pf L2(PX) E1 + E2 + E3

Rλr + Rλ 1+r C loga M

8σ2κ2 log 6

C (1 + a)a + 1

4Rκ2r+1 log 6

δ + 2Rκ2r r

As M = loga(1/λ)

λ and λ (0, 1/e], we can derive that C loga M

λM C (1 + a)a; see (C.16). Thus we may further bound the right-hand side by:

E( ˆfλ,M) inf f H E(f)

δ + 3Rκ2r q

C (1 + a)a + 1 2κ log 6

+ R (C (1 + a)a + 1) λr.

Let λ = Cn 1 2r+1 . When n C2 max n (log 1

2r , 1 o for a large enough constant C2 depending on κ, C, a, L , C and

r (but not on δ), all the above conditions on n and λ are satisfied. Since 1 q

δ for δ 1, the above bound

can be further upper bounded: E( ˆfλ,M) inff H E(f) C1n 2r 2r+1 log2 6

2κσ + 3Rκ2r(2κ + 1) q

C (1 + a)a + 1 1 p

C + R (C (1 + a)a + 1) Cr #2

C.3. Proof of Theorem 3.2: Proof of Bounds for E31, E32 and E33

We prove the bound (C.11) in Appendix C.3.1, E32 = 0 in Appendix C.3.2, and the bound (C.12) in Appendix C.3.3. Some supporting technical lemmas are presented in Appendix C.3.4.

C.3.1. PROOF OF BOUND FOR E31

Recall that is the operator norm; in particular, for a vector v RM, v is its vector ℓ2-norm. From the expression of E31 in (C.8), we have E31 b1 A, where b1 = SM ˆC 1 M,λC1/2 M,λ , and A = C 1/2 M,λ ( ˆS My/ n S Mf ) . We bound these two terms respectively.

Bounding b1: We have

b2 1 = SM ˆC 1 M,λC1/2 M,λ 2

= C1/2 M,λ ˆC 1 M,λS MSM ˆC 1 M,λC1/2 M,λ

= C1/2 M,λ ˆC 1 M,λCM ˆC 1 M,λC1/2 M,λ

C1/2 M,λ ˆC 1/2 M,λ ˆC 1/2 M,λ C1/2 M C1/2 M ˆC 1/2 M,λ ˆC 1/2 M,λ C1/2 M,λ

= ˆC 1/2 M,λ C1/2 M,λ 2 ˆC 1/2 M,λ C1/2 M 2.

Note that ˆC 1/2 M,λ C1/2 M ˆC 1/2 M,λ C1/2 M,λ C 1/2 M,λ C1/2 M ˆC 1/2 M,λ C1/2 M,λ . Therefore, we have that b2 1 ˆC 1/2 M,λ C1/2 M,λ 4

and thus b1 ˆC 1/2 M,λ C1/2 M,λ 2.

QMC Features

A direct application of Proposition C.5 (with A = ˆCM, B = CM) gives us b1 1 1 λmax(C 1/2 M,λ (CM ˆ CM)C 1/2 M,λ ).

In order to bound 1 1 λmax(C 1/2 M,λ (CM ˆ CM)C 1/2 M,λ ), we verify the conditions needed to apply Proposition C.4. Re-

call that ϕM(x) = M 1/2(ψ(x, ω1), . . . , ψ(x, ωM)) , CM = E[ϕM(X) ϕM(X)] (from Definition C.1), and ˆCM = 1 n Pn i=1[ϕM(xi) ϕM(xi)]. For v := ϕM(X), we have v, C 1 M,λv v 2 C 1 M,λ κ2

λ . Hence, F (λ)

in Proposition C.4 can be taken as κ2

λ . From Proposition C.4, for any τ > 0, when 19κ2

2τ λ CM and n max{405κ2, 67κ2 log κ2

2τ }, we have

λmax(C 1/2 M,λ (CM ˆCM)C 1/2 M,λ ) 1

with probability at least 1 τ. Note that

CM = S MSM = SMS M = LM L L LM L C loga M

Summarizing the above conditions, when 19κ2

2τ λ L C loga M

M and n max{405κ2, 67κ2 log κ2

2τ }, we have b1 1 1 1/3 = 3/2 with probability at least 1 τ.

Bounding A: By the definition of ˆS M (see Definition C.1), we have that ˆS My/ n = n 1 Pn i=1 ϕM(xi)yi. Therefore,

C 1/2 M,λ ( ˆS My/ n S Mf ) = 1

i=1 (C 1/2 M,λ ϕM(xi)yi C 1/2 M,λ S Mf ).

To bound the above sum by concentration inequalities, we apply Proposition C.2 with zi := C 1/2 M,λ ϕM(xi)yi and µ :=

C 1/2 M,λ S Mf RM, whose conditions are verified as follows. First, we verify that Ezi = µ:

Ezi = C 1/2 M,λ E[ϕM(xi)yi] = C 1/2 M,λ E[ϕM(xi)E[yi | xi]] = C 1/2 M,λ E[ϕM(xi)f (xi)] = C 1/2 M,λ S Mf = µ.

Next, we bound the k-th moment (k 2) of zi using KRR Condition 2:

E zi k = E[ C 1/2 M,λ ϕM(xi)yi k] = E[ C 1/2 M,λ ϕM(xi) k |yi|k] = E[ C 1/2 M,λ ϕM(xi) k E[|yi|k | xi]]

E[ C 1/2 M,λ ϕM(xi) k] 1

Recall that ϕM(x) = M 1/2(ψ(x, ω1), . . . , ψ(x, ωM)) and ψ is bounded by κ (KRR Condition 1). Hence,

C 1/2 M,λ ϕM(xi) 2 1

λ ϕM(xi) 2 κ2

E zi k E[ C 1/2 M,λ ϕM(xi) k] 1

By Jensen s inequality, triangle inequality, and generalized H older s inequality, we have

E zi µ k = E zi Ezi+1 k E zi zi+1 k

E[( zi + zi+1 )k] E[2k 1( zi k + zi+1 k)] = 2k E zi k.

With our bound for E zi k in (C.13),

E zi µ k 2k E zi k 1

QMC Features

Now, we are ready to apply Proposition C.2. With probability at least 1 τ, we have that

A = C 1/2 M,λ ( ˆS My/ n S Mf ) =

8σ2κ2 log 2

Combining the bound b1 3/2 with the above bound on A, by setting τ to be δ/3, we have that (C.11) holds with probability at least 1 2δ/3 when 19κ2

2δ λ L C loga M

M and n max{405κ2, 67κ2 log 3κ2

C.3.2. PROOF OF E32 = 0

Let ψω denote the function ψ( , ω). Observe that for f, g L2(PX), we have

f, Lg L2(PX) = Z f(x)K(x, z)g(z)d PX(x)d PX(z)

= Z f(x)ψ(x, ω)ψ(z, ω)g(z)d PX(x)d PX(z)dπ(ω)

= Z f, ψω L2(PX) g, ψω L2(PX)dπ(ω).

We will use the isometric isomorphism between the Hilbert tensor product space L2(PX) L2(PX) and the space of Hilbert-Schmidt operators from L2(PX) to itself (see e.g., Aubin 2011, Section 12). In particular, this isomorphism Φ : H1 H2 HS(H2, H1) is given by f, Φ(a)g H1 = a, f g H1 H2 for a H1 H2, f H1, g H2; and if a = f g for f H1 and g H2, then a, f g H1 H2 = f, f H1 g, g H2.

Therefore, with this isomorphism, we can write

f, Lg L2(PX) = Z f, ψω L2(PX) g, ψω L2(PX)dπ(ω)

= Z f, (ψω ψω)g L2(PX)dπ(ω)

= f, Z ψω ψωdπ(ω)g

Note that the last equality follows from the fact that if X is Bochner integrable, then Eφ(X) = φ(EX) for any bounded linear functional φ (see e.g., Cohn, 2013, Appendix E). The HS norm of R ψω ψωdπ(ω) is finite since ψ(x, ω) is bounded (KRR Condition 1). Hence, we have that

L = Z ψω ψωdπ(ω). (C.14)

Since P is the projection operator onto the closure of I(H), which is ran L1/2 by Theorem A.7, we have that (IL2(PX) P)L1/2 = 0. Let {ei}i 1 be an orthonormal basis of L2(PX). Note that IL2(PX) P is self-adjoint. Hence, we have

0 = tr (IL2(PX) P)L(IL2(PX) P) = X

i (IL2(PX) P)L(IL2(PX) P)ei, ei L2(PX)

i L(IL2(PX) P)ei, (IL2(PX) P)ei L2(PX)

Z ψω ψωdπ(ω)(IL2(PX) P)ei, (IL2(PX) P)ei

Z ψω, (IL2(PX) P)ei 2 L2(PX) dπ(ω) = Z (IL2(PX) P)ψω 2 L2(PX)dπ(ω).

Hence, (IL2(PX) P)ψω = 0 for Lebesgue almost every ω [0, 1]p. Since ω 7 ψω L2(PX) is continuous (KRR

QMC Features

Condition 1), we have that (IL2(PX) P)ψω = 0 for all ω [0, 1]p. Consequently, for any β RM, we have

β, S M(IL2(PX) P)f RM = 1

i=1 βi (IL2(PX) P)ψωi, f L2(PX) = 0.

This shows that E32 = SM ˆC 1 M,λS M(IL2(PX) P)f L2(PX) = 0.

C.3.3. PROOF OF BOUND FOR E33

From KRR Condition 3, we have Pf = If H = Lrg, where g L2(PX). From the definition of E33 in (C.10), together with the equality A 1 B 1 = A 1(B A)B 1, we have

E33 = SM( ˆC 1 M,λ C 1 M,λ)S MPf L2(PX)

= SM ˆC 1 M,λ(CM ˆCM)C 1 M,λS MPf L2(PX)

= SM ˆC 1 M,λC1/2 M,λC 1/2 M,λ (CM ˆCM)C 1 M,λS ML1/2 M,λL 1/2 M,λ Lrg L2(PX)

SM ˆC 1 M,λC1/2 M,λ C 1/2 M,λ (CM ˆCM) C 1 M,λS ML1/2 M,λ L 1/2 M,λ L1/2 Lr 1/2 g L2(PX).

From KRR Condition 3, we have g L2(PX) R. Note that

L1/2 2 = sup f L2(PX)=1 L1/2f 2 L2(PX) = sup f L2(PX)=1 Lf, f L2(PX)

= sup f L2(PX)=1

Z ψω ψωdπ(ω)f, f

= sup f L2(PX)=1

Z ψω, f 2 L2(PX)dπ(ω)

sup f L2(PX)=1

Z ψω 2 L2(PX) f 2 L2(PX)dπ(ω) κ2,

where the second line follows from (C.14), and the last inequality is based on the boundedness of ψ (KRR Condition 1). This implies that all eigenvalues of L are in [0, κ2]. Therefore, we have Lr 1/2 κ2r 1 for r 1/2.

From (C.7) we have SM(S MSM + λIM) 2 = (SMS M + λIL2(PX)) 2SM. Thus,

C 1 M,λS ML1/2 M,λ 2 = L1/2 M,λSMC 2 M,λS ML1/2 M,λ = L1/2 M,λSM(S MSM + λIM) 2S ML1/2 M,λ

= L1/2 M,λ(SMS M + λIL2(PX)) 2SMS ML1/2 M,λ

= L1/2 M,λ(LM + λIL2(PX)) 2LML1/2 M,λ

= L 3/2 M,λ LML1/2 M,λ = LML 3/2 M,λ L1/2 M,λ = LML 1 M,λ 1.

Here, we use the fact that LM and L 3/2 M,λ commute, as they share the same set of eigen-vectors.

Combining the bounds above, we have that

E33 SM ˆC 1 M,λC1/2 M,λ C 1/2 M,λ (CM ˆCM) 1 L 1/2 M,λ L1/2 κ2r 1 R = Rκ2r 1b1b2B, (C.15)

where b2 = L 1/2 M,λ L1/2 , B = C 1/2 M,λ (CM ˆCM) , and SM ˆC 1 M,λC1/2 M,λ is the term b1 already defined and bounded in Appendix C.3.1. In what follows we bound b2 and B respectively.

QMC Features

Bounding b2: From KRR Condition 1 and Proposition 2.6, we have that L LM C loga M

M . Therefore,

b2 = L 1/2 M,λ L1/2 = q

L 1/2 M,λ LL 1/2 M,λ

L 1/2 M,λ (L LM)L 1/2 M,λ + L 1/2 M,λ LML 1/2 M,λ

L 1/2 M,λ (L LM)L 1/2 M,λ + 1

L 1/2 M,λ L LM L 1/2 M,λ + 1

Recall that M = loga(1/λ)

λ . We have

λM = C loga loga(1/λ)

loga(1/λ) = C (log(1/λ) + log(loga(1/λ)))a

loga(1/λ) (C.16)

= C 1 + alog log(1/λ)

a C (1 + a)a , (C.17)

where we have used a loose bound that log log(1/λ)

log(1/λ) 1 for λ (0, 1/e]. In conclusion, when λ (0, 1/e], we have

C(1 + a)a + 1. Note that this is a deterministic rather than a probabilistic bound.

Bounding B: To bound B = C 1/2 M,λ (CM ˆCM) , we apply Proposition C.3 with v = z = ϕM(X) and vi = zi = ϕM(Xi). We first verify its conditions: v = z κ by |ψ(x, ω)| κ from KRR Condition 1; for v = ϕM(X), Q = Ev v = CM, we have (CM + λI) 1/2v 2 v 2λ 1 κ2λ 1. Hence we can take F (λ) = κ2λ 1. By definition, N(λ) = tr (Q + λI) 1Q = tr (Q + λI) 1Ev v = E[v (Q + λI) 1v]

ess supv H (Q + λI) 1/2v 2 F (λ).

Thus, by Proposition C.3, for any τ > 0, with probability at least 1 τ,

B = C 1/2 M,λ (CM ˆCM) C 1/2 M,λ (CM ˆCM) HS 4κ2 log 2

Combining the bound on b1 in Appendix C.3.1 and b2 p

C(1 + a)a + 1 with the above bound on B and setting τ to be δ/3, we have from (C.15) that

E33 Rκ2r 1b1b2B 3

C (1 + a)a + 1

4Rκ2r+1 log 6

δ + 2Rκ2r r

with probability 1 2δ/3, which proves the statement of (C.12).

Moreover, combining the bounds on b1, b2, A and B, we have that (C.11) and (C.12) hold simultaneously with probability at least 1 δ when 19κ2

2δ λ min{ L C loga M

M , 1/e} and n max{405κ2, 67κ2 log 3κ2

C.3.4. SUPPLEMENTARY TECHNICAL LEMMAS

Proposition C.2 (Bernstein s inequality, Rudi & Rosasco 2017, Appendix B, Proposition 2). Let z1, . . . , zn be a sequence of i.i.d. random vectors on a separable Hilbert space H. Assume µ = Ezi exists and there exist σ, D 0 such that

E zi µ k H 1

2k!σ2Dk 2, for all k 2, i {1, . . . , n}.

QMC Features

Then for any δ (0, 1], with probability at least 1 δ,

Proposition C.3 (Rudi & Rosasco 2017, Appendix D, Proposition 5). Let H, K be two separable Hilbert spaces and (v1, z1), . . . , (vn, zn) H K be n i.i.d. random vectors such that v H κ and z K κ almost surely for some constant κ > 0. Let Q = Ev v, T = Ev z, and Tn = 1

n Pn i=1 vi zi. Define

F (λ) := ess supv H (Q + λI) 1/2v 2, N(λ) := tr (Q + λI) 1Q .

For any 0 < λ < Q and any τ > 0, with probability at least 1 τ, the following holds

(Q + λI) 1/2(T Tn) HS 4 q

F (λ)κ log 2

4κ2 N(λ) log 2

Proposition C.4 (Rudi & Rosasco 2017, Appendix D, Proposition 6). Let v1, . . . , vn be n i.i.d. random vectors on a separable Hilbert space H such that Q = Ev v is trace-class, and for any λ > 0, there exists a constant F (λ) < such that v, (Q + λI) 1v F (λ) almost surely. Let Qn = 1

n Pn i=1 vi vi and take 0 < λ Q . Then for any δ 0, the following holds with probability at least 1 2δ:

(Q + λI) 1/2(Q Qn)(Q + λI) 1/2 2β(1 + F (λ))

where β = log 4 tr Q

λδ . Moreover, with the same probability,

λmax (Q + λI) 1/2(Q Qn)(Q + λI) 1/2 2β

Consequently, if v κ almost surely, when 19κ2

4δ λ Q and n max n 405κ2, 67κ2 log κ2

2δ o , with probability at least 1 δ, we have

λmax (Q + λI) 1/2(Q Qn)(Q + λI) 1/2 1/3.

Proposition C.5 (Rudi & Rosasco 2017, Appendix D, Proposition 8). Let H be a separable Hilbert space, A, B be two bounded self-adjoint positive linear operators on H, and λ > 0. Then

(A + λI) 1/2B1/2 (A + λI) 1/2(B + λI)1/2 (1 β) 1/2,

where β = λmax (B + λI) 1/2(B A)(B + λI) 1/2 satisfies β λmax(B) λmax(B)+λ < 1.

D. Additional Simulation Studies and Real Data Examples

D.1. Additional Simulation Studies

Here, we present results corresponding to the r = 0.5 case as mentioned in Section 4.2. Similar to what we did before, the training and test data are generated from Y = f(X) + ε, where f is the regression function, X Unif[0, 1]d, and ε N(0, 1). We consider two choices of kernels: (i) the min kernel K(x, x ) = Qd i=1 min(xi, x i), and (ii) the Gaussian kernel K(x, x ) = exp( 1

2σ2 x x 2 2), with the bandwidth σ set as the median of X X (computed numerically), where X, X i.i.d. Unif[0, 1]d.

From Theorem A.7, when r = 0.5, ran Lr is essentially H. Therefore, for the min kernel, we set f(x) = 2K(1d, x) K( 3

41d, x)+2K( 1

21d, x) K( 1

41d, x), and for the Gaussian kernel, we set f(x) = K( 1

31d, x)+K( 2

31d, x). In both cases, we have f ran Lr. To approximately control the signal-noise-ratio, we set the regression function as f(x) = C f f(x)

for some constant C f such that Ef(X) = 5. The kernel ridge regularization constant is set as λ = 0.25n 1 2r+1 .

QMC Features

4 6 8 10 log2(M)

4 6 8 10 log2(M)

4 6 8 10 log2(M)

4 6 8 10 log2(M)

Figure 5. Gaussian Kernel (r = 0.5): the test MSE against the number of random features for KRR, RF-KRR and QMCF-KRR.

4 6 8 10 log2(M)

4 6 8 10 log2(M)

4 6 8 10 log2(M)

4 6 8 10 log2(M)

Figure 6. Min Kernel (r = 0.5): the test MSE against the number of random features, for KRR, RF-KRR and QMCF-KRR.

103 2 102 3 102 4 102 6 102

103 2 102 3 102 4 102 6 102

103 2 102 3 102 4 102 6 102

Figure 7. The test MSE against the number of random features for the Cadata data set.

We plot the test MSE against the number of random features, for exact KRR, RF-KRR and QMCF-KRR in Figures 5 and 6. For each combination of kernel and d, 106 test data are first generated and held fixed. We consider 1000 realizations of training samples of size 104. For each of the realization, we fit a kernel ridge regression and compute its test error. The MSE (solid lines) in Figure 3 and 4 are obtained by averaging over the 1000 realizations. We also provide confidence bands using the 25% and 75% error quantiles from the 1000 realizations.

It is observed that, empirically, the case with r = 0.5 case exhibits patterns similar to those of the r = 1 case in Section 4.2, which demonstrates the superior performance of QMC features.

D.2. Real Data

We consider the following two real-world examples in this subsection.

Cadata (Pace & Barry, 1997): In this data set (n = 20640, d = 6), the response is the log of the median house price, and the predictors are median income, housing median age, total rooms, total bedrooms, population, and households. We first perform a random train-test split, allocating 25% of the data to the test set. The response is normalized to have mean 0, and all predictors are normalized to have mean 0 and variance 1, using statistics from the training set. Gaussian kernel is used with the bandwidth empirically set as the median of pairwise distance within the training set.

Cod-rna (Uzilov et al., 2006): This benchmark dataset (n = 59535 (train) + 271617 (test), d = 8) was developed for

QMC Features

103 2 102 3 102 4 102 6 102

103 2 102 3 102 4 102 6 102

103 2 102 3 102 4 102 6 102

Figure 8. The test MSE against the number of random features for the Cod-rna data set.

detecting non-coding RNAs. The response is binary with imbalanced class sizes: in the training set there are 39690 -1 s and 19845 1 s; in the test set there are 181078 -1 s and 90539 1 s. Eight predictors are available: G total value computed by the Dynalign algorithm (Uzilov et al., 2006), the length of shorter sequence, and respective A , U , C frequencies of sequence 1 and sequence 2. We normalize all predictors to have mean 0 and variance 1, using statistics from the training set. The Gaussian kernel is used with the bandwidth empirically set as the median of pairwise distance within 10000 random samples from the training set.

Note that our result (Theorem 3.2) depends on λ of order between n 1/2 and n 1/3. Thus, for the above two data sets (Cadata: n 1/2 0.007, n 1/3 0.037; Cod-rna: n 1/2 0.004, n 1/3 0.026) we consider three choices of λ that are approximately within this range: λ = 0.0001, 0.001, 0.01. We let M vary from 200 to 1000. Given the randomness of the MC approach, we repeated the RF-KRR process 1000 times with 1000 different seeds; subsequently, we plotted the mean test MSE along with the 95% confidence band (for RF-KRR). The results are shown in Figures 7 and 8.

We observe that for all choices of λ: (i) As M increases, the MSE of QMCF-KRR decreases quickly and converges to that of the exact KRR. (ii) When M is not too small (M > 200 for Cadata; M > 400 for Cod-rna), QMC outperforms the average performance of MC. (iii) RF-KRR has a very wide confidence band. In terms of providing a high probability MSE upper bound, QMCF-KRR significantly outperforms RF-KRR. For example, when λ = 0.01, the high probability (97.5% quantile) MSE upper bound for RF-KRR with M = 1000 is comparable to that of QMCF-KRR with M 200.

D.3. Simulation Results in High Dimensions

We have seen superior performance of QMC features compared with classical random features in low-dimensional settings (in Section 4, Appendix D.1, and Appendix D.2). In this subsection, we present simulation results of RF-KRR and QMCF-KRR in higher dimensions. These results show that the performance of QMC features may be less satisfactory as the dimension increases.

We will follow the same simulation setting as in Section 4.2 and Appendix D.1. We consider both r = 1 and r = 1/2. The training and test data are generated from Y = f(X) + ε, where f is the regression function, X Unif[0, 1]d, and ε N(0, 1). We consider the Gaussian kernel K(x, x ) = exp( 1

2σ2 x x 2 2), with the bandwidth σ set as the median of X X (computed numerically), where X, X i.i.d. Unif[0, 1]d.

For r = 1, we use the same regression function f(x) as in Section 4.2. For r = 0.5, we use the same regression function f(x) as in Appendix D.1. The kernel ridge regularization parameter is set as λ = 0.25n 1 2r+1 .

In Figures 9 and 10, we plot the test MSE against the number of random features, for exact KRR, RF-KRR and QMCF-KRR. For each combination of r and d, 106 test data are first generated and held fixed. We consider 1000 realizations of training samples of size 104. For each of the realization, we fit a kernel ridge regression and compute its test error (i.e., MSE on the test set). The solid lines in Figures 9, 10 are obtained by averaging over the 1000 realizations. We also provide confidence bands using the 25% and 75% error quantiles from the 1000 realizations. For the MC method, the randomness comes from re-generating the training set and the MC random features. Whereas for the QMC method, it only comes from the training

QMC Features

4 6 8 10 log2(M)

4 6 8 10 log2(M)

4 6 8 10 log2(M)

4 6 8 10 log2(M)

Figure 9. Negative results in higher dimensions: the test MSE against the number of random features for exact KRR, RF-KRR and QMCF-KRR. Gaussian Kernel is used with r = 1.

4 6 8 10 log2(M)

4 6 8 10 log2(M)

4 6 8 10 log2(M)

4 6 8 10 log2(M)

Figure 10. Negative results in higher dimensions: the test MSE against the number of random features, for exact KRR, RF-KRR and QMCF-KRR. Gaussian Kernel is used with r = 0.5.

set re-generation as the QMC features are deterministic.

It can be seen from Figures 9 and 10 that for both r = 1 and r = 0.5, QMC features may not outperform classical random features as the dimension increases, which aligns with the practical observation that the best use case for QMC often arises when the integrand can be well approximated by a sum of functions involving only a small number of its input variables (Owen, 2023; Adcock & Brugiapaglia, 2022). As the dimension increases, the improvement of QMC over MC may be less significant or even worse, as seen in Figures 9 and 10 for d = 30, 50 and 100. Nevertheless, for suitably large M, it can be seen that the MSE of QMCF-KRR still decreases quickly, approaching the performance of the exact KRR.