# samplingbased_nyström_approximation_and_kernel_quadrature__364a2876.pdf Sampling-based Nystr om Approximation and Kernel Quadrature Satoshi Hayakawa 1 Harald Oberhauser 1 Terry Lyons 1 We analyze the Nystr om approximation of a positive definite kernel associated with a probability measure. We first prove an improved error bound for the conventional Nystr om approximation with i.i.d. sampling and singular-value decomposition in the continuous regime; the proof techniques are borrowed from statistical learning theory. We further introduce a refined selection of subspaces in Nystr om approximation with theoretical guarantees that is applicable to non-i.i.d. landmark points. Finally, we discuss their application to convex kernel quadrature and give novel theoretical guarantees as well as numerical observations. 1. Introduction Kernel methods form a prominent part among modern machine learning tools. However, making kernel methods scalable to large datasets is an ongoing challenge. The main bottleneck is that the kernel Gram matrix scales quadratically in the number of data points. For large scale problems the number of matrix entries can easily be of the order hundredthousands or millions so that even storing the full Gram matrix can become too costly. Several approaches have been developed to deal with these, among the most prominent are the Random Fourier Features and the Nystr om method. In this article, we revisit and generalize the Nystr om method and provide new error estimates. Consequences are theoretical guarantees for kernel quadrature and improvements on the standard Nystr om method that go beyond uniform subsampling of data points. Nystr om Approximation. The main idea of the Nystr om method is to replace the original kernel k by another kernel kapp that is constructed by random projection of the elements in the (in general infinite-dimensional) RKHS associ- 1Mathematical Institute, University of Oxford, Oxford, United Kingdom. Correspondence to: Satoshi Hayakawa . Proceedings of the 40 th International Conference on Machine Learning, Honolulu, Hawaii, USA. PMLR 202, 2023. Copyright 2023 by the author(s). ated with k into a low-dimensional RKHS. A consequence of this is that the Gram matrix of kapp is a low-rank approximation of the original Gram matrix. Concretely, let µ denote a probability measure on a (Hausdorff) space X and k a kernel on X; then the standard Nystr om approximation uses the random kernel k Z(x, y) := k(x, Z)k(Z, Z)+k(Z, y). (1) where Z = (zi)ℓ i=1 is an ℓ-point subset of X usually taken i.i.d. from µ (Drineas et al., 2005; Kumar et al., 2012). Further s-rank Approximation. While less common, the following rank-reduced version is of our interest: kapp(x, y) = k Z s (x, y) := k(x, Z)k(Z, Z)+ s k(Z, y), (2) where k(Z, Z)+ s is the Moore Penrose pseudo-inverse of the best s-rank approximation of the Gram matrix k(Z, Z) = (k(zi, zj))ℓ i,j=1 with s ℓ. Note that k Z ℓ= k Z. Our motivation for this rank reduction comes from kernelbased numerical integration. Indeed, if we are given an s-rank kernel kapp and a probability measure µ, by Tchakaloff s theorem there is a discrete probability measure ν supported over at most s + 1 points satisfying R X f dν for all f Hkapp, where Hkapp is the finite-dimensional RKHS associated with the kernel kapp. Such a measure ν works as a kernel quadrature rule if the kapp well approximates the original kernel k, and the rank s directly affects the number of (possibly expensive) function evaluations we need to estimate each integral. The primary error criterion in this paper is Z k(x, x) kapp(x, x) dµ(x), (3) which arises from the error estimate in kernel/Bayesian quadrature (Hayakawa et al., 2022; Adachi et al., 2022). Contribution. Our first theoretical result is that the expectation of (3) is of the order O p P i>s σi + polylog(ℓ)/ℓ when the eigenvalues (σi) i=1 of the kernel integral operator induced by (k, µ) enjoy exponential convergence (the expectation is taken over the empirical sample Z). Key to the proof of this bound is the use of concepts from statistical learning theory; in particular, the (local) Sampling-based Nystr om Approximation and Kernel Quadrature Table 1: Main quantitative results. Individual bounds are available in Remark 1, Theorem 2, and Proposition 4. For the explanation on each kernel, see at the end of Contribution section. Here are remarks on the notation. (a) σi is the i-th eigenvalue of the integral operator K : L2(µ) L2(µ); g 7 R X k( , x)g(x) dµ(x). (b) µX denotes the equally weighted empirical measure 1 N PN i=1 δxi given by X = (xi)N i=1. (c) µ( ) and µX( ) denote the integrals over the diagonal. See (4). Quantity Bound Assumption E[µ(k k Z s )] O i>s σi + (log ℓ)2d+1 ! ( Z iid µ, k: bounded σi exp( βi1/d) µ( p k Z k Z s,µ)2 µ(k Z k Z s,µ) E[µX( p k Z k Z s,X)]2 E[µX(k Z k Z s,X)] i>s σi Z: fixed Z: fixed, X iid µ Rademacher complexity. This error estimate is far better than the bound O spectral term + s1/2/ℓ1/4 that follows from the existing high-probability estimate R X (k(x, x) k Z s (x, x)) dµ(x) = O(sσs +P ℓ) (Hayakawa et al., 2022, Corollary 4). By combining our new bound with known kernel quadrature estimates this explains the strong empirical performance of the random kernel quadrature, see Hayakawa et al. (2022); previously the theoretical bounds were not even better than Monte-Carlo in terms of ℓ. Our second contribution is the use of other kapp than k Z s with better bounds of (3), for a general class of landmark points Z rather than just an i.i.d. sample from µ. This generalization allows to use other sets Z in (2) to achieve better overall performance; e.g. sampling Z from determinantal point processes (DPPs) on X is known to be advantageous in applications. To construct and provide theoretical guarantees for such improved Nystr om constructions we revisit and generalize a method that was proposed in Santin & Schaback (2016) and give further theoretical guarantees applicable to kernel quadrature rules. The following is the list of low-rank approximations presented in the paper: k Z and k Z s : Usual Nystr om approximations using landmark points Z. See (1) and (2). k Z s,µ: The s-rank truncated Mercer decomposition of the kernel k Z with respect to the measure µ. See (11). k Z s,X: A version of k Z s,µ with µ given by the empirical measure 1 N PN i=1 δxi of the set X = (xi)N i=1. This actually coincides with k Z s when X = Z; see (6). See Table 1 for a summary of our quantitative results. Outline. Section 2 discusses the existing literature and introduces some notation. Section 3 contains our first main result, namely the analysis of k Z s for an i.i.d. Z; Appendix A provides the necessary background from statistical learning theory. In Section 4, we then treat a general Z to give refined low-rank approximations together with theoretical guarantees, rather than the conventional k Z s . In Section 5, we discuss how our bounds yields new theories and methods for the recent random kernel quadrature construction, which enables us to explain the empirical performance as well as to build some strong candidates whose performance is assessed by numerical experiments. All the omitted proofs are given in Appendix B. 2. Related Literature and Notation To simplify the notation, we denote X f(x) dν(x), ν(h) := Z X h(x, x) dν(x) (4) for any functions f : X R, h : X X R and a (probability) measure ν on X, whenever the integrals are well-defined. In this notation, the aim of this paper is to bound µ( p k kapp) or µ(k kapp) for a class of low-rank approximation kapp. Also, A+ denotes the Moore Penrose pseudo-inverse of a matrix A. Approximation of the Gram Matrix. The standard use of the Nystr om method in ML is to replace the Gram matrix k(X, X) for a set X = (xi)N i=1 by the low-rank matrix k Z(X, X) where k Z is defined as in (1). A welldeveloped literature studies the case when Z = (zi)ℓ i=1 is uniformly and independently sampled from X, see Drineas et al. (2005); Kumar et al. (2012); Yang et al. (2012); Jin et al. (2013); Li et al. (2015). Further, the cases of leverage-based sampling (Gittens & Mahoney, 2016), DPPs (Li et al., 2016), and kernel K-means samples (Oglic & G artner, 2017) have received attention. Moreover, two variants of the standard Nystr om method have been studied: the first replaces the Moore-Penrose inverse of k(Z, Z) in (1) with the pseudoinverse of the best s-rank approximation of k(Z, Z) as in (2) via SVD (Drineas et al., 2005; Kumar et al., 2012; Li et al., 2015); the second uses the best s-rank approximation of k Z(X, X), see (Tropp et al., 2017; Wang et al., 2019). Sampling-based Nystr om Approximation and Kernel Quadrature For a brief overview in this regard, see Wang et al. (2019, Remark 1). Approximation of the Integral Operator. The matrix k(X, X) can be regarded as a finite-dimensional representation of the linear (integral) operator K : L2(µ) L2(µ), (Kf)(x) = Z X k(x, y)f(y) dµ(y). We denote with (σi, ei) i=1 the eigenpairs of the operator K, and assume the eigenvalues are ordered σ1 σ2 0. The Mercer decomposition exists under mild assumptions (for example, supp µ = X, k is continuous and R X k(x, x) dµ(x) < (Steinwart & Scovel, 2012) are sufficient) and gives the representation i=1 σiei(x)ei(y), (5) where ei L2(µ) = 1, and ( σiei) i=1 is an orthonormal basis of the RKHS Hk of k. Hence, a natural approach is to just truncate this expansions after s terms, kapp = Ps i=1 σiei(x)ei(y), to get a finite-dimensional approximation of the kernel k. This approach is natural since the approximation quality of the operator K determines the resulting error estimates. Unfortunately, it is often rendered useless since the Mercer decomposition depends on the tuple (k, µ) and while explicit expression are known for special choices, in general it is unlikely to have a closed-form representation of the eigenpairs (σi, ei) i=1. Other Approximations. A compromise which is relevant to our work is proposed in Santin & Schaback (2016). Instead of using the Mercer decomposition of K one uses the Mercer decomposition of (1). Our main result allows to generalize this approach and to provide theoretical guarantees missing in the reference. Related is the article Gauthier (2021) that studies the interactions of several Hilbert Schmidt spaces of (integral) operators given by a Nystr om approximation/projection of a kernel-measure pair as in the present paper; further, Chatalic et al. (2022) considers a lowrank approximation of an empirical kernel mean embedding by using a Nystr om-based projection. The leverage-based sampling studied in Gittens & Mahoney (2016) has continuous counterparts. One with a slight modification is in the kernel literature (Bach, 2017), while the exact counterpart can be found in a context from approximation theory (Cohen & Migliorati, 2017) under the name of optimallyweighted sampling, which essentially proposes sampling from s 1 Ps i=1 e2 i (x) dµ(x). The Power Function. Finally, the square root of the diagonal term p k(x, x) k Z(x, x) or its generalization is known as the power function in the literature on kernelbased interpolation (De Marchi, 2003; Santin & Haasdonk, 2017; Karvonen et al., 2021). There the primary interest is its L (uniform) norm, rather than the L1(µ) norm, µ( p k kapp), or the L2(µ) norm, µ(k kapp), that appear in kernel quadrature estimates and error estimates of the Nystr om/Mercer type decompositions. Kernel Quadrature. The literature on kernel quadrature includes herding (Chen et al., 2010; Bach et al., 2012; Husz ar & Duvenaud, 2012; Tsuji et al., 2022), weighted/correlated sampling (Bach, 2017; Belhadji et al., 2019; 2020; Belhadji, 2021), a subsampling method called thinning (Dwivedi & Mackey, 2021; 2022; Shetty et al., 2022) and a positively weighted kernel quadrature (Hayakawa et al., 2022) that motivated our work. We refer to Hayakawa et al. (2022, Table 1) for comparison of existing algorithms in terms of their convergence guarantees and computational complexities. 3. Analyzing k Z s for i.i.d. Z via Statistical Learning Theory Let Z = (zi)ℓ i=1 X and k Z s be the s-dimensional kernel given by k Z s (x, y) = k(x, Z)k(Z, Z)+ s k(Z, y) as in the usual Nystr om approximation. Throughout the paper, suppose we are provided the singular value decomposition of the matrix k(Z, Z) = U diag(λ1, . . . , λℓ)U with an orthogonal matrix U = [u1, . . . , uℓ] and λ1 λℓ 0. Note that k Z s (x, y) = i=1 1{λi>0} 1 λ i(u i k(Z, x))(u i k(Z, y)) (6) is actually a truncated Mercer decomposition of k Z with regard to the measure µZ = 1 ℓ Pℓ i=1 δzi, since u i k(Z, ), u j k(Z, ) ℓu i k(Z, Z)k(Z, Z)uj = λiλj This fact is at the heart of our analysis: k Z s is optimal srank approximation for the measure µZ, and the statistical learning theory connects estimates in empirical measure and the original measure. Let us denote by PZ,s : Hk Hk the linear operator given by k( , x) 7 k Z s ( , x) for all x X. We shall also simply write PZ = PZ,ℓ. Lemma 1. PZ,s is an orthogonal projection in H. This projection is related the quantity of interest, in that k Z s (x, x) = k( , x), PZ,sk( , x) Hk = PZ,sk( , x) 2 Hk. Thus, we have k(x, x) k Z s (x, x) = P Z,sk( , x) 2 Hk by Sampling-based Nystr om Approximation and Kernel Quadrature using P Z,s, the orthogonal complement of PZ,s. So we are now interested in estimating the integral µ( p k k Z s ) = R X P Z,sk( , x) Hk dµ(x) from the viewpoint of the projection operator. We first estimate its empirical counterpart µZ( p k k Z s ) = 1 ℓ Pℓ i=1 P Z,sk( , zi) Hk, where µZ = 1 ℓ Pℓ i=1 δzi is the empirical measure. Indeed, we have the following identity regarding µZ(k k Z s ): Lemma 2. For any ℓ-point sample Z X, we have k k Z s )2 µZ(k k Z s ) = 1 where λ1 λℓare eigenvalues of k(Z, Z). When Z is given by an i.i.d. sampling, the decay of eigenvalues λi enjoys the rapid decay given by σi in the following sense: Lemma 3. Let Z = (zi)ℓ i=1 be an ℓ-point independent sample from µ. Then, for the eigenvalues λ1 λℓof k(Z, Z), we have For a general random orthogonal projection operator, we can prove the following bound by using arguments in statistical learning theory (Section A): Theorem 1. Let Z = (zi)ℓ i=1 be an ℓ-point independent sample from µ and P be a random orthogonal projection in Hk possibly depending on Z. For any integer m 1, we have the following bound: X Pk( , x) Hk dµ(x) E i=1 Pk( , zi) Hk i>m σi + kmax 80m2 log(1 + 2ℓ) where the expectation is taken regarding the draws of Z. Recall that µ( p k k Z s ) = R X P Z,sk( , x) Hk dµ(x). By combining this theorem when P = P Z,s and Lemma 2 & 3, we can obtain the following: Corollary 1. Let Z = (zi)ℓ i=1 be an ℓ-point independent sample from µ. Then, for any integer m 1, we have k k Z s ) 2 s X i>s σi + 4 s X 80m2 log(1 + 2ℓ) Remark 1. When σj e βi1/d with a constant β > 0 and a positive integer d (typical for d-dimensional Gaussian kernel, see, e.g., Adachi et al., 2022, Section A.2), by taking m (log ℓ)d, we have a bound k k Z s ) = O i>s σi + (log ℓ)2d+1 for ℓ 3; see Appendix B.6 for the proof. Since k k Z s kmax p k k Z s , the same estimate applies to E[µ( p k k Z s )]. These also lead to an (s + 1)-point randomized convex kernel quadrature Qs+1 with the same order of E[wce(Qs+1)]. See Section 5 for details. 4. A Refined Low-rank Approximation with General Z The process of obtaining a good approximation kapp of k using k Z can be decomposed into two parts: k kapp = k k Z | {z } A + k Z kapp | {z } B In the previous section, we have analyzed the case Z is i.i.d. and kapp = k Z s . However, we can consider more general Z, and indeed we actually have a better way to select a subspace (i.e., kapp) from the finite-rank kernel k Z rather than just using k Z s . 4.1. Part A: Estimating the Error of k Z for General Z This part is relatively well-studied. Indeed, µ(k k Z) = R X (k(x, x) k Z(x, x)) dµ(x) for some non-i.i.d. Z can be bounded by using the results of weighted kernel quadrature. For example, Belhadji et al. (2019) consider the worst-case error for the weighted integral X f(x)g(x) dµ(x) i=1 wif(zi) (7) for any f Hk 1 and a fixed g L2(µ) with Z = (zi)ℓ i=1 following a certain DPP. Now consider the optimal worstcase error in the above approximation for the fixed point configuration Z: inf wi sup f Hk 1 i=1 wif(zi) X k( , x)g(x) dµ(x) i=1 wik( , zi) i=1 wik( , zi) Hk = P Z Kg Hk. (8) By using this, we can prove the following estimate: Sampling-based Nystr om Approximation and Kernel Quadrature Proposition 1. For any finite subset Z X and any integer m 0, we have i=1 P Z Kei 2 Hk i=1 P Z Kei 2 Hk + X where (σi, ei) i=1 are the eigenpairs of K. The papers Belhadji et al. (2019; 2020); Belhadji (2021) give bounds on the worst-case error of the weighted kernel quadrature (8) when Z is given by some correlated sampling, whereas Bach (2017) gives another bound when Z is given by an optimized weighted sampling rather than sampling from µ. By using (8) and Proposition 1, we can import their bounds on weighted kernel quadrature with non-i.i.d. Z to the estimate of µ(k k Z) = R X P Z k( , x) 2 Hk dµ(x). Here, we just give one such example: Corollary 2. Let Z = (zi)ℓ i=1 be taken from a DPP given by the projection kernel p(x, y) = Pℓ i=1 ei(x)ei(y) with a reference measure µ, i.e., P(Z A) = 1 ℓ! R A det p(Z, Z) dµ ℓ(Z) for any Borel set A X d. Then, for any integer m 0, we have E µ(k k Z) X i>m σi + 4m X where the expectation is taken regarding the draws of Z. In any case, by using those non-i.i.d. points, we can obtain a better Z in the sense that R X (k(x, x) k Z(x, x)) dµ(x) attains a sharper upper bound than the bound given in the previous section for an ℓ-point i.i.d. sample from µ. However, for a general Z, it is not necessary sensible to execute the SVD of k(Z, Z) and get k Z s accordingly, as a SVD of k(Z, Z) corresponds to approximating µ by the empirical measure 1 ℓ Pℓ i=1 δzi (indeed, this observation is the key to the results in the previous section). Thus, for points Z not given by i.i.d. sampling, there should exist a better choice of kapp than k Z s . We discuss this in the following section. 4.2. Part B: Mercer Decomposition of k Z Instead of using k Z s , we propose to compute the Mercer decomposition of k Z with respect to µ and truncate it to get k Z s,µ, which is defined in the following. This is doable if we have knowledge of hµ(x, y) := R X k(x, t)k(t, y) dµ(t), since k Z is a finite-dimensional kernel. We can prove the following: Lemma 4. We have hµ(x, y) = P i=1 σ2 i ei(x)ei(y). We now discuss how hµ can be used to derive the Mercer decomposition of k Z. Note that this can be regarded as a generalization of Santin & Schaback (2016, Section 6). Let KZ : L2(µ) L2(µ) be the integral operator given by g 7 R X k Z( , x)g(x) dµ(x). For functions of the form f = a k(Z, ) and g = b k(Z, ) with a, b Rℓ, we have f, g L2(µ) = Z X a k(Z, x)k(x, Z)b dµ(x) = a hµ(Z, Z)b. (9) So, if we write hµ(Z, Z) = H H by using an H Rℓ ℓ (since hµ(Z, Z) is positive semi-definite), an element f = a k(Z, ) L2(µ) is non-zero if and only if Ha = 0. Furthermore, we have X k( , Z)k(Z, Z)+k(Z, x)k(x, Z)a dµ(x) = k( , Z)k(Z, Z)+hµ(Z, Z)a = k(Z, Z)+hµ(Z, Z)a k(Z, ). (10) Thus, f is a nontrivial eigenfunction of KZ, if Ha = 0 and a is an eigenvector of k(Z, Z)+hµ(Z, Z). It is equivalent to c = Ha being an eigenvector of Hk(Z, Z)+H . Let us decompose this matrix by SVD as Hk(Z, Z)+H = V diag(κ1, . . . , κℓ)V , where the V = [v1, . . . , vℓ] Rℓ ℓis an orthogonal matrix and κ1 κℓ 0. Then, we have Hk(Z, Z)+H = i=1 κiviv i . Let us consider fi = (H+vi) k(Z, ) = v i (H+) k(Z, ) for i = 1, . . . , ℓas candidates of eigenfunctions of KZ. We can actually prove the following: Lemma 5. The set {fi | i 1, κi > 0} forms an orthornomal subset of L2(µ) whose elements are eigenfunctions of KZ. Let us define k Z µ (x, y) := Pℓ i=1 κifi(x)fi(y); note that this is computable. From the above lemma, this expression is a natural candidate for Mercer decomposition of k Z. We can prove that it actually coincides with k Z(x, y) µ-almost everywhere, and so the decomposition is independent of the choice of H up to µ-null sets: Proposition 2. There exists a measurable set A X depending on Z with µ(A) = 1 such that k Z(x, y) = k Z µ (x, y) holds for all x, y A. Moreover, we can take A = X if ker hµ(Z, Z) ker k(Z, Z). Now we just define k Z s,µ for s ℓas follows: k Z s,µ(x, y) := i=1 κifi(x)fi(y). (11) Theorem 2. We have µ(k Z µ k Z s,µ) Pℓ i=s+1 σi for any Z = (zi)ℓ i=1 X. Sampling-based Nystr om Approximation and Kernel Quadrature Proof. The left-hand side is equal to Pℓ i=s+1 κi from Lemma 5 and the definition of the kernels. Thus, it suffices to prove κi σi for each i. It directly follows from the min-max principle (or Weyl s inequality) as k k Z µ is positive definite on an A X with µ(A) = 1 from Proposition 2. Remark 2. The choice of the matrix H with H H = hµ(Z, Z) does not affect the theory but might affect the numerical errors. We have used the matrix square-root hµ(Z, Z)1/2, i.e., the symmetric and positive semi-definite matrix H with H2 = hµ(Z, Z), throughout the experiments in Section 5, so that we just need to take the pseudo-inverse of positive semi-definite matrices. Approximate Mercer Decomposition. When we have no access to the function hµ, we can just approximate it by using an empirical measure. For a X = (xj)M j=1 X, denote by h X the function given by replacing µ in hµ with the empirical measure with points X: h X(x, y) = 1 j=1 k(x, xj)k(xj, y) = 1 M k(x, X)k(X, y). We can actually replace every hµ by h X in the above construction to define k Z X and k Z s,X. This approximation is already mentioned by Santin & Schaback (2016) without theoretical guarantee. Another remark is that, when restricted on the set X, it is equivalent to the best s-rank approximation of k Z(X, X) in the Gram-matrix case (Tropp et al., 2017; Wang et al., 2019), since the L2-norm for the uniform measure on X just corresponds to the ℓ2-norm in R|X|. Note that we have k Z X(X, X) = k Z(X, X) from Proposition 2 in the discrete case. As we have ker h X(Z, Z) = ker k(Z, X)k(X, Z) = ker k(X, Z), we additionally obtain the following sufficient condition from Proposition 2. Proposition 3. k Z X(x, y) = k Z(x, y) holds for all x, y X. Moreover, if ker k(X, Z) ker k(Z, Z), then we have k Z X = k Z over the whole X. In particular, we have k Z X = k Z whenever Z X. These (at least µ-a.s.) equalities given in Proposition 2 & 3 are necessary for the applications to kernel quadrature, since we need k kapp to be positive definite for exploiting the existing guarantees such as Theorem 3 in the next section. Although checking k Z X = k Z is not an easy task, from the first part of Proposition 3, k Z s,X satisfies the following estimate in terms of the empirical measure µX. Proposition 4. Let Z X be a fixed subset and X be an M-point independent sample from µ. Then, we have E µX(k Z k Z s,X) = E µX(k Z X k Z s,X) X where the expectation is taken regarding the draws of X. We can also give a bound of the resulting error µ(k Z k Z s,X) again by using the arguments from learning theory, but under an additional assumption as stated in the following. Nevertheless, Proposition 4 is already sufficient for our application in kernel quadrature; see Theorem 4. Proposition 5. Under the same setting as in Proposition 4, if ker k(X, Z) ker k(Z, Z) holds almost surely for the draws of X, we have k Z k Z s,X) i 2 s X i>M σi + 4 s X 80m2 log(1 + 2M) for any integer m 1. Remark 3. The assumption ker k(X, Z) ker k(Z, Z) seems to be very hard to check in practice. An example with this property is (X, k, µ) such that X = RD with D, M > ℓ, the kernel k is just the Euclidean inner product on RD, and µ is given by a Gaussian distribution with a nonsingular covariance matrix. This said, we have some ways to avoid this issue in practice. One way is to use X Z instead of X so that the condition automatically holds. Then, the above order of estimate should still hold when ℓ M, though it complicates the analysis. Another way is effective when we use k Z X for constructing a kernel quadrature from an empirical measure given by X itself; see the next section for details. 5. Application to Kernel Quadrature Let us give error bounds for kernel quadrature as a consequence of the previous sections. We are mainly concerned with the kernel quadrature of the form (7) without weight, i.e., the case when g = 1 for efficiently discretizing the probability measure µ. Given an n-point quadrature rule Qn : f 7 Pn i=1 wif(xi) with weights wi R and points xi X, the worst-case error of Qn with respect to the RKHS Hk and the target measure µ is defined as wce(Qn; Hk, µ) := sup f Hk 1 |Qn(f) µ(f)|. Note that it is equal to MMDk(Qn, µ), the maximum mean discrepancy (with k) between Qn regarded as a (signed) measure and µ (Gretton et al., 2006). We call Qn convex if it is a probability measure, i.e., wi 0 and Pn i=1 wi = 1. Suppose we are given an s-rank kernel approximation kapp(x, y) = Ps i=1 ciφi(x)φi(y) with ci 0 and k kapp being positive definite (µ-almost surely). The following is taken from Hayakawa et al. (2022, Theorem 6 & 8). Sampling-based Nystr om Approximation and Kernel Quadrature Theorem 3. If an n-point convex quadrature Qn satisfies Qn(φi) = µ(φi) for 1 i s and Qn( p k kapp) µ( p k kapp), then we have wce(Qn; Hk, µ) 2µ( p Moreover, such a quadrature Qn exists with n = s + 1. Although there is a randomized algorithm for constructing the Qn stated in the above theorem (Hayakawa et al., 2022, Algorithm 2 with modification), it has two issues; it requires exact values of µ(φi) (and µ( p k kapp)) and its computational complexity has no useful upper bound unless we have additional assumptions such as well-behaved moments of test functions φi (Hayakawa et al., 2023a) or structure like a product kernel with a product measure (Hayakawa et al., 2023b). This said, we can deduce updated convergence results for outputs of the algorithm as in Remark 1. 5.1. Kernel Recombination Instead of considering an exact quadrature, what we do in practice in this low-rank approach is matching the integrals against a large empirical measure (see also Adachi et al., 2022, Section 6), say µY = 1 N PN i=1 δyi with Y = (yi)N i=1. If we have ( Qn(φi) = µY (φi), 1 i s, Qn( p k kapp) µY ( p k kapp), (12) then, from Theorem 3 with a target measure µY and the triangle inequality of MMD, we have wce(Qn; Hk, µ) MMDk(Qn, µY ) + MMDk(µY , µ) k kapp) + MMDk(µY , µ). (13) Indeed, such a quadrature Qn with n = s + 1 and points given by a subset of Y can be constructed via an algorithm called recombination (Litterer & Lyons, 2012; Tchernychova, 2016; Cosentino et al., 2020; Hayakawa et al., 2022). Existing approaches of this kernel recombination have then been using an approximation kapp typically given by k Z s whose randomness is independent from the sample Y , but it is not a necessary requirement as long as we can expect an efficient bound of µY ( p k kapp) in some sense. Another small but novel observation is that k kapp being positive definite is only required on the sample Y in deriving the estimate (13); not over the support of µ in contrast to Theorem 3. These observations circumvent the issues mentioned in Remark 3 when using kapp = k Z Y (k Z s,X with X = Y ). Let us now denote the kernel recombination in a form of function as Qn = KQuad(kapp, Y ), where the output Qn is an n-point convex quadrature satisfying n = s + 1 and (12); note that the constraint is slightly different from what is given in Hayakawa et al. (2022, Algorithm 1), but we can achieve (12) by replacing k1,diag with p k1,diag in the cited algorithm. We can now prove the performance of low-rank approximations given in the previous section. Indeed, k Z s,Y and k Z s,µ have the following same estimate. Theorem 4. Let Z X be a fixed subset and Y be an Npoint independent sample from µ. Then, a random convex quadrature Qn = KQuad(k Z s,Y , Y ) satisfies E[wce(Qn; Hk, µ)] k k Z) + 2 s X i>s σi + rck,µ where ck,µ := µ(k) RR X X k(x, y) dµ(x) dµ(y) and the expectation is taken regarding the draws of Y . The estimate (14) holds also for Qn = KQuad(k Z s,µ, Y ). 5.2. Numerical Examples In this section, we compare the numerical performance of k Z s,Y and k Z s,µ for kernel quadrature with the conventional Nystr om approximation for a non-i.i.d. Z in the setting that we can explicitly compute the worst-case error. Periodic Sobolev Spaces. The class of RKHS we use is called periodic Sobolev spaces of functions on X = [0, 1] (a.k.a. Korobov spaces), and given by the following kernel for a positive integer r: kr(x, y) = 1 + ( 1)r 1(2π)2r (2r)! B2r(|x y|), where B2r is the 2r-th Bernoulli polynomial (Wahba, 1990; Bach, 2017). We consider the case µ being the uniform measure, where the eigenfunctions of the integral operator K are known to be 1, 2 cos(2πm ), 2 sin(2πm ) with eigenvalues respectively 1, m 2r, m 2r for each positive integer m. This RKHS is commonly used for measuring the performance of kernel quadrature methods (Kanagawa et al., 2016; Bach, 2017; Belhadji et al., 2019; Hayakawa et al., 2022). We also consider its products: k d r (x, y) = Qd i=1 kr(xi, yi) and µ being the uniform measure on the hypercube X = [0, 1]d. By considering the eigenvalues, we can see that hµ = k d 2r for each kernel k d r from Remark 4. Experiments. In the experiments for the kernel k d r , we compared the worst-case error of n-point kernel quadrature rules given by Qn = KQuad(kapp, Y ) with kapp = k H s , k Z s , k Z s,Y , k Z s,µ (s = n 1) under the following setting: Y is an N-point independent sample from µ with N = n2 (Figure 1) or N = n3 (Figure 2). Sampling-based Nystr om Approximation and Kernel Quadrature H is the uniform grid {i/n | i = 1, . . . , n} (d = 1) or the Halton sequence with Owen scrambling (Halton, 1960; Owen, 2017) (d 2). Z is the union of H and another 20n-point independent sample from ν d, where ν is the 1-dimensional (2, 5)-Beta distribution, whose density is proportional to x(1 x)4 for x [0, 1]. We additionally compared Monte Carlo : uniform weights 1/n with i.i.d. sample (xi)n i=1 from µ, Uniform Grid (d = 1): points in H with uniform weights 1/n (known to be optimal for each n), and Halton (d 2): points in an independent copy of H with uniform weights 1/n. The aim of this experiment was to see if the proposed methods (k Z s,Y and k Z s,µ) can actually recover a good subspace of the RKHS given by k Z with Z not summarizing µ. To do so, we mixed H (a good summary of µ) and an i.i.d. sample from ν to determine Z. Figure 1 shows the results for (d, r) = (1, 1), (2, 1), (3, 3) with N = n2 and n = 4, 8, 16, 32, 64, 128. From Figure 1(a, b), we can see that our methods indeed recover (and perform slightly better than) the rate of k H from a contaminated sample Z. In Figure 1(c), the four low-rank methods all perform equally well, and it seems that the dominating error is given by the term caused by MMDk(µY , µ). Figure 2 shows the results for (d, r) = (1, 2) with N = n2 or N = n3 and n = 4, 8, 16, 32, 64. In this case, we can see that k Z s,Y or k Z s,µ eventually suffers from the numerical instability, which is also reported by Santin & Schaback (2016). Since their error inflation is not completely hidden even in the case N = n2 unlike the previous experiments, one possible reason for the instability is that taking the pseudo-inverse of k(Z, Z) or hµ(Z, Z)1/2 in the algorithm becomes highly unstable when the spectral decay is fast. Although they have preferable guarantees in theory, its numerical error seems to harm the overall efficiency, and this issue needs to be addressed e.g. by circumventing the use of pseudo-inverse in future work. Remark 4. Unlike the kernel quadrature with k Z s,µ or k Z s,Y , that with k Z s does not suffer from a similar numerical instability despite the use of k(Z, Z)+ s . This phenomenon can be explained by the nature of Hayakawa et al. (2022, Algorithm 1); it only requires (stable) test functions φi = u i k(Z, ) (i = 1, . . . , s) for its equality constraints, where ui is the i-th eigenvector of k(Z, Z), while the (possibly unstable) diagonal term k Z s (x, x) appears in the inequality constraint, which can empirically be omitted (Hayakawa et al., 2022, Section E.2). Computational Complexity. By letting ℓ, N (larger than s) respectively be the cardinality of Z and Y , we can express 0.6 0.8 1.0 1.2 1.4 1.6 1.8 2.0 log10n log10(wce)2 s Monte Carlo Uniform Grid (a) d = 1, r = 1 0.6 0.8 1.0 1.2 1.4 1.6 1.8 2.0 log10n log10(wce)2 s Monte Carlo (b) d = 2, r = 1 0.6 0.8 1.0 1.2 1.4 1.6 1.8 2.0 log10n log10(wce)2 s Monte Carlo (c) d = 3, r = 3 Figure 1: Experiments in periodic Sobolev spaces with reproducing kernel k d r . Average of log10(wce(Qn; Hk, µ)2) over 20 samples plotted with their standard deviation. Sampling-based Nystr om Approximation and Kernel Quadrature 0.6 0.8 1.0 1.2 1.4 1.6 1.8 2.0 log10n log10(wce)2 s Monte Carlo Uniform Grid 0.6 0.8 1.0 1.2 1.4 1.6 1.8 log10n log10(wce)2 s Monte Carlo Uniform Grid Figure 2: Experiments in k2 with N = n2, n3 for recombination algorithms. Average of log10(wce(Qn; Hk, µ)2) over 20 samples plotted with their standard deviation. the computational steps of KQuad(kapp, Y ) with kapp = k Z s , k Z s,Y , k Z s,µ as follows: Using k Z s takes O sℓN + sℓ2 + s3 log(N/s) , but it can be reduced to O ℓN + sℓ2 + s3 log(N/s) by omitting the (empirically unnecessary) inequality constraint (Hayakawa et al., 2022, Remark 2). Using k Z s,Y takes O ℓ3 + ℓ2N + s3 log(N/s) , where O ℓ3 and O ℓ2N respectively come from computing k(Z, Z)+ and h Y (Z, Z). Using k Z s,µ takes O ℓ3 + sℓN + s3 log(N/s) (if hµ available), where O ℓ3 is from computing k(Z, Z)+. For example, in the case of Figure 1(c) with n = 128, the average time per one trial was respectively 26.5, 226, 216 seconds for k Z s , k Z s,Y , k Z s,µ, while it was 52.6, 57.8, 41.2 seconds for the case of Figure 2(b) with n = 64.1 6. Concluding Remarks In this paper, we have studied the performance of several Nystr om-type approximations kapp of a positive definite kernel k associated with a probability measure µ, in terms of the error µ( p k kapp). We first improved the bounds for k Z s , the conventional Nystr om approximation based on an i.i.d. Z and the use of SVD, by leveraging results in statistical learning theory. We then went beyond the i.i.d. setting and considered general Z including DPPs; we further introduced two competitors of k Z s , i.e., k Z s,µ and k Z s,X, which are given by directly computing the Mercer decomposition of the finite-rank kernel k Z against the measure µ and the empirical measure µX, respectively. Finally, we used our results to improve the theoretical guarantees for convex kernel quadrature Hayakawa et al. (2022), and provided numerical results to illustrate the difference between the conventional k Z s and the newly proposed k Z s,µ and k Z s,X. Despite its nice theoretical properties, a limitation of our second contribution, i.e., the proposed kernel approximations, is that they involve the computation of a pseudo-inverse, which can be numerically unstable when there is a rapid spectral decay. This point should be addressed in future work, but one promising approach in the context of kernel quadrature is to conceptually learn from the stability of k Z s mentioned in Remark 4; if we see the construction of the low-rank kernel as optimization of the vectors ui for which functions u i k(Z, ) well approximate Hk Z in terms of L2(µ) metric, we can possibly leverage the stability of convex optimization for instance. Acknowledgements The authors would like to thank Ken ichiro Tanaka and the anonymous reviewers for helpful comments. This work was supported in part by the EPSRC [grant number EP/S026347/1], in part by The Alan Turing Institute under the EPSRC grant EP/N510129/1, the Data Centric Engineering Programme (under the Lloyd s Register Foundation grant G0095), the Defence and Security Programme (funded by the UK Government) and the Office for National Statistics & The Alan Turing Institute (strategic partnership) and in part by the Hong Kong Innovation and Technology Commission (Inno HK Project CIMDA). 1All the experiments were conducted on a Mac Book Pro with Apple M1 Max chip and 32GB unified memory. Code is available at the nystrom folder in https://github.com/ satoshi-hayakawa/kernel-quadrature. Sampling-based Nystr om Approximation and Kernel Quadrature Adachi, M., Hayakawa, S., Jørgensen, M., Oberhauser, H., and Osborne, M. A. Fast Bayesian inference with batch Bayesian quadrature via kernel recombination. In Advances in Neural Information Processing Systems, volume 35, pp. 16533 16547, 2022. Bach, F. On the equivalence between kernel quadrature rules and random feature expansions. The Journal of Machine Learning Research, 18(1):714 751, 2017. Bach, F., Lacoste-Julien, S., and Obozinski, G. On the equivalence between herding and conditional gradient algorithms. In International Conference on Machine Learning, pp. 1355 1362, 2012. Belhadji, A. An analysis of Ermakov Zolotukhin quadrature using kernels. In Advances in Neural Information Processing Systems, volume 34, 2021. Belhadji, A., Bardenet, R., and Chainais, P. Kernel quadrature with DPPs. In Advances in Neural Information Processing Systems, volume 32, pp. 12907 12917, 2019. Belhadji, A., Bardenet, R., and Chainais, P. Kernel interpolation with continuous volume sampling. In International Conference on Machine Learning, pp. 725 735. PMLR, 2020. Chatalic, A., Schreuder, N., Rosasco, L., and Rudi, A. Nystr om kernel mean embeddings. In International Conference on Machine Learning, pp. 3006 3024. PMLR, 2022. Chen, Y., Welling, M., and Smola, A. Super-samples from kernel herding. In Conference on Uncertainty in Artificial Intelligence, pp. 109 116, 2010. Cohen, A. and Migliorati, G. Optimal weighted leastsquares methods. The SMAI journal of computational mathematics, 3:181 203, 2017. Cosentino, F., Oberhauser, H., and Abate, A. A randomized algorithm to reduce the support of discrete measures. In Advances in Neural Information Processing Systems, volume 33, pp. 15100 15110, 2020. De Marchi, S. On optimal center locations for radial basis function interpolation: computational aspects. Rendiconti del Seminario Matematico, 61(3):343 358, 2003. Drineas, P., Mahoney, M. W., and Cristianini, N. On the Nystr om method for approximating a Gram matrix for improved kernel-based learning. The Journal of Machine Learning Research, 6(12):2153 2175, 2005. Dwivedi, R. and Mackey, L. Kernel thinning. In Conference on Learning Theory, pp. 1753 1753. PMLR, 2021. Dwivedi, R. and Mackey, L. Generalized kernel thinning. In International Conference on Learning Representations, 2022. Gauthier, B. Nystr om approximation and reproducing kernels: embeddings, projections and squared-kernel discrepancy. preprint, 2021. URL https://hal. archives-ouvertes.fr/hal-03207443. Gin e, E. and Koltchinskii, V. Concentration inequalities and asymptotic results for ratio type empirical processes. The Annals of Probability, 34(3):1143 1216, 2006. Gittens, A. and Mahoney, M. W. Revisiting the Nystr om method for improved large-scale machine learning. The Journal of Machine Learning Research, 17(1):3977 4041, 2016. Gretton, A., Borgwardt, K., Rasch, M., Sch olkopf, B., and Smola, A. A kernel method for the two-sample-problem. In Sch olkopf, B., Platt, J., and Hoffman, T. (eds.), Advances in Neural Information Processing Systems, volume 19. MIT Press, 2006. Gy orfi, L., Kohler, M., Krzyzak, A., and Walk, H. A distribution-free theory of nonparametric regression. Springer, 2006. Halton, J. H. On the efficiency of certain quasi-random sequences of points in evaluating multi-dimensional integrals. Numerische Mathematik, 2(1):84 90, 1960. Hayakawa, S. and Suzuki, T. On the minimax optimality and superiority of deep neural network learning over sparse parameter spaces. Neural Networks, 123:343 361, 2020. Hayakawa, S., Oberhauser, H., and Lyons, T. Positively weighted kernel quadrature via subsampling. In Advances in Neural Information Processing Systems, volume 35, pp. 6886 6900, 2022. Hayakawa, S., Lyons, T., and Oberhauser, H. Estimating the probability that a given vector is in the convex hull of a random sample. Probability Theory and Related Fields, 185:705 746, 2023a. Hayakawa, S., Oberhauser, H., and Lyons, T. Hypercontractivity meets random convex hulls: analysis of randomized multivariate cubatures. Proceedings of the Royal Society A, 479(2273):20220725, 2023b. Husz ar, F. and Duvenaud, D. Optimally-weighted herding is Bayesian quadrature. In Conference on Uncertainty in Artificial Intelligence, pp. 377 386, 2012. Jin, R., Yang, T., Mahdavi, M., Li, Y.-F., and Zhou, Z.-H. Improved bounds for the Nystr om method with application to kernel classification. IEEE Transactions on Information Theory, 59(10):6939 6949, 2013. Sampling-based Nystr om Approximation and Kernel Quadrature Kanagawa, M., Sriperumbudur, B. K., and Fukumizu, K. Convergence guarantees for kernel-based quadrature rules in misspecified settings. In Advances in Neural Information Processing Systems, volume 29, pp. 3296 3304, 2016. Karvonen, T., S arkk a, S., and Tanaka, K. Kernel-based interpolation at approximate Fekete points. Numerical Algorithms, 87(1):445 468, 2021. Koltchinskii, V. Local rademacher complexities and oracle inequalities in risk minimization. The Annals of Statistics, 34(6):2593 2656, 2006. Kumar, S., Mohri, M., and Talwalkar, A. Sampling methods for the Nystr om method. The Journal of Machine Learning Research, 13(1):981 1006, 2012. Li, C., Jegelka, S., and Sra, S. Fast DPP sampling for Nystr om with application to kernel methods. In International Conference on Machine Learning, pp. 2061 2070. PMLR, 2016. Li, M., Bi, W., Kwok, J. T., and Lu, B.-L. Large-scale Nystr om kernel matrix approximation using randomized SVD. IEEE Transactions on Neural Networks and Learning Systems, 1(26):152 164, 2015. Litterer, C. and Lyons, T. High order recombination and an application to cubature on Wiener space. The Annals of Applied Probability, 22(4):1301 1327, 2012. Mohri, M., Rostamizadeh, A., and Talwalkar, A. Foundations of Machine Learning. Second edition, 2018. Oglic, D. and G artner, T. Nystr om method with kernel k-means++ samples as landmarks. In International Conference on Machine Learning, pp. 2652 2660. PMLR, 2017. Owen, A. B. A randomized Halton algorithm in R. ar Xiv preprint ar Xiv:1706.02808, 2017. Penrose, R. A generalized inverse for matrices. Mathematical Proceedings of the Cambridge Philosophical Society, 51(3):406 413, 1955. Santin, G. and Haasdonk, B. Convergence rate of the dataindependent P-greedy algorithm in kernel-based approximation. Dolomites Research Notes on Approximation, 10, 2017. Santin, G. and Schaback, R. Approximation of eigenfunctions in kernel-based spaces. Advances in Computational Mathematics, 42(4):973 993, 2016. Schmidt-Hieber, J. Nonparametric regression using deep neural networks with Re LU activation function. The Annals of Statistics, 48(4):1875 1897, 2020. Shetty, A., Dwivedi, R., and Mackey, L. Distribution compression in near-linear time. In International Conference on Learning Representations, 2022. Shinozaki, N., Sibuya, M., and Tanabe, K. Numerical algorithms for the Moore Penrose inverse of a matrix: iterative methods. Annals of the Institute of Statistical Mathematics, 24(1):621 629, 1972. Srebro, N. and Sridharan, K. Note on refined Dudley integral covering number bound. Unpublished results, 2010. URL https://www.cs.cornell. edu/ sridharan/dudley.pdf. Srebro, N., Sridharan, K., and Tewari, A. Smoothness, low noise and fast rates. In Advances in Neural Information Processing Systems, volume 23, 2010. Steinwart, I. and Scovel, C. Mercer s theorem on general domains: On the interaction between measures, kernels, and RKHSs. Constructive Approximation, 35(3):363 417, 2012. Tchernychova, M. Carath eodory cubature measures. Ph D thesis, University of Oxford, 2016. Tropp, J. A., Yurtsever, A., Udell, M., and Cevher, V. Fixedrank approximation of a positive-semidefinite matrix from streaming data. In Advances in Neural Information Processing Systems, volume 30, 2017. Tsuji, K., Tanaka, K., and Pokutta, S. Pairwise conditional gradients without swap steps and sparser kernel herding. In International Conference on Machine Learning, pp. 21864 21883. PMLR, 2022. Wahba, G. Spline Models for Observational Data. Society for Industrial and Applied Mathematics, 1990. Wainwright, M. J. High-dimensional statistics: A nonasymptotic viewpoint, volume 48. Cambridge University Press, 2019. Wang, S., Gittens, A., and Mahoney, M. W. Scalable kernel k-means clustering with Nystr om approximation: relative-error bounds. The Journal of Machine Learning Research, 20(1):431 479, 2019. Yang, T., Li, Y.-F., Mahdavi, M., Jin, R., and Zhou, Z.-H. Nystr om method vs random Fourier features: A theoretical and empirical comparison. In Advances in Neural Information Processing Systems, volume 25, 2012. Sampling-based Nystr om Approximation and Kernel Quadrature A. Tools from statistical learning theory In this section, F always denotes a class of functions from X to R, i.e., F RX . Let us define the Rademacher complexity of F with respect to the sample Z = (zi)ℓ i=1 X as follows (e.g., Mohri et al., 2018, Definition 3.1): j=1 sjf(zj) where the conditional expectation is taken with regard to the Rademacher variables, i.e., i.i.d. variables sj uniform in { 1}. The following is a version of the uniform law of large numbers, though we only use the one side of the inequality. Proposition 6 (Mohri et al., 2018, Theorem 3.3). Let Z be an ℓ-point independent sample from µ. If there is a B > 0 such that f B for every f F, then with probability at least 1 δ, we have sup f F (µ(f) µZ(f)) 2E[RZ(F)] + For a pseudo metric d on F, we denote the ε-convering number of F by N(F, d; ε). Namely, N(F, d; ε) is the infimum of positive integers N such that there exist f1, . . . , f N F satisfying min1 i N d(fi, g) ε for all g F. Let us define a pseudo-metric d Z(f, g) := q 1 ℓ Pℓ j=1(f(zi) g(zi))2. The following assertion is a version of Dudley s integral entropy bound (Srebro et al., 2010, Lemma A.3; see Srebro & Sridharan (2010) for a correction of the constant). Proposition 7 (Dudley integral). For any ℓ-point sample Z = (zi)ℓ i=1 X, we have log N(F, d Z; ε) dε. The following is a straightforward modification of Schmidt-Hieber (2020, Lemma 4) tailored to our setting. It originates from an analysis of empirical risk minimizers, and this kind of technique has also been known in earlier work under the name of local Rademacher complexities (Gy orfi et al., 2006; Koltchinskii, 2006; Gin e & Koltchinskii, 2006). Proposition 8. Let F L (µ) be a set of functions with f 0 and f L (µ) F for all f F, where F > 0 is a constant. If ˆf is a random function in F possibly depending on Z, then, for every ε > 0, we have E h µ( ˆf) i 2E h µZ( ˆf) i + F 9 log N + 64 + 5ε, where N := max{3, N(F, L1(µ); ε)}. Proof. The proof here essentially follows the original proof, where we re-compute the constants as the condition is slightly different; see also Hayakawa & Suzuki (2020, Theorem 2.6) and its remark. Let Z = (z 1, . . . z ℓ) be an independent copy of Z. Let Fε be an ε-covering of F in L1(µ) with the cardinality N and f be a random element of Fε such that µ(| ˆf f |) ε. Then, we have E h µZ( ˆf) i E h µ( ˆf) i = i=1 ( ˆf(zi) ˆf(z i)) i=1 (f (zi) f (z i)) Define T := maxf Fε Pℓ i=1(f(zi) f(z i))/r(f), where we let r(f) := max{c p ℓ 1 log N, p µ(f)} for each f Fε with a constant c > 0 fixed afterwards. Thus, we obtain i=1 (f (zi) f (z i)) 2E r(f )2 + 1 2ℓ2 E T 2 . (16) Sampling-based Nystr om Approximation and Kernel Quadrature The first term can evaluated as E r(f )2 c2 log N ℓ + E[µ(f )] c2 log N ℓ + E[µ( ˆf)] + ε. (17) For the second term, we first have " f(zi) f(z i) r(f) i=1 E f(zi)2 + f(z i)2 Since we have |f(zi) f(z i)|/r(f) 2F/r(f) 2F ℓ c log N uniformly for f Fε, Bernstein s inequality combined with the union bound yields P T 2 t = P T ℓt 3c log N ) 2N exp 3c log N for t 9c2ℓlog N. Therefore, we have 0 P T 2 t dt 9c2ℓlog N + Z 9c2ℓlog N 2N exp 3c log N = 9c2ℓlog N + 4N 8Fℓ+ 64F 2ℓ 9c2 log N exp 9c2 log N Let us now set c = p 8F/9 so that 9c2 = 8F. Then, we obtain E T 2 8Fℓlog N + 64Fℓsince N 3 by assumption. By combining it with (15) (17), we finally obtain E h µZ( ˆf) i E h µ( ˆf) i 1 2E h µ( ˆf) i + ( 40 9 F log N + 32F) from which the desired inequality readily follows. B.1. Properties of the pseudo-inverse For a matrix A Rm n, its Moore Penrose pseudo-inverse A+ (Penrose, 1955) is defined as the unique matrix X Rn m that satisfies AXA = A, XAX = X, (AX) = AX, (XA) = XA. It also satisfies that A+A is the orthogonal projection onto the orthogonal complement of ker A (the range of A ), while AA+ is the orthogonal projection onto the range of A (Penrose, 1955; Shinozaki et al., 1972). We use these general properties of A+ throughout Section B. See e.g. Drineas et al. (2005) for the concrete construction of such a matrix. B.2. Proof of Lemma 1 Proof. Recall that we have the SVD k(Z, Z) = U diag(λ1, . . . , λℓ)U with an orthogonal matrix U = [u1, . . . , uℓ]. and λ1 λℓ 0. By using this notation, we have k Z s (x, y) = X 1 λ j(u j k(Z, x))(u j k(Z, y)). (18) If we denote by Qj : Hk Hk the projection onto span{u j k(Z, )}, we have (u j k(Z, x))(u j k(Z, y)) = u j k(Z, ), k( , x) Hk u j k(Z, ), k( , y) Hk = u j k(Z, ) 2 Hk Qjk( , x), Qjk( , y) Hk = λj Qjk( , x), Qjk( , y) Hk , (19) Sampling-based Nystr om Approximation and Kernel Quadrature where the last inequality follows from u i k(Z, ), u j k(Z, ) Hk = u i k(Z, Z)uj = δijλj. Now let e PZ,s be the orthogonal projection onto span{u j k(Z, )}s j=1 in Hk. We prove e PZ,s = PZ,s. From the orthogonality of {u j k(Z, )}s j=1 we have e PZ,s = Ps j=1 Qj and k( , x), k Z s ( , y) Hk = k Z s (x, y) = j=1 Qjk( , x), Qjk( , y) Hk = D e PZ,sk( , x), e PZ,sk( , y) E Hk = D k( , x), e PZ,sk( , y) E for all x, y X. In particular, k Z s ( , y) = e PZ,sk( , y), so we have e PZ,s = PZ,s. B.3. Proof of Lemma 2 Proof. The inequality follows from Cauchy Schwarz. Let us prove the equality. We use the notation Qj from the proof of Lemma 1. We first obtain PZk( , zi) = k( , zi) for i = 1, . . . , ℓ, since PZ is a projection onto span{k( , zi)}ℓ i=1. Thus, we have P Z,sk( , zi) = (PZ PZ,s)k( , zi) = (Qs+1 + + Qℓ)k( , zi), and so i=1 P Z,sk( , zi) 2 Hk = 1 s+1 j ℓ λj>0 1 λj (u j k(Z, zi))2 by using (19). Since k(Z, Z) = U diag(λ1, . . . , λℓ)U = Pℓ i=1 λiuiu i , we can explicitly calculate u j k(Z, zi) = u j i=1 λiuiu i 1j = λju i 1j, where 1j Rℓis the vector with 1 in the j-th coordinate and 0 in the other coordinates. As U is an ℓ ℓorthogonal matrix, we actually have Pℓ i=1(u i 1j)2 = 1 for each j = 1, . . . , ℓ. s+1 j ℓ λj>0 1 λj (u j k(Z, zi))2 = 1 j=s+1 λj(u i 1j)2 = 1 j=s+1 λj, (20) and the proof is complete. B.4. Proof of Lemma 3 Proof. From the min-max principle, we have λj = min Vj 1 Rℓ dim Vj 1 j 1 max xj V j 1, xj 2=1 x j k(Z, Z)xj, (21) where Vj 1 is a linear subspace of Rℓ. Recall the Mercer expansion k(x, y) = P i=1 σiei(x)ei(y). By letting ej(Z) = (ej(z1), . . . , ej(zℓ)) Rℓ, we can write k(Z, Z) = P i=1 σiei(Z)ei(Z) . We assume that this equality holds in the following. We especially write the remainder term as ks+1(Z, Z) := k(Z, Z) Ps i=1 σiei(Z)ei(Z) Consider taking Vs = span{e1(Z), . . . , es(Z)} and xj argmax x V j 1, x 2=1 x k(Z, Z)x, Vj = span(Vj 1 {xj}) for j = s + 1, . . . , ℓin (21). Then, λ j := x j k(Z, Z)x satisfies λj λ j, and so we have j=s+1 λ k = j=s+1 x j k(Z, Z)xj = j=s+1 x j ks+1(Z, Z)xj, Sampling-based Nystr om Approximation and Kernel Quadrature where we have used that x j ei(Z) = 0 for any i s < j in the last inequality. By taking some {x1, . . . , xs} Rℓ, we can make {x1, . . . , xℓ} a orthonormal basis of Rℓ, so we obtain j=s+1 x j ks+1(Z, Z)xj j=1 x j ks+1(Z, Z)xj = tr ks+1(Z, Z). Therefore, we have 1 ℓ ℓtr ks+1(Z, Z) = 1 i=1 ks+1(zi, zi), and we obtain the desired inequality in expectation since E[ks+1(zi, zi)] = P j=s+1 σj. B.5. Proof of Theorem 1 We first prove the following generic proposition by exploiting the ingredients given in Section A. Proposition 9. Let Q be an arbitrary deterministic m-dimensional orthogonal projection in Hk. Then, for any random orthogonal projection P possibly depending on Z, we have µ( PQk( , x) Hk) µZ( PQk( , x) Hk) + with probability at least 1 δ. Furthermore, with regard to the expectation, we also have E[µ( PQk( , x) Hk)] 2E[µZ( PQk( , x) Hk)] + kmax 80m2 log(1 + 2ℓ) 9 + 69 . (23) Proof. Let {v1, . . . , vm} be an orthonormal basis of QHk. Let also {ui}i I and {ui}i J be respectively an orthonormal basis of PHk and (PHk) , so {ui}i I J is an orthonormal basis of Hk. Let us compute PQk( , x) 2 Hk. Since we have PQk( , x) = P j=1 vj, k( , x) Hk vj j=1 vj(x)Pvj = X j=1 vj(x) ui, vj Hk ui (where we can exchange the summation as they converge in Hk), we obtain PQk( , x) 2 Hk = X j=1 vj(x) ui, vj Hk = AP,Qvx 2 ℓ2(I) = v x A P,QAP,Qvx, where vx = (v1(x), . . . , vm(x)) Rm and AP,Q is a linear operator Rm ℓ2(I) given by a = (a1, . . . , am) 7 (Pm j=1 ui, vj Hk aj)i I, and A P,Q : ℓ2(I) Rm is its dual (defined by the property a, A P,Qb Rm = AP,Qa, b ℓ2(I)), which can be understood as the transpose of AP,Q. Note that A P,QAP,Q can be regarded as an m m matrix and we have (A P,QAP,Q)j,h = X i I ui, vj Hk ui, vh Hk = Pvj, Pvh Hk . We can also define BP,Q = AP ,Q by replacing P with P . Then we have (A P,QAP,Q)j,h + (B P,QBP,Q)j,h = Pvj, Pvh Hk + P vj, P vh Hk = vj, vh Hk = δjh, so A P,QAP,Q is an m m positive semi-definite matrix with A P,QAP,Q Im. It thus suffices to consider a uniform estimate of µ( p v x Svx) µZ( p v x Svx) with a positive semi-definite matrix S Im. This S can be written as S = U U by using a U Rm m with U 2 1, so we shall solve the following problem: Sampling-based Nystr om Approximation and Kernel Quadrature Find a uniform upper bound of µ( Uvx 2) µZ( Uvx 2) for any matrix U Rm m with U 2 1. Now we can reduce our problem to a routine work of bounding the covering number of the function class F := {f U := x 7 Uvx 2 | U U}, where U := {U Rm m | U 2 1}. For any x X, we have j=1 vj(x)2 = Qk( , x) 2 Hk k( , x) 2 Hk = k(x, x). If Uδ is a δ-covering of U, then {f U}U Uδ gives a δ kmax-covering. Indeed, for any U, V U with U V 2 δ, we have d Z(f U, f V )2 = 1 i=1 ( Uvzi 2 V vzi 2)2 1 i=1 (U V )vzi 2 2 δ2 1 i=1 vzi 2 2 δ2kmax. Here, we have the covering number bound log N(U, 2; δ) m2 log 1 + 2 δ for δ 1 (and 0 for δ 1) as U can be seen as a unit ball of Rm2 in a certain norm (Wainwright, 2019, Example 5.8), so log N(F, d Z; ε) m2 log(1 + 2 kmax/ε) for ε kmax. Therefore, from Proposition 7, we have m2 log 1 + 2 kmax dt 18m kmax where we have used the estimate 1 + log 1 + 2 Since we also have a bound f U U 2 kmax, we can use Proposition 6 to obtain µ( PQk( , x) Hk) µZ( PQk( , x) Hk) sup f F (µZ(f) µ(f)) with probability at least 1 δ. So we have proven (22). We next prove (23) by using Proposition 8. We have the same bound for log N(F, L1(µ); ε) from the same argument as above, and so we especially get log N F, L1(µ); kmax m2 log(1 + 2ℓ). As f L (µ) kmax =: F holds for all f F, we can now apply Proposition 8 with ε = F/ℓto obtain the desired conclusion. We next prove the following proposition that includes the desired assertion by using Proposition 9. Proposition 10. Let Z = (zi)ℓ i=1 be an ℓ-point independent sample from µ. Let P be a random orthogonal projection in Hk possibly depending on Z. For any integer m 1, with probability at least 1 δ, we have X Pk( , x) Hk dµ(x) 1 i=1 Pk( , zi) Hk + Sampling-based Nystr om Approximation and Kernel Quadrature Furthermore, in expectation, we have the following bound: X Pk( , x) Hk dµ(x) E i=1 Pk( , zi) Hk 80m2 log(1 + 2ℓ) 9 + 69 + 4 s X j>m σj. (24) Proof. Note that we use the fact that for any projection operator P Pf f frequently within the proof. For an ℓ-point sample Z = (z1, . . . , zℓ) X, let us denote µZ be the mapping f 7 1 ℓ Pℓ i=1 f(zi). If we have f , f L1(µ) with f f, we can generally obtain µ(f) µZ(f) = (µ(f) µ(f )) + (µ(f ) µZ(f )) + (µZ(f ) µZ(f)) µ(f f ) + (µ(f ) µZ(f )). (25) We here use f(x) = Pk( , x) Hk and f (x) = PPmk( , x) Hk PP mk( , x) Hk for an m, where Pm is the projection operator onto span{e1, . . . , em} in Hk and P m is its orthogonal complement. In this case, µ(f f ) can easily be estimated by Cauchy Schwarz as follows: µ(f f ) µ(2 PP mk( , x) Hk) 2µ( P mk( , x) Hk) µ( P mk( , x) 2 Hk) = 2 s X j>m σj, (26) where we have used the fact P mk( , x) 2 Hk = k( , x) 2 Hk Pmk( , x) 2 Hk = k(x, x) i=1 σiei(x)2 = i=m+1 σiei(x)2. We also bound µ(f ) µZ(f ) by µ(f ) µZ(f ) µ( PPmk( , x) Hk) µZ( PPmk( , x) Hk) + µZ( P mk( , x) Hk), (27) where we have used the second inequality in (26) for µZ. The last term µZ( P mk( , x) Hk) above is estimated either in expectation or in high probability as follows: E µZ( P mk( , x) Hk) s X µZ( P mk( , x) Hk) s X δ with probability at least 1 δ. (28) The latter follows from a simple calculation of Hoeffing s inequality. Thus, it suffices to derive a bound for µ( PPmk( , x) Hk) µZ( PPmk( , x) Hk) or its expectation; we do it by letting Q = Pm and ˆf = f in Proposition 9. By combining (just summing up) the inequalities (25) (28), and (22), we obtain the desired inequality in high probability. For the result in expectation, we first combine the inequalities (25) (28), and (23) to get the bound E[µ(f)] E[µZ(f)] E[µZ( PPmk( , x) Hk)] + kmax 80m2 log(1 + 2ℓ) 9 + 69 + 3 s X (recall f(x) = Pk( , x) Hk). Since we can also estimate E[µZ( PPmk( , x) Hk)] as E[µZ( PPmk( , x) Hk)] E[µZ( Pk( , x) Hk)] + E µZ( PP mk( , x) Hk) E[µZ( Pk( , x) Hk)] + s X we obtain the desired conclusion. Sampling-based Nystr om Approximation and Kernel Quadrature B.6. Proof of Remark 1 Proof. We assume ℓ 3 here. Let F(x) := β 1x1 1/d exp( βx1/d). If d 2, its derivative is F (x) = exp( βx1/d) 1 1/d β x 1/d exp( βx1/d) = 1 1 1/d β x 1/d exp( βx1/d). Thus, if x (log ℓ)d/βd, we have F (x) d exp( βx1/d). This inequality is still true if d = 1. By taking m = (2 log ℓ)d/βd , we obtain X 2(log ℓ)d/βd exp( βx1/d) dx d F(2(log ℓ)d/βd) = 2d 1d βd (log ℓ)d 1 Therefore, this choice of m satisfies s X i>m σi = O (log ℓ)(d 1)/2 , m2 = O (log ℓ)2d . Combining these with the inequality in Corollary 1 gives the desired estimate. B.7. Proof of Proposition 1 Proof. We basically just compute the trace of the operator P Z K. Indeed, we have Z X P Z k( , x) 2 Hk = Z X (k(x, x) k Z(x, x)) dµ(x), (29) and, from (5), we also have the following identity: Z X k(x, x) dµ(x) = i=1 ei, Kei L2(µ) . (30) For k Z, as we can write k Z(x, y) = Pℓ i=1 gi(x)gi(y) by using gi L2(µ) (see e.g., (18)), we can also have Z X k Z(x, x) dµ(x) = X L2(µ) , (31) where KZ : L2(µ) L2(µ) is the integral operator given by g 7 R X k Z( , x)g(x) dµ(x), and (ei)i I is an orthonormal basis of L2(µ) including (ei) i=1. The second equality follows from the fact that K KZ is a (semi-)positive definite operator since k k Z is a positive definite kernel, and so we have 0 ei, KZei L2(µ) ei, Kei L2(µ) = 0 for any i I \ Z>0. For this integral operator, since we have k Z( , x) = PZk( , x), we can prove X PZk( , x)g(x) dµ(x) = PZ X k( , x)g(x) dµ(x) = PZKg for any g L2(µ) under the well-definedness of K. Thus, from (29) (31), we have Z X P Z k( , x) 2 Hk = ei, (K KZ)ei ei, P Z Kei L2(µ) . (32) For general f Hk and g L2(µ), we can prove f, Kg Hk = f, Z X k( , x)g(x) dµ(x) X f, k( , x) Hk g(x) dµ(x) = f, g L2(µ) , so that in particular g, P Z Kg L2(µ) = Kg, P Z Kg Hk = P Z Kg 2 Hk. By letting g = ei in the above equation, we can deduce the desired equality from (32). For the inequality, use the bound P Z Kei 2 Hk Kei 2 Hk = σiei 2 Hk = σi σiei 2 Hk = σi for each i > m. Sampling-based Nystr om Approximation and Kernel Quadrature B.8. Proof of Corollary 2 Proof. From Proposition 1 and (8), it suffices to prove for an arbitrary g L2(µ) that P Z Kg 2 Hk = inf wi sup f Hk 1 i=1 wif(zi) It is indeed an immediate consequence of Belhadji (2021, Theorem 4). B.9. Proof of Lemma 4 Proof. Given the Mercer decomposition k(x, y) = P i=1 σiei(x)ei(y), we can compute hµ(x, y) = Z X k(x, t)k(t, y) dµ(t) i,j=1 σiσjei(x)ej(y) Z X ei(t)ei(t) dµ(t) i,j=1 δijσiσjei(x)ej(y) = i=1 σ2 i ei(x)ei(y), where we have used the fact that (ei) i=1 is an orthonormal set in L2(µ). B.10. Proof of Lemma 5 Proof. From (9), we have fi, fj L2(µ) = v i (H+) H HH+vj = (HH+vi) (HH+vj). (33) Here, note that {vi, κi > 0} (ker H ) as we have, for any v ker H , 0 = v Hk(Z, Z)+H v = i=1 κiv viv i v = i=1 κi(v vi)2. Therefore, HH+vi = vi if κi > 0 since HH+ is the projection onto (ker H ) , and so {fi, κi > 0} is orthonormal from (33). We can also see that fi = (H+vi) k(Z, ) is an eigenfunction of KZ from the remark below (10) and HH+vi = vi. B.11. Proof of Proposition 2 Proof. We rewrite k Z µ in terms of another summation as follows: k Z µ (x, y) := i=1 κifi(x)fi(y) = k(x, Z)H+ ℓ X i=1 κiviv i (H )+k(Z, y) = k(x, Z)H+Hk(Z, Z)+H (H )+k(Z, y) 1 λi u i H (H+) k(Z, x)k(y, Z)H+Hui, (34) where (λi, ui) are eigenpairs of k(Z, Z). Recall also that we have k Z(x, y) = k(x, Z)k(Z, Z)+k(Z, y) = X 1 λi u i k(Z, x)k(y, Z)ui. (35) Sampling-based Nystr om Approximation and Kernel Quadrature From (34) and this, it suffices to prove u k(Z, ) = u H (H+) k(Z, ) in L2(µ) for any u Rℓ. Indeed, we have Z u k(Z, x) u H (H+) k(Z, x) 2 dµ(x) u Iℓ H (H+) k(Z, x) 2 dµ(x) = u Iℓ H (H+) Z X k(Z, x)k(x, Z) dµ(x) (Iℓ H+H)u = u Iℓ H (H+) H H(Iℓ H+H)u = 0 since H (H+) H = H and HH+H = H hold (Iℓis the identity matrix). Thus, we obtain the desired assertion. Finally, we prove that k Z µ and k Z coincide when ker hµ(Z, Z) ker k(Z, Z). From (34) and (35), it suffices to prove H+Hui = ui for indices i with λi > 0. Note that H+H is the orthogonal projection onto the orthogonal complement of ker H = ker H H = hµ(Z, Z) from a general property of the pseudo-inverse. Since ui is an eigenvector of k(Z, Z) with a positive eigenvalue λi, it is orthogonal to any v ker k(Z, Z) (as u i v = λ 1 i u i k(Z, Z)v = 0). Therefore, if we have ker hµ(Z, Z) ker k(Z, Z), ui is also orthogonal to ker hµ(Z, Z) and so H+Hui = ui as desired. B.12. Proof of Proposition 4 First, we give a proof for a folklore property of products of positive semi-definite matrices. Lemma 6. Let ℓ, m n be positive integers and A, B Rn n be (symmetric) positive semi-definite matrices. Assume B = C C = D D for a real matrix C Rm n and D Rℓ n. Then, CAC and DAD have the same set of nonzero eigenvalues with the same multiplicity (in terms of real eigenvectors). Proof. For a real square matrix M Rj j and a real number λ, let us define Sλ(M) := {v Rj | Mv = λv} be the real eigenspace of M corresponding to λ. We shall prove there is a bijection between Sλ(AB) and Sλ(CAC ) for each real λ = 0 (and the same for Sλ(DAD ) by symmetry). Once we establish this, we see that each λ = 0 has the same multiplicity as an eigenvalue of CAC and DAD (multiplicity can be zero; in that case λ is not an eigenvalue), and the desired assertion follows. Let us fix λ = 0. If v Sλ(CAC ), we have CAC (Cv) = CABv = λ(Cv), so Cv Sλ(CAC ). We also have Cv = Cv for another element (v =)v Sλ(AB) since AC (Cv Cv) = AB(v v) = λ(v v) = 0. Thus, matrix multiplication by C is an injective map from Sλ(AB) to Sλ(CAC ). Let us finally prove Sλ(AB) v 7 Cv Sλ(CAC ) is surjective. Let u Sλ(CAC ). Then, u = λ 1(λu) = λ 1CAC u = C(λ 1AC u), so we can write u = Cv for v = λ 1AC u. It remains to prove v Sλ(AB), but we can see it as follows: λ(AC C)AC u = 1 λAC (CAC u) = 1 λAC (λu) = λv. Therefore, we have a bijection between Sλ(AB) and Sλ(CAC ) and we are done. Recall µ(k Z µ k Z s,µ) Pℓ i=s+1 κi holds for eigenvalues κ1 κℓ 0 of Hµk(Z, Z)+H µ with H µ Hµ = hµ(Z, Z) (that immediately follows from the definitions of k Z µ and k Z s,µ, and that fi are L2(µ)-orthonormal). By replacing µ with µX, we have µX(k Z X k Z s,X) Pℓ i=s+1 κX i for eigenvalues of κX 1 κX ℓ 0 of HXk(Z, Z)+H X, where H XHX = h X(Z, Z) = 1 M k(Z, X)k(X, Z). By using the lemma, we can see that κX i are actually the same as the eigenvalues of 1 M k(X, Z)k(Z, Z)+k(Z, X) = 1 M k Z(X, X). As k k Z is a positive definite kernel, k(X, X) k Z(X, X) is a positive semi-definite matrix, the i-th largest eigenvalue of k Z(X, X) is bounded by the i-th largest eigenvalue of k(X, X) (Weyl s inequality). Now, let λX 1 λX 2 0 be the eigenvalues of k(X, X). From the above argument, we have µX(k Z X k Z s,X) i=s+1 κX i 1 i=s+1 λX i 1 i=s+1 λX i . Sampling-based Nystr om Approximation and Kernel Quadrature Notice that we can apply Lemma 3 with X instead of Z, and obtain E µX(k Z X k Z s,X) P i>s σi as desired. B.13. Proof of Proposition 5 Proof. Fix a sample X with ker k(X, Z) ker k(Z, Z) and let us use the same notation as in µ, i.e., H H = h X(Z, Z) = 1 M k(Z, X)k(X, Z); Hk(Z, Z)+H = V diag(κ1, . . . , κℓ)V with κ1 κℓ 0 and V being orthogonal; fi = (H+vi) k(Z, ) and k Z X(x, y) = Pℓ i=1 κifi(x)fi(y). In this case, from the same argument as the last paragraph in the proof of Proposition 2, we have H+H is an identity map over (ker h X(Z, Z)) = (ker k(X, Z)) (ker k(Z, Z)) . By considering the SVD of k(Z, Z), we see that (ker k(Z, Z)) is exactly the linear subspace of Rℓspanned by eigenvectors of k(Z, Z) with nonzero eigenvalues, which is equal to {k(Z, Z)v | v Rℓ} = {k(Z, Z)+v | v Rℓ}. In particular, we have H+Hk(Z, Z)+ = k(Z, Z)+. We now prove that { κifi | i 1, κi > 0} actually forms an orthonoramal set in Hk. Indeed, if κi, κj > 0, we have κifi, κjfj Hk = κiκjv i (H+) k(Z, Z)H+vj = 1 κiκj v i Hk(Z, Z)+H (H+) k(Z, Z)H+ Hk(Z, Z)+H vj = 1 κiκj v i Hk(Z, Z)+k(Z, Z)k(Z, Z)+H vj = 1 κiκj v i Hk(Z, Z)+H vj = δij, where we have used the fact that vi and vj are eigenvectors of Hk(Z, Z)+H with eigenvalues κi and κj, respectively. Let P : Hk Hk be the orthogonal projection onto span{ κifi | i > s, κi > 0}. Then, we have i=s+1 κifi, k( , x) Hk κifi = κifi(x) κifi, and so Pk( , x) 2 Hk = Pℓ i=s+1 κifi(x)2 = k Z(x, x) k Z s,X(x, x). Note that the projection P is a random operator depending on the sample X. Now, we can use Theorem 1 with the empirical measure given by X instead of Z to obtain k Z k Z s,X) i 2E h µX( q k Z X k Z s,X) i + 4 s X i>m σi + kmax 80m2 log(1 + 2M) 9 + 69 . (36) for any integer m 1, where we have used Pk( , x) Hk = q k Z(x, x) k Z s,X(x, x) = q k Z X(x, x) k Z s,X(x, x) almost surely. From Proposition 4, we have k Z X k Z s,X) i2 E h µX( q k Z X k Z s,X)2i E µX(k Z X k Z s,X) X and combining it with (36) leads to the desired conclusion. B.14. Proof of Theorem 4 Proof. We first prove the result for Qn = KQuad(ks,Y, Y). Since k(x, x) k Z(x, x) = k Z Y (x, x) k Y s,Z(x, x) for x Y from Proposition 3, we have k k Z s,Y ) µY ( p k k Z) + µY ( q k Z Y k Z s,Y ). Sampling-based Nystr om Approximation and Kernel Quadrature From Proposition 4, by taking the expectation with regard to Y , we have k Z Y k Z s,Y ) i E h µY (k Z Y k Z s,Y ) i s X and so we obtain k k Z s,Y ) i µ( p k k Z) + s X By combining it with (13), it is now sufficient to show E[MMDk(µY , µ)] p ck,µ/N, but actually it follows from the identity E MMDk(µY , µ)2 = ck,µ/N, which can be shown by a straightforward calculation (see, e.g., Hayakawa et al., 2022, Proof of Theorem 7). In the case of Qn = KQuad(k Z s,µ, Y ), we instead have the decomposition k k Z s,µ) µY ( p k k Z) + µY ( q k Z µ k Z s,µ); Theorem 2 yields the desired estimate for expectation.