# kernel_interpolation_with_continuous_volume_sampling__b430b483.pdf Kernel Interpolation With Continuous Volume Sampling Ayoub Belhadji 1 R emi Bardenet 1 Pierre Chainais 1 A fundamental task in kernel methods is to pick nodes and weights, so as to approximate a given function from an RKHS by the weighted sum of kernel translates located at the nodes. This is the crux of kernel quadrature or kernel interpolation from discrete samples. Furthermore, RKHSs offer a convenient mathematical and computational framework, connecting the discrete and continuous worlds. We introduce and analyse continuous volume sampling (VS), the continuous counterpart for choosing node locations of a discrete distribution introduced in (Deshpande & Vempala, 2006). Our contribution is theoretical: we prove almost optimal bounds for interpolation and quadrature under VS. While similar bounds already exist for some specific RKHSs using ad-hoc node constructions, VS offers bounds that apply to any Mercer kernel and depend only on the spectrum of the associated integration operator. We emphasize that, unlike previous randomized approaches that rely on regularized leverage scores or determinantal point processes, evaluating the pdf of VS only requires pointwise evaluations of the kernel. VS is thus naturally amenable to MCMC samplers. 1. Introduction Kernel approximation is a recurrent task in machine learning (Hastie, Tibshirani, and Friedman, 2009)[Chapter 5], signal processing (Unser, 2000) or numerical quadrature (Larkin, 1972). Expressed in its general form, we are given a reproducing kernel Hilbert space F (RKHS; Berlinet & Thomas Agnan, 2011) of functions over X, with a symmetric kernel k : X X R+, and an element µ : X R of F. We ask for conditions on a design x = (x1, . . . , x N) X N, 1Univ. Lille, CNRS, Centrale Lille, UMR 9189 - CRISt AL, Villeneuve d Ascq, France. Correspondence to: Ayoub Belhadji . Proceedings of the 37 th International Conference on Machine Learning, Online, PMLR 119, 2020. Copyright 2020 by the author(s). and on the corresponding weights w1, . . . , w N, such that the RKHS norm µ i=1 wik(xi, .) is small. In other words, µ should be well reconstructed in F by the weighted sum. Measuring the error in RKHS norm has a computational advantage. Indeed, minimizing (1) boils down to minimizing a quadratic form; and given a design x such that Det K(x) = Det(k(xi, xj)) > 0, Equation (1) has a unique set of minimizing weights. The minimizer corresponds to the weights ˆw = K(x) 1µ(x), where µ(x) RN contains the evaluation of µ at the N design nodes xi. Whenever the weights are chosen to be ˆw, the sum in (1) takes the same values as µ at the nodes xi, and the optimal value of (1) is thus called interpolation error; otherwise we speak of approximation error. Note that when the kernel k is bounded, guarantees in RKHS norm translate to guarantees in the supremum norm. In this work, we propose and analyze the interpolation based on a random design drawn from a distribution called continuous volume sampling, which favors designs x with a large value of Det K(x). After introducing this new distribution, we prove non-asymptotic guarantees on the interpolation error which depend on the spectrum of the kernel k. Previous kernel-based randomized designs, both i.i.d. (Bach, 2017) and repulsive (Belhadji et al., 2019), can be hard to compute in practice since they require access to the Mercer decomposition of k. We show here that continuous volume sampling enjoys similar error bounds as well as some additional interpretable geometric properties, while having a joint density that can be evaluated as soon as one can evaluate the RKHS kernel k. In particular, this opens the possibility of Markov chain Monte Carlo samplers (Rezaei & Gharan, 2019). Volume sampling was originally introduced on a finite domain (Deshpande et al., 2006), where it has been used in matrix subsampling for linear regression and low-rank approximations (Derezinski & Warmuth, 2017; Belhadji et al., 2018). Like (Belhadji et al., 2019), our work connects the discrete problem of subsampling columns from a matrix and the continuous problem of interpolating functions in an RKHS. The rest of the article is organized as follows. Section 2 Kernel Interpolation With Continuous Volume Sampling reviews kernel-based interpolation. In Section 3, we define continuous volume sampling and relate it to projection determinantal point processes, as used by (Belhadji et al., 2019). Section 4 contains our main results while Section 5 contains sketches of all proofs with pointers to the supplementary material for missing details. Section 6 numerically illustrates our main result. We conclude in Section 7, discussing some consequences beyond kernel interpolation. Notation and assumptions. We assume that X is equipped with a Borel measure dω, and that the support of dω is X. Let L2(dω) be the Hilbert space of square integrable, real-valued functions on X, with inner product , dω, and associated norm . dω. Assumption A X k(x, x)dω(x) < + . Under Assumption A, define the integral operator X k( , y)f(y)dω(y), f L2(dω). (2) By construction, Σ is self-adjoint, positive semi-definite, and trace-class (Simon, 2005). For m N*, with N = N {0}, denote by em the m-th eigenfunction of Σ, normalized so that em dω = 1, and σm the corresponding eigenvalue. Assumption A implies that the embedding operator IF : F L2(dω) is compact; moreover, since dω is of full support in X, IF is injective (Steinwart & Christmann, 2008). This implies a Mercer-type decomposition of k, k(x, y) = X m N σmem(x)em(y), (3) where the convergence is pointwise (Steinwart & Scovel, 2012). The eigenvalues (σm) are assumed to be nonincreasing. Moreover, for m N , we write e F m = σmem. Since IF is injective, (e F m)m N is an orthonormal basis of F (Steinwart & Scovel, 2012). Unless explicitly stated, we assume that F is dense in L2(dω), so that (em)m N is an orthonormal basis of L2(dω). For more intuition, under these assumptions, f F if and only if P m σ 1 m f, em 2 L2(dω) converges; and we denote for r 0 Σ rf 2 F = P m f, e F m 2 F/σ2r m . For x X N, we define K(x) := k(xi, xj)i,j [N]. If Det K(x) > 0, the subspace T (x) = Span k(xi, .)i [N] is of dimension N; we denote by ΠT (x) the ., . F-orthogonal projection on T (x). Finally, for N N*, we will often sum over the sets Um N = {U N*, |U| = N, m / U}, (4) UN = {U N*, |U| = N}. (5) Finally, define the approximation error E(µ; x, w) = µ X i [N] wik(xi, .) F, (6) where [N] = {1, . . . , N}. If Det K(x) > 0, let ˆw = K(x) 1µ(x) and define the interpolation error E(µ; x) = µ X i [N] ˆwik(xi, .) F (7) = µ ΠT (x)µ F. (8) 2. Related Work This section reviews some results on kernel interpolation to better situate our contributions. The literature on this topic is prolific and cannot be covered in details here. In particular, we start by reviewing results on optimal kernel quadrature, a particular case of kernel interpolation. 2.1. Interpolation for optimal kernel quadrature Given g L2(dω), kernel quadrature deals with approximating the integrals Z i [N] wif(xi), f F, (9) where the weights wi do not depend on f. In principle, it is easy to control the integration error, i.e., the absolute value of the difference between the l.h.s. and r.h.s. of (9). Indeed, i [N] wif(xi) f F E(µg; x, w), (10) where µg = Z X g(x)k(x, .)dω(x)= Σg is the so-called embedding1 of g in the RKHS F. An upper bound on the approximation error of µg implies an upper bound on the integration error that is uniform over any bounded subset of F. This observation sparked intense research on the kernel approximation of embeddings µg. Among kernel approximation results, we pay a particular attention to interpolation, i.e., approximation with optimal weights. In the sequel, we call optimal kernel quadrature the quadrature based on optimal weights ˆw minimizing (1) for a given set of nodes. (Bojanov, 1981) proved that, for g 1, the interpolation of µg using the uniform grid over X = [0, 1] has an error in O(N 2s) if F is the periodic Sobolev space of order s, and that any set of nodes leads to that rate at least. A similar rate was proved for g not constant (Novak et al., 2015) even though it is only asymptotically optimal in that case. In the quasi-Monte Carlo (QMC) literature, several designs were investigated for X = [0, 1]d, g 1 and F that may not 1When g is constant, µg is classically called the mean-element of the measure dω (Smola et al., 2007). Kernel Interpolation With Continuous Volume Sampling even be a Hilbert space; see (Dick & Pillichshammer, 2010). In this context, the term QMC quadrature rule means a low discrepancy sequence, loosely speaking a well-spread set of nodes, along with uniform weights wi = 1/N. If F is a Korobov space of order s 1, the Halton sequence of nodes (Halton, 1964) leads to E(µg; x, (1/N))2 in O(log(N)2d N 2) and higher-order digital nets converge faster as O(log(N)2sd N 2s) (Dick & Pillichshammer, 2014)[Theorem 5]. These rates are naturally inherited if the uniform weights are replaced by the respective optimal weights ˆw, as observed by (Briol et al., 2019). In particular, (Briol et al., 2019) emphasize that the bound for higher-order digital nets attains the optimal rate in this RKHS. For optimal kernel quadrature based on Halton sequences, this inheritance argument does not explain the fast O(log(N)2sd N 2s) rates observed empirically by (Oettershagen, 2017). Beside the hypercube, optimal kernel quadrature has been considered on the hypersphere equipped with the uniform measure (Ehler et al., 2019), or on Rd equipped with the Gaussian measure (Karvonen & S arkk a, 2019). In these works, the design construction is adhoc for the space X and g is usually assumed to be constant. Another approach is offered by optimization algorithms that we review in Section 2.3. Before that, we clarify the subtle difference between optimal kernel quadrature and kernel interpolation. 2.2. Kernel interpolation beyond embeddings Besides the approximation of the embeddings µg discussed in Section 2.1, theoretical guarantees for the kernel interpolation of a general µ F are sought per se. The Shannon reconstruction formula for bandlimited signals (Shannon, 1948) is implicitly an interpolation formula by the sinc kernel. The RKHS approach for sampling in signal processing was introduced in (Yao, 1967) for the Hilbert space of bandlimited signals; see also (Nashed & Walter, 1991) for generalizations. Remarkably, in those RKHSs, every µ F is an embedding µg for some g L2(dω): k is a projection kernel of infinite rank. In general, for a trace-class kernel, the subspace spanned by the embeddings µg is strictly included in F. More precisely, every µg satisfies Σ 1/2µg F = Σ1/2g F = g L2(dω) < + . This condition is more restrictive than what is required for a generic µ to belong to F, i.e., µ F < + , so that kernel interpolation is more general than optimal kernel quadrature. The proposed approach will permit to deal with any µ F. Scattered data approximation (Wendland, 2004) is another field where quantitative error bounds for kernel interpolation on X Rd are investigated; see (Schaback & Wendland, 2006) for a modern review. In a few words, these bounds typically depend on quantities such as the fill-in distance ϕ(x) = supy X mini [N] y xi 2, so that the interpolation error converges to zero as N if ϕ(x) goes to zero. Any node set can be considered, as long as ϕ(x) is small. Using these techniques, (Oates & Girolami, 2016) proposed another application of kernel interpolation: the construction of functional control variates in Monte Carlo integration. Finally, note that the application of these techniques is restricted to compact domains: the fill-in distance is infinite if X is not compact, even for well-spread sets of nodes. 2.3. Optimization algorithms Optimization approaches offer a variety of algorithms for the design of the interpolation nodes. (De Marchi, 2003) and (De Marchi et al., 2005) proposed greedily maximizing the so-called power function p(x; x) = k(x, x) kx(x) K(x) 1kx(x) 1/2 , (11) where kx(x) = (k(x, xi))i [N]. This algorithm leads to an interpolation error that goes to zero with N for a kernel of class C2 (De Marchi et al., 2005). Later, (Santin & Haasdonk, 2017) proved better convergence rates for smoother kernels. Again, these results assume that the domain X is compact. Other greedy algorithms were proposed in the context of Bayesian quadrature (BQ) such as Sequential BQ (Husz ar & Duvenaud, 2012), or Frank-Wolfe BQ (Briol et al., 2015). These algorithms sequentially minimize E(µg; x), for a fixed g L2(dω). The nodes are thus adapted to one particular µg by construction. In general, each step of these greedy algorithms requires to solve a non-convex problem with many local minima (Oettershagen, 2017)[Chapter 5]. In practice, costly approximations must be employed such as local search in a random grid (Lacoste-Julien et al., 2015). An alternative approach, that is very related to our contribution and has raised a lot of recent interest, is to observe that the squared power function (11) can be upper bounded by the inverse of Det K(x) (Schaback, 2005; Tanaka, 2019). Designs that maximize Det K(x) are called Fekete points; see e.g. (Bos & Maier, 2002; Bos & De Marchi, 2011). (Tanaka, 2019) proposed to approximate Det K(x) using the Mercer decomposition of k, followed by a rounding of the solution of a D-experimental design problem, yet without a theoretical analysis of the interpolation error. (Karvonen et al., 2019) proved that for the uni-dimensional Gaussian kernel, the approximate objective function of (Tanaka, 2019) is actually convex. Moreover, (Karvonen et al., 2019) analyze their interpolation error; see also Section 4.2. Finally, we emphasize that these algorithms require the knowledge of a Mercer-type decomposition of k so that they cannot be implemented for any kernel; moreover, the approximate objective function may be non-convex in general. Kernel Interpolation With Continuous Volume Sampling 2.4. Random designs In this section, we survey random node designs with uniform-in-g approximation guarantees for the embeddings µg in the RKHS norm. (Bach, 2017) studied the quadrature resulting from sampling i.i.d. nodes (xj) from some proposal distribution q. He proved that when the proposal is chosen to be σm σm + λem(x)2, (12) with λ > 0, and the number of points N satisfies N 5dλ log(16dλ/δ) with dλ = Tr Σ(Σ + λI) 1, then with probability larger than 1 δ, sup g dω 1 inf w 2 4 wj qλ(xj)1/2 k(xj, .) 2 The bound in (13) gives a control on the approximation error of µg by the subspace spanned by the k(xj, .), and this control is uniform over g in the unit ball of L2(dω). Note that for a fixed value of λ, the upper bound in (13) guarantees that the approximation error is smaller than 4λ. It does not however guarantee that the error goes to zero as N increases since it appears that λ should decrease as N increases. This coupling of N and λ combined with the condition N dλ log dλ makes it intricate to derive a convergence rate from (13). Moreover, the optimal density q λ is only implicitly available in general through the limit in (12), which makes sampling and pointwise evaluation difficult in practice. (Belhadji et al., 2019) proposed a related kernel-based quadrature, but using nodes sampled from a repulsive joint distribution called a projection determinantal point process (DPP); see (Hough et al., 2006) and our Section 3. In particular, the repulsion is characterized by the first eigenfunctions (en)n [N] of the integration operator Σ. The weights ˆw are chosen again by minimizing the residual error (1), which gives the uniform bound E sup g dω 1 E(µg; x)2 2(N 2r N + o(N 2r N)), (14) where r N = P m N+1 σm. This result can be improved by further restricting g to be an eigenfunction of Σ, leading to E sup g {en; n 1} E(µg; x)2 2(Nr N + o(Nr N)). (15) Now for smooth kernels, such as the Gaussian kernel or the Sobolev kernel with a large regularity parameter, the upper bounds in (14) and (15) do converge to 0 as N goes to + . Furthermore, sampling from the recommended projection DPP can be implemented easily, although it still requires the knowledge of the Mercer decomposition of k, unlike the method that we introduce here in Section 3. Since the bounds in (14) and (15) are uniform-in-g, they also concern interpolation. One downside of the analysis in (Belhadji et al., 2019) is that these upper bounds are rather pessimistic: experimental results suggest faster rates in O(σN). If one could prove these rates, then kernel quadrature or interpolation using DPPs would reach known lower bounds, which we now quickly survey. 2.5. Lower bounds When investigating upper bounds for kernel interpolation errors, it is useful to remember existing lower bounds, so as to evaluate the tightness of one s results. In particular, N-widths theory (Pinkus, 2012) implies lower bounds for kernel interpolation errors, which once again show the importance of the spectrum of Σ. The N-width of S = {µg = Σg, g L2(dω) 1} with respect to the couple (L2(dω), F) (Pinkus, 2012, Chapter 1.7) is defined as the square root of d N(S)2 = inf Y F dim Y =N sup g dω 1 inf y Y Σg y 2 F. In interpolation, we do use a subspace Y F spanned by N independent functions k(xi, .), so that sup g dω 1 E(Σg; x)2 d N(S)2. (16) Applying (Pinkus, 2012, Theorem 2.2, Chapter 4) to the adjoint of the embedding operator IF (Steinwart & Scovel, 2012)[Lemma 2.2], it comes d N(S)2 = σN+1. One may object that some QMC sequences seem to breach this lower bound. For example, in the Korobov space (d = 2, s 1), σN+1 = O(log(N)2s N 2s) (Bach, 2017), while the interpolation of µg with g 1 using a Fibonacci lattice leads to an error in O(log(N)N 2s) = o(σN+1) (Bilyk et al., 2012)[Theorem 4]. But this is the rate for one particular µg, and it cannot be achieved uniformly in g. 3. Volume Sampling and DPPs In this section, we introduce a repulsive distribution that we call continuous volume sampling (VS) and compare it to projection determinantal point processes (DPPs; (Hough et al., 2006)). Both continuous VS and projection DPPs are parametrized using a reference measure dω and a repulsion kernel K : X X R+. 3.1. Continuous volume sampling Definition 1 (Continuous volume sampling) Let N N and x = {x1, . . . , x N} X. We say that x follows the Kernel Interpolation With Continuous Volume Sampling volume sampling distribution if (x1, . . . , x N) is a random variable of X N, the law of which is absolutely continuous with respect to i [N]dω(xi), and the density writes f VS(x1, . . . , x N) Det K(x). (17) Two remarks are in order. First, under Assumption A, the density f VS in (17) indeed integrates to 1. Indeed, Hadamard s inequality yields X N Det K(x) dω(xi) Z i [N] k(xi, xi) dω(xi) X k(x, x)dω(x) N < + . Second, the determinant in (17) is invariant to permutations, so that continuous volume sampling can indeed be seen as defining a random set x = {x1, . . . , x N}. In the following, we denote, for any symmetric and continuous kernel k satisfying Assumption A, ZN( k) := Z X N Det K(x) dω(xi). (18) 3.2. Continuous volume sampling as a mixture of DPPs Definition 1 could be mistaken with the definition of a determinantal point process (DPP; Macchi, 1975). However, the cardinality of a DPP sample is a sum of Bernoulli random variables (Hough et al., 2006), while volume sampling is supported on subsets of X with cardinality exactly equal to N. This property is convenient for approximation tasks where the number of nodes N is fixed. While it is not a DPP, volume sampling is actually a mixture of DPPs. Proposition 2 For U N define the projection kernel KU(x, y) = X u U eu(x)eu(y). (19) For N N , we have f VS(x1, . . . , x N) X u U σu Det(KU(xi, xj))(i,j), (20) and the normalization constant is equal to ZN(k) = N! X u U σu. (21) The proof of this proposition is given in Appendix 2.1. Observe that for every U UN, (x1, . . . , x N) 7 1 N! Det(KU(xi, xj))(i,j) [N] [N], (22) defines a well-normalized probability distribution on X N, called the projection DPP associated to the marginal kernel KU (Hough et al., 2006). Among all DPPs, only projection DPPs have a deterministic cardinality, equal to the rank of KU (Hough et al., 2006). Interestingly, the largest weight in the mixture (20) corresponds to the projection DPP of marginal kernel K[N] proposed in (Belhadji et al., 2019) for kernel quadrature. The following lemma gives an upper bound on this maximal weight using the eigenvalues of Σ. Lemma 3 For N N*, define u U σu. (23) Then for all N N*, δN σN/r N. In particular, if the spectrum of k decreases polynomially, then δN = O(1/N), so that as N grows, volume sampling becomes more different from the projection DPP of (Belhadji et al., 2019). In contrast, if the spectrum decays exponentially, then δN = O(1). 3.3. Sampling algorithms A projection DPP can be sampled exactly as long as one can evaluate the corresponding projection kernel K (Hough et al., 2006). For kernel quadrature (Belhadji et al., 2019), evaluating K requires the knowledge of the Mercer decomposition of the RKHS kernel k. The algorithm of (Hough et al., 2006) implements the chain rule for projection DPPs, and each conditional is sampled using rejection sampling. This suggests using the mixture in Proposition 2 to sample from the volume sampling distribution. Again, such an algorithm requires explicit knowledge of the Mercer decomposition of the kernel or at least a decomposition onto an orthonormal basis of F as in (Karvonen et al., 2019). This is a strong requirement that is undesirable in practice. The fact that the joint pdf (17) only requires evaluating k pointwise suggests that volume sampling is fully kernelized, in the sense that a sampling algorithm should be able to bypass the need for a kernel decomposition, thus making the method very widely applicable. One could proceed by rejection sampling. Yet the acceptance ratio would likely scale poorly with N. A workaround would be to use an MCMC sampler similar to what was proposed in (Rezaei & Gharan, 2019). This MCMC algorithm is based on a Gibbs sampler chain: given a state x = {x1, . . . , x N}, remove a node xn chosen uniformly at random and add a node y with a probability proportional to Det K(x ) where x = {x1, . . . , xn 1, y, xn+1, . . . , x N}, and the initial state of the Markov chain follow a distribution of density p0 defined on X N with respect to Q n [N] dω(xn). For this Markov chain, the authors were able to derive Kernel Interpolation With Continuous Volume Sampling bounds for the mixing time: τP0(η) = min{t| Pt PVS TV η}, where . TV is the total variation distance, Pt is the distribution of the Markov chain after t steps and PVS is the distribution of the continuous volume sampling. The author proposed the initialization of the Markov chain by a sequential algorithm and proved that the mixing time scales as O(N 5 log(N)). They also proved bounds on the expected number of rejections, which shows the feasibility of the implementation of the Gibbs steps. This sequential algorithm can be implemented in fully kernelized way without the need for the Mercer decomposition of k. We leave investigating the efficiency of such an MCMC approach to volume sampling for future work. 4. Main Results In this section, we give a theoretical analysis of kernel interpolation on nodes that follow the continuous volume sampling distribution. We state our main result in Section 4.1, an uniform-in-g upper bound of EVS µg ΠT (x)µg 2 F. We give an upper bound for a general µ F in Section 4.2. 4.1. The interpolation error for embeddings µg The main theorem of this article decomposes the expected error for an embedding µg in terms of the expected errors ϵm(N) for eigenfunctions of the kernel. Theorem 4 Let g = X m N* gmem satisfy g dω 1. Then under Assumption A, EVS µg ΠT (x)µg 2 F = X m N g2 mϵm(N), (24) where ϵm(N) = σm In particular, the sequence (ϵm(N))m N* is non-increasing and sup g dω 1 EVS µg ΠT (x)µg 2 F sup m N ϵm(N) = ϵ1(N). (25) Moreover, ϵ1(N) σN (1 + βN) , (26) where βN = min M [2:N] [(N M + 1)σN] 1 X In other words, under continuous volume sampling, ϵ1(N) is a uniform upper bound on the expected squared interpolation error of any embedding µg such that g dω 1. We shall see in Section 5.1 that ϵm(N) = EVS µem ΠT (x) µem 2 F. Now, for N0 N*, a simple counting argument yields, for m N0, ϵm(N) σN0. Actually, for m N0, µem 2 F σN0, independently of the nodes. Inequality (26) is less trivial and makes continuous volume sampling distribution worth of interest: the upper bound goes to 0 as N + , below the initial error σN0. Moreover, the convergence rate is O(σN), matching the lower bound of Section 2.5 if the sequence (βN)N N* is bounded. In the following proposition, we prove that it is the case as soon as the spectrum decreases polynomially (e.g., Sobolev spaces of finite smoothness) or exponentially (e.g., the Gaussian kernel). Proposition 5 If σm = m 2s with s > 1/2 then N N*, βN 1 + 1 2s 1 (27) If σm = αm, with α [0, 1[, then N N*, βN α 1 α. (28) In both cases, the proof uses the fact that βN [(N MN + 1)σN] 1 X m MN σm, (29) for a well designed sequence MN. For example, if σm = m 2s, we take MN = N/c with c > 1; if σm = αm we take MN = N. We give a detailed proof in the supplementary. For a general kernel, if an asymptotic equivalent of σN is known (Widom, 1963; 1964), it should be possible to give an explicit construction of MN. Indeed, σN + [(N MN + 1)σN] 1 X m N+1 σm, (30) and MN should be chosen to control both terms in the RHS. Figure 1 illustrates the upper bound of Theorem 4 and the constant of Proposition 5 in case of the periodic Sobolev space of order s = 3. We observe that EVS E(µem; x)2 respects the upper bound: it starts from the initial error level σm and decreases according to the upper bound for N m. 4.2. The interpolation error of any element of F Theorem 4 dealt with the interpolation of an embedding µg of some function g L2(dω). We now give a bound on the interpolation error for any µ F. We need the following assumption, which is relatively weak; see Proposition 5 and the discussion that follows. Kernel Interpolation With Continuous Volume Sampling Assumption B There exists B > 0 such that βN B. Theorem 6 Let µ F. Assume that Σ rµ F < + for some r [0, 1/2]. Then, under Assumption B, EVS E(µ; x)2 (2 + B)σ2r N Σ rµ 2 F = O(σ2r N ). In other words, the expected interpolation error depends on the smoothness parameter r. For r = 1/2, we exactly recover the rate of Theorem 4. In contrast, for r < 1/2, the rate O(σ2r N ) is slower. For r = 0, our bound is constant with N. Note that assuming more smoothness (r > 1/2) does not seem to improve the rate O(σN). Let us comment on this bound in two classical cases. First, consider the uni-dimensional Sobolev space of order s. Assumption B is satisfied by Proposition 5 and the squared error scales as O(N 4sr). Moreover, for this family of RKHSs, Σ r. F can be seen as the norm in the Sobolev space of order (2r + 1)s, and we recover a result in (Schaback & Wendland, 2006)[Theorem 7.8] for quasi-uniform designs. By using the norm in the RKHS F of rougher functions, we upper bound the interpolation error of µ belonging to the smoother RKHS Σr F. Second, we emphasize again that our result is agnostic to the choice of the kernel, as long as Assumption B holds. In particular, Theorem 6 applies to the Gaussian kernel: the rate is slower O(σ2r N ) yet still exponential. Finally, recall that for f F |f(x)|2 = | f, k(x, .) F|2 f 2 Fk(x, x), (31) so that, bounds on the RKHS norm imply bounds on the uniform norm if the kernel k is bounded. Therefore, for r [0, 1/2], our result improves on the rate O(N 2σ2r N ) of approximate Fekete points (Karvonen et al., 2019). 4.3. Asymptotic unbiasedness of kernel quadrature As explained in Section 2.1, kernel interpolation is widely used for the design of quadratures. In that setting, one more advantage of continuous volume sampling is the consistency of its estimator. This is the purpose of the following result. Theorem 7 Let f F, and g L2(dω). Then BN(f, g) EVS i N ˆwif(xi) n N* f, en dω g, en dω 1 EVS τ F n (x) . Moreover, BN(f, g) 0 as N + . Compared to the upper bound on the integration error given by (10), the bias term in Theorem 7 takes into account the interaction between f and g. For example, if for all n N*, 0.3 0.4 0.5 0.6 0.7 0.8 0.9 log10(N) log10( VSE(¹em; x)2 ) m = 1 m = 2 m = 3 m = 4 m = 5 UB: (1 + B)¾N Figure 1. The value of EVS E(µem; x)2 for m {1, 2, 3, 4, 5} for the periodic Sobolev space (s = 3, d = 1) compared to the theoretical upper bound (UB) of Theorem 4. f, en dω g, en dω = 0, the quadrature is unbiased for every N. Theorem 7 is a generalization of a known property of regression estimators based on volume sampling in the discrete setting (Ben-Tal & Teboulle, 1990; Derezinski & Warmuth, 2017). 5. Sketch of the Proofs The proof of Theorem 4 decomposes into three steps. First, in Section 5.1, we write E(µg; x)2 as a function of the square of the interpolation errors E(µem; x)2 of the embeddings µem. Then, in Section 5.2, we give closed formulas for EVS E(µem; x)2 in terms of the eigenvalues of Σ. Finally, the inequality (26) is proved using an upper bound on the ratio of symmetric polynomials (Guruswami & Sinop, 2012). The details are given in Appendix 2.4.3. Finally, the proofs of Theorem 6 and Theorem 7 are straightforward consequences of Theorem 4. The details are given in Appendix 2.9 and Appendix 2.10. 5.1. Decomposing the interpolation error Let x X N such that Det K(x) > 0. For m1, m2 N , let the cross-leverage score between m1 and m2 associated to x be τ F m1,m2(x) = e F m1(x) K(x) 1e F m2(x). (32) When m1 = m2 = m, we speak of the m-th leverage score2 associated to x, and simply write τ F m(x). By Lemma S6, the m-th leverage score is related to the interpolation error of the m-th eigenfunction e F m. Indeed, e F m ΠT (x)e F m 2 F = 1 τ F m(x) [0, 1]. (33) 2Our definition is consistent with the leverage scores used in matrix subsampling (Drineas et al., 2006). Loosely speaking, τ F m(x) is the leverage score of the m-th column of the semi-infinite matrix (e F n (xi))(i,n) [N] N*. Kernel Interpolation With Continuous Volume Sampling Similarly, for the cross-leverage score, ΠT (x)e F m1, ΠT (x)e F m2 F = τ F m1,m2(x) [ 1, 1]. (34) For g L2(dω), the interpolation error of the embedding µg can be expressed using the (cross-)leverage scores. Lemma 8 If Det K(x) > 0, then, E(µg; x)2 = X 1 τ F m(x) (35) m1 =m2 N gm1gm2 σm1 σm2τ F m1,m2(x). In particular, with probability one, a design sampled from the continuous volume sampling distribution in Definition 1 satisfies (35). Furthermore, we shall see that the expected value of the (cross-) leverage scores has a simple expression. 5.2. Explicit formulas for expected leverage scores Proposition 9 expresses expected leverage scores in terms of the spectrum of the integration operator. Proposition 9 For m N , EVS τ F m(x) = 1 P u U σu. (36) Moreover, for m1, m2 N such that m1 = m2, we have EVS τ F m1,m2(x) = 0. (37) In Appendix 2.4, we combine Lemma 8 with Proposition 9. This concludes the proof of Theorem 4 by Beppo Levi s monotone convergence theorem. It remains to prove Proposition 9. Again, we proceed in two steps. First, our Proposition 10 yields a characterization of EVS τ F m(x) and EVS τ F m1,m2(x) in terms of the spectrum of three perturbed versions of the integration operator Σ. Second, we give explicit forms of these spectra in Proposition 11 below. The idea is to express EVS τm(x)F as the normalization constant (21) of a perturbation of the kernel k. The same goes for EVS τ F m1,m2(x). Let t R+ and Σt, Σ+ t and Σ t be the integration operators3 on L2(dω), respectively associated with the kernels kt(x, y) = k(x, y) + te F m(x)e F m(y), (38) k+ t (x, y) = k(x, y) (39) + t e F m1(x) + e F m2(x) e F m1(y) + e F m2(y) , 3We drop from the notation the dependencies on m, m1 and m2 for simplicity. k t (x, y) = k(x, y) (40) + t e F m1(x) e F m2(x) e F m1(y) e F m2(y) . By Assumption A, and by the fact that (em)m N* is an orthonormal basis of L2(dω), all three kernels also have integrable diagonals (see Assumption A). In particular, they define RKHSs that can be embedded in L2(dω). Moreover, recalling the definition (21) of the normalization constant ZN of volume sampling, the following quantities are finite N!ZN(kt), φ+ m1,m2(t) = 1 N!ZN(k+ t ), and φ m1,m2(t) = 1 N!ZN(k t ). (41) Remember that by Proposition 2, φm(t) = N! X u U σu(t), (42) where { σu(t), u N*} is the set of eigenvalues4 of Σt. Similar identities are valid for φ+ m1,m2(t) and φ m1,m2(t) with the eigenvalues of Σ+ t and Σ t respectively. Proposition 10 The functions φm, φ+ m1,m2 and φ m1,m2 are right differentiable in zero. Furthermore, EVS τ F m(x) = 1 ZN(k) φm EVS τ F m1,m2(x) = 1 4ZN(k) The details of the proof are postponed to Appendix 2.7. We complete this proposition with a description of the spectrum of the operators Σt, Σ+ t and Σ t using the spectrum of Σ. Proposition 11 The eigenvalues of Σt write σu(t) = σu if u = m, (1 + t)σu if u = m. (43) Moreover, the eigenvalues of Σ+ t and Σ t satisfy { σ+ u (t), u N*} = { σ u (t), u N*}. (44) The proof is based on the observation that the perturbations in (38), (39), and (40) only affect a principal subspace of dimension 1 or 2; see Appendix 2.6. Combining the characterization of EVS τ F m(x) and EVS τ F m1,m2(x) given in Proposition 10, and Proposition 11, we prove Proposition 9; see details in Appendix 2.8. 4For a given value of t, the eigenvalues σu(t) are not necessarily decreasing in u. We give explicit formulas for these eigenvalues in Proposition 11, and the order satisfied for t = 0 is not necessarily preserved for t > 0. This does not change anything to the argument since these eigenvalues only appear in quantities such as φm(t) which are invariant under permutation of the eigenvalues. Kernel Interpolation With Continuous Volume Sampling 6. A Numerical Simulation To illustrate Theorem 4, we let X = [0, 1] and dω be the uniform measure on X. Let the RKHS kernel be (Berlinet & Thomas-Agnan, 2011) ks(x, y) = 1 + 2 X 1 m2s cos(2πm(x y)), (45) so that F = Fs is the Sobolev space of order s on [0, 1]. The Mercer decomposition (3) of ks is such that σ1 = 1, e1 1 is constant and, for n 1, σ2n = σ2n+1 = 1/n2s, e2n(x) = 2 cos(2πnx), e2n+1(x) = 2 sin(2πnx). (46) Note that ks can be expressed in closed form using Bernoulli polynomials (Wahba, 1990). In Theorem 4, (24) decomposes the expected interpolation error of any µg in terms of the interpolation error ϵm(N) of the µem. Therefore, it is sufficient to numerically check the values of the ϵm(N). As an illustration we consider g {e1, e5, e7} in (24), so that µem = Σem = σmem, with m {1, 5, 7}. We use the Gibbs sampler proposed by (Rezaei & Gharan, 2019) to approximate continuous volume sampling. We consider various numbers of points N [2, 20]. Figure 2 shows log-log plots of the theoretical (th) value of EVS µg ΠT (x)µg 2 F compared to its empirical (emp) counterpart, vs. N, for s {1, 2}. For each N, the estimate is an average over 50 independent samples, each sample resulting from 500 individual Gibbs iterations. For both values of the smoothness parameter s, we observe a close fit of the estimate with the actual expected error. 7. Discussion We deal with interpolation in RKHSs using random nodes and optimal weights. This problem is intimately related to kernel quadrature, though interpolation is more general. We introduced continuous volume sampling (VS), a repulsive point process that is a mixture of DPPs, although not a DPP itself. VS comes with a set of advantages. First, interpretable bounds on the interpolation error can be derived under minimalistic assumptions. Our bounds are close to optimal since they share the same decay rate as known lower bounds. Moreover, we provide explicit evaluations of the constants appearing in our bounds for some particular RKHSs (e.g., Sobolev, Gaussian). Second, while the eigendecomposition of the integration operator plays an important role in the analysis, the definition of the density function of volume sampling only involves kernel evaluations. In that sense, VS is a fully kernelized approach. Unlike previous work on random design, this may permit sampling without knowing the Mercer decomposition of the kernel (Rezaei & 0.4 0.6 0.8 1.0 1.2 log10(N) log10( VSE(¹em; x)2 ) m = 1 (th) m = 5 (th) m = 7 (th) m = 1 (emp) m = 5 (emp) m = 7 (emp) 0.4 0.6 0.8 1.0 1.2 log10(N) log10( VSE(¹em; x)2 ) m = 1 (th) m = 5 (th) m = 7 (th) m = 1 (emp) m = 5 (emp) m = 7 (emp) Figure 2. The empirical estimate of EVS E(µem; x)2 for m {1, 5, 7} compared to its expression (24) in the case of the periodic Sobolev space. The smoothness is s = 1 (top), s = 2 (bottom). Gharan, 2019), as demonstrated in Section 6. Investigating efficient samplers and their impact on bounds is deferred to future work; the current paper is a theoretical motivation for further methodological research. We conclude with a few general remarks. In the context of interpolation, (Karvonen et al., 2019) propose to use approximate Fekete points. In comparison, our analysis yields a sharper upper bound while circumventing the analysis of the Lebesgue constant, a central quantity in interpolation theory. Yet, it would be interesting to analyze the distribution of the Lebesgue constant under continuous volume sampling, to provide an assessment of the numerical stability of our approach. A related extension of our work would be the analysis of interpolation under regularization as in (Bach, 2017). Finally, while optimal kernel quadrature may be the main application of our results, Section 4.2 hints that more generic elements than embeddings can also be well approximated using VS. This connects to kernel quadrature in misspecified settings (Kanagawa et al., 2016), or experimental design problems for Gaussian process approximations (Wynne et al., 2020). Acknowledgements AB acknowledges support from ANR grant BOB (ANR-16CE23-0003), R egion Hauts-de-France, and Centrale Lille. RB acknowledges support from ERC grant BLACKJACK (ERC-2019-STG-851866). Kernel Interpolation With Continuous Volume Sampling Bach, F. On the equivalence between kernel quadrature rules and random feature expansions. The Journal of Machine Learning Research, 18(1):714 751, 2017. Belhadji, A., Bardenet, R., and Chainais, P. A determinantal point process for column subset selection. ar Xiv preprint ar Xiv:1812.09771, 2018. Belhadji, A., Bardenet, R., and Chainais, P. Kernel quadrature with DPPs. In Advances in Neural Information Processing Systems 32, pp. 12907 12917. 2019. Ben-Tal, A. and Teboulle, M. A geometric property of the least squares solution of linear equations. Linear algebra and its applications, 139:165 170, 1990. Berlinet, A. and Thomas-Agnan, C. Reproducing kernel Hilbert spaces in probability and statistics. Springer Science & Business Media, 2011. Bilyk, D., Temlyakov, V. N., and Yu, R. The L2 discrepancy of two-dimensional lattices. In Recent Advances in Harmonic Analysis and Applications, pp. 63 77. Springer, 2012. Bojanov, B. D. Uniqueness of the optimal nodes of quadrature formulae. Mathematics of computation, 36(154): 525 546, 1981. Bos, L. P. and De Marchi, S. On optimal points for interpolation by univariate exponential functions. Dolomites Research Notes on Approximation, 4(1), 2011. Bos, L. P. and Maier, U. On the asymptotics of Fekete-type points for univariate radial basis interpolation. Journal of Approximation Theory, 119(2):252 270, 2002. Briol, F. X., Oates, C., Girolami, M., and Osborne, M. A. Frank-Wolfe Bayesian quadrature: Probabilistic integration with theoretical guarantees. In Advances in Neural Information Processing Systems, pp. 1162 1170, 2015. Briol, F. X., Oates, C. J., Girolami, M., Osborne, M. A., Sejdinovic, D., et al. Probabilistic integration: A role in statistical computation? Statistical Science, 34(1):1 22, 2019. De Marchi, S. On optimal center locations for radial basis function interpolation: computational aspects. Rend. Splines Radial Basis Functions and Applications, 61(3): 343 358, 2003. De Marchi, S., Schaback, R., and Wendland, H. Nearoptimal data-independent point locations for radial basis function interpolation. Advances in Computational Mathematics, 23(3):317 330, 2005. Derezinski, M. and Warmuth, M. K. Unbiased estimates for linear regression via volume sampling. In Advances in Neural Information Processing Systems, pp. 3084 3093, 2017. Deshpande, A. and Vempala, S. Adaptive sampling and fast low-rank matrix approximation. In Proceedings of the 9th International Conference on Approximation Algorithms for Combinatorial Optimization Problems, and 10th International Conference on Randomization and Computation, APPROX 06/RANDOM 06, pp. 292 303. Springer-Verlag, 2006. Deshpande, A., Rademacher, L., Vempala, S., and Wang, G. Matrix approximation and projective clustering via volume sampling. In Proceedings of the Seventeenth Annual ACM-SIAM Symposium on Discrete Algorithm, SODA 06. Society for Industrial and Applied Mathematics, 2006. Dick, J. and Pillichshammer, F. Digital nets and sequences: discrepancy theory and quasi Monte Carlo integration. Cambridge University Press, 2010. Dick, J. and Pillichshammer, F. Discrepancy theory and quasi-Monte Carlo integration. In A panorama of discrepancy theory, pp. 539 619. Springer, 2014. Drineas, P., Mahoney, M. W., and Muthukrishnan, S. Sampling algorithms for ℓ2 regression and applications. In Proceedings of the seventeenth annual ACM-SIAM symposium on Discrete algorithm, pp. 1127 1136. Society for Industrial and Applied Mathematics, 2006. Ehler, M., Gr af, M., and Oates, C. J. Optimal Monte Carlo integration on closed manifolds. Statistics and Computing, 29(6):1203 1214, 2019. Guruswami, V. and Sinop, A. K. Optimal column-based low-rank matrix reconstruction. In Proceedings of the twenty-third annual ACM-SIAM symposium on Discrete Algorithms, pp. 1207 1214. SIAM, 2012. Halton, J. H. Algorithm 247: Radical-inverse quasi-random point sequence. Communications of the ACM, 7(12): 701 702, 1964. Hastie, T., Tibshirani, R., and Friedman, J. The elements of statistical learning: data mining, inference, and prediction. Springer Science & Business Media, 2009. Hough, J. B., Krishnapur, M., Peres, Y., and Vir ag, B. Determinantal processes and independence. Probability surveys, 3:206 229, 2006. Husz ar, F. and Duvenaud, D. Optimally-weighted herding is Bayesian quadrature. In Proceedings of the Twenty Eighth Conference on Uncertainty in Artificial Intelligence, UAI 12, pp. 377 386. AUAI Press, 2012. Kernel Interpolation With Continuous Volume Sampling Kanagawa, M., Sriperumbudur, B. K., and Fukumizu, K. Convergence guarantees for kernel-based quadrature rules in misspecified settings. In Advances in Neural Information Processing Systems, pp. 3288 3296, 2016. Karvonen, T. and S arkk a, S. Gaussian kernel quadrature at scaled Gauss-Hermite nodes. BIT Numerical Mathematics, pp. 1 26, 2019. Karvonen, T., S arkk a, S., and Tanaka, K. Kernel-based interpolation at approximate Fekete points. ar Xiv preprint ar Xiv:1912.07316, 2019. Lacoste-Julien, S., Lindsten, F., and Bach, F. Sequential kernel herding: Frank-wolfe optimization for particle filtering. ar Xiv preprint ar Xiv:1501.02056, 2015. Larkin, F. M. Gaussian measure in Hilbert space and applications in numerical analysis. The Rocky Mountain Journal of Mathematics, pp. 379 421, 1972. Macchi, O. The coincidence approach to stochastic point processes. 7:83 122, 03 1975. Nashed, M. Z. and Walter, G. G. General sampling theorems for functions in reproducing kernel Hilbert spaces. Mathematics of Control, Signals and Systems, 4(4):363, 1991. Novak, E., Ullrich, M., and Wo zniakowski, H. Complexity of oscillatory integration for univariate sobolev spaces. Journal of Complexity, 31(1):15 41, 2015. Oates, C. and Girolami, M. Control functionals for quasimonte carlo integration. In Gretton, A. and Robert, C. C. (eds.), Proceedings of the 19th International Conference on Artificial Intelligence and Statistics, volume 51 of Proceedings of Machine Learning Research, pp. 56 65, Cadiz, Spain, 09 11 May 2016. PMLR. Oettershagen, J. Construction of optimal cubature algorithms with applications to econometrics and uncertainty quantification. Ph D Thesis, University of Bonn, 2017. Pinkus, A. N-widths in Approximation Theory, volume 7. Springer Science & Business Media, 2012. Rezaei, A. and Gharan, S. O. A polynomial time MCMC method for sampling from continuous determinantal point processes. In International Conference on Machine Learning, pp. 5438 5447, 2019. Santin, G. and Haasdonk, B. Convergence rate of the dataindependent p-greedy algorithm in kernel-based approximation. Dolomites Research Notes on Approximation, 10 (Special Issue), 2017. Schaback, R. Multivariate interpolation by polynomials and radial basis functions. Constructive Approximation, 21 (3):293 317, 2005. Schaback, R. and Wendland, H. Kernel techniques: from machine learning to meshless methods. Acta numerica, 15:543 639, 2006. Shannon, C. E. A mathematical theory of communication. Bell system technical journal, 27(3):379 423, 1948. Simon, B. Trace Ideals and Their Applications. American Mathematical Society, 2005. Smola, A., Gretton, A., Song, L., and Sch olkopf, B. A Hilbert space embedding for distributions. In International Conference on Algorithmic Learning Theory, pp. 13 31. Springer, 2007. Steinwart, I. and Christmann, A. Support Vector Machines. Springer Publishing Company, Incorporated, 1st edition, 2008. ISBN 0387772413. Steinwart, I. and Scovel, C. Mercer s theorem on general domains: on the interaction between measures, kernels, and RKHSs. Constructive Approximation, 35(3):363 417, 2012. Tanaka, K. Generation of point sets by convex optimization for interpolation in reproducing kernel Hilbert spaces. Numerical Algorithms, pp. 1 31, 2019. Unser, M. Sampling-50 years after shannon. Proceedings of the IEEE, 88(4):569 587, 2000. Wahba, G. Spline Models for Observational Data, volume 59. SIAM, 1990. Wendland, H. Scattered Data Approximation. Cambridge University Press, 2004. Widom, H. Asymptotic behavior of the eigenvalues of certain integral equations. I. Transactions of the American Mathematical Society, 109(2):278 295, 1963. Widom, H. Asymptotic behavior of the eigenvalues of certain integral equations. II. Archive for Rational Mechanics and Analysis, 17(3):215 229, 1964. Wynne, G., Briol, F. X., and Girolami, M. Convergence guarantees for gaussian process approximations under several observation models. ar Xiv preprint ar Xiv:2001.10818, 2020. Yao, K. Applications of reproducing kernel Hilbert spacesbandlimited signal models. Information and Control, 11 (4):429 444, 1967.