# kernel_quadrature_with_randomly_pivoted_cholesky__09544651.pdf

Kernel Quadrature with Randomly Pivoted Cholesky

Ethan N. Epperly and Elvira Moreno Department of Computing and Mathematical Sciences California Institute of Technology Pasadena, CA 91125 {eepperly,emoreno2}@caltech.edu

This paper presents new quadrature rules for functions in a reproducing kernel Hilbert space using nodes drawn by a sampling algorithm known as randomly pivoted Cholesky. The resulting computational procedure compares favorably to previous kernel quadrature methods, which either achieve low accuracy or require solving a computationally challenging sampling problem. Theoretical and numerical results show that randomly pivoted Cholesky is fast and achieves comparable quadrature error rates to more computationally expensive quadrature schemes based on continuous volume sampling, thinning, and recombination. Randomly pivoted Cholesky is easily adapted to complicated geometries with arbitrary kernels, unlocking new potential for kernel quadrature.

1 Introduction

Quadrature is one of the fundamental problems in computational mathematics, with applications in Bayesian statistics [35], probabilistic ODE solvers [27], reinforcement learning [32], and modelbased machine learning [30]. The task is to approximate an integral of a function f by the weighted sum of f s values at judiciously chosen quadrature points s1, . . . , sn:

X f(x)g(x) dµ(x)

i=1 wif(si). (1)

Here, and throughout, X denotes a topological space equipped with a Borel measure µ, and g L2(µ) denotes a square-integrable function. The goal of kernel quadrature is to select quadrature weights w1, . . . , wn R and nodes s1, . . . , sn X which minimize the error in the approximation (1) for all f drawn from a reproducing kernel Hilbert space (RKHS) H of candidate functions.

The ideal kernel quadrature scheme would satisfy three properties:

1. Spectral accuracy. The error of the approximation (1) decreases at a rate governed by the eigenvalues of the reproducing kernel k of H, with rapidly decaying eigenvalues guaranteeing rapidly decaying quadrature error.

2. Efficiency. The nodes s1, . . . , sn and weights w1, . . . , wn can be computed by an algorithm which is efficient in both theory and practice.

ENE acknowledges support by NSF FRG 1952777, Carver Mead New Horizons Fund, and the U.S. Department of Energy, Office of Science, Office of Advanced Scientific Computing Research, Department of Energy Computational Science Graduate Fellowship under Award Number DE-SC0021110. EM was supported in part by Air Force Office of Scientific Research grant FA9550-22-1-0225.

37th Conference on Neural Information Processing Systems (Neur IPS 2023).

The first two goals may be more easily achieved if one has access to a Mercer decomposition

i=1 λiei(x)ei(y), (2)

where e1, e2, . . . form an orthonormal basis of L2(µ) and the eigenvalues λ1 λ2 are decreasing. Fortunately, the Mercer decomposition is known analytically for many RKHS s H on simple sets X such as boxes [0, 1]d. However, such a decomposition is hard to compute for general kernels k and domains X, leading to the third of our desiderata:

3. Mercer-free. The quadrature scheme can be efficiently implemented without access to an explicit Mercer decomposition (2).

Despite significant progress in the probabilistic and kernel quadrature literature (see, e.g., [3 8, 17, 18, 21 23, 26, 41]), the search for a kernel quadrature scheme meeting all three criteria remains ongoing.

Contributions. We present a new kernel quadrature method based on the randomly pivoted Cholesky (RPCHOLESKY) sampling algorithm that achieves all three of the goals. Our main contributions are

1. Generalizing RPCHOLESKY sampling (introduced in [10] for finite kernel matrices) to the continuum setting and demonstrating its effectiveness for kernel quadrature. 2. Establishing theoretical results (see theorem 1) which show that RPCHOLESKY kernel quadrature achieves near-optimal quadrature error rates. 3. Developing efficient rejection sampling implementations (see algorithms 2 and 4) of RPCHOLESKY in the continuous setting, allowing kernel quadrature to be applied to general spaces X, measures µ, and kernels k with ease.

The remainder of this introduction will sketch our proposal for RPCHOLESKY kernel quadrature. A comparison with existing kernel quadrature approaches appears in section 2.

Randomly pivoted Cholesky. Let H be an RKHS with a kernel k that is integrable on the diagonal Z

X k(x, x) dµ(x) < + . (3)

RPCHOLESKY uses the kernel diagonal k(x, x) as a sampling distribution to pick quadrature nodes. The first node s1 is chosen to be a sample from the distribution k(x, x) dµ(x), properly normalized:

s1 k(x, x) dµ(x) R

X k(x, x) dµ(x).

Having selected s1, we remove its influence on the kernel, updating the entire kernel function:

k (x, y) = k(x, y) k(x, s1)k(s1, y)

k(s1, s1) . (4)

Linear algebraically, the update (4) can be interpreted as Gaussian elimination, eliminating row and column s1 from the infinite matrix k. Probabilistically, if we interpret k as the covariance function of a Gaussian process, the update (4) represents conditioning on the value of the process at s1. We use the updated kernel k to select the next quadrature node:

s2 k (x, x) dµ(x) R

X k (x, x) dµ(x),

whose influence is then subtracted off k as in (4). RPCHOLESKY continues along these lines until n nodes have been selected. The resulting algorithm is shown in algorithm 1. Having chosen the nodes s1, . . . , sn, our choice of weights w1, . . . , wn is standard and is discussed in section 3.2.

RPCHOLESKY sampling is more flexible than many kernel quadrature methods, easily adapting to general spaces X, measures µ, and kernels k. To demonstrate this flexibility, we apply RPCHOLESKY to the region X in fig. 1a equipped with the Matérn 5/2-kernel with bandwidth 2 and the measure

dµ(x, y) = (x2 + y2) dx dy.

Algorithm 1 RPCHOLESKY: unoptimized implementation

Input: Kernel k : X X R and number of quadrature points n Output: Quadrature points s1, . . . , sn

1: for i = 1, 2, . . . , n do 2: Sample si from the probability measure k(x, x) dµ(x)/ R

X k(x, x) dµ(x) 3: Update kernel k k k( , si)k(si, si) 1k(si, ) 4: end for

A set of n = 20 quadrature nodes produced by RPCHOLESKY sampling (using algorithm 2) is shown in fig. 1a. The cyan pink shading shows the diagonal of the kernel after updating for the selected nodes. We see that near the selected nodes, the updated kernel is very small, meaning that future steps of the algorithm will avoid choosing nodes in those regions. Nodes far from any currently selected nodes have a much larger kernel value, making them more likely to be chosen in future RPCHOLESKY iterations.

Figure 1: RPCHOLESKY kernel quadrature on oddly shaped region. Left: Black dots show 20 points selected by RPCHOLESKY on a crescent-shaped region. Shading shows kernel values k((x, y), (x, y)) after removing influence of selected points. Right: Mean relative error for RPCHOLESKY kernel quadrature, iid kernel quadrature, and Monte Carlo quadrature for f(x, y) = sin(x) exp(y) with 100 trials. Shaded regions show 10%/90% quantiles.

The quadrature error for f(x, y) = sin(x) exp(y), g 1, and different numbers n of quadrature nodes for RPCHOLESKY kernel quadrature, kernel quadrature with nodes drawn iid from µ/µ(X), and Monte Carlo quadrature are shown in fig. 1b. Other spectrally accurate kernel quadrature methods would be difficult to implement in this setting because they would require an explicit Mercer decomposition (projection DPPs and leverage scores) or an expensive sampling procedure (continuous volume sampling). A comparison of RPCHOLESKY with more kernel quadrature methods on a benchmark problem is provided in fig. 2.

2 Related work

Here, we overview past work on kernel quadrature and discuss the history of RPCHOLESKY sampling.

2.1 Kernel quadrature

The goal of kernel quadrature is to provide a systematic means for designing quadrature rules on RKHS s. Relevant points of comparison are Monte Carlo [36] and quasi-Monte Carlo [14] methods, which have O(1/ n) and O(polylog(n)/n) convergence rates, respectively.

The literature on probabilistic and kernel approaches to quadrature is vast, so any short summary is necessarily incomplete. Here, we briefly summarize some of the most prominent kernel quadrature

methods, highlighting limitations that we address with our RPCHOLESKY approach. We refer the reader to [22, Tab. 1] for a helpful comparison of many of the below-discussed methods.

Herding. Kernel herding schemes [4, 11, 26, 41] iteratively select quadrature nodes using a greedy approach. These methods face two limitations. First, they require the solution of a (typically) nonconvex global optimization problem at every step, which may be computationally costly. Second, these methods exhibit slow O(1/n) quadrature error rates [11, Prop. 4] (or even slower [16]). Optimally weighted herding is known as sequential Bayesian quadrature [26].

Thinning. Thinning methods [17, 18, 38] try to select n good quadrature nodes from a larger set of N n candidate nodes drawn from a simple distribution. While these methods have other desirable theoretical properties, they do not benefit from spectral accuracy.

Leverage scores and projection DPPs. With optimal weights (section 3.2), quadrature with nodes sampled using a projection determintal point process (DPP) [7] (see also [5, 6, 21]) or iid from the (ridge) leverage score distribution [3] achieve spectral accuracy. However, known efficient sampling algorithms require access to the Mercer decomposition (2), limiting the applicability of these schemes to simple spaces X, measures µ, and kernels k, where the decomposition is known analytically.

Continuous volume sampling. Continuous volume sampling is the continuous analog of n-DPP sampling [28], providing quadrature nodes that achieve spectral accuracy [8, Prop. 5]. Unfortunately, continuous volume sampling is computationally challenging. In the finite setting, the best-known algorithm [9] for exact n-DPP sampling requires a costly O(|X| n6.5 + n9.5) operations. Inexact samplers based on Markov chain Monte Carlo (MCMC) (e.g., [1, 2, 34]) may be more competitive, but the best-known samplers in the continuous setting still require an expensive O(n5 log n) MCMC steps [34, Thms. 1.3 1.4]. RPCHOLESKY sampling achieves similar theoretical guarantees to continuous volume sampling (see theorem 1) and can be efficiently and exactly sampled (algorithm 2).

Recombination and convex weights. The paper [22] (see also [23]) proposes two ideas for kernel quadrature when µ is a probability measure and g 1. First, they suggest using a recombination algorithm (e.g., [29]) to subselect good quadratures from N n candidate nodes iid sampled from µ. All of the variants of their method either fail to achieve spectral accuracy or require an explicit Mercer decomposition [22, Tab. 1]. Second, they propose choosing weights (w1, . . . , wn) that are convex combination coefficients. This choice makes the quadrature scheme robust against misspecification of the RKHS H f, among other benefits. It may be worth investigating a combination of RPCHOLESKY quadrature nodes with convex weights in future work.

2.2 Randomly pivoted Cholesky

RPCHOLESKY was proposed, implemented, and analyzed in [10] for the task of approximating an M M positive-semidefinite matrix A. The algorithm is algebraically equivalent to applying an earlier algorithm known as adaptive sampling [12, 13] to A1/2. Despite their similarities, RPCHOLESKY and adaptive sampling are different algorithms: To produce a rank-n approximation to an M M matrix, RPCHOLESKY requires O(n2M) operations, while adaptive sampling requires a much larger O(n M 2) operations. See [10, 4.1] for further discussion on RPCHOLESKY s history. In this paper, we introduce a continuous extension of RPCHOLESKY and analyze its effectiveness for kernel quadrature.

3 Theoretical results

In this section, we prove our main theoretical result for RPCHOLESKY kernel quadrature. We first establish the mathematical setting (section 3.1) and introduce kernel quadrature (section 3.2). Then, we present our main theorem (section 3.3) and discuss its proof (section 3.4).

3.1 Mathematical setting and notation

Let µ be a Borel measure supported on a topological space X and let H be a RKHS on X with continuous kernel k : X X R. We assume that x 7 k(x, x) is integrable (3) and that H

is dense in L2(µ). These assumptions imply that H is compactly embedded in L2(µ), the Mercer decomposition (2) converges pointwise, and the Mercer eigenfunctions form an orthonormal basis of L2(µ) and an orthogonal basis of H [39, Thm. 3.1].

Define the integral operator

X k(x, y)f(y) dµ(y). (5)

Viewed as an operator T : L2(µ) L2(µ), T is self-adjoint, positive semidefinite, and trace-class.

One final piece of notation: For a function δ : X R and an ordered (multi)set S = {s1, . . . , sn}, we let δ(S) be the column vector with ith entry δ(si). Similarly, for a bivariate function h : X X R, h( , S) (resp. h(S, )) denotes the row (resp. column) vector-valued function with ith entry h( , si) (resp. h(si, )), and h(S, S) denotes the matrix with ij entry h(si, sj).

3.2 Kernel quadrature

Following earlier works (e.g., [3, 7]), let us describe the kernel quadrature problem more precisely. Given a function g L2(µ), we seek quadrature weights w = (w1, . . . , wn) Rn and nodes S = {s1, . . . , sn} X that minimize the maximum quadrature error over all f H with f H 1:

Err(S, w; g) := max f H 1

X f(x)g(x) dµ(x)

i=1 wif(si)

A short derivation ([3, Eq. (7)] or appendix C) yields the simple formula

Err(S, w; g) =

i=1 wik( , si)

This equation reveals that the quadrature rule minimizing Err(S, w; g) is the least-squares approximation of Tg H by a linear combination of kernel function evaluations Pn i=1 wik( , si).

If we fix the quadrature nodes S = {s1, . . . , sn}, the optimal weights w = (w 1, . . . , w n) Rn minimizing Err(S, w; g) are the solution of the linear system of equations

k(S, S)w = Tg(S). (8)

We use (8) to select the weights throughout this paper, and denote the error with optimal weights by

Err(S; g) := Err(S, w ; g). (9)

3.3 Error bounds for randomly pivoted Cholesky kernel quadrature

Our main result for RPCHOLESKY kernel quadrature is as follows:

Theorem 1. Let S = {s1, . . . , sn} be generated by the RPCHOLESKY sampling algorithm. For any function g L2(µ), nonnegative integer r 0, and real number δ > 0, we have

E Err2(S; g) δ g 2 L2(µ)

n r log 2λ1 δ P i=r+1 λi

To illustrate this result, consider an RKHS with eigenvalues λi = O(i 2s) for s > 1/2. An example is the periodic Sobolev space Hs per([0, 1]) (see section 5). By setting δ = 1/r, we see that

E Err2(S; g) O(r 2s) g 2 L2(µ) for RPCHOLESKY with n = Ω(r log r).

The optimal scheme requires n = Θ(r) nodes to achieve this bound [8, Prop. 5 & 2.5]. Thus, to achieve an error of O(r 2s), RPCHOLESKY requires just logarithmically more nodes than

the optimal quadrature scheme for such spaces. This compares favorably to continuous volume sampling, which achieves the slightly better optimal rate but is much more difficult to sample.

Having established that RPCHOLESKY achieves nearly optimal error rates for interesting RKHS s, we make some general comments on theorem 1. First, observe that the error depends on the sum P i=r+1 λi of the tail eigenvalues. This tail sum is characteristic of spectrally accurate kernel quadrature schemes [7, 8]. The more distinctive feature in (10) is the logarithmic factor. Fortunately, for double precision computation, the achieveable accuracy is bounded by the machine precision 2 52, so this logarithmic factor is effectively a modest constant:

log 2λ1 δ P i=r+1

log(252) < 37.

3.4 Connection to Nyström approximation and idea of proof

We briefly outline the proof of theorem 1. Following previous works (e.g., [3, 23]), we first utilize the connection between kernel quadrature and the Nyström approximation [31, 42] of the kernel k.

Definition 2. For nodes S = {s1, . . . , sn} X, the Nyström approximation to the kernel k is

k S(x, y) = k(x, S) k(S, S) k(S, y). (11)

Here, denotes the Moore Penrose pseudoinverse. The Nyström approximate integral operator is

X k S( , x)f(x) dµ(x).

This definition leads to a formula for the quadrature rule Pn i=1 wik( , si) with optimal weights (8).

Proposition 3. Fix nodes S = {s1, . . . , sn}. With the optimal weights w Rn (8), we have

i=1 wi k( , si) = TSg. (12)

Consequently, Err2(S; g) = (T TS)g 2 H g, (T TS)g L2(µ).

To finish the proof of theorem 1, we develop and solve a recurrence for an upper bound on the largest eigenvalue of E[T TS]. See appendix A for details and for the proof of proposition 3.

4 Efficient randomly pivoted Cholesky by rejection sampling

To efficiently perform RPCHOLESKY sampling in the continuous setting, we can use rejection sampling. (A similar rejection sampling idea is used in the MCMC continuous volume sampler of [34].) Assume for simplicity that the measure µ is normalized so that Z

X k(x, x) dµ(x) = Tr T = 1.

We assume two forms of access to the kernel k:

1. Entry evaluations. For any x, y X, we can evaluate k(x, y).

2. Sampling the diagonal. We can produce samples from the measure k(x, x) dµ(x).

To sample from the RPCHOLESKY distribution, we use k(x, x) dµ(x) as a proposal distribution and accept proposal s with probability

1 k S(s, s)

(Recall k S from (11).) The resulting algorithm for RPCHOLESKY sampling is shown in algorithm 2.

Algorithm 2 RPCHOLESKY with rejection sampling

Input: Kernel k : X X R and number of quadrature points n Output: Quadrature points s1, . . . , sn

1: Initialize L 0n n, i 1, S L stores a Cholesky decomposition of k(S, S) 2: while i n do 3: Sample si from the probability measure k(x, x) dµ(x) 4: (d, c) RESIDUALKERNEL(si, S, L, k) Helper subroutine algorithm 3 5: Draw a uniform random variable U UNIF[0, 1] 6: if U < d/k(si, si) then Accept with probability d/k(si, si) 7: Induct si into the selected set: S S {si}, i i + 1 8: L(i, 1 : i) c

9: end if 10: end while

Algorithm 3 Helper subroutine to evaluate residual kernel

Input: Point s X, set S X, Cholesky factor L of k(S, S), and kernel k Output: d = k(s, s) k S(s, s) and c = L 1k(S, s)

1: procedure RESIDUALKERNEL(s,S,L,k) 2: c L 1k(S, s) 3: d k(s, s) c 2 d = k(s, s) k S(s, s) 4: return (d, c) 5: end procedure

Theorem 4. Algorithm 2 produces exact RPCHOLESKY samples. Let ηi denote the trace-error of the best rank-i approximation to T:

The expected runtime of algorithm 2 is at most O(Pn i=1 i2/ηi) O(n3/ηn 1).

This result demonstrates that algorithm 2 suffers from the curse of smoothness: The faster the eigenvalues λ1, λ2, . . . decrease, the smaller ηn 1 will be and, consequently, the slower algorithm 2 will be in expectation. While this curse is an unfortunate limitation, it is also a common one. The curse of smoothness affects all known Mercer-free, spectrally accurate kernel quadrature schemes. In fact, the situation is worse for other algorithms. The CVS sampler of [34], for example, requires as many as O(n5 log n) MCMC steps, each of which has cost O(n2/ηn 1). According to current analysis, algorithm 2 is O(n4 log n)-times faster than the CVS sampler of [34].

Notwithstanding the curse of smoothness, algorithm 2 is useful in practice. The algorithm works under minimal assumptions and can reach useful accuracies η = 10 5 while sampling n = 1000 nodes on a laptop. Algorithm 2 may also be interesting for RPCHOLESKY sampling for a large finite set |X| 109, since its runtime has no explicit dependence on the size of the space X.

To achieve higher accuracies and blunt the curse of smoothness, we can improve algorithm 2 with optimization. Indeed, the optimal acceptance probability for algorithm 2 would be

p(si; α) = 1

α k(si, si) k S(si, si)

k(si, si) where α = max x X k(x, x) k S(x, x)

This suggests the following scheme: Initialize with α = 1 and run the algorithm with acceptance probability p(si; α). If we perform many loop iterations without an acceptance, we then recompute the optimal α by solving an optimization problem. The resulting procedure is shown in algorithm 4.

If the optimization problem for α is solved to global optimality, then algorithm 4 produces exact RPCHOLESKY samples. The downside of algorithm 4 is the need to solve a (typically) nonconvex global optimization problem to compute the optimal α. Fortunately, in our experiments (section 5), only a small number of optimization problems ( 10) are needed to produce a sample of n = 1000 nodes. In the setting where the optimization problem is tractable, the speedup of algorithm 4 can

Algorithm 4 RPCHOLESKY with optimized rejection sampling

Input: Kernel k : X X R and number of quadrature points n Output: Quadrature points s1, . . . , sn

1: Initialize L 0n n, i 1, S , trials 0, α 1 2: while i n do 3: trials trials + 1 4: Sample si from the probability measure k(x, x) dµ(x) 5: (d, c) RESIDUALKERNEL(si, S, L, k) Helper subroutine algorithm 3 6: Draw a uniform random variable U UNIF[0, 1] 7: if U < (1/α) d/k(si, si) then Accept with probability d/k(si, si) 8: S S {si}, i i + 1, trials 0 9: L(i, 1 : i) c

10: end if 11: if trials TRIALS_MAX then We use TRIALS_MAX [25, 1000] 12: α maxx X RESIDUALKERNEL(x, S, L, k)/k(x, x) 13: trials 0 14: end if 15: end while

be immense. To produce n = 200 RPCHOLESKY samples for the space [0, 1]3 with the kernel (13), algorithm 4 requires just 0.19 seconds compared to 7.47 seconds for algorithm 2.

5 Comparison of methods on a benchmark example

Having demonstrated the versatility of RPCHOLESKY for general spaces X, measures µ, and kernels k in fig. 1, we now present a benchmark example from the kernel quadrature literature to compare RPCHOLESKY with other methods. Consider the periodic Sobolev space Hs per([0, 1]) with kernel

k(x, y) = 1 + 2

m=1 m 2s cos(2πm(x y)) = 1 + ( 1)s 1(2π)2s

(2s)! B2s({x y}),

where B2s is the (2s)th Bernoulli polynomial and { } reports the fractional part of a real number [3, p. 7]. We also consider [0, 1]3 equipped with the product kernel

k 3((x1, x2, x3), (y1, y2, y3)) = k(x1, y1)k(x2, y2)k(x3, y3). (13)

We quantify the performance of nodes S and weights w using Err(S, w; g), which can be computed in closed form [22, Eq. (14)]. We set µ := UNIF[0, 1]d and g 1.

We compare the following schemes:

Monte Carlo, IID kernel quadrature. Nodes s1, . . . , sn iid µ with uniform weights wi = 1/n (Monte Carlo) and optimal weights (8) (IID). Thinning. Nodes s1, . . . , sn thinned from n2 iid samples from µ = UNIF[0, 1] using kernel thinning [17, 18] with the COMPRESS++ algorithm [38] with optimal weights (8). Continuous volume sampling (CVS). Nodes s1, . . . , sn drawn from the volume sampling distribution [8, Def. 1] by Markov chain Monte Carlo with optimal weights (8). RPCHOLESKY. Nodes s1, . . . , sn sampled by RPCHOLESKY using algorithm 4 with the optimal weights (8). Positively weighted kernel quadrature (PWKQ). Nodes and weights computed by the Nyström+empirical+opt method as described on [22, p. 9].

See appendix D for more details about our numerical experiments. Experiments were run on a Mac Book Pro with a 2.4 GHz 8-Core Intel Core i9 CPU and 64 GB 2667 MHz DDR4 RAM. Our code is available at https://github.com/eepperly/RPCholesky-Kernel-Quadrature.

Errors for the different methods are shown in fig. 2 (left panels). We see that RPCHOLESKY consistently performs among the best methods at every value of n, numerically achieving the rate

(a) s = 3, d = 1

(b) s = 3, d = 3

Figure 2: Benchmark example: Sobolev space. Mean quadrature error Err(S, w; g) (left) and sampling time (right) for different methods (100 trials) for s = 1, d = 3 (top) and s = d = 3 (bottom). Shaded regions show 10%/90% quantiles.

of convergence of the fastest method for each problem. Particularly striking is the result that RPCHOLESKY sampling s performance is either very close to or better than that of continuous volume sampling, despite the latter s slightly stronger theoretical properties.

Figure 2 (right panels) show the sampling times for each of the nonuniform sampling methods. RPCHOLESKY sampling is the fastest by far. To sample 128 nodes for s = 3 and d = 1, RPCHOLESKY was 17 faster than continuous volume sampling, 52 faster than thinning, and 170 faster than PWKQ.

To summarize our numerical evaluation (comprising fig. 2 and fig. 1 from the introduction), we find that RPCHOLESKY is among the most accurate kernel quadrature methods tested. To us, the strongest benefits of RPCHOLESKY (supported by these experiments) are the method s speed and flexibility. These virtues make it possible to apply spectrally accurate kernel quadrature in scenarios where it would have been computationally intractable before.

6 Application: Analyzing large chemistry datasets

While our focus thusfar has been on infinite domains X, the kernel quadrature formalism can also be applied to a finite set X of data points. Let µ = UNIF X be the uniform distribution and set g 1. Here, the task is to exhibit n nodes S = {s1, . . . , sn} X and weights w Rn such that the average of every smooth function f : X R over the whole dataset is well-approximated by a sum:

mean(f) := 1

i=1 wif(si). (14)

Here is an application to chemistry. Let X denote a large set of compounds of interest, and let f : X R denote a target chemical property such as specific heat capacity. We are interested

in computing mean(f). Unfortunately, evaluating f on each x X requires an expensive density functional theory computation, so it can be prohibitively expensive to evaluate f on every x X. Fortunately, we can obtain fast approximations to mean(f) using (14), which only require evaluating f on a much smaller set of size |S| |X|.

Our experimental setup is as follows. For X, we use 2 104 randomly selected points from the QM9 dataset [33, 37, 40]. We represent each compound as a vector in R1500 using the many-body tensor representation [25] computed with the DScribe package [24]. Choose k to be a Gaussian kernel with bandwidth chosen by median heuristic [20] on a further subsample of 103 random points. We omit continuous volume sampling, thinning, and positively weighted kernel quadrature because of their computational cost. In their place, we add the greedy Nyström method [19] with optimal weights (8).

Figure 3: Application: chemistry. Worst-case quadrature error (left) and mean relative error for estimation of the average value of the isotropic polarizability function f(x) (right) for different methods (100 trials). Shaded regions show 10%/90% quantiles.

Results are shown in fig. 3. Figure 3a shows the worst-case quadrature error (9). For this metric, IID and randomly pivoted Cholesky are the definitive winners, with randomly pivoted Cholesky being slightly better. Figure 3b shows the mean relative error for the kernel quadrature estimates of the mean isotropic polarizability of the compounds in X. For sufficiently large n, randomly pivoted Cholesky has the smallest error, beating IID and Monte Carlo by a factor of three at n = 512.

7 Conclusions, Limitations, and Possibilites for Future Work

In this article, we developed continuous RPCHOLESKY sampling for kernel quadrature. Theorem 1 demonstrates RPCHOLESKY kernel quadrature achieves near-optimal error rates. Numerical results (fig. 2) hint that its practical performance might be even better than that suggested by our theoretical analysis and fully comparable with kernel quadrature based on the much more computationally expensive continuous volume sampling distribution. RPCHOLESKY supports performant rejection sampling algorithms (algorithms 2 and 4), which facilitate implementation for general spaces X, measures µ, and kernels k with ease.

We highlight three limitations of RPCHOLESKY kernel quadrature that would be worth addressing in future work. First, given the comparable performance of RPCHOLESKY and continuous volume sampling in practice, it would be desirable to prove stronger error bounds for RPCHOLESKY sampling or find counterexamples which demonstrate a separation between the methods. Second, it would be worth developing improved sampling algorithms for RPCHOLESKY which avoid the need for global optimization steps. Third, all known spectrally accurate kernel quadrature methods require integrals of the form R

X k(x, y)g(y) dµ(y), which may not be available. RPCHOLESKY kernel quadrature requires them for the computation of the weights (8). Developing spectrally accurate kernel quadrature schemes that avoid such integrals remains a major open problem for the field.

Acknowledgements

We thank Lester Mackey, Eliza O Reilly, Joel Tropp, and Robert Webber for helpful discussions and feedback.

Disclaimer. This report was prepared as an account of work sponsored by an agency of the United States Government. Neither the United States Government nor any agency thereof, nor any of their employees, makes any warranty, express or implied, or assumes any legal liability or responsibility for the accuracy, completeness, or usefulness of any information, apparatus, product, or process disclosed, or represents that its use would not infringe privately owned rights. Reference herein to any specific commercial product, process, or service by trade name, trademark, manufacturer, or otherwise does not necessarily constitute or imply its endorsement, recommendation, or favoring by the United States Government or any agency thereof. The views and opinions of authors expressed herein do not necessarily state or reflect those of the United States Government or any agency thereof.

[1] Nima Anari, Yang P. Liu, and Thuy-Duong Vuong. Optimal sublinear sampling of spanning trees and determinantal point processes via average-case entropic independence. In 63rd Annual Symposium on Foundations of Computer Science, pages 123 134. IEEE, October 2022.

[2] Nima Anari, Shayan Oveis Gharan, and Alireza Rezaei. Monte Carlo Markov chain algorithms for sampling strongly Rayleigh distributions and determinantal point processes. In Proceedings of the 29th Conference on Learning Theory, pages 103 115. PMLR, June 2016.

[3] Francis Bach. On the equivalence between kernel quadrature rules and random feature expansions. Journal of Machine Learning Research, 18(1):714 751, January 2017.

[4] Francis Bach, Simon Lacoste-Julien, and Guillaume Obozinski. On the equivalence between herding and conditional gradient algorithms. Proceedings of the 29th International Conference on Machine Learning, March 2012.

[5] Rémi Bardenet and Adrien Hardy. Monte Carlo with determinantal point processes. The Annals of Applied Probability, 30(1):368 417, February 2020.

[6] Ayoub Belhadji. An analysis of Ermakov Zolotukhin quadrature using kernels. In Advances in Neural Information Processing Systems, volume 34, pages 27278 27289. Curran Associates, Inc., 2021.

[7] Ayoub Belhadji, Rémi Bardenet, and Pierre Chainais. Kernel quadrature with DPPs. In Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019.

[8] Ayoub Belhadji, Rémi Bardenet, and Pierre Chainais. Kernel interpolation with continuous volume sampling. In Proceedings of the 37th International Conference on Machine Learning, pages 725 735. PMLR, November 2020.

[9] Daniele Calandriello, Michal Derezinski, and Michal Valko. Sampling from a k-DPP without looking at all items. In Advances in Neural Information Processing Systems, volume 33, pages 6889 6899. Curran Associates, Inc., 2020.

[10] Yifan Chen, Ethan N. Epperly, Joel A. Tropp, and Robert J. Webber. Randomly pivoted Cholesky: Practical approximation of a kernel matrix with few entry evaluations, February 2023. ar Xiv:2207.06503.

[11] Yutian Chen, Max Welling, and Alex Smola. Super-samples from kernel herding. In Proceedings of the Twenty-Sixth Conference on Uncertainty in Artificial Intelligence, page 109 116. AUAI Press, 2010.

[12] Amit Deshpande, Luis Rademacher, Santosh Vempala, and Grant Wang. Matrix approximation and projective clustering via volume sampling. In Proceedings of the seventeenth annual ACM-SIAM symposium on Discrete algorithm, pages 1117 1126. ACM, 2006.

[13] Amit Deshpande and Santosh Vempala. Adaptive sampling and fast low-rank matrix approximation. In Josep Díaz, Klaus Jansen, José D. P. Rolim, and Uri Zwick, editors, Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques, pages 292 303. Springer Berlin Heidelberg, 2006.

[14] Josef Dick and Friedrich Pillichshammer. Quasi-Monte Carlo integration, discrepancy and reproducing kernel Hilbert spaces, page 16 45. Cambridge University Press, 2010.

[15] Tobin A. Driscoll, Nicholas Hale, and Lloyd N. Trefethen. Chebfun guide, 2014.

[16] J. C. Dunn. Convergence rates for conditional gradient sequences generated by implicit step length rules. SIAM Journal on Control and Optimization, 18(5):473 487, 1980.

[17] Raaz Dwivedi and Lester Mackey. Kernel thinning. In Proceedings of Thirty Fourth Conference on Learning Theory, pages 1753 1753. PMLR, August 2021.

[18] Raaz Dwivedi and Lester Mackey. Generalized kernel thinning. In Tenth International Conference on Learning Representations, April 2022.

[19] Shai Fine and Katya Scheinberg. Efficient SVM Training training using low-rank kernel representations. Journal of Machine Learning Research, 2(Dec):243 264, 2001.

[20] Damien Garreau, Wittawat Jitkrittum, and Motonobu Kanagawa. Large sample analysis of the median heuristic, October 2018. ar Xiv:1707.07269.

[21] Guillaume Gautier, Rémi Bardenet, and Michal Valko. On two ways to use determinantal point processes for Monte Carlo integration. In Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019.

[22] Satoshi Hayakawa, Harald Oberhauser, and Terry Lyons. Positively weighted kernel quadrature via subsampling. In Advances in Neural Information Processing Systems, volume 35, pages 6886 6900. Curran Associates, Inc., 2022.

[23] Satoshi Hayakawa, Harald Oberhauser, and Terry Lyons. Sampling-based Nyström approximation and kernel quadrature, January 2023. ar Xiv:2301.09517.

[24] Lauri Himanen, Marc O. J. Jäger, Eiaki V. Morooka, Filippo Federici Canova, Yashasvi S. Ranawat, David Z. Gao, Patrick Rinke, and Adam S. Foster. DScribe: Library of descriptors for machine learning in materials science. Computer Physics Communications, 247:106949, February 2020.

[25] Haoyan Huo and Matthias Rupp. Unified representation of molecules and crystals for machine learning. Machine Learning: Science and Technology, 3:045017, November 2022.

[26] Ferenc Huszár and David Duvenaud. Optimally-weighted herding is Bayesian quadrature. In Proceedings of the Twenty-Eighth Conference on Uncertainty in Artificial Intelligence, page 377 386. AUAI Press, 2012.

[27] Hans Kersting and Philipp Hennig. Active uncertainty calibration in Bayesian ODE solvers. In Proceedings of the Thirty-Second Conference on Uncertainty in Artificial Intelligence, page 309 318. AUAI Press, 2016.

[28] Alex Kulesza and Ben Taskar. k-DPPs: Fixed-size determinantal point processes. In Proceedings of the 28th International Conference on Machine Learning, pages 1193 1200. Omnipress, January 2011.

[29] C. Litterer and T. Lyons. High order recombination and an application to cubature on Wiener space. The Annals of Applied Probability, 22(4):1301 1327, August 2012.

[30] Kevin P. Murphy. Machine Learning: A Probabilistic Perspective. The MIT Press, 2012.

[31] E. J. Nyström. Über Die Praktische Auflösung von Integralgleichungen mit Anwendungen auf Randwertaufgaben. Acta Mathematica, 54(0):185 204, 1930.

[32] Supratik Paul, Konstantinos Chatzilygeroudis, Kamil Ciosek, Jean-Baptiste Mouret, Michael A. Osborne, and Shimon Whiteson. Alternating optimisation and quadrature for robust control. In Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, (AAAI-18), the 30th innovative Applications of Artificial Intelligence (IAAI-18), and the 8th AAAI Symposium on Educational Advances in Artificial Intelligence (EAAI-18), pages 3925 3933. AAAI Press, February 2018.

[33] Raghunathan Ramakrishnan, Pavlo O. Dral, Matthias Rupp, and O. Anatole von Lilienfeld. Quantum chemistry structures and properties of 134 kilo molecules. Scientific Data, 1(1):140022, December 2014.

[34] Alireza Rezaei and Shayan Oveis Gharan. A polynomial time MCMC method for sampling from continuous determinantal point processes. In Proceedings of the 36th International Conference on Machine Learning, pages 5438 5447. PMLR, May 2019.

[35] Christian P. Robert. The Bayesian Choice: From Decision-Theoretic Foundations to Computational Implementation. Springer Texts in Statistics. Springer, second edition, 2007.

[36] Christian P. Robert and George Casella. Monte Carlo Statistical Methods. Springer Texts in Statistics. Springer, second edition, 2004.

[37] Lars Ruddigkeit, Ruud van Deursen, Lorenz C. Blum, and Jean-Louis Reymond. Enumeration of 166 billion organic small molecules in the chemical universe database GDB-17. Journal of Chemical Information and Modeling, 52(11):2864 2875, November 2012.

[38] Abhishek Shetty, Raaz Dwivedi, and Lester Mackey. Distribution compression in near-linear time. In Tenth International Conference on Learning Representations, April 2022.

[39] Ingo Steinwart and Clint Scovel. Mercer s theorem on general domains: On the interaction between measures, kernels, and RKHSs. Constructive Approximation, 35(3):363 417, June 2012.

[40] Annika Stuke, Milica Todorovi c, Matthias Rupp, Christian Kunkel, Kunal Ghosh, Lauri Himanen, and Patrick Rinke. Chemical diversity in molecular orbital energy predictions with kernel ridge regression. The Journal of Chemical Physics, 150(20):204121, May 2019.

[41] Kazuma Tsuji, Ken ichiro Tanaka, and Sebastian Pokutta. Pairwise conditional gradients without swap steps and sparser kernel herding. In Proceedings of the International Conference on Machine Learning, 2022.

[42] Christopher K. I. Williams and Matthias Seeger. Using the Nyström method to speed up kernel machines. In Proceedings of the 13th International Conference on Neural Information Processing Systems, pages 661 667. MIT Press, January 2000.

A Proof of theorem 1

We start off by proving proposition 3:

Proof of proposition 3. Equation (12) follows from a direct calculation:

i=1 wik( , si) = k( , S)w = Z

X k( , S)k(S, S) 1k(S, x)g(x) dµ(x) = TSg.

To establish the second claim, use the fact [3, 2.1] that T 1/2 : H L2(µ) is an isometry and invoke the definition (9) of Err(S; g):

Err2(S; g) = (T TS)g 2 H = T 1/2(T TS)g 2

L2(µ) = g, (T TS)T 1(T TS)g L2(µ) g, (T TS)g L2(µ).

The last inequality follows because T TS and TS = T (T TS) are both positive semidefinite.

For the rest of our proof of theorem 1, we employ the following notation. Let denote the Loewner order for bounded, linear operators on L2(µ). That is, A B if B A is positive semidefinite (psd). Let ki denote the residual ki := k k{s1,...,si} for the approximate kernel associated with the first i nodes. (See proposition 8 below.) The associated integral operator is

X ki( , x)f(x) dµ(x).

To analyze RPCHOLESKY, we begin by analyzing a single step of the procedure. The expected value of the residual kernel after one step, k1, is

E k1(x, y) = k(x, y)

X k(x, s)k(s, y) dµ(s) R

X k(z, z) dµ(z) .

Consequently, the expected value of the integral operator T1 is

Φ(T) := E[T1] = T T 2

Tr T . (15)

The map Φ enjoys several useful properties [10, Lem. 3.2]. Proposition 5. The map Φ is positive, monotone, and concave with respect to the order on the set of trace-class psd operators on L2(µ). That is, for trace-class psd operators A, B,

0 A B = 0 Φ(A) Φ(B) (16)

and, for θ [0, 1], θΦ(A) + (1 θ)Φ(B) Φ(θA + (1 θ)B). (17)

Proof. We reproduce the proof from [10, Lem. 3.2] for completeness.

Positivity. Assume A 0. Observe that Id A/ Tr A, where Id denotes the identity operator. Thus

Φ(A) = A Id A Tr A

This establishes that Φ is positive: A 0 implies Φ(A) 0.

Concavity. For θ [0, 1], denote θ := 1 θ. Then

Φ(θA + θB) θΦ(A) θΦ(B) = θθ θ Tr A + θ Tr B

Tr A Tr B B

This establishes the concavity claim (17).

Monotonicity. Suppose 0 A B. Observe the map Φ is positive homogeneous: For θ > 0, Φ(θA) = θΦ(A). Thus,

Φ(B) = Φ(A + (B A)) = 2Φ A + (B A)

Invoke concavity (17) to obtain

Φ(B) = 2Φ A + (B A)

Φ(A) + Φ(B A) Φ(A).

In the last inequality, we use the positivity Φ(B A) 0. This completes the proof of (16).

We can use the map Φ to bound the residual integral operator Ti: Proposition 6. For i 0, we have the bound

E[Ti] Φi(T), (18)

where Φi denotes the i-fold composition of Φ.

Proof. The ith step of RPCHOLESKY applies one step of RPCHOLESKY to ki 1. Thus,

E[Ti] = E[E[Ti | s1, . . . , si 1]] = E[Φ(Ti 1)].

Now, we can apply concavity (17) to obtain

E[Ti] = E[Φ(Ti 1)] Φ(E[Ti 1]).

Applying this result twice and using monotonicity (16) yields:

E[Ti] Φ(E[Ti 1]) and E[Ti 1] Φ(E[Ti 2]) = E[Ti] Φ(Φ(E[Ti 2])).

Iterating this argument, we obtain the desired conclusion (18).

We now have the tools to prove theorem 1. We actually prove the following slight refinement: Theorem 7. Let S = {s1, . . . , sn} be generated by the RPCHOLESKY sampling algorithm. For any function g L2(µ), nonnegative integer r 0, and real numbers a, ε > 0, we have

E Err2(S; g) g 2 L2(µ)

Theorem 1 immediately follows from theorem 7 by taking ε = δ/2 and a = (δ/2) P i=r+1 λi. We now prove theorem 7.

Proof of theorem 7. In the standing notation, T TS corresponds to Tn. Thus, by proposition 6, we have

E g, (T TS)g L2(µ) = g, E[Tn]g L2(µ) g, Φn(T)g L2(µ) g 2 L2(µ) λmax(Φn(T)).

Thus, it is sufficient to show that

λmax(Φn(T)) a + ε

i=r+1 λi (20)

holds under condition (19).

Using the definition (15) of Φ, we see that the eigenvalues λ(i) 1 λ(i) 2 of Φi(T) are given by the following recurrence

λ(i) j = λ(i 1) j

P ℓ=1 λ(i 1) ℓ for i, j = 1, 2, . . . (21)

with initial condition λ(0) j = λj for all j. From this formula, we immediately conclude that each λ(i) j is nonnegative and nonincreasing in i. In addition, the ordinal ranking λ(i) 1 λ(i) 2 is preserved for each i. Indeed, if λ(i 1) j+1 λ(i 1) j 0, then

λ(i) j+1 λ(i) j = λ(i 1) j+1 λ(i 1) j

λ(i 1) j+1 2 λ(i 1) j 2

P ℓ=1 λ(i 1) ℓ

= λ(i 1) j+1 λ(i 1) j

1 λ(i 1) j+1 + λ(i 1) j P ℓ=1 λ(i 1) ℓ

and the claim follows by induction on i. To bound λmax(Φi(T)) = λ(i) 1 , we bound the sum of eigenvalues appearing in the recurrence (21):

ℓ=1 λ(i 1) ℓ = λ(i 1) 1 + + λ(i 1) r +

ℓ=r+1 λ(i 1) ℓ rλ(i 1) 1 +

ℓ=r+1 λℓ. (22)

To bound the first r terms, we used the ordering of the eigenvalues λ(i 1) 1 λ(i 1) r . To bound the tail, we used the fact that for each ℓ {r + 1, r + 2, . . .}, the eigenvalue λ(i 1) ℓ is nonincreasing in i and thus bounded by λ(0) ℓ = λℓ. Substituting (22) in (21), we obtain

λ(i) 1 λ(i 1) 1

rλ(i 1) 1 + P ℓ=r+1 λℓ for i = 1, 2, . . . . (23)

To bound the solution to the recurrence (23), we pass to continuous time. Specifically, for any t Z+, λ(t) 1 is bounded by the process x(t) solving the initial value problem

d dtx(t) = x(t)2

rx(t) + P ℓ=r+1 λℓ , x(0) = λ1.

The bound λ(t) 1 x(t) holds for all t Z+ because x 7 x2/(rx + P ℓ=r+1 λℓ) is decreasing for x (0, ). We are interested in the time t = t at which x(t ) = a + ε P ℓ=r+1 λℓ, which we obtain by separation of variables

t = Z a+ε P ℓ=r+1 λℓ

λ1 rx + P ℓ=r+1 λℓ x2 dx = r log x +

x=a+ε P ℓ=r+1 λℓ

= r log λ1 a + ε P ℓ=r+1 λℓ +

ℓ=r+1 λℓ a + ε P

λ1 r log λ1

Since x is decreasing, the following holds for n satisfying (10):

λmax(Φn(T)) = λ(n) 1 x(n) x(t ) = a + ε

This shows (20), completing the proof.

B Proof of theorem 4

To see why algorithm 2 produces samples from the RPCHOLESKY sampling distribution, observe that at each effective iteration i the matrix L(1 : i 1, 1 : i 1) is the Cholesky factor of k(S, S). Therefore, d is equal to the residual kernel kres S := k k S evaluated at si

d = kres S (si, si).

At each effective iteration i, the candidate point si is sampled from k(x, x) dµ(x) and accepted with probability kres S (si, si)/k(si, si). Therefore, conditional on acceptance and previous steps of the algorithm, si is sampled from the probability measure

1 P{Acceptance | S} k(x, x) dµ(x) kres S (x, x) k(x, x) = kres S (x, x) dµ(x) P{Acceptance | S}.

Since this is a probability measure which must integrate to one, the probability of acceptance is

P{Acceptance | S} = Z

X kres S (x, x) dµ(x).

Therefore, conditional on acceptance and previous steps of the algorithm, si is sampled from the probability measure kres S (x, x) dµ(x)/ R

X kres S (x, x) dµ(x). This corresponds exactly with the RPCHOLESKY sampling procedure described in algorithm 1 (see proposition 8 below).

Now, we bound the runtime. For each iteration i, the expected acceptance probability is bounded from below by the smallest possible trace error achievable by a rank (i 1) approximation

P{Acceptance} = E[P{Acceptance | S}] = E Z

X kres S (x, x) dµ(x) ηi 1.

Therefore, the expected number of rejections at step i is bounded by the expectation of a geometric random variable with parameter ηi 1, i.e., by η 1 i 1. Since each loop iteration takes O(i2) operations (attributable to line 4) and at most η 1 i 1 expected iterations are required to accept at step i, the total expected runtime is at most O(Pn i=1 i2/ηi 1) O(n3/ηn 1).

C Derivation of (7)

For completeness, we provide a derivation of (7). This result is standard; see, e.g., [3, Eq. (7)]. Recall our definition (6) for Err(S, w; g):

Err(S, w; g) = max f H 1

X f(x)g(x) dµ(x)

i=1 wif(si)

Use the reproducing property of k and linearity to write

Err(S, w; g) = max f H 1

X f, k( , x) Hg(x) dµ(x)

i=1 wi f, k( , si) H

Now apply linearity of , H to obtain

Err(S, w; g) = max f H 1

X k( , x)g(x) dµ(x)

i=1 wik( , si)

Use the dual characterization of the RKHS norm H and the definition (5) of T to conclude

Err(S, w; g) =

i=1 wik( , si)

D Details for numerical experiments

In this section, we provide additional details about our numerical experiments.

Continuous volume sampling. We are unaware of a provably correct n-DPP/continuous volume distribution sampler that is sufficiently efficient to facillitate numerical testing and comparisons. The best-known sampler [34] for the continuous setting is inexact and requires O(n5 log n) MCMC steps, which would be prohibitive for us. To produce samples from the continuous volume sampling distribution for this paper, we use a continuous analog of the MCMC algorithm [2]. Given a set S = {s1, . . . , sn} and assuming µ is a probability measure, one MCMC step proceeds as follows:

1. Sample snew µ and sold UNIF S. 2. Define S = S {snew} \ {sold}. 3. With probability 1 2 min 1, det k(S , S )

det k(S, S)

accept and set S S . Otherwise, leave S unchanged.

In our experiments, we initialize with s1, . . . , sn drawn iid from µ and run for 10n MCMC steps.

Running for just 10n MCMC steps is aggressive, even in the finite setting. Indeed, the theoretical analysis of [2, Thm. 2, Eq. (1.2)] for µ = UNIF{1, 2, . . . , |X|}, proves the sampler mixes to stationarity in at most O(n2|X|) MCMC steps when properly initialized. Even with just 10n MCMC steps, continuous volume sampling is dramatically more expensive than RPCHOLESKY in our experiments for n 64.

Finally, we notice that the performance of continuous volume sampling degrades to that of iid sampling for sufficiently large n. We believe that the cause for this are numerical issues in evaluating the determinants of the kernel matrices k(S, S). This suggests that, in addition to being faster, RPCHOLESKY may be more numerically robust than continuous volume sampling.

Thinning and positively weighted kernel quadrature. Our kernel quadrature experiments with thinning and PWKQ is based on code from the authors of [22] available online at https://github. com/satoshi-hayakawa/kernel-quadrature, which uses the goodpoints package (https: //github.com/microsoft/goodpoints) by the authors of [17, 18, 38] to perform thinning. We use g = 4, δ = 0.5, and four bins for the COMPRESS++ algorithm.

Miscellaneous. Our code is available online at

https://github.com/eepperly/RPCholesky-Kernel-Quadrature.

To evaluate exact integrals

X k(x, y)g(y) dµ(y)

for fig. 1, we use the 2D functionality in Cheb Fun [15]. To compute the optimal weights (8), we add a small multiple of the identity to regularize the system:

w , reg = (k(S, S) + 10εmach trace(k(S, S)) I) 1Tg(S).

Here, εmach = 2 52 is the double precision machine epsilon.

E Randomly pivoted Cholesky and Nyström approximation

In this section, we describe the connection between the RPCHOLESKY algorithm and Nyström approximation. We hope that this explanation provides context that will be helpful in understanding the RPCHOLESKY algorithm and the proofs of theorems 1 and 4.

To make the following discussion more clear, we begin by rewriting the basic RPCHOLESKY algorithm (see algorithm 1) in algorithm 5 to avoid overwriting the kernel k at each step:

From this rewriting, we see that in addition to selecting quadrature nodes s1, . . . , sn, the RPC-

HOLESKY algorithm builds an approximation ˆkn to the kernel k. Indeed, one can verify that the approximation ˆkn produced by this algorithm is precisely the Nyström approximation (definition 2).

Proposition 8. The output ˆkn of algorithm 5 is equal to the Nyström approximation k{s1,...,sn}.

Proof. Proceed by induction on n. The base case n = 0 is immediate. Fix n and let S := {s1, . . . , sn}, s := sn+1, and S := {s1, . . . , sn, s }. Write the Nyström approximation k S as

k S = [k( , S) k( , s )] k(S, S) k(S, s ) k(s , S) k(s , s )

1 k(S, ) k(s , )

Algorithm 5 RPCHOLESKY: unoptimized implementation without overwriting

Input: Kernel k : X X R and number of quadrature points n Output: Quadrature points s1, . . . , sn and Nyström approximation ˆkn = k{s1,...,sn} to k

1: Initialize k0 k, ˆk0 0 2: for i = 1, 2, . . . , n do 3: Sample si from the probability measure k(x, x) dµ(x)/ R

X k(x, x) dµ(x)

4: Update Nyström approximation ˆki ˆki 1 + ki 1( , si)ki 1(si, si) 1ki 1(si, ) 5: Update kernel ki ki 1 ki 1( , si)ki 1(si, si) 1ki 1(si, ) 6: end for

Compute a block LDL factorization of the matrix in the middle of this expression: k(S, S) k(S, s ) k(s , S) k(s , s )

= I 0 k(s , S)k(S, S) 1 1

k(S, S) 0 0 kn(s , s )

I k(S, S) 1k(S, s ) 0 1

Here, we ve used the induction hypothesis

kn(s , s ) = k(s , s ) k S(s , s ) = k(s , s ) k(s , S)k(S, S) 1k(S, s ). (24)

k S = [k( , S) k( , s )] k(S, S) k(S, s ) k(s , S) k(s , s )

1 k(S, ) k(s , )

= [k( , S) k( , s )] I k(S, S) 1k(S, s ) 0 1

k(S, S) 1 0 0 kn(s , s ) 1

I 0 k(s , S)k(S, S) 1 1

k(S, ) k(s , )

= [k( , S) kn( , s )] k(S, S) 1 0 0 kn(s , s ) 1

k(S, ) kn( , s )

= ˆkn + kn( , s )kn(s , s ) 1kn(s , ) = ˆkn+1.

For (25), we used the induction hypothesis (24) again. Having established that k S = kn+1, the proposition is proven.