# kernel_quadrature_with_dpps__52c2820a.pdf

Kernel quadrature with DPPs

Ayoub Belhadji, Rémi Bardenet, Pierre Chainais Univ. Lille, CNRS, Centrale Lille, UMR 9189 - CRISt AL, Villeneuve d Ascq, France {ayoub.belhadji, remi.bardenet, pierre.chainais}@univ-lille.fr

We study quadrature rules for functions from an RKHS, using nodes sampled from a determinantal point process (DPP). DPPs are parametrized by a kernel, and we use a truncated and saturated version of the RKHS kernel. This link between the two kernels, along with DPP machinery, leads to relatively tight bounds on the quadrature error, that depends on the spectrum of the RKHS kernel. Finally, we experimentally compare DPPs to existing kernel-based quadratures such as herding, Bayesian quadrature, or leverage score sampling. Numerical results conﬁrm the interest of DPPs, and even suggest faster rates than our bounds in particular cases.

1 Introduction

Numerical integration [11] is an important tool for Bayesian methods [38] and model-based machine learning [32]. Formally, numerical integration consists in approximating Z

X f(x)g(x)dω(x) X

j [N] wjf(xj), (1)

where X is a topological space, dω is a Borel probability measure on X, g is a square integrable function, and f is a function belonging to a space to be precised. In the quadrature formula (1), the N points x1, . . . , x N X are called the quadrature nodes, and w1, . . . , w N the corresponding weights.

The accuracy of a quadrature rule is assessed by the quadrature error, i.e., the absolute difference between the left-hand side and the right-hand side of (1). Classical Monte Carlo algorithms, like importance sampling or Markov chain Monte Carlo [39], pick up the nodes as either independent samples or a sample from a Markov chain on X, and all achieve a root mean square quadrature error in O(1/

N). Quasi-Monte Carlo quadrature [12] is based on deterministic, low-discrepancy sequences of nodes, and typical error rates for X = Rd are O(logd N/N). Recently, kernels have been used to derive quadrature rules such as herding [2, 9], Bayesian quadrature [19, 35], sophisticated control variates [28, 33], and leverage-score quadrature [1] under the assumption that f lies in a RKHS. The main theoretical advantage is that the resulting error rates are faster than classical Monte Carlo and adapt to the smoothness of f.

In this paper, we propose a new quadrature rule for functions in a given RKHS. Our nearest scientiﬁc neighbour is [1], but instead of sampling nodes independently, we leverage dependence and use a repulsive distribution called a projection determinantal point process (DPP), while the weights are obtained through a simple quadratic optimization problem. DPPs were originally introduced by [29] as probabilistic models for beams of fermions in quantum optics. Since then, DPPs have been thoroughly studied in random matrix theory [21], and have more recently been adopted in machine learning [26] and Monte Carlo methods [3].

In practice, a projection DPP is deﬁned through a reference measure dω and a repulsion kernel K. In our approach, the repulsion kernel is a modiﬁcation of the underlying RKHS kernel. This ensures that sampling is tractable, and, as we shall see, that the expected value of the quadrature error is

33rd Conference on Neural Information Processing Systems (Neur IPS 2019), Vancouver, Canada.

controlled by the decay of the eigenvalues of the integration operator associated to the measure dω. Note that quadratures based on projection DPPs have already been studied in the literature: implicitly in [22, Corollary 2.3] in the simple case where X = [0, 1] and dω is the uniform measure, and in [3] for [0, 1]d and more general measures. In the latter case, the quadrature error is asymptotically of order N 1/2 1/2d [3], with f essentially C1. In the current paper, we leverage the smoothness of the integrand to improve the convergence rate of the quadrature in general spaces X.

This article is organized as follows. Section 2 reviews kernel-based quadrature. In Section 3, we recall some basic properties of projection DPPs. Section 4 is devoted to the exposition of our main result, along with a sketch of proof. We give precise pointers to the supplementary material for missing details. Finally, in Section 5 we illustrate our result and compare to related work using numerical simulations, for the uniform measure in d = 1 and 2, and the Gaussian measure on R.

Notation. Let X be a topological space equipped with a Borel measure dω and assume that the support of dω is X. Let L2(dω) be the Hilbert space of square integrable, real-valued functions deﬁned on X, with the usual inner product denoted by , dω, and the associated norm by . dω. Let k : X X R+ be a symmetric and continuous function such that, for any ﬁnite set of points in X, the matrix of pairwise kernel evaluations is positive semi-deﬁnite. Denote by F the associated reproducing kernel Hilbert space (RKHS) of real-valued functions [5]. We assume that x 7 k(x, x) is integrable with respect to the measure dω so that F L2(dω). Deﬁne the integral operator

X k( , y)f(y)dω(y), f L2(dω). (2)

By construction, Σ is self-adjoint, positive semi-deﬁnite, and trace-class [40]. For m N, denote by em the m-th eigenfunction of Σ, normalized so that em dω = 1 and σm the corresponding eigenvalue. The integrability of the diagonal x 7 k(x, x) implies that F is compactly embedded in L2(dω), that is, the identity map IF : F L2(dω) is compact; moreover, since dω is of full support in X, IF is injective [44]. This implies a Mercer-type decomposition of k,

k(x, y) = X

m N σmem(x)em(y), (3)

where N = N {0} and the convergence is point-wise [45]. Moreover, for m N , we write e F m = σmem. Since IF is injective [45], (e F m)m N is an orthonormal basis of F. Unless explicitly stated, we assume that F is dense in L2(dω), so that (em)m N is an orthonormal basis of L2(dω). For more intuition, under these assumptions, f F if and only if P

m σ 1 m f, em 2 L2(dω) converges.

2 Related work on kernel-based quadrature

When the integrand f belongs to the RKHS F of kernel k [10], the quadrature error reads [41] Z

X f(x)g(x)dω(x) X

j [N] wjf(xj) = f, µg X

j [N] wjk(xj, .) F

j [N] wjk(xj, .) F , (4)

where µg = R

X g(x)k(x, .)dω(x) is the so-called mean element [13, 31]. A tight approximation of the mean element by a linear combination of functions k(xj, .) thus guarantees low quadrature error. The approaches described in this section differ by their choice of nodes and weights.

2.1 Bayesian quadrature and the design of nodes

Bayesian Quadrature initially [27] considered a ﬁxed set of nodes and put a Gaussian process prior on the integrand f. Then, the weights were chosen to minimize the posterior variance of the integral of f. If the kernel of the Gaussian process is chosen to be k, this amounts to minimizing the RHS of (4). The case of the Gaussian reference measure was later investigated in detail [35], while parametric integrands were considered in [30]. Rates of convergence were provided in [8] for speciﬁc kernels on compact spaces, under a ﬁll-in condition [47] that encapsulates that the nodes must progressively ﬁll up the (compact) space.

Finding the weights that optimize the RHS of (4) for a ﬁxed set of nodes is a relatively simple task, see later Section 4.1, the cost of which can even be reduced using symmetries of the set of nodes [20, 24]. Jointly optimizing on the nodes and weights, however, is only possible in speciﬁc cases [6, 23]. In general, this corresponds to a non-convex problem with many local minima [16, 34]. While [36] proposed to sample i.i.d. nodes from the reference measure dω, greedy minimization approaches have also been proposed [19, 34]. In particular, kernel herding [9] corresponds to uniform weights and greedily minimizing the RHS in (4). This leads to a fast rate in O(1/N), but only when the integrand is in a ﬁnite-dimensional RKHS. Kernel herding and similar forms of sequential Bayesian quadrature are actually linked to the Frank-Wolfe algorithm [2, 7, 19]. Beside the difﬁculty of proving fast convergence rates, these greedy approaches still require heuristics in practice.

2.2 Leverage-score quadrature

In [1], the author proposed to sample the nodes (xj) i.i.d. from some proposal distribution q, and then pick weights ˆw in (1) that solve the optimization problem

wj q(xj)1/2 k(xj, .) 2

F + λN w 2 2, (5)

for some regularization parameter λ > 0. Proposition 1 gives a bound on the resulting approximation error of the mean element for a speciﬁc choice of proposal pdf, namely the leverage scores

q λ(x) k(x, .), Σ 1/2(Σ + λIL2(dω)) 1Σ 1/2k(x, .) L2(dω) = X

σm σm + λem(x)2. (6)

Proposition 1 (Proposition 2 in [1]). Let δ [0, 1], and dλ = Tr Σ(Σ + λI) 1. Assume that N 5dλ log(16dλ/δ), then

P sup g dω 1 inf w 2 4

wj qλ(xj)1/2 k(xj, .) 2

F 4λ 1 δ. (7)

In other words, Proposition 1 gives a uniform control on the approximation error µg by the subspace spanned by the k(xj, .) for g belonging to the unit ball of L2(dω), where the (xj) are sampled i.i.d. from q λ. The required number of nodes is equal to O(dλ log dλ) for a given approximation error λ. However, for ﬁxed λ, the approximation error in Proposition 1 does not go to zero when N increases. One theoretical workaround is to make λ = λ(N) decrease with N. However, the coupling of N and λ through dλ makes it very intricate to derive a convergence rate from Proposition 1. Moreover, the optimal density q λ is in general only available as the limit (6), which makes sampling and evaluation difﬁcult. Finally, we note that Proposition 1 highlights the fundamental role played by the spectral decomposition of the operator Σ in designing and analyzing kernel quadrature rules.

3 Projection determinantal point processes

Let N N and (ψn)n [N] an orthonormal family of L2(dω), and assume for simplicity that X Rd and that dω has density ω with respect to the Lebesgue measure. Deﬁne the repulsion kernel

K(x, y) = X

n [N] ψn(x)ψn(y), (8)

not to be mistaken for the RKHS kernel k. One can show [18, Lemma 21] that

1 N! Det(K(xi, xj)i,j [N]) Y

i [N] ω(xi) (9)

is a probability density over X N. When x1, . . . , x N have distribution (9), the set x = {x1, . . . x N} is said to be a projection DPP1 with reference measure dω and kernel K. Note that the kernel K is a positive deﬁnite kernel so that the determinant in (9) is non-negative. Equation (9) is key to

1In the ﬁnite case, more common in ML, projection DPPs are also called elementary DPPs [26].

understanding DPPs. First, loosely speaking, the probability of seeing a point of x in an inﬁnitesimal volume around x1 is K(x1, x1)ω(x1)dx1. Note that when d = 1 and (ψn) are the family of orthonormal polynomials with respect to dω, this marginal probability is related to the optimal proposal qλ in Section 2.2; see Appendix E.2. Second, the probability of simultaneously seeing a point of x in an inﬁnitesimal volume around x1 and one around x2 is h K(x1, x1) K(x2, x2) K(x1, x2)2i ω(x1)ω(x2) dx1dx2

[K(x1, x1)ω(x1)dx1] [K(x2, x2)ω(x2)dx2] .

The probability of co-occurrence is thus always smaller than that of a Poisson process with the same intensity. In this sense, a projection DPP with symmetric kernel is a repulsive distribution, and K encodes its repulsiveness.

One advantage of DPPs is that they can be sampled exactly. Because of the orthonormality of (ψn), one can write the chain rule for (9); see [18]. Sampling each conditional in turn, using e.g. rejection sampling [39], then yields an exact sampling algorithm. Rejection sampling aside, the cost of this algorithm is cubic in N without further assumptions on the kernel. Simplifying assumptions can take many forms. In particular, when d = 1, and ω is a Gaussian, gamma [14], or beta [25] pdf, and (ψn) are the orthonormal polynomials with respect to ω, the corresponding DPP can be sampled by tridiagonalizing a matrix with independent entries, which takes the cost to O(N 2) and bypasses the need for rejection sampling. For further information on DPPs see [21, 43].

4 Kernel quadrature with projection DPPs

We follow in the footsteps of [1], see Section 2.2, but using a projection DPP rather than independent sampling to obtain the nodes. In a nutshell, we consider nodes (xj)j [N] that are drawn from the projection DPP with reference measure dω and repulsion kernel

K(x, y) = X

n [N] en(x)en(y), (10)

where we recall that (en) are the normalized eigenfunctions of the integral operator Σ. The weights w are obtained by solving the optimization problem

min w RN µg Φw 2 F, (11)

where Φ : (wj)j [N] 7 X

j [N] wjk(xj, .) (12)

is the reconstruction operator2. In Section 4.1 we prove that (11) almost surely has a unique solution ˆw and state our main result, an upper bound on the expected approximation error µg Φ ˆw 2 F under the proposed Projection DPP. Section 4.2 gives a sketch of the proof of this bound.

4.1 Main result

Assuming that nodes (xj)j [N] are known, we ﬁrst need to solve the optimization problem (11) that relates to problem (5) without regularization (λ = 0). Let x = (x1, . . . , x N) X N, then

µg Φw 2 F = µg 2 F 2w µg(xj)j [N] + w K(x)w, (13)

where K(x) = (k(xi, xj))i,j [N]. The right-hand side of (13) is quadratic in w, so that the optimization problem (11) admits a unique solution ˆw if and only if K(x) is invertible. In this case, the solution is given by ˆw = K(x) 1µg(xj)j [N]. A sufﬁcient condition for the invertibility of K(x) is given in the following proposition. Proposition 2. Assume that the matrix E(x) = (ei(xj))i,j [N] is invertible, then K(x) is invertible.

The proof of Proposition 2 is given in Appendix D.1. Since the pdf (9) of the projection DPP with kernel (10) is proportional to Det2 E(x), the following corollary immediately follows.

2The reconstruction operator Φ depends on the nodes xj, although our notation doesn t reﬂect it for simplicity.

Corollary 1. Let x = {x1, . . . , x N} be a projection DPP with reference measure dω and kernel (10). Then K(x) is a.s. invertible, so that (11) has unique solution ˆw = K(x) 1µg(xj)j [N] a.s.

We now give our main result that uses nodes (xj)j [N] drawn from a well-chosen projection DPP.

Theorem 1. Let x = {x1, . . . , x N} be a projection DPP with reference measure dω and kernel (10). Let ˆw be the unique solution to (11) and deﬁne g dω,1 = X

n [N] | en, g dω|. Assume that g dω 1

and deﬁne r N = P

m N+1 σm, then

EDPP µg Φ ˆw 2 F 2σN+1 + 2 g 2 dω,1

In particular, if Nr N = o(1), then the right-hand side of (14) is Nr N + o(Nr N). For example, take X = [0, 1], dω the uniform measure on X, and F the s-Sobolev space, then σm = m 2s [5]. Now, if s > 1, the expected worst case quadrature error is bounded by Nr N = O(N 2 2s) = o(1). Another example is the case of the Gaussian measure on X = R, with the Gaussian kernel. In this case σm = βαm with 0 < α < 1 and β > 0 [37] so that Nr N = N β 1 ααN+1 = o(1).

We have assumed that F is dense in L2(dω) but Theorem 1 is valid also when F is ﬁnite-dimensional. In this case, denote N0 = dim F. Then, for n > N0, σn = 0 and r N0 = 0, so that (14) implies

µg Φ ˆw 2 F = 0 a.s. (15)

This compares favourably with herding, for instance, which comes with a rate in O( 1

N ) for the quadrature based on herding with uniform weights [2, 9].

The constant g dω,1 in (14) is the ℓ1 norm of the coefﬁcients of projection of g onto Span(en)n [N] in L2(dω). For example, for g = en, g dω,1 = 1 if n [N] and g dω,1 = 0 if n N + 1. In the worst case, g dω,1

N. Thus, we can obtain a uniform bound for g dω 1 in the spirit of Proposition 1, but with a supplementary factor N in the upper bound in (14).

4.2 Bounding the approximation error under the DPP

In this section, we give the skeleton of the proof of Theorem 1, referring to the appendices for technical details. The proof is in two steps. First, we give an upper bound for the approximation error µg Φ ˆw 2 F that involves the maximal principal angle between the functional subspaces of F

EF N = Span(e F n )n [N] and T (x) = Span(k(xj, .))j [N].

DPPs allow closed form expressions for the expectation of trigonometric functions of such angles; see [4] and Appendix E.1 for the geometric intuition behind the proof. The second step thus consists in developing the expectation of the bound under the DPP.

4.2.1 Bounding the approximation error using principal angles

Let x = (x1, . . . , x N) X N be such that Det E(x) = 0. By Proposition 2, K(x) is non singular and dim T (x) = N. The optimal approximation error writes

µg Φ ˆw 2 F = µg ΠT (x)µg 2 F, (16)

where ΠT (x) = Φ(Φ Φ) 1Φ is the orthogonal projection onto T (x) with Φ the dual3 of Φ.

In other words, (16) equates the approximation error to ΠT (x) µg 2 F, where ΠT (x) is the orthogonal projection onto T (x) . Now we have the following lemma.

Lemma 1. Assume that g dω 1 then Σ 1/2µg F 1 and

ΠT (x) µg 2 F 2 σN+1 + g 2 dω,1 max n [N] σn ΠT (x) e F n 2 F

3For µ F,Φ µ = (µ(xj))j [N]. Φ Φ is an operator from RN to RN that can be identiﬁed with K(x).

Now, to upper bound the right-hand side of (17), we note that σn ΠT (x) e F n 2 F is the product of two terms: σn is a decreasing function of n while ΠT (x) e F n 2 F is the interpolation error of the eigenfunction e F n , measured in the . F norm. We can bound the latter interpolation error uniformly in n [N] using the geometric notion of maximal principal angle between T (x) and EF N = Span(e F n )n [N]. This maximal principal angle is deﬁned through its cosine

cos2 θN(T (x), EF N) = inf u T (x),v EF N u F=1, v F=1

u, v F. (18)

Similarly, we can deﬁne the N principal angles θn(T (x), EF N) 0, π

2 for n [N] between the subspaces EF N and T (x). These angles quantify the relative position of the two subspaces. See Appendix C.3 for more details about principal angles. Now, we have the following lemma. Lemma 2. Let x = (x1, . . . , x N) X N such that Det E(x) = 0. Then

max n [N] ΠT (x) e F n 2 F 1 cos2 θN(T (x), EF N) 1 Y

1 cos2 θn(T (x), EF N) 1. (19)

To sum up, we have so far bounded the approximation error by the geometric quantity in the right-hand side of (19). Where projection DPPs shine is in taking expectations of such geometric quantities.

4.2.2 Taking the expectation under the DPP

The analysis in Section 4.2.1 is valid whenever Det E(x) = 0. As seen in Corollary 1, this condition is satisﬁed almost surely when x is drawn from the projection DPP of Theorem 1. Furthermore, the expectation of the right-hand side of (19) can be written in terms of the eigenvalues of the kernel k. Proposition 3. Let x be a projection DPP with reference measure dω and kernel (10). Then,

T (x), EF N

n [N] σn . (20)

The bound of Proposition 3, once reported in Lemma 2 and Lemma 1, already yields Theorem 1 in the special case where σ1 = = σN. This seems a very restrictive condition, but next Proposition 4 shows that we can always reduce the analysis to that case. In fact, let the kernel k be deﬁned by

k(x, y) = X

n [N] σ1en(x)en(y) + X

n N+1 σnen(x)en(y) = X

n N σnen(x)en(y), (21)

and let F be the corresponding RKHS. Then one has the following inequality.

Proposition 4. Let T (x) = Span k(xj, .)

j [N] and Π T (x) the orthogonal projection onto

T (x) in ( F, ., . F). Then,

n [N], σn ΠT (x) e F n 2 F σ1 Π T (x) e F n 2 F. (22)

Simply put, capping the ﬁrst eigenvalues of k yields a new kernel k that captures the interaction between the terms σn and ΠT (x) e F n 2 F such that we only have to deal with the term Π T (x) e F n 2 F.

Combining Proposition 3 with Proposition 4 applied to the kernel k yields Theorem 1.

4.3 Discussion

We have arbitrarily introduced a product in the right-hand side of (19), which is a rather loose majorization. Our motivation is that the expected value of this symmetric quantity is tractable under the DPP. Getting rid of the product could make the bound much tighter. Intuitively, taking the upper bound in (20) to the power 1/N results in a term in O(r N) for the RKHS F. Improving the bound in (20) would require a de-symmetrization by comparing the maximum of the 1/ cos2 θℓ(T (x), EF N) to

their geometric mean. An easier route than de-symmetrization could be to replace the product in (19) by a sum, but this is beyond the scope of this article.

In comparison with [1], we emphasize that the dependence of our bound on the eigenvalues of the kernel k, via r N, is explicit. This is in contrast with Proposition 1 that depends on the eigenvalues of Σ through the degree of freedom dλ so that the necessary number of samples N diverges when λ 0. On the contrary, our quadrature requires a ﬁnite number of points for λ = 0. It would be interesting to extend the analysis of our quadrature in the regime λ > 0.

5 Numerical simulations

5.1 The periodic Sobolev space and the Korobov space

Let dω be the uniform measure on X = [0, 1], and let the RKHS kernel be [5]

ks(x, y) = 1 + X

1 m2s cos(2πm(x y)),

so that F = Fs is the Sobolev space of order s on [0, 1]. Note that ks can be expressed in closed form using Bernoulli polynomials [46]. We take g 1 in (1), so that the mean element µg 1. We compare the following algorithms: (i) the quadrature rule DPPKQ we propose in Theorem 1, (ii) the quadrature rule DPPUQ based on the same projection DPP but with uniform weights, implicitly studied in [22], (iii) the kernel quadrature rule (5) of [1], which we denote LVSQ for leverage score quadrature, with regularization parameter λ {0, 0.1, 0.2} (note that the optimal proposal is q λ 1), (iv) herding with uniform weights [2, 9], (v) sequential Bayesian quadrature (SBQ) [19] with regularization to avoid numerical instability, and (vi) Bayesian quadrature on the uniform grid (UGBQ). We take N [5, 50]. Figures 1a and 1b show log-log plots of the worst case quadrature error w.r.t. N, averaged over 50 samples for each point, for s {1, 3}.

We observe that the approximation errors of all ﬁrst four quadratures converge to 0 with different rates. Both UGBQ and DPPKQ converge to 0 with a rate of O(N 2s), which indicates that our O(N 2 2s) bound in Theorem 1 is not tight in the Sobolev case. Meanwhile, the rate of DPPUQ is O(N 2) across the three values of s: it does not adapt to the regularity of the integrands. This corresponds to the CLT proven in [22]. LVSQ without regularization converges to 0 slightly slower than O(N 2s). Augmenting λ further slows down convergence. Herding converges at an empirical rate of O(N 2), which is faster than the rate O(N 1) predicted by the theoretical analysis in [2, 9]. SBQ is the only one that seems to plateau for s = 3, although it consistently has the best performance for low N. Overall, in the Sobolev case, DPPKQ and UGBQ have the best convergence rate. UGBQ known to be optimal in this case [6] has a better constant.

Now, for a multidimensional example, consider the Korobov" kernel ks deﬁned on [0, 1]d by

x, y [0, 1]d, ks,d(x, y) = Y

i [d] ks(xi, yi). (23)

We still take g 1 in (1) so that µg 1. We compare (i) our DPPKQ, (ii) LVSQ without regularization (λ = 0), (iii) the kernel quadrature based on the uniform grid UGBQ, (iv) the kernel quadrature SGBQ based on the sparse grid from [42], (v) the kernel quadrature based on the Halton sequence Halton BQ [15]. We take N [5, 1000] and s = 1. The results are shown in Figure 1c. This time, UGBQ suffers from the dimension with a rate in O(N 2s/d), while DPPKQ, Halton BQ and LVSQ (λ = 0) all perform similarly well. They scale as O((log N)2s(d 1)N 2s), which is a tight upper bound on σN+1, see [1] and Appendix B. SGBQ seems to lag slightly behind with a rate O((log N)2(s+1)(d 1)N 2s) [17, 42].

5.2 The Gaussian kernel

We now consider dω to be the Gaussian measure on X = R along with the RKHS kernel kγ(x, y) = exp[ (x y)2/2γ2], and again g 1. Figure 1d compares the empirical performance of DPPKQ to the theoretical bound of Theorem 1, herding, crude Monte Carlo with i.i.d. sampling from dω, and sequential Bayesian Quadrature, where we again average over 50 samples. We take N [5, 50]

0.8 1.0 1.2 1.4 1.6 log10(N)

log10(Squared error)

DPPKQ: 1.9 DPPUQ: 1.7 Herding: 1.8 SBQ: 1.8 LVSQ ( = 0): 1.7 LVSQ ( = 0.1) LVSQ ( = 0.2)

(a) Sobolev space, d = 1, s = 1

0.8 1.0 1.2 1.4 1.6 log10(N)

log10(Squared error)

DPPKQ: 6.0 DPPUQ: 2.0 Herding: 2.1 SBQ: 3.2 LVSQ ( = 0): 4.8 LVSQ ( = 0.1) LVSQ ( = 0.2)

(b) Sobolev space, d = 1, s = 3

0.5 1.0 1.5 2.0 2.5 3.0 log10(N)

log10(Squared error)

DPPKQ LVSQ ( = 0)

Halton BQ UGBQ SGBQ

(log N)2s(d 1)N 2s

1 10(log N)2(s + 1)(d 1)N 2s

(c) Korobov space, d = 2, s = 1

0 10 20 30 40 50 N

log10(Squared error)

DPPKQ DPPKQ (UB) Herding SBQ MC

(d) Gaussian kernel, d = 1

Figure 1: Squared error vs. number of nodes N for various kernels.

2. Note that, this time, only the y-axis is on the log scale for better display, and that LVSQ is not plotted since we don t know how to sample from qλ in (6) in this case. We observe that the approximation error of DPPKQ converges to 0 as O(αN), while the discussion below Theorem 1 let us expect a slightly slower O(NαN). Herding improves slightly upon Monte Carlo that converges as O(N 1). Similarly to Sobolev spaces, the convergence of sequential Bayesian quadrature plateaus even if it has the smallest error for small N. We also conclude that DPPKQ is a close runner-up to SBQ and deﬁnitely takes the lead for large enough N.

6 Conclusion

In this article, we proposed a quadrature rule for functions living in a RKHS. The nodes are drawn from a DPP tailored to the RKHS kernel, while the weights are the solution to a tractable, nonregularized optimization problem. We proved that the expected value of the squared worst case error is bounded by a quantity that depends on the eigenvalues of the integral operator associated to the RKHS kernel, thus preserving the natural feel and the generality of the bounds for kernel quadrature [1]. Key intermediate quantities further have clear geometric interpretations in the ambient RKHS. Experimental comparisons suggest that DPP quadrature favourably compares with existing kernel-based quadratures. In speciﬁc cases where an optimal quadrature is known, such as the uniform grid for 1D periodic Sobolev spaces, DPPKQ seems to have the optimal convergence rate. However, our generic error bound does not reﬂect this optimality in the Sobolev case, and must thus be sharpened.

We have discussed room for improvement in our proofs. Further work should also address exact sampling algorithms, which do not exist yet when the spectral decomposition of the integral operator is not known. Approximate algorithms would also sufﬁce, as long as the error bound is preserved.

Acknowledgments

We acknowledge support from ANR grant Bo B (ANR-16-CE23-0003) and région Hauts-de-France. We also thank Adrien Hardy and the reviewers for their detailed and insightful comments.

[1] F. Bach. On the equivalence between kernel quadrature rules and random feature expansions. The Journal of Machine Learning Research, 18(1):714 751, 2017.

[2] F. Bach, S. Lacoste-Julien, and G. Obozinski. On the equivalence between herding and conditional gradient algorithms. In Proceedings of the 29th International Coference on International Conference on Machine Learning, ICML 12, pages 1355 1362, 2012.

[3] R. Bardenet and A. Hardy. Monte Carlo with determinantal point processes. ar Xiv:1605.00361, May 2016.

[4] A. Belhadji, R. Bardenet, and P. Chainais. A determinantal point process for column subset selection. ar Xiv:1812.09771, 2018.

[5] A. Berlinet and C. Thomas-Agnan. Reproducing kernel Hilbert spaces in probability and statistics. Springer Science & Business Media, 2011.

[6] B. Bojanov. Uniqueness of the optimal nodes of quadrature formulae. Mathematics of computation, 36(154):525 546, 1981.

[7] F. Briol, C. Oates, M. Girolami, and M. Osborne. Frank-Wolfe Bayesian quadrature: Probabilistic integration with theoretical guarantees. In Advances in Neural Information Processing Systems, pages 1162 1170, 2015.

[8] F. Briol, C. Oates, M. Girolami, M. Osborne, D. Sejdinovic, et al. Probabilistic integration: A role in statistical computation? Statistical Science, 34(1):1 22, 2019.

[9] Y. Chen, M. Welling, and A. Smola. Super-samples from kernel herding. In Proceedings of the Twenty-Sixth Conference on Uncertainty in Artiﬁcial Intelligence, UAI 10, pages 109 116, Arlington, Virginia, United States, 2010. AUAI Press.

[10] N. Cristianini and J. Shawe-Taylor. Kernel methods for pattern recognition. Cambridge University Press, 2004.

[11] P. J. Davis and P. Rabinowitz. Methods of numerical integration. Courier Corporation, 2nd edition, 2007.

[12] J. Dick and F. Pilichshammer. Digital nets and sequences. Discrepancy theory and quasi-Monte Carlo integration. Cambridge University Press, 2010.

[13] J. Dick and F. Pillichshammer. Discrepancy theory and quasi-Monte Carlo integration. In A Panorama of Discrepancy Theory, pages 539 619. Springer, 2014.

[14] I. Dumitriu and A. Edelman. Matrix models for beta ensembles. Journal of Mathematical Physics, 43(11):5830 5847, 2002.

[15] J. Halton. Algorithm 247: Radical-inverse quasi-random point sequence. Communications of the ACM, 7(12):701 702, 1964.

[16] A. Hinrichs and J. Oettershagen. Optimal point sets for quasi-Monte Carlo integration of bivariate periodic functions with bounded mixed derivatives. In Monte Carlo and Quasi-Monte Carlo Methods, pages 385 405. Springer, 2016.

[17] M. Holtz. Sparse grid quadrature in high dimensions with applications in ﬁnance and insurance. Ph D Thesis, University of Bonn, 2008.

[18] J. B. Hough, M. Krishnapur, Y. Peres, and B. Virág. Determinantal processes and independence. Probability surveys, 2006.

[19] F. Huszár and D. Duvenaud. Optimally-weighted herding is Bayesian quadrature. In Proceedings of the Twenty-Eighth Conference on Uncertainty in Artiﬁcial Intelligence, UAI 12, pages 377 386. AUAI Press, 2012.

[20] R. Jagadeeswaran and F. Hickernell. Fast automatic Bayesian cubature using lattice sampling. ar Xiv:1809.09803, 2018.

[21] K. Johansson. Random matrices and determinantal processes. In Mathematical Statistical Physics, Session LXXXIII: Lecture Notes of the Les Houches Summer School 2005, pages 1 56.

[22] K. Johansson. On random matrices from the compact classical groups. Annals of mathematics, pages 519 545, 1997.

[23] T. Karvonen and S. Särkkä. Classical quadrature rules via gaussian processes. In 2017 IEEE 27th International Workshop on Machine Learning for Signal Processing (MLSP), pages 1 6. IEEE, 2017.

[24] T. Karvonen, S. Särkkä, C. Oates, et al. Symmetry exploits for Bayesian cubature methods. ar Xiv:1809.10227, 2018.

[25] R. Killip and I. Nenciu. Matrix models for circular ensembles. International Mathematics Research Notices, 2004(50):2665, 2004.

[26] A. Kulesza and B. Taskar. Determinantal point processes for machine learning. Foundations and Trends R in Machine Learning, 5(2 3):123 286, 2012.

[27] F. Larkin. Gaussian measure in Hilbert space and applications in numerical analysis. The Rocky Mountain Journal of Mathematics, pages 379 421, 1972.

[28] Q. Liu and J. D. Lee. Black-box importance sampling. In Internation Conference on Artiﬁcial Intelligence and Statistics (AISTATS), 2017.

[29] O. Macchi. The coincidence approach to stochastic point processes. 7:83 122, 03 1975.

[30] T. Minka. Deriving quadrature rules from Gaussian processes. Technical report, Technical report, Statistics Department, Carnegie Mellon University, 2000.

[31] K. Muandet, K. Fukumizu, B. Sriperumbudur, B. Schölkopf, et al. Kernel mean embedding of distributions: A review and beyond. Foundations and Trends R in Machine Learning, 10(1-2):1 141, 2017.

[32] K. Murphy. Machine learning: a probabilistic perspective. MIT Press, 2012.

[33] C. J. Oates, M. Girolami, and N. Chopin. Control functionals for Monte Carlo integration. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 79(3):695 718, jun 2017.

[34] J. Oettershagen. Construction of optimal cubature algorithms with applications to econometrics and uncertainty quantiﬁcation. Ph D Thesis, University of Bonn, 2017.

[35] A. O Hagan. Bayes-Hermite quadrature. Journal of Statistical Planning and Inference, 29(3):245 260, 1991.

[36] C. Rasmussen and Z. Ghahramani. Bayesian Monte Carlo. In Advances in Neural Information Processing Systems 15, pages 489 496, Cambridge, MA, USA, Oct. 2003. Max-Planck Gesellschaft, MIT Press.

[37] C. Rasmussen and C. Williams. Gaussian Processes for Machine Learning. Adaptive Computation and Machine Learning. MIT Press, Cambridge, MA, USA, Jan. 2006.

[38] C. P. Robert. The Bayesian choice: from decision-theoretic foundations to computational implementation. Springer Science & Business Media, 2007.

[39] C. P. Robert and G. Casella. Monte Carlo statistical methods. Springer, 2004.

[40] B. Simon. Trace Ideals and Their Applications. American Mathematical Society, 2005.

[41] A. Smola, A. Gretton, L. Song, and B. Schölkopf. A Hilbert space embedding for distributions. In International Conference on Algorithmic Learning Theory, pages 13 31. Springer, 2007.

[42] S. Smolyak. Quadrature and interpolation formulas for tensor products of certain classes of functions. In Doklady Akademii Nauk, volume 148, pages 1042 1045. Russian Academy of Sciences, 1963.

[43] A. Soshnikov. Determinantal random point ﬁelds. Russian Mathematical Surveys, 55:923 975, 2000.

[44] I. Steinwart and A. Christmann. Support Vector Machines. Springer Publishing Company, Incorporated, 1st edition, 2008.

[45] I. Steinwart and C. Scovel. Mercer s theorem on general domains: on the interaction between measures, kernels, and RKHSs. Constructive Approximation, 35(3):363 417, 2012.

[46] G. Wahba. Spline Models for Observational Data, volume 59. SIAM, 1990.

[47] H. Wendland. Scattered Data Approximation, volume 17. Cambridge University Press, 2004.