# kernelmatrix_determinant_estimates_from_stopped_cholesky_decomposition__150415b6.pdf

Journal of Machine Learning Research 24 (2023) 1-57 Submitted 7/21; Revised 9/22; Published 1/23

Kernel-Matrix Determinant Estimates from stopped Cholesky Decomposition

Simon Bartels bartels@di.ku.dk University of Copenhagen Universitetsparken 1 2100 København, Denmark

Wouter Boomsma wb@di.ku.dk University of Copenhagen Universitetsparken 1 2100 København, Denmark

Jes Frellsen jefr@dtu.dk Technical University of Denmark Richard Petersens Plads 2800 Kgs. Lyngby, Denmark

Damien Garreau damien.garreau@unice.fr Universit e Cˆote d Azur, Inria, CNRS, LJAD Parc Valrose 06108 Nice Cedex 2, France

Editor: Isabelle Guyon

Algorithms involving Gaussian processes or determinantal point processes typically require computing the determinant of a kernel matrix. Frequently, the latter is computed from the Cholesky decomposition, an algorithm of cubic complexity in the size of the matrix. We show that, under mild assumptions, it is possible to estimate the determinant from only a sub-matrix, with probabilistic guarantee on the relative error. We present an augmentation of the Cholesky decomposition that stops under certain conditions before processing the whole matrix. Experiments demonstrate that this can save a considerable amount of time while rarely exceeding an overhead of more than 5% when not stopping early. More generally, we present a probabilistic stopping strategy for the approximation of a sum of known length where addends are revealed sequentially. We do not assume independence between addends, only that they are bounded from below and decrease in conditional expectation.

Keywords: Gaussian Processes, Optimal Stopping, Kernel Methods, Kriging

1. Introduction

Gaussian processes are a popular probabilistic model in the machine learning community, and a core element of many other methods such as Bayesian optimization (Moˇckus, 1975), Bayesian quadrature (Diaconis, 1988), probabilistic numerics (Hennig et al., 2015) or the Automatic Statistician (Steinruecken et al., 2019). Typically, inference with a Gaussian

c 2023 Simon Bartels, Wouter Boomsma, Jes Frellsen and Damien Garreau.

License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v24/21-0781.html.

Bartels, Boomsma, Frellsen and Garreau

0 0.5 1 1.5 2 2.5 3 3.5 4 4.5

processed datapoints

log-determinant

upper bound lower bound estimate (middle) 10% relative error exact

Figure 1: Estimating the log-determinant of a kernel matrix (Section 3). While sequentially processing the dataset, our approach is to monitor upper and lower bounds on the log-determinant, and to stop the computation when they are suﬃciently close. In this example, less than half of the dataset is suﬃcient for the estimate to be within one order of magnitude (10% relative error) of the exact solution. Main strength of our approach: unprocessed datapoints remain untouched we achieve sub-linear complexity. This speed-gain is obtained in exchange for a certain, controllable failure probability of the bounds, assuming that datapoints are i.i.d.

process requires the computation of a Cholesky decomposition of a kernel matrix. For most datasets, this is computationally feasible despite the cubic worst-case complexity of the Cholesky decomposition in the number of samples. Nevertheless, when this computation has to performed often, e.g., to optimize kernel parameters, the computational cost of this decomposition becomes paramount. When a kernel s parameters do not ﬁt well with the data, our observation is that the log-determinant of the kernel matrix can often be predicted from a subset. This situation frequently occurs in particular at the beginning of the kernel-parameter optimization process. In the following, we will demonstrate that it is possible (i) to recognize this situation while computing the Cholesky decomposition, and (ii) to stop the computation prematurely, which can save a considerable amount of time. When we are not in a situation that justiﬁes stopping the computation early, we propose to simply continue the computation of the log-determinant until the end. Thus the additional computational cost of our method is just that of keeping track of some simple numerical indicators. The main beneﬁt of our method is that it provides an almost-free lunch since the overhead when not stopping early is relatively small (on average less than ﬁve percent). To make this idea practical, we modiﬁed the Open BLAS (Wang et al., 2013) implementation and made our code1 available. More generally, we will see that our optional stopping strategy can be used to estimate a sum of random variables that are decreasing in expectation. In this general setting, we

1. https://github.com/Simon Bartels/pac_kernel_matrix_determinant_estimation

Kernel-Matrix Determinant Estimates from stopped Cholesky Decomposition

prove that our stopped Cholesky decomposition returns an estimate of a desired relative error r with respect to the full computation, with probability 1 δ, where δ is a user-deﬁned probability threshold. Under the assumption that the dataset inputs are independent and identically distributed, the probability threshold δ refers to the distribution of the inputs. For a given level of accuracy that is satisfactory for the problem at hand, the user can then pick a level of conﬁdence in the result and obtain a gain in computational cost with provable guarantees. The level of conﬁdence is the only parameter of our method.

2. Problem Setup, Related Work and Background

2.1 Problem setup

Given a σ2 R+, a set of inputs x1, . . . , x N X and a kernel function k : X X R, we deﬁne the kernel matrix A := KN + σ2IN, where

k(x1, x1) k(x1, x2) . . . k(x1, x N)

k(x2, x1) k(x2, x2) ... ... ... k(x N, x1) . . . k(x N, x N)

The main focus of this article is the eﬃcient computation of log det (A), which is typically achieved via Cholesky decomposition of A, if N is not too large. That is, ﬁnd the unique, lower triangular matrix L RN N satisfying LL = A. Given the Cholesky decomposition of A, one subsequently computes the log-determinant using the formula

log det (A) = 2

n=1 log Lnn .

2.2 Related work

Approximation methods for the log-determinant have been studied extensively often the more general case of symmetric and positive deﬁnite matrices (Skilling, 1989; Seeger, 2000; Dorn and Enßlin, 2015; Ubaru et al., 2017; Fitzsimons et al., 2017a,b; Saibaba et al., 2017; Boutsidis et al., 2017; Dong et al., 2017; Gardner et al., 2018). All of the aforementioned methods are conceptually similar in that they rely on stochastic trace estimators: the kernel matrix is multiplied with random (probe) vectors and the inner products of the results are used to construct an estimate for the log-determinant. The theoretical performance analysis of these methods often requires knowledge or an upper bound on expensive-to-compute quantities such as the largest eigenvalue, the condition number or eigenvalue gaps of A (Ubaru et al., 2017; Boutsidis et al., 2017; Saibaba et al., 2017; Gardner et al., 2018). An advantage of our approach is that we only require knowledge of the largest diagonal entry on A and a lower bound on the smallest eigenvalue which is given by σ2. Most related to our work are Ubaru et al. (2017); Boutsidis et al. (2017); Gardner et al. (2018) in the sense that for a desired relative error and conﬁdence, they prove how to set the parameters of their algorithms accordingly. Though, a noteworthy distinction to our work is the choice of the probability measure which the desired conﬁdence refers to. In our case,

Bartels, Boomsma, Frellsen and Garreau

this probability measure is the law of the inputs xi. For the stochastic trace estimators the conﬁdence refers to the source of randomness of the probe vectors. For the problems we consider in our experiments in Section 5, none of the theorems by Boutsidis et al. (2017); Ubaru et al. (2017); Gardner et al. (2018) that guarantee relative error are applicable. Lemma 8 by Boutsidis et al. (2017) assumes that all eigenvalues are bounded from above by 1. This assumption can be established by dividing A by trace[A], but this would no longer provide a relative approximation error guarantee on log det (A). Theorem 4.1 by Ubaru et al. (2017) is not applicable, since the log of the eigenvalues of the kernel matrix can be of diﬀerent sign. Theorem 2 by Gardner et al. (2018) is a consequence of Theorem 4.1 by Ubaru et al. (2017) and therefore also not applicable. Gardner et al. (2018) recommend certain default parameter values, though we observed experimentally that this conﬁguration yields estimates whose relative errors are more often than not worse than 0.1 and may vary over two orders of magnitude (see Fig. 12 in Appendix E). We therefore did not compare our approach to their method. To nevertheless allow the reader to assess the diﬃculty of the numerical problems considered in Section 5, we compare our method to the pivoted Cholesky decomposition of Harbrecht et al. (2012) (see Section 5.3). Most related to our Theorem 2 is the work by Mnih et al. (2008) and references therein. They propose an algorithm called EBStop that returns an estimate of the mean of a sum of i.i.d. random variables. Theorem 2 is more general and assumes only a (non-strict) decrease in conditional expectation. Their approach is in a sense more sophisticated as they also monitor the empirical variance of the addends, which is also an interesting future direction for our approach.

2.3 Cholesky decomposition

In the following, we will focus on an implementation of the Cholesky decomposition that proceeds row-wise over the elements of the matrix, Algorithm 1. As opposed to a column-wise or submatrix implementation, the number of ﬂoating operations increases with each iteration of the outer loop (George et al., 1986). Hence, this version can beneﬁt the most from early stopping. Algorithm 1 is useful to express and motivate our idea. To exploit blocking and parallel computation resources requires some modiﬁcations which we describe in Appendix A. Note that computing Ljj requires access only to the ﬁrst x1, . . . , xj datapoints.

3. Stopped Cholesky Decomposition

This section is a high-level description of our algorithm. The formal proof of our claims is deferred to Section 4 and the supplementary material. The main idea of the algorithm is the following: each time a new diagonal element of the Cholesky decomposition is computed, we compute an upper bound and a lower bound of log det (A). If the two bounds are suﬃciently close to each other and suﬃciently far away from zero, a certain relative error can be guaranteed. We ﬁrst introduce the bounds used by our algorithm, and then deﬁne more precisely what we mean by close. Denote by n the number of diagonal elements that have been computed so far. Deﬁne Dn := 2 Pn j=1 log Ljj which is the log-determinant of the principal sub-matrix of A up to

index n and observe that log det (A) = DN = Dn + 2 PN j=n+1 log Ljj for any n. We will use

Kernel-Matrix Determinant Estimates from stopped Cholesky Decomposition

Algorithm 1 Augmented row-wise Cholesky decomposition with optional stopping.

Highlighted are our modiﬁcations to the original algorithm. See Algorithm 3 on page 19 for a practical implementation.

Input A: matrix for which to compute the log-determinant Input N: matrix size Input r : desired relative error Input δ : desired conﬁdence Input σ2 : lower bound on minj Ajj Input C+ : upper bound on log (maxj Ajj)

1: D 0, cδ (C+ log(σ2))H 1 N (δ/2) constant described in Section 3

2: for j = 1, . . . , N do

3: for i = 1, . . . , j 1 do

4: for k = 1, . . . , j 1 do

5: Aij Aij Aik Ajk 6: end for

7: Aij Aij/Ajj now Aij = Lij 8: end for

9: for k = 1, . . . , j 1 do

10: Ajj Ajj Ajk Ajk 11: end for

Ajj now Ajj = Ljj

13: D D + 2 log(Ajj) track log-determinant of the submatrix up to index n

14: ˆD Evaluate Conditions And Estimator(r, N, n, D, σ2, C+, cδ)

15: if ˆD = 0 then

16: return ˆD bounds are close enough, return estimate for log-determinant

18: end for

19: return D Now the lower-triangular part of A contains L.

Bartels, Boomsma, Frellsen and Garreau

in the following that for kernel matrices, one can write (Lemma 5)

L2 nn = k(xn, xn) + σ2 k n(Kn 1 + σ2In 1) 1kn , (1)

where k n := [k(xn, x1), . . . , k(xn, xn 1)] Rn 1, that is: the diagonal elements of the Cholesky, squared, correspond to the posterior variance of a Gaussian process given observations disturbed by Gaussian noise (see Rasmussen and Williams (2006, p. 16)). Our lower bound Ln is deterministic. It is simply the sum of log of the elements computed so far plus a linear extrapolation in σ2. That is,

Ln := Dn + (N n) log σ2 ,

which uses that Eq. (1) is bounded from below by σ2. On the other hand, the upper bound is probabilistic. We show in Section 4 how we can achieve the control of the failure probability. The key observation is that the diagonal elements of the Cholesky decrease in (conditional) expectation, under the assumption that x1, . . . , x N are independent and identically distributed. (This assumption is not always fulﬁlled, e.g. when the inputs are sorted. However, in practice, the assumption can be considered established, after a random shuﬄe of the dataset.) With increasing n, this variance can only decrease (see Rasmussen and Williams (2006, Question 2.9.4) or the proof of Theorem 7). Thus, the mean of all Lnn is likely to be an overestimate of the expected value of Ln+1,n+1. Therefore, we use as an upper bound, the sum of the elements computed so far, plus a linear extrapolation of their mean:

U n := Dn + (N n)Dn + cδ

where cδ := (log(σ2 +maxj {1,...,N} k(xj, xj)) log σ2)H 1 N (δ/2) and HN(δ) is deﬁned later in Eq. (9). We defer the motivation of cδ to Section 4. The constant H 1 N (δ) can be found with negligible overhead, for example using the Sci Py method minimize scalar(F, bounds=[0, 1], method= bounded ) where F(x) = (HN(x) HN(δ))2. A deterministic upper bound to log det (A) is

U n := Dn + (N n) log σ2 + max j {1,...,N} k(xj, xj)

which is a consequence of Lemma 6. To make sure that our bound is never worse than this deterministic bound we set

Un := min(U n, U n) . (2)

Fig. 1 shows a visualization of our bounds using an RBF kernel (Eq. (13) with θ = 1 and ℓ= exp( 1)) for ten random permutation of the TAMILNADU dataset (see Table 1). Now we are nearly ready to write our algorithm. The only missing piece is to decide whether Un and Ln are close enough. Suppose we believe that log det (A) [Ln, Un], then ˆDn := 1

2(Un + Ln) is a natural estimate. We will show (Lemma 15) that it is possible to guarantee the relative error log det (A) ˆDn

log det (A)

Kernel-Matrix Determinant Estimates from stopped Cholesky Decomposition

when log det (A) cannot be zero, and

Un Ln 2 min(|Un| , |Ln|) r. (4)

To exclude log det (A) = 0, we check in addition that

sign(Ln) = sign(Un) = 0 . (5)

Algorithm 2 describes above elaborations in pseudo code. Algorithm 1 shows our modiﬁcations to a Cholesky decomposition algorithm with new statements highlighted. Importantly, the computation of the bounds and checks are inexpensive in comparison to an outer-loop iteration of the Cholesky decomposition. Figs. 2 and 3 show the left-hand side of Eq. (4) for two examples. Note that when X is bounded and the kernel is diﬀerentiable, with a suﬃcient amount of data, the upper bound gets arbitrarily close to the lower bound.

Algorithm 2 Evaluate Conditions And Estimator. At a given step, this routine computes the lower and upper bounds, and proceeds to check if they are close enough.

Input r: desired relative error Input N: matrix size Input n: size of the processed subset Input Dn: sum of the log diagonal elements up to index n Input σ2: lower bound on minj Ajj Input C+: upper bound on log (maxj Ajj) Input cδ = H 1 N (δ/2)(C+ log(σ2)): constant described in Section 3

1: Ln Dn + (N n) log σ2 lower bound estimate

2: Un min(Dn + (N n)Dn+cδ

n + cδ, Dn + (N n)C+ upper bound estimate

3: if sign(Un) = sign(Ln) = 0 and Un Ln < 2r min(|Un| , |Ln|) then

4: return 1

2(Un + Ln) bounds are close enough, return estimate

6: return 0 bounds are not close enough

4. Theoretical Justiﬁcation

We now turn to the theoretical analysis of our algorithm. Our main goal in this section is to explain how the expressions of the lower and upper bounds are obtained. Note that we consider, in fact, a more general problem: stopping the computation of a sum of random variables that decrease in expectation. To the best of our knowledge, this is the ﬁrst result obtained in this setting, where the addends are not independent and identically distributed (the xi are not the addends). Theorem 2 states that the stopping condition described in the following is a solution to this problem, and Theorem 4 states that Theorem 2 can be applied to estimate determinants of kernel matrices.

Bartels, Boomsma, Frellsen and Garreau

probabilistic pivoting diagonal precision 5% overhead

103 TAMILNADU (D=53)

processed datapoints

relative error bound

0 0.5 1 1.5 2 2.5

103 mean time default Cholesky: 1850 s

t mean(tdefault)

Figure 2: An early stopping scenario. We compute the log-determinant of a kernel matrix using the RBF kernel (with θ = 1, ℓ= exp( 1) in Eq. (13)) for ten random permutations of the TAMILNADU dataset (see Table 1) using our algorithm ( ) and Cholesky decomposition with pivoting ( , see Section 5.3). The y-axis represents the relative error each algorithm can guarantee for (left panel) number of (fully) processed datapoints or (right panel) a given computation time. Even for such a short length-scale ℓand a desired relative error r = 0.1, our algorithm touches only half the dataset before stopping with a conﬁdence of 90% (δ = 0.1) with respect to the data distribution. Left panel: the solid, green lines show the relative error our algorithm can guarantee. At the same time these lines display our stopping condition Eq. (4). The variance between repetitions is so small such that only one line is visible to the eye. The horizontal lines ( ) mark the mean relative error corresponding to an absolute approximation error on the diagonal elements (denoted with d, see Eq. (15)) which is the pivoted Cholesky s stopping criterion. The singularity in the beginning of the stopping condition stems from the denominator crossing 0 which demonstrates the necessity of the second stopping condition Eq. (5). The reason for the slope changes are switches from the deterministic bound U n to U n and back in Eq. (2). Right panel: for each repetition, fraction of both algorithm s CPU time over the mean time of the default Cholesky. Since S steps of the approximate Cholesky with pivoting cost O(NS2) operations, it stops earlier in the left panel, but our algorithm ( ) scaling as O(S3) is faster in practice.

Kernel-Matrix Determinant Estimates from stopped Cholesky Decomposition

probabilistic pivoting diagonal precision 5% overhead

103 BANK (D=51)

processed datapoints

relative error bound

0 0.5 1 1.5 2 2.5

103 mean time default Cholesky: 94 s

t mean(tdefault)

Figure 3: A disadvantageous scenario. We compute the log-determinant of a kernel matrix using the OU kernel (θ = 1, ℓ= exp(1) in Eq. (14)) on the BANK dataset for ten random permutations. Left panel: same setup as in Fig. 2. On this dataset, even using a long lengthscale, requires processing more than 90% of the data to achieve a relative error r of at least 0.1. Right panel: same setup as in Fig. 2. When our algorithm is not stopping early, that is, it returns the result of the default Cholesky, the overhead is on average less than 5%. The Cholesky with pivoting on the other hand may require more than 150% of the time of the default Cholesky. The extreme diﬀerence in absolute runtime between this ﬁgure and Fig. 2 is investigated in Section 5.5.

Bartels, Boomsma, Frellsen and Garreau

4.1 Notation

Since we are considering an optional stopping problem, we use the terminology of stochastic processes. This section is a quick reminder of the most important concepts, we refer to Grimmett and Stirzaker (2001), and Davidson (1994) for a more thorough introduction. For a monotonically increasing function f : R R and δ R, deﬁne f 1(δ) := arg supε R{f(ε) δ}. A ﬁltration is a sequence (Fj)j N of increasing σ-algebras, i.e., Fj Fj+1 for all j N. For random variables X1, . . . , XN, we denote by σ(X1, . . . , XN) the σ-algebra generated by (X1, . . . , XN). A sequence of random variables (Xj)j N is called adapted to a ﬁltration, if Xj is Fj-measurable for all j N. A random variable τ is called a stopping time (w.r.t. a ﬁltration), if it takes values in N and {τ = j} Fj for all j N.

4.2 Problem Setting

Let (Ω, F, P) be a probability space and (Fj)j {1,...,N} be a ﬁltration. Furthermore, let (fj)j {1,...,N} [C , C+] be a sequence of random variables such that for j {1, . . . , N 1} : fj is Fj-measurable and the conditional expectation is decreasing, formally:

E[fj+1 | Fj] E[fj | Fj 1] , ( )

with F0 := { , R}. For this sequence, we want to estimate its sum

Given a desired upper bound on the relative error r (0, 1) and a probability of failure δ (0, 1), our goal is to device a strategy that, being presented sequentially with the f1, f2, . . ., decides in each step whether to continue or to stop, and if stopping, provides an estimator ˆDτ, such that its relative error is less than r with probability 1 δ. Formally, the goal is to device a stopping time τ and an estimator ˆDτ, such that,

Remark 1 A trivial solution is to deﬁne τ := N and ˆDτ := DN, which simply consists in doing the whole computation.

4.3 Stopping Condition

We now deﬁne precisely the quantities introduced in Section 3: the lower bounds Ln and the upper bounds Un. Recall that the lower bounds Ln are deterministic, whereas DN Un holds only with a certain probability. The stopping time τ will monitor these bounds and stop if they are large in magnitude (away from zero) and close enough that the relative error cannot exceed the desired tolerance r (0, 1). As in Section 3, set

Ln := Dn + (N n)C , (6)

Kernel-Matrix Determinant Estimates from stopped Cholesky Decomposition

0 100 200 300 400 500 600 700 800 900 1,000 1,100 1,200 10 3

Figure 4: Visualization of the function HN from Eq. (9). With growing dataset size N, the error guarding constant cδ grows too, yet at a lower rate. For our experiments in Section 5, we consider datasets of roughly 50000 datapoints and a failure tolerance of δ := 0.1/2, which corresponds to a constant of approximately 547.3.

Un := Dn + min cδ + (N n)Dn + cδ

n , (N n)C+ , (7)

2(Ln + Un), (8)

where cδ := (C+ C )HN 1(δ/2) and

HN(x) := 1{x N}

The function HN is derived from a theorem by Fan et al. (2012) which our proofs rely on. Fig. 4 shows visualizations of that function. Finally, we deﬁne the stopping time as

τ = min{N} n < N s.t. Cs n and Cp n hold , (10)

where Cs n is the sign condition

Cs n true if sign(Un) = sign(Ln) = 0 , (11)

and Cp n is the relative error condition

Cp n true if Un Ln 2 min(|Un| , |Ln|) r . (12)

Note that the quantities in the stopping conditions are all Fn-measurable, thus τ is indeed a stopping time. We can now state our main result.

Theorem 2 Assume that DN is a sum of random variables decreasing conditionally in expectation as in Section 4.2. Then, for any r, δ (0, 1), the relative error of the estimator

Bartels, Boomsma, Frellsen and Garreau

ˆDτ deﬁned by Eqs. (6), (7) and (10) to (12) is bounded by r with probability at least 1 δ, formally:

Intuitively, Theorem 2 guarantees that stopping early in the computation makes sense for any given r and δ. The less precision is required (corresponding to larger r) the easier the second stopping condition in Eq. (12) can be satisﬁed. The less conﬁdence is necessary (corresponding to larger δ), the smaller the term cδ in Eq. (7), which also increases chances to satisfy Eq. (12) earlier. On the other hand, when r = 0, Eq. (7) can only be true, if upper and lower bounds coincide. The latter can only be the case if cδ = 0 (requires δ = 2) and Dn = n C . This means: if we were to desire absolute precision, the theorem would recommend to compute the full sum. The proof of Theorem 2, and the proof the following lemma are part of the supplementary material. Let us give a sketch of the proof. The design of the stopping condition is based on the following Lemma 3.

Lemma 3 Let D [L, U], and assume sign(L) = sign(U) = 0. Then

|D (U + L)/2|

|D| U L 2 min(|L| , |U|) .

The proof of Theorem 2 ﬁrst bounds P DN ˆDτ

> r by P DN ˆDτ

> r, DN Uτ + P (DN > Uτ). By using Lemma 3 and the stopping conditions, we can show that the probability of the left addend is 0. Finally, we bound P(DN > Uτ) by applying Fan et al. (2012) s Hoeﬀding s inequality for martingales twice. Once, to show that DN is probably not much larger than its expected value, and a second time, to show that Uτ is probably not much smaller.

4.4 Application to Kernel-Matrix Determinant Estimation

We now specialize Theorem 2 to the situation at hand.

Theorem 4 Assume x1, . . . , x N X are independent and identically distributed. Denote with P the law of the x1, . . . , x N and with C the Cholesky decomposition of A. Deﬁne the probability space (X, σ(x1, . . . , x N), P) and the canonical ﬁltration Fj := σ(x1, . . . , xj) for j = 1, . . . , N. Further, deﬁne

fj := 2 log Cjj,

C := log σ2 ,

and assume there exists a constant C+ such that

max j=1,...,N log(k(xj, xj) + σ2) C+ almost surely. .

Kernel-Matrix Determinant Estimates from stopped Cholesky Decomposition

Then, using the deﬁnitions of Theorem 2,

log det (A) ˆDτ

|log det (A)| > r

As stated before, the i.i.d. assumption is not too stringent. Finding the deterministic upper bound C+ is also a given in most use-cases, for example when X is bounded, or when the kernel is normalized or stationary. For instance, C+ = θ in the case of the RBF and OU kernels in Eqs. (13) and (14) respectively. The proof of Theorem 4 is part of the supplementary material. Essentially, to apply Theorem 2 for the estimation of kernel-matrix determinants, one has to show that the summands are decreasing in expectation. As stated before, the key observation is that the diagonal elements of the Cholesky correspond to the posterior variance of a Gaussian process given observations disturbed by independent Gaussian noise. With each observation, the posterior variance can only decrease, which in turn allows to show that the diagonal elements of the Cholesky decrease conditionally in expectation.

5. Experiments

5.1 Intuition on the stopping strategy

One application of our implementation is to probe bad kernel parameters quickly. For example, consider the case of a kernel matrix generated from a radial basis function (RBF) kernel

k RBF (x, z) := θ exp x z 2

with a lengthscale ℓfar too large with respect to the data. In that case, the diagonal elements of the Cholesky then come quickly close to σ2, which implies that upper and lower bounds become close enough to stop the computation earlier. We examine this hypothesis for the RBF on diﬀerent datasets increasing the length scale exponentially. Furthermore, to also explore the limitations of our approach, we run the same experiments for the Ornstein-Uhlenbeck (OU) kernel

k OU(x, z) := θ exp x z

Both kernels are (in the limit) members of the Mat ern class of covariance functions (Rasmussen and Williams, 2006, p. 85). Whereas samples from a Gaussian process with RBF covariance are the smoothest in this class, samples from the OU are the roughest. It is therefore not a surprise that our approach is less successful when using the OU kernel. Hence, one advantage of our approach is that one can qualitatively distinguish settings in which stopping occurs earlier or later. Furthermore, the overhead compared to the exact computation is negligible when not stopping. The other parameters, σ2 and θ, inﬂuence stopping primarily via the error guarding constant cδ log(σ2 + θ) log σ2 in Eq. (7) (see Theorem 4). A smaller σ2 results in a

Bartels, Boomsma, Frellsen and Garreau

Key N D Source & URL

BANK 45211 51 Moro et al. (2014) Bank+Marketing METRO 48204 66 no citation request Metro+Interstate+Traffic+Volume PM2.5 43824 79 Liang et al. (2015) Beijing+PM2.5+Data PROTEIN 45730 9 no citation request Physicochemical+Properties+of+Protein+Tertiary+ Structure PUMADYN 8192 32 Snelson and Ghahramani (2006) www.cs.toronto.edu/~delve/data/pumadyn/desc.html TAMILNADU2 45781 53 no citation request Tamilnadu+Electricity+Board+Hourly+Readings

Table 1: Overview over all datasets used for the experiments in Section 5. Key refers to the title, we gave a dataset in this article. The letter N refers to the number of instances (training and testing) and D refers to the dimensionality after one-hot encoding. The URL is a suﬃx for http://archive.ics.uci.edu/ml/datasets/. The reference in Source acknowledges a citation request, if any.

smaller deterministic lower-bound Ln and a larger cδ, both factors potentially preventing early stopping. On the other hand, we will see that a smaller σ2 also implies smaller addends log Ljj, such that the exact log-determinant is further away from zero which allows faster stopping. In our case, where σ2 = 10 3, θ = 1, δ = 0.1, and dataset size N 50000, one can infer with Fig. 4 that cδ is approximately 7 600 = 4200 which is negligible in comparison to the determinants in the following which are of order 107.

5.2 Experiment setup

From the UCI machine learning repository (Dua and Graﬀ, 2019), we took all multivariate datasets in matrix format with 40000 to 50000 instances without missing values. Furthermore, we included the frequently used PUMADYN dataset (Snelson and Ghahramani, 2006) as a small-scale example of only 8000 instances. Categorical variables where one-hot encoded and each dataset was then standardized. Table 1 provides an overview of all the datasets that we use. All large-scale experiments ( 40000 datapoints) were executed on machines running Ubuntu 18.04 with 32 Gigabytes of RAM and two Intel Xeon E5-2670 v2 CPUs. The experiments for the PUMADYN dataset were run on a laptop running Ubuntu 20.04 with 16 Gigabytes of RAM and an Intel i7-8665U CPU, to demonstrate the usefulness of our approach on more standard hardware. We remark again that we do not use Algorithm 1 but Algorithm 3 in Appendix A which is a more practical implementation capable of exploiting blocking and parallelization.

Kernel-Matrix Determinant Estimates from stopped Cholesky Decomposition

5.3 Baseline

As described in Section 2.2, none of the referenced stochastic trace estimators can guarantee a desired relative error, and they are therefore not included in the following comparison. The performance for the algorithm by Gardner et al. (2018) is shown in Fig. 12 when using the RBF kernel. As baseline, we compare against the Cholesky decomposition with full pivoting (Harbrecht et al., 2012). Denote with Ln the N n matrix for the intermediate approximation of the Cholesky decomposition in step n. In each step n, the algorithm keeps track of the approximation error of all remaining diagonal elements i that is how much Kii diﬀers from [Ln L n]ii and processes the element inducing the most error next. The algorithm stops when an absolute error tolerance on the diagonal elements, denoted with d, can be guaranteed, that is when

d max i |Kii [Ln L n]ii| . (15)

Note that S iterations of the pivoted Cholesky require O(NS2) operations whereas Algorithm 1 scales as O(S3). For this algorithm, we can set3

UP n := Dn +

j=n+1 log(Kjj [Ln L n]jj) (16)

and LP n := Ln, and apply the same stopping strategy which allows to compare this algorithm with our proposed approach. In the next paragraph, we describe how to compare both algorithms without modifying the Fortran implementation of the Cholesky with pivoting.

5.4 Parameters and performance metric

We set σ2 := 0.001 and θ := 1, and increased the lengthscale as ℓ:= exp(i) for i = 1, . . . , 3. In the appendix we report results where we set ℓ:= 1 and θ := 1, and vary the observation noise as σ2 := exp(i) for i = 4, . . . , 0. The Cholesky decomposition with full pivoting takes as input parameter a desired relative error on the diagonal elements d (instead of a relative error on the log-determinant, see Eq. (15)). We ran this algorithm for d {0.001, 0.005, 0.01, 0.05, 0.1, 0.5}. After the pivoted Cholesky stopped, we computed the relative error on the log-determinant that this algorithm could guarantee in that step. Then we ran our algorithm trying to achieve the same error for δ = 0.1. Occasionally, the desired relative error is larger than 1. In that case, ˆDN := 0 is an estimator satisfying this requirement which would allow stopping before even starting. However, we did not check for this condition, to instead observe when the algorithm would stop in such situations. We repeated each conﬁguration for ten random permutations of the dataset. We measured the performance of our method in terms of CPU time t saved over the average CPU time used for the default Cholesky tdefault:

m := t mean(tdefault).

Thus, small values of m are better.

3. One can show that Kjj [Ln L n]jj is the posterior variance of a GP in xj conditioned on the ﬁrst n datapoints. The claim follows from Rasmussen and Williams (2006, Question 2.9.4).

Bartels, Boomsma, Frellsen and Garreau

5.5 Results

As an example, Fig. 5 shows our results for the PM2.5 dataset. For all other datasets, similar ﬁgures (Figs. 6 to 9 and 11) can be found in Appendix E. In all experiments, the returned estimate of our modiﬁed Cholesky decomposition had indeed the desired error. For the easy cases, our algorithm needs less than 10% of the average time of the default Cholesky. Here, with easy cases we mean that the relative error can be larger than 0.1 and using an RBF kernel with ℓ exp(1) (there is one exception: the BANK dataset and ℓ= exp(1)). The Cholesky decomposition with pivoting also saves time in these settings, yet less. The diﬀerence between the algorithms becomes more apparent the harder the problem. Except for three cases, which we will elaborate below, our algorithm needs never longer than 105% of the time of the default Cholesky. In contrast, the Cholesky with pivoting may take more than twice as long. In three cases our approach crosses the 105% mark: using an RBF kernel with ℓ= 1 on PM2.5 and METRO, and ℓ= exp (1) on METRO. In these scenarios, the kernel matrix contains many extremely small entries of less than 10 65. Floating-point multiplication is not a constant-time operation and we observed that a large number of such entries signiﬁcantly prolongs the runtime of our experiments. It is the reason why for ℓ= exp( 1), the run time for the default Cholesky can take up to ten times longer than for larger length-scales. Our row-wise implementation of the Cholesky decomposition suﬀers more from this phenomenon than the original Open BLAS version. One can circumvent this problem by eliminating such small entries or by increasing the block-size in Algorithm 3. However, we deliberately did not apply these strategies to showcase possible downsides of this implementation. Furthermore, note that the absolute overhead is less than 30s for these three cases and that the eﬀect becomes more negligible the longer the absolute running time. Importantly, the additional run-time does not stem from checking our stopping conditions which can be seen in Fig. 17 where we also compare to the original BLAS implementation endowed with our stopping condition. The take-away is that the row-wise Cholesky implementation allows higher speed gains but may sometimes induce more than 5% overhead. In future work, we aim to investigate middle-grounds between these two Cholesky implementations. For the experiments where we varied σ2 the results are similar. The plots can be found in Appendix E.3. The smaller σ2, the further is the exact log-determinant away from zero, which allows to stop earlier. An additional problem with larger σ2 is that our deterministic lower bound is too conservative. This can be seen in Appendix E.5 where we show case the development of the bounds for both experiments. That a better lower bound is conceivable is demonstrated by Artemev et al. (2021); Bartels et al. (2023).

6. Conclusion

6.1 Summary

We presented a stopping strategy for the Cholesky decomposition that allows to obtain estimates for the log-determinant of a kernel matrix of desired error r, before completing the decomposition. The stopping strategy has only one parameter: a failure probability δ. We showed that the returned estimate indeed has the desired error with probability 1 δ, under the mild assumptions that the dataset inputs are independent and identically distributed

Kernel-Matrix Determinant Estimates from stopped Cholesky Decomposition

probabilistic pivoting default 5% overhead RBF Eq. (13) OU Eq. (14)

Figure 5: Relative execution times to compute the log-determinant using RBF (left panel) and OU (right panel) kernels on the PM2.5 dataset for θ = 1, log ℓ= 1, . . . , 3 and δ = 0.1 for ten repetitions. The number next to one on the y-axis displays the absolute execution times of the default Cholesky. The solid, horizontal, orange line ( ) visualizes the 105% mark. The x-axis displays a desired absolute error on the diagonal elements d (top) and the average corresponding desired relative error r (bottom) on the log-determinant. The longer the length-scale, the earlier it is possible to stop and the higher the speed-up. The speed-up is generally higher for the RBF than for the OU. Even though the Cholesky decomposition with pivoting ( ) needs to compute less diagonal elements (see Figs. 2 and 3) compared with our methods it is slower in practice and may even take more than twice as long as simply running the default Cholesky decomposition ( ). Our method ( ) on the other hand is faster, and when approximation is hard, the overhead is negligible. One exception can be seen for the RBF kernel and using a length-scale of ℓ= 1. The reason for this exception is described in Section 5.5.

Bartels, Boomsma, Frellsen and Garreau

and a boundedness assumption that is met if the kernel or the domain is bounded. We demonstrated that there exists settings in which it is possible to save considerable amounts of time when stopping the Cholesky decomposition before completion. Importantly, when not stopping early, the induced overhead is less than ﬁve percent on average.

As part of their concluding remarks, Chalupka et al. (2013) wrote that

...the results presented above point to the very simple Subset of Data method (or the Hybrid variant) as being the leading contender. We hope this will act as a rallying cry to those working on GP approximations to beat this dumb method.

In essence, the presented idea makes a virtue of necessity. Algorithm 2 can be viewed as an estimate for how much data is necessary to identify a kernel model for a particular dataset distribution. The claim that kernel machines do not scale well with large datasets becomes brittle, when the overall dataset size matters little.

6.2 Future work

Early stopping for lower error values r, closer to numerical precision would be desirable. One way to achieve this goal could be to ﬁnd a less conservative, probabilistic lower bound on the log determinant. A direction to investigate are concentration inequalities for self-bounding functions (Boucheron et al., 2013, p. 60). Some concentration inequalities for self-bounding functions allow to reason about the probability of the function falling below its expectation. One can show, that the log-determinant of a kernel matrix is such a function (Bartels, 2020, Lemma 33, p. 94).

In the long run, we hope to lift our experiments to hyper-parameter optimization for Gaussian processes. For that, our analysis needs to be extended to the term y A 1y. This analysis is similar, but not trivial (Bartels et al., 2023). Furthermore, we will then require an extension of our approach to control the quality of the gradient approximation, since it can be quite diﬀerent from the accuracy of the function approximation.

Acknowledgments

This work was funded in part by the Novo Nordisk Foundation through the Center for Basic Machine Learning Research in Life Science (NNF20OC0062606). It also received funding from the Danish Ministry of Education and Science, Digital Pilot Hub and Skylab Digital. Damien Garreau acknowledges the support of the French government, through the NIM-ML project (ANR-21-CE23-0005-01).

Simon is grateful for patient listening and fruitful discussions to Gabriele Abbati, Philipp Hennig, Motonobu Kanagawa, Hans Kersting, Jonas K ubler, Simon Lacoste-Julien, Krikamol Muandet, Alexander Neitz, Giambatista Parascandolo, Micha el Perrot, Carl Rasmussen, Luca Rendsburg, Maja Rudolph, Michael Schober, Sebastian Weichwald and Inna Zeitler.

Kernel-Matrix Determinant Estimates from stopped Cholesky Decomposition

Algorithm 3 Blocked and recursive formulation of Algorithm 1. Highlighted are our modiﬁcations to the original algorithm.

Input A: matrix for which to compute the log-determinant Input N: matrix size Input b: block size (number of elements to process in parallel) Input r : desired relative error Input δ : desired conﬁdence Input σ2 : lower bound on minj Ajj Input C+ : upper bound on log (maxj Ajj)

1: D 0, cδ (C+ log(σ2))H 1 N (δ/2) constant described in Section 3

2: A1:b,1:b chol(A1:b,1:b)

3: D D + 2 Pb l=1 log(All) track log-determinant of the submatrix up to index b

4: ˆD Evaluate Conditions And Estimator(N, b, D, σ2, C+, cδ)

5: if ˆD = 0 then

6: return ˆD

8: i b + 1, j min(i + b, N)

9: while i < N do

10: Ai:j,1:i Ai:j,1:i A 1:i,1:i 11: Ai:j,i:j Ai:j,i:j Ai:j,1:i A i:j,1:i 12: Ai:j,i:j chol(Ai:j,i:j)

13: D D + 2 Pj l=i log(All) track log-determinant of the submatrix up to index j

14: ˆD Evaluate Conditions And Estimator(N, j, D, σ2, C+, cδ)

15: if ˆD = 0 then

16: return ˆD bounds are close enough, return estimate for log-determinant

18: i i + b, j min(i + b, N)

19: end while

20: return D Now the lower-triangular part of A contains L.

Bartels, Boomsma, Frellsen and Garreau

Appendix A. A practical implementation of Cholesky decomposition with stopping

Algorithm 3 is a blocked and recursive version of Algorithm 1. Our Open BLAS implementation uses the above algorithm with a block size of b :=#CPUS BLOCK_SIZE, where BLOCK_SIZE is the internal Open BLAS block size. Furthermore, the call to chol is a call to the default Open BLAS Cholesky. Algorithm 3 is easy to employ in or on top of any library.

Appendix B. Proof of Theorem 4

Proof By Lemma 8: log det (A) = PN j=1 Ljj, and one can see that the problem already has the right form for (main paper) Theorem 2. To apply the theorem, we need to show that for all j = 1, . . . N, the Ljj are functions of x1, . . . , xj (Lemma 5), that fj := 2 log Ljj [C , C+] (Lemma 6), and that E[fj+1 | Fj] E[fj | Fj 1] (Lemma 7).

We now proceed just as in the proof above. We are going to show that the j-th diagonal element of the Cholesky is bounded and can be computed from x1, .., xj only. Then, we conclude that the elements must decrease in (conditional) expectation. To proof the following lemmata, deﬁne

kn(x) := [k(x, x1), . . . , k(x, xn)] Rn ,

kn+1 := kn(xn+1) Rn and

vn := k(xn, xn) + σ2 k n(Kn 1 + σ2In 1) 1kn .

The ﬁrst term kn(x) denotes the covariance between an arbitrary input x and the ﬁrst n datapoints from the dataset. In particular, this deﬁnition will be used in the proof of Lemma 7, which states the decrease in expectation. The term vn is the posterior variance of a Gaussian process f conditioned on observations y Rn, perturbed by Gaussian white noise4: p(y | f) = N(0, σ2I). Lemma 5 establishes a link between vn and the n-th diagonal element of the Cholesky, which is then used in the proof of Lemma 7.

Lemma 5 (Link between the Cholesky and Gaussian process regression) Denote with LN the Cholesky decomposition of A := KN + σ2IN, so that LNL N = A. The n-th diagonal element of LN, squared, is equivalent to vn:

[LN]2 nn = vn .

Proof By a slight abuse of notation, let us deﬁne

k(x1, x1) + σ2 , and

LN := CN 1 0 k NL N 1 v N

We will show that the lower triangular matrix LN satisﬁes LNL N = KN + σ2IN. Since the Cholesky decomposition is unique (Golub and Van Loan, 2013, Theorem 4.2.7), LN

4. see for example Rasmussen and Williams (2006, p. 16)

Kernel-Matrix Determinant Estimates from stopped Cholesky Decomposition

must be the Cholesky decomposition of KN + σ2IN. Furthermore, by deﬁnition of LN, [LN]2 NN = v N. The statement then follows by the recursive deﬁnition of LN. We want to show that LNL N = KN + σ2IN. The proof follows by induction. To show the beginning, note that

L1L 1 = k(x1, x1) + σ2 = K1 + σ2I1 .

For the induction step, let us assume that the proposition holds up to N 1, that is, LN 1L N 1 = KN 1 + σ2IN 1, then, by deﬁnition of LN,

LNL N = LN 1 0 k NL N 1 v N

L N 1 L 1 N 1k N 0 v N

= LN 1L N 1 LN 1L 1 N 1k N k NL N 1L N 1 k NL N 1L 1 N 1k N + v N

= KN 1 + σ2IN 1 k N k N k N(KN 1 + σ2I) 1k N + v N

= KN 1 + σ2IN 1 k N k N k(x N, x N) + σ2

Lemma 6 (Bounding the fjs) Denote by LN the Cholesky decomposition of KN +σ2IN. Deﬁne C := log σ2 and take C+ maxj=1,...,N log k(xj, xj) + σ2 . Then, for all j {1, . . . , N}, C fj C+ a.s. .

Proof By Lemma 5,

L2 nn = k(xn, xn) + σ2 k n(Kn 1 + σ2In 1) 1kn .

The term k n(Kn 1 + σ2In 1) 1kn is always positive since (Kn 1 + σ2In 1) 1 is a symmetric positive deﬁnite matrix. Hence, k(xn, xn) + σ2 is an upper bound to L2 nn. On the other hand, since k is a kernel, k(xn, xn) k n(Kn 1 + σ2In 1) 1kn cannot be negative and σ2

is a therefore a lower bound to L2 nn. Since both values are positive and the logarithm is an increasing function on the positive real axis, the proof is complete.

Equipped with the link between the diagonal elements of the Cholesky and Gaussian process regression stated in Lemma 5, we can now show that the diagonal elements of the Cholesky must decrease in (conditional) expectation, when treating the x1, . . . , x N as random variables. This follows intuitively from the fact that the posterior variance of a Gaussian process in a ﬁxed location x can only decrease with more observations.

Lemma 7 (The fjs are decreasing in expectation) Assume x1, . . . , x N X are independent and identically distributed. Denote with P the law of the x1, . . . , x N and with L

Bartels, Boomsma, Frellsen and Garreau

the Cholesky decomposition of A. Deﬁne the probability space (X, σ(x1, . . . , x N), P) and the canonical ﬁltration Fj := σ(x1, . . . , xj) for j = 1, . . . , N. Then the fj decrease in conditional expectation, that is,

E[fj+1 | σ(x1, . . . , xj)] E[fj | σ(x1, . . . , xj 1)] .

Proof Denote with Qj(dx) := P (dx | x1, . . . , xj), the regular conditional probability. Deﬁne the shorthand qj(x) := kj(x) (Kj + σ2I) 1kj(x). We will show later in the proof, in Eq. (18), that qj(x) = qj 1(x) + rj(x) where rj(x) 0. Taking Eq. (18) as granted for now, we can show the claim as follows.

E[fj+1 | σ(x1, . . . , xj)] = E[log L2 j+1,j+1 | σ(x1, . . . , xj)]

(deﬁnition of fj)

= Z log k(x, x) + σ2 kj(x) (Kj + σ2I) 1kj(x) Qj(dx)

(property of conditional expectation)

= Z log k(x, x) + σ2 qj(x) Qj(dx)

(deﬁnition of qj(x))

= Z log k(x, x) + σ2 qj 1(x) rj(x) Qj(dx)

(using Eq. (17))

Z log k(x, x) + σ2 qj 1(x) Qj(dx)

(using Eq. (18) and monotonicity of the logarithm)

= Z log k(x, x) + σ2 qj 1(x) Qj 1(dx)

(with Fubini s theorem)

= E[log L2 jj | σ(x1, . . . , xj 1)]

(property of conditional expectation)

= E[fj | σ(x1, . . . , xj 1)]

(deﬁnition of fj)

It remains to show qj(x) = qj 1(x) + rj(x) where rj(x) 0. For readability, we deﬁne vx := (Kj 1 + σ2I) 1kj 1(x) and c := v 1 j . First note, that using block-matrix inversion we can write

(Kj + σ2Ij) 1 = (Kj 1 + σ2Ij 1) 1 + vxjcv xj vxjc v xjc c

Using above observation, we can transform qj(x).

qj(x) = kj 1(x) k(xj, x)

Kernel-Matrix Determinant Estimates from stopped Cholesky Decomposition

(Kj 1 + σ2I) 1 + vxjcv xj vxjc v xjc c

kj 1(x) k(xj, x)

(deﬁnition of qj(x) and using above observation)

= kj 1(x) k(x, xj)

vx + vxjcv xjkj 1(x) vxjck(x, xj) v xjkj 1(x)c + ck(x, xj)

(evaluating the RHS matrix-vector multiplication)

= kj 1(x) vx + c(v xjkj 1(x))2

2v xjkj 1(x)ck(x, xj) + ck(x, xj)2

(evaluating the vector product)

= kj 1(x) vx + c(k(x, xj) v xjkj 1(x))2

(rearranging terms into a quadratic)

= qj 1(x) + c(k(x, xj) v xjkj 1(x))2

(deﬁnition of qj 1(x))

This shows that

qj(x) = qj 1(x) + rj(x) , where (17)

rj(x) := c(k(x, xj) v xjkj 1(x))2 0. (18)

The claim of Lemma 8 can for example be found in Rasmussen and Williams (2006, p. 203).

Lemma 8 (Computing the log determinant from the Cholesky decomposition) Denote with L the Cholesky decomposition of a symmetric and positive deﬁnite matrix A. Then

log det (A) = 2

j=1 log Ljj .

log |A| = log |LL |

(using K = LL )

= log(|L| |L |)

(property of the determinant)

= log(|L|2)

(transposition does not aﬀect the determinant)

Bartels, Boomsma, Frellsen and Garreau

(property of triangular matrices)

j=1 log Ljj

(property of the logarithm)

Appendix C. Background Material for the Proof of Theorem 2

Before we state the proof of Theorem 2, we provide here the tools that we are going to use. Our main tool will be the following theorem by Fan et al. (2012). Essentially, this is a generalization of Hoeﬀding s inequality to martingales. It states that for a sum of random variables that decrease in (conditional) expectation, the probability of exceeding a certain threshold is low, when at the same time the (conditional) variance is bounded by another constant. Importantly, this probability holds simultaneously for all partial sums starting in 1 and ending in n = 1 to n = N.

Theorem 9 (Hoeﬀding s inequality for supermartingales (Fan et al., 2012)) Assume that (ξj, Fj)j=1,...,N are supermartingale diﬀerences satisfying ξj 1. Then, for any x 0 and v > 0,

P for some n [1, N]

j=1 ξj x and

j=1 V[ξj | Fj 1] v HN(x, v),

HN(x, v) := 1{x N}

N x) N N+v .

The following theorem will give us the upper bound on the conditional variance, necessary for Theorem 9. The theorem below applies to empirical variance estimates, but the remark thereafter shows that it provides a bound on the true variance as well.

Theorem 10 (Popoviciu s inequality (Popoviciu, 1935; Sharma et al., 2010)) For a sequence of real numbers x1, . . . , xn [m, M], deﬁne µ := 1

n Pn j=1 xj and σ2 := 1

n PN j=1(xj µ)2, then σ2 1/4(M m)2.

Kernel-Matrix Determinant Estimates from stopped Cholesky Decomposition

Remark 11 Theorem 10 can be used to obtain a bound on the conditional variance as well. Let x1, . . . , xn P( | F) be independent. Then,

V[X | F] = E[(X E[X | F])2 | F]

(deﬁnition of conditional variance)

= n n 1E[σ2 | F]

(using Bessel s correction)

n 4(n 1)(M m)2

(by Theorem 10)

which holds for all n N. Hence, V[X | F] 1/4(M m)2.

The martingale diﬀerences that we will be analyzing have random indices from our stopping time. Doob s Optional Sampling Theorem (see for example Grimmett and Stirzaker (2001, p. 489)) and the remark below provide us with the mathematical justiﬁcation.

Theorem 12 (Doob s Optional Sampling Theorem) Let (Xj, Fj)j N be a submartingale and τ1 τ2 . . . be a sequence of stopping times s.t. P(τj nj) = 1 for some deterministic real sequence nj, then the stopped process (Xτj, Fτj)j N is also a submartingale.

Remark 13 By exchanging Xj for Xj the theorem can be shown to hold for supermartingales as well.

Corollary 14 (Stopped submartingale diﬀerences) Let (ξj, Fj)j N be a submartingalediﬀerence and let τ be a stopping time, then the stopped process (ξmin(j,τ), Fmin(j,τ))j N is also a submartingale-diﬀerence.

Proof Deﬁne Xl := Pl j=1 ξj and observe that this deﬁnes a submartingale. By Theorem 12 (Xmin(j,τ), Fmin(j,τ))j N is a submartingale. Then Xmin(j,τ) Xmin(j,τ) 1 = ξmin(j,τ) is again a submartingale-diﬀerence.

Appendix D. Proof of Theorem 2

The proof can be split into two parts. Lemma 16 shows by using the stopping conditions that if the bound holds, the relative error of the estimator is indeed less than r with probability 1. The second part is to show that P (Uτ < DN) δ, which is the purpose of Lemma 17 and which makes use of the assumption stated in Eq. ( ). Proof

|DN| > r, DN Uτ

Bartels, Boomsma, Frellsen and Garreau

|DN| > r, DN > Uτ

|DN| > r, DN Uτ

+ P (DN > Uτ)

(upper-bounding joint by marginal)

(by Lemma 16 and Lemma 17)

The following lemma gives an upper bound on the relative error of an estimator in terms of upper and lower bounds for the quantity of interest. The bound is minimized if the estimator is chosen to be the average of upper and lower bound. The lemma can also be found in Mnih (2008) but has been developed independently.

Lemma 15 (Bounding the relative error) Let D, ˆD [L, U], and assume sign(L) = sign(U) = 0. Then the relative error of the estimator ˆD can be bounded as

|D| max(U ˆD, ˆD L)

min(|L| , |U|) .

Proof First observe that if DN > ˆD then |DN ˆD| = DN ˆD U ˆD. If DN ˆD, then |DN ˆD| = ˆD DN ˆD L. Hence,

|DN ˆD| max(U ˆD, ˆD L).

Case L > 0: In this case |DN| = DN L = |L|, and we obtain for the relative error:

max(U ˆD, ˆD L)

|DN| max(U ˆD, ˆD L)

Case U < 0: In that case |L| |DN| |U|, and the relative error can be bounded as follows.

max(U ˆD, ˆD L)

|DN| max(U ˆD, ˆD L)

Since we assumed sign(L) = sign(U) these were all cases that required consideration. Combining all observations yields

|DN| max(U ˆD, ˆD L) max 1

= max(U ˆD, ˆD L)

min(|U|, |L|) .

Kernel-Matrix Determinant Estimates from stopped Cholesky Decomposition

Lemma 16 (Controlling the relative error when DN Uτ) With the deﬁnitions of Section 4.2, the probability that the relative error of the estimator is larger than some r > 0 and at the same time the bound holds, is zero. Formally,

|DN| > r, DN Uτ

Proof As a preliminary observation note that

(by deﬁnition)

j=n+1 fj for all n = 0, . . . , N

(deﬁnition of Dn)

j=n+1 C for all n = 0, . . . , N

(since fj [C , C+])

= Ln for all n = 0, . . . , N (19)

(using the deﬁnition of Ln)

and hence, for all n = 0, . . . , N, Ln is an almost sure lower bound to DN.

|DN| > r, DN Uτ

|DN| > r, DN Uτ, τ < N

|DN| > r, DN Uτ, τ = N

Recall that ˆDτ = 1/2(Lτ + Uτ). In case τ = N, we have that UN = LN = DN, and hence,

|DN| > r, DN Uτ, τ = N = 0. For brevity, deﬁne the event A := {DN Uτ, τ < N}, that is, the upper bound holds and the stopping conditions are fulﬁlled at a time before N.

|DN| > r, DN Uτ, τ < N

|DN| > r, A

(deﬁnition of A)

Bartels, Boomsma, Frellsen and Garreau

|DN| > r, Lτ DN, A

(since Lτ is an almost sure lower bound to DN by Eq. (19))

max(Uτ ˆDτ, ˆDτ Lτ)

min(|Lτ| , |Uτ|) > r, Lτ DN, A

(by Lemma 15, using the ﬁrst condition of τ, Eq. (11))

= P Uτ Lτ 2 min(|Lτ| , |Uτ|) > r, Lτ DN, A

(deﬁnition of ˆDτ)

(by the second condition of τ, Eq. (12))

Lemma 17 (Upper bound control) With the deﬁnitions of Section 4.2, the probability that the upper bound fails is less than δ. Formally,

P (DN > Uτ) δ.

Proof The following parts of the proof rely on Theorem 9 by Fan et al. (2012). To apply Theorem 9, deﬁne Z j := fj E[fj | Fj 1] and Zj := Z τ+j.

For brevity, we deﬁne ε := (C+ C )H 1 N (δ/2), εn := ε 1 N n + 1

n , and ˆµn := Dn

n + εn such that we can write

Un = Dn + (N n) min(ˆµn, C+) . (20)

P (DN > Uτ)

j=τ+1 fj > Dτ + (N τ) min ˆµτ, C+

(using the deﬁnition of Dn and Eq. (20))

j=τ+1 fj > (N τ) min ˆµτ, C+

(simplifying)

j=τ+1 fj > (N τ)ˆµτ or

j=τ+1 fj > (N τ)C+

(exchanging min for logical or)

Kernel-Matrix Determinant Estimates from stopped Cholesky Decomposition

j=τ+1 fj > (N τ)ˆµτ

(since fj C+)

j=1 [Zj + E[fτ+j | Fτ+j 1]] > (N τ)ˆµτ

(deﬁnition of Zj)

j=τ+1 E[fj | Fj 1] > (N τ)ˆµτ,

j=τ+1 E[fj | Fj 1] N τ

j=τ+1 E[fj | Fj 1] > N τ

(sum rule and upper-bounding joint by marginal)

Consider the ﬁrst addend in Eq. (21).

j=τ+1 E[fj | Fj 1] > (N τ)ˆµτ,

j=τ+1 E[fj | Fj 1] N τ

j=1 Zj + N τ

τ (Dτ + ε) > (N τ)ˆµτ

(combining the two events)

j=1 Zj + N τ

τ (Dτ + ε) >

τ + ε 1 N τ + 1

(deﬁnition of ˆµτ and εn)

(simplifying)

Bartels, Boomsma, Frellsen and Garreau

Zj C+ C > H 1 N (δ/2)

(deﬁnition of ε and dividing by C+ C )

Zj C+ C > H 1 N (δ/2)

for some n {1, . . . , N}

(enlarging the event)

We are now ready to use Theorem 9. Since (Z j, Fj)j {1,...,N} is a martingale diﬀerence,

Zmin(j,N), Fmin(τ+j,N)

is a martingale diﬀerence as well (Corollary 14). Further note, that the random variables Zj C+ C are bounded from above by 1. Hence, there is only one ingredient missing to apply Theorem 9, which is a bound on the conditional variance. To this end, we use Popoviciu s inequality. The latter is applicable, since the Zj C+ C are also bounded from below by 1.

Zj C+ C > H 1 N (δ/2) for some n {1, . . . , N}

Zj C+ C > H 1 N (δ/2),

j=1 V Zj C+ C

for some n {1, . . . , N}

(by Popoviciu s inequality (Theorem 10))

H(H 1 N (δ/2), N)

(by Theorem 9, where H is deﬁned in that theorem)

= HN(H 1 N (δ/2)) δ/2.

(deﬁnition of HN)

Now we will take care of the second addend in Eq. (21), using the assumption that the fj decrease in expectation: Eq. ( ). We will again apply Theorem 9.

j=τ+1 E[fj | Fj 1] > N τ

Kernel-Matrix Determinant Estimates from stopped Cholesky Decomposition

j=τ+1 E[fτ+1 | Fτ] > N τ

(using Eq. ( ))

= P (τE[lτ+1 | Fτ] > Dτ + ε)

(dividing by N τ and multiplying by τ)

j=1 (E[lτ+1 | Fτ] fj) > ε

(deﬁnition of Dτ)

j=1 (E[lj+1 | Fj] fj) > ε

(using again Eq. ( ))

E[lj+1 | Fj] fj

C+ C > H 1 N

(deﬁnition of ε and dividing by C+ C )

j=1 Z j C+ C > H 1 N

(deﬁnition of Z j)

Changing the sign does not change the martingale diﬀerence property and hence, ( Z j, Fj)j {1,...,N} is a martingale diﬀerence as well. We can apply the same argument as in Eq. (22).

j=1 Z j C+ C > H 1 N

(using the same argument as in Eq. (22))

Bartels, Boomsma, Frellsen and Garreau

Appendix E. Results

This section contains the complete results from the experiments described in Section 5. In Appendix E.1 we show the results for varying lengthscale ℓin Figs. 6 to 11. Considering the same datasets, Fig. 12 in Appendix E.2 shows the relative error when using the method of Gardner et al. (2018) with default parameters for the RBF kernel. Appendix E.3 shows the results for ﬁxed lengthscale and varying observation noise in Figs. 13 to 17. Appendix E.5 shows with Figs. 23 to 27 how varying the noise aﬀects the evolution of the bounds.

Kernel-Matrix Determinant Estimates from stopped Cholesky Decomposition

E.1 Varying length-scale

probabilistic pivoting default 5% overhead RBF Eq. (13) OU Eq. (14)

Figure 6: Relative execution times to compute the log-determinant using RBF (left panel) and OU (right panel) kernels on the PROTEIN dataset for θ = 1, log ℓ= 1, . . . , 3 and δ = 0.1 for ten repetitions. The number next to one on the y-axis displays the absolute execution times of the default Cholesky. The solid, horizontal, orange line ( ) visualizes the 105% mark. The x-axis displays a desired absolute precision on the diagonal elements d (top) and the average corresponding desired relative precision r (bottom) on the log-determinant.

Bartels, Boomsma, Frellsen and Garreau

probabilistic pivoting default 5% overhead RBF Eq. (13) OU Eq. (14)

Figure 7: Relative execution times to compute the log-determinant using RBF (left panel) and OU (right panel) kernels on the TAMILNADU dataset for θ = 1, log ℓ= 1, . . . , 3 and δ = 0.1 for ten repetitions. The number next to one on the y-axis displays the absolute execution times of the default Cholesky. The solid, horizontal, orange line ( ) visualizes the 105% mark. The x-axis displays a desired absolute precision on the diagonal elements d (top) and the average corresponding desired relative precision r (bottom) on the log-determinant.

Kernel-Matrix Determinant Estimates from stopped Cholesky Decomposition

probabilistic pivoting default 5% overhead RBF Eq. (13) OU Eq. (14)

Figure 8: Relative execution times to compute the log-determinant using RBF (left panel) and OU (right panel) kernels on the BANK dataset for θ = 1, log ℓ= 1, . . . , 3 and δ = 0.1 for ten repetitions. The number next to one on the y-axis displays the absolute execution times of the default Cholesky. The solid, horizontal, orange line ( ) visualizes the 105% mark. The x-axis displays a desired absolute precision on the diagonal elements d (top) and the average corresponding desired relative precision r (bottom) on the log-determinant.

Bartels, Boomsma, Frellsen and Garreau

probabilistic pivoting default 5% overhead RBF Eq. (13) OU Eq. (14)

Figure 9: Relative execution times to compute the log-determinant using RBF (left panel) and OU (right panel) kernels on the METRO dataset for θ = 1, log ℓ= 1, . . . , 3 and δ = 0.1 for ten repetitions. The number next to one on the y-axis displays the absolute execution times of the default Cholesky. The solid, horizontal, orange line ( ) visualizes the 105% mark. The x-axis displays a desired absolute precision on the diagonal elements d (top) and the average corresponding desired relative precision r (bottom) on the log-determinant.

Kernel-Matrix Determinant Estimates from stopped Cholesky Decomposition

probabilistic pivoting default 5% overhead RBF Eq. (13) OU Eq. (14)

Figure 10: Relative execution times to compute the log-determinant using RBF (left panel) and OU (right panel) kernels on the PM2.5 dataset for θ = 1, log ℓ= 1, . . . , 3 and δ = 0.1 for ten repetitions. The number next to one on the y-axis displays the absolute execution times of the default Cholesky. The solid, horizontal, orange line ( ) visualizes the 105% mark. The x-axis displays a desired absolute precision on the diagonal elements d (top) and the average corresponding desired relative precision r (bottom) on the log-determinant.

Bartels, Boomsma, Frellsen and Garreau

probabilistic pivoting default 5% overhead RBF Eq. (13) OU Eq. (14)

Figure 11: Relative execution times to compute the log-determinant using RBF (left panel) and OU (right panel) kernels on the PUMADYN dataset for θ = 1, log ℓ= 1, . . . , 3 and δ = 0.1 for ten repetitions. The number next to one on the y-axis displays the absolute execution times of the default Cholesky. The solid, horizontal, orange line ( ) visualizes the 105% mark. The x-axis displays a desired absolute precision on the diagonal elements d (top) and the average corresponding desired relative precision r (bottom) on the log-determinant.

Kernel-Matrix Determinant Estimates from stopped Cholesky Decomposition

E.2 Stochastic Trace Estimator

10 3 10 2 10 1

10 3 10 2 10 1

-1.0 0.0 1.0 2.0 10 3 10 2 10 1

-1.0 0.0 1.0 2.0

Figure 12: The need for theoretical guarantees. Of the related work described in Section 2.2 only Gardner et al. (2018) provide publicly accessible code. The ﬁgure shows the achieved relative error r, Eq. (3), when using default parameters, for the RBF kernel, Eq. (13), with θ := 1 and diﬀerent length-scales ℓon all our considered datasets (see Table 1). The relative error is more often than not worse than 0.1 and can diﬀer over two orders of magnitude. Theorem 2 in Gardner et al. (2018) which could describe how to set the parameters of their method to achieve a desired precision is not applicable in this setting (see Section 2.2).

Bartels, Boomsma, Frellsen and Garreau

E.3 Varying observation noise

probabilistic probabilistic (column-wise) pivoting default 5% overhead RBF Eq. (13) OU Eq. (14)

σ2 = 0.0001

σ2 = 0.0001

Figure 13: Relative execution times to compute the log-determinant using RBF (left panel) and OU (right panel) kernels on the PROTEIN dataset for θ = 1, ℓ= 1, log10 σ2 = 4, . . . , 0 and δ = 0.1 for ten repetitions. The number next to one on the y-axis displays the absolute execution times of the default Cholesky. The solid, horizontal, orange line ( ) visualizes the 105% mark. The x-axis displays a desired absolute precision on the diagonal elements d (top) and the average corresponding desired relative precision r (bottom) on the log-determinant.

Kernel-Matrix Determinant Estimates from stopped Cholesky Decomposition

probabilistic probabilistic (column-wise) pivoting default 5% overhead RBF Eq. (13) OU Eq. (14)

σ2 = 0.0001

σ2 = 0.0001

Figure 14: Relative execution times to compute the log-determinant using RBF (left panel) and OU (right panel) kernels on the TAMILNADU dataset for θ = 1, ℓ= 1, log10 σ2 = 4, . . . , 0 and δ = 0.1 for ten repetitions. The number next to one on the y-axis displays the absolute execution times of the default Cholesky. The solid, horizontal, orange line ( ) visualizes the 105% mark. The x-axis displays a desired absolute precision on the diagonal elements d (top) and the average corresponding desired relative precision r (bottom) on the log-determinant.

Bartels, Boomsma, Frellsen and Garreau

probabilistic probabilistic (column-wise) pivoting default 5% overhead RBF Eq. (13) OU Eq. (14)

σ2 = 0.0001

σ2 = 0.0001

Figure 15: Relative execution times to compute the log-determinant using RBF (left panel) and OU (right panel) kernels on the METRO dataset for θ = 1, ℓ= 1, log10 σ2 = 4, . . . , 0 and δ = 0.1 for ten repetitions. The number next to one on the y-axis displays the absolute execution times of the default Cholesky. The solid, horizontal, orange line ( ) visualizes the 105% mark. The x-axis displays a desired absolute precision on the diagonal elements d (top) and the average corresponding desired relative precision r (bottom) on the log-determinant.

Kernel-Matrix Determinant Estimates from stopped Cholesky Decomposition

probabilistic probabilistic (column-wise) pivoting default 5% overhead RBF Eq. (13) OU Eq. (14)

σ2 = 0.0001

σ2 = 0.0001

Figure 16: Relative execution times to compute the log-determinant using RBF (left panel) and OU (right panel) kernels on the BANK dataset for θ = 1, ℓ= 1, log10 σ2 = 4, . . . , 0 and δ = 0.1 for ten repetitions. The number next to one on the y-axis displays the absolute execution times of the default Cholesky. The solid, horizontal, orange line ( ) visualizes the 105% mark. The x-axis displays a desired absolute precision on the diagonal elements d (top) and the average corresponding desired relative precision r (bottom) on the log-determinant.

Bartels, Boomsma, Frellsen and Garreau

probabilistic probabilistic (column-wise) pivoting default 5% overhead RBF Eq. (13) OU Eq. (14)

σ2 = 0.0001

σ2 = 0.0001

Figure 17: Relative execution times to compute the log-determinant using RBF (left panel) and OU (right panel) kernels on the PM2.5 dataset for θ = 1, ℓ= 1, log10 σ2 = 4, . . . , 0 and δ = 0.1 for ten repetitions. The number next to one on the y-axis displays the absolute execution times of the default Cholesky. The solid, horizontal, orange line ( ) visualizes the 105% mark. The x-axis displays a desired absolute precision on the diagonal elements d (top) and the average corresponding desired relative precision r (bottom) on the log-determinant.

Kernel-Matrix Determinant Estimates from stopped Cholesky Decomposition

E.4 Bound progression for varying lengthscale

probabilistic pivoting lower bound exact

RBF Eq. (13) OU Eq. (14)

processed datapoints

processed datapoints

Figure 18: Progression of the bounds from Eqs. (6), (7) and (16) over the number of processed datapoints using RBF (left panel) and OU (right panel) kernels on the PROTEIN dataset for σ2 = 10 3, θ = 1, log ℓ= 1, . . . , 3 and δ = 0.1 for ten repetitions. The variance between repetitions is so small such that only one line is visible to the eye. For the Cholesky decomposition with pivoting ( , continuous data is not available. The stopping points correspond to the ones described in Section 5.4.

Bartels, Boomsma, Frellsen and Garreau

probabilistic pivoting lower bound exact

RBF Eq. (13) OU Eq. (14)

processed datapoints

processed datapoints

Figure 19: Progression of the bounds from Eqs. (6), (7) and (16) over the number of processed datapoints using RBF (left panel) and OU (right panel) kernels on the TAMILNADU dataset for σ2 = 10 3, θ = 1, log ℓ= 1, . . . , 3 and δ = 0.1 for ten repetitions. The variance between repetitions is so small such that only one line is visible to the eye. For the Cholesky decomposition with pivoting ( , continuous data is not available. The stopping points correspond to the ones described in Section 5.4.

Kernel-Matrix Determinant Estimates from stopped Cholesky Decomposition

probabilistic pivoting lower bound exact

RBF Eq. (13) OU Eq. (14)

processed datapoints

processed datapoints

Figure 20: Progression of the bounds from Eqs. (6), (7) and (16) over the number of processed datapoints using RBF (left panel) and OU (right panel) kernels on the BANK dataset for σ2 = 10 3, θ = 1, log ℓ= 1, . . . , 3 and δ = 0.1 for ten repetitions. The variance between repetitions is so small such that only one line is visible to the eye. For the Cholesky decomposition with pivoting ( , continuous data is not available. The stopping points correspond to the ones described in Section 5.4.

Bartels, Boomsma, Frellsen and Garreau

probabilistic pivoting lower bound exact

RBF Eq. (13) OU Eq. (14)

processed datapoints

processed datapoints

Figure 21: Progression of the bounds from Eqs. (6), (7) and (16) over the number of processed datapoints using RBF (left panel) and OU (right panel) kernels on the METRO dataset for σ2 = 10 3, θ = 1, log ℓ= 1, . . . , 3 and δ = 0.1 for ten repetitions. The variance between repetitions is so small such that only one line is visible to the eye. For the Cholesky decomposition with pivoting ( , continuous data is not available. The stopping points correspond to the ones described in Section 5.4.

Kernel-Matrix Determinant Estimates from stopped Cholesky Decomposition

probabilistic pivoting lower bound exact

RBF Eq. (13) OU Eq. (14)

processed datapoints

processed datapoints

Figure 22: Progression of the bounds from Eqs. (6), (7) and (16) over the number of processed datapoints using RBF (left panel) and OU (right panel) kernels on the PM2.5 dataset for σ2 = 10 3, θ = 1, log ℓ= 1, . . . , 3 and δ = 0.1 for ten repetitions. The variance between repetitions is so small such that only one line is visible to the eye. For the Cholesky decomposition with pivoting ( , continuous data is not available. The stopping points correspond to the ones described in Section 5.4.

Bartels, Boomsma, Frellsen and Garreau

E.5 Bound progression for varying observation noise

probabilistic pivoting lower bound exact

RBF Eq. (13) OU Eq. (14)

σ2 = 0.0001

σ2 = 0.0001

σ2 = 0.001000

σ2 = 0.001000

processed datapoints

processed datapoints

Figure 23: Progression of the bounds from Eqs. (6), (7) and (16) over the number of processed datapoints using RBF (left panel) and OU (right panel) kernels on the PROTEIN dataset for log σ2 = 4, . . . , 0, θ = 1, log ℓ= 0 and δ = 0.1 for ten repetitions. The variance between repetitions is so small such that only one line is visible to the eye. For the Cholesky decomposition with pivoting ( , continuous data is not available. The stopping points correspond to the ones described in Section 5.4.

Kernel-Matrix Determinant Estimates from stopped Cholesky Decomposition

probabilistic pivoting lower bound exact

RBF Eq. (13) OU Eq. (14)

σ2 = 0.0001

σ2 = 0.0001

σ2 = 0.001000

σ2 = 0.001000

processed datapoints

processed datapoints

Figure 24: Progression of the bounds from Eqs. (6), (7) and (16) over the number of processed datapoints using RBF (left panel) and OU (right panel) kernels on the TAMILNADU dataset for log σ2 = 4, . . . , 0, θ = 1, log ℓ= 0 and δ = 0.1 for ten repetitions. The variance between repetitions is so small such that only one line is visible to the eye. For the Cholesky decomposition with pivoting ( , continuous data is not available. The stopping points correspond to the ones described in Section 5.4.

Bartels, Boomsma, Frellsen and Garreau

probabilistic pivoting lower bound exact

RBF Eq. (13) OU Eq. (14)

σ2 = 0.0001

σ2 = 0.0001

σ2 = 0.001000

σ2 = 0.001000

0 1 2 3 104

processed datapoints

0 1 2 3 104

processed datapoints

Figure 25: Progression of the bounds from Eqs. (6), (7) and (16) over the number of processed datapoints using RBF (left panel) and OU (right panel) kernels on the PM2.5 dataset for log σ2 = 4, . . . , 0, θ = 1, log ℓ= 0 and δ = 0.1 for ten repetitions. The variance between repetitions is so small such that only one line is visible to the eye. For the Cholesky decomposition with pivoting ( , continuous data is not available. The stopping points correspond to the ones described in Section 5.4.

Kernel-Matrix Determinant Estimates from stopped Cholesky Decomposition

probabilistic pivoting lower bound exact

RBF Eq. (13) OU Eq. (14)

σ2 = 0.0001

σ2 = 0.0001

σ2 = 0.001000

σ2 = 0.001000

processed datapoints

processed datapoints

Figure 26: Progression of the bounds from Eqs. (6), (7) and (16) over the number of processed datapoints using RBF (left panel) and OU (right panel) kernels on the METRO dataset for log σ2 = 4, . . . , 0, θ = 1, log ℓ= 0 and δ = 0.1 for ten repetitions. The variance between repetitions is so small such that only one line is visible to the eye. For the Cholesky decomposition with pivoting ( , continuous data is not available. The stopping points correspond to the ones described in Section 5.4.

Bartels, Boomsma, Frellsen and Garreau

probabilistic pivoting lower bound exact

RBF Eq. (13) OU Eq. (14)

σ2 = 0.0001

σ2 = 0.0001

σ2 = 0.001000

σ2 = 0.001000

processed datapoints

processed datapoints

Figure 27: Progression of the bounds from Eqs. (6), (7) and (16) over the number of processed datapoints using RBF (left panel) and OU (right panel) kernels on the BANK dataset for log σ2 = 4, . . . , 0, θ = 1, log ℓ= 0 and δ = 0.1 for ten repetitions. The variance between repetitions is so small such that only one line is visible to the eye. For the Cholesky decomposition with pivoting ( , continuous data is not available. The stopping points correspond to the ones described in Section 5.4.

Kernel-Matrix Determinant Estimates from stopped Cholesky Decomposition

Artem Artemev, David R. Burt, and Mark van der Wilk. Tighter bounds on the log marginal likelihood of gaussian process regression using conjugate gradients. In Marina Meila and Tong Zhang, editors, Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pages 362 372, 2021.

Simon Bartels. Probabilistic Linear Algebra. Ph D thesis, University of T ubingen, 2020.

Simon Bartels, Kristoﬀer Stensbo-Smidt, Pablo Mure no-Munoz, Wouter Boomsma, Jes Frellsen, and Søren Hauberg. Adaptive Cholesky Gaussian Processes. Proceedings of Artiﬁcial Intelligence and Statistics (AISTATS), to appear, 2023. http://arxiv.org/ abs/2202.10769.

St ephane Boucheron, G abor Lugosi, and Pascal Massart. Concentration Inequalities: A Nonasymptotic Theory of Independence. Oxford University Press, 1st edition, 2013.

Christos Boutsidis, Petros Drineas, Prabhanjan Kambadur, Eugenia-Maria Kontopoulou, and Anastasios Zouzias. A randomized algorithm for approximating the log determinant of a symmetric positive deﬁnite matrix. Linear Algebra and its Applications, 533:95 117, 2017.

Krzysztof Chalupka, Williams, C. K. I., and Iain Murray. A framework for evaluating approximation methods for Gaussian process regression. Journal of Machine Learning Research, 14(1):333 350, 2013.

James Davidson. Stochastic Limit Theory: An Introduction for Econometricians. Oxford University Press, 1994.

P. Diaconis. Bayesian numerical analysis. Statistical decision theory and related topics, IV (1):163 175, 1988.

Kun Dong, David Eriksson, Hannes Nickisch, David Bindel, and Andrew G. Wilson. Scalable log determinants for gaussian process kernel learning. In Advances in Neural Information Processing Systems, pages 6330 6340, 2017.

Sebastian Dorn and Torsten A. Enßlin. Stochastic determination of matrix determinants. Physical Review E, 92:013302, 2015.

Dheeru Dua and Casey Graﬀ. UCI machine learning repository, 2019. URL http://archive. ics.uci.edu/ml.

Xiequan Fan, Ion Grama, and Quansheng Liu. Hoeﬀding s inequality for supermartingales. Stochastic Processes and their Applications, 122(10):3545 3559, 2012.

Jack Fitzsimons, Kurt Cutajar, Michael Osborne, Stephen Roberts, and Maurizio Filippone. Bayesian inference of log determinants. In Gal Elidan, Kristian Kersting, and Alexander T. Ihler, editors, Thirty-Third Conference on Uncertainty in Artiﬁcial Intelligence, UAI 2017, August 11-15, 2017, Sydney, Australia, 2017a.

Bartels, Boomsma, Frellsen and Garreau

Jack Fitzsimons, Diego Granziol, Kurt Cutajar, Michael Osborne, Maurizio Filippone, and Stephen Roberts. Entropic trace estimates for log determinants. In Michelangelo Ceci, Jaakko Hollm en, Ljupˇco Todorovski, Celine Vens, and Saˇso Dˇzeroski, editors, Machine Learning and Knowledge Discovery in Databases, pages 323 338, 2017b.

Jacob R. Gardner, GeoﬀPleiss, David Bindel, Kilian Q. Weinberger, and Andrew G. Wilson. Gpytorch: Blackbox matrix-matrix gaussian process inference with gpu acceleration. In Advances in Neural Information Processing Systems, 2018.

Alan George, Michael T. Heath, and Joseph Liu. Parallel cholesky factorization on a shared-memory multiprocessor. Linear Algebra and its Applications, 77:165 187, 1986.

Gene H. Golub and Charles F. Van Loan. Matrix computations. Johns Hopkins University Press, 4 edition, 2013.

Geoﬀrey Grimmett and David Stirzaker. Probability and Random Processes. Oxford University Press, 3rd edition, 2001.

Helmut Harbrecht, Michael Peters, and Reinhold Schneider. On the low-rank approximation by the pivoted Cholesky decomposition. Applied Numerical Mathematics, 62(4):428 440, 2012.

P. Hennig, M.A. Osborne, and M. Girolami. Probabilistic numerics and uncertainty in computations. Proceedings of the Royal Society of London A: Mathematical, Physical and Engineering Sciences, 471(2179), 2015.

Xuan Liang, Tao Zou, Bin Guo, Shuo Li, Haozhe Zhang, Shuyi Zhang, Hui Huang, and Song Xi Chen. Assessing beijing s pm2.5 pollution: severity, weather impact, apec and winter heating. Proceedings of the Royal Society A: Mathematical, Physical and Engineering Sciences, 471(2182):20150257, 2015.

Volodymyr Mnih. Eﬃcient stopping rules. Master s thesis, University of Alberta, Canada, 2008.

Volodymyr Mnih, Csaba Szepesv ari, and Jean-Yves Audibert. Empirical Bernstein stopping. pages 672 679, 2008.

Jonas Moˇckus. On Bayesian methods for seeking the extremum. In Gury I. Marchuk, editor, Optimization Techniques IFIP Technical Conference, volume 27 of Lecture Notes in Computer Science, pages 400 404, 1975.

S ergio Moro, Paulo Cortez, and Paulo Rita. A data-driven approach to predict the success of bank telemarketing. Decision Support Systems, 62:22 31, 2014.

Tiberiu Popoviciu. Sur les equations alg ebriques ayant toutes leurs racines r eelles. Mathematica, 9:129 145, 1935.

Carl E. Rasmussen and Christopher K. Williams. Gaussian Processes for Machine Learning. The MIT Press, 2006.

Kernel-Matrix Determinant Estimates from stopped Cholesky Decomposition

Arvind K. Saibaba, Alen Alexanderian, and Ilse C. F. Ipsen. Randomized matrix-free trace and log-determinant estimators. Numerische Mathematik, 137(2):353 395, Oct 2017.

Matthias Seeger. Skilling techniques for Bayesian analysis. 2000.

Rajesh Sharma, Madhu Gupta, and Girish Kapoor. Some better bounds on the variance with applications. Journal of Mathematical Inequalities, 4:355 363, 2010.

John Skilling. The Eigenvalues of Mega-dimensional Matrices, pages 455 466. 1989.

Edward Snelson and Zoubin Ghahramani. Sparse Gaussian processes using pseudo-inputs. In Y. Weiss, B. Sch olkopf, and J. C. Platt, editors, Advances in Neural Information Processing Systems 18, pages 1257 1264. 2006.

Christian Steinruecken, Emma Smith, David Janz, James Lloyd, and Zoubin Ghahramani. The Automatic Statistician. In Frank Hutter, Lars Kotthoﬀ, and Joaquin Vanschoren, editors, Automated Machine Learning, Series on Challenges in Machine Learning. 2019.

Shashanka Ubaru, Jie Chen, and Yousef Saad. Fast estimation of tr(f(a)) via stochastic lanczos quadrature. SIAM Journal on Matrix Analysis and Applications, 38(4):1075 1099, 2017.

Qian Wang, Xianyi Zhang, Yunquan Zhang, and Qing Yi. AUGEM: Automatically generate high performance Dense Linear Algebra kernels on x86 CPUs. In SC 13: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, pages 1 12, 2013.