# biasfree_scalable_gaussian_processes_via_randomized_truncations__91326108.pdf

Bias-Free Scalable Gaussian Processes via Randomized Truncations

Andres Potapczynski * 1 Luhuan Wu * 2 Dan Biderman * 1 Geoff Pleiss 1 John P. Cunningham 1 2

Scalable Gaussian Process methods are computationally attractive, yet introduce modeling biases that require rigorous study. This paper analyzes two common techniques: early truncated conjugate gradients (CG) and random Fourier features (RFF). We ﬁnd that both methods introduce a systematic bias on the learned hyperparameters: CG tends to underﬁt while RFF tends to overﬁt. We address these issues using randomized truncation estimators that eliminate bias in exchange for increased variance. In the case of RFF, we show that the bias-to-variance conversion is indeed a trade-off: the additional variance proves detrimental to optimization. However, in the case of CG, our unbiased learning procedure meaningfully outperforms its biased counterpart with minimal additional computation. Our code is available at https://github.com/ cunningham-lab/RTGPS.

1. Introduction

Gaussian Processes (GP) are popular and expressive nonparametric models, and considerable effort has gone into alleviating their cubic runtime complexity. Notable successes include inducing point methods (e.g. Snelson & Ghahramani, 2006; Titsias, 2009; Hensman et al., 2013), ﬁnite-basis expansions (e.g. Rahimi & Recht, 2008; Mutn y & Krause, 2018; Wilson et al., 2020; Loper et al., 2020), nearest neighbor truncations (e.g. Datta et al., 2016; Katzfuss et al., 2021), and iterative numerical methods (e.g. Cunningham et al., 2008; Cutajar et al., 2016; Gardner et al., 2018). Common to these techniques is the classic speed-bias tradeoff: coarser GP approximations afford faster but more biased solutions that in turn affect both the model s predictions and learned hyperparameters. While a few papers analyze the

*Equal contribution (randomly ordered). 1Zuckerman Institute, Columbia University 2Statistics Department, Columbia University. Correspondence to: Andres Potapczynski <ap3635@columbia.edu>, Luhuan Wu <lw2827@columbia.edu>, Dan Biderman <db3236@columbia.edu>.

Proceedings of the 38 th International Conference on Machine Learning, PMLR 139, 2021. Copyright 2021 by the author(s).

bias of inducing point methods (Bauer et al., 2016; Burt et al., 2019), the biases of other approximation techniques, and their subsequent impact on learned GP models, have not been rigorously studied.

Here we scrutinize the biases of two popular techniques random Fourier features (RFF) (Rahimi & Recht, 2008) and conjugate gradients (CG) (e.g. Cunningham et al., 2008; Cutajar et al., 2016; Gardner et al., 2018). These methods are notable due to their popularity and because they allow dynamic control of the speed-bias tradeoff: at any model evaluation, the user can adjust the number of CG iterations or RFF features to a desired level of approximation accuracy. In practice, it is common to truncate these methods to a ﬁxed number of iterations/features that is deemed adequate. However, such truncation will stop short of an exact (machine precision) solution and potentially lead to biased optimization outcomes.

We provide a novel theoretical analysis of the biases resulting from RFF and CG on the GP log marginal likelihood objective. Speciﬁcally, we prove that CG is biased towards hyperparameters that underﬁt the data, while RFF is biased towards overﬁtting. In addition to yielding suboptimal hyperparameters, these biases hurt posterior predictions, regardless of the inference method used at test-time. Perhaps surprisingly, this effect is not subtle, as we will demonstrate. Our analysis suggests there is value in debiasing GP learning with CG and RFF. To do so, we turn to recent work that shows the merits of exchanging the speed-bias tradeoff for a speed-variance tradeoff (Beatson & Adams, 2019; Chen et al., 2019; Luo et al., 2020; Oktay et al., 2020). These works all introduce a randomization procedure that reweights elements of a fast truncated estimator, eliminating its bias at the cost of increasing its variance.

We thus develop bias-free versions of GP learning with CG and RFF using randomized truncation estimators. In short, we randomly truncate the number of CG iterations and RFF features, while reweighting intermediate solutions to maintain unbiasedness. Our variant of CG uses the Russian Roulette estimator (Kahn, 1955), while our variant of RFF uses the Single Sample estimator of Lyne et al. (2015). We believe our RR-CG and SS-RFF methods to be the ﬁrst to produce unbiased estimators of the GP log marginal likelihood with < O(N 3) computation.

Bias-Free Scalable Gaussian Processes via Randomized Truncations

Finally, through extensive empirical evaluation, we ﬁnd our methods and their biased counterparts indeed constitute a bias-variance tradeoff. Both RR-CG and SS-RFF are unbiased, recovering nearly the same optimum as the exact GP method, while GP trained with CG and RFF often converge to solutions with worse likelihood. We note that bias elimination is not always practical. For SS-RFF, the optimization is slow, due to the large auxiliary variance needed to counteract the slowly decaying bias of RFF. On the other hand, RR-CG incurs a minimal variance penalty, likely due to the favorable convergence properties of CG. In a wide range of benchmark datasets, RR-CG demonstrates similar or better predictive performance compared to CG using the same expected computational time.

To summarize, this work offers three main contributions:

theoretical analysis of the bias of CGand RFF-based GP approximation methods ( 3) RR-CG and SS-RFF: bias-free versions of these popular GP approximation methods ( 4) results demonstrating the value of our RR-CG and SSRFF methods ( 3 and 5).

2. Background

We consider observed data D = {(xi, yi)}N i=1 for xi Rd

and yi R, and the standard GP model:

f( ) GP(µ( ), k( , )),

yi = f(xi) + ϵi, ϵi N 0, σ2

where k( , ) is the covariance kernel, µ( ) is set to zero without loss of generality, and hyperparameters are collected into the vector θ, which is optimized as:

θ = arg minθ L(θ)

L(θ) = log p(y|X; θ)

2 log | b KXX| | {z } model complexity

+ y b K 1 XXy | {z } data ﬁt

+N log 2π (1)

where b KXX RN N is the Gram matrix of all data points with diagonal observational noise:

b KXX[i, j] = k(xi, xj) + σ2Ii=j.

Following standard practice, we optimize θ with gradients:

tr n b K 1 XX b KXX

θ o y b K 1 XX b KXX

θ b K 1 XXy .

Three terms thus dominate the computational complexity: y b K 1 XXy, log | b KXX|, and tr{ b K 1 XX b KXX

θ }. The common

approach to computing this triad is the Cholesky factorization, requiring O(N 3) time and O(N 2) space.

Extensive literature has accelerated the inference and hyperparameter learning of GP. Two very popular strategies are using conjugate gradients (Cunningham et al., 2008; Cutajar et al., 2016; Gardner et al., 2018; Wang et al., 2019) to approximate the linear solves in Eq. (2), and random Fourier features (Rahimi & Recht, 2008), which constructs a randomized ﬁnite-basis approximation of the kernel.

2.1. Conjugate Gradients

To apply conjugate gradients to GP learning, we begin by replacing the gradient in Eq. (2) with a stochastic estimate (Cutajar et al., 2016; Gardner et al., 2018):

z b K 1 XX b KXX

θ z y b K 1 XX b KXX

θ b K 1 XXy ,

where z is a random variable such that E[z] = 0 and E[zz T ] = I. Note that the ﬁrst term constitutes a stochastic estimate of the trace term (Hutchinson, 1989). Thus, stochastic optimization of Eq. (1) can be reduced to computing the linear solves b K 1 XXy and b K 1 XXz.

Conjugate gradients (CG) (Hestenes et al., 1952) is an iterative algorithm for solving positive deﬁnite linear systems A 1b. It consists of a three-term recurrence, where each new term requires only a matrix-vector multiplication with A. More formally, each CG iteration computes a new term of the following summation:

A 1b = PN i=1 γidi, (4)

where the γi are coefﬁcients and the di are conjugate search directions (Golub & Van Loan, 2012). N iterations of CG produce all N summation terms and recover the exact solution. In practice, exact convergence may require more than N iterations due to inaccuracies of ﬂoating point arithmetic. However, the summation converges exponentially, and so J N iterations may sufﬁce to achieve high accuracy.

CG is an appealing method for computing b K 1 XXy and b K 1 XXz due to its computational complexity and its potential for GPU-accelerated matrix products. J iterations takes at most O(JN 2) time and O(N) space if the matrix-vector products are performed in a map-reduce fashion (Wang et al., 2019). However, ill-conditioned kernel matrices hinder the convergence rate (Cutajar et al., 2016), and so the Jth CG iteration may yield an inaccurate approximation of Eq. (3).

2.2. Random Fourier Features

Rahimi & Recht (2008) introduce a randomized ﬁnite-basis approximation to stationary kernels:

k(x, x ) = k(x x ) φ(x) φ(x ) (5)

Bias-Free Scalable Gaussian Processes via Randomized Truncations

where φ(x) RJ and J N. The RFF approximation relies on Bochner s theorem (Bochner et al., 1959): letting τ = x x , all stationary kernels k(τ) on Rd can be exactly expressed as the Fourier dual of a nonnegative measure P(ω): k(τ) = R P(ω) exp(iωτ) dω. A Monte Carlo approximation of this Fourier transform yields:

j=1 exp(iωjτ), ωj P(ω),

which simpliﬁes to a ﬁnite-basis approximation:

KXX [φ(x1) . . . φ(xn)][φ(x1) . . . φ(xn)] ,

φ(x) = [cos ω i x , sin ω i x ]J/2 i=1, ωi P(ω).

For many common kernels, P(ω) can be computed in closedform (e.g. RBF kernels have zero-mean Gaussian spectral densities). The approximated log likelihood can be computed in O(J3+N) time and O(JN) space using the Woodbury inversion lemma and the matrix determinant lemma, respectively. The number of random features J/2 is a user choice, with typical values between 100-1000. More features lead to more accurate kernel approximations.

2.3. Unbiased Randomized Truncation

We will now brieﬂy introduce Randomized Truncation Estimators, which are the primary tool we use to unbias the CG and RFF log marginal likelihood estimates. At a high level, assume that we wish to estimate some quantity ψ that can be expressed as a (potentially-inﬁnite) series:

ψ = PH j=1 j, H N { }.

Here and in the following sections, j can either be random or deterministic. To avoid the expensive evaluation of the full summation, a randomized truncation estimator chooses a random term J {1, . . . , H} with probability mass function P(J) = P(J = J) after which to truncate computation. In the following, we introduce two means of deriving unbiased estimators by upweighting the summation terms.

The Russian Roulette estimator (Kahn, 1955) obtains an unbiased estimator ψJ by truncating the sum after J P(J) terms and dividing the surviving terms by their survival probabilities:

IJ j P(J j)

and, E[ ψJ] = PH i=1 j = ψ. (See appendix for further derivation.) The choice of P(J) determines both the computational efﬁciency and the variance of ψJ. A thin-tailed P(J) will often truncate sums after only a few terms (J H).

However, tail events (J H) are upweighted inversely to their low survival probability, and so thin-tailed truncation distributions may lead to high variance.

The Single Sample estimator (Lyne et al., 2015) implements an alternative reweighting scheme. After drawing J P(J), it computes a single summation term J, which it upweights by 1/P(J):

IJ=j P(J = j)

This procedure is unbiased, and it amounts to estimating ψ using a single importance-weighted sample from P(J) (see appendix). Again, P(J) controls the speed/variance trade-off. We refer the reader to (Beatson & Adams, 2019) for a detailed comparison of these two estimators. We emphasize that both estimators remain unbiased even if j is a random variable, as long as it is independent from the random truncation integer J.

3. GP Learning with CG and RFF is Biased

Here we prove that early truncated CG and RFF provide biased approximations to the terms comprising the GP log marginal likelihood (Eq. 1). We also derive the bias decay rates for each method. We then empirically demonstrate these biases and show they affect the hyperparameters learned through optimization. Remarkably, we ﬁnd that the above biases are highly systematic: CG-based GP learning favors underﬁtting hyperparameters while RFF-based learning favors overﬁtting hyperparameters.

3.1. CG Biases GP Towards Underﬁtting

In the GP literature, CG has often been considered an exact method for computing the log marginal likelihood (Cunningham et al., 2008; Cutajar et al., 2016), as the iterations are only truncated after reaching a pre-speciﬁed residual error threshold (e.g. 10 10). However, as CG is applied to ever-larger kernel matrices it is common to truncate the CG iterations before reaching this convergence (Wang et al., 2019). While this accelerates the hyperparameter learning process, the resulting solves and gradients can no longer be considered exact. In what follows, we show that the earlytruncated CG optimization objective is not only approximate but also systematically biased towards underﬁtting.

To analyze the early-truncation bias, we adopt the analysis of Gardner et al. (2018) that recovers the GP log marginal likelihood (Eq. 1) from the stochastic gradient s CG estimates of b K 1 XXy and b K 1 XXz (Eq. 3). Recall the two terms in the log marginal likelihood are the data ﬁt term y b K 1 XXy and the model complexity term log | b KXX|. The ﬁrst term falls directly out of the CG estimate of b K 1 XXy, while a

Bias-Free Scalable Gaussian Processes via Randomized Truncations

Figure 1. CG (left) systematically overestimates log | b KXX| and underestimates y b K 1 XXy whereas RFF (right) does the opposite. The dashed orange line shows the exact values computed by Cholesky. Our unbiased methods, RR-CG (left) and SS-RFF (right), recover the true log | b KXX| and y b K 1 XXy values. For these two methods, the x-axis indicates the expected number of iterations/features.

Figure 2. Kernel lengthscale values learned by optimizing (biased) CG and RFF log marginal likelihood approximations. CG overestimates the optimal kernel lengthscale whereas RFF underestimates it. We plot the divergence (in log-ratio scale) between the learned and true lengthscales as a function of the number of CG iterations (left) and of the number of RFF samples (right).

stochastic estimate of log | b KXX| can be obtained through the byproducts of CG s computation for b K 1 XXz. Gardner et al. (2018) show that the CG coefﬁcients in Eq. (4) can be manipulated to produce a partial tridiagonalization: T(J) z = Q(J) z b KXXQ(J) z , where T(J) z RJ J is tridiagonal. T(J) z can compute the Stochastic Lanczos Quadrature estimate of b KXX (Ubaru et al., 2017; Dong et al., 2017):

log | b KXX| = E h z T (log b KXX)z i

z 2e 1 log T(J) z e1, (8)

where log( ) is the matrix logarithm and e1 is the ﬁrst row of the identity matrix. The following theorem analyzes the bias of these y b K 1 XXy and log | b KXX| estimates:

Theorem 1. Let u J and v J be the estimates of y b K 1 XXy and log | b KXX| respectively after J iterations of CG; i.e.:

u J = y PJ i=1γidi , v J = z 2e 1 log T(J) z e1.

If J < N, CG underestimates the inverse quadratic term and overestimates the log determinant in expectation:

u J y b K 1 XXy, Ez[v J] log | b KXX|. (9)

The biases of both terms decay at a rate of O(C 2J), where C is a constant that depends on the conditioning of b KXX.

Proof sketch. The direction of the biases can be proved using a connection between CG and numeric quadrature. u J and v J are exactly equal to the J-point Gauss quadrature approximation of y b K 1 XXy and z 2e 1 (log T(N) z )e1 represented as Riemann-Stieltjes integrals. The sign of the CG approximation bias follows from the standard Gauss quadrature error bound, which is negative for y b K 1 XXy and positive for log | b KXX|. The convergence rates are from standard bounds on CG (Golub & Van Loan, 2012) and the analysis of Ubaru et al. (2017). See appendix for a full proof.

Fig. 1 conﬁrms our theoretical analysis and demonstrates the systematic biases of CG. We plot the log marginal likelihood terms for a subset of the Pole Tele UCI dataset, varying the number of CG iterations (J) used to produce the estimates. Compared against the exact terms (computed with Cholesky), we see an overestimation of log | b KXX| and an underestimation of y b K 1 XXy. These biases are most prominent when using few CG iterations.

We turn to study the effect of CG learning on hyperparameters. Since the log marginal likelihood is a nonconvex function of θ, it is not possible to directly prove how the bias affects θ. Nevertheless, we know intuitively that underestimating y b K 1 XXy de-prioritizes model ﬁt while overestimating log | b KXX| over-penalizes model complexity. Thus, the learned hyperparameters will likely underﬁt the data. Underﬁtting may manifest in an overestimation of the learned lengthscale ℓ, as low values of ℓincrease the ﬂexibility and the complexity of the model. This hypothesis is empirically conﬁrmed in Fig. 2 (left panel). We train a GP regression model on a toy dataset: y = x sin(5πx) + ε

Bias-Free Scalable Gaussian Processes via Randomized Truncations

and ε N(0, 0.01). We ﬁx all hyperparameters other than the lengthscale, which is learned using both CG-based optimization and (exact) Cholesky-based optimization. The overestimation of ℓdecays with the number of CG iterations.

3.2. RFF Biases GP Towards Overﬁtting

Previous work has studied the accuracy of RFF s approximation to the entries of the Gram matrix k(x, x ) φ(x) φ(x ) (Rahimi & Recht, 2008; Sutherland & Schneider, 2015). However, to the best of our knowledge there has been little analysis of nonlinear functions of this approximate Gram matrix, such as y b K 1 XXy and log | b KXX| appearing in the GP objective. Interestingly, we ﬁnd that RFF systematically biases these terms:

Theorem 2. Let e KJ be the RFF approximation with J/2 random features. In expectation, e KJ overestimates the inverse quadratic and underestimates the log determinant:

EP(ω) h y e K 1 J y i y b K 1 XXy (10)

EP(ω) h log | e KJ| i log | b KXX|. (11)

The biases of both terms decay at a rate of O(1/J).

Proof sketch. The direction of the biases is a straightforward application of Jensen s inequality, noting that b K 1 XX is a convex function and log | b KXX| is a concave function. The magnitude of the bias is derived from a second-order Taylor expansion that closely resembles the analysis of Nowozin (2018). See appendix for full proof.

Again, Fig. 1 conﬁrms the systematic biases of RFF, which decay at a rate proportional to the number of features, as predicted by Thm. 2. Hence, RFF should affect the learned hyperparameters in a manner opposite to CG. Overestimating y b K 1 XXy emphasizes data ﬁtting while underestimating log | b KXX| reduces the model complexity penalty, overall resulting in overﬁtting behavior. Following the intuition presented in Sec. 3.1, we expect the lengthscale to be underestimated, as empirically conﬁrmed by Fig. 2 (right panel). The ﬁgure also illustrates the slow decay of the RFF bias.

4. Bias-free Scalable Gaussian Processes

We debias the estimates of both the GP training objective in Eq. (1) and its gradient in Eq. (2) (as approximated by CG and RFF) using unbiased randomized truncation estimators. To see how such estimators apply to GP hyperparameter learning, we note that both CG and RFF recover the true log marginal likelihood (or an unbiased estimate thereof) in their limits:

Observation 1. CG recovers the exact log marginal likeli-

hood in expectation in at most N iterations:

y b K 1 XXy = y PN j=1γjdj , (12)

log | b KXX| = Ez h z 2e 1 (log T(N) z )e1 i . (13)

By the law of large numbers, RFF converges almost surely to the exact log marginal likelihood as the number of random features goes to inﬁnity:

y b K 1 XXy = lim J y e K 1 J y, (14)

log | b KXX| = lim J log | e KJ|. (15)

To maintain the scalability of CG and RFFs while eliminating bias, we express the log marginal likelihood terms in Eqs. (12) to (15) as summations amenable to randomized truncation. We then apply the Russian Roulette and Single Sample estimators of Sec. 2.3 to avoid computing all summation terms while obtaining the same result in expectation.

4.1. Russian Roulette-Truncated CG (RR-CG)

The stochastic gradient in Eq. (3) requires performing two solves: b K 1 XXy and b K 1 XXz. Using the summation formulation of CG (Eq. 4), we can write these two solves as series:

b K 1 XXy = PN i=1 γidi, b K 1 XXz = PN i=1 γ id i,

where each CG iteration computes a new term of the summation. By applying the Russian Roulette estimator from Eq. (6), we obtain the following unbiased estimates:

b K 1 XXy PJ j=1(γjdj)/P(J j), J P(J)

b K 1 XXz PJ

j=1(γ jd j)/P(J j). J P(J ), (16)

These unbiased solves produce an unbiased optimization gradient in Eq. (3); we refer to this approach as Russian Roulette CG (RR-CG). With the appropriate truncation distribution P(J), this estimate affords the same computational complexity of standard CG without its bias.

We must compute two independent estimates of b K 1 XXy with different J P(J) in order for the y b K 1 XX b KXX

θ b K 1 XXy term in Eq. (3) to be unbiased. Thus, the unbiased gradient requires 3 calls to RR-CG, as opposed to the 2 CG calls needed for the biased gradient. Nevertheless, RR-CG has the same O(JN 2) complexity as standard CG and the additional solve can be computed in parallel with the others.

We can also use the Russian Roulette estimator to compute the log marginal likelihood itself, though this is not strictly necessary for gradient-based optimization. (See appendix.)

Choosing the truncation distribution. Since the Russian Roulette estimator is unbiased for any choice of P (J),

Bias-Free Scalable Gaussian Processes via Randomized Truncations

we wish to choose a truncation distribution that balances computational cost and variance1. Beatson & Adams (2019) propose the relative optimization efﬁciency (ROE) metric, which is the ratio of the expected improvement of taking an optimization step with our gradient estimate to its computational cost. A critical requirement of the ROE analysis is the expected rate of decay of our approximations in terms of the number of CG iterations J. We summarize our estimates and choices of distribution as follows:

Theorem 3. The approximation to log | b KXX| and to y b K 1 XXy using RR-CG decays at a rate of O(C 2J). Therefore the truncation distribution that maximizes the ROE is P (J) C 2J, where C is a constant that depends on the conditioning of b KXX. The expected computation and variance of P (J) is ﬁnite.

Proof sketch. Beatson & Adams (2019) show that the truncation distribution that maximizes the ROE is proportional to the rate of decay of our approximation divided by its computational cost. The error of CG decays as O(C 2J), and the cost of each summation term is constant with respect to J. In practice, we vary the exponential decaying rate to control the expectation and variance of P(J). To further reduce the variance of our estimator, we set a minimum number of CG iterations to be computed, as in (Luo et al., 2020). See appendix for full proof.

In practice, however, we do not have access to C since computing the conditioning of b KXX is impractical. Yet, we can change the base of the exponential to e and add a temperature parameter λ to rescale the function and control the rate of decay of the truncation distribution as a sensible alternative. Thus, we follow a more general exponential decay distribution:

P(J) e λJ, J = Jmin, , N (17)

where Jmin is the minimum truncation number. By varying the values of λ and Jmin, we can control the expectation and standard deviation of P(J). In practice we found that having the standard deviation value between 10 and 20 achieves stable GP learning process, which can be obtained by tuning λ between 0.05 and 0.1 for sufﬁciently large datasets (e.g. N 500). We also noticed that the method is not sensitive to these choices of hyperparameters and that they work well across all the experiments. The expected truncation number can be further tuned by varying Jmin. We emphasize that these choices impact the speed-variance tradeoff. By setting a larger Jmin we decrease the speed by requiring more baseline computations but also decrease the variance (since the minimum approximations have the largest deviations from the ground truth).

1Solely minimizing variance is not appealing, as this is achieved by P (J) = IJ=N which has no computational savings.

Toy problem. In Fig. 1 we plot the empirical mean of the RR-CG estimator using 104 samples from an exponential truncation distribution. We ﬁnd that RR-CG produces unbiased estimates of the y b K 1 XXy and log | b KXX| terms that are indistinguishable from the exact values computed with Cholesky. Reducing the expected truncation iteration E(J) (x-axis) increases the standard error of empirical means, demonstrating the speed-variance trade-off.

4.2. Single Sample-Truncated RFF (SS-RFF)

Denoting e Kj as the kernel matrix estimated by j random Fourier features, we can write log | b KXX| as the following telescoping series:

log | b KXX| = log | e K1| +

log | e Kj| log | e Kj 1|

+ log | b KXX| log | e KN/2 1|

= log | e K1| + PN/2 j=2 j, (19)

where j is deﬁned as log | e Kj| log | e Kj 1| for all j < N/2, and N/2 is deﬁned as log | b KXX| log | e KN/2 1|. Note that each j is now a random variable, since it depends on the random Fourier frequencies ω. Crucially, we only include N/2 terms in the series so that no term requires more than O(N 3) computation in expectation. (For any j > N/2, e Kj is a full-rank matrix and thus is as computationally expensive as the true b KXX matrix.) We construct a similar telescoping series for y b K 1 XXy.

As with Eq. (16), we approximate the series in Eq. (19) with a randomized truncation estimator, this time using the Single Sample estimator (7):

log | b KXX| log | e K1| + J/P (J) . (20)

where J is drawn from the truncation distribution P(J) with support over {2, 3, . . . N/2}. Note that the Single Sample estimate requires computing 3 log determinants (log | e K1|, log | e KJ 1|, and log | e KJ|) for a total of O(J3 +NJ2) computations and O(NJ) memory. This is asymptotically the same requirement as standard RFF. The Russian Roulette estimator, on the other hand, incurs a computational cost of O(NJ3 + J4) as it requires computing (log | e K1| through log | e KJ|) which quickly becomes impractical for large J.

A similar Single Sample estimator constructs an unbiased estimate of y b K 1 XXy. Backpropagating through these Single Sample estimates produces unbiased estimates of the log marginal likelihood gradient in Eq. (2).

Choosing the truncation distribution. For the Single Sample estimator we do not have to optimize the ROE since

Bias-Free Scalable Gaussian Processes via Randomized Truncations

minimizing the variance of this estimator does not result in a degenerate distribution.

Theorem 4. The truncation distribution that minimizes the variance of the SS-RFF estimators for log | b KXX| and y b K 1 XXy is P (J) 1/J. The expected variance and computation of P (J) is ﬁnite.

Proof sketch. The minimum variance distribution can be found by solving a constrained optimization problem. In practice, we can further decrease the variance of our estimator by ﬁxing a minimum value of RFF features to be used in Eq. (19) and by increasing the step size (c N) between the elements at each j = log | e Kc J| log | e Kc(J 1)|. See appendix for full proof.

For the experiments we started with 500 features and also tried various step sizes c {1, 10, 100}. The variance of the estimator decreases as we increase c since the probability weights will decrease in magnitude. Yet, despite of using the optimal truncation distribution, setting a high number of features and taking long steps c = 100, the variance of the estimator still requires taking several steps before converging, making SS-RFF computationally impractical.

Toy problem. Similar to RR-CG, in Fig. 1 we plot the empirical mean of the SS-RFF estimator using 104 samples. We ﬁnd that SS-RFF produces unbiased estimates of the y b K 1 XXy and log | b KXX| terms. However, these estimates have a higher variance when compared to the estimates of RR-CG. Reducing the expected truncation iteration E(J) (x-axis) increases the standard error of the empirical means, demonstrating the speed-variance trade-off.

4.3. Analysis of the Bias-free Methods

Randomized truncations and conjugate gradients have existed for many decades (Hestenes et al., 1952; Kahn, 1955), but have rarely been used in conjunction. Filippone & Engler (2015) proposed a method closely related to our RRCG which performs randomized early-truncation of CG iterations to obtain unbiased posterior samples of the GP covariance parameters. We differ by tackling the GP hyperparameter learning problem: we provide the ﬁrst theoretical analysis of the biases incurred by both CG and RFF, and proceed to tailor unbiased estimators for each method.

To some extent, randomized truncation methods are antithetical to the original intention of CG: producing deterministic and nearly exact solves. For large-scale applications, where early truncation is necessary for computational tractability, the ability to trade bias for variance is beneﬁcial. This fact is especially true in the context of GP learning, where the bias of early truncation is systematic and cannot be simply explained away as numerical imprecision.

Randomized truncation estimates are often used to estimate

Figure 3. Optimization landscape of a GP with two hyperparameters. The SS-RFF and RR-CG models converge to similar hyperparameter values that are nearly optimal, while the RFF and CG models converge to suboptimal solutions. In addition, the stochastic effect of the randomized truncation is visible in the trajectories of RR-CG and SS-RFF. Moreover, CG (and RR-CG) models truncate after 20 iterations (in expectation); RFF (and SS-RFF) models use 700 features (in expectation).

inﬁnite series, where it is challenging to design truncation distributions with ﬁnite expected computation and/or variance. We avoid such issues since CG and the telescoping RFF summations are both ﬁnite.

First, we show that our bias-free methods recover nearly the same hyperparameters as exact methods (i.e. Choleskybased optimization), whereas models that use CG and RFF converge to suboptimal hyperparameters. Since RR-CG and SS-RFF eliminate bias at the cost of increased variance, we then demonstrate the optimization convergence rate and draw conclusions on our methods applicability. Finally, we compare models optimized with RR-CG against a host of approximate GP methods across a wide range of UCI datasets (Asuncion & Newman, 2007). All experiments are implemented in GPy Torch (Gardner et al., 2018)

Optimization trajectories of bias-free GP. Fig. 3 displays the optimization landscape the log marginal likelihood of the Pole Tele dataset as a function of (RBF) kernel lengthscale ℓand noise σ2. As expected, an exact GP (optimized using Cholesky, Fig. 3 upper left) recovers the optimum. Notably, the GP trained with standard CG and RFF converges to suboptimal hyperparameters (upper right/lower right). RR-CG and SS-RFF models (trained with 20 iterations and 700 features in expectation, respectively) successfully eliminate this bias, and recover nearly

Bias-Free Scalable Gaussian Processes via Randomized Truncations

Figure 4. The GP optimization objective for models trained with RFF and SS-RFF. (Pole Tele dataset, RBF kernel, Adam optimizer.) RFF models converge to sub-optimal log marginal likelihoods. SSRFF models converge to (near) optimum values, yet require more than 100 as many optimization steps.

Figure 5. The GP optimization objective for models trained with CG and RR-CG. (Bike dataset, RBF kernel, Adam optimizer.) RR-CG models converge to optimal solutions, while the (biased) CG models diverge. Increasing the expected truncation of RR-CG only slightly improves optimization convergence; models converge in < 100 steps of Adam.

the same parameters as the exact model (upper center/lower left). These plots also show the speed-variance tradeoff of randomized truncation. SS-RFF and RR-CG have noisy optimization trajectories due to auxiliary truncation variance.

Convergence of GP hyperparameter optimization. Figs. 4 and 5 plot the exact GP log marginal likelihood of the parameters learned by each method during optimization. Each trajectory corresponds to a RBF-kernel GP trained on the Pole Tele dataset (Fig. 4) and the Bike dataset (Fig. 5).

Fig. 4 shows that RFF models converge to solutions with worse log likelihoods, and more RFF features slow the rate of optimization. Additionally, we see the cost of the auxiliary variance needed to debias SS-RFF: while SS-RFF models achieve better optima than their biased counterparts, they take 2-3 orders of magnitude longer to converge, despite using a truncation distribution that minimizes variance. We thus conclude that SS-RFF has too much variance to be practical for GP hyperparameter learning.

Fig. 5 on the other hand shows that RR-CG is minimally affected by its auxiliary variance. The GP trained with RR-CG converges in roughly 100 iterations, nearly matching Cholesky-based optimization. Decreasing the expected truncation value from E[J] = 40 to 20 slightly slows this convergence. We note that the bias induced by standard CG can be especially detrimental to GP learning. On this dataset, the biased models deviate from their Cholesky counterparts and eventually diverge away from the optimum.

Predictive performance of bias-free GP. Lastly, we compare the predictive performance of GPs that use RR-CG, CG, and Cholesky for hyperparameter optimization. We emphasize that the RR-CG and CG methods only make use of early truncation approximations during training. At test time, we compute the predictive posterior by running CG to a tolerance of 10 4, which we believe can be considered exact for practical intents and purposes. Additionally, we include four other (biased) scalable GP approximations methods as baselines: RFF, Stochastic Variational Gaussian Processes (SVGP) (Hensman et al., 2013), generalized Product of Expert Gaussian Processes (POE) (Cao & Fleet, 2014; Deisenroth & Ng, 2015), and stochastic gradientbased Gaussian Processes (sg GP) (Chen et al., 2020). We note that the RFF, SVGP, and sg GP methods introduce both bias and variance, as these methods rely on randomization and approximation.

We use CG with J = 100 iterations, and RR-CG with E[J] = 100 expected iterations; both methods use the preconditioner of Gardner et al. (2018). All RFF models use 700 random features. For SVGP, we use 1,024 inducing points and minibatches of size 1,024 as in (Wang et al., 2019). The POE models are comprised of GP experts that are each trained on 1,024 data point subsets. For sg GP, the subsampled datasets are constructed by selecting a random point x, y and its 15 nearest neighbors as in (Chen et al., 2020). Each dataset is randomly split to 64% training, 16% validation and 20% testing sets. All kernels are RBF with a separate lengthscale per dimension. See appendix for more details.

We report prediction accuracy (RMSE) and negative log likelihood (NLL) in Fig. 6 (see appendix for full tables on predictive performance and training time). We make two key observations: (i) RR-CG meaningfully debiases CG. When the bias of CG is not detrimental to optimization (e.g. CG with 100 iterations is close to convergence for the Elevators dataset), RR-CG has similar performance. However, when the CG bias is more signiﬁcant (e.g. the KEGG dataset), the bias-free RR-CG improves the GP predictive RMSE and NLL. We also include a ﬁgure displaying the predictive performance of RR-CG and CG with increasing number of (expected) CG iterations in appendix. (ii) RR-CG recovers the same optimum as the ground-truth method

Bias-Free Scalable Gaussian Processes via Randomized Truncations

Figure 6. Root-mean-square-error (RMSE) and negative log likelihood (NLL) of GP trained with CG (light purple), RR-CG (dark purple) and various approximate methods (grey). Dashed red lines indicates Cholesky-based GP performance (when applicable). Results are averaged over 3 dataset splits. Missing RFF and sg GP results correspond to (very high) outlier NLL / RMSE values. In almost all experiments, GP learning with RR-CG achieves similar or better performance compared to that with CG at the same computational cost.

(i.e. Cholesky) does, as indicated by the red-dashed line in Fig. 6. This result provides additional evidence that RR-CG achieves unbiased optimization.

While RR-CG obtains the lowest RMSE on all but 2 datasets, we note that the (biased) GP approximations sometimes achieve lower NLL. For example, SVGP has a lower NLL than that of RR-CG on the Bike dataset, despite having a higher RMSE. We emphasize that this is not a failing of RR-CG inference. The SVGP NLL is even better than that of the exact (Cholesky) GP, suggesting a potential model misspeciﬁcation for this particular dataset. Since SVGP overestimates the observational noise σ2 (Bauer et al., 2016), it may obtain a better NLL when outliers are abundant. Though we cannot compare against the Cholesky posterior on larger datasets, we hypothesize that the NLL/RMSE discrepancy on these datasets is due to a similar modeling issue.

6. Conclusion

We prove that CG and RFF introduce systematic biases to the GP log marginal likelihood objective: CG-based training will favor underﬁtting models, while RFF-based training will promote overﬁtting. Modifying these methods with randomized truncation converts these biases into variance, enabling unbiased stochastic optimization. Our results show that this bias-to-variance exchange indeed constitutes a trade-off. The convergence of SS-RFF is impractically slow, likely due to the truncation variance needed to eliminate RFF s slowly-decaying bias. However, for CG-based

training, we ﬁnd that variance is almost always preferable to bias. Models trained with RR-CG achieve better performance than those trained with standard CG, and tend to recover the hyperparameters learned with exact methods. Though models trained with CG do not always exhibit noticeable bias, RR-CG s negligible computational overhead is justiﬁable to counteract cases where the bias is signiﬁcant.

We reported experiments with at most 300K observations for our methods and baselines, which is substantial for GPs. We emphasize that RR-CG can be extended to datasets with over one million data points as in Wang et al. (2019). However, the computational cost is much higher, requiring multiple GPUs for training and testing.

We note that the RR-CG algorithm is not limited to GP applications. Future work should explore applying RR-CG to other optimization problems with large-scale solves.

Acknowledgements

This work was supported by the Simons Foundation, Mc Knight Foundation, the Grossman Center, and the Gatsby Charitable Trust.

Asuncion, A. and Newman, D. Uci machine learning repository, 2007.

Bauer, M., van der Wilk, M., and Rasmussen, C. E. Understanding probabilistic sparse gaussian process approxi-

Bias-Free Scalable Gaussian Processes via Randomized Truncations

mations. In Advances in Neural Information Processing Systems, 2016.

Beatson, A. and Adams, R. P. Efﬁcient optimization of loops and limits with randomized telescoping sums. In International Conference on Machine Learning, 2019.

Bochner, S. et al. Lectures on Fourier integrals, volume 42. Princeton University Press, 1959.

Burt, D., Rasmussen, C. E., and Van Der Wilk, M. Rates of convergence for sparse variational gaussian process regression. In International Conference on Machine Learning, pp. 862 871, 2019.

Cao, Y. and Fleet, D. J. Generalized product of experts for automatic and principled fusion of gaussian process predictions. ar Xiv preprint ar Xiv:1410.7827, 2014.

Chen, H., Zheng, L., Al Kontar, R., and Raskutti, G. Stochastic gradient descent in correlated settings: A study on gaussian processes. Advances in Neural Information Processing Systems, 33, 2020.

Chen, R. T. Q., Behrmann, J., Duvenaud, D., and Jacobsen, J.-H. Residual ﬂows for invertible generative modeling. Advances in Neural Information Processing Systems, 2019.

Cunningham, J. P., Shenoy, K. V., and Sahani, M. Fast gaussian process methods for point process intensity estimation. In International Conference on Machine learning, pp. 192 199, 2008.

Cutajar, K., Osborne, M. A., Cunningham, J. P., and Filippone, M. Preconditioning kernel matrices. In International Conference on Machine Learning, 2016.

Datta, A., Banerjee, S., Finley, A. O., and Gelfand, A. E. Hierarchical nearest-neighbor gaussian process models for large geostatistical datasets. Journal of the American Statistical Association, 111(514):800 812, 2016.

Deisenroth, M. and Ng, J. W. Distributed gaussian processes. In International Conference on Machine Learning, pp. 1481 1490. PMLR, 2015.

Dong, K., Eriksson, D., Nickisch, H., Bindel, D., and Wilson, A. G. Scalable log determinants for gaussian process kernel learning. In Advances in Neural Information Processing Systems, pp. 6327 6337, 2017.

Filippone, M. and Engler, R. Enabling scalable stochastic gradient-based inference for gaussian processes by employing the unbiased linear system solver (ulisse). In International Conference on Machine Learning, pp. 1015 1024. PMLR, 2015.

Gardner, J., Pleiss, G., Weinberger, K. Q., Bindel, D., and Wilson, A. G. Gpytorch: Blackbox matrix-matrix gaussian process inference with gpu acceleration. In Advances in Neural Information Processing Systems, pp. 7576 7586, 2018.

Golub, G. H. and Van Loan, C. F. Matrix Computations. The Johns Hopkins University Press, 4th Edition, 2012.

Hensman, J., Fusi, N., and Lawrence, N. D. Gaussian processes for big data. In Uncertainty in Artiﬁcial Intelligence, 2013.

Hestenes, M. R., Stiefel, E., et al. Methods of conjugate gradients for solving linear systems. Journal of Research of the National Bureau of Standards, 49(1), 1952.

Hutchinson, M. F. A stochastic estimator of the trace of the inﬂuence matrix for laplacian smoothing splines. Communications in Statistics-Simulation and Computation, 18(3):1059 1076, 1989.

Kahn, H. Use of different monte carlo sampling techniques. Rand Corporation, 1955.

Katzfuss, M., Guinness, J., et al. A general framework for vecchia approximations of gaussian processes. Statistical Science, 36(1):124 141, 2021.

Loper, J., Blei, D., Cunningham, J. P., and Paninski, L. General linear-time inference for gaussian processes on one dimension. ar Xiv preprint ar Xiv:2003.05554, 2020.

Luo, Y., Beatson, A., Norouzi, M., Zhu, J., Duvenaud, D., Adams, R. P., and Chen, R. T. Q. Sumo: Unbiased estimation of log marginal probability for latent variable models. In International Conference on Learned Representations, 2020.

Lyne, A.-M., Girolami, M., Atchadé, Y., Strathmann, H., Simpson, D., et al. On russian roulette estimates for bayesian inference with doubly-intractable likelihoods. Statistical science, 30(4):443 467, 2015.

Mutn y, M. and Krause, A. Efﬁcient high dimensional bayesian optimization with additivity and quadrature fourier features. Advances in Neural Information Processing Systems, pp. 9005 9016, 2018.

Nowozin, S. Debiasing evidence approximations: On importance-weighted autoencoders and jackknife variational inference. In International Conference on Learned Representations, 2018.

Oktay, D., Mc Greivy, N., Aduol, J., Beatson, A., and Adams, R. P. Randomized automatic differentiation. ar Xiv preprint ar Xiv:2007.10412, 2020.

Bias-Free Scalable Gaussian Processes via Randomized Truncations

Rahimi, A. and Recht, B. Random features for large-scale kernel machines. In Advances in Neural Information Processing Systems, pp. 1177 1184, 2008.

Snelson, E. and Ghahramani, Z. Sparse Gaussian processes using pseudo-inputs. In Advances in Neural Information Processing Systems, 2006.

Sutherland, D. J. and Schneider, J. G. On the error of random fourier feaures. Conference on Uncertainty in Artiﬁcial Intelligence, 2015.

Titsias, M. K. Variational learning of inducing variables in sparse Gaussian processes. In International Conference on Artiﬁcial Intelligence and Statistics, pp. 567 574, 2009.

Ubaru, S., Chen, J., and Saad, Y. Fast estimation of tr(f(a)) via stochastic lanczos quadrature. SIAM Journal on Matrix Analysis and Applications, 38(4):1075 1099, 2017.

Wang, K., Pleiss, G., Gardner, J., Tyree, S., Weinberger, K. Q., and Wilson, A. G. Exact gaussian processes on a million data points. In Advances in Neural Information Processing Systems, pp. 14648 14659, 2019.

Wilson, J., Borovitskiy, V., Terenin, A., Mostowsky, P., and Deisenroth, M. Efﬁciently sampling functions from gaussian process posteriors. In International Conference on Machine Learning, pp. 10292 10302, 2020.