# loglineartime_gaussian_processes_using_binary_tree_kernels__66b08d12.pdf

Log-Linear-Time Gaussian Processes

Using Binary Tree Kernels

Michael K. Cohen University of Oxford michael.cohen@eng.ox.ax.uk

Samuel Daulton University of Oxford, Meta

sdaulton@meta.com

Michael A. Osborne

University of Oxford mosb@robots.ox.ax.uk

Gaussian processes (GPs) produce good probabilistic models of functions, but most GP kernels require O((n + m)n2) time, where n is the number of data points and m the number of predictive locations. We present a new kernel that allows for Gaussian process regression in O((n + m) log(n + m)) time. Our binary tree kernel places all data points on the leaves of a binary tree, with the kernel depending only on the depth of the deepest common ancestor. We can store the resulting kernel matrix in O(n) space in O(n log n) time, as a sum of sparse rank-one matrices, and approximately invert the kernel matrix in O(n) time. Sparse GP methods also offer linear run time, but they predict less well than higher dimensional kernels. On a classic suite of regression tasks, we compare our kernel against Matérn, sparse, and sparse variational kernels. The binary tree GP assigns the highest likelihood to the test data on a plurality of datasets, usually achieves lower mean squared error than the sparse methods, and often ties or beats the Matérn GP. On large datasets, the binary tree GP is fastest, and much faster than a Matérn GP.

1 Introduction

Gaussian processes (GPs) can be used to perform regression with high-quality uncertainty estimates, but they are slow. Naïvely, GP regression requires O(n3 + n2m) computation time and O(n2) computation space when predicting at m locations given n data points [28]. A kernel matrix of size n n must be inverted (or Cholesky decomposed), and then m matrix-vector multiplications must be done with that inverse matrix (or m linear solves with the Cholesky factors). A few methods that we will discuss later achieve O(n2m) time complexity [25, 30].

With special kernels, GP regression can be faster and use less space. Inducing point methods, using z inducing points, allow regression to be done in O(z2(n + m)) time and in O(z2 + zn) space [21, 22, 23, 13]. We will discuss the details of these inducing point kernels later, but they are kernels in their own right, not just approximations to other kernels. Unfortunately, these kernels are low dimensional (having a z-dimensional Hilbert space), which limits the expressivity of the GP model.

We present a new kernel, the binary tree kernel, that also allows for GP regression in O(n + m) space and O((n + m) log(n + m)) time (both model ﬁtting and prediction). The time and space complexity of our method is also linear in the depth of the binary tree, which is naïvely linear in the dimension of the data, although in practice we can increase the depth sublinearly. Training some kernel parameters takes time quadratic in the depth of the tree. The dimensionality of the binary tree kernel is exponential in the depth of the tree, making it much more expressive than an inducing points kernel. Whereas for an inducing points kernel, the runtime is quadratic in the dimension of the Hilbert space, for the binary tree kernel, it is only logarithmic an exponential speedup.

A simple depiction of our kernel is shown in Figure 1, which we will deﬁne precisely in Section 3. First, we create a procedure for placing all data points on the leaves of a binary tree. Given the

36th Conference on Neural Information Processing Systems (Neur IPS 2022).

binary tree, the kernel between two points depends only on the depth of the deepest common ancestor. Because very different tree structures are possible for the data, we can easily form an ensemble of diverse GP regression models. Figure 2 depicts a schematic sample from a binary tree kernel. Note how the posterior mean is piecewise ﬂat, but the pieces can be small.

Figure 1: A binary tree kernel with four data points. In this example, k(x1, x1) = 1, k(x1, x2) = 0, k(x1, x3) = 0.8, and k(x1, x4) = 0.3.

On a standard suite of benchmark regression tasks [25], we show that our kernel usually achieves better negative log likelihood (NLL) than state-of-the-art sparse methods and conjugate-gradient-based exact methods, at lower computational cost in the big-data regime.

There are not many limitations to using our kernel. The main limitation is that other kernels sometimes capture the relationships in the data better. We do not have a good procedure for understanding when data has more Matérn character or more binary tree character (except through running both and comparing training NLL). But given that the binary tree kernel usually outperforms the Matérn, we ll tentatively say the best ﬁrst guess is that a new dataset has more binary tree character. One concrete limitation for some applications, like Bayesian Optimization, is that the posterior mean is piecewise-ﬂat, so gradient-based heuristics for ﬁnding extrema would not work.

In contexts where a piecewise-ﬂat posterior mean is suitable, we struggle to see when one would prefer a sparse or sparse variational GP to a binary tree kernel. The most thorough approach would be to run both and see which has a better training NLL, but if you had to pick one, the binary tree GP seems to be better performing and comparably fast. If minimizing mean-squared error is the objective, the Matérn kernel seems to do slightly better than the binary tree. If the dataset is small, and one needs a very fast prediction, a Matérn kernel may be the best option. But otherwise, if one cares about well-calibrated predictions, these initial results we present tentatively suggest using a binary tree kernel over the widely-used Matérn kernel.

The log-linear time and linear space complexity of the binary tree GP, with performance exceeding a normal kernel, could profoundly expand the viability of GP regression to larger datasets.

2 Preliminaries

Our problem setting is regression. Given a function f : X ! R, for some arbitrary set X, we would like to predict f(x) for various x 2 X. What we have are observations of f(x) for various (other) x 2 X. Let X 2 X n be an n-tuple of elements of X, and let y 2 Rn be an n-tuple of real numbers, such that yi f(Xi) + N(0, λ), for λ 2 R 0. X and y comprise our training data.

With an m-tuple of test locations X0 2 X m, let y0 2 Rm, with y0

i). y0 is the ground truth for the target locations. Given training data, we would like to produce a distribution over R for each target location X0

i, such that it assigns high marginal probability to the unknown y0

i. Alternatively, we sometimes would like to produce point estimates ˆy0

i in order to minimize the squared error (ˆy0

A GP prior over functions is deﬁned by a mean function m : X ! R, and a kernel k : X X ! R. The expected function value at a point x is deﬁned to be m(x), and the covariance of the function values at two points x1 and x2 is deﬁned to be k(x1, x2). Let KXX 2 Rn n be the matrix of kernel

Figure 2: A schematic diagram of a function sampled from a binary tree kernel. The function is over the interval [0, 1], and points on the interval are placed onto the leaves of a depth-4 binary tree according to the ﬁrst 4 bits of their binary expansion. The sampled function is in black. Purple represents the sample if the tree had depth 3, green depth 2, orange depth 1, and red depth 0.

values (KXX)ij = k(Xi, Xj), and let m X 2 Rn be the vector of mean values (m X)i = m(Xi). For a GP to be well-deﬁned, the kernel must be such that KXX is positive semideﬁnite for any X 2 X n. For a point x 2 X, Let KXx 2 Rn be the vector of kernel values: (KXx)i = k(Xi, x), and let Kx X = K>

Xx. Let λ 0 be the variance of observation noise. Let µx and σ2

x be the mean and variance of our posterior predictive distribution at x. Then, with Kλinv

XX = (KXX + λI) 1,

µx := (y m X)>Kλinv

XXKXx + m(x) (1) σ2

x := k(x, x) Kx XKλinv

XXKXx + λ. (2)

See Williams and Rasmussen [28] for a derivation. We compute Equations 1 and 2 for all x 2 X0.

3 Binary tree kernel

We now introduce the binary tree kernel. First, we encode our data points as binary strings. So we have X = Bq, where B = {0, 1}, and q 2 N.

If X = Rd, we must map Rd 7! Bq. First, we rescale all points (training points and test points) to lie within the box [0, 1]d. (If we have a stream of test points, and one lands outside of the box [0, 1]d, we can either set Kx X to 0 for that point or we rescale and retrain in O(n log n) time.) Then, for each x 2 [0, 1]d, for each dimension, we take the binary expansion up to some precision p, and for those d p bits, we permute them using some ﬁxed permutation. We call this permutation the bit order, and it is the same for all x 2 [0, 1]d. Note that now q = dp. See Figure 3 for an example. We optimize the bit order during training, and we can also form an ensemble of GPs using different bit orders.

Figure 3: Function from [0, 1]2 ! B8.

For x 2 Bq, let x i be the ﬁrst i bits of x. [[expression]] evaluates to 1, if expression is true, otherwise 0. We now deﬁne the kernel:

Deﬁnition 1 (Binary Tree Kernel). Given a weight vector w 2 Rq, with w 0 and ||w||1 = 1,

kw(x1, x2) =

So the more leading bits shared by x1 and x2, the larger the covariance between the function values. Consider, for example, points x1 and x4 from Figure 1, where x1 is (left, left, right), and x4 is (left, right, right); they share only the ﬁrst leading bit . We train the weight vector w to maximize the likelihood of the training data. Proposition 1 (Positive Semideﬁniteness). For X 2 X n, for k = kw, KXX 0.

Proof. Let s 2 Sq

i=1 Bi be a binary string, and let |s| be the length of s. Let X[s] 2 Rn with (X[s])j = hh

[s] is clearly positive semideﬁnite. Finally, KXX = Pq

s2Bi wi X[s]X>

[s], and recall wi 0, so KXX 0.

4 Sparse rank one sum representation

In order to do GP regression in O(n) space and O(n log n) time, we develop a Sparse Rank One Sum representation of linear operators (SROS). This was developed separately from the very similar Hierarchical matrices [1], which we discuss below. In SROS form, linear transformation of a vector can be done in O(n) time instead of O(n2). We will store our kernel matrix and inverse kernel matrix in SROS form. The proof of Proposition 1 exempliﬁes representing a matrix as the sum of sparse rank one matrices. Note that each X[s] is sparse if q is large, most X[s] s are the zero vector.

We now show how to interpret an SROS representation of an n n matrix. Let [n] = {1, 2, ..., n}. For r 2 N, let L : [r]n [r]n Rn Rn ! Rn n construct a linear operator from four vectors. Deﬁnition 2 (Linear Operator from Simple SROS Representation). Let p, p0 2 [r]n, and let u, u0 2 Rn. For l 2 [r], let up=l 2 Rn be the vector where up=l

j = uj[[pj = l]], likewise for u0 and p0. Then:

L(p, p0, u, u0) 7! Pm

i=1 up=i(u0)p0=i>.

Figure 4: A matrix in standard form constructed from a matrix in SROS form. The large square depicts the matrix L(p, p0, u, u0) 2 R5 5 with elements colored by value. See up=0 for a color legend.

We depict Deﬁnition 2 in Figure 4. p and p0 represent partitions over n elements: all elements with the same integer value in the vector p belong to the same partition. Note that r, the number of parts in the partition, need not exceed n, the number of elements being partitioned. If p = p0 (which is almost always the case for us) and the elements of p, u, and u0 were shufﬂed so that all elements in the same partition were next to each other, then L(p, p0, u, u0) would be block diagonal. Note that L(p, p0, u, u0) is not necessarily low rank. If p is the ﬁnest possible partition, and p = p0, L(p, p0, u, u0) is diagonal. SROS matrices can be thought of as a generalization of two types of matrix that are famously amenable to fast computation: rank one matrices (all points in the same partition) and diagonal matrices (each point in its own partition).

We now extend the deﬁnition of L to allow for multiple p, p0, u, and u0 vectors.

Deﬁnition 3 (Linear Operator from SROS Representation). Let L : [r]n q [r]n q Rn q Rn q ! Rn n. Let P, P 0 2 [r]n q, and let U, U 0 2 Rn q. Let P:,i, U:,i, etc. be the ith columns of the respective arrays. Then: L(P, P 0, U, U 0) 7! Pq

i=1 L(P:,i, P 0

:,i, U:,i, U 0

Algorithm 1 performs linear transformation of a vector using SROS representation in O(nq) time.

Algorithm 1 Linear Transformation with SROS Linear Operator. This can be vectorized on a Graphical Processing Unit (GPU), using e.g. torch.Tensor.index_add_ for Line 5 and non-slice indexing for Line 6 [19]. Slight restructuring allows vectorization over [q] as well.

Require: P, P 0 2 [r]n q, U, U 0 2 Rn q, x 2 Rn

Ensure: y = L(P, P 0, U, U 0)x

1: y 0 2 Rn

2: for i 2 [q] do . O(nq) time 3: p, p0, u, u0 P:,i, P 0

:,i, U:,i, U 0

:,i 4: z 0 2 Rr . zl will store the dot product ((u0)p0=l)>xp0=l

5: for j 2 [n] do zp0

jxj . O(n) time

6: for j 2 [n] do yj yj + zpjuj . O(n) time return y

We now discuss how to approximately invert a certain kind of symmetric SROS matrix, but our methods could be extended to asymmetric matrices. First, we deﬁne a partial ordering over partitions. For two partitions p, p0, we say p0 p if p0 is ﬁner than or equal to p; that is, p0

j0 =) pj = pj0. Using that partial ordering, a symmetric SROS matrix can be approximately inverted efﬁciently if for all 1 i, i0 q, P:,i P:,i0 or P:,i0 P:,i. As the reader may have recognized, our kernel matrix KXX can be written as an SROS matrix with this property.

We will write symmetric SROS matrices in a slightly more convenient form. All (u0)p=l must be a constant times up=l. We will store these constants in an array C. Let L(P, C, U) be shorthand for L(P, P, U, C U), where denotes element-wise multiplication. For L(P, C, U) to be symmetric, it must be the case that Pji = Pj0i =) Cji = Cj0i. Then, all elements of U corresponding to a given up=l are multiplied by the same constant. We now present an algorithm for calculating (L(P, C, U) + λI) 1, for λ 6= 0, which is an approximate inversion of L(P, C, U). We have not

yet analyzed numerical sensitivity for λ ! 0, but we conjecture that all ﬂoating point numbers involved need to be stored to at least log2(1/λ) bits. Without loss of generality, let λ = 1, and note (L(P, C, U) + λI) 1 = λ 1(L(P, λ 1C, U) + I) 1.

By assumption, all columns of P are comparable with respect to the partial ordering above, so we can reorder the columns of P such that P:,i P:,j for i < j. The key identity that we use to develop our fast inversion algorithm is the Sherman Morrison Formula:

(A + cuu>) 1 = A 1 A 1uu>A 1

c 1 + u>A 1u (3)

Starting with A = I, we add the sparse rank one matrices iteratively, from the ﬁnest partition to the coarsest one, updating A 1 as we go. We represent (L(P, C, U)+I) 1 in the form I +L(P, C0, U 0), so we write an algorithm that returns C0 and U 0. We can also quickly calculate log |L(P, C, U) + I| at the same time, using the matrix determinant lemma: |A + cuu>| = (1 + cu>A 1u)|A|.

Theorem 1 (Fast Inversion). For P 2 [r]n q and C, U 2 Rn q, if P:,i

(is coarser than)

P:,j for i < j, then there exists C0, U 0 2 Rn q, such that (L(P, C, U) + I) 1 = I + L(P, C0, U 0). There exists an algorithm for computing C0 and U 0 that takes O(nq2) time.

Proof. For X 2 Rn q, let X:,i+1:q 2 Rn (q i) be columns i + 1 through q of matrix X (inclusive). Let Ai = I + L(P:,i+1:q, C:,i+1:q, U:,i+1:q), and Aq = I. Now suppose A 1

i can be written as I + L(P:,i+1:q, C0

:,i+1:q, U 0

:,i+1:q) for some C0 and U 0. For the base case of i = q, this holds trivially. We show it also holds for i 1, and we can compute C0

time. Let p = P:,i, u = U:,i, and c = C:,i. Consider up=l, where each element is zero unless the corresponding element of p equals l. What do we know about the product A 1

i up=l (as seen in Equation 3)?

Because the columns of P go from coarser partitions to ﬁner ones, all of the vectors generating the sparse rank one components of L(P:,i+1:q, C0

:,i+1:q, U 0

:,i+1:q) are from partitions that are equal to or ﬁner than p. Thus, they are either zero everywhere up=l is zero, or zero everywhere up=l is nonzero. Vectors v of the second kind can be ignored, as cvv>up=l = 0. Thus, when multiplying L(P:,i+1:q, C0

:,i+1:q, U 0

:,i+1:q) by up=l, the only relevant vectors are ﬁlled with zeros except where the corresponding element of p equals l. So we can get rid of those rows of P:,i+1:q, C0

:,i+1:q, and U 0

:,i+1:q. Suppose there are nl elements of p that equal l. Then L(P:,i+1:q, C0

:,i+1:q, U 0

:,i+1:q)up=l

involves nl rows, and can be computed in O

time. Moreover, this product, which we ll call (u0)p=l, is only nonzero when the corresponding element of p equals l, so it has the same sparsity pattern as up=l. The other component of A 1

i is the identity matrix, and Iup=l clearly has the same sparsity as up=l. Thus, returning to Equation 3, when we add up=l(up=l)> to Ai, we update A 1

i with an outer product of vectors whose sparsity pattern is the same as that of up=l.

For each l, A 1

i need not be updated with each up=l one at a time. For l 6= l0, up=l and up=l0

are nonzero at separate indices, so up=l and (u0)p=l0 are nonzero at separate indices, so the extra component of A 1

i that appears after the up=l0 update is irrelevant to the up=l update, because (up=l)>(u0)p=l0 = 0. Since the up=l update takes O(nl(q i)) time, all of them together take

l nl(q i)) time, which equals O(n(q i)) time. Calculating an element of c0 only involves computing the denominator in Equation 3, using a matrix-vector product already computed. So we can write C0

:,i:q and U 0

:,i:q by adding a preceding column to C0

:,i+1:q and U 0

:,i+1:q, using the same partition p, and it takes O(n(q i)) time.

Following the induction down to i = 0, we have (L(P, C, U) + I) 1 = I + L(P, C0, U 0), and a total time of O(nq2).

Algorithm 2 also performs approximate inversion, which we prove in Appendix A. It differs slightly from the algorithm in the proof, but can take full advantage of a GPU speedup. In the setting where all columns of U are identical, observe that in Lines 10 and 11, the same computation is repeated for all k 2 [i]. Indeed, in this setting, this block of code can be modiﬁed to run in O(n) time rather than O(ni), making the whole algorithm run in O(nq) time, as shown in Proposition 4 in Appendix A.

A Hierarchical matrix is a matrix which is either represented as a low-rank matrix or as a 2 2 block matrix of Hierarchical matrices [1]. In our SROS format, many of the sparse rank one matrices overlap, whereas in a Hierarchical matrix, the low-rank matrices do not overlap, and converting an SROS matrix into a Hierarchical matrix would typically be inefﬁcient. Hierarchical matrices admit approximate inversion in O(na2 log2 n) time, where a is the maximum rank of the component submatrices [11]. However, this is not an approximation in a technical sense, as there is no error bound. At many successive steps in the algorithm, a rank 2a matrix is approximated by a rank a

Algorithm 2 Inverse and determinant of I+ SROS Linear Operator. Lines 5 through 11 can all be easily vectorized on a GPU. Lines 5 and 10 require torch.Tensor.index_add_ or equivalent, and lines 6 and 11 require non-slice indexing, which are not quite as fast as some GPU operations.

Require: P 2 [r]n q, C, U 2 Rn q

Ensure: I + L(P, C0, U 0) = (I + L(P, C, U)) 1; x = log |I + L(P, C, U)|

1: x, C0, U 0 0, 0 2 Rn q, U 2: for i 2 (q, q 1, ..., 1) do . O(nq2) time 3: p, c, u, u0 P:,i, C:,i, U:,i, U 0

:,i 4: z 0 2 Rr . zl will store c(l)((u0)p=l)>up=l, where c(l) = ck if pk = l 5: for j 2 [n] do zpj zpj + cju0

juj . O(n) time

6: for j 2 [n] do C0

ji cj/(1 + zpj) . O(n) time

7: for l 2 [r] do x x + log(1 + zl) . O(n) time because r n 8: if i > 0 then 9: y 0 2 Rn i . O(ni) time 10: for j, k 2 [n] [i 1] do ypjk ypjk + u0

j Ujk . O(ni) time

11: for j, k 2 [n] [i 1] do U 0

jypjk . O(ni) time return C0, U 0, x

matrix [10]; to our knowledge there is no analysis of how resulting errors might cascade. After converting an SROS matrix to hierarchical form, this rough inversion would take O(nq2 log2 n) time.

5 Binary tree Gaussian process

We now show that our kernel matrix KXX can be written in SROS form, with P containing successively ﬁner partitions. Thus, KXX can be approximately inverted quickly, for use in Equations 1 and 2. Next, we ll show that we can efﬁciently optimize the log likelihood of the training data by tuning the weight vector w along with the bit order. The log likelihood can be calculated in O(nq log n) time and then the gradient w.r.t. w in O(nq2) time.

Recall from the proof of Proposition 1: KXX = Pq

s2Bi wi X[s]X>

[s], where X[s] 2 Rn

with (X[s])j =

. So we will set P:,i, C:,i, and U:,i, so that L(P:,i, C:,i, U:,i) = P

s2Bi wi X[s]X>

[s]. Let P:,i partition the set of points X so that points are in the same partition if the ﬁrst i bits match. Now, requiring the ﬁrst i + 1 bits to match is a stricter criterion than requiring the ﬁrst i bits to match, so the P:,i grow successively ﬁner. For any piece of the partition where the ﬁrst i bits of the constituent points equals the bitstring s, the corresponding sparse rank one component of KXX is wi X[s]X>

[s]. So let U:,i = 1n, and let C:,i = wi1n.

Proposition 2 (SROS Form Kernel). KXX = L(P, C, U), as deﬁned above.

This follows immediately from the deﬁnitions. To compute these partitions P:,i, we sort X, which is a set of bit strings. And then we can easily compute which points have the same ﬁrst i bits. This all takes O(nq log n) time. Now note that U:,i = U:,i0 for all i, i0, so (KXX + λI) 1 and |KXX + λI| can be computed in O(nq) time, rather than O(nq2).

The training negative log likelihood of a GP is that of the corresponding multivariate Gaussian on the training data. So: NLL(w) = 1

y>(KXX(w) + λI) 1y + log |KXX(w) + λI| + n log(2 )

. This can be computed in O(nq) time, since matrix-vector multiplication takes O(nq) time for a matrix in SROS form. So if the bit order is unchanged, an optimization step can be done in O(nq) time, and if the data needs to be resorted, then in O(nq log n) time. On the largest dataset we tested (House Electric), with n 1.3 million and q = 88, sorting the data and computing P takes about 0.96 seconds on a GPU, and then calculating the negative log likelihood takes about another 1.08 seconds. We show in Appendix B how to compute rw NLL in O(nq2) time.

To optimize the bit order and weight vector at the same time, we represent both with a single parameter vector 2 Rq

+, with || ||1 = 1. To get the bit order from , we start with a default bit order and permute the bit order according to a permutation that would sort in descending order. To get the

Algorithm 3 GP Regression with a binary tree kernel.

Require: X 2 Bn q, y 2 Rn, X0 2 Bm q, w 2 Rq, λ 2 R+

Ensure: µX0 and σ2

X0 are the predictive means and variances at X0, and nll the training negative log likelihood. 1: X X X0

2: X", perm Sort( X) . The rows of X are sorted lexically from leading

bit to trailing bit. O((n+m)q log(n+m)) time. 3: for j, i 2 [n + m] [q] do P "

ji #of unique rows in X"

1:j,1:i 4: . M1:j,1:i is the ﬁrst j rows and i columns of M. O((n + m)q) time.

5: P perm 1(P ") . This unsorts the input. O((n + m)q) time. 6: P, P 0 P1:n, Pn+1:n+m 7: U, U 0, U 1n q, 1m q, 1(n+m) q

8: C, C 1nw T , 1n+mw T

λ , U 1, logdetλ Invert(P, λ 1C, U) . Uses Algorithm 2. Speedup to O(nq) time

because columns of U are identical. 10: C 1, logdet λ 1C 1

λ , logdetλ + n log(λ) 11: z Lin Transform(P, P, U 1, C 1 U 1, y) + λ 1y . Uses Algorithm 1 to compute the

Woodbury vector. O(nq) time. 12: µX0 Lin Transform(P 0, P, U 0, C U, z) . O((n + m)q) time. 13: nll (y>z + logdet + n log(2 ))/2 14: Cprec, U prec Invert( P, λ 1 C, λ 1 U) . O((n + m)q) time. 15: Cprec, U prec Cprec

n+1:n+m, U prec

n+1:n+m 16: Ccov, U cov Invert(P 0, Cprec, U prec) . O(mq2) time; extra factor of q because

columns of U prec are not identical. 17: σ2

X0 λ(1m + Sum Each Row(Ccov U cov U cov)) . O(mq) time. 18: return µX0, σ2

weight vector, we sort in descending order, add a 0 at the end, and compute the differences between adjacent elements. When there are ties in the elements of , the choice of bit order does not affect the negative log likelihood (or the kernel at all) because the relevant associated weight is 0. The negative log likelihood is continuous with respect to , and when all values of are unique, it is differentiable with respect to . Letting = eφ/||eφ||1, we minimize loss w.r.t. φ using BFGS [8].

To calculate the predictive mean at a list of predictive locations X0, we ﬁrst multiply y by (KXX + λI) 1, and then we multiply that vector by KXX0. We obtain both KXX and KXX0 in SROS form as follows. Let X = X X0 be the concatenation of the two tuples, now an (n + m)-tuple. Writing K X X = L( P, C, U), the arrays on the r.h.s. can be computed in O((n + m)q log(n + m)) time. Then, with P, C, and U being the ﬁrst n rows of P, C, U, KXX = L(P, C, U). And letting P 00 and U 00 be the last m rows, KXX0 = L(P, P 00, C U, U 00). Thus, the predictive mean µx from Equation 1 can be computed at m locations in O((n + m)q log(n + m)) time.

The predictive covariance matrix, which extends the predictive variance from Equation 2, is calculated X0 = KX0X0 +λIm KX0X(KXX +λIn) 1KXX0 = (K X X +λIm+n)/KXX, where / denotes the Schur complement. From a property of block matrix inversion, the last m columns of the last m rows of (K X X + λI) 1 equals ((K X X + λIm+n)/KXX) 1. So we get the predictive precision matrix in O((n + m)q log(n + m)) time by inverting K X X + λI and taking the bottom right m m block. Then, we get the predictive covariance matrix by inverting that. This takes O(mq2) time, since it does not have the property of all the columns of U being equal. If we only want the diagonal elements of an SROS matrix (the independent predictive variances in this case), we can simply sum the rows of C U U in O(mq) time. Thus, in total, computing the independent predictive variances requires O((n + m)q log(n + m) + mq2) time. See Algorithm 3.

6 Related Work

All existing kernels of which we are aware for linear time GP regression on unstructured data involve

inducing points (related to the Nyström approximation [27]) or inducing frequencies. For a given set of inducing points Z, for some base kernel k, the inducing point kernel (in its most basic form) is the following, although subtle variants exist: k Z(x, x0) = Kx ZK 1

ZZKZx0 [21].

Sparse Gaussian Process Regression (SGPR) involves selecting Z, and then using k Z (or a variant). Notably, KZ

XX = KXZK 1

ZZKZX is low rank, providing computational efﬁciency. The predictive mean and covariance have compact form, with observational noise λ: µx(Z) = Kx Z(λKZZ + k ZXk XZ) 1KZXy and σ2

xx0(Z) = Kx Z(KZZ + λ 1k ZXk XZ) 1KZx0.

Titsias s [24] sparse variational kernel is also low rank and uses inducing points. The sparse variational GP (SVGP) is constructed as the solution to a variational inference problem. It depends on inducing points Z, data points X, and observed function values y. We have focused on Gaussian processes with 0 mean, but the SVGP method uses a nonzero prior mean along with a kernel: m SVGP(x) = µx(Z) and k SVGP(x, x0) = k(x, x0) Kx ZK 1

ZZKZx0 + σ2

xx0(Z). Given the dependence on X and y, this is not a true probability distribution over function space. The variational problem underlying this kernel also provides guidance in how to select the inducing points Z. For further discussion of the kernel underlying the SVGP method, see Wild et al. [26].

An inducing point kernel with z inducing points produces a z-dimensional reproducing kernel Hilbert space (RKHS). The dimensionality of the RKHS relates to the expressivity of the kernel. Whereas an inducing point method buys a z-dimensional RKHS for the price of O(z2n) time and O(zn) space, the binary tree kernel produces a 2q-dimensional RKHS in O(qn) time and space an exponential improvement. (Observe that we can ﬁnd 2q linearly independent functions of the form k( , x) one for each of the 2q leaves x might belong to.) Wilson and Nickisch [29] develop a method for speeding up inducing point methods signiﬁcantly, especially in low-dimensional settings.

Lázaro-Gredilla et al. [16] propose an inducing frequencies kernel: given a set of m inducing vectors si, k(x, x0) = 1/m Pm

i=1 cos(2 s>

i (x x0)). Dutordoir et al. [6] propose an inducing frequencies kernel for low dimensional data, in which k(x, x0) is a special function of x>x0.

On one-dimensional data, ﬁltering/smoothing methods perform Bayesian inference over functions in O(n) time [14, 12]. A few non-O(n) methods bear mentioning. We are not the ﬁrst to consider a kernel over points on the leaves of a tree [18, 17] or on the leaves of multiple trees [7], but their methods take O(n3) time. On certain kinds of structured data, Toeplitz solvers achieve O(n2) time complexity [30]. Cutajar et al. s [3], Wang et al. s [25], and others use of a conjugate gradients solver to replace inversion/factorization has unclear time complexity between O(n2) and O(n3), depending on the kernel matrix spectrum. 1

7 Experiments

In Table 1, we compare our binary tree kernel and a binary tree ensemble (see Appendix C for details on the ensemble) against three baseline methods: exact GP regression using a Matérn kernel, sparse Gaussian process regression (SGPR) [23], and a stochastic variational Gaussian process (SVGP) [13]. We evaluate our method on the same open-access UCI datasets [4] as Wang et al. [25], using their same training, validation, and test partitions, and we compare against the baseline results they report. For the binary tree (BT) kernels, we use p = min(8, b150/dc + 1), and recall q = pd. We set λ = 1/n. We train the bit order and weights to minimize training NLL. For the binary tree ensemble (BTE), we use 20 kernels. For the Matérn kernel, we use Blackbox Matrix-Matrix multiplication (BBMM) [9], which uses the conjugate gradients method to calculate matrix-vector products with (KXX + λI) 1. SGPR uses 512 data points and SVGP uses 1,024 inducing points. We report

the mean and two standard errors across 3 replications with different dataset splits. For further experimental details, see Appendix C. BTE achieves the best NLL on 6/12 datasets, and best RMSE on 5/12 datasets (including some ties). Out of the 4 largest datasets, BT/BTE is fastest on 3. The run

1The conjugate gradients method takes O(n2p ) time, where is the condition number of the kernel matrix. Poggio et al. [20] say claims about the condition number of a random matrix A should also apply to kernel matrices with random data. If they mean a Wishart random matrix (which it should be if, e.g., k(x, x0) = x>x0), that would be the square of the condition number of the corresponding Gaussian random matrix, which grows as O(n) [2]. Putting it all together, we get O(n3) for conjugate gradients. We don t know how quickly preconditioning can reduce the condition number.

Figure 5: Run times given dataset size. For BT, the trendline is calculated controlling for log(q) with afﬁne regression, and then setting q = 150. The slope w.r.t. log(q) is 2.82 1.06. Theoretically, all slopes are too low except for that of SVGP, presumably because of overhead in the small-data regime.

times are plotted in Figure 5. The code is available at https://github.com/mkc1000/btgp and https://tinyurl.com/btgp-colab.

BT performs noticeably worse than BTE for test NLL on CTSlice due to over-ﬁtting. There are enough degrees of freedom when optimizing the bit order (d = 385) that BT kernel can over-ﬁt to the training data. The ensemble over multiple bit orders is much more robust.

DATASET n d BTE BT MATÉRN (BBMM) SGPR SVGP

POLETELE 9,600 26 0.625 0.035 0.490 0.040 0.180 0.036 0.094 0.008 0.001 0.008 ELEVATORS 10,623 18 0.649 0.032 0.646 0.023 0.619 0.054 0.580 0.060 0.519 0.022 BIKE 11,122 17 0.708 0.433 0.806 0.273 0.119 0.044 0.291 0.032 0.272 0.018 KIN40K 25,600 8 0.869 0.004 0.881 0.008 0.258 0.084 0.087 0.067 0.236 0.077 PROTEIN 29,267 9 0.781 0.023 0.845 0.026 1.018 0.056 0.970 0.010 1.035 0.006 KEGGDIR 31,248 20 1.031 0.020 1.029 0.021 0.199 0.381 1.123 0.016 0.940 0.020 CTSLICE 34,240 385 2.527 0.147 1.092 0.147 0.894 0.188 0.073 0.097 1.422 0.005 KEGGU 40,708 27 0.667 0.007 0.667 0.007 0.419 0.027 0.984 0.012 0.666 0.007 3DROAD 278,319 3 0.251 0.009 0.252 0.006 0.909 0.001 0.943 0.002 0.697 0.002 SONG 329,820 90 1.330 0.003 1.331 0.003 1.206 0.024 1.213 0.003 1.417 0.000 BUZZ 373,280 77 1.198 0.003 1.198 0.003 0.267 0.028 0.106 0.008 0.224 0.050 HOUSEELEC 1,311,539 11 2.569 0.006 2.492 0.012 0.152 0.001 1.010 0.039

POLETELE 9,600 26 0.154 0.006 0.161 0.004 0.151 0.012 0.217 0.002 0.215 0.002 ELEVATORS 10,623 18 0.478 0.021 0.476 0.018 0.394 0.006 0.437 0.018 0.399 0.009 BIKE 11,122 17 0.118 0.057 0.103 0.029 0.220 0.002 0.362 0.004 0.303 0.004 KIN40K 25,600 8 0.580 0.003 0.587 0.006 0.099 0.001 0.273 0.025 0.268 0.022 PROTEIN 29,267 9 0.608 0.008 0.623 0.011 0.536 0.012 0.656 0.010 0.668 0.005 KEGGDIR 31,248 20 0.086 0.003 0.086 0.003 0.086 0.005 0.104 0.003 0.096 0.001 CTSLICE 34,240 385 0.116 0.009 0.132 0.009 0.262 0.448 0.218 0.011 1.003 0.005 KEGGU 40,708 27 0.120 0.001 0.121 0.001 0.118 0.000 0.130 0.001 0.124 0.002 3DROAD 278,319 3 0.187 0.002 0.186 0.001 0.101 0.007 0.661 0.010 0.481 0.002 SONG 329,820 90 0.914 0.003 0.916 0.003 0.807 0.024 0.803 0.002 0.998 0.000 BUZZ 373,280 77 0.801 0.002 0.801 0.002 0.288 0.018 0.300 0.004 0.304 0.012 HOUSEELEC 1,311,539 11 0.029 0.001 0.029 0.001 0.055 0.000 0.084 0.005

POLETELE 9,600 26 5.16 0.58 0.69 0.018 1.16 0.34 1.15 0.068 ELEVATORS 10,623 18 2.6 0.19 0.68 0.012 1.16 0.38 1.27 0.092 BIKE 11,122 17 2.68 0.15 0.69 0.015 1.17 0.38 1.28 0.093 KIN40K 25,600 8 1.44 0.028 0.71 0.045 1.62 0.96 3.26 0.23 PROTEIN 29,267 9 2.92 0.2 0.8 0.17 2.27 0.9 3.31 0.27 KEGGDIR 31,248 20 7.14 0.39 0.85 0.1 2.2 1.09 3.8 0.38 CTSLICE 34,240 385 52.01 0.92 3.32 5.0 2.16 0.99 3.87 0.34 KEGGU 40,708 27 7.46 0.54 6.32 0.41 2.22 1.05 4.78 0.4 3DROAD 278,319 3 1.93 0.12 126.37 20.92 12.01 5.51 34.09 3.19 SONG 329,820 90 31.87 3.79 33.79 10.45 7.89 3.12 39.55 3.08 BUZZ 373,280 77 20.18 6.66 571.15 66.34 29.25 18.33 46.35 2.93 HOUSEELEC 1,311,539 11 118.41 3.93 575.64 6.94 367.71 4.7

Table 1: NLL (top), RMSE (middle), and run time in minutes (bottom) on regression datasets, using a single GPU (Tesla V100-SXM2-16GB for BT and BTE and Tesla V100-SXM2-32GB for the other methods). The asterisk indicates an estimate of the time from the reported training time on 8 GPUS, assuming linear speedup in number of GPUs and independent noise in training times per GPU. All columns except BT and BTE come from Wang et al. [25].

8 Discussion

We have proven that the binary tree kernel GP is scalable. Our empirical results suggest that it often outpredicts not just other scalable methods, but even the popular Matérn GP. If the results in this paper replicate in other domains, it could obviate wide usage of classic GP kernels like the Matérn kernel, as well as inducing point kernels. Sometimes, our kernel fails to capture patterns in the data; some functions values simply do not covary this way. But other kernels we tested seemed to fail like that even more.

Our contributions to linear algebra and kernel design may signiﬁcantly increase the size of data sets on which GPs can do state-of-the-art modelling.

[1] Mario Bebendorf. Hierarchical matrices. Springer, 2008.

[2] Zizhong Chen and Jack J Dongarra. Condition numbers of Gaussian random matrices. SIAM

Journal on Matrix Analysis and Applications, 27(3):603 620, 2005.

[3] Kurt Cutajar, Michael Osborne, John Cunningham, and Maurizio Filippone. Preconditioning

kernel matrices. In International conference on machine learning, pages 2529 2538. PMLR, 2016.

[4] Dheeru Dua and Casey Graff. UCI machine learning repository, 2017. URL http://archive.

ics.uci.edu/ml.

[5] Philippe Duchon, Philippe Flajolet, Guy Louchard, and Gilles Schaeffer. Boltzmann samplers

for the random generation of combinatorial structures. Comb. Probab. Comput., 13(4 5): 577 625, jul 2004. ISSN 0963-5483. doi: 10.1017/S0963548304006315.

[6] Vincent Dutordoir, Nicolas Durrande, and James Hensman. Sparse Gaussian processes with

spherical harmonic features. In International Conference on Machine Learning, pages 2793 2802. PMLR, 2020.

[7] Dai Feng and Richard Baumgartner. Random forest (RF) kernel for regression, classiﬁcation

and survival. Co RR, abs/2009.00089, 2020. URL https://arxiv.org/abs/2009.00089.

[8] Roger Fletcher. Practical methods of optimization. John Wiley & Sons, New York, 2nd edition,

[9] Jacob Gardner, Geoff Pleiss, Kilian Q Weinberger, David Bindel, and Andrew G Wilson.

Gpytorch: Blackbox matrix-matrix Gaussian process inference with gpu acceleration. Advances in neural information processing systems, 31, 2018.

[10] Wolfgang Hackbusch. A sparse matrix arithmetic based on H-matrices. Part I: Introduction to

H-matrices. Computing, 62(2):89 108, 1999.

[11] Wolfgang Hackbusch, Boris N Khoromskij, and Ronald Kriemann. Hierarchical matrices based

on a weak admissibility criterion. Computing, 73(3):207 243, 2004.

[12] Jouni Hartikainen and Simo Särkkä. Kalman ﬁltering and smoothing solutions to temporal

Gaussian process regression models. In 2010 IEEE international workshop on machine learning for signal processing, pages 379 384. IEEE, 2010.

[13] James Hensman, Nicolò Fusi, and Neil D. Lawrence. Gaussian processes for big data. In

Proceedings of the Twenty-Ninth Conference on Uncertainty in Artiﬁcial Intelligence, UAI 13, page 282 290, Arlington, Virginia, USA, 2013. AUAI Press.

[14] RE Kalman. A new approach to linear ﬁltering and prediction problems. Journal of Basic

Engineering, 83(1):95 108, 1960.

[15] Sanyam Kapoor, Marc Finzi, Ke Alexander Wang, and Andrew Gordon Gordon Wilson. Skiing

on simplices: Kernel interpolation on the permutohedral lattice for scalable gaussian processes. In Marina Meila and Tong Zhang, editors, Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pages 5279 5289. PMLR, 18 24 Jul 2021. URL https://proceedings.mlr.press/v139/kapoor21a. html.

[16] Miguel Lázaro-Gredilla, Joaquin Quinonero-Candela, Carl Edward Rasmussen, and Aníbal R

Figueiras-Vidal. Sparse spectrum Gaussian process regression. The Journal of Machine Learning Research, 11:1865 1881, 2010.

[17] Julien-Charles Lévesque, Audrey Durand, Christian Gagné, and Robert Sabourin. Bayesian

optimization for conditional hyperparameter spaces. In 2017 International Joint Conference on Neural Networks (IJCNN), pages 286 293, 2017.

[18] Xingchen Ma and Matthew Blaschko. Additive tree-structured covariance function for con-

ditional parameter spaces in Bayesian optimization. In Silvia Chiappa and Roberto Calandra, editors, Proceedings of the Twenty Third International Conference on Artiﬁcial Intelligence and Statistics, volume 108 of Proceedings of Machine Learning Research, pages 1015 1025. PMLR, 26 28 Aug 2020.

[19] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan,

Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary De Vito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. Pytorch: An imperative style, highperformance deep learning library. In Advances in Neural Information Processing Systems 32, pages 8024 8035. Curran Associates, Inc., 2019. URL http://papers.neurips.cc/paper/ 9015-pytorch-an-imperative-style-high-performance-deep-learning-library. pdf.

[20] Tomaso Poggio, Gil Kur, and Andrzej Banburski. Double descent in the condition number.

ar Xiv preprint ar Xiv:1912.06190, 2019.

[21] Joaquin Quinonero-Candela and Carl Edward Rasmussen. A unifying view of sparse approxi-

mate Gaussian process regression. The Journal of Machine Learning Research, 6:1939 1959, 2005.

[22] Edward Snelson and Zoubin Ghahramani. Sparse Gaussian processes using pseudo-inputs.

In Y. Weiss, B. Schölkopf, and J. Platt, editors, Advances in Neural Information Processing Systems, volume 18. MIT Press, 2005.

[23] Michalis Titsias. Variational learning of inducing variables in sparse Gaussian processes. In

David van Dyk and Max Welling, editors, Proceedings of the Twelth International Conference on Artiﬁcial Intelligence and Statistics, volume 5 of Proceedings of Machine Learning Research, pages 567 574, Hilton Clearwater Beach Resort, Clearwater Beach, Florida USA, 16 18 Apr 2009. PMLR. URL https://proceedings.mlr.press/v5/titsias09a.html.

[24] Michalis Titsias. Variational learning of inducing variables in sparse Gaussian processes. In

Artiﬁcial intelligence and statistics, pages 567 574. PMLR, 2009.

[25] Ke Wang, Geoff Pleiss, Jacob Gardner, Stephen Tyree, Kilian Q Weinberger, and Andrew Gor-

don Wilson. Exact Gaussian processes on a million data points. Advances in Neural Information Processing Systems, 32, 2019.

[26] Veit Wild, Motonobu Kanagawa, and Dino Sejdinovic. Connections and equivalences be-

tween the Nyström method and sparse variational Gaussian processes. ar Xiv preprint ar Xiv:2106.01121, 2021.

[27] Christopher Williams and Matthias Seeger. Using the Nyström method to speed up kernel

machines. In T. Leen, T. Dietterich, and V. Tresp, editors, Advances in Neural Information Processing Systems, volume 13. MIT Press, 2000. URL https://proceedings.neurips. cc/paper/2000/file/19de10adbaa1b2ee13f77f679fa1483a-Paper.pdf.

[28] Christopher K Williams and Carl Edward Rasmussen. Gaussian processes for machine learning,

volume 2. MIT press Cambridge, MA, 2006.

[29] Andrew Wilson and Hannes Nickisch. Kernel interpolation for scalable structured gaussian

processes (kiss-gp). In International conference on machine learning, pages 1775 1784. PMLR, 2015.

[30] Yunong Zhang, William E Leithead, and Douglas J Leith. Time-series Gaussian process

regression based on Toeplitz computation of O(n2) operations and O(n)-level storage. In Proceedings of the 44th IEEE Conference on Decision and Control, pages 3711 3716. IEEE, 2005.