# scalable_gaussian_processes_with_gridstructured_eigenfunctions_gpgrief__3550a77a.pdf

Scalable Gaussian Processes with Grid-Structured Eigenfunctions (GP-GRIEF)

Trefor W. Evans 1 Prasanth B. Nair 1

We introduce a kernel approximation strategy that enables computation of the Gaussian process log marginal likelihood and all hyperparameter derivatives in Oppq time. Our GRIEF kernel consists of p eigenfunctions found using a Nyström approximation from a dense Cartesian product grid of inducing points. By exploiting algebraic properties of Kronecker and Khatri-Rao tensor products, computational complexity of the training procedure can be practically independent of the number of inducing points. This allows us to use arbitrarily many inducing points to achieve a globally accurate kernel approximation, even in high-dimensional problems. The fast likelihood evaluation enables type-I or II Bayesian inference on large-scale datasets. We benchmark our algorithms on real-world problems with up to twomillion training points and 1033 inducing points.

1. Introduction

Gaussian process (GP) modelling is a powerful Bayesian approach for classiﬁcation and regression, however, it is restricted to modestly sized datasets since training and inference require Opn3q time and Opn2q storage, where n is the number of training points (Rasmussen & Williams, 2006). This has motivated the development of approximate GP methods that use a set of m (! nq inducing points to reduce time and memory requirements to Opm2n m3q and Opmnq, respectively (Smola & Bartlett, 2000; Snelson & Ghahramani, 2006; Titsias, 2009; Peng & Qi, 2015). However, such techniques perform poorly if too few inducing points are used, and computational savings are lost on complex datasets that require m to be large.

Wilson & Nickisch (2015) exploited the structure of inducing points placed on a Cartesian product grid, allowing for

1University of Toronto, Canada. Correspondence to: Trefor W. Evans <trefor.evans@mail.utoronto.ca>, Prasanth B. Nair <pbn@utias.utoronto.ca>.

Proceedings of the 35 th International Conference on Machine Learning, Stockholm, Sweden, PMLR 80, 2018. Copyright 2018 by the author(s).

m ą n while dramatically reducing computational demands over an exact GP. This inducing point structure enables signiﬁcant performance gains in low-dimensions, however, time and storage complexities scale exponentially with the dataset dimensionality, rendering the technique intractable for general learning problems unless a dimensionality reduction procedure is applied. In the present work, a Cartesian product grid of inducing points is also considered, however, we show that these computational bottlenecks can be eliminated by identifying and exploiting further structure of the resulting matrices. The proposed approach leads to a highly scalable algorithm which we call GP-GRIEF (Gaussian Processes with Grid-Structured Eigenfunctions). After an initial setup cost of Opnp2 dnp dm3{dq, GP-GRIEF requires only Oppq time and Oppq memory per log marginal likelihood evaluation, where d denotes the dataset dimensionality, and p is the number of eigenfunctions that we will describe next. We emphasize that our complexity is practically independent of m, which can generally be set arbitrarily high.

GP-GRIEF approximates an exact kernel as a ﬁnite sum of eigenfunctions which we accurately compute using the Nyström approximation conditioned on a huge number of inducing points. In other words, our model is sparse in the kernel eigenfunctions rather than the number of inducing points, which can greatly exceed the size of the training set due to the structure we introduce. This is attractive since it is well-known that eigenfunctions produce the most compact representation among orthogonal basis functions. Although the eigenfunctions used are approximate, we demonstrate convergence in the limit of large m. Additionally, our ability to ﬁll out the input space with inducing points enables accurate global approximations of the eigenfunctions, even at test locations far from the training data. These basis functions also live in a reproducing kernel Hilbert space, unlike some other sparse GPs whose bases have a pre-speciﬁed form (e.g. Lázaro-Gredilla et al. (2010)). We summarize our main contributions below

We break the curse of dimensionality incurred by placing inducing points on a full Cartesian product grid. Typically, a grid of inducing points results in a computational complexity that scales exponentially in d, however, we reduce this complexity to linear in d by exploiting algebraic properties of Kronecker and Khatri-Rao products.

Scalable Gaussian Processes with Grid-Structured Eigenfunctions

We practically eliminate dependence of the inducing point quantity, m, on computational complexity. This allows us to choose m " n to provide a highly accurate kernel approximation, even at locations far from the training data.

We show that the Nyström eigenfunction approximation becomes exact for large m, which is achievable thanks to the structure and algebra we introduce.

Applications of the developed algebra are discussed to enable the extension of structured kernel interpolation methods for high-dimensional problems. We also develop an efﬁcient preconditioner for general kernel matrices.

We discuss a ﬂexible parametrization of the GRIEF kernel through a re-weighting of the kernel eigenfunctions. This admits computation of the log marginal likelihood, along with all p 1 hyperparameter derivatives in Oppq.

Finally, we demonstrate type-I Bayesian inference on real-world datasets with up to 2 million training points and m 1033 inducing points.

We begin with a review of GPs in section 2, and we outline an eigenfunction kernel approximation in section 3. Section 4 demonstrates why we should use many inducing points and subsequently develops the algebra necessary to make m " n efﬁcient and stable. Section 5 outlines a kernel reparameterization that enables efﬁcient type-I Bayesian inference, even for large datasets. We ﬁnish with numerical studies in section 6, demonstrating the performance of GP-GRIEF on real-world datasets.

2. Background on Gaussian Processes

We will employ Gaussian processes (GPs) as non-parametric prior distributions over the latent function which generated the training dataset. It is assumed that the dataset is corrupted by independent Gaussian noise with variance σ2 ě 0 and that the latent function is drawn from a Gaussian process with zero mean and covariance determined by the kernel k : Rd ˆ Rd Ñ R. Considering a regression problem, the log marginal likelihood (LML) of the training targets, y P Rn, can be written as

log Ppy|θ, σ2, Xq 1

2 log |KX,X σ2In|

1 2y T p KX,X σ2Inq 1y n

2 logp2πq, (1)

where X txi P Rdun i 1 is the set of n training point input locations, r KA,Bsi,j kpai, bjq such that KX,X P Rnˆn

is the kernel covariance matrix evaluated on the training dataset, and we assume the kernel is parametrized by the hyperparameters, θ. If we consider type II Bayesian inference, we would like to select the hyperparameters tσ2, θu that maximize the LML. After hyperparameter estimation,

inference can be carried out at an untried point, x P Rd, giving the posterior distribution of the prediction y P R

y |θ, σ2, X, x N p Ery s, Vry sq ,

Ery s Kx ,Xp KX,X σ2Inq 1y,

Vry s Kx ,x Kx ,Xp KX,X σ2Inq 1KX,x . (2)

If we take a fully Bayesian (type I) approach, then we integrate out the hyperparameters by considering the hyperparameter posterior. This generally results in analytically intractable integrals which require techniques such as Markov-Chain Monte Carlo (MCMC) sampling, or one of its variants (Neal, 1997).

3. Eigenfunction Kernel Approximation

We consider a compact representation of the GP prior using a truncated Mercer expansion of the kernel k. We use the ﬁrst p eigenfunctions which we approximate numerically using a Nyström approximation (Peng & Qi, 2015)

2 i Kx,Uqi looooomooooon

2 i Kz,Uqi looooomooooon

Kx,UQST p Λ 1 p Sp QT KU,z kpx, zq,

where U tui P Rdum i 1 refers to the set of m inducing point locations; Λ, Q P Rmˆm are diagonal and unitary matrices containing the eigenvalues and eigenvectors of KU,U, respectively; λi and qi denote the ith largest eigenvalue and corresponding eigenvector of KU,U, respectively; Sp P Rpˆm is a sparse selection matrix where Sppi, :q contains one value set to unity in the column corresponding to the index of the ith largest value on the diagonal of Λ; and we use the shorthand notation Λp SpΛST p diagpλpq P Rpˆp to denote a diagonal matrix containing the p largest eigenvalues of KU,U, sorted in descending order. φipxq is the numerical approximation of the ith eigenfunction evaluated at the input x, scaled by the root of the ith eigenvalue. We only explicitly compute this scaled eigenfunction for numerical stability, as we will discuss later. Using the kernel rk, the prior covariance matrix on the training set becomes

r KX,X KX,UQST p Λ 1

2 p loooooooomoooooooon

2 p Sp QT KU,X loooooooomoooooooon

where the columns of Φ P Rnˆp contain the p scaled eigenfunctions of our kernel evaluated on the training set. Observe that if U is randomly sampled from X, then r KX,X is the same covariance matrix from the Nyström method of Williams & Seeger (2001), however, since we have replaced the kernel and not just the covariance matrix, we recover a valid probabilistic model (Peng & Qi, 2015).

Scalable Gaussian Processes with Grid-Structured Eigenfunctions

While r KX,X has a rank of at most p, Peng & Qi (2015) show how a correction can be added to rk (eq. (3)) to give a full rank covariance matrix (provided k does also). The resulting GP will be non-degenerate. We can write this correction as δpx zqpkpx, zq rkpx, zqq, where δpaq 1 if a 0, else 0. This correction term does not affect the computational complexity of GP training, however, we ﬁnd it does not generally improve performance over the unmodiﬁed rk. We do not consider this correction in further discussion.

4. Grid-Structured Eigenfunctions (GRIEF)

Previous work employing Nyström approximations in kernel methods require m to be small (often ! n) to yield computational beneﬁts. As a result, the choice of inducing point locations, U, has a great inﬂuence on the approximation accuracy, and many techniques have been proposed to choose U effectively (Smola & Schökopf, 2000; Drineas & Mahoney, 2005; Zhang et al., 2008; Belabbas & Wolfe, 2009; Kumar et al., 2012; Wang & Zhang, 2013; Gittens & Mahoney, 2013; Li et al., 2016; Musco & Musco, 2017). In this work, we would instead like to use so many inducing points that carefully optimizing the distribution of U is unnecessary. We will even consider m " n. The following result shows how an eigenfunction approximation can be improved by using many inducing points.

Theorem 1. If the ith eigenvalue of k is simple and nonzero and U Ą X, a Nyström approximation of the ith kernel eigenfunction converges in the limit of large m,

qpnq i lim mÑ8

λpmq i KX,Uqpmq i , (5)

where λpmq i P R and qpmq i P Rm are the ith largest eigenvalue and corresponding eigenvector of KU,U, respectively. qpnq i is the kernel eigenfunction corresponding to the ith largest eigenvalue, evaluated on the set X.

Proof. We begin by constructing a Nyström approximation of the eigenfunction evaluated on U, using X as inducing points. From theorem 3.5 of (Baker, 1977), as m Ñ 8,

λpnq i KU,Xqpnq i , (6)

where we assume that the ith eigenvalue of k is simple and non-zero. Multiplying both sides by KX,UK 1 U,U,

KX,UK 1 U,Uqpmq i c n

λpnq i KX,UK 1 U,UKU,Xqpnq i . (7)

Since U Ą X, KX,U Sn KU,U is a subset of the rows of KU,U, where Sn P Rnˆm is a selection matrix. We can write KX,UK 1 U,UKU,X Sn KU,UK 1 U,UKU,UST n

Sn KU,UST n KX,X. Additionally, since the eigenvector qpmq i satisﬁes, K 1 U,Uqpmq i 1 λpmq i qpmq i , we get

λpmq i KX,Uqpmq i c n

λpnq i KX,Xqpnq i . (8)

Noting that KX,Xqpnq i λpnq i qpnq i completes the proof.

For multiple eigenvalues, it can similarly be shown that the ith approximated eigenfunction converges to lie within the linear space of eigenfunctions corresponding to the ith eigenvalue of k as m Ñ 8.

We can use a large m by distributing inducing points on a Cartesian tensor product grid1. Saatçi (2011) demonstrated efﬁcient GP inference when training points are distributed in this way by exploiting Kronecker matrix algebra. We will assume this grid structure for our inducing points, i.e. U will form a grid. If the covariance kernel satisﬁes the product correlation rule (as many popular multidimensional kernels do), i.e. kpx, zq śd i 1 kipxi, ziq, then KU,U P Rmˆm

inherits the Kronecker product form KU,U Âd i 1 Kpiq U,U, where b is the Kronecker product (Van Loan, 2000). Kpiq U,U P RĎ mˆĎ m are one-dimensional kernel covariance matrices for a slice of the input space grid along the ith dimension, and sm d?m Op10q is the number of inducing points we choose along each dimension of the full grid. It is evident that the Kronecker product leads to a large, expansed matrix from smaller ones, therefore, it is very advantageous to manipulate and store these small matrices without unpacking them, or explicitly computing the Kronecker product. Exploiting this structure decreases the storage of KU,U from Opm2q Ñ Opdm2{dq Opd sm2q, and the cost of matrix-vector products with KU,U from Opm2qÑOpdmpd 1q{dq Opd smd 1q. Additionally, the cost of the eigen-decomposition of KU,U decreases from Opm3q Ñ Opdm3{dq Opd sm3q, and the eigenvector matrix Q Âd i 1 Qpiq and eigenvalue matrix Λ diag Âd i 1 λpiq both inherit a Kronecker product structure, enabling matrix-vector products with r KX,X in Opd smd 1q operations (Van Loan, 2000; Saatçi, 2011).

In low-dimensions, exploiting the Kronecker product structure of KU,U Âd i 1 Kpiq U,U can be greatly advantageous, however, we can immediately see from the above complexities that the cost of matrix-vector products2 with r KX,X

1 We want U to be sampled from the same distribution as the training data. Approximating the data distribution by placing U on a grid is easy to do by various means as a quick preprocessing step. 2We assume that a conjugate gradient method would be employed for GP training requiring matrix-vector products. Alternative formulations would require columns of Q Âd i 1 Qpiq to be expanded which similarly scales exponentially (Op smdq).

Scalable Gaussian Processes with Grid-Structured Eigenfunctions

increases exponentially in d. The storage requirements will similarly increase exponentially since a vector of length m smd needs to be stored when a matrix-vector product is made with Q Âd i 1 Qpiq, and KX,U requires Op smdnq storage. This poor scaling poses a serious impediment to the application of this approach to high-dimensional datasets.

We now show how to massively decrease time and storage requirements from exponential to linear in d by identifying further matrix structure in our problem.

We begin by identifying structure in the exact crosscovariance between train (or test) points and inducing points. These matrices, e.g. KX,U, admit a row-partitioned Khatri Rao product structure as follows (Nickson et al., 2015)

KX,U d i 1 Kpiq X,U (9)

Kp1q X,Up1, :q b Kp2q X,Up1, :q b b Kpdq X,Up1, :q Kp1q X,Up2, :q b Kp2q X,Up2, :q b b Kpdq X,Up2, :q ... ... ... ... Kp1q X,Upn, :q b Kp2q X,Upn, :q b b Kpdq X,Upn, :q

where is the Khatri-Rao product whose computation gives a block Kronecker product matrix (Liu & Trenkler, 2008). We will always mention how Khatri-Rao product blocks are partitioned. Since Kpiq X,U are only of size n ˆ sm, the storage of KX,U has decreased from exponential to linear in d: Op smdnq Ñ Opdn smq Opdnq. We also observe that the selection matrix Sp d i 1 Spiq p can be written as a row-partitioned Khatri-Rao product matrix where each submatrix contains one non-zero per row. Further, by exploiting both Kronecker and Khatri-Rao matrix algebra, our main result below shows that KX,UQST p can be computed in Opdnpq time. This is a substantial reduction over the naive cost of Op smdnpq time.

Theorem 2. The product of a row-partitioned Khatri-Rao matrix KX,U d i 1 Kpiq X,U P RnˆĎ md, a Kronecker product matrix Q Âd i 1 Qpiq P RĎ mdˆĎ md, and a columnpartitioned Khatri-Rao matrix ST p d i 1 Spiq p T P RĎ mdˆp can be computed as follows

i 1 Kpiq X,UQpiq Spiq p T , (10)

where d is the (element-wise) Hadamard product. This computation only requires products of the smaller matrices Kpiq X,U P RnˆĎ m, Qpiq P RĎ mˆĎ m and Spiq p P RpˆĎ m.

Proof. First, observe that KX,UQ d i 1 Kpiq X,UQpiq d i 1 Rpiq is a row-partitioned Khatri-Rao product matrix using theorem 2 of (Liu & Trenkler, 2008). Now we must compute a matrix product of rowand column-partitioned

Khatri-Rao matrices. We observe that each element of this matrix product is an inner product between two Kronecker product vectors, i.e. KX,UQST p sij Âd l 1 Rplqpi, : q Âd l 1 ST plq p p:, jq śd l 1 Rplqpi, :q ST plq p p:, jq. Writing this in matrix form completes the proof.

If all the sub-matrices were dense, Opdn smpq time would be required to compute KX,UQST p , however, since Sp is a sparse selection matrix with one non-zero per row, computation requires just Opdn maxpp, smcqq Opdnpq time. We achieve this time by computing only the necessary columns of Kpiq X,UQpiq and avoiding redundant computations. The constant c is the average number of non-zeros columns in each of t Spiq p ud i 1 which is typically Op1q, however, may be sm in the worst case. Evidently the time complexity is effectively independent of the number of inducing points. Even in the rare worst case where c sm and sm2 ą p, the scaling is extremely weak; Opdnm2{dq.

What results is a kernel composed of basis eigenfunctions that are accurately approximated on a grid of inducing points using a Nyström approximation. Although m increases exponentially in d given this inducing point structure, the cost of GP training and inference is not affected. We call the resulting model GP-GRIEF (GP with GRId-structured Eigen Functions). Completing the computations required for GP training and inference require straightforward application of the matrix-determinant and matrix-inversion lemmas which we demonstrate later in eq. (13).

Eigenvalue Search What has not been addressed is how to form Sp d i 1 Spiq p and compute λp efﬁciently. This requires ﬁnding the index locations and values of the largest p eigenvalues in a vector of length m smd. In highdimensions, this task is daunting considering m can easily exceed the number of atoms in the observable universe. Fortunately, the resulting vector of eigenvalues, diagpΛq Âd i 1 λpiq, has a Kronecker product structure which we can exploit to develop a fast search algorithm that requires only Opd smpq time. To do this, we compute a truncated Kronecker product expansion by keeping only the p largest values after each sequential Kronecker product such that only Kronecker products between length p and length sm vectors are computed. Algorithm 1 outlines a more numerically stable version of this search strategy that computes the log of the eigenvalues and also demonstrates how Sp is computed.

Computation in High Dimensions Direct use of theorem 2 may lead to ﬁnite-precision rounding inaccuracies and overﬂow errors in high dimensions because of the Hadamard product over d matrices. We can write a more numerically stable version of this algorithm by taking the log of eq. (10), allowing us to write the computation as a sum of d matrices,

Scalable Gaussian Processes with Grid-Structured Eigenfunctions

Algorithm 1 Computes Sp, and logλp (the log of the p largest eigenvalues of KU,U). We use zero-based array indexing, modpa, bq computes a mod b, |a| computes the length of a, sortbpaq returns the minp|a|, bq largest elements of a in descending order, as well as the indices of these elements in a, and tau computes the ﬂoor of the elements in a.

Input: tλpiq P RĎ mud i 1 Output: t Spiq p P RpˆĎ mud i 1 & logλp P Rp

logλp, idxs sortp logpλp1qq

for i 2 to d do

logλp, ord sortp logλpb1Ď m 1| logλp| b logpλpiqq

idxs idxs tord{ smu, : , modpord, smq ı

end for Spiq p IĎ m idxsp:, i 1q, : (d i 1

rather than a product

i 1 signp Bpiqq d exp ˆ dÿ

i 1 logpabs Bpiqq ,

where Bpiq Kpiq X,UQpiq Spiq p T , and exp, log are computed element-wise. While the sign matrix is the Hadamard product of d matrices, it contains only t 1, 0, 1u so it is not susceptible to numerical issues. Also, when the sign of an element is zero, we do not compute the log. Unfortunately, the exp computation can still lead to numerical issues, however, Φ suffers less because of the rescaling provided by the eigenvalues (i.e. elements of Φ are the quotient of possibly very large or small values). Since all eigenvalues are positive, we can stably compute Φ as follows

Φ KX,UQST p Λ 1

i 1 signp Bpiqq d exp ˆ dÿ

i 1 logpabs Bpiqq 1

21n logλT p

where logλp is computed by algorithm 1.

4.1. Preconditioning Applications

As an aside remark, we discuss the application of p r KX,X σ2Inq 1 as a preconditioner for the exact kernel matrix KX,X σ2In in moderately sized problems where Opn2q storage is not prohibitive. The use of r KX,X for matrix preconditioning was explored with notable empirical success by Cutajar et al. (2016) where a sub-set of training data was used as inducing points giving U Ă X and m ă n. By theorem 1, we know that the Nyström approximation converges for large m, and we have shown that we can accommodate m " n to provide an accurate low-rank kernel matrix approximation.

4.2. SKI Applications

As a further aside, we discuss how the developed algebra can be applied in a general kernel interpolation setting. Wilson & Nickisch (2015) introduced a kernel interpolation perspective to unify inducing point methods wherein the kernel is interpolated from a covariance matrix on the inducing points. For instance, the subset of regressors (So R) method (Silverman, 1985; Quiñonero-Candela & Rasmussen, 2005) can be viewed as a zero-mean GP interpolant of the true kernel while Wilson & Nickisch (2015) proposed a sparse approximate interpolant. We can denote the interpolated covariance matrix as EKU,UET , where E P Rnˆm is the interpolation matrix.

In structured kernel interpolation (SKI) the inducing points form a grid such that KU,U inherits a Kronecker product form. This can provide dramatic computational advantages, however, SKI suffers from the exponential scaling discussed earlier and so is recommended only for very lowdimensional problems, d ď 5 (Wilson et al., 2016). We observe that in the case of the GP interpolant (e.g. So R), as well as the sparse interpolant suggested by Wilson & Nickisch (2015), the interpolation matrix E inherits a rowpartitioned Khatri-Rao structure. This enables direct use of theorem 2 to reduce the exponential scaling in d to a linear scaling, and allows SKI to scale to high-dimensional problems. However, time complexity would scale quadratically in n, unlike the proposed GRIEF methods.

5. Re-weighted Eigenfunction Kernel

We can approximately recover a wide class of kernels by modifying the weights associated with the kernel eigenfunctions (Buhmann, 2003). Here we consider this ﬂexible kernel parametrization for the GRIEF kernel. Extending eq. (3), we can write the re-weighted GRIEF kernel as

i 1 wiφipxqφipzq, (11)

where W P Rpˆp diagpwq, w twi ą 0up i 1 are the eigenfunction weights. The covariance matrix then becomes

r KX,X ΦWΦT , (12)

and the full set of hyperparameters is tσ2, w, θu, however, we choose to ﬁx θ so that Φ remains constant. We can then compute the log marginal likelihood (LML) and all p 1 derivatives with respect to tσ2, wu in Oppq which is independent of n. To do this, we ﬁrst assume that y T y P R, ΦT y r P Rp, and ΦT Φ A P Rpˆp are precomputed, which requires Opnp2 dnp d sm3q time, however, this step only needs to be done once before LML iterations begin. Then, to compute the LML (eq. (1)), we use the

Scalable Gaussian Processes with Grid-Structured Eigenfunctions

matrix inversion and determinant lemmas to give

y T r KX,X σ2In 1y σ 2 y T y r T P 1r , (13)

log ˇˇ r KX,X σ2In ˇˇ log ˇˇP ˇˇ

i 1 log wi pn pq log σ2,

where P P Rpˆp σ2W 1 A. Using these relations, the LML can be computed within Opp3q time. The LML derivatives with respect to all hyperparameters can also be computed in Opp3q as shown in the following expression which is derived in appendix A of the supplement

2σ4 diag A A d P 1A T 1p 2σ2 ,

Bσ2 y T y 2r T P 1r r T P 1AP 1r

2σ4 n Tr P 1A

To further reduce the per-iteration computational complexity from Opp3q Ñ Oppq we apply a linear transformation to the basis functions to make them mutually orthogonal when evaluated on the training data. We can write the ith transformed basis function as rφipxq řp j 1 rvjirΣ 1 ii φipxq, where rΣ P Rrpˆrp is a diagonal matrix containing the nonzero singular values of Φ, r V P Rpˆrp contains the corresponding right-singular vectors of Φ, and rp ď minpp, nq. Using these transformed basis functions, both A Irp and P σ2W 1 A become diagonal matrices, enabling evaluation of the LML and all derivatives in Oppq using eq. (13) and the derivative expressions above. The transformation requires the singular-value decomposition of Φ before LML iterations, however, this precomputation is no more expensive than those discussed previously at Opnrp2q.

Since the kernel is now heavily parametrized, maximizing the LML for type-II Bayesian inference is susceptible to overﬁtting. Instead, we may choose to take a fully Bayesian type-I approach and integrate out the hyperparameters using hybrid MCMC sampling. This type-I approach requires far more LML evaluations then type-II (typically Op105q), however, the fast Oppq (or Opp3q) evaluations make this tractable even for very large problems since the cost per iteration is independent of the number of training points. In appendix B of the supplement we discuss further extensions to this type-I inference procedure.

6. Experimental Studies

Two-Dimensional Visualization Figure 1 shows a comparison between GP-GRIEF and the Variational Free Energy (VFE) inducing point approximation (Titsias, 2009) on a two-dimensional test problem with n 10 training points generated by the function fpxq sinpx1q sinpx2q and corrupted with Np0, 0.1q noise. For both models, we use a squared-exponential base kernel, and we estimate the kernel lengthscale and noise variance, σ2, by maximizing the

log marginal likelihood. VFE can also select its inducing point locations. In this study, we do not consider optimizing the GRIEF weights, w. VFE with m 4 inducing points achieves a root-mean squared error (RMSE) of 0.47 on the test set, whereas GP-GRIEF with the same number of basis functions3, p 4, achieves an RMSE of 0.34, identical to the reconstruction provided by a full GP using the exact kernel. While GP-GRIEF uses a dense grid of m 25 inducing points, it has a computational complexity equivalent to VFE. This demonstrates the reconstruction power of GP-GRIEF, even when very few eigenfunctions are considered.

Kernel Reconstruction Accuracy We compare the kernel covariance reconstruction error of the GRIEF Nyström method to competing techniques in ﬁg. 2. We sample 5000 points from Up ?

3q in d 100 dimensions, randomly taking half for training and half for testing, and we consider a squared-exponential kernel. Given only the training set, we attempt to reconstruct the exact prior covariance matrices between the training set (ﬁg. 2a), and the joint train/test set (2b). This allows us to study the kernel reconstruction accuracy between points both within, and beyond the training set. In both studies, the proposed GRIEF Nyström method greatly outperforms a random Fourier features reconstruction (Rahimi & Recht, 2007), and a randomized Nyström reconstruction where m p inducing points are uniformly sampled from the training set. We emphasize that both randomized Nyström and GRIEF Nyström have the same computational complexity for a given number of basis functions, p, even though the GRIEF Nyström method uses m 10200 inducing points ( sm 100).

In the joint train/test study of ﬁg. 2b, we observe a larger gap between GRIEF Nyström and randomized Nyström than in ﬁg. 2a. This is not surprising since the goal of the randomized Nyström method (and indeed all existing extensions to this technique) is to improve the accuracy of eigenfunctions evaluated on the training set which does not guarantee performance of the eigenfunctions evaluated on points outside this set. For example, if a test point is placed far from the training set then we expect a poor approximation from existing Nyström methods. However, our GRIEF Nyström approach attempts to ﬁll out the input space with inducing points everywhere, not just near training points. This guarantees an accurate approximation at test locations even if they are far from training data. Comparatively, the random Fourier features technique samples from a distribution that is independent of the training data, so it is also expected to perform no worse on the joint set than the training set. However, we observe that it provides an equally poor reconstruction on both sets.

The black curves in ﬁg. 2 show the exact eigen-

3Technically, VFE gives an inﬁnite basis function expansion through a correction term, however, we will assume p m 4.

Scalable Gaussian Processes with Grid-Structured Eigenfunctions

(a) Test Data. (b) VFE, m 4, RMSE 0.47 (c) GP-GRIEF, m 25, p 4, RMSE 0.34

Figure 1: Reconstruction using GP-GRIEF outperforms VFE. Both techniques use an equal number of basis functions and have the same computational complexity. Crosses denote training point positions and dots denote inducing point locations.

0 100 200 300 400 500 600 Number of Basis Functions

Training Prior ||f K K||2 ||K||2

Random Nystr om, m = p

GRIEF Nystr om, m = 10200

Random Fourier Features Exact Eigen-decomposition

(a) Training prior covariance error.

0 100 200 300 400 500 600 Number of Basis Functions

Joint Prior ||f K K||2 ||K||2

Random Nystr om, m = p

GRIEF Nystr om, m = 10200

Random Fourier Features Exact Eigen-decomposition

(b) Train/test joint prior covariance error.

Figure 2: Covariance matrix reconstruction error of GPGRIEF beats randomized Nyström with uniform sampling averaged over 10 samples. GP-GRIEF approaches the optimal reconstruction accuracy of the black curve.

decomposition of the covariance matrices which demonstrates the optimal reconstruction accuracy for a kernel approximation with a given number of basis functions. We observe that GP-GRIEF approaches this optimal accuracy in both studies, even though the test point distribution is not known at training time.

UCI Regression Studies We next assess performance on real-world regression datasets from the UCI repository. Using the authors code4, we report the mean and standard devi-

4https://github.com/treforevans/gp_grief

ation of the RMSE from 10-fold cross validation5. Also presented is the mean training time per fold on a machine with two E5-2680 v3 processors. We use a squared-exponential kernel with automatic relevance determination (SE-ARD) and we compare our test errors to those reported by Yang et al. (2015) using type-II inference on the same train-test splits. Yang et al. (2015) used an exact GP with an SE-ARD kernel for datasets with n ă 2000, and Fastfood expansions were used to approximate the SE-ARD kernel for the larger datasets (n ą 2000).

Before training the GP-GRIEF models, we initialize the base kernel hyperparameters, θ, by maximizing the marginal likelihood of an exact GP constructed on minpn, 1000q points randomly selected from the dataset. We then train GPGRIEF-II or GP-GRIEF-I which differ by not only type II or type-I inference but also the kernel parametrizations used. GP-GRIEF-II uses the kernel from eq. (3) which is parametrized by the base kernel hyperparameters: tθ, σ2u. The presented training time includes log marginal likelihood maximization to estimate the hyperparameters, beginning with the initialized values. GP-GRIEF-I uses the kernel from eq. (11) which is parametrized by the basis function weights: tw, σ2u; the base kernel hyperparameters, θ, are ﬁxed to the initialized values. We integrate out tw, σ2u using Metropolis adjusted Langevin dynamics MCMC which uses gradient information (Girolami & Calderhead, 2011). The training time includes MCMC sampling, which we run for 10000 iterations. We use log-normal priors with {mode, variance} of t1, 100u for w, and tσ2 0, 0.04u for σ2, where σ2 0 is the initialized value. We begin sampling at the prior mode, burning the ﬁrst 1000 samples and thinning every 50 thereafter. For datasets with n ą 106 we use the Oppq LML computations described in section 5. For all studies, we ﬁx sm 10, and we ﬁx p 1000 for GP-GRIEF-I. For

590% train, 10% test per fold. We use folds from https://people.orie.cornell.edu/andrew/code

Scalable Gaussian Processes with Grid-Structured Eigenfunctions

Dataset n d m smd p Time (hrs) GP-GRIEF-II p Time (hrs) GP-GRIEF-I Yang et al. (2015)

challenger 23 4 104 10 0 0.554 0.277 1000 0.178 0.519 0.261 0.63 0.26 fertility 100 9 109 100 0.001 0.172 0.055 1000 0.66 0.166 0.051 0.21 0.05 slump 103 7 107 100 0 3.972 1.891 1000 0.566 3.470 1.712 4.72 2.42 automobile 159 25 1025 100 0.007 0.145 0.057 1000 0.604 0.111 0.036 0.18 0.07 servo 167 4 104 100 0 0.280 0.085 1000 0.265 0.268 0.075 0.28 0.09 cancer 194 33 1033 100 0.007 27.843 3.910 1000 0.667 30.568 3.340 35 4 hardware 209 7 107 100 0 0.408 0.046 1000 0.637 0.402 0.045 0.43 0.04 yacht 308 6 106 100 0.001 0.170 0.083 1000 0.595 0.120 0.070 0.16 0.11 autompg 392 7 107 100 0.001 2.607 0.356 1000 0.594 2.563 0.369 2.63 0.38 housing 506 13 1013 100 0.004 3.212 0.864 1000 0.62 2.887 0.489 2.91 0.54 forest 517 12 1012 100 0.001 1.386 0.14 1000 0.621 1.384 0.139 1.39 0.16 stock 536 11 1011 100 0.001 0.005 0.000 1000 0.567 0.005 0.000 0.005 0.001 energy 768 8 108 100 0.002 0.49 0.057 1000 0.622 0.461 0.064 0.46 0.07 concrete 1030 8 108 1000 0.008 5.232 0.723 1000 0.57 5.156 0.766 4.95 0.77 solar 1066 10 1010 1000 0.003 0.786 0.198 1000 0.628 0.809 0.193 0.83 0.20 wine 1599 11 1011 1000 0.012 0.483 0.052 1000 0.583 0.477 0.047 0.47 0.08 skillcraft 3338 19 1019 1000 0.011 0.248 0.016 1000 0.573 0.248 0.016 0.25 0.02 pumadyn 8192 32 1032 1000 0.156 0.20 0.00 1000 0.645 0.212 0.004 0.20 0.00 elevators 16599 18 1018 1000 0.283 0.091 0.002 1000 0.664 0.097 0.001 0.090 0.001 kin40k 40000 8 108 1000 0.38 0.206 0.004 1000 0.649 0.206 0.004 0.28 0.01 keggu 63608 27 1027 1000 3.642 0.118 0.003 1000 0.704 0.134 0.005 0.12 0.00 3droad 434874 3 103 1000 0.869 11.648 0.281 1000 0.221 12.966 0.077 10.91 0.05 electric 2049280 11 1011 1000 8.019 0.064 0.002 1000 0.418 0.058 0.006 0.12 0.12

Table 1: Mean and standard deviation of test error and average training time (including hyperparameter estimation or MCMC sampling) from 10-fold cross validation (90% train, 10% test per fold) on UCI regression datasets.

GP-GRIEF-II, we make p proportional to n by rounding n down to the nearest power of ten, or take 1000 if it is lesser, i.e. p minp1000, 10tlog10 nuq.

It is ﬁrstly evident that both GP-GRIEF-I and GP-GRIEF-II outperform the exact GP presented by Yang et al. (2015) on nearly every small dataset (n ă 2000). In particular, GPGRIEF-I performs extremely well on these small datasets as we would expect since it uses a very ﬂexible kernel and is robust to over-ﬁtting as a result of the principled Bayesian approach employed. On larger datasets, where we expect the hyperparameter posterior to be more peaked, we see that the type-II techniques begin to be competitive. On these larger datasets, both GP-GRIEF techniques show comparable test error to Yang et al. (2015) on all datasets but perform considerably better on kin40k and the electric dataset with two-million training points. With respect to time, we note that the GP-GRIEF-I model trained extremely rapidly considering a fully Bayesian approach was taken; only 25 minutes were required for the two-million point electric dataset even though this size is prohibitive for most GP models taking a type-II empirical Bayes approach.

The independence of computational complexity on m allows enormous numbers of inducing points to be used. We use m 1033 inducing points for the cancer dataset which demonstrates the efﬁciency of the matrix algebra employed

since storing a double-precision vector of this length requires 8 quadrillion exabytes; far more than all combined hard-disk space on earth.

7. Conclusion

Our new technique, GP-GRIEF, has been outlined along with promising initial results on large real-world datasets where we demonstrated GP training and inference in Oppq time with Oppq storage. This fast training enables type I Bayesian inference to remain computationally attractive even for very large datasets as we had shown in our studies. We showed that our complexities are independent of m, allowing us to break the curse of dimensionality inherent to methods that manipulate distributions of points on a full Cartesian product grid. Asymptotic results were also presented to show why a choice of large m is important to provide an accurate global kernel approximation. Lastly, we considered the use of up to 1033 inducing points in our regression studies, demonstrating the efﬁciency of the matrix algebra employed. We discussed how the developed algebra can be used in areas beyond the focus of the numerical studies, such as in a general kernel interpolation framework, or in general kernel matrix preconditioning applications. However, it will be interesting to explore what other applications could exploit the developed matrix algebra techniques.

Scalable Gaussian Processes with Grid-Structured Eigenfunctions

Acknowledgements

Research funded by an NSERC Discovery Grant and the Canada Research Chairs program.

Baker, C. T. H. The numerical treatment of integral equations. Oxford: Clarendon press, 1977.

Belabbas, M.-A. and Wolfe, P. J. Spectral methods in machine learning: New strategies for very large datasets. In National Academy of Sciences of the USA, number 106, pp. 369 374, 2009.

Bilionis, I. Predictive Science lab: py-mcmc. https: //github.com/Predictive Science Lab/ py-mcmc, December 2014.

Buhmann, M. D. Radial basis functions: theory and implementations. Cambridge university press, 2003.

Cutajar, K., Osborne, M. A., Cunningham, J. P., and Filippone, M. Preconditioning kernel matrices. In International Conference on Machine Learning, pp. 2529 2538, 2016.

Drineas, P. and Mahoney, M. W. On the Nyström method for approximating a Gram matrix for improved kernelbased learning. Journal of Machine Learning Research, (6):2153 2175, 2005.

Girolami, M. and Calderhead, B. Riemann manifold Langevin and Hamiltonian Monte Carlo methods. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 73(2):123 214, 2011.

Gittens, A. and Mahoney, M. W. Revisiting the Nyström method for improved large-scale machine learning. In International Conference on Machine Learning, 2013.

GPy. GPy: A gaussian process framework in python. http: //github.com/Sheffield ML/GPy, since 2012.

Kumar, S., Mohri, M., and Talwalkar, A. Sampling methods for the Nyström method. Journal of Machine Learning Research, (13):981 1006, 2012.

Lázaro-Gredilla, M., Quiñonero-Candela, J., Rasmussen, C. E., and Figueiras-Vidal, A. Sparse spectrum Gaussian process regression. Journal of Machine Learning Research, 11(6):1865 1881, 2010.

Li, C., Jegelka, S., and Sra, S. Fast DPP sampling for Nyström with application to kernel methods. In International Conference on Machine Learning, 2016.

Liu, S. and Trenkler, G. Hadamard, Khatri-Rao, Kronecker and other matrix products. International Journal of Information and Systems Sciences, 4(1):160 177, 2008.

Musco, C. and Musco, C. Recursive sampling for the Nyström method. In Advances in Neural Information Processing Systems, 2017.

Neal, R. M. Monte Carlo implementation of Gaussian process models for Bayesian regression and classiﬁcation. Technical Report 9702, University of Toronto, 1997.

Nickson, T., Gunter, T., Lloyd, C., Osborne, M. A., and Roberts, S. Blitzkriging: Kronecker-structured stochastic Gaussian processes. ar Xiv preprint ar Xiv:1510.07965, 2015.

Peng, H. and Qi, Y. Eigen GP: Gaussian process models with adaptive eigenfunctions. In International Joint Conference on Artiﬁcial Intelligence, pp. 3763 3769, 2015.

Pinheiro, J. C. and Bates, D. M. Unconstrained parametrizations for variance-covariance matrices. Statistics and computing, 6(3):289 296, 1996.

Quiñonero-Candela, J. and Rasmussen, C. E. A unifying view of sparse approximate Gaussian process regression. Journal of Machine Learning Research, 6(12):1939 1959, 2005.

Rahimi, A. and Recht, B. Random features for large-scale kernel machines. In Advances in Neural Information Processing Systems, pp. 1177 1184, 2007.

Rasmussen, C. E. and Williams, C. K. I. Gaussian Processes for Machine Learning. MIT Press, 2006.

Saatçi, Y. Scalable inference for structured Gaussian process models. Ph D thesis, University of Cambridge, 2011.

Silverman, B. W. Some aspects of the spline smoothing approach to non-parametric regression curve ﬁtting. Journal of the Royal Statistical Society B, 47(1):1 52, 1985.

Smola, A. J. and Bartlett, P. Sparse greedy Gaussian process regression. In Advances in Neural Information Processing Systems, pp. 619 625, 2000.

Smola, A. J. and Schökopf, B. Sparse greedy matrix approximation for machine learning. In International Conference on Machine Learning, pp. 911 918, 2000.

Snelson, E. and Ghahramani, Z. Sparse Gaussian processes using pseudo-inputs. In Advances in Neural Information Processing Systems, pp. 1257 1264, 2006.

Titsias, M. Variational learning of inducing variables in sparse Gaussian processes. In Artiﬁcial Intelligence and Statistics, pp. 567 574, 2009.

Van Loan, C. F. The ubiquitous Kronecker product. Journal of Computational and Applied Mathematics, 123(1):85 100, 2000.

Scalable Gaussian Processes with Grid-Structured Eigenfunctions

Wang, S. and Zhang, Z. Improving CUR matrix decomposition and the Nyström approximation via adaptive sampling. Journal of Machine Learning Research, (14): 2729 2769, 2013.

Williams, C. K. I. and Seeger, M. Using the Nyström method to speed up kernel machines. In Advances in Neural Information Processing Systems, pp. 682 688, 2001.

Wilson, A. G. and Nickisch, H. Kernel interpolation for scalable structured Gaussian processes (KISS-GP). In International Conference on Machine Learning, pp. 1775 1784, 2015.

Wilson, A. G., Hu, Z., Salakhutdinov, R., and Xing, E. P. Deep kernel learning. In Artiﬁcial Intelligence and Statistics, pp. 370 378, 2016.

Yang, Z., Smola, A. J., Song, L., and Wilson, A. G. À la carte learning fast kernels. In Artiﬁcial Intelligence and Statistics, pp. 1098 1106, 2015.

Zhang, K., Tsang, I. W., and Kwok, J. T. Improved Nyström low-rank approximation and error analysis. In International Conference on Machine Learning, pp. 1232 1239, 2008.