# gausslegendre_features_for_gaussian_process_regression__5801ee6c.pdf Journal of Machine Learning Research 23 (2022) 1-47 Submitted 1/21; Revised 11/21; Published 2/22 Gauss-Legendre Features for Gaussian Process Regression Paz Fink Shustin pazfink@mail.tau.ac.il Department of Applied Mathematics Tel Aviv University Tel Aviv, 69978, Israel Haim Avron haimav@tauex.tau.ac.il Department of Applied Mathematics Tel Aviv University Tel Aviv, 69978, Israel Editor: John Shawe-Taylor Gaussian processes provide a powerful probabilistic kernel learning framework, which allows learning high quality nonparametric regression models via methods such as Gaussian process regression. Nevertheless, the learning phase of Gaussian process regression requires massive computations which are not realistic for large datasets. In this paper, we present a Gauss-Legendre quadrature based approach for scaling up Gaussian process regression via a low rank approximation of the kernel matrix. We utilize the structure of the low rank approximation to achieve effective hyper parameter learning, training and prediction. Our method is very much inspired by the wellknown random Fourier features approach, which also builds low-rank approximations via numerical integration. However, our method is capable of generating high quality approximation to the kernel using an amount of features which is poly-logarithmic in the number of training points, while similar guarantees will require an amount that is at the very least linear in the number of training points when using random Fourier features. Furthermore, the structure of the low-rank approximation that our method builds is subtly different from the one generated by random Fourier features, and this enables much more efficient hyperparameter learning. The utility of our method for learning with low-dimensional datasets is demonstrated using numerical experiments. 1. Introduction Gaussian processes (GPs) (Williams and Rasmussen, 2006) provide a powerful probabilistic kernel learning framework, which allows learning high quality nonparametric regression models via methods such as Gaussian process regression (GPR). Indeed, GP based methods are widely used in machine learning and statistics. They have been applied to a wide variety of problems, such as data visualization, Bayesian optimization (Snoek et al., 2012), modeling dynamics and spatial data analysis (Stein, 1999). One of the key advantages of the GP formulation of kernel regression is that the marginal likelihood is a function of the kernel hyperparameters, and that it can be computed via a closed-form formula. By maximizing the marginal likelihood, one can learn the hyperparameters from the data, thereby tuning the method in a principled manner. However, learning GPs comes with a hefty computational price-tag. Given a training set of n points of dimension d, exact GPR requires solving a (usually dense) linear equation, and thus requires O(n3) FLOPs. Prediction costs O(nd) FLOPs per test point. Such costs are problematic for datasets with more than a few thousand points. The situation is even more severe if we consider the hyperparameter learning phase: here the cost is O(n3) FLOPs per hyperparameter in a learning iteration (assuming we use a first-order optimization method). Hyperparameter learning of exact kernel models on large-scale data is even more unrealistic than training such models. 2022 Paz Fink Shustin and Haim Avron. License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v23/21-0030.html. Fink Shustin and Avron Given the ubiquity of GPs, it is unsurprising that there is a rich literature on scaling GP-based method, e.g. (Quiñonero-Candela and Rasmussen, 2005) and (Wilson and Nickisch, 2015). One attractive approach is to approximate the kernel matrix (also known as covariance matrix) Kθ as a sum of a diagonal matrix (often a multiple of the identity) and a low rank matrix (Stein, 2014): Kθ ZW(θ)Z + D(θ) (1) In the above, Kθ Rn n denotes the kernel matrix, where the subscript θ denotes the dependence of the kernel matrix on the hyperparameters θ (discussion of our notation appears in Section 2.1), Z has s n columns, and W(θ), D(θ) are diagonal matrices. The various steps of GPR can be much more efficiently conducted on a kernel whose kernel matrix has the structure of the righthand side of Eq. (1), e.g. training takes O(ns2) (see Section 4.2 for details on efficient GPR with low-rank approximations with an even more restricted structure in which D(θ) is a multiple of the identity). In the kernel learning literature, methods for forming a low rank approximation of kernels can be roughly split into two approaches: methods that use data-dependent basis functions, and methods that use independent basis functions. An example for the first kind is the Nyström method (Williams and Seeger, 2001). Such methods utilize the given training data, and thus may outperform methods that use independent basis functions, especially when there is a large gap in the eigenspectrum. However, data dependence can incur additional costs. For example, the Nyström method requires keeping some of the data as part of the model. Another class of methods for building low rank approximations of kernel matrices are methods that use independent basis functions, and thus approximate the kernel function directly. One such important and highly influential method is the random Fourier features approach suggested by Rahimi and Recht in 2007 (Rahimi and Recht, 2008). Following the publication of (Rahimi and Recht, 2008), there has been extensive research on random features, including works that attempt to improve the approximation quality of the method (e.g., (Sutherland and Schneider, 2015; Choromanski et al., 2018)), works that focused on using in random features to learn huge datasets (e.g. (Huang et al., 2014; Avron and Sindhwani, 2016)), and works that focused on theoretical analysis of random features (e.g. (Yang et al., 2012; Sriperumbudur and Szabó, 2015; Avron et al., 2017; Li et al., 2019)). The previous list is far from exhaustive. In the context of our work, worth mentioning is Avron et al. (Avron et al., 2017) which showed that if the kernel matrix of the approximate kernel spectrally approximates the kernel matrix of the true kernel then the excess risk when using kernel ridge regression with the approximate kernel is not much larger than the excess risk when using the true kernel. Random Fourier features, and random features methods in general, are based on writing the kernel function as an integral and then using numerical integration schemes in order to construct a low rank approximation of that function1. In random Fourier features, a shift-invariant kernel is rewritten as an integral via an application of Bochner s theorem, and Monte-Carlo integration is used to build the low rank approximation. The use of Quasi Monte-Carlo in lieu of Monte Carlo integration was explored Avron et al. (Avron et al., 2016). Bach explored the connection between random Fourier features and kernel quadrature rules in (Bach, 2017), however without providing any practically useful explicit mappings for kernels. Monte-Carlo and Quasi-Monte Carlo integration admit only a slow convergence rate. As a consequence, the number of features required for a spectral approximation when using Monte-Carlo or Quasi Monte-Carlo integration must be polynomial in a quality parameter of the spectral approximation. In this paper we argue that in the context of Gaussian process regression a stronger notion of spectral equivalence is required. The slow convergence rate of Monte-Carlo or Quasi Monte-Carlo integration implies that at best the number of features required for spectral equivalence is linear in the number of training points, which is obviously undesirable. 1. A rank k bivariate function f(x, y) is a function that can be written as f(x, y) = Pk j=1 σjφj(x)ψj(y) for some σ1, . . . , σk, φ1, . . . , φk, ψ1, . . . , ψk (Townsend and Trefethen, 2013). Gauss-Legendre Features for Gaussian Process Regression Random features approaches based on Monte-Carlo and Quasi Monte-Carlo suffer from another serious defect when it comes to GPR: the low rank approximation they build has the form Kθ Z(θ)Z(θ) + D(θ) (2) While for training and prediction, this structure works equally as well as the structure in Eq. (1), when it comes to hyperparameter learning this is no longer the case; see Section 4.3. One can construct faster converging low-rank approximations using numerical quadrature rules such as Gaussian quadrature. Dao et al. considered the use of Gaussian quadrature in the context of kernel learning (Dao et al., 2017). Gaussian quadrature rule is a method for numerically approximating weighted integrals (i.e., integrals of the form f(x)w(x)dx where w(x) 0 is a weight function) that is optimal in some formal sense. In the context of approximating kernel functions, the weight function w( ) is determined by the kernel function and the value of the hyperparameters. Once the weight function has been determined, in order to use a Gaussian quadrature the nodes and weights corresponding to that particular weight function must be computed. Efficient algorithms exist, but these algorithms require the computation of integrals as well. For a single kernel, that is when using a fixed value of the hyperparameters, and when using a fixed number of quadrature features, computing the nodes and weights is a one-time offline task. However, if the hyperparameters are not fixed, e.g. when they are set using hyperparameter learning, Gaussian quadrature becomes unrealistic. Furthermore, the fact that the nodes and weights change with the hyperparameters implies that we must use an approximation of the form of Eq. (2) and not of Eq. (1), which is less desirable. The connection between random Fourier features and quadrature rules was also explored by Munkhoeva et al. (Munkhoeva et al., 2018). One noticeable limitation of quadrature rules is the curse of dimensionality, since the scale of their computational requirements increases with dimension. Thus, this approach is more applicable for low-dimensional datasets, such as spatial data. Low rank approximations for kernels matrices have also been widely used in the statistics literature, and in particular the spatial statistics literature (Cressie and Johannesson, 2008; Eidsvik et al., 2012; Banerjee et al., 2008; Finley et al., 2009; Katzfuss and Cressie, 2012). Possible limitations of the low rank approximation approach in the context of spatial statistics have been noted in (Quiñonero-Candela and Rasmussen, 2005; Banerjee et al., 2008; Stein et al., 2007; Sang et al., 2011), and analyzed mathematically by Stein (Stein, 2014). The use of random features in the context of spatial statistics was explored by Ton et al. (Ton et al., 2018). In this paper, we propose a quadrature based low-rank approximation approach for efficient GPR involving a wide class of kernels which includes shift-invariant kernels (i.e., stationary covariance functions). Unlike previous literature which uses quadrature features in the context of GPR, our method forms an approximation of the form of Eq. (1), and so is able to efficiently perform hyperparameter learning in addition to training and prediction. Our method achieves this by using a fixed set of quadrature nodes and weights, and designing the approximation so that varying the hyperparameters corresponds to only changing the integrand. Specifically, our method uses a Gauss-Legendre quadrature. Gauss-Legendre quadrature is a Gaussian quadrature for the uniform weight function on a finite interval. Thus, the weight function does not change with the hyperparameters, and with it the quadrature nodes and weight stay fixed, whereas only the integrand varies. Changing only the integrand translates to a simplified parametric form for the approximate kernel matrix (Eq. (1)) which is more amenable to efficient computations. Our proposed method, which we call Gauss-Legendre Features, is described in Section 4. The Gauss-Legendre quadrature is designed to approximate integrals with an integration area which is a finite interval. However, for most widely-used kernels the integrand has infinite support. We address this issue by utilizing the fact that for such kernels the integrand decays quickly, so we can approximate the integral by truncating the integration area. The truncation cutoffis determined by a parameter of our method. Another parameter is the number of features (i.e., quadrature nodes) used in the approximation. In order to set these two parameters correctly, we need a method for assessing the quality of one kernel function approximation by another. To that end, we introduce Fink Shustin and Avron the notion of spectral equivalence, and argue that if one parameterized family of kernels is spectrally equivalent to another one, then that first family is a good surrogate for the second family in the context of GPR. These results are summarized in Section 3. We rigorously analyze how to set the truncation cutoffand the number of features to achieve spectral equivalence (these results are reported in Section 5). Here another advantage of using Gauss Legendre quadrature becomes evident: the Gauss-Legendre quadrature converges much faster than the Monte-Carlo or Quasi Monte-Carlo integration, so typically the number of features is sublinear in the training size. Indeed, for widely used kernels like the Gaussian kernel and the Matèrn kernel, the number of features required when using Gauss-Legendre features is poly-logarithmic in the training size (see Section 6). A sublinear number of features is also likely achievable using Gaussian quadrature (kernel learning using Gaussian quadrature is suggested by Dao et al. (Dao et al., 2017), however without proving spectral equivalence). Yet, as explained this is rather problematic for hyperparameter learning, and in general requires a large overhead for computing the quadrature nodes and weights. Finally, empirical results (Section 7) clearly demonstrate the superiority of our proposed method over classical random Fourier features when conducting Gaussian process regression on low-dimensional datasets. 2. Preliminaries 2.1 Notations and Basic Definitions We consider all vectors as column vectors, unless otherwise stated. For a vector x or a matrix A, the notation x or A denotes the Hermitian transpose. The n n identity matrix is denoted by In. A Hermitian matrix A is positive semidefinite (PSD) if x Ax 0 for every vector x. Also, for any Hermitian matrices A, B of the same size, the notation A B means that B A is PSD. We consider n pairs of training data (x1, y1), . . . , (xn, yn) X Y Rd R, where x denotes the input vector of dimension d and y denotes a scalar response. A kernel function (aka covariance function) is a function k : X X R which is positive definite, i.e. for every m N and x1, . . . , xm Rd, the matrix K Rm m defined by Kij = k(xi, xj) is PSD. The matrix K is known by various names: kernel matrix, Gram matrix, covariance matrix. Given a dataset x1, . . . , xn Rd, we will conveniently use X to denote the n-by-d matrix whose rows are x1, . . . , xn, and use K(X, X) to denote the kernel matrix corresponding to the kernel k with data X. For another kernel k we will use K(X, X) to denote the kernel matrix. In many cases we will deal with parameterized families of kernels {kθ}θ Θ, where θ represents the hyperparameters vector, and Θ is a set of possible parameters values. The kernel matrix corresponding to kθ is denoted by Kθ(X, X). We also group the responses y1, . . . , yn into a single vector y Rn. The Kullback Leibler divergence (abbreviated KL-divergence henceforth) is a well established metric for how much one distribution is different from a reference distribution. We denote the KLdivergence between two probability distributions P on Q by DKL (P, Q), and recall the following is a well established result2: if N1 = N(µ0, Σ0) and N2 = N(µ1, Σ1) are two multivariate normal distributions, we have DKL (N1, N2) = 1 2Tr Σ 1 1 Σ0 + 1 2 (µ1 µ0)T Σ 1 1 (µ1 µ0)+ 1 2 (log det Σ1 log det Σ0) n 2. The exact definition of the KL-divergence is not important, since we always use Eq. (3) when working with it. Gauss-Legendre Features for Gaussian Process Regression 2.2 Gaussian Process Regression Gaussian Process Regression (GPR) is a Bayesian nonparametric approach for regression. First, the following regression model is assumed: y = f(x) + ε, ε i.i.d N 0, σ2 n (σ2 n is a (hyper)parameter). Additionally, it is assumed that f is a Gaussian Process, f(x) GP (µ (x) , k(x, x )), where k is the kernel function. This means that for any set of data points x1, . . . , xm Rd the vector f Rm defined by fj = f(xj) (j = 1, . . . , m) is a Gaussian random vector with mean defined by µ(X) = [µ(x1) µ(xm)]T and the covariance matrix K(Z, Z). Throughout the paper we assume, for the sake of simplicity, that the mean function µ(x) is zero. This simplifies the formulas while not really restricting generality (a nonzero mean can be easily handled by shifting). Under these assumptions, y N(0, K(X, X) + σ2 n I). Under these priors, the expected predictive value for f(x) at a test x is (Williams and Rasmussen, 2006) f(x) K(x, X) K(X, X) + σ2 n In 1 y . Consequently, training is conducted by computing the vector α := K(X, X) + σ2 n In 1 y . From these formulas, we see that assuming that evaluating the kernel function takes O(d) operations and that α is computed using direct factorization, training takes O(n3) operations and prediction takes O(nd) operations. The previous description is for a fixed kernel k. Typically, the kernel kθ has hyperparameters which we represent throughout the paper by the vector θ. The hyperparameters are usually constrained to some possible set of hyperparameters values Θ, and thus defines a parameterized family of kernels {kθ}θ Θ. Hyperparameter learning refers to the process of determining the value of the hyperparameters directly from the training data, and is considered one of the important advantages of the GP framework. This is typically conducted by maximizing the log marginal likelihood: 2y T Kθ(X, X) + σ2 n In 1 y 1 2 log det(Kθ(X, X) + σ2 n In) n 2 log 2π (4) In order to maximize L(θ) using a first-order optimization method, it is required to compute its gradients. Using direct methods, computing the gradient takes O(n3|θ|) where |θ| represents the number of hyperparameters in θ. 2.3 Random Fourier Features Random Fourier Features (RFF) (Rahimi and Recht, 2008), is one of the most popular methods for constructing a low rank approximation of kernels and scaling up kernel methods. The method targets shift-invariant kernels, i.e. kernels of the form k(x, x ) = k0(x x ) for a positive definite function k0( ). RFF is motivated by a simple consequence of Bochner s Theorem: for every shift-invariant kernel for which k0(0) = σ2 f there is a probability measure µ and possibly a corresponding probability density function p( ), both on Rd, such that k(x, x ) = σ2 f Rd e 2πiηT(x x )dµ(η) = σ2 f Rd e 2πiηT(x x )p(η)dη Let us assume that the density p( ) exists. If one chooses η1, . . . , ηs randomly according to p( ), and defines ϕ(x) = 1 s e 2πiηT 1 x, . . . , e 2πiηT s x , then k(x, x ) = σ2 f Eη1,...ηs ϕ (x) ϕ (x ) . Fink Shustin and Avron Hence, an approximated kernel can be defined: k(RFF)(x, x ) := ϕ(x) ϕ(x ) = σ2 f s j=1 e 2πiηT j (x x ) . The kernel matrix corresponding to the approximate kernel is K (RFF)(X, X) = ZZ where Z Cn s to be the matrix whose mth row is ϕ(xm) . The low rank structure of K (RFF)(X, X) allows more efficient training (O(ns2)) and predictions (O(sd)), which are attractive if s n. The previous description is for a fixed kernel (and fixed hyperparameters). When using GPR with hyperparameter learning we are dealing parameterized family of kernels {kθ}θ Θ. This case has not been considered in Rahimi and Recht original work (Rahimi and Recht, 2008). We discuss it in Section 4.3. Note that, there exist variants of RFF. One example is the modified RFF method (Avron et al., 2017) which suggests to sample Z using a different sampling distribution from the one used in classical RFF. In this method, another probability density function q( ) whose support includes that of p( ) is considered, such that η1, . . . , ηs are chosen randomly according to p( )/q( ). Then, the approximate kernel is k(MRF)(x, x ) := σ2 f s p(ηj) q(ηj)e 2πiηT j (x x ) . We consider this variant in the experiments reported in Section 7. 2.4 Gauss-Legendre Quadrature Quadrature rules are a deterministic way to approximate integrals. A celebrated method for(mostly, one dimensional) numerical integration is Gaussian quadrature. The idea of Gaussian quadrature is to approximate integrals Ωf(η)dµ(η) P j {1,...,s}d wjf(ηj) where the number of nodes s yields exactness of the approximation for all polynomials of degree up to 2s 1. The nodes and weights depend on the measure dµ and the parameter s, and can be computed efficiently using a corresponding family of orthogonal polynomials (Hale and Townsend, 2013). Usually, the number of nodes s is chosen such that the quadrature error satisfies Ω f(η)dµ(η) X j {1,...,s}d wjf(ηj) for a certain , i.e., is maximized to obtain an error at most . Gaussian quadrature is very accurate, and thus beneficial in particular when dealing with well behaved functions (e.g., analytic functions or functions whose derivatives are smooth up to a certain degree (Trefethen, 2013)). Once the weight function is fixed, and the number of nodes (s) is chosen, the nodes and weights are fixed, and are computed only once. Gauss-Legendre quadrature is a type of Gaussian quadrature for approximating integrals of the form [ 1,1]d f(η)dη, i.e. the uniform weight function is considered for the domain [ 1, 1]d. The one dimensional weights of Gauss-Legendre quadrature are wj = 2 (1 η2 j )(L s(ηj))2 , j = 1, . . . , s where Ls is the s-th Legendre polynomial, and the nodes η1, . . . , ηs are simply the roots of Ls (Abramowitz and Stegun, 1972, p. 887). Higher dimension Gauss-Legendre quadarture are obtained by tensoring the one dimensional rule. Gauss-Legendre Features for Gaussian Process Regression In this paper, we consider kernels that can be written as an integral form (e.g., as in the RFF method described in the previous section). The integral is typically a product of a feature map that depends on the family of the kernel (e.g., shift-invariant kernel) and a weight function which depends on the kernel itself. Thus, it seems natural to approximate this integral using a Gaussian quadrature. Indeed, such an approximation was considered in (Dao et al., 2017; Munkhoeva et al., 2018). However, all the aforementioned works considered fixed kernel parameters, and did not consider varying the kernel parameters as is done in hyperparameter learning for GPR. When using a Gaussian quadarture based on the weight function related to the kernel at hand, the weight function itself depends on the hyperparameters. This means that the nodes and weights have to be recomputed in every iteration. Furthermore, it is unclear how in that case one will be able to compute the derivative of the marginal likelihood if gradient descent is used to maximize the marginal likelihood. Instead, in this work we decided to use Gauss-Legendre quandrature, truncating the integral and pushing the (truncated) weight function to be part of the integrand. We remark that Clenshaw Curtis (considered with uniform weight function) would probably have worked similarly well, but Gauss-Legendre is optimal for the uniform weight function, so we chose it. 3. Spectrally Equivalent Kernel Approximations Our strategy for scaling up GPR is based on approximating the kernel kθ by an approximate kernel kθ that is low-rank in some sense which will become apparent in the next section. This raises the question: how can we determine whether kθ indeed approximates kθ well? Avron et al. (Avron et al., 2017) suggested that in the context of kernel ridge regression, spectral approximations of the kernel matrices allow us to reason about how well one kernel is approximated by another. The argument in (Avron et al., 2017) is based on risk bounds for given fixed hyperparameters, and so is less appropriate for GPR where hyperparameter learning is common practice. In this section, we introduce the notion of spectral equivalence, a stronger form of spectral approximation, and connect it to hyperparameter learning in GPR. Assume a bounding set X Rd for the data, then given a dataset (x1, y1), . . . , (xn, yn) X R, the general assumption when using a kernel k is that y N(µ, K(X, X)) . If, however, we would have used the kernel k, then the assumption would have been y N(µ, K(X, X)) . Thus, a measure on how well k approximates k might be devised by measuring how much N(µ, K(X, X)) is different from N(µ, K(X, X)). The KL-divergence is a well-established measure on how different one probability distribution is from a reference distribution, so arguably, k approximates k well if the KL-divergence DKL N(µ, K(X, X)), N(µ, K(X, X)) is small. Indeed, the use of KL-divergence as such a measure was suggested in the literature on spatial data analysis (Banerjee et al., 2008; Sang and Huang, 2012; Stein, 2014). The notion of spectral equivalence, which we develop below, is a measure of how two matrices are close to one another. To connect it to the KL-divergence, which we use to measure how well k approximates k, we have the following lemma, which implies that if the covariance matrices of two multivariate distributions are close, then the KL-divergence is small. Lemma 1 Suppose that µ Rn and Σ0,, Σ1 Rn n are two symmetric positive definite matrices. Suppose that (1 n 1)Σ0 Σ1 (1 + n 1)Σ0 . (5) Then, DKL (N(µ, Σ0), N(µ, Σ1)) 1 + O(n 1) . Fink Shustin and Avron We first need the following Lemma. Lemma 2 Suppose that A and B are two symmetric positive definite matrices of order n n, such that (1 n 1)B A (1 + n 1)B . (6) Then, there exists γ1, . . . , γn [ n 1, n 1] such that log det A log det B = i=1 log(1 + γi) . Proof Let λ1, . . . , λn denote the sorted eigenvalues of B, and λ1, . . . , λn denote the sorted eigenvalues of A, so log det B = i=1 log λi, log det A = i=1 log λi . Eq. (6) implies that there exist γ1, . . . , γn [ n 1, n 1] such that λi = (1 + γi)λi. Hence, log det A = i=1 log(1 + γi)λi = log det B + i=1 log(1 + γi) and that completes the proof. Proof [Proof of Lemma 1]Since for two symmetric positive definite matrices A and B, A B implies B 1 A 1, Eq. (5) implies that 1 1 n + 1 Σ 1 0 = 1 1 + n 1 Σ 1 0 Σ 1 1 1 1 n 1 Σ 1 0 = 1 + 1 n 1 Multiplying by Σ1/2 0 on the right and left sides gives 1 1 n + 1 In Σ1/2 0 Σ 1 1 Σ1/2 0 1 + 1 n 1 i.e., the eigenvalues of Σ1/2 0 Σ 1 1 Σ1/2 0 are bounded in the interval [1 (n+1) 1, 1+(n 1) 1]. Thus, Tr Σ 1 1 Σ0 = Tr Σ 1 1 Σ1/2 0 Σ1/2 0 = Tr Σ1/2 0 Σ 1 1 Σ1/2 0 n 1 + 1 n 1 Also, from Lemma 2, there exist γ1, . . . , γn [ n 1, n 1] such that log det Σ1 log det Σ0 = i=1 log(1 + γi) . Gauss-Legendre Features for Gaussian Process Regression Using Eq. (3), we obtain DKL (N(µ, Σ0), N(µ, Σ1)) = 1 2Tr Σ 1 1 Σ0 + 1 2(log det Σ1 log det Σ0) n i=1 log (1 + γi) n 1 n 1 + log 1 + 1 2 + 1 2(n 1) + n Lemma 1 motivates the following definitions: Definition 3 We say that a n-by-n symmetric matrix A is spectrally equivalent to another n-by-n symmetric matrix B if (1 n 1)B A (1 + n 1)B. (7) Remark 4 Spectral equivalence is a strong type of spectral approximation. Indeed, for a 0, a matrix A is a -spectral approximation of another matrix B, if (1 )B A (1 + )B (Avron et al., 2017), so spectral equivalence is spectral approximation with = n 1. As the previous lemmas show, this is the level of approximation required to make the KL-divergence between the distributions defined by two covariance matrices to be small. Definition 5 Let n 1 be an integer, and X be a data domain. Two positive definite kernels k and k are n-spectrally equivalent on domain X if for every X with n rows in X, the kernel matrix K(X, X) is spectrally equivalent to the kernel matrix K(X, X). The last definition uses two specific kernels, k and k. In GPR, we usually use a parameterized family of kernels {kθ}θ Θ, where θ represents the hyperparameters, and Θ is a set of possible parameter values. We generally assume that Θ is bounded. Boundedness of Θ is necessary, since without it, it is possible to drive the kernel matrix to identity, thereby making it impossible to approximate it using a low rank matrix. We then approximate each kernel kθ by kθ, that is we use the parameterized family { kθ}θ Θ. We say that the parameterized family { kθ}θ Θ approximates the parameterized family {kθ}θ Θ well if for every θ Θ the kernel kθ approximates kθ well, as is captured by the following definition. Definition 6 Two parameterized families of positive definite kernels {kθ}θ Θ and { kθ}θ Θ are said to be n-spectrally equivalent on domain X if for every θ Θ, kθ and kθ are n-spectrally equivalent over X. In light of Lemma 1, if two parameterized families {kθ}θ Θ and { kθ}θ Θ are n-spectrally equivalent, then for any parameters θ and any dataset consisting of n data points, the distributions on the response assumed by the two GP models induced by these families are close in the sense that the KL-divergence is close to 1. 4. Gauss-Legendre Features In this section, we present our proposed method (Gauss-Legendre Features) and show how it can be used to perform efficient Gaussian process regression. Our method includes two important parameter vectors: U and s. In the next section, we show how these parameters can be set in order to obtain an approximation that is spectrally equivalent to the true kernel. Fink Shustin and Avron 4.1 Feature Map We begin by describing the Gauss-Legendre feature map. The proposed method builds feature maps for kernel families that can be written in the following form: kθ(x, x ) = σ2 f Rd ϕ(x, η)ϕ(x , η) p(η; θ0)dη + σ2 nγ(x x ) . (8) In the above, ( 1 z = 0 0 z = 0 , the function ϕ : X Rd C is such that for every x X the function ϕ(x, ) is even-symmetric (i.e., for every η Rd, ϕ(x, η) = ϕ(x, η) ), θ = [θ0, σ2 f, σ2 n], and for every θ0 the function p( ; θ0) is an even probability density on Rd. Note that θ0 can be a vector. Note that in Eq. (8) we included a ridge term σ2 nγ(x x ). Typically, the ridge term is omitted from the kernel but appears in various equations involving the kernel matrix due to the Gaussian noise assumption in the GPR model. While we could state our theory in the more traditional way of having the noise term outside of the kernel, the definitions and theorems statements will be somewhat more cumbersome. We found it more convenient to include σ2 n as part of the vector of the parameter set θ, and include the ridge term in the kernel definition. The resulting equations are the same. There are quite a few kernel families that adhere to this structure. For example, due to Bochner s theorem, a shift-invariant kernel with an additional noise level term can be written in the form kθ(x, x ) = σ2 f Rd e i(x x )Tηp(η; θ0)dη + σ2 nγ(x x ) . (9) So, we can use ϕ(x, η) = e ix Tη to cast shift invariant kernels in the form of Eq. (8). The underlying idea of Gauss-Legendre features is to first truncate the integral Eq. (9) to the domain QU = Qd k=1[ Uk, Uk], for some U = (U1, . . . , Ud)T, and then approximate the truncated integral using a tensorized Gauss-Legendre quadrature. Since the weight function is uniform for the Gauss-Legendre quadrature rule, it enables the method to fit a variety of kernel classes. The domain QU might depend on X and Θ, but not on the concrete dataset x1, . . . , xn. Let (χ(m) 1 , w(m) 1 ), . . . , (χ(m) m , w(m) m ) denote the nodes and weights of the m-point Gauss-Legendre. Assume we are given a list of quadrature size for each dimension: s = (s1, . . . , sd). Let s = Qd k=1 sk. The approximation then reads: kθ(x, x ) σ2 f QU ϕ(x, η)ϕ(x , η) p(η; θ0)dη + σ2 nγ(x x ) jd=1 wj1 jdp(ˆηj1 jd; θ0)ϕ(x, ˆηj1...jd)ϕ(x , ˆηj1 jd) + σ2 nγ(x x ) j=1 hj(θ0)ϕ(x, ηj)ϕ(x , ηj) + σ2 nγ(x x ) (10) η(s1) j1... η(sd) jd U1 χ(s1) j1 ... Ud χ(sd) jd Gauss-Legendre Features for Gaussian Process Regression k=1 Ukw(sk) jk . In Eq. (10), we assumed we have a bijective mapping a1, . . . , as between {1, . . . , s} and {1, . . . , s1} {1, . . . , sd} and then defined: ηj := ˆηaj hj(θ0) := wajp(ηj; θ0) . Finally, the parameterized family of approximate kernels is kθ(x, x ) := σ2 f j=1 hj(θ0)ϕ(x, ηj)ϕ(x , ηj) + σ2 nγ(x x ) . (11) Note that the conditions that p( ; θ0) is even and ϕ(x, ) is even-symmetric, coupled with the fact that the Gauss-Legendre quadrature is symmetric, ensures that kθ(x, x ) is always real. We stress that since the weights in Gauss-Legendre quadrature are fixed once U and s are chosen, we need to compute them only once. It is also worth noting that Clenshaw-Curtis quadrature probably would work similarly to Gauss-Legendre quadrature, when using the uniform density function (Trefethen, 2008, 2013). Nevertheless, since Gauss-Legendre quadrature is optimal for the uniform weight function, we decided to use it and not Clenshaw-Curtis. Consider the first term on the right hand side of Eq. (11). It is a bivariate function that can be written as a sum of s separable bivariate functions. Thus, we can informally view s as the rank of the decomposition, and if s is small, then this is a low-rank approximation. The parameterized family of approximate kernels { kθ}θ Θ is composed of kernels that can be written as a low-rank bivariate function plus a ridge term. In the next subsection, we show how to utilize this low-rank structure in order to efficiently perform Gaussian process regression. Of course, the crucial question is how do we choose U = (U1, . . . , Ud) and s = (s1, . . . , sd). We want to choose these parameters such that {kθ}θ Θ and { kθ}θ Θ are n-spectrally equivalent over the domain X, where n is the target dataset size (since GPR is nonparametric, the effective rank of the kernel matrix goes to infinity when n goes to infinity, so it is impossible to approximate the kernel matrix well with a matrix of fixed rank, i.e., with s fixed, as n goes to infinity). We discuss this question in the next section. In the reminder of this section, we discuss how to efficiently perform GPR using the approximate kernel family { kθ}θ Θ. 4.2 Efficient Gaussian Process Regression As a first and crucial step, we show how to write the kernel matrix of kθ as a low-rank matrix plus a ridge term. Given a dataset x1, . . . , xn X, let X, as usual, denote the matrix whose row j is x T j . Define the matrix Z Cn s, Zlj := ϕ(xl, ηj) . Note that Z depends on U and s, but not on the hyperparameters θ. Next, define W : Θ Rs s + , W(θ) := h1(θ0) ... hs(θ0) We now have Kθ(X, X) = σ2 f ZW(θ)Z + σ2 n In . Notice that dependence on θ is confined to the diagonal matrix W(θ). This will be very helpful in deriving efficient formulas for GPR. We now discuss each of the various stages of GPR separately. For simplicity, we assume that the GP prior has zero mean (µ = 0). Fink Shustin and Avron Training. Given y1, . . . , yn, training usually amounts to computing the vector α := Kθ(X, X) 1y where y = [y1, . . . , yn]T. However, in our case, in order to utilize the structure of Kθ(X, X), we instead compute: w := Z α = W(θ) 1(σ2 f Z Z + σ2 n W(θ) 1) 1Z y . In the above, the second equality is a simple consequence of the Woodbury matrix identity. Since W(θ) has positive diagonal, w can be computed using O(ns2) operations, discounting the cost of computing Z. Prediction. Given a test set x(t) 1 , . . . , x(t) t which are distinct from the training set, the predicted (mean) vector y(t) = [y(t) 1 , . . . , y(t) t ]T is defined by y(t) := Kθ(X(t), X)α. Let Z(t) Ct s, Z(t) lj := ϕ(x(t) l , ηj) . Then, we have Kθ(X(t), X) = σ2 f Z(t)W(θ)Z y(t) = Kθ(X(t), X)α = σ2 f Z(t)W(θ)Z α = σ2 f Z(t)W(θ)w . The predictive variance is then Y(t) = Kθ(X(t), X(t)) Kθ(X(t), X) Kθ(X, X) 1 Kθ(X(t), X) = σ2 f Z(t)W(θ)Z(t) σ2 f Z(t)W(θ)Z σ2 f ZW(θ)Z + σ2 n In 1 ZW(θ)Z(t) = σ2 f Z(t)W(θ) Is σ 2 n W(θ) 1 σ2 f Z Z + σ2 n W(θ) 1 1 Z ZW(θ) Z(t) = σ2 f Z(t) Is σ 2 n σ2 f Z Z + σ2 n W(θ) 1 1 Z Z W(θ)Z(t) = σ2 f Z(t) Is σ 2 n σ2 f Z Z + σ2 n W(θ) 1 1 Z Z W(θ)Z(t) Hence, once we have w (computed during training), we can compute y(t) using O(ts) operations, discounting the cost of compute Z(t), and Y(t) using O(s3 + ts2) operations. Hyperparameter Learning. Hyperparameter learning amounts to finding the hyperparameters θ which maximize the log marginal likelihood. To do so, we need to be to able to efficiently compute the log marginal likelihood and its gradient. It is well known that the likelihood is given by 2y T Kθ(X, X) 1y 1 2 log det Kθ(X, X) n 2 log 2π (12) and the derivatives are given by Kθ(X, X) 1 Kθ(X, X) 2y T Kθ(X, X) 1 Kθ(X, X) θi Kθ(X, X) 1y (13) where θi represents a hyperparameter in θ. Gauss-Legendre Features for Gaussian Process Regression Proposition 7 After an O(ns2) preprocessing step of computing Z Z, and discounting the cost of computing the partial derivatives of p with respect to the hyperparamters, the log marginal likelihood L(θ) and the gradient L(θ) can be computed in O(ns + s3 + s|θ|) arithmetic operations, where |θ| represents the number of hyperparameters in θ. Furthermore, the amount of memory storage required is O(s2). Proof First, let us consider the computation of the likelihood. For the first term in Eq. (12), note that α(θ) = Kθ(X, X) 1y = σ 2 n (y σ2 f ZW(θ)w(θ)) where w(θ) = W(θ) 1(σ2 f Z Z + σ2 n W(θ) 1) 1Z y. In the previous equations, we made the dependence of α and w on θ explicit. Obviously, once Z Z has been computed (an O(ns2) preprocessing step), we can compute both w(θ) and α(θ) in O(ns + s3). The first term in Eq. (12) is now equal to y Tα(θ)/2. For the second term in Eq. (12), using the matrix determinant lemma, we have log det Kθ(X, X) = n log σ2 n + s log σ2 f + log det W(θ) + log det(σ 2 f W(θ) 1 + σ 2 n Z Z) . Since W(θ) is diagonal, once Z Z has been computed, we can compute this term in O(s3) operations. Next, let us consider the computation of each derivative of the likelihood according to Eq. (13). The crucial observations are: σ2 f = ZW(θ)Z , Kθ(X, X) σ2n = In, Kθ(X, X) θi = σ2 f Z W(θ) (using θi to denote an hyperparameter in θ0) where using last equality amounts to computing partial derivative p(ηj; θ0)/ θi for j = 1, . . . , s. For the second term in Eq. (13) we have, α(θ)T Kθ(X, X) σ2 f α(θ) = w(θ)TW(θ)w(θ) α(θ)T Kθ(X, X) σ2n α(θ) = α(θ) 2 2 α(θ)T Kθ(X, X) θi α(θ) = σ2 fw(θ)T W(θ) so this term can be computed in O(s) operations once we compute w(θ) and α(θ) (which are computed during the computation of the likelihood). For the first term in Eq. (13), let F(θ) := σ2 f(σ2 f Z Z + σ2 n W(θ) 1) 1Z Z . Again, once Z Z has been computed, F(θ) can be computed in O(s3) operations. Now, using the Woodbury formula and cyclicality of the trace, we have Kθ(X, X) 1 Kθ(X, X) = σ 2 f Tr (F(θ)) (14) Kθ(X, X) 1 Kθ(X, X) = σ 2 n (n Tr (F(θ))) Kθ(X, X) 1 Kθ(X, X) = σ 2 n σ2 f Tr W(θ) θi Z Z σ 2 n σ2 f Tr W(θ) Fink Shustin and Avron (the calculations leading to these formulas appear in Appendix A). Thus, once Z Z, F(θ) and the diagonal of Z ZF(θ) have been computed, all these computations can be done in O(s|θ|). Note that the diagonal of Z ZF(θ) can be computed using O(s2) operations once we have Z Z and F(θ). In terms of memory storage, notice that once Z Z and Z y have been computed there is no longer any need for Z (which requires O(ns) in storage). However, in order to compute Z Z and Z y we do not need to form all of Z in memory, but rather can stream over the training set transforming every training point using ϕ and accumulating its contribution to Z Z and Z y. Thus, the dominant storage cost is for holding Z Z which is O(s2). The computations can be performed more stably by utilizing various matrix identities. We delegate the details to Appendix A. To have a computational advantage in the training and prediction steps we need s = o(n). However, since for most kernels computing the gradient of the likelihood requires O(n3|θ|), our method has a computational advantage in the hyperparameter learning phase even if s = Θ(n). 4.3 Comparison to Other Methods Our proposed method is very much inspired by the Random Fourier Features (RFF) method (Rahimi and Recht, 2008). Although originally defined only for shift-invariant kernels, the method can be easily generalized for kernels of the form of Eq. (8). We refer to the generalization as Monte-Carlo Features (MCF). RFF is a special case of MCF. In particular, given some fixed parameters θ, an MCF approximate kernel is kθ(x, x ) = σ2 f s j=1 ϕ(x, ηj)ϕ(x , ηj) + σ2 nγ(x x ) where η1, . . . , ηs are sampled from the density function p(η; θ0). As before, we include the ridge term σ2 nγ(x x ) in the kernel definition. Thus, the kernel matrix approximation is K (MCF) θ (X, X) = σ2 f Z(θ)Z(θ) + σ2 n In where Z(θ) Cn s, Z(θ)lj := ϕ(xl, ηj) . Note that, according to (Avron et al., 2017), to obtain K and K (MCF) θ that are n-spectrally equivalent, the number of features of RFF should be s = O(n3), in which case the approximation is more expensive than exact computation. Obviously, training and prediction can be efficiently executed by utilizing the identity plus low-rank structure of K (MCF) θ much in the same way as we have done for Gauss-Legendre features, and indeed this is the reason the method was developed (Rahimi and Recht, 2008). The above developments were for a fixed θ. However, it is less clear how to define a family of approximations for various θ, and perform hyperparameter learning. A key issue is that the kernel approximation should vary smoothly with θ, so obviously fresh η1, . . . , ηs cannot be sampled differently for every θ. It is outside the scope of this paper to consider how to use MCF to define parameterized families of kernel approximation suitable for hyperparameter learning. Nevertheless, since we wish to use RFF as a baseline for complexity comparisons and numerical experiments, we show how it is possible to use the specific case of RFF to form parameterized families of kernel approximation and perform hyperparameter learning for a restricted family of kernels that includes the Gaussian and Matèrn kernels. Specifically, we will consider a restricted class of shift-invariant whose kernel has the following specific form: kθ(x, x ) = σ2 fk0(L 1(x x )) + σ2 nγ(x x ) (15) Gauss-Legendre Features for Gaussian Process Regression where k0( ) is a positive definite function, and L is diagonal with positive entries L1, . . . , Ld which are part of parameter vector θ (i.e., θ = [L1, . . . Ld, σ2 f, σ2 n]). Note that the number of variables in θ0 is equal to the dimension d. Gaussian and Matèrn kernels are examples of such kernels. In this case we can write kθ in the form of Eq. (8), where ϕ(x, η) = e ix Tη. We further assume that p(η; L) is such that sampling a random vector η is the same as sampling from the distribution defined by p( ; Id) and scaling the vector by L 1. Thus, for such kernels we can sample η1, . . . , ηs once from p( ; Id) and view any change in θ0 = L as corresponding change in η1, . . . , ηs. Concretely, letting W = η1 . . . ηs , the feature matrix is Z(L) = 1 s exp i XL 1W and the kernel matrix is K (RFF) θ (X, X) = σ2 f Z(L)Z(L) + σ2 n In . However, the crucial point is that now Z(L) depends smoothly on L, so we can compute gradients. Formulas quite similar to the ones derived in the previous section can be derived (we omit most details), with the main difference being in taking the derivative of the kernel matrix with respect to the parameters in L. Here we have K (RFF) θ (X, X) Lk = σ2 f (ZZ )(L) Lk Z(L) + Z Z(L) Lk W exp i XL 1W = i X L 1 Lk XT exp i XL 1W = i WT L 1 Lk XT Z(L) = i X L 1 Lk W Z(L) . In terms of complexity, the main difference between Gauss-Legendre Features and RFF is that for the former the matrix Z Z stays constant when θ varies, and so the product can be computed once, while for the latter Z(L) Z(L) varies, and so changes every iteration. This adds an additional cost of O(ns2) operations for every gradient computation. Furthermore, we need to compute Z(L)/ Lk for k = 1, . . . ., d in each iteration, each costing O(nsd) operations, for a total of O(nsd2) operations. Furthermore, since Z(L) changes in each iteration, and Z(L) features in many of the equations, we cannot compute the matrix Z(L) Z(L) once and reduce storage costs to O(s2), and using Z(L) implicitly many times will incur a large overhead. Thus, for RFF the storage cost is O(ns). Table 1: Computational complexities (arithmetic operations) comparison between Gauss-Legendre Features and Random Fourier Features for kernels of the form of Eq. (15) (e.g., non-isotropic Gaussian an Matèrn kernels). In the table, n is the size of the training set, t is the size of the test set, s is the approximation rank, and I is the number of gradient computations for hyperparameter learning (e.g., number of gradient descent iterations). Gauss-Legendre Features Random Fourier Features Inference O(ns2) O(ns2) Prediction O(nt) O(nt) Hyperparameter O(ns2 + I(ns + s3 + sd)) O(I(ns2 + nsd2)) Fink Shustin and Avron A similar issue will likely arise when using features based on Gaussian quadrature, like Dao et al. suggested (Dao et al., 2017) (that paper does not discuss GPR hyperparameter learning). When θ changes, the distribution that defines Gaussian quadrature changes. Unlike Guass-Legendre features which use fixed nodes, for Gaussian quadrature features the quadrature nodes change with θ, which prevents the use of a fixed feature matrix Z(θ). Furthermore, when performing hyperparameter learning with Gaussian quadrature features, we need to not only compute the quadrature weights but also compute their derivatives. We note that the formulas for the modified RFF are similar, where Z(L)/ Lk and Z(L) / Lk will have an additional term that originates in pointwise multiplying the RFF matrix Z by [p(η1)/q(η1), . . . , p(ηs)/q(ηs)], as described in Section 2.3. However, this does not changes the costs and issues mentioned above. 5. Parameter Computation In order to complete the description of our method, we need to specify how to choose U and s. First, we show how to set U and s for a fixed θ Θ and n such that kθ and kθ are n-spectrally equivalent. We then consider how to set a fixed U and s such that the two families {kθ}θ Θ and { kθ}θ Θ are n-spectrally equivalent. 5.1 From Matrix Approximation to Integral Approximation For now (and until subsection 5.4) let us assume that θ is fixed. Our goal is to set U and s such that kθ and kθ are n-spectrally equivalent, i.e. for every dataset x1, . . . , xn X (1 n 1)Kθ(X, X) Kθ(X, X) (1 + n 1)Kθ(X, X) . In other words, we want to set U and s such that for every v Rn, (1 n 1)v TKθ(X, X)v v T Kθ(X, X)v (1 + n 1)v TKθ(X, X)v . (16) Henceforth, for conciseness, we drop X from the expressions, although the various expressions implicitly depend on X. Let ϕ(x1, η) ... ϕ(xn, η) v TKθv = σ2 f Rd |z(η) v|2p(η; θ0)dη + σ2 n v 2 2 v T Kθv = σ2 f j=1 hj(θ0)|z(ηj) v|2 + σ2 n v 2 2 . Since rescaling v rescales all the terms in the previous inequality, we can assume without loss of generality that v TKθv = 1. In that case, Eq. (16) is equivalent to Rd |z(η) v|2p(η; θ0)dη j=1 hj(θ0)|z(ηj) v|2 1 σ2 fn . (17) Thus, the nodes η1, . . . , ηs and weights h1(θ0) . . . , hs(θ0) function as a quadrature approximation. Gauss-Legendre Features for Gaussian Process Regression 5.2 Truncating the Integral As alluded earlier, we approach the quadrature approximation Eq. (17) by first truncating the integral and then using a Gauss-Legendre quadrature for the truncated integral. In other words, we write Rd |z(η) v|2p(η; θ0)dη j=1 hj(θ0)|z(ηj) v|2 Rd |z(η) v|2p(η; θ0)dη ˆ QU |z(η) v|2p(η; θ0)dη + QU |z(η) v|2p(η; θ0)dη j=1 hj(θ0)|z(ηj) v|2 where QU = Qd k=1[ Uk, Uk]. We set U such that the first term is smaller than σ 2 f n 1/2, and set each of the components in s to be large enough so that the second term is also smaller than σ 2 f n 1/2. Obviously, we want to set the components in U to be as small as possible, to limit the integration area. Having a smaller integration area allows us to use smaller values in s. The minimal values in U such that the first term is bounded by σ 2 f n 1/2 depends on how quickly p(η; θ0) decays as η : the faster the density decays, the smaller is the region where the function value has a significant contribution. Therefore, in our analysis, we distinguish between four classes of decay of p( ; θ0), and analyze each on its own. For a diagonal L, with positive entries L1, . . . , Ld on the diagonal (i.e., L = diag (L1, . . . , Ld)), let us define: p : Rd R+ measurable | ˆ Rd p(η)dη = 1, p(η) C 1 1 + L2 kη2 k P(r) C,L := p : Rd R+ measurable | ˆ Rd p(η)dη = 1, p(η) C 1 + Lη 2 2 r E(1) C,L := p : Rd R+ measurable | ˆ Rd p(η)dη = 1, p(η) C e Lη 1 E(2) C,L := p : Rd R+ measurable | ˆ Rd p(η)dη = 1, p(η) C e Lη 2 2 . The families PC,L and P(r) C,L include densities that decay at a polynomial rate or faster, while E(1) C,L includes densities that decay at exponential rate or faster, and E(2) C,L includes densities that decay at a square exponential rate or faster. Table 2 shows four well known kernels, their corresponding densities, and their decay class. The following proposition specifics how to set U based on the decay class and the maximum value of ϕ. Proposition 8 Suppose that, MR is such that for every η Rd and every x X we have |ϕ(x, η)| MR. Then, the following establishes a U(min) = (U (min) 1 , . . . , U (min) d ) such that if U U(min) (where we interpret the inequality as entrywise) then we have Rd |z(η) v|2p(η; θ0)dη ˆ QU |z(η) v|2p(η; θ0)dη 1 2σ2 fn (19) for every v such that v TKθv = 1. Fink Shustin and Avron Table 2: Example of decay classes for a few well known kernel function. In the table below, L is a diagonal matrix with non-negative diagonal entries. kθ(x, x ) p(η; θ0) Decay Class exp( L 1(x x ) 2 2//2) (2π) d/2 d Y k=1 ℓk exp( Lη 2 2) E(2) C,L, C = (2π) d/2 Qd k=1 ℓk 2d Qd k=1 ℓk ℓ2 k+(x x )2 k E(1) C,L, C = 1 exp( L 1(x x ) 1) π d d Y ℓk 1 + ℓ2 kη2 k Lk = ℓk PC,L, C = π d Qd k=1 ℓk Matèrn 21 ν x ) 2)ν Kν( Γ(ν + d/2) πd/2Γ(ν)(2ν)d/2 k=1 ℓk 1 + Lη 2 2 (ν+d/2) P(r) C,L r = ν + d/2 C = Γ(ν + d/2) Γ(ν)(2πν)d/2 1. If p( ; θ0) PC,L, U (min) k := 1 Lk cot Lk 4CM 2 Rσ2 f n2 2. If p( ; θ0) P(r) C,L and r > d/2: (a) for d = 2, set U (min) k := 1 Lk πCM 2 Rσ2 f n2 (r 1)σ2n L1L2 1 . (b) for any d = 2, let x > 0 be the solution to the equation πd/2Cn M 2 R 2d 2Γ d 2 (2r d)σ2n Qd k=1 Lk xd 2r 2F1 r d/2, r; r d/2 + 1; x 2 = 1 2σ2 fn . (20) (2F1 is the hypergeometric function). Then set U (min) k := x/Lk . We have U (min) k 1 πd/2CM 2 Rσ2 fn2 2 (2r d)σ2n Qd k=1 Lk 3. If p( ; θ0) E(1) C,L, U (min) k := 1 Lk ln 1 Lk 4CM 2 Rσ2 f n2 4. If p( ; θ0) E(2) C,L, U (min) k := 1 Lk 22 d CM 2 Rσ2 f n2 Remark 9 We recommend to set U to U(min) . For PC,L, E(1) C,L and E(2) C,L we give explicit formulas for U(min). For P(r) C,L, it is defined implicitly as the solution to a nonlinear equation. We recommend finding the solution numerically using root-finding methods. We also give an explicit upper bound for the value of U(min), which can be used if one wishes to avoid solving a non-linear equation, however, those upper bounds tend to be loose. Nevertheless, the upper bound is used later to derive asymptotic bounds on s. Gauss-Legendre Features for Gaussian Process Regression Proof First, since v TKθv = 1 , we have |z(η) v|2 = z(η) K 1/2 θ K1/2 θ v 2 (z(η) K 1 θ z(η)) (v TKθv) = z(η) K 1 θ z(η) (21) σ 2 n z(η) 2 2 nσ 2 n M 2 R where the first inequality is due to the Cauchy-Schwartz inequality, the second inequality follows from observing that the smallest eigenvalue of Kθ is bigger than or equal to σ2 n, and the last inequality is due to the fact that every entry in z(η) has absolute value that is smaller or equal to MR. The case of p PC,L: In this case, Rd |z(η) v|2p(η; θ0)dη ˆ QU |z(η) v|2p(η; θ0)dη = |η| U |z(η) v|2p(η)dη 2Cn M 2 R σ2n 1 1 + L2 kη2 k dη1 dηd = 2Cn M 2 R σ2n 1 1 + L2 kη2 k dηk = 2Cn M 2 R σ2n 2 arctan(Lk Uk) . So, in order for Eq. (19) to hold, we set U (min) k = 1 4CM 2 Rσ2 fn2 4CM 2 Rσ2 fn2 The case of p P(r) C,L: In this case, z(η) v 2 p(η)dη ˆ z(η) v 2 p(η)dη |η| U |z(η) v|2p(η)dη 2Cn M 2 R σ2 n 1 + L2 1η2 1 + . . . + L2 dη2 d r dη1 dηd = 2Cn M 2 R σ2 n Qd k=1 Lk L1U1 . . . ˆ 1 + η2 1 + . . . + η2 d r dη1 dηd 2Cn M 2 R σ2n Qd k=1 Lk 0 . . . ˆ π/2 j=1 sinj φd 1 j dφ1 dφd 2dtdθ = 2Cn M 2 R σ2n Qd k=1 Lk π 0 sinj φd 1 jdφd 1 j = πd/2Cn M 2 R 2d 2Γ d 2 σ2n Qd k=1 Lk (1 + t2)r dt = πd/2Cn M 2 R 2d 1Γ d 2 σ2n Qd k=1 Lk (1 + t)r dt Fink Shustin and Avron where m = min1 k d{Lk Uk}, and in the second inequality we use d-dimensional (d 2) spherical coordinates (Blumenson, 1960): η1 = t cos φ1 2 k d 2 : ηk = t cos φk ηd 1 = t sin θ ηd = t cos θ where η = t, 0 φj π/2, 0 θ < π/2, m t < , and the Jacobian is J = td 1 d 2 Y j=1 sinj φd 1 j . Also, in the fourth equality we use the following property of the beta function: Γ(x + y) = B (x, y) = 2 ˆ π/2 0 (sin φ)2x 1 (cos φ)2y 1 dφ with y = 1/2 and x = (j + 1)/2, for any j = 1, . . . , d 2, that is ˆ π/2 0 sinj φd 1 jdφd 1 j = Γ 1 which implies that 0 sinj φd 1 jdφd 1 j Now, we write the last integral of (22) in terms of incomplete beta function, as follows: ˆ (1 + t)r dt = t = t 1 t m2/(1+m2) td/2 1 (1 t)r d/2 1d t = ( 1)1+d/2 r ˆ 0 m 2 ur d/2 1(1 u) rdu = ( 1)d/2 r ˆ m 2 0 ur d/2 1(1 u) rdu 2r d 2F1 r d/2, r; r d/2 + 1; m 2 The expression in the last equality is the analytic continuation of the beta function B m 2 (r d/2, 1 r) (Olver et al., 2010, Sections 8.17, 15.4), i.e. B m 2 (r d/2, 1 r) = 2md 2r 2r d 2F1 r d/2, r; r d/2 + 1; m 2 . Gauss-Legendre Features for Gaussian Process Regression Therefore, we obtain Rd |z(η)α|2p(η)dη ˆ QU |z(η) α|2 p(η)dη πd/2Cn M 2 R 2d 2Γ d 2 (2r d)σ2n Qd k=1 Lk md 2r 2F1 r d/2, r; r d/2 + 1; m 2 . In order for Eq. (19) to hold, let x > 0 be the solution of the equation πd/2Cn M 2 R 2d 2Γ d 2 (2r d)σ2n Qd k=1 Lk xd 2r 2F1 r d/2, r; r d/2 + 1; x 2 = 1 2σ2 fn and then set U (min) k = x/Lk. Note that if we replace the expression after the second equality in (22) with the upper bound 2Cn M 2 R σ2n Qd k=1 Lk η2 1 + . . . + η2 d r dη1 dηd then, in order for Eq. (19)to hold, we can set U (min) k = 1 πd/2CM 2 Rσ2 fn2 2d 2(2r d)Γ d 2 σ2n Qd k=1 Lk Note also that for d = 2, we have the simplest case of spherical coordinates, which yields U (min) k = 1 v u u t r 1 s πCM 2 Rσ2 fn2 (r 1)σ2n L1L2 1 . Also, for d = 1 we do not need spherical coordinates. In that case, we have Rd |z(η)α|2p(η)dη ˆ QU |z(η) α|2 p(η)dη = 2n CM 2 R σ2n 1 (1 + L2 1η2)r dη = n CM 2 R σ2n L1 L2 1U 2 s 1/2 (1 + s)r ds = 2n CM 2 RL1 2r 1 U 1 2r (2r 1)σ2n L1 2F1 r 1/2, r; r + 1/2; L 2 1 U 2 . Now, U (min) that equates the last expression with 1/2σ2 fn is obtained by solving the same equation obtained for d 3, only with d = 1. Note that if we bound the first integral similarly to the bound in the case d 3, we obtain the same formula for U (min) only with d = 1. The case of p E(1) C,L: In this case, Rd |z(η) v|2p(η)dη ˆ QU |z(η) v|2 p(η)dη = |η| U |z(η) v|2p(η)dη 2Cn M 2 R σ2n U1 e L1η1dη1 ˆ Ud e Ldηddηd = 2Cn M 2 R σ2n Uk e Lkηkdηk = 2Cn M 2 R σ2n Fink Shustin and Avron So, in order for Eq. (19) to hold we set U (min) k = 1 4CM 2 Rσ2 fn2 The case of p E(2) C,L: In this case, Rd |z(η) v|2p(η)dη ˆ QU |z(η) v|2 p(η)dη = |η| U |z(η) v|2p(η)dη 2Cn M 2 R σ2n U1 e L2 1η2 1 2 dη1 ˆ Ud e L2 dη2 d 2 dηd = πd/2Cn M 2 R 2d/2 1σ2n πd/2Cn M 2 R 2d/2 1σ2n e L2 k U2 k 2 Lk where we used the bound erfc (x) e x2 for non-negative x, which follows from (Craig, 1991). In this paper it was shown that for any non-negative x erfc (x) = 2 and therefore erfc (x) = 2 sin2 θ dθ 2 2 0 e x2dθ = e x2. So, in order for Eq. (19) to hold we set U (min) k = 1 v u u u t2 ln 22 d/2CM 2 Rσ2 fn2 5.3 Approximating the Truncated Integral The nodes η1, . . . , ηs and their weights h1(θ0), . . . , hs(θ0) are simply rescaled multivariate Gauss Legendre quadrature nodes and weights3, and so Ps j=1 hj(θ0)|z(ηj) v|2 is a quadrature approximation of QU |z(η) v|2p(η; θ0)dη. In this section we derive a lower bound on s that guarantees that QU |z(η) v|2p(η; θ0)dη j=1 hj(θ0)|z(ηj) v|2 The bound on s depends on U. Together with Proposition 8, we completely specify how to build the quadrature approximation so that Eq. (17) holds. 3. Gauss-Legendre quadrature is defined for one dimensional integrals. In multivariate Gauss-Legendre quadrature we refer to the quadrature obtained by tensorizing the one-dimensional quadrature. Gauss-Legendre Features for Gaussian Process Regression 5.3.1 Decay of Chebyshev Coefficients for Multivariate Functions Our analysis relies on generalizations of existing decay bounds for Chebyshev expansions of analytic functions in one dimension to multivariate functions. In this subsection, we introduce these results. Classical decay bounds for Chebyshev expansions of analytic functions in one dimension are based on bounding the function values on the Bernstein ellipse (see (Mason and Handscomb, 2002, Section 1.4) for further details). For multivariate functions, a polyellipse is used instead. Definition 10 A Bernstein ellipse is an open region in the complex plane which bounded by an ellipse with foci 1. A Bernstein polyellipse in d-dimensions is a cartesian product of d Bernstein ellipses. Let ρ(U, β) := β/(2U) + p β2/(4U 2) + 1. Given U = (U1, . . . , Ud) > 0 and a singularity point β = (β1, . . . , βd) > 0, denote. EU,β := z Cd : zk + q < Ukρ(Uk, βk) k = 1, . . . , d . Note that E1,β U, where β U denotes entrywise division between β and U, is a Bernstein polyellipse, and that EU,β = U E1,β U. So, EU,β is a polyellipse with foci at Uk. For a multivariate analytic function f on [ 1, 1]d, the multivariate tensorised Chebyshev expansion is given by j1,...,jd=0 aj1...jd Tj1(x1) Tjd(xd) where the coefficients are given by aj1...jd = 2d m f (x1, . . . , xd) Tj1(x1) Tjd(xd) p 1 x2 d dx1 dxd where m := # {jk : jk = 0}. The following is a generalization of classical results for one dimension (Trefethen, 2013, Theorem 8.1, Theorem 8.2): Theorem 11 Let f be an analytic function on [ 1, 1]d and analytically continuable to E1,β U where it satisfies |f (x1, . . . , xd) | M for some M > 0. Then, for all j1, . . . , jd, |aj1...jd| 2d m M ρj1 1 ρjd d . Although Theorem 11 has essentially been proven in (Wang and Zhang, 2020), for completeness we include in Appendix B our proof of Theorem 11 which is based on a different technique. 5.3.2 Bounding the Integration Error We have the following result: Theorem 12 Given U = (U1, . . . , Ud) such that U1, . . . , Ud > 0, and β such that β1, . . . , βd > 0, let EU,β be the polyellipse such that in dimension j the foci is Uj and passes through iβj/2, and let ρ = (ρ1, . . . , ρd) with ρj := ρ(Uj, βj) (for j = 1, . . . , d) denote the sum of the semi-axes in each dimension. Assume that either p( ; θ0) PC,L or p( ; θ0) P(r) C,L where r > d/2, or p E(1) C,L or p E(2) C,L. Furthermore, assume that if z(η) = a(η) + ib(η), then for each 1 j n the functions Fink Shustin and Avron aj, bj are analytic on Rd. Finally, assume that p( ; θ0) has an analytic continuation ˆp( ; θ0) to EU,β. Let ˆz( ) denote the analytic continuation of z( ). Denote MR := sup η Rd z(η) MU,β := sup η EU,β ˆz(η) CU,β := sup η EU,β |ˆp(η; θ0)| . 1 d ln 22d+2M 2 U,βCU,βσ 2 n σ2 fn2 + ln Uk ln(ρk 1) 2 ln ρk + 1, k = 1, . . . , d, QU |z(η) v|2p(η; θ0)dη j=1 hj(θ0)|z(ηj) v|2 Note that in that case ln 22d+2M 2 U,βCU,βU d k σ 2 n σ2 fn2 d ln(ρk 1) Proof For conciseness, we drop θ0 from p throughout the proof. For convenience, we use the following form of the quadrature rule j=1 hj(θ0)|z(ηj) v|2 = jd=1 wj1 jdp(ηj1 jd)|z(ηj1 jd) v|2 as presented in Section 4. Denote fv(η) = |z(η) v|2. Also denote p(χ) := p(U1χ1, . . . , Udχd) and fv(χ) := fv(U1χ1, . . . , Udχd). Note that QU fv(η)p(η)dη jd=1 wj1...jdp(ηj1 jd)fv(ηj1...jd) [ 1,1]d fv(χ) p(χ)dχ jd=1 wj1...jd p(χj1...jd) fv(χj1...jd) where wj1...jd = Qd k=1 w(sk) jk . The sum in the right-hand side is a quadrature approximation of fv(χ) p(χ), which we analyze. To that end, we first bound the analytic continuation of fv(χ) p(χ) on E1,β U (where β U denotes entrywise division). For every η EU,β we have (similar to the derivation of Eq. (21)): | ˆfv(η)| = |ˆz(η) v|2 σ 2 n ˆz(η) 2 2 nσ 2 n M 2 U,β . Thus, | ˆfv(η)ˆp(η)| nσ 2 n M 2 U,βCU,β and | ˆ fv(χ)ˆ p(χ)| nσ 2 n M 2 U,βCU,β. We can now apply quadrature approximation bounds on fv(χ) p(χ) to bound the error [ 1,1]d fv(χ) p(χ)dχ jd=1 wj1...jd fv(χj1...jd) p(χj1...jd) Gauss-Legendre Features for Gaussian Process Regression Let fv(χ) p(χ) = k1,...,kd=0 ak1...kd Tk1(χ1) Tkd(χd) be the multivariate Chebyshev expansion of fv(χ) p(χ), and let P2s 1 be the truncated expansion: P2s 1(χ) := kd=0 ak1...kd Tk1(χ1) Tkd(χd) . Similarly to the strategy employed in (Ubaru et al., 2017) and (Trefethen, 2013, Theorem 19.3), we have [ 1,1]d fv(χ) p(χ)dχ ˆ [ 1,1]d P2s 1(χ)dχ+ P2s 1(χj1...jd) wj1...jd fv(χj1...jd) p(χj1...jd) wj1...jd fv(χ) p(χ) P2s 1(χ) dχ fv(χj1...jd) p(χj1...jd) P2s 1(χj1...jd) wj1...jd 1 . . . ˆ 1 kd=2sd ak1...kd Tk1(χ1) Tkd(χd)dχ1 dχd kd=2sd ak1...kd Tk1(χ(s1) j1 ) Tkd(χ(sd) jd ) kd=2sd |ak1...kd| 1 |Tkm(χm)| dχm + jd=1 w(s1) j1 w(sd) jd Tk1(χ(s1) j1 ) Tkd(χ(sd) jd ) kd=2sd |ak1...kd| j1=1 w(s1) j1 jd=1 w(sd) jd kd=2sd |ak1...kd| 2d + 2d 22d+1n M 2 U,βCU,β σ2n ρk1 1 ρkd d = 22d+1n M 2 U,βCU,β σ2n 1 ρ2sk 1 k (ρk 1) where we use the bound fv(χ) p(χ) n M 2 U,βCU,β In the first equality, we use the following equality [ 1,1]d P2s 1(χ)dχ = jd=1 wj1...jd P2s 1(χj1...jd) Fink Shustin and Avron which follows from the exactness of the Gauss-Legendre quadrature in one dimension: ˆ [ 1,1]d P2s 1(χ)dχ = ˆ kd=0 ak1...kd Tk1(χ1) Tkd(χd)dχ1 dχd kd=0 ak1...kd [ 1,1]d Tk1(χ1) Tkd(χd)dχ1 dχd kd=0 ak1...kd 1 Tk1(χ1)dχ1 1 Tkd(χd)dχd kd=0 ak1...kd j1=1 w(s1) j1 Tk1(χ1,j1) jd=1 w(sd) jd Tkd(χd,jd) jd=1 w(s1) j1 w(sd) jd kd=0 ak1...kd Tk1(χ1,j1) Tkd(χd,jd) jd=1 w(s1) j1 w(sd) jd P2s 1(χj1...jd) . QU |z(η) v|2 p(η)dη jd=1 wj1 jd|z(ηj) v|2 k=1 Uk 22d+1n M 2 U,βCU,β σ2n Uk ρ2sk 1 k (ρk 1) . Finally, bounding 22d+1n M 2 U,βCU,β σ2n Uk ρ2sk 1 k (ρk 1) 1 2σ2 fn 1/d for each k = 1, . . . , d , gives the bound from the theorem and the statement now follows immediately. The last theorem allows us to compute the required s based on U and the singularities in p( ; θ0). In Section 6, we show concrete examples for using Theorem 12 to bound s and U. The main step is finding the polyellipse parameters. If the analytic extension of p( ; θ0) has its first singularity at the pure imaginary value x0 = iβ for some β = (β1, . . . , βd)T such that 0 < βk < Uk, then p( ; θ0) has its first singularity at x0 = iβ U. Thus, we can choose the ellipses parameters to be ρk = βk 2Uk + r β2 k 4U 2 k + 1. Otherwise, we can choose βk = 2Uk, i.e., ρk = 1 + 5.4 Handling a Parameter Domain Θ Given a hyperparameter domain Θ, we want to set the parameters U = (U1, . . . , Ud) and s = (s1, . . . , sd) such that Eq. (17) hold for every θ Θ. This way, the parameterized family of positive definite kernel approximations { kθ}θ Θ given by Eq. (11) with these U and s is n-spectrally equivalent to {kθ}θ Θ on the data domain X. To do that, we need to find the worst-case (over θ Θ) parameters U and s. The following gives a general end-to-end statement. Theorem 13 Let kθ(x, x ) = σ2 f Rd ϕ(x, η)ϕ(x , η) p(η; θ0)dη + σ2 nγ(x, x ) Gauss-Legendre Features for Gaussian Process Regression be a parameterized family of kernels where θ Θ. Suppose that: 1. |ϕ(x, η)| MR for every x X and η Rd. 2. We set U supθ Θ U(min)(θ) where U(min)(θ) is the value set by Proposition 8 using parameters θ. 3. For every n points x1, . . . , xn X we set z(η) = [ϕ(x1, η), . . . , ϕ(xn, η)]T. If we write z(η) = a(η) + ib(η), then for each 1 j n the functions aj, bj are analytic on Rd. 4. β is such that p( ; θ0) has an analytic continuation ˆp( ; θ0) to EU,β for all θ Θ. Let ˆz( ) denote the analytic continuation of z( ). MU,β := sup η EU,β ˆz(η) CU,β := sup θ Θ, η EU,β |ˆp(η; θ0)| . 1 d ln 22d+2M 2 U,βCU,βσ 2 n σ2 fn2 + ln Uk ln(ρk 1) 2 ln ρk + 1, k = 1, . . . , d, the parameterized family of kernel approximations given by Eq. (11) is n-spectrally equivalent to {kθ}θ Θ on the data domain X. Furthermore, if we set U = U(min) and s according the last lower bound we have ln 22d+2M 2 U,βCU,βU d k σ 2 n σ2 fn2 d ln(ρk 1) To use this theorem, one needs to bound the decay of the density functions p( ; θ0) over Θ and calculate an upper bound on CU,β. In general, this might be hard, but luckily in most kernels display monotonicity in their hyperparameters that helps identify the worst case for θ over Θ. For example, for the one dimensional Gaussian kernel exp( ℓ 1(x x ) 2 2//2), the various parameters in the theorems monotonically increase as ℓ 0. In the next section we given concrete examples for using Theorem 13 for the kernels listed in Table 2. 6. Examples of Feature Maps for Kernels In this section, we show how to apply the theory presented in the previous section to design nspectrally equivalent kernel approximations for a few widely used kernel functions. Throughout this section, we assume that n is fixed, the data domain is X Rd, and the hyperparameter domain is Θ. Furthermore, we assume that we have a bounding box on the domain, i.e. X Qd k=1[ Ri/2, Ri/2] (obviously, such a bounding box can be easily computed from the input data). Let R = [R1, . . . , Rd]. 6.1 Gaussian Kernel Recall that the Gaussian kernel is kθ(x, x ) = σ2 f exp x x 2 2/2ℓ2 + σ2 nγ(x x ) Fink Shustin and Avron (θ = [ℓ, σ2 f, σ2 n]) where we added a scaling factor σ2 f and included the ridge term σ2 nγ(x x ) in the kernel definition. Note that for conciseness, we consider the isotropic version; the formulas can be modified for the anisotropic case. As discussed in Section 4, by setting ϕ(x, η) = e ix Tη , this kernel matches the form of Eq. (8). We assume that the hyperparameters are bounded as follows: Θ = {[ℓ, σ2 n, σ2 f] : ℓ ℓ0, σ2 n σ2 n0, σ2 f σ2 f0} where σ2 n0 > 0 (i.e. we have a ridge term; our method is not able to approximate the Gaussian kernel in the absence of a ridge term). The density is given by p(η; ℓ) = ℓd(2π) d/2 exp η 2 2ℓ2/2 . Therefore, for θ Θ , p( ; ℓ) E(2) C,L with C = ℓd 0(2π) d/2 and L = ℓ0 2Id. We also have |ϕ(x, η)| 1 for all x and η, so we set MR = 1. So, based on Proposition 8 we set v u u u t2 ln 22 dσ2 f0n2 In addition, p( ; ℓ) is analytic on Rd, and in particular it is analytically continuable to the polyellipse EU,β with βk = 2Uk, ρk = 1 + 2, as described in Theorem 12. Now we bound p( ; ℓ0) on the polyellipse as follows, for xk [ 2Uk], yk [ Uk, Uk]: |p(x + iy; ℓ0)| = ℓd 0(2π) d/2 exp k=1 (xk + iyk)2 ! ℓd 0(2π) d/2 exp k=1 x2 k y2 k + 2ixkyk = ℓd 0(2π) d/2 exp k=1 y2 k x2 k ℓd 0(2π) d/2 exp ℓd 0(2π) d/2 exp = ℓd 0(2π) d/2 exp ℓ2 0 U 2 2/2 Hence, CU,β = ℓd 0(2π) d/2 exp ℓ2 0 U 2 2/2 . Recall that z : Rd Cn is given by z(η)j = e iηTxj, it is easy to verify that for every v Rn the function |z( ) v|2 is an analytic function on Rd. In particular, it is analytically continuable to the polyellipse EU,β with βk = 2Uk, ρk = 1+ 2 as required by Theorem 12. Now we bound |z(η)j| for each j on its corresponding ellipse. For any x, y Rd : |z(x + iy)j| = |z(iy)j|, so we need to bound |z(iy)j| for y Qd k=1[ Uk, Uk]: |z(iy)j| = e i iy Txj = ey Txj e y 2 xj 2 e U 2 R 2/2 . Gauss-Legendre Features for Gaussian Process Regression Hence, |z(η)j| e U 2 R 2/2 =: MU,β. Now, one can apply Theorem 13 with the these parameters and obtain that for 1 d ln 22d+2π d/2σ 2 n0 σ2 f0n2 + ℓ2 0 2d U 2 2 + 1 d U 2 R 2 + 1 22 dσ2 f0n2 we have the desired bound. Since U 2 = O( ln n) (in particular, U 2 2 = O(ln n)) and assuming the bounding box R is fixed then sk = O(ln n) suffice and s = Qd k=1 sk = O((ln n)d) suffices. 6.2 Matèrn Kernel Recall that the Matèrn kernel is kθ(x, x ) = σ2 f 21 ν + σ2 nγ(x x ) (θ = [ℓ, σ2 f, σ2 n]) where we added a scaling factor σ2 f and an included the ridge term σ2 nγ(x x ) in the kernel definition. Note that for conciseness, we consider the isotropic version for a fixed ν; the formulas can be modified for the anisotropic case. As in the case of the Gaussian kernel, by setting ϕ(x, η) = e ix Tη , this kernel matches the form of Eq. (8). We assume that the hyperparameters are bounded as follows: Θ = {[ℓ, σ2 n, σ2 f] : ℓ ℓ0, σ2 n σ2 n0, σ2 f σ2 f0} where σ2 n0 > 0 (i.e. we have a ridge term; our method is not able to approximate the Matèrn kernel in the absence of a ridge term). The density is given by p(η; ℓ) = Γ(ν + d/2)ℓd Γ(ν) (2νπ)d/2 Therefore p P(r) C,L where we consider C = Γ(ν+d/2)ℓd 0 Γ(ν)(2νπ)d/2 , L = ℓ0 2ν Id and r = ν + d/2 > d/2, i.e., 2r d = 2ν. So we set U to be the numerical solution of Eq. (20) for d = 1, and we set Uk to be the numerical solution of Eq. (20) for d 2. In addition, p( ; ℓ) is analytic on Rd and it is analytically continuable to the polyellipse EU,β with βk = 2ν 4ℓ2 0d U 2 k + 1, as required by Theorem 12. Now we bound p( ; ℓ0) on the polyellipse as follows, for ν 2ℓ2 0d + U 2 k, q ν 2ℓ2 0d + U 2 k |p(x + iy; ℓ0)| = Γ(ν + d/2)ℓd 0 Γ(ν) (2νπ)d/2 1 + ℓ2 0 2ν k=1 (xk + iyk)2 Γ(ν + d/2)ℓd 0 Γ(ν) (2νπ)d/2 Γ(ν + d/2)ℓd 0 Γ(ν) (2νπ)d/2 = Γ(ν + d/2)ℓd 0 Γ(ν) (2νπ)d/2 Fink Shustin and Avron where the maximum value is obtained at the nearest points to the poles: yk = 2ν 2ℓ0d. Hence, CU,β := Γ(ν+d/2)ℓd 0 Γ(ν)(2νπ)d/2 3 Recall that z : Rd Cn is given by z(η)j = e iηTxj. Now we bound |z(η)j| for each j on its corresponding ellipse. For any x, y Rd : |z(x + iy)j| = |z(iy)j|, so we need to bound |z(iy)j| for y Qd k=1 [ βk/2, βk/2]: |z(iy)j| = e i iy Txj = ey Txj e y 2 xj 2 e β 2 R 2/4 Hence, |z(η)j| e β 2 R 2/4 =: MU,β. Now, for the asymptotics, consider the upper bound for U in Theorem 13. If one denotes γ = ln 22d+2π d/2 Γ(ν+d/2) 4 (ν+d/2) σ 2 n0 σ2 f0n2 and δ = ln σ 2 n0 σ2 f0n2 2d 1νB(ν, d , then Theorem 13 can be applied with the these parameters and obtain that (for 1 2d β 2 R 2 + γ d + δ 2ν ln βk/(2Uk) 1 + q β2 k/(4U2 k) + 1 2 ln βk/(2Uk) + q β2 k/(4U2 k) + 1 + 1 we have the desired bound. Assuming the bounding box R is fixed, sk = O(ln n) suffice and s = Qd k=1 sk = O((ln n)d) suffices. 6.3 Semigroup Kernels The previous two examples were of shift-invariant kernels, and the feature mapping ϕ was based on Bochner s theorem. In this section, we demonstrate the application of our theory to a different type of kernels: semigroup kernels (Yang et al., 2014). These type of kernels require a slight modification of our setup, which we briefly describe below, but adjusting theory itself is technical and we omit it. Semigroup kernels are well-suited for non-negative data, i.e. X Rd +, and require that the kernel value at x and x depends only on the sum x+x : k(x, x ) = k0(x+x ). One example of such kernel is the reciprocal semigroup kernel: kθ(x, x ) = σ2 f λ xk + x k + λ + σ2 nγ(x x ) (θ = [λ, σ2 f, σ2 n]) where we add a scaling factor σ2 f and an included the ridge term σ2 nγ(x x ) in the kernel definition. Berg et al. (Berg et al., 1984) showed that every semigroup kernel can be written in the following integral form, which is analogous to Eq. (9): kθ(x, x ) = σ2 f Rd + e ηT(x+x )p(η; λ)dη + σ2 nγ(x x ) where p( ; λ) is a probability density function which is supported only on Rd +. For the reciprocal semigroup kernel we have p(η; λ) = λe λ η 1 = λe λ Pd k=1 ηk, so p( ; λ) E(1) C,Lwith C = λ, L = λId, which is analytic on Rd +. Thus, we see that semigroup kernels can be represented as kθ(x, x ) = σ2 f Rd + ϕ(x, η)ϕ(x , η) p(η; θ0)dη + σ2 nγ(x x ) , which is almost the same as Eq. (8), except the integration area is Rd + instead of Rd. For semigroup kernels ϕ(x, η) = e ηTx. Gauss-Legendre Features for Gaussian Process Regression The construction of Gauss-Legendre features is quite similar to the integration area is Rd, except that we replace the assumption that X Qd k=1[ Rk/2, Rk/2] with X Qd k=1[0, Rk], the truncated integration area QU with HU := Qd k=1[0, Uk], and the integration nodes and weights are obtained by linearly transforming HU (instead of QU) to [ 1, 1]d with the transformation ηk = Uk χk+1 2 for k = 1, . . . , d. We omit the details of the construction, since they mostly repeat the construction described in Section 5. Now consider the reciprocal semigroup kernel. We assume that the hyperparameters are bounded as follows: Θ = {[λ, σ2 n, σ2 f] : λ λ0, σ2 n σ2 n0, σ2 f σ2 f0} where σ2 n0 > 0 (i.e., we have a ridge term). We set 2λ1 d 0 σ2 f0n2 In addition, since p( ; λ) is analytic on Rd +, and in particular it is analytically continuable to the polyellipse EU,β with βk = 2Uk, ρk = 1 + 2. Now we bound the analytic continuation of p( ; λ0) (which we also denote by p( ; λ0)) on the polyellipse as follows. For any x Rd, y Rd, |p(x + iy; λ0)| = λ0e λ0 Pd k=1 xk+iyk = |p(x; λ0)| , so we need to bound |p(x; λ0)| for xk 2 2 , Uk 1+ 2 2 ], yk [0, Uk]: |p(x; λ0)| = λ0e λ0 Pd k=1 xk λ0eλ0 Pd k=1 Uk Hence, CU,β := λ0eλ0 2 U 1. For semigroup kernels we use z : Rd + Rn defined by z(η)j = e ηTxj as the feature map. It can be seen that for every v Rn the function |z( ) v|2 is an analytic function on Rd +, and we also have |z(η)j| 1 = MR for η 0 and for all 1 j n. In particular, it is analytically continuable to the pollyellipse EU,β with βk = 2Uk, ρk = 1 + 2 . Now we bound |z(η)j|2, where here z denotes the analytic continuation. Notice that for any x Rd, y Rd, |z(x + iy)j| = |z(x)j| , so we need to bound |z(x)j| for each xk [Uk 1 2 2 , Uk 1+ 2 2 ], yk [0, Uk]: |z(x)j| = e x Txj = e Pd k=1 xk(xj)k e Pd k=1(Uk and each xj satisfies (xj)k Rk. Hence, |z(η)j| e 2 UTR =: MU,β. So we set 1 d ln 22d+2λ0σ 2 n σ2 fn2 + ln 2 1 2d U 1 + d UTR + ln ln 2λ1 d 0 σ2 f0n2 Since U 1 = O(ln n) and assuming the bounding box R is fixed, then sk = O(ln n) suffice and s = Qd k=1 sk = O((ln n)d) suffices. It can be seen that in most examples, the dependence of the number of features on d is exponential. 7. Numerical Experiments In this section, we report experiments evaluating the performance of our proposed quadrature based approach. Our goal is to show that indeed if U and s are set to be large enough, our method yields results that are essentially indistinguishable from using the exact kernel while offering faster hyperparameter learning, training and prediction. Clearly, from the theoretical results, our Fink Shustin and Avron method predominately applies to low-dimensional datasets (for example, such datasets are prevalent in spatial statistics), so we experiment with one dimensional and two dimensional datasets. We experiment both with the Gaussian kernel and the Matèrn kernel. In the graphs, we label our method as GLF-GPR (standing for Gauss-Legendre Features Gaussian Process Regression). We use the following methods as a benchmark: exact GPR (labeled in the graphs as Exact-GPR), GPR based on random Fourier features (labeled RFF-GPR), and GPR based on modified random Fourier features (labeled MRF-GPR). As a performance metric, we use the MSE error on a test set (as a function of number of features) and the time to learn the hyperparameters. Training and prediction time of both GLF-GPR and RFF-GPR are essentially the same for the same number of features, and both are faster than Exact-GPR if the number of features is smaller than the training set size. Thus, when it comes to training and prediction time, it is sufficient to explore the test error as a function of the number of features. However, hyperparameter learning time can vary considerably between GLF-GPR and RFF-GPR, so we compare this quantity directly. MRF-GPR behaves similarly to RFF-GPR and is shown for comparison in the synthetic data experiments. The various methods were implemented in MATLAB. Optimizing the hyperparameters was conducted using the MATLAB function fmincon after transforming the hyperparameters to a logarithmic scale. For each problem, we defined a hyperparameter domain, e.g. Θ = {[ℓ, σ2 n, σ2 f] : ℓ0 ℓ ℓ1, σ2 n0 σ2 n σ2 n1, σ2 f1 σ2 f σ2 f0} and we take the initial hyperparameters for the optimization to be [ℓ0, σ2 f0, σ2 n0]. Running times were measured on a machine with two 3.2GHz Intel(R) Xeon(R) Gold 6134 CPUs, each having 8 cores, and 256GB RAM. 7.1 Synthetic Data In this subsection, we report experiments on synthetically generated data. The data is generated by noisily sampling a predetermined function, i.e. samples are generated from the formula yi = f (xi) + τi where f is the true function and {τi} are i.i.d noise terms, distributed as normal variables with variance σ2 τ = 0.52 (for 1D) or σ2 τ = 0.32I2 (for 2D). In these experiments we use the isotropic Gaussian kernel. First, we consider a one dimensional function: f 1 (x) = sin(2x) + sin(6ex) (23) The function was sampled equidistantly on [ 1, 1] with n = 800 samples. The results are reported in Figure 1, where we show how GLF-GPR with the number of quadrature points s compared to RFF-GPR. In the top-right graph, we see that the log-likelihood of GLF-GPR merges with the log-likelihood of Exact-GPR for each of the hyperparameters, where the optimal hyperparameters are dashed in blue. RFF-GPR and MRF-GPR deviate considerably. This graph exemplifies that GLF-GPR can yield a good approximation to the exact log-likelihoods, while RFF-GPR and MRF-GPR yield poor approximations. We also see that GLF-GPR optimizes hyperparameters that are much closer to the exact values than RFF-GPR and MRF-FPR. The bottom-left plot shows the MSE error on the same test points. We see that the GLF-GPR error stabilizes on the Exact-GPR error even before the theoretical value of s. The bottom-right graph shows the runtime of the hyperparameter learning phase for different values of quadrature points s. GLF-GPR is clearly more efficient than Exact GPR and mostly more efficient than RFF-GPR and MRF-GPR. As expected, as s becomes larger, GLF-GPR learns the hyperparameters much faster than Exact-GPR and RFF-GPR. Furthermore, GLF-GPR achieves a low error rate with fewer features than RFF-GPR and MRF-GPR, and thus Gauss-Legendre Features for Gaussian Process Regression -1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1 -2 2 Test Data and Predictions True Function RFF-GPR MRF-GPR GLF-GPR Exact-GPR 0.2 0.3 0.4 0.5 0.6 0.7 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 RFF-GPR MRF-GPR GLF-GPR Exact-GPR 30 40 50 60 70 80 90 100 110 120 130 0.016 0.018 0.02 0.022 0.024 Test Error vs. s RFF-GPR MRF-GPR GLF-GPR Exact-GPR Theoretical s 20 40 60 80 100 120 140 Hyperparameter Learning Time Hyperparameter Learning Time vs. s RFF-GPR MRF-GPR GLF-GPR Exact-GPR Theoretical s Figure 1: Results for data generated using the function f 1 . Top-left: true function, input data, and prediction. Top-right: log-likelihood for various sections of the hyperparameter range. Bottom-Left: Test error as a function of the number of features. Bottom-Right: hyperparameter learning time. is able to do training, prediction, and hyperparameter learning much faster than RFF-GPR and MRF-GPR. Next, we consider a two dimensional function: f 2 (x1, x2) = (sin(x1) + sin(10ex1))(sin(x2) + sin(10ex2)) . (24) The function was sampled on an uniform grid on [ 1, 1] [ 1, 1] with n = 4096 samples. We consider ℓ1 = ℓ2 so U1 = U2 and s1 = s2, i.e., s = s2 1. The results are reported in Figure 2. Similar to the synthetic 1D experiment, in the top-right plot we see that GLF-GPR yields a good approximation to the exact log-likelihood. In the bottom-left plot we see that shows the GLF-GPR error stabilizes on error of Exact-GPR at a much smaller number of quadrature points than the theoretical s. 7.2 Natural Sound Modeling Next we consider the natural sound benchmark used in (Wilson and Nickisch, 2015) (without hyperparameter learning) and (Dong et al., 2017) (with hyperparameter learning). The data is shown in the top-left graph of Figure 3. The goal is to recover contiguous missing regions in a waveform with n = 59309 training points. The test consists of 691 samples. The Gaussian kernel is used for learning. Results are reported in Figure 3. In the bottom-left graph, we plot the test error as a function of the number of features. Initially, GLF-GPR produces poor results, but when s is large enough, the results are similar to the Exact-GPR (see also the top-right plot). We see that even when the number of quadrature points is smaller than the theoretical value required for spectral equivalence, GLF-GPR s error stabilizes on the Exact-GPR error. In contrast, RFF-GPR s error oscillates above Exact-GPR s error. In the bottom-right graph we see that the runtime of the hyperparameters learning phase is significantly smaller for GLF-GPR. Fink Shustin and Avron True Function -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 Exact-GPR Learned Function -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 RFF-GPR Learned Function -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 GLF-GPR Learned Function -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 MRF-GPR Learned Function -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 0.1 0.12 0.14 0.16 0.18 0.2 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18 0.2 0.2 0.4 0.6 0.8 1 1.2 1.4 RFF-GPR MRF-GPR GLF-GPR Exact-GPR 600 800 1000 1200 1400 1600 1800 2000 Test Error vs. s RFF-GPR MRF-GPR GLF-GPR Exact-GPR Theoretical s 600 800 1000 1200 1400 1600 1800 2000 Hyperparameter Learning Time Hyperparameter Learning Time vs. s RFF-GPR MRF-GPR GLF-GPR Exact-GPR Theoretical s Figure 2: Results for data generated using the function f 2 . Top-left:true function, input data, and prediction. Top-right: log-likelihood for various sections of the hyperparameter range. Bottom-Left: Test error as a function of the number of features. Bottom-Right: hyperparameter learning time. 0 1 2 3 4 5 6 Time 104 1 2 3 4 5 Time 104 Test Data and Predictions Test Data RFF-GPR GLF-GPR Exact-GPR 0.5 1 1.5 2 Test Error vs. s RFF-GPR GLF-GPR Exact-GPR Theoretical s 0.5 1 1.5 2 Hyperparameter Learning Time Hyperparameter Learning Time vs. s RFF-GPR GLF-GPR Exact-GPR Figure 3: Results for the natural sound data. Top-left: full dataset (training and test). Top-right: test data and predictions, for the best smallest s we tested. Bottom-left: MSE of the test data for various sizes of s. The dashed vertical line is the value of the theoretical minimum s needed for spectral equivalence. Bottom-right: running time for various sizes of s. 7.3 Google Daily High Stock Price We consider a time series data of the daily high stock price of Google spanning 3797 days from 19th August 2004 to 19th September 2019. We set the data as x {1, . . . , 3797} and y = log(Stockhigh). Gauss-Legendre Features for Gaussian Process Regression The test is of size of 12% of the data, i.e., consists of 502 days. We use the Matèrn kernel with ν = 5/2. We note that the theoretical number of quadrature features s required for spectral equivalence is bigger than the number of training points. Possible reasons are: the hyperparameter σ2 n in these dataset is very small and that increases our bound in Eq. (21). This increases U which increases s. In addition, the weight function of the Matèrn kernel has a singularity point which leads to the ellipse parameter being pretty small. However, in practice, we see that the approximation convergences around the value s = 1550. 500 1000 1500 2000 2500 3000 3500 Days from 19/08/2004 Google Daily High Stocks 19/08/2004 - 19/09/2019 Test Data RFF-GPR GLF-GPR Exact-GPR 1550 1600 1650 1700 1750 1800 1850 1900 1950 2000 Days from 19/08/2004 Zoomed Google Daily High Stocks 19/08/2004 - 19/09/2019 600 800 1000 1200 1400 1600 1800 2000 2200 Test Error vs. s RFF-GPR GLF-GPR Exact-GPR 600 800 1000 1200 1400 1600 1800 2000 2200 100 Hyperparameter Learning Time Hyperparameter Learning Time vs. s RFF-GPR GLF-GPR Exact-GPR Figure 4: Results for the google stocks data. Top-left: test data and predictions. Top-right: zoomed test data and predictions for the best smallest s we tested. Bottom-left: MSE of the test data for various sizes of s. Bottom-right: running time for various sizes of s. Results are reported in Figure 4. From the top-right and top-left plots, we see that RFFGPR is producing a poor approximation while GLF-GPR approximation merges with Exact-GPR approximation. Also, from the bottom-left plot, we see that initially, GLF-GPR produces poor errors, but when s is large enough, the error stabilizes on the Exact-GPR error (see also the topright plot). 7.4 Spatial Temperature Anomaly for East Africa in 2016 Similar to (Ton et al., 2018), we consider MOD11A2 Land Surface Temperature (LST) 8-day composite 2D data of synoptic yearly mean for 2016 in the East Africa region. For the training set, we randomly sample 77404 LST locations and set x {(Longitude, Latitude)} and y = {temperature}.We examine the MSE errors on the remaining 6005 locations but use all 83409 data points to draw maps. We also use the anisotropic Matèrn kernel with ν = 1. Again, the theoretical number of quadrature features for spectral equivalence is bigger than the number of training points. However again in practice, we see that the approximation convergences with fewer features. Due to memory and time constraints, we were unable to use Exact-GPR, and RFF-GPR results are presented up to the computer s memory capacity. Results are reported in Figure 5. In the top plot, we see that GLF-GPR approximates the true function well, unlike RFF-GPR. Also, from the bottom-left plot we see that around s = 21025, the GLF-GPR error stabilizes while RFF-GPR error is still suboptimal. Fink Shustin and Avron 0.5 1 1.5 2 2.5 3 Test Error vs. s RFF-GPR GLF-GPR 0.5 1 1.5 2 2.5 3 Hyperparameter Learning Time Hyperparameter Learning Time vs. s RFF-GPR GLF-GPR Figure 5: Results for the east Africa data. Top: all true data, and predictions. Bottom-left: MSE of the test data for various sizes of s. Bottom-right: running time for various sizes of s. 8. Conclusions and Future Work In this paper, we proposed the use of Gauss-Legendre feature for large-scale Gaussian process regression. Our method is very much inspired by Random Fourier Features (Rahimi and Recht, 2008). However, our method replaces Monte-Carlo integration in RFF with a Gauss-Legendre quadrature of a truncated integral representation of the kernel function. With Gauss-Legendre quadrature our method is able to build spectrally equivalent kernel approximation with an number of features that is asymptotically poly-logarithmic in the training size. In contrast, with RFF the number of features for spectral equivalence must be at least linear. Sublinear number of features can also be obtained using a Gaussian quadrature (suggested by Dao et al. (Dao et al., 2017) in the context of kernel learning). However, this is problematic in the context of hyperparameter learning (see Section 4.3). RFF has a similar issue. In contrast, the use of Gauss-Legendre quadrature allows our method to keep the quadrature nodes and weights fixed, leading to simplified structural dependence of the kernel matrix on the hyperparameters which is more amenable to hyperparameter learning. Finally, we demonstrate the utility of our method on several real-world low-dimensional datasets. We mention a few possible directions for future research: Asymptotically, our method requires a number of features that is poly-logarithmic in the training size. Yet, for some moderately sized datasets, our theoretical results required a number of features larger than the number of training points. However, in practice, the number of Gauss-Legendre Features for Gaussian Process Regression features required for high quality results was much smaller than the bound. Closing this gap is an open problem. Our method is able to handle a rich family of kernels (see Eq. (8)) which includes stationary kernels and some non-stationary kernels (see Section 6.3). Extending our method to arbitrary non stationary kernels is an open problem. The number of features needed by our method is exponential in the dimension, i.e. we have not escaped from the curse of dimensionality. Future research directions are: replacing the tensorized multivariate quadrature with sparse grids, and using convolutional kernels. In doing so, we strive to avoid an exponential dependence on d. Acknowledgements This research was supported by BSF grant 2017698. Appendix A. Further Details on Hyperparameters Learning A.1 Derivation of Eq. (14) Recall that, F(θ) = σ2 f(σ2 f Z Z + σ2 n W(θ) 1) 1Z Z . First, from the Woodbury formula we obtain that Tr Kθ(X, X) 1 = Tr σ2 f ZW(θ)Z + σ2 n In 1 = σ 2 n Tr In σ2 f Z σ2 f Z Z + σ2 n W(θ) 1 1 Z = σ 2 n n σ 2 n Tr σ2 f Z σ2 f Z Z + σ2 n W(θ) 1 1 Z = σ 2 n n σ 2 n Tr σ2 f σ2 f Z Z + σ2 n W(θ) 1 1 Z Z = σ 2 n (n Tr (F(θ))) . Now, for the first term in Eq. (14) we have Kθ(X, X) 1 Kθ(X, X) = Tr Kθ(X, X) 1ZW(θ)Z = Tr Kθ(X, X) 1σ 2 f σ2 f ZW(θ)Z + σ2 n In σ2 n In = σ 2 f Tr In σ2 n Kθ(X, X) 1 = σ 2 f n σ 2 f σ2 n Tr Kθ(X, X) 1 = σ 2 f n σ 2 f σ2 n σ 2 n n σ 2 n Tr (F(θ)) = σ 2 f Tr (F(θ)) . For the next term in Eq. (14) we have Kθ(X, X) 1 Kθ(X, X) = Tr Kθ(X, X) 1 = σ 2 n (n Tr (F(θ))) Fink Shustin and Avron Finally, for the third term in Eq. (14) we have Kθ(X, X) 1 Kθ(X, X) = Tr σ2 f ZW(θ)Z + σ2 n In 1 σ2 f Z W(θ) = σ 2 n Tr σ2 f Z W(θ) θi Z σ4 f Z σ2 f Z Z + σ2 n W(θ) 1 1 Z Z W(θ) = σ 2 n σ2 f Tr Z W(θ) θi Z ZF(θ) W(θ) = σ 2 n σ2 f Tr Z W(θ) θi Z σ 2 n σ2 f Tr ZF(θ) W(θ) = σ 2 n σ2 f Tr W(θ) θi Z Z σ 2 n σ2 f Tr W(θ) where in the second equality we use the Woodbury formula. A.2 Efficient Gaussian Process Regression using QR Decomposition Here we present alternative formulas to the ones presented in Section 4.2 and based on QR decomposition instead of the normal equations. Such formulas are likely to be more numerically robust. Let Z = QZRZ be a thin QR decomposition of Z, i.e. QZ Cn s is such that Q ZQZ = In and RZ Cs s is an upper triangular matrix. We suggest to compute the QR decomposition of Z in lieu of computing Z Z, and keeping only RZ and Q Zy so still only O(s2) is needed. There is no asymptotic penalty in terms of arithmetic operation count since the decomposition can be computed in O(ns2) operations. However, Z Z tends to be ill-conditioned due to the squaring of the condition number of Z, so it is best to avoid computing it. Note that the QR decomposition is computed only once, and not per iteration. Let us consider a specific iteration, and for conciseness, we omit θ for the following formulas. Let " RZ σn σf W 1/2 We compute a thin QR decomposition A = QARA of A (O(s3) operations) and write where Q1, Q2 Cs s, i.e., Hence, RZ = Q1RA which implies that Z = QZQ1RA . Therefore, " Z σn σf W 1/2 = QZ 0 0 Is A = QZ 0 0 Is RA = QZQ1 Q2 Gauss-Legendre Features for Gaussian Process Regression which is a QR decomposition of " Z σn σf W 1/2 The crux is that given the QR decomposition of Z, we can compute the QR decomposition of B in O(s3) arithmetic operations instead of O(ns2). We now compute w = W 1 σ2 f Z Z + σ2 n W 1 1 Z y = σ 2 f W 1 " Z σn σf W 1/2 = σ 2 f W 1R 1 A [Q 1Q Z Q 2] y 0s 1 = σ 2 f W 1R 1 A Q 1Q Zy which implies that w can be computed in O(ns) operations (since W is diagonal). Similarly, we also have F(θ) = σ2 f σ2 f Z Z + σ2 n W(θ) 1 1 Z Z = σ2 fσ 2 f R 1 A Q 1Q ZZ = R 1 A Q 1Q ZQZRZ = R 1 A Q 1RZ = R 1 A Q 1Q1RA i.e., F(θ) can be computed in O(s3) operations. This allows us to compute the first two formulas in Eq. (14) in O(s) time. As for the third formula, we have (25). Since W(θ)/ θi is diagonal, we need to compute only the diagonal of Z Z and Z ZF(θ). The diagonal of Z Z is just the square norms of the columns of Z, and can be precomputed in O(ns). Furthermore, in some cases we know analytically the values of this norm. For example, for shift-invariant kernels we use ϕ(x, η) = e ix Tη so the squared norms of the columns of Z is equal to n. As for Z ZF(θ), we have: Z ZF(θ) = R A(Q 1Q1)2RA = R ZQ1Q 1RZ . Note that Q 1RZ has already been computed for F(θ). Since we only need the diagonal of Z ZF(θ), and this is an Hermitian matrix, the diagonal is just the squared norms of the columns of Q 1RZ. Thus, after O(s3) preprocessing, for every θi the first term in Eq. (13) can be computed in O(s) operations . As for the first identity in Eq. (13), we still have α = σ 2 n y σ2 f ZWw and Z α = w. So, the second identity can be computed as previously described using w, which is computed according to Eq. (26). Appendix B. Analysis of Multivariate Chebyshev Approximation in High Dimensions The following theorems are generalizations of similar one-dimensional theorems. All the proofs rely on ideas similar to the ones presented in (Mason, 1980, 1982; Mason and Handscomb, 2002) Fink Shustin and Avron and (Trefethen, 2013, Theorem 3.1, Theorem 8.1). Note that a similar generalization can found in (Trefethen, 2017; Wang and Zhang, 2020). For completeness, we present our own proof, which is based on a different technique. For convenience, denote: Eρ := z Cd : zk + q z2 k 1 < ρk k = 1, . . . , d as the polyellipse, where each Eρk is Bernstein ellipse with foci at 1 and the sum of of major and minor semiaxis lengths of the ellipse is ρk . Also denote Cr := z Cd : |zk| = rk k = 1, . . . , d as the polycircle centered at the origin, and simply denote C1 in the case where r1 = . . . = rd = 1. Finally, denote Ar,R := z Cd : 0 < rk < |zk| < Rk k = 1, . . . , d as the polyannulus centered at the origin. The following proposition appears in (Scheidemann, 2005) as Theorem 1.5.26. See also (Bochner and Martin, 1948, Pages 32, 90-91) for further details. Proposition 14 Let Ar,R be the polyannulus centered at the origin, and let f be an analytic complex function on Ar,R. Also, let rk < sk < Rk, k = 1, . . . , d. Then f has a multivariate Laurent expansion j1,...,jd= bj1...jdzj1 1 zjd d converging uniformly on Ar,R. The coefficients bj1...jd are given by bj1...jd = 1 f (z) zα+1 dz Note that the one dimensional Chebyshev polynomials in the complex plane are defined by Tj(z) = wj + w j z = w + w 1 In addition, for a function f that is analytic in the interior and on the boundary of Eρ in the complex plane, the complex Chebyshev series of f is P j=0 aj Tj(z) where j = 0 : aj = 2 π (ρ2j + ρ 2j) Eρ f(z)Tj(z) dz Eρ f(z)Tj(z) dz The last definition was introduced in (Mason and Handscomb, 2002), and we generalize it to multivariate functions. The multivariate complex tensorized Chebyshev series of a multivariate complex function f that is analytic in the polyellipse Eρ can be defined by j1,...,jd=0 aj1...jd Tj1(z1) Tjd(zd) . Gauss-Legendre Features for Gaussian Process Regression Proposition 15 Let f be a multivariate complex function that is analytic in the polyellipse Eρ, where ρ = (ρ1, . . . , ρd), ρ1, . . . , ρd > 0. Then, the coefficients of its multivariate complex tensorized Chebyshev series X j1,...,jd=0 aj1...jd Tj1(z1) Tjd(zd) (27) are given by aj1...jd = 2d m πd ρ2j1 1 + ρ 2j1 1 ρ2jd d + ρ 2jd d Eρ f (z1, . . . , zd) Tj1(z1) Tjd(zd) where m := #{jk : jk = 0}. We remark that this Chebyshev series converges to f uniformly, as (Mason, 1982, Theorem 9.1) claims. Proof We begin by mapping f(z) on the contour of Eρ into g(w) on Cρ. For k = 1, . . . , d, define zk = wk + w 1 k 2 such that g (w1, . . . , wd) = f (z1, . . . , zd) = f w1+w 1 1 2 , . . . , wd+w 1 d 2 . It follows that g (w1, . . . , wd) = g (wα1 1 , . . . , wαd d ) , α1, . . . , αd { 1, 1} . (28) The equation for each wk has two solutions We choose the solutions wk = zk + p z2 k 1 , so |wk| = ρk > 1 and thus the second solution for each k = 1, . . . , d is essentially w 1 k . These relations imply that g is analytic in the polyannulus between Cρ and Cρ 1. We also have for each k = 1, . . . , d Tjk(zk) = wjk k + w jk k 2 . Therefore, and since f is analytic in Eρ, we have g(w) = f(z) j1,...,jd=0 aj1...jd Tj1(z1) Tjd(zd) j1,...,jd=0 wj1 k + w j1 k wjd d + w jd d . That is, the series given in Eq. (27) can be written as the Laurent series of g. Thus, by Proposition 14, the coefficients are given by Cρ w 1 j1 1 w 1 jd d g (w1, . . . , wd) dw1 dwd . which implies aj1...jd = 1 Cρ w 1 j1 1 w 1 jd d g (w1, . . . , wd) dw1 dwd . (29) Fink Shustin and Avron The last integral can be written also as Cρ w 1 j1 1 w 1 jd d g (w1, . . . , wd) dw1 dwd = Cρ w 1 j1 1 w 1 jd d g (wα1 1 , . . . , wαd d ) dw1 dwd = (30) | w1|=ρα1 1 . . . | wd|=ρ αd d ew 1+α1j1 1 ew 1+αdjd d g ( ew1, . . . , ewd) d ew1 d ewd = Cρ ew 1+α1j1 1 ew 1+αdjd d g ( ew1, . . . , ewd) d ew1 d ewd where the first equality follows from Eq. (28), the second equality is changing of variables from wk to ewk = wαk k , αk { 1, 1}, and the last equality is due to Definition 14 which means the integral also can be considered on Cρ 1. Now we show that Cρ w 1 j1 1 w 1 jd d g (w1, . . . , wd) dw1 dwd = (31) ρ2jk k + ρ 2jk k ρ2jk k w jk k + ρ 2jk k wjk k g (w1, . . . , wd) dw1 dwd by induction on the number of changes of variables. The base case: apply the change of variables only for one of the variables. Without loss of generality, we show it for w1: Cρ w 1 j1 1 w 1 jd d g (w1, . . . , wd) dw1 dwd = ρ2j1 1 + ρ 2j1 1 ρ2j1 1 + ρ 2j1 1 Cρ w j1 1 w jd d g (w1, . . . , wd) dw1 dwd ρ2j1 1 + ρ 2j1 1 Cρ ρ2j1 1 w j1 1 w jd d g (w1, . . . , wd) dw1 dwd ρ2j1 1 + ρ 2j1 1 Cρ ρ 2j1 1 w j1 1 w jd d g (w1, . . . , wd) dw1 dwd ρ2j1 1 + ρ 2j1 1 Cρ ρ2j1 1 w j1 1 w jd d g (w1, . . . , wd) dw1 dwd ρ2j1 1 + ρ 2j1 1 Cρ ρ 2j1 1 wj1 1 w j2 2 w jd d g (w1, . . . , wd) dw1 dwd ρ2j1 1 + ρ 2j1 1 ρ2j1 1 w j1 1 + ρ 2j1 1 wj1 1 w j2 2 w jd d g (w1, . . . , wd) dw1 dwd where in the third equality we use Eq. (30) with α1 = 1. The inductive step: suppose that for 1 < n 1 < d changes of variables, the following holds: Cρ w 1 j1 1 w 1 jd d g (w1, . . . , wd) dw1 dwd = (32) ρ2jk k + ρ 2jk k ρ2jk k w jk k + ρ 2jk k wjk k w jn n w jd d g (w1, . . . , wd) dw1 dwd Gauss-Legendre Features for Gaussian Process Regression Then, we show that this is also true for n changes of variables: Cρ w 1 j1 1 w 1 jd d g (w1, . . . , wd) dw1 dwd = ρ2jn n + ρ 2jn n ρ2jn n + ρ 2jn n Cρ w j1 1 w jd d g (w1, . . . , wd) dw1 dwd ρ2jk k + ρ 2jk k ρ2jn n + ρ 2jn n ρ2jk k w jk k + ρ 2jk k wjk k ! w jn n w jd d g (w1, . . . , wd) dw1 dwd ρ2jk k + ρ 2jk k ρ2jk k w jk k + ρ 2jk k wjk k ! ρ2jn n w jn n w jd d g (w1, . . . , wd) dw1 dwd ρ2jk k + ρ 2jk k ρ2jk k w jk k + ρ 2jk k wjk k ! ρ 2jn n w jn n w jd d g (w1, . . . , wd) dw1 dwd ρ2jk k + ρ 2jk k ρ2jk k w jk k + ρ 2jk k wjk k ! ρ2jn n w jn n w jd d g (w1, . . . , wd) dw1 dwd ρ2jk k + ρ 2jk k ρ2jk k w jk k + ρ 2jk k wjk k ! ρ 2jn n wjn n w jd d g (w1, . . . , wd) dw1 dwd ρ2jk k + ρ 2jk k ρ2jk k w jk k + ρ 2jk k wjk k ! w jn+1 n+1 w jd d g (w1, . . . , wd) dw1 dwd where in the second equality we use Eq. (32), and in the fourth equality we use Eq. (30) with αn = 1. Therefore, by induction, for n = d we obtain Eq. (31), and by Eq. (29): aj1...jd = 1 ρ2jk k + ρ 2jk k ρ2jk k w jk k + ρ 2jk k wjk k g (w1, . . . , wd) dw1 dwd Now, for each k = 1, . . . , d we have dzk p Tjk(zk) = wkjk + wk jk 2 = ρ2jk k w jk k + ρ 2jk k wjk k 2 (34) where wk = ρkeiθk. Therefore, replacing g (w1, . . . , wd) by f (z1, . . . , zd) and recall that each wk on Cρk maps zk on Eρk, we obtain aj1...jd = 2d m πd ρ2j1 1 + ρ 2j1 1 ρ2jd d + ρ 2jd d Eρ f (z1, . . . , zd) Tj1(z1) Tjd(zd) We proceed to the main theorem: Theorem 16 Let f(x1, . . . , xd) be an analytic function in [ 1, 1]d and analytically continuable to the polyellipse Eρ where it satisfies |f(x1, . . . , xd)| M for some M > 0. Let Tj(x) := cos(j cos 1(x)) be the j degree one dimensional Chebyshev polynomial, and Eρ1, . . . , Eρd are the open Bernstein ellipses with major and minor semiaxis lengths correspondingly summing to ρ1, . . . , ρd > 1. Then: Fink Shustin and Avron 1. The multivariate (real) Chebyshev coefficients of f are given by aj1...jd := 2d m [ 1,1]d f(x1, . . . , xd)Tj1(x1) Tjd(xd) p 1 x2 d dx1 dxd where m := #{jk : jk = 0}. 2. The coefficients satisfy |aj1...jd| 2d m M ρj1 1 ρjd d . Some versions of this theorem appear in (Bochner and Martin, 1948, Pages 32, 94-95) and (Trefethen, 2017), however without an explicit bound. Proof As in the proof of Proposition 15, consider the analytic continuation f(z) on the contour of Eρ which we map into g(w) on Cρ, by defining for k = 1, . . . , d zk = wk + w 1 k 2 . Then, we saw that aj1...jd = 2d m πd ρ2j1 1 + ρ 2j1 1 ρ2jd d + ρ 2jd d Eρ f (z1, . . . , zd) Tj1(z1) Tjd(zd) In particular, since f(z1, . . . , zd) is a continuation of f(x1, . . . , xd) to the complex plane, replacing each zk with xk = Re(zk) for k = 1, . . . d gives aj1...jd = 2d m [ 1,1]d f (x1, . . . , xd) Tj1(x1) Tjd(xd) p 1 x2 d dx1 dxd . This completes the first part of the proof. For the second part of the proof, we use the bound on f representation in Eq. (29) for the coefficients to obtain: |aj1...jd| = 2 m Cρ w j1 1 1 w jd 1 d g (w1, . . . , wd) dw1 dwd Cρ |w1| j1 1 |wd| jd 1 dw1 dwd πdρj1+1 1 ρjd+1 d πdρj1+1 1 ρjd+1 d 2dπdρ1 ρd ρj1 1 ρjd d . Gauss-Legendre Features for Gaussian Process Regression M. Abramowitz and I. A. Stegun. Handbook of mathematical functions with formulas, graphs, and mathematical tables, volume 55. US Government printing office, 1972. H. Avron and V. Sindhwani. High-performance kernel machines with implicit distributed optimization and randomization. Technometrics, 58(3):341 349, 2016. doi: 10.1080/00401706. 2015.1111261. URL https://doi.org/10.1080/00401706.2015.1111261. H. Avron, V. Sindhwani, J. Yang, and M. W. Mahoney. Quasi-Monte Carlo feature maps for shiftinvariant kernels. Journal of Machine Learning Research, 17(1):4096 4133, 2016. H. Avron, M. Kapralov, C. Musco, C. Musco, A. Velingker, and A. Zandieh. Random Fourier features for kernel ridge regression: Approximation bounds and statistical guarantees. In Proceedings of the 34th International Conference on Machine Learning (ICML), pages 253 262. JMLR. org, 2017. F. Bach. On the equivalence between kernel quadrature rules and random feature expansions. Journal of Machine Learning Research, 18(1):714 751, 2017. S. Banerjee, A. E. Gelfand, A. O. Finley, and H. Sang. Gaussian predictive process models for large spatial data sets. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 70 (4):825 848, 2008. doi: 10.1111/j.1467-9868.2008.00663.x. URL https://rss.onlinelibrary. wiley.com/doi/abs/10.1111/j.1467-9868.2008.00663.x. C. Berg, Christensen, and P. Jens Peter Reus amd Ressel. Harmonic analysis on semigroups: theory of positive definite and related functions, volume 100. Springer, 1984. L. Blumenson. A derivation of n-dimensional spherical coordinates. The American Mathematical Monthly, 67(1):63 66, 1960. S. Bochner and W. Martin. Several complex variables. Princeton Univ Press, 1948. K. Choromanski, M. Rowland, T. Sarlos, V. Sindhwani, R. Turner, and A. Weller. The geometry of random features. In A. Storkey and F. Perez-Cruz, editors, Proceedings of the Twenty First International Conference on Artificial Intelligence and Statistics (AISTATS), volume 84 of Proceedings of Machine Learning Research, pages 1 9. PMLR, 09 11 Apr 2018. J. W. Craig. A new, simple and exact result for calculating the probability of error for twodimensional signal constellations. In Proc. IEEE Milcom, volume 91, pages 571 575, 1991. N. Cressie and G. Johannesson. Fixed rank kriging for very large spatial data sets. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 70(1):209 226, 2008. T. Dao, C. M. De Sa, and C. Ré. Gaussian quadrature for kernel features. In Advances in Neural Information Processing Systems (NIPS), pages 6107 6117, 2017. K. Dong, D. Eriksson, H. Nickisch, D. Bindel, and A. G. Wilson. Scalable log determinants for Gaussian process kernel learning. In Advances in Neural Information Processing Systems (NIPS), pages 6327 6337, 2017. J. Eidsvik, A. O. Finley, S. Banerjee, and H. Rue. Approximate Bayesian inference for large spatial datasets using predictive process models. Computational Statistics & Data Analysis, 56(6):1362 1380, 2012. A. O. Finley, H. Sang, S. Banerjee, and A. E. Gelfand. Improving the performance of predictive process modeling for large datasets. Computational statistics & data analysis, 53(8):2873 2884, 2009. Fink Shustin and Avron N. Hale and A. Townsend. Fast and accurate computation of gauss legendre and gauss jacobi quadrature nodes and weights. SIAM Journal on Scientific Computing, 35(2):A652 A674, 2013. P. Huang, H. Avron, T. N. Sainath, V. Sindhwani, and B. Ramabhadran. Kernel methods match deep neural networks on TIMIT. In 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 205 209, 2014. doi: 10.1109/ICASSP.2014.6853587. M. Katzfuss and N. Cressie. Bayesian hierarchical spatio-temporal smoothing for very large datasets. Environmetrics, 23(1):94 107, 2012. Z. Li, J.-F. Ton, D. Oglic, and D. Sejdinovic. Towards a unified analysis of random fourier features. In International Conference on Machine Learning, pages 3905 3914. PMLR, 2019. J. Mason. Near-best multivariate approximation by Fourier series, Chebyshev series and Chebyshev interpolation. Journal of Approximation Theory, 28(4):349 358, 1980. J. C. Mason. Minimal projections and near-best approximations by multivariate polynomial expansion and interpolation. In Multivariate Approximation Theory II, pages 241 254. Springer, 1982. J. C. Mason and D. C. Handscomb. Chebyshev polynomials. Chapman and Hall/CRC, 2002. M. Munkhoeva, Y. Kapushev, E. Burnaev, and I. Oseledets. Quadrature-based features for kernel approximation. In Advances in Neural Information Processing Systems (Neur IPS), pages 9147 9156, 2018. F. W. Olver, D. W. Lozier, R. F. Boisvert, and C. W. Clark. NIST handbook of mathematical functions hardback and CD-ROM. Cambridge university press, 2010. J. Quiñonero-Candela and C. E. Rasmussen. A unifying view of sparse approximate Gaussian process regression. Journal of Machine Learning Research, 6(Dec):1939 1959, 2005. A. Rahimi and B. Recht. Random features for large-scale kernel machines. In Advances in Neural Information Processing Systems (NIPS), pages 1177 1184, 2008. H. Sang and J. Z. Huang. A full scale approximation of covariance functions for large spatial data sets. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 74(1):111 132, 2012. doi: 10.1111/j.1467-9868.2011.01007.x. URL https://rss.onlinelibrary.wiley.com/ doi/abs/10.1111/j.1467-9868.2011.01007.x. H. Sang, M. Jun, and J. Z. Huang. Covariance approximation for large multivariate spatial data sets with an application to multiple climate model errors. The Annals of Applied Statistics, pages 2519 2548, 2011. V. Scheidemann. Introduction to complex analysis in several variables. Springer, 2005. J. Snoek, H. Larochelle, and R. P. Adams. Practical Bayesian optimization of machine learning algorithms. In Advances in Neural Information Processing Systems (NIPS), pages 2951 2959, 2012. B. Sriperumbudur and Z. Szabó. Optimal rates for random Fourier features. Advances in Neural Information Processing Systems (NIPS), 28:1144 1152, 2015. M. L. Stein. Interpolation of spatial data: Some Theory for Kriging. Springer Series in Statistics. Springer-Verlag, New York, 1999. ISBN 0-387-98629-4. doi: 10.1007/978-1-4612-1494-6. URL http://dx.doi.org/10.1007/978-1-4612-1494-6. Some theory for Kriging. Gauss-Legendre Features for Gaussian Process Regression M. L. Stein. Limitations on low rank approximations for covariance matrices of spatial data. Spatial Statistics, 8:1 19, 2014. M. L. Stein et al. Spatial variation of total column ozone on a global scale. The Annals of Applied Statistics, 1(1):191 210, 2007. D. J. Sutherland and J. Schneider. On the error of random Fourier features. pages 862 871, 2015. J.-F. Ton, S. Flaxman, D. Sejdinovic, and S. Bhatt. Spatial mapping with Gaussian processes and nonstationary Fourier features. Spatial Statistics, 28:59 78, 2018. A. Townsend and L. N. Trefethen. An extension of chebfun to two dimensions. SIAM Journal on Scientific Computing, 35(6):C495 C518, 2013. doi: 10.1137/130908002. URL https://doi.org/ 10.1137/130908002. L. Trefethen. Multivariate polynomial approximation in the hypercube. Proceedings of the American Mathematical Society, 145(11):4837 4844, 2017. L. N. Trefethen. Is gauss quadrature better than clenshaw curtis? SIAM review, 50(1):67 87, 2008. L. N. Trefethen. Approximation theory and approximation practice, volume 128. Siam, 2013. S. Ubaru, J. Chen, and Y. Saad. Fast estimation of tr(f(a)) via stochastic Lanczos quadrature. SIAM Journal on Matrix Analysis and Applications, 38(4):1075 1099, 2017. H. Wang and L. Zhang. Analysis of multivariate Gegenbauer approximation in the hypercube. Adv Comput Math, 46:53, 2020. C. Williams and M. Seeger. Using the Nyström method to speed up kernel machines. In T. Leen, T. Dietterich, and V. Tresp, editors, Advances in Neural Information Processing Systems (NIPS), volume 13, pages 682 688. MIT Press, 2001. URL https://proceedings.neurips.cc/paper/ 2000/file/19de10adbaa1b2ee13f77f679fa1483a-Paper.pdf. C. K. Williams and C. E. Rasmussen. Gaussian processes for machine learning, volume 2. MIT Press Cambridge, MA, 2006. A. Wilson and H. Nickisch. Kernel interpolation for scalable structured Gaussian processes (KISSGP). In International Conference on Machine Learning (ICML), pages 1775 1784, 2015. J. Yang, V. Sindhwani, Q. Fan, H. Avron, and M. Mahoney. Random Laplace feature maps for semigroup kernels on histograms. In Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 14, pages 971 978, Washington, DC, USA, 2014. IEEE Computer Society. ISBN 978-1-4799-5118-5. doi: 10.1109/CVPR.2014.129. URL https://doi. org/10.1109/CVPR.2014.129. T. Yang, Y.-F. Li, M. Mahdavi, R. Jin, and Z.-H. Zhou. Nyström method vs random Fourier features: A theoretical and empirical comparison. In Advances in Neural Information Processing Systems (NIPS), pages 476 484, 2012.