# conjugate_gradients_for_kernel_machines__893b2980.pdf Journal of Machine Learning Research 21 (2020) 1-42 Submitted 7/18; Revised 10/19; Published 2/20 Conjugate Gradients for Kernel Machines Simon Bartels sbartels@tue.mpg.de Philipp Hennig ph@tue.mpg.de Max Planck Institute for Intelligent Systems and University of T ubingen Maria-von-Linden-Str. 6, T ubingen, GERMANY Editor: Mohammad Emtiyaz Khan Regularized least-squares (kernel-ridge / Gaussian process) regression is a fundamental algorithm of statistics and machine learning. Because generic algorithms for the exact solution have cubic complexity in the number of datapoints, large datasets require to resort to approximations. In this work, the computation of the least-squares prediction is itself treated as a probabilistic inference problem. We propose a structured Gaussian regression model on the kernel function that uses projections of the kernel matrix to obtain a low-rank approximation of the kernel and the matrix. A central result is an enhanced way to use the method of conjugate gradients for the specific setting of least-squares regression as encountered in machine learning. Keywords: Gaussian processes, kernel methods, low-rank approximation, conjugate gradients, probabilistic numerics 1. Introduction Regularized least-squares is one of the fundamental algorithms in statistics and machine learning. Due to its importance it is known under a variety of names such as kernel ridge regression (Hoerl and Kennard, 1970), spline regression (e.g. Wahba (1990)), Kriging (e.g. Matheron (1973)) and Gaussian process (GP) regression (e.g. Rasmussen and Williams (2006)). The common principle is the estimation of a regression function from a reproducing kernel Hilbert space (RKHS) f : X R over some domain X that minimizes the regularized loss (Rasmussen and Williams, 2006, Eq. (6.19)) 2 f 2 k + 1 2σ2 i=1 (yi f(xi))2, where (xi, yi) X R, i = 1, . . . , N are observations, σ2 R+ is a regularization parameter, k is the corresponding kernel and k is the RKHS norm of f. The minimizer of this loss has a closed-form solution that coincides with the posterior mean of the Gaussian process p(f | X, y) = GP(f; f, c) under a zero-mean prior p(f) = GP(f; 0, k) and likelihood p(y | f(X)) = N(y; f(X), σ2I) (Kimeldorf and Wahba, 1970; Wahba, 1990; Rasmussen and Williams, 2006): f(x ) = k (K + σ2I) 1y, (1) c(x , x ) = k(x , x ) k (K + σ2I) 1k (2) c 2020 Simon Bartels and Philipp Hennig. License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v21/18-473.html. Bartels and Hennig 3 2.5 2 1.5 1 0.5 0 0.5 1 1.5 2 2.5 3 2 exact CG KMCG Figure 1: Our algorithm KMCG in comparison to CG on a toy setup. The dataset consists of one hundred data-points where the targets are a draw from a zero-mean Gaussian process with squared exponential kernel (Eq. (18) with Λ = 0.25 and θf = 2). The thin, black line is the posterior mean of that Gaussian process (Eq. 1). The light-green line is the mean prediction produced by conjugate gradients after P = 7 steps and the dark-red line is the mean prediction of KMCG (where the number of inducing inputs M = N). where Kij = k(xi, xj), and k ,i = k(x , xi). For datasets up to about N 5 104 observations, the standard approach to solve Equations (1) and (2) is to compute a Cholesky decomposition (Benoit, 1924) of K + σ2I at a cubic cost O(N3). For larger datasets, a number of approximate algorithms have been proposed that yield an approximation ˆf to f in linear time (Zhu et al., 1998; Csat o and Opper, 2002; Snelson and Ghahramani, 2007; Walder et al., 2008; Rahimi and Recht, 2009; Titsias, 2009b; L azaro-Gredilla et al., 2010; Yan and Qi, 2010; Le et al., 2013; Solin and S arkk a, 2014; Wilson and Nickisch, 2015; Hensman et al., 2018). Comparative empirical studies like those of Chalupka et al. (2013) or Qui nonero-Candela and Rasmussen (2005) indicate that some of these methods can provide good approximations in a reasonable amount of time, although there is no conclusive best practice among these choices. Not included in the list above are iterative linear solvers, such as the method of conjugate gradients (CG) (Hestenes and Stiefel, 1952). These algorithms construct an approximate solution to systems of linear equations Ax = b using repeated matrix-vector multiplications (MVMs). In general, each MVM with K has quadratic costs O(N2) which is one reason why the machine learning community prefers the methods above. Furthermore, a linear solver needs to run again for new test inputs when computing the posterior uncertainty (Eq. 2) and Gaussian process regression often requires the evaluation of the log marginal likelihood: ln p(y) = 1 2y (K + σ2I) 1y + 1 2 ln |2π(K + σ2I)| 1. (3) Conjugate gradients can be used to estimate |K| (Filippone and Engler, 2015), yet also requiring several runs. Below, we present a way of using CG specifically tailored to Equations (1) to (3) which we dub kernel machine conjugate gradients (KMCG). Our approach follows the notion of probabilistic numerics (PN) (Hennig et al., 2015) which phrases approximation as inference. A common idea of PN formulations is to replace a deterministic yet intractable operation by Bayesian inference where, by design, prior and likelihood admit analytic estimation of the intractable solution. In our case, the intractable operation is the inversion of very large Conjugate Gradients for Kernel Machines matrices (i.e. of size N N such that N3 is intractable), and the design criterion for the prior is that the posterior mean over the matrix has to admit efficient inversion, which we achieve through the matrix inversion lemma. Instead of providing an approximation solely to the vector (K + σ2I) 1y, our approach uses the MVMs performed by CG to learn an approximation directly to the function k. The following section proposes a model-template that can be used to learn low-rank approximations to kernel functions. The subsequent section shows how conjugate gradients can be applied into that template. A discussion on how our approach relates to existing work is presented thereafter, in Section 3.4. To approximate Equations (1) to (3), we will approximate the kernel and, to this end, present a probabilistic estimation rule for k. The idea is to treat the kernel as unknown and to choose prior and likelihood such that the posterior mean k M is efficient to evaluate and yields a kernel of finite rank. Substituting for this finite-rank kernel in Equations (1) to (3) then allows to compute these expressions faster. The following sections describe finite-rank kernel, our prior, possible likelihoods and resulting posteriors. Fig. 2 shows a schematic summary of this section. 2.1. Finite-rank Kernel An M-rank approximation to a kernel is a factorization of the form k(x, z) φ(x) Σ 1φ(z) where φ(x) : X CM, φ denotes the conjugate transpose, and Σ is an M M Hermitian and positive definite matrix. Given such an expansion one can use the matrix-inversion, and matrix-determinant lemmata to approximate Equations (1) to (3) with the expressions below f(x ) φ(x ) ΦΦ + σ2Σ 1 Φy (4) c(x , z ) σ2φ(x ) ΦΦ + σ2Σ 1 φ(z ) (5) 2y Φ ΦΦ + σ2Σ 1 Φy 1 σ2 ΦΦ + Σ N 2 ln(2πσ2) (6) where φ(x )j = φj(x ) and Φij = φi(Xj). Typically M N and therefore the computational costs to evaluate Equations (4) to (6) reduce from O(N3) to O(NM2), i.e. linear in N. The dominant factor is the matrix-matrix product ΦΦ . An example for a finite-rank kernel that will become important later, is the Subset of Regressors (So R) approximation (Qui nonero-Candela and Rasmussen, 2005) k So R(x, z) = k(x, XU)k(XU, XU) 1k(XU, z) (7) where XU is a set of M so called inducing inputs. The method proposed in this work (KMCG) is related to So R. Readers familiar with So R will be aware of the associated flaws, and methods to remedy them (Qui nonero-Candela and Rasmussen, 2005; Titsias, 2009b). Figure 2: Schematic summary of our proposed kernel approximation method. ground-truth in R2 (a) The kernel k, here a squared exponential (Eq. 18) is assumed to be an unknown function. prior in R2 k0 ψ |k k0| ψ 1 sample (b) Section 2.2 describes a Kronecker-structured Gaussian process prior over the kernel. Above pictures show from left-to-right: prior mean (zero), prior standard deviation, the absolute error divided by the standard deviation minus one and a sample from this prior. likelihood on N2 (c) Observations of k stem from matrix-vector multiplications with the kernel matrix K (Section 2.3), sketched using random columns of the identity matrix. posterior in R2 k M ψM |k k M| ψM 1 sample (d) The posterior is again Gaussian (Section 2.4) and similar to Figure 2a the pictures show from left-to-right: mean, standard deviation, relative error and a sample. By design, the posterior mean k M is an approximation of finite rank which allows to efficiently solve the original least-squares problem (Section 2.1). Conjugate Gradients for Kernel Machines The Deterministic Training Conditional (DTC) approximation alleviates this issue by using the exact kernel for the prior uncertainty over the test inputs (Qui nonero-Candela and Rasmussen, 2005). In effect this is a substitution of Eq. (5) for Eq. (8) below. c(x , z ) k(x , x ) φ(x ) ΦΦ + σ2I 1 φ(z ) (8) We will apply the same substitution for our method KMCG. Another approach to this problem is taken by the FITC method (Qui nonero-Candela and Rasmussen, 2005). FITC can be obtained from So R by substituting the approximate diagonal elements k So R(x, x) for their exact counterpart k(x, x). For our method, we found that this correction slightly worsens performance. Consider a Gaussian process prior over bivariate functions k GP(k0, γψ) (9) where ψ : X2 X2 R is a covariance function over kernels and γ R+ is a scaling parameter. Since the posterior mean is meant to be a substitution for the exact kernel, this is an exchange of one least-squares problem for another. Without further assumptions, calculating the posterior over k is more expensive than computing the equations of interest (Equations 1 to 3). Efficient inference is rendered possible by imposing the following structure on ψ ψ(k(a, b), k(c, d)) := 1 2w(a, c)w(b, d) + 1 2w(a, d)w(b, c) (10) for a, b, c, d X and where w is a covariance function on the domain X. Consider the first addend. It states that the similarity between k(a, b) and k(c, d) depends on the similarity of a and c, and b and d a natural assumption for kernel matrices. The second addend is a symmetrization of the first. Observe that each addend is a product kernel of two pairs of inputs and recall that a product kernel produces Kronecker product matrices. The sum of the two products leads to covariance matrices that have a symmetric Kronecker product form, i.e. A, B RN N : ψ(A, B) = A B RN2 N2 (see Appendix A). This will allow a sufficiently efficient evaluation of the posterior. Fig. 2 visualizes the variance and shows samples from this prior for the toy setup from Fig. 1. This choice of prior offers a trade-offbetween efficient tractable inference and the desire to encode as much prior structural information about the kernel as possible. One desirable property to encode is symmetry, and indeed, matrix-valued functions sampled from this prior distribution are symmetric (c.f. Fig. 2 for examples, Appendix A.1 for formal proof). Kernel functions are also positive definite. Unfortunately, since the positive definite cone is not a linear sub-space of the vector-space of real matrices, this property can not be encoded in a Gaussian prior, in closed form.1 However, it is possible to guarantee positive-definiteness of the posterior mean point estimate through the specific choice of prior parameters k0 = 0 (proof in Proposition 11, p. 27). For this reason, we adopt this choice for the remainder. 1. e.g. Hennig (2015) discusses this problem and possible solutions. Bartels and Hennig There are other properties of certain kernels that would be desirable to encode, but which are not feasible within the chosen framework without sacrificing fast computability. For example, stationarity of the kernel can not be represented by a prior with Kronecker structure in the covariance since a and b (and symmetrically c and d) do not appear together as arguments to w. The question remains how to choose w. Recall that w should reflect the similarity between k(a, b) and k(c, d) which depends on the similarity of a and c, and b and d. To measure the relationship between inputs is exactly the purpose of the kernel k and we therefore set w := k for the remainder. Even if k fails to capture similarity between inputs, as choice for w it still captures the similarity between the kernel values. Furthermore, samples from the approximate kernel will be a function of w and lastly, this choice is convenient computationally as expressions simplify. 2.3. Likelihood Having specified a prior over k, we will now be concerned with how to obtain observations. Iterative solvers like conjugate gradients proceed by collecting a sequence of linear projections of the (kernel) matrix to be inverted, in the form of matrix-vector products. In fact, this general structure also describes the setting of non-adaptive approaches like inducing point methods, which can be interpreted as collecting multiplications of the kernel matrix with a set of pre-specified and sparse vectors (namely the unit selection vectors exui). We can use these matrix-vector products for learning a low-rank version of the kernel by introducing the linear operator T p :(X X)R RP 2, k 7 vec ZZ k(x, z)pi(x)pj(z) dx dz where i, j = 1...P, p = [p1, ..., p P ] are densities or distributions and vec (A) is a column vector created by stacking the rows of A. Example 1 (Matrix-vector multiplication) Define T p with j=1 sijδ(x xuj). (12) Then the evaluation of T pk reduces to a matrix vector product, that is mat (T pk) = S k(XU, XU)S where Sij = sij, XU = [xu1, ..., xu M] and mat ( ) transforms a P 2 vector into a P P matrix, s.t. mat (vec (A)) = A. The xuj can be datapoints or arbitrary elements of the domain X. The choice Sij := δij leads to the Subset of Regressors approximation (Proposition 1, p. 7). Conjugate Gradients for Kernel Machines Example 2 (Integrals with Eigenfunctions) Let φi i = 1, ..., P be orthogonal Eigenfunctions of k with respect to a density ν on X, i.e. Z k(x, z)φi(z)ν(z) dz = λiφi(x) Z φi(z)φj(z)ν(z) dz = δij where λi R and δij is the Kronecker indicator function (compare Rasmussen and Williams (2006, p. 96)). Then for pi(x) := φi(x)ν(x) the observations [mat (T pk)]ij = δijλi are spectral values of the kernel. In essence, this example shows another possibility to express prior knowledge over the kernel. This likelihood leads to the Projected Bayes Regressor (Trecate et al., 1999) (Proposition 2, p. 8), which is a historical, deterministic precursor to the more widely known random Fourier feature expansion of Rahimi and Recht (2008). 2.4. Posterior and Subsumed Approximation Methods The observation operator T p is a linear projection, and hence transforms the Gaussian prior into an also Gaussian posterior. Given the prior (Eq. 9) and any likelihood of the previous section, the posterior is Gaussian with: p(k | Y , T p) = N(k M, w M) k M = k0 + (T pψ) (T p(T pψ) ) 1(vec (Y ) T pk0) (13) ψM = ψ (T pψ) (T p(T pψ) ) 1T pψ The concrete posterior depends on the choice of T p. The following propositions presents approximation methods that have a view as GP inference with low-rank kernel and how they arise in our framework. Proposition 1 (Subset of Regressors) Consider the prior of Eq. (9) with k0 := 0 and w := k and the likelihood defined in Example 1 with sij = δij. Then the posterior mean k M is equivalent to that of So R: k M(x, z) = k So R = k(x, XU)k(XU, XU) 1k(XU, z) where XU are inducing inputs, not necessarily part of X. The proof is part of Appendix B. An example of this posterior distribution is shown in Figure 2. The related method, Fully Independent Conditional (FIC) has a very similar kernel, k FIC = k So R(x, z) + δ(x z)(k(x, x) k So R(x, x)). The structure indicates that this kernel should fit as well into our framework. One option that comes to mind, is to model the diagonal elements as certain, using a prior with mean k0(k(a, b)) := δ(a b)k(a, b) and Bartels and Hennig k M ψM |k k M| ψM 1 sample -2.0 0.0 2.0 Figure 3: progression of the posterior (Eq. 14) for KMCG on the toy example from Figure 1 for P = 2, 4 and 8 conjugate gradients steps. The columns show from left to right: mean, standard deviation, standardized error (white refers to perfect calibration, green to overconfidence and red to underconfidence) and a sample. covariance function ψ (a, b; c, d) := (1 δ(a b))ψ(a, b; c, d)(1 δ(c d)). The posterior mean, however, is in general not the FIC kernel, as for off-diagonal elements, the prediction differs to due to the certainty over the diagonal elements. Furthermore, the modification of the covariance function annuls the convenient algebraic properties of the associated covariance matrices and hence, this prior is dismissed as potential FIC competitor. Another strategy could be to add the diagonal elements as observations. However, this is not possible with the operator as defined in Eq. (11) as it requires the mapping to a finite-dimensional vector. Also restricting the observation to testand training points does not lead to FIC. It remains an open question whether FIC fits into our proposed kernel approximation scheme.2 Proposition 2 (Projected Bayes Regressor) Consider the prior of Eq. (9) with k0 := 0 and w := k and the likelihood defined in Example 2. Let λ1 to λP be the largest eigenvalues of the kernel k (w.r.t to the Mercer expansion) and assume the inputs x1, ..., x N are independent 2. We found that replacing heuristically the approximate diagonal for the exact diagonal does not improve performance. Conjugate Gradients for Kernel Machines and identical draws from ν. Then the posterior kernel k M leads to the Projected Bayes Regressor (Trecate et al., 1999). The proof is part of Appendix C. 3. Conjugate Gradients for Kernel Machines The previous section introduced a probabilistic estimation rule for the kernel k. This section presents another data-collection approach using conjugate gradients that leads to a new approximation algorithm: kernel machine conjugate gradients (KMCG). The interest to use conjugate gradients for kernel machines goes back to more than 25 years (Skilling, 1993) and is still continuing (Davies, 2015; Filippone and Engler, 2015). Albeit quadratic costs per step, CG has advantages over many of the approximation methods referenced in the introduction. CG has only one parameter, the desired precision, which is more natural than e.g. the number of inducing inputs for inducing point methods (Qui nonero Candela and Rasmussen, 2005). This means, the computational budget of CG is not fixed in advance but varies as necessary for the problem at hand. 3.1. Conjugate Gradients Conjugate gradients (Algorithm 1) is an iterative solver for linear equation systems Ax = b where A RN N is a real, symmetric and positive definite matrix (Hestenes and Stiefel, 1952). In theory, CG returns the exact solution x after N steps. In practice, CG is used as approximate solver and can provide good approximations to x in significantly less than N steps. The costs of running CG are dominated by a matrix-vector multiplication in each step which in general has complexity O(N2). The number of necessary steps depends on the eigenvalues λ1 > ... > λN of A. The following summary of the properties of CG is an excerpt from Nocedal and Wright (1999, Chapter 5.1). We use the notation x 2 Λ := x Λx for any symmetric and positive definite matrix Λ. The A-error of CG decreases in each step with xk+1 x 2 A λN k λ1 2 ||x0 x||2 A and one can show that if A has at most r distinct eigenvalues, then CG terminates after r steps with the exact solution. Thus, conjugate gradients is particularly advantageous if the eigenvalues of A are clustered or decay rapidly. 3.2. Kernel-machine Conjugate Gradients Our approach is to run conjugate gradients for P steps on a kernel matrix of size M and to treat the matrix multiplications (zi in Algorithm 1) as observations in the model presented in Section 2. Formally, the likelihood is defined similar to the So R likelihood (Example 1) albeit scaled. Definition 3 (Conjugate-gradients likelihood) Choose a subset XM of size M from X and denote as y M RM the vector that contains the corresponding entries of y. Run Bartels and Hennig Algorithm 1 Conjugate Gradients 1: procedure Conjugate Gradients(A, b, x0, ε) 2: r0 Ax0 b The initial residual ... 3: s0 r0 ... is the first search direction. 5: while ||ri||2 > ε do 6: zi Asi the most expensive step: O(N2) matrix-multiplication 7: αi r i ri s i zi optimal linesearch along si for φ(x) := x Ax 2x b 8: xi+1 xi + αisi update to the solution 9: ri+1 ri + αizi analogue update to the residual 10: si+1 ri+1 + r i+1ri+1 r i ri si Gram-Schmidt applied to the new residual 11: i i + 1 12: end while 13: return xi 14: end procedure conjugate gradients (Algorithm 1 on p. 10) with x0 := 0, A = k(XM, XM), b = y M and ε := 0.01||b||2. In Equation (11) set j=1 sjδ(x xj) where sj is the j-th entry of vector si in iteration i of Algorithm 1. Remark 4 KMCG uses only the CG search directions s1, ..., s P and not the solution ˆx. Other search directions ( e.g. from the Lanczos process) could also be used3. Using this likelihood, the resulting approximate kernel (Eq. (13)) and approximate Equations are (Proposition 9, p. 26): ˆk M(x , x ) = k(x , XM)S(S KMS) 1S k(XM, x ) (14) ˆf(x ) = k(x , XM)S(R R + σ2S KMS) 1R y (15) ˆc(x , x ) = k(x , x ) k(x , XM)S(S KMS) 1S k(XM, x ) + σ2k(x , XM)S R R + σ2S KMS 1 S k(XM, x ) ln ˆZ = 1 2σ2 (y y y R(R R + σ2S KMS) 1R y) (16) 2 ln |R R + σ2S KMS| 1 2(N P) ln σ2 + 1 where S := s1 ... s P , R := k(XM, X)S and P is the number of CG-iterations. We call this approximation kernel machine conjugate gradients (KMCG). 3. In exploratory experiments, we found conjugate gradients search directions to perform slightly better than Lanczos search directions. Conjugate Gradients for Kernel Machines Algorithm 2 Kernel Machine Conjugate Gradients 1: procedure KMCG(k, X, y, σ2, ε) 2: We assume (w.l.o.g.) that the inducing inputs are a subset of X, denoted by XM. 3: Let y M the be corresponding entries of y. 4: Conjugate Gradients(k(XM, XM), y M, ε) ignore solution ˆx 5: S [s1, ..., s P ] collect CG search directions 6: Z [z1, ..., z P ] Z = KMS 7: if M < N then 8: R k(X, XM)S 10: R Z When XM = X above matrix multiplication is not necessary. 12: L1 chol(S Z) precompute required Choleskies 13: L2 chol(σ2S Z + R R) 14: evaluate Eqs. (15) to (16) 15: end procedure 3.3. Properties Figure 3 shows how the approximation to the kernel progresses for the toy example from Figure 1. Computing the Cholesky of R R + σ2S KMS costs O(NMP). After that, evaluating the mean prediction is possible in O(M) and the variance in O(MP). In case P = M, KMCG reduces to So R since all occurrences of S in Equation (14) cancel and what remains is the So R kernel (Equation 7). If KM has a favorable distribution of eigenvalues such that conjugate gradients terminates in less than M steps (c.f. Section 3.1), KMCG can be used to speed up So R.4 In practice, this kind of advantage can only be expected to be beneficial when realized in low-level code. The level of efficiency of existing low-level linear algebra routines makes it challenging to evaluate this area. Recall that the computational complexity of CG for the solution of Eq. (9) in P iterations is O(N2P), that of inducing point methods with M inducing inputs is O(NM2), and KMCG running for P iterations on M inducing points has complexity O(NMP). The subsequent evaluation section is dedicated to the case M = N, i.e. using the whole data set which places KMCG in direct competition to plain conjugate gradients. 3.3.1. Relationship to the Nadaraya-Watson estimator Taking only one step (P = 1) implies S = y M and Equation (15) takes the following form m=1 k(xm, x )ym where α = y MKMy M σ2y MKMy M+y MKMKMy M . The equation bears resemblance to the Nadaraya Watson estimator (Bishop, 2006, p. 301f): a sum over all training targets weighted by the 4. The same applies to related methods such as DTC (Qui nonero-Candela and Rasmussen, 2005) and Titsias method (Titsias, 2009b). Bartels and Hennig similarity of the corresponding input to the test input. However, the scaling-factor α is different. Since conjugate gradients solves the linear system for the mean prediction, it is to be expected that this might incur a trade-offto the approximation of the variance. See Section 4 for an empirical evaluation of the quality of the variance estimate. 3.3.2. Uncertainty In addition to the posterior mean k M, the Gaussian formulation of the approximation problem also provides a posterior variance ψM. It is a natural question to which degree this object can be interpreted as a notion of uncertainty or, more specifically, as an estimate of the square error (k k M)2. This section provides an analysis of this covariance for KMCG, showing it to be an outer bound on the true error. Figure 3 visualizes this for the toy data set from Figure 1. Proposition 5 (relative error bound) The relative size of estimation error and error estimate is bounded from above by 2. (k(x, z) k M(x, z))2 ψM(k(x, z), k(x, z)) 2 Proof To simplify notation, define k x := k(x, X) and G := S(S KS) 1S . For KMCG posterior mean and variance evalute to (Appendix B): k M(x, z) = k x Gkz, ψM(k(x, z), k(x, z)) = 1/2 k(x, x)k(z, z) + k(x, z)2 k x Gkxk z Gkz (k x Gkz)2 = 1/2 k(x, x)k(z, z) + k(x, z)2 k M(x, x)k M(z, z) k M(x, z)2 . As a variance ψM(k(x, x), k(x, x)) is always larger than 0 which implies k(x, x) k M(x, x) for all x. This allows to lower bound ψM(k(x, z), k(x, z)) by 1 2k(x, z)2 1 2k M(x, z)2 leading to (k(x, z) k M(x, z))2 ψM(k(x, z), k(x, z)) 2(k(x, z) k M(x, z)2 k(x, z)2 k M(x, z)2 = 2 (k(x, z) k M(x, z)2 (k(x, z) k M(x, z))(k(x, z) + k M(x, z)) = 2|k(x, z) k M(x, z)| k(x, z) + k M(x, z) 2. Conjugate Gradients for Kernel Machines 3.4. Related Work In terms of using conjugate gradients for kernel machines there is related work by Filippone and Engler (2015). Their algorithm ULISSE is aimed at the estimation of the marginal likelihood p(θ | y) where θ are hyper-parameters of the kernel k. They use a randomized conjugate gradients to estimate gradients of the log-marginal likelihood (Eq. (3)) which in combination with Stochastic Gradient Langevin Dynamics (SGLD) (Welling and Teh, 2011) allows to sample from p(θ | y). Our work is complementary to ULISSE. While running CG the matrix multiplications the inference perspective in Section 2 can be used to build a low-rank approximation of the kernel matrix which can serve as preconditioner for the next SGLD step. Using the Kronecker product for efficient inference has been explored before for example in the KISS-GP framework (Wilson and Nickisch, 2015). The difference to this work is that Wilson and Nickisch (2015) factorize the kernel matrix K into a Kronecker-product where here it is the covariance matrix of the prior ψ(K, K) over the kernel that has Kronecker structure (cf. Eq. 9). A synergy between their and our approach is hard to imagine. However, the follow-up work by Pleiss et al. (2018) uses Lanczos iteration to build a low-rank approximation of a kernel matrix C for the variance prediction. Presumably, one could use instead KMCG. 4. Empirical Comparison of Conjugate Gradients and Kernel Machine Conjugate Gradients This section elaborates the conceptual differences between CG and KMCG and then compares both algorithms with numerical experiments. Consider Equation (1), restated below for convenience. f(x ) = k (K + σ2I) 1y (1) CG computes an approximation to (K +σ2I) 1y and uses the exact k . In contrast, KMCG computes an approximation to k and substitutes k as well. That the systematic replacement of the kernel can be of importance has been noted before by Rasmussen and Williams (2006, p. 177) when comparing So R and the Nystr om method (Williams and Seeger, 2001). The So R method approximates k with the kernel in Equation (7). In contrast Nystr om uses the exact k such that the predictive variance (Eq. 2) can become negative. They further observed that for large M, Nystr om and So R have a similar performance, yet for small M Nystr om performs poorly. We made the same observations for CG and KMCG in the following comparison on common regression problems. Classical conjugate gradients is used to solve the equations (K + σ2I)α = y. In contrast, since the goal of KMCG is to learn an approximation to the kernel, the algorithm runs conjugate gradients on Kα = y, i.e. without noise term. Both methods were evaluated in terms of the average relative error f(x ,k) ˆf(x ,k) Bartels and Hennig where x ,k is a test input not part of the training set. The text book version of conjugate gradients in Algorithm 1 is known to be numerically unstable5 (Golub and Van Loan, 2013, p. 635) and there exist different strategies to cope with this problem. We refer the interested reader to Golub and Van Loan (2013, p. 562f) and the references therein. To explore the potential of our method, we bypass this implementation issue using the slowest6 yet most stable solution: complete reorthogonalization7 (Golub and Van Loan, 2013, p. 564) and the explicit projection-method formulation (Saad, 2003, p. 135 Eq. (5.7)) to compute α. Therefore the following comparison will be conceptually, i.e. over the number of conjugate gradient steps. For completeness, Appendix E.1 contains results how KMCG performs in wall-clock time. Often the baseline methods converge faster since block-matrix multiplication is faster than looped matrix-vector multiplication. Baseline methods are the Fully Independent Training Conditional (FITC) approximation (Qui nonero-Candela and Rasmussen, 2005) and the Variational Free Energy (VFE) method (Titsias, 2009a), with inducing inputs randomly selected from the data set as recommended by Chalupka et al. (2013). The baseline runs were repeated 10 times and besides the average, each figure shows also the progressive minimum and maximum over all runs, to take into account for more elaborate inducing input selection schemes. In all our experiments, we used two popular stationary kernel functions: automatic relevance determination (ARD) Squared Exponential and ARD Mat ern 5/2 (Rasmussen and Williams, 2006, p. 83f, p. 106), k SE(d(x, z; Λ)) = θf exp 1 k52(d(x, z; Λ)) = θf where d = d(x, z; Λ) = ||x z||Λ and Λ is a diagonal matrix. All experiments were executed with Matlab R2019a on an Intel i7 CPU with 32 Gigabytes of RAM running Ubuntu 18.04. 4.1. Common Regression data sets The data sets chosen are small such that computation of the exact GP is still feasible. The origin and purpose of each data set can be found in Appendix D. Each data set has been shuffled and split into two sets, using one for training and the other for testing. For each data set, we optimized the kernel parameters running Carl Rasmussen s minimize function8 for 100 optimization-steps, where initially all kernel hyper-parameters are set to 1. Fig. 4 shows how the average relative error develops for the described setup9. The number of inducing inputs M was set to M = NP such that O-notation costs are equivalent to KMCG: Since KMCG uses multiplications with K for observations, the costs per CG-step 5. see additional results in Appendix E.3 6. Computing the exact solution is actually faster. 7. We experimented with selective reorthogonalization (Simon, 1984) but found it in our experiments to be slower than full reorthogonalization. 8. This method is part of the GPML toolbox (Rasmussen and Nickisch, 2010), see http://www. gaussianprocess.org/gpml/code/matlab/doc. 9. Since the Mat ern kernel experiments look very similar, these results are in Appendix E.2 Conjugate Gradients for Kernel Machines are O(N2P). The upper x-axis displays the number of conjugate gradients steps, the lower x-axis, the number of inducing inputs. During early iterations the performance of CG is not as reliable as KMCG and the latter also improves more consistently. In comparison to the baselines, KMCG often provides a worse approximation to start with but exhibits a faster convergence rate. In contrast to plain conjugate gradients, KMCG naturally provides estimates for variance (Eq. 2) and evidence (Eq. 3). Define the average relative errors εvar and εev analogously to Equation (17), respectively. Figure 6 and 5 show the average relative error of these estimates in comparison to the baselines. For all data sets one can observe that the approximation quality of KMCG for the evidence (Eq. (3)) is improving at first and then worsening. KMCG is better at approximating the quadratic form than the determinant. Therefore, the approximation often overshoots . The baselines clearly outperform KMCG in these experiments. A possible explanation is that the baselines provide a better overall-approximation to the kernel matrix: After P CG-steps, the KMCG kernel is of rank P whereas using M inducing inputs, the VFE kernel is of rank M (so is the FITC kernel, putting the diagonal correction aside). Since M = NP, the baselines can afford more inducing inputs M than KMCG can afford CG-steps P. Overall, when it comes to real-time, the baselines are preferable over KMCG. The picture changes when matrix-multiplication is less expensive than O(N2) which is investigated in the next section. 4.2. Grid-structured data sets In the previous section the baselines are the preferable estimators over KMCG. This changes when matrix-multiplication costs less than O(N2). For example when the kernel is a product kernel (such as squared exponential) and the data set has grid-structure, the cost for matrixmultiplication is almost linearly in the number of data-points (Wilson et al., 2014) such that the number of CG-steps KMCG can take, matches the number of baseline inducing inputs. 4.2.1. Artificial data sets The data sets considered in the following are artificial multi-dimensional grids.10 For the training set, along each axis G points are equally spaced in [ G/4, G/4] distorted by Gaussian noise N(0, 10 3). One hundred test inputs are uniformly distributed over the [ G/4, G/4] cube. Targets are drawn from a Gaussian process with squared exponential kernel (length scales and amplitude equal to 1). The number of inducing inputs had to be capped at 500 due to memory limitations. Figure 7 shows how the approximation error to mean, variance and likelihood term evolves, zoomed in on the first 100 steps. For reference, Fig. 7 also shows a 10 10 data set to give an idea how each method would evolve when investing more computational power would be feasible. In the appendix, Figure 10 shows the same comparison over time for the whole 500 steps, stopping KMCG when it becomes slower than the baselines. 10. Computing the exact solution is feasible exploiting the Kronecker structure of the kernel matrix which we use to evaluate the quality of the approximation methods. However, we may imagine datapoints missing, s.t. matrix-vector multiplication is fast but computing the exact solution is not. M = 289 354 409 ABALONE N = 2088, D = 8 M = 340 417 481 PRECIPITATION N = 2888, D = 2 M = 405 496 573 PUMADYN N = 4096, D = 32 M = 89 109 126 MPG N = 196, D = 7 CG KMCG VFE FITC M = 548 671 775 POLETELECOMM N = 7499, D = 26 M = 577 706 815 ELEVATORS N = 8299, D = 17 20 40 60 80 M = 525 643 742 AILERONS N = 6874, D = 39 20 40 60 80 M = 90 110 127 TOY N = 200, D = 1 20 40 60 80 CG-steps P CG-steps P CG-steps P Figure 4: progression of the relative error εf as a function of the number of iterations of CG and KMCG for different data sets using the squared-exponential kernel (Eq. 18). The bottom axis is the number of CG-steps, which is the same for all plots. Therefore, this axis is visible only in the last row. The top axis denotes the number of inducing inputs used by the baseline methods. The shaded area visualizes minimum and maximum over all baseline runs. A cross denotes the end of a crashed run. M = 289 354 409 ABALONE N = 2088, D = 8 M = 340 417 481 PRECIPITATION N = 2888, D = 2 M = 405 496 573 PUMADYN N = 4096, D = 32 M = 89 109 126 MPG N = 196, D = 7 FITC VFE KMCG M = 548 671 775 POLETELECOMM N = 7499, D = 26 M = 577 706 815 ELEVATORS N = 8299, D = 17 20 40 60 80 M = 525 643 742 AILERONS N = 6874, D = 39 20 40 60 80 M = 90 110 127 TOY N = 200, D = 1 20 40 60 80 CG-steps P CG-steps P CG-steps P Figure 5: progression of the relative error of the variance εvar as a function of the number of iterations of KMCG and baseline for different data sets using the squared-exponential kernel (Eq. 18). The bottom axis is the number of CG-steps, which is the same for all plots. Therefore, this axis is visible only in the last row. The top axis denotes the number of inducing inputs used by the baseline methods. The shaded area visualizes minimum and maximum over all baseline runs. A cross denotes the end of a crashed run. M = 289 354 409 ABALONE N = 2088, D = 8 M = 340 417 481 PRECIPITATION N = 2888, D = 2 M = 405 496 573 PUMADYN N = 4096, D = 32 M = 89 109 126 MPG N = 196, D = 7 FITC VFE KMCG M = 548 671 775 POLETELECOMM N = 7499, D = 26 M = 577 706 815 ELEVATORS N = 8299, D = 17 20 40 60 80 M = 525 643 742 AILERONS N = 6874, D = 39 20 40 60 80 M = 90 110 127 TOY N = 200, D = 1 20 40 60 80 CG-steps P CG-steps P CG-steps P Figure 6: progression of the relative error of the evidence εev as a function of the number of iterations of baseline and KMCG for different data sets using the squared-exponential kernel (Eq. 18). The bottom axis is the number of CG-steps, which is the same for all plots. Therefore, this axis is visible only in the last row. The top axis denotes the number of inducing inputs used by the baseline methods. The shaded area visualizes minimum and maximum over all baseline runs. A cross denotes the end of a crashed run. The small spikes in the plots where KMCG appears to be close to the solution correspond to changes of the estimate from too small to too large. Conjugate Gradients for Kernel Machines On these data sets KMCG dominates the baseline methods. After already one hundred CG-steps KMCG provides a useful approximation to the posterior mean whereas the baselines hardly show any progress. For the variance, the same computational effort is not enough. Though the baselines find better solutions, all methods essentially fail to arrive at a satisfactory solution of a relative error below one. The issue is that all methods overestimate the posterior variance by two orders of magnitude. The picture is similar for the evidence, albeit the approximations are closer to the truth and KMCG performs slightly better on average. 4.2.2. Natural Sound Modeling For a real-world example of a grid-structured data set, we repeat the Natural Sound Modeling experiment considered by Turner (2010); Wilson and Nickisch (2015) and Dong et al. (2017). Given the intensity of a sound signal recorded over time, the objective is to recover the signal in missing regions. All inputs (i.e. including missing) are equidistant and hence the kernel matrix (over all inputs) is Toeplitz for stationary kernel. The kernel matrix over the given inputs is not Toeplitz, which forbids to use this structure for exact inference. Nevertheless matrix-vector-multiplication can be performed in linear time. We use the squared-exponential kernel with the hyper-parameters used by Dong et al. (2017). Since the exact posterior is infeasible to compute, we report only the standardized mean squared-error: SMSE := 1 V[y] j=1 (y ,j ˆf(x ,j))2. To conform with the original experiment, we added for each baseline method a run the inducing inputs where chosen to be on a regular grid. The result of this run correspond to the minimum. Figure 8 confirms the observations from the previous section that KMCG arrives at satisfactory solutions faster than baseline, if matrix-vector multiplication is not an issue. 5. Conclusion We have presented a new approximate inference method for kernel machines that showed how linear solvers can be used in combination with low-rank kernel approximations. The approach is based on a probabilistic numerics viewpoint: the kernel k is treated as a latent quantity and a linear solver is used for collecting observations of k. By design, the resulting approximate kernel is of low rank and is plugged into the nonparametric least-squares problem. The approach is not restricted to least-squares problems but applicable in any scenario where the bottleneck is the inversion of a large kernel matrix, as e.g. GP classification. Our kernel machine conjugate gradients (KMCG), consistently outperforms plain conjugate gradients in numerical experiments. This does not change the fact that standard dense kernel least-squares problems are often more efficiently solved by inducing point methods. However, as demonstrated in Section 4.2, in the settings which allow fast multiplication with the kernel matrix, the new algorithm can improve upon the state of the art. 10 10 1000 1000 100 100 100 30 30 30 30 CG KMCG VFE FITC 20 40 60 80 20 40 60 80 20 40 60 80 20 40 60 80 CG-steps P CG-steps P CG-steps P CG-steps P Figure 7: comparison of baseline and KMCG on grid-structured data sets using the squared exponential kernel (Eq. 18). The shaded area visualizes minimum and maximum over all baseline runs. SOUND N = 59309, D = 1 20 40 60 80 0 5 10 15 20 SOUND N = 59309, D = 1 CG KMCG VFE FITC CG-steps P time in s Figure 8: comparison of KMCG and CG on the SOUND data set, using the squaredexponential kernel (Eq. 18). Left panel: progression of the standardized mean-squared error (SMSE) over the first 100 iterations of CG and KMCG. Right panel: the same comparison with unlimited steps in wall-clock time, cut-offwhen the slowest baseline (FITC with 500 inducing inputs) finishes. The shaded area visualizes minimum and maximum over all baseline runs. Bartels and Hennig Acknowledgments The authors thank Alexandra Gessner, Hans Kersting, Agustinus Kristiadi, Frederik Kunstner, Filip de Roos and Matthias Werner for helpful discussions and proof-reading. The authors gratefully acknowledge financial support by the Emmy Noether Programme of the German Research Union (DFG, Grant HE 7114/3-1); by the European Research Council through ERC St G Action 757275 / PANAMA; the DFG Cluster of Excellence Machine Learning - New Perspectives for Science , EXC 2064/1, project number 390727645; the German Federal Ministry of Education and Research (BMBF) through the T bingen AI Center (FKZ: 01IS18039A); and funds from the Ministry of Science, Research and Arts of the State of Baden-W urttemberg. Appendix A. Properties of the Symmetric Kronecker Product The Kronecker product and its symmetric version have been studied, among others, by Loan (2000) and Magnus and Neudecker (1980). The definitions used in this work slightly differ from the authors above and instead follow Hennig (2015). The Kronecker product for two arbitrary matrices A RN1 N2, B RN3 N4 is defined as [A B]ij,kl := Aik Bjl where i {1, ..., N1}, j {1, ..., N3}, k {1, ..., N2} and l {1, ..., N4}, and ij is not a product but a double-index. The following identities about Kronecker products and the vectorization operator can be found in Hennig and Kiefel (2013), and are restated here for the convenience of the reader: (A B) vec (C) = vec (ACB ) (K1) (A B)(C D) =(AC) (BD) (K2) (A B) 1 =A 1 B 1 (K3) (A B) =A B (K4) (A + B) C =A C + B C (K5) where11 A, B, C, D RN N, and A and B are assumed to be invertible. An appealing property of Kronecker-structured matrices is their interaction with vectorized matrices. For a square matrix A = a1 . . . a N RN N, the vectorization operator vec ( ) : RN N _ RN2 stacks the rows12 of A into one vector: , with [vec (A)](ij) = [A]ij 11. The conditions can be more general but for ease of exposition, we assume all matrices are square and of equal size. 12. Stacking the columns is equivalently possible and common. It is associated with a permutation in the definition of the Kronecker product, but the resulting inferences are equivalent. Conjugate Gradients for Kernel Machines and mat ( ) transforms an N2 vector into an N N matrix, s.t. mat (vec (A)) = A. A vector product of vectorized matrices corresponds to the trace of their product: vec (A) vec (B) = tr [AB ] . (V1) tr [AB ] = X i,j Aij B ji i,j Aij Bij = vec (A) vec (B) The symmetric Kronecker product for two square matrices A, B RN N of equal size is defined as A B := ΓN(A B)ΓN where [ΓN]ij,kl := 1/2δikδjl + 1/2δilδjk satisfies Γ vec (C) = 1/2 vec (C) + 1/2 vec (C ) for all square-matrices C RN N. Equivalently, one can write (A B)ij,kl = 1 4 (Aik Bjl + Ail Bjk + Bik Ajl + Bil Ajk) . The symmetric Kronecker product inherits some of the desirable properties of the Kronecker product. Some of the following identities can, again, be found in Hennig (2015), some are due to Loan (2000) and Magnus and Neudecker (1980) and some are novel. The proof gives exact credit. Proposition 6 Let V , W RN N be square matrices and A , B RN M be rectangular. W W = ΓN(W W ) (SK1) ΓM(A A) = (A A)ΓN (SK2) V W = W V (SK3) (A A)(W W )(B B) = (AW B) (AW B) (SK4) W W V V = (W + V ) (W V ) (SK5) (W W ) 1 = (W 1 W 1). (SK6) The interpretation of Eq. (SK6) requires some care: symmetric Kronecker product matrices are rank deficient. Eq. (SK6) is to be read in the sense that for symmetric Y RN N, i.e. Y = Y , X := mat (W 1 W 1) vec (Y ) satisfies vec (Y ) = (W W ) vec (X) and X is the unique symmetric solution. Bartels and Hennig Proof The proofs for Eqs. (SK1) and (SK2) can be found in Magnus and Neudecker (1999)[p. 46-50]. In the notation of Magnus and Neudecker (1999) Γ = Nn = Dn D+ n and K = 2Γ 2I. Eq. (SK1) is Theorem 13 (a). Eq. (SK2) follows from Theorem 9 (a). To show (W V ) = (V W ), let X RN N be an arbitrary matrix. (V W ) vec (X) = Γ(V W )Γ vec (X) 2Γ(V W ) vec (X + X ) 2Γ vec (V (X + X )W ) 4 vec (V (X + X )W + W (X + X )V ) 2Γ vec (W (X + X )V ) 2Γ(W V ) vec (X + X ) = Γ(W V )Γ vec (X) = (W V ) vec (X) To show Eq. (SK4), use (SK2). (A A)(W W )(B B) = (A A)Γ(W W )Γ(B B) = Γ(A A)(W W )(B B)Γ = Γ(AW B AW B)Γ = AW B AW B The proof of Eq. (SK5) uses (SK3). (A + B) (A B) = Γ(A + B) (A B)Γ = Γ(A A A B + B A B B)Γ = A A A B + B A B B = A A B A + B A B B It remains to prove Eq. (SK6). Assume Z satisfies (W W ) vec (Z) = vec (Y ) and Z = Z . Then, vec (Y ) = (W W ) vec (Z) = (W W )ΓN vec (Z) using Eq. (SK1) and Eq. (SK2) = (W W ) vec (Z) since Z = Z and hence, Z = (W W ) 1 vec (Y ). Using Eq. (K3) and again Eq. (SK1), Z = (W 1 W 1) vec (Y ) which is the definition of X. Conjugate Gradients for Kernel Machines A.1. Sampling from a Gaussian with Symmetric Kronecker Covariance matrix To sample matrices from the KMCG posterior (Eq. 13) the following proposition will be useful. Proposition 7 Let W , W M RN N be symmetric and positive semi-definite matrices s.t. W W M is symmetric positive-semidefinite as well. Further let vec (Y ) N(0, W W W M W M), denote with L+ the Cholesky of W + W M, with L the Cholesky of W W M and let vec (X) N(0, IN2), then Γ(L1 L2) vec (X) and vec (Y ) have the same distribution. Remark: This shows that Y is symmetric due to the Γ-operator. Proof As vec (X) is standard normal, Γ(L+ L ) vec (X) is distributed Gaussian with mean 0 and covariance matrix Γ(L+ L )(Γ(L+ L )) . Γ(L+ L ) [Γ(L+ L )] = Γ(L+ L )(L + L )Γ = (L+L +) (L L ) = (W + W M) (W W M) According to Equation (SK5): (W + W M) (W W M) = W W W M W M. Appendix B. Inducing Input Methods This section contains the proof of Proposition 1, introduced on page 7, and, for the readers convenience, restated below, along with the referenced equations. Proposition 8 (Subset of Regressors) Consider the prior of Eq. (9) with k0 := 0 and w := k and the likelihood defined in Example 1 with sij = δij. Then the posterior mean k M is equivalent to that of So R: k M(x, z) = k So R = k(x, XU)k(XU, XU) 1k(XU, z) where XU are inducing inputs, not necessarily part of X. The mentioned equations are k GP(k0, γψ), (9) ψ(k(a, b), k(c, d)) := 1 2w(a, c)w(b, d) + 1 2w(a, d)w(b, c), (10) T p :(X X)R RP 2, k 7 vec ZZ k(x, z)pi(x)pj(z) dx dz and pi(x) := j=1 sijδ(x xuj). (12) Proposition 1 follows from the more general Proposition 9, below. Bartels and Hennig Proposition 9 Consider the prior of Eq. (9) (without the restriction w = k) and the likelihood defined in Example 1. The posterior over k is p(k | Y = T pk) = N(k; k M, ψM) with posterior mean k M(a, b) = k0(a, b) + w(a, XU)S(S WMS) 1S KMS(S WMS) 1S w(XU, b) (20) w(a, XU)S(S WMS) 1S k0(XU, XU)S(S WMS) 1S w(XU, b) and posterior variance ψM(k(a,b), k(c, d)) = 1 2w(a, c)w(b, d) + 1 2w(a, d)w(b, c) (21) 2w(a, XU)S(S WMS) 1S w(XU, c)w(b, XU)S(S WMS) 1S w(XU, d) 2w(a, XU)S(S WMS) 1S w(XU, d)w(b, XU)S(S WMS) 1S w(XU, c) where WM = w(XU, XU). Proof The proof is tedious linear algebra. If prior and likelihood are Gaussian, so is the posterior with mean k M(a, b) = k0(a, b) (T pψ(k(a, b), )) (T p(T pw) ) 1 vec (Y S k0(XU, XU)S) , and variance ψM(k(a, b), k(c, d)) = ψ((a, b), (c, d)) (T pψ((a, b), ( , ))) (T p(T pψ) ) 1T pψ(c, d), ( , )). With Lemma 10 and Eq. (SK6), we can write (T pψ(k(a, b), )) (T p(T pw) ) 1 2 vec (S w(XU, a)w(b, XU) + w(XU, b)w(a, XU))S) (S WMS) 1 (S WMS) 1 2 vec (S WMS) 1S w(XU, a)w(b, XU) + w(XU, b)w(a, XU))S(S WMS) 1 and, using Eq. (V1), obtain for Eq. (20): k M(a, b) = k0(a, b) 2 tr (S WMS) 1S w(XU, a)w(b, XU)S(S WMS) 1(Y k0(XU, XU)) 2 tr (S WMS) 1S w(XU, b)w(a, XU)S(S WMS) 1(Y k0(XU, XU)) = k0(a, b) + w(a, XU)S(S WMS) 1S KMS(S WMS) 1S w(XU, b) w(a, XU)S(S WMS) 1S k0(XU, XU)S(S WMS) 1S w(XU, b) The derivation for Eq. (21) follows analogously. Conjugate Gradients for Kernel Machines Lemma 10 Let T p be defined by Eq. (12). T pw(k(a, b), ) = 1 2 vec (S (w(XU, a)w(b, XU) + w(XU, b)w(a, XU)) S) (22) T p(T pw( , )) = (S WMS) (S WMS) (23) Proof Denote with mat ( ) the complement of the vectorization operator, i.e. mat (vec (A)) = A. Define the matrix S RN M as Sij = sij and denote with Sl the l-th column of S. Also recall that by Eq. (10) ψ(k(a, b), k(x, z)) = 1 2(w(a, x)w(b, z) + w(a, z)w(b, x)). [mat (T p[ψ(k(a, b), k( , ))])]ij = ZZ ψ(k(a, b), k(x, z)) l=1 silδ(x ul) l=1 sjlδ(z ul) 2(w(a, x)w(b, z) + w(a, z)w(b, x)) l=1 silδ(x ul) l=1 sjlδ(z ul) l=1 Sim Sjl(w(a, um)w(b, ul) + w(a, ul)w(b, um)) 2 [S w(XU, a)w(b, XU)S + S w(XU, b)w(a, XU)S]ij 2 [S (w(XU, a)w(b, XU) + w(XU, b)w(a, XU))S]ij which shows Eq. (22) = [mat ((S S )Γ vec (w(XU, a)w(b, XU)))]ij = [mat ((S S )Γ(w(XU, a) w(XU, b)))]ij Repeating above derivations shows the second statement, Eq. (23): T p(T pψ) = (S S )Γ(w(XU, XU) w(XU, XU))Γ(S S) = (S S) (w(XU, XU) w(XU, XU))(S S) = (S WMS) (S WMS) Equation (SK4) Proposition 11 If k0 = 0, S has rank M, and k and w are positive definite kernel functions then the posterior mean in Eq. (20) is symmetric and positive semi-definite. Proof With k0 = 0 the expression for k M from Proposition 9 simplifies to k M(x, z) =w(x, XU)S(S WMS) 1S KMS(S WMS) 1S w(XU, z). Bartels and Hennig The function k M is symmetric since k is symmetric. The bivariate function k M is said to be positive (semi-)definite ifffor all n N and for all Z X, k M(Z, Z) is a positive (semi-)definite matrix. Since k(XU, XU) is a symmetric and positive definite (s.p.d.) matrix, so is S k(XU, XU)S for arbitrary S. The same argument holds for S WMS. Since S is rank M, (S WMS) 1 exists and the inverse of an s.p.d. matrix is s.p.d. as well. Therefore S(S WMS) 1S KMS(S WMS) 1S is symmetric and positive semi-definite. This completes the proof. Appendix C. Projected Bayes Regressor This section contains the proof of Proposition 2 restated below. Proposition 12 (Projected Bayes Regressor) Consider the prior of Eq. (9) with k0 := 0 and w := k and the likelihood defined in Example 2. Let λ1 to λP be the largest eigenvalues of the kernel k (w.r.t to the Mercer expansion) and assume the inputs x1, ..., x N are independent and identical draws from ν. Then the posterior kernel k M leads to the Projected Bayes Regressor (Trecate et al., 1999). Proof Given Lemma 13 below, all that remains is to substitute k M in Eq. (1) which evaluates to φ(x ) (ΦΦ + σ2Λ 1)Φ y. (24) Comparing b(x) in Definition 1 in Trecate et al. (1999) and Eq. (24) one observes that both are equivalent. Lemma 13 Let φi i = 1, ..., P be orthogonal Eigenfunctions of k with respect to a density ν on X, i.e. Z k(x, z)φi(z)ν(z) dz = λiφi(x) Z φi(z)φj(z)ν(z) dz = δij where λi R and δij is the Kronecker delta. Under the prior of Eq. (9) with k0 := 0 and w := k and the likelihood defined in Example 2 with pi(x) = φi(x)ν(x), the approximate kernel (Eq. (13)) evaluates to k M(x, z) = j=1 λjφj(x)φj(z) = φ(x) Λφ(z) where [φ(x)]i = φi(x) and Λij := δijλi. Conjugate Gradients for Kernel Machines Proof With a zero prior-mean, the posterior over k (Eq. (13)) simplifies to k M(x, z) = (T pψ(k(x, z), ) (T p(T pψ) ) 1T pk. Differing from the proof of Proposition 9 the observation operator T p (Eq. 11) is of the form: [mat (T pk)]ij = ZZ k(x, z)φi(x)φj(z) ν(dx)ν(dz) Z φi(z)φj(z) ν(dz) = λiδij = Λij. The observation operator T p applied to the covariance function w evaluates to: [mat (T pψ(k(a, b), k( , )))]ij = mat T p 2k(a, )k(b, ) + 1 2k(a, )k(b, ) ZZ k(a, x)k(b, z)φi(x)φj(z)ν(dx)ν(dz) ZZ k(a, x)k(b, z)φj(x)φi(z)ν(dx)ν(dz) 2λiλj(φi(a)φj(b) + φi(b)φj(a)) (25) 2[Λ(φ(a)φ(b) + φ(b)φ(a) )Λ]ij. Applying T p again, leads to [T p(T pψ) ]ij,gh = ZZ [T pψ(k(x, z), k( , )] ghφi(x)φj(z) ν(dx)ν(dz) using Equation (25) ZZ (φg(x)φh(z) + φg(z)φh(x))φi(x)φj(z) ν(dx)ν(d(z) Z (δigφh(z) + δihφg(z))φj(z) ν(d(z) 2λgλh(δigδjh + δihδjg) = [Λ Λ]ij,gh where the last equation follows from the definition of the symmetric Kronecker product. This implies for the posterior mean over the kernel: k M(a, b) = (T pψ(k(a, b), ) (T p(T pψ) ) 1T pk 2 vec (Λ(φ(a)φ(b) + φ(b)φ(a) )Λ) (Λ Λ) 1 vec (Λ) 2 vec ((φ(a)φ(b) + φ(b)φ(a) )) vec (Λ) applying Equation (V1): 2 tr [Λ(φ(a)φ(b) + φ(b)φ(a) )] = φ(a) Λφ(b). Appendix D. Benchmark data sets Table 1 describes the purposes and origins of standard benchmark data sets used for Gaussian process regression. More information on PRECIPITATION can be found at http: //www.image.ucar.edu/Data/US.monthly.met/. It appears that the data sets AILERONS, ELEVATORS and POLETELECOMM are no longer available under the link https://www. dcc.fc.up.pt/~ltorgo/Regression/datasets.html. However, all files are part of this submission. Appendix E. Additional Experiments and Results This section consists of figures showing the results of Section 4.1 for the Mat ern kernel, real-time experiments and experiments with the textbook version of conjugate gradients. E.1. Real-time Results This section shows the same results as in Section 4.1 but over training-time instead of CG-steps. All figures have been trimmed to the slowest baseline method. Fig. 9 shows how the relative error εf develops over time for the squared exponential kernel and Fig. 10 shows the same for experiments over grid-structured data sets from Section 4.2. For the x-axis values we took the median of all measurements and fitted a quadratic function to these. E.2. Mat ern Kernel Results The figures in this section show the results for the Mat ern 5/2 kernel (Eq. (19)) for the experiment setup described Section 4.1. Fig. 11 shows the results for the relative error εf, Fig. 12 and Fig. 13 the results for εvar and εev, respectively. Fig. 14 displays the relative error over time. ABALONE Nash et al. (1994); Waugh (1995); Dua and Graff(2019) age prediction of abalone from physical measurements https://archive.ics.uci.edu/ml/datasets/Abalone AILERONS Camachol (1998) control action prediction on the ailerons of an F16 aircraft ELEVATORS Camachol (1998) control action prediction on the elevators of an F16 aircraft MPG Quinlan (1993); Dua and Graff(2019) fuel consumption prediction in miles per gallon for different attributes of cars https://archive.ics.uci.edu/ml/datasets/auto+mpg POLETELECOMM Weiss and Indurkhya (1995) commercial telecommunication application no further information PRECIPITATION Vanhatalo and Vehtari (2008) US annual precipitation prediction for the year 1995 https://github.com/gpstuff-dev/gpstuff/blob/master/gp/demodata/USprec1.txt PUMADYN Snelson and Ghahramani (2006) acceleration prediction one of the arm links given angles, positions and velocities of other links of a Puma560 robot ftp://ftp.cs.toronto.edu/pub/neuron/delve/data/tarfiles/pumadyn-family/ pumadyn-32nm.tar.gz SOUND Turner (2010); Wilson and Nickisch (2015) sound intensity prediction of a signal recorded over time for missing regions https://github.com/kd383/GPML_SLD/blob/master/demo/sound/audio_data.mat TOY introduced in this work targets are a draw from a zero-mean Gaussian process with squared exponential kernel (Eq. (18) with Λ = 0.25 and θf = 2), inputs stem in equal parts from a Gaussian mixture (N(0, 1) + N(1, 0.1) + N( 0.5, 0.05)) and the uniform distribution over [0, 1] Table 1: descriptions and sources for all data sets considered in this work. ABALONE N = 2088, D = 8 0.2 0.4 0.6 0.8 1 PRECIPITATION N = 2888, D = 2 5 10 2 0.1 0.15 PUMADYN N = 4096, D = 32 0.2 0.4 0.6 0.8 1 MPG N = 196, D = 7 CG KMCG VFE FITC POLETELECOMM N = 7499, D = 26 ELEVATORS N = 8299, D = 17 0.2 0.4 0.6 0.8 1 AILERONS N = 6874, D = 39 TOY N = 200, D = 1 time in s time in s time in s Figure 9: progression of the relative error εf over training time for different data sets using the squared-exponential kernel (Eq. 18). The shaded area visualizes minimum and maximum over all baseline runs. A cross denotes the end of a crashed run. 10 10 1000 1000 100 100 100 30 30 30 30 CG KMCG VFE FITC time in s time in s time in s time in s Figure 10: progression of the relative error εf over training time for different data sets using the squared-exponential kernel (Eq. 18). The shaded area visualizes minimum and maximum over all baseline runs. A cross denotes the end of a crashed run. It may seem surprising that the runs on the 100 100 100 data set take more than twice as long. By chance, the data set contains more extreme values in the kernel matrix, i.e. smaller than 1e 50. Multiplication with these elements takes more time. M = 289 354 409 ABALONE N = 2088, D = 8 M = 340 417 481 PRECIPITATION N = 2888, D = 2 M = 405 496 573 PUMADYN N = 4096, D = 32 M = 89 109 126 MPG N = 196, D = 7 CG KMCG VFE FITC M = 548 671 775 POLETELECOMM N = 7499, D = 26 M = 577 706 815 ELEVATORS N = 8299, D = 17 20 40 60 80 M = 525 643 742 AILERONS N = 6874, D = 39 20 40 60 80 M = 90 110 127 TOY N = 200, D = 1 20 40 60 80 CG-steps P CG-steps P CG-steps P Figure 11: progression of the relative error εf as a function of the number of iterations of baseline and KMCG for different data sets using the Mat ern kernel (Eq. 19). The bottom axis is the number of CG-steps, which is the same for all plots. Therefore, this axis is visible only in the last row. The top axis denotes the number of inducing inputs used by the baseline methods. The shaded area visualizes minimum and maximum over all baseline runs. A cross denotes the end of a crashed run. M = 289 354 409 ABALONE N = 2088, D = 8 M = 340 417 481 PRECIPITATION N = 2888, D = 2 M = 405 496 573 PUMADYN N = 4096, D = 32 M = 89 109 126 MPG N = 196, D = 7 FITC VFE KMCG M = 548 671 775 POLETELECOMM N = 7499, D = 26 M = 577 706 815 ELEVATORS N = 8299, D = 17 20 40 60 80 M = 525 643 742 AILERONS N = 6874, D = 39 20 40 60 80 M = 90 110 127 TOY N = 200, D = 1 20 40 60 80 CG-steps P CG-steps P CG-steps P Figure 12: progression of the relative error of the variance εvar as a function of the number of iterations of baseline and KMCG for different data sets using the Mat ern kernel (Eq. 19). The bottom axis is the number of CG-steps, which is the same for all plots. Therefore, this axis is visible only in the last row. The top axis denotes the number of inducing inputs used by the baseline methods. The shaded area visualizes minimum and maximum over all baseline runs. A cross denotes the end of a crashed run. M = 289 354 409 ABALONE N = 2088, D = 8 M = 340 417 481 PRECIPITATION N = 2888, D = 2 M = 405 496 573 PUMADYN N = 4096, D = 32 M = 89 109 126 MPG N = 196, D = 7 FITC VFE KMCG M = 548 671 775 POLETELECOMM N = 7499, D = 26 M = 577 706 815 ELEVATORS N = 8299, D = 17 20 40 60 80 M = 525 643 742 AILERONS N = 6874, D = 39 20 40 60 80 M = 90 110 127 TOY N = 200, D = 1 20 40 60 80 CG-steps P CG-steps P CG-steps P Figure 13: progression of the relative error of the evidence εev as a function of the number of iterations of baseline and KMCG for different data sets using the Mat ern kernel (Eq. 19). The bottom axis is the number of CG-steps, which is the same for all plots. Therefore, this axis is visible only in the last row. The top axis denotes the number of inducing inputs used by the baseline methods. The shaded area visualizes minimum and maximum over all baseline runs. A cross denotes the end of a crashed run. ABALONE N = 2088, D = 8 PRECIPITATION N = 2888, D = 2 PUMADYN N = 4096, D = 32 0.2 0.4 0.6 0.8 1 MPG N = 196, D = 7 CG KMCG VFE FITC POLETELECOMM N = 7499, D = 26 0.5 1 1.5 2 ELEVATORS N = 8299, D = 17 5 10 2 0.1 0.15 AILERONS N = 6874, D = 39 TOY N = 200, D = 1 time in s time in s time in s Figure 14: progression of the relative error εf over training time for different data sets using the Mat ern kernel (Eq. 19). The shaded area visualizes minimum and maximum over all baseline runs. A cross denotes the end of a crashed run. Bartels and Hennig E.3. Instability of Textbook Conjugate Gradients The experiments in Section 4, where carried out by running conjugate gradients with full reorthogonalization. Fig. 15 demonstrates that for the problems under consideration, the textbook version of conjugate gradients is not sufficiently numerically stable.13 With vanilla conjugate gradients in the background, KMCG can run only for a couple of steps before the Cholesky decomposition of S KS fails. Benoit. Note sˆure une m ethode de r esolution des equations normales provenant de l application de la m ethode des moindres carr es a un syst eme d equations lin eaires en nombre inf erieure a celui des inconnues. Application de la m ethode a la r esolution d un syst eme d efini d equations lin eaires. (Proc ed e du Commandant Cholesky). Bulletin Geodesique, 7(1), 1924. Christopher M. Bishop. Pattern Recognition and Machine Learning. Springer, 2006. Rui Camachol. Inducing models of human control skills. In Claire N edellec and C eline Rouveirol, editors, Proceedings of the 10th European Conference on Machine Learning, 1998. Krzysztof Chalupka, Williams, Christopher K. I., and Iain Murray. A framework for evaluating approximation methods for Gaussian process regression. Journal of Machine Learning Research, 14(1), 2013. Lehel Csat o and Manfred Opper. Sparse on-line Gaussian processes. Neural Computation, 14(3), 2002. Alexander Davies. Effective implementation of Gaussian process regression for machine learning. Ph D thesis, University of Cambridge, 2015. Kun Dong, David Eriksson, Hannes Nickisch, David Bindel, and Andrew G Wilson. Scalable log determinants for Gaussian process kernel learning. In Advances in Neural Information Processing Systems, 2017. Dheeru Dua and Casey Graff. UCI machine learning repository, 2019. URL http://archive. ics.uci.edu/ml. Maurizio Filippone and Raphael Engler. Enabling scalable stochastic gradient-based inference for Gaussian processes by employing the Unbiased LInear System Solv Er (ULISSE). In Proceedings of the 32nd International Conference on Machine Learning, Lille, France, 2015. Gene H. Golub and Charles F. Van Loan. Matrix computations. Johns Hopkins University Press, 4 edition, 2013. 13. Approved by the editor, this figure differs from the accepted revision, after correcting an error. ABALONE N = 2088, D = 8 PRECIPITATION N = 2888, D = 2 PUMADYN N = 4096, D = 32 MPG N = 196, D = 7 KMCG(CG) CG FOM KMCG(FOM) POLETELECOMM N = 7499, D = 26 ELEVATORS N = 8299, D = 17 20 40 60 80 AILERONS N = 6874, D = 39 20 40 60 80 TOY N = 200, D = 1 20 40 60 80 CG-steps P CG-steps P CG-steps P Figure 15: progression of the relative error εf over 100 CG-steps for different data sets using the squared exponential kernel (Eq. (18)), comparing CG and FOM. The shaded area visualizes minimum and maximum over all baseline runs. A cross denotes the end of a crashed run. Bartels and Hennig Philipp Hennig. Probabilistic interpretation of linear solvers. SIAM Journal on Optimization, 25(1), 2015. Philipp Hennig and Martin Kiefel. Quasi-Newton methods a new direction. Journal of Machine Learning Research, 14, March 2013. Philipp Hennig, Michael A. Osborne, and Mark Girolami. Probabilistic numerics and uncertainty in computations. Proceedings of the Royal Society of London A: Mathematical, Physical and Engineering Sciences, 2015. James Hensman, Nicolas Durrande, and Arno Solin. Variational Fourier features for Gaussian processes. Journal of Machine Learning Research, 18(151), 2018. Magnus R. Hestenes and Eduard Stiefel. Methods of conjugate gradients for solving linear systems. Journal of Research of the National Bureau of Standards, 49(6), 1952. Arthur E. Hoerl and Robert W. Kennard. Ridge regression: Biased estimation for nonorthogonal problems. Technometrics, 12(1), 1970. George S. Kimeldorf and Grace Wahba. A correspondence between Bayesian estimation on stochastic processes and smoothing by splines. The Annals of Mathematical Statistics, 1970. Miguel L azaro-Gredilla, Joaquin Qui nonero-Candela, Carl E. Rasmussen, and An ıbal R. Figueiras-Vidal. Sparse spectrum Gaussian process regression. Journal of Machine Learning Research, 11, 2010. Quoc Le, Tamas Sarlos, and Alexander Smola. Fastfood - computing Hilbert space expansions in loglinear time. In Proceedings of the 28th International Conference on Machine Learning, 2013. Charles F. Van Loan. The ubiquitous Kronecker product. Journal of Computational and Applied Mathematics, 123(1), 2000. Numerical Analysis 2000. Vol. III: Linear Algebra. Jan R. Magnus and Heinz Neudecker. The elimination matrix: Some lemmas and applications. SIAM Journal on Algebraic Discrete Methods, 1(4), 1980. doi: 10.1137/0601049. Jan R. Magnus and Heinz Neudecker. Matrix Differential Calculus with Applications in Statistics and Econometrics. John Wiley, second edition, 1999. Georges Matheron. The intrinsic random functions and their applications. Advances in applied probability, 1973. Warwick J. Nash, Tracy L. Sellers, Simon R. Talbot, Andrew J. Cawthorn, and Wes B. Ford. The population biology of abalone (haliotis species) in Tasmania. 1, blacklip abalone (h. rubra) from the north coast and the islands of bass strait. Technical Report 48, Sea Fisheries Division, Marine Research Laboratories - Taroona, Department of Primary Industry and Fisheries, Tasmania, 1994. URL https://trove.nla.gov.au/work/11326142. Jorge Nocedal and Stephen J. Wright. Numerical Optimization. Springer Verlag, 1999. Conjugate Gradients for Kernel Machines GeoffPleiss, Jacob Gardner, Kilian Weinberger, and Andrew G. Wilson. Constant-time predictive distributions for Gaussian processes. In Proceedings of the 35th International Conference on Machine Learning, 2018. John R. Quinlan. Combining instance-based and model-based learning. In Proceedings of the 10th International Conference on International Conference on Machine Learning, 1993. Joaquin Qui nonero-Candela and Carl E. Rasmussen. A unifying view of sparse approximate Gaussian process regression. Journal of Machine Learning Research, 6, 2005. Ali Rahimi and Benjamin Recht. Random features for large-scale kernel machines. In J. C. Platt, D. Koller, Y. Singer, and S. T. Roweis, editors, Advances in Neural Information Processing Systems 20, 2008. Ali Rahimi and Benjamin Recht. Weighted sums of random kitchen sinks: Replacing minimization with randomization in learning. In Advances in Neural Information Processing Systems 23, 2009. Carl E. Rasmussen and Christopher K.I. Williams. Gaussian Processes for Machine Learning. MIT, 2006. Carl Edward Rasmussen and Hannes Nickisch. Gaussian processes for machine learning (gpml) toolbox. Journal of Machine Learning Research, 11, 2010. Yousef Saad. Iterative Methods for Sparse Linear Systems. Society for Industrial and Applied Mathematics, second edition, 2003. Horst D. Simon. Analysis of the symmetric lanczos algorithm with reorthogonalization methods. Linear Algebra and its Applications, 61, 1984. John Skilling. Bayesian numerical analysis. In Jr W. T. Grandy and P. W. Milonni, editors, Physics and Probability: Essays in Honor of Edwin T. Jaynes. 1993. Edward Snelson and Zoubin Ghahramani. Sparse Gaussian processes using pseudo-inputs. In Y. Weiss, B. Sch olkopf, and J. C. Platt, editors, Advances in Neural Information Processing Systems 18. 2006. Edward Snelson and Zoubin Ghahramani. Local and global sparse Gaussian process approximations. In Proceedings of the Eleventh International Conference on Artificial Intelligence and Statistics, 2007. Arno Solin and Simo S arkk a. Hilbert space methods for reduced-rank Gaussian process regression, 2014. URL https://arxiv.org/abs/1401.5508v1. Michalis K. Titsias. Variational learning of inducing variables in sparse Gaussian processes. In Proceedings of the Twelth International Conference on Artificial Intelligence and Statistics, 2009a. Michalis K. Titsias. Variational Learning of Inducing Variables in Sparse Gaussian Processes. In Proceedings of the Twelth International Conference on Artificial Intelligence and Statistics, 2009b. Bartels and Hennig Giancarlo F. Trecate, Christopher K. I. Williams, and Manfred Opper. Finite-dimensional approximation of Gaussian processes. In Advances in Neural Information Processing Systems 2, 1999. Richard E. Turner. Statistical Models for Natural Sounds. Ph D thesis, University College London, 2010. Jarno Vanhatalo and Aki Vehtari. Modelling local and global phenomena with sparse Gaussian processes. In David Mc Allester and Petri Myllym aki, editors, UAI 2008, Twenty Fourth Conference on Uncertainty in Artificial Intelligence, Helsinki, Finland, July 9-12, 2008, 2008. Grace Wahba. Spline models for observational data. Number 59 in CBMS-NSF Regional Conferences series in applied mathematics. SIAM, 1990. Christian Walder, Kwang I. Kim, and Bernhard Sch olkopf. Sparse multiscale Gaussian process regression. In Proceedings of the 25th International Conference on Machine Learning, 2008. Sam Waugh. Extending and benchmarking Cascade-Correlation. Ph D thesis, University of Tasmania, 1995. Sholom M. Weiss and Nitin Indurkhya. Rule-based machine learning methods for functional prediction. Journal of Artificial Intelligence Research, 3(1), 1995. Max Welling and Yee W. Teh. Bayesian learning via stochastic gradient Langevin dynamics. In Proceedings of the 28th International Conference on International Conference on Machine Learning, 2011. Christopher K. I. Williams and Matthias Seeger. Using the Nystr om method to speed up kernel machines. In Advances in Neural Information Processing Systems 13, 2001. Andrew Wilson and Hannes Nickisch. Kernel interpolation for scalable structured Gaussian processes (kiss-gp). In Francis Bach and David Blei, editors, Proceedings of the 32nd International Conference on Machine Learning, volume 37, 2015. Andrew G. Wilson, Elad Gilboa, Arye Nehorai, and John P. Cunningham. Fast kernel learning for multidimensional pattern extrapolation. In Advances in Neural Information Processing Systems 27. 2014. Feng Yan and Yuan Qi. Sparse Gaussian process regression via l1 penalization. In Proceedings of the 27th International Conference on Machine Learning, 2010. Huaiyu Zhu, Christopher K. I. Williams, Richard J. Rohwer, and Michal Morciniec. Gaussian regression and optimal finite dimensional linear models. In C. M. Bishop, editor, Neural Networks and Machine Learning. 1998.