# matérn_gaussian_processes_on_riemannian_manifolds__7a967eb3.pdf Mat ern Gaussian processes on Riemannian manifolds Viacheslav Borovitskiy 1,4 Alexander Terenin 2 Peter Mostowsky 1 Marc Peter Deisenroth3 1St. Petersburg State University 2Imperial College London 3University College London 4St. Petersburg Department of Steklov Mathematical Institute of Russian Academy of Sciences Gaussian processes are an effective model class for learning unknown functions, particularly in settings where accurately representing predictive uncertainty is of key importance. Motivated by applications in the physical sciences, the widelyused Mat ern class of Gaussian processes has recently been generalized to model functions whose domains are Riemannian manifolds, by re-expressing said processes as solutions of stochastic partial differential equations. In this work, we propose techniques for computing the kernels of these processes on compact Riemannian manifolds via spectral theory of the Laplace Beltrami operator in a fully constructive manner, thereby allowing them to be trained via standard scalable techniques such as inducing point methods. We also extend the generalization from the Mat ern to the widely-used squared exponential Gaussian process. By allowing Riemannian Mat ern Gaussian processes to be trained using well-understood techniques, our work enables their use in mini-batch, online, and non-conjugate settings, and makes them more accessible to machine learning practitioners. 1 Introduction Gaussian processes (GPs) are a widely-used class of models for learning an unknown function within a Bayesian framework. They are particularly attractive for use within decision-making systems, e.g. in Bayesian optimization [36] and reinforcement learning [10, 11], where well-calibrated uncertainty is crucial for enabling the system to balance trade-offs, such as exploration and exploitation. A GP is specified through its mean and covariance kernel. The Mat ern family is a widely-used class of kernels, often favored in Bayesian optimization due to its ability to specify smoothness of the GP by controlling differentiability of its sample paths. Throughout this work, we view the widely-used squared exponential kernel as a Mat ern kernel with infinite smoothness. Motivated by applications areas such as robotics [4, 22] and climate science [5], recent work has sought to generalize a number of machine learning algorithms from the vector space to the manifold setting. This allows one to work with data that lives on spheres, cylinders, and tori, for example. To define such a GP, one needs to define a positive semi-definite kernel on those spaces. In the Riemannian setting, as a simple candidate generalization for the Mat ern or squared exponential kernel, one can consider replacing Euclidean distance in the formula with the Riemannian geodesic distance. Unfortunately, this approach leads to ill-defined kernels in many cases of interest [14]. An alternative approach was recently proposed by Lindgren et al. [25], who adopt a perspective introduced in the pioneering work of Whittle [42] and define a Mat ern GP to be the solution of Equal contribution. Correspondence to: VIACHESLAV.BOROVITSKIY@GMAIL.COM, A.TERENIN17@IMPERIAL.AC.UK, and PMOSTOWSKY@GMAIL.COM. Code available at HTTPS://GITHUB.COM/SPBU-MATH-CS/RIEMANNIAN-GAUSSIAN-PROCESSES and HTTPS://GITHUB.COM/ATERENIN/SPARSEGAUSSIANPROCESSES.JL. 34th Conference on Neural Information Processing Systems (Neur IPS 2020), Vancouver, Canada. a certain stochastic partial differential equation (SPDE) driven by white noise. This approach generalizes naturally to the Riemannian setting, but is cumbersome to work with in practice because it entails solving the SPDE numerically. In particular, setting up an accurate finite element solver can become an involved process, especially for certain smoothness values [2, 3]. This also prevents one from easily incorporating recent advances in scalable GPs, such as sparse inducing point methods [21, 41], into the framework. This in turn impedes one from easily employing mini-batch training, online training, non-Gaussian likelihoods, or incorporating GPs as differentiable components within larger models. In this work, we extend Mat ern GPs to the Riemannian setting in a fully constructive manner, by introducing Riemannian analogues of the standard technical tools one uses when working with GPs in Euclidean spaces. To achieve this, we first study the special case of the d-dimensional torus Td. Using ideas from abstract harmonic analysis, we view GPs on the torus as periodic GPs on Rd, and derive expressions for the kernel and spectral measure of a Mat ern GP in this case. Building on this intuition, we generalize the preceding ideas to general compact Riemannian manifolds without boundary. Using insights from harmonic analysis induced by the Laplace Beltrami operator, we develop techniques for computing the kernel and generalized spectral measure of a Mat ern GP in this setting. These expressions enable computations via standard GP approaches, such as Fourier feature or sparse variational methods, thereby allowing practitioners to easily deploy familiar techniques in the Riemannian setting. We conclude by showcasing how to employ the proposed techniques through a set of examples. 2 Gaussian processes Let X be a set, and let f : X R be a random function. We say that f GP(µ, k) if, for any n and any finite set of points x Xn, the random vector f = f(x) is multivariate Gaussian with prior mean vector µ = µ(x) and covariance matrix Kxx = k(x, x). We henceforth, without loss of generality, set the mean function to be zero. Given a set of training observations (xi, yi), we let yi = f(xi) + εi with εi N(0, σ2). Under the prior f GP(0, k) the posterior distribution f | y is another GP, with mean and covariance E(f | y) = K( )x(Kxx + σ2I) 1y Cov(f | y) = K( , ) K( )x(Kxx + σ2I) 1Kx( ) (1) where ( ) denotes an arbitrary set of test locations. The posterior can also be written (f | y)( ) = f( ) + K( )x(Kxx + σ2I) 1(y f(x) ε) (2) where equality holds in distribution [43]. This expression allows one to sample from the posterior by first sampling from the prior, and transforming the resulting draws into posterior samples. On X = Rd, one popular choice of kernel is the Mat ern family with parameters σ2, κ, ν, defined as kν(x, x ) = σ2 21 ν where Kν is the modified Bessel function of the second kind [17]. The parameters of this kernel have a natural interpretation: σ2 directly controls variability of the GP, κ directly controls the degree of dependence between nearby data points, and ν directly controls mean-square differentiability of the GP [29]. As ν , the Mat ern kernel converges to the widely-used squared exponential kernel k (x, x ) = σ2 exp x x 2 which induces an infinitely mean-square differentiable GP. For a bivariate function k : X X R to be a kernel, it must be positive semi-definite, in the sense that for any n and any x Xn, the kernel matrix Kxx is positive semi-definite. For X = Rd, a translation-invariant kernel k(x, x ) = k(x x ) is called stationary, and can be characterized via Bochner s Theorem. This result states that a translation-invariant bivariate function is positive definite if and only if it is the Fourier transform of a finite non-negative measure ρ, termed the spectral measure. This measure is an important technical tool for constructing kernels [29], and for practical approximations such as Fourier feature basis expansions [20, 28]. Figure 1: The Mat ern kernel k1/2(x, ), defined on a circle, sphere and dragon. The point x is marked with a red dot. The height of the solid line and color, respectively, give the value of the kernel. 2.1 A no-go theorem for kernels on manifolds We are interested in generalizing the Mat ern family from the vector space setting to a compact Riemannian manifold (M, g) such as the sphere or torus. One might hope to achieve this by replacing Euclidean norms with the geodesic distances in (3) and (4). In the latter case, this amounts to defining k(x, x ) = σ2 exp dg(x, x )2 where dg is the geodesic distance with respect to g on M. Unfortunately, one can prove this is not generally a well-defined kernel. Theorem 1. Let (M, g) be a complete, smooth Riemannian manifold without boundary, with associated geodesic distance dg. If the geodesic squared exponential kernel (5) is positive semi-definite for all κ > 0, then M is isometric to a Euclidean space. Proof. Feragen et al. [14, Theorem 2]. Since Euclidean space is not compact, this immediately implies that (5) is not a well-defined kernel on any compact Riemannian manifold without boundary. We therefore call (5) and its finite-smoothness analogues the na ıve generalization. In spite of this issue, the na ıve generalization is usually still positive semi-definite for some κ, and it has been used in a number of applied areas [22]. Feragen and Hauberg [13] proposed a number of open problems arising from these issues. In Section 3, we show that, on the torus, the na ıve generalization is locally correct in a sense made precise in the sequel. We now turn to an alternative approach, which gives well-defined kernels in the general case. 2.2 Stochastic partial differential equations Whittle [42] has shown that Mat ern GPs on X = Rd satisfy the stochastic partial differential equation 4 f = W (6) for ν < , where is the Laplacian and W is Gaussian white noise re-normalized by a certain constant. One can show using the same argument that the limiting squared exponential GP satisfies 4 f = W (7) 4 is the (rescaled) heat semigroup [12, 18]. This viewpoint on GPs has recently been reintroduced in the statistics literature by Lindgren et al. [25], and a number of authors, including S arkk a et al. [33] and Simpson et al. [35], have used it to develop computational techniques, notably in the popular INLA package [32]. One advantage of the SPDE definition is that generalizing it to the Riemannian setting is straightforward: one simply replaces with the Beltrami Laplacian and W with the canonical white noise process with respect to the Riemannian volume measure. The kernels of these GPs, computed in the sequel, are illustrated in Figure 1. Unfortunately, the SPDE definition is somewhat non-constructive: it is not immediately clear how to compute the kernel, and even less clear how to generalize familiar tools to this setting. In practice, this restricts one to working with PDE-theoretic discretization x x x x 1 x x + 1 x x 2 x x + 2 Figure 2: The distances being considered in definitions (9) and (10). techniques, such as Galerkin finite element methods, the efficiency of which depend heavily on the smoothness of f, and which can require significant hand-tuning to ensure accuracy. It also precludes one from working in non-conjugate settings, such as classification, or from using recently-proposed techniques for scalable GPs via sparse inducing point methods [20, 21, 41], as they require one to either be able to compute the kernel point-wise, or compute the spectral measure, or both. 2.3 State of affairs and contribution In this work, our aim is to generalize the standard theoretical tools available for GPs on Rd to the Riemannian setting. Our strategy is to first study the problem for the special case of a d-dimensional torus. Here, we provide expressions for the kernel of a Mat ern GP in the sense of Whittle [42] via periodic summation, which yields a series whose first term is the na ıve generalization. Building on this intuition, we develop a framework using Laplace Beltrami eigenfunctions that allows us to provide expressions for the kernel and generalized spectral measure of a Mat ern GP on a general compact Riemannian manifold without boundary. The framework is fully constructive and compatible with sparse GP techniques for scalable training. A number of closely related ideas, beyond those described in the preceding sections, have been considered in various literatures. Solin and S arkk a [38] used ideas based on spectral theory of the Laplace Beltrami operator to approximate stationary covariance functions on bounded domains of Euclidean spaces. These ideas were applied, for instance, to model ambient magnetic fields using Gaussian processes by Solin et al. [37]. An analog of the expression we provide in equation (18) for the Riemannian Mat ern kernel was concurrently proposed as a practical GP model by Coveney et al. [8] in this work, we derive said expression from the SPDE formulation of Mat ern GPs. Finally, the Riemannian squared exponential kernel, also sometimes called the heat or diffusion kernel, has been studied by Gao et al. [15]. We connect these ideas with stochastic partial differential equations. In this work, we concentrate on Gaussian processes f : M R whose domain is a Riemannian manifold. We do not study models f : R M where the range is a Riemannian manifold this setting is explored by Mallasto and Feragen [27]. 3 A first example: the d-dimensional torus To begin our analysis and build intuition, we study the d-dimensional torus Td, which is defined as the product manifold Td = S1 ... S1 where S1 denotes a unit circle2. Since functions on a circle can be thought of as periodic functions on R, and similarly for Td and Rd, defining a kernel on a torus is equivalent to defining a periodic kernel. For a general function f : Rd R, one can transform it into a function g : Td R by periodic summation g(x1, ..., xd) = X n Zd f(x1 + n1, ..., xd + nd) (8) where xj [0, 1) is identified with the angle 2πxj and the point exp(2πixj) S1. Define addition of two points in S1 by the addition of said numbers modulo 1, and define addition in Td component-wise. Periodic summation preserves positive-definiteness, since it preserves positivity of the Fourier transform, which by Bochner s theorem is equivalent to positive-definiteness see Sch olkopf and 2Note that T2 = S1 S1 is diffeomorphic but not isometric to the usual donut-shaped torus whose metric is induced by embedding in R3. This is important, because it is the Riemannian metric structure that gives rise to the Laplace Beltrami operator and hence to the generalized Mat ern and squared exponential kernels. Diffeomorphisms do not necessarily preserve metric structure, so they may not preserve kernels. Smola [34, Section 4.4.4] for a formal proof. This gives an easy way to construct positive-definite kernels on Td. In particular, we can generalize Mat ern and squared exponential GPs from Rd to Td by defining kν(x, x ) = X where C ( ) is a constant given in Appendix B to ensure k( )(x, x) = σ2, and k (x, x ) = X respectively. We prove that these are the covariance kernels of the SPDEs introduced previously. Proposition 2. The Mat ern (squared exponential) kernel k in (9) (resp. (10)) is the covariance kernel of the Mat ern (resp. squared exponential) Gaussian process in the sense of Whittle [42]. Proof. Appendix C. This result offers an intuitive explanation for why the na ıve generalization based on the geodesic distance might fail to be positive semi-definite on non-Euclidean spaces for all length scales, yet work well for smaller length scales: on Td, it is locally correct in the sense that it is equal to the first term in the periodic summation (9). To obtain the full generalization, one needs to take into account not just geodesic paths, but geodesic-like paths which include loops around the space a Mat ern GP incorporates global topological structure of its domain. For the circle, these are visualized in Figure 2. For spaces where this structure is even more elaborate, definitions based purely on geodesic distances may not suffice to ensure positive semi-definiteness or good numerical behavior. We conclude by presenting a number of practical formulas for Mat ern kernels on the circle. Example 3 (Circle). Take M = S1. For ν = , the kernel and spectral measure are k (x, x ) = σ2 C ϑ3(π(x x ), exp( 2π2κ2)) ρ (n) = σ2 C exp( 2π2κ2n2) (11) where n Z, ϑ3( , ) is the third Jacobi theta function [1, equation 16.27.3], and C = ϑ3(0, exp( 2π2κ2)). This kernel is normalized to have variance σ2. Example 4 (Circle). Take M = S1. For ν = 1/2, the kernel and spectral measure are k1/2(x, x ) = σ2 C1/2 cosh |x x | 1/2 ρ1/2(n) = 2σ2 sinh(1/2κ) κ2 + 4π2n2 1 (12) where C1/2 = cosh(1/2κ). This kernel is normalized to have variance σ2. A derivation and more general formula, valid for ν = 1/2 + n, n N, can be found in Appendix B. Note that these spectral measures are discrete, as the Laplace Beltrami operator has discrete spectrum. Finally, we give the Fourier feature approximation [20, 28] of the GP prior on T1 = S1, which is ρν(n) wn,1 cos(2πnx) + wn,2 sin(2πnx) wn,j N(0, 1). (13) We have defined Mat ern and squared exponential GPs on Td and given expressions for the kernel, spectral measure, and Fourier features on T1. With sharpened intuition, we now study the general case. 4 Compact Riemannian manifolds The arguments used in the preceding section are, at their core, based on ideas from abstract harmonic analysis connecting Rd, Td, and Zd as topological groups. This connection relies on the algebraic structure of groups, which does not exist on a general Riemannian manifold. As a result, different notions are needed to establish a suitable framework. Figure 3: Examples of eigenfunctions of Laplace Beltrami operator on a circle, sphere, and dragon. For the circle, the value of the eigenfunction is given by the (signed) distance between the solid line and dashed unit circle. For the sphere and dragon, the value of the eigenfunction is given by the color. Let (M, g) be a compact Riemannian manifold without boundary, and let g be the Laplace Beltrami operator. Our aim is to compute the covariance kernel of the Gaussian processes solving the SPDEs (6) and (7) in this setting. Mathematically, this amounts to introducing an appropriate formalism so that one can calculate the desired expressions using spectral theory. We do this in a fully rigorous manner in Appendix D, while here we present the main ideas and results. First, we discuss how the operators on the left-hand side of SPDEs (6) and (7) are defined. By compactness of M, g admits a countable number of eigenvalues, which are non-negative and can be ordered to form a non-decreasing sequence with λn for n . Moveover, the corresponding eigenfunctions form an orthonormal basis {fn}n Z+ of L2(M), and g admits the representation n=0 λn f, fn fn (14) which is termed the Sturm Liouville decomposition [6, 7]. This allows one to define the operators Φ( g) for a function Φ : [0, ) R, by replacing λn with Φ(λn) in (14), and specifying appropriate function spaces as domain and range to ensure convergence of the series in a suitable sense. This idea is called functional calculus for the operator g. Using it, we define 4 f, fn fn (15) 4 f, fn fn. (16) Figure 3 illustrates the eigenfunctions fn. Note that when M = Td, the orthonormal basis {fn}n Z+ consists of sines and cosines, and thus the corresponding functional calculus is defined in terms of standard Fourier series. This also agrees with the usual way of defining such operators in the Euclidean case using the Fourier transform. Next, we proceed to define the remaining parts of the SPDEs. The theory of stochastic elliptic equations described in Lototsky and Rozovsky [26] gives an appropriate notion of white noise W for our setting, as well as a way to uniquely solve SPDEs of the form Lf = W, where L is a bounded linear bijection between a pair of Hilbert spaces. We show that the operators 2 (M) L2(M) e κ2 2 (M) L2(M) (17) are bounded and invertible, where Hs(M) are appropriately defined Sobolev spaces on the manifold, and Hs(M) are the diffusion spaces studied by De Vito et al. [9]. We prove that the solutions of our SPDEs in the sense of Lototsky and Rozovsky [26] are Gaussian processes with kernels equal to the reproducing kernels of the spaces Hν+d/2(M) and Hκ2/2(M), which are given by De Vito et al. [9]. Summarizing, we get the following. Theorem 5. Let λn be eigenvalues of g, and let fn be their respective eigenfunctions. The kernels of the Mat ern and squared exponential GPs on M in the sense of Whittle [42] are given by kν(x, x ) = σ2 2 fn(x)fn(x ) (18) k (x, x ) = σ2 2 λnfn(x)fn(x ) (19) where C( ) are normalizing constants chosen so that the average variance3 over the manifold satisfies volg(M) 1 R X k( )(x, x)dx = σ2. Proof. Appendix D. Our attention now turns to the spectral measure. In the Euclidean case, the spectral measure, assuming sufficient regularity, is absolutely continuous its Lebesgue density is given by the Fourier transform of the kernel. In the case of Td, the spectral measure is discrete its density with respect to the counting measure is given by the Fourier coefficients of the kernel. Like in the case of the torus, for a compact Riemannian manifold the spectral measure is discrete its density with respect to the counting measure is given by the generalized Fourier coefficients of the kernel with respect to the orthonormal basis fn(x)fn (x ) on L2(M M). For Mat ern and square exponential GPs, these are 2 ρ (n) = σ2 This allows one to recover most tools used in spectral theory of GPs. In particular, one can construct a regular Fourier feature approximation of the GPs by taking the top-N eigenvalues, and writing ρ(n)wnfn(x) wn N(0, 1). (21) Other kinds of Fourier feature approximations, such as random Fourier features, are also possible. We now illustrate an example in which these expressions simplify. Example 6 (Sphere). Take M = Sd to be the d-dimensional sphere. Then we have kν(x, x ) = n=0 cn,d ρν(n) C(d 1)/2 n cos dg(x, x ) (22) where cn,d are constants given in Appendix B, C( ) n are the Gegenbauer polynomials, dg is the geodesic distance, and ρν(n) can be expressed explicitly in terms of λn = n(n + d 1) using (20). See Appendix B for details on the corresponding Fourier feature approximation. A derivation with further details can be found in Appendix B. Similar expressions are available for many other manifolds, where the Laplace Beltrami eigenvalues and eigenfunctions are known. 4.1 Summary We conclude by summarizing the presented method of computing the kernel of Riemannian Mat ern Gaussian processes defined by SPDEs. The key steps are as follows. 1. Obtain the Laplace Beltrami eigenpairs for the given manifold, either analytically or numerically. This step needs to be performed once in advance. 2. Approximate the kernel using a finite truncation of the infinite sums (18) or (19). This kernel approximation can be evaluated pointwise at any locations, fits straightforwardly into modern automatic differentiation frameworks, and is simple to work with. The resulting truncation error will depend on the smoothness parameter ν, dimension d, and eigenvalue growth rate, which is quantified by Weyl s law [44]. For ν < convergence will be polynomial, and for ν = it will be exponential. If σ2 is trainable, the constant Cν which normalizes the kernel by its average variance can generally be disregarded. If Fourier feature approximations of the prior are needed, for instance, to apply the pathwise sampling technique of Wilson et al. [43], they are given by (21). 3The marginal variance k( )(x, x) can depend on x, thus we normalize the kernel by the average variance. (a) Ground truth (b) Posterior 95% intervals 0 100 200 300 400 Position Momentum (c) Posterior samples for one trajectory Figure 4: Visualization of the dynamical system s learned phase diagram. Middle: we simulate 40 trajectories starting at the red dots, integrate the learned Hamilton s equations forward and backward in time until they approximately intersect other trajectories, and plot 95% intervals in phase space. Right: we simulate the trajectory beginning from the yellow dot, and plot mean and 95% intervals. 5 Illustrated Examples Here we showcase two examples to illustrate the theory: dynamical system prediction and sample path visualization. We focus on simplified settings to present ideas in an easy-to-understand manner. 5.1 Dynamical system prediction We illustrate how Riemannian squared exponential GPs can be used for predicting dynamical systems while respecting the underlying geometry of the configuration space the system is defined on. This is an important task in robotics, where GPs are often trained within a model-based reinforcement learning framework [10, 11]. Here, we consider a purely supervised setup, mimicking the model learning inner loop of said framework. For a prototype physical system, consider an ideal pendulum, whose configuration space is the circle S1, and whose phase space is the cotangent bundle T S1, which is isometric to the cylinder S1 R equipped with the product metric. The equations of motion are given by Hamilton s equations, which are parameterized by the Hamiltonian H : T S1 R. To learn the equations of motion from observed data, we place a GP prior on the Hamiltonian, with covariance given by a squared exponential kernel on the cylinder, defined as a product kernel of squared exponential kernels on the circle and real line. Following Hensman et al. [21], training proceeds using mini-batch stochastic variational inference with automatic relevance determination. The full setup is given in Appendix A. To generate trajectories from the learned equations of motion, following Wilson et al. [43], we approximate the prior GP using Fourier features, and employ (2) to transform prior sample paths into posterior sample paths. We then generate trajectories by solving the learned Hamilton s equations numerically for each sample, which is straightforward because the approximate posterior is a basis function approximation and therefore easily differentiated in the ordinary deterministic manner. Results can be seen in Figure 4. From these, we see that our GP learns the correct qualitative behavior of the equations of motion, mirroring the results of Deisenroth and Rasmussen [11]. 5.2 Sample path visualization To understand how complicated geometry affects posterior uncertainty estimates and illustrate the techniques on a general Riemannian manifold, we consider a posterior sample path visualization task. We take M to be the dragon manifold from the Stanford 3D scanning repository, modified slightly to remove components not connected to the outer surface. We represent the manifold using a 202490-triangle mesh and obtain 500 Laplace Beltrami eigenpairs numerically using the Firedrake package [30]. For training data, we introduce a ground truth function by fixing a distinguished point at the end of the dragon s snout, and compute the sine of the geodesic distance from that point. We then observe this function at 52 points on the manifold chosen from the mesh s nodes, and train a Mat ern GP regression model with smoothness ν = 3/2 by maximizing the marginal likelihood with respect to (a) Ground truth (c) Standard deviation (d) One posterior sample Figure 5: Visualization of a Mat ern Gaussian process posterior on the dragon. We plot the true function values, posterior mean, marginal posterior variance, and one posterior sample evaluated on the entire mesh. Here, black dots denote training locations, and color represents value of the corresponding functions. Additional posterior samples can be seen in Appendix A. the remaining kernel hyperparameters. By using the path-wise sampling expression (2), we obtain posterior samples defined on the entire mesh. Results can be seen in Figure 5. Here, we see that posterior mean and uncertainty estimates match the manifold s shape seamlessly, decaying roughly in proportion with the geodesic distance in most regions. In particular, we see that the two sides of the dragon s snout have very different uncertainty values, despite close Euclidean proximity. This mimics the well-known swiss roll example of manifold learning [24, Section 6.1.1], and highlights the value of using a model which incorporates geometry. 6 Conclusion In this work, we developed techniques for computing the kernel, spectral measure, and Fourier feature approximation of Mat ern and squared exponential Gaussian processes on compact Riemannian manifolds, thereby constructively generalizing standard Gaussian process techniques to this setting. This was done by viewing the Gaussian processes as solutions of stochastic partial differential equations, and expressing the objects of interest in terms of Laplace Beltrami eigenvalues and eigenfunctions. The theory was demonstrated on a set of simple examples: learning the equations of motion of an ideal pendulum, and sample path visualization for a Gaussian process defined on a dragon. This illustrates the theory in settings both where Laplace Beltrami eigenfunctions have a known analytic form, and where they need to be calculated numerically using a differential equation or graphics processing framework. Our work removes limitations of previous approaches, allowing Mat ern and squared exponential Gaussian processes to be deployed in mini-batch, online, and nonconjugate settings using variational inference. We hope these contributions enable practitioners in robotics and other physical sciences to more easily incorporate geometry into their models. Broader Impact This is a purely theoretical paper. We develop technical tools that make Mat ern Gaussian processes easier to work with in the Riemannian setting. This enables practitioners who are not experts in stochastic partial differential equations to model data that lives on spaces such as spheres and tori. We envision the impact of this work to be concentrated in the physical sciences, where spaces of this type occur naturally. Since the state spaces of most robotic arms are Riemannian manifolds, we expect these ideas to improve performance of model-based reinforcement learning by making it easier to incorporate geometric prior information into models. Since climate science is concerned with studying the globe, we also expect that our ideas can be used to model environmental phenomena, such as sea surface temperatures. By employing Gaussian processes for data assimilation and building them into larger frameworks, this could facilitate more accurate climate models compared to current methods. These impacts carry forward to potential generalizations of our work. We encourage practitioners to consider impacts on their respective disciplines that arise from incorporating geometry into models. Acknowledgments and Disclosure of Funding VB was supported by the St. Petersburg Department of Steklov Mathematical Institute of Russian Academy of Sciences and by the Ministry of Science and Higher Education of the Russian Federation, agreement No 075-15-2019-1620. PM was supported by the Ministry of Science and Higher Education of the Russian Federation, agreement No 075-15-2019-1619. VB and PM were supported by Native towns , a social investment program of PJSC Gazprom Neft, and by the Department of Mathematics and Computer Science of St. Petersburg State University. AT was supported by the Department of Mathematics at Imperial College London. [1] M. Abramowitz and I. A. Stegun. Handbook of Mathematical Functions With Formulas, Graphs, and Mathematical Tables. United States Department of Commerce, 1972. Cited on pages 5, 17. [2] D. Bolin and K. Kirchner. The rational SPDE approach for Gaussian random fields with general smoothness. Journal of Computational and Graphical Statistics, 29(2):274 285, 2020. Cited on page 2. [3] D. Bolin, K. Kirchner, and M. Kov acs. Numerical solution of fractional elliptic stochastic PDEs with spatial white noise. IMA Journal of Numerical Analysis, 40(2):1051 1073, 2017. Cited on page 2. [4] S. Calinon. Gaussians on Riemannian manifolds for robot learning and adaptive control. IEEE Robotics and Automation Magazine, 27(2):33 45, 2020. Cited on page 1. [5] G. Camps-Valls, J. Verrelst, J. Munoz-Mari, V. Laparra, F. Mateo-Jimenez, and J. Gomez Dans. A survey on Gaussian processes for earth-observation data analysis: A comprehensive investigation. IEEE Geoscience and Remote Sensing Magazine, 4(2):58 78, 2016. Cited on page 1. [6] Y. Canzani. Analysis on Manifolds via the Laplacian. Harvard University, 2013. Cited on pages 6, 22. [7] I. Chavel. Eigenvalues in Riemannian Geometry. Academic Press, 1984. Cited on pages 6, 22. [8] S. Coveney, C. Corrado, C. H. Roney, D. O Hare, S. E. Williams, M. D. O Neill, S. A. Niederer, R. H. Clayton, J. E. Oakley, and R. D. Wilkinson. Gaussian process manifold interpolation for probabilistic atrial activation maps and uncertain conduction velocity. Philosophical Transactions of the Royal Society A, 378(2173):20190345, 2020. Cited on page 4. [9] E. De Vito, N. M ucke, and L. Rosasco. Reproducing kernel Hilbert spaces on manifolds: Sobolev and diffusion spaces. Analysis and Applications, 19(3), 2020. Cited on pages 6, 19, 21, 24, 25. [10] M. P. Deisenroth, D. Fox, and C. E. Rasmussen. Gaussian processes for data-efficient learning in robotics and control. IEEE Transactions on Pattern Analysis and Machine Intelligence, 37(2):408 423, 2015. Cited on pages 1, 8. [11] M. P. Deisenroth and C. E. Rasmussen. PILCO: A model-based and data-efficient approach to policy search. In International Conference on Machine Learning, pages 465 472, 2011. Cited on pages 1, 8, 15. [12] L. C. Evans. Partial Differential Equations. American Mathematical Society, 2010. Cited on page 3. [13] A. Feragen and S. Hauberg. Open problem: Kernel methods on manifolds and metric spaces. What is the probability of a positive definite geodesic exponential kernel? In Conference on Learning Theory, pages 1647 1650, 2016. Cited on page 3. [14] A. Feragen, F. Lauze, and S. Hauberg. Geodesic exponential kernels: When curvature and linearity conflict. In Conference on Computer Vision and Pattern Recognition, pages 3032 3042, 2015. Cited on pages 1, 3. [15] T. Gao, S. Z. Kovalsky, and I. Daubechies. Gaussian process landmarking on manifolds. SIAM Journal on Mathematics of Data Science, 1(1):208 236, 2019. Cited on page 4. [16] C. S. Gordon. Survey of Isospectral Manifolds. In Handbook of Differential Geometry. Volume 1, pages 747 778. Elsevier, 2000. Cited on page 20. [17] I. S. Gradshteyn and I. M. Ryzhik. Table of Integrals, Series, and Products. Academic Press, 7th edition, 2014. Cited on page 2. [18] A. Grigoryan. Heat Kernel and Analysis on Manifolds. American Mathematical Society, 2009. Cited on page 3. [19] J. Guinness and M. Fuentes. Isotropic covariance functions on spheres: some properties and modeling considerations. Journal of Multivariate Analysis, 143:143 152, 2016. Cited on page 17. [20] J. Hensman, N. Durrande, and A. Solin. Variational Fourier features for Gaussian processes. Journal of Machine Learning Research, 18(151):1 52, 2017. Cited on pages 2, 4, 5. [21] J. Hensman, N. Fusi, and N. D. Lawrence. Gaussian processes for big data. In Uncertainty in Artificial Intelligence, pages 282 290, 2013. Cited on pages 2, 4, 8, 16. [22] N. Jaquier, L. Rozo, S. Calinon, and M. B urger. Bayesian optimization meets Riemannian manifolds in robot learning. In Conference on Robot Learning, pages 233 246, 2019. Cited on pages 1, 3. [23] S. Lang. Real and Functional Analysis. Springer, 1993. Cited on page 22. [24] J. A. Lee and M. Verleysen. Nonlinear Dimensionality Reduction. Springer, 2007. Cited on page 9. [25] F. Lindgren, H. Rue, and J. Lindstr om. An explicit link between Gaussian fields and Gaussian Markov random fields: the stochastic partial differential equation approach. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 73(4):423 498, 2011. Cited on pages 1, 3. [26] S. V. Lototsky and B. L. Rozovsky. Stochastic Partial Differential Equations. Springer, 2017. Cited on pages 6, 22, 23. [27] A. Mallasto and A. Feragen. Wrapped Gaussian process regression on Riemannian manifolds. In Conference on Computer Vision and Pattern Recognition, pages 5580 5588, 2018. Cited on page 4. [28] A. Rahimi and B. Recht. Random features for large-scale kernel machines. In Advances in Neural Information Processing Systems, pages 1177 1184, 2008. Cited on pages 2, 5. [29] C. E. Rasmussen and C. K. Williams. Gaussian Processes for Machine Learning. MIT Press, 2006. Cited on pages 2, 16, 21. [30] F. Rathgeber, D. A. Ham, L. Mitchell, M. Lange, F. Luporini, A. T. Mc Rae, G.-T. Bercea, G. R. Markall, and P. H. Kelly. Firedrake: automating the finite element method by composing abstractions. ACM Transactions on Mathematical Software, 43(3):1 27, 2016. Cited on pages 8, 13. [31] M. Reed and B. Simon. Methods of Modern Mathematical Physics. Academic Press, 1980. Cited on page 22. [32] H. Rue, S. Martino, and N. Chopin. Approximate Bayesian inference for latent Gaussian models by using integrated nested Laplace approximations. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 71(2):319 392, 2009. Cited on page 3. [33] S. S arkk a, A. Solin, and J. Hartikainen. Spatiotemporal learning via infinite-dimensional Bayesian filtering and smoothing: a look at Gaussian process regression through Kalman filtering. IEEE Signal Processing Magazine, 30(4):51 61, 2013. Cited on page 3. [34] B. Sch olkopf and A. J. Smola. Learning with Kernels. MIT Press, 2002. Cited on page 4. [35] D. Simpson, F. Lindgren, and H. Rue. Think continuous: Markovian Gaussian models in spatial statistics. Spatial Statistics, 1:16 29, 2012. Cited on page 3. [36] J. Snoek, H. Larochelle, and R. P. Adams. Practical Bayesian optimization of machine learning algorithms. In Advances in Neural Information Processing Systems, pages 2951 2959, 2012. Cited on page 1. [37] A. Solin, M. Kok, N. Wahlstr om, T. B. Sch on, and S. S arkk a. Modeling and interpolation of the ambient magnetic field by Gaussian processes. IEEE Transactions on Robotics, 34(4):1112 1127, 2018. Cited on page 4. [38] A. Solin and S. S arkk a. Hilbert space methods for reduced-rank Gaussian process regression. Statistics and Computing, 30(2):419 446, 2020. Cited on page 4. [39] E. M. Stein and G. Weiss. Introduction to Fourier Analysis on Euclidean Spaces. Princeton University Press, 2016. Cited on page 16. [40] R. S. Strichartz. Analysis of the Laplacian on the complete Riemannian manifold. Journal of Functional Analysis, 52(1):48 79, 1983. Cited on page 22. [41] M. Titsias. Variational learning of inducing variables in sparse Gaussian processes. In Artificial Intelligence and Statistics, pages 567 574, 2009. Cited on pages 2, 4, 16. [42] P. Whittle. Stochastic processes in several dimensions. Bulletin of the International Statistical Institute, 40(2):974 994, 1963. Cited on pages 1, 3 5, 7, 20. [43] J. T. Wilson, V. Borovitskiy, A. Terenin, P. Mostowski, and M. P. Deisenroth. Efficiently sampling functions from Gaussian process posteriors. In International Conference on Machine Learning, pages 7470 7480, 2020. Cited on pages 2, 7, 8, 16. [44] S. Zelditch. Eigenfunctions of the Laplacian on a Riemannian manifold. American Mathematical Society, 2017. Cited on page 7.