# implicit_manifold_gaussian_process_regression__eb5d7567.pdf Implicit Manifold Gaussian Process Regression Bernardo Fichera1 Viacheslav Borovitskiy2 Andreas Krause2 Aude Billard1 1 EPFL 2 ETH Zürich Gaussian process regression is widely used because of its ability to provide well-calibrated uncertainty estimates and handle small or sparse datasets. However, it struggles with high-dimensional data. One possible way to scale this technique to higher dimensions is to leverage the implicit low-dimensional manifold upon which the data actually lies, as postulated by the manifold hypothesis. Prior work ordinarily requires the manifold structure to be explicitly provided though, i.e. given by a mesh or be known to be one of the well-known manifolds like the sphere. In contrast, in this paper we propose a Gaussian process regression technique capable of inferring implicit structure directly from data (labeled and unlabeled) in a fully differentiable way. For the resulting model, we discuss its convergence to the Matérn Gaussian process on the assumed manifold. Our technique scales up to hundreds of thousands of data points, and improves the predictive performance and calibration of the standard Gaussian process regression in some high-dimensional settings. 1 Introduction Gaussian processes are among the most adopted models for learning unknown functions within the Bayesian framework. Their data efficiency and aptitude for uncertainty quantification make them appealing for modeling and decision-making applications in the fields like robotics (Deisenroth and Rasmussen, 2011), geostatistics (Chilès and Delfiner, 2012), numerics (Hennig et al., 2015), etc. The most widely used Gaussian process models, like squared exponential and Matérn (Rasmussen and Williams, 2006), impose the simple assumption of differentiability of the unknown function while also respecting the geometry of Rd by virtue of being stationary or isotropic. Such simple assumptions make uncertainty estimates reliable, albeit too conservative at times. The same simplicity makes these models struggle from the curse of dimensionality. We hypothesize that it is still possible to leverage these simple priors for real world high-dimensional problems granted that they are adapted to the implicit low-dimensional submanifolds where the data actually lies, as illustrated by Figure 1. (a) Euclidean, prediction (b) Ours, prediction (c)Euclidean, uncertainty Uncertainty (d) Ours, uncertainty Figure 1: Euclidean (standard Matérn-5/2 kernel) vs ours (implicit manifold) Gaussian process regression for data that lies on a dumbbell-shaped curve (1-dimensional manifold) assumed unknown. The data contains a small set of labeled points and a large set of unlabeled points. Our technique recognizes that the two lines in the middle are intrinsically far away from each other, giving a much better model on and near the manifold. Far away from the manifold it reverts to the Euclidean model. Code available at https://github.com/nash169/manifold-gp. 37th Conference on Neural Information Processing Systems (Neur IPS 2023). Recent works in machine learning generalized Matérn Gaussian processes for modeling functions on non-Euclidean domains such as manifolds or graphs (Azangulov et al., 2022; 2023; Borovitskiy et al., 2021; 2023; 2020). Crucially, this line of work assumes known geometry (e.g. a manifold or a graph) beforehand. In this work we aim to widen the applicability of Gaussian processes for higher dimensional problems by automatically learning the implicit low-dimensional manifold upon which the data lies, the existence of which is suggested by the manifold hypothesis. We propose a new model which learns this structure and approximates the Matérn kernel on the implicit manifold. Our approach can operate in both supervised and semi-supervised settings, with the emphasis on the latter: uncovering the implicit manifold may require a lot of samples from it, however these samples need not be labeled, and unlabeled data is usually more abundant. Taking inspiration in the manifold learning results of Coifman and Lafon (2006), Dunson et al. (2021) and others we approximate the unknown manifold by an appropriately weighted nearest neighbor graph. Then we use graph Matérn kernels thereon as approximations to the manifold Matérn kernels of Borovitskiy et al. (2020), extending them to the vicinity of the manifold in the ambient Rd in an appropriate way. 1.1 Related Work and Contribution High-dimensional Gaussian process regression is an area of active research, primarily motivated by decision making applications like Bayesian optimization. There are three main directions in this area: (1) selecting a small subset of input dimensions, (2) learning a small number of new features by linearly projecting the inputs and (3) learning non-linear features. Our technique belongs to the third direction. Further details on the area can be found in the recent review by Binois and Wycoff (2022). A close relative of our technique in the literature is described in Dunson et al. (2022). It targets the low-dimensional setting where the inputs are densely sampled on the underlying surface. It is based on the heat (diffusion) kernels on graphs as in Kondor and Lafferty (2002) and uses the Nyström method to extend kernels to Rd, both of which may incur a high computational cost. Another close relative is the concurrent work by Peach et al. (2023) on modeling vector fields on implicit manifolds. We are targeting the high-dimensional setting. Here, larger datasets of partly labeled points are often needed to infer geometry. Because of this, we emphasize computational efficiency by leveraging sparse precision matrix structure of Matérn kernels (as opposed to the heat kernels) and use KNN for sparsifying the graph and accelerating the Nyström method. This results in linear computational complexity with respect to the number of data points. Furthermore, the model we propose is fully differentiable, which may be used to find both kernel and geometry hyperparameters by maximizing the marginal likelihood. Finally, to get reasonable predictions on the whole ambient space Rd, we combine the prediction of the geometric model with the prediction of a classical Euclidean Gaussian process, weighting these by the relative distance to the manifold. The geometric model is differentiable with respect to its kernel-, likelihoodand geometry-related hyperparameters, with gradient evaluation cost being linear with respect to the number of data points. After training, we can efficiently compute the predictive mean and kernel as well as sample the predictive model, providing the basic computational primitives needed for the downstream applications like Bayesian optimization. We evaluate our technique on a synthetic low-dimensional example and test it in a high-dimensional large dataset setting of predicting rotation angles of rotated MNIST images, improving over the standard Gaussian process regression. 2 Gaussian Processes A Gaussian process f GP(m, k) is a distribution over functions on a set X. It is determined by the mean function m(x) = E f(x) and the covariance function (kernel) k(x, x ) = Cov(f(x), f(x )). Given data X, y, where X = (x1, .., xn) and y = (y1, .., yn) with xi X d, yi R, one usually assumes yi = f(xi)+εi where εi N(0, σ2 ε) is IID noise and f GP(0, k) is some prior Gaussian process, whose mean is assumed to be zero in order to simplify notation. The posterior distribution f | y is then another Gaussian process f | y GP( ˆm, ˆk) with (Rasmussen and Williams, 2006) ˆm( ) = K( )X KXX + σ2 εI 1y, ˆk( , ) = K( , ) K( )X KXX + σ2 εI 1KX( ), (1) where the matrix KXX has entries k(xi, xj), the vector KX( ) = K ( )X has components k(xi, ). If needed, one can efficiently sample f | y using pathwise conditioning (Wilson et al., 2020; 2021) (a) Kernel, sphere S2 (b) Kernel, graph P S2 (c) Sample, sphere S2 (d) Sample, graph P S2 Figure 2: Kernel values k( , ) and samples for the Matérn-3/2 Gaussian processes on the sphere manifold S2 and for the approximating Matérn-5/2 process on a geodesic polyhedron graph P S2. Matérn Gaussian processes including the limiting ν case, squared exponential Gaussian processes are the most popular family of models for X = Rd. These have zero mean and kernels kν,κ,σ2(x, x ) = σ2 21 ν where Kν is the modified Bessel function of the second kind (Gradshteyn and Ryzhik, 2014) and ν, κ, σ2 are the hyperparameters responsible for smoothness, length scale and variance, respectively. We proceed to describe how Matérn processes can be generalized to inputs x lying on explicitly given manifolds or graphs instead of the Euclidean space Rd. 2.1 Matérn Gaussian Processes on Explicit Manifolds and Graphs For a domain which is a Riemannian manifold, an obvious and natural idea for generalizing Matérn Gaussian processes could be to substitute the Euclidean distances x x in Equation (2) with the geodesic distance. However, this approach results in ill-defined kernels that fail to be positive semi-definite (Feragen et al., 2015; Gneiting, 2013). Another direction for generalization is based on the stochastic partial differential equation (SPDE) characterization of Matérn processes first described by Whittle (1963): f GP(0, kν,κ,σ2) solves 4 f = W, (3) where Rd is the standard Laplacian operator and W is the Gaussian white noise with variance proportional to σ2. If taken to be the definition, this characterization can be easily extended to general Riemannian manifolds X = M by substituting Rd with the Laplace Beltrami operator M, taking d = dim M and substituting W with the appropriate generalization of the Gaussian white noise (Lindgren et al., 2011). Based on this idea, Borovitskiy et al. (2020) showed that on compact Riemannian manifolds, Matérn Gaussian processes are the zero-mean processes with kernels kν,κ,σ2(x, x ) = σ2 ν d/2 fl(x)fl(x ), (4) where λl, fl are eigenvalues and eigenfunctions of the Laplace Beltrami operator and Cν,κ is the normalizing constant ensuring that 1 X kν,κ,σ2(x, x) dx = σ2. This, alongside with considerations from Azangulov et al. (2022) allows one to practically compute kν,κ,σ2 for many compact manifolds. If the domain X is a weighted undirected graph G, we can also use Equation (3) to define Matérn Gaussian processes on G (Borovitskiy et al., 2021). In this case, Rd is substituted with the minus graph Laplacian G and W N(0, σ2 WI) is the vector of IID Gaussians. Here, SPDE transforms into a stochastic linear system, whose solution is of the same form as Equation (4) but with a finite sum instead of the infinite series, with d = 0 because there is no canonical notion of dimension for graphs and with λl, fl being the eigenvalues and eigenvectors as functions on the node set of the matrix G. These processes are illustrated on Figure 2. 3 Implicit Manifolds and Gaussian Processes on Them Consider a dataset X = (x1, .., x N) , xi Rd partially labeled with labels y1, .., yn R, n N. Assume that xi are IID randomly sampled from a compact Riemannian submanifold M Rd. As by Section 2.1, the manifold M is associated to a family of Matérn Gaussian processes tailored to its geometry. We do not assume to know M, only the fact that it exists, hence the question is: how can we recover the kernels of the aforementioned geometry-aware processes from the observed dataset? It is clear from Equation (4) that to recover kν,κ,σ2 we need to get the eigenpairs λl, fl of the Laplace Beltrami operator on M. Naturally, for a finite dataset this can only be done approximately. We proceed to discuss the relevant theory of Laplace Beltrami eigenpair approximation. 3.1 Background on Approximating the Eigenpairs of the Laplace Beltrami Operator There exists a number of theoretical and empirical results on eigenpair approximation. Virtually all of them study approximating the implicit manifold by some kind of a weighted undirected graph1 with node set {x1, .., x N} and weights that are somehow determined by the Euclidean distances xi xj . The eigenvalues of the graph Laplacian on this graph are supposed to approximate the eigenvalues of the Laplace Beltrami operator, while the eigenvectors regarded as functions on the node set approximate the values of the eigenfunctions of the Laplace Beltrami operator at xi M. To approximate eigenfunctions elsewhere, any sort of continuous (smooth) interpolation suffices. There are three popular notions of graph Laplacian. Let us denote the adjacency matrix of the weighted graph by A and define D to be the diagonal degree matrix with Dii = P j Aij. Then unnormalized , sym = I D 1/2AD 1/2 symmetric normalized , rw = I D 1A random walk normalized . (5) The first two of these are symmetric positive semi-definite matrices, the third is, generally speaking, non-symmetric. However, from the point of view of linear operators, all of them can be considered symmetric (self-adjoint) positive semi-definite: the first two with respect to the standard Euclidean inner product , , and the third one with respect to the modified inner product v, u D = Dv, u . Thus for each there exists an orthonormal basis of eigenvectors and eigenvalues are non-negative.2 The most common way to define the graph is by setting Aij = exp xi xj 2/4α2 for an α > 0. If xi are IID samples from the uniform distribution on the manifold M, then all of the graph Laplacians, each multiplied by an appropriate power of α, converge to the Laplace Beltrami operator, both pointwise (Hein et al., 2007) and spectrally (García Trillos et al., 2020), i.e. in the sense of eigenpair convergence, at least at the node set of the graph.3 However, if the inputs xi are sampled non-uniformly, graph Laplacians, at best, converge to different continuous limits, none of which coincides with the Laplace Beltrami operator (Hein et al., 2007). Coifman and Lafon (2006) proposed a clever trick to handle non-uniformly sampled data x1, .., x N. Starting with A and D defined in the same way as A and D before, they define A = D 1 A D 1. Intuitively, this corresponds to normalizing by the kernel density estimator to cancel out the unknown density. The corresponding rw then converges pointwise to the Laplace Beltrami operator (Hein et al., 2007), though un and sym do not: they converge to different continuous limits. Dunson et al. (2021, Theorem 2) show that, under technical regularity assumptions, eigenvalues λk and renormalized eigenvectors of rw converge to the respective eigenvalues and eigenfunctions of the Laplace Beltrami operator, regardless of the sampling density of xi. Both in the simple case and in the sampling density independent case, the graphs and their respective Laplacians turn out to be dense, requiring a lot of memory to store and being inefficient to operate with. To make computations efficient, sparse graphs such as KNN graphs are much more preferable over the dense graphs. Spectral convergence for KNN graphs is studied, for example, in Calder and Trillos (2022), for Aij = h( xi xj /α) with a compactly supported regular function h, and with 1Other possibilities include linear (classical PCA) or quadratic (Pavutnitskiy et al., 2022) approximations. See the books by Lee and Verleysen (2007) and Ma and Fu (2011) for additional context. 2Naturally, for the random walk normalized Laplacian rw the orthonormality is with respect to , D. 3García Trillos et al. (2020) do not explicitly study sym. Since sym and rw are similar matrices, they share eigenvalues, so eigenvalue convergence for sym is trivial, the eigenvector convergence, however, is not. limit depending on the sampling density. Unfortunately, we are unaware of any spectral convergence results in the literature that hold for KNN graphs and are independent of the data sampling density. 3.2 Approximating Matérn Kernels on Manifolds Here we incorporate various convergence results, including but not limited to the ones described in Section 3.1, proving that all spectral convergence results imply the convergence of graph Matérn kernels to the respective manifold Matérn kernels. Proposition 1. Denote the eigenpairs by λl, fl for a graph Laplacian and by λM l , f M l for the Laplace Beltrami operator. Fix δ > 0. Assume that, with probability at least 1 δ, for all ε > 0, for α small enough and for N large enough we have |λl λM l | < ε and |fl(xi) f M l (xi)| < ε. Then, with probability at least 1 δ, we have k N,α,L ν,κ,σ2(xi, xj) kν,κ,σ2(xi, xj) as α 0, N, L , where k N,α,L ν,κ,σ2(xi, xj) = σ2 ν dim(M)/2 fl(xi)fl(xj). (6) Proof. First prove that the tail of the series in Equation (4) converges uniformly to zero, then combine this with eigenpair bounds. See details in Appendix A. Remark. The convergence in xi M can be lifted to pointwise convergence for all x M if eigenvectors are interpolated Lipschitz-continuously, simply because the eigenfunctions are smooth. Inspired by this theory, we proceed to present the implicit manifold Gaussian process model. 3.3 Implicit Manifold Gaussian Process Guided by the theory we described in the previous section we are now ready to formulate the implicit manifold Gaussian process model. Given the dataset x1, .., x N Rd and y1, .., yn R, we put A = D 1 A D 1, Aij = SK(xi, xj) exp m Aim i = j, 0 i = j. (7) Here SK(xi, xj) = 1 if xi is one of the K nearest neighbors of xj or vice versa and SK(xi, xj) = 0 otherwise; all matrices are of size N N and depend on α and K as hyperparameters. Thanks to the coefficient SK(xi, xj) that performs KNN sparsification, the matrix A is sparse when K N.4 Then we consider the operator rw = I D 1A defined by Equation (5), whose matrix is also sparse. Denoting its eigenvalues ordered from the smallest to the largest by 0 = λ0 λ1 . . . λN 1, and its eigenvectors orthonormal under the modified inner product , D and regarded as functions on the node set of the graph by f0, f1, . . . , f N 1, we define Matérn kernel on graph nodes xi by k X ν,κ,σ2(xi, xj) = σ2 l=0 Φν,κ(λl)fl(xi)fl(xj), Φν,κ(λ) = 2ν κ2 + λ ν , (8) where L does not need to be equal to the actual number N of eigenpairs. Doing so means truncating the high frequency eigenvectors (fl for l large), which always contribute less to the sum because they correspond to smaller values of Φν,κ(λl). This can massively reduce the computational costs. By Proposition 1, Equation (8) approximates the manifold Matérn kernel with smoothness ν = ν dim(M)/2. We adopt such a reparametrization because it does not require estimating the a priori unknown dim(M). This, however, makes the typical assumption of ν {1/2, 3/2, 5/2} inadequate. We chose a particular graph Laplacian normalization, namely the random walk normalized graph Laplacian rw, to approximate the true Laplace-Beltrami operator regardless of the potential non-uniform sampling of x1, .., x N, based on the theoretical insights described in Section 3.1. The kernel in Equation (8) is only defined on the set of nodes xi, next step is to extend it to the whole space Rd. Extending kernels is usually a difficult problem because one has to worry about positive 4Note: Aii = 1, as if the graph has loops. Assuming Aii = 0 would lead to discontinuities at later stages. (a) Extended eigenvector f 3( ) (b) Extended kernel k X ν,κ,σ2( , ) (c) Function γ(dist( , M)) Figure 3: Different quantities connected to kernel extension. Notice that the values on subfigures (a) and (b) are artificially restricted to the set dist( , M) < 3α to maintain numerical stability. semi-definiteness. To work around it, we extend the features fl. For this, we use Nyström method: we allow the first argument of SK to be an arbitrary vector from Rd, defining SK(x, xj) = 1 if xj is one of the K nearest neighbors of x among x1, .., x N. This allows us to extend A, D, A and D as A(x, xj) = SK(x, xj) exp j=1 A(x, xj) = X xj KNN(x) A(x, xj), (9) A(x, xj) = A(x, xj) D(x) D(xj) , D(x) = j=1 A(x, xj) = X xj KNN(x) A(x, xj), (10) where KNN(x) is the set of the K nearest neighbors from x among x1, .., x N. With this, we define fl(x) = 1 1 λl D(x) fl(xj) = 1 1 λl D(x) fl(xj). (11) It is easy to check that this extension perfectly reproduces the values fl(xj), simply because A(x, xj) coincides with Aij when x = xi and because (fl(x1), .., fl(x N)) is the eigenvector of A corresponding to the eigenvalue 1 λl. It also allows us to extend the kernel k X ν,κ,σ2 as well, such that k X ν,κ,σ2(x, x ) is defined for arbitrary x, x Rd and Equation (8) still holds for the nodes x1, .., x N. We visualize an extended eigenvector and an extended kernel in Figures 3a and 3b. For x far away from the nodes xj the values of D(x) become very small, making the extension procedure numerically unstable. Furthermore, the geometric model is generally not so relevant far away from the manifold. Because of this, the final predictive model f (p) GP(m(p), k(p)) combines the geometric model f (m) GP(m(m), k(m)) the posterior under the kernel k X ν,κ,σ2 we defined just above with the standard Euclidean model f (e) GP(m(e), k(e)) the posterior under the standard Euclidean Gaussian process in Rd, for instance with the squared exponential kernel: f (p)(x) = γ(x)f (m)(x) + (1 γ(x))f (e)(x), γ(x) = exp 1 (3α)2 (3α)2 dist(x, M)2 where γ is a bump function that is zero outside the 3α neighborhood of the manifold, illustrated in Figure 3c. Here dist(x, M) can be computed as the distance from x to its nearest neighbor node xj.5 4 Efficient Training of the Implicit Manifold Gaussian Processes Here we describe how to perform the implicit manifold Gaussian process regression efficiently, being able to handle hundreds of thousands of points, in both supervised and semi-supervised regimes. In all cases we need efficient (approximate) KNN to build a graph, extend the kernel beyond the nodes xi and combine the geometric model with the standard Euclidean one. For this we use FAISS (Johnson et al., 2019). The resulting sparse matrices, such as the Laplacian rw, we represent as black box functions capable of performing matrix-vector multiplications for any given input vector. 5In practice it makes sense to compute dist(x, M) as the average of distances between x and its K nearest neighbors to smoothen the resulting γ(x) this is computationally cheap given an efficient KNN implementation. After hyperparameters are found we will return to their search later we need to compute the eigenpairs λl, f l of rw. For this we run Lanczos algorithm (Meurant, 2006) to evaluate the eigenpairs λsym l , f sym l of the symmetric matrix sym, putting λl = λsym l and f l = D 1/2f sym l because the matrices rw and sym are similar, i.e. rw = D 1/2 sym D1/2. Importantly, Lanczos algorithm only relies on matrix-vector products with sym. We only compute a few hundred of eigenpairs, asking Lanczos to provide twice or trice as many and disregarding the rest. When there is a lot of labeled data, we approximate the classical Euclidean (e.g. squared exponential) kernel using random Fourier features approximation (Rahimi and Recht, 2007), this allows linear computational complexity scaling with respect to the number of data points. The number of Fourier features is taken to be equal to L, with the same L as in Equation (8). As it was already mentioned, we need to find hyperparameters ˆθ = ˆα, ˆκ, ˆσ2, ˆσ2 ε that determine the graph, Gaussian process prior and the noise variance that fit the observations y best. The ν parameter we assume manually fixed. To avoid nonsensical parameter values a common difficulty often occurring when the data is scarce one might want to assume a prior p(θ) on θ. Some specific choices of p(θ) are discussed in Appendix C. Then ˆθ is a maximum a posteriori (MAP) estimate: ˆθ = arg maxθ log p(y | θ, X) + log p(θ). (13) To simplify hyperparameter initialization and align with zero prior mean assumption it makes sense to preprocess yi to be centered and normalized. To solve the optimization problem in Equation (13) we use restarted gradient descent. Repeatedly evaluating the gradient of log p(y | θ, X) is the main computational bottleneck. The key idea for doing this efficiently viable for integer values of ν is to reduce matrix-vector products with Matérn kernels precision to iterated matrix-vector products with the Laplacian, which is sparse. First, we describe this in detail in the noiseless supervised setting, where the idea is most directly applicable. 4.1 Noiseless Supervised Learning Here we assume that all inputs are labeled, i.e. N = n, and all observations are noiseless, i.e. σ2 ε = 0. Denoting by PXX = K 1 XX the precision matrix, the log-likelihood log p(y | θ, X), up to a multiplicative constant and an additive constant irrelevant for optimization, is given by L(θ) = log det(KXX) y K 1 XXy = log det(PXX) y PXXy. (14) Its gradient may be given and then subsequently approximated (Gardner et al., 2018) by θ = tr P 1 XX PXX θ y z P 1 XX PXX where z is a random vector consisting of IID variables that are either 1 or 1 with probability 1/2. The first term on the right-hand side is the stochastic estimate of the trace of Hutchinson (1989). Since the kernels from Section 3.3 coincide with graph Matérn kernels on the nodes xi, we have l=0 Φν,κ(λl)f lf l , rwf l = λlf l, f l Df m = δlm. (16) However, as the graph bandwidth α is one of the hyperparameters we optimize over, using Equation (16) would entail repeated eigenpair computations and differentiating through this procedure. Because of this, we use an alternative way to compute matrix-vector products PXXu detailed below.6 Proposition 2. Assuming ν N, the precision matrix PXX of k X ν,κ,σ2(xi, xj) can be given by C 1 ν,κ D 2ν κ2 I + rw . . . 2ν Proof. See Appendix A. 6Though automatic differentiability could in principle work for iterative methods like the Lanczos algorithm, the amount of memory required for storing the gradients of the intermediate steps quickly becomes prohibitive. Using Proposition 2 to evaluate matrix-vector products PXXu and conjugate gradients (Meurant, 2006) to solve z P 1 XX using only the matrix-vector products, we can efficiently evaluate the righthand side of Equation (15), with linear costs with respect to N, assuming that the graph is sparse. Preconditioning (Wenger et al., 2022) can be used to further improve the efficiency of the solve. 4.2 Noiseless Semi-Supervised Learning Here we assume that inputs are partly unlabeled, i.e. N = n, while observations are still noiseless, i.e. σ2 ε = 0. Denote Z to be the labeled part of X. Then the matrix KXX in the log-likelihood given by Equation (14) should be substituted with KZZ. However, while PXX can be represented using Equation (17), the precision PZZ = K 1 ZZ cannot. We thus compute it as the Schur complement: PZZ = A BD 1C, PXX = A B C D where partitioning of PXX corresponds to partitioning X into Z and the rest. Evaluating a matrixvector product PZZu requires a solve of D 1(Cu). This solve can also be performed using conjugate gradients, keeping the computational complexity linear in N but increasing the constants. 4.3 Handling Noisy Observations Finally, we assume noisy observations, i.e. σ2 ε > 0. The inputs can be partially unlabeled, i.e. N = n. In this case, matrix KXX in the log-likelihood given by Equation (14) should be substituted with KZZ + σ2 εI. To reduce this to the previously considered cases, we use the Taylor expansion (KZZ + σ2 εI) 1 K 1 ZZ σ2 εK 2 ZZ + σ4 εK 3 ZZ . . . = PZZ σ2 εP2 ZZ + σ4 εP3 ZZ . . . (19) In practice, we only use the first two terms on the right-hand side as an approximation. This allows to retain linear computational complexity scaling with respect to N but increases the constants. 4.4 Resulting Algorithm Here we provide a concise summary of the implicit manifold Gaussian process regression algorithm. Step 1: KNN-index. Construct the KNN index on the points x1, .., x N. This allows linear time evaluation of any matrix-vector product with A, and thus also with A, rw, PXX for ν N, etc. Step 2: hyperparameter optimization. Find the hyperparameters ˆθ that solve Equation (13). Assuming ν = ˆν N is manually fixed, this relies only on matrix-vector products with rw. Step 3: computing the eigenpairs. Fixing the graph bandwidth ˆα found on Step 2, compute the eigenpairs λl, fl corresponding to the L smallest eigenvalues λl. For large N, use Lanczos algorithm. After the steps above are finished, Equations (8) and (11) define the geometric kernel k X ˆν,ˆκ,ˆσ2(x, x ) for arbitrary x, x Rd. Then the respective prior GP(0, k X ˆν,ˆκ,ˆσ2) can be conditioned by the labeled data in the standard way, yielding the posterior f (m) GP(m(m), k(m)). To get sensible predictions far away from the data, the geometric model f (m) is convexly combined with an independently trained classical Gaussian process model, as given by Equation (12). The resulting predictive model is still a Gaussian process, sum of two appropriately weighted independent Gaussian processes. Remark. The number of neighbors K, the number of eigenpairs L and the smoothness ν are assumed to be manually fixed parameters. Higher values of K and L improve the quality of approximation of the manifold kernel, which is often linked to better predictive performance, but requires more computational resources. The parameter ν can be picked using cross validation or prior knowledge. Small integer values of ν reduce computational costs, but may be inadequate for higher dimensions of the assumed manifold due to the ν = ν dim(M)/2 link with the manifold kernel smoothness ν . 5 Experiments We start in Section 5.1 by examining a simple synthetic example to gain intuition on how noisesensitive the technique is. Then in Section 5.2 we consider real datasets, showing improvements in higher dimensions. More experiments, results, and additional discussion can be found in Appendix B (a) Ground truth (b) Noiseless, prediction (c) Low noise, prediction (d) High noise, prediction Figure 4: The ground truth function on the dumbbell manifold and the predictions of the implicit manifold Gaussian process regression (IMGP) under different levels of noise. 5.1 Synthetic Examples We consider a one dimensional manifold resembling the shape of a dumbbell which already appeared in Figures 1 and 3. The unknown function f is defined by fixing a point x in the top left part of the dumbbell, and computing sin(d(x , )) where d( , ) denotes the geodesic (intrinsic) distance between a pair of points on the manifold. This function is illustrated in Figure 4a. To measure performance we primarily rely on measuring negative log-likelihood (NLL) on the dense mesh of test locations. We do this because such metric is able to combine accuracy and calibration simultaneously. Additionally, we present the root mean square error (RMSE). We investigate a semi-supervised setting where the number of unlabeled points is large (N n = 1546) and the number of labeled points is small (n = 10). We contaminate the inputs with noise, putting X = Xnoiseless + N(0, σ2 XI) and do the same with the outputs, putting y = f (X) + N(0, σ2 y I) for various values of σX, σy > 0. Specifically, we consider σX = σy = β {0, 0.01, 0.05} to which we refer to as the noiseless setting, the low noise setting and the high noise setting, respectively. The results for these are visualized in Figures 4b to 4d with performance metrics reported in Table 1. The implicit manifold Gaussian process regression is referred to as IMGP (we use ν = 1) and it is compared with the standard Euclidean Matérn-5/2 Gaussian process. IMGP performs much better in the noiseless and the low noise settings. The high noise is enough to damage the calibration of IMGP, as it ties with the baseline model: NLL is slightly worse and RMSE is slightly better. In Appendix B.1 we show how performance depends on the fraction n/N of labeled data points, the truncation level L and we discuss the choice of the K parameter in KNN. Additionally, in Appendix B.2 we consider noise-sensitivity for a 2D manifold. 5.2 High Dimensional Datasets For the high-dimensional setting, we considered predicting rotation angles for MNIST-based datasets. Additionally, we examined a high-dimensional dataset from the UCI ML Repository, CT slices. 5.2.1 Setup Datasets. We consider two MNIST-based datasets. The first one is created by extracting a single image per digit from the complete MNIST dataset. By randomly rotating these 10 images we obtained N = 10000 training samples and 1000 testing samples. We call it Single Rotated MNIST (SR-MNIST). For the second dataset, we select 100 random samples from MNIST. By randomly rotating these, we generate N = 100 000 training samples, most will be unlabeled, and 10 000 testing samples. We call it Multiple Rotated MNIST (MR-MNIST). The last dataset, CT slices, has dimensionality of d = 385, we split it to have N = 24075 training samples and 24075 testing samples. Dataset names can be complemented by the fraction of labeled samples, e.g. MR-MNIST-10% refers to n = 10%N. β = 0 β = 0.01 β = 0.05 β = 0 β = 0.01 β = 0.05 Euclidean Matérn-5/2 0.98 0.99 0.02 1.02 0.03 2.17 2.09 0.03 1.91 0.10 IMGP 0.33 0.34 0.02 1.00 0.02 5.02 4.19 0.1 1.91 1.88 Table 1: Performance metrics for the dumbbell manifold with varying magnitude of noise β. MNIST CT slices Method SR - 10% MR - 1% MR - 10% 5% 10% 25% EGP 0.54 0.01 0.20 0.01 0.43 0.01 0.80 0.02 0.96 0.00 1.20 0.09 S-IMGP 1.42 0.01 2.24 0.20 0.68 0.08 0.47 0.06 0.59 0.08 0.08 0.01 SS-IMGP 1.52 0.01 0.59 0.01 0.79 0.00 26.1 12.7 1.03 0.09 0.72 0.68 S-IMGP (full) - - - 0.64 0.83 0.88 0.29 0.42 0.10 SS-IMGP (full) - - - 2.48 0.08 2.35 0.04 1.99 0.04 Table 2: Negative log likelihood on test samples for real datasets. For RMSE see Tables 4 and 5. Methods. We consider implicit manifold Gaussian processes in the supervised regime (S-IMGP) and in the semi-supervised regime (SS-IMGP). We compare them to the GPy Torch implementation of the Euclidean Matérn-5/2 Gaussian Process. We refer to it as the Euclidean Gaussian Process (EGP). Additional details. We run 100 iterations of hyperparameter optimization using Adam with a fixed learning rate of 0.01. For MNIST, with use IMGP with ν = 2; for CT slices with ν = 3. 5.2.2 Results Table 2 shows the negative log-likelihood metric for different datasets and methods on the test set. The respective RMSEs are presented in Appendix B.3. On SR-MNIST, IMGP outperforms EGP in both supervised and semi-supervised scenarios. MR-MNIST is more challenging. In the supervised setting for n = 1%N, S-IMGP is incapable of inferring the underlying manifold structure, performing worse than EGP. However, SS-IMGP, with more data to infer manifold from, performs best. For n = 10%N, IMGP gets a better grip of the dataset s geometry, outperforming EGP in both regimes. For CT slices, regardless of n, both S-IMGP and SS-IMGP performed poorly. Looking for an explanation, we considered two modifications. First, we fixed the graph bandwidth ˆα found in the algorithm s Step 2 (cf. Section 4.4), and re-optimized the other hyperparameters κ, σ2, σ2 ε by maximizing the likelihood of the eigenpair-based model (truncated to L = 2000 eigenpairs) computed in the algorithm s Step 3. This resulted in limited improvement but did not change the big picture in fact, values for S-IMGP and SS-IMGP in Table 2 for CT slices already include this modification. Second, on top of this hyperparameter re-optimization, we tried computing the eigenpairs using torch.linalg.eigh instead of the Lanczos implementation in GPy Torch, taking the same number L = 2000 of eigenpairs. The resulting methods S-IMGP (full) and SS-IMGP (full) showed considerable improvement over the baseline, as shown in Table 2. This indicated an issue with the quality of eigenpairs derived from the Lanczos method which requires further investigation. We discuss this in Appendix B.3, together with the aforementioned hyperparameter re-optimization procedure. 6 Conclusion In this work, we propose the implicit manifold Gaussian process regression technique. It is able to use unlabeled data to improve predictions and uncertainty calibration by learning the implicit manifold upon which the data lies, being inspired by the convergence of graph Matérn Gaussian processes to their manifold counterparts. This helps building better probabilistic models in higher dimensional settings where the standard Euclidean Gaussian processes usually struggle. This is supported by our experiments in a synthetic low-dimensional setting and for high-dimensional datasets. Leveraging sparse structure of graph Matérn precision matrices and efficient approximate KNN, the technique is able to scale to large datasets of hundreds of thousands points, which is especially important in high dimension, where a large number of unlabeled points is often needed to learn the implicit manifold. The model is fully differentiable, making it possible to infer hyperparameters in the usual way. Limitations. The quality of the constructed graph significantly influences the technique s performance. When dealing with data from complex manifolds or exhibiting highly non-uniform density, simplistic KNN strategies might fail to capture the manifold structure due to their reliance on a single graph bandwidth. In such scenarios, larger values of parameters K and L, or in high dimensions, of parameter ν, may be beneficial but could substantially increase computational costs. Furthermore, larger datasets coupled with high parameter values can lead to numerical stability issues, for instance, in the Lanczos algorithm, calling for further improvements and research. Despite these challenges, our method shows promise for advancing probabilistic modeling in higher dimensions. Acknowledgements We express our gratitude to Alexander Shulzhenko (St. Petersburg University, mentored by VB), whose initial work on a similar problem and accessible source material at https://github.com/ Alexander Shulzhenko/Implicit-Manifold-Gaussian-Processes, while not included in the paper, sparked inspiration for this project. We thank Dr. Alexander Terenin (Cornell University) for generously making his Blender rendering scripts accessible to the public. These scripts, which aided us in creating Figure 2, can be found in his Ph.D. thesis repository at https://github.com/ aterenin/phdthesis. BF and AB acknowledge support by the European Research Council (ERC), Advanced Grant agreement No 741945, Skill Acquisition in Humans and Robots. VB acknowledges support by an ETH Zürich Postdoctoral Fellowship. I. Azangulov, A. Smolensky, A. Terenin, and V. Borovitskiy. Stationary Kernels and Gaussian Processes on Lie Groups and their Homogeneous Spaces I: the compact case. ar Xiv preprint ar Xiv:2208.14960, 2022. Cited on pages 2, 3. I. Azangulov, A. Smolensky, A. Terenin, and V. Borovitskiy. Stationary Kernels and Gaussian Processes on Lie Groups and their Homogeneous Spaces II: non-compact symmetric spaces. ar Xiv preprint ar Xiv:2301.13088, 2023. Cited on page 2. M. Binois and N. Wycoff. A survey on high-dimensional Gaussian process modeling with application to Bayesian optimization. ACM Transactions on Evolutionary Learning and Optimization, 2(2):1 26, 2022. Cited on page 2. V. Borovitskiy, I. Azangulov, A. Terenin, P. Mostowsky, M. Deisenroth, and N. Durrande. Matérn Gaussian Processes on Graphs. In International Conference on Artificial Intelligence and Statistics, 2021. Cited on pages 2, 3. V. Borovitskiy, M. R. Karimi, V. R. Somnath, and A. Krause. Isotropic Gaussian Processes on Finite Spaces of Graphs. In International Conference on Artificial Intelligence and Statistics, 2023. Cited on page 2. V. Borovitskiy, A. Terenin, P. Mostowsky, and M. P. Deisenroth. Matérn Gaussian Processes on Riemannian Manifolds. In Advances in Neural Information Processing Systems, 2020. Cited on pages 2, 3, 14. J. Calder and N. G. Trillos. Improved spectral convergence rates for graph Laplacians on ε-graphs and k-NN graphs. Applied and Computational Harmonic Analysis, 60:123 175, 2022. Cited on page 4. J.-P. Chilès and P. Delfiner. Geostatistics: Modeling Spatial Uncertainty. John Wiley & Sons, 2012. Cited on page 1. R. R. Coifman and S. Lafon. Diffusion Maps. Applied and Computational Harmonic Analysis. Special Issue: Diffusion Maps and Wavelets, 21(1):5 30, 2006. Cited on pages 2, 4. E. De Vito, N. Mücke, and L. Rosasco. Reproducing kernel Hilbert spaces on manifolds: Sobolev and diffusion spaces. Analysis and Applications, 19(03):363 396, 2021. Cited on page 14. M. Deisenroth and C. E. Rasmussen. PILCO: A model-based and data-efficient approach to policy search. In International Conference on Machine Learning, 2011. Cited on page 1. H. Donnelly. Eigenfunctions of the Laplacian on compact Riemannian manifolds. Asian Journal of Mathematics, 10(1):115 126, 2006. Cited on page 14. D. B. Dunson, H.-T. Wu, and N. Wu. Graph based Gaussian processes on restricted domains. Journal of the Royal Statistical Society Series B: Statistical Methodology, 84(2):414 439, 2022. Cited on page 2. D. B. Dunson, H.-T. Wu, and N. Wu. Spectral convergence of graph Laplacian and heat kernel reconstruction in L from random samples. Applied and Computational Harmonic Analysis, 55:282 336, 2021. Cited on pages 2, 4. A. Feragen, F. Lauze, and S. Hauberg. Geodesic exponential kernels: When curvature and linearity conflict. In Conference on Computer Vision and Pattern Recognition, 2015. Cited on page 3. N. García Trillos, M. Gerlach, M. Hein, and D. Slepˇcev. Error estimates for spectral convergence of the graph Laplacian on random geometric graphs toward the Laplace Beltrami operator. Foundations of Computational Mathematics, 20(4):827 887, 2020. Cited on page 4. J. Gardner, G. Pleiss, K. Q. Weinberger, D. Bindel, and A. G. Wilson. GPy Torch: Blackbox Matrix Matrix Gaussian Process Inference with GPU Acceleration. In Advances in Neural Information Processing Systems, 2018. Cited on pages 7, 18. T. Gneiting. Strictly and non-strictly positive definite functions on spheres. Bernoulli, 19(4):1327 1349, 2013. Cited on page 3. I. S. Gradshteyn and I. M. Ryzhik. Table of Integrals, Series, and Products. Academic Press, 7th edition, 2014. Cited on page 3. M. Hein, J.-Y. Audibert, and U. von Luxburg. Graph Laplacians and their Convergence on Random Neighborhood Graphs. Journal of Machine Learning Research, 8(48):1325 1368, 2007. Cited on page 4. P. Hennig, M. A. Osborne, and M. Girolami. Probabilistic numerics and uncertainty in computations. Proceedings of the Royal Society A: Mathematical, Physical and Engineering Sciences, 471(2179):20150142, 2015. Cited on page 1. M. F. Hutchinson. A stochastic estimator of the trace of the influence matrix for Laplacian smoothing splines. Communications in Statistics-Simulation and Computation, 18(3):1059 1076, 1989. Cited on page 7. J. Johnson, M. Douze, and H. Jégou. Billion-scale similarity search with GPUs. IEEE Transactions on Big Data, 7(3):535 547, 2019. Cited on page 6. R. I. Kondor and J. Lafferty. Diffusion kernels on graphs and other discrete structures. In International Conference on Machine Learning, 2002. Cited on page 2. J. A. Lee and M. Verleysen. Nonlinear Dimensionality Reduction. Springer Science & Business Media, 2007. Cited on page 4. F. Lindgren, H. Rue, and J. Lindström. An Explicit Link between Gaussian Fields and Gaussian Markov Random Fields: The Stochastic Partial Differential Equation Approach. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 73(4):423 498, 2011. Cited on page 3. Y. Ma and Y. Fu. Manifold Learning Theory and Applications. CRC press, 2011. Cited on page 4. G. Meurant. The Lanczos and conjugate gradient algorithms: from theory to finite precision computations. SIAM, 2006. Cited on pages 7, 8. F. Pavutnitskiy, S. O. Ivanov, E. Abramov, V. Borovitskiy, A. Klochkov, V. Vyalov, A. Zaikovskii, and A. Petiushko. Quadric hypersurface intersection for manifold learning in feature space. In International Conference on Artificial Intelligence and Statistics, 2022. Cited on page 4. R. L. Peach, M. Vinao-Carl, N. Grossman, M. David, E. Mallas, D. Sharp, P. A. Malhotra, P. Vandergheynst, and A. Gosztolai. Implicit Gaussian process representation of vector fields over arbitrary latent manifolds. ar Xiv preprint ar Xiv:2309.16746, 2023. Cited on page 2. A. Rahimi and B. Recht. Random features for large-scale kernel machines. In Advances in Neural Information Processing Systems, 2007. Cited on page 7. C. E. Rasmussen and C. K. I. Williams. Gaussian Processes for Machine Learning. Adaptive Computation and Machine Learning. MIT Press, 2006. Cited on pages 1, 2. J. Wenger, G. Pleiss, P. Hennig, J. Cunningham, and J. Gardner. Preconditioning for scalable Gaussian process hyperparameter optimization. In International Conference on Machine Learning, pages 23751 23780. PMLR, 2022. Cited on page 8. P. Whittle. Stochastic Processes in Several Dimensions. Bulletin of the International Statistical Institute, 40:974 994, 1963. Cited on page 3. J. T. Wilson, V. Borovitskiy, A. Terenin, P. Mostowsky, and M. P. Deisenroth. Efficiently Sampling Functions from Gaussian Process Posteriors. In International Conference on Machine Learning, 2020. Cited on page 2. J. T. Wilson, V. Borovitskiy, A. Terenin, P. Mostowsky, and M. P. Deisenroth. Pathwise Conditioning of Gaussian Processes. Journal of Machine Learning Research, 22(105):1 47, 2021. Cited on page 2. Proposition 1. Denote the eigenpairs by λl, fl for a graph Laplacian and by λM l , f M l for the Laplace Beltrami operator. Fix δ > 0. Assume that, with probability at least 1 δ, for all ε > 0, for α small enough and for N large enough we have |λl λM l | < ε and |fl(xi) f M l (xi)| < ε. Then, with probability at least 1 δ, we have k N,α,L ν,κ,σ2(xi, xj) kν,κ,σ2(xi, xj) as α 0, N, L , where k N,α,L ν,κ,σ2(xi, xj) = σ2 ν dim(M)/2 fl(xi)fl(xj). (6) Proof. Fix small ε > 0. We will prove that for α small enough and N, L large enough we have |kν,κ,σ2 f (xi, xj) k N,α,L ν,κ,σ2 f (xi, xj)| < Cε for some C > 0 with probability at least 1 δ. Since the assumption holds on the same event of probability 1 δ for all ε, this directly translates to the convergence on the same event. In fact, a probabilistic narrative is nonessential for what we actually prove, and we do not use it below. To simplify notation, we replace PL 1 l=0 by PL l=0. First, for any L Z>0 define the truncated version k L ν,κ,σ2 f of the manifold kernel kν,κ,σ2 f by k L ν,κ,σ2 f (x, x ) = σ2 f Cν,κ ν dim(M)/2 f M l (x)f M l (x ). (20) The manifold Matérn kernels are the reproducing kernels of Sobolev spaces, if the latter are defined appropriately (Borovitskiy et al., 2020). These are Mercer kernels (De Vito et al., 2021), hence, by Mercer s theorem, k L ν,κ,σ2 f kν,κ,σ2 f uniformly on M, i.e. for L large enough we have k L ν,κ,σ2 f (xi, xj) kν,κ,σ2 f (xi, xj) sup x,x M k L ν,κ,σ2 f (x, x ) kν,κ,σ2 f (x, x ) < ε. (21) Now suppose that α is small enough and N is large enough so that λl λM l < ε , |fl(xi) f M l (xi)| < ε , ε = min 1, λM L d 1 4 , λM L d 1 for all l {1, .., L} and for all i {1, .., N} with probability at least 1 δ. Assuming the manifold is connected, by Donnelly (2006) we have |f M l | Cλ λM l d 1 4 for l > 0, where Cλ > 0 is a constant that depends on the geometry of the manifold. The case l = 0 is special because λM 0 = 0. Since f M 0 is a constant function, we have |f M l | Cλ max((λM l ) d 1 4 , 1) (23) for all l 0, where Cλ here is potentially different from the Cλ before. Assuming, without loss of generality, ε < 1, we have fl(xi)fl(xj) f M l (xi)f M l (xj) |fl(xi)| fl(xj) f M l (xj) (24) + f M l (xj) fl(xi) f M l (xi) (25) ε |fl(xi)| + f M l (xj) (26) ε fl(xi) f M l (xi) + f M l (xi) + f M l (xj) (27) ε ε + 2Cλ max((λM l ) d 1 4 , 1) (1 + 2Cλ)ε The function Φ(λ) = 2ν κ2 + λ ν dim(M)/2 is Lipschitz: |Φ(λ) Φ(λ )| CΦ|λ λ | where CΦ = sup λ 0 |Φ (λ)| = sup λ 0 (ν + dim(M)/2) 2ν κ2 + λ ν dim(M)/2 1 = (ν + dim(M)/2) 2ν ν dim(M)/2 1 . (29) Define an auxiliary kernel with manifold eigenvalues and graph eigenfunctions by k L ν,κ,σ2 f (x, x ) = σ2 f Cν,κ ν dim(M)/2 fl(x)fl(x ). (30) k L ν,κ,σ2 f (xi, xj) k L ν,κ,σ2 f (xi, xj) ν dim(M)/2 (1 + 2Cλ)ε Φ(0)(1 + 2Cλ)ε. (32) Also, noting that |fl(xi)| f M l (xi) fl(xi) + f M l (xi) ε + Cλ max((λM l ) d 1 4 , 1), write k L ν,κ,σ2 f (xi, xj) k N,α,L ν,κ,σ2 f (xi, xj) Φ(λM l ) Φ(λl) |fl(xi)||fl(xj)| (33) l=0 CΦ λM l λl |fl(xi)||fl(xj)| (34) l=0 2CΦε (ε )2 + C2 λ max((λM l ) d 1 2 , 1) (35) 2CΦ(1 + C2 λ)ε. (36) Finally, kν,κ,σ2 f (xi, xj) k N,α,L ν,κ,σ2 f (xi, xj) kν,κ,σ2 f (xi, xj) k L ν,κ,σ2 f (xi, xj) (37) + k L ν,κ,σ2 f (xi, xj) k L ν,κ,σ2 f (xi, xj) (38) + k L ν,κ,σ2 f (xi, xj) k N,α,L ν,κ,σ2 f (xi, xj) (39) ε + σ2 f Cν,κ Φ(0)(1 + 2Cλ) + 2CΦ(1 + C2 λ) ε. (40) This proves the claim. Proposition 2. Assuming ν N, the precision matrix PXX of k X ν,κ,σ2(xi, xj) can be given by C 1 ν,κ D 2ν κ2 I + rw . . . 2ν Proof. The covariance matrix KXX is given by l=0 Φ(λl)f lf l , Φ(λ) = 2ν κ2 + λ ν (41) where f l = D 1/2f sym l and f sym l are the orthonormal eigenvectors of the symmetric normalized Laplacian sym. Denote Ksym XX = σ2 l=0 Φ(λl)f sym l (f sym l ) (42) then (Ksym XX) 1 = σ 2 C 1 ν,κ PL 1 l=0 2ν κ2 + λl νf sym l (f sym l ) = σ 2 κ2 I + sym ν. We have l=0 Φ(λl)D 1/2f sym l (f sym l ) D 1/2 = D 1/2Ksym XXD 1/2. (43) (a) RMSE depending on L (b) NLL depending on L (c) RMSE depending on f (d) NLL depending on f Figure 5: Root Mean Square Error (RMSE) and Negative Log-Likelihood (NLL) for increasing number of eigenpairs L (left panels) and increasing fraction f = n%N of labeled points (right panels). The legend in (a) and (b) refers to the number of hyperparameter optimization iterations. On the other hand, σ 2 f C 1 ν,κ D 2ν ν = σ 2 f C 1 ν,κ D 2ν κ2 I + D 1/2 sym D1/2 ν (44) = σ 2 f C 1 ν,κ D1/2 2ν ν D1/2 = D1/2(Ksym XX) 1D1/2 (45) = K 1 XX = PXX. (46) B Additional Experimental Results and Details Here we provide additional results and details for synthetic examples and high-dimensional datasets. B.1 1D Dumbbell Manifold In this section, we analyze our method s sensitivity to the number of labeled points, spectrum truncation, and neighbor count in graph construction, offering deeper insights into its behavior. Eigenpairs Truncation. From a theoretical standpoint, utilizing the complete set of eigenpairs to construct the kernel should be beneficial. However, this assumption may not hold true in cases where the optimization problem is not fully converged. The trends of RMSE and NLL, as depicted in Figures 5a and 5b, illustrate the impact of increasing the number of eigenpairs for 100, 200, and 1000 iterations. In situations where hyperparameter optimization does not converge fully, the length scale parameter κ fails to reach sufficiently high values necessary to generate appropriate spectral density decay, which in turn would properly weigh higher frequency eigenpairs. In such scenarios, truncating the spectrum is similar to increasing the length scale, which might improve results in certain cases. In the 1D scenario, due to its simplicity, we opted for a modest number of eigenpairs, namely L = 50. Exceeding this count did not yield any discernible improvements. Dataset Size. We analyze the sensitivity to dataset size of our method (IMGP) against the Euclidean case (EGP) in the semi-supervised learning scenario. For fixed number of eigenpairs, L = 50, Figures 5c and 5d show performance metrics (RMSE and NLL) depending on the percentage n%N of labeled points. In a typical scenario where we have at our disposal fewer labeled points compared to the number of unlabeled points, our method outperforms EGP in both accuracy (RMSE) and uncertainty quantification (NLL). As n increases EGP starts to match the performance of IMGP. Number of neighbors. For the one dimensional case considered in this section, only three neighbors are necessary to capture the essential features of the manifold s geometry. Choosing a number less than three would hinder the algorithm s ability to capture the underlying manifold structure. On the other hand, increasing the number of neighbors beyond this threshold does not affect the solution, provided that sufficient time is allocated for hyperparameter optimization to converge and the graph bandwidth becomes small enough to correct for all undesired edges in the graph structure (for instance in the central region of the dumbbell). In high-dimensional problems where the dimension of the underlying manifold is unknown, this suggests to incrementally increase the number of neighbors until the loss function stops improving, if it is computationally feasible to try multiple values of K. (a) Ground truth (b) IMGP pred. β = 0 (c)IMGP pred.β =0.01 (d)IMGP pred.β =0.05 (e) Euclidean, prediction (f) Euclidean, uncertainty (g) IMGP, uncertainty Figure 6: (a) Ground truth function on the complex 2D manifold and (b)-(d) predictions of the implicit manifold Gaussian process regression (IMGP) for increasing level of sampling noise β. (e)-(d) Euclidean GP prediction and uncertainty and (g) IMGP uncertainty in noiseless scenario. B.2 2D Dragon Manifold We consider another synthetic setting, a complex 2D manifold, depicted in Figure 6. The ground truth function is visualized in Figure 6a, it is the same as in the one-dimensional case, the sine of the geodesic distance to a point, which is located in the green area. In the semi-supervised learning scenario, Figures 6e and 6b offer a comparison of the posterior mean between the Euclidean GP and IMGP, while Figures 6f and 6g illustrate the posterior standard deviation for both models quite similar to each other, in this regime. Similar to what we did for the 1D compact manifold in Section 5.1, we evaluated the performance of IMGP under varying levels of sampling noise. Figures 6b to 6d display the IMGP predictions in the semi-supervised learning scenario as sampling noise increases. Similar to the 1D case, our approach consistently outperforms the standard Euclidean GP, as evidenced in Table 3. Remark. We observed that for higher levels of sampling noise, the linear combination of posteriors, as described by Equation (12), significantly outperform the single geometric model. B.3 High Dimensional Datasets Rotated MNIST. For IMGP, we use ν = 2. As discussed in Appendix B.1, the optimal number of eigenpairs L varies considerably depending on the convergence of the optimization problem. Given the limited number of iterations per run we opted to fix the number of eigenpairs at 20%N and 2%N, for SR-MNIST and MR-MNIST, respectively. Table 4 reports the complete results obtained for the rotated MNIST dataset, including both RMSE and NLL. Notably, these emphasize the importance of unlabeled points, as S-IMGP performs worse than both SS-IMGP and EGP on MR-MNIST-1%. β = 0 β = 0.01 β = 0.05 β = 0 β = 0.01 β = 0.05 Euclidean Matérn-5/2 0.24 0.02 0.24 0.01 0.26 0.00 0.85 0.46 0.16 1.58 1.26 1.80 IMGP 0.12 0.01 0.21 0.01 0.22 0.01 2.14 0.13 1.51 0.03 1.30 0.09 Table 3: Results for a complex 2D manifold with varying magnitude of sampling noise. CT slices. For IMGP, we use ν = 3. In Table 5 we report results for IMGP-S (full) and IMGP-SS (full). These are based on the exact eigenpairs, computed by torch.linalg.eigh, as opposed to the standard Lanczos implementation we use by default. Additionally, these include the hyperparameter re-optimization step, as described in Section 5.2 and discussed below. Regarding RMSE, all three compared methods IMGP-S (full) and IMGP-SS (full) and EGP exhibit similar performance, with a slight advantage for SS-IMGP (full) as the number of training points increases. When considering NLL, as previously observed with MNIST, SS-IMGP performs best in all settings. However, NLL decreases for larger values of n/N, indicating a probable overfit in this regime. Hyperparameter re-optimization. We observed this step to serve two important purposes: (1) fixing overly small values of the signal variance σ2, potentially caused by the absence of covariance normalization in the optimization process and poor convergence; (2) adjusting the length scale parameter to take into account the loss of high-frequency components due to truncation. Implementation. Currently, the implementation faces two significant limitations that might restrict its usability to high-memory hardware setups. Firstly, due to the absence of rich enough sparse matrix routines in Py Torch, we had to develop our own custom implementation of differentiable sparse operators. We kept to high-level routines, which forced us to strike a balance between performance and memory efficiency. In particular, for matrix-vector sparse product operations, our approach relies on highly optimized vectorized code, delivering high performance on GPU at the expense of increased memory allocation. Secondly, we faced challenges with sparse eigen-solvers in the Py Torch ecosystem. Our attempts of using the Py Torch implementation of the LOBPCG algorithm, torch.lobpcg, yielded relatively poor results. We had similar experience with the Lanczos implementation from GPy Torch (Gardner et al., 2018). In light of this, when extracting higherfrequency components was needed, we resorted to two alternative solutions: Py Torch dense matrix eigen-decomposition torch.linalg.eigh and the Sci Py Arpack wrapper scipy.linalg.eigsh. The first approach, while benefiting from GPU acceleration, can be infeasible because of limited GPU memory. The second approach, known for its efficiency in memory usage due to its Krylov subspace-based nature, is constrained to CPU utilization, significantly impacting the algorithm s overall performance. C Hyperparameter Priors and Initialization Here we describe hyperpameter priors which might be of help when using implicit manifold Gaussian process regression. Graph Bandwidth α. Graph Laplacian converges to the Laplace Beltrami operator when α tends to zero, motivating smaller α. However, in a non-asymptotic setting it is impractical to have α overly small as it will render the graph effectively disconnected and cause numerical instabilities. One appropriate prior could thus be gamma distribution, whose right tail discourages high values of α, and which, given an appropriate choice of the parameters, encourages α to align with the scale of pairwise distances between graph nodes. Specifically, we choose the parameters of the gamma distribution so as to (1) match its mode with the median (Q2) of pairwise average distance between K-th nearest neighbors because such an α would give reasonable weights in the KNN graph and (2) so as to place its standard 0.95 confidence interval to the right of a certain lower bound α we define further that ensures the graph is numerically not disconnected. Let D = maxxi KNN(xj) xi xj where j = 1, .., N . (47) Dataset n/N S-IMGP SS-IMGP EGP S-IMGP SS-IMGP EGP SR-MNIST 10% 0.04 0.01 0.01 0.00 0.12 0.01 1.42 0.01 1.52 0.01 0.54 0.01 MR-MNIST 1% 0.74 0.02 0.43 0.02 0.74 0.02 2.24 0.20 0.59 0.01 0.20 0.01 MR-MNIST 10% 0.14 0.05 0.03 0.00 0.13 0.00 0.68 0.08 0.79 0.00 0.43 0.01 Table 4: Results for the rotated MNIST dataset. n/N S-IMGP (full) SS-IMGP (full) EGP S-IMGP (full) SS-IMGP (full) EGP 5% 0.27 0.02 0.39 0.04 0.24 0.02 0.64 0.83 2.48 0.08 0.80 0.02 10% 0.22 0.02 0.19 0.01 0.16 0.00 0.88 0.29 2.35 0.04 0.96 0.00 25% 0.14 0.01 0.08 0.02 0.09 0.01 0.42 0.10 1.99 0.04 1.20 0.09 50% 0.08 0.01 0.07 0.01 0.08 0.04 0.96 0.11 2.04 0.02 1.02 0.04 Table 5: Results for Relative location of CT slices on axial axis (d = 385, N = 48150) from UCI Machine Learning Repository. (a) Prior for the semi-supervised case (b) Prior for the supervised case Figure 7: The histogram of D and the prior for the bandwidth hyperparameter α. The bandwidth s lower bound we compute as α = min d D 4 log τ , (48) where τ is a user-defined parameter. Let η and β the shape and rate parameters of the gamma distribution. In order to achieve (1), we define η = ρQ2 + 1 and β = ρ (49) where ρ is used to achieve (2) and is given by ρ 4Q2 (Q2 α)2 . (50) Considering the S-MNIST dataset, Figure 7 shows the bandwidth prior distribution for the semisupervised (labeled and unlabeled points) and the supervised (labeled points) scenarios. Signal Variance σ2 f. Assuming normalized yi, the signal variance σ2 f should be close to 1. Because of this a natural prior for σ2 is the truncated normal (onto the set of positive reals σ2 > 0), with mode at 1. The pre-truncation variance can be chosen, for example, to have 0 at 3 standard deviations away from the mode, i.e. it can be chosen to be 1/9. Note that setting a prior over the signal variance parameter requires evaluating the normalization constant Cν,κ7 for the kernel at each hyperparameter optimization step, something that can be avoided otherwise. Noise Variance σ2 ε. Choosing a prior for the noise variance σ2 ε is heavily problem-dependent. Truncated normal (onto the set σ2 ε > 0) with mode at 0 could be a reasonable option. The pretruncation variance can be chosen, for example, to be 1, 1/4 or 1/9, placing the value 1, which is the variance of the normalized observations yi, at 1, 2 or 3 standard deviations away from the mode. 7Covariance matrix normalization constant Cν,κ can be approximated as Cν,κ 1 M PM i=1 e T i P 1 XXei, where M is relatively small, ei are random standard basis vectors and the solve is performed by running the conjugate gradients. Differentiability of the model is preserved. Note however that we did not use this normalization in our tests since the performance improvement we observed was not enough to justify the additional computation overhead. This trade-off can vary considerably from case to case, which is why in our implementation, the covariance normalization is optional. Length scale The interpretation and the scale of the length scale parameter is manifold-specific. This makes it very difficult to come up with any reasonable prior. Because of this, we suggest actually leaving the length scale parameter free. Parameter initialization When doing MAP estimation, one can initialize parameters randomly, sampling them from respective priors.