# bayesian_quadrature_on_riemannian_data_manifolds__538367a8.pdf Bayesian Quadrature on Riemannian Data Manifolds Christian Fr ohlich 1 Alexandra Gessner * 1 2 Philipp Hennig 1 2 Bernhard Sch olkopf 2 Georgios Arvanitidis * 2 Riemannian manifolds provide a principled way to model nonlinear geometric structure inherent in data. A Riemannian metric on said manifolds determines geometry-aware shortest paths and provides the means to define statistical models accordingly. However, these operations are typically computationally demanding. To ease this computational burden, we advocate probabilistic numerical methods for Riemannian statistics. In particular, we focus on Bayesian quadrature (BQ) to numerically compute integrals over normal laws on Riemannian manifolds learned from data. In this task, each function evaluation relies on the solution of an expensive initial value problem. We show that by leveraging both prior knowledge and an active exploration scheme, BQ significantly reduces the number of required evaluations and thus outperforms Monte Carlo methods on a wide range of integration problems. As a concrete application, we highlight the merits of adopting Riemannian geometry with our proposed framework on a nonlinear dataset from molecular dynamics. 1. Introduction The tacit assumption of a Euclidean geometry, implying that distances can be measured along straight lines, is inadequate when data follows a nonlinear trend, which is known as the manifold hypothesis. As a result, probability distributions based on a flat geometry may poorly model the data and fail to capture its underlying structure. Generalized distributions that account for curvature of the data space have been put forward to alleviate this issue. In particular, Pennec (2006) proposed an extension of the normal distribution on Riemannian manifolds such as the sphere. The key strategy to use such distributions on general data *Equal supervision 1University of T ubingen, Germany 2Max Planck Institute for Intelligent Systems, T ubingen, Germany. Correspondence to: Christian Fr ohlich . Proceedings of the 38 th International Conference on Machine Learning, PMLR 139, 2021. Copyright 2021 by the author(s). Figure 1. A LAND on a protein trajectory manifold of the enzyme adenylate kinase. Each datum represents a conformation of the protein, i.e., a spatial arrangement of its atoms. The conformation corresponding to the LAND mean ( ) is visualized. manifolds is by replacing straight lines with continuous shortest paths, known as geodesics, which respect the nonlinear structure of the data. This is achieved by introducing a Riemannian metric in the data space that specifies how distance and volume are distorted locally. To this end, Arvanitidis et al. (2016) proposed a maximum likelihood estimation scheme based on a data-induced metric to learn the parameters of a Locally Adaptive Normal Distribution (LAND), illustrated in Fig. 1. However, it relies on a computationally expensive optimization task that entails the repeated numerical integration of the unnormalized density on the manifold, for which no closed-form solution exists. Hence we are interested in techniques to improve the efficiency of the numerical integration scheme. Probabilistic numerics (Hennig et al., 2015; Cockayne et al., 2019) frames computations that are subject to noise and approximation error as active inference. A probabilistic numerical routine treats the computer as a data source, acquiring evaluations according to a policy to decrease its uncertainty about the result. Amongst probabilistic numerical methods, we focus on Bayesian quadrature (BQ) to compute intractable integrals on data manifolds. Bayesian quadrature (O Hagan, 1991; Briol et al., 2019) treats numerical integration as an inference problem by constructing posterior measures over integrals given observations, i.e., evaluations of the integrand. Besides providing sound un- Bayesian Quadrature on Riemannian Data Manifolds certainty estimates, the probabilistic approach permits the inclusion of prior knowledge about properties of the function to be integrated, and leverages active learning schemes for node selection as well as transfer learning schemes, for example when multiple similar integrals have to be jointly estimated (Xi et al., 2018; Gessner et al., 2019). These properties make BQ especially relevant in settings where the integrand is expensive to evaluate, and make it a suitable tool for integration on Riemannian data manifolds. Contributions The uptake of Riemannian methods in machine learning is principally hindered by prohibitive computational costs. We here address a key aspect of this bottleneck by improving the efficiency of integration on data manifolds. We customize Bayesian quadrature to curved spaces by exploiting knowledge about their structure. To this end, we introduce a tailored acquisition function that minimizes the number of expensive computations by selecting informative directions (instead of single points) on the manifold. Adopting a probabilistic numerical integration scheme enables efficient exploration of the integrand s support under a suitable prior. We demonstrate accuracy and performance of our approach on synthetic and real-world data manifolds, where we observe speedups by factors of up to 20. In these examples we focus on the LAND model, which provides a wide range of numerical integration problems of varying geometry and difficulty. We highlight molecular dynamics as a promising application area for Riemannian machine learning models. The results support the use of probabilistic numerical methods within Riemannian geometry to achieve significant speedup. 2. Riemannian Geometry In our applied setting, we view RD as a smooth manifold M with a changed notion of distance and volume measurement as compared to the Euclidean case. This view arises from the assumption that data have a general underlying nonlinear structure in RD (see Fig. 2), and thus, the following discussion excludes manifolds with structure known a priori, e.g., spheres and tori. In our case, the tangent space TµM at a point µ M is again RD, but centered at µ. This is a vector space that allows to represent points of the manifold as tangent vectors v RD. Pictorially, a vector v TµM is tangential to some curve passing through µ. Together, these vectors give a linearized view of the manifold with respect to a base point µ. A Riemannian metric is a positive definite matrix M : RD RD D + that varies smoothly across the manifold. Therefore, we can define a Expµ(v) = x Logµ(x) = v Figure 2. A protein trajectory manifold. A subset of the geodesics is shown with respect to a fixed point µ ( ). The background is colored according to the volume element p |M| (Sec. 2.1) on a log scale. We omit a colorbar, since the values are not easily interpreted. Darker color indicates regions with small metric, to which shortest paths are attracted. For one geodesic ( ), we show the downscaled tangent vector and the Log and Exp operations. local inner product between tangent vectors v, w TµM as v, w µ = v, M(µ)w , where , is the Euclidean inner product. This inner product makes the smooth manifold a Riemannian manifold (do Carmo, 1992; Lee, 2018). A Riemannian metric locally scales the infinitesimal distances and volume. Consider a curve γ : [0, 1] M with γ(0) = µ and γ(1) = x. The length of this curve on the Riemannian manifold M is computed as γ(t), M(γ(t)) γ(t) dt, (1) where γ(t) = d dtγ(t) Tγ(t)M is the velocity of the curve. The γ that minimizes this functional is the shortest path between the points. To overcome the invariance of L under reparametrization of γ, shortest paths can equivalently be defined as minimizers of the energy functional. This enables application of the Euler-Lagrange equations, which result in a system of 2nd order nonlinear ordinary differential equations (ODEs). The shortest path is obtained by solving this system as a boundary value problem (BVP) with boundary conditions γ(0) = µ and γ(1) = x. Such a length-minimizing curve is known as geodesic. To perform computations on M we introduce two operators. The first is the logarithmic map Logµ(x) = v, which represents a point x M as a tangent vector v TµM. This can be seen as the initial velocity of the geodesic that reaches x at t = 1 with starting point µ. The inverse operator is the exponential map Expµ(t v) = γ(t) that takes an initial velocity γ(0) = v TµM and generates a unique geodesic with γ(0) = µ and γ(1) = x. Note that Logµ(Expµ(v)) = v, and also, Logµ(x) 2 = v 2 = L(γ). Computationally, the logarithmic map amounts to solving a BVP, whereas the Bayesian Quadrature on Riemannian Data Manifolds exponential map corresponds to an initial value problem (IVP). For general data manifolds, analytic solutions of the geodesic equations do not exist, so we rely on specialized approximate numerical solvers for the BVPs; however, finding shortest paths still remains a computationally expensive problem (Hennig & Hauberg, 2014; Arvanitidis et al., 2019). In contrast, the exponential map as an IVP is an easier problem and solutions are significantly more efficient. We illustrate our applied manifold setting in Fig. 2, where we show geodesics starting from a point µ, as well as the Log and Exp operators between µ and a point x. Note that Figs. 1 to 4 are in correspondence, that is, they illustrate different aspects of the same setting. More details on Riemannian geometry and the ODE system are in A.1. 2.1. Constructing Riemannian Manifolds from Data The Riemannian volume element or measure d M(x) = p |M(x)| dx represents the distorted infinitesimal standard Lebesgue measure dx. For a meaningful metric, this quantity is small near the data and increases as we move away from them. Intuitively, this metric behavior pulls shortest paths near the data. There are broadly two unsupervised approaches to learn such an adaptive metric from data. Given a dataset x1:N of N points in RD, Arvanitidis et al. (2016) proposed a nonparametric metric to model nonlinear data trends as the inverse of a local diagonal covariance matrix with entries wn(x)(xnd xd)2 + ρ where the weights wn are obtained from an isotropic Gaussian kernel wn(x) = exp . The lengthscale σ determines the curvature of the manifold, i.e., how fast the metric changes. The hyperparameter ρ controls the value of the metric components that is reached far from the data, so the measure there is 2 . Typically, ρ is set to a small scalar to encourage geodesics to follow the data trend. However, this metric does not scale to higher dimensions due to the curse of dimensionality (Bishop, 2006, Ch. 1.4). Another approach relies on generative models to capture the geometry of high-dimensional data in a low-dimensional latent space (Tosi et al., 2014; Arvanitidis et al., 2018). Let a dataset y1:N RD with latent representation x1:N RD and D > D, such that yn g(xn) where g is a stochastic function with Jacobian Jg(x) RD D. Then, the pullback metric M(x) = E[J g(x)Jg(x)] is naturally induced in RD, which enables the computation of lengths that respect the geometry of the data manifold in RD . Even though this metric reduces the dimensionality of the problem and can be learned directly from the data by learning g, it is computationally expensive to use due to the Jacobian. To mitigate this shortcoming, we propose a surrogate Riemannian metric. Consider a Variational Auto-Encoder (VAE) (Kingma & Welling, 2014; Rezende et al., 2014) with encoder qφ(x|y) = N(x | µφ(y), diag(σ2 φ(y))), decoder pθ(y|x) = N(y | µθ(x), diag(σ2 θ(x))) and prior p(x) = N(x | 0, ID), with deep neural networks as the functions that parametrize the distributions. Then, the aggregated posterior is RD qφ(x | y)p(y) dy 1 qφ(x | yn), (3) where the integral is approximated from the training data. This is a Gaussian mixture model that assigns non-zero density only near the latent codes of the data. Thus, motivated by Arvanitidis et al. (2020) we define a diagonal Riemannian metric in the latent space as M(x) = (qφ(x) + ρ) 2 This metric fulfills the desideratum of modeling the local behavior of the data in the latent space and it is more efficient than the pullback metric. The variance σ2 φ( ) of the components is typically small, so the metric adapts well to the data, which, however, may result in high curvature. 3. Gaussians on Riemannian Manifolds Pennec (2006) theoretically derived the Riemannian normal distribution as the maximum entropy distribution on a manifold M, given a mean µ and covariance matrix Σ. The density1 on M is p(x) = 1 C(µ, Σ) exp 1 Logµ(x), Σ 1 Logµ(x) (5) This is reminiscent of the familiar Euclidean density, but with a Mahalanobis distance based on the nonlinear logarithmic maps. Analytic solutions for the normalization constant C can be given only for certain manifolds that are known a priori, like the sphere, since this requires analytic solutions for the logarithmic and exponential maps. Arvanitidis et al. (2016) extended this distribution to general data manifolds (see Fig. 1) where the Riemannian metric is learned as discussed in Sec. 2.1. In this case, M = RD and TµM = RD, so Σ RD D + . The normalization constant is computed using a na ıve Monte Carlo scheme as 2 Logµ(x), Σ 1 Logµ(x) d M(x), (6) gµ(v) N(v; 0, Σ) dv = C(µ, Σ), 1The covariance ΣM on M is not equal to Γ 1 M , but the covariance on TµM is Σ = Γ 1. Throughout the paper we take the second perspective. For additional technical details see A.1. Bayesian Quadrature on Riemannian Data Manifolds Figure 3. The function gµ on the tangent space of the protein trajectory manifold. The origin 0 ( ) corresponds to the point µ on the manifold from which the exponential maps are computed. Contours of the integration measure N(v; 0, Σ) are in light gray. Logarithmic maps Logµ(xn) of the data are scattered in white. The background is colored according to the volume element on a log scale. The color scale is quantitatively unrelated to Fig. 2. where gµ(v) = q |M(Expµ(v))| gives the tangent space view on the volume element. An example of gµ is depicted in Fig. 3. Instead of having to solve BVPs for the logarithmic maps, we now integrate on the Euclidean tangent space, where we solve significantly faster exponential maps (IVPs). The normalization constant is needed to estimate maximum likelihood parameters µ and Σ. For this, we use gradient descent in an alternating fashion, keeping µ fixed while optimizing Σ and vice versa, as detailed in A.2. Note that C(µ, Σ) acts as a regularizer, keeping µ near the data manifold and penalizing an overestimated Σ. Moreover, the constant enables the definition of a mixture of LANDs. The MC estimator for this integral requires the evaluation of a large number of exponential maps and is ignorant about known structure of the integrand. We replace MC by BQ to drastically reduce the number of these costly evaluations needed to retain accuracy. Our foremost goal is to speed up numerical integration on data manifolds since exponential maps are, albeit faster than the BVPs, still relatively slow. The runtime of exponential maps depends on the employed metric (see Sec. 2.1) and on other factors such as curvature or curve length. 4. Bayesian Quadrature Bayesian quadrature (BQ) is a probabilistic approach to integration that performs Bayesian inference on the value of the integral given function evaluations and prior assumptions on the integrand (e.g., smoothness). The probabilistic model enables a decision-theoretic approach to the selection of informative evaluation locations. BQ is thus inherently sample-efficient, making it an excellent choice in settings where function evaluations are costly, as is the case in Eq. (6) where evaluations of the integrand rely on exponential maps. The key strategy for the application of BQ in a manifold context is to move the integration to the Euclidean tangent space. We review vanilla BQ and then apply adaptive BQ variants to the integration problem in Eq. (6). 4.1. Vanilla BQ Bayesian quadrature seeks to compute otherwise intractable integrals of the form RD f(v)π(v) dv, (7) where f(v) : RD R is a function that is typically expensive to evaluate and π(v) is a probability measure on RD. A Gaussian process (GP) prior is assumed on the integrand to obtain a posterior distribution on the integral by conditioning the GP on function evaluations. Such a GP is a distribution over functions with the characterizing property that the joint distribution of a finite number of function values is multivariate normal (Rasmussen & Williams, 2006). It is parameterized by its mean function m(v) and covariance function (or kernel) k(v, v ), and we write f GP(m, k). After observing data D at input locations V = v1:M and evaluations f = f(v)1:M, the posterior is m D(v) = m(v) + k(v, V)k(V, V) 1 (f m(V)) , k D(v, v ) = k(v, v ) k(v, V)k(V, V) 1k(V, v ). Due to linearity of the integral operator, the random variable C representing our belief about the integral follows a univariate normal posterior after conditioning on observations, with posterior mean E [C | D] = m D(v)π(v) dv and variance V [C | D] = k D(v, v )π(v)π(v ) dv dv . For certain choices of m, k and π, these expressions have a closed-form solution (Briol et al., 2019). 4.2. Warped BQ The integration task we consider is Eq. (6), where the integration measure on the tangent space TµM is Gaussian of the form π(v) = N(v; 0, Σ). To encode the known positivity of the integrand g := gµ(v) > 0, we model its squareroot by a Gaussian process. Let f = 2(g δ), where δ > 0 is a small scalar and f GP(m, k). Consequently, g = δ+ 1 2f 2 is guaranteed to be positive. Assume noise-free observations of f as f = 2(g(V) δ). Inference is done in f-space and induces a non-Gaussian posterior on g. To overcome non-Gaussianity, Gunter et al. (2014) proposed an algorithm dubbed warped sequential active Bayesian integration (WSABI). WSABI approximates g by a GP either via a local linearization on the marginal of g at every location Bayesian Quadrature on Riemannian Data Manifolds v (WSABI-L) or via moment-matching (WSABI-M). The approximate mean and covariance function of g can be written in terms of the moments of f as m D(v) = δ + 1 2m D(v)2 + η 2k D(v, v), k D(v, v ) = η 2k D(v, v ) + m D(v)k D(v, v )m D(v ), where η = 0 for WSABI-L and η = 1 for WSABI-M. For suitable kernels, these expressions permit closed-form integrals for BQ. Chai & Garnett (2019) discuss these and other possible transformations. Kanagawa & Hennig (2019) showed that the algorithm is consistent for δ > 0. In the present setting, WSABI offers three main advantages over vanilla BQ and MC: First, it encodes the prior knowledge that the integrand is positive everywhere. Second, for metrics learned from data, the volume element typically grows fast and takes on large values away from the data. This makes modeling g directly by a GP impractical, especially when the kernel encourages smoothness. The square-root transform alleviates this problem by reducing the dynamic range of f compared to that of g. Finally, WSABI yields an active learning scheme that adapts to the integrand in that the utility function explicitly depends on previous function values. Gunter et al. (2014) proposed uncertainty sampling, under which the next location is evaluated at the location of largest uncertainty under the integration measure. The WSABI objective is the posterior variance of the unwarped GP scaled with the squared integration measure u(v) = k D(v, v)π(v)2, (10) which we optimize to sequentially select the next tangent vector for evaluation of the exponential map and thus g. 4.3. Directional Cumulative Acquisition In our manifold setting, the acquisition function is defined on the tangent space. Exponential maps yield intermediate steps on straight lines in the tangent space, as moving along a geodesic does not change the direction of the initial velocity, only the distance traveled. As a result, after solving Expµ(α r) for a unit vector r, we get evaluations of the integrand at other locations βr, 0 < β < α, essentially for free. Because the IVP is solved already, only g has to be evaluated at the respective location, which is cheap once the exponential map is computed. This observation motivates rethinking the scheme for sequential design to select good initial directions instead of fixed velocities for the exponential map. We propose to select these initial directions such that the cumulative variance along the direction on the tangent space is maximized, and multiple points along this line with large marginal variance WSABI-L DCV Figure 4. Posterior over gµ(v)N(v; 0, Σ). We compare WSABI- L using uncertainty sampling (left) and DCV (right). Observed locations are scattered in white. The linear color scale represents the posterior magnitude. are then selected to reduce the overall variance. The modified acquisition function, i.e., the cumulative variance along a straight line, can be written as u(βr) dβ. (11) The new acquisition policy arises from optimizing u for unit tangent vectors r. We call the acquisition function from Eq. (11) directional cumulative variance (DCV). While it does have a closed-form solution, that solution costs O(M 4) to evaluate in the number M of evaluations chosen for BQ. We resort to numerical integration to compute the objective and its gradient (A.3). This is feasible because these are multiple univariate integrals that can efficiently be estimated from the same evaluations. Since r is constrained to lie on the unit hypersphere, we employ a manifold optimization algorithm (A.3). Once an exponential map is computed, we use the standard WSABI objective to sample multiple informative points along the straight line α r. For simplicity, we use DCV only in conjunction with WSABI-L. Optimizing this acquisition function is costly as it requires posterior mean predictions and predictive gradients of the GP inside the integration. Furthermore, confining observations to lie along straight lines implies that BQ may cover less space given a fixed number of function evaluations. Therefore, DCV will be useful in settings where exponential maps come at a high computational cost. Fig. 4 compares posteriors arising from standard WSABI-L using uncertainty sampling (simply referred to as WSABI-L) and DCV. 4.4. Choice of Integrand Model We use a Mat ern-5/2 kernel to model gµ, which we found to be more robust than the square exponential kernel (RBF). The Mat ern model makes less strong assumptions on differentiability (Stein, 2012) and empirically proved more suitable in high-curvature settings in which the integrand shows large variability. Depending on the employed Riemannian metric, we can set the constant prior mean of g to the known volume element far from the data (Sec. 2.1). This amounts to the prior assumption that wherever there are no observations yet, the distance to the data is likely high. Bayesian Quadrature on Riemannian Data Manifolds W-L W-M DCV MC 10 2 relative error W-L W-M DCV MC (b) CIRCLE D W-L W-M DCV MC W-L W-M DCV MC Figure 5. Boxplot error comparison (log scale, shared y-axis) of BQ and MC on whole LAND fit for different manifolds. For MC, we allocate the runtime of the mean slowest BQ method. Each box contains 16 independent runs. CIRCLE CIRCLE D MNIST ADK runtime in seconds W-L W-M DCV Figure 6. Mean runtime comparison (for a single integration) of the BQ methods. Errorbars indicate 95% confidence intervals w.r.t the 16 runs on each LAND fit. The reported runtimes belong to the boxplots in Fig. 5. 4.5. Transfer Learning The LAND optimization process requires the sequential computation of one integral per iteration as the parameters of the model get updated in an alternating manner. In general, elaborate schemes as in Xi et al. (2018) and Gessner et al. (2019) are available to estimate correlated integrals jointly. However, our integrand gµ does not depend on the covariance Σ of the integration measure π. Therefore, exponential maps remain unaltered when only the covariance Σ changes from one iteration to the next while the mean µ remains fixed. BQ can thus reuse the observations from the previous iteration and only needs to collect a reduced number of new samples to account for the changed integration measure. This node reuse enables tremendous runtime savings. 5. Experiments We test the methods (WSABI-L, WSABI-M, DCV) on both synthetic and real-world data manifolds. Our aim is to show that Bayesian quadrature is faster compared to the Monte Carlo baseline, yet retains high accuracy. The experiments focus on the LAND model to illustrate practical use cases of Riemannian statistics. Furthermore, the iterative optimiza- tion process yields a wide range of integration problems (Eq. 6) of varying difficulty. In total, our experiments comprise 43,920 BQ integrations. For different manifolds, we conduct two kinds of experiments: First, we fit the LAND model and record all integration problems arising during the optimization procedure (A.2). This allows us to compare the competitors on the whole problem, where BQ can benefit from node reuse. As the ground truth, we use extensive MC sampling (S = 40,000). We fix the number of acquired samples for BQ and generate boxplots from the mean errors on the whole LAND fit for 16 independent runs (Fig. 5). Due to the alternating update of LAND parameters during optimization, either the integrand or the integration measure changes over consecutive iterations. We let WSABI-L and WSABI-M actively collect 80 in the former and 10 samples additionally to the reused ones in the latter case; for DCV, we fix 18 and 2 exponential maps, respectively, and acquire 6 points on each straight line. Integration cost for BQ is thus highly variable over iterations. Allocating a fixed runtime would not be sensible as BQ benefits from collecting more information after updates to the mean, a time investment that is over-compensated in the more abundant and due to node reuse cheap covariance updates. We choose sample numbers so as to allow for sufficient exploration of the space with practical runtime. For MC, we allocate the runtime budget of the mean slowest BQ method on that particular problem in order to compare accuracy over runtime. Mean runtimes for single integrations, averaged on entire LAND fits, are shown in Fig. 6 and mean exponential map runtimes, as computed by MC, are reported in A.4. Secondly, we focus on the first integration problem of each LAND fit in detail and compare the convergence behavior of the different BQ methods and MC over wall clock runtime (Fig. 7). We use the kernel metric (Sec. 2.1) when not otherwise mentioned. All further technical details for reproducibility are in A.4. In the plot legends, we abbreviate WSABI-L/WSABI-M with W-L/W-M, respectively. Code is publicly available at github.com/froec/BQon RDM. Bayesian Quadrature on Riemannian Data Manifolds 10 20 30 40 50 60 runtime in seconds relative error 10 20 30 40 50 60 runtime in seconds (b) CIRCLE D 10 20 30 40 50 60 runtime in seconds 10 20 30 40 50 60 runtime in seconds Figure 7. Comparison of BQ and MC errors against runtime (vertical log scale, shared legend and axes) for different manifolds, on the first integration problem of the respective LAND fit. Shaded regions indicate 95% confidence intervals w.r.t. the 30 independent runs (A.4). Figure 8. Model comparison on CIRCLE toy data. 5.1. Synthetic Experiments Toy Data We generated three toy data sets and fitted the LAND model with pre-determined component numbers. Here we focus on a circle manifold with 1000 data points, for which we compare the resulting LAND fit to the Euclidean Gaussian mixture model (GMM) in Fig. 8, and report results for the other datasets in A.4. Higher-dimensional Toy Data With increasing number of dimensions, new challenges for metric learning and geodesic solvers appear. With the simple kernel metric, almost all of the volume will be far from the data as the dimension increases, a phenomenon which we observe already in relatively low dimensions. Such metric behavior can lead to pathological integration problems, as the integrand may then become almost constant. In this experiment, we embed the circle toy data in higher dimensions by sampling random orthonormal matrices. After projecting the data, we add Gaussian noise ϵi N(0, 0.01) and standardize. Here we show the result for d = 5 and report results for d = {3, 4} in A.4. 5.2. Real-World Experiments MNIST We trained a Variational Auto-Encoder on the first three digits of MNIST using the aggregated posterior metric (Sec. 2.1) and found that the LAND is able to distinguish the three clusters more clearly than a Euclidean Gaussian mix- Figure 9. Model comparison on three-digit MNIST. ture model, see Fig. 9. The LAND favors regions of higher density, where the VAE has more training data. In this experiment, the gain in speed of BQ is even more pronounced, since exponential maps are slow due to high curvature. MC with 1000 samples achieves 2.78% mean error on the whole LAND fit with a total runtime of 6 hours and 56 minutes, whereas DCV (18/2 exponential maps) achieves 2.84% error within 21 minutes; a speedup by a factor of 20. Molecular Dynamics In molecular dynamics, biophysical systems are simulated on the atomic level. This approach is useful to understand the conformational changes of a protein, i.e., the structural changes it undergoes. A Riemannian model is appropriate in this setting, because not all atom coordinates represent physically realistic conformations. For instance, a protein clearly does not self-intersect. Adapting locally to the data by space distortion is thus critical for modeling. More specifically, the LAND model is relevant because clustering conformations and finding representative states are of scientific interest (see e.g., Papaleo et al., 2009; Wolf & Kirschner, 2013; Spellmon et al., 2015; Tribello & Gasparotto, 2019). The LAND can visualize the conformational landscape and generate realistic samples. Plausible interpolations (trajectories) between conformations may be conceived of as geodesics under the Riemannian metric. We obtained multiple trajectories of the closed to open transition of the enzyme adenylate kinase (ADK) (Seyler et al., 2015). Each observation consists of the Cartesian (x, y, z) Bayesian Quadrature on Riemannian Data Manifolds Figure 10. Comparison of the Euclidean Gaussian vs. LAND mean and eigenvectors on ADK data. Data is colored according to the radius of gyration, a measure indicating how open the protein is, providing a visual argument for the manifold hypothesis. Figure 11. ADK in closed, LAND mean and open state. coordinates for each of the 3,341 atoms, yielding a 10,023 dimensional vector. As is common in the field, we used PCA to extract the essential dynamics (Amadei et al., 1993), which clearly exhibit manifold structure (Fig. 1). The first two eigenvectors already explain 65% of the total variance and suffice to capture the transition motion. We model the ADK manifold with high curvature and large measure far from the data to account for the knowledge that realistic trajectories lie closely together. This makes for a challenging integration problem, since most mass is near the data boundary due to extreme metric values. A single-component LAND yields a representative state for the transition between the closed and open conformation. Whereas the Euclidean mean falls outside the data manifold, the LAND mean is reasonably situated. Plotting the eigenvectors of the covariance matrix makes it clear that the LAND captures the intrinsic dimensions of the data manifold (Fig. 10) and that the mean interpolates between the closed and open state (Fig. 11). Our aim here is to demonstrate that molecular dynamics is an exciting application area for Riemannian statistics and sketch potential experiments, which are then for domain experts to design. 5.3. Interpretation We find that BQ consistently outperforms MC in terms of speed. Even on high-curvature manifolds with volume el- ements spanning multiple orders of magnitudes, such as MNIST and ADK, the GP succeeds to approximate the integrand well. Among the different BQ candidates, we cannot discern a clear winner, since their performance depends on the specific problem geometry and exponential map runtimes. DCV performs especially well when geodesic computations are costly, such as for MNIST. We note that geodesic solvers and metric learning are subject to new challenges in higher dimensions, which merit further research effort. 6. Conclusion and Discussion Riemannian statistics is the appropriate framework to model real data with nonlinear geometry. Yet, its wide adoption is hampered by the prohibitive cost of numerical computations required to learn geometry from data and operate on manifolds. In this work, we have demonstrated on the example of numerical integration the great potential of probabilistic numerical methods (PNM) to reduce this computational burden. PNM adaptively select actions in a decision-theoretic manner and thus handle information with greater care than classic methods, e.g., Monte Carlo. Consequently, the deliberate choice of informative computations saves unnecessary operations on the manifold. We have extended Bayesian quadrature to Riemannian manifolds, where it outperforms Monte Carlo over a large number of integration problems owing to its increased sample efficiency. Beyond known active learning schemes for BQ, we have introduced a version of uncertainty sampling adapted to the manifold setting that allows to further reduce the number of expensive geodesic evaluations needed to estimate the integral. Numerical integration is just one of multiple numerical tasks in the context of statistics on Riemannian manifolds where PNM suggest promising improvements. The key operations on data manifolds are geodesic computations, i.e., solutions of ordinary differential equations. Geodesics have been viewed through the PN lens, e.g., by Hennig & Hauberg (2014), but still offer a margin for increasing the performance of statistical models such as the considered LAND. Once multiple PNM are established for Riemannian statistics, the future avenue directs towards having them operate in a concerted fashion. As data-driven Riemannian models rely on complex computation pipelines with multiple sources of epistemic and aleatory uncertainty, their robustness and efficiency can benefit from modeling and propagating uncertainty through the computations. All in all, we believe the coalition of geometryand uncertainty-aware methods to be a fruitful endeavor, as these approaches are united by their common intention to respect structure in data and computation that is otherwise often neglected. Bayesian Quadrature on Riemannian Data Manifolds Acknowledgements AG acknowledges funding by the European Research Council through ERC St G Action 757275 / PANAMA and support by the International Max Planck Research School for Intelligent Systems (IMPRS-IS). This work was supported by the German Federal Ministry of Education and Research (BMBF): T ubingen AI Center, FKZ: 01IS18039B, and by the Machine Learning Cluster of Excellence, EXC number 2064/1 Project number 390727645. The authors thank Nicholas Kr amer, Agustinus Kristiadi, and Dmitry Kobak for useful discussions. Amadei, A., Linssen, A. B., and Berendsen, H. J. Essential dynamics of proteins. Proteins: Structure, Function, and Bioinformatics, 17(4):412 425, 1993. Arvanitidis, G., Hansen, L. K., and Hauberg, S. A Locally Adaptive Normal Distribution. In Lee, D. D., Sugiyama, M., von Luxburg, U., Guyon, I., and Garnett, R. (eds.), Advances in Neural Information Processing Systems 29: Annual Conference on Neural Information Processing Systems 2016, December 5-10, 2016, Barcelona, Spain, pp. 4251 4259, 2016. Arvanitidis, G., Hansen, L. K., and Hauberg, S. Latent space oddity: on the curvature of deep generative models. In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings, 2018. Arvanitidis, G., Hauberg, S., Hennig, P., and Schober, M. Fast and robust shortest paths on manifolds learned from data. In Chaudhuri, K. and Sugiyama, M. (eds.), The 22nd International Conference on Artificial Intelligence and Statistics, AISTATS 2019, 16-18 April 2019, Naha, Okinawa, Japan, volume 89 of Proceedings of Machine Learning Research, pp. 1506 1515. PMLR, 2019. Arvanitidis, G., Hauberg, S., and Sch olkopf, B. Geometrically enriched latent spaces. In ar Xiv preprint, 2020. Bishop, C. M. Pattern Recognition and Machine Learning (Information Science and Statistics). Springer-Verlag, 2006. Briol, F.-X., Oates, C. J., Girolami, M., Osborne, M. A., Sejdinovic, D., et al. Probabilistic integration: A role in statistical computation? Statistical Science, 34(1):1 22, 2019. Chai, H. R. and Garnett, R. Improving quadrature for constrained integrands. In Chaudhuri, K. and Sugiyama, M. (eds.), The 22nd International Conference on Artificial Intelligence and Statistics, AISTATS 2019, 16-18 April 2019, Naha, Okinawa, Japan, volume 89 of Proceedings of Machine Learning Research, pp. 2751 2759. PMLR, 2019. Cockayne, J., Oates, C., Sullivan, T. J., and Girolami, M. Bayesian probabilistic numerical methods. SIAM Review, 61(4):756 789, 2019. doi: 10.1137/17M1139357. do Carmo, M. P. Riemannian Geometry. Birkh auser, 1992. Gessner, A., Gonzalez, J., and Mahsereci, M. Active multiinformation source Bayesian quadrature. In Globerson, A. and Silva, R. (eds.), Proceedings of the Thirty-Fifth Conference on Uncertainty in Artificial Intelligence, UAI 2019, Tel Aviv, Israel, July 22-25, 2019, volume 115 of Proceedings of Machine Learning Research, pp. 712 721. AUAI Press, 2019. Gunter, T., Osborne, M. A., Garnett, R., Hennig, P., and Roberts, S. J. Sampling for inference in probabilistic models with fast Bayesian quadrature. In Ghahramani, Z., Welling, M., Cortes, C., Lawrence, N. D., and Weinberger, K. Q. (eds.), Advances in Neural Information Processing Systems 27: Annual Conference on Neural Information Processing Systems 2014, December 8-13 2014, Montreal, Quebec, Canada, pp. 2789 2797, 2014. Hennig, P. and Hauberg, S. Probabilistic solutions to differential equations and their application to Riemannian statistics. In Proceedings of the Seventeenth International Conference on Artificial Intelligence and Statistics, AISTATS 2014, Reykjavik, Iceland, April 22-25, 2014, volume 33 of JMLR Workshop and Conference Proceedings, pp. 347 355. JMLR.org, 2014. Hennig, P., Osborne, M. A., and Girolami, M. Probabilistic numerics and uncertainty in computations. Proceedings of the Royal Society of London A: Mathematical, Physical and Engineering Sciences, 2015. Kanagawa, M. and Hennig, P. Convergence guarantees for adaptive Bayesian quadrature methods. In Wallach, H. M., Larochelle, H., Beygelzimer, A., d Alch e-Buc, F., Fox, E. B., and Garnett, R. (eds.), Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, Neur IPS 2019, December 8-14, 2019, Vancouver, BC, Canada, pp. 6234 6245, 2019. Kingma, D. P. and Welling, M. Auto-encoding variational Bayes. In Bengio, Y. and Le Cun, Y. (eds.), 2nd International Conference on Learning Representations, ICLR 2014, Banff, AB, Canada, April 14-16, 2014, Conference Track Proceedings, 2014. Lee, J. Introduction to Riemannian Manifolds. Springer, 2018. Bayesian Quadrature on Riemannian Data Manifolds O Hagan, A. Bayes Hermite quadrature. Journal of Statistical Planning and Inference, 29(3):245 260, 1991. Papaleo, E., Mereghetti, P., Fantucci, P., Grandori, R., and De Gioia, L. Free-energy landscape, principal component analysis, and structural clustering to identify representative conformations from molecular dynamics simulations: the Myoglobin case. Journal of Molecular Graphics and Modelling, 27(8):889 899, 2009. Pennec, X. Intrinsic statistics on Riemannian manifolds: Basic tools for geometric measurements. Journal of Mathematical Imaging and Vision, 25(1):127, 2006. Rasmussen, C. E. and Williams, C. K. Gaussian Processes for Machine Learning, volume 2. MIT press Cambridge, MA, 2006. Rezende, D. J., Mohamed, S., and Wierstra, D. Stochastic backpropagation and approximate inference in deep generative models. In Proceedings of the 31th International Conference on Machine Learning, ICML 2014, Beijing, China, 21-26 June 2014, volume 32 of JMLR Workshop and Conference Proceedings, pp. 1278 1286. JMLR.org, 2014. Seyler, S. L., Kumar, A., Thorpe, M. F., and Beckstein, O. Path similarity analysis: A method for quantifying macromolecular pathways. PLOS Computational Biology, 11(10):1 37, 10 2015. doi: 10.1371/journal. pcbi.1004568. URL https://doi.org/10.1371/ journal.pcbi.1004568. Spellmon, N., Sun, X., Sirinupong, N., Edwards, B., Li, C., and Yang, Z. Molecular dynamics simulation reveals correlated inter-lobe motion in protein Lysine Methyltransferase SMYD2. PLOS ONE, 10(12):e0145758, 2015. Stein, M. L. Interpolation of spatial data: some theory for kriging. Springer Science & Business Media, 2012. Tosi, A., Hauberg, S., Vellido, A., and Lawrence, N. D. Metrics for probabilistic geometries. In Zhang, N. L. and Tian, J. (eds.), Proceedings of the Thirtieth Conference on Uncertainty in Artificial Intelligence, UAI 2014, Quebec City, Quebec, Canada, July 23-27, 2014, pp. 800 808. AUAI Press, 2014. Tribello, G. A. and Gasparotto, P. Using dimensionality reduction to analyze protein trajectories. Frontiers in Molecular Biosciences, 6:46, 2019. ISSN 2296-889X. doi: 10.3389/fmolb.2019.00046. Wolf, A. and Kirschner, K. N. Principal component and clustering analysis on molecular dynamics data of the ribosomal l11 23s subdomain. Journal of Molecular Modeling, 19(2):539 549, 2013. Xi, X., Briol, F., and Girolami, M. A. Bayesian quadrature for multiple related integrals. In Dy, J. G. and Krause, A. (eds.), Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsm assan, Stockholm, Sweden, July 10-15, 2018, volume 80 of Proceedings of Machine Learning Research, pp. 5369 5378. PMLR, 2018.