# identifying_latent_distances_with_finslerian_geometry__4e7e9df8.pdf Published in Transactions on Machine Learning Research (10/2023) Identifying latent distances with Finslerian geometry Alison Pouplin alpu@dtu.dk Technical University of Denmark David Eklund david.eklund@ri.se Research Institutes of Sweden Carl Henrik Ek che29@cam.ac.uk University of Cambridge Søren Hauberg sohau@dtu.dk Technical University of Denmark Reviewed on Open Review: https: // openreview. net/ forum? id= Q2Gi0TUAd S Riemannian geometry provides us with powerful tools to explore the latent space of generative models while preserving the underlying structure of the data. The latent space can be equipped it with a Riemannian metric, pulled back from the data manifold. With this metric, we can systematically navigate the space relying on geodesics defined as the shortest curves between two points. Generative models are often stochastic, causing the data space, the Riemannian metric, and the geodesics, to be stochastic as well. Stochastic objects are at best impractical, and at worst impossible, to manipulate. A common solution is to approximate the stochastic pullback metric by its expectation. But the geodesics derived from this expected Riemannian metric do not correspond to the expected length-minimising curves. In this work, we propose another metric whose geodesics explicitly minimise the expected length of the pullback metric. We show this metric defines a Finsler metric, and we compare it with the expected Riemannian metric. In high dimensions, we prove that both metrics converge to each other at a rate of O 1 D . This convergence implies that the established expected Riemannian metric is an accurate approximation of the theoretically more grounded Finsler metric. This provides justification for using the expected Riemannian metric for practical implementations. 1 Introduction Generative models provide a convenient way to learn low-dimensional latent variables z corresponding to data observations x through a smooth function f : Z Rq X RD, such that x = f(z). Through this learnt manifold, one can generate new data or compare observations by interpolating or computing distances. However, doing so by using the Euclidean distance in the latent space is misleading (Hauberg, 2018a), because the latent variables are not statistically identifiable. If our observations are lying near a manifold (Fefferman et al., 2016), we want to equip our latent space with a metric that preserves distance measures on it. Figure 1 (left panel) illustrates the need for defining geometric-aware distances on manifolds. Distances on a manifold can be precisely defined using a norm, which is a mathematical function that exhibits several desirable properties such as non-negativity, homogeneity, and the triangle inequality. In particular, a norm can be induced by an inner product (i.e., a quadratic function) that associates each pair of points on the manifold with a scalar value. Published in Transactions on Machine Learning Research (10/2023) To derive the standard Riemannian interpretation of the latent space, we first compute the infinitesimal Euclidean norm according to the data space. Using the Taylor expansion, we have: f(z + z) f(z) 2 2 f(z) + J(z) z f(z) 2 2 = z J(z) J(z) z. As a first approximation, the norm defined in the latent space locally preserves the Euclidean norm defined in the data space. The curvature of our data manifold is condensed in the Riemannian metric tensor Gz = J (z)J(z), which serves as a proxy to define the Riemannian metric: gz : (u, v) u Gzv. In mathematical jargon, we say that the Riemannian manifold (Z, g) is obtained by pulling back the Euclidean metric through the map f. Riemannian geometry enables the exploration of the latent space in precise geometric terms, and quantities of interest such as the length, the energy or the volume can be directly derived from the pullback metric. These geometric quantities are, by construction, known to be invariant to reparametrizations of the latent space Z, and are thus statistically identifiable (Hauberg, 2018a). While this geometric framework exclusively handles deterministic objects, generative models are often stochastic. The learnt map f, that mathematically describes those models, is stochastic too. For example, in the celebrated Gaussian Process Latent Variable Model (GPLVM) (Lawrence, 2003), this function is a Gaussian process. The pullback metric is then stochastic, and standard differential geometric constructions no longer apply. Navigating the latent manifolds through geodesics (length-minimising curves) becomes practically infeasible. Previous research tried to circumvent this problem by approximating the stochastic pullback metric with the expected value of the Riemannian metric tensor, but the derived length is an unintuitive quantities that do not correspond to the expected length: L(γ) |E[G]:= Z γ(t) E[G] dt = E L(γ) |G := Z E γ(t) G dt. Instead of taking the expectation of the metric tensor, which serves as a surrogate to define a norm, we propose to take the expectation of the norm directly. In this paper, we compare our expected norm with the norm induced by the the commonly used expected metric tensor. The main findings are: 1. The expected norm defines a Finsler metric. Finsler geometry is a generalisation of Riemannian geometry. 2. For Gaussian processes, the stochastic norm obtained through the pullback metric follows a noncentral Nakagami distribution, so our Finsler metric has a closed-form expression. 3. In high dimensions, for Gaussian processes, our Finsler metric and the previously studied Riemannian metric converge to each other at a rate of O 1 D , with D the dimension of the data space. We conclude that the Riemannian metric serves as a good approximation of the Finsler metric, which is theoretically more grounded. It further justifies the use of the Riemannian metric in practice when exploring stochastic manifolds. 1.1 Outline of the paper The paper explores the geometry of latent spaces learned by generative models, which encode a latent low-dimensional manifold that represents observed high-dimensional data. The latent manifold is denoted Z Rq and the data manifold is denoted X RD. Assuming the manifold hypothesis holds, we need to define an infinitesimal norm in the latent manifold to compute distances that respect the underlying geometry of the data. Such a norm can be constructed by pulling back the Euclidean distance through the smooth function that maps the latent manifold to the data manifold. This map, f : Z X, mathematically describes the decoder of a trained generative model. Those models being often stochastic, we consider f being a stochastic process. It means that the pullback metric tensor, G = J J, and its induced norm, G : u u Gu, are also stochastic. In section 2, we mathematically define the notion of stochastic pullback metric and stochastic manifolds. Published in Transactions on Machine Learning Research (10/2023) Figure 1: In this illustration, we show the 2-dimensional representation (θ, ϕ) of the sphere parametrised in R3 with f : (θ, ϕ) cos(θ) sin(ϕ), sin(θ) cos(ϕ), sin(ϕ). In practice, we don t have access to well-parametrised manifolds, but instead those manifolds are being shaped by data points in RD. The data points are depicted as tiny dots for illustrative purposes. On the left figure, the map f is deterministic, and on the right figure, we imagine that f is stochastic. When we pullback, through f, a circle from the sphere to the plane, the circle is deformed to an ellipse. Those ellipses, called indicatrices, are the fingerprints of the metric and they showcase the deformation of the plane. Left figure: In blue, the Euclidean distance, is represented by the straight line in the plane. It is not identifiable and it does not represent a geodesic on the sphere. In orange, the length-minimising curve obtained through the pullback metric of the mapping f leads to a great circle on the sphere. Following the great circle is the fastest way to go from one point to another. Effectively, the pullback metric leads to a geodesic, contrary to the Euclidean metric. Notice that, in the plane, the orange geodesic also follows the direction of the indicatrices, since it is length-minimising. Right figure: We imagine that a generative model maps latent space to data space to the data space using a stochastic process f = {f1, f2, . . . }. This leads to a stochastic pullback metric, stochastic indicatrices, and stochastic paths in the latent space. To circumvent all the challenges posed by this stochastic component, a deterministic approximation of the norm is needed. It can be defined by taking the expectation of the metric tensor. This norm, that we will note R : u u E[G], has been studied before by Tosi et al. (2014), and is explained in section 2.2. In this paper, we propose instead to directly take the expectation of the stochastic norm. This expected norm, noted F : u E u G , is introduced in section 2.4. The norm R is defined by a Riemannian metric, and we show that the norm F defines a Finsler metric. We explain the general difference between Finsler and Riemannian geometry in section 3. The aim of this paper is to compare those two norms. We first draw absolute bounds in section 3.2, and then relative bounds in 3.3. We also investigate the relative difference of the norms when the dimension of the data space increases, in section 3.4. Finally, we perform some experiments in Section 4, that illustrate the Riemannian and Finsler norm in the same latent space. 1.2 Related works Riemannian geometry for machine learning When navigating a learnt latent space, the geodesics obtained through the pullback metric not only follow geometrically coherent paths, they also remain invariant under different representations. While two runs of the same model will produce distinct latent manifolds, the geodesics connecting the same chosen points should have the same length. We say that the pullback metric effectively solved the identifiability problem (Hauberg, 2018a). This has led to the growing adoption of Riemannian geometry in machine learning applications. In robotics, Beik-Mohammadi et al. (2021) used the pullback metric from a variational autoencoder to safely navigate the space of motion patterns, and Scannell et al. (2021) use geodesics under the expected metric in a GPLVM to control quadrotor robot. In proteins modelling, Detlefsen et al. (2022) showed that the derived geodesics on the space of beta-lactamase follow their phylogenic tree structure. In game content generation, González-Duque et al. (2022) generated game levels more coherent and reliable by interpolating data along the geodesics. Jørgensen & Hauberg (2021) used the expected Riemannian metric from observations of pairwised distances through a GPLVM to build a probabilistic dimensionality reduction model. Published in Transactions on Machine Learning Research (10/2023) Finsler geometry in machine learning Our work crucially relies on Finslerian geometry, which has been well-studied mathematically, but has only seen very limited use in machine learning and statistics. We point to two notable exceptions, which are quite distinct from our work. Lopez et al. (2021) use symmetric spaces to represent graphs and endow these with a Finsler metric to capture dissimilarity structure in the observational data. Ratliffet al. (2021) discuss the role of differential geometry in motion planning for robotics. Along the way, they touch upon Finslerian geometry, but mostly as a neat tool to allow for generalizations. To the best of our knowledge, no prior work has investigated the links between stochastic and Finslerian geometry. Strategies to deal with stochastic Riemannian geometry Tosi et al. (2014) and Arvanitidis et al. (2018) introduced approximation of the pullback metric by taking the expectation of the metric tensor. In those two cases, the map f is respectively a trained Gaussian process, or the decoder of a VAE. In this paper, the derivations only hold if f is a smooth stochastic process (Definition 2.2), which is not the case1 of the VAEs, and hence, our results are not directly applicable to those models. In addition to the work of Tosi et al. (2014), a solution to circumvent the randomness of the metric tensor is to consider that the data follows a specific probability distribution. Instead of looking at the shortest path on the data manifold, Arvanitidis et al. (2021) borrow tools from information geometry and consider the straightest paths on the manifold whose elements are probability distributions. 2 Expectation on random manifolds The metric pulled back by a stochastic mapping is, de facto, stochastic and endows a random manifold. Unfortunately, we are not yet equipped to derive geometric objects on a random manifold. Instead, we dodge this problem by seeking a deterministic approximation of this stochastic metric. As mentioned above, a common solution is to approximate such a metric by its expectation. In section 2.2, we study the expected Riemannian metric. The solution suggested by this paper is to approximate the expectation of the lengths instead of the random metric itself. In section 2.4, we show that this new metric is not Riemannian but Finslerian (Proposition 2.2), and it has a closed-form expression when the map f is a Gaussian process (Proposition 2.3). 2.1 Random Riemannian geometry The pullback metric is defined as a Riemannian metric if and only if the mapping f is an immersion, which is a differentiable function whose derivatives are injective everywhere on the manifold (Lee, 2013, Proposition 13.9). A manifold equipped with a Riemannian metric is called a Riemannian manifold. Definition 2.1. The pullback of the Euclidean metric through the immersion f : Z X is a Riemannian metric. It is defined as the inner product gz : (Tz Z, Tz Z) R+ : (u, v) u Gv, at a specific point z in the manifold Z. u and v are vectors lying in the tangent plane Tz Z (ie: the set of all tangent vectors) of the manifold. G = J J, with J the Jacobian of f. Since a Riemannian metric is an inner product, it induces a norm: G. We can then define the curve length and curve energy on a manifold: LG(γ) = R 1 0 γ(t) G dt and EG(γ) = 1 2 R 1 0 γ(t) 2 G dt, with γ a curve defined on Z, and γ its derivative. A locally length-minimising curve between two connecting points is a geodesic. To obtain a geodesic, we can minimise the curve length, but in practice minimising the 1the decoder of a VAE, while it decodes to a Gaussian, cannot be considered as a differentiable stochastic process. One reason is because the independence of the probability of the data: p(x|z) = Qn i=1 p(xi|zi). Let us assume the opposite: the decoder is a Gaussian process. The covariance of the Gaussian process would be a diagonal matrix because of the independence of the probability of the data. The covariance would correspond to a dirac distribution: cov(xi, xj) = δij. However, a stochastic process is differentiable only if the covariance is differentiable, which is not the case of the dirac distribution. Published in Transactions on Machine Learning Research (10/2023) curve energy is more efficient. On the manifold, we also define a volume measure in order to integrate probability functions: for U Z, R f(U) h(x) dx = R U h(f(z))VR dz, with VR(z) = Gz the volume measure. In addition, we are considering the case where the immersion f is a stochastic process. The outputs of our trained model, x X, which represent our data, are random variables. Definition 2.2. A stochastic process is a collection of random variables {X(t, ω), t T} indexed by an index set T defined on a sample space Ω, which represents the set of all possible outcomes. An outcome in Ωis denoted by ω, and a realisation of the stochastic process is the sequence of X( , ω) that depends on the outcome ω. In this framework, our index set is our latent manifold T = Z, and our sample space Ωis defined as the set of the model evaluations. For every point z Z, every time we execute our model, the output x = f(z) is a random variable following a specific distribution. When the data x follow a Gaussian distribution, the stochastic process is called a Gaussian process. A GP-LVM (Lawrence, 2003) is a model that learns how to map the data from a latent space to a data space through a Gaussian process. When f is a stochastic immersion, the metric tensor becomes a random matrix. In this paper, we call a manifold equipped with the stochastic pullback metric a random manifold, noted (Z, g). As a consequence of the stochastic aspect of the metric, all the functionals are stochastic themselves, and they are no longer trivial to manipulate. Definition 2.3. A random Riemannian metric tensor is a matrix-valued random field (ie: a collection of matrix-valued random variables {G(z, ω), z Z}), whose realisation for a specific evaluation ω Ωis a Riemannian metric tensor. A random Riemannian metric is a metric induced by a random Riemannian metric tensor: gz : (Tz Z, Tz Z) R+ : (u, v) u Gv. For the rest of the paper, the associated stochastic norm is noted: G : Tz Z R+ : u p gz(u, u) := If this stochastic norm is induced by f defined as a Gaussian process, then G follows a non-central Nakagami distribution. This is explained in the proof of Proposition 2.3. 2.2 Norm induced by the expected metric tensor One way to approximate a random metric tensor is to take its expectation with respect to the collection of random metrics induced by the stochastic process. This has been introduced before by Tosi et al. (2014) GP-LVMs. Definition 2.4. Let G be a stochastic Riemannian metric tensor on the manifold Z. We refer to E[G] as the expected metric tensor. It induces a Riemannian metric and a norm on Z. We will note the norm induced by the expected metric tensor as: R : Tz Z R+ : u u E[G] := q Like any Riemannian metric, we can define the following functionals: LR(γ) = R 1 0 p γ(t) E[G] γ(t) dt, ER(γ) = R 1 0 γ(t) E[G] γ(t) dt = E[ER(γ)], and VR(z) = p 2.3 Expected paths on random manifolds Approximating the stochastic metric by its expectation seems a natural but also ad-hoc solution. If we want to explore a manifold, we might prefer to use a representative quantity, such as the lengths between Published in Transactions on Machine Learning Research (10/2023) data points. The expectation of the lengths can give us an idea about how, on average, two points are connected on a random manifold. The expected curve length, and its corresponding curve energy on the random manifold (Z, g) are defined as: LF (γ) = R 1 0 E hp γ(t) G γ(t) i dt = E[LG(γ)], and EF (γ) = R 1 0 E hp γ(t) G γ(t) i2 dt. One observation made by Eklund & Hauberg (2019) is that the length (LR) derived from the expected Riemannian metric is not equal to the expected curve length (LF ), and their respective energy curves differ by a variance term: ER(γ) EF (γ) = Z 1 0 γ(t) E[G] γ(t) E q γ(t) G γ(t) 2 dt 0 E h γ(t) 2 G i E h γ(t) G i2 dt = Z 1 0 Var h γ(t) G This term can be regarded as a regularisation term for the Riemannian energy curve: the curve energy ER might be penalised when the curve goes through regions with high-variance. In practice, for a Gaussian process with a stationary kernel, this variance term is upper bounded by the posterior variance that is relatively low next to the training points and is high outside of the support of the data. Later, we will also see that the functionals agree in high dimensions, leading to the same geodesics (Section 3). Eklund & Hauberg (2019) also noted that these quantities are bounded by the number of dimensions: Proposition 2.1. (Eklund & Hauberg, 2019) Let f : Rq RD be a stochastic process such that the sequence: {f 1, f 2, . . . , f D} has uniformly bounded moments. There is then a constant C such that: 2.4 Expected norm and Finsler geometry Our work builds on Eklund & Hauberg (2019) s research. We are interested in approximating the stochastic norm instead of the metric tensor, and by doing so, the derived curve length and curve energy are the same ones studied by Eklund & Hauberg (2019). We go further as we not only compare curve lengths, but the deterministic norms obtained with the stochastic metric. Definition 2.5. Let G be a stochastic Riemannian metric tensor on the manifold Z. It induces a stochastic norm, G on Z. We will note the expected norm as: F : Tz Z R+ : u E[ u G] := E h While it cannot be induced by an inner-product, it is sufficiently convex to be defined as a Finsler metric. Definition 2.6. Let F : T Z R+ be a continuous non-negative function defined on the tangent bundle T Z of a differentiable manifold Z. We say that F is a Finsler metric if, for each point z of Z and v on Tz Z, we have (1) Positive homogeneity: λ R+, F(λv) = λF(v). (2) Smoothness: F is a C function on the slit tangent bundle T Z \ {0}. (3) Strong convexity criterion: the Hessian matrix gij(v) = 1 2 2F 2 vivj (v) is positive definite for non-zero v. A differentiable manifold equipped with a Finsler metric is called a Finsler manifold. Finsler geometry can be seen as an extension of Riemannian geometry, since the requirements for defining a metric are less restrictive. Published in Transactions on Machine Learning Research (10/2023) Proposition 2.2. Let G be a stochastic Riemannian metric tensor. Then, the function Fz : Tz Z R : u u F defines a Finsler metric, but it is not induced by a Riemannian metric. Proof. If F was induced by a Riemannian metric, then this metric would be defined as: fz : Rq Rq R+ : (v1, v2) E[ p v 1 Gv2]2. Since a Riemannian metric is an inner product, it should be symmetric, positive, definite and bilinear. Here, we can see that fx is not bilinear, so fz is not a Riemannian metric. However, we can prove that Fz : Rq R : v E[ v Gv] is positive, homogeneous, smooth and strongly convex, and so Fz is a Finsler metric (Shen & Shen, 2016, Definition 2.1). For the full proof, see Section B.1. So far, we have assumed that f is an immersion and a stochastic process. If we consider f to be a Gaussian Process in particular, the Finsler norm can be rewritten in a closed form expression. Proposition 2.3. Let f be a Gaussian process and J its Jacobian, with J N(E[J], Σ). The Finsler norm can be written as: Fz : Tz Z R+ : v F := v with 1F1 as the confluent hypergeometric function of the first kind and ω = (v Σv) 1(v E[J] E[J]v). Proof. We suppose that f is a Gaussian process, and so is its Jacobian. G follows a non-central Wishart distribution: G = J J Wq(D, Σ, Σ 1E[J] E[J]). v Gv is a scalar and also follows a non-central Wishart distribution: v Gv W1(D, σ, ω), with σ = v Σv and ω = (v Σv) 1(v E[J] E[J]v) (Kent & Muirhead, 1984, Definition 10.3.1). The square-root of a non-central Wishart distribution follows a non-central Nakagami distribution (Hauberg, 2018b). Then, by construction, the stochastic norm G follows a noncentral Nakagami distribution. The expectation of this distribution is known, and it has a closed-form expression. The confluent hypergeometric function of the first kind, also known as the Kummer function, is a special function that is defined as the solution of a specific second-order linear differential equation. The term ω appears from the non-central Wishart distribution. When ω is non-zero, the distribution of the Jacobian shifts away from the origin, and ω represents the magnitude and the direction of this shift, balanced by the correlation between the variables. In Section 3.4, to prove our results in high-dimensions, we will assume that our manifold Z is bounded, and so is ω. 3 Comparison of Riemannian and Finsler metrics 3.1 Theoretical comparison In geometry, we need to define a norm to compute functionals, and the Riemannian metric is conveniently obtained by constructing an inner product. Because of its bilinearity, it greatly simplifies subsequent computations, but it is also restrictive. Relaxing this assumption and defining a metric as a more general2 norm has been studied by Finsler (1918), who gave his name to this discipline. Finsler geometry is similar to Riemannian geometry without the bilinear assumption, most of the functionals (curve length and curve energy) are defined similarly to those obtained in Riemannian geometry. However, the volume measure is different, and there are at least two definitions of volume measure used in Finsler geometry: the Busemann-Hausdorffvolume and the Holmes-Thomson volume measure (Wu, 2011). In this 2The norm actually needs to be strongly convex, but it is not necessary symmetric. This means that, for a vector v, we can have a non reversible Finsler metric: Fx(v) = Fx( v). Intuitively, this means that the path used to connect two points would be different depending on the starting point. This asymmetric property becomes valuable when studying the geometry of anisotropic media (Markvorsen, 2016), for example. In our case, our Finsler metric is reversible. Published in Transactions on Machine Learning Research (10/2023) paper, we decided to focus on the Busemann-Hausdorffdefinition (Definition 3.1), which is more intuitive and easier to derive. If the Finsler metric is a Riemannian metric, the definition of volume naturally coincides with the Riemannian volume measure. Definition 3.1. For a given point z on the manifold, we define the Finsler indicatrix as the set of vectors in the tangent space such that the Finsler metric is equal to: {v Tz Z|Fz(v) = 1}). We call Bn(1) the Euclidean unit ball, and vol( ) the standard Euclidean volume. In local coordinates (e1, , ed) on a Finsler manifold M, the Busemann-Hausdorffvolume form is defined as d VF = VF (z)e1 ed, with: VF (z) = vol(Bn(1)) vol({v Tz Z|Fz(v) < 1}). In the definition above, we introduce the notion of indicatrix. An indicatrix is a way to represent the distortion induced by the metric on a unit circle. If our metric is Euclidean, we will only have a linear transformation between the latent and the observational spaces, and the indicatrix would still be a circle. Because the Riemannian metric is quadratic, it will always generate an ellipse in the latent space. The Finsler indicatrix, however, would have a convex, even asymmetrical, shape. This difference can be observed in the indicatrix-field represented in Figure 2: The Finsler indicatrices in purple can have almost rectangular shape, while the Riemannian indicatrices, in orange, are ellipses. Figure 2: Indicatrice field over the latent space of the pinwheel data (in grey) representing the Riemannian (in orange) and Finslerian (in purple) metrics (See Section C). (A) The indicatrices are computed over a grid in the latent space. (B) The indicatrices are computed along a geodesic: the Riemannian and Finslerian metrics coincide. There are also a few observations to note in Figure 2. First, in the area of low predictive variance (where data points lie in the latent space), the Finsler and Riemannian indicatrices are alike. This follows from the preceding comment that the metrics diverge by a variance term. If our mapping f was deterministic, both metrics would agree. Second, for every point, the Riemannian indicatrices are always contained by the Finslerian ones, illustrating Proposition 3.1 on our absolute bounds in the following section. 3.2 Absolute bounds on the Finsler metric The Finsler norm is upper bounded with the Riemannian norm obtained from the expected metric tensor. It is also lower bounded: Published in Transactions on Machine Learning Research (10/2023) Proposition 3.1. We define α = 2 Γ( D 2 . The Finsler norm: F is bounded by two norms, αΣ and R, induced by the two respective Riemannian metric tensors: the covariance tensor αΣz and the expected metric tensor E[Gz]. (z, v) Z Tz Z : v αΣ v F v R Proof. The full proof is detailed in Section B.2.1, and it can be summarised the following way. The upper bound v F v R, also rewritten as: E[ v E[G]v, is obtained by applying Jensen s inequality, knowing that the square root x x is a concave function. The lower bound v αΣ v F , rewritten as v Gv], is obtained using the closed form expression of the Finsler function. The result is illustrated in Figure 4 (lower right). Four metric tensors (G1, G2, G3, G4), each following a non-central Wishart distribution with a specific mean and covariance matrix, have been computed. For each of them, we have drawn the indicatrices ({v Tz Z | v = 1}) induced by the norms: F , R and αΣ. As expected, we can notice that the αΣ-indicatrix contains the Finsler indicatrix, itself containing R-indicatrix. By bounding the Finsler metric, we are able to bound their respective functionals: Corollary 3.1. The length, the energy and the Busemann-Hausdorffvolume of the Finsler metric are bounded respectively by the Riemannian length, energy and volume of the covariance tensor αΣ (noted LαΣ, EαΣ, VαΣ) and the expected metric E[G] (noted LR, ER, VR): z Z, LαΣ(z) LF (z) LR(z) EαΣ(z) EF (z) ER(z) VαΣ(z) VF (z) VR(z) Proof. The full proof is detailed in Section B.2.1. From Proposition 3.1, we need to integrate each term of the inequality to obtain the length and the energy. The volume is less trivial, since we use the Busemann Hausdorffdefinition for measuring VF . We recast the problem in hyperspherical coordinates, and show that the Finsler indicatrix is still bounded. 3.3 Relative bounds on the Finsler metric Proposition 3.2. Let f be a stochastic immersion. f induces the stochastic norm G, defined in Section 2. The relative difference between the Finsler norm F and the Riemmanian norm R is: v R Var h v 2 G i 2E h v 2 G i2 . Proof. This proposition is a direct application of the Sharpened Jensen s inequality (Liao & Berg, 2019). The previous proposition is valid for any stochastic immersion. We can see that the metrics become equal when the ratio of the variance over the expectation shrinks to zero. This happens in two cases: when the variance converges to zero, which is similar to having a deterministic immersion, and when the number of dimensions increases. The latter case is investigated below for a Gaussian process 3. 3Interestingly, the term E[ν]2/Var[ν], when ν follows a central Nakagami distribution, is called a shape parameter. It has been introduced by Nakagami himself to study the intensity of fading in radio wave propagation (Nakagami, 1960). When ν := p ξ with ξ W1(Σ, D), then m = D/2. This is a particular result obtained from Proposition 3.3, when E[J] = 0. Published in Transactions on Machine Learning Research (10/2023) Proposition 3.3. Let f be a Gaussian process. We note ω = (v Σv) 1(v E[J] E[J]v), with J the jacobian of f, and Σ the covariance matrix of J. The relative ratio between the Finsler norm F and the Riemmanian norm R is: v R 1 D + ω + ω (D + ω)2 . Proof. v Gv follows a one-dimension non-central Wishart distribution: v Gzv W1(D, σ, ω), with σ = v Σv and ω = (v Σv) 1(v E[J] E[J]v). We use the theorem of the moments to obtain both the expectation and the variance, which leads us to the result. As we have seen that the metrics are bounded, it is easy to show that the functionals derived from those metrics are also bounded: Corollary 3.2. When f is a Gaussian Process, the relative ratio between the length, the energy and the volume of the Finsler norm (noted LF , EF , VF ) and the Riemannian norm (noted LR, ER, VR) is: 0 LR(z) LF (z) LR(z) max v Tz Z 1 D + ω + ω (D + ω)2 0 ER(z) EF (z) ER(z) max v Tz Z ( 2 D + ω + 1 + 2ω (D + ω)2 + 2ω (D + ω)3 + ω2 0 VR(z) VF (z) 1 max v Tz Z 1 D + ω + ω (D + ω)2 Proof. We directly use Proposition 3.3. To obtain the inequalities with the lengths and the energies, we first multiply all the terms by the Riemannian metric, and we integrate every term. To obtain the inequality with the volume, similarly to Corollary 3.1, we place ourselves in hyperspherical coordinates and bound the radius of the Finsler indicatrix. The full proof is in Section B.2.2. Figure 3: Difference of volume for data embedded in the latent space. (A) Riemannian volume measure, (B) Finslerian (Busemann-Hausdorff) volume measure, (C) Variance of the Gaussian process, (D) Ratio between the Riemannian and Finslerian volume: (VR(z) VF (z))/VR(z). All heatmaps are computed in logarithm scale. In Figure 3, we can compare the volume measures obtained from the Riemannian and Finsler metrics, and in particular, their ratio in the top right image. When the metrics are computed next to the data points in area where the variance is very low, we can see that the ratio of the volume measure is at the order of magnitude 10 4. Further away from the data points, the variance increases and so does the difference between the Riemannian and Finsler volume measures. 3.4 Results in high dimensions Proposition 3.3 and Corrollary 3.2 indicate that the metrics become similar when the dimension (D) of the observational space increases. If we assume that the latent space is a bounded manifold, the metrics converge to each other at a rate of O 1 D , as do their functionals. Published in Transactions on Machine Learning Research (10/2023) We assume that the latent manifold is bounded. Then, we can deduce that (1) the term ω, which represents the non-centrality of the data, does not grow faster than the number of dimensions (See lemma B.2.2, in Section B.2.3) and (2) we that the metrics are finite. Corollary 3.3. Let f be a Gaussian Process. In high dimensions, we have: LR(z) LF (z) LR(z) = O 1 , ER(z) EF (z) ER(z) = O 1 , and VR(z) VF (z) VR(z) = O q When D converges toward infinity: LR + LF , ER + EF and VR + VF . Proof. This result follows from Corollary 3.2, assuming the latent manifold is bounded. The full proof can be found in Setion B.2.3. Corollary 3.4. Let f be a Gaussian Process. In high dimensions, the relative ratio between the Finsler norm F and the Riemmanian norm R is: And, when D converges toward infinity: v Tz Z, v R + v F . Proof. Similarly, from Proposition 3.3, in a bounded manifold, both metrics converge to each other in high dimensions. dim: 3 dim: 5 dim: 10 dim: 50 Riemannian indicatrix Finslerian indicatrix Lower bound metric tensor indicatrix Figure 4: Left: Ratio of volumes (VR VF )/VR decreasing with respect to the number of dimensions. The results were obtained from using a collection of matrices {Gi} following a non-central Wishart distribution. Upper right: The Finsler and Riemannian indicatrices converge towards each other when increasing the number of dimensions. Lower right: Illustration of the absolute bounds in Proposition 3.1 with the αΣ-indicatrices, Riemannian indicatrices and Finsler indicatrices. 4 Experiments We want to illustrate cases where these metrics differ in practice. For this, we use two synthetic datasets, consisting of pinwheel and concentric circles mapped to a sphere, and four real-world datasets: a font Published in Transactions on Machine Learning Research (10/2023) dataset (Campbell & Kautz, 2014), a dataset representing single-cells (Guo et al., 2010), MNIST (Le Cun, 1998) and fashion MNIST (Xiao et al., 2017). We trained a GPLVM with and without using stochastic active sets (Moreno-Muñoz et al., 2022) to learn a latent manifold. From the learnt model, we can access the Riemannian and Finsler metrics, and minimise their respective curve energies to obtain the corresponding geodesics. All the code has been built using Stochman (Detlefsen et al., 2021), a python library to efficiently compute geodesics on manifolds. 4.1 Experiments with synthetic data showing high variance The synthetic data correspond to simple patterns a pinwheel and concentric circles that have been projected onto a sphere. In those examples, we are plotting curves that does not follow the data points and so, go through regions of high variance. Notably, the background of the latent manifold represents the variance of the posterior distribution in logarithmic scale. For both cases, we have the 3-dimensional data space and the corresponding learnt latent space. The curves are mapped from the latent space to the data space using the forward pass of the GPLVM. Figure 5: The Riemannian (purple), Finslerian (orange) and Euclidean (dotted gray) geodesics obtained by pulling back the metric through the Gaussian processes of a trained GPLVM. The models, trained on the 3-dimensional synthetic data (Figures A.2, B.2) learnt their latent representations (Figures A.1, B.1) We can see that the Riemannian geodesics tend to avoid area of high variance in both cases: they are attracted to the data, at the detriment of following longer paths in R3. The Finslerian geodesics, on the opposite, are not perturbed by the variance term, and will explore regions without any data, following shorter paths in R3. The geodesics have been plotted by computing a discretized manifold, obtained from a 10x10 grid with Detlefsen et al. (2021). This approach proved to be more effective than minimizing the energy along a spline, which was prone to getting trapped in local minima. The GPLVM has been coded in Pyro (Bingham et al., 2019). All the implementation details can be found in the Appendix C. 4.2 Experiments with a font dataset and q PCR dataset The font dataset (Campbell & Kautz, 2014) represents the contour of letters in various font, and the q PCR dataset (Guo et al., 2010) represents single-cells stages. We trained a GPLVM model, also using Pyro (Bingham et al., 2019), to learn the latent space. From the optimised Gaussian process, we can access the Riemannian and Finsler metric, and minimise their respective curve energies to obtain geodesics. Similar to the previous experiment, the background colour represents the variance of the posterior distribution in logarithmic scale. We can notice that the Riemannian and the Finsler geodesics agrees with each other. This experiment also agrees with our finding that in high dimensions (the dimensions of the single-cells data is 48, the font data is 256), both metrics converge to each other. This concludes that, in practice, the Riemannian metric is a good approximation of the Finsler metric, and Finsler geometry doesn t need to be used in high dimensions. Published in Transactions on Machine Learning Research (10/2023) Figure 6: The Riemannian (plain line) and Finslerian (dotted line) geodesics obtained by pulling back the metric through the GPLVM. The models trained on single-cells data (which cannot be represented) and font data (Figure B.2) learnt their respective 2-dimensional latent representation (Figures A, B.1). 4.3 Experiments with MNIST and Fashion MNIST GPLVMs are often hard to train, but they can be scaled effectively using variational inference and inducing points. Using stochastic active sets has been shown to give reliable uncertainty estimates, and the model also learn more meaningful latent representations, as shown in the original paper by Moreno-Muñoz et al. (2022). For those experiments, we used those models on two well-known benchmarks: MNIST and Fashion MNIST. In both experiments, because the difference of the variance of the posterior learnt by the GPLVM across the latent manifold is low, as we can notice in the background of the plots on the right, and because the data is high dimensional, we cannot see any difference between the Finslerian and the Riemannian geodesics. Again, in practice, the Riemannian metric is a good approximation of the Finslerian metric. 5 Discussion Generative models are often used to reduce data dimension in order to better understand the mechanisms behind the data generating process. We consider the general setting where the mapping from latent variables to observations is driven by a smooth stochastic process, and the sample mappings span Riemannian manifolds. The Riemannian geometry machinery has already been used in the past to explore the latent space. In this paper, we have shown how curves and volumes can be identified by defining the length of a latent curve as its expected length measured in the observation space. This is a natural extension of classical differential geometric constructions to the stochastic realm. Surprisingly, we have shown that this does not give rise to a Riemannian metric over the latent space, even if sample mappings do. Rather, the latent representation naturally becomes equipped with a Finsler metric, implying that stochastic manifolds, such as those spanned by Latent Variable Models (LVMs), are inherently more complex than their deterministic counterparts. The Finslerian view of the latent representation gives us a suitable general solution to explore a random manifold, but it does not immediately translate into a practical computational tool. As Riemannian manifolds are better understood computationally than Finsler manifolds, we have raised the question: How good an approximation of the Finsler metric can be achieved by a Riemannian metric? The answer turns out to be: quite good. We have shown that as data dimension increases, the Finsler metric becomes increasingly Riemannian. Since LVMs are most commonly applied to high-dimensional data (as this is where dimensionality reduction carries value), we have justification for approximating the Finsler metric with a Riemannian metric such that computational tools become more easily available. In practice we find that geodesics under the Finsler and the Riemannian metric are near identical, except in regions of high uncertainty. Published in Transactions on Machine Learning Research (10/2023) Figure 7: The Riemannian (purple) and the Finslerian (orange) geodesics are plotted. Upper figure: learnt latent space with the images (left) and data points (right) of Fashion MNIST. Lower figure: learnt latent space with the images (left) and data points (right) of MNIST. Acknowledgments The authors would like to thank Professor Steen Markovsen for the initial discussions on Finsler geometry, and Dr. Cilie Feldager for her invaluable help in taming GP-LVMs. This work was funded in part by the Novo Nordisk Foundation through the Center for Basic Machine Learning Research in Life Science (NNF20OC0062606). It also received funding from the European Research Council (ERC) under the European Union Horizon 2020 research, innovation programme (757360). SH was supported in part by research grants (15334, 42062) from VILLUM FONDEN. Code availability The code used for the experiments in this study is available on Git Hub at https://github.com/a-pouplin/ latent_distances_finsler. Published in Transactions on Machine Learning Research (10/2023) Z, X Smooth differentiable latent (Z) and data (X) manifold, f A stochastic immersion f : Z Rq X RD, J Jacobian of the stochastic function f, G Stochastic metric tensor defined as the pullback metric through f: G = J J, Tz Z Tangent space of the manifold Z at a point z, Σ IF f is a Gaussian process, then J QD i=1 N(µi, Σ), G Stochastic induced norm: v G := v Gv, R Riemannian induced norm: v R := p v E[G]v := p g(v, v), F Finsler norm: v F := E[ v Gv] = F(v), LR, ER, VR Length, energy and volume obatined from the Riemannian induced norm R, LF , EF , VF Length, energy and Busemann Hausdorffvolume obtained from the Finsler norm F . Sumon Ahmed, Magnus Rattray, and Alexis Boukouvalas. Grandprix: scaling up the bayesian gplvm for single-cell data. Bioinformatics, 35(1):47 54, 2019. Georgios Arvanitidis, Lars Kai Hansen, and Søren Hauberg. Latent Space Oddity: on the Curvature of Deep Generative Models. In International Conference on Learning Representations (ICLR), 2018. Georgios Arvanitidis, Miguel González-Duque, Alison Pouplin, Dimitris Kalatzis, and Søren Hauberg. Pulling back information geometry, 2021. URL https://arxiv.org/abs/2106.05367. Hadi Beik-Mohammadi, Søren Hauberg, Georgios Arvanitidis, Gerhard Neumann, and Leonel Rozo. Learning riemannian manifolds for geodesic motion skills. ar Xiv preprint ar Xiv:2106.04315, 2021. Eli Bingham, Jonathan P Chen, Martin Jankowiak, Fritz Obermeyer, Neeraj Pradhan, Theofanis Karaletsos, Rohit Singh, Paul Szerlip, Paul Horsfall, and Noah D Goodman. Pyro: Deep universal probabilistic programming. The Journal of Machine Learning Research, 20(1):973 978, 2019. Neill D. F. Campbell and Jan Kautz. Learning a manifold of fonts. ACM Trans. Graph., 33(4), jul 2014. ISSN 0730-0301. doi: 10.1145/2601097.2601212. URL https://doi.org/10.1145/2601097.2601212. Nicki S. Detlefsen, Alison Pouplin, Cilie W. Feldager, Cong Geng, Dimitris Kalatzis, Helene Hauschultz, Miguel González Duque, Frederik Warburg, Marco Miani, and Søren Hauberg. Stochman. Git Hub. Note: https://github.com/Machine Learning Life Science/stochman/, 2021. Nicki Skafte Detlefsen, Søren Hauberg, and Wouter Boomsma. Learning meaningful representations of protein sequences. Nature communications, 13(1):1914, 2022. David Eklund and Søren Hauberg. Expected path length on random manifolds. ar Xiv preprint ar Xiv:1908.07377, 2019. Charles Fefferman, Sanjoy Mitter, and Hariharan Narayanan. Testing the manifold hypothesis. Journal of the American Mathematical Society, 29(4):983 1049, 2016. Paul Finsler. Ueber kurven und Flächen in allgemeinen Räumen. Philos. Fak., Georg-August-Univ., 1918. Miguel González-Duque, Rasmus Berg Palm, Søren Hauberg, and Sebastian Risi. Mario plays on a manifold: Generating functional content in latent space through differential geometry. In 2022 IEEE Conference on Games (Co G), pp. 385 392. IEEE, 2022. Guoji Guo, Mikael Huss, Guo Qing Tong, Chaoyang Wang, Li Li Sun, Neil D. Clarke, and Paul Robson. Resolution of cell fate decisions revealed by single-cell gene expression analysis from zygote to blastocyst. Developmental Cell, 18(4):675 685, 2010. ISSN 1534-5807. doi: https://doi.org/10.1016/j.devcel.2010.02. 012. URL https://www.sciencedirect.com/science/article/pii/S1534580710001103. Published in Transactions on Machine Learning Research (10/2023) Søren Hauberg. Only bayes should learn a manifold (on the estimation of differential geometric structure from data). ar Xiv preprint ar Xiv:1806.04994, 2018a. Søren Hauberg. The non-central Nakagami distribution. Technical report, Technical University of Denmark, 2018b. URL http://www2.compute.dtu.dk/~sohau/papers/nakagami2018/nakagami.pdf. Martin Jørgensen and Soren Hauberg. Isometric gaussian process latent variable model for dissimilarity data. In International Conference on Machine Learning, pp. 5127 5136. PMLR, 2021. John T. Kent and R. J. Muirhead. Aspects of Multivariate Statistical Theory. The Statistician, 1984. ISSN 00390526. doi: 10.2307/2987858. Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. ar Xiv preprint ar Xiv:1412.6980, 2014. Neil Lawrence. Gaussian process latent variable models for visualisation of high dimensional data. Advances in neural information processing systems, 16, 2003. Yann Le Cun. The mnist database of handwritten digits. http://yann. lecun. com/exdb/mnist/, 1998. John M Lee. Smooth manifolds. In Introduction to smooth manifolds, pp. 1 31. Springer, 2013. J. G. Liao and Arthur Berg. Sharpening jensen s inequality. The American Statistician, 73(3):278 281, 2019. doi: 10.1080/00031305.2017.1419145. Federico Lopez, Beatrice Pozzetti, Steve Trettel, Michael Strube, and Anna Wienhard. Symmetric spaces for graph embeddings: A finsler-riemannian approach. In Marina Meila and Tong Zhang (eds.), Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pp. 7090 7101. PMLR, 18 24 Jul 2021. URL https://proceedings.mlr.press/v139/ lopez21a.html. Steen Markvorsen. A finsler geodesic spray paradigm for wildfire spread modelling. Nonlinear Analysis: Real World Applications, 28:208 228, 2016. Pablo Moreno-Muñoz, Cilie Feldager, and Søren Hauberg. Revisiting active sets for gaussian process decoders. Advances in Neural Information Processing Systems, 35:6603 6614, 2022. Minoru Nakagami. The m-distributiona general formula of intensity distribution of rapid fading. In Statistical methods in radio wave propagation, pp. 3 36. Elsevier, 1960. Pyro. Gaussian process latent variable model, 2022. URL https://pyro.ai/examples/gplvm.html. Carl Edward Rasmussen and Christopher K. I. Williams. Gaussian Processes for Machine Learning (Adaptive Computation and Machine Learning). The MIT Press, 2005. ISBN 026218253X. Nathan D. Ratliff, Karl Van Wyk, Mandy Xie, Anqi Li, and Muhammad Asif Rana. Generalized nonlinear and finsler geometry for robotics. In 2021 IEEE International Conference on Robotics and Automation (ICRA), pp. 10206 10212, 2021. doi: 10.1109/ICRA48506.2021.9561543. Aidan Scannell, Carl Henrik Ek, and Arthur Richards. Trajectory optimisation in learned multimodal dynamical systems via latent-ode collocation. In 2021 IEEE International Conference on Robotics and Automation (ICRA), pp. 12745 12751. IEEE, 2021. Yi-Bing Shen and Zhongmin Shen. Introduction to Modern Finsler Geometry. Co-published with HEP, 2016. doi: 10.1142/9726. URL https://www.worldscientific.com/doi/abs/10.1142/9726. Alessandra Tosi, Søren Hauberg, Alfredo Vellido, and Neil D Lawrence. Metrics for probabilistic geometries. ar Xiv preprint ar Xiv:1411.7432, 2014. J. G. Wendel. Note on the gamma function. The American Mathematical Monthly, 55(9):563 564, 1948. ISSN 00029890, 19300972. URL http://www.jstor.org/stable/2304460. Published in Transactions on Machine Learning Research (10/2023) Bingye Wu. Volume form and its applications in finsler geometry. Publicationes Mathematicae, 78, 04 2011. doi: 10.5486/PMD.2011.4998. Han Xiao, Kashif Rasul, and Roland Vollgraf. Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms. ar Xiv preprint ar Xiv:1708.07747, 2017. Published in Transactions on Machine Learning Research (10/2023) A A primer on Geometry The main purpose of the paper is to define and compare two legitimate metrics to compute the average length between random points. Before going further, it s important to formally define the two metrics (Riemannian and Finsler metrics, respectively) which we do in sections A.2 and A.3. They are both constructed on topological manifolds, the definition of which is recalled in section A.1. We finally introduce the notion of random manifold in section A.4, which is the last notion needed to frame our problem of interest: which metric should we use to compute the average distance on a random manifold? A.1 Topological and differentiable manifolds This section aims to define core concepts in differential geometry that will be used later to define Riemannian and Finsler manifolds. Recall that two topological spaces are called homeomorphic if there is a continuous bijection between them with continuous inverse. Definition A.1. A d-dimensional topological manifold M is a second-countable Hausdorfftopological space such that every point has an open neighbourhood homeomorphic to an open subset of Rd. Let M be a topological manifold. This means that for any x M there is an open neighbourhood Ux of x and a homeomorphism φUx : Ux Rd onto an open subset of Rd. Suppose that x, y M are such that Ux Uy = , let U = Ux, V = Uy and consider the so-called coordinate change map φV φ 1 U|φU(U V ) : φU(U V ) Rd. We call M together with an open cover {Ux}x M as above a differentiable or smooth manifold if the coordinate maps are infinitely differentiable. Beyond these technical definitions, one can imagine a differentiable manifold as a well-behaved smooth surface that possesses locally all the topological properties of a Euclidean space. All the manifolds in this paper are assumed to be differentiable and connected manifolds. Definition A.2. We also define, for a differentiable manifold M, the tangent space Tx M as the set of all the tangent vectors at x M, and the tangent bundle T M the disjoint union of all the tangent spaces: T M = x MTx M. So far, we have only defined topological and differential properties of manifolds. In order to compute geometric quantities, we need to equip those with a metric that helps us derive useful quantities such as lengths, energies and volumes. A metric is a scalar valued function that is defined for each point on the topological manifold and takes as inputs one or two vectors (depending on the type of metric) from the tangent space at the specific point. Such a function can either be defined as a scalar product between two vectors, this is the case of a Riemannian metric or, in the case of a Finsler metric, it is defined similarly to the norm of a vector. We will formally define these metrics and highlight their differences in the following sections. A.2 Riemannian manifolds Definition A.3. Let M be a manifold. A Riemannian metric is a map assigning at each point x M a scalar product G( , ) : Tx M Tx M R, with G a positive definite bilinear map, which is smooth with respect to x. A smooth manifold equipped with a Riemannian metric is called a Riemannian manifold. We usually express the metric as a symmetric positive definite matrix G, where we have for two vectors u, v Tx M: G(u, v) = u, v G = u Gv. We further define the induced norm: v Tx M, v G = p The Riemannian metric here can either refer to the scalar product G itself, or the associated metric tensor G. Published in Transactions on Machine Learning Research (10/2023) gx(u, v) = 1 df(v) df(u) Figure 8: f is an immersion that maps a low dimensional manifold to a high dimensional manifold M. On M, a tangent plane Tx M is draw at x. The indicatrice of the Euclidean metric is plotted in blue. When this metric is pulled-back through f, the low dimensional space is now equipped with the pullback metric g, which is a Riemannian metric by definition. The vectors df(u) and df(v) are called the push-forwards of the vectors u and v through f. Definition A.4. We consider a curve γ(t) and its derivative γ(t) on a Riemannian manifold M equipped with the metric g. Then, we define the length of the curve: LG(γ) = Z γ(t)) G dt = Z p gt( γ(t), γ(t)) dt, where gt = gγ(t). Locally length-minimising curves between two connecting points are called Geodesics. Definition A.5. The curve energy is defined as: EG(γ) = Z γ(t)) 2 G dt = Z gt( γ(t), γ(t)) dt. There are two interesting properties to note about the length of a curve and the curve energy. First, the length is parametrisation invariant: for any bijective smooth function η on the domain of γ we have that LG(γ η) = LG(γ). We also say the Riemannian metric gives us intrinsic coordinates to compute the length. Secondly, for a given curve γ, we have: LG(γ)2 2EG(γ). Because of the invariance of the curve, when we aim to minimise it, a solver can find an infinite number of solutions. On the other hand, the curve energy is convex and will lead to a unique solution. Thus, to obtain a geodesic, instead of solving the corresponding ODE equations, or directly minimising lengths, it is easier in practice to minimise the curve energy, as a minimal energy gives a minimal length. The Riemannian metric also provides us with an infinitesimal volume element that relates our metric G to an orthonormal basis, the same way the Jacobian determinant accommodate for a change of coordinates in the change of variables theorem. Definition A.6. In local coordinates (e1, , ed), the volume form of the Riemannian manifold M, equipped with the metric tensor G, is defined as: d VG = VG(x)e1 ed, with: Remark. The symbol represents the wedge product and it is used to manipulate differential k-forms. Here, the basis vectors (e1, , ed) form a d-dimensional parallelepiped (e1 ed) with unit volume. Published in Transactions on Machine Learning Research (10/2023) Figure 9: Once the low dimensional manifold is equipped with a metric that captures the inherent structure of the high dimensional manifold, we can compute a geodesic γ, by minimising the energy functional between two points. The geodesic f(γ) will be the shortest path between two points on the manifold M. The volume measure VG can also be used to integrate functions over regions of the manifold, as we would do in the Euclidean space. It can also be linked to the density of the data: if the data points are uniformly distributed over the high-dimensional manifold, in the low-dimensional manifold, a low volume would correspond to a high density of data. It is a useful way to give more information about the distribution of the data. A.3 Finsler manifolds Finsler geometry is often described as an extension of Riemannian geometry, since the metric is defined in a more general way, lifting the quadratic constraint. In particular, the norm of a Riemmanian metric is a Finsler metric, but the converse is not true. Definition A.7. Let F : T M R+ be a continuous non-negative function defined on the tangent bundle T M of a differentiable manifold M. We say that F is a Finsler metric if, for each point x of M and v on Tx M, we have: 1. Positive homogeneity: λ R+, F(λv) = λF(v). 2. Smoothness: F is a C function on the slit tangent bundle T M \ {0}. 3. Strong convexity criterion: the Hessian matrix gij(v) = 1 2 2F 2 vivj (v) is positive definite for nonzero v. A differentiable manifold M equipped with a Finsler metric is called a Finsler manifold. Here, it is worth noting that, for a given point in the manifold, the Finsler metric is defined with only one vector in the tangent space, while the Riemannian metric is defined with two vectors. Moreover, from the previous definition, we can deduce that the metric is: 1. Positive definite: for all x M and v Tx M, F(v) 0 and F(v) = 0 if and only if v = 0. 2. Subadditive: F(v + w) F(v) + F(w) for all x M and v, w Tx M. We say that F is a Minkowski norm on each tangent space Tx M. Furthermore, if F satisfies the reversibility property: F(v) = F( v), it defines a norm on Tx M in the usual sense. Similarly to Riemannian geometry, lengths, energies and volumes can be defined directly from the Finsler metric: Definition A.8. We consider a curve γ and its derivative γ on a Finsler manifold M equipped with the metric F. We define the length of the curve as follows: LF (γ) = Z F( γ(t))dt. Published in Transactions on Machine Learning Research (10/2023) Figure 10: f is an immersion that maps a low dimensional manifold to a high dimensional manifold M. On M, a tangent plane Tx M is draw at x. Compared to the Riemannian manifold, the Finsler indicatrix, which represents the all the vectors u Tx M such that Fx(u) = 1, is not necessarily an ellipse. It can be asymmetric if the metric is asymmetric itself. It is always convex. Definition A.9. The curve energy is defined as: EF (γ) = R F( γ(t))2dt. Not only are the definitions strikingly similar, they also share the same properties. The curve length is also invariant under reparametrisation, and upper bounded by the curve energy. Computing geodesics on a manifold is reduced to a variational optimisation problem. These propositions are proved in detail in Lemmas B.1.4 and B.1.5, in the appendix. In Riemannian geometry, the volume measure defined by the metric is unique. In Finsler geometry, different definitions of the volume exist, and they all coincide with the Riemannian volume element when the metric is Riemannian. The most common choices of volume forms are the Busemann-Hausdorffmeasure and the Holmes-Thompson measure. According Wu (2011), depending on the Finsler metric and the topological manifold, some choices seem more legitimate than others. In this paper, we decided to only focus on the Busemann-Hausdorffvolume, as its definition is the most commonly used and leads to easier derivations. We will later show that in high dimensions, our Finsler metric converges to a Riemannian metric, and thus, the results obtained for the Busemann-Hausdorffvolume measure are also valid for the Holmes-Thomson volume measure. Definition A.10. For a given point x on the manifold, we define the Finsler indicatrix as the set of vectors in the tangent space such that the Finsler metric is equal to one: {v Tx M|F(v) = 1}). We denote the Euclidean unit ball in Rd by Bd(1) and for measurable subsets S Rd we use vol(S) to denote the standard Eulcidean volume of S. In local coordinates (e1, , ed) on a Finsler manifold M, the Busemann-Hausdorffvolume form is defined as d VF = VF (x)e1 ed, with: VF (x) = vol(Bd(1)) vol({v Tx M|F(v) < 1}). We can interpret the volume as the ratio between the euclidean ball, and a convex ball whose radius is defined as a unit Finsler metric. If the Finsler metric is replaced by a Riemannian metric, the volume of the indicatrix will be an ellipsoid whose semi-axis are equal to the inverse of the squareroot of the metric s eigenvalues. The Finsler volume then reduces to the definition of the Riemannian volume. A.4 Random manifolds So far, we have only considered deterministic data points lying on a manifold. If we consider our data to be random variables, we will need to define the associated random metric and manifold. Published in Transactions on Machine Learning Research (10/2023) {ft} = {f1, f2, f3, } Figure 11: Usually the immersion f would be deterministic. In the case of most generative models, where f is described by a GP-LVM, or the decoder of a VAE, the immersion is stochastic. The pullback metric is stochastic de facto. As said previously, if we have a function f : Rq RD that parametrises a manifold, then we can construct a Riemannian metric G = J f Jf, with Jf the Jacobian of the function f. In the previous cases, we assumed f to be a deterministic function, and so is the metric. We construct a stochastic Riemannian metric in the same way, with f being a stochastic process. A stochastic process f : Rq RD is a random map in the sense that samples of the process are maps from Rq to RD (the so-called sample paths of the process). Definition A.11. A stochastic process f : Rq RD is smooth if the sample paths of f are smooth. We call a smooth process f a stochastic immersion if the Jacobian matrix of its sample paths has full rank everywhere. We can then define the stochastic Riemannian metric G = J f Jf. The terms stochastic and random are used interchangeably. The definition of the stochastic immersion is fairly important, as it means that its Jacobian is full rank. Since the Jacobian is full rank, the random metric G is positive definite, a necessary condition to define a Riemannian metric. Another definition of a stochastic Riemannian metric would be the following: Definition A.12. A stochastic Riemannian metric on Rq is a matrix-valued random field on Rq whose sample paths are Riemannian metrics. A stochastic manifold is a differentiable manifold equipped with a stochastic Riemannian metric. Any matrice drawn from this stochastic metric would be a proper Riemannian metric. When using the random Riemannian metric on two vectors u, v Tx M, G(u, v) = u Gv is a random variable, but both u, v are deterministic vectors. From this definition, it follows that the length, the energy and the also volume are random variables. Object Riemann Finsler metric g : Tx M Tx M R F : T M R+ length structure LG(γ) = R p gt( γ(t), γ(t)) dt LF (γ) = R F( γ(t)) dt energy structure EG(γ) = R gt( γ(t), γ(t)) dt EF (γ) = R F( γ(t))2 dt volume element VG(x) = p | det{G}| VF (x) = vol(Bn(1))/vol({v Tx M|F(x, v) < 1}) Busemann-Hausdorffvolume measure Table 1: Comparison of Riemannian and Finsler metrics. Published in Transactions on Machine Learning Research (10/2023) One of the main challenges of this paper is to find coherent notations while respecting the tradition of two geometric fields. In Riemannian geometry, we principally use a metric, noted gp : Tp M Tp M R+, that is defined as an inner product and thus can induce a norm, but is not a norm. In Finsler geometry, we call intercheangeably Finsler function, Finsler metric or Finsler norm, the norm traditionally noted F : T M R+, with Fp(u) := F(p, u) defined at a point p M for a vector u Tp M. We will assume that all our metric are always defined for a specific point z (or p) on our manifold Z (or M), and so we will just drop this index. The following notations will be used: Stochastic pullback metric tensor G = J f Jf Stochastic pullback metric g : (u, v) u Gu Expected Riemannian metric g : (u, v) u E[G]v Stochastic pullback induced norm G : u u Gu Expected Riemannian induced norm R : u p u E[G]u := p g(u, u) Finsler metric F : u E[ u Gu] := F(u) B.1 Finslerian geometry of the expected length In this section, we will always let f : Rq RD be a stochastic immersion, Jf its Jacobian, and G = J f Jf a metric tensor. We will first prove that the function F : T M R : v E h v Gv i is a Finsler metric. Then, for the specific case where Jf follows a non-central normal distribution, the Finsler metric F defined as the expected length follows a non-central Nakagami distribution and can be expressed in closed form. To prove that the function F is indeed a Finsler metric, we will need to verify the criteria above, among them the strong convexity criterion is less trivial to prove than the others. It will be detailed in Lemma B.1.3. Strong convexity means that the Hessian matrix 1 2Hess(F(v)2) = 1 2 2F 2 vivj (v) is strictly positive definite for non-negative v. This matrix, when F is a Finsler function, is also called the fundamental form and plays an important role in Finsler geometry. To prove the strong convexity criterion, we will need the full expression of the fundamental form, detailed in Lemma B.1.1. Lemma B.1.1. The Hessian matrix 1 2Hess(F(v)2) of the function F(v) = E h v Gv i is given by 1 2Hess(F(v)2) = E h (v Gv) 1 2 i E h (v Gv) 1 2 G (v Gv) 3 2 Gvv G i + E h (v Gv) 1 2 G i2 vv . Proof. Let G be a random positive definite symmetric matrix and define g : Rq R : v 7 v Gv, where v is considered a column vector. We would like to know the different derivatives of g with respect to v. We name by default Jg and Hg, its Jacobian and Hessian matrix. Using the chain rule, we have: Jg = (v Gv) 1 2 v G and Hg = (v Gv) 1 2 G (v Gv) 3 For the rest of the proof, we need to show that derivatives and expectation values commute. Using the Fubini theorem, we can show that tha derivatives and the expectation values commute. For F : Rq R : v 7 E[ Hess(F) = E[Hg] = E h (v Gv) 1 2 G (v Gv) 3 F = E[Jg] = E[(v Gv) 1 We now consider the function h : Rq R : v 7 E[ v Gv]2 = F(v)2. Using the chain rule and changing the order of expectation and derivatives, we have its Hessian Hh = 2F Hess[F] + 2 F F = 2E[g]E[Hg] + 2E[Jg] E[Jg]. Published in Transactions on Machine Learning Research (10/2023) Finally, replacing Jg and Hg previously obtained in this expression, we conclude: 1 2Hh(x, v) = E h (v Gv) 1 2 i E h (v Gv) 1 2 G (v Gv) 3 2 Gvv G i + E h (v Gv) 1 2 G i2 vv . Remark. Before going further, it s important to note that G = J f Jf is a random matrix that is positive definite: it is symmetric by definition and has full rank. The later statement is justified by the assumption that the stochastic process f : Rq RD is an immersion, then Jf is full rank. Lemma B.1.2. The function F(v) = E h 1. positive homogeneous: λ R+, F(λv) = λF(v) 2. smooth: F(v) is a C function on the slit tangent bundle T M \ {0} Proof. 1) Let λ R, then we have: F(λv) = E h λ2v Gv i = |λ| h 2) The multivariate function: Rq\{0} R + : v v Gv is C and strictly positive, since G = J f Jf is positive definite. The function R + R + : x x is also C . Finally, R + R + : x E[x] is by definition differentiable. By composition, F(v) is a C function on the slit tangent bundle T M \ {0}. Lemma B.1.3. The function F(v) = E h v Gv i satisfies the strong convexity criterion. Proof. Proving that F satisfies the strong convexity criterion is equivalent to show that the Hessian matrix H = 1 2Hess(F(v)2) is strictly positive definite. Thus, we need to prove that w Rq\{0}, w Hw > 0. According to Lemma B.1.1, because the expectation is a positive function, it s straightforward to see that w Rq\{0}, w Hw 0. The tricky part of this proof is to show that w Hw > 0. This can be obtained if one of the terms (F Hess(F) or F F) is strictly positive. First, let s decompose H as the sum of matrices: H = FHess(F) + F F (Lemma B.1.1), with: F Hess(F) = E h (v Gv) 1 2 i E (v Gv) 3 2 (v Gv)G Gv(Gv) , F F = E h (v Gv) 1 2 G i2 vv . We will study two cases: when w span(v), and when w / span(v). We will always assume that v = 0, and so by definition: F(v) > 0. Let w span(v). We will show that w F Fw > 0. We have w = αv, α R. Because F is 1homogeneous and using Euler theorem, we have: F(v)v = F(v). Then (αv) F F(αv) = α2F 2, and α2F(v)2 > 0. Let w / span(v). F being a scalar function, we have: w FHess[F]w = Fw Hess[F]w. We would like to show that: w Hess[F]w > 0. The strategy is the following: if we prove that the kernel of Hess[F] is equal to the span(v), then w / span(v) is equivalent to say that w / ker(Hess[F]) and we can conclude that: w Hess[F]w > 0. Let s prove span(v) ker(Hess(F)). We know that Hess(F)v = 0, since F is 1-homogeneous, so we have span(v) ker(Hess(F)). To obtain the equality, we just need to prove that the dimension of the kernel is equal to 1. Let z span(v G) , which is (Gv)T z = 0. We have dim(span(v M)) = 1, and thus: dim(span(v G) ) = q 1. Furthermore, z Hess[F]z = z E h M(v Mv) 1 2 i z > 0, so we can deduce that dim(im(Hess[F])) = q 1. Using the Rank-Nullity theorem, we conclude that dim(ker(Hess(F))) = q dim(im(Hess[F])) = 1, which concludes the proof. In conclusion, w Rq\{0}, w 1 2Hess(F(v)2)w > 0. The function F satisfies the strong convexity criterion. Published in Transactions on Machine Learning Research (10/2023) Proposition 2.2. Let G be a stochastic Riemannian metric tensor. Then, the function Fz : Tz Z R : u u F defines a Finsler metric, but it is not induced by a Riemannian metric. Proof. Let s define F as a Riemannian metric: F : Rq Rq R : (v1, v2) E hp v 1 Gv2 i . If F were a Riemannian metric, then it would be bilinear, which is clearly not the case. Thus, F is not a Riemannian metric. According to Lemma B.1.2 and Lemma B.1.2, F is a Finsler metric. Proposition 2.3. Let f be a Gaussian process and J its Jacobian, with J N(E[J], Σ). The Finsler norm can be written as: Fz : Tz Z R+ : v F := v with 1F1 as the confluent hypergeometric function of the first kind and ω = (v Σv) 1(v E[J] E[J]v). Proof. The objective of the proof is to show that, if the Jacobian Jf follows a non-central normal distribution, then, v Rq, the expectation E[v J f Jfv] will follow a non-central Nakagami distribution. This is a particular case of the derivation of moments of non-central Wishart distributions, previously shown and studied by Kent & Muirhead (1984); Hauberg (2018b). By hypothesis, Jf follows a non-central normal distribution: Jf N(E[J], ID Σ). Then, G = J f Jf follows a non-central Wishart distribution: G Wd(D, Σ, Σ 1E[J] E[J]). According to (Kent & Muirhead, 1984, Theorem 10.3.5.), v Gv will also follow a non-central Wishart distribution: v Gv W1(D, v Σv, ω), with: ω = (v Σv) 1(v E[J] E[J]v). To compute E[ v Gv], we shall look at the derivation of moments. (Kent & Muirhead, 1984, Theorem 10.3.7.) states that: if X Wq(D, Σ, Ω ), with q D, then E[(det(X))k] = (det{Σ})k2qk Γq( D 2 +k) Γq( D 2 ) 1F1( k, D 2Ω ). We directly apply the theorem to our case, knowing that v Gv is a scalar term, so det v Gv = v Gv, q = 1, and k = 1 Lemma B.1.4. The length of a curve using a Finsler metric is invariant by reparametrisation. Proof. The proof is similar to the one obtained on a Riemannian manifold (Lee (2013), Proposition 13.25), where we make use of the homogeneity property of the Finsler metric. Let (M, F) be a Finsler manifold and γ : [a, b] M a piecewise smooth curve segment. We call γ a reparametrisation of γ, such that γ = γ φ with φ : [c, d] [a, b] a diffeomorphism. We want to show that LF (γ) = LF ( γ). LF ( γ) = Z d c F( γ(t)) dt = Z d dt(γ φ(t))) dt φ 1(a) | φ(t)|F( γ φ(t)) dt = Z b a F( γ(t)) dt = LF (γ) Lemma B.1.5. If a curve globally minimizes its energy on a Finsler manifold, then it also globally minimizes its length and the Finsler function F of the velocity vector along the curve is constant. Published in Transactions on Machine Learning Research (10/2023) Proof. The curve energy and the curve length are defined as: EF (γ) = R 1 0 F 2( γ(t))dt and LF (γ) = R 1 0 F( γ(t))dt, with γ : [0, 1] Rd. Let s define f and g two real-valued functions such that: f : R R : t 7 F( γ(t)) and g : R R : t 7 1. Applying Cauchy-Schwartz inequality, we directly obtain: 0 F( γ(t))dt 0 F( γ(t))2dt Z 1 0 12dt, which means: LF (γ)2 EF (γ). The equality is obtained exactly when the functions f and g are proportional, hence, when the Finsler function is constant. B.2 Comparison of Riemannian and Finsler metrics We have defined both a Riemannian (g : (v1, v2) v 1 E[G]v2) and a Finsler (F : (x, v) E[ v Gv]) metric, in the hope to compute the average length between two points on a random manifold created by the random field f: G = J f Jf. The main idea of this section is to better compare those two metrics and in what extend they differ in terms of length, energy and volume. From now on, f : Rq RD will always be defined as a stochastic non-central gaussian process. Its Jacobian Jf also follows a non-central gaussian distribution, G = J f Jf a non-central Wishart distribution, and F : (x, v) = E[ v Gv] a non-central Nakagami distribution (Proposition 2.3). The Finsler metric can be written in closed form. In section B.2.1, we will see that the Finsler metric is upper and lower bounded by two Riemannian tensors (Proposition 3.1), and we can deduce an upper and lower bound for the length, the energy and the volume (Corollary 3.1). Then, in section B.2.2, we will show that the relative difference between the Finsler norm and the Riemannian induced norm is always positive and upper bounded a term that is inversely proportional to the number of dimensions D (Proposition 3.3). Similarly, we will deduce the same for the length, the energy and the volume (Corollary 3.2). From this last results, we can directly conclude in section B.2.3 that both metrics are equal in high dimensions (Corollary 3.4). A possible interpretation is that in high dimensions the data distribution obtained on those manifolds becomes more and more concentrated around the mean, reducing the variance term to zero. The manifold becoming deterministic, both metrics become equal. Remark. Most of the following proofs will be a bit technical, as they rely on the closed form expression of the non-central Nakagami distribution. Once proving the main propositions, obtaining the corollaries is straightforward. While we do not have closed form expression of the indicatrix, we will show that it s a monotoneous function which can upper and lower bounded. B.2.1 Bounds on the Finsler metric Proposition 3.1. We define α = 2 Γ( D 2 . The Finsler norm: F is bounded by two norms, αΣ and R, induced by the two respective Riemannian metric tensors: the covariance tensor αΣz and the expected metric tensor E[Gz]. (z, v) Z Tz Z : v αΣ v F v R Proof. Let s first recall that the Finsler function can be written as: v F := F(v) = The confluent hypergeometric function is defined as: 1F1(a, b, z) = P k=0 (a)k (b)k zk k! , with (a)k and (b)k being the Pochhammer symbols. Note that, despite their confusing notation, they are defined as rising factorials. By definition, we have: (a)k (b)k = Γ(a+k) Γ(b+k) Γ(b) Γ(a). We can use the Kummer transformation to obtain: Published in Transactions on Machine Learning Research (10/2023) 1F1(a, b, z) = e z1F1(b a, b, z). Replacing a = 1 2 and z = 1 2ω, we finally get: 2 + k) Γ( D 1) Let s show that: v Tx M : v αΣv F(v), with α = 2 Γ( D The Pochhammer symbole is defined as (x)k = x(x + 1) . . . (x + k 1) = Γ(x+k) Γ(x) . For x R +, we have: 2)k. Thus, Γ( D 2 ) . The Gamma function being strictly positive on R+, we obtain: 2 + k) Γ( D 2 + k) Γ( D 2 + k) Γ( D v αΣv F(v). 2) Let s show that: v Tx M : F(v) p Wendel (1948) proved: Γ(x+y) Γ(x) xy, for x > 0 and y [0, 1]. With x = D 2 + k, y = 1 2, we obtained 2 + k, which leads to: F(v) 2v Σv e z P k=0 q Furthermore, P k=0 e z zk k! = 1 and the function x q 2 + x is concave. Then by Jensen s in- equality: e z P k=0 q 2 + e z P k=0 zk k! k. Knowing that P k=0 zk k! = zez, we have: e z P k=0 q And with z = Ω 2 , we obtain: F(v) p v Σ(D + Ω)v. From (Kent & Muirhead, 1984, p. 442), the expectation of a non-central Wishart distribution (G Wq(D, Σ, Ω)) is: E[G] = DΣ + ΣΩ. This finally leads to: Remark. As a side note, the second part of the inequality F(v) p v E[G]v can be obtained using directly Proposition 3.2. Corollary 3.1. The length, the energy and the Busemann-Hausdorffvolume of the Finsler metric are bounded respectively by the Riemannian length, energy and volume of the covariance tensor αΣ (noted LαΣ, EαΣ, VαΣ) and the expected metric E[G] (noted LR, ER, VR): z Z, LαΣ(z) LF (z) LR(z) EαΣ(z) EF (z) ER(z) VαΣ(z) VF (z) VR(z) Published in Transactions on Machine Learning Research (10/2023) Proof. From Proposition 3.1, we have (x, v) M Tx M : p h(v) F(v) p g(v), with h : v v αΣv and g : v v E[G]v Riemannian metrics. We also define the parametric curve: t R, γ(t) = x and γ(t) = v. 1) Let s show that LΣ(x) LF (x) LR(x). Because of the monotonicity of the Lebesgue integrals, we directly have: R p h( γ(t))dt R F( γ(t))dt R p g( γ(t))dt. 2) Let s show that EΣ(x) EF (x) ER(x). Since all the functions are positive, we can raise them to the power two, and again, with the monotonicity of the Lebesgue integrals, we have: R h( γ(t))dt R F 2(γ(t), γ(t))dt R p g( γ(t))dt. 3) Let s show that VΣ(x) VF (x) VR(x). We write the vectors v Tx M in hyperspherical coordinates: v = re, with r = v the radial distance and e the angular coordinates. With v = re, we have: r p h(e) r F(e) r g(e) p h(e) 1 F(e) 1 p We want to identify an inequality between the indicatrices, noted vol(Ih), vol(Ig), vol(IF ), formed by the functions h, g and F. Let s define: rg p h(e) = rh p g(e) = r F F(e) = 1. For every angular coordinate e, we obtain: rh r F rg. Intuitively, this means that the finsler indicatrix will always be bounded by the indicatrices formed by h and g. The Busemann-Hausdorf volume of a function f is defined as: σB(f) = vol(Bn(1))/vol(If), with vol(Bn(1)) the volume of the unit ball and vol(If) the volume of the indicatrix formed by f. The previous inequality and the definition of the Busemann-Hausdorffvolume implies that: vol(Ih) vol(IF ) vol(Ig) σB(h) σB(F) σB(g). The functions g and h being Riemannian, we have: σB(h) = p det(αΣ) and σB(g) = p det(E[G]), which concludes the proof. B.2.2 Relative bounds between the Finsler and the Riemannian metric Proposition 3.2. Let f be a stochastic immersion. f induces the stochastic norm G, defined in Section 2. The relative difference between the Finsler norm F and the Riemmanian norm R is: v R Var h v 2 G i 2E h v 2 G i2 . Proof. We will directly use a sharpen version of Jensen s inequality obtained by Liao & Berg (2019): Let X be a one-dimensional random variable with mean µ and P(X (a, b)) = 1, where a b + . Let φ a twice derivable function on (a, b). We further define: h(x, µ) = φ(x) φ(µ) (x µ)2 φ (µ) x µ . Then: inf x (a,b){h(x, µ)}Var[X] E[φ(x)] φ(E[x]) sup x (a,b) {h(x, µ)}Var[X]. In our case, we will chose φ : z z with z a one-dimensional random variable defined as z = v Gv. a = 0, b = + and µ = E[z]. h(z, µ) = ( z µ)(z µ) 2 (2(z µ) µ) 1. Because its first derivative φ is convex, the function x h(x, µ) is monotonically increasing. Thus: inf z (0,+ ){h(x, µ)} = lim z 0 = µ 2µ2 and sup z (0,+ ) {h(x, µ)} = lim z + = 0. It finally gives: µ 2µ2 Var[z] E[ z] p Published in Transactions on Machine Learning Research (10/2023) Replacing v F := F(v) = E[ z] and v R := p E[z] = µ concludes the proof. Lemma B.2.1. Let z W1(D, σ, Ω) following a one-dimensional non-central Wishart distribution. Then: Var[z] 2E[z]2 = 1 D + Ω+ Ω (D + Ω)2 Proof. (Kent & Muirhead, 1984, Theorem 10.3.7.) states that if z W1(D, σ, ω) then E[zk] = 2 ) 1F1( k, D 2Ω). In particular, for k = 1 and k = 2, we have 1F1( 1, b, c) = 1 c 1F1( 2, b, c) = 1 2c b + c2 b(b+1). We also have Γ( D 2 + 1 , which leads to: E[z] = σ(D + Ω) and E[z2] = σ2(2ω + 2(D + ω) + (D + ω)2). Finally, we conclude: E[z]2 = E[z2] E[z]2 1 = 2ω (D + ω)2 + 2 D + ω . Proposition 3.3. Let f be a Gaussian process. We note ω = (v Σv) 1(v E[J] E[J]v), with J the jacobian of f, and Σ the covariance matrix of J. The relative ratio between the Finsler norm F and the Riemmanian norm R is: v R 1 D + ω + ω (D + ω)2 . Proof. The result is directly obtained using Proposition 3.2 and Lemma B.2.1. Corollary 3.2. When f is a Gaussian Process, the relative ratio between the length, the energy and the volume of the Finsler norm (noted LF , EF , VF ) and the Riemannian norm (noted LR, ER, VR) is: 0 LR(z) LF (z) LR(z) max v Tz Z 1 D + ω + ω (D + ω)2 0 ER(z) EF (z) ER(z) max v Tz Z ( 2 D + ω + 1 + 2ω (D + ω)2 + 2ω (D + ω)3 + ω2 0 VR(z) VF (z) 1 max v Tz Z 1 D + ω + ω (D + ω)2 Proof. Let s call M = maxv Tx M{ ω (D+ω)2 + 1 D+ω}. From Proposition 3.3, we have: 0 v R v F M v R, with M = max v Tx M 1 D + ω + ω (D + ω)2 1) By the monocity of the Lesbesgue integral, we can directly integrate the previous norms along a curve γ, which immediately leads to: 0 LR(x) LF (x) MLR(x). 2) Since all the functions are positive: 0 v F v R M v R + v F leads to: v 2 F v 2 R M 2 v 2 R + 2M v F v R + v 2 F , and replacing v F v R in the right hand term: v 2 F v 2 R (M 2 + 2M) v 2 R + v 2 F , and finally: 0 v 2 R v 2 F (M 2 + 2M) v 2 R. Again, by continuity of the Lebesgue integral, we directly obtain: 0 ER(x) EF (x) (M 2 + 2M)ER(x). Published in Transactions on Machine Learning Research (10/2023) 3) In order to compare the volume between the Finsler and the Riemannian metric, we need to compare the volume of their indicatrices, noted: vol(Ig) and vol(IF ) respectively. We write the vectors v Tx M in hypershperical coordinates, with v = re, r = v the radial distance and e the angular coordinates. The volume of the indicatrices obtained in dimension q (dimension of the latent space) can be written as: dq V = rq 1drdΦ, with Φ defining the different angles. We will note r F and rg the radial distances of the Finsler and Riemann metrics such that: v F = r F e F = 1 and v R = rg e R = 1 obtained for a specific angle e. vol(IF ) vol(Ig) = Z 0 rq 1dr Z rg 0 rq 1dr dΦ = Z and by definition: vol(IF ) = R Φ(rq f/q)dΦ. Furthermore, for a specific angle e, we have: rg/r F = p g(e)/F(e) 1 M, from Proposition 3.3. We have: 0 vol(IF ) vol(Ig) and by the definition of the Busemann Hausdorffvolume: VF (x) VG(x) VF (x) = vol(IF ) vol(Ig) vol(IF ) , we conclude the proof. B.2.3 Implications in High Dimensions In this section, we want to show that the difference between the Finsler norm and the Riemannian induced norm, as well as their respective functionals, tend to zero at a rate of O( 1 D). We need to be sure that ω doesn t grow faster than D, in other terms: ω = O(D). This can be obtained if we assume that every element of the expectation of Jacobian is upper bounded ( m R +, i, j E[Jij] m). This happens in at least two cases: (1) E[f] is somehow Lipschitz continuous; or (2) if f is a Gaussian Process and its covariance is upper bounded. The latter case happens when the process is defined over a bounded domain. Lemma B.2.2. Our Finsler metric v E[ v Gv] is defined with v Gv W1(D, v Σv, ω), and ω = (v Σv) 1(v E[J] E[J]v). If the Finsler manifold is bounded, then: ω DM, with M R+. Proof. By definition, Σ does not depend on D. We assume the manifold is bounded, which means that every element of the expected Jacobian is upper bounded: E[J]ij m, with m R +. We call σ = v Σv R +. ω = σ 1 D X j=1 vi E[J]ki E[J]kjvj σ 1 D X k=1 m2 v 2 DM, with M = σ 1m2 v 2 R +, and M does not depend on D. Corollary 3.3. Let f be a Gaussian Process. In high dimensions, we have: LR(z) LF (z) LR(z) = O 1 , ER(z) EF (z) ER(z) = O 1 , and VR(z) VF (z) VR(z) = O q When D converges toward infinity: LR + LF , ER + EF and VR + VF . Proof. From Corollary 3.2, we directly obtained the results in high dimensions. Published in Transactions on Machine Learning Research (10/2023) We assume that our latent space is bounded, then by B.2.2, we have: 0 ω MD, with M R+. For the length, we have: LG(x) LF (x) LG(x) max v Tx M 1 D + ω + ω (D + ω)2 For the energy functional, we have: EG(x) EF (x) EG(x) max v Tx M ( 2 D + ω + 1 + 2ω (D + ω)2 + 2ω (D + ω)3 + ω2 D + 1 + 2M + M 2 lim sup D D EG(x) EF (x) EG(x) lim sup D 2(1 + M) + M 2 + 2M + 1 D2 2(1 + M) For the volume, we have: VG(x) VF (x) 1 max v Tx M 1 D + ω + ω (D + ω)2 Using Taylor series expansion, when x 0, we have: 1 (1 x)q = qx + o(x2). Let s call ε 1, and rewrite the Taylor series: VG(x) VF (x) VG(x) q 1 + M D + εq 1 + M q VG(x) VF (x) VG(x) (1 + M)(1 + ε) The difference between the functionals can converge to zero if they are similar in high dimensions, or if they all diverge to infinity. This latter case does not happen as we assume the latent manifold being bounded, and so the metrics are then finite, which concludes the proof. Corollary 3.4. Let f be a Gaussian Process. In high dimensions, the relative ratio between the Finsler norm F and the Riemmanian norm R is: And, when D converges toward infinity: v Tz Z, v R + v F . Proof. Similar to the 3.3, assuming that our latent space is bounded, from B.2.2, we have 0 ω MD. From 3.3, we deduce: v R 1 D + ω + ω (D + ω)2 Published in Transactions on Machine Learning Research (10/2023) In a bounded manifold, the metric are finite. We can deduce that they converge to each other in high dimensions. C Experiments C.1 Datasets C.1.1 Font data The dataset represents 46 different font for each letter (upper and lower case) whose contour is parametrised by a spline (or two splines, depending on the letter used) obtained from at least 500 points Campbell & Kautz (2014). In our case, we choose to learn the manifold of the letter f. The dataset is composed of 46 different fonts, each letter being drawn by 1024 points. We reduce this number from 1024 to 256 by sampling one point every 4. The dimension of the observational space is then 256. C.1.2 q PCR The q PCR data, gathered from Guo et al. (2010), was used to illustrate the training of a GPLVM in Pyro Pyro (2022) and is available at the Open Data Science repository Ahmed et al. (2019). It consists of 437 single-cell q PCR data for which the expression of 48 genes has been measured during 10 different cell stages. We then have 437 data points, 48 observations, and 10 classes. Before training the GP-LVM, the data is grouped by the capture time, as illustrated in the Pyro documentation. C.1.3 Pinwheel on a sphere A pinwheel in 2-dimension is created and then projected onto a sphere using a stereographic projection method. The final dataset is composed of 1000 points with their coordinates in 3-dimensions. C.2 GP-LVM training We learn our two-dimensional latent space by training a GP-LVM Lawrence (2003) with Pyro Bingham et al. (2019). The Gaussian Process used is a Sparse GP, defined with a kernel (RBF, or Matern) composed of a one-dimensional lengthscale and variance. The parameters are learnt with the Adam optimiser Kingma & Ba (2014). The number of steps and the initialisation of the latent space vary with the dataset. datasets pinwheel font data q PCR Number of data points 500 46 437 Number of observations 3 256 48 initialisation PCA PCA custom kernel RBF Matern52 Matern52 steps 17000 5000 5000 learning rate 1e-3 1e-4 1e-4 lengthscale 0.24 0.88 0.15 variance 0.95 0.30 0.75 noise 1e-4 1e-3 1e-3 Table 2: Description of the datasets trained with a GP-LVM. C.3 Computing indicatrices An indicatrix of a function g at a point x is defined such that: v Tx M|gx(v) < 1. In other terms, the indicatrix is the representation of a unit ball in our latent space. If we use an euclidean metric, our indicatrix in our 2-dimensional latent space would be a unit ball, as we need to solve: v Tx M, v < 1. Published in Transactions on Machine Learning Research (10/2023) For a Riemannian metric, our indicatrix is necessarily an ellipse, whose semi axis are the square-roots of the eigenvalues of the metric tensor G:: v Tx M, v Gv < 1. For our Finsler metric, we don t have an analytical solution, and so it s difficult to predict the shape of the convex polygon. In this paper, the indicatrices are drawn the following way: for a single point in our latent space, we compute the value of v Gv and F(x, v) for v varying over the space. We then extract the contour when v Gv and F(x, v) are equal to 1. Computing the area of the indicatrices will be used in the section C.4 to compute the volume measures. C.4 Computing the volume forms For the figures used in this paper, by default, the background of the latent space represents the volume measure of the expected Riemannian metric (VG = p E[G]) on a logarithm scale. In figure 3, the volume measure of the Finsler metric is also computed. C.4.1 Finsler metric To compute the volume measure of our Finsler metric, we choose the Busemann-Hausdorffdefinition, which is the ratio of a unit ball over the volume of its indicatrix: V = vol(Bn(1))/vol({v Tx M|F(x, v) < 1}). While our Finsler function has an analytical form, its expression doesn t allow to directly solve the equation: v Tx M, F(x, v) < 1. Instead we approximate its indicatrix as describe in C.3, using a contour plot and extracting the paths vertices. We can then compute the area of the obtained polygon, and divide with the volume of a unit ball: vol(B2(1)) = π. The volume measure can then be computed for each point over a grid (32 x 32, in figure 3), and we interpolate all the other points. Note that this method can only be used when our latent space is of dimension 2. C.4.2 Expected Riemannian metric There is two ways to compute the volume measure of the expected Riemannian metric. One way is to directly use the metric tensor: VG = p E[G]. Another one is to remember that any Riemannian metric is a Finsler metric, and thus, the Busemann-Hausdorffdefinition also applied for our metric: VG = vol(Bn(1))/vol({v Tx M|v E[G]v < 1}). Solving v E[G]v < 1 for v Tx M is equivalent to solving the area of an ellipse. For the first method, we can either sample multiple times the metric, which is computationally expensive, or use the fact that our metric tensor is a non-central Wishart matrix: G = J J Wq(D, Σ, Σ 1E[J] E[J]), with Σ the covariance of the Jacobian J and D the dimension of the observational space. In this case, its expectation is: E[G] = E[J] E[J] + DΣ. We can access the derivatives of the function f (detailed in section D.2), and compute both quantities E[J] and Σ needed to estimate the expected metric and its determinant. For the second method, we can compute the area of the ellipse in the same way we compute the Finsler volume measure. C.5 Experiments when increasing the number of dimensions In Figure 4, we computed the volume ratio and draw indicatrices while varying the number of dimensions, to illustrate that both our Finsler metric and the expected Riemannian metric seem to converge when D increases. The main issue with this experiment is to vary only one factor, the number of dimensions D, while keeping the other factors unchanged. This is difficult for two reasons: 1) Even with a very low dimensional observational, both metrics are already very similar to each other. It would be difficult to illustrate a convergence while increasing the number of dimensions. 2) the function f needs to be learnt again each time we increase the number of dimensions of the observational space, and the parameters of the Gaussian Process will change too. Instead, we try to illustrate our results by computing empirically the stochastic metric tensor G = J J, using its Jacobian J N(E[J], Σ), a D q matrix. The number of dimensions is modified by simply Published in Transactions on Machine Learning Research (10/2023) truncatenating the Jacobian J. In Figure 4, the volume ratio is computed for 12 Jacobians obtained with different random parameters E[J] and Σ. The Finsler and Riemannian indicatrices (lower right) are drawn for only one Jacobian selected randomly. D Computations D.1 Computing geodesics with Stochman and minimising the curve energy functionals An essential task is to compute shortest paths, or geodesics, between data points in the latent space. Those shortest paths can be obtained in two ways: either by solving a corresponding system of ODEs, or by minimising the curve energy in the latent space. The former being computationally expensive, we favour the second approach which consists in optimising a parameterised spline on the manifold. This method is already implemented in Stochman Detlefsen et al. (2021), where we can easily optimise splines by redefining the curve energy function of a manifold class. We need two curvge energy functional: one for the expected Riemannian metric and one for the Finsler metric. D.1.1 Curve energy for the Riemannian metric We know that the stochastic metric tensor Gt defined on a point t follows a non-central Whishart distribution. Thus, we can compute its expectation E[Gt] knowing the Jacobian covariance and expectation: E[Jt] and Σ. The next Section D.2 explains how to compute those quantities. Assuming the spline is discretized into N points, we can compute the curve energy with: EG(γ(t)) = Z 1 0 γ(t) E[Gt] γ(t) dt i=1 γ i E[Ji]T E[Ji] + DΣi γi. D.1.2 Curve energy for the Finsler metric In order to compute of the curve energy E(γ), we must first derive the expectation E[Jt] and covariance Σ of the Jacobian of f, which should follow a normal distribution: Ji = fi N E[J], Σ . We assume the points fi are independent samples with the same variance drawn from a normal distribution. We can then compute the Finsler metric which follows a non-central Nakagami distribution (See Proposition 2.2): E(γ(t)) = Z 1 0 F(t, γ(t))2 dt i=1 2 γ i Σi γi with 1F1 the confluent hypergeometric function of the first kind and ωi = ( γ i Σi γi) 1( γ i Ωi γi) and Ωi = Σ 1 i E[Ji] E[Ji]. This function has been implemented in Pytorch using the known gradients for the hypergeometric function: x 1F1(a, b, x) = a b 1F1(a + 1, b + 1, x). D.2 Accessing the posterior derivatives We assume that the probabilitic mapping f from the latent variables X to the observational variables Y follows a normal distribution. We would like to obtain the posterior kernel Σ and expectation µ such that p( tf|Y, X) N (µ , Σ ). We make the hypothesis the observed variables are modelled with a gaussian noise ϵ whose variance is the same in every dimension. In particular, for the nth latent (x) and observed (y) variable in the jth dimension: yn,j = fj(xn,:) + ϵn. Thus, the output variables have the same variance, and the posterior kernel Σ is then isotropic with respect to the output dimensions: Σ = σ2 ID. There are two ways of obtaining the posterior variance and expectation: Published in Transactions on Machine Learning Research (10/2023) We use the gaussian processes to predict the derivative ( cf) of the mapping function f, and we multiply the obtained posterior kernel by the curve derivative ( tc), following the chain rule: df(c(t)) dt (Section: D.2.1) We discretize the derivative of the mapping function as the difference of this function evaluated at two close points. We use a linear operation to obtain the posterior variance and expectation: tf(c(t)) f(c(ti+1)) f(c(ti)). (Section: D.2.2) D.2.1 Closed-form expressions We assume that f is a Gaussian process. Hence, because the differentiation is a linear operation, the derivative of a Gaussian Process is also a Gaussian Process Rasmussen & Williams (2005). The data Y RN D follows a normal distribution, so we can infer the partial derivative of one data point (JT )ji = yi xj , with i = 1 . . . D and j = 1 . . . d. We have: , K(x, x) K(x, x ) K(x , x) 2K(x , x ) and J can be predicted: p(J |Y, X) = i=1 N (µ , Σ ) , µ = K(x , x) K(x, x) 1 (y µy) + µ y Σ = 2K(x , x ) K(x , x) K(x, x) 1 K(x, x ). Finally, tf is obtained: p( tf(c(t))|f(x), x) = i=1 N cµ , c Σ c ID . D.2.2 Discretization One can notice that: tf(c(t)) f(c(ti+1)) f(c(ti)). We know that f(c(ti+1)) and f(c(ti)) both follows a normal distribution. f(c(ti)) f(c(ti+1)) " σ2 ii σ2 i,i+1 σ2 i+1,i σ2 i+1,i+1 If Y = AX affine transformation of a multivariate Gaussian X N(µ, σ2), then Y is also a multivariate Gaussian with: Y N(Aµ, AT σ2A). In our case, we choose AT = [ 1, 1]. We have: f(c(ti+1)) f(c(ti)) N(µ , σ2 ID), µ = µi+1 µi σ2 = σ2 ii + σ2 i+1,i+1 2σ2 i,i+1 Published in Transactions on Machine Learning Research (10/2023) Figure 12: Illustration of the derivatives obtained with a trained GP on a simple parametrised function: both methods give the correct derivatives if enough points are sampled.