# intrinsic_persistent_homology_via_densitybased_metric_learning__ba19d900.pdf

Journal of Machine Learning Research 24 (2023) 1-42 Submitted 9/21; Revised 1/23; Published 1/23

Intrinsic Persistent Homology via Density-based Metric Learning

Ximena Fern andez xfernand@dm.uba.ar Department of Mathematics Swansea University, UK Departamento de Matem atica and IMAS-CONICET FCEN, Universidad de Buenos Aires, Argentina

Eugenio Borghini eborghini@dm.uba.ar Departamento de Matem atica and IMAS-CONICET FCEN, Universidad de Buenos Aires, Argentina

Gabriel Mindlin gabo@df.uba.ar IFIBA, CONICET and Departamento de F ısica FCEN, Universidad de Buenos Aires, Argentina

Pablo Groisman pgroisma@dm.uba.ar Departamento de Matem atica and IMAS-CONICET FCEN, Universidad de Buenos Aires, Argentina NYU-ECNU Institute of Mathematical Sciences NYU Shanghai

Editor: Aryeh Kontorovich

We address the problem of estimating topological features from data in high dimensional Euclidean spaces under the manifold assumption. Our approach is based on the computation of persistent homology of the space of data points endowed with a sample metric known as Fermat distance. We prove that such metric space converges almost surely to the manifold itself endowed with an intrinsic metric that accounts for both the geometry of the manifold and the density that produces the sample. This fact implies the convergence of the associated persistence diagrams. The use of this intrinsic distance when computing persistent homology presents advantageous properties such as robustness to the presence of outliers in the input data and less sensitiveness to the particular embedding of the underlying manifold in the ambient space. We use these ideas to propose and implement a method for pattern recognition and anomaly detection in time series, which is evaluated in applications to real data. Keywords: topological data analysis, persistent homology, manifold learning, distance learning, time series

1. Introduction

1.1 Motivation and Problem Statement.

It is a common situation in machine learning that the given data represents a possibly noisy ﬁnite sample of a geometric object embedded in a high dimensional Euclidean space.

c 2023 Ximena Fern andez, Eugenio Borghini, Gabriel Mindlin and Pablo Groisman.

License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v24/21-1044.html.

Fern andez, Borghini, Mindlin and Groisman

This is the case, for instance, in the analysis of time series arising from observations of a dynamical system, where a spatial representation of the data can be interpreted as a sample of a geometric structure the attractor encoding valuable information of the underlying system s behaviour. Under the manifold assumption, both the metric and the density of the sample play a central role in the process of reconstruction of topological properties of the underlying shape. From a theoretical point of view, the problem can be stated as follows. Let Xn be a set of n sample points with common density f supported on a smooth compact Riemannian manifold M embedded in RD. We are interested in recovering topological features of M from the sample Xn RD in a setting in which both M and f are assumed to be unknown. A standard approach to accomplish this task consists in applying to Xn a computational technique known as persistent homology, which allows to obtain qualitative information about connected components, cycles, voids and higher dimensional holes from the point cloud. Here, the sample Xn is considered as a metric space endowed with some computable distance, such as the Euclidean distance or an estimator of the inherited geodesic distance. Although the topological information carried by M remains the same when endowed with any Riemannian metric, the output of the application of persistent homology to Xn strongly depends on the particular distance function employed. In this article, we consider a computable estimator deﬁned over Xn of a certain Riemannian metric on M that takes into account the density f, which was called Fermat distance by Groisman et al. (2022). We show that the use of this density-based intrinsic metric in the computation of persistent homology can lead to results that overcome simultaneously certain weaknesses of standard approaches, such as the sensitivity to outliers and the dependence on the embedding of the sample in the ambient space. Persistent homology is a central technique in Topological Data Analysis (TDA) developed to infer the homology groups of a space by studying a sample Xn at all scales of resolution at the same time (see Edelsbrunner and Harer, 2008; Edelsbrunner et al., 2002; Boissonnat et al., 2018; Niyogi et al., 2008; Zomorodian and Carlsson, 2005). It has found applications in many ﬁelds, including neuroscience (Giusti et al., 2015), ﬁnance (Gidea and Katz, 2018), signal processing (Perea and Harer, 2015; Tralie and Perea, 2018; Perea, 2019), computational neural networks (Gabrielsson et al., 2020), virus evolution (Chan et al., 2013) and sensor networks (de Silva and Ghrist, 2007a). This method yields as output an object called persistence diagram associated to the sample. Under mild conditions, the homology groups of the underlying topological space can be read oﬀthe persistence diagram (see Edelsbrunner et al., 2002). Chazal et al. (2009, 2016) provided a general framework that allows to deﬁne persistence diagrams for inﬁnite metric spaces instead of just ﬁnite approximations (samples). Thus, one can view the persistence diagram associated to a sample of a space as an estimate of a limiting object, namely, the persistence diagram of the entire space. When the distinction is needed, we will call these diagrams sample persistence diagram and population persistence diagram respectively. Our main result states that, under reasonable conditions, there is convergence as metric spaces of the sample Xn endowed with a computable estimator of the Fermat distance towards the manifold M (equipped with the Fermat distance) in the sense of Gromov Hausdorﬀas the size n grows. When combined with the well-known stability theorem (Cohen-Steiner et al., 2007; Chazal et al., 2014, 2009), this approximation result as metric

Intrinsic Persistent Homology via Density-based Metric Learning

spaces allows to deduce the convergence of the corresponding persistence diagrams. For this purpose, the space of diagrams is naturally equipped with the bottleneck distance. Approximation results that include convergence rates and conﬁdence regions have been established when the metric of the target space is known; see e.g. (Fasy et al., 2014) where the Euclidean distance is considered for both the samples and the space and also (Chazal et al., 2015), where a general metric is used but assumed to be known in advance. Persistence diagrams are known to be sensitive to the presence of outliers (see Bendich et al., 2011; Chazal et al., 2011; Buchet et al., 2016; Anai et al., 2019). Chazal et al. (2011); Anai et al. (2019) proposed ﬁltrations of point clouds regarded as empirical measures in the ambient Euclidean space called DTM-ﬁltrations to achieve a robust computation of ambient persistent homology. This theory was later extended to general metric spaces by Buchet et al. (2016). On the other hand, intrinsic versions of the classical ˇCech and Vietoris Rips ﬁltrations were developed with the aim of capturing topological properties of manifolds sitting in an Euclidean space which are independent of the embedding. The approach exhibited in this article handles both diﬃculties at the same time. Indeed, we show that sample persistence diagrams computed using the estimator of the (intrinsic) Fermat distance are both robust to outliers for positive degree and display the correct homology of the manifold for a longer parameter interval as compared with the use of ambient Euclidean distance. We refer the reader to the video by Fernandez (2021b) for an introductory exposition of the contents of this article.

1.2 Contributions

Let (M, ρ) be a smooth d-dimensional Riemannian manifold embedded in RD with density f : M R>0 and a Riemannian density-based distance ρ (mainly, it will be the Fermat distance df,p deﬁned below). For p > 1, the population Fermat distance is deﬁned as

df,p(x, y) = inf γ

1 f(γt)(p 1)/d | γt|dt.

Here x, y M, | | denotes the Euclidean distance and the inﬁmum is taken over all piecewise smooth curves γ : I = [0, 1] M with γ(0) = x, and γ(1) = y. In the special case when f is uniform, the population Fermat distance reduces to (a multiple of) the inherited Riemannian distance d M from the ambient Euclidean space. When this is not the case, this distance takes into account the density, which may be advantageous in certain situations, like in the case of estimation of the topology of M from samples with presence of noise and outliers. This metric was also considered in the works by Hwang et al. (2016); Mckenzie and Damelin (2019); Sapienza et al. (2018); Groisman et al. (2022). Given a ﬁnite set of points Xn, the sample Fermat distance between x, y is deﬁned as

d Xn,p(x, y) = inf γ

i=0 |xi+1 xi|p

where the inﬁmum is taken over all paths γ = (x0, x1, . . . , xr+1) with x0 = x, xr+1 = y and {x1, x2, . . . , xr} Xn.

Fern andez, Borghini, Mindlin and Groisman

Our main result states the Gromov Hausdorﬀconvergence (a.s.) of the sample endowed with the sample Fermat distance, appropriately re-scaled, to (M, df,p). Theorem Let M be a smooth, closed d-dimensional Riemannian manifold embedded in RD. Let f : M R>0 be a smooth density function. Let Xn = {x1, x2, . . . , xn} M be a set of n independent sample points in M with common density f. Given p > 1, there exists a constant µ = µ(p, d) such that for every λ (p 1)/pd, 1/d and ε > 0 there exist θ > 0 satisfying

P d GH M, df,p , Xn, n(p 1)/d

µ d Xn,p > ε exp θn(1 λd)/(d+2p)

for n large enough, where d GH stands for the Gromov-Hausdorﬀdistance between metric spaces. As a consequence of this result and the stability theorem for persistence diagrams we deduce the following convergence result. Corollary Let ε > 0 and λ (p 1)/pd, 1/d . There exists a constant θ > 0 such that

P db dgm(Filt(M, df,p)), dgm(Filt(Xn, n(p 1)/d

µ d Xn,p)) > ε exp θn(1 λd)/(d+2p)

for n large enough. Here Filt( ) denotes either the Vietoris Rips or ˇCech ﬁltration, dgm( ) the associated persistence diagram and db is the bottleneck distance (see Section 3 for precise deﬁnitions). Since (M, df,p) is a Riemannian manifold, its population persistence diagram dgm(Filt(M, df,p)) displays the correct homology up to the convexity radius conv(M, df,p). In contrast, for (M, | |) this is guaranteed only up to the reach τM. It is easy to ﬁnd examples of manifolds in which conv(M, df,p) is much larger than τM. On the other hand, we prove that for a reasonable upper bound r on the ﬁltration parameter, dgm(Rips<r(Xn, d Xn,p)) is robust to outliers for homology degree greater than 0. Proposition Let Xn be a sample of M and let Y RD M be a ﬁnite set of outliers. Let δ = min n min y Y d E(y, Y {y}), d E(Xn, Y ) o , where d E denotes the Euclidean distance

between sets. For all k > 0 and p > 1,

dgmk(Rips<δp(Xn Y, d Xn Y,p)) = dgmk(Rips<δp(Xn, d Xn,p)),

where Rips<δp stands for the Rips ﬁltration up to parameter δp and dgmk for the persistent homology of degree k. The threshold δp is restrictive if it is below diam(Xn, d Xn,p). However, we will show that under a natural model for the outliers, δp > diam(Xn, d Xn,p) for large enough p.

1.3 Applications to Signal Analysis.

The study of time series specially, derived from dynamical systems through the inference of homology groups of a certain associated space called delay embedding was pioneered in the works by Perea (2019); Perea and Harer (2015). The construction of the delay embedding of a time series heavily depends on the dimension or number of independent variables

Intrinsic Persistent Homology via Density-based Metric Learning

of the underlying system, and the choice of a parameter called time delay. It often leads to analyse subspaces of a suﬃciently high dimensional Euclidean space, which makes the inference of topological information unstable. In ﬁrst place, by means of concrete examples involving the Lorenz attractor and noisy periodic signals, we show that the use of Fermat distance in this method can lead to a more robust inference of the delay embeddings topological features. The reason behind this is that the Fermat distance is less prone, compared to the Euclidean distance, to the eﬀect known as curse of dimensionality and less dependent on the particular embedding. We also describe a method to detect change-points in the time series through the study of the evolution in time of the persistence diagrams of the corresponding time-delay embeddings. This is applied to discover anomalies in electrocardiogram signals and diﬀerent patterns in the song of canaries corresponding to diﬀerent syllables. The code to replicate the computational examples and applications can be found at the repository by Fernandez (2021a).

1.4 Related Work

The sample Fermat distance was introduced independently in the articles by Sapienza et al. (2018); Mckenzie and Damelin (2019). The study of approximations of density based metric from samples was suggested by Vincent and Bengio (2003) and developed by Sajama and Orlitsky (2005). Cohen et al. (2015); Chu et al. (2020) analyzed a general family of metrics that includes the population Fermat distance and deeply studied the case p = 2 of sample Fermat distance, which was also called power weighted shortest distance by Mckenzie and Damelin (2019). Groisman et al. (2022) proved that it is possible to recover the population Fermat distance df,p for d-dimensional manifolds which are isometrically embedded (closures of) open sets of Rd in RD as the limit of the sample Fermat distance. In the related work by Hwang et al. (2016) it was shown that in the same context, a statistic that is similar to the sample Fermat distance but uses the inherited Riemannian distance d M between consecutive points in a path instead of the Euclidean one to measure its cost, also converges almost surely to the Fermat distance. We remark that this statistic cannot be computed from the sample since the inherited distance is not assumed to be known in advance. However, the results by Hwang et al. (2016) provides an essential and strong foundation on the basis of which our main result is built over. The problem of learning geodesic distances from samples for submanifolds of the Euclidean space, specially with the aim of reducing dimensionality and visualizing data, has a long history; see for instance (Tenenbaum et al., 2000; M emoli and Sapiro, 2005). On the other hand, the problem of estimating the persistence diagram of a submanifold of an Euclidean space from a sample has been studied by Fasy et al. (2014); Chazal et al. (2015), where the underlying metric is assumed to be known. In this setting, both Chazal et al. (2015) and Fasy et al. (2014) were able to prove the following satisfying result: the persistence diagrams computed using the sample converge almost surely (in the sense of bottleneck distance) to the persistence diagram of the desired metric space. Moreover, they gave exponentially small bounds in the size of the sample for the probability of the bottleneck distance between the corresponding persistence diagrams being larger than some positive number; see (Chazal et al., 2015, Corollary 3) and (Fasy et al., 2014, Lemma 4),

Fern andez, Borghini, Mindlin and Groisman

where in addition conﬁdence sets for persistence diagrams are provided. In a diﬀerent direction, the advantages of computing persistence diagrams of submanifolds of an Euclidean space using alternative metrics more speciﬁcally, metrics based on diﬀusion geometry and random walks were explored experimentally by Bendich et al. (2011).

1.5 Structure of the Paper

In Section 2 we prove our main result Theorem 7 regarding the Gromov Hausdorﬀconvergence of metric spaces using, respectively, the sample and the population Fermat distance. Section 3 includes an introduction to persistent homology and is devoted to the study of persistence diagrams of manifolds endowed with Fermat distance. We deduce in ﬁrst place the convergence of sample persistence diagrams to population persistence diagrams. Then, we show that by using these intrinsic metrics the topological features last longer in the persistence diagrams. Finally, we show that Fermat-based persistence diagrams are robust to the presence of outliers for positive homology degree. In Section 4 we present a method for pattern recognition in time series, which is applied to real data from electrocardiograms and songs of canaries. Appendix A contains the proofs of some technical results (Proposition 5 and Lemma 8), required as intermediate steps to prove Theorem 7.

2. Density-based Distance Learning

In this section we prove the main theorem of the article, which states that the sample Xn, considered as a metric space with the sample Fermat distance (appropriately re-scaled), converges almost surely to (M, df,p) in the sense of Gromov Hausdorﬀ. We begin by introducing the population Fermat distance for a smooth closed Riemannian manifold without boundary M of dimension d > 1 with Riemannian metric tensor g together with a positive C density function f : M R>0. For p > 1, consider the deformed metric tensor gp = f2(1 p)/dg given by a conformal transformation of the original metric g. Since f is smooth, gp is a Riemannian metric tensor. Thus, M has a metric space structure given by the geodesic distance with respect to gp, denoted by df,p.

Deﬁnition 1 (Hwang et al., 2016) For p > 1, the population Fermat distance between x, y M is deﬁned as

df,p(x, y) = inf γ

1 f(γt)(p 1)/d p

g( γt, γt)dt

where the inﬁmum is taken over all piecewise smooth curves γ : I M with γ0 = x, and γ1 = y.

Notice that geodesics in M with respect to the distance df,p are more likely to lie in regions with high values of f. The name Fermat distance comes from the analogy with optics, in which df,p is the optical distance as deﬁned by Fermat s principle when the refraction index is given by f (p 1)/d. Consider now a set Xn = {x1, x2, . . . , xn} M of n sample points in M with common density f. Suppose that M is embedded in RD and it is endowed with the standard inherited Riemannian metric. Our aim is to approximate df,p(x, y), assuming no knowledge about M

Intrinsic Persistent Homology via Density-based Metric Learning

and the Riemannian distance deﬁned on it. To achieve this, we will deﬁne an estimator for this distance over the sample. We denote by |x y| the Euclidean distance between points x, y M.

Deﬁnition 2 (Sapienza et al., 2018; Mckenzie and Damelin, 2019) For p > 1, the sample Fermat distance between x, y M is deﬁned as

d Xn,p(x, y) = inf γ

i=0 |xi+1 xi|p

where the inﬁmum is taken over all paths γ = (x0, x1, . . . , xr+1) of ﬁnite length with x0 = x, xr+1 = y and {x1, x2, . . . , xr} Xn.

Since p > 1, geodesics with respect to this distance are also likely to lie in regions with high density of points in Xn. This is due to the fact that paths with short edges are favored even if they have large total (Euclidean) length. We remark here that, for technical reasons, we adopt a slightly diﬀerent deﬁnition for the sample Fermat distance than the original one from Sapienza et al. (2018). Namely, in the original setting, only paths completely contained in Xn are considered, including the endpoints. Points that are not in the sample Xn are projected to the nearest point in Xn. In consequence, our sample Fermat distance here does not generally induce a pseudometric over M, but only a metric when restricted to Xn.

Example 1 (Eyeglasses) The eﬀect of taking diﬀerent values of p for the sample Fermat distance d Xn,p in the geometry of a manifold is illustrated below. Concretely, the eyeglasses curve in R2 uniformly sampled and perturbed with Gaussian noise is considered (see Figure 1). We compute the sample Fermat distance between each pair of points for a series of values of p > 1 and embed the sampled points in R2 in such a way that the Euclidean distance in the embedding reﬂects the Fermat distance, using the Multidimensional Scaling algorithm (MDS). As p becomes larger, the geometry of the data overcomes the bottleneck region and it deforms into a circle. We also compute the Isomap embedding in R2 posed by Bernstein et al. (2000). Recall that the Isomap embedding is the MDS projection with an estimator of the inherited Riemannian distance based in the k-NN graph as input distances (see Bernstein et al., 2000, Section 5). Due to the noise near the bottleneck region, some points that are far in the sense of the inherited Riemannian distance become close in the distance estimated from the k-NN graph. Note that Isomap embedding is highly sensitive to noise, while with Fermat distance the points lying in low density regions are mapped to points that are far from the rest of the sample. The larger the power p, the stronger this eﬀect. This feature allows Fermat distance to reconstruct the underlying topology of the manifold in the present case, even with noise, for a range of values of p.

Remark 3 (The role of p) The parameter p in the deﬁnition of the population Fermat distance df,p controls the density weight f (p 1)/d in the computation of geodesics. Whereas for p = 1 the optimal paths are obtained in classic geodesic paths, for large p they might signiﬁcantly diﬀer, being mostly restricted to areas of high density. In practice, the value of p in the sample Fermat distance d Xn,p quantiﬁes the balance between the embedding and the

Fern andez, Borghini, Mindlin and Groisman

Figure 1: Top: A sample with noise of 2000 points of the eyeglasses dataset and Isomap projection with k = 5 (similar results are obtained for all reasonable values of k). Points are coloured according to local density. Middle and bottom: MDS embedding in R2 using Fermat distance for diﬀerent values of p.

density of a given sample Xn when estimating the optimal paths (notice that it is equivalent to the Euclidean distance for p = 1). In general, there is a reasonable large although bounded interval of values of p for which the estimator d Xn,p allows to recover the intrinsic geometry of the sample Xn even in presence of noise (c.f. Example 1). A similar phenomena can be experimentally observed when it is used in clustering tasks, as shown in simulations by Sapienza et al. (2018) and Little et al. (2022).

Remark 4 (Dimensionality reduction) The estimation of Fermat distance on input data, when coupled with the MDS projection, produces a new method to achieve dimensionality reduction. This strategy is in analogy with the popular algorithm Isomap by Tenenbaum et al. (2000). It is known that Isomap suﬀers from topological instability in presence of noise, since it may construct erroneous connections (called short-circuits) in the k-NN graph that potentially impair its performance (see Balasubramanian and Schwartz (2002)). In contrast, since noise generally corresponds with regions of low density, noisy points are treated by our method almost as not being part of the manifold. These eﬀects increase with the value of p, and they might be advantageous for the inference of the right geometry of the data (c.f. Section 3.3).

Intrinsic Persistent Homology via Density-based Metric Learning

Our ﬁrst result, Proposition 5, shows that the sample Fermat distance converges to the population Fermat distance for closed (i.e. compact and without boundary) submanifolds of RD. A related result was previously proved by Groisman et al. (2022) for isometrically embedded (closures of) open sets of Rd. Here we extend the class of manifolds to any compact manifold without boundary embedded in RD. Moreover, Proposition 5 states a uniform convergence for any two points in the manifold not only pointwise, as stated by Groisman et al. (2022) . This feature is essential to study both the manifold and the sample endowed with the (population and sample respectively) Fermat distance as single objects (metric spaces) and to prove convergence in the sense of Gromov Hausdorﬀ. Let us ﬁx some notations and general hypotheses. Hereafter, M will denote a smooth ddimensional closed Riemannian submanifold of RD endowed with the inherited Riemannian distance d M. We will consider a set Xn M of n independent random points with common smooth density f : M R>0. We will denote by Mf and mf the maximum and minimum values attained by f on M, respectively. Observe that 0 < mf < Mf < . Finally, given p > 1 we set α = 1/(d + 2p).

Proposition 5 For every p > 1 and λ (p 1)/pd, 1/d , given ε > 0 there exist µ, θ > 0 such that

n(p 1)/dd Xn,p(x, y) µdf,p(x, y) > ε exp θn(1 λd)α

for n large enough. The supremum is taken over x, y M.

The constant µ from the statement is ﬁxed throughout this manuscript and depends only on p and d. It was originally deﬁned in (Howard and Newman, 1997, Lemma 3). The constant θ depends on ϵ, p, f and M. Proposition 5 is derived from a related result by Hwang et al. (2016), in which the authors establish the convergence of a sample statistic known as the power-weighted shortest path to the population Fermat distance. For p > 1 and points x, y M, the power-weighted shortest path between x, y is deﬁned as

LXn,p(x, y) = inf γ

i=0 d M(xi+1, xi)p (1)

where the inﬁmum is taken over all paths γ = (x0, . . . , xk+1) in Xn of ﬁnite length with x0 = x, xk+1 = y.

Theorem 6 (Hwang et al., 2016, Theorem 1) Let p > 1 and ε > 0. Suppose that (bn)n 1 is a sequence of positive real numbers such that log(n)

nbdn 0 as n goes to inﬁnity. Then, there exists a constant θ > 0 (which depends on ε) such that

sup x,y M d M(x,y) bn

n(p 1)/d LXn,p(x, y)

df,p(x, y) µ

exp( θ(nbd n)α)

for all suﬃciently large n, where the supremum is taken over x, y M with d M(x, y) bn.

Fern andez, Borghini, Mindlin and Groisman

As explained in the paragraph following Theorem 1 in (Hwang et al., 2016, p. 2793), the requirement that log(n)

nbdn 0 is necessary in order to obtain a nontrivial upper bound for the probability. Note that in Theorem 6, the convergence holds for the set of points x, y M with d M(x, y) greater than some sequence (bn). However, since we will be interested in studying the Gromov Hausdorﬀconvergence of the associated metric spaces (see (2) below), it is necessary to have uniform control of the convergence of the estimated distance for all points in the manifold. The uniform convergence is one of the main improvements upon Theorem 6 we show in Proposition 5. Also, notice that the proposed statistic LXn,p of df,p is based on the previous knowledge of the inherited Riemannian distance d M. In the general data analysis setting, only a sample of points in a Euclidean space is given. Under the assumption that points lie on an (unknown) manifold M, the goal is to ﬁnd an estimator of the intrinsic distance df,p that can be completely computed from the sample. In Proposition 5, we prove that sample Fermat distance d Xn,p is indeed a good estimator of df,p. Proposition 5 arises as a natural continuation of Theorem 6. The main idea of the proof is to show that any segment that is part of any shortest path with respect to d Xn,p will be arbitrarily small with high probability if n is large enough. This will allow us to deduce that the power-weighted distance is well approximated by the sample Fermat distance. We defer the proof to Appendix A. We will next estimate the Gromov Hausdorﬀdistance between the metric space Xn with an appropriate re-scaling of the sample Fermat distance d Xn,p and M endowed with the population Fermat distance df,p. Recall that the Gromov Hausdorﬀdistance d GH is a metric on the (isometry classes of) compact metric spaces that, roughly speaking, quantiﬁes how diﬃcult it is to match every point of a metric space (X, ρX) with some point of another space (Y, ρY). More formally, it is deﬁned as

d GH (X, ρX), (Y, ρY) := inf{d H(h1(X), h2(Y))}, (2)

where the inﬁmum is over all the isometric embeddings h1 : X W, h2 : Y W in a common metric space W and d H stands for the Hausdorﬀdistance. We will employ the following equivalent characterization of the Gromov-Hausdorﬀdistance, which is often more convenient:

d GH (X, ρX), (Y, ρY) = 1

2 inf R sup (x,y),(x ,y ) R |ρX(x, x ) ρY(y, y )|, (3)

where the inﬁmum is taken over subsets R X Y such that the projections πX(R) = X, πY(R) = Y. We are now ready to state our main theorem. For notational convenience, we set dn,p = n(p 1)/d

µ d Xn,p, the re-scaled sample Fermat distance on Xn.

Theorem 7 Let ε > 0 and λ (p 1)/pd, 1/d . There exists a constant θ > 0 such that

P d GH((M, df,p), (Xn, dn,p)) > ε exp θn(1 λd)α

for n large enough and α = 1/(d + 2p).

Intrinsic Persistent Homology via Density-based Metric Learning

Before presenting the proof of Theorem 7, we will need a preliminary lemma which asserts that, with high probability, no point of M is too far from the nearest point of the sample. The argument of this proof is standard, but we include it in Appendix A for the reader s convenience.

Lemma 8 For any κ > 0, the event sup x M d M(x, Xn) n(κ 1)/d

holds with probability at most exp( θnκ) for some constant θ > 0 if n is large enough.

We are now in position to prove Theorem 7. Proof [Theorem 7] In order to compute the Gromov Hausdorﬀdistance between (M, df,p) and (Xn, dn,p), we consider in (3) the relation

R = {(xi, xi): xi Xn} {(xy, y): y M, df,p(xy, y) = df,p(Xn, y)}.

By a simple application of the triangle inequality we get that

d GH (M, df,p), (Xn, dn,p) 1

sup x,y Xn |df,p(x, y) dn,p(x, y)| + 2 sup y M df,p(Xn, y)

Observe that the two terms on the right hand side of the previous inequality can be bounded above by Proposition 5 and Lemma 8 respectively. Given ε > 0, by (4) we have that

P d GH (M, df,p), (Xn, dn,p) > ε/2

sup x,y Xn |df,p(x, y) dn,p(x, y)| > ε/2

sup y M df,p(Xn, y) > ε/4

To bound the ﬁrst term, we apply Proposition 5 to get

sup x,y Xn |df,p(x, y) dn,p(x, y)| > ε/2

exp θn(1 λd)α .

for some positive constant θ and n suﬃciently large. As for the second term, notice that since df,p(x, y) m (p 1)/d f d M(x, y),

Lemma 8 implies

sup y M df,p(Xn, y) > n(α 1)/dm(p 1)/d f

for n large. The proof follows by noticing that the sequence n(α 1)/dm (p 1)/d f converges to 0 as n goes to inﬁnity.

Fern andez, Borghini, Mindlin and Groisman

Remark 9 (Rate of convergence) The rate of convergence in Theorem 7 is related to the ﬂuctuations of n p 1

d d Xn,p(x, y) around µdf,p(x, y) or, more coarsely, the variance of n p 1

d d Xn,p(x, y) (Damron and Wang (2016) provide strong evidence that the bias can be bounded by the variance). It is expected that this variance decreases as a power of n, i.e.

cn ζ Var n p 1

d d Xn,p(x, y) Cn ζ

for a dimension-dependent constant ζ = ζ(d) > 0. The precise value of ζ(d) is a still open problem in probability theory in the context of First Passage Percolation (Howard and Newman (2001); Auﬃnger et al. (2017)). For d = 1 it can be proved that ζ = 1. For d 2 it is widely believed (Auﬃnger et al., 2017) that the exponent should not depend on p and that for d = 2 we should have ζ(2) = 2/3. For d 3 it is not clear what the value of ζ(d) should be. If we write ζ(d) = 2(χ(d) 1)/d, it is expected that χ(d) should decrease with the dimension but there is not agreement on whether there exists some critical dimension dc such that χ(d) = 0 for d dc or even if we should have χ(d) 0 as d (Auﬃnger et al., 2017, Section 3). In Howard and Newman (2001) non-optimal rigorous bounds have been proven for Euclidean First Passage Percolation that in our context read

P d GH M, df,p , Xn, n(p 1)/d

µ d Xn,p > n 1

d+ε C1 exp ( C0nε)

for positive constants C0, C1 depending on ε > 0. This bound follows immediately in our case when M is the closure of a bounded open and convex set and f is constant on M. For the general case considered in this manuscript we expect to have similar bounds. Obtaining those bounds would be highly valuable, but its analysis is out of the scope of this paper. We refer the reader to Little et al. (2022) for a detailed discussion about the rate of convergence.

3. Fermat-based Persistent Homology

In this section we explore the use of Fermat distance as input in the computation of the persistence diagram associated to a sample of a manifold. We deduce the almost sure convergence of persistence diagrams of the sample Xn with the (re-scaled) sample Fermat distance towards the persistence diagram of (M, df,p). We also show that we expect to read the correct homology of M for a longer parameter interval in the diagram associated to the sample Xn computed with Fermat distance as compared with the use of Euclidean distance. Finally, we prove that Fermat-based persistence diagrams are robust to the presence of outliers for homology degree greater than 0.

3.1 Convergence of Persistence Diagrams

We start by brieﬂy recalling the main concepts and results in persistent homology theory and refer the reader to the works by Chazal et al. (2014, 2016) for a more thorough exposition. For the computation of the persistent homology of a point cloud, one imagines each point as a ball (that is, representing a small surrounding region) and builds a combinatorial model for the space connecting the points according to whether the corresponding regions intersect. More precisely, for every ﬁxed value of a parameter or scale that controls the size

Intrinsic Persistent Homology via Density-based Metric Learning

of the region that each point represents, one gets a simplicial complex (i.e., a higher dimensional analogue of a graph). This family of simplicial complexes, also known as a ﬁltration, is the input of the procedure to compute persistent homology. Indeed, the topological features of this family of complexes change as the scale parameter grows: diﬀerent connected components join in one, some loops are ﬁlled, new cavities appear, etc. By analyzing these transitions, we are able to assign a birth and a death value to each of these features, and the diﬀerence between them represents its persistence. The most persistent features represent topological signatures, whereas the shortest intervals may be considered as noise. The output of this procedure is summarized in an object called persistence diagram. We next give the formal deﬁnitions. Given a (possibly inﬁnite) metric space (X, ρ), a ﬁltration over the real numbers Filt(X, ρ) = (Filtϵ(X, ρ))ϵ R is a family of simplicial complexes with vertex set X such that Filtϵ(X) Filtϵ (X) whenever ϵ ϵ . For the purposes of this article, we are going to consider only some natural ﬁltrations that are strongly linked to the metric ρ. The ˇCech ﬁltration consists of a family of simplicial complexes (ˇCechϵ(X))ϵ R where a set of points [x0, . . . , xk] forms a k-simplex of ˇCechϵ(X) if the intersection of the k + 1 closed balls Bρ(xi, ϵ) is non empty. Equivalently, ˇCechϵ(X) is the nerve of the cover { Bρ(x, ϵ): x X}. The ˇCech complex is the most natural way to build a simplicial complex associated to a space, since in favourable cases, it allows to recover its homotopy type as a consequence of the Nerve Theorem (Hatcher, 2002, 4.G). However, the construction of the ˇCech complex is expensive from a computational point of view, since it requires to check for a large number of intersections. To circumvent this issue, one can instead consider the Vietoris Rips ﬁltration (Ripsϵ(X))ϵ R. The k-simplices of Ripsϵ(X) are sets [x0, . . . , xk] such that ρ(xi, xj) ϵ for all 0 i, j k. Equivalently, Ripsϵ(X) can be deﬁned as the ﬂag complex of ˇCechϵ(X) (that is, the clique complex of the 1-skeleton of ˇCechϵ(X)). If X is a subset of the Euclidean space RD, then one have ˇCechϵ(X) Rips2ϵ(X) ˇCech

2D/(D+1)ϵ(X); see e.g. Theorem

2.5. by de Silva and Ghrist (2007b). In this sense, the Rips complex is a computationally eﬃcient approximation of the ˇCech complex. Other ﬁltrations involving lower dimensional simplices, such as the Alpha ﬁltration by Edelsbrunner et al. (1983), can also be considered in our context. For any ﬁltration as above, it is clear that the topology of the complexes Filtϵ(X) will typically change as ϵ increases. This evolution is appropriately captured by considering the homology groups (over a ﬁeld k) of the nested family of simplicial complexes. One gets in this way a sequence of vector spaces (H (Filtϵ(X)))ϵ R, where the inclusions Filtϵ(X) Filtϵ (X) induce canonical linear maps H (Filtϵ(X)) H (Filtϵ (X)) in homology. Under some conditions, such as ﬁniteness of X (Edelsbrunner et al., 2002; Zomorodian and Carlsson, 2005), this sequence can be decomposed as a direct sum of intervals I[ϵb, ϵd] deﬁned as

0 0 0 0 0 k 1 1 k | {z } [ϵb,ϵd]

Every interval is determined by the birth and death parameters ϵb and ϵd respectively, and it can be interpreted as a topological feature of X with an associated lifetime ϵd ϵb (note that ϵd may be inﬁnite, in that case the feature has inﬁnite lifetime). The (multi)set of points (ϵb, ϵd) is called the persistence diagram of (X, ρ) and is denoted dgm(Filt(X, ρ))

Fern andez, Borghini, Mindlin and Groisman

(or simply dgm(Filt(X)) if ρ is clear from the context). Persistence diagrams are contained in the half (extended) plane above the diagonal = {(x, y): x = y}. For technical reasons, the diagonal is considered as part of every persistence diagram with inﬁnite multiplicity. Chazal et al. (2009, 2016, 2014) proved that, within a more abstract persistent framework, it is possible to extend the deﬁnition of persistence diagrams to some cases where the sequence might not be interval-decomposable. In particular, it is shown by Chazal et al. (2014) that if X is a compact metric space, for every value of ϵ at most a ﬁnite number of new topological features appear (even though the vector spaces (H (Filtϵ(X)))ϵ R may be inﬁnite-dimensional) and hence dgm(Filt(X)) is well-deﬁned. Notice also that all the deﬁnitions can be extended to ﬁltrations indexed over connected subsets of the real line.

Example 2 (Eyeglasses) We compute the persistence diagram associated to the Vietoris Rips ﬁltration of the sample points from Example 1, Figure 1. We compare the results obtained with diﬀerent distant choices: the Euclidean distance, the k-NN estimator of the inherited Riemannian distance for k = 4 and k = 5 and the sample Fermat distance for p = 2.5 and p = 3. We also considered a weighted Vietoris Rips ﬁltration derived by a DTM-function with parameters m = 0.01 and p = 1 (see Anai et al. (2019) and Remark 17). The homology of the eyeglasses curve has one generator of H0 and one generator of H1. However, it can be noticed that for either Euclidean and k-NN distance for k 5, the persistence diagram displays two salient generators for the ﬁrst homology group H1, which can be attributed to the small reach of the manifold. As it can be seen in Figure 2, smaller values of k fail to capture the geometry of the eyeglasses manifold. A similar situation is presented using the Vietoris Rips DTM-ﬁltration. Finally, for the Vietoris Rips ﬁltration using Fermat distance for diﬀerent choices of p, the diagrams show accurately only one persistent generator for H1. On the other hand, the number of noticeable connected components increases with p. This eﬀect is caused by the presence of noisy points in regions of extremely low density, becoming isolated points (or outliers) as p evolves (cf. Remark 16).

Since in our setup we usually only get an approximation of the metric space under consideration, we will be interested in comparing persistence diagrams built on top of different metric spaces. In this sense, the bottleneck distance is a frequently used quantity to measure the diﬀerence between two persistence diagrams. Given persistence diagrams dgm1 and dgm2, consider all perfect matchings M dgm1 dgm2 such that every point of dgm1 and dgm2 is paired exactly once in M. Note that points in dgm1 and dgm2 are allowed to be paired with points in the diagonal . The bottleneck distance db(dgm1, dgm2) is then deﬁned as the inﬁmum, over all such matchings M as before, of the largest ℓ -distance between matched pairs. That is,

db(dgm1, dgm2) = inf M max (x,y) M |x y| .

The stability theorem (Cohen-Steiner et al., 2007; Chazal et al., 2014) ensures continuity (more precisely, Lipschitz continuity) in the process of computing persistence diagrams for a metric space. This means that small perturbations in the original metric space (in the sense of Gromov Hausdorﬀ) will translate into an at most proportional perturbation in the

Intrinsic Persistent Homology via Density-based Metric Learning

Figure 2: Persistence diagrams (lifetime) associated to the eyeglasses point cloud with noise for diﬀerent ﬁltrations. Top: Vietoris Rips ﬁltration with Euclidean distance and k-NN distance for k = 4 and k = 5. Bottom: Vietoris Rips DTM-ﬁltration with parameters m = 0.01 and p = 1 and Vietoris Rips ﬁltration with Fermat distance for p = 2.5 and p = 3.

corresponding persistence diagram (in the sense of the bottleneck distance). Formally, it states that for any two precompact metric spaces X and Y

db dgm Filt(X, ρX) , dgm Filt(Y, ρY) 2d GH (X, ρX), (Y, ρY) . (5)

This fact is exploited by Fasy et al. (2014); Chazal et al. (2015) to establish the almost sure convergence (in the sense of bottleneck distance) of the persistence diagrams associated to samples of a compact metric space drawn according to a measure satisfying certain hypotheses to the persistence diagram of the space. In these works the distance function of the underlying metric space is assumed to be known, and it is inherited by the sample. We are able to obtain convergence of persistence diagrams in our context, in which only an estimator of the underlying metric is available. Concretely, given the metric spaces (M, df,p) and (Xn, dn,p), from the estimation of its Gromov Hausdorﬀdistance of Theorem 7 and the stability theorem (5) we deduce the following result.

Fern andez, Borghini, Mindlin and Groisman

Corollary 10 Let ε > 0 and λ (p 1)/pd, 1/d . There exists a constant θ > 0 such that

P db dgm(Filt(M, df,p)), dgm(Filt(Xn, dn,p)) > ε exp θn(1 λd)α

for n large enough and α = 1/(d + 2p).

3.2 Homology Inference

The content of Corollary 10 is that dgm(Filt(Xn, dn,p)) is (asymptotically) a good estimator of dgm(Filt(M, df,p)). On the other hand, if we were to employ the Euclidean distance | |, it follows from the results by Chazal et al. (2015) that the sample persistence diagrams dgm(Filt(Xn, | |)) converge to dgm(Filt(M, | |)) under reasonable hypotheses. We are therefore interested in comparing for how long we may expect to read the correct homology of M in each of the diagrams dgm(Filt(M, dn,p)) and dgm(Filt(M, | |)) in terms of two natural geometric measures associated to the manifold, namely, the reach and the convexity radius (see Hausmann, 1995; Latschev, 2001; Niyogi et al., 2008; Chazal and Lieutier, 2008). In this section we show that the homology of (M, df,p) can be recovered correctly from its persistence diagram up to the convexity radius conv(M, df,p), whereas for (M, | |) this is guaranteed only up to its reach τM. Notice that the reach of a submanifold of an Euclidean space depends strongly on the particular embedding, whereas the convexity radius is an intrinsic quantity linked to the geometry of the manifold. There are simple examples of manifolds in which this distinction is relevant to correctly recover its homology from a sample (see Examples 1 and 3). Recall that given X RD a closed subset, the medial axis Med(X) of X is deﬁned as

Med(X) := {y RD : d E(y, X) = |p y| for at least two diﬀerent points p RD},

where d E(y, X) = infx X |y x|. The reach τX of X, ﬁrst introduced by Federer (1959), is the minimum distance from X to Med(X), that is,

τX := inf x X d E(x, Med(X)).

Given a Riemannian manifold (N, g), we will say that a subset S N is geodesically convex if for every two points in S, there is a unique geodesic segment that connects them and it is completely contained in S. The convexity radius conv(N, x) at a point x N is the supremum over those r > 0 for which the (geodesic) ball B(x, r) is geodesically convex. The convexity radius conv(N) of the manifold N is deﬁned as

conv(N) := inf x N conv(N, x).

Proposition 11 Let M be a compact submanifold of RD. Then, we have the following homotopy equivalences:

ˇCechϵ(M, | |) M for ϵ < τM and Ripsϵ(M, | |) M for ϵ < 2 q

2D τM, and both bounds are optimal, in the sense that there exist examples for which the homotopy equivalence does not hold for larger values of ϵ.

Intrinsic Persistent Homology via Density-based Metric Learning

ˇCechϵ(M, df,p) M and Ripsϵ(M, df,p) M for ϵ < conv(M, df,p).

Moreover, if df,p coincides up to a constant with d M (i.e. f is uniform), we have the estimate

conv(M, df,p) = Vol(M, d M)(p 1)/dconv(M, d M) Vol(M, d M)(p 1)/d π

Proof The fact that ˇCechϵ(M, | |) is homotopy equivalent to M for ϵ < τM is an immediate consequence of the Nerve Theorem. The same result implies that ˇCechϵ(M, df,p) M for ϵ < conv(M, df,p), since geodesically convex sets are always contractible and the intersection of geodesically convex sets is again geodesically convex. Regarding the Vietoris Rips ﬁltration, the fact that the simplicial complex Ripsϵ(M, | |) is homotopy equivalent to M

for ϵ < 2 q

2D τM can be deduced from (Kim et al., 2020, Theorem 20). Finally, since df,p is a Riemannian distance on M, by Hausmann (1995) there is an explicit homotopy equivalence Ripsϵ(M, df,p) M for ϵ < conv(M, df,p) (see also Latschev, 2001). The optimality of the bound ϵ < τM for ˇCechϵ(M, | |) is clear (think of a unit sphere in RD), and indeed, typically the topology of ˇCechϵ(M, | |) changes when ϵ attains τM. A critical example for the Vietoris Rips complex is the standard 1-dimensional circle S1, and it can be derived from the main result of Adamaszek and Adams (2017), similarly as in (Kim et al., 2020, Example 24). The last assertion in the statement follows directly from the inequalities

conv(M, d M) min π 2 sup K , 1

2inj(M, d M)

(see Cheeger and Ebin, 1975, 5.14) and

inj(M, d M) πτM, K 1 τ 2 M

(see Aamari et al., 2019, Proposition A.1). Here inj(M, d M) is the injectivity radius of M and K is the sectional curvature.

Example 3 Consider a planar ellipse ER,ε with minor axis of length ε and major axis of length R ε. By letting R + and/or ε 0, we see that the convexity radius of a closed submanifold of R2 can be arbitrarily large while its reach can be arbitrarily small. A similar example can be constructed in RD, being M a d-dimensional ellipsoid for any d < D. The same phenomenon can be achieved by constructing diﬀerent eyeglasses curves with arbitrarily large length and constant reach, Figure 3. Its population persistence diagrams diﬀer as predicted by Theorem 11. The persistence diagram computed with the Euclidean distance captures the right homology only for ϵ less that the reach. In contrast, for the Fermat distance the correct homology is captured for radii as large as (a multiple of) the convexity radius, which can be made large enough by enlarging the bridge between the glasses.

Fern andez, Borghini, Mindlin and Groisman

Figure 3: Left: Eyeglasses curves, uniformly sampled (250 points). In both cases, the reach is 0.5. Below each curve, we plot a thickening of the samples with Euclidean balls of radius slightly greater than the reach. Right: Persistence diagrams (lifetime) associated to the Vietoris Rips ﬁltration for both the Euclidean distance and the re-scaled Fermat distance dn,p with p = 2. While H0 is correctly estimated in both cases by reading the persistence diagrams, the ones computed with the Euclidean distance displays two salient generators for the ﬁrst homology group H1, inaccurately suggesting two cycles. The second cycle s birth is at the level of twice the reach. For the (re-scaled) Fermat distance, the diagrams shows correctly only one persistent generator for H1.

3.3 Robustness to Outliers

Persistence diagrams are highly sensitive to outliers (see Bendich et al., 2011; Chazal et al., 2011; Buchet et al., 2016; Anai et al., 2019). We will see that the computation of persistence homology using Fermat distance is robust to the presence of outliers for positive degree. Concretely, given a sample Xn M and Y RD M a ﬁnite set of points in the complement of M in the ambient Euclidean space the outliers we prove that dgmk(Rips(Xn Y, d Xn Y,p)) coincides with dgmk(Rips(Xn, d Xn,p)) for k > 0 up to some reasonable ﬁltration parameter. First we need a deﬁnition.

Intrinsic Persistent Homology via Density-based Metric Learning

Deﬁnition 12 Given a ﬁnite set of points S RD, deﬁne the minimal spacing of S as

κ(S) = min x S d E(x, S {x}),

where d E denotes the Euclidean distance between sets.

Proposition 13 Let δ = min{κ(Y ), d E(Xn, Y )} and p > 1. Then, for every ϵ < δp

Ripsϵ(Xn Y, d Xn Y,p) = Ripsϵ(Xn, d Xn,p) Y.

In particular, for all k > 0

dgmk(Rips<δp(Xn Y, d Xn Y,p)) = dgmk(Rips<δp(Xn, d Xn,p)),

where Rips<δp(X, ρX) stands for Ripsϵ(X, ρX)

ϵ<δp, i.e., the Rips ﬁltration up to parameter δp of a metric space (X, ρX).

Proof Let us estimate the distance between two given points in Xn Y with respect to d Xn Y,p in terms of δ and d Xn,p. If x Xn and y Y ,

d Xn Y,p(x, y) d Xn Y,p(Xn, Y ) = d E(Xn, Y )p δp.

If y, y Y , d Xn Y,p(y, y ) d Xn Y,p(y, Y {y}) δp.

For the second inequality, notice that if y Y is such that d Xn Y,p(y, Y {y}) = d Xn Y,p(y, y) = len(γ), the geodesic γ between y and y either involves only points from Y or there exist some point x Xn in γ. In the ﬁrst case d Xn Y,p(y, y) κ(Y )p whereas in the second case d Xn Y,p(y, y) 2d E(Xn, Y )p. Given x, x Xn, let γ be a minimal path between x, x , so that d Xn Y,p(x, x ) = len(γ). If d Xn Y,p(x, x ) < ϵ, then γ only involves points in Xn since otherwise ϵ len(γ) 2d E(Xn, Y ) 2δp, which is a contradiction. Hence, d Xn Y,p(x, x ) = d Xn,p(x, x ).

We deﬁne now a geometric notion of outliers. Recall that given Xn RD, the ε-graph Gε(Xn) is the undirected graph with the points of Xn as vertices and an edge connecting xi and xj Xn whenever |xi xj| < ε.

Deﬁnition 14 Let Xn M be a sample of M RD and Y RD M be a ﬁnite set of points. Let ε := min{ε > 0 : Gε(Xn) is connected} and δ = min{κ(Y ), d E(Xn, Y )}. We say that Y are (geometric) outliers if δ > ε .

We show next that for this notion of outliers, the upper bound on the parameter for the Rips ﬁltration of Proposition 13 is not restrictive for suﬃciently large p. Indeed, let diamp(Xn) be the diameter of (Xn, d Xn,p). Note that for every ϵ diamp(Xn) the simplicial complex Ripsϵ(Xn, d Xn,p) equals the standard (n 1)-simplex n 1, with trivial topology (and hence persistence diagrams are not interesting for scales larger than this threshold). The next result states that provided that p is large enough, the persistence diagrams of (Xn, d Xn,p) and (Xn Y, d Xn Y,p) coincide up to the ﬁltration parameter diamp(Xn).

Fern andez, Borghini, Mindlin and Groisman

Corollary 15 Given Xn a sample of M and Y RD a ﬁnite set of outliers, then for all k > 0

dgmk(Rips<diamp(Xn)(Xn Y, d Xn Y,p)) = dgmk(Rips<diamp(Xn)(Xn, d Xn,p)).

for p > C log(n) with C = log(δ/ϵ ) 1.

Proof There is an upper bound diamp(Xn) nεp . Since Y are outliers, ε < δ . For

p > C log(n), δ ε

p > n and consequently, diamp(Xn) < δp. The result now follows from Proposition 13.

Remark 16 In general, the persistence diagram of (Xn Y, d Xn Y,p) for degree k = 0 does not coincide with the diagram of the metric space without outliers (Xn, d Xn,p). However, if Y is a set of geometric outliers, it is related to the corresponding persistence diagrams of Xn and Y through the following formula:

dgm0(Rips(Xn Y, d Xn Y,p)) = dgm< 0 (Rips(Xn, d Xn,p)) dgm0(Rips(Q, d Q)).

Here, dgm< denotes the bounded persistence intervals and Q = (Y Xn)/Xn is the quotient metric space endowed with the induced metric d Q.

Remark 17 (DTM) Filtrations classically used for the computation of persistent homology of Euclidean point clouds, such as the ˇCech or Vietoris Rips ﬁltrations, are very sensitive to the presence of outliers. That is, ˇCech (or Vietoris Rips) ﬁltrations computed on top of Xn and Xn Y might be very diﬀerent (its interleaving distance depends on d H(Xn, Xn Y ), see e.g. Chazal et al. (2014)). To overcome this limitation, Anai et al. (2019) introduced weighted ﬁltrations based on the notion of distance to measure (DTM). Given µ the empirical measure of Xn RD and m [0, 1) a parameter, the DTM-function over RD is deﬁned

as dµ,m(x) := q

1 m R m 0 δ2 µ,t(x)dt, where δµ,t(x) = inf{r 0: µ( B(x, r)) > t} and B(x, r) denotes the closed Euclidean ball with center x and radius r. Given a parameter p > 1, the weighted ball Bdµ,m(x, ϵ) with center x Xn and radius ϵ dµ,m(x) is the Euclidean

ball B(x, rx(ϵ)) with radius rx(ϵ) = (ϵp dp µ,m(x))1/p (if ϵ < dµ,m(x), it is empty). The ˇCech DTM-ﬁltration (V DTM m,p (Xn))ϵ>0 with parameters (m, p) is the weighted ˇCech ﬁltration constructed as the nerve of the cover {Bdµ,m(x, ϵ) : x Xn} for every ϵ > 0. A DTM-based version of a weighted Vietoris Rips ﬁltration can also be derived. DTM-ﬁltrations of Euclidean point clouds produce ﬁltrations (and hence, persistence diagrams) less sensitive to outliers, given that the (interleaving) distance between V DTM m,p (Xn) and V DTM m,p (Xn Y ) is upper bounded not only in terms of d H(Xn, Xn Y ) but also in terms of the Wasserstein distance between the measures µXn and µXn Y . However, if Xn is a sample of a manifold M, these ﬁltrations are still very sensitive to the particular embedding of the manifold in RD. This is consequence of the dependence of the DTM-function on the ambient space (see Example 4). Its (lack of) dependence on non-intrinsic properties has been investigated thereafter. In this direction, a generalization of DTM-ﬁltrations for general metric spaces (X, ρ) is considered in (Buchet et al., 2016).

Intrinsic Persistent Homology via Density-based Metric Learning

Figure 4: A (noisy) sample of 1500 points from the trefoil knot with outliers (red).

Example 4 (Trefoil) Consider the embedding of a topological circle S1 in R3 given by the trefoil knot. In particular, it is homeomorphic to S1 and its homology has just one generator in H0 (one connected component) and one generator in H1 (one 1-dimensional cycle). Given a (noisy) sample of 1500 points from the trefoil knot with 10 outliers, Figure 4, we compute its persistence diagram for diﬀerent choices of ﬁltrations and compare them with the case without the outliers, Figure 5. For the Vietoris Rips ﬁltration using Euclidean distance, the small reach of the embedding produces a persistence diagram with four persistent generators for H1 in both cases, with and without outliers (cf. Example 3). If we use k-NN distances, the presence of outliers aﬀects the accuracy of the topological features captured in the persistence diagram, which presents four salient generators for H1 instead of the single generator recovered from the sample without outliers. For the Vietoris Rips DTM-ﬁltration, we observe that the diagrams are comparable both in absence and presence of outliers. However, the dependence of the embedding of the construction is reﬂected in the incorrect number of generators for H1 with long persistence. Finally, the persistence diagram computed from the Vietoris Rips ﬁltration using Fermat distance remains unaﬀected in presence of outliers for degree 1 (Corollary 15), and it shows correctly a single salient generator of H1. For degree 0, the diagram is related to the diagram of the sample without the outliers and the diagram of the outliers themselves (cf. Remark 16).

3.4 Computational Complexity

Our proposed pipeline for the computation of Fermat-based persistent homology consists of the precomputation of Fermat distance in the input sample Xn, followed by the computation of persistent homology from the metric space (Xn, d Xn,p) described by the distance matrix. The computation of the matrix of pairwise sample Fermat distances between points in Xn has complexity O(n3). However, it can be reduced to O(n2 log2 n) with high probability by restricting the computation of shortest paths to the k-NN graph on top of Xn with k = O(log n) (see Section 2.3 in Groisman et al. (2022), Little et al. (2022); Chu et al. (2020)).

Fern andez, Borghini, Mindlin and Groisman

Figure 5: Persistence diagrams associated to the Vietoris-Rips ﬁltration of the sample of the trefoil knot using Euclidean distance, k-NN distance with k = 10, DTM weight and Fermat distance with p = 3 of the sample without outliers Xn (left) and the sample with outliers Xn Y (right) respectively. When Fermat distance is used, the persistence diagram of Xn Y for degree 1 equals the diagram of Xn (without outliers). For degree 0, it decomposes as the union of the subdiagram of ﬁnite intervals of Xn, dgm< 0 (Rips(Xn, d Xn,p)), and the diagram dgm0(Rips(Q, d Q)) of the quotient space Q = (Y Xn)/Xn.

Intrinsic Persistent Homology via Density-based Metric Learning

On the other hand, the standard algorithm used to compute persistent homology was ﬁrst introduced by Edelsbrunner et al. (2002) and it is based on the Gaussian reduction of the boundary matrix. Persistent homology for degree up to k depends on the (k+1)-skeleton of the ﬁltration and the worst case computational complexity is cubical in the number N of simplices of dimension at most k + 1 (Morozov, 2005; Otter et al., 2017). An alternative algorithm for the reduction of the boundary matrix, introduced by Milosavljevi c et al. (2011), has complexity O(Nω), with ω the matrix multiplication coeﬃcient. At present, the best bound for ω is 2.376 (Coppersmith and Winograd, 1987). In practice, computation of persistent homology has lower complexity. For Vietoris Rips ﬁltrations, the worst case complexity is for k-dimensional persistent homology is O n k+2 3 = O n3(k+2) with n the number of vertices of Xn. However, Giunti et al. (2022) proved that, for instance, the average complexity for the reduction of the boundary matrix of degree 1 is upper bounded by O(n5 log2(n)). Moreover, they showed that this upper bound seems to be not tight, since experimental simulations show that the average cost of the reduction of the 1-boundary matrix follows a curve of around O(n3.73). Overall, our proposed pipeline based on the precomputation of pairwise Fermat distance in Xn does not increase the complexity of the total persistent homology computation.

4. Applications to Signal Analysis

In this section we present a method for change-point detection and pattern recognition in time series through the analysis of topological features (see also Maleti c et al. (2016); Perea (2019); Perea and Harer (2015)). This method is illustrated by a series of experiments in both synthetic and real data. In the experiments, the use of Fermat distance (as opposed to Euclidean distance) is observed to lead to more robust inference of the topology of the underlying space. We remark that in these examples the data does not necessarily verify the i.i.d. assumption. Fermat and k-NN distances are computed using the library Fermat (Aristas, 2018), while Ripser by Bauer (2021) is employed for the computation of persistence diagrams associated to Vietoris Rips ﬁltrations. All the computations are over the ﬁeld k = Z2. The code for all the examples and experiments can be found in the repository by Fernandez (2021a).

4.1 Topological Analysis of Time Series.

Time-delay embeddings of scalar time-series data is a well-known technique to recover the underlying dynamics of a system. Takens (1981) theorem gives conditions under which a smooth attractor can be reconstructed from a generic observable function, with dimensional bounds related to those of the Whitney Embedding Theorem. It implies in particular that if X(t) is a real valued signal (which is assumed to be one of the coordinates of a ﬂow given by a system of diﬀerential equations), then the delay coordinate map

t 7 X(t), X(t + τ), X(t + 2τ) . . . , X(t + (D 1)τ)

is an embedding of an orbit. Here D is the embedding dimension and τ is the time delay. From a theoretical point of view, D is the number of variables of the original system. However, in practice the underlying equations describing the dynamical system are not

Fern andez, Borghini, Mindlin and Groisman

available. Thus, dynamics are often analyzed by studying the topology of their attractors; i.e., invariant subsets of the phase space towards which the system tends to evolve (Birman and Williams, 1983; Smale, 1967; Gilmore and Lefranc, 2002). If the attractor is a smooth manifold M of dimension d, under certain conditions Takens theorem implies that the delay embedding of the signal with D 2d + 1 is diﬀeomorphic to M. We describe now an approach based on intrinsic persistence diagrams to study geometry of attractors and pattern recognition in time series by means of the analysis of the time evolving topological organization of the embedded ﬂow. Let (x1, x2, . . . , xn) be a time series, i.e. a ﬁnite sample of a signal X : [0, T] R such that for evenly spaced points 0 = t1 < t2 < < tn = T, xi = X(ti) for all 1 i n. Given D and τ, compute the delay embedding of the time series

Xn = {(xi, xi+τ, xi+2τ, . . . , xi+(D 1)τ) : 1 i n (D 1)τ} RD.

Then, for p > 1, endow Xn with a metric space structure induced by the sample Fermat distance d Xn,p. The persistence diagram of the delay embedding (Xn, d Xn,p) quantiﬁes information about the homology of the attractor associated to the underlying dynamical system.

Example 5 (Reconstruction of Lorenz attractor) The parameters associated to the delay coordinate reconstruction for a time series can be determined following some heuristics (e.g. false nearest neighbors to determine the embedding dimension (Kennel et al., 1992)). However, in case of noisy data, the embedding dimension is often over-estimated and it may have a great impact on the phase space reconstruction. Indeed, in high dimensional spaces, any two points of a typical large set are at similar Euclidean distance (Aggarwal et al., 2001). This phenomenon is part of what is known as the curse of dimensionality. For this reason, the choice of an intrinsic distance is crucial to recover the right topological features of a space embedded in high dimension. Consider the strange attractor associated to the Lorenz (1963) system

x = σ(y x), y = x(ρ z) y, z = xy βz

when (σ, ρ, β) = (10, 28, 8/3). In Figure 6 we take a numerical integration ϕ(t, v0) of (6) with dt = 0.01, satisfying the initial condition ϕ(0, v0) = v0 with v0 = (1, 1, 1). We inspect the time series corresponding to the x-coordinate with additive Gaussian noise with variance 0.1, and recover topological information of the attractor from the delay embedding (see also Maleti c et al. (2016)). Notice that in this case, although the number of variables in the underlying system is 3, the dimension of the attractor is d = 2 so the embedding dimension estimated by Takens theorem is greater than or equal to 5. The persistence diagram of the delay embedding reconstruction is computed with time delay τ = 10 and embedding dimensions D = 3, 4 and 5, Figure 6. Here, a uniform downsampling from the original point cloud of 10000 points is computed, to obtain a new point cloud of 3400 points.

Intrinsic Persistent Homology via Density-based Metric Learning

Figure 6: From top to bottom: The x-coordinate time series with Gaussian noise (variance = 0.1) of the Lorenz attractor. The original trajectory and the delay embedding of the noisy x-coordinate time series with D = 3 and τ = 10. Persistence diagrams associated to the delay embedding computed with Euclidean and Fermat distances for embedding dimension D = 3, D = 4 and D = 5 and time delay τ = 10.

Fern andez, Borghini, Mindlin and Groisman

The Lorenz attractor is homotopy equivalent to the eight-space with two holes corresponding to the equilibrium points that the trajectory never reaches. As Figure 6 reveals, the use of Fermat distance leads to robustly capturing the intrinsic two 1-cycles for the diﬀerent embedding dimensions, while this is not the case for the Euclidean distance.

Example 6 (Periodicity) A periodic dynamic within a noisy system might be robustly captured using time-delay embeddings. Indeed, embeddings of periodic signals have the topology of a cycle. However, the general success of the reconstruction of the intrinsic cyclic geometry is highly dependent on the choice of the delay parameter τ (and the embedding dimension D). In practice, classic heuristics based on time-delayed mutual information (Fraser and Swinney, 1986) and false nearest neighbors (Kennel et al., 1992) are used, but they present high sensitiveness to noise. We show that the use of Fermat distance when recovering the intrinsic geometry of delay embeddings has stability properties with respect to the choice of τ. Consider the function f(t) = cos(t) + cos(3t) with additive Gaussian noise of variance 0.4. For a sample of 2000 points of the noisy signal in consideration at the interval [0, 100], the classic heuristic estimations of the optimal parameters outputs τ = 28 and D = 8 (here, the computations are preformed with the package Time Series from the software Giotto-tda by Tauzin et al. (2020)). However, the associated time-delay embedding presents low reach value and, hence, it is still hard to capture its homology with standard methods (see Figure 7). In general dynamics, the eﬀect of the choice of τ is reﬂected in changes in the embedding of the associated attractor in the ambient space. Although Takens theorem theoretically establishes diﬀeomorphic embeddings for diﬀerent choices of τ, in practice the accuracy of the reconstruction of the underlying manifold usually depends on the choice of τ. Crucially, persistence diagrams computed using Fermat distance are less dependent of extrinsic properties and hence, highly appropriate for the estimation of topological properties of the attractor (that are, indeed, independent of the embedding). To illustrate the stability properties with respect to the choice of the delay parameter, we computed the delay embedding of the noisy periodic signal of Figure 7 in R8 for a range of values of τ. We observe that, while the features displayed on the diagrams computed using Euclidean distance change with the embedding, the ones computed using Fermat distance are consistent: they all display a single generator for H1 (Figure 8). Here, p was set equal to 6, but similar results can be obtained for a range of values of p.

In order to identify changes in patterns of time series, we investigate the topological evolution in time of the delay embedding. For every sample time tj [0, T] (1 j n (D 1)τ), consider the delay embedding Xj of the restriction of the time series up to time tj, with the metric structure inherited from (Xn, d Xn,p). That is,

Xj := {(xi, xi+τ, xi+2τ, . . . , xi+(D 1)τ) : 1 i j} Xn.

If M[0, t] is the delay embedding of the restricted signal X|[0,t], the time evolving series of diagrams {dgm(Rips(Xi)) : 1 j n (D 1)τ} is a sample of an approximation of the curve t 7 dgm(Rips(M[0, t])), (7)

Intrinsic Persistent Homology via Density-based Metric Learning

Figure 7: Top: Periodic signal with noise, deﬁned as f(t) = cos(t) + cos(3t) with additive Gaussian noise of variance 0.4. Bottom left: Delay embedding (projection 3d to the ﬁrst coordinates) with the optimal values of the parameters, i.e D = 8, τ = 28, according to the canonical heuristics (embedding of the signal without noise in dark orange). Bottom right: Persistence diagrams (degree 1 only) of the embedding of the signal without and with noise, computed using the Euclidean distance and Fermat distance for p = 6.

where M[0, t] is considered a metric subspace of M = M[0, T] endowed with the population Fermat distance. Finally, compute

db dgm(Rips(Xi)), dgm(Rips(Xi 1))

ti ti 1 (8)

as an approximate the ﬁrst order derivative of (7). Shifts in patterns in the signal can be detected from the sample as peaks in the bottleneck distance between consecutive persistence diagrams. Some applications of this technique follow below.

Example 7 (Anomaly detection in ECG) The purpose of this example is to present a computational method of automated detection of abnormal heartbeats (arrhythmia) through the topological analysis of a delay embedding of ECG signals. We consider the record sel102

Fern andez, Borghini, Mindlin and Groisman

Figure 8: Top: Time-delay embeddings in R8 (projection 3d to the ﬁrst coordinates) for τ = 15, 25, 35, 45 of the signal f(t) = cos(t) + cos(3t) with additive Gaussian noise of variance 0.4 (cf. Fig. 7). Bottom: Persistence diagrams (degree 1 only) using Euclidean distance and Fermat distance (for p = 6, but similar outputs are obtained for a range of values of p).

of the QT Database from the freely-available repository of medical research data Physio Net MIT, Figure 9. Regular heartbeats are characterized by a periodic pattern (Lilly and School, 2016, Ch.4). The delay embedding in R3 of a normal ECG has hence a cyclic topology induced by the periodic behavior of the time series (see Perea, 2019; Emrani et al., 2014). However, every time that an irregular heartbeat occurs, a new cycle arises in the embedding. We compute the associated persistence diagram for a normal period and for a period that includes an anomalous heartbeat. All delay embeddings were computed with a stride of t = 2, obtaining point clouds of up to 3000 points from the original sample of size 6000. Persistent cycles in H1 in diagrams computed using Euclidean distance are not in correspondence with the periodicity pattern and the anomaly. Indeed, at the periodic interval [0, 4000] there are two salient generators for H1. On the contrary, by using Fermat distance, an initial cycle for

Intrinsic Persistent Homology via Density-based Metric Learning

Figure 9: Top: ECG signal (anomaly in blue). Middle: Bottleneck distance between consecutive persistence diagrams associated to time evolving embeddings of the ECG signal. Bottom: Delay embedding in R3 with τ = 15. The associated persistence diagrams at degree 1 using Euclidean distance and Fermat distance with p = 2 for the embedding of the signal in the periods of time [0, 4000] and [0, 6000].

the periodic pattern and a second cycle in the irregular period that accounts for the anomaly are distinctly detected (here, the choice of p = 2 is related to the weight we give to the density when computing Fermat distances; that is, we set p so that the exponent p 1

d equals 1, where d = 1 is the dimension of the curve). Moreover, the moment immediately following the occurrence of the anomaly can be detected using persistent homology of time evolving delay embeddings. Indeed, the estimator (8) of the ﬁrst derivative of the time evolving persistent diagrams features a prominent peak when the topology of the embedding changes. Lower peaks are also present as the result of the noisy real record.

Fern andez, Borghini, Mindlin and Groisman

Figure 10: Top: Record of the air sac pressure of canary during a song. Bottom: Delay embedding in R3 with time delay τ = 500 and its associated persistence diagram using Fermat distance with p = 1.5.

Example 8 (Pattern recognition in birdsongs) During song production, canaries use a set of air sac pressure gestures with characteristic shapes to generate diﬀerent patterns of sound (or syllables). Pressure patterns of diﬀerent syllables constitute a diverse set: they can be either almost harmonic oscillations, high frequency ﬂuctuations or oscillations presenting wiggles. The recognition of song syllables from the air sac pressure series is a well-studied problem in non-linear dynamical systems (Mindlin and Laje, 2006; Alonso et al., 2009).

We provide a topological method to detect the number of diﬀerent syllables in a canary song from the (noisy) record of the ﬂuctuations of its air sac pressure X(t), Figure 10 (data provided by the Laboratory of Dynamical Systems from the Department of Physics of the University of Buenos Aires). Given the time delay embedding of the time series X(t) with τ = 500 and D = 3, its associated persistence diagram computed using Fermat distance with p = 1.5 shows four prominent generators for the ﬁrst homology group, which are in correspondence with the four diﬀerent patterns observed in the time series (see Figure 11). Indeed, the embedding of each syllable is topologically a cycle (see Perea, 2019; Perea and Harer, 2015). However, this decomposition is not available beforehand so the study of the global topology of the embedding of the entire time series is necessary in order to analyze the complete song. Here, prior to the computation of the persistence diagram, we down-sampled the original time series at evenly spaced times with stride t = 100, obtaining a subsample of size 3000 from the original T 300000 points.

We can also detect the moments at which changes of syllables take place during the song. The estimator (8) of the ﬁrst derivative of the path of persistence diagrams associated to the

Intrinsic Persistent Homology via Density-based Metric Learning

Figure 11: Top: Bottleneck distance between consecutive persistence diagrams associated to time evolving embeddings (moving average curve with window of time 500). Peaks are related to changes in the pattern of the air sac pressure record of the canary song. Bottom: Delay embedding of each detected syllable.

time evolving delay embeddings presents peaks followed by an exponential decay each time a new pattern arises, Figure 11.

5. Conclusions and Future Work

We introduced the use of density-based asymptotically intrinsic distances in point clouds to reconstruct the homology of a manifold from a noisy sample. In most of the standard approaches, persistent homology computed from Euclidean samples of manifolds lacks of two relevant properties: robustness to outliers and independence of the embedding in the ambient space. Whereas each of these properties has been studied separately in previous works, we present a simple method that is able to achieve both at the same time. Our proposal is based on the use of Fermat distance when computing persistence diagrams of samples of manifolds. The key point is that, although this distance deforms the inherited geometry of the manifold, it produces intrinsic persistence diagrams that are more robust to outliers. Concretely, we provided rigorous proofs of convergence of the persistence diagrams of the associated metric spaces, robustness to a simple model of outliers and dependence of the persistence intervals on intrinsic (but not extrinsic) attributes of the underlying manifold. Furthermore, we showed experimentally that our technique is stable

Fern andez, Borghini, Mindlin and Groisman

under to a wider range of noisy situations, including real datasets. We intend to extend our results to more general models of outliers and noise in future works. Finally, a detailed comparison of our approach with other related methods, like DTM-ﬁltrations and the use of Euclidean distance and the intrinsic k-NN distance in the construction of Vietoris-Rips ﬁltrations, is also presented.

Acknowledgments

We are grateful to Luis Scoccola and Jeﬀrey Giansiracusa for many useful discussions and suggestions during the preparation of this article. We also acknowledge the anonymous reviewers and the associate editor for many helpful comments that greatly improved the manuscript. X. F. is a member of the Centre for Topological Data Analysis funded by the EPSRC grant EP/R018472/1. P. G. is partially supported by CONICET grant PIP 2021 11220200102825CO and UBACy T grant 20020190100293BA. G. M. is partially supported by PICT MAX PLANCK 4681 and PICT 00619.

Appendix A. Proof of Auxiliary Results

The purpose of this appendix is to present formal proofs of Proposition 5 and Lemma 8. Recall that M RD is a closed submanifold of dimension d D and Xn M is an i.i.d. sample of size n with common density f > 0. Given p > 1, we set α = 1/(d + 2p). Proposition 5 will be derived from Theorem 6 by Hwang et al. (2016). We start with a series of results to show that any segment that is part of any shortest path with respect to d Xn,p is arbitrarily small with high probability for n large enough. This will allow us to prove that the sample Fermat distance uniformly well-approximates the power-weighted distance (1).

Proposition 18 Given b > 0 and ε > 0, there exists θ > 0 such that

n(p 1)/dd Xn,p(x, y)

df,p(x, y) µ

for n large enough, where the supremum is taken over all x, y M with d M(x, y) b.

Proof Given ε > 0 and b > 0, by Theorem 6 there exists θ > 0 such that for every x, y M with d M(x, y) b,

n(p 1)/d LXn,p(x, y)

df,p(x, y) µ > ε (9)

with probability at most exp( θnα) (notice that here we set the sequence bn to be constantly b). Let x, y M and let γ = (x0, . . . , xk+1) be the shortest path between x, y with respect to LXn,p. That is,

LXn,p(x, y) =

i=0 d M(xi+1, xi)p.

Intrinsic Persistent Homology via Density-based Metric Learning

Since |xi+1 xi| d M(xi+1, xi),

LXn,p(x, y)

i=0 |xi+1 xi|p d Xn,p(x, y).

Thus, by (9), the inequality

n(p 1)/dd Xn,p(x, y)

df,p(x, y) µ > ε

holds with probability bounded by exp( θnα).

Corollary 19 Let b0 > 0. Let x, y M be such that they belong to some minimal path between points in M with respect to d Xn,p. Then,

P(|x y| > b0) exp( θnα)

for some constant θ > 0, provided n is large enough.

Proof Fix ε0 > 0. By Proposition 18, there exists a constant θ > 0 such that

sup u,v n(p 1)/dd Xn,p(u, v)

df,p(u, v) > µ + ε0

for all n suﬃciently large, where the supremum is taken over u, v M such that d M(u, v) b0. On the other hand, note that since M is compact the diameter diamp(M) of M with respect to the distance df,p is ﬁnite. Hence,

n(p 1)/d (µ + ε0) diamp(M)

n(p 1)/d (µ + ε0) bp 0

for all u, v M with d M(u, v) b0 and all n suﬃciently large. Suppose now that x, y M belong to some shortest path between points of M with respect to d Xn,p, say u and v, but that |x y| > b0. Then, clearly d Xn,p(u, v) |x y|p

and d M(u, v) > b0 (since otherwise d Xn,p(u, v) |u v|p < bp 0). We remark here that x and y do not necessarily belong to the sample Xn. From the previous computations, it follows that whenever n is large enough, with probability at least 1 exp( θnα),

|x y|p d Xn,p(u, v) df,p(u, v)

n(p 1)/d (µ + ε0) bp 0,

as we wanted to show.

Fern andez, Borghini, Mindlin and Groisman

Remark 20 (see Bernstein et al., 2000, Corollary 4 or Boissonnat et al., 2019, Lemma 3) Let (M, g) be a smooth compact Riemannian manifold embedded in RD. Given δ > 0, there exists ε > 0 such that for every x, y M with |x y| < ε,

d M(x, y) (1 + δ)|x y|.

We are now able to prove a new version of Theorem 6 in which the proposed estimator of df,p is the sample Fermat distance (rather than the power-weighted shortest path).

Proposition 21 Fix ε > 0 and a sequence of positive real numbers (bn)n 1 satisfying that log(n)

nbdn 0 when n . Then, for every p > 1, there exists θ > 0 such that

n(p 1)/dd Xn,p(x, y)

df,p(x, y) µ

exp θ(nbd n)α

for n large enough, where the supremum is taken over x, y M with d M(x, y) bn.

Proof Let δ > 0 be a small number to be ﬁxed later. The strategy of the proof consists of showing that, with probability exponentially high in (nbd n)α, LXn,p(x, y) and d Xn,p(x, y) coincide up to a factor of (1 + δ)p for all x, y M with d M(x, y) bn. Once that is established, the proof follows readily by applying Theorem 6. Notice in ﬁrst place that by Remark 20, there exists η > 0 such that d M(x, y) (1 + δ)|x y| whenever x, y M, |x y| < η. By Corollary 19, we may assume that |u v| < η for every u, v M belonging to a minimal path with probability exponentially high in nα. Let x, y M be two points with d M(x, y) bn. Since by our assumptions every segment in a shortest path from x to y with respect to d Xn,p has Euclidean length at most η, it is not diﬃcult to see that

d Xn,p(x, y) LXn,p(x, y) (1 + δ)pd Xn,p(x, y). (10)

Now, by Theorem 6, the probability that n(p 1)/d LXn,p(x, y)

df,p(x, y) µ

is exponentially high in (nbd n)α, provided n is large enough. We will check that for δ > 0 suﬃciently small, the desired inequality for d Xn,p follows if we assume that the event from (11) occurs. It is clear by (10) and (11) that

n(p 1)/dd Xn,p(x, y)

df,p(x, y) µ < ε

As for the other inequality, notice that

2 < (1 + δ)p n(p 1)/dd Xn,p(x, y)

df,p(x, y) µ

+ ((1 + δ)p 1)µ.

Intrinsic Persistent Homology via Density-based Metric Learning

Hence, for δ > 0 small enough we have

ε < n(p 1)/dd Xn,p(x, y)

df,p(x, y) µ

as desired.

Finally, we promote the convergence of the sample Fermat distance from Proposition 21 to a uniform convergence in probability (that is, for any pair of points x, y M regardless of the distance between them). Such uniform convergence may be accomplished by choosing a sequence (bn)n 1 which converges to 0 at an adequate rate. This step is instrumental in order to prove the Gromov Hausdorﬀconvergence of the sample metric spaces (Xn, dn,p) to (M, df,p) (see Theorem 10 and its proof). Proof [Proposition 5] Roughly, the strategy of the proof consists in bounding the quantity

|n(p 1)/dd Xn,p(x, y) µdf,p(x, y)|

splitting in two cases according to whether the distance d M(x, y) is greater than or smaller than some appropriately chosen sequence bn > 0. More precisely, we will set bn = n λ for some λ ((p 1)/pd, 1/d). Let ε > 0. Since λ < 1/d, clearly the sequence log(n)

n 1 converges to 0 as n goes to inﬁnity and hence, by Proposition 21 the bound n(p 1)/dd Xn,p(x, y)

df,p(x, y) µ

holds with probability at most exp( θ(nbd n)α) = exp( θn(1 λd)α) for some θ > 0 and all x, y M with d M(x, y) n λ provided n is large enough (here ε > 0 is a small number to be determined). Denote by diam(M) the diameter of M with respect to the distance d M. Since df,p(x, y) m (p 1)/d f d M(x, y) m (p 1)/d f diam(M), we see that the event

|n(p 1)/dd Xn,p(x, y) µdf,p(x, y)| > m (p 1)/d f diam(M)ε

also holds with probability bounded from above by exp( θn(1 λd)α) for the same θ > 0 as before, whenever d M(x, y) n λ. By setting ε = ε(m (p 1)/d f diam(M)) 1 we obtain the desired bound for x, y M with d M(x, y) n λ. For the remaining case, take x, y M satisfying d M(x, y) n λ and notice in ﬁrst place that

df,p(x, y) m (p 1)/d f d M(x, y) m (p 1)/d f n λ.

Hence, for n suﬃciently large, µdf,p(x, y) ε/2. On the other hand, since by deﬁnition of d Xn,p it is d Xn,p(x, y) |x y|p d M(x, y)p n λp,

we see that n(p 1)/dd Xn,p(x, y) n(p 1)/d λp. The hypothesis on λ implies that the exponent of n in the last inequality is negative and thus n(p 1)/dd Xn,p(x, y) ε/2 provided n is

Fern andez, Borghini, Mindlin and Groisman

large. Summing up, we conclude that there exists n0 such that for all x, y M with d M(x, y) n λ and n n0,

|n(p 1)/dd Xn,p(x, y) µdf,p(x, y)| ε,

which completes the proof of the proposition.

We turn now to the proof of Lemma 8, which follows ideas from Cuevas and Rodr ıguez Casal (2004) and (M emoli and Sapiro, 2005, Section 5).

Deﬁnition 22 (see Lee, 2018, Chapter 5) The injectivity radius inj(N) of a Riemannian manifold (N, g) is deﬁned as

inj(N) := inf x N inj(N, x),

where inj(N, x) is the largest radius for which the exponential map is a diﬀeomorphism.

Proof [Lemma 8] Since M is compact, its injectivity radius inj(M) is strictly positive. Then, by an inequality of Croke (see Croke, 1980, Proposition 14), there exists a constant c = c(d) > 0 such that every metric ball B in M of radius r < inj(M)

2 has volume at least c(d)rd. Since we can assume that κ < 1 without loss of generality, for all n suﬃciently large we have n(κ 1)/d < inj(M)

2 . From this point, we follow the strategy from the proof of (Cuevas and Rodr ıguez-Casal, 2004, Theorem 3). Let Pn be the maximum number of disjoint balls of radius n(κ 1)/d

4 contained in M this is known as packing number, see for example (Niyogi et al., 2008, Section 5) and take {B1, . . . , BPn} a set of disjoint balls of radius n(κ 1)/d

4 in M. It is clear then that

Pn Vol(M) min1 j Pn Vol(Bj) Vol(M)4d

for n so large that n(κ 1)/d < inj(M)

2 . Now, suppose that x M veriﬁes d M(x, Xn) > n(κ 1)/d. Since the balls 2B1, . . . , 2BPn cover M (where 2Bj stands for the ball with the same center as Bj but with twice the radius) the distance from x to some center of these balls is at most n(κ 1)/d

2 and thus there should be no point from the sample in some ball 2Bj. A simple computation reveals that the probability that some random variable xi Xn does not belong to 2Bj is at most 1 mf Vol(2Bj). By the independence of the random variables x1, . . . , xn, if n is large enough

i=1 {xi 2Bj}

1 mf Vol(2Bj) n 1 mfc(d)nκ 1 n.

We conclude that

P sup x M d M(x, Xn) n(κ 1)/d

i=1 {xi 2Bj}

(1 mfc(d)nκ 1)n Pn.

Since Pn grows at most like a polynomial in n, (1 mfc(d)nκ 1)n Pn exp( θnκ) for an appropriate θ > 0 and n big enough, as we wanted to show.

Intrinsic Persistent Homology via Density-based Metric Learning

Eddie Aamari, Jisu Kim, Fr ed eric Chazal, Bertrand Michel, Alessandro Rinaldo, and Larry Wasserman. Estimating the reach of a manifold. Electron. J. Stat., 13(1):1359 1399, 2019.

Micha l Adamaszek and Henry Adams. The Vietoris-Rips complexes of a circle. Paciﬁc J. Math., 290(1):1 40, 2017.

Charu C. Aggarwal, Alexander Hinneburg, and Daniel A. Keim. On the surprising behavior of distance metrics in high dimensional space. In International Conference on Database Theory, pages 420 434. Springer, 2001.

Leandro M. Alonso, Jorge A. Alliende, Franz Goller, and Gabriel B. Mindlin. Lowdimensional dynamical model for the diversity of pressure patterns used in canary song. Physical Review E, 79(4):041929, 2009.

Hirokazu Anai, Fr ed eric Chazal, Marc Glisse, Yuichi Ike, Hiroya Inakoshi, Rapha el Tinarrage, and Yuhei Umeda. DTM-based ﬁltrations. In 35th International Symposium on Computational Geometry, volume 129 of LIPIcs. Leibniz Int. Proc. Inform., pages Art. No. 58, 15. Schloss Dagstuhl. Leibniz-Zent. Inform., Wadern, 2019.

Aristas. Fermat package. http://www.aristas.com.ar/fermat/, 2018.

Antonio Auﬃnger, Michael Damron, and Jack Hanson. 50 years of First-passage Percolation, volume 68 of Univ. Lect. Ser. Providence, RI: American Mathematical Society (AMS), 2017. ISBN 978-1-4704-4183-8; 978-1-4704-4356-6.

Mukund Balasubramanian and Eric L. Schwartz. The Isomap algorithm and topological stability. Science, 295 5552:7, 2002.

Ulrich Bauer. Ripser: eﬃcient computation of Vietoris-Rips persistence barcodes. J. Appl. Comput. Topol., 5(3):391 423, 2021. ISSN 2367-1726. doi: 10.1007/s41468-021-00071-5. URL https://doi.org/10.1007/s41468-021-00071-5.

Paul Bendich, Taras Galkovskyi, and John Harer. Improving homology estimates with random walks. Inverse Problems, 27(12):124002, 14, 2011. ISSN 0266-5611.

Mira Bernstein, Vin De Silva, John C. Langford, and Joshua B. Tenenbaum. Graph approximations to geodesics on embedded manifolds, 2000.

Joan S. Birman and R. F. Williams. Knotted periodic orbits in dynamical systems. I. Lorenz s equations. Topology, 22(1):47 82, 1983.

Jean-Daniel Boissonnat, Fr ed eric Chazal, and Mariette Yvinec. Geometric and Topological Inference. Cambridge Texts in Applied Mathematics. Cambridge University Press, Cambridge, 2018.

Jean-Daniel Boissonnat, Andr e Lieutier, and Mathijs Wintraecken. The reach, metric distortion, geodesic convexity and the variation of tangent spaces. J. Appl. Comput. Topol., 3(1-2):29 58, 2019.

Fern andez, Borghini, Mindlin and Groisman

Micka el Buchet, Fr ed eric Chazal, Steve Y Oudot, and Donald R Sheehy. Eﬃcient and robust persistent homology for measures. Computational Geometry, 58:70 96, 2016.

Joseph Minhow Chan, Gunnar Carlsson, and Raul Rabadan. Topology of viral evolution. Proceedings of the National Academy of Sciences, 110(46):18566 18571, 2013.

Fr ed eric Chazal and Andr e Lieutier. Smooth manifold reconstruction from noisy and nonuniform approximation with guarantees. Comput. Geom., 40(2):156 170, 2008.

Fr ed eric Chazal, David Cohen-Steiner, Marc Glisse, Leonidas J Guibas, and Steve Y. Oudot. Proximity of persistence modules and their diagrams. In Proceedings of the Twenty-ﬁfth Annual Symposium on Computational Geometry, pages 237 246, 2009.

Fr ed eric Chazal, David Cohen-Steiner, and Quentin M erigot. Geometric inference for probability measures. Foundations of Computational Mathematics, 11(6):733 751, 2011.

Fr ed eric Chazal, Vin de Silva, and Steve Oudot. Persistence stability for geometric complexes. Geom. Dedicata, 173:193 214, 2014.

Fr ed eric Chazal, Marc Glisse, Catherine Labru ere, and Bertrand Michel. Convergence rates for persistence diagram estimation in topological data analysis. J. Mach. Learn. Res., 16: 3603 3635, 2015.

Fr ed eric Chazal, Vin de Silva, Marc Glisse, and Steve Oudot. The Structure and Stability of Persistence Modules. Springer Briefs in Mathematics. Springer, Cham, 2016.

JeﬀCheeger and David G. Ebin. Comparison Theorems in Riemannian Geometry. North Holland Publishing Co., Amsterdam-Oxford; American Elsevier Publishing Co., Inc., New York, 1975. North-Holland Mathematical Library, Vol. 9.

Timothy Chu, Gary L. Miller, and Donald R. Sheehy. Exact computation of a manifold metric, via Lipschitz embeddings and shortest paths on a graph. In Proceedings of the 2020 ACM-SIAM Symposium on Discrete Algorithms, pages 411 425. SIAM, Philadelphia, PA, 2020.

Michael B. Cohen, Brittany Terese Fasy, Gary L. Miller, Amir Nayyeri, Donald R. Sheehy, and Ameya Velingker. Approximating nearest neighbor distances. In Proceedings of the Algorithms and Data Structures Symposium, 2015.

David Cohen-Steiner, Herbert Edelsbrunner, and John Harer. Stability of persistence diagrams. Discrete Comput. Geom., 37(1):103 120, 2007. ISSN 0179-5376.

Don Coppersmith and Shmuel Winograd. Matrix multiplication via arithmetic progressions. In Proceedings of the Nineteenth Annual ACM Symposium on Theory of Computing, pages 1 6, 1987.

Christopher B. Croke. Some isoperimetric inequalities and eigenvalue estimates. Ann. Sci. Ecole Norm. Sup. (4), 13(4):419 435, 1980.

Antonio Cuevas and Alberto Rodr ıguez-Casal. On boundary estimation. Adv. in Appl. Probab., 36(2):340 354, 2004.

Intrinsic Persistent Homology via Density-based Metric Learning

Michael Damron and Xuan Wang. Entropy reduction in Euclidean ﬁrst-passage percolation. Electron. J. Probab., 21:Paper No. 65, 23, 2016.

Vin de Silva and R. Ghrist. Coverage in sensor networks via persistent homology. Algebraic & Geometric Topology, 7:339 358, 2007a.

Vin de Silva and Robert Ghrist. Coverage in sensor networks via persistent homology. Algebr. Geom. Topol., 7:339 358, 2007b.

Herbert Edelsbrunner and John Harer. Persistent homology a survey. In Surveys on Discrete and Computational Geometry, volume 453 of Contemp. Math., pages 257 282. Amer. Math. Soc., Providence, RI, 2008.

Herbert Edelsbrunner, David G. Kirkpatrick, and Raimund Seidel. On the shape of a set of points in the plane. IEEE Trans. Inform. Theory, 29(4):551 559, 1983.

Herbert Edelsbrunner, David Letscher, and Afra Zomorodian. Topological persistence and simpliﬁcation. Discrete Comput. Geom., 28(4):511 533, 2002. ISSN 0179-5376. Discrete and computational geometry and graph drawing (Columbia, SC, 2001).

Saba Emrani, Thanos Gentimis, and Hamid Krim. Persistent homology of delay embeddings and its application to wheeze detection. IEEE Signal Processing Letters, 21(4):459 463, 2014.

Brittany T. Fasy, Fabrizio Lecci, Alessandro Rinaldo, Larry Wasserman, Sivaraman Balakrishnan, and Aarti Singh. Conﬁdence sets for persistence diagrams. Ann. Statist., 42 (6):2301 2339, 2014.

Herbert Federer. Curvature measures. Trans. Amer. Math. Soc., 93:418 491, 1959.

Ximena Fernandez. Github repository: Intrinsic persistent homology. https://github. com/ximenafernandez/intrinsic PH, 2021a. Code for the computational examples.

Ximena Fernandez. Intrinsic persistent homology. https://www.youtube.com/watch?v= 1l P9ndi M60o, 2021b. Prepared in the context of the Tutorial-a-thon 2021, organized by the Applied Algebraic Topology Research Network.

Andrew M. Fraser and Harry L. Swinney. Independent coordinates for strange attractors from mutual information. Phys. Rev. A, 33:1134 1140, Feb 1986.

Rickard B. Gabrielsson, Bradley J. Nelson, Anjan Dwaraknath, and Primoz Skraba. A topology layer for machine learning. In Silvia Chiappa and Roberto Calandra, editors, Proceedings of the Twenty Third International Conference on Artiﬁcial Intelligence and Statistics, volume 108 of Proceedings of Machine Learning Research, pages 1553 1563. PMLR, 26 28 Aug 2020.

Marian Gidea and Yuri Katz. Topological data analysis of ﬁnancial time series: landscapes of crashes. Phys. A, 491:820 834, 2018.

Robert Gilmore and Marc Lefranc. The Topology of Chaos. Wiley-Interscience [John Wiley & Sons], New York, 2002. Alice in Stretch and Squeezeland.

Fern andez, Borghini, Mindlin and Groisman

Barbara Giunti, Guillaume Houry, and Michael Kerber. Average complexity of matrix reduction for clique ﬁltrations. ar Xiv preprint ar Xiv:2111.02125, 2022.

Chad Giusti, Eva Pastalkova, Carina Curto, and Vladimir Itskov. Clique topology reveals intrinsic geometric structure in neural correlations. Proceedings of the National Academy of Sciences, 112(44):13455 13460, 2015.

Pablo Groisman, Matthieu Jonckheere, and Facundo Sapienza. Nonhomogeneous Euclidean ﬁrst-passage percolation and distance learning. Bernoulli, 28(1):255 276, 2022.

Allen Hatcher. Algebraic Topology. Cambridge University Press, Cambridge, 2002.

Jean-Claude Hausmann. On the Vietoris-Rips complexes and a cohomology theory for metric spaces. In Prospects in Topology (Princeton, NJ, 1994), volume 138 of Ann. of Math. Stud., pages 175 188. Princeton Univ. Press, Princeton, NJ, 1995.

C. Douglas Howard and Charles M. Newman. Euclidean models of ﬁrst-passage percolation. Probab. Theory Related Fields, 108(2):153 170, 1997. ISSN 0178-8051.

C. Douglas Howard and Charles M. Newman. Geodesics and spanning trees for Euclidean ﬁrst-passage percolation. Ann. Probab., 29(2):577 623, 2001.

Sung Jin Hwang, Steven B. Damelin, and Alfred O. Hero, III. Shortest path through random points. Ann. Appl. Probab., 26(5):2791 2823, 2016.

Matthew B. Kennel, Reggie Brown, and Henry D. I. Abarbanel. Determining embedding dimension for phase-space reconstruction using a geometrical construction. Phys. Rev. A, 45:3403 3411, Mar 1992.

Jisu Kim, Jaehyeok Shin, Fr ed eric Chazal, Alessandro Rinaldo, and Larry Wasserman. Homotopy Reconstruction via the Cech Complex and the Vietoris-Rips Complex. In So CG 2020 - 36th International Symposium on Computational Geometry, Zurich, Switzerland, June 2020.

Janko Latschev. Vietoris-Rips complexes of metric spaces near a closed Riemannian manifold. Arch. Math. (Basel), 77(6):522 528, 2001.

John M. Lee. Introduction to Riemannian Manifolds, volume 176 of Graduate Texts in Mathematics. Springer, Cham, 2018.

Leonard S. Lilly and Harvard Medical School. Pathophysiology of Heart Disease: A Collaborative Project of Medical Students and Faculty. Wolters Kluwer, 2016.

Anna Little, Daniel Mc Kenzie, and James M. Murphy. Balancing geometry and density: path distances on high-dimensional data. SIAM J. Math. Data Sci., 4(1):72 99, 2022.

Edward N. Lorenz. Deterministic nonperiodic ﬂow. J. Atmospheric Sci., 20(2):130 141, 1963.

Slobodan Maleti c, Yi Zhao, and Milan Rajkovi c. Persistent topological features of dynamical systems. Chaos, 26(5):053105, 14, 2016.

Intrinsic Persistent Homology via Density-based Metric Learning

Daniel Mckenzie and Steven Damelin. Power Weighted Shortest Paths for Clustering Euclidean Data. Foundations of Data Science, 1(3):307, 2019.

Facundo M emoli and Guillermo Sapiro. Distance functions and geodesics on submanifolds of Rd and point clouds. SIAM J. Appl. Math., 65(4):1227 1260, 2005. ISSN 0036-1399.

Nikola Milosavljevi c, Dmitriy Morozov, and Primoz Skraba. Zigzag persistent homology in matrix multiplication time. In Proceedings of the Twenty-Seventh Annual Symposium on Computational Geometry, So CG 11, page 216 225, New York, NY, USA, 2011. Association for Computing Machinery. ISBN 9781450306829.

Gabriel B. Mindlin and Rodrigo Laje. The Physics of Birdsong. Springer Science & Business Media, 2006.

Laboratory for Computational Physiology MIT. Physionet databases. https://physionet. org/about/database/.

Dmitriy Morozov. Persistence algorithm takes cubic time in worst case. Bio Geometry News, Dept. Comput. Sci., Duke Univ, 2, 2005.

Partha Niyogi, Stephen Smale, and Shmuel Weinberger. Finding the homology of submanifolds with high conﬁdence from random samples. Discrete Comput. Geom., 39(1-3): 419 441, 2008.

Nina Otter, Mason A. Porter, Ulrike Tillmann, Peter Grindrod, and Heather A. Harrington. A roadmap for the computation of persistent homology. EPJ Data Science, 6:1 38, 2017.

Jose A. Perea. Topological times series analysis. Notices Amer. Math. Soc., 66(5):686 694, 2019.

Jose A. Perea and John Harer. Sliding windows and persistence: an application of topological methods to signal analysis. Found. Comput. Math., 15(3):799 838, 2015.

Sajama and Alon Orlitsky. Estimating and computing density based distance metrics. In Proceedings of the 22nd International Conference on Machine Learning, ICML 05, page 760 767. Association for Computing Machinery, 2005.

Facundo Sapienza, Pablo Groisman, and Matthieu Jonckheere. Weighted geodesic distance following Fermat s principle. In International Conference on Learning Representation, 2018.

Stephen Smale. Diﬀerentiable dynamical systems. Bull. Amer. Math. Soc., 73:747 817, 1967. ISSN 0002-9904.

Floris Takens. Detecting strange attractors in turbulence. In Dynamical Systems and Turbulence, Warwick 1980 (Coventry, 1979/1980), volume 898 of Lecture Notes in Math., pages 366 381. Springer, Berlin-New York, 1981.

Guillaume Tauzin, Umberto Lupo, Lewis Tunstall, Julian Burella P erez, Matteo Caorsi, Anibal Medina-Mardones, Alberto Dassatti, and Kathryn Hess. giotto-tda: A topological data analysis toolkit for machine learning and data exploration, 2020.

Fern andez, Borghini, Mindlin and Groisman

Joshua B. Tenenbaum, Vin de Silva, and John C. Langford. A global geometric framework for nonlinear dimensionality reduction. Science, 290(5500):2319 2323, 2000.

Christopher J. Tralie and Jose A. Perea. (Quasi)periodicity quantiﬁcation in video data, using topology. SIAM J. Imaging Sci., 11(2):1049 1077, 2018.

Pascal Vincent and Yoshua Bengio. Density-sensitive metrics and kernels. In Snowbird Learning Workshop, 2003.

Afra Zomorodian and Gunnar Carlsson. Computing persistent homology. Discrete Comput. Geom., 33(2):249 274, 2005. ISSN 0179-5376.