# robust_persistence_diagrams_using_reproducing_kernels__a42f11ec.pdf Robust Persistence Diagrams using Reproducing Kernels Siddharth Vishwanath The Pennsylvania State University suv87@psu.edu Kenji Fukumizu The Institute of Statistical Mathematics fukumizu@ism.ac.jp Satoshi Kuriki The Institute of Statistical Mathematics kuriki@ism.ac.jp Bharath Sriperumbudur The Pennsylvania State University bks18@psu.edu Persistent homology has become an important tool for extracting geometric and topological features from data, whose multi-scale features are summarized in a persistence diagram. From a statistical perspective, however, persistence diagrams are very sensitive to perturbations in the input space. In this work, we develop a framework for constructing robust persistence diagrams from superlevel filtrations of robust density estimators constructed using reproducing kernels. Using an analogue of the influence function on the space of persistence diagrams, we establish the proposed framework to be less sensitive to outliers. The robust persistence diagrams are shown to be consistent estimators in bottleneck distance, with the convergence rate controlled by the smoothness of the kernel this in turn allows us to construct uniform confidence bands in the space of persistence diagrams. Finally, we demonstrate the superiority of the proposed approach on benchmark datasets. 1 Introduction Given a set of points Xn = {X1, X2, . . . , Xn} observed from a probability distribution P on an input space X Rd, understanding the shape of Xn sheds important insights on low-dimensional geometric and topological features which underlie P, and this question has received increasing attention in the past few decades. To this end, Topological Data Analysis (TDA), with a special emphasis on persistent homology [20, 44], has become a mainstay for extracting the shape information from data. In statistics and machine-learning, persistent homology has facilitated the development of novel methodology (e.g., [8, 11, 14]), which has been widely used in a variety of applications dealing with massive, unconventional forms of data (e.g., [5, 22, 43]). Informally speaking, persistent homology detects the presence of topological features across a range of resolutions by examining a nested sequence of spaces, typically referred to as a filtration. The filtration encodes the birth and death of topological features as the resolution varies, and is presented in the form of a concise representation a persistence diagram or barcode. In the context of dataanalysis, there are two different methods for obtaining filtrations. The first is computed from the pairwise Euclidean distances of Xn, such as the Vietoris-Rips, ˇCech, and Alpha filtrations [20]. The second approach is based on choosing a function on X that reflects the density of P (or its approximation based on Xn), and, then, constructing a filtration. While the two approaches explore the topological features governing P in different ways, in essence, they generate similar insights. Authors arranged alphabetically 34th Conference on Neural Information Processing Systems (Neur IPS 2020), Vancouver, Canada. 6 4 2 0 2 4 6 6 4 2 0 2 4 6 G Connected Components 6 4 2 0 2 4 6 6 4 2 0 2 4 6 G Connected Components Figure 1: (Left) Xn is sampled from a circle with small perturbations to each point. The persistence diagram detects the presence of the loop, as guaranteed by the stability of persistence diagrams [12, 16]. (Right) Xn is sampled from a circle but with just a few outliers. The resulting persistence diagram changes dramatically the persistence of the main loop plummets, and other spurious loops appear, as elaborated in Section 2. Despite obvious advantages, the adoption of persistent homology in mainstream statistical methodology is still limited. An important limitation among others, in the statistical context, is that the resulting persistent homology is highly sensitive to outliers. While the stability results of [12, 16] guarantee that small perturbations on all of Xn induce only small changes in the resulting persistence diagrams, a more pathological issue arises when a small fraction of Xn is subject to very large perturbations. Figure 1 illustrates how inference from persistence diagrams can change dramatically when Xn is contaminated with only a few outliers. Another challenge is the mathematical difficulty in performing sensitivity analysis in a formal statistical context. Since the space of persistence diagrams has an unusual mathematical structure, it falls victim to issues such as non-uniqueness of Fréchet means and unbounded curvature of geodesics [18, 29, 36]. With this background, the central objective of this paper is to develop outlier robust persistence diagrams, develop a framework for examining the sensitivity of the resulting persistence diagrams to noise, and establish statistical convergence guarantees. To the best of our knowledge, not much work has been carried out in this direction. Bendich et al. [4] construct persistence diagrams from Rips filtrations on Xn by replacing the Euclidean distance with diffusion distance, Brécheteau and Levrard [7] use a coreset of Xn for computing persistence diagrams from the distance-to-measure, and Anai et al. [2] use weighted-Rips filtrations on Xn to construct more stable persistent diagrams. However, no sensitivity analysis of the resultant diagrams are carried out in [2, 4, 7] to demonstrate their robustness. Contributions. The main contributions of this work are threefold. 1) We propose robust persistence diagrams constructed from filtrations induced by an RKHS-based robust KDE (kernel density estimator) [27] of the underlying density function of P (Section 3). While this idea of inducing filtrations by an appropriate function [13, 21, 32] use KDE, distance-to-measure (DTM) and kernel distance (KDist), respectively has already been explored, we show the corresponding persistence diagrams to be less robust compared to our proposal. 2) In Section 4.1, we generalize the notions of influence function and gross error sensitivity which are usually defined for normed spaces to the space of persistence diagrams, which lack the vector space structure. Using these generalized notions, we investigate the sensitivity of persistence diagrams constructed from filtrations induced by different functions (e.g., KDE, robust KDE, DTM) and demonstrate the robustness of the proposed method, both mathematically (Remark 4.3) and numerically (Section 5). 3) We establish the statistical consistency of the proposed robust persistence diagrams and provide uniform confidence bands by deriving exponential concentration bounds for the uniform deviation of the robust KDE (Section 4.2). Definitions and Notations. For a metric space X, the ball of radius r centered at x X is denoted by BX(x, r). P(Rd) is the set of all Borel probability measures on Rd, and M(Rd) denotes the set of probability measures on Rd with compact support and tame density function (See Section 2). δx denotes a Dirac measure at x. For bandwidth σ > 0, Hσ denotes a reproducing kernel Hilbert space (RKHS) with Kσ : Rd Rd R as its reproducing kernel. We denote by Φσ(x) = Kσ( , x) Hσ, the feature map associated with Kσ, which embeds x Rd into Φσ(x) Hσ. Throughout this paper, we assume that Kσ is radial, i.e., Kσ(x, y) = σ dψ( x y 2/σ) with ψ( 2) being a pdf on Rd, where x 2 2 = Pd i=1 x2 i for x = (x1, . . . , xd) Rd. Some common examples include the Gaussian, Matérn and inverse multiquadric kernels. We denote Kσ = supx,y Rd Kσ(x, y) = σ dψ(0). Without loss of generality, we assume ψ(0) = 1. For P P(Rd), µP = R Kσ( , y)d P(y) Hσ is called the mean embedding of P, and Dσ = µP : P P(Rd) is the space of mean embeddings [30]. 2 Persistent Homology: Preliminaries We present the necessary background on persistent homology for completeness. See [9, 42] for a comprehensive introduction. Persistent Homology. Let φ : X R 0 be a function on the metric space (X, d). At level r > 0, the sublevel set Xr = φ 1 ([0, r]) = {x X : φ(x) r} encodes the topological information in X. For r < s, the sublevel sets are nested, i.e., Xr Xs. Thus {Xr}0 r< is a nested sequence of topological spaces, called a filtration, denoted by Sub(φ), and φ is called the filter function. As the level r varies, the evolution of the topology is captured in the filtration. Roughly speaking, new cycles (i.e., connected components, loops, voids and higher order analogues) can appear or existing cycles can merge. A new k-dimensional feature is said to be born at b R when a nontrivial k-cycle appears in Xb. The same k-cycle dies at level d > b when it disappears in all Xd+ϵ for ϵ > 0. Persistent homology is an algebraic module which tracks the persistence pairs (b, d) of births b and deaths d with multiplicity µ across the entire filtration Sub(φ). Mutatis mutandis, a similar notion holds for superlevel sets Xr = φ 1 ([r, )), inducing the filtration Sup(φ). For r < s, the inclusion Xr Xs is reversed and a cycle born at b dies at a level d < b, resulting in the persistence pair (d, b) instead. Figure 2 shows 3 connected components in the superlevel set for r = 8. The components were born as r swept through the blue points, and die when r approaches the red points. In practice, the filtrations are computed on a grid representation -4 -2 0 2 4 0 5 10 15 0 5 8 10 15 0 5 8 10 15 Superlevel Set for r=8 Filter Function φ(x) 0th-Persistence Diagram Figure 2: Dgm (Sup(φ)) for φ : R R. of the underlying space using cubical homology. We refer the reader to Appendix E for more details. Persistence Diagrams. By collecting all persistence pairs, the persistent homology features are concisely represented as a persistence diagram Dgm (Sub(φ)) = (b, d) R2 : 0 b < d . A similar definition carries over to Dgm (Sup(φ)), using (d, b) instead. See Figure 2 for an illustration. When the context is clear, we drop the reference to the filtration and simply write Dgm(φ). The kth persistence diagram is the subset of Dgm(φ) corresponding to the k-dimensional features. The space of persistence diagrams is the locally-finite multiset of points on Ω= {(x, y) : 0 x < y }, endowed with the family of p-Wasserstein metrics Wp, for 1 p . We refer the reader to [18, 19] for a thorough introduction. W is commonly referred to as the bottleneck distance. Definition 2.1. Given two persistence diagrams D1 and D2, the bottleneck distance is given by W (D1, D2) = inf γ Γ sup p D1 p γ(p) , where Γ = {γ : D1 D2 } is the set of all bijections from D1 to D2, including the diagonal = (x, y) R2 : 0 x = y with infinite multiplicity. An assumption we make at the outset is that the filter function f is tame. Tameness is a metric regularity condition which ensures that the number of points on the persistence diagrams are finite, and, in addition, the number of nontrivial cycles which share identical persistence pairings are also finite. Tame functions satisfy the celebrated stability property w.r.t. the bottleneck distance. Proposition 2.2 (Stability of Persistence Diagrams [12, 16]). Given two tame functions f, g : X R, W (Dgm(f), Dgm(g)) f g . The space of persistence diagrams is, in general, challenging to work with. However, the stability property provides a handle on the persistence space through the function space of filter functions. 3 Robust Persistence Diagrams Given Xn = {X1, X2, . . . , Xn} Rd drawn iid from a probability distribution P M(Rd) with density f, the corresponding persistence diagram can be obtained by considering a filter function φn : Rd R, constructed from Xn as an approximation to its population analogue, φP : Rd R, that carries the topological information of P. Commonly used φP include the (i) kernelized density, fσ, (ii) Kernel Distance (KDist), d Kσ P , and (iii) distance-to-measure (DTM), d P,m, which are defined as: X Kσ(x, y)d P(y) ; d Kσ P = µδx µP Hσ ; d P,m(x) = m 0 F 1 x (u)du, where Fx(t) = P ( X x 2 t) and σ, m > 0. For these φP, the corresponding empirical analogues, φn, are constructed by replacing P with the empirical measure, Pn = 1 n Pn i=1 δXi. For example, the empirical analogue of fσ is the familiar kernel density estimator (KDE), f n σ = 1 n Pn i=1 Kσ( , Xi). While KDE and KDist encode the shape and distribution of mass for supp(P) by approximating the density f (sublevel sets of KDist are rescaled versions of superlevel sets of KDE [13, 32]), DTM, on the other hand, approximates the distance function to supp(P). Since φn is based on Pn, it is sensitive to outliers in Xn, which, in turn affect the persistence diagrams (as illustrated in Figure 1). To this end, in this paper, we propose robust persistence diagrams constructed using superlevel filtrations of a robust density estimator of f, i.e., the filter function, φn is chosen to be a robust density estimator of f. Specifically, we use the robust KDE, f n ρ,σ, introduced by [27] as the filter function, which is defined as a solution to the following M-estimation problem: f n ρ,σ = arg inf g G X ρ Φσ(y) g Hσ d Pn(y), (1) where ρ : R 0 R 0 is a robust loss function, and G = Hσ Dσ = Dσ is the hypothesis class. Observe that when ρ(z) = 1 2z2, the unique solution to Eq. (1) is given by the KDE, f n σ . Therefore, a robust KDE is obtained by replacing the square loss with a robust loss, which satisfies the following assumptions. These assumptions, which are similar to those of [27, 39] guarantee the existence and uniqueness (if ρ is convex) of f n ρ,σ [27], and are satisfied by most robust loss functions, including the Huber loss, ρ(z) = 1 2z21 {z 1} + z 1 2 1 {z > 1} and the Charbonnier loss, ρ(z) = (A1) ρ is strictly-increasing and M-Lipschitz, with ρ(0) = 0. (A2) ρ (x) is continuous and bounded with ρ (0) = 0 . (A3) ϕ(x) = ρ (x)/x is bounded, L-Lipschitz and continuous, with ϕ(0) < . (A4) ρ exists, with ρ and ϕ nonincreasing. Unlike for squared loss, the solution f n ρ,σ cannot be obtained in a closed form. However, it can be shown to be the fixed point of an iterative procedure, referred to as KIRWLS algorithm [27]. The KIRWLS algorithm starts with initial weights {w(0) i }n i=1 such that Pn i=1 w(0) i = 1, and generates the iterative sequence of estimators {f (k) ρ,σ}k N as f (k) ρ,σ = i=1 w(k 1) i Kσ( , Xi) ; w(k) i = ϕ( Φσ(Xi) f (k) ρ,σ Hσ) Pn j=1 ϕ( Φσ(Xj) f (k) ρ,σ Hσ) . Intuitively, note that if Xi is an outlier, then the corresponding weight wi is small (since ϕ is nonincreasing) and therefore less weight is given to the contribution of Xi in the density estimator. Hence, the weights serve as a measure of inlyingness smaller (resp. larger) the weights, lesser (resp. more) inlying are the points. When Pn is replaced by P, the solution of Eq. (1) is its population analogue, fρ,σ. Although fρ,σ does not admit a closed form solution, it can be shown [27] that there exists a non-negative real-valued function wσ satisfying R Rd wσ(x) d P(x) = 1 such that Rd Kσ( , x)wσ(x)d P(x) = Z ϕ( Φσ(x) fρ,σ Hσ) R Rd ϕ( Φσ(y) fρ,σ Hσ)d P(y)Kσ( , x) d P(x), (2) where wσ acts as a population analogue of the weights in KIRWLS algorithm. To summarize our proposal, the fixed point of the KIRWLS algorithm, which yields the robust density estimator f n ρ,σ, is used as the filter function to obtain a robust persistence diagram of Xn. On the computational front, note that f n ρ,σ is computationally more complex than the KDE, f n σ , requiring O(nℓ) computations compared to O(n) of the latter, with ℓbeing the number of iterations required to reach the fixed point of KIRWLS. However, once these filter functions are computed, the corresponding persistence diagrams have similar computational complexity as both require computing superlevel sets, which, in turn, require function evaluations that scale as O(n) for both f n ρ,σ and f n σ . 4 Theoretical Analysis of Robust Persistence Diagrams In this section, we investigate the theoretical properties of the proposed robust persistence diagrams. First, in Section 4.1, we examine the sensitivity of persistence diagrams to outlying perturbations through the notion of metric derivative and compare the effect of different filter functions. Next, in Section 4.2, we establish consistency and convergence rates for the robust persistence diagram to its population analogue. These results allow to construct uniform confidence bands for the robust persistence diagram. The proofs of the results are provided in Appendix A. 4.1 A measure of sensitivity of persistence diagrams to outliers The influence function and gross error sensitivity are arguably the most popular tools in robust statistics for diagnosing the sensitivity of an estimator to a single adversarial contamination [23, 26]. Given a statistical functional T : P(X) (V, V ), which takes an input probability measure P P(X) on the input space X and produces a statistic P 7 T(P) in some normed space (V, V ), the influence function of x X at P is given by the Gâteaux derivative of T at P restricted to the space of signed Borel measures with zero expectation: IF(T; P, x) = ϵT (1 ϵ)P + ϵδx ϵ=0 = lim ϵ 0 T ((1 ϵ)P + ϵδx) T(P) and the gross error sensitivity at P is given by Γ(T; P) = supx X IF(T; P, x) V . However, a persistence diagram (which is a statistical functional) does not take values in a normed space and therefore the notion of influence functions has to be generalized to metric spaces through the concept of a metric derivative: Given a complete metric space (X, d X) and a curve s : [0, 1] X, the metric derivative at ϵ = 0 is given by |s | (0) = limϵ 0 1 ϵ d X(s(0), s(ϵ)). Using this generalization, we have the following definition, which allows to examine the influence an outlier has on the persistence diagram obtained from a filtration. Definition 4.1. Given a probability measure P P(Rd) and a filter function φP depending on P, the persistence influence of a perturbation x Rd on Dgm (φP) is defined as Ψ (φP; x) = lim ϵ 0 1 ϵ W Dgm φPϵx , Dgm (φP) , where Pϵ x = (1 ϵ)P + ϵδx, and the gross-influence is defined as Γ(φP) = supx Rd Ψ (φP; x). For ϵ > 0, let f ϵ,x ρ,σ be the robust KDE associated with the probability measure Pϵ x. The following result (proved in Appendix A.1) bounds the persistence influence for the persistence diagram induced by the filter function fρ,σ, which is the population analogue of robust KDE. Theorem 4.2. For a loss ρ satisfying (A1) (A3), and σ > 0, if lim ϵ 0 1 ϵ f ϵ,x ρ,σ fρ,σ exists, then the persistence influence of x Rd on Dgm (fρ,σ) satisfies Ψ (fρ,σ; x) Kσ 1 2 ρ Φσ(x) fρ,σ Hσ Rd ζ Φσ(y) fρ,σ Hσ d P(y) 1 , (3) where ζ(z) = ϕ(z) zϕ (z). Remark 4.3. We make the following observations from Theorem 4.2. (i) Choosing ρ(z) = 1 2z2 and noting that ϕ(z) = ρ (z) = 1, a similar analysis, as in the proof of Theorem 4.2, yields a bound for the persistence influence of the KDE as Ψ (fσ; x) σ d/2 Φσ(x) fσ Hσ . On the other hand, for robust loss functions, the term in Eq. (3) involving ρ is bounded because of (A2), making them less sensitive to very large perturbations. In fact, for nonincreasing ϕ, it can be shown (see Appendix C) that Ψ (fρ,σ; x) σ d/2wσ(x) Φσ(x) fρ,σ Hσ , where, in contrast to KDE, the measure of inlyingness, wσ, weighs down extreme outliers. 20 40 60 80 100 0.0020 0.0030 0.0040 0.0050 Sup norm influence Distance from Support 20 40 60 80 100 0.0000 0.0010 0.0020 H1: Bottleneck Influence Distance from Support 20 40 60 80 100 0.000 0.001 0.002 0.003 0.004 H1: W1 Influence Distance from Support Figure 3: Points Xn are sampled from P with nontrivial 1st-order homological features and outliers Ym are added at a distance r from the support of P. (Left) The average L distance between the density estimators computed using Xn and Xn Ym as r increases. (Center) The average W distance between the corresponding persistence diagrams for the 1st-order homological features. (Right) The W1 distance (defined in Eq. E.1 in Appendix E) between the same persistence diagrams. The results show that the outliers Ym have little influence on the persistence diagrams from the robust KDEs. In contrast, as the outliers become more extreme (i.e., r increases) their influence on the persistence diagrams from the KDE becomes more prominent. (ii) For the generalized Charbonnier loss (a robust loss function), given by ρ(z) = 1 + z2 α/2 1 for 1 α < 2, the persistence influence satisfies Ψ (fρ,σ; x) σ d/2 1 + Φσ(x) fρ,σ 2 Hσ Rd Φσ(y) fρ,σ 2 Hσ d P(y) 1 α Note that for α = 1, the bound on the persistence influence Ψ (fρ,σ; x) does not depend on how extreme the outlier x is. Similarly, for the Cauchy loss, given by ρ(z) = log(1 + z2), we have Ψ (fρ,σ; x) σ d/2 1 + Z Rd Φσ(y) fρ,σ 2 Hσ d P(y) . This shows that for large perturbations, the gross error sensitivity for the Cauchy and Charbonnier losses are far more stable than that of KDE. This behavior is also empirically illustrated in Figure 3. The experiment is detailed in Appendix C. (iii) For the DTM function, it can be shown that Ψ (d P,m; x) 2 m sup f(x) Z Rd f(y)d P(y) : f L2(P) 1 . (4) While d P,m cannot be compared to both fσ and fρ,σ, as it captures topological information at a different scale, determined by m, we point out that when supp(P) is compact, Ψ (d P,m; x) is not guaranteed to be bounded, unlike in Ψ (fρ,σ; x). We refer the reader to Appendix C for more details. It follows from Remark 4.3 that as σ 0, the persistence influence of both the KDE and robust KDE behave as O(σ d), showing that the robustness of robust persistence diagrams manifests only in cases where σ > 0. However, robustness alone has no bearing if the robust persistence diagram and the persistence diagram from the KDE are fundamentally different, i.e., they estimate different quantities as σ 0. The following result (proved in Appendix A.2) shows that as σ 0, Dgm (fρ,σ) recovers the same information as that in Dgm (fσ), which is same as Dgm (f), where f is the density of P. Theorem 4.4. For a strictly-convex loss ρ satisfying (A1) (A4), and σ > 0, suppose P M(Rd) with density f, and fρ,σ is the robust KDE. Then W (Dgm (fρ,σ) , Dgm (f)) 0 as σ 0. Suppose P = (1 π)P0 + πQ, where P0 corresponds to the true signal which we are interested in studying, and Q manifests as some ambient noise with 0 < π < 1 2. In light of Theorem 4.4, by letting σ 0, along with the topological features of P0, we are also capturing the topological features of Q, which may obfuscate any statistical inference made using the persistence diagrams. In a manner, choosing σ > 0 suppresses the noise in the resulting persistence diagrams, thereby making them more stable. On a similar note, the authors in [21] state that for a suitable bandwidth σ > 0, the level sets of fσ carry the same topological information as supp(P), despite the fact that some subtle details in f may be omitted. In what follows, we consider the setting where robust persistence diagrams are constructed for a fixed σ > 0. 4.2 Statistical properties of robust persistence diagrams from samples Suppose Dgm f n ρ,σ is the robust persistence diagram obtained from the robust KDE on a sample Xn and Dgm (fρ,σ) is its population analogue obtained from fρ,σ. The following result (proved in Appendix A.3) establishes the consistency of Dgm f n ρ,σ in the W metric. Theorem 4.5. For convex loss ρ satisfying (A1) (A4), and fixed σ > 0, suppose Xn is observed iid from a distribution P M(Rd) with density f. Then W Dgm f n ρ,σ , Dgm (fρ,σ) p 0 as n . We present the convergence rate of the above convergence in Theorem 4.7, which depends on the smoothness of Hσ. In a similar spirit to [21], this result paves the way for constructing uniform confidence bands. Before we present the result, we first introduce the notion of entropy numbers associated with an RKHS. Definition 4.6 (Entropy Number). Given a metric space (T, d) the nth entropy number is defined as en(T, d) = inf ϵ > 0 : {t1, t2, . . . , t2n 1} T such that T i=1 Bd(ti, ϵ) Further, if (V, V ) and (W, W ) are two normed spaces and L : V W is a bounded, linear operator, then en(L) = en(L : V W) = en (L(BV ), W ), where BV is a unit ball in V . Loosely speaking, entropy numbers are related to the eigenvalues of the integral operator associated with the kernel Kσ, and measure the capacity of the RKHS in approximating functions in L2(Rd). In our context, the entropy numbers will provide useful bounds on the covering numbers of sets in the hypothesis class G. We refer the reader to [35] for more details. With this background, the following theorem (proved in Appendix A.4) provides a method for constructing uniform confidence bands for the persistence diagram constructed using the robust KDE on Xn. Theorem 4.7. For convex loss ρ satisfying (A1) (A4), and fixed σ > 0, suppose the kernel Kσ satisfies en (id : Hσ L (X)) aσn 1 2p , where aσ > 1, 0 < p < 1 and X Rd. Then, for a fixed confidence level 0 < α < 1, sup P M(X) P n ( W Dgm f n ρ,σ , Dgm (fρ,σ) > 2M Kσ ξ(n, p) + δ 2 log (1/α) where ξ(n, p) is given by γ ap σ (1 2p) 1 n if 0 < p < 1/2, γC aσ log(n) n if p = 1/2, 2p 1 1 n1/4p if 1/2 < p < 1, for fixed constants γ > 12 log 2, C > 3 log(9aσ) and µ = 2 min n ϕ(2 Kσ 1 2 ), ρ (2 Kσ Remark 4.8. We highlight some salient observations from Theorem 4.7. (i) If diam(X) = r, and the kernel Kσ is m-times differentiable, then from [35, Theorem 6.26], the entropy numbers associated with Kσ satisfy en (id : Hσ L (X)) crmn m d . In light of Theorem 4.7, for p = d 2m, we can make two important observations. First, as the dimension of the input space X increases, we have that the rate of convergence decreases; which is a direct consequence from the curse of dimensionality. Second, for a fixed dimension of the input space, the parameter p in Theorem 4.7 can be understood to be inversely proportional to the smoothness of the kernel. Specifically, as the smoothness of the kernel increases, the rate of convergence is faster, and we obtain sharper confidence bands. This makes a case for employing smoother kernels. (ii) A similar result is obtained in [21, Lemma 8] for persistence diagrams from the KDE, with a convergence rate Op(n 1/2), where the proof relies on a simple application of Hoeffding s inequality, unlike the sophisticated tools the proof of Theorem 4.7 warrants for the robust KDE. (a) Xn (in ) and Ym (in ) 0.000 0.005 0.010 0.015 0.020 Bottleneck Distance (b) π = 20%, p = 4 10 60 0.000 0.005 0.010 0.015 0.020 Bottleneck Distance (c) π = 30%, p = 2 10 72 0.000 0.005 0.010 0.015 0.020 Bottleneck Distance (d) π = 40%, p = 2.5 10 75 Figure 4: (a) A realization of Xn Ym. (b, c, d) As the noise level π increases, boxplots for W Dρ,σ, D# σ in blue and W Dσ, D# σ in red show that the robust persistence diagram recovers the underlying signal better. 5 Experiments We illustrate the performance of robust persistence diagrams in machine learning applications through synthetic and real-world experiments.1 In all the experiments, the kernel bandwidth σ is chosen as the median distance of each xi Xn to its kth nearest neighbour using the Gaussian kernel with the Hampel loss (similar setting as in [27]) we denote this bandwidth as σ(k). Since DTM is closely related to the k-NN density estimator [6], we choose the DTM smoothing parameter as m(k) = k/n. Additionally, the KIRWLS algorithm is run until the relative change of empirical risk < 10 6. Runtime Analysis. For n = 1000, Xn is sampled from a torus inside [0, 2]3. For each grid resolution α {0.04, 0.06, 0.08, 0.10}, the robust persistence diagram Dgm f n ρ,σ and the KDE persistence diagram Dgm (f n σ ) are constructed from the superlevel filtration of cubical homology. The total time taken to compute the persistence diagrams is reported in Table 1. The results demonstrate that the computational bottleneck is the persistent homology pipeline, and not the KIRWLS for f n ρ,σ. Table 1: Runtime (in Seconds) for computing Dgm f n ρ,σ and Dgm (f n σ ) at each grid resolution. Grid Resolution 0.04 0.06 0.08 0.10 Average runtime for Dgm f n ρ,σ 76.7s 17.1s 6.7s 3.5s Average runtime for Dgm (f n σ ) 75.5s 15.3s 4.7s 1.8s Bottleneck Simulation. The objective of this experiment is to assess how the robust KDE persistence diagram compares to the KDE persistence diagram in recovering the topological features of the underlying signal. Xn is observed uniformly from two circles and Ym is sampled uniformly from the enclosing square such that m = 200 and m/n = π {20%, 30%, 40%} shown in Figure 4 (a). For each noise level π, and for each of N = 100 realizations of Xn and Ym, the robust persistence diagram Dρ,σ and the KDE persistence diagram Dσ are constructed from the noisy samples Xn Ym. In addition, we compute the KDE persistence diagram D# σ on Xn alone as a proxy for the target persistence diagram one would obtain in the absence of any contamination. The bandwidth σ(k) > 0 is chosen for k = 5. For each realization i, bottleneck distances Ui = W Dρ,σ, D# σ and Vi = W Dσ, D# σ are computed for 1st-order homological features. The boxplots and p-values for the one-sided hypothesis test H0 : U V = 0 vs. H1 : U V < 0 are reported in Figures 4 (b, c, d). The results demonstrate that the robust persistence diagram is noticeably better in recovering the true homological features, and in fact demonstrates superior performance when the noise levels are higher. Spectral Clustering using Persistent Homology. We perform a variant of the six-class benchmark experiment from [1, Section 6.1]. The data comprises of six different 3D objects : cube, circle, sphere, 3clusters, 3clusters In3clusters, and torus. 25 point clouds are sampled from each object with additive Gaussian noise (SD= 0.1), and ambient Matérn cluster noise. For each point cloud, Xn, the robust persistence diagram Dgm f n ρ,σ and the persistence diagram Dgm (d Xn), from the distance function, are constructed. Additionally, Dgm (d Xn) is transformed to the persistence image Img (d Xn, h) for h = 0.1. Note that Dgm f n ρ,σ is a robust diagram while Img (d Xn, h) is a stable vectorization of a non-robust diagram [1]. For each homological order {H0, H1, H2}, distance 1https://github.com/sidv23/robust-PDs 0.00 0.03 0.06 0.09 0.12 0.15 0.00 0.03 0.06 0.09 0.12 0.15 G Connected Components 0.00 0.05 0.10 0.15 0.0 0.1 0.2 0.3 0.4 0.5 0.6 Persistence Image Bandwidth Misclassification Error KDE RKDE DTM 0.00 0.05 0.10 0.15 0.0 0.1 0.2 0.3 0.4 0.5 0.6 Persistence Image Bandwidth Misclassification Error KDE RKDE DTM Figure 5: (a) Xn is sampled from the image boundary of a bone, and uniform noise is added. (b) The resulting persistence diagram from the robust KDE. The persistence diagram picks up the 1st order features near the joints of the cartoon bone. The misclassification error for the KDE, robust KDE and DTM as the persistence image bandwidth increases, (c) for the three-class classification and, (d) for the five-class classification. matrices { 0, 1, 2} are computed: Wp metric for Dgm (fρ,σ), and Lp metric for Img (d Xn, h) with p {1, 2, }, and spectral clustering is performed on the resulting distance-matrices. The quality of the clustering is assessed using the rand-index. The results, reported in Table 2, evidence the superiority of employing inherently robust persistence diagrams in contrast to a robust vectorization of an inherently noisy persistence diagram. Table 2: Rand-index for spectral clustering using distance matrices for Dgm (fρ,σ) and Img (d Xn, h). Dgm (fρ,σ) Img (d Xn, h) Distance Metric W1 W2 W L1 L2 L 0 (from H0) 95.30% 93.65% 94.44% 78.53% 81.77% 80.05% 1 (from H1) 91.43% 88.56% 84.53% 81.89% 81.14% 77.75% 2 (from H2) 86.33% 73.91% 73.62% 80.09% 77.12% 77.35% max = max { 0, 1, 2} 95.72% 93.65% 94.44% 82.43% 78.80% 79.78% MPEG7. In this experiment, we examine the performance of persistence diagrams in a classification task on [28]. For simplicity, we only consider five classes: beetle, bone, spring, deer and horse. We first extract the boundary of the images using a Laplace convolution, and sample Xn uniformly from the boundary of each image, adding uniform noise (π = 15%) in the enclosing region. Persistence diagrams Dgm (f n σ ) and Dgm f n ρ,σ from the KDE and robust KDE are constructed. In addition, owing to their ability to capture nuanced multi-scale features, we also construct Dgm (dn,m) from the DTM filtration. The smoothing parameters σ(k) and m(k) are chosen as earlier for k = 5. The persistence diagrams are normalized to have a max persistence max{|d b| = 1 : (b, d) Dgm(φ)}, and then vectorized as persistence images, Img (f n σ , h), Img f n ρ,σ, h , and Img (dn,m, h) for various bandwidths h. A linear SVM classifier is then trained on the resulting persistence images. In the first experiment we only consider the first three classes, and in the second experiment we consider all five classes. The results for the classification error, shown in Figure 5, demonstrate the superiority of the proposed method. We refer the reader to Appendix D for additional experiments. 6 Conclusion & Discussion In this paper, we proposed a statistically consistent robust persistent diagram using RKHS-based robust KDE as the filter function. By generalizing the notion of influence function to the space of persistence diagrams, we mathematically and empirically demonstrated the robustness of the proposed method to that of persistence diagrams induced by other filter functions such as KDE. Through numerical experiments, we demonstrated the advantage of using robust persistence diagrams in machine learning applications. We would like to highlight that most of the theoretical results of this paper crucially hinge on the loss function being convex. As a future direction, we would like to generalize the current results to non-convex loss functions, and explore robust persistence diagrams induced other types of robust density estimators, which could potentially yield more robust persistence diagrams. Another important direction we intend to explore is to enhance the computational efficiency of the proposed approach using coresets, as in [7], and/or using weighted Rips filtrations, as in [2]. We provide a brief discussion in Appendix E. Broader Impact Over the last decade, Topological Data Analysis has become an important tool for extracting geometric and topological information from data, and its applications have been far reaching. For example, it has been used successfully in the study the fragile X-syndrome, to discover traumatic brain injuries, and has also become an important tool in the study of protein structure. In astrophysics, it has aided the study of cosmic microwave background, and the discovery of cosmic voids and filamental structures in cosmological data. With a continual increase in its adoption in data analysis, it has become important to understand the limitations of using persistent homology in machine learning applications. As real-world data is often flustered with measurement errors and other forms of noise, in this work, we examine the sensitivity of persistence diagrams to such noise, and provide methods to mitigate the effect of this noise, so as to make reliable topological inference. Acknowledgments and Disclosure of Funding The authors would like to thank the anonymous reviewers for their helpful comments and constructive feedback. Siddharth Vishwanath and Bharath Sriperumbudur are supported in part by NSF DMS CAREER Award 1945396. Kenji Fukumizu is supported in part by JST CREST Grant Number JPMJCR15D3, Japan. Satoshi Kuriki is partially supported by JSPS KAKENHI Grant Number JP16H02792, Japan. [1] H. Adams, T. Emerson, M. Kirby, R. Neville, C. Peterson, P. Shipman, S. Chepushtanova, E. Hanson, F. Motta, and L. Ziegelmeier. Persistence Images: A stable vector representation of persistent homology. Journal of Machine Learning Research, 18(1):218 252, 2017. [2] H. Anai, F. Chazal, M. Glisse, Y. Ike, H. Inakoshi, R. Tinarrage, and Y. Umeda. DTM-based filtrations. In So CG 2019-35th International Symposium on Computational Geometry, 2019. [3] P. L. Bartlett and S. Mendelson. Rademacher and Gaussian complexities: Risk bounds and structural results. Journal of Machine Learning Research, 3(Nov):463 482, 2002. [4] P. Bendich, T. Galkovskyi, and J. Harer. Improving homology estimates with random walks. Inverse Problems, 27(12):124002, 2011. [5] P. Bendich, J. S. Marron, E. Miller, A. Pieloch, and S. Skwerer. Persistent homology analysis of brain artery trees. The Annals of Applied Statistics, 10(1):198, 2016. [6] G. Biau, F. Chazal, D. Cohen-Steiner, L. Devroye, and C. Rodriguez. A weighted k-Nearest Neighbor density estimate for geometric inference. Electronic Journal of Statistics, 5:204 237, 2011. [7] C. Brécheteau and C. Levrard. The k-PDTM: a coreset for robust geometric inference. ar Xiv preprint ar Xiv:1801.10346, 2018. [8] R. Brüel-Gabrielsson, V. Ganapathi-Subramanian, P. Skraba, and L. J. Guibas. Topology-aware surface reconstruction for point clouds. ar Xiv preprint ar Xiv:1811.12543, 2018. [9] F. Chazal and B. Michel. An introduction to topological data analysis: Fundamental and practical aspects for data scientists. ar Xiv preprint ar Xiv:1710.04019, 2017. [10] F. Chazal, D. Cohen-Steiner, and Q. Mérigot. Geometric inference for probability measures. Foundations of Computational Mathematics, 11(6):733 751, 2011. [11] F. Chazal, L. J. Guibas, S. Y. Oudot, and P. Skraba. Persistence-based clustering in Riemannian manifolds. Journal of the ACM (JACM), 60(6):1 38, 2013. [12] F. Chazal, V. De Silva, M. Glisse, and S. Oudot. The Structure and Stability of Persistence Modules. Springer, 2016. [13] F. Chazal, B. Fasy, F. Lecci, B. Michel, A. Rinaldo, A. Rinaldo, and L. Wasserman. Robust topological inference: Distance to a measure and kernel distance. Journal of Machine Learning Research, 18(1):5845 5884, 2017. [14] C. Chen, X. Ni, Q. Bai, and Y. Wang. A topological regularizer for classifiers via persistent homology. In The 22nd International Conference on Artificial Intelligence and Statistics, pages 2573 2582, 2019. [15] Y.-C. Chen. A tutorial on kernel density estimation and recent advances. Biostatistics & Epidemiology, 1(1):161 187, 2017. [16] D. Cohen-Steiner, H. Edelsbrunner, and J. Harer. Stability of persistence diagrams. Discrete & Computational Geometry, 37(1):103 120, 2007. [17] G. Dal Maso. An Introduction to Γ-convergence, volume 8. Springer Science & Business Media, 2012. [18] V. Divol and T. Lacombe. Understanding the topology and the geometry of the persistence diagram space via optimal partial transport. ar Xiv preprint ar Xiv:1901.03048, 2019. [19] H. Edelsbrunner and J. Harer. Computational Topology: An Introduction. American Mathematical Society, 2010. [20] H. Edelsbrunner, D. Letscher, and A. Zomorodian. Topological persistence and simplification. In Proceedings 41st Annual Symposium on Foundations of Computer Science, pages 454 463. IEEE, 2000. [21] B. T. Fasy, F. Lecci, A. Rinaldo, L. Wasserman, S. Balakrishnan, and A. Singh. Confidence sets for persistence diagrams. The Annals of Statistics, 42(6):2301 2339, 2014. [22] M. Gameiro, Y. Hiraoka, S. Izumi, M. Kramar, K. Mischaikow, and V. Nanda. A topological measurement of protein compressibility. Japan Journal of Industrial and Applied Mathematics, 32(1):1 17, 2015. [23] F. R. Hampel, E. M. Ronchetti, P. J. Rousseeuw, and W. A. Stahel. Robust Statistics: The Approach Based on Influence Functions, volume 196. John Wiley & Sons, 2011. [24] A. Hatcher. Algebraic Topology. Cambridge University Press, 2002. [25] E. Hewitt and K. Ross. Abstract Harmonic Analysis: Volume I. Grundlehren der mathematischen Wissenschaften. Springer Berlin, 1979. [26] P. J. Huber. Robust Statistics. Wiley Series in Probability and Statistics - Applied Probability and Statistics Section Series. Wiley, 2004. [27] J. Kim and C. D. Scott. Robust kernel density estimation. Journal of Machine Learning Research, 13(Sep):2529 2565, 2012. [28] L. J. Latecki, R. Lakamper, and T. Eckhardt. Shape descriptors for non-rigid shapes with a single closed contour. In Proceedings IEEE Conference on Computer Vision and Pattern Recognition. CVPR 2000 (Cat. No. PR00662), volume 1, pages 424 429. IEEE, 2000. [29] Y. Mileyko, S. Mukherjee, and J. Harer. Probability measures on the space of persistence diagrams. Inverse Problems, 27(12):124007, 2011. [30] K. Muandet, K. Fukumizu, B. K. Sriperumbudur, and B. Schölkopf. Kernel mean embedding of distributions: A review and beyond. Foundations and Trends in Machine Learning, 10(1-2): 1 141, 2017. [31] R. Peyre. Comparison between W2 distance and H 1 norm, and localization of Wasserstein distance. ESAIM: Control, Optimisation and Calculus of Variations, 24(4):1489 1501, 2018. [32] J. M. Phillips, B. Wang, and Y. Zheng. Geometric inference on kernel density estimates. In 31st International Symposium on Computational Geometry (So CG 2015), 2015. [33] N. Srebro, K. Sridharan, and A. Tewari. Smoothness, low noise and fast rates. In Advances in Neural Information Processing Systems, pages 2199 2207, 2010. [34] B. Sriperumbudur. On the optimal estimation of probability measures in weak and strong topologies. Bernoulli, 22(3):1839 1893, 2016. [35] I. Steinwart and A. Christmann. Support Vector Machines. Springer, 2008. [36] K. Turner, Y. Mileyko, S. Mukherjee, and J. Harer. Fréchet means for distributions of persistence diagrams. Discrete & Computational Geometry, 52(1):44 70, 2014. [37] A. Van Der Vaart and J. A. Wellner. Preservation theorems for Glivenko-Cantelli and uniform Glivenko-Cantelli classes. In High dimensional probability II, pages 115 133. Springer, 2000. [38] A. W. Van der Vaart. Asymptotic Statistics, volume 3. Cambridge University Press, 2000. [39] R. Vandermeulen and C. Scott. Consistency of robust kernel density estimators. In Conference on Learning Theory, pages 568 591, 2013. [40] R. Vershynin. High-Dimensional Probability: An Introduction with Applications in Data Science. Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge University Press, 2018. [41] C. Villani. Topics in Optimal Transportation. Number 58. American Mathematical Society, 2003. [42] L. Wasserman. Topological data analysis. Annual Review of Statistics and Its Application, 5: 501 532, 2018. [43] X. Xu, J. Cisewski-Kehe, S. B. Green, and D. Nagai. Finding cosmic voids and filament loops using topological data analysis. Astronomy and Computing, 27:34 52, 2019. [44] A. Zomorodian and G. Carlsson. Computing persistent homology. Discrete & Computational Geometry, 33(2):249 274, 2005.