# differentially_private_topological_data_analysis__89e3ad6c.pdf

Journal of Machine Learning Research 25 (2024) 1-42 Submitted 5/23; Revised 11/23; Published 5/24

Diﬀerentially Private Topological Data Analysis

Taegyu Kang kang426@purdue.edu Department of Statistics Purdue University West Lafayette, IN 47907, USA

Sehwan Kim sehwan_kim@hphci.harvard.edu Department of Population Medicine Harvard Medical School/Harvard Pilgrim Health Care Institute Boston, MA, 02215, USA

Jinwon Sohn sohn24@purdue.edu Department of Statistics Purdue University West Lafayette, IN 47907, USA

Jordan Awan jawan@purdue.edu Department of Statistics Purdue University West Lafayette, IN 47907, USA

Editor: Manuel Gomez-Rodriguez

This paper is the ﬁrst to attempt diﬀerentially private (DP) topological data analysis (TDA), producing near-optimal private persistence diagrams. We analyze the sensitivity of persistence diagrams in terms of the bottleneck distance, and we show that the commonly used Čech complex has sensitivity that does not decrease as the sample size n increases. This makes it challenging for the persistence diagrams of Čech complexes to be privatized. As an alternative, we show that the persistence diagram obtained by the L1-distance to measure (DTM) has sensitivity O(1/n). Based on the sensitivity analysis, we propose using the exponential mechanism whose utility function is deﬁned in terms of the bottleneck distance of the L1-DTM persistence diagrams. We also derive upper and lower bounds of the accuracy of our privacy mechanism; the obtained bounds indicate that the privacy error of our mechanism is near-optimal. We demonstrate the performance of our privatized persistence diagrams through simulations as well as on a real data set tracking human movement.

Keywords: Čech complex, Distance to a measure, Exponential mechanism, Persistence diagram, Persistent homology

*: Kang, Kim, and Sohn are co-ﬁrst authors and they contribute equally to this paper. : Corresponding author.

c 2024 Taegyu Kang, Sehwan Kim, Jinwon Sohn, and Jordan Awan.

License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v25/23-0585.html.

Kang, Kim, Sohn, and Awan

1. Introduction

Recent advances in technology make it possible to obtain data with such complicated structure that traditional data analysis methodologies cannot deal with them appropriately. To analyze such complex data, topological data analysis has been an indispensable tool in data science (Niyogi et al., 2011; Khasawneh and Munch, 2016; Wasserman, 2018; Dindin et al., 2020; Rieck et al., 2020). Essentially, topology is the most fundamental mathematical structure where the notion of nearness can be discussed, and its generality makes it an appropriate framework for discussing extremely complicated data which are not expected to have more equipped structures such as vector spaces, manifolds, and so on. Topological data analysis is a novel branch of data analysis which was invented to capture the topological structure of data, and it has been deeply studied for the last couple of decades. See Carlsson (2009) for a comprehensive overview. Especially, persistent homology, its ﬂagship method, has been extensively studied theoretically and applied to many diﬀerent disciplines such as medicine (Nicolau et al., 2011), biology (Mc Guirl et al., 2020), neuroscience (Xu et al., 2021; Caputi et al., 2021), astronomy (Xu et al., 2019), and machine learning (Hensel et al., 2021; Betthauser et al., 2022), to name a few.

At the same time, as bigger and more diverse data have become accessible, the issue of protecting private information of individuals in the data has also gained attention. Due to this concern, there is an increasing demand for privacy-protecting procedures with formal guarantees. Such a paradigm has accelerated the growing attention to a well-formulated framework of privacy protection in data science. Diﬀerential privacy (DP), introduced by Dwork et al. (2006), is the state-of-the-art framework that formally quantiﬁes the notion of privacy and its protection. Diﬀerential privacy requires that a privacy-protecting algorithm produces similar results for any two data sets, which diﬀer at only one data point. The exact deﬁnition of ϵ-DP will be introduced in the following section and we recommend Dwork and Roth (2014) for a comprehensive introduction to DP. Recently, DP has been a growing research topic in data science, tackling problems in deep learning (Shokri and Shmatikov, 2015; Abadi et al., 2016), functional data analysis (Hall et al., 2013; Mirshani et al., 2019), social networks (Karwa and Slavković, 2016; Karwa et al., 2017), as well as many others.

While the DP framework has been widely adapted to numerous methodologies in data science as mentioned above, its application to TDA has yet to be discussed. To the best of our knowledge, the only work involved with both DP and TDA is Hehir et al. (2022), which solely used persistence diagrams as a method of communicating the utility of a randomized response algorithm, and did not attempt to produce a private version of a TDA object. We believe that introducing the DP framework to TDA will be an emerging direction of research because many areas where TDA methods have been successfully utilized use data containing people s sensitive information. For example, Shnier et al. (2019) applied persistence diagrams to diﬀerentiate gene expressions in individuals with autism spectrum disorders from those in a control group. Furthermore, TDA methods are used in several other problems in medical domain and neuroscience, as mentioned above, such as brain connectivity (Caputi et al., 2021), breast cancer (Nicolau et al., 2011), and neurological disorder (Lee et al., 2011). Finally, TDA has recently been combined with other popular machine learning methods such as convolutional neural networks (Love et al., 2023), auto-encoders (Hofer et al., 2019), etc. Hence, introducing DP to TDA may have far-reaching inﬂuence in data science.

Differentially Private Topological Data Analysis

Our Contributions: This paper is concerned with how to introduce the concept of diﬀerential privacy (DP) into the framework of topological data analysis (TDA). Our key observation is that, to exploit currently available privacy mechanisms, one needs an outlierrobust TDA method. Such an observation agrees with a long-standing intuitive principle in diﬀerential privacy saying that the speciﬁc data of any one individual should not have a signiﬁcant eﬀect on the outcome of the analysis to achieve privacy protection; for instance, see Dwork and Lei (2009), Avella-Medina (2021). To illuminate the adaptation of this principle to TDA, we examine the sensitivity of the bottleneck distance of persistence diagrams, which is the most widely used presentation of persistent homology, obtained by two diﬀerent types of construction: persistence diagrams obtained from Čech complexes, which we show is not outlier-robust; and persistence diagrams obtained from the distance to a measure (DTM), which is outlier-robust. Our examination shows why persistence diagrams of Čech complexes are not readily privatized, and how persistence diagrams of the DTM can overcome such a diﬃculty. Moreover, we discuss how the magnitude of outlier-robustness aﬀects the rate of sensitivity of the bottleneck distance, and propose to use L1-DTM in order to achieve a minimal sensitivity. Based on the sensitivity analysis, we propose the ﬁrst diﬀerentially private mechanism for persistence diagrams that provides ϵ-diﬀerential privacy, using the exponential mechanism. We also establish upper and lower bounds for the accuracy error of our mechanism. The established bounds indicate that the privacy error of our mechanism is near-optimal. Our contributions can be summarized more speciﬁcally as follows:

We prove that the sensitivity of the persistence diagram of Čech complexes, deﬁned in terms of the bottleneck distance, does not diminish to zero as the sample size increases.

We propose using the persistence diagram of the distance to measure (DTM) as an alternative, and we prove that the Lp-DTM persistence diagram is guaranteed to have sensitivity, which is deﬁned in terms of the bottleneck distance, O(n 1/p). This leads us to use the L1-DTM persistence diagram that guarantees the sensitivity O(n 1).

We apply the exponential mechanism whose utility function is deﬁned in terms of the bottleneck distance of L1-DTM persistence diagrams in order to produce diﬀerentially privatized persistence diagrams. To the best of our knowledge, our algorithm is the ﬁrst attempt of developing a mechanism generating diﬀerentially privatized persistence diagrams. We also ﬁnd upper and lower bounds of the accuracy error of our mechanism.

We prove that any privacy mechanism applied to the L1-DTM persistence diagrams cannot have accuracy whose decay order is superior to the upper bound of the decay order of the privacy error corresponding to our mechanism. This result indicates that our mechanism may have optimal privacy error.

Organization: The remainder of the paper is organized as follows. In Section 2, we brieﬂy review the background and notation of TDA and DP. In Section 3, we ﬁrst examine the sensitivity of persistence diagrams constructed from the Čech complexes, as well as for an outlier-robust construction of persistence diagrams obtained from the DTM, introduced by Chazal et al. (2011). In Section 4, based on the sensitivity analysis given in Section 3, we employ the exponential mechanism to generate privatized persistence diagrams. We also derive upper and lower bounds of its accuracy. Simulation studies which implement our

Kang, Kim, Sohn, and Awan

algorithm are given in Section 5. In Section 6, we apply our algorithm to a real-world data set including information about the locations of three people walking in a building recorded on smartphones over time. All proofs as well as additional results of real data analyses are presented in the appendices.

2. Preliminaries

In this section, we introduce the persistence diagram, which is a statistic about the shape of the data, and look at bottleneck distance, a metric in the persistence diagram space as well as its stability. Also, we review the ϵ-diﬀerential privacy (ϵ-DP) and the exponential mechanism, which is one of the algorithms that satisﬁes ϵ-DP.

Notation Throughout the paper, for real numbers A and B which possibly depend on a parameter n Z+, we use the asymptotic notation A B or A = O(B) to denote the bound |A| CB for some absolute constant C > 0. If the constant C depends on some parameters, we will explicitly indicate them; for instance, A k,d B means the bound |A| Ck,d B with a constant Ck,d depending only on k and d. If A B and A B, we write denote it by A B. We also use the notation A = o(B) to denote the asymptotic result limn A/B = 0. In addition, for random variables X and Y which possibly depend on a parameter n Z+, we write X = Op(Y ) to mean that X/Y is bounded in probability and X = op(Y ) to mean that X/Y converges to zero in probability, which are standard notations in probability theory. Let (Xn) n=1 be a sequence of random variables. We write Xn = Op f(n) to mean that Xn = Op f(n) logk n for some k Z+. For a given metric space (X, d), Dn := Dn(X) := X n denotes the set of all n-tuples of elements in X for every n Z+.

2.1 Persistent Homology and Diagrams

Here, we brieﬂy introduce two methods of constructing persistent homology and corresponding persistence diagrams of data, which will show up in our main discussion. The former one is the persistent homology of Čech complex and the latter one is the persistent homology of the sub-level sets of a continuous function. We believe that an intuitive and illustrative description of persistent homology will suﬃce to understand the results of this paper. More detailed background knowledge about persistent homology along with some fundamental knowledge about simplicial homology is presented in Appendix A. For a deeper and comprehensive understanding for persistent homology, we refer the reader to the literature of persistent homology; for instance, Edelsbrunner and Harer (2008, 2009); Zomorodian and Carlsson (2005). For fundamental concepts about algebraic topology, we refer the reader to standard texts in algebraic topology such as Munkres (1984); Bredon (1997). Let D = {x1, . . . , xn} be a ﬁnite subset of a metric space (X, d). Let r > 0 be a positive real number. At every point xi, we place a ball B(xi; r) with radius r centered at xi. The persistence homology of the Čech complexes on D captures the evolution of the homological structure of the union n j=1B(xj; r) as r varies. For instance, the 0th homological feature represents the connected components of it and the 1st homological feature represents the loops in it. Figure 1 portrays how to construct the persistent homology of Čech complexes. As the radius r varies, some homological features show up and disappear, and such birth

Differentially Private Topological Data Analysis

Figure 1: Constructing of Čech complexes and its persistence diagram: The left three ﬁgures illustrate how Čech complexes on nine points supported on a circle are constructed. When r = 0.4, there are several connected components; but there is no loop. When r = 0.8, there exists a 1-dimensional loop that captures the shape of the circle. When r = 1.2, the loop disappears and there is only a single contractible connected component. The rightmost ﬁgure is the persistence diagram of the Čech complexes. Each black dot represents the birth-and-death times of each connected component and the red triangle represents the birth-and-death times of the loop.

and death of homological features are presented as multisets called persistence diagrams. More precisely, a qth persistence diagram of the Čech complexes on the data set D a multiset that consists of ﬁnitely many, say m, points (bi, di) satisfying 0 bi di for every i = 1, . . . , m; the presence of each point (bi, di) means that there exists a q-dimensional homological feature that shows up at radius bi and disappears at radius di. As for the other method, let f D : X R be a continuous function deﬁned on metric space (X, d), possibly depending on the given set D. For each r R, one can consider the sub-level set Lr := {x X : f D(x) r}. As we consider the evolution of the union of balls in the previous way of construction, we now consider the evolution of the sub-level sets Lr as r varies. Figure 2 illustrates such an evolution of a certain continuous function. In the ﬁgure, three connected components and one loop show up once and disappear at some time except for a single connected component. The birth-death pairs at each dimension can be presented as a persistence diagram, just as for the Čech complex. In general, let a ﬁltration of topological spaces {Ur}r R be given, where R is a linearly ordered set; that is, for any r1 and r2 in R satisfying r1 r2, Ur1 Ur2. Then, one can deﬁne the persistent homology and the corresponding persistence diagram of the sequence. In our ﬁrst example, each Ur is the union of balls with radius r (or, the simplicial complex obtained from the balls); in our second example, each Ur is the sub-level set Lr.

2.2 Stability of Persistence Diagrams in the Bottleneck Distance

A persistence diagram P = {(bi, di)}m i=1 is essentially a multiset of birth-death pairs bi and di, which satisfy bi di. There are numerous ways to vectorize a persistence diagram into an element in some vector space. One of the most popular ways is to represent each birth-death pair (b, d) by the Dirac measure δ(b,d) at (b, d), and represent the whole diagram P by the point measure Pm i=1 δ(bi,di) which is a measure on the set T := {(x, y) : 0

Kang, Kim, Sohn, and Awan

Figure 2: Filtration corresponding to the L2-DTM of a circle data set: The data set is supported on a circle, and the L2-DTM function of the data is visualized (for convenience, the function multiplied by 1 is presented). The persistence diagram constructed from the function is presented. The second and third columns present how the ﬁltration of the sublevel sets of the function evolves. As the ﬁltration evolves, three connected components show up at values 0.308, 0.367, and 0.397 respectively. One component dies at a value of 0.613, another one dies at 0.636. The last one lives until the end of evolution. At the right-most ﬁgure in the last row, what we mean by Not die is that, even though the birth-death pair of Point 1 shows up in the ﬁgure the actual death time of Point 1 is inﬁnity.

Differentially Private Topological Data Analysis

x y } (for a detailed description of such a way of vectorization, see Section 2 in Owada (2022)). By realizing a persistence diagram as a measure, it is possible to deﬁne the distance between two persistence diagrams by means of a distance between measures. One of the most popular choices is using the L Wasserstein distance of the measures, which is called the bottleneck distance. Speciﬁcally, let P, P be two persistence diagrams. Then the bottleneck distance between P and P is deﬁned as

d B(P, P ) := min g: P P max z P z g(z) ,

where P and P denote the persistence diagrams P and P along with the copies of all points on the diagonal respectively; g : P P ranges over all bijections between P and P . In words, d B(P, P ) is the minimax cost of pairing the birth-death points in one diagram to the other diagram one-by-one in terms of ℓ distance. When two diagrams contain diﬀerent numbers of birth-death points, then the remaining points in one diagram pair up with the points on the diagonal. A key property of the bottleneck distance is the following stability property (for more details, see Cohen-Steiner et al. (2007); Chazal et al. (2016a)). Suppose that Pq(D) and Pq(D ) are qth persistence diagrams constructed from the Čech complexes of two sets D and D in a metric space (X, d), then

d B(Pq(D), Pq(D )) d H(D, D ), (2.1)

d H(D, D ) := max sup x D inf y D d(x, y), sup y D inf x D d(x, y)

denotes the Hausdorﬀdistance between D and D . Analogously, if Pq(D) and Pq(D ) are obtained from the ﬁltrations of the sub-level sets of continuous tame functions f D and f D , respectively. Then d B(Pq(D), Pq(D ))) sup x X |f D(x) f D (x)|. (2.2)

The precise deﬁnition of tame functions is presented in Deﬁnition 24. For more comprehensive discussion, please refer to Cohen-Steiner et al. (2007). Intuitively, a R-valued function f is said to be tame if the homology of its sub-level sets changes at most ﬁnitely many times.

2.3 Diﬀerential Privacy

DP is a mathematical framework designed to quantify the privacy leakage of a proposed randomized algorithm (called a mechanism), introduced by Dwork et al. (2006). The ﬁrst step for measuring such privacy risk starts from specifying which databases are considered to diﬀer in one entry, which we refer to as adjacent databases. We say D and D are adjacent if d(D, D ) 1, for some metric d( , ) between databases. In this paper, we use Hamming distance H( , ), which counts the number of entries that diﬀer between D and D . A privacy mechanism M : Dn Y returns a random variable M(D) for any D Dn, and the privacy risk of the algorithm M can be evaluated by deﬁnition as follows:

Kang, Kim, Sohn, and Awan

Deﬁnition 1 (ϵ-Diﬀerential privacy (ϵ-DP): Dwork et al., 2006) Given ϵ 0, a privacy mechanism M on the output space Y satisﬁes ϵ-DP if

P(M(D) S) eϵP(M(D ) S), (2.2)

for every measurable set S Y and all D and D satisfying H(D, D ) 1.

This deﬁnition characterizes how much privacy leakage could occur via the privacy budget parameter ϵ when a single entity in D is not the same one in D . The smaller that ϵ is, the harder it is to distinguish the probability distributions of M(D) and M(D ), which accordingly makes it harder to identify whether data set D or D was used in the analysis by M (Wasserman and Zhou, 2010). For conceptual understanding, let us imagine that a data set D contains information of 100 people obtained by a survey. Let us call one person in D Person A, and let us assume that a ϵ-DP privacy mechanism M is employed so that a summary from M(D) is released, containing some useful information about D. Then, it is known that, for any other data set D that shares 99 people with D, except for Person A, the probability distributions of M(D) and M(D ) cannot be easily distinguished. Thus, it is diﬃcult to identify whether Person A is included in the actual data set D or not. Notice that Person A was chosen arbitrarily in D, so the privacy of each individual in D is protected. While any ϵ-DP mechanism preserves privacy, not all mechanisms ensure good performance with respect to an underlying utility. It is straightforward to imagine that sanitized statistics can devastate the performance of the utility due to excessive noises for privacy. In contrast, the exponential mechanism is a general technique that ensures high utility while controlling the privacy leakage within the budget ϵ.

Proposition 2 (Exponential mechanism: Mc Sherry and Talwar, 2007) Let n Z+ and let {u D : Y R : D Dn} be a collection of utility functions. Assume that the sensitivity (u) is ﬁnite:

(u) = sup H(D,D ) 1 sup y Y |u D(y) u D (y)| < , (2.3)

where the supremum is over all adjacent D and D and assume that R exp (u D(y)) dν(y) < for all D D where ν is a measure in Y. If satisﬁes (u) < , then the collection of mechanisms {M(D) : D D}, each of which has the probability density with respect to ν

p D(y) exp ϵ

2 u D(y) , (2.4)

satisﬁes ϵ-DP.

The exponential mechanism can be easily applied to a wide variety of problems having utility functions. One of the simplest examples is a count statistic. Let us deﬁne the count statistic count(D) of a data set D to be the number of data points in D having a certain property, and deﬁne a utility function u D(y) := |y count(D)|, which has sensitivity 1. This utility puts higher values when y is close to count(D). Many machine learning and statistical inference problems can be privately handled using the exponential mechanism

Differentially Private Topological Data Analysis

when it is possible to deﬁne appropriate utility functions such as empirical risk or likelihood functions (Huang and Kannan, 2012; Awan et al., 2019; Cummings et al., 2019; Lu et al., 2022).

Proposition 3 (Utility of the exponential mechanism: Dwork and Roth, 2014) Let OPTD = maxy Y u D(y) be the optimal value that can be achieved by the utility function over all outputs, given database D. Let Y be a random variable with the density given in (2.4). Then,

P u D(Y ) OPTD 2

ϵ log |Y| + t e t, (2.5)

for every t 0. Consequently,

u D(Y ) = OPTD + Op

The utility function in the exponential mechanism must be carefully chosen to ensure that the error rate given in Proposition 3 translates to optimal rates for the private output. For example, Awan et al. (2019) showed that when the utility function has a quadratic Taylor expansion at its maximum, the randomness for privacy in the exponential mechanism often gives rise to Op(1/ n) noise, which in general is of the same order as the non-private statistical estimation problems. On the other hand, Reimherr and Awan (2019) showed that for some utility functions which are locally approximated by the absolute value function, the randomness for privacy may be as low as Op(1/n). With the goal of producing diﬀerentially private persistence diagrams, we propose using exponential mechanism whose utility function is the negative bottleneck distance between the private and non-private persistence diagrams.

3. Sensitivity of Persistence Diagrams in the Bottleneck Distance

Most DP algorithms require quantifying how much the value of a statistic is changed by changing a single point in a given data. The largest possible amount of that change in the statistic is called the sensitivity of the statistic. In this study, we regard a persistence diagram constructed from a data set D as a statistic that estimates the homological structure of the space underlying the data, and we use the bottleneck distance to deﬁne a metric on the space of persistence diagrams. Hence, to apply a DP mechanism to persistence diagrams, our ﬁrst step should be estimating the sensitivity of persistence diagrams in terms of the bottleneck distance; namely, we are going to analyze how big the bottleneck distance d B(PD, PD ) can be, where the pair (D, D ) denotes a pair of adjacent data sets. Note PD and PD mean the persistence diagrams constructed from the data sets D and D respectively under a given way of constructing persistent homology. In diﬀerential privacy, to ensure consistent estimators, it is necessary that the sensitivity goes to 0 as the size of the data grows. Our key observation is that if a chosen way of constructing persistent homology is not outlier-robust, the sensitivity of the corresponding persistence diagrams may not tend to 0 even if the size of data, say n, grows. We demonstrate that the sensitivity of the persistence diagrams of Čech complexes cannot converge to 0 even if the size of data grows to inﬁnity. To overcome such an issue, we

Kang, Kim, Sohn, and Awan

propose using the notion called distance to a measure (DTM), which was thoroughly discussed by Chazal et al. (2018) to give birth to outlier-robust persistence diagrams. Moreover, among various versions of construction of DTM, we propose using L1-DTM which gives the smallest sensitivity. Before moving on to the main sensitivity analysis, we would like to clarify our terminology. In the introduction to this section, we have been using the word sensitivity to refer to two diﬀerent quantities: sensitivity of the bottleneck distance and the sensitivity of utility functions of the exponential mechanism. To avoid a confusion, going forward we refer to the sensitivity of the bottleneck distance of persistence diagrams as the base sensitivity, which is the terminology introduced in Awan and Wang (2022). The base sensitivity of the bottleneck distance of qth persistence diagrams from Čech complexes is denoted by Cech q and that from the L1-DTM is denoted by DTM q . The precise deﬁnition of them will be presented in each of the following subsections. Otherwise, going forward we will reserve the term sensitivity for the sensitivity of a given utility function of the exponential mechanism.

3.1 Sensitivity of the Persistence Diagrams of Čech Complexes

Let us illustrate how the construction of Čech complexes fails to have a decreasing base sensitivity. The situation of the following example is well illustrated in Figure 3. Note that Figure 3 draws ﬁgures by means of the Vietoris-Rips complex instead of the Čech complex. The Vietoris-Rips complex is a variant of the Čech complex which has a computational advantage than the Čech complex. In fact, the ﬁltration of Vietoris-Rips complexes has essentially the same information with that of Čech complexes. The deﬁnition of the Vietoris Rips complex and its relationship with the Čech complex is presented in Appendix A.3.

Example 1 Let D be a set of n points in R2 that is tightly clustered into exactly two clusters. Write x to denote the point located at the midpoint of the clusters, and take D to be the data set obtained by moving one point in D to x. Now, further imagine that n grows while the conﬁguration of the points in D and D remains the same, and derive the 0th dimensional persistence diagrams obtained from the Čech complexes of D and D . Then, the connected components in D collapse into the two clusters quickly, while the isolated point x D

produces an additional connected component that lives longer. Such a discrepancy between two persistence diagrams prohibits the bottleneck distance between them from going to 0. More precisely, the bottleneck distance between them remains as big as the distance of the point x from the clusters in D.

The following lemma establishes that this phenomenon is widespread. We denote the qth persistence diagram constructed from the Čech complexes on the data set D by PČech q (D).

Lemma 4 Let D = {x1, . . . , xn} be a subset of an Euclidean space Rd. Let {d1, . . . , dm} be the set of distinct ﬁnite death times in P Cech 0 (D) with 0 < d1 < < dm < . Let δ = dm dm 1 (if m = 1, let δ = d1). Then, it is possible to take a set D with H(D, D ) = 1 satisfying that

d B P Cech 0 (D), P Cech 0 (D ) min{δ, dm/2}.

Differentially Private Topological Data Analysis

-2 -1 0 1 2 3 x1

-2 -1 0 1 2 3 x1

-2 -1 0 1 2 3 x1

-2 -1 0 1 2 3 x1

0 1 2 3 Birth

0 1 2 3 Birth

0 1 2 3 Birth

0 1 2 3 Birth

Figure 3: Persistence diagrams on D and D : the red circles and the green triangles are the connected components and the loops respectively. The columns (i-1) and (i-2) correspond to the results with L1-DTM on D and D respectively, and the two diagrams have 0.042 bottleneck distance in terms of the connected components. The columns (ii-1) and (ii-2) are from the Vietoris-Rips complex and the distance between the two diagrams have 0.762.

Roughly, the lemma can be proved by constructing a data set D having an additional point at the middle of the most signiﬁcant connected components in the ﬁltration of Čech complexes of D, i.e., the connected components that die at time dm. The detailed proof is presented in Appendix B.1. From now on, we assume that all the data sets are supported in a bounded subset E of Rd

unless there is any additional speciﬁcation. We deﬁne the base sensitivity Cech q concerning Čech complexes:

Cech q := sup H(D,D ) 1 d B P Cech q (D), P Cech q (D ) .

Note that the stability theorem (2.1) implies the following upper bound of the base sensitivity:

Cech q diam E

for every non-negative integer q. Lemma 4 provides the matching lower bound of the base sensitivity for q = 0. Moreover such upper and lower bounds show that the sensitivity of the utility function v D deﬁned as

v D(P) := d B P Cech 0 (D), P , (3.1)

has sensitivity of constant order:

Kang, Kim, Sohn, and Awan

Theorem 5 Suppose that a given data-generating process is supported on a bounded subset E of a Euclidean space. Then, we have

Cech 0 diam E

Moreover, the utility function v D deﬁned in (3.1) satisﬁes

1 4diam E sup H(D,D ) 1 sup P |v D(P) v D (P)| diam E.

Theorem 5 shows that why it is challenging to develop a privacy mechanism for Čech complexes: Čech complexes are so sensitive, in terms of the bottleneck distance of their persistence diagrams, that the sensitivity of the utility function v D( ) remains constant regardless of the size n of the data set. This implies that the exponential mechanism using this utility function keeps adding a constant size of noise even if n gets bigger. This prevents the bottleneck distance from becoming small even in the case of huge n.

3.2 Sensitivity of the Persistence Diagrams of the DTM

The DTM, which was introduced by Chazal et al. (2011), provided a novel way to overcome the sensitivity to outliers. DTM proposes measuring how far each point is from the dense part of the support of the probability measure. By doing so, an outlier corresponds to a relatively large distance. Thus, when it comes to the ﬁltration of the sub-level sets of a DTM, the topological features produced by the outlier would occur in the late period of the ﬁltration, or it might not occur through the whole ﬁltration. More thorough discussions on the DTM can be found in Chazal et al. (2018); Anai et al. (2020); Oudot (2015). By virtue of the properties of the DTM, it is very likely that DTM-based persistence diagrams give rise to a much smaller sensitivity, so it may provide us with a suitable TDA statistic to build our privatized mechanism upon. In fact, we show that DTM-based persistence diagrams achieve sensitivity converging to 0 as n grows to inﬁnity, but the rate of decay depends on which class of DTM designs we use. Basically, a DTM is deﬁned to be a Lp norm of a certain function. The original version of DTM was deﬁned to be a L2-type quantity. We show that the L2-type DTM produces persistence diagrams whose base sensitivity is bounded by O(n 1/2), and we recognize that each Lp-DTM results in an analogous upper bound of the base sensitivity: O(n 1/p). From this observation, we focus on the L1-DTM that has the fastest decay rate in the base sensitivity. Furthermore, we also verify the base sensitivity of the persistence diagrams obtained from the L1-DTM is bounded below by n 1 up to a constant. In other words, our sensitivity analysis for L1-DTM is sharp up to constants. We present the deﬁnition of the general Lp-DTM and its empirical realization. The key property to obtain upper bounds of the persistence diagrams is the so-called Wasserstein stability of a DTM, which was extensively discussed in the past literature; for instance, see Chazal et al. (2016b). As a result of the Wasserstein stability, we deduce the upper bound of rate n 1/p for the Lp-DTM. The matching lower bound of rate n 1 for the L1-DTM is established by constructing a speciﬁc example that exactly gives the lower bound. All the proofs are presented in the appendix.

Differentially Private Topological Data Analysis

Deﬁnition 6 (Distance to a measure) Let µ be a probability measure and X be a random variable whose probability distribution is µ. For the given µ, 0 < m < 1, and p 1, the Lp distance to the measure µ at resolution m is deﬁned by

δ(p)(x) := δ(p) µ,m(x) := 1

G 1 x (u) pdu 1/p ,

where Gx(t) = P X x t . Here, denotes the ℓ2-norm in Euclidean spaces.

The hyperparameter m determines how much smoothing eﬀect will be employed, which is reminiscent of the role of the bandwidth in a kernel density estimation. A natural empirical approximation would be the following:

Deﬁnition 7 (Empirical version of the DTM) Let X1, . . . , Xn be i.i.d. samples obtained from a probability distribution µ and µn the empirical probability measure deﬁned on this sample, i.e.,

The empirical Lp-DTM to µ at resolution m, denoted by ˆδ(p), is deﬁned to be the Lp-DTM to µn at resolution m; namely,

ˆδ(p)(x) := δ(p) µn,m(x) = 1

Xi Nk(x) Xi x p 1/p ,

where k = mn and Nk(x) is the set containing the k nearest neighbors of x among X1, . . . , Xn. Here, the distance between data points is measured by the ℓ2-norm in Euclidean space.

The key quantitative property of the Lp-DTM, which is called its Wasserstein stability, is the following: let µ and ν be probability measures deﬁned on a common metric space, then sup x

δ(p) µ (x) δ(p) ν (x) 1 m1/p Wp(µ, ν), (3.1)

where Wp(µ, ν) denotes the p-Wasserstein distance between µ and ν. For more details, see Chazal et al. (2016b). Let D and D be adjacent data sets. Let PDTMp q (D) denote the qth persistence diagram constructed from the ﬁltration of sub-level sets of the Lp-DTM to the empirical distribution of the data set D. The base sensitivity of DTMp q concerning the DTM is

DTMp q := sup H(D,D )=1 d B PDTMp q (D), PDTMp q (D ) .

By virtue of the stability theorem (2.2) and the Wasserstein stability (3.1), the following upper bound of the base sensitivity DTMp q can be established by quantifying the p-Wasserstein distance between empirical distributions on adjacent data sets.

Kang, Kim, Sohn, and Awan

Theorem 8 (Sensitivity of the persistence diagrams via Lp-DTM) Let D and D be ﬁnite subsets of a bounded set E in Rd satisfying |D| = |D | = n and H(D, D ) = 1. Then, for every non-negative integer q,

DTMp q diam E m1/pn1/p .

In fact, as a result of the Wasserstein stability (3.1), the result of the theorem can be obtained by estimating the p-Wasserstein distance between the empirical distributions on D and D . The detailed proof is presented in Appendix B.1. According to Theorem 8, each Lp-DTM is guaranteed to have base sensitivity bounded above by O(n 1/p). In particular, such a guaranteed upper bound becomes smallest when p is taken to be 1:

d B PDTM1 q (D), PDTM1 q (D ) diam E

In fact, the upper bound of the L1-DTM is sharp up to constants.

Proposition 9 (Lower bound of the sensitivity of the L1-DTM) Suppose that m < 1/2. Then, for every positive integer n, there exists a pair of sets D and D that satisﬁes |D| = |D | = n, H(D, D ) = 1, and

d B PDTM1 0 (D), PDTM1 0 (D ) = diam E

where k = mn .

The proof can be obtained by constructing a pair of adjacent data sets D and D that achieve the proposed distance. In fact, the data sets illustrated in Figure 3 achieve it. For a detailed proof, see Appendix B.1. Now, we introduce the utility function that we use to design our privacy mechanism. Let Pers denote the space of persistence diagrams, equipped with the bottleneck distance. For any given data set D and any non-negative integer q, we deﬁne the function u(q) D : Pers R as follows: u(q) D (P) := d B P, PDTM1 q (D) .

Let ℓbe a chosen non-negative integer. Our utility function u D : Pers ℓ+1 R is deﬁned as follows:

u D(P0, . . . , Pℓ) :=

q=0 u(q) D (Pq). (3.2)

As a result of the upper and lower bounds for the base sensitivity, we can establish the following upper and lower bounds of the sensitivity:

Corollary 10 For a chosen ℓ 0, let the utility function u D be deﬁned as in (3.2). Then the following is satisﬁed:

2 mn sup H(D,D )=1 sup P Pers |u D(P) u D (P)| (ℓ+ 1)diam E

Differentially Private Topological Data Analysis

Remark 11 The additive nature of the utility function u D is what allows us to establish the upper and lower bounds in Corollary 10. Notice that the lower bound of the corollary is derived from the result of Proposition 9 which is only valid for the 0th persistence diagrams; but, the additive form of u D allows it to be a lower bound for the sensitivity of the entire utility function.

Remark 12 Note that while the lower bound of L1-DTM matches the rate of its upper bound, we do not at this time obtain such matching lower bounds of the other Lp-DTMs. Hence, it might be the case that the base sensitivity of the general Lp-DTM can be improved. For example, in the situation of Figure 3, we found empirical evidence that the bottleneck distance between L2-DTM persistence diagrams of D and D is also O(n 1).

4. Employment of the Exponential Mechanism with the L1-DTM

In this section, we describe how to implement the exponential mechanism in order to generate privatized persistence diagrams. Let D = {X1, . . . , Xn} be a data set that consists of i.i.d. samples having a common probability distribution µ. For brevity, we denote by Pq(D) for each q 0 the qth persistence diagram obtained from the L1-DTM to the empirical measure µn, which was denoted by PDTM1 q (D) in the previous section. And, we set P(D) := (P0(D), . . . , Pℓ(D))

for the given ℓ. Let Pq(µ) be the qth persistence diagram obtained from the L1-DTM δ(1) µ to the measure µ, and let us deﬁne P(µ) := (P0(µ), . . . , Pℓ(µ)) for the given ℓ. Also, let PDP = (P0,DP, . . . , Pℓ,DP) be a tuple of privatized persistence diagrams generated from our algorithm. We analyze the error of our privatized persistence diagrams from two diﬀerent points of view. First, we estimate d B(P(µ), PDP). This quantity represents the error of the privatized persistence diagrams from the persistence diagram of the original data-generating process. From a statistical perspective, P(µ) can be regarded as a parameter characterizing the true data-generating process. Hence, the quantity d B(P(µ), PDP) quantiﬁes the amount of error in estimating the parameter P(µ) by the privatized statistic PDP that is obtained by privatizing the actual statistic P(D). The second approach is to estimate the quantity d B(P(D), PDP) which quantiﬁes how much the privatization process distorts the original non-privatized statistic.

4.1 Generating Privatized Persistence Diagrams

The design of an exponential mechanism is formulated by specifying an output space Y and a utility function u D : Y R for each data set D. Since our target to be privatized is a persistence diagram, it would be a natural choice to take the space of all possible persistence diagrams, which we denoted by Pers, as the output space. The ﬁrst candidate for the utility function would be the function u D deﬁned in (3.2). However, its output space, Pers, contains persistence diagrams which have arbitrary many numbers of birthdeath pairs; that is, to sample persistence diagrams from the whole Pers is inevitably an inﬁnite-dimensional problem, which is technically diﬃcult in computation. To bypass such an issue, we pre-specify a hyperparameter M Z+, a positive integer, and only take care of the space Pers M of persistence diagrams having at most M birth-death pairs at each

Kang, Kim, Sohn, and Awan

dimension q. On each restricted space Pers M, for any given data set D, we re-deﬁne the function u(q) D : Pers M R as follows:

u(q) D (P) := d B(P, Pq(D)).

The utility function u D is also re-deﬁned in the same way as in (3.2) by using the re-deﬁned u(q) D s. Namely, the utility function u D : Pers M ℓ+1 R is deﬁned as

u D(P0, . . . , Pℓ) :=

q=0 u(q) D (Pq). (4.1)

Note that the upper and lower bounds established in Corollary 10 are still valid for the utility u D deﬁned in (4.1). Under the choice of the utility function u D in (4.1), the probability distribution from which privatized persistence diagrams are generated can be speciﬁed. Before describing the exponential mechanism, we introduce a discretized version of Pers M that will make the analysis in Section 4.2 convenient. Note that each persistence diagram in Pers M can be viewed as a family of at most M points in the upper-left triangle T := {(x, y) : 0 x y diam E}. Instead of using T directly, we discretize it with ﬁnitely many discrete points; for example, a set of equally-spaced ﬁnitely many points in T can be a such discretization. By discretizing the set T with N2 discrete points, a discretization of Pers M can be obtained; namely, the discretized version of Pers M is the family of sets having at most M points in the discretized version of T . Note that such a discretization of Pers M has cardinality at most N2M. For a given positive integer N, we deﬁne g Pers M,N to be the discretization of Pers M obtained by discretizing T with N2 equally spaced discrete points. Therefore, our exponential mechanism is indeed carried out on the space g Pers M,N. The space g Pers M,N is the actual output space where the private persistence diagrams generated by the following mechanism live.

Proposition 13 Let ℓ 0 be ﬁxed and the utility u D deﬁned in (4.1). Set p( ) to denote the probability density function characterized by the following expression:

p(PDP) exp ϵ

q=0 d B Pq(D), Pq,DP (4.2)

with respect to the uniform distribution on the set ( g Pers M,N)ℓ+1 as the base measure. In (4.2), is deﬁned as follows:

:= (ℓ+ 1)diam E

mn and PDP = (P0,DP, . . . , Pℓ,DP). Then, the exponential mechanism characterized by the density (4.2) satisﬁes ϵ-DP.

Remark 14 Note that the discretization is not necessary to establish Proposition 13, but is needed to derive the utility results in the following section, such as Proposition 15 and Theorem 17. It is possible that this discretization can be removed from our analysis using more sophisticated techniques.

Differentially Private Topological Data Analysis

4.2 Analysis of Privatized Persistence Diagrams

Let ℓ 0 be determined. Recall that we have restricted the output space of our privatized persistence diagram in terms of M points for each dimension, and that these fall on a discretized version of the continuous persistence diagram space. To incorporate these limitations into our consideration for the error quantiﬁcation, we deﬁne POPT to be the closest persistence diagram from P(D) = (P0(D), . . . , Pℓ(D)) that can be generated from the privacy algorithm. More precisely, for each q

Pq,OPT := argmin P Pers M d B(P, Pq(D)),

where P ranges over all persistence diagrams having at most M elements, and

POPT := (P0,OPT, . . . , Pℓ,OPT).

Similarly, the counterpart of POPT on the discrete space g Pers M,N is deﬁned as follows. For every q 0, e Pq,OPT := argmin

P g Pers M,N d B P, Pq(D)

and e POPT := ( e P0,OPT, . . . , e Pℓ,OPT).

Moreover, for any pair of (ℓ+ 1)-tuples of persistence diagrams P = (P0, . . . , Pℓ) and P = (P 0, . . . , P ℓ), we deﬁne

d B(P, P ) :=

q=0 d B(Pq, P q).

In general, in the literature on the exponential mechanism, there have been broad analyses with regard to the error-minimizing value in the space covered by the exponential mechanism. For instance, see Dwork and Roth (2014). One key result concerning Pq,OPT is summarized in Proposition 2.5. Consequently, the following estimate can be established. Recall that the discretized space g Pers M,N has been obtained by discretizing the upper-left triangle T with N2 equally-spaced discrete points.

Proposition 15 Let POPT be deﬁned in the above and PDP the private persistence diagram obtained from the exponential mechanism summarized in Section 4.1. Suppose that the upperleft triangle T is discretized into N2 equally spaced points. Then the following holds:

d B(POPT, PDP) = Op

(ℓ+ 1)2M log N

In particular, if we take N = n, it holds that

d B(POPT, PDP) = Op

(ℓ+ 1)2M log n

Kang, Kim, Sohn, and Awan

Remark 16 In fact, the exponential mechanism itself only directly guarantees that privatized diagrams are concentrated at the optimal diagram in the discretized space. More speciﬁcally, we have

d B( e POPT, PDP) = Op

(ℓ+ 1)2M log N

On the other hand, as long as we employ ﬁne enough discretization, it is trivial that the distance d B(POPT, e POPT) is negligible compared to the error (4.3). For instance, taking N = N(n) = n ensures that such an approximation error has order Op(n 1) and the term log N in (4.3) only adds log n amount of error. This guarantees the result in Proposition 15.

To take advantage of the above result, we can estimate each of the two types of errors as follows. d B(PDP, P(µ)) d B(PDP, POPT) + d B(POPT, P(µ)) (4.4)

and d B(PDP, P(D)) d B(PDP, POPT) + d B(POPT, P(D)). (4.5)

Hence, the remaining part is to estimate d B(POPT, P(µ)) and d B(POPT, P(D)), respectively. Before stating our main theorem in this section, we would like to summarize the terminology that we use to call each of the error terms we are concerned with. First of all, we call d B(P(D), P(µ)) the estimation error because P(D) can be viewed as a statistic estimating P(µ). The term d B(PDP, POPT) is called the privacy error, following the tradition in DP literature. On the other hand, we call the quantity d B(POPT, P(D)) the approximation error because POPT is the best approximation of P(D) in the space Pers M. In contrast with the previous two terms, a type of quantity of the form d B(POPT, P(µ)) has not been analyzed in the literature before to our knowledge. As we noted, P(µ) can be regarded as a population parameter describing the probability measure µ generating the data D. Concerning this perspective, we call the quantity d B(POPT, P(µ)) the constrained estimation error and call the corresponding quantity d B(PDP, P(µ)) the privacy-estimation error in order to allude that this quantity would be interpreted as the amount of error in estimating the population parameter P(µ) by the privatized statistic PDP. If we can choose M large enough, both terms d B(POPT, P(µ)) and d B(POPT, P(D)) can be estimated through the convergence of the empirical distribution on D to the measure µ in terms of the Wasserstein distance W1 (See Proposition 25). It turns out that both terms are bounded by Op((ℓ+ 1)n 1/d). Hence, by taking M = M(n) to be a slowly increasing sequence we can achieve such a bound without degrading the privacy error obtained in Proposition 15.

Theorem 17 (Upper bound for the privacy-estimation error) Set M(n) = log n and N(n) = n. Then, for all large enough n, the following estimate holds:

d B(PDP, P(µ)) d B(PDP, POPT) + d B(POPT, P(µ))

nϵ + (ℓ+ 1)

Differentially Private Topological Data Analysis

It is natural to wonder how sharp the obtained upper bounds are. As for the populationestimation error (and the estimation error), unfortunately, it is inevitable to get the rate n 1/d so long as the argument relies on the Wasserstein convergence of the empirical measure on D to the measure µ. This means that if a tighter rate is possible, it is required to use a diﬀerent approach in order to examine the birth-and-death of homological features of the sub-level sets of the DTM more precisely. In the literature of TDA, there are some approaches that examined such features of Čech complexes by employing some toolkits from geometry. For example, see Bobrowski and Adler (2014). Such approaches may hint how to analyze persistence diagrams of the DTM more precisely. As for the privacy error, we argue that it is sharp up to constants and logarithmic factors. Recall that POPT is deﬁned to be the persistence diagram in the range of our privacy algorithm which has the smallest distance from P(D). This deﬁnition lets us surmise that the distance d B(PDP, POPT) could be smaller than the distance d B(PDP, P(D)) in a considerable probability. This means that if we are able to ﬁnd a lower bound of d B(PDP, P(D)) matching the upper bound of d B(PDP, POPT), it underpins that our estimate could be sharp. In the following theorem, we prove that, under some mild conditions, there is no ϵ-DP mechanism whose privacy error with respect to the persistence diagrams from L1-DTM can be smaller than 1/(nϵ). For the following, we recall that P0(D) denotes the 0th persistence diagram obtained by the L1-DTM to the empirical measure on a given data set, as deﬁned before.

Theorem 18 Suppose that m < 1/2. Let n be a positive integer and M an arbitrary ϵ-DP mechanism that produces a privatized persistence diagram M(D) of a data set D. Assume that ϵ satisﬁes 1/n ϵ 1. Then it is not possible for M to achieve d B(P0(D), M(D)) = op 1

nϵ for every database D with |D| = n.

5. Simulation Study

In the following simulation studies and the real-world data analysis, we only consider the 0th and the 1st persistence diagrams; namely, the utility we use is given by taking ℓ= 1, i.e., we set u D : Pers M 2 R by

u D(P0, P1) := u(0) D (P0) + u(1) D (P1).

The purpose of such restriction is only to present our algorithm succinctly; the algorithm can readily be extended to take the higher-dimensional features into consideration. We produce the diﬀerentially private persistence diagrams and investigate the impact of the key hyper-parameters: the privacy budget ϵ and the sample size n, where the resolution of the DTM m is set to 0.2. For the exponential mechanism, the default parameters are ϵ = 1, m = 0.2, n = 4000, and M = 5. These hyper-parameters were chosen by preliminary simulations. To sample from the exponential mechanism, we use a Markov chain Monte Carlo algorithm, speciﬁed in Appendix C.1. We choose the last iterate out of T = 10000 Monte Carlo diagrams as the reporting privatized diagram to be used for analysis . The simulation is based on the example in Figure 3, which has two circles at the origin (1.5, 1.5) and ( 1.5, 1.5) whose radii are 1.5 and 1 respectively. Each circle consists of 200

The R code is available at https://github.com/jwsohn612/DPTDA.

Kang, Kim, Sohn, and Awan

0 1 2 3 Birth

epsilon=0.1

0 1 2 3 Birth

0 1 2 3 Birth

(a) Privatized persistence diagrams of diﬀerent ϵ where n = 4000.

0 1 2 3 Birth

0 1 2 3 Birth

0 1 2 3 Birth

(b) Privatized persistence diagrams of diﬀerent n where ϵ = 1.

Figure 4: Privatized diagrams over 500 replicated data sets as described in Section 5: 0dimensional connected components (orange) and 1-dimensional loops (green).

observations of uniformly generated points along the boundary of the circle. There is one more point in the middle of the circles for (i-2) and (ii-2). When inducing the Vietoris-Rips diagrams, the maximum ﬁltration scale is speciﬁed as 3. All analyses are based on 500 sampled data sets, and we apply the privacy mechanism for each replication.

Figure 4 illustrates the outputs of the exponential mechanism as ϵ and n are varied. To reﬂect the variation of diagrams, we consider 500 replicated data sets. Our exponential algorithm is independently applied to each data set, and the algorithm reports the ﬁnal diagram only. By repeating this procedure for all 500 replicates, we obtain the 500 reported private diagrams. Each panel in Figure 4 is drawn based on the 500 private diagrams that illustrate the shape of private diagrams distribution. As expected, the variability in the privatized persistence diagrams becomes smaller as either ϵ or n becomes larger.

The overall tendency in terms of the bottleneck distance is exhibited in Figure 5. Note that the x-axis is written in the log scale. Gray areas in the panel show 95% point-wise quantile intervals of the bottleneck distance between the non-private diagram and its private one. Figure 5 depicts that both ϵ and n in the log scale have an approximately linear relationship to the log-bottleneck distance as shaded areas decently contain the red dotted lines e.g., log d B(P(D), ) log n + c with some constant c. These results heuristically support that d B((P)DP , P(D)) = Op(1/(nϵ)) (considering ℓto be ﬁxed).

Differentially Private Topological Data Analysis

-2 -1 0 1 2 log of epsilon

log of bottleneck distance

6 7 8 9 log of sample size

log of bottleneck distance

Figure 5: The 95% quantile intervals of d B between P(D) and the corresponding private diagram over 500 replicates where ϵ (left) and n (right) increase respectively. Red dotted lines captured overall in the shaded areas have -1 slopes.

6. Real Data Analysis

In this section, we apply our mechanism to a real-world data set , which tracks the locations of three people walking around within a building, recorded on smartphones. We are going to call those people Walker A, B, and C. The 3-dimensional coordinates (x, y, z) of the location of each person were measured 20000 times over time so that the data set consists of 20000 location vectors (x, y, z) for each of Walker A, B, and C. We calculate the persistence diagram corresponding to each of the walkers and apply our mechanism in order to privatize it. We would like to remark that we are not concerned with the privacy of individual walkers, but we consider the privacy of an individual s time stamps when they change. If a particular walker s persistence diagram changes signiﬁcantly in accordance with a change of location at a certain timestamp, then the location information could be exposed to a risk of privacy leakage. Our privacy mechanism retains the topological structure of each walker s travel while protecting the information associated with each timestamp. To obtain a privatized diagram, we carry out 50000 iterations in the MCMC procedure in our mechanism; the persistence diagram obtained at the last iteration is proposed as the reporting privatized diagram. We set the size of the sampling space M = 5, the resolution of the L1-DTM m = 0.05, the privacy budget ϵ = 1. Figure 7 depicts the results of comparing the L1-DTM persistence diagram corresponding to Walker C and its privatized diagram. One can see the diagrams look quite similar. In fact, the bottleneck distances d B(P0(D), P0,DP) and d B(P1(D), P1,DP) are both 0.01, which supports that our mechanism achieves high accuracy. Note that points near the diagonal are not considered signiﬁcant, and do not substantially aﬀect the bottleneck distance. In the right plot of Figure 7, we see that the bottleneck distance converges to a region < .025, and that the Markov chain seems to have converged after 5000 iterations.

Data is provided at http://bertrand.michel.perso.math.cnrs.fr/Enseignements/TDA/Tuto-Part1. html

Kang, Kim, Sohn, and Awan

Figure 6: Scatter plots of the location information of Walker A,B, and C.

Figure 7: (Left) The true persistence diagram of Walker C; (Middle) a privatized persistence diagram of Walker C at the last iteration of a MCMC procedure, (Right) the bottleneck distances d B(P0(D), P0,DP) (Black) and d B(P1(D), P1,DP) (Red) of the true and a privatized diagram over MCMC iterations.

Differentially Private Topological Data Analysis

7. Discussion

In this paper, we propose the ﬁrst mechanism for producing a diﬀerentially private persistence diagram, while highlighting the role of outlier-robustness in the sensitivity analysis. Even though our results oﬀer signiﬁcant contributions to private TDA, as well as a general understanding of the robust TDA measures, there are still some important weaknesses of our work as well as directions for future work: As noted in other papers (e.g., Minami et al., 2016; Ganesh and Talwar, 2020; Seeman et al., 2021; Awan and Rao, 2023), MCMC implementations of the exponential mechanism can incur additional privacy costs, which should be incorporated into the analysis for more rigorous studies. The proposed techniques in the above papers could be applied to our instance of the exponential mechanism to ensure that the MCMC approximation is taken into account for end-to-end privacy protection. Another approach would be to directly sample the exponential mechanism on the discrete space of O(N2(ℓ+1)) points, but this could be computationally prohibitive with large N and ℓ. While in this paper we recommended choosing M, the number of components in the persistence diagram, to be either a suﬃciently large constant or an increasing function such as M = log n, one could also consider M to be an unknown quantity that also needs to be learned in a private manner. As one of the anonymous reviewers suggested, one may be able to develop a reversible jump MCMC algorithm to sample from the exponential mechanism in this case. Some challenges of this approach would be 1) developing a base measure over the inﬁnite-dimensional space Q M=1( g Pers M,N)ℓ+1, which ensures that the exponential mechanism results in a valid probability distribution, 2) determine a reversible jump rule that allows for conversion between the dimensions, and 3) perform a customized utility analysis of this new mechanism. We leave it to future researchers to explore this direction. On the side of TDA, we would like to mention that some other outlier-robust TDA methods could be discussed for privacy protection. For instance, a kernel distance which was also discussed by Chazal et al. (2018) may be a good candidate. Besides TDA, the overall framework of how we propose a privacy mechanism can be applied to any other statistics that take their values in a metric space. A utility function concerned with a metric space-valued statistic can be deﬁned similarly as we do with a persistence diagram and the bottleneck distance; this was already recognized in Dwork et al. (2006). However, the utility analysis for each diﬀerent problem requires unique understanding of the speciﬁc structure and properties of the statistic and metric space at hand. A theoretical limitation of our method is its scaling with the number of dimensions, denoted by ℓ+ 1, we consider. It is well known that the error in ϵ-DP mechanisms typically scales linearly with the dimension, and our instance of the exponential mechanism is no diﬀerent. Since this is a limitation of ϵ-DP, it can only be addressed by using a diﬀerent privacy framework. Future researchers may consider developing DP-TDA methods in the frameworks of approximate DP (Dwork and Roth, 2014), zero-concentrated DP (Bun and Steinke, 2016) or Gaussian DP (Dong et al., 2022), which often allow for the magnitude of the privacy noise to scale only in the square-root of the dimension. Another limitation of our work is that our utility analysis depends on an artiﬁcial discretization of the persistence diagram space. This limitation is caused by the use of Proposition 3, and could be potentially addressed by using more advanced techniques.

Kang, Kim, Sohn, and Awan

Finally, Dong et al. (2020) proposed an alternative quantity to sensitivity for the exponential mechanism, which may improve the ﬁnite-sample performance.

Acknowledgments

The authors are thankful to the anonymous reviewers for their helpful comments that significantly improved the presentation of this manuscript. Taegyu Kang s research was partially supported by the AFOSR grant FA9550-22-0238 at Purdue University. Sehwan Kim started working on this project while he was at Purdue University. Jordan Awan s research is supported in part by NSF grant SES-2150615.

Martin Abadi, Andy Chu, Ian Goodfellow, H Brendan Mc Mahan, Ilya Mironov, Kunal Talwar, and Li Zhang. Deep learning with diﬀerential privacy. In Proceedings of the 2016 ACM SIGSAC conference on computer and communications security, pages 308 318, 2016.

Hirokazu Anai, Frédéric Chazal, Marc Glisse, Yuichi Ike, Hiroya Inakoshi, Raphaël Tinarrage, and Yuhei Umeda. DTM-based ﬁltrations. In Topological Data Analysis, volume 15, pages 33 66. Springer, 2020.

Marco Avella-Medina. Privacy-preserving parametric inference: a case for robust statistics. Journal of the American Statistical Association, 116(534):969 983, 2021.

Jordan Awan and Vinayak Rao. Privacy-aware rejection sampling. Journal of Machine Learning Research, 24(74):1 32, 2023.

Jordan Awan and Yue Wang. Diﬀerentially private Kolmogorov-Smirnov-type tests. ar Xiv:2208.06236, 2022.

Jordan Awan, Ana Kenney, Matthew Reimherr, and Aleksandra Slavković. Beneﬁts and pitfalls of the exponential mechanism with applications to Hilbert spaces and functional pca. In International Conference on Machine Learning, pages 374 384. PMLR, 2019.

Leo Betthauser, Urszula Chajewska, Maurice Diesendruck, and Rohith Pesala. Discovering distribution shifts using latent space representations. ar Xiv:2202.02339, 2022.

Omer Bobrowski and Robert J. Adler. Distance functions, critical points, and the topology of random čech complexes. Homology, Homotopy and Applications, 16(2):311 344, 2014.

Glen E. Bredon. Topology and geometry. Springer-Verlag, New York, 1997.

Mark Bun and Thomas Steinke. Concentrated diﬀerential privacy: Simpliﬁcations, extensions, and lower bounds. In Theory of Cryptography Conference, pages 635 658. Springer, 2016.

Luigi Caputi, Anna Pidnebesna, and Jaroslav Hlinka. Promises and pitfalls of topological data analysis for brain connectivity analysis. Neuro Image, 238:118245, 2021.

Differentially Private Topological Data Analysis

Gunnar Carlsson. Topology and data. Bulletin of the American Mathematical Society, 46 (2):255 308, 2009.

Frédéric Chazal, David Cohen-Steiner, and Quentin Mérigot. Geometric inference for probability measures. Foundations of Computational Mathematics, 11:733 751, 2011.

Frédéric Chazal, Vin de Silva, Marc Glisse, and Steve Oudot. The structure and stability of persistence modules. Springer, 2016a.

Frédéric Chazal, Pascal Massart, and Bertrand Michel. Rates of convergence for robust geometric inference. Electronic Journal of Statistics, 10:2243 2286, 2016b.

Frédéric Chazal, Brittany Fasy, Fabrizio Lecci, Bertrand Michel, Alessandro Rinaldo, and Larry Wasserman. Robust topological inference: Distance to a measure and kernel distance. Journal of Machine Learning Research, 18:1 40, 2018.

David Cohen-Steiner, Herbert Edelsbrunner, and John Harer. Stability of persistence diagrams. Discrete and Computational Geometry, 37:103 120, 2007.

Rachel Cummings, Varun Gupta, Dhamma Kimpara, and Jamie Morgenstern. On the compatibility of privacy and fairness. In Adjunct publication of the 27th conference on user modeling, adaptation and personalization, pages 309 315, 2019.

Meryll Dindin, Yuhei Umeda, and Frederic Chazal. Topological data analysis for arrhythmia detection through modular neural networks. In Advances in Artiﬁcial Intelligence: 33rd Canadian Conference on Artiﬁcial Intelligence, Canadian AI 2020, Ottawa, ON, Canada, May 13 15, 2020, Proceedings 33, pages 177 188. Springer, 2020.

Jinshuo Dong, David Durfee, and Ryan Rogers. Optimal diﬀerential privacy composition for exponential mechanisms. In International Conference on Machine Learning, pages 2597 2606. PMLR, 2020.

Jinshuo Dong, Aaron Roth, and Weijie J Su. Gaussian diﬀerential privacy. Journal of the Royal Statistical Society Series B, 84(1):3 37, 2022.

Cynthia Dwork and Jing Lei. Diﬀerential privacy and robust statistics. In STOC 09 Proceedings of the 2009 ACM International Symposium on Theory of Computing, pages 371 380, 2009.

Cynthia Dwork and Aaron Roth. The Algorithmic Foundations of Diﬀerential Privacy. Now Publishers Inc., 2014.

Cynthia Dwork, Frank Mc Sherry, Kobbi Nissim, and Adam Smith. Calibrating noise to sensitivity in private data analysis. In Theory of cryptography conference, pages 265 284. Springer, 2006.

Herbert Edelsbrunner and John Harer. Persistent homology - a survey. In Surveys on discrete and computational geometry, Contemporary Mathematics. American Mathematical Society, 2008.

Kang, Kim, Sohn, and Awan

Herbert Edelsbrunner and John Harer. Computational Topology : An Introduction. American Mathematical Society, 2009.

Nicolas Fournier and Arnaud Guillin. On the rate of convergence in Wasserstein distance of the empirical measure. Probability Theory and Related Fields, 162(4):707 738, 2015.

Arun Ganesh and Kunal Talwar. Faster diﬀerentially private samplers via Rényi divergence analysis of discretized Langevin MCMC. Advances in Neural Information Processing Systems, 33:7222 7233, 2020.

Rob Hall, Alessandro Rinaldo, and Larry Wasserman. Diﬀerential privacy for functions and functional data. The Journal of Machine Learning Research, 14(1):703 727, 2013.

Jonathan Hehir, Siddharth Vishwanath, Aleksandra Slavković, and Xiaoyue Niu. Problems on random graphs under local diﬀerential privacy, 2022. Paper presented at JMM 2022.

Felix Hensel, Michael Moor, and Bastian Rieck. A survey of topological machine learning methods. Frontiers in Artiﬁcial Intelligence, 4:52, 2021.

Christoph Hofer, Roland Kwitt, Marc Niethammer, and Mandar Dixit. Connectivityoptimized representation learning via persistent homology. In International conference on machine learning, pages 2751 2760. PMLR, 2019.

Zhiyi Huang and Sampath Kannan. The exponential mechanism for social welfare: Private, truthful, and nearly optimal. In 2012 IEEE 53rd Annual Symposium on Foundations of Computer Science, pages 140 149. IEEE, 2012.

Vishesh Karwa and Aleksandra Slavković. Inference using noisy degrees: Diﬀerentially private β-model and synthetic graphs. The Annals of Statistics, 44(1):87 112, 2016.

Vishesh Karwa, Pavel N Krivitsky, and Aleksandra B Slavković. Sharing social network data: diﬀerentially private estimation of exponential family random-graph models. Journal of the Royal Statistical Society. Series C (Applied Statistics), pages 481 500, 2017.

Firas A Khasawneh and Elizabeth Munch. Chatter detection in turning using persistent homology. Mechanical Systems and Signal Processing, 70:527 541, 2016.

Hyekyoung Lee, Moo K Chung, Hyejin Kang, Bung-Nyun Kim, and Dong Soo Lee. Discriminative persistent homology of brain networks. In 2011 IEEE International Symposium on Biomedical Imaging: From Nano to Macro, pages 841 844, 2011.

Ephy R. Love, Benjamin Filippenko, Vasileios Maroulas, and Gunnar Carlsson. Topological convolutional layers for deep learning. Journal of Machine Learning Research, 24:1 35, 2023.

Zhigang Lu, Hassan Jameel Asghar, Mohamed Ali Kaafar, Darren Webb, and Peter Dickinson. A diﬀerentially private framework for deep learning with convexiﬁed loss functions. IEEE Transactions on Information Forensics and Security, 17:2151 2165, 2022.

Differentially Private Topological Data Analysis

Melissa R. Mc Guirl, Alexandria Volkening, and Björn Sandstede. Topological data analysis of zebraﬁsh patterns. Proceedings of the National Academy of Sciences, 117(10):5113 5124, 2020.

Frank Mc Sherry and Kunal Talwar. Mechanism design via diﬀerential privacy. In 48th Annual IEEE Symposium on Foundations of Computer Science (FOCS 07), pages 94 103, 2007.

Kentaro Minami, HItomi Arai, Issei Sato, and Hiroshi Nakagawa. Diﬀerential privacy without sensitivity. Advances in Neural Information Processing Systems, 29, 2016.

Ardalan Mirshani, Matthew Reimherr, and Aleksandra Slavković. Formal privacy for functional data with gaussian perturbations. In International Conference on Machine Learning, pages 4595 4604. PMLR, 2019.

James R. Munkres. Elements of algebraic topology. Addison-Wesley Publishing Company, 1984.

Monica Nicolau, Arnold J. Levine, and Gunnar Carlsson. Topology based data analysis identiﬁes a subgroup of breast cancers with a unique mutational proﬁle and excellent survival. Proceedings of the National Academy of Sciences, 108(17):7265 7270, 2011.

Partha Niyogi, Stephen Smale, and Shumuel Weinberger. A topological view of unsupervised learning from noisy data. SIAM Journal on Computing, 40(3):646 663, 2011.

Steve Y. Oudot. Persistence Theory: From Quiver Representations to Data Analysis. American Mathematical Society, 2015.

Takashi Owada. Convergence of persistence diagram in the sparse regime. The Annals of Applied Probability, 32(6):4706 4736, 2022.

Matthew Reimherr and Jordan Awan. Kng: The k-norm gradient mechanism. Advances in Neural Information Processing Systems, 32, 2019.

Bastian Rieck, Tristan Yates, Christian Bock, Karsten Borgwardt, Guy Wolf, Nicholas Turk Browne, and Smita Krishnaswamy. Uncovering the topology of time-varying fmri data using cubical persistence. Advances in neural information processing systems, 33:6900 6912, 2020.

Jeremy Seeman, Matthew Reimherr, and Aleksandra Slavković. Exact privacy guarantees for markov chain implementations of the exponential mechanism with artiﬁcial atoms. Advances in Neural Information Processing Systems, 34:13125 13136, 2021.

Daniel Shnier, Mircea Voineagu, and Irina Voineagu. Persistent homology analysis of brain transcriptome data in autism. Journal of the Royal Society Interface, 16, 2019.

Reza Shokri and Vitaly Shmatikov. Privacy-preserving deep learning. In Proceedings of the 22nd ACM SIGSAC conference on computer and communications security, pages 1310 1321, 2015.

Kang, Kim, Sohn, and Awan

Larry Wasserman. Topological data analysis. Annual Review of Statistics and Its Application, 5(1):501 532, 2018.

Larry Wasserman and Shuheng Zhou. A statistical framework for diﬀerential privacy. Journal of the American Statistical Association, 105(489):375 389, 2010. doi: 10.1198/jasa.2009. tm08651.

Xiaoqi Xu, Nicolas Drougard, and Raphaëlle N. Roy. Topological data analysis as a new tool for EEG processing. Frontiers in Neuroscience, 15, 2021.

Xin Xu, Jessi Cisewski-Kehe, Sheridan B. Green, and Daisuke Nagai. Finding cosmic voids and ﬁlament loops using topological data analysis. Astronomy and Computing, 27:34 52, 2019.

Afra Zomorodian and Gunnar Carlsson. Computing persistent homology. Discrete Computational Geometry, 33:249 274, 2005.

Appendix A. More on Persistent Homology

This part is devoted to providing more detailed background information about how to construct persistent homology and the corresponding persistence diagram. We start with the deﬁnition of simplicial complexes and simplicial homology, and then we introduce how to construct persistent homology.

A.1 Simplicial homology

Let us start with the deﬁnition of simplicial complexes. Most of the contents of this subsection is based on Munkres (1984).

Deﬁnition 19 (Simplicial complexes) An (abstract) simplicial complex is a collection K of ﬁnite non-empty sets, such that if σ is an element of K, so is every non-empty subset of σ.

Each element σ of a simplicial complex K is called a simplex of K. The dimension of the simplex σ is deﬁned to be |σ| 1, i.e., the number of elements in σ minus one. When σ is a q-dimensional simplex, we simply say that σ is a q-simplex. The dimension dim K of the simplicial complex K is deﬁned to be the maximum dimension of the simplices in K, i.e., dim K := max σ K dim σ.

If the set {dim σ : σ K} is not bounded, set dim K = . Each non-empty subset of σ is called a face of σ. Let K be a simplicial complex. For each simplex σ = {v0, . . . , vq} in K, one can consider ordered tuples of vertices in σ. Namely, for every permutation α on {0, . . . , q}, there exists an ordered tuple (vα(0), . . . , vα(1)). Such an ordered tuple is called a ordered simplex of σ. The collection of all ordered simplices of every simplex in K is called the ordered simplicial complex of K, and denoted by Kord.

Differentially Private Topological Data Analysis

Let Kord be an ordered simplicial complex of a simplicial complex K. Let vα = (vα(0), . . . , vα(q)) and vβ = (vβ(0), . . . , vβ(q)) be two ordered q-simplices of a common qsimplex σ = {v0, . . . , vq}. Declare vα vβ if and only if α and β have the same sign, i.e., α and β diﬀer only by even numbers of transpositions. Notice that this relation deﬁnes an equivalence relation on the set of ordered simplices of every simplex σ. Let [v0, . . . , vq] denote the equivalence class of the ordered simplex (v0, . . . , vq). Such an equivalence class is called an oriented q-simplex. Namely, every q-simplex with q 1 induces two oriented q-simplices. Let Kori denote the set of all oriented simplices of every simplex in K. When there is no confusion, we use the symbol σ to denote both a simplex and an oriented simplex. For every natural number q 0, set Kq ori be the set of all oriented q-simplices of K. Deﬁne Cq(K) be the set of all functions c : Kq ori Z satisfying the following.

c(σ) = c(σ ) if σ and σ are opposite orientations of the same simplex.

c(σ) = 0 for all but ﬁnitely many oridented q-simplices σ, i.e., each c is ﬁnitely supported.

One can equip a group structure on Cq(K) by deﬁning the group operation to be elementwise addition. Notice that Cq(K) is an abelian group with that group structure. Moreover, it is straightforward that Cq(K) is a free abelian group whose basis can be constructed by choosing exactly one oriented simplex for every simplex σ. One can represent every element c in Cq(K) by the ﬁnite Z-linear combinations of oriented q-simplices of K, i.e., each c can be written as

where k is ﬁnite, ni Z and σi Kq ori for all 1 i k. Each function c is called a q-chain of K and Cq(K) is called the group of oriented q-chains of K. We set Cq(K) = 0 if q < 0 or q > dim K. Now, we deﬁne the boundary operator of oriented chain complexes.

Deﬁnition 20 (Boundary operator) Let K be a simplicial complex. For every integer q, deﬁne q : Kori,q Cq 1(K)

by assigning

q : [v0, . . . , vq] 7

i=0 ( 1)i[v0, . . . , ˆvi, . . . , vq],

where [v0, . . . , ˆvi, . . . , vq] is the (q 1)-oriented simplex obtained by deleting vi from [v0, . . . , vq]. Since Cq(K) is a free abelian group, the map q can be extended into a unique group homomorphism q : Cq(K) Cq 1(K). This homomorphism is called the boundary operator.

The key property of the boundary operator is the following:

q 1 q = 0 for every q.

Kang, Kim, Sohn, and Awan

In other words, the sequence (Cq(K), q)q Z of abelian groups and group homomorphisms form a chain complex. This property can be rephrased as follows.

Im q 1 Ker q for every q,

where Ker and Im mean the kernel and the image of a homomorphism, respectively. Since the sequence of groups of oriented chain complexes form a chain complex, it is possible to deﬁne the homology groups of it. Moreover, the kernel Ker q is usually written as Zq(K) and each of its elements is called a q-cycle; and, the image Im q 1 is usually written as Bq(K) and each of its elements is called a q-boundary.

Deﬁnition 21 (Simplicial homology) Let K be a simplicial complex. For every integer q, the qth simplicial homology group is deﬁned to be the following quotient group:

Zq(K) Bq(K) = Ker( q : Cq(K) Cq 1(K))

Im( q 1 : Cq(K) Cq(K)) ,

and denoted by Hq(K)

Remark 22 Instead of constructing Cq(K) to be an abelian group, one can consider the free R-module on the same basis where R is a commutative ring. The boundary operator can be deﬁned by the same, and now it can be uniquely extended to be an R-module homomorphism q : Cq(K) Cq 1(K). The resulting sequence (Cq(K), q) of R-modules and R-module homomorphisms form a chain complex of R-modules, so the simplicial homology of it can be deﬁned by the same way; in this case, each homology group Hq(K) becomes an R-module as well. In such a case, we denote the qth simplicial homology module of K by Hq(K; R) and call it the qth simplicial homology of K with coeﬃcients in R.

A.2 Persistent homology

Let {Kr}r R be a collection of simplicial complexes satisfying Kr1 Kr2 if r1 r2. Such a collection is called a ﬁltration of simplicial complexes (parametrized by R). For each simplicial complex Kr in the ﬁltration, it is possible to construct the chain complex (Cq(Kr), q)q Z and the corresponding homology groups (Hq(Kr))q Z. In addition, each inclusion map ιr1,r2 : Kr1 Kr2 (r1 r2), induces a group homomorphism Cq(Kr1) Cq(Kr2), which is actually the inclusion map Cq(Kr1) , Cq(Kr2) for every integer q; and, all such homomorphisms (inclusions) commute with boundary operators, i.e., each inclusion induces a chain map between chain complexes of oriented chains. Thus, each inclusion ιr1,r2 induces a homomorphism ιq r1,r2 : Hq(Kr1) Hq(Kr2) between homology groups for every q. This produces a collection {Hq(Kr)}r R of simplicial homology groups accompanied with a group homomorphism ιq r1,r2 : Hq(Kr1) Hq(Kr2) for every q and every pair r1 r2. For each pair r1 r2 and each q, the image of ιq r1,r2 : Hq(Kr1) Hq(Kr2), denoted by Im ιq r1,r2, is called the qth persistent homology group that persists from r1 to r2. The rank of the group Im ιq r1,r2 is called the qth persistent Betti number that persists from r1 to r2 and denoted by βq r1,r2. Intuitively, the Betti number βq r1,r2 represents the number of independent q-cycles that were born before the parameter r1 and have not been dead until the parameter r2 in the ﬁltration. Furthermore, for each q-cycle in the ﬁltration, it is possible to consider

Differentially Private Topological Data Analysis

the parameter values at which the cycle shows up at ﬁrst (birth) and disappears (death), respectively. Let σ be a q-cycle that shows up in the ﬁltration at some point, i.e., σ is an element of Ker q(Kr) for some r. Then, it is possible to consider the birth and death times (parameter values) of it. By bringing together all birth-death pairs of all q-cycles in the ﬁltration, one can form a multiset of points of the form (b, d) with b d . That multiset is called the qth persistence diagram of the ﬁltration. The formal construction of the persistence diagram is involved with the structure theorem of ﬁnitely generated graded modules over a principal ideal domain, which is a theorem from abstract algebra. Please refer to Carlsson (2009) and Edelsbrunner and Harer (2008) for more formal and comprehensive discussion.

A.3 Some ways to construct a ﬁltration of simplicial complexes

Now, we introduce several ways to obtain a ﬁltration of simplicial complexes that play a role in the main discussion of this paper. The contents of this subsection can be found in Edelsbrunner and Harer (2009). Let D = {x1, . . . , xn} be a ﬁnite subset of a metric space (X, d). For every non-negative real number r 0, consider the ball B(xi; r) := {y X : d(y, xi) < r) centered at each i {1, . . . , n}. The Čech complex C(D; r) on D with radius r is the simplicial complex deﬁned as follows. A subset σ = {xi0, . . . , xiq} of D is a member of C(D; r) if and only if q j=0B(xij; r) = . Notice that C(D; r1) C(D; r2) for every pair r1 r2. Hence, the collection { C(D; r)}r 0 of Čech complexes forms a ﬁltration of simplicial complexes. There are several variants of the Čech complex. One of such variants is the Vietoris-Rips complex. The Vietoris-Rips complex VR(D; r) on D with radius r is deﬁned as follows. A subset σ = {xi0, . . . , xiq} of D is a member of VR(D; r) if and only if B(xijk; r) B(xijl; r) = for every k, l {0, . . . , q}, i.e., The balls B(xi0; r), . . . B(xiq; r) pairwise intersect with one another. It is also obvious that VR(D; r1) VR(D; r2) whenever r1 r2. Hence, the collection {VR(D; r)}r 0 of Vietoris-Rips complexes forms a ﬁltration of simplicial complexes. The following relationship between the Čech complex and the Vietoris-Rips complex indicates that, on a ﬁnite subset in an Euclidean space, the ﬁltration of Čech complexes and that of Vietoris-Rips complexes have essentially the same information.

Proposition 23 Let D = {x1, . . . , xn} be a ﬁnite subset of a Euclidean space equipped with the metric induced by the ℓ2-norm on it. Then, for every r 0, the following holds.

C(D; r) VR(D; r) C(D;

The last way of construction is obtained from a real-valued function deﬁned on a metric space. Let (X, d) be triangulable a metric space and f : X R be a real-valued continuous function. For each r R, consider the sub-level set Lr := f 1 ( , r] , which is a subset of X. Notice that Lr1 Lr2 whenever r1 r2. Moreover, since X is triangulable, all sub-level sets can be triangulized while respecting the inclusion relationships. Hence, the collection of such triangulizations of the collection {Lr}r R of sub-level sets produces a ﬁltration of simplicial complexes. Before closing this section, we introduce a certain condition on a continuous function f : X R that ensures that f does not behave too wildly.

Kang, Kim, Sohn, and Awan

Deﬁnition 24 (Tame functions) Let (X, d) be a triangulable metric space and f : X R a real-valued continuous function. Set Xr to be the triangulization of the sub-level set f 1 ( , r] . Let ιq r1,r2 : Hq(Xr1) Hq(Xr2) be the group homomorphism induced by the inclusion map ιr1,r2 : Xr1 Xr2 for every pair r1 r2. We call r R a homological critical value if there is no positive number ϵ > 0 for which ιq r ϵ,r+ϵ is an isomorphism for each dimension q. The function f is said to be tame if f produces only ﬁnitely many homological critical values and all homology groups of all sub-level sets of it have ﬁnite rank.

Appendix B. Proofs of the Main Results

In this part, we present the detailed proofs of the theorems in Section 3 and Section 4. Throughout this section, unless there is no further speciﬁcation, the symbol denotes the ℓ2-norm in the Euclidean space where the data points discussed in each proof live.

B.1 Proofs of the Results in Section 3

Proof [Proof of Lemma 4] Fix r > 0 so that dm 1 < r < dm, and let G(D; r) be the geometric graph with vertex set X and connecting threshold r; i.e., D is the vertex set of G(D; r), and each pair {xi, xj} of vertices is an edge of it if d(xi, xj) r. Since (0, dm) is an element of the diagram P Cech 0 (D), there are at least two connected components Y1 and Y2 in G(D; r) which satisfy

min x1 Y1,x2 Y2 d(x1, x2) = 2dm. (A.1.1)

Let Y1 and Y2 be such connected components, and let x1 Y1 and x2 Y2 be the points attaining the minimum, i.e., d(x1, x2) = dm. Set D to be the set obtained by adding one point, say z, at the mid-point of x1 and x2. It is obvious that the death time of Y1 (or equivalently, Y2) is cut in half. Notice that the death times of the other connected components in the ﬁltration of D cannot be bigger by adding the point z. Thus, we can write P Cech 0 (D ) = (0, d 1), . . . , (0, d t), (0, )} {(0, dm/2)}

with d j dm 1 for all j = 1, . . . , t. Here, the element (0, dm/2) has multiplicity at least 2.

To calculate the bottleneck distance between P Cech 0 (D) and P Cech 0 (D ) we have to consider all possible bijections between P Cech 0 (D) and P Cech 0 (D ). All such bijections can be classiﬁed into three categories. First, (0, dm) P Cech 0 (D) is associated with element (0, d j) P Cech 0 (D ) for some j {1, . . . , t}. Second, (0, dm) P Cech 0 (D) is associated with

(0, dm/2) P Cech 0 (D ). Third, (0, dm) P Cech 0 (D) is associated with a point in the diagonal line. In the ﬁrst case, the possible minimum distance concerning (0, dm) cannot be smaller than δ. In the second case, the distance between (0, dm) and (0, dm/2) is obviously dm/2. In the last case, the distance between (0, dm) and the diagonal line is dm/

2. Since the bottleneck distance is deﬁned by taking the minimum over all such bijections, the desired result follows.

Differentially Private Topological Data Analysis

Proof [Proof of Theorem 5] Suppose that n is even. Let a and b be two points in the set E with |a b| = diam E, and D consist of n/2 copies of a and n/2 copies of b. Let D be obtained by moving one of as to the mid-point of a and b, say c. Then, it is obvious that

P Cech 0 (D) = {(0, diam E/2) , (0, )}

and P Cech 0 (D ) = {(0, diam E/4) , (0, diam E/4) , (0, )} .

This proves that the bottleneck distance between these two diagrams is lower bounded by diam E/4, which implies the desired result. When n is odd, one can take D to have (n 1)/2 copies of a and (n + 1)/2 copies of b, and the result does not change. As for the second result, the proposed upper bound can be established by applying the reverse triangle inequality. To establish the lower bound, notice that for any pair of sets D and D , sup P |v D(P) v D (P)| v D P Cech 0 (D) v D P Cech 0 (D)

= d B P Cech 0 (D ), P Cech 0 (D) .

The supremum of the last expression over all adjacent pairs D and D is lower bounded by diam E/4 as a consequence of the ﬁrst result. This completes the proof.

Proof [Proof of Theorem 8] Let δ(p) D and δ(p) D be Lp-DTM to the empirical distributions of D and D , respectively. By the stability theorem (2.2) and the Wasserstein stability (3.1) of the DTM, we have

d B PDTMp q (D), PDTMp q (D ) δ(p) D δ(p) D 1 m1/p Wp (ˆµD, ˆµD ) , (A.1.2)

where ˆµD and ˆµD represent the empirical distributions on D and D , respectively. We are going to establish a upper bound of the right-hand side of the inequality (A.1.2). Assume that H(D, D ) = 1. Let x be the element that is in D but not in D , and z be the element that is in D but not in D. Let π be the coupling of ˆµD and ˆµD deﬁned as follows: For every y D, set

π(y, y) = 1

and π(x, z) = 1

It is straightforward to verify that π is indeed a coupling of ˆµD and ˆµD . With this π we have Z

(z1,z2) Rd Rd z1 z2 p dπ(z1, z2) = x z p 1

n diam E p 1

By the deﬁnition of the Wasserstein distance Wp, we have

Wp(ˆµD, ˆµD )p = inf ν

(z1,z2) Rd Rd z1 z2 p dν(z1, z2)

(z1,z2) Rd Rd z1 z2 p dπ(z1, z2)

Kang, Kim, Sohn, and Awan

where ν ranges over all couplings of Pn and P n. Therefore, we obtain the following:

Wp(ˆµD, ˆµD ) diam E

which implies the desired result.

Proof [Proof of Proposition 9] Let D be a data set whose points are split into 50% and 50% at two ends a, b of E respectively. More speciﬁcally, a b = diam E and the set D contains 2/n copies of a and 2/n copies of b. Let c be the mid-point of a and b; that is a c = b c = diam E/2. Construct D by moving one a in D to c; namely, D has n/2 1 copies of a, n/2 copies of b, and one c. Let δD be the L1-DTM to the empirical distribution on D with resolution m and δD likewise. Then, we have

0 if x = a, 0 if x = b, diam E/2 if x = c.

On the other hand,

0 if x = a, 0 if x = b,

2 if x = c.

Recall that k = mn . Notice that any point x on the line segment ab satisﬁes δD(x) δD(c) and δD (x) δD (c). Hence, the 0th persistence diagram PDTM1 0 (D) of D is obtained as follows: PDTM1 0 (D) = {(0, diam E/2), (0, )}.

Similarly, PDTM1 0 (D ) is obtained as follows:

PDTM1 0 (D ) = (0, (k 1) diam E/(2k)), (0, ) .

The bottleneck distance between the two diagrams above is calculated as follows:

d B(PDTM1 0 (D), PDTM1 0 (D )) = diam E

Proof [Proof of Corollary 10] The upper bound is obtained by applying the reverse triangle inequality. As for the lower bound, notice that

sup P |u D(P) u D (P)| |u D(PDTM1(D)) u D (PDTM1(D))|

= d B PDTM1(D ), PDTM1(D) d B PDTM1 0 (D ), PDTM1 0 (D) .

The last expression is bounded below by diam E/(2 mn ) as a result of Proposition 9.

Differentially Private Topological Data Analysis

B.2 Proofs of the Results in Section 4

Proof [Proof of Proposition 15] Notice that the inequality (2.5) gives

P |u D(PDP) u D( e POPT)| 2

ϵ (log | g Pers (ℓ+1) M,N | + t) e t,

for every t 0. The reverse triangle inequality yields

d B(PDP, e POPT) |u D(PDP) u D( e POPT)|.

Combining those two yields

d B(PDP, e POPT) = Op

ϵ (ℓ+ 1) log g Pers M,N .

Recall that is chosen to be = (ℓ+1)diam E/(mn) and | g Pers M,N| = NM. Thus, we have

d B(PDP, e POPT) = Op

(ℓ+ 1)2M log N

Now, recall that the upper-left triangle T is discretized by N2 = N2(n) = n2 equallyspaced points; the length of each spacing is bounded by Cdiam E/n for some constant C that only depends on the chosen Euclidean distance. Hence, with all large enough n, the error in approximating POPT by e POPT satisﬁes

d B(POPT, e POPT) = Op(n 1),

which completes the proof.

Theorem 17 can be proved by establishing the following result.

Proposition 25 Let M = M(n) be a non-decreasing sequence of positive integers satisfying M(n) |Pq(µ)| for all large enough n. Then, for every q 0, we have

d B(Pq,OPT, Pq(µ)) = Op(n 1/d).

Moreover, we also have d B(Pq,OPT, Pq(D)) = Op(n 1/d).

Proof [Proof of Proposition 25] With large enough n, we can assume that M |Pq(µ)|. In other words, Pq(µ) belongs to the space of persistence diagrams having at most M elements. Hence, by the deﬁnition of Pq,OPT,

d B(Pq(D), Pq,OPT) d B(Pq(D), Pq(µ)).

As for d B(Pq(µ), Pq,OPT), the triangle inequality gives

d B(Pq(µ), Pq,OPT) d B(Pq(µ), Pq(D)) + d B(Pq(D), Pq,OPT)

2 d B(Pq(µ), Pq(D)).

Kang, Kim, Sohn, and Awan

According to Theorem 2 in Fournier and Guillin (2015) along with the stability theorem (2.2) of the bottleneck distance and the Wasserstein stability (3.1) of the DTM, it is straightforward to deduce that

d B(Pq(µ), Pq(D)) = Op(n 1/d)

for every q 0.

Proof [Proof of Theorem 17] According to Proposition 25, we have

d B(Pq,OPT, Pq(µ)) = Op(n 1/d)

for every q 0. This estimate, together with the estimate given in Proposition 15, establishes the desired result.

The proof of Theorem 18 is achieved by establishing the following three lemmas. The ﬁrst lemma is rather technical.

Lemma 26 Set P = PDTM1 0 . Assume that the resolution m of the DTM is chosen to satisfy m 1/2. For any pair of positive integers K and n with 1 K n, there exists a pair of data sets Xn and Yn satisfying |Xn| = |Yn| = n, H(Xn, Yn) = K, and d B(P(Xn), P(Yn)) CK

n for some constant C independent of K and n, where H(Xn, Yn) denotes the Hamming distance between Xn and Yn.

Proof Recall that k = mn . The whole situation will be broken down into three cases: (i) 1 K min{k, n/2 k}, (ii) min{k, n/2 k} < K < max{k, n/2 k}, and (iii) K max{k, n/2 k}. First, let us assume that 1 K min{k, n/2 k}. Choose two points a and b satisfying a b = diam E; for instance, in the case of E = [0, 1]d, one may choose a = (0, . . . , 0) and b = (1, . . . , 1). Choose the data set Xn that consists of n/2 copies of a and n/2 copies of b (If n is odd, take (n 1)/2 copies of a and (n + 1)/2 copies of b; the results will be the same). On the other hand, choose the data set Yn constructed by moving K copies of a to the mid-point of a and b, say c, i.e., as multisets, Xn and Yn can be expressed as follows:

Xn = (a, n/2) , (b, n/2) and Yn = (a, n/2 K) , (c, K) , (b, n/2) .

Since M n/2 k, the point a still has more than k numbers of points in the data set Yn. Thus, we have

0 if x = a,

2 if x = c, 0 if x = b.

Let x(t) be the point in the line segment ac that divides ac into the ratio t : (1 t) with t [0, 1]. Then, we have

δYn(x(t)) =

2 if 0 t 1/2, (k K)t diam E/2+K(1 t) diam E/2

k if 1/2 < t 1.

Differentially Private Topological Data Analysis

Now, let us further decompose the situation into two cases: (i-1) (k 2K) 0 K k/2 and (i-2) (k 2K) < 0 K > k/2. In the case (i-1), δYn(x(t)) is increasing in t. Hence, P(Yn) is obtained to be P(Yn) = {(0, ), (0, δYn(c))}.

Notice that P(Xn) is obtained to be

P(Xn) = {(0, ), (0, diam E/2)}.

d B(P(Xn), P(Yn)) = diam E

2 δYn(c) = diam E

In the case (i-2), δYn((x(t)) decreases from t = 1/2 to t = 1. Thus, P(Yn) is given to be

P(Yn) = {(0, ), (δYn(c), diam E/4), (0, diam E/4)}.

the bottleneck distance between P(Xn) and P(Yn) can be derived by comparing the two diﬀerent scenarios. First case corresponds (0, diam /4) in P(Yn) to (0, diam E/2) in P(Xn). The distance obtained from this case must be greater than or equal to diam E/4. The other case corresponds (0, diam E/2) in P(Xn) to (δYn(c), diam E/4) in P(Yn). Consequently, (0, diam E/4) in P(Yn) must correspond to a point in the diagonal. Thus, the distance obtained in this case must be greater than or equal to diam E/(4

2). Therefore,

d B(P(Xn), P(Yn)) diam E

Now, let us turn our attention to the case (ii), which assumes that min{k, n/2 k} < K < max{k, n/2 k}. First, consider the case k < n/2 k, so that k < K < n/2 k. In this case, both a and c have at least k points, so

δYn(x) = 0 for all x = a, b, and c.

The above result gives us

P(Yn) = {(0, ), (0, diam /4), (0, diam E/4)}.

d B(P(Xn), P(Yn)) = diam E

Second, consider the case k > n/2 k, so that n/2 k < K < k. In this case, both a and c have less than k points. Thus,

2 if x = a,

2 if x = c, 0 if x = b

Kang, Kim, Sohn, and Awan

Using the similar argument we utilized in the case (i), it is possible to demonstrate that the desired result is true in this case too. Finally, let us consider the case (iii) where K max{k, n/2 k}. In this case, P(Yn) has at least one element (0, diam /4). Hence its bottleneck distance from P(Xn) is always greater than or equal to diam E/4. This completes the proof of the lemma.

The next two lemmas address the concept of DP in terms of a hypothesis testing framework. Lemma 27 is a well-known folklore result in the DP literature. It can be easily derived using the f-DP framework (Dong et al., 2022). We give a direct proof that does not require using f-DP.

Lemma 27 Let X and X be adjacent data sets, and M be any ϵ-DP mechanism. For a hypothesis test H0 : M(X) versus H1 : M(X ),

Type I error + Type II error 2 1 + eϵ .

Proof Call Y the probability space that M(X) lives in. Call µX the probability measures on Y for M(X). By Awan et al. (2019, Proposition 2.3), there exists a base measure ν, which dominates µX for all databases X. Call f X the density of µX with respect to ν, which by Awan et al. (2019, Proposition 2.3) satisﬁes f X eϵf X almost everywhere ν, for adjacent databases X and X . Let X and X be adjacent databases, and let φ : Y [0, 1] be a test. Then the type I and type II errors are I = Eφ(M(X)) and II = 1 Eφ(M(X )), respectively. Then

I = Eφ(M(X)) = Z φ(t)f X(t)dν e ϵ Z φ(t)f X (t)dν

= e ϵEφ(M(X ))

= e ϵ(1 II).

Repeating the argument using φ = 1 φ and swapping the roles of X and X , we get

II e ϵ(1 I).

Then, I + II e ϵ[2 (I + II)],

which implies that I + II 2 1+eϵ .

Lemma 28 Let (ϵn) n=1 be a sequence of positive numbers satisfying 1/n ϵn 1 for every n. Set Kn = 1/ϵn . For each n, For a given sequence of positive numbers ( n) n=1, let {(Xn, Yn)} n=1 be a sequence of ﬁnite data sets satisfying, for each n, H(Xn, Yn) = Kn and d B(P(Xn), P(Yn)) Kn n. Here, H denotes the Hamming distance between sets and P means an arbitrary persistence diagram. Then for any ϵ(n)-DP mechanism M that produces a privatized persistence diagram, it is not possible for both d B(PXn, M(Xn)) = op( n/ϵn) and d B(PYn, M(Yn)) = op( n/ϵn).

Differentially Private Topological Data Analysis

Proof For simplicity of notation, we will suppress the dependence of X, Y , ϵ, , and K on n. We will construct a hypothesis test for H0 : M(X) versus H1 : M(Y ). Note that since M is ϵ-DP for groups of size 1, it is Kϵ-DP for groups of size K (Dwork and Roth, 2014, Theorem 2.2). Deﬁne the sets SX and SY as follows:

SX = {P | d B(P, P(X)) < K /2}

SY = {P | d B(P, P(Y )) < K /2}

and deﬁne our test to be φ(M( )) = I(M( ) SY ), which is the indicator function on the event M( ) SY . Then

Type I error = P(M(X) SY ) P(M(X) SX)

Type II error = P(M(Y ) SY ).

As a result of Lemma 27, we have that

P(M(X) SX) + P(M(X ) SY ) 2 1 + ekϵ 2 1 + e,

since kϵ 1, which implies that either

P(d B(M(X), P(X)) K /2) 1 1 + e,

P(d B(M(Y ), P(Y )) K /2) 1 1 + e.

This rules out the possibility that both are op(K ) op( /ϵ).

Proof [Proof of Theorem 18] According to Lemma 26, it is guaranteed that the 0th persistence diagram of the L1-DTM ﬁltration PDTM1 0 satisﬁes the conditions stated in Lemma 28 with n = C mn for some constant C independent of n. For brevity, set P = PDTM1 0 . Then, for a sequence (Xn, Yn) of data sets satisfying the condition, Lemma 28 tells us that either

P(d B(M(Xn), P(Xn)) CK/n) 1 1 + e

P(d B(M(Yn), P(Yn)) CK/n) 1 1 + e

holds, which rules out the possibility that they are op(K/n) op(1/(nϵ)). This completes the proof.

Kang, Kim, Sohn, and Awan

Appendix C. Supplements of the Simulation and the Real Data Analysis

C.1 More Detailed Description of Our Algorithm

Here, the algorithm of our privacy mechanism, which is introduced in Section 4, is explained in detail. For a given data set D, Let S denote the maximum value of the L1DTM function on the data set and M a speciﬁed positive integer. To get an initial diagram P(0) DP = P(0) 0,DP, P(0) 1,DP , generate independently and identically distributed sample x1, . . . , x M, z1, . . . , z M from the uniform distribution on the closed interval [0, S], i.e., x1, . . . , xn, z1, . . . , z M i.i.d. Unif[0, S], symbolically. For each i {1, . . . , M}, set yi = xi + (1 xi)zi;

the diagram P(0) 0,DP is constructed as follows:

P(0) 0,DP = {(x1, y1), . . . , (x M, y M)}.

The other initial diagram P(0) 1,DP is generated in the same way independent of P(0) 0,DP. Notice that each initial diagram consists of M points uniformly distributed on the upper-left triangle {(x, y) : x, y [0, S] and y x}. For generated P(t) DP = (P(t) 0,DP, P(t) 1,DP), the next candidate P(t+1) DP by adding Gaussian noise to each element in each diagram in tth step. To be precise, write

P(t) 0,DP = {(x(t) 1 , y(t) 1 ), . . . , (x(t) M , y(t) M )}.

Generate i.i.d. sample Z(t) 1 , . . . , Z(t) M from the 2-dimensional Gaussian distribution with mean (0, 0) and covariance matrix σ2I2, where σ is a pre-speciﬁed positive number and I2 is the 2 by 2 identity matrix. Set

P 0 = {(x(t) 1 , y(t) 1 ) + Z(t) 1 , . . . , (x(t) M , y(t) M ) + Z(t) M }

P 1 is constructed in the same with independent of P 0, and set P = (P 0, P 1). Then, calculate the accept/reject probability in the Metropolis-Hastings sampler p deﬁned in terms of the bottleneck distance:

p = min 0, ϵ

2 u D(P(t) DP) u D(P ) ,

Generate U Unif(0, 1). If log U p, take P(t+1) DP = P ; otherwise, take P(t+1) DP = P(t) DP. This procedure is carried out repeatedly again, and the P(t) DP at the ﬁnal iteration is proposed as a privatized persistence diagram. The whole procedure is summarized in Algorithm 1.

C.2 Additional Results in the Real Data Analysis

The following illustration, Figure 8, depicts the accuracy of privatized persistence diagrams for Walker A and B. The procedure of implementing the mechanism is the same with that for Walker C described in Section 6. One can see that we also obtain quite accurate privatized diagrams in this case as well.

Differentially Private Topological Data Analysis

(a) Walker A

(b) Walker B

Figure 8: Figure 8-(a) presents the L1-DTM persistence diagram of the data of Walker A and its privatized diagram. Also, the change of the bottleneck distance between the true and privatized diagram over the MCMC iterations is depicted. At the ﬁnal iteration, we have d B(P0(D), P0,DP) = 0.01 and d B(P1(D), P1,DP) = 0.009. Figure 8-(b) presents the same kind of information about Walker B. Here, at the ﬁnal iteration, we obtain d B(P0(D), P0,DP) = 0.011 and d B(P1(D), P1,DP) = 0.009.

Kang, Kim, Sohn, and Awan

Algorithm 1 MCMC implementation of the exponential mechanism

Input: P0(D), P1(D), and a positive integer M Initialization: P(0) 0,DP, P(0) 1,DP Unif(Pers M) for i = 1, 2, . . . , do

P 0 = P(t 1) 0,DP + N(0, σ2I2), P 1 = P(t 1) 1,DP + N(0, σ2I2) P = (P 0, P 1) p = min 0, ϵ

2 u D(P(t 1)) u D(P )

U Unif(0, 1) if log U p then

P(t) 0,DP = P 0, P(t) 1,DP = P 1 else

P(t) 0,DP = P(t 1) 0,DP , P(t) 1,DP = P(t 1) 1,DP end if end for