# learning_hyperbolic_representations_of_topological_features__48d5b315.pdf

Published as a conference paper at ICLR 2021

LEARNING HYPERBOLIC REPRESENTATIONS OF TOPOLOGICAL FEATURES

Panagiotis Kyriakis

University of Southern California

Los Angeles, USA pkyriaki@usc.edu

Iordanis Fostiropoulos

University of Southern California

Los Angeles, USA fostirop@usc.edu

Paul Bogdan

University of Southern California

Los Angeles, USA pbogdan@usc.edu

Learning task-speciﬁc representations of persistence diagrams is an important problem in topological data analysis and machine learning. However, current methods are restricted in terms of their expressivity as they are focused on Euclidean representations. Persistence diagrams often contain features of inﬁnite persistence (i.e., essential features) and Euclidean spaces shrink their importance relative to non-essential features because they cannot assign inﬁnite distance to ﬁnite points. To deal with this issue, we propose a method to learn representations of persistence diagrams on hyperbolic spaces, more speciﬁcally on the Poincare ball. By representing features of inﬁnite persistence inﬁnitesimally close to the boundary of the ball, their distance to non-essential features approaches inﬁnity, thereby their relative importance is preserved. This is achieved without utilizing extremely high values for the learnable parameters, thus, the representation can be fed into downstream optimization methods and trained efﬁciently in an end-to-end fashion. We present experimental results on graph and image classiﬁcation tasks and show that the performance of our method is on par with or exceeds the performance of other state of the art methods.

1 INTRODUCTION

Persistent homology is a topological data analysis tool which tracks how topological features (e.g. connected components, cycles, cavities) appear and disappear as we analyze the data at different scales or in nested sequences of subspaces (1; 2). A nested sequence of subspaces is known as a ﬁltration. As an informal example of a ﬁltration consider an image of variable brightness. As the brightness is increased, certain features (edges, texture) may become less or more prevalent. The birth of a topological feature refers to the "time" (i.e., the brightness value) when it appears in the ﬁltration and the death refers to the "time" when it disappears. The lifespan of the feature is called persistence. Persistent homology summarizes these topological characteristics in a form of multiset called persistence diagram, which is a highly robust and versatile descriptor of the data. Persistence diagrams enjoy the stability property, which ensures that the diagrams of two similar objects are similar (3). Additionally, under some assumptions, one can approximately reconstruct the input space from a diagram (which is known as solving the inverse problem) (4). However, despite their strengths, the space of persistence diagrams lacks structure as basic operations, such as addition and scalar multiplication, are not well deﬁned. The only imposed structure is induced by the Bottleneck and Wasserstein metrics, which are notoriously hard to compute, thereby preventing us from leveraging them for machine learning tasks.

Related Work. To address these issues, several vectorization methods have been proposed. Some of the earliest approaches are based on kernels, i.e., generalized products that turn persistence diagrams into elements of a Hilbert space. Kusano et al. (5) propose a persistence weighted Gaussian kernel which allows them to explicitly control the effect of persistence. Alternatively, Carrière et al. (6) leverage the sliced Wasserstein distance to deﬁne a kernel that mimics the distance between diagrams. The approaches by Bubenik (7) based on persistent landscapes, by Reininghaus et al. (8) based on scale space theory and by Le et al. (9) based on the Fisher information metric are along the same line of work. The major drawback in utilizing kernel methods is that they suffer from scalability issues as the training scales poorly with the number of samples.

Published as a conference paper at ICLR 2021

In another line of work, researchers have constructed ﬁnite-dimensional embeddings, i.e., transformations turning persistence diagrams into vectors in a Euclidean space. Adams et al. (10) map the diagrams to persistence images and discretize them to obtain the embedding vector. Carrière et al. (11) develop a stable vectorization method by computing pairwise distances between points in the persistence diagram. An approach based on interpreting the points in the diagram as roots of a complex polynomial is presented by Di Fabio (12). Adcock et al. (13) identify an algebra of polynomials on the diagram space that can be used as coordinates and the approach is extended by Kališnik in (14) to tropical functions which guarantee stability. The common drawback of these embeddings is that the representation is pre-deﬁned, i.e., there exist no learnable parameters, therefore, it is agnostic to the speciﬁc learning task. This is clearly sub-optimal as the eminent success of deep learning has demonstrated that it is preferable to learn the representation.

The more recent approaches aim at learning the representation of the persistence diagram in an end-to-end fashion. Hofer et al. (15) present the ﬁrst input layer based on a parameterized family of Gaussian-like functionals, with the mean and variance learned during training. They extend their method in (16) allowing for a broader class of parameterized function families to be considered. It is quite common to have topological features of inﬁnite persistence (1), i.e., features that never die. Such features are called essential and in practice are usually assigned a death time equal to the maximum ﬁltration value. This may restrict their expressivity because it shrinks their importance relative to non-essential features. While we may be able to increase the scale sufﬁciently high and end up having only one trivial essential feature (i.e., the 0-th order persistent homology group that becomes a single connected component at a scale that is sufﬁciently large), the resulting persistence diagrams may not be the ones that best summarize the data in terms of performance on the underlying learning task. This is evident in the work by Hofer et al. (15) where the authors showed that essential features offer discriminative power. The work by Carrière et al. (17), which introduces a network input layer the encompasses several vectorization methods, emphasizes the importance of essential features and is the ﬁrst one to introduce a deep learning method incorporating extended persistence as a way to deal with them.

In this paper, we approach the issue of essential features from the geometric viewpoint. We are motivated by the recent success of hyperbolic geometry and the interest in extending machine learning models to hyperbolic spaces or general manifolds. We refer the reader to the review paper by Bronstein et al. (18) for an overview of geometric deep learning. Here, we review the most relevant and pivotal contributions in the ﬁeld. Nickel et al. (19; 20) propose Poincaré and Lorentz embeddings for learning hierarchical representations of symbolic data and show that the representational capacity and generalization ability outperform Euclidean embeddings. Sala et al. (21) propose low-dimensional hyperbolic embeddings of hierarchical data and show competitive performance on World Net. Ganea et al. (22) generalize neural networks to the hyperbolic space and show that hyperbolic sentence embeddings outperform their Euclidean counterparts on a range of tasks. Gulcherhe et al. (23) introduce hyperbolic attention networks which show improvements in terms of generalization on machine translation and graph learning while keeping a compact representation. In the context of graph representation learning, hyperbolic graph neural networks (24) and hyperbolic graph convolutional neural networks (25) have been developed and shown to lead to improvements on various benchmarks. However, despite this success of geometric deep learning, little work has been done in applying these methods to topological features, such as persistence diagrams.

The main contribution of this paper is to bridge the gap between topological data analysis and hyperbolic representation learning. We introduce a method to represent persistence diagrams on a hyperbolic space, more speciﬁcally on the Poincare ball. We deﬁne a learnable parameterization of the Poincare ball and leverage the vectorial structure of the tangent space to combine (in a manifoldpreserving manner) the representations of individual points of the persistence diagram. Our method learns better task-speciﬁc representations than the state of the art because it does not shrink the relative importance of essential features. In fact, by allowing the representations of essential features to get inﬁnitesimally close to the boundary of the Poincare ball, their distance to the representations of non-essential features approaches inﬁnity, therefore preserving their relative importance. To the best of our knowledge, this is the ﬁrst approach for learning representations of persistence diagrams in non-Euclidean spaces.

Published as a conference paper at ICLR 2021

2 BACKGROUND

In this section, we provide a brief overview of persistent homology leading up to the deﬁnition of persistence diagrams. We refer the interested reader to the papers by Edelsbrunner et al. (1; 2) for a detailed overview of persistent homology. An overview of homology can be found in the Appendix.

Persistent Homology. Let K be a simplicial complex. A ﬁltration of K is a nested sequence of subcomplexes that starts with the empty complex and ends with K,

= K0 K1 . . . Kd = K. (1)

A typical way to construct a ﬁltration is to consider sublevel sets of a real valued function, f : K R. Let a1 < < ad be a sorted sequence of the values of f(K). Then, we obtain a ﬁltration by setting

K0 = and Ki = f 1(( , ai]) for 1 i d. (2)

We can apply simplicial homology to each of the subcomplexes of the ﬁltration. When 0 i j d, the inclusion Ki Kj induces a homomorphism

f i,j n : Hn(Ki) Hn(Kj) (3)

on the simplicial homology groups for each homology dimension n. We call the image of f i,j n a n-th persistent homology group and it consists of homology classes born before i that are still alive at j. A homology class α is born at Ki if it is not in the image of the map induced by the inclusion Ki 1 Ki. Furthermore, if α is born at Ki, it dies entering Kj if the image of the map induced by Ki 1 Kj 1 does not contain the image of α but the image of the map induced by Ki 1 Kj does. The persistence of the homology class α is j i. Since classes may be born at the same i and die at the same j, we can use inclusion-exclusion to determine the multiplicity of each (i, j),

µi,j n = βi,j 1 n βi 1,j 1 n βi,j n + βi 1,j n , (4)

where the n-th persistent Betti numbers βi,j n are the ranks of the images of the n-th persistent homology group, i.e., βi,j n = rank(im(f i,j n )), and capture the number of n-dimensional topological features that persist from i to j. By setting µi, n = βi,d n βi 1,d n we can account for features that still persist at the end of the ﬁltration (j = d), which are known as essential features.

Persistence Diagrams. Persistence diagrams are multisets supported by the upper diagonal part of the real plane and capture the birth/death of topological features (i.e., homology classes) across the ﬁltration.

Deﬁnition 2.1 (Persistence Diagram). Let = {x R : mult(x) = } be the multiset of the diagonal R = {(x1, x2) R2 : x1 = x2}, where mult( ) denotes the multiplicity function and let R2 = {(x1, x2) R (R ) : x2 > x1}. Also, let n be a homology dimension and consider the sublevel set ﬁltration induced by a function f : K R over the complex K. Then, a persistence diagram, Dn(f), is a multiset of the form Dn(f) = {x : x R2 } constructed by inserting each point (ai, aj) for i < j with multiplicity µi,j n (or µi, n if it is an essential feature). We denote the space of all persistence diagrams with D.

Deﬁnition 2.2 (Wasserstein distance and stability). Let Dn(f), En(g) be two persistence diagrams generated by the ﬁltration induced by the functions f, g : K R, respectively. We deﬁne the Wasserstein distance

wq p(Dn(f), Eg(g)) = inf η X

x D x η(x) p q 1/p, (5)

where p, q N and the inﬁmum is taken over all bijections η : Dn(f) En(g). The special case p = is known as Bottleneck distance. The persistence diagrams are stable with respect to the Wasserstein distance if and only if wq p(Dn(f), Eg(g)) f g .

Note that a bijection η between persistence diagrams is guaranteed to exist because their cardinalities are equal, considering that, as per Def. 2.1, the points on the diagonal are added with inﬁnite multiplicity. The strength of persistent homology stems from the above stability deﬁnition, which essentially states that the map taking a sublevel function to the persistence diagram is Lipschitz continuous. This implies that if two objects are similar then their persistence diagrams are close.

Published as a conference paper at ICLR 2021

Complex K, Filtration f

Figure 1: Illustration of our method: Initially, the points are transferred via the auxiliary transformation ρ and the parameterization φ to the Poincare ball B, where learnable parameters θ are added. Then, the logarithmic map is used for transforming the points to the tangent space Tx0B. Finally, the resulting vectors are added and transformed back to the manifold via the exponential map. Note that the persistence diagram is mapped to a single point on the Poincare ball (i.e., Φ(D, θ) B).

3 PERSISTENT POINCARE REPRESENTATIONS

In this section, we introduce our method (Fig. 1) for learning representations of persistence diagrams on the Poincare ball. We refer the reader to the Appendix for some fundamental concepts of differential geometry.

The Poincare ball is an m-dimensional manifold (B, g B x ), where B = {x Rm : x < 1} is the open unit ball. The space in which the ball is embedded is called ambient space and is assumed to be equal to Rm. The Poincare ball is conformal (i.e., angle-preserving) to the Euclidean space but it does not preserve distances. The metric tensor and distance function are as follows

g B x = λ2 xg E λx = 2

1 x 2 d B(x, y) = arccos 1 + 2 x y 2

(1 x 2)(1 y )2

where g E = Im is the Euclidean metric tensor. Eq. 6 highlights the beneﬁt of using the Poincare ball for representing persistence diagrams. Contrary to Euclidean spaces, distances in the Poincare ball can approach inﬁnity for ﬁnite points. This space is ideal for representing essential features appearing in persistence diagrams without squashing their importance relative to non-essential features. Informally, this is achieved by allowing the representations of the former ones to get inﬁnitesimally close to the boundary, thereby their distances to the later ones approach inﬁnity. Fig. 2 provides an illustration.

We gradually construct our representation through a composition of 3 individual transformations. The ﬁrst step is to transfer the points to the ambient space (i.e., Rm) of the Poincare ball. Let D1 be a persistence diagram. We introduce the following auxiliary transformation

ρ : R2 Rm. (7)

This auxiliary transformation is essentially a high-dimensional embedding and may contain learnable parameters. Nonetheless, our main focus is to learn a hyperbolic representation and, therefore, we assume that ρ is not learnable. Later in this section, we analyze conditions on ρ to guarantee the stability and expressiveness of the hyperbolic representation.

The second step is to transform the embedded points from the ambient space to the Poincare ball. When referring to points on a manifold, it is important to deﬁne a coordinate system. A homeomorphism ψ : B Rm is called coordinate chart and gives the local coordinates on the manifold. The inverse map φ : Rm B, is called a parameterization of B and gives the ambient coordinates. The main idea is to inject learnable parameters into this parameterization. The injected parameters could be any form of differentiable functional that preserves the homomorphic property. Differentiability is needed such that our representation can be fed to downstream optimization

1The sublevel set function f and the homology dimension n are omitted.

Published as a conference paper at ICLR 2021

methods. In our construction, we utilize a variant of the generalized spherical coordinates. Let θ Rm be a vector of m parameters. We deﬁne the learnable parameterization φ : Rm Rm B as follows

π arctan θ1r1 and yi = θi + arccos xi 1

ri 1 , for i = 2, 3, ..m, (8)

where r2 i = x2 m + + x2 i+1 + x2 i + ϵ. The small positive constant ϵ is added to ensure that the denominator in Eq. 8 is not zero. Intuitively, Eq. 8 corresponds to scaling the radius of the point by a factor θ1 and rotating it by θi radians across the angular axes. The scaling and rotation parameters are learned during training. Note that the form of y1 ensures that representation belongs in the unit ball for all values of θ1. The coordinate chart is not explicitly used in our representation; it is provided in the Appendix for the sake of completeness.

The third step is to combine the representations of each individual point of the persistence diagram into a single point in the hyperbolic space. Typically, in Euclidean spaces, this is done by concatenating or adding the corresponding representations. However, in non-Euclidean spaces such operations are not manifold-preserving. Therefore, we transform the points from the manifold to the tangent space, combine the vectors via standard vectorial addition and transform the resulting vector back to the manifold. This approach is based on the exponential and logarithmic maps

expx : Tx B B and logx : B Tx B. (9)

The exponential map allows us to transform a vector from the tangent space to the manifold and its inverse (i.e., the logarithmic map) from the manifold to the tangent space. For a general manifold, it is hard to ﬁnd these maps as we need to solve for the minimal geodesic curve (see Appendix for more details). Fortunately, for the Poincare ball case, they have analytical expressions, given as follows

expx(v) = x tanh λx v

, logx(y) = 2

λx tanh 1 x y x y

where denotes the Möbius addition, which is a manifold-preserving operator (i.e., for any x, y B = x y B). The analytical expression is given in the Appendix. The transformations given by these maps are norm-preserving, i.e., for example, the geodesic distance from x to the transformed point expx(v) coincides with the metric norm v g induced by the metric tensor g B x . This is an important property as we need the distance between points (and therefore the relative importance of topological features) to be preserved when transforming to and from the tangent space. We now combine the aforementioned transformations and deﬁne the Poincare hyperbolic representation followed by its stability theorem.

Deﬁnition 3.1 (Poincare Representation). Let D D be the persistence diagram to be represented in an m-dimensional Poincare ball (B, g B x ) embedded in Rm and x0 B be a given point. The representation of D on the manifold B is deﬁned as follows

Φ : D Rm B, Φ(D, θ) = expx0 X

x D logx0 φ(ρ(x)) , (11)

where the exponential and logarithmic maps are given by Eq. 10 and the learnable parameterization and the auxiliary transformation by Eq. 8 and Eq. 7, respectively.

Theorem 1 (Stability of Hyperbolic Representation). Let D, E be two persistence diagrams and consider an auxiliary transformation ρ : R2 Rm that is

Lipschitz continuous w.r.t the induced metric norm g,

ρ(x) = 0 for all x R .

Additionally, assume that x0 = 0. Then, the hyperbolic representation given by Eq. 11 is stable w.r.t the Wasserstein distance when p = 1, i.e., there exists constant K > 0 such that

d B(Φ(D, θ), Φ(E, θ)) Kwg 1(D, E) (12)

where d B is the geodesic distance and wg 1 is the Wasserstein metric with the q-norm replaced by the induced norm g (i.e., the norm induced by the metric tensor g, see Appendix A.2).

Published as a conference paper at ICLR 2021

0 10 20 30 40 50 60 70 80 90 100 0

(B, g B x )

g B x = λ2 x I2

Figure 2: Left: Example graph from the IMDB-BINARY dataset. Middle: Persistence diagrams extracted using the Vietoris-Rips ﬁltration. The dashed line denotes features of inﬁnite persistence, which are represented by points of maximal death value equal to 90 (i.e., by points of ﬁnite persistence). Right: Equivalent representation on the 2-dimensional Poincare ball. Features of inﬁnite persistence are mapped inﬁnitesimally close to the boundary. Therefore, their distance to ﬁnite persistence features approaches inﬁnity (d ϵ 2).

The proof of Theorem 1 (given in the Appendix) results from a general stability theorem (3) and is on par with similar results for other vectorizations (10) or representations (15) of persistence diagrams. One subtle difference is that Theorem 1 uses the induced norm rather than the q-norm appearing in the Wasserstein distance. However, since the induced norm implicitly depends on the chosen point x0, which, per requirements of Theorem 1, is assumed to be equal to the origin, there is no substantial difference. The fact that we require the auxiliary transformation ρ to be zero on the diagonal is important to theoretically guarantee stability. Intuitively, this can be understood by recalling (Def. 2.1) that all (inﬁnite) points on the diagonal are included in the persistence diagram. By mapping the diagonal to zero and taking x0 = 0, we ensure that the summation in Eq. 11 collapses to zero when summing over the diagonal.

Finally, we note that the assumptions of Theorem 1 are not restrictive. In fact, we can easily ﬁnd Lipschitz continuous transformations that are zero on the diagonal R , such as the exponential and rational transformations proposed by Hofer et al. (15). Additionally, we note that the assumptions of Theorem 1 do not prohibit us from choosing an "under-powered" or degenerate ρ. For example, ρ = 0 satisﬁes the assumptions and therefore leads to a stable representation. However, such representation is obviously not useful for learning tasks. An implicit requirement, that guarantees not only the stability but the expressiveness of the results representation, is that ρ does not result in any information loss. This requirement is satisﬁed by picking a ρ that it is injective, which, given that it is a higher dimensional embedding, is a condition easy to satisfy. In practice, we use a variant of the exponential transformation by Hofer et al. (15). The exact expression is given in the Appendix.

4 EXPERIMENTS

We present experiments on diverse datasets focusing on persistence diagrams extracted from graphs and grey-scale images. The learning task is classiﬁcation. Our representation acts as an input to a neural network and the parameters are learned end-to-end via standard gradient methods. The architecture as well as other training details are discussed in the Appendix. The code to reproduce our experiments is publicly available at https://github.com/pkyriakis/permanifold/.

Ablation Study: To highlight to what extent our results are driven by the hyperbolic embedding, we perform an ablation study. In more detail, we consider three variants of our method:

1. Persistent Poincare (P-Poinc): This is the original method as presented in Sec. 3,

2. Persistent Hybrid (P-Hybrid): Same as P-Poinc with the Poincare ball replaced by the Euclidean space. This implies that the exponential and logarithmic maps (Eq. 10) reduce to the identity maps, i.e., expx(v) = x + v logx(y) = y x. The learnable parameterization is as in Eq. 8.

3. Persistent Euclidean (P-Eucl): Same as P-Hybrid with Eq. 8 replaced with simple addition of the learnable parameters, i.e., y = x + θ.

Published as a conference paper at ICLR 2021

Baseline - Essential Features Separation: To highlight the beneﬁt of a uniﬁed Poincare representation, we design a natural baseline that treats essential and non-essential features separately. In more detail, for each point (b, d) D, we calculate its persistence d b and then compute the histogram of the resulting persistence values. For essential features, we compute the histogram of their birth times. Then, we concatenate those histograms and feed them as input to the neural network (architecture described in the Appendix). We consider the case where the essential features are included (baseline w/ essential) and the case where they are discarded (baseline w/o essential).

Manifold Dimension and Projection Bases: Since our method essentially represents each persistence diagram on a m dimensional Poincare ball, it may introduce substantial information compression when the points in the diagrams are not of the same order as m. A trivial approach to counteract this issue is to use a high value for m. However, we experimentally observed that a high manifold dimension does not give the optimal classiﬁcation performance and it adds a computational overhead in the construction of the computation graph. Empirically, the best approach is to keep m at moderate values (in the range m = 3 to m = 12), replicate the representation K times and concatenate the outputs. Each replica is called a projection base and for their number we explored values dependant on the number of points in the persistence diagram. Persistence diagrams obtained from images tend to have substantially fewer points than diagrams obtained from graphs. Therefore, for images, we explored moderate values for K, i.e., 5 10, whereas for graphs we increased K in the range 200 500. Essentially, we treat both m and K as hyper-parameters, explore their space following the aforementioned empirical rules and pick the optimal via the validation dataset. As we increase m, it is usually prudent to decrease K to maintain similar model capacity.

4.1 GRAPH CLASSIFICATION

In this experiment, we consider the problem of graph classiﬁcation. We evaluate our approach using social graphs from (26). The REDDIT-BINARY dataset contains 1000 samples and the graphs correspond to online discussion threads. The task is to identify to which community a given graph belongs. The REDDIT-5K and the REDDIT-12K are a larger variant of the former dataset and contain 5K and 12K graphs from 5 and 11 subreddits, respectively. The task is to predict to which subreddit a given discussion graph belongs. The IMDB-BINARY contains 1000 ego-networks of actors that have appeared together in any movie and the task is to identify the genre (action or romance) to which an ego-graph belongs. Finally, the IMDB-MULTI contains 1500 ego-networks belonging to 3 genres (action, romance, sci-ﬁ). We train our model using 10-fold cross-validation with a 80/20 split.

Graphs are special cases of simplicial complexes, therefore, deﬁning ﬁltrations is straightforward. We use two methods: The ﬁrst one captures global topological properties and is based on shortest paths. In this case, the graph G = (V, E) is lifted to a metric space (V, d) using the shortest path distance d : V V R 0 between two vertices, which can be easily proved to be a valid metric. Then, we deﬁne the Vietoris-Rips complex of G as the ﬁltered complex V Rs(G) that contains a subset of V as a simplex if all pairwise distances in that subset are less than or equal to s, or, formally, V Rs(G) = {(v0, v1, , vm) : d(vi, vj) s, i, j}. (13) This approach essentially interprets the vertices of the graph as a point cloud in a metric space, with the distance between points given by the shortest path between the corresponding vertices. In the case of unweighted graph, we assign unit weight across edges. In Fig. 2 we show a sample persistence diagram extracted with the Vietoris-Rips ﬁltration and the corresponding representation of the features on the Poincare ball. The second method captures local topological properties and is based on vertex degree. Given a graph G = (V, E), a simplicial complex can be deﬁned as the union of the vertex and edge sets, i.e., K = K0 K1, where K0 = {v : v V } and K1 = {(u, v) : (u, v) E}. The sublevel function is deﬁned as follows: f(v) = deg(v) for v K0 and f(u, v) = max {f(u), f(v)} for (u, v) K1, where deg(v) is the vertex degree v. Then, the ﬁltration is given by Eq. 2.

We compare our method against several state of the art methods for graph classiﬁcation. In particular, we compare against : (1) the Weisfeiler-Lehman (WL) graph kernel (27), (2) the Deep Graph Kernel (DGK) (26), (3) the Patchy-SAN (PSCN) (28), the Weighted Persistence Image Kernel (WKPI) (29), (5) the Anonymous Walk Embeddings (AWE) (30), (6) the Graph Isomorphism Network with Graph Filitraton Learning (GIN-GFL) (31). Additionally, we compare against the Pers Lay input layer by Carrière et al. (17) which utilizes extended persistence as an alternative way to deal with essential features and the Learnable Representation based on Gaussian-like structure elements (GLR) presented by Hofer et al. (15).

Published as a conference paper at ICLR 2021

Table 1: Classiﬁcation accuracy (mean std or min-max range, if available).

IMDB-M IMDB-B REDDIT-B REDDIT-5K REDDIT-12K baseline w/o ess. 38.43 0.98 65.78 1.25 67.33 1.55 39.30 1.12 28.35 1.11 baseline w/ ess. 38.78 0.85 66.56 1.13 69.33 0.75 38.45 0.98 30.43 0.75 WL 49.33 4.75 73.40 4.63 81.10 1.90 49.44 2.36 38.18 1.31 DGK 44.5 0.52 66.96 0.56 78.04 0.39 41.27 0.18 32.22 0.10 PSCN 45.23 2.84 71.00 2.29 86.30 1.58 49.10 0.70 41.32 0.32 WKPI 49.5 0.40 75.10 1.10 n/a 59.50 0.60 48.40 0.50 AWE 51.54 3.61 74.45 5.83 87.89 2.53 54.74 1.91 39.20 2.0 GIN-GFL 49.70 2.9 74.50 4.60 90.30 2.60 55.70 2.90 n/a Pers Lay 48.8-52.2 71.2-72.6 n/a 55.6-56-5 47.7-49.1 GLR n/a n/a n/a 54.5 44.5 P-Eucl 46.45 4.03 67.54 3.54 71.45 2.98 43.15 3.12 32.56 3.68 P-Hybrid 47.87 2.03 72.48 4.57 72.87 1.75 46.85 2.17 36.42 4.08 P-Poinc 57.31 4.27 81.86 4.26 79.78 3.21 51.71 3.01 42.16 3.45

We run simulations for different manifold dimensions (ranging from m = 2 to m = 12) and pick the best one using the mean (across all 10 folds) cross-validation accuracy as criterion. We report the mean and standard deviation for the best m on Table 1. We observe that the performance of the P-Eucl method is poor but the addition of the learnable parameterization (i.e., P-Hybrid method) leads to small but consistent improvement. The best performance is achieved by the fully hyperbolic representation (P-Poinc) and is on par or exceeds the performance of the state of the art. This suggests that the hyperbolic representation is vital for obtaining good performance. Additionally, the baselines have poor performance irrespectively of whether or not essential features have been included via the histogram of their birth times. This indicates that treating essential features separately and including them as additional inputs is not sufﬁcient for good performance. These results support our initial motivation for representing persistence diagrams on the Poincare ball. Also, observe that our method out-performs all the competitor methods for the IMDB datasets, which have a small number of nodes and edges (approx. 10-20 nodes and 60-90 edges) while the REDDIT ones are bigger networks (approx. 400-500 nodes and 500-600 edges). This may indicate that the Poincare representation is best suited for smaller, less complex graphs.

4.2 IMAGE CLASSIFICATION

Even though it is well known that convolutional neural networks have achieved unprecedented success as feature extractors for images, we present an image classiﬁcation case-study using topological features as a proof-of-concept for our method. Contrary to graphs, images are not inherently equipped with the structure of a simplicial complex. In theory, we could construct Vietoris-Rips complexes, as in Eq. 13, by interpreting pixels as a point cloud. However, this is not the most natural representation of an image, which has a grid structure. Therefore, we exploit this structure by constructing cubical complexes, i.e., unions of cubes aligned on a 2D grid. As in the case of graphs, we use two methods. For the ﬁrst method, called cubical ﬁltration, we use the grey-scale image directly and represent each pixel as a 2 cube. Then, all of its faces are added to the K-th complex. We obtain a sublevel function by extending the grey-scale value I(v) of pixel a v to all cubes in K as follows:

f(σ) = min σ face of τ I(τ), σ K. (14)

In the second experiment, we consider the problem of image classiﬁcation. We utilize two standardized datasets: the MNIST, which contains images of handwritten digits, and the Fashion-MNIST, which contains shape images of different types of garment (e.g., T-shirt, trouser, etc.). Each dataset contains a total of 70K (60K train, 10K validation, 10 folds) grey-scale images of size 28 28. Both datasets are balanced and categorized into 10 classes. Finally, a grey-scale ﬁltration is obtained using Eq. 2. The second method is called height ﬁltration and uses the binarized version of the original image. We deﬁne the height ﬁltration by choosing a direction v R2 of unit norm and by assigning to each pixel p of value 1 in the binarized image a new value equal to p, v , i.e., the distance of pixel p to the plane deﬁned by v. This creates a new grey-scale image which is then fed to the aforementioned cubical ﬁltration. We note that the height ﬁltration deserves special attention in persistent homology

Published as a conference paper at ICLR 2021

0 1 2 3 4 5 6 7 8 9 0

m = 3 m = 9

0 1 2 3 4 5 6 7 8 9 0.8

validation accuracy

m = 3 m = 9

Table 2: Classiﬁcation accuracy

MNIST F-MNIST PWGK 75.31 28.13 PI 94.39 31.26 GLR 93.31 56.85 P-Eucl 92.27 30.21 P-Hybrid 93.78 38.45 P-Poinc 95.91 72.28

Figure 3: Plotting the train loss (left) and the validation accuracy (middle) over 10 training epochs for the MNIST dataset using two different dimensions for the Poincare ball (m = 3 and m = 9).

because complexes to which it is applied can be approximately reconstructed from their persistence diagrams (4). In practice, we use direction vectors v that are distributed uniformly across the unit cycle. We choose 30 and 50 directions for the MNIST and Fashion-MNIST, respectively.

We compare our method against three baselines that encompass all methods for handling persistence diagrams: (1) the Persistence Weighted Gaussian Kernel (PWGK) approach proposed by Kusano et al. (5), (2) the Persistence Images (PI) embedding developed by Adams et al. (10), and (3) the Learnable Representation based on Gaussian-like structure elements (GLR) by Hofer et al. (15). We run our simulations for manifold dimensions equal to m = 3 and m = 9 and report the best results in Table 2. Our method outperforms all other methods, in some cases by a considerable margin. We also study how the manifold dimension affects the train loss and validation accuracy. The results are shown in Fig. 3. Observe that for m = 9 the train loss decreases more rapidly and the validation accuracy increases more rapidly in the ﬁrst few epochs. This suggests that a higher manifold dimension may be slightly better for representing persistence diagrams. Additionally, we observed that the Poincare representation tends to generalize better than its Euclidean counterpart. In fact, the validation accuracy of the P-Eucl method started decreasing after the ﬁrst 2-3 epochs, whereas the P-Poinc method showed a saturation rather than a decrease. This was observed without modifying the dropout rate or any other hyper-parameters and is a strong empirical ﬁnding that demonstrates the superiority of Poincare representations.

5 CONCLUSION

We presented the ﬁrst, to the best of our knowledge, method for learning representations of persistence diagrams in the Poincare ball. Our main motivation for introducing such method is that persistence diagrams often contain topological features of inﬁnite persistence (i.e., essential features) the representational capacity of which may be bottlenecked when representing them in Euclidean spaces. This stems from the fact that Euclidean spaces cannot assign inﬁnite distance to ﬁnite points. The main beneﬁt of using the Poincare space is that by allowing the representations of essential features to get inﬁnitesimally close to the boundary of the ball their distance to non-essential features approaches inﬁnity, therefore preserving their relative importance. Directions for future work include the learning of ﬁltration and/or scale end-to-end as well as to investigate whether or not there exists some hyperbolic trend in distances appearing in persistence diagrams that justify the improved performance especially on small graphs.

ACKNOWLEDGEMENTS

The authors gratefully acknowledge the support by the National Science Foundation under the Career Award CPS/CNS-1453860, the NSF award under Grant Numbers CCF-1837131, MCB-1936775, CNS-1932620, and CMMI 1936624 and the DARPA Young Faculty Award and DARPA Director s Fellowship Award, under Grant Number N66001-17-1-4044, and a Northrop Grumman grant. The views, opinions, and/or ﬁndings contained in this article are those of the authors and should not be interpreted as representing the ofﬁcial views or policies, either expressed or implied by the Defense Advanced Research Projects Agency, the Department of Defense or the National Science Foundation.

Published as a conference paper at ICLR 2021

[1] Herbert Edelsbrunner and John Harer. Persistent homology a survey. Discrete Computational Geometry - DCG, 453, 01 2008.

[2] Herbert Edelsbrunner and D. Morozov. Persistent homology: theory and practice. European Congress of Mathematics, pages 31 50, 01 2014.

[3] David Cohen-Steiner, Herbert Edelsbrunner, and John Harer. Stability of persistence diagrams. Discrete & Computational Geometry, 37(1):103 120, 2007.

[4] K. Turner, S. Mukherjee, and D. M. Boyer. Persistent homology transform for modeling shapes and surfaces. Information and Inference: A Journal of the IMA, 3(4):310 344, Dec 2014.

[5] Genki Kusano, Kenji Fukumizu, and Yasuaki Hiraoka. Persistence weighted gaussian kernel for topological data analysis. In Proceedings of the 33rd International Conference on International Conference on Machine Learning - Volume 48, ICML 16, page 2004 2013. JMLR.org, 2016.

[6] Mathieu Carrière, Marco Cuturi, and Steve Oudot. Sliced wasserstein kernel for persistence diagrams. In Proceedings of the 34th International Conference on Machine Learning - Volume 70, ICML 17, page 664 673. JMLR.org, 2017.

[7] Peter Bubenik. Statistical topological data analysis using persistence landscapes. Journal of Machine Learning Research, 16(3):77 102, 2015.

[8] J. Reininghaus, S. Huber, U. Bauer, and R. Kwitt. A stable multi-scale kernel for topological machine learning. In 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 4741 4748, 2015.

[9] Tam Le and Makoto Yamada. Persistence ﬁsher kernel: A riemannian manifold kernel for persistence diagrams. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems 31, pages 10007 10018. Curran Associates, Inc., 2018.

[10] Henry Adams, Tegan Emerson, Michael Kirby, Rachel Neville, Chris Peterson, Patrick Shipman, Sofya Chepushtanova, Eric Hanson, Francis Motta, and Lori Ziegelmeier. Persistence images: A stable vector representation of persistent homology. Journal of Machine Learning Research, 18(8):1 35, 2017.

[11] Frédéric Chazal, Brittany Terese Fasy, Fabrizio Lecci, Alessandro Rinaldo, and Larry Wasserman. Stochastic convergence of persistence landscapes and silhouettes. In Proceedings of the Thirtieth Annual Symposium on Computational Geometry, SOCG 14, page 474 483, New York, NY, USA, 2014. Association for Computing Machinery.

[12] Barbara Di Fabio and Massimo Ferri. Comparing persistence diagrams through complex vectors. In Vittorio Murino and Enrico Puppo, editors, Image Analysis and Processing ICIAP 2015, pages 294 305, Cham, 2015. Springer International Publishing.

[13] Aaron Adcock, Erik Carlsson, and Gunnar Carlsson. The ring of algebraic functions on persistence bar codes. Homology, Homotopy and Applications, 18, 04 2013.

[14] Sara Kalisnik. Tropical coordinates on the space of persistence barcodes. Foundations of Computational Mathematics, 19(1):101 129, 2019.

[15] Christoph Hofer, Roland Kwitt, Marc Niethammer, and Andreas Uhl. Deep learning with topological signatures. In Proceedings of the 31st International Conference on Neural Information Processing Systems, NIPS 17, page 1633 1643, Red Hook, NY, USA, 2017. Curran Associates Inc.

[16] Christoph D. Hofer, Roland Kwitt, and Marc Niethammer. Learning representations of persistence barcodes. Journal of Machine Learning Research, 20(126):1 45, 2019.

Published as a conference paper at ICLR 2021

[17] Mathieu Carrière, Frédéric Chazal, Yuichi Ike, Théo Lacombe, Martin Royer, and Yuhei Umeda. Perslay: A neural network layer for persistence diagrams and new graph topological signatures. In Silvia Chiappa and Roberto Calandra, editors, The 23rd International Conference on Artiﬁcial Intelligence and Statistics, AISTATS 2020, 26-28 August 2020, Online [Palermo, Sicily, Italy], volume 108 of Proceedings of Machine Learning Research, pages 2786 2796. PMLR, 2020.

[18] M. M. Bronstein, J. Bruna, Y. Le Cun, A. Szlam, and P. Vandergheynst. Geometric deep learning: Going beyond euclidean data. IEEE Signal Processing Magazine, 34(4):18 42, 2017.

[19] Maximillian Nickel and Douwe Kiela. Poincaré embeddings for learning hierarchical representations. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems 30, pages 6338 6347. Curran Associates, Inc., 2017.

[20] Maximilian Nickel and Douwe Kiela. Learning continuous hierarchies in the lorentz model of hyperbolic geometry. In Jennifer G. Dy and Andreas Krause, editors, Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsmässan, Stockholm, Sweden, July 10-15, 2018, volume 80 of Proceedings of Machine Learning Research, pages 3776 3785. PMLR, 2018.

[21] Frederic Sala, Chris De Sa, Albert Gu, and Christopher Re. Representation tradeoffs for hyperbolic embeddings. In Jennifer Dy and Andreas Krause, editors, Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pages 4460 4469, Stockholmsmässan, Stockholm Sweden, 10 15 Jul 2018. PMLR.

[22] Octavian Ganea, Gary Becigneul, and Thomas Hofmann. Hyperbolic neural networks. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems 31, pages 5345 5355. Curran Associates, Inc., 2018.

[23] Caglar Gulcehre, Misha Denil, Mateusz Malinowski, Ali Razavi, Razvan Pascanu, Karl Moritz Hermann, Peter Battaglia, Victor Bapst, David Raposo, Adam Santoro, and Nando de Freitas. Hyperbolic attention networks. In International Conference on Learning Representations, 2019.

[24] Qi Liu, Maximilian Nickel, and Douwe Kiela. Hyperbolic graph neural networks. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems 32, pages 8230 8241. Curran Associates, Inc., 2019.

[25] Ines Chami, Zhitao Ying, Christopher Ré, and Jure Leskovec. Hyperbolic graph convolutional neural networks. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems 32, pages 4868 4879. Curran Associates, Inc., 2019.

[26] Pinar Yanardag and S.V.N. Vishwanathan. Deep graph kernels. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD 15, page 1365 1374, New York, NY, USA, 2015. Association for Computing Machinery.

[27] Nino Shervashidze, Pascal Schweitzer, Erik Jan van Leeuwen, Kurt Mehlhorn, and Karsten M. Borgwardt. Weisfeiler-lehman graph kernels. Journal of Machine Learning Research, 12(77):2539 2561, 2011.

[28] Mathias Niepert, Mohamed Ahmed, and Konstantin Kutzkov. Learning convolutional neural networks for graphs. In Proceedings of the 33rd International Conference on International Conference on Machine Learning - Volume 48, ICML 16, page 2014 2023. JMLR.org, 2016.

[29] Qi Zhao and Yusu Wang. Learning metrics for persistence-based summaries and applications for graph classiﬁcation. In Hanna M. Wallach, Hugo Larochelle, Alina Beygelzimer, Florence d Alché-Buc, Emily B. Fox, and Roman Garnett, editors, Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, Neur IPS 2019, 8-14 December 2019, Vancouver, BC, Canada, pages 9855 9866, 2019.

Published as a conference paper at ICLR 2021

[30] Sergey Ivanov and Evgeny Burnaev. Anonymous walk embeddings. In Jennifer Dy and Andreas Krause, editors, Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pages 2186 2195, Stockholmsmässan, Stockholm Sweden, 10 15 Jul 2018. PMLR.

[31] Christoph Hofer, Florian Graf, Bastian Rieck, Marc Niethammer, and Roland Kwitt. Graph ﬁltration learning. In Hal Daumé III and Aarti Singh, editors, Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pages 4314 4323, Virtual, 13 18 Jul 2020. PMLR.

[32] Manfredo Perdigao do Carmo. Riemannian geometry. Birkhäuser, 1992.

Published as a conference paper at ICLR 2021

A HOMOLOGY AND DIFFERENTIAL GEOMETRY

A.1 HOMOLOGY

Homology is a general method for associating a chain of algebraic objects (e.g abelian groups) to other mathematical objects such as topological spaces. The construction begins with a chain group, Cn, whose elements are the n-chains, which for a given complex are formal sums of the n-dimensional cells. The boundary homomorphism, n : Cn Cn+1, maps each n-chain to the sum of the (n 1)-dimensional faces of its n-cell, which is a (n 1)-chain. Combining the chain groups and the boundary maps in the sequence, we get the chain complex:

. . . n+2 Cn+1 n+1 Cn n . . . 2 C1 1 C0 0 0. (15)

The kernels, Zn = ker( n), and the images, Bn = im( n+1), of the boundary homomorphisms are called cycle and boundary groups, respectively. A fundamental property of the boundary homomorphism is that its square is zero, n n+1 = 0, which implies that, for every n, the boundary forms a subgroup of the cycle, i.e., im( n+1) ker( n). Therefore, we can create their quotient group

Hn := ker( n))/im( n+1) = Zn/Bn, (16)

which is called n-th homology group and its elements are called homology classes. The rank, βn = rank(Hn), of this group is known as the n-th Betti number.

Even though homology can be applied to any topological space, we focus on simplicial homology, i.e., homology groups generated by simplicial complexes. Let K be a simplicial complex and Kn its n-skeleton. The chain group Cn(K) is the free abelian group whose generators are the n-dimensional simplexes of K. For a simplex σ = [x0, xn] Kn, we deﬁne the boundary homomorphism as

i=0 [x0, . . . , xi 1, xi+1, . . . , xn] (17)

and linearly extend this to Cn(K), i.e., n(P σi) = P (σi).

A.2 RIEMMANIAN MANIFOLDS

An m-dimensional manifold can be seen as a generalization of a 2D surface and is a space that can be locally approximated by Rm. The space in which the manifold is embedded is called ambient space. We assume that the ambient space is the standard Euclidean space of dimension equal to the dimension of the manifold, i.e, Rm. For x M, the tangent space Tx M of M at x is the linear vector space spanned by the m linear independent vectors tangent to the curves passing through x. A Riemannian metric g = (gx)x M on M is a collection of inner products gx : Tx M Tx M R varying smoothly across M and allows us to deﬁne geometric notions such as angles and the length of a curve. A Riemannian manifold is a smooth manifold equipped with a Riemannian metric g. The Riemannian metric g gives rise to the metric norm g and induces global distances by integrating the length of a shortest path between two points x, y M

d(x, y) = inf γ

gγ(t)( γ(t), γ(t))dt, (18)

where γ M ([0, 1], H) is such that γ(0) = x and γ(1) = y. The unique smooth path γ of minimal length between two points x and y is called a geodesic and the underlying shortest path length is called geodesic distance. Geodesics and can be seen as the generalization of straight-lines. For a point x M and a vector v Tx M, let γv be the geodesic starting from x with an initial tangent vector equal to v, i.e., γv(0) = x and γv(0) = v. The existence, uniqueness and differentiability of the exponential map are guaranteed by the classical existence and uniqueness theorem of ordinary differential equations and Gauss s Lemma of Riemannian geometry (32).

A.3 POINCARE BALL

We deﬁne the Möbius addition for any x, y on the Poincare ball:

x y = (1 + 2 x, y + y 2)x + (1 x 2)y

1 + 2 x, y + x 2 y 2 (19)

Published as a conference paper at ICLR 2021

The Möbius addition is a manifold preserving operator that allows us to add point belonging in the Poincare ball without leaving the space. Finally, we provide the coordinate chart ψ : M Rm, i.e., the inverse of the parameterization in Eq. 8, which is deﬁned as follows:

j=2 sin (yj) cos (yi), for i = 1, . . . , m 1 and xm = x1

j=2 sin (yj) sin (ym). (20)

Note that we have omitted the learnable parameters. The coordinate chart is not explicitly needed for constructing the representation. However, it could be useful as an intermediate layer between our representation and standart deep neural network architecture. Its purpose would be to transfer the representation from the manifold to a Euclidean space which can be fed into a standart DNN without having to deﬁne its operations on the manifold.

B THEORETICAL RESULTS

Proof of Theorem 1. Let D, E be two persistence diagrams and η : D E bijection that achieves the inﬁnum in Eq.5. Consider the subset D0 = D \ R , which is essentially the original diagram without the points in the diagonal. Then, we have the following sequence of inequalities

d B(Φ(D, θ), Φ(E, θ)) (21)

x D logx0 φ(ρ(x)) , expx0 X

x E logx0 φ(ρ(x)) (22)

x D logx0(φ(ρ(x))) X

x E logx0(φ(ρ(x))))

x D0 logx0(φ(ρ(x))) X

x D0 logx0(φ(ρ(η(x)))

logx0(φ(ρ(x))) logx0(φ(ρ(η(x)))) g (25)

x D0 φ(ρ(x)) φ(ρ(η(x))) g Ke Kl Kφ X

x D0 ρ(x) ρ(η(x)) g (26)

Ke Kl KφKρ X

x D0 x η(x) g Ke Kl KφKρwg 1(D0, E) (27)

where wg 1 is the Wasserstein distance with the q-norm replaced with the norm induced by the metric tensor of the Poincare ball. Eq. 23 follows from the smoothness of the exponential map (assuming that it has Lipschitz constant equal to Ke). Eq. 24 follows by substituting the bijection η and because, as per requirements of Theorem 1, the auxiliary transformation ρ is zero on the diagonal R and x0 = 0. Note that the requirements of Theorem 1, combined with Eq. 8, 10 and 19, imply that the summation over x R collapses to zero. Eq. 25 follows from the triangle inequality. Eq. 26 follows from the smoothness of the logarithmic map and the parameterization (assuming Kl and Kφ Lipschitz constants, respectively). Finally, Eq. 27 follows from the assumed, as per Theorem 1, smoothness of ρ and by using the deﬁnition of Wasserstein distance. This concludes the proof.

C IMPLEMENTATION DETAILS

Auxiliary Transformation: Even though several choices for the auxiliary transformation ρ : R2 Rm deﬁned in Eq. 7 are possible, we present the one used in our implementation. Our choice for ρ is based on the exponential structure elements introduced by Hofer et al. (15). Formally, for i = 1, 2, . . . m, we deﬁne the following

ρi(x1, x2) =

( e (x1 µ1,i)2 (ln (x1 x2) µ2,i)2 x1 = x2 0 x1 = x2. (28)

Published as a conference paper at ICLR 2021

Then, the auxiliary transformation is ρ : (x1, x2) (ρi(x1, x2))m i=1, where (µ1,i, µ2,i)m i=1 are the means. In contrast to Hofer et al. (15), those means are not learned, but we rather ﬁx them to pre-deﬁned values and keep then constant during training. Additionally, we can easily prove that the above-deﬁned ρ satisﬁes the stability conditions given by Theorem 1, i.e., zero on the diagonal and Lipschitz continuity. Finally, under the assumption of all means being unique, it is injective on R2 and, therefore, introduces no information loss.

Network Architecture: We implemented all algorithms in Tensor Flow 2.2 using the TDA-Toolkit2 and the Scikit-TDA3 for extracting persistence diagrams and run all experiments on the Google Cloud AI Platform. To show the versatility of our approach, we used the same network architecture across all of our simulations and our proposed representation acts as an input. In more detail, for each individual ﬁltration that we extract from the data (images or graphs), we create a different input layer. The input to this layer is the persistence diagrams of all homology dimension given by the corresponding ﬁltration. Given the nature of the data (graphs and images), the homology dimensions with non-trivial persistence diagrams are only the ﬁrst two (i.e., H0 and H1). Our input layer processes the persistence diagrams of each homology class independently and outputs the representations (i.e., Eq. 11) for each one of them. Following that, we concatenate and ﬂatten the previous outputs and fed the resulting vector to a batch normalization layer. Then, we use two dense layers of 256 and 128 neurons, respectively, with a Rectiﬁed Linear (Re Lu) activation function. This is followed by a dropout layer to prevent overﬁtting and a ﬁnal dense layer as an output. We consider two different implementations of the the above architecture. The ﬁrst one is based on the hyperbolic neural networks introduced by Ganea et al. in (22). It re-deﬁnes key neural network operations in the Poincare ball and, therefore, allows us to directly utilize the output of our representation in the neural network. The second implementation uses the coordinate chart given by Eq. 20 to project the representation back to a Euclidean space and feed it to a standardized implementation of the aforementioned layers. This method is more convenient in practice as it maintains compatibility with existing backbone networks trained in Euclidean geometry.

Hyperparameters and Training: To train the neural network, we use the Adam optimizer (β1 = 0.9, β2 = 0.999) with an initial learning rate of 0.001 and batch size equal to 64. We use a random uniform initializer in the interval [ 0.05, 0.05] for all learnable variables. For all graph datasets we trained the network for 100 epochs and halved the learning rate every 25 epochs. For the MNIST and Fashion-MNIST datasets we used 10 and 20 epochs, respectively, and no learning rate scheduler. We tune the dropout rate manually by starting from really low values and monitoring the validation set accuracy. In general, we noticed that the network did not tend to overﬁt, therefore, we kept the rate at low values (0-0.2) for all experiments.

2https://github.com/giotto-ai/giotto-tda 3https://github.com/scikit-tda/scikit-tda