# pointlevel_topological_representation_learning_on_point_clouds__20631734.pdf

Point-Level Topological Representation Learning on Point Clouds

Vincent P. Grande 1 Michael T. Schaub 1

Topological Data Analysis (TDA) allows us to extract powerful topological and higher-order information on the global shape of a data set or point cloud. Tools like Persistent Homology give a single complex description of the global structure of the point cloud. However, common machine learning applications like classification require point-level information and features. In this paper, we bridge this gap and propose a novel method to extract point-level topological features from complex point clouds using discrete variants of concepts from algebraic topology and differential geometry. We verify the effectiveness of these topological point features (TOPF) on both synthetic and real-world data and study their robustness under noise and heterogeneous sampling.

1. Introduction

In modern machine learning (Murphy, 2022), objects are described by feature vectors. However, the coordinates of a single vector can often only be understood in relation to the entire data set: if the value x is small, average, large, or even an outlier depends on the remaining data. In a low-dimensional case this issue can be addressed simply by normalising the data points according to the global mean and standard deviation or similar procedures. We can interpret this as the most straight-forward way to construct local features informed by the global structure of the data set.

In the case where not all data dimensions are equally relevant, or contain correlated and redundant information, we can apply (sparse) PCA to project the data points to a lowerdimensional space using information about the global structure of the point cloud. For even more complex data, we may first have to learn the encoded structure itself: indeed, a

1Department of Computer Science, RWTH Aachen University, Germany. Correspondence to: Vincent P. Grande <grande@cs.rwth-aachen.de>.

Proceedings of the 42 nd International Conference on Machine Learning, Vancouver, Canada. PMLR 267, 2025. Copyright 2025 by the author(s).

typical assumption underpinning many unsupervised learning methods is the so-called manifold hypothesis which posits that real world data can be described well via submanifolds of n-dimensional space (Ma & Fu, 2012; Fefferman et al., 2016). Using eigenvectors of some Laplacian, we can then obtain a coordinate system intrinsic to the point cloud (see e.g. (Shi & Malik, 2000; Belkin & Niyogi, 2003; Coifman & Lafon, 2006)). Common to all these above examples is the goal is to construct locally interpretable point-level features that encode globally meaningful positional information robust to local perturbations of the data.

Instead of focussing on the interpretation of individual points, topological data analysis (TDA), (Carlsson & Vejdemo-Johansson, 2021), follows a different approach.

TDA extracts a global description of the shape of data, which is typically considered in the form of a high-dimensional point cloud. This is done measuring topological features like persistent homology, which counts the number of generalised holes on multiple scales. Due to their flexibility and robustness these global topological features have been shown to contain relevant information in a broad range of application scenarios: In medicine, TDA has provided methods to analyse cancer progression (Lawson et al., 2019). In biology, persistent homology has been used to analyse knotted protein structures (Benjamin et al., 2023), and the spectrum of the Hodge Laplacian has been used for predicting protein behaviour (Wee et al., 2024).

This success of topological data analysis is a testament to the fact that relevant information is encoded in the global topological structure of point cloud data. Such higher-order topological information is however invisible to the tools of data analysis discussed above like PCA or diffusion maps. We are now faced by the problem that (i) important parts of the global structure of a complex point cloud can only be described by the language of applied topology, however (ii) most standard methods to obtain positional point-level information are not sensitive to the higher-order topology of the point cloud.

Contributions We introduce TOPF (Figure 1), a novel method to compute point-level topological features relating individual points to global topological structures of point clouds. TOPF (i) outperforms other methods and embeddings for clustering downstream tasks on topologically

Point-Level Topological Representation Learning on Point Clouds

4 Compute persistent

Pick significant

Weighted harmonic

representatives

Average over incident simplices

Topological Node Embeddings in

Biology, Machine Learning,

Computer Vision, etc.

Feature vectors

Feature vectors

Applications

Bars correspond to loops in the point cloud

Figure 1: Computing Topological Point Features (TOPF). Input. A point cloud X in n-dimensional space. Step 1. To extract global topological information, the persistent homology is computed on an α/VR-filtration. The most significant topological features F across all specified dimensions are selected. Step 2. k-homology generators associated to all features fi,k F are computed. For every feature, a simplicial complex is built at a step of the filtration where fi,k is alive. Step 3. The homology generators are projected to the harmonic space. Step 4. The vectors are normalised to obtain vectors ei k indexed over the k-simplices. For every point x and feature f F, we compute the mean of the entries of ei k corresponding to simplices containing x. The output is a |X| |F| matrix which can be used for downstream ML tasks. Optional. We weigh the simplicial complexes resulting in a topologically more faithful harmonic representative in Step 3.

structured data, returns (ii) meaningful representations, and is (iii) robust to moderate noise and heterogeneous sampling. Finally, we introduce the topological clustering benchmark suite, the first benchmark for topological clustering.

Related Work The intersection of topological data analysis, topological signal processing and geometry processing has many interesting related developments in the past few years. On the side of homology and TDA, the authors in (De Silva & Vejdemo-Johansson, 2009) and (Perea, 2020) use harmonic cohomology representatives to reparametrise point clouds based on circular coordinates. This implicitly assumes that the underlying structure of the point cloud is amenable to such a characterization.

In (Basu & Cox, 2022; Gurnari et al., 2023), the authors develop and use harmonic persistent homology for data analysis. However, among other differences their focus is not on providing robust topological point features. (Carrière et al., 2015) construct point features using Topology as well. However, their signatures only summarise the local topology of the neighbourhood of the point rather than the relation between the point and the global topological features. (Grande & Schaub, 2023a) uses the harmonic space of the Hodge Laplacians to cluster point clouds respecting topology, but is unstable against some form of noise, has no

possibility for features selection across scales and is computationally far more expensive than TOPF. An overview over the thriving field of topological data analysis can be found in (Wasserman, 2018; Munch, 2017). For a more in-depth review of related work, see Appendix A. Because there are different views on what constitutes representation learning, we note that the learnt representations of TOPF are the result of homology and linear algebra computations, rather than trained features of an autoencoder or other neural network.

Organisation of the paper In Section 2, we give an overview over the main ideas and concepts behind TOPF. In Section 3, we describe how to compute TOPF. Finally, we will apply TOPF on synthetic and real-world data in Section 4. Furthermore, Appendix A contains a brief history of topology and a detailed discussion of related work. Appendix B contains additional theoretical considerations. In Appendix C, we give a theoretical result guaranteeing the correctness of TOPF in an idealised setting. Appendix D describes the novel topological clustering benchmark suite, Appendix E contains details on the implementation, the choice of hyperparameters and a discussion of runtime complexity, Appendix F contains more details on the experiments, Appendix G gives a detailed treatment of feature selection, Appendix H discusses simplicial weights, and Appendix I discusses limitations in detail.

Point-Level Topological Representation Learning on Point Clouds

Code We provide TOPF as an easy-to-use python package with example notebooks at https://github.com/ vincent-grande/topf installable via pip.

2. Main Ideas of TOPF

A main goal of algebraic topology is to capture the shape of spaces. Techniques from topology describe globally meaningful structures that are indifferent to local perturbations and deformations. This robustness of topological features to local perturbations is particularly useful for the analysis of large-scale noisy datasets.

In this section we provide a broad overview over the most important concepts of topology and TDA for our context, prioritising intuition over technical formalities. For more details, the reader is referred to (Bredon et al., 1993) (algebraic topology) and (Munch, 2017) (TDA).

Simplicial Complexes Spaces in topology are continuous (connected), consist of infinitely many points, and often live in abstract space. Our input data sets however consist of finitely many points embedded in real space Rn. In order to bridge this gap and open up topology to computational methods, we need a notion of discretised topological spaces consisting of finitely many base points with finite description length. A simplicial complex is the simplest discrete model that can still approximate any topological space occuring in practice (Quillen, 1967):

Definition 2.1 (Simplicial complexes). A simplicial complex (SC) S consists of a set of vertices V and a set of finite non-empty subsets (simplices, S) of V closed under taking non-empty subsets, such that the union over all simplices S

σ S σ is V . We will often identify S with its set of simplicies S and denote by Sk the set of simplices σ S with |σ| = k + 1, called k-simplices. We say that S is n-dimensional, where n is the largest k such that Sk is non-empty. The k-skeleton of S contains the simplices of dimension at most k. If the vertices V lie in Rn, we call the convex hull in Rn of a simplex σ its geometric realisation |σ|. When doing this for every simplex of S, we call this the geometric realisation of S, |S| Rn.

We can construct an n-dimensional SC S in n + 1 steps: We start with a set of vertices V (S0). We then connect certain pairs of vertices with edges (S1). Afterwards, we fill in some fully connected triangles (S2), and so on.

Vietoris Rips and α-complexes We now need a way to construct a simplicial complex that approximates the topological structure inherent in our data set X Rn. Instead of having to pick a single scale, the Vietoris Rips (VR) filtration and the α-filtration take as input a point cloud and return a nested sequence of simplicial complexes indexed

by a scale parameter ε approximating the topology of the data across all possible scales.

Definition 2.2 (VR complex). Given a finite point cloud X in a metric space (M, d) and a non-negative real number ε R 0, the associated VR complex V Rε(X) is given by the vertex set X and the set of simplices S = {σ X | σ = , x, y σ : d(x, y) ε}.

A VR complex at ε consists of all simplices σ where all vertices x σ have distance of at most ε. For r r , we obtain the canonical inclusions ir,r (X): V Rr(X) , V Rr (X). For a simplex σ and a filtration F, we denote by its filtration value F(σ) the smallest ε such that σ Fε. The set of

VR complexes on X for all possible r R 0 together with the inclusions then form the VR filtration on X. For large point clouds, the VR filtration becomes computationally expensive due to its large number of simplices. In contrast, the α-filtration approximates the topology of a point cloud using far fewer simplices. Thus we will make use of αcomplexes in settings of ambient dimension lower than 4, where their construction is computationally feasible. For a complete discussion and definition of α-complexes, see Appendix B.

Boundary matrices We still need an algebraic representation of simplicial complexes that is capable of encoding the structure of the SC and enables extraction of the topological features: The boundary matrices Bk associated to an SC S store all structural information of the SC. The rows of Bk are indexed by the k-simplices of S and the columns are indexed by the (k + 1)-simplices.

Definition 2.3 (Boundary matrices). Let S be a simplicial complex and a total order on its vertices V . Then, the i-th face map in dimension n f n i : Sn Sn 1 is given by

f n i : {v0, v1, . . . , vn} 7 {v0, v1, . . . , bvi, . . . , vn}

with v0 v1 vn and bvi denoting the omission of vi. Now, the n-th boundary operator Bn : R[Sn+1] R[Sn] with R[Sn] being the real vector space over the basis Sn is given by Bn : σ 7

i=0 ( 1)if n+1 i (σ).

When lexicographically ordering the simplex basis, we can view Bn as a matrix. We call R[Sn] the space of n-chains. B0 is the vertex-edge incidence matrix of the associated graph consisting of the 0and 1-simplices of S and B1 is the edge-triangle incidence matrix of S

Betti Numbers and Persistent Homology We now turn to the notion of topological features and how to extract them. Homology is one of the main algebraic invariants to capture the shape of topological spaces and SC. The k-th homology module Hk(S) of an SC S with boundary operators Bk

Point-Level Topological Representation Learning on Point Clouds

Figure 2: PH sketch and TOPF pipeline applied to NALCN channelosome, a membrane protein (Kschonsak et al., 2022). Left: Bars represent life times of features (Grande & Schaub, 2023b). Centre left: Steps 1&2a, when computing persistent 1-homology, three classes are more prominent than the rest. Centre right: Step 2b: The selected homology generators. Right: Step 3: The projections of the generators into harmonic space are now each supported on one of the rings.

is defined as Hk(S) := ker Bk 1/ Im Bk. The generator or representative of a homology class is an element of the kernel ker Bk 1. In dimension 1, these are given by formal sums of 1-simplices forming closed loops in the SC. Importantly, the rank rk Hk(S) is called the k-th Betti number Bk of S. In dimension 0, B0 counts the number of connected components, B1 counts the number of loops around holes of the space, B2 counts the number of 3-dimensional voids with 2-dimensional boundary, and so on.

Given a filtration of SCs, we can track how the homology modules evolve as the simplicial complex grows. The mathematical formalisation, persistent homology, thus turns a point cloud via a simplicial filtration into an algebraic object summarising the topological features of the point cloud (Edelsbrunner & Harer, 2008).

The Hodge Laplacian and the Harmonic Space We need to relate the global characterisations of PH back to local properties of the point cloud. We will do so by using ideas and concepts from differential geometry and topology: The simplicial Hodge Laplacian is a discretisation and generalisation of the Hodge Laplace operator acting on differential forms of manifolds:

Definition 2.4 (Hodge Laplacian). Given a simplicial complex S with boundary operators Bk, we define the n-th Hodge Laplacian Ln : R[Sn] R[Sn] by setting

Ln := B n 1Bn 1 + Bn B n .

Theorem 2.5 (Hodge Decomposition (Lim, 2020; Schaub et al., 2021; Roddenberry et al., 2021)). For an SC S with boundary matrices (Bi) and Hodge Laplacians (Li), we have in every dimension k

R[Sk] = Im B k 1 | {z } gradient space

ker Lk | {z } harmonic space Im Bk | {z } curl space .

This, together with the fact that the k-th harmonic space is isomorphic to the k-th real-valued homology group

Algorithm 1 Topological Point Features (TOPF)

Input: Point cloud X Rn, maximum homology dimension d N, interpolation coeff. λ. 1. Compute persistent homology with generators in dim. k d. (Sec. 2: Betti Numbers & Persistent Homology) 2. Select set of significant features (bi, di, gi) with birth, death, and generator in F3 coordinates (See Step 2). 3. Embed gi into real space (1), and project into harmonic subspace (2) of SC at step dλ i b1 λ i or λdi + (1 λ)bi. 4. Normalise projections to ek i and compute F i k(x) := avgx σ(ek i l(σ)) for all points x X (3). Output: Features of x X

ker Lk = Hk(R) means that we can associate a unique harmonic representative to every homology class. Intuitively, this means that for every abstract global homology class of persistent homology at filtration step t from above we can now compute one unique harmonic representative in ker Lk that assigns every simplex a value based on how much it contributes to the homology class. Thus, the Hodge Laplacian is a gateway between the global topological features and the local properties of our SC. We discuss how this relates to the theory of differential forms and Hodge theory in the continuous case in Appendix B.2.

3. How to Compute Topological Point Features

In this section, we will combine the ideas and insights of the previous section to give a complete account of how to compute topological point features (TOPF). A pseudo-code version can be found in Algorithm 1 and an overview in Figure 1. We start with a finite point cloud X Rn.

Step 1: Computing the persistent homology First, we determine the most significant persistent homology classes which determine the shape of the point cloud. Doing this, we also extract the interesting scales of the data set. We will later use this to construct SCs to localise the global

Point-Level Topological Representation Learning on Point Clouds

homology features. Thus we first compute the persistent k-homology modules Pk including a set of homology representatives Rk of X using an α-filtration for n 3 and a VR filtration for n > 3. We use Z/3Z coefficients to be sensitive to simplex orientations. In case we have prior knowledge on the data set, we can choose a real number R R>0 and only compute the filtration up to the parameter R. In data sets like protein atom coordinates, this might be useful as we have prior knowledge on what constitutes the interesting scale, reducing computational complexity. See Figure 2 centre left for a PH diagram.

Step 2: Selecting the relevant topological features We now need to select the relevant persistent homology classes which carry the most important global information. The persistent homology Pk module in dimension k is given to us as a list of pairs of birth and death times (bk i , dk i ). We can assume these pairs are ordered in non-increasing order of the durations lk i = dk i bk i . We are interested in connected components, loops, cavities, etc. that persist over a long time, indicating that they are important for the shape of the point cloud. Distinguishing between the relevant and the irrelevant features is in general difficult and may depend on additional insights on the domain of application. In order to provide a heuristic which does not depend on any a-priori assumptions on the number of relevant features we pick the smallest quotient qk i := lk i+1/lk i > 0 as the point of cut-off Nk := arg mini qk i . The only underlying assumption of this approach is that the band of relevant features is separated from the noisy homological features by a drop in persistence. If this assumption is violated, the only possible way to do meaningful feature selection depends on application-specific domain knowledge. We found that our proposed heuristics work well across a large scale of applications. See Figure 2 left and centre for an illustration and Appendix G for more technical details and ways to improve and adapt the feature selection module of TOPF. We call the chosen k-homology classes with khomology generators f i k.

Step 3: Projecting the features into harmonic space and normalising In this step, we need to relate the global topology extracted in the previous step to the simplices which we will use to compute the local topological point features. Every selected feature f i k of the previous step comes with a birth time bi,k and a death time di,k. This means that the homology class f i k is present in every SC of the filtration between step (i.e. filtration value) bi,k and di,k and we could choose any of the SCs for the next step. Picking a small filtration value will lead to fewer simplices in the SC and thus to a very localised harmonic representative. Picking a large filtration value will lead to many simplices in the SC and thus to a very smooth and blurry harmonic representative with large support. Finding a middle ground

between these regimes returns optimal results. Given interpolation hyperparameter γ (0, 1), we will thus consider the simplicial complex Sti,k(X) at step ti,k := b1 γ i,k dγ i,k for k > 0 and at step ti,k := γdi,k for k = 0 of the simplicial filtration. At this point, the homology class f i k is still alive. We then consider the real vector space R[Sti,k k (X)] with formal basis consisting of the k-simplices of the SC Sti,k. From the persistent homology computation of the first step, we also obtain a generator of the feature f i k, consisting of a list Σi k of simplices ˆσj Sbi,k k and coefficients cj Z/3Z. We need to turn this formal sum of simplices with Z/3Z-coefficients into a vector in the real vector space R[Sti,k k (X)]: Let ι: Z/3Z be the map induced by the canonical inclusion of { 1, 0, 1} , R. We can now define an indicator vector ei k R[Sti,k k (X)] associated to the feature f i k (Cf. (De Silva & Vejdemo-Johansson, 2009)).

( ι(cj) ˆσj Σi k : σ = ˆσj 0 else . (1)

Empirically, ei k is a homology generator for real coefficients as well in the vast majority of cases, although our construction only guarantees for Bk 1ei k 0 mod 3. We discuss how to fix the rare case where this does not work in Appendix B.3 and now assume to work with a real homology representative ei k. While this homology representative lives in a real vector space, it is not unique, has a small support, and its value can differ largely even between close simplices. All of these problems can be solved by projecting the homology representative to the harmonic subspace ker Lk of R[Sti,k k (X)]. Rather than directly projecting ei k to the harmonic subspace, we make use of the Hodge decomposition theorem (Theorem 2.5) which allows us to solve computationally efficient least square problems:

ei k,curl := Bk arg min x R[Sk+1]

ei k ei k,grad Bkx 2

and get the harmonic representative ˆei k := ei k ei k,grad ei k,curl. (Cf. Figure 2 right for a visualisation.) Because homology representatives are gradient-free, ei k,grad = 0 and we only need to compute ei k,curl.

Step 4: Processing and aggregation at a point level In the previous step, we have computed a set of harmonic representatives of homology classes. Each harmonic representative assigns each simplex in the correspondings SCs a value. However, these simplices likely have no real-world meaning and the underlying simplicical complexes differ depending on the birth and death times of the homology classes. Hence in this step, we will collect the features on the point-level after performing some necessary preprocessing. Given a vector ˆei k indexed over simplices and a hyperparameter δ, we now construct ei k : Sti,k k (X) [0, 1]

Point-Level Topological Representation Learning on Point Clouds

ei k : σ 7 {|ˆei k(σ)|/(δ max σ S ti,k k (X) |ˆei k(σ )|), 1}

such that ˆei k is normalised to [0, 1], then the values of [0, δ] are mapped linearly to [0, 1] and everything above is sent to 1. We found empirically that a thresholding parameter of δ 0.07 works best across at the range of applications considered below. However, TOPF is not sensitive to small changes to δ because entries of ˆei k are concentrated around 0 (cf. Appendix E.1).

For every feature f i k in dimension k with processed simplicial feature vector ei k and simplicial complex Sti,k, we define the point-level feature map F k i : X R mapping from the initial point cloud X to R by setting

F k i : v 7

σk S ti,k k : v σk ei k(σk)

max(1, |{σk St k : v σk}|). (3)

For every point v, we can thus view the vector (F k i (v): f k i F) as a feature vector for v. We call this collection of features Topological Point Features (TOPF). (Cf. Figure 3 for an example).

Choosing Simplicial Weights The above discussed theory works analogously for weighted SCs. We discuss how TOPF employs weighted SCs to mitigate the influence of noise and heterogeneous sampling in Appendix H.

Theoretical Guarantees In the appendix in Theorem C.1 we give theoretical guarantees for when TOPF provably works on an idealised point cloud.

Homology, Cohomology, and Hodge Laplacian kernels Many methods of TDA employ cohomology instead of homology because of computational benefits. The TOPF would work with persistent cohomology instead of persistent homology, however the cohomology representatives will be associated to different harmonic representatives which we found to be less faithful of topological features. We will briefly elaborate on this relationship. Formally, the cochain space where cohomology lives is the dual of the chain space of homology. For finite-dimensional vector spaces with the standard inner product, there is a canonical isomorphism between the two spaces induced by sending a vector v to the linear extension of the map sending v to 1. Thus, we can identify the chains and cochains. We have the boundary maps Bi and the coboundary maps B i . By the universal coefficient theorem real-valued homology and cohomology are isomorphic, with

Hi = ker Bi/ Im Bi+1 = Hi = ker B i+1/ Im B i .

Thus, using the canonical isomorphism to the dual vector space, all elements of ker Bi are homology representatives and of ker B i+1 are cohomology representatives. The kernel of the Hodge Laplacian ker Li is the intersection of ker Bi and ker B i+1. Hence every element in ker Li = ker Bi ker B i+1 automatically is a homology and cohomology representative which is unique by matching dimensions. This gives us an explicit isomorphism between ker Li and the (co-)homology.

In step 3, we start with some arbitrary homology representative r in ker Bi. However, we want a harmonic representative h in ker Li = ker Bi ker B i+1 for the same homology class, i.e. with h = r c for c Im Bi+1 , the curl space quotiened out in homology. By the orthogonality of the Hodge decomposition, this c is given by the projection of r to the curl space .

Comparison with Topological Point Cloud Clustering Given a point cloud, TPCC ((Grande & Schaub, 2023a)) selects an ε-radius and builds the associated VR simplicial complex. It then computes the associated simplicial Hodge Laplacians and determines a basis of their 0-eigenvectors.

TPCC embeds the simplices bases on the eigenvector coordinates, performs subspace clustering on the embeddings and aggregates this information back to the point level to determine a final clustering. However, TPCC is (i) computationally expensive due to extensive eigenvector computations, (ii) depending on high-dimensional subspace clustering algorithms, which are prone to instabilities and errors, (iii) sensitive to the correct choice of hyperparameters, (iv) requiring the topological true features and noise to occur in different steps of the simplicial filtration, and it (v) solely focussed on clustering the points rather than extracting relevant point-level features. TOPF solves (i) by using homology representatives from persistent homology and projecting them into the harmonic space, which requires solving a computationally efficient least-squares problem instead of an eigenvector problem. Problem (ii) and (iv) are solved by selecting relevant persistent homology classes by the TOPF heuristic instead of considering and clustering the entire harmonic space of a given simplicial complex. Finally, challenge (iii) is partially addressed by the heuristics of step 3 and 4 and the general architecture being more robust to noise.

4. Experiments

In this section, we conduct experiments on real world and synthetic data, compare the clustering results with clustering by TPCC, other classical clustering algorithms, and other point features, and demonstrate the robustness of TOPF against noise. In Table 2, we perform an ablation study with respect to the harmonic projections of step 3 of TOPF.

Point-Level Topological Representation Learning on Point Clouds

Figure 3: TOPF on 3d real-world and synthetic point clouds. For every point, we highlight the largest corresponding topological feature, where colour stands for the different features and saturation for the value of the feature. (a): Atoms of mutated Cys123 of E. coli (Hidber et al., 2007). We added auxiliary points on the convex hull and considered 2-homology, to detect the protein pockets which are crucial for protein-environment interactions, cf. (Oda et al., 2024). (b): Atoms of NALCN channelosome (Kschonsak et al., 2022) display three distinct loops. (c): Points sampled in the state space of a Lorentz attractor. The two features correspond to the two lobes of the attractor. (d): Point cloud spaceship of our newly introduced topological clustering benchmark suite (See Appendix D). (e): Latent space of a VAE trained on image patches showing topological structure (See Figure 12 for details).

Topological Point Cloud Clustering Benchmark We introduce the topological clustering benchmark suite (Appendix D) and report running times and the accuracies of clustering based on TOPF and other methods and point embeddings, see Table 1. We see that spectral clustering on

TOPF vectors (listed as TOPF) outperforms all classical clustering algorithms on all but one dataset by a wide margin. We also see that TOPF closely matches the performance of the only other higher-order topological clustering algorithm, TPCC on two datasets with clear topological features, whereas TOPF outperforms TPCC on datasets with more complex structure. In addition, TOPF has a consistently lower running time with better scaling for the more complex datasets, while not requiring prior knowledge on the best topological scale. Furthermore, TOPF outperforms WSDesc (Li et al., 2022) pretrained on the 3DMatch dataset (Zeng et al., 2017) and DGCNN (Wang et al., 2019) pretrained on the Shape Net Part data set. This highlights that neural network architectures like DGCNN need specific applicationspecific training data for good performance that is simply not available in all cases. TOPF is an unsupervised method for extracting interpretable topological features founded in algebraic topology and differential geometry.

Figure 4: TOPF on a high-dimensional point cloud. We used TOPF on 6500 points sampled from the 24-dimensional conformation space of cyclooctane (Martin & Watson, 2011). We show the ISOMAP projection from 24 dimensions. Top: Three features automatically selected by TOPF. Bottom: TOPF-Clustering for 3 and 4 clusters. TOPF can correctly cluster similar points according to their topological function. Four clusters, TOPF can even identify the anomalous points (blue) violating the manifold structure.

The performance differences of TPCC and TOPF on different point clouds of the TCBS comes mainly down to how difficult and robust the topological structure encoded in the datasets is. In particular, 2Spheres2Circles and Sphere In Circle are directly sampled from unions of manifolds without any noise and sufficient sampling density and TPCC and TOPF perform similarly. In comparison, in Ellipses and Spaceship, the topological features have shorter life times and live at different scales. There is no single scale containing all features, and for most holes/voids, there is not even a scale whithout noisy holes with small persistence present in the filtration. TOPF can deal with these situations, whereas TPCC cannot. For the difference on Halved Circle, we posit that this is due to the better feature aggregation of TOPF. TPCC requires to perform subspace clustering on an harmonic edge embedding, which is both unstable and sensitive to parameters, which we posit TPCC has no information to choose correctly in this setting.

On 4Circles+Grid, TOPF performs worst out of the point clouds of TCBS. Figure 8 and Figure 9 show the ground truth and TOPFclustering on 4Circles+Grid. In short,

TOPF chooses suboptimal scales which are not well-enough connected. Figure 10 supports this interpretation, increasing the interpolation hyperparameter λ improves TOPF performance on this dataset. We have used a fixed set of hyperparameters for all experiments on the TCBS for transparency.

Feature Generation In Figure 3, we show qualitatively that TOPF constructs meaningful topological features on

Point-Level Topological Representation Learning on Point Clouds

Table 1: Quantitative performance comparison of clustering with TOPF and other features/clustering algorithms. Four 2d and three 3d data sets of the topological clustering benchmark suite (Appendix D, cf. Figure 8 for ground truth and Figure 9 for labels by TOPF). We ran each algorithm 20 times and list the mean adjusted rand index (ARI) with standard deviation σ and mean running time. We omit σ for algorithms with σ = 0. Spectral clustering on TOPF vectors consistently outperforms or almost matches the other algorithms while having significantly better run time than the second best performing algorithm TPCC. Spectral Clustering (SC), DBSCAN (applied on the points) are standard clustering algorithms, TOMATO is a topological clustering algorithm (Chazal et al., 2013), Geo clusters using 12-dimensional point geometric features extracted by pgeof. Weakly Supervised 3D Local Descriptor Learning for Point Cloud Registration (WSD (Li et al., 2022)) is pretrained on 3DMatch data and produces 32-dimensional feature vectors. DGCNN (Wang et al., 2019) is pretrained on Shape Net Part segmentation, with the parameters determined by a hyperparameter search. PAM+ISOMAP denotes k-medoids clustering on the ISOMAP embedding. We highlight all ARI scores within 0.05 of the best ARI score.

TOPF TPCC SC DBSCAN Ag C To MATo Geo PN WSD DGCNN PAM+ISOMAP

4spheres ARI 0.81 0.52 0.17 0.37 0.00 0.45 0.32 0.20 0.30 0.13 0.29 0.03 0.39 time (s) 13.9 23.3 0.2 <0.1 <0.1 <0.1 0.2 <0.1 <0.1 <0.1 0.2

Ellipses ARI 0.95 0.47 0.04 0.25 0.19 0.52 0.29 0.81 0.50 0.35 0.70 0.00 0.56 time (s) 13.5 14.4 0.1 <0.1 <0.1 <0.1 0.1 <0.1 <0.1 <0.1 <0.1

4Circles+Grid ARI 0.70 0.39 0.04 0.90 0.92 0.89 0.82 0.41 0.55 0.06 0.75 0 0.78 time (s) 13.7 28.5 0.5 <0.1 <0.1 <0.1 0.3 <0.1 <0.1 <0.1 0.3

Halved Circle ARI 0.71 0.18 0.12 0.24 0.00 0.20 0.16 0.08 0.36 0.00 0.39 0.07 0.13 time (s) 13.6 14.3 0.1 <0.1 <0.1 <0.1 0.1 <0.1 <0.1 <0.1 <0.1

2Spheres2Circle ARI 0.94 0.97 0.01 0.70 0.00 0.51 0.87 0.12 0.55 0.95 0.84 0.18 0.63 time (s) 20.7 1662.2 1.6 <0.1 0.3 <0.1 0.9 <0.1 <0.1 0.2 6.5

Spherein Circle ARI 0.97 0.98 <0.1 0.34 0.00 0.29 0.06 0.69 0.39 0.46 0.76 0.06 0.94 time (s) 14.5 8.0 <0.1 <0.1 <0.1 <0.1 0.08 <0.1 <0.1 <0.1 <0.1

Spaceship ARI 0.92 0.56 0.03 0.28 0.26 0.47 0.30 0.87 0.41 0.76 0.79 0.00 0.58 time (s) 15.5 341.8 16.7 <0.1 <0.1 <0.1 0.2 <0.1 <0.1 <0.1 0.1

mean ARI 0.86 0.58 0.44 0.16 0.48 0.40 0.45 0.44 0.39 0.64 0.58 time (s) 15.1 298.9 0.4 <0.1 <0.1 <0.1 0.3 <0.1 <0.1 <0.1 1.0

synthetic data and data sets from Biology and Physics, corresponding to for example rings and pockets in proteins or trajectories around different attractors in dynamical systems, (for individual heatmaps see Figure 14). In Figure 4, we show that TOPF works for the high-dimensional conformation space of cyclooctane.

Robustness Against Noise, Downsampling, and Sampling Heterogeneity We have evaluated the robustness of TOPF against Gaussian noise on the dataset introduced in (Grande & Schaub, 2023a) and compared the results against

TPCC, Spectral Clustering, Graph Spectral Clustering on the graph constructed by TPCC, and against k-means in Figure 6. We have also analysed the robustness of TOPF against the addition of outliers in Figure 6. We see that TOPF performs well in both cases, underlining our claim of robustness. Finally, in Figure 5, we show that TOPF performs well under sampling heterogeneity and under downsampling to sparse point clouds, while comparing TOPF to the other baselines. We posit that this is due to the robustness of α-complexes, the chosen simplicial weights, and the general robustness of features extracted by persistent homology. In our eyes, the above result showcases the strength of our approach incorporating ideas from topological data analysis and differential geometry, given that heterogeneous sampling and sparse point clouds constitute a major challenge for classical geometry learning.

Embedding Space of Variational Autoencoders and Highdimensional spaces Variational autoencoders (VAE) are unsupervised neural networks that learn to extract a lowdimensional embedding of a high-dimensional data set. We have trained 2 VAEs on 518 image patches which were sampled along a topological structure on a larger image with a latent space dimension of 3 and of 16. We show that running TOPF on the two latent spaces as well as directly on the 8748(!)-dimensional base space can recover the topological structure underlying the capturing process of the training set. We visualise the feature vectors on the latent space in Figure 3 (e) and Figure 13 and provide additional details in Figure 12 and Appendix F.

In Figure 4, we use TOPF on a high-dimensional input point cloud representing the conformation space of cyclooctane (Martin & Watson, 2011). We see in the ISOMAP projections of Figure 4 that TOPF extracts reasonable features representing the topology of the conformation space. This shows the feasability of TOPF on very high-dimensional spaces. Note that TOPF even recovers the anomalous points violating the manifold hypothesis, as done in (Stolz et al., 2020).

5. Discussion

Limitations TOPF can by design only produce meaningful output on point clouds with a topological structure

Point-Level Topological Representation Learning on Point Clouds

Downsampling

TOPF Spectral Clustering

KMeans OPTICS

DBSCAN Agglomerative Clustering Mean Shift WSDesc Geo

100 10 1 downsampling factor

Heterogeneous Sampling

Figure 5: Performance of TOPF Clustering under downsampling and heterogeneous sampling on 4spheres. Left: We perform random downsampling to test the performance of TOPF on sparse point clouds. TOPF maintains strong performance until 100 points, which proves that

TOPF performs well on very sparse point clouds. For experiments on all datasets of TCBS see Figure 15. Right: We investigate the performance under heterogeneous sampling. We divide the point cloud into halves cutting up topological structures and downsample one of the halves with a factor of up to 100. The experiments show that TOPF still outperforms the other methods under a sampling irregularity of 1 : 10, indicating a strong robustness against heterogeneous sampling. For experiments on more datasets of TCBS see Figure 16.

0 100 200 300 400 500 600 No. Outliers

0.0 0.2 0.4 0.6 0.8 1.0 Noise

Method TOPF untuned TOPF tuned

KMeans Spectral Clustering

TPCC Spectral on SC

Figure 6: Performance of TOPF Clustering with noise/outliers on Spherein Circle, 95% CI. The radii are 3 and 1, the mean distance to the closest neighbour is 0.22. Left: We add i.i.d. Gaussian noise to every point with standard deviation indicated by the noise parameter. Even when compared with TPCC on a data set specifically crafted for TPCC, TOPF requires significantly less information and delivers almost equal performance. Tuned for datasets with a high noise, the TOPF outperform TPCC. Right: We add outliers with same standard deviation as the point cloud. We then measure ARI restricted to original points. Compared with TPCC on a data set specifically crafted for TPCC, TOPF requires significantly less information and delivers matching to superior performance.

Table 2: Ablation study for the harmonic projection in step 3. We see that in the more complex datasets, performance without the harmonic projection drops significantly. Only 2Spheres2Circles and Spherein Circle are unions of simple manifolds without noise and thus TOPF can recover meaningful features even without harmonic projection.

Dataset ARI TOPF ARI TOPF without harmonic projection

4spheres 0.81 0.08 Ellipses 0.95 0.53 4Circles+Grid 0.70 0.11 Halved Circle 0.71 0.15

2Spheres2Circles 0.94 0.95 Spherein Circle 0.97 1.00 Spaceship 0.92 0.49

mean 0.86 0.47

quantifiable by persistent homology. In practice it is thus desirable to combine TOPF with some geometric or other point-level feature extractor. As TOPF relies on the computation of persistent homology, its runtime increases on very large point clouds, especially in higher dimensions where α-filtrations are computationally infeasible. However, subsampling, either randomly or using landmarks, usually preserves relevant topological features while improving run time (Perea, 2020). Finally, selection of the relevant features is a very hard problem. While our proposed heuristics work well across a variety of domains and application scenarios, only domainand problem-specific knowledge makes correct feature selection feasible.

Future Work The integration of higher-order TOPF features into ML pipelines that require point-level features potentially leads to many new interesting insights across the domains of biology, drug design, graph learning and computer vision. Furthermore, efficient computation of simplicial weights leading to the provably most faithful topological point features is an exciting open problem.

Conclusion We introduced point-level features TOPF founded on algebraic topology relating global structural features to local information. We gave theoretical guarantees for the correctness of their construction and evaluated them quantitatively and qualitatively on synthetic and realworld data sets. Finally, we introduced the novel topological clustering benchmark suite and showed that clustering using TOPF outperforms other available clustering methods and features extractors. In particular, we showed that TOPF performs well both on very sparse datasets and on datasets under heterogeneous sampling.

Point-Level Topological Representation Learning on Point Clouds

Acknowledgments

We thank the anonymous referees for the detailed discussions and helpful feedback.

VPG acknowledges funding by the German Research Council DFG within Research Training Group 2236 UNRAVEL. MTS acknowledges partial funding by the Ministry of Culture and Science (MKW) of the German State of North Rhine Westphalia ( NRW Rückkehrprogramm ) and the European Union (ERC, HIGH-HOPES, 101039827). Views and opinions expressed are however those of the author(s) only and do not necessarily reflect those of the European Union or the European Research Council Executive Agency. Neither the European Union nor the granting authority can be held responsible for them.

Impact Statement

This paper presents work whose goal is to advance the field of Machine Learning. There are many potential societal consequences of our work, none which we feel must be specifically highlighted here.

Adams, H., Emerson, T., Kirby, M., Neville, R., Peterson, C., Shipman, P., Chepushtanova, S., Hanson, E., Motta, F., and Ziegelmeier, L. Persistence images: A stable vector representation of persistent homology. Journal of Machine Learning Research, 18(8):1 35, 2017.

Arafat, N. A., Basu, D., Gel, Y., and Chen, Y. When witnesses defend: A witness graph topological layer for adversarial graph learning. ar Xiv preprint ar Xiv:2409.14161, 2024.

Atiyah, M. K-theory. CRC press, 1989.

Basu, S. and Cox, N. Harmonic persistent homology. In 2021 IEEE 62nd Annual Symposium on Foundations of Computer Science (FOCS), pp. 1112 1123. IEEE, 2022.

Bauer, U. Ripser: efficient computation of vietoris rips persistence barcodes. Journal of Applied and Computational Topology, 5(3):391 423, 2021.

Belkin, M. and Niyogi, P. Laplacian eigenmaps for dimensionality reduction and data representation. Neural computation, 15(6):1373 1396, 2003.

Benjamin, K., Mukta, L., Moryoussef, G., Uren, C., Harrington, H. A., Tillmann, U., and Barbensi, A. Homology of homologous knotted proteins. Journal of the Royal Society Interface, 20(201):20220727, 2023.

Bousfield, A. K. The localization of spaces with respect to homology. Topology, 14(2):133 150, 1975. doi: https: //doi.org/10.1016/0040-9383(75)90023-3.

Bredon, G., Ewing, J., Gehring, F., and Halmos, P. Topology and Geometry. Graduate Texts in Mathematics. Springer, New York, 1993.

Bubenik, P. et al. Statistical topological data analysis using persistence landscapes. J. Mach. Learn. Res., 16(1):77 102, 2015.

Carlsson, G. and Vejdemo-Johansson, M. Topological Data Analysis with Applications. Cambridge University Press, 2021.

Carlsson, G., Zomorodian, A., Collins, A., and Guibas, L. Persistence barcodes for shapes. In Proceedings of the 2004 Eurographics/ACM SIGGRAPH symposium on Geometry processing, pp. 124 135, 2004.

Carrière, M., Oudot, S. Y., and Ovsjanikov, M. Stable topological signatures for points on 3d shapes. In Computer graphics forum, volume 34, pp. 1 12. Wiley Online Library, 2015.

Carriere, M., Chazal, F., Glisse, M., Ike, Y., Kannan, H., and Umeda, Y. Optimizing persistent homology based functions. In International conference on machine learning, pp. 1294 1303. PMLR, 2021.

Carriere, M., Theveneau, M., and Lacombe, T. Diffeomorphic interpolation for efficient persistence-based topological optimization. ar Xiv preprint ar Xiv:2405.18820, 2024.

Chaudhry, C., Horwich, A. L., Brunger, A. T., and Adams, P. D. Exploring the structural dynamics of the e. coli chaperonin groel using translation-libration-screw crystallographic refinement of intermediate states. Journal of molecular biology, 342(1):229 245, 2004.

Chazal, F. and Michel, B. An introduction to topological data analysis: fundamental and practical aspects for data scientists. Frontiers in artificial intelligence, 4:108, 2021.

Chazal, F., Guibas, L. J., Oudot, S. Y., and Skraba, P. Persistence-based clustering in riemannian manifolds. J. ACM, 60(6), nov 2013.

Chen, Y.-C. and Meil a, M. The decomposition of the higher-order homology embedding constructed from the k-laplacian. In Ranzato, M., Beygelzimer, A., Dauphin, Y., Liang, P., and Vaughan, J. W. (eds.), Advances in Neural Information Processing Systems, volume 34, pp. 15695 15709. Curran Associates, Inc., 2021.

Point-Level Topological Representation Learning on Point Clouds

Chen, Y.-C., Meil a, M., and Kevrekidis, I. G. Helmholtzian eigenmap: Topological feature discovery & edge flow learning from point cloud data. ar Xiv preprint ar Xiv:2103.07626, 2021.

Coifman, R. R. and Lafon, S. Diffusion maps. Applied and computational harmonic analysis, 21(1):5 30, 2006.

ˇCufar, M. and Virk, Ž. Fast computation of persistent homology representatives with involuted persistent homology. Foundations of Data Science, 5(4):466 479, 2023.

De Silva, V. and Vejdemo-Johansson, M. Persistent cohomology and circular coordinates. In Proceedings of the twenty-fifth annual symposium on Computational geometry, pp. 227 236, 2009.

De Silva, V., Morozov, D., and Vejdemo-Johansson, M. Dualities in persistent (co) homology. Inverse Problems, 27(12):124003, 2011.

Dedekind, R. Was sind und was sollen die Zahlen? Verlag Friedrich Vieweg und Sohn, Braunschweig, 1888.

Delaunay, B. et al. Sur la sphere vide. Izv. Akad. Nauk SSSR, Otdelenie Matematicheskii i Estestvennyka Nauk, 7(793-800):1 2, 1934.

Dodziuk, J. Finite-difference approach to the hodge theory of harmonic forms. American Journal of Mathematics, 98(1):79 104, 1976.

Ebli, S. and Spreemann, G. A notion of harmonic clustering in simplicial complexes. In 2019 18th IEEE International Conference On Machine Learning And Applications (ICMLA), pp. 1083 1090, 2019. doi: 10.1109/ ICMLA.2019.00182.

Edelsbrunner, H. Alpha shapes-a survey. In Tessellations in the sciences: Virtues, techniques and applications of geometric tilings. 2011.

Edelsbrunner, H. and Harer, J. Persistent homology a survey. Contemporary mathematics, 453(26):257 282, 2008.

Eilenberg, S. and Mac Lane, S. General theory of natural equivalences. Transactions of the American Mathematical Society, 58:231 294, 1945.

Fefferman, C., Mitter, S., and Narayanan, H. Testing the manifold hypothesis. Journal of the American Mathematical Society, 29(4):983 1049, Oct 2016. doi: 10.1090/jams/852.

Fong, D. C.-L. and Saunders, M. Lsmr: An iterative algorithm for sparse least-squares problems. SIAM Journal on Scientific Computing, 33(5):2950 2971, 2011.

Grande, V. P. and Schaub, M. T. Topological point cloud clustering. In Proceedings of the 40th International Coference on Machine Learning, ICML 23, 2023a.

Grande, V. P. and Schaub, M. T. Non-isotropic persistent homology: Leveraging the metric dependency of ph. In Learning on Graphs Conference, pp. 17 1. PMLR, 2023b.

Grover, A. and Leskovec, J. node2vec: Scalable feature learning for networks. In Proceedings of the 22nd ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 855 864, 2016.

Gurnari, D., Guzmán-Sáenz, A., Utro, F., Bose, A., Basu, S., and Parida, L. Probing omics data via harmonic persistent homology. ar Xiv preprint ar Xiv:2311.06357, 2023.

Hausdorff, F. Grundzüge einer theorie der geordneten mengen. Mathematische Annalen, 65:435 505, 1908.

Hidber, E., Brownie, E. R., Hayakawa, K., and Fraser, M. E. Participation of cys123α of escherichia coli succinyl-coa synthetase in catalysis. Acta Crystallographica Section D: Biological Crystallography, 63(8):876 884, 2007.

Hilbert, D. Grundlagen der Geometrie. Wissenschaft und Hypothese. B. G. Teubner, Leipzig, 1899.

Hiraoka, Y., Imoto, Y., Lacombe, T., Meehan, K., and Yachimura, T. Topological node2vec: Enhanced graph embedding via persistent homology. Journal of Machine Learning Research, 25(134):1 26, 2024.

Hu, S.-t. Homotopy theory. Academic press, 1959.

Kerber, M. and Edelsbrunner, H. 3d kinetic alpha complexes and their implementation. In 2013 Proceedings of the Fifteenth Workshop on Algorithm Engineering and Experiments (ALENEX), pp. 70 77. SIAM, 2013.

Kerber, M. and Russold, F. Graphcode: Learning from multiparameter persistent homology using graph neural networks. ar Xiv preprint ar Xiv:2405.14302, 2024.

Kschonsak, M., Chua, H. C., Weidling, C., Chakouri, N., Noland, C. L., Schott, K., Chang, T., Tam, C., Patel, N., Arthur, C. P., Leitner, A., Ben-Johny, M., Ciferri, C., Pless, S. A., and Payandeh, J. Structural architecture of the human nalcn channelosome. Nature, 603(7899): 180 186, Mar 2022. doi: 10.1038/s41586-021-04313-5.

Lawson, P., Sholl, A. B., Brown, J. Q., Fasy, B. T., and Wenk, C. Persistent homology for the quantitative evaluation of architectural features in prostate cancer histology. Scientific reports, 9(1):1139, 2019.

Point-Level Topological Representation Learning on Point Clouds

Li, L., Fu, H., and Ovsjanikov, M. WSDesc: Weakly supervised 3d local descriptor learning for point cloud registration. IEEE Transactions on Visualization and Computer Graphics, pp. 1 1, 2022. doi: 10.1109/TVCG.2022. 3160005.

Lim, L.-H. Hodge laplacians on graphs. SIAM Review, 62 (3):685 715, 2020.

Lurie, J. Stable infinity categories. ar Xiv preprint math/0608228, 2006.

Ma, Y. and Fu, Y. Manifold learning theory and applications, volume 434. CRC press Boca Raton, 2012.

Martin, S. and Watson, J.-P. Non-manifold surface reconstruction from high-dimensional point cloud data. Computational Geometry, 44(8):427 441, 2011.

Mc Innes, L., Healy, J., and Melville, J. Umap: Uniform manifold approximation and projection for dimension reduction. ar Xiv preprint ar Xiv:1802.03426, 2018.

Mémoli, F. Gromov wasserstein distances and the metric approach to object matching. Foundations of computational mathematics, 11:417 487, 2011.

Mémoli, F., Wan, Z., and Wang, Y. Persistent laplacians: Properties, algorithms and implications. SIAM Journal on Mathematics of Data Science, 4(2):858 884, 2022.

Moor, M., Horn, M., Rieck, B., and Borgwardt, K. Topological autoencoders. In International conference on machine learning, pp. 7045 7054. PMLR, 2020.

Munch, E. A user s guide to topological data analysis. Journal of Learning Analytics, 4(2):47 61, 2017.

Murphy, K. P. Probabilistic Machine Learning: An introduction. MIT Press, 2022. URL probml.ai.

Noether, E. Ableitung der elementarteilertheorie aus der gruppentheorie, nachrichten der 27 januar 1925. Jahresbericht Deutschen Math. Verein.(2. Abteilung), 34:104, 1926.

Oda, H., Kida, M., Nakata, Y., and Kurihara, H. Novel definition and quantitative analysis of branch structure with topological data analysis. ar Xiv preprint ar Xiv:2402.07436, 2024.

Paik, T. and Park, J. Circular coordinates for density-robust analysis. ar Xiv preprint ar Xiv:2301.12742, 2023.

Perea, J. A. Sparse circular coordinates via principal Zbundles. In Baas, N. A., Carlsson, G. E., Quick, G., Szymik, M., and Thaule, M. (eds.), Topological Data Analysis, pp. 435 458, Cham, 2020. Springer International Publishing.

Perea, J. A., Scoccola, L., and Tralie, C. J. Dreimac: Dimensionality reduction with eilenberg-maclane coordinates. Journal of Open Source Software, 8(91): 5791, 2023. doi: 10.21105/joss.05791. URL https: //doi.org/10.21105/joss.05791.

Poincaré, H. Analysis situs. J. de l Ecole Poly., 1, 1895.

Quillen, D. G. Homotopical Algebra, volume 43 of Lecture Notes in Mathematics. Springer, Berlin, 1967.

Roddenberry, T. M., Glaze, N., and Segarra, S. Principled simplicial neural networks for trajectory prediction. In International Conference on Machine Learning, pp. 9020 9029. PMLR, 2021.

Schaub, M. T., Benson, A. R., Horn, P., Lippner, G., and Jadbabaie, A. Random walks on simplicial complexes and the normalized hodge 1-laplacian. SIAM Review, 62 (2):353 391, 2020.

Schaub, M. T., Zhu, Y., Seby, J.-B., Roddenberry, T. M., and Segarra, S. Signal processing on higher-order networks: Livin on the edge... and beyond. Signal Processing, 187: 108149, 2021. doi: https://doi.org/10.1016/j.sigpro.2021. 108149.

Shi, J. and Malik, J. Normalized cuts and image segmentation. IEEE Transactions on pattern analysis and machine intelligence, 22(8):888 905, 2000.

Singh, G., Mémoli, F., and Carlsson, G. Topological methods for the analysis of high dimensional data sets and 3d object recognition. Eurographics Symposium on Point Based Graphics, pp. 10, 2007.

Stolz, B. J., Tanner, J., Harrington, H. A., and Nanda, V. Geometric anomaly detection in data. Proceedings of the National Academy of Sciences, 117(33):19664 19669, 2020.

Tenenbaum, J. B., Silva, V. d., and Langford, J. C. A global geometric framework for nonlinear dimensionality reduction. science, 290(5500):2319 2323, 2000.

The GUDHI Project. GUDHI User and Reference Manual. GUDHI Editorial Board, 2015.

Turner, K., Mukherjee, S., and Boyer, D. M. Persistent homology transform for modeling shapes and surfaces. Information and Inference: A Journal of the IMA, 3(4): 310 344, 2014.

Wang, Y., Sun, Y., Liu, Z., Sarma, S. E., Bronstein, M. M., and Solomon, J. M. Dynamic graph cnn for learning on point clouds. ACM Transactions on Graphics (tog), 38 (5):1 12, 2019.

Point-Level Topological Representation Learning on Point Clouds

Wasserman, L. Topological data analysis. Annual Review of Statistics and Its Application, 5(1):501 532, 2018.

Wee, J., Chen, J., Xia, K., and Wei, G.-W. Integration of persistent laplacian and pre-trained transformer for protein solubility changes upon mutation. Computers in Biology and Medicine, pp. 107918, 2024.

Xin, C., Mukherjee, S., Samaga, S. N., and Dey, T. K. Gril: A 2-parameter persistence based vectorization for machine learning. In Topological, Algebraic and Geometric Learning Workshops 2023, pp. 313 333. PMLR, 2023.

Zeng, A., Song, S., Nießner, M., Fisher, M., Xiao, J., and Funkhouser, T. 3dmatch: Learning local geometric descriptors from rgb-d reconstructions. In CVPR, 2017.

ˇCufar, M. Ripserer.jl: flexible and efficient persistent homology computation in julia. Journal of Open Source Software, 5(54):2614, 2020. doi: 10.21105/joss.02614. URL https://doi.org/10.21105/joss.02614.

Point-Level Topological Representation Learning on Point Clouds

A. Extended Background

A brief history of topology and machine learning Algebraic topology is a discipline of Mathematics dating back roughly to the late 19th century (Poincaré, 1895). Starting with Henri Poincaré and continuing in the early 20th century, the mathematical community became interested in developing a framework to capture the global shapes of manifolds and topological spaces in concise algebraic terms. This development was partly made possible by the push towards a formalisation of mathematics and analysis, in particular, which took place inside the mathematical community in the 1800 s and early 1900 s (e.g. (Dedekind, 1888; Hilbert, 1899; Hausdorff, 1908)). The axiomatisation of analysis in the early 20th

century is an important result of this process. Another key step towards modern algebraic topology was the development of homological algebra which was led by the insights of Emmy Noether (Noether, 1926). The main idea was to replace the Betti numbers by the more structured and powerful homology groups. Over the course of the last 100 years, branching into many sub-areas like low-dimensional topology, differential topology, K-theory or homotopy theory (Atiyah, 1989; Hu, 1959), algebraic topology has resolved many of the important questions and provides a comprehensive tool-box for the study of topological spaces. These achievements were tied to an abstraction and generalisation of concepts: topological spaces turned into spectra, diffeomorphisms to homotopy equivalences and later weak equivalences, and Topologists turned to category theory (Eilenberg & Mac Lane, 1945), model categories (Bousfield, 1975) and recently -categories (Lurie, 2006) as the language of choice.

The 21st century saw the advent and rise of topological data analysis (TDA, (Bubenik et al., 2015; Chazal & Michel, 2021)). In short, mathematicians realised that the same notions of shape and topology that their predecessors carefully defined a century earlier were now characterising the difference between healthy and unhealthy tissue, between normal and abnormal behaviour protein behaviour, or more general between different categories in their complex data sets.

Topological data analysis is closely connected to distanceand density-based methods for data analysis like DBSCAN or tools from manifold learning like ISOMAP (Tenenbaum et al., 2000) or UMAP (Mc Innes et al., 2018). Persistent homology captures the shape of a point cloud across all scales and can be captured in persistence barcodes (Carlsson et al., 2004), persistent landscapes (Bubenik et al., 2015) or persistence images (Adams et al., 2017). In multiparameter persistent homology, an additional filtration parameter like density is included, allowing for stronger representations like graphcodes (Kerber & Russold, 2024) or GRIL (Xin et al., 2023). The persistent homology transform computes persistent homology over a height filtration across different directions in the data set, further uncovering geometric properties of the dataset (Turner et al., 2014). The mapper graph is a tool to visualise high dimensional point clouds with graphs using the (persistent) homology of slices according to some generalised heigh function (Singh et al., 2007).

Related Work The intersection of topological data analysis, topological signal processing and geometry processing has many interesting related developments in the past few years. On the side of homology and TDA, the authors in (De Silva & Vejdemo-Johansson, 2009) and (Perea, 2020) use harmonic cohomology representatives to reparametrise point clouds based on circular coordinates. This implicitly assumes that the underlying structure of the point cloud is amenable to such a characterization. Although circular coordinates are orthogonal to the core goal of TOPF, the approaches share many key ideas and insights. In (Basu & Cox, 2022; Gurnari et al., 2023), the authors develop and use harmonic persistent homology and provide a way to pool features to the point-level. However, their focus is not on providing robust topological point features and their approach includes no tunable homology feature selection across dimensions, no support for weighted simplicial complexes, and they only construct the simplicial complex at birth. In their paper on topological mode analysis, (Chazal et al., 2013) use persistent homology to cluster point clouds. However, they only consider 0-dimensional homology to base the clustering on densities and there is no clear way to generalise this to higher dimensions.

On the more geometric-centred side, (Ebli & Spreemann, 2019) already provide a notion of harmonic clustering on simplices, (Chen & Meil a, 2021; Chen et al., 2021) analyse the notion of geometry and topology encoded in the Hodge Laplacian and its relation to homology decompositions, (Schaub et al., 2020) study the normalised and weighted Hodge Laplacian in the context of random walks, and (Grande & Schaub, 2023a) use the harmonic space of the Hodge Laplacians to cluster point clouds respecting topology. Finally, a persistent variant of the Hodge Laplacian is used to study filtrations of simplicial complexes (Mémoli et al., 2022).

There are other constructions of isometry-invariant local shape descriptors, as for example proposed in (Mémoli, 2011). However, their aim is not relate the global topology back to the local features.

In (Moor et al., 2020), the authors basically set up a topological loss functions that ensures that topological features of the original point cloud are preserved in the latent space of the autoencoder. This is different from the approach TOPF takes, as

Point-Level Topological Representation Learning on Point Clouds

TOPF tries to detect and extract topological features. Similar things can be said about topological node-2-vec (Hiraoka et al., 2024) and some experiments in (Carriere et al., 2021) or some experiments of (Carriere et al., 2024): They all try to come up with ways to preserve global topological structure while embedding a point cloud, while we on the other hand construct the embedding based on the relationship between the points and the global topological structure.

On graphs, (Arafat et al., 2024) construct topological representations of graphs to study adversarial graph learning. They construct a local topological features for nodes and global topological features for the entire graph. In contrast to our approach, they do not relate the global topological features back to the local level, but rather consider the local topology on the point level and the single global topological summary separately. Furthermore when constructing their features, they use vision transformers on the persistence images of their witness filtrations, which does not allow for the same amount of interpretability as TOPF does.

In (Grande & Schaub, 2023a), the authors have introduced TPCC, the first method to cluster a point cloud based on the higher-order topological features encoded in the data set. However, TPCC is (i) computationally expensive due to extensive eigenvector computations, (ii) depending on high-dimensional subspace clustering algorithms, which are prone to instabilities and errors, (iii) sensitive to the correct choice of hyperparameters, (iv) requiring the topological true features and noise to occur in different steps of the simplicial filtration, and it (v) solely focussed on clustering the points rather than extracting relevant node-level features. This paper solves all the above by completely revamping the TPCC pipeline, introducing several new ideas from applied algebraic topology and differential geometry. The core insight is: When you have the time to compute persistent homology with generators on a data set, you get the topological node features with similar computational effort.

B. Theoretical Considerations

B.1. More details on VR and α-filtrations

Vietoris Rips complexes are easy to define, approximate the topological properties of a point cloud across all scales and computationally easy to implement. However for moderately large r, the associated VR complex contains a large number of simplices up to |X| n+1 n-simplices for large enough r leading to poor computational performance for any downstream task on some large point clouds. One way to see this is the following: After adding the first edge that connects two components or the final simplex that fills a hole in the simplicial complex the VR complex keeps adding more and more simplices in the same area that keep the topology unchanged. One way to mitigate this problem is to pre-compute a set of simplices that are able to express the entire topology of the point cloud. For a point cloud X Rn, the α-filtration consists of the intersection of the simplicial complexes of the VR filtration on X with the (higher-dimensional) Delaunay triangulation of X in R. Due to algorithmic reasons, the filtration value of a simplex is then related to the radius of the circumscribed sphere instead of the maximum pair-wise distance of vertices, where the filtration value αX(σ) of a simplex σ is given by the minimum ε such that σ is contained in αε(X), i.e. αX(σ) := inf{ε R : σ αε(V )}. This reduces the number of required simplices across all dimensions to O(|X| n/2 ). However, the Delaunay triangulation becomes computationally infeasible for larger n.

Definition B.1 (n-dimensional Delaunay triangulation). Given a set of vertices V Rn, a Delaunay triangulation DT(V ) is a triangulation of V such that for any n-simplex σn DT(V ) the interior of the circum-hypersphere of σn contains no point of DT(V ). A triangulation of V is a SC S with vertex set V such that its geometric realisation covers the convex hull of V hull(V ) = |S| and we have for any two simplices σ, σ that the intersection of geometric realisations |σ| |σ | is either empty or the geometric realisation |ˆσ| of a common sub-simplex ˆσ σ, σ .

If V is in general position, the Delaunay triangulation is unique and guaranteed to exist (Delaunay et al., 1934). We will now first introduce a slightly simpler version of the α-filtration, the α -filtration.

Definition B.2 (α -complex of a point cloud). Given a finite point cloud X in real space Rn, the α -complex α ε(X) is the subset of the n-dimensional Delaunay triangulation DT(X) consisting of all k-simplices σk DT(X) with a radius r of its circumscribed (k 1)-sphere with r ε for all k n.

The α and α-filtration share the same set of simplices and agree on the filtration values of the simplices on all topdimensional simplices and all other simplices which are Gabriel simplices.

Definition B.3 (Gabriel Simplex). Given a set of vertices V Rn and its Delaunay triangulation DT(V ). A simplex σ DT(V ) is called Gabriel if there exists no point v V such that v is contained in the interior of the circumsphere of σ.

Point-Level Topological Representation Learning on Point Clouds

Definition B.4 (α-complex of a point cloud, (Kerber & Edelsbrunner, 2013; Edelsbrunner, 2011)). Given a finite point cloud X in real space Rn, the α-complex αε(X) is the α -complex of V , except that the filtration value α(σ) of all non-Gabriel simplices σ with associated points X in the interior of the circumsphere of σ is given by minx X α(σ {x}).

We note that due to the definition of the Delaunay triangulation, all n-simplices are Gabriel simplices and hence this is well-defined. This is an equivalent formulation of the original definition, which was for example shown in (Kerber & Edelsbrunner, 2013). We chose to go with the above formulation, as it is the form used in implementations of α-filtrations.

We note two things: (i) The α -filtration and α-filtration have fewer simplices than the VR filtration and (ii) the filtration values of individual simplices differ between the α , the α and the VR filtration. In particular, the order of the filtration values can be different across the two types of filtration. For 1-simplicial complexes, the α and VR filtration values on the simplices present in both are equivalent: The VR filtration value of an edge is its length l, whereas its α -filtration value is the radius r = l/2 of the associated 0-sphere consisting only of the two vertices.

B.2. Differential forms and a continuous analogue of TOPF

In this section, we will discuss how a continuous analogue of TOPF on Riemannian manifolds would look like. Simply considering the simplicial homology of the manifold, as we did in the case of a discrete simplicial complex, is however not sufficient: Harmonic cycles and cocycles don t exist in the simplicial chain complex of the manifold. This is because the space of chains in simplicial homology on manifolds is infinite-dimensional. Thus, we cannot canonically identify the space of chains with its dual, the space of cochains. From an intuitive point of view, a harmonic chain would need to take infinitesimal small values on every simplex which is not allowed. Rather, the right language to formalise harmonic representatives is Differential Geometry, Hodge Theory, and harmonic forms. Dodziuk proved in (Dodziuk, 1976) that the discrete Hodge Laplacian as used in this paper if weighted correctly converges spectrally to the Hodge Laplacian on differential forms which are arbitrary-dimensional integrands over manifolds.

There is a beautiful connection between differential geometry and algebraic topology: A theorem by de Rham states that given a Riemannian manifold M, the real-valued homology Hk(M; R) with coefficients in R is isomorphic to the de Rham cohomology Hk d R(M; R) on differential forms on M for arbitrary k Z 0, some form of cohomology defined via the chain complex of vector spaces of differential k-forms on M. We can define an analogue of the discrete Hodge Laplacian on differential forms by the differential induced via exterior derivates d and its adjoint δ:

The kernel ker of the Hodge Laplacian is called the space of harmonic forms. Similar to the discrete case, the Hodge theorem now gives us a natural vector space isomorphism between the space of harmonic k-forms on M, Hk(M), and the kth real-valued homology group Hk(M; R) = Hk(M).

We can rephrase this as follows: the harmonic forms can be seen as unique and natural homology representatives.

Let us now get some intuition for what this means for continuous TOPF features: In dimension 0, 0-forms are functions f : M R. In this case, we can simply take their value f(x) at x M as the corresponding point feature at x. In dimension 1, there is a correspondence between 1-forms and vector fields on M. In this case, we can take the norm of the vector field that corresponds to a given harmonic form at a point x via the map x 7 |v(x)|.

In general in dimension k, this is more complicated. Luckily, as we are interested in point-features, we can do all computations point-wise. We consider a point x M and a harmonic form ω. At point x, ω determines an element in the exterior algebra on the dual of the tangent space of M at x, i.e ωx Vk T x M and we need to define a norm on this space. An orthonormal basis on T x M, e1(x), . . . , en(x) gives rise to a basis consisting of elements ei1(x) ... eik(x) of the exterior algebra. We can now evaluate ωx on the elements of this basis ωx(ei,1(x) ... ei,k(x)) and take the square root of the sum of squares of the individual results

TOPFω : x 7 s X

1 i1< <ik n ωx (ei1(x) eik(x))2.

As one can show, this result is independent of the choice of ONB of T x M and gives a point-wise norm of the differential forms.

Point-Level Topological Representation Learning on Point Clouds

In the discrete case discussed in this paper, however, we don t evaluate on all basis elements as described above, but rather on all k-simplices adjacent to x. These k-simplices can thus be seen as samples from the exterior algebra of x. This sampling induces some differences to the continuous case, which we address with the post-processing procedure as described in the paper.

We note that some authors prefer to reserve the name of Hodge Laplacian for the continuous case of differential geometry and for the representations of finite element theory and discrete exterior calculus. Thus, what we call Hodge Laplacian on simplicial complexes is sometimes known as the combinatorial Laplacian.

B.3. Fixing real lifts of homology generators in Z/3Z-coefficients

Figure 7: Z/3Z-homology generator with faulty lift to R-coefficients. Left: Initial lift produced in Z/3Z-coefficients. Right: Fixed lift with light blue edges representing the original lift with coefficients in { 1, 1} and dark blue edges representing the fix with coefficients in 3Z. The underlying data is the trefoil knot embedded in R3, as generated by (Perea et al., 2023).

In step 3. of the TOPF algorithm in Section 3, we attempt to lift a homology generator with simplices σj Σk and coefficients cj in Z/3Z-coefficients to a homology generator in R-coefficients using the embedding

( ι(cj) ˆσj Σi k : σ = ˆσj 0 else .

where ι: Z/3Z is the map induced by the canonical inclusion of { 1, 0, 1} , R. Thus, we know that ei k is a cycle modulo 3: Bk 1ei k 0 mod 3. A-prior however, there is no reason why this equality should hold in R as well. As it turns out, in virtually all examples in practice, we already have ei k ker Bk 1 which is very convenient. The same phenomenon occurs for cocycles as well, see (De Silva & Vejdemo-Johansson, 2009). We will now assume to be in a rare case of ei k ker Bk 1. We can now attempt to fix this using a linear integer program finding y Z|Sk| such that

3Bk 1y = Bk 1ei k.

And then we set fix(ei k) := ei k 3y. If this linear program is not feasible, ei k has no integer/real lift and TOPF will skip the associated topological feature. In Figure 7, we give a visual example of a faulty lift and a fix obtained by the above procedure. This is similar to the implementation used to compute Eilenberg Mac Lane coordinates in DREi Mac (Perea et al., 2023).

Point-Level Topological Representation Learning on Point Clouds

C. Theoretical guarantees

In this section, we want to investigate possible theoretical guarantees for TOPF computed on idealised datasets. Instead of directly working with the α-complexes, we will first prove guarantees for the closely related construction of α -complexes. α filtrations (Definition B.2) are closely related to α-filtrations (Definition B.4) used in the implementation and agree on the filtration values on most of the simplices, have however nicer properties. Finally, we give conditions on when this result applies to the α-filtrations, which is relevant to the actual implementation of TOPF.

Theorem C.1 (Topological Point Features of Spheres). Let X consist of at least (n + 2) points (denoted by S) sampled uniformly at random from a unit n-sphere in Rn+1 and an arbitrary number of points with distance of at least 2 to S. When we now consider the α -filtration on this point cloud, with probability 1 we have that (i) there exists an n-th persistent homology class generated by the n-simplices on the convex hull of S, and (ii) the support of the associated topological point feature (TOPF) F n is precisely S: supp(F n) = S. (iii) The same holds true for point clouds sampled from multiple ni-spheres if the above conditions are met on each individual sphere. (iv) If there are no other points contained in or on the circumspheres of the n-simplices of the convex hull of S, then the same above holds true for TOPF computed using the α-filtration. (v) More generally, for any contractible subset M of the (n + 1)-simplices such that all n-simplices on the boundary fulfill the above condition, this holds when using the α-filtration, the discussed n-simplices (i) and the subset of points contained in M (ii).

We will give a proof of this after the following remark. Remark C.2. The key idea of the proof is to use that α -filtrations and partially α-filtrations assign the filtration value based on the radius of the circumsphere of the k-simplex. Because all points in S lie on the same n-sphere, with probability 1 we can write down the filtration value of the (n + 1)-simplices explicitly. This is of course a very idealised setting. However in practice, datasets with topological structure consist in a majority of cases of points sampled with noise from deformed n-spheres. The theorem thus guarantees that TOPF will recover these structural information in an idealised setting. Experimental evidence suggests that this holds under the addition of noise as well which is plausible as harmonic persistent homology is robust against some noise (Basu & Cox, 2022). For the 2D setting, condition (iv) is equivalent to assuming that there is no line through the centre of the circle such that one half plane does not contain any points of S.

We will now give the proof of the theorem that guarantees that TOPF works. We discuss the assumptions of the theorem in Remark C.3.

Proof. Assume that we are in the scenario of the theorem. Now because the n-volume of (n 1)-submanifolds is zero, we have that with probability 1 the points of S don t lie on a single (n 1) sphere inside the n-sphere. Let us now look at the α-filtration of the simplices in S: Recall that the filtration values of a k-simplex is given by the radius of the (k 1)-sphere determined by its vertices. Because all of the (n+1)-simplices σn+1 with vertices V S in S lie on the same unit n-sphere Sn, they all share the filtration value of α(σn+1) = 1. By the same argument as above, with probability 1 there are no (n + 1) points in S that lie on an unit (n 1)-sphere. Thus all of the n-simplices σn lie on (n 1)-spheres Sn with a radius r < 1 smaller than 1 and hence have a filtration value α(σn) smaller than 1. Let

b := max ({α(σn) : σn hull(S)})

be the maximum filtration value of an n-simplex on the boundary of the convex hull of S. Then, a linear combination g of the n-simplices of the boundary of the convex hull of S with coefficients in 1 is a generator of a persistent homology class with life time (b, 1) (this follows from the fact that n-spheres and their triangulations are orientable). This proves claim (i).

Because of the assumption that all points not contained in S have a distance of at least 2 to the points in S, all (n + 1)- simplices σn+1 with vertices both in S and its complement in X will have a filtration value α(σn+1) 1 of at least 1. Recall that all (n + 1)-simplices σn+1 S with vertices inside S have a filtration value of α(σn+1) = 1. Thus the adjoint of the n-th boundary operator B n is trivial on the homology generator g for a simplicial complex constructed during (b, 1). Thus, we have that for the n-th Hodge Laplacian

Lng = B n 1Bn 1g + Bn B n g = 0 + 0 = 0

and hence g is a harmonic generator for the entire filtration range of (b, 1). Claim (ii) and (iii) then follow from the construction of the TOPF values.

Point-Level Topological Representation Learning on Point Clouds

For claim (iv), it suffices to notice that the n-simplices σ on the convex hull are precisely non-Gabriel simplices iff there exist a point x S lying inside the circumsphere of σ. Because the assumption of part (iv) excludes this case, all n-simplices are Gabriel simplices and the α-filtration values (Definition B.4) and α -filtration values (Definition B.2) agree on the n-simplices and (n + 1)-simplices of the Delaunay triangulation. Because these are the only considered filtration values for the proof, the claim follows.

Finally for part (v), we see that the same argument now applies to the simplices on the boundary, consisting of n-simplices of M, consisting of (n + 1)-simplices. Because M is contractible, it is connected and has trivial homology. Thus, its boundary is homotopy equivalent to an n-sphere. Because in this case only the points X on the boundary of M are contained in the generating n-simplices of the considered homology class, the associated TOPF feature will have support on X . This concludes the proof.

Remark C.3. The setting of the theorem is a very special setting. In general on data, harmonic chains have entries with absolute values somewhere in [0, 1]. In the setting of the theorem however, the harmonic representative will take values in 1, 0, 1. We will briefly explain why this is the case: Consider a k-homology representative r with simplex-values in { 1, 0, 1}. We consider the harmonic projection h of r into the harmonic space. Then, for a k-simplex σ, we have that h(σ) = 1 iff r(σ) = 1 and σ is not the face of any (k + 1)-simplices.

This can for example be seen by representing h as the difference of r and its gradient and curl parts h = r rgrad rcurl. Because r is already a cycle, Bkr = 0 and thus rgrad = 0. Now, the curl part can be written as stemming from a signal on the the k + 1-simplices, rcurl = Bk+1xcurl. However, because σ is not the face of a k + 1-simplex, rcurl(σ) = 0 and thus h(σ) = r(σ).

The setting of the theorem precisely constructs a case where we have k-simplices and no (k + 1)-simplices in the relevant part of the filtration. This is the case because in an α -filtration, the filtration value of a k-simplex is the radius of its circumscribed (k 1)-sphere. All relevant (k + 1)-simplices in the theorem lie on the k-unit sphere, and thus have filtration value 1. This is however an idealised setting. Because we construct the simplicial complex to compute the harmonic representative somewhere in the middle (determined by the interpolation coefficient) between the birth and death time of the homology class, empirically the majority of the ksimplices of the k-homology generators are faces of (k + 1)-simplices. In this case, the harmonic representative smooths out the feature.

D. Topological Clustering Benchmark Suite

We introduce seven point clouds for topological point cloud clustering in the topological clustering benchmark suite (TCBS). Some of these point clouds have been adapted from (Grande & Schaub, 2023a). The ground truth and the point clouds are depicted in Figure 8. The point clouds represent a mix between 0-, 1and 2-dimensional topological structures in noiseless and noisy settings in ambient 2-dimensional and 3-dimensional space. Point clouds Ellipses and spaceship already incorporate heterogeneous sampling/densities. The results of clustering according to TOPF can be found in Figure 9.

We constructed the benchmark by sampling from topologically different shapes with varying sampling density in different ambient dimensions. For example for the point cloud 2Spheres2Circles, we combined points sampled from two spheres and two circles. We then divided the point cloud into four ground truth clusters, one for each of the two spheres and two circles and assigned every point the cluster corresponding to the object it was sampled from. Thus, the ground truth labels of a point corresponds to the topological structure it was sampled from. We provide functions to resample some of the point clouds with variable point numbers and provide a method to provide sampling heterogeneity as described in Figure 16.

E. Implementation

We have created an easy-to-use python package TOPF which can be found at github (https://github.com/ vincent-grande/topf) and Py Pi (https://pypi.org/project/topf/) with many examples from the paper and the topological clustering benchmark suite. The package currently works under mac OS and Linux. The package contains both the code to generate the topological point features, as well as the code to reproduce the various visualisation steps in this paper.

All experiments for TOPF were run on a Apple M1 Pro chipset with 10 cores and 32 GB memory. TOPF and the experiments are implemented in Python and Julia. For persistent homology computations, we used GUDHI (The

Point-Level Topological Representation Learning on Point Clouds

Figure 8: Data sets of the Topological Clustering Benchmark Suite (TCBS) with true labels. Top: 2D data sets. From left to right: a): 4Spheres (656 points), b): Ellipses (158 points), c): Spheres+Grid (866 points), d): Halved Circle (249 points). Bottom: 3D data sets. From left to right: e): 2Spheres2Circles (4600 points), f): Spherein Circle (267 points), g): spaceship (650 points).

Figure 9: Data sets of the Topological Clustering Benchmark Suite (TCBS) with labels generated by TOPF. Top: 2D data sets. From left to right: a): 4Spheres (0.81 ARI), b): Ellipses (0.95 ARI), c): Spheres+Grid (0.70 ARI), d): Halved Circle (0.71 ARI). Bottom: 3D data sets. From left to right: e): 2Spheres2Circles (0.94 ARI), f): Spherein Circle (0.97 ARI), g): spaceship (0.92 ARI).

GUDHI Project, 2015) ( The GUDHI developers, MIT license) and Ripserer ( ˇCufar, 2020) ( mtsch, MIT license), which is a modified Julia implementation of (Bauer, 2021). For the least square problems, we used the LSMR im-

Point-Level Topological Representation Learning on Point Clouds

plementation of Sci Py (Fong & Saunders, 2011). We used the pgeof Python package for computation of geometric features https://github.com/drprojects/point_geometric_features ( Damien Robert, Loic Landrieu, Romain Janvier, MIT license). We use parts of the implementation of TPCC https://git.rwth-aachen. de/netsci/publication-2023-topological-point-cloud-clustering ( Computational Network Science Group, RWTH Aachen University, MIT license). We use the implementation of WSDesc (Li et al., 2022), https://github.com/craigleili/WSDesc (Lei Li, Hongbo Fu, Maks Ovsjanikov. CC BY-NC 4.0 License). We use the implementation of An Tao, https://github.com/antao97/dgcnn.pytorch of Dynamic Graph

CNNs (Wang et al., 2019), MIT license. The idea for the implementation of the fix of the lift of the Z/3Z-representative to R-coefficients was inspired by DREi Mac, (Perea et al., 2023).

E.1. Hyperparameters

0.1 0.2 0.3 0.4 0.5 0.6

interpolation coefﬁcient

0.025 0.050 0.075 0.1 00 0.1 25 0.1 50 0.1 75 0.200 delta

ﬁlename 4spheres Ellipses Spheres+Grid Halved Circle 2Spheres2Circles Spaceship Spherein Circle

0.0 0.2 0.4 0.6 0.8 1 .0

Feature Damping Coefﬁcient

kmeans spectral Clustering Method

ﬁlename 4spheres Ellipses Spheres+Grid Halved Circle 2Spheres2Circles Spaceship Spherein Circle

Figure 10: Hyperparameter robustness of TOPF on the TCBS. TOPF performs reasonably well across a large range around the default parameters (Interpolation coefficient: 0.3, δ: 0.07, damping coefficient (non-negative): 0, Clustering method: spectral). The clustering method is used on the TOPF vectors to determine the cluster-ids.

All the relevant hyperparameters are already mentioned in their respective sections. However, for convenience we gather and briefly discuss them in this section. We note that TOPF is robust and applicable in most scenarios when using the default parameters without tuning hyperparameters. The hyperparameters should more be thought of as an additional way where detailed domain-knowledge can enter the TOPF pipeline. We have performed additional experiments to validate the robustness of TOPF to moderate changes in parameters, see Figure 10.

Point-Level Topological Representation Learning on Point Clouds

Maximum Homology Dimension d The maximum homology dimension determines the dimensions of persistent homology the algorithm computes.

For the choice of the maximum homology degree d to be considered there are mainly three heuristics which we will list in decreasing importance (Cf. (Grande & Schaub, 2023a)):

I. In applications, we usually know which kind of topological features we are interested in, which will then determine d. This means that 1-dimensional homology and d = 1 suffices when we are looking at loops of protein chains. On the other hand, if we are working with voids and cavities in 3d histological data, we need d = 2 and thus compute 2-dimensional homology.

II. Algebraic topology tells us that there are no closed n-dimensional submanifolds of Rn. Hence their top-homology will always vanish and all interesting homological activity will appear for d < n.

III. In the vast majority of cases, the choice will be between d = 1 or d = 2 because empirically there are virtually no higher-dimensional topological features in practice.

In our quantitative experiments, we have always chosen d = n 1.

Thresholding parameter δ In step 4 of the algorithm, we normalise and threshold the harmonic representatives. After normalising, the entries of the vectors lie in the interval of [0, 1]. The thresholding parameter δ now essentially determines an interval of [0, δ] which we will linearly map to [0, 1], while mapping all entries above δ to 1 as well. This is necessary as most of the entries in the vector ei k are very close to 0 with a very small number of entries being close to 1. Without this thresholding, TOPF would now be almost entirely determined by these few large values. Thus this step limits the maximum possible influence of a single entry. However, because most of the entries of ei k are concentrated around 0, small changes in δ will not have a large effect and we chose δ = 0.07 in all our experiments.

Interpolation coefficient λ The interpolation coefficient λ (0, 1) determines whether we build our simplicial complexes close to the birth or the death of the relevant homological features at time t = b1 λd. This then in turns controls how localised or smooth the harmonic representative will be. In general, the noisier the ground data is the higher we should choose λ. However, TOPF is not sensitive to small changes in λ. We have picked λ = 0.3 for all the quantitative experiments, which empirically represents a good choice for a broad range of applications.

Feature selection factor β Increasing β leads to TOPF preferring to pick a larger number of relevant topological features. Without specific domain-knowledge, β = 0 represents a good choice.

Feature selection quotients max_total_quot, min_rel_quot, and min_0_ratio These are technical hyperparameters controlling the feature selection module of TOPF. For a technical account of them, see Appendix G. In most of the cases without domain knowledge, they do not have an effect on the performance of TOPF and should be kept at their default values.

Simplicial Complex Weights Although the simplicial weights are not technically a hyperparameter, there are many potential ways to weigh the considers SCs that can highlight or suppress different topological and geometric properties. In all our experiments, we use w weights discussed in Appendix H.

Ablation study In Table 2, we performed an ablation study and benchmarked TOPF against TOPF while skipping the harmonic projection step with the Hodge Laplacian (Step 3.). The result show that in general, skipping the step with the Hodge Laplacian decreases the performance of TOPF by a wide margin. The exception to this are Sphere In Circle or 2Spheres2Circles, where the two methods have similar performance. In these two last examples, points are directly sampled from a manifold, and thus the loops and spheres are very "thin". Thus, the homology generator is already a good approximation of all the simplices responsible for a topological feature. In general however, this is not the case, necessating the use of Differential Geometry and the Hodge Laplacian.

Point-Level Topological Representation Learning on Point Clouds

E.2. Runtime

In Figure 11, we analyse the runtime of the current TOPF implementation on two point clouds from the topological clustering benchmark suite. We increase the point density while keeping the structure of the point cloud intact. The runtime of our implementation with sparsification = off in the regime of Figure 11 appears to scale linearly with the total number of points with a significant constant term. The significant constant term is probably due to our implementation starting a new julia kernel and communicating between python and julia. A python-only or julia-only implementation would thus speed up computations.

Theoretical runtime complexity

1. a). Constructing the complex: The Vietoris Rips complex will have O(nk+1) simplices, where n is the number of points and k the maximum homology dimension. Apart from this, there is no significant computational effort needed. Note that we can decrease this if we set a maximum VR-radius, as done in many works from the literature. Computing the alpha-complex in ambient dimension d is equivalent to computing the convex hull in ambient dimension d + 1, which has a complexity of O(nd/2). For a fixed dimension d, the number of simplices is linear in n. b). Computing persistent homology and generators: As we need the homology generators, we use the recently introduced involuted persistent homology ( ˇCufar & Virk, 2023). While the authors discuss the runtime performance, they do not give a concrete complexity bound in ( ˇCufar & Virk, 2023). However, they deduce that it has a similar run-time to usual persistent cohomology computations, whose performance is for example discussed in (De Silva et al., 2011).

2. Picking the significant generators has negligible impact.

3. Computing the harmonic projections. Given a homology representative r in dimension k, computing the harmonic representative amounts to computing a sparse matrix vector product of Bk+1x and a sparse least squares problem lsmr(Bk+1, r), i.e. solving min x RSk+1 r Bk+1x 2

where Sk+1 is the number of (k + 1)-simplices in the simplicial complex and Bk+1 has (k + 2)Sk+1 non-zero entries. This is a sparse least-squares problem, which we solve using the iterative sparse solver lsmr (Fong & Saunders, 2011). Because this is an iterative solver, the authors do not give runtime complexity, but discuss the computational requirements in (Fong & Saunders, 2011).

4. Pooling and averaging over the simplicial neighbours has negligible runtime constraints.

F. More Details on the Experiments

High-dimensional Point Clouds In Figure 4, we use TOPF on a high-dimensional input point cloud representing the conformation space of cyclooctane (Martin & Watson, 2011). Because the ambient dimension is too large for α-filtrations, a Vietoris Rips filtration with more simplices is used. The VR filtration depends on the ambient dimension only for computing the distance matrix, which is a negligible part of the runtime. Hence, a similar performance can be expected for higher dimensions than 24 as well. To counteract the increase in computational complexity, TOPF automatically downsamples the point cloud using a mixture of landmark and random sampling. We see in the ISOMAP projections of Figure 4 that TOPF extracts reasonable features representing the topology of the conformation space. Setting nclusters to 4 even reveals the manifold anomalies in a separate cluster (blue), similar to the results of (Stolz et al., 2020). The runtime for this high-dimensional experiment was 189.8 seconds.

Variational Autoencoders We have trained a VAE on image patches sampled with a topological structure, see Figure 12. We have sampled the image patches around two loops and a connecting line. Using TOPF, we could recover this structure in the latent space of the VAE highlighting how topological point features can help in explainable AI. We note, of course, that many latent spaces do not carry any higher-order topological information, making characterisation by TOPF infeasible. Rather we showed that when we expect a topological structure to exist in the data set due to some specifics of the data, we can recover this topological structure even in the latent space of the VAE using TOPF. We have repeated the experiments for a higher-dimensional latent space and received similar results, see Figure 13.

Point-Level Topological Representation Learning on Point Clouds

0 10000 20000 30000 40000 n_points

sparsification

0 2000 4000 6000 8000 n_points

sparsification

Figure 11: Runtime on two examples from Topological Clustering Benchmark Suite while increasing point density. Left: Halved Circle, Right: 2 Spheres 2 Circles. Experiments kept the topological structure intact while increasing the number of points sampled. Both the runtime of the vanilla TOPF algorithm, as well as of TOPF using automatic subsampling heuristic as implemented in the python package, are listed.

Additional heatmaps on proteins In Figure 14, we provide additional experiments of TOPF on protein data. This time, we report every single feature vector produced by TOPF in a separate plot. In the experiment with Cys123, we wanted to show that TOPF can detect so-called protein pockets. Because pockets do not form holes detectable by TDA straightforward, the literature suggests adding (and later removing) these points on the convex hull, which turns the pockets into holes, see for example (Oda et al., 2024).

Performance with decreasing sampling density In Figure 15, we analyse the robustness of TOPF with respect to a decrease in sampling density. We have selected 2Spheres2Circles as the dataset with the highest original sampling density for this experiment. It shows that TOPF can still produce meaningful results even for significantly downsampled point clouds. We note that the considered point clouds are in particular very sparse when compared to the point clouds usually considered by neural 3D point cloud architectures. We interpret the results as basically repeating the good performance of TOPF, except for the datasets Ellipsesin Ellipses and spaceship, where the performance drops rapidly (at least in a logarithmic plot). Our interpretation of this is that both datasets consist of submanifolds with different sampling density, which intersect/are contained in one another. Thus, the point density is a key to distinguishing the classes, and a random change in these densities due to downsampling can make the encoded structure hard to detect or even causes them to vanish in a sense of topology.

Performance under heterogeneous sampling In Figure 16, we analyse the robustness of TOPF with respect to heterogeneous sampling. The results show that TOPF performs well even in the presence of inhomogeneous sampling, and the degrading of performance can largely be attributed to a decrease in minimum sampling rate, rather than the inhomogeneous nature.

From a theoretical point of view, this should not be surprising: The birth of the topological feature will now depend on the minimum sampling density, while the death time still depends on the size of the feature. The heuristic employed by TOPF will then choose a maximal triangulation radius which lies between these two radii. In the dense part, the triangulation will be denser then in the sparse parts. However, due to the the character of the alpha filtration, the weighting of the simplices discussed in Appendix H and the thresholding this does not cause an issue in the features generated by TOPF.

Details on experiments with WSDesc WSDesc (Weakly Supervised 3D Local Descriptor Learning for Point Cloud Registration), (Li et al., 2022), uses voxel-based representations of point clouds to extract robust 3D local descriptors for point cloud registration. As showcased in the original paper, WSDesc showcases a good generalisation for point clouds outside of the training set. We used WSDesc pretrained on the 3DMatch data. WSDesc was used with 512 keypoints, and

Point-Level Topological Representation Learning on Point Clouds

Variational Auto

Latent space with TOPF features

=> TOPF can extract topological information from latent spaces of NN and autoencoders

TOPF features overlayed on

sampling locations

Sample square image patches around the points marked in red

Figure 12: Using TOPF to explore the topological structure in latent/embedding spaces of Variational Auto Encoders (VAE) 1. Given a picture (of a cow strolling along the shore of Lake Prags), we sample 578 square patches around the centre points marked in red in the image. The centre points are taken from the sides of two squares and a line connecting the two squares. The assumption is that this topological structure (with two holes) is present in the sample space as well. 2. We train a VAE on the set of 578 square patches with latent space dimension of 3 (down from 33 33 3.) 3. We run TOPF with fixed_num_features set to [0, 2] on the latent space to extract the two most significant point-wise topological features. 4. We overlay the topological features from the latent space over the centre points of the corresponding image patches. We see that TOPF has roughly recovered the topological structure inherent in the sample space as described in step 1. We note that the image patches with centre on the middle horizontal line are almost identical, making it virtually impossible to distinguish them. Note that although TOPF thinks they contribute to the left loop, their corresponding feature entries are a lot smaller than those of the image patches (shown by lighter colours and smaller dots in the plot.)

tested with a varying of keypoints up to 5000. We thus tested WSDesc with the maximum number of keypoints possible per input point cloud.

Details on experiments with DGCNN We use the implementation of An Tao, https://github.com/antao97/ dgcnn.pytorch of Dynamic Graph CNNs (Wang et al., 2019). We use the data pretrained on the Shape Net Part segmentation dataset, where we use the features of the second last layer. This showed the best performance. We determine the optimal hyperparameters (k, object class, clustering method, layer to use) for every dataset of TCBS individually and only present the results on the best parameter set. The spectral clustering always outperformed k-means. The authors of the original paper state that DGCNN performs very well on datasets with above 500 points, and still is robust to downsampling beyond this point. Thus we always input the maximum available number of points to DGCNN. As described in (Wasserman, 2018), we scale the input point cloud to fit into a unit sphere. We pad the 2d data sets with zeroes in the third dimension. TOPF still outperforms DGCNN on the TCBS. We believe that this is due to the heterogeneous sampling, small sampling set size and most importantly the lack of training data for the neural network.

Point-Level Topological Representation Learning on Point Clouds

Figure 13: VAE experiment described in Figure 12 for a 16-dimensional latent space of the VAE and the original 8748-dimensional image space. Left: The first three of the 16 dimension of the latent space embeddings, colour represents the fourth dimension. Centre left: The weighted harmonic representatives of the selected topological features. Centre right: The features produced by TOPF overlayed on the centre pixels of the associated image patches, as done in Figure 12 Step 4., just without the picture of the cow. Right: TOPF features obtained from the original 8748-dimensional image space. This shows that TOPF can help analyse topological structure in high-dimensional data.

Figure 14: TOPF heatmaps for three proteins. Top left: NALCN channelosome (Kschonsak et al., 2022) Top right: Mutated Cys123 of E. coli (Hidber et al., 2007), with convex hull added during computation, only 2-dimensional homology features Bottom: Gro EL of E. coli (Chaudhry et al., 2004) (Selected features).

G. How to pick the most relevant topological features

Simplified heuristic The persistent homology Pk module in dimension k is given to us as a list of pairs of birth and death times (bk i , dk i ). We can assume these pairs are ordered in non-increasing order of the durations lk i = dk i bk i . This list

Point-Level Topological Representation Learning on Point Clouds

filename = spheres And Grid TRUE.csv

filename = 4spheres TRUE.csv

102 3 101 4 101 6 101 2 102

filename = Halved Circle TRUE.csv

102 3 101 4 101 6 101

filename = Ellipses In Ellipses TRUE.csv

filename = Two_Spheres_2_Circles TRUE.csv

filename = spaceship_v2TRUE.csv

filename = Spherein Circle TRUE.csv

method TOPF Spectral Clustering KMeans OPTICS DBSCAN Agglomerative Clustering Mean Shift WSDesc Geo

Figure 15: Performance of TOPF while decreasing sampling density. The x-axis is scaled logarithmically and TOPF achieves an adjusted rand index (ARI, random algorithms achieve an ARI of over 0.9 for 700 samples, down from 4600 original points on the 2Spheres2Circles Dataset. The smallest considered sample size was 16 points. The points were sampled at random. We ran each experiment 100 times and report the standard deviation.

is typically very long and consists to a large part of noisy homological features which vanish right after they appear. In contrast, we are interested in connected components, loops, cavities, etc. that persist over a long time, indicating that they are important for the shape of the point cloud. Distinguishing between the relevant and the irrelevant features is in general difficult and may depend on additional insights on the domain of application. In order to provide a heuristic which does not depend on any a-priori assumptions on the number of relevant features we pick the smallest quotient qk i := lk i+1/lk i > 0 as the point of cut-off Nk := arg mini qk i . The only underlying assumption of this approach is that the band of relevant features is separated from the noisy homological features by a drop in persistence.

Advanced Heuristic However, certain applications have a single very prominent feature, followed by a range of still relevant features with significantly smaller life times, that are then followed by the noisy features after another drop-off. This then could potentially lead the heuristic to find the wrong drop-off. We propose to mitigate this issue by introducing a hyperparameter β R>0. We then define the i-th importance-drop-off quotient qk i by

qk i := lk i+1/lk i (1 + β/i) .

Point-Level Topological Representation Learning on Point Clouds

10 2 10 1 100

filename = 4spheres TRUE.csv

10 2 10 1 100

filename = Halved Circle TRUE.csv

10 2 10 1 100

downsampling factor

filename = Two_Spheres_2_Circles TRUE.csv

10 2 10 1 100

downsampling factor

filename = Spherein Circle TRUE.csv

method TOPF Spectral Clustering KMeans OPTICS DBSCAN Agglomerative Clustering Mean Shift Geo WSDesc

Figure 16: Performance of TOPF under heterogeneous sampling. Top: We divide four of the point clouds with a symmetry into two halves. We downsample only one of the halves, creating sampling irregularities within the individual topological features. We repeat the experiment 100 times and report the achieved ARI. Note that the lowest value on the x-axis means downsampling by a factor of 100, meaning that the smaller point clouds will then almost only consist of one half. We compare the performance of TOPF with the other baselines, noting that TOPF outperforms them for reasonable heterogeneities. Bottom Left: 4spheres with downsampling factor 0.2 with true labels. Bottom Right: 2spheres2circles with downsampling factor 0.1 with true labels.

Point-Level Topological Representation Learning on Point Clouds

The basic idea is now to consider the most significant Nk homology classes in dimension k when setting Nk to be

Nk := arg min i qk i .

Increasing β leads the heuristic to prefer selections with more features than with fewer features. Empirically, we still found β = 0 to work well in a broad range of application scenarios and used it throughout all experiments. There are only a few cases where domain-specific knowledge could suggest picking a larger β.

To catch edge cases with multiple steep drops or a continuous transition between real features and noise, we introduce two more checks: We allow a minimal qk i of min_rel_quot = 0.1 and a maximal quotient qh 1/qk i of max_total_quot = 10 between any homology dimensions. Because features in 0-dimensional homology are often more noisy than features in higher dimensions, we add a minimum zero-dimensional homology ratio of min_0_ratio = 5, i.e. every chosen 0-dimensional feature needs to be at least min_0_ratio more persistent then the minimum persistence of the higher-dimensional features. Because these hyperparameters only deal with the edge cases of feature selection, TOPF is not very sensitive to them. For all our experiments, we used the above hyperparameters. We advise to change them only in cases where one has in-depth domain knowledge about the nature of relevant topological features.

Fixed number of topological features Alternatively, it is possible to specify a fixed desired number Nd of topological features per dimension d. TOPF then automatically returns the Nd most relevant features in dimension d. For practical purposes, we can then weigh these features by a function w( ) in the life time di bi or the life quotient di/bi of the features. Possible picks for w include an exponential function w(x) = ex, a quadratic function w(x) = x2, a linear function w(x) = x or a scaled sigmoid function. Our intuition suggests that when picking functions like a linear function, the many short-lived features will be given too much weight in comparison to the few more relevant features. Selection the best weight function is an interesting open problem for future work.

H. Simplicial Weights

In an ideal world, the harmonic eigenvectors in dimension k would be vectors assigning 1 to all k-simplices contributing to k-dimensional homological feature, a 0 to all k-simplices not contributing or orthogonal to the feature, and a value in ( 1, 1) for all simplices based on the alignment of the simplex with the boundary of the void. However, this is not the case: In dimension 1, we can for example imagine a total flow of 1 circling around the hole. This flow is then split up between all parallel edges which means two things: I Edges where the loop has a larger diameter have smaller harmonic values than edges in thin areas and II in VR complexes, which are the most frequently used simplicial complexes in TDA, edges in areas with a high point density have smaller harmonic values than edges in low-density areas. Point II is another advantage of α-complexes: The expected number of simplices per point does not scale with the point density in the same way as it does in the VR complex, because only the simplices of the Delaunay triangulation can appear in the complex.

We address this problem by weighing the k-simplices of the simplicial complex. The idea behind this is to weigh the simplicial complex in such a way that it increases and decreases the harmonic values of some simplices in an effort to make the harmonic eigenvectors more homogeneous. For weights w RSk, W = diag(w), the symmetric weighted Hodge Laplacian (Schaub et al., 2020) takes the form of

Lw k = W 1/2Bk 1B k 1W 1/2 + W 1/2Bk B k W 1/2.

Because we want the homology representative to lie in the weighted gradient space, we have to scale its entries with the weight and set ei k,w := W 1/2ei k. With this, we have that

B k 1W 1/2ei k,w = B k 1W 1/2W 1/2ei k = B k 1ei k = 0

We propose two options to weigh the simplicial complex. The first option is to weigh a k-simplex by the square of the number of k + 1-simplices the simplex is contained in:

w (σk) = 1/(|{σk+1 St k+1 : σk σk+1}| + 1)2

where the +1 is to enforce good behaviour at simplices that are not contained in any higher-order simplices. One of the advantages of the α-complex is that we don t have large concentrations of simplices in well-connected areas. The proposed

Point-Level Topological Representation Learning on Point Clouds

Figure 17: Effect of weighting a simplicial complex on harmonic representatives. Red edges have large absolute values and blue edges have small absolute values of the considered quantity. Top: VR complex. Bottom: α-complex Left: The base point cloud with different densities. 2nd Left: Unweighted harmonic homology representative of the large loop. 3rd Left: Effective resistance of the 1-simplices. 3rd Right: Harmonic homology representative of the complex weighted by effective resistance. 2nd Right: Inverse of number of incident triangles (Definition H.1). Right: Harmonic homology representative of the complex weighted by number of incident triangles. Up to a small threshold, the standard harmonic representative in the VR complex is almost exclusively supported in the low-density regions of the simplicial complex. This leads to poor and unpredictable classification performance in downstream tasks. In contrast, the harmonic homology representative of the weighted VR complex has a more homogenous support along the loop, while still being able to discriminate the edges not contributing to the loop. The α-complex suffers less from this phenomenon (at least in dimension 2), and hence reweighing is not necessarily required.

weighting w is computationally straightforward, as it can be obtained as the column sums of the absolute value of the boundary matrix |Bk|. The weights also deal with the previously mentioned problem II: As the homology representative is scaled inversely to the weight vector w, the simplices in high-density regions will be assigned a low weight and thus their weighted homology representative will have a larger entry. By the projection to the orthogonal complement of the curl space, this large entry is then diffused among the high-density region of the SC with many simplices, whereas the lower entries of the simplices in low-density regions are only diffused among fewer adjacent simplices.

(Paik & Park, 2023) consider the dual problem of weighting the simplicial complex such that a harmonic cohomology representative becomes sampling-density independent. They show that the dual to our approach, i.e. by dividing by the sum of lower-incident simplices (i.e. nodes) of the two vertices of an edge produces an harmonic representative stable under heterogeneous density.

However, the first weight is not able to incorporate the number of parallel simplices into the weighting. This is why we propose a second simplicial weight function based on generalised effective resistance.

Definition H.1 (Effective Hodge resistance weights). For a simplicial complex S with boundary matrices (Bk), we define

Point-Level Topological Representation Learning on Point Clouds

the effective Hodge resistance weights w R on k-simplices to be:

w R := diag B+ k 1Bk 1 2

where diag( ) denotes the vector of diagonal entries and ( )+ denotes taking the Moore Penrose inverse.

Intuitively for k = 1, we can assume that every edge has a resistance of 1 and then the effective resistance coincides with the notion from Physics. Thus simplices with many parallel simplices are assigned a small effective resistance, whereas simplices with few parallel simplices are assigned an effective resistance close to 1. However, computing the Moore Penrose inverse is computationally expensive and only feasible for small simplicial complexes.

In Figure 17, we show that the weights w are a good approximation of the effective resistance in terms of the resulting harmonic representative. The standard form of TOPF used in all experiments uses w -weights.

I. Limitations

Topological features are not everywhere The proposed topological point features take relevant persistent homology generators and turn these into point-level features. As such, applying TOPF only produces meaningful results on point clouds that have a topological structure. On these point clouds, TOPF can extract structural information unobtainable by non-topological methods. Although TDA has been successful in a wide range of applications, a large number of data sets does not possess a meaningful topological structure. Applying TOPF in these cases will produce no additional information. Other data sets require pre-processing before containing topological features. In Figure 3 left, the 2d topological features characterising protein pockets of Cys123 only appear after artificially adding points sampled on the convex hull of the point cloud (Cf. (Oda et al., 2024)).

Computing persistent homology can be computationally expensive As TOPF relies on the computation of persistent homology including homology generators, its runtime increases on very large point clouds. This is especially true when using VR instead of α-filtrations, which become computationally infeasible for higher-dimensional point clouds. Persistent homology computations for dimensions above 2 are only feasible for very small point clouds. Because virtually all discovered relevant homological features in applications appear in dimension 0, 1, or 2, this does not present a large problem. Despite these computational challenges, subsampling, either randomly or using landmarks, usually preserves relevant topological features and thus extends the applicability of TDA in general and TOPF even to very large point clouds.

Automatic feature selection is difficult without domain knowledge While the proposed heuristics works well across a variety of domains and application scenarios, only domainand problem-specific knowledge makes truthful feature selection feasible.

Experimental Evaluation There are no benchmark sets for topological point features in the literature, which makes benchmarking TOPF not straightforward. On the level of clustering, we introduced the topological clustering benchmark suite to make quantitative comparisons of TOPF possible, and benchmarked TOPF on some of the point clouds of (Grande & Schaub, 2023a). On both the level of point features and real-world data sets, it is however hard to establish what a ground truth of topological features would mean. Instead we chose to qualitatively report the results of TOPF on proteins and real-world data, see Figure 3.