# fast_topological_clustering_with_wasserstein_distance__1e231f29.pdf Published as a conference paper at ICLR 2022 FAST TOPOLOGICAL CLUSTERING WITH WASSERSTEIN DISTANCE Tananun Songdechakraiwut Department of Electrical and Computer Engineering University of Wisconsin Madison, USA songdechakra@wisc.edu Bryan M. Krause & Matthew I. Banks Department of Anesthesiology Department of Neuroscience University of Wisconsin Madison, USA Kirill V. Nourski Department of Neurosurgery Iowa Neuroscience Institute University of Iowa, USA Barry D. Van Veen Department of Electrical and Computer Engineering University of Wisconsin Madison, USA The topological patterns exhibited by many real-world networks motivate the development of topology-based methods for assessing the similarity of networks. However, extracting topological structure is difficult, especially for large and dense networks whose node degrees range over multiple orders of magnitude. In this paper, we propose a novel and computationally practical topological clustering method that clusters complex networks with intricate topology using principled theory from persistent homology and optimal transport. Such networks are aggregated into clusters through a centroid-based clustering strategy based on both their topological and geometric structure, preserving correspondence between nodes in different networks. The notions of topological proximity and centroid are characterized using a novel and efficient approach to computation of the Wasserstein distance and barycenter for persistence barcodes associated with connected components and cycles. The proposed method is demonstrated to be effective using both simulated networks and measured functional brain networks. 1 INTRODUCTION Network models are extremely useful representations for complex data. Significant attention has been given to cluster analysis within a single network, such as detecting community structure (Newman, 2006; Rohe et al., 2011; Yin et al., 2017). Less attention has been given to clustering of collections of network representations. Clustering approaches typically group similar networks based on comparisons of edge weights (Xu & Wunsch, 2005), not topology. Assessing similarity of networks based on topological structure offers the potential for new insight, given the inherent topological patterns exhibited by most real-world networks. However, extracting meaningful network topology is a very difficult task, especially for large and dense networks whose node degrees range over multiple orders of magnitude (Barrat et al., 2004; Bullmore & Sporns, 2009; Honey et al., 2007). Persistent homology (Barannikov, 1994; Edelsbrunner et al., 2000) has recently emerged as a powerful tool for understanding, characterizing and quantifying complex networks (Songdechakraiwut et al., 2021). Persistent homology represents a network using topological features such as connected components and cycles. Many networks naturally divide into modules or connected components (Bullmore & Sporns, 2009; Honey et al., 2007). Similarly, cycles are ubiquitous and are often used to describe information propagation, robustness and feedback mechanisms (Kwon & Cho, 2007; Lind et al., 2005). Effective use of such topological descriptors requires a notion of proximity that quantifies the similarity between persistence barcodes, a convenient representation for connected components and cycles (Ghrist, 2008). Wasserstein distance, which measures the minimal effort to modify one persistence barcode to another (Rabin et al., 2011), is an excellent choice due to its Code for topological clustering is available at https://github.com/topolearn. Published as a conference paper at ICLR 2022 appealing geometric properties (Staerman et al., 2021) and its effectiveness shown in many machine learning applications (Kolouri et al., 2017; Mi et al., 2018; Solomon et al., 2015). Importantly, Wasserstein distance can be used to interpolate networks while preserving topological structure (Songdechakraiwut et al., 2021), and the mean under the Wasserstein distance, known as Wasserstein barycenter (Agueh & Carlier, 2011), can be viewed as the topological centroid of a set of networks. The high cost of computing persistence barcodes, Wasserstein distance and the Wasserstein barycenter limit their applications to small scale problems, see, e.g., (Clough et al., 2020; Hu et al., 2019; Kolouri et al., 2017; Mi et al., 2018). Although approximation algorithms have been developed (Cuturi, 2013; Cuturi & Doucet, 2014; Lacombe et al., 2018; Li et al., 2020; Solomon et al., 2015; Vidal et al., 2019; Xie et al., 2020; Ye et al., 2017), it is unclear whether these approximations are effective for clustering complex networks as they inevitably limit sensitivity to subtle topological features. Indeed, more and more studies, see, e.g., (Robins & Turner, 2016; Xia & Wei, 2014) have demonstrated that such subtle topological patterns are important for the characterization of complex networks, suggesting these approximation algorithms are undesirable. Recently, it was shown that the Wasserstein distance and barycenter for network graphs have closedform solutions that can be computed exactly and efficiently (Songdechakraiwut et al., 2021) because the persistence barcodes are inherently one dimensional. Motivated by this result, we present a novel and computationally practical topological clustering method that clusters complex networks of the same size with intricate topological characteristics. Topological information alone is effective at clustering networks when there is no correspondence between nodes in different networks. However, when networks have meaningful node correspondence, we perform cluster analysis using combined topological and geometric information to preserve node correspondence. Statistical validation based on ground truth information is used to demonstrate the effectiveness of our method when discriminating subtle topological features in simulated networks. The method is further illustrated by clustering measured functional brain networks associated with different levels of arousal during administration of general anesthesia. Our proposed method outperforms other clustering approaches in both the simulated and measured data. The paper is organized as follows. Background on our one-dimensional representation of persistence barcodes is given in section 2, while section 3 presents our topological clustering method. In sections 4 and 5, we compare the performance of our method to several baseline algorithms using simulated and measured networks, and conclude the paper with a brief discussion of the potential impact of this work. 2 ONE-DIMENSIONAL PERSISTENCE BARCODES 2.1 GRAPH FILTRATION Consider a network represented as a weighted graph G = (V, w) comprising a set of nodes V with symmetric adjacency matrix w = (wij), with edge weight wij representing the relationship between node i and node j. The number of nodes is denoted as |V |. The binary graph Gϵ = (V, wϵ) of G is defined as a graph consisting of the node set V and binary edge weights wϵ,ij = 1 if wij > ϵ and wϵ,ij = 0 otherwise. We view the binary network Gϵ as a 1-skeleton (Munkres, 2018), a simplicial complex comprising only nodes and edges. In the 1-skeleton, there are two types of topological features: connected components (0-dimensional topological features) and cycles (1-dimensional topological features). There are no topological features of higher dimensions in the 1-skeleton, in contrast to well-known Rips (Ghrist, 2008) and clique complexes (Otter et al., 2017). The number of connected components and the number of cycles in the binary network are referred to as the 0-th Betti number β0(Gϵ) and the 1-st Betti number β1(Gϵ), respectively. A graph filtration of G is defined as a collection of nested binary networks (Lee et al., 2012): Gϵ0 Gϵ1 Gϵk, where ϵ0 ϵ1 ϵk are filtration values. As ϵ increases, more and more edges are removed from the network G since we threshold the edge weights at higher connectivity. For instance, G has each pair of nodes connected by an edge and thus is a complete graph consisting of a single connected component, while G has no edges and represents the node set. Figure 1 illustrates the graph filtration of a four-node network and the corresponding Betti numbers. Note that other Published as a conference paper at ICLR 2022 Figure 1: Graph filtration of four-node network G. As the filtration value increases, the number of connected components β0 monotonically increases while the number of cycles β1 monotonically decreases. Connected components are born at the edge weights e3, e5, e6 while cycles die at e1, e2, e4. filtrations for analyzing graphs have been proposed, including use of descriptor functions such as heat kernels (Carrière et al., 2020) and task-specific learning (Hofer et al., 2020). 2.2 BIRTH-DEATH DECOMPOSITION Persistent homology keeps track of birth and death of connected components and cycles over filtration values ϵ to deduce their persistence, i.e., the lifetime from their birth to death over ϵ. The persistence is represented as a persistence barcode PB(G) comprising intervals [bi, di] representing the lifetime of a connected component or a cycle that appears at the filtration value bi and vanishes at di. In the edge-weight threshold graph filtration defined in Section 2.1, connected components are born and cycles die as the filtration value increases (Songdechakraiwut et al., 2021). Specifically, β0 is monotonically increasing from β0(G ) = 1 to β0(G ) = |V |. There are β0(G ) β0(G ) = |V | 1 connected components that are born over the filtration. Connected components will never die once they are born, implying that every connected component has a death value at . Thus, we can represent their persistence as a collection of finite birth values B(G) = {bi}|V | 1 i=1 . On the other hand, G is a complete graph containing all possible cycles; thus, all cycles have birth values at . Again, we can represent the persistence of the cycles as a collection of finite death values D(G) = {di}. How many cycles are there? Since the deletion of an edge wij must result in either the birth of a connected component or the death of a cycle, every edge weight must be in either B(G) or D(G). Thus, the edge weight set W = {wij|i > j} decomposes into the collection of birth values B(G) and the collection of death values D(G). Since G is a complete graph with |V |(|V | 1) 2 edge weights and |V | 1 of these weights are associated with the birth of connected components, the number of cycles in G is thus equal to |V |(|V | 1) 2 (|V | 1) = 1+ |V |(|V | 3) 2 . In the example of Figure 1, we have B(G) = {e3, e5, e6} and D(G) = {e1, e2, e4}. Other graph filtrations (Carrière et al., 2020; Hofer et al., 2020) do not necessarily share this monotonicity property and consequently one-dimensional barcode representations are not applicable. Finding the birth values in B(G) is equivalent to finding edge weights comprising the maximum spanning tree of G and can be done using well-known methods such as Prim s and Kruskal s algorithms (Lee et al., 2012). Once B(G) is known, D(G) is simply given as the remaining edge weights. Finding B(G) and D(G) requires only O(n log n) operations, where n is the number of edges in the network, and thus is extremely computationally efficient. 3 CLUSTERING METHOD 3.1 TOPOLOGICAL DISTANCE SIMPLIFICATION Use of edge-weight threshold filtration and limiting consideration to connected components and cycles as topological features results in significant simplification of the 2-Wasserstein distance (Rabin et al., 2011) between barcode descriptors (Cohen-Steiner et al., 2010) of networks as follows. Let G and H be two given networks that have the same number of nodes. The topological distance dtop(G, H) is defined as the optimal matching cost: p P B(G) ||p τ(p)||2 1 p=[bp,dp] P B(G) bp bτ(p) 2 + dp dτ(p) 2 1 Published as a conference paper at ICLR 2022 where the optimization is over all possible bijections τ from barcode PB(G) to barcode PB(H). Intuitively, we can think of each interval [bi, di] as a point (bi, di) in a 2-dimensional plane. The topological distance measures the minimal amount of work to move points in PB(G) to PB(H). Note this alternative representation of points in the plane is equivalent to the persistence barcode and called the persistence diagram (Edelsbrunner & Harer, 2008). Moving a connected component point (bi, ) to a cycle point ( , dj) or vice versa takes an infinitely large amount of work. Thus, we only need to optimize over bijections that match the same type of topological features. Subsequently, we can equivalently rewrite dtop in terms of B(G), D(G), B(H) and D(H) as dtop(G, H) = min τ0 b τ0(b) 2 + min τ1 d τ1(d) 2 1 where τ0 is a bijection from B(G) to B(H) and τ1 is a bijection from D(G) to D(H). The first term matches connected components to connected components and the second term matches cycles to cycles. Matching each type of topological feature separately is commonly done in medical imaging and machine learning studies (Clough et al., 2020; Hu et al., 2019). The topological distance dtop has a closed-form solution that allows for efficient computation as follows (Songdechakraiwut et al., 2021). dtop(G, H) = X b τ 0 (b) 2 + X d τ 1 (d) 2 1 where τ 0 maps the l-th smallest birth value in B(G) to the l-th smallest birth value in B(H) and τ 1 maps the l-th smallest death value in D(G) to the l-th smallest death value in D(H) for all l. A proof is provided in Appendix A. As a result, the optimal matching cost can be computed quickly and efficiently by sorting birth and death values, and matching them in order. The computational cost of evaluating dtop is O(n log n), where n is the number of edges in networks. 3.2 TOPOLOGICAL CLUSTERING Let G = (V, w) and H = (V, u) be two networks. We define the network dissimilarity d2 net between G and H as a weighted sum of the squared geometric distance and the squared topological distance: d2 net G, H = (1 λ) X wij uij 2 + λd2 top G, H , (4) where λ [0, 1] controls the relative weight between the geometric and topological terms. The geometric distance measures the node-by-node dissimilarity in the networks that is not captured by topology alone and is helpful when node identity is meaningful, such as in neuroscience applications. Given observed networks with identical node sets, the goal is to partition the networks into k clusters C = {Ch}k h=1 with corresponding cluster centroids or representatives M = {Mh}k h=1 such that the sum of the network dissimilarities d2 net from the networks to their representatives is minimized, i.e., min C,M L(C, M) = min C,M G Ch d2 net(Mh, G). (5) The topological clustering formulation given in (5) suggests a natural iterative relocation algorithm using coordinate descent (Banerjee et al., 2005). In particular, the algorithm alternates between two steps: an assignment step and a re-estimation step. In the assignment step, L is minimized with respect to C while holding M fixed. Minimization of L is achieved simply by assigning each observed network to the cluster whose representative Mh is the nearest in terms of the criterion d2 net. In the re-estimation step, the algorithm minimizes L with respect to M while holding C fixed. In this case, L is minimized by re-estimating the representatives for each individual cluster: G Ch d2 net(Mh, G) = G Ch d2 net(Mh, G). (6) We will consider solving the objective function given in (6) for λ = 0, λ = 1 and λ (0, 1). Published as a conference paper at ICLR 2022 λ = 0 describes conventional edge clustering (Mac Queen, 1967) since the topological term is excluded. λ = 1 describes clustering based on pure topology and ignores correspondence of edge weights. This case is potentially useful for clustering networks whose node sets are not identical or of different size. Each representative Mh of cluster Ch minimizes the sum of the squared topological distances, i.e., G Ch d2 top(Mh, G) = min B(Mh),D(Mh) b τ 0 (b) 2 + X d τ 1 (d) 2 . (7) Thus, we only need to optimize over the topology of the network, i.e., B(Mh) and D(Mh), instead of the original network Mh itself. The topology solving (7) is the topological centroid of networks in cluster Ch. Interestingly, the topological centroid has a closed-form solution and can be calculated analytically as follows. Lemma 1. Let G1, ..., Gn be n networks each with m nodes. Let B(Gi) : bi,1 bi,m 1 and D(Gi) : di,1 di,1+m(m 3)/2 be the topology of Gi. It follows that the l-th smallest birth value of the topological centroid of the n networks is given by the mean of all the l-th smallest birth values of such networks, i.e., Pn i=1 bi,l/n. Similarly, the l-th smallest death value of the topological centroid is given by Pn i=1 di,l/n. Since Eq. (7) is quadratic, Lemma 1 can be proved by setting its derivative equal to zero. The complete proof is given in Appendix A. The results in (Songdechakraiwut & Chung, 2020; Songdechakraiwut et al., 2022) may be used to find the centroid of different size networks. For the most general case considering both correspondence of edge weights and topology, i.e., when λ (0, 1), we can optimize the objective function given in (6) by gradient descent. Let H = (V, u) be a cluster representative being estimated given Ch. The gradient of the squared topological distance Hd2 top(H, G) with respect to edge weights u = (uij) is given as a gradient matrix whose ij-th entry is d2 top(H, G) uij = 2 uij τ 0 (uij) if uij B(H); 2 uij τ 1 (uij) if uij D(H). (8) This follows because the edge weight set decomposes into the collection of births and the collection of deaths. Intuitively, by slightly adjusting the edge weight uij, we have the slight adjustment of either a birth value in B(H) or a death value in D(H), which slightly changes the topology of the network H. The gradient computation consists of computing persistence barcodes and finding the optimal matching using the closed-form solution given in (3), requiring O(n log n) operations, where n is the number of edges in networks. Evaluating the gradient for (6) requires computing the gradients of all the observed networks. This can be computationally demanding when the size of a dataset is large. However, an equivalent minimization problem that allows faster computation is possible using the following result: Lemma 2. G Ch d2 net(Hh, G) = uh,ij wh,ij 2 + λd2 top Hh, c Mtop,h , where wh = (wh,ij) are edge weights in the sample mean network M h = (V, wh) of cluster Ch, and c Mtop,h is the topological centroid of networks in cluster Ch. Thus, instead of computing the gradient for every network in the set, it is sufficient to compute the gradient at the cluster sample means M h and topological centroids c Mtop,h. Hence, one only needs to perform topological interpolation between the sample mean network M h and the topological centroid c Mtop,h of each cluster. That is, the optimal representative is the one whose geometric location is close to M h and topology is similar to c Mtop,h. At each current iteration, we take a step in the direction of negative gradient with respect to an updated Hh from the previous iteration. Note that the proposed method constrains the search of cluster centroids to a space of meaningful networks through topological interpolation. In contrast, the sample mean, such as would be used in a naive application of k-means, does not necessarily represent a meaningful network. Published as a conference paper at ICLR 2022 Furthermore, we have the following theorem: Theorem 1. The topological clustering algorithm monotonically decreases L in (5) and terminates in a finite number of steps at a locally optimal partition. The proofs for Lemma 2 and Theorem 1 are provided in Appendix A. 4 VALIDATION USING SIMULATED NETWORKS Simulated networks of different topological structure are used to evaluate the clustering performance of the proposed approach relative to that of seven other methods. We use the term topology to denote the proposed approach with λ > 0. Clustering using the special case of λ = 0 is termed the edge-weight method. Recall the graph filtration defined in section 2.2 decomposes the edge weights into two groups of birth and death values corresponding to connected components and cycles. In order to assess the impact of separating edge weights into birth and death sets, we evaluate an ad hoc method called sort that applies k-means to feature vectors obtained by simply sorting edge weights. In addition, we evaluate several clustering algorithms that utilize network representations. Two persistent homology clustering methods based on conventional two-dimensional barcodes (Otter et al., 2017) are evaluated. The Wasserstein method performs clustering using k-medoids based on Wasserstein distance between two-dimensional barcodes for connected components and cycles. The computational complexity of the Wasserstein method is managed by utilizing an approximation algorithm (Lacombe et al., 2018) to compute the Wasserstein distance. A Wasserstein barycenter clustering algorithm called Bregman Alternating Direction Method of Multipliers (B-ADMM) (Ye et al., 2017) is used to cluster two-dimensional barcodes for cycles. In order to make B-ADMM computationally tractable, each barcode is limited to no more than the 50 most persistent cycles. The Net Stats clustering approach uses k-means to cluster three-dimensional feature vectors composed of the following network summary statistics: (1) average weighted degree over all nodes (Rubinov & Sporns, 2010), (2) average clustering coefficient over all nodes (Rubinov & Sporns, 2010) and (3) modularity (Newman, 2006; Reichardt & Bornholdt, 2006). Lastly, we evaluate two graph kernel based clustering approaches using kernel k-means (Dhillon et al., 2004): the Graph Hopper kernel (Feragen et al., 2013) and the propagation kernel (Neumann et al., 2016). Supplementary description and implementation details of the candidate methods are provided in Appendix B. Initial clusters for all methods are selected at random. Figure 2: Example networks with |V | = 60 nodes and m = 5 modules exhibit different withinmodule connection probabilities r = 0.9, 0.8, 0.7 and 0.6. Modular network structure Random modular networks Xi are simulated with |V | nodes and m modules such that the nodes are evenly distributed among modules. Figure 2 displays modular networks with |V | = 60 nodes and m = 5 modules such that |V |/m = 12 nodes are in each module. Edges connecting two nodes within the same module are assigned a random weight following a normal distribution N(µ, σ2) with probability r or otherwise Gaussian noise N(0, σ2) with probability 1 r. On the other hand, edges connecting nodes in different modules have probability 1 r of being N(µ, σ2) and probability r of being N(0, σ2). The modular structure becomes more pronounced as the within-module connection probability r increases. Any negative edge weights are set to zero. This procedure yields random networks Xi that exhibit topological connectedness. We use µ = 1 and σ = 0.5 universally throughout the study. Simulation Three groups of modular networks L1 = {Xi}20 i=1, L2 = {Xi}40 i=21 and L3 = {Xi}60 i=41 corresponding to m = 2, 3 and 5 modules, respectively, are simulated. This results in 60 networks in the dataset, each of which has a group label L1, L2 or L3. We consider r = 0.9, 0.8, 0.7 and 0.6 to vary the strength of the modular structure, as illustrated in Figure 2. The dataset is partitioned into three clusters C1, C2 and C3 using the candidate algorithms. Clustering performance is then evaluated by first assigning each cluster to the group label that is most frequent in that cluster and then calculating the accuracy statistic s as the fraction of correctly labeled networks, Published as a conference paper at ICLR 2022 Figure 3: Clustering performance comparison for the dataset of simulated networks with |V | = 60 nodes and m = 2, 3, 5 modules with respect to average accuracy (left) and average p-values (right). Results for within-module connection probabilities r = 0.6, 0.7, 0.8 and 0.9 are shown. Data points (middle horizontal lines) indicate the averages over results for 100 different initial conditions, and vertical error bars indicate standard deviations. i.e., s = 1 60 P3 i=1 maxj{|Ci Lj|}, where |Ci Lj| denotes the number of common networks in both Ci and Lj. Note this evaluation of clustering performance is called purity, which not only is transparent and interpretable but also works well in this simulation study where the number and size of clusters are small and balanced, respectively (Manning et al., 2008). Since the distribution of the accuracy s is unknown, a permutation test is used to determine the empirical distribution under the null hypothesis that sample networks and their group labels are independent (Ojala & Garriga, 2010). The empirical distribution is calculated by repeatedly shuffling the group labels and then re-computing the corresponding accuracy for one million random permutations. By comparing the observed accuracy to this empirical distribution, we can determine the statistical significance of the clustering performance. The p-value is calculated as the fraction of permutations that give accuracy higher than the observed accuracy s. The average p-value and average accuracy across 100 different initial assignments are reported. Figure 4: Clustering performance for simulated networks with |V | = 60 nodes and m = 2, 3, 5 modules as a function of λ for within-module connection probabilities r = 0.9 (top row) and r = 0.6 (bottom row). Results Figure 3 indicates that all methods considered perform relatively well with pronounced modularity (r = 0.9). As the modularity strength diminishes with decreasing r, clustering performance also decreases. The proposed topological clustering algorithm (λ = 1) performs relatively well for the more nuanced modularity associated with r = 0.7 and 0.6. Since the dataset is purposefully generated to exhibit dependency between the sample networks and their group labels, the p-value indicates the degree of statistical significance to which structure is differentiated (Ojala & Garriga, 2010). Small p-values indicate the algorithm is able to differentiate network structure. The proposed method has very low p-values indicating that its improved accuracy over the baseline methods is significant. The Wasserstein and B-ADMM methods require very high computation costs for the conventional two-dimensional barcodes (Otter et al., 2017), and the approximations of Wasserstein distance (Lacombe et al., 2018) and barycenter (Ye et al., 2017). The ad hoc sort method performs nearly identically to the topology approach, showing that the separation into births of connected components and deaths of cycles has no apparent impact for these simulated networks with densely connected structure. However, we hypothesize the separation will impact performance when networks are sparse. Supplementary experimental results including an additional set of module sizes m = 2, 5 and 10 are provided in Appendix B. Figure 4 displays topological clustering performance as a function of λ for the strongest (r = 0.9) and weakest (r = 0.6) degree of modularity. The performance is not very sensitive to the value of λ. When the within-module connection probability is small (r = 0.6), increasing the weight of the topological distance by increasing λ results in the best performance. We hypothesize that in this case the relatively strong random nature of the edge weights introduces noise into the geometric term that Published as a conference paper at ICLR 2022 Figure 5: Representative data from a single subject (ID R376). For this subject there are 36 measured networks in each of three conditions: wake, sedated, and unresponsive states for a total of 108 networks. (a) Sample mean networks during wake, sedated and unresponsive states of the subject computed using ground truth labels. (b) Betti plots based on ground truth labels. Thick lines represent topological centroids. Shaded areas around the centroids represent standard deviation. hinders performance as λ decreases. Conversely, the topological distance appears less sensitive to this particular edge weight noise, resulting in the best performance when λ = 1. 5 APPLICATION TO FUNCTIONAL BRAIN NETWORKS Dataset We evaluate our method using an extended brain network dataset from the anesthesia study reported by Banks et al. (2020) (see Appendix C). The measured brain networks are based on alpha band (8-12 Hz) weighted phase lag index (Vinck et al., 2011) applied to 10-second segments of resting state intracranial electroencephalography recordings from eleven neurosurgical patients administered increasing doses of the general anesthetic propofol just prior to surgery. The network size varies from 89 to 199 nodes while the number of networks (10-second segments) per subject varies from 71 to 119. Each segment is labeled as one of the three arousal states: pre-drug wake (W), sedated/responsive (S), or unresponsive (U); these labels are used as ground truth in the cluster analysis. Figure 5 illustrates sample mean networks and Betti plots describing topology for a representative subject. Performance evaluation We apply the adjusted Rand index (ARI) (Hubert & Arabie, 1985) to compare clustering performance against ground truth. Lower scores indicate less similarity while higher scores show higher similarity between estimated and ground-truth clusters. Perfect agreement is scored as 1.0. For each subject, we calculate these performance metrics by running the algorithm for 100 different initial conditions, resulting in 100 scores which are then averaged. We also average across subjects (11 100 scores) to obtain a final overall score, which describes the overall output clusters across trials and subjects. We calculate average confusion matrices for each method by assigning each cluster to the state that is most frequent in that cluster. Figure 6: Average ARI per subject (gray lines) and across all eleven subjects (red line) as a function of λ. Cluster analysis All baselines used in the simulation study are evaluated on the brain network dataset. In addition, we consider clustering using k-medoids and three previously proposed network distance measures for brain network analyses: Gromov-Hausdorff (GH) (Lee et al., 2012); Kolmogorov-Smirnov (KS) (Chung et al., 2019); and the spectral norm (Banks et al., 2020). Supplementary description of the three methods is provided in Appendix B. Initial clusters are selected at random for all methods. Figure 6 reports the ARI of the topological approach for multiple predefined λ s including λ = 0 and λ = 1. The relative performance of topological clustering is reported in Figures 7 and 8 assuming λ = 0.5, which results in equal weighting of topological and geometric distances. Results Figure 6 indicates that the performance on the brain network data set varies significantly across subjects, but in general varies stably with λ. The best performance occurs with λ between 0.4 Published as a conference paper at ICLR 2022 Figure 7: ARI performance for the brain network dataset. Data points (middle horizontal lines) indicate averages over 100 random initial conditions, and error bars indicate standard deviations. Figure 8: Average confusion matrices over 100 random initializations for select methods for subject R376 (see Figure 5). Results for the other methods are provided in Appendix B. and 0.7, suggesting that a combination of geometric and topological distances gives the best result and that performance is not sensitive to the exact value chosen for λ. Note that different regions of the brain are functionally differentiated, so it is plausible that topological changes due to changing arousal level vary with location, giving geometric distance a role in the clustering of brain networks. Prior expectations of this sort can be used to determine whether to cluster based only on topology (λ = 1) or a combination of geometry and topology. Figure 7 compares the ARI performance metric for individual subjects and the combination of all the subjects. All methods demonstrate significant variability in individual subject performance. This is expected due to the variations in effective signal to noise ratio, the number of nodes, and the number of networks in each underlying state. Consequently the combined performance across subjects has high variability. The topological method with λ = 0.5 performs relatively well across all subjects. Figure 8 illustrates that the proposed approach is particularly effective at separating wake (W) and unresponsive (U) states. Transitioning from the wake state to the unresponsive state results in dramatic changes in brain connectivity (Banks et al., 2020). The majority of errors from the proposed topological clustering approach are associated with the natural overlap between wake and sedated states (S). The sedated brain, in which subjects have been administered propofol but are still conscious, is expected to be more like the wake brain than the unresponsive brain. Thus, errors are expected to be more likely in differentiating sedated and wake states than sedated and unresponsive states. This suggests that the types of errors observed in the proposed method are consistent with prior biological expectations. Impact The demonstrated effectiveness and computational elegance of our approach to clustering networks based on topological similarity will have a high impact on the analysis of large and complex network representations. In the study of brain networks, algorithms that can demonstrate correlates of behavioral states are of considerable interest. The derived biomarkers of changes in arousal state presented here demonstrate potential for addressing the important clinical problem of passively assessing arousal state in clinical settings, e.g., monitoring depth of anesthesia and in establishing diagnosis and prognosis for patients with traumatic brain injury and other disorders of consciousness. More broadly, the algorithm presented here will contribute to elucidating the neural basis of consciousness, one of the most important open problems in biomedical science. Published as a conference paper at ICLR 2022 ACKNOWLEDGMENTS This work is supported in part by the National Institute of General Medical Sciences under award R01 GM109086, and the Lynn H. Matthias Professorship from the University of Wisconsin. Martial Agueh and Guillaume Carlier. Barycenters in the Wasserstein space. SIAM Journal on Mathematical Analysis, 43(2):904 924, 2011. Arindam Banerjee, Srujana Merugu, Inderjit S Dhillon, Joydeep Ghosh, and John Lafferty. Clustering with Bregman divergences. Journal of Machine Learning Research, 6(10), 2005. Matthew I Banks, Bryan M Krause, Christopher M Endemann, Declan I Campbell, Christopher K Kovach, Mark Eric Dyken, Hiroto Kawasaki, and Kirill V Nourski. Cortical functional connectivity indexes arousal state during sleep and anesthesia. Neuro Image, 211:116627, 2020. Serguei Barannikov. The framed Morse complex and its invariants. Advances in Soviet Mathematics, American Mathematical Society, 1994. Alain Barrat, Marc Barthelemy, Romualdo Pastor-Satorras, and Alessandro Vespignani. The architecture of complex weighted networks. Proceedings of the National Academy of Sciences, 101(11): 3747 3752, 2004. Karsten Borgwardt, Elisabetta Ghisu, Felipe Llinares-López, Leslie O Bray, and Bastian Alexander Rieck. Graph kernels: State-of-the-art and future challenges. Foundations and Trends in Machine Learning, 13(5-6):531 712, 2020. Ed Bullmore and Olaf Sporns. Complex brain networks: Graph theoretical analysis of structural and functional systems. Nature Reviews Neuroscience, 10(3):186 198, 2009. Mathieu Carrière, Frédéric Chazal, Yuichi Ike, Théo Lacombe, Martin Royer, and Yuhei Umeda. Perslay: A neural network layer for persistence diagrams and new graph topological signatures. In International Conference on Artificial Intelligence and Statistics, pp. 2786 2796. PMLR, 2020. Moo K Chung, Hyekyoung Lee, Alex Di Christofano, Hernando Ombao, and Victor Solo. Exact topological inference of the resting-state brain networks in twins. Network Neuroscience, 3(3): 674 694, 2019. James Clough, Nicholas Byrne, Ilkay Oksuz, Veronika A Zimmer, Julia A Schnabel, and Andrew King. A topological loss function for deep-learning based image segmentation using persistent homology. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2020. David Cohen-Steiner, Herbert Edelsbrunner, John Harer, and Yuriy Mileyko. Lipschitz functions have Lp-stable persistence. Foundations of Computational Mathematics, 2010. Marco Cuturi. Sinkhorn distances: Lightspeed computation of optimal transport. Advances in Neural Information Processing Systems, 26:2292 2300, 2013. Marco Cuturi and Arnaud Doucet. Fast computation of Wasserstein barycenters. In International Conference on Machine Learning, pp. 685 693, 2014. Inderjit S Dhillon, Yuqiang Guan, and Brian Kulis. Kernel k-means: Spectral clustering and normalized cuts. In Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 551 556, 2004. Herbert Edelsbrunner and John Harer. Persistent homology-a survey. Contemporary Mathematics, 453:257 282, 2008. Herbert Edelsbrunner, David Letscher, and Afra Zomorodian. Topological persistence and simplification. In Proceedings 41st Annual Symposium on Foundations of Computer Science, pp. 454 463. IEEE, 2000. Published as a conference paper at ICLR 2022 Aasa Feragen, Niklas Kasenburg, Jens Petersen, Marleen de Bruijne, and Karsten Borgwardt. Scalable kernels for graphs with continuous attributes. Advances in Neural Information Processing Systems, 26, 2013. Robert Ghrist. Barcodes: The persistent topology of data. Bulletin of the American Mathematical Society, 45(1):61 75, 2008. Christoph Hofer, Florian Graf, Bastian Rieck, Marc Niethammer, and Roland Kwitt. Graph filtration learning. In International Conference on Machine Learning, pp. 4314 4323. PMLR, 2020. Christopher J Honey, Rolf Kötter, Michael Breakspear, and Olaf Sporns. Network structure of cerebral cortex shapes functional connectivity on multiple time scales. Proceedings of the National Academy of Sciences, 104(24):10240 10245, 2007. Xiaoling Hu, Fuxin Li, Dimitris Samaras, and Chao Chen. Topology-preserving deep image segmentation. Advances in Neural Information Processing Systems, 32, 2019. Lawrence Hubert and Phipps Arabie. Comparing partitions. Journal of Classification, 2(1):193 218, 1985. Soheil Kolouri, Se Rim Park, Matthew Thorpe, Dejan Slepcev, and Gustavo K Rohde. Optimal mass transport: Signal processing and machine-learning applications. IEEE Signal Processing Magazine, 34(4):43 59, 2017. Yung-Keun Kwon and Kwang-Hyun Cho. Analysis of feedback loops and robustness in network evolution based on Boolean models. BMC Bioinformatics, 8(1):1 9, 2007. Théo Lacombe, Marco Cuturi, and Steve Oudot. Large scale computation of means and clusters for persistence diagrams using optimal transport. Advances in Neural Information Processing Systems, 2018. Hyekyoung Lee, Hyejin Kang, Moo K Chung, Bung-Nyun Kim, and Dong Soo Lee. Persistent brain network homology from the perspective of dendrogram. IEEE Transactions on Medical Imaging, 31(12):2267 2277, 2012. Lingxiao Li, Aude Genevay, Mikhail Yurochkin, and Justin M Solomon. Continuous regularized Wasserstein barycenters. Advances in Neural Information Processing Systems, 33, 2020. Pedro G Lind, Marta C González, and Hans J Herrmann. Cycles and clustering in bipartite networks. Physical Review E, 72(5):056127, 2005. James Mac Queen. Some methods for classification and analysis of multivariate observations. In Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, 1967. Christopher D Manning, Prabhakar Raghavan, and Hinrich Schütze. Introduction to Information Retrieval. Cambridge University Press, 2008. Liang Mi, Wen Zhang, Xianfeng Gu, and Yalin Wang. Variational Wasserstein clustering. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 322 337, 2018. Dragoslav S Mitrinovic. Analytic Inequalities. Springer, 1970. James R Munkres. Elements of Algebraic Topology. CRC press, 2018. Marion Neumann, Roman Garnett, Christian Bauckhage, and Kristian Kersting. Propagation kernels: Efficient graph kernels from propagated information. Machine Learning, 102(2):209 245, 2016. Mark EJ Newman. Modularity and community structure in networks. Proceedings of the National Academy of Sciences, 103(23):8577 8582, 2006. Markus Ojala and Gemma C Garriga. Permutation tests for studying classifier performance. Journal of Machine Learning Research, 11(6), 2010. Nina Otter, Mason A Porter, Ulrike Tillmann, Peter Grindrod, and Heather A Harrington. A roadmap for the computation of persistent homology. EPJ Data Science, 6:1 38, 2017. Published as a conference paper at ICLR 2022 Julien Rabin, Gabriel Peyré, Julie Delon, and Marc Bernot. Wasserstein barycenter and its application to texture mixing. In International Conference on Scale Space and Variational Methods in Computer Vision, pp. 435 446. Springer, 2011. Jörg Reichardt and Stefan Bornholdt. Statistical mechanics of community detection. Physical Review E, 74(1):016110, 2006. Vanessa Robins and Katharine Turner. Principal component analysis of persistent homology rank functions with case studies of spatial point patterns, sphere packing and colloids. Physica D: Nonlinear Phenomena, 334:99 117, 2016. Karl Rohe, Sourav Chatterjee, and Bin Yu. Spectral clustering and the high-dimensional stochastic blockmodel. The Annals of Statistics, 39(4):1878 1915, 2011. Mikail Rubinov and Olaf Sporns. Complex network measures of brain connectivity: Uses and interpretations. Neuro Image, 52(3):1059 1069, 2010. Justin Solomon, Fernando De Goes, Gabriel Peyré, Marco Cuturi, Adrian Butscher, Andy Nguyen, Tao Du, and Leonidas Guibas. Convolutional Wasserstein distances: Efficient optimal transportation on geometric domains. ACM Transactions on Graphics (TOG), 34(4):1 11, 2015. Tananun Songdechakraiwut and Moo K Chung. Topological learning for brain networks. ar Xiv preprint ar Xiv:2012.00675, 2020. Tananun Songdechakraiwut, Li Shen, and Moo Chung. Topological learning and its application to multimodal brain network integration. 24th International Conference on Medical Image Computing and Computer-Assisted Intervention (MICCAI), 2021. Tananun Songdechakraiwut, Bryan M Krause, Matthew I Banks, Kirill V Nourski, and Barry D Van Veen. Topological classification in a Wasserstein distance based vector space. ar Xiv preprint ar Xiv:2202.01275, 2022. Guillaume Staerman, Pierre Laforgue, Pavlo Mozharovskyi, and Florence d Alché Buc. When OT meets Mo M: Robust estimation of Wasserstein distance. In International Conference on Artificial Intelligence and Statistics, pp. 136 144, 2021. Katharine Turner, Yuriy Mileyko, Sayan Mukherjee, and John Harer. Fréchet means for distributions of persistence diagrams. Discrete & Computational Geometry, 52(1):44 70, 2014. Jules Vidal, Joseph Budin, and Julien Tierny. Progressive Wasserstein barycenters of persistence diagrams. IEEE Transactions on Visualization and Computer Graphics, 26(1):151 161, 2019. Martin Vinck, Robert Oostenveld, Marijn Van Wingerden, Franscesco Battaglia, and Cyriel MA Pennartz. An improved index of phase-synchronization for electrophysiological data in the presence of volume-conduction, noise and sample-size bias. Neuro Image, 55(4):1548 1565, 2011. Kelin Xia and Guo-Wei Wei. Persistent homology analysis of protein structure, flexibility, and folding. International Journal for Numerical Methods in Biomedical Engineering, 30(8):814 844, 2014. Yujia Xie, Xiangfeng Wang, Ruijia Wang, and Hongyuan Zha. A fast proximal point method for computing exact Wasserstein distance. In Uncertainty in Artificial Intelligence, pp. 433 453, 2020. Rui Xu and Donald Wunsch. Survey of clustering algorithms. IEEE Transactions on Neural Networks, 16(3):645 678, 2005. Jianbo Ye, Panruo Wu, James Z Wang, and Jia Li. Fast discrete distribution clustering using Wasserstein barycenter with sparse support. IEEE Transactions on Signal Processing, 65(9): 2317 2332, 2017. Hao Yin, Austin R Benson, Jure Leskovec, and David F Gleich. Local higher-order graph clustering. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 555 564, 2017. Published as a conference paper at ICLR 2022 A.1 CLOSED-FORM SOLUTION OF TOPOLOGICAL DISTANCE The topological distance dtop has the closed-form solution as follows dtop(G, H) = X b τ 0 (b) 2 + X d τ 1 (d) 2 1 where τ 0 maps the l-th smallest birth value in B(G) to the l-th smallest birth value in B(H) and τ 1 maps the l-th smallest death value in D(G) to the l-th smallest death value in D(H) for all l. Proof. Let G and H be networks with m nodes. The topological distance is originally defined as dtop(G, H) = min τ0 b τ0(b) 2 + min τ1 d τ1(d) 2 1 We rewrite the first term as follows. b τ0(b) 2 = min τ0 b2 2bτ0(b) + τ 2 0 (b) b B(G) 2bτ0(b) + C1 b B(G) bτ0(b), where C1 is a constant. Note that P b B(G) τ 2 0 (b) is equivalent to P b B(H) b2, which is a constant over all bijections τ0. Similarly for the second term, we have d τ1(d) 2 = max τ1 d D(G) dτ1(d). Let B(G) : x1 xm 1 and B(H) : y1 ym 1 be the collections of birth values for G and H respectively. By the rearrangement inequality (Mitrinovic, 1970), it is known that x1y1 + + xm 1ym 1 xπ(1)y1 + + xπ(m 1)y(m 1) for every permutation xπ(1), ..., xπ(m 1) of x1, ..., xm 1. Thus, optimal bijection τ 0 maps the l-th smallest birth value in B(G) to the l-th smallest birth value in B(H). Similarly for the second term, we have τ 1 that maps the l-th smallest death value in D(G) to the l-th smallest death value in D(H). A.2 CLOSED-FORM SOLUTION OF TOPOLOGICAL CENTROID Lemma 1. Let G1, ..., Gn be n networks each with m nodes. Let B(Gi) : bi,1 bi,m 1 and D(Gi) : di,1 di,1+m(m 3)/2 be the topology of Gi. It follows that the l-th smallest birth value of the topological centroid of the n networks is given by the mean of all the l-th smallest birth values of such networks, i.e., Pn i=1 bi,l/n. Similarly, the l-th smallest death value of the topological centroid is given by Pn i=1 di,l/n. Proof. Let M be a cluster representative. Recall that the topological centroid minimizes the sum of the squared topological distances, i.e., i=1 d2 top(M, Gi) = min B(M),D(M) b τ 0 (b) 2 + X d τ 1 (d) 2 . Published as a conference paper at ICLR 2022 Let B(M) : γ1 γm 1 and D(M) : δ1 δ1+m(m 3)/2 be the topology of M. The first term can be expanded as b τ 0 (b) 2 = [γ1 bi,1]2 + + [γm 1 bi,m 1]2 , which is quadratic and thus can be solved by setting its derivative equal to zero. The solution is given by bγl = Pn i=1 bi,l/n. Similarly, the second term can be expanded as d τ 1 (d) 2 = [δ1 di,1]2 + + [δ1+m(m 3)/2 di,1+m(m 3)/2]2 , which is again quadratic. By setting its derivative equal to zero, we find the minimum at bδl = Pn i=1 di,l/n. A.3 TOPOLOGICAL INTERPOLATION G Ch dnet(Θh, G) = θh,ij wh,ij 2 + λd2 top Θh, c Mtop,h , (10) where wh = (wh,ij) are edge weights in the sample mean network M h = (V, wh) of cluster Ch, and c Mtop,h is the topological centroid of networks in cluster Ch. Proof. Let G = (V, w) C and Θ = (V, θ). Denote the cardinality of C by m. We will show that G C dnet(Θ, G) = min Θ θij wij 2 + λd2 top(Θ, c Mtop) . We first rewrite the geometric term as follows. X θij wij 2 = X θ2 ij 2θijwij + w2 ij mθ2 ij 2θij X G C wij + X G C wij + C1 G C wij + ( X G C wij/m)2 + C2 G C wij/m 2 + C2 θij wij 2 + C2, where C1, C2 are constants. Similarly for the topological term, we have X G C d2 top Θ, G = X b τ 0 (b) 2 + X d τ 1 (d) 2 G C τ 0 (b) + X G C τ 1 (d) + C3 G C τ 0 (b)/m 2 + X G C τ 1 (d)/m 2 + C4 = md2 top(Θ, c Mtop) + C4. Published as a conference paper at ICLR 2022 G C dnet(Θ, G) = min Θ (1 λ) m X θij wij 2 + C2 + λ md2 top(Θ, c Mtop) + C4 = min Θ m (1 λ) X θij wij 2 + λd2 top(Θ, c Mtop) + C5 θij wij 2 + λd2 top(Θ, c Mtop) . A.4 CONVERGENCE TO A LOCALLY OPTIMAL PARTITION Theorem 1. The topological clustering algorithm monotonically decreases L and terminates in a finite number of steps at a locally optimal partition. Proof. Let C(t) = {C(t) h }k h=1 be a partition of observed networks with corresponding representatives M(t) = {M (t) h }k h=1 after the t-th iteration. Then, L(C(t), M(t)) = dnet(M (t) h , G) dnet(M (t) h , G) dnet(M (t+1) h , G) = L(C(t+1), M(t+1)), where (a) follows from the assignment step and (b) follows from applying Lemma 1 (when λ = 1) and Lemma 2 (when λ (0, 1) and step size in gradient descent is sufficiently small) to the re-estimation step. Note that λ = 0 describes conventional edge clustering (Mac Queen, 1967). This result shows that the algorithm monotonically decreases L. Since the number of distinct partitions is finite and L is monotonically decreasing, eventually L will not be decreased either by (a) the assignment step or (b) re-estimation step and the algorithm terminates. B EXPERIMENTAL DETAILS B.1 SUPPLEMENTARY DESCRIPTION FOR CANDIDATE METHODS METHODS BASED ON TWO-DIMENSIONAL BARCODES The Wasserstein method clusters conventional two-dimensional barcodes representing connected components and cycles. To this end, two separate Wasserstein distances are needed to match connected components to connected components and cycles to cycles. The Wasserstein method performs clustering using k-medoids based on the sum of the two Wasserstein distances for connected components and cycles. Bregman Alternating Direction Method of Multipliers (B-ADMM) (Ye et al., 2017) is a Wasserstein barycenter based algorithm for clustering discrete distributions. We follow the standard approach (Turner et al., 2014) to representing underlying discrete distributions on two-dimensional barcodes in the form of Dirac masses. The high computational complexity of B-ADMM is managed by limiting the clustering to no more than the 50 most persistent cycles. The two methods require computation of two-dimensional persistence barcodes from networks. We compute a two-dimensional persistence barcode using the approach of Otter et al. (2017) in which edge weights are inverted via the function f(w) = 1/(1 + w). Then a point cloud is obtained from the shortest path distance between nodes. Finally, we compute Rips filtration (Ghrist, 2008) and its corresponding persistence barcode from the point cloud. Published as a conference paper at ICLR 2022 Supplementary Figure 1: Average confusion matrices over 100 random initializations for subject R376. W: wake, S: sedated, U: unresponsive. METHODS BASED ON GRAPH KERNELS The two graph kernel methods based on Graph Hopper and propagation kernels require node continuous attributes. We follow the experimental protocol of Borgwardt et al. (2020) in which a node attribute is set to the sum of edge weights incident to the node. METHODS BASED ON NETWORK DISTANCE MEASURES FOR BRAIN NETWORK ANALYSES Gromov-Hausdorff (GH) distance (Lee et al., 2012) compares dendrogram shape differences in brain networks. GH distance requires brain networks that exhibit metric spaces. To this end, we follow the approach of Otter et al. (2017) for converting a brain network to a metric space. The metric space is obtained by first inverting edge weights via the function f(w) = 1/(1 + w) and then using the shortest path distance between nodes as a metric. Kolmogorov-Smirnov (KS) distance (Chung et al., 2019) compares topological differences in brain networks using their Betti numbers. Two separate KS distances are needed to compare 0-th Betti numbers to 0-th Betti numbers and 1-st Betti numbers to 1-st Betti numbers. The sum of the two KS distances is used to compare brain networks. We follow the approach of Banks et al. (2020) to comparing brain networks using the operator or spectral norm of the difference between network adjacency matrices. B.2 IMPLEMENTATION OF CANDIDATE METHODS For all the baseline methods, we used existing implementation codes from authors publications and publicly available repository websites. We used parameters recommended in the public code without any modification. Code for Wasserstein distance approximation (Lacombe et al., 2018) is available at https://gudhi.inria.fr/. Code for Bregman Alternating Direction Method of Multipliers (B-ADMM) method (Ye et al., 2017) is available at https://github.com/bobye/ WBC_Matlab. Code for the three network summary statistics used in the Net Stats method is available at https://sites.google.com/site/bctnet/. Code for Graph Hopper and propagation graph kernels is available at https://github.com/ysig/Gra Ke L. B.3 SUPPLEMENTARY RESULTS FOR SIMULATED AND MEASURED DATASETS For simulated networks, clustering performance comparison is reported in Supplementary Table 1. For measured brain networks, confusion matrices of all the candidate methods for subject R376 (see Supplementary Table 2) are reported in Supplementary Figure 1. Published as a conference paper at ICLR 2022 Supplementary Table 1: Clustering performance comparison for simulated networks with |V | = 60 nodes. (a) average accuracy and (b) average p-values for various parameter settings of m (number of modules) and r (within-module connection probability). m r Top (λ = 1) Edge-weight Sort Wasserstein B-ADMM Net Stats Graph Hopper Propagation 0.6 0.75 0.06 0.46 0.04 0.76 0.06 0.57 0.06 0.61 0.05 0.75 0.06 0.69 0.08 0.60 0.12 0.7 0.95 0.06 0.62 0.13 0.94 0.08 0.77 0.08 0.81 0.08 0.96 0.02 0.90 0.12 0.75 0.13 0.8 0.97 0.10 0.75 0.17 0.97 0.10 0.88 0.12 0.92 0.08 0.98 0.09 0.98 0.09 0.80 0.16 0.9 0.94 0.13 0.81 0.17 0.93 0.14 0.90 0.15 0.96 0.09 0.95 0.12 0.97 0.13 0.82 0.20 0.6 0.78 0.07 0.44 0.05 0.78 0.07 0.63 0.06 0.66 0.05 0.77 0.07 0.73 0.11 0.65 0.14 0.7 0.85 0.13 0.59 0.10 0.85 0.12 0.81 0.09 0.86 0.07 0.89 0.10 0.85 0.21 0.81 0.12 0.8 0.90 0.14 0.72 0.13 0.89 0.15 0.88 0.14 0.91 0.13 0.91 0.14 0.85 0.26 0.85 0.15 0.9 0.92 0.14 0.78 0.15 0.92 0.14 0.92 0.14 0.91 0.15 0.90 0.15 0.73 0.33 0.85 0.17 (a) Average accuracy m r Top (λ = 1) Edge-weight Sort Wasserstein B-ADMM Net Stats Graph Hopper Propagation 0.6 < 0.001 0.28 0.26 < 0.001 0.02 0.12 0.003 0.021 < 0.001 0.02 0.14 0.09 0.27 0.7 < 0.001 0.07 0.21 < 0.001 < 0.001 < 0.001 < 0.001 0.04 0.20 0.01 0.10 0.8 < 0.001 0.03 0.13 < 0.001 < 0.001 < 0.001 < 0.001 0.02 0.14 0.04 0.20 0.9 < 0.001 0.01 0.09 < 0.001 < 0.001 < 0.001 < 0.001 0.04 0.20 0.08 0.27 0.6 < 0.001 0.37 0.33 < 0.001 0.004 0.034 < 0.001 < 0.001 0.06 0.24 0.12 0.31 0.7 < 0.001 0.05 0.14 < 0.001 < 0.001 < 0.001 < 0.001 0.14 0.35 0.01 0.10 0.8 < 0.001 0.03 0.15 < 0.001 < 0.001 < 0.001 < 0.001 0.20 0.40 0.03 0.17 0.9 < 0.001 0.004 0.039 < 0.001 < 0.001 < 0.001 < 0.001 0.41 0.49 0.04 0.20 (b) Average p-values Supplementary Table 2 Subject Age Gender Network size R369 30 M 199 L372 34 M 174 R376 48 F 189 B384 38 M 89 R399 22 F 175 L400 59 F 126 L403 56 F 194 L405 19 M 127 L409 31 F 160 L423 51 M 152 L514 46 M 118 C BRAIN NETWORK DATASET Brain network data were obtained from eleven neurosurgical patients (5 female, ages 19 - 59 years old, median age 38 years old; Supplementary Table 2). The patients had been diagnosed with medically refractory epilepsy and were undergoing chronic invasive intracranial electroencephalography (i EEG) monitoring to identify potentially resectable seizure foci. All human subjects experiments were carried out in accordance with The Code of Ethics of the World Medical Association (Declaration of Helsinki) for experiments involving humans. The research protocols were approved by the University of Iowa Institutional Review Board and the National Institutes of Health. Written informed consent was obtained from all subjects. Research participation did not interfere with acquisition of clinically required data. Subjects could rescind consent at any time without interrupting their clinical evaluation. Recordings were made using subdural and depth electrodes (Ad-Tech Medical, Oak Creek, WI) placed solely on the basis of clinical requirements, as determined by the team of epileptologists and neurosurgeons. The brain network dataset was based on recordings made in the operating room prior to electrode removal, before and during induction of general anesthesia with propofol. Details of the experimental procedure, recording, and electrophysiological data analysis are available in (Banks et al., 2020). Published as a conference paper at ICLR 2022 Supplementary Figure 2: Toy illustration of performing topological clustering using only topological centroids (λ = 1). Clustering is performed on 15 networks: five each with m = 2, 5 and 10 modules. All networks use |V | = 30, within module connection probability r = 0.9 and edge weight standard deviation σ = 0.1. (a) Network examples for m = 2 (top), 5 (middle) and 10 (bottom) modules. (b) Illustration of topological clustering. Topological centroids c Mtop,i (thick lines) and individual network topologies are visualized using Betti numbers as a function of edge weights (filtration values). The algorithm converges in three iterations and the final partition (last column) perfectly matches the ground truth, i.e., C1, C2 and C3 corresponds to m = 2, 5 and 10. D TOY ILLUSTRATION OF TOPOLOGICAL CLUSTERING Supplementary Figure 2 illustrates a cluster analysis of the proposed topological approach using only topological centroids (λ = 1) on a toy dataset generated using random modular networks described in Section 4 in the main body of the paper.