# why_the_metric_backbone_preserves_community_structure__28c1b238.pdf

Why the Metric Backbone Preserves Community Structure

Maximilien Dreveton EPFL maximilien.dreveton@epfl.ch

Charbel Chucri EPFL charbel.chucri@epfl.ch

Matthias Grossglauser EPFL matthias.grossglauser@epfl.ch

Patrick Thiran EPFL patrick.thiran@epfl.ch

The metric backbone of a weighted graph is the union of all-pairs shortest paths. It is obtained by removing all edges (u, v) that are not on the shortest path between u and v. In networks with well-separated communities, the metric backbone tends to preserve many inter-community edges, because these edges serve as bridges connecting two communities, but tends to delete many intra-community edges because the communities are dense. This suggests that the metric backbone would dilute or destroy the community structure of the network. However, this is not borne out by prior empirical work, which instead showed that the metric backbone of real networks preserves the community structure of the original network well. In this work, we analyze the metric backbone of a broad class of weighted random graphs with communities, and we formally prove the robustness of the community structure with respect to the deletion of all the edges that are not in the metric backbone. An empirical comparison of several graph sparsification techniques confirms our theoretical finding and shows that the metric backbone is an efficient sparsifier in the presence of communities.

1 Introduction

Graph clustering partitions the vertex set of a graph into non-overlapping groups, so that the vertices of each group share some typical pattern or property. For example, each group might be composed of vertices that interact closely with each other. Graph clustering is one of the main tasks in the statistical analysis of networks [4, 30].

In many scenarios, the observed pairwise interactions are weighted. In a proximity graph, these weights measure the degree of similarity between edge endpoints (e.g., frequency of interaction in a social network), whereas in a distance graph, they measure dissimilarity instead (for example, the length of segments in a road network or travel times in a flight network). To avoid confusion, we refer to the weights in a distance graph as costs, so that the cost of a path will be naturally defined as the sum of the edge costs on this path.1

Distance graphs obtained from real-world data typically violate the triangle inequality. More precisely, the shortest distance between two vertices in the graph is not always equal to the cost of the direct edge, but rather equal to the cost of an indirect path via other vertices. For example, the least expensive

1This additive property does not hold for similarity weights, but a proximity graph can be transformed into a distance graph by applying an isomorphic non-linear transformation on the weights [36].

38th Conference on Neural Information Processing Systems (Neur IPS 2024).

flight between two cities is often a flight including one or more layovers. An edge whose endpoints are connected by an indirect shorter path is called semi-metric; otherwise, it is called metric.

We obtain the metric backbone of a distance graph by removing all its semi-metric edges. It has been experimentally observed that the metric backbone retains only a small fraction of the original edges, typically between 10% and 30% in social networks [35]. Properties of the original graph that depend only on the shortest paths (such as connectivity, diameter, and betweenness centrality) are preserved by its metric backbone. Moreover, experiments on contact networks indicate that other properties (such as spreading processes and community structures) are also empirically well approximated by computing them on the metric backbone, rather than on the original graph [8].

The preservation of these properties by the backbone highlights a well-known empirical feature of complex networks: redundancy. This is the basis for graph sparsification: the task of building a sparser graph from a given graph so that important structural properties are approximately preserved while reducing storage and computational costs. Many existing network sparsifiers identify the statistically significant edges of a graph by comparing them to a null-model [34, 43]. These methods typically require hyperparameters and can leave some vertices isolated (by removing all edges from a given vertex). Spectral sparsification aims at preserving the spectral structure of a graph but also relies on a hyperparameter, and the edges in the sparsified graph might have different weights than in the original graph [37]. In contrast, the metric backbone is parameter-free and automatically preserves all properties linked to the shortest path structure. Moreover, all-pairs shortest paths can be efficiently computed [14, 33]. This makes the metric backbone an appealing mechanism for sparsification.

Among all the properties empirically shown to be well preserved by the metric backbone, the community structure is perhaps the most surprising. Indeed, if a network contains dense communities that are only loosely connected by a small number of inter-community edges, then all shortest paths between vertices in different communities must go through one of these edges. This suggests that these "bridge" edges are less likely to be semi-metric than intra-community edges, where the higher density provides shorter alternative paths.2 This in turn implies that metric sparsification should thin out intra-community edges more than inter-community edges, thereby diluting the community structure. The central contribution of this paper is to show that this intuition is wrong.

To do so, we formally characterize the metric backbone of a weighted stochastic block model (w SBM). We assume that the vertices are separated into k blocks (also referred to as clusters or communities). An edge between two vertices belonging to blocks a and b is present with probability pab and is independent of the presence or absence of other edges. This edge, if present, has a weight sampled from a distribution with cdf Fab, and this weight represents the cost of traveling through this edge. We denote by pmb ab the probability that an edge between a vertex in block a and a vertex in block b is present in the backbone.3 Under loose assumptions on the costs distributions Fab and on the probabilities pab, we show that pmb ab /pmb cd = (1 + o(1))pab/pcd for every a, b, c, d [k]. This shows that the metric backbone thins the edge set approximately uniformly and, therefore, preserves the community structure of the original graph. Moreover, we also prove that a spectral algorithm recovers almost exactly the communities.

We also conduct numerical experiments with several graph clustering algorithms and datasets to back up these theoretical results. We show that all clustering algorithms achieve similar accuracy on the metric backbone as on the original network. These simulations, performed on different types of networks and with different clustering algorithms, generalize the experiments of [8], which are restricted to contact networks and the Louvain algorithm. Another type of graph we consider are graphs constructed from data points in a Euclidean space, typically by a kernel similarity measure between q-nearest neighbors (q-NN) [15]. This is an important technique for graph construction, with applications in non-linear embedding and clustering. Although q controls the sparsity of this embedding graph, this procedure is not guaranteed to produce only metric edges, and varying q often impacts the clustering performance. By investigating the metric backbone of q-NN graphs, we notice that it makes clustering results more robust against the value of q. Consequently, leveraging graph sparsification alongside q-NN facilitates graph construction.

2In fact, this intuitive observation is at the core of one of the very first community detection algorithms [17]. 3The events 1{(u, v) is metric} and 1{(w, x) is metric} are in general not independent, but the probability that (u, v) is metric depends only on the cluster assignment of vertices u and v.

The paper is structured as follows. We introduce the main definitions and theoretical results on the metric backbone of w SBMs in Section 2. We discuss these results in Section 3 and compare them with the existing literature. Sections 4 and 5 are devoted to numerical experiments and applications. Finally, we conclude in Section 6.

Code availability We provide the code used for the experiments: https://github.com/ Charbel-11/Why-the-Metric-Backbone-Preserves-Community-Structure

Notations The notation 1n denotes the vector of size n 1 whose entries are all equal to one. For any vector π Rn and any matrix B Rn m, we denote by πmin = mina [n] πa and Bmin = mina,b Bab, and similarly for πmax and Bmax. Given two matrices A and B of the same size, we denote by A B the entry-wise matrix product (i.e., their Hadamard product). AT is the transpose of a matrix A. For a vector π, we denote diag(π) the diagonal matrix whose diagonal element (a, a) is πa.

The indicator of an event A is denoted 1{A}. Binomial and exponential random variables are denoted by Bin(n, p) and Exp(λ), respectively. The uniform distribution over an interval [a, b] is denoted Unif(a, b). Finally, given a cumulative distribution function F, we write X F for a random variable X sampled from F, i.e., P(X x) = F(x), and we denote by f the pdf of this distribution. We write whp (with high probability) for events with probability tending to 1 as n .

2 The Metric Backbone of Weighted Stochastic Block Models

2.1 Definitions and Main Notations

Let G = (V, E, c) be an undirected weighted graph, where V = [n] is the set of vertices, E V V is the set of undirected edges, and c: E R+ is the cost function.The cost of a path (u1, , up) is naturally defined as Pp 1 q=1 c(uq, uq+1), and a shortest path between u and v is a path of minimal cost starting from vertex u and finishing at vertex v. We define the metric backbone as the union of all shortest paths of G. Definition 1. The metric backbone of a weighted graph G = (V, E, c) where c represents the edge cost is the subgraph Gmb = (V, Emb, cmb) of G, where Emb E is such that e Emb if and only if e belongs to a shortest path from two vertices in G, and cmb : Emb R+; e 7 c(e) is the restriction of c to Emb.

We investigate the structure of the metric backbone of weighted random graphs with community structure. We generate these graphs as follows. Each vertex u [n] is randomly assigned to the cluster a [k] with probability πa. We denote by zu [k] the cluster of vertex u. Conditioned on zu and zv, an edge (u, v) is present with probability pzuzv, independently of the presence or absence of other edges. If an edge (u, v) is present, it is assigned a cost c(u, v). The cost c(u, v) is sampled from Fzuzv where F = (Fab)1 a,b k denotes a collection of cumulative distribution functions such that Fab = Fba. This defines the weighted stochastic block model, and we denote (z, G) w SBM(n, π, p, F) with G = ([n], E, c), z [k]n and

P (E | z) = Y

1 u<v n p1{{u,v} E} zuzv (1 pzuzv)1{{u,v} E} ,

P (c | E, z) = Y

{u,v} E P (c(u, v) | zu, zv) ,

and c(u, v) | zu = a, zv = b is sampled from Fab.

The w SBM is a direct extension of the standard (non-weighted) SBM into the weighted setting. SBM is the predominant benchmark for studying community detection and establishing theoretical guarantees of recovery by clustering algorithms [1]. Despite its shortcomings, such as a low clustering coefficient, the SBM is analytically tractable and is a useful model for inference purposes [32].

Throughout this paper, we will make the following assumptions.

Assumption 1 (Asymptotic scaling). The edge probabilities pab between blocks a and b can depend on n, such that pab = Babρn with ρn = ω(n 1 log n) and Bab is independent of n. Furthermore, the number of communities k, the matrix B, the probabilities πa and the cdf Fab are all fixed (independent of n). We also assume πmin > 0 and Bmin > 0.

To rule out some issues, such as edges with a zero cost,4 we also assume that the probability distributions of the costs have no mass at 0. More precisely, we require that the cumulative distribution functions Fab verify Fab(0) = 0 and F ab(0) = λab > 0, where F denotes the derivative of F (i.e., the pdf). The first condition ensures that the distribution has support R+ and no mass at 0, and the second ensures that, around a neighborhood of 0, Fab behaves as the exponential distribution Exp(λab) (or as the uniform distribution Unif([0, λ 1 ab ]).

Assumption 2 (Condition on the cost distributions). The costs are sampled from continuous distributions, and there exists Λ = (λab)a,b with λab > 0 such that Fab(0) = 0 and F ab(0) = λab.

We define the following matrix

T = [Λ B] diag(π), (2.1)

where B and Λ are defined in Assumptions 1 and 2. Finally, we denote by τmin and τmax the minimum and maximum entries of the vector τ = T1k.

Remark 1. Assume that Λ = λ1k1T k with λ > 0. We notice that τa = λ P

b πb Bab. Denote by d = ( d1, , dk) the vector whose a-entry da is the expected degree of a vertex in community a. We have da = n P

b πbpab. Then τmin = λ dmin(nρn) 1 and τmax = λ dmax(nρn) 1, where dmin and dmax are the minimum and maximum entries of d.

2.2 Cost of Shortest Paths in w SBMs

Given a path (u1, , up), recall that its cost is Pp 1 q=1 c(uq, uq+1), and the hop count is the number of edges composing the path (that is, the hop count of (u1, , up) is p 1).

For two vertices u, v V , we denote by C(u, v) the cost of the shortest path from u to v.5 The following proposition provides asymptotics for the cost of shortest paths in w SBMs.

Proposition 1. Let (z, G) w SBM(n, π, p, F). Suppose that Assumptions 1 and 2 hold and let τmin and τmax be defined following Equation (2.1). Then, for two vertices u and v chosen uniformly at random in blocks a and b, respectively. We have whp

(τmax) 1 nρn log n C(u, v) (τmin) 1 .

To prove Proposition 1, in the first stage, we simplify the problem by assuming exponentially distributed weights. We then analyze two first passage percolation (FPP) processes, originating from vertices u and v, respectively. Using the memoryless property of the exponential distribution, we analyze two first passage percolation (FPP) processes, originating from vertex u and v, respectively. Each FPP explores the nearest neighbors of its starting vertex until it reaches q neighbors. As long as q = o( n), the two FPP processes remain disjoint (with high probability). Thus, the cost C(u, v) is lower-bounded by the sum of (i) the cost of the shortest path from u to its q-nearest neighbor and of (ii) the cost of the shortest path from v to its q-nearest neighbor. On the contrary, when q = ω( n), the two FPPs intersect, revealing a path from u to v, and the cost of this path upper-bounds the cost C(u, v) of the shortest path from u to v. In the second stage, we extend the result to general weight distributions by noticing that the edges belonging to the shortest paths have very small costs. Moreover, Assumption 2 yields that the weight distributions behave as an exponential distribution in a neighborhood of 0. We can thus adapt the coupling argument of [23] to show that the edge weights distributions do not need to be exponential, as long as Assumption 2 is verified. The proof of Proposition 1 is provided in Section A.

4An edge (u, v) E such that c(u, v) = 0 yields the possibility of teleportation from vertex u to vertex v. 5We notice that, given a w SBM, whp there is only one shortest path.

2.3 Metric Backbone of w SBMs

Let (z, G) w SBM(n, π, p, F) and denote by Gmb the metric backbone of G. Choose two vertices u and v uniformly at random, and notice that the probability that the edge (u, v) is present in Gmb

depends only on zu and zv, and not on zw for w / {u, v}. Denote by pmb ab the probability that an edge between a vertex in community a and a vertex in community b appears in the metric backbone, i.e.,

pmb ab = P (u, v) Gmb | zu = a, zv = b .

The following theorem shows that the ratio pmb ab pab scales as Θ log n

Theorem 1. Let (z, G) w SBM(n, π, p, F) and suppose that Assumptions 1 and 2 hold. Let τmin and τmax be defined after Equation (2.1). Then

(1 + o(1)) λab

τmax nρn log n pmb ab pab (1 + o(1)) λab

We prove Theorem 1 in Appendix B.1. Theorem 1 shows that the metric backbone maintains the same proportion of intraand inter-community edges as in the original graph. We illustrate the theorem with two important examples. Example 1. Consider a weighted version of the planted partition model, where for all a, b [k] we have πa = 1/k and

Bab = p0 if a = b, q0 otherwise,

where p0 and q0 are constant. Assume that Λ = λ1k1T k with λ > 0. Using Remark 1, we have τmin = τmax = λk 1 (p0 + (k 1)q0), and Theorem 1 states that

pmb = (1 + o(1)) kp0 p0 + (k 1)q0

n and qmb = (1 + o(1)) kq0 p0 + (k 1)q0

In particular, pmb

qmb = (1 + o(1)) p0

Example 2. Consider a stochastic block model with edge probabilities pab = Babρn such that the vertices of different communities have the same expected degree d. If Λ = λ1k1T k , then

pmb ab = (1 + o(1))Bab

2.4 Recovering Communities from the Metric Backbone

In this section, we prove that a spectral algorithm on the (weighted) adjacency matrix of the metric backbone of a w SBM asymptotically recovers the clusters whp. Given an estimate ˆz [k]n of the clusters z [k]n, we define the loss as

loss(z, ˆz) = 1

n inf σ Sym(k) Ham (z, σ ˆz) ,

where Ham denotes the Hamming distance and Sym(k) is the set of all permutations of [k].

Algorithm 1: Spectral Clustering on the weighted adjacency matrix of the metric backbone Input: Graph G, number of clusters k Output: Predicted community memberships ˆz [k]n

1 Denote W mb Rn n + the weighted adjacency matrix of the metric backbone Gmb of G

2 Let W mb = Pn i=1 σiuiu T i be an eigen-decomposition of W mb, with eigenvalues ordered in decreasing absolute value (|σ1| |σn|) and eigenvectors u1, , un Rn

3 Denote U = [u1, , uk] Rn k and Σ = diag(σ1, , σk)

4 Let ˆz [k]n be a (1 + ϵ)-approximate solution of k-means performed on the rows of U Rn k

The following theorem states that, as long as the matrix T defined in (2.1) is invertible, the loss of spectral clustering applied on the metric backbone asymptotically vanishes whp.

Theorem 2. Let (z, G) w SBM(n, π, p, F) and suppose that Assumptions 1 and 2 hold. Let µ be the minimal absolute eigenvalue of the matrix T defined in (2.1). Moreover, assume that τmax = τmin and µ = 0. Then, the output ˆz of Algorithm 1 on G verifies whp

loss(z, ˆz) = O 1 µ2 log n

We prove Theorem 2 in Appendix B.2. We saw in Example 1 and 2 that the condition τmax = τmin is verified in several important settings. The additional assumption µ = 0 (equivalent to T being invertible) also often holds: in the planted partition model of Example 1, T is invertible if p0 = q0.

3 Comparison with Previous Work

The metric backbone has been introduced under different names, such as the essential subgraph [29], the transport overlay network [41], or simply the union of shortest path trees [39]. In this section, we discuss our contribution with respect to closely related earlier works.

3.1 Computing the Metric Backbone

Computing the metric backbone requires solving the All Pairs Shortest Path (APSP) problem, a classic and fundamental problem in computer science. Simply running Dijkstra s algorithm on each vertex of the graph solves the APSP in O(nm + n2 log n) worst-case time, where m = |E| is the number of edges in the original graph [14], whereas [29] proposed an algorithm running in O(nm + n2 log n) worst-case time, where m is the number of edges in the metric backbone. APSP has also been studied in weighted random graphs in which the weights are independent and identically distributed [20, 16]. In particular, the APSP can be solved in O(n2) time with high probability on complete weighted graphs whose weights are drawn independently and uniformly at random from [0, 1] [33].

However, practical implementations of APSP can achieve faster results. For example, in [24], an empirical observation regarding the low hop count of shortest paths is leveraged to compute the metric backbone efficiently. Although exact time complexity is not provided, the implementation scales well for massive graphs, such as a Facebook graph with 190 million nodes and 49.9 billion edges, and the empirical running time appears to be linear with the number of edges [24, Table 1 and Figure 5]. Additionally, our simulations reveal that some popular clustering algorithms such as spectral and subspace clustering have higher running times than computing the metric backbone.

3.2 First-Passage Percolation in Random Graphs

To study the metric backbone theoretically, we need to understand the structure of the shortest path between vertices in a random graph. This classical and fundamental topic of probability theory is known as first-passage percolation (FPP) [19]. The paper [23] originally studied the weights and hop counts of shortest paths on the complete graph with iid weights. This was later generalized to Erd os-Rényi graphs and configuration models (see, for example, [28, 12] and references therein).

Closer to our setting, [25] studied the FPP on inhomogeneous random graphs. Indeed, SBMs are a particular case of inhomogeneous random graphs, for which the set of vertex types is discrete (we refer to [7] for general statements on inhomogeneous random graphs). Assuming that the edge weights are independent and Exp(λ)-distributed, [25] established a central limit theorem of the weight and hop count of the shortest path between two vertices chosen uniformly at random among all vertices. Using the notation of Section 2, this result implies that nρn

log n C(u, v) converges in probability to τ 1, where τ is the Perron-Frobenius eigenvalue of λB diag(π).

The novelty of our work is two-fold. First, we allow different cost distributions for each pair of communities, whereas all previous works in FPP on random graphs assume that the costs are sampled from a single distribution. Furthermore, we examine the cost of a path between two vertices, u and v, chosen uniformly at random among vertices in block a and in block b, respectively. This differs from previous work (and, in particular, [25]) in which vertices u and v are selected uniformly at random among all vertices. As a result, even for a single cost distribution, Proposition 1 cannot be obtained directly from [25, Theorem 1.2]. This difference is key, as this proposition is required to establish Theorem 1.

The closest result to Theorem 1 appearing in the literature is [39, Corollary 1]; it establishes a formula for the probability pmb uv that an edge between two vertices u and v exists in the metric backbone of a random graph whose edge costs are iid. This previous work does not focus on community structure, so the costs are sampled from a single distribution. More importantly, the expression of pmb uv given by [39, Theorem 2 and Corollary 1] is mainly of theoretical interest (and we use it in the proof of Theorem 1). Indeed, understanding the asymptotic behavior of pmb uv requires a complete analysis of the cost C(u, v) of the shortest path between u and v. [39] propose such an analysis only in one simple scenario (namely, a complete graph with iid exponentially distributed costs).

4 Experimental Results

In this section, we test whether the metric backbone preserves a graph community structure in various real networks for which a ground truth community structure is known (see Table 2 in Appendix C.1 for an overview).

As in many weighted networks, such as social networks, the edge weights represent a measure (e.g., frequency) of interaction between two entities over time, we need to preprocess these proximity graphs into distance graphs. More precisely, given the original (weighted or unweighted) graph G = (V, E, s), where the weights measure the similarities between pairs of vertices, we define the proximity p(u, v) of vertices u and v as the weighted Jaccard similarity between the neighborhoods of u and v, i.e.,

P w Nei(u) Nei(v) min{s(u, w), s(v, w)} P

u Nei(u) Nei(v) max{s(u, w), s(v, w)},

where Nei(u) = {w V : (u, w) E} denotes the neighborhood of u, i.e., the vertices connected to u by an edge. If G is unweighted (s(e) = 1 for all e E), we simply recover the Jaccard index |Nei(u) Nei(v)| |Nei(u) Nei(v)|. We note that other choices for normalization could have been made, such as the Adamic-Adar index [2]. We refer to [10] for an in-depth discussion of similarity and distance indices.

Once the proximity graph G = (V, E, p) has been computed, we construct the distance graph D = (V, E, c) where c: E R+ is such that

e E : c(e) = 1 p(e) 1.

This is the simplest and most commonly used method for converting a similarity to a distance [36].

We then compute the set Emb of metric edges of the distance graph D, and let Gmb = (V, Emb, pmb) where pmb : Emb [0, 1]; e 7 p(e) is the restriction of p to Emb.

We will also consider the two following sparsifications (with the corresponding restrictions of c to the sparsified edge sets) to compare the resulting community structure:

the threshold graph Gθ = (V, Eθ, pθ), where an edge e E is kept in Eθ iff p(e) θ; the graph Gss = (V, Ess, pss) obtained by spectral sparsification on G. We use the Spielman Srivastava sparsification [37], implemented in the Py GSP package [13].

For both threshold and spectral sparsification, we tune the hyperparameters so that the number of edges kept is the same as in the metric backbone: |Emb| = |Eθ| = |Ess|. We provide in Table 3 (in Appendix) some statistics on the percentage of edges remaining in the sparsified graphs.

For each proximity graph G and its three sparsifications Gmb, Gθ, and Gss, we run a graph clustering algorithm to obtain the respective predicted clusters ˆz, ˆzmb, ˆzθ and ˆzss. We show, in Figure 1, the adjusted Rand index6 (ARI) obtained between the ground truth communities and the predicted clusters for three widely used graph clustering algorithms: Bayesian algorithm [31], Leiden algorithm [38] and spectral clustering [40]. We use the graph-tool implementation for the Bayesian algorithm, with an exponential prior for the weight distributions. The Leiden algorithm is implemented at

6The Rand index (RI) measures similarity between two clusterings based on pair counting. This index is adjusted for chance because a random clustering could lead to a large RI. The ARI is a modification of the RI such that two random clusterings (resp., two identical clusterings) give an ARI of 0 (resp., of 1) [22].

https://github.com/vtraag/leidenalg. For spectral clustering, we assume that the algorithm knows the correct number of clusters in advance, and we use the implementation from scikit-learn.7

High School Primary School DBLP Amazon Dataset

Original Graph Metric Backbone Threshold Subgraph Spectral Sparsifier

(a) Bayesian MCMC

High School Primary School DBLP Amazon Dataset

Original Graph Metric Backbone Threshold Subgraph Spectral Sparsifier

High School Primary School DBLP Amazon Dataset

Original Graph Metric Backbone Threshold Subgraph Spectral Sparsifier

(c) Spectral Clustering

Figure 1: Effect of sparsification on the performance of clustering algorithms on various data sets. We observe that the metric backbone and the spectral sparsification retain equally well the community structure across all data sets and for all clustering algorithms tested. Thresholding often yields several disconnected components of small sizes, impacting the performance of clustering algorithms on Gθ.

We highlight the difference between the metric backbone and the threshold subgraph of the Primary school data set in Figure 2. We observe, in Figure 2a, that the edges in red (which are present in the backbone but not in the threshold graph) are mostly inter-community edges. On the contrary, in Figure 2b, the blue edges (which are present in the threshold graph but not in the backbone) are mostly intra-community edges. Despite this difference, the metric backbone retains the information about the community structure, as shown in Figure 1.

(a) Metric Backbone (b) Threshold Subgraph

Figure 2: Graphs obtained from Primary school data set, after taking the metric backbone (Figure 2a) and after thresholding (Figure 2b), are drawn using the same layout. Vertex colors represent the true clusters. Edges present in the metric backbone but not in the threshold graph are highlighted in red. Edges present in the threshold graph, but not in the metric backbone, are highlighted in blue.

5 Application to Graph Construction Using q-NN

In a large number of machine-learning tasks, data does not originally come from a graph structure, but instead from a cloud of points x1, , xn where each xu belongs to a metric space (say Rd for simplicity). The goal of graph construction is to discover a proximity graph G = ([n], E, p) from the original data points. Graph construction is commonly done through the q-nearest neighbors (q-NN). Given a similarity function sim: Rd Rd R+ that quantifies the resemblance between two data points, we define the set N(u, q) of q-nearest neighbors of u [n]. More precisely, N(u, q) is the subset of [n]\{u} with cardinality q such that for all v N(u, q) and for all w N(u, q) we have

sim(xu, xv) sim(xu, xw).

7We also note that the threshold graph Gθ is often not connected. Hence, we ignored all components comprising less than 5 nodes before running spectral clustering.

The edge set E is composed of all pairs of vertices (u, v) such that u N(v, q) or v N(u, q),8 and the proximity puv associated with the edge (u, v) is (suv + svu)/2, where

suv = sim(xu, xv) if v N(u, q), 0 otherwise.

In the following, we use the Gaussian kernel similarity sim(xu, xv) = exp xu xv 2

d2 K(xu) , where

d K(xi) is the Euclidean distance between xu and its q-NN. In Appendix C.2, we provide results using another similarity measure.

We investigate the effect of graph sparsification on graphs built by q-NN. We sample n = 10000 images from MNIST [26] and Fashion MNIST [42], and use the full HAR [3] dataset (n = 10299). From the sampled data points, we build the q-NN graph Gq, as well as its metric backbone Gmb q and its spectral sparsification Gss q . We then compare the performance of two clustering algorithms, spectral clustering and Poisson learning. Poisson learning is a semi-supervised graph clustering algorithm and was recently shown to outperform other graph-based semi-supervised algorithms [9]. Results of spectral clustering using another similarity measure are presented in the Appendix.

We compare the ARI of clustering algorithms on q-NN graphs and its sparsifications (metric backbone and spectral sparsification) for various choices of the number of nearest neighbors q. The results are shown in Figure 3 (for spectral clustering) and Figure 4 (for Poisson-learning). Unlike the spectral sparsifier, the metric backbone retains a high ARI across all choices of q. Interestingly, the performance on the original graph often decreases with q, which is a hyperparameter of the graph construction step. Applying a clustering algorithm on the metric backbone comes with the two advantages of significantly reducing the number of edges in the graph and of making its performance robust against the choice of the hyperparameter q. Indeed, a larger value of q creates more edges but with a higher distance (cost), which are therefore more likely to be non-metric.

20 40 60 80 100 q

Original Graph Metric Backbone Spectral Sparsifier

20 40 60 80 100 q

Original Graph Metric Backbone Spectral Sparsifier

(b) Fashion MNIST

20 40 60 80 100 q

Original Graph Metric Backbone Spectral Sparsifier

Figure 3: Performance of spectral clustering on subsets of MNIST, Fashion MNIST datasets, and on the HAR dataset. The ARI is averaged over 10 trials; error bars show the standard error of the mean.

20 40 60 80 q

Original Graph Metric Backbone Spectral Sparsifier

20 40 60 80 q

Original Graph Metric Backbone Spectral Sparsifier

(b) Fashion MNIST

20 40 60 80 q

Original Graph Metric Backbone Spectral Sparsifier

Figure 4: Performance of Poisson learning on subsets of MNIST, Fashion MNIST datasets, and the HAR dataset. The ARI is averaged over 100 trials, and error bars show the standard error of the mean.

Finally, we compare in Table 1 the ARI obtained on the q-nearest neighbor graph using q = 10 (as it is a common default choice) with the metric backbone graph of a q-nearest neighbor graph with

8In other words, an edge (u, v) is present if at least one of its two endpoints is in the q-nearest neighborhood of the other.

q = n/2. Moreover, we compute an approximation of the metric backbone by sampling uniformly at random 2 log n vertices and taking the union of the 2 log n shortest-path trees rooted at each one of them instead of the union of all n shortest-path trees. This produces a graph Gmb n/2 whose edge

set is a subset of the edges of the true metric backbone Gmb n/2. We observe that Gmb n/2 retains the community structure albeit being typically twice sparser than the 10-nearest neighbor graph G10.

Algorithm Data set G10 Gmb n/2

Spectral clustering

MNIST ARI 0.533 0.563 0.011 Edges 550, 653 373, 379 3, 018

Fashion MNIST ARI 0.411 0.425 0.006 Edges 578, 547 272, 063 857

HAR ARI 0.519 0.492 0.001 Edges 77, 526 37, 535 977

Poisson learning

MNIST ARI 0.814 0.015 0.821 0.021 Edges 550, 653 373, 379 3, 018

Fashion MNIST ARI 0.526 0.010 0.544 0.014 Edges 578, 547 272, 063 857

HAR ARI 0.618 0.039 0.636 0.037 Edges 77, 526 37, 535 977

Table 1: Comparison of clustering on the full data sets. G10 denotes the 10-nearest neighbors graph, and Gmb n/2 denotes the approximate metric backbone of the n/2-nearest neighbor graph. We approximate the metric backbone by sampling only 2 log n shortest-path trees.

Additional discussion The performance of clustering algorithms on the q-nearest neighbor graph Gq tends to decrease when q increases. Indeed, a larger q introduces many low-similarity edges, which can act as noise. Spectral sparsification preserves the spectral properties of the graph, but this becomes ineffective if the spectral properties of Gq are insufficient to recover the communities (as attested by the poor performance of spectral clustering for large values of q in Figures 3a and 8a). However, when the performance of spectral clustering on Gq remains stable as q is varied, so does the performance of spectral clustering on the spectral sparsified graph Gss q (as seen in Figures 3b and 8b). In contrast, the metric backbone preserves the shortest paths rather than spectral properties. Because the shortest paths are robust to the addition of numerous low-similarity edges,9 the performance of clustering algorithms on the metric backbone Gmb q remains stable when q increases. This holds regardless of whether the performance on the original graph Gq is stable or decreases with increasing q. Finally, sparsified graphs obtained by metric sparsification are more consistent across different values of q than those obtained via spectral sparsification. For instance, on the MNIST dataset with Gaussian kernel similarity, the metric backbone Gmb 30 and Gmb 40 have both approximately 70k edges, with 64k edges in common. In contrast, the spectrally sparsified graphs Gss 30 and Gss 40, also with around 70k edges each, share only 22k edges in common.

6 Conclusion

The metric backbone plays a crucial role in preserving several essential properties of a network. Notably, the metric backbone effectively preserves the network community structure, although many inter-community edges belong to shortest paths. In this paper, we have specifically proven that the metric backbone preserves the community structure in weighted stochastic block models. Moreover, our numerical experiments emphasize the performance of the metric backbone as a powerful graph sparsification tool. Furthermore, the metric backbone can serve as a preprocessing step for graph construction employing q-nearest neighbors, alleviating the sensitivity associated with selecting the hyperparameter q and producing sparser graphs.

9Consider, for example, an adversary adding an arbitrary number of edges with a cost of ω log n

in a w SBM. Proposition 1 proves that this addition does not affect the metric backbone, as none of these high-cost edges will be included in any shortest path.

[1] Emmanuel Abbe. Community detection and stochastic block models: recent developments. The Journal of Machine Learning Research, 18(1):6446 6531, 2017.

[2] Lada A Adamic and Eytan Adar. Friends and neighbors on the web. Social networks, 25(3):211 230, 2003.

[3] Davide Anguita, Alessandro Ghio, Luca Oneto, Xavier Parra, and Jorge Luis Reyes-Ortiz. A public domain dataset for human activity recognition using smartphones. In 21st European Symposium on Artificial Neural Networks (ESANN 13), 2013.

[4] Konstantin Avrachenkov and Maximilien Dreveton. Statistical Analysis of Networks. Boston Delft: now publishers, 10 2022.

[5] Afonso S. Bandeira and Ramon van Handel. Sharp nonasymptotic bounds on the norm of random matrices with independent entries. The Annals of Probability, 44(4):2479 2506, 2016.

[6] Shankar Bhamidi, Remco van der Hofstad, and Gerard Hooghiemstra. First passage percolation on random graphs with finite mean degrees. The Annals of Applied Probability, 20(5):1907 1965, 2010.

[7] Béla Bollobás, Svante Janson, and Oliver Riordan. The phase transition in inhomogeneous random graphs. Random Structures & Algorithms, 31(1):3 122, 2007.

[8] Rion Brattig Correia, Alain Barrat, and Luis M Rocha. Contact networks have small metric backbones that maintain community structure and are primary transmission subgraphs. PLo S Computational Biology, 19(2):e1010854, 2023.

[9] Jeff Calder, Brendan Cook, Matthew Thorpe, and Dejan Slepcev. Poisson learning: Graph based semi-supervised learning at very low label rates. In International Conference on Machine Learning, pages 1306 1316. PMLR, 2020.

[10] Shihyen Chen, Bin Ma, and Kaizhong Zhang. On the similarity metric and the distance metric. Theoretical Computer Science, 410(24-25):2365 2376, 2009.

[11] Gabor Csardi and Tamas Nepusz. The igraph software package for complex network research. Inter Journal, Complex Systems:1695, 2006.

[12] Fraser Daly, Matthias Schulte, and Seva Shneer. First passage percolation on Erd os Rényi graphs with general weights. ar Xiv preprint ar Xiv:2308.12149, 2023.

[13] Michaël Defferrard, Lionel Martin, Rodrigo Pena, and Nathanaël Perraudin. Py GSP: Graph signal processing in Python, October 2017.

[14] Edsger Wybe Dijkstra. A note on two problems in connexion with graphs. Journal of the ACM (Numerische Mathematik, 1:269 271, 1959.

[15] Wei Dong, Charikar Moses, and Kai Li. Efficient k-nearest neighbor graph construction for generic similarity measures. In Proceedings of the 20th international conference on World wide web, pages 577 586, 2011.

[16] Alan M Frieze and Geoffrey R Grimmett. The shortest-path problem for graphs with random arc-lengths. Discrete Applied Mathematics, 10(1):57 77, 1985.

[17] Michelle Girvan and Mark EJ Newman. Community structure in social and biological networks. Proceedings of the national academy of sciences, 99(12):7821 7826, 2002.

[18] Bruce Hajek, Yihong Wu, and Jiaming Xu. Achieving exact cluster recovery threshold via semidefinite programming. IEEE Transactions on Information Theory, 62(5):2788 2797, 2016.

[19] John M Hammersley and Dominic JA Welsh. First-passage percolation, subadditive processes, stochastic networks, and generalized renewal theory. In Bernoulli 1713, Bayes 1763, Laplace 1813: Anniversary Volume. Proceedings of an International Research Seminar Statistical Laboratory University of California, Berkeley 1963, pages 61 110. Springer, 1965.

[20] Refael Hassin and Eitan Zemel. On shortest paths in graphs with random weights. Mathematics of Operations Research, 10(4):557 564, 1985.

[21] Reinhard Heckel and Helmut Bölcskei. Robust subspace clustering via thresholding. IEEE Transactions on Information Theory, 61(11):6320 6342, 2015.

[22] Lawrence Hubert and Phipps Arabie. Comparing partitions. Journal of classification, 2:193 218, 1985.

[23] Svante Janson. One, two and three times log n/n for paths in a complete graph with random weights. Combinatorics, Probability and Computing, 8(4):347 361, 1999.

[24] Vasiliki Kalavri, Tiago Simas, and Dionysios Logothetis. The shortest path is not always a straight line: leveraging semi-metricity in graph analysis. Proceedings of the VLDB Endowment, 9(9):672 683, 2016.

[25] István Kolossváry and Júlia Komjáthy. First passage percolation on inhomogeneous random graphs. Advances in Applied Probability, 47(2):589 610, 2015.

[26] Yann Le Cun, Corinna Cortes, and Christopher JC Burges. The MNIST database of handwritten digits. http://yann.lecun.com/exdb/mnist/, 1998.

[27] Jing Lei and Alessandro Rinaldo. Consistency of spectral clustering in stochastic block models. Annals of Statistics, 43(1):215 237, 2015.

[28] Lasse Leskelä and Hoa Ngo. First passage percolation on sparse random graphs with boundary weights. Journal of Applied Probability, 56(2):458 471, 2019.

[29] Catherine C. Mc Geoch. All-pairs shortest paths and the essential subgraph. Algorithmica, 13:426 441, 1995.

[30] Mark EJ Newman. Networks. Oxford University Press, 07 2018.

[31] Tiago P Peixoto. Nonparametric weighted stochastic block models. Physical Review E, 97(1):012306, 2018.

[32] Tiago P. Peixoto and Alec Kirkley. Implicit models, latent compression, intrinsic biases, and cheap lunches in community detection. Phys. Rev. E, 108:024309, Aug 2023.

[33] Yuval Peres, Dmitry Sotnikov, Benny Sudakov, and Uri Zwick. All-pairs shortest paths in O(n2) time with high probability. Journal of the ACM (JACM), 60(4):1 25, 2013.

[34] M Ángeles Serrano, Marián Boguná, and Alessandro Vespignani. Extracting the multiscale backbone of complex weighted networks. Proceedings of the national academy of sciences, 106(16):6483 6488, 2009.

[35] Tiago Simas, Rion Brattig Correia, and Luis M Rocha. The distance backbone of complex networks. Journal of Complex Networks, 9(6), 2021.

[36] Tiago Simas and Luis M Rocha. Distance closures on complex networks. Network Science, 3(2):227 268, 2015.

[37] Daniel A. Spielman and Nikhil Srivastava. Graph sparsification by effective resistances. SIAM Journal on Computing, 40(6):1913 1926, 2011.

[38] Vincent A Traag, Ludo Waltman, and Nees Jan Van Eck. From Louvain to Leiden: guaranteeing well-connected communities. Scientific reports, 9(1):5233, 2019.

[39] Piet Van Mieghem and Huijuan Wang. The observable part of a network. IEEE/ACM Transactions on Networking, 17(1):93 105, 2009.

[40] Ulrike Von Luxburg. A tutorial on spectral clustering. Statistics and computing, 17:395 416, 2007.

[41] Huijuan Wang, Javier Martin Hernandez, and Piet Van Mieghem. Betweenness centrality in a weighted network. Physical Review E, 77(4):046105, 2008.

[42] Han Xiao, Kashif Rasul, and Roland Vollgraf. Fashion-MNIST: a novel image dataset for benchmarking machine learning algorithms. ar Xiv preprint ar Xiv:1708.07747, 2017.

[43] Ali Yassin, Abbas Haidar, Hocine Cherifi, Hamida Seba, and Olivier Togni. An evaluation tool for backbone extraction techniques in weighted complex networks. Scientific Reports, 13(1):17000, 2023.

[44] Yi Yu, Tengyao Wang, and Richard J Samworth. A useful variant of the davis kahan theorem for statisticians. Biometrika, 102(2):315 323, 2015.

A Proof of Proposition 1

Let us set up some notations before proving Proposition 1. We let (z, G) w SBM(n, π, p, f) where the model parameters verify Assumptions 1 and 2, and we denote by c and c the cost and the extended cost associated with G, respectively. We let Γ1, , Γk be the k communities, i.e., Γa = {w [n]: zw = a} , and we denote by n1, , nk their respective sizes, i.e., na = |Γa|. As k = Θ(1), the concentration of multinomial distributions ensures that na = (1 + o(1))πan whp.

A.1 Particular case of exponentially distributed costs

Proof of Proposition 1 for exponentially distributed costs. We firstly assume that Fab Exp(λab) with λab > 0. It is immediate to notice that (fab)ab verify Assumption 2.

Let u, v be two vertices chosen uniformly at random in Γa and Γb respectively, so that zu = a and zv = b. To explore the neighborhood of vertex u, we consider a first passage percolation (FPP) on G starting from u. More precisely, for any t > 0, we denote by B(u, t) = {w V : C(u, w) t} the set of vertices within a cost t of u, and by

Cu(q) = min {t 0: |B(u, t)| q + 1} ,

the cost going from u to its q-th nearest neighbor (with the convention that min = ). In particular, B(u, Cu(q)) is the set of the q nearest neighbors of u. Let U(q) = (U1(q), , Uk(q)) be the collection of sets such that Uℓ(q) is the set of vertices that are in the q nearest neighborhood of u and in community ℓ, i.e.,

Uℓ(q) = B(u, Cu(q)) Γℓ.

Finally, we denote by S(u, q) the matrix whose entry Sℓℓ (u, q) is equal to the number of edges going from Uℓ(q) to Γℓ \Uℓ (q).

The outline of the proof is as follows. In a first step (i), we study the FPP starting from vertex u to obtain upper and lower bounds on the cost Cu( n log n) for going from u to its n log n-nearest neighbor10. More precisely, we will establish that nρn log n Cu p

n log n 1 + o(1)

2τmin (A.1)

holds whp. By symmetry, the same upper bound also holds for Cv( n log n).

Next, we will show in step (ii) that B(u, n log n) B(v, n log n) = whp. Together with the upper bound (A.1), we can then upper bound the cost C(u, v) of the shortest path from u to v. Indeed, let w1 B(u, n log n) B(v, n log n) (such a w1 exists whp by step (ii)), and consider the path P = u w1 v, where u w1 and w1 v denote the shortest path from u to w1 and from w1 to v, respectively. Because the cost C(u, v) of the shortest path from u to v is upper-bounded by the cost of the path P, we have whp

C(u, v) C(u, w1) + C(w1, v) 1 + o(1)

where the second inequality holds because w1 B(u, n log n) B(v, n log n), which enables to use the upper bound (A.1). This establishes the desired upper bound on C(u, v).

To lower-bound C(u, v), we will establish that

2τmax nρn log n Cu

holds whp. Again, a similar bound holds for Cv p

n/ log n by symmetry.

Finally, we will show in step (iii) that B u, Cu( p

n/ log n) B v, Cv( p

n/ log n = whp. Combining this with the lower bound (A.2), we conclude then that whp

C(u, v) 1 + o(1)

10Because n log n is not necessarily integer we should write n log n but we drop the and to avoid overburdening the notations.

(i) Upper and lower bounds on Cu( p

n/ log n) and Cu( n log n). In this paragraph, we will establish (A.1) and (A.2).

A key property of the FPP is to notice that, conditioned on S(u, ) = (S(u, 1), , S(u, n)), the random numbers Cu(q + 1) Cu(q) are independent and exponentially distributed, such that

Cu(q + 1) Cu(q) | S(u, ) Exp

ℓ,ℓ λℓℓ Sℓℓ (u, q)

Indeed, Cu(q + 1) Cu(q) is the difference in the cost of traveling from u to its (q + 1)-th nearestneighbor minus the cost from u and its q-th nearest neighbor. In other words,

Cu(q + 1) Cu(q) | S(u, ) = min w Uc(q) w Γc \Uc (q)

cww , (A.4)

with cww Exp(λcc ). Statement (A.3) follows because min {Exp(λ), Exp(µ)} Exp(λ + µ).

Let q n log n. The number of edges Sℓℓ (u, q) between the sets Uℓ(q) and Γℓ \Uℓ (q) is binomially distributed such that

Sℓℓ (u, q) Bin (|Uℓ(q)| |Γℓ \Uℓ (q)| , pcc ) .

The number of vertices in Γℓ is nℓ = Θ(n) and the number of vertices in |Uℓ (q)| verifies |Uℓ (q)| q n. Therefore,

E [Sℓℓ (u, q)] = |Uℓ(q)| (nℓ |Uℓ (q)|) pℓℓ = (1 + o(1)) |Uℓ(q)| nℓ pℓℓ .

Hence E [Sℓℓ (u, q)] 1, and the concentration of binomial distributions ensure that whp

Sℓℓ (u, q) = (1 + o(1)) |Uℓ(q)| (nℓ |Uℓ(q)|)pℓℓ = (1 + o(1)) |Uℓ(q)| nℓ pℓℓ . (A.5)

Therefore, using nℓ = (1 + o(1))πℓ n and pℓℓ = Bℓℓ ρn, we have from (A.4)

E [Cu(q + 1) Cu(q) | S(u, )] = 1 + o(1) P

ℓ|Uℓ(q)| P ℓ nℓ λℓℓ pℓℓ

= 1 + o(1) nρn P

ℓ πℓ λℓℓ Bℓℓ .

We notice that P

l |Uℓ(q)| P ℓ πℓ λℓℓ Bℓℓ = U T (q)T1k where T = (Λ B) diag(π) is the operator defined in (2.1). Then, using P ℓ|Uℓ(q)| = q, we have

ℓ |Uℓ(q)| X

ℓ πℓ λℓℓ Bℓℓ τmaxq. (A.6)

1 + o(1) nρnτmaxq E [Cu(q + 1) Cu(q) | S(u, )] 1 + o(1) nρnτminq . (A.7)

This bound is uniform in S(u, ). Thus, using the total law of probability and summing over all 1 q n log n leads to

2τmax nρn log n ECu( p

n log n) 1 + o(1)

2τmin , (A.8)

where we used P n log n q=1 q 1 = 2 1(log n + log log n) + Θ(1).

Let us now upper-bound the variance of Cu n log n . By the law of total variance, we have

Var h Cu( p

n log n) i = E h Var h Cu( p

n log n) | S(u, ) ii + Var h E h Cu( p

n log n) | S(u, ) ii .

The first term on the right-hand side of (A.9) can be upper bounded by proceeding similarly as for the expectation. Indeed, we have

Var [Cu(q + 1) Cu(q) | S(u, )] 1 + o(1) n2ρ2nτ 2 minq2 ,

and the independence of Cu(q + 1) Cu(q) conditioned on S(u, ) leads to

Var h Cu( p

n log n) | S(u, ) i =

n log n 1 X

q=1 Var (Cu(q + 1) Cu(q) | S(u, ))

1 + o(1) n2ρ2nτ 2 min

1 + o(1) n2ρ2nτ 2 min

To upper bound the second term on the right-hand side of (A.9), we notice that

Var h E h Cu( p

n log n) | S(u, ) ii = Var

n log n 1 X

q=1 E [Cu(q + 1) Cu(q) | S(u, )]

n log n 1 X

q=1 Var [E [Cu(q + 1) Cu(q) | S(u, )]] ,

where the first line holds by the linearity of the conditional expectation, and the second one by the independence of Cu(q + 1) Cu(q) conditioned on S(u, ). Moreover, recall that Var(X) (b a)2/4 if a X b. Hence, using the upper and lower bound on Cu(q + 1) Cu(q) obtained in (A.7) leads to

Var [E [Cu(q + 1) Cu(q) | S(u, )]] 1 + o(1)

1 nρnτminq 1 nρnτmaxq

and therefore

Var h E h Cu( p

n log n) | S(u, ) ii 1 + o(1) 4n2ρ2nτ 2 min

Combining (A.10) and (A.11) into (A.9) provides

Var h Cu( p

n log n) i 5

4 (1 + o(1))π2

6τ 2 minn2ρ2n . (A.12)

This upper bound (A.12) ensures that Var Cu( n log n) = o E Cu( n log n) 2 , and therefore an application of Chebyshev s inequality ensures that

2τmax nρn log n Cu p

n log n 1 + o(1)

with high probability. Likewise, by symmetry for a FPP starting from v instead of u, we find

2τmax nρn log n Cv p

n log n 1 + o(1)

This establishes (A.1). Moreover, we establish (A.2) by doing a slight modification of this proof. More precisely, we sum over all q p

n/ log n instead of all q n log n in step (A.8). We obtain

the same bound as in (A.8) because P

n/ log n q=1 q 1 = 2 1(log n log log n) + Θ(1).

(ii) B(u, Cu( n log n)) B(v, Cv( n log n)) = whp. For ease of notations, let us shorten by Bu = B(u, Cu( n log n)) the set of the n log n nearest neighbours of u and by Bv(q) the set of the q-nearest neighbours of v. We also denote by wq the q-nearest neighbor of v. A key property is that the two FPPs (starting from vertex u and starting from vertex v) are independent of each other as

long as they do not intersect. To make this rigorous, we denote by Q the random variable counting the number of steps made by the FPP starting from v without intersecting with Bu, i.e.,

Q = min {q [n]: Bu Bv(q) = } .

We now show that P (Q > q) = o(1) whenever q p

n/ log n, which implies that Bu Bv( n log n) = . Using [6, Lemma B.1], we have

P (Q > m) = E

q=1 Q(q) (Q > q | Q > q 1)

where Q(q) denotes the conditional distribution given Bu and Bv(q). We further notice that

Q(q) (Q > q | Q > q 1) = 1 Q(q) (Q = q | Q > q 1) ,

and that the event {Q > q 1} is equivalent to the event {w1 Bu, , wq 1 Bu}, i.e., the FPP starting from v has not yet collided with the one starting from u. Conditionally on this event, all vertices within the same block have an equal probability of being chosen at the q-th step of the FPP. Therefore,

Q(q) (Q = q | Q > q 1) = X

ℓ [k] P (wq Bu | wq Γℓ, Q > q 1) P (wq Γℓ)

|Γℓ| P (wq Γℓ) .

ℓ [k] |Bu Γℓ| = n log n, we have that11 maxℓ|Bu Γℓ| k 1 n log n. Moreover, |Γℓ| (1 + o(1))πmaxn whp and P ℓ [k] P (wq Γℓ) = 1. Therefore,

Q(q) (Q = q | Q > q 1) maxℓ [k] |Bu Γℓ|

minℓ [k] |Γℓ|

ℓ [k] P (wq Γℓ)

Going back to (A.13), this implies that

which indeed goes to 0 when m p

(iii) B(u, Cu( p

n/ log n)) B(v, Cv( p

n/ log n)) = whp. We proceed similarly to step (ii) by considering the FPP starting from v, and denote by wq the q-nearest neighbor of v. For ease of

notations, let us denote in this paragraph Bu = B u, Cu p

n/ log n and Bv(q) = B (v, Cv (q)).

Note that, in contrast to step (ii), we now look at the FPP up to step p

n/ log n instead of the FPP up to step n log n. We define

Q = min{q [n]: Bu Bv(q) = }.

P (Q = q) = P (wq Bu | Q > q 1) P (Q > q 1)

|Γℓ| P (wq Γℓ| Q > q 1) P (Q > q 1) ,

11Recall that if x1 + + xk = m with xℓ 0 then maxℓxℓ k 1m.

where, as in step (ii), the second equality holds because all vertices within the same block have the same probability of being chosen at the next step of the FPP as long as the two FPP have not collided. Using |Γℓ| (1 + o(1))πminn and |Bu Γℓ| |Bu| p

n/ log n leads to

P (Q = q) 1 πmin

1 + o(1) n log n.

P (Bu Bv = ) =

q=1 P (Q = q) 1 + o(1) πmin log n = o(1).

This concludes the proof of Proposition 1 when Fab Exp(λab).

A.2 General case

Proof of Proposition 1 with non-exponentially distributed costs. In this section, the probability densities Fab are regular (see Assumption 2) but not necessarily exponentially distributed. We adapt the argument provided at the beginning of Section 2 of [23]. The strategy is to transform the graph G = (V, E, c) generated from a w SBM(n, π, F) into a graph Gexp = (V, E, cexp) with the same vertices and edge set but where the costs cexp are obtained from F and are exponentially distributed so that Gexp has the same distribution as a w SBM (n, π, (Exp(λab))ab). We will then show that the shortest paths in Gexp and G are the same.

Let fab be the density function of the cdf Fab, and remember that λab = F ab(0). Let F 1 ab be the generalized inverse function12 of Fab. By regularity of Fab (Assumption 2), we have limt 0 Fab(t)/t = λab and thus limt 0 F 1 ab (t)/t = λ 1 ab . We also denote by gab(x) = λabe λabx

and by Gab(x) = 1 e λabx the density and cumulative distribution functions of Exp(λab).

We first show that Proposition 1 holds for uniformly distributed edge weights, i.e., when fab(x) = λab1(x [0, λ 1 ab ]). Denote by Fab the cdf associated to fab, and let (z, G) w SBM(n, π, p, F) where G = (V, E, c). We construct the graph Gexp = (V, E, cexp) such that for all (w, w ) E with zw = ℓand zw = ℓ we have cexp(w, w ) = G 1 ℓℓ (λℓℓ c(w, w )). Note that the (unweighted) edges of G and Gexp are the same, only the costs c and cexp differ. Since λℓℓ c(w, w ) Unif(0, 1) we have cexp(w, w ) Exp(λℓℓ ). In particular, Gexp has the same distribution of a weighted SBM whose costs are exponentially distributed, i.e., (z, Gexp) w SBM(n, π, p, G) with G = (Gab)a,b.

Let P(u, v) (resp., Pexp(u, v)) be the shortest path from u to v in G (resp., in Gexp), and denote by

C(u, v) (resp., by Cexp(u, v)) its cost. We know from Section A.1 that Cexp(u, v) = Θ log n

Suppose that the edge (w, w ) E belongs to Pexp(u, v). From cexp(w, w ) Cexp(u, v) we notice that cexp(w, w ) = O( log n

nρn ), and hence by Assumption 1 we have cexp(w, w ) = o(1). Moreover, by definition of cexp we have

c(w, w ) cexp(w, w ) = 1 λℓℓ Gℓℓ (cexp(w, w ))

cexp(w, w ) .

Recall that limt 0 Gℓℓ (t)/t = λℓℓ . Thus, we have for any ϵ > 0 and for n large enough, 1 ϵ < c(w,w ) cexp(w,w ) < 1 + ϵ. This holds for any edge (w, w ) belonging to Pexp(u, v). Therefore, the sum of the costs c(w, w ) over Pexp(u, v) is at most (1 + ϵ)Cexp(u, v), and hence

C(u, v) < (1 + ϵ)Cexp(u, v). (A.14)

Similarly, if (w, w ) P(u, v), then the upper bound (A.14) together with Assumption 1 imply that C(u, v) = o(1). This in turn implies c(w, w ) = o(1) and hence 1 ϵ < c(w,w ) cexp(w,w ) < 1 + ϵ. Thus, summing over all edges in P(u, v) leads to

(1 ϵ)Cexp(u, v) < C(u, v). (A.15)

12More precisely, we define F 1 ab (t) = inf{x R: Fab(x) t}. Because the cdf Fab is increasing, F 1 ab (t) is well-defined.

Combining (A.14) with (A.15) shows that Proposition 1 holds for uniformly distributed edges weights.

Finally, if the costs c(w, w ) are sampled from general distributions Fab verifying Assumption 2, then we construct the graph Gunif = (V, E, cunif) where cunif(u, v) = λ 1 zuzv Fzuzv(c(u, v)). We have cunif(u, v) Unif([0, λ 1 zuzv]) and we apply the previous reasoning (by replacing the exponential distributions with uniform distributions) to conclude.

B Proof of Sections 2.3 and 2.4

B.1 Proof of Theorem 1

Proof of Theorem 1. Let G = (V, E, c) be a w SBM, and let u, v V be two arbitrary distinct vertices such that zu = a and zv = b. Then, adapting the proof13 of [39, Corollary 1], we can write

0 c uv(x) log(1 pab Fab(x)) dx, (B.1)

where c uv(x) is the probability density function of the weight of the shortest path between u and v and Fab(x) = R x 0 fab(x)dx is the cumulative distribution function of the length of an edge between two vertices belonging to communities a and b.

Proposition 1 ensures that, with high probability, the cost C(u, v) is a random variable whose support is lower and upper bounded by 1+o(1)

nρn and 1+o(1)

nρn , respectively. Its density function c uv(x) tends therefore to zero outside these two bounds, and hence by setting the log( ) factor in Equation (B.1) to the lower and, respectively, the upper bound of the support of C(u, v), and by integrating next the pdf c uv(x) over the whole interval, Equation (B.1) implies that

log 1 pab Fab

pmb ab log 1 pab Fab

We finish the proof using Fab (1+o(1)) log n

= (1 + o(1))λab log n τminnρn and Fab (1+o(1)) log n

(1 + o(1))λab log n τmaxnρn (Assumption 2).

B.2 Proof of Theorem 2

Proof of Theorem 2. Let G = (V, E, c) be the original graph and Gmb = (V, Emb, cmb) its metric backbone. Let Eθ E be the subset of edges whose cost is no more than θ:

(u, v) Eθ c(u, v) θ,

and which is the edge set of the corresponding threshold graph Gθ = (V, Eθ, cθ), where cθ is the restriction of c to Eθ.

Denote by W, W mb, W θ Rn n + the adjacency matrices of G, Gmb, and Gθ, respectively.

Overview of the proof. To prove Theorem 2, the key idea is to choose a threshold θ large enough such that the threshold graph contains the metric backbone (i.e., Emb Eθ), but not too large so that the adjacency matrices W mb and W θ are not too different.

Lemma 1 ensures that, for any ϵ > 0, we have P max1 u,v n C(u, v) 3+ϵ

Hence, Emb Eθ whp as soon as θ 3+ϵ τmin log n

nρn . We choose θ = 4 τmin log n

nρn , and we proceed by conditioning on the event Emb Eθ, which occurs with high probability given our choice of θ.

We will first prove that the clusters can exactly be recovered using the eigenvectors of EW mb. Then, using Davis-Kahan s Theorem [44, Theorem 2], we show that the clusters can also be recovered from the adjacency matrix W mb, provided that W mb EW mb is small enough. More precisely, we obtain an upper-bound on loss(z, z) that depends on W mb EW mb . The main ingredients of the proof are thus (i) the justification of the choice of θ in Lemma 1 and (ii) the careful upper-bounding of W mb EW mb .

13We note that [39] state the result for a weighted Erd os-Rényi random graph, but their proof holds for a w SBM as well.

Starting point: eigenstructure of the expected adjacency matrix EW mb and Davis-Kahan. Let Z {0, 1}n k be the one-hot encoding of the true community structure z [k]n, i.e.,

u [n], a [k]: Zua = 1 if zu = a, 0 otherwise. (B.2)

For any two vertices u, v belonging to communities a and b, respectively, we write EW mb uv = mab. Let M = (mab) Rk k. We have

EW mb = ZMZT .

Denote by | σ1| | σk| the k eigenvalues of EW mb, and by u1, , uk their associated eigenvectors. Let Σ = diag( σ1, , σk) Rk k and U = [ u1, , uk] Rn k. We have

EW mb = U Σ U T . (B.3)

Let = diag( nπ) Rk k be the diagonal matrix whose diagonal elements are nπ1, , nπk. We have ZMZT = Z 1 M Z 1 T .

Notice that Z 1 Rn k has orthonormal rows (indeed Z 1 T Z 1 = 1ZT Z 1 = IK because ZT Z = 2). Let ODOT be an eigendecomposition of the symmetric real-valued matrix M (that is, D Rk k is a diagonal matrix whose diagonal elements are in decreasing order (in absolute value) and O Rk k is an orthonormal matrix). Then

ZMZT = Z 1O D Z 1O T

is an eigendecomposition of ZMZT (because Z 1O is orthonormal). Hence, going back to (B.3), we have Σ = D and U = Z 1O for some orthonormal matrix O Rk k.

Because U = Z 1O, two vertices are in the same cluster if and only if their corresponding rows in U are the same. In other words, the spectral embedding of the expected graph EW mb is condensed into k points ( 1O)1 , , ( 1O)k of Rk. Consequently, k-means on U recovers the true clusters (up to a permutation).

Next, [27, Lemma 5.3] ensures that any (1 + ϵ) solution z of the k-means problem on UΣ verifies loss(z, z) 4(4 + 2ϵ) min O Ok k UO U 2 F (B.4)

where Ok k denotes the group of orthonormal k-by-k matrices and F is the Frobenius norm. Davis-Kahan s Theorem [44, Theorem 2] ensures the existence of an orthogonal matrix O Ok k such that

UO U F 23/2k1/2 W mb EW mb

| σk| , (B.5)

where denotes the matrix operator norm.

Let us now establish an expression of | σk|. Observe firstly that M and M 2 have the same eigenvalues.14 From Lemma 4, we have 1 2τ 2max (Λ B)ab log2 n

n2ρn mab 1 2τ 2 min (Λ B)ab log2 n

The definition of T = [Λ B] diag(π) in (2.1) and the fact that 2 = n diag(π) further imply that

1 2τ 2max Tab log2 n

ab 1 2τ 2 min Tab log2 n

Recall that θ = 4 τmin log n

nρn and µ is the smallest (in absolute value) eigenvalue of T. Using the

assumption τmin = τmax, we obtain that M 2 = θ log n

8τmin T and thus

σk = µ 8τmin θ log n. (B.6)

Therefore, combining (B.4), (B.5) and (B.6), we obtain

loss(z, z) (4 + 2ϵ)25k 8τmin

µ W mb EW mb

14Let A, B be two symmetric matrices of the same size. Let λ be an eigenvalue of ABA, with corresponding eigenvector x. Multiplying both sides of ABAx = λx by A, we get A2By = λy with y = Ax.

Core of the proof: concentration of W mb around EW mb. We finish the proof by showing that W mb EW mb = O θ log n whp. First, by a triangle inequality, we have

W mb EW mb W mb W θ E W mb W θ + W θ EW θ (B.8)

For ease of the exposition, we will upper-bound the term of (B.8) in the following order: (i) the second term W θ EW θ , and then (ii) the first term W mb W θ E W mb W θ .

(i) Let us first study W θ EW θ . The matrix X = W θ/θ is symmetric with zero-diagonal, whose entries {Xuv, u < v} are independent [0, 1]-valued random variables. Moreover, Lemma 3 shows that E [Xuv | zu = a, zv = b] = pabλabθ = Θ(log n/n). Thus, [18, Theorem 5] ensures that for any c > 0 there exists c > 0 such that

P W θ EW θ c θ p

log n n c. (B.9)

(ii) Next, let us study W mb W θ E W mb W θ 2. Denote Y = W mb W θ /θ. By Lemma 1 and our choice of θ, we have Emb Eθ (for n large enough). Moreover, W θ uv = W mb uv for all {u, v} Emb Eθ = Emb. Hence,

( W θ uv θ if {u, v} Eθ\Emb, 0 otherwise, and EYuv =

( EW θ uv θ if {u, v} Eθ\Emb, 0 otherwise.

We can thus rewrite the matrices Y and EY as Y = R W θ/θ and EY = R EW θ/θ, where denote the Hadamard (entry-wise) matrix product and R {0, 1}n n is defined by

Ruv = Rvu = 1 if {u, v} Eθ\Emb, 0 otherwise.

Let Ec1 be the event that all rows of R have at most c1 log n non-zero entries (with c1 > 0), i.e.,

v=1 1{Ruv = 0} c1 log n

Because Emb Eθ, we have

1{Ruv = 0} = Ruv = 1{(u, v) Eθ\Emb} 1{(u, v) Eθ}.

Thus, Ec1,θ Ec1, where

v=1 1{(u, v) Eθ} c1 log n

Hence, P(Ec1) P(Ec1,θ). Recall also that 1{(u, v) Eθ} = 1{c(u, v) θ}. Thus, by Lemma 2, for any c0 > 0 there exists a c1 > 0 such that

P (Ec1) 1 n c0. (B.10)

Conditioned on the high probability event Ec1, every row of the matrix R has at most c1 log n non-zero elements. Moreover, W/θ is symmetric and has bounded (hence sub-gaussian) entries. Therefore [5, Corollary 3.9] ensures the existence of constants C, C > 0 such that

P R W θ EW θ /θ C p

log n + t Ec1 e C t2.

Using t = C log n, and because Y EY = R W θ EW θ /θ, we obtain the following statement: for any c > 0, there exists c > 0 such that

P Y EY 2 c p

log n Ec1 n c.

Using (B.10), we finally obtain

P W mb W θ E W mb W θ c θ p

log n 2n c. (B.11)

Conclusion Using (B.8), (B.9) and (B.11), we have

W mb EW mb 2c θ p

log n, (B.12)

with probability at least 1 3n c. The proof ends by combining (B.12) with (B.7).

B.3 Additional Lemmas

Lemma 1. Let (z, G) w SBM(n, π, p, F) and suppose that Assumptions 1 and 2 hold. Then,

max u,v [n] C(u, v) (1 + o(1)) 3

Proof. We proceed as in the proof of Proposition 1 by firstly assuming that Fab Exp(λab). The extension to more general weight distributions follows from the same coupling argument presented in Section A.2 and is thus omitted.

We use the same notations as in the proof of Proposition 1. Recall in particular the notations Cu(q), Uℓ(q) and Xℓℓ (u, q). Because we established that B(u, Cu( n log n)) B(v, Cv( n log n)) = whp, we have

C(u, v) Cu( p

n log n) + Cv( p

max u,v C(u, v) max u,v

n log n) + Cv( p

= 2 max u Cu( p

Using a union bound, we obtain for any t > 0,

P max u Cu( p

u=1 P Cu( p

= n P Cu ( p

n log n) t , (B.13)

where u denotes an arbitrary vertex chosen in [n]. For any s > 0, we have (by Chernoff bounds)

n log n) t e st E h es Cu ( n log n)i . (B.14)

Recall from (A.3) that Cu (q+1) Cu (q) | S(u, ) are i.i.d. Exp(θq) with θq = P ℓ,ℓ λℓℓ Sℓℓ (u, q). As for X Exp(θ) we have E[es X] = θ/(θ s) = 1 + s/(θ s),

E h es Cu ( n log n) | S(u, ) i =

q=1 E h es(Cu (q+1) Cu (q)) | S(u, ) i

q=1 log 1 + s θq s

Let 0 < ϵ < 1/6. From Equations (A.5) and (A.6), we have whp (for n large enough)

θq (1 ϵ)nρnτminq.

Choose s = (1 2ϵ)nρnτmin and let αn be a diverging sequence verifying αn = o( n log n) to be chosen later. We split the sum as follows

We have for any 1 q αn

θq s nρnτminq 1 ϵ 1 2ϵ

while for αn q we have (1 2ϵ)αn ϵ (for n large enough) and thus

θq s nρnτminq 1 ϵ 1 2ϵ

nρnτminq 1 ϵ 1 2ϵ

nρnτminq(1 2ϵ),

and thus n log n X

s θq s (1 2ϵ)

1 q + 1 1 2ϵ

Recalling that Pn q=m+1 q 1 log(n/m) for any m 1 (by integration) further leads to

s θq s (1 2ϵ) 1

ϵ (1 + log αn) + 1 1 2ϵ log n log n

E h es Cu ( n log n) | S(u, ) i exp (1 2ϵ) 1

ϵ (1 + log αn) + 1 1 2ϵ log n log n

Because the right-hand side of this last upper bound does not depend on S(u, ), we have by the total law of probabilities,

E h es Cu ( n log n)i exp (1 2ϵ) 1

ϵ (1 + log αn) + 1 1 2ϵ log n log n

We choose αn = log n . Because αn = ω(1), αn = o(log n), and ϵ is fixed, we have for n large enough,

1 ϵ (1 + log αn) ϵlog n

2 1 1 2ϵ log n log n

(1 + 3ϵ)log n

where the second inequality uses ϵ < 1/6. Thus, we can rewrite the inequality (B.15) (for n large enough) as

E h es Cu ( n log n)i exp (1 2ϵ)(1 + 4ϵ)log n

exp (1 + 2ϵ)log n

Let t = (3 + ϵ ) log n/(2nρnτmin) with ϵ = 15ϵ. Recall that s = (1 2ϵ)nρnτmin. From (B.14) we obtain

n log n) t e (1 2ϵ)(3+ϵ ) log n

2 +(1+2ϵ) log n

2 (2 8ϵ 2ϵϵ +ϵ )

e log n(1+ϵ),

where in the last line we uses 8ϵ 2ϵϵ + ϵ = 6ϵ 30ϵ2 ϵ (using ϵ < 1/6). Because 1 + ϵ > 1, we have P Cu ( n log n) t = o(n 1), and we conclude using (B.13).

Lemma 2. Let (z, G) w SBM(n, π, p, F) and suppose Assumptions 1 and 2 hold. Let c0, c1 > 0 be two constants (independent of n), and let θ = c0 log n

nρn . Denote

v=1 1{c(u, v) θ} (4Bmaxλmaxc0 + c1) log n

Then, P (Ec1,θ) 1 n c1.

Proof. For u = v [n] such that zu = a and zv = b, we have (recall that the probability of having an edge (u, v) E is pab, and the cost of this edge is sampled from Fab)

P (c(u, v) θ | zu = a, zv = b) = pab Fab(θ).

Recall also that, by Assumption 1, we have pab = Babρn and by Assumption 1 Fab(x) = (1 + o(1))λabx. Let ϵ > 0. Using the law of total probability, we obtain

P (c(u, v) θ) (1 + o(1))Bmaxλmaxθρn,

where Bmax = max1 a,b k Bab and λmax = max1 a,b k λab. Thus, for n large enough we have

P (c(u, v) θ) 2Bmaxλmaxθρn,

For ease of notations, let pθ = 2Bmaxλmaxθρn and c = 4Bmaxλmaxc0 + 1 + c1.

For u [n] we denote du = Pn v=1 1{c(u, v) θ}. We have

P max u [n] du c log n X

u [n] P (du c log n) . (B.16)

Moreover, for any t > 0, we have

E etdu 1 pθ + pθet n enpθ(et 1),

where the second inequality used log(1 + x) x. Let c > 0. Using Chernoff s bounds, we have

P (du c log n) e t c log n+npθ(et 1).

Let t = 1. Because e1 1 2 and npθ = 2Bmaxλmaxc0 log n, we obtain

P (du c log n) e log n( c log n 4Bmaxλmaxc0).

We finish the proof by letting c = 4Bmaxλmaxc0 + 1 + c1 and using (B.16).

Lemma 3. Let (z, G) w SBM(n, π, p, F) and suppose Assumptions 1 and 2 hold. Let θ = c log n

nρn . Denote W the weighted adjacency matrix of G, and W θ the weighted adjacency matrix of the threshold graph. Let u, v [n] such that zu = a and zv = b. We have E W θ uv = pabλabθ2/2.

Proof. We have W θ uv = Wuv1{Wuv θ}. Moreover, Wuv | Wuv = 0 is sampled from Fab. Thus,

E W θ uv = pab E [Wuv1{Wuv θ}] .

Denote also fab the pdf of Fab. Recall by Assumption 2 that fab(x) = λab + O(x). Because θ 1 we have

E W θ uv = pabλab θ2

Lemma 4. Let (z, G) w SBM(n, π, p, F) and suppose Assumptions 1 and 2 hold. Let θ = c log n

nρn . Denote W the weighted adjacency matrix of G, and W mb the weighted adjacency matrix of the metric backbone of G. Let u, v [n] such that zu = a and zv = b. We have

1 2τ 2max (Λ B)ab log2 n

n2ρn E W mb uv 1 2τ 2 min (Λ B)ab log2 n

Proof. Let u, v [n] be two arbitrary vertices such that zu = a and zv = b. We have W mb uv = c(u, v)1{(u, v) Emb}. Recall from Proposition 1 that

(τmax) 1 nρn log n C(u, v) (τmin) 1

whp, where C(u, v) is the cost of the shortest path from u to v. Therefore,

E W mb uv = E c(u, v) (u, v) Emb

E c(u, v)1 c(u, v) 1 τmin

= pabλab 1 2

log n τminnρn

where we used Lemma 3 with θ = 1 τmin log n

nρn . Similarly,

E W mb uv E c(u, v)1 c(u, v) 1 τmax

= pabλab 1 2

log n τmaxnρn

Recalling that pab = Babρn, we obtain

1 2τ 2max (Λ B)ab log2 n

n2ρn E W mb uv 1 2τ 2 min (Λ B)ab log2 n

C Additional Numerical Experiments

C.1 Additional Material for Section 4

Data set n |E| k d

High school 327 5,818 9 36 Primary school 242 8,317 11 69 DBLP 3,762 17,587 193 9 Amazon 8,035 183,663 195 46

Table 2: Dimensions of networks considered. High school and primary school data sets are taken from http://www.sociopatterns.org, where the weights represent the number of interactions between students. DBLP and Amazon data sets are unweighted and taken from https://snap. stanford.edu/data/. d denotes the average unweighted degree, i.e., 2|E|/n.

Data set High school Primary school DBLP Amazon

|E| 0.29 0.34 0.82 0.72

Table 3: Ratio of edges kept by the metric backbone in various real graphs.

We highlight the difference between the metric backbone and the spectral sparsification of the Primary school data set in Figure 5. Whereas Figure 2 highlights a clear pattern between the metric backbone and the threshold graph, finding from Figure 5 any pattern in the edges present in the metric backbone but not in the spectral sparsification (and vice-versa) is much harder. Although both sparsified graphs preserve the community structure very well, they do so by keeping a very different set of edges.

(a) Metric Backbone (b) Spectral Sparsifier

Figure 5: Graphs obtained from Primary school data set, after taking the metric backbone (Figure 5a) and after spectral sparsification (Figure 5b), drawn using the same layout. Vertex colors represent the true clusters. Edges present in the metric backbone but not in the spectral sparsifier graph are highlighted in red. Similarly, edges present in the spectral sparsifier graph, but not in the metric backbone, are highlighted in blue.

We preprocessed the DBLP and Amazon datasets by only keeping components where the second eigenvalue of the normalized Laplacian of each component that is kept is larger than 0.1. This ensures that we do not consider communities that are not well-connected.

We additionally show in Figure 6 the performance of the three clustering algorithms (Bayesian, Leiden, and spectral clustering) on four other data sets. As we do not have any ground truth for these additional data sets, we computed the ARI between the clustering obtained on the original network and on the sparsified network. We observe that the largest ARI is almost always obtained for the metric backbone. This shows that the metric backbone is the sparsification that best preserves the community structures of the original networks.

US Airport 500 DBLP Open Flights Amazon Dataset

Metric Backbone Threshold Subgraph Spectral Sparsifier

(a) Bayesian MCMC

US Airport 500 DBLP Open Flights Amazon Dataset

Metric Backbone Threshold Subgraph Spectral Sparsifier

US Airport 500 DBLP Open Flights Amazon Dataset

Metric Backbone Threshold Subgraph Spectral Sparsifier

(c) Spectral Clustering

Figure 6: Effect of sparsification on the performance of clustering algorithms compared to the results of the same clustering algorithms ran on the original graph.

C.2 Additional Material for Section 5

Number of edges in the q-nearest neighbor graph and its metric backbone We show in Figure 7 the number of edges in Gq and Gmb q (we do not show the number of edges of the spectral sparsified Gss q because the hyperparameters of Gss q are chosen such that Gmb q and Gss q have the same number of edges).

Exploring another similarity measure Finally, we provide additional evidence supporting the robustness of the metric backbone with respect to the number of nearest neighbors q by providing the results of another clustering algorithm, namely threshold-based subspace clustering (TSC). TSC is a subspace clustering algorithm and was shown to succeed under very general conditions on the high-dimensional data set to be clustered [21]. TSC performs spectral clustering on a q-nearest neighbors graph obtained using the following similarity measure

sim(xu, xv) = exp ( 2 arccos (< xu, xv >)) .

20 40 60 80 100 q

Number of Edges

Original Graph Metric Backbone

20 40 60 80 100 q

Number of Edges

Original Graph Metric Backbone

(b) Fashion MNIST

20 40 60 80 100 q

Number of Edges

Original Graph Metric Backbone

Figure 7: Number of edges in the q-nearest neighbour graph (built using Gaussian kernel similarity) and its metric backbone.

Because in Section 5 we showed the performance of spectral clustering using the Gaussian kernel similarity, these additional experiments also show the robustness of our results with respect to the choice of the similarity measure. Figure 8 shows the performance of TSC on Gq and its sparsified versions Gmb q and Gss q , and Figure 9 shows the number of edges (before and after sparsification).

20 40 60 80 100 q

Original Graph Metric Backbone Spectral Sparsifier

20 40 60 80 100 q

Original Graph Metric Backbone Spectral Sparsifier

(b) Fashion MNIST

20 40 60 80 100 q

Original Graph Metric Backbone Spectral Sparsifier

Figure 8: Performance of TSC on subsets of MNIST and Fashion MNIST datasets, and on the HAR dataset. The plots show the ARI averaged over 10 trials, and the error bars show the standard error.

20 40 60 80 100 q

Number of Edges

Original Graph Metric Backbone

20 40 60 80 100 q

Number of Edges

Original Graph Metric Backbone

(b) Fashion MNIST

20 40 60 80 100 q

Number of Edges

Original Graph Metric Backbone

Figure 9: Number of edges remaining in each graph, using dot product similarity.

Computing Resources All experiments were run on a standard laptop with 16GB of RAM and 12th Gen Intel(R) Core(TM) i7-1250U CPU.

To compute the full metric backbone, we used the igraph distances implementation [11]. It takes around 7 minutes for graphs with 10, 000 vertices and q = 10. For the same graph, computing the spectral sparsification takes around 15 minutes. In general, computing the spectral sparsification was 2 to 3 times slower than computing the metric backbone. Computing the approximate metric backbone takes around 100 seconds for 70, 000 vertices and q = 132 on our C++ implementation.

For spectral clustering and TSC algorithms, the bottleneck is the clustering part: running the scikitlearn implementation of spectral clustering takes 3.5 hours for MNIST and 2 hours for Fashion MNIST while we only needed 100 seconds to obtain the metric backbone approximation.

Neur IPS Paper Checklist

Question: Do the main claims made in the abstract and introduction accurately reflect the paper s contributions and scope? Answer: [Yes] Justification: The abstract and introduction clearly state all the theoretical and numerical claims made in the paper. Guidelines:

The answer NA means that the abstract and introduction do not include the claims made in the paper. The abstract and/or introduction should clearly state the claims made, including the contributions made in the paper and important assumptions and limitations. A No or NA answer to this question will not be perceived well by the reviewers. The claims made should match theoretical and experimental results, and reflect how much the results can be expected to generalize to other settings. It is fine to include aspirational goals as motivation as long as it is clear that these goals are not attained by the paper. 2. Limitations

Question: Does the paper discuss the limitations of the work performed by the authors? Answer: [Yes]

Justification: Our work points out the assumptions made in the different theorems (especially Assumption 1 and 2). The numerical section (done only using real data) highlights that the results are robust to model misspecification (by going from weighted SBM to real data sets). We used several types of data sets (such as social networks, interaction networks, images), and several clustering algorithms to show how the method generalises well. Computational efficiency is discussed in Section 3. Guidelines:

The answer NA means that the paper has no limitation while the answer No means that the paper has limitations, but those are not discussed in the paper. The authors are encouraged to create a separate "Limitations" section in their paper. The paper should point out any strong assumptions and how robust the results are to violations of these assumptions (e.g., independence assumptions, noiseless settings, model well-specification, asymptotic approximations only holding locally). The authors should reflect on how these assumptions might be violated in practice and what the implications would be. The authors should reflect on the scope of the claims made, e.g., if the approach was only tested on a few datasets or with a few runs. In general, empirical results often depend on implicit assumptions, which should be articulated. The authors should reflect on the factors that influence the performance of the approach. For example, a facial recognition algorithm may perform poorly when image resolution is low or images are taken in low lighting. Or a speech-to-text system might not be used reliably to provide closed captions for online lectures because it fails to handle technical jargon. The authors should discuss the computational efficiency of the proposed algorithms and how they scale with dataset size. If applicable, the authors should discuss possible limitations of their approach to address problems of privacy and fairness. While the authors might fear that complete honesty about limitations might be used by reviewers as grounds for rejection, a worse outcome might be that reviewers discover limitations that aren t acknowledged in the paper. The authors should use their best judgment and recognize that individual actions in favor of transparency play an important role in developing norms that preserve the integrity of the community. Reviewers will be specifically instructed to not penalize honesty concerning limitations.

3. Theory Assumptions and Proofs

Question: For each theoretical result, does the paper provide the full set of assumptions and a complete (and correct) proof?

Answer: [Yes]

Justification: All the theorems, formulas, and proofs in the paper are numbered and crossreferenced. All assumptions are clearly stated or referenced in the statement of all the theorems. The proofs are in the supplemental material, but we provide a proof sketch of the main results in the main text.

Guidelines:

The answer NA means that the paper does not include theoretical results. All the theorems, formulas, and proofs in the paper should be numbered and crossreferenced. All assumptions should be clearly stated or referenced in the statement of any theorems. The proofs can either appear in the main paper or the supplemental material, but if they appear in the supplemental material, the authors are encouraged to provide a short proof sketch to provide intuition. Inversely, any informal proof provided in the core of the paper should be complemented by formal proofs provided in appendix or supplemental material. Theorems and Lemmas that the proof relies upon should be properly referenced.

4. Experimental Result Reproducibility

Question: Does the paper fully disclose all the information needed to reproduce the main experimental results of the paper to the extent that it affects the main claims and/or conclusions of the paper (regardless of whether the code and data are provided or not)?

Answer: [Yes]

Justification: we release all the code and data used to perform the experiments.

Guidelines:

The answer NA means that the paper does not include experiments. If the paper includes experiments, a No answer to this question will not be perceived well by the reviewers: Making the paper reproducible is important, regardless of whether the code and data are provided or not. If the contribution is a dataset and/or model, the authors should describe the steps taken to make their results reproducible or verifiable. Depending on the contribution, reproducibility can be accomplished in various ways. For example, if the contribution is a novel architecture, describing the architecture fully might suffice, or if the contribution is a specific model and empirical evaluation, it may be necessary to either make it possible for others to replicate the model with the same dataset, or provide access to the model. In general. releasing code and data is often one good way to accomplish this, but reproducibility can also be provided via detailed instructions for how to replicate the results, access to a hosted model (e.g., in the case of a large language model), releasing of a model checkpoint, or other means that are appropriate to the research performed. While Neur IPS does not require releasing code, the conference does require all submissions to provide some reasonable avenue for reproducibility, which may depend on the nature of the contribution. For example (a) If the contribution is primarily a new algorithm, the paper should make it clear how to reproduce that algorithm. (b) If the contribution is primarily a new model architecture, the paper should describe the architecture clearly and fully. (c) If the contribution is a new model (e.g., a large language model), then there should either be a way to access this model for reproducing the results or a way to reproduce the model (e.g., with an open-source dataset or instructions for how to construct the dataset).

(d) We recognize that reproducibility may be tricky in some cases, in which case authors are welcome to describe the particular way they provide for reproducibility. In the case of closed-source models, it may be that access to the model is limited in some way (e.g., to registered users), but it should be possible for other researchers to have some path to reproducing or verifying the results.

5. Open access to data and code

Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material?

Answer: [Yes]

Justification: we release all the code and data used to perform the experiments.

Guidelines:

The answer NA means that paper does not include experiments requiring code. Please see the Neur IPS code and data submission guidelines (https://nips.cc/ public/guides/Code Submission Policy) for more details. While we encourage the release of code and data, we understand that this might not be possible, so No is an acceptable answer. Papers cannot be rejected simply for not including code, unless this is central to the contribution (e.g., for a new open-source benchmark). The instructions should contain the exact command and environment needed to run to reproduce the results. See the Neur IPS code and data submission guidelines (https: //nips.cc/public/guides/Code Submission Policy) for more details. The authors should provide instructions on data access and preparation, including how to access the raw data, preprocessed data, intermediate data, and generated data, etc. The authors should provide scripts to reproduce all experimental results for the new proposed method and baselines. If only a subset of experiments are reproducible, they should state which ones are omitted from the script and why. At submission time, to preserve anonymity, the authors should release anonymized versions (if applicable). Providing as much information as possible in supplemental material (appended to the paper) is recommended, but including URLs to data and code is permitted.

6. Experimental Setting/Details

Question: Does the paper specify all the training and test details (e.g., data splits, hyperparameters, how they were chosen, type of optimizer, etc.) necessary to understand the results?

Answer: [Yes]

Justification: we present in the core of the paper all the detail necessary to appreciate the numerical results. The supplement (Appendix C) contains additional information (such as network statistics). Finally, we also release the code.

Guidelines:

The answer NA means that the paper does not include experiments. The experimental setting should be presented in the core of the paper to a level of detail that is necessary to appreciate the results and make sense of them. The full details can be provided either with the code, in appendix, or as supplemental material.

7. Experiment Statistical Significance

Question: Does the paper report error bars suitably and correctly defined or other appropriate information about the statistical significance of the experiments?

Answer: [Yes]

Justification: the results are accompanied by error bars, showing the standard error.

Guidelines:

The answer NA means that the paper does not include experiments. The authors should answer "Yes" if the results are accompanied by error bars, confidence intervals, or statistical significance tests, at least for the experiments that support the main claims of the paper. The factors of variability that the error bars are capturing should be clearly stated (for example, train/test split, initialization, random drawing of some parameter, or overall run with given experimental conditions). The method for calculating the error bars should be explained (closed form formula, call to a library function, bootstrap, etc.) The assumptions made should be given (e.g., Normally distributed errors). It should be clear whether the error bar is the standard deviation or the standard error of the mean. It is OK to report 1-sigma error bars, but one should state it. The authors should preferably report a 2-sigma error bar than state that they have a 96% CI, if the hypothesis of Normality of errors is not verified. For asymmetric distributions, the authors should be careful not to show in tables or figures symmetric error bars that would yield results that are out of range (e.g. negative error rates). If error bars are reported in tables or plots, The authors should explain in the text how they were calculated and reference the corresponding figures or tables in the text.

8. Experiments Compute Resources

Question: For each experiment, does the paper provide sufficient information on the computer resources (type of compute workers, memory, time of execution) needed to reproduce the experiments?

Answer: [Yes]

Justification: Details about compute resources are included in the appendix C.

Guidelines:

The answer NA means that the paper does not include experiments. The paper should indicate the type of compute workers CPU or GPU, internal cluster, or cloud provider, including relevant memory and storage. The paper should provide the amount of compute required for each of the individual experimental runs as well as estimate the total compute. The paper should disclose whether the full research project required more compute than the experiments reported in the paper (e.g., preliminary or failed experiments that didn t make it into the paper).

9. Code Of Ethics

Question: Does the research conducted in the paper conform, in every respect, with the Neur IPS Code of Ethics https://neurips.cc/public/Ethics Guidelines?

Answer: [Yes]

Justification: We respect the Neur IPS Code of Ethics.

Guidelines:

The answer NA means that the authors have not reviewed the Neur IPS Code of Ethics. If the authors answer No, they should explain the special circumstances that require a deviation from the Code of Ethics. The authors should make sure to preserve anonymity (e.g., if there is a special consideration due to laws or regulations in their jurisdiction).

10. Broader Impacts

Question: Does the paper discuss both potential positive societal impacts and negative societal impacts of the work performed?

Answer: [NA]

Justification: there is no direct societal impact of the work performed.

Guidelines:

The answer NA means that there is no societal impact of the work performed. If the authors answer NA or No, they should explain why their work has no societal impact or why the paper does not address societal impact. Examples of negative societal impacts include potential malicious or unintended uses (e.g., disinformation, generating fake profiles, surveillance), fairness considerations (e.g., deployment of technologies that could make decisions that unfairly impact specific groups), privacy considerations, and security considerations. The conference expects that many papers will be foundational research and not tied to particular applications, let alone deployments. However, if there is a direct path to any negative applications, the authors should point it out. For example, it is legitimate to point out that an improvement in the quality of generative models could be used to generate deepfakes for disinformation. On the other hand, it is not needed to point out that a generic algorithm for optimizing neural networks could enable people to train models that generate Deepfakes faster. The authors should consider possible harms that could arise when the technology is being used as intended and functioning correctly, harms that could arise when the technology is being used as intended but gives incorrect results, and harms following from (intentional or unintentional) misuse of the technology. If there are negative societal impacts, the authors could also discuss possible mitigation strategies (e.g., gated release of models, providing defenses in addition to attacks, mechanisms for monitoring misuse, mechanisms to monitor how a system learns from feedback over time, improving the efficiency and accessibility of ML). 11. Safeguards

Question: Does the paper describe safeguards that have been put in place for responsible release of data or models that have a high risk for misuse (e.g., pretrained language models, image generators, or scraped datasets)? Answer: [NA] Justification: the paper poses no such risks. Guidelines:

The answer NA means that the paper poses no such risks. Released models that have a high risk for misuse or dual-use should be released with necessary safeguards to allow for controlled use of the model, for example by requiring that users adhere to usage guidelines or restrictions to access the model or implementing safety filters. Datasets that have been scraped from the Internet could pose safety risks. The authors should describe how they avoided releasing unsafe images. We recognize that providing effective safeguards is challenging, and many papers do not require this, but we encourage authors to take this into account and make a best faith effort. 12. Licenses for existing assets

Question: Are the creators or original owners of assets (e.g., code, data, models), used in the paper, properly credited and are the license and terms of use explicitly mentioned and properly respected? Answer: [Yes] Justification: we cited all the original papers that produced the code of the various algorithms, packages and datasets used. Guidelines:

The answer NA means that the paper does not use existing assets. The authors should cite the original paper that produced the code package or dataset. The authors should state which version of the asset is used and, if possible, include a URL. The name of the license (e.g., CC-BY 4.0) should be included for each asset.

For scraped data from a particular source (e.g., website), the copyright and terms of service of that source should be provided. If assets are released, the license, copyright information, and terms of use in the package should be provided. For popular datasets, paperswithcode.com/datasets has curated licenses for some datasets. Their licensing guide can help determine the license of a dataset. For existing datasets that are re-packaged, both the original license and the license of the derived asset (if it has changed) should be provided. If this information is not available online, the authors are encouraged to reach out to the asset s creators.

13. New Assets

Question: Are new assets introduced in the paper well documented and is the documentation provided alongside the assets?

Answer: [NA]

Justification: the paper does not release new assets.

Guidelines:

The answer NA means that the paper does not release new assets. Researchers should communicate the details of the dataset/code/model as part of their submissions via structured templates. This includes details about training, license, limitations, etc. The paper should discuss whether and how consent was obtained from people whose asset is used. At submission time, remember to anonymize your assets (if applicable). You can either create an anonymized URL or include an anonymized zip file.

14. Crowdsourcing and Research with Human Subjects

Question: For crowdsourcing experiments and research with human subjects, does the paper include the full text of instructions given to participants and screenshots, if applicable, as well as details about compensation (if any)?

Answer: [NA]

Justification: the paper does not involve crowdsourcing nor research with human subjects.

Guidelines:

The answer NA means that the paper does not involve crowdsourcing nor research with human subjects. Including this information in the supplemental material is fine, but if the main contribution of the paper involves human subjects, then as much detail as possible should be included in the main paper. According to the Neur IPS Code of Ethics, workers involved in data collection, curation, or other labor should be paid at least the minimum wage in the country of the data collector.

15. Institutional Review Board (IRB) Approvals or Equivalent for Research with Human Subjects

Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or institution) were obtained?

Answer: [NA]

Justification: the paper does not involve crowdsourcing nor research with human subjects.

Guidelines:

The answer NA means that the paper does not involve crowdsourcing nor research with human subjects.

Depending on the country in which research is conducted, IRB approval (or equivalent) may be required for any human subjects research. If you obtained IRB approval, you should clearly state this in the paper. We recognize that the procedures for this may vary significantly between institutions and locations, and we expect authors to adhere to the Neur IPS Code of Ethics and the guidelines for their institution. For initial submissions, do not include any information that would break anonymity (if applicable), such as the institution conducting the review.