# nearlyoptimal_hierarchical_clustering_for_wellclustered_graphs__d2e01dae.pdf

Nearly-Optimal Hierarchical Clustering for Well-Clustered Graphs

Steinar Laenen * 1 Bogdan-Adrian Manghiuc * 1 He Sun * 1

This paper presents two efficient hierarchical clustering (HC) algorithms with respect to Dasgupta s cost function. For any input graph G with a clear cluster-structure, our designed algorithms run in nearly-linear time in the input size of G, and return an O(1)-approximate HC tree with respect to Dasgupta s cost function. We compare the performance of our algorithm against the previous stateof-the-art on synthetic and real-world datasets and show that our designed algorithm produces comparable or better HC trees with much lower running time.

1. Introduction

Hierarchical clustering (HC) is the recursive partitioning of a dataset into increasingly smaller clusters via an effective binary tree representation, and has been employed as a standard package in data analysis with widespread applications in practice. Traditional HC algorithms are typically based on agglomerative heuristics and, due to the lack of a clear objective function, there was limited work on their analysis. Dasgupta (2016) introduced a simple cost function for hierarchical clustering, and this work has inspired a number of algorithmic studies on hierarchical clustering.

In this paper we study efficient hierarchical clustering for graphs with a clear structure of clusters. We prove that, under two different conditions of an input graph G that characterise its cluster-structure, one can construct in nearly-linear time1 an O(1)-approximate HC tree T of G with respect to Dasgupta s cost. Our results show that, while it s NP-hard to construct an O(1)-approximate HC tree for general graphs

*Equal contribution 1School of Informatics, University of Edinburgh, United Kingdom. Correspondence to: Steinar Laenen <steinar.laenen@ed.ac.uk>, Bogdan-Adrian Manghiuc <b.a.manghiuc@sms.ed.ac.uk>, He Sun <h.sun@ed.ac.uk>.

Proceedings of the 40 th International Conference on Machine Learning, Honolulu, Hawaii, USA. PMLR 202, 2023. Copyright 2023 by the author(s). 1We say that, for an input graph G with n vertices and m edges, an algorithm runs in nearly-linear time if the algorithm s running time is O(m logc n) for some constant c. For simplicity we use e O( ) to hide this logc n factor.

assuming the Small Set Expansion Hypothesis (Charikar & Chatziafratis, 2017), an O(1)-approximate HC tree can be constructed in nearly-linear time for a wide range of graph instances occurring in practice. This nearly-linear time complexity of our designed algorithms represents a significant improvement over the previous state-of-the-art on the same problem (Manghiuc & Sun, 2021).

Our designed two algorithms share the same framework at the high level: we apply spectral clustering (Ng et al., 2001) to partition an input graph G into k clusters P1, . . . , Pk, and further partition each cluster by grouping the vertices in every Pi (1 i k) with respect to their degrees. We call the resulting vertex sets degree buckets, and show that the Dasgupta cost of HC trees constructed on degree buckets is low. These intermediate trees constructed on every bucket can therefore form the basis of our final tree. To construct the final HC tree on G, we merge the degree bucket trees based on the following two approaches:

Our first approach treats every degree bucket of vertices in G as a single vertex of another contracted graph H with much fewer vertices. Thanks to the small size of H, we apply the recursive sparsest cut algorithm and construct a tree on H in a top-down fashion. The structure of this tree on H determines how the degree bucket trees are merged when constructing our final tree of G.

For our second approach, we show that under a regularity assumption on the vertex degrees, it suffices to merge the clusters in a caterpillar style. This simpler construction allows our second algorithm to run in nearly-linear time for a larger number of clusters k compared with the first algorithm.

To demonstrate the significance of our work, we experimentally compare our first algorithm against the previous state-of-the-art and a well-known linkage heuristic (Average Linkage) on both synthetic and real-world datasets. Our experimental results show that the trees constructed from our algorithm and Average Linkage achieve similar cost values, which are much lower than the ones constructed from Cohen-Addad et al. (2017) and Manghiuc & Sun (2021). Moreover, our algorithm runs significantly faster than the other three tested algorithms.

Nearly-Optimal Hierarchical Clustering for Well-Clustered Graphs

2. Related Work

Addressing the lack of an objective function for hierarchical clustering, Dasgupta (2016) introduced a simple cost function to measure the quality of an HC tree, and proved several properties of the cost function. Dasgupta further showed that a recursive sparsest cut algorithm can be applied to construct an O(log3/2 n)-approximate HC tree. Charikar & Chatziafratis (2017) improved the analysis of constructing HC trees based on the sparsest cut problem, and proved that an α-approximate algorithm for the sparsest cut problem can be employed to construct an O(α)-approximate HC tree. This implies that, by applying the state-of-the-art for the sparsest cut problem (Arora et al., 2009), an O( log n)- approximate HC tree can be constructed in polynomial-time.

It is known that, assuming the Small Set Expansion Hypothesis, it is NP-hard to construct an O(1)-approximate HC tree for general graphs (Charikar & Chatziafratis, 2017). Hence, it is natural to examine the conditions of input graphs under which an O(1)-approximate HC tree can be constructed in polynomial-time. Cohen-Addad et al. (2017) studied a hierarchical extension of the classical stochastic block model (SBM) and showed that, for graphs randomly generated from this model, there is an SVD projection-based algorithm (Mc Sherry, 2001) that, together with linkage heuristics, constructs a (1 + o(1))-approximate HC tree with high probability. Manghiuc & Sun (2021) studied hierarchical clustering for well-clustered graphs and proved that, when there is a cluster-structure of an input graph G, an O(1)-approximate HC tree of G can be constructed in polynomial-time; their designed algorithm is based on the graph decomposition algorithm by Oveis Gharan & Trevisan (2014), and has high time complexity.

There are recent studies of hierarchical clustering in different models of computation. For instance, Kapralov et al. (2022) studied the problem of learning the hierarchical cluster structure of graphs in a semi-supervised setting. Their presented algorithm runs in sub-linear time and, under some clusterability conditions of an input graph G with k clusters, their algorithm O( log k)-approximates Dasgupta s cost of an optimal HC tree. This work is incomparable to ours: the objective of their work is to approximate Dasgupta s cost of an HC tree, while the output of our algorithms is a complete HC tree.

Finally, there are studies of hierarchical clustering under different objective functions. Moseley & Wang (2017) studied the dual of Dasgupta s cost function. This objective, and a dissimilarity objective by Cohen-Addad et al. (2019), have received considerable attention (Alon et al., 2020; Charikar et al., 2019; Chatziafratis et al., 2020). It is important to notice that an O(1)-approximate HC tree can be constructed efficiently for general graphs under these objectives, suggesting the fundamental difference in the computational

complexity of constructing an HC tree under different objectives. This is the main reason for us to entirely focus on the Dasgupta s cost function in this work.

3. Preliminaries

This section lists the background knowledge used in our paper, and is organised as follows: In Section 3.1 we list the basic notation and facts in spectral graph theory. Section 3.2 discusses hierarchical clustering and Dasgupta s cost function. In Section 3.3 we introduce the notion of contracted graphs, and we finish the section with a brief introduction to spectral clustering in Section 3.4.

3.1. Notation

We always assume that G = (V, E, w) is an undirected graph with |V | = n vertices, |E| = m edges, and weight function w : V V R 0. For any edge e = {u, v} E, we write we or wuv to express the weight of e. Let wmin and wmax be the minimum and maximum edge weight of G, respectively; we further assume that wmax/wmin c nγ

for some constants γ > 0 and c, which are independent of the input size.

For a vertex u V , we denote its degree by du P

v V wuv. We use δG, G, and d G for the minimum, maximum and average degrees in G respectively, where d G P u V du/n. For any S V , we use δG(S), G(S), and d G(S) to represent the minimum, maximum, and average degrees of the vertices of S in G.

For any two subsets S, T V , we define the cut value between S and T by w(S, T) P

e E(S,T ) we, where E(S, T) is the set of edges between S and T. For any G = (V, E, w) and set S V , the volume of S is vol G(S) P

u S du, and we write vol(G) when referring to vol G(V ). Sometimes we drop the subscript G when it is clear from the context. For any non-empty subset S V , we define G[S] to be the induced subgraph on S.

For any input graph G = (V, E, w) and any S V , let the conductance of S in G be

ΦG(S) w(S, V \ S)

We define the conductance of G by

ΦG min S V vol(S) vol(V )/2 ΦG(S).

For any non-empty subset S V we refer to ΦG(S) as the outer conductance of S with respect to G, and ΦG[S] as the inner conductance of S.

Our analysis is based on the spectral properties of graphs, and here we list the basics of spectral graph theory. For

Nearly-Optimal Hierarchical Clustering for Well-Clustered Graphs

a graph G = (V, E, w), let D Rn n be the diagonal matrix defined by Duu = du for all u V . We denote by A Rn n the adjacency matrix of G, where Auv = wuv for all u, v V . The normalised Laplacian matrix of G is defined as L I D 1/2AD 1/2, where I is the n n identity matrix. The normalised Laplacian L is symmetric and real-valued, and has n real eigenvalues which we write as λ1 . . . λn. It is known that λ1 = 0 and λn 2 (Chung, 1997).

For any integer k 2, we call subsets of vertices A1, . . . , Ak a k-way partition of G if Sk i=1 Ai = V and Ai Aj = for different i and j. We define the k-way expansion of G by

ρ(k) min partitions A1,...,Ak max 1 i k ΦG(Ai).

The celebrated higher-order Cheeger inequality (Lee et al., 2014) states that it holds for any graph G and k 2 that

2 ρ(k) O(k3) p

3.2. Hierarchical Clustering

A hierarchical clustering (HC) tree of a given graph G is a binary tree T with n leaf nodes such that each leaf corresponds to exactly one vertex v V (G). Let T be an HC tree of a graph G = (V, E, w), and N T be an arbitrary internal node2 of T . We denote T [N] to be the subtree of T rooted at N, leaves (T [N]) to be the set of leaf nodes of T [N], and parent T (N) to be the parent of node N in T . In addition, each internal node N T induces a unique vertex set C V formed by the vertices corresponding to leaves(T [N]). For ease of presentation, we sometimes abuse the notation and write N T for both the internal node of T and the corresponding subset of vertices in V .

To measure the quality of an HC tree T with similarity weights, Dasgupta (2016) introduced the cost function defined by

COSTG(T ) X

e={u,v} E we |leaves (T [u v])|,

where u v is the lowest common ancestor of u and v in T . Sometimes, it is convenient to consider the cost of an edge e = {u, v} E in T as cost G(e) we |leaves(T [u v])|. Trees that achieve a better hierarchical clustering have a lower cost, and the objective of HC is to construct trees with the lowest cost based on the following consideration: for any pair of vertices u, v V that corresponds to an edge of

2We consider any non-leaf node of T an internal node. We always use the term node(s) for the nodes of T and the term vertices for the elements of the vertex set V .

high weight wuv (i.e., u and v are highly similar), a good HC tree would separate u and v lower in the tree, thus reflected in a small size of |leaves(T [u v])|. We denote by OPTG the minimum cost of any HC tree of G, i.e., OPTG = min T COSTG(T ), and use the notation T to refer to an optimal tree achieving the minimum cost. We say that an HC tree T is an α-approximate tree if COSTG(T ) α OPTG for some α 1.

3.3. Contracted Graphs

Our work is based on contracted graphs, which were introduced by Kapralov et al. (2022) in the context of hierarchical clustering.

Definition 3.1 (Contracted Graph (Kapralov et al., 2022)). Let G = (V, E, w) be a weighted graph, and A = {A}k i=1 be a partition of V . We say that the vertex and edgeweighted graph H = ([k], ( [k] 2 ), W , w ) is a contraction of G with respect to A if for every i, j [k] we have that W (i, j) = w(Ai, Aj) and for every i [k] we have w (i) = |Ai|. We denote the contraction of G with respect to A as G/A.

Note that contracted graphs are vertex-weighted, i.e., every vertex u V (H) has a corresponding weight. To measure the quality of an HC tree T on a vertex-weighted graph H = (V, E, W, w), we define the weighted Dasgupta s cost of T on H as

WCOSTH(T ) X

e={u,v} E W(u, v) X

z leaves(T [u v]) w(z).

We denote by WOPTH the minimum cost of any HC tree of H, i.e., WOPTH = min T WCOSTH(T ).

For any set S V (H) we define the sparsity of the cut (S, V \ S) in H as

sparsity H(S) W(S, V \ S) w(S) w(V \ S), (2)

where w(S) P

v S w(v). The vertex-weighted sparsest cut of G is the cut with the minimum sparsity.

We call the vertex-weighted variant of the recursive sparsest cut algorithm the WRSC algorithm, which is described as follows: Let α 1, and H = (V, E, W, w) be a vertex and edge-weighted graph. Let (S, V \ S) be a vertexweighted sparsest cut of H. The WRSC algorithm on H is a recursive algorithm that finds a cut (T, V \ T) satisfying sparsity H(T) α sparsity H(S), and recurs on the induced subgraphs H[T] and H[V \ T]. Kapralov et al. (2022) showed that the approximation guarantee of this algorithm for constructing HC trees follows from the one for non-vertex-weighted graphs (Charikar & Chatziafratis, 2017), and their result is summarised as follows:

Nearly-Optimal Hierarchical Clustering for Well-Clustered Graphs

Lemma 3.2 (Kapralov et al. (2022)). Let H = (V, E, W, w) be a vertex and edge-weighted graph. Then, the WRSC algorithm achieves an O(α)-approximation for the weighted Dasgupta s cost of H, where α is the approximation ratio of the sparsest cut algorithm used in WRSC.

We emphasise that we only use the combinatorial properties of vertex-weighted graphs. As such we don t consider their Laplacian matrices and the corresponding spectra.

3.4. Spectral Clustering

Another key component used in our analysis is spectral clustering, which is one of the most popular clustering algorithms used in practice (Ng et al., 2001; Spielman & Teng, 1996). For any input graph G = (V, E, w) and k N, spectral clustering consists of the following three steps: (1) compute the eigenvectors f1 . . . fk of LG, and embed each u V (G) to the point F(u) Rk based on f1 . . . fk; (2) apply k-means on the embedded points {F(u)}u V (G); (3) partition V into k clusters P1 . . . Pk based on the output of k-means.

To analyse the performance of spectral clustering, one can examine the scenario in which there is a large gap between λk+1 and ρ(k). By the higher-order Cheeger inequality (1), we know that a low value of ρ(k) ensures that the vertex set V of G can be partitioned into k subsets (clusters), each of which has conductance upper bounded by ρ(k); on the other hand, a high value of λk+1 implies that any (k + 1)- way partition of V would introduce some A V with conductance ΦG(A) ρ(k + 1) λk+1/2.

Based on this observation, a sequence of works (Macgregor & Sun, 2022; Mizutani, 2021; Peng et al., 2017) showed that, assuming the presence of a large gap between λk+1 and ρ(k), spectral clustering returns clusters P1, . . . Pk of low outer conductance ΦG(Pi) for each 1 i k. We remark that spectral clustering can be implemented in nearly-linear time (Peng et al., 2017).

4. Hierarchical Clustering for Well-Clustered Graphs: Previous Approach

Our presented new algorithms are based on the work of Manghiuc & Sun (2021) on the same problem, and this section gives a brief overview of their approach. We consider a graph G = (V, E, w) to have k well-defined clusters if V (G) can be partitioned into disjoint subsets {Ai}k i=1 such that (i) there s a sparse cut between Ai and V \ Ai, formulated as ΦG(Ai) Φout for any 1 i k, and (ii) each G[Ai] has high inner conductance ΦG[Ai] Φin. Then, for an input graph G with a clear cluster-structure, the algorithm by Manghiuc & Sun (2021), which we call the MS algorithm in the following, constructs an O(1)-approximate

HC tree of G in polynomial-time. At a very high level, the MS algorithm is based on the following two components:

They first show that, when an input graph G of n vertices and m edges has high conductance, an O(1)- approximate HC tree of G can be constructed based on the degree sequence of G, and the algorithm runs in O(m + n log n) time. We use Tdeg to express such trees constructed from the degree sequence of G.

They combine the first result with the algorithm proposed by Oveis Gharan & Trevisan (2014), which decomposes an input graph into a set of clusters A1, . . . , Aℓ, where every Ai has low outerconductance ΦG(Ai) and high inner-conductance ΦG[Ai] for any 1 i ℓ.

With these two components, one might intuitively think that an O(1)-approximate HC tree of G can be easily constructed by (i) finding clusters {Ai} of high conductance, (ii) constructing an O(1)-approximate HC tree Ti = Tdeg(G[Ai]) for every G[Ai], and (iii) merging the constructed {Ti} in an optimal way. However, Manghiuc & Sun (2021) show by counterexample that this is not sufficient and, in order to achieve an O(1)-approximation, further decomposition of every {Ai} would be necessary. Specifically, they adjust the algorithm of Oveis Gharan & Trevisan (2014), and further decompose every vertex set Ai of high-conductance into smaller subsets, which they call critical nodes. They show that these critical nodes can be carefully merged to construct an O(1)-approximate HC tree for well-clustered graphs. On the downside, as the MS algorithm is heavily based Oveis Gharan & Trevisan (2014), the time complexity of their algorithm is e O k3m2n (wmax/wmin) , which limits the application of their algorithm on large-scale datasets.

5. Algorithms

This section presents our hierarchical clustering algorithms for well-clustered graphs. It consists of two subsections, each of which corresponds to one algorithm.

5.1. The First Algorithm

In this subsection we present a nearly-linear time algorithm that, given a well-clustered graph G with a constant number of clusters as input, constructs an O(1)-approximate HC tree of G with respect to Dasgupta s cost. Our result is as follows: Theorem 5.1. Given a connected graph G = (V, E, w) and some constant k as input such that λk+1 = Ω(1), λk = O(k 12), and wmax/wmin = O(nγ) for a constant γ > 1, there is a nearly-linear time algorithm that constructs an HC tree T of G satisfying COSTG(T ) = O (1) OPTG.

Nearly-Optimal Hierarchical Clustering for Well-Clustered Graphs

In comparison to the previous algorithms for hierarchical clustering on well-structured graphs (Cohen-Addad et al., 2017; Manghiuc & Sun, 2021), the advantages of our algorithm are its simplicity and nearly-linear time complexity, which is optimal up to a poly-logarithmic factor.

Overview of the Algorithm. We first describe the algorithm behind Theorem 5.1. Our algorithm consists of four steps. For any input graph G = (V, E, w) and parameter k, our algorithm first runs spectral clustering and obtains clusters P1, . . . , Pk.

The second step of our algorithm is a degree-based bucketing procedure. We set β 2k(γ+1), and define for any Pi returned from spectral clustering and u Pi the set

B (u) {v Pi : du dv < β du} . (3)

Moreover, as B(u) consists of all the vertices v Pi with du dv < β du, we generalise the definition of B(u) and define for any j Z that

Bj(u) = v Pi : βj du dv < βj+1 du .

By definition, it holds that B(u) = B0(u), and {Bj(u)}j forms a partition of Pi for any u Pi. We illustrate the bucketing procedure in Figure 1. In our algorithm, we set u(i) (1 i k) to be a vertex in Pi with minimum degree and focus on Bj u(i)

j, the partition of Pi which is only

based on non-negative values of j. Let Bi Bj u(i)

and B Sk i=1 Bi.

du duβ duβ 1

δG(Pi) du G(Pi)

Figure 1. The illustration of our bucketing procedure induced by a vertex u.

In the third step, our algorithm constructs an arbitrary balanced binary tree TB for every bucket B B, and sets T to be the collection of our constructed balanced trees TB.

For the fourth step, our algorithm constructs the contracted graph H defined by H G B, and applies the WRSC algorithm to construct an HC tree TG/B of H. Our algorithm finally replaces every leaf node of TG/B corresponding to a set B B by an arbitrary balanced tree TB on G [B]. See Algorithm 1 for the formal description of our algorithm.

Algorithm 1 Spectral Clustering with Degree Bucketing and WRSC (Spec WRSC)

1: Input: G = (V, E, w), k N such that λk+1 > 0 2: {P1, . . . , Pk} Spectral Clustering(G, k) 3: β 2k(γ+1)

4: T 5: for every Pi do 6: Order the vertices of Pi increasingly with respect to their degrees 7: u(i) vertex u Pi with minimum degree 8: Bi Bj u(i)

j 9: for every B Bi do 10: Construct an arbitrary balanced TB of G [B] 11: T T TB 12: end for 13: end for 14: B = Sk i=1 Bi 15: H G B 16: T WRSC(H) 17: for every TB T do 18: Replace the leaf node of T corresponding to B by TB 19: end for 20: Return T

It is important to notice that, as every output Pi (1 i k) from spectral clustering doesn t necessarily have high innerconductance, our work shows that in general the innerconductance condition of vertex sets is not needed in order to construct an O(1)-approximate HC tree. This is significant in our point of view, since ensuring the high innerconductance of certain vertex sets is the main reason behind the high time complexity of the MS algorithm. Moreover, the notion of critical nodes introduced in Manghiuc & Sun (2021), which are subsets of a cluster with high conductance, is replaced by the degree-based bucketing B of every output cluster Pi from spectral clustering; this B can be constructed in nearly-linear time.

Proof Sketch of Theorem 5.1. We first analyse the approximation ratio of our constructed tree, and show why these simple four steps suffice to construct an O(1)- approximate HC tree for well-clustered graphs. By the algorithm description, our constructed T is based on merging different TB, and it holds that

COSTG (T ) =

B Bi COSTG[B] (TB)

e={u,v} u B,v B

cost G(e). (4)

Nearly-Optimal Hierarchical Clustering for Well-Clustered Graphs

The first step of our analysis is to upper bound the total contribution of the internal edges of all the buckets, and our result is as follows:

Lemma 5.2. It holds that

B Bi COSTG[B] (TB) = O β k23/λ10 k+1 OPTG.

By Lemma 5.2, the total cost induced from the edges inside every bucket can be directly used when constructing the final tree T . This is crucial for our analysis since our defined buckets can be constructed in nearly-linear time. In comparison to this step, the MS algorithm relies on finding critical nodes, which is computationally much more expensive.

Next, we analyse the second term of (4). For ease of presentation, let e E be the set of edges crossing different buckets, and TG/B the tree returned from the WRSC algorithm; remember that every leaf of TG/B corresponds to a vertex set in B. By construction, we know that the weight of every edge e e E contributes to the edge weight in G/B, and it holds that X

e e E cost G(e) = WCOSTG/B(TG/B). (5)

Moreover, since TG\B is constructed by performing the WRSC algorithm on G \ B, we have by Lemma 3.2 that

WCOSTG/B(TG/B) = O (α) WOPTG/B, (6)

where α is the approximation ratio achieved by the WRSC algorithm.

Our next lemma is the key to the overall analysis, and upper bounds WOPTG/B with respect to OPTG.

Lemma 5.3. It holds that

WOPTG/B = O β k23/λ10 k+1 OPTG.

Combining (5), (6) with Lemma 5.3, we have X

e e E cost G(e) = O α β k23/λ10 k+1 OPTG. (7)

Next, we study α, which is the only term in the approximation ratio of (7) that is not necessarily a constant. Recall that our choice of γ and β satisfies

wmin = O(nγ)

and β = 2k(γ+1). Hence, it holds that

δG = O nγ+1 ,

and the total number of buckets in B is upper bounded by

k max{1, logβ nγ+1} k+ k(γ + 1)

log β log n = k+log n.

Thanks to this, there are at most k + log n vertices in G/B, and a sparsest cut of G/B can be found in O(n) time by enumerating all of its possible subsets; as such we can set α = 1 in the analysis. We highlight that this is another advantage of our bucketing step: with careful choice of parameters, we only need to study a contracted graph with O(log n) vertices, whose sparsest cut can be found in linear time. Since the WRSC algorithm only computes the vertexweighted sparsest cut O(log n) times, the overall running time of WRSC is O(n log n). We remark that we don t need to deal with the general sparsest cut problem, for which most approximation algorithms are based on complicated optimisation techniques (e.g., Arora et al. (2009)).

Combining (4) with Lemmas 5.2, 5.3 as well as the fact that α = 1 proves the approximation guarantee of Theorem 5.1.

Next, we sketch the running time analysis of Algorithm 1. The first step (Line 2) applies spectral clustering, and takes e O(m) time (Peng et al., 2017). The second step (Lines 6 8) is a degree-based bucketing procedure for all the clusters, the time complexity of which is

i=1 O (|Pi| log |Pi|) = O(n log n).

The third step (Lines 9 12) of the algorithm constructs an arbitrary binary tree TB for every bucket B B, and we show that this step takes e O(m) time. The actual construction of the trees in this step isn t difficult, but we need to store several attributes for each internal node as we build the tree, such that we can compute Dasgupta s cost in nearly-linear time as well. We refer the reader to the appendix for the detailed discussion of this step. In the last step (Lines 14 19), the algorithm merges the trees TB based on the output of WRSC, which can be implemented in O(n log n) time as discussed above. Combining all these steps gives us the nearly-linear time complexity of Algorithm 1.

Proof Sketch for Lemma 5.2 and Lemma 5.3. Our key approach to breaking the running time barrier of Manghiuc & Sun (2021) is that, instead of approximating the optimal tree T , we approximate the tree TMS constructed by the algorithm of Manghiuc & Sun (2021). We know from their result that one can upper bound the cost of TMS with respect to OPTG, i.e.,

COSTG(TMS) = O k22/λ10 k+1 OPTG.

We take the existence of the tree TMS for granted, and perform two transformations to construct a tree T MS, while

Nearly-Optimal Hierarchical Clustering for Well-Clustered Graphs

ensuring that the total introduced cost from the two transformations can be upper bounded. In the first step, we identify how every bucket B B is spread throughout TMS, and make sure that all the separate components that make up B are isolated in our new tree. We call the resulting tree T MS, and prove that

COSTG (T MS) COSTG (TMS) + O(k21/λ10 k+1) OPTG.

For the second transformation, we carefully adjust the tree T MS, such that the currently isolated components that make up every bucket B B get grouped together into the same subtree. We call the resulting tree T MS, and bound its cost by

COSTG (T MS) COSTG (T MS)+O(β k23/λ10 k+1) OPTG.

Taking these transformations together, we get that

COSTG (T MS) O(β k23/λ10 k+1) OPTG. (8)

Notice that we can perform these two transformations without an explicit construction of TMS, but we still end up with a tree T MS with bounded Dasgupta cost. Moreover, since every bucket B B in T MS is separated, T MS is also a tree on the contracted graph G/B; this allows us to upper bound WOPTG/B with COSTG(T MS). Combining this with (8) proves Lemma 5.3. With the same approach, we prove that Lemma 5.2 holds as well.

We remark that this is another conceptual contribution of our paper: although Manghiuc & Sun (2021) show that a similar proof technique can be generalised to construct O(1)-approximate HC trees for expander graphs, we show that such proof technique can be used to obtain O(1)- approximate HC trees for well-clustered graphs.

5.2. The Second Algorithm

In this subsection, we present another nearly-linear time hierarchical clustering algorithm for well-clustered graphs. Here, we assume that the degrees of vertices inside every cluster are almost balanced, i.e., it holds for an optimal partitioning S1, . . . Sk corresponding to ρ(k) that the degrees inside each Si are upper bounded by a parameter ηS R+. With this condition, our second algorithm achieves the same approximation guarantee as Algorithm 1 even with k = O(logc n) for some constant c. This result is summarised as follows:

Theorem 5.4. Let G = (V, E, w) be a graph with k clusters {Si}k i=1 such that maxi( G(Si)/δG(Si)) ηS, λk+1 = Ω(1), ρ(k) k 4, and ΦG[Si] Ω(1) for 1 i k. Then, there is a nearly-linear time algorithm that constructs an HC tree TSCB of G satisfying COSTG(TSCB) = O (1) OPTG.

Algorithm 2 Spectral Clustering with Degree Bucketing and Caterpillar Construction

1: Input: G = (V, E, w), k N, ηS R+

2: P = {P1, . . . , Pk} Spectral Clustering(G, k) 3: β ηS 4: Initialize T 5: for Pi P do 6: Order the vertices of Pi increasingly with respect to their degrees 7: B (u) {v Pi : du dv < β du} 8: u i argmaxu Pi vol(B(u)) 9: Bu i Bj (u i )

j 10: for B Bu i do 11: Let TB be any balanced HC tree on G[B] 12: T T TB 13: end for 14: end for 15: Let T = {T1, . . . , T|T|} be such that |Ti| |Ti+1|, for all 1 i < |T| 16: Initialize TSCB T1 17: for i 2, . . . , |T| do 18: TSCB TSCB Ti 19: end for 20: Return TSCB

The algorithm behind Theorem 5.4 is similar with Algorithm 1, and is described in Algorithm 2. However, our second algorithm has two significant changes compared with Algorithm 1.

1. We adjust the bucketing step by setting the bucketing parameter β = ηS, where ηS is an upper bound for maxi ( G(Si)/δG(Si)). Moreover, instead of bucketing the vertices starting at the vertex u(1) of minimum degree inside Pi, we carefully choose for each Pi the vertex u i argmaxu Pi vol(B (u)) whose induced bucket B (u i ) has the highest volume inside Pi. Based on this, we set Bu i Bj (u i )

j and B Sk i=1 Bu i .

2. After constructing the balanced trees TB for every B B, instead of applying WRSC, we concatenate the trees TB in a simple caterpillar style according to the sizes |B| of the buckets B B.

To explain the first change, notice that every cluster Pi returned by spectral clustering has a large overlap with its corresponding optimal cluster Si. Moreover, the degrees of the vertices inside Si are within a factor of ηS of each other. Therefore, if an arbitrary vertex u Pi is chosen as representative, the bucketing B (u) of Pi might equally divide the vertices in Pi Si into two consecutive buckets, which is undesirable due to the high induced cost of the crossing edges. To circumvent this issue, we choose u i as

Nearly-Optimal Hierarchical Clustering for Well-Clustered Graphs

the vertex to induce the bucketing and prove that this specific choice of u i guarantees that the bucketing Bu i contains one bucket B Bu i largely overlapping with Si. This greatly reduces the number of crossing edges in our constructed HC tree and hence improves the approximation guarantee.

To explain the second change, we notice that the nearlylinear running time of WRSC relies on the condition that k = O(1), which might not hold necessarily for the new setting. We prove that, as long as maxi ( G(Si)/δG(Si)) is upper bounded by a constant, it is sufficient to construct a caterpillar tree based on the sizes |B| of the buckets B B. We prove that our algorithm runs in nearly-linear time, and the output tree achieves O(1)-approximation. See the appendix for the detailed analysis of Algorithm 2 and the proof of Theorem 5.4

6. Experiments

We experimentally evaluate the performance of our designed Spec WRSC algorithm, and compare it against the previous state-of-the-art and a well-known linkage algorithm. Specifically, the Spec WRSC algorithm is compared against the following three algorithms:

1. the Linkage++ algorithm proposed in (Cohen Addad et al., 2017);

2. the MS algorithm proposed in (Manghiuc & Sun, 2021);

3. the Average Linkage algorithm.

Even though there are no theoretical approximation guarantees of Average Linkage with respect to Dasgupta s cost function, we include it in our evaluation for reference due to its excellent performance in practice.

All the tested algorithms were implemented in Python 3.9 and experiments were performed using a Lenovo Think Pad T15G, with an Intel(R) Xeon(R) W-10855M CPU@2.80GHz processor and 126 GB RAM. All of the reported costs and running time below are averaged over 5 independent runs. Our code can be downloaded from https://github.com/steinarlaenen/nearlyoptimal-hierarchical-clustering-for-well-clustered-graphs.

6.1. Results on Synthetic Data

We first compare the performance of our algorithm against the others described before on synthetic data.

Stochastic Block Model We look at graphs generated according to the standard Stochastic Block Model (SBM). We set the number of clusters as k = 5, and the number of vertices in each cluster {Pi}5 i=1 as nk. For each pair of vertices u Pi and v Pj, we add an edge {u, v} with

0 1,000 2,000 3,000 Vertices per cluster 0

Running Time (s)

0.06 0.01 0.14 0.18 0.2 p 0.8

Relative Dasgupta Cost

Spec WRSC MS Linkage + + Avg Linkage

Figure 2. Results for SBMs. In Figure (a) the x-axis represents the number of vertices inside each cluster, and the y-axis represents the algorithms running time in seconds. In Figure (b), the x-axis represents different values of p, while the y-axis represents the cost of the algorithms returned HC trees normalised by the cost of Spec WRSC.

probability p if i = j, and add an edge with probability q if i = j.

To compare the running time, we incrementally increase the number of vertices in each cluster {Pi}5 i=1 from nk = 100 to nk = 3,000, in increments of 200, and we fix p = 0.1 and q = 0.002. For each algorithm, we measure the running time in seconds, and the results are plotted in Figure 2(a), confirming our algorithm s low time complexity compared with our competitors. To further highlight the fast running time of our algorithm, we increase the size of each cluster to 20,000, which results in the total number of nodes n = 100,000. We find that Spec WRSC returns an HC tree in 14,437 seconds ( 4 hours), while all the other algorithms timeout after 12 hours of compute time.

To compare the quality of the constructed HC trees on SBMs, we fix the number of vertices inside each cluster {Pi}5 i=1 as nk = 1,000, the value q = 0.002, and consider different values of p [0.06, 0.2]. As shown in Figure 2(b), Spec WRSC performs better than all the other algorithms.

Hierarchical Stochastic Block Model. Next, we consider graphs generated from a hierarchical stochastic block model (HSBM) (Cohen-Addad et al., 2019); HSBM is similar to the SBM but assumes the existence of a hierarchical structure between the clusters. We set the number of vertices in each cluster {Pi}5 i=1 as nk = 600. For each pair of vertices u Pi and v Pj, we assume that u and v are connected by an edge with probability p if i = j; otherwise u and v are connected by an edge with probability qi,j defined as follows: (i) for all i {1, 2, 3} and j {4, 5}, qi,j = qj,i = qmin; (ii) for i {1, 2}, qi,3 = q3,i = 2 qmin; (iii) q4,5 = q5,4 = 2 qmin; (iv) q1,2 = q2,1 = 3 qmin. We fix the value qmin = 0.0005 and consider different values of p [0.04, 0.2]. This choice of hyperparameters bears similarity with the setting tested in (Cohen-Addad et al., 2017; Manghiuc & Sun, 2021). The result of our experiments is plotted in Figure 3(a). Again, we

Nearly-Optimal Hierarchical Clustering for Well-Clustered Graphs

Ir Wi Ca Ng Parameters

n 50 180 558 3,516 k 3 5 5 5 σ 0.3 0.88 0.88 130 λk+1/λk 5.27 4.96 1.45 1.01

Running Time

Avg Linkage 0.14 0.22 2.49 198.34 MS 0.88 2.00 20.36 1,088.87 Linkage++ 0.61 0.90 9.18 944.05 Spec WRSC 0.136 0.16 1.69 128.17

Table 1. Setup of the parameters for the tested real-world datasets, and the running time of the tested algorithms. For each dataset, we list the number of nodes n, the number of clusters k, the σ-value used to construct the similarity graph based on the Gaussian kernel, and the spectral gap λk+1/λk of the constructed normalised Laplacian. The running time is reported in seconds, and the lowest one is highlighted.

find that our algorithm outperforms MS and Linkage++, and has similar performance as Average Linkage.

6.2. Result on Real-World Data

To evaluate the performance of our algorithm on real-world datasets, we first follow the sequence of recent work on hierarchical clustering (Abboud et al., 2019; Cohen-Addad et al., 2017; Manghiuc & Sun, 2021; Menon et al., 2019; Roy & Pokutta, 2017), all of which are based on the following 4 datasets from the Scikit-learn library (Pedregosa et al., 2011) and the UCI ML repository (Dua & Graff, 2017): Iris (Ir), Wine (Wi), Cancer (Ca), and Newsgroup (Ng)3. For each dataset, we construct the similarity graph based on the Gaussian kernel, in which the σ-value is chosen according to the standard heuristic (Ng et al., 2001). The setup of parameters for each dataset is listed in Table 1.

The cost of the HC trees returned by each algorithm is reported in Figure 3(b), and the figure shows that our algorithm performs better than MS and Linkage++ and matches the performance of Average Linkage. We further report the running time of the tested algorithms in Table 1, and this shows that Spec WRSC has the lowest running time among the four tested algorithms. Moreover, we observe that Spec WRSC performs better when the spectral gap λk+1/λk is large, which is in line with our theoretical analysis.

3Due to the large size of News Group, we consider only a subset consisting of comp.graphics , comp.os.ms-windows.misc , comp.sys.ibm.pc.hardware , comp.sys.mac.hardware , rec.sport.baseball , and rec.sport.hockey .

0.06 0.01 0.14 0.18 0.2 p

Relative Dasgupta Cost

Ir Wi Ca Ng .

Relative Dasgupta Cost

Spec WRSC MS Linkage + + Avg Linkage

Figure 3. (a) Results for HSBMs, and (b) results on the real-world datasets. In Figure (a) the x-axis represents different values of p, while the y-axis represents the cost of the algorithms returned HC trees normalised by the cost of Spec WRSC. In Figure (b) the x-axis represents the different real-world datasets we evaluate on.

To further demonstrate the effectiveness of our algorithm on larger real-world datasets, we evaluate our algorithm on the GEMSEC Facebook dataset (Rozemberczki et al., 2019). This dataset represents a page-page graph of verified Facebook pages, in which every vertex represents an official Facebook page, and every edge represents the mutual likes between pages. We focus on the largest subset of this dataset, i.e., the artist category. This graph contains 50,515 nodes and 819,306 edges. Out of all the tested algorithms, Spec WRSC is the only one that terminates within 12 hours of compute time. Setting k = 20, Spec WRSC returns a tree in 1,223 seconds, with a Dasgupta cost of 1.62 1010.

These experimental results together demonstrate that our designed algorithm not only has excellent running time but also constructs hierarchical clustering with a cost lower than or similar to the ones constructed by the previous algorithms.

Acknowledgements

This work is supported by an EPSRC Early Career Fellowship (EP/T00729X/1).

Abboud, A., Cohen-Addad, V., and Houdrouge, H. Subquadratic high-dimensional hierarchical clustering. In Advances in Neural Information Processing Systems 33 (Neur IPS 19), pp. 11576 11586, 2019.

Alon, N. Eigenvalues and expanders. Combinatorica, 6(2): 83 96, 1986.

Alon, N., Azar, Y., and Vainstein, D. Hierarchical clustering: a 0.585 revenue approximation. In 33rd Annual Conference on Learning Theory (COLT 20), pp. 153 162, 2020.

Arora, S., Rao, S., and Vazirani, U. Expander flows, geo-

Nearly-Optimal Hierarchical Clustering for Well-Clustered Graphs

metric embeddings and graph partitioning. Journal of the ACM, 56(2):1 37, 2009.

Blum, M., Karp, R. M., Vornberger, O., Papadimitriu, C. H., and Yannakakis, M. The complexity of testing whether a graph is a superconcentrator. Information Processing Letters, 13(4-5):164 167, 1981.

Charikar, M. and Chatziafratis, V. Approximate hierarchical clustering via sparsest cut and spreading metrics. In 28th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA 17), pp. 841 854, 2017.

Charikar, M., Chatziafratis, V., and Niazadeh, R. Hierarchical clustering better than average-linkage. In 30th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA 19), pp. 2291 2304, 2019.

Chatziafratis, V., Yaroslavtsev, G., Lee, E., Makarychev, K., Ahmadian, S., Epasto, A., and Mahdian, M. Bisect and conquer: Hierarchical clustering via max-uncut bisection. In 23rd International Conference on Artificial Intelligence and Statistics (AISTATS 20), pp. 3121 3132, 2020.

Chung, F. R. K. Spectral Graph Theory. American Mathematical Society, 1997.

Cohen-Addad, V., Kanade, V., and Mallmann-Trenn, F. Hierarchical clustering beyond the worst-case. In Advances in Neural Information Processing Systems 31 (Neur IPS 17), pp. 6201 6209, 2017.

Cohen-Addad, V., Kanade, V., Mallmann-Trenn, F., and Mathieu, C. Hierarchical clustering: Objective functions and algorithms. Journal of the ACM, 66(4):1 42, 2019.

Dasgupta, S. A cost function for similarity-based hierarchical clustering. In 48th Annual ACM Symposium on Theory of Computing (STOC 16), pp. 118 127, 2016.

Dua, D. and Graff, C. UCI machine learning repository, 2017. URL http://archive.ics.uci.edu/ml.

Kapralov, M., Kumar, A., Lattanzi, S., and Mousavifar, A. Learning hierarchical structure of clusterable graphs. Arxiv, 2207.02581, 2022.

Lee, J. R., Oveis Gharan, S., and Trevisan, L. Multiway spectral partitioning and higher-order Cheeger inequalities. Journal of the ACM, 61(6):1 30, 2014.

Macgregor, P. and Sun, H. A tighter analysis of spectral clustering, and beyond. In 39th International Conference on Machine Learning (ICML 22), pp. 14717 14742, 2022.

Manghiuc, B.-A. and Sun, H. Hierarchical Clustering: O(1)-Approximation for Well-Clustered Graphs. In Advances in Neural Information Processing Systems 35 (Neur IPS 21), pp. 9278 9289, 2021.

Mc Sherry, F. Spectral partitioning of random graphs. In 42nd Annual IEEE Symposium on Foundations of Computer Science (FOCS 01), pp. 529 537, 2001.

Menon, A. K., Rajagopalan, A., Sumengen, B., Citovsky, G., Cao, Q., and Kumar, S. Online hierarchical clustering approximations. Arxiv, :1909.09667, 2019.

Mizutani, T. Improved analysis of spectral algorithm for clustering. Optimization Letters, 15(4):1303 1325, 2021.

Moseley, B. and Wang, J. Approximation bounds for hierarchical clustering: Average linkage, bisecting k-means, and local search. In Advances in Neural Information Processing Systems 31 (Neur IPS 17), pp. 3094 3103, 2017.

Ng, A. Y., Jordan, M. I., and Weiss, Y. On spectral clustering: Analysis and an algorithm. In Advances in Neural Information Processing Systems 15 (Neur IPS 01), pp. 849 856, 2001.

Oveis Gharan, S. and Trevisan, L. Partitioning into expanders. In 25th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA 14), pp. 1256 1266, 2014.

Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., and Duchesnay, E. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825 2830, 2011.

Peng, R., Sun, H., and Zanetti, L. Partitioning Well Clustered Graphs: Spectral Clustering Works! SIAM Journal on Computing, 46(2):710 743, 2017.

Roy, A. and Pokutta, S. Hierarchical clustering via spreading metrics. The Journal of Machine Learning Research, 18(1):3077 3111, 2017.

Rozemberczki, B., Davies, R., Sarkar, R., and Sutton, C. Gemsec: Graph embedding with self clustering. In Proceedings of the 2019 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining 2019, pp. 65 72. ACM, 2019.

Spielman, D. A. and Teng, S.-H. Spectral partitioning works: Planar graphs and finite element meshes. In 37th Annual IEEE Symposium on Foundations of Computer Science (FOCS 96), pp. 96 105, 1996.

Nearly-Optimal Hierarchical Clustering for Well-Clustered Graphs

A. Additional Background

In this section we list additional background notion and facts used in our analysis, and the section is organised as follows: In Section A.1 we introduce additional background on hierarchical clustering. In Section A.2 we provide a more detailed discussion on spectral clustering, and finally in Section A.3 we present and prove a new decomposition lemma of wellclustered graphs.

A.1. Hierarchical Clustering

We first examine the upper and lower bounds of the cost of HC trees. The first fact holds trivially by the fact that every edge s contribution towards the cost function is at most n.

Fact A.1. It holds for any HC tree T of G that

COSTG(T ) n vol(G)

Secondly, it is known that OPTG can be lower bounded with respect to the degree distribution of G.

Lemma A.1 ((Manghiuc & Sun, 2021)). It holds for any graph G that

9 max vol(G)2

G , δG n2 = 2ΦG

9 n vol(G) max d G

Next, we introduce the notion of a dense branch, which is a useful tool to analyse the cost of HC trees.

Definition A.2 (Dense branch, (Manghiuc & Sun, 2021)). Given a graph G with an HC tree T , the dense branch is the path (A0, A1, . . . , Ak) in T for some k Z 0, such that the following hold:

1. A0 is the root of T ;

2. Ak is the node such that vol(Ak) > vol(G)/2 and both of its children have volume at most vol(G)/2.

It is important to note that the dense branch of T is unique, and consists of all the nodes Ai with vol(Ai) > vol(G)/2. Moreover, for every pair of consecutive nodes Ai, Ai+1 on the dense branch, Ai+1 is the child of Ai of the higher volume.

The next lemma presents a lower bound on the cost of a tree T using the dense branch, which will be extensively used in our analysis.

Lemma A.3 (Lower bound of COSTG(T ) based on the dense branch (Manghiuc & Sun, 2021)). Let G be a graph of conductance ΦG, and T an arbitrary HC tree of G. Suppose (A0, . . . , Ak) is the dense branch of T for some k Z 0, and suppose each node Ai has sibling Bi, for all 1 i k. Then, the following lower bounds of COSTG(T ) hold:

1. COSTG(T ) ΦG

2 Pk i=1 |Ai 1| vol(Bi);

2. COSTG(T ) ΦG

2 |Ak| vol(Ak).

One of the two main results from Manghiuc & Sun (2021) is an O(1)-approximation on Dasgupta s cost for expander graphs. Their result is stated in the following theorem.

Theorem A.4 (Manghiuc & Sun (2021)). Given any graph G = (V, E, w) with conductance ΦG as input, there is an algorithm that runs in O(m + n log n) time, and returns an HC tree Tdeg(G) of G that satisfies COSTG(Tdeg) = O 1/Φ4 G OPTG.

Throughout our discussion, we always use Tdeg(G) to represent the HC tree of G that is constructed from Theorem A.4, i.e., the tree constructed based on the degree sequence of G. In particular, for any partition {C1, . . . , Cℓ}, one can apply Theorem A.4 to every induced graph G[Ci] for any 1 i ℓ, and for simplicity we write Ti Tdeg(G[Ci]) in the following discussion. We next define critical nodes with respect to every Ti.

Nearly-Optimal Hierarchical Clustering for Well-Clustered Graphs

Definition A.5 (Critical nodes (Manghiuc & Sun, 2021)). Let Ti = Tdeg(G[Ci]) be defined as above for any non-empty subset Ci V . Suppose (A0, . . . , Ari) is the dense branch of Ti for some ri Z+, Bj is the sibling of Aj, and let Ari+1, Bri+1 be the two children of Ari. We define Si {B1, . . . , Bri+1, Ari+1} to be the set of critical nodes of Ci. Each node N Si is a critical node.

We remark that each critical node N Si (1 i ℓ) is an internal node of maximum size in Ti that is not in the dense branch. Moreover, every Si is a partition of Ci.

The following lemma lists some facts of the critical nodes of the trees constructed from Theorem A.4.

Lemma A.6 ((Manghiuc & Sun, 2021)). Let Si = {B1, . . . , Bri+1, Ari+1} be the set of critical nodes of Ti. Then the following statements hold:

(Q1) |Bj| = 2 |Bj+1| for all j 2,

(Q2) vol G[Ci](Bj) 2 vol G[Ci](Bj+1) for all j 1,

(Q3) |Aimax| = |Bimax|.

For convenience in our notation, we also define a critical sibling node of critical nodes.

Definition A.7 (Critical Sibling Node). Let Si = {B1, . . . , Bri+1, Ari+1} be the set of critical nodes of Ti. We define a sibling node of a critical node Bj as sib Ti(Bj) Bj+1 for 1 j ri, and sib Ti(Bri+1) Ari+1 otherwise.

We remark that the only critical node without a sibling is Ari+1. Furthermore, due to the degree based construction in Theorem A.4, it holds for all the other critical nodes N Si \ Ari+1 that vol(N) 2 vol(sib Ti(N)) and |N| 2 |sib Ti(N)|.

Finally, the following inequality will be extensively in our proof of Theorem 5.4:

Lemma A.8. Let G = (V, E, w) be a graph with a partition S1, . . . , Sk, such that Φout 1/2 and ΦG[Si] Φin for any 1 i k. Then it holds that k X

i=1 |Si| vol(Si) 18 ηS

where ηS is an upper bound of maxi( (Si)/δ(Si)).

Proof. We can trivially lower bound OPTG by only considering the internal edges e Si of the individual clusters G[Si], and then applying the lower bound from Lemma A.1:

e E[G[Si]] cost G[Si](e)

i=1 OPTG[Si]

9 |Si| vol(G[Si]) d G[Si]

i=1 |Si| vol(Si)

Here, the second inequality follows by Lemma A.1 and the last inequality holds because of the assumptions in the Lemma and the fact that vol(Si) = vol(G[Si]) + w(Si, V \ Si) 2 vol(G[Si])

when Φout 1/2, and because d G[Si] G[Si] d G[Si]

(Si) δ(Si) (Si) 2 1 2 ηS .

Rearranging the terms above proves the statement.

Nearly-Optimal Hierarchical Clustering for Well-Clustered Graphs

A.2. Spectral Clustering

Another component used in our analysis is spectral clustering. To analyse the theoretical performance of spectral clustering, one can examine the scenario in which there is a large gap between λk+1 and ρ(k). By the higher-order Cheeger inequality (1), we know that a low value of ρ(k) ensures that the vertex set V of G can be partitioned into k subsets (clusters), each of which has conductance upper bounded by ρ(k); on the other hand, a large value of λk+1 implies that any (k + 1)-th partition of V would introduce some A V with conductance ΦG(A) ρ(k + 1) λk+1/2. Based on this, Peng et al. (2017) introduced the parameter

and showed that a large value of Υ(k) is sufficient to guarantee a good performance of spectral clustering. The result of Peng et al. (2017) has been improved by a sequence of works, and the following result, which can be shown easily by combining the proof technique of Peng et al. (2017) and Macgregor & Sun (2022), will be used in our analysis.

Lemma A.9. There is an absolute constant CA.9 R>0, such that the following holds: Let G be a graph with k optimal clusters {Si}k i=1, and Υ(k) CA.9 k. Let {Pi}k i=1 be the output of spectral clustering and, without loss of generality, the optimal correspondence of Pi is Si for any 1 i k. Then, it holds for any 1 i k that

vol(Pi Si) k CA.9

3Υ(k) vol(Si),

where A B for any sets A and B is defined by A B (A \ B) (B \ A). Moreover, these P1, . . . , Pk can be computed in nearly-linear time.

A.3. Strong Decomposition Lemma

The objective of this subsection is to prove the following decomposition lemma. The lemma shows that, under a certain eigen-gap condition, an input graph G can be partitioned into clusters with bounded inner and outer conductance, and certain constraints on cut values. We remark that, while obtaining the partition promised by the lemma below requires high time complexity, we only need the existence of such partition in our analysis.

Lemma A.10 (Improved Strong Decomposition Lemma). Let G = (V, E, w) be a graph such that λk+1 > 0 and

λk < 1 270 c0 (k+1)6 2 , where c0 is the constant in Lemma A.12. Then, there is a polynomial-time algorithm that finds an

ℓ-partition of V into sets {Ci}ℓ i=1, for some ℓ k, such that for every 1 i ℓand every vertex u Ci the following properties hold:

(A1) Φ(Ci) = O(k6 λk);

(A2) ΦG[Ci] = Ω(λ2 k+1/k4);

(A3) w(u, V \ Ci) 6(k + 1) vol G[Ci](u).

We remark that the first two properties of the partitioning promised by Lemma A.10 are the same as the ones from Manghiuc & Sun (2021). However, the third property of our lemma is stronger than theirs, as Property (A3) now holds for all vertices in u Ci, instead of only the critical nodes N Si. We emphasise that this improved decomposition is crucial for our final analysis. In particular, we only use Lemma A.10 to show the existence of a strong decomposition with properties (A1), (A2), and (A3), and we use the strengthened property (A3) in the analysis of the approximation factor of our algorithm.

Before starting the analysis, by convention we set ΦG(V ) 0, and ΦG( ) 1. We also set w( , S) 0 for any non-empty subset S V . The following two results will be used in our analysis.

Lemma A.11 (Cheeger Inequality, (Alon, 1986)). It holds for any graph G that λ2

2 ΦG 2λ2. Furthermore, there is a nearly-linear time algorithm (i.e., the Spectral Partitioning algorithm) that finds a set S such that vol(S) vol(V )/2, and ΦG(S) 2λ2.

Lemma A.12 (Lemma 1.13, (Oveis Gharan & Trevisan, 2014)). There is an absolute constant c0 > 1 such that for any k 2 and any r-way partition C1, . . . , Cr of V , where r k 1, we have that

min 1 i r λ2 LG[Ci] 2c0 k6 λk.

Nearly-Optimal Hierarchical Clustering for Well-Clustered Graphs

Now we describe the underlying algorithm and show a sequence of claims, which are used to prove Lemma A.10. At the very high level, our algorithm for computing a stronger decomposition of a well-clustered graph can be viewed as an adjustment to Algorithm 3 in (Manghiuc & Sun, 2021), which itself is based on Algorithm 3 in (Oveis Gharan & Trevisan, 2014) The main idea can be summarised as follows: the algorithm starts with the trivial 1-partition of G, i.e., C1 = V ; in every iteration, the algorithm applies the Spectral Partitioning algorithm for every graph in {G[Ci]}r i=1, and tries to find a sparse cut (S, Ci \ S) for some S Ci.4

If such a cut is found, the algorithm uses this cut to either introduce a new partition set Cr+1 of low conductance, or refine the current partition {Ci}r i=1;

If no such cut is found, the algorithm checks if it is possible to perform a local refinement of the partition sets {Ci}r i=1 in order to reduce the overall weight of the crossing edges, i.e. P

i =j w(Ci, Cj). If such a refinement is not possible, the algorithm terminates and outputs the current partition; otherwise, the partition sets are locally refined and the process is repeated.

The output of the algorithm is guaranteed to satisfy Properties (A1) and (A2) of Lemma A.10.

Our improved analysis will show that Property (A3) holds as well, and this will be proven with the two additional Properties (A4) and (A5) stated later. We begin our analysis by setting the notation, most of which follows from (Oveis Gharan & Trevisan, 2014). We write {Ci}r i=1 as a partition of V for some integer r 1, and every partition set Ci contains some core set denoted by core(Ci) Ci. For an arbitrary subset S Ci, we define S+ S core(Ci), and S S \ S+. We further define S+ core(Ci) \ S, and S Ci \ (S core(Ci)). Note that {S+, S+} forms a partition of core(Ci), and {S , S } forms a partition of Ci \ core(Ci). For the ease of presentation, we always write u+ and u if the set S consists of a single vertex u. For any sets S, T V which are not necessarily disjoint, we write

w(S T) w(S, T \ S),

For any subsets S C V , we follow (Oveis Gharan & Trevisan, 2014) and define the relative conductance as

φ(S, C) w(S C)

vol(C) w(S V \ C) ,

whenever the right-hand side is defined and otherwise we set φ(S, C) = 1. To explain the meaning of φ(S, C), suppose that C V is the vertex set such that ΦG(C) is low and ΦG[C] is high, i.e., C is a cluster. Then, we know that most of the subsets S C with vol(S) vol(C)/2 satisfy the following properties:

Since ΦG[C](S) is high, a large fraction of the edges adjacent to vertices in S would leave S;

Since ΦG(C) is low, a small fraction of edges adjacent to S would leave C.

Combining the above observations, one could conclude that w(S C) w(S V \ C) if C is a good cluster, which means that φ(S, C) is lower bounded by a constant. Moreover, Oveis Gharan and Trevisan (Oveis Gharan & Trevisan, 2014) showed a converse of this fact: if φ(S, C) is large for all S C, then C has high inner conductance. These facts suggest that the relative conductance provides a good quantitative measure of the quality of a cluster.

Now we explain the high-level idea of the proposed algorithm, and refer the reader to Algorithm 3 for the formal description. Our algorithm starts with the partitioning algorithm (Algorithm 3 in (Oveis Gharan & Trevisan, 2014)), and obtains an intermediate partition {Ci}r i=1 (Lines 6 22). For every Ci (1 i r), the algorithm further checks if the following conditions are satisfied:

(A4) For every vertex u Ci with vol(u+) vol(CORE(Ci))/2, it holds that φ(u+, CORE(Ci)) 1 3(k+1);

(A5) For every vertex u Ci with vol(u ) vol(Ci)/2, it holds that w(u Ci) w(u V \ Ci) 1 k+1.

4We denote by r the number of clusters in the current run of the algorithm, and denote by ℓthe final number of clusters output by the algorithm.

Nearly-Optimal Hierarchical Clustering for Well-Clustered Graphs

If (A4) is violated by some vertex u Ci for some i, then the algorithm uses u to refine the core set core(Ci) (Line 27). If (A5) is not satisfied, then the algorithm further refines the partition (Line 30). The algorithm repeats this local refinement process until no such update is found anymore. In the following analysis, we set

10 , 30c0 (k + 1)5 p

where c0 is the constant specified in Lemma A.12, and

ϕin λk+1 140(k + 1)2 , ϕout 90c0 (k + 1)6p

Notice that, by assuming λk < 1 270 c0 (k+1)6 2 in Lemma A.10, it holds that ϕout < 1/3. This fact will be used several times in our analysis.

Following the proof structure in (Oveis Gharan & Trevisan, 2014), we will prove Lemma A.10 via a sequence of claims. Notice that, during the entire execution of the algorithm, the sets {Ci}r i=1 always form a partition of V , and each CORE(Ci) is a subset of Ci. Firstly, we show that, at any point during the execution of the algorithm, the core sets CORE(Ci) (1 i r) always have low conductance.

Claim A.1. Throughout the algorithm, we always have that

max 1 i r ΦG(CORE(Ci)) ρ 1 + 1 k + 1

The following result will be used in our proof:

Lemma A.13 (Lemma 2.2, (Oveis Gharan & Trevisan, 2014)). Let G = (V, E, w) be a graph, and let S, W be two subsets such that S W V . Suppose that the following two conditions are satisfied for some ε > 0:

1. φ(S, W) ε/3 and

2. max {ΦG(S), ΦG(W \ S)} (1 + ε) ΦG(W).

Then it holds that min {ΦG(S), ΦG(W \ S)} ΦG(W).

Proof of Claim A.1. Let r be the current number of clusters generated by the algorithm, and we prove by induction that the claim holds during the entire execution of the algorithm. First of all, for the base case of r = 1, we have that CORE(C1) = C1 = V , which gives us that ΦG(CORE(C1)) = 0; hence, the statement holds trivially.

Secondly, for the inductive step, we assume that the statement holds for some fixed configuration of the core sets {core(Ci)}r i=1 and we prove that the statement holds after the algorithm updates the current configuration. Notice that {CORE(Ci)}r i=1 are updated through Lines 10, 13, 16, and 27 of the algorithm, so it suffices to show that the claim holds after executing these lines. We continue the proof with case distinction.

When executing Lines 10, 16, the algorithm introduces some new set CORE(Cr+1) such that

ΦG(CORE(Cr+1)) ρ 1 + 1 k + 1

Combining this with the inductive hypothesis, which assumes the inequality holds for CORE(Ci) (1 i r), we have that

max 1 i r+1 ΦG(CORE(Ci)) ρ 1 + 1 k + 1

5If the set S Ci returned by Spectral Partitioning has vol(S+) > vol(G)/2, swap S with Ci \ S.

Nearly-Optimal Hierarchical Clustering for Well-Clustered Graphs

Algorithm 3 Algorithm for partitioning G into ℓ k clusters

1: Input: G = (V, E, w), k > 1 such that λk+1 > 0; 2: Output: A ϕ2 in/4, ϕout ℓ-partition {Ci}ℓ i=1 of G satisfying (A1) (A3), for some ℓ k; 3: Let r = 1, CORE(C1) = C1 = V ; 4: Let ϕin = λk+1 140(k+1)2 , and ϕout = 90c0 (k + 1)6 λk;

5: Let ρ = min n λk+1

10 , 30c0 (k + 1)5 λk o ;

6: while At least one of the following conditions holds

1. 1 i r such that w(Ci \ CORE(Ci) Ci) < w(Ci \ CORE(Ci) Cj) for some j = i;

2. Spectral Partitioning finds S Ci with 5 vol(S+) vol(core(Ci))/2, such that max ΦG[Ci](S), ΦG[Ci](Ci \ S) < ϕin;

do 7: Order the sets C1, . . . , Cr such that λ2(LG[C1]) . . . λ2(LG[Cr]); 8: Let 1 i r the smallest index for which item 2 of the while-condition is satisfied, and let S Ci be the corresponding set;

9: if max n ΦG(S+), ΦG

S+ o 1 + 1 k+1 r+1 ρ then

10: Let Ci = Ci \ S+, CORE(Ci) = S+, Cr+1 = CORE(Cr+1) = S+, r = r + 1 and go to Line 6; 11: end if 12: if min n φ(S+, CORE(Ci)), φ

S+, CORE(Ci) o 1 3(k+1) then

13: Update CORE(Ci) to either S+ or S+ with the lower conductance, and go to Line 6; 14: end if

15: if ΦG(S ) 1 + 1 k+1 r+1 ρ then

16: Let Ci = Ci \ S , Cr+1 = CORE(Cr+1) = S , set r = r + 1 and go to Line 6; 17: end if 18: if w(Ci \ CORE(Ci) Ci) < w(Ci \ CORE(Ci) Cj) for some j = i then 19: Let Ci = CORE(Ci), merge (Ci \ CORE(Ci)) with argmax Cj{w(Ci \ CORE(Ci) Cj)}, and go to Line 6; 20: end if 21: if w(S Ci) < w(S Cj) for some j = i then 22: Let Ci = Ci \ S , merge S with argmax Cj{w(S Cj)}, and go to Line 6; 23: end if 24: end while 25: For every partition set Ci 26: if u Ci such that vol(u+) vol(CORE(Ci))/2 and φ(u+, CORE(Ci)) 1 3(k+1) then

27: Update CORE(Ci) to either u+ or u+ with the lower conductance, and go to Line 6; 28: end if 29: if u Ci such that vol(u ) vol(Ci)/2 and w(u Ci) < w(u Cj) for some j = i then 30: Let Ci = Ci \ u , merge u with argmax Cj{w(u Cj)}, and go to Line 6; 31: end if 32: Return: the partition {Ci}r i=1.

Nearly-Optimal Hierarchical Clustering for Well-Clustered Graphs

The case for executing Lines 13 and 27 is similar, and we prove this by applying Lemma A.13. We first focus on Line 13 here, dealing with Line 27 next. When executing Line 13, we know that the if-condition in Line 9 does not hold, so we have that

max n ΦG(S+), ΦG

S+ o > 1 + 1 k + 1

r+1 ρ 1 + 1 k + 1

ΦG(CORE(Ci)),

where the last inequality follows by the inductive hypothesis. Moreover, when executing Line 13, we also know that the if-condition in Line 12 holds, i.e.,

min n φ S+, CORE(Ci) , φ

S+, CORE(Ci) o 1 3(k + 1).

Therefore, by applying Lemma A.13 with S+ core(Ci) and ε = 1/(k + 1) and using the inductive hypothesis, we conclude that

min n ΦG(S+), ΦG

S+ o ΦG(CORE(Ci)) ρ 1 + 1 k + 1

Now we look at Line 27. When executing Line 27, we know that u+ = u; since otherwise u+ = and φ( , CORE(Ci)) = 1. Therefore, it holds that

1 = max n ΦG(u+), ΦG

u+ o > 1 + 1 k + 1

r+1 ρ 1 + 1 k + 1

ΦG(CORE(Ci)),

where the first inequality holds because ΦG(u+) = 1, and the last inequality follows by the inductive hypothesis. Moreover, when executing Line 27, we also know that the if-condition in Line 26 holds, i.e.,

φ u+, CORE(Ci) 1 3(k + 1).

Therefore, by applying Lemma A.13 with u+ core(Ci) and ε = 1/(k + 1) and using the inductive hypothesis, we conclude that

min n ΦG(u+), ΦG

u+ ΦG(CORE(Ci)) ρ 1 + 1 k + 1

where the first equality holds because ΦG(u+) = 1 holds for u+ = u.

Combining the two cases above, we know that the claim always holds during the entire execution of the algorithm. This completes the proof.

Next, we show that the number of partition sets cannot exceed k. This proof is identical to Claim 3.2 in (Oveis Gharan & Trevisan, 2014), and we include the proof here for completeness.

Claim A.2. The total number of clusters returned by the algorithm satisfies that ℓ k.

Proof. Suppose for contradiction that the number of clusters becomes r = k + 1 at some point during the execution of the algorithm. Then, since CORE(C1), . . . , CORE(Ck+1) are disjoint, by the definition of ρ(k + 1) and Claim A.1 we have that

ρ(k + 1) max 1 i k+1 ΦG(core(Ci)) 1 + 1 k + 1

k+1 ρ e ρ e λk+1

which contradicts the higher-order Cheeger inequality (1). Therefore, the total number of clusters at any time satisfies r < k + 1, and the total number of returned clusters satisfies ℓ k.

Now we are ready to show that the output {Ci}ℓ i=1 of Algorithm 3 and its core sets {CORE(Ci)}ℓ i=1 satisfy Properties (A4) and (A5), which will be used in proving Lemma A.10.

Nearly-Optimal Hierarchical Clustering for Well-Clustered Graphs

Claim A.3. Let {Ci}ℓ i=1 be the output of Algorithm 3 with the corresponding core sets {CORE(Ci)}ℓ i=1. Then, the following hold for any 1 i ℓ:

1. ΦG(CORE(Ci)) ϕout/(k + 1);

2. ΦG(Ci) ϕout;

3. ΦG[Ci] ϕ2 in/4;

4. For every u Ci with vol(u+) vol(CORE(Ci))/2, we have that

φ(u+, CORE(Ci)) 1 3(k + 1);

5. For every u Ci with vol(u ) vol(Ci)/2, we have that

w(u Ci) w(u V \ Ci) 1 k + 1.

Proof. First of all, by Claim A.1 we have for any 1 i ℓthat

ΦG(CORE(Ci)) ρ 1 + 1 k + 1

ℓ e ρ 30 e c0 (k + 1)5p

where the second inequality holds by the fact that ℓ k + 1, the third one holds by the choice of ρ , and the last one holds by the choice of ϕout. This proves Item (1).

To prove Item (2), we notice that the first condition of the while-loop (Line 6) doesn t hold when the algorithm terminates, hence we have for any 1 i = j ℓthat

w(Ci \ CORE(Ci) Ci) w(Ci \ CORE(Ci) Cj).

By applying the averaging argument, we have that

w(Ci \ CORE(Ci) CORE(Ci)) = w(Ci \ CORE(Ci) Ci)

w(Ci \ CORE(Ci) V )

ℓ w(Ci \ CORE(Ci) V \ Ci)

We apply the same analysis used in (Oveis Gharan & Trevisan, 2014), and have that

ΦG(Ci) = w(Ci V )

w (core(Ci) V ) + w (Ci \ core(Ci) V \ Ci) w(Ci \ core(Ci) core(Ci))

vol(core(Ci))

ΦG(core(Ci)) + (k 1) w(Ci \ core(Ci) core(Ci))

vol(core(Ci)) k ΦG(core(Ci))

where the second inequality uses equation (11). This proves Item (2).

Next, we analyse Item (3). Again, we know that the second condition within the while-loop (Line 6) does not hold when the algorithm terminates. By the performance of the Spectral Partitioning algorithm (i.e., Lemma A.11), it holds for any 1 i ℓthat ΦG[Ci] ϕ2 in/4. With this, we prove that Item (3) holds.

Similarly, when the algorithm terminates, we know that for any node u Ci the if-condition in Line 26 does not hold. Hence, for any 1 i ℓand any u Ci with vol(u+) vol(CORE(Ci))/2, we have that

φ(u+, CORE(Ci)) 1 3(k + 1).

Nearly-Optimal Hierarchical Clustering for Well-Clustered Graphs

This shows that Item (4) holds as well.

Finally, since there is no u Ci satisfying the if-condition in Line 29 of the algorithm, it holds for any 1 i = j ℓand every vertex u Ci that w(u Ci) w(u Cj). Therefore, by the same averaging argument we have that

w(u Ci) w(u V )

ℓ w(u V \ Ci)

which shows that Item (5) holds.

It remains to prove that the algorithm does terminate. To prove this, we first show that, in each iteration of the while-loop (Lines 7 22), at least one of the if-conditions will be satisfied, and some sets are updated accordingly. This fact, stated as Claim A.4, is important, since otherwise the algorithm might end up in an infinite loop. The following result will be used in our proof.

Lemma A.14 (Lemma 2.6, (Oveis Gharan & Trevisan, 2014)). Let core(Ci) Ci V , and S Ci be such that vol(S+) vol(core(Ci))/2. Suppose that the following hold for some parameters ρ and 0 < ε < 1:

1. ρ ΦG(S ) and ρ max{ΦG(S+), ΦG(S+)};

2. If S = , then w(S Ci) w(S V )/k;

3. If S+ = , then φ(S+, core(Ci)) ε/3 and φ

S+, core(Ci) ε/3.

Then, it holds that ΦG[Ci](S) ε ρ 14k .

Claim A.4. If at least one condition of the while-loop is satisfied, then at least one of the if-conditions (Lines 9,12,15,18 or 21) is satisfied.

Proof. First of all, notice that if the first condition of the while-loop is satisfied, then the if-condition in Line 18 will be satisfied and the claim holds. Hence, we assume that only the second condition of the while-loop is satisfied, and we prove the claim by contradiction. That is, we show that, if none of the if-conditions holds, then the set S returned by the Spectral Partitioning algorithm would satisfy that ΦG[Ci](S) ϕin. The proof is structured in the following two steps:

1. We first prove that

ΦG[Ci](S) max{ρ , ρ(r + 1)}

14(k + 1)2 ;

2. Using Item (1) we prove that ΦG[Ci](S) ϕin and reach our desired contradiction.

Step 1: We prove this fact by applying Lemma A.14 with parameters

ρ max{ρ , ρ(r + 1)} and ε 1 k + 1.

Let us show that the conditions of Lemma A.14 are satisfied, beginning with the first one. If S = , then we trivially have that 1 = ΦG(S ) ρ; so we assume that S = . As the if-condition in Line 15 is not satisfied, we have that

ΦG(S ) 1 + 1 k + 1

and combining this with Claim A.1 gives us that

ΦG(S ) = max ΦG(CORE(C1)), . . . , ΦG(CORE(Cr)), ΦG(S ) ρ(r + 1).

Nearly-Optimal Hierarchical Clustering for Well-Clustered Graphs

Therefore, we have that ΦG(S ) max{ρ , ρ(r + 1)} = ρ. (12)

Similarly, if S+ = , then we trivially have that 1 = max{ΦG(S+), ΦG(S+)} ρ; so we assume that S+ = . Moreover, since we have chosen S+ such that vol(S+) vol(core(Ci))/2, we know that S+ = core(Ci) \ S+ = . As the if-condition in Line 9 is not satisfied, we have that

max n ΦG(S+), ΦG

S+ o 1 + 1 k + 1

Combining this with Claim A.1 gives us that

max n ΦG(S+), ΦG

= max n ΦG(CORE(C1)), . . . , ΦG(CORE(Ci 1)), ΦG(S+), ΦG

S+ , ΦG(CORE(Ci+1)), . . . , ΦG(CORE(Cr)) o

Therefore we have that max n ΦG(S+), ΦG

S+ o max{ρ , ρ(r + 1)} = ρ. (13)

Combining (12) and (13), we see that the first condition of Lemma A.14 is satisfied. Since the if-condition in Line 21 is not satisfied, it follows by an averaging argument that

w(S Ci) w(S V )

which shows that the second condition of Lemma A.14 is satisfied. Finally, since the if-condition in Line 12 is not satisfied, we know that min n φ(S+, CORE(Ci)), φ

S+, CORE(Ci) o 1 3(k + 1),

which shows that the third condition of Lemma A.14 is satisfied as well. Hence, by Lemma A.14 we conclude that

ΦG[Ci](S) ε ρ 14(k + 1) = max{ρ , ρ(r + 1)}

14(k + 1)2 , (14)

which completes the proof of the first step.

Step 2: We prove this step with a case distinction as follows:

Case 1: r = k. By (14) and (1), we have that

ΦG[Ci](S) ρ(r + 1) 14(k + 1)2 = ρ(k + 1) 14(k + 1)2 λk+1 28(k + 1)2 ϕin,

which leads to the desired contradiction.

Case 2: r < k. Recall that the partition sets {Ci}r i=1 are labelled such that λ2(LG[C1]) . . . λ2(LG[Cr]), and the algorithm has chosen the lowest index i for which the set S Ci returned by the Spectral Partitioning algorithm satisfies the second condition of the while-loop. Our proof is based on a further case distinction depending on the value of i.

Case 2a: i = 1 (i.e., the algorithm selects S C1). We combine the performance of the Spectral Partitioning algorithm (Lemma A.11) with Lemma A.12, and obtain that

ΦG[C1](S) q

2λ2(LG[C1]) = min 1 j r

2λ2(LG[Cj]) p

4c0 k6 λk. (15)

Combining (14) and (15) we have that ρ 28c0 (k + 1)5p

Nearly-Optimal Hierarchical Clustering for Well-Clustered Graphs

Thus, by the definition of ρ we have that

We combine this with (14), and have that

ΦG[Ci](S) λk+1 140(k + 1)2 = ϕin,

which gives our desired contradiction.

Case 2b: i > 1 (i.e., the algorithm selects S Ci for some i 2). Let S1 C1 be the set obtained by applying the Spectral Partitioning algorithm to the graph G[C1]. Since the algorithm did not select S1 C1, we know that ΦG[C1](S1) ϕin. Combining the performance of the Spectral Partitioning algorithm (Lemma A.11) with Lemma A.12, we have that

ϕin ΦG[C1](S1) min 1 j r

2λ2(LG[Cj]) 2c0 k3 p

This gives us that λk+1

10 = 14(k + 1)2 ϕin < 30c0 (k + 1)5 p

and it holds by the definition of ρ that

Therefore, by (14) we have that

ΦG[Ci](S) λk+1 140(k + 1)2 = ϕin.

Combining the two cases above gives us the desired contradiction. With this, we complete the proof of the claim.

Next, we will show that the total number of iterations that the algorithm runs, i.e., the number of times the instruction go to Line 6 is executed, is finite.

Claim A.5. For any graph G = (V, E, w) with the minimum weight wmin as the input, Algorithm 3 terminates after executing the while-loop O (k n vol(G)/wmin) times.

Proof. Notice that the algorithm goes back to check the loop conditions (Line 6) right after any of Lines 10, 13, 16, 19, 22, 27 and 30 is executed, and each of these commands changes the current structure of our partition {Ci}r i=1 with core sets CORE(Ci) Ci. We classify these updates into the following three types:

1. The updates that introduce a new partition set Cr+1. These correspond to Lines 10, and 16;

2. The updates that contract the core sets CORE(Ci) to a strictly smaller subset T CORE(Ci). These correspond to Lines 13 and 27;

3. The updates that refine the partition sets {Ci}r i=1 by moving a subset T Ci \ CORE(Ci) from the partition set Ci to a different partition set Cj, for some Ci = Cj. These correspond to Lines 19, 22, and 30.

We prove that these updates can occur only a finite number of times. The first type of updates can occur at most k times, since we know by Claim A.2 that the algorithm outputs ℓ k clusters. Secondly, for a fixed value of ℓ, the second type of updates occurs at most n times, since each update decreases the size of some CORE(Ci) by at least one. Finally, for a fixed ℓand a fixed configuration of core sets CORE(Ci) Ci, the third type of updates occurs at most O(vol(G)/wmin) times. This is due to the fact that, whenever every such update is executed, the total weight between different partition sets, i.e., P

i =j w(Ci, Cj), decreases by at least wmin. Combining everything together proves the lemma.

Finally, we combine everything together and prove Lemma A.10.

Nearly-Optimal Hierarchical Clustering for Well-Clustered Graphs

Proof of Lemma A.10. We first show that Properties (A1), (A2) and (A3) hold, and in the end we analyse the runtime of the algorithm. Combining Items (2) and (3) of Claim A.3 with the choices of ϕin, ϕout in (10), we obtain for all 1 i ℓ that ΦG(Ci) ϕout = O k6 λk and ΦG[Ci] ϕ2 in/4 = Ω λ2 k+1/k4 . Hence, Properties (A1) and (A2) hold for every Ci.

To analyse Property (A3), we fix an arbitrary node u Ci that belongs to the partition set Ci with core set CORE(Ci). By definition, we have that w(u, V \ Ci) = w(u+, V \ Ci) + w(u , V \ Ci).

We study w(u+, V \ Ci) and w(u , V \ Ci) separately.

Bounding the value of w(u+, V \ Ci): We analyse w(u+, V \ Ci) by the following case distinction.

Case 1: vol(u+) vol(CORE(Ci))/2. By Item (4) of Claim A.3 we know that

φ(u+, CORE(Ci)) 1 3(k + 1),

which is equivalent to

3(k + 1) vol(CORE(Ci))

vol(u+) w(u+ CORE(Ci)) w(u+ V \ CORE(Ci)).

This implies that 6(k + 1) w(u+ CORE(Ci)) w(u+ V \ CORE(Ci)),

and we have that

w(u+, V \ Ci) w(u+ V \ CORE(Ci)) 6(k + 1) w(u+ CORE(Ci)) 6(k + 1) vol G[Ci](u+).

Case 2: vol(u+) > vol(CORE(Ci))/2. We have that

w(u+, V \ Ci) w(CORE(Ci), V \ Ci) w(CORE(Ci), V \ CORE(Ci))

= vol(CORE(Ci)) ΦG(CORE(Ci)) vol(CORE(Ci)) ϕout

k + 1 vol(u+), (16)

where the third inequality follows by Item (1) of Claim A.3. Therefore, we have that

vol G[Ci](u+) = vol(u+) w(u+, V \ Ci) > vol(u+) 1 2ϕout

where the last inequality follows by (16). We further combine (16) with (17), and obtain that

w(u+, V \ Ci) 2ϕout

k + 1 1 1 2ϕout

k+1 vol G[Ci](u+) 2ϕout

k vol G[Ci](u+),

where the last inequality holds by our assumption that ϕout < 1/3.

Therefore, combining the two cases above gives us that

w(u+, V \ Ci) 6(k + 1) vol G[Ci] u+ . (18)

Bounding the value of w(u , V \ Ci): We analyse w(u , V \ Ci) based on the following two cases.

Case 1: vol(u ) vol(Ci)/2. By Item (5) of Claim A.3, we know that

w(u Ci) w(u V \ Ci) 1 (k + 1),

Nearly-Optimal Hierarchical Clustering for Well-Clustered Graphs

which gives us that w(u , V \ Ci) (k + 1) w(u Ci) (k + 1) vol G[Ci](u ).

Case 2: vol(u ) > vol(Ci)/2. In this case, we have that

w(u , V \ Ci) w(Ci, V \ Ci) = ΦG(Ci) vol(Ci) ϕout vol(Ci) 2ϕout vol(u ), (19)

where the second inequality follows by Item (2) of Claim A.3. This implies that

vol G[Ci](u ) = vol(u ) w(u , V \ Ci) (1 2ϕout) vol(u ), (20)

where the last inequality follows by (19). Finally, combining (19) and (20) gives us that

w(u , V \ Ci) 2ϕout 1 2ϕout vol G[Ci](u ) 2 vol G[Ci](u ),

where the last inequality follows by our assumption that ϕout < 1/3. Therefore, combining the two cases together gives us that w(u , V \ Ci) (k + 1) vol G[Ci](u ). (21)

Our claimed property (A3) follows by summing the inequalities in (18) and (21) and the fact that

vol G[Ci](u) = vol G[Ci](u+) + vol G[Ci](u ).

Finally, we analyse the runtime of the algorithm. By Claims A.4 and A.5, we know that the algorithm does terminate, and the total number of iterations of the main while-loop executed by the algorithm is upper bounded by O(k n vol(G)/wmin). Notice that this quantity is upper bounded by O(poly(n)) given our assumption that wmax/wmin = O(poly(n)). This completes the proof.

B. Omitted Details of Proof of Theorem 5.1

This section presents omitted details of the proof of Theorem 5.1. Our key approach of breaking the running time barrier of Manghiuc & Sun (2021) is that, instead of approximating the optimal tree T , we approximate the tree constructed by the algorithm of Manghiuc & Sun (2021).

At a very high level, we first slightly adjust the tree construction by Manghiuc & Sun (2021), and show that the cost of this adjusted variant is the same as their original tree construction. We call the adjusted tree TMS. Then, we perform a sequence of changes to TMS, such that the total induced cost of these changes is not too high. After this sequence of changes, we end up with a tree T MS, whose cost can be approximated in nearly-linear time with the output of Algorithm 1. In particular, we only require the existence of the tree TMS, and not its explicit construction.

The section is organised as follows: In Section B.1 we introduce a variant of the algorithm and tree construction from Manghiuc & Sun (2021). In Section B.2 we prove Lemma 5.2 and Lemma 5.3. Finally, we prove Theorem 5.1 in Section B.3.

B.1. Caterpillar Tree on Extended Critical Nodes

Now we present an adjusted version of the result on well-clustered graphs from Manghiuc & Sun (2021). Their original algorithm first uses their variant of Lemma A.10 to explicitly construct a partition {C1, . . . , Cℓ}, and then decomposes every Ci into critical nodes Si. These critical nodes are used to construct their final tree.

In our adjusted form, instead of using the critical nodes Si as the building blocks, we use the extended critical nodes as building blocks, which are described below.

Let C1, . . . Cℓbe the partition returned by Lemma A.10. We assume for every Ci that (A0, . . . , Ar) is the dense branch of Ti = Tdeg(G[Ci]) for some r Z+, and we extend the dense branch to (A0, . . . , Aimax 1) with the property that, for every i [r, imax 1], Ai is the child of Ai 1 with the higher volume, and Aimax 1 having children Aimax and Bimax such that |Aimax| 2 and let Si {B1, . . . , Bimax, Aimax} be the set of all extended critical nodes.

Based on this new notion of extended critical nodes, Algorithm 4 gives a formal description of our adjusted tree called TMS, which consists of three phases: Partition, Prune and Merge. In the Partition phase (Lines 3 4), the algorithm

Nearly-Optimal Hierarchical Clustering for Well-Clustered Graphs

employs Lemma A.10 to partition V (G) into sets {Ci}ℓ i=1, and applies Theorem A.4 to obtain the corresponding trees {Ti}ℓ i=1. The Prune phase (Lines 7 9) consists of a repeated pruning process: for every such tree Ti, the algorithm prunes the subtree Ti[N], where N Si is the extended critical node closest to the root in Ti, and adds Ti[N] to T, which is the collection of all the pruned subtrees (Line 9). The process is repeated until Ti is completely pruned. Finally, in the Merge phase (Lines 13 16) the algorithm combines the trees in T in a caterpillar style according to an increasing order of their size, where we define the size of a tree T as |T | |leaves(T )|. We call the final constructed tree TMS, and the performance of this algorithm is summarised in Lemma B.1. We emphasise again that in our algorithm we do not explicitly run Algorithm 4 to construct our tree. Instead, we use it to show the existence of tree TMS, and our final analysis is to approximate the cost of TMS.

Algorithm 4 Merge Extended Critical Nodes

1: Input: A graph G = (V, E, w), a parameter k Z+ such that λk+1 > 0 2: Output: An HC tree TMS of G 3: Apply the partitioning algorithm (Lemma A.10) on input (G, k) to obtain {Ci}ℓ i=1 for some ℓ k; 4: Let Ti = HCwith Degrees(G[Ci]); 5: Initialise T = ; 6: for All clusters Ci do 7: Let Si be the set of extended critical nodes of Ti; 8: for N Si do 9: Update T T Ti[N]. 10: end for 11: end for 12: Let t = |T|, and T = {f T1, . . . , f Tt} be such that |e Ti| |e Ti+1| for all 1 i < t; 13: Initialise TMS = e T1; 14: for i = 2, . . . , t do 15: Let TMS be the tree with TMS and e Ti as its two children; 16: end for 17: Return: TMS

Lemma B.1. Let G = (V, E, w) be a graph, and k > 1 such that λk+1 > 0 and λk < 1 270 c0 (k+1)6 2 , where c0 is the

constant in Lemma A.12. Then, the tree TMS returned by Algorithm 4 satisfies COSTG(TMS) = O k22/λ10 k+1 OPTG.

The remaining part of this subsection is to analyse Algorithm 4, and prove Lemma B.1. First of all, notice that, since {Ci}ℓ i=1 is the output of our partitioning algorithm, it holds for every u Ci for any 1 i ℓthat

(A1) Φ(Ci) = O(k6 λk);

(A2) ΦG[Ci] = Ω(λ2 k+1/k4);

(A3) w(u, V \ Ci) 6(k + 1) vol G[Ci](u).

Generalising Property (A3), it is easy to see that it holds for every extended critical node N Si (1 i ℓ) that

(A3 ) w(N, V \ Ci) 6(k + 1) vol G[Ci](N).

Moreover, with the parameter ϕin = Θ λk+1/k2 , we have that ΦG[Ci] = Ω(ϕ2 in) holds for any 1 i ℓ. Let Ti = Tdeg(G[Ci]) with the corresponding set of extended critical nodes Si for all 1 i ℓ, and S = Sℓ i=1 Si be the set of all extended critical nodes. Now we group the edges of G into two categories: let E1 be the set of edges in the induced subgraphs G[Ci] for all 1 i ℓ, i.e.,

i=1 E [G[Ci]] ,

Nearly-Optimal Hierarchical Clustering for Well-Clustered Graphs

and E2 be the remaining crossing edges. Therefore, we write the cost of our tree TMS as

COSTG(TMS) = X

e E1 cost TMS(e) + X

e E2 cost TMS(e), (22)

and analyse each sum individually in Lemmas B.3 and B.4.

Our first result bounds the size of the parent of an extended critical node N Si in TMS, with respect to the size of its parent in the tree Ti. This result will be extensively used when analysing the cost of the edges adjacent to N.

Lemma B.2. It holds for every 1 i ℓand every extended critical node N Si that parent TMS(N) 6k parent Ti(N) .

Proof. Suppose the extended dense branch of Ti is (A0, . . . , Aimax 1) for some imax Z 0, with Bj being the sibling of Aj and Aimax 1 having children Aimax, Bimax. Recall that the set of extended critical nodes is Si = {B1, . . . , Bimax, Aimax}. By construction of the degree-based tree, we know from Lemma A.6 that it holds for all 1 j imax 1 that |Aj| = 2 |Aj+1|, which implies that |Bj+1| = |Aj+1| and |Bj| = 2 |Bj+1| for all j 2. Thus, we conclude that for every interval 2s 1, 2s , for some s Z 0, there are at most 3 critical nodes6 N Si of size |N| 2s 1, 2s . Now let us fix N Si. By construction, we have that parent Ti(N) 2 |N|. (23)

On the other hand, by the construction of TMS we have that

parent TMS(N) =

M Sj |M| |N|

M Sj 2s 1<|M| 2s

j=1 3 2 log |N| +1 12k |N|. (24)

By combining (23) and (24), we have that parent TMS(N) 12k |N| 6k parent Ti(N) ,

which proves the statement.

Now we prove the two main technical lemmas of this subsection.

Lemma B.3. It holds that P

e E1 cost TMS(e) = O k/ϕ8 in OPTG.

Proof. Notice that X

e E1 cost TMS(e) =

e E[G[Ci]] cost TMS(e).

We prove that, for every 1 i ℓand e E[G[Ci]], the cost of e in TMS and the one in Ti differ by at most a factor of O(k). Combining this with Theorem A.4 will prove the lemma.

To prove this O(k)-factor bound, we fix any 1 i ℓand let Si be the set of extended critical nodes of Ci. As the nodes of Si form a partition of the vertices of G[Ci], any edge e E[G[Ci]] satisfies exactly one of the following conditions: (i) e is inside a critical node; (ii) e is adjacent to a critical node. Formally, it holds that X

e E[G[Ci]] cost TMS(e) = X

N Si e E(N,N)

cost TMS(e) + X

N Si M Si\{N} |parent Ti(M)| |parent Ti(N)|

e E(N,M) cost TMS(e)

6We remark that, in the worst case, all three nodes B1, Bimax and Aimax could have size in (2s 1, 2s].

Nearly-Optimal Hierarchical Clustering for Well-Clustered Graphs

We first examine the case in which the cost of e in both trees are the same, i.e., Case (i). Since we do not change the structure of the tree inside any critical node, it holds that X

N Si e E(N,N)

cost TMS(e) = X

N Si e E(N,N)

cost Ti(e).

For Case (ii), the cost of any such edge increases by at most a factor of O(k) due to Lemma B.2 and the construction of TMS. Formally, let N Si be an arbitrary extended critical node, and M Si \ {N} be an extended critical node such that parent Ti(M) parent Ti(N) . Firstly, notice that if parent Ti(N) is the root node of Ti, then for any edge e E(N, M) it holds that cost Ti(e) = we |Ti|. On the other hand, by the construction of TMS, we know that cost TMS(e) 6k we |Ti|, so we conclude that cost TMS(e) 6k cost Ti(e). Secondly, if parent Ti(N) is not the root node of Ti and since parent Ti(M) parent Ti(N) , we know that |M| |N|. Therefore it holds for any edge e E(N, M) that cost TMS(e) = we parent TMS(N) 6k we parent Ti(N) = 6k cost Ti(e),

where the inequality follows by Lemma B.2. Combining the above facts, we have that

e E1 cost TMS(e)

e E[G[Ci]] cost Ti(e)

i=1 6k COSTG[Ci](Ti). (25)

On the other side, let T be any optimal HC tree of G with cost OPTG, and it holds that

i=1 OPTG[Ci] =

i=1 Ω Φ4 G[Ci] COST(Ti) =

i=1 Ω ϕ8 in COST(Ti), (26)

where the last equality follows by Property (A2) of Lemma A.10 and Theorem A.4 applied to every G[Ci]. Finally, by combining (25) and (26) we have that X

e E1 cost TMS(e) = O(k/ϕ8 in) OPTG,

which proves the lemma.

Lemma B.4. It holds that X

e E2 cost TMS(e) = O k2/ϕ10 in OPTG.

Proof. For the edges e E2, we bound the cost with the help of Lemma B.2 similar as before. Specifically, we have that

e E2 cost TMS(e)

N Si M S\Si |M| |N|

e E(N,M) cost TMS(e)

parent TMS(N) w(N, V \ Ci)

N Si 36k(k + 1) parent Ti(N) vol G[Ci](N) (27)

4 ΦG[Ci] COSTG[Ci](Ti) (28)

1 Φ5 G[Ci] OPTG[Ci] (29)

= O(k2/ϕ10 in ) OPTG,

Nearly-Optimal Hierarchical Clustering for Well-Clustered Graphs

where (27) follows from Property (A3 ) of Lemma A.10 and Lemma B.2, (28) follows by Lemma A.3, and (29) follows by Theorem A.4 applied to every induced subgraph G[Ci].

Finally, we are ready to prove Lemma B.1.

Proof of Lemma B.1. Let C1, . . . Cℓbe the partitioned returned by the algorithm from Lemma A.10. Let S Sk i=1 Si be the set of all critical nodes, and TCC be the caterpillar style tree on the critical nodes according to an increasing order of their sizes. We have that

COSTG(TMS) = X

e E1 cost TMS(e) + X

e E2 cost TMS(e)

= O(k/ϕ8 in) OPTG + O(k2/ϕ10 in ) OPTG = O(k2/ϕ10 in ) OPTG = O(k22/λ10 k+1) OPTG,

where the second equality follows by Lemmas B.3 and B.4, and the last equality follows by the definition of ϕin.

B.2. Proof of Lemmas 5.2 and 5.3

Now we prove Lemma 5.2 and Lemma 5.3, which allow us to approximate the cost of TMS. To sketch the main proof idea for Lemma 5.3, we construct a tree T MS and upper bound its weighted Dasgupta cost on the bucketing B, i.e., WCOSTG/B (T MS), with respect to OPTG. To construct T MS, we start with the tree TMS on G, whose upper bound with respect to OPTG is known because of Lemma B.1. We then transform TMS into T MS in two steps, such that there is only a small induced cost for both transformations. This allows us to upper bound COSTG(T MS) with respect to COSTG(TMS). Crucially, we can approximate the tree TMS with Algorithm 1 without explicitly constructing tree TMS. The proof for Lemma 5.2 follows from the proof for Lemma 5.3.

In order to analyse the constructed trees T MS and T MS, we analyse the properties of the buckets in the context of the entire graph G, and not just on every set Pi returned by spectral clustering. We denote by B the set containing all the buckets obtained throughout all sets Pi, i.e.,

notice that B is a partition of V . We use κ |B|

to denote the total number of buckets. For convenience, we label the buckets B = {B1, . . . , Bκ} arbitrarily, and in general we always refer to a bucket Bj B with the subscript j.

Next, we also analyse the properties of the extended critical nodes in the context of the entire graph G, not just from every set Si. Therefore, let

be the set of all extended critical nodes, and we label them as S = {N1, . . . Nk1} in the order they appear in if one would travel upwards starting from the bottom of the caterpillar tree TMS. In general, we always refer to an extended critical node Ni S with the subscript i. We use C(Ni) to denote the original cluster that Ni belongs to, i.e, the cluster C(Ni) is the Cs such that Ni Ss. We also use Tdeg(Ni) to denote the tree Tdeg that Ni originally corresponds to. To avoid confusion, we remind the reader that Ti is the degree-based tree constructed on the induced graph G[Ci], i.e., Ti = Tdeg(G[Ci]).

Next, we define Xj {Ni S|Ni Bj = }

as the set of critical nodes that have non-empty intersection with bucket Bj B. Note that the subscript of Xj matches the one of the corresponding bucket Bj. We then define the anchor (critical) node Ni S of the set Xj as the largest critical node inside Xj. i.e., we have that Ni argmax Ni Xj |Ni |,

where ties are broken by choosing Ni to be the critical node highest up in TMS. These anchor nodes corresponding to the sets Xj and therefore the buckets Bj B are the key in our proof of the approximation factor of our algorithm; they

Nearly-Optimal Hierarchical Clustering for Well-Clustered Graphs

determine the locations in the tree TMS where we group together the buckets Bj B. It is also important to note that for every Xj its anchor node is determined uniquely. On the other hand, any Ni S might be the anchor node for multiple Xj s.

Now we re ready to describe the two transformations.

Step 1: Splitting the Critical Nodes. In this first step, we transform the tree TMS such that all the sets Ni Bj for 1 i k1 and 1 j κ are separated and restructured to form a caterpillar tree. This is done by splitting every Ni into Ni Bj for all 1 j κ and appending them next to each other arbitrarily in the caterpillar tree in the original location of Ni in TMS.

More formally, we consider the partition of every Ni S into {Ni Bj}κ j=1. Then, for every internal node parent(Ni) in the tree TMS, we create new internal nodes parent(Ni Bj ) for every Ni Bj {Ni Bj}κ j=1, and let one of the children of each of these be a new internal node corresponding to Ni Bj , and let the other child be another node parent(Ni Bj ) for j = j . Moreover, we ensure that the ordering (starting from the bottom) of the new nodes parent(Ni Bj) along the back of the caterpillar tree will be such that the first all the ones corresponding to N1 will appear, then N2 etc., up until Nk1.

We call the resulting tree T MS, whose cost is bounded in the lemma below.

Lemma B.5. It holds that COSTG (T MS) COSTG (TMS) + O(k/ϕ10 in ) OPTG.

Proof. First of all, notice that for any edge crossing critical nodes, i.e., e E(Ni, Ni ) for i = i , we have that

cost T MS(e) cost TMS(e).

This is because splitting up the critical nodes cannot increase the size of the lowest common ancestor of any crossing edge between critical nodes, due to the caterpillar tree construction. So we don t need to consider the increase in the cost of crossing edges between critical nodes.

Next, we study the edges inside critical nodes e E(Ni, Ni) for any i. These edges cost could be increased compared with the one in TMS, since they can span across nodes Ni Bj. However, the increase in size of their lowest common ancestor is by construction at most |parent TMS(Ni)|, and therefore the total cost of this transformation can be upper bounded as

e E(Ni,Ni) cost T MS(e)

e E(Ni,Ni) w(e) parent TMS(Ni)

i=1 6k parent Tdeg(Ni)(Ni) vol G[C(Ni)](Ni) (30)

4 ΦG[Cs] COSTG[Cs](Ts) (31)

1 Φ5 G[Cs] OPTG[Cs] (32)

=O(k/ϕ10 in ) OPTG,

where (30) follows from Lemma B.2, (31) follows from Lemma A.3, and (32) follows from Theorem A.4 applied to every induced subgraph G[Ci]. This proves the statement.

Step 2: Grouping Degree Buckets. In this second step, we move around the sets Ni Bj, such that the buckets B1, . . . , Bκ are separated after this step.

We do this by traversing the tree T MS from the bottom to the top, and visiting every node parent(Ni Bj) up along the back of the caterpillar construction. This means we first visit all the nodes parent(Ni Bj) corresponding to N1, then N2 etc. up to Nk1. When visiting parent(Ni Bj) we check whether Ni is an anchor node of Xj. If not, we continue travelling up. If it is, we transform the tree T MS by moving up all the internal nodes Ni Bj for Ni Xj \ Ni and placing them in the same subtree as the internal node corresponding to Ni Bj. We do so by creating a new root node RBj, and we loop through each Ni Bj for Ni Xj in an arbitrary order. In the first iteration, we set the children of RBj to be

Nearly-Optimal Hierarchical Clustering for Well-Clustered Graphs

Figure 4. Visualisation of a part of the transformation described in step 2, where the buckets are grouped. On the left we visualise a section of the tree T MS, corresponding to a single critical node Ni. For illustration, we assume here that Ni is the anchor node for X2. This means that all Ni B2 for Ni X2 \ Nj are moved up into the same subtree, as illustrated on the right in the new tree T MS.

Ni Bj, and a newly created node denoted by z. We iteratively do this, setting the two children of z in the same way as for the root node, such that we construct a caterpillar tree on the nodes Ni Bj for Ni Xj. Finally, after this construction, we replace the child node Ni Bj of parent(Ni Bj) with the root node RBj of our newly constructed caterpillar tree.

Note that every parent(Ni Bj) for Ni Xj \ Ni no longer has a corresponding Ni Bj as a child node, since it has been placed in the new caterpillar tree. Therefore, as a final step, we remove parent(Ni Bj) from the tree T MS, and appropriately update the child and parent node directly below and above parent(Ni Bj) to ensure that the tree stays connected.

After travelling up the entire tree, this transformation ensures that all the vertices in Bj are placed in the same subtree, since S Ni Xj Ni Bj = Bj. Crucially, because Ni is an anchor node, we know by the construction of the tree TMS that every Ni Xj \ Ni is placed lower in the tree T MS, and hence we do not have to consider the induced cost of moving any critical node higher up in the tree down to Ni. To illustrate, We visualise (part of) the transformation in Figure 4. We call the resulting tree T MS, and we bound its cost with the following lemma.

Lemma B.6. It holds that

COSTG (T MS) COSTG (T MS) + O(β k3/ϕ10 in ) OPTG.

Proof. We bound the cost of the transformation by looking at the total induced cost incurred at every Ni S that is an anchor node. First of all, notice that for all nodes Ni Bj in T MS we have that

|parent T MS(Ni Bj)| |parent TMS(Ni)| 6k |parent Tdeg(Ni)(Ni)|, (33)

where the first inequality holds because of the construction of T MS and the second inequality follows by Lemma B.2. Recall that Tdeg(Ni) is the tree Tdeg that Ni originally corresponds to.

Now let us fix some Ni that is also an anchor node, and let Xi be the set such that Xj Xi shares Ni as their anchor node, i.e., Ni = argmax Ni Xj |Ni | for Xj Xi. The only edges whose cost can increase when moving up Ni Bj for some Ni Xj will be the edges with one endpoint in Ni Bj, since moving them up potentially increases their lowest common ancestor to be |parent T MS(Ni Bj)|. Note that because we carefully choose to perform the transformations at the anchor node, we don t introduce any new vertices from higher up in the tree which could potentially change the size of the lowest common ancestor.

Nearly-Optimal Hierarchical Clustering for Well-Clustered Graphs

Therefore, we can bound the cost of moving up all Ni Bj for Ni Xj and Xj Xi as

|parent T MS(Ni Bj)| X

Ni Xj vol(Ni Bj)

6k |parent Tdeg(Ni)(Ni)| X

Ni Xj vol(Ni Bj) (34)

6k |parent Tdeg(Ni)(Ni)| X

Ni Xj |Ni Bj| G(Ni Bj)

6k |parent Tdeg(Ni)(Ni)| X

Ni Xj |Ni Bj| β G(Ni Bj) (35)

6k |parent Tdeg(Ni)(Ni)| X

Ni Xj |Ni Bj| β G(Ni)

6k β |parent Tdeg(Ni)(Ni)| 14k G[C(Ni)](Ni) X

Ni Xj |Ni | (36)

O(k3 β) |parent Tdeg(Ni)(Ni)| G[C(Ni)](Ni) |parent Tdeg(Ni)(Ni)| (37)

O(k3 β) |parent Tdeg(Ni)(Ni)| δG[C(Ni)](sib Tdeg(Ni)(Ni)) |parent Tdeg(Ni)(Ni)| (38)

O(k3 β) |parent Tdeg(Ni)(sib Tdeg(Ni)(Ni))| vol G[C(Ni)] sib Tdeg(Ni)(Ni) . (39)

Here, (34) follows by (33), (35) follows by the fact that all degrees du of u Ni Bj for Ni Xj lie within a factor of β of each other. Since we denote by C(Ni) the cluster to which Ni originally corresponds, (36) holds because of Property (A3) of Lemma A.10, (37) holds because we only move the critical nodes Ni that are below Ni in the caterpillar construction, so the sum of their sizes is at most |parent TMS(Ni)|, which we upper bound with (33). Moreover, (38) follows by the fact that the minimum degree in G[C(Ni)] of a sibling node of Ni (Definition A.7) is at least the maximum degree of Ni in G[C(Ni)], (39) holds because

|parent Tdeg(Ni)(Ni)| 2 |parent Tdeg(Ni)(sib Tdeg(Ni)(Ni))|

and δG[C(Ni)](sib Tdeg(Ni)(Ni)) |parent Tdeg(Ni)(Ni)| 4 vol G[C(Ni)] sib Tdeg(Ni)(Ni) .

The bound above holds as long as Ni has a sibling node, and it remains to examine the case in which Ni doesn t have a sibling node. Recall that the set of extended critical nodes of a cluster Cs returned by Lemma B.1 is Ss {B1, . . . , Bsmax, Asmax}. The only critical node without a sibling node is Asmax, and we know by construction that |Asmax| 2. Because TMS is a caterpillar tree based on the sizes of the nodes, Asmax is placed as one of the critical nodes furthest down the tree. Given that there are at most three critical nodes of size at most two in every Cs, Asmax must be one of the last 3ℓextended critical nodes in the caterpillar tree.

Suppose one of these Asmax from some Ss is an anchor node. The only nodes in the tree T MS that can be moved up to this node are other nodes of size at most two, of which there are at most 3ℓ(three for every partition Cs). Let Nq argmax Ni S |Ni | 2 vol(Ni ) be the critical node of size at most two with the largest volume. Then, all the transformations

(at most 3k) occurring in the lower part of the tree TMS can be upper bounded by

=O(k3) vol G(Nq)

=O(k4) vol G[C(Nq)] (Nq)

=O(β k3) |parent Tdeg(Nq)(Nq)| vol G[C(Nq)] (Nq) , (40)

Nearly-Optimal Hierarchical Clustering for Well-Clustered Graphs

where the first equality holds as Nq is the largest volume extended critical node of size at most two, the second one holds by Property (A3) of Lemma A.10, and the last one holds because |parent Tdeg(Nq)(Nq)| 4 and the fact that k 2k(γ+1) = β for k 1 and γ > 0.

Finally, to upper bound the total cost of the transformation, we assume for the upper bound that every extended critical node is an anchor node, and we split the sum over the induced cost of all the anchor nodes that have critical siblings (39), and the transformation in the bottom of the tree corresponding to critical nodes of size at most 2, which don t have critical siblings (40). Hence, the total cost of the transformation is at most

|parent Tdeg(Ni)(sib Tdeg(Ni)(Ni))| vol G[C(Ni)] sib Tdeg(Ni)(Ni) +

O(β k3) |parent Tdeg(Nq)(Nq)| vol G[C(Nq)] (Nq)

parent Tdeg(Ni)(Ni) vol G[C(Ni)](Ni) (41)

4 ΦG[Cs] COSTG[Cs](Ts) (42)

1 Φ5 G[Cs] OPTG[Cs] (43)

=O(β k3/ϕ10 in ) OPTG.

Here, (41) holds by the fact that the first sum of (41) is over |parent Tdeg(Ni)(Ni)| vol G[C(Ni)](Ni) for all 1 i k1 except for the extended critical nodes that are not the sibling node of any critical node. Hence, we upper bound the sum by including these terms, and relabeling the indices. Moreover, (42) follows by Lemma A.3, and (43) follows by Theorem A.4 applied to every induced subgraph G[Cs]. This proves the statement.

Proof of Lemma 5.3. Since the tree T MS separates all the individual buckets B B by construction, we can upper bound the cost of WOPTG/B with respect to the cost of T MS and have that

WOPTG/B COSTG (TMS) + O(k/ϕ10 in ) OPTG + O(β k3/ϕ10 in ) OPTG = O β k23/λ10 k+1 OPTG,

where the first inequality holds by Lemmas B.5 and B.6, and the second one holds by Lemma B.1 and the fact ϕin = Θ(λk+1/k2).

Proof of Lemma 5.2. We bound the cost of the internal edges in G[Bj] for 1 j κ using the trivial upper bound for every Bj, and have that

B Bi COSTG[B] (TB)

j=1 |Bj| vol(Bj) =

Xj Xi |Bj| X

Ni Xj vol(Ni Bj)

Xj Xi |parent T MS(Ni Bj)| X

Ni Xj vol(Ni Bj)

where the last inequality holds by construction of the tree T MS and the property of anchor nodes. By repeating the same calculation as the one in proving Lemma 5.3, we have that

B Bi COSTG[B] (TB)

Xj Xi |parent T MS(Ni Bj)| X

Ni Xj vol(Ni Bj)

=O β k23/λ10 k+1 OPTG,

which proves the statement.

Nearly-Optimal Hierarchical Clustering for Well-Clustered Graphs

B.3. Proof of Theorem 5.1

Now we prove the approximation and running time guarantee of Algorithm 1. In the following Lemma we analyse the approximation and running time guarantee of WRSC.

Lemma B.7. Let wmax/wmin = O(nγ) for a constant γ > 1, and B the set of all buckets constructed inside every P1, . . . Pk. Then, given a contracted graph H = G/B as input the WRSC algorithm runs in e O(n) time and returns an HC tree T , such that WCOST(T ) = O (1) WOPTG/B.

Proof. We denote by B the set containing all the buckets obtained throughout all sets Pi, i.e.,

and it holds that |B| = κ k logβ ( G/δG). Recall that our choice of γ and β satisfies

wmin = O(nγ)

and β = 2k(γ+1). Hence, it holds that G

δG = O nγ+1 ,

and the total number of buckets in B is upper bounded by

k max{1, logβ nγ+1} k + k (γ + 1)

log β log n = k + log n.

Hence, there are at most k +log n vertices in G/B. To compute the optimal weighted sparsest cut in H = G/B, we compute sparsity H(S) for every subset S V (H), the total number of which is upper bounded by 2k+log n = O 2k n . Since both w(S, V (H) \ S) and vol H(S) can be computed in e O(1) time, the sparsest cut of G/B can be computed in e O(n) time given that k is a constant. Combining this with the master theorem proves the time complexity of the WRSC algorithm.

The approximation guarantee for the algorithm follows from Lemma 3.2 and the fact that we compute the optimal weighted sparsest cut; as such we have α = 1

Next, we study the time complexity of Algorithm 1. We first show that the cost COSTG (T ) can be computed in nearly-linear time, assuming that the internal nodes of our constructed tree T have several useful attributes.

Lemma B.8. Let TG be an HC tree of a graph G = (V, E, w), with depth D. Assume that every internal node N of TG has (i) a pointer parent that points to its parent node, (ii) an attribute size, which specifies the number of leaves in the subtree TG (G[N]), and (iii) an attribute depth that specifies the depth of TG (G[N]). Then, COSTG(TG) can be computed in O(m D) time.

Proof. Since G has m edges, it suffices to show that we can compute the value cost G(e) for every edge e = {u, v} E in O(D) time. Let us fix an arbitrary edge e = {u, v}, and let Nu and Nv be the leaf nodes of TG corresponding to u and v respectively. Without loss of generality, we assume that Nu(depth) Nv(depth). We successively travel from Nu up the tree via the parent pointers until we reach a node N u, such that N u(depth) = Nv(depth). After that, we find the lowest common ancestor of u and v by simultaneously travelling up one node at a time, starting from both N u and Nv, until we reach the same internal node Nr. Once this condition is satisfied, we can readily compute cost G({u, v}) = we Nr(size). The overall running time for computing cost G({u, v}) is O(D), and therefore the statement holds.

By Lemma B.8 we know that the time complexity of constructing T is linear in the depth of T . The following lemma shows that the depth of T is O(log n), and the entire T can be constructed in nearly-linear time.

Lemma B.9. Algorithm 1 runs in e O (m) time, and constructs the tree T of depth O (log n). Moreover, every internal node N T has a pointer parent, and every node T T has attributes size and depth.

Nearly-Optimal Hierarchical Clustering for Well-Clustered Graphs

Proof. We follow the execution of Algorithm 1 to analyse its time complexity, and the depth of its constructed T . Line 2 of Algorithm 1 applies spectral clustering to obtain the clusters {Pi}k i=1, and the running time of this step is e O(m) (Peng et al., 2017). Since ordering the vertices of every Pi with respect to their degrees takes O (|Pi| log |Pi|) time, the total time complexity of sorting the vertices for {Pi}, finding {u(i)}, as well as constructing {Bi} (Lines 6 8) is

O Pk i=1 |Pi| log |Pi| = O(n log n).

Now let Bj B be an arbitrary bucket, and we study the construction of TBj. We assume that Bj has vertices u1, . . . , unj sorted in increasing order of their degrees, where nj = |Bj|. We construct the balanced tree TBj recursively as follows:

1. Initialise a root node N0, such that N0(size) = nj, and N0(parent) = ;

2. Create two nodes N1 and N2 corresponding to the vertices u1, . . . , u nj/2 an u nj/2 +1, . . . unj respectively. Set N1(parent) = N0 and N2(parent) = N0;

3. Recursively construct the subtrees rooted at N1 and N2 on the corresponding sets of vertices, until we reach the leaves.

By construction, TBj is a balanced tree and has O(nj) internal nodes. Hence, the depth of TBj is O(log nj) = O(log n). Moreover, it s straightforward to see that this step (Lines 9 12) takes O(n log n) time for constructing each TBj.

For constructing the contracted graph H from B, notice that every edge of G only contributes to exactly one of the edge weights in H; as such constructing H (Line 15) takes O(m) time. Finally, to complete the construction of T , we run the WRSC algorithm on H (Line 16), and assume that we know how the vertices are split at each iteration of the recursive weighted sparsest cut algorithm. We construct the tree T recursively as follows:

1. Initialise a root node R0, such that R0(size) = n, and R0(parent) = ;

2. Create two nodes R1 and R2 corresponding to the sets A and B, where (A, B) is the first split of the WRSC algorithm. Set R1(parent) = R0 and R2(parent) = R0;

3. Recursively construct the subtrees rooted at R1 and R2 on the corresponding sets of vertices and split (A, B) returned by the WRSC algorithm, until we reach the leaves corresponding to B1, . . . Bκ.

4. For each TBj T, we update the parent of the root node N0(parent) with N0(parent) = RBj(parent), where RBj is the leaf node corresponding to Bj in the tree construction returned above by the WRSC algorithm. We also delete the leaf node RBj, thereby effectively replacing RBj by TBj.

By Lemma B.7 we know that the WRSC algorithm (Line 16) runs in nearly-linear e O(m) time. Taking everything together, this proves the time complexity of Algorithm 1. The O(log n) depth of T follows by the fact that (i) the depth of every tree TBj is O(log n), and (ii) the distance between the root of T and the root of any such tree TBj in T is also O(log n).

Finally, to set the depth attribute of every internal node, we perform a top-down traversal from the root of T to the leaves, and update the depth for every internal node as one more than the depth of their parent.

We are now ready to prove the main result, which we will state here in full.

Theorem B.10 (Formal statement of Theorem 5.1). Let G = (V, E, w) be a graph with wmax/wmin = O(nγ) for a

constant γ > 1, and k > 1 such that λk+1 > 0 and λk < 1 270 c0 (k+1)6 2 , where c0 is the constant in Lemma A.12.

Algorithm 1 runs in time e O(m) and both constructs an HC tree T of G and computes COSTG(T ) satisfying COSTG(T ) = O 2k(γ+1) k23/λ10 k+1 OPTG. In particular, when λk+1 = Ω(1) and k = O(1), the algorithm s constructed tree T satisfies that COSTG(T ) = O(1) OPTG.

Proof of Theorem B.10. Let T be the tree returned by Algorithm 1. By combining (4), (7), Lemma 5.2, and the approximation guarantee in Lemma B.7 we have that

COSTG(T ) = O 2k(γ+1) k23/λ10 k+1 OPTG.

Nearly-Optimal Hierarchical Clustering for Well-Clustered Graphs

The time complexity of Algorithm 1 follows by Lemma B.9. Moreover, as the depth of T is O(log n), and every internal node N T has a pointer parent and attributes size and depth, by Lemma B.8 we can compute COSTG(T ) in nearly-linear time e O(m).

We complete the proof by dealing with our assumption that the algorithm knows the number of clusters k. If the number of clusters k is unknown, we perform a standard technique from the literature (Cohen-Addad et al., 2017; Manghiuc & Sun, 2021) and run independent copies of Algorithm 1 with all possible values k ranging from 1 to k. By adding an extra O(k) = O(1) factor in the overall time complexity, we ensure that one of the runs has the correct number of clusters k = k.

C. Omitted Details of Proof of Theorem 5.4

This section presents omitted details of the proof of Theorem 5.4, and is structured as follows: we present a technical overview of our result in Section C.1; we describe our designed algorithm in Section C.2, and prove Theorem 5.4 in Section C.3. We introduce the parameter ηS such that

ηS max 1 i k (Si)

this will be used in our later discussion.

C.1. Overview of Main Techniques

We give an overview of our main techniques and discuss why spectral clustering suffices to construct HC trees with good approximation guarantees for well-clustered graphs. One general challenge for analysing the approximation ratio of any HC tree T of G is due to the limited understanding of the structure of an optimal HC tree. There are two main approaches to address this: (1) the first approach is to ensure that most cuts induced at the top of the tree T are sparse; as such, a recursive application of the sparsest cut algorithm is employed; (2) the second approach is to reason about the structure of an optimal HC tree assuming that the underlying graphs have a highly symmetric hierarchical structure of clusters (Cohen-Addad et al., 2017), or a clear cluster structure (Manghiuc & Sun, 2021). In particular, the algorithm in (Manghiuc & Sun, 2021) requires that every returned vertex set has high inner conductance. Therefore, taking these into account, the starting point of our approach is to consider the role of high inner-conductance in bounding the cost of an HC tree.

Suppose that the optimal clusters S1, . . . , Sk corresponding to ρ(k) are given, and it holds ΦG[Si] Φin for 1 i k. These k clusters could have different sizes, volumes, and degree distributions. We show that one can construct a tree TS, by first constructing trees TG[Si] on every induced subgraph G[Si], and simply concatenating these TG[Si] in a caterpillar style according to the sizes of |Si| (1 i k). Moreover, the cost of our constructed tree TS can be upper bounded with respect to OPTG. See Figure 5 for an illustration of a caterpillar tree, and Lemma C.1 for a formal statement; the proof of Lemma C.1 can be found at the end of the subsection.

Figure 5. An illustration of our constructed caterpillar tree from optimal clusters S1, . . . , Sk that satisfy |S1| . . . |Sk|.

Lemma C.1. Let G = (V, E, w) be an input graph of n vertices, and G has optimal clusters S1, . . . , Sk corresponding to ρ(k) 1/k and it holds that ΦG[Si] Φin for 1 i k. Given S1, . . . , Sk as part of the input, there is a nearly-linear

Nearly-Optimal Hierarchical Clustering for Well-Clustered Graphs

time algorithm that constructs an HC tree TS such that

COSTG(TS) 36 ηS

Since Φin = Ω(1) for many well-clustered graphs, Lemma C.1 states that a simple caterpillar construction based on the k optimal clusters achieves an approximation of O(ηS), which is O(1) if the degrees of every cluster are almost-balanced. Moreover, thanks to this caterpillar structure over the Si s and an arbitrary tree construction for every G[Si], we can construct TS and compute its cost in nearly-linear time.

However, the above discussion assumes that the optimal clusters are known, which are co-NP hard to find (Blum et al., 1981). As the starting point of our algorithm, we therefore employ the spectral clustering algorithm, expecting that the output clusters P1, . . . , Pk could directly replace S1, . . . , Sk. Unfortunately, it is unclear whether this approach would work for the following two reasons:

1. while the output P1, . . . , Pk of spectral clustering can be applied to approximate the optimal S1, . . . , Sk, it is unknown whether every Pi approximates their corresponding optimal Si with respect to ΦG[Si]; moreover, it remains open whether the inner conductance ΦG[Si] is approximated by ΦG[Pi]. Without the inner conductance guarantee of ΦG[Pi] for all 1 i k, we cannot apply Lemma C.1.

2. while Lemma A.9 shows that the symmetric difference between every Pi and their optimal corresponding Si can be bounded, this approximation guarantee can still increase the ratio between the maximum and minimum degrees in Pi. That is, with spectral clustering alone, we cannot upper bound ηP (which is an upper bound for maxi( (Pi)/δ(Pi))) from ηS. Moreover, Lemma A.9 does not provide any guarantees on the difference in size between Si and Pi, which is crucial if we want to construct arbitrary TPi s.

We overcome these two bottlenecks through the following high-level ideas:

1. we show that bounded inner conductance of every output Pi isn t needed, and that the approximate Pi s returned from spectral clustering suffice to construct an HC tree with guaranteed approximation. This is a fundamental difference compared with (Manghiuc & Sun, 2021), which requires the inner conductance being constant for every partition set.

2. we introduce a novel bucketing procedure that decomposes Pi s into further subsets, by grouping together vertices of similar degree. These subsets will form the basis of our tree construction.

Thanks to the bucketing step, we are able to construct an HC tree in nearly-linear line.

Proof of Lemma C.1. Given an input graph G and optimal clusters S1, . . . , Sk such that |S1| . . . |Sk|, we construct the tree TS through the following two steps: first of all, we construct an arbitrary HC tree TSi for every G[Si], 1 i k; secondly, we construct TS by concatenating the trees TS1, . . . , TSk in a caterpillar style, i.e., TS1 is at the top of the tree, and TS2, . . . , TSk are appended inductively. Since we can construct every TG[Si] in an arbitrary manner and the tree TS can be constructed from {TG[Si]}1 i k through sorting the clusters by their sizes, the overall algorithm runs in nearly-linear time.

Next, we analyse COSTG(TS). By definition, it holds that

COSTG(TS) =

e E[G[Si]] cost G(e) + X

e e E cost G(e), (44)

where e E is the set of edges crossing different clusters. For the first summation of (44), we have that

i=1 COSTG[Si](TSi) 18 ηS

i=1 OPTG[Si] 18 ηS

Φin OPTG. (45)

Here, the first inequality follows the fact that the cost of any HC tree of G = (V, E, w) is at most |V | vol(G) (Fact A.1) and subsequently applying Lemma A.8; the second inequality holds because OPTG[Si] is at most OPTG for all 1 i k.

Nearly-Optimal Hierarchical Clustering for Well-Clustered Graphs

For the second summation of (44), it holds that

e e E cost G(e) =

u Si,v Sj e={u,v}

We look at all the edges adjacent to vertices in S1. By construction, it holds that |S1| n/k, and |leaves (TS[u v])| = n for any edge {u, v} satisfying u S1, v V \ S1. Since w(S1, V \ S1) vol(S1) ρ(k), it holds that X

u S1,v S1 e={u,v}

cost G(e) = n w(S1, V \ S1) n vol(S1) ρ(k) k |S1| ρ(k) vol(S1).

We continue with looking at the edges between S2 and V \ (S1 S2). As S2 is the largest cluster among S2, . . . , Sk, it holds that |S2| n |S1|

k 1 , which implies that n |S1| < k |S2|. Notice that any edge e = {u, v} with u S2 and v V \ (S1 S2) satisfies that |leaves (TS[u v])| = n |S1|, and as such we have that X

u S2,v S1 S2 e={u,v}

cost G(e) (n |S1|) w(S2, V \ (S1 S2)) k |S2| ρ(k) vol(S2).

Generalising the analysis above, we have that

e e E cost G(e) =

u Si,v Sj e={u,v}

i=1 |Si| vol(Si)

i=1 OPTG[Si], (46)

where the last inequality follows from Lemma A.8. Combining (44), (45) with (46), we have that

COSTG(TS) 36 ηS

which proves the statement.

C.2. Algorithm Description

Now informally describe our algorithm and refer the reader to Algorithm 2 for the formal presentation. The algorithm takes as input a graph G = (V, E, w) with k optimal clusters {Si}k i=1. For simplicity, we assume that the algorithm knows the target number of clusters k and an upper bound ηS such that

ηS max 1 i k (Si)

and we carefully deal with this assumption at the end of the section. As the first step, the algorithm runs the Spectral Clustering subroutine and obtains the approximate clusters {Pi}k i=1. Following our previous discussion, it is unclear whether the algorithm can or cannot use the clusters Pi directly in order to construct an O(1)-approximate tree. To overcome this difficulty, we further partition every Pi into degree-based buckets, but rather than setting β = 2k(γ+1) as we did for Algorithm 1, we set β = ηS. Recall then for any set Pi and every vertex u Pi, the bucket of Pi starting at u is defined B (u) = {v Pi : du dv < ηS du} .

Nearly-Optimal Hierarchical Clustering for Well-Clustered Graphs

du duηS duη 1 S

δ(Pi) du (Pi)

du i du i ηs du i η 1 s

δ(Pi) du i (Pi)

Figure 6. This figure illustrates two different bucketings of the vertices inside some Pi. Figure (a) is a bucketing induced by some vertex u. Figure (b) is the bucketing induced by u i with vol(B(u )) maximised.

Again, it is important to note that the bucket B(u) contains every vertex v Pi whose degree satisfies du dv < ηS du. We refer the reader to Figure 6(a) for an illustration of the bucketing procedure.

Similar to before, vertices u, v Pi having different degrees du = dv generally induce different bucketings. In order to determine the appropriate bucketing, Algorithm 2 chooses as representative a vertex u i Pi, whose induced bucket B (u i ) has the highest volume (Line 8 of Algorithm 2). To motivate this choice, notice that every cluster Pi has a large overlap with its corresponding optimal cluster Si and the degrees of the vertices inside Si are within a factor of ηS of each other. If an arbitrary vertex u Pi is chosen as representative, the bucketing B (u) of Pi might equally divide the vertices in Pi Si into two consecutive buckets, which is undesirable. However, we will prove that our specific choice of u i guarantees that the bucketing Bu i contains one bucket Bj Bu i largely overlapping with Si (see Figure 6(b) for an illustration).

Once the bucketing procedure of Pi is complete the algorithm proceeds and constructs an arbitrary balanced binary tree TBj, for every bucket Bj, and adds the tree TBj to a global set of trees T. This process is repeated for every cluster Pi. As a final step, the algorithm merges the trees inside T in a caterpillar fashion in decreasing order of the sizes, placing the larger sized trees closer to the root (Lines 16-18 of Algorithm 2). The resulting tree TSCB is the output of the algorithm.

C.3. Algorithm Analysis

This subsection analyses the performance of Algorithm 2, and proves Theorem 5.4. It is organised as follows: in the first part we prove several properties of our carefully constructed bucketing procedure. In the second part, we prove that the cost of our constructed tree TSCB achieves the claimed approximation guarantee. Thirdly, we give a detailed description of how we can efficiently construct the tree TSCB and compute its cost in nearly-linear time. The proof of Theorem 5.4 follows by combining the three main results.

We first prove that our bucketing procedure induces at most 2k buckets inside every cluster Pi, and the total number of buckets throughout all clusters Pi is at most 2k2.

Lemma C.2. Let Pi be an arbitrary cluster, and u i = argmaxu Pi vol(B (u)) be a vertex inducing the bucketing Bu i . The following statements hold:

1. B (u i ) Bu i ;

2. For every optimal cluster Sj, the vertices in Sj Pi belong to at most two buckets, i.e., Bt Bu i : Sj Bt = 2;

3. The bucketing induced by u i contains at most 2k buckets Bu i 2k.

Proof. The first property follows trivially from the definition. The second property follows from our assumption that (Si) ηS δ(Si) for every optimal cluster Si. Therefore, the vertices inside Si belong to at most 2 buckets. The third property follows from the second one and summing over all clusters.

Nearly-Optimal Hierarchical Clustering for Well-Clustered Graphs

Our second result presents a very important property of our bucketing procedure. We show that, in every cluster Pi, exactly one bucket Bj Bu i has large overlap with the optimal cluster Si, and all the other buckets have low volume. We emphasise that this result is only possible due to our calibration technique of choosing the bucketing induced by a vertex u i Pi, whose bucket B (u i ) has the highest volume.

Lemma C.3. Let Pi be an arbitrary cluster, and u i = argmaxu Pi vol(B (u)) be a vertex inducing the bucketing Bu i . The following two statements hold:

1. There exists a unique heavy bucket Bj Bu i such that vol(Bj Si) 1 2k CA.9

3Υ(k) vol(Si);

2. For every other bucket Bt Bu i \ {Bj} it holds that vol(Bt) k CA.9

Υ(k) vol(Si).

Proof. For the first statement, we show that the bucket B(u i ) satisfies the claim. Let umin be a vertex in Pi Si of the lowest degree. Since the degrees of all vertices in Si (and hence in Pi Si) are within a factor of ηS from each other, we know that all vertices in Pi Si must be in the bucket B (umin), i.e., Pi Si B (umin). Moreover, by the choice of u i we have that

vol(B (u i )) vol(B (umin)) vol(Pi Si) vol(Si) vol(Pi Si) 1 k CA.9

vol(Si), (47)

where the last inequality follows by Lemma A.9. We now write

vol(B (u i ) Si) = vol(B (u i )) vol(B (u i ) \ Si)

vol(Si) vol(Pi \ Si)

where the first inequality uses (47). The uniqueness of the heavy bucket follows by noticing that vol(B(u i ) Si) > vol(Si) /2.

The second statement essentially follows from the fact that Si has a large volume overlap with B (u i ). Concretely, let Bt Bu i \ {B (u i )} be an arbitrary bucket. We have that

vol(Bt) = vol(Bt Si) + vol(Bt \ Si)

vol(Si) vol(Si \ Bt) + vol(Pi \ Si)

vol(Si) vol(Si B (u i )) + vol(Pi Si)

3Υ(k) vol(Si) + k CA.9

3Υ(k) vol(Si)

Υ(k) vol(Si) .

This completes the proof.

In order to analyse the final constructed tree TSCB, we analyse the properties of the buckets in the context of the entire graph G, and not just on every set Pi. We denote by B the set containing all the buckets obtained throughout all sets Pi, i.e.,

where u i = argmaxu Pi vol(B (u)) is a vertex inducing the corresponding bucketing in set Pi. We remark that B is a partition of V . We use ℓ |B|

to denote the total number of buckets, and we know by Lemma C.2 that ℓ 2k2. For convenience, we label the buckets B = {B1, . . . , Bℓ} in decreasing order of the sizes breaking ties arbitrarily, i.e., |Bi| |Bi+1|, for all i < ℓ.

Nearly-Optimal Hierarchical Clustering for Well-Clustered Graphs

Based on Lemma C.3, exactly k buckets in B are heavy, meaning that they have large overlap with the optimal clusters Si. We denote by BH the set of those k buckets; we emphasise once more that each heavy bucket corresponds to a unique cluster Si and vice-versa. Additionally, we denote by BL B \ BH the set of the remaining light buckets.

The next two results summarise the main properties of our constructed buckets. We first present three properties of the heavy buckets which bound their size, volume and weight of the crossing edges.

Lemma C.4. For every heavy bucket Bj BH corresponding to Si, the following statements hold:

(B1) vol(Bj) 1 + k CA.9

3Υ(k) vol(Si);

(B2) |Bj| (1 + ηS) |Si|.

(B3) w(Bj, Bj) k CA.9+λk+1

Υ(k) vol(Si).

Proof. Let Bj BH be a heavy bucket corresponding to Si. For (B1), we have that

vol(Bj) = vol(Bj Si) + vol(Bj \ Si) vol(Si) + vol(Pi Si) 1 + k CA.9

To prove (B2), suppose for contradiction that |Bj| > (1 + ηS) |Si|, which implies that |Bj \ Si| = |Bj| |Bj Si| > ηS |Bj Si|. Based on this, we have that

vol(Bj \ Si) |Bj \ Si| δ(Bj \ Si) > ηS |Bj Si| (Bj Si)

vol(Bj Si) 1 2k CA.9

where the second inequality follows from the fact that, inside bucket Bj all the degrees are within a factor of ηS of each other and the last inequality follows by Lemma C.3. This implies that

vol(Si) < 1 1 2k CA.9

3Υ(k) vol(Pi Si) vol(Si)

3Υ(k) k CA.9 2 vol(Si) ,

where the second inequality follows by Lemma A.9, and last inequality by our assumption on Υ(k).

Finally, we prove (B3). We have that

w(Bj, Bj) w(Bj Si, Bj) + w(Bj \ Si, Bj)

w(Bj Si, Bj Si) + w(Bj Si, Bj \ Si) + vol(Bj \ Si)

Bj Si + w(Si, Si) + vol(Pi Si)

3Υ(k) vol(Si) + ρ(k) vol(Si) + k CA.9

3Υ(k) vol(Si) (48)

k CA.9 + λk+1

Υ(k) vol(Si) , (49)

where (48) follows from Lemmas C.3 and A.9, and (49) uses the fact that ρ(k) = λk+1

The final technical result is similar with Lemma C.4, and presents several useful upper bounds of the volume and size of the light buckets.

Lemma C.5. For every light bucket Bt BL, the following statements hold:

(B4) vol(Bt Sj) ηS vol(Bt Si), for all clusters Si, Sj satisfying |Bt Sj| |Bt Si|;

(B5) vol(Bt Si) 2k CA.9

3Υ(k) vol(Si), for all clusters Si;

Nearly-Optimal Hierarchical Clustering for Well-Clustered Graphs

Bt BL |Bt| vol(Bt) 4k2 CA.9 ηS

3Υ(k) Pk i=1 |Si| vol(Si).

Proof. Let Bt BL be a light bucket, and let Si, Sj be optimal clusters so that |Bt Sj| |Bt Si|. It holds that

vol(Bt Sj) |Bt Sj| (Bt Sj) |Bt Sj| ηS δ(Bt Si) ηS vol(Bj Si) ,

where the second inequality uses the fact that inside bucket Bt all the degrees are within a factor of ηS. This proves (B4).

Next, we prove (B5). Suppose that Bt is a bucket from some cluster Pr, and let Si be an arbitrary cluster. The proof continues by a case distinction:

Case 1: r = i. In this case, since Bt is a light bucket in Pr, by Lemma C.3 we have that

vol(Bt Si) = vol(Bt Sr) 2k CA.9

3Υ(k) vol(Sr) .

Case 2: r = i. In this case we simply have that

vol(Bt Si) vol(Pr Si) vol(Pi Si) k CA.9

3Υ(k) vol(Si) .

Combining the two cases proves (B5).

We finally prove (B6). We have that X

Bt BL |Bt|vol(Bt)

i=1 |Bt Si|

j=1 vol(Bt Sj)

i,j=1 |Bt Si|vol(Bt Sj)

i,j=1: |Bt Si| |Bt Sj|

|Bt Si|vol(Bt Sj) +

i,j=1: |Bt Si|>|Bt Sj|

|Bt Si|vol(Bt Sj)

i,j=1: |Bt Si| |Bt Sj|

|Bt Sj|vol(Bt Sj) +

i,j=1: |Bt Si|>|Bt Sj|

|Bt Si| ηS vol(Bt Si)

j=1 |Bt Sj|vol(Bt Sj) + k

i=1 |Bt Si| ηS vol(Bt Si)

Bt BL |Bt Si|vol(Bt Si)

2k ηS 2k CA.9

Bt BL |Bt Si|vol(Si) (52)

4k2 CA.9 ηS

i=1 |Si|vol(Si) ,

where (50) follows by grouping the terms |Bt Si|vol(Bt Sj) into two categories, (51) follows by (B4), and (52) follows by (B5).

Nearly-Optimal Hierarchical Clustering for Well-Clustered Graphs

Now we combine the properties shown earlier, and upper bound the cost of our constructed tree TSCB through the following lemma.

Lemma C.6. It holds that

COSTG (TSCB) 1 + 5k4 CA.9

36 η2 S Φin OPTG.

Proof. By construction, we know that TSCB is the caterpillar tree formed by merging the trees TBj, for all Bj B. Therefore, in order to bound the overall cost it suffices to bound the cost within each induced subtree TBj as well as the cost of the edges crossing different buckets. Formally, we have that

COSTG (TSCB) =

j=1 COSTG[Bj] TBj +

e E(Bj,Bt) cost G(e)

j=1 |Bj|vol(Bj) +

e E(Bj,Bt) cost G(e). (53)

W study the first term of (53), and have that

j=1 |Bj|vol(Bj)

Bj BH |Bj|vol(Bj) + X

Bt BL |Bt|vol(Bt)

(1 + ηS) 1 + k CA.9

i=1 |Si|vol(Si)

+ 4k2 CA.9 ηS

i=1 |Si|vol(Si)

2ηS 1 + k CA.9

+ 4k2 CA.9 ηS

i=1 |Si|vol(Si)

2ηS + 2k2 CA.9 ηS

Φin OPTG (55)

= 1 + k2 CA.9

36 η2 S Φin OPTG

where (54) follows from (B1), (B2), (B6), and (55) follows by Lemma A.8.

Next we look at the second term of (53). Let Bj B be an arbitrary bucket, and we study the cost of the edges between Bj and all the other buckets Bt positioned lower in the tree, i.e., |Bt| |Bj|. By construction, for every such edge e we have that

cost G(e) = we

t=j+1 |Bt| we ℓ |Bj|.

Therefore, we can upper bound the total cost of the crossing edges adjacent to Bj by

e E(Bj,Bt) cost G(e) ℓ |Bj|w(Bj, Bj).

Nearly-Optimal Hierarchical Clustering for Well-Clustered Graphs

Hence, by summing over all buckets Bj we have that

e E(Bj,Bt) cost G(e)

j=1 ℓ |Bj|w(Bj, Bj)

Bj BH |Bj|w(Bj, Bj) + X

Bt BL |Bt|vol(Bt)

(1 + ηS) k CA.9 + λk+1

i=1 |Si|vol(Si)

+ 4k2 CA.9 ηS

i=1 |Si|vol(Si)

ℓ 2ηS 4k CA.9

3Υ(k) + 4k2 CA.9 ηS

i=1 |Si|vol(Si)

ℓ 4k2 CA.9 ηS

Φin OPTG (57)

Υ(k) 36 η2 S Φin OPTG

where (56) follows by (B2) and (B3) for the heavy buckets and (B6) for the light ones, and (57) follows by Lemma A.8. Finally, combining (53), (55) and (57) we conclude that

COSTG (TSCB) 1 + k2 CA.9

36 η2 S Φin OPTG + 4k4 CA.9

Υ(k) 36 η2 S Φin OPTG

1 + 5k4 CA.9

36 η2 S Φin OPTG,

which completes the proof.

Finally, the following theorem summarises the performance of Algorithm 2.

Theorem C.7 (Formal Statement of Theorem 5.4). Let G = (V, E, w) graph of n vertices, m edges, and optimal clusters {Si}k i=1 corresponding to ρ(k) satisfying k = O(poly log(n)), ΦG[Si] Φin for 1 i k, and Υ(k) CA.9 k. Then, there is an e O(m) time algorithm that returns both an HC tree TSCB of G, and COST (TSCB). Moreover, it holds that

COSTG (TSCB) 1 + 5k4 CA.9

36 η2 S Φin OPTG

where ηS is an upper bound of maxi ( (Si)/δ(Si)).

Proof. The approximation guarantee of the returned tree TSCB from Algorithm 2 follows from Lemma C.6. By a similar proof as the proof of Lemma B.9, it holds that the time complexity of Algorithm 2 is e O(m) and the depth of TSCB is O(polylog n). Therefore, by Lemma B.8, we can compute COSTG(TSCB) in nearly-linear time e O(m).

It remains to deal with our assumption that the algorithm knows the number of clusters k and an upper bound

ηS max 1 i k (Si)

If the number of clusters k is unknown, we perform the technique described before and run independent copies of Algorithm 2 with all possible values k ranging from 1 to O(polylog(n)). By adding an extra O(polylog n) = e O(1) factor in the overall time complexity, we ensure that one of the runs has the correct number of clusters k = k.

Nearly-Optimal Hierarchical Clustering for Well-Clustered Graphs

Finally, in order to obtain an approximation of ηS, we additionally run O(log(n)) independent copies of Algorithm 2 with different values f ηS = 2i, for all i [log δG, log G]. This process ensures that, at least one such estimate satisfies that f ηS/2 ηS f ηS. By introducing a factor of O(log n) to the overall running time, this ensures that at least one of the constructed trees of Algorithm 2 satisfies the promised approximation ratio, up to an additional factor of (ηS/f ηS)2. As a result, we return the tree with the minimum cost among all different runs of our algorithm.