# hierarchical_clustering_o1approximation_for_wellclustered_graphs__5d89be45.pdf Hierarchical Clustering: O(1)-Approximation for Well-Clustered Graphs Bogdan-Adrian Manghiuc School of Informatics The University of Edinburgh b.a.manghiuc@sms.ed.ac.uk He Sun School of Informatics The University of Edinburgh h.sun@ed.ac.uk Hierarchical clustering studies a recursive partition of a data set into clusters of successively smaller size, and is a fundamental problem in data analysis. In this work we study the cost function for hierarchical clustering introduced by Dasgupta [12], and present two polynomial-time approximation algorithms: Our first result is an O(1)-approximation algorithm for graphs of high conductance. Our simple construction bypasses complicated recursive routines of finding sparse cuts known in the literature (e.g., [6, 11]). Our second and main result is an O(1)- approximation algorithm for a wide family of graphs that exhibit a well-defined structure of clusters. This result generalises the previous state-of-the-art [10], which holds only for graphs generated from stochastic models. The significance of our work is demonstrated by the empirical analysis on both synthetic and real-world data sets, on which our presented algorithm outperforms the previously proposed algorithm for graphs with a well-defined cluster structure [10]. 1 Introduction Hierarchical clustering (HC) studies a recursive partition of a data set into clusters of successively smaller size, via an effective binary tree representation. As a basic technique, hierarchical clustering has been employed as a standard package in data analysis, and has comprehensive applications in practice. While traditionally HC trees are constructed through bottom-up (agglomerative) heuristics, which lacked a clearly-defined objective function, Dasgupta [12] has recently introduced a simple objective function to measure the quality of a particular hierarchical clustering and his work has inspired a number of research on this topic [3, 6, 7, 8, 10, 11, 20, 24]. Consequently, there has been a significant interest in studying efficient HC algorithms that not only work in practice, but also have proven theoretical guarantees with respect to Dasgupta s cost function. Our contribution. In this work, we present two new approximation algorithms for constructing HC trees that can be rigorously analysed with respect to Dasgupta s cost function. For our first result, we construct an HC tree of an input graph G entirely based on the degree sequence of V (G), and we show that the approximation guarantee of our constructed tree is with respect to the conductance of G, which will be defined formally in Section 2. The striking fact of this result is that, for any n-vertex graph G with m edges and conductance Ω(1) (a.k.a. expander graph), an O(1)-approximate HC tree of G can be very easily constructed in O(m + n log n) time, although obtaining such result for general graphs is impossible under the Small Set Expansion Hypothesis (SSEH) [6]. Our theorem is in line with a sequence of results for problems that are naturally linked to the Unique Games and Small Set Expansion problems: it has been shown that such problems are much easier to solve once The full version of the paper is available at https://arxiv.org/abs/2112.09055. 35th Conference on Neural Information Processing Systems (Neur IPS 2021). the input instance exhibits the high conductance property [4, 5, 16, 18]. However, to the best of our knowledge, our result is the first of this type for hierarchical clustering. While our first result presents an interesting theoretical fact, we further study whether we can extend this O(1)-approximate construction to a much wider family of graphs occurring in practice. Specifically, we look at well-clustered graphs, i.e., the graphs in which vertices within each cluster are better connected than vertices between different clusters and the total number of clusters is constant. This includes a wide range of graphs occurring in practice with a clear cluster-structure, and have been extensively studied over the past two decades (e.g., [13, 15, 23, 26]). As our second and main result, we present a polynomial-time O(1)-approximation algorithm that constructs an HC tree for a well-clustered graph. Given that the class of well-clustered graphs includes graphs with clusters of different sizes and asymmetrical internal structure, our result significantly improves the previous state-of-the-art [10], which only holds for graphs generated from stochastic models. At the technical level, the design of our algorithm is based on the graph decomposition algorithm presented in [13], which is designed to find a good partition of a well clustered graph. However, our analysis suggests that, in order to obtain an O(1)-approximation algorithm, directly applying their decomposition isn t sufficient for our purpose. To overcome this bottleneck, we refine the output decomposition via a pruning technique, and carefully merge the refined parts to construct our final HC tree. In our point of view, our presented stronger graph decomposition procedure might have applications in other settings as well. To demonstrate the significance of our work, we compare our algorithm against the previous state-of-the-art with similar approximation guarantee [10] and well-known linkage heuristics on both synthetic and real-world data sets. Although our algorithm s performance is marginally better than [10] for the graphs generated from the stochastic block models (SBM), the cost of our algorithm s output is up to 50% lower than the one from [10] when the clusters of the input graph have different sizes and some cliques are embedded into a cluster. Related work. Our work fits in a line of research initiated by Dasgupta [12], who introduced a cost function to measure the quality of an HC tree. Dasgupta proved that a recursive application of the algorithm for the Sparest Cut problem achieves an O(log3/2 n)-approximation, which has been subsequently improved to O(log n) and O( log n) by [24] and [6, 11]. It is known to be NP-hard to find an optimal HC tree [12] , and SSEH-hard to achieve an O(1)-approximation for a general input instance [6] . Cohen-Addad et al. [10] studied a hierarchical extension of the SBM and showed that a certain SVD projection algorithm together with several linkage heuristics achieves a (1 + o(1))-approximation with high probability. We emphasise that our notion of well-clustered graphs generalises the SBM variant studied in [10] and does not assume the rigid hierarchical structure of the clusters. For another line of related work, Moseley and Wang [20] studied the dual objective function and proved that Average Linkage achieves a (1/3)-approximation for the new objective function. Notice that, although this has received significant attention very recently [3, 7, 8, 9, 25], achieving an O(1)-approximation is tractable under this alternative objective. This suggests the fundamental difference on the hardness of the problem under different objective functions, and is our reason to entirely focus on the Dasgupta s cost function in this work. 2 Preliminaries Throughout the paper, we always assume that G = (V, E, w) is an undirected graph with n vertices, m edges and weight function w : V V R 0. For any edge e = {u, v} E we write we or wuv to indicate the similarity weight between u and v. For a vertex u V , we denote its degree by du P v V wuv and we assume that wmax/wmin = O(poly(n)), where wmin(wmax) is the minimum (maximum) edge weight. We will use dmin, dmax, davg for the minimum, maximum and average degrees in G, where davg P u V du/n. For a nonempty subset S V , we define G[S] to be the induced subgraph on S and we denote by G{S} the subgraph G[S], where self loops are added to vertices v S such that their degrees in G and G{S} are the same. For any two subsets S, T V , we define the cut value w(S, T) P e E(S,T ) we, where E(S, T) is the set of edges between S and T. For any set S V , the volume of S is vol(S) P u S du and we write vol(G) when referring to vol(V (G)). We further define the conductance of S by ΦG(S) w(S, V \ S) and ΦG min S V vol(S) vol(G)/2 ΦG(S). We call G an expander graph if ΦG = Ω(1). For a graph G, let D Rn n be the diagonal matrix defined by Duu = du for all u V . We denote by A Rn n the adjacency matrix of G, where Auv = wuv for all u, v V . The normalised Laplacian matrix of G is defined as L I D 1/2AD 1/2, where I is the identity n n matrix. We will denote the eigenvalues of L by 0 = λ1 λn 2. Hierarchical clustering. A hierarchical clustering (HC) tree of a given graph G is a binary tree T with n leaf nodes such that each leaf corresponds to exactly one vertex v V (G). Let T be an HC tree of some graph G = (V, E, w) and let N T be an arbitrary internal node2 of T . We denote T [N] to be the subtree of T rooted at N, leaves (T [N]) to be the set of leaf nodes of T [N] and parent T (N) to be the parent of node N in T . In addition, each internal node N T induces a unique vertex set C V formed by the vertices corresponding to leaves(T [N]). We will abuse the notation and write N T for both the internal node of T and its corresponding set of vertices in V . To measure the quality of an HC tree T with similarity weights, Dasgupta [12] introduced the following cost function: cost G(T ) X e={u,v} E we |leaves (T [u v])| , where u v is the lowest common ancestor of u and v in T . Trees that achieve a better hierarchical clustering have a lower cost, based on the following consideration: for any pair of vertices u, v V that corresponds to an edge of high weight wuv (i.e., u and v are highly similar) a good HC tree would separate u and v lower in the tree, thus reflected in a small size |leaves(T [u v])|. We denote by OPTG the minimum cost of any HC tree of G, i.e., OPTG = min T cost G(T ), and use T to refer to an optimal tree achieving the minimum. We say that an HC tree T is an α-approximate tree if cost G(T ) α OPTG. All omitted proofs are deferred to the full version of this paper. 3 Hierarchical clustering for graphs of high conductance In this section we study hierarchical clustering for graphs with high conductance and prove that, for any input graph G with ΦG = Ω(1), an O(1)-approximate HC tree of G can be simply constructed based on the degree sequence of G. As a starting point, we show that cost G(T ) for any T can be upper bounded with respect to ΦG and the degree distribution of V (G). Lemma 3.1. It holds for any HC tree T of graph G that cost G(T ) 9 4ΦG min n davg dmin , dmax Figure 1: (a) Our considered graph G; (b) the tree that separates the vertices of high degrees from the others achieves O(1)-approximation. While Lemma 3.1 holds for any graph G, it implies some interesting facts for expander graphs: first of all, when G = (V, E, w) satisfies dmax/dmin = O(1) and ΦG = Ω(1), Lemma 3.1 shows that any tree T is an O(1)- approximate tree. In addition, although ΦG plays a crucial role in analysing cost G(T ) as for many other graph problems, Lemma 3.1 indicates that the degree distribution of G might also have significant impact. One could naturally ask the extend to which the degree distribution of V (G) would influence the construction of optimal trees. To study this question, we look at the following graph G where all edges have unit weight: (i) let G1 = (V, E1) be a constant-degree expander graph of n vertices with ΦG1 = Ω(1), e.g., the ones 2We will always use (internal) nodes for the nodes of T and vertices for the elements of the vertex set V . presented in [14]; (ii) we choose n2/3 vertices from V to form S, and let K = (S, S S) be a complete graph defined on S; (iii) partition the vertices of V \S into n2/3 groups of roughly the same size, associate each group to a unique vertex in S and let E2 be the set of edges formed by connecting every vertex in S with all the vertices in its associated group. (iv) let G (V, E1 (S S) E2) be the union of G1, K and the edges in E2, see Figure 1(a) for illustration. By construction, we know that ΦG = Ω(1), and the degrees of G satisfy dmax = Θ(n2/3), dmin = Θ(1), and davg = Θ(n1/3). Therefore, the ratio between cost G(T ) for any HC tree T and OPTG could be as high as Θ(n1/3). On the other side, it s not difficult to show that the tree T illustrated in Figure 1(b), which first separates the set S of high-degree vertices from V \ S at the top of the tree, actually O(1)-approximates OPTG3. This example suggests that grouping vertices of similar degrees first could potentially help reduce cost G(T ) for our constructed T . This motivates us to design the following Algorithm 1 to construct an HC tree. We highlight that the output of Algorithm 1 is uniquely determined by the ordering of the vertices of G according to their degrees, which can be computed in O (n log n) time. Algorithm 1: HCwith Degrees(G{V }) Input: G = (V, E, w) with the ordered vertices such that dv1 . . . dv|V |; Output: An HC tree Tdeg(G); 1 if |V | = 1 then 2 return the single vertex in V as the tree; 4 imax := log2(|V | 1) ; r := 2imax; A := {v1, . . . , vr}; B := V \ A; 5 Let T1 = HCwith Degrees(G{A}); T2 = HCwith Degrees(G{B}); 6 return Tdeg with T1 and T2 as the two children. Theorem 1. Given any graph G = (V, E, w) with conductance ΦG as input, Algorithm 1 runs in O(m + n log n) time, and returns an HC tree Tdeg of G that satisfies cost G(Tdeg) = O 1/Φ4 G OPTG. Theorem 1 guarantees that, when the input G satisfies ΦG = Ω(1), the output Tdeg of Algorithm 1 achieves an O(1)-approximation. Moreover, as the high-conductance property can be determined in nearly-linear time by computing λ2(LG) and applying the Cheeger inequality, Algorithm 1 presents a very simple construction of an O(1)-approximate HC tree once the input G is known to have high conductance. Proof sketch of Theorem 1. The key notion employed in our proof is that of the dense branch which can be informally described as follows: we perform a traversal in T starting at the root node A0 and we sequentially travel to the child of higher volume. The process stops whenever we reach a node Ak, for some k Z 0, such that vol(Ak) > vol(G)/2 and both its children have volume at most vol(G)/2. The sequence of visited nodes by this process is the dense branch of T . Formally, for any HC tree T of G, the dense branch is the path (A0, A1, . . . , Ak) in T for some k Z 0, such that the following conditions hold: (1) A0 is the root of T ; (2) Ak is the node such that vol(Ak) > vol(G)/2 and both children of Ak have volume at most vol(G)/2. It is important to note that the dense branch of a tree T is unique and it contains all nodes Ai such that vol(Ai) > vol(G)/2 and only those nodes. At a very high level, the proof of Theorem 1 starts with any optimal HC tree T0 of G and constructs trees T1, T2, T3 and T4 such that (i) cost G(Ti) can be upper bounded with respect to cost G (Ti 1) for every 1 i 4, (ii) the final constructed T4 is exactly the tree Tdeg(G), the output of Algorithm 1. Combining these two facts allows us to upper bound cost G(Tdeg) with respect to OPTG. Step 1: Regularisation. Let (A0, . . . , Ak0) be the dense branch of any optimal T0, and let Bi be the sibling of Ai, for all 1 i k0. Let imin = log2(|Ak0|) and imax = log2(|V | 1) . For simplicity, we assume that the dense branch has size at least 2, that |A1| 2imax and that |Ak0| = 2imin, which ensures that imin imax. These assumptions allow us to better present the proof techniques by ignoring some corner cases. In the full version of our paper we deal with the 3To see this, notice that cost(T ) cost(T [S]) + n vol(G1) + n2 = Θ(n2), as the complete subgraph G[S] induces a cost of Θ((n2/3)3) [12]. The subgraph G[S] also implies that OPTG = Ω(n2). Aj+1 B2 j+1 Figure 2: (a) With a proper partition of Bj+1, we have a new node A2 j of size 2i; (b) The nodes between Aj1 and Aj2 (left) have size in (2i 1, 2i); These nodes are compressed (right) and only the nodes of size some power of 2 remain in the dense branch. general case when some of these assumptions do not hold. The goal of this step is to adjust the dense branch of T0, such that the resulting tree T1 satisfies the following condition: for all i [imin, imax], there is a node of size exactly 2i along the dense branch of T1. We achieve this by applying a sequence of operations, each of which creating a new node of exact size 2i for some suitable i. Concretely, let i [imin, imax] be the largest integer such that there is no node of size 2i on the dense branch of T0. As |Ak0| = 2imin, there is some node Aj on the dense branch such that |Aj| > 2i and |Aj+1| < 2i. We adjust the branch at Aj as follows: (i) we consider a partition of Bj+1 = B1 j+1 B2 j+1 such that |Aj+1| + |B2 j+1| = 2i; (ii) we replace the node Aj by some newly created node A1 j that has children B1 j+1 and a new node A2 j. (iii) the two children of A2 j will be Aj+1 and B2 j+1. This adjustment is illustrated in Figure 2(a), and we repeat this process until no such i exists any more. We call the resulting tree T1, and the following lemma gives an upper bound of cost G(T1). Lemma 3.2. Our constructed tree T1 satisfies cost G(T1) (1 + 4/ΦG) cost G(T0). Step 2: Compression. With potential relabelling of the nodes, let (A0, . . . , Ak1) be the dense branch of T1 for some k1 Z+ satisfying |Ak1| = 2imin, and Bi be the sibling of Ai. The objective of this step is to ensure that, by a sequence of adjustments, all the nodes along the dense branch are of size equal to some power of 2. Here, we describe how one such adjustment is performed, and refer the reader to Figure 2(b) for illustration. Let i [imin, imax] be some index such that |Aj1| = 2i and |Aj2| = 2i 1 for some j1 < j2. We compress the dense branch by removing all nodes between Aj1 and Aj2 as follows: The two children of Aj1 will be Aj2 and some new node Cj1, which has children Cj1+1 and Bj1+1; The two children of Cj1+1 will be some new node Cj1+2 and Bj1+2, etc. The last node Cj2 2 has children Bj2 1 and Bj2. In addition, we perform one more such adjustment to remove all nodes Aj of size 2imax < |Aj| < n and ensure all nodes (except potentially A0) on the dense branch have size 2i for some i [imin, imax]. We call the resulting tree T2, and the following lemma bounds the cost of T2. Lemma 3.3. Our constructed tree T2 satisfies that cost G(T2) 2 cost G(T1). Step 3: Matching. Let (A0, . . . , Ak2) be the dense branch of T2, for some k2 Z+, and Bi be the sibling of Ai. In this step we transform T2 into T3, such that T3 is isomorphic to Tdeg(G), which ensures that T3 and Tdeg(G) have the same structure. To achieve this, for every 1 i k2 we simply replace each T2[Bi] with Tdeg (G{Bi}). We further replace T2[Ak2] with Tdeg (G{Ak2}). We call the resulting tree T3, and bound its cost by the following lemma: Lemma 3.4. Our constructed tree T3 satisfies that cost G(T3) (1 + 4/ΦG) cost G(T2). Step 4: Sorting. We assume that (A0, . . . , Ak3) is the dense branch of T3 for some k3 Z+, and we extend the dense branch to (A0, . . . , Aimax) with the same property that for every i [k3, imax] Ai is the child of Ai 1 with the higher volume, and let S {B1, . . . , Bimax, Aimax}. Recall that in Tdeg(G) the first r = 2imax vertices of the highest degrees, i.e., {v1, . . . , vr} belong to A1, of which the first 2imax 1 vertices belong to A2, and so on; however, this might not be the case for T3. Hence, in the final step, we prove that T3 can be transformed into Tdeg(G) without a significant increase of the total cost. In this step we will swap vertices between the internal nodes B1, B2, . . . , Bimax, Aimax in such a way that Aimax will consist of v1 and v2, Bimax will consist of v3 and v4, Bimax 1 will consist of v5 up to v8, etc. We call vertex u misplaced if the position of u is T3 is different from the one in Tdeg(G). To transform T3 into Tdeg(G) we perform a sequence of operations, each of which consisting in a chain of swaps focusing on the vertex of the highest degree that is currently misplaced. For the sake of argument, we assume that v1 Aimax is misplaced, and we apply the following operation to move v1 to Aimax: (i) let v1 Bi0 for some i0 1, and let y be the vertex of the lowest degree among the vertices in S \ {B1, . . . , Bi0}. Say y Bi1 for some i1 > i0, and swap v1 with y while keeping the structure of the tree unchanged; (ii) repeat the swap operation above until v1 reaches its correct place Aimax. Once the above process is complete and v1 reaches Aimax, we apply a similar chain of swaps to v2 to ensure v2 also reaches Aimax. Then, we sequentially apply the process to v3 and v4 to ensure they reach Bimax, and continue this process until there are no more misplaced vertices. We call the resulting tree T 3 , and notice that every node X S in T 3 contains the correct set of vertices. However, the positions of these vertices in T 3 [X] might be different from the ones in Tdeg(G{X}). To overcome this issue, we repeat Step 3 again to the tree T 3 , and this will introduce another factor of (1+4/ΦG) to the total cost of the constructed tree. Importantly, the final constructed tree after this step is exactly the tree Tdeg(G) and we bound its cost by the following lemma: Lemma 3.5. It holds for Tdeg that cost(Tdeg) (1 + 24/ΦG) (1 + 4/ΦG) cost(T3). Finally, by combining Lemmas 3.2, 3.3, 3.4 and 3.5 we prove the approximation guarantee of Tdeg in Theorem 1. The runtime of Algorithm 1 follows by a simple application of the master theorem. 4 Hierarchical clustering for well-clustered graphs So far we have shown that an O(1)-approximate tree can be easily constructed for expander graphs. We will now focus on a wider class of well-clustered graphs. Informally, a well-clustered graph is a collection of densely-connected components (clusters) of high conductance, which are weakly interconnected. As these graphs form some of the most meaningful objects for clustering in practice, one would naturally ask whether our O(1)-approximation for expanders can be extended to wellclustered graphs. In this section, we will give an affirmative answer to this question. To formalise the well-clustered property, we consider the notion of (Φin, Φout)-decomposition introduced by Gharan and Trevisan [13]. Formally, for a graph G = (V, E, w) and k Z+, we say that G has k well-defined clusters if V (G) can be partitioned into disjoint subsets {Pi}k i=1, such that the following hold: (i) there s a sparse cut between Pi and V \ Pi for any 1 i k, which is formulated as ΦG(Pi) Φout; and (ii) each induced subgraph G[Pi] has high conductance ΦG[Pi] Φin. We underline that, through the celebrated higher-order Cheeger inequality [17], this condition of (Φin, Φout)-decomposition can be approximately reduced to other formulations of a well-clustered graph studied in the literature, e.g., [23, 26, 27]. The starting point of our second result is the following polynomial-time algorithm presented by Gharan and Trevisan [13], which produces a (Φin, Φout)-decomposition of graph G for some parameters Φin and Φout. Specifically, given a well-clustered graph as input, their algorithm returns disjoint sets of vertices {Pi}ℓ i=1 with bounded Φ(Pi) and ΦG[Pi] for each Pi, and the algorithm s performance is as follows: Lemma 4.1 ([13], Theorem 1.6). Let G = (V, E, w) be a graph such that λk+1 > 0, for some k 1. Then, there is a polynomial-time algorithm that finds an ℓ-partition {Pi}ℓ i=1 of V , for some ℓ k, such that the following hold for every 1 i ℓ: (A1) Φ(Pi) = O k6 λk ; (A2) ΦG[Pi] = Ω λ2 k+1/k4 . Informally, this result states that, when the underlying input graph G presents a clear structure of clusters, one can find in polynomial-time a partition {Pi}ℓ i=1 such that both the outer and inner conductance of every Pi can be bounded. One natural question raising from this partition {Pi}ℓ i=1 is whether we can directly use {Pi}ℓ i=1 to construct an HC tree. As an obvious approach, one could consider to (i) construct trees Ti = Tdeg(G[Pi]) for every 1 i ℓ, and (ii) merge the tres {Ti} in the best way to construct the final tree TG. Unfortunately, this direct approach does not work, and we present in the full version of the paper an example where this approach fails to achieve an O(1)-approximation4. To address this, we follow our intuition gained from Figure 1, and further decompose every Pi into smaller subsets. Similar with analysing dense branches, we introduce the critical nodes associated with each Ti. Definition 4.2 (Critical nodes). Let Ti = Tdeg(G[Pi]) be the tree computed by Algorithm 1 to the induced graph G[Pi]. Suppose (A0, . . . , Ari) is the dense branch of Ti, for some ri Z+, Bj is the sibling of Aj and let Ari+1, Bri+1 be the two children of Ari. We define Si {B1, . . . , Bri+1, Ari+1} and call every node N Si a critical node. We remark that each critical node N Si (1 i ℓ) is an internal node of maximum size in Ti that is not on the dense branch. Moreover, each Si is a partition of Pi. Based on critical nodes, we present an improved decomposition algorithm, which is similar to the one in Lemma 4.1, and prove that the output quality of our algorithm can be significantly strengthened for hierarchical clustering. Specifically, in addition to satisfying (A1) and (A2), we prove that the total weight between each critical node N Si and all the other clusters Pj, for all i = j, can be upper bounded. We highlight that this is one of the key properties that allows us to obtain our main result, and also suggests that the original decomposition algorithm in [13] might not suffice for our purpose. Lemma 4.3 (Strong Decomposition Lemma). Let G = (V, E, w) be a graph such that λk+1 > 0 and λk = O(k 12). Then, there is a polynomial-time algorithm that finds an ℓ-partition of V into sets {Pi}ℓ i=1, for some ℓ k, such that properties (A1) and (A2) hold for every 1 i ℓ. Moreover, for every cluster Pi and every critical node N Si, it holds that (A3) w(N, V \ Pi) 6(k + 1) vol G[Pi](N). To underline the importance of (A3), recall that, in general, each subtree Ti cannot be directly used to construct an O(1)-approximate HC tree of G because of the potential high cost of the crossing edges E(Pi, V \ Pi). However, if the internal cost of Ti is high enough to compensate for the cost introduced for the crossing edges E(Pi, V \ Pi), then one can safely use this Ti as a building block. This is one of the most crucial insights that leads us to design our final algorithm Prune Merge. Now we are ready to describe the algorithm Prune Merge, and we refer the reader to Algorithm 2 for the formal presentation. At a high-level, our algorithm consists of three phases: Partition, Prune and Merge. In the Partition phase (Lines 1 2), the algorithm invokes Lemma 4.3 to partition V (G) into sets {Pi}ℓ i=1, and applies Algorithm 1 to obtain the corresponding trees {Ti}ℓ i=1. The Prune phase (Lines 4 11) consists of a repeated pruning process: for every such tree Ti, the algorithm checks in Line 7 if the maximal possible cost of the edges E(Pi, V \ Pi) (i.e., the LHS of the inequality in the if-condition) can be bounded by the internal cost of the critical nodes N Si, up to a factor of O(k). If so, the algorithm uses Ti as a building block and adds it to a global set of trees T; Otherwise, the algorithm prunes the subtree Ti[N], where N Si is the critical node closest to the root in Ti, and adds Ti[N] to T (Line 11). The process is repeated with the pruned Ti until either the condition in Line 7 is satisfied, or Ti is completely pruned. Finally, in the Merge phase (Lines 13 15) the algorithm combines the trees in T in a caterpillar style according to an increasing order of their sizes. The performance of this algorithm is summarised in Theorem 2. Theorem 2. Let G = (V, E, w) be graph, and k > 1 such that λk+1 > 0 and λk = O(k 12). The algorithm Prune Merge runs in polynomial-time and constructs an HC tree TPM of G satisfying cost G(TPM) = O k22/λ10 k+1 OPTG. In particular, when λk+1 = Ω(1) and k = O(1), the algorithm s constructed tree TPM satisfies that cost G(TPM) = O(1) OPTG. 4The example consists of two copies of the graph in Figure 1(a), connected by a sparse cut. Algorithm 2: Prune Merge(G, k) Input: A graph G = (V, E, w), a parameter k Z+ such that λk+1 > 0; Output: An HC tree TPM of G; 1 Apply the partitioning algorithm (Lemma 4.3) on input (G, k) to obtain {Pi}ℓ i=1 for some ℓ k; 2 Let Ti = HCwith Degrees(G[Pi]); 3 Initialise T = ; 4 for All clusters Pi do 5 Let Si be the set of critical nodes of Ti; 6 while Si is nonempty do N Si w(N, V \ Pi) 6(k + 1) P N Si |parent Ti(N)| vol G[Pi](N) then 8 Update T T Ti and Si = ; 10 Let N, M be the two children of the root of Ti such that N Si; 11 Update T T Ti[N], Si Si \ {N} and Ti Ti[M]; 12 Let t = |T| and T = { e T1, . . . , e Tt} be such that |e Ti| |e Ti+1|, for all 1 i < t; 13 Initialise TPM = e T1; 14 for i = 2 . . . t do 15 Let TPM be the tree with TPM and e Ti as its two children; 16 return TPM. We remark that, although Algorithm 2 requires a parameter k as input, we can apply the standard technique of running Algorithm 2 for different values of k and returning the tree with the lowest cost. By introducing a factor of O(k) to the algorithm s runtime, this ensures that one of the constructed trees by Algorithm 2 would always satisfy our promised approximation ratio. 5 Experiments We experimentally evaluate the performance of our proposed algorithm, and compare it against the three well-known linkage heuristics for computing hierarchical clustering trees, and different variants of the algorithm proposed in [10], i.e. Linkage++, on both synthetic and real-world data sets. At a high level, Linkage++ consists of the following three steps: (i) Project the input data points into a lower dimensional Euclidean subspace; (ii) Run the Single Linkage algorithm [11] until k clusters are left; (iii) Run a Density based linkage algorithm on the k clusters until one cluster is left. Specifically, our algorithm Prune Merge will be compared against the following 6 algorithms: Average Linkage, Complete Linkage, and Single Linkage: the three well-known linkage algorithms studied in the literature. We refer the reader to [11] for a complete description. Linkage++, PCA+ and Density: the algorithm proposed in [10], together with two variants also studied in [10]. The algorithm PCA+ corresponds to running Steps (i) and (ii) of Linkage++ until one cluster is left (as opposed to k clusters), while Density corresponds to running Steps (i) and (iii) of Linkage++. All algorithms were implemented in Python 3.8 and the experiments were performed using an Intel(R) Core(TM) i5-6500 CPU @ 3.20GHz processor, with 16 GB RAM. All of the reported costs below are averaged over 5 independent runs. Synthetic data sets. We first compare the performance of our algorithm with the aforementioned other algorithms on synthetic data sets. Clusters of the same size. Our first set of experiments employ input graphs generated according to random stochastic models, where all clusters have the same size. For our first experiment, we look at graphs generated from the standard Stochastic Block Model (SBM). We first set the number of clusters as k = 3, and the number of vertices in each cluster {Pi}3 i=1 as 1, 000. We assume that any pair of vertices within each cluster is connected by an edge with probability p, and any pair of vertices from different clusters is connected by an edge with probability q. We fix the value q = 0.002, and consider different values of p [0.04, 0.2]. Our experimental results are illustrated in Figure 3(a). For our second experiment, we consider graphs generated according to a hierarchical stochastic block model (HSBM) [11]. This model assumes the existence of a ground-truth hierarchical structure of the clusters. For the specific choice of parameters, we set the number of clusters as k = 5, and the number of vertices in each cluster {Pi}5 i=1 as 600. For every pair of vertices (u, v) Pi Pj, we assume that u and v are connected by an edge with probability p if i = j; otherwise u and v are connected by an edge with probability qi,j defined as follows: (i) for all i {1, 2, 3} and j {4, 5}, qi,j = qj,i = qmin; (ii) for i {1, 2}, qi,3 = q3,i = 2 qmin; (iii) q4,5 = q5,4 = 2 qmin; (iv) q1,2 = q2,1 = 3 qmin. We fix the value qmin = 0.0005 and consider different values of p [0.04, 0.2] as illustrated in Figure 3(b). We remark that this choice of parameters resembles similarities with [10] and this is to ensure that the underlying clusters exhibit a ground truth hierarchical structure. As reported in Figure 3, our experimental results for both sets of graphs are similar, and the performance of our algorithm is marginally better than Linkage++. This is well expected, as Linkage++ is specifically designed for the HSBM, in which all the clusters have the same inner density characterised by parameter p, and their algorithm achieves a (1 + o(1))-approximation for those instances. Figure 3: Results for clusters of the same size. The x-axis represents different values of p, while the y-axis represents the cost of the algorithms returned HC trees normalised by the cost of Prune Merge. Figure (a) corresponds to inputs generated according to the SBM, while Figure (b) to those according to the HSBM. Clusters with non-uniform densities. Next we study graphs in which edges are present non-uniformly within each cluster (e.g., Figure 1(a) discussed earlier). Specifically, we set k = 3, |Pi| = 1000, q = 0.002, p = 0.06, and every pair of vertices (u, v) Pi Pj is connected by an edge with probability p if i = j and probability q otherwise. Moreover, this time we choose a random set Si Pi of size |Si| = cp |Pi| from each cluster, and add edges to connect every pair of vertices in each Si so that the vertices of each Si form a clique. By setting different values of cp [0.05, 0.4], the performance of our algorithm is about 20% 50% better than Linkage++ with respect to the cost value of the constructed tree, see Figure 4(a) for detailed results. To explain the outperformance of our algorithm, notice that, by adding a clique into some cluster, the cluster structure is usually preserved with respect to (Φin, Φout) or similar eigen-gap assumption on well-clustered graphs. However, the existence of such a clique within some cluster would make the vertices degrees highly unbalanced; as such many clustering algorithms that involve the matrix perturbation theory in their analysis might not work well. Clusters of different sizes. To further highlight the significance of our algorithm on synthetic graphs of non-symmetric structures among the clusters, we study the graphs in which the clusters have different sizes. We choose the same set of k and q values as before (k = 3, q = 0.002), but set the sizes of the clusters to be |P1| = 1900, |P2| = 900 and |P3| = 200. Every pair of vertices u, v Pi, for i {1, 2} is connected by an edge with probability p1 = 0.06, while pairs of vertices Figure 4: Results for graphs with non-uniform densities, or different sizes. The x-axis represents the cp-values, while the y-axis represents the cost of the algorithms constructed trees normalised by the cost of the ones constructed by Prune Merge. Figure (a) corresponds to inputs where all clusters have the same size, while in Figure (b) the clusters have different sizes. u, v P3 are connected with probability5 p2 = 5 p1 = 0.3. We further plant a clique Si Pi of size |Si| = cp Pi for each cluster Pi, as in the previous set of experiments. By choosing different values of cp from [0.05, 0.4], our results are reported in Figure 4(b), demonstrating that our algorithm performs better than the ones in [10]. Figure 5: Results on real-world data sets. The x-axis represents the various data sets and our choice of the σ-value used for constructing the similarity graphs. The y-axis corresponds to the cost of the algorithms output normalised by the cost of Prune Merge. Real-world data sets. To evaluate the performance of our algorithm on real-world data sets, we follow the sequence of recent work on hierarchical clustering [2, 10, 19, 24], all of which are based on the following 5 data sets from the Scikit-learn library [22] as well as the UCI ML repository [1]: Iris, Wine, Cancer, Boston and Newsgroup6. Similar with [24], for each data set we construct the similarity graph based on the Gaussian kernel, in which the σ-value is chosen according to the standard heuristic [21]. As reported in Figure 5, our algorithm performs marginally worse than Linkage++ and significantly better than PCA+. 6 Conclusion The experimental results on synthetic data sets demonstrate that our presented algorithm Prune Merge not only has excellent theoretical guarantees, but also produces output of lower cost than the previous algorithm Linkage++. In particular, the outperformance of our algorithm is best illustrated on graphs whose clusters have asymmetric internal structure and non-uniform densities. On the other side, the experimental results on real-world data sets show that the performance of Prune Merge is inferior to Linkage++ and especially to Average Linkage. We believe that developing more efficient algorithms for well-clustered graphs is a very meaningful direction for future work. Finally, our experimental results indicate that the Average Linkage algorithm performs extremely well on all instances, when compared to Prune Merge and Linkage++. This leads to the open question whether Average Linkage achieves an O(1)-approximation for well-clustered graphs, although it fails to achieve an O(1)-approximation for general graphs [11]. In our point of view, the answer to this question could help us design more efficient algorithms for hierarchical clustering that not only work in practice, but also have rigorous theoretical guarantees. 5Such choice of p2 is to compensate for the small size of cluster P3, and this ensures that the outer conductance ΦG(C3) is low. 6Due to the very large size of this data set, we consider only a subset consisting of comp.graphics , comp.os.ms-windows.misc , comp.sys.ibm.pc.hardware , comp.sys.mac.hardware , rec.sport.baseball , and rec.sport.hockey . Acknowledgements Bogdan Manghiuc is supported by an EPSRC Doctoral Training Studentship (EP/R513209/1), and He Sun is supported by an EPSRC Early Career Fellowship (EP/T00729X/1). [1] UCI ML Repository. https://archive.ics.uci.edu/ml/index.php. Accessed: 202105-22. [2] Amir Abboud, Vincent Cohen-Addad, and Hussein Houdrouge. Subquadratic high-dimensional hierarchical clustering. In 33rd Advances in Neural Information Processing Systems (Neur IPS 19), pages 11576 11586, 2019. [3] Noga Alon, Yossi Azar, and Danny Vainstein. Hierarchical clustering: a 0.585 revenue approximation. In 33rd Annual Conference on Learning Theory (COLT 20), pages 153 162, 2020. [4] Sanjeev Arora, Boaz Barak, and David Steurer. Subexponential algorithms for unique games and related problems. Journal of the ACM, 62(5), November 2015. [5] Sanjeev Arora, Subhash Khot, Alexandra Kolla, David Steurer, Madhur Tulsiani, and Nisheeth K. Vishnoi. Unique games on expanding constraint graphs are easy: extended abstract. In 40th Annual ACM Symposium on Theory of Computing (STOC 08), pages 21 28, 2008. [6] Moses Charikar and Vaggos Chatziafratis. Approximate hierarchical clustering via sparsest cut and spreading metrics. In 28th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA 17), pages 841 854, 2017. [7] Moses Charikar, Vaggos Chatziafratis, and Rad Niazadeh. Hierarchical clustering better than average-linkage. In 30th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA 19), pages 2291 2304, 2019. [8] Moses Charikar, Vaggos Chatziafratis, Rad Niazadeh, and Grigory Yaroslavtsev. Hierarchical clustering for Euclidean data. In 22nd International Conference on Artificial Intelligence and Statistics (AISTATS 19), pages 2721 2730, 2019. [9] Vaggos Chatziafratis, Grigory Yaroslavtsev, Euiwoong Lee, Konstantin Makarychev, Sara Ahmadian, Alessandro Epasto, and Mohammad Mahdian. Bisect and conquer: Hierarchical clustering via max-uncut bisection. In 23rd International Conference on Artificial Intelligence and Statistics (AISTATS 20), pages 3121 3132, 2020. [10] Vincent Cohen-Addad, Varun Kanade, and Frederik Mallmann-Trenn. Hierarchical clustering beyond the worst-case. In 31st Advances in Neural Information Processing Systems (Neur IPS 17), pages 6201 6209, 2017. [11] Vincent Cohen-Addad, Varun Kanade, Frederik Mallmann-Trenn, and Claire Mathieu. Hierarchical clustering: Objective functions and algorithms. Journal of the ACM, 66(4):1 42, 2019. [12] Sanjoy Dasgupta. A cost function for similarity-based hierarchical clustering. In 48th Annual ACM Symposium on Theory of Computing (STOC 16), pages 118 127, 2016. [13] Shayan Oveis Gharan and Luca Trevisan. Partitioning into expanders. In 25th Annual ACMSIAM Symposium on Discrete Algorithms (SODA 14), pages 1256 1266, 2014. [14] Shlomo Hoory, Nathan Linial, and Avi Wigderson. Expander graphs and their applications. Bulletin of the American Mathematical Society, 43(4):439 561, 2006. [15] Ravi Kannan, Santosh S. Vempala, and Adrian Vetta. On clusterings: Good, bad and spectral. Journal of the ACM, 51(3):497 515, 2004. [16] Alexandra Kolla. Spectral algorithms for unique games. In 25th Conference on Computational Complexity (CCC 10), pages 122 130, 2010. [17] James R. Lee, Shayan Oveis Gharan, and Luca Trevisan. Multiway spectral partitioning and higher-order cheeger inequalities. Journal of the ACM, 61(6):37:1 37:30, 2014. [18] Huan Li, He Sun, and Luca Zanetti. Hermitian Laplacians and a Cheeger inequality for the Max-2-Lin problem. In 27th Annual European Symposium on Algorithms (ESA 19), pages 71:1 71:14, 2019. [19] Aditya Krishna Menon, Anand Rajagopalan, Baris Sumengen, Gui Citovsky, Qin Cao, and Sanjiv Kumar. Online hierarchical clustering approximations. ar Xiv:1909.09667, 2019. [20] Benjamin Moseley and Joshua Wang. Approximation bounds for hierarchical clustering: Average linkage, bisecting k-means, and local search. In 31st Advances in Neural Information Processing Systems (Neur IPS 17), pages 3094 3103, 2017. [21] Andrew Ng, Michael Jordan, and Yair Weiss. On spectral clustering: Analysis and an algorithm. In 15th Advances in Neural Information Processing Systems (Neur IPS 02), pages 849 856, 2002. [22] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825 2830, 2011. [23] Richard Peng, He Sun, and Luca Zanetti. Partitioning well-clustered graphs: Spectral clustering works! SIAM Journal on Computing, 46(2):710 743, 2017. [24] Aurko Roy and Sebastian Pokutta. Hierarchical clustering via spreading metrics. The Journal of Machine Learning Research, 18(1):3077 3111, 2017. [25] Danny Vainstein, Vaggos Chatziafratis, Gui Citovsky, Anand Rajagopalan, Mohammad Mahdian, and Yossi Azar. Hierarchical clustering via sketches and hierarchical correlation clustering. In 24th International Conference on Artificial Intelligence and Statistics (AISTATS 21), pages 559 567, 2021. [26] Ulrike von Luxburg. A tutorial on spectral clustering. Statistics and Computing, 17(4):395 416, 2007. [27] Zeyuan Allen Zhu, Silvio Lattanzi, and Vahab S. Mirrokni. A local algorithm for finding well-connected clusters. In 30th International Conference on Machine Learning (ICML 13), pages 396 404, 2013.