# learningaugmented_hierarchical_clustering__15fa760b.pdf

Learning-Augmented Hierarchical Clustering

Vladimir Braverman 1 2 3 Jon C. Ergun 4 Chen Wang 2 5 Samson Zhou 5

Hierarchical clustering (HC) is an important data analysis technique in which the goal is to recursively partition a dataset into a tree-like structure while grouping together similar data points at each level of granularity. Unfortunately, for many of the proposed HC objectives, there exist strong barriers to approximation algorithms with the hardness of approximation. We consider the problem of hierarchical clustering given auxiliary information from natural oracles in the learningaugmented framework. Our main results are algorithms that given learning-augmented oracles, compute efficient approximate HC trees for the celebrated Dasgupta s and Moseley-Wang objectives that overcome known hardness barriers.

1. Introduction

Hierarchical clustering (HC) is a popular data analysis technique that recursively partitions a dataset throughout a treelike structure, so that similar data points are grouped together at different levels of granularity. Specifically, the input is a set of n data points and a measure of similarity or dissimilarity between the points, which induces a weighted graph whose vertices represent the data points and whose edge weights represent the pairwise measure between the vertices. The output is a binary dendrogram, which is a rooted tree whose leaves represent the individual data points and whose internal nodes each represent a cluster of the data points in its subtree, thus providing a hierarchical representation of relationships within the dataset.

Hierarchical clustering has several advantages over flat clusterings such as k-means or k-median, where the dataset is partitioned into a fixed number of clusters. For example,

Alphabetical order of author names. 1Johns Hopkins University 2Rice University 3Google Research 4Carnegie Mellon University 5Texas A&M University. Correspondence to: Vladimir Braverman <vova@cs.jhu.edu>, Jon C. Ergun <ergunjon0@gmail.com>, Chen Wang <chen.wang.research@gmail.com>, Samson Zhou <samsonzhou@gmail.com>.

Proceedings of the 42 nd International Conference on Machine Learning, Vancouver, Canada. PMLR 267, 2025. Copyright 2025 by the author(s).

the correct number of clusters in flat clusterings is often a difficult question that is the focus of a sequence of works dating back to the 1950s (Thorndike, 1953). In hierarchical clustering, there is no fixed number of clusters that needs to be determined in advance. Another advantage of hierarchical clustering is that the dendrogram simultaneously captures structure at all levels of granularity, whereas flat clustering does not identify further structure inside each of the clusters. Hence, hierarchical clustering arises in various applications where data exhibits hierarchical structure, such as biology and phylogenetics (Sneath et al., 1973; Sotiriou et al., 2003), image and text analysis (Steinbach et al., 2000), and community detection (Leskovec et al., 2020).

Despite a wealth of heuristics for both agglomerative bottom-up (Ward Jr, 1963) and divisive top-down approaches (Gu enoche et al., 1991), formal mathematical understanding of hierarchical clustering often stagnated due to the absence of well-posed objectives until a relatively recent work by Dasgupta (Dasgupta, 2016). Subsequently, additional objectives (Moseley & Wang, 2017; Cohen-Addad et al., 2019) were proposed to quantify the performance of dendrograms with n leaves, so that high-revenue similarity trees and low-cost dissimilarity trees correspond to desirable hierarchical partitions of the dataset.

For a number of these objectives, various algorithms have been proven to achieve specific approximation guarantees. For example, a divisive clustering algorithm based on the sparsest cut subroutine was shown to give an O( log n) approximation (Charikar & Chatziafratis, 2017b; Cohen Addad et al., 2019; Deng et al., 2025) for Dasgupta s objective (Dasgupta, 2016), while the long-used agglomerative heuristic was shown to give a 2-approximation for the dissimilarity objective proposed by (Cohen-Addad et al., 2019) and a 1

3-approximation for the similarity objective proposed by (Moseley & Wang, 2017), i.e., the Moseley-Wang (MW) objective. These objectives were further explored in the contexts of better approximation factors (Chatziafratis et al., 2020b; Alon et al., 2020), sublinear computation models (Rajagopalan et al., 2021; Assadi et al., 2022; Agarwal et al., 2022), and graphs with special properties (Charikar et al., 2019b; Manghiuc & Sun, 2021).

Unfortunately, (Charikar et al., 2019a) showed that averagelinkage cannot do better than 3

2-approximation for the for-

Learning-Augmented Hierarchical Clustering

mer objective or better than 1

3-approximation for the latter. More general hardness of approximation results were given, showing the impossibility of achieving roughly 1.003approximation (Chatziafratis et al., 2020a) for dissimilarity under the Unique Games Conjecture and the impossibility of achieving (1 C)-approximation for the Moseley-Wang objective (Chatziafratis et al., 2020b) under the Small Set Expansion hypothesis, for a fixed constant C > 0. Moreover, for the Dasgupta objective, (Charikar & Chatziafratis, 2017a; Roy & Pokutta, 2017) showed that under the Small Set Expansion hypothesis, there is no constant approximation in polynomial time for any constant. Thus, we seek new practical approaches that enable better approximation guarantees without assumptions about the underlying dataset or weight function.

Learning-augmented algorithms. We draw inspiration from the recent advances in the predictive capabilities of machine learning models. On one hand, datasets often have additional auxiliary information that can be used to improve algorithmic performance if accurate. For example, in many applications, the input dataset can retain insightful patterns exhibited by similar datasets generated from previous instances. On the other hand, machine learning models lack provable guarantees and can result in wildly inaccurate predictions when generalizing to unfamiliar inputs (Szegedy et al., 2014). Nevertheless, learning-augmented algorithms (Mitzenmacher & Vassilvitskii, 2020) that overcome worst-case computational limits have been designed for a number of applications, such as warm-starts for faster algorithms (Dinitz et al., 2021; Chen et al., 2022c; Davies et al., 2023), data structures optimized for specific query distributions (Kraska et al., 2018; Mitzenmacher, 2018; Lin et al., 2022; Fu et al., 2025), online algorithms with some forecast of the future (Purohit et al., 2018; Gollapudi & Panigrahi, 2019; Lattanzi et al., 2020; Wang et al., 2020; Wei & Zhang, 2020; Bamas et al., 2020; Im et al., 2021; Lykouris & Vassilvitskii, 2021; Aamand et al., 2022; Anand et al., 2022; Azar et al., 2022; Grigorescu et al., 2022; Khodak et al., 2022; Gupta et al., 2022; Jiang et al., 2022; Antoniadis et al., 2023; Shin et al., 2023; Benomar & Perchet, 2023), input-sensitive sketches for more space-efficient streaming algorithms (Hsu et al., 2019; Indyk et al., 2019; Jiang et al., 2020; Chen et al., 2022b;a; Li et al., 2023), and classical NP hard problems (Braverman et al., 2024; Cohen-Addad et al., 2024; Braverman et al., 2025).

For clustering problems, (Ergun et al., 2022; Nguyen et al., 2023; C. S. et al., 2024) introduced flat clustering algorithms that use polynomial runtime and achieve approximation guarantees beyond NP hardness limits. Though their techniques are specific to k-means and k-median clustering, their work nevertheless serves as an important conceptual message that demonstrates machine learning oracles can be used to improve upon traditional techniques for cluster anal-

ysis. Furthermore, for graph-base problems, recent results by (Cohen-Addad et al., 2024; Braverman et al., 2024; Dong et al., 2025) have shown that natural learning-augmented oracles could help overcome NP-hardness constraints as well. We thus ask whether machine learning models can be used to provably improve (graph-based) hierarchical clustering.

1.1. Our Contributions

In this paper, we consider the problem of hierarchical clustering given a possibly erroneous oracle that uses auxiliary information, e.g., through clusterings of similar datasets, to provide local information about the relationship between queried data points. In particular, we consider a splitting oracle that, on an input query of a triplet (u, v, w) of vertices, outputs the vertex that is first separated away from the other two with respect to an optimal or near-optimal hierarchical clustering tree. In other words, if the oracle is consistent with some ground-truth tree, it will output the vertex that is not in the same subtree as the other two vertices, under their least common ancestor. We remark that such oracle advice is natural due to the plethora of machine learning models that are trained on related instances of graphs, where the triplet relationships are already labeled.

Using triplet split-away information is common in the literature, and there have been results explored in similar settings, e.g., tree reconstruction with accurate triplet relationships given (Aho et al., 1981), triplets are given as constraints (Chatziafratis et al., 2018), noisy triplet information with fresh randomness (Emamjomeh-Zadeh & Kempe, 2018), and algorithms with quartet information (Jiang et al., 2000; Snir & Yuster, 2011; Alon et al., 2014). Furthermore, the reconstruction of phylogenetic CSPs, which provides an oracle with a similar form to ours, is an extensively studied line of work (see, e.g., (Chatziafratis & Makarychev, 2023)). From the machine learning perspective, it is possible to learn such oracles in the PAC learning framework (see Appendix J for details).

In line with existing literature on learning-augmented algorithms, we investigate a stochastic and independently responding splitting oracle, where randomness is introduced only once. Specifically, this implies that the oracle correctly responds with a probability of p for some constant p > 1/2, independently across vertex triplets. Additionally, repeated queries of the same triplet consistently yield the same (possibly erroneous) responses, which rules out basic boosting strategies such as repeatedly making the same query to the oracle. We remark this splitting oracle mirrors numerous machine learning models that are trainable with data yet exhibit inherent noise.

We now present our main results for learning-augmented hierarchical clustering. We first show that for the Dasgupta objective, we can use such a splitting oracle to achieve a

Learning-Augmented Hierarchical Clustering

constant factor approximation in polynomial time. Theorem 1. There exists an algorithm that, given a weighted undirected graph G = (V, E, w) and a splitting oracle O, with high probability, in polynomial time and O(n3) queries computes a hierarchical clustering tree T such that cost G(T ) O(1) OPTDas(G), where OPTDas(G) is the cost of the optimal hierarchical clustering tree T , i.e., OPTDas(G) = cost G(T ).

By comparison, (Charikar & Chatziafratis, 2017a; Roy & Pokutta, 2017) showed that there is no polynomial-time algorithm that could achieve any constant approximation to Dasgupta s objective, under the Small Set Expansion hypothesis. Hence, Theorem 1 illustrates that the power of a splitting oracle can be used to break complexity hardness limitations. On the other hand, we remark that the runtime of the algorithm of Theorem 1, although polynomial, is perhaps embarrassingly large. We thus give an algorithm that uses the splitting oracle and O(n3) time to achieve approximation guarantees beyond the current state-of-theart oblivious algorithms. Theorem 2. There exists an algorithm that, given a weighted undirected graph G = (V, E, w) and a splitting oracle O, with high probability, in O(n3 log n) time and O(n3) queries computes a hierarchical clustering tree T such that cost G(T ) O( log log n) OPTDas(G), where OPTDas(G) is the cost of the optimal hierarchical clustering tree T , i.e., OPTDas(G) = cost G(T ).

By comparison, the best-known polynomial-time oblivious algorithm achieves O( log n)-approximation (Charikar & Chatziafratis, 2017b; Cohen-Addad et al., 2019). Thus our algorithm that achieves O( log log n) approximation gives very competitive practical bounds the improvement of our algorithm is approximately 2.3 times better for n = 1010.

Turning our attention to the Moseley-Wang objective, we similarly show that a splitting oracle can also be used to achieve any constant factor approximation in polynomial time. Theorem 3. There exists an algorithm that, given a weighted undirected graph G = (V, E, w) and a splitting oracle O, with high probability, in O(n2 polylog n) time and O(n2) queries computes a hierarchical clustering tree T such that rev G(T ) (1 o(1)) OPTMW(G), where OPTMW(G) is the revenue of the optimal hierarchical clustering tree T , i.e., OPTMW(G) = rev G(T ).

We note that (Chatziafratis et al., 2020b) showed the APXhardness of the (1 C) approximation for Moseley-Wang objective under the Small Set Expansion hypothesis, for a fixed constant C (0, 1). As such, Theorem 3 again also shows the power of splitting oracles to overcome impossibility barriers. Since a n-vertex graph could have input

size as large as Θ(n2), the time complexity in Theorem 3 is near-linear in the worst case.

Finally, we observe that our algorithms possess favorable properties that are extremely amenable to sublinear algorithms. As such, we can obtain the following results in the streaming and parallel computation (PRAM) settings. Theorem 4. In the single-pass graph streaming and the PRAM settings, there exists:

a single-pass (dynamic) streaming algorithm that, given a weighted undirected graph G = (V, E, w) in a poly(n)-length dynamic stream and an offline splitting oracle O, with high probability, uses O(n log3 n) bits of space and polynomial time computes a hierarchical clustering tree T such that cost G(T ) O(1) OPTDas(G) (Theorem 21).

a PRAM algorithm that, given a weighted undirected graph G = (V, E, w) and a splitting oracle O, with high probability, in O(n2 polylog n) work and log3 n depth computes a hierarchical clustering tree T such that rev G(T ) (1 o(1)) OPTMW(G) (Theorem 22).

Our results in the sublinear settings similarly outperform the state-of-the-art in the HC algorithms without oracle advice. For instance, in the single-pass graph streaming setting, (Assadi et al., 2022; Agarwal et al., 2022) designed semi-streaming algorithms with O(1) approximation but exponential time. By comparison, our algorithm only uses polynomial time, leveraging the advantage of the splitting oracle.

2. Preliminaries

We present the definition of the hierarchical clustering problem, our splitting oracle model, and the HC objectives in this section.

2.1. The hierarchical clustering problem

We consider the hierarchical clustering problem with a splitting oracle O. The hierarchical clustering problem is defined as follows. We are given an n-vertex weighted undirected input graph G = (V, E, w), and our goal is to produce a binary tree T whose root node corresponds to the vertex set V and the leaves represent the singleton vertices. The vertices set contained in the internal nodes form a Laminar set family: suppose node x has children (y, z), it represents a split Sx (Sy, Sz), where Sx = Sy Sz. In this work, we assume without loss of generality n 200 log n the bound holds for any n 2500, and if n is a constant we can simply use a brute-force algorithm.

Hierarchical clustering trees only define a data structure, and there are many ways to construct valid HC trees.

Learning-Augmented Hierarchical Clustering

What eventually matters is to construct good HC trees a notion that does not have a universal way to define. Popular approaches include heuristics, which work well subjectively but lack formal guarantees, and objective functions, which provide rigorous frameworks to study the optimal trees and the approximation algorithms. In recent years, the latter approach has attracted considerable attention with popular objective functions by (Dasgupta, 2016; Moseley & Wang, 2017; Cohen-Addad et al., 2019).

Notation. For each internal node x in T , we use leaves T [ x ] to denote the leaves in the induced subtree of x. Each internal node of an HC tree can be described by lowest common ancestor (LCA) of vertices. For two vertices (u, v) on the leaves of T , we use LCAT (u, v) to denote the node that is the lowest common ancestor of u and v. We can further generalize this notion to a set of vertices, i.e., for a set X V , the node LCAT (X) refers to the lowest common ancestor of all vertices in X. For a set X, we call the induced subtree TX a maximal subtree of T if leaves T [ LCATX(X) ] = X, i.e., the lowest common ancestor of X in TX induces all leaves of X in T .

Let r be the root of a hierarchical clustering tree T , and for any internal node x, we let dist T (r, x) be the number of edges on the shortest path between r and x. We say node x on level level T (x) is a higher level node than node y with level level T (y) in T if dist T (r, x) > dist T (r, y). For two internal nodes x and y in T , we use x = pa (y) to denote the relationship of x being the parent node of y. Note that if x = pa (y), it automatically implies that level T (y) = level T (x) + 1.

The split-away vertex. Note that in any hierarchical clustering tree, if we look at a triplet of vertices (u, v, w), there must exist a vertex that split away from the two others in the optimal tree T , i.e., two vertices with a LCA that is same as the LCA of all three vertices in T . Formally, for a triplet of vertices (u, v, w), we define w splits away from (u, v) as follows.

Definition 5 (Split-away vertex). Let G = (V, E, w) be a n-vertex graph, and let T be the optimal HC tree for G. Given a triplet of vertices (u, v, w), we say w splits away from (u, v) (in T ) if LCAT (w, u) (resp. LCAT (w, v)) is equal to LCAT ({u, v, w}).

2.2. The splitting oracle model

We study the hierarchical clustering problem with a natural oracle advice model. In particular, we assume an oracle O : V V V V that takes a triplet of vertices (u, v, w), probabilistically correctly returns the vertex that split away from the other two vertices in the optimal tree1. The formal

1We provide a discussion for splitting oracles with an approximately optimal HC tree in Appendix I.2.

definition is given as follows.

Definition 6 (The splitting oracle for hierarchical clustering). Let G = (V, E) be a n-vertex graph, and let T be the optimal hierarchical clustering tree of G. The oracle O : V V V V is a function that upon being queried with a triplet of vertices (u, v, w), responds as follows

with probability p, the correct answer on which vertex splits away from the two others in T .

with probability (1 p), an arbitrary (adversarial) answer on which vertex splits away from the two other vertices.

The randomness is taken independently over all the queries and is fixed across different queries on the same triplet. We assume each query to the oracle takes O(1) time.

Assuming the correct probability of an oracle is some constant p > 1/2 is very common in the literature, especially for graph problem (Braverman et al., 2024; Cohen-Addad et al., 2024; Dong et al., 2025). For the convenience of presentation, we assume p = 9 10 in this paper, and we provide a discussion about general success probabilities in Appendix I.1. Observe that by the fixed randomness for each triplet, there are at most n 3 many answers that O can have. This setting rules out trivial algorithms that simply get the correct by querying multiple times and boosting the success probability.

Several hierarchical clustering objectives are proved to be hard to approximate in polynomial time under very plausible complexity assumptions. As such, our goal is to explore whether we can obtain better approximation guarantees with the splitting oracle.

2.3. Objective functions for hierarchical clustering

We introduce the objective functions for hierarchical clustering we are going to discuss in this paper. These include the Dasgupta minimization (cost) objective (Dasgupta, 2016) and Moseley-Wang maximization (revenue) objective (Moseley & Wang, 2017). We start with the minimization objective as prescribed by (Dasgupta, 2016). Problem 1 (HC under Dasgupta s cost function). Given an n-vertex weighted graph G = (V, E, w) with vertices corresponding to data points and edges measuring their similarity, create a rooted tree T whose leaf nodes are V . The goal is to minimize the cost of this tree T defined as

cost G(T ) := X

e=(u,v) E w(e) |leaves T [ LCAT (u, v) ]|,

where |leaves T [ LCAT (u, v) ]| is the number of leafnodes in the sub-tree of T rooted at the lowest common

Learning-Augmented Hierarchical Clustering

ancestor of u and v. We use OPTDas(G) to denote the cost of an optimal HC tree under Dasgupta s cost for the graph G.

Roughly speaking, Dasgupta s objective accumulates the cost on an edge (u, v) by the number of leaves inside the subtree where u and v are first split. In contrast, the Moseley Wang objective focuses on the dual of Dastupta s objective: it gathers the revenue on an edge (u, v) by the number of leaves outside the subtree where u and v are first split. Formally, the Moseley-Wang objective can be given as follows.

Problem 2 (HC under Moseley-Wang revenue function). Given an n-vertex weighted graph G = (V, E, w) with vertices corresponding to data points and edges measuring their similarity, create a rooted tree T whose leaf nodes are V . The goal is to maximize the revenue rev G(T ) of a tree T defined as X

e=(u,v) E w(e) (n |leaves T [ LCAT (u, v) ]|)

e=(u,v) E w(e) |non-leaves T [ LCAT (u, v) ]|,

where |leaves T [ LCAT (u, v) ]| is the number of leafnodes in the sub-tree of T rooted at the lowest common ancestor of u and v, and |non-leaves T [ LCAT (u, v) ]| is the number of nodes that are not among leaves T [ LCAT (u, v) ]. We use OPTMW(G) to denote the revenue of an optimal HC tree under Dasgupta s cost for the graph G.

Observe that both objectives are composeable w.r.t. edges, i.e., it is possible to divide the total objective to objectives induced by each (or each set of) edge(s). For any HC tree T and any set of edge E1 E, we use rev G(T , E1) and cost G(T , E1) to denote the revenue and the cost induced by the edges in E1.

By a straightforward calculation, one can show that for any HC tree T , there is rev G(T ) = P

e=(u,v) E w(e) n cost G(T ). Since P

e=(u,v) E w(e) n is a deterministic function of the graph G itself, the optimal HC tree T under the two objectives are the same. However, the two objectively admits vastly different approximation algorithms. In particular, for the minimization objective, (Dasgupta, 2016) and the following work (Roy & Pokutta, 2016; Charikar & Chatziafratis, 2017b) showed that we can achieve an O( log n) approximation in polynomial time, and there is no O(1) approximation in polynomial time assuming Small Set Expansion (SSE) hypothesis. On the other hand, for the revenue maximization objective, (Moseley & Wang, 2017) proved that the average-linkage heuristic can achieve a 1/3 approximation in polynomial time. Therefore, we would

naturally expect different results for hierarchical clustering with the splitting oracle with the two objectives.

3. The Definitions and Results for Partial Hierarchical Clustering Trees

A technical backbone of our algorithms in this paper is the partial hierarchical clustering tree. Roughly speaking, these structures replicate the organizational framework of the optimal HC tree, exhibiting only minor ambiguity within small subsets of vertices. In this section, we formally define the strong and weak partial trees and give efficient construction algorithms for them. We remark that our constructions of the partial HC trees are entirely based on the vertex set V and the oracle O of the input graph, irrespective of specific objective functions. This inherent independence renders our partial HC trees highly versatile and potentially of significant interest in their own right.

3.1. Partial hierarchical clustering trees

We start by showing the definition of partial hierarchical clustering trees, which are very similar to the normal HC trees: the internal nodes represent subsets of vertices, and the leaves are individual vertices. However, in partial HC trees, we allow a collection of vertices that are not too large to have unknown local clustering, and we simply represent the whole set of vertices as a leaf node in the tree. The formal definition is as follows.

Definition 7 (Partial hierarchical clustering trees). A partial hierarchical clustering tree I is a binary tree such that

1. The root represents the vertex set V .

2. For a node x with children (y, z), it represents a split Sx (Sy, Sz), where Sx = Sy Sz.

3. The leaves of I corresponds to

either a singleton vertex in V . or a set of vertices S V such that S 50000 log n. In this case, we call the leave a supervertex.

Compared to the full hierarchical clustering tree, the partial HC tree allows the leaves to be contracted vertices with size at most O(log n). We now define a partial tree that is strongly consistent with a hierarchical clustering tree T .

Definition 8 (Partial tree strongly consistent with T ). Let I be a partial hierarchical clustering tree and let T be a (standard) hierarchical clustering tree. We say I is strongly consistent with T if

1. (Strong contraction property) Each super-vertex induces a maximal subtree in T .

Learning-Augmented Hierarchical Clustering

2. (Subtree preservation property) For any pair of leaves (x, y) in I, let X and Y be the set of leaves corresponding to x and y in T (recall that the leaves of I can be super-vertices). The subtree induced by LCAI(x, y) contains the exactly the same set of vertices as induced by LCAT (X Y ).

In other words, a partial tree I is strongly consistent with T if there exists a way to locally arrange tree structures for every super-vertex to exactly recover T . An illustration of the strongly consistent partial tree (w.r.t. T ) can be found in Figure 1.

The strong partial tree is a very helpful data structure for HC. Nevertheless, finding such a strong partial tree could be challenging. As such, we also define partial trees that are weakly consistent with the tree T as follows. Definition 9 (Partial tree weakly consistent with T ). Let I be a partial hierarchical clustering tree and let T be a (standard) hierarchical clustering tree. We say I is weakly consistent with T if

1. (Weak contraction property) Each super-vertex corresponds to a collection of maximal subtrees in T , i.e., i Vi such that each Vi satisfies

leaves T [ LCAT (Vi) ] = Vi.

Furthermore, the collection i Vi is with out-degree at most 2 in T such that

(a) At most one edge is connected to a node that is the parent of the LCA of i Vi. (b) At most one edge is connected to a node that is a sibling of a maximal subtree of T induced by (a subset of) i Vi.

2. (Subtree preservation property) For any pair of leaves (x, y) in I, let X and Y be the set of leaves corresponding to x and y in T (recall that the leaves of I can be super-vertices). The subtree induced by LCAI(x, y) contains the exactly the same set of vertices as induced by LCAT (X Y ).

The difference between the weak and strong consistency is that in weakly consistent partial trees, the contraction of vertices can happen in any consecutive region of the original tree T . An illustration of the weakly consistent partial tree (w.r.t. T ) can be found in Figure 2.

Our goal is to use oracle O, and vertex set V to construct a partial HC tree I that is consistent with the optimal HC tree T .

3.2. Main results of partial HC trees

We now give our main results for the strong and weak partial trees, respectively. In particular, our results include

Contains at most vertices

Figure 1. An illustration of the strongly consistent partial HC trees as defined in Definition 8. The boxes indicate super-vertices whose clustering is unknown in the partial HC tree.

Contains at most vertices, but not necessary from a maximal subtree

Figure 2. An illustration of the weakly consistent partial HC trees as defined in Definition 9. The boxes indicate super-vertices whose clustering is unknown in the partial HC tree.

An algorithm that, with high probability, constructs a partial tree strongly consistent with the optimal tree T

in O(n3 log n) time and O(n3) queries to the splitting oracle. Furthermore, the algorithm uses only O(n) space; and

An algorithm that, with high probability, constructs a partial tree weakly consistent with the optimal tree T in O(n2) time and queries to the splitting oracle. Furthermore, the algorithm can be implemented in the PRAM model with O(n2) work and polylog n depth.

Result for strongly consistent partial HC trees. Our main theorem to construct strongly consistent partial HC trees is as follows.

Theorem 10. There exists an algorithm that given a splitting oracle O of a weighted undirected graph G = (V, E, w), with high probability, in O(n3 log n) time and O(n3) queries computes a partial hierarchical clustering tree I that is strongly consistent with the optimal hierarchical clustering tree T . Furthermore, the algorithm has the following properties.

i). The runtime of the algorithm is deterministic, and

Learning-Augmented Hierarchical Clustering

the high probability randomness is over the correctness guarantee.

ii). The algorithm can be implemented in O(n log n) space.

While the algorithm outlined in Theorem 10 necessitates O(n3) queries to the oracle O, rendering it less efficient, the overall running time of O(n3) is tolerable, especially considering the hardness of HC.

Result for weakly consistent partial HC trees. We now show our algorithmic results for the weakly consistent partial HC trees, which enjoy much better efficiency in both the running time and the number of oracle queries.

Theorem 11. There exists an algorithm that given a splitting oracle O of a weighted undirected graph G = (V, E, w), with high probability, in O(n2 polylog n) time and O(n2) queries computes a partial hierarchical clustering tree I that is weakly consistent with the optimal hierarchical clustering tree T .

We note that since the number of longest dependent calls for our weak partial tree is at most O(log3 n), Theorem 11 implies a PRAM algorithm with O(n2 polylog n) work and O(log3 n) depth. The formal statement is as follows.

Corollary 12. There exists a PRAM algorithm that given a graph G = (V, E) and a splitting oracle O, with high probability, in O(n2 polylog n) work and O(log3 n) depth computes a partial hierarchical clustering tree I that is weakly consistent with the optimal HC tree T .

We suspect that by modifying some subroutines of the algorithm in Theorem 11, we could possibly bring the number of time and queries to O(n). The study of a sublinear-time algorithm is an interesting future problem.

A (very) high-level technical summary of the partial trees. As we will discuss in Appendix B, simply following the oracle advice does not lead to any valid outputs. The construction of the partial trees requires a careful aggregation of the split-away information to recover subtrees with sizes Ω(log n). One observation here is that if we are lucky to select a vertex u from the smaller side of the first binary partition, we could determine the correct side for each vertex v V in the first partition of the optimal tree with high probability. We could then recursively recover the partition in the optimal tree T . However, getting a vertex from the smaller side of the first partition is not easy, and we would need to test multiple vertices and use the size of the resulting set as the indicator for correctness. This includes careful design and analysis of different algorithmic subroutines, which is the main technical challenge. A detailed technical overview could be find in Appendix B.

We present the HC algorithms using the results for partial

HC trees for the rest of the main paper. We defer the detailed analysis of the partial trees to Appendices F to H.

4. Polynomial Time Algorithms for Dasgupta s Hierarchical Clustering Objective

We introduce our polynomial time algorithms for Dasgupta s HC objective in this section. These results include an O(1)-approximation algorithm in polynomial time (albeit some large constant on the exponent) and an O( log log n)- approximation algorithm in O(n3) time. Our algorithms crucially rely on the strongly consistent partial tree in Theorem 10.

4.1. A Polynomial-time Algorithm for O(1)-approximation on Dasgupta s HC Objective

We now introduce our algorithm that finds an HC tree with O(1)-approximation to Dasgupta s objective in polynomial time. One can find the formal statement of the result in Theorem 1. The algorithm of Theorem 1 is as Algorithm 1.

Algorithm 1 A polynomial-time algorithm for the Dasgupta s HC objective Input: Input graph G = (V, E, w); Splitting oracle O Output: A hierarchical clustering tree T Run the strong partial tree approximation algorithm in Theorem 10 to obtain partial tree I for each super-vertex in I do

On input vertex set S, exhaustively search the sparsest cut (A, B) on the induced subgraph G[S] Partition the vertices as S (A, B), and recurse on G[A] and G[B] end

We now prove the efficiency and the approximation guarantees for Dasgupta s objective. The following lemma provides the efficiency for Algorithm 1.

Lemma 4.1. Algorithm 1 runs (deterministically) in O(n50002) time and uses O(n3) queries.

Proof. By Theorem 10, the first step of the algorithm that computes the strong partial tree takes O(n3) time and O(n3) queries. Note that we only take queries in this step.

For the second step, when the input size is s, an exhaustive search on the sparsest cut takes O(2s) time. As such, let X be the set of induced vertices for a single super-vertex of I, since we have |X| 50000 log n, it only takes O(n50000) time to find the sparsest cut. Similarly, we can show that in each recursive call, the runtime is at most O(n50000). By Fact A.1, there are at most O(log n) nodes in a binary tree with O(log n) leaves. As such, the recursive sparsest cut for a single super-vertex in I takes O(n50000 log n) time. There

Learning-Augmented Hierarchical Clustering

are at most O(n) super-vertices in the tree; therefore, the total runtime of the second step takes O(n50000 log n n) = O(n50002) time.

Combining the efficiency of the two steps gives us the desired efficiency bound.

The main lemma for the approximation guarantees of Algorithm 1 is as follows. Lemma 4.2. Conditioning on the high probability guarantees of Theorem 10, Algorithm 1 outputs an HC tree T that achieves O(1)-approximation to the Dasgupta s objective.

Proof. For any partial tree I, we first partition the edges in to Ecross and Esame based on whether the edge (u, v) E crosses different partial trees, i.e.,

1. (u, v) Ecross iff u X and v Y for some supervertices X = Y in I.

2. (u, v) Esame iff u, v X for some super-vertex X in I.

Since E = Ecross Esame, by using Observation 1, we can show that OPTDas = cost G(T ) = cost G(T , E1) + cost G(T , E2). We now analyze the costs w.r.t. to E1 and E2, respectively.

1. For Ecross, we argue that cost G(T , Ecross) = cost G(T , Ecross). To see this, note that if u and v are of different super-vertices, by the definition of partial HC trees that are strongly consistent with the optimal tree T , there is

leaves T [ LCAT (u, v) ] = leaves T [ LCAT (u, v) ].

As such, we have cost G(T , Ecross) = cost G(T , Ecross) by the definition of the cost function.

2. For Esame, we argue that cost G(T , Esame) O(1) cost G(T , Esame). Formally, for each super-vertex X, we can use Proposition 17 on G[X] to argue that the cost G(T , Esame[X]) O(1) cost G(T , Esame[X]), where Esame[X] stands for the set of edges in Esame with both endpoints in X. Therefore, we can apply this calculation to every super-vertex to get the desired approximation factor.

We now use Observation 1 again on Ecross and Esame to bound that

cost G(T ) = cost G(T , Ecross) + cost G(T , Esame)

cost G(T , Ecross) + O(1) cost G(T , Esame)

O(1) (cost G(T , Ecross) + cost G(T , Esame))

= O(1) cost G(T ) = O(1) OPTDas,

as desired.

Combining Theorem 10, Lemma 4.2, and Lemma 4.1 leads to the proof of Theorem 1 (see Appendix C for a more formal version).

4.2. An O(n3) Time Algorithm for O( log log n)-approximation on Dasgupta s HC Objective

One drawback of the algorithm we have in Section 4.1 is that the efficiency is theoretical only after all, a runtime of O(n50002) is nowhere near being practical. Observe that the subroutine that leads to the very large exponent is the exhaustive search of the optimal sparsest cut. Therefore, we can hope to use some more efficient approximation for sparsest cuts while not sacrificing too much on the approximation guarantee. This intuition leads us to Algorithm 2.

Algorithm 2 An O(n3) time algorithm for the Dasgupta s HC objective Input: Input graph G = (V, E, w); Splitting oracle O Output: A hierarchical clustering tree T Run the strong partial tree approximation algorithm in Theorem 10 to obtain partial tree I for each super-vertex in I do

On input vertex set S, find an O( p

log |S|)- approximation of the sparsest cut (A, B) on the induced subgraph G[S] using the algorithm of Proposition 18 Partition the vertices as S (A, B), and recurse on G[A] and G[B] end

The main lemmas for the efficiency and approximation guarantees for Algorithm 2 are as follows.

Lemma 4.3. With high probability, Algorithm 2 runs in O(n3 log n) time and uses O(n3) queries.

Lemma 4.4. Conditioning on the high probability guarantees of Theorem 10, Algorithm 2 outputs an HC tree T that achieves O( log log n)-approximation to the Dasgupta s objective.

The proof of Lemma 4.3 and Lemma 4.4 could be found in Appendix C.

5. Near-linear Time Algorithms for Moseley-Wang Hierarchical Clustering Objective

We introduce our algorithm for the Moseley-Wang HC objective in this section. The algorithm is based on the weakly consistent trees as in Theorem 11, which could be implemented in near-linear time. The algorithm is described as Algorithm 3.

Learning-Augmented Hierarchical Clustering

Algorithm 3 A near-linear time algorithm for the Moseley Wang HC objective Input: Input graph G = (V, E, w); Splitting oracle O Output: A hierarchical clustering tree T Run the weak partial tree approximation algorithm in Theorem 11 to obtain partial tree I for each super-vertex in I do

partition the leaves arbitrarily to obtain an HC tree T end

Since each super-vertex contains O(log n) vertices, we can always partition the super-vertices in polylog n time2. There are at most O(n/ log n) such super-vertices, and the total runtime overhead is at most O(n polylog n).

Our analysis for Algorithm 3 is more involved compared to the results in Section 4 it requires careful handling of the contributions of the less significant edges to the Moseley Wang objective. In particular, we could prove the following structural result. Let I be any partial HC tree that is weakly consistent with the optimal tree T , we show that the set of edges (u, v) such that

a). has at most O(log2 n) non-leaves in T ; and

b). let X and Y be corresponding super-vertices that contain u and v in I; there is leaves T [ LCAT (X) ] leaves T [ LCAT (Y ) ] = .

can contribute to at most an o(1) fraction of the optimal cost. To the best of our knowledge, the structural result was not known before.

6. Learning-augmented Sublinear Algorithms for Hierarchical Clustering

In this section, we explore sublinear algorithms for hierarchical clustering with the splitting oracle. Among these is a semi-streaming algorithm, capable of computing an O(1) approximation of Dasgupta s HC objective within polynomial time. Additionally, we introduce a PRAM algorithm that achieves a (1 o(1)) approximation of the Moseley-Wang objective, utilizing O(n2) work and polylog n depth. From a technical standpoint, these algorithms represent straightforward extensions of the results outlined in Theorems 1 to 3. Despite their simplicity, these algorithms demonstrate the advantages of the splitting oracle in modern sublinear computation models. Specifically, we compare our sublinear algorithms with previous results as follows:

1. In the streaming setting, (Assadi et al., 2022; Agarwal et al., 2022) designed single-pass streaming al-

2Arbitrary balanced partitions requires O(log n log log n) time

gorithms that achieve O(n) memory usage and O(1)- approximation to Dasgupta s objective, albeit in exponential time. By improving the time efficiency to polynomial time, our streaming result echoes a similar narrative in the offline setting, demonstrating significantly more efficient constructions with the splitting oracle.

2. For the parallel setting, (Agarwal et al., 2024) (cf. (Agarwal et al., 2022)) provided parallel algorithms (in the PRAM and the similar MPC settings, see Appendix A.3 for details of these models) for Dasgupta s objective with O(n2) work and polylog n depth that achieve polylog n approximation. Since the objectives are different, their result is not directly comparable to ours; however, the conceptual message here is still that the splitting oracle is able to significantly improve the approximation guarantee and the efficiency.

Due to space limits, we only give the algorithms as in Algorithm 4 and Algorithm 5, and defer their proofs to Appendix E.

Algorithm 4 A polynomial-time single-pass semi-streaming algorithm for the Dasgupta s HC objective Input: Input graph G = (V, E, w); Splitting oracle O Output: A hierarchical clustering tree T Before the start of the stream, run the strong partial tree approximation algorithm in Theorem 10 to obtain partial tree I for each edge (u, v) with during the insertion/deletion stream do

If u, v X for any super-vertex X in I, update (u, v) with the same insertion/deletion update Otherwise, ignore the edge end for each super-vertex in I after the stream, run recursive sparsest cut as follows do

On input vertex set S, exhaustively search the sparsest cut (A, B) on the induced subgraph G[S]. Partition the vertices as S (A, B), and recurse on G[A] and G[B]. end

Algorithm 5 A near-linear work, poly-logarithmic depth PRAM algorithm for the Moseley-Wang HC objective Input: Input graph G = (V, E, w); Splitting oracle O Output: A hierarchical clustering tree T Run the PRAM weak partial tree approximation algorithm in Corollary 12 to obtain partial tree I for each super-vertex in I do

Partition the leaves arbitrarily to obtain an HC tree T end

Learning-Augmented Hierarchical Clustering

Acknowledgements

We thank anonymous ICML reviewers for the helpful comments and suggestions. Vladimir Braverman is supported in part by the Naval Research (ONR) grant N00014-23-1-2737 and NSF CNS-2333887 award. Samson Zhou is supported in part by NSF CCF-2335411. The work was conducted in part while Samson Zhou was visiting the Simons Institute for the Theory of Computing as part of the Sublinear Algorithms program.

Impact Statement

This paper presents work that advances the Machine Learning and Algorithm Design fields. Due to its theoretical nature, we do not see any immediate societal impacts, although there are many potential societal consequences of our work for downstream applications.

Aamand, A., Chen, J. Y., and Indyk, P. (optimal) online bipartite matching with degree information. In Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems, Neur IPS, 2022.

Agarwal, A., Khanna, S., Li, H., and Patil, P. Sublinear algorithms for hierarchical clustering. In Koyejo, S., Mohamed, S., Agarwal, A., Belgrave, D., Cho, K., and Oh, A. (eds.), Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems, Neur IPS, 2022.

Agarwal, A., Khanna, S., Li, H., Patil, P., Wang, C., White, N., and Zhong, P. Parallel approximate maximum flows in near-linear work and polylogarithmic depth. In Proceedings of the 2024 Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pp. 3997 4061. SIAM, 2024.

Aho, A. V., Sagiv, Y., Szymanski, T. G., and Ullman, J. D. Inferring a tree from lowest common ancestors with an application to the optimization of relational expressions. SIAM Journal on Computing, 10(3):405 421, 1981.

Alon, N., Snir, S., and Yuster, R. On the compatibility of quartet trees. SIAM J. Discret. Math., 28(3):1493 1507, 2014.

Alon, N., Azar, Y., and Vainstein, D. Hierarchical clustering: A 0.585 revenue approximation. In Conference on Learning Theory, COLT, pp. 153 162, 2020.

Anand, K., Ge, R., Kumar, A., and Panigrahi, D. Online algorithms with multiple predictions. In International Conference on Machine Learning, ICML, pp. 582 598, 2022.

Anthony, M. and Bartlett, P. L. Neural Network Learning: Theoretical Foundations. Cambridge University Press, 1999. doi: 10.1017/CBO9780511624216.

Antoniadis, A., Coester, C., Eli as, M., Polak, A., and Simon, B. Online metric algorithms with untrusted predictions. ACM Trans. Algorithms, 19(2):19:1 19:34, 2023.

Arora, S., Hazan, E., and Kale, S. 0(sqrt (log n)) approximation to SPARSEST CUT in o(n2) time. In 45th Symposium on Foundations of Computer Science (FOCS), Proceedings, pp. 238 247, 2004.

Assadi, S., Chatziafratis, V., Lacki, J., Mirrokni, V., and Wang, C. Hierarchical clustering in graph streams: Singlepass algorithms and space lower bounds. In Conference on Learning Theory, pp. 4643 4702, 2022.

Azar, Y., Panigrahi, D., and Touitou, N. Online graph algorithms with predictions. In Proceedings of the 2022 ACM-SIAM Symposium on Discrete Algorithms, SODA, pp. 35 66, 2022.

Bamas, E., Maggiori, A., and Svensson, O. The primal-dual method for learning augmented algorithms. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems, Neur IPS, 2020.

Benomar, Z. and Perchet, V. Advice querying under budget constraint for online algorithms. In Oh, A., Naumann, T., Globerson, A., Saenko, K., Hardt, M., and Levine, S. (eds.), Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems, Neur IPS, 2023.

Braverman, V., Dharangutte, P., Shah, V., and Wang, C. Learning-augmented maximum independent set. In Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques, APPROX/RANDOM, pp. 24:1 24:18, 2024.

Braverman, V., Dharangutte, P., Jiang, S. H.-C., Nguyen, H.- A., Wang, C., Zhang, Y., and Zhou, S. Relative error fair clustering in the weak-strong oracle model. In Proceedings of the 42nd International Conference on Machine Learning, ICML, 2025.

C. S., K., Lee, E., Rabani, Y., Schwiegelshohn, C., and Zhou, S. On approximability of ℓ2 2 min-sum clustering. Co RR, abs/2412.03332, 2024.

Charikar, M. and Chatziafratis, V. Approximate hierarchical clustering via sparsest cut and spreading metrics. In Proceedings of the Twenty-Eighth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA, pp. 841 854, 2017a.

Learning-Augmented Hierarchical Clustering

Charikar, M. and Chatziafratis, V. Approximate hierarchical clustering via sparsest cut and spreading metrics. In Proceedings of the Twenty-Eighth Annual ACM-SIAM Symposium on Discrete Algorithms, pp. 841 854. SIAM, 2017b.

Charikar, M., Chatziafratis, V., and Niazadeh, R. Hierarchical clustering better than average-linkage. In Proceedings of the Thirtieth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA, pp. 2291 2304. SIAM, 2019a.

Charikar, M., Chatziafratis, V., Niazadeh, R., and Yaroslavtsev, G. Hierarchical clustering for euclidean data. In The 22nd International Conference on Artificial Intelligence and Statistics, AISTATS, pp. 2721 2730, 2019b.

Chatziafratis, V. and Makarychev, K. Triplet reconstruction and all other phylogenetic csps are approximation resistant. In 64th IEEE Annual Symposium on Foundations of Computer Science, FOCS, pp. 253 284, 2023.

Chatziafratis, V., Niazadeh, R., and Charikar, M. Hierarchical clustering with structural constraints. In Proceedings of the 35th International Conference on Machine Learning, ICML, pp. 773 782, 2018.

Chatziafratis, V., Gupta, N., and Lee, E. Inapproximability for local correlation clustering and dissimilarity hierarchical clustering. Co RR, abs/2010.01459, 2020a.

Chatziafratis, V., Yaroslavtsev, G., Lee, E., Makarychev, K., Ahmadian, S., Epasto, A., and Mahdian, M. Bisect and conquer: Hierarchical clustering via max-uncut bisection. In The 23rd International Conference on Artificial Intelligence and Statistics, AISTATS, pp. 3121 3132, 2020b.

Chen, J. Y., Eden, T., Indyk, P., Lin, H., Narayanan, S., Rubinfeld, R., Silwal, S., Wagner, T., Woodruff, D. P., and Zhang, M. Triangle and four cycle counting with predictions in graph streams. In The Tenth International Conference on Learning Representations, ICLR, 2022a.

Chen, J. Y., Indyk, P., and Wagner, T. Streaming algorithms for support-aware histograms. In International Conference on Machine Learning, ICML, pp. 3184 3203, 2022b.

Chen, J. Y., Silwal, S., Vakilian, A., and Zhang, F. Faster fundamental graph algorithms via learned predictions. In International Conference on Machine Learning, ICML, pp. 3583 3602, 2022c.

Cohen-Addad, V., Kanade, V., Mallmann-Trenn, F., and Mathieu, C. Hierarchical clustering: Objective functions and algorithms. Journal of the ACM (JACM), 66(4):1 42, 2019.

Cohen-Addad, V., d Orsi, T., Gupta, A., Lee, E., and Panigrahi, D. Learning-augmented approximation algorithms for maximum cut and related problems. In Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems, Neur IPS, 2024.

Dasgupta, S. A cost function for similarity-based hierarchical clustering. In Wichs, D. and Mansour, Y. (eds.), Proceedings of the 48th Annual ACM SIGACT Symposium on Theory of Computing, STOC 2016, Cambridge, MA, USA, June 18-21, 2016, pp. 118 127. ACM, 2016.

Davies, S., Moseley, B., Vassilvitskii, S., and Wang, Y. Predictive flows for faster ford-fulkerson. In International Conference on Machine Learning, ICML, volume 202, pp. 7231 7248, 2023.

Deng, C., Gao, J., Upadhyay, J., Wang, C., and Zhou, S. On the price of differential privacy for hierarchical clustering. In The Thirteenth International Conference on Learning Representations, ICLR, 2025.

Dinitz, M., Im, S., Lavastida, T., Moseley, B., and Vassilvitskii, S. Faster matchings via learned duals. In Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems, Neur IPS, pp. 10393 10406, 2021.

Dong, Y., Peng, P., and Vakilian, A. Learning-augmented streaming algorithms for approximating MAX-CUT. In 16th Innovations in Theoretical Computer Science Conference, ITCS, pp. 44:1 44:24, 2025.

Emamjomeh-Zadeh, E. and Kempe, D. Adaptive hierarchical clustering using ordinal queries. In Proceedings of the Twenty-Ninth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA, pp. 415 429, 2018.

Ergun, J. C., Feng, Z., Silwal, S., Woodruff, D. P., and Zhou, S. Learning-augmented k-means clustering. In The Tenth International Conference on Learning Representations, ICLR, 2022.

Fu, C., Nguyen, B. G., Seo, J. H., Zesch, R. S., and Zhou, S. Learning-augmented search data structures. In The Thirteenth International Conference on Learning Representations, ICLR, 2025.

Gollapudi, S. and Panigrahi, D. Online algorithms for rentor-buy with expert advice. In Proceedings of the 36th International Conference on Machine Learning, ICML, pp. 2319 2327, 2019.

Grigorescu, E., Lin, Y., Silwal, S., Song, M., and Zhou, S. Learning-augmented algorithms for online linear and semidefinite programming. In Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems, Neur IPS, 2022.

Learning-Augmented Hierarchical Clustering

Gu enoche, A., Hansen, P., and Jaumard, B. Efficient algorithms for divisive hierarchical clustering with the diameter criterion. Journal of classification, 8:5 30, 1991.

Gupta, A., Panigrahi, D., Subercaseaux, B., and Sun, K. Augmenting online algorithms with $\varepsilon$- accurate predictions. In Koyejo, S., Mohamed, S., Agarwal, A., Belgrave, D., Cho, K., and Oh, A. (eds.), Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems, Neur IPS, 2022.

Høgemo, S., Bergougnoux, B., Brandes, U., Paul, C., and Telle, J. A. On dasgupta s hierarchical clustering objective and its relation to other graph parameters. In Bampis, E. and Pagourtzis, A. (eds.), Fundamentals of Computation Theory - 23rd International Symposium, FCT, Proceedings, pp. 287 300, 2021.

Hsu, C., Indyk, P., Katabi, D., and Vakilian, A. Learningbased frequency estimation algorithms. In 7th International Conference on Learning Representations, ICLR, 2019.

Im, S., Kumar, R., Qaem, M. M., and Purohit, M. Online knapsack with frequency predictions. In Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems, Neur IPS, pp. 2733 2743, 2021.

Indyk, P., Vakilian, A., and Yuan, Y. Learning-based lowrank approximations. In Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, Neur IPS, pp. 7400 7410, 2019.

Izzo, Z., Silwal, S., and Zhou, S. Dimensionality reduction for wasserstein barycenter. In Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, Neur IPS, 2021.

Jiang, S. H., Liu, E., Lyu, Y., Tang, Z. G., and Zhang, Y. Online facility location with predictions. In The Tenth International Conference on Learning Representations, ICLR, 2022.

Jiang, T., Kearney, P. E., and Li, M. A polynomial time approximation scheme for inferring evolutionary trees from quartet topologies and its application. SIAM J. Comput., 30(6):1942 1961, 2000.

Jiang, T., Li, Y., Lin, H., Ruan, Y., and Woodruff, D. P. Learning-augmented data stream algorithms. In 8th International Conference on Learning Representations, ICLR, 2020.

Khodak, M., Balcan, M., Talwalkar, A., and Vassilvitskii, S. Learning predictions for algorithms with predictions. In Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems, Neur IPS, 2022.

Kraska, T., Beutel, A., Chi, E. H., Dean, J., and Polyzotis, N. The case for learned index structures. In Proceedings of the 2018 International Conference on Management of Data, SIGMOD Conference, pp. 489 504, 2018.

Lattanzi, S., Lavastida, T., Moseley, B., and Vassilvitskii, S. Online scheduling via learned weights. In Proceedings of the 2020 ACM-SIAM Symposium on Discrete Algorithms, SODA, pp. 1859 1877, 2020.

Leskovec, J., Rajaraman, A., and Ullman, J. D. Mining of massive data sets. Cambridge university press, 2020.

Li, Y., Lin, H., Liu, S., Vakilian, A., and Woodruff, D. P. Learning the positions in countsketch. In The Eleventh International Conference on Learning Representations, ICLR, 2023.

Lin, H., Luo, T., and Woodruff, D. P. Learning augmented binary search trees. In International Conference on Machine Learning, ICML, pp. 13431 13440, 2022.

Lucic, M., Faulkner, M., Krause, A., and Feldman, D. Training gaussian mixture models at scale via coresets. J. Mach. Learn. Res., 18:160:1 160:25, 2017.

Lykouris, T. and Vassilvitskii, S. Competitive caching with machine learned advice. J. ACM, 68(4):24:1 24:25, 2021.

Manghiuc, B. and Sun, H. Hierarchical clustering: O(1)- approximation for well-clustered graphs. In Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems, Neur IPS, pp. 9278 9289, 2021.

Mitzenmacher, M. A model for learned bloom filters and optimizing by sandwiching. In Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems, Neur IPS, pp. 462 471, 2018.

Mitzenmacher, M. and Vassilvitskii, S. Algorithms with predictions. In Roughgarden, T. (ed.), Beyond the Worst Case Analysis of Algorithms, pp. 646 662. Cambridge University Press, 2020.

Moseley, B. and Wang, J. R. Approximation bounds for hierarchical clustering: Average linkage, bisecting k-means, and local search. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems, pp. 3094 3103, 2017.

Learning-Augmented Hierarchical Clustering

Nguyen, T. D., Chaturvedi, A., and Nguyen, H. L. Improved learning-augmented algorithms for k-means and k-medians clustering. In The Eleventh International Conference on Learning Representations, ICLR, 2023.

Purohit, M., Svitkina, Z., and Kumar, R. Improving online algorithms via ML predictions. In Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, Neur IPS, pp. 9684 9693, 2018.

Rajagopalan, A., Vitale, F., Vainstein, D., Citovsky, G., Procopiuc, C. M., and Gentile, C. Hierarchical clustering of data streams: Scalable algorithms and approximation guarantees. In Meila, M. and Zhang, T. (eds.), Proceedings of the 38th International Conference on Machine Learning, ICML, pp. 8799 8809, 2021.

Roy, A. and Pokutta, S. Hierarchical clustering via spreading metrics. In Advances in Neural Information Processing Systems, pp. 2316 2324, 2016.

Roy, A. and Pokutta, S. Hierarchical clustering via spreading metrics. J. Mach. Learn. Res., 18:88:1 88:35, 2017.

Shin, Y., Lee, C., Lee, G., and An, H. Improved learningaugmented algorithms for the multi-option ski rental problem via best-possible competitive analysis. In International Conference on Machine Learning, ICML, pp. 31539 31561, 2023.

Sneath, P. H., Sokal, R. R., et al. Numerical taxonomy. the principles and practice of numerical classification., 1973.

Snir, S. and Yuster, R. A linear time approximation scheme for maximum quartet consistency on sparse sampled inputs. SIAM J. Discret. Math., 25(4):1722 1736, 2011.

Sotiriou, C., Neo, S.-Y., Mc Shane, L. M., Korn, E. L., Long, P. M., Jazaeri, A., Martiat, P., Fox, S. B., Harris, A. L., and Liu, E. T. Breast cancer classification and prognosis based on gene expression profiles from a populationbased study. Proceedings of the National Academy of Sciences, 100(18):10393 10398, 2003.

Steinbach, M., Karypis, G., and Kumar, V. A comparison of document clustering techniques. Technical report, University of Minnesota, 2000.

Szegedy, C., Zaremba, W., Sutskever, I., Bruna, J., Erhan, D., Goodfellow, I. J., and Fergus, R. Intriguing properties of neural networks. In 2nd International Conference on Learning Representations, ICLR, Conference Track Proceedings, 2014.

Thorndike, R. L. Who belongs in the family? Psychometrika, 18(4):267 276, 1953.

Wang, S., Li, J., and Wang, S. Online algorithms for multishop ski rental with machine learned advice. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems, Neur IPS, 2020.

Ward Jr, J. H. Hierarchical grouping to optimize an objective function. Journal of the American statistical association, 58(301):236 244, 1963.

Wei, A. and Zhang, F. Optimal robustness-consistency trade-offs for learning-augmented online algorithms. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems, Neur IPS, 2020.

Learning-Augmented Hierarchical Clustering

A. Additional Technical Preliminaries

A.1. Concentration inequalities

We now present the standard concentration inequalities used in our proofs. We start from the following standard variant of Chernoff-Hoeffding bound.

Proposition 13 (Chernoff-Hoeffding bound). Let X1, . . . , Xn be n independent random variables with support in [0, 1]. Define X := Pn i=1 Xi. Then, for every δ (0, 1], there is

Pr (|X E [X]| > δ E [X]) 2 exp δ2 E [X]

Furthermore, for every δ > 0, there is

Pr (|X E [X]| > δ E [X]) 2 exp δ2 E [X]

A.2. Standard results for trees

Fact A.1. Any binary tree T with n leaves contains at most n internal nodes.

Proof. Consider the collection of the internal nodes that are the parents of the leaves. Since the tree is binary, there are at most n/2 such nodes. Contract all these internal nodes with the leaves, and we can obtain a new tree T such that the number of leaves is n/2. As such, we can again count an upper bound for the number of the parents for the leaves in T as n/4. We can continue this until we only have the root, and the number of internal nodes is at most

as desired.

A.3. The PRAM and the Massively Parallel Computation (MPC) Models

We briefly introduce the parallel models we investigate for the Moseley-Wang objective. In particular, we investigated the classical PRAM model and the Massively Parallel Computation model in our work.

The PRAM model. The Parallel Random Access Machine (PRAM) model is a widely used theoretical framework in parallel computing. It provides a simplified abstraction of a parallel computer system, where multiple processors work simultaneously to solve a computational problem. In the PRAM model, each processor has direct access to a common memory space (RAM), and communication between processors and the RAM is instantaneous ( parallel RAM). Processors can read from and write to any memory location in parallel, hence the term Random Access .

In the PRAM model, there are usually two objectives for algorithm designers to optimize: the total work, defined as the total number of elementary operations, and the depth, defined as the length of the longest dependent call in the algorithm. In the theoretical abstract version of PRAM, we do not care about the number of processors we use in the algorithm.

The MPC model. The Massively Parallel Computation (MPC) model is a theoretical framework used to analyze algorithms designed for modern parallel computing architectures, e.g., the Map Reduce framework. In this model, communications are conducted in synchronized rounds, and a machine can communicate with any other machine in a round. Furthermore, each machine can do unlimited local computation between the rounds. Unlike traditional CONGEST models, the communication here is limited only by the memory size a size s machine cannot send or receive more than s bits of information.

For graph problems, suppose each machine has size s, we typically use O(n2/s) machines, where O(n2) is the worst-case input size (alternatively, one can also target the more instance-optimal O(m/s) machines).

Our goal in the MPC model is to minimize two objectives: i). the memory size s of each machine; and ii). the number of parallel rounds. In particular, if our algorithm works with s = O(nδ) memory for any δ (0, 1), we call the algorithm fully scalable in the MPC model. Typically, the best MPC algorithms would ask for fully scalable memory and polylog n rounds.

Learning-Augmented Hierarchical Clustering

A reduction between the PRAM and the MPC algorithms. The PRAM and the MPC models share a great deal of similarities. And indeed, the following reduction is known.

Proposition 14. Suppose there exists a PRAM algorithm that computes a function f with w(n) work and d(n) depth, where n is the input size of f. Then, there exists a fully scalable MPC algorithm that computes f with O(w(n)) total memory and O(d) rounds. The memory per machine can be made O(nδ) for any δ (0, 1), and the number of machines is O( W (n)

nδ polylog n).

A.4. Existing Techniques for Dasgupta s Objective

We discuss some known techniques for Dasgupta s minimization HC objective in this section. We will use these techniques in our hierarchical clustering algorithms for Dasgupta s objective.

OPTIMAL HIERARCHICAL CLUSTERING TREES

We first give an observation that characterizes the composability of HC costs with respect to the edges under Dasgupta s objective.

Observation 1 ((Dasgupta, 2016)). Let G be any graph, and let E1 and E2 be two disjoint subsets of edges in G. For any HC tree T , let cost G(T , E1) and cost G(T , E2) be the HC costs induced by edges in E1 and E2, respectively. Then,

cost G(T ) = cost G(T , E1) + cost G(T , E2).

Observation 1 shows that to bound the total cost of the HC tree under Dasgupta s objective, it suffices to bound the edges split by the internal nodes.

APPROXIMATE HC TREES WITH RECURSIVE BALANCED MIN-CUTS AND SPARSEST CUTS

Dasgupta s work proved that finding the optimal trees for the hierarchical clustering function is NP-hard (Dasgupta, 2016). Consequently, significant attention has been directed towards developing approximation algorithms for efficient hierarchical clustering on the graph. A well-known approach involves obtaining an approximation of the optimal hierarchical clustering by iteratively employing sparsest cuts on the graph, e.g., (Dasgupta, 2016; Deng et al., 2025). Formally, we can define the sparsest cuts and the HC trees created by recursively applying the sparsest cuts on the induced subgraphs as in Definition 15 and Definition 16.

Definition 15 (Sparsest Cuts). For any parameter β such that 0 < β < 1, we say that a cut (A , B ) is a sparsest cut if its sparsity (edge expansion) is minimized, i.e.

w(A , B ) min{|A | , |B |} w(A, B) min{|A| , |B|}

for any cut (A, B) of G.

Definition 16 (Recursive Sparsest Cut Procedure). We say an HC tree T is obtained by the recursive sparsest procedure on G if for each non-leaf node z of T , the cut(T [z]) is obtained by a (possibly approximate) sparsest cut (A, B) on the subgraph induced by T [z]. We call an HC tree obtained by recursively applying (approximate) sparsest cuts on induced subgraphs as a recursive sparsest cut HC tree.

Previous work (see, e.g. (Charikar & Chatziafratis, 2017b; Assadi et al., 2022)) proved that if one applies the procedure in Definition 16, we can get an O(1) approximation of the optimal HC tree.

Proposition 17 ((Charikar & Chatziafratis, 2017b; Assadi et al., 2022)). For any graph G = (V, E, w), let Tsparse be an HC tree obtained by the recursive sparsest cut procedure in Definition 16, there is

cost G(Tsparse) O(1) OPT(G).

Note that finding the exact sparsest cut for Proposition 17 is NP-hard. The first way to circumvent this issue is to use approximation algorithms, especially for the best-known O( log n) approximation for sparsest cut in polynomial time (Arora et al., 2004). The formal guarantee is as follows.

Learning-Augmented Hierarchical Clustering

Proposition 18 ((Arora et al., 2004)). There exists a randomized algorithm that given a graph G = (V, E, w), with high probability, in O(|V |2) time finds a partition A B = V such that

w(A, B) min{|A| , |B|} O( p

log |V |) w(A , B ) min{|A | , |B |},

where (A , B ) is the sparsest cut of G.

We cannot immediately massage Proposition 18 with Proposition 17 since Proposition 17 does not state what will happen for approximate sparsest cuts. Fortunately, by the results in (Charikar & Chatziafratis, 2017b; Assadi et al., 2022), we can indeed obtain an O(α)-approximation algorithm for OPT(G) by recursively applying the α-approximate sparsest cut.

Proposition 19 ((Charikar & Chatziafratis, 2017b; Assadi et al., 2022)). Let T be a recursive sparsest cut tree obtained by recursively applying α-approximation sparsest cuts on the induced subgraphs (as in Definition 16). Then, we have

cost G(T ) O(α) OPT(G).

There is another way to deal with the NP-hardness issue of the sparsest cut. If we can reduce the input size, we can possibly obtain the exact optimal cuts on induced subgraphs of size O(log n). This strategy allows us to leverage our strong partial tree whose unknown clustering is only restricted to the induced subgraphs with O(log n) size.

B. Technical Overview

In this section, we give a high-level overview of our techniques. We also provide intuition on our algorithmic design choices, including a number of potential pitfalls, as well as a number of natural other approaches and why they do not work.

B.1. Why not simply follow the oracle (or other related strategies)?

At first glance, one might wonder whether the splitting oracle trivializes the problem. A natural question is whether it is possible to simply follow the oracle to recover the optimal tree T . Since the oracle only returns the relative information among a triplet of vertices (u, v, w), it is not immediately clear how to translate the answers from the oracle to a partition of vertices. After taking a closer look at the problem, we could observe issues with a handful of straightforward approaches.

The first natural approach is to pretend the oracle is always correct and construct a tree from the splitting-away information between the triplets. Unfortunately, due to the error probability and adversarial answers, there may not exist an underlying tree consistent with the answers to the queries. As such, it is unclear how the algorithm could produce a definitive answer.

The second approach we could try is to frame the problem as a phylogenetic reconstruction problem, e.g., take all the splitting away for triplets as constraints, and try to construct an HC tree that satisfies as many constraints as possible. However, such an approach has two issues: i). by a recent result of (Chatziafratis & Makarychev, 2023), the phylogenetic reconstruction problem is itself UG-hard; and ii). the HC tree we constructed may prioritize a small number of wrong answers from the oracle that happen to induce very large additive error.

A more involved idea is to aggregate the oracle answers to construct the HC tree s partitions. To this end, an algorithm to determine the partition of a vertex v is to fix u in the smaller subtree and look into the number of vertices t V that split away from (u, v). More concretely, consider the split of the tree on the root V (S1, S2), and suppose we know a vertex u that is on the smaller subtree of the root partition (this is a big suppose as we will see later). Then for any vertex v that is in the same subtree of u, we can get many vertices t V from O with the answer t split away from (u, v) . On the other hand, for a vertex v that is in the opposite subtree of u, only a few vertices t from O would answer t V split away from (u, v) . The gap is large enough to apply concentration inequalities and find a separation between the cases. As such, we can recursively apply the above procedure and produce the optimal tree T .

Unfortunately, the above idea only works for the idealized case where we indeed know a vertex u from the smaller part of the root partition. For the general case, the algorithm requires a surprising amount of new ideas and technical work. In particular, note that the aforementioned algorithm faces two major challenges: (1). as the partition goes deep down the HC tree, the sizes of the subtrees become too small for high-probability guarantees; and (2). it is not clear how to find a good vertex u that induces the root cut. To elaborate on challenge (1), note that when the subtree induces o(log n) leaves, it is generally not possible for us to guarantee correctness for the subsequent partitions. As such, we must handle some form

Learning-Augmented Hierarchical Clustering

of ambiguity when dealing with subtrees induced on vertex sets with smaller sizes. Our approach to this challenge is to forgo the guarantees inside each leaf with o(log n) vertices and work with the respective objective functions to show that the additive error is tolerable. In particular, we use the notion of partial hierarchical clustering trees that approximately capture the structure of the optimal HC tree T until the size of the induced vertex set becomes too small. In particular, we require specific structural properties of the costs of HC trees under Dasgupta s and the Moseley-Wang objectives. We provide more details about partial HC trees and how to use them to overcome challenge (1) in Appendix B.2.

Challenge (2) is even trickier and requires more care. Observe that in the example of root cut V (S1, S2), if u is in the bigger side of the partition, the argument may not work. However, since we only have access to the triplet split information, retrieving whether a vertex u is on the smaller side of a particular tree split seems to be too much to ask. In particular, consider an example that the optimal tree T first makes two splits of small subtrees of size n0.99, as illustrated in Figure 3. Here, if we use u2 as the baseline vertex to perform the split, we can still get a valid partition. However, the structure of the obtained tree is very different from T , and the additive error could be huge. Furthermore, since the actual tree T is hidden from us, it is not immediately clear how could we distinguish a partition obtained by using u1 vs. u2. The problem becomes even more intriguing when we want to obtain near-linear time efficiency. We will discuss the intuition and techniques to handle challenge (2) in Appendix B.3.

Figure 3. An illustration of the hard example that the straightforward majority voting does not work. Left: the optimal HC tree T ; Right: the outcome of the tree if we use u2 as the baseline to perform the partition we discussed.

B.2. Partial hierarchical clustering trees and HC

We now delve into more details about the definition of partial trees and how we use them to obtain low additive errors in Dasgupta s and the Moseley-Wang objectives.

The definition of partial hierarchical clustering trees. As discussed, a central element of our techniques is the notion of partial hierarchical clustering (HC) trees. The generic definition of the partial HC tree is similar to the normal HC tree, with the root representing the entire vertex set and the internal nodes representing the subsets of vertices. However, on the leaf level, we allow the leaves of partial HC trees to contain multiple vertices, up to O(log n) many vertices, and contract them into a single leaf, which we call super-vertices . The partial HC tree is then allowed to be oblivious of the clustering inside each leaf node.

A partial HC tree is only useful if it can somehow capture the optimal tree T . To this end, we introduce the strongly consistent and weakly consistent partial HC trees. Roughly speaking, a partial HC tree I is said to be strongly consistent with the optimal tree T if it follows every partition of the optimal tree in a top-down manner until the induced size of vertices is of size less than O(log n), in which case we simply collapse the leaf into a super-vertex. Note that in the strongly consistent partial HC trees, every super-vertex induces a maximal tree in T here, a maximal tree means a tree where its induced vertices are exactly the leaves of their lowest common ancestor. By comparison, the weakly consistent partial HC tree allows vertices that do not necessarily form maximal trees to collapse into a single super-vertex. For instance, if there are O(log n) vertices in multiple subtrees in T , and suppose the LCAs of these subtrees are close to the root and

Learning-Augmented Hierarchical Clustering

form a consecutive segment in T , the weakly consistent partial HC tree can still collapse all of the vertices into a single super-vertex. We provide illustrations of the weakly and strongly consistent partial trees in Figure 1 and Figure 2, and one can refer their formal definition to Definitions 8 and 9 in Section 3.

We will eventually show that, given the splitting oracle, we can efficiently construct both the strongly and the weakly consistent partial trees regardless of the input graph and the objective functions. However, for now, we first discuss why partial HC trees are useful for our HC objectives.

Using partial HC trees for hierarchical clustering. Intuitively, if we can obtain a partial HC tree that is consistent with the optimal tree T , we can build an actual HC tree by fixing the clustering of the super-vertices locally to obtain a good approximation. Indeed, we note that if a partial HC tree is strongly consistent with T , we can straightforwardly obtain approximation algorithms for both Dasgupta s and the Moseley-Wang objectives. For Dasgupta s objective, we can run the optimal or approximate recursive sparsest cuts for the subgraphs induced by the super-vertices. Note that since the subgraphs are only of size O(log n), we can even afford to find the exact optimal recursive sparsest cuts in polynomial time. The case for the Moseley-Wang objective is even easier: since the number of leaves outside each super-vertex is at least n O(log n), any arbitrary partition of the super-vertices can still give us a (1 o(1)) approximation.

Unfortunately, as we will see shortly, the strongly consistent partial tree can only be implemented in O(n3) time a tolerable yet far-from-optimal efficiency. As such, for the Moseley-Wang objective, we further investigate the algorithm that only uses the weakly consistent partial HC trees, which we can build in near-linear time. In this case, for two vertices that are in the same super-vertex, we can replicate the argument for the strongly consistent partial HC trees again to get a (1 o(1)) approximation. However, challenges arise when the two vertices (u, v) are in different super-vertices of the partial HC tree. Here, the number of induced leaves can differ by an O(log n) additive factor, but the number of non-leaves induced by (u, v) can be very small in T (say, o(log n)). As such, the O(log n) difference on the size of non-leaves might lead to infinity multiplicative gap in the revenue, which makes controlling the overall approximation factor hard. To tackle this issue, we prove some new structural results for HC trees under the Moseley-Wang objective: we show for all the edges (u, v) that only induce a very small number of non-leaves and are far away from each other in the optimal tree, the contribution of such edges to the optimal objective can only be an o(1) fraction. As such, we can simply ignore the approximation guarantees on these edges and obtain an (1 o(1)) approximation of the Moseley-Wang objective.

B.3. The construction of the partial HC trees

We now come back to the efficient construction of the partial HC trees that are strongly and weakly consistent with the optimal HC tree. For simplicity, we slightly abuse the notation to let V always denote the vertex set in the high-level discussion, even if we are talking about a subset of vertices3.

B.3.1. STRONGLY CONSISTENT PARTIAL HC TREES

We first discuss the case for the strongly consistent HC tree, which only requires top-down splits. As we have discussed before, for any fixed partition, since we do not have any information about which side a vertex u is on, it is generally very hard for us to know which obtained partition is actually consistent with the split in T . To address this challenge, introduce the small-tree splitting order, which, roughly speaking, is a thought process that recursively draws the smaller side of the subtree in T . For instance, in the tree T prescribed in Figure 3, the first n0.99 vertices form the first small tree V small 1 , the second n0.99 vertices form the second small tree V small 2 , and so on.

We shall show that if u is among the first few small trees in the small-tree splitting order, we can recover the set of vertices as the sibling of the small tree4. For example, if we select u2 in Figure 3, we can recover the subtree on the right but not the subtree that contains u1. Our strategy is as follows: for a fixed vertex u and a vertex v whose split is to be determined, in addition to testing how many vertices t V such that t splits away from (u, v), we also test the number of vertices t V such that v splits away from (u, t). To see why this additional test helps, let us again look at the example in Figure 3. With the additional subroutine, if we use u2 as the fixed vertex, the vertices in V small 1 will split away from many (u2, t) pairs. On the other hand, for every vertex v on the sibling subtree of V small 2 , it splits away from (u, t) only if t V small 2 , which creates a clear signal. By careful handling of cases, we could argue that the algorithm works for general cases as long as u belongs to an early enough small tree.

3In the formal analysis of Appendices G and H, we use e V as the set of vertices of the current recursion level. 4Since sibling is a generic word, we call this set counterpart in our formal description in Appendices G and H to avoid confusion.

Learning-Augmented Hierarchical Clustering

The above strategy provides a new way to identify a good u: it suffices to only look at the size of the set of vertices we recover. In particular, if u is among the root cut, it surely induces the largest size on the set of the recovered vertices. As such, a simple exhaustive search can find such a vertex u and the corresponding set T. Since there are n vertices to be tested, and each test requires O(n2) time, the total time for each partition is at most O(n3). We can then recursively run this procedure, which will lead to a partial tree that is strongly consistent in O(n4) time. Furthermore, using a simple sampling trick, we could reduce the time for each test to O(n log n) time and queries, which brings the total number of time and queries to O(n3). Furthermore, since we only need to maintain counters for each vertex, the entire algorithm can be implemented in O(n log n) space.

B.3.2. WEAKLY CONSISTENT PARTIAL HC TREES

The exhaustive search subroutine in the above idea inevitably leads to Θ(n3) time and queries on the splitting oracle. This gives us a new, and perhaps more intriguing challenge: if we only want to get partial trees that are weakly consistent with the optimal tree, can we improve the efficiency? Note that if we target a near-linear running time, we cannot always hope to get a u from the smaller side of the root partition. For a concrete example, let us look at the tree T in Figure 3 again. Here, before we hit a vertex in the first small tree of size n0.99, we will not be able to produce a root cut. However, by the size of the first small tree, we will need to test at least n0.01 vertices if we sample vertices uniformly at random. The overhead can be further enlarged: suppose the root cut splits the vertices into n 1 vertices and a single vertex, and suppose this process continues for n0.99 levels; then, it is entirely unclear how to avoid the n0.99 overhead.

The vertical split idea. The above hard instance inspires us to resort to vertical splits of the tree that is, instead of finding a vertex u on the split of the root, we use a vertex u that is sufficiently early in the small-tree split order. To elaborate, we can efficiently find a vertex u that is among the union of smaller subtrees that collectively induced at least n/polylog n leaves. In this way, we can still recover a maximal subtree whose induced leaves T is of size at least (n n/polylog n) a size reduction that is significant enough for the entire algorithm to converge in polylog n iterations. Finally, our algorithm will guarantee that V is a composable set from a single maximal tree, which implies that T and V \ T are composable sets, which allow us to recurse on both sides.

The use of horizon sets . There is yet another subtle issue in the above idea: we have to ensure both parts of the split always maintain the weak consistency property. Specifically, it is crucial to maintain super-vertices with an out-degree of at most 2, where each is linked to at most one parent node in T and one sibling node in T . Within our vertical split concept, as T invariably forms a single composable set, achieving this is straightforward. However, in the residual part of the split algorithm here, V \ T sustaining weak consistency becomes notably more complex. There can be two cases for such a guarantee to hold: either a). V \ T itself is a single composable set, which happens when V is a split on the root vertex, or b). V \ T has some orphaned vertices the sibling vertices of the subtree induced by T in V (see Definition 28 for the formal definition).

Our approach to handle both the a) and b) cases is to use a semi-invariant horizon set VH V . The idea here is that instead of finding T on V , we find it on VH, which is roughly defined as the set of vertices for us to find the partition T on before a split on the root node. In particular, suppose the set of orphaned vertices is of relatively small size in V . We can always find a vertex u V such that u is split earlier than the orphaned vertex set in the small-tree split order of VH. Therefore, we can make sure that the set T to be found in the new iteration will include the orphaned vertices in V . In this way, we always keep at most one edge connecting to the sibling of the orphaned vertices in the current iteration. An illustration of the role of the horizon set can be shown as Figure 9 in Appendix H.

The candidate vertex and root test. The above analysis assumed the size of orphaned vertices is relatively small in V . However, what if the size of orphaned vertices becomes large in V ? In this scenario, one of two sub-cases must happen: either we run into a root split, or we have a large set of vertices which is not on the root of VH, but occupies a large fraction among the remaining V . (Note that in case of a root split, the entire set of V \ T is an orphaned set.) The challenge here is that we should update VH in the former case while keep using the same VH in the latter case, which requires us to distinguish the cases.

To this end, we employ a new idea to select a candidate vertex that splits away from the orphaned set to address this challenge. Concretely, since the set of orphaned vertices is sufficiently large, when doing random sampling, we can get a vertex u from the orphaned set. Then, we use VH to test whether there exist vertices that split away from (u , t) for

Learning-Augmented Hierarchical Clustering

sufficiently many t VH. The idea here is that if u is in the orphaned set, and there still exists a vertex u that splits earlier than u in (the small-tree split order of) VH, then u should split away from many (u , t) pairs. On the other hand, if u is on the smaller side of the root cut of VH, there is only a small number of t that any u V can split away from. As such, we can make progress by either identifying the right candidate vertex that splits earlier than u , or by switching the horizon VH and recurse on the root cut.

Merging of two weakly consistent partial trees. In the case of strongly consistent partial trees, the merging of subtrees is very straightforward: since the splits always follow the top-down order of internal nodes, we can easily merge the two subtrees with a common parent node. In the case of weakly consistent partial trees, the story is much more complicated. For the merge to be correct, we have to correctly identify the orphaned subtree in the previous recursion; however, since we only have access to the splitting oracle, and the actual optimal tree structure is hidden from us, it is not immediately clear how could we identify the correct internal node to merge the trees.

Fortunately, we could utilize the good vertex from the previous recursion to identify the orphaned set of vertices V orphan. In particular, let u be the good vertex we used to split the tree; since u also belongs to V orphan, for each vertex v, we can test how many times v splits away from (u, t) for t T, where T is the single maximal tree to be merged in the level of recursion. If v V orphan, such a vertex should not split away from (u, t); otherwise, if v V orphan, v should split away from (u, t). Since the size of T is large enough, we could identify V orphan correctly with high probability, and perform the merge correctly. An illustration for this idea to identify V orphan is in Figure 10 in Appendix H.

The complexity of the algorithm for weakly consistent trees. Similar to the case for the strongly consistent trees, the subroutine that gives the sibling of a small tree for a fixed vertex u takes O(n2) time. However, in the new algorithm, we only need to sample and test O(log n) vertices u for each iteration. Similarly, the root test and the tree merging subroutines both use O(log n) vertices and O(n2) time. Furthermore, since we can roughly reduce the instance size by a (1 1/polylog n) factor every iteration, the entire process converges in polylog n iterations. As such, we could argue that the total running time is O(n2), and the longest chain of dependent calls is polylog n. This would imply a near-linear time offline algorithm and a parallel algorithm with near-linear work and poly-logarithmic depth.

C. Missing Proofs of Section 4 (Dasgupta s Objective)

We give the proofs we skipped in Section 4 in this section.

C.1. Missing proofs of Section 4.1

Formal proof of Theorem 1. With Algorithm 1, we can obtain the poly-time efficiency from Lemma 4.1. Furthermore, since the strong partial tree algorithm of Theorem 10 succeeds with high probability, the approximation guarantee of Lemma 4.2 holds with high probability as well. This concludes the proof.

C.2. Missing proofs of Section 4.2

Proof of Lemma 4.3. The first step of the algorithm that computes the strong partial tree takes O(n3 log n) time and O(n3) queries by Theorem 10. We need to argue that the second step takes at most O(n3 log n) time as well. Note that by Proposition 18, the algorithm, with high probability, runs in O(s2) time and finds an O( log s)-approximation of the sparsest cut (A, B), where s is the input size. In our case, for a single super-vertex of I with |X| 50000 log n, the running time is therefore O(log2 n) only. By Fact A.1, there are at most O(log n) nodes in a binary tree with O(log n) leaves, which implies O(log3 n) running time for a single super-vertex. Finally, with at most O(n) super-vertices in the tree, the second step only takes O(n log3 n) = O(n3 log n) time, as desired.

Proof of Lemma 4.4. Similar to the proof of Lemma 4.2, for any partial tree I, we first partition the edges in to Ecross and Esame based on whether the edge (u, v) E crosses different partial trees, i.e.,

1. (u, v) Ecross iff u X and v Y for some super-vertices X = Y in I.

2. (u, v) Esame iff u, v X for some super-vertex X in I.

Learning-Augmented Hierarchical Clustering

Again, by using Observation 1, we can show that OPTDas = cost G(T ) = cost G(T , E1) + cost G(T , E2). The costs w.r.t. to E1 and E2 are therefore as follows.

1. For Ecross, we have that cost G(T , Ecross) = cost G(T , Ecross) by using the same argument of Lemma 4.2.

2. For Esame, we argue that cost G(T , Esame) O( log log n) cost G(T , Esame). Formally, since algorithm in Proposition 18 finds an O( log s)-approximation for each super-vertex X, we can use Proposition 17 on G[X] to argue that the cost G(T , Esame[X]) O( log log n) cost G(T , Esame[X]) since s = O(log n). Therefore, we can apply the same argument to every super-vertex to get the desired approximation factor.

We now use Observation 1 again on Ecross and Esame to bound that

cost G(T ) = cost G(T , Ecross) + cost G(T , Esame)

cost G(T , Ecross) + O( p

log log n) cost G(T , Esame)

log log n) (cost G(T , Ecross) + cost G(T , Esame))

log log n) cost G(T ) = O( p

log log n) OPTDas,

as desired.

D. Missing Details of Section 5 (Moseley-Wang Objective)

We now prove the approximation guarantee of Theorem 3. To this end, we will show the following technical lemma that lower bounds the number of non-leaves between T and T . Lemma D.1. Let T be a hierarchical clustering tree obtained by Algorithm 3, and let u, v V be any two vertices. Then, conditioning on the high probability event that I is a partial tree that is weakly consistent with T , there is

If u and v are in the same super-vertex X of I, then, there is

|non-leaves T [ LCAT (u, v) ]| n 50000 log n.

If u and v are in different super-vertices X and Y of I, then, there is

|non-leaves T [ LCAT (u, v) ]| |non-leaves T [ LCAT (u, v) ]| 50000 log n.

Furthermore, if leaves T [ LCAT (X) ] leaves T [ LCAT (Y ) ] = ,

then, we additionally have

|non-leaves T [ LCAT (u, v) ]| = |non-leaves T [ LCAT (u, v) ]| .

Proof. We prove the two cases separately as follows.

u and v are in the same super-vertex X of I. By our construction, every leaf that is outside the subtree induced by leaves T [ LCAT (X) ] counts as a non-leave of (u, v). As such, since |X| 50000 log n, it is straightforward to get that

|non-leaves T [ LCAT (u, v) ]| n 50000 log n.

u and v are in different super-vertices X, Y of I. Suppose w.log. that u X and v Y . By the subtree preserving property, we have leaves T [ LCAT (u, v) ] = leaves T [ LCAT (X Y ) ]. We now discuss further two sub-cases:

a). If leaves T [ LCAT (X) ] and leaves T [ LCAT (Y ) ] are disjoint. In this case, we have exactly leaves T [ LCAT (X Y ) ] = leaves T [ LCAT (u, v) ], which implies leaves T [ LCAT (u, v) ] = leaves T [ LCAT (u, v) ]. Therefore, we have

|non-leaves T [ LCAT (u, v) ]| = |non-leaves T [ LCAT (u, v) ]| .

This proves the furthermore part of the second case.

Learning-Augmented Hierarchical Clustering

b). If leaves T [ LCAT (X) ] and leaves T [ LCAT (Y ) ] have intersections. In this case, note that by the weak contraction property, we must have inclusion relationship between leaves T [ LCAT (X) ] and leaves T [ LCAT (Y ) ]. Suppose without loss of generality, that leaves T [ LCAT (X) ] leaves T [ LCAT (Y ) ], the differences between leaves T [ LCAT (X Y ) ] and leaves T [ LCAT (u, v) ] is at most the set of X. Since |X| 50000 log n, we have

|non-leaves T [ LCAT (u, v) ]| |non-leaves T [ LCAT (u, v) ]| 50000 log n,

as desired.

This concludes the proof of Lemma D.1.

To prove the approximation guarantee of Theorem 3, we will need a handful of structural observations for the optimal HC trees in the Moseley-Wang objective. We first observe that for any optimal HC tree T necessarily has a monotone edge weight property between an internal node and its descendants.

Other leaves

Figure 4. An illustration of the edge weights and leaves described in Claim D.2.

Claim D.2. Let z1 and z2 be any two internal nodes of the optimal HC tree T under the Moseley-Wang objective, and let z2 be a descendant node of z1. We use w1 and w2 to denote the total weights of the edges induced by z1 and z2, respectively. Furthermore, let w3 be the total weights of the edges induced on the internal nodes between z1 and z2.

Suppose the partition induced by z1 is S (A, S \ A), and let B and C be the set of vertices induced by z2 such that B C S \ A. Furthermore, let D = (S \ A) \ (B C), and suppose max{|B| , |C|} |A|. Then, there is

|D| + 1) w3 |A| + |D| w2 max{|B| , |C|}.

An illustration of the edges and leaves used in the statement can be found in Figure 4.

Proof. The claim is similar in spirit to the switching lemma under Dasgupta s objective as proved in (Høgemo et al., 2021). Suppose w.log. that |B| |C|. For the purpose of this proof (and also that of Claim D.3), we let rev G(T , E1) be the revenue

Learning-Augmented Hierarchical Clustering

induced by a subset of edges E1 with T being the HC tree of G. We further observe some useful relationships between the weights w1, w2, w3 and the weights w(A, B), w(A, C), w(A, D), w(B, C), w(B, D), and w(C, D) as follows.

w1 = w(A, D) + w(A, B) + w(A, C) w2 = w(B, C) w3 w(B, D) + w(C, D).

We first prove a self-contained structural claim that the edge weights between A and B cannot be too much bigger than w3. More formally, the claim is as follows.

Claim D.3. Let A, B, C, and D be the set of vertices and w1, w2, and w3 be the edge weights as prescribed in Claim D.2. Furthermore, let E(A, B) be the edges between A and B, and w(A, B) be the weights of E(A, B). Then, there is

w(A, B) |A|

Proof. As consistent with the proof of Claim D.2, we assume w.log. that |B| |C|. Let us construct a tree T (1) based on T by switching the subtrees induced by A and D. Note that in such change of HC tree, the only edges that will have a changed revenue are E(A, B), E(A, C), E(A, D), E(B, C), and edges accounted by w3, which we call E3 (including but not limited to E(B, D) and E(C, D)), and edges in. We list these changes as follows.

1. E(A, B): we have that rev G(T (1), EA,B) rev G(T , EA,B) |D| w(A, B).

2. E(A, C): we have that rev G(T (1), EA,C) rev G(T , EA,C) |D| w(A, C).

3. E(A, D): we have that rev G(T (1), EA,D) rev G(T , EA,D) = 0.

4. E(B, C): we have that rev G(T (1), EB,C) rev G(T , EB,C) = 0.

5. E3: we have that rev G(T (1), E3) rev G(T , E3) w3 |A|.

To maintain the optimality of T , there should be rev G(T (1)) rev G(T ) 0. As such, we have

0 rev G(T (1)) rev G(T )

= rev G(T (1), EA,B) rev G(T , EA,B) + rev G(T (1), EA,C) rev G(T , EA,C)

+ rev G(T (1), EA,D) rev G(T , EA,D) + rev G(T (1), EB,C) rev G(T , EB,C)

+ rev G(T (1), E3) rev G(T , E3)

(w(A, B) + w(A, C)) |D| w3 |A| .

As such, we can move the terms around, and obtain that

w(A, B) w(A, B) + w(A, C) |A|

as desired. Claim D.3

We now use Claim D.3 to prove Claim D.2. We again construct a tree T (2) based on T by switching the subtrees induced by A and B (note that this is different from the proof of Claim D.3). Observe again that only the edges that have different revenue contribution in T (2) vs. T are E(A, B), E(A, C), E(A, D), E(B, C), and edges accounted by w3. We again list all the changes of revenue induced on these edges.

1. E(A, B): we have that rev G(T (2), EA,B) rev G(T , EA,B) = 0.

2. E(A, C): we have that rev G(T (2), EA,C) rev G(T , EA,C) (|B| + |D|) w(A, C).

3. E(A, D): we have that rev G(T (2), EA,D) rev G(T , EA,D) |B| w(A, D).

4. E(B, C): we have that rev G(T (2), EB,C) rev G(T , EB,C) (|A| + |D|) w(B, C).

Learning-Augmented Hierarchical Clustering

5. E3: we have that rev G(T (2), E3) rev G(T , E3) |A| w3.

Therefore, by the optimally of T , there should be rev G(T (2)) rev G(T ) 0. As such, we have

0 rev G(T (2)) rev G(T )

= rev G(T (2), EA,B) rev G(T , EA,B) + rev G(T (2), EA,C) rev G(T , EA,C)

+ rev G(T (2), EA,D) rev G(T , EA,D) + rev G(T (2), EB,C) rev G(T , EB,C)

+ rev G(T (2), E3) rev G(T , E3)

(|B| + |D|) w(A, C) + |B| w(A, D) (|A| + |D|) w(B, C) w3 |A| .

As such, by moving the terms around, we can get that

|B| (w(A, C) + w(A, D)) (|B| + |D|) w(A, C) + |B| w(A, D)

(|A| + |D|) w(B, C) + |A| w3.

We can use the observation that w1 = w(A, B) + w(A, C) + w(A, D) to obtain that

|B| w1 = (w(A, C) + w(A, B) + w(A, D))

(|A| + |D|) w(B, C) + |A| w3 + |B| w(A, B)

by adding |B| w(A, B) on both sides. Now, we can apply Claim D.3 to obtain that

|B| w1 (|A| + |D|) w(B, C) + |A| w3 + |B| |A|

(|A| + |D|) w(B, C) + |B| w3 + |B| |A|

|D| w3 (using |A| |B|)

Note that w(B, C) = w2. As such, the above implies that

|B| w1 (1 + |A|

(|A| + |D|) w2.

Moving the turns around in the above inequality gives us the desired bound. Claim D.2

We now use Claim D.2 to show that the set of edges (u, v) such that

a). has at most O(log2 n) non-leaves in T ; and

b). let X and Y be corresponding super-vertices that contain u and v in a weakly consistent partial tree I; there is leaves T [ LCAT (X) ] leaves T [ LCAT (Y ) ] = .

can contribute to at most an o(1) fraction of the optimal cost. This means that our estimation with non-leaves using Lemma D.1 would lead to a good approximation. The formal statement is as follows.

Lemma D.4. Let I be an arbitrary partial HC tree that is weakly consistent with the optimal HC tree T under the Moseley-Wang objective. Define Elow(I) V V as the edges such that for any (u, v) Elow(I), there is

a). |non-leaves T [ LCAT (u, v) ]| 50000 log2 n.

b). Let X and Y be corresponding super-vertices that contain u and v in I; there is

leaves T [ LCAT (X) ] leaves T [ LCAT (Y ) ] = .

Learning-Augmented Hierarchical Clustering

Then, for sufficiently large n, the total contribution of revenue from the edges in Elow(I) is at most 50000 log4 n

n fraction of OPTMW, i.e.,

e=(u,v) Elow(I) w(e) |non-leaves T [ LCAT (u, v) ]| O(log4 n

Proof. For any two super-vertices X and Y in the partial HC tree I, by the weak contraction property, if leaves T [ LCAT (X) ] leaves T [ LCAT (Y ) ] is not empty, the only possible case is to have inclusion relationships between the two sets of leaves. Suppose w.log. that leaves T [ LCAT (X) ] leaves T [ LCAT (Y ) ]. Let Y be the set of vertices induced by the sibling node of X in I, and it is straightforward to see that Y Y . We further let Y be the larger immediate child of the tree induced by Y .

By the conditions of i). |non-leaves T [ LCAT (u, v) ]| 50000 log2 n, ii). |X| 5000 log n, and iii). the inclusion relationship between the leaves, there is

Y n 50000 (log2 n + log n); |Y | n 50000 (log2 n + log n)

We now use Claim D.2 inductively to bound the weights of w(E1) for the sets of edges E1 Elow that are split by some internal nodes of X. Let E2 be the set of edges that is split in the subtree induced by Y , and let X be vertices that are split by E1 and as the sibling of Y . Our induction hypothesis is that

w(E1) < C log n w(E2) n 50000 (log2 n + log n)

for some absolute constant C.

To prove this statement, we first look at the base case. By the size bound on X, it is straightforward to see that

|X| X 50000 log n.

In the base case, we pick edges E1 Elow such that E2 is split immediately after E1, i.e., E1 are the edges between X and Y . Now, we can use Claim D.2 with A X, B Y , and D (w3 = 0) to argue that

|X| w(E1) X < w(E2)

By the size upper bound of X and the size lower bound of Y , the above inequality implies that

w(E1) 50000 log n < 2 w(E2) n 50000 (log2 n + log n) < C log n w(E2) n 50000 (log2 n + log n)

for any C > 2, which proves the base case.

For the inductive step, let us suppose the statement holds until some internal node z, and we look into E1 Elow that is induced on pa (z). We again use Claim D.2 by letting A X, B Y , and D X \ X. Define E3 as the set of edges that are split between z and LCAT ( Y ). By the induction hypothesis, for every node between z and LCAT ( Y ), the total induced weights is at most

C log n w(E2) n 50000 (log2 n + log n).

Furthermore, since |X|

| X| |X| 50000 log n, and z and pa LCAT ( Y ) are in the same X of I, we can bound the total

weights in E3 as follows

w(E3) 50000 log n C log n w(E2) n 50000 (log2 n + log n).

Learning-Augmented Hierarchical Clustering

Furthermore, by the size upper bound of X, we have |A| + |D| 50000 log n in Claim D.2. Combining the above gives us that for sufficiently large n, we have

|X| w(E1) (50000 log n + 1) w(E3)

|X| w(E1) ( |X|

| X| + 1) w(E3)

|X| < w(E2)

where the first inequality follows from the fact that w(E3) = O( log2 n

n ) w(E2), the second inequality follows from that

|X| / X |X| 50000 log n, and the last inequality follows from Claim D.2. Therefore, we can again obtain that

w(E1) 100000 log n < 4 w(E2) n 50000 (log2 n + log n),

which means w(E1) < C log n w(E2) n 50000 (log2 n+log n) for a sufficiently large constant C (C = 400000 suffices). This concludes our inductive proof for the weight bound of any E1 Elow in any X.

Using w(E1) C log n n 50000 (log2 n+log n) w(E2) for any (u, v) Elow, we note that a trivial lower bound of the optimal revenue is

OPTMW = Ω(log n w(E2)),

since this is the revenue induced on the edge set E2 only. On the other hand, since we need the inclusion relationship between the leaves, and by the fact that the number of non-leaves is at most log2 n, there are at most 50000(log2 n + log n) edges in Elow, i.e.,

|Elow| 50000(log2 n + log n).

Therefore, the total contribution of revenue by Elow is at most

e=(u,v) Elow(I) w(e) |non-leaves T [ LCAT (u, v) ]|

e=(u,v) Elow(I) w(e) log2 n (by the number of non-leaves)

max{w(E1) | E1 Elow} 50000(log2 n + log n) log2 n (uniform upper bound)

C log n n 50000 (log2 n + log n) w(E2) 50000(log2 n + log n) log2 n

(by the relationship between w(E2) and w(E1) for E1 Elow)

n OPTMW), (by the lower bound of OPTMW)

as desired.

Finalizing the proof of Theorem 3. We have discussed that the algorithm enjoys O(n2) running time and O(n2) query efficiency. For the approximation guarantee, note that each edge (u, v) gains a revenue of wu,v |non-leaves T [ LCAT (u, v) ]| in the optimal tree T . For edges (u, v) Elow, let Esame high be the set of edges where u and v are in the same super-vertex, and Ediff high be the set of edges where u and v are in different super-vertices. We now have

For edges in Esame high , we have |non-leaves T [ LCAT (u, v) ]| n 50000 log n that by Lemma D.1. On the other

Learning-Augmented Hierarchical Clustering

hand, any vertex pair has at most n non-leaves in the graph. Therefore, we have that

rev G Esame high (T ) = X

e Esame high w(e) |non-leaves T [ LCAT (u, v) ]|

e Esame high w(e) n 50000 log n

e Esame high w(e) n (1 50000 log n

(1 50000 log n

n ) rev G Esame high (T ).

For edges in Ediff high, we have

|non-leaves T [ LCAT (u, v) ]| |non-leaves T [ LCAT (u, v) ]| 50000 log n

(1 O( 1 log n)) |non-leaves T [ LCAT (u, v) ]| ,

where the first inequality follows from Lemma D.1, and the second inequality is from the fact that |non-leaves T [ LCAT (u, v) ]| 50000 log2 n for every (u, v) Elow. As such, we have that

rev G Ediff high(T ) = X

e Ediff high

w(e) |non-leaves T [ LCAT (u, v) ]|

1 O( 1 log n) X

e Ediff high

w(e) |non-leaves T [ LCAT (u, v) ]|

= 1 O( 1 log n) rev G Ediff high(T ).

Therefore, by additionally using Lemma D.4, we have that

rev G(T ) rev G Esame high (T ) + rev G Ediff high(T )

(1 O( 1 log n)) rev G Esame high (T ) + rev G Ediff high(T )

(1 O( 1 log n)) rev G Esame high (T ) + rev G Ediff high(T )

(1 O( 1 log n)) 1 O(log4 n

n ) rev G(T ) (using Lemma D.4)

(1 o(1)) rev G(T ),

as desired. Remark 20. We can observe that if we run Algorithm 3 with the strongly consistent partial HC tree, we can get a similar (and even stronger) approximation guarantee, albeit with worse efficiency ( O(n3) time). Concretely, note that if we use the strongly consistent partial HC tree, the additive error again only happens on Esame, and we do not need Lemma D.4 to bound the contributions of the edges in Elow(I) (since = ). In this way, we can further decrease the o(1) term to O( log n

n ). However, the sacrifice of the running time is too significant, and we skip the details of this algorithm.

E. Missing Details of Section 6 (Sublinear Algorithms)

We discuss the analysis of the algorithms and their analysis for the rest of this section.

E.1. A single-pass semi-streaming HC algorithm for Dasgupta s objective

We present our algorithm for the semi-streaming algorithm in this part. To begin with, we need to define the model for streaming hierarchical clustering with the splitting oracle.

Learning-Augmented Hierarchical Clustering

Graph streaming with offline splitting oracle. We focus on the (dynamic) graph streaming model with the offline splitting oracle. In this model, the edges of the graph are inserted and deleted (together with their edge weights), and the algorithm is asked to output an HC tree by the end of the stream. Additionally, the algorithm is given an offline splitting oracle O before the graph stream, and the algorithm is allowed to make unlimited computations before the stream starts. The total memory cost is the total number of bits the algorithm maintains at any point, including those dedicated to edge representation and those generated through offline computations.

For polynomial-time efficiency, we assume that the stream itself is of poly(n) length since otherwise, the stream itself will take super-polynomial time to complete. Under the above model and setting, the formal statement of our result is as follows.

Theorem 21. There exists a single-pass (dynamic) streaming algorithm that given a weighted undirected graph G = (V, E, w) in a poly(n)-length dynamic stream and an offline splitting oracle O, with high probability, in O(n log3 n) (bits) space and polynomial time computes a hierarchical clustering tree T such that cost G(T ) O(1) OPTDas(G), where OPTDas(G) is the cost of the optimal hierarchical clustering tree T , i.e., OPTDas(G) = cost G(T ).

Proof. The algorithm is to simply first construct a strong partial tree I with the algorithm of Definition 8 before the stream starts, and store edges only inside the same super-vertices during the stream. By the end of the stream, we compute recursive sparsest cuts on the vertices induced on super-vertices of I, and output in the same manner of Algorithm 1. The formal description is as Algorithm 4.

We first prove the time and space efficiency. Essentially, Algorithm 4 computes a strong partial tree I before the stream starts, and only main edges inside each super-vertex of O(log n) size. By Theorem 10, the pre-processing part takes polynomial (O(n3 log n)) time and O(n log n) space. Furthermore, for each super-vertex, we maintain at most O(log2 n) edges; each edge could have at most poly(n) updates, which means an additive space of O(log n) bits suffices for each edge. Therefore, we record at most O(log n) bits for each edge. There are at most n super-vertices, which means the algorithm maintains the information of O(n log2 n) edges, which takes O(n log3 n) bits. After the stream, we partition the super-vertices with the edges stored, and write down the rest of the HC tree. The time efficiency of the post-processing part is exactly as Lemma 4.1.

For the approximation guarantee, note that we are essentially simulating Algorithm 1 with the same input and output guarantees. As such, we can simply use Lemma 4.2 to argue the approximation guarantee.

E.2. Parallel HC algorithms for Moseley-Wang Objective

We now move the PRAM hierarchical clustering algorithm for the Moseley-Wang objective with near-linear O(n2) work and polylog (n) depth. The formal statement of the algorithm is as follows.

Theorem 22. There exists a PRAM algorithm that given a weighted undirected graph G = (V, E, w) and a splitting oracle O, with high probability, in O(n2 polylog n) work and polylog n depth computes a hierarchical clustering tree T such that rev G(T ) (1 o(1)) OPTMW(G), where OPTMW(G) is the revenue of the optimal hierarchical clustering tree T , i.e., OPTMW(G) = rev G(T ).

Proof. The algorithm is to run the PRAM weak partial tree algorithm in Corollary 12 to obtain I, and arbitrarily partition the vertices in the super-vertices of I. The formal description of the algorithm is as Algorithm 5.

By Corollary 12, the first step that computes the weak partial tree I only takes O(n2) work and polylog n depth. For the second step, since the partition is arbitrary, we can simply perform arbitrary balanced cuts on the vertices for each super-vertex X. For an individual X, this procedure can be done in O(log n log log n) work and O(log log n) depth. By accounting for all the super-vertices, we blow up the work by at most an O(n) factor and the depth remains the same since we can partition super-vertices in parallel. Therefore, the second step takes O(n log2 n) work and O(log log n) depth. Therefore, the entire procedure takes O(n2) work and polylog n depth.

The output of Theorem 22 follows exactly the same rules of Algorithm 3; therefore, the approximation guarantee follows from Theorem 3.

Remark 23. By the reduction of Proposition 14, the result of Theorem 22 also implies a fully-scalable Massively Parallel Computation (MPC) algorithm that computes the HC tree T with (1 o(1)) OPTMW(G) revenue in O(n2) total memory and polylog n depth. The memory on each machine here is allowed to be O(nα) for any α (0, 1), and we use O(n2 α) total machines in the MPC algorithm.

Learning-Augmented Hierarchical Clustering

F. Preliminaries for Partial HC Tree Algorithms

We will discuss the construction of partial HC trees for the rest of this paper (Appendices F to H). In this section, we first introduce the several notions that are essential in the analysis of the partial HC tree algorithms. We use these notions extensively in the analysis of both Theorem 10 (Appendix G) and Theorem 11 (Appendix H).

In the high-level overview, we slightly abused the notation to use V to denote the subset of vertices. In our formal description of the algorithms, we will use e V V as the set of vertices of the current recursion step, and use n = e V as the size of the set.

F.1. Composable vertex sets and restricted subtrees

To continue, we first introduce the notions of composable vertex sets for a graph and the HC tree restricted to the sets.

Definition 24 (Composable vertex sets). Let G = (V, E) be a n-vertex graph and T be a hierarchical clustering of G. For a subset e V V , we say e V is a composable (vertex) set of (G, T ) if e V can be written in a disjoint union of the leaves of maximal trees, i.e., a union of vertices e V = i e Vi, such that all e Vi satisfies that

leaves T [ LCAT (e Vi) ] = e Vi.

In the special case, we say e V is a single composable (vertex) set if there is only one such maximal tree, i.e., leaves T [ LCAT (e V ) ] = e V .

In other words, if a vertex set e V is composable, it means if we only look at the leaves of e V , they still form subtrees of T . An illustration of composable sets can be found in Figure 5.

Single Composable Sets (Vertices of Maximal Subtrees)

General Composable Sets but Nota Maximal Subtree

Figure 5. An illustration of the composable vertex sets and maximal trees as in Definition 24.

We can define the HC tree restricted to composable subsets of vertices as follows.

Definition 25 (Hierarchical clustering tree restricted to subset). Let G = (V, E) be a n-vertex graph and T be a hierarchical clustering of G. For any composable subset S V such that |S| 2, we call T (S) as T restricted to S if T (S) is a new binary tree constructed with the following process

Learning-Augmented Hierarchical Clustering

1. Extract T from T by taking the induced subtree of the internal node LCAT (S).

2. Remove all subtrees whose leaves contain only vertices not in S from T .

3. If there exists an internal node x that only has a single child, contract x and its child to one node. Repeat until all internal node has two children nodes.

Note that the algorithm in Definition 25 is a thought process, and it only serves the purpose of analysis. The third line eventually terminates since there exists at least one subtree with two leaves for any composable set S with at least two vertices. An illustration of HC trees restricted to subsets can be shown in Figure 6.

Figure 6. An illustration of the HC tree T to be restricted on a subset of vertices S.

We now provide the following observation that the new tree T (S) preserves the relative order of split away vertices of T .

Observation 2. For any triplet of vertices (u, v, w), the orders of split-away for (u, v, w) are the same in T (S) and T , i.e., w splits away from (u, v) in T (S) if and only if w splits away from (u, v) in T .

Proof. We first observe that the orders of split away for any triplet (u, v, w) are the same in T and T since T is a subtree of T . Furthermore, since S is a conposeable set, every subtree we remove from T is necessarily a maximal subtree, i.e., fix the removed vertex set S, the leaves of LCAT (S) is S itself.

Let x be the lowest common ancestor of (u, v) in T , and x be the lowest common ancestor of (u, v, w) in T . By our definition, we have level T (x ) > level T (x). In T (S), if none of x and x is contracted in line 3, then the order of splits away trivially remains the same as in T .

On the other hand, if at least one of x and x is contracted, and let the new internal nodes be y and y , we claim that there is still level T (S)(y ) > level T (S)(y). This is simply because every removed tree is a maximal tree, and the only case that y and y are merged is that the induced vertices of y become empty, which contradicts the fact that (u, v) is not removed. Therefore, the split order between (w, u, v) is the same as in T , which in turn is the same in T .

F.2. Small-tree splitting order

We now introduce the following notion of small-tree split order, which we frequently use in our analysis.

Definition 26 (Small-tree split order). Consider any subset S V and the hierarchical clustering tree T (S) restricted to S, and consider the following process that divides the leaves of S:

Learning-Augmented Hierarchical Clustering

1. Starting from the root r, let the split be S (S1 l , S1 r), and assume w.log. S1 l S1 r . Let V small 1 be the vertices (leaves) induced by the smaller subtree S1 l .

2. Starting from the lowest common ancestor of S1 r, recursively define V small ℓ as the vertices contained in the smaller subtree of level ℓ(by splitting from the larger subtree of level ℓ 1).

For the convenience of notation, we define V small ℓ = when ℓis larger than the depth of the tree. For any integer N [0, n H], we can define the first N vertices in the small-tree split order by taking the first N vertices in the order of V small 1 , V small 2 , etc.. Inside each set V small i , we pick vertices in an arbitrarily fixed order. We use the notation V small( N) to denote the first N vertices in the small-tree split order, and we use V small( N) to denote the last N vertices in the small-tree split order.

Intuitively, we can think of the small-tree split order as we always write the smaller tree on the left-hand side, and recurse on the larger tree for this procedure, then take leaves from the left-most vertices. Note that every vertex has to belong to V small i for some i. An illustration can be found in Appendix F.2.

Based on the notion of small-tree split order, we can define the notion of induced leaves of the first or last N vertices in the small-tree split order as follows.

Definition 27 (Induced leaves of the N first split vertices). Let T (S) and S be the hierarchical clustering tree and the set of leaves. For any integers N [0, n H], we define the induced leaves of V small( N) as the union of the ℓ i=1V small i , where ℓis the maximum level such that V small ℓ contains at least one vertex in V small( N).

An illustration of the induced leaves in Definition 27 can be found in Appendix F.2.

Figure 7. An illustration of the notion of small-tree split order Definition 26 and the Induced leaves of the N first split vertices Definition 27.

G. The Algorithm for the Strongly Consistent Partial HC Tree: Proof of Theorem 10

We now give an algorithm for partial trees that are strongly consistent with the optimal HC tree T . We first remind the readers of our main result on the algorithm for strongly consistent partial HC trees as follows.

Theorem 10. There exists an algorithm that given a splitting oracle O of a weighted undirected graph G = (V, E, w), with high probability, in O(n3 log n) time and O(n3) queries computes a partial hierarchical clustering tree I that is strongly consistent with the optimal hierarchical clustering tree T . Furthermore, the algorithm has the following properties.

i). The runtime of the algorithm is deterministic, and the high probability randomness is over the correctness guarantee.

ii). The algorithm can be implemented in O(n log n) space.

We refer the readers to Appendix B for a high-level overview of the algorithm. We directly give the algorithm and the analysis in this section.

Learning-Augmented Hierarchical Clustering

The algorithm. We introduce our main algorithm for the construction of strongly consistent partial HC trees. To begin with, we give a helper function counterpart-tester-strong as follows. As we have discussed in the high-level overview, this function tests the sibling vertices of a small tree that is early enough in the small-tree splitting order.

Algorithm 6 counterpart-tester-strong(O,Vinput)(u, t, threshold1, threshold2): an algorithm to test whether t is among the counterpart of u Input: Vertex set Vinput; the splitting oracle O; baseline vertex u; test vertex v; threshold1 (0, n), threshold2 (0, n) Output: Whether v is a counterpart of u Initialize counters c1 0, c2 0 for Every vertex t Vinput do

If v splits away from (u, t), increase c1 by 1 If t splits away from (u, v), increase c2 by 1 end if c1 threshold1 and c2 threshold2 then

Return v is a counterpart of u . end

Our main algorithm for the strongly consistent partial HC tree construction is as follows.

Algorithm 7 strong-partial-tree O(e V ): an algorithm for strongly consistent partial tree

Input: Vertex set e V of size n; the splitting oracle O; parameter ε < 1 a sufficiently small constant Output: A partial tree Te V that is strongly consistent with T (e V )

if e V 50000 log n then Return a super-vertex end for Each u e V do

Initialize T Sample a set S of s = 20 log n/ε2 vertices from e V for v e V do

Run counterpart-tester-strong(O,S)(u, v, (3/5 ε) s, (1/6 ε) s) Add v to T if v is a counterpart of u end Record the size |T| for this choice of u end Pick (T , e V \ T ) such that T T is with the largest size (breaking ties arbitrarily) Recursively call

Te V \T strong-partial-tree O(e V \ T ) TT strong-partial-tree O(T ).

Connect the two trees with a common ancestor as the root

We first observe that the algorithm takes at most O(n3) queries to O and O(n3/ε2) time. The formal statement and analysis are as follows.

Lemma G.1. The algorithm strong-partial-tree O(V ) takes at most O(n3) queries to O and O(n3 log n/ε2) running time.

Proof. The O(n3) query upper bound is trivial since there are at most n 3 such comparisons, and the algorithm can store the answers for reuse. For the running time, we claim that each recursive call on strong-partial-tree O(e V ) with e V = n vertices takes at most O( n2 log n) time. To see this, note that each call of Line 7 takes at most O(log n/ε2) time,

and there are at most O( n2) calls of the counterpart-tester-strong function. By Fact A.1, there are at most O(n) such recursion calls induced on internal nodes, which results in at most O(n3 log n/ε2) running time.

We now prove the correctness of the algorithm. To this end, we first establish the split-away relationships between the vertex set e V and the sampled set S.

Lemma G.2. Consider quantities c1, c2, c1, and c2 obtained by the following processes of running Algorithm 6:

Learning-Augmented Hierarchical Clustering

Let c1 and c2 be the counter returned by counterpart-tester-strong(O,S)(u, v, (3/5 ε) s, (1/6 ε) s) (i.e., by running line 7).

Let c1 and c2 be the counter obtained by running counterpart-tester-strong(O,e V )(u, v, 3 n/5, n/6).

Then, with high probability, the following statements are true.

If c1 3 n/5 and c2 n/6 , then c1 (3/5 ε) s and c2 (1/6 ε) s.

If c1 < (3/5 2ε) n and c2 < (1/6 2ε) n, then c1 < (3/5 ε) s and c2 < (1/6 ε) s.

Proof. The lemma is a direct application of Chernoff bound, and we prove the quantities for c1 and c1 only since the relationships between c2 and c2 follow from the same argument. For each fixed pair of (u, v) as the inputs to Algorithm 6,

let T e V be the set of vertices in e V reporting v splits away from (u, t) . It is clear that T = c1. Furthermore, define Xt

as the indicator random variable for the sampled vertex t T S and X as the total number of answers reporting v splits away from (u, t) in S. We have E [X] = P

x S Pr(Xt = 1) = s c1

If c1 3 n/5, we have that E [X] 3s

5 , and X is summation of independent indicator random variables. Therefore, by Chernoff bound we have

5 ε) s Pr (X E [X] < ε s) exp 2 ε2s2

exp ( 40 log n) 1

where the third inequality is by the choice that s = 20 log n/ε2. On the other hand, if c1 < (3/5 2ε) n, we have that E [X] < (3/5 2ε) s. Therefore, by Chernoff bound, we have

Pr (X E [X] ε s) exp 2 ε2s2

exp ( 40 log n) 1

where the third inequality is by the choice that s = 20 log n/ε2.

Using Lemma G.2, we now present the following key lemma for the behavior of the counterpart-test of Algorithm 7.

Lemma G.3. For any composable e V such that e V 50000 log n, let ℓ be the maximal level that V small ℓ contains a vertex

in V small( n/5). With high probability, Line 7 in Algorithm 7 satisfies the following properties.

a). For any vertex u i=ℓ +1V small i , there is

No vertex v i=ℓ +1V small i can be added to T by Line 7 of Algorithm 7.

No vertex v ℓ 1 i=1 V small i can be added to T by Line 7 of Algorithm 7.

b). For every level ℓ ℓ and vertices u V small ℓ , with high probability, there is

No vertex v V small ℓ can be added to T by Line 7 of Algorithm 7. No vertex v V small k for k < ℓcan be added to T by Line 7 of Algorithm 7.

c). There exists a ℓ ℓ such that ℓ i=1V small i n 25, and for any ℓ ℓand any vertex u V small ℓ , with high

probability, all vertices v V small k for k > ℓare added to T by Line 7 of Algorithm 7.

Proof. By Lemma G.2, we only need to prove the split-away properties on e V , and the split-away properties on S follows. Let ℓ be the maximal level that the V small ℓ contains a vertex in V small( n/5), and suppose the split on this level results in (Sℓ l , Sℓ r ). We assume w.log. that Sℓ l Sℓ r so that Sℓ l becomes V small ℓ . We first observe a structural property.

Observation 3. The size of the vertices in VH \ V small( n/5) (and equivalently i=ℓ +1V small i ) satisfies

5 < i=ℓ +1V small i 4 n

Learning-Augmented Hierarchical Clustering

Proof. The upper bound follows from V small( n/5) n/5, as otherwise V small( n/5) will not have enough vertices. For the lower bound, note that by level ℓ 1, the size there are still more than 4 n

5 vertices remain to be decided. Also, note that ℓ +1V small i collectively forms the larger subtree w.r.t. V small ℓ . As such, its size has to be more than 1

Note that by the definition of small trees, we also have V small ℓ n

2 for any ℓ. We now proceed one-by-one to the proofs of Item a)., Item b)., and Item c)..

Proof of Item a). Fix any vertex u i=ℓ +1V small i . For the first statement, note that for every v i=ℓ +1V small i , all the vertices in ℓ i=1V small i are splitting away from (u, v). Define Xu,v,t as the indicator random variable answered by O as t splits away from (u, v) , and define Xu,v = P

t e V Xu,v,t as the total number of vertex split away from (u, v). By Observation 3, the expected number of split away reported by oracle O is at least

Note also that Xu,v is a summation of 0/1 independent random variables. As such, we can apply Chernoff bound to get that

= Pr Xu,v 25

27 E [Xu,v]

exp (2/27)2

50 n 9000 log n using the lower bound on the size of e V )

For the second statement, note that for any t i=ℓ V small i , v actually splits away from (u, t). As such, there is at least 4 n

5 vertices such that v splits away from (u, t) this is because the LCA between (u, v) has to be higher than the LCA between (u, t) for any t i=ℓ V small i . As such, define Yu,v as the number of total answers of v splits away from (u, t) by O, we again have

As such, we again have

Pr Yu,v 3 n

= Pr Yu,v 15

18 E [Xu,v]

50 n 9000 log n using the lower bound on the size of e V )

By a union bound over at most n vertices, no vertex v ℓ 1 i=1 V small i or v i=ℓ +1V small i can pass the test and be recorded as a counterpart . Furthermore, by Lemma G.2, if Xu,v > n/6 and Yu,v > 3 n/5, then such u cannot pass the test on the sampled set S, which is as desired. This part of the proof can be visualized as in Figure 8(a).

Proof of Item b).. We now show that vertices v ℓ i=1V small i cannot pass the counterpart test. For vertices in v V small ℓ , we claim that there are too many vertices t that split away from (u, v). To see this, let us define Au,v as the total number of answers of t splits away from (u, v) answered by O. Note that for any vertex t i=ℓ +1V small i , the answer is always yes since ℓ ℓ ℓ. Therefore, we can lower bound the expectation of Au,v with E [Au,v] 2

10 9 25 n. Therefore, we can apply Chernoff bound to get

3 E [Au,v] 1

We now turn to the vertices that are in ℓ 1 i=1V small i . Note that by definition, there is i=ℓV small i i=ℓ V small i 4

5 n. For any t i=ℓ V small i , v splits away from (u, t). As such, define Bu,v as the number of answers by O with v splits away

Learning-Augmented Hierarchical Clustering

from (u, t) , there is E [Bu,v] 18

25 n. Once again, by applying Chernoff bound, there is

Pr Bu,v 3 n

3 E [Bu,v] 1

Applying a union bound over the cases and all n n vertices and using Lemma G.2 would lead to the desired statements. This part of the proof can be visualized as in Figure 8(b).

Proof of Item c).. We fix ℓas the minimum level such that the union of ℓ i=1V small i has at least n/25 vertices. Let ℓ ℓ, and for any vertex v in V small k for k > ℓ, define the following two random variables: Xu,v as the number of answers that t splits away from (u, v) , and Yu,v as the number of answers that v splits away from (u, t) . We shall show that both terms are not large and v can always pass the test. (Note that X and Y are overloaded and are not related to their meaning in the proof of Line b)..)

Note that by definition, we have ℓ 1 i=1V small i < n 25. Since ℓ ℓ, for any vertex u ℓ i=1V small i , only vertices t

ℓ 1 i=1V small i are actually splitting away from (u, v) for v i=ℓV small i . As such, we have E [Xu,v] ( 1

25) n = 7 50 n by the correct probability of O. For Xu,v, we also note that if the answer is at most (30000 100000ε) log n, then by Lemma G.2, the vertex passes the test with high probability. Therefore, we assume Xu,v E [Xu,v] (8000 100000ε) log n, and we again apply Chernoff for the tail bound:

Pr (Xu,v (1/6 2ε) n) exp 50 (2/75 2ε)2

7 3 E [Yu,v]

n3 . (using E [Yu,v] (8000 100000ε) log n and pick ε sufficiently small)

For Yu,v, note that only t V small ℓ can report v splits away from (u, t) since the LCA between (u, t) is at least ℓfor any other v. Therefore, the number of signals we can possibly get is n/2 plus the noise induced by the oracle O. As such, we again have E [Yu,v] 11 n/20. Also, note that if Yu,v is less than (30000 100000ε) log n, the vertex v would always pass the test, which allows us to assume w.log. that E [Yu,v] (30000 100000ε) log n. Therefore, we can again apply Chernoff bound to show

5 2ε) n exp 20 (1/20 2ε)2

n3 . (using E [Zu,v] (30000 100000ε) log n and pick ε sufficiently small)

Therefore, we could apply Lemma G.2 to argue that with high probability, the vertex v would pass the test. This part of the proof can be visualized as in Figure 8(c).

Combining the analysis of Item a)., Item b)., and Item c). gives us the desired proof of Lemma G.3.

Using Lemma G.3, we can now argue that with high probability, the algorithm always correctly identifies the vertex split from the root as long as the size is at least 50000 log n. More formally, we have

Lemma G.4. For any composable e V such that e V 50000 log n, let T (e V ) be the optimal HC tree restricted to e V , and

suppose the root split of T (e V ) is (S l , S r). Then, the output (T , e V \ T ) of Line 7 is exactly (S l , S r).

Proof. Assume without the loss the generality that S l S r. Note that for any vertex u S l , it is also among the vertex of V small( n/25). As such, by Lemma G.3, the entire vertex set of S r is returned. On the other hand, for any vertex u S r, we claim the induced set T is necessarily smaller than S r. To see this, note that by Lemma G.3, the only possible case for Line 7 to not return a subset of S r is if u V small ℓ such that ℓ> ℓ and ℓ = 1. However, in such a case, the algorithm can at most return the set S l , which is necessarily smaller than the size of S r. As such, the maximum size of the set is attained by picking the vertex from S l , which implies that the return rule of Line 7 gives the partition (S l , S r).

In essence, Lemma G.4 is the very natural consequence of Lemma G.3. This can be visualized in Figure 8(d) of Figure 8 the T with the largest size is always the set of vertices induced by u1, which is exactly what we want.

Learning-Augmented Hierarchical Clustering

splits away from

splits away from

(a) Illustration of Item a). of Lemma G.3 with u i=ℓ +1V small i .

splits away from

splits away from

splits away from

(b) Illustration of Item b). of Lemma G.3 with u V small ℓ for some ℓ ℓ .

(only) such that splits away from

(only) such that

splits away from

(c) Illustration of Item c). of Lemma G.3 with u V small ℓ for ℓ ℓas prescribed in Lemma G.3.

(d) The counterpart trees generated by vertices u1, u2, and u3. We can simply take the largest partition to guarantee the root cut.

Figure 8. An illustration of the analysis we used in the proof of Lemma G.3.

Learning-Augmented Hierarchical Clustering

Conditioning on the high probability event, any super-vertex has to induce a maximal subtree in T , which satisfies the requirement for the tree to be strong consistent with T . By Lemma G.1, Algorithm 7 deterministically takes at most O(n3) queries and O(n3 log n/ε2) = O(n3 log n) time by the choice of ε = Θ(1).

Finally, we observe that Algorithm 7 only takes O(n log n) memory to implement. To see this, observe that for any recursive call of Algorithm 7, for any fixed u and v, the subroutine counterpart-tester-strong can be implemented in O(n log n) bits of space; furthermore, the space can be re-used for different u and v vertices. As such, we only need to maintain the set T and its size for every u, which takes O(n log2 n) space. In fact, we can maintain this in O(n log n) space by keeping the set T with the largest size and re-using the space on the fly. After each recursive call, we can free up the space of e V to store e V \ T and T instead. On a separate O(n log n)-sized space, we can keep writing down the HC tree with

i). A name of the node that takes O(log n) bits of memory;

ii). The set of vertices induced by this node.

In each recursion, we re-use the space of ii) by erasing the set of vertices induced by an internal node x once the children of x are written down. In the end, the sets of vertices are only written on the super-vertices. As such, the entire partial HC tree can be stored in O(n log n) memory as well.

H. The Algorithm for the Weakly Consistent Partial HC Tree: Proof of Theorem 11

We present the considerably more involved algorithm for the weakly consistent partial tree construction in near-linear time. We first remind the readers of the main theorem of this result.

Theorem 11. There exists an algorithm that given a splitting oracle O of a weighted undirected graph G = (V, E, w), with high probability, in O(n2 polylog n) time and O(n2) queries computes a partial hierarchical clustering tree I that is weakly consistent with the optimal hierarchical clustering tree T .

Our main effort is to show the algorithm that satisfies the guarantee of Theorem 11. As we have discussed in the high-level overview, the algorithm is divided into the split and the merging parts. We first show an algorithm that given a composable subset of vertices e V V such that e V Ω(log n), return a set T that induces a maximal subtree in T (e V ). As

such, we can recurse on T and e V \ T to gradually reduce the sizes to at most O(log n). Then, in the second part, we show how to glue the subtrees together to eventually form a partial tree that is consistent with T . The overall structure of the algorithm can be shown as Algorithm 8.

Algorithm 8 weak-partial-tree O(e V , VH): an algorithm for partial tree construction

Input: Input vertex set e V ; input horizon vertex set VH; a splitting oracle O as prescribed in Definition 6 Output: A partial hierarchical clustering tree for e V . Initialize e V = V , VH = V . If e V < 50000 log n, return a super-vertex (defined as in Definition 7). Get the partition of T

such that TT is a complete subtree, i.e. (T, e V \ T, u , VH) vertex-split O(e V , VH) (Algorithm 12) Recurse on T and e V \ T, i.e.,

TT weak-partial-tree O(T, T); Te V \T weak-partial-tree O(e V \ T, VH)

for the respective HC trees Run the merging algorithm Te V tree-merge O(TT , Te V \T , u ) (Algorithm 13).

We will introduce and analyze the splitting and the merging algorithms in the subsequent sections.

H.1. An algorithm to split the vertices

We first discuss our algorithm that produces complete subtrees for a given set of vertices e V . For the clarity of presentation, we use n to denote the size of e V , i.e., n = e V .

Learning-Augmented Hierarchical Clustering

H.1.1. THE ALGORITHM

We now present the actual algorithm. We first give two tester subroutines that serves as a more general version of the counterpart-tester-strong algorithm we used in Algorithm 7.

Algorithm 9 counterpart-tester(O,VH)(u, t, threshold1, threshold2): an algorithm to test whether t is among the counterpart of u

Input: Vertex set VH; a splitting oracle O; baseline vertex u; test vertex v; threshold1 (0, n), threshold2 (0, n) Output: Whether v is a counterpart of u Initialize counters c1 0, c2 0 for Every vertex t VH do

If v splits away from (u, t), increase c1 by 1 If t splits away from (u, v), increase c2 by 1 end if c1 threshold1 and c2 threshold2 then

Return v is a counterpart of u . end

Algorithm 9 is very similar to Algorithm 6, and the difference is that instead of testing on e V set, we query on the VH set instead. This is more than a simple notation change: it allows us to separate the vertices to be tested (e.g., vertices in e V ) vs. the set we used in the tests (e.g., vertices in VH).

Next, we introduce the tester that given vertices u V small ℓ1 and t V small ℓ2 , returns whether ℓ2 < ℓ1, i.e., whether t is split earlier in the small-tree split order of u.

Algorithm 10 predecessor-tester(O,VH)(u, t, threshold): an algorithm to test whether t is among the predecessor of u Input: Vertex set VH; a splitting oracle O; baseline vertex u; test vertex t; threshold (0, n) Output: Whether t is a predecessor of u Initialize counters c 0 for Every vertex s VH do

If t splits away from (u, s), increase c by 1 end if c threshold then

Return t is a predecessor of u . end

Note that unlike Algorithm 9, the vertex t passes the test in Algorithm 10 if the number of split away is lower-bounded. With the predecessor-tester(O,VH)(u, t, threshold) algorithm, we can now define the following algorithm that tests whether a root split has happened in the previous iteration.

We now continue to the presentation of our tree-split algorithm. The full algorithm is as Algorithm 12.

Learning-Augmented Hierarchical Clustering

Algorithm 11 test-orphan-predecessor(O,VH,e V ): an algorithm to obtain any vertex that is split earlier than the orphaned vertex set in the small-tree split order

Input: Vertex set e V of size n; the horizon set VH of size n H; a splitting oracle O Output: Vertex X that is either empty or contain predecessor of orphaned sets Sample a set U of 100 log n vertices from e V using fresh randomness for u U do

for v VH do

if n H 3 n then

threshold-pred 3 n/2 end else

threshold-pred 2n H/3 end Add x to X if predecessor-tester(O,VH)(u, v, threshold-pred) (Algorithm 10) reports v is a predecessor of u end Record the size |X| for this choice of u

end Return X as the largest size of X

Algorithm 12 vertex-split O(e V , VH): an algorithm that splits e V to T and e V \ T

Input: Vertex set e V of size n; the horizon set VH of size n H; a splitting oracle O Output: Vertex sets T, e V \ T; new horizon set VH Sample a set U of 500 log n vertices from e V for u U do

Initialize T for v VH do

Run counterpart-tester(O,VH)(u, v, 3n H/5, n H/6) Add v to T if v is a counterpart of u end Record the size |T| for this choice of u end Pick (T , e V \ T ) such that T T is with the largest size if T e V = then

// Test whether a root split happened in the last iteration X test-orphan-predecessor(O,VH,e V ) (using Algorithm 11) if X = then

// The case of non-root split Pick an arbitrary u X Initialize T for v VH do

Run counterpart-tester(O,VH)(u , v, 3n H/5, n H/6) Add v to T if v is a counterpart of u

end Output with the same rules of the T e V = case end else

// Case for root split Let VH e V , run and output with vertex-split O(e V , e V ) Enforce a single level of recursion call (return FAIL if e V = VH and the algorithm enters the above line again with e V = VH) end end else

// Keep the current horizon Output (T e V , e V \ T ) as the partition and keep the same VH Output u as the vertex u U corresponding to the output T

Learning-Augmented Hierarchical Clustering

Note that in Line 12, we sample 500 log n as opposed to 500 log n vertices (this is on purpose as opposed to being a typo). The formal guarantee of the tree split algorithm is as follows.

Lemma H.1. Suppose e V is a composable set with size n that satisfies the weak contraction property as prescribed in Definition 9, and suppose n 50000 log n. Furthermore, let VH be a single composable set (e.g., VH induces a maximal tree in T ) of n H vertices, and suppose e V = ℓ i=1V small i

for some ℓin the small-tree split order of VH. Then, given a splitting oracle O with correct probability at least 9/10, Algorithm 12 with high probability outputs composable sets T e V and e V \ T and a vertex u such that

1. T induces a single maximal subtree in T (e V ).

2. e V \ T satisfies the weak contraction property as prescribed in Definition 9, and in particular, the subtrees induced by e V \ T have

(a) one edge connected to the node pa LCAT (e V ) (if it exist).

(b) one edge connected to the node that induces T in T .

3. The lowest common ancestor between the nodes in T is an (immediate) child node of the lowest common ancestor between T {u } in T (e V ), i.e.,

LCAT (e V )(T {u }) = pa LCAT (e V )(T) .

4. Size properties: at least one of the following guarantees hold.

(a) The new set e V \ T has at least 99 100 fraction of vertices that are orphaned vertices, i.e., the new V orphan set accounts of at least 99 100 fraction of vertices in e V \ T ; (b) The size of T satisfies the following properties:

i. The size of T e V is at least 1 200 n, i.e.,

T e V 1 200 n.

ii. If VH = e V , the size of T e V is at most (1 1 10000 log2 n) n, i.e.,

T e V (1 1 10000 log2 n) n.

Furthermore, case Item 4b always happens if VH = e V , and the algorithm runs in time O(n2 H log n).

We now proceed to the analysis to prove Lemma H.1.

THE ANALYSIS

To begin with, we define the notion of orphaned vertices from our split procedure.

Definition 28 (Orphaned vertices). Let e V be a composable set of vertices, and let T = i=ℓ+1V small i . We call V small ℓ e V the set of orphaned vertices in V small ℓ with respect to T. We denote the orphaned set of vertices as V orphan T,ℓ , and we write V orphan as the simplified notation when the context is clear.

One can refer to Figure 9 for a visualization of the orphaned vertices (we used V as opposed to e V in the figure). When the context is clear, we ignore the dependence on V small ℓ and T when talking about orphaned vertices, and simply denote the orphaned vertices as V orphan. In our proof of correctness, we will show inductively that the set e V only has a single orphaned set V orphan a key to guarantee property Item 2 of Lemma H.1.

We now proceed with the relatively straightforward analysis of the running time of a single level of recursion.

Learning-Augmented Hierarchical Clustering

Splits with recursive subsets of

More than one edge connect to sibling

Splits with semi-invariate

Exactly one edge connect to sibling

Figure 9. An illustration of how horizon sets help maintain weak consistency. With the same u, the reason for the different outcomes is that on the top, the small-tree split order restricted on V has changed compared to the last iteration. The use of the horizon set helps keep the small-tree split order invariant before a root cut happens.

Lemma H.2. The algorithm vertex-split O(e V , VH) runs in time O(n2 H log n).

Proof. Each run of the counterpart-tester takes O(n H) time. As such, the procedure that finds u takes O(n2 H log n) time since it involves O(n H log n) calls of counterpart-tester. Similarly, the procedure that finds X takes at most O(n H log n) calls of predecessor-tester, which again result in O(n2 H log n) time. Furthermore, if X = , we only make n H number of counterpart-tester calls, which takes O(n2 H) time. On the other hand, if X = , since we insist on a single level of recursion call, the runtime overhead is at most O(n2 H log n). Adding up the time complexity of the above two procedures gives us the desired bound.

We proceed with the proof of correctness for the algorithm. To this end, we first observe that by Observation 2 and since VH is composable, the answers for the splitting oracle O on (u, v, w) VH is fully preserved in T (VH). We now give a technical lemma that characterizes the return set T by running counterpart-tester on different sets of vertices this is essentially the same argument we used in Lemma G.3, albeit we switched e V to VH. We provide the lemma and the analysis for the purpose of completeness.

Lemma H.3 (Cf. Lemma G.3). For any composable VH such that |VH| 50000 log n, let ℓ be the maximal level that V small ℓ contains a vertex in V small( n H/5). With high probability, Line 12 in Algorithm 12 satisfies the following properties.

a). For any vertex u i=ℓ +1V small i , there are

No vertex v i=ℓ +1V small i can be added to T by Line 12 of Algorithm 12.

No vertex v ℓ 1 i=1 V small i can be added to T by Line 12 of Algorithm 12.

b). For every level ℓ ℓ and vertices u V small ℓ , with high probability, there are

No vertex v V small ℓ can be added to T by Line 12 of Algorithm 12. No vertex v V small k for k < ℓcan be added to T by Line 12 of Algorithm 12.

Learning-Augmented Hierarchical Clustering

c). There exists a ℓsuch that ℓ i=1V small i n H

25 , and for any ℓ ℓand any vertex u V small ℓ , with high probability, all

vertices v V small k for k > ℓare added to T by Line 12 of Algorithm 12.

Proof. Let ℓ be the maximal level that the V small ℓ contains a vertex in V small( n H/5), and suppose the split on this level results in (Sℓ l , Sℓ r ). We again assume w.log. that Sℓ l Sℓ r , which means Sℓ l becomes V small ℓ . The bounds in Observation 3 still hold with parameter n H, i.e., 2n H

5 < i=ℓ +1V small i 4n H

5 . We now proceed to the proofs of Item a)., Item b)., and Item c)., respectively.

Proof of Item a). Fix any vertex u i=ℓ +1V small i . For the first statement, note that for every v i=ℓ +1V small i , all the vertices in ℓ i=1V small i are splitting away from (u, v). Define Xu,v,t as the indicator random variable answered by O as t splits away from (u, v) , and define Xu,v = P

t e V Xu,v,t as the total number of vertex split away from (u, v). By Observation 3, the expected number of split away reported by oracle O is at least

Note also that Xu,v is a summation of 0/1 independent random variables. As such, we can apply Chernoff bound to get that

Pr Xu,v n H

= Pr Xu,v 25

27 E [Xu,v]

exp (2/27)2

50 n H 9000 log n using the lower bound on the size of VH)

For the second statement, note that for any t i=ℓ V small i , v actually splits away from (u, t). As such, there is at least 4n H

5 vertices such that v splits away from (u, t) this is because the LCA between (u, v) has to be higher than the LCA between (u, t) for any t i=ℓ V small i . As such, define Yu,v as the number of total answers of v splits away from (u, t) by O, we again have

As such, we again have

Pr Yu,v 3n H

= Pr Yu,v 15

18 E [Xu,v]

50 n H 9000 log n using the lower bound on the size of VH)

By a union bound over at most n vertices, no vertex v ℓ 1 i=1 V small i or v i=ℓ +1V small i can pass the test and be recorded as a counterpart .

Proof of Item b).. We now show that vertices v ℓ i=1V small i cannot pass the counterpart test. For vertices in v V small ℓ , we claim that there are too many vertices t that split away from (u, v). To see this, let us define Au,v as the total number of answers of t splits away from (u, v) answered by O. Note that for any vertex t i=ℓ +1V small i , the answer is always yes since ℓ ℓ ℓ. Therefore, we can lower bound the expectation of Au,v with E [Au,v] 2

5n H 9 10 9 25 n H. Therefore, we can apply Chernoff bound to get

Pr Au,v n H

3 E [Au,v] 1

We now turn to the vertices that are in ℓ 1 i=1V small i . Note that by definition, there is i=ℓV small i i=ℓ V small i 4

5 n H. For any t i=ℓ V small i , v splits away from (u, t). As such, define Bu,v as the number of answers by O with v splits away from (u, t) , there is E [Bu,v] 18

25 n H. Once again, by applying Chernoff bound, there is

Pr Bu,v 3n H

3 E [Bu,v] 1

Applying a union bound over the cases and all n H n vertices gives us the desired statements.

Learning-Augmented Hierarchical Clustering

Proof of Item c).. We fix ℓas the minimum level such that the union of ℓ i=1V small i has at least n H/25 vertices. Let ℓ ℓ, and for any vertex v in V small k for k > ℓ, define the following two random variables: Xu,v as the number of answers that t splits away from (u, v) , and Yu,v as the number of answers that v splits away from (u, t) . We shall show that both terms are not large and v can always pass the test. (Note that X and Y are overloaded and are not related to their meaning in the proof of Line a)..)

Note that by definition, we have ℓ 1 i=1V small i < n H

25 . Since ℓ ℓ, for any vertex u ℓ i=1V small i , only vertices

t ℓ 1 i=1V small i are actually splitting away from (u, v) for v i=ℓV small i . As such, we have E [Xu,v] ( 1

25) n H = 7 50 n H by the correct probability of O. For Xu,v, we also note that if the answer is at most 3000 log n, then the vertex trivially passes the test. Therefore, we assume Xu,v E [Xu,v] 3000 log n, and we again apply Chernoff for the tail bound:

Pr Xu,v n H

exp (4/21)2

n3 . (using the condition E [Yu,v] 3000 log n)

For Yu,v, note that only t V small ℓ can report v splits away from (u, t) since the LCA between (u, t) is at least ℓfor any other v. Therefore, the number of signals we can possibly get is n H/2 plus the noise induced by the oracle O. As such, we again have E [Yu,v] 11n H/20. Also, note that if Yu,v is less than 30000 log n, the vertex v would always pass the test, which allows us to assume w.log. that E [Yu,v] 30000 log n. Therefore, we can again apply Chernoff bound to show

Pr Yu,v 3n H

exp (1/11)2

n3 . (using the condition E [Zu,v] 30000 log n)

A direct corollary of Lemma H.3 is that the set T induced by any u ℓ i=1V small i is larger than the T induced by u i= ℓ+1V small i , and the higher level vertices in ℓ i=1V small i induces larger sets. More formally, we can summarize this observation as follows.

Lemma H.4. Conditioning on the high-probability event of Lemma H.3, the following statements are true:

Let T1 be the vertex set induced by u1 ℓ i=1V small i from Line 12 in Algorithm 12, and let T2 be the vertex set induced by u2 i= ℓ+1V small i from Line 12 in Algorithm 12. We have |T1| |T2|.

Let ℓ1 ℓ2 ℓ. Let T1 be the vertex set induced by u1 V small ℓ1 from Line 12 in Algorithm 12, and let T2 be the vertex set induced by u2 V small ℓ2 from Line 12 in Algorithm 12. We have |T1| |T2|.

Proof. We prove the second bullet first since the conclusion can be used to prove the first bullet. Note that conditioning on the high-probability event of Lemma H.3, if ℓ1 ℓ2, we have T2 T1 by Item c).. Therefore, we have |T1| |T2|.

For the first bullet, note that conditioning on the high-probability event of Lemma H.3, the set T2 can either be V small ℓ (by Item a).) or i= ℓ+1V small i (by Item b).). In either case, the size of such a set is at most i= ℓ+1V small i , which is the set T1 generated by u V small ℓ . Furthermore, by the result in the second bullet, if u V small i for some i ℓ, the induced set T1 can only be larger. This proves the first bullet.

We now show that conditioning on Lemma H.3 (resp. Lemma H.4), if the size of the orphaned set is relatively small, we will not need the subroutine in Line 12, and all the guarantees in Lemma H.1 will be satisfied.

Lemma H.5. Let e V and VH be as prescribed by Lemma H.1, and suppose the size of the orphaned set is at most 99 100 n, i.e., V orphan 99 100 n. Then, with high probability, we have T e V = , and the resulting T and e V \ T satisfy the properties of Lemma H.1.

Learning-Augmented Hierarchical Clustering

Proof. We discuss the cases based on whether VH = e V and whether there exists a V small ℓ such that V small ℓ 99 100 n.

1. If VH = e V . In this case, we show that with high probability, the guarantees in Item 1, Item 2, and Item 3 of Lemma H.1

always hold, and the Item 4b case of Lemma H.1 is going to happen. To see this, note that with high probability, we will sample a vertex that is among the first n 25 vertices in the small-tree split order: the probability for us to not sample a vertex from V small( n 25) is at most

25)500 log n 1

As such, we can condition on a vertex v among V small( n 25) is sampled. By Lemma H.3 and Lemma H.4, the counterpart set induced by v V small( n 25) is among ℓ i=1V small i , which necessarily induces a larger set than any other u i= ℓ+1V small i . Therefore, the induced set T must be from the vertex v .

We now use this to verify the desired properties. The proofs of Item 1, Item 2, and Item 3 are straightforward as follows.

For Item 1, note that by Lemma H.3, the induced set T is always i=ℓV small i , which forms a maximal subtree in T (VH). Similarly, T e V induces a maximal subtree in T (e V ). For Item 2, note that as long as T includes the orphaned set V orphan, there will be only one edge connecting to the node induces T in T . In this case, there is no orphaned set, and Item 2 holds trivially. Item 3 directly follows from Lemma H.3 since T is induced by u .

For the size upper and lower bounds (Item 4), we verify that the guarantees for case 4b always holds. Note that conditioning on the high-probability event of Lemma H.3, the size is at least 2n H

5 (Observation 3). Therefore, the set T e V = T has size at least 2 n

5 1 200 n, which proves the lower bound (Item 4(b)i). For the upper bound (Item 4(b)ii), we note that for the size of T e V to be more than (1 1 10000 log2 n) n, a necessary condition is to sample a vertex u V small( n/10000 log2 n). Since we sample 500 log n vertices, define X as the random variable for the number of vertices sampled from V small( n/10000 log2 n), we have

E [X] 1 20 log n.

Since X is a summation of independent random variables supported on [0, 1], we can apply Chernoff bound to show that

Pr (X 1) = Pr (X 50 log n E [X])

2500 log2 n 1 20 log n 2 + 20 log n

Therefore, we can apply a union bound to show that with high probability, the size of T e V will not be larger than (1 1 10000 log2 n) n, as desired.

2. If VH = e V . We need to handle this case with more care. We first show that at least one vertex that is in e V \ V orphan can be sampled with high probability. To see this, note that by the size bound on V orphan , the probability for a vertex in e V \ V orphan to not be sampled is at most 99/100. Therefore, the probability for no vertices in e V \ V orphan to be sampled is at most

(99/100)500 log n 1

We condition on the high-probability event that at least one vertex from e V \ V orphan is sampled for the rest of the proof. We now discuss two sub-cases.

Learning-Augmented Hierarchical Clustering

a). If there exists a V small ℓ such that V small ℓ 99 100 n and no vertex from ℓ 1 i=1V small i is sampled. In this case, we show that with high probability, Item 1, Item 2, and Item 3 of Lemma H.1 always hold, and Item 4a in Lemma H.1 is going to happen. Note that in this case, there exist vertices of V small ℓ that are among V small( n/100), and it is of the size at least n/100. As such, the probability for us to sample at least one vertex from V small ℓ is at least

100)500 log n 1 1

Let v V small ℓ be the sampled vertex. Note that since VH = e V , the vertex we sample from V small ℓ is among the vertices of ℓ i=1V small i in Lemma H.3. Therefore, by the same argument of the VH = e V case, if we pick T with the largest size, the entire set of i=ℓ+1V small i is going to be included in T . Therefore, the properties of Item 1, Item 2, and Item 3 follow from the same argument of the VH = e V case. Furthermore, by the condition that no vertex from ℓ 1 i=1V small i is sampled, we cannot have larger such T sets (see Lemma H.4). As such, the set we will pick is necessarily the set T corresponds to v V small ℓ , and V small ℓ becomes the new V orphan set of the next iteration. Finally, note that we have V small ℓ 99 100 n. And after we remove T from e V , we have (e V e V \ T ), which means ( n n C) for some C > 0. As such, for the next iteration, we must have V orphan 99 100 n.

b). If there exists a V small ℓ such that V small ℓ 99 100 n and a vertex from ℓ 1 i=1V small i is sampled. In this case, we show that Item 1, Item 2, and Item 3 of Lemma H.1 always hold, and Item 4b of Lemma H.1 is going to happen with high probability. Note that since a vertex v ℓ 1 i=1V small i is sampled, and since v is among ℓ i=1V small i in Lemma H.3, the entire set of V small ℓ is going to be included in T . The properties as prescribed by Item 1, Item 2, and Item 3 follow from the argument in the e V = VH case, and the size lower bound becomes T e V 99

100 n 1 200 n,

as desired. Finally, note that we do not need to guarantee the size upper bound (Item 4(b)ii) since we will not be able to meet the VH = e V condition. c). If V small ℓ < 99 100 n for all ℓamong e V . In this case, we show that Item 1, Item 2, and Item 3 of Lemma H.1 always hold, and Item 4b case of Lemma H.1 will happen. We first show the size lower bound of Item 4b: note that with high probability, we can sample one vertex that is among the first 1/200 vertices to be split in e V in the small-tree split order: the probability for us to not sample any vertex among the first n/200 vertices in the small-tree split order is at most 199

500 log n 1

Therefore, we condition on the event that a vertex v in V small( n/200) is sampled. Since we have the condition that V small ℓ < 99 100 n for all ℓamong e V , a vertex among V small( n/200) can induce at most 99 n

100 + n 200 = 199

200 n vertices. As such, let ℓ be the level in the small-tree split order of v , we have

i=ℓ +1V small i 1 200 n.

Furthermore, as in the case analysis of VH = e V , which means v is among the vertices of ℓ i=1V small i in Lemma H.3. Therefore, the properties of Item 1, Item 2, and Item 3 follow from the same argument as in the VH = e V case. For the size bounds of T e V , by Lemma H.4, if we pick the largest T by the subroutine of line Line 12, at least the entire set of i=ℓ +1V small i is going to be included, which means the size of T e V is of size at least 1 200 n. This gives us the size lower bound. Finally, we again note that we do not need to guarantee the size upper bound (Item 4(b)ii) since we will not be able to meet the VH = e V condition.

We now handle the case when V orphan becomes large. We first note that if we happen to sample a vertex u e V \ V orphan, we can still guarantee T e V = and obtain the properties as prescribed by Lemma H.1.

Learning-Augmented Hierarchical Clustering

Lemma H.6. Let e V and VH be as prescribed by Lemma H.1, and suppose the size of the orphaned set is more than 99 100 n, i.e., V orphan > 99 100 n. Furthermore, suppose U (e V \ V orphan) = . Then, with high probability, we have T e V = , and the resulting T and e V \ T satisfy the properties of Lemma H.1.

Proof. In the lemma statement, we have already conditioned on a vertex sampled from e V \ V orphan. Furthermore, we can again show that the probability for us to sample a vertex among the first n H/25 in the small-tree split order is at least

25)500 log n 1 1

The events of U (e V \ V orphan) = and U V small( n H/25) = are not independent. Nevertheless, we can still apply a union bound and argue that both events happen with high probability.

Conditioning on the high-probability events as above, we can argue by Lemma H.3 and Lemma H.4 that the entire set of V orphan is going to be included by the subroutine as defined in line Line 12 of Algorithm 12, which gives us the size lower bound. We do not need to guarantee the size upper bound since we cannot meet the condition of VH = e V .

Finally, for properties of Item 1, Item 2, and Item 3, we can repeat our proofs in Lemma H.5. We provide the analysis again for the purpose of self-contained proof.

For Item 1, note that by Lemma H.3, the induced set T is always i=ℓV small i . Therefore, T e V induces a maximal subtree in T (e V ).

For Item 2, note that as long as T includes the orphaned set V orphan, there will be only one edge connecting to the node induces T in T . This is exactly what we proved in the lemma.

Item 3 directly follows from Lemma H.3 since T is induced by u .

This concludes the proof of Lemma H.6.

By Lemma H.6, the only case of concern now is when T e V = , i.e., V orphan > 99 100 n and U does not contain samples from e V \ V orphan. We now show that our procedure in Line 12 can effectively distinguish between the cases of e V \ V orphan = (root cut) and e V \ V orphan being small.

Lemma H.7. Let e V and VH be as prescribed by Lemma H.1, and suppose the size of the orphaned set is more than 99 100 n, i.e., V orphan > 99 100 n. Furthermore, suppose T e V = in Line 12 of Algorithm 12. Then, the following statements are true.

(i). If e V \ V orphan = , with high probability, we have X = and X (e V \ V orphan), i.e., X only contains vertices in e V but not in V orphan.

(ii). If e V \ V orphan = , with high probability, we have X = .

Proof. Consider the small-tree splitting order of VH, and let u V small ℓ for some ℓ ℓ, where V small ℓ = V orphan. We prove

that with high probability, i). no vertices v e V ( i=ℓV small ℓ ) can be added to X by Line 11; and ii). all vertices in v e V ( ℓ 1 i=1V small ℓ ) are added to X by Line 11. (Note that this is why we name the subroutine as a predecessor test.)

We first observe that by our definition, there is V orphan e V ( i=ℓV small ℓ ). Too see i), note that any v e V ( i=ℓV small ℓ ) can only split away from (u, t) for t V small ℓ : this is true since for every t V small ℓ , the lowest common ancestor between (u, t) induces a subtree in VH that contains v. Moreover, since V small ℓ e V , we have V small ℓ n. Define Xv as the number of answers v splits away from (u, t) for t VH from O. By our choice of the parameter threshold-pred, we assume w.log. Xv n since otherwise v will not join X anyway. If n H 3 n, we have E [Xv] n + n H

10 n in expectation. Since

Learning-Augmented Hierarchical Clustering

Xv is a summation of independent random variables supported on {0, 1}, we can apply Chernoff bound to obtain that

2 n = Pr Xv 15

exp (2/13)2

n3 . (using the condition on the lower bound of Xv)

On the other hand, if n H > 3 n, we have E [Xv] n H

30 n H. We again assume w.log. that Xv n, and we can apply Chernoff bound to obtain that

exp (8/13)2

n3 . (using the condition on the lower bound of Xv)

Therefore, we can apply a union bound and argue that with high probability, no vertices v e V ( i=ℓV small ℓ ) can be added to X by Line 11, as desired by i).

We now proceed to show ii). all vertices in v e V ( ℓ 1 i=1V small ℓ ) are added to X by Line 11. Note that v e V ( ℓ 1 i=1V small ℓ ) implies v e V \ V orphan. Therefore, v splits from (u, t) for every t in the orphaned set and for every t as the sibling of V orphan, i.e., t i= ℓ+1V small i . Define Yv as the number of answers v splits away from (u, t) for t VH from O. By our

choice of the parameter threshold-pred, if n H 3 n, since we have V orphan 99 100 n and i= ℓ+1V small i V orphan , we

have E [Yv] 9 10 199

100 n in expectation. Since Yv is a summation of independent random variables supported on {0, 1}, we can apply Chernoff bound to obtain that

2 n = Pr Yv 17

exp (2/15)2

n3 . (using E [Yv] 17

On the other hand, if n H > 3 n, we have at least n H 1 100 n 299

300n H vertices t such that v splits away from (u, t). Therefore, we have E [Yv] 9 10 299

5 n H in expectation. Therefore, we can again apply Chernoff bound to obtain that

n3 . (using E [Yv] 4

Therefore, we can apply a union bound to obtain the desired statement on ii).

By our statements in i) and ii) as above, we can already conclude that X (e V \ V orphan). Therefore, Item (ii). of Lemma H.7 follows straightforwardly since if e V \ V orphan = , any of its subset can only be empty as well. For Item (i)., what remains to show is that with high probability, there is X = . Note that if u V orphan and e V \ V orphan = , then by

our statements in i) and ii) above, the set X will not be empty. Since we assume V orphan 99 100 e V , the probability for

Learning-Augmented Hierarchical Clustering

U to not have any vertex u V orphan is at most

100)100 log n 1

as desired. Thus, with high probability, X is not empty, which proves Item (i). and concludes the proof of Lemma H.7.

Finalizing the proof of Lemma H.1. By Lemma H.2, the algorithm runs in n2 H log n time. For the set of vertices e V , we either have V orphan < 99 100 n or V orphan 99 100 n. In the former case, we apply Lemma H.5, and all guarantees in Lemma H.1 are satisfied. Otherwise, if V orphan 99 100 n and U (e V \ V orphan) = , by Lemma H.6, we can still satisfy the properties as prescribed by Lemma H.1.

The only remaining case is if V orphan 99 100 n and U (e V \ V orphan) = . In such a case, the algorithm will enter Line 12. If e V \ V orphan = , then by Item (ii). of Lemma H.7, the algorithm uses VH = e V for a single level of recursion call, and the properties of Lemma H.1 are satisfied by the guarantees of Lemma H.5 (since now V orphan = ). Otherwise, if e V \ V orphan = , note that by Item (i). of Lemma H.7, any arbitrary vertex u U belongs to V small ℓ for some ℓ< ℓ, such that V small ℓ = V orphan. Since V orphan 99 100 n and n H n, the vertex u is among V small( n H/100). As such, by Lemma H.3, with high probability, the vertex set T contains all vertices in V orphan, and the size is sufficiently large to guarantee Item 4(b)i of Lemma H.1. We do not need to guarantee Item 4(b)ii since V orphan = , and we will not meet the e V = VH condition. The guarantees of Item 1 and Item 2 are similarly satisfied since we remove a set i V small i that contains V orphan. Finally, by a similar argument as we used in Lemma H.5 and Lemma H.6, Item 3 is satisfied as desired.

H.2. An algorithm to merge two subtrees

We now move to the algorithm that merges two partial trees. Note that this task is not trivial: we use the thought process of small-tree split order in the proof of Lemma H.1, but the algorithm vertex-split O(e V ) does not immediately tell us which node did we extract the set T. As such, it still takes considerable work to merge the two trees on the right internal node.

Exactly here is why we need the split algorithm vertex-split O(e V ) to return the vertex u . Note that our goal is essentially to find the lowest common ancestor x between u and T, and stitch the tree TT to the node. As such, a natural strategy is to ask whether in T (resp. T (e V )), whether a vertex v e V splits away from (u , x) for x T. If O is to answer the queries correctly, all vertices that are outside LCAT (e V )(T {u }) would answer yes, and all vertices that are among the leaves of LCAT (e V )(T {u }) would answer no. We then use the fact that T is always large enough to beat the noise from O.

The formal description of the algorithm is as Algorithm 13.

Algorithm 13 tree-merge O(TT , Te V \T , u ): an algorithm to merge partial trees TT and Te V \T .

Input: Vertex set e V of size n; a splitting oracle O; Partial trees TT and Te V \T constructed by vertex-split O(e V , VH);

vertex u by vertex-split O(e V , VH) Output: A partial tree on Te V Initialize S for s e V \ T do

Initialize a counter cs 0 for Every vertex t T do

If s splits away from (u , t), increase cs by 1 end if cs 1

2 |T| then Add s to S

end end Take the lowest common ancestor x = LCAT e V \T (S ) If S does not induces a maximal tree in Te V \T , i.e., leaves T e V \T [ S ] = S , abort the algorithm and report fail If the algorithm does not fail, split node x into two nodes: the left node induces the subtree of x, and the right node induces the subtree TT .

We now present the guarantees of the tree-merging algorithm.

Learning-Augmented Hierarchical Clustering

Lemma H.8. Given any composable vertex set e V V such that e V = n 50000 log n, a splitting oracle O with correct

probability 9/10, and suppose Te V \T , TT , and u are obtained by Algorithm 12. Furthermore, assume that

i). The high probability events of Lemma H.1 happens;

ii). The partial tree Te V \T is (weakly) consistent with T (e V \ T), TT is (weakly) consistent with T (T).

Then, with high probability, Algorithm 13 runs in time O( n2), and outputs a partial tree Te V that is weakly consistent with T (e V ).

Proof. We first remind the readers of the definition of a partial tree I weakly consistent with another tree T . For I to be consistent with T , there should be

a). each super-vertex of I corresponding to a connected subtree in T with out-degree at most 2, and each of the edge connects to either a parent or a sibling node; and

b). for leaves (x, y) in I, let X and Y be the corresponding leaves in T , the subtrees induced by LCAT (x, y) and LCAT (X Y ) contain exactly the same set of leaves.

We now show that with high probability, the algorithm tree-merge O(TT , Te V \T , u ) outputs a partial tree that satisfies

a) and b) w.r.t. T (e V ). For a), we note that the super-vertices in T (e V ) and (T (e V \ T), T (T)) are exactly the same. Furthermore, by the high probability event of Lemma H.1, both T (e V ) and (T (e V \ T), T (T)) satisfy the weakly consistent property. As such, the guarantee of a) follows.

The main work here is to prove the guarantee prescribed by b). To this end, we first observe that for any set of vertices X and Y , if LCAT (e V )(X Y ) only contain vertices in T (resp. e V \ T), then the assumptions of Te V \T being consistent with

T (e V \ T) and TT being consistent with T (T) is sufficient to prove the leaves induced by LCAT e V (X Y ) is the same as the leaves of LCAT (e V )(X Y ). This is evident by using the procedure that constructs T (e V ) as in Definition 25.

The final missing piece is the vertex sets X and Y that induce vertices in both T and e V \ T, which is the place where we need to show that the merging algorithm finds the correct node to merge. Let S = V orphan be the orphaned vertices by removing T from e V . Since we condition on the high probability event of Lemma H.1, there must be an internal vertex z, such that z = LCAT (e V )(T {u }), and nodes r T and r S such that i). z = pa (r T ) and z = pa (r S) in

T (e V ) and ii). the induced leaves of r T is T and the induced leaves of r S is S such that u S. Furthermore, we have |T| 50000 log n 1 200 200 log n by Lemma H.1. We now claim that by running tree-merge O(TT , Te V \T , u ), the set S we recover is exactly the leaves of S (the V orphan set of vertices). The detailed analysis is as follows.

For each s S, observe that in T (and T (e V )), s does not split away from (u , t) for t T. Therefore, define Cs as the random variable that records s split away from (u , t) for t T from O, we have in expectation E [Cs] 1 10 |T|. If Cs 20 log n, it trivially fails the test. Otherwise, if Cs 20 log n, we can apply Chernoff bound to get that

2 |T| = Pr (Cs 5 E [Cs])

n3 . (using Cs 20 log n)

For each d S, observe that in T (and T (e V )), d does split away from (u , t) for t T. Therefore, define Cd as the random variable that records d split away from (u , t) for t T from O, we have in expectation E [Cd] 9 10 |T|.

Learning-Augmented Hierarchical Clustering

Therefore, we can apply Chernoff bound to get that

2 |T| = Pr Cd 5

n3 . (using Cd 200 log n)

We can then apply a union bound over at most n n vertices in e V \ T to get the desired statement.

Observe that any internal node that induces leaves in both T and e V \ T has to at least include the whole set of S and T. For any leaves x and y in Te V , let the induced set of vertices in the leaves of T (e V ) be Z = S T P. By the above argument, S T should be returned in the set of vertices. Furthermore, by the consistency between Te V \T and T (e V \ T), the set P should also be returned. This concludes the proof.

The merging algorithm in Lemma H.8 could be visualized as in Figure 10.

a). Merging algorithm's view b). The actual

Figure 10. An illustration of the algorithm that merges T (V \ T) and T (T) The shaded internal node is the actual node to split S and T in T . If the oracle is correct, s does not split away from (u , t), but d does split away from (u , t). The size of T is large enough to overcome the adversarial noise.

H.3. Finalizing the proof of Theorem 11

We now move to prove Theorem 11 for our partial tree construction. We first bound the number of possible internal nodes as n by Fact A.1. Therefore, we can apply a union bound on all splits for Lemma H.1, and argue that with high probability, the events of Lemma H.1 hold for every partition. Furthermore, conditioning on Lemma H.1 always holds across the splits, we can again apply a union bound to show that Lemma H.8 holds across all internal nodes in the partial tree construction with high probability. We condition on the high probability events of Lemma H.1 and Lemma H.8 across the internal nodes for the rest of the proof.

Proof of efficiency. We now show that the depth of recursive calls on Lemma H.1 is O(log3 n), i.e., the longest sequence of recursive calls induced by any fixed e V is at most O(log3 n) in Algorithm 8. To see this, consider any set of vertices e V

Learning-Augmented Hierarchical Clustering

with size n, and we look into three levels of splits on e V . Suppose the vertices sets are e V (e V1, e V2, e V3, e V4, e V5, e V6, e V7, e V8). We claim that there is max i

{ e Vi }8 i=1 (1 1 10000 log2 n) n.

We prove the above statement by using Lemma H.1. Conditioning on the high probability event of Lemma H.1, if the first split of e V falls into the condition of e V = VH, then by Item 4(b)i and Item 4(b)ii of Lemma H.1, the balanceness of sizes already follows. Otherwise, we can have the following cases

If we enter the case of Item 4b, note that we have |T| n 200. As such, in the first split e V (e V \ T, T), we already

have e V \ T 199

200 n (1 1 10000 log2 n) n. The size of T might be large; however, it must have e V = VH for the

first split on T (by Algorithm 8). Therefore, in the next iteration, the maximum size is at most (1 1 10000 log2 n) n as well.

If we enter the case of Item 4a, note that the set T of the current iteration is of size at most n 100 (1 1 10000 log2 n) n. Furthermore, in the next iteration, the V orphan set accounts for at least 99 100 fraction of vertices. Hence, we can apply Lemmas H.6 and H.7 and argue that the split will be in the case of Item 4b with T2 as the new set T . Now, we have e V \ (T T2) with size at most n 100 (1 1 10000 log2 n) n. The new set T2 (e V \ T) might be large, but since it

forms a single maximal subtree in T (T2 (e V \ T)), there is VH = e V , and in the third iteration, the maximum size is going be to at most (1 1 10000 log2 n) n, as desired.

Since the size reduces by a (1 1 10000 log2 n) factor for every three level of splits, after 60000 log3 n recursive calls, we have

remaining size n (1 1 10000 log2 n)20000 log3 n

n exp ( 2 log n) O(1),

to which point the remaining vertices will be collapsed to a super-vertex by our algorithm. Therefore, since n n, the longest sequence of dependent calls is at most O(log3 n).

Finally, to complete the proof of efficiency, note that by Lemma H.1, the runtime for each call of Algorithm 12 is O(n2 H log n). The tree has depth at most log3 n; and at each level, the total number of runtime is at most O(n2 log n) since we have P n H n. Similarly, each call of the merging algorithm will happen only after the split algorithm, which causes an overall O(n2) runtime overhead on any level. Therefore, the total runtime is bounded by O(n2 log4 n) = O(n2 polylog n), as desired.

Proof of correctness. We inductively prove the correctness of Theorem 11. On the level of the leaves in Algorithm 8, by Lemma H.1, if the leave contains more than one vertex, it must be a composable set with out-degree at most 2 in T , and exactly one of them connecting to a parent node, and the other connecting to the sibling of the orphaned vertex. As such, when we merge two components X, Y in which at least one of them is a super-vertex, we can guarantee the out-degree is still at most 2, and the LCA of X and Y induces the same vertices as on T (LCAT (X Y )), which implies the partial tree is weakly consistent with T (X Y ). On the other hand, when merging two components who are both not super-vertices, we can use Lemma H.8, and the assumptions of weak consistency come from the guarantees on previous partitions. Therefore, the weak consistency inductively applies to every level of the merging process, which gives the desired correctness guarantee.

I. Discussions about Additional Settings for Our Algorithms

We discuss our algorithms in additional settings, which include general success probability (other than 9/10) and splitting oracle for approximately optimal HC trees.

I.1. General success probabilities

We use a success probability of 9/10 in our algorithms for technical convenience. Here, we discuss algorithms with more general success probabilities. We remark that due to adversarial incorrect answers, our algorithm cannot work with 1

Learning-Augmented Hierarchical Clustering

success probability for arbitrary ε. In fact, it is unclear whether any algorithm would work with 1

2 +ε success probability and adversarial incorrect answers. Concretely, suppose that in the optimal HC tree, the first cut is balanced with size (n/2, n/2). This appears to be a quite easy example. Now, let us fix a vertex u and determine whether a vertex v is on the same side of u. If v is on the same side of u, there are n/2 vertices w V such that w splits away from (u, v); conversely, if v is on the opposite side, there is no such w vertex. However, due to adversarial incorrect answers, we can report (n/2 n/3 = n/6) such w vertices in the former case, and n/3 such w vertices in the latter case. As such, in the above example, the correct probability for the oracle must be at least 3/4 to get anything meaningful. Finally, we remark our algorithm would work for any success probability 1

2 + C for sufficiently large C = Ω(1): all the analysis will go through with changes in the constants. Furthermore, if we deal with random incorrect answers instead, we will be able to work with 1

2 + ε success probability for any ε = Ω(1/n).

Finally, we give a remark on the discrepancy between success probabilities between learning-augmented HC and other learning-augmented graph algorithms. In some graph algorithm, e.g., in (Braverman et al., 2024; Cohen-Addad et al., 2024; Dong et al., 2025), the learning-augmented oracle could work with 1/2 + C probability for any C = Ω(1). The reason our algorithm should work with a sufficiently high success probability is due to the hierarchical structure. For instance, in the paper that studied learning-augmented max-cut (e.g. (Dong et al., 2025)), the errors in the algorithm are one shot . However, in the HC problem, if the construction of the partial tree is wrong at any level, the error will propagate to all subsequent nodes, and it is not clear how to control the error if this happens. Therefore, a constant success probability sufficiently larger than 1/2 is necessary.

I.2. Splitting oracle with approximately optimal HC trees

A natural extension of our algorithms is to explore HC algorithms with splitting oracles from an approximately optimal HC tree. In other words, for a triplet of vertices (u, v, w), the oracle O answers the splitting away query based on an HC tree T that achieves α-approximation of the optimal tree T . We remark that our algorithms based on the strongly consistent partial HC trees (i.e., the algorithms of Theorems 1, 2 and 21) could work with approximation HC trees. In particular, if the splitting oracle is constructed from an α-approximation HC tree T , our algorithm will produce HC trees with an extra O(α) factor on the approximation guarantees.

On the other hand, however, it is not immediately clear whether our algorithms based on weakly consistent partial HC trees could work for oracles from approximate HC trees. The main difficulty here is that to analyze the revenue decrement induced by the weakly consistent partial HC tree, we need to prove Lemma D.4 that characterizes the revenue structure of the optimal tree. We proved the statement by showing that if the statement is not true, we can increase the revenue, which forms a contradiction with the optimal HC tree (see Claim D.2 for details). We cannot easily argue that the same structural statement with an approximately optimal HC tree. This could be an interesting problem to resolve for future work.

J. Splitting Oracle and Learning Theory

In this section, we offer formal learning theory analysis for learning a splitting oracle for the learning-augmented hierarchical clustering problem. In particular, we utilize the PAC learning framework to show that a high-quality predictor can be efficiently learned, provided that the input instances are drawn from a specific distribution. We remark that similar results have been shown in other settings by (Izzo et al., 2021; Chen et al., 2022c; Ergun et al., 2022; Grigorescu et al., 2022). Thus although the results of this section are by now standard techniques, they still provide an end-to-end framework for designing learning-augmented algorithms.

First, we suppose that there exists an underlying distribution D, from which our input is drawn. In particular, D generates independent instances for hierarchical clustering, corresponding to the setting where similar instances of hierarchical clustering are being solved. Note that this is exactly the setting where we would like to apply learning-augmented algorithms. If there is instead generalization failure or distribution-shift, then inherently machine learning models will perform poorly.

Then our goal is to efficiently learn a predictor f from a family F of possible functions, where the input to each predictor f is a weighted undirected graph G = (V, E, w) and three specific nodes, and the output is a feature vector. We remark that each input instance G can be encoded as a vector in Rn2+n, by first considering the weighted n n adjacency matrix of the graph. We can then flatten the matrix into a vector of dimension n2 and then append a 3-sparse binary vector of length n, corresponding to the three vertices in the input. We also assume that the output of f has at most n dimension, indicating a binary vector for which vertex should be split from the other two vertices.

Learning-Augmented Hierarchical Clustering

We define a loss function L : f G R, which intuitively defines how accurate a predictor f performs on each input instance G. For example, f can represent the splitting oracle on G and L can denote the number of inaccurate responses compared to the best hierarchical clustering on G.

Now our goal is to learn the function f F, which minimizes the following objective:

E G D,(x,y,z) V 3 [L(f G(x, y, z))] . (3)

Let f be an optimal function in F,, so that f = argmin E G D,(x,y,z) V 3 [L(f G(x, y, z))] is a minimizer of the above

objective. Assuming that for each graph instance G, triplet (x, y, z) V , and each f F, we can efficiently compute f G(x, y, z) as well as L(f G(x, y, z)), say in polynomial time T(n), then we have:

Proposition 29. There exists an algorithm that uses poly T(n), 1

ε samples and returns a function ˆf, such that with probability at least 9 10,

E G D,(x,y,z) V 3

h L( ˆf G(x, y, z)) i min f E G D,(x,y,z) V 3 [L(f G(x, y, z))] + ε.

In particular, Proposition 29 is a PAC-style result that bounds the number of samples necessary to achieve a good probability of learning an approximately-optimal function ˆf. The algorithm corresponding to Proposition 29 is straightforward; it is simply the empirical minimizer after a sufficient number of samples are drawn. To prove correctness, we first require the following definition of pseudo-dimension for a function class, which is a generalization of VC dimension to real-valued functions.

Definition 30 (Pseudo-dimension, e.g., Definition 9 in (Lucic et al., 2017)). Let X denote a ground set, and let F be a collection of functions mapping elements from X to the interval [0, 1]. Consider a fixed set S = {x1, . . . , xn} X, a set of real numbers R = {r1, . . . , rn}, where each ri [0, 1], and a function f F. The subset Sf = {xi S | f(xi) ri} is referred to as the induced subset of S determined by the function f and the real values R. We say that the set S with associated values R is shattered by F if the number of distinct induced subsets is |{Sf | f F}| = 2n. Then the pseudo-dimension of F is defined as the size of the largest subset of X that can be shattered by F (or it is infinite if no such maximum exists).

Using pseudo-dimension, we can now present an accuracy-sample complexity trade-off for empirical risk minimization with and the number of necessary samples. First, we define H be the class of functions in F composed with L, i.e., H := {L f : f F}. Moreover, by normalization, we can assume the range of L is contained within [0, 1]. Then we have the following generalization bounds:

Theorem 31. (Anthony & Bartlett, 1999) Let D be a distribution over problem instances in G, and let H be a class of functions h : G [0, 1] with pseudo-dimension d G. Consider t i.i.d. samples G1, G2, . . . , Gt drawn from D. There exists a universal constant c0 such that for any ε > 0, if t c0 d H

ε2 , then for all h H, we have the following with probability at least 9 10: 1

i=1 h(Gi) E G D [h] (G)

We have the following immediate corollary by applying the triangle inequality.

Corollary 32. Let G1, . . . , Gt be a set of independent samples from D and let ˆh H be a function that minimizes 1 t Pt i=1 h(Gi). If the number of samples t is chosen as in Theorem 31, then with probability at least 9 10,

h ˆh(G) i min E G D [h (G)] + 2ε.

Thus, the main question is to analyze the pseudo-dimension of our function class H. To that end, we first relate the pseudo-dimension to the VC dimension of a related class of threshold functions.

Lemma J.1 (Pseudo-dimension to VC dimension, Lemma 10 in (Lucic et al., 2017)). For any h H, let Bh denote the indicator function of the threshold function, i.e., Bh(x, y) = sgn(h(x) y). Then the pseudo-dimension of H equals the VC-dimension of the subgraph class BH = {Bh | h H}.

Learning-Augmented Hierarchical Clustering

What remains is to bound the VC dimension of the function to compute in the class, which follows from the following standard result.

Lemma J.2 (Theorem 8.14 in (Anthony & Bartlett, 1999)). Let τ : Ra Rb {0, 1}, defining the class

T = {x 7 τ(θ, x) : θ Ra}.

Assume that any function τ can be computed by an algorithm that takes as input the pair (θ, x) Ra Rb and produces the value τ(θ, x) after performing no more than t of the following operations:

arithmetic operations +, , , / on real numbers,

comparisons involving >, , <, , =, and outputting the result of such comparisons,

outputting 0 or 1.

Then, the VC dimension of T is bounded by O(a2t2 + t2a log a).

We can now apply these results to prove Proposition 29 by instantiating Lemma J.2 with the computational complexity of evaluating any function in the class H.

Proof of Proposition 29. From Lemma J.1, we know that the pseudo-dimension of H is equivalent to the VC dimension of the threshold functions defined by H. Next, from Lemma J.2, we observe that the VC dimension of the relevant class of threshold functions is polynomial in the computational complexity of evaluating a function from the class. In other words, Lemma J.2 implies that the VC dimension of BH (as defined in Lemma J.1) is polynomial in the number of arithmetic operations required to compute the threshold function corresponding to some h H. According to our definition, this quantity is polynomial in T(n). Thus, the pseudo-dimension of H is also polynomial in T(n), and the desired result follows.

Note that we can initialize Proposition 29 with various oracles, in terms of the input and output predictions. Indeed, if each function in the family of oracles we are considering can be computed efficiently, then Proposition 29 guarantees that a polynomial number of samples is sufficient to learn a nearly optimal oracle.