# fitting_trees_to_ell_1hyperbolic_distances__6293bd7d.pdf

Fitting trees to ℓ1-hyperbolic distances

Joon-Hyeok Yim Anna C. Gilbert

Building trees to represent or to fit distances is a critical component of phylogenetic analysis, metric embeddings, approximation algorithms, geometric graph neural nets, and the analysis of hierarchical data. Much of the previous algorithmic work, however, has focused on generic metric spaces (i.e., those with no a priori constraints). Leveraging several ideas from the mathematical analysis of hyperbolic geometry and geometric group theory, we study the tree fitting problem as finding the relation between the hyperbolicity (ultrametricity) vector and the error of tree (ultrametric) embedding. That is, we define a vector of hyperbolicity (ultrametric) values over all triples of points and compare the ℓp norms of this vector with the ℓq norm of the distortion of the best tree fit to the distances. This formulation allows us to define the average hyperbolicity (ultrametricity) in terms of a normalized ℓ1 norm of the hyperbolicity vector. Furthermore, we can interpret the classical tree fitting result of Gromov as a p = q = result. We present an algorithm HCCROOTEDTREEFIT such that the ℓ1 error of the output embedding is analytically bounded in terms of the ℓ1 norm of the hyperbolicity vector (i.e., p = q = 1) and that this result is tight. Furthermore, this algorithm has significantly different theoretical and empirical performance as compared to Gromov s result and related algorithms. Finally, we show using HCCROOTEDTREEFIT and related tree fitting algorithms, that supposedly standard data sets for hierarchical data analysis and geometric graph neural networks have radically different tree fits than those of synthetic, truly tree-like data sets, suggesting that a much more refined analysis of these standard data sets is called for.

1 Introduction

Constructing trees or ultrametrics to fit given data are both problems of great interest in scientific applications (e.g., phylogeny), algorithmic applications (optimal transport and Wasserstein distances are easier, for example, to compute quickly on trees), data visualization and analysis, and geometric machine learning. An ultrametric space is one in which the usual triangle inequality has been strengthened to d(x, y) max{d(x, z), d(y, z)}. A hyperbolic metric space is one in which the metric relations amongst any four points are the same as they would be in a tree, up to the additive constant δ. More generally, any finite subset of a hyperbolic space looks like a finite tree.

There has been a concerted effort to solve both of these problems in the algorithmic and machine learning communities, including [1] [5] among many others. Indeed, the motivation for embedding into hyperbolic space or into trees was at the heart of the recent explosion in geometric graph neural networks [6].

As an optimization problem, finding the tree metric that minimizes the ℓp norm of the difference between the original distances and those on the tree (i.e., the distortion) is known to be NP-hard for most formulations. [7] showed that it is APX-hard under the ℓ norm. The first positive result also came from [7], which provided a 3-approximation algorithm and introduced a reduction technique from a tree metric to an ultrametric (a now widely used technique). The current best known result is an O((log n log log n)1/p) approximation for 1 < p < [2], and O(1) approximation for p = 1,

37th Conference on Neural Information Processing Systems (Neur IPS 2023).

Hyperbolicity measure Value Distortion

Gromov s δ-hyperbolicity δ = x d d T = O(δ log n) = O( x log n) Average hyperbolicity δ = 1 ( n 1 3 ) x 1 d d T 1 = O(δn3) = O( x 1)

Table 1: Connection between δ-hyperbolicity and average hyperbolicity and how these quantities determine the distortion of the resulting tree metric.

recently shown by [1]. Both papers exploit the hierarchical correlation clustering reduction method and its LP relaxation to derive an approximation algorithm.

While these approximation results are fruitful, they are not so practical (the LP result uses an enormous number of variables and constraints). On the more practical side, [8] provided a robust method, commonly known as Neighbor Join (which can be computed in O(n3) time), for constructing a tree. Recently, [9] proposed an O(n2) method known as Tree Rep for constructing a tree. Unfortunately, neither algorithm provides a guaranteed bound on the distortion.

The main drawback with all of these results is that they assume almost nothing about the underlying discrete point set, when, in fact, many real application data sets are close to hierarchical or nearly so. After all, why fit a tree to generic data only to get a bad approximation? In fact, perhaps with some geometric assumptions on our data set, we can fit a better tree metric or ultrametric, perhaps even more efficiently than for a general data set.

Motivated by both Gromov s δ-hyperbolicity [4] and the work of Chatterjee and Slonam [10] on average hyperbolicity, we define proxy measures of how tree-like a data set is. We note that [4], [11] provide a simple algorithm and analysis to find a tree approximation for which the maximum distortion (ℓ norm) is bounded by O(δ log n), where δ is the hyperbolicity constant. Moreover, this bound turns out to be the best order we can have. In this paper, we go beyond the simple notion of δ-hyperbolicity to define a vector of hyperbolicity values (d) for a set of distance values. The various ℓp norms of this vector capture how tree-like a data set is. Then, we show that the ℓq norm of the distortion of the best fit tree and ultrametrics can be bounded in terms of this tree proxy. Thus, we give a new perspective on the tree fitting problem, use the geometric nature of the data set, and arrive at, hopefully, better and more efficient tree representations. The table below captures the relationship between the different hyperbolicity measures and the tree fit distortion. We note the striking symmetry in the tradeoffs.

Our main theoretical result can be summarized as

There is an algorithm which runs in time O(n3 log n) which returns a tree metric d T with distortion bounded by the average (or ℓ1) hyperbolicity of the distances; i.e.,

d d T 1 8 x(d) 1 8 n 1 3

Avg Hyp(d).

Additionally, the performance of HCCROTTEDTREEFIT and other standard tree fitting algorithms on commonly used data sets (especially in geometric graph neural nets) shows that they are quite far from tree-like and are not well-represented by trees, especially when compared with synthetic data sets. This suggests that we need considerably more refined geometric notions for learning tasks with these data sets.

2 Preliminaries

2.1 Basic definitions

To set the stage, we work with a finite metric space (X, d) and we set |X| = n, the number of triples in X as n 3 = ℓ, and r = n 4 the number of quadruples of points in X. In somewhat an abuse of notation, we let X 3 denote the set of all triples chosen from X (and, similarly, for all quadruples of points). Next, we recall the notion of hyperbolicity, which is defined via the Gromov product [4]. Given a metric space (X, d), the Gromov product of two points x, y X with respect to a base point

w X is defined as gpw(x, y) := 1

2 (d(x, w) + d(y, w) d(x, y)) .

We use the definition of the Gromov product on two points with respect to a third to establish the four point condition and define fpw(d; x, y, z) := max π perm[min(gpw(πx, πz), gpw(πy, πz)) gpw(πx, πy)]

where the maximum is taken over all permutations π of the labels of the four points. Since fpw(d; x, y, z) = fpx(d; y, z, w) = fpy(d; x, z, w) = fpz(d; x, y, w), we sometimes denote the four point condition as fp(d; x, y, z, w). Similarly, we define the three point condition as tp(d; x, y, z) := max π perm[d(πx, πz) max(d(πx, πy), d(πy, πz))]

which we use to define ultrametricity.

Following a standard definition of Gromov, a metric space (X, d) is said to be δ-hyperbolic with respect to the base point w X, if for any x, y, z X, the following holds: gpw(x, y) min(gpw(y, z), gpw(x, z)) δ. We denote Hyp(d) = δ, the usual hyperbolicity constant (similarly, UM(d), is the usual ultrametricity constant). We note that this measure of hyperbolicity is a worst case measure and, as such, it may give a distorted sense of the geometry of the space. A graph which consists of a tree and a single cycle, for instance, is quite different from a single cycle alone but with a worst case measure, we will not be able to distinguish between those two spaces.

In order to disambiguate different spaces, we define the hyperbolicity vector as the ℓ-dimensional vector of all four point conditions with respect to d:

w(d) = [fp(d; x, y, z, w)] for all x, y, z X 3

Similarly, we define the ultrametricity vector as the ℓ-dimensional vector of all three point conditions with respect to d:

(d) = [tp(d; x, y, z)] for all x, y, z X 3

We use the hyperbolicity and ultrametricity vectors to express more refined geometric notions.

We define p-average hyperbolicity and p-average ultrametricity.

Avg Hypp(d) =

x,y,z,w ( X 4) fp(d; x, y, z, w)p

Avg UMp(d) =

x,y,z ( X 3) tp(d; x, y, z)p

If p = 1, then the notions are simply the average (and we will call them so). Also, for clarity, note the usual hyperbolicity and ultrametricity constants Hyp(d) and UM(d) are the p = case. Proposition 2.1. We have the simple relations:

(a) Hyp(d) = maxx X x(d) x(d) for any x X.

(b) UM(d) = (d) .

In the discussion of heirarchical correlation clustering in Section 2.4 and in the analysis of our algorithms in Section 3, we construct multiple graphs using the points of X as vertices and derived edges. Of importance to our analysis is the following combinatorial object which consists of the set of bad triangles in a graph (i.e., those triples of vertices in the graph for which exactly two edges, rather than three, are in the edge set). Given a graph G = (V, E), denote B(G), the set of bad triangles in G, as

B(G) := (x, y, z) V 3

| |{(x, y), (y, z), (z, x)} E| = 2 .

2.2 Problem formulation

First, we formulate the tree fitting problem. Given a finite, discrete metric space (X, d) and the distance d(xi, xj) between any two points xi, xj X, find a tree metric (T, d T ) in which the points in X are among the nodes of the tree T and the tree distance d T (xi, xj) is close to the original distance d(xi, xj).

While there are many choices to measure how close d T and d are, in this paper, we focus on the ℓp error; i.e., d T d p, for 1 p . This definition is itself a shorthand notation for the following. Order the pairs of points (xi, xj), i < j lexicographically and write d (overloading the symbol d) for the vector of pairwise distances d(xi, xj). Then, we seek a tree distance function d T whose vector of pairwise tree distances is close in ℓp norm to the original vector of distances. For example, if p = , we wish to bound the maximum distortion between any pairs on X. If p = 1, we wish to bound the total error over all pairs. Similarly, we define the ultrametric fitting problem.

We also introduce the rooted tree fitting problem. Given a finite, discrete metric space (X, d) and the distance d(xi, xj) between any two points xi, xj X, and a distinguished point w X, find a tree metric (T, d T ) such that d d T p is small and d T (w, x) = d(w, x) for all x X. Although the rooted tree fitting problem has more constraints, previous work (such as [7]) shows that by choosing the base point w appropriately, the optimal error of the rooted tree embedding is bounded by a constant times the optimal error of the tree embedding. Also, the rooted tree fitting problem is closely connected to the ultrametric fitting problem.

Putting these pieces together, we observe that while considerable attention has been paid to the ℓq tree fitting problem for generic distances with only mild attention paid to the assumptions on the input distances. No one has considered both restrictions on the distances and more sophisticated measures of distortion. We define the ℓp/ℓq tree (ultrametric) fitting problem as follows. Definition 2.2 (ℓp/ℓq tree (ultrametric) fitting problem). Given (X, d) with hyperbolicity vector x(d) and Avg Hypq(d) (ultrametricity vector (d) and Avg UMq(d)), find the tree metric (T, d T ) (ultrametric (X, d U)) with distortion

d d T p Avg Hypq(d) f(n) or d d U p Avg UMq(d) f(n)

for a growth function f(n) that is as small as possible. (Indeed, f(n) might be simply a constant.)

2.3 Previous results

Next, we detail Gromov s classic theorem on tree fitting, using our notation above. Theorem 2.3. [4] Given a δ-hyperbolic metric space (X, d) and a reference point x X, there exists a tree structure T and its metric d T such that

1. T is x-restricted, i.e., d(x, y) = d T (x, y) for all y X.

2. d d T 2 x(d) log2(n 2) .

In other words, we can bound the maximum distortion of tree metric in terms of the hyperbolicity constant Hyp(d) = δ and the size of our input space X.

2.4 (Hierarchical) Correlation clustering and ultrametric fitting

Several earlier works ([2], [12]) connected the correlation clustering problem to that of tree and ultrametric fitting and, in order to achieve our results, we do the same. In the correlation clustering problem, we are given a graph G = (V, E) whose edges are labeled + (similar) or - (different) and we seek a clustering of the vertices into two clusters so as to minimize the number of pairs incorrectly classified with respect to the input labeling. In other words, minimize the number of - edges within clusters plus the number of + edges between clusters. When the graph G is complete, correlation clustering is equivalent to the problem of finding an optimal ultrametric fit under the ℓ1 norm when the input distances are restricted to the values of 1 and 2.

Hierarchical correlation clustering is a generalization of correlation clustering that is also implicitly connected to ultrametric and tree fitting (see [1], [2], [12], [13]). In this problem, we are given a set of non-negativeweights and a set of edge sets. We seek a partition of vertices that is both hierarchical and

minimizes the weighted sum of incorrectly classified pairs of vertices. It is a (weighted) combination of correlation clustering problems.

More precisely, given a graph G = (V, E) with k + 1 edge sets Gt = (V, Et), and k + 1 weights δt 0 for 0 t k, we seek a hierarchical partition Pt that minimizes the ℓ1 objective function, P δt|Et E(Pt)|. It is hierarchical in that for each t, Pt subdivides Pt+1.

Chowdury, et al. [13] observed that the Single Linkage Hierarchical Clustering algorithm (SLHC) whose output can be modified to produce an ultrametric that is designed to fit a given metric satisfies a similar property to that of Gromov s tree fitting result. In this case, the distortion bound between the ultrametric and the input distances is a function of the ultrametricity of the metric space. Theorem 2.4. Given (X, d) and the output of SLHC in the form of an ultrametric d U, we have

d d U (d) log2(n 1) .

2.5 Reductions and equivalent bounds

Finally, we articulate precisely how the tree and ultrametric fitting problems are related through the following reductions. We note that the proof of this theorem uses known techniques from [2] and [1] although the specific results are novel. First, an ultrametric fitting algorithm yields a tree fitting algorithm. Theorem 2.5. Given 1 p < and 1 q . Suppose we have an ultrametric fitting algorithm such that for any distance function d on X (with |X| = n), the output d U satisfies

d d U p Avg UMq(d) f(n) for some growth function f(n).

Then there exists a tree fitting algorithm (using the above) such that given an input d on X (with |X| = n), the output d T satisfies

d d T p 2 n 3

1/q Avg Hypq(d) f(n) for same growth function f(n).

Conversely, a tree fitting algorithm yields an ultrametric fitting algorithm. From which we conclude that both problems should have the same asymptotic bound, which justifies our problem formulation. Theorem 2.6. Given 1 p < and 1 q . Suppose that we have a tree fitting algorithm such that for any distance function d on X (with |X| = n), the output d T satisfies

d d T p Avg Hypq(d) f(n) for some growth function f(n).

Then there exists an ultrametric fitting algorithm (using the above) such that given an input d on X (with |X| = n), the output d T satisfies

4 + 2q 4 1/q Avg UMq(d) f(2n) for same growth function f(n).

Proofs of both theorems can be found in the Appendix 6.1 6.2.

3 Tree and ultrametric fitting: Algorithm and analysis

In this section, we present an algorithm for the p = q = 1 ultrametric fitting problem. We also present the proof of upper bound and present an example that shows this bound is asymptotically tight (despite our empirical evidence to the contrary). Then, using our reduction from Section 2, we produce a tree fitting result.

3.1 HCC Problem

As we detailed in Section 2, correlation clustering is connected with tree and ultrametric fitting and in this section, we present a hierarchical correlation clustering (HCC) algorithm which bounds the number of disagreement edges by constant factor of number of bad triangles. We follow with a proposition that shows the connection between bad triangle objectives and the ℓ1 ultrametricity vector.

Definition 3.1 (HCC with triangle objectives). Given a vertex set V and k + 1 edge sets, set Gt = (V, Et) for 0 t k. The edge sets are hierarchical so that Et Et+1 for each t. We seek a hierarchical partition Pt so that for each t, Pt subdivides Pt+1 and the number of disagreement edges |Et E(Pt)| is bounded by C |B(Gt)| (where B denotes the set of bad triangles) for some constant C > 0.

Note that this problem does not include the weight sequence {δt}, as the desired output will also guarantee an upper bound on P δt|Et E(Pt)|, the usual ℓ1 objective.

This proposition relates the ℓ1 vector norm of (d), the average ultrametricity of the distances, and the bad triangles in the derived graph. This result is why we adapt the hierarchical correlation clustering problem to include triangle objectives.

Proposition 3.2. Given distance function d on X 2 (with |X| = n) and s > 0, consider the s-neighbor graph Gs = (X, Es) where Es denotes {(x, y) X 2 |d(x, y) s}. Then we have

(d) 1 = ℓ Avg UM(d) = Z

0 |B(Gs)|ds.

3.2 Main results

Our main contribution is that the HCCTRIANGLE algorithm detailed in Section 3.3 solves our modified HCC problem and its output partition behaves reasonably on every level. From this partition, we can construct a good ultrametric fit which we then leverage for a good (rooted) tree fit (using the reduction from Section 2. Theorem 3.3. HCCTRIANGLE outputs a hierarchical partition Pt where |Et E(Pt)| 4 |B(Gt)| holds for every 0 t n 2 = k. Furthermore, the algorithm runs in time O(n2).

By Proposition 3.2 and Theorem 3.3, we can find an ultrametric fit using the HCCTRIANGLE subroutine to cluster our points. This algorithm we refer to as HCCULTRAFIT. Using Theorem 3.3, we have following ℓ1 bound. Theorem 3.4. Given d, HCCULTRAFIT outputs an ultrametric d U with d d U 1 4 1 = 4ℓ Avg UM1(d). The algorithm runs in time O(n2 log n).

In other words, HCCULTRAFIT solves the HCC Problem 3.1 with constant C = 4. The following proof shows why adapting HCCTRIANGLE as a subroutine is successful.

Proof of Theorem 3.4 from Theorem 3.3: Suppose the outputs of HCCTRIANGLE and HCCULTRAFIT are {Pt} and d U, respectively. Denote di := d(ei) (d0 = 0, dk+1 = ) and, for any s > 0, set Gs = (X, Et), Ps = Pt for dt s < dt+1.

Then, for any x, y X 2 , we see that

(x, y) Es E(Ps) d(x, y) s < d U(x, y) or d U(x, y) s < d(x, y).

Again, we use the integral notion from Proposition 3.2. Every edge (x, y) will contribute |Es E(Ps)| with amount exactly |d U(x, y) d(x, y)|. Then, by Theorem 3.3,

d d U 1 = Z

0 |Es E(Ps)|ds

0 |B(Gs)|ds = 4 (d) 1,

as desired. Assuming that HCCTRIANGLE runs in time O(n2), HCCULTRAFIT runs in time O(n2 log n) as the initial step of sorting over all pairs is needed. Thus ends the proof.

By the reduction argument we discussed in Section 2, we can put all of these pieces together to conclude the following: Theorem 3.5. Given (X, d), we can find two tree fits with the following guarantees:

Given a base point point x X, HCCROOTEDTREEFIT outputs a tree fit d T with d d T 1 8 x(d) 1. The algorithm runs in O(n2 log n).

There exists x X where x(d) 1 n 1 3 Avg Hyp1(d). Therefore, given (X, d), one can find a (rooted) tree metric d T with d d T 1 8 n 1 3 Avg Hyp1(d) in time O(n3 log n).

3.3 Algorithm and analysis

Algorithm 1 ISHIGHLYCONNECTED: tests if two clusters are highly connected.

function ISHIGHLYCONNECTED

Input: vertex set X, Y and edge set E For x X:

If |{y Y |(x, y) E}| < |Y |

2 : return False For y Y :

If |{x X|(x, y) E}| < |X|

2 : return False return True

Figure 1: Illustration of highly connectedness condition

Algorithm 2 HCCTRIANGLE: Find HCC which respects triangle objectives

function HCCTRIANGLE

Input: V = {v1, , vn} and an ordering of all pairs {e1, e2, , e( n 2)} on V so that Et =

{e1, e2, , et}.

Desired Output: hierarchical partition {Pt} for 0 t n 2 = k so that |Et E(Pt)| 4 |B(Gt)| holds for every t.

Init: P = {{v1}, {v2}, , {vn}} P0 P For t {1, 2, , n 2 }: Take et = (x, y) with x Cx and y Cy (Cx, Cy P) If Cx = Cy:

If ISHIGHLYCONNECTED(Cx, Cy, Et) is true:

add C = Cx Cy and remove Cx, Cy in P. Pt P return {Pt}

In the rest of this subsection, we provide a sketch of the proof that HCCTRIANGLE provides the desired correlation clustering (the detailed proof can be found in Appendix 6.4). We denote the output of HCCTRIANGLE by {Pt} and also assume E = Et and Pt = {C1, C2, , Ck}. The algorithm HCCTRIANGLE agglomerates two clusters if they are highly connected, adds the cluster to the partition, and iterates. The key to the ordering of the edges input to HCCTRIANGLE is that they are ordered by increasing distance so that the number of edges that are in disagreement in

Algorithm 3 HCCULTRAFIT: Ultrametric fitting algorithm, uses HCCTRIANGLE

function HCCULTRAFIT(d)

Sort X 2 as {e1, , e( n 2)} so that d(e1) d(e2) d(e( n 2))

{Pt} = HCCTRIANGLE(X, {et}) d U(x, y) d(ej) where j = argmint(x, y are in same cluster in Pt) return d U

the correlation clustering is upper bounded by the number of edges whose distances violate a triangle relationship. The proof proceeds in a bottom up fashion. (It is clear that the output partition is naturally hierarchical.) We count the number of bad triangles within and between clusters, which are lower bounded by the number of disagreement edges within and between clusters, respectively. The proof uses several combinatorial properties which follow from the highly connected condition. This is the key point of our proof.

From an ultrametricity fit to a rooted tree fit: Following a procedure from Cohen-Addad, et al. [1], we can also obtain a tree fit with the ℓ1/ℓ1 objective. Note that the algorithm HCCROOTEDTREEFIT takes the generic reduction method from ultrametric fitting algorithm but we have instantiated it with HCCULTRAFIT. For a self-contained proof, see the Appendix 6.3.

Proof of Theorem 3.5: This proof can be easily shown simply by applying Theorem 2.5 with p = q = 1 and f(n) = 4 n 3 .

Algorithm 4 Find a tree fitting given an ultrametric fitting procedure

1: procedure HCCROOTEDTREEFIT 2: Input: distance function d on X 2 and base point w X 3: Output: (rooted) tree metric d T which fits d and d(x, w) = d T (x, w) for all x X 4: M maxx X d(x, w) 5: cw(x, y) 2M d(x, w) d(y, w) for all x, y X 2

6: d U HCCULTRAFIT(d + cw) 7: d T d U cw 8: return d T

Running Time Although ISHIGHLYCONNECTED is seemingly expensive, there is a way to implement HCCTRIANGLE so that all procedures run in O(n2) time. Thus, HCCULTRAFIT can be implemented so as to run in O(n2 log n) time. The detailed algorithm can be found in Appendix 6.5.

Asymptotic Tightness Consider (d, X) with X = {x1, , xn} and d(x1, x2) = d(x1, x3) = 1 and 2 otherwise. Then we see that tp(d; x1, x2, x3) = 1 and 0 otherwise, so that p = 1. One can easily check that for any ultrametric d U, |ϵ(x1, x2)|p + |ϵ(x1, x3)|p + |ϵ(x2, x3)|p 21 p for ϵ := d U d. When p = 1, d d U 1 (d) 1 = n 3 Avg UM(d) holds for any ultrametric d U. While HCCULTRAFIT guarantees d d U 1 4 (d) 1 = 4 n 3 Avg UM(d); this shows that our theoretical bound is asymptotically tight.

Examples demonstrating how HCCULTRAFIT works and related discussion can be found in Appendix 6.7.1.

4 Experiments

In this section, we run HCCROOTEDTREEFIT on several different type of data sets, those that are standard for geometric graph neural nets and those that are synthetic. We also compare our results with other known algorithms. We conclude that HCCROOTEDTREEFIT (HCC) is optimal when the data sets are close to tree-like and when we measure with respect to distortion in the ℓ1 sense and running time. It is, however, suboptimal in terms of the ℓ measure of distortion (as to be expected). We also conclude that purportedly hierarchical data sets do not, in fact, embed into trees with low distortion, suggesting that geometric graph neural nets should be configured with different geometric considerations. Appendix 6.9 contains further details regarding the experiments.

Data set C-ELEGAN CS PHD CORA AIRPORT

n 452 1025 2485 3158 Hyp(d) 1.5 6.5 11 1 Avg Hyp(d) 0.13 0.51 0.36 0.18 Bound 158.0 1384.5 2376.2 1547.1

HCC 0.90 0.19 2.60 1.11 2.42 0.44 1.09 0.15 Gromov 1.14 0.04 2.79 0.36 3.38 0.13 1.56 0.08 TR 0.83 0.16 2.55 1.34 2.91 0.63 1.28 0.21 NJ 0.30 1.35 1.06 0.49 Table 2: Connection between hyperbolicity and average hyperbolicity and how these quantities determine the average distortion ( d d T 1/ n 2 ) of the resulting tree metric.

Data set C-ELEGAN CS PHD CORA AIRPORT

HCC 4.3 0.64 23.37 3.20 19.30 1.11 7.63 0.54 Gromov 3.32 0.47 13.24 0.67 9.23 0.53 4.04 0.20 TR 5.90 0.72 21.01 3.34 16.86 2.11 10.00 1.02 NJ 2.97 16.81 13.42 4.18 Table 3: ℓ error (i.e., max distortion)

4.1 Common data sets

We used common unweighted graph data sets which are known to be hyperbolic or close to tree-like and often used in graph neural nets, especially those with geometric considerations. The data sets we used are C-ELEGAN, CS PHD from [14], and CORA, AIRPORT from [15]. (For those which contain multiple components, we chose the largest connected component.) Given an unweighted graph, we computed its shortest-path distance matrix and used that input to obtain a tree metric. We compared these results with the following other tree fitting algorithms TREEREP (TR) [9], NEIGHBORJOIN (NJ) [8], and the classical Gromov algorithm. As TREEREP is a randomized algorithm and HCCROOTEDTREEFIT and Gromov s algorithm depends on the choice of a pivot vertex, we run all of these algorithms 100 times and report the average error with standard deviation. All edges with negative weights have been modified with weight 0, as TREEREP and NEIGHBORJOIN both occasionally produce edges with negative weights. Recall, both TREEREP and NEIGHBORJOIN enjoy no theoretical guarantees on distortion.

First, we examine the results in Table 2. We note that although the guaranteed bound (of average distortion), 8 n 1 3 Avg Hyp(d)/ n 2 is asymptotically tight even in worst case analysis, this bound is quite loose in practice; most fitting algorithms perform much better than that. We also see that the ℓ1 error of HCCROOTEDTREEFIT is comparable to that of TREEREP, while NEIGHBORJOIN performs much better than those. It tends to perform better when the graph data set is known to be more hyperbolic (or tree-like), despite no theoretical guarantees. It is, however, quite slow.

Also, we note from Table 3 that Gromov s algorithm, which solves our ℓ /ℓ hyperbolicity problem tends to return better output in terms of ℓ error. On the other hand, its result on ℓ1 error is not as good as the other algorithms. In contrast, our HCCROOTEDTREEFIT performs better on the ℓ1 objective, which suggests that our approach to this problem is on target.

Data set C-ELEGAN CS PHD CORA AIRPORT

O(n2 log n) HCC 0.648 0.013 3.114 0.029 18.125 0.330 28.821 0.345

O(n2) Gromov 0.055 0.005 0.296 0.004 2.063 0.033 3.251 0.033

O(n2) TR 0.068 0.009 0.223 0.046 0.610 0.080 0.764 0.151

O(n3) NJ 0.336 4.659 268.45 804.67 Table 4: Running Time(s). For NJ, we implemented the naive algorithm, which is O(n3). We note that TREEREP produces a tree and not a set of distances; its running times excluded a step which computes the distance matrix, which takes O(n2) time.

Initial Tree BT(8, 3) BT(5, 4) BT(3, 5) BT(2, 8) DISEASE

n 585 776 364 511 2665 r 1/ n 1 3 0.01887 0.00378 0.01876 0.00375 0.00150 0.00036 0.00098 0.00020 0.00013 0.00010

HCC 0.00443 0.00098 0.01538 0.00733 0.04153 0.01111 0.07426 0.02027 0.00061 0.00058 Gromov 0.18225 0.00237 0.44015 0.00248 0.17085 0.00975 0.16898 0.00975 0.18977 0.00196 TR 0.01360 0.00346 0.01103 0.00336 0.06080 0.00874 0.09550 0.01887 0.00081 0.00059 NJ 0.03180 0.00767 0.06092 0.01951 0.04309 0.00521 0.04360 0.00648 0.00601 0.00281 Table 5: Average ℓ1 error when ne = 500

In analyzing the running times in Table 4, we notice that HCCROOTEDTREEFIT runs in truly O(n2 log n) time. Also, its dominant part is the subroutine HCCTRIANGLE, which runs in O(n2).

4.2 Synthetic data sets

In order to analyze the performance of our algorithm in a more thorough and rigorous fashion, we generate random synthetic distances with low hyperbolicity. More precisely, we construct a synthetic weighted graph from fixed balanced trees. We use BT(r, h) to indicate a balanced tree with branching factor r and height h and DISEASE from [6], an unweighted tree data set. For each tree, we add edges randomly, until we reach 500 additional edges. Each added edge is given a distance designed empirically to keep the δ-hyperbolicity bounded by a value of 0.2.

Then we measured the ℓ1 error of each tree fitting algorithm (and averaged over 50 trials). Note that all rooted tree fitting algorithms use a root node (for balanced trees, we used the apex; for DISEASE, we used node 0).

For these experiments, we see quite different results in Table 5. All of these data sets are truly tree-like. Clearly, NEIGHBORJOIN performs considerably worse on these data than on the common data sets above, especially when the input comes from a tree with high branching factor (note that the branching factor of DISEASE is recorded as 6.224, which is also high). We also note that Gromov s method behaves much worse than all the other algorithms. This is possibly because Gromov s method is known to produce a non-stretching tree fit, while it is a better idea to stretch the metric in this case. The theoretical bound is still quite loose, but not as much as with the common data sets.

5 Discussion

All of our experiments show that it is critical to quantify how tree-like a data set is in order to understand how well different tree fitting algorithms will perform on that data set. In other words, we cannot simply assume that a data set is generic when fitting a tree to it. Furthermore, we develop both a measure of how tree-like a data set is and an algorithm HCCROOTEDTREEFIT that leverages this behavior so as to minimize the appropriate distortion of this fit. The performance of HCCROOTEDTREEFIT and other standard tree fitting algorithms shows that commonly used data sets (especially in geometric graph neural nets) are quite far from tree-like and are not well-represented by trees, especially when compared with synthetic data sets. This suggests that we need considerably more refined geometric notions for learning tasks with these data sets.

[1] V. Cohen-Addad, D. Das, E. Kipouridis, N. Parotsidis, and M. Thorup, Fitting distances by tree metrics minimizing the total error within a constant factor, in 2021 IEEE 62nd Annual Symposium on Foundations of Computer Science (FOCS), IEEE, 2022, pp. 468 479. [2] N. Ailon and M. Charikar, Fitting tree metrics: Hierarchical clustering and phylogeny, in 46th Annual IEEE Symposium on Foundations of Computer Science (FOCS 05), IEEE, 2005, pp. 73 82. [3] M. Farach, S. Kannan, and T. Warnow, A robust model for finding optimal evolutionary trees, Algorithmica, vol. 13, no. 1, pp. 155 179, 1995. [4] M. Gromov, Hyperbolic groups, in Essays in group theory, Springer, 1987, pp. 75 263. [5] L. L. Cavalli-Sforza and A. W. Edwards, Phylogenetic analysis. models and estimation procedures, American journal of human genetics, vol. 19, no. 3 Pt 1, p. 233, 1967. [6] I. Chami, Z. Ying, C. Ré, and J. Leskovec, Hyperbolic graph convolutional neural networks, Advances in neural information processing systems, vol. 32, 2019. [7] R. Agarwala, V. Bafna, M. Farach, M. Paterson, and M. Thorup, On the approximability of numerical taxonomy (fitting distances by tree metrics), SIAM Journal on Computing, vol. 28, no. 3, pp. 1073 1085, 1998. [8] N. Saitou and M. Nei, The neighbor-joining method: A new method for reconstructing phylogenetic trees., Molecular biology and evolution, vol. 4, no. 4, pp. 406 425, 1987. [9] R. Sonthalia and A. C. Gilbert, Tree! i am no tree! i am a low dimensional hyperbolic embedding, ar Xiv preprint ar Xiv:2005.03847, 2020. [10] S. Chatterjee and L. Sloman, Average gromov hyperbolicity and the parisi ansatz, Advances in Mathematics, vol. 376, p. 107 417, 2021. [11] É. Ghys and P. De La Harpe, Espaces métriques hyperboliques, in Sur les groupes hyperboliques d après Mikhael Gromov, Springer, 1990, pp. 27 45. [12] B. Harb, S. Kannan, and A. Mc Gregor, Approximating the best-fit tree under l p norms, in Approximation, Randomization and Combinatorial Optimization. Algorithms and Techniques: 8th International Workshop on Approximation Algorithms for Combinatorial Optimization Problems, APPROX 2005 and 9th International Workshop on Randomization and Computation, RANDOM 2005, Berkeley, CA, USA, August 22-24, 2005. Proceedings, Springer, 2005, pp. 123 133. [13] S. Chowdhury, F. Mémoli, and Z. T. Smith, Improved error bounds for tree representations of metric spaces, Advances in Neural Information Processing Systems, vol. 29, 2016. [14] R. A. Rossi and N. K. Ahmed, The network data repository with interactive graph analytics and visualization, in AAAI, 2015. [Online]. Available: https://networkrepository.com. [15] P. Sen, G. Namata, M. Bilgic, L. Getoor, B. Galligher, and T. Eliassi-Rad, Collective classification in network data, AI magazine, vol. 29, no. 3, pp. 93 93, 2008. [16] W. H. Day, Computational complexity of inferring phylogenies from dissimilarity matrices, Bulletin of mathematical biology, vol. 49, no. 4, pp. 461 467, 1987. [17] B. H. Bowditch, A course on geometric group theory., 2006. [18] H. Samet, The quadtree and related hierarchical data structures, ACM Computing Surveys (CSUR), vol. 16, no. 2, pp. 187 260, 1984. [19] M. Elkin, Y. Emek, D. A. Spielman, and S.-H. Teng, Lower-stretch spanning trees, SIAM Journal on Computing, vol. 38, no. 2, pp. 608 628, 2008. [20] N. Ailon, M. Charikar, and A. Newman, Aggregating inconsistent information: Ranking and clustering, Journal of the ACM (JACM), vol. 55, no. 5, pp. 1 27, 2008. [21] V. Chepoi, F. Dragan, B. Estellon, M. Habib, and Y. Vaxès, Diameters, centers, and approximating trees of delta-hyperbolicgeodesic spaces and graphs, in Proceedings of the twenty-fourth annual symposium on Computational geometry, 2008, pp. 59 68. [22] W. Chen, W. Fang, G. Hu, and M. W. Mahoney, On the hyperbolicity of small-world and treelike random graphs, Internet Mathematics, vol. 9, no. 4, pp. 434 491, 2013. [23] D. Coudert, A. Nusser, and L. Viennot, Computing graph hyperbolicity using dominating sets*, in 2022 Proceedings of the Symposium on Algorithm Engineering and Experiments (ALENEX), SIAM, 2022, pp. 78 90.

[24] R. Sarkar, Low distortion delaunay embedding of trees in hyperbolic plane, in International Symposium on Graph Drawing, Springer, 2011, pp. 355 366. [25] F. Sala, C. De Sa, A. Gu, and C. Ré, Representation tradeoffs for hyperbolic embeddings, in International conference on machine learning, PMLR, 2018, pp. 4460 4469. [26] I. Chami, A. Gu, V. Chatziafratis, and C. Ré, From trees to continuous embeddings and back: Hyperbolic hierarchical clustering, Advances in Neural Information Processing Systems, vol. 33, pp. 15 065 15 076, 2020. [27] N. Monath, M. Zaheer, D. Silva, A. Mc Callum, and A. Ahmed, Gradient-based hierarchical clustering using continuous representations of trees in hyperbolic space, in Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, 2019, pp. 714 722. [28] M. Nickel and D. Kiela, Poincaré embeddings for learning hierarchical representations, Advances in neural information processing systems, vol. 30, 2017. [29] N. Linial, E. London, and Y. Rabinovich, The geometry of graphs and some of its algorithmic applications, Combinatorica, vol. 15, no. 2, pp. 215 245, 1995. [30] M. Nickel and D. Kiela, Learning continuous hierarchies in the lorentz model of hyperbolic geometry, in International Conference on Machine Learning, PMLR, 2018, pp. 3779 3788. [31] H. Fournier, A. Ismail, and A. Vigneron, Computing the gromov hyperbolicity of a discrete metric space, Information Processing Letters, vol. 115, no. 6-8, pp. 576 579, 2015. [32] P. K. Agarwal, K. Fox, A. Nath, A. Sidiropoulos, and Y. Wang, Computing the gromovhausdorff distance for metric trees, ACM Transactions on Algorithms (TALG), vol. 14, no. 2, pp. 1 20, 2018. [33] B. Nica and J. Špakula, Strong hyperbolicity, Groups, Geometry, and Dynamics, vol. 10, no. 3, pp. 951 964, 2016. [34] T. Das, D. Simmons, and M. Urba nski, Geometry and dynamics in Gromov hyperbolic metric spaces. American Mathematical Soc., 2017, vol. 218. [35] I. Abraham, M. Balakrishnan, F. Kuhn, D. Malkhi, V. Ramasubramanian, and K. Talwar, Reconstructing approximate tree metrics, in Proceedings of the twenty-sixth annual ACM symposium on Principles of distributed computing, 2007, pp. 43 52.

6.1 Proof of Theorem 2.5

We will follow the details from [7] (and [1] to address minor issues). This is, in fact, the generic reduction method from ultrametric fitting.

Algorithm 5 Find a tree fitting given an ultrametric fitting procedure

1: procedure ULTRAFIT 2: Input: distance function d 3: Output: ultrametric d U which fits d 4: procedure ROOTEDTREEFIT 5: Input: distance function d on X 2 and base point w X 6: Output: (rooted) tree metric d T which fits d and d(w, x) = d T (w, x) for all x X 7: Define m = maxx X d(w, x), cw(x, y) = 2m d(w, x) d(w, y), and βx = 2(m d(w, x))(x X) 8: d U = ULTRAFIT(d + cw) 9: Restrict d U(x, y) = min(max(βx, βy, d U (x, y)), 2m) 10: d T = d U cw 11: return d T

Claim: For any x, y X, |d U(x, y) (d + cw)(x, y)| |d U (x, y) (d + cw)(x, y)| holds. In other words, the restriction will reduce the error.

Proof: It is enough to check the two cases when d U differs.

Case 1: max(βx, βy) = d U(x, y) > d U (x, y): without loss of generality, we assume that d U(x, y) = βx βy. We have (d + cw)(x, y) = d(x, y) + cw(x, y) (d(w, y) d(w, x))+(2m d(w, x) d(w, y)) = 2m 2d(w, x) = βx, which shows (d+cw)(x, y) d U(x, y) > d U (x, y). Therefore the claim holds.

Case 2: 2m = d U(x, y) < d U (x, y): since (d + cw)(x, y) = 2m 2gpw(x, y) 2m, we have (d + cw)(x, y) d U(x, y) < d U (x, y). Again, the claim holds.

This completes the proof.

Claim: The restriction d U is also an ultrametric.

Proof: We need to prove d U(x, y) max(d U(x, z), d U(y, z)) for all x, y, z X. As U is ultrametric by assumption, it is enough to check only if

d U(x, y) > d U (x, y) or max(d U(x, z), d U(y, z)) > max(d U (x, z), d U (y, z)) holds.

Case 1: d U(x, y) > d U (x, y): we have max(βx, βy) = d U(x, y) > d U (x, y). Without loss of generality, we assume that d U(x, y) = βx βy. Then we have d U(x, z) max(βx, βz) βx = d U(x, y), which shows d U(x, y) max(d U(x, z), d U(y, z)).

Case 2: max(d U(x, z), d U(y, z)) < max(d U (x, z), d U (y, z)): we have d U(x, z) < d U (x, z) or d U(y, z) < d U (y, z) holds. In either case, we have the value is clipped by the maximum value 2m so that d U(x, y) max(d U(x, z), d U(y, z)) = 2m.

Therefore, we conclude that the stronger triangle inequality still holds, which completes the proof.

Claim: By the restriction, d T satisfies tree metric.

Proof: First, we need to verify that d T is a metric. First, we have

d T (x, y) = d U(x, y) cw(x, y) max(βx, βy) cw(x, y) = max(βx, βy) 1

2 = |d(w, x) d(w, y)| 0,

so that d T is non-negative. Next, we will prove the triangle inequality: d T (x, z) d T (x, y) + d T (y, z). Without loss of generality, we assume that d U(x, y) d U(y, z). Then we have

d T (x, z) = d U(x, z) cw(x, z) max(d U(x, y), d U(y, z)) cw(x, z) = d U(x, y) cw(x, y) + cw(x, y) cw(x, z) = d T (x, y) + (d(w, z) d(w, y)) d T (x, y) + |d(w, z) d(w, y)| d T (x, y) + d T (y, z).

Therefore, d T is (non-negative) metric. To show it is a tree metric, we examine the four point condition of d T . For any x, y, z, t X,

d T (x, y) + d T (z, t) = (d U(x, y) cw(x, y)) + (d U(z, t) cw(z, t)) = (d U(x, y) + d U(z, t)) (cw(x, t) + cw(z, t)) = (d U(x, y) + d U(z, t)) (4m d(w, x) d(w, y) d(w, z) d(w, t)),

so that any pair sum of d T differs from d U by 4m d(w, x) d(w, y) d(w, z) d(w, t). As d U is ultrametric and thus 0-hyperbolic, d T is also 0-hyperbolic, as desired. Thus, d T is a tree metric.

Finally, we prove that ℓp error of d T is bounded as desired. This can be done by

d d T p = (d + cw) (d T + cw) p = d U (d + cw) p d U (d + cw) p Avg UMq(d + cw) f(n),

by assumption. For 1 q < , we have

w X (Avg UMq(d + cw))q = 1

x,y,z ( X\{w} 3 ) tp(d + cw; x, y, z)q

x,y,z ( X\{w} 3 ) 2qfpw(d; x, y, z)q

x,y,z,w ( X 4) fpq(d; x, y, z, w)

n (Avg Hypq(d))q.

Therefore, there exists w X so that Avg UMq(d + cw) 2 n 3

n 1/q Avg Hypq(d). The q = case can be separately checked.

Hence, there exists w X where such reduction yields a tree fitting with ℓp error bounded by

n 1/q Avg Hypq(d) f(n), as desired.

Note that when we use HCCULTRAFIT, then max(βx, βy) d U (x, y) 2m always hold so that the clipping is not necessary.

6.2 Proof of Theorem 2.6

Our reduction is based on the technique from [16] and Section 9 of [1]. Given d on X, we will construct a distance d on Z = X Y such that |X| = |Y | = n and

d(z, w) if z, w X M if z X, w Y or z Y, w X c if z, w Y for all z = w Z,

for M large enough and c small enough. Then first we have

Claim: If 1 q < , then Avg Hypq(d ) 1

4 + 2q 4 1/q Avg UMq(d). For q = , we have Hyp(d ) 2 UM(d).

Proof: It can be checked that if at least two of x, y, z, w is from Y , then we immediately have fp(d ; x, y, z, w) = 0. Therefore, we get X

x,y,z,w ( Z 4) fp(d ; x, y, z, w)q = X

x,y,z,w ( X 4) fp(d ; x, y, z, w)q + X

x,y,z ( X 3) fp(d ; x, y, z, w)q

x,y,z,w ( X 4) fp(d; x, y, z, w)q + X

x,y,z ( X 3) tp(d; x, y, z)q

Avg Hypq(d)q + n n 3

Avg UMq(d)q.

To bound Avg Hypq, we will use the fact that

fp(d; x, y, z, w) 1

2[tp(d; x, y, z) + tp(d; x, y, w) + tp(d; x, z, w) + tp(d; y, z, w)].

Avg Hypq(d)q = 1 n 4 X

x,y,z,w ( X 4) fp(d; x, y, z, w)q

x,y,z,w ( X 4) [tp(d; x, y, z)q + tp(d; x, y, w)q + tp(d; x, z, w)q + tp(d; y, z, w)q]

x,y,z ( X 3) (n 3)tp(d; x, y, z)q = 2q

x,y,z ( X 3) tp(d; x, y, z)q = 2q Avg UMq(d)q.

To sum up, we get

Avg Hypq(d )q = 1 2n 4 X

x,y,z,w ( Z 4) fp(d ; x, y, z, w)q

2n 4 Avg Hypq(d)q + n n 3

2n 4 Avg UMq(d)q

16 Avg Hypq(d)q + 1

4 Avg UMq(d)q

16 Avg UMq(d)q + 1

4 Avg UMq(d)q = 1

4 + 2q 4 Avg Hypq(d)q.

The q = case can be checked separately.

Therefore, by the claim above and the assumption, we have a tree fitting d T which fits d with d T d p Avg Hypq(d ) f(2n). Our goal is to construct a reduction to deduce an ultrametric fit d U on X, by utilizing d T . First,

Claim: If M and c were large and small enough respectively, then d T (x1, y) + d T (x2, y) d T (x1, x2) 2c holds for every x1, x2 X and y Y .

Proof: Suppose the claim fails. As d(x1, y) + d(x2, y) d(x1, x2) = M + M d(x1, x2), it suggests that one of three pair distances should have distortion at least 1

3(2M 2c d(x1, x2)). This should not happen if M c Avg Hypq(d ) f(2n).

We will interpret the above claim as a structural property. Given a tree T which realizes the tree metric d T , one can find a Steiner node w on x1, x2, y. Then we have d T (w, y) = 1

2[d T (x1, y) + d T (x2, y) d T (x1, x2)] c. Since it holds for any x1, x2 X, it suggests that every geodesic segment [x, y] for fixed y Y should share their paths at least c. This observation allows a following uniformization on Y .

Claim: One can refine a tree metric so that d T (y1, y2) = c for all y1, y2 Y with ℓp error nonincreasing.

Algorithm 6 Refine a tree fit on Z = X Y

procedure UNIFORMIZATION ON Y 2: Input: tree fit (T, d T ) on Z = X Y , 1 p < Output: refined tree (T , d T ) on Z 4: d T (x1, x2) d T (x1, x2) x1 = x2 X d T (y1, y2) c y1 = y2 Y 6: Find y0 = argminy Y P

x X |d T (x, y) M|p

d T (x, y) d T (x, y0) x X, y Y 8: return d T

Proof: Pick y0 = argminy Y P

x X |d T (x, y) d(x, y)|p. Then we will keep X {y0} with its tree structure and relocate all other y = y0 Y . As every geodesic segments will share their paths at least c, we will pick a point z with d T (y, z) = c/2 on the segment, and draw edge (z, y) for all y Y with length c/2. Then d T (y1, y2) = c as desired. Furthermore, as d T (x, y) = d T (x, y0) for any x X and y Y ,

z,w ( Z 2) |d T (z, w) d(z, w)|p

x,x ( X 2) |d T (x, x ) d(x, x )|p + X

x X,y Y |d T (x, y) M|p + X

y,y ( Y 2) |d T (y, y ) c|p

x,x ( X 2) |d T (x, x ) d(x, x )|p + X

x X,y Y |d T (x, y0) M|p + X

y,y ( Y 2) 0

x,x ( X 2) |d T (x, x ) d(x, x )|p + X

x X |d T (x, y0) M|p

x,x ( X 2) |d T (x, x ) d(x, x )|p + X

x X |d T (x, y) M|p d T d p p,

as desired.

Algorithm 7 Construction of a restricted tree with respect to given root r

procedure RESTRICTTREE 2: Input: Tree fitting (T, d T ) on Z Output: Tree fitting d T such that d T (x, y) = M for all x X and y Y 4: for each x X:

if d T (x, y0) > M: 6: Relocate x to point on geodesic [x, y0] so that d T (x, y0) = M else if d T (x, y0) < M: 8: Add edge (x, x ) with length M d T (x, y0) and relocate x to x return (T, d T )

Claim: One can refine a tree metric so that d T (x, y) = M for all x X and y Y with ℓp error not greater than 3(p 1)/p times the original error. Furthermore, if we restrict d T on X, then it is an ultrametric.

Proof: We will use the algorithm RESTRICTTREE to obtain the restricted tree as described. It is clear that such procedure will successfully return a tree metric d T with d T (x, y) = M for all x X and y Y . Furthermore, as Denote ϵ := d T d. Then for any pair x1, x2 X, its distortion does not increase greater than |ϵ(x1, y0)| + |ϵ(x2, y0)|, which is the amount of relocation. We can utilize this

d T (x, y0) > M d T (x , y0) < M

Figure 2: This figure depicts how RESTRICTTREE works. The idea is we can relocate every vertices so that d(x, y0) = d T (x, y0) holds for every x X.

d T d p p = X

z,w ( Z 2) |d T (z, w) d(z, w)|p

x,x ( X 2) |d T (x, x ) d(x, x )|p + X

x X,y Y |d T (x, y) M|p + X

y,y ( Y 2) |d T (y, y ) c|p

x,x ( X 2) |d T (x, x ) d(x, x )|p

x,x ( X 2) (|ϵ(x, y0)| + |ϵ(x, x )| + |ϵ(x , y0)|)p

x,x ( X 2) (|ϵ(x, y0)|p + |ϵ(x, x )|p + |ϵ(x , y0)|p)

x,x ( X 2) |ϵ(x, x )|p + (n 1) X

x X |ϵ(x, y0)|p

x,x ( X 2) |ϵ(x, x )|p + n X

x X |ϵ(x, y0)|p

= 3p 1 d T d p p,

as desired. Then by the restriction, d T (x, y0) = M holds for all x X. Therefore,

tp(d T ; x1, x2, x3) = fp(d T ; x1, x2, x3, y0) = 0

holds for all x1, x2, x3 X 3 , which shows that d T on X is an ultrametric.

Denote the restricted metric as d U (on X 2 )). Then we have

d U d p = d T d p 3

p d T d p 3

4 + 2q 4 1/q Avg UMq(d) f(2n),

as desired.

6.3 Self-contained proof of ultrametric fit to rooted tree fit

Choose any base point w X and run HCCROOTEDTREEFIT. Note that d + cw = 2(M gpw). Then by 3.4, we see that the output d T satisfies

d d T 1 8 (d + cw) 1 = 8 (M gpw) 1 = 8 w(d) 1.

Although w(d) 1 itself does not satisfy the guaranteed bound, we know that X

w X w(d) 1 = X

x,y,z ( X\{w} 3 ) fpw(d; x, y, z)

x,y,z,w ( X 4) fp(d; x, y, z, w) = 4 n 4

so there exists an output d T (with appropriately chosen w) with its ℓ1 error bounded by 8 4 n n 4 Avg Hyp(d) = 8 n 1 3 Avg Hyp(d). As we need to check every base point w X, this procedure runs in time O(n3 log n).

6.4 Proof of Theorem 3.3

We begin by recalling the definition of highly connectedness.

Definition 6.1. Given a graph G = (V, E) and two disjoint vertex sets X and Y , the pair (X, Y ) is said to be highly connected if both

for all x X, |{y Y |(x, y) E}| |Y |

for all y Y, |{x X|(y, x) E}| |X|

Our first lemma shows that every clusters should contain reasonably many edges within , so that the number of false positive edges can be bounded. Because highly connectedness determines whether at least half of the edges have been connected, we can expect that the degree of every node should be at least half of the size of clusters.

Lemma 6.2. For any C Pt with |C| 2 and x C, |{x C|(x, x ) Et}| |C|

Proof. We use induction on t. The lemma is obviously true when t = 0. Next, suppose we pick C Pt with |C| 2 and the cluster C is formed at step t0 t. If t0 < t, since Pt is increasing, it is clear that

|{x C|(x, x ) Et}| |{x C|(x, x ) Et0}| |C|

2 for all x C.

Therefore, it is enough to check when t0 = t, i.e., C is the newly added cluster at step t.

Suppose C is achieved by merging C = Ca Cb. We then have two cases to check, depending on the sizes of Ca and Cb.

Case 1: |Ca| = 1 or |Cb| = 1: Without loss of generality, assume that |Ca| = 1 and let Ca = {a}. If Cb is also a singleton, then there is nothing to prove. If not, then, we must have (a, y) Et for all y Cb in order to be highly connected. And by the induction hypothesis,

|{y Cb|(y, y ) Et}| |{y Cb|(y, y ) Et 1}| |Cb|

2 for all y Cb.

Hence, we have

|{y C|(a, y ) Et}| = |Cb| = |C| 1 |C|

and for all y Cb

|{y C|(y, y ) Et}| = |{y Cb|(y, y ) Et}| + 1 |Cb|

2 + 1 > |C|

which completes the proof of Case 1.

Case 2: |Ca| > 1 and |Cb| > 1: By the induction hypothesis, we have

|{x Ca|(x, x ) Et}| |{x Ca|(x, x ) Et 1}| |Ca|

2 for all x Ca,

and similarly for Cb. As Ca and Cb are both highly connected, we have

|{y Cb|(x, y ) Et}| |Cb|

2 for all x Ca,

|{x Cb|(y, x ) Et}| |Ca|

2 for all y Cb.

Therefore, for any x Ca,

|{z C|(x, z ) Et}| = |{x Ca|(x, x ) Et}| + |{y Cb|(x, y ) Et}|

while each inequality comes from the induction hypothesis and the connectivity condition. We can similarly show the property for any y Cb, which completes the proof of Case 2.

This completes the proof.

Next, we prove the following isoperimetric property: Lemma 6.3. For C Pt denote X := {(x, y) Et|x X, y C \ X} for X C. Then for any proper X,

Proof. Without loss of generality, assume that 1 |X| |C|

2 . Then for any x X, by 6.2, there are at least |C|

2 |X| + 1 edges which connects x and other vertices not in X. Therefore, we have

| X| |X| |C|

2 |X| + 1 |C|

As our version of the HCC problem bounds the number of edges in disagreement in terms of the bad triangles, with the next lemmas, we count the number of bad triangles |B(Gt)| and bound the number of edges in disagreement. Recall that a bad triangle is defined as a triplet of edges in which only two of the three possible edges belong to Et. For each cluster C Pt, denote the number of bad triangles inside C as TC. And for each cluster pair C, C Pt, we count some of the bad triangles between C and C as TC,C . More precisely, Definition 6.4. Given a partition Pt, denote

TC := {(x, y, z)|x, y, z C and (x, y, z) B(Gt)} for C Pt,

T(C,C ) := {(x, x , y)|x, x C, y C and (x, x ), (x, y) Et, (x , y) / Et}

{(x, y, y )|x C, y, y C and (y, y ), (x, y) Et, (x, y ) / Et} for (C, C ) Pt 2

Note that TC,C does not count every bad triangle between C and C , as there might be bad triangles in which the missing edge is inside the cluster. With these definitions, we have X

C Pt |TC| + X

(C,C ) ( Pt 2 ) |T(C,C )| |B(Gt)|.

Proposition 6.5. For any cluster C Pt, the number of edges not in C is bounded by |TC|. That is,

|{(x, y) / E|x, y C}| |TC|.

(In fact, it is bounded by |TC|/2.)

Proof. We will prove that for any e = (x, y) / Et with x, y C, there exist at least two elements in TC that contain e, which implies our result. If |C| = 1, then there is nothing to prove. For any |C| 2 and (x, y) / Et, by Lemma 6.2, there are at least |C|

2 neighbors of x and y (we will denote each as Nx and Ny, respectively). Since Nx, Ny C \ {x, y}, we have

|Nx Ny| = |Nx| + |Ny| |Nx Ny| |C|

2 (|C| 2) = 2.

For any z Nx Ny, (x, y, z) TC by definition, which proves the assertion.

Lemma 6.6. For G = (V, E) and two disjoint subsets X, Y V , suppose

|{x X|(x, x ) E}| |X|

2 for all x X and |{y Y |(y, y ) E}| |Y |

2 for all y Y.

We will further assume that X and Y are not highly connected. Then we have

|{(x, y) E|x X, y Y }| 4 (|{(x, x , y)|x, x X, y Y and (x, x ), (x, y) E, (x , y) / E}|

+ |{(x, y, y )|x X, y, y Y and (y, y ), (x, y) E, (x, y ) / E}|)

Proof. The key argument here is to use Lemma 6.3 and count the boundary sets. Define NY (x) := {y Y |(x, y) E} for x X (and NX(y) for y Y respectively.) Then for any (y, y ) NY (x), (x, y, y ) is a bad triangle and it contributes the right hand side. In other words, the right hand side of the above equation is upper bounded by

x X | NY (x)| + X

y Y | NX(y)|

By 6.3, we know that for NY (x) = or Y , | NY (x)| |Y |

2 . We will count such x X and y Y which its neighbor set is proper, in order to make a lower bound. We divide the rest of the analysis into three cases:

Case 1: There exists x0 X such that NY (x0) = : then for any y Y , x0 / NX(y) so that NX(y) cannot be X. Thus, NX(y) = immediately leads that | NX(y)| |X|

2 . We have

|{(x, y) E|x X, y Y }| |X| |{y Y |NX(y) = }|

y Y,NX(y) = |X|

y Y,NX(y) = | NX(y)|

which proves the lemma.

Case 2: There exists y0 Y such that NX(y0) = : this case is proven as we did in Case 1.

Case 3: For any x X and y Y , NY (x) = and NX(y) = , we will use the assumption that X and Y are not highly connected. With this assumption, we can also assume without loss of generality that there is an x0 X such that |NY (x0)| < |Y |

2 . Then for any y / NY (x0), NX(y) = and NX(y) = X (as x0 / NX(y)). Thus, the boundary set exists and has at

2 elements. Therefore,

RHS ( ) 4 X

y Y | NX(y)|

y Y,y / NY (x0) | NX(y)|

y Y,y / NY (x0)

2 (|Y | |NY (x0)|)

2 = |X| |Y | LHS.

This completes the proof.

Proposition 6.7. For any cluster pair (C, C ) Pt, the number of edges between C and C is bounded by 4|T(C,C )|. (i.e., |{(x, y) E|x C, y C }| 4|T(C,C )|.)

Proof. Here, we will use an induction on t. When t = 0, then there is nothing to prove. Suppose the induction hypothesis holds for t0 < t.

Case 1: et does not connect two vertices between C and C and C, C are not added at step t: then by induction hypothesis, the proposition holds at step t 1, which is also true at step t (As the left hand side is invariant and |T(C,C )| is increasing).

Case 2: et = (x, y) for x C and y C : then it follows that our algorithm decided not to merge two clusters, which means C and C are not highly connected. Then by Lemma 6.6, the number of edges is bounded by 4|T(C,C )| as desired.

Case 3: et does not connect two vertices between C and C , but C or C is newly formed at step t: note that both clusters cannot be generated at the same time, so we assume without loss of generality that C is the new cluster and assume it is generated by C = Ca Cb. Then by the induction hypothesis, we have

{(x, y) E|x C, y C }| = {(x, y) E|x Ca, y C }| + {(x, y) E|x Cb, y C }| 4(|T(Ca,C )| + |T(Cb,C )|) (by the induction hypothesis)

4|T(C,C )|.

This completes the proof.

We are now ready to prove Theorem 3.3.

Proof of Theorem 3.3: For Pt = {C1, C2, , Ck}, denote

ei := |{(x, x ) / E|x, x Ci}|,

ei,j := |{(x, y) E|x Ci, y Cj}|.

Then by the above propositions, we have

|Et E(Pt)| = X

i<j 4T(Ci,Cj)

i<j T(Ci,Cj)

6.5 Implementation details

To detail our implementation, we will introduce the following variables.

For x X and cluster C, denote D(x, C) := |{y C|(x, y) E}|. We further denote the connectivity condition H(x, C) as IF(D(x, C) |C|

2 ). Also for given two clusters Ca and Cb, we will denote M(Ca, Cb) := |{x Ca|H(x, Cb) = 1}|. We can easily check that two disjoint clusters Ca and Cb are highly connected if and only if M(Ca, Cb) + M(Cb, Ca) = |Ca| + |Cb|.

Algorithm 8 HCCTRIANGLE: Actual Implementation

function HCCTRIANGLE

Input: X = {x1, , xn}, {et} Initialize D, H, M as 0 for vertex set X Initialize P = {{x1}, {x2}, , {xn}}, E = For t {1, 2, , n 2 }: Take et = (x, y) with x Ca and y Cb (Ca, Cb P) Add D(x, Cb), D(y, Ca) by 1 Update H(x, Cb) and H(y, Ca) If H has been updated, update M(Ca, Cb) and M(Cb, Ca) accordingly If Ca = Cb and M(Ca, Cb) + M(Cb, Ca) = |Ca| + |Cb|:

Remove Ca, Cb and add Cnew := Ca Cb in P D(:, Cnew) D(:, Ca) + D(:, Cb) Compute H(:, Cnew) and M(:, Cnew) (by using D(:, Cnew)) M(Cnew, :) M(Ca, :) + M(Cb, :) Pt P return {Pt}

One can observe that an iteration only takes O(1) when Ca and Cb are not merged. When we merge Cnew, we need time O(n) to compute new information of Cnew. Since such events occur exactly n 1 times, we can conclude that HCCTRIANGLE overall runs in time O(n2).

6.6 Asymptotic tightness of Gromov s distortion bound

While it is well-known that Gromov s result is asymptotically tight, we give a self-contained sketch of the proof and provide an example that witnesses the bound. We detail the construction given by [17]. Consider the Poincaré disk H2 and its hyperbolic n = 4m-gon Hn,r as a set of equally spaced n points around the circle of radius r in H2. Label all the points as x1, x2, . . . , x4m, in the counterclockwise order, and define y = x3m. Suppose d T is a tree metric fit to d and denote the Gromov product with respect to y as gpy. Then gpy(x0, x2m) min(gpy(x0, x1), gpy(x1, x2m)) min(gpy(x0, x1), min(gpy(x1, x2), gpy(x2, x2m)) min i {0,1,...,2m 1} gpy(xi, xi+1).

Therefore there exists an i such that d T (xi, y) + d T (xi+1, y) d T (xi, xi+1) d T (x0, y) + d T (x2m, y) d T (x0, x2m). On the other hand, for r, n large enough, [d(xi, y)+d(xi+1, y) d(xi, xi+1)] [d(x0, y)+d(x2m, y) d(x0, x2m)] 2 log(sin(π/m)) 2 log n, shows one of six pair distances should have distortion at least 1/3 log n = Ω(log n).

6.7 Examples

6.7.1 Example for HCCULTRAFIT

For the following distances d on X = {a, b, c, d, e, f}, we will compute the ultrametric fit using HCC. We first sort all the pairs in increasing distance order. We merge {a} and {b} first; this edge

Figure 3: This figure depicts the example that proves Gromov s distortion bound is asymptotically tight. By symmetry, we can conclude that Gromov s algorithm will always return the same ℓ error regardless of the choice of base point w.

corresponds to e1 = (a, b). Then we merge {d} and {e} for e2 = (d, e). Next, when we explore (e, f) (regardless of whether (a, d) comes first or not), we merge {d, e} and {f} with corresponding distance 6. Then we merge {a, b} and {c} when we explore (b, c). Finally, we merge {a, b, c} and {d, e, f} when both clusters are highly connected, and we check its corresponding fit is 8.

d a b c d e f a 0 3 5 6 8 8 b 3 0 7 8 7 10 c 5 7 0 9 5 7 d 6 8 9 0 4 5 e 8 7 5 4 0 6 f 8 10 7 5 6 0

d U a b c d e f a 0 3 7 8 8 8 b 3 0 7 8 8 8 c 7 7 0 8 8 8 d 8 8 8 0 4 6 e 8 8 8 4 0 6 f 8 8 8 6 6 0

Figure 4: This figure depicts the example output d U by drawing a dendrogram.

6.7.2 Example for HCCROOTEDTREEFIT

Next, given d, we want to find a tree fitting d T which restricts r X. We first compute cr and run HCCULTRAFIT on d + cr, denote the output d U. Then our tree fit d T is obtained by d T := d U cr, which is a tree metric. Furthermore, it restricts r X, so that d T (x, r) = d(x, r) for all x X.

d r a b c d e f r 0 9 9 8 10 8 7 a 9 0 1 2 5 5 4 b 9 1 0 4 7 4 6 c 8 2 4 0 7 1 2 d 10 5 7 7 0 2 2 e 8 5 4 1 2 0 1 f 7 4 6 2 2 1 0

cr r a b c d e f r 0 11 11 12 10 12 13 a 11 0 2 3 1 3 4 b 11 2 0 3 1 3 4 c 12 3 3 0 2 4 5 d 10 1 1 2 0 2 3 e 12 3 3 4 2 0 5 f 13 4 4 5 3 5 0

d + cr r a b c d e f r 0 20 20 20 20 20 20 a 20 0 3 5 6 8 8 b 20 3 0 7 8 7 10 c 20 5 7 0 9 5 7 d 20 6 8 9 0 4 5 e 20 8 7 5 4 0 6 f 20 8 10 7 5 6 0

d U r a b c d e f r 0 20 20 20 20 20 20 a 20 0 3 7 8 8 8 b 20 3 0 7 8 8 8 c 20 7 7 0 8 8 8 d 20 8 8 8 0 4 6 e 20 8 8 8 4 0 6 f 20 8 8 8 6 6 0

d r a b c d e f r 0 9 9 8 10 8 7 a 9 0 1 4 7 5 4 b 9 1 0 4 7 5 4 c 8 4 4 0 6 4 3 d 10 7 7 6 0 2 3 e 8 5 5 4 2 0 1 f 7 4 4 3 3 1 0

Figure 5: This figure depicts how the output d T looks like. This tree structure in fact can easily be obtained by utilizing the structure of dendrogram we computed.

6.8 Rooted tree structure

We discuss how the tree structure of d T , the output of HCCROOTEDTREEFIT, is related to that of d U. As fitting d U is an agglomerative clustering procedure, one may consider its linkage matrix Z with specified format in scipy library (See https://docs.scipy.org/doc/scipy/reference/generated/scipy.cluster.hierarchy.linkage.html). At each step, we utilize the data to construct desired rooted tree by adding Steiner node. Detailed algorithm is as follows (For the simplicity, we will describe the procedure when X = [n] = {0, 1, , n 1}).

6.9 Experiment details

6.9.1 TREEREP

We adapted code from TREEREP, which is available at https://github.com/rsonthal/Tree Rep. We removed its dependency upon Py Torch and used the numpy library for two reasons. First, we have found that using the Py Torch library makes the implementation unnecessarily slower (especially for small input). Second, to achieve a fair comparison, we fixed the randomized seed using numpy. All edges with negative weight have been set to 0 after we have outputted the tree.

6.9.2 GROMOV

We used the duality of Gromov s algorithm and SLHC (observed by [13]) and simply used reduction method (from ultrametric fitting to rooted tree fitting). scipy library was used to run SLHC. This procedure only takes O(n2) time to compute.

Algorithm 9 Construct rooted tree from linkage matrix Z

1: procedure ULTRALINKAGEFIT 2: Input: distance function d 3: Output: linkage matrix Z which depicts ultrametric fit d U 4: procedure CONSTRUCTROOTEDTREE 5: Input: distance function d on [n] 2 and base point 0 r n 1 6: Output: Rooted tree (T, d T ) which fits d and d(r, x) = d T (r, x) for all 0 x n 1 7: Define m = max0 x n 1 d(r, x), cr(x, y) = 2m d(r, x) d(r, y), and βx = 2(m d(r, x))(0 x n 1) 8: Define dr(x) d(r, x) for 0 w n 1 9: Z = ULTRALINKAGEFIT(d + cr) 10: For t {0, 1, , n 2}: 11: x, y, d Z[t, 0], Z[t, 1], Z[t, 2] 12: Add node n + t in T with dr(n + t) = m d/2 13: Add edge (x, n + t) in T with weight dr(x) dr(n + t) 14: Add edge (y, n + t) in T with weight dr(y) dr(n + t) 15: (Collapse node 2n 2 and the root node r) 16: return (T, d T )

6.9.3 NEIGHBORJOIN

For NEIGHBORJOIN, we implemented simple code. Note that it does not contain any heuristics on faster implementation. All edges with negative weight have been set to 0 after we have outputted the tree.

6.9.4 Hardware setup

All experiments have used shared-use computing resource. There are 4 CPU cores each with 5Gi B memory. We executed all the code using a Jupyter notebook interface running Python 3.11.0 with numpy 1.24.2, scipy 1.10.0, and networkx 3.0. The operating system we used is Red Hat Enterprise Linux Server 7.9.

6.9.5 Common data set

For C-ELEGAN and CS PHD, we used pre-computed distance matrix from https://github.com/rsonthal/Tree Rep. For CORA, AIRPORT, and DISEASE, we used https://github.com/Hazy Research/hgcn and computed the shortest-path distance matrix of its largest connected component. The supplementary material includes the data input that we utilized.

6.9.6 Synthetic data set

To produce random synthetic tree-like hyperbolic metric from a given tree, we do the following.

1. Pick two random vertices v, w with d T (v, w) > 2 (If not, then run 1 again.)

2. Add edge (v, w) with weight d T (v, w) 2δ (for δ = 0.1).

3. We repeat until new ne = 500 edges have been added.

4. We compute shortest-path metric of the outputted sparse graph.

We excluded pair with d T (v, w) 2 because: if it is 1, then the procedure simply shrinks the edge. if it is 2, then one can consider the procedure as just adding an Steiner node (of v, w and their common neighbor), which does not pose hyperbolicity.

6.9.7 Comparison

For the comparison, we have fixed the randomized seed using numpy library (from 0 to nseed - 1). For common data set experiments, we run nseed = 100 times. For synthetic data set experiments, we run nseed = 50 times.

Error Average ℓ1 error Max ℓ error

HCC 0.260 0.058 1.478 0.213 Gromov 0.255 0.041 1.071 0.092 TR 0.282 0.067 1.648 0.316 NJ 0.224 1.430 QT 1.123 1.943 Table 6: Experiments on unit cube in R2. The data set does not have some hyperbolicity feature so that the result kind may be different from the main experiments.

6.10 Detailed Comparisons

In comparison with [1] As [1] presented an O(1) approximation that minimizes the ℓ1 error, it can be deduced that the total error of its output is also bounded by O(Avg Hyp(d)n3), while the leading coefficient is not known. It would be interesting to analyze the performance of LP based algorithms including [1] if we have some tree-like assumptions on input, such as Avg Hyp(d) or the δ-hyperbolicity.

In comparison with [10] We bounded the total error of the tree metric in terms of the average hyperbolicity Avg Hyp(d) and the size of our input space X. As the growth function we found is O(n3), it can be seen that the average error bound would be O(Avg Hyp(d)|X|), which is asymptotically tight (In other words, the dependency on |X| is necessary).

While the setup [10] used is quite different, the result can be interpreted as follows: they bounded the expectation of the distortion in terms of the average hyperbolicity Avg Hyp and the maximum bound on the Gromov product b (or the diameter D). The specific growth function in terms of Avg Hyp and b is not known, or is hard to track. By slightly tweaking our asymptotic tightness example, it can also be seen that the dependency on b should be necessary. It would also be interesting to find how tight the dependency in terms of b is.

In comparison with QUADTREE There are a number of tree fitting algorithms in many applications, with various kinds of inputs and outputs. One notable method is QUADTREE (referred in, for example, [18]) which outputs a tree structure where each node reperesents a rectangular area of the domain. There are two major differences between QUADTREE and ours: first, QUADTREE needs to input data points (Euclidean), while ours only requires a distance matrix. Also, the main output of QUADTREE is an utilized tree data structure, while ours (and other comparisons) focus on fitting the metric.

We conducted a simple experiment on comparing QUADTREE and others including ours. First, we uniformly sampled 500 points in [0, 1]2 R2 and ran QUADTREE algorithm as a baseline algorithm. While defining edge weights on the output of QUADTREE is not clear, we used 2 d for depth d edges. Note that the input we sampled is nowhere hyperbolic so that it may not enjoy the advantages of fitting algorithms which use such geometric assumptions.

There is a simple reason why QUADTREE behaves worse when it comes to the metric fitting problem. QUADTREE may distinguish two very close points across borders. Then its fitting distance between such points is very large; it in fact can be the diameter of the output tree, which is nearly 2 in this experiment. That kind of pairs pose a huge error.