# a_characterization_of_linkagebased_hierarchical_clustering__efe19a5a.pdf Journal of Machine Learning Research 17 (2016) 1-17 Submitted 7/11; Revised 8/15; Published 12/16 A Characterization of Linkage-Based Hierarchical Clustering Margareta Ackerman margareta.ackerman@sjsu.edu Department of Computer Science San Jose State University San Jose, CA Shai Ben-David shai@cs.uwaterloo.ca D.R.C. School of Computer Science University of Waterloo Waterloo, ON Editor: Marina Meila The class of linkage-based algorithms is perhaps the most popular class of hierarchical algorithms. We identify two properties of hierarchical algorithms, and prove that linkagebased algorithms are the only ones that satisfy both of these properties. Our characterization clearly delineates the difference between linkage-based algorithms and other hierarchical methods. We formulate an intuitive notion of locality of a hierarchical algorithm that distinguishes between linkage-based and global hierarchical algorithms like bisecting k-means, and prove that popular divisive hierarchical algorithms produce clusterings that cannot be produced by any linkage-based algorithm. 1. Introduction Clustering is a fundamental and immensely useful task, with many important applications. There are many clustering algorithms, and these algorithms often produce different results on the same data. Faced with a concrete clustering task, a user needs to choose an appropriate algorithm. Currently, such decisions are often made in a very ad hoc, if not completely random, manner. Users are aware of the costs involved in employing different clustering algorithms, such as running times, memory requirements, and software purchasing costs. However, there is very little understanding of the differences in the outcomes that these algorithms may produce. It has been proposed to address this challenge by identifying significant properties that distinguish between different clustering paradigms (see, for example, Ackerman et al. (2010b) and Fisher and Van Ness (1971)). By focusing on the input-output behaviour of algorithms, these properties shed light on essential differences between them (Ackerman et al. (2010b, 2012)). Users could then choose desirable properties based on domain expertise, and select an algorithm that satisfies these properties. In this paper, we focus hierarchical algorithms, a prominent class of clustering algorithms. These algorithms output dendrograms, which the user can then traverse to obtain the desired clustering. Dendrograms provide a convenient method for exploring multiple c 2016 Margareta Ackerman and Shai Ben-David. Ackerman and Ben-David clusterings of the data. Notably, for some applications the dendrogram itself, not any clustering found in it, is the desired final outcome. One such application is found in the field of phylogeny, which aims to reconstruct the tree of life. One popular class of hierarchical algorithms is linkage-based algorithms. These algorithms start with singleton clusters, and repeatedly merge pairs of clusters until a dendrogram is formed. This class includes commonly-used algorithms such as single-linkage, average-linkage, complete-linkage, and Ward s method. In this paper, we provide a property-based characterization of hierarchical linkage-based algorithms. We identify two properties of hierarchical algorithms that are satisfied by all linkage-based algorithms, and prove that at the same time no algorithm that is not linkagebased can satisfy both of these properties. The popularity of linkage-based algorithms leads to a common misconception that linkage-based algorithms are synonymous with hierarchical algorithms. We show that even when the internal workings of algorithms are ignored, and the focus is placed solely on their input-output behaviour, there are natural hierarchical algorithms that are not linkage-based. We define a large class of divisive algorithms that includes the popular bisecting k-means algorithm, and show that no linkage-based algorithm can simulate the input-output behaviour of any algorithm in this class. 2. Previous Work Our work falls within the larger framework of studying properties of clustering algorithms. Several authors study such properties from an axiomatic perspective. For instance, Wright (1973) proposes axioms of clustering functions in a weighted setting, where every domain element is assigned a positive real weight, and its weight may be distributed among multiple clusters. A recent, and influential, paper in this line of work is Kleinberg s impossibility result (Kleinberg (2003)), where he proposes three axioms of partitional clustering functions and proves that no clustering function can simultaneously satisfy these properties. Properties have been used study different aspects of clustering. Ackerman and Ben David (2008) consider properties satisfied by clustering quality measures, showing that properties analogous to Kleinberg s axioms are consistent in this setting. Meila (2005) studies properties of criteria for comparing clusterings, functions that map pairs of clusterings to real numbers, and identifies properties that are sufficient to uniquely identify several such criteria. Puzicha et al. (2000) explore properties of clustering objective functions. They propose a few natural properties of clustering objective functions, and then focus on objective functions that arise by requiring functions to decompose into additive form. Most relevant to our work are previous results distinguishing linkage-based algorithms based on their properties. Most of these results are concerned with the single-linkage algorithm. In the hierarchial clustering setting, Jardine and Sibson (1971) and Carlsson and M emoli (2010) formulate a collection of properties that define single linkage. Zadeh and Ben-David (2009) characterize single linkage in the partitional setting where instead of constructing a dendrogram, clusters are merged until a given number of clusters remain. Finally, Ackerman et al. (2010a) characterize linkage-based algorithms in the same partitional setting in terms of a few natural properties. These results enable a comparison Linkage-Based Hierarchical Clustering Figure 1: A dendrogram of domain set {x1, . . . , x8}. The horizontal lines represent levels and every leaf is associated with an element of the domain. of the input-output behaviour of (a partitional variant of) linkage-based algorithms with other partitional algorithms. In this paper, we characterize hierarchical linkage-based algorithms, which map data sets to dendrograms. Our characterization is independent of any stopping criterion. It enables the comparison of linkage-based algorithms to other hierarchical algorithms, and clearly delineates the differences between the input/output behaviour of linkage-based algorithms and other hierarchical methods. 3. Definitions A distance function is a symmetric function d : X X R+, such that d(x, x) = 0 for all x X. The data sets that we consider are pairs (X, d), where X is some finite domain set and d is a distance function over X. We say that a distance function d over X extends distance function d over X X, denoted d d, if d (x, y) = d(x, y) for all x, y X . Two distance function d over X and d over X agree on a data set Y if Y X, Y X , and d(x, y) = d (x, y) for all x, y Y . A k-clustering C = {C1, C2, . . . , Ck} of a data set X is a partition of X into k non-empty disjoint subsets of X (so, i Ci = X). A clustering of X is a k-clustering of X for some 1 k |X|. For a clustering C, let |C| denote the number of clusters in C. For x, y X and clustering C of X, we write x C y if x and y belong to the same cluster in C and x C y, otherwise. Given a rooted tree T where the edges are oriented away from the root, let V (T) denote the set of vertices in T, and E(T) denote the set of edges in T. We use the standard interpretation of the terms leaf, descendent, parent, and child. A dendrogram over a data set X is a binary rooted tree where the leaves correspond to elements of X. In addition, every node is assigned a level, using a level function (η); leaves are placed at level 0, parents have higher levels than their children, and no level is empty. See Figure 1 for an illustration. Formally, Ackerman and Ben-David Definition 1 (dendrogram) A dendrogram over (X, d) is a triple (T, M, η) where T is a binary rooted tree, M : leaves(T) X is a bijection, and η : V (T) {0, . . . , h} is onto (for some h Z+ {0}) such that 1. For every leaf node x V (T), η(x) = 0. 2. If (x, y) E(T), then η(x) > η(y). Given a dendrogram D = (T, M, η) of X, we define a mapping from nodes to clusters C : V (T) 2X by C(x) = {M(y) | y is a leaf and a descendent of x}. If C(x) = A, then we write v(A) = x. We think of v(A) as the vertex (or node) in the tree that represents cluster A. We say that A X is a cluster in D if there exists a node x V (T) so that C(x) = A. We say that a clustering C = {C1, . . . , Ck} of X X is in D if Ci is in D for all 1 i k. Note that a dendrogram may contain clusterings that do not partition the entire domain, and i = j, v(Ci) is not a descendent of v(Cj), since Ci Cj = . Definition 2 (sub-dendrogram) A sub-dendrogram of (T, M, η) rooted at x V (T) is a dendrogram (T , M , η ) where 1. T is the subtree of T rooted at x, 2. For every y leaves(T ), M (y) = M(y), and 3. For all y, z V (T ), η (y) < η (z) if and only if η(y) < η(z). Definition 3 (Isomorphisms) A few notions of isomorphisms of structures are relevant to our discussion. 1. We say that (X, d) and (X , d ) are isomorphic domains, denoted (X, d) =X (X , d ), if there exists a bijection φ : X X so that d(x, y) = d (φ(x), φ(y)) for all x, y X. 2. We say that two clusterings (or partitions) C of some domain (X, d) and C of some domain (X , d ) are isomorphic clusterings, denoted (C, d) =C (C , d ), if there exists a domain isomorphism φ : X X so that x C y if and only if φ(x) C φ(y). 3. We say that (T1, η1) and (T2, η2) are isomorphic trees, denoted (T1, η1) =T (T1, η1), if there exists a bijection H : V (T1) V (T2) so that (a) for all x, y V (T1), (x, y) E(T1) if and only if (H(x), H(y)) E(T2), and (b) for all x V (T1), η1(x) = η2(H(x)). 4. We say that D1 = (T1, M1, η1) of (X, d) and D2 = (T2, M2, η2) of (X , d ) are isomorphic dendrograms, denoted D1 =D D2, if there exists a domain isomorphism φ : X X and a tree isomorphism H : (T1, η1) (T2, η2) so that for all x leaves(T1), φ(M1(x)) = M2(H(x)). Linkage-Based Hierarchical Clustering 4. Hierarchical and Linkage-Based Algorithms In the hierarchical clustering setting, linkage-based algorithms are hierarchical algorithms that can be simulated by repeatedly merging close clusters. In this section, we formally define hierarchical algorithms and linkage-based hierarchical algorithms. 4.1 Hierarchical Algorithms In addition to outputting a dendrogram, we require that hierarchical clustering functions satisfy a few natural properties. Definition 4 (Hierarchical clustering function) A hierarchical clustering function F is a function that takes as input a pair (X, d) and outputs a dendrogram (T, M, η). We require such a function, F, to satisfy the following: 1. Representation Independence: Whenever (X, d) =X (X , d ), then F(X, d) =D F(X , d ). 2. Scale Invariance: For any domain set X and any pair of distance functions d, d over X, if there exists c R+ such that d(a, b) = c d (a, b) for all a, b X, then F(X, d) = F(X, d ). 3. Richness: For all data sets {(X1, d1), . . . , (Xk, dk)} where Xi Xj = for all i = j, there exists a distance function ˆd over Sk i=1 Xi that extends each of the di s (for i k), so that the clustering {X1, . . . , Xk} is in F(Sk i=1 Xi, ˆd). The last condition, richness, requires that by manipulating between-cluster distances every clustering can be produced by the algorithm. Intuitively, if we place the clusters sufficiently far apart, then the resulting clustering should be in the dendrogram. In this work, we focus on distinguishing linkage-based algorithms from other hierarchical algorithms. 4.2 Linkage-Based Algorithms The class of linkage-base algorithms includes some of the most popular hierarchical algorithms, such as single-linkage, average-linkage, complete-linkage, and Ward s method. Every linkage-based algorithm has a linkage function that can be used to determine which clusters to merge at every step of the algorithm. Definition 5 (Linkage Function) A linkage function is a function ℓ: {(X1, X2, d) | d over X1 X2} R+ 1. ℓis representation independent: For all (X1, X2) and (X 1, X 2), if ({X1, X2}, d) =C ({X 1, X 2}, d ) then ℓ(X1, X2, d) = ℓ(X 1, X 2, d ). 2. ℓis monotonic: For all (X1, X2, d) if d is a distance function over X1 X2 such that for all x {X1,X2} y, d(x, y) = d (x, y) and for all x {X1,X2} y, d(x, y) d (x, y) then ℓ(X1, X2, d ) ℓ(X1, X2, d). Ackerman and Ben-David As in our characterization of partitional linkage-based algorithms, we assume that a linkage function has a countable range. Say, the set of non-negative algebraic real numbers. The following are the linkage-functions of some of the most popular linkage-based algorithms, Single-linkage: ℓ(A, B, d) = mina A,b B d(a, b) Average-linkage: ℓ(A, B, d) = P a A,b B d(a, b)/(|A| |B|) Complete-linkage: ℓ(A, B, d) = maxa A,b B d(a, b) For a dendrogram D and clusters A and B in D, if there exists x so that parent(v(A)) = parent(v(B)) = x, then let parent(A, B) = x, otherwise parent(A, B) = . We now define hierarchical linkage-based functions. Definition 6 (Linkage-Based Function) A hierarchical clustering function F is linkagebased if there exists a linkage function ℓso that for all (X, d), F(X, d) = (T, M, η) where η(parent(A, B)) = m if and only if ℓ(A, B) is minimal in {ℓ(S, T) : S T = , η(S) < m, η(T) < m, η(parent(S)) m, η(parent(T)) m}. Note that the above definition implies that there exists a linkage function that can be used to simulate the output of F. We start by assigning every element of the domain to a leaf node. We then use the linkage function to identify the closest pair of nodes (with respect to the clusters that they represent), and repeatedly merge the closest pairs of nodes that do yet have parents, until only one such node remains. 4.3 Locality We introduce a new property of hierarchical algorithms. Locality states that if we select a clustering from a dendrogram (a union of disjoint clusters that appear in the dendrogram), and run the hierarchical algorithm on the data underlying this clustering, we obtain a result that is consistent with the original dendrogram. Definition 7 (Locality) A hierarchical function F is local if for all X, d, and X X, whenever clustering C = {C1, C2, . . . , Ck} of X is in F(X, d) = (T, M, η), then for all 1 i k 1. Cluster Ci is in F(X , d|X ) = (T , M , η ), and the sub-dendrogram of F(X, d) rooted at v(Ci) is also a sub-dendrogram of F(X , d|X ) rooted at v(Ci). 2. For all x, y X , η (x) < η (y) if and only if η(x) < η(y). Locality is often a desirable property. Consider for example the field of phylogenetics, which aims to reconstruct the tree of life. If an algorithm clusters phylogenetic data correctly, then if we cluster any subset of the data, we should get results that are consistent with the original dendrogram. Linkage-Based Hierarchical Clustering Figure 2: An example of an A-cut. 4.4 Outer Consistency Clustering aims to group similar elements and separate dissimilar ones. These two requirements are often contradictory and algorithms vary in how they resolve this contradiction. Kleinberg (2003) proposed a formalization of these requirements in his consistency axiom for partitional clustering algorithms. Consistency requires that if within-cluster distances are decreased, and between-cluster distances are increased, then the output of a clustering function does not change. Since then it was found that while many natural clustering functions fail consistency, most satisfy a relaxation, which requires that the output of an algorithm is not changed by increasing between-cluster distances (Ackerman et al. (2010b)). Given successfully clustered data, if points that are already assigned to different clusters are drawn even further apart, then it is natural to expect that, when clustering the resulting new data set, such points will not share the same cluster. Here we propose a variation of this requirement for the hierarchical clustering setting. Given a dendrogram produced by a hierarchical algorithm, we select a clustering C from a dendrogram and pull apart the clusters in C (thus making the clustering C more pronounced). If we then run the algorithm on the resulting data, we can expect that the clustering C will occur in the new dendrogram. Outer consistency is a relaxation of the above property, making this requirement only on a subset of clusterings. For a cluster A in a dendrogram D, the A-cut of D is a clustering in D represented by nodes on the same level as v(A) or directly below v(A). For convenience, if node u is the root of the dendrogram, then assume its parent has infinite level, η(parent(u)) = . Formally, Definition 8 (A-cut) Given a cluster A in a dendrogram D = (T, M, η), the A-cut of D is cut A(D) = {C(u) | u V (T), η(parent(u)) > η(v(A)) and η(u) η(v(A)).}. Note that for any cluster A in D of (X, d), the A-cut is a clustering of X, and A is one of the clusters in that clustering. For example, consider the diagram in Figure 2. Let A = {x3, x4}. The horizontal line on level 4 of the dendrogram represents the intuitive notion of a cut. To obtain the corresponding clustering, we select all clusters represented by nodes on the line, and for Ackerman and Ben-David the remaining clusters, we choose clusters represented by nodes that lay directly below the horizontal cut. In this example, clusters {x3, x4} and {x5, x6, x7, x8} are represented by nodes directly on the line, and {x1, x2} is a cluster represented by a node directly below the marked horizontal line. Recall that a distance function d over X is (C, d)-outer-consistent if d (x, y) = d(x, y) whenever x C y, and d (x, y) d(x, y) whenever x C y. Definition 9 (Outer-Consistency) A hierarchical function F is outer consistent if for all (X, d) and any cluster A in F(X, d), if d is (cut A(F(X, d)), d)-outer-consistent then cut A(F(X, d)) = cut A(F(X, d )). 5. Main Result The following is our characterization of linkage-based hierarchical algorithms. Theorem 10 A hierarchical function F is linkage-based if and only if F is outer consistent and local. We prove the result in the following subsections (one for each direction of the iff). In the last part of this section, we demonstrate the necessity of both properties. 5.1 All Local, Outer-Consistent Hierarchical Functions are Linkage-Based Lemma 11 If a hierarchical function F is outer-consistent and local, then F is linkagebased. We show that there exists a linkage function ℓso that when ℓis used in Definition 6 then for all (X, d) the output is F(X, d). Due to the representation independence of F, one can assume w.l.o.g., that the domain sets over which F is defined are (finite) subsets of the set of natural numbers, N. Definition 12 (The (pseudo-) partial ordering