# a_characterization_of_linkagebased_hierarchical_clustering__efe19a5a.pdf

Journal of Machine Learning Research 17 (2016) 1-17 Submitted 7/11; Revised 8/15; Published 12/16

A Characterization of Linkage-Based Hierarchical Clustering

Margareta Ackerman margareta.ackerman@sjsu.edu Department of Computer Science San Jose State University San Jose, CA

Shai Ben-David shai@cs.uwaterloo.ca D.R.C. School of Computer Science University of Waterloo Waterloo, ON

Editor: Marina Meila

The class of linkage-based algorithms is perhaps the most popular class of hierarchical algorithms. We identify two properties of hierarchical algorithms, and prove that linkagebased algorithms are the only ones that satisfy both of these properties. Our characterization clearly delineates the diﬀerence between linkage-based algorithms and other hierarchical methods. We formulate an intuitive notion of locality of a hierarchical algorithm that distinguishes between linkage-based and global hierarchical algorithms like bisecting k-means, and prove that popular divisive hierarchical algorithms produce clusterings that cannot be produced by any linkage-based algorithm.

1. Introduction

Clustering is a fundamental and immensely useful task, with many important applications. There are many clustering algorithms, and these algorithms often produce diﬀerent results on the same data. Faced with a concrete clustering task, a user needs to choose an appropriate algorithm. Currently, such decisions are often made in a very ad hoc, if not completely random, manner. Users are aware of the costs involved in employing diﬀerent clustering algorithms, such as running times, memory requirements, and software purchasing costs. However, there is very little understanding of the diﬀerences in the outcomes that these algorithms may produce. It has been proposed to address this challenge by identifying signiﬁcant properties that distinguish between diﬀerent clustering paradigms (see, for example, Ackerman et al. (2010b) and Fisher and Van Ness (1971)). By focusing on the input-output behaviour of algorithms, these properties shed light on essential diﬀerences between them (Ackerman et al. (2010b, 2012)). Users could then choose desirable properties based on domain expertise, and select an algorithm that satisﬁes these properties. In this paper, we focus hierarchical algorithms, a prominent class of clustering algorithms. These algorithms output dendrograms, which the user can then traverse to obtain the desired clustering. Dendrograms provide a convenient method for exploring multiple

c 2016 Margareta Ackerman and Shai Ben-David.

Ackerman and Ben-David

clusterings of the data. Notably, for some applications the dendrogram itself, not any clustering found in it, is the desired ﬁnal outcome. One such application is found in the ﬁeld of phylogeny, which aims to reconstruct the tree of life. One popular class of hierarchical algorithms is linkage-based algorithms. These algorithms start with singleton clusters, and repeatedly merge pairs of clusters until a dendrogram is formed. This class includes commonly-used algorithms such as single-linkage, average-linkage, complete-linkage, and Ward s method. In this paper, we provide a property-based characterization of hierarchical linkage-based algorithms. We identify two properties of hierarchical algorithms that are satisﬁed by all linkage-based algorithms, and prove that at the same time no algorithm that is not linkagebased can satisfy both of these properties. The popularity of linkage-based algorithms leads to a common misconception that linkage-based algorithms are synonymous with hierarchical algorithms. We show that even when the internal workings of algorithms are ignored, and the focus is placed solely on their input-output behaviour, there are natural hierarchical algorithms that are not linkage-based. We deﬁne a large class of divisive algorithms that includes the popular bisecting k-means algorithm, and show that no linkage-based algorithm can simulate the input-output behaviour of any algorithm in this class.

2. Previous Work

Our work falls within the larger framework of studying properties of clustering algorithms. Several authors study such properties from an axiomatic perspective. For instance, Wright (1973) proposes axioms of clustering functions in a weighted setting, where every domain element is assigned a positive real weight, and its weight may be distributed among multiple clusters. A recent, and inﬂuential, paper in this line of work is Kleinberg s impossibility result (Kleinberg (2003)), where he proposes three axioms of partitional clustering functions and proves that no clustering function can simultaneously satisfy these properties. Properties have been used study diﬀerent aspects of clustering. Ackerman and Ben David (2008) consider properties satisﬁed by clustering quality measures, showing that properties analogous to Kleinberg s axioms are consistent in this setting. Meila (2005) studies properties of criteria for comparing clusterings, functions that map pairs of clusterings to real numbers, and identiﬁes properties that are suﬃcient to uniquely identify several such criteria. Puzicha et al. (2000) explore properties of clustering objective functions. They propose a few natural properties of clustering objective functions, and then focus on objective functions that arise by requiring functions to decompose into additive form. Most relevant to our work are previous results distinguishing linkage-based algorithms based on their properties. Most of these results are concerned with the single-linkage algorithm. In the hierarchial clustering setting, Jardine and Sibson (1971) and Carlsson and M emoli (2010) formulate a collection of properties that deﬁne single linkage. Zadeh and Ben-David (2009) characterize single linkage in the partitional setting where instead of constructing a dendrogram, clusters are merged until a given number of clusters remain. Finally, Ackerman et al. (2010a) characterize linkage-based algorithms in the same partitional setting in terms of a few natural properties. These results enable a comparison

Linkage-Based Hierarchical Clustering

Figure 1: A dendrogram of domain set {x1, . . . , x8}. The horizontal lines represent levels and every leaf is associated with an element of the domain.

of the input-output behaviour of (a partitional variant of) linkage-based algorithms with other partitional algorithms. In this paper, we characterize hierarchical linkage-based algorithms, which map data sets to dendrograms. Our characterization is independent of any stopping criterion. It enables the comparison of linkage-based algorithms to other hierarchical algorithms, and clearly delineates the diﬀerences between the input/output behaviour of linkage-based algorithms and other hierarchical methods.

3. Deﬁnitions

A distance function is a symmetric function d : X X R+, such that d(x, x) = 0 for all x X. The data sets that we consider are pairs (X, d), where X is some ﬁnite domain set and d is a distance function over X. We say that a distance function d over X extends distance function d over X X, denoted d d, if d (x, y) = d(x, y) for all x, y X . Two distance function d over X and d over X agree on a data set Y if Y X, Y X , and d(x, y) = d (x, y) for all x, y Y . A k-clustering C = {C1, C2, . . . , Ck} of a data set X is a partition of X into k non-empty disjoint subsets of X (so, i Ci = X). A clustering of X is a k-clustering of X for some 1 k |X|. For a clustering C, let |C| denote the number of clusters in C. For x, y X and clustering C of X, we write x C y if x and y belong to the same cluster in C and x C y, otherwise. Given a rooted tree T where the edges are oriented away from the root, let V (T) denote the set of vertices in T, and E(T) denote the set of edges in T. We use the standard interpretation of the terms leaf, descendent, parent, and child. A dendrogram over a data set X is a binary rooted tree where the leaves correspond to elements of X. In addition, every node is assigned a level, using a level function (η); leaves are placed at level 0, parents have higher levels than their children, and no level is empty. See Figure 1 for an illustration. Formally,

Ackerman and Ben-David

Deﬁnition 1 (dendrogram) A dendrogram over (X, d) is a triple (T, M, η) where T is a binary rooted tree, M : leaves(T) X is a bijection, and η : V (T) {0, . . . , h} is onto (for some h Z+ {0}) such that

1. For every leaf node x V (T), η(x) = 0.

2. If (x, y) E(T), then η(x) > η(y).

Given a dendrogram D = (T, M, η) of X, we deﬁne a mapping from nodes to clusters C : V (T) 2X by C(x) = {M(y) | y is a leaf and a descendent of x}. If C(x) = A, then we write v(A) = x. We think of v(A) as the vertex (or node) in the tree that represents cluster A. We say that A X is a cluster in D if there exists a node x V (T) so that C(x) = A. We say that a clustering C = {C1, . . . , Ck} of X X is in D if Ci is in D for all 1 i k. Note that a dendrogram may contain clusterings that do not partition the entire domain, and i = j, v(Ci) is not a descendent of v(Cj), since Ci Cj = .

Deﬁnition 2 (sub-dendrogram) A sub-dendrogram of (T, M, η) rooted at x V (T) is a dendrogram (T , M , η ) where

1. T is the subtree of T rooted at x,

2. For every y leaves(T ), M (y) = M(y), and

3. For all y, z V (T ), η (y) < η (z) if and only if η(y) < η(z).

Deﬁnition 3 (Isomorphisms) A few notions of isomorphisms of structures are relevant to our discussion.

1. We say that (X, d) and (X , d ) are isomorphic domains, denoted (X, d) =X (X , d ), if there exists a bijection φ : X X so that d(x, y) = d (φ(x), φ(y)) for all x, y X.

2. We say that two clusterings (or partitions) C of some domain (X, d) and C of some domain (X , d ) are isomorphic clusterings, denoted (C, d) =C (C , d ), if there exists a domain isomorphism φ : X X so that x C y if and only if φ(x) C φ(y).

3. We say that (T1, η1) and (T2, η2) are isomorphic trees, denoted (T1, η1) =T (T1, η1), if there exists a bijection H : V (T1) V (T2) so that

(a) for all x, y V (T1), (x, y) E(T1) if and only if (H(x), H(y)) E(T2), and

(b) for all x V (T1), η1(x) = η2(H(x)).

4. We say that D1 = (T1, M1, η1) of (X, d) and D2 = (T2, M2, η2) of (X , d ) are isomorphic dendrograms, denoted D1 =D D2, if there exists a domain isomorphism φ : X X and a tree isomorphism H : (T1, η1) (T2, η2) so that for all x leaves(T1), φ(M1(x)) = M2(H(x)).

Linkage-Based Hierarchical Clustering

4. Hierarchical and Linkage-Based Algorithms

In the hierarchical clustering setting, linkage-based algorithms are hierarchical algorithms that can be simulated by repeatedly merging close clusters. In this section, we formally deﬁne hierarchical algorithms and linkage-based hierarchical algorithms.

4.1 Hierarchical Algorithms

In addition to outputting a dendrogram, we require that hierarchical clustering functions satisfy a few natural properties.

Deﬁnition 4 (Hierarchical clustering function) A hierarchical clustering function F is a function that takes as input a pair (X, d) and outputs a dendrogram (T, M, η). We require such a function, F, to satisfy the following:

1. Representation Independence: Whenever (X, d) =X (X , d ), then F(X, d) =D F(X , d ).

2. Scale Invariance: For any domain set X and any pair of distance functions d, d

over X, if there exists c R+ such that d(a, b) = c d (a, b) for all a, b X, then F(X, d) = F(X, d ).

3. Richness: For all data sets {(X1, d1), . . . , (Xk, dk)} where Xi Xj = for all i = j, there exists a distance function ˆd over Sk i=1 Xi that extends each of the di s (for i k), so that the clustering {X1, . . . , Xk} is in F(Sk i=1 Xi, ˆd).

The last condition, richness, requires that by manipulating between-cluster distances every clustering can be produced by the algorithm. Intuitively, if we place the clusters suﬃciently far apart, then the resulting clustering should be in the dendrogram. In this work, we focus on distinguishing linkage-based algorithms from other hierarchical algorithms.

4.2 Linkage-Based Algorithms

The class of linkage-base algorithms includes some of the most popular hierarchical algorithms, such as single-linkage, average-linkage, complete-linkage, and Ward s method. Every linkage-based algorithm has a linkage function that can be used to determine which clusters to merge at every step of the algorithm.

Deﬁnition 5 (Linkage Function) A linkage function is a function

ℓ: {(X1, X2, d) | d over X1 X2} R+

1. ℓis representation independent: For all (X1, X2) and (X 1, X 2), if ({X1, X2}, d) =C ({X 1, X 2}, d ) then ℓ(X1, X2, d) = ℓ(X 1, X 2, d ).

2. ℓis monotonic: For all (X1, X2, d) if d is a distance function over X1 X2 such that for all x {X1,X2} y, d(x, y) = d (x, y) and for all x {X1,X2} y, d(x, y) d (x, y) then ℓ(X1, X2, d ) ℓ(X1, X2, d).

Ackerman and Ben-David

As in our characterization of partitional linkage-based algorithms, we assume that a linkage function has a countable range. Say, the set of non-negative algebraic real numbers. The following are the linkage-functions of some of the most popular linkage-based algorithms,

Single-linkage: ℓ(A, B, d) = mina A,b B d(a, b)

Average-linkage: ℓ(A, B, d) = P a A,b B d(a, b)/(|A| |B|)

Complete-linkage: ℓ(A, B, d) = maxa A,b B d(a, b)

For a dendrogram D and clusters A and B in D, if there exists x so that parent(v(A)) = parent(v(B)) = x, then let parent(A, B) = x, otherwise parent(A, B) = . We now deﬁne hierarchical linkage-based functions.

Deﬁnition 6 (Linkage-Based Function) A hierarchical clustering function F is linkagebased if there exists a linkage function ℓso that for all (X, d), F(X, d) = (T, M, η) where η(parent(A, B)) = m if and only if ℓ(A, B) is minimal in {ℓ(S, T) : S T = , η(S) < m, η(T) < m, η(parent(S)) m, η(parent(T)) m}.

Note that the above deﬁnition implies that there exists a linkage function that can be used to simulate the output of F. We start by assigning every element of the domain to a leaf node. We then use the linkage function to identify the closest pair of nodes (with respect to the clusters that they represent), and repeatedly merge the closest pairs of nodes that do yet have parents, until only one such node remains.

4.3 Locality

We introduce a new property of hierarchical algorithms. Locality states that if we select a clustering from a dendrogram (a union of disjoint clusters that appear in the dendrogram), and run the hierarchical algorithm on the data underlying this clustering, we obtain a result that is consistent with the original dendrogram.

Deﬁnition 7 (Locality) A hierarchical function F is local if for all X, d, and X X, whenever clustering C = {C1, C2, . . . , Ck} of X is in F(X, d) = (T, M, η), then for all 1 i k

1. Cluster Ci is in F(X , d|X ) = (T , M , η ), and the sub-dendrogram of F(X, d) rooted at v(Ci) is also a sub-dendrogram of F(X , d|X ) rooted at v(Ci).

2. For all x, y X , η (x) < η (y) if and only if η(x) < η(y).

Locality is often a desirable property. Consider for example the ﬁeld of phylogenetics, which aims to reconstruct the tree of life. If an algorithm clusters phylogenetic data correctly, then if we cluster any subset of the data, we should get results that are consistent with the original dendrogram.

Linkage-Based Hierarchical Clustering

Figure 2: An example of an A-cut.

4.4 Outer Consistency

Clustering aims to group similar elements and separate dissimilar ones. These two requirements are often contradictory and algorithms vary in how they resolve this contradiction. Kleinberg (2003) proposed a formalization of these requirements in his consistency axiom for partitional clustering algorithms. Consistency requires that if within-cluster distances are decreased, and between-cluster distances are increased, then the output of a clustering function does not change. Since then it was found that while many natural clustering functions fail consistency, most satisfy a relaxation, which requires that the output of an algorithm is not changed by increasing between-cluster distances (Ackerman et al. (2010b)). Given successfully clustered data, if points that are already assigned to diﬀerent clusters are drawn even further apart, then it is natural to expect that, when clustering the resulting new data set, such points will not share the same cluster. Here we propose a variation of this requirement for the hierarchical clustering setting. Given a dendrogram produced by a hierarchical algorithm, we select a clustering C from a dendrogram and pull apart the clusters in C (thus making the clustering C more pronounced). If we then run the algorithm on the resulting data, we can expect that the clustering C will occur in the new dendrogram. Outer consistency is a relaxation of the above property, making this requirement only on a subset of clusterings. For a cluster A in a dendrogram D, the A-cut of D is a clustering in D represented by nodes on the same level as v(A) or directly below v(A). For convenience, if node u is the root of the dendrogram, then assume its parent has inﬁnite level, η(parent(u)) = . Formally,

Deﬁnition 8 (A-cut) Given a cluster A in a dendrogram D = (T, M, η), the A-cut of D is cut A(D) = {C(u) | u V (T), η(parent(u)) > η(v(A)) and η(u) η(v(A)).}.

Note that for any cluster A in D of (X, d), the A-cut is a clustering of X, and A is one of the clusters in that clustering. For example, consider the diagram in Figure 2. Let A = {x3, x4}. The horizontal line on level 4 of the dendrogram represents the intuitive notion of a cut. To obtain the corresponding clustering, we select all clusters represented by nodes on the line, and for

Ackerman and Ben-David

the remaining clusters, we choose clusters represented by nodes that lay directly below the horizontal cut. In this example, clusters {x3, x4} and {x5, x6, x7, x8} are represented by nodes directly on the line, and {x1, x2} is a cluster represented by a node directly below the marked horizontal line. Recall that a distance function d over X is (C, d)-outer-consistent if d (x, y) = d(x, y) whenever x C y, and d (x, y) d(x, y) whenever x C y.

Deﬁnition 9 (Outer-Consistency) A hierarchical function F is outer consistent if for all (X, d) and any cluster A in F(X, d), if d is (cut A(F(X, d)), d)-outer-consistent then cut A(F(X, d)) = cut A(F(X, d )).

5. Main Result

The following is our characterization of linkage-based hierarchical algorithms.

Theorem 10 A hierarchical function F is linkage-based if and only if F is outer consistent and local.

We prove the result in the following subsections (one for each direction of the iﬀ). In the last part of this section, we demonstrate the necessity of both properties.

5.1 All Local, Outer-Consistent Hierarchical Functions are Linkage-Based

Lemma 11 If a hierarchical function F is outer-consistent and local, then F is linkagebased.

We show that there exists a linkage function ℓso that when ℓis used in Deﬁnition 6 then for all (X, d) the output is F(X, d). Due to the representation independence of F, one can assume w.l.o.g., that the domain sets over which F is deﬁned are (ﬁnite) subsets of the set of natural numbers, N.

Deﬁnition 12 (The (pseudo-) partial ordering <F ) We consider triples of the form (A, B, d), where A B = and d is a distance function over A B. Two triples, (A, B, d) and (A , B , d ) are equivalent, denoted (A, B, d) = (A , B , d ) if they are isomorphic as clusterings, namely, if ({A, B}, d) =C ({A , B }, d ). <F is a binary relation over equivalence classes of such triples, indicating that F merges a pair of clusters earlier than another pair of clusters. Formally, denoting =-equivalence classes by square brackets, we deﬁne it by: [(A, B, d)] <F [(A , B , d )] if

1. At most two sets in {A, B, A , B } are equal and no set is a strict subset of another.

2. The distance functions d and d agree on (A B) (A B ).

3. There exists a distance function d over X = A B A B so that F(X, d ) = (T, M, η) such that

(a) d extends both d and d ,

Linkage-Based Hierarchical Clustering

(b) There exist (x, y), (x, z) E(T) such that C(x) = A B, C(y) = A, and C(z) = B

(c) For all D {A , B }, either D A B, or D cut A BF(X, d ).

(d) η(v(A )) < η(v(A B)) and η(v(B )) < η(v(A B)).

Since we deﬁne hierarchical algorithms to be representation independent, we can just discuss triples, instead of their equivalence classes. For the sake of simplifying notation, we will omit the square brackets in the following discussion. In the following lemma we show that if (A, B, d) <F (A , B , d ), then A B cannot have a lower level than A B.

Lemma 13 Given a local and outer-consistent hierarchical function F, whenever (A1, B1, d1) <F (A2, B2, d2), there is no data set (X, d) such that A1, B1, A2, B2 X and η(v(A2 B2)) η(v(A1 B1)), where F(X, d) = (T, M, η).

Proof By way of contradiction, assume that such (X, d) exists. Let X = A1 B1 A2 B2. Since (A1, B1, d1) <F (A2, B2, d2), there exists d that satisﬁes the conditions of Deﬁnition 12. Consider F(X , d|X ). By locality, the sub-dendrogram rooted at v(A1 B1) contains the same nodes in both F(X , d|X ) and F(X, d), and similarly for the sub-dendrogram rooted at v(A2 B2). In addition, the relative level of nodes in these subtrees is the same. Construct a distance function d over X that is both ({A1 B1, A2 B2}, d|X )-outer consistent and ({A1 B2, A2, B2}, d )-outer consistent as follows:

d (x, y) = max(d(x, y), d (x, y)) whenever x A1 B1 and y A2 B2

d (x, y) = d1(x, y) whenever x, y A B

d (x, y) = d2(x, y) whenever x, y A B

Note that {A1 B1, A2 B2} is an (A1 B1)-cut of F(X , d|X ). Therefore, by outerconsistency, cut A1 B1(F(X , d )) = {A2 B2, A1 B1}. Since d satisﬁes the conditions in Deﬁnition 12, cut A1 B1F(X, d ) = {A1 B1, A2, B2}. By outer-consistency we get that cut A1 B1(F(X , d )) = {A2 B2, A1, B1}. Since these sets are all non-empty, this is a contradiction.

We now deﬁne equivalence with respect to <F .

Deﬁnition 14 ( =F ) [(A, B, d)] and [(A , B , d )] are F-equivalent, denoted [(A, B, d)] =F [(A , B , d )], if either they are isomorphic as clusterings, ({A, B}, d) =C ({A , B }, d ) or

1. At most two sets in {A, B, A , B } are equal and no set is a strict subset of another.

2. The distance functions d and d agree on (A B) (A B ).

3. There exists a distance function d over X = A B A B so that F(A B A B , d ) = (T, η) where

(a) d extends both d and d ,

Ackerman and Ben-David

(b) There exist (x, y), (x, z) E(T) such that C(x) = A B, and C(y) = A, and C(z) = B,

(c) There exist (x , y ), (x , z ) E(T) such that C(x ) = A B , and C(y ) = A , and C(z ) = B , and

(d) η(x) = η(x )

(A, B, d) is comparable with (C, D, d ) if they are <F comparable or (A, B, d) =F (C, D, d ). Whenever two triples are F-equivalent, then they have the same <F or =F relationship with all other triples.

Lemma 15 Given a local, outer-consistent hierarchical function F, if (A, B, d1) =F (C, D, d2), then for any (E, F, d3), if (E, F, d3) is comparable with both (A, B, d1) and (C, D, d2) then

if (A, B, d1) =F (E, F, d3) then (C, D, d2) =F (E, F, d3)

if (A, B, d1) <F (E, F, d3) then (C, D, d2) <F (E, F, d3)

Proof Let X = A B C D E F. By richness (condition 3 of Deﬁnition 4), there exists a distance function d that extends di for i {1, 2, 3} so that {A B, C D, E F} is a clustering in F(X, d). Assume that (E, F, d3) is comparable with both (A, B, d1) and (C, D, d2). By way of contradiction, assume that (A, B, d1) =F (E, F, d3) and (C, D, d2) <F (E, F, d3). Then by locality, in F(X, d), η(v(A B)) = η(v(E F)). Observe that by locality, since (C, D, d1) <F (E, F, d3), then η(v(C D)) < η(v(E F)) in F(X, d). Therefore (again by locality) η(v(A B)) = η(v(C D)) in any data set that extends d1 and d2, contradicting that (A, B, d1) =F (C, D, d2).

Note that <F is not transitive. In particular, if (A, B, d1) <F (C, D, d2) and (C, D, d2) <F (E, F, d3), it may be that (A, B, d1) and (E, F, d3) are incomparable. To show that <F can be extended to a partial ordering, we ﬁrst prove the following anti-cycle property.

Lemma 16 Given a hierarchical function F that is local and outer-consistent, there exists no ﬁnite sequence (A1, B1, d1) <F <F (An, Bn, dn) <F (A1, B1, d1).

Proof Without loss of generality, assume that such a sequence exists. By richness, there exists a distance function d that extends each of the di where {A1 B1, A1 B2, . . . , An Bn} is a clustering in F(S i Ai Bi, d) = (T, M, η). Let i0 be so that η(v(Ai0 Bi0) η(v(Aj Bj)) for all j = i0. By the circular structure with respect to <F , there exists j0 so that (Aj0, Bj0, dj0) <F (Ai0, Bi0, di0). This contradicts Lemma 13.

We make use of the following general result.

Lemma 17 For any cycle-free, anti-symmetric relation P( , ) over a ﬁnite or countable domain D there exists an embedding h into R+ so that for all x, y D, if P(x, y) then h(x) < h(y).

Linkage-Based Hierarchical Clustering

Proof First we convert the relation P into a partial order by deﬁning a < b whenever there exists a sequence x1, ...., xk so that P(a, x1), P(x2, x3), ..., P(xk, b). This is a partial ordering because P is antisymmetric and cycle-free. To map the partial order to the positive reals, we ﬁrst enumerate the elements, which can be done because the domain is countable. The ﬁrst element is then mapped to any value, φ(x1). By induction, we assume that the ﬁrst n elements are mapped in an order preserving manner. Let xi1...xik be all the members of {x1, ...xn} that are below xn+1 in the partial order. Let r1 = max{φ(xi1), ....φ(xik}, and similarly let r2 be the minimum among the images of all the members of {x1, . . . , xk} that are above xn+1 in the partial order. Finally, let φ(xn+1) be any real number between r1 and r2. It is easy to see that now φ maps {x1, ...xn, xn+1} in a way that respects the partial order.

Finally, we deﬁne our linkage function by embedding the =F -equivalence classes into the positive real numbers in an order preserving way, as implied by applying Lemma 17 to <F . Namely, ℓF : {[(A, B, d)] : A N, B N, A B = and d is a distance function over A B} R+ so that [(A, B, d)] <F [(A , B , d )] implies ℓF [(A, B, d)] < ℓF [(A, B, d)].

Lemma 18 The function ℓF is a linkage function for any hierarchical function F that satisﬁes locality and outer-consistency.

Proof Since ℓF is deﬁned on =F -equivalence classes, representation independence of hierarchical functions implies that ℓF satisﬁes condition 1 of Deﬁnition 5. The function ℓF satisﬁes condition 2 of Deﬁnition 5 by Lemma 19, whose proof follows.

Lemma 19 Consider d1 over X1 X2 and d2 that is ({X1, X2}, d1)-outer-consistent, then (X1, X2, d2) <F (X1, X2, d1), whenever F is local and outer-consistent.

Proof Assume that there exist such d1 and d2 where (X1, X2, d2) <F (X1, X2, d1). Let d3 over X1 X2 be a distance function such that d3 is ({X1, X2}, d1)-outer-consistent and d2 is ({X1, X2}, d3)-outer-consistent. In particular, d3 can be constructed as follows:

d3(x, y) = d1(x,y)+d2(x,y)

2 whenever x X1 and y X2

d3(x, y) = d1(x, y) whenever x, y X1 or x, y X2

Set (X 1, X 2, d2) =F (X1, X2, d2) and (X 1 , X 2 , d3) =F (X1, X2, d3). Let X = X1 X2 X 1 X 2 X 1 X 2 . By richness, there exists a distance function d that extends di for all 1 i 3 so that {X1 X2, X 1 X 2, X 1 X 2 } is a clustering in F(X, d ). Let F(X, d ) = (T, M, η). Since (X 1, X 2, d2) <F (X1, X2, d1), by locality and outerconsistency, we get that η(v(X 1 X 2)) < η(v(X1 X2)). We consider the level (η value) of v(X 1 X 2 ) with respect to the levels of v(X 1 X 2) and v(X1 X2) in F(X, d ). We now consider a few cases. Case 1: η(v(X 1 X 2 )) η(v(X 1 X 2)). Then there exists an outer-consistent change moving X1 and X2 further away from each other until (X1, X2, d1) = (X 1 , X 2 , d3). Let ˆd be the distance function that extends d1 and d2 which shows that (X 1, X 2, d2) <F (X1, X2, d1).

Ackerman and Ben-David

cut X 1 X 2F(X1 X2 X 1 X 2, ˆd) = {X 1 X 2, X1, X2}. We can apply outer consistency on {X 1 X 2, X1, X2} and move X1 and X2 away from each other until {X1, X2} is isomorphic to {X 1 , X 2 }. By outer consistency, this modiﬁcation should not eﬀect the (X1 X2)-cut. Applying locality, we have two isomorphic data sets that produce diﬀerent dendrograms, one in which the further pair ((X 1, X 2) with distance function d2) is not below the medium pair ((X 1 , X 2 ) with distance function d3), and the other in which the medium pair is above the furthest pair. Case 2: η(v(X 1 X 2 )) η(v(X1 X2)). Since X i is isomorphic to Xi for all i {1, 2}, η(v(Xi)) = η(v(X i )) for all i {1, 2}. This gives us that in this case, cut X1 X2F(X1 X2 X 1 X 2 , d ) = {X1 X2, X 1 , X 2 }. We can therefore apply outer consistency and separate X 1 and X 2 until {X 1 , X 2 } is isomorphic to {X 1 X 2}. So this gives us two isomorphic data sets, one in which the further pair is not below the closest pair, and the other in which the further pair is below the closest pair. Case 3: η(X1 X2) < η(X 1 X 2 ) < η(X 1 X 2). Notice that cut X 1 X 2 F(X1 X2 X 1 X 2 , d ) = {X 1 X 2 , X1, X2}. So outer-consistency applies when we increase the distance between X1 and X2 until {X1, X2} is isomorphic to {X 1 X 2}. This gives us two isomorphic sets, one in which the medium pair is below the further pair, and another in which the medium pair is above the furthest pair.

The following Lemma concludes the proof that every local, outer-consistent hierarchical algorithm is linkage-based.

Lemma 20 Given any hierarchical function F that satisﬁes locality and outer-consistency, let ℓF be the linkage function deﬁned above. Let LℓF denote the linkage-based algorithm that ℓF deﬁnes. Then LℓF agrees with F on every input data set.

Proof Let (X, d) be any data set. We prove that at every level s, the nodes at level s in F(X, d) represent the same clusters as the nodes at level s in LℓF (X, d). In both F(X, d) = (T, M, η) and LℓF (X, d) = (T , M , η ), level 0 consists of |X| nodes each representing a unique elements of X. Assume the result holds below level k. We show that pairs of nodes that do not have parents below level k have minimal ℓF value only if they are merged at level k in F(X, d). Consider F(X, d) at level k. Since the dendrogram has no empty levels, let x V (T) where η(x) = k. Let x1 and x2 be the children of x in F(X, d). Since η(x1), η(x2) < k, these nodes also appear in LℓF (X, d) below level k, and neither node has a parent below level k. If x is the only node in F(X, d) above level k 1, then it must also occur in LℓF (X, d). Otherwise, there exists a node y1 V (T), y1 {x1, x2} so that η(y1) < k and η(parent(y1)) k. Let X = C(x) C(y1). By locality, cut C(x)F(X , d|X ) = {C(x), C(y1)}, y1 is below x, and x1 and x2 are the children of x. Therefore, (C(x1), C(x2), d) <F (C(x1), C(y1), d) and ℓF (C(x1), C(x2), d) < ℓF (C(x1), C(y1), d). Assume that there exists y2 V (T), y2 {x1, x2, y1} so that η(y2) < k and η(parent(y2)) k. If parent(y1) = parent(y2) and η(parent(y1)) = k, then (C(x1), C(x2), d) =F (C(y1), C(y2), d) and so ℓF (C(x1), C(x2), d) = ℓF (C(y1), C(y2), d).

Linkage-Based Hierarchical Clustering

Otherwise, let X = C(x) C(y1) C(y2). By richness, there exists a distance function d that extends d|C(x) and d|(C(y1) C(y1)), so that {C(x), C(y1) C(y2)} is in F(X , d ). Note that by locality, the node v(C(y1) C(y2)) has children v(C(y1)) and v(C(y2)) in F(X , d ). We can separate C(x) from C(y1) C(y2) in both F(X , d ) and F(X , d|X ) until both are equal. Then by outer-consistency, cut C(x)F(X , d|X ) = {C(x), C(y1), C(y2)} and by locality y1 and y2 are below x. Therefore, (C(x1), C(x2), d) <F (C(y1), C(y2), d) and so ℓF (C(x1), C(x2), d) < ℓF (C(y1), C(y2), d).

5.2 All Linkage-Based Functions are Local and Outer-Consistent

Lemma 21 Every linkage-based hierarchical clustering function is local.

Proof Let C = {C1, C2, . . . , Ck} be a clustering in F(X, d) = (T, M, η). Let X = i Ci. For all X1, X2 X , ℓ(X1, X2, d) = ℓ(X1, X2, d|X ). Therefore, for all 1 i k, the sub-dendrogram rooted at v(Ci) in F(X, d) also appears in F(X, d ), with the same relative levels.

Lemma 22 Every linkage-based hierarchical clustering function is outer-consistent.

Proof Let C = {C1, C2, . . . , Ck} be a Ci-cut in F(X, d) for some 1 i k. Let d be (C, d)- outer-consistent. Then for all 1 i k, and all X1, X2 Ci, ℓ(X1, X2, d) = ℓ(X1, X2, d ), while for all X1 Ci, X2 Cj, for any i = j, ℓ(X1, X2, d) ℓ(X1, X2, d ) by monotonicity. Therefore, for all 1 j k, the sub-dendrogram rooted at v(Cj) in F(X, d) also appears in F(X, d ). All nodes added after these sub-dendrograms are at a higher level than the level of v(Ci). And since the Ci-cut is represented by nodes that occur on levels no higher than the level of v(Ci), the Ci-cut in F(X, d ) is the same as the Ci-cut in F(X, d).

5.3 Necessity of Both Properties

We now show that both the locality and outer-consistency properties are necessary for deﬁning linkage-based algorithms. Neither property individually is suﬃcient for deﬁning this family of algorithms. Our results above showing that all linkage-based algorithms are both local and outer-consistent already imply that a clustering function that satisﬁes one, but not both, of these requirements is not linkage-based. It remains to show that neither of these two properties implies the other. We do so by demonstrating the existence of a hierarchical function that satisﬁes locality but not outer-consistency, and one that satisfy outer-consistency but not locality. Consider a hierarchical clustering function F that applies average-linkage on data sets with an even number of elements, and single-linkage on data sets consisting of an odd number of elements. Since both average-linkage and single-linkage are linkage-based algorithms, they are both outer-consistent. It follows that F is outer-consistent. However, this hierarchical clustering function fails locality, as it is easy to construct a data set with an even number of

Ackerman and Ben-David

elements where average-linkage detects an odd-sized cluster, for which single-linkage would produce a diﬀerent dendrogram. Now, consider the following function

ℓ(X1, X2, d) = 1 maxx X1,y X2 d(x, y).

The function ℓis not a linkage-function since it fails the monotonicity condition. The function ℓalso does not conform with the intended meaning of a linkage-function. For instance, ℓ(X1, X2, d) is smaller than ℓ(X 1, X 2, d ) when all the distances between X1 and X2 are (arbitrarily) larger than any distance between X 1 and X 2. If we then consider the hierarchical clustering function F that results by utilizing ℓin a greedy fashion to construct a dendrogram (by repeatedly merging the closest clusters according to ℓ), then the function F is local by the same argument as the proof of Lemma 21. We now demonstrate that F is not outer-consistent. Consider a data set (X, d) such that for some A X, the A-cut of F(X, d) is a clustering with a least 3 clusters where every cluster consists of a least 2 elements. Then if we move two clusters suﬃciently far away from each other and all other data, they will be merged by the algorithm before any of the other clusters are formed, and so the A-cut on the resulting data changes following an outer-consistent change. As such, F is not outer-consistent.

6. Divisive Algorithms

Our formalism provides a precise sense in which linkage-based algorithms make only local considerations, while many divisive algorithms inevitably take more global considerations into account. This fundamental distinction between these paradigms can be used to help select a suitable hierarchical algorithm for speciﬁc applications. This distinction also implies that many divisive algorithms cannot be simulated by any linkage-based algorithm, showing that the class of hierarchical algorithms is strictly richer than the class of linkage-based algorithm (even when focusing only on the input-output behaviour of algorithms). A 2-clustering function F maps a data set (X, d) to a 2-partition of X. An F-Divisive algorithm is a divisive algorithm that uses a 2-clustering function F to decide how to split nodes. Formally,

Deﬁnition 23 (F-Divisive) A hierarchical clustering function is F-Divisive with respect to a 2-clustering function F, if for all (X, d), F(X, d) = (T, M, η) such that for all x V (T)/leaves(T) with children x1 and x2, F(C(x)) = {C(x1), C(x2)}.

Note that Deﬁnition 23 does not place restrictions on the level function. This allows for some ﬂexibility in the levels. Intuitively, it doesn t force an order on splitting nodes. The following property represents clustering functions that utilize contextual information found in the remainder of the data set when partitioning a subset of the domain.

Deﬁnition 24 (Context sensitive) F is context-sensitive if there exist x, y, z, w and distance functions d and d , where d extends d, such that F({x, y, z}, d) = {{x}, {y, z}} and F({x, y, z, w}, d ) = {{x, y}, {z, w}}.

Linkage-Based Hierarchical Clustering

Many 2-clustering functions, including k-means, min-sum, and min-diameter are contextsensitive (see Corollary 29, below). Natural divisive algorithms, such as bisecting k-means (k-means-Divisive), rely on context-sensitive 2-clustering functions. Whenever a 2-clustering algorithm is context-sensitive, then the F-divisive function is not local.

Theorem 25 If F is context-sensitive then the F-divisive function is not local.

Proof Since F is context-sensitive, there exists a distance functions d d so that {x} and {y, z} are the children of the root in F({x, y, z}, d), while in F({x, y, z, w}, d ), {x, y} and {z, w} are the children of the root and z and w are the children of {z, w}. Therefore, {{x, y}, {z}} is clustering in F({x, y, z, w}, d ). But cluster {x, y} is not in F({x, y, z}, d), so the clustering {{x, y}, {z}} is not in F({x, y, z}, d), and so F-divisive is not local.

Applying Theorem 10, we get:

Corollary 26 If F is context-sensitive, then the F-divisive function is not linkage-based.

We say that two hierarchical algorithms disagree if they may output dendrograms with diﬀerent clusterings. Formally,

Deﬁnition 27 Two hierarchical functions F0 and F1 disagree if there exists a data set (X, d) and a clustering C of X so that C is in Fi(X, d) but not in F1 i(X, d), for some i {0, 1}.

Theorem 28 If F is context-sensitive, then the F-divisive function disagrees with every linkage-based function.

Proof Let L be any linkage-based function. Since F is context-sensitive, there exists distance functions d d so that F({x, y, z}, d) = {{x}, {y, z}} and F({x, y, z, w}, d ) = {{x, y}, {z, w}}. Assume that L and F-divisive produce the same output on ({x, y, z, w}, d ). Therefore, since {{x, y}, {z}} is a clustering in F-divisive({x, y, z, w}, d ), it is also a clustering in L({x, y, z, w}, d ). Since L is linkage-based, by Theorem 10, L is local. Therefore, {{x, y}, {z}} is a clustering in L({x, y, z}, d ). But it is not a clustering in F-divisive({x, y, z}, d).

Corollary 29 The divisive algorithms that are based on the following 2-clustering functions disagree with every linkage-based function: k-means, min-sum, min-diameter.

Proof Set x = 1, y = 3, z = 4, and w = 6 to show that these 2-clustering functions are context-sensitive. The result follows by Theorem 28.

Ackerman and Ben-David

7. Conclusions

In this paper, we provide the ﬁrst property-based characterization of hierarchical linkagebased clustering. Our characterization shows the existence of hierarchical methods that cannot be simulated by any linkage-based method, revealing inherent input-output diﬀerences between agglomeration and divisive hierarchical algorithms. This work falls in the larger framework of property-based analysis of clustering algorithms, which aims to provide a better understanding of these techniques as well as aid users in the crucial task of algorithm selection. It is important to note that our characterization is not intended to demonstrate the superiority of linkage-based methods over other hierarchical techniques, but rather to enable users to make informed trade-oﬀs when choosing algorithms. In particular, properties investigated in previous work should also be considered, while future work will continue to investigate important properties with the ultimate goal of providing users with a property-based taxonomy of popular clustering methods that would enable selecting suitable methods for a wide range of applications.

8. Acknowledgements

We would like to thank David Loker for several helpful discussions. We would also like to thank the anonymous referees whose comments and suggestions greatly improved this paper.

M. Ackerman and S. Ben-David. Measures of clustering quality: A working set of axioms for clustering. In Proceedings of Neural Information Processing Systems (NIPS), pages 121 128, 2008.

M. Ackerman, S. Ben-David, and D. Loker. Characterization of linkage-based clustering. In Proceedings of The 23rd Conference on Learning Theory, pages 270 281, 2010a.

M. Ackerman, S. Ben-David, and D. Loker. Towards property-based classiﬁcation of clustering paradigms. Laﬀerty et al., pages 10 18, 2010b.

M. Ackerman, S. Ben-David, S. Branzei, and D. Loker. Weighted clustering. In Association for the Advancement of Artiﬁcial Intelligence (AAAI), pages 858 863, 2012.

G. Carlsson and F. M emoli. Characterization, stability and convergence of hierarchical clustering methods. The Journal of Machine Learning Research, 11:1425 1470, 2010.

L. Fisher and J.W. Van Ness. Admissible clustering procedures. Biometrika, 58(1):91 104, 1971.

N. Jardine and R. Sibson. Mathematical taxonomy. London, 1971.

J. Kleinberg. An impossibility theorem for clustering. Proceedings of International Conferences on Advances in Neural Information Processing Systems, pages 463 470, 2003.

Linkage-Based Hierarchical Clustering

M. Meila. Comparing clusterings: an axiomatic view. In Proceedings of the 22nd international conference on Machine learning, pages 577 584. ACM, 2005.

J. Puzicha, T. Hofmann, and J.M. Buhmann. A theory of proximity based clustering: Structure detection by optimization. Pattern Recognition, 33(4):617 634, 2000.

W.E. Wright. A formalization of cluster analysis. Pattern Recognition, 5(3):273 282, 1973.

R.B. Zadeh and S. Ben-David. A uniqueness theorem for clustering. In Proceedings of the Twenty-Fifth Conference on Uncertainty in Artiﬁcial Intelligence, pages 639 646. AUAI Press, 2009.