# interactive_bayesian_hierarchical_clustering__f03f7b48.pdf

Interactive Bayesian Hierarchical Clustering

Sharad Vikram SVIKRAM@CS.UCSD.EDU Sanjoy Dasgupta DASGUPTA@CS.UCSD.EDU Computer Science and Engineering, UCSD, 9500 Gilman Drive, La Jolla, CA 92093

Clustering is a powerful tool in data analysis, but it is often difﬁcult to ﬁnd a grouping that aligns with a user s needs. To address this, several methods incorporate constraints obtained from users into clustering algorithms, but unfortunately do not apply to hierarchical clustering. We design an interactive Bayesian algorithm that incorporates user interaction into hierarchical clustering while still utilizing the geometry of the data by sampling a constrained posterior distribution over hierarchies. We also suggest several ways to intelligently query a user. The algorithm, along with the querying schemes, shows promising results on real data.

1. Introduction

Clustering is a basic tool of exploratory data analysis. There are a variety of efﬁcient algorithms including kmeans, EM for Gaussian mixtures, and hierarchical agglomerative schemes that are widely used for discovering natural groups in data. Unfortunately, they don t always ﬁnd a grouping that suits the user s needs.

This is inevitable. In any moderately complex data set, there are many different plausible grouping criteria. Should a collection of rocks be grouped according to value, or shininess, or geological properties? Should animal pictures be grouped according to the Linnaean taxonomy, or cuteness? Different users have different priorities, and an unsupervised algorithm has no way to magically guess these.

As a result, a rich body of work on constrained clustering has emerged. In this setting, a user supplies guidance, typically in the form of must-link or cannot-link constraints, pairs of points that must be placed together or apart. Introduced by Wagstaff & Cardie (2000), these constraints have since been incorporated into many differ-

Proceedings of the 33 rd International Conference on Machine Learning, New York, NY, USA, 2016. JMLR: W&CP volume 48. Copyright 2016 by the author(s).

ent ﬂat clustering procedures (Wagstaff et al., 2001; Bansal et al., 2004; Basu et al., 2004; Kulis et al., 2005; Biswas & Jacobs, 2014).

In this paper, we introduce constraints to hierarchical clustering, the recursive partitioning of a data set into successively smaller clusters to form a tree. A hierarchy has several advantages over a ﬂat clustering. First, there is no need to specify the number of clusters in advance. Second, the tree captures cluster structure at multiple levels of granularity, simultaneously. As such, trees are particularly wellsuited for exploratory data analysis and the discovery of natural groups.

There are several well-established methods for hierarchical clustering, the most prominent among which are the bottom-up agglomerative methods such as average linkage (see, for instance, Chapter 14 of Hastie et al. (2009)). But they suffer from the same problem of under-speciﬁcation that is the scourge of unsupervised learning in general. And, despite the rich literature on incorporating additional guidance into ﬂat clustering, there has been relatively little work on the hierarchical case.

What form might the user s guidance take? The usual mustlink and cannot-link constraints make little sense when data has hierarchical structure. Among living creatures, for instance, should elephant and tiger be linked? At some level, yes, but at a ﬁner level, no. A more straightforward assertion is that elephant and tiger should be linked in a cluster that does not include snake. We can write this as a triplet ({elephant, tiger}, snake). We could also assert ({tiger, leopard}, elephant). Formally, ({a, b}, c) stipulates that the hierarchy contains a subtree (that is, a cluster) containing a and b but not c.

A wealth of research addresses learning taxonomies from triplets alone, mostly in the ﬁeld of phylogenetics: see Felsenstein (2004) for an overview, and Aho et al. (1981) for a central algorithmic result. Let s say there are n data items to be clustered, and that the user seeks a particular hierarchy T on these items. This T embodies at most n 3 triplet constraints, possibly less if it is not binary. It was pointed out in Tamuz et al. (2011) that roughly n log n

Interactive Bayesian Hierarchical Clustering

carefully-chosen triplets are enough to fully specify T if it is balanced. This is also a lower bound: there are nΩ(n) different labeled rooted trees, so each tree requires Ω(n log n) bits, on average, to write down and each triple provides O(1) bits of information, since there are just three possible outcomes for each set of points a, b, c. Although n log n is a big improvement over n3, it is impractical for a user to provide this much guidance when the number of points is large. In such cases, a hierarchical clustering cannot be obtained on the basis of constraints alone; the geometry of the data must play a role. We consider an interactive process during which a user incrementally adds constraints.

Starting with a pool of data X Rd, the machine builds a candidate hierarchy T.

The set of constraints C is initially empty.

The machine presents the user with a small portion of T: speciﬁcally, its restriction to O(1) leaves S X. We denote this T|S. The user either accepts T|S, or provides a triplet constraint ({a, b}, c) that is violated by it. If a triplet is provided, the machine adds it to C and modiﬁes the tree T accordingly.

In realizing this scheme, a suitable clustering algorithm and querying strategy must be designed. Similar issues have been confronted in ﬂat clustering with must-link and cannot-link constraints but the solutions are unsuitable for hierarchies, and thus a fresh treatment is warranted.

THE CLUSTERING ALGORITHM

What is a method of hierarchical clustering that takes into account the geometry of the data points as well as userimposed constraints?

We adopt an interactive Bayesian approach. The learning procedure is uncertain about the intended tree and this uncertainty is captured in the form of a distribution over all possible trees. Initially, this distribution is informed solely by the geometry of the data but once interaction begins, it is also shaped by the growing set of constraints.

The nonparametric Bayes literature contains a variety of different distributional models for hierarchical clustering. We describe a general methodology for extending these to incorporate user-speciﬁed constraints. For concreteness, we focus on the Dirichlet diffusion tree (Neal, 2003), which has enjoyed empirical success. We show that triplet constraints are quite easily accommodated: when using a Metropolis-Hastings sampler, they can efﬁciently be enforced, and the state space remains strongly connected, assuring convergence to the unique stationary distribution.

THE QUERYING STRATEGY

What is a good way to select the subsets S? A simple option is to pick them at random from X. We show that this strategy leads to convergence to the target tree T . Along the way, we deﬁne a suitable distance function for measuring how close T is to T .

We might hope, however, that a more careful choice of S would lead to faster convergence, in much the same way that intelligent querying is often superior to random querying in active learning. In order to do this, we show how the Bayesian framework allows us to quantify which portions of the tree are the most uncertain, and thereby to pick S that focuses on these regions.

Querying based on uncertainty sounds promising, but is dangerous because it is heavily inﬂuenced by the choice of prior, which is ultimately quite arbitrary. Indeed, if only such queries were used, the interactive learning process could easily converge to the wrong tree. We show how to avoid this situation by interleaving the two types of queries.

Finally, we present a series of experiments that illustrate how a little interaction leads to signiﬁcantly better hierarchical clusterings.

1.1. Other related work

A related problem that has been studied in more detail (Zoller & Buhmann, 2000; Eriksson et al., 2011; Krishnamurthy et al., 2012) is that of building a hierarchical clustering where the only information available is pairwise similarities between points, but these are initially hidden and must be individually queried.

In another variant of interactive ﬂat clustering (Balcan & Blum, 2008; Awasthi & Zadeh, 2010; Awasthi et al., 2014), the user is allowed to specify that individual clusters be merged or split. A succession of such operations can always lead to a target clustering, and a question of interest is how quickly this convergence can be achieved.

Finally, it is worth mentioning the use of triplet constraints in learning other structures, such as Euclidean embeddings (Borg & Groenen, 2005).

2. Bayesian hierarchical clustering

The most basic form of hierarchical clustering is a rooted binary tree with the data points at its leaves. This is sometimes called a cladogram. Very often, however, the tree is adorned with additional information, for instance:

1. An ordering of the internal nodes, where the root is assigned the lowest number and each node has a higher number than its parent.

Interactive Bayesian Hierarchical Clustering

This ordering uniquely speciﬁes the induced kclustering (for any k): just remove the k 1 lowestnumbered nodes and take the clusters to be the leafsets of the k resulting subtrees.

2. Lengths on the edges.

Intuitively, these lengths correspond to the amount of change (for instance, time elapsed) along the corresponding edges. They induce a tree metric on the nodes, and often, the leaves are required to be at the same distance from the root.

3. Parameters at internal nodes.

These parameters are sometimes from the same space as the data, representing intermediate values on the way from the root to the leaves.

Many generative processes for trees end up producing these more sophisticated structures, with the understanding that undesired additional information can simply be discarded at the very end. We now review some well-known distributions over trees and over hierarchical clusterings.

Let s start with cladograms on n leaves. The simplest distribution over these is the uniform. Another well-studied option is the Yule model, which can be described using either a top-down or bottom-up generative process. The top-down view corresponds to a continuous-time pure birth process: start with one lineage; each lineage persists for a random exponential(1) amount of time and then splits into two lineages; this goes on until there are n lineages. The bottom-up view is a coalescing process: start with n points; pick a random pair of them to merge; then repeat. Aldous (1995) has deﬁned a one-parameter family of distributions over cladograms, called the beta-splitting model, that includes the uniform and the Yule model as special cases. It is a top-down generative process in which, roughly, each split is made by sampling from a Beta distribution to decide how many points go on each side. To move to arbitrary (not necessarily binary) splits, a suitable generalization is the Gibbs fragmentation tree (Mc Cullagh et al., 2008).

In this paper, we will work with joint distributions over both tree structure and data. These are typically inspired by, or based directly upon, the simpler tree-only distributions described above. Our primary focus is the Dirichlet diffusion tree (Neal, 2003), which is speciﬁed by a birth process that we will shortly describe. However, our methodology applies quite generally. Other notable Bayesian approaches to hierarchical clustering include: Williams (2000), in which each node of the tree is annotated with a vector that is sampled from a Gaussian centered at its parent s vector; Heller & Ghahramani (2005), that deﬁnes a distribution over ﬂat clusterings and then speciﬁes an agglomerative scheme for ﬁnding a good partition with respect to this

distribution; Adams et al. (2008), in which data points are allowed to reside at internal nodes of the tree; Teh et al. (2008); Boyles & Welling (2012), in which the distribution over trees is speciﬁed by a bottom-up coalescing process; and Knowles & Ghahramani (2015), which generalizes the Dirichlet diffusion trees to allow non-binary splits.

2.1. The Dirichlet diffusion tree

The Dirichlet diffusion tree (DDT) is a generative model for d-dimensional vectors x1, x2, . . . , x N. Data are generated sequentially via a continuous-time process, lasting from time t = 0 to t = 1, when they reach their ﬁnal value.

The ﬁrst point, x1, is generated via a Brownian motion, beginning at the origin, i.e. X1(t + dt) = X1(t) + N(0, σ2Iddt) where X1(t) represents the value of x1 at time t. The next point, x2, follows the path created by x1 until it eventually diverges at a random time, according to a speciﬁed acquisition function a(t). When x2 diverges, it creates an internal node in the tree structure which contains both the time and value of x2 when it diverged. After divergence, it continues until t = 1 with an independent Brownian motion. In general, the i-th point follows the path created by the previous i 1 points. When it reaches a node, it will ﬁrst sample one of two branches to enter, then either 1) diverge on the branch, whereupon a divergence time is sampled according to the acquisition function a(t), or 2) recursively continue to the next node. Each of these choices has a probability associated with it, according to various properties of the tree structure and choice of acquisition function (details can be found in Neal (2003)). Eventually, all points will diverge and continue independently, creating an internal node storing the time and intermediate value for each point at divergence. The DDT thus deﬁnes a binary tree over the data (see Figure 1 for an example). Furthermore, given a DDT with N points, it is possible to sample the possible divergence locations of a (N + 1)-th point, using the generative process.

Sampling the posterior DDT given data can be done with the Metropolis-Hastings (MH) algorithm, an MCMC method. The MH algorithm obtains samples from target distribution p(x) indirectly by instead sampling from a conditional proposal distribution q(x|x ), creating a Markov chain whose stationary distribution is p(x), assuming q satisﬁes some conditions. Our choice of proposal distribution modiﬁes the DDT s tree structure via a subtree-prune and regraft (SPR) move, which has the added beneﬁt of extending to other distributions over hierarchies.

An SPR move consists of ﬁrst a prune then a regraft. Suppose T is a binary tree with n leaves. Let s be a non-root node in T selected uniformly at random and S be its corresponding subtree. To prune S from T, we remove s s parent p from T, and replace p with s s sibling.

Interactive Bayesian Hierarchical Clustering

Figure 1. An example DDT with 1-dimensional data. Blue lines represent paths taken by each data point, and black dots represent nodes of the tree. The rightmost dots are leaves and the others are nodes created when points diverged. When drawing the hierarchy, typically the top stem is omitted.

Regrafting selects a branch at random and attaches S to it as follows. Let (u, v) be the chosen branch (u is the parent of v). S is attached to the branch by creating a node p with children s and v and parent u (see Figure 2 for an example).

The MH proposal distribution for the DDT is an augmented SPR move, where the time and intermediate value at each node are sampled in addition to tree structure. The exchangeability of the DDT enables efﬁcient sampling of regraft branches by simulating the generation process for a new point and returning the branch and time where it diverges. The intermediate values for the entire tree are sampled via an interleaved Gibbs sampling move, as all conditional distributions are Gaussian.

3. Adding interaction

Impressive as the Dirichlet diffusion tree is, there is no reason to suppose that it will magically ﬁnd a tree that suits the user s needs. But a little interaction can be helpful in improving the outcome.

Let T denote the target hierarchical clustering. It is not necessarily the case that the user would be able to write this down explicitly, but this is the tree that captures the distinctions he/she is able to make, or wants to make. Figure 3 (left) shows an example, for a small data set of 5 points. In this case, the user does not wish to distinguish between points 1, 2, 3, but does wish to place them in a cluster that excludes point 4.

We could posit our goal as exactly recovering T . But in many cases, it is good enough to ﬁnd a tree that captures all the distinctions within T but also possibly has some extraneous distinctions, as in the right-hand side of Figure 3.

Formally, given data set X, we say S X is a cluster of tree T if there is some node of T whose descendant leaves are exactly S. We say T is a reﬁnement of T if they have the same set of leaves, and moreover every cluster of T

is also a cluster of T. This, then, is our goal: to ﬁnd a reﬁnement of the target clustering T .

3.1. Triplets

The user provides feedback in the form of triplets. The constraint ({a, b}, c) means that the tree should have a cluster containing a and b but not c. Put differently, the lowest common ancestor of a, b should be a strict descendant of the lowest common ancestor of a, b, c.

Let (T) denote the set of all proper triplet constraints embodied in tree T. If T has n nodes, then | (T)| n 3 . For non-binary trees, it will be smaller than this number. Figure 3 (left), for instance, has no triplet involving 1, 2, 3.

Reﬁnement can be characterized in terms of triplets.

Lemma 3.1. Tree T is a reﬁnement of tree T if and only if (T ) (T).

Proof. See supplement.

In particular, any triplet-querying scheme that converges to the full set of triplets of the target tree T is also guaranteed to produce trees that converge to a reﬁnement of T .

With this lemma in mind, it is natural to measure how close a tree T is to the target T with the following (asymmetric) distance function, which we call triplet distance (TD):

TD(T , T) =

c (T ) I(c / (T))

| (T )| (1)

where I is the indicator function. This distance is zero exactly when T is a reﬁnement of T , in which case we have reached our goal.

A simple strategy for obtaining triplets would be to present the user with three randomly chosen data points and have the user pick the odd one out. This strategy has several drawbacks. First, some sets of three points have no triplet constraint (for instance, points 1, 2, 3 in Figure 3). Second, the chosen set of points might correspond to a triplet that has already been speciﬁed, or is implied by speciﬁed triplets. For example, knowledge of ({a, b}, c) and ({b, c}, d) implies ({a, c}, d). Enumerating the set of implied triplets is non-trivial for n > 3 triplets (Bryant & Steel, 1995), making it difﬁcult to avoid these implied triplets in the ﬁrst place.

We thus consider another strategy rather than the user arranging three data points into a triplet, the user observes

Interactive Bayesian Hierarchical Clustering

Figure 2. In the subtree-prune and regraft (SPR) move, a subtree S is selected uniformly at random and is then pruned from the tree. Next, a regraft location is selected from the valid regraft locations, and S is re-attached at that location.

Figure 3. Target tree T (left) and a reﬁnement of it.

the hierarchy induced over some O(1)-sized subset S of the data and corrects an error in the tree by supplying a triplet. We call this is a subtree query. Finally, we note that in this work we only consider the realizable case where the triplets obtained from a user do not contain contradictory information and that there is a tree that satisﬁes all of them.

3.2. Finding a tree consistent with constraints

We start with a randomly initialized hierarchy T over our data and show an induced subtree T|S to the user, obtaining the ﬁrst triplet. The next step is constructing a new tree that satisﬁes the triplet. This begins the feedback cycle; a user provides a triplet given a subtree and the triplet is incorporated into a clustering algorithm, producing a new candidate tree. A starting point is an algorithm that returns a tree consistent with a set of triplets.

The simplest algorithm to solve this problem is the BUILD algorithm, introduced in Aho et al. (1981). Given a set of triplets C, BUILD will either return a tree that satisﬁes C, or error if no such tree exists. In BUILD, we ﬁrst construct the Aho graph GC, which has a vertex for each data point and an undirected edge {a, b} for each triplet constraint ({a, b}, c). If GC is connected, there is no tree that satisﬁes all triplets. Otherwise, the top split of the tree is a partition of the connected components of GC: any split is

ﬁne as long as points in the same component stay together. Satisﬁed triplets are discarded, and BUILD then continues recursively on the left and right subtrees.

BUILD satisﬁes triplet constraints but ignores the geometry of the data, whereas we wish to take both into account. By incorporating triplets into the posterior DDT sampler, we obtain high likelihood trees that still satisfy C.

3.3. Incorporating triplets into the sampler

In this section, we present an algorithm to sample candidate trees from the posterior DDT, constrained by a triplet set C. It is based on the subtree prune and regraft move.

The SPR move is of particular interest because we can efﬁciently enforce triplets to form a constrained-SPR move, resulting in a sampler that only produces trees that satisfy a set of triplets. A constrained-SPR move is deﬁned as an SPR move that assigns zero probability to any resulting trees that would violate a set of triplets. Restricting the neighborhood of an SPR move runs the risk of partitioning the state space, losing the convergence guarantees of the Metropolis-Hastings algorithm. Fortunately, a constrained SPR move does not compromise strong connectivity. For any realizable triplet set C, we prove the constrained-SPR move Markov chain s aperiodicity and irreducibility.

Consider the Markov chain on the state space of rooted binary trees that is induced by the constrained sampler.

Lemma 3.2. The constrained-SPR Markov chain is aperiodic.

Proof. A sufﬁcient condition for aperiodicity is the existent of a self-loop in the transition matrix: a non-zero probability of a state transitioning to itself. Supposed we have pruned a subtree already. When regrafting, the ordinary SPR move has a non-zero probability of choosing any branch, and a constrained-SPR move cannot regraft to branches that would violate triplets. Since the current tree

Interactive Bayesian Hierarchical Clustering

Figure 4. Visualized is a constrained-SPR move. Pruning is identical but a regraft location is selected from the valid regraft locations limited by triplets. In this image, we are constrained by the sole triplet ({2, 3}, 4).

in the Markov chain satisﬁes triplet set C, there is a nonzero probability of regrafting to the same location. We thus have an aperiodic Markov chain.

Lemma 3.3. A constrained-SPR Markov chain is irreducible.

Proof. (sketch) To show irreducibility, we show that a tree T has an non-zero probability of reaching an arbitrary tree T via constrained-SPR moves where both T and T satisfy a set of triplets C. Our proof strategy is to construct a canonical tree TC, and show that there exists a non-zero probability path from T to TC, and therefore from T to TC. We then show that for a given constrained-SPR move, the reverse move has a non-zero probability. Thus, there exists a path from T to TC to T , satisfying irreducibility.

Recall that the split at a node in a binary tree that satisﬁes triplet set C corresponds to a binary partition of the Aho graph at the node (see Section 3.2). TC is a tree such that every node in TC is in canonical form. A node is in canonical form if it is a leaf node, or, the partition of the Aho graph at that node can be written as (l, r). l is the single connected component containing the point with the minimum data index, and r is the rest of the components.

To convert a particular node s into canonical form, we ﬁrst perform grouping , which puts l into a single descendant of s via constrained-SPR moves. We then make two constrained-SPR moves to convert the partition at s into the form (l, r) (see Figure 5). We convert all nodes into canonical form recursively, turning an arbitrary tree T into TC.

Finally, the reverse constrained-SPR move has a non-zero probability. Suppose we perform a constrained-SPR move on tree T1, converting it into T2 by detaching subtree s and attaching it to branch (u, v). A constrained-SPR move on T2 can select s for pruning and can regraft it to form T1 with a non-zero probability since T1 satisﬁes the same constraint set as C. For a full proof, please refer to the supplement.

The simplest possible scheme for a constrained-SPR move would be rejection sampling. The Metropolis-Hastings algorithm for the DDT would be the same as in the unconstrained case, but any trees violating C would have accept probability 0. Although this procedure is correct, it is impractical. As the number of triplets grows larger, more trees will be rejected and the sampler will slow down over time.

To efﬁciently sample a tree that satisﬁes a set of triplets C, we modify the regraft in the ordinary SPR move. The constrained-SPR move must assign zero probability to any regraft branches that would result in a tree that violates C. This is accomplished by generating the path from the root in the same manner as sampling a branch, but avoiding paths that would resulted in violated triplets.

DESCRIPTION OF CONSTRAINED-SPR SAMPLER

Recall that in the DDT s sampling procedure for regraft branches, a particle at a node picks a branch, and either diverges from that branch or recursively samples the node s child. Let s be the root of the subtree we are currently grafting back onto tree T, let C be the triplet set, and let leaves(u) denote the descendant-leaves of node u. Suppose we are currently at node u, deciding whether to diverge at the branch (u, v) or to recursively sample v. Consider any triplet ({a, b}, c) C. If all or none of a, b, c are in leaves(s), then the triplet is unaffected by the graft, and can be ignored. Otherwise, checks are needed:

1. c leaves(s)

Then we know a, b leaves(s). If a and b are split across v s children, we are banned from sampling v.

2. a leaves(s) but b, c leaves(s)

If both b and c are in leaves(v), we are required to sample v. If just c is in leaves(v), we are banned from both diverging at (u, v) and sampling v. Otherwise we can either diverge at (u, v) or sample v.

Interactive Bayesian Hierarchical Clustering

Figure 5. The process of converting s into canonical form. We ﬁrst group nodes from l into their own isolated subtree, then perform two constrained-SPR moves to put s into canonical form.

(The case where b leaves(s) is symmetric to case 2.) If we choose to sample v, we remove constraints from our current set C that are now satisﬁed, and continue recursively. This deﬁnes a procedure by which we can sample a divergence branch that does not violate constraints.

While the constrained-SPR sampler can produce a set of trees given a set of static constraints, the BUILD algorithm is useful in adding new triplets into the sampler. Suppose we have been sampling trees with constrained-SPR moves satisfying triplet set C and we obtain a new triplet u = ({a, b}, c) from a user query. We take the current tree T and ﬁnd the least common ancestor (call it z) of a and b. We then call BUILD(C + {u}) on just the nodes in leaves(z), and we substitute the resulting subtree at position z in tree T.

3.4. Intelligent subset queries

We now have a method to sample a constrained distribution over candidate trees. Given a particular candidate tree T, our ﬁrst strategy for subtree querying is to pick a random subset S of the leaves of constant size, and show the user the induced subtree over the subset, T|S. We call this random subtree querying. But can we use a set of trees produced by the sampler to make better subtree queries? If tree structure is ambiguous in a particular region of data, i.e. there are several hierarchies that could explain a particular conﬁguration of data, the MH algorithm will sample over these different conﬁgurations. A query over points in these ambiguous regions may help our algorithm converge to a better tree faster. By looking for these regions in our samples, we can choose query subsets S for which the tree structure is highly variable, and hopefully the resulting triplet from the user will reduce the ambiguity.

More precisely, we desire a notion of tree variance. Given a set of trees T , what is the variance over a given subset of the data S? We propose using the notion of tree distance

as a starting point. For a given tree T, the tree distance between two nodes a and b, denoted treedist T (a, b), is the number of edges of T needed to get from a to b. Consider two leaves u and v. If the tree structure around them is static, we expect the tree distance between u and v to change very little, as the surrounding tree will not change. However, if there is ambiguity in the surrounding structure, the tree distance will be more variable. Given a subset of data S and a set of trees T , the tree distance variance (TDV) of the trees over the subset is deﬁned as:

TDV(T , S) = max u,v S Var T T [treedist T |S(u, v)] (2)

This measure of variance is the max of the variance of tree distance between any two points in the subset. Computing this requires O(|T ||S|2 log |S|) time, and since since |S| is constant, it is not prohibitively expensive.

Given a set of trees from the sampler T , we now select a high-variance subtree by instantiating L random subsets of constant size, S1, . . . , SL and picking argmaxl TDV(T , Sl). We call this active subtree querying. Although using tree variance will help reconcile ambiguity in the tree structure, if a set of samples from a tree all violate the same triplet, it is unlikely that active querying will recover that triplet. Thus, interleaving random querying and active querying will hopefully help the algorithm converge quickly, while avoiding local optima.

4. Experiments

We evaluated the convergence properties of ﬁve different querying schemes. In a simple query , a user is presented with three random data and picks an odd one out. In a smart query , a user is unrealistically shown the entire candidate tree and reports a violated triplet. In a random query , the user is shown the induced candidate tree over a random subset of the data. In an active query , the user

Interactive Bayesian Hierarchical Clustering

(a) Fisher Iris

Figure 6. The average of four runs of constrained-SPR samplers for the Fisher Iris dataset and the MNIST dataset, using 5 different querying schemes. A query was made every 100 iterations.

is shown a high variance subtree using tree-distance variance. Finally, in an interleaved query , the user is alternatively shown a random subtree and a high variance subtree. In each experiment, T was known, so user queries were simulated by picking a triplet violated by the root split of the queried tree, and if no such triplet existed, recursing on a child. Each scheme was evaluated on four different datasets. The ﬁrst dataset, MNIST (Lecun et al., 1998), is an 10-way image classiﬁcation dataset where the data are 28 x 28 images of digits. The target tree T is simply the K-way classiﬁcation tree over the data. The second dataset is Fisher Iris, a 3-way ﬂower classiﬁcation problem, where each of 150 ﬂowers has ﬁve features. The third dataset, Zoo (Lichman, 2013), is a set of 93 animals and 15 binary morphological features for each of animals, the target tree being the induced binary tree from the Open Tree of Life (Hinchliff et al., 2015). The fourth dataset is 20 Newsgroups (Joachims, 1997), a corpus of text articles on 20 different subjects. We use the ﬁrst 10 principal components as features in this classiﬁcation problem. All datasets were modeled with DDT s with acquisition function a(t) = 1/(1 t) and Brownian motion parameter σ2

estimated from data. To better visualize the different convergence rates of the querying schemes, MNIST and 20 Newsgroups were subsampled to 150 random points.

For each dataset and querying scheme, we instantiated a SPR sampler with no constraints. Every one hundred iterations of the sampler, we performed a query. In subtree queries, we used subsets of size |S| = 10 and in ac-

tive querying, the highest-variance subset was chosen from L = 20 different random subsets. As baselines, we measured the triplet distance of the vanilla DDT and the average linkage tree. Finally, results were averaged over four runs of each sampler. The triplet distances for Fisher Iris and MNIST can be seen in Figure 6. Results for the other datasets can be found in the supplement. Although unrealistic due to the size of the tree shown to the user, the smart query performed the best, achieving minimum error with the least amount of queries. Interleaved followed next, followed by active, random, and simple. In general, the vanilla DDT performed the worst, and the average linkage score varied on each dataset, but in all cases, the subtree querying schemes performed better than both the vanilla DDT and average linkage.

In three datasets (MNIST, Fisher Iris and Zoo), interactive methods achieve higher data likelihood than the vanilla DDT. Initially, the sampler is often restructuring the tree with new triplets and data likelihood is unlikely to rise. However, over time as less triplets are reported, the data likelihood increases rapidly. We thus conjecture that triplet constraints may help the MH algorithm ﬁnd better optima.

5. Future Work

We are interested in studying the non-realizable case, i.e. when there does not exist a tree that satisﬁes triplet set C. We would also like to better understand the effect of constraints on searching for optima using MCMC methods.

Interactive Bayesian Hierarchical Clustering

Acknowledgements

Sharad Vikram and Sanjoy Dasgupta are supported by the National Science Foundation under grant CNS-1446912. The authors are grateful for feedback from the anonymous reviewers and for help and advice from Suqi Liu, Stefanos Poulis, and Christopher Tosh.

Adams, R.P., Ghahramani, Z., and Jordan, M.I. Treestructured stick breaking for hierarchical data. In Advances in Neural Information Processing Systems, 2008.

Aho, A.V., Sagiv, Y., Szymanski, T.G., and Ullman, J.D. Inferring a tree from lowest common ancestors with an application to the optimization of relational expressions. SIAM Journal on Computing, 10:405 421, 1981.

Aldous, D. Probability distributions on cladograms. In Aldous, D. and Pemantle, R. (eds.), Random Discrete Structures (IMA Volumes in Mathematics and its Applications 76), pp. 1 18, 1995.

Awasthi, P. and Zadeh, R.B. Supervised clustering. In Advances in Neural Information Processing Systems, 2010.

Awasthi, P., Balcan, M.-F., and Voevodski, K. Local algorithms for interactive clustering. In Proceedings of the 31st International Conference on Machine Learning, 2014.

Balcan, M.-F. and Blum, A. Clustering with interactive feedback. In Algorithmic Learning Theory (volume 5254 of the series Lecture Notes in Computer Science), pp. 316 328, 2008.

Bansal, N., Blum, A., and Chawla, S. Correlation clustering. Machine Learning, 56(1 3):89 113, 2004.

Basu, S., Banerjee, A., and Mooney, R. Active semisupervision for pairwise constrained clustering. In SIAM International Conference on Data Mining, 2004.

Biswas, A. and Jacobs, D. Active image clustering with pairwise constraints from humans. International Journal of Computer Vision, 108(1):133 147, 2014.

Borg, I. and Groenen, P.J.F. Modern Multidimensional Scaling: Theory and Applications. Springer Verlag, 2005.

Boyles, L. and Welling, M. The time-marginalized coalescent prior for hierarchical clustering. In Advances in Neural Information Processing Systems, 2012.

Bryant, D and Steel, M. Extension operations on sets of leaf-labeled trees. Advances in Applied Mathematics, 16 (4):425 453, 1995.

Eriksson, B., Dasarathy, G., Singh, A., and Nowak, R. Active clustering: robust and efﬁcient hierarchical clustering using adaptively selected similarities. In Proceedings of the 14th International Conference on Artiﬁcial Intelligence and Statistics, 2011.

Felsenstein, J. Inferring Phylogenies. Sinauer, 2004.

Hastie, T., Tibshirani, R., and Friedman, J. The Elements of Statistical Learning. Springer, 2nd edition, 2009.

Heller, K. and Ghahramani, Z. Bayesian hierarchical clustering. In Proceedings of the 22nd International Conference on Machine Learning, 2005.

Hinchliff, Cody E., Smith, Stephen A., Allman, James F., Burleigh, J. Gordon, Chaudhary, Ruchi, Coghill, Lyndon M., Crandall, Keith A., Deng, Jiabin, Drew, Bryan T., Gazis, Romina, Gude, Karl, Hibbett, David S., Katz, Laura A., Laughinghouse, H. Dail, Mc Tavish, Emily Jane, Midford, Peter E., Owen, Christopher L., Ree, Richard H., Rees, Jonathan A., Soltis, Douglas E., Williams, Tiffani, and Cranston, Karen A. Synthesis of phylogeny and taxonomy into a comprehensive tree of life. In Proceedings of the National Academy of Sciences, 2015.

Joachims, Thorsten. A probabilistic analysis of the rocchio algorithm with tﬁdf for text categorization. In Proceedings of the Fourteenth International Conference on Machine Learning, ICML 97, pp. 143 151, 1997. ISBN 1-55860-486-3.

Knowles, D.A. and Ghahramani, Z. Pitman-Yor diffusion trees for Bayesian hierarchical clustering. IEEE Transactions on Pattern Analysis and Machine Intelligence, 37(2):271 289, 2015.

Krishnamurthy, A., Balakrishnan, S., Xu, M., and Singh, A. Efﬁcient active algorithms for hierarchical clustering. In Proceedings of the 29th International Conference on Machine Learning, 2012.

Kulis, B., Basu, S., Dhillon, I., and Mooney, R. Semisupervised graph clustering: a kernel approach. In Proceedings of the 22nd International Conference on Machine Learning, 2005.

Lecun, Yann, Bottou, Lon, Bengio, Yoshua, and Haffner, Patrick. Gradient-based learning applied to document recognition. In Proceedings of the IEEE, pp. 2278 2324, 1998.

Lichman, M. UCI machine learning repository, 2013. URL

http://archive.ics.uci.edu/ml.

Mc Cullagh, P., Pitman, J., and Winkel, M. Gibbs fragmentation trees. Bernoulli, 14(4):988 1002, 2008.

Interactive Bayesian Hierarchical Clustering

Neal, R.M. Density modeling and clustering using Dirichlet diffusion trees. In Bernardo, J.M. et al. (eds.), Bayesian Statistics 7, pp. 619 629. Oxford University Press, 2003.

Tamuz, O., Liu, C., Belongie, S., Shamir, O., and Kalai, A.T. Adaptively learning the crowd kernel. In Proceedings of the 28th International Conference on Machine Learning, 2011.

Teh, Y.W., III, H. Daume, and Roy, D.M. Bayesian agglomerative clustering with coalescents. In Advances in Neural Information Processing Systems, 2008.

Wagstaff, K. and Cardie, C. Clustering with instance-level constraints. In Proceedings of the 17th International Conference on Machine Learning, 2000.

Wagstaff, K., Cardie, C., Rogers, S., and Schroedl, S. Constrained k-means clustering with background knowledge. In Proceedings of the 18th International Conference on Machine Learning, 2001.

Williams, C.K.I. A MCMC approach to hierarchical mixture modeling. In Advances in Neural Information Processing Systems, 2000.

Zoller, T. and Buhmann, J.M. Active learning for hierarchical pairwise data clustering. In Proceedings of the 15th International Conference on Pattern Recognition, 2000.