# flattening_a_hierarchical_clustering_through_active_learning__388a1180.pdf

Flattening a Hierarchical Clustering through Active Learning

Fabio Vitale Department of Computer Science INRIA Lille, France & Sapienza University of Rome, Italy fabio.vitale@inria.fr

Anand Rajagopalan Google Research NY New York, USA anandbr@google.com

Claudio Gentile Google Research NY New York, USA cgentile@google.com

We investigate active learning by pairwise similarity over the leaves of trees originating from hierarchical clustering procedures. In the realizable setting, we provide a full characterization of the number of queries needed to achieve perfect reconstruction of the tree cut. In the non-realizable setting, we rely on known important-sampling procedures to obtain regret and query complexity bounds. Our algorithms come with theoretical guarantees on the statistical error and, more importantly, lend themselves to linear-time implementations in the relevant parameters of the problem. We discuss such implementations, prove running time guarantees for them, and present preliminary experiments on real-world datasets showing the compelling practical performance of our algorithms as compared to both passive learning and simple active learning baselines.

1 Introduction Active learning is a learning scenario where labeled data are scarse and/or expensive to gather, as they require careful assessment by human labelers. This is often the case in several practical settings where machine learning is routinely deployed, from image annotation to document classiﬁcation, from speech recognition to spam detection, and beyond. In all such cases, an active learning algorithm tries to limit human intervention by seeking as little supervision as possible, still obtaining accurate prediction on unseen samples. This is an attractive learning framework offering substantial practical beneﬁts, but also presenting statistical and algorithmic challenges. A main argument that makes active learning effective is when combined with methods that exploit the cluster structure of data (e.g., [11, 21, 10], and references therein), where a cluster typically encodes some notion of semantic similarity across the involved data points. An ubiquitous solution to clustering is to organize data into a hierarchy, delivering clustering solutions at different levels of resolution. An (agglomerative) Hierarchical Clustering (HC) procedure is an unsupervised learning method parametrized by a similarity function over the items to be clustered and a linkage function that lifts similarity from items to clusters of items. Finding the right level of resolution amounts to turning a given HC into a ﬂat clustering by cutting the resulting tree appropriately. We would like to do so by resorting to human feedback in the form of pairwise similarity queries, that is, yes/no questions of the form are these two products similar to one another ? or are these two news items covering similar events ? . It is well known that such queries are relatively easy to respond to, but are also intrinsically prone to subjectiveness and/or noise. More importantly, the hierarchy at hand need not be aligned with the similarity feedback we actually receive. In this paper, we investigate the problem of cutting a tree originating from a pre-speciﬁed HC procedure through pairwise similarity queries generated by active learning algorithms. Since the tree is typically not consistent with the similarity feedback, that is to say, the feedback is noisy, we are lead to tackle this problem under a variety of assumptions about the nature of this noise (from noiseless to random but persistent to general agnostic). Moreover, because different linkage functions

33rd Conference on Neural Information Processing Systems (Neur IPS 2019), Vancouver, Canada.

applied to the very same set of items may give rise to widely different tree topologies, our study also focuses on characterizing active learning performance as a function of the structure of the tree at hand. Finally, because these hierarchies may in practice be sizeable (in the order of billion nodes), scalability will be a major concern in our investigation. On motivation. It is often the case in big organizations that data processing pipelines are split into services, making Machine Learning solution providers be constrained by the existing hardware/software infrastructure. In the Active Learning applications that motivate this work, the hierarchy over the items to be clustered is provided by a third party, i.e., by an exogenous data processing tool that relies on side information on the items (e.g., word2vec mappings and associated distance functions) which are possibly generated by yet another service, etc. In this modular environment, it is reasonable to assume that the tree is given to us as part of the input of our Active Learning problem. The human feedback our algorithms rely upon may or may not be consistent with the tree at hand both because human feedback is generally noisy and because this feeback may originate from yet another source of data, e.g., another production team in the organization that was not in charge of building the original tree. In fact, the same tree over the data items may serve the clustering needs of different groups within the organization, having different goals and views on the same data. This also motivates why, when studying this problem, we are led to consider different noise scenarios, that is, the presence of noisy feedback and the possibility that the given tree is not consistent with the clustering corresponding to the received feedbacks. Our contribution. In the realizable setting (both noiseless and persistent noisy, Section 3), we introduce algorithms whose expected number of queries scale with the average complexity of tree cuts, a notion which is introduced in this paper. A distinctive feature of these algorithms is that they are rather ad hoc in the way they deal with the structure of our problem. In particular, they cannot be seen as ﬁnding the query that splits the version space as evenly as possible, a common approach in many active learning papers (e.g., [12, 25, 15, 16, 27, 24], and references therein). We then show that, at least in the noiseless case, this average complexity measure characterizes the expected query complexity of the problem. Our ad hoc analyses are beneﬁcial in that they deliver sharper guarantees than those readily available from the above papers. In addition, and perhaps more importantly for practical usage, our algorithms admit linear-time implementations in the relevant parameters of the problem (like the number of items to be clustered). In the non-realizable setting (Section 4), we build on known results in importance-weighted active learning (e.g., [5, 6]) to devise a selective sampling algorithm working under more general conditions. While our statistical analysis follows by adaptating available results, our goal here is to rather come up with fast implementations, so as to put the resulting algorithms on the same computational footing as those operating under (noisy) realizability assumptions. By leveraging the speciﬁc structure of our hypothesis space, we design a fast incremental algorithm for selective sampling whose running time per round is linear in the height of the tree. In turn, this effort paves the way for our experimental investigation (Section 5), where we compare the effectiveness of the two above-mentioned approaches (realizable with persistent noise vs non-realizable) on real data originating from various linkage functions. Though preliminary in nature, these experiments seem to suggest that in practice the algorithms originating from the persistent noise assumption exhibit more attractive learning curves than those working in the more general non-realizable setting. Related work. The literature on active learning is vast, and we can hardly do it justice here. In what follows we conﬁne ourselves to the references which we believe are closest to our paper. Since our sample space is discrete (the set of all possible pairs of items from a ﬁnite set of size n), our realizable setting is essentially a pool-based active learning setting. Several papers have considered greedy algorithms which generalize binary search [1, 20, 12, 25, 15, 24]. The query complexity can be measured either in the worst case or averaged over a prior distribution over all possible labeling functions in a given set. The query complexity of these algorithms can be analyzed by comparing it to the best possible query complexity achieved for that set of items. In [12] it is shown that if the probability mass of the version space is split as evenly as possible then the approximation factor for its average query complexity is O(log(1/pm)), where pm is the minimal prior probability of any considered labeling function. [15] extended this result through a more general approach to approximate greedy rules, but with the worse factor O(log2(1/pm)). [20] observed that modifying the prior distribution always allows one to replace O(log(1/pm)) by the smaller factor O(log N), where N is the size of the set of labeling functions. Results of a similar ﬂavor are contained in [25, 24]. In our case, N can be exponential in n (see Section 2), making these landmark results too broad to be tight for our speciﬁc setting. Furthermore, some of these papers (e.g., [12, 25, 24]) have

only theoretical interest because of their difﬁcult algorithmic implementation. Interesting advances on this front are contained in the more recent paper [27], though when adapted to our speciﬁc setting, their results give rise to worse query bounds than ours. In the same vein are the papers by [7, 8], dealing with persistent noise. Finally, in the non-realizable setting, our work fully relies on [6], which in turns builds on standard references like [9, 5, 17] see, e.g., the comprehensive survey by [18]. Further references, speciﬁcally related to clustering with queries, are mentioned in Appendix A.

2 Preliminaries and learning models

We consider the problem of ﬁnding cuts of a given binary tree through pairwise similarity queries over its leaves. We are given in input a binary1 tree T originating from, say, an agglomerative (i.e., bottom-up) HC procedure (single linkage, complete linkage, etc.) applied to a set of items L = {x1, . . . , xn}. Since T is the result of successive (binary) merging operations from bottom to top, T turns out to be a strongly binary tree2 and the items in L are the leaves of T. We will denote by V the set of nodes in T, including its leaves L, and by r the root of T. The height of T will be denoted by h. When referring to a subtree T of T, we will use the notation V (T ), L(T ), r(T ), and h(T ), respectively. We also denote by T(i) the subtree of T rooted at node i, and by L(i) the set of leaves of T(i), so that L(i) = L(T(i)), and r(T(i)) = i. Moreover, par(i) will denote the parent of node i (in tree T), left(i) will be the left-child of i, and right(i) its right child.

4x 5x 6x 7x 8x

2x 3x 4x 5x 6x 7x 8x 1x

r T Figure 1: Left: A binary tree corresponding to a hierarchical clustering of the set of items L = {x1, . . . , x8}. The cut depicted in dashed green has two nodes above and the rest below. This cut induces over L the ﬂat clustering C1 = {{x1, x2, x3, x4, x5}, {x6}, {x7, x8}} corresponding to the leaves of the subtrees rooted at the 3 green-bordered nodes just below the cut (the lower boundary of the cut). Clustering C1 is therefore realized by T. On the contrary, clustering C2 = {{x1, x2, x3}, {x4, x5, x6}, {x7, x8}} is not. Close to each node i is also displayed the number N(i) of realized cuts by the subtree rooted at i. For instance, in this ﬁgure, 7 = 1 + 3 2, and 22 = 1 + 7 3, so that T admits overall N(T) = 22 cuts. Right: The same ﬁgure, where below each node i are the probabilities p(i) encoding a uniform prior distribution over cuts. Notice that p(i) = 1/N(i) so that, like all other cuts, the depicted green cut has probability (1 1/22) (1 1/3) (1/7) 1 (1/2) = 1/22.

A ﬂat clustering C of L is a partition of L into disjoint (and non-empty) subsets. A cut c of T of size K is a set of K edges of T that partitions V into two disjoint subsets; we call them the nodes above c and the nodes below c. Cut c also univocally induces a clustering over L, made up of the clusters L(i1), L(i2), . . . , L(i K), where i1, i2, . . . , i K are the nodes below c that the edges of c are incident to. We denote this clustering by C(c), and call the nodes i1, i2, . . . , i K the lower boundary of c. We say that clustering C0 is realized by T if there exists a cut c of T such that C(c) = C0. See Figure 1 (left) for a pictorial illustration. Clearly enough, for a given L, and a given tree T with set of leaves L, not all possible clusterings over L are realized by T, as the number and shape of the clusterings realized by T are strongly inﬂuenced by T s structure. Let N(T) be the number of clusterings realized by T (notice that this is also equal to the number of distinct cuts admitted by T). Then N(T) can be computed through a simple recursive formula. If we let N(i) be the number of cuts realized by T(i), one can easily verify that N(i) = 1 + N(left(i)) N(right(i)), with N(xi) = 1 for all xi L. With this notation, we then have N(T) = N(r(T)). If T has n leaves, N(T) ranges from n, when T is a degenerate line tree, to the exponential αn , when T is the full binary tree, where α 1.502 (e.g., http://oeis.org/A003095). See again Figure 1 (left) for a simple example.

1 In fact, the trees we can handle are more general than binary: we are making the binary assumption throughout for presentational convenience only. 2 A strongly binary tree is a rooted binary tree for which the root is adjacent to either zero or two nodes, and all non-root nodes are adjacent to either one or three nodes.

2x 3x 4x 5x 6x 7x 8x 1x

2x 3x 4x 5x 6x 7x 8x 1x

2x 3x 4x 5x 6x 7x 8x 1x

2x 3x 4x 5x 6x 7x 8x 1x

Figure 2: Cuts and corresponding values of d H(Σ, b C) in four input trees for the same ground-truth Σ determined by the clustering { {x1, x2, x3, x4, x5}, {x6}, {x7, x8} }. Underneath each tree is the clustering b C induced by the depicted cut. The color of a cluster is green if it corresponds to a cluster in the clustering of Σ, and is red otherwise. In T1 we have d H(Σ, b C) = 0 (realizable case). The other 3 trees illustrate the non-realizable case with the cut minimizing d H(Σ, b C) on the corresponding tree. Recalling that d H(Σ, b C) counts ordered pairs, in T2 we have d H(Σ, b C) = 10, because x6 now belongs to the ﬁrst cluster according to clustering b C. In T3 we have d H(Σ, b C) = 2. Finally, in T4 it is easy to verify that d H(Σ, b C) = 10.

A ground-truth matrix Σ is an n n and 1-valued symmetric matrix Σ = [σ(xi, xj)]n n i,j=1 encoding a pairwise similarity relation over L. Speciﬁcally, if σ(xi, xj) = 1 we say that xi and xj are similar, while if σ(xi, xj) = 1 we say they are dissimilar. Moreover, we always have σ(xi, xi) = 1 for all xi L. Notice that Σ need not be consistent with a given clustering over L, i.e., the binary relation deﬁned by Σ over L need not be transitive. Given T and its leaves L, an active learning algorithm A proceeds in a sequence of rounds. In a purely active setting, at round t, the algorithm queries a pair of items (xit, xjt), and observes the associated label σ(xit, xjt). In a selective sampling setting, at round t, the algorithm is presented with (xit, xjt) drawn from some distribution over L L, and has to decide whether or not to query the associated label σ(xit, xjt). In both cases, the algorithm is stopped at some point, and is compelled to commit to a speciﬁc cut of T (inducing a ﬂat clustering over L). Coarsely speaking, the goal of A is to come up with a good cut of T, by making as few queries as possible on the entries of Σ. Noise Models. The simplest possible setting, called noiseless realizable setting, is when Σ itself is consistent with a given clustering realized by T, i.e., when there exists a cut c of T such that C(c ) = {L(i1), . . . , L(i K)}, for some nodes i1, . . . , i K V , that satisﬁes the following: For all r = 1, . . . , K, and for all pairs (xi, xj) L(ir) L(ir) we have σ(xi, xj) = 1, while for all other pairs we have σ(xi, xj) = 1. We call (persistent) noisy realizable setting one where Σ is generated as follows. Start off from the noiseless ground-truth matrix, and call it Σ . Then, in order to obtain Σ from Σ , consider the set of all n 2 pairs (xi, xj) with i < j, and pick uniformly at random a subset of size λ n 2 , for some λ [0, 1/2). Each such pair has ﬂipped label in Σ: σ(xi, xj) = 1 σ (xi, xj). This is then combined with the symmetric σ(xi, xj) = σ(xj, xi), and the reﬂexive σ(xi, xi) = 1 conditions. We call λ the noise level. Notice that this kind of noise is random but persistent, in that if we query the same pair (xi, xj) twice we do obtain the same answer σ(xi, xj). Clearly, the special case λ = 0 corresponds to the noiseless setting. Finally, in the general non-realizable (or agnostic) setting, Σ is an arbitrary matrix that need not be consistent with any clustering over L, in particular, with any clustering over L realized by T. Error Measure. If Σ is some ground-truth matrix over L, and bc is the cut output by A, with induced clustering b C = C(bc), we let Σ b C = [σ b C(xi, xj)]n n i,j=1 be the similarity matrix associated with b C, i.e., σ b C(xi, xj) = 1 if xi and xj belong to the same cluster, and 1 otherwise. Then the Hamming distance d H(Σ, b C) simply counts the number of (ordered) pairs (xi, xj) having inconsistent sign:

d H(Σ, b C) = {(xi, xj) L2 : σ(xi, xj) = σ b C(xi, xj)} .

The same deﬁnition applies in particular to the case when Σ itself represents a clustering over L. The quantity d H, sometimes called correlation clustering distance, is closely related to the Rand index [26] see, e.g., [23]. Figure 2 contains illustrative examples.

5i 6i 4i n-1 i

Figure 3: Left: The dotted green cut c

can be described by the set of values of {y(i), i V }, below each node. In this tree, in order to query, say, node i2, it sufﬁces to query any of the four pairs (x1, x3), (x1, x4), (x2, x3), or (x2, x4). The baseline queries i1 through i6 in a breadth-ﬁrst manner, and then stops having identiﬁed c . Right: This graph has N(T) = n. On the depicted cut, the baseline has to query all n 1 internal nodes.

Prior distribution. Recall cut c deﬁned in the noiseless realizable setting and its associated Σ . Depending on the speciﬁc learning model we consider (see below), the algorithm may have access to a prior distribution P( ) over c , parametrized as follows. For i V , let p(i) be the conditional probability that i is below c given that all i s ancestors are above. If we denote by AB(c ) V the nodes of T which are above c , and by LB(c ) V those on the lower boundary of c , we can write

i AB(c ) (1 p(i)) Y

j LB(c ) p(j)

where p(i) = 1 if i L. In particular, setting p(i) = 1/N(i) i yields the uniform prior P(c ) = 1/N(T) for all c realized by T. See Figure 1 (right) for an illustration. A canonical example of a non-uniform prior is one that favors cuts close to the root, which are thereby inducing clusterings having few clusters. These can be obtained, e.g., by setting p(i) = α, for some constant α (0, 1). Learning models. We consider two learning settings. The ﬁrst setting (Section 3) is an active learning setting under a noisy realizability assumption with prior information. Let C = C(c ) be the ground truth clustering induced by cut c before noise is added. Here, for a given prior P(c ), the goal of learning is to identify C either exactly (when λ = 0) or approximately (when λ > 0), while bounding the expected number of queries (xit, xjt) made to the ground-truth matrix Σ, the expectation being over the noise, and possibly over P(c ). In particular, if b C is the clustering produced by the algorithm after it stops, we would like to prove upper bounds on E[d H(Σ , b C)], as related to the number of active learning rounds, as well as to the properties of the prior distribution. The second setting (Section 4) is a selective sampling setting where the pairs (xit, xjt) are drawn i.i.d. according to an arbitrary and unknown distribution D over the n2 entries of Σ, and the algorithm at every round can choose whether or not to query the label. After a given number of rounds the algorithm is stopped, and the goal is the typical goal of agnostic learning: no prior distribution over cuts is available anymore, and we would like to bound with high probability over the sample (xi1, xj1), (xi2, xj2), . . . the so-called excess risk of the clustering b C produced by A, i.e., the difference

P(xi,xj) D σ(xi, xj) = σ b C(xi, xj) min c P(xi,xj) D σ(xi, xj) = σC(c)(xi, xj) , (2)

the minimum being over all possible cuts c realized by T. Notice that when D is uniform the excess risk reduces to 1 n2 d H(Σ, b C) minc d H(Σ, C(c)) . At the same time, we would like to bound with high probability the total number of labels the algorithm has queried.

3 Active learning in the realizable case

As a warm up, we start by considering the case where λ = 0 (no noise). The underlying cut c can be conveniently described by assigning to each node i of T a binary value y(i) = 0 if i is above c , and y(i) = 1 if i is below. Then we can think of an active learning algorithm as querying nodes, instead of querying pairs of leaves. A query to node i V can be implemented by querying any pair (xiℓ, xir) L(left(i)) L(right(i)). When doing so, we actually receive y(i), since for any such (xiℓ, xir), we clearly have y(i) = σ (xiℓ, xir). An obvious baseline is then to perform a kind of breadth-ﬁrst search in the tree: We start by querying the root r, and observe y(r); if y(r) = 1 we stop and output clustering b C = {L}; otherwise, we go down by querying both left(r) and right(r), and then proceed recursively. It is not hard to show that this simple algorithm will make at most 2K 1 queries, with an overall running time of O(K), where K is the number of clusters of C(c ). See Figure 3 for an illustration. If we know beforehand that K is very small, then this baseline is a tough competitor. Yet, this is not the best we can do in general. Consider, for instance, the line graph in Figure 3 (right), where c has K = n.

Ideally, for a given prior P( ), we would like to obtain a query complexity of the form log(1/P(c )), holding in the worst-case for all underlying c . As we shall see momentarily, this is easily obtained when P( ) is uniform. We ﬁrst describe a version space algorithm (One Third Splitting, OTS) that admits a fast implementation, and whose number of queries in the worst-case is O(log N(T)). This will in turn pave the way for our second algorithm, Weighted Dichotomic Path (WDP). WDP leverages P( ), but its theoretical guarantees only hold in expectation over P(c ). WDP will then be extended to the persistent noisy setting through its variant Noisy Weighted Dichotomic Path (N-WDP). We need a few ancillary deﬁnitions. First of all note that, in the noiseless setting, we have a clear hierarchical structure on the labels y(i) of the internal nodes of T: Whenever a query reveals a label y(i) = 0, we know that all i s ancestors will have label 0. On the other hand, if we observe y(i) = 1 we know that all internal nodes of subtree T(i) have label 1. Hence, disclosing the label of some node indirectly entails disclosing the labels of either its ancestors or its descendants. Given T, a bottom-up path is any path connecting a node with one of its ancestors in T. In particular, we call a backbone path any bottom up path having maximal length. Given i V , we denote by St(i) the version space at time t associated with T(i), i.e., the set of all cuts of T(i) that are consistent with the labels revealed so far. For any node j = i, St(i) splits into Sy(j)=0 t (i) and Sy(j)=1 t (i), the subsets of St(i) obtained by imposing a further constraint on y(j).

OTS (One Third Splitting): For all i V , OTS maintains over time the value |St(i)|, i.e., the size of St(i), along with the forest F made up of all maximal subtrees T of T such that |V (T )| > 1 and for which none of their node labels have been revealed so far. OTS initializes F to contain T only, and maintains F updated over time, by picking any backbone of any subtree T F, and visiting it in a bottom-up manner. See the details in Appendix B.1. The following theorem (proof in Appendix B.1) crucially relies on the fact that π is a backbone path of T , rather than an arbitrary path. Theorem 1 On a tree T with n leaves, height h, and number of cuts N, OTS ﬁnds c by making O(log N) queries. Moreover, an ad hoc data-structure exists that makes the overall running time O(n + h log N) and the space complexity O(n). Hence, Theorem 1 ensures that, for all c , a time-efﬁcient active learning algorithm exists whose number of queries is of the form log(1/P(c )), provided P(c ) = 1/N(T) for all c . This query bound is fully in line with well-known results on splittable version spaces [12, 25, 24], so we cannot make claims of originality. Yet, what is relevant here is that this splitting can be done very efﬁciently.

We complement the above result with a lower bound holding in expectation over prior distributions on c . This lower bound depends in a detailed way on the structure of T. Given tree T, with set of leaves L, and cut c , recall the deﬁnitions of AB(c ) and LB(c ) we gave in Section 2. Let T c be the subtree of T whose nodes are (AB(c ) LB(c )) \ L, and then let e K(T, c ) = |L(T c )| be the number of its leaves. For instance, in Figure 3 (left), T c is made up of the six nodes i1, . . . , i6, so that e K(T, c ) = 3, while in Figure 3 (right), T c has nodes i1, . . . , in 1, hence e K(T, c ) = 1. Notice that we always have e K(T, c ) K, but for many trees T, e K(T, c ) may be much smaller than K. A striking example is again provided by the cut in Figure 3 (right), where e K(T, c ) = 1, but K = n. It is also helpful to introduce Ls(T), the set of all pairs of sibling leaves in T. For instance, in the tree of Figure 3, we have |Ls(T)| = 3. One can easily verify that, for all T we have

max c e K(T, c ) = |Ls(T)| log2 N(T) .

We now show that there always exist families of prior distributions P( ) such that the expected number of queries needed to ﬁnd c is Ω(E[ e K(T, c )]). The quantity E[ e K(T, c )] is our notion of average (query) complexity. Since the lower bound holds in expectation, it also holds in the worst case. The proof can be found in Appendix B.2.

Theorem 2 In the noiseless realizable setting, for any tree T, any positive integer B |Ls(T)|, and any (possibly randomized) active learning algorithm A, there exists a prior distribution P( ) over c such that the expected (over P( ) and A s internal randomization) number of queries A has to make in order to recover c is lower bounded by B/2, while B E[ e K(T, c )] 2B, the latter expectation being over P( ). Next, we describe an algorithm that, unlike OTS, is indeed able to take advantage of the prior distribution, but it does so at the price of bounding the number of queries only in expectation.

WDP (Weighted Dichotomic Path): Recall prior distibution (1), collectively encoded through the values {p(i), i V }. As for OTS, we denote by F the forest made up of all maximal subtrees T

(1-0.25)*0.5=0.375

(1-0.25)*(1-0.5)*0.9=0.3375

(1-0.25)*(1-0.5)* (1-0.9)*1=0.0375 0.0375

0.675 0.675

0.0375*log_2(1/0.0375)+ 0.3375*log_2(1/0.3375)+ 0.375*log_2(1/0.375)+ 0.25*log_2(1/0.25)=1.737

1.737 1.891 1.891

1.163 1.163

(1-0.9)*1=0.1 0.1

0.1*log_2(1/0.1)+ 0.9*log_2(1/0.9)=0.468

0.468 0.468

Figure 4: An example of input tree T before (left) and after (right) the ﬁrst binary search of WDP. The green node is a dummy super-root. The nodes in yellow are the roots of the subtrees currently included in forest F. The numbers in red within each node i indicate the probabilities p(i), while the q(i) values are in blue, and viewed here as associated with edges (par(i), i). The magenta numbers at each leaf ℓgive the entropy H(π(ℓ, r(T ))), where r(T ) is the root of the subtree in F that contains both ℓand r(T ). Left: The input tree T at time t = 0. No labels are revealed, and no clusters of C(c ) are found. Right: Tree T after a full binary search has been performed on the depicted light blue path. Before this binary search, that path connected a leaf of a subtree in F to its root (in this case, F contains only T). The selected path is the one maximazing entropy within the forest/tree on the left. The dashed line indicates the edge of c found by the binary search. The red, blue and magenta numbers are updated accordingly to the result of the binary search. The leaves enclosed in the grey ellipse are now known to form a cluster of C(c ).

of T such that |V (T )| > 1 and for which none of their node labels have so far been revealed. F is updated over time, and initially contains only T. We denote by π(u, v) a bottom-up path in T having as terminal nodes u and v (hence v is an ancestor of u in T). For a given cut c , and associated labels {y(i), i V }, any tree T F, and any node i V (T ), we deﬁne3

q(i) = P(y(i) = 1 y(par(i)) = 0) = p(i) Y

j π(par(i),r(T )) (1 p(j)) . (3)

We then associate with any backbone path of the form π(ℓ, r(T )), where ℓ L(T ), an entropy H(π(ℓ, r(T ))) = P

i π(ℓ,r(T )) q(i) log2 q(i). Notice that at the beginning we have P

i π(ℓ,r(T )) q(i) = 1 for all ℓ L. This invariant will be maintained on all subtrees T . The prior probabilities p(i) will evolve during the algorithm s functioning into posterior probabilities based on the information revealed by the labels. Accordingly, also the related values q(i) w.r.t. which the entropy H( ) is calculated will change over time. Due to space limitations, WDP s pseudocode is given in Appendix B.3, but we have included an example of its execution in Figure 4. At each round, WDP ﬁnds the path whose entropy is maximized over all bottom-up paths π(ℓ, r ), with ℓ L and r = r(T ), where T is the subtree in F containing ℓ. WDP performs a binary search on such π(ℓ, r ) to ﬁnd the edge of T which is cut by c , taking into account the current values of q(i) over that path. Once a binary search terminates, WDP updates F and the probabilities p(i) at all nodes i in the subtrees of F. See Figure 4 for an example. Notice that the p(i) on the selected path become either 0 (if above the edge cut by c ) or 1 (if below). In turn, this causes updates on all probabilities q(i). WDP continues with the next binary search on the next path with maximum entropy at the current stage, discovering another edge cut by c , and so on, until F becomes empty. Denote by P>0 the set of all priors P( ) such that for all cuts c of T we have P(c) > 0. The proof of the following theorem is given in Appendix B.3.

Theorem 3 In the noiseless realizable setting, for any tree T of height h, any prior distribution P( ) over c , such that P( ) P>0, the expected number of queries made by WDP to ﬁnd c is

O E h e K(T, c ) i log h , the expectations being over P( ).

For instance, in the line graph of Figure 3 (right), the expected number of queries is O(log n) for any prior P( ), while if T is a complete binary tree with n leaves, and we know that C(c ) has O(K) clusters, we can set p(i) in (1) as p(i) = 1/ log K, which would guarantee E[ e K(T, c )] = O(K), and a bound on the expected number of queries of the form O(K log log n). By comparison, observe that the results in [12, 15, 27] would give a query complexity which is at best O(K log2 n), while those in [25, 24] yield at best O(K log n). In addition, we show below (Remark 1) that our algorithm has very compelling running time guarantees. It is often the case that a linkage function generating T also tags each internal node i with a coherence level αi of T(i), which is typically increasing as we move downwards from root to leaves. A common

3 For deﬁniteness, we set y(par(r)) = 0, that is, we are treating the parent of r(T) as a dummy super root with labeled 0 since time t = 0. Thus, according to this deﬁnition, q(r) = p(r).

situation in hierarchical clustering is then to ﬁgure out the right level of granularity of the ﬂat clustering we search for by deﬁning parallel bands of nodes of similar coherence where c is possibly located. For such cases, a slightly more involved guarantee for WDP is contained in Theorem 6 in Appendix B.3, where the query complexity depends in a more detailed way on the interplay between T and the prior P( ). In the above example, if we have b-many edge-disjoint bands, Theorem 6 replaces factor log h of Theorem 3 by log b.

N-WDP (Noisy Weighted Dichotomic Path): This is a robust variant of WDP that copes with persistent noise. Whenever a label y(i) is requested, N-WDP determines its value by a majority vote over randomly selected pairs from L(left(i)) L(right(i)). Due to space limitations, all details are contained in Appendix B.4. The next theorem quantiﬁes N-WDP s performance in terms of a tradeoff between the expected number of queries and the distance to the noiseless ground-truth matrix Σ .

Theorem 4 In the noisy realizable setting, given any input tree T of height h, any cut c P( ) P>0, and any δ (0, 1/2), N-WDP outputs with probability 1 δ (over the noise in the labels) a

clustering b C such that 1 n2 d H(Σ , b C) = O 1 n (log(n/δ))3/2

(1 2λ)3 by asking O log(n/δ) (1 2λ)2 E e K(T, c ) log h

queries in expectation (over P( )).

Remark 1 Compared to the query bound in Theorem 3, the one in Theorem 4 adds a factor due to noise. The very same extra factor is contained in the bound of [22]. Regarding the running time of

WDP , the version we have described can be naively implemented to run in O(n E e K(T, c )) expected time overall. A more time-efﬁcient variant of WDP exists for which Theorem 3 and Theorem 6 still hold, that requires O(n + h E e K(T, c )) expected time. Likewise, an efﬁcient variant of N-WDP exists

for which Theorem 4 holds, that takes O n + h + log2 n (1 2λ)2 E e K(T, c ) expected time.

4 Selective sampling in the non-realizable case

In the non-realizable case, we adapt to our clustering scenario the importance-weighted algorithm in [6]. The algorithm is a selective sampler that proceeds in a sequence of rounds t = 1, 2, . . .. In round t a pair (xit, xjt) is drawn at random from distribution D over the entries of a given ground truth matrix Σ, and the algorithm produces in response a probability value pt = pt(xit, xjt). A Bernoulli variable Qt {0, 1} is then generated with P(Qt = 1) = pt, and if Qt = 1 the label σt = σ(xit, xjt) is queried, and the algorithm updates its internal state; otherwise, we skip to the next round. The way pt is generated is described as follows. Given tree T, the algorithm maintains at each round t an importance-weighted empirical risk minimizer cut ˆct, deﬁned as ˆct = argminc errt 1(C(c)) , where the argmin is over all cuts c realized by T, and errt 1(C) = 1 t 1 Pt 1 s=1 Qs

ps {σC(xis, xjs) = σs} , being { } the indicator function of the predicate at argument. This is paired up with a perturbed empirical risk minimizer ˆc t = argmin c : σC(c)(xit,xjt) =σC(ˆct)(xit,xjt) errt 1(C(c)) ,

the argmin being over all cuts c realized by T that disagree with ˆct on the current pair (xit, xjt). The value of pt is a function of dt = errt 1(C(ˆc t)) errt 1(C(ˆct)), of the form

pt = min 1, O 1/d2 t + 1/dt log((N(T)/δ) log t)/t , (4)

where N(T) is the total number of cuts realized by T (i.e., the size of our comparison class), and δ is the desired conﬁdence parameter. Once stopped, say in round t0, the algorithm gives in output cut ˆct0+1, and the associated clustering C(ˆct0+1). Let us call the resulting algorithm NR (Non-Realizable).

Despite N(T) can be exponential in n, there are very efﬁcient ways of computing ˆct, ˆc t, and hence pt at each round. In particular, an ad hoc procedure exists that incrementally computes these quantities by leveraging the sequential nature of NR. For a given T, and constant K 1, consider the class C(T, K) of cuts inducing clusterings with at most K clusters. Set R = R (T, D) = minc C(T,K) P(xi,xj) D σ(xi, xj) = σC(c)(xi, xj) , and Bδ(K, n) = K log n + log(1/δ). The following theorem is an adaptation of a result in [6]. See Appendix C.1 for a proof.

Theorem 5 Let T have n leaves and height h. Given conﬁdence parameter δ, for any t 1, with probability at least 1 δ, the excess risk (2) achieved by the clustering C(ˆct+1) computed by NR w.r.t.

the best cut in class C(T, K) is bounded by O q

Bδ(K,n) log t

t + Bδ(K,n) log t

, while the (expected)

Tree Avg depth Std. dev BEST s error BEST s K SING 2950 1413.6 8.26% 4679 MED 186.4 41.8 8,51% 1603 COMP 17.1 3.3 8.81% 557 Table 1: Statistics of the trees used in our experiments. These trees result from applying the linkage functions SING, COMP, and MED to the MNIST dataset (ﬁrst 10000 samples). Each tree has the same set of n = 10000 leaves. Avg depth is the average depth of the leaves in the tree, Std. dev is its standard deviation. For reference, we report the performance of BEST (i.e., the minimizer of d H over all possible cuts realized by the trees), along with the associated number of clusters K.

number of labels Pt s=1 ps is bounded by O θ R t + p

t Bδ(K, n) log t + Bδ(K, n) log3 t , where θ = θ(C(T, K), D) is the disagreement coefﬁcient of C(T, K) w.r.t. distribution D. In particular, when D is uniform we have θ K. Moreover, there exists a fast implementation of NR whose expected running time per round is E(xi,xj) D[de(lca(xi, xj))] h, where de(lca(xi, xj)) is the depth in T of the lowest common ancestor of xi and xj.

5 Preliminary experiments

The goal of these experiments was to contrast active learning methods originating from the persistent noisy setting (speciﬁcally, N-WDP) to those originating from the non-realizable setting (speciﬁcally, NR). The comparison is carried out on the hierarchies produced by standard HC methods operating on the ﬁrst n = 10000 datapoints in the well-known MNIST dataset from http://yann.lecun. com/exdb/mnist/, yielding a sample space of 108 pairs. We used Euclidean distance combined with the single linkage (SING), median linkage (MED), and complete linkage (COMP) functions. The n n ground-truth matrix Σ is provided by the 10 class labels of MNIST.

We compared N-WDP with uniform prior and NR to two baselines: passive learning based on empirical risk minimization (ERM), and the active learning baseline performing breadth-ﬁrst search from the root (BF, Section 3) made robust to noise as in N-WDP. For reference, we also computed for each of the three hierarchies the performance of the best cut in hindsight (BEST) on the entire matrix Σ. That is essentially the best one can hope for in each of the three cases. All algorithms except ERM are randomized and have a single parameter to tune. We let such parameters vary across suitable ranges and, for each algorithm, picked the best performing value on a validation set of 500 labeled pairs.

In Table 1, we have collected relevant statistics about the three hierarchies. In particular, the single linkage tree turned out to be very deep, while the complete linkage one is quite balanced. We evaluated test set accuracy vs. number of queries after parameter tuning, excluding these 500 pairs. For N-WDP, once a target number of queries was reached, we computed as current output the maximum-a-posteriori cut. In order to reduce variance, we repeated each experiment 10 times.

The details of our empirical comparison are contained in Appendix C.3. Though our experiments are quite preliminary, some trends can be readily spotted (see Table 2 in Appendix C.3): i. N-WDP signiﬁcantly outperforms NR. E.g., in COMP at 250 queries, the test set accuracy of N-WDP is at 9.52%, while NR is at 10.1%. A similar performance gap at low number of queries one can observe in SING and MED. This trend was expected: NR is very conservative, as it has been designed to work under more general conditions than N-WDP. We conjecture that, whenever the speciﬁc task at hand allows one to make an aggressive noise-free algorithm (like WDP) robust to persistent noise (like N-WDP), this outcome is quite likely to occur. ii. BF is competitive only when BEST has few clusters. iii. N-WDP clearly outperforms ERM, while the comparison between NR and ERM yields mixed results.

Conclusions and ongoing activity. Beyond presenting new algorithms and analyses for pairwise similarity-based active learning, our goal was to put different approaches to active learning on the same footing for comparison on real data. The initial evidence emerging from our experiments is that Active Learning algorithms based on persistent noise can in practice be more effective than those making the more general non-realizable assumption. Notice that the similarities of the pairs of items have been generated by the MNIST class labels, hence they have virtually nothing to do with the trees we generated, which in turn do not rely on those labels at all. These initial trends suggested by our experiments clearly need a more thorough investigation. We are currently using other datasets, of different nature and size. Further HC methods are also under consideration, like those based on k-means.

Acknowledgments

Fabio Vitale acknowledges support from the Google Focused Award ALL4AI and the ERC Starting Grant DMAP 680153 , awarded to the Department of Computer Science of Sapienza University.

[1] E. Arkin, H. Meijer, J. Mitchell, D. Rappaport, and S. Skiena. Decision trees for geometric models. In Proc. Symposium on Computational Geometry, pages 369 378, 1993.

[2] H. Ashtiani, S. Kushagra, and S. Ben-David. Clustering with same-cluster queries. In Proc. 30th NIPS, 2016.

[3] P. Awasthi, M. F. Balcan, and K. Voevodski. Local algorithms for interactive clustering. Journal of Machine Learning Research, 18, 2017.

[4] M. F. Balcan and A. Blum. Clustering with interactive feedback. In Proc. of the 19th International Conference on Algorithmic Learning Theory, pages 316 328, 2008.

[5] Alina Beygelzimer, Sanjoy Dasgupta, and John Langford. Importance weighted active learning. In Proc. ICML, pages 49 56. ACM, 2009.

[6] Alina Beygelzimer, Daniel Hsu, John Langford, and Tong Zhang. Agnostic active learning without constraints. In Proc. 23rd International Conference on Neural Information Processing Systems, NIPS 10, pages 199 207, 2010.

[7] Yuxin Chen, S. Hamed Hassani, Amin Karbasi, and Andreas Krause. Sequential information maximization: When is greedy near-optimal? In Proc. 28th Conference on Learning Theory, PMLR 40, pages 338 363, 2015.

[8] Yuxin Chen, S. Hamed Hassani, and Andreas Krause. Near-optimal bayesian active learning with correlated and noisy tests. In Proc. 20th International Conference on Artiﬁcial Intelligence and Statistics, 2017.

[9] D. Cohn, L. Atlas, and R. Ladner. Improving generalization with active learning. Machine Learning, 15:201 221, 1994.

[10] C. Cortes, G. De Salvo, C. Gentile, M. Mohri, and N. Zhang. Region-based active learning. In Proc. 22nd International Conference on Artiﬁcial Intelligence and Statistics, 2019.

[11] S. Dasgupta and D. Hsu. Hierarchical sampling for active learning. In Proc. of the 25th International Conference on Machine Learning, 2008.

[12] Sanjoy Dasgupta. Coarse sample complexity bounds for active learning. In Advances in neural information processing systems, pages 235 242, 2005.

[13] S. Davidson, S. Khanna, T. Milo, and S. Roy. Top-k and clustering with noisy comparisons. ACM Trans. Database Syst., 39(4):35:1 35:39, 2014.

[14] Donatella Firmani, Barna Saha, and Divesh Srivastava. Online entity resolution using an oracle. Proc. VLDB Endow., 9(5):384 395, 2016.

[15] Daniel Golovin and Andreas Krause. Adaptive submodularity: A new approach to active learning and stochastic optimization. In ar Xiv:1003.3967, 2017.

[16] Alon Gonen, Sivan Sabato, and Shai Shalev-Shwartz. Efﬁcient active learning of halfspaces: An aggressive approach. Journal of Machine Learning Research, 14:2583 2615, 2013.

[17] S. Hanneke. A bound on the label complexity of agnostic active learning. In Proc. 24th International Conference on Machine Learning, pages 353 360, 2007.

[18] S. Hanneke. Theory of disagreement-based active learning. Foundations and Trends in Machine Learning, 7(2-3):131 309, 2014.

[19] Dov Harel and Robert E. Tarjan. Fast algorithms for ﬁnding nearest common ancestors. SIAM Journal on Computing, 13(2):338 355, 1984.

[20] S. Kosaraju, T. Przytycka, and R. Borgstrom. On an optimal split tree problem. In Proc. 6th International Workshop on Algorithms and Data Structures, pages 157 168, 1999.

[21] S. Kpotufe, R. Urner, and S. Ben-David. Hierarchical label queries with data-dependent partitions. In Proc. 28th Conference on Learning Theory, pages 1176 1189, 2015.

[22] A. Mazumdar and B. Saha. Clustering with noisy queries. In ar Xiv:1706.07510v1, 2017b.

[23] M. Meila. Local equivalences of distances between clusterings a geometric perspective. Machine Learning, 86(3):369 389, 2012.

[24] Stephen Mussmann and Percy Liang. Generalized binary search for split-neighborly problems. In Proc. 21st International Conference on Artiﬁcial Intelligence and Statistics (AISTATS) 2018, 2018.

[25] Robert D. Nowak. The geometry of generalized binary search. IEEE Transactions on Information Theory, 57(12):7893 7906, 2011.

[26] W. M. Rand. Objective criteria for the evaluation of clustering methods. Journal of the American Statistical Association, 66:846 850, 1971.

[27] C. Tosh and S. Dasgupta. Diameter-based active learning. In Thirty-fourth International Conference on Machine Learning (ICML), 2017.

[28] Norases Vesdapunt, Kedar Bellare, and Nilesh Dalvi. Crowdsourcing algorithms for entity resolution. Proc. VLDB Endow., 7(12):1071 1082, 2014.

[29] Jiannan Wang, Tim Kraska, Michael J. Franklin, and Jianhua Feng. Crowder: Crowdsourcing entity resolution. Proc. VLDB Endow., 5(11):1483 1494, July 2012.