# active_clustering_for_labeling_training_data__92b06663.pdf Active clustering for labeling training data Quentin Lutz Nokia Bell Labs quentin.lutz@nokia-bell-labs.com Élie de Panafieu Nokia Bell Labs elie.de_panafieu@nokia-bell-labs.com Alex Scott University of Oxford scott@maths.ox.ac.uk Maya Stein University of Chile mstein@dim.uchile.cl Gathering training data is a key step of any supervised learning task, and it is both critical and expensive. Critical, because the quantity and quality of the training data has a high impact on the performance of the learned function. Expensive, because most practical cases rely on humans-in-the-loop to label the data. The process of determining the correct labels is much more expensive than comparing two items to see whether they belong to the same class. Thus motivated, we propose a setting for training data gathering where the human experts perform the comparatively cheap task of answering pairwise queries, and the computer groups the items into classes (which can be labeled cheaply at the very end of the process). Given the items, we consider two random models for the classes: one where the set partition they form is drawn uniformly, the other one where each item chooses its class independently following a fixed distribution. In the first model, we characterize the algorithms that minimize the average number of queries required to cluster the items and analyze their complexity. In the second model, we analyze a specific algorithm family, propose as a conjecture that they reach the minimum average number of queries and compare their performance to a random approach. We also propose solutions to handle errors or inconsistencies in the experts answers. 1 Introduction There is an increasing demand for software implementing supervised learning for classification. Training data input for such software consists of items belonging to distinct classes. The output is a classifier: a function that predicts, for any new item, the class it most likely belongs to. Its quality depends critically on the available learning data, in terms of both quantity and quality [21]. But labeling large quantities of data is costly. This task cannot be fully automated, as doing so would assume access to an already trained classifier. Thus, human intervention, although expensive, is required. In this article, we focus on helping the human experts build the learning data efficiently. One natural way for the human experts to proceed is to learn (or discover) the classes and write down their characteristics. Then, items are considered one by one, assigning an existing class to each of them, or creating a new one if necessary. This approach requires the experts to learn the various classes, which, depending on the use-case, can be difficult. A different approach, proposed to us by Nokia engineer Maria Laura Maag, is to discover the partition by querying the experts on two items at a time asking whether these belong to the same class or not. This approach avoids the part of the process where classes are learned, and can therefore be cheaper. It is the setting we consider here, and we call the corresponding algorithm an active clustering algorithm, or for short, AC algorithm. Authors presented in alphabetical order. 35th Conference on Neural Information Processing Systems (Neur IPS 2021). More precisely, we assume there is a set of size n, with a partition unknown to us. An AC algorithm will, in each step, choose a pair of elements and asks the oracle whether they belong to the same partition class or not. The choices of the queries of the algorithm are allowed to depend on earlier answers. The algorithm will use transitivity inside the partition classes: if each of the pairs x, y and y, z is known to lie in the same class (for instance because of positive answers from the oracle), then the algorithm will know that x and z are also in the same class, and it will not submit a query for this pair. The algorithm terminates once the partition is recovered, i.e. when all elements from the same partition class have been shown to belong to the same class, and when for each pair of partition classes, there has been at least one query between their elements. We investigate AC algorithms under two different random models for the unknown set partition. In Section 2.1, the set partition is sampled uniformly at random, while in Section 2.2, the number of blocks is fixed and each item chooses its class independently following the same distribution. Section 2.3 analyzes the cases where the experts answers contain errors or inconsistencies. Our proofs, sketched in Section 3, rely on a broad variety of mathematical tools: probability theory, graph theory and analytic combinatorics. We conclude in Section 4, with some interesting open problems and research topics. Related works. Note that our model has similarities with sorting algorithms. Instead of an array of elements equipped with an order, we consider a set of items equipped with a set partition structure. In the sorting algorithm, the oracle determines the order of a pair of elements, while in our setting, the oracle tells whether two items belong to the same class. Both in sorting and AC algorithms, the goal is the design of algorithms using few queries. The cost of annotating raw data to turn it into training data motivated the exploration of several variants of supervised learning. Transfer learning reduces the quantity of labeled data required by using the knowledge gained while solving a different but related problem. Semi-supervised learning reduces it by learning also from the inherent clustering present in unlabeled data. In active learning, the learner chooses the data needing labeling and the goal is to maximize the learning potential for a given small number of queries. However, users who want to use supervised learning software for classification as a black-box can mitigate the annotating cost only by modifying the labeling process. Recent papers [9, 18] acknowledge that humans prefer pairwise queries over pointwise queries as they are better suited for comparisons. Pairwise queries have been considered in semi-supervised clustering [5, 24, 16] where they are called must-link and cannot-link constraints. The pairs of vertices linked by those constraints are random in general, but chosen adaptively in active semi-supervised clustering [4]. In both cases, the existence of a similarity measure between items is assumed, and a general theoretical study of this setting is provided by [19]. This is not the case in [11, 17], where a hierarchical clustering is built by successively choosing pairs of items and measuring their similarity. There, the trade-off between the number of measurements and the accuracy of the hierarchical clustering is investigated. The difference with the current paper is that the similarity measure takes real values, while we consider Boolean queries, and that their output is a hierarchical clustering, while our clustering (a set partition) is flat. The problem we consider is motivated by its application to annotation for supervised machine learning. It also belongs to the field of combinatorial search [1, 7]. Related papers [2, 15] consider the problem of reconstructing a set partition using queries on sets of elements, where the answer to such a query is whether there is an edge in the queried set, or the number of distinct blocks of the partition present in the queried set, respectively. Our setting is a more constrained case corresponding to queries of size two. Another similar setting is that of entity resolution [13] where recent developments also assume perfect oracles and transitive properties using pairwise queries to reconstruct a set partition [23, 25]. In this case, clusters correspond to duplicates of the same entity. Most solutions have a real-valued similarity measure between elements but rely on human verification to improve the results. The entity resolution literature also considers the noisy case for a fixed number of clusters [8, 9, 18, 10]. 2 Our results 2.1 Uniform distribution on partitions We start by considering the setting where the partition of the n-set is chosen uniformly at random among all possible partitions. The average complexity of an AC algorithm is the average number of Figure 1: Aggregated graphs obtained after various queries and organized into query trees. The positive answers are displayed on top, the negative ones on the bottom. Left, a non-chordal query, leading to an average complexity of 13/5. Right, chordal queries from the same initial situation, leading to an optimal average complexity of 12/5. queries used to recover the random partition. We will define a class of AC algorithms, called chordal algorithms, and prove in Theorem 2 that an algorithm has minimal average complexity if and only if it is chordal. Theorem 3 shows that all chordal algorithms have the same distribution on their number of queries. Finally, this distribution is characterized both exactly and asymptotically in Theorem 4. It will be useful to use certain graphs marking the progress of the AC algorithm on our n-set. Given a partition P of an n-set and an AC algorithm, we can associate a graph Gt to each step t of the algorithm. Namely, G0 has n vertices, labeled with the elements of our set, and at time t 1, the graph Gt is obtained by adding an edge for each negatively-answered query, and successively merging pairs of vertices that received a positive answer. (A vertex u obtained by joining vertices v, w is adjacent to all neighbors of each of v, w, and we label u with the union of its earlier labels.) We call Gt the aggregated graph at time t. Note that after the last step of the algorithm, the aggregated graph is complete and each of its vertices corresponds to one of the blocks of P. Also note that any fixed Gt (with labels on vertices also fixed) may appear as the aggregated graph for more than one partition (possibly at a different time t ). We call the set of all these partitions the realizations of Gt. Those notions are illustrated in Figure 1. We now need a quick graph-theoretic definition. A cycle C in a graph is induced if all edges induced by V (C) belong to the cycle. A graph is chordal if its induced cycles all have length three. Chordal graphs appear in many applications. We say that an AC algorithm is chordal if one of the following equivalent conditions is satisfied (see the Appendix for a proof of the equivalence of the conditions): (i) for all input partitions, each aggregated graph is chordal, (ii) for all input partitions, no aggregated graph has an induced C4, (iii) for all input partitions, and for every query u, v made on some aggregated graph Gt, the intersection of the neighborhoods of u and v separates u and v in Gt where a set S (possibly empty) is said to separate vertices u, v V (G) \ S if u and v lie in distinct components of G S. The queries that keep a graph chordal are exactly those satisfying Condition (iii). Thus, it is used in chordal algorithms to identify the candidate queries. On the other hand, a query satisfying Condition (ii) might turn a chordal graph non-chordal by creating an induced cycle of length more than 4 (in which case, Condition (ii) will fail on a later query). Two examples of chordal algorithms are presented below. Definition 1. The clique algorithm is an AC algorithm that gradually grows the partition, starting with the empty partition. Each new item is compared successively to an item of each block of the partition, until it is either added to a block (positive answer to a query) or all blocks have been visited, in which case a new block is created for this item. The universal algorithm finds the block containing the item of largest label by comparing it to all remaining items. The algorithm is then applied to partition the other items, and the previous block is added to the result. Theorem 2. On partitions of size n chosen uniformly at random, an AC algorithm has minimal average complexity if and only if it is chordal. The complexity distribution of an AC algorithm of a set S with n 1 elements is a tuple (a0, a1, a2, . . .) where ai is the number of partitions of S for which the algorithm has complexity i. Clearly, ai = 0 for all i < n 1, and a( n 2) = 1, for any AC algorithm. Our next result shows that constraining the average complexity to be minimal fixes the complexity distribution. Theorem 3. On partitions of size n chosen uniformly at random, all chordal AC algorithms have the same complexity distribution. Note that either of Theorems 2 and 3 implies the weaker statement that all chordal AC algorithms of an n-set have the same average complexity, but with very different proofs. Sketches of the proofs of Theorems 2 and 3 can be found in sections 3.1 and 3.2. In practice, several human experts work in parallel to annotate the training data. The chordal algorithms can easily be parallelized: when an expert becomes available, give him a query that would keep the aggregated graph chordal if all pending queries received negative answers. The condition ensures the chordality of the parallelized algorithm. Our third theorem for this setting describes the distribution of the number of queries used by chordal algorithms on partitions chosen uniformly at random. Two formulas for the probability generating function are proposed. The first one is well suited for computer algebra manipulations, while the second one, more elegant, is better suited for asymptotic analysis. It is a q-analog of Dobi nski s formula for Bell numbers Bn = 1 (see e.g. [14, p.762]), counting the number of partitions of size n. In order to state the theorem, we introduce the q-analogs of an integer, the factorial, the Pochhammer symbol and the exponential k=0 qk, [n]q! = k=1 [k]q, (a; q)n = k=0 (1 aqk), eq(z) = Observe that the q-analog reduce to their classic counterparts for q = 1. The Lambert function W(x) is defined for any positive x as the unique solution of the equation wew = x. As x tends to infinity, we have W(x) = log(x) log(log(x)) + o(1). Theorem 4. Let Xn denote the complexity of a chordal algorithm on a partition of size n chosen uniformly at random. The distribution of Xn has probability generating function in the variable q equal to the two following expressions [m]n q [m]q!qn m. The normalized variable (Xn En)/σn converges in distribution to a standard Gaussian law, where 4(2W(n) 1)e2W (n) and σn = 1 3W(n)2 4W(n) + 2 W(n) + 1 e3W (n). As a corollary, the average complexity of chordal algorithms is asymptotically n 2 log(n). Because of this almost quadratic optimal complexity, using pairwise queries can only be better than direct classification for relatively small data sets. We arrive at a different conclusion for the model presented in the next section, where AC algorithm have linear complexity. 2.2 Random partitions with fixed number of blocks In many real-world applications, we have information on the number of classes and their respective sizes in the data requiring annotation. This motivates the introduction of the following alternative random model for the set partition investigated by the AC algorithms. Now, the set of partitions of the n-set S is distributed in a not necessarily uniform way, but each element of S has a probability of pi to belong to the partition class Ci (and the probabilities pi sum up to 1). Our main focus here will be on the most applicable and accessible variant of the above-described model, where the number of partition class is a fixed number k, and k is small compared to n. Without loss of generality, we can assume that the probabilities p1, p2, . . . , pk in this model satisfy p1 p2 pk. Intuitively, an algorithm with low average complexity should avoid making many comparisons between different classes (one such comparison is always necessary between any pair of classes, but more comparisons are not helpful). For this reason, it seems plausible that an AC algorithm with optimal complexity should compare a new element first with the largest class identified up to this moment, as both the new element and the largest class are most likely to coincide with C1, the most likely class . This is precisely how the clique AC algorithm from Definition 1 operates, where we add the additional refinement that each new item is compared to the blocks discovered so far in decreasing order of their sizes. With this refinement, we conjecture the following. Conjecture 5. For an n-set with a random partition with probabilities p1, p2, . . . , pk, the clique algorithm has minimal average complexity among all AC algorithms. In support of Conjecture 5, we present the following results. Firstly, we exhibit the limit behaviour of the expected number of queries of the clique algorithm. The simple proof of this theorem can be found in the appendix. It allows one to decide for practical cases whether direct classification by human experts or pairwise queries will be more efficient to annotate the training data. Theorem 6. Let p1 pk be fixed with k i=1 pi = 1. Let X denote the expected number of queries made by the clique algorithm with these parameters for an n-set. Then E[X] k i=1 ipin. We now compare the complexity of the clique algorithm with the complexity of an algorithm that chooses its queries randomly. More precisely, we define the random algorithm to be the one that at each step t compares a pair of elements chosen uniformly at random among all pairs whose relation is not known at this time (that is, they are neither known to belong to the same class, nor known to belong to different classes). The reason for analyzing the random algorithm is that one may think of this algorithm as the one that models the situation where no strategy at all is used by the human experts, which might make this procedure cheap to implement. It turns out, however, that there is a cost in form of larger average complexity associated to the random algorithm, if compared to our proposed candidate for the optimal algorithm, the clique algorithm. Theorem 7. Let p1 pk be fixed with k i=1 pi = 1. Let X be the number of queries of a random algorithm with these parameters for an n-set. Then E[X] n k + i pj; we estimate the contribution from the remaining steps by an sum of independent random variables). We do not pursue the details here. 4 Conclusion In this article, motivated by the building of training data for supervised learning, we studied active clustering using pairwise comparisons and without any other additional information (such as a similarity measure between items). Two random models for the secret set partition where considered: the uniform model and the bounded number of blocks model. They correspond to different practical annotation situations. The uniform model reflects the case where nothing is known about the set partition. In that case, many clusters will typically be discarded at the end of the clustering process, as they are too small for the supervised learning algorithm to learn anything from them. Thus, this is a worst case scenario. When some information is available on the set partition, such as its number of blocks or their relative sizes, the bounded number of blocks model becomes applicable. Comparison between direct labeling and pairwise comparisons. As a practical application of this work, we provided tools to decide whether direct labeling or pairwise comparisons will be more efficient for a given annotation task. One should first decide whether the uniform model or the bounded number of blocks model best represents the data. In both cases, our theorems provide estimates for the number of pairwise queries required. Then the time taken by an expert to answer a direct labeling query or a pairwise comparison should be measured. Finally, combining those estimates, the time required to annotate the data using direct labeling or pairwise comparison can be compared. We also provided tools to detect and correct errors in the experts answers. Similarity measures. Generally, a similarity measure on the data is available and improves the quality of the queries we can propose to the human experts. This similarity measure can be a heuristic that depends only on the format of the data. For example, if we are classifying technical documents, a text distance can be used. The similarity could also be trained on a small set of data already labeled. This setting has been analyzed by [19]. Our motivation for annotating data is to train a supervised learning algorithm. However, one could use active learning to merge and replace the annotation and learning steps. The problem is then to find the queries that will improve the learned classifier the most. In this article, we focused on the case where such similarity measures are not available. However, we are confident that the mathematical tools developed here will be useful to analyze more elaborate settings as well. In particular, the aggregated graph contains exactly the information on the transitive closure of the answers to the pairwise queries, so its structure should prove relevant whenever pairwise comparisons are considered. Open problems. We leave as open problems the proof that the clique algorithm reaches the minimal average complexity in the bounded number of blocks model, and the complexity analysis of the random algorithm in the uniform model. Acknowledgments and Disclosure of Funding We thank Maria Laura Maag (Nokia) for introducing us to the problem of clustering using pairwise comparisons, and Louis Cohen, who sadly could not continue working with us on this topic. The authors of this paper met through and were supported by the Rand NET project (Rise project H2020EU.1.3.3). Alex Scott was supported by EPSRC grant EP/V007327/1. Quentin Lutz and Élie de Panafieu produced part of this work at Lincs (www.lincs.fr). Quentin Lutz is supported by Nokia Bell Labs (CIFRE convention 2018/1648). Maya Stein was supported by ANID Regular Grant 1180830, and by ANID PIA ACE210010. [1] Martin Aigner. Combinatorial search. John Wiley & Sons, Inc., 1988. [2] Noga Alon and Vera Asodi. Learning a Hidden Subgraph . In: SIAM J. Discret. Math. 18 (4 2005), pp. 697 712. [3] Noga Alon and Joel H. Spencer. The probabilistic method. Fourth. Wiley Series in Discrete Mathematics and Optimization. John Wiley & Sons, Inc., Hoboken, NJ, 2016, pp. xiv+375. ISBN: 978-1-119-06195-3. [4] Sugato Basu, Arindam Banerjee, and Raymond J Mooney. Active semi-supervision for pairwise constrained clustering . In: Proceedings of the 2004 SIAM international conference on data mining. SIAM. 2004, pp. 333 344. [5] Sugato Basu, Mikhail Bilenko, and Raymond J Mooney. A probabilistic framework for semisupervised clustering . In: Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining. 2004, pp. 59 68. [6] Douglas Bauer et al. Combinatorial optimization problems in the analysis and design of probabilistic networks . In: Networks 15.2 (1985), pp. 257 271. [7] Mathilde Bouvel, Vladimir Grebinski, and Gregory Kucherov. Combinatorial search on graphs motivated by bioinformatics applications: A brief survey . In: International Workshop on Graph-Theoretic Concepts in Computer Science. Springer. 2005, pp. 16 27. [8] Nicolo Cesa-Bianchi et al. A correlation clustering approach to link classification in signed networks. JMLR Workshop and Conference Proceedings, extended version on ar Xiv:1301.4769, 2012. [9] I Eli Chien, Huozhi Zhou, and Pan Li. HS2: Active learning over hypergraphs with pointwise and pairwise queries . In: The 22nd International Conference on Artificial Intelligence and Statistics. PMLR. 2019, pp. 2466 2475. [10] Susan Davidson et al. Top-k and clustering with noisy comparisons . In: ACM Transactions on Database Systems (TODS) 39.4 (2014), pp. 1 39. [11] Brian Eriksson et al. Active clustering: Robust and efficient hierarchical clustering using adaptively selected similarities . In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics. JMLR Workshop and Conference Proceedings. 2011, pp. 260 268. [12] Thomas Ernst. A comprehensive treatment of q-calculus. Springer Science & Business Media, 2012. [13] Ivan P Fellegi and Alan B Sunter. A theory for record linkage . In: Journal of the American Statistical Association 64.328 (1969), pp. 1183 1210. [14] Philippe Flajolet and Robert Sedgewick. Analytic combinatorics. Cambridge University Press, 2009. [15] Vladimir Grebinski and Gregory Kucherov. Reconstructing set partitions . In: Tenth Annual ACM-SIAM Symposium on Discrete Algorithms-SODA 99. ACM/SIAM. 1999, p. 2. [16] Daniel Gribel, Michel Gendreau, and Thibaut Vidal. Semi-Supervised Clustering with Inaccurate Pairwise Annotations . In: ar Xiv preprint ar Xiv:2104.02146 (2021). [17] Akshay Krishnamurthy et al. Efficient active algorithms for hierarchical clustering . In: ar Xiv preprint ar Xiv:1206.4672 (2012). [18] Arya Mazumdar and Barna Saha. Clustering with Noisy Queries. 2017. ar Xiv: 1706.07510 [stat.ML]. [19] Arya Mazumdar and Barna Saha. Query complexity of clustering with side information . In: Advances in Neural Information Processing Systems. 2017, pp. 4685 4696. [20] Daniel S Moak. The q-analogue of Stirling s formula . In: The Rocky Mountain Journal of Mathematics (1984), pp. 403 413. [21] Curtis G Northcutt, Anish Athalye, and Jonas Mueller. Pervasive label errors in test sets destabilize machine learning benchmarks . In: ar Xiv preprint ar Xiv:2103.14749 (2021). [22] The Sage Developers. Sage Math, the Sage Mathematics Software System (Version 9.0). https://www.sagemath.org. 2020. [23] Norases Vesdapunt, Kedar Bellare, and Nilesh Dalvi. Crowdsourcing algorithms for entity resolution . In: Proceedings of the VLDB Endowment 7.12 (2014), pp. 1071 1082. [24] Kiri Wagstaff et al. Constrained k-means clustering with background knowledge . In: Icml. Vol. 1. 2001, pp. 577 584. [25] Jiannan Wang et al. Leveraging Transitive Relations for Crowdsourced Joins . In: SIGMOD 13: Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data. ar Xiv Preprint 1408.6916. 2013, pp. 229 240.