# statistical_guarantees_for_consensus_clustering__7f7b5cdc.pdf Published as a conference paper at ICLR 2023 STATISTICAL GUARANTEES FOR CONSENSUS CLUSTERING Zhixin Zhou1, Gautam Dudeja2, Arash A. Amini2 1City University of Hong Kong, 2University of California, Los Angeles zhixin0825@gmail.com, gdudeja@g.ucla.edu, aaamini@ucla.edu. Consider the problem of clustering n objects. One can apply multiple algorithms to produce N potentially different clustersings of the same objects, that is, partitions of the n objects into K groups. Even a single randomized algorithm can output different clusterings. This often happens when one samples from the posterior of a Bayesian model, or runs multiple MCMC chains from random initializations. A natural task is then to form a consensus among these different clusterings. The challenge in an unsupervised setting is that the optimal matching between clusters of different inputs is unknown. We model this problem as finding a barycenter (also known as Fréchet mean) relative to the misclassification rate. We show that by lifting the problem to the space of association matrices, one can derive aggregation algorithms that circumvent the knowledge of the optimal matchings. We analyze the statistical performance of aggregation algorithms under a stochastic label perturbation model, and show that a K-means type algorithm followed by a local refinement step can achieve near optimal performance, with a rate that decays exponentially fast in N. Numerical experiments show the effectiveness of the proposed methods. 1 INTRODUCTION Clustering is a fundamental task in machine learning and data analysis. Given data on each of the n objects in a set, there are numerous algorithms to produce a clustering of these n objects, which is formally a partitioning of {1, . . . , n} into K disjoint sets. A natural problem that arises in practice is how to form a consensus among these clusterings. This is especially important if the different clusterings are produced by a single randomized algorithm. This situation often arises in Bayesian modeling, where the posterior naturally encodes the variability of the clustering problem. Finding a consensus clustering then corresponds to finding the center of the posterior, from which we can also obtain estimates of the variability of the posterior. A clustering of n objects can be viewed as a label vector in [K]n where [K] = {1, . . . , K}. We assume that we are given N label vectors zj [Kj]n for j = 1, . . . , N, with potentially different number of clusters each. Let K = maxj Kj and note that we can view all zj as vectors in [K]n. The task is to obtain a consensus K-clustering, that is, a label vector z [K]n which is close to all z1, . . . , z N at the same time. We also refer to this task as the label aggregation problem. In the context of clustering, there is no meaning to the label of each cluster, that is, the label aggregation problem is unsupervised, in the sense that there is no natural correspondence between labels of different clusterings. This is in contrast to label aggregation in classification in which the labels have a common meaning among different input classifications. We refer to the latter task as supervised label aggregation. In the unsupervised setting, forming a consensus label is a nontrivial task due the label-switching problem. Consider for example, the case n = 5 and the two label vectors z1 = (1, 1, 1, 2, 2) and z2 = (2, 2, 2, 1, 1). These two vectors are different in all 5 positions but they define the same clusterings of the objects. In this case, the consensus label bz can be taken to be either z1 or z2. More generally, for every zj, there could be a permutation πj on [K], such that the permuted vectors πj zj := (πj(zji))n i=1, are closer to each other than the original zjs. Published as a conference paper at ICLR 2023 To formalize the above idea, we recall the definition of the misclassification rate between two label vectors, z, y [K]n: Mis(z, y) = min π 1 n i=1 1{zi = π(yi)} (1) where the minimum is taken over all the permutations π : [K] [K]. Mis( , ) is a proper metric on the space of K-clusterings of n objects. It is also a metric on [K]n if we identify vectors that are obtained from each other by label-switching. We can now define the consensus label as the barycenter of z1, . . . , z N in Mis( , ) metric, that is, bz argmin z [K]n j=1 wj Mis(z, zj) (2) where wj 0 are a given set of weights. We often assume uniform weights: wj = 1 for all j. The barycenter bz is also known as the Frechét mean. Solving (2) is complicated by the presence of the permutation in the definition of Mis function. More explicitly, we need to solve bz argmin z [K]n min π1,...,πN i=1 wj1{zi = πj(zji)} (3) showing that in addition to z, we have to optimize over N permutations πj, j = 1, . . . , N. In this paper, we provide alternative solutions that avoid optimizing over these permutations. Our contributions The unsupervised version of the label aggregation problem is the realistic and practical one when dealing with aggregating labels from Bayesian clustering algorithms, since the posterior has K! modes corresponding to all possible label permutations, and the output will be near an arbitrary mode in each run of the algorithm. The main contributions of this paper to unsupervised aggregation are the following: 1. We show that by lifting the barycenter problem to the space of association matrices, one can derive algorithms that avoid optimizing over the unknown permutations (Section 2.1). In particular, we propose both a basic and a spectral K-means type aggregation algorithm. 2. We propose a random perturbation model (RPM) under which we can study the theoretical performance of both supervised and unsupervised aggregation algorithms. We prove the statistical consistency of the basic aggregation algorithm under RPM (Section 2.2). 3. Under RPM, the supervised setting corresponds to an oracle that knows the true matching permutations. By studying this oracle, we derive the optimal statistical misclassification rate for supervised aggregation (Section 3.1). 4. We propose an efficient local refinement step on the output of any consistent aggregation algorithm in the unsupervised setting, and show that the updated labels achieve nearly the same misclassification rate as the above oracle (Section 3.2). Our theoretical analysis illustrates how different parameters affect the difficulty of the label aggregation problem. In Section 4, we provide numerical experiments comparing the performance of the proposed algorithms against each other and existing methods. Related work In the supervised setting, the problem of label aggregation is to combine multiple annotated dataset. The label inferred for each item from those produced by multiple annotators acts as the ground truth for the classification task. Various probabilistic models have been proposed for aggregating annotations, with parameters to account for the expertise of the annotators and the noise in the labeling process (47; 37). The unsupervised setting is more challenging as there is no meaning to the cluster labels (the label-switching issue) and the clusterings can have potentially different number of clusters. The idea of passing to association matrices to get around the label-switching issue, has been leveraged in several existing approaches (13; 24; 43; 13; 21; 29), although the connection we make to the lifted barycenter problem and the resulting spectral methods is new to the best of our knowledge. In (24; 43), the authors employ an Expectation-Maximization strategy to obtain a nonnegative matrix factorization of the combined association matrix. The authors of (41) provide Published as a conference paper at ICLR 2023 several approaches, using the hypergraph representation of the clusterings, that have shown promising results in the context of image segmentation (22). A set of fuzzy consensus algorithms is proposed in (40) that generate soft consensus partitions by combining a collection of fuzzy clusterings. What we referred to as unsupervised label aggregation problem has appeared under many different names in the literature, including but not limited to, cluster ensembles (41), clustering/cluster ensemble problem (44; 10), ensemble clustering (1), clustering aggregation (17), combining clusterings (42; 14; 27), consensus clustering (48; 18) and the median partition problem (12; 46; 18). As can be surmised from the variety of names, there is a copious literature on the subject, spanning over multiple fields, with many ideas rediscovered time after time. We refer to the excellent surveys (48; 44; 18) for a more exhaustive list of references and historical discussions. There is also a parallel line of work in the Bayesian clustering literature on aggregating the posterior clusterings (31; 7; 9; 25; 15; 45; 8). The barycentric view to aggregation that we take in (2) has appeared in many previous work, but often with a different distance in place of Mis, including but not limited to the symmetric difference distance (SDD), a.k.a. the Mirkin metric (up to a constant), in the median partition problem (39; 23; 12; 17; 18), the Binder loss and variational information (VI) in (45; 15), the normalized mutual information (NMI) in the pioneering work of (41), the adjusted Rand index (15) and the category utility function (42; 35). After introducing our methods, in Section 2.3, we give a more detailed comparison with the literature. We choose Mis as the distance in the present work for a better comparison with the oracle problem in Section 3.1. We also show in Appendix B.1 that consistency in Mis implies the consistency in other distances. Despite the voluminous literature on the subject, statistical analysis of the methods under a statistical model for the input clusterings has not been undertaken before. This is the gap that we fill in this paper, by providing the first consistency and optimality results under a statistical model (the RPM) for a method of clustering aggregation that we propose. To the best of our knowledge, the question of consistency, let alone optimality, has not been considered for any method of aggregation before. We also shed more light on the relation between the barycenteric approach and those based on association matrices (Section 2.3), and how convex relaxation leading to a spectral method can be used to approximate the median partition. To illustrate the importance of statistical analysis, we also show that a simple common approach to the median partition problem, known as the Best Of K (18), is in general inconsistent under RPM, despite being shown to be a 2-factor approximation of the median partition problem (12). This further highlights the key insights that statistical analysis under a model can provide which is not possible to obtain by CS-type theory on approximation algorithms. 2 LIFTED AGGREGATION ALGORITHMS We start by introducing some notation. Let EK = {ek}K k=1 be the set of standard basis vectors of RK. The elements of EK can be considered one-hot encodings of the labels from [K]. From now on, instead of encoding labels as element of [K], we encode them as element of Ek. We can then view labelings of n objects as elements of the following set En K = {z = (z1, . . . , zn) : zi EK i [n]}. (4) Each zi is viewed as a K 1 vector and each element of En K as K n matrices, which we refer to as label matrices. For Z EK, permuting the cluster labels is equivalent to pre-multiplication by a K K permutation matrix P, that is, PZ. The label aggregation problem can be restated as follows: Given label matrices Z1, . . . , ZN En K, find a consensus label matrix Z En K, by solving the barycenter problem: b Z argmin Z En K min P1,...,PN j=1 wj Z Pj Zj 2 F (5) where P1, . . . , PN are K K permutation matrices and X F := P i,j X2 ij 1/2 is the Frobenius norm of matrix X. One can verify that problem (5) is equivalent to (3). The following result shows that if we know the optimal permutations Pjs, we can easily find the barycenter b Z: Proposition 1. Let { b P1, . . . , b PN} be an optimal set of permutation matrices in (5). Then, the optimal solution b Z of (5) is the columnwise argmax of P j wj b Pj Zj. Published as a conference paper at ICLR 2023 Algorithm 1 Basic label aggregation algorithm. 1: Form association matrices Xj = ZT j Zj. 2: Form the average association matrix X = PN j=1 wj Xj. 3: Perform K-means on the rows of X. Algorithm 2 Spectral label aggregation algorithm. 1: Define the same average association matrix X as in Algorithm 1. 2: Perform K-truncated eigendecomposition of X = UΛU T , where Λ RK K contains top-K eigenvalues on the diagonal and the columns of U Rn K are the corresponding eigenvectors. 3: Perform K-means on the rows of U. 2.1 LIFTING TO ASSOCIATION MATRICES The difficulty in the unsupervised setting is that the optimal permutations { b Pj} are unknown. To get around this issue, we lift the barycenter problem to the space of association matrices. For a label matrix Z En K, we define the corresponding association matrix as X = ZT Z {0, 1}n n. Note that Xij = 1 iff i and j are in the same cluster according to Z, otherwise Xij = 0. The advantage of X is that it is invariant to label switching: X = ZT Z = (PZ)T PZ for any permutation matrix P. This suggests solving the following lifted barycenter problem instead of (3): b X argmin X XK j=1 wj X Xj 2 F (6) where Xj = ZT j Zj and XK = {ZT Z : Z En K}, that is, the set of (binary) association matrices with at most K clusters. Semidefinite relaxation Problem (6) is still hard to solve due to the combinatorial nature of XK. An approach to solving problems over XK is to relax to a semidefinite program, an idea that has been applied before to community detection in networks (4). In particular, XK is inside the doubly nonnegative cone {X : X 0, X 0}, where X 0 and X 0 mean X is positive semidefinite and elementwise nonnegative. We note that Xii = 1 for all i. This suggests relaxing to the following problem b X argmin X j=1 wj X Xj 2 F : X 0, X 0, Xii = 1 i o . (7) Problem (7) has a simple solution. The solution of the unconstrained version of (7) over Rn n is X := Pn j=1 wj Xj/ Pn j=1 wj. Since {Xj} belong to the constraint set of (7) and this set is convex, X too belongs to the constraint set. Hence, X is the solution of (7), that is, b X = X . It remains to translate b X back to labels, for which we can preform rowwise K-means, leading to Algorithm 1. Since elements of XK are of rank at most K, to get a solution which is closer to that of the lifted barycenter problem (6), we can perform a spectral truncation of b X to its K top eigenvectors, before applying the rowwise K-means. This leads to the spectral aggregation Algorithm 1. Other variants of spectral clustering on b X are also possible, e.g., using the normalized Laplacian, etc. In the K-means step of Algorithm 1 and 2, any constant-factor approximation to the K-means problem can be used. 2.2 CONSISTENCY In order to study the statistical performance of different aggregation algorithms, we propose a random perturbation model (RPM), where both the clusters and the labels of the clusters can undergo random perturbations, allowing us to study the difficulty of the unsupervised aggregation problem. Let Z En K be the true label matrix with columns z i EK, i = 1, . . . , n. Definition 1 (RPM). We write Z L(Z , p) if Z = (z1, . . . , zn) En K with columns zi drawn independently as follows: zi = Pz i, z i (1 p)δz i + p Unif(EK) (8) where δz i is the point mass at z i and P is an independent K K permutation matrix. Published as a conference paper at ICLR 2023 Under RPM, the observed label matrices Z1, . . . , ZN are i.i.d., so it is reasonable to let w1 = = w N = 1 in Algorithms 1 and 2. We will discuss the algorithm for weighted samples in Section 5. Let nk(Z ) be the number of objects in cluster k according to Z . We make the following assumption: nk(Z ) βn/K, k [K] (9) for some β [1, K]. Here, β measures how much the true clustering deviates from being balanced. For β = 1, we have nk(Z ) = n/K for all k, while β = K corresponds to no restriction on the sizes of the true clusters. The following result shows that the basic aggregation algorithm is statistically consistent under the RPM: Theorem 1 (Consistency). Let Z En K be a label matrix satisfying (9). Assume that Z1, . . . , ZN are i.i.d. draws from L(Z , p) and let ξ := p(2 p). Let Mis be the misclassification rate between the true label matrix Z , and the output of Algorithm 1 with wj = 1 for all j [N]. Then, there exists a universal constant C > 0, such that E[Mis] Cξβ2K Consistency of Algorithm 1 follows from (10) and the Markov inequality: For any δ > 0, we have P(Mis δ) δ 1E[Mis] 0 as n, N and p is bounded away from 1. We note that in this and subsequent results all the parameters, such as K and p, are allowed to change as n, N , subject to the conditions of the theorems. The first term inside the parentheses in (10) is the dominant one. Assume for simplicity that β 1. If the model has low noise, then p is small and so is ξ since ξ p. Then, the dominant term is O(p K/N), that is, a smaller number of clusters, K, and a larger sample size, N, improve the performance. The second term is independent of of the sample size, but vanishes at the rate O(p2K2/n) in the low noise setting, as the number of objects, n, grows. 2.3 LITERATURE COMPARISON The relation between the metrics on clusterings is discussed in (34; 33). Let d M(Z, Zj) be the Mirkin metric between clusterings Z and Zj (34, Eqn (6)). As we show in Appendix B, it turns out that d M(Z, Zj) = X Xj 2 F = X Xj ℓ1 leading to d M/2 = SDD = Binder = n 2 (1 Rand) where Binder denotes the Binder loss (5) and Rand, the Rand index (36). It follows that the lifted barycenter (6) that we derived is equivalent to the median partition (12) and the Binder loss barycenter of (45) as well as the Rand barycenter (15). This problem is often solved by greedy search starting from a random initialization (45; 8). Our Algorithm 2 then provides a fast scalable spectral method of obtaining an approximate solution to this ubiquitous problem. A lot of algorithms proposed for consensus clustering operate on the average association matrix X, by treating it as a similarity matrix and performing usual clustering on it, for example, by performing agglomerative clustering (31; 18; 14; 28). Wade and Ghahramani (45) criticize these approaches as being ad-hoc compared to the decision-theoretic approach of finding a barycenter. However, we show X is indeed a solution to the relaxed version (7) of the barycenter (6) which is the same as the Binder barycenter in (45), hence clustering X is effectively solving the same problem with a different method. A k-means based aggregation algorithm, called KCC, has been proposed in (48). In our notation, this is equivalent to concatenating Zjs row-wise to form an NK n matrix and running k-means on the columns. This is different from our Algorithm (1) that operates on X. We compare with KCC in Section 4. Unlike KCC, Algorithm (1) comes with a consistency guarantee under our model assumptions. The Best Of K (12) essentially solves (6) by restricting the feasible region to {X1, . . . , XN}, hence picking the lowest-scoring input clustering. The approach proposed in (7; 9) (see also (15)) is to find the input clustering that minimizes the cost X 7 X X F . It is not hard to see that the barycenter cost (6) is equal to X X 2 F plus a constant (essentially a bias-variance decomposition; see (26)). Hence, the approach of (7; 9) is equivalent to the Best Of K. As we show in Appendix A, Best Of K is, in general, inconsistent under RPM unless N grows exponentially in n, a very strong condition not needed by our algorithms. The K-means step 3 in Algorithm (1) can be replaced by other clustering algorithms, e.g. averagelinkage clustering, Best Of K and CC pivot algorithm of (18; 12). Besides global clustering algorithms, many authors, including (41; 17; 18; 45; 8) also propose local search, a.k.a. greedy, algorithms which Published as a conference paper at ICLR 2023 0 0.2 0.4 0.6 0.8 1 p Chernoff divergence K = 2 K = 5 K = 10 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 p -(log(PN))/N Bayesian Error Probability N = 2 N = 10 N = 30 N = (Chernoff div.) Figure 1: Left: Chernoff divergence I in equation (11) as a function of p, the label perturbation probability. Right: How fast (log PN)/N converges to the Chernoff divergence for K = 2. update one label at a time to minimize the barycenter loss, based on any number of metrics discussed earlier, including the information-theoretic ones (NMI and VI (32)). In (26), after reducing (6) to minimizing X 7 X X 2 F , they write X = ZT Z (in our notation) and then relax Z to a general nonnegative matrix and solve the problem with nonnegative matrix factorization. Their work has the flavor of our Algorithm 2, although our approach, being based on regular spectral decomposition, is highly robust and scalable. The RPM, with P set equal to the identity, is closely related to the artificial data model considered in (18), with the difference that RPM does not potentially create a new cluster after perturbation, and introduces a random permutation of the clusters labels after perturbation (via P). In Appendix E, we argue that RPM is a good model of a concentrated posterior, hence consistency under RPM is relevant to Bayesian aggregation problems for which posterior consistency has been shown. 3 OPTIMAL RATE Theorem 1 guarantees an O(N 1) rate of misclassification for Algorithm 1. A natural question is whether we can do better. To answer this question, we first consider what is the best an oracle, with the knowledge of the random permutations in (8), can do. This oracle is effectively solving the supervised version of the problem. We then show that a refinement step allows us to achieve nearly the same as the optimal oracle rate, without knowing the matching permutations. 3.1 SUPERVISED ORACLE Let Z be a label matrix with the ith column z i (1 p)δz i + p Unif(EK), and let Z1, . . . , ZN be independent copies of Z . We would like to recover the true label matrix Z . This is the oracle version of model (8), since Zjs are label matrices without random permutations. In this case, the problem decouples to n independent label recovery problems. We further simplify the problem to that of deciding between z 1 = e1 and z 1 = e2. This problem is equivalent the hypothesis testing: H0 : Multinomial(N, (1 p, q, . . . , q)) versus H1 : Multinomial(N, (q, 1 p, . . . , q)), where q := p/K and p := (K 1)q := p q. A classical result from information theory (6, Theorem 11.9.1) allows us to determine the optimal performance in this case: Proposition 2. The Bayesian error probability, PN, for testing H0 against H1, with positive prior probabilities, is bounded by e NI, where I := log 2 p (1 p)q + (K 2)q (11) is the best achievable error exponent in the sense that 1 N log PN I as N . The left panel in Figure 1 shows plots of I as a function of p, for various K, and the right panel shows the convergence of the exponent of PN for K = 2. The Bayesian error probability PN can be achieved by performing a likelihood ratio test between the two hypotheses. We can generalize this result to testing between K hypotheses, in which case the Bayesian error probability is bounded by Published as a conference paper at ICLR 2023 Algorithm 3 Local Refinement 1: Input: Average association matrix X, initial label matrix e Z. 2: Output: An updated label matrix b Z. 3: for i = 1 to N do 4: Let Xi be the ith column of X = P j wj Xj. 5: Replace the ith column of e Z by zeros and denote this matrix by e Z i. 6: Let (n1, . . . , n K) be the row sums of e Z i. 7: Let (b1, . . . , b K) = e Z i Xi. 8: Update the ith label by arg maxk (bk/nk). 9: end for (K 1)e NI. The oracle algorithm that achieves this bound is the one that finds the columnwise argmax of the average of Zjs. In light of Proposition 2, the bound in Theorem 1 is far from optimal since it guarantees a linear decay of the error in N, that is O(N 1), rather than the exponential decay e NI. The question is whether this gap can be filled by a non-oracle algorithm. 3.2 LOCAL REFINEMENT To approach the oracle rate, we propose a fast local refinement on the label of each object, as outlined in Algorithm 3. This refinement can be performed on the output of any reasonable aggregation algorithm. The idea of performing a local refinement to boost the performance of clustering algorithms has been employed in various settings, including clustering of sub-Gaussian mixtures (30) and community detection in stochastic block models (SBM) (3; 16; 50; 51). A more detailed comparison with the SBM appears in Appendix F. Algorithm 3 requires a good initial label matrix e Z, with a small number of mismatches relative to the true matrix Z . The algorithm focuses on the local information of the ith object. With X the average of the association matrices, Xij is the sample proportion of objects i and j appearing in the same cluster. Viewing X as the adjacency matrix of a weighted graph, one can verify that bk = P j =i wj Xji1{ e Zj = k} is the weighted number of connections between object i and objects in cluster k. The algorithm then normalizes the number of connections by the cluster size nk. Thus, bk/nk estimates the probability that object i is connected to another object in cluster k. The higher this probability is, the higher the chance that object i belongs to cluster k. The last step in the for loop, updates the ith label according to these statistics. It is possible to repeat the local refinement procedure, by feeding its output b Z back as an initial label matrix. Given a good initialization, local refinements usually converge in constant or O(log n) number of steps (30). 3.3 ACHIEVING THE SUPERVISED RATE WITHOUT SUPERVISION From the oracle result (Proposition 2), we expect the optimal misclassification rate to be close to e NI. This is verified by the next result, showing that a single local refinement step applied to a consistent aggregation algorithm, such as Algorithm 1, can get us nearly to the optimal rate: Theorem 2. Assume that Z1, . . . , ZN are i.i.d. draws from the random perturbation model (8). Let b Z be the output of Algorithm 3 initialized by some, possibly data-dependent, e Z. Let nmin = mink [K] nk(Z ) be the smallest true cluster size and recall the definition of I in (11). Assume that (a) nminp(1 I)/K and NI log K , and there exists δ satisfying Knδ/(nminp I) = o(1) such that one of the followings holds: (b1) P(Mis( e Z, Z ) δ) = 1 o(1), or (b2) E[Mis( e Z, Z )] δ. Then, for some η = o(1), the misclassification rate satisfies P Mis( b Z, Z ) e (1 η)NI = 1 o(1). (12) Published as a conference paper at ICLR 2023 The assumptions of Theorem 2 are mild. Suppose that p is bounded away from 0 or 1, say p [0.01, 0.99], K = O(1) and the cluster sizes are similar. Then, the assumptions can be simplified to nmin, N and δ = o(1). The theorem is most interesting when p 1 and K is large. In this case, I 0 and the first requirement of assumption (a) becomes nmin I/K . This assumption guarantees that the samples provide sufficient information to recover the permutations, although we have not attempted to do so in our algorithm. The second requirement of assumption (a) provides evidence to distinguish the true labels from the other K 1 labels. Under the assumptions of Theorem 2, Algorithm 3 initialized with input satisfying (12), will have an output satisfying (b1). Hence, Theorem 2 also guarantees rate-optimality of an iterative Algorithm 3. 4 EXPERIMENTAL RESULTS We now present empirical results comparing the performances of the proposed aggregation algorithms, with additional results provided in Appendix C. The ground truth label matrix Z is generated by randomly assigning each of the n objects to one of the K labels. The N input clusterings Zj, j [N] are generated from model (8). We measure the performance of an algorithm by the adjusted Rand index (ARI) of its output against the ground truth. We consider seven different aggregation algorithms: (1) Our Algorithm 1, referred to as Basic in the plots; (2) Our Algorithm 2, referred to as SC; (3) KCC algorithm (48); (4) CC Pivot algorithm (2; 18) with threshold 0.25; (5) Best One Element Move (BOEM) algorithm (11; 18); (6) the EM algorithm of (24); and (7) Best Of K algorithm (12; 18). In addition, we consider variants of algorithms (1) (5) where we apply our refinement step to their output. This gives us a total of 12 methods. In the plots, the refined version is denoted with a solid line and the original version (without refinement) with a dotted line. We also use the average ARI of the N input clusterings, denoted by the INPUT, as a baseline. Balanced setting with varying n and N. Figure 2 depicts plots of ARI versus the noise probability p in model (8), for various methods. The results are averaged over 40 replications. The settings in Figure 2 all correspond to balanced cluster sizes. Generally, our proposed Basic and SC algorithms outperform the EM, KCC, CCPivot and BOEM algorithms, with the failure thresholds occurring at larger values of p (harder problems). In some settings, the refinement shows some improvement, but in others, the output of the refinement applied to Basic and SC nearly coincides with the original algorithm. This shows that in some settings the original algorithm implicitly performs the refinement itself. We note that increasing N shifts the failure thresholds to the right as expected, as does the increasing of n, both consistent with our theory. Note that Best Of K performs no better than INPUT. Moreover, refined versions of KCC and BOEM outperform their original versions. Unbalanced setting. Figure 3 depicts the results obtained with disproportionate cluster sizes, specifically with p1 proportion of the objects in one cluster and the rest distributed uniformly to the remaining clusters. Local refinement over Basic and SC performs significantly better as we deal with input clusterings of disproportionate cluster sizes, especially at lower noise probabilities. 5 CONCLUSION AND DISCUSSION In the present paper, we defined the random perturbation model to study the label aggregation problem. We developed a K-means type algorithm followed by an efficient local refinement step to achieve the optimal misclassification rate under the assumptions of the model. Numerical experiments also show the effectiveness of our proposed methods. Let us also discuss possible avenues for future work. A single-stage algorithm. Two-stage algorithms have been popular in many clustering problems (3; 30; 16). Numerical experiments show that, in many cases, a K-means type algorithm performs sufficiently close to the local refinement with good initialization (Section 4). The K-means type algorithm assigns labels based on the distances between the objects and the centers. This criterion is different from the likelihood ratio test in many cases, and so the output of the algorithm will not achieve the misclassification rate of the oracle problem. It will be a novel improvement if the K-means type algorithm can be generalized to an EM-type algorithm so that the distance" between the parameter and the object is defined by the likelihood. Whether such an algorithm exists, and how efficient it is statistically and computationally, are interesting open questions. A robust algorithm. We observe i.i.d. label matrices from the random perturbation model defined in Definition 1. As long as p < 1, every label matrix from this model provides the same amount Published as a conference paper at ICLR 2023 0.4 0.6 0.8 1.0 Probability Adjusted Rand Index (a) n = 100, N = 20, K = 6 0.4 0.6 0.8 1.0 Probability Adjusted Rand Index (b) n = 100, N = 200, K = 6 0.4 0.6 0.8 1.0 Probability Adjusted Rand Index (c) n = 500, N = 20, K = 6 0.4 0.6 0.8 1.0 Probability Adjusted Rand Index (d) n = 500, N = 200, K = 6 Figure 2: Performance impact with the increase in n and N 0.4 0.6 0.8 1.0 Probability Adjusted Rand Index (a) n = 100, N = 20, K = 6, p1 = 0.5 0.4 0.6 0.8 1.0 Probability Adjusted Rand Index (b) n = 100, N = 20, K = 6, p1 = 0.75 Figure 3: Significant improvements due to local refinement in the case of unbalanced cluster sizes. of information in expectation, so there is no reason to assign different weights to label matrices. However, in practice, the samples may not be i.i.d. Published as a conference paper at ICLR 2023 ACKNOWLEDGEMENT Zhixin Zhou was supported by the Research Grants Council of Hong Kong [Project Number 21502921]. A. A. Amini was supported by the NSF grant DMS-1945667. [1] Daniel Duarte Abdala, Pakaket Wattuya, and Xiaoyi Jiang. Ensemble clustering via random walker consensus strategy. In 2010 20th International Conference on Pattern Recognition, pages 1433 1436. IEEE, 2010. [2] Nir Ailon, Moses Charikar, and Alantha Newman. Aggregating inconsistent information: ranking and clustering. Journal of the ACM (JACM), 55(5):1 27, 2008. [3] Arash A Amini, Aiyou Chen, Peter J Bickel, and Elizaveta Levina. Pseudo-likelihood methods for community detection in large sparse networks. The Annals of Statistics, 41(4):2097 2122, 2013. [4] Arash A. Amini and Elizaveta Levina. On semidefinite relaxations for the block model. The Annals of Statistics, 46(1):149 179, 2018. [5] David A Binder. Bayesian cluster analysis. Biometrika, 65(1):31 38, 1978. [6] Thomas M Cover. Elements of information theory. John Wiley & Sons, 1999. [7] David B Dahl. Model-based clustering for expression data via a dirichlet process mixture model. Bayesian inference for gene expression and proteomics, 4:201 218, 2006. [8] David B Dahl, Devin J Johnson, and Peter Müller. Search algorithms and loss functions for bayesian clustering. Journal of Computational and Graphical Statistics, pages 1 13, 2022. [9] David B Dahl and Michael A Newton. Multiple hypothesis testing by clustering treatment effects. Journal of the American Statistical Association, 102(478):517 526, 2007. [10] Xiaoli Zhang Fern and Carla E Brodley. Solving cluster ensemble problems by bipartite graph partitioning. In Proceedings of the twenty-first international conference on Machine learning, page 36, 2004. [11] Vladimir Filkov and Steven Skiena. Heterogeneous data integration with the consensus clustering formalism. In International Workshop on Data Integration in the Life Sciences, pages 110 123. Springer, 2004. [12] Vladimir Filkov and Steven Skiena. Integrating microarray data by consensus clustering. International Journal on Artificial Intelligence Tools, 13(04):863 880, 2004. [13] A. L. N. Fred and A. K. Jain. Data clustering using evidence accumulation. In 2002 International Conference on Pattern Recognition, volume 4, pages 276 280. IEEE, 2002. [14] Ana LN Fred and Anil K Jain. Combining multiple clusterings using evidence accumulation. IEEE transactions on pattern analysis and machine intelligence, 27(6):835 850, 2005. [15] Arno Fritsch and Katja Ickstadt. Improved criteria for clustering based on the posterior similarity matrix. Bayesian analysis, 4(2):367 391, 2009. [16] Chao Gao, Zongming Ma, Anderson Y Zhang, and Harrison H Zhou. Achieving optimal misclassification proportion in stochastic block models. The Journal of Machine Learning Research, 18(1):1980 2024, 2017. [17] Aristides Gionis, Heikki Mannila, and Panayiotis Tsaparas. Clustering aggregation. Acm transactions on knowledge discovery from data (tkdd), 1(1):4 es, 2007. [18] Andrey Goder and Vladimir Filkov. Consensus clustering algorithms: Comparison and refinement. In 2008 Proceedings of the Tenth Workshop on Algorithm Engineering and Experiments (ALENEX), pages 109 117. SIAM, 2008. Published as a conference paper at ICLR 2023 [19] David Inouye, Pradeep Ravikumar, and Inderjit Dhillon. Square root graphical models: Multivariate generalizations of univariate exponential families that permit positive dependencies. In International conference on machine learning, pages 2445 2453. PMLR, 2016. [20] David I Inouye, Eunho Yang, Genevera I Allen, and Pradeep Ravikumar. A review of multivariate distributions for count data derived from the poisson distribution. Wiley Interdisciplinary Reviews: Computational Statistics, 9(3):e1398, 2017. [21] Paul Kellam, Xiaohui Liu, Nigel Martin, Christine Orengo, Stephen Swift, and Allan Tucker. Comparing, contrasting and combining clusters in viral gene expression data. In Proceedings of 6th workshop on intelligent data analysis in medicine and pharmocology, pages 56 62, 2001. [22] Jens Keuchel and Daniel Küttel. Efficient combination of probabilistic sampling approximations for robust image segmentation. In Joint Pattern Recognition Symposium, pages 41 50. Springer, 2006. [23] Ludmila I Kuncheva and Dmitry P Vetrov. The problems of approximation in spaces of relations and qualitative data analysis. Information and Remote Control, 35(11):1424 1431, 2006. [24] Tilman Lange and Joachim M Buhmann. Combining partitions by probabilistic label aggregation. In Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining, pages 147 156, 2005. [25] John W Lau and Peter J Green. Bayesian model-based clustering procedures. Journal of Computational and Graphical Statistics, 16(3):526 558, 2007. [26] Tao Li and Chris Ding. Weighted consensus clustering. In Proceedings of the 2008 SIAM International Conference on Data Mining, pages 798 809. SIAM, 2008. [27] Tao Li, Mitsunori Ogihara, and Sheng Ma. On combining multiple clusterings: an overview and a new perspective. Applied Intelligence, 33(2):207 219, 2010. [28] Yan Li, Jian Yu, Pengwei Hao, and Zhulin Li. Clustering ensembles based on normalized edges. In Pacific-Asia Conference on Knowledge Discovery and Data Mining, pages 664 671. Springer, 2007. [29] Yuan Li, Benjamin Rubinstein, and Trevor Cohn. Exploiting worker correlation for label aggregation in crowdsourcing. In International Conference on Machine Learning, pages 3886 3895. PMLR, 2019. [30] Yu Lu and Harrison H Zhou. Statistical and computational guarantees of lloyd s algorithm and its variants. ar Xiv preprint ar Xiv:1612.02099, 2016. [31] Mario Medvedovic and Siva Sivaganesan. Bayesian infinite mixture model based clustering of gene expression profiles. Bioinformatics, 18(9):1194 1206, 2002. [32] Marina Meil a. Comparing clusterings by the variation of information. In Learning theory and kernel machines, pages 173 187. Springer, 2003. [33] Marina Meil a. Comparing clusterings an information based distance. Journal of multivariate analysis, 98(5):873 895, 2007. [34] Marina Meilˇa. Comparing clusterings: an axiomatic view. In Proceedings of the 22nd international conference on Machine learning, pages 577 584, 2005. [35] Boris Mirkin. Reinterpreting the category utility function. Machine Learning, 45(2):219 228, 2001. [36] William M Rand. Objective criteria for the evaluation of clustering methods. Journal of the American Statistical association, 66(336):846 850, 1971. [37] Vikas C Raykar, Shipeng Yu, Linda H Zhao, Gerardo Hermosillo Valadez, Charles Florin, Luca Bogoni, and Linda Moy. Learning from crowds. Journal of machine learning research, 11(4), 2010. Published as a conference paper at ICLR 2023 [38] Zahra Razaee and Arash Amini. The potts-ising model for discrete multivariate data. Advances in Neural Information Processing Systems, 33:13727 13737, 2020. [39] Simon Régnier. Sur quelques aspects mathématiques des problèmes de classification automatique. Mathématiques et Sciences humaines, 82:13 29, 1983. [40] Xavier Sevillano, Francesc Alías, and Joan Claudi Socoró. Positional and confidence votingbased consensus functions for fuzzy cluster ensembles. Fuzzy Sets and Systems, 193:1 32, 2012. [41] Alexander Strehl and Joydeep Ghosh. Cluster ensembles a knowledge reuse framework for combining multiple partitions. Journal of machine learning research, 3(Dec):583 617, 2002. [42] Alexander Topchy, Anil K Jain, and William Punch. Combining multiple weak clusterings. In Third IEEE international conference on data mining, pages 331 338. IEEE, 2003. [43] Alexander Topchy, Anil K Jain, and William Punch. A mixture model for clustering ensembles. In Proceedings of the 2004 SIAM international conference on data mining, pages 379 390. SIAM, 2004. [44] Sandro Vega-Pons and José Ruiz-Shulcloper. A survey of clustering ensemble algorithms. International Journal of Pattern Recognition and Artificial Intelligence, 25(03):337 372, 2011. [45] Sara Wade and Zoubin Ghahramani. Bayesian cluster analysis: Point estimation and credible balls (with discussion). Bayesian Analysis, 13(2):559 626, 2018. [46] Yoshiko Wakabayashi. The complexity of computing medians of relations. Resenhas do Instituto de Matemática e Estatística da Universidade de São Paulo, 3(3):323 349, 1998. [47] Jacob Whitehill, Ting-fan Wu, Jacob Bergsma, Javier Movellan, and Paul Ruvolo. Whose vote should count more: Optimal integration of labels from labelers of unknown expertise. Advances in neural information processing systems, 22, 2009. [48] Junjie Wu, Hongfu Liu, Hui Xiong, Jie Cao, and Jian Chen. K-means-based consensus clustering: A unified view. IEEE transactions on knowledge and data engineering, 27(1):155 169, 2014. [49] Zhixin Zhou and Arash A Amini. Analysis of spectral clustering algorithms for community detection: the general bipartite setting. The Journal of Machine Learning Research, 20(1):1774 1820, 2019. [50] Zhixin Zhou and Arash A Amini. Optimal bipartite network clustering. The Journal of Machine Learning Research, 21(1):1460 1527, 2020. [51] Zhixin Zhou and Ping Li. Rate optimal Chernoff bound and application to community detection in the stochastic block models. Electronic Journal of Statistics, 14(1):1302 1347, 2020. Published as a conference paper at ICLR 2023 Supplementary material for Statistical Guarantees for Consensus Clustering This supplement contains the detailed proofs of the results and some extra simulations. A INCONSISTENCY OF BESTOFK Using the notation of the present paper, the name Best Of K" should be Best Of N". We will use our notation in the following proposition and keep the name Best Of K". Proposition 3. Best Of K is not consistent unless N grows exponentially fast in n. Proof. We will prove this proposition by providing a counterexample. Suppose K = 2, 1 p = 0.6 and q = 0.4. Then for a label vector z from the RPM, by the Hoeffding inequality, P(Mis(z, z ) 0.1) 1 exp(2(0.4 0.1)2n) exp(2(0.6 0.1)2n) 1 2 exp( 0.18n) where we have accounted for the two permutations in the definition of Mis. Suppose we observe N i.i.d. label vectors z1, . . . , z N from the RPM. Then P( min i [N] Mis(zi, z ) 0.1) (1 2 exp( 0.18n))N 1 2N exp( 0.18n). This probability (of missing the target) approaches 1 unless N grows exponentially fast in n. B RELATIONS AMONG CLUSTERING DISTANCES Let nk be the size of the kth cluster of Z En K and, n ℓthe size of the ℓth cluster of Z En L, and let X and X be the corresponding association matrices. The Mirkin distance (34, Eqn (6)) is given by d M(Z, Z ) = X ℓ (n ℓ)2 2 X k,ℓ n2 kℓ (13) where nkℓis the number of objects that are in cluster k according to Z and cluster ℓaccording to Z . It is not hard to see that P k n2 k = X 2 F and similarly P ℓ(n ℓ)2 = X 2 F . We also have Z(Z )T = (nkℓ), hence, using A 2 F = tr(AAT ), X k,ℓ n2 kℓ= Z(Z )T 2 F = tr(Z(Z )T Z ZT ) = tr((Z )T Z ZT Z) = tr(X X). Combining these facts, we obtain the first equality below d M(Z, Z ) = X X 2 F = X X ℓ1. (14) The second equality follows from X X having elements in { 1, 0, 1}. Here ℓ1 denotes the ℓ1 norm of a matrix viewed as a vector. The equality d M(Z, Z ) = X X ℓ1 immediately shows that d M is indeed a distance on the space of clusterings. It also connects the Mirkin distance with the Rand index. To see the connection with the Rand index, let Ndisagree be the number of pairs of objects for which Z and Z disagree about their co-clustering, that is, whether the two objects are in the same cluster or not. Similarly, let Nagree be the number of pairs of objects for which Z and Z agree about their co-clustering. We have Ndisagree + Nagree = n 2 . The Rand index is defined as the proportion of the agreements, that is, Rand = Nagree n 2 . It is easy to see that X X ℓ1 = 2Ndisagree where the factor of 2 is due to the double-counting caused by the symmetry of X X . This proves the relation 1 2d M = n 2 (1 Rand). (15) The symmetric difference distance (SDD) is another name for Ndisagree, hence d M/2 = SDD. The Binder loss is defined as half the expression in (13), that is, d M/2 = Binder. Published as a conference paper at ICLR 2023 B.1 CONSISTENCY IN Mis IMPLIES CONSISTENCY IN MIRKIN DISTANCE Let us now show that the consistency in Mis implies consistency in the normalized Mirkin distance defined as d M := d M/n2. See (34, Eqn (9)). It then follows that consistency in Mis implies consistency in the normalized SDD, normalized Binder loss and the Rand index, as discussed above. This claim follows from the following inequality: Proposition 4. We have d M 2 Mis. Proof. Let X and X be the association matrices corresponding to label vectors z and z . Then d M(z, z ) = X X ℓ1 as shown in (14). The entries of X X take values in { 1, 0, 1}. Assume, WLOG, that the optimal permutation between z and z is the identity. Then: 1. If the label zi = z i , then the ith row of X X has at most n Mis nonzero entries. There are at most n such rows. 2. If the label zi = z i , then the ith row of X X has at most n nonzero entries. There are at most n Mis such rows. Therefore, d M = X X ℓ1 n (n Mis) + (n Mis) n = 2n2 Mis and the result follows. C EXTRA SIMULATION RESULTS 0.4 0.6 0.8 1.0 Probability Adjusted Rand Index (a) n = 100, N = 20, K = 6, p1 = 0.8 0.4 0.6 0.8 1.0 Probability Adjusted Rand Index (b) n = 100, N = 20, K = 6, p1 = 0.9 Figure 4: Significant improvements due to local refinement in the case of unbalanced cluster sizes. Figure 4 shows some extra cases of unbalanced cluster sizes (various values of p1 as defined earlier), showing the significant improvement of the refinement step in such cases. All the results for the unbalanced case (including those in the main text) are averaged over 120 runs. Tables 1, 2 and 3 show the average ARI in all the eight settings (abbreviated Set in the tables) shown in Figures 2, 3 and 4. The tables show the performance of the methods at noise probabilities p = 0.45, 0.55 and 0.65 respectively corresponding to a cross-section of each plot at a line parallel to the y-axis, crossing the x-axis at the respective value of p. The settings are as follows: 1. Set 1: Balanced, n = 100, N = 20. 2. Set 2: Balanced, n = 100, N = 200. 3. Set 3: Balanced, n = 500, N = 20. 4. Set 4: Balanced, n = 500, N = 200. 5. Set 5: Unbalanced, n = 100, N = 20, p1 = 0.5. 6. Set 5: Unbalanced, n = 100, N = 20, p1 = 0.75. Published as a conference paper at ICLR 2023 Method Refined Set 1 Set 2 Set 3 Set 4 Set 5 Set 6 Set 7 Set 8 Basic FALSE 1.00 1.00 1.00 1.00 0.82 0.35 0.24 0.093 Basic TRUE 1.00 1.00 1.00 1.00 0.98 0.91 0.86 0.690 Best Of K FALSE 0.32 0.30 0.29 0.31 0.33 0.28 0.24 0.150 BOEM FALSE 0.50 0.48 0.57 0.69 0.38 0.41 0.33 0.310 BOEM TRUE 0.65 0.59 0.94 1.00 0.60 0.70 0.52 0.430 CCPivot FALSE 0.77 1.00 0.77 1.00 0.81 0.78 0.77 0.670 CCPivot TRUE 0.84 1.00 0.81 1.00 0.81 0.75 0.70 0.570 EM FALSE 0.97 0.98 0.97 0.99 0.72 0.41 0.36 0.220 Input FALSE 0.30 0.30 0.30 0.30 0.33 0.28 0.25 0.160 KCC FALSE 1.00 1.00 1.00 1.00 0.91 0.44 0.34 0.130 KCC TRUE 1.00 1.00 1.00 1.00 0.93 0.50 0.41 0.180 SC FALSE 0.99 1.00 1.00 1.00 0.96 0.65 0.65 0.430 SC TRUE 1.00 1.00 1.00 1.00 0.99 0.98 0.97 0.880 Table 1: Mean adjusted rand index (ARI) for all settings at noise probability p = 0.45. Method Refined Set 1 Set 2 Set 3 Set 4 Set 5 Set 6 Set 7 Set 8 Basic FALSE 0.96 1.00 0.98 1.00 0.66 0.26 0.19 0.077 Basic TRUE 0.97 1.00 0.98 1.00 0.89 0.64 0.57 0.370 Best Of K FALSE 0.20 0.21 0.20 0.20 0.23 0.19 0.17 0.096 BOEM FALSE 0.21 0.21 0.23 0.29 0.33 0.39 0.31 0.240 BOEM TRUE 0.46 0.36 0.56 0.65 0.55 0.62 0.48 0.280 CCPivot FALSE 0.51 0.99 0.51 0.99 0.57 0.52 0.44 0.360 CCPivot TRUE 0.61 0.98 0.54 0.94 0.52 0.36 0.32 0.200 EM FALSE 0.87 0.95 0.91 0.97 0.64 0.33 0.27 0.140 Input FALSE 0.20 0.20 0.20 0.20 0.23 0.18 0.16 0.098 KCC FALSE 0.94 1.00 0.98 1.00 0.67 0.28 0.21 0.081 KCC TRUE 0.97 1.00 0.98 1.00 0.74 0.36 0.27 0.130 SC FALSE 0.95 1.00 0.98 1.00 0.75 0.41 0.35 0.170 SC TRUE 0.97 1.00 0.98 1.00 0.95 0.86 0.79 0.550 Table 2: Mean adjusted rand index (ARI) for all settings at noise probability p = 0.55. 7. Set 6: Unbalanced, n = 100, N = 20, p1 = 0.8. 8. Set 7: Unbalanced, n = 100, N = 20, p1 = 0.9. D.1 PROOF OF PROPOSITION 1 For any Z En K, we have Z 2 F = P k,i Z2 ki = P k,i Zki = n. Thus, Z 2 F = b Pj Zj 2 F = n for all j [N]. Hence, solving (5) is equivalent to maximizing f(Z) := PN j=1 wj tr(ZT b Pj Zj) = tr(ZT Z) over En K, where Z := P j wj b Pj Zj. Let Z = (z1, . . . , zn) and Z = ( z1, . . . , zn). Maximizing f(Z) = Pn i=1 zi, zi is a separable problem over i, and maximizing z 7 z, zi over EK amounts to finding the index of the maximum element of zi, that is, the argmax of zi, as claimed. D.2 PROOF OF THEOREM 1 We have Zj = Pj Z j where Z j = (z j1, . . . , zjn) and z ji are i.i.d. draws as in (8). Since the algorithm is invariant to permutations Pj, without loss of generality we assume Pj = In, hence Zj = Z j. We write X = (Z )T Z for the true association matrix. Let En be the all-ones n n matrix. Published as a conference paper at ICLR 2023 Method Refined Set 1 Set 2 Set 3 Set 4 Set 5 Set 6 Set 7 Set 8 Basic FALSE 0.790 1.000 0.88 1.00 0.47 0.17 0.120 0.061 Basic TRUE 0.810 1.000 0.89 1.00 0.60 0.33 0.280 0.160 Best Of K FALSE 0.120 0.130 0.12 0.12 0.13 0.11 0.098 0.062 BOEM FALSE 0.085 0.098 0.12 0.11 0.24 0.29 0.320 0.270 BOEM TRUE 0.130 0.150 0.21 0.22 0.36 0.33 0.290 0.150 CCPivot FALSE 0.250 0.690 0.22 0.67 0.28 0.24 0.210 0.140 CCPivot TRUE 0.280 0.540 0.21 0.37 0.22 0.12 0.088 0.051 EM FALSE 0.600 0.920 0.74 0.95 0.46 0.23 0.180 0.078 Input FALSE 0.120 0.120 0.12 0.12 0.14 0.11 0.096 0.059 KCC FALSE 0.590 1.000 0.89 1.00 0.39 0.16 0.100 0.045 KCC TRUE 0.740 1.000 0.89 1.00 0.52 0.24 0.180 0.083 SC FALSE 0.750 1.000 0.88 1.00 0.47 0.19 0.140 0.076 SC TRUE 0.810 1.000 0.89 1.00 0.65 0.40 0.330 0.190 Table 3: Mean adjusted rand index (ARI) for all settings at noise probability p = 0.65. Lemma 1. Let Z L(Z , p) and let X = ZT Z be the corresponding association matrix. Then, M := E[X] = (1 ξ)X + ξ 1 K En + (1 1 where ξ = p(2 p). Proof of Lemma 1. We have Xij = (ZT Z)ij = zi, zj and E[zi] = (1 p)z i + p 1 K 1K. For i = j, zi and zj are independent, hence EXij = Ezi, Ezj = (1 p)z i + p 1 K 1K, (1 p)z j + p 1 = (1 p)2 z i , z j + 2p(1 p) 1 K For i = j, we have E[Xii] = 1. The above shows that E[X] = (1 p)2X + p(2 p) 1 K En + p(2 p) 1 1 which simplifies to the desired expression. Let Z1, . . . , ZN, Z L(Z , p) be independent draws, and let Xj = ZT j Zj and X = ZT Z be the associated association matrices. Setting X = 1 N PN t=1 Xt, we obtain E X M 2 F = X ij E( Xij Mij)2 = X ij var( Xij) = 1 ij var(Xij). We have var(Xij) = 0 for i = j. For i = j, one has Xij Ber((1 ξ)X ij + ξ/K), hence var(Xij) = (1 ξ)X ij + ξ K (1 ξ)2X ij + 2 ξ K (1 ξ)X ij + ξ2 X ij + ψ(ξ/K) where ψ(x) = x(1 x). Note that ξ = p(2 p) (0, 1). It follows that N E X M 2 F ψ(ξ) 1 2 ij X ij + n2ψ(ξ/K) where the inequality is due to bounding var(Xii) by the same formula used for var(Xij), i = j. Let n k be the number of entities in cluster k of Z , that is, n k = (Z 1n)k. We have P ij X ij = Z 1n 2 = P k(n k)2. Using the assumption n k βn/K, we have N E X M 2 F ψ(ξ) 1 2 K + n2ψ(ξ/K). Published as a conference paper at ICLR 2023 Calculating the center separations. Let f M = (1 ξ)X + (ξ/K)En. We note that M f M is diagonal and M f M 2 F = ξ(1 1/K)In 2 F = ξ2(1 1/K)2n ξ2n. It follows that E X f M 2 F = E X i =j ( Xij f Mij)2 + E X i ( Xii f Mii)2 E X M 2 F + ξ2n. 1 n2 E X f M 2 F 2 K + ψ(ξ/K) i + 2ξ2 The matrix f M is a K-means matrix with K distinct rows. If zi = r = k = zi , then f Mi Mi 2 = (1 ξ)2 X i X i 2 = (1 ξ)2(n r + n k) 2(1 ξ)2 n using n k n/(βK), which holds by assumption (9). We have nrδ2 r 2(1 ξ)2( n βK )2 which gives the following bound, using (49, Proposition 1), E[Misr] 1 N(1 ξ)2 h ψ(ξ)(K 2)β4 + β2K2ψ(ξ/K) i + ξ2 (1 ξ)2 β2K2 Here, Misr is the misclassification rate over true cluster r. The dependence on β of the first term is O(β2) when K = 2 and O(β4) when K > 2. Ignoring this difference, we can simplify the bound, by noting that K2ψ(ξ/K) = Kξ(1 ξ/K) Kξ and β2 β4. Then, E[Misr] ξ (1 ξ)2 2Kβ4 (1 ξ)2 β2K2 from which the bound in the theorem follows since Mis = P r(n r/n) Misr. D.3 PROOF SKETCH FOR THEOREM 2 For the benefit of the readers, we first give a proof sketch for Theorem 2 and its key lemma. A detailed proof is given in Appendix D.4. The proof of Theorem 2 relies on the following key lemma: Lemma 2. Let B(δ) denote the set of label matrices Z with at most nδ labels different from Z , and let b Z(Z) be the output of Algorithm 3 with initial label matrix Z. If nminp(1 I)/K and log K P Z B(δ) such that bzi(Z) = z i e (1 η )NI+ 3KnδN 2pnmin (17) for some η = o(1). The first step is to prove the case δ = 0 in Lemma 2, corresponding to the initial label matrix in Algorithm 3 being Z . If z i = e1, then the algorithm fails to recover z i if there exists k = 1 such that Yk := n1bk nkb1 0. Yk is the average of i.i.d. samples, where each sample follows a mixture model depending on which events among zi = e1, zi = ek, or zi {e1, ek} happens. We compute the MGF of Yk and obtain the bound E exp(t NYk/(n1n2(1 p))) (1 p) e t(1+o(1)) + q et(1+o(1)) + (K 2)q e nmin(1 p)2 N. The choice of t has little affect on the last term since nmin is large, so we set t = 1 2 log[(1 p)/q] to minimize (1 p)e t + qet. Under the regularity conditions of the lemma and the definition of I in (11), we have E exp(t Yk/(n1n2(1 p))) 2 p (1 p)q + (K 2)q (1 o(1))N = e (1 o(1))NI. Published as a conference paper at ICLR 2023 Applying the Chernoff inequality, it follows that P(bzi(Z ) = z i ) k=2 P (Yk 0) k=2 E exp(t Yk/(n1n2(1 p))) (K 1)e (1 o(1))NI. Using the assumption log K NI 0, we obtain P(bzi(Z ) = z i ) e (1 η)NI. This proves the case δ = 0. Now we compare Yk s obtained from Algorithm 3 initialized with label matrices Z and Z, and denoted by Yk(Z ) and Yk(Z), respectively. For all Z B(δ), we show that |Yk(Z ) Yk(Z)| 3(n1 nk) nδ if z i = e1, giving P Z B(δ) such that bzi(Z) = z i k=2 P Yk 3(n1 nk)nδ . We apply the Chernoff inequality with the same choice of t to obtain (17). We arrive at the proof of Theorem 2. Let b Z(Z) be the output of Algorithm 3 with initial label matrix Z. Consider the event Aδ = {Mis( e Z, Z ) δ}. For any ε > 0, P Mis( b Z( e Z), Z ) > ε P Ac δ + P Z B(δ ), Mis( b Z(Z), Z ) > ε . We have P Ac δ = o(1) under assumption (b1). Letting ε = NIe (1 η )NI+ 3KnδN 2pnmin , one can verify that the second probability also converges to 0 under the conditions of the theorem and ε = e (1 o(1))NI. This proves (20) under assumption (b1). For the proof under assumption (b2), please see Appendix D.4 D.4 DETAILED PROOF OF THEOREM 2 Let b Z(Z) be the output of Algorithm 3 with initial label matrix Z. Consider the event Aδ = {Mis( e Z, Z ) δ }. For any ε > 0, P Mis( b Z( e Z), Z ) > ε P Ac δ + P Z B(δ ), Mis( b Z(Z), Z ) > ε . (18) If assumption (b1) holds, then let δ = δ so that P(Ac δ) = o(1). If assumption (b2) holds, Then we let δ = p nminp Iδ/(Kn) so that Knδ nminp I = o(1) and by Markov s inequality, P(Mis( e Z, Z ) > δ ) 1 δ E[Mis( e Z, Z )] δ Knδ nminp I = o(1). Then, (b1) is satisfied with δ = δ . Therefore, it is enough to only consider assumption (b1) and let δ = δ for the rest of the proof. Let π be the permutation corresponding to Mis( e Z, Z ) in assumption (b1), that is, π = argminπ Pn i=1 1{ zi = π(z i )}. Since we can always assume π (z ) to be the true label, without loss of generality, we can assume π = identity. Writing T2 for the second term in (18), T2 P Z B(δ), i=1 1{bzi(Z) = z i } > nε P n X i=1 1 Z B(δ), bzi(Z) = z i > nε . By Markov s inequality, we obtain i=1 P Z B(δ), bzi(Z) = z i 1 εe (1 η )NI+ 3KnδN 2pnmin . (19) where the second inequality follows from Lemma 2, given assumption (a) of the theorem. Published as a conference paper at ICLR 2023 Assumption (a) of the theorem also implies 3KnδN 2pnmin = o(NI), so this term can be absorbed into η NI, giving T2 1 εe (1 η )NI for some η = o(1). Let ε = NIe (1 η )NI = e (1 η)NI, where η = η + log(NI) NI = o(1). It follows from (19) that εe (1 η )NI = 1 NI = o(1). Hence, we obtain (12) as desired. D.4.1 AN AUXILIARY LEMMA We state the case δ = 0 in Lemma 2 as a separate lemma and prove it first. Recall that nmin = mink [K] nk where nk is the size of the kth cluster. We have the following lemma. Lemma 3 (Local refinement with Z ). Suppose the initial label matrix in Algorithm 3 is Z , and assume nminp(1 I)/K and log K P(bzi = z i ) = e (1 η)NI (20) for some η = o(1). As a direct consequence, E[Mis( b Z, Z )] e (1 η)NI. Proof of Lemma 3. Let q := p/K and p := (K 1)q := p q. We first focus on the probability P(bz1 = z 1). Let C k = {i 2 : z i = ek}. We have bk = P i C k zi, z1 . Since z 1 = e1 by assumption, z1 takes values e1 and any of eℓ, ℓ = 1 w.p. 1 p and q. For i C 1, zi has the same distribution as z1. For i C 2, zi takes values e2 and any of eℓ, ℓ = 2 w.p. 1 p and q respectively. Note that (b1, b2) is independent of z1. It follows that (b1, b2) | z1 Bin(n1, 1 p) Bin(n2, q), if z1 = e1 Bin(n1, q) Bin(n2, 1 p), if z1 = e2 Bin(n1, q) Bin(n2, q), if z1 / {e1, e2} (21) where is the notation for the product measure, that is, b1 and b2 are independent in each case. The three possibilities above hold with probability 1 p, q and (K 2)q respectively. Let Y = n1b2 n2b1 and let MY (λ) be the moment-generating function (MGF) of Y . Let ψ(λ; p) = 1 p+peλ be the MGF of a Ber(p) variable. Then, the MGF of Bin(n, p) is ψ(λ; p)n and hence E[eλY | z1] = E[eλn1b2 | z1] Ee λn2b1 | z1] ψ(λn1; q)n2 ψ( λn2; 1 p)n1 if z1 = e1 ψ(λn1; 1 p)n2 ψ( λn2; q)n1 if z1 = e2 ψ(λn1; q)n2 ψ( λn2; q)n1 if z1 / {e1, e2}. Let ϕ(λ; µ) = exp(µ(eλ 1)) be the MGF of Poi(µ) and note that ψ(λ; p)n ϕ(λ; np). Then, for example, we have E[eλY | z1 = e1] ϕ(λn1; n2q) ϕ( λn2; n1(1 p)). Since ϕ(λ; µ) = exp[µ(λ + o(λ))] = exp[µλ(1 + o(1))] for λ = o(1), we obtain E[eλY | z1 = e1] exp n1n2λq(1 + o(1)) n1n2λ(1 p) 1 + o(1) assuming that λ(n1 + n2) = o(1). Then, E[eλY | z1 = e1] exp n1n2λ(q 1 + p) 1 + o(1) Take λ = t [n1n2(1 p)] 1 for some t 0 to be determined below. Noting q 1 + p = (1 p), E[eλY | z1 = e1] exp t 1 + o(1) . Published as a conference paper at ICLR 2023 The case z1 = e2 is argued similarly and we obtain the bound E[eλY | z1 = e2] exp t 1+o(1) . For z1 / {e1, e2}, we perform a second-order expansion, assuming λ = o(1): ϕ(λ; µ) = exp µ λ + 1 2λ2 + o(λ2) exp µ λ + λ2 and obtain ψ(λn1; q)n2 ψ( λn2; q)n1 exp λ2n1n2(n1 + n2)q . Let γ := 2q/(1 p)2 and let nhar := 2n1n2/(n1 + n2) be the harmonic mean of n1 and n2. Note that nhar nmin. We have λ2n1n2(n1 + n2)q = t2(n1 + n2)q n1n2(1 p)2 = γt2/nhar. To summarize, the conditional MGF satisfies E[et Y/(n1n2(1 p)) | z1] exp t 1 + o(1) if z1 = e1 exp t 1 + o(1) if z1 = e2 exp(γt2/nhar) if z1 / {e1, e2}. Recall that the events z1 = e1, z1 = e2 and z1 / {e1, e2} happen with probability 1 p, q and (K 2)q respectively. It follows that MY (t/(n1n2(1 p))) (1 p) e t(1+o(1)) + q et(1+o(1)) + (K 2)q eγt2/nhar. (22) 2 log((1 p)/q) = 1 2 log 1 + K p (1 p) , (23) so that (1 p)e t = qet = p (1 p)q. Then, t 0 and since log(1 + x) x, we have t K(1 p)/(2p). (24) The condition λ(n1 + n2) = o(1) is satisfied under assumption nminp/K , since λ(n1 + n2) = (n1 + n2)t n1n2(1 p) = 2t nhar(1 p) K(1 p)/p nhar(1 p) K nminp = o(1). Recalling that q = p/K, the exponent of the last term in (22) satisfies γt2 nhar γK2(1 p)2 4p2nhar = 2q K2 4p2nhar = K 2nharp K 2nminp = o(I) under the assumption of the lemma. It follows that MY (t/(n1n2(1 p))) 2 p (1 p)q eo(t) + (K 2)qeo(I) (1 p)q)1+o(1) + (K 2)qeo(I) (1 p)q)o(1) eo(I) e I where the first equality is by eo(t) = (et)o(1) = ( p (1 p)q)o(1) for our choice of t, and the second equality by the definition (11) of I. Since p (1 p)q e I, we have ( p (1 p)q)o(1) = eo(I), hence MY (t/(n1n2(1 p))) eo(I)e I = e (1 o(1))I = e (1 η)I. Let Y1, . . . , YN be the i.i.d. copies of Y . By Markov s inequality, j=1 Yj 0 = P eλ PN j=1 Yj 1 Eeλ PN j=1 Yj = MY1(λ)N e (1 η)NI. The above argument shows that P b2 n1 e (1 η)NI. Repeating the argument for the ith label, it shows that P(bzi(Z ) = z i ) P max k=2,...,K bk nk b1 (K 1)e (1 η)NI. If K = 2, then we have already obtained (20). If K > 2, then (K 1)e (1 η)NI = e (1 η)NI+log(K 1) = e (1 η)NI+o(NI). The term o(NI) can be absorbed into ηNI, so we can still obtain (20). Published as a conference paper at ICLR 2023 D.4.2 DETAILED PROOF OF LEMMA 2 Let i = 1 without loss of generality, and let Z B(δ), and ˆnk = nk(Z), the size of the kth cluster in the label matrix Z. Let bk(Z ) = (Z 1 X1)k and bk(Z) = (Z 1 X1)k where X1 is the first column of X in the algorithm. Suppose Z has at most nδ labels different from Z , then [n1b2(Z ) n2b1(Z )] [n1b2(Z) n2b1(Z)] (n1 n2) nδ. and [n1b2(Z) n2b1(Z)] [ˆn1b2(Z) ˆn2b1(Z)] |n1 ˆn1| b2(Z) + |n2 ˆn2| b1(Z) nδ (n1 + n2). Let Y (Z ) = n1b2(Z ) n2b1(Z ) and Y (Z) = ˆn1b2(Z) ˆn2b1(Z). Combining the two by the triangle inequality, we obtain |Y (Z ) Y (Z)| 3(n1 n2) nδ =: h(δ). By Markov s inequality, for λ 0, P max Z B(δ) Y (Z) 0 P Y (Z ) h(δ) = P eλNY (Z ) e λNh(δ) eλNh(δ)E[eλNY (Z )] As in the proof of Lemma 3, we take λ = t [n1n2(1 p)] 1, with t given by (23). Using the upper bound (24), we have λNh(δ) = 3(n1 n2)n N n1n2(1 p) δt 3KnδN The result follows as in Lemma 3. E RPM AND BAYESIAN AGGREGATION One might ask whether RPM is a useful model in practice. For the applications in which all the label vectors are perturbations of a common true center , and our goal is to recover this center, RPM is a good first approximation. This is the case for Bayesian label aggregation as we argue below. In such settings, the RPM is like the i.i.d. noise model used in classical regression. Although one can imagine more complex regression models (like those with heteroscedastic noise, or mixtures of regressions, etc.), the i.i.d. setting still provides a lot of insights for understanding the more complex models. E.1 RPM IS A GOOD MODEL FOR A CONCENTRATED POSTERIOR Lets us now argue how one can arrive at RPM in the context of Bayesian aggregation, by systematically making some assumptions. First, we note that our goal in the paper is not to prove the posterior concentration around the truth , also known as posterior consistency. This is problem-specific and out of the scope of this work. We assume that we are working with a model for which posterior consistency has already been established. The question that we are trying to answer is then this: Given that the posterior concentrates around the true partition, and given that the MCMC has converged that is, we are sampling from this concentrated posterior can we obtain a consistent estimate of the center of the posterior (which would be the true partition) based these samples? For this purpose, it is enough to assume that we are observing samples from the posterior p(z1, . . . , zn|D), where D is the observed data, and this posterior is concentrating around z which is the true partition. For simplicity, let us drop D and note that the posterior is some multivariate discrete distribution p(z1, . . . , zn). Then, we proceed in steps: 1. First, we address the label-switching issue. Let z be a sample from the posterior and let permutation π be the minimizer of H(z , π(z)) over all K! permutations, where H( , ) is the Hamming distance. We note that π(z) is an equivalent label vector to z (only the cluster Published as a conference paper at ICLR 2023 labels are permuted.) We consider the distribution of π(z) as the posterior distribution rather than that of z. This is only to simplify the discussion and is without loss of generality, since our methods are invariant to label-switching. The distribution of π(z) will have a single mode at z while that of z will have K! modes on all equivalent versions of z . Given such simplification, renaming π(z) to z from now on, the posterior is a multivariate distribution p(z1, . . . , zn) which is highly concentrated around z = (z 1, . . . , z n). This follows from the posterior consistency assumption. 2. We claim that this multivariate distribution can be approximated by the product of its marginals p(z1, . . . , zn) Qn i=1 pi(zi). This is intuitively clear for any concentrated discrete distribution. Alternatively, it is intuitively clear to anyone who has looked at MCMC samples at stationarity. Each node/object i usually has a most likely assignment which is z i , but it occasionally jumps to some other label with a small probability. The fluctuations for different nodes are independent; this is intuitively because the bulk of the labels don t change their clusters all at once; only a few do at any given time. 3. Non-uniform RPM: Given the independence assumption across i, the most general form pi(zi) can take is a categorical (a.k.a. Multinomial(1,π)) distribution. The bulk of the mass of this categorical variable will be on z i , and the rest distributed among the other labels. For example, for some node, i, the label can jump, say, between z i = 3 and 5 with the bulk of the probability on 3. For others it could be that when they jump from z i , they land over a larger collection of labels. Here, we are making the simplifying assumption that for each node, the mass that is put on anything other than z i is uniformly distributed over the label set [K] \ {z i }. This assumption can be removed, and we can work with general categorical variables, at the expense of making the rates and the analysis more complicated. 4. Inhomogeneous RPM: Given that the noise in the categorical variable is uniform over [K] \ {z i }, we now assume that the probability of taking any of those values is the same for the all nodes (i.e., independent of i). This is exactly the homogeneous RPM that we consider. This assumption is easy to remove and we can work with the inhomogeneous RPM that allows this probability to depend on i. We do not lose anything in Step 1. Steps 3 and 4 are simplifying steps that are taken for the ease of understanding and presentation. The main assumption is Step 2, the approximation by the product of marginals. Below we provide some hard evidence that this is very reasonable in the Bayesian setting. E.2 HARD EVIDENCE OF NEAR-INDEPENDENCE We consider the problem of recovering the clusters in a stochastic block model (SBM) which is a random network model with latent node clusters. Figure 5 shows the Sequential NMI plot for a Gibbs sampler on a non-parametric SBM (with a Dirichlet Process prior on labels). The sequential NMI means that we compute the NMI of the partition at iteration t relative to that at iteration t 1, and the x-axis shows t. The plot suggests that the chain enters stationarity roughly around iteration 100. We use the MMD-based approach, described in (19; 20; 38), to compare the posterior joint distribution of the labels to its approximate versions: - Figure 6(a) shows the result for samples from iterations 100 to 1000 of the Gibbs sampler. This is the stationary joint distribution. - Figure 6(b) shows the result for samples from iterations 1 to 100 of the Gibbs sampler. This is based on the transient samples (effectively, average joint behavior over the transient period). - Figure 6(c) shows the results on a movie rating dataset with complicated dependent joint distribution (that has nothing to do with SBM). We refer to (19; 20; 38) for details of how these experiments are performed. The bootstrap MMD serves as the baseline; those methods whose MMD is closer to the bootstrap are better approximations of the joint distribution. In general smaller MMD means a better approximation of the joint. Ind Mult is exactly the approximation by independent multinomials (i.e. product of marginals). The Copula Mult is a good joint model for the discrete multivariate distribution. Published as a conference paper at ICLR 2023 Figure 5: Sequential NMI for the SBM Gibbs sampler (a) (b) (c) Figure 6: MMD histograms between the posterior distribution and its various approximations We see that for the movie rating data and the transient chain, indeed Copula Mult has a much lower MMD, than Ind Mult, showing that there is dependence in the joint that is not captured by Ind Mult. However, for the stationary distribution (Figure 6(a)), the Ind Mult has comparable (and even slightly smaller MMD) relative to the Copula Mult, and is close to bootstrap. This shows that the product of marginals is a good approximation in this case, and justifies Step 2 in our reduction. F COMPARISON WITH STOCHASTIC BLOCK MODEL Below we outline some of the similarities and differences between the problem of cluster recovery in SBM and the consensus clustering problem we consider in this paper. Suppose that A is the adjacency matrix generated from the stochastic block model (SBM) and X is the association matrix generated from the RPM. Similarity: A larger value of Xij increases the likelihood of i and j being in the same cluster in RPM. Similarly, Aij = 1, i.e., there is an edge between i and j, increases the likelihood of i and j being in the same community in SBM. Therefore, both X and A can be considered proximity matrices and we can utilize a min-cut algorithm on them to find the clusters. To approximate the Published as a conference paper at ICLR 2023 min-cut algorithm, various researchers have proposed the approach of using a good initialization plus a local refinement step. This idea can be applied to many clustering problems, including community detection. In this paper, we show that it can also be applied to consensus clustering. Difference: The entries of the adjacency matrix A are independent, but the entries of the association matrix X are not. Indeed, the entries on the same row of X have very strong dependence. The likelihood function of X is very different from A, but we can still show that a simple local refinement step outputs optimal labels. The error rate is comparable to the Bayes error rate given by likelihood ratio test.