# comparing_hard_and_overlapping_clusterings__bb114395.pdf Journal of Machine Learning Research 16 (2015) 2949-2997 Submitted 10/12; Revised 2/15; Published 12/15 Comparing Hard and Overlapping Clusterings Danilo Horta horta@icmc.usp.br Ricardo J. G. B. Campello campello@icmc.usp.br Instituto de Ciˆencias Matem aticas e de Computa c ao Universidade de S ao Paulo Campus de S ao Carlos Caixa Postal 668, 13560-970, S ao Carlos-SP, Brazil Editor: Marina Meila Similarity measures for comparing clusterings is an important component, e.g., of evaluating clustering algorithms, for consensus clustering, and for clustering stability assessment. These measures have been studied for over 40 years in the domain of exclusive hard clusterings (exhaustive and mutually exclusive object sets). In the past years, the literature has proposed measures to handle more general clusterings (e.g., fuzzy/probabilistic clusterings). This paper provides an overview of these new measures and discusses their drawbacks. We ultimately develop a corrected-for-chance measure (13AGRI) capable of comparing exclusive hard, fuzzy/probabilistic, non-exclusive hard, and possibilistic clusterings. We prove that 13AGRI and the adjusted Rand index (ARI, by Hubert and Arabie) are equivalent in the exclusive hard domain. The reported experiments show that only 13AGRI could provide both a fine-grained evaluation across clusterings with different numbers of clusters and a constant evaluation between random clusterings, showing all the four desirable properties considered here. We identified a high correlation between 13AGRI applied to fuzzy clusterings and ARI applied to hard exclusive clusterings over 14 real data sets from the UCI repository, which corroborates the validity of 13AGRI fuzzy clustering evaluation. 13AGRI also showed good results as a clustering stability statistic for solutions produced by the expectation maximization algorithm for Gaussian mixture. Implementation and supplementary figures can be found at http://sn.im/25a9h8u. Keywords: overlapping, fuzzy, probabilistic, clustering evaluation 1. Introduction Clustering is a task that aims to determine a finite set of categories (clusters) to describe a data set according to similarities/dissimilarities among its objects (Kaufman and Rousseeuw, 1990; Everitt et al., 2001). Several clustering algorithms are published every year, which makes developing of effective measures to compare clusterings indispensable (Vinh et al., 2009, 2010). Clustering algorithm A is commonly considered better than B for a given data set X if A produces clusterings that are more similar (according to a similarity measure1 for clustering) to a reference solution for X than those produced by B. Similarity measures are also used for consensus clustering, clustering stability assessment, and even for quantifying information loss (Strehl and Ghosh, 2003; Monti et al., 2003; Yu 1. Note that a dissimilarity/distance measure can always be cast into a similarity measure. For comparison purposes, we transformed dissimilarity/distance measures into similarity measures in this work. c 2015 Danilo Horta and Ricardo J. G. B. Campello. Horta and Campello et al., 2007; Beringer and Hllermeier, 2007; Vinh and Epps, 2009). A consensus clustering technique aims to find a high-quality clustering solution by combining several (potentially poor) solutions obtained from different methods, algorithm initializations, or perturbations of the same data set. This combination is achieved by producing a solution that shares the most information, quantified by a similarity measure, with the original solutions (Strehl and Ghosh, 2003). In the context of clustering stability assessment, the method used to generate a set of clustering solutions is considered stable if the set shows low variation, which is considered a desirable quality (Kuncheva and Vetrov, 2006). One can apply a clustering algorithm several times to subsamples of the original data set for any numbers of clusters, producing a set of clusterings for each number of clusters. The number of clusters for which the set of solutions is less diverse is considered a good estimate of the true number of clusters (Borgelt and Kruse, 2006; Vinh and Epps, 2009). Another interesting application of similarity measures is in the quantification of information loss (Beringer and Hllermeier, 2007). To increase efficiency (e.g., in the context of data stream clustering), one can first map the data into a low-dimensional space and cluster the transformed data. If the transformation is almost lossless, the clustering structures in the two spaces should be highly similar; a similarity measure can be used to assess this. Several established measures are suitable for comparing exclusive hard clusterings (EHCs) (Albatineh et al., 2006; Meila, 2007; Vinh et al., 2009, 2010), i.e., clusterings in which each object exclusively belongs to one cluster. Examples of popular measures are the Rand index (RI) (Rand, 1971), adjusted Rand index (ARI) (Hubert and Arabie, 1985), Jaccard index (JI) (Jaccard, 1908), mutual information (Strehl and Ghosh, 2003), and variation of information (VI) (Meila, 2005). Bcubed (BC) (Bagga and Baldwin, 1998; Amig o et al., 2009) is a measure for evaluating coreferences (e.g., a set of pronouns referring to the same noun in a paragraph) in the natural language processing field. Coreferences can also be viewed as EHCs (Cardie and Wagstaf, 1999), and BC satisfies some (frequently regarded as) desirable properties that most well-known EHC measures do not (Amig o et al., 2009). Thus, we also include BC in this work. There are other important clustering types, e.g., fuzzy/probabilistic clustering2 (FC), non-exclusive hard clustering (NEHC), and possibilistic clustering (PC) (Campello, 2010; Anderson et al., 2010), that are not assessed using well-established measures but that would benefit from the tasks discussed above. Various EHC measure generalizations have recently appeared in the literature (Borgelt and Kruse, 2006; Campello, 2007; Anderson et al., 2010; Campello, 2010) to fill this gap. Unfortunately, all these measures exhibit critical problems that hinder their applicability. The RI fuzzy version by Campello (2007) does not attain its maximum (i.e., 1) whenever two identical solutions are compared, which makes it difficult to convey the similarity of the compared solutions. The same issue is exhibited by other RI generalizations (Borgelt and Kruse, 2006; Ceccarelli and Maratea, 2008; Rovetta and Masulli, 2009; Brouwer, 2009; Anderson et al., 2010; Quere and Frelicot, 2011). Moreover, most of the proposed measures are not corrected for randomness, i.e., they do not provide a constant average evaluation 2. The usage of fuzzy or probabilistic depends on the interpretation of the object membership degrees given by the solution. Fuzzy c-means (Bezdek, 1981) and expectation maximization (EM) (Dempster et al., 1977) give a fuzzy and a probabilistic interpretation, respectively, although the solutions they produce come from the same domain of clusterings. We will hereafter call it fuzzy clustering in both cases for simplicity. Comparing Hard and Overlapping Clusterings over sets of independently generated clusterings (constant baseline for short). In practice this means that theses measures tend to favor clusterings with certain numbers of clusters (Vinh et al., 2009, 2010), whether the compared solutions are similar or not. Additionally, several of the measures have a low sensitivity to differences in solution quality, where close evaluation values can result from comparing very similar or very different solutions. Biclustering is also an important type of clustering solution, which is usually represented by a set of pairs C {(Ce 1, Cc 1), (Ce 2, Cc 2), . . . , (Ce k, Cc k)}. Each pair (Ce r, Cc r) has two nonempty sets of objects of different types. In gene expression analysis, Ce r could be the set of genes related to the experimental conditions in Cc r (Madeira and Oliveira, 2004). In subspace clustering, Ce r could be the set of objects related to the object features in Cc r (Patrikainen and Meila, 2006; G unnemann et al., 2011). We do not consider this type of clustering henceforth as it would overly extend the length and complexity of this work. Moreover, a biclustering can always be converted to an NEHC (Patrikainen and Meila, 2006), which is one of the scenarios we investigate here. We first develop an RI generalization, called the frand index (13FRI),3 to handle FCs. We then develop the adjusted frand index (13AFRI) by correcting 13FRI for randomness. Although the assumed randomness model is apparently unrelated to that assumed for ARI (Hubert and Arabie, 1985), we prove that 13AFRI and ARI are different formulations of the same measure in the EHC domain. Finally, we also extend the 13FRI and 13AFRI measures to the more general domain of PCs (which include the NEHC, FC, and EHC solutions as special cases, Section 3), resulting in the grand index (13GRI) and adjusted grand index (13AGRI), respectively. We defined four clearly desirable properties that a good similarity measure should display. Under this framework, our proposed measures are empirically compared in two experiments with 32 others, out of which 28 are measures proposed in the past recent years to handle more general clusterings than EHCs. Several of the measures could not distinguish among solutions that are close to from those that are far from the reference solution according to the number of clusters in the first experiment. 13AGRI presented an evident, desirable sensitivity over the ranges of the numbers of clusters. In the second experiment, 13AGRI was the only measure that exhibited a constant baseline for all scenarios of randomly generated exclusive hard, fuzzy, non-exclusive hard, and possibilistic clusterings. We applied 13AGRI and ARI to evaluate fuzzy c-means (Bezdek, 1981) and k-means (Mac Queen, 1967) solutions, respectively, over 14 real data sets from UCI repository (Newman and Asuncion, 2010). We argue that the high correlation found between 13AGRI and ARI evaluations is an indication of the 13AGRI evaluation appropriateness for FCs. 13AGRI is also assessed as a stability statistic for FCs produced by the expectation maximization for Gaussian mixture (EMGM) (Dempster et al., 1977) algorithm. The remainder of the paper is organized as follows. Section 2 discusses evaluation of similarity measures and establishes four desirable properties. Section 3 sets the background of the work and reviews the measures proposed in the past years to tackle more general clusterings than EHCs. Section 4 presents the 13FRI measure for handling FCs, develops a corrected-for-chance version of 13FRI named 13AFRI, and explains why 13FRI and 3. The number 13 is a reminder of the publication year of the measure (2013). We use a reminder in front of each measure acronym, except for RI, ARI, JI, and BC. This helps us identify the recently proposed measures. Horta and Campello 13AFRI are not suitable for comparing every type of PC. Section 5 proposes the 13GRI and 13AGRI measures by addressing the issue that prevented 13FRI and 13AFRI from being appropriately applied to PCs. Section 6 deduces the asymptotic computational complexity of 13FRI, 13AFRI, 13GRI, and 13AGRI and introduces an efficient algorithm to calculate the expectations used by 13AFRI and 13AGRI. Section 7 presents four experiments, the first two to empirically evaluate the measures according to the four desirable properties. First experiment (Section 7.1) assesses how the measures behave when comparing solutions produced by clustering algorithms with reference solutions across a range of the numbers of clusters. Second experiment (Section 7.2) assesses the ability of the measures to provide unbiased evaluations in several scenarios. Third experiment (Section 7.3) compares 13AGRI and ARI evaluations of fuzzy and exclusive hard clusterings in 14 real data sets. Fourth experiment (Section 7.4) uses 13AGRI as a stability statistic for FC assessment in five real data sets. Section 8 discusses the criteria adopted to evaluate and compare the measures. Section 9 concludes the work, and Appendix proves some properties of our measures. 2. Desirable Measure Properties Evaluating a measure for comparing clusterings is a difficult task. Partly because different applications may require different perspectives regarding the similarity between clusterings, and partly because there is no universally accepted set of properties that a measure for comparing clusterings must have. It is often the case that a measure is modified to comply with a set of desirable properties but, as a side effect, loses another set of desirable properties that it previously had. This is the case of variation of information (Meila, 2005) and its corrected-for-chance version developed in (Vinh et al., 2009, 2010), where the latter gives away the metric property to gain the property of displaying constant baseline evaluations for randomly generated solutions. There is even a result stating that no sensible measure for comparing clusterings will simultaneously satisfy three desirable properties (Meila, 2005). In order to evaluate the usefulness of our proposed measure, we compare ours with the ones found in the literature over four clearly desirable properties. These properties have been chosen because they are appealing from a practical perspective and together they can unveil flaws of several existing measures according to well established intuitions. The properties are defined as follows: Maximum. A measure is told to obey this property if it attains its known maximum value whenever two equivalent solutions are compared. The maximum has to be invariant to the data set as well. Discriminant. A good measure must be able to detect the best solution among a given set of solutions. Contrast. A good measure must provide progressively better (or worse) evaluations for progressively better (or worse) solutions. Baseline. A measure that has a predetermined expected value over randomly generated solutions is told to have the baseline property (also, adjusted for chance). It is a common practice to have the maximum equal to 1 and the baseline value equal to 0, such that having the maximum property means that the measure attains 1 when comparing Comparing Hard and Overlapping Clusterings two equivalent solutions and having the baseline property means that comparing randomly generated solutions tend to give evaluations close to zero. A measure having a known maximum that is always attained when two equivalent solutions are compared provides an objective goal (i.e., producing a clustering that attains that score) and ensures the user that a better solution can be found when the evaluation is lower than the maximum. Comparisons between evaluations of clusterings generated from different data sets may be misguided because of different extents to which variation is possible when the measure does not have a fixed maximum (Luo et al., 2009). As mentioned by Vinh et al. (2010), the fact that all of the 22 different pair counting based measures discussed in (Albatineh et al., 2006) are normalized to have a known maximum further stresses the particular interest of the clustering community in this property. A measure may not attain its predefined maximum for the ideal solution, but still might be able to detect the best solution among a set of non-ideal solutions. This elicits the measure as having the discriminant property. This property definition naturally prompts the question How can I know that a given solution is better than another one? that the measure tries to answer in the first place. However, there is one situation where the answer is unquestionable: any reasonable measure should evaluate the ideal solution (i.e., the one equivalent to the reference solution) as being superior to the others. If a measure somehow evaluates a given solution better than the reference one, it is clearly flawed as a similarity measure. We propose the contrast property because we observed in preliminary experiments that some measures would give flat evaluations over solutions progressively farther from the reference one. This behavior can be problematic when such a measure is used for assessing clustering algorithms with similar accuracy, as the measure might not be sensible enough to capture any difference. The contrast property is also related to the useful range of a measure (Fowlkes and Mallows, 1983; Wu et al., 2009; Vinh et al., 2010). A measure can have known upper and lower bounds but its evaluations can be spread out only over a small fraction of that range in practice. As an example, for a given number of objects n, RI attains the maximum 1 for two equivalent clusterings and the minimum 0 when comparing a clustering having one cluster and a clustering having n clusters. However, it has been reported that RI provides evaluations almost always above 0.5, even when comparing randomly generated clusterings (Fowlkes and Mallows, 1983; Wu et al., 2009). Knowing beforehand the useful range (i.e., the range within which the evaluations will fall for real applications) certainly increases the intuitiveness of the measure. The maximum property can be mathematically proved for each measure, but the other properties can only be experimentally assessed and/or disproved. The discriminant and contrast properties are somewhat subjective, but a measure that evaluates the ideal solution worse than another solution clearly does not comply with those properties. The baseline property does not specify a particular model for randomly generating solutions (and we believe that specifying one would be artificial). We thus empirically evaluate the measures regarding this property over different models of randomly generating solutions. Horta and Campello U/V V1,: V2,: Vk V,: Sums U1,: N1,1 N1,2 N1,k V N1,+ U2,: N2,1 N2,2 N2,k V N2,+ ... ... ... ... ... ... Uk U,: Nk U,1 Nk U,2 Nk U,k V Nk U,+ Sums N+,1 N+,2 N+,k V N+,+ Table 1: Contingency table. 3. Background and Related Work Let X {x1, x2, . . . , xn} be a data set with n objects. A clustering solution with k clusters can be represented by a matrix U [Ur,i] Rk n, where Ur,i expresses the membership degree of xi to the rth cluster and U satisfies the following properties: 0 Ur,i 1 ( r N1,k and i N1,n), (1a) 0 < Pn i=1 Ur,i ( r N1,k), and (1b) 0 < Pk r=1 Ur,i ( i N1,n). (1c) We say that U Mp {U Rk n | satisfies Equations (1)} is a possibilistic clustering (PC). By adding more constraints, three other clustering types emerge: U Mf {U Mp | Pk r=1 Ur,i = 1 i} is a fuzzy/probabilistic clustering (FC), U Mneh {U Mp | Ur,i {0, 1} r, i} is a non-exclusive hard clustering (NEHC), and U Meh Mf Mneh is an exclusive hard clustering (EHC) (Campello, 2010; Anderson et al., 2010). Note that Meh Mf, Meh Mneh, Mf Mp, and Mneh Mp (Figure 1). Set Mp of all PCs covers the other sets, and a measure for this domain is applicable to virtually every type of clustering present in the literature. Figure 1: Venn diagram representing the relationship between clustering domains. We believe that the most popular measures for comparing EHCs are those based on pair counting, including ARI and JI. A common approach to compute these measures begins by obtaining a contingency matrix (Albatineh et al., 2006). Let U and V be two EHCs with k U and k V clusters, respectively, of the same data set of n objects. Table 1 defines their contingency table, where N = UVT is the contingency matrix and Nr,t is the number Comparing Hard and Overlapping Clusterings of objects that simultaneously belong to the rth cluster of U and tth cluster of V. The marginal totals N+,t = Pk U r=1 Nr,t and Nr,+ = Pk V t=1 Nr,t yield the cluster sizes and the grand total N+,+ = Pk U,k V r,t=1 Nr,t = n yields the number of objects in the data set. The contingency matrix is then used to calculate the pairing variables a (the number of object pairs in the same cluster in both U and V), b (the number of object pairs in the same cluster in U but in different clusters in V), c (the number of object pairs in different clusters in U but in the same cluster in V), and d (the number of object pairs in different clusters in both U and V) (Jain and Dubes, 1988; Albatineh et al., 2006): r,t=1 N2 r,t N+,+ r=1 N2 r,+ 1 r,t=1 N2 r,t, (2b) t=1 N2 +,t 1 r,t=1 N2 r,t, and (2c) (a + b + c) = 1 r=1 N2 r,+ + t=1 N2 +,t) + 1 r,t=1 N2 r,t. (2d) Albatineh et al. (2006) list 22 measures based on pair counting defined solely using a, b, c, and d. For example, JI and RI are respectively defined as JI(U, V) a/(a + b + c) and (3) RI(U, V) (a + d)/(a + b + c + d). (4) ARI is defined as (Hubert and Arabie, 1985)4 ARI(U, V) a (a+c)(a+b) a+b+c+d (a+c)+(a+b) 2 (a+c)(a+b) a+b+c+d . (5) As an alternative to the contingency matrix, one can define the pairing variables by employing the co-association matrices JU UTU and JV VTV (Zhang et al., 2012). When U and V are EHCs, the above definition amounts to ( 1 if r such that Ur,i = 1 and Ur,j = 1 0 otherwise . (6) The pairing variables can be rewritten as5 a = P i 0 do 8: while j > 0 and xi yj do 10: end while 11: E[ a]U,V E[ a]U,V + (m j) xi 12: i i 1 13: end while 14: i, j m, m 15: while j > 0 do 16: while i > 0 and xi > yj do 18: end while 19: E[ a]U,V E[ a]U,V + (m i) yj 20: j j 1 21: end while 22: E[ a]U,V E[ a]U,V/m A measure must somehow adequately evaluate the similarity between the compared solutions. Section 7.1 follows this idea and compares 34 measures by applying them to evaluate solutions with different numbers of clusters produced by different clustering algorithms. This comparison is done by considering the first three properties proposed in Section 2: maximum, discriminant, and contrast. Synthetic data sets were generated according to the cluster types that these algorithms search for (e.g., it is well-known that k-means (Mac Queen, 1967) tends to produce spherical-like clusters), and the reference solution for each data set was defined by applying the corresponding clustering algorithm with a well-tuned initial solution. In this scenario is then expected that the dissimilarity between the generated and reference solutions will reflect the difference in the numbers of clusters. In a different scenario, Section 7.2 compares the measures when evaluating randomly generated solutions, by assessing the measures according to the baseline property proposed in Section 2. A measure should display a uniform evaluation across the range of numbers of clusters because any resemblance between the compared solutions is only due to chance. Section 7.3 assesses the 13AGRI evaluation validity for FCs in 14 real data sets, and Section 7.4 uses 13AGRI as a stability statistic for estimating the number of clusters in five real data sets. Because 13GRI (13AGRI) is more general and becomes equivalent to 13FRI (13AFRI) when applied to FCs, we only show the results of 13GRI (13AGRI). Comparing Hard and Overlapping Clusterings 7.1 Measuring the Similarity Between Clusterings We evaluated the measures in four synthetic data sets (Figures 4), each suitable for one of the following clustering types: EHC, FC, NEHC, and PC. The DEHC data set (Figure 4(a)) has nine well-separated clusters, whereas the DFC data set (Figure 4(b)) has nine overlapping clusters. In both data sets, the clusters were generated using Gaussian distributions with equal variances and no correlation between the attributes. The DNEHC data set (Figure 4(c)) has four clusters, but they reduce to two clusters when projected to a single axis.14 We generated the DPC data set (Figure 4(d)) to resemble a synthetic one (Zhang and Leung, 2004) with noise added. 0 1 2 3 4 5 6 0.5 (a) Data set for EHC (DEHC). 0 1 2 3 4 5 6 7 0 (b) Data set for FC (DFC). 1 0 1 2 3 4 (c) Data set for NEHC (DNEHC). 0 1 2 3 4 5 6 3 (d) Data set for PC (DPC). Figure 4: Data set for each clustering type. Different clustering algorithms were employed for each data set, appropriate for the corresponding clustering type as follows: k-means for DEHC, fuzzy c-means (FCM) and expectation maximization for Gaussian mixtures (EMGM) (Dempster et al., 1977) for DFC, SUBCLU (Kailing et al., 2004) for DNEHC, and improved possibilistic c-means 2 (IPCM2) (Zhang and Leung, 2004) for DPC. The FCM and IPCM2 exponent m was set to 2 (which is commonly adopted in the literature), the SUBCLU parameter minpts was set to 5, and the Euclidean norm was adopted; this same configuration was used in all the experiments reported in this work. The reference solution for the combination of data set and clustering algorithm (i.e., (DEHC, k-means), (DFC, FCM), (DFC, EMGM), (DNEHC, SUBCLU), and (DPC, IPCM2)) was produced by applying the clustering algorithm with the right number of clusters (or a well-tuned epsilon for SUBCLU), and the result was analyzed to ensure that the solution could be considered ideal in the clustering space sought by the corresponding algorithm. For example, we applied k-means to DEHC with k = 9 clusters, using the means of the Gaussian distributions (used to generate the clusters) as the initial centroids. The final solution had virtually the same initial centroids, corroborating the validity of the obtained solution. It is worth noting that we are not suggesting that the considered clustering algorithms are not suitable for the data sets to which they have not been applied to. For example, FCM can easily find the clustering structure in DEHC, as well as IPCM2 can find the clustering structure in DFC. What is most important is that the data set has a clustering structure suitable for the clustering algorithm being applied. 14. The other data sets could have a similar interpretation as well. However, we only consider subspaces in this specific data set. Horta and Campello 0 5 10 15 20 25 30 number of clusters clustering similarity RI 07CRI 08BRIp 08BRIm 09RI 09BRI 10QRIp 10QRIm 10ARI 10ARIn 11ARInm 11D2 13GRI 03MI BC 09EBC 03VI 0 5 10 15 20 25 30 number of clusters clustering similarity ARI 07CARI 09BARI 10AARI 10AARIn 11AARInm 13AGRI JI 10CSI 0 5 10 15 20 25 30 number of clusters clustering similarity (10CF) Figure 5: EHC measure evaluations of k-means solutions for the DEHC data set. The algorithms k-means, FCM, EMGM, and IPCM2 were applied 30 times for each number of clusters k {2, 3, . . . , n} (the literature commonly adopts the upper threshold n as a rule of thumb (Pal and Bezdek, 1995; Pakhira et al., 2005)), and SUBCLU was applied 30 times for each epsilon in the range {0.1, 0.2, . . . , 5.0}. The measures were applied to each solution, and only the highest (which means the best ) values attained in each k or epsilon for a given measure were retained to generate the plots in Figures 5, 6, 7, 8, and 9. We opted to plot the highest values instead of averages because we are interested in the solutions that are as close as possible to the reference one, for a given number of clusters (or epsilon), and to make the results as independent as possible to the stochastic nature of the algorithms. Measures showing the same values were joined and represented by a single curve, and multiple figures for the same experiments were plotted for visualization purposes. Figure 5 shows that most generalized measures displayed the same results as RI or ARI, when evaluating EHCs. This is expected because most of these measures were defined Comparing Hard and Overlapping Clusterings 0 5 10 15 20 25 30 number of clusters clustering similarity 08BRIp 09RI 10ARI 10ARIn 05MI 09BRI 10QRIp 0 5 10 15 20 25 30 number of clusters clustering similarity 08BRIm 10QRIm 10AARI 10AARIn 03MI 0 5 10 15 20 25 30 number of clusters clustering similarity (10CF) Figure 6: FC measure evaluations of FCM solutions for the DFC data set. by extending the variables behind the RI or ARI formulations. For example, the 07CRI measure is a fuzzy version of RI in which the pairing variables a, b, c, and d were defined using fuzzy sets. When applied to EHCs, 07CRI reduces to RI (Campello, 2007). RI, 09HI, 10CFn, 12DB, and the measures that showed the same results as RI were weakly affected by a positive difference between the obtained and the true numbers of clusters. RI is equal to 1 and 0.94 for the solutions with 9 and 30 clusters, respectively, which represents less than 10% of its total range [0, 1]. This weak responsiveness to the number of clusters makes it difficult to decide whether the solution at hand is really good or not (weak contrast property). 09CRI exhibited an increasing evaluation across the numbers of clusters, and 09CARI produced scores close to zero only. In fact, 09CARI resulted in evaluations close to zero for each scenario in this section. Conversely, JI, ARI, BC, 09EBC, 10CSI, 11MD, and the measures that showed the same results as ARI (including 13AGRI proposed here) exhibited a steady decrease for high numbers of clusters. We believe that this more prominent responsiveness Horta and Campello to differences in the clusterings is more intuitively appealing. 10CF (Figure 5(c)) attained the maximum 1 for the right number of clusters. 0 5 10 15 20 25 30 number of clusters clustering similarity 08BRIp 09RI 10ARI 10ARIn 05MI 09BRI 10QRIp 0 5 10 15 20 25 30 number of clusters clustering similarity 08BRIm 10QRIm 10AARI 10AARIn 03MI 0 5 10 15 20 25 30 number of clusters clustering similarity (10CF) Figure 7: FC measure evaluations of EMGM solutions for the DFC data set. Figure 6 shows FC measure evaluations of FCM solutions for the DFC data set. Only 13AGRI and 11AARInm provided both the maximum value 1 for the true number of clusters and showed steady decreasing evaluations over the positive increase in the difference between the obtained and true numbers of clusters. 09HI was 1 for the true number of clusters, but it showed an asymptotic-like curve for high numbers of clusters. 03VI, 08BRIp, 09RI, 09CRI, 09CARI, 10ARI, and 10ARIn could not indicate the reference solution. Figure 7 displays EMGM solution evaluations for the DFC data set. 07CRI, 08BRIp, 08BRIm, 09CRI, 09CARI, 09RI, 09BRI, 10QRIp, 10QRIm, 10ARI, 10ARIn, and 11ARInm could not indicate the true number of clusters. 09HI, 10CFn, 11AARInm, 13GRI, and 13AGRI attained their maxima 1 for the right number of clusters. However, 10CFn and 13GRI showed little to no evaluation change over the solutions with number of clusters greater than k = 9 (low contrast). 10CF attained 0.92 for the right number of clusters. Comparing Hard and Overlapping Clusterings 0.0 0.5 1.0 1.5 2.0 2.5 epsilon clustering similarity 09BRI 10QRIp 0.0 0.5 1.0 1.5 2.0 2.5 epsilon clustering similarity 0.0 0.5 1.0 1.5 2.0 2.5 epsilon clustering similarity (09HI) clustering similarity (11D2) 0.0 0.5 1.0 1.5 2.0 2.5 epsilon clustering similarity (10CF) clustering similarity (10ARI) Figure 8: NEHC measure evaluations of SUBCLU solutions for the DNEHC data set. Figure 8, in which NEHCs are evaluated, shows only the range {0.1, 0.2, . . . , 2.1} of epsilons, as the results from 1.4 to 5.0 are identical. The reference solution has 8 clusters: 4 from data on the plane, 2 from data projected onto the x axis, and 2 from data projected onto the y axis (Figure 4(c)). Figure 8(b) indicates the number of clusters found for each epsilon. SUBCLU generates the reference solution only for the epsilons from 0.4 to 1.0 (we know this by inspection), and most measures yield the highest score in this interval. 07CRI, 09CRI, 10ARI, and 10AARI judged the solution with an epsilon equal to 0.1 to be the best one. Most of the measures identified the correct solutions, but only 09EBC, 09HI, 10CSI, 10CF, 10CFn, 11AARInm, 11MD, 11D2, 13GRI, and 13AGRI attained their maxima 1 for these solutions. 11AARInm and 13AGRI rapidly approached zero for non-optimal epsilons. In Figure 9, 13GRI and 13AGRI exhibited a steep fall in the evaluations and a peak 1 at the true number of clusters. The DPC data set has only 3 clusters, while the others have 9 (DEHC and DFC) or 8 (DNEHC) clusters. A steeper curve is therefore expected. 07CRI, 09HI, 09BRI, 10QRIp, 10QRIm, 10ARI, 10ARIn, and 11ARInm provided high evaluations Horta and Campello 0 5 10 15 20 25 30 number of clusters clustering similarity 10ARIn 11ARInm 07CRI 0 5 10 15 20 25 30 number of clusters clustering similarity 10AARIn 11AARInm 07CARI 09BRI 10QRIp 0 5 10 15 20 25 30 number of clusters clustering similarity (10CF) Figure 9: PC measure evaluations of IPCM2 solutions for the DPC data set. for a wide range of numbers of clusters. Measures 10ARI and 09CARI could not discriminate between the solutions, and 09CRI could not indicate the true number of clusters. 10CFn showed an increasing evaluation for solutions with number of clusters greater than k = 5. 10CF indicated the right number of clusters in Fig 9(c), though not evaluating it as the maximum 1 (it was evaluated as 0.92). Table 3 summarizes the results by indicating with k the measures that identified the reference clustering (discriminant property) and 1 the measures that attained their maxima for the reference solution (maximum property). 09HI, 10CFn, 11AARInm, 13GRI, and 13AGRI are the only measures that displayed the above properties for each scenario. However, 09HI, 10CFn, and 13GRI presented a poor sensitivity to solution variations in most of the cases (e.g., Figures 5(a) and 5(b)), and 10CFn showed an increasing evaluation for progressively worse solutions (Figure 9(a)). 11AARInm and 13AGRI identified the reference solution, attained their maxima 1 for the reference clustering, and were sensitive to the difference in the numbers of clusters in all scenarios. Comparing Hard and Overlapping Clusterings Measures EHC FCFCM FCEMGM NEHC PC JI k /1 - - - - RI k /1 - - - - ARI k /1 - - - - BC k /1 - - - - 03MI k /1 k / k / - - 05MI k /1 k / k / - - 03VI k /1 / k / - - 07CRI k /1 k / / / k / 07CARI k /1 k / k / / k / 08BRIp k /1 / / - - 08BRIm k /1 k / / - - 09EBC k /1 - - k /1 - 09CRI / / / / / 09CARI / / / / / 09HI k /1 k /1 k /1 k /1 k /1 09RI k /1 / / - - 09BRI k /1 k / / k / k / 09BARI k /1 k / k / k / k / 10QRIp k /1 k / / k / k / 10QRIm k /1 k / / k / k / 10ARI k /1 / / / / 10AARI k /1 k / k / / k /1 10ARIn k /1 / / / k /1 10AARIn k /1 k / k / / k /1 10CSI k /1 - - k /1 - 10CF k /1 k / k / k /1 k / 10CFn k /1 k /1 k /1 k /1 k /1 11ARInm k /1 k /1 / / k /1 11AARInm k /1 k /1 k /1 k /1 k /1 11MD k /1 - - k /1 - 11D2 k /1 - - k /1 - 13GRI k /1 k /1 k /1 k /1 k /1 13AGRI k /1 k /1 k /1 k /1 k /1 12DB k /1 - - / - k means that the measure identified the reference clustering, and 1 means that the measure attained its maximum 1 for the identified reference clustering. A cell with - denotes that the measure was not developed for the corresponding clustering type. Table 3: Maximum and discriminant properties displayed by measures. Horta and Campello 7.2 Comparing Randomly Generated Clusterings The experiment in this section is based on a previously published one (Vinh et al., 2009, 2010) that assessed the ability of proposed EHC measures (based on information theory) to yield a constant baseline for randomly generated solutions. For a particular clustering type (EHC, FC, NEHC, or PC), random model (uniform, beta, unbalanced, or unbalanced-beta), 2-tuple (n, k ), and k {2, 3, . . . , 2k }, we generated 30 clustering pairs with n objects. Each pair contains a clustering with k clusters (representing an obtained solution) and a clustering with k clusters (representing a reference solution). We used four combinations of the number of objects and the true number of clusters: (n = 25, k = 5), (n = 100, k = 5), (n = 50, k = 10), and (n = 200, k = 10). The random models used to generate the clusterings depended on the clustering type as follows: For EHC, we generated clusterings for both the uniform and unbalanced models. In the uniform model, each object was uniformly assigned to one cluster. In the unbalanced model, each object was assigned to one cluster according to the following distribution: p1 0.1/k and pj pj 1 + α s.t. Pk j=1 pj = 1 (it implies that α = 1.8/(k(k 1))), where pj is the probability of assigning an object to the jth cluster; For FC, we generated clusterings for the uniform, beta, and uniform-beta models. Let Xu r be a random variable distributed according to the uniform distribution U(0, 1). For the uniform model, object xi has a degree of membership to the rth cluster distributed according to Xu r /(Xu 1 +Xu 2 + +Xu k ), where k is the number of clusters. For the beta model, we uniformly draw ri N1,k for each object xi to indicate to which cluster xi probably has the highest degree of membership. Formally, let Xb r and Y b be two random variables distributed according to the beta distributions Be(1, 5) and Be(5, 1), respectively. Object xi has a degree of membership to the rth cluster (r = ri) distributed according to Xb r/(Xb 1 + +Xb ri 1 +Y b +Xb ri+1 + +Xb k) and to the rith cluster distributed according to Y b/(Xb 1 + +Xb ri 1+Y b+Xb ri+1+ +Xb k). The unbalanced-beta is equal to the beta model except that ri 1, such that the first cluster will have most of the membership; For NEHC, we generated clusterings for both the uniform and unbalanced models. In the uniform model, each object xi was uniformly assigned to ki N1,k clusters, where ki was uniformly drawn. In the unbalanced model, each object xi was assigned to ki N1,k clusters according to the following method. Object xi is assigned to a cluster according to the distribution p as in the EHC unbalanced model. The distribution p is then adjusted such that the cluster already drawn (say, the jth cluster) will not be selected again for xi (i.e., pj 0) and normalized to sum 1. The second cluster is randomly selected according to the resulting p. This process is repeated until xi is assigned to ki clusters; For PC, we generated clusterings for the uniform, beta, and uniform-beta models. The distributions used are similar to those used for FC. The only difference is the absence of normalizing denominators in their definitions. Comparing Hard and Overlapping Clusterings Cluster (EHC, Un) (FC, UBe) (NEHC, Un) (PC, UBe) 1st 2 10.2 22 15.9 2nd 11 10.4 53 16.4 3rd 20 11.4 64 17.0 4th 29 11.7 73 19.0 5th 38 56.2 74 83.3 Table 4: Object-to-cluster membership sums for clustering samples having n = 100 objects and k = 5 clusters. 2 3 4 5 6 7 8 9 10 number of clusters clustering similarity RI 07CRI 08BRIp 08BRIm 09RI 09BRI 10QRIp 10QRIm 10ARI 10ARIn 11ARInm 13GRI 03VI 03MI 05MI 09HI 2 3 4 5 6 7 8 9 10 number of clusters clustering similarity BC 09EBC 09CRI ARI 07CARI 09BARI 10AARI 10AARIn 11AARInm 13AGRI JI 10CSI 2 3 4 5 6 7 8 9 10 number of clusters clustering similarity (10CF) Figure 10: Average evaluations for (EHC, U, n = 25, k = 5). Horta and Campello We denote a particular experimental setting using a 4-tuple. For example, (EHC, U, n = 25, k = 5) refers to an EHC set generated according to the uniform model, where each clustering has 25 objects. The solutions of (EHC, U, n = 25, k = 5) were arranged in 30 EHC pairs for each k {2, 3, . . . , 10}. Each pair contains an EHC with k clusters and an EHC with k clusters. Thus, the set (EHC, U, n = 25, k = 5) has 30 9 = 270 pairs of clusterings. The measures were then applied to evaluate the similarity between the two clusterings of each EHC pair, and the average evaluation for each k {2, 3, . . . , 10} was calculated and plotted in Figure 10. Similarly, Figures 11, 12, and 13 refer to the experimental settings (FC, U, n = 100, k = 5), (NEHC, U, n = 50, k = 10), and (PC, Be, n = 200, k = 10), respectively. The remaining figures are not shown here to avoid cluttering but can be found in the supplementary material: http://sn.im/25a9h8u. Those figures will be referred here when appropriate. Figures 10(a) and 10(b) show that 11 measures exhibited the same averages as RI and that six measures displayed the same averages as ARI, respectively. RI and JI (to a lesser extent) do not show a constant baseline (Hubert and Arabie, 1985; Albatineh et al., 2006), and this behavior is again observed in Figures 10(a) and 10(b). The 13GRI and 13AGRI measures showed the same averages as RI and ARI, respectively, because of their equivalence in the EHC context (Corollaries 4 and 5 in Appendix). 10CF attained a peak at k = k clusters in Figure 10(c) for randomly generated clusterings. BC, 09EBC, 11MD, ARI, and the measures with similar values to ARI are the only ones that showed a constant baseline. The others showed a tendency to favor solutions with a high or low numbers of clusters. Figure 11 shows the results for the experimental setting (FC, U, n = 100, k = 5). 03MI, 05MI, 07CARI, 09BARI, and 13AGRI displayed a constant baseline close to zero in Figure 11(b). 07CRI, 08BRIm, and 10QRIm (Figure 11(a)) also showed constant baselines, although not close to zero. These three measures were neither formally adjusted for chance nor based on a measure that was. Moreover, 07CRI, 08BRIm, and 10QRIm showed a low variance for a wide range of numbers of clusters in Figure 6. This leads us to suspect that the uniform behavior presented in Figure 11(a) is due to a poor sensitivity to solution variations. 09BRI and 10QRIp exhibited in Figure 11(b) a monotonically decreasing curve with low variation in values, as well as 10AARI and 10AARIn in Figure 11(a). 11AARInm produced values greater than its supposed maximum 1 and showed a counterintuitive behavior in Figure 11(c). 10CF, 11ARInm, and 13GRI showed a peak at k = k for randomly generated clusterings. 07CARI, 09BARI, and 13AGRI are the only measures that displayed an approximately constant baseline close to zero in Figure 12, corresponding to the results for (NEHC, U, n = 50, k = 10). As for 10QRIm in Figure 11(a), the 10QRIp measure had a constant baseline in Figure 12(a) probably due to a low sensitivity in solution discrimination, as it is not adjusted for chance and is based on a measure (RI) known to be biased. The same cannot be said about 13AGRI, as it compares the solutions against a null model and exhibited a strong sensitivity in all experiments in Section 7.1. 10AARI showed in Figure 12(b) values greater than 1 for most solutions. 10CF (Figure 12(e)) and 13GRI (Figure 12(b)) again showed a peak at k = k for randomly generated solutions. 10ARI and 11AARInm (Figure 12(d)) produced highly irregular evaluations. 11AARInm produced (overflow) for k = 2 due to near-zero division. Comparing Hard and Overlapping Clusterings 2 3 4 5 6 7 8 9 10 number of clusters clustering similarity 08BRIp 09CRI 09RI 10ARI 10ARIn 07CRI 10AARI 10AARIn 09HI 08BRIm 10QRIm 2 3 4 5 6 7 8 9 10 number of clusters clustering similarity 09BRI 10QRIp 03MI 05MI 09BARI 13AGRI 03VI 2 3 4 5 6 7 8 9 10 number of clusters clustering similarity (11AARInm) clustering similarity (10CF) Figure 11: Average evaluations for (FC, U, n = 100, k = 5). Figure 13 illustrates the results for (PC, Be, n = 200, k = 10). 09BRI, 09BARI, 10QRIp, 10QRIm, and 13AGRI showed constant baselines, and the constant baselines of 13AGRI and 09BARI were close to zero. 10CF (Figure 13(d)) and 13GRI (Figure 13(b)) again scored random clusterings with k = k as better solutions. 10AARI and 11AARInm displayed highly unexpected values (Figure 13(c)). Table 5 denotes which measures showed the baseline property. The italic n s refer to measures that provided constant baselines in the experiments corresponding to Figures 10, 11, 12, and 13 but not for all the remaining experiments. For example, BC and 09EBC showed unbiased evaluations in Figure 10(b) but not in the experiment (EHC, Un, n = 100, k = 5) reported in the supplementary material. Most measures could not provide an unbiased evaluation. They usually tend to favor random solutions with high or low numbers of clusters or show a peak in evaluating random solutions with the same number of clusters as the reference one. This behavior is undesirable, as the compared solutions were independently generated. Only 09BARI and 13AGRI Horta and Campello 2 4 6 8 10 12 14 16 18 20 number of clusters clustering similarity 09BRI 10QRIp 2 4 6 8 10 12 14 16 18 20 number of clusters clustering similarity 2 4 6 8 10 12 14 16 18 20 number of clusters clustering similarity (09HI) clustering similarity (11D2) 2 4 6 8 10 12 14 16 18 20 number of clusters clustering similarity (11AARInm) clustering similarity (10ARI) 2 4 6 8 10 12 14 16 18 20 number of clusters clustering similarity (10CF) Figure 12: Average evaluations for (NEHC, U, n = 50, k = 10). Comparing Hard and Overlapping Clusterings 2 4 6 8 10 12 14 16 18 20 number of clusters clustering similarity 09BARI 13AGRI 07CARI 09CRI 10ARIn 10QRIm 2 4 6 8 10 12 14 16 18 20 number of clusters clustering similarity 09BRI 10QRIp 2 4 6 8 10 12 14 16 18 20 number of clusters clustering similarity (11AARInm) clustering similarity (10AARI) 2 4 6 8 10 12 14 16 18 20 number of clusters clustering similarity (10CF) clustering similarity (10ARI) Figure 13: Average evaluations for (PC, Be, n = 200, k = 10). presented an approximately constant (and close to zero) baseline in all scenarios. The null model of 13AGRI is clearly violated in each scenario, which suggests that adjusting 13GRI is not just a theoretical adornment but a true correction that makes practical clustering comparisons fairer. Recall that, contrary to 13AGRI, 09BARI did not assign the maximum score 1 to the perfect solutions for all but the EHC scenario in the previous section. 7.3 13AGRI Evaluation Validity for FCs We applied the k-means and FCM algorithms 30 times for each number of clusters k {2, 3, . . . , 20} to the UCI data sets (Newman and Asuncion, 2010) shown in Table 6. 13AGRI evaluated the best clustering (according to the respective algorithm s cost function) for each number of clusters using the known classification as the reference solution; the reference solution is thus an EHC. 13AGRI provides the same evaluation as ARI for k-means solutions since k-means produces EHCs (Corollary 5 in Appendix). FCM is regarded as the fuzzy version of k-means, both search for spherical-like clusters, and FCM tends to k-means when Horta and Campello Measures EHC FC NEHC PC Measures EHC FC NEHC PC JI n - - - 09BARI y y y y RI n - - - 10QRIp n n n n ARI y - - - 10QRIm n n n n BC n - - - 10ARI n n n n 03MI n y - - 10AARI y n n n 05MI n y - - 10ARIn n n n n 03VI n n - - 10AARIn y n n n 07CRI n n y n 10CSI n - n - 07CARI y n y n 10CF n n n n 08BRIp n n - - 10CFn n n n n 08BRIm n n - - 11ARInm n n n n 09EBC n - n - 11AARInm y n n n 09CRI n n n n 11MD n - n - 09CARI - - - - 11D2 n - n - 09HI n n n n 13GRI n n n n 09RI n n - - 13AGRI y y y y 09BRI n n n n 12DB n - n - Table 5: Did the similarity measure display approximately constant baselines? FCM exponent m approaches 1 (Yu et al., 2004). Thus, their solutions are often similar in the sense that converting an FCM solution into an EHC (by assigning the objects to the clusters for which they have the highest membership degrees) results in a clustering in which the relative assignment of objects is similar to the relative assignment of objects in the solution produced by k-means (i.e., when objects xi and xj are assigned to the same cluster in one solution, they are often assigned to the same cluster in the other solution). This section examines whether 13AGRI produces similar evaluations for solutions generated by k-means and FCM. If this is the case, we can be more confident in the validity of 13AGRI FC evaluations since 13AGRI and ARI are equivalent in the EHC domain. For each data set, Table 7 displays the Pearson correlations between 13AGRI evaluations of the solutions produced by k-means and of the solutions produced by FCM across the number of clusters in {2, 3, . . . , 20}. Five correlations were higher than 0.9, and more than a half were higher than 0.7. Figures 14(a) and 14(b) depict 13AGRI evaluations for the data sets on which the correlations attained the three highest and three lowest values, respectively. Figure legends display the corresponding data set, clustering type, and the number of classes in the a priori classification. Because the reference solutions are EHCs, 13AGRI almost always provided higher scores when evaluating EHC solutions than when evaluating FC solutions. The lowest correlations seem to have been obtained in the data sets for which the algorithms could not find good clusterings. For these data sets, the similarity between the found solutions and the reference one mostly fluctuates across the numbers of clusters as (we conjecture) there is no ideal number of clusters at which a peak on the 1. The original data set has 16 objects with missing attributes. We adopted the k-nearest neighbor algorithm with Euclidean distance for imputation (Hastie et al., 1999) and used the resulting data set. Comparing Hard and Overlapping Clusterings Name # Objects # Attributes # Classes Breast cancer w. d. (bcw-d) 569 30 2 Breast cancer w. o. (bcw-o)1 699 9 2 Synthetic control chart (chart) 600 60 6 Ecoli data set (ecoli) 336 7 7 Glass identification (glass) 214 9 6 Haberman (haberman) 306 3 2 Image segmentation (img) 210 19 7 Ionosphere (ion) 351 34 2 Iris (iris) 150 4 3 Pima indians diabetes (pima) 768 8 2 Connectionist bench (sonar) 208 60 2 SPECT heart (heart) 267 22 2 Vehicle silhouettes (vehicle) 846 18 4 Wine (wine) 178 13 3 Table 6: UCI data sets. evaluation curve would be found. K-means and FCM produced rather poor solutions for the haberman and sonar data sets according to 13AGRI. 13AGRI evaluations indicate that k-means could uncover some structure in the chart data set because a 13AGRI score (also an ARI score) of 0.5 is a considerable one according to our experience. However, there was not a distinctive solution across the numbers of clusters. 13AGRI indicates the FC solution with three clusters as the most similar to the reference one for the chart data set. To further investigate the behavior of 13AGRI for the chart solutions, we reduced the chart dimensionality by projecting the 60-dimensional data to the first nine principal components (Jolliffe, 2002) explaining 90% of the variance. We identified two pairs of classes with high degree of overlap (namely, classes decreasing trend with downward shift and increasing trend with upward shift (Alcock, 1999)) by projecting the data onto several planes. We joined the classes decreasing trend with downward shift and increasing trend with upward shift, resulting in a classification (used as the reference clustering) with four classes. The Pearson correlation between 13AGRI evaluations is now 0.91 using the same experimental configuration as above. Figure 15 shows the evaluations for k-means and FCM solutions. 13AGRI provided high evaluations for k-means solutions with three and four clusters, while 13AGRI suggests that the best FCM solution is the one with three clusters. Results indicate that 13AGRI when applied to FCs behaves similarly to 13AGRI (i.e., ARI) when applied to EHCs, particularly when the solutions uncover some data set structure. Considering that ARI is one of the most trusted similarity measures, the results corroborate the 13AGRI evaluation validity for FCs. 7.4 Clustering Stability Assessment We applied EMGM to subsamples of the top five data sets from the previous section (i.e., bcw-d, iris, wine, bcw-o, and img) 100 times for each number of clusters k {2, . . . , 20}, generating 100 Gaussian mixtures for each number of clusters and data set; these Gaussian Horta and Campello bcw-d iris wine bcw-o img ecoli ion 0.99 0.99 0.98 0.98 0.91 0.89 0.83 vehicle glass pima heart haberman chart sonar 0.75 0.70 0.69 0.60 0.23 0.02 -0.45 Table 7: Correlation between 13AGRI evaluations of hard exclusive and fuzzy clusterings. 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 number of clusters clustering similarity bcw d EHC 2 bcw d FC 2 iris EHC 3 iris FC 3 wine EHC 3 wine FC 3 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 number of clusters clustering similarity haberman EHC 2 haberman FC 2 chart EHC 6 chart FC 6 sonar EHC 2 sonar FC 2 Figure 14: 13AGRI evaluations that exhibited the three highest (a) and the three lowest correlations (b). 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 number of clusters clustering similarity chart pca 90 EHC 4 chart pca 90 FC 4 Figure 15: 13AGRI evaluations for the processed chart data set. Comparing Hard and Overlapping Clusterings bcw-d bcw-o wine iris img 0.94 0.90 0.85 0.67 0.59 Table 8: Correlation between 13AGRI evaluation and stability statistic. mixtures are different explanations for the phenomenon that produced the data set. We calculated a probabilistic clustering U (also known as FC) of the whole data set for each Gaussian mixture such that Ur,i is the probability of xi belonging to the rth cluster (i.e., to the rth Gaussian mixture component). 13AGRI compared each of the 100 2 probabilistic clustering pairs for each number of clusters and data set, and the average was taken as the stability statistic (the less diverse the solution set, the higher the stability statistic) for the corresponding number of clusters and data set. Subsamples were generated by randomly selecting 80% of the data set objects, without replacement, as in (Monti et al., 2003). Algorithm 2 describes how stability assessment can be used to estimate the number of clusters and to select a promising clustering of a set of solutions. Algorithm 2 Stability assessment Require: Data set X. 1: for i {1, 2, . . . , 100} do 2: Si Randomly draw 80% of the objects from X, without reposition. 4: for k {2, 3, . . . , 20} do 5: for i {1, 2, . . . , 100} do 6: Apply EMGM to Si finding a Gaussian mixture with k components. 7: Ui Calculate the probabilistic clustering of the whole data set using the found Gaussian mixture. 9: tk P i 1, and 1 k U, k V n. It implies that U and V are EHCs and that k U = 1 and k V = n or k U = n and k V = 1, which unambiguously determine U and V. Proof Realize from Equations (12) that Pk U r=1 Ur,l = 1 l implies SU i,j = 1 JU i,j. To have 13FRI(U, V) = 0, it must be the case that a = d = 0 (Equation 14), which implies that min{JU i,j, JV i,j} = min{1 JU i,j, 1 JV i,j} = 0 i < j (Equations 13a and 13d). Hence, JU i,j, JV i,j {0, 1} and JU i,j = JV i,j for all i < j. We first prove by contradiction that U cannot have a column i and a row r for which Ur,i (0, 1) (the same holds for V). Assuming that the ith column of U has Ur,i (0, 1) for an r N1,k U, we have k U > 1 and at least two elements of U:,i have values in the open interval (0, 1) because Pk U t=1 Ut,i = 1. Without loss of generality, assume that i = 1 (the columns of U and V can always be simultaneously permuted without changing the measure). We know that UT :,1U:,j = JU 1,j = 0 j N2,n because UT :,1U:,j cannot yield 1. Thus, JV 1,j = 1 j N2,n. This implies that the columns of V are all identical and each one has the element 1, resulting in k V = 1 because of the constraint Pn j=1 Vt,j > 0 t. We thus have JV i1,j1 = 1 i1 < j1 and JU i2,j2 = 0 i2 < j2. The last equality only holds with constraint Horta and Campello Pn j=1 Ut,j > 0 t if each row of U has exactly one value greater than zero. The property Pk U t=1 Ut,j = 1 j of FCs and the assumption k U n then require each column of U to have exactly one value greater than zero (and to have k U = n rows), which is the value 1. This violates the assumption that Ur,i (0, 1), which implies that U (and V) must be a matrix with only zeros and ones. Suppose n = 2. If columns 1 and 2 of U are identical, columns 1 and 2 of V are different because we have already proven that JU i,j = JV i,j. This only can happen for k U = 1 and k V = 2 (remember the properties of an FC). Now, suppose that n > 2 and, without loss of generality, that U:,1 and U:,2 are identical and that V:,1 and V:,2 are different. If a column i > 2 of U differs from columns 1 and 2 of U, we conclude that columns 1 and 2 of V are equal to column i of V. However, this implies that columns 1 and 2 of V are equal, and, as we known, they are not. Consequently, all columns of U must be identical and all columns of V must be different. This can only happen for k U = 1 and k V = n, which proves the proposition. Proposition 2 Given two EHCs U and V, we have 13FRI(U, V) = RI(U, V). Proof Realize that a, b, c, and d (Equations 13) are equivalent to a, b, c, and d (Equations 7) by assigning the values 0 and 1 to JU i,j and JV i,j. Proposition 3 Given two EHCs U and V, we have 13AFRI(U, V) = ARI(U, V). Proof Both ARI and 13AFRI use the framework of Equation (15). The expectation of ARI given U and V is E[ARI]U,V = (E[a]U,V + E[d]U,V)/(a + b + c + d) (Hubert and Arabie, 1985). We must therefore only show that E[a]U,V = E[ a]U,V and E[d]U,V = E[ d]U,V, since a = a, b = b, c = c, and d = d by Proposition 2. Let JU = UTU, JV = VTV, and N = UVT. Because U and V are EHCs, we can rewrite min{JU i,j, JV i,j} = JU i,j JV i,j. Both P i