# adjusting_for_chance_clustering_comparison_measures__c46dd4a9.pdf

Journal of Machine Learning Research 17 (2016) 1-32 Submitted 12/15; Revised 7/16; Published 8/16

Adjusting for Chance Clustering Comparison Measures

Simone Romano simone.romano@unimelb.edu.au

Nguyen Xuan Vinh vinh.nguyen@unimelb.edu.au

James Bailey baileyj@unimelb.edu.au

Karin Verspoor karin.verspoor@unimelb.edu.au

Dept. of Computing and Information Systems, The University of Melbourne, VIC, Australia.

Editor: Sebastian Nowozin

Adjusted for chance measures are widely used to compare partitions/clusterings of the same data set. In particular, the Adjusted Rand Index (ARI) based on pair-counting, and the Adjusted Mutual Information (AMI) based on Shannon information theory are very popular in the clustering community. Nonetheless it is an open problem as to what are the best application scenarios for each measure and guidelines in the literature for their usage are sparse, with the result that users often resort to using both. Generalized Information Theoretic (IT) measures based on the Tsallis entropy have been shown to link pair-counting and Shannon IT measures. In this paper, we aim to bridge the gap between adjustment of measures based on pair-counting and measures based on information theory. We solve the key technical challenge of analytically computing the expected value and variance of generalized IT measures. This allows us to propose adjustments of generalized IT measures, which reduce to well known adjusted clustering comparison measures as special cases. Using the theory of generalized IT measures, we are able to propose the following guidelines for using ARI and AMI as external validation indices: ARI should be used when the reference clustering has large equal sized clusters; AMI should be used when the reference clustering is unbalanced and there exist small clusters.

Keywords: Clustering Comparison, Clustering Validation, Adjustment for Chance, Generalized Information Theoretic Measures, Pair-Counting Measures

1. Introduction

Clustering comparison measures are used to compare partitions/clusterings of the same data set. In the clustering community (Aggarwal and Reddy, 2013), they are extensively used for external validation when the ground truth clustering is available. A family of popular clustering comparison measures are measures based on pair-counting (Albatineh et al., 2006). This category comprises the well known similarity measures Rand Index (RI) (Rand, 1971) and the Jaccard coeﬃcient (J) (Ben-Hur et al., 2001). Recently, information theoretic (IT) measures have been also extensively used to compare partitions (Strehl and Ghosh, 2003; Vinh et al., 2010). Given the variety of diﬀerent possible measures, it is very challenging to identify the best choice for a particular application scenario (Wu et al., 2009). The picture becomes even more complex if adjusted for chance measures are also considered. Adjusted for chance measures are widely used external clustering validation techniques

c 2016 Simone Romano, Nguyen Xuan Vinh, James Bailey, and Karin Verspoor.

Romano, Vinh, Bailey, and Verspoor

because they improve the interpretability of the results. Indeed, two important properties hold true for adjusted measures: they have constant baseline equal to 0 value when the partitions are random and independent, and they are equal to 1 when the compared partitions are identical. Notable examples are the Adjusted Rand Index (ARI) (Hubert and Arabie, 1985) and the Adjusted Mutual Information (AMI) (Vinh et al., 2009). It is common to see published research that validates clustering solutions against a reference ground truth clustering with the ARI or the AMI. Nonetheless there are still open problems: there are no guidelines for their best application scenarios shown in the literature to date and authors often resort to employing them both and leaving the reader to interpret. Moreover, some clustering comparisons measures are susceptible to selection bias: when selecting the most similar partition to a given ground truth partition, clustering comparison measures are more likely to select partitions with many clusters (Romano et al., 2014). In Romano et al. (2014) it was shown that it is beneﬁcial to perform statistical standardization to IT measures to correct for this bias. In particular, standardized IT measures help in decreasing this bias when the number of objects in the data set is small. Statistical standardization has not been applied to pair-counting measures yet in the literature. We solve this challenge in the current paper, and provide further results about the utility of measure adjustment by standardization. In this work, we aim to bridge the gap between the adjustment of pair-counting measures and the adjustment of IT measures. In Furuichi (2006) and Simovici (2007) it has been shown that generalized IT measures based on the Tsallis q-entropy (Tsallis et al., 2009) are a further generalization of IT measures and some pair-counting measures such as RI. In this paper, we will exploit this useful idea to connect ARI and AMI. Furthermore using the same idea, we can perform statistical adjustment by standardization to a broader class of measures, including pair-counting measures. A key technical challenge is to analytically compute the expected value and variance for generalized IT measures when the clusterings are random. To solve this problem, we propose a technique applicable to a broader class of measures we name Lφ, which includes generalized IT measures as a special case. This generalizes previous work which provided analytical adjustments for narrower classes: measures based on pair-counting from the family L (Albatineh et al., 2006), and measures based on the Shannon mutual information (Vinh et al., 2009, 2010). Moreover, we deﬁne a family of measures Nφ which generalizes many clustering comparison measures. For measures which belong in this family, the expected value can be analytically approximated when the number of objects is large. Figure 1 depicts the families of measures discussed in this paper. Table 1 summarizes the development of this line of work over the past 30 years and positions our contribution. In summary, we make the following contributions:

We deﬁne families of measures for which the expected value and variance can be computed analytically when the clusterings are random;

We propose generalized adjusted measures to correct for the baseline property and for selection bias. This captures existing well known measures as special cases;

We provide insights into the open problem of identifying the best application scenarios for clustering comparison measures, in particular the application scenarios for ARI and AMI.

Adjusting for Chance Clustering Comparison Measures

Rand Index (RI)

MI Jaccard (J)

Generalized Information Theoretic VI

Figure 1: Families of clustering comparison measures discussed in this paper. We show how to analytically adjust measures in Lφ and how to obtain approximations for the family Nφ.

Year Contribution Reference

1985 Expectation of Rand Index (RI) (Hubert and Arabie, 1985) 2006 Expectation and variance of S L (Albatineh et al., 2006) 2009 Expectation of Shannon Mutual Information (MI) (Vinh et al., 2009) 2010 Expectation of Normalized Shannon MI (NMI) (Vinh et al., 2010) 2014 Variance of Shannon MI (Romano et al., 2014)

2016 Expectation and variance of S Lφ Asymptotic expectation of S Nφ This Work

Table 1: Work on adjusting clustering comparison measures carried out over the past 30 years. Information theoretic measures have been only recently adjusted for chance. In this paper, we bridge the gap between adjustment of pair-counting measures and information theoretic measures.

2. Comparing Partitions

Given two partitions (clusterings) U and V of the same data set of N objects, let {u1, . . . , ur} and {v1, . . . , vc} be the disjoint sets (clusters) for U and V respectively. Let |ui| = ai for i = 1, . . . , r denote the number of objects in the set ui and |vj| = bj for j = 1, . . . , c denote the number of objects in vj. Naturally, Pr i=1 ai = Pc j=1 bj = N. The overlap between the two partitions U and V can be represented in matrix form by a r c contingency table M where nij represents the number of objects in both ui and vj, i.e. nij = |ui vj|. Also, we refer to ai = Pr i=1 nij as the row marginals and to bj = Pc j=1 nij as the column marginals. A contingency table M is shown in Table 2. Pair-counting measures between partitions, such as the Rand Index (RI) (Rand, 1971), might be deﬁned using the following quantities: k11, the pairs of objects in the same set in both U and V ; k00 the pairs of objects not in the same set in U and not in the same set in V ; k10, the pairs of objects in the same set in U and not in the same set in V ; and k01 the pairs of objects not in the same set in U and in the same set in V . All these quantities can

Romano, Vinh, Bailey, and Verspoor

V b1 bj bc a1 n11 n1c ... ... ... ... U ai nij ... ... ... ... ar nr1 nrc

Table 2: r c contingency table M related to two clusterings U and V . ai = P j nij are the row marginals and bj = P i nij are the column marginals.

be computed using the contingency table M, for example:

j=1 nij(nij 1), k00 = 1

j=1 n2 ij r X

j=1 b2 j (1)

Using k00, k11, k10, and k01 it is possible to compute similarity measures, e.g. RI, or distance measures, e.g. the Mirkin index MK(U, V ) P i a2 i + P j b2 j 2 P i,j n2 ij, between partitions (Meil a, 2007):

RI(U, V ) (k11 + k00)/ N 2

, MK(U, V ) = 2(k10 + k01) = N(N 1)(1 RI(U, V )) (2)

Information theoretic measures are instead deﬁned for random variables but can also be used to compare partitions when we employ the empirical probability distributions associated to U, V , and the joint partition (U, V ). Let ai

N , and nij

N be the probability that an object falls in the set ui, vj, and ui vj respectively. We can therefore deﬁne the Shannon entropy with natural logarithms for a partition V as follows: H(V ) P j bj N ln bj

N . Similarly, we can deﬁne the entropy H(U) for the partition U, the joint entropy H(U, V ) for the joint partition (U, V ), and the conditional entropies H(U|V ) and H(V |U). Shannon entropy can be used to deﬁne the well know Mutual Information (MI) and employ it to compute similarity between partitions U and V :

MI(U, V ) H(U) H(U|V ) = H(V ) H(V |U) = H(U) + H(V ) H(U, V ) (3)

On contingency tables, MI is linearly related to G-statistics used for likelihood-ratio tests: G = 2NMI. In Meil a (2007), using the Shannon entropy it was shown that the following distance, namely the Variation of Information (VI) is a metric:

VI(U, V ) 2H(U, V ) H(U) H(V ) = H(U|V ) + H(V |U) = H(U) + H(V ) 2MI(U, V ) (4) Information theoretic measures are extensively used to compare crisp partitions (Strehl and Ghosh, 2003; Vinh et al., 2010). Very recently they have also been used to compare fuzzy partitions (Lei et al., 2014a, 2016).

Adjusting for Chance Clustering Comparison Measures

2.1 Generalized Information Theoretic Measures

Generalized Information Theoretic (IT) measures based on the generalized Tsallis q-entropy (Tsallis, 1988) can be deﬁned for random variables (Furuichi, 2006) and also be applied to the task of comparing partitions (Simovici, 2007). Indeed, these measures have also seen recent application in the machine learning community. More speciﬁcally, it has been shown that they can act as proper kernels (Martins et al., 2009). Furthermore, empirical studies demonstrated that careful choice of q yields successful results when comparing the similarity between documents (Vila et al., 2011), decision tree induction (Maszczyk and Duch, 2008; Wang et al., 2015), and reverse engineering of biological networks (Lopes et al., 2011). It is important to note that the Tsallis q-entropy is equivalent to the Harvda-Charvat-Dar oczy generalized entropy proposed in Havrda and Charv at (1967); Dar oczy (1970). Results available in literature about these generalized entropies are equivalently valid for all the proposed versions. Given q R+ {1}, the generalized Tsallis q-entropy for a partition V is deﬁned as follows: Hq(V ) 1 q 1 1 P j bj

N q . Similarly to the case of Shannon entropy, we have the joint q-entropy Hq(U, V ) and the conditional q-entropies Hq(U|V ) and Hq(V |U). Conditional q-entropy is computed according to a weighted average parametrized on q. More speciﬁcally the formula for Hq(V |U) is:

q Hq(V |ui) =

The q-entropy reduces to the Shannon entropy computed in nats for q 1. In Furuichi (2006), using the fact that q > 1 implies Hq(U) Hq(U|V ), it is shown that non-negative MI can be naturally generalized with q-entropy when q > 1:

MIq(U, V ) Hq(U) Hq(U|V ) = Hq(V ) Hq(V |U) = Hq(U) + Hq(V ) Hq(U, V ) (6)

However, q values smaller than 1 are allowed if the assumption that MIq(U, V ) is always positive can be dropped. In addition, generalized IT measures can be used to deﬁne the generalized Variation of Information distance (VIq) which tends to VI in Eq. (4) when q 1:

VIq(U, V ) Hq(U|V )+Hq(V |U) = 2Hq(U, V ) Hq(U) Hq(V ) = Hq(U)+Hq(V ) 2MIq(U, V ) (7) In Simovici (2007) it was shown that VIq is a proper metric and interesting links were identiﬁed between measures for comparing partitions U and V . We state these links in Proposition 1 given that they set the fundamental motivation of our paper:

Proposition 1 (Simovici, 2007) When q = 2 the generalized variation of information, the Mirkin index, and the Rand index are linearly related: VI2(U, V ) = 1 N2 MK(U, V ) = N 1

N (1 RI(U, V )).

Generalized IT measures are not only a generalization of IT measures in the Shannon sense but also a generalization of pair-counting measures for particular values of q. Note that in literature there exist another well know generalization of entropy: the Renyi entropy (Renyi,

Romano, Vinh, Bailey, and Verspoor

1961). This entropy is again parametrized on a real number q and is deﬁned as follows: Rq(V ) 1 1 q ln P j bj

N q . Because of the use of the logarithm for any value of q, the Renyi entropy does enable the generalization of pair-counting measures when q = 2. Therefore, in this paper we make use of generalized IT measures based on the Tsallis entropy.

2.2 Normalized Generalized IT Measures

To allow a more interpretable range of variation, a clustering similarity measure should be normalized: it should achieve its maximum at 1 when U = V . An upper bound to the generalized mutual information MIq is used to obtained a normalized measure. MIq can take diﬀerent possible upper bounds (Furuichi, 2006). Here, we choose to derive another possible upper bound using Eq. (7) when we use the minimum value of VIq = 0: max MIq = 1 2(Hq(U) + Hq(V )). This upper bound is valid for any q R+ {1} and allows us to link diﬀerent existing measures as we will show in the next sections of the paper. The Normalized Mutual Information with q-entropy (NMIq) is deﬁned as follows:

NMIq(U, V ) MIq(U, V ) max MIq(U, V ) = MIq(U, V ) 1 2 Hq(U) + Hq(V ) = Hq(U) + Hq(V ) Hq(U, V )

1 2 Hq(U) + Hq(V ) (8)

Even if NMIq(U, V ) achieves its maximum 1 when the partitions U and V are identical, NMIq(U, V ) is not a suitable clustering comparison measure. Indeed, it does not show constant baseline value equal to 0 when partitions are random. We explore this through an experiment. Given a dataset of N = 100 objects, we randomly generate uniform partitions U with r = 2, 4, 6, 8, 10 sets and V with c = 6 sets independently of each others. The average value of NMIq over 1, 000 simulations for diﬀerent values of q is shown in Figure 2. It is reasonable to expect that when the partitions are independent, the average value of NMIq is constant irrespectively of the number of sets r of the partition U. This is not the case. This behavior is unintuitive and misleading when comparing partitions. Computing the

2 3 4 5 6 7 8 9 10 Number of sets r in U

1 q = 0.9 q = 1.2 q = 2

Figure 2: The baseline value of NMIq(U, V ) between independent random partitions U and V . Despite the partitions are random, the baseline of NMIq is not constant and depends on the number of sets of the partitions.

analytical expected value of generalized IT measures under the null hypothesis of random and independent U and V is important; it can be subtracted from the measure itself to adjust its baseline for chance such that the result is 0 when U and V are random. Given

Adjusting for Chance Clustering Comparison Measures

Proposition 1, this strategy also allows us to generalize adjusted for chance pair-counting and Shannon IT measures.

3. Baseline Adjustment

In order to adjust the baseline of a similarity measure S(U, V ), we have to compute its expected value under the null hypothesis of independent random partitions U and V . We adopt the assumption of randomness used to adjust RI (Hubert and Arabie, 1985) and the Shannon MI (Vinh et al., 2009). This is formalized as follows:

Deﬁnition 1 (Random partitions) The partitions U and V are generated independently and at random ﬁxing the number of objects N and the marginals ai and bj.

This is also denoted as the permutation or the hypergeometric model of randomness. We are able to compute the exact expected value for a similarity measure in the family Lφ:

Deﬁnition 2 Let Lφ be the family of similarity measures S(U, V ) = α + β P ij φij(nij) where α and β do not depend on the entries nij of the contingency table M and φij( ) are bounded real functions.

Intuitively, Lφ represents the class of measures that can be written as a linear combination of φij(nij). A measure between partitions uniquely determines α, β, and φij. However, not every choice of α, β, and φij yields a meaningful similarity measure. Lφ is a superset of the set L deﬁned in Albatineh et al. (2006) as the family of measures S(U, V ) = α + β P ij n2 ij, i.e. S L are special cases of measures in Lφ with φij( ) = ( )2. Figure 1 shows a diagram of the similarity measures discussed in Section 2.1 and their relationships.

Lemma 1 If S(U, V ) Lφ, when partitions U and V are random:

E[S(U, V )] = α + β X

ij E[φij(nij)] where E[φij(nij)] is (9)

min{ai,bj} X

nij=max{0,ai+bj N} φij(nij) ai!bj!(N ai)!(N bj)! N!nij!(ai nij)!(bj nij)!(N ai bj + nij)! (10)

Lemma 1 extends the results in Albatineh and Niewiadomska-Bugaj (2011) showing exact computation of the expected value of measures in the family L. Given that generalized IT measures belong in Lφ we can employ this result to adjust them.

3.1 Baseline Adjustment for Generalized IT measures

Using Lemma 1 it is possible to compute the exact expected value of Hq(U, V ), VIq(U, V ) and MIq(U, V ):

Theorem 1 When the partitions U and V are random:

i) E[Hq(U, V )] = 1 q 1 1 1 Nq P ij E[nq ij] with E[nq ij] from Eq. (10) with φij(nij) = nq ij;

ii) E[MIq(U, V )] = Hq(U) + Hq(V ) E[Hq(U, V )];

Romano, Vinh, Bailey, and Verspoor

iii) E[VIq(U, V )] = 2E[Hq(U, V )] Hq(U) Hq(V ).

It is worth noting that this approach is valid for any q R+ {1}. We can use these expected values to adjust for baseline generalized IT measures. We use the method proposed in Hubert and Arabie (1985) to adjust similarity measures, such as MIq, and distance measures, such as VIq:

AMIq MIq E[MIq] max MIq E[MIq] AVIq E[VIq] VIq E[VIq] min VIq (11)

VIq is a distance measure, thus min VIq = 0. For MIq we use the upper bound max MIq = 1 2 Hq(U) + Hq(V ) as for NMIq in Eq. (8). An exhaustive list of adjusted versions of Shannon MI can be found in Vinh et al. (2010), when the upper bound 1

2(Hq(U) + Hq(V )) is used the authors named the adjusted MI as AMIsum. It is important to note that this type of adjustment turns distance measures into similarity measures, i.e., AVIq is a similarity measure. It is also possible to maintain both the distance properties and the baseline adjustment using NVIq VIq/E[VIq] which can be seen as a normalization of VIq with the stochastic upper bound E[VIq] (Vinh et al., 2009). It is also easy to see that AVIq = 1 NVIq. The adjustments in Eq. (11) also enable the measures to be normalized. AMIq and AVIq achieve their maximum at 1 when U = V and their minimum is 0 when U and V are random partitions. According to the chosen upper bound for MIq, we obtain the nice analytical form shown in Theorem 2. Our adjusted measures quantify the discrepancy between the values of the actual contingency table and their expected value in relation to the maximum discrepancy possible, i.e. the denominator in Eq. (12). It is also easy to see that all measures in Lφ resemble this form when adjusted.

Theorem 2 Using E[nq ij] in Eq. (10) with φij(nij) = nq ij, the adjustments for chance for MIq(U, V ) and VIq(U, V ) are:

AMIq(U, V ) = AVIq(U, V ) =

P ij nq ij P ij E[nq ij]

1 2 P i aq i + P j bq j P ij E[nq ij] (12)

From now on we only discuss AMIq, given that it is identical to AVIq. There are notable special cases for our proposed adjusted generalized IT measures. In particular, the Adjusted Rand Index (ARI) (Hubert and Arabie, 1985) is equal to AMI2. ARI is a classic measure, heavily used for validation in social sciences and the most popular clustering validity index.

Corollary 1 It holds true that: i) limq 1 AMIq = limq 1 AVIq = AMI = AVI with Shannon entropy; ii) AMI2 = AVI2 = ARI.

Therefore, using the permutation model we can perform baseline adjustment to generalized IT measures. Our generalized adjusted IT measures are a further generalization of particular well known adjusted measures such as AMI and ARI. It is worth noting, that ARI is equivalent to other known measures for comparing partitions (Albatineh et al., 2006).

Adjusting for Chance Clustering Comparison Measures

Furthermore, there is also a strong connection between ARI and Cohen s κ statistics used to quantify inter-rater agreement (Warrens, 2008). As ﬁnal remark, we point out that our baseline adjustments can also be seen as statistical corrections for generalized information theoretic measures. It is indeed well known that information theoretic measures are severely biased when plug-in estimators are used, and many have worked on correcting this bias for decades: there exist in literature frequentist approaches (Paninski, 2003) as well as Bayesian approaches (Archer et al., 2013; Cerquetti, 2014) to reduce bias. In this section, we discussed an adjustment to obtain exact bias correction in particular when U and V are independent.

Computational complexity: The computational complexity of AMIq in Eq. (12) is dominated by the computation of the sum of the expected value of each cell.

Proposition 2 The computational complexity of AMIq is O(N max {r, c}).

If all the possible contingency tables M obtained by permutations were generated, the computational complexity of the exact expected value would be O(N!). However, this can be dramatically reduced using properties of the expected value.

3.2 Experiments on Measure Baseline

Here we show that our adjusted generalized IT measures have a baseline value of 0 when comparing random partitions U and V . In Figure 3 we show the behavior of AMIq, ARI, and AMI on the same experiment proposed in Section 2.2. They are all close to 0 with negligible variation when the partitions are random and independent. Moreover, it is interesting to see the equivalence of AMI2 and ARI. On the other hand, the equivalence of AMIq and AMI with Shannon entropy is obtained only at the limit q 1.

2 3 4 5 6 7 8 9 10 Number of sets r in U

q = 0.9 q = 1.2 q = 2 ARI AMI

Figure 3: Baseline value of adjusted clustering comparison measures between two random partitions. When varying the number of sets for the random partition U, the value of AMIq(U, V ) is always very close to 0 with negligible variation for any q.

We also point out that NMIq does not show constant baseline when the relative size of the sets in U varies when U and V are random. In Figure 4, we generate random partitions V with c = 6 sets on N = 100 points, and random binary partitions U independently. NMIq(U, V ) shows diﬀerent behavior at the variation of the relative size of the biggest set

Romano, Vinh, Bailey, and Verspoor

in U. This is unintuitive given that the partitions U and V are random and independent. We obtain the desired property of a baseline value of 0 with AMIq.

50 60 70 80 90 Relative size of the biggest set in U [%]

50 60 70 80 90 Relative size of the biggest set in U [%]

q = 0.9 q = 1.2 q = 2 ARI AMI

Figure 4: The left panel shows the baseline value of NMIq(U, V ) between the random partions U and V at the variation of the size of the sets in U. The right panel shows the baseline value of adjusted clustering comparison measures. When varying the relative size of one cluster for the random partition U, the value of AMIq(U, V ) is always very close to 0 with negligible variation for any q.

3.3 Large Number of Objects

In this section, we introduce a very general family of measures which includes Lφ. For measures belonging to this family, it is possible to ﬁnd an approximation of their expected value when the number of objects N is large. This allows us to identify approximations for the expected value of measures in Lφ as well as for measures not in Lφ, such as the Jaccard coeﬃcient as shown in Figure 1. Let Nφ be the family of measures which are non-linear combinations of φij(nij):

Deﬁnition 3 Let Nφ be the family of similarity measures S(U, V ) = φ(n11

N , . . . , nij

N , . . . , nrc

N ) where φ is a bounded real function as N reaches inﬁnity.

Note that Nφ is a generalization of Lφ. At the limit of large number of objects N, it is possible to compute the expected value of measures in Nφ under random partitions U and V using only the marginals of the contingency table M:

Lemma 2 If S(U, V ) Nφ, then lim N + E[S(U, V )] = φ a1

N b1 N , . . . , ai

N bj N , . . . , ar

In Morey and Agresti (1984) the expected value of the RI was computed using an approximated value based on the multinomial distribution. It turns out this approximated value is equal to what we obtain for RI using Lemma 2. The authors of Albatineh et al. (2006) noticed that the diﬀerence between the approximation and the expected value obtained with the hypergeometric model is small on empirical experiments when N is large. We point out that this is a natural consequence of Lemma 2 given that RI Lφ Nφ. Moreover, the multinomial distribution was also used to compute the expected value of the Jaccard coeﬃcient (J) in Albatineh and Niewiadomska-Bugaj (2011), obtaining good results on empirical experiments with many objects. Again, this is a natural consequence of Lemma 2

Adjusting for Chance Clustering Comparison Measures

given that J Nφ but J / Lφ. Indeed, the Jaccard coeﬃcient does not allow analytical adjustment using the hypergeometric model but it allows an approximation using Lemma 2. Generalized IT measures belong in Lφ Nφ. Therefore we can employ Lemma 2. When the number of objects is large, the expected value under random partitions U and V of Hq(U, V ), MIq(U, V ), and VIq(U, V ) in Theorem 1 depends only on the entropy of the partitions U and V , i.e., just the marginals of the contingency table must be taken into account:

Theorem 3 It holds true that:

i) lim N + E[Hq(U, V )] = Hq(U) + Hq(V ) (q 1)Hq(U)Hq(V );

ii) lim N + E[MIq(U, V )] = (q 1)Hq(U)Hq(V );

iii) lim N + E[VIq(U, V )] = Hq(U) + Hq(V ) 2(q 1)Hq(U)Hq(V ).

Result i) recalls the property of non-additivity that holds true for random variables (Furuichi, 2006). Figure 5 shows the behavior of E[Hq(U, V )] when the partitions U and V are generated uniformly at random. V has c = 6 sets and U has r sets. In this case, Hq(U) + Hq(V ) (q 1)Hq(U)Hq(V ) appears to be a good approximation already for N = 1000. In particular, the approximation is good when the number of objects N is big with regards to the number of cells of the contingency table in Table 2: i.e., when N r c is large enough.

(a) Approximation on smaller sample size

2 4 6 8 10 Number of sets r in U

E[Hq(U; V )]

N = 50 objects

(b) Approximation on bigger sample size

2 4 6 8 10 Number of sets r in U

E[Hq(U; V )]

N = 1000 objects

q = 0.8 q = 1.1 q = 1.5

Figure 5: The left panel shows the average value of E[Hq(U, V )] between random partitions U and V when they are induced on N = 50 objects. The right panel shows results for N = 1000 objects. E[Hq(U, V )] are plotted using a solid line and their limit value Hq(U) + Hq(V ) (q 1)Hq(U)Hq(V ) is plotted using a dashed line. The solid line coincides approximately with the dashed one in 4(b) when N = 1000. The limit value is a good approximation for E[Hq(U, V )] when N

r c is large enough.

From point ii) follows the result proved in Vinh et al. (2010) to connect the adjusted mutual information to the widely used Normalized Mutual Information (NMI) based on Shannon entropy (Strehl and Ghosh, 2003):

Theorem 4 (Vinh et al., 2010) It holds true that:

lim N + AMI(U, V ) = NMI(U, V )

Romano, Vinh, Bailey, and Verspoor

NMI is easier to compute than AMI and it is less prone to computer precision errors. The analysis provided in this session aims at broadening the possible clustering comparison measures that can be adjusted: as long as a measure belongs in Nφ, its expected value at large N can be computed and thus it can be adjusted. Moreover, we saw that adjusted measures in Lφ Nφ have simpler formulas at large N. The adjustments at large N are faster to compute and less prone to computer precision errors. In the next section we put forward a theoretical analysis on the best choice between AMI and ARI when validating clustering solutions. Moreover being NMI equal to AMI at large N, the next section helps also to understand when to use either NMI or ARI.

4. Application Scenarios for AMIq

In this section we aim to answer to the question: Given a reference ground truth clustering V , which is the best choice for q in AMIq(U, V ) to validate the clustering solution U? By answering this question, we implicitly identify the application scenarios for ARI and AMI given the results in Corollary 1. This is particularly important for external clustering validation. Nonetheless, there are a number of other applications where the task is to ﬁnd the most similar partition to a reference ground truth partition: e.g., categorical feature selection (Vinh et al., 2014), decision tree induction (Criminisi et al., 2012), generation of alternative or multi-view clusterings (M uller et al., 2013), or the exploration of the clustering space with the Meta-Clustering algorithm (Caruana et al., 2006; Lei et al., 2014b) to list a few. Diﬀerent values for q in AMIq yield to diﬀerent biases. The source of these biases can be identiﬁed by analyzing the properties of the q-entropy. In Figure 6 we show the q-entropy for a binary partition at the variation of the relative size p of one cluster. This can be analytically computed: Hq(p) = 1 q 1(1 pq (1 p)q). The range of variation for Hq(p) is much bigger if q is small. More speciﬁcally when q is small, the diﬀerence in entropy between an unbalanced partition and a balanced partition is big.

Figure 6: Tsallis q-entropy Hq(p) for a binary clustering where p is the relative size of one cluster. When q is small, the q-entropy varies in a bigger range. When q is small, the difference in entropy between an unbalanced partition and a balanced partition is big.

q = 0.5 q = 1.5 q = 2.5

Let us focus on an example. Let V be a reference clustering with 3 clusters of size 50 each, and let U1 and U2 be two clustering solutions with the same number of clusters and same cluster sizes. The contingency tables for U1 and U2 are shown on Figure 7. Given that both contingency tables have the same marginals, the only diﬀerence between AMIq(U1, V ) and AMIq(U2, V ) according to Eq. (11) lies in MIq. Given that both solutions U1 and U2

Adjusting for Chance Clustering Comparison Measures

V 50 50 50 50 50 0 0 U1 50 0 44 6 50 0 6 44

V 50 50 50 50 48 1 1 U2 50 1 46 3 50 1 3 46

AMIq(U1; V ) AMIq(U2; V )

Figure 7: Ground truth clustering V compared in turn to the clustering solutions U1 and U2. AMIq with small q prefers the solution U1 because there exists one pure cluster: i.e., there is one cluster which contains elements from only one cluster in the reference clustering V .

are compared against V , the only term that varies in MIq(U, V ) = Hq(V ) Hq(V |U) is Hq(V |U). In order to identify the clustering solution that maximizes AMIq we have to analyze the solution that decreases Hq(V |U) the most. Hq(V |U) is a weighted average of the entropies Hq(V |ui) computed on the rows of the contingency table as shown in Eq. (5), and this is sensitive to values equal to 0. Given the bigger range of variation of Hq for small q, small q implies higher sensitivity to row entropies of 0. Therefore, small values of q tends to decrease Hq(V |U) much more if the clusters in the solution U are pure: i.e., clusters contain elements from only one cluster in the reference clustering V . In other words, AMIq with small q prefers pure clusters in the clustering solution. When the marginals in the contingency tables for two solutions are diﬀerent, another important factor in the computation of AMIq is the normalization coeﬃcient 1

2(Hq(U) + Hq(V )). Balanced solutions U will be penalized more by AMIq when q is small. Therefore, AMIq with small q prefers unbalanced clustering solutions. To summarize, AMIq with small q such as AMI0.5 or AMI1 = AMI with Shannon entropy:

Is biased towards pure clusters in the clustering solutions;

Prefers unbalanced clustering solutions.

By contrary, AMIq with bigger q such as AMI2.5 or AMI2 = ARI:

Is less biased towards pure clusters in the clustering solution;

Prefers balanced clustering solutions.

Given a reference clustering V , these biases can guide the choice of q in AMIq to identify more suitable clustering solutions.

4.1 Use AMIq with small q such as AMI0.5 or AMI1 = AMI when the reference clustering is unbalanced and there exist small clusters

If the reference cluster V is unbalanced and presents small clusters, AMIq with small q might prefer more appropriate clustering solutions U. For example, in Figure 8 we show two contingency tables associated to two clustering solutions U1 and U2 for the reference

Romano, Vinh, Bailey, and Verspoor

clustering V with 4 clusters of size [10, 10, 10, 70] respectively. When there exist small clusters in the reference V their identiﬁcation has to be precise in the clustering solution. The solution U1 looks arguably better than U2 because it shows many pure clusters. In this scenario we advise the use of AMI0.5 or AMI1 = AMI with Shannon entropy because it gives more weight to the clustering solution U1.

V 10 10 10 70

8 8 0 0 0 7 0 7 0 0 7 0 0 7 0 78 2 3 3 70

V 10 10 10 70

10 7 1 1 1 10 1 7 1 1 10 1 1 7 1 70 1 1 1 67

AMIq(U1; V ) AMIq(U2; V )

Figure 8: Ground truth clustering V compared in turn to the clustering solutions U1 and U2. V is unbalanced and presents small clusters. When the reference clustering has small clusters their identiﬁcation in the solution has to be precise. Therefore U1 appears to be a better solution than U2. AMIq with small q prefers the solution U1 because its clusters are pure. In this scenario we advise the use of AMI0.5 or AMI1 = AMI.

If the number of objects N is large, AMI is equivalent to NMI according to Theorem 4. Therefore, when the reference clustering is unbalanced, there exist small clusters, and N is large, it is advisable to use NMI rather than ARI.

4.2 Use AMIq with big q such as AMI2.5 or AMI2 = ARI when the reference clustering has big equal sized clusters

If V is a reference clustering with big equal size clusters it is less crucial to have precise clusters in the solution. Indeed, precise clusters in the solution penalize the recall of clusters from the reference. In this case, AMIq with bigger q might prefer more appropriate solutions. In Figure 9 we show two clustering solutions U1 and U2 for the reference clustering V with 4 equal size clusters of size 25. The solution U2 looks better than U1 because each of its clusters identiﬁes more elements from particular clusters in the reference. Moreover, U2 has to be preferred to U1 because it consists in 4 equal sized clusters as the reference clustering V consists in equal sized clusters. In this scenario we advise the use of AMI2.5 or AMI2 = ARI because it gives more importance to the solution U2. If the number of objects N is large, AMI is equivalent to NMI according to Theorem 4. Therefore, when the reference clustering is balanced with big equal sized clusters and N is large, it is advisable to use ARI rather than NMI.

5. Standardization of Clustering Comparison Measures

The selection of the most similar partition U to a reference partition V is biased according to the chosen similarity measure, the number of sets r in U, and their relative size.

Adjusting for Chance Clustering Comparison Measures

V 25 25 25 25

17 17 0 0 0 17 0 17 0 0 17 0 0 17 0 49 8 8 8 25

V 25 25 25 25

24 20 2 1 1 25 2 20 2 1 23 1 1 20 1 28 2 2 2 22

AMIq(U1; V ) AMIq(U2; V )

Figure 9: Ground truth clustering V compared in turn to the clustering solutions U1 and U2. V shows equal size clusters. U2 appears to be a better solution than U1 because its clusters are more balanced in size than the clusters in U1. When the reference clustering has big equal sized clusters their precise identiﬁcation is less crucial. AMIq with big q prefers the solution U2 because it is less biased to pure clusters in the solution. In this scenario we advise the use of AMI2.5 or AMI2 = ARI.

This phenomena is known as selection bias and it has been extensively studied in decision trees (White and Liu, 1994). Researchers in this area agree that in order to achieve unbiased selection of partitions, distribution properties of similarity measures have to be taken into account (Dobra and Gehrke, 2001; Shih, 2004; Hothorn et al., 2006). Using the permutation model, we proposed in Romano et al. (2014) to analytically standardize the Shannon MI by subtraction of its expected value and division by its standard deviation. In this section, we discuss how to achieve analytical standardization of measures S Lφ. In order to standardize a measure, we must analytically compute its variance:

Lemma 3 If S(U, V ) Lφ, when partitions U and V are random:

Var(S(U, V )) = β2 E h X

ij φij(nij) 2i X

ij E[φij(nij)] 2 ,

where E h X

ij φij(nij) 2i

is equal to

nij φ(nij)P(nij)

φij(nij) + X

ni j φi j( ni j)P( ni j)+

nij P( nij )

φij ( nij ) + X

ni j φi j ( ni j )P( ni j )

with nij Hyp(ai, bj, N), ni j Hyp(bj nij, ai , N ai), nij Hyp(ai nij, bj , N bj), ni j Hyp(ai , bj nij , N ai) hypergeometric random variables.

We can use the expected value to standardize measures S Lφ, such as generalized IT measures.

Romano, Vinh, Bailey, and Verspoor

5.1 Standardization of Generalized IT Measures

The variance under the permutation model of generalized IT measures is:

Theorem 5 Using Eqs. (10) and (13) with φij( ) = ( )q, when the partitions U and V are random:

i) Var(Hq(U, V )) = 1 (q 1)2N2q E[(P ij nq ij)2] (P ij E[nq ij])2 ;

ii) Var(MIq(U, V )) = Var(Hq(U, V ))

iii) Var(VIq(U, V )) = 4Var(Hq(U, V ))

We deﬁne the standardized version of the similarity measure MIq (SMIq), and the standardized version of the distance measure VIq (SVIq) as follows:

SMIq MIq E[MIq] p

Var(MIq) , SVIq E[VI]q VIq p

Var(VIq) , (14)

As for the case of AMIq and AVIq, it turns out that SMIq is equal to SVIq:

Theorem 6 Using Eqs. (10) and (13) with φij( ) = ( )q, the standardized MIq(U, V ) and the standardized VIq(U, V ) are:

SMIq(U, V ) = SVIq(U, V ) =

P ij nq ij P ij E[nq ij] q

E[(P ij nq ij)2] (P ij E[nq ij])2 (15)

This formula shows that we are interested in maximizing the diﬀerence between the sum of the cells of the actual contingency table and the sum of the expected cells under randomness. Standardized measures diﬀers from their adjusted counterpart because of the denominator, i.e. the standard deviation of the sums of the cells. Indeed, SMIq and SVIq measure the number of standard deviations MIq and VIq are from their mean. There are some notable special cases for particular choices of q. Indeed, our generalized standardization of IT measures allows us to generalize also the standardization of paircounting measures such as the Rand index. To see this, let us deﬁne the Standardized Rand Index (SRI): SRI RI E[RI]

Var(RI) and recall that the standardized G-statistic is deﬁned

as SG G E[G]

Var(G) (Romano et al., 2014):

Corollary 2 It holds true that: i) limq 1 SMIq = limq 1 SVIq = SMI = SVI = SG with Shannon entropy; ii) SMI2 = SVI2 = SRI.

Computational complexity: The computational complexity of SMIq is dominated by computation of the second moment of the sum of the cells deﬁned in Eq. (13):

Proposition 3 The computational complexity of SMIq is O(N3c max {c, r}).

Note that the complexity is quadratic in c and linear in r. This happens because of the way we decided to condition the probabilities in Eq. (13) in the proof of Lemma 3. With diﬀerent conditions, it is possible to obtain a formula symmetric to Eq. (13) with complexity O(N3r max {r, c}) (Romano et al., 2014).

Adjusting for Chance Clustering Comparison Measures

Statistical inference: All IT measures computed on partitions can be seen as estimators of their true value computed using the random variables associated to the partitions U and V . Therefore, SMIq can be used as non-parametric independence test for MIq. We formalize this with the following proposition:

Proposition 4 The p-value associated to the test for independence between U and V using MIq(U, V ) is smaller than: 1 1+(SMIq(U,V ))2 .

For example, if SMIq is equal to 4.46 the associated p-value is smaller than 0.05. Neural time series data is often analyzed making use of the Shannon MI (e.g. see Chapter 29 in Cohen (2014)). It is common practice to test the independence of two time series by computing SMI via Monte Carlo permutations, sampling from the space of N! cardinality. Our SMIq can be eﬀectively and eﬃciently used in this application because it is exact and obtains O(N3r max {r, c}) complexity.

5.2 Experiments on Selection Bias

In this section, we evaluate the performance of standardized measures on selection bias correction when partitions U are generated at random and independently from the reference partition V . This hypothesis has been employed in previous published research to study selection bias (White and Liu, 1994; Frank and Witten, 1998; Dobra and Gehrke, 2001; Shih, 2004; Hothorn et al., 2006; Romano et al., 2014). In particular, we experimentally demonstrate that NMIq is biased towards the selection of partitions U with more clusters at any q. Therefore, in this scenario it is beneﬁcial to perform standardization. Mind though that the choice of whether performing standardization or not is application dependent (Romano et al., 2015). For example, it has been argued that in some cases the selection of clustering solutions should be biased towards clusterings with the same number of clusters as in the reference (Amelio and Pizzuti, 2015). In this section we aim to show the eﬀects of selection bias when clusterings are independent and that standardization helps in reducing it. Moreover, we will see in Section 5.3 that it is particularly important to correct for selection bias when the number of objects N is small.

Given a reference partition V on N = 100 objects with c = 4 sets, we generate a pool of random partitions U with r ranging from 2 to 10 sets. Then, we use NMIq(U, V ) to select the closest partition to the reference V . The plot at the bottom of Figure 10 shows the probability of selection of a partition U with r sets using NMIq computed on 5000 simulations. We do not expect any partition to be the best given that they are all generated at random: i.e., the plot is expected to be ﬂat if a measure is unbiased. Nonetheless, we see that there is a clear bias towards partitions with 10 sets if we use NMIq with q respectively equal to 1.001, 2, or 3. We can see that the use of the adjusted measures such as AMIq helps in decreasing this bias, in particular when q = 2. On this experiment when q = 2, baseline adjustment seems to be eﬀective in decreasing the selection bias because the variance of AMI2 = ARI is almost constant. However for all q, using SMIq we obtain close to uniform probability of selection of each random partition U.

Romano, Vinh, Bailey, and Verspoor

Probability of selection (q = 1:001)

Number of sets r in U 2 4 6 8 10

Probability of selection (q = 2)

Number of sets r in U 2 4 6 8 10

Probability of selection (q = 3)

Number of sets r in U 2 4 6 8 10

Figure 10: Selection bias towards random partitions U with diﬀerent number of sets r when compared to a reference V . The probability of selection should be uniform when partitions are random. Using SMIq we achieve close to uniform probability of selection for q equal to 1.001, 2 and 3 respectively.

5.3 Large Number of Objects

It is likely to expect that the variance of generalized IT measures decreases when partitions are generated on a large number of objects N. Here we prove a general result about measures of the family Nφ.

Lemma 4 If S(U, V ) Nφ, then lim N + Var(S(U, V )) = 0.

Given that generalized IT measures belong in the family Nφ, we can prove the following:

Theorem 7 It holds true that:

lim N + Var(Hq(U, V )) = lim N + Var(MIq(U, V )) = lim N + Var(VIq(U, V )) = 0 (16)

Therefore, SMIq attains very large values when N is large. In practice of course, N is ﬁnite, so the use of SMIq is beneﬁcial. However, it is less important to correct for selection bias if the number of objects N is big with regards to the number of cells in the contingency table in Table 2: i.e., when N

r c is large. Indeed, when the number of objects is large AMIq might be suﬃcient to avoid selection bias and any test for independence between partitions has high power. In this scenario, SMIq is not needed and AMIq might be preferred as it can be computed more eﬃciently.

6. Conclusion

In this paper, we computed the exact expected value and variance of measures of the family Lφ, which contains generalized IT measures. We also showed how the expected value for measures S Nφ can be computed for large N. Using these statistics, we proposed AMIq and SMIq to adjust generalized IT measures both for baseline and for selection bias. AMIq is a further generalization of well known measures for clustering comparisons such as ARI and AMI. This analysis allowed us to provide guidelines for their best application in diﬀerent scenarios. In particular ARI might be used as external validation index when the reference

Adjusting for Chance Clustering Comparison Measures

clustering shows big equal sized clusters. AMI can be used when the reference clustering is unbalanced and there exist small clusters. The standardized SMIq can instead be used to correct for selection bias among many possible candidate clustering solutions when the number of objects is small. Furthermore, it can also be used to test the independence between two partitions. All code has been made available online1.

Acknowledgments

James Bailey s work was supported by an Australian Research Council Future Fellowship. Experiments were carried out on Amazon cloud supported by AWS in Education Grant Award.

1. https://sites.google.com/site/adjgenit/

Romano, Vinh, Bailey, and Verspoor

Appendix A. Theorem Proofs

Proposition 1 (Simovici, 2007) When q = 2 the generalized variation of information, the Mirkin index, and the Rand index are linearly related: VI2(U, V ) = 1 N2 MK(U, V ) = N 1

N (1 RI(U, V )).

VIq(U, V ) = 2Hq(U, V ) Hq(U) Hq(V )

= 1 (q 1)Nq

When q = 2, VI2(U, V ) = 1 N2 (P i a2 i + P j b2 j 2 P i,j n2 ij) = 1 N2 MK(U, V ) = N 1

N (1 RI(U, V )).

Lemma 1 If S(U, V ) Lφ, when partitions U and V are random:

E[S(U, V )] = α + β X

ij E[φij(nij)] where E[φij(nij)] is (9)

min{ai,bj} X

nij=max{0,ai+bj N} φij(nij) ai!bj!(N ai)!(N bj)! N!nij!(ai nij)!(bj nij)!(N ai bj + nij)! (10)

Proof The expected value of S(U, V ) according to the hypergeometric model of randomness is E[S(U, V )] = P M S(M)P(M) where M is a contingency table generated via permutations. This is reduced to E[S(U, V )] = P M(α + β P ij φij(nij))P(M) = α + β P M P ij φij(nij)P(M). Because of linearity of the expected value, it is possible to swap the summation over M and the one over cells obtaining α + β P ij P nij φij(nij)P(nij) = α + β P ij E[φij(nij)] where nij is a hypergeometric distribution with the marginals ai, bj, and N as parameters, i.e. nij Hyp(ai, bj, N).

Theorem 1 When the partitions U and V are random:

i) E[Hq(U, V )] = 1 q 1 1 1 Nq P ij E[nq ij] with E[nq ij] from Eq. (10) with φij(nij) = nq ij;

ii) E[MIq(U, V )] = Hq(U) + Hq(V ) E[Hq(U, V )];

iii) E[VIq(U, V )] = 2E[Hq(U, V )] Hq(U) Hq(V ).

Proof The results easily follow from Lemma 1 and the hypothesis of ﬁxed marginals.

Adjusting for Chance Clustering Comparison Measures

Theorem 2 Using E[nq ij] in Eq. (10) with φij(nij) = nq ij, the adjustments for chance for MIq(U, V ) and VIq(U, V ) are:

AMIq(U, V ) = AVIq(U, V ) =

P ij nq ij P ij E[nq ij]

1 2 P i aq i + P j bq j P ij E[nq ij] (12)

Proof The using the upper bound 1

2(Hq(U) + Hq(V )) to MIq, AMIq and AVIq are equiva-

lent. Therefore we compute AVIq. The denominator is equal to E[VIq] = 2 (q 1)Nq 1 2(P i aq i + P j bq j) P i,j E[nq ij] . The numerator is instead 2 (q 1)Nq P ij nq ij P i,j E[nq ij] .

Corollary 1 It holds true that: i) limq 1 AMIq = limq 1 AVIq = AMI = AVI with Shannon entropy; ii) AMI2 = AVI2 = ARI.

Proof Point i) follows from the limit of the q-entropy when q 1. Point ii) follows from:

AVI2 = E[VI2] VI2 E[VI2] min VI2 =

N (RI E[RI])

N (max RI E[RI]) = ARI

Proposition 2 The computational complexity of AMIq is O(N max {r, c}).

Proof The computation of P(nij) where nij is a hypergeometric distribution Hyp(ai, bj, N) is linear in N. However, the computation of the expected value E[nq ij] = P nij nq ij P(nij) can

exploit the fact that P(nij) are computed iteratively: P(nij+1) = P(nij) (ai nij)(bj nij) (nij+1)(N ai bj+nij+1). We compute P(nij) only for max {0, ai + bj N}. In both cases P(nij) can be computed in O(max {ai, bj}). We can compute all other probabilities iteratively as shown above in constant time. Therefore:

O(max {ai, bj}) +

min {ai,bj} X

j=1 O(max {ai, bj}) =

i=1 O(max {cai, N})

= O(max {c N, r N}) = O(N max {c, r})

Lemma 2 If S(U, V ) Nφ, then lim N + E[S(U, V )] = φ a1

N b1 N , . . . , ai

N bj N , . . . , ar

Romano, Vinh, Bailey, and Verspoor

Proof S(U, V ) can be written as φ(n11

N , . . . , nij

N , . . . , nrc

N ). Let X = (X1, . . . , Xrc) = (n11

N , . . . , nij

N , . . . , nrc

N ) be a vector of rc random variables where nij is a hypergeometric distribution with the marginals as parameters: ai, bj and N. The expected value of nij

N ] = 1 N aibj

N . Let µ = (µ1, . . . , µrc) = (E[X1], . . . , E[Xrc]) = (a1

N b1 N , . . . , ai

N bj N , . . . , ar

N bc N ) be the vector of the expected values. The Taylor approximation of S(U, V ) = φ(X) around µ is:

φ(X) φ(µ) +

t=1 (Xt µt) φ

s=1 (Xt µt)(Xs µs) 2φ Xt Xs + . . .

Its expected value is (see Section 4.3 of (Ang and Tang, 2006)):

E[φ(X)] φ(µ) + 1

s=1 Cov(Xt, Xs) 2φ Xt Xs + . . .

We just analyse the second order remainder given that it dominates the higher order ones. Using the Cauchy-Schwartz inequality we have that |Cov(Xt, Xs)| p

Var(Xt)Var(Xs). Each Xt and Xs is equal to nij

N for some indexes i and j. The variance of each Xt and Xs is therefore equal to Var(nij

N ) = 1 N2 aibj

N 1 . When the number of records is large also the marginals increase: N + ai + , and bj + i, j. However, because of the permutation model, all the fractions ai

N stay constant i, j. Therefore, also µ

is constant. However, at the limit of large N, the variance of nij

N tends to 0: Var nij

1 N ai N bj N 1 ai

N 1 + 1 N 1 bj

N 0. Therefore, at large N:

E[φ(X)] φ(µ) = φ a1

N b1 N , . . . , ai

N bj N , . . . , ar

Theorem 3 It holds true that:

i) lim N + E[Hq(U, V )] = Hq(U) + Hq(V ) (q 1)Hq(U)Hq(V );

ii) lim N + E[MIq(U, V )] = (q 1)Hq(U)Hq(V );

iii) lim N + E[VIq(U, V )] = Hq(U) + Hq(V ) 2(q 1)Hq(U)Hq(V ).

Proof E[Hq(U, V )] = 1 q 1 1 P ij E h nij

N qi and according to Lemma 2 for large N:

E[Hq(U, V )] 1 q 1 1 P ij ai N bj N q = 1 q 1 1 P i ai N q P j bj

N q . If we add an subtract

Adjusting for Chance Clustering Comparison Measures

1 P i ai N q P j bj

N q in the parenthesis above:

E[Hq(U, V )] 1 q 1

= Hq(U) + Hq(V ) + 1 q 1

= Hq(U) + Hq(V ) (q 1)Hq(U)Hq(V )

Point ii) and iii) follow from Equations (6) and (7).

Lemma 3 If S(U, V ) Lφ, when partitions U and V are random:

Var(S(U, V )) = β2 E h X

ij φij(nij) 2i X

ij E[φij(nij)] 2 ,

ij φij(nij) 2i

is equal to

nij φ(nij)P(nij)

φij(nij) + X

ni j φi j( ni j)P( ni j)+

nij P( nij )

φij ( nij ) + X

ni j φi j ( ni j )P( ni j )

with nij Hyp(ai, bj, N), ni j Hyp(bj nij, ai , N ai), nij Hyp(ai nij, bj , N bj), ni j Hyp(ai , bj nij , N ai) hypergeometric random variables.

Proof The proof follows Theorem 1 proof in Romano et al. (2014). Using the properties of the variance we can show that Var(S(U, V )) = β2Var(P ij φij(nij)) = β2 E[(P ij φij(nij))2]

Romano, Vinh, Bailey, and Verspoor

(P ij E[φij(nij)])2 . (E[P ij φij(nij)])2 = (P ij E[φij(nij)])2 can be computed using Eq. (10). The ﬁrst term in the sum is instead:

ij φij(nij))2] = X

i j E[φij(nij)φi j (ni j )] = X

ni j φij(nij)φi j (ni j )P(nij, ni j )

We cannot ﬁnd the exact form of the joint probability P(nij, ni j ) thus we rewrite it as P(nij)P(ni j |nij) = P(nij)P( ni j ). The random variable nij is an hypergeometric distribution that simulates the experiment of sampling without replacement the ai objects in the set ui from a total of N objects. Sampling one of the bj objects from vj is deﬁned as a success: nij Hyp(ai, bj, N). The random variable ni j has a diﬀerent distribution depending on the possible combinations of indexes i, i , j, j . Thus E[(P ij φij(nij))2] is equal to:

ni j φij(nij)φi j (ni j )P(nij, ni j ) = X

nij φij(nij)P(nij) X

ni j φi j ( ni j )P( ni j )

which, by taking care of all possible combinations of i, i , j, j , is equal to :

nij φij(nij)P(nij)

nij φij( nij)P( nij) + X

ni j φi j( ni j)P( ni j)

nij φij ( nij )P( nij ) + X

ni j φi j ( ni j )P( ni j )

Case 1: i = i j = j P( nij) = 1 if and only if nij = nij and 0 otherwise. This case produces the ﬁrst term φij(nij) enclosed in square brackets.

Case 2: i = i j = j In this case, the possible successes are the objects from the set vj . We have already sampled nij objects and we are sampling from the whole set of objects excluding the set vj. Thus, nij Hyp(ai nij, bj , N bj).

Case 3: i = i j = j This case is symmetric to the previous one where ai is now the possible number of successes. Therefore ni j Hyp(bj nij, ai , N ai).

Case 4: i = i j = j In order compute P( ni j ), we have to impose a further condition:

P( ni j ) = X

nij P( ni j | nij )P( nij ) = X

nij P( ni j )P( nij )

Adjusting for Chance Clustering Comparison Measures

We are considering sampling the ai objects in ui from the whole set of objects excluding the ai objects from ui. Just knowing that nij objects have already been sampled from ui does not allow us to know how many objects from vj have also been sampled. If we know that nij is the number of objects sampled from vj , we know there are bj nij possible successes and thus ni j | nij = ni j Hyp(ai , bj nij , N ai). So the last two terms in Eq. (18) can be put together: X

nij φij ( nij )P( nij ) + X

ni j φi j ( ni j )P( ni j )

nij φij ( nij )P( nij ) + X

ni j φi j ( ni j ) X

nij P( ni j | nij )P( nij )

nij φij ( nij )P( nij ) + X

ni j φi j ( ni j ) X

nij P( ni j )P( nij )

nij P( nij )φij ( nij ) + X

nij P( nij ) X

ni j φi j ( ni j )P( ni j )

nij P( nij )

φij ( nij ) + X

ni j φi j ( ni j )P( ni j )

By putting everything together we get that E[(P ij φij(nij))2] is equal to:

nij φ(nij)P(nij)

φij(nij) + X

ni j φi j( ni j)P( ni j)+

nij P( nij )

φij ( nij ) + X

ni j φi j ( ni j )P( ni j )

Theorem 5 Using Eqs. (10) and (13) with φij( ) = ( )q, when the partitions U and V are random:

i) Var(Hq(U, V )) = 1 (q 1)2N2q E[(P ij nq ij)2] (P ij E[nq ij])2 ;

ii) Var(MIq(U, V )) = Var(Hq(U, V ))

iii) Var(VIq(U, V )) = 4Var(Hq(U, V ))

Proof The results follow from Lemma 3, the hypothesis of ﬁxed marginals and properties of the variance.

Theorem 6 Using Eqs. (10) and (13) with φij( ) = ( )q, the standardized MIq(U, V ) and the standardized VIq(U, V ) are:

SMIq(U, V ) = SVIq(U, V ) =

P ij nq ij P ij E[nq ij] q

E[(P ij nq ij)2] (P ij E[nq ij])2 (15)

Romano, Vinh, Bailey, and Verspoor

Proof For SMIq: the numerator is equal to Hq(U, V ) E[Hq(U, V )] = 1 (q 1)Nq P ij nq ij P i,j E[nq ij] . According Theorem 5, the denominator is instead: q

Var(MIq(U, V )) = q

Var(Hq(U, V )) = 1 (q 1)Nq

ij nq ij)2] (E[ X

ij nq ij])2.

For SVIq: the numerator is equal to 2Hq(U, V ) 2E[Hq(U, V )] = 2 (q 1)Nq P ij nq ij P i,j E[nq ij] . According Theorem 5, the denominator is instead:

Var(VIq(U, V )) = q

4Var(Hq(U, V )) = 2 (q 1)Nq

ij nq ij)2] (E[ X

ij nq ij])2.

Therefore, SMIq and SVIq are equivalent.

Corollary 2 It holds true that: i) limq 1 SMIq = limq 1 SVIq = SMI = SVI = SG with Shannon entropy; ii) SMI2 = SVI2 = SRI.

Proof Point i) follows from the limit of the q-entropy when q 1 and the linear relation of G-statistic to MI: G = 2NMI. Point ii) follows from:

SVI2 = E[VI2] VI2 p

N (RI E[RI])

Var(RI) = SRI

Proposition 3 The computational complexity of SMIq is O(N3c max {c, r}).

Proof Each summation in Eq. (13) can be bounded above by the maximum value of the cell marginals and each sum can be done in constant time. The last summation in Eq. (13) is:

max {ai,bj } X

max {ai ,bj } X

ni j =0 O(1) =

max {ai,bj } X

i =1 O(max {ai , bj })

max {ai,bj } X

nij =0 O(max {N, rbj })

j =1 O(max {ai N, airbj , bj N, rb2 j })

= O(max {cai N, air N, r N2})

Adjusting for Chance Clustering Comparison Measures

The above term is thus the computational complexity of the inner loop. Using the same machinery one can prove that:

max {ai,bj} X

nij=0 O(max {cai N, air N, r N2}) = O(max {c2N3, rc N3}) = O(N3c max {c, r})

Proposition 4 The p-value associated to the test for independence between U and V using MIq(U, V ) is smaller than: 1 1+(SMIq(U,V ))2 .

Proof Let MI0 q be the random variable under the null hypothesis of independence between partitions associated to the test statistic MIq(U, V ). The p-value is deﬁned as:

p-value = P MI0 q MIq(U, V ) = P MI0 q E[MIq(U, V )] MIq(U, V ) E[MIq(U, V )]

MI0 q E[MIq(U, V )] p

Var(MIq(U, V )) MIq(U, V ) E[MIq(U, V )] p

Var(MIq(U, V ))

MI0 q E[MIq(U, V )] p

Var(MIq(U, V )) SMIq(U, V )

Let Z be the standardized random variable MI0 q E[MIq(U,V )]

Var(MIq(U,V )) , then using the one side Cheby-

shev s inequality also known as the Cantelli s inequality (Ross, 2012):

p-value = P(Z SMIq(U, V )) < 1

1 + SMIq(U, V ) 2

Lemma 4 If S(U, V ) Nφ, then lim N + Var(S(U, V )) = 0.

Proof Let X = (X1, . . . , Xrc) = (n11

N , . . . , nij

N , . . . , nrc

N ) be a vector of rc random variables where nij is a hypergeometric distribution with the marginals as parameters: ai, bj and N. Using the Taylor approximation (Ang and Tang, 2006) of S(U, V ) = φ(X), it is possible to show that:

s=1 Cov(Xt, Xs) φ

φ Xs + . . .

Using the Cauchy-Schwartz inequality we have that |Cov(Xt, Xs)| p

Var(Xt)Var(Xs). Each Xt and Xs is equal to nij

N for some indexes i and j. The variance of each Xt and Xs is

Romano, Vinh, Bailey, and Verspoor

therefore equal to Var(nij

N ) = 1 N2 aibj

N 1 . When the number of records is large also the marginals increase: N + ai + , and bj + i, j. However because of the permutation model, all the fractions ai

N stay constant i, j. Therefore, at the limit

of large N, the variance of nij

N tends to 0: Var nij

N ai N bj N 1 ai

N 1 + 1 N 1 bj

N 0 and thus Var(φ(X)) tends to 0.

Theorem 7 It holds true that:

lim N + Var(Hq(U, V )) = lim N + Var(MIq(U, V )) = lim N + Var(VIq(U, V )) = 0 (16)

Proof Trivially follows from Lemma 4.

Adjusting for Chance Clustering Comparison Measures

Charu C. Aggarwal and Chandan K. Reddy. Data Clustering: Algorithms and Applications. CRC Press, 2013.

Ahmed N. Albatineh and Magdalena Niewiadomska-Bugaj. Correcting Jaccard and other similarity indices for chance agreement in cluster analysis. Advances in Data Analysis and Classiﬁcation, 5(3):179 200, 2011.

Ahmed N. Albatineh, Magdalena Niewiadomska-Bugaj, and Daniel Mihalko. On similarity indices and correction for chance agreement. Journal of Classiﬁcation, 23(2):301 313, 2006.

Alessia Amelio and Clara Pizzuti. Is normalized mutual information a fair measure for comparing community detection methods? In Proceedings of the 2015 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining 2015, pages 1584 1585. ACM, 2015.

Alfredo H-S. Ang and Wilson H. Tang. Probability Concepts in Engineering: Emphasis on Applications to Civil and Environmental Engineering. John Wiley and Sons, 2006.

Evan Archer, Il Memming Park, and Jonathan W Pillow. Bayesian and quasi-bayesian estimators for mutual information from discrete data. Entropy, 15(5):1738 1755, 2013.

Asa Ben-Hur, Andr e Elisseeﬀ, and Isabelle Guyon. A stability based method for discovering structure in clustered data. In Paciﬁc symposium on biocomputing, volume 7, pages 6 17, 2001.

Rich Caruana, M Elhaway, Nam Nguyen, and Casey Smith. Meta clustering. In Data Mining, 2006. ICDM 06. Sixth International Conference on, pages 107 118. IEEE, 2006.

Annalisa Cerquetti. Bayesian nonparametric estimation of tsallis diversity indices under gnedin-pitman priors. ar Xiv preprint ar Xiv:1404.3441, 2014.

Mike X Cohen. Analyzing neural time series data: theory and practice. MIT Press, 2014.

Antonio Criminisi, Jamie Shotton, and Ender Konukoglu. Decision forests: A uniﬁed framework for classiﬁcation, regression, density estimation, manifold learning and semisupervised learning. Foundations and Trends in Computer Graphics and Vision, 7(2-3): 81 227, 2012.

Zolt an Dar oczy. Generalized information functions. Information and control, 16(1):36 51, 1970.

Alin Dobra and Johannes Gehrke. Bias correction in classiﬁcation tree construction. In Proceedings of the International Conference on Machine Learning, pages 90 97, 2001.

Eibe Frank and Ian H. Witten. Using a permutation test for attribute selection in decision trees. In Proceedings of the International Conference on Machine Learning, pages 152 160, 1998.

Romano, Vinh, Bailey, and Verspoor

Shigeru Furuichi. Information theoretical properties of tsallis entropies. Journal of Mathematical Physics, 47(2):023302, 2006.

Jan Havrda and Frantiˇsek Charv at. Quantiﬁcation method of classiﬁcation processes. concept of structural a-entropy. Kybernetika, 3(1):30 35, 1967.

Torsten Hothorn, Kurt Hornik, and Achim Zeileis. Unbiased recursive partitioning: A conditional inference framework. Journal of Computational and Graphical Statistics, 15 (3):651 674, 2006.

Lawrence Hubert and Phipps Arabie. Comparing partitions. Journal of Classiﬁcation, 2: 193 218, 1985.

Yang Lei, James C Bezdek, Jeﬀrey Chan, Nguyen Xuan Vinh, Simone Romano, and James Bailey. Generalized information theoretic cluster validity indices for soft clusterings. In IEEE Symposium on Computational Intelligence and Data Mining (CIDM), pages 24 31. IEEE, 2014a.

Yang Lei, Nguyen Xuan Vinh, Jeﬀrey Chan, and James Bailey. Filta: Better view discovery from collections of clusterings via ﬁltering. In Machine Learning and Knowledge Discovery in Databases, pages 145 160. Springer, 2014b.

Yang Lei, James Bezdek, Jeﬀrey Chan, Nguyen Vinh, Simone Romano, and James Bailey. Extending information-theoretic validity indices for fuzzy clustering. IEEE Transactions on Fuzzy Systems, 2016.

Fabr ıcio M Lopes, Evaldo A de Oliveira, and Roberto M Cesar. Inference of gene regulatory networks from time series by Tsallis entropy. BMC systems biology, 5(1):61, 2011.

Andr e FT Martins, Noah A Smith, Eric P Xing, Pedro MQ Aguiar, and M ario AT Figueiredo. Nonextensive information theoretic kernels on measures. The Journal of Machine Learning Research, 10:935 975, 2009.

Tomasz Maszczyk and W lodzis law Duch. Comparison of Shannon, Renyi and Tsallis entropy used in decision trees. In Artiﬁcial Intelligence and Soft Computing ICAISC 2008, pages 643 651. Springer, 2008.

Marina Meil a. Comparing clusterings an information based distance. Journal of Multivariate Analysis, 98(5):873 895, 2007.

Leslie C Morey and Alan Agresti. The measurement of classiﬁcation agreement: an adjustment to the rand statistic for chance agreement. Educational and Psychological Measurement, 44(1):33 37, 1984.

Emmanuel M uller, Stephan G unnemann, Ines F arber, and Thomas Seidl. Discovering multiple clustering solutions: Grouping objects in diﬀerent views of the data. Tutorial at ICML, 2013. URL http://dme.rwth-aachen.de/en/DMCS.

Liam Paninski. Estimation of entropy and mutual information. Neural computation, 15(6): 1191 1253, 2003.

Adjusting for Chance Clustering Comparison Measures

William M Rand. Objective criteria for the evaluation of clustering methods. Journal of the American Statistical association, 66(336):846 850, 1971.

Alfred Renyi. On measures of entropy and information. 1961.

Simone Romano, James Bailey, Vinh Nguyen, and Karin Verspoor. Standardized mutual information for clustering comparisons: One step further in adjustment for chance. In Proceedings of the 31st International Conference on Machine Learning (ICML-14), pages 1143 1151, 2014.

Simone Romano, Nguyen Xuan Vinh, James Bailey, and Karin Verspoor. A framework to adjust dependency measure estimates for chance. ar Xiv preprint ar Xiv:1510.07786, 2015.

Sheldon Ross. A ﬁrst course in probability. Pearson, 2012.

Y-S Shih. A note on split selection bias in classiﬁcation trees. Computational statistics & data analysis, 45(3):457 466, 2004.

Dan Simovici. On generalized entropy and entropic metrics. Journal of Multiple Valued Logic and Soft Computing, 13(4/6):295, 2007.

Alexander Strehl and Joydeep Ghosh. Cluster ensembles a knowledge reuse framework for combining multiple partitions. The Journal of Machine Learning Research, 3:583 617, 2003.

Constantino Tsallis. Possible generalization of Boltzmann-Gibbs statistics. Journal of statistical physics, 52(1-2):479 487, 1988.

Constantino Tsallis et al. Introduction to Nonextensive Statistical Mechanics. Springer, 2009.

Marius Vila, Anton Bardera, Miquel Feixas, and Mateu Sbert. Tsallis mutual information for document classiﬁcation. Entropy, 13(9):1694 1707, 2011.

Nguyen Xuan Vinh, Julien Epps, and James Bailey. Information theoretic measures for clusterings comparison: is a correction for chance necessary? In Proceedings of the International Conference on Machine Learning, pages 1073 1080. ACM, 2009.

Nguyen Xuan Vinh, Julien Epps, and James Bailey. Information theoretic measures for clusterings comparison: Variants, properties, normalization and correction for chance. Journal of Machine Learning Research, 11:2837 2854, 2010.

Nguyen Xuan Vinh, Jeﬀrey Chan, Simone Romano, and James Bailey. Eﬀective global approaches for mutual information based feature selection. In Proceedings of the 20th ACM SIGKDD international conference on knowledge discovery and data mining, pages 512 521. ACM, 2014.

Yisen Wang, Chaobing Song, and Shu-Tao Xia. Improving decision trees using tsallis entropy. ar Xiv preprint ar Xiv:1511.08136, 2015.

Romano, Vinh, Bailey, and Verspoor

Matthijs J Warrens. On the equivalence of Cohens kappa and the Hubert-Arabie adjusted Rand index. Journal of Classiﬁcation, 25(2):177 183, 2008.

Allan P. White and Wei Zhong Liu. Bias in information-based measures in decision tree induction. Machine Learning, pages 321 329, 1994.

Junjie Wu, Hui Xiong, and Jian Chen. Adapting the right measures for k-means clustering. In Knowledge Discovery and Data Mining, pages 877 886, 2009.