# unbiased_multivariate_correlation_analysis__9ffaf63e.pdf

Unbiased Multivariate Correlation Analysis

Yisen Wang, , Simone Romano, Vinh Nguyen, James Bailey, Xingjun Ma, Shu-Tao Xia

Dept. of Computer Science and Technology, Tsinghua University, China Dept. of Computing and Information Systems, University of Melbourne, Australia wangys14@mails.tsinghua.edu.cn {simone.romano, vinh.nguyen, baileyj, xingjun.ma}@unimelb.edu.au; xiast@sz.tsinghua.edu.cn

Correlation measures are a key element of statistics and machine learning, and essential for a wide range of data analysis tasks. Most existing correlation measures are for pairwise relationships, but real-world data can also exhibit complex multivariate correlations, involving three or more variables. We argue that multivariate correlation measures should be comparable, interpretable, scalable and unbiased. However, no existing measures satisfy all these requirements. In this paper, we propose an unbiased multivariate correlation measure, called UMC, which satisﬁes all the above criteria. UMC is a cumulative entropy based non-parametric multivariate correlation measure, which can capture both linear and non-linear correlations for groups of three or more variables. It employs a correction for chance using a statistical model of independence to address the issue of bias. UMC has high interpretability and we empirically show it outperforms state-of-the-art multivariate correlation measures in terms of statistical power, as well as for use in both subspace clustering and outlier detection tasks.

Introduction Analysing correlations is a fundamental task in both statistics and machine learning. It has applications in many realworld learning tasks, e.g., feature selection (Brown et al. 2012), subspace search (Nguyen et al. 2013), causal inference (Bareinboim, Tian, and Pearl 2014) and subspace clustering (Kriegel, Kr oger, and Zimek 2009). For the setting of continuous variables (as opposed to discrete), most existing correlation measures focus on pairwise relationships. For example, Pearson s correlation coefﬁcient detects bivariate linear correlations, and Maximal Information Coefﬁcient (MIC) (Reshef et al. 2011) detects both linear and non-linear bivariate correlations. However, real-world data often contains three or more variables which can exhibit multivariate (higher-order) correlations. If bivariate based measures are used to identify multivariate correlations, through pairwise aggregation, multivariate correlations can potentially be overlooked. For example, it has been shown that genes may reveal only a weak correlation with a disease when considered individually, while the correlation for a group of genes may be very strong (Zhang et al. 2008).

Copyright c 2017, Association for the Advancement of Artiﬁcial Intelligence (www.aaai.org). All rights reserved.

2 3 4 5 6 7 8 9 10 Dimension

UDS for functional relationship UDS for independent variables UDS-r for functional relationship UDS-r for independent variables

(a) UDS and UDS-r

2 3 4 5 6 7 8 9 10 Dimension

MAC for functional relationship MAC for independent variables

2 3 4 5 6 7 8 9 10 Dimension

CMI for functional relationship CMI for independent variables

2 3 4 5 6 7 8 9 10 Dimension

Hi CS for functional relationship Hi CS for independent variables

Figure 1: Raw scores for existing multivariate correlation measures. For interpretability and unbiasedness, a measure s score should be a horizontal line with value 1 for a functional relationship as the number of variables varies. It should be a horizontal line with value 0 for sets of independent variables of increasing size. None of the four measures exhibits both behaviours.

Identifying multivariate correlations is therefore an important activity in data analysis. Measures for multivariate correlation analysis have three important applications (Romano et al. 2016), (a) detection; (b) quantiﬁcation; and (c) ranking. We argue that these applications require a multivariate correlation measure to meet the following criteria: (1) Comparability The correlation score for a small set of variables can be meaningfully compared against the score for a large set of variables. The score should also lie within a predetermined range such as [0,1]; (2) Interpretability The correlation score should be 0 for a set of independent variables and 1 for a set of variables having a strong relationship; (3) Scalability The score should be efﬁcient to compute as the number of variables or data size increases; (4) Unbiasedness The score should be constant as the number of variables increases, if maintaining a ﬁxed degree of relationship among them.

Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence (AAAI-17)

Recently proposed multivariate correlation measures, such as Universal Dependency Score (UDS) (Nguyen, Mandros, and Vreeken 2016), Multivariate m Aximal Correlation (MAC) (Nguyen et al. 2014), Cumulative Mutual Information (CMI) (Nguyen et al. 2013) and Hi CS (Keller, Muller, and Bohm 2012), only partially satisfy the above 4 criteria. For example, given a data set of 1000 data points with d variables {Xi}d i=1, we consider two simple correlations: a strong functional one (all Xi s are the same) and a weak independent one (all Xi s are random variables, drawn independently of each other). For good interpretability, we expect the measure score to be close to 1 for the functional relationship and to be 0 for a set of independent variables. More importantly, the measure should be stable for a ﬁxed correlation as dimensionality (the number of variables) increases, i.e., unbiased. However, Figure 1 shows that none of the above measures is able to satisfy these basic requirements. UDS, UDS-r (without regularization), MAC and CMI do not output a stable score, showing an increasing trend either for the functional relationship or independent variables, while Hi CS demonstrates a decreasing trend for the functional relationship. We summarize the behaviour of these multivariate correlation measures in Table 1 and see that none of them satisﬁes all the above 4 criteria.

Table 1: The behaviour of multivariate correlation measures

Measure Compar. Interpret. Scal. Unbias. UDS UDS-r MAC CMI Hi CS UMC

Motivated by these observations, we propose an Unbiased Multivariate Correlation measure (UMC) that satisﬁes the above 4 criteria. UMC is designed by taking a uniﬁed view of existing correlation measures. It employs cumulative entropy that permits non-parametric computation on (continuous) empirical data (Rao et al. 2004), (Di Crescenzo and Longobardi 2009a). To address the interpretability and bias issues, we extend the framework proposed in (Romano et al. 2016) to multivariate correlation measures, and analytically derive the expected value of the conditional cumulative entropy under a statistical model of independence and use it for bias correction of UMC. Analytical computation (as opposed to Monte Carlo simulation) of this expected value ensures UMC can maintain a good degree of computational efﬁciency. Overall, UMC is purely non-parametric and suitable for capturing both linear and non-linear correlations amongst multiple variables, while achieving comparability, interpretability, scalability as well as unbiasedness. Our contributions in this paper are fourfold: (1) We provide a uniﬁed view of existing multivariate correlation measures and the connections between them; (2) We theoretically and empirically analyse bias issues for existing multivariate correlation measures; (3) We propose an unbiased measure for multivariate correlation analysis UMC; and

(4) We empirically demonstrate that UMC achieves the best performance amongst existing multivariate correlation measures in terms of power and performance based on experiments with both synthetic and real-world data.

A Uniﬁed View of Correlation Measures In this section, we review existing multivariate correlation measures in a uniﬁed way to highlight their connections. Consider a data set D with d real-valued variables {Xi}d i=1 and n data points. We also refer to d as the dimensionality of the data set. We regard each Xi as a random variable, characterized by its Probability Distribution Function (PDF) p(Xi). A multivariate correlation measure M quantiﬁes how much the relation of D = {Xi}d i=1 deviates from statistical independence (Te Sun 1980). That is to say, how much their joint probability distribution differs from the product of their marginal probability distributions,

M(D) = diﬀ p(X1, . . . , Xd),

i=1 p(Xi) . (1)

If diﬀ() is instantiated as the KL-divergence (Kullback and Leibler 1951), Eq. (1) will become the Total Correlation (TC) measure (Te Sun 1978),

TC(D) = KL p(X1, . . . , Xd)

i=1 H(Xi) H(X1, . . . , Xd).

However, the joint Shannon entropy H(X1, . . . , Xd) requires estimation of the joint probability mass function p(X1, . . . , Xd), which suffers from the empty space problem (Aggarwal and Philip 2001) in high dimensional space. To avoid using the joint probability distribution, according to the factorization property of the joint probability, the KL divergence in Eq. (2) can be factorized as:

KL p(X1, . . . , Xd)

i=1 p(Xi) = KL p(X2|X1) p(X2)

+ + KL p(Xd|X1, . . . , Xd 1) p(Xd) .

This is just one possible factorization of the KL-divergence. In general, different variable orderings will lead to different factorization formulas. More importantly, this factorization inspires an approximation of Eq. (1) as:

i=2 diﬀ p(Xi), p(Xi|X1, . . . , Xi 1) , (4)

where the conditional distributions are computed progressively with i 1 variables, partially mitigating the high dimensional empty space problem. Furthermore, this kind of progressive aggregation mechanism from small variable sets to large helps the measure scale well to high dimensionality. Despite the advantages of Eq. (4), it is sensitive to the ordering of the variables {X1, . . . , Xd}. The maximal correlation analysis approach (Breiman and Friedman 1985),

(Rao et al. 2011) suggests an order-free technique as follows. Let Fd be the set of bijective functions σ: {1, . . . , d} {1, . . . , d}, we have:

M(D) maxσ Fd d i=2 diﬀ

p(Xσ(i)),p(Xσ(i)|Xσ(1),...,Xσ(i 1))

. (5) Existing multivariate correlation measures can now be interpreted within this framework. Hi CS is based on a speciﬁc marginalized form of Eq. (1), i.e., diﬀ(p(Xi), p(Xi|{X1, . . . , Xd}\{Xi})), and computed as an average across multiple iterations where Xi is randomly picked; MAC is based on Eq. (2) using Shannon entropy estimated through data discretization; CMI and UDS are based on Eq. (5), each instantiating diﬀ() with the cumulative entropy h(Xi) (deﬁned below), CMI = maxσ Fd d i=2[h(Xσ(i)) h(Xσ(i)|Xσ(1), . . . , Xσ(i 1))], and UDS = CMI / d i=2 h(Xσ(i)), which is a normalized version of CMI. Our proposed measure, UMC, is also based on Eq. (5) using cumulative entropy to instantiate diﬀ(). However, different from UDS, we adopt a tight upper bound as the normalization factor, and employ a correction for chance using a statistical model of independence. These assist in addressing the drawbacks of the existing measures mentioned in Table 1.

Cumulative Entropy It is common practice to discretize real-valued data to estimate Shannon entropy, but this can lead to loss of information. On the other hand, differential entropy, which works directly on continuous variables, has been shown to be problematic for assessing correlations (Rao et al. 2004). In this paper, we employ cumulative entropy (Di Crescenzo and Longobardi 2009a), which can be estimated on real-valued data without data discretization. The CMI and UDS measures are also based on cumulative entropy. The cumulative entropy of a continuous random variable Z, denoted as h(Z), is deﬁned as:

h(Z) = P(Z z) log P(Z z)dz, (6)

where P(Z z) is the Cumulative Distribution Function (CDF) of Z. Similar to Shannon entropy, h(Z) captures the degree of uncertainty in Z. The only difference is that the cumulative entropy is deﬁned on the CDF. Since 0 P(Z z) 1, it follows that h(Z) 0. The conditional cumulative entropy of any real-valued random variable Z given a random variable Y is deﬁned as:

h(Z|Y ) = h(Z|y)p(y)dy, y domain(Y ), (7)

h(Z|y) = P(Z z|y) log P(Z z|y)dz. (8)

It has two important properties (Rao et al. 2004), (Di Crescenzo and Longobardi 2009a). Theorem 1. h(Z|Y ) 0 with equality iff Z is a function of Y . Theorem 2. h(Z|Y ) h(Z) with equality iff Z is statistically independent of Y .

Dimensionality Bias Analysis

The presence of bias is harmful for a multivariate correlation measure, since it can make it hard to interpret and unreliable in real-world learning tasks, especially for quantiﬁcation and ranking related applications. In this section, we identify and analyse the issue of dimensionality bias empirically and theoretically. Empirically, to illustrate the dimensionality bias issue, we generate 5 data sets with d = [5, 10, 15, 20, 25] variables and n = 1000 data points for two correlations: i) identity relationship (all variables are identical) and ii) independent set of variables (all variables are randomly and independently drawn). For each existing correlation measure, we select the data set which achieves the highest correlation score, and then repeat this process 10,000 times. The probability of selecting a data set according to each possible dimensionality is shown in Figure 2.

5 10 15 20 25 Dimension

Probability of Selection

UDS for identity relationship UDS for independent variables

5 10 15 20 25 Dimension

Probability of Selection

MAC for identity relationship MAC for independent variables

5 10 15 20 25 Dimension

Probability of Selection

CMI for identity relationship CMI for independent variables

5 10 15 20 25 Dimension

Probability of Selection

Hi CS for identity relationship Hi CS for independent variables

Figure 2: Dimensionality bias of correlation measures for identity relationship and independent variables. The ideal case would be an equal 20% probability for each possible dimensionality. The ﬁgure is best viewed in color.

Given that the correlation scores should be the same for every data set of all dimensionalities, they should each have an equal chance of being selected (i.e., 20%). However, Figure 2 clearly demonstrates that all these measures are biased to data sets with high dimensionalities for the independent variables. On the other hand, for the identity relationship, except MAC which almost shows no bias, all the other measures suffer from different biases, i.e., UDS and CMI are biased to high dimensionalities, while Hi CS is biased to low dimensionalities. Theoretically, CMI has been proved to be biased to high dimensionality in (Nguyen, Mandros, and Vreeken 2016). Here, we prove that UDS is also biased to high dimensionality in some cases (proof in the Supplementary Material A 1).

1https://sites.google.com/site/umcsupplementary/home

Theorem 3. UDS(X1, . . . , Xd+1) UDS(X1, . . . , Xd) whenever Xd+1 = Xi, i {1, 2, . . . , d}. That is to say, UDS scores a set of variables higher when there are repeated (redundant) variables.

Unbiased Multivariate Correlation Measure

In this section, we propose UMC, an Unbiased Multivariate Correlation measure, which satisﬁes comparability, interpretability, scalability and unbiasedness. Speciﬁcally, a statistical model of independence is employed to address the dimensionality bias issue, providing a constant baseline and good interpretability. The factorization property from Eq. (5) is adopted to guarantee good scalability to high dimensionality, and a tight upper bound is used for normalization to achieve good comparability. Furthermore, an analytical computation formula for the expected value of the conditional cumulative entropy under the statistical independence model is derived.

Statistical Model of Independence

The technique of correcting bias using a statistical model of independence has already been successfully used in several pairwise dependency measures, e.g, Adjusted Mutual Information (AMI) (Vinh, Epps, and Bailey 2009) and Adjusted Rand Index (ARI) (Hubert and Arabie 1985) in the context of comparing two clusterings. The common statistical model of independence they use is the permutation model (Hubert and Arabie 1985). Herein, we adopt the same permutation model of independence, extending it to multivariate correlation measures.

Deﬁnition 1 (The permutation model of independence). Samples are generated by permuting the n data points {Xi(k)}n k=1 of each variable Xi in the data set D = {Xi}d i=1 independently. Let τi denote an independent random permutation of n indices for each Xi, then Dperm = { X1(τ1(k)), . . . , Xd(τd(k)) }n k=1. In other words, if D is in a n d dimensional matrix format and we independently permute the elements of each column, then the distributions of the columns (the marginal distribution of each Xi) remain the same, but columns become independent of each other.

The expected value E0 of a correlation measure under this model can be regarded as its correlation-by-chance value, i.e., the bias, so it can be used for adjusting the measure to achieve a constant baseline. This is the strategy we adopt for UMC, explained below.

Deﬁnition of UMC

Based on Eq. (5), using the cumulative entropy, the tight upper bound and the expected value under the permutation model of independence, we obtain:

Deﬁnition 2 (Unbiased Multivariate Correlation, UMC).

d i=2[h(Xσ(i)) h(Xσ(i)|Xσ(1), . . . , Xσ(i 1)) E0]

d i=2[A E0] (9)

where σ is an ordering of variables, A denotes a tight upper bound A = h(Xσ(i)) min({h(Xσ(i)|Xσ(1), . . . , Xσ(i 1))}d i=2) and E0 represents the expected value under the permutation model E0[h(Xσ(i)) h(Xσ(i)|Xσ(1), . . . , Xσ(i 1))]. UMC belongs to the class of maximal correlation analysis. By considering the maximum value, we aim at uncovering the best correlation score of the variables involved, and making UMC variable-order invariant. In addition, UMC possesses several important properties required from a good multivariate correlation measure (proofs in Supplementary Material B): 1) UMC(X1, . . . , Xd) is greater than or equal to 0 on average, and attains 1 as maximum value; 2) UMC(X1, . . . , Xd) is equal to 0 on average when {X1, . . . , Xd} are statistically independent; 3) UMC(X1, . . . , Xd) = 1 if there exists Xi such that each in {X1, . . . , Xd}\{Xi} is a function of Xi. According to UMC s deﬁnition, a search is performed over all variable orderings, i.e., d! possible ones, for the maximal value. However, such a search is impractical. Therefore, we adopt the same heuristics as the UDS measure (Nguyen, Mandros, and Vreeken 2016) by specifying an approximation of the optimal ordering, avoiding the exhaustive search. More speciﬁcally, the approximate optimal ordering σ Fd is h(Xσ (1)) h(Xσ (d)), i.e., orders variables in descending order of cumulative entropy value. In the following, to simplify, we will assume the optimal ordering of D is {X1, . . . , Xd} satisfying h(X1) h(Xd). Since h(Xi) s are in decreasing order and h(Xi|X1, . . . , Xi 1) h(Xi), we can conclude that min({h(Xi|X1, . . . , Xi 1)}d i=2) is the last conditional cumulative entropy h(Xd|X1, . . . , Xd 1). Therefore we do not need to search for the minimum. These techniques make UMC more computationally efﬁcient.

Computing the Cumulative Entropy The unconditional cumulative entropy can be calculated in closed-form (Di Crescenzo and Longobardi 2009b). Let Xi(1) Xi(n) be the ordered data points of Xi, we have

j=1 (Xi(j + 1) Xi(j)) j

For computing conditional cumulative entropy, an optimal correlation-aware discretization method has been proposed, which does not break the correlation structures in the data (Reshef et al. 2011), (Nguyen, Mandros, and Vreeken 2016). The idea is as follows: since the ordering of variables has been ﬁxed, ﬁrstly, we search for the discretization of X1 that maximizes h(X2) h(X2|X1). Then, we ﬁx the previous discretization of X1, and only search for the optimal discretization of X2 that maximizes h(X3) h(X3|X1, X2). We follow this rationale till we obtain h(Xd) h(Xd|X1, . . . , Xd 1). Without re-discretizing any previous variables, a substantial amount of computation time can be saved. The optimal discretization at each step can be efﬁciently found by dynamic programming (see (Reshef et al. 2011; Nguyen et al. 2014; Nguyen, Mandros,

and Vreeken 2016) for proofs). In addition, to avoid overﬁtting, we set the minimum number of data points falling into each bin cell to nϵ, with 0 < ϵ < 1. This guarantees a sufﬁcient number of points for computing the conditional cumulative entropy, similar to the technique in (Wang et al. 2016).

Analytical Derivation of the Expected Value E0 In this section, we derive an analytical formula for the expected value E0 under the permutation model of independence. As shown in Eq. (9), E0 consists of two parts: E0[h(Xi)] and E0[h(Xi|X1, . . . , Xi 1)]. Under the permutation model, h(Xi) is invariant, so only E0[h(Xi|X1, . . . , Xi 1)] needs to be computed. To simplify, let X = Xi and V = {X1, . . . , Xi 1}. h(X|V ) is computed using the m bin cells {v1, . . . , vm} obtained with dynamic programming. Therefore,

λ=1 h(X|vλ)|vλ|

where |vλ| is the number of points in the bin cell vλ. The expected value under the permutation model is equal to:

E0[h(X|V )] = E0 m λ=1 h(X|vλ) |vλ|

n m λ=1 |vλ|E0[h(X|vλ)], (12) where the conditional cumulative entropy E0[h(X|vλ)] for each bin cell vλ is computed according to Eq. (10):

E0[h(X|vλ)] = E0

j=1 (X(j + 1) X(j)) j

|vλ| log j |vλ|

(13) The key task is to compute E0[X(j + 1) X(j)]. In the absence of any information about the distribution of X, and considering that |vλ| is small because of the dynamic programming optimization, it is natural to assume that X|vλ follows a maximum entropy distribution, i.e., the uniform distribution. Thus, if max = max(X) and min = min(X), we get E0[X(j + 1) X(j)] = max min

|vλ|+1 . We estimate the range max min using the interquartile range IQR: Q3 Q1. Based on it, we propose the following approximation: E0[X(j + 1) X(j)] 1.5(Q3 Q1)

|vλ|+1 . This approximation makes the estimate more accurate when X is not uniform. Indeed Q3 Q1 max min = 1

2 for the uniform distribution, and Q3 Q1 max min 1

1 for an extremely skewed distribution. Finally, following Eq. (13), we get

E0[h(X|vλ)] = 1.5(Q3 Q1)

j |vλ| log j |vλ| =

ζ ( 1) ζ ( 1, |vλ|)

|vλ| + (|vλ| 1) log |vλ|

where ζ (s) is the derivative of the Riemann zeta function, and ζ (s, a) is the partial derivative (Elizalde 1986) of the generalized Riemann zeta function with respect to the ﬁrst argument (Di Crescenzo and Longobardi 2009a).

By plugging Eq. (14) into Eq. (12), we derive the expectation value E0[h(X|V )]. Eventually, we obtain the analytical formula E0:

E0[h(X) h(X|V )] = h(X) 1

ζ ( 1) ζ ( 1, |vλ|)

|vλ| + (|vλ| 1) log |vλ|

Experiments In this section, we empirically assess UMC. Firstly, we study UMC s performance on synthetic data. Secondly, we incorporate UMC into a state-of-the-art subspace beam search algorithm (Keller, Muller, and Bohm 2012) to mine correlated subspaces for subspace clustering and outlier detection on real data. Our baselines are the existing multivariate correlation measures: UDS (Nguyen, Mandros, and Vreeken 2016), MAC (Nguyen et al. 2014), CMI (Nguyen et al. 2013) and Hi CS (Keller, Muller, and Bohm 2012). For each baseline, the parameters are optimized following its respective original paper. For UMC, we set ϵ = 0.3. Note that all the results are the average of 100 trials.

Performance on Synthetic Data Raw Score. We craft different correlations among the d variables of each data set D = {Xi}d i=1, including: (1) Rel. A: independent variables. Each Xi is randomly and independently sampled from [0, 1]. (2) Rel. B: identity relationship. X1 is uniformly sampled from [0, 1] and Xi = X1 for all i = 2, 3, ..., d. (3) Rel. C: power law relationship. X1 is uniformly sampled from [0, 1] and Xi = (X1)i for all i = 2, 3, ..., d. (4) Rel. D: sin relationship. X1 is uniformly sampled from [0, 1] and Xi = sin(Xi 1) for all i = 2, 3, ..., d. We vary d [2, 20], and set n = 1000. The results are shown in Figure 3. We did not include CMI here, as CMI is not normalized. Figure 3 shows that UMC has excellent interpretability, outputting 0 for independent variables as dimensionality increases. Furthermore, UMC outputs 1 for functional relationships as dimensionality varies. This is in contrast to the other measures, none of which behaves correctly across all four cases. From this perspective, UMC successfully addresses the dimensionality bias issue compared to the existing measures.

Dimensionality Bias. UMC is almost unbiased, showing almost equal probability of selecting a data set regardless of dimensionality (more details are in the Supplementary Material C).

Statistical Power. We test the power of the measures with similar methodology to (Reshef et al. 2011) by adding Gaussian noise within the range [0.1, 1.0]. Figure 4 reports the results for Rel. B and Rel. C on the data set with 1000 data points and 25 variables. The power for Hi CS degenerates very quickly because it cannot work in high dimensional spaces (see Figure 1). UMC, UDS and CMI all make use of the factorization property in Eq. (5), so they can cope well with high dimensionality. UMC outperforms UDS and CMI

2 4 6 8 10 12 14 16 18 20 Dimension

UMC UDS MAC Hi CS

2 4 6 8 10 12 14 16 18 20 Dimension

UMC UDS MAC Hi CS

2 4 6 8 10 12 14 16 18 20 Dimension

UMC UDS MAC Hi CS

2 4 6 8 10 12 14 16 18 20 Dimension

UMC UDS MAC Hi CS

Figure 3: Raw measure scores on synthetic data with different relationships. UMC shows a ﬂat line at 0 for (a) and ﬂat line at 1 for each of (b-d), which is the desired behavior. The ﬁgure is best viewed in color.

on noisier data sets. MAC does not work well because it outputs a high score for independent variables, which makes it hard to identify when there exists a real relationship.

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 Noise

UMC UDS MAC CMI Hi CS

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 Noise

UMC UDS MAC CMI Hi CS

Figure 4: Power of the measures with regards to noise. The higher the better. The ﬁgure is best viewed in color.

Scalability. We examine the scalability of the measures with regards to dimensionality and data size on a PC platform with Intel Core i7-3770 CPU and 32GB RAM. To measure the running time versus the data dimensionality, we generate data sets with 4000 data points and varying dimensionality. To quantify the efﬁciency with regards to the data size, we generate data sets with 20 variables and varying data size. Figure 5 demonstrates that UMC is very computationally efﬁcient. In particular, it is even faster than UDS.

Performance on Real-World Data

We evaluate the performance of UMC on real-world data sets involving more complex correlations, considering two typical applications: subspace clustering and outlier detection. Due to space limitations, detailed results for the latter task are presented in the Supplementary Material D.

5 10 15 20 25 30 35 40 45 50 Dimension 10-2

UMC UDS MAC CMI Hi CS

(a) Runtime to dimensionality

1000 2000 4000 6000 8000 10000 Data size 10-2

UMC UDS MAC CMI Hi CS

(b) Runtime to data size

Figure 5: Runtime with varying dimensionality and data size (Time axis is in log scale).

Subspace Clustering. (M uller et al. 2009) showed that mining clusters from the subspaces with high correlation usually produces more meaningful results. Thus, measuring the correlation of a subspace is a critical aspect for subspace clustering performance. Following the existing literature (M uller et al. 2009), we plug the correlation measures into a state-of-the-art subspace search method (Keller, Muller, and Bohm 2012) to ﬁnd highly correlated subspaces, and then evaluate the quality of these subspaces through DBSCAN (Ester et al. 1996) clustering on the top 50 subspaces with highest measure score. The F1 score (M uller et al. 2009) is adopted as the performance metric to compare clustering results against the ground truth. We test 11 real UCI data sets2 widely used for benchmarking in the clustering community and previous work on correlation measures, using their class labels as ground truth. Table 2 shows that UMC achieves the highest subspace quality compared to all other measures. We believe the reason lies in the unbiasedness of UMC, which enables it to correctly ﬁnd the truly correlated subspaces, compared to other measures that may assign inﬂated correlation scores. Speciﬁcally, UDS and CMI are biased to subspaces with higher dimensionality regardless of whether they possess true correlations. MAC and Hi CS output very close scores for subspaces with correlations or without correlations. By applying a Friedman test (Demˇsar 2006) at the 0.05 signiﬁcance level, we ﬁnd that the observed differences in F1 value are signiﬁcant. Under a Wilcoxon signed rank test (Demˇsar 2006) with 0.05 signiﬁcance level, the difference between UMC and UDS is statistically signiﬁcant.

Outlier Detection. UMC also outperforms other measures signiﬁcantly at the 0.05 signiﬁcance level (see Supplementary Material D).

Conclusions In this paper, we proposed an unbiased multivariate correlation measure (UMC) fulﬁlling comparability, interpretability, scalability and unbiasedness. We argued that existing multivariate correlation measures only partially satisfy these criteria, and we theoretically and empirically identiﬁed dimensionality bias issues for them. In our proposed UMC measure, we employed a statistical model of independence,

2http://archive.ics.uci.edu/ml/index.html

Table 2: Clustering results (F1 score) on real data

Data (n d) UMC UDS MAC CMI Hi CS WBC (198 33) 0.89 0.82 0.81 0.79 0.75 Shape (160 17) 0.89 0.89 0.85 0.82 0.77 Glass (214 9) 0.63 0.60 0.60 0.59 0.37 WBCD (569 30) 0.82 0.77 0.74 0.58 0.60 Diabetes (768 8) 0.84 0.79 0.52 0.71 0.53 Leaves (1600 64) 0.81 0.70 0.61 0.52 0.45 Pendigits (7494 16) 0.87 0.83 0.81 0.73 0.55 Waveform (5000 40) 0.64 0.58 0.46 0.32 0.21 Optical (5620 64) 0.67 0.61 0.48 0.40 0.36 Musk (6598 166) 0.95 0.92 0.88 0.61 0.58 Letter (20000 16) 0.84 0.82 0.82 0.64 0.49

i.e., the permutation model, under which an analytical formula of the expected value was derived, helping UMC improve interpretability and avoid dimensionality bias. Experimental evaluation convincingly demonstrated that UMC outperforms existing measures for a range of tasks.

Acknowledgments

This research is supported by the National Natural Science Foundation of China (No. 61371078).

Aggarwal, C., and Philip, S. Y. 2001. Outlier detection for high dimensional data. In Proceedings of the 2001 ACM SIGMOD International Conference on Management of Data (SIGMOD-01), 37 46. Bareinboim, E.; Tian, J.; and Pearl, J. 2014. Recovering from selection bias in causal and statistical inference. In Proceedings of the Twenty-Eighth AAAI Conference on Artiﬁcial Intelligence (AAAI-14), 2410 2416. Breiman, L., and Friedman, J. H. 1985. Estimating optimal transformations for multiple regression and correlation. Journal of the American statistical Association 80(391):580 598. Brown, G.; Pocock, A.; Zhao, M.-J.; and Luj an, M. 2012. Conditional likelihood maximisation: a unifying framework for information theoretic feature selection. Journal of Machine Learning Research 13(Jan):27 66. Demˇsar, J. 2006. Statistical comparisons of classiﬁers over multiple data sets. Journal of Machine Learning Research 7(Jan):1 30. Di Crescenzo, A., and Longobardi, M. 2009a. On cumulative entropies. Journal of Statistical Planning and Inference 139(12):4072 4087. Di Crescenzo, A., and Longobardi, M. 2009b. On cumulative entropies and lifetime estimations. In International Work Conference on the Interplay Between Natural and Artiﬁcial Computation, 132 141. Elizalde, E. 1986. An asymptotic expansion for the ﬁrst derivative of the generalized riemann zeta function. Mathematics of Computation 47(175):347 350. Ester, M.; Kriegel, H.-P.; Sander, J.; Xu, X.; et al. 1996. A density-based algorithm for discovering clusters in large spatial databases with noise. In Proceedings of the 2nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD-96), 226 231.

Hubert, L., and Arabie, P. 1985. Comparing partitions. Journal of Classiﬁcation 2(1):193 218. Keller, F.; Muller, E.; and Bohm, K. 2012. Hics: high contrast subspaces for density-based outlier ranking. In Proceedings of the 2012 IEEE 28th International Conference on Data Engineering (ICDE-12), 1037 1048. Kriegel, H.-P.; Kr oger, P.; and Zimek, A. 2009. Clustering high-dimensional data: A survey on subspace clustering, patternbased clustering, and correlation clustering. ACM Transactions on Knowledge Discovery from Data (TKDD) 3(1):1. Kullback, S., and Leibler, R. A. 1951. On information and sufﬁciency. The Annals of Mathematical Statistics 22(1):79 86. M uller, E.; G unnemann, S.; Assent, I.; and Seidl, T. 2009. Evaluating clustering in subspace projections of high dimensional data. Proceedings of the VLDB Endowment 2(1):1270 1281. Nguyen, H. V.; M uller, E.; Vreeken, J.; Keller, F.; and B ohm, K. 2013. Cmi: An information-theoretic contrast measure for enhancing subspace cluster and outlier detection. In Proceedings of the 13th SIAM International Conference on Data Mining (SDM-13), 198 206. Nguyen, H. V.; M uller, E.; Vreeken, J.; Efros, P.; and B ohm, K. 2014. Multivariate maximal correlation analysis. In Proceedings of the 31st International Conference on Machine Learning (ICML-14), 775 783. Nguyen, H.-V.; Mandros, P.; and Vreeken, J. 2016. Universal dependency analysis. In Proceedings of the 16th SIAM International Conference on Data Mining (SDM-16), 792 800. Rao, M.; Chen, Y.; Vemuri, B. C.; and Wang, F. 2004. Cumulative residual entropy: a new measure of information. IEEE Transactions on Information Theory 50(6):1220 1228. Rao, M.; Seth, S.; Xu, J.; Chen, Y.; Tagare, H.; and Pr ıncipe, J. C. 2011. A test of independence based on a generalized correlation function. Signal Processing 91(1):15 27. Reshef, D. N.; Reshef, Y. A.; Finucane, H. K.; Grossman, S. R.; Mc Vean, G.; Turnbaugh, P. J.; Lander, E. S.; Mitzenmacher, M.; and Sabeti, P. C. 2011. Detecting novel associations in large data sets. Science 334(6062):1518 1524. Romano, S.; Vinh, N. X.; Bailey, J.; and Verspoor, K. 2016. A framework to adjust dependency measure estimates for chance. In Proceedings of the 16th SIAM International Conference on Data Mining (SDM-16), 423 431. Te Sun, H. 1978. Nonnegative entropy measures of multivariate symmetric correlations. Information and Control 36:133 156. Te Sun, H. 1980. Multiple mutual informations and multiple interactions in frequency data. Information and Control 46:26 45. Vinh, N. X.; Epps, J.; and Bailey, J. 2009. Information theoretic measures for clusterings comparison: is a correction for chance necessary? In Proceedings of the 26th International Conference on Machine Learning (ICML-09), 1073 1080. Wang, Y.; Tang, Q.; Xia, S.-T.; Wu, J.; and Zhu, X. 2016. Bernoulli random forests: Closing the gap between theoretical consistency and empirical soundness. In Proceedings of the Twenty-ﬁfth International Joint Conference on Artiﬁcial Intelligence (IJCAI-16), 2167 2173. Zhang, X.; Pan, F.; Wang, W.; and Nobel, A. 2008. Mining nonredundant high order correlations in binary data. Proceedings of the VLDB Endowment 1(1):1178 1188.