# systematically_exploring_associations_among_multivariate_data__ff2c0b90.pdf The Thirty-Fourth AAAI Conference on Artificial Intelligence (AAAI-20) Systematically Exploring Associations among Multivariate Data Lifeng Zhang School of Information, Renmin University of China 59, Zhongguancun Street, Haidian Beijing, P.R.China, 100872 l.zhang@ruc.edu.cn Detecting relationships among multivariate data is often of great importance in the analysis of high-dimensional data sets, and has received growing attention for decades from both academic and industrial fields. In this study, we propose a statistical tool named the neighbor correlation coefficient (n Cor), which is based on a new idea that measures the local continuity of the reordered data points to quantify the strength of the global association between variables. With sufficient sample size, the new method is able to capture a wide range of functional relationship, whether it is linear or nonlinear, bivariate or multivariate, main effect or interaction. The score of n Cor roughly approximates the coefficient of determination (R2) of the data which implies the proportion of variance in one variable that is predictable from one or more other variables. On this basis, three n Cor based statistics are also proposed here to further characterize the intra and inter structures of the associations from the aspects of nonlinearity, interaction effect, and variable redundancy. The mechanisms of these measures are proved in theory and demonstrated with numerical analyses. Introduction Identifying relationships among variables is one of the most critical issues in data analysis and interpretation (Altman and Krzywinski 2015) with a wide range of applications in diverse fields from data science to neuroscience. Nowadays, however, a large data set may contain a vast number of variable pairs and combinations that are difficult to be examined manually (Reshef et al. 2011). Association measures can be used to quickly find out the significant associations scattered in thousands or even millions of potential relationships without modelling the relationships explicitly, and thereby provide valuable knowledge and promising pointers for future study. Consider a data sample {(x(t), y(t))|1 t N} that is observed from an underlying functional relationship expressed as follows. y = f(x) + e = xi x gi(xi) + xi x hi(xi) + e (1) where y R, x = (xi|1 i M) RM, e R, and M 2 respectively denote dependent variable, multiple independent Copyright 2020, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. variables, subset of x, and additive noise. f( ), gi( ) and hi( ) denote the underlying function, main effect and interaction effect respectively. If f( ) is linear in which all gi( ) are linear and all hi( ) are null, the Pearson correlation coefficient should be a perfect measure of how much fluctuation in one variable can be explained by another variable (R2) (Altman and Krzywinski 2015). If f( ) is nonlinear, the traditional correlation test is no longer sufficient. Developing concise and efficient nonlinear association detection methodologies has been a challenging research and received wide attention for over a half century. Some of the previous approaches have been developed based on the theory of mutual information (MI) and partitioning (binning) techniques. A typical one is the maximal information coefficient (MIC), which is the state-of-the-art association measure that has been extensively evaluated recently (Reshef et al. 2011; Reshef et al. 2018). This kind of methods use partitioning as a means to apply MI on continuous random variables based on the idea that, if an association exists between two variables, then a grid can be drawn on the scatterplot that partitions the data to encapsulate that relationship. Similarly, Heller and Gorfine s SDDP adopted summation or maximization aggregation of the scores over all partitions of a fixed size to estimate the MI (Heller et al. 2016). In addition, other techniques, such as kernel density estimation (KDE), k-nearest neighbor distances (k NN), and nonlinear correlation information entropy (NICE), also can be used to compute the score of MI as a dependence measure (Moon, Rajagopalan, and Lall 1995; Darbellay and Vajda 1999; Kraskov, Stogbauer, and Grassberger 2004; Wang, Shen, and Zhang 2005). Another method named distance correlation (d Cor) has a compact representation analogous to the Pearson correlation coefficient, however is calculated based on certain Euclidean distances between sample elements (Sz ekely, Rizzo, and Bakirov 2007; Sz ekely and Rizzo 2009). d Cor can be viewed as a special case of kernel based method (Sejdinovic et al. 2013), which is a kind of more general statistic defined on reproducing kernel Hilbert spaces (Gretton et al. 2008; Gretton and Gyorfi2010). Empirical studies (Reshef et al. 2015; Reshef et al. 2018) showed that d Cor also achieved excellent performance in some situations. In order to construct a distribution free test, Sz ekely and Rizzo (2009) considered using the ranks of each random variable instead of the actual values in computing d Cor, and Heller, Heller, and Gorfine (2013) introduced an association test based on the cross-classification of the distances from center points. From the 1980s, a number of higher order correlation measures have been proposed to construct concise nonlinear model validity tests (Aguirre 1995; Billings and Zhu 1995; Mao and Billings 2000) for system identification. Zhang, Zhu, and Longden (2007) and Zhu, Zhang, and Longden (2007) introduced a set of first order correlation functions, named omni-directional cross-correlation functions (ODCCF) by considering the symmetrical properties of nonlinear relationship. Other extensions of the ordinary correlation test include the Spearman rank correlation coefficient, Kendall coefficient of concordance (Kendall 1938), maximal correlation (R enyi 1959; Breiman and Friedman 1985), principal curve based methods (Hastie and Stuetzle 1989; Delicado 2001; Delicado and Smrekar 2009), randomized dependence coefficient (RDC) (Lopez-Paz, Hennig, and Scholkopf 2013), and nonlinear spectral correlation (Liu, Sohn, and Jeon 2017), which all capture a certain range of nonlinear relationships. Nevertheless, these approaches still cannot effectively detect associations in a satisfactory manner under every condition, since they are incapable of equitability estimating the R2 of the relationships and show strong preference for some types of nonlinear functions (Reshef et al. 2018). In addition, the overwhelming majority of the existing methods are designed for pairwise association detection, and thus unable to detect interaction effects. If only main effects exist among multiple variables, pairwise test should be sufficient since the influence of the variables is separable in such situation. In contrast, interaction effect describes a situation in which the simultaneous influence of two or more independent variables is not additive. In the real world, actually, many bivariate relationships appear to be insignificant or non-functional, but can in fact be explained by interaction effects. Due to the complexity of interactions, sometimes, there is no any trend, principle curve, or particular pattern identifiable in the pairwise tests, and even various fitting or transformation would be of no avail. Therefore, whenever interactions occur all the bivariate analysis techniques tend to be less effective. Although d Cor and some MI estimators can handle multivariate data, they are still incapable of distinguishing interactions from main effects in all cases. In the present study, we propose a new method named the neighbor correlation coefficient (n Cor) to detect the relationships among data sequences in both the bivariate and multivariate cases with the following properties. (i) With sufficient sample size, the method could capture a wide range of functional relationships including not only various bivariate functional forms such as exponential or periodic, but also multivariate associations and in particular interaction effects. (ii) The method roughly measures the association strength (R2) of the data that have the same statistical power increasing with sample size for whatever functional relationship. (iii) The method can be used to further characterize and distinguish the inter and intra structure of the detected relationship from the aspects of nonlinearity, interactivity, and variable re- dundancy. Finally, n Cor differs from the previous approaches in that it detects associations by measuring the local continuity of the concomitants obtained from data reordering, rather than partitioning the scatterplots, estimating the probability distributions, or computing with pairwise distances (and reproducing kernel Hilbert spaces). For this reason, in the new method, the independent variables are only used for reordering data points, but not involved in the computation of correlation scores at all. This provides an alternative way to assess the relationships among multivariate data. The rest of this study is organized as follows. In the next two sections, n Cor and three n Cor based statistics are proposed. Subsequently, empirical studies are performed to evaluate the effectiveness of the new statistics and make comparisons with previous approaches. In the last Section, conclusions are drawn to summarize the study. To reduce the length of the study, more analyses and experimental studies are enclosed in supplementary material. Neighbor correlation coefficient (n Cor) To simplify the proofs, without losing generality, throughout this study x is assumed to be continuous and uniformly distributed, and f( ), gi( ) and hi( ) are all assumed to be continuous functions. Supplementary material gives the theoretical proofs of all the lemmas and theorems, as well as the empirical proofs on the robustness of the new method against data distribution and function continuity. Actually, the new method exhibits almost same performance under different data distributions when the sample size is sufficiently large (In Supplementary Material, we tested 8 continuous and discrete distributions including uniform, normal, exponential, and bimodal). n Cor for bivariate data Consider a set of paired data {(x(t), y(t))|1 t N}. To detect the potential relationship y = g(x) + e, sample points need to be rearranged initially in an increasing order of the independent variable. The concepts of order statistics and concomitants are given as follows (David and Nagaraja 2003). Order statistics: Sorting independent variable data with respect to its values to obtain a new sequence denoted by x(1:N) x(2:N) x(N:N), where x(k:N) is known as the k-th order statistic of {x(t)}. Let {n(k)|1 k N} be the reordering permutation, that is, if n(k) = t then x(k:N) = x(t). Lemma 1. Let x(1), x(2), , x(N) be a sample of a random variable which is continuous and uniformly distributed on [a, b]. Let Δx(k:N) be the difference between two neighboring order statistics of {x(t)} which can be derived as Δx(k:N) = x(k+1:N) x(k:N) (2) Then, it holds that lim N Δx(k:N) = 0, 1 k N 1 (3) Concomitants: Rearranging dependent variable data in accordance with {n(k)} to yield a new sequence y[1:N], y[2:N], , y[N:N] , where y[k:N] is known as the kth concomitant, and defined as y[k:N] = y(t) when n(k) = t. Lemma 2. Let {(x(t), y(t))|1 t N} be observed from a noise free continuous relationship y = g(x), where x [a, b] is a uniformly distributed random variable. Let Δy[k:N] be the difference between two neighboring concomitants which can be derived as Δy[k:N] = y[k+1:N] y[k:N] (4) Let Δy denote the sequence of {Δy[k:N]}. If N is sufficiently large, then Var(Δy) < Var(y). In addition, lim N Δy[k:N] = 0, 1 k N 1 (5) Figure 1 shows the scatterplots of four typical data relationships that when Δx(k:N) is sufficiently small (due to large N), the amplitude of Δy[k:N] should be much smaller than that of y(t)(y[k:N]). Figure 1: The scatterplots of four bivariate associations. n Cor measures how much knowing the independent variables determines the value of the dependent variable based on the idea that, if a continuous functional relationship exists, the data points which are very similar in the independent variables should also have similar values in the dependent variable. In such a situation, y[k:N] will exhibit a positive correlation with y[k+1:N] = y[k:N]+Δy[k:N]. Pairwise neighboring concomitants, then, are used to compute n Cor by means of the product-moment correlation coefficient as below. Definition 1. neighbor correlation coefficient (n Cor). Let (y , y ) denote the paired sequences of the neighboring concomitants where y = {y[k:N]|1 k N 1} and y = {y[k+1:N]|1 k N 1}. n Cor is defined as n Cor(x, y) = Cov(y , y ) Var(y )Var(y ) (6) where Cov( ) and Var( ) denote covariance and variance operators respectively. n Cor when applied to a sample can be calculated as (7). Theorem 1. Let {(x(t), y(t))} (|x| 1) be a data sample that is observed from random variables (x, y). If (x, y) are independent, then the expectation of the correlation coefficient has n Cor(x, y) = 0. When applied to a sample, a hypothesis test rejects the null hypothesis of independence if |n Cor(x, y)| > tanh Φ 1(1 α/2)/ where Φ( ) denotes the standard normal cumulative distribution function and α is the significance level of the test. Theorem 2. Let {(x(t),i, y(t))} be paired data that is observed from the relationship as defined in (1), and each xi x is uniformly distributed on [a, b]. If a main effect gi(xi) exists, with sufficient N, the correlation coefficient has n Cor(x, y) > 0. In addition, lim N n Cor(x, y) = Var(g(x))/Var(y) (9) n Cor for multivariate data When considering the relationship among three or more variables, an interaction effect may arise that the association between each of the interacting variables and the dependent variable depends on the values of the other interacting variable(s). Figure 2(a) clearly suggests that when an interaction occurs, the value of y(t) is unpredictable if only x(t),1 or x(t),2 is known. Although Δx(k:N),1 is small, Δy[k:N] could be large due to the potentially large value of Δx(k:N),2. In this case, the aforementioned order statistics based data rearranging is no longer sufficient. Figure 2: The scatterplots (a) and contour map (b) of y = x1x2 with 100 sample points. To address this problem, we convert the data reordering process to a travelling salesman problem (TSP) by considering the reordering permutation as a short route that visits each sample point in the multi-dimensional independent variable space exactly once. Then, {n(k)} obtained from solving TSP are applied to generate concomitants for computing n Cor. Figure 2(b) shows that although a short route cannot make each {x(t),i} be rearranged in an ascending or descending n Cor(x, y) = (N 1) N 1 k=1 y[k:N]y[k+1:N] N 1 k=1 y[k+1:N] k=1 y2 [k:N] N 1 k=1 y[k:N] 2 k=1 y2 [k+1:N] N 1 k=1 y[k+1:N] 2 (7) order, it still ensures not only a small distance between each pair of connected data points in the space of x (which is defined as λn(k)n(k+1) = x(k:N) x(k+1:N) ), but also a small difference in y (Δy[k:N]). Lemma 3. Let {(x(t), y(t))|1 t N} be observed from a noise free continuous relationship y = f(x) where each xi x is uniformly distributed on [a, b]. Let λn (k)n (k+1) be obtained from the optimum reordering permutation that is defined as {n (k)} = arg min n(1), ,n(N) k=1 λn(k)n(k+1) + λn(1)n(N) (10) Then, it holds that lim N Δy[k:N] = 0, 1 t N (11) Theorem 3. Let {(x(t), y(t))} be a data sample that is observed from the relationship as defined in (1), and each xi x is uniformly distributed on [a, b]. Suppose n Cor(x, y) is calculated based on the optimum {n (k)}. Then, lim N n Cor(x, y) = Var(f(x))/Var(y) Var(gi(xi)) Var(hi(xi)) Var(y) (12) In this study, we adopt the nearest neighbor (NN) algorithm (Algorithm 1) to solve TSP (Gutin and Punnen 2007), which is simple and can quickly yield a short route of sufficient quality to satisfy the needs of association detection. (i) The time complexity of NN algorithm is O(N 2) which means that the computational time of n Cor(x, y) mainly depends on N but not on M. (ii) n Cor is able to cope with high dimensional x without a large expense of computational cost. With increasing M, however, the power of n Cor on approximating R2 decreases little by little, since with fixed N the data points become more sparse in a higher dimensional space. (iii) n Cor is robust to non-optimal reordering, and it is not sensitive on any particular order of the data points. Even though a TSP route is comparatively bad, the majority of the connected data points are still close to each other in x space, that is enough for computing n Cor. Three n Cor based association measures To further characterize the inter and intra structure of the detected associations, three n Cor based statistics are also proposed in this study. Algorithm 1 NN algorithm based data reordering. Input: Euclidean distance matrix of sample data {x(t)}, denoted by [λpq]N N where λpq = x(p) x(q) ; Output: concomitants {y[k:N]|1 k N}; Start on data point t 1 as the current data point, set n(1) 1 and y[1:N] y(t); for k 1 to N 1 do Find out the shortest distance connecting the current data point t and an unvisited data point i / {n(1), , n(k)} that i arg min i λit; Move the current data point to t i , set n(k+1) i and y[k+1:N] y(i ); end for Definition 2. The coefficient of interaction (COI) COI(x, y) =n Cor(x, y) xi x max 0, n Cor(xi, y) |xi x| 2 max 0, COI(xi, y) (13) where 2 |xi|<|x|. COI(x, y) (COI(x1 x M, y)) is a measure of the strength of the interaction effect exactly in terms of x, and by Theorem 3 it holds that lim N COI(x, y) = Var(h(x))/Var(y). Remark 1. COI and n Cor can be used together to distinguish interaction from main effect. (i) If n Cor(xi, y) is significant, then a main effect exists between xi and y. The stronger the main effect is the larger the n Cor value will be. (ii) If n Cor(xi, y) is significant and COI(x, y)>0, then an interaction may exists. The stronger the interaction effect is the larger the COI value will be. Definition 3. The coefficient of nonlinearity (CON) CON(x, y) = n Cor(x, y) Cor2(x, y) (14) where Cor(x, y) denotes the Pearson correlation coefficient. CON(x, y) is a measure of the nonlinearity of main effect, which indicates the strength of the nonlinear part of a bivariate association. CON is defined similar as the measure of nonlinearity MIC ρ2 in (Reshef et al. 2011). Remark 2. CON and Cor can be used together to distinguish linear and nonlinear associations. Consider a significant n Cor(x, y). (i) CON(x, y) 0 indicates that only a linear association exists. (ii) CON(x, y)>0 indicates that the association is nonlinear. The stronger the nonlinear effect is the larger the CON value will be. (iii) An insignificant Cor is significant n Cor is significant n Cor is significant |x A|=1 No association detected Linear and nonlinear effects coexist Nonlinear effect only No association detected Interaction h(x A) exist Only main effects or other interactions exist Linear effect only Essential that includes at least one irreplaceable independent variable Redundant in the presence of the other independent variables T: True F: False Figure 3: Diagnose and characterize various associations by using n Cor and the three n Cor based statistics Cor(x, y) indicates that there only exists a nonlinear effect such that a linear model will completely fail to capture the underlying relationship. Definition 4. The coefficient of essentialness (COE) COE(xs, y) = n Cor(x, y) max 0, n Cor(x\xs, y) (15) where xs x, and x\xs = {xi| xi x, xi / xs}. COE(xs, y) (COE(xs1 xsm, y)) implies whether or not xs is redundant in the presence of x\xs. Its role is similar as that of the partial correlation coefficient in the linear case and conditional MI (CMI) which is the MI of two variables conditioned to a third one (Fleuret 2004; Sato et al. 2006; Runge 2018). Remark 3. COE can be used to detect if a subset of independent variables is essential in analyzing the dependent variable. Consider a significant n Cor(x, y) and a subset xs. (i) COE(xs, y)>0 indicates that xs is essential in that it contains at least one irreplaceable independent variable that must be involved in model construction. (ii) Generally, the more essential xs of same size is, the larger value the COE test will yield, and thus needs to be given a higher priority in analyzing y. Summarily, Fig. 3 depicts the decision tree that represents how to diagnose and characterize various associations arising from a subset of independent variables 1 |xs x| M by the use of Cor, n Cor and the three n Cor based statistics. Empirical studies In this section, a set of simulation examples and a real-world data set are employed to illustrate the effectiveness of the new method. Supplementary material gives the detailed experimental settings, as well as more empirical demonstrations of n Cor on detecting associations and approximating R2 under different data distributions, sample sizes, non-optimum data reordering, and independent variable numbers. Simulation experiments and comparative analysis Six simulated examples were performed here for comparison purposes. Table 1 presents the underlying associations that cover a wide range of functional forms including parabolic, exponential, periodic, cross term, mixture function, and even classification problem (step function). Table 1: The six simulated examples (|g| and |h| respectively denote the numbers of main effects g( ) and interactions h( ) occuring in each f( )). Underlying functions |g| |h| y1 = sin(10x2), x1 0 sin(10x2 + 2), x1<0 1 1 y2 = x1x2 0.7x2x3 + 3x1x2x3 0 3 y3 = x1x2 2 3x1 + cos(20x3) + e 2 1 y4 = cos(20x1x2)x2 + 0.5 exp(x2) + e x3 = 0.667x2 + 0.333u 1 1 (x1, x2, y5): two-spirals problem 0 1 (x1, x2, y6): noisy two-spirals problem 0 1 In examples 1 to 4, we considered three random independent variables with uniform distribution and amplitude range from -1 to 1. In examples 3 and 4, a normally distributed random noise with zero mean and variance of 0.25 was applied to increase the difficulties of association detection. All the data sequences for the first four examples were generated with length of 1000. In example 4, collinearity occurred between x2 and x3, and u was set to be a random variable having an identical distribution of xi. Examples 5, called two-spirals problem, is a benchmark task for nonlinear classification, which consists of two spirals each with 200 samples in a 2-D space. In example 6, each independent variable of the two-spirals problem was corrupted by a normally distributed additive noise with zero mean and variance of 1 10 4. In addition, 40 randomly selected samples (10%) were wrongly categorized that led to a noisy dependent variable. Here, we compared n Cor with MIC, d Cor, k NN based MI, CODCF, and the RDC whose outstanding performances have been extensively demonstrated (Reshef et al. 2018). To make comparisons easier, the MI values were re-scaled to the range [0, 1] (Gelfand and Yaglom 1957; Lange and Table 2: Association detection by using previous methods (significant scores are marked with underlines). MIC(x, y) CODCF(x, y) RDC(x, y) x1 x2 x3 x1 x2 x3 x1 x2 x3 x1x2 x1x3 x2x3 x1x2x3 y1 0.134 0.462 0.134 -0.032 -0.096 -0.001 0.107 0.244 0.095 0.260 0.149 0.212 0.255 y2 0.180 0.229 0.231 0.384 0.479 0.353 0.478 0.532 0.503 0.693 0.730 0.714 0.976 y3 0.412 0.131 0.268 -0.659 -0.360 0.039 0.664 0.390 0.118 0.809 0.673 0.382 0.814 y4 0.133 0.242 0.226 0.009 0.441 0.392 0.098 0.477 0.414 0.486 0.424 0.483 0.491 y5 0.222 0.207 0.098 0.006 0.012 0.001 0.020 y6 0.196 0.183 0.080 0.009 0.011 0.004 0.023 d Cor(x,y) r2 MI(x, y) = 1 exp( 2MI(x, y)) x1 x2 x3 x1x2 x1x3 x2x3 x1x2x3 x1 x2 x3 x1x2 x1x3 x2x3 x1x2x3 y1 0.054 0.114 0.043 0.106 0.062 0.079 0.089 0 0.992 0 0.862 0.001 0.548 0.522 y2 0.210 0.213 0.262 0.279 0.267 0.290 0.342 0.296 0.510 0.312 0.791 0.725 0.796 0.978 y3 0.624 0.106 0.070 0.558 0.525 0.092 0.498 0.487 0.163 0.207 0.643 0.546 0.256 0.572 y4 0.052 0.405 0.360 0.342 0.263 0.400 0.361 0.089 0.260 0.212 0.461 0.205 0.326 0.383 y5 0.091 0.034 0.081 0.025 0.090 0.747 y6 0.079 0.049 0.079 0.001 0.030 0.463 COE(x1,y1)=0.31 COE(x2,y1)=0.57 COE(x3,y1)= 0.29 n Cor(x1,y1)=0 n Cor(x2,y1)=0.31 n Cor(x3,y1)=0 n Cor(x1x2,y1)=0.88 COI(x1x2,y1)=0.56 n Cor(x1x3,y1)=0.01 COI(x1x3,y1)=0.01 n Cor(x2x3,y1)=0.27 COI(x2x3,y1)= 0 n Cor(x1x2x3,y1)=0.5834 COI(x1x2x3,y1)= 0.304 COE(x1,y2)=0.81 COE(x2,y2)=0.93 COE(x3,y2)=0.72 n Cor(x1,y2)=0 n Cor(x2,y2)= 0.01 n Cor(x3,y2)=0 n Cor(x1x2,y2)=0.21 COI(x1x2,y2)=0.21 n Cor(x1x3,y2)=0 COI(x1x3,y2)=0 n Cor(x2x3,y2)=0.12 COI(x2x3,y2)=0.11 n Cor(x1x2x3,y2)=0.9289 COI(x1x2x3,y2)=0.5988 COE(x1,y3)=0.51 COE(x2,y3)=0.06 COE(x3,y3)=0.03 n Cor(x1,y3)=0.44 n Cor(x2,y3)=0 n Cor(x3,y3)=0.23 n Cor(x1x2,y3)=0.64 COI(x1x2,y3)=0.2 n Cor(x1x3,y3)=0.6 COI(x1x3,y3)= 0.1 n Cor(x2x3,y3)=0.16 COI(x2x3,y3)= 0.1 n Cor(x1x2x3,y3)=0.6662 COI(x1x2x3,y3)= 0.2081 COE(x1,y4)=0.11 COE(x2,y4)=0.1 COE(x3,y4)= 0.12 n Cor(x1,y4)=0.01 n Cor(x2,y4)=0.2 n Cor(x3,y4)=0.16 n Cor(x1x2,y4)=0.43 COI(x1x2,y4)=0.22 n Cor(x1x3,y4)=0.21 COI(x1x3,y4)=0.05 n Cor(x2x3,y4)=0.2 COI(x2x3,y4)= 0.2 n Cor(x1x2x3,y4)=0.3159 COI(x1x2x3,y4)= 0.3238 0.2 0.4 0.6 0.8 0 0 0.5 1 0.5 COE(x1,y5)=0.91 0 0.5 1 0.5 COE(x2,y5)=0.97 n Cor(x1,y5)=0.01 n Cor(x2,y5)=0.07 n Cor(x1x2,y5)=0.979 COI(x1x2,y5)=0.8983 0.2 0.4 0.6 0.8 0 0 0.5 1 0.5 COE(x1,y5)=0.3 0 0.5 1 0.5 COE(x2,y5)=0.27 n Cor(x1,y5)=0.05 n Cor(x2,y5)=0.02 n Cor(x1x2,y5)=0.3178 COI(x1x2,y5)=0.2484 Figure 4: The scatterplots and association detection for the six examples. According to the properties of the new statistics, significant main effects (n Cor), detected interaction effects (COI) and essential independent variables (COE) are marked as red, blue, and green respectively. n Cor were tested for significance using Fisher s transformation with the same confidence limits (α = 5%) that are 0.062 for examples 1-4 and 0.098 for the last two. Grubmuller 2005). The hypothesis test introduced in (Zhang, Zhu, and Longden 2007; Reshef et al. 2011; Sz ekely, Rizzo, and Bakirov 2007; Lopez-Paz, Hennig, and Scholkopf 2013) was used to detect significant associations at 95% confidence level (α = 0.05). For MIC and other MI measures, the empirical confidence limits were obtained through using 1000 surrogate sets of random data (Reshef et al. 2018). Tables 2 presents the experimental results. (i) None of these measures can capture every underlying relationship with satisfactory results, and particularly the association between x1 and y1 which is missed out by all the methods except RDC. Generally, the multivariate measures achieve 0 1 2 3 4 5 6 7 x 10 4 Gross national income per capita (xi) Health expenditure per person (y) Cor2(xi,y) = 0.714; n Cor(xi,y) = 0.667 0 20 40 60 80 100 0 Industry Contribution to Economy (xj) Health expenditure per person (y) Cor2(xj,y) = 0.005; n Cor(xj,y) = 0.004 0 1 2 3 4 5 6 7 x 10 4 Gross national income per capita (xi) Industry Contribution to Economy (xj) n Cor(xixj,y) = 0.793; COI(xixj,y) = 0.126 0 1 2 3 4 5 0 Population annual growth rate (xi) Deaths Children HIV/AIDS (y) Cor2(xi,y) = 0.001; n Cor(xi,y) = 0.103 20 30 40 50 60 70 80 0 Healthy Life Expectancy at Birth (xj) Deaths Children HIV/AIDS (y) Cor2(xj,y) = 0.236; n Cor(xj,y) = 0.266 0 1 2 3 4 5 Population annual growth rate (xi) Healthy Life Expectancy at Birth (xj) n Cor(xixj,y) = 0.735; COI(xixj,y) = 0.366 0 50 100 150 200 250 300 350 0 Age Std. Mortality Rate Injuries (xi) Years of Life Lost to Injuries (y) Cor2(xi,y) = 0; n Cor(xi,y) = 0.005 ˆy = ˆf N(xi) R2=0.0295 20 30 40 50 60 70 80 0 Healthy Expectancy at Birth B.S. (xj) Years of Life Lost to Injuries (y) Cor2(xj,y) = 0.267; n Cor(xj,y) = 0.357 ˆy = ˆf N(xj) R2=0.3746 0 50 100 150 200 250 300 350 Age Std. Mortality Rate Injuries (xi) n Cor(xixj,y) = 0.683; COI(xixj,y) = 0.322 ˆy = ˆf N(xi, xj) R2=0.7211 Healthy Exp at Birth B.S. (xj) Years of Life Lost to Injuries (y) 0 20 40 60 80 40 Years of life lost to communicable diseases (x) Life expectancy at birth (years) B.S. (y) Cor2(x,y) = 0.801; n Cor(x,y) = 0.832 ˆy = ˆf N(x) ˆy = ˆf L(x) 0 1000 2000 3000 4000 5000 6000 0 Infant Mortality Rate B.S. (y) Government Expenditure on Health (x) Cor2(x,y) = 0.221; n Cor(x,y) = 0.621 ˆy = ˆf N(x) ˆy = ˆf L(x) 0 1 2 3 4 5 6 7 40 Continent (x) Life Expectancy at Birth B.S. (y) Cor2(x,y) = 0; n Cor(x,y) = 0.615 ˆy = ˆf N(x) ˆy = ˆf L(x) Figure 5: Six typical examples of the associations detected by n Cor and COI, including three interactions (a-i) and three main effects (j-l). In (g-l), ˆf L( ) and ˆf N( ) respectively indicate that the line or curve (surface), and R2 are obtained through using linear regression or ANN. better performance than the bivariate ones. By using these methods, however, there are still some missed detections. (ii) Despite exceeding the confidence interval, these measures cannot assign appropriate scores to correctly state the importance of each independent variable to predicting or classifying the dependent variable. For instance, the values of CODCF(x2, y1), RDC(x1x2, y1), MIC(x1, y5), and MIC(x2, y5) just slightly exceed the confidence limits, whereas the corresponding variables are strongly associated. d Cor(x2, y1) = 0.11 and r2 MI(x2, y1) = 0.99, however, the real R2 is in fact about 0.3. (iii) Although RDC, d Cor and MI can handle both bivariate and multivariate data, they still cannot be used to distinguish between interactions and main effects. The detection results of the two measures are sometimes ambiguous and confusing so that it is quite difficult to properly discern interaction effects from these values. For example, r2 MI(x2, y1) > r2 MI(x1x2, y1), but there is a strong interaction h(x1, x2), and y1 only can be predicted properly by using x1 and x2 simultaneously. In example 2, r2 MI(x1x2, y2), r2 MI(x1x3, y2), and r2 MI(x2x3, y2) yield very similar values, and especially MI(x1x3, y2) > MI(x1, y2) + MI(x3, y2), but actually an exact interaction h(x1, x3) does not exist. In addition, RDC(x1x3, y2) > RDC(x2x3, y2) > RDC(x1x2, y2), however, y2 precisely contains terms x1x2 and x2x3. Figure 3 shows the scatterplots of the six examples and the correlation detection results. As shown in the figures, the scatterplots display a variety of patterns. Especially between y1, y4 and x1, there is even no any pattern that can be discovered by visual inspection of the scatterplots, since the data points just look like completely random. In contrast to the existing methods, n Cor base statistics successfully detect the underlying relationships without any missed or false judgement. By COI test, all the interaction effects are precisely discerned from the associations along with a rough assessment of the effect strength. In example 2, COI(x1x2x3, y2)/COI(x1x2, y2) = 2.9 and COI(x2x3, y2)/COI(x1x2, y2) = 0.52 are approximately equal to the theoretical values of the corresponding variance ratios which can be derived as Var(3x1x2x3)/Var(x1x2) = 3 and Var(0.7x2x3)/Var(x1x2) = 0.49. Moreover, a negative COE(x3, y4) indicates that x3 is redundant. That is to say, although n Cor(x3, y4) yields a considerable value, x3 is not essential to analyzing y4 since the predictive information carried by x3 is fully overlapping with x2. Association detection in large data set n Cor based statistics were used to explore a real-world data set that consists of 357 social, economic, health, and political indicators for 202 countries around the world for the time period from 1960 through 2005. It was originally collected from the World Health Organization (WHO) and partner organizations (Rosling 2008; W.H.O. 2009). By the new method, we detected a huge number of interesting associations including both nonlinear main effects and interactions. For more statistical results see supplementary material. Figure 5 shows six typical associations detected by the new measures. To confirm that n Cor is an effective estimate of R2, linear regression and feedforward artificial neural network (ANN) were implemented to identify the detected relationships to obtain the real R2 of the data. (i) Fig. 5 (a) depicts a superposition of two relationships which has been studied previously (Reshef et al. 2011) that, most data points obey a steeper trend, and the others obey a less steep trend. Obviously, it is impossible to separate the two trends of health expenditure (y) when considering the national income (xi) alone. By COI test, we found another indicator, called industry contribution to economy (xj), which does not directly affect but interactively influences y (Figs. 5 (b, c)). When looking at y in the space of xi and xj, the less steep minority of points can be precisely separated from the others by three lines. (ii) Similarly, Figs. 5 (d-f) show another example which is even more persuasive. The COI test detect a strong interaction effect implying the fact that neither a low population growth rate (xi) nor a short healthy life expectancy (xj) is unique to the counties with extremely high deaths among children due to HIV/AIDS (the outliers of y), but a combination of the two is. (iii) The third example is an association consisting of both main and interaction effects. Fig. 5 (g-i) show the relationships among the three variables, as well as the curves, surface, and the R2 obtained from the best fitted ANNs (10 ANNs was trained for each case). By means of n Cor, we can not only accurately reveal the composition of the association, but also properly foretell the R2 of the data. (iv) Figs. 5 (j-l) show three pairwise associations which are diagnosed respectively as weak, strong, and only nonlinear effects, and then confirmed by linear regression and ANN. Fig. 5 (l) suggests that even with a qualitative independent variable, n Cor still exhibits an excellent detection power. Conclusion Data-driven research is becoming increasingly popular in fields as varied as biology, physics, political science, and economics. In such kind of studies, association detection is one of the most critical issues, and may provide a lot of valuable insight into large and complex data sets that is otherwise difficult to obtain. n Cor inherits the merits of the Person correlation coefficient in the linear case, but is generally applicable to measuring all types of functional relationships. The three n Cor based statistics can be used to distinguish and characterize the associations from the aspects of nonlinearity, interactivity, and variable redundancy. These measures, as illustrated in the empirical studies, are simple but powerful, and may have a wide range of applications from quick association detection to various data analysis and interpretation. References Aguirre, L. A. 1995. A nonlinear correlation function for selecting the delay time in dynamical reconstructions. Physics Letters A 203(2-3):88 94. Altman, N., and Krzywinski, M. 2015. Points of significance: Association, correlation and causation. Nature Methods 12(10):899 900. Billings, S. A., and Zhu, Q. M. 1995. Model validation tests for multivariable nonlinear models including neural networks. International Journal of Control 62(4):749 766. Breiman, L., and Friedman, J. H. 1985. Estimating optimal transformations for multiple regression and correlation. Journal of the American Statistical Association 80:580 598. Darbellay, G. A., and Vajda, I. 1999. Estimation of the information by an adaptive partitioning of the observation space. IEEE Transactions on Information Theory 45(4):1315 1321. David, H. A., and Nagaraja, H. N. 2003. Order Statistics, Third Edition. New Jersey: John Wiley and Sons. Delicado, P., and Smrekar, M. 2009. Measuring non-linear dependence for two random variables distributed along a curve. Statistics and Computing 19(3):255 269. Delicado, P. 2001. Another look at principal curves and surfaces. Journal of Multivariate Analysis 77(1):84 116. Fleuret, F. 2004. Fast binary feature selection with conditional mutual information. Journal of Machine Learning Research 5:1531 1555. Gelfand, I. M., and Yaglom, A. M. 1957. Calculation of the amount of information about a random function contained in another such function. American Mathematical Society Translations: Series 2 12(1):199 236. Gretton, A., and Gyorfi, L. 2010. Consistent nonparametric tests of independence. Journal of Machine Learning Research 11(3):1391 1423. Gretton, A.; Fukumizu, K.; Teo, C.; Song, L.; Scholkopf, B.; and Smola, A. 2008. A kernel statistical test of independence. In Advances in neural information processing systems (NIPS) 20, 585 592. Gutin, G., and Punnen, A. P. 2007. The traveling salesman problem and its variations. Boston: Springer. Hastie, T., and Stuetzle, W. 1989. Principal curves. Journal of the American Statistical Association 84:502 516. Heller, R.; Heller, Y.; Kaufman, S.; Brill, B.; and Gorfine, M. 2016. Consistent distribution-free k-sample and independence tests for univariate random variables. Journal of Machine Learning Research 17:1 54. Heller, R.; Heller, Y.; and Gorfine, M. 2013. A consistent multivariate test of association based on ranks of distances. Biometrika 100(2):503 510. Kendall, M. 1938. A new measure of rank correlation. Biometrika 30:81 93. Kraskov, A.; Stogbauer, H.; and Grassberger, P. 2004. Estimating mutual information. Physical Review E 69:066138. Lange, O. F., and Grubmuller, H. 2005. Generalized correlation for biomolecular dynamics. Proteins Structure Function and Bioinformatics 62(4):1053 1061. Liu, P.; Sohn, H.; and Jeon, I. 2017. Nonlinear spectral correlation for fatigue crack detection under noisy environments. Journal of Sound and Vibration 400:305 316. Lopez-Paz, D.; Hennig, P.; and Scholkopf, B. 2013. The randomized dependence coefficient. In Advances in neural information processing systems (NIPS) 27, 1 8. Mao, K. Z., and Billings, S. A. 2000. Multi-directional model validity tests for non-linear system identification. International Journal of Control 73(2):132 143. Moon, Y.; Rajagopalan, B.; and Lall, U. 1995. Estimation of mutual information using kernel density estimators. Phys Rev E 52(3):2318 2321. Reshef, D. N.; Reshef, Y. A.; Finucane, H. K.; Grossman, S. R.; Mc Vean, G.; Turnbaugh, P. J.; Lander, E. S.; Mitzenmacher, M.; and Sabeti, P. C. 2011. Detecting novel associations in large data sets. Science 334(6062):1518 1524. Reshef, Y. A.; Reshef, D. N.; Finucane, H. K.; Sabeti, P. C.; and Mitzenmacher, M. M. 2015. Measuring dependence powerfully and equitably. Journal of Machine Learning Research 17(1):7406 7468. Reshef, D.; Reshef, Y.; Sabeti, P.; and Mitzenmacher, M. 2018. An empirical study of the maximal and total information coefficients and leading measures of dependence. The Annals of Applied Statistics 12:123 155. R enyi, A. 1959. On measures of dependence. Acta Mathematica Hungarica 10:441 451. Rosling, H. 2008. Indicators in gapminder world. http://www.gapminder.org/gapminder-world/ indicators-in-gapminder-world/. Runge, J. 2018. Conditional independence testing based on a nearest-neighbor estimator of conditional mutual information. In Proceedings of the 21st International Conference on Artificial Intelligence and Statistics (AISTATS), volume 84, 938 947. Sato, T.; Yamanishi, Y.; Horimoto, K.; Kanehisa, M.; and Toh, H. 2006. Partial correlation coefficient between distance matrices as a new indicator of protein-cprotein interactions. Bioinformatics 22(20):2488 2492. Sejdinovic, D.; Sriperumbudur, B.; Gretton, A.; and Fukumizu, K. 2013. Equivalence of distance-based and rkhsbased statistics in hypothesis testing. The Annals of Statistics 41:2263 2291. Sz ekely, G., and Rizzo, M. 2009. Brownian distance covariance. The Annals of Applied Statistics 3:1236 1265. Sz ekely, G.; Rizzo, M.; and Bakirov, N. 2007. Measuring and testing dependence by correlation of distances. The Annals of Statistics 35:2769 2794. Wang, Q.; Shen, Y.; and Zhang, J. Q. 2005. A nonlinear correlation measure for multivariable data set. Physica D 200:287 295. W.H.O. 2009. WHO statistical information system (WHOSIS). http://www.who.int/whosis/en/. Zhang, L. F.; Zhu, Q. M.; and Longden, A. 2007. A set of novel correlation tests for nonlinear system variables. International Journal of Systems Science 38(1):47 60. Zhu, Q. M.; Zhang, L. F.; and Longden, A. 2007. Development of omni-directional correlation functions for nonlinear model validation. Automatica 43(9):1519 1531.