# finding_statistically_significant_interactions_between_continuous_features__e345f879.pdf

Finding Statistically Signiﬁcant Interactions between Continuous Features

Mahito Sugiyama1,2 and Karsten Borgwardt3,4

1National Institute of Informatics, Tokyo 101-8430, Japan 2JST PRESTO, Japan 3D-BSSE, ETH Z urich, Basel 4058, Switzerland 4SIB Swiss Institute of Bioinformatics, Switzerland mahito@nii.ac.jp, karsten.borgwardt@bsse.ethz.ch

The search for higher-order feature interactions that are statistically signiﬁcantly associated with a class variable is of high relevance in ﬁelds such as Genetics or Healthcare, but the combinatorial explosion of the candidate space makes this problem extremely challenging in terms of computational efﬁciency and proper correction for multiple testing. While recent progress has been made regarding this challenge for binary features, we here present the ﬁrst solution for continuous features. We propose an algorithm which overcomes the combinatorial explosion of the search space of higher-order interactions by deriving a lower bound on the p-value for each interaction, which enables us to massively prune interactions that can never reach signiﬁcance and to thereby gain more statistical power. In our experiments, our approach efﬁciently detects all signiﬁcant interactions in a variety of synthetic and real-world datasets.

1 Introduction

A big challenge in high-dimensional data analysis is the search for features that are statistically signiﬁcantly associated with the class variable, while accounting for the inherent multiple testing problem. This problem is relevant in a broad range of applications including natural language processing, statistical genetics, and healthcare. To date, this problem of feature selection [Guyon and Elisseeff, 2003] has been extensively studied in statistics and machine learning, including the recent advances in selective inference [Taylor and Tibshirani, 2015], a technique that can assess the statistical signiﬁcance of features selected by linear models such as the Lasso [Lee et al., 2016]. However, current approaches have a crucial limitation: They can ﬁnd only single features or linear combinations of features, but it is still an open problem to ﬁnd patterns, that is, multiplicative (potentially higher-order) interactions between features. A relevant line of research towards this goal is signiﬁcant (discriminative) pattern mining [Terada et al., 2013a;

Contact Author

Llinares-L opez et al., 2015; Papaxanthos et al., 2016; Pellegrina and Vandin, 2018], which tries to ﬁnd statistically associated feature interactions while controlling the family-wise error rate (FWER), the probability to detect one or more false positive patterns. However, all existing methods for signiﬁcant pattern mining only apply to combinations of binary or discrete features, and none of methods can handle real-valued data, although such data is common in many applications. If we binarize data beforehand to use existing signiﬁcant pattern mining approaches, a binarization-based method may not be able to distinguish (un)correlated features (see Figure 1). To date, there is no method that can ﬁnd all higher-order interactions of continuous features that are signiﬁcantly associated with an output variable and that accounts for the inherent multiple testing problem. To solve this problem, one has to address the following three challenges: 1. How to assess the signiﬁcance for a multiplicative interaction of continuous features? 2. How to perform multiple testing correction? In particular, how to control the FWER (family-wise error rate), the probability to detect one or more false positives? 3. How to manage the combinatorial explosion of the candidate space, where the number of possible interactions is 2d for d features? The second problem and the third problem are related with each other: If one can reduce the number of combinations by pruning unnecessary ones, one can gain statistical power and reduce false negative combinations while at the same time controlling the FWER. Although there is an extensive body of work in statistics on multiple testing correction including the FDR [Hochberg, 1988; Benjamini and Hochberg, 1995] and also several studies on approaches to interaction detection [Bogdan et al., 2015; Su and Cand es, 2016], none of these approaches addresses the problem of dealing with the combinatorial explosion of the 2d-dimensional search space when trying to ﬁnding higher-order signiﬁcant multiplicative interactions. Our goal in this paper is to present the ﬁrst method, called C-Tarone, that can ﬁnd all higher-order interactions between continuous features that are statistically signiﬁcantly associated with the class variable, while controlling the FWER. Our approach is to use the rank order statistics to directly estimate the probability of joint occurrence of each feature

Proceedings of the Twenty-Eighth International Joint Conference on Artiﬁcial Intelligence (IJCAI-19)

0.0 1.0 0.2 0.4 0.6 0.8

Feature 1 = 0 Feature 1 = 1 Feature 1 = 0 Feature 1 = 1

Feature 2 = 0 Feature 2 = 1

Feature 2 = 0 Feature 2 = 1

0.0 0.2 0.4 0.6 0.8 1.0

Figure 1: Although there is a clear correlation between Features 1 and 2 in the left panel and none in the right panel, after median-based binarization of Feature 1 and Feature 2, the estimated probability of the occurrence of the feature combination (the number of points for which Feature 1 = Feature 2 = 1, or the number of points in the red box) will be exactly the same in both examples. Hence if the left case is a signiﬁcant interaction, the right uncorrelated case also becomes a signiﬁcant interaction in binarization-based methods.

combination from continuous data, which is known as copula support [Tatti, 2013], and apply a likelihood ratio test to assess the signiﬁcance of association between feature interactions and the class label, which solves the problem 1. We present the tight lower bound on p-values of association, which enables us to prune unnecessary interactions that can never be signiﬁcant through the notion of testability proposed by Tarone [1990], and can solve both problems 2 and 3. This paper is organized as follows: We introduce our method C-Tarone in Section 2. We introduce a likelihood ratio test as a statistical association test for interactions in Section 2.1, analyze multiple testing correction using the testability in Section 2.2, and present an algorithm in Section 2.3. We experimentally validate our method in Section 3 and summarize our ﬁndings in Section 4.

2 The Proposed Method: C-Tarone

Given a supervised dataset D = { (v1, y1), (v2, y2), . . . , (v N, y N) }, where each data point is a pair of an ddimensional vector vi = (v1 i , v2 i , . . . , vd i ) Rd and a binary class label yi {0, 1}. We denote the set of features by V = {1, 2, . . . , d} and the power set of features by 2V . For each feature j {1, 2, . . . , d}, we write vj = (vj 1, vj 2, . . . , vj N), which is the N-dimensional vector composed of the jth feature of the dataset D. Our goal is to ﬁnd every multiplicative feature interaction that is associated with class labels. To tackle the problem, ﬁrst we measure the joint occurrence probability η(J) R of a feature combination J 2V in a dataset. For each combination J 2V , the size |J| corresponds to the order of an interaction, and we ﬁnd arbitrary-order interactions in the form of combinations. If data is not real-valued but binary, that is, vj i {0, 1}, this problem is easily solved by measuring the support used in frequent itemset mining [Agrawal and Srikant, 1994; Aggarwal and Han, 2014] . Each feature combination J 2V is called an itemset, and the support of J is deﬁned as η(J) = (1/N) PN i=1 Q j J vj i , which corresponds to joint probability of J. The support is always

from 0 to 1, and we can introduce a binary random variable XJ corresponding to the joint occurrence of J, where XJ = 1 if features in J jointly occurs and XJ = 0 otherwise, η(J) corresponds to the empirical estimate of the probability Pr(XJ = 1). This approach can be generalized to continuous (realvalued) data using the copula support, which is the prominent result given by Tatti [2013] in the context of frequent pattern mining from continuous data. The copula support allows us to deﬁne the binary random variable XJ of joint occurrence of J and estimate η(J) = Pr(XJ = 1) from continuous data. The key idea is to use the normalized ranking for each feature j V , which is deﬁned as π(vj i ) = (k 1)/(N 1) if vj i is the kth smallest value among N values vj 1, vj 2, . . . , vj N. Then we convert a dataset D with respect to features in a combination J 2V into a single N-dimensional vector x J = (x J1, x J2, . . . , x JN) [0, 1]N by

j J π(vj i ) (1)

for each data point i {1, 2, . . . , N}. Tatti [2013] showed that the empirical estimate of the probability Pr(XJ = 1) of the joint occurrence of a combination J is obtained by the copula support deﬁned as

i=1 x Ji = 1

j J π(vj i ). (2)

Intuitively, joint occurrence of J means that rankings among features in J match with each other. The deﬁnition in Equation (2) is analogous to the support for binary data, where the only difference here is π(vj i ) instead of binary value vj i . The copula support always satisﬁes η(J) 0.5 by deﬁnition and has the monotonicity with respect to the inclusion relationship of combinations: η(J) η(J {j}) for all J 2V and j V \ J. Note that, although Tatti [2013] considered a statistical test for the copula support, it cannot be used in our setting due to the following two reasons: (1) his setting is unsupervised while ours is supervised; (2) multiple testing correction was not considered for the test and the FWER was not controlled.

2.1 Statistical Testing Now we can formulate our problem of ﬁnding feature interactions, or itemset mining on continuous features, that are associated with class labels as follows: Let Y be an output binary variable of which class labels are realizations. The task is to ﬁnd all feature combinations J 2V such that the null hypothesis XJ Y , that is, XJ and Y are statistically independent, is rejected by a statistical association test while rigorously controlling the FWER, the probability of detecting one or more false positive associations, under a predetermined signiﬁcance level α. As a statistical test, we propose to use a likelihood ratio test [Fisher, 1922], which is a generalized χ2-test and often called G-test [Woolf, 1957]. Although Fisher s exact test has been used as the standard statistical test in recent studies [Llinares-L opez et al., 2015; Sugiyama et al., 2015;

Proceedings of the Twenty-Eighth International Joint Conference on Artiﬁcial Intelligence (IJCAI-19)

XJ = 1 XJ = 0 Total

Y = 1 η(J)r1 r1 η(J)r1 r1 Y = 0 η(J)r0 r0 η(J)r0 r0 Total η(J) 1 η(J) 1

(a) Expected distribution.

XJ = 1 XJ = 0 Total

Y = 1 η1(J) r1 η1(J) r1 Y = 0 η0(J) r0 η0(J) r0 Total η(J) 1 η(J) 1

(b) Observed distribution.

Table 1: Contingency tables.

Terada et al., 2013a], it can be applied to only discrete test statistics and cannot be used in our setting. Suppose that Pr(Y = l) = rl for each class l {0, 1}. From two binary variables XJ and Y , we obtain a 2 2 contingency table, where each cell denotes the joint probability Pr(XJ = l, Y = l ) with l, l {0, 1} and can be described as a four-dimensional probability vector p:

p = Pr(XJ = 1, Y = 1), Pr(XJ = 1, Y = 0),

Pr(XJ = 0, Y = 1), Pr(XJ = 0, Y = 0) .

Let p E be the probability vector under the null hypothesis XJ Y and p O be the empirical vector obtained from N observations. The difference between two distributions p O and p E can be measured by the Kullback Leibler (KL) divergence DKL(p O, p E) = P i p Oi log(p Oi/p Ei), and the independence XJ Y is translated into the condition DKL(p O, p E) = 0. In the G-test, which is a special case of likelihood ratio test, the test statistic is given as λ = 2NDKL(p O, p E), which follows the χ2-distribution with the degree of freedom 1. In our case, for each combination J 2V , the probability vector p E under the null is given as p E = η(J)r1, η(J)r0, r1 η(J)r1, r0 η(J)r0 ,

where η(J) is the copula support of a feature combination J and rl is the ratio of the label l {0, 1} in the dataset. In contrast, the observed probability vector p O is given as p O = η1(J), η0(J), r1 η1(J), r0 η0(J) ,

where η1(J) = (1/N)x J y and η0(J) = (1/N)x J (1 y) with x J deﬁned in Equation (1). These two distributions are shown in Table 1. While it is known that the statistic λ does not exactly follow the χ2-distribution if one of components of p E is too small, such situation does not usually occur since η(J) is not too large as η(J) 0.5 by deﬁnition and not too small as such combinations are not testable, which will be shown in Section 2.3. In the following, we write the set of all possible probability vectors by P = { p | pi 0, P pi = 1 }, and its subset given marginals a and b as P(a, b) = { p P | p1 + p2 = a, p1 + p3 = b }.

2.2 Multiple Testing Correction Since we have 2d hypotheses as each feature combination translated into a hypothesis, we need to perform multiple testing correction to control the FWER (family-wise error rate), otherwise we will ﬁnd massive false positive combinations. The most popular multiple testing correction method is Bonferroni correction [Bonferroni, 1936]. In the method, the predetermined signiﬁcance level α (e.g. α = 0.05) is corrected as δBon = α/m, where m is called a correction factor and is the number of hypotheses m = 2d in our case. Each hypothesis J is declared as signiﬁcant only if p-value(J) < δBon = α/2d. Then the resulting FWER < α is guaranteed. However, it is well known that Bonferroni correction is too conservative. In particular in our case, the correction factor 2d is too massive due to the combinatorial effect and it is almost impossible to ﬁnd signiﬁcant combinations, which will generate massive false negatives. Here we use the testability of hypotheses introduced by Tarone [1990] and widely used in the literature of signiﬁcant pattern mining [Llinares-L opez et al., 2015; Papaxanthos et al., 2016; Sugiyama et al., 2015; Terada et al., 2013a], which allows us to prune unnecessary feature combinations that can never be signiﬁcant while controlling the FWER at the same level. Hence Tarone s testability always offers better results than Bonferroni correction if it is applicable. The only requirement of Tarone s testability is the existence of the lower bound of the p-value given the marginals of the contingency table, which are η(J) and r0 (or r1) in our setting. In the following, we prove that we can analytically obtain the tight upper bound of the KL divergence, which immediately leads to the tight lower bound of the p-value.

Theorem 1. (TIGHT UPPER BOUND OF THE KL DIVERGENCE) For a, b [0, 1] with a b 1/2 and a probability vector p E = (ab, a(1 b), (1 a)b, (1 a)(1 b)),

DKL(p, p E) < a log 1

b + (b a) log b a (1 a)b

+ (1 b) log 1 (1 a) (3)

for all p P(a, b) and this is tight. Here we formally introduce how to prune unnecessary hypotheses by Tarone s testability. Let ψ(J) be the lower bound of the p-value of J obtained from the upper bound of the KL-divergence proved in Theorem 1. Suppose that J1, J2, . . . , J2d be the sorted sequence of all combinations such that

ψ(J1) ψ(J2) ψ(J3) ψ(J2d)

is satisﬁed. Let m be the threshold such that

m ψ(Jm) < α and (m + 1) ψ(Jm+1) α. (4)

Tarone [1990] showed that the FWER is controlled under α with the correction factor m. The set T = {J1, J2, . . . , Jm} is called testable combinations, and each J T is signiﬁcant if p-value(J) < δTar = α/m. On the other hand, combinations Jm+1, . . . , J2d are called untestable as they can never be signiﬁcant. Since m 2d usually holds, we can expect to

Proceedings of the Twenty-Eighth International Joint Conference on Artiﬁcial Intelligence (IJCAI-19)

obtain higher statistical power in Tarone s method compared to Bonferroni method. Moreover, we present C-Tarone in the next subsection, which can enumerate such testable combinations without seeing untestable ones, hence we overcome combinatorial explosion of the search space of combinations. Let us denote by B(a, b) the upper bound provided in Equation (3). We analyze the behavior of the bound B(a, b) as a function of a with ﬁxed b. This is a typical situation in our setting as a corresponds to the copula support η(J), which varied across combinations, while b corresponds to the class ratio r1, which is ﬁxed in each analysis. Assume that b 1/2. When a < b, we have B(a, b)/ a = log(1 a)/(b a) > 0, hence it is monotonically increases as a increases. When b < a < 1/2, we have B(a, b)/ a = log(a b)/a < 0, thereby it monotonically decreases as a increases. We illustrate the bound B(a, b) with b = 0.3 and the corresponding minimum achievable p-value with the sample size N = 100 in Figure 4 in Appendix.

2.3 Algorithm We present the algorithm of C-Tarone that efﬁciently ﬁnds testable combinations J1, J2, . . . , Jm such that ψ(J1) ψ(J2) ψ(Jm), where m satisﬁes the condition (4). We summarize C-Tarone in Algorithm 1, which performs depth-ﬁrst search to ﬁnd J1, J2, . . . , Jm such that ψ(J1) ψ(J2) ψ(Jm), where m satisﬁes the condition (4). Suppose that r1 1/2 r0. Since the lower bound of the p-value ψ(J) takes the minimum value when η(J) = r1 (see Figure 5 in Appendix by letting a = η(J) and b = r1) and is monotonically decreasing as η(J) decreases, for any η(J) r1, C T with C = {J 2V | η(J) σ} and T = {J 2V | ψ(J) B(σ, r1)} is always guaranteed, where σ is a threshold for copula supports. Thus if the condition (4) is satisﬁed for some m |T |, the m smallest combinations in T are the testable combinations. Moreover, since η(J) has the monotonicity with respect to the inclusion relationship, that is, η(J) η(J {j}) for all j V \ J, ﬁnding the set C is simply achieved by DFS (depth-ﬁrst search): Starting from the smallest combination with assuming η( ) = 1, if η(J) σ for a combination J, we update J by adding j V \ J and recursively check J. Otherwise if η(J) < σ, we prune all combinations K J as η(K) < σ holds. Our algorithm dynamically updates the threshold σ to enumerate testable combinations while pruning massive unnecessary combinations. First we set σ = 0, which means that all combinations are testable. Whenever we ﬁnd a new combination J, we update C and T and check the condition (4). If |T |B(σ, r1) > α holds, σ is too low and T will become too large with the current σ, hence we update σ to min J C η(J) and remove the combination argmin J C η(J). Finally, when the algorithm stops, it is clear that the set T coincides with the set of testable combinations, hence each combination J T is signiﬁcant if the p-value of J is smaller than δTar = α/|T |. Since Algorithm 1 always outputs the set of testable combinations T , which directly follows from C T and the monotonicity of η(J), the SIGNIFICANCETESTING function ﬁnds all signiﬁcant combinations with the FWER α. Thus Algorithm 1 always ﬁnds all signiﬁcant feature combinations with

Algorithm 1: C-Tarone.

1 C-TARONE(D, α)

2 σ 0; C ; // σ is a global variable

3 DFS( , 0, C, D, α);

4 T {K C | ψ(K) B(σ, r1)}; // The set of testable combinations

5 SIGNIFICANCETESTING(T , α);

6 DFS(J, jprev, C, D, α)

7 foreach j {jprev + 1, . . . , d} do

9 Compute η(J) by Equation (2);

10 if η(J) > σ then

11 Compute ψ(J); C C {J};

12 T {K C | ψ(K) B(σ, r1)};

13 while |T |B(σ, r1) α do

14 Jmin argmin K Cη(K);

15 σ η(Jmin); C C \ {Jmin};

16 T T \ {Jmin};

17 DFS(J, j, C, D, α);

18 J J \ {j};

19 SIGNIFICANCETESTING(T , α)

20 foreach J T do

21 if p-value(J) < α / |T | then output J;

controlling the FWER under α. Moreover, C-Tarone is independent of the feature ordering and the above completeness with respect to the set of signiﬁcant combinations is always satisﬁed. The time complexity of C-Tarone is O(|T |).

3 Experiments

We examine the effectiveness and the efﬁciency of C-Tarone using synthetic and real-world datasets. We used Amazon Linux AMI release 2017.09 and ran all experiments on a single core of 2.3 GHz Intel Xeon CPU E7-8880 v3 and 2.0 TB of memory. All methods were implemented in C/C++ and compiled with gcc 4.8.5. The FWER level α = 0.05 throughout experiments. There exists no existing method that can enumerate significant feature combinations from continuous data with multiple testing correction. Thus we compare C-Tarone to signiﬁcant itemset mining method with prior binarization of a given dataset since signiﬁcant itemset mining offers to enumerate all signiﬁcant feature combinations while controlling the FWER from binary data. We employ median-based binarization as a preprocessing. For each feature j V , we pick up the median of vj = (vj 1, vj 2, . . . , vj N), denoted by med(j), and binarize each value vj i as a pair (v med(j) i , v>med(j) i ), where v med(j) i = 1 if vj i med(j) and 0 otherwise, and v>med(j) i = 1 if vj i > med(j) and 0 otherwise. Thus a given dataset is converted to the binarized dataset with 2d features. We use the state-of-the-art signiﬁcant itemset mining algorithm LAMP ver.2 [Minato et al., 2014] and employ implementation provided by [Llinares-L opez et al., 2018],

Proceedings of the Twenty-Eighth International Joint Conference on Artiﬁcial Intelligence (IJCAI-19)

Running time (sec.) Precision Recall F-measure

Number of data points Number of features

C-Tarone Binarization

C-Tarone Binarization

1e+03 1e 02 2e 02

5e 02 1e 01 2e 01

5e 01 1e+00

1e+03 1e+04 1e+05

1e+04 1e+05

1e+03 1e+00

20 40 60 80

20 40 60 80 1e 02 2e 02

5e 02 1e 01 2e 01

5e 01 1e+00

20 40 60 80

20 40 60 80100

1e+04 1e+05

1e+04 1e+05

Figure 2: Results on synthetic data with the minor class ratio r1 = 0.5. Regarding the scale of precision and F-measure, see comment at the last paragraph just before Section 3. The number of features is d = 20 in the left column and the sample size is N = 1,000 in the right column. Both xand y-axes are in logarithmic scale. C-Tarone is shown in red circles, the binarization approach in blue triangles.

which incorporates the fastest frequent pattern mining algorithm LCM [Uno et al., 2004] to enumerate testable feature combinations from binary data. Note that, in terms of runtime, comparison with the brute-force approach of testing all the 2d combinations with the Bonferroni correction is not valid as the resulting FWER is different between the bruteforce and the proposed method. We also tried other binarization approaches, interordinal scaling used in numerical pattern mining [Kaytoue et al., 2011] and interval binarization in subgroup discovery [Grosskreutz and R uping, 2009] as a preprocessing of signiﬁcant itemset mining. However, both preprocessing generate too many binarized dense features, resulting in the lack of scalability in the itemset mining step in the enumeration of testable combinations. Hence we do not employ them as

comparison partners. Details of these binarization techniques are summarized in Section C in Appendix. To evaluate the efﬁciency of methods, we measure the running time needed for enumerating all signiﬁcant combinations. In the binarization method, we exclude the time used for binarization as this preprocessing step is efﬁcient enough and negligible compared to the pattern mining step. To examine the quality of detected combinations, we compute precision, recall, and the F-measure by comparing such combinations with those obtained by the standard decision tree method CART [Breiman et al., 1984], which obtains multiplicative combinations of features in the form of binary trees. We used the rpart function in R with its default parameter setting, where the Gini index is used for splitting and the minimum number of data points that must exist in a node is 20. We apply the decision tree to each dataset and retrieve all the paths from the root to leaf nodes of the learned tree. In each path, we use the collection of features used in the path as a positive feature combination of the ground truth, that is, a feature combination found by C-Tarone (or binarization method) is deemed to be true positive if and only if it constitutes one of full paths from the root to a leaf of the learned decision tree. Note that the FWER is always controlled under α in both of C-Tarone and binarization method, and our aim is to empirically examine the quality of the feature combinations compared to those selected by a standard decision tree. We used not forests but a single tree as the ground truth depends on the number of trees if we use forests, resulting in arbitrary results. Please note that, due to the fact that there are up to 2d feature combinations and we only count an exact match between a retrieved pattern and a true pattern as a hit, precision and F-measure will be close to 0. Still, we compute them to allow for a comparison of the relative performance of the different approaches. Evaluation criteria that take overlaps between retrieved patterns and true patterns (partial matches) into account would lead to higher levels of precision, but they are a topic of ongoing research and not a focus of this work.

Results on Synthetic Data First we evaluate C-Tarone on synthetic data with varying the sample size N from 1, 000 to 200, 000, the number d of features from 20 to 100, and setting the class ratio to r1 = 0.5 or r1 = 0.2, i.e., the number N1 of samples in the minor class is N/2 or N/5. In each dataset, we generate 20% of features that are associated with the class labels. More precisely, ﬁrst we generate the entire dataset from the uniform distribution from 0 to 1 and assign the class label 1 to the ﬁrst N1 data point. Then, for the N1 data points in the class 1, we pick up one of the 20% of associated features and copy it to every associated feature with adding Gaussian noise with (µ, σ2) = (0, 0.1). Hence there is no correlation in the class 0 across all features and there are positive correlations among such 20% of features in the class 1. The other 80% are uninformative features. Results are plotted in Figure 2 for r1 = 0.5 (classes are balanced). See Figure 5 in Appendix for r1 = 0.2 (classes are imbalanced). In the ﬁgure, we plot results with varying N while ﬁxing d = 20 on the left column and those with varying

Proceedings of the Twenty-Eighth International Joint Conference on Artiﬁcial Intelligence (IJCAI-19)

ctg faults ijcnn magic segment transfusion waveform wdbc wine yacht 1e 03

ctg faults ijcnn magic segment transfusion waveform wdbc wine yacht

ctg faults ijcnn magic segment transfusion waveform wdbc wine yacht

ctg faults ijcnn magic segment transfusion waveform wdbc wine yacht

Precision Recall F-measure Running time (sec.)

C-Tarone (proposed) Binarization

Figure 3: Results on real data. Regarding the scale of precision and F-measure, see the comment at the last paragraph just before Section 3. The y-axis is in logarithmic scale. C-Tarone is shown in red and the binarization approach is shown in blue. Higher (taller) is better in precision, recall, and F-measure, while lower is better in running time.

d while ﬁxing N = 1, 000 on the right column. In comparison with the median-based binarization method (plotted in blue), C-Tarone (plotted in red) has a clear advantage regarding the precision, which is several orders of magnitude higher than the binarization method in every case. This is why binarization method cannot distinguish correlated and uncorrelated combinations as we discussed in Introduction and illustrated in Figure 1, resulting in including uncorrelated features into signiﬁcant combinations. Although recall is competitive across various N and d, in all cases, the F-measure of C-Tarone are higher than those of the medianbased binarization method. In both methods, precision drops when the sample size N becomes large: As we gain more and more statistical power for larger N, many feature combinations, even those with very small dependence to the class labels and not used in the decision tree, tend to reach statistical signiﬁcance. Although the algorithm LCM (itemset mining) used in the binarization approach is highly optimized with respect to the efﬁciency, C-Tarone is competitive with it on all datasets, as can be seen in Figure 2 (bottom row). To summarize, we observe that C-Tarone improves over the competing binarization approach in terms of the F-measure in detecting higher quality feature combinations in classiﬁcation.

Results on Real Data We also evaluate C-Tarone on real-world datasets shown in Table 2 in Appendix, which are benchmark datasets for binary classiﬁcation from the UCI repository [Lichman, 2013]. To

clarify the exponentially large search space, we also show the number 2d of candidate combinations for d features in the table. All datasets are balanced to maximize the statistical power for comparing detected signiﬁcant combinations, i.e. r1 = 0.5. If they are not balanced in the original dataset, we randomly subsample data points from the larger class. We summarize results in Figure 3. Again, C-Tarone shows higher precision in all datasets than the binarization method and better or competitive recall, resulting in higher F-measure scores in all datasets. In addition, running time is competitive with the binarization method, which means that C-Tarone can successfully prune the massive candidate space for signiﬁcant feature combinations. These results demonstrate the effectiveness of C-Tarone.

4 Conclusion

In this paper, we have proposed a solution to the open problem of ﬁnding all multiplicative interactions between continuous features that are signiﬁcantly associated with an output variable after rigorously controlling for multiple testing. While interaction detection with multiple testing has been studied before, our approach, called C-Tarone, is the ﬁrst to overcome the problem of detecting all higher-order interactions from the enormous search space 2d for d features. Our work opens the door to many applications of searching signiﬁcant feature combinations, in which the data is not adequately described by binary features, including large ﬁelds such as data analysis for high-throughput technologies

Proceedings of the Twenty-Eighth International Joint Conference on Artiﬁcial Intelligence (IJCAI-19)

in biology and medicine. Our work here addresses the problem of ﬁnding continuous features. Finding signiﬁcant combinations associated with continuous output variables is an equally challenging and practically relevant problem, that we will tackle in future work.

Acknowledgments

This work was supported by JSPS KAKENHI Grant Numbers JP16K16115, JP16H02870, and JST, PRESTO Grant Number JPMJPR1855, Japan (M.S.). This work was funded in part by the SNSF Starting Grant Signiﬁcant Pattern Mining (K.B.).

A Proof of Theorem 1

Let f(x) = DKL(p, p E) with p = (x, a x, b x, (1 b) (a x)) and a = 1 a, b = 1 b. We have

f(x) = x log x

ab + (a x) log a x

+ (b x) log b x

a b + (b a + x) log b a + x

The second derivative of f(x) is given as

2f x2 = 1 1 a b + x + 1 b x + 1 a x + 1

which is always positive from the constraint 0 < x < min{a, b} = a. Thus f(x) is maximized when x goes to 0 or a. The limit of f(x) is obtained as

lim x 0 f(x) = a log 1

b + b log 1

a + (b a) log b a

lim x a f(x) = a log 1

b + (b a) log b a

a b + (1 b) log 1

To check which is larger, let δ be the difference limx a f(x) limx 0 f(x). Then it is obtained as δ = b log b+b log b +(b a) log(b a) (b a) log(b a). The partial derivative of δ with respect to a is δ/ a = log(1 a b) log(b a), which is always positive as (1 a b) (b a) = 1 2b 0 with the condition b 1/2. Hence the difference δ takes the minimum value at a = 0 and we obtain

δ b log b + b log b + b log b b log b = 0.

Thus f(x) is the tight upper bound when x a.

B Related Work

Since the only line of work that tries to ﬁnd multiplicative feature combinations while controlling the FWER is signiﬁcant pattern mining, we provide an overview of this ﬁeld in the following. Signiﬁcant pattern mining introduces statistical signiﬁcance into the task of contrast (or discriminative) pattern mining [Dong and Bailey, 2013], where the objective is to ﬁnd discriminative patterns with respect to class partitioning of a dataset. After early work on multiple testing correction in association rule mining [H am al ainen, 2012;

0.0 0.2 0.4 0.6 0.8 1.0 10 28

Minimum achievable p-value

0.0 0.2 0.4 0.6 0.8 1.0 0.0

Maximum achievable KL divergence

Figure 4: The upper bound B(a, b) of the KL divergence (left) and the corresponding p-value (right) with N = 100 with respect to changes in a when b = 0.3.

Webb, 2007], [Terada et al., 2013a] were the ﬁrst to achieve control of the FWER in itemset mining, by successfully combining a pattern mining algorithm and Tarone s trick [Tarone, 1990]. The enumeration algorithm has been improved in LAMP ver. 2 [Minato et al., 2014] and signiﬁcant subgraph mining [Sugiyama et al., 2015]. To date, signiﬁcant pattern mining has been extended to various types of tests and data, including a Westfall-Young permutation test to treat independencies among patterns [Llinares-L opez et al., 2015; Terada et al., 2013b], logistic regression [Terada et al., 2016] or a Cochran Mantel Haenszel (CMH) test [Llinares-L opez et al., 2017; Papaxanthos et al., 2016] for categorical covariates, hypothesis streams [Webb and Petitjean, 2016], and top K signiﬁcant patterns [Pellegrina and Vandin, 2018]. However, none of the above studies succeeded to directly perform signiﬁcant pattern mining on continuous variables without prior binarization. Although the ﬁeld of subgroup discovery [Atzmueller, 2015; Novak et al., 2009; Herrera et al., 2011] also considers measures of statistical dependence for ﬁnding multiplicative feature combinations, e.g. [Grosskreutz and R uping, 2009; Mampaey et al., 2015; van Leeuwen and Ukkonen, 2016], none of these methods accounts for multiple testing by controlling the FWER.

C Additional Binarization Methods In interordinal scaling, each binarized feature is in the form of a or a , where endpoints a are from a dataset, that is, a {vj 1, vj 2, . . . , vj N} for a feature j {1, 2, . . . , d}. Thus, for an d-dimensional real-valued vector vi D, each element vj i is expanded as the 2N-dimensional binary vector such that v vj 1 i , v vj 2 i , . . . , v vj N i , v vj 1 i , v vj 2 i , . . . , v vj N i ,

where each value for the binarized feature v vj k i = 1 if vj i vj k and 0 otherwise. As a result, the dataset D is converted into the binary dataset with 2d N features. In interval binarization, each binarized feature is in the form of (a, b] , where endpoints a, b are from data, and each element vj i of an d-dimensional vector vi is expanded as the (N(N 1)/2)- dimensional binary vector such that v (vj 1,vj 2] i , v (vj 1,vj 3] i , . . . , v (vj 1,vj N] i , v (vj 2,vj 3] i , . . . , v (vj N 1,vj N] i .

Proceedings of the Twenty-Eighth International Joint Conference on Artiﬁcial Intelligence (IJCAI-19)

Running time (sec.) Precision Recall F-measure

Number of data points Number of features

C-Tarone Binarization

C-Tarone Binarization

1e+03 1e 02 2e 02

5e 02 1e 01 2e 01

5e 01 1e+00

1e+03 1e+00

20 25 30 35 40

20 25 30 35 40 1e 02 2e 02

5e 02 1e 01 2e 01

5e 01 1e+00

20 25 30 35 40

20 25 30 35 40 1e+00

1e+04 1e+05

1e+04 1e+05

1e+04 1e+05

1e+04 1e+05

Figure 5: Results on synthetic data with the minor class ratio r1 = 0.2. The number of features is d = 20 in the left column and the sample size is N = 3,000 in the right column. Both xand yaxes are in logarithmic scale. C-Tarone is shown in red circles, the binarization approach in blue triangles. Missing points in (b) mean that no signiﬁcant combination is detected.

Thus a dataset D is converted into the binary dataset with d N(N 1)/2 features. Both interordinal scaling and interval binarization could ﬁnish their computation for a tiny dataset with (N, d) = (50, 5) in approximately 24 hours, but did not ﬁnish after 48 hours for (N, d) = (100, 10), and they exceeded the memory limit (2.0 TB) for larger datasets.

[Aggarwal and Han, 2014] C. C. Aggarwal and J. Han, editors. Frequent Pattern Mining. Springer, 2014.

[Agrawal and Srikant, 1994] R. Agrawal and R. Srikant. Fast algorithms for mining association rules. In Proceedings of the 20th International Conference on Very Large Data Bases, pages 487 499, 1994.

Data N d # candidate combinations (search space)

ctg 942 22 4,194,304 faults 316 27 134,217,728 ijcnn 9,706 22 4,194,304 magic 13,376 10 1,024 segment 660 19 524,288 transfusion 356 4 16 waveform 3,314 21 2,097,152 wdbc 424 30 1,073,741,824 wine 3,198 11 2,048 yacht 308 6 64

Table 2: Statistics of real data.

[Atzmueller, 2015] M. Atzmueller. Subgroup discovery. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 5(1):35 49, 2015. [Benjamini and Hochberg, 1995] Y. Benjamini and Y. Hochberg. Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the Royal Statistical Society. Series B (Methodological), 57(1):289 300, 1995. [Bogdan et al., 2015] M. Bogdan, E. Van Den Berg, C. Sabatti, W. Su, and E. J. Cand es. SLOPE adaptive variable selection via convex optimization. The annals of applied statistics, 9(3):1103 1140, 2015. [Bonferroni, 1936] C. E. Bonferroni. Teoria statistica delle classi e calcolo delle probabilit a. Pubblicazioni del R Istituto Superiore di Scienze Economiche e Commerciali di Firenze, 8:3 62, 1936. [Breiman et al., 1984] L. Breiman, J. Friedman, C. J. Stone, and R. A. Olshen. Classiﬁcation and Regression Trees. CRC Press, 1984. [Dong and Bailey, 2013] G. Dong and J. Bailey, editors. Contrast Data Mining: Concepts, Algorithms, and Applications. CRC Press, 2013. [Fisher, 1922] R. A. Fisher. On the mathematical foundations of theoretical statistics. Philosophical Transactions of the Royal Society of London A: Mathematical, Physical and Engineering Sciences, 222(594-604):309 368, 1922. [Grosskreutz and R uping, 2009] H. Grosskreutz and S. R uping. On subgroup discovery in numerical domains. Data Mining and Knowledge Discovery, 19(2):210 226, 2009. [Guyon and Elisseeff, 2003] I. Guyon and A. Elisseeff. An introduction to variable and feature selection. Journal of Machine Learning Research, 3:1157 1182, 2003. [H am al ainen, 2012] W. H am al ainen. Kingﬁsher: an efﬁcient algorithm for searching for both positive and negative dependency rules with statistical signiﬁcance measures. Knowledge and Information Systems, 32(2):383 414, 2012. [Herrera et al., 2011] F. Herrera, C. J. Carmona, P. Gonz alez, and M. J. del Jesus. An overview on subgroup discovery:

Proceedings of the Twenty-Eighth International Joint Conference on Artiﬁcial Intelligence (IJCAI-19)

foundations and applications. Knowledge and Information Systems, 29(3):495 525, 2011. [Hochberg, 1988] Y. Hochberg. A sharper bonferroni procedure for multiple tests of signiﬁcance. Biometrika, 75(4):800 802, 1988. [Kaytoue et al., 2011] M. Kaytoue, S. O. Kuznetsov, and A. Napoli. Revisiting numerical pattern mining with formal concept analysis. In Proceedings of the 22nd International Joint Conference on Artiﬁcial Intelligence, pages 1342 1347, 2011. [Lee et al., 2016] J. D. Lee, D. L. Sun, Y. Sun, and J. E. Taylor. Exact post-selection inference, with application to the lasso. Annals of Statistics, 44(3):907 927, 06 2016. [Lichman, 2013] M. Lichman. UCI machine learning repository, 2013. [Llinares-L opez et al., 2015] F. Llinares-L opez, M. Sugiyama, L. Papaxanthos, and K. M. Borgwardt. Fast and memory-efﬁcient signiﬁcant pattern mining via permutation testing. In Proceedings of the 21st ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pages 725 734, 2015. [Llinares-L opez et al., 2017] F. Llinares-L opez, L. Papaxanthos, D. Bodenham, D. Roqueiro, COPDGene Investigators, and K. Borgwardt. Genome-wide genetic heterogeneity discovery with categorical covariates. Bioinformatics, 33(12):1820 1828, 2017. [Llinares-L opez et al., 2018] F. Llinares-L opez, L. Papaxanthos, D. Roqueiro, D. Bodenham, and K. Borgwardt. CASMAP: detection of statistically signiﬁcant combinations of SNPs in association mapping. Bioinformatics, bty1020, 2018. [Mampaey et al., 2015] M. Mampaey, S. Nijssen, A. Feelders, R. Konijn, and A. Knobbe. Efﬁcient algorithms for ﬁnding optimal binary features in numeric and nominal labeled data. Knowledge and Information Systems, 42(2):465 492, 2015. [Minato et al., 2014] S. Minato, T. Uno, K. Tsuda, A. Terada, and J. Sese. A fast method of statistical assessment for combinatorial hypotheses based on frequent itemset enumeration. In Machine Learning and Knowledge Discovery in Databases, volume 8725 of LNCS, pages 422 436. Springer, 2014. [Novak et al., 2009] P. K. Novak, N. Lavraˇc, and G. I. Webb. Supervised descriptive rule discovery: A unifying survey of contrast set, emerging pattern and subgroup mining. The Journal of Machine Learning Research, 10:377 403, 2009. [Papaxanthos et al., 2016] L. Papaxanthos, F. Llinares Lopez, D. Bodenham, and K. M. Borgwardt. Finding signiﬁcant combinations of features in the presence of categorical covariates. In Advances in Neural Information Processing Systems, volume 29, pages 2271 2279, 2016. [Pellegrina and Vandin, 2018] L. Pellegrina and F. Vandin. Efﬁcient mining of the most signiﬁcant patterns with permutation testing. In Proceedings of the 24th ACM

SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 2070 2079, 2018. [Su and Cand es, 2016] W. Su and E. J. Cand es. SLOPE is adaptive to unknown sparsity and asymptotically minimax. The Annals of Statistics, 44(3):1038 1068, 2016. [Sugiyama et al., 2015] M. Sugiyama, F. Llinares-L opez, N. Kasenburg, and K. M Borgwardt. Signiﬁcant subgraph mining with multiple testing correction. In Proceedings of 2015 SIAM International Conference on Data Mining, pages 37 45, 2015. [Tarone, 1990] R. E. Tarone. A modiﬁed Bonferroni method for discrete data. Biometrics, 46(2):515 522, 1990. [Tatti, 2013] N. Tatti. Itemsets for real-valued datasets. In 2013 IEEE 13th International Conference on Data Mining, pages 717 726, 2013. [Taylor and Tibshirani, 2015] J. Taylor and R. J. Tibshirani. Statistical learning and selective inference. Proceedings of the National Academy of Sciences, 112(25):7629 7634, 2015. [Terada et al., 2013a] A. Terada, M. Okada-Hatakeyama, K. Tsuda, and J. Sese. Statistical signiﬁcance of combinatorial regulations. Proceedings of the National Academy of Sciences, 110(32):12996 13001, 2013. [Terada et al., 2013b] A. Terada, K. Tsuda, and J. Sese. Fast Westfall-Young permutation procedure for combinatorial regulation discovery. In 2013 IEEE International Conference on Bioinformatics and Biomedicine, pages 153 158, 2013. [Terada et al., 2016] A. Terada, D. du Verle, and K. Tsuda. Signiﬁcant pattern mining with confounding variables. In Advances in Knowledge Discovery and Data Mining (PAKDD 2016), volume 9651 of LNCS, pages 277 289, 2016. [Uno et al., 2004] T. Uno, T. Asai, Y. Uchida, and H. Arimura. An efﬁcient algorithm for enumerating closed patterns in transaction databases. In Discovery Science, volume 3245 of LNCS, pages 16 31, 2004. [van Leeuwen and Ukkonen, 2016] M. van Leeuwen and A. Ukkonen. Expect the unexpected on the signiﬁcance of subgroups. In Discovery Science, volume 9956 of LNCS, pages 51 66, 2016. [Webb and Petitjean, 2016] G. I. Webb and F. Petitjean. A multiple test correction for streams and cascades of statistical hypothesis tests. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 1255 1264, 2016. [Webb, 2007] G. I. Webb. Discovering signiﬁcant patterns. Machine Learning, 68(1):1 33, 2007. [Woolf, 1957] B. Woolf. The log likelihood ratio test (the G-test). Annals of human genetics, 21(4):397 409, 1957.

Proceedings of the Twenty-Eighth International Joint Conference on Artiﬁcial Intelligence (IJCAI-19)