# nearestneighbor_sampling_based_conditional_independence_testing__2129f352.pdf

Nearest-Neighbor Sampling Based Conditional Independence Testing

Shuai Li1, Ziqi Chen1 , Hongtu Zhu2, Christina Dan Wang3, Wang Wen4

1School of Statistics, KLATASDS-MOE, East China Normal University, Shanghai, China 2 Departments of Biostatistics, Statistics, Computer Science, and Genetics, The University of North Carolina at Chapel Hill, Chapel Hill, USA 3 Business Division, New York University Shanghai, Shanghai, China 4 School of Mathematics and Statistics, Central South University, Changsha, China shuaili.cq@foxmail.com, zqchen@fem.ecnu.edu.cn, htzhu@email.unc.edu, christina.wang@nyu.edu, wangwen.a@foxmail.com

The conditional randomization test (CRT) was recently proposed to test whether two random variables X and Y are conditionally independent given random variables Z. The CRT assumes that the conditional distribution of X given Z is known under the null hypothesis and then it is compared to the distribution of the observed samples of the original data. The aim of this paper is to develop a novel alternative of CRT by using nearest-neighbor sampling without assuming the exact form of the distribution of X given Z. Speciﬁcally, we utilize the computationally efﬁcient 1-nearest-neighbor to approximate the conditional distribution that encodes the null hypothesis. Then, theoretically, we show that the distribution of the generated samples is very close to the true conditional distribution in terms of total variation distance. Furthermore, we take the classiﬁer-based conditional mutual information estimator as our test statistic. The test statistic as an empirical fundamental information theoretic quantity is able to well capture the conditional-dependence feature. We show that our proposed test is computationally very fast, while controlling type I and II errors quite well. Finally, we demonstrate the efﬁciency of our proposed test in both synthetic and real data analyses.

Introduction Conditional independence testing (CIT) has wide applications in statistics and machine learning, including causal inference (Spirtes et al. 2000; Pearl 2009; Cai, Li, and Zhang 2022) and graphical models (Lauritzen 1996; Koller and Friedman 2009) as two well-known examples. The aim of this paper is to develop a ﬂexible and fast method for CIT. Speciﬁcally, we consider two univariate continuous random variables X and Y , and a set of random variables Z Rd Z, whose dimension d Z can potentially diverge to inﬁnity, with a joint density function given by p X,Y,Z(x, y, z). Based on n independently and identically distributed (i.i.d) copies {(Xi, Yi, Zi) : i = 1, . . . , n} of (X, Y, Z), we are interested in testing whether two random variables X and Y are conditionally independent given Z; that is,

H0 : X Y |Z versus H1 : X Y |Z,

Corresponding author: Ziqi Chen. Copyright c 2023, Association for the Advancement of Artiﬁcial Intelligence (www.aaai.org). All rights reserved.

where denotes the independence. The high dimensionality of Z makes CIT challenging (Bellot and van der Schaar 2019; Shi et al. 2021). Our proposed method can be readily extended to the scenario of multivariate X and Y . Recently, many methods have been proposed to test conditional independence. See, for example, Candes et al. (2018), Zhang et al. (2011), Zhang et al. (2017), Bellot and van der Schaar (2019), Strobl, Zhang, and Visweswaran (2019), Berrett et al. (2020), Shah and Peters (2020), Shi et al. (2021), and Zhang et al. (2022). Among them, the conditional randomization test (CRT) proposed by Candes et al. (2018) is one of the most important methods, but CRT assumes that the true conditional distribution p X|Z is known. Conditional on {Z1, . . . , Zn}, one can independently draw X(m) i p X|Z=Zi for each i across m = 1, . . . , M such that all X(m) := (X(m) 1 , . . . , X(m) n ) are independent of X := (X1, . . . , Xn) and Y := (Y1, . . . , Yn), where M is the number of repetitions. Thus, under the null hypothesis H0 : X Y |Z, we have (X(m), Y, Z) d= (X, Y, Z) for all

m, where d= denotes equality in distribution. A large difference between (X(m), Y, Z) and (X, Y, Z) can be regarded as a strong evidence against H0. Statistically, one can consider a test statistic T(X, Y , Z) and approximate its p-value by

1 + PM m=1 I(T(X(m), Y , Z) T(X, Y , Z))

1 + M , (1)

where I( ) is the indicator function. Under H0, the p-value is valid and P(p α|H0) α holds for any α (0, 1). Several methods have been developed based on different approximations to p X|Z, since p X|Z is rarely known in practice. For instance, Bellot and van der Schaar (2019) developed a Generative Conditional Independence Test (GCIT) by using Wasserstein generative adversarial networks (WGANs, Arjovsky, Chintala, and Bottou, 2017) to approximate p X|Z. Let bp X|Z be an estimator of p X|Z based on WGANs. Theoretically, as shown in Bellot and van der Schaar (2019), the excess type I error over a desired level α of their GCIT test is bounded by E{d T V (p X|Z, bp X|Z)}, where d T V denotes the total variation distance. However, Figure 1 shows that bp X|Z approximates p X|Z very poorly in

The Thirty-Seventh AAAI Conference on Artificial Intelligence (AAAI-23)

two relatively simple simulation settings. Thus, as shown in synthetic data analysis, the GCIT test has inﬂated type-I errors. Recently, Shi et al. (2021) proposed to use the Sinkhorn GANs (Genevay, Peyr e, and Cuturi 2018) to approximate p X|Z. As shown in Figure 1, we ﬁnd that the Sinkhorn GANs also perform poorly in the two relatively simple simulation settings. The choice of test statistics in CRT is crucial for achieving adequate statistical power as well as controlling type I errors, whereas it has not been carefully investigated. For instance, Bellot and van der Schaar (2019) proposed to consider multiple test statistics, including the Maximum Mean Discrepancy (MMD), the Pearsons correlation coefﬁcient (PCC), the distance correlation (DC), and the Kolmogorov-Smirnov distance (KS), but little is known about how to appropriately choose test statistics in different scenarios. Moreover, test statistics that solely measure the dependence between X and Y may suffer from inﬂated type-I errors and/or inadequate power under H1. We consider two scenarios, including (i) a simple Markov chain X Z Y and (ii) X Z Y , where direct arrows connecting two random variables are direct causes. In both scenarios, test statistics that solely measure the dependence between X and Y increase type I errors and/or lose the statistical power in testing H0 against H1. Thus, it is required to use test statistics that can capture the conditional dependence. In this paper, we consider the conditional mutual information (CMI) for (X, Y, Z), denoted as I(X; Y |Z), and its empirical version (Mukherjee, Asnani, and Kannan 2020), since it provides a strong theoretical guarantee for conditional dependence relations such that I(X; Y |Z) = 0 X Y |Z (Cover and Thomas 2012). The CMI has been widely used in causal learning (Hlinka et al. 2013), graph models (Liang and Wang 2008), and feature selection (Fleuret 2004). However, the empirical CMI can be computationally difﬁcult especially for high dimensional Z. In this paper, we propose a novel CIT method based on the 1-nearest-neighbor sampling strategy (NNSCIT) to simulate samples from a distribution that is approximately close to the true density p X|Z. The nearest-neighbor sampling ﬁrst developed by Fix and Hodges (1951) has been widely used in density estimation, classiﬁcation, and regression problems (Silverman 2018; Cover and Hart 1967; Devroye et al. 1994). Recently, Sen et al. (2017) used the nearest-neighbor bootstrap procedure to generate samples from the joint distribution of (X, Y, Z) under X Y |Z. Compared with GANs, 1-nearest-neighbor (1-NN) not only demonstrates computational efﬁciency, but also exhibits superiority in approximating quality. We make four major contributions as follows. First, we propose to use the 1-NN method to generate samples from the approximated conditional distribution of X given Z. The 1-NN is computationally much more efﬁcient than WGANs (Bellot and van der Schaar 2019) and the Sinkhorn GANs (Shi et al. 2021). Theoretically, we show that the distribution of samples generated from 1-NN is very close to the true conditional distribution in terms of the total variation distance. Second, we take I(X; Y |Z) as our test statistic and estimate it empirically using the recent classiﬁer-based

Algorithm 1: 1-Nearest-Neighbor sampling (1NN(V1,V2,n)) Input: Data sets V1 and V2, both with sample size n and V = V1 V2 consists of 2n i.i.d. samples from p X,Z. Output: Generate e X from X|Z for each Z-coordinate in V2. 1: Let U0 = . 2: for (X, Z) in V2 do 3: Go to V1 to ﬁnd the sample ( e X, e Z) such that e Z is the 1-nearest neighbor of Z in terms of the l2 norm. 4: U0 = U0 { e X}. 5: end for 6: return U0

method (Mukherjee, Asnani, and Kannan 2020). Third, for the pseudo samples e X(m) (m = 1, . . . , M) generated from 1-NN, we provide insights to replace I( e X(m); Y |Z) with I( e X(m); Y ) to speed up the calculation, because estimations of I( e X(m); Y |Z)s are very computationally intensive especially for the case that the dimensionality of Z is high. Fourth, our proposed test not only asymptotically achieves a valid control of the type I error, but also outperforms all competing tests in numerical studies.

1-Nearest-Neighbor Sampling In this section, we present the 1-NN sampling algorithm, as well as its theoretical and empirical results stating that the distribution of the sample generated resembles closely the true conditional distribution.

1-NN Sampling from p X|Z(x|z) We have two data sets V1 and V2, both with sample size n, such that V = V1 V2 consisting of 2n i.i.d. samples from the distribution p X,Z(x, z). Given all Z coordinates in V2, Algorithm 1 presents the procedure to generate a data set U0 consisting of n samples, which mimics samples generated from p X|Z(x|z). Speciﬁcally, for each Z coordinate in V2, we search the nearest neighbor ( e X, e Z) in V1 in terms of the Z coordinate in l2 norm and then add e X to U0. When V is a set containing samples from the distribution p X,Y,Z(x, y, z), Algorithm 1 continues to work with the Y -coordinates ignored.

Theoretical Results For a given Z coordinate in V2, we show that the distribution of e X generated in Algorithm 1 is very close to the true conditional distribution in terms of the total variation distance. Before presenting our theoretical result, we ﬁrst introduce Lemma 1 of Cover and Hart (1967), which states that the nearest neighbor of Z converges almost surely to Z as the training size n grows to inﬁnity. Lemma 1. Let Z and Z1, Z2, . . . , Zn be i.i.d. random variables according to p(z). Let Z n be the nearest neighbor to Z from the set {Z1, Z2, . . . , Zn}. Then Z n converges almost surely to Z as n grows to inﬁnity.

We next present several standard regularity conditions, which have been introduced in Gao, Oh, and Viswanath (2016), Gao, Oh, and Viswanath (2017) and Sen et al. (2017). For the sake of simplicity, subscripts may be dropped. For example, p(x|z) may be used in place of p X|Z(x|z). Smoothness assumption on p(x|z): A smoothness condition is assumed on p(x|z), which can be regarded as a generalization of the boundedness of the maximum eigenvalue of Fisher Information matrix of x w.r.t z. Assumption 1. For all z Rdz and all a such that a z 2 ϵ1, we have 0 λmax(Ia(z)) β, where β > 0, 2 is the l2 norm and the generalized curvature matrix Ia(z) = (Ia(z)ij) is deﬁned as

Ia(z)ij = E 2 log p(x|ez)

ezi ezj |ez=a

Z log p(x|z)

p(x|ez)p(x|z)dx ez=a .

Smoothness assumptions on p(z): Assumption 2. The probability density function p(z) is twice continuously differentiable, and the Hessian matrix Hp(z) of the p.d.f. p(z) with respect to z satisﬁes Hp(z) 2 cdz almost everywhere, where cdz is only dependent on dz.

Given Z, let e X denote the sample produced by 1-NN such that e X = X n is the X-coordinate of the sample (X n, Z n) in V1 with Z n being the nearest neighbor of Z. There is no doubt that e X p(x|Z n). Let bp(x|Z) := p(x|Z n). For any two distributions P1 and P2 that are deﬁned on the same probability space, the total variation distance between P1 and P2 is deﬁned as d T V (P1, P2) = sup A Ω|P1(A) P2(A)|, where the supremum is taken over all measurable subsets of the sample space Ω. We have the following theorem and leave its proof in the Supplementary Materials.

Theorem 2. Under Assumptions 1 and 2, we have d T V (p(x|Z), p(x|Z n)) = d T V (p(x|Z), bp(x|Z)) = op(1), as the sample size n in V1 goes to inﬁnite.

Empirical Goodness of Fit

In this subsection, we investigate the empirical goodness-ofﬁt performance of samples generated from 1-NN. We consider the following two scenarios. Scenario 1. X Uniform[0, 1] and Z are assumed to be independent, where Z is a 50-dimensional multivariate Gaussian distribution with mean vector (0.7, 0.7, . . . , 0.7) and the identity covariance matrix. The true conditional distribution of X|Z is the same with that of X. Scenario 2. Set X = AT f Z + ϵ, where the entries of Af are randomly and uniformly sampled from [0, 1] and then normalized to the unit l1 norm and Z is generated from a 50 dimensional multivariate Gaussian distribution with mean vector (0.7, 0.7, . . . , 0.7) and the identity covariance matrix. The noise variables ϵ s are independently sampled from a normal distribution with mean zero and variance 0.49.

Figure 1: The conditional histograms.

For each of the two models, we generate n = 1000 samples. Randomly choose 500 samples as the training dataset V1 and the remaining as the testing dataset V2. For 1-NN, we generate 500 pseudo samples by 1-NN(V1, V2, 500). Given each Z coordinate in V2, we also generate pseudo samples e X using the WGANs and the Sinkhorn GANs, respectively. Figure 1 shows the conditional histograms of the generated samples as well as the true samples all normalized to range [0, 1] for Scenarios 1 and 2, respectively. It is observed that the 1-NNs ﬁt the conditional densities reasonably well, whereas WGANs and the Sinkhorn GANs perform poorly. Speciﬁcally, WGANs tend to be biased towards either 0 or 1, and the Sinkhorn GANs cannot capture the feature of the true conditional distribution.

Nearest-Neighbor Sampling Based CIT

In this section, we introduce our CIT based on the nearestneighbor sampling (NNSCIT) and present the pseudo code of computing NNSCIT and its p-value in Algorithm 2. Moreover, theoretically, we show that our proposed test achieves a valid control of the type-I error.

The Proposed CIT Approach

Our CIT test is based on an approximation of CMI I(X; Y |Z) = I(X; Y, Z) I(X; Z), where I(X; Y, Z) and I(X; Z) are, respectively, the mutual information of (X; Y, Z) and that of (X; Z). We construct our CIT statistic as a classiﬁer-based CMI estimator (CCMI, Mukherjee, Asnani, and Kannan, 2020) of I(X; Y |Z) given by

b I(X; Y |Z) = b I(X; Y, Z) b I(X; Z), (2)

where b I(X; Y, Z) (or b I(X; Z)) is a classiﬁer-based estimator of I(X; Y, Z) (or I(X; Z)). By Theorem 1 in Mukherjee, Asnani, and Kannan (2020), b I(X; Y |Z) is a consistent estimator of I(X; Y |Z). Furthermore, generate samples e X(m) (m = 1, . . . , M) from 1-NN conditioned on Z, we can show that b I( e X(m); Y |Z) I( e X(m); Y |Z) converges to zero for all m. Without loss of generality, we discuss how to approximate I(X; Z) = DKL(p X,Z(x, z)||p X(x)p Z(z)), where p X,Z(x, z) is the joint density of (X, Z) and p X(x) and p Z(z) are, respectively, the marginal density of X and Z. Moreover, DKL(F||G) = R f(x) log(f(x)/g(x))dx is the Kullback-Leibler (KL) divergence between two distribution functions F and G, whose density functions are given by f(x) and g(x), respectively. The Donsker-Varadhan (DV) representation of DKL(F||G) is given by

sup s S [Ex fs(x) log{Ex g exp(s(x))}] , (3)

where the function class S includes all functions with ﬁnite expectations in (3). The optimal function in (3) is given by s (x) = log(f(x)/g(x)) (Belghazi et al. 2018), leading to

DKL(F||G) = Ex f log f(x)

(4) Following Mukherjee, Asnani, and Kannan (2020), we use the classier two-sample principle (Lopez-Paz and Oquab 2016) to estimate the likelihood ratio L(x) = f(x)/g(x) as follows. Speciﬁcally, we consider n i.i.d. samples {Xf i }n i=1 with Xf i f(x) and d i.i.d. samples {Xg j }d j=1 with Xg j g(x). We label yf i = 1 for all Xf i and yg j = 0 for all Xg j . One trains a binary classiﬁer using deep neural network on this supervised classiﬁcation task. The classiﬁer produces predicted probability αl = Pr(y = 1|Xl) for a given sample Xl, leading to an estimator of the likelihood ratio on Xl given by b L(Xl) = αl/(1 αl). Therefore, it follows from (4) that an estimator of the KL-divergence, b DKL(F||G), is given by

i=1 log b L(Xf i ) log

j=1 b L(Xg j )

Since mutual information is a special case of the KL divergence, we obtain the estimator b I(X; Z) of I(X; Z) and that of I(X; Y, Z). Following the idea of CRT, the p-value of our CIT method can be given by

p = 1 + PM m=1 I b I( e X(m); Y |Z) b I(X; Y |Z)

1 + M . (5)

In Lemma 3, we show that the excess type I error of the test based on (5) is bounded by the total variation distance between p X|Z( |Z) and bp X|Z( |Z). By Theorem 2, we further get P(p α|H0) α + o(1). Therefore, the excess type I error of our CIT method is guaranteed to tend to zero

as n . Two binary classiﬁcations based on deep neural network should be trained to get b I( e X(m); Y |Z) for each m. Together with b I(X; Y |Z), 2(M + 1) binary classiﬁcation neural networks should be trained for computing the p-value in (5). When M is large, the calculation is extremely intensive and time consuming, especially for the case that the dimensionality of Z is high. In (5), instead of using b I( e X(m); Y |Z), we further propose to utilize b I( e X(m); Y ) calculated by the method of Mesner and Shalizi (2020) according to the following reasons. First, compared with b I( e X(m); Y |Z), b I( e X(m); Y ) is computationally very fast. Second, e X(m) is generated from 1NN conditional on Z, we thus have I( e X(m); Y |Z) = 0, whereas e X(m) and Y may share information via Z, that is, I( e X(m); Y ) I( e X(m); Y |Z) = 0. By the consistency of b I( e X(m); Y |Z) and b I( e X(m); Y ), we conclude that replacing b I( e X(m); Y |Z) with b I( e X(m); Y ) can improve controlling the probability of making type I error of our CIT method. Thus, we propose a simple counterpart of (5) for p-value calculation as follows:

p = 1 + PM m=1 I b I( e X(m); Y ) b I(X; Y |Z)

1 + M . (6)

Since e X(m) i s are generated by the 1-NN sampling strategy, we call our test as NNSCIT. Equation (6) lays the foundation of our CIT method, whose pseudo code has been summarized in Algorithm 2. We describe how to obtain b I( e X(m); Y ). Speciﬁcally, given i.i.d. samples {( e Xi, Yi)}n i=1 with ( e Xi, Yi) p e X,Y . Let

ρk,i/2 be the l -distance from point ( e Xi, Yi) to its kth nearest neighbor. Deﬁne

n e X,i = |{ e Xj : | e Xi e Xj| ρk,i/2 , j = i}|,

where |A| is the number of elements in the set A. Similarly, deﬁne n Y,i. For each i, we deﬁne

δi = ψ(k) ψ(n e X,i) ψ(n Y,i) + ψ(n),

where ψ(x) := d log Γ(x)/dx is the digamma function. Therefore, we have

b I( e X; Y ) = max

It follows from Theorems 3.1 and 3.2 in Mesner and Shalizi (2020) that b I( e X; Y ) is a consistent estimator of I( e X; Y ). Finally, we discuss why we cannot replace b I(X; Y |Z) by b I(X; Y ) in (6). One may think of approximating p-value as follows:

p = 1 + PM m=1 I n b I( e X(m); Y ) b I(X; Y ) o

1 + M , (8)

which results in another CRT test. Let bcα be the upper α quantile of the distribution of b I( e X(m); Y ). Given signiﬁcance level α, the rejection regions of (6) and (8) are given

Algorithm 2: Nearest-Neighbor sampling based conditional independence test (NNSCIT) Input: Data-set U of n i.i.d. samples from p X,Y,Z. Parameter: The number of repetitions M; the neighbor order k in MI estimation; the signiﬁcance level α. Output: Accept H0 : X Y |Z or H1 : X Y |Z. 1: Randomly divide U into two disjoint parts: U1 := {Xtrain, Ytrain, Ztrain} with sample size n n/3 and U2 := {Xtest, Ytest, Ztest} with sample size n/3 . 2: m = 1. 3: while m M do 4: Randomly taking n/3 samples from U1 to obtain V1. 5: Produce U m 0 := { e X(m)} using 1-NN(V1,U2, n/3 ) in Algorithm 1. 6: Compute I(m) := b I({ e X(m)}; {Ytest}) according to Equ. (7). 7: m = m + 1. 8: end while 9: Compute I := b I({Xtest}; {Ytest}|{Ztest}) according to Equ. (2).

10: Compute p-value: p := 1+PM m=1 I{I(m) I}

1+M . 11: if p α then 12: Accept H0 : X Y |Z. 13: else 14: Accept H1 : X Y |Z. 15: end if

by {b I(X; Y |Z) > bcα} and {b I(X; Y ) > bcα}, respectively. Under H1, I(X; Y |Z) should deviate from zero. Intuitively, the test with rejection region {b I(X; Y |Z) > bcα} is more likely to accept H1 than that with {b I(X; Y ) > bcα}. For example, consider X Z Y . This relation indicates X Y |Z (H1 holds), but X and Y may be independent. Therefore, the rejection region {b I(X; Y |Z) > bcα} could detect H1, but {b I(X; Y ) > bcα} may fail to do so. That is, the test using (6) is generally more powerful than that using (8) under H1. Consider another special case when X Z, we obtain I(X; Y |Z) I(X; Y ). By the consistency of b I(X; Y |Z) and b I(X; Y ), replacing b I(X; Y ) in (8) with b I(X; Y |Z) will increase the power under H1. That is, (6) endows more power than (8) under H1. We can reach the same conclusion for Y Z.

Theoretical Results

In this subsection, we present theoretical results of our NNSCIT based on (5) and (6). We introduce the following notation. Without loss of generality, let U2 := {(X1, Y1, Z1), . . . , (Xn1, Yn1, Zn1)} in Algorithm 2 with n1 = n/3 , where x is the largest integer not greater than x. We deﬁne X := (X1, X2, . . . , Xn1), Y := (Y1, Y2, . . . , Yn1), and Z := (Z1, Z2, . . . , Zn1). Denote P( |Z) := p( |Z1) . . . p( |Zn1) and b P( |Z) := bp( |Z1)

. . . bp( |Zn1). Assume that f X(m) := ( e X(m) 1 , . . . , e X(m) n1 ) is sampled according to b P( |Z) for m = 1, . . . , M. Let T(X, Y , Z) := b I(X; Y |Z) and T(f X(1), Y , Z) := b I( e X(1); Y |Z), . . . , T(f X(M), Y , Z) := b I( e X(M); Y |Z). Let f XF be an additional copy sampled from b P( |Z) and independently of Y and of X, f X(1), . . . , f X(M). Under H0: X Y |Z, conditionally on Y and Z, X and (f X(1), . . . , f X(M)) are independent, and f XF and (f X(1), . . . , f X(M)) are independent. Thus, we have

d T V {((X, f X(1), . . . , f X(M))|Y , Z),

((f XF , f X(1), . . . , f X(M))|Y , Z)}

= d T V {(X|Y , Z), (f XF |Y , Z)} = d T V {P( |Z), b P( |Z)}.

Deﬁne a set Aα as

Aα := n (x, ex(1), . . . , ex(M)) :

1 + PM m=1 I{T(ex(m), Y , Z) T(x, Y , Z)}

1 + M α o .

Then, we have

P(p α|Y , Z) = P{(X, f X(1), . . . , f X(M)) Aα|Y , Z}

d T V {((X, f X(1), . . . , f X(M))|Y , Z),

((f XF , f X(1), . . . , f X(M))|Y , Z)}

+ P{(f XF , f X(1), . . . , e X(M)) Aα|Y , Z}

= d T V {P( |Z), b P( |Z)}

+ P{(f XF , f X(1), . . . , f X(M)) Aα|Y , Z}.

Conditioning on Y and Z, f XF , f X(1), . . . , f X(M) are identically distributed and thus exchangeable, so P{(f XF , f X(1), . . . , f X(M)) Aα|Y , Z} α holds and we obtain the following result. Lemma 3. Assume that H0 : X Y |Z is true, for any desired signiﬁcance level α (0, 1), the type I error of test (5) satisﬁes

P(p α|Y , Z) α + d T V {P( |Z), b P( |Z)}. (9)

An immediate implication of Lemma 3 is that the type I error rate holds unconditionally as follows:

P(p α|H0) α + E[d T V {P( |Z), b P( |Z)}].

Furthermore, for any given test statistic T( ), we can compute the p-value via (1) by replacing X(m) with the 1-NN sample f X(m). The resulting test also enjoys (9) by similar arguments. Under H0, I( e X(m); Y ) I( e X(m); Y |Z). Denote the p values in (5) and (6) as p and p , respectively. With the consistency of b I( e X(m); Y |Z) and b I( e X(m); Y ), we obtain the following main result. Theorem 4. Assume that H0 holds, we have

P(p α|H0) α P(p α|H0) α

E[d T V {P( |Z), b P( |Z)}].

30 40 50 60 70 80 90 100 110 120 Dimension of Z

Type I error

GCIT CCIT RCIT KCIT CMIknn DGCIT NNSCIT

30 40 50 60 70 80 90 100 110 120 Dimension of Z

Scenario II

30 40 50 60 70 80 90 100 110 120 Dimension of Z

Type I error

Scenario III

30 40 50 60 70 80 90 100 110 120 Dimension of Z

Scenario IV

Figure 2: The empirical type-I error rate of various tests under H0.

d Z GCIT CCIT RCIT KCIT CMIknn DGCIT NNSCIT 5 0.11 0.36 0.53 0.72 0.03 0.50 0.01 10 0.48 0.22 0.82 0.84 0.06 0.86 0.05 15 0.87 0.21 0.85 0.93 0.15 0.91 0.05 20 0.93 0.25 0.90 0.98 0.05 0.97 0.07

Table 1: The empirical type-I error rate of various tests for Example 1.

Theorem 4 has three important implications. First, the excess type I error over a desired level α (0, 1) of the test (6) is bounded by E{d T V ( b P( |Z), P( |Z))}. Second, our proposed method outperforms CRT (5) in controlling type I error. Third, by Theorem 2, we get P(p α|H0) α + o(1). Thus, the excess type I error of our NNSCIT is guaranteed to be small.

Performance Evaluation In this section, we examine the ﬁnite sample performance of our NNSCIT by using the synthetic datasets. We compare NNSCIT with GCIT (Bellot and van der Schaar 2019), the classiﬁer-based CI test (CCIT) (Sen et al. 2017), the kernelbased CI test (KCIT) (Zhang et al. 2011), RCIT (Strobl, Zhang, and Visweswaran 2019), the CMI-based CI test (CMIknn) (Runge 2018), and DGCIT (Shi et al. 2021). We leave some additional simulation studies and the real data analysis in the Supplementary Materials. The source code of NNSCIT is available at https://github.com/Lee Shuaikenwitch/NNSCIT.

Performances on Synthetic Dataset The synthetic data sets are generated by using the post nonlinear model similar to those in Zhang et al. (2011); Doran

et al. (2014); and Bellot and van der Schaar (2019). Speciﬁcally, we deﬁne (X, Y, Z) under H0 and H1 as follows:

H0 : X = f(AT f Z + ϵf), Y = g(AT g Z + ϵg),

H1 : Y = h(AT h Z + b X) + ϵh.

The entries of Af and Ag are randomly and uniformly sampled from [0, 1] and then normalized to the unit l1 norm. The entries of Ah are sampled from a standard normal distribution and b is set to 2. The noise variables ϵf, ϵg and ϵh are independently sampled from a normal distribution with mean zero and variance 0.49. The signiﬁcance level is set at α = 0.05 and the sample size is ﬁxed at n = 1000. Set M = 500 and k = 3. Consider the following four scenarios: Scenario I. Set f, g and h to be the identity functions, inducing linear dependencies, Z N(0.7, 1), and X N(0, 1) under H1. Scenario II. Set f, g and h as in Scenario I, but use a Laplace distribution to generate Z. Scenario III. Set f, g and h as in Scenario I, but use Uniform[ 2.5, 2.5] to generate Z. Scenario IV. Set f, g and h to be randomly sampled from x2, x3, tanh(x), cos(x) . Set Z N(0, 1), and X N(0, 1) under H1. We vary the dimension of Z as d Z = 30, 40, 50, 60, 70, 80, 90, 100, 110, and 120. Figures 2 and 3 include the type-

30 40 50 60 70 80 90 100 110 120 Dimension of Z

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

GCIT CCIT RCIT KCIT CMIknn DGCIT NNSCIT

30 40 50 60 70 80 90 100 110 120 Dimension of Z

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

Scenario II

30 40 50 60 70 80 90 100 110 120 Dimension of Z

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

Scenario III

30 40 50 60 70 80 90 100 110 120 Dimension of Z

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

Scenario IV

Figure 3: The empirical power of various tests under H1.

d Z GCIT CCIT RCIT KCIT CMIknn DGCIT NNSCIT 5 0.37 1 0.03 0.02 1 0.78 1 10 0.53 1 0.04 0.10 1 0.82 1 15 0.55 1 0.05 0.14 1 0.79 1 20 0.63 1 0.07 0.16 1 0.88 1

Table 2: The empirical power of various tests for Example 2.

I error rates under H0 and powers under H1, respectively, over 300 data replications. Additional simulation results for d Z = 5, 10, 15, 20, and 25 can be found in the Supplementary Materials (Figures 1 and 2). We have the following observations. First, our test controls type I error very well under H0, while achieves high power under H1. Second, CMIknn has satisfactory performances in controlling type-I error, but under H1, it loses power in almost all scenarios. Third, although DGCIT and KCIT have adequate power under H1, they have inﬂated type-I errors in some cases, especially when d Z is less than 30. Fourth, GCIT, CCIT and RCIT cannot control type-I errors in some cases, especially when d Z is less than 30. Moreover, under H1, GCIT, CCIT and RCIT lose some power in almost all scenarios. Figure 4 in the Supplementary Materials reports the run times as a function of d Z for a single CIT with data generated under Scenario II. Other scenarios show similar performance. Our NNSCIT is computationally very efﬁcient. In contrast, CCIT, CMIknn and DGCIT are very timeconsuming and are prohibitive in practice.

Performances on Two Examples

As discussed in the Introduction, we evaluate the performances of our method in the following two examples. The

details of data generation mechanisms are presented in the Supplementary Materials. Example 1. X Z Y . In this case, H0 holds, but there is a strong dependence between X and Y . Table 1 reports the type-I error rates. Our NNSCIT controls type-I error very well, but GCIT, CCIT, RCIT, KCIT and DGCIT break down as their type-I errors are very large. Example 2. X Z Y . In this case, H1 holds, but X and Y are independent. Table 2 reports the powers of different methods. Our method achieves power as high as 1. In contrast, RCIT and KCIT have power less than 0.2 and GCIT and DGCIT also lose some power.

In this paper, we propose a novel and fast NNSCIT. We use the 1-NN sampling strategy to approximate the conditional distribution X|Z. Compared with GANs, 1-NN not only has computational efﬁciency, but also exhibits advantage in approximation accuracy. We take the classiﬁer-based conditional mutual information (CCMI) estimator as our test statistic, which captures the conditional-dependence feature very well. We show that our NNSCIT has three notable features, including controlling type-I error well, achieving high power under H1, and being computationally efﬁcient.

Acknowledgments Dr. Ziqi Chen s work was partially supported by National Natural Science Foundation of China (NSFC) (12271167 and 11871477) and Natural Science Foundation of Shanghai (21ZR1418800). Dr Christina Dan Wang s work was partially supported by National Natural Science Foundation of China (NSFC) (12271363 and 11901395).

References Arjovsky, M.; Chintala, S.; and Bottou, L. 2017. Wasserstein generative adversarial networks. In International Conference on Machine Learning, 214 223. PMLR. Belghazi, M. I.; Baratin, A.; Rajeshwar, S.; Ozair, S.; Bengio, Y.; Courville, A.; and Hjelm, D. 2018. Mutual information neural estimation. In International Conference on Machine Learning, 531 540. PMLR. Bellot, A.; and van der Schaar, M. 2019. Conditional independence testing using generative adversarial networks. In Advances in Neural Information Processing Systems, 2199 2208. Berrett, T. B.; Wang, Y.; Barber, R. F.; and Samworth, R. J. 2020. The conditional permutation test for independence while controlling for confounders. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 82(1): 175 197. Cai, Z.; Li, R.; and Zhang, Y. 2022. A distribution free conditional independence test with applications to causal discovery. Journal of Machine Learning Research, 23(85): 1 41. Candes, E.; Fan, Y.; Janson, L.; and Lv, J. 2018. Panning for gold: model-X knockoffs for high dimensional controlled variable selection. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 80(3): 551 577. Cover, T.; and Hart, P. 1967. Nearest neighbor pattern classiﬁcation. IEEE Transactions on Information Theory, 13(1): 21 27. Cover, T. M.; and Thomas, J. A. 2012. Elements of Information Theory. John Wiley & Sons. Devroye, L.; Gyorﬁ, L.; Krzyzak, A.; and Lugosi, G. 1994. On the strong universal consistency of nearest neighbor regression function estimates. The Annals of Statistics, 22(3): 1371 1385. Doran, G.; Muandet, K.; Zhang, K.; and Sch olkopf, B. 2014. A Permutation-Based Kernel Conditional Independence Test. In Uncertainty in Artiﬁcial Intelligence, 132 141. Citeseer. Fix, E.; and Hodges, J. L. 1951. Discriminatory Analysis, Nonparametric Discrimination: Consistency Properties. Technical Report 4, USAF School of Aviation Medicine. Fleuret, F. 2004. Fast binary feature selection with conditional mutual information. Journal of Machine Learning Research, 5(9): 1531 1555. Gao, W.; Oh, S.; and Viswanath, P. 2016. Breaking the bandwidth barrier: Geometrical adaptive entropy estimation. In Advances in Neural Information Processing Systems, 2460 2468.

Gao, W.; Oh, S.; and Viswanath, P. 2017. Demystifying ﬁxed k-nearest neighbor information estimators. In IEEE International Symposium on Information Theory, 1267 1271. Genevay, A.; Peyr e, G.; and Cuturi, M. 2018. Learning generative models with sinkhorn divergences. In International Conference on Artiﬁcial Intelligence and Statistics, 1608 1617. PMLR. Hlinka, J.; Hartman, D.; Vejmelka, M.; Runge, J.; Marwan, N.; Kurths, J.; and Paluˇs, M. 2013. Reliability of inference of directed climate networks using conditional mutual information. Entropy, 15(6): 2023 2045. Koller, D.; and Friedman, N. 2009. Probabilistic graphical models: principles and techniques. MIT press. Lauritzen, S. L. 1996. Graphical models, volume 17. Clarendon Press. Liang, K.-C.; and Wang, X. 2008. Gene regulatory network reconstruction using conditional mutual information. EURASIP Journal on Bioinformatics and Systems Biology, 2008: 1 14. Lopez-Paz, D.; and Oquab, M. 2016. Revisiting classiﬁer two-sample tests. ar Xiv:1610.06545. Mesner, O. C.; and Shalizi, C. R. 2020. Conditional mutual information estimation for mixed, discrete and continuous data. IEEE Transactions on Information Theory, 67(1): 464 484. Mukherjee, S.; Asnani, H.; and Kannan, S. 2020. CCMI: Classiﬁer based conditional mutual information estimation. In Uncertainty in Artiﬁcial Intelligence, 1083 1093. PMLR. Pearl, J. 2009. Causality. Cambridge university press. Runge, J. 2018. Conditional independence testing based on a nearest-neighbor estimator of conditional mutual information. In International Conference on Artiﬁcial Intelligence and Statistics, 938 947. PMLR. Sen, R.; Suresh, A. T.; Shanmugam, K.; Dimakis, A. G.; and Shakkottai, S. 2017. Model-powered conditional independence test. In Advances in Neural Information Processing Systems, 2955 2965. Shah, R. D.; and Peters, J. 2020. The hardness of conditional independence testing and the generalised covariance measure. The Annals of Statistics, 48(3): 1514 1538. Shi, C.; Xu, T.; Bergsma, W.; and Li, L. 2021. Double generative adversarial networks for conditional independence testing. Journal of Machine Learning Research, 22(285): 1 32. Silverman, B. W. 2018. Density estimation for statistics and data analysis. Routledge. Spirtes, P.; Glymour, C. N.; Scheines, R.; and Heckerman, D. 2000. Causation, prediction, and search. MIT press. Strobl, E. V.; Zhang, K.; and Visweswaran, S. 2019. Approximate kernel-based conditional independence tests for fast non-parametric causal discovery. Journal of Causal Inference, 7(1): 1 24. Zhang, H.; Zhou, S.; Zhang, K.; and Guan, J. 2017. Causal discovery using regression-based conditional independence tests. In Proceedings of the AAAI Conference on Artiﬁcial Intelligence, 1250 1256.

Zhang, H.; Zhou, S.; Zhang, K.; and Guan, J. 2022. Residual similarity based conditional independence test and its application in causal discovery. In Proceedings of the AAAI Conference on Artiﬁcial Intelligence, 5942 5949. Zhang, K.; Peters, J.; Janzing, D.; and Sch olkopf, B. 2011. Kernel-based conditional independence test and application in causal discovery. In Proceedings of the Twenty-Seventh Conference on Uncertainty in Artiﬁcial Intelligence, 804 813.