# conditional_diffusion_models_based_conditional_independence_testing__6da162ed.pdf

Conditional Diffusion Models Based Conditional Independence Testing

Yanfeng Yang1 , Shuai Li1 , Yingjie Zhang1 , Zhuoran Sun1, Hai Shu2, Ziqi Chen1 , Renming Zhang3

1School of Statistics, KLATASDS-MOE, East China Normal University, Shanghai, China 2 Department of Biostatistics, School of Global Public Health, New York University, New York, USA 3 Department of Computer Science, Boston University, Boston, USA

Conditional independence (CI) testing is a fundamental task in modern statistics and machine learning. The conditional randomization test (CRT) was recently introduced to test whether two random variables, X and Y , are conditionally independent given a potentially high-dimensional set of random variables, Z. The CRT operates exceptionally well under the assumption that the conditional distribution X|Z is known. However, since this distribution is typically unknown in practice, accurately approximating it becomes crucial. In this paper, we propose using conditional diffusion models (CDMs) to learn the distribution of X|Z. Theoretically and empirically, it is shown that CDMs closely approximate the true conditional distribution. Furthermore, CDMs offer a more accurate approximation of X|Z compared to GANs, potentially leading to a CRT that performs better than those based on GANs. To accommodate complex dependency structures, we utilize a computationally efﬁcient classiﬁer-based conditional mutual information (CMI) estimator as our test statistic. The proposed testing procedure performs effectively without requiring assumptions about speciﬁc distribution forms or feature dependencies, and is capable of handling mixed-type conditioning sets that include both continuous and discrete variables. Theoretical analysis shows that our proposed test achieves a valid control of the type I error. A series of experiments on synthetic data demonstrates that our new test effectively controls both type-I and type-II errors, even in high dimensional scenarios.

Code https://github.com/Yanfeng-Yang-0316/CDCIT

Introduction Conditional independence (CI) is an important concept in statistics and machine learning. Testing conditional independence plays a central role in classical problems such as causal inference (Pearl 1988; Spirtes, Glymour, and Scheines 2000), graphical models (Lauritzen 1996; Koller and Friedman 2009), and variable selection (Dai, Shen, and Pan 2022). It is widely used in various scientiﬁc problems, including gene regulatory network inference (Dai et al.

These authors contributed equally. Corresponding author: Ziqi Chen (zqchen@fem.ecnu.edu.cn, chenzq453@163.com). Copyright c 2025, Association for the Advancement of Artiﬁcial Intelligence (www.aaai.org). All rights reserved.

2024) and personalized therapies (Khera and Kathiresan 2017; Zhu et al. 2018). We consider testing whether two random variables, X and Y , are independent given a random vector Z, based on observations of the joint density p X,Y,Z(x, y, z). Speciﬁcally, we test the following hypotheses: H0 : X Y |Z versus H1 : X Y |Z,

where denotes the independence. In practical genomewide association studies, X represents a speciﬁc genetic variant, Y denotes disease status, and Z accounts for the rest of the genome. By conditioning on Z, we can evaluate whether the genetic variant X has an effect on the disease status Y (Liu et al. 2022) by CI testing. CI testing becomes particularly challenging due to the high-dimensionality of the conditioning vector Z (Bellot and van der Schaar 2019; Shi et al. 2021). Moerever, the presence of mixed discrete and continuous variables in Z as in many real-world applications presents further challenges for the testing (Mesner and Shalizi 2020; Zan et al. 2022). Recently, there has been a large and growing literature on CI testing, and for a more comprehensive review, we refer readers to Li and Fan (2020). The metric-based tests (e.g., Su and White (2008), Su and White (2014), Wang et al. (2015)) employ some kernel smoothers to estimate the conditional characteristic function or the distribution function of Y given X and Z. However, due to the curse of dimensionality, these tests are usually not suitable when the conditioning vector Z is high-dimensional. The kernel-based tests, such as Fukumizu et al. (2007), Zhang et al. (2011), and Scetbon, Meunier, and Romano (2022), represent probability distributions as elements of a reproducing kernel Hilbert space (RKHS), which enables us to understand properties of these distributions using Hilbert space operations. However, these tests based on asymptotic distributions may exhibit inﬂated type-I errors or inadequate power when dealing with highdimensional Z (Doran et al. 2014; Runge 2018; Shi et al. 2021). The most relevant work to ours is the conditional randomization test (CRT) proposed by Cand es et al. (2018), which assumes the true conditional distribution of X given Z, denoted by P( |Z), is known. It is theoretically proven that the CRT maintains validity by ensuring that the type I error does not exceed the signiﬁcance level α (Berrett et al. 2020; Liu et al. 2022). However, the true conditional distribution P( |Z) is rarely

The Thirty-Ninth AAAI Conference on Artificial Intelligence (AAAI-25)

known in practice, several methods have been developed to approximate the P( |Z). The smoothing-based methods (Hall, Racine, and Li 2004; Hall and Yao 2005; Izbicki and Lee 2017) suffer from the curse of dimensionality, and their performance deteriorates sharply when the dimensionality of Z becomes large. Bellot and van der Schaar (2019) developed a Generative Conditional Independence Test (GCIT) by using Wasserstein generative adversarial networks (WGANs; Arjovsky, Chintala, and Bottou 2017) to approximate P( |Z). Shi et al. (2021) proposed to use the Sinkhorn GANs (Genevay, Peyr e, and Cuturi 2018) to approximate P( |Z). However, the training of GANs is often unstable, with the risk of collapse if hyperparameters and regularizers are not carefully chosen (Dhariwal and Nichol 2021). The potentially inaccurate learning of conditional distributions using GANs can lead to inﬂated type I errors in GANsbased CI tests (Li et al. 2023). Li et al. (2023) introduced a method using the 1-nearest neighbor technique to generate samples from the approximated conditional distribution of X given Z. However, their approach necessitates dividing the dataset into two segments. Consequently, only one-third of the total samples are allocated to the testing dataset used for calculating test statistics, which reduces the test s statistical power. Li et al. (2024) proposed utilizing a k-nearestneighbor local sampling strategy to generate samples from the approximated conditional distribution of X given Z. Nevertheless, this approach encounters issues with insufﬁcient sample diversity, resulting in unstable performance of the CI test, particularly when the conditioning variables Z include both continuous and discrete variables. Diffusion models, which have recently emerged as a notable class of generative models, have attracted signiﬁcant attention (Ho, Jain, and Abbeel 2020; Song et al. 2021; Yang et al. 2023). Unlike GANs, diffusion models provide a much more stable training process and generate more realistic samples (Dhariwal and Nichol 2021; Song, Meng, and Ermon 2021). They have achieved great success in various tasks, such as image generation (Ho, Jain, and Abbeel 2020) and video generation (Ho et al. 2022). In this paper, we propose using conditional diffusion models to approximate P( |Z) and generating samples from the approximated conditional distribution. Moreover, as highlighted in Li et al. (2023), the choice of test statistics in CRT procedure is crucial for achieving adequate statistical power as well as controlling type I errors. Conditional mutual information (CMI) for (X, Y, Z), denoted as I(X; Y |Z), provides a strong theoretical guarantee for conditional dependence relations. Speciﬁcally, I(X; Y |Z) = 0 X Y |Z (Cover and Thomas 2012). In this paper, we adopt the classiﬁer-based CMI estimator as the test statistic (Mukherjee, Asnani, and Kannan 2020; Li et al. 2024). Our main contributions are summarized as follows. First, we propose, for the ﬁrst time, using conditional diffusion models to generate samples for the conditional distribution in the CI testing task. Compared to GANs, this method is not only more stable during training but also demonstrates signiﬁcant advantages in sample quality and diversity. It is theoretically and empirically shown that the distribution of the generated samples is very close to the true condition-

al distribution. Second, we use a computationally efﬁcient classiﬁer-based CMI estimator as the test statistic, which captures intricate dependence structures among variables. Third, theoretical analysis demonstrates that our proposed test achieves a valid control of the type I error asymptotically. Fourth, our empirical evidence demonstrates that, compared to state-of-the-art methods, our test effectively controls type I error while maintaining sufﬁcient power under H1. This remains true even when handling high-dimensional data and/or mixed-type conditioning sets that include both continuous and discrete variables.

The Proposed Approach The Conditional Randomization Test (CRT) Our work builds on the conditional randomization test (CRT) proposed by Cand es et al. (2018), which, however, assumes that the true conditional distribution P( |Z) is known. Speciﬁcally, consider n i.i.d. copies DT = {(Xi, Yi, Zi) : i = 1, . . . , n} of (X, Y, Z). If P( |Z) is known, conditionally on Z = (Z1, . . . , Zn)T , one can independently draw X(b) i P( |Zi) for each i across b = 1, . . . , B such that all X(b) := (X(b) 1 , . . . , X(b) n )T are independent of X := (X1, . . . , Xn)T and Y := (Y1, . . . , Yn)T , where B is the number of repetitions. Thus, under the null hypothesis H0 : X Y |Z, we have (X(b), Y , Z) d= (X, Y , Z)

for all b, where d= denotes equality in distribution. A large difference between (X(b), Y , Z) and (X, Y , Z) can be regarded as a strong evidence against H0. Statistically, for any test statistic T (X, Y , Z), we can calculate the p-value of the CI test by

1 + PB b=1 I{T (X(b), Y , Z) T (X, Y , Z)}

1 + B , (1)

where I{ } is the indicator function. Under the null hypothesis H0, since the (B + 1) triples (X, Y , Z), (X(1), Y , Z), . . . , (X(B), Y , Z) are exchangeable, the above p-value is valid. Speciﬁcally, P(p α|H0) α holds for any α (0, 1) (Cand es et al. 2018; Berrett et al. 2020).

Methodology for Sampling The traditional CRT procedure assumes that the true conditional distribution P( |Z) is known. However, in practice, this distribution is seldomly known. We propose using scorebased conditional diffusion models to approximate the conditional distribution of X given Z. Speciﬁcally, we have an unlabelled data set DU that consists of N i.i.d. samples {(XU i , ZU i ) : i = 1, . . . , N} from the distribution PX,Z. We aim to accurately recover the true distribution of X conditioning on Z using DU. The score-based conditional diffusion models involve two stochastic processes: the forward process and the reverse process. Speciﬁcally, let t [0, T] be the time index in forward process and reverse process, and denote X := X(0) Rdx. Conditioned on Z, the forward process is presented as:

d X(t) = (X(t)/2)dt + d B(t),

where X(0) P( |Z) and B(t) is a standard Brownian motion in Rdx. At any time t, let pt( |Z) and Pt( |Z) be the conditional density and distribution of X(t)|Z, respectively. By the property of Ornstein Uhlenbeck (OU) process, we derive

X(t) d= exp( t/2)X(0) + p

1 exp( t)ϵ, (2)

where ϵ N(0, Idx) and Idx is the dx-dimensional identity matrix. It can be deduced that as T approaches inﬁnity, X(T) N(0, Idx). In practice, however, X(t) is stopped at a sufﬁciently large T to ensure computational feasibility. According to Song et al. (2021), the true reverse process is:

2X(t) + log p T t(X(t)|Z) dt + d B(t),

X(0) PT ( |Z),

where B(t) is a standard Brownian motion in reversed process. We observe that the distributions of X(t) and X(T t) are identical. Thus, the conditional density and distribution of X(t)|Z are p T t( |Z) and PT t( |Z), respectively. log p T t( |z) is the score function of X(t) conditioned on Z = z. The analytical solution for the conditional score function is unattainable. Here, we utilize a Re LU neural network bs(x, z, t) F to approximate it, where F is the function class of Re LU neural networks. Following Ho, Jain, and Abbeel (2020) and Song et al. (2021), we train bs(x, z, t) by approximating the conditional score function through minimization of the following objective:

Et EX(0),ZEX(t) bs(X(t), Z, t) log φ(X(t), X(0)) 2 2,

X(t) N(exp ( t/2)X(0), [1 exp ( t)]Idx),

φ(X(t), X(0)) = exp X(t) exp ( t/2)X(0) 2

2 (1 exp ( t))

(X(0), Z) PX,Z, t Uniform[tmin, T],

and tmin is an early-stopping time close to zero ensuring the denominator in φ is not zero. As described in (2), when T is sufﬁciently large, PT ( |Z) can be approximated by N(0, Idx). Therefore, given Z, we propose the following reverse process with the approximated score function to generate pseudo samples:

2 X(t) + bs( X(t), Z, T t) dt + d B(t), X(0) N(0, Idx). (3) Let X(T tmin) be the pseudo-sample generated by (3). As demonstrated in Theorem 1, this pseudo-sample has a conditional distribution b P( |Z) that approximates the true conditional distribution P( |Z) effectively (Fu et al. 2024). Algorithm 1 outlines the training procedure for the conditional score matching models, whereas the sampling process from the reverse process is detailed in Algorithm 2.

Algorithm 1: Training the conditional score matching models Input: A data set with N i.i.d. samples {XU i , ZU i }N i=1. Output: The score network bs. 1: Let {Xi(0)}N i=1 = {XU i }N i=1 2: Initialize a deep neural network bs(x, z, t) 3: while not converge do 4: Draw t Uniform[tmin, T] 5: Draw ϵ1, . . . , ϵN N(0, Idx) 6: Let Xi(t) = exp( t/2)Xi(0) + p

1 exp( t)ϵi 7: Compute Lscore = PN i=1 bs(Xi(t), ZU i , t) + [Xi(t) exp( t/2)Xi(0)]/(1 exp( t)) 2 2 8: Take optimization step on Lscore and update the parameters of bs 9: end while 10: Return bs

Algorithm 2: Sampling from score-based conditional diffusion models Input: A sample Z PZ, the score network bs, and sample step K. Output: Pseudo sample b X. 1: Evenly divide [0, T tmin] into t0 = 0 < t1 < . . . < t K = T tmin and let t = (T tmin)/K

2: Draw X(0) N(0, Idx). 3: for k = 0 to K 1 do 4: Draw ϵk N(0, Idx)

5: Let X(k + 1) = X(k) +

2 X(k) + bs( X(k), Z, T tk) t

7: end for 8: Let b X = X(K) 9: Return b X

Theoretical Guarantee for Sampling Quality

We present a theoretical result showing that the distribution of the generated samples closely resembles the true conditional distribution. Speciﬁcally, denote the true conditional density of X given Z as p( |Z), and the density of b X sampled from Algorithm 2 given Z as bp( |Z). The total variation distance between two density functions p1 and p2 is deﬁned as d TV(p1, p2) = 1

2 R |p1(x) p2(x)|dx. Deﬁne

Γ1(k, α) = k + α dx + dz + 2k + 2α,

Γ2(k, α) = max 19

2 , k + α + 2

where k and α are deﬁned in Supplementary Materials. The proof of Theorem 1 is deferred to Supplementary Materials.

Theorem 1. Under Assumptions 1 and 2 in the Supplementary Material, taking early stopping time tmin = N 4Γ1(k,α) 1 and terminal time T = 2Γ1(k, α) log N, when N , we have

E{XU i ,ZU i }N i=1EZ[d TV(p( |Z), bp( |Z))]

= O N Γ1(k,α) (log N)Γ2(k,α) .

Empirical Evidence for Sampling Quality

In this subsection, we investigate the empirical performance of our proposed method for approximating the conditional distribution in sample generation in the CRT framework. Speciﬁcally, we compare it with several well-known existing conditional distribution estimating methods in CI test, including WGANs (Bellot and van der Schaar 2019), Sinkhorn GANs (Shi et al. 2021), and k-nearest neighbors (k-NN) (Li et al. 2024). We consider the following three models.

Model M1: X = Z2 1 + exp(Z2 + Z3/3) + Z4 Z5 +

0.5(1 + Z2 2 + Z2 5)ϵ, where Z1, . . . , Z5, ϵ i.i.d. N(0, 1).

Model M2: X = (5+Z2 1/3+Z2 2 +Z2 3 +Z2 4 +Z2 5) exp(r), where r B N( 2, 1) + (1 B) N(2, 1), B

Bernoulli(1, 0.5), and Z1, . . . , Z5 i.i.d. N(0, 1).

Model M3: X = P13 i=1 Zi/13 + 0.33ϵ, where

Z1, . . . , Z10, ϵ i.i.d. N(0, 1), and Z11, . . . , Z20 i.i.d. 2

Bernoulli(1, 0.5) 1.

Models M1 and M2 exhibit complex, nonlinear and nonmonotonic relationships between X and Z. Model M3 involves a mixed-type Z consisting of both continuous and discrete variables, where we model the true relationship between X and Z = (Z1, . . . , Z20) using only (Z1, . . . , Z13). For each model, we use 500 samples to learn the conditional distribution for each method. For each model, we estimate the conditional density functions using the 500 samples generated from each method with kernel smoothing (Weglarczyk 2018). Figure 1 and Figure 3 (Supplementary Material) display the estimated conditional density functions for a randomly generated value of Z. The results demonstrate that our conditional diffusion estimation method yields better conditional density estimators than WGANs, Sinkhorn GANs and k-NN. Further, it can be observed that the samples generated by k-NN lack diversity. To further evaluate these methods, we compute the mean squared errors between the quantiles of the generated samples and those of the true conditional distribution. Specifically, a value z is randomly sampled. Given Z = z, we generate 500 samples for each conditional distribution estimating method. We then calculate the squared error between the τ quantile of the generated samples and the τ quantile of the true distribution over τ {0.05, 0.25, 0.50, 0.75, 0.95}. This procedure is repeated 100 times, and the mean squared errors for each τ are reported in Table 1. Our proposed sampling method performs best on Models M1 and M3. It is also quite competitive on Model M2, though it is not always the top performer.

Figure 1: Comparison of conditional density estimators on Model M1. Z = ( 0.810, 0.049, 1.566, 0.030, 1.495).

Test Statistic Conditional mutual information (CMI) I(X; Y |Z) is deﬁned as ZZZ p X,Y,Z(x, y, z) log p X,Y,Z(x, y, z) p X,Z(x, z)p Y |Z(y|z)dxdydz,

where p X,Y,Z(x, y, z) is the joint density of (X, Y, Z), p X,Z(x, z) is the joint density of (X, Z), and p Y |Z(y|z) is the conditional density of Y given Z = z. CMI as a tool for measuring conditional dependency, with I(X; Y |Z) = 0 X Y |Z, has been used in conditional independence testing (Runge 2018; Li et al. 2024). Its major advantages include not needing the data to adhere to particular distributional assumptions or the features to have speciﬁc dependency relationships, making CMI applicable to a wide range of real-world datasets (Mukherjee, Asnani, and Kannan 2020). In Equation (1), our test statistic T (X, Y , Z) is set as the classiﬁer-based CMI estimator (CCMI, Mukherjee, Asnani, and Kannan, 2020). Speciﬁcally, I(X; Y |Z) can be expressed in terms of Kullback-Leibler (KL) divergence:

I(X; Y |Z) = d KL(p X,Y,Z(x, y, z)||p X,Z(x, z)p Y |Z(y|z)), (4) where d KL(f||g) denotes the KL divergence between two distribution functions F and G, with density functions f(x) and g(x), respectively. We further utilize the Donsker Varadhan (DV) representation of d KL(f||g),

sup s S [Ew fs(w) log Ew g exp{s(w)}] , (5)

where the function class S includes all functions with ﬁnite expectations. In fact, the optimal function in (5) is given by s (x) = log{f(x)/g(x)} (Belghazi et al. 2018), which leads to:

d KL(f||g) = Ew f log {f(w)/g(w)} log[Ew g {f(w)/g(w)}]. (6)

Next, our primary goal is to empirically estimate (6) with f = p X,Y,Z(x, y, z) and g = p X,Z(x, z)p Y |Z(y|z),

Model Quantile Ours WGANs Sinkhorn GANs k-NN 0.05 1.731(3.740) 7.478(6.372) 5.474(5.910) 2.062(5.962) 0.25 0.508(0.998) 3.676(4.585) 2.384(2.468) 1.201(2.193) M1 0.50 0.405(0.421) 2.568(3.408) 2.595(2.905) 0.876(1.773) 0.75 0.951(0.936) 2.364(2.509) 4.451(4.221) 1.601(2.278) 0.95 3.758(4.589) 3.871(1.386) 9.716(9.708) 4.617(6.932) 0.05 2.684(4.288) 4.465(1.684) 7.733(6.258) 1.014(3.343) 0.25 3.609(3.985) 6.929(5.917) 11.670(11.963) 8.640(24.409) M2 0.50 9.072(11.561) 18.820(13.056) 25.303(13.502) 36.565(6.240) 0.75 31.397(58.095) 51.404(64.778) 63.834(152.047) 68.384(106.082) 0.95 132.283(166.269) 127.967(250.006) 214.595(559.565) 282.834(646.127) 0.05 0.012(0.018) 0.190(0.105) 0.207(0.116) 0.051(0.075) 0.25 0.011(0.018) 0.048(0.111) 0.045(0.092) 0.057(0.079) M3 0.50 0.012(0.018) 0.042(0.088) 0.040(0.105) 0.044(0.072) 0.75 0.014(0.019) 0.084(0.130) 0.094(0.141) 0.052(0.071) 0.95 0.017(0.023) 0.263(0.234) 0.296(0.197) 0.083(0.126)

Table 1: Mean squared errors (MSEs) and standard deviations (SDs) of the quantiles of samples generated by our conditional diffusion models, WGANs, Sinkhorn GANs, and k-NN. The smallest MSEs and SDs for each quantile are highlighted in bold.

which requires samples from both p X,Y,Z(x, y, z) and p X,Z(x, z)p Y |Z(y|z). Following the approach of Li et al. (2024), we use the 1-NN sampling algorithm to estimate the conditional distribution of Y |Z. For further details, see Algorithm 4 in Supplementary Materials. Finally, we formalize the classiﬁer-based CMI estimator; see Algorithm 5 in Supplementary Materials. Speciﬁcally, consider a data set V consisting of 2n i.i.d. samples {Wi := (Xi, Yi, Zi)}2n i=1 with (Xi, Yi, Zi) p X,Y,Z(x, y, z). We divide V into two parts, V1 and V2, each containing n samples. Using Algorithm 4, we generate a new data set V with n samples from V1 and V2. Assigning labels l = 1 to all samples in V2 (positive samples drawn from p X,Y,Z(x, y, z)) and l = 0 to all samples in V (negative samples drawn from p X,Z(x, z)p Y |Z(y|z)). In this supervised classiﬁcation task, we train a binary classiﬁer using advanced models like XGBoost (Sen et al. 2017; Chen and Guestrin 2016) or deep neural networks (Goodfellow, Bengio, and Courville 2016). The classiﬁer outputs the predicted probability αm = P(l = 1|Wm) for a given sample Wm, which leads to an estimator of the likelihood ratio on Wm given by b L(Wm) = αm/(1 αm). Based on Equations (4) and (6), we can derive an estimator for I(X; Y |Z),

b I(X; Y |Z) := bd KL(p X,Y,Z(x, y, z)||p X,Z(x, z)p Y |Z(y|z))

i=1 log b L(W f i ) log{d 1 d X

j=1 b L(W g j )},

where d = n/3 with t being the largest integer not greater than t, W f i is a sample in V test f , and W g j is a sample in V test g , with V test f and V test g deﬁned in Algorithm 5.

Computation of the p-Value We calculate the p-value based on the score-based conditional diffusion models to make informed decisions regarding the hypothesis testing. Speciﬁcally, by Algorithm 2, we

independently draw pseudo samples b X(b) i bp( |Zi) for each i across b = 1, . . . , B, where bp( |Zi) is the conditional density of b X given Zi and B is the number of repetitions. Conditional on Z, all c X(b) := ( b X(b) 1 , . . . , b X(b) n )T are independent of Y and also X. We denote the CMI es-

timator of I(X; Y |Z) based on (c X(b), Y , Z) as d CMI (b)

and denote the estimator based on (X, Y , Z) as d CMI. According to Theorem 1 in Mukherjee, Asnani, and Kannan (2020), b I(X; Y |Z) is a consistent estimator of I(X; Y |Z). We calculate the p-value using Equation (1) by substituting

T (X(b), Y , Z) and T (X, Y , Z) with d CMI (b) and d CMI, respectively. The pseudo code is summarized in Algorithm 3. In the next section, we will prove that our test asymptotically achieves a valid control of type I error.

Theoretical Results

In this section, we present our main theoretical results, with all proofs deferred to Supplementary Materials. Denote p(n)( |Z) := Qn i=1 p( |Zi) and bp(n)( |Z) := Qn i=1 bp( |Zi). In Theorem 2, we bound the excess type I error conditionally on Y and Z by the total variation distance between bp(n)( |Z) and p(n)( |Z).

Theorem 2. Assume H0 : X Y |Z is true. For any significance level α (0, 1), the p-value obtained from Algorithm 3 satisﬁes

P(p α|Y , Z) α + d TV(p(n)( |Z), bp(n)( |Z)).

An immediate implication of Theorem 2 is that the type I error rate can be unconditionally controlled as

P(p α|H0) α + E[d TV(p(n)( |Z), bp(n)( |Z))].

Then applying Theorem 1 to this inequality yields the following Corollary 1, which shows that our CI testing procedure can asymptotically control the type I error at level α.

Algorithm 3: Conditional diffusion models based conditional independence testing (CDCIT) Input: Dataset DT = (X, Y , Z) consisting of n i.i.d. samples from p X,Y,Z(x, y, z) and unlabelled dataset DU = (XU, ZU) consisting of N i.i.d. samples from p X,Z(x, z). Parameter: The number of repetitions B; the signiﬁcance level α. Output: Accept H0 : X Y |Z or H1 : X Y |Z.

1: Use Algorithm 5 to obtain d CMI based on DT. 2: Use Algorithm 1 to obtain the score network bs based on DU. 3: b = 1. 4: while b B do 5: For each i {1, . . . , n}, given Zi, produce b X(b) i using Algorithm 2 with Zi and bs as input. Let c X(b) := ( b X(b) 1 , . . . , b X(b) n )T .

6: Use Algorithm 5 to obtain d CMI(b) based on (c X(b), Y , Z). 7: b = b + 1. 8: end while 9: Compute p-value: p := 1 + PB b=1 I d CMI(b) d CMI /(1 + B). 10: if p α then 11: Accept H0 : X Y |Z. 12: else 13: Accept H1 : X Y |Z. 14: end if

Corollary 1. Assume n N Γ1(k,α) (log N)Γ2(k,α) = o(1). Under assumptions in Theorems 1 and 2, the p-value obtained from Algorithm 3 satisﬁes

P(p α|H0) α + o(1).

Remark 1. Since log N = O(N δ) for any δ > 0, the sample size assumption in Corollary 1 implies that the sample size N in DU needs to satisfy N n1/(Γ1(k,α) δ) for any sufﬁciently small δ > 0 in order to asymptotically control the type I error.

Synthetic Data Analysis We evaluate our method, CDCIT, on synthetic datasets, and compare it with the seven state-of-the-art (SOTA) methods: GCIT (Bellot and van der Schaar 2019), NNSCIT (Li et al. 2023), the classiﬁer-based CI test (CCIT) (Sen et al. 2017), the kernel-based CI test (KCIT) (Zhang et al. 2011), LPCIT (Scetbon, Meunier, and Romano 2022), DGCIT (Shi et al. 2021), and NNLSCIT (Li et al. 2024). We set the number of repetitions B to 100 and the signiﬁcance level α to 0.05. We report the type I error rate and the testing power under H1 for all methods in each experiment. All the results are presented as an average over 100 independent trials. We present additional simulation studies, the detailed training parameter settings for our CDCIT and the real data analysis in Supplementary Materials.

Scenario I: the post-nonlinear model. The ﬁrst synthetic dataset is generated using the post-nonlinear model similar to those in Zhang et al. (2011), Bellot and van der Schaar (2019), Scetbon, Meunier, and Romano (2022), and Li et al. (2024). Speciﬁcally, the triples (X, Y, Z) under H0 and H1 are generated using the following models:

H0 :X = f1(Z + 0.25 ϵx), (8)

Y = f2(Z + 0.25 ϵy),

H1 :X = f1(Z + 0.25 ϵx) + 0.5 ϵb,

Y = f2(Z + 0.25 ϵy) + 0.5 ϵb,

where Z is the sample mean of Z = (z1, . . . , zdz), all zl in Z, ϵx, ϵy and ϵb are i.i.d. samples generated from the standard Gaussian distribution, and functions f1 and f2 are randomly sampled from the set {x, x2, x3, tanh(x), cos(x)}, and dz represents the dimension of Z. Scenario II: the mixed continuous and discrete conditioning-set model. The conditioning variable set Z = (z1, . . . , zdz) is mixed-type, consisting of dz/2 continuous variables (z1, . . . , z dz/2 ) and dz dz/2 discrete variables (z dz/2 +1, . . . , zdz). We use only (z1, z2, . . . , 2 dz/3 ) to generate X and Y under both H0 and H1 in the true model. Speciﬁcally,

H0 :X = 1 2 dz/3

i=1 zi + 0.33 ϵx, (9)

Y = 1 2 dz/3

i=1 zi + 0.33 ϵy,

H1 :X = 1 2 dz/3

i=1 zi + 0.33 ϵb,

Y = 1 2 dz/3

i=1 zi + 0.33 ϵb,

where z1, . . . , z dz/2 i.i.d. N(0, 1), z dz/2 +1, . . . , zdz i.i.d.

2 Bernoulli(1, 0.5) 1, and ϵx, ϵy and ϵb all follow the standard Gaussian distribution. For each experiment, 1000 samples are generated. We use N = 500 to train the conditional sampler and n = 500 to compute the test statistic in our CDCIT. We vary dz, the dimension of Z, from 10 to 100. The results are shown in Figure 2. More results regarding N and n are provided in Figures 4 and 5 in Supplementary Materials. We have the following observations. First, in both postnonlinear and mixed models, our test controls type I error very well and achieves high power under H1 as dz increases. Second, NNSCIT has satisfactory performance in controlling type I error, but it loses power under H1, especially when dz exceeds 40 in the mixed model. Third, although CCIT, KCIT and GCIT have adequate power under H1, they have inﬂated type I errors in almost all scenarios. Fourth, DGCIT and NNLSCIT sometimes fail to control the type I error well, especially when dz 20. Fifth, LPCIT shows weak performance on both type I error and testing power.

Figure 2: Comparison of the type I error (lower is better) and power (higher is better) of our method with seven SOTA methods on the post-nonlinear model (8) and mixed model (9) with varying dimension of Z. Under the mixed model, the power of our method, as well as those of DGCIT, CCIT, and NNLSCIT, stays consistently at 1 across different dz.

Figure 7 in Supplementary Materials reports the timing performance of all considered methods for a single test. Our CDCIT is found to be highly computationally efﬁcient even when dealing with large sample sizes and high-dimensional conditioning sets.

Conclusion We introduce a novel CI testing procedure using the conditional diffusion models to approximate the distribution of X|Z. We have theoretically and empirically shown that the distribution of the generated samples is very close to the true conditional distribution. We use a computationally efﬁcient classiﬁer-based CMI estimator as the test statistic, which captures intricate dependence structures among variables. We demonstrate that our proposed test achieves valid control of the type I and type II errors. Furthermore, our test remains highly computationally efﬁcient, even when dealing with high-dimensional conditioning sets. Our method has the potential to broaden the applicability of causal discov-

ery in real-world scenarios, such as gene regulatory network, the identiﬁcation of disease-associated genes, and intricate social networks, thereby aiding in the identiﬁcation of relationships and patterns within complex systems.

Acknowledgments Dr. Ziqi Chen s work was partially supported by National Natural Science Foundation of China (NSFC) (12271167 and 72331005) and Basic Research Project of Shanghai Science and Technology Commission (22JC1400800). We thank the anonymous reviewers for their helpful comments.

References Arjovsky, M.; Chintala, S.; and Bottou, L. 2017. Wasserstein generative adversarial networks. In International Conference on Machine Learning, 214 223. Belghazi, M. I.; Baratin, A.; Rajeshwar, S.; Ozair, S.; Bengio, Y.; Courville, A.; et al. 2018. Mutual information neural

estimation. In International Conference on Machine Learning, 531 540. Bellot, A.; and van der Schaar, M. 2019. Conditional independence testing using generative adversarial networks. In Advances in Neural Information Processing Systems, volume 32. Berrett, T. B.; Wang, Y.; Barber, R. F.; and Samworth, R. J. 2020. The conditional permutation test for independence while controlling for confounders. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 82(1): 175 197. Cand es, E.; Fan, Y.; Janson, L.; and Lv, J. 2018. Panning for gold: model-X knockoffs for high dimensional controlled variable selection. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 80(3): 551 577. Chen, T.; and Guestrin, C. 2016. Xgboost: A scalable tree boosting system. In ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 785 794. Cover, T. M.; and Thomas, J. A. 2012. Elements of Information Theory. John Wiley & Sons. Dai, B.; Shen, X.; and Pan, W. 2022. Signiﬁcance tests of feature relevance for a black-box learner. IEEE Transactions on Neural Networks and Learning Systems. Dai, H.; Ng, I.; Luo, G.; Spirtes, P.; Stojanov, P.; and Zhang, K. 2024. Gene Regulatory Network Inference in the Presence of Dropouts: a Causal View. ar Xiv:2403.15500. Dhariwal, P.; and Nichol, A. 2021. Diffusion models beat gans on image synthesis. In Advances in Neural Information Processing Systems, volume 34. Doran, G.; Muandet, K.; Zhang, K.; and Sch olkopf, B. 2014. A permutation-based kernel conditional independence test. In Conference on Uncertainty in Artiﬁcial Intelligence, 132 141. Fu, H.; Yang, Z.; Wang, M.; and Chen, M. 2024. Unveil conditional diffusion models with classiﬁer-free guidance: a sharp statistical theory. ar Xiv:2403.11968. Fukumizu, K.; Gretton, A.; Sun, X.; and Sch olkopf, B. 2007. Kernel measures of conditional dependence. In Advances in Neural Information Processing Systems, volume 20. Genevay, A.; Peyr e, G.; and Cuturi, M. 2018. Learning generative models with sinkhorn divergences. In International Conference on Artiﬁcial Intelligence and Statistics, 1608 1617. Goodfellow, I.; Bengio, Y.; and Courville, A. 2016. Deep learning. MIT Press. Hall, P.; Racine, J.; and Li, Q. 2004. Cross-validation and the estimation of conditional probability densities. Journal of the American Statistical Association, 99(468): 1015 1026. Hall, P.; and Yao, Q. 2005. Approximating conditional distribution functions using dimension reduction. The Annals of Statistics, 33(3): 1404 1421. Ho, J.; Jain, A.; and Abbeel, P. 2020. Denoising diffusion probabilistic models. In Advances in Neural Information Processing Systems, volume 33.

Ho, J.; Salimans, T.; Gritsenko, A.; et al. 2022. Video diffusion models. In Advances in Neural Information Processing Systems, volume 35. Izbicki, R.; and Lee, A. B. 2017. Converting highdimensional regression to high-dimensional conditional density estimation. Electronic Journal of Statistics, 11: 2800 2831. Khera, A. V.; and Kathiresan, S. 2017. Genetics of coronary artery disease: discovery, biology and clinical translation. Nature Reviews Genetics, 18(6): 331 344. Koller, D.; and Friedman, N. 2009. Probabilistic graphical models: principles and techniques. MIT Press. Lauritzen, S. L. 1996. Graphical models, volume 17. Clarendon Press. Li, C.; and Fan, X. 2020. On nonparametric conditional independence tests for continuous variables. Wiley Interdisciplinary Reviews: Computational Statistics, 12(3): e1489. Li, S.; Chen, Z.; Zhu, H.; Wang, C.; and Wen, W. 2023. Nearest-neighbor sampling based conditional independence testing. In AAAI Conference on Artiﬁcial Intelligence, volume 37, 8631 8639. Li, S.; Zhang, Y.; Zhu, H.; Wang, C.; Shu, H.; Chen, Z.; et al. 2024. K-nearest-neighbor local sampling based conditional independence testing. In Advances in Neural Information Processing Systems, volume 36. Liu, M.; Katsevich, E.; Janson, L.; and Ramdas, A. 2022. Fast and powerful conditional randomization testing via distillation. Biometrika, 109(2): 277 293. Mesner, O. C.; and Shalizi, C. R. 2020. Conditional mutual information estimation for mixed, discrete and continuous data. IEEE Transactions on Information Theory, 67(1): 464 484. Mukherjee, S.; Asnani, H.; and Kannan, S. 2020. CCMI: Classiﬁer based conditional mutual information estimation. In Conference on Uncertainty in Artiﬁcial Intelligence, 1083 1093. Pearl, J. 1988. Probabilistic reasoning in intelligent systems: networks of plausible inference. Morgan kaufmann. Runge, J. 2018. Conditional independence testing based on a nearest-neighbor estimator of conditional mutual information. In International Conference on Artiﬁcial Intelligence and Statistics, 938 947. Scetbon, M.; Meunier, L.; and Romano, Y. 2022. An asymptotic test for conditional independence using analytic kernel embeddings. In International Conference on Machine Learning, 19328 19346. Sen, R.; Suresh, A. T.; Shanmugam, K.; Dimakis, A. G.; and Shakkottai, S. 2017. Model-powered conditional independence test. In Advances in Neural Information Processing Systems, volume 30. Shi, C.; Xu, T.; Bergsma, W.; and Li, L. 2021. Double generative adversarial networks for conditional independence testing. Journal of Machine Learning Research, 22(285): 1 32.

Song, J.; Meng, C.; and Ermon, S. 2021. Denoising diffusion implicit models. In International Conference on Learning Representations. Song, Y.; Sohl-Dickstein, J.; Kingma, D. P.; Kumar, A.; Ermon, S.; and Poole, B. 2021. Score-based generative modeling through stochastic differential equations. In International Conference on Learning Representations. Spirtes, P.; Glymour, C. N.; and Scheines, R. 2000. Causation, prediction, and search. MIT Press. Su, L.; and White, H. 2008. A nonparametric Hellinger metric test for conditional independence. Econometric Theory, 24(4): 829 864. Su, L.; and White, H. 2014. Testing conditional independence via empirical likelihood. Journal of Econometrics, 182(1): 27 44. Wang, X.; Pan, W.; Hu, W.; Tian, Y.; and Zhang, H. 2015. Conditional distance correlation. Journal of the American Statistical Association, 110(512): 1726 1734. Weglarczyk, S. 2018. Kernel density estimation and its application. ITM Web of Conferences, 23: 00037. Yang, L.; Zhang, Z.; Song, Y.; et al. 2023. Diffusion models: A comprehensive survey of methods and applications. ACM Computing Surveys, 56(4): 1 39. Zan, L.; Meynaoui, A.; Assaad, C. K.; et al. 2022. A conditional mutual information estimator for mixed data and an associated conditional independence test. Entropy, 24(9): 1234. Zhang, K.; Peters, J.; Janzing, D.; and Sch olkopf, B. 2011. Kernel-based conditional independence test and application in causal discovery. In Conference on Uncertainty in Artiﬁcial Intelligence, 804 813. Zhu, Z.; Zheng, Z.; Zhang, F.; et al. 2018. Causal associations between risk factors and common diseases inferred from GWAS summary data. Nature Communications, 9(1): 1 12.