# modelpowered_conditional_independence_test__aadcd919.pdf

Model-Powered Conditional Independence Test

Rajat Sen1,*, Ananda Theertha Suresh2,*, Karthikeyan Shanmugam3,*, Alexandros G. Dimakis1, and

Sanjay Shakkottai1

1The University of Texas at Austin

2Google, New York 3IBM Research, Thomas J. Watson Center

We consider the problem of non-parametric Conditional Independence testing (CI testing) for continuous random variables. Given i.i.d samples from the joint distribution f(x, y, z) of continuous random vectors X, Y and Z, we determine whether X ?? Y |Z. We approach this by converting the conditional independence test into a classiﬁcation problem. This allows us to harness very powerful classiﬁers like gradient-boosted trees and deep neural networks. These models can handle complex probability distributions and allow us to perform signiﬁcantly better compared to the prior state of the art, for high-dimensional CI testing. The main technical challenge in the classiﬁcation problem is the need for samples from the conditional product distribution f CI(x, y, z) = f(x|z)f(y|z)f(z) the joint distribution if and only if X ?? Y |Z. when given access only to i.i.d. samples from the true joint distribution f(x, y, z). To tackle this problem we propose a novel nearest neighbor bootstrap procedure and theoretically show that our generated samples are indeed close to f CI in terms of total variational distance. We then develop theoretical results regarding the generalization bounds for classiﬁcation for our problem, which translate into error bounds for CI testing. We provide a novel analysis of Rademacher type classiﬁcation bounds in the presence of non-i.i.d nearindependent samples. We empirically validate the performance of our algorithm on simulated and real datasets and show performance gains over previous methods.

1 Introduction

Testing datasets for Conditional Independence (CI) have signiﬁcant applications in several statistical/learning problems; among others, examples include discovering/testing for edges in Bayesian networks [15, 27, 7, 9], causal inference [23, 14, 29, 5] and feature selection through Markov Blankets [16, 31]. Given a triplet of random variables/vectors (X, Y, Z), we say that X is conditionally independent of Y given Z (denoted by X ?? Y |Z), if the joint distribution f X,Y,Z(x, y, z) factorizes as f X,Y,Z(x, y, z) = f X|Z(x|z)f Y |Z(y|z)f Z(z). The problem of Conditional Independence Testing (CI Testing) can be deﬁned as follows: Given n i.i.d samples from f X,Y,Z(x, y, z), distinguish between the two hypothesis H0 : X ?? Y |Z and H1 : X 6?? Y |Z.

In this paper we propose a data-driven Model-Powered CI test. The central idea in a model-driven approach is to convert a statistical testing or estimation problem into a pipeline that utilizes the power of supervised learning models like classiﬁers and regressors; such pipelines can then leverage recent advances in classiﬁcation/regression in high-dimensional settings. In this paper, we take such a model-powered approach (illustrated in Fig. 1), which reduces the problem of CI testing to Binary Classiﬁcation. Speciﬁcally, the key steps of our procedure are as follows:

* Equal Contribution

31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.

... Original Samples

Training Set

ˆg (Trained Classiﬁer)

L(ˆg, De) (Test Error)

3n Original Samples

Original Samples

Nearest Neighbor Bootstrap

n samples close to f CI

x3n y3n z3n

Figure 1: Illustration of our methodology. A part of the original samples are kept aside in U1. The rest of the samples are used in our nearest neighbor boot-strap to generate a data-set U0

2 which is close to f CI in distribution. The samples are labeled as shown and a classiﬁer is trained on a training set. The test error is measured on a test set there-after. If the test-error is close to 0.5, then H0 is not rejected, however if the test error is low then H0 is rejected.

(i) Suppose we are provided 3n i.i.d samples from f X,Y,Z(x, y, z). We keep aside n of these original

samples in a set U1 (refer to Fig. 1). The remaining 2n of the original samples are processed through our ﬁrst module, the nearest-neighbor bootstrap (Algorithm 1 in our paper), which produces n simulated samples stored in U0

2. In Section 3, we show that these generated samples in U0

2 are in fact close in total variational distance (deﬁned in Section 3) to the conditionally independent distribution f CI(x, y, z) , f X|Z(x|z)f Y |Z(y|z)f Z(z). (Note that only under H0 does the equality f CI(.) = f X,Y,Z(.) hold; our method generates samples close to f CI(x, y, z) under both hypotheses).

(ii) Subsequently, the original samples kept aside in U1 are labeled 1 while the new samples simulated

from the nearest-neighbor bootstrap (in U0

2) are labeled 0. The labeled samples (U1 with label 1 and U0

2 labeled 0) are aggregated into a data-set D. This set D is then broken into training and test sets Dr and De each containing n samples each.

(iii) Given the labeled training data-set (from step (ii)), we train powerful classiﬁers such as gradient

boosted trees [6] or deep neural networks [17] which attempt to learn the classes of the samples. If the trained classiﬁer has good accuracy over the test set, then intuitively it means that the joint distribution f X,Y,Z(.) is distinguishable from f CI (note that the generated samples labeled 0 are close in distribution to f CI). Therefore, we reject H0. On the other hand, if the classiﬁer has accuracy close to random guessing, then f X,Y,Z(.) is in fact close to f CI, and we fail to reject H0.

For independence testing (i.e whether X ?? Y ), classiﬁers were recently used in [19]. Their key observation was that given i.i.d samples (X, Y ) from f X,Y (x, y), if the Y coordinates are randomly permuted then the resulting samples exactly emulate the distribution f X(x)f Y (y). Thus the problem can be converted to a two sample test between a subset of the original samples and the other subset which is permuted - Binary classiﬁers were then harnessed for this two-sample testing; for details see [19]. However, in the case of CI testing we need to emulate samples from f CI. This is harder because the permutation of the samples needs to be Z dependent (which can be high-dimensional). One of our key technical contributions is in proving that our nearest-neighbor bootstrap in step (i) achieves this task.

The advantage of this modular approach is that we can harness the power of classiﬁers (in step (iii) above), which have good accuracies in high-dimensions. Thus, any improvements in the ﬁeld of binary classiﬁcation imply an advancement in our CI test. Moreover, there is added ﬂexibility in choosing the best classiﬁer based on domain knowledge about the data-generation process. Finally, our bootstrap is also efﬁcient owing to fast algorithms for identifying nearest-neighbors [24].

1.1 Main Contributions

(i) (Classiﬁcation based CI testing) We reduce the problem of CI testing to Binary Classiﬁcation as

detailed in steps (i)-(iii) above and in Fig. 1. We simulate samples that are close to f CI through a novel nearest-neighbor bootstrap (Algorithm 1) given access to i.i.d samples from the joint distribution.

The problem of CI testing then reduces to a two-sample test between the original samples in U1 and U0

2, which can be effectively done by binary classiﬁers.

(ii) (Guarantees on Bootstrapped Samples) As mentioned in steps (i)-(iii), if the samples gener-

ated by the bootstrap (in U0

2) are close to f CI, then the CI testing problem reduces to testing whether the data-sets U1 and U0

2 are distinguishable from each other. We theoretically justify that this is indeed true. Let φX,Y,Z(x, y, z) denote the distribution of a sample produced by Algorithm 1, when it is supplied with 2n i.i.d samples from f X,Y,Z(.). In Theorem 1, we prove that d T V (φ, f CI) = O(1/n1/dz) under appropriate smoothness assumptions. Here dz is the dimension of Z and d T V denotes total variational distance (Def. 1).

(iii) (Generalization Bounds for Classiﬁcation under near-independence) The samples generated

from the nearest-neighbor bootstrap do not remain i.i.d but they are close to i.i.d. We quantify this property and go on to show generalization risk bounds for the classiﬁer. Let us denote the class of function encoded by the classiﬁer as G. Let ˆR denote the probability of error of the optimal classiﬁer ˆg 2 G trained on the training set (Fig. 1). We prove that under appropriate assumptions, we have

r0 O(1/n1/dz) ˆR r0 + O(1/n1/dz) + O

with high probability, upto log factors. Here r0 = 0.5(1 d T V (f, f CI)), V is the VC dimension [30] of the class G. Thus when f is equivalent to f CI (H0 holds) then the error rate of the classiﬁer is close to 0.5. But when H1 holds the loss is much lower. We provide a novel analysis of Rademacher complexity bounds [4] under near-independence which is of independent interest.

(iv) (Empirical Evaluation) We perform extensive numerical experiments where our algorithm

outperforms the state of the art [32, 28]. We also apply our algorithm for analyzing CI relations in the protein signaling network data from the ﬂow cytometry data-set [26]. In practice we observe that the performance with respect to dimension of Z scales much better than expected from our worst case theoretical analysis. This is because powerful binary classiﬁers perform well in high-dimensions.

1.2 Related Work

In this paper we address the problem of non-parametric CI testing when the underlying random variables are continuous. The literature on non-parametric CI testing is vast. We will review some of the recent work in this ﬁeld that is most relevant to our paper.

Most of the recent work in CI testing are kernel based [28, 32, 10]. Many of these works build on the study in [11], where non-parametric CI relations are characterized using covariance operators for Reproducing Kernel Hilbert Spaces (RKHS) [11]. KCIT [32] uses the partial association of regression functions relating X, Y , and Z. RCIT [28] is an approximate version of KCIT that attempts to improve running times when the number of samples are large. KCIPT [10] is perhaps most relevant to our work. In [10], a speciﬁc permutation of the samples is used to simulate data from f CI. An expensive linear program needs to be solved in order to calculate the permutation. On the other hand, we use a simple nearest-neighbor bootstrap and further we provide theoretical guarantees about the closeness of the samples to f CI in terms of total variational distance. Finally the two-sample test in [10] is based on a kernel method [3], while we use binary classiﬁers for the same purpose. There has also been recent work on entropy estimation [13] using nearest neighbor techniques (used for density estimation); this can subsequently be used for CI testing by estimating the conditional mutual information I(X; Y |Z).

Binary classiﬁcation has been recently used for two-sample testing, in particular for independence testing [19]. Our analysis of generalization guarantees of classiﬁcation are aimed at recovering guarantees similar to [4], but in a non-i.i.d setting. In this regard (non-i.i.d generalization guarantees), there has been recent work in proving Rademacher complexity bounds for β-mixing stationary processes [21]. This work also falls in the category of machine learning reductions, where the general philosophy is to reduce various machine learning settings like multi-class regression [2], ranking [1], reinforcement learning [18], structured prediction [8] to that of binary classiﬁcation.

2 Problem Setting and Algorithms

In this section we describe the algorithmic details of our CI testing procedure. We ﬁrst formally deﬁne our problem. Then we describe our bootstrap algorithm for generating the data-set that mimics samples from f CI. We give a detailed pseudo-code for our CI testing process which reduces the problem to that of binary classiﬁcation. Finally, we suggest further improvements to our algorithm.

Problem Setting: The problem setting is that of non-parametric Conditional Independence (CI) testing given i.i.d samples from the joint distributions of random variables/vectors [32, 10, 28]. We are given 3n i.i.d samples from a continuous joint distribution f X,Y,Z(x, y, z) where x 2 Rdx, y 2 Rdy and z 2 Rdz. The goal is to test whether X ?? Y |Z i.e whether f X,Y,Z(x, y, z) factorizes as, f X,Y,Z(x, y, z) = f X|Z(x|z)f Y |Z(y|z)f Z(z) , f CI(x, y, z)

This is essentially a hypothesis testing problem where: H0 : X ?? Y |Z and H1 : X 6?? Y |Z.

Note: For notational convenience, we will drop the subscripts when the context is evident. For instance we may use f(x|z) in place of f X|Z(x|z).

Nearest-Neighbor Bootstrap: Algorithm 1 is a procedure to generate a data-set U0 consisting of n samples given a data-set U of 2n i.i.d samples from the distribution f X,Y,Z(x, y, z). The data-set U is broken into two equally sized partitions U1 and U2. Then for each sample in U1, we ﬁnd the nearest neighbor in U2 in terms of the Z coordinates. The Y -coordinates of the sample from U1 are exchanged with the Y -coordinates of its nearest neighbor (in U2); the modiﬁed sample is added to U0.

Algorithm 1 Data Gen - Given data-set U = U1 [ U2 of 2n i.i.d samples from f(x, y, z) (|U1| = |U2| = n ), returns a new data-set U0 having n samples.

1: function DATAGEN(U1, U2, 2n) 2: U0 = ; 3: for u in U1 do 4: Let v = (x0, y0, z0) 2 U2 be the sample such that z0 is the 1-Nearest Neighbor (1-NN) of z (in 2 norm) in the whole data-set U2, where u = (x, y, z) 5: Let u0 = (x, y0, z) and U0 = U0 [ {u0}. 6: end for 7: end function

One of our main results is that the samples in U0, generated in Algorithm 1 mimic samples coming from the distribution f CI. Suppose u = (x, y, z) 2 U1 be a sample such that f Z(z) is not too small. In this case z0 (the 1-NN sample from U2) will not be far from z. Therefore given a ﬁxed z, under appropriate smoothness assumptions, y0 will be close to an independent sample coming from f Y |Z(y|z0) f Y |Z(y|z). On the other hand if f Z(z) is small, then z is a rare occurrence and will not contribute adversely.

CI Testing Algorithm: Now we introduce our CI testing algorithm, which uses Algorithm 1 along with binary classiﬁers. The psuedo-code is in Algorithm 2 (Classiﬁer CI Test -CCIT).

Algorithm 2 CCITv1 - Given data-set U of 3n i.i.d samples from f(x, y, z), returns if X ?? Y |Z.

1: function CCIT(U, 3n, , G) 2: Partition U into three disjoint partitions U1, U2 and U3 of size n each, randomly. 3: Let U0

2 = Data Gen(U2, U3, 2n) (Algorithm 1). Note that |U0

2| = n. 4: Create Labeled data-set D := {(u, = 1)}u2U1 [ {(u0, 0 = 0)}u02U0

2 5: Divide data-set D into train and test set Dr and De respectively. Note that |Dr| = |De| = n. 6: Let ˆg = argming2G ˆL(g, Dr) := 1 |Dr|

(u, )2Dr 1{g(u) 6= l}. This is Empirical Risk Minimization for training the classiﬁer (ﬁnding the best function in the class G). 7: If ˆL(ˆg, De) > 0.5 , then conclude X ?? Y |Z, otherwise, conclude X 6?? Y |Z. 8: end function

In Algorithm 2, the original samples in U1 and the nearest-neighbor bootstrapped samples in U0

2 should be almost indistinguishable if H0 holds. However, if H1 holds, then the classiﬁer trained in Line 6 should be able to easily distinguish between the samples corresponding to different labels. In Line 6, G denotes the space of functions over which risk minimization is performed in the classiﬁer.

We will show (in Theorem 1) that the variational distance between the distribution of one of the samples in U0

2 and f CI(x, y, z) is very small for large n. However, the samples in U0

2 are not exactly i.i.d but close to i.i.d. Therefore, in practice for ﬁnite n, there is a small bias b > 0 i.e.

ˆL(ˆg, De) 0.5 b, even when H0 holds. The threshold needs to be greater than b in order for Algorithm 2 to function. In the next section, we present an algorithm where this bias is corrected.

Algorithm with Bias Correction: We present an improved bias-corrected version of our algorithm as Algorithm 3. As mentioned in the previous section, in Algorithm 2, the optimal classiﬁer may be able to achieve a loss slightly less that 0.5 in the case of ﬁnite n, even when H0 is true. However, the classiﬁer is expected to distinguish between the two data-sets only based on the Y, Z coordinates, as the joint distribution of X and Z remains the same in the nearest-neighbor bootstrap. The key idea in Algorithm 3 is to train a classiﬁer only using the Y and Z coordinates, denoted by ˆg0. As before we also train another classier using all the coordinates, which is denoted by ˆg. The test loss of ˆg0 is expected to be roughly 0.5 b, where b is the bias mentioned in the previous section. Therefore, we can just subtract this bias. Thus, when H0 is true ˆL(ˆg0, D0

e) ˆL(ˆg, De) will be close to 0. However, when H1 holds, then ˆL(ˆg, De) will be much lower, as the classiﬁer ˆg has been trained leveraging the information encoded in all the coordinates.

Algorithm 3 CCITv2 - Given data-set U of 3n i.i.d samples, returns whether X ?? Y |Z.

1: function CCIT(U, 3n, , G) 2: Perform Steps 1-5 as in Algorithm 2. 3: Let D0

r = {((y, z), )}(u=(x,y,z), )2Dr. Similarly, let D0

e = {((y, z), )}(u=(x,y,z), )2De. These are the training and test sets without the X-coordinates. 4: Let ˆg = argming2G ˆL(g, Dr) := 1 |Dr|

(u, )2Dr 1{g(u) 6= l}. Compute test loss: ˆL(ˆg, De). 5: Let ˆg0 = argming2G ˆL(g, D0

r) := 1 |D0r|

(u, )2D0r 1{g(u) 6= l}. Compute test loss: ˆL(ˆg0, D0

e). 6: If ˆL(ˆg, De) < ˆL(ˆg0, D0

e) , then conclude X 6?? Y |Z, otherwise, conclude X ?? Y |Z. 7: end function

3 Theoretical Results

In this section, we provide our main theoretical results. We ﬁrst show that the distribution of any one of the samples generated in Algorithm 1 closely resemble that of a sample coming from f CI. This result holds for a broad class of distributions f X,Y,Z(x, y, z) which satisfy some smoothness assumptions. However, the samples generated by Algorithm 1 (U2 in the algorithm) are not exactly i.i.d but close to i.i.d. We quantify this and go on to show that empirical risk minimization over a class of classiﬁer functions generalizes well using these samples. Before, we formally state our results we provide some useful deﬁnitions.

Deﬁnition 1. The total variational distance between two continuous probability distributions f(.) and g(.) deﬁned over a domain X is, d T V (f, g) = supp2B|Ef[p(X)] Eg[p(X)]| where B is the set of all measurable functions from X ! [0, 1]. Here, Ef[.] denotes expectation under distribution f.

We ﬁrst prove that the distribution of any one of the samples generated in Algorithm 1 is close to f CI in terms of total variational distance. We make the following assumptions on the joint distribution of the original samples i.e. f X,Y,Z(x, y, z):

Smoothness assumption on f(y|z): We assume a smoothness condition on f(y|z), that is a generalization of boundedness of the max. eigenvalue of Fisher Information matrix of y w.r.t z.

Assumption 1. For z 2 Rdz, a such that ka zk2 1, the generalized curvature matrix Ia(z) is,

f(y|z0)f(y|z)dy

δ2 log f(y|z0)

We require that for all z 2 Rdz and all a such that ka zk2 1, λmax (Ia(z)) β. Analogous

assumptions have been made on the Hessian of the density in the context of entropy estimation [12].

Smoothness assumptions on f(z): We assume some smoothness properties of the probability density function f(z). The smoothness assumptions (in Assumption 2) is a subset of the assumptions made in [13] (Assumption 1, Page 5) for entropy estimation. Deﬁnition 2. For any δ > 0, we deﬁne G(δ) = P (f(Z) δ). This is the probability mass of the distribution of Z in the areas where the p.d.f is less than δ. Deﬁnition 3. (Hessian Matrix) Let Hf(z) denote the Hessian Matrix of the p.d.f f(z) with respect to z i.e Hf(z)ij = @2f(z)/@zi@zj, provided it is twice continuously differentiable at z. Assumption 2. The probability density function f(z) satisﬁes the following:

(1) f(z) is twice continuously differentiable and the Hessian matrix Hf satisﬁes k Hf(z)k2 cdz

almost everywhere, where cdz is only dependent on the dimension.

f(z)1 1/ddz c3, 8d 2 where c3 is a constant. Theorem 1. Let (X, Y 0, Z) denote a sample in U0

2 produced by Algorithm 1 by modifying the original sample (X, Y, Z) in U1, when supplied with 2n i.i.d samples from the original joint distribution f X,Y,Z(x, y, z). Let φX,Y,Z(x, y, z) be the distribution of (X, Y 0, Z). Under smoothness assumptions (1) and (2), for any < 1, n large enough, we have:

d T V (φ, f CI) b(n)

c3 21/dzΓ(1/dz)

(nγdz)1/dzdz

+ β G (2cdz 2)

2nγdzcdz dz+2

Here, γd is the volume of the unit radius 2 ball in Rd.

Theorem 1 characterizes the variational distance of the distribution of a sample generated in Algorithm 1 with that of the conditionally independent distribution f CI. We defer the proof of Theorem 1 to Appendix A. Now, our goal is to characterize the misclassiﬁcation error of the trained classiﬁer in Algorithm 2 under both H0 and H1. Consider the distribution of the samples in the data-set Dr used for classiﬁcation in Algorithm 2. Let q(x, y, z| = 1) be the marginal distribution of each sample with label 1. Similarly, let q(x, y, z| = 0) denote the marginal distribution of the label 0 samples. Note that under our construction,

q(x, y, z| = 1) = f X,Y,Z(x, y, z) =

f CI(x, y, z) if H0 holds 6= f CI(x, y, z) if H1 holds

q(x, y, z| = 0) = φX,Y,Z(x, y, z) (2)

where φX,Y,Z(x, y, z) is as deﬁned in Theorem 1.

Note that even though the marginal of each sample with label 0 is φX,Y,Z(x, y, z) (Equation (2)), they are not exactly i.i.d owing to the nearest neighbor bootstrap. We will go on to show that they are actually close to i.i.d and therefore classiﬁcation risk minimization generalizes similar to the i.i.d results for classiﬁcation [4]. First, we review standard deﬁnitions and results from classiﬁcation theory [4].

Ideal Classiﬁcation Setting: We consider an ideal classiﬁcation scenario for CI testing and in the process deﬁne standard quantities in learning theory. Recall that G is the set of classiﬁers under consideration. Let q be our ideal distribution for q given by q(x, y, z| = 1) = f X,Y,Z(x, y, z),

q(x, y, z| = 0) = f CI

X,Y,Z(x, y, z) and q( = 1) = q( = 0) = 0.5. In other words this is the ideal classiﬁcation scenario for testing CI. Let L(g(u), ) be our loss function for a classifying function g 2 G, for a sample u , (x, y, z) with true label . In our algorithms the loss function is the 0 1 loss, but our results hold for any bounded loss function s.t. |L(g(u), )| |L|. For a distribution q

and a classiﬁer g let R q(g) , Eu, q[L(g(u), )] be the expected risk of the function g. The risk optimal classiﬁer g

q under q is given by g

q , arg ming2G R q(g). Similarly for a set of samples S and a classiﬁer g, let RS(g) , 1 |S|

u, 2S L(g(u), ) be the empirical risk on the set of samples. We deﬁne g S as the classiﬁer that minimizes the empirical loss on the observed set of samples S that is, g S , arg ming2G RS(g).

If the samples in S are generated independently from q, then standard results from the learning theory states that with probability 1 δ,

R q(g S) R q(g

where V is the VC dimension [30] of the classiﬁcation model, C is an universal constant and n = |S|.

Guarantees under near-independent samples: Our goal is to prove a result like (3), for the classiﬁcation problem in Algorithm 2. However, in this case we do not have access to i.i.d samples because the samples in U0

2 do not remain independent. We will see that they are close to independent in some sense. This brings us to one of our main results in Theorem 2.

Theorem 2. Assume that the joint distribution f(x, y, z) satisﬁes the conditions in Theorem 1. Further assume that f(z) has a bounded Lipschitz constant. Consider the classiﬁer ˆg in Algorithm 2

trained on the set Dr. Let S = Dr. Then according to our deﬁnition g S = ˆg. For > 0 we have:

(i) Rq(g S) Rq(g

4dz log(n/δ) + on(1/ )

with probability at least 1 8δ. Here V is the V.C. dimension of the classiﬁcation function class, G is as deﬁned in Def. 2, C is an universal constant and |L| is the bound on the absolute value of the loss.

(ii) Suppose the loss is L(g(u), ) = 1g(u)6= (s.t |L| 1). Further suppose the class of classifying

functions is such that Rq(g

q) r0 + . Here, r0 , 0.5(1 d T V (q(x, y, z|1), q(x, y, z|0))) is the risk of the Bayes optimal classiﬁer when q( = 1) = q( = 0). This is the best loss that any classiﬁer can achieve for this classiﬁcation problem [4]. Under this setting, w.p at least 1 8δ we have:

1 d T V (f, f CI)

2 Rq(g S) 1

1 d T V (f, f CI)

where b(n) is as deﬁned in Theorem 1.

We prove Theorem 2 as Theorem 3 and Theorem 4 in the appendix. In part (i) of the theorem we prove that generalization bounds hold even when the samples are not exactly i.i.d. Intuitively, consider two sample inputs ui, uj 2 U1, such that corresponding Z coordinates zi and zj are far away. Then we expect the resulting samples u0

2) to be nearly-independent. By carefully capturing this notion of spatial near-independence, we prove generalization errors in Theorem 3. Part (ii) of the theorem essentially implies that the error of the trained classiﬁer will be close to 0.5 (l.h.s)

when f f CI (under H0). On the other hand under H1 if d T V (f, f CI) > 1 γ, the error will be less than 0.5(γ + b(n)) + γn which is small.

4 Empirical Results

In this section we provide empirical results comparing our proposed algorithm and other state of the art algorithms. The algorithms under comparison are: (i) CCIT - Algorithm 3 in our paper where we use XGBoost [6] as the classiﬁer. In our experiments, for each data-set we boot-strap the samples and run our algorithm B times. The results are averaged over B bootstrap runs1. (ii) KCIT - Kernel CI test from [32]. We use the Matlab code available online. (iii) RCIT - Randomized CI Test from [28]. We use the R package that is publicly available.

1The python package for our implementation can be found here (https://github.com/rajatsen91/CCIT).

4.1 Synthetic Experiments

We perform the synthetic experiments in the regime of post-nonlinear noise similar to [32]. In our experiments X and Y are dimension 1, and the dimension of Z scales (motivated by causal settings and also used in [32, 28]). X and Y are generated according to the relation G(F(Z) + ) where is a noise term and G is a non-linear function, when the H0 holds. In our experiments, the data is generated as follows: (i) when X ?? Y |Z, then each coordinate of Z is a Gaussian with unit mean and variance, X = cos(a T Z + 1) and Y = cos(b T Z + 2). Here, a, b 2 Rdz and kak = kbk = 1. a,b are ﬁxed while generating a single dataset. 1 and 2 are zero-mean Gaussian noise variables, which are independent of everything else. We set V ar( 1) = V ar( 2) = 0.25. (ii) when X 6?? Y |Z, then everything is identical to (i) except that Y = cos(b T Z + c X + 2) for a randomly chosen constant c 2 [0, 2].

In Fig. 2a, we plot the performance of the algorithms when the dimension of Z scales. For generating each point in the plot, 300 data-sets were generated with the appropriate dimensions. Half of them are according to H0 and the other half are from H1 Then each of the algorithms are run on these data-sets, and the ROC AUC (Area Under the Receiver Operating Characteristic curve) score is calculated from the true labels (CI or not CI) for each data-set and the predicted scores. We observe that the accuracy of CCIT is close to 1 for dimensions upto 70, while all the other algorithms do not scale as well. In these experiments the number of bootstraps per data-set for CCIT was set to B = 50. We set the threshold in Algorithm 3 to = 1/pn, which is an upper-bound on the expected variance of the test-statistic when H0 holds.

4.2 Flow-Cytometry Dataset

We use our CI testing algorithm to verify CI relations in the protein network data from the ﬂowcytometry dataset [26], which gives expression levels of 11 proteins under various experimental conditions. The ground truth causal graph is not known with absolute certainty in this data-set, however this dataset has been widely used in the causal structure learning literature. We take three popular learned causal structures that are recovered by causal discovery algorithms, and we verify CI relations assuming these graphs to be the ground truth. The three graph are: (i) consensus graph from [26] (Fig. 1(a) in [22]) (ii) reconstructed graph by Sachs et al. [26] (Fig. 1(b) in [22]) (iii) reconstructed graph in [22] (Fig. 1(c) in [22]).

For each graph we generate CI relations as follows: for each node X in the graph, identify the set Z consisting of its parents, children and parents of children in the causal graph. Conditioned on this set Z, X is independent of every other node Y in the graph (apart from the ones in Z). We use this to create all CI conditions of these types from each of the three graphs. In this process we generate over 60 CI relations for each of the graphs. In order to evaluate false positives of our algorithms, we also need relations such that X 6?? Y |Z. For, this we observe that if there is an edge between two nodes, they are never CI given any other conditioning set. For each graph we generate 50 such non-CI relations, where an edge X $ Y is selected at random and a conditioning set of size 3 is randomly selected from the remaining nodes. We construct 50 such negative examples for each graph. In Fig. 2, we display the performance of all three algorithms based on considering each of the three graphs as ground-truth. The algorithms are given access to observational data for verifying CI and non-CI relations. In Fig. 2b we display the ROC plot for all three algorithms for the data-set generated by considering graph (ii). In Table 2c we display the ROC AUC score for the algorithms for the three graphs. It can be seen that our algorithm outperforms the others in all three cases, even when the dimensionality of Z is fairly low (less than 10 in all cases). An interesting thing to note is that the edges (pkc-raf), (pkc-mek) and (pka-p38) are there in all the three graphs. However, all three CI testers CCIT, KCIT and RCIT are fairly conﬁdent that these edges should be absent. These edges may be discrepancies in the ground-truth graphs and therefore the ROC AUC of the algorithms are lower than expected.

0 5 20 50 70 100 150 Dimension of Z

CCIT RCIT KCIT

Algo. Graph (i) Graph (ii) Graph (iii)

CCIT 0.6848 0.7778 0.7156 RCIT 0.6448 0.7168 0.6928 KCIT 0.6528 0.7416 0.6610

Figure 2: In (a) we plot the performance of CCIT, KCIT and RCIT in the post-nonlinear noise synthetic data. In generating each point in the plots, 300 data-sets are generated where half of them are according to H0 while the rest are according to H1. The algorithms are run on each of them, and the ROC AUC score is plotted. In (a) the number of samples n = 1000, while the dimension of Z varies. In (b) we plot the ROC curve for all three algorithms based on the data from Graph (ii) for the ﬂow-cytometry dataset. The ROC AUC score for each of the algorithms are provided in (c), considering each of the three graphs as ground-truth.

5 Conclusion

In this paper we present a model-powered approach for CI tests by converting it into binary classiﬁcation, thus empowering CI testing with powerful supervised learning tools like gradient boosted trees. We provide an efﬁcient nearest-neighbor bootstrap which makes the reduction to classiﬁcation possible. We provide theoretical guarantees on the bootstrapped samples, and also risk generalization bounds for our classiﬁcation problem, under non-i.i.d near independent samples. In conclusion we believe that model-driven data dependent approaches can be extremely useful in general statistical testing and estimation problems as they enable us to use powerful supervised learning tools.

Acknowledgments

This work is partially supported by NSF grants CNS 1320175, NSF Sa TC 1704778, ARO grants W911NF-17-1-0359, W911NF-16-1-0377 and the US Do T supported D-STOP Tier 1 University Transportation Center.

[1] Maria-Florina Balcan, Nikhil Bansal, Alina Beygelzimer, Don Coppersmith, John Langford,

and Gregory Sorkin. Robust reductions from ranking to classiﬁcation. Learning Theory, pages 604 619, 2007.

[2] Alina Beygelzimer, John Langford, Yuri Lifshits, Gregory Sorkin, and Alex Strehl. Condi-

tional probability tree estimation analysis and algorithms. In Proceedings of the Twenty-Fifth Conference on Uncertainty in Artiﬁcial Intelligence, pages 51 58. AUAI Press, 2009.

[3] Karsten M Borgwardt, Arthur Gretton, Malte J Rasch, Hans-Peter Kriegel, Bernhard Schölkopf,

and Alex J Smola. Integrating structured biological data by kernel maximum mean discrepancy. Bioinformatics, 22(14):e49 e57, 2006.

[4] Stéphane Boucheron, Olivier Bousquet, and Gábor Lugosi. Theory of classiﬁcation: A survey

of some recent advances. ESAIM: probability and statistics, 9:323 375, 2005.

[5] Eliot Brenner and David Sontag. Sparsityboost: A new scoring function for learning bayesian

network structure. ar Xiv preprint ar Xiv:1309.6820, 2013.

[6] Tianqi Chen and Carlos Guestrin. Xgboost: A scalable tree boosting system. In Proceedings of

the 22Nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 785 794. ACM, 2016.

[7] Jie Cheng, David Bell, and Weiru Liu. Learning bayesian networks from data: An efﬁcient

approach based on information theory. On World Wide Web at http://www. cs. ualberta. ca/ jcheng/bnpc. htm, 1998.

[8] Hal Daumé, John Langford, and Daniel Marcu. Search-based structured prediction. Machine

learning, 75(3):297 325, 2009.

[9] Luis M De Campos and Juan F Huete. A new approach for learning belief networks using

independence criteria. International Journal of Approximate Reasoning, 24(1):11 37, 2000.

[10] Gary Doran, Krikamol Muandet, Kun Zhang, and Bernhard Schölkopf. A permutation-based

kernel conditional independence test. In UAI, pages 132 141, 2014.

[11] Kenji Fukumizu, Francis R Bach, and Michael I Jordan. Dimensionality reduction for supervised

learning with reproducing kernel hilbert spaces. Journal of Machine Learning Research, 5(Jan):73 99, 2004.

[12] Weihao Gao, Sewoong Oh, and Pramod Viswanath. Breaking the bandwidth barrier: Geometri-

cal adaptive entropy estimation. In Advances in Neural Information Processing Systems, pages 2460 2468, 2016.

[13] Weihao Gao, Sewoong Oh, and Pramod Viswanath. Demystifying ﬁxed k-nearest neighbor

information estimators. ar Xiv preprint ar Xiv:1604.03006, 2016.

[14] Markus Kalisch and Peter Bühlmann. Estimating high-dimensional directed acyclic graphs with

the pc-algorithm. Journal of Machine Learning Research, 8(Mar):613 636, 2007.

[15] Daphne Koller and Nir Friedman. Probabilistic graphical models: principles and techniques.

MIT press, 2009.

[16] Daphne Koller and Mehran Sahami. Toward optimal feature selection. Technical report,

Stanford Info Lab, 1996.

[17] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classiﬁcation with deep

convolutional neural networks. In Advances in neural information processing systems, pages 1097 1105, 2012.

[18] John Langford and Bianca Zadrozny. Reducing t-step reinforcement learning to classiﬁcation.

In Proc. of the Machine Learning Reductions Workshop, 2003.

[19] David Lopez-Paz and Maxime Oquab. Revisiting classiﬁer two-sample tests. ar Xiv preprint

ar Xiv:1610.06545, 2016.

[20] Colin Mc Diarmid. On the method of bounded differences. Surveys in combinatorics, 141(1):148

[21] Mehryar Mohri and Afshin Rostamizadeh. Rademacher complexity bounds for non-iid processes.

In Advances in Neural Information Processing Systems, pages 1097 1104, 2009.

[22] Joris Mooij and Tom Heskes. Cyclic causal discovery from continuous equilibrium data. ar Xiv

preprint ar Xiv:1309.6849, 2013.

[23] Judea Pearl. Causality. Cambridge university press, 2009.

[24] V Ramasubramanian and Kuldip K Paliwal. Fast k-dimensional tree algorithms for nearest

neighbor search with application to vector quantization encoding. IEEE Transactions on Signal Processing, 40(3):518 531, 1992.

[25] Bero Roos. On the rate of multivariate poisson convergence. Journal of Multivariate Analysis,

69(1):120 134, 1999.

[26] Karen Sachs, Omar Perez, Dana Pe er, Douglas A Lauffenburger, and Garry P Nolan.

Causal protein-signaling networks derived from multiparameter single-cell data. Science, 308(5721):523 529, 2005.

[27] Peter Spirtes, Clark N Glymour, and Richard Scheines. Causation, prediction, and search. MIT

press, 2000.

[28] Eric V Strobl, Kun Zhang, and Shyam Visweswaran. Approximate kernel-based conditional

independence tests for fast non-parametric causal discovery. ar Xiv preprint ar Xiv:1702.03877, 2017.

[29] Ioannis Tsamardinos, Laura E Brown, and Constantin F Aliferis. The max-min hill-climbing

bayesian network structure learning algorithm. Machine learning, 65(1):31 78, 2006.

[30] Vladimir N Vapnik and A Ya Chervonenkis. On the uniform convergence of relative frequencies

of events to their probabilities. In Measures of Complexity, pages 11 30. Springer, 2015.

[31] Eric P Xing, Michael I Jordan, Richard M Karp, et al. Feature selection for high-dimensional

genomic microarray data. In ICML, volume 1, pages 601 608. Citeseer, 2001. [32] Kun Zhang, Jonas Peters, Dominik Janzing, and Bernhard Schölkopf. Kernel-based conditional

independence test and application in causal discovery. ar Xiv preprint ar Xiv:1202.3775, 2012.