# causal_discovery_using_regressionbased_conditional_independence_tests__5d39f209.pdf Causal Discovery Using Regression-Based Conditional Independence Tests Hao Zhang, Shuigeng Zhou, Kun Zhang, Jihong Guan Shanghai Key Lab of Intelligent Information Processing, Fudan University, China. Department of Philosophy, Carnegie Mellon University, USA. Department of Computer Science & Technology, Tongji University, China {haoz15, sgzhou}@fudan.edu.cn; kunz1@cmu.edu; jhguan@tongji.edu.cn Conditional independence (CI) testing is an important tool in causal discovery. Generally, by using CI tests, a set of Markov equivalence classes w.r.t. the observed data can be estimated by checking whether each pair of variables x and y is d-separated, given a set of variables Z. Due to the curse of dimensionality, CI testing is often difficult to return a reliable result for high-dimensional Z. In this paper, we propose a regression-based CI test to relax the test of x y|Z to simpler unconditional independence tests of x f(Z) y g(Z), and x f(Z) Z or y g(Z) Z under the assumption that the data-generating procedure follows additive noise models (ANMs). When the ANM is identifiable, we prove that x f(Z) y g(Z) x y|Z. We also show that 1) f and g can be easily estimated by regression, 2) our test is more powerful than the state-of-the-art kernel CI tests, and 3) existing causal learning algorithms can infer much more causal directions by using the proposed method. Introduction Statistical independence and conditional independence (CI) are important concepts in statistics, artificial intelligence (AI) and other related fields. In causal discovery, we consider such a scenario: let X, Y and Z denote sets of random variables, if the CI between X and Y given Z holds, denoted by X Y |Z, then it means that given Z, further knowing X (or Y ) does not provide any additional information about Y (or X), thus we can deduce that X and Y have no directed causality. Independence and CI play a central role in causal discovery. Generally speaking, the CI relationship X Y |Z allows us to separate X Y when constructing a probabilistic model for P(X, Y, Z), which results in a parsimonious representation. By using CI tests, the PC algorithm (Spirtes, Glymour, and Scheines 2000), for example, can determine a set of Markov equivalence classes (Pearl 2009). However, CI testing is much more difficult than unconditional independence testing (Bergsma 2004). For CI tests, traditional methods either focus on the discrete cases (the conditional set can be combined into a variable according to the corresponding conditional probability table, hence it Correspondence author. Copyright c 2017, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. is easier to handle discrete cases), or impose simplified assumptions to deal with the continuous cases. For example, under the assumption of Gaussian variables with linear dependence relationships, partial correlation can be used to test CI (Baba, Shibata, and Sibuya 2004). In such a situation, X Y |Z reduces to zero partial correlation or zero correlation between X and Y given Z, which can be easily tested. Nevertheless, nonlinearity and non-Gaussian noises are popular in practice, hence this assumption is not always reasonable, and often leads to incorrect results. Most existing methods are based on explicit estimation of conditional densities or their variants, or discretize the conditional set Z to a set of bins, and transform CI to unconditional independence in each bin. Due to the curse of dimensionality, the conditional set becomes very large, inevitably the required sample size increases dramatically. For example, in (Su and White 2008) the authors used a characterization of CI, PX|Y Z = PX|Z, to determine CI by measuring the distance between estimates of conditional densities. However, accurate estimation of conditional densities or related quantities is not easy, which deteriorates the testing result, especially when the conditional set is too large. Generally speaking, CI testing is a nontrivial task, and the curse of dimensionality of the conditional variable Z makes it even more challenging. When Z takes a finite number of values {z1, ..., zk}, then X Y |Z iff X Y |Z = zi for each value zi. Given a sample of size n, even if the data points are distributed evenly on the values of Z, we must show the independence within each subset of the sample with the same Z value by using only approximately n/k points in each subset. When Z is real-valued and Pz is continuous, the observed values of Z are almost surely unique. To extend the above procedure to the continuous cases, we must infer conditional independence using nonidentical but neighboring values of Z, where neighboring is quantified by some distance metric. Finding neighboring points becomes more difficult as the dimensionality of Z grows. To approximate CI to unconditional independence between X and Y in each subset, we need a large number of subsets of Z. However, with too many subsets, the subsets may have not enough data points to evaluate independence. Recently, kernel-based tests were proposed for conditional and unconditional independence testing. With the ability to represent high order moments, mapping of vari- Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence (AAAI-17) ables into reproducing kernel Hilbert spaces (RKHSs) allows us to infer properties of distributions, such as independence and homogeneity (Gretton et al. 2006). In (Fukumizu et al. 2007), the authors proposed to use the Hilbert-Schmidt norm of the conditional cross covariance operator, which is a measure of conditional covariance of the images of X and Y under the corresponding functions from RKHSs. When the RKHSs are characteristic kernels, the operator norm is zero iff X Y |Z. We denote this method by CIP ERM. A more recent method (denoted by KCIT in short) proposed in (Zhang et al. 2011), uses partial association of regression functions to measure CI, X Y |Z iff for all f L2 XZ and g L2 Y (L2 XZ and L2 Y denote the spaces of square integrable functions of (X, Z) and Y , respectively) such that E( f g) = 0 where f(X, Z) = f(X, Z) rf(Z) and g(Y, Z) = g(Y ) rg(Z) (rf, rg L2 Z are regression functions). This method relaxes the spaces of functions f, g, rf and rg to RKHSs, corresponding to kernels defined on these variables. Compared to discretization-based CI testing methods, kernel methods exploit more complete information of the data and involve less random error. It was showed that causal learning methods based on kernel methods can discover more accurate causalities. In causal discovery, the mechanism of data generation is often assumed. A widely used model is the additive noise model (ANM) (Shimizu et al. 2006; Hoyer et al. 2009; Peters, Janzing, and Sch olkopf 2011), because many real-world data are regarded to be generated by following ANM (Peters, Janzing, and Sch olkopf 2011). Concretely, ANM assumes that the observed variables follow a directed acyclic graph (DAG) with the structure function: Y = f(X) + ε where X is the parent of Y and ε is a random noise term that X ε. So CI tests in ANM not only use the three sets of variables X, Y and Z, but also consider random noise that may be small but really exists. In this paper, we try to develop a new CI testing method for causality discovery from the perspective of ANM. Consider a set of variables Z and other two variables x and y, we show that if the data-generating procedure follows ANM, then we can relax X Y |Z to two conditions x f(Z) y g(Z), and x f(Z) Z or y g(Z) Z, where functions f and g are estimated by regression. When the ANM is identifiable (Zhang and Hyv arinen 2009; Peters, Janzing, and Sch olkopf 2011), we further prove that x f(Z) y g(Z) implies x f(Z) Z or y g(Z) Z, which means x f(Z) y g(Z) is sufficient to support X Y |Z. We also show that f and g can be easily calculated independently by minimizing the residuals of (x, Z) and (y, Z). With this result, we propose the regression-based conditional independence test method, which is denoted by RCIT. RCIT provides a way to relax CI tests to simpler unconditional independence tests. Finally, we apply RCIT to causality discovery. It is well known that existing causal discovery methods based on CI tests usually return a set of Markov equivalence classes. In our RCIT, x f(Z) y g(Z) and x f(Z) Z or y g(Z) Z implies Z x or y. This means that causal discovery methods (e.g. the PC algo- rithm) using RCIT for CI testing can detect more causal directions (details are in the next section). In our experiments, we show that on synthetic datasets the proposed method is more powerful than state-of-the-art approaches, and it can accurately estimate the distribution of the test statistic under the null hypothesis when the dimensionality of Z grows to produce a well-calibrated test. We also validate the practicability of the new test for inferring CI relationships on real-world datasets. Regression-based conditional independence test (RCIT) Generally, ANM is defined as a tuple (S, P(X)), where S = {S1, S2, , Sn} is a collection of n equations, Si : xi=fi(paxi)+εi, i=1, 2, , n, where paxi corresponds to the set of direct parents of xi in a DAG G, the noise variables εi have a strictly positive density (with respect to the Lebesgue measure) and are i.i.d., and εi paxi. ANM reflects the data-generating processes of X in the DAG G. We say a ANM is identifiable if it is asymmetrical in cause and effect and is capable of distinguishing between them. In fact, ANM is generally identifiable in nonlinear cases, all the nonidentifiable cases are summarized in (Zhang and Hyv arinen 2009) (let the invertible mapping in Post-Nonlinear causal model (Zhang and Hyv arinen 2009) be identity mapping). We consider such a scenario: given a DAG G where the data-generating procedure follows ANM, there are two randomly selected nodes xi and xj, we want to test whether xi and xj are conditionally independent given a set of variables Z. By default, throughout this paper we assume that all variables follow ANM. In what follows, we present the theoretical results for characterizing CIs (i.e., xi xj|Z) from the perspective of ANM, which underlie the proposed new method. Theorem 1. If xi and xj are neither directly connected nor unconditionally independent, then there must exist a set of variables Z and two functions f and g such that xi f(Z) xj g(Z), and xi f(Z) Z or xj g(Z) Z. Proof. Without loss of generality, assume that xj is an ancestor of xi, and let paxi denote the set of direct parents of xi. Following the data-generating process of ANM, we have xi = f(paxi)+εi and εi paxi, i.e., xi f(paxi) paxi. For the reason that εi is an exogenous additive noise that is independent of xi and all its non-descendant nodes, we have εi (xj, paxi). Thus, given an arbitrary function g, we have xi f(paxi) xj g(paxi). Similarly, if xi is an ancestor of xj, or xi and xj share common ancestors, we can also obtain xi f(paxi) (paxi, xj g(paxi)). Therefore, let paxi (or paxj) be Z, we complete the proof of this theorem. Actually, Z = paxi or Z = paxj is just a sufficient condition to complete Theorem 1. In many cases, we need not restrict Z = paxi or Z = paxj to meet xi f(Z) xj g(Z) and xi f(Z) Z or xj g(Z) Z. For example, given a DAG of x1 z1 x2 and z2 x2, if x2 can be expressed by x2 = f1(z1) + f2(z2) + ε, let Z = z1 and g be an arbitrary function, we can also obtain x2 f1(Z) x1 g(Z) and x2 f1(Z) Z. In what follows, we show that xi and xj are independent given such a Z. Theorem 2. Given two variables xi and xj, there is a set of variables Z and two functions f and g such that xi f(Z) xj g(Z), and xi f(Z) Z or xj g(Z) Z, if the corresponding ANM is 1) linear, or 2) identifiable and faithfulness then xi xj|Z. Proof. As I(xi; xj|Z) =I(xi f(Z); y g(Z)|Z) = I(xi f(Z); (xj g(Z), Z)) I(xi f(Z); Z) = I(xj g(Z); (xi f(Z), Z)) I(xj g(Z); Z). In linear case, if xi f(Z) xj g(Z) and xi f(Z) Z, then I(xi f(Z); (xj g(Z), Z)) = 0 and I(xi f(Z); Z) = 0, hence I(xi; xj, Z) = 0, i.e., xi xj|Z. Similarly, we can also deduce I(xj g(Z); (xi f(Z), Z)) I(xj g(Z); Z) = 0 in the similar conditions. In the case that ANM is identifiable and faithfulness, let εi = xi f(Z), if xi f(Z) Z, then εi Z. Assume that xi xj|Z, then there must be at least one path P from xi to xj, or xj to xi, and P must via εi. Hence, εi xj, as εi Z εi g(Z), we have εi xj g(Z), that is contradictory. In (Zhang and Hyv arinen 2009), the authors showed that only very carefully chosen parameters can lead to a nonidentifiable ANM in nonlinear cases. Therefore, Theorem 2 means that xi f(Z) xj g(Z) and xi f(Z) Z or xj g(Z) Z are generally sufficient to support xi xj|Z. Combining Theorem 1 and Theorem 2, we can see that the CI test of xi xj|Z can be replaced by two unconditional independent tests xi f(Z) xj g(Z) and xi f(Z) Z or xj g(Z) Z. Therefore, according to the above two theorems, we can relax a CI test to at most three unconditional independent tests. In causal discovery, we need at most 3 |S| i=1 Ci |S| (S denotes the maximum conditional set, Z S) unconditional independent tests to determine whether xi and xj are CI in the worst case, while the existing CI testing methods need |S| i=1 Ci |S| CI tests. In what follows, we try to further simplify the conditions. Let us consider the scenario that the data-generating procedure follows nonlinear ANM. Inspired by plenty of empirical results, we find that in practice only one unconditional independence test is enough. Thus, we have the following conjecture. Conjecture 1. Given a set of variables Z, two variables xi and xj and two functions f and g such that xi f(Z) xj g(Z), a necessary condition for xi xj|Z is that the corresponding ANM is not identifiable1. To rationalize this conjecture, without loss of generality we assume that xi is the ancestor (or parent) of xj. If xi f(Z) xj g(Z) and xi f(Z) Z or xj g(Z) Z, then we have xi xj|Z according to Theorem 2. If xi f(Z) Z and xj g(Z) Z, we can generate two nodes vi and vj to extend the corresponding DAG with vi = xi 1We found a flaw in the proof of this result given in an earlier version, and therefore, here it is treated as a conjecture. f(Z) + εi and vj = xj g(Z) + εj where εi and εj are additive noise generated randomly, thus we have vi Z and vj Z. If xi f(Z) xj g(Z), then we obtain vi vj. Combining vi Z and vj Z with vi vj, we can deduce that vi, Z and vj form a V-structure (Cai, Zhang, and Hao 2013), i.e., both vi and vj are the parents of Z. Recall that vi = xi f(Z) + εi and vj = xj g(Z) + εj imply Z vi and Z vj respectively, then there are two cases: 1) the distribution is faithful to the original DAG with added edges Z vi or Z vj, 2) the distribution is neither faithful to the original DAG with added edges Z vi nor that with added edges Z vj such that vi and vi can be expressed by a function of the other nodes but Z. Therefore, in case 1, Z can be both the cause and the effect of vi (or vj). According to the mechanism of ANM (Hoyer et al. 2009), this can occur only in case that the ANM is not identifiable. All non-identifiable cases of ANM (let the invertible mapping in Post-Nonlinear causal model (Zhang and Hyv arinen 2009) be identity mapping) are summarized in (Zhang and Hyv arinen 2009), and they showed that only very carefully chosen parameters can lead to a non-identifiable ANM in nonlinear cases. That is, if the ANM is not identifiable, then the ANM is generally linear, strictly speaking, the causal relationship between vi and z is linear, i.e., vi = xi f(Z)+εi where f is a linear function. In case 2, if both Z vi and Z vj can be removed, considering that vi = xi f(Z) + εi and vj = xj g(Z) + εj, then xi (xj) is the parent of vi (vj). As xi is the ancestor (or parent) of xj, vi cannot be independent of vj. Thus, if xi f(Z) xj g(Z) and xi f(Z) Z and xj g(Z) , the corresponding ANM is not identifiable and generally we have either f or g is linear. For example, consider a DAG of x (y, Z) and Z y, where Z = f(x)+εz (f is a linear function) and y = h(x)+ g(Z)+εj. Then x f(Z) = εz is independent of y g(Z) = h(x) + εy. In practice, even in such a case, it is not easy to find appropriate f and g to guarantee x f(Z) y g(Z). In the context of nonlinear ANM, for two arbitrary variables, if they are not directly connected, we can surely find two nonlinear functions f and g to meet the condition xi f(Z) xj g(Z) according to Theorem 1. As shown in Theorem 3, xi f(Z) xj g(Z) covers xi f(Z) Z or xj g(Z) Z, which leads to xi xj|Z according to Theorem 2. Thus, instead of testing xi xj|Z, we need only to check whether there exist two nonlinear functions f and g such that xi f(Z) xj g(Z). A consequent advantage is that it is easy to find the target functions independently. We need only do nonlinear regression to find the minimum residuals of (xi, Z) and (xj, Z) respectively. Moreover, in causal discovery, causal directions are usually detected by determining V-structure and consistent propagation (Pearl 2009). For the reason that xi f(Z) xj g(Z) implies xi f(Z) Z or xj g(Z) Z according to Theorem 3, RCIT can capture more information about causal directions than determining only V-structure. This is because in nonlinear ANM, if x f(Z) Z, it rarely occurs that Z contains a child of x, which was also suggested in (Mooij et al. 2009): whenever {z1, ..., zl} contains a child of x, independence of Residuals({z1, ..., zl}, x) (fits x as a function of Z and returns the residuals) and {z1, ..., zl} is rejected. Compared with CI tests, RCIT can detect more causal directions even though there is no V-structure contained in the corresponding DAG. Consider a simple example, give a DAG: x1 x2 x3, it is easy to find two functions f and g such that x1 f(x2) x2 and x3 g(x2) x2, so we can infer x1 x2 and x2 x3. However, it is difficult for CI tests to distinguish the three structures x1 x2 x3, x1 x2 x3 and x1 x2 x3, because all of them fit the observed conditional and unconditional independence, though obviously with completely different structures. Causal discovery based on RCIT In this section, we present a new method of causality discovery based on the PC algorithm and RCIT. For convenience, the method is called PCRCIT , which means PC algorithm based on RCIT. PCRCIT is developed based on the PC algorithm, in which we use RCIT to replace CI, and use existing methods (e.g., KCIT (Zhang et al. 2011)) to test unconditional independence. We perform RCIT simply by estimating f of f and g of g, then we can test whether x f(Z) y g(Z), x f(Z) Z or y g(Z) Z. If there is sufficient priori knowledge showing that the corresponding data-generating procedure follows nonlinear ANM, we can test only x f(Z) y g(Z) according to Conjecture 1. Our method is outlined in Algorithm 1. The first step (Line 1 6) is to construct the causal skeleton by employing RCIT. The procedure follows the PC algorithm. That is, we form the complete undirected graph G on the set X of variables, then check whether every two variables xi and xj are conditional independent, given a set Z of variables, while saving the results of independence (e.g. recording that Z is the ancestor or parent of xi if xi f(Z) Z). After obtain the causal skeleton, we orient the edges according to the results of independence and do consistent propagation (Line 7). Finally, as a refinement step, we deal with the remaining unoriented edges by detecting V-structures and doing consistent propagation as in the PC algorithm. That is, to check whether a local structure xi xj xk can form a V-structure. If it is, orient it as xi xj xk (Line 8). Note that in the case of nonlinear ANM, the performance of testing x f(Z) y g(Z), x f(Z) Z or y g(Z) Z is slightly different from that of testing only the first term. For example, the ground true is xi xj|Z, let a and b denote the error rate of regression of f and g respectively. If we assume that unreliable f (or g) will cause xi f(Z) xj g(Z)|Z and xi f(Z) Z (or xj g(Z) Z), then the error rate of RCIT of testing 2 (or 3) terms is a b, while the error rate of RCIT of testing 1 term is a + b. However, different assumptions may lead to different results, which is difficult to generalize. According to our observation, in many cases, the two kinds of RCIT testing are very close to each other in performance. Algorithm 1 PC algorithm based on RCIT (PCRCIT ) Input: variables set X = {x1, ..., xn}, threshold k. Output: partial DAG G. 1: Form the complete undirected graph G on the variables set X. 2: for xi, xj X and adjacent in G do 3: if Z X \ {xi, xj} and (|Z| < k) such that xi xj|Z (estimated by RCIT) then 4: delete edge xi xj from G and save the results of independence (e.g. record Z to xi if xi f(Z) Z). 5: end if 6: end for 7: orient the edges of skeleton according the results of independence and then do consistent propagation. 8: orient the remaining un-oriented edges and do consistent propagation. Performance evaluation We apply the proposed method to both synthetic and real data to evaluate its practical performance and compare it with KCIT, CIP ERM and partial correlation and their applications of PC algorithm. In our implementation, we perform the regression using Gaussian Processes (Rasmussen 2006) and the unconditional independence tests of RCIT using KCIT (Zhang et al. 2011). Effect of Z s dimensionality and sample size We first examine how the probabilities of Type I (where the CI hypothesis is incorrectly rejected) and Type II errors (where the CI hypothesis is not rejected although being false) of RCIT change along with the size of the conditioning set Z (D = 1, 2, ..., 5) and the sample size (n = 100 and 200) in particular situations by simulation. Here we consider two cases as follows. In Case I, only one variable in Z, denoted by Z1, is effective, i.e., other conditioning variables are independent of X, Y , and Z1. We generate X and Y from Z1 according to the ANM data generating procedure: they are generated as f(g(Z1))+ε where f and g are randomly selected from sin, cos, tanh, square and cubic functions and are different for X and Y , and ε U( 0.2, 0.2). Hence, X Y |Z holds. In our simulations, Zi is i.i.d. uniform U(0, 1). In Case II, all variables in the conditioning set Z are effective in generating X and Y . We first generate the independent variables Zi, then X and Y are generated as i fi(gi(Zi)) + ε where fi and gi are randomly selected from sin, cos, tanh, square and cubic functions. We compare RCIT with KCIT, CIP ERM (with the standard setting of 500 bootstrap samples) and partial correlation test in terms of both types of errors. The significance level is fixed at 0.01. Note that for a good testing method, the probability of Type I error should be as close to the significance level as possible, and the probability of Type II error should be as small as possible. To see how large they are for RCIT, we increase the dimensionality of Z and the sample size n, and repeat the tests 1000 random times. We calculate Type I and II errors like this: for example D = 3, in Case I x should be independent of y given (Z1), (Z1, Z2), (Z1, Z3) and (Z1, Z2, Z3), then Type I error =1the number of CIs/4. On the other side, x is independent of y given , (Z2), (Z3) and (Z2, Z3), then Type II error = the number of CIs/4. Similarly, we can calculate Type I and II errors in Case II. We first examine the Type I error in both case I and II. As shown in Fig. 1(a) and 1(c), the Type I error of RCIT is close to the significance level, and as D increases, the probability of Type I error slightly increases. One can see that in Case I, even when D = 3, the probability of Type I error of the other three methods is clearly larger than the significance level. Furthermore, KCIT and CIP ERM are very sensitive to D. The curve of partial correlation tends to be a straight line parallel to the x axis, for the reason that all the experimental data follow nonlinear generating procedure aforementioned. A significant observation is that increasing sample size (from 100 to 200) does not reduce the Type I errors of the four methods. In many scenarios, two disjoined variables are CI given a set of variables, thus a good test is expected to have a small probability of Type II error. As shown in Fig. 1(b) and 1(d), RCIT obtains clearly the best result. As D increases, the probability of Type II error always increases. Intuitively, this is reasonable: due to the finite sample size effect, as the conditioning set becomes larger and larger, X and Y tend to be considered as conditionally independent. On the other hand, as the sample size increases from 100 to 200, the probability of Type II error quickly approaches zero. In particular, as shown in Fig. 1(b), the curves of RCIT under 200 sample size keep close to zero. In contrast to the case of Type I error, the increasing sample size (from 100 to 200) can dramatically reduce Type II error. Note that KCIT and RCIT have very similar performance when the dimensionality of Z is 1 and 2, which means that when a given DAG is very small, the two methods should perform similarly in discovering causal skeleton. However, RCIT can learn more information about the causal directions, which will be discussed in the next subsection. Performance in causal discovery CI tests are frequently used in causal inference where one assumes that the true causal structure of n random variables x1, ..., xn can be represented by a directed acyclic graph (DAG) G. More specifically, the causal Markov condition assumes that the joint distribution satisfies all CIs that are imposed by the true causal graph (note that this is an assumption about the physical generating process of the data, not only about their distribution). The constraint-based methods like the PC algorithm make additional assumption of faithfulness (i.e., the joint distribution does not allow any CIs that are not entailed by the Markov condition) and recover the graph structure by exploiting the (conditional) independence that can be found in the data. Obviously, this is only possible up to Markov equivalence classes, which are sets of graphs that impose exactly the same independence and CIs. Hence, the PC algorithm based on existing CI test methods orients causal directions by finding V-structure and 1 2 3 4 5 0 Dimensionality of Z (a) Type I error, case I 1 2 3 4 5 0 Dimensionality of Z (b) Type II error, case I 1 2 3 4 5 0 Dimensionality of Z (c) Type I error, case II 1 2 3 4 5 0 Dimensionality of Z (d) Type II error, case II Figure 1: The probabilities of Type I and Type II errors obtained by simulation in various situations. Top: Case I (only one variable in Z is effective to X and Y ). Bottom: Case II (all variables in Z are effective). consistent propagations (Pearl 2009). In our experiments, we show that the PC algorithm based on RCIT can reveal much more causal directions as mentioned above. We generate data from a random DAG G. In particular, we sample four random variables x1, ..., x4 and allow arrows from xi to xj only for i < j. With probability 0.5 each possible arrow is either present or absent. The root variables are generated by N(0, 1) and the leaf variables xi are generated by f(g( i fi(gi(paxi)))) + ε where f, g, fi and gi are randomly selected from sin, cos, tanh, square and cubic functions and are different for X and Y , and ε U( 0.2, 0.2) independent across paxi. For significance level 0.01 and sample sizes between 25 and 400 we simulate 1000 DAGs, and evaluate the performance of different methods on discovering the causal skeleton and PDAG (including identifiable causal directions). For the reason that RCIT and KCIT work significantly better than CIP ERM and partial correlation as shown in Fig. 1 (for the performance comparison between KCIT, CIP ERM and partial correlation in PC, see (Zhang et al. 2011)), here we compare PCRCIT with PC based on KCIT (denoted by PCKCIT ) for performance evaluation. To the best of our knowledge, in generic cases KCIT outperforms the other existing methods in term of discovering causality with PC when the input data are continuous. As shown in Fig. 2(a), we can see that when the sample size is small (e.g. less than 200), PCRCIT performs significantly better than PCKCIT . As the sample size increases, the performance of PCKCIT tends close to that of PCRCIT . When the sample size up to 400, the F1 curves of PCRCIT and PCRCIT tend to overlap, but the former is still slightly (about 0.016) better than that of the latter. That is, although both RCIT and KCIT utilize regression, PCRCIT performs significantly better than PCKCIT in term of discovering causal skeleton when the sample size is small, which is the frequently-encountered case in reality. Obviously, our method is advantageous over the existing CI tests where a larger conditional set with a smaller sample will always lead to an incorrect conclusion, while in our method regression with unconditional test can perform significantly better. We also evaluate the two methods in discovering PDAG. The results are presented in Fig. 2(b). We can see that PCRCIT achieves the best result in all cases, though the performance of PCKCIT in discovering causal skeleton is very close to that of PCRCIT when the sample size is large enough. The reason is that PCKCIT orients causal directions only based on V-structures and consistent propagations (Pearl 2009), in other words, returns only a set of Markov equivalence classes, while PCRCIT can uncover more causal directions by checking whether x f(Z) Z. 25 50 100 200 300 400 Sample size Recall/Precision/F1 (a) Discovering skeleton 25 50 100 200 300 400 Sample size Recall/Precision/F1 (b) Discovering PDAG Figure 2: Performance comparison between PCRCIT and PCKCIT for various sample sizes in term of discovering (a) causal skeleton and (b) PDAG. Performance in causal direction inference We apply PCRCIT to the data set presented in (Mooij et al. 2009), which was generated following ANM w.r.t. a DAG consisting seven variables as shown in Fig. 3(a). For performance comparison in discovering causal direction, we choose two similar skeletons reconstructed by PCRCIT and PCKCIT with 1000 samples, which are shown in Fig. 3(b) and Fig. 3(c). We can see that all the causal edges discovering by PCRCIT are correct. However, as shown in Fig. 3(c), the directions of edges 2 4, 6 7 and 6 5 are incorrectly inferred by PCKCIT . By taking the advantage of RCIT, existing constraint-based methods (e.g. the PC algorithm) can greatly improve the performance in causal discovery as RCIT helps to break the Markov equivalence classes. Graphical modeling from medical data We finally apply RCIT to a real-word dataset used in a previous work (Fukumizu et al. 2007). The data consists of three variables, crea- Figure 3: Performance comparison in causal direction inference. (a) ground truth causal model; (b) reconstructed DAG based on PCRCIT ; (c) reconstructed DAG based on PCKCIT . Here, the red arrows indicate error directions. . tinine clearance (C), digoxin clearance (D), and urine flow (U). These were taken from 35 patients, and analyzed by graphical models in (Edwards 2012). Based on medical knowledge, D should be independent of U when controlling C, i.e., D U|C is known as the ground truth. The results are presented in Table 1. We can see that D U|C is strongly affirmed by using RCIT, while is not found by using partial correlation. Table 1: Results on real medical data. Methods on testing D U|C P-value RCIT 0.1827 part.corr. 0.0037 In this paper, we propose a novel regression-based conditional independence testing approach based on additive noise model. In contrast to the existing CI testing methods, it makes use of the characterization of conditional independence in term of residuals (or exogenous noise) between variables. We show that once the causal process is assumed, the general CIs can be replaced by some weaker conditions. We relax the test of x y|Z to simpler unconditional independence tests of x f(Z) y g(Z), and x f(Z) Z or y g(Z) Z under the assumption that the data-generating procedure follows ANM. When the ANM is identifiable, we prove that x f(Z) y g(Z) x y|Z. Compared to the exiting methods, our method is less sensitive to the dimensionality of Z. Experiments on both simulated and real world data show that the new method outperforms the existing techniques in discovering causality. Acknowledgement This work was supported by the Key Projects of Fundamental Research Program of Shanghai Municipal Commission of Science and Technology under grant No. 14JC1400300. Jihong Guan was supported by the Program of Shanghai Subject Chief Scientist (15XD1503600). References Baba, K.; Shibata, R.; and Sibuya, M. 2004. Partial correlation and conditional correlation as measures of conditional independence. Australian & New Zealand Journal of Statistics 46(4):657 664. Bergsma, W. P. 2004. Testing conditional independence for continuous random variables. Eurandom. Cai, R.; Zhang, Z.; and Hao, Z. 2013. Causal gene identification using combinatorial v-structure search. Neural Networks 43(7):63 71. Edwards, D. 2012. Introduction to graphical modelling. Springer Science & Business Media. Fukumizu, K.; Gretton, A.; Sun, X.; and Sch olkopf, B. 2007. Kernel measures of conditional dependence. Advances in Neural Information Processing Systems 20(1):167 204. Gretton, A.; Borgwardt, K. M.; Rasch, M.; Sch olkopf, B.; and Smola, A. J. 2006. A kernel method for the two-sampleproblem. In Advances in neural information processing systems, 513 520. Hoyer, P. O.; Janzing, D.; Mooij, J. M.; Peters, J.; and Sch olkopf, B. 2009. Nonlinear causal discovery with additive noise models. In Advances in neural information processing systems, 689 696. Mooij, J.; Janzing, D.; Peters, J.; and Sch olkopf, B. 2009. Regression by dependence minimization and its application to causal inference in additive noise models. In Proceedings of the 26th annual international conference on machine learning, 745 752. ACM. Pearl, J. 2009. Causality. Cambridge university press. Peters, J.; Janzing, D.; and Sch olkopf, B. 2011. Causal inference on discrete data using additive noise models. Pattern Analysis and Machine Intelligence, IEEE Transactions on 33(12):2436 2450. Rasmussen, C. E. 2006. Gaussian processes for machine learning. Shimizu, S.; Hoyer, P. O.; Hyv arinen, A.; and Kerminen, A. 2006. A linear non-gaussian acyclic model for causal discovery. The Journal of Machine Learning Research 7:2003 2030. Spirtes, P.; Glymour, C. N.; and Scheines, R. 2000. Causation, prediction, and search, volume 81. MIT press. Su, L., and White, H. 2008. A nonparametric hellinger metric test for conditional independence. Econometric Theory 24(04):829 864. Zhang, K., and Hyv arinen, A. 2009. On the identifiability of the post-nonlinear causal model. In Proceedings of the twenty-fifth conference on uncertainty in artificial intelligence, 647 655. AUAI Press. Zhang, K.; Peters, J.; Janzing, D.; and Sch olkopf, B. 2011. Kernel-based conditional independence test and application in causal discovery. 804 813. Corvallis, OR, USA: AUAI Press.