# causal_discovery_from_heterogeneousnonstationary_data__cf81c265.pdf Journal of Machine Learning Research 21 (2020) 1-53 Submitted 3/19; Revised 2/20; Published 5/20 Causal Discovery from Heterogeneous/Nonstationary Data Biwei Huang biweih@andrew.cmu.edu Kun Zhang kunz1@cmu.edu Department of Philosophy, Carnegie Mellon University, Pittsburgh, PA 15213 Jiji Zhang jijizhang@ln.edu.hk Department of Philosophy, Lingnan University, Tuen Mun, Hong Kong Joseph Ramsey jdramsey@andrew.cmu.edu Department of Philosophy, Carnegie Mellon University, Pittsburgh, PA 15213 Ruben Sanchez-Romero ruben.saro@rutgers.edu Center for Molecular and Behavioral Neuroscience, Rutgers University, Newark, NJ 07102 Clark Glymour cg09@andrew.cmu.edu Department of Philosophy, Carnegie Mellon University, Pittsburgh, PA 15213 Bernhard Sch olkopf bernhard.schoelkopf@tuebingen.mpg.de Max Planck Institute for Intelligent Systems, T ubingen 72076, Germany Editor: Jon Mc Auliffe It is commonplace to encounter heterogeneous or nonstationary data, of which the underlying generating process changes across domains or over time. Such a distribution shift feature presents both challenges and opportunities for causal discovery. In this paper, we develop a framework for causal discovery from such data, called Constraint-based causal Discovery from heterogeneous/NOnstationary Data (CD-NOD), to find causal skeleton and directions and estimate the properties of mechanism changes. First, we propose an enhanced constraintbased procedure to detect variables whose local mechanisms change and recover the skeleton of the causal structure over observed variables. Second, we present a method to determine causal orientations by making use of independent changes in the data distribution implied by the underlying causal model, benefiting from information carried by changing distributions. After learning the causal structure, next, we investigate how to efficiently estimate the driving force of the nonstationarity of a causal mechanism. That is, we aim to extract from data a low-dimensional representation of changes. The proposed methods are nonparametric, with no hard restrictions on data distributions and causal mechanisms, and do not rely on window segmentation. Furthermore, we find that data heterogeneity benefits causal structure identification even with particular types of confounders. Finally, we show the connection between heterogeneity/nonstationarity and soft intervention in causal discovery. Experimental results on various synthetic and real-world data sets (task-f MRI and stock market data) are presented to demonstrate the efficacy of the proposed methods. Keywords: causal discovery, heterogeneous/nonstationary data, independent-change principle, kernel distribution embedding, driving force estimation, confounder . Equal contribution c 2020 Biwei Huang, Kun Zhang, Jiji Zhang, Joseph Ramsey, Ruben Sanchez-Romero, Clark Glymour, and Bernhard Sch olkopf. License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v21/19-232.html. Huang, Zhang, Zhang, Ramsey, Sanchez-Romero, Glymour, and Sch olkopf 1. Introduction Many tasks across several disciplines of empirical sciences and engineering rely on the underlying causal information. As it is often difficult, if not impossible, to carry out randomized experiments, inferring causal relations from purely observational data, known as the task of causal discovery, has drawn much attention in machine learning, philosophy, statistics, and computer science. Traditionally, for causal discovery from observational data, under appropriate assumptions, so-called constraint-based approaches recover some information of the underlying causal structure based on conditional independence relationships of the variables (Spirtes et al., 1993). Alternatively, approaches based on functional causal models infer the causal structure by exploiting the fact that under certain conditions, the independence between the noise and the hypothetical cause only holds for the correct causal direction but not for the wrong direction (Shimizu et al., 2006; Hoyer et al., 2009; Zhang and Hyv arinen, 2009a,b). Over the last few years, with the rapid accumulation of huge volumes of data of various types, causal discovery is facing exciting opportunities but also great challenges. One feature such data often exhibit is distribution shift. Distribution shift may occur across data sets, which may be obtained under different interventions or with different data collection conditions, or over time, as featured by nonstationary data. For an example of the former kind, consider remote sensing imagery data. The data collected in different areas and at different times usually have different distributions due to varying physical factors related to ground, vegetation, illumination conditions, etc. As an example of the latter kind, f MRI recordings are usually nonstationary: information flows in the brain may change with stimuli, tasks, attention of the subject, etc. More specifically, it is believed that one of the basic properties of neural connections is their time-dependence (Havlicek et al., 2011). In these situations many existing approaches to causal discovery may fail, as they assume a fixed causal model and hence a fixed joint distribution underlying the observed data. For example, if changes in local mechanisms of some variables are related, one can model the situation as if there exists some unobserved quantity which influences all those variables and, as a consequence, the conditional independence relationships in the distribution-shifted data will be different from those implied by the true causal structure. In this paper, we assume that mechanisms or parameters, associated with the causal model, may change across data sets or over time (we allow mechanisms to change in such a way that some causal links in the structure may vanish or appear in some domains or over some time periods). We aim to develop a principled framework to model such situations as well as practical methods, called Constraint-based causal Discovery from heterogeneous/NOnstationary Data (CD-NOD), to address the following questions: 1. How can we efficiently identify variables with changing local mechanisms and reliably recover the skeleton of the causal structure over observed variables? 2. How can we take advantage of the information carried by distribution shifts for the purpose of identifying causal directions? After identifying the causal structure, it is then appealing to ask how causal mechanisms change across domains or over time, which raises the question: 3. How can we extract from data a low-dimensional and potentially interpretable representation of changes, the so-called driving force of changing causal mechanisms? Causal Discovery from Heterogeneous/Nonstationary Data Furthermore, we extend our approach to deal with more general scenarios, e.g., dynamic systems which involve both time-varying instantaneous and lagged causal relations and the case of stationary confounders. In answering these questions, we make use of the following properties of causal systems. (i) Causal models and distribution shifts are heavily coupled: causal models provide a compact description of how data-generating processes, as well as data distributions, change, and distribution shifts exhibit such changes. (ii) From a causal perspective, the distribution shift in heterogeneous/nonstationary data is usually constrained it may be due to the changes in the data-generating processes (i.e., the local causal mechanisms) of a small number of variables. (iii) From a latent variable modeling perspective, heterogeneous/nonstationary data are generated by some quantities that change across domains or over time, which gives hints as to how to understand distribution shift and estimate its underlying driving forces. (iv) Suppose that there are no confounders for cause and effect. Then P(cause) and P(effect | cause) are either fixed or change independently, also known as the modularity property of causal systems (Pearl, 2000). Such an independence helps identify causal directions in the presence of distribution shifts. To reliably estimate the skeleton of the causal structure and detect changing causal mechanisms from heterogeneous/nonstationary data (Problem 1), we make use of property (i), (ii), and (iii) listed above. Specifically, we introduce a surrogate variable C into the causal system to characterize hidden quantities that lead to the changes across domains or over time. The variable C can be a domain or time index. Including C in the causal system provides a convenient way to unpack distribution shifts to causal representations. We show that given C, (conditional) independence relationships between observed variables are the same as those implied by the true causal structure. We, additionally, show that variables that are adjacent to the surrogate variable C have changing causal mechanisms. We make the assumption of faithfulness on the graph involving C, and it is known that faithfulness implies a minimality condition on the edges (Zhang and Spirtes, 2016); as a consequence, the graphical representation produced by our procedure naturally enjoys a minimal change principle the representation explains the conditional independence relations and changeability of the distribution with a minimal number of changing conditional distributions. Regarding Problem 2, as a sub-problem of causal discovery, we show that distribution shift provides additional information for causal direction identification. it is known that with functional causal model-based approaches, there are cases where causal directions are not identifiable, e.g., the linear-Gaussian case and the case with a general functional class (Hyv arinen and Pajunen, 1999; Zhang et al., 2015b). This restricts the causal direction identification to certain functional classes, e.g., additive (Shimizu et al., 2006; Hoyer et al., 2009; Zhang and Hyv arinen, 2009a) or post-nonlinear models (Zhang and Chan, 2006; Zhang and Hyv arinen, 2009b). We show that using information carried by distribution shifts does not suffer from these restrictions the method applies to general causal mechanisms. Specifically, we take advantage of property (iv) for causal direction determination: if there is no confounder for Vi and Vj, then the causal mechanisms, represented by the conditional distributions P(Vi | PAi) and P(Vj | PAj), change independently across data sets or over time. However, independence typically no longer holds for wrong directions. This gives rise to causal asymmetry. To exploit this asymmetry, we develop a kernel embedding Huang, Zhang, Zhang, Ramsey, Sanchez-Romero, Glymour, and Sch olkopf of nonstationary conditional distributions to represent changing causal mechanisms and accordingly propose a dependence measure to determine causal directions. Furthermore, it is worth noting that although our method can been seen as an extension of constraint-based methods such as PC (Spirtes et al., 2001), unlike the original ones, it can find the causal direction between even two variables, thanks to the surrogate variable C in the system or more generally, the independent change property implied by a causal system. Regarding Problem 3, traditionally, one may use Bayesian change point detection to detect change points of observed time series (Adams and Mac Kay, 2007) or use sliding windows and then estimate the causal model within each segment separately. However, Bayesian change point detection can only be applied to detect changes in marginal or joint distributions, whereas causal mechanisms are represented by conditional distributions. Moreover, neither of them is appropriate when causal mechanisms change continuously over time. More recently, a window-free method has been proposed, by extending Gaussian process regression (Huang et al., 2015). However, it requires the assumption of linearity, and it fails to handle the case when nonstationarity results from the change of noise distributions. In this paper, by leveraging property (iii), we propose a nonparametric method to recover a low-dimensional and interpretable representation of mechanism changes, which does not rely on window segmentation. This paper is organized as follows.1 In Section 2 we define and motivate the problem in more detail and review related work. Section 3 proposes an enhanced constraint-based method to recover the causal skeleton over observed variables and identify variables with changing causal mechanisms. Section 4 develops a method for determining causal directions by exploiting distribution shifts. It makes use of the property that in a causal system, causal modules change independently if there are no confounders. Section 5 proposes a method, termed Kernel Nonstationary Visualization (KNV), to visualize a low-dimensional and interpretable representation of changing mechanisms, the so-called driving force . In Section 6, we extend CD-NOD to the case that allows both time-varying lagged and instantaneous causal relationships, and we discuss whether distribution shifts also help for causal discovery when there exist stationary confounders. In addition, we give a procedure to leverage both CD-NOD and approaches based on constrained functional causal models. In Section 7, we show the connection between heterogeneity/nonstationarity and soft intervention in causal 1. This paper is built on the ar Xiv paper by Zhang et al. (2015a) and the conference papers by Zhang et al. (2017) and Huang et al. (2017) but is significantly extended in several aspects. We reformulate assumptions in Section 3.1. We extend Section 3.2 to show how we detect pseudo confounders for nonadjacent variables. In Section 4.1, we add Algorithm 2 which uses generalization of invariance to identify causal directions. In Section 4.2, we propose a new approach to efficiently identify causal directions using independent changes between causal mechanisms and detect pseudo confounders behind adjacent variables (Algorithm 3). In Section 4.3, we give identifiability conditions of CD-NOD and define the equivalence class that CD-NOD can achieve if those conditions do not hold. Furthermore, we extend CD-NOD to the case when there exist both time-varying instantaneous and lagged causal relationships (Section 6.1). Accordingly, we propose Algorithm 5 to efficiently recover both instantaneous and time-lagged causal relationships. In Section 6.2, we further discuss whether distribution shifts also help for causal discovery when there exist stationary confounders. With CD-NOD, some causal directions may not be identifiable, if the identifiability conditions are not satisfied (Theorem 2). To make the method more applicable, we combine our framework with approaches based on constrained functional causal models (Section 6.3). In Section 7, we show that heterogeneity/nonstationarity and soft intervention are related in causal discovery, and we find that our proposed method is even more effective. Causal Discovery from Heterogeneous/Nonstationary Data discovery. Section 8 reports experimental results tested on both synthetic and real-world data sets, including task-f MRI data, Hong Kong stock data, and US stock data. 2. Problem Definition and Related Work In this section, we first review causal discovery approaches with fixed causal models. Then we give examples to show that if the underlying causal model changes, directly applying approaches with fixed causal models may result in spurious edges or wrong causal directions, which motivates our work in causal discovery with changing causal models. 2.1. Causal Discovery of Fixed Causal Models Most causal discovery methods assume that there is a fixed causal model underlying the observed data and aim to estimate it from the data. Classic approaches to causal discovery divide roughly into two types. In the late 1980 s and early 1990 s, it was noted that under appropriate assumptions, one could recover a Markov equivalence class of the underlying causal structure based on conditional independence relationships among the variables (Spirtes et al., 1993). This gave rise to the constraint-based approach to causal discovery, and the resulting equivalence class may contain multiple DAGs (or other graphical objects to represent causal structures), which entail the same conditional independence relationships. The required assumptions include the causal Markov condition and the faithfulness assumption, which entail a correspondence between d-separation properties in the underlying causal structure and statistical independence properties in the data. The so-called score-based approach (see, e.g., Chickering, 2003; Heckerman et al., 1995) searches for the equivalence class which gives the highest score under some scoring criteria, such as the Bayesian Information Criterion (BIC), the posterior of the graph given the data (Heckerman et al., 2006), and the generalized score functions (Huang et al., 2018). Another set of approaches is based on constrained functional causal models, which represent the effect as a function of the direct causes together with an independent noise term. The causal direction implied by the constrained functional causal model is generically identifiable, in that the model assumptions, such as the independence between the noise and cause, hold only for the true causal direction and are violated for the wrong direction. Examples of such constrained functional causal models include the linear non-Gaussian acyclic model (Li NGAM, Shimizu et al., 2006), the additive noise model (Hoyer et al., 2009; Zhang and Hyv arinen, 2009a), and the post-nonlinear causal model (Zhang and Chan, 2006; Zhang and Hyv arinen, 2009b). 2.2. With Changing Causal Models Suppose that we are given a set of observed variables V = {Vi}m i=1 whose causal structure is represented by a DAG G. For each Vi, let PAi denote the set of parents of Vi in G. Suppose that at each time point or in each domain, the joint probability distribution of V factorizes according to G: i=1 P(Vi | PAi). (1) Huang, Zhang, Zhang, Ramsey, Sanchez-Romero, Glymour, and Sch olkopf V1 V2 V3 V4 V1 V2 V3 V4 Figure 1: An illustration on how ignoring changes in the causal model may lead to spurious edges by constraint-based methods. (a) The true causal graph (including confounder g(C), which is hidden). (b) The estimated causal skeleton on the observed data in the asymptotic case given by constraint-based methods, e.g., PC or FCI. We call each P(Vi | PAi) a causal module (the same meaning with causal mechanism in previous sections). If there are distribution shifts (i.e., P(V) changes across domains or over time), at least some causal modules P(Vk | PAk), k N must change. We call those causal modules changing causal modules. Their changes may be due to changes of the involved functional models, causal strengths, noise levels, etc. We assume that those quantities that change across domains or over time can be written as functions of a domain or time index, and denote by C such an index. If the changes in some modules are related, one can treat the situation as if there exists some unobserved quantity (confounder) which influences those modules and, as a consequence, the conditional independence relationships in the distribution-shifted data will be different from those implied by the true causal structure. Therefore, standard constraintbased algorithms such as PC or FCI (Spirtes et al., 1993) may not be able to reveal the true causal structure. As an illustration, suppose that the observed data were generated according to Fig. 1(a), where g(C), a function of C, is involved in the generating processes for both V2 and V4; the causal skeleton over the observed data then contains spurious edges V1 V4 and V2 V4, as shown in Fig. 1(b), because there is only one conditional independence relationship, V3 V1 | V2. (a) On data set 1 (b) On data set 2 (c) On merged data sets (d) On merged data sets Figure 2: An illustration of a failure of using the approach based on functional causal models for causal direction determination when the causal model changes. (a) Scatter plot of V1 and V2 on data set 1. (b) Scatter plot of V1 and V2 on data set 2. (c) Scatter plot of V1 and V2 on merged data (both data sets). (d) Scatter plot of V1 and the estimated regression residual ˆE on merged data by regressing V2 on V1. Causal Discovery from Heterogeneous/Nonstationary Data Moreover, when one fits a fixed functional causal model (e.g., a linear, non-Gaussian model, Shimizu et al., 2006) to distribution-shifted data, the estimated noise may not be independent of the cause. Consequently, the approach based on constrained functional causal models, in general, cannot infer the correct causal structure either. Figure 2 gives an illustration of this point. Suppose that we have two data sets for variables V1 and V2: V2 is generated from V1 according to V2 = 0.3V1 + E in the first data set and according to V2 = 0.7V1 + E in the second one, and in both data sets V1 and E are mutually independent and follow a uniform distribution. Figure 2(a-c) show the scatter plots of V1 and V2 on data set 1, on data set 2, and on merged data, respectively. Figure 2(d) shows the scatter plot of V1, the cause, and the estimated regression residual on the merged data set by regressing V2 on V1; they are not independent anymore, although on either data set the regression residual is independent of V1. Thus, we cannot correctly determine the causal direction. To tackle the issue of changing causal models, one may try to find causal models in each domain, for data from multiple domains, or in each sliding window, for nonstationary data, separately, and then compare and merge them. For instance, regarding the former type of data from multiple domains, in particular, multiple data sets obtained by external interventions where it is unknown what variables are manipulated, He and Geng (2016) considered two settings, depending on the sample size of each data set in the setting with a large sample size for each data set, they proposed a graph-merging method after learning a causal network in each domain separately; in the setting with a relative small sample size, they proposed to pool together the data to learn a network structure and then use a re-sampling approach to evaluate the edges of the learned network. Regarding nonstationary data, improved versions include the online change point detection method (Adams and Mac Kay, 2007), the online undirected graph learning (Talih and Hengartner, 2005), and the locally stationary structure tracker algorithm (Kummerfeld and Danks, 2013). Such methods may suffer from high estimation variance due to sample scarcity, large type II errors, or multiple testing problems from a large number of statistical tests. Some methods aim to estimate the time-varying causal model by making use of certain types of smoothness of the change (Huang et al., 2015), but they do not explicitly locate the nonstationary causal modules. Several methods aim to model time-varying time-delayed causal relations (Xing et al., 2010; Song et al., 2009), which can be reduced to online parameter learning because the direction of causal relations is given (i.e., the past influences the future). Compared to them, learning changing instantaneous causal relations, with which we are concerned in this paper, is generally more difficult. Recently, several methods have been proposed to tackle time-varying or domain-varying instantaneous causal relations (Ghassami et al., 2018; Huang et al., 2019a,b, 2020). However, they assume linear causal models, limiting their applicability to complex problems with nonlinear causal relations. In contrast, we develop a nonparametric and computationally efficient method that can identify changing causal modules and reliably recover the causal structure. We show that distribution shifts actually contain useful information for the purpose of determining causal directions and develop practical algorithms accordingly. After identifying the causal structure, we propose a method to estimate a low-dimensional and interpretable representation of changes. Huang, Zhang, Zhang, Ramsey, Sanchez-Romero, Glymour, and Sch olkopf 3. CD-NOD Phase I: Changing Causal Module Detection and Causal Skeleton Estimation In this section, we first formalize the assumptions that will be used in CD-NOD. Specifically, we allow a particular type of confounders pseudo confounders, and we do not put hard restrictions on functional forms of causal mechanisms and data distributions. Accordingly, we propose an approach to efficiently detect changing causal modules and identify the causal skeleton; we call this step CD-NOD phase I. We show that the proposed approach is guaranteed to asymptotically recover the true graph as if unobserved changing factors were known. 3.1. Assumptions In this paper, we allow changes in causal modules and some of the changes to be related; the related changes can be explained by positing particular types of confounders. Intuitively, such confounders may refer to some high-level background variables. For instance, for f MRI data, they may be the subject s attention or some unmeasured background stimuli; for the stock market, they may be related to economic policies. Thus, we do not assume causal sufficiency for the set of observed variables. Instead, we assume pseudo causal sufficiency as stated below. Assumption 1 (Pseudo Causal Sufficiency) We assume that the confounders, if any, can be written as functions of the domain index or smooth functions of time 2. It follows that in each domain or at each time instance, the values of these confounders are fixed. To clearly express our basic idea in the presence of distribution shift, we focus on DAGs and assume pseudo causal sufficiency. Note that our approach is flexible enough to be extended to cover other types of graphs, e.g., graphs with confounders and graphs with cycles. Later in Section 6.2, we will discuss how nonstationarity helps when there exist stationary confounders. In table 1, we summarize descriptions of different types of confounders (latent common causes) that will be used in this paper, including pseudo confounders, stationary confounders, and nonstationary confounders. We start with contemporaneous causal relations; the mechanisms and parameters associated with the causal model are allowed to change across data sets or over time, or even vanish or appear in some domains or over some time periods. However, it is natural to generalize our framework to incorporate time-delayed causal relations (Section 6.1). Denote by {gl(C)}L l=1 the set of pseudo confounders (which may be empty). We further assume that for each Vi, its local causal process can be represented by the following structural equation model (SEM): Vi = fi PAi, gi(C), θi(C), ϵi , (2) 2. More specifically, for data with multiple domains, we require that the confounders can be written as a function of the domain index (i.e., it does not change within a domain); for nonstationary time series, we require that the confounder is a smooth function of the time index. Roughly speaking, the smoothness constraint requires the gradient of the function to not change rapidly. In practice, one may specify the level of smoothness in advance (say, by assuming the function follows a Gaussian process prior and properly setting the kernel width to some range) or learn it from data by maximizing marginal likelihood or cross validation. Causal Discovery from Heterogeneous/Nonstationary Data Confounder type Description Pseudo confounder It can be represented as functions of domain index or smooth functions of time index. Stationary confounder Its distribution is fixed and it cannot be represented as functions of domain index or smooth functions of time. Nonstationary confounder Its distribution changes across domains or over time and it cannot be represented as functions of domain index or smooth functions of time index, respectively. Table 1: Descriptions of different types of confounders (latent common causes). where gi(C) {gl(C)}L l=1 denotes the set of confounders that influence Vi (it is an empty set if there is no confounder behind Vi and any other variable), θi(C) denotes the effective parameters in the model that are also assumed to be functions of C, and ϵi is a disturbance term that is independent of C and PAi and has a non-zero variance (i.e., the model is not deterministic). We also assume that the ϵi s are mutually independent. Note that θi(C) is specific to Vi and is independent of θj(C) for i = j. The variable C can be the domain or time index. In special cases, e.g., the case with multiple domains and all of which have nonstationarity, C has two dimensions: one is the domain index and the other is the time index. The SEM given in Eq. 2 does not have any restrictions on data distributions or functional classes. In this paper we treat C as a random variable, and so there is a joint distribution over V {gl(C)}L l=1 {θi(C)}m i=1. We assume that this distribution is Markov and faithful to the graph resulting from the following additions to G (which, recall, is the causal structure over V): add {gl(C)}L l=1 {θi(C)}m i=1 to G, and for each i, add an arrow from each variable in gi(C) to Vi and add an arrow from θi(C) to Vi. We refer to this augmented graph as Gaug. Obviously, G is simply the induced subgraph of Gaug over V. Specifically, the assumption is summarized below. Assumption 2 The joint distribution over V {gl(C)}L l=1 {θi(C)}m i=1 is Markov and faithful to the augmented graph Gaug. In addition, there is no selection bias; i.e., the observed data are perfect random samples from the populations implied by the causal model. The distribution change across domains or over time can be considered in the following way. In the case when C is the domain index, C follows a uniform distribution over all possible values, and we have a particular way to generate its value: all possible values are generated once, resulting in domain indices. In the case when C is the time index, we take time to be a special random variable which follows a uniform distribution over the considered time period, with the corresponding data points evenly sampled at a certain sampling frequency. Correspondingly, the generating process of nonstationary data can be considered as follows: we generate random values from C, and then we generate data points over V according to the SEM in (2). The generated data points are then sorted in ascending order according to the values of C. In other words, we observe the distribution P(V|C), where P(V|C) may change across different values of C, resulting in non-identical distributions of data. Huang, Zhang, Zhang, Ramsey, Sanchez-Romero, Glymour, and Sch olkopf 3.2. Detection of Changing Modules and Recovery of Causal Skeleton In this section, we propose a method to detect variables whose causal modules change and infer the skeleton of G. The basic idea is simple: we use the (observed) variable C as a surrogate for the unobserved {gl(C)}L l=1 {θi(C)}m i=1, or in other words, we take C to capture C-specific information. We now show that given the assumptions in Section 3.1, we can apply conditional independence tests to V C to detect variables with changing modules and recover the skeleton of G. We consider C as a surrogate variable (it itself is not a causal variable, it is always available, and confounders and changing parameters are its functions): by adding only C to the variable set V, the skeleton of G and the changing causal modules can be estimated as if {gl(C)}L l=1 {θi(C)}m i=1 were known. This is achieved by Algorithm 1 and supported by Theorem 1. Algorithm 1 Detection of Changing Modules and Recovery of Causal Skeleton 1. Build a complete undirected graph UG on the variable set V C. 2. (Detection of changing modules) For each i, test for the marginal and conditional independence between Vi and C. If they are independent given a subset of {Vk | k = i}, remove the edge between Vi and C in UG. 3. (Recovery of causal skeleton) For every i = j, test for the marginal and conditional independence between Vi and Vj. If they are independent given a subset of {Vk | k = i, k = j} {C}, remove the edge between Vi and Vj in UG. The procedure given in Algorithm 1 outputs an undirected graph UG that contains C as well as V. In Step 2, whether a variable Vi has a changing module is decided by whether Vi and C are independent conditional on some subset of other variables. The justification for one side of this decision is trivial. If Vi s module does not change, that means P(Vi | PAi) remains the same for every value of C, and so Vi C | PAi. Thus, if Vi and C are not independent conditional on any subset of other variables, Vi s module changes with C, which is represented by an edge between Vi and C. Conversely, we assume that if Vi s module changes, which entails that Vi and C are not independent given PAi, then Vi and C are not independent given any other subset of V\{Vi}. If this assumption does not hold, then we only claim to detect some (but not necessarily all) variables with changing modules. Step 3 aims to discover the skeleton of the causal structure over V. It leverages the results from Step 2: if neither Vi nor Vj is adjacent to C, then C does not need to be involved in the conditioning set. In practice, one may apply any constraint-based search procedures on V C, e.g., SGS and PC (Spirtes et al., 1993). Its (asymptotic) correctness is justified by the following theorem: Theorem 1 Given Assumptions 1 and 2, for every Vi, Vj V, Vi and Vj are not adjacent in G if and only if they are independent conditional on some subset of {Vk | k = i, k = j} {C}. Basic idea of the proof. For a complete proof see Appendix A. The only if direction is proved by making use of the weak union property of conditional independence repeatedly, the fact that all gl(C) and θi(C) are deterministic functions of C, some implications of the Causal Discovery from Heterogeneous/Nonstationary Data SEMs Eq. 2, the assumptions in Section 3.1, and the properties of mutual information given in Madiman (2008). The if direction is shown based on the faithfulness assumption on Gaug and the fact that {gl(C)}L l=1 {θi(C)}m i=1 is a deterministic function of C. Furthermore, for any pair of nonadjacent variables Vi and Vj with Vi C and Vj C, we can easily detect whether there are pseudo confounders behind Vi and Vj from the independence test results derived from Algorithm 1: 1. If Vk V\{Vi, Vj}, Vi Vj|Vk, and Vk V\{Vi, Vj}, so that Vi Vj|{Vk , C}, then there exist pseudo confounders behind Vi and Vj. 2. If Vk V\{Vi, Vj}, so that Vi Vj|Vk, then there is no pseudo confounder behind Vi and Vj. Note that in Algorithm 1, it is crucial to use a general, nonparametric conditional independence test, for how variables depending on C is unknown and usually very nonlinear. In this work, we use the kernel-based conditional independence test (KCI-test, Zhang et al., 2011) to capture the dependence on C in a nonparametric way. By contrast, if we use, for example, tests of vanishing partial correlations, as is widely used in the neuroscience community, the proposed method may not work well. Moreover, it is worth noting that the estimated graphical representation by Algorithm 1 naturally follows the principle of minimal changes, which was explicitly formulated by Ghassami et al. (2018). This is because faithfulness on the augmented graph implies the edge minimality condition in the graphical representation. Any variable adjacent to C in the augmented graph has a changing mechanism, and the estimated graph by Algorithm 1 has as few edges involving C as possible; hence it has the smallest number of changing causal mechanisms (or conditional distributions). 4. CD-NOD Phase II: Distribution Shifts Benefit Causal Direction Determination We now show that introducing the additional variable C as a surrogate not only allows us to infer the skeleton of the causal structure but also facilitates the determination of some causal directions. Let us call those variables that are adjacent to C in the output of Algorithm 1 C-specific variables , which are actually the effects of changing causal modules. For each C-specific variable Vk, it is possible to determine the direction of every edge which has an endpoint on Vk. Let Vl be any variable adjacent to Vk in the output of Algorithm 1. Then there are two possible scenarios to consider: S1. Vl is not adjacent to C. Then C Vk Vl forms an unshielded triple. For practical purposes, we take the direction between C and Vk as C Vk (though we do not claim C to be a cause in any substantial sense). Then we can use standard orientation rules for unshielded triples to orient the edge between Vk and Vl (Spirtes et al., 1993; Pearl, 2000). There are two possible situations: 1.a If Vl and C are independent given a set of variables excluding Vk, then the triple is a V-structure, and we have Vk Vl. 1.b Otherwise, if Vl and C are independent given a set of variables including Vk, then the triple is not a V-structure, and we have Vk Vl. Huang, Zhang, Zhang, Ramsey, Sanchez-Romero, Glymour, and Sch olkopf S2. Vl is also adjacent to C. This case is more complex than S1, but it is still possible to identify the causal direction between Vk and Vl, based on the principle that P(cause) and P(effect | cause) change independently; a heuristic method is given in Section 4.2. The procedure in S1, which will be further discussed in Section 4.1, contains the methods proposed in Hoover (1990); Tian and Pearl (2001); Peters et al. (2016) for causal discovery from changes as special cases. It may also be interpreted as special cases of the principle underlying the method for S2: if one of P(cause) and P(effect | cause) changes while the other remains invariant, they are clearly independent. 4.1. Causal Direction Identification by Generalization of Invariance There exist methods for causal discovery from differences among multiple data sets (Hoover, 1990; Tian and Pearl, 2001; Peters et al., 2016) that explore invariance of causal mechanisms. They used linear models to represent causal mechanisms and, as a consequence, the invariance of causal mechanisms can be assessed by checking whether the involved parameters change across data sets or not. Actually, S1.b above provides a nonparametric way to achieve this in light of nonparametric conditional independence tests. For any variable Vi and a set of variables S, the conditional distribution P(Vi | S) is invariant across different values of C if and only if P(Vi | S, C = c1) = P(Vi | S, C = c2), c1 and c2. This is exactly the condition under which Vi C | S. In other words, testing for invariance (or homogeneity) of the conditional distribution is naturally achieved by performing a conditional independence test on Vi and C given the set of variables S, for which there exist off-the-shelf algorithms and implementations. When S is the empty set, this reduces to a test of marginal independence between Vi and C, or a test of homogeneity of P(Vi). In S1.a, we have the invariance of P(Vl) (i.e., P(cause)) when the causal mechanism, represented by P(Vk|Vl) (i.e., P(effect | cause)), changes, which is complementary to the invariance of causal mechanisms in S1.b. The (conditional) independence test results between Vi and C are readily available from Algorithm 1 and can be applied to determine causal directions between variables which satisfy S1. The procedure is summarized in Algorithm 2. Algorithm 2 Causal Direction Identification by Generalization of Invariance 1. Input: causal skeleton UG from Algorithm 1. 2. Orient C Vk, for any variable which is adjacent to C. 3. For any unshielded triple with C Vk Vl, where Vl is not adjacent to C, a. if Vl C|S, with S V and S Vk = , orient Vk Vl; b. if Vl C|S, with S V and Vk S, orient Vk Vl. 4. Output: partially oriented graph UG using the property of generalization of invariance. Causal Discovery from Heterogeneous/Nonstationary Data θ1(C) θ2(C) Figure 3: An illustration of a two-variable case: V1 V2 with corresponding parameters θ1(C) and θ2(C) changing independently. Naturally, both invariance properties above are particular cases of the principle of independent changes of causal modules underlying the method for S2: if one of P(cause) and P(effect | cause) changes while the other remains invariant, they are clearly independent. Usually, there is no reason why only one of them could change, so the above invariance properties are rather restrictive. The property of independent changes holds in rather generic situations, e.g., when there is no confounder behind cause and effect. Below we will propose an algorithm for causal direction determination based on independent changes of causal modules. 4.2. Causal Direction Identification by Independently Changing Modules We now develop a method to handle S2 above. To clearly express the idea, let us start with a two-variable case: suppose V1 and V2 are adjacent and are both adjacent to C. We aim to identify the causal direction between them, which, without loss of generality, we assume to be V1 V2. Figure 3 shows the case where the involved changing parameters, θ1(C) and θ2(C), are independent, i.e., P(V1; θ1) and P(V2 | V1; θ2) change independently (we dropped the argument C in θ1 and θ2 to simplify notation). For the reverse direction, one can decompose the joint distribution of (V1, V2) according to P(V1, V2; θ 1, θ 2) = P(V2; θ 2)P(V1 | V2; θ 1), (3) where θ 1 and θ 2 are assumed to be sufficient for the corresponding distribution modules P(V2) and P(V1|V2). Generally speaking, θ 1 and θ 2 are not independent, because they are determined jointly by θ1 and θ2. Now we face the problem of how to compare the dependence between θ1 and θ2 with that between θ 1 and θ 2. Since θ is assumed to be sufficient for the corresponding distribution module, it is equivalent to compare the dependence between P(V1) and P(V2|V1) with that between P(V2) and P(V1|V2). The idea that causal modules are independent is not new (Pearl, 2000), but note that in a stationary situation where each module is fixed, such independence is very difficult, if not impossible, to test. By contrast, in the situation we are considering presently, both P(V1) and P(V2|V1) are changing, and we can try to measure the extent to which variation in P(V1) and variation in P(V2) are dependent (and similarly for P(V2) and P(V1|V2)). This is the sense in which distribution change actually helps in the identification of causal directions, Huang, Zhang, Zhang, Ramsey, Sanchez-Romero, Glymour, and Sch olkopf and as far as we know, this is the first time that such an advantage is exploited in the case where both P(cause) and P(effect | cause) change. We extend the Hilbert Schmidt Independence Criterion (HSIC, Gretton et al., 2008) to measure the dependence between causal modules. To do so, we first develop a novel kernel embedding of nonstationary conditional distributions which does not rely on sliding windows and estimate their corresponding Gram matrices in Section 4.2.1, which will be used in the extended HSIC and the causal direction determination rule in Section 4.2.2. In Section 4.2.3, we propose an algorithm for causal direction determination in multi-variable cases by taking advantage of independent changes. 4.2.1. Kernel Embedding of Constructed Joint Distributions Notation Throughout this section, we use the following notation. Let X be a random variable on domain X, and (H, k) be a Reproducing Kernel Hilbert Space (RKHS) with a measurable kernel on X. Let φ(x) H represent the feature map for each x X, with φ : X H. We assume integrability: EX[k(X, X)] . Similar notations are for variables Y and C. The cross-covariance operator CY X : H G is defined as CY X := EY X[φ(X) ψ(Y )], where G is the RKHS associated with Y . We represent causal modules P(Vi|PAi, C) by kernel embedding. Intuitively, to represent the kernel embedding of changing causal modules, we need to consider P(Vi|PAi, C) for each value of C separately. If C is a domain index, for each value of C we have a dataset of (Vi, PAi). If C is a time index, one may use a sliding window to use the data of (Vi, PAi) in the window of length L centered at C = c. However, in some cases, it might be hard to find an appropriate window length L, especially when the causal module changes fast. In the following, we propose a way to estimate the kernel embedding of changing causal modules on the whole dataset, avoiding window segmentation. For the sake of conciseness, below we use Y and X to denote Vi and PAi, respectively. Suppose that there are N samples for each variable. Instead of working with P(Y |X, C = cn) (n = 1, , N) directly, we virtually construct a particular distribution P(Y , X | C = cn) as follows:3 P(Y , X|C = cn) = P(Y |X, C = cn)P(X). (4) The embedding of this joint distribution of X and Y is simpler than that of the conditional of Y given X. Since P(X) does not depend on C and its support is rich enough to contain that of P(X|C = cn), one can see that whenever there are changes in P(Y |X, C = cn) across different values of cn, there must be changes in P(Y , X|C = cn), and vice versa. In other words, the constructed distribution P(Y , X|C = cn) captures changes in P(Y |X, C = cn) across different cn. We let P(Y , X, C = cn) = P(Y |X, C = cn)P(X)P(C = cn). Proposition 1 shows that the kernel embedding of the distribution P(Y , X|C = cn) can be estimated on the whole data set, without window segmentation. Proposition 1 Let X represent the direct causes of Y , and suppose that they have N observations. The kernel embedding of distribution P(Y , X|C = cn) can be represented as ˆ µY ,X|C=cn = 1 nΦy(Kx Kc + λI) 1diag(kc,cn)KxΦ x, 3. Here we use Y instead of Y to emphasize that in this constructed distribution Y and X are not symmetric. Causal Discovery from Heterogeneous/Nonstationary Data where Φy := [φ(y1), ..., φ(y N)], Φx := [φ(x1), ..., φ(x N)], kc,cn := [k(c1, cn), ..., k(c N, cn)] , Kx(xt, xt ) = φ(xt), φ(xt ) , Kc(ct, ct ) = φ(ct), φ(ct ) , and represents point-wise product. The detailed proof of proposition 1 is given in Appendix B. Next we estimate the N N Gram matrix of ˆ µY ,X|C=c. We consider different kernels for the estimation of Gram matrix. Let Ml Y X represent the Gram matrix of ˆ µY ,X|C=c with a linear kernel, and MG Y X the Gram matrix of ˆ µY ,X|C=c with a Gaussian kernel. If we use a linear kernel, the (c, c )th entry of the Gram matrix Ml Y X is the inner product between ˆ µY ,X|C=c and ˆ µY ,X|C=c : Ml Y X(c, c ) tr(ˆ µ Y ,X|C=cˆ µY ,X|C=c ) N2 k c,c K3 x (Kx Kc + λI) 1Ky(Kx Kc + λI) 1 kc,c , (5) which is the (c, c )th entry of the matrix Ml Y X = 1 N2 Kc K3 x (Kx Kc + λI) 1Ky(Kx Kc + λI) 1 Kc. (6) If we use a Gaussian kernel with kernel width σ2, the Gram matrix is given by MG Y X(c, c ) =exp || µY ,X|C=c µY ,X|C=c ||2 F 2σ2 2 =exp Ml Y X(c, c) + Ml Y X(c , c ) 2Ml Y X(c , c) where || ||F denotes the Frobenius norm. This can be represented in matrix notation as MG Y X = exp diag(Ml Y X) 1N + 1N diag(Ml Y X) 2Ml Y X 2σ2 2 where diag( ) sets all off-diagonal entries zero, and 1N is an N N matrix with all entries being 1. Excitingly, we can see that with our methods, we do not need to explicitly learn the high-dimensional kernel embedding µY ,X|C=c for each c. With the kernel trick, the final Gram matrix can be represented by N N kernel matrices directly. There are several hyperparameters to set. The hyperparameters associated with Kx, Kc, and the regularization parameter λ in equation (6) are learned through a Gaussian process regression framework: they are learned by maximizing the marginal likelihood of Y . For the hyperparameters associated with Ky and the kernel with σ2 in equation (7), we set them with empirical values; please refer to Zhang et al. (2011) for details. Change in marginal distributions. As a special case, when we are concerned with how the marginal distribution of Y changes with C, i.e., when X = , we directly make use of µY |C=cn = CY CC 1 CCφ(cn). (9) Huang, Zhang, Zhang, Ramsey, Sanchez-Romero, Glymour, and Sch olkopf This can also be obtained by constraining X in µY ,X|C=cn to take a fixed value. Its empirical estimate is ˆµY |C=cn = 1 nΦcΦ c + λI) 1φcn = Φy(Kc + λI) 1kc,cn. (10) Then (c, c ) entry of the Gram matrix with a linear kernel is: Ml Y (c, c ) ˆµ Y |C=cˆµY |C=c = k c,c(Kc + λI) 1Φ yΦy(Kc + λI) 1kc,c = k c,c(Kc + λI) 1Ky(Kc + λI) 1kc,c , (11) which is the (c, c )th entry of Ml Y = Kc(Kc + λI) 1Ky(Kc + λI) 1Kc. (12) For a Gaussian kernel with kernel with σ2, the Gram matrix is MG Y = exp diag(Ml Y X) 1N + 1N diag(Ml Y X) 2Ml Y X 2σ2 2 4.2.2. Two-Variable Case In this section, we extend HSIC to measure the dependence between causal modules, based on which we determine causal directions. For simplicity, let us start with the two-variable case: suppose that X and Y are adjacent and both are adjacent to C. We aim to identify the causal direction between them, which, without loss of generality, we assume to be X Y . The guiding idea is that distribution shift may carry information that confirms the independence of causal modules, which, in the simple case we are considering, is the independence between P(X) and P(Y |X). If P(X) and P(Y |X) are independent but P(Y ) and P(X|Y ) are not, then the causal direction is inferred to be from X to Y . The dependence between P(X) and P(Y |X) can be measured by extending the HSIC (Gretton et al., 2008). HSIC Given a set of observations {(u1, v1), (u2, v2), ..., (u N, v N)} for variables U and V , HSIC provides a measure of dependence and a statistic for testing their statistical independence. Roughly speaking, it measures the squared covariances between feature maps of U and feature maps of V . Let MU and MV be the Gram matrices for U and V calculated on the sample, respectively. An estimator of HSIC is given by Gretton et al. (2008): HSICUV = 1 (N 1)2 tr(MUHMV H), (14) where H is used to center the features, with entries Hij := δij N 1. Causal Discovery from Heterogeneous/Nonstationary Data In what follows, we will use a normalized version of the estimated HSIC, which is invariant to the scale in MU and MV : HSICN UV = HSICUV 1 N 1tr(MUH) 1 N 1tr(MV H) = tr(MUHMV H) tr(MUH)tr(MV H). (15) Dependence between Changing Modules In our case, we aim to check whether P(Y |X) and P(X) change independently along with C. We work with the estimate of their embeddings. Then we can think of ˆµX|C=cn, ˆ µY ,X|C=cn N n=1 as the observed data pairs and measure their dependence from the data pairs. This can be done by applying (the normalized version of) the estimate of HSIC given in Eq. 15 to the above data pairs. The expression then involves MX, the Gram matrix of ˆµX|C at C = c1, c2, ..., c N, and MY X, the Gram matrix of ˆ µY X|C at C = c1, c2, ..., c N. In particular, the dependence between P(Y |X) and P(X) on the given data can be estimated by ˆ X Y = tr(MXHMY XH) tr(MXH)tr(MY XH). (16) Similarly, for the hypothetical direction Y X the dependence between P(X|Y ) and P(Y ) on the data is estimated by ˆ Y X = tr(MY HMXY H) tr(MY H)tr(MXY H). (17) We then have the following rule to determine the causal direction between X and Y . Causal Direction Determination Rule Suppose that X and Y are two random variables with N observations. We assume that X and Y are adjacent and both are adjacent to C and assume no pseudo confounders behind them. The causal direction between X and Y is then determined according to the following rule: if ˆ X Y < ˆ Y X, output X Y ; if ˆ X Y > ˆ Y X, output X Y . In practice, there may exist pseudo confounders. In such a case, we set a threshold α on ˆ . If ˆ X Y > α and ˆ Y X > α, we conclude that there are pseudo confounders which influence both X and Y and leave the direction undetermined. 4.2.3. With More Than Two Variables The causal direction determination rule in the two-variable case can be extended to learn causal directions in multi-variable cases. Suppose that we have m observed random variables {Vi}m i=1 and a partially oriented graph UG derived from Algorithms 1 and 2. Let VS be the subset of {Vi}m i=1, such that Vi VS if and only if Vi s causal module changes. Note that generally speaking, different from the unconfounded two-variable case, when identifying the causal direction between an unoriented pair of adjacent variables, we need to remove the effect from their common causes. Thus, before moving forward, we first define deconfounding set and potential deconfounding set of a pair of adjacent variables in UG. Huang, Zhang, Zhang, Ramsey, Sanchez-Romero, Glymour, and Sch olkopf Figure 4: An illustration of the definitions of minimal deconfounding set and minimal potential deconfounding set on a partially oriented graph. Definition 1 (Deconfounding Set) A set of variables Z V\{Vl, Vk} is the deconfounding set of a pair of adjacent variables (Vl, Vk), if (i) no node in Z is a descendant of Vl or Vk, (ii) and Z blocks every path between Vl and Vk that contains arrows into Vl and Vk. Furthermore, a set of variables Z is the minimal deconfounding set of a pair of adjacent variables (Vl, Vk), if any Zs Z is not a deconfounding set. Definition 2 (Potential Deconfounding Set) A set of variables Z V\{Vl, Vk} is the potential deconfounding set of a pair of adjacent variables (Vl, Vk), if (i) no node in Z is a descendant of Vl or Vk, (ii) Z blocks every path between Vl and Vk that does not contain an arrow out of Vl or Vk, (iii) and any Z Z is not in the deconfounding set. Similarly, a set of variables Z is the minimal potential deconfounding set of a pair of adjacent variables (Vl, Vk), if any subset Zs Z is not a potential deconfounding set. In Figure 4, for example, the set Z = {V3} is a minimal deconfounding set of (V1, V2), and the set Z = {V5} is a minimal potential deconfounding set of (V1, V2). We take advantage of the independence between causal modules to identify directions. To efficiently identify causal directions using independent changes when there are multiple variables, we propose Algorithm 3, with the main procedure as follows. For each undirected pair Vl, Vk VS, we denote their minimal deconfounding set by Z(1) lk and their minimal potential deconfounding set by Z(2) lk . Note that there may be multiple minimal deconfounding sets and multiple minimal potential deconfounding sets; to search for such sets more efficiently, in our implementation, we only consider those ones with every variable in Z(1) lk and Z(2) lk being adjacent to Vl. Let Z(2,n) lk be a subset of Z(2) lk , where n is the total cardinality of Z(1) lk and Z(2,n) lk ; i.e., |Z(1) lk | + |Z(2,n) lk | = n. We evaluate the dependence between P(Vk, Z(1) lk , Z(2,n) lk ) and P(Vl|Vk, Z(1) lk , Z(2,n) lk ), and that between P(Vl, Z(1) lk , Z(2,n) lk ) and P(Vk|Vl, Z(1) lk , Z(2,n) lk ). If we find that P(Vl, Z(1) lk , Z(2,n) lk ) P(Vk|Vl, Z(1) lk , Z(2,n) lk ) and that P(Vk, Z(1) lk , Z(2,n) lk ) P(Vl|Vk, Z(1) lk , Z(2,n) lk ), we output Vl Vk, and if there are unoriented edges from variables in Z(2,n) lk to Vk or Vl, then we consider those variables as Causal Discovery from Heterogeneous/Nonstationary Data parents. Similarly, if we find that P(Vk, Z(1) lk , Z(2,n) lk ) P(Vl|Vk, Z(1) lk , Z(2,n) lk ) and that P(Vl, Z(1) lk , Z(2,n) lk ) P(Vk|Vl, Z(1) lk , Z(2,n) lk ), we output Vk Vl, instead. Similar to the search procedure of PC, we start from n = 0, evaluate the dependence between corresponding modules for each undirected pair, and then let n = n + 1 and repeat the procedure until no unoriented pairs have the total cardinality of minimal deconfounding set and minimal potential deconfounding set greater than or equal to n. Note that in Algorithm 3, we use dependence measures in (16) and (17) to determine the independence between causal modules. Particularly, we set a threshold α. If the dependence measure ˆ α, then the corresponding modules are independent; otherwise, they are dependent. For a pair of adjacent variables Vk, Vl VS, if their direction is undetermined by Algorithm 3 and all their measured modules are dependent, then there exist pseudo confounders behind Vk and Vl. Furthermore, for variables whose causal modules are stationary and which are adjacent only to variables with stationary modules, the causal directions between them cannot be determined by Algorithm 2 or 3. In such a case, one may further infer some causal directions by making use of Meek s orientation rules (Meek, 1995) used in PC. To better illustrate the algorithm, we go through a simple example, for which the true graph and brief orientation procedures are given in Figure 5. Below is the precise orientation procedure. 1. In step 1, we have observed variables {Vi}4 i=1, with Vi VS for any i, and an unoriented graph UG over {Vi}4 i=1 after Algorithms 1 and 2. Since all causal modules are changing, there is no invariance property that can be used for orientation determination in Algorithm 2. 2. In step 2, we start from n = 0. For example, for the unoriented pair of adjacent variables (V1, V2), Z(1) 12 = { } and Z(2,n) 12 = { }. In this case, we have P(V1) P(V2|V1) and P(V2) P(V1|V2), and thus we cannot determine the direction between V1 and V2. Similarly, we cannot determine the direction between V2 and V4. For the unoriented pair (V1, V3), we have that P(V3) P(V1|V3) and that P(V1) P(V3|V1), and thus we output V3 V1. Similarly, we can determine the direction between V3 and V4 and output V3 V4. Then let n = 1. For the unoriented pair (V1, V2), Z(1) 12 = { } and Z(2,n) 12 = {V3}. We then have that P(V1, V3) P(V2|V1, V3) and that P(V2, V3) P(V1|V2, V3), and thus we output V1 V2. For the unoriented pair (V2, V4), Z(1) 24 = {V3} and Z(2,n) 24 = { }. We then have that P(V4, V3) P(V2|V4, V3) and that P(V2, V3) P(V4|V2, V3), and thus output V4 V2. 3. We output the fully identified causal graph. 4.3. Identifiability Conditions of CD-NOD In this section, we first give identifiability conditions of CD-NOD, and then we define the equivalence class that CD-NOD can identify if corresponding conditions do not hold. We make use of independent changes and orientation rules, including discovering Vstructures and using orientation propagation, to determine causal directions. To fully identify the DAG, we ought to make some assumption on the change property of the distribution in wrong directions. In particular, we have the following assumption. Huang, Zhang, Zhang, Ramsey, Sanchez-Romero, Glymour, and Sch olkopf Algorithm 3 Causal Direction Identification by Independent Changes of Causal Modules 1. Input: observations of {Vi}m i=1, a subset VS V ( V VS, V has a changing causal module), a partially oriented causal graph UG from Algorithms 1 and 2. 2. n = 0 Repeat i. Select an unoriented pair of adjacent variables (Vl, Vk), with Vl, Vk VS. A. Let Z(1) lk be the set of minimal deconfounding set of Vk and Vl; any Z Z(1) lk is adjacent to Vl. B. Let Z(2) lk be the set of minimal potential deconfounding set; any Z Z(2) lk is adjacent to Vl. ii. If |Z(1) lk | + |Z(2) lk | n, repeat A. Take Z(2,n) lk Z(2) lk , with |Z(2,n) lk | = n |Z(1) lk |. B. If P(Vk, Z(1) lk , Z(2,n) lk ) P(Vl|Vk, Z(1) lk , Z(2,n) lk ) and P(Vl, Z(1) lk , Z(2,n) lk ) P(Vk|Vl, Z(1) lk , Z(2,n) lk ), output Vk Vl; for Z Z(2,n) lk with Z Vl, output Z Vl; for Z Z(2,n) lk with Z Vk, output Z Vk. C. If P(Vk, Z(1) lk , Z(2,n) lk ) P(Vl|Vk, Z(1) lk , Z(2,n) lk ) and P(Vl, Z(1) lk , Z(2,n) lk ) P(Vk|Vl, Z(1) lk , Z(2,n) lk ), output Vl Vk; for Z Z(2,n) lk with Z Vl, output Z Vl; for Z Z(2,n) lk with Z Vk, output Z Vk. D. If one of the conditions in (B) and (C) holds, return to step (i); Until every unoriented pair of adjacent variables (Vl, Vk), with Vl, Vk VS, has been selected. 3. Output: graph UG, with edges between variables in VS oriented. Causal Discovery from Heterogeneous/Nonstationary Data True graph Graph after Algorithms 1 and 2 Independencies & Dependencies Resulting graphs V3 V4 n = 0: P(V3) P(V1|V3) & P(V1) P(V3|V1). So output V3 V1. V3 V4 P(V3) P(V4|V3) & P(V4) P(V3|V1). So output V3 V4. V3 V4 n = 1: P(V1, V3) P(V2|V1, V3) & P(V2, V3) P(V1|V2, V3). So output V1 V2. V3 V4 P(V4, V3) P(V2|V4, V3) & P(V2, V3) P(V4|V2, V3). So output V4 V2. Figure 5: An example to illustrate causal direction identification with Algorithm 3. Huang, Zhang, Zhang, Ramsey, Sanchez-Romero, Glymour, and Sch olkopf Assumption 3 If Vi Vj, with Vi, Vj V, and at least one of them has a changing mechanism, then P(Vi|Vj, Z) and P(Vj, Z) are dependent, where Z V\{Vl, Vk} is a minimal deconfounding set of (Vi, Vj). This assumption can be viewed as faithfulness on another level. Specifically, if Vi Vj, P(Vi|Vj, Z) and P(Vj, Z) involve the changing parameters both in Vi and Vj, so they are expected to be dependent. Now we are ready to give identifiability conditions of causal directions, which are stated in Theorem 2. Theorem 2 Under Assumptions 1, 2, and 3, the causal direction between adjacent variables Vi, Vj V is identifiable with CD-NOD, if at least one of the following three conditions is satisfied: (1) the edge between Vi and Vj is involved in a V-structure; (2) at least one of V i s causal module and V j s causal module changes, and if both, the causal modules of Vi and Vj are independent; (3) there exists an edge incident to one and only one of the variables in {Vi, Vj}. Please refer to Appendix C for detailed proofs. Note that CD-NOD does not require hard restrictions on the functional class of causal mechanisms and data distributions. Combined with Theorem 1, we know that for any pair of adjacent variables in graph G, if at least one of the three conditions in Theorem 2 holds, then the whole causal structure is identifiable. It is given in the following corollary. Corollary 1 Under Assumptions 1, 2, and 3, the whole causal graph is identifiable, if any pair of adjacent variables Vi and Vj in graph G satisfies at least one of the following three conditions: (1) the edge between Vi and Vj is involved in a V-structure; (2) at least one of V i s causal module and V j s causal module changes, and if both, the causal modules of Vi and Vj are independent; (3) there exists an edge incident to one and only one of the variables in {Vi, Vj}. Proof From Theorem 1, we know that the causal skeleton is identifiable. In addition, from Theorem 2, we know that for any pair of adjacent variables, if at least one of the three conditions is satisfied, then the direction is identifiable. Therefore, the whole causal structure is identifiable. For a pair of adjacent variables Vi and Vj, if none of the conditions in Theorem 2 holds, we may not be able to determine the causal direction between them. Thus, in this case, we cannot derive a fully identified causal graph but an equivalence class. Note that the causal skeleton is identifiable, as shown by Theorem 1, regardless of these conditions. Below we give a formal definition of equivalence class with CD-NOD. Causal Discovery from Heterogeneous/Nonstationary Data Definition 3 (CD-NOD Equivalence Class) Let G = (V, E) and G = (V, E ) be two DAGs over the same set of variables V. G and G are called CD-NOD equivalent, if and only if they satisfy the following properties. (1) G and G have the same causal skeleton. (2) For any pair of adjacent variables Vi, Vj V, if it satisfies at least one of the conditions given in Theorem 2, then the causal direction between Vi and Vj is the same in G and G . Remark: Another important factor that needs to be considered in practical problems is computational complexity. In CD-NOD phase I, the computational complexity of each kernel-based (conditional) independence test is O(N3), where N is the number of samples of each variable. The computational complexity of the search procedure with PC is bounded by (m+1)2mk 1 (k 1)! , where m is the number of observed variables, and k is the maximal degree of variables in V C. In CD-NOD phase II, the computational complexity of each dependence measure, with normalized HSIC, is O(N3). 5. CD-NOD Phase III: Nonstationary Driving Force Estimation In this section, we focus on the visualization of how causal module P(Vi | PAi, C) changes, i.e., where the changes occur, how fast it changes, and how to visualize the changes. We assume that we already know the causal structure and know which causal modules are changing (see Algorithms 1, 2, and 3). In the parametric case, if we know which parameters of the causal model PAi Vi are changing, e.g., the mean of a direct cause, the coefficients in a linear SEM, then we can estimate such parameters for different values of C and see how they change. However, such knowledge is usually not available, and for the sake of flexibility, it is generally better to model causal processes nonparametrically. Therefore, it is desirable to develop a general nonparametric procedure to capture the change of causal modules. We aim to find a low-dimensional and interpretable mapping of P(Vi | PAi, C) which captures its nonstationarity in a nonparametric way: λi(C) = hi(P(Vi | PAi, C)), (18) where hi is a nonlinear function, mapping the conditioning distribution P(Vi | PAi, C) to a low dimensional representation λi(C). We call λi(C) the nonstationary driving force of P(Vi | PAi, C). This formulation is rather general: any identifiable parameters in P(Vi | PAi, C) can be expressed this way, and in the nonparametric case, λi(C) can be seen as a statistic to summarize changes in P(Vi | PAi, C) along with different values of C. If P(Vi | PAi, C) does not change along with C, then λi(C) remains constant; otherwise, λi(C) is intended to capture the variability of P(Vi | PAi, C) across different values of C. Now there are two problems to solve. One is that given only observed data, how we represent the conditional distributions conveniently. The other is what method to use to enable λi(C) to capture the variability in the conditional distribution along with C. For the former problem, we represent changing conditional distributions by kernel embedding as that given in Proposition 1, which is readily achievable. For the latter one, Huang, Zhang, Zhang, Ramsey, Sanchez-Romero, Glymour, and Sch olkopf we use kernel principle component analysis (KPCA) to capture its variability along with C and accordingly propose a method called Kernel Nonstationary Visualization (KNV) of causal modules. In the following, for conciseness, we will use X and Y to denote Vi and PAi, respectively. Nonstationary Driving Force Estimation as Eigenvalue Decomposition Problems We use the estimated kernel embedding of distributions, ˆ µY ,X|C=cn (n = 1, , N), derived from Proposition 1 as the input, and aim to find ˆλ(C) as a (linear or nonlinear) transformation of µY ,X|C=cn, to capture its variability across different values of C. This can be achieved by exploiting KPCA techniques (Sch olkopf et al., 1998), which computes principal components in kernel spaces of the input. To perform KPCA, we need to know the Gram matrix of ˆ µY ,X|C=cn, which has already been estimated in Section 4.2.1. We use Ml Y X to represent the Gram matrix of ˆ µY ,X|C with a linear kernel, and MG Y X the Gram matrix of ˆ µY ,X|C with a Gaussian kernel. Then ˆλ(C) can be found by performing eigenvalue decomposition on the above Gram matrix, Ml Y X or Mg Y X; for details please see Sch olkopf et al. (1998). In practice, one may take the first few eigenvectors which capture most of the variance. We can see that with our methods, we do not need to explicitly learn the high-dimensional kernel embedding µY ,X|C=c for each c. With the kernel trick, the final Gram matrix can be represented by N N kernel matrices directly. Then the nonstationary driving force ˆλ(C) can be estimated by performing eigenvalue decomposition on the Gram matrix. Algorithm 4 summarizes the proposed KNV method. Algorithm 4 KNV of Causal Modules P(Y |X, C) 1. Input: N observations of X and Y . 2. Calculate Gram matrix MY X (see Eq. 6 for linear kernels and Eq. 8 for Gaussian kernels). 3. Find ˆλ(C) by directly feeding Gram matrix MY X to KPCA. That is, perform eigenvalue decomposition on MY X to find the nonlinear principal components ˆλ(C), as in Section 4.1 of Sch olkopf et al. (1998). In practice, we may take the first few eigenvectors which capture most of the variance. 4. Output: the estimation of nonstationary driving force ˆλ(C) of P(Y |X, C). 6. Extensions of CD-NOD In this section, we show how to extend CD-NOD to deal with more general scenarios. For example, we extend CD-NOD to allow both time-varying instantaneous and lagged causal relationships. We additionally discuss whether and how distribution shifts help for causal discovery in the presence of stationary confounders. We then give a procedure, which Causal Discovery from Heterogeneous/Nonstationary Data leverages both CD-NOD and approaches based on constrained functional causal models, to extend the generality of the proposed method. 6.1. With Time-Varying Instantaneous and Lagged Causal Relationships In many scenarios, e.g., dynamic systems with insufficient time resolution, there may exist both instantaneous and time-lagged causal relationships over the measured data (Hyv arinen et al., 2010; Gong* et al., 2015). In this section, we extend CD-NOD proposed in Section 3 to recover both time-varying instantaneous and lagged causal relationships and identify changing causal modules. Suppose that there are m observed processes V(t) = (V1(t), , Vm(t))T with t = 1, , T, where there exist both time-varying instantaneous and lagged causal relations over these m processes. Further denote the largest time lag by P. To efficiently recover both instantaneous and time-lagged causal relations, we reorganize the set of variables V into V = V(1) V(P+1), with V(1) = V (1) 1 , V (1) 2 , , V (1) m , V(2) = V (2) 1 , V (2) 2 , , V (2) m , V(P+1) = V (P+1) 1 , V (P+1) 2 , , V (P+1) m , where V (k) i = Vi(k), Vi(k + 1), , Vi(T P + k 1) , indicating the ith process from the kth time point to the (T P + k 1)th time point, for i = 1, , m and k = 1, , P + 1. After reorganization, in total there are m (P + 1) variables. We constrain that the future cannot cause the past; i.e., variables in V(i) cannot cause variables in V(j) when i > j. Figure 6 gives an illustration of a two-variable case with time lag P = 1. Figure 6(a) gives the repetitive causal graph over two processes V(t) = (V1(t), V2(t))T with t = 1, , T. Figure 6(b) gives the unit causal graph over the reorganized set of variables V = V(1) V(2). In this case, we aim to recover the (time-varying) instantaneous causal relations between V (2) 1 and V (2) 2 , and the (time-varying) lagged causal relations from V (1) 1 to V (2) 2 , V (1) 1 to V (2) 1 , and V (1) 2 to V (2) 2 . Note that in this example, when inferring the instantaneous causal relation between V (2) 1 and V (2) 2 , we should consider the influence of the lagged common cause V (1) 1 . We also add the surrogate variable C into the causal system to characterize distribution shifts. Algorithm 5 extends Algorithm 1 to recover both instantaneous and lagged causal skeletons and detect changing causal modules. The main procedure is as follows. We first construct a complete undirected graph over the reconstructed variable set V and C. Then in step 2, we detect changing modules for variables in V(P+1) by testing the independence between V (P+1) i and C given a subset of V(P+1)\V (P+1) i , for i = 1, , m. If they are independent, we remove the edge between V (k) i and C, for k = 1, , P + 1. In step 3, we recover lagged causal relations between variables in V(P+1) and those in V(P p+1), for p = 1, , P. For example, if V (P+1) i and V (P p+1) j are independent given a subset of V\{V (P+1) i , V (P p+1) j } C, then remove the edge between V (P+1 k) i and V (P p+1 k) j for k = 0, , P p. In step 4, we estimate the instantaneous causal skeleton between V (P+1) i and V (P+1) j (i = j). If V (P+1) i and V (P+1) j are independent given a subset of Huang, Zhang, Zhang, Ramsey, Sanchez-Romero, Glymour, and Sch olkopf V1(1) V1(2) V1(T 1) V1(T) V2(1) V2(2) V2(T 1) V2(T) V (1) 1 V (2) 1 V (1) 2 V (2) 2 (a) Repetitive causal graph. (b) Unit causal graph. Figure 6: A two-variable case with both instantaneous and one-lagged (P = 1) causal relationships. (a) Repetitive causal graph. (b) Unit causal graph. V\{V (P+1) i , V (P+1) j } C Zij, where Zij { V(k)}P k=1 are the lagged common causes of V (P+1) i and V (P+1) j , then we remove the edge between V (k) i and V (k) j for k = 1, , P + 1. Note that it is important to consider lagged common causes Zij in this step, e.g., V (1) 1 is the lagged common cause of V (2) 1 and V (2) 2 in Figure 6 (b). For the recovery of causal directions, lagged causal relations obey the rule that past causes future, which is obvious, while instantaneous causal directions can be inferred in the same way as that described in Algorithm 2 and 3. 6.2. With Stationary Confounders In this section, we discuss the case when there exist stationary confounders; that is, the distribution of the confounder is fixed. We find that distribution shifts over observed variables may still help estimate causal directions in such a case. To show how distribution shifts help, we analyze two-variable cases in Figure 7, where V1 and V2 are observed variables, and Z is a hidden variable which influences both V1 and V2. In Figure 7(a) & (b), the causal module of V1 changes, with changing parameter θ1(C), while causal modules of V2 and Z are fixed. In such a case, we have V1 V2 and V1 V2|C in both graphs. Furthermore, in (a) we have C V2 and C V2|V1, while in (b) we have C V2 and C V2|V1. We can see that distribution shift helps to distinguish between the two graphs when only one causal module changes. In Figure 7(c) & (d), both the causal modules of V1 and V2 change, and they change independently, while the distribution of Z is fixed. In these two graphs, we have V1 V2 and V1 V2|C, and we have P(V1) P(V2|V1) and P(V2) P(V1|V2), for the reason given below. Hence, graphs in (c) and (d) have the same independence over V C and between distribution modules, and thus we cannot distinguish between (c) and (d) without further information. The reason that P(V1) P(V2|V1) in (c) is as follows. In (c), P(V2|V1) can be represented as: P(V2|V1) = R P(V2|V1, Z)P(Z|V1) d Z, where P(V2|V1, Z) contains the information of θ2(C), and P(Z|V1) contains the information of θ1(C), so P(V2|V1) also contains the information of θ1(C). Since both P(V2|V1) and P(V1) contain the information of θ1(C), generally speaking, P(V2|V1) and P(V1) are not independent. Therefore, although, according to the causal module independence, P(V1|Z) P(V2|V1, Z), P(V1) and P(V2|V1) are dependent. Similarly, we can derive that P(V2) P(V1|V2) in (d). Causal Discovery from Heterogeneous/Nonstationary Data Algorithm 5 Detection of Changing Modules and Recovery of Causal Skeleton with both (Time-Varying) Instantaneous and Lagged Causal Relationships 1. Build a complete undirected graph UG over the variable set V C. 2. (Detection of changing modules) For each i (i = 1, , m), test for the marginal and conditional independence between V (P+1) i and C. If they are independent given a subset of V(P+1)\V (P+1) i , remove the edge between V (P+1) i and C in UG; at the same time, remove the edge between V (k) i (k = 1, , P) and C. 3. (Recovery of lagged causal skeleton) For the pth (p = 1, , P) lagged causal relationships, test for the marginal and conditional independence between the variables in V(P+1) and those in V(P p+1). Particularly, for V (P+1) i and V (P p+1) j , if they are independent given a subset of V\{V (P+1) i , V (P p+1) j } C, remove the edge between V (P+1) i and V (P p+1) j ; at the same time, remove the edge between V (P k+1) i and V (P p+1 k) j , for k = 0, , P p. 4. (Recovery of instantaneous causal skeleton) For every i = j, test for the marginal and conditional independence between V (P+1) i and V (P+1) j . Let Zij { V(k)}P k=1 be the lagged common causes of V (P+1) i and V (P+1) j , which can be derived from the output of Step 3 above. If V (P+1) i and V (P+1) j are independent given a subset of V(P+1)\{V (P+1) i , V (P+1) j } C Zij, remove the edge between V (P+1) i and V (P+1) j in UG; at the same time, remove the edge between V (k) i and V (k) j , for k = 1, , P. θ1(C) θ2(C) Z θ1(C) θ2(C) Z Figure 7: Two-variable cases with stationary confounders. (a) V1 V2 with the changing causal module of V1. (b) V1 V2 with the changing causal module of V1. (c) V1 V2 with the changing causal modules of both V1 and V2. (d) V1 V2 with the changing causal modules of both V1 and V2. Huang, Zhang, Zhang, Ramsey, Sanchez-Romero, Glymour, and Sch olkopf 6.3. Combination with Approaches Based on Constrained Functional Causal Models For a pair of adjacent variables Vi and Vj, if none of the identifiability conditions given in Theorem 2 are satisfied, the causal direction between Vi and Vj is undetermined with CD-NOD. In such a case, to infer the causal direction between Vi and Vj, one possible way is to further leverage the approaches based on constrained functional causal models, e.g., the linear non-Gaussian model (Shimizu et al., 2006), the nonlinear additive noise model (Hoyer et al., 2009; Zhang and Hyv arinen, 2009a), or the post-nonlinear model (Zhang and Chan, 2006; Zhang and Hyv arinen, 2009b). To determine the causal direction with functional causal model-based approaches, we have two scenarios to consider, depending on whether P(Vi, Vj) changes or not. 1. The joint distribution P(Vi, Vj) is fixed. In this scenario, we can fit a constrained functional causal model, e.g., the nonlinear additive noise model, as usual. Let Z V\{Vi, Vj} be a minimal deconfounding set of Vi and Vj. We first assume Vi Vj and fit an additive noise model Vj = fj(Vi, Z) + Ej, and test the independence between estimated residual ˆEj and hypothetical causes (Vi, Z), and then assume Vj Vi and fit an additive noise model in this direction Vi = fi(Vj, Z) + Ei, and test the independence between estimated residual ˆEi and hypothetical causes (Vj, Z). We choose the direction in which the independence between estimated residual and hypothetical causes holds. 2. The joint distribution P(Vi, Vj) changes. The fact that for Vi and Vj, none of the conditions in Theorem 2 is satisfied implies that neither Vi nor Vj is adjacent to C. It further infers that there are some variables in V with changing distributions and influencing Vi or Vj. Let Zi V\{Vi, Vj} represent a set of variables which directly influence Vi, and satisfies that Zi Zi, its distribution P(Zi) changes. Similarly, we denote Zj as parents of Vj that have changing distributions. Let Z V\{Vi, Vj} be a minimal deconfounding set of Vi and Vj, and let Z Z, with P( Z) fixed and Z Z\ Z, P(Z) changes. Note that it is easy to infer that Zi Z and Zj Z. We first fit an additive noise model by assuming Vi Vj Vj = fj(Vi, Z, Zj) + Ej, and test the independence between estimated residual ˆEj and hypothetical causes (Vi, Zj, Z), Causal Discovery from Heterogeneous/Nonstationary Data and then assume Vj Vi and fit an additive noise model accordingly Vi = fi(Vj, Z, Zi) + Ei, and test the independence between estimated residual ˆEi and hypothetical causes (Vj, Zi, Z). We choose the direction in which the independence between estimated residual and hypothetical causes holds. Note that in case 2, we cannot ignore the changing factors Zi and Zj in their corresponding functional causal models, because functional causal model-based approaches assume a fixed causal model. If we do not consider Zi or Zj, then the corresponding functions fi and fj, respectively, may change across data sets or over time. If there are unoriented edges to Vi or Vj, the potential minimal deconfounding set Z, and potential changing influences Zi and Zj are chosen in a heuristic way, similar to that in Algorithm 3. 7. Relations between Heterogeneity/Nonstationarity and Soft Intervention in Causal Discovery In this section, we show that heterogeneity/nonstationarity and soft intervention are tightly related in causal discovery, and the former may be more effective for causal discovery than the latter. The definition of soft intervention is as follows. Definition 4 (Soft Intervention (Eberhardt and Scheines, 2007)) Given a set of measured variables V, a soft intervention Ip on a variable S V is an intervention on S that satisfies the following constraints: when Ip = 1, Ip does not make S independent of their causes in V; i.e., it does not break any edges that are incident to S. The causal module P(S|PAS) is replaced by P (S|PAS, Ip = 1), where P (S|PAS, Ip = 1) = P(S|PAS, Ip = 0). (19) Otherwise, all terms remain unchanged. Additionally, Ip satisfies the following properties: Ip is a variable with two states, Ip = 1 or Ip = 0, whose state is known. When Ip = 0, the passive observational distribution over S obtains. Ip is a direct cause of S and only S. Ip is exogenous, that is, uncaused. Note that the soft intervention does not imply any structural changes over the variables in V; instead, it influences the manipulated probability distribution. Heterogeneity/nonstationarity can be seen as a consequence of soft interventions done by nature. Recall that the heterogeneous/nonstationary causal modules are defined as Vi = fi PAi, gi(C), θi(C), ϵi , Huang, Zhang, Zhang, Ramsey, Sanchez-Romero, Glymour, and Sch olkopf where gi(C) {gl(C)}L l=1 denotes the set of confounders that influence Vi (it is an empty set if there is no confounder behind Vi and any other variable), θi(C) denotes the effective parameters in the model that are also assumed to be functions of C, and ϵi is a disturbance term that is independent of C. The distribution of P(Vi|PAi, C) changes when C takes different values, i.e., P(Vi|PAi, C = c) = P(Vi|PAi, C = c ). (20) This means that across the values of C, c and c , there exists a soft intervention on Vi. Our developed framework in causal discovery from heterogeneous/nonstationary data is more applicable than that from Eberhardt and Scheines (2007) in the following aspects: 1. In Eberhardt and Scheines (2007), Ip is known and only contains two states, while in our case we can detect where it happens in an automated way, and the changes can be discrete (across domains) or continuous (over time). 2. In Eberhardt and Scheines (2007), Ip is assumed to be a direct cause to a single variable, while in our case there is no such a restriction. For example, we allow pseudo confounders gl(C) which affects several variables. 3. Our framework also allows edges to vanish or appear in some domains or over some time periods; that is, the structure over V may change, for which hard intervention is a special case, while in Eberhardt and Scheines (2007), the causal structure does not change. 8. Experimental Results We applied the proposed CD-NOD (both phase I and phase II) to both synthetic and real-world data sets to learn causal graphs and identify changing causal modules, and then we learned a low-dimensional representation of changing causal modules (phase III). 8.1. Synthetic Data 8.1.1. Setting 1 To show the generality of the proposed methods in causal discovery when data distributions change, we considered two types of data: multi-domain heterogeneous data and nonstationary time series. Moreover, we considered changes in both causal strength and noise variances. For variable Vi whose causal module changes, the tth data point was generated according to the following functional causal model: Vj PAi bij,tfi(Vj,t) + σi,tϵi,t, where PAi denotes the set of Vi s direct causes, Vj PAi is the jth direct cause of Vi, bij,t is the varying causal strength from variable Vj to Vi, and σi,t is the changing parameter applied to the noise term ϵi,t. The function fi was randomly chosen from linear, cubic, tanh, sinc functions, or random mixtures of those functions. The noise term ϵi,t was randomly chosen from a uniform distribution U( 0.5, 0.5) or a standard normal distribution N(0, 1). The varying causal strength bij,t and varying noise s parameter σi,t were generated in different ways for heterogeneous and nonstationary data. Causal Discovery from Heterogeneous/Nonstationary Data For heterogeneous data whose data distributions change across domains, in each domain, bij,t and σ2 i,t were randomly generated from uniform distributions U(0.5, 2.5) and U(1, 3), respectively. Note that in the same domain both bij,t and σ2 i,t remain the same. For nonstationary data whose data distributions smoothly change over time, we generated bij,t and σi,t by sampling from a Gaussian process (GP) prior bij(t) GP( 0, K(t, t)) and σi(t) GP( 0, K(t, t)), respectively. We used squared exponential kernel k(t, t ) = exp( 1 λ2 ), where λ is the kernel width. For variable Vi whose causal module is fixed, it was generated according to a fixed functional causal model: Vi,t = X Vj PAi bijfi(Vj,t) + ϵi,t, where both the causal strength bij and variance of the noise term ϵi are fixed over time. We randomly generated acyclic causal structures according to the Erdos-Renyi model (Erd os and R enyi, 1959) with the probability of each edge 0.3. Each generated graph has 6 variables. In each graph, we randomly picked variables with changing causal modules, and the number was randomly chosen from {4, 5, 6}. We also considered different sample sizes. Particularly, for heterogeneous data, the sample size of each domain was chosen between 50 and 100, and the total sample size was N =600, 900, 1200, and 1500. For nonstationary data, we fixed the kernel with of the Gaussian prior and generated the data with sample size N = 600, 900, 1200, and 1500. In each scenario (each data type and each sample size), we generated 100 random realizations. We applied the proposed CD-NOD to identify the underlying causal structure, both causal skeleton (phase I) and causal directions (phase II), and detect changing causal modules (phase I). Phase I: causal skeleton identification and changing causal module detection. We applied CD-NOD phase I (Algorithm 1) to identify the causal skeleton and detect changing causal modules. Specifically, for heterogeneous data, we used the domain index as the surrogate variable C to capture the distribution change, while for nonstationary data, we used the time index. We used PC (Spirtes et al., 1993) as the search procedure and the kernel-based conditional independence (KCI) test to test conditional independence relationships. We compared our approach with the original constraint-based method, which does not take into account distribution changes. For the original constraint-based method, we applied the PC search over the observed variables, combined with the KCI test. We also compared with approaches based on the minimal change principle, the identical boundaries (IB) method and the minimal changes (MC) method (Ghassami et al., 2018). Both IB and MC methods are designed for multi-domain causal discovery in linear systems. For the KCI test, the significance level was 0.05. We used Gaussian kernels, and the hyperparameters, e.g. the kernel width, were set with empirical values; please refer to Zhang et al. (2011) for details. Since both IB and MC methods need data from multiple domains, for heterogeneous data, we segmented the data according to the domain index Huang, Zhang, Zhang, Ramsey, Sanchez-Romero, Glymour, and Sch olkopf 600 900 1200 1500 Sample size 600 900 1200 1500 Sample size Heterogeneous data 600 900 1200 1500 Sample size CD-NOD Original constraint IB MC 600 900 1200 1500 Sample size 600 900 1200 1500 Sample size Nonstationary data 600 900 1200 1500 Sample size Figure 8: Accuracy of the recovered causal skeleton from heterogeneous (upper row) and nonstationary (lower row) data. We compared three accuracy measures: F1 score (left column), precision (middle column), and recall (right column). We compared our proposed CD-NOD (phase I) with original constraint-based method, IB method, and MC method. before applying IB and MC methods, and for nonstationary data, we segmented the data into non-overlapping domains, with sample size 100 in each domain. Figure 8 gives the accuracy of the recovered causal skeleton from both heterogeneous (upper row) and nonstationary (lower row) data. The x-axis indicates the sample size, and the y-axis shows the accuracy of the recovered causal skeleton. We considered three accuracy measures: F1 score, precision, and recall, where precision indicates the rate of spurious edges, recall the rate of missing edges, and F1 = 2 precision recall precision+recall. We can see that the proposed CD-NOD (phase I) gives the best F1 score and precision on both heterogeneous and nonstationary data for all sample sizes, and it gives similar recall with others. Moreover, there is a slight increase in F1 score and recall along with the sample size. The original constraint-based method gives a much lower precision, and hence a lower F1 score as well, which means that there are many spurious edges. The reason is that it does not consider distribution shifts, as we illustrated in Figure 1. Both IB and MC methods give a much lower precision, probably because they are designed for linear systems, and thus the indirect nonlinear influences cannot be totally captured by variables along the pathway with linear functions. Table 2 shows the accuracy of the detected changing causal modules, measured by F1 score, precision, and recall. We can see that the proposed CD-NOD performs well, and that the F1 score tends to increase along with sample size in both scenarios. Causal Discovery from Heterogeneous/Nonstationary Data N F1 Precision Recall 600 0.93 0.98 0.89 900 0.95 0.99 0.93 1200 0.96 1.00 0.93 1500 0.96 0.99 0.93 (a) Heterogeneous data N F1 Precision Recall 600 0.94 0.99 0.90 900 0.96 0.99 0.94 1200 0.97 0.99 0.96 1500 0.96 0.97 0.95 (b) Nonstationary data Table 2: Accuracy of the identified changing causal modules from (a) heterogeneous data and (b) nonstationary data. Phase II: causal direction identification. We then applied three rules to identify causal directions, based on the skeleton learned in phase I. The three rules are (1) generalization of invariance (Algorithm 2), (2) independent changes between causal modules (Algorithm 3), (3) and Meek s orientation rule (Meek, 1995). The three rules are applied in the above sequence to identify causal directions. We compared with the window-based method (Zhang et al., 2017), the IB method, and the MC method for direction determination. The window-based method identifies the causal direction by measuring the dependence between hypothetical causal modules; however, the kernel density estimation of modules were performed in each sliding window or in each domain, separately. For the window-based method, we used the causal skeleton derived from CD-NOD phase I. For heterogeneous data, the window size was the same as the sample size in each domain; for nonstationary data, the size of each sliding window was 100. For the IB and MC method, the skeleton and orientations were derived simultaneously. Figure 9 shows the F1 score of the recovered whole causal graph (both skeletons and directions) from heterogeneous and nonstationary data. We can see that the proposed CD-NOD gives the best F1 score in all scenarios. The window-based method has second-best performance, and it performs slightly better on heterogeneous data than on nonstationary data. The reason may be that on nonstationary data, the sliding window-based method may lead to large estimation errors, especially when the causal influence varies quickly over time. The performance of IB and MC methods is worse; it is not surprising since the generated data are not linear. Huang, Zhang, Zhang, Ramsey, Sanchez-Romero, Glymour, and Sch olkopf 600 900 1200 1500 Sample size Nonstationary data CD-NOD Window-based IB MC 600 900 1200 1500 Sample size Heterogeneous data Figure 9: Accuracy of the recovered whole causal graph (both skeletons and directions) from both heterogeneous and nonstationary data. We compared our proposed CD-NOD with window-based method, IB method, and MC method. 8.1.2. Setting 2 To clearly show the efficacy of Algorithm 3 in causal direction identification (without being affected by the accuracy from causal skeleton identification and the other two orientation rules), we generated another synthetic data set. Particularly, we generated fully connected acyclic graphs. For each variable Vi, all its causal strength bij and noise s parameter σi change, either across domains or over time. We considered different graph sizes m = 2, 4, and 6 and different sample sizes N = 600, 900, 1200, and 1500. To assess the performance of causal direction identification, we assumed that the causal skeleton is given (fully connected), and we inferred causal directions with Algorithm 3, which exploits the independence between causal modules. We compared CD-NOD with the window-based method, the IB method, and the MC method. The setting of hyperparameters and window size was the same as that in Setting 1. Figure 10 gives the accuracy (F1 score) of the inferred causal directions from heterogeneous and nonstationary data. We can see that CD-NOD (Algorithm 3) gives the best F1 score, and its accuracy slightly increase with sample size. The accuracy of window-based method is comparably lower than CD-NOD (Algorithm 3), and its performance is better on heterogeneous data than that on nonstationary data. The accuracy of the IB and MC method is much lower, since they are only for linear systems. 8.1.3. Setting 3 We further investigated the effect of sample size per domain in heterogeneous data of the final performance. We varied the number of samples N0 in each domain, with N0 = 10, 20, 40, 60, 80, and 100, and there were in total 10 domains. Other settings of the data generating process are the same as that in Setting 1. Figure 11(left) gives the F1 score of the recovered causal skeleton, along with the number of samples per domain. We compared CD-NOD with original constraint-based methods Causal Discovery from Heterogeneous/Nonstationary Data 600 900 1200 1500 Sample size (2-variable graph) 600 900 1200 1500 Sample size (4-variable graph) Nonstationary data 600 900 1200 1500 Sample size (6-variable graph) CD-NOD Window-based IB MC 600 900 1200 1500 0.5 600 900 1200 1500 0.5 Heterogeneous data 600 900 1200 1500 0.5 Figure 10: Accuracy of inferred causal directions from heterogeneous (upper row) and nonstationary (lower row) data. We tested the accuracy on 2-variable (left column), 4-variable (middle column), 6-variable (right column) fully connected graphs. We compared our proposed CD-NOD with the window-based method, the IB method, and the MC method. Huang, Zhang, Zhang, Ramsey, Sanchez-Romero, Glymour, and Sch olkopf 10 20 40 60 80 100 Sample size per domain Skeleton (Heterogeneous) CD-NOD Original constraint IB MC 10 20 40 60 80 100 Sample size per domain Direction (Heterogeneous) CD-NOD Window-based IB MC Figure 11: Accuracy of the estimated causal skeleton (left) and the whole causal graph (right) from heterogeneous data, along with the number of samples in each domain. (without taking into account the distribution shift), the IB method, and the MC method. Figure 11(right) shows the F1 score of the recovered whole causal graph, including both the skeleton and directions, produced by CD-NOD, the window-based method, the IB method, and the MC method, respectively. We can see that CD-NOD achieves the best accuracy in all cases, regarding both the estimated skeleton and the whole causal graph. Moreover, we found that not surprisingly, its performance improves with the number of samples in each domain, with a relatively large improvement (around 5%) from N0 = 40 to N0 = 60. 8.1.4. Setting 4 After identifying the causal structure, we then estimated the nonstationary driving force of changing causal modules by the procedure KNV given in Algorithm 4. For illustration purposes, we used a single driving force for each module, which changes both causal strength b and noise parameter σ. Phase III: nonstationary driving force estimation We used Gaussian kernels both in kernel embedding of constructed joint distributions and kernel PCA. We compared our approach with the linear time-dependent functional causal model (Huang et al., 2015), which puts a GP prior on time-varying coefficients and uses the mean of the posterior to represent the nonstationary driving force. In addition, we compared our methods with Bayesian change point detection (Adams and Mac Kay, 2007), 4 which is widely used in nonstationary data to detect change points. Figure 12(a) and 12(b) show the estimated low-dimensional driving force of changing causal modules for smooth changes (nonstationary data) and sudden changes (heterogeneous data), respectively, when N = 600. In left panels, blue lines show the estimated driving force by KNV, and red lines show ground truth. Vertical black dashed lines indicate detected change points by Bayesian change point detection. Middle panels show the largest ten 4. We used the implemented Matlab code from http://hips.seas.harvard.edu/content/bayesian-onlinechangepoint-detection. Causal Discovery from Heterogeneous/Nonstationary Data eigenvalues of Gram matrix Mg. In right panels, blue lines are the recovered changing components by the linear time-dependent GP, and red lines are ground truth. The scale of the y-axis has been adapted for clarity. We only showed the first principal component in the left panel, since the first eigenvector captures most of the variance, as shown in middle panels. We can see that KNV gives the best recovery of the changing components in all cases. Bayesian change point detection, which aims to estimate change points in the marginal or joint distributions (instead of causal mechanisms), fails to handle the case of smooth changes. The linear time-dependent GP does not work well, and one reason is that it cannot account for influences from changing noise parameters, in cases of both smooth and sudden change. 8.2. Real-World Data Sets We then applied the proposed CD-NOD to three real-world data sets. We performed the same three procedures as before: recover causal structure over variables, identify changing causal modules, and estimate low-dimensional driving forces of changing causal modules. Specifically, for causal skeleton determination and changing module detection, we applied Algorithm 1. For causal direction determination, we applied three orientation rules: Algorithm 2, Algorithm 3, and Meek s orientation rule, in sequence. For driving force estimation, we applied KNV (Algorithm 4). Since our proposed methods outperform all the alternatives on synthetic data, we only applied our proposed methods to real-world data sets. 8.2.1. Task f MRI We applied our methods to task f MRI data, which were recorded under a star/plus experiment (Wang et al., 2004). There are three states in the experiment. (1) The subject rested or gazed at a fixation point on the screen (State 1). (2) The subject was shown a picture and a sentence that was not negated, and was instructed to press a button to indicate whether the sentence matched the picture (State 2). (3) The subject was shown a picture and a negated sentence, and was instructed to press a button indicating whether the sentence matched the picture (State 3). The time resolution of the recording is 500 ms. The f MRI voxel data are organized into regions of interests (ROIs), resulting in 25 ROIs. Figure 13 shows the recovered causal structure over the 25 ROIs, where red circles mean that corresponding brain areas have changing causal mechanisms. We can see that the causal edges from the ROIs in the right hemisphere to the corresponding left ones are robust, e.g., RIPL LIPL, RIT LIT. ROIs CALC, LSGA, LIPL, and LIT have large indegree centrality, where CALC and LIT are responsible for visual input processing, and LSGA and LIPL are for language processing. The causal modules of CALC, RIPL, LIPL, RSGA, LSGA, RDLPFC, LDLPFC, RTRIA, LSP, and RIT are time-varying, which are marked with red circles. The identified changing causal modules correspond to key areas for visual and language perception. By considering nonstationarity, 16 connections that were produced by the original PC algorithm are removed, e.g., LIPL - RSGA, LSPL - RDLPFC, and RDLPFC - RTRIA, indicating possible pseudo confounders behind those pairs of ROIs. Figure 14 visualizes the estimated nonstationary driving force by KNV. The top figure shows the state of input stimuli: State = 1 (State 1) means that the subject was in resting state; State = 2 (State 2) means that the subject was shown a picture and a non-negated sentence, and State = 3 (State 3) means that the shown sentence was negated. The lower Huang, Zhang, Zhang, Ramsey, Sanchez-Romero, Glymour, and Sch olkopf Ground truth KNV Change point detection 0 200 400 600 Time (t) 0 5 10 Principle component Ground truth Time-dependent GP 0 200 400 600 Time (t) (a) Smooth changes 0 200 400 600 Time (t) 0 5 10 Principle component 0 200 400 600 Time (t) (b) Sudden changes Figure 12: Visualization of estimated driving forces of changing causal modules for (a) smooth changes and (b) sudden changes. Left panel: blue lines show the recovered nonstationary components by KNV. Red lines are ground truth. Vertical black dashed lines indicate detected change points by Bayesian change point detection. Middle Panel: the largest ten eigenvalues of Gram matrix Mg. Right Panel: blue lines are recovered nonstationary components by linear time-dependent GP. Red lines are ground truth. Causal Discovery from Heterogeneous/Nonstationary Data Figure 13: Recovered causal graph over 25 ROIs. The red circles denote that causal modules of corresponding brain areas changing over states. Huang, Zhang, Zhang, Ramsey, Sanchez-Romero, Glymour, and Sch olkopf 0 500 1000 Time (t) Figure 14: Visualization of estimated nonstationary driving force from task f MRI data. The top figure shows the state of input stimuli: State = 1 means that the subject was in resting state; State = 2 means that the subject was shown a picture and a sentence, in which the sentence was not negated, while in State = 3 the sentence was negated. The other four figures show the estimated nonstationary driving forces of LDLPFC, LIPL, LSGA, and RIPL, respectively. The shaded areas are intervals which have obvious changes; they match the resting states (State = 1) with no input stimuli. For LDLPFC, the second shaded interval is delayed. There are changes between State 2 and State 3 but are less obvious, comparatively. four figures show the estimated nonstationary driving forces of the causal modules of several key ROIs: LDLPFC, LIPL, LSGA, and RIPL, respectively. The shaded areas are intervals which have obvious changes; they match the resting state (State 1) with no input stimuli. Quantitatively, the correlations between states and the recovered driving forces of LDLPFC, LIPL, LSGA, and RIPL are 0.22, 0.49, 0.41, 0.47, respectively. For LDLPFC, the second shaded interval is delayed, which may be due to the fact that LDLPFC is involved in cognitive processes, e.g., working memory, so that it may remain activated for some time when the input stimulus has been removed. There are changes between State 2 and State 3, but they are less obvious. In short, the recovered nonstationary driving forces show that there are obvious changes between the resting state and the task states, while the changes between two task states are relatively less obvious. Causal Discovery from Heterogeneous/Nonstationary Data 8.2.2. Stock Returns We applied our methods to daily returns of stocks from Hong Kong (HK) and the United States (US), downloaded from Yahoo Finance. HK Stock Market The HK stock dataset contains 10 major stocks, which are daily dividend adjusted closing prices from 10/09/2006 to 08/09/2010. For the few days when the stock price is not available, a simple linear interpolation is used to estimate the price. Denoting the closing price of the ith stock on day t by Pi,t, the corresponding return is calculated by Vi,t = Pi,t Pi,t 1 Pi,t 1 . The 10 stocks are Cheung Kong Holdings (1), Wharf (Holdings) Limited (2), HSBC Holdings plc (3), Hong Kong Electric Holdings Limited (4), Hang Seng Bank Ltd (5), Henderson Land Development Co. Limited (6), Sun Hung Kai Properties Limited (7), Swire Group (8), Cathay Pacific Airways Ltd (9), and Bank of China Hong Kong (Holdings) Ltd (10). Among these stocks, 3, 5, and 10 belong to Hang Seng Finance Sub-index (HSF), 1, 8, and 9 belong to Hang Seng Commerce & Industry Sub-index (HSC), 2, 6, and 7 belong to Hang Seng Properties Sub-index (HSP), and 4 belongs to Hang Seng Utilities Sub-index (HSU). Figure 15 shows the estimated causal structure, where the causal modules of 2, 3, 4, 5 and 7 are found to be time-dependent, indicated by red circles. In contrast, the PC algorithm with the KCI test (without taking into account the nonstationarity) gives five more edges, which are 2 3, 2 5, 3 6, 5 7, and 4 5; most of these spurious edges (the former four) are in the finance sub-index and the properties sub-index, indicating the existence of pseudo confounders behind these sub-indices. We found that all stock returns that have changing causal mechanisms are in HSF, HSP, and HSU; they might be affected by some unconsidered changing factors, e.g., the change of economic policies. Furthermore, we estimated causal directions by the procedure given in Algorithm 2 and 3, and the inferred directions are consistent with some background knowledge of the market. For instance, the within sub-index causal directions tend to follow the owner-member relationship. Examples include 5 3 (note that 3 partially owns 5), 9 8, and 4 1. Those stocks in HSF are major causes for those in HSC and HSP, and the stocks in HSP and HSU impact those in HSC. These results indicate that overseas markets (e.g., the US market) may influence the local market via large banks, and that prices of stocks in commerce and industry are usually affected by certain companies in finance, properties, and utilities sectors. Figure 16 (bottom panels) visualizes the estimated driving forces of stocks 2, 3, 4, 5, and 7 (the scale of y-axis has been adjusted). We can see that the nonstationary components of root causes, 4 and 5, share a similar variability; the change points are around 01/22/2007 (T1), 07/16/2007 (T2), 06/30/2008 (T3), and 02/11/2009 (T4). The nonstationary components of the causal modules of 2, 3, and 7 have change points around T2, T3, and T4. Stock 5 has additional change points at T1 (01/22/2007), which is perhaps due to the fact that Hang Seng Bank established its wholly owned subsidiary Hang Seng Bank (China) Limited in 2007. The change points around T2, T3, and T4 match with the critical time points of financial crisis around the year of 2008. The active phase of the crisis, which manifested as a liquidity crisis, could be dated from August, 2007,5 around T2. The estimated driving 5. See more information at https://en.m.wikipedia.org/wiki/Financial_crisis_of_2007-08. Huang, Zhang, Zhang, Ramsey, Sanchez-Romero, Glymour, and Sch olkopf Figure 15: Recovered causal graph over the 10 main HK stocks. Red circles indicate that causal modules of corresponding stocks change over time. forces are consistent with the change of the TED spread,6 which is an indicator of perceived credit risk in the general economy and is shown in Figure 16 (top panel) for comparison. US Stock Market We then applied our methods to daily returns of a number of stocks from New York Stock Exchange, downloaded from Yahoo Finance. We considered 80 major stocks and used their daily dividend adjusted closing prices from 07/05/2006 to 12/16/2009. They are clustered into 10 sectors: energy, public utilities, capital goods, health care, consumer service, finance, transportation, consumer nondurable goods, basic industry, and technology. Figure 17 shows the estimated causal graph over stock returns, each color representing one sector. The size of each node reflects its degree of connections (sum of indegree and outdegree), and larger node indicates a higher degree. We found that intra-sector connections are more dense than inter-sector connections. The stocks in energy, finance, public utilities, and basic industries are more likely to be causes of stocks in other sectors; among these four sectors, stocks in energy and finance usually causally influence stocks in utilities and basic industries. More specifically, GE, a stock in energy, does not have causal edges within the sector; instead, it is adjacent to stocks in other sectors, which may be due to the reason that GE is a conglomerate and has various segments, including healthcare, power, transportation, oil and gas, etc. In regard to causal interactions within energy, CHK and KEG are root causes, which may be because KEG is an oilfield services company, and CHK is a holding company for a variety of energy enterprises; NBR and HAL are also more likely to cause others, and PBR, WMB, TLM, and NE are more likely to be influenced by other stocks. 6. See https://en.m.wikipedia.org/wiki/TED_spread. Causal Discovery from Heterogeneous/Nonstationary Data 10/09/06 08/09/10 T1 T2 T3 T4 -10 0 10 20 T1 T2 T3 T4 -10 0 10 20 Figure 16: The estimated nonstationary driving forces of time-dependent stock returns, as well as the curve of the TED spread, over the period from 10/09/2006 to 08/09/2010. Top: Curve of the TED spread shown for comparison. Bottom: Estimated nonstationary driving force of stocks 2, 3, 4, 5, and 7, where T1, T2, T3, and T4 stand for 01/22/2007, 07/16/2007, 05/03/2008, and 11/02/2009, respectively. We can see that the nonstationary components of root causes, 4 and 5, share the similar variability with change points around T1, T2, T3, and T4. The nonstationary components of 2, 3, and 7 have change points only around T2, T3, and T4. Huang, Zhang, Zhang, Ramsey, Sanchez-Romero, Glymour, and Sch olkopf Figure 17: Recovered causal graph over 80 NYSE stocks. Each color of nodes represents one sector. Among the stocks in finance, MS, SAN, and JPM have a large number of edges with stocks from other sectors. Within finance, USB and KEY usually cause others, and WFC is the sink node. Among the stocks in basic industry, FCX is largely affected by stocks from energy, and it causes other stocks in basic industry, probably because it also belongs to petroleum industry; GLW and DOW have a large number of edges with stocks in consumer service and technology, which may be related to their products in glass & ceramic materials, and they are not adjacent to other stocks in basic industry. Within basic industry, IAG is the root cause, and KGC is always influenced by others. Regarding the stocks from the remaining sectors, DIS, HST, and HD from consumer service, ORCL from technology, and CSX from transportation have large centrality. By considering nonstationarity, 63 edges that were produced by the original PC algorithm were removed; 19 of them are within sectors, and 44 are between sectors. The causal modules of 37 out of 80 stocks change over time; most of them are from energy (10 out of 28), finance (7 out of 9), basic industry (6 out of 14), and consumer service (5 out of 7). It is reasonable that stocks, that are close to the root node, are more likely to be influenced by some external factors, such as changing economic environments and policies. These findings, such as stocks from finance are more likely to be causes of stocks in other sectors, are consistent with what we discovered from HK stock market reported above. Causal Discovery from Heterogeneous/Nonstationary Data 07/05/06 T1 T2 12/17/09 07/05/06 T1 T2 12/17/09 07/05/06 T1 T2 12/17/09 Figure 18: The estimated nonstationary driving force of six stock returns from 07/05/2006 12/16/2009. The stocks SAN, MS, and DOW have obvious change points at 07/16/2007 (T1) and 05/05/2008 (T2) . The stocks GE, CHK, and BAC have change points only at 05/05/2008 (T2). The change points match the critical time of financial crisis. Figure 18 visualizes the estimated nonstationary driving forces of stocks SAN, MS, DOW, GE, CHK, and BAC. We can see that among these six stocks, SAN, MS, and DOW have obvious change points around 07/16/2007 (T1) and 05/05/2008 (T2), while the stocks GE, CHK, and BAC only have changes points around 05/05/2008 (T2). We found that stocks, which have change points around both time points, usually have large centrality and influence a large number of other stocks. The change points match the critical time of the 2008 financial crisis - those in the TED spread and parts of the change points (T2 and T3) in HK stock data. 9. Discussion and Conclusions This paper is concerned with causal discovery from heterogeneous/nonstationary data and visualization of changing causal modules. We assume a pseudo causal sufficiency condition, which states that all confounders can be represented by functions of domain or smooth functions of time index. We proposed (1) an enhanced constraint-based method for locating variables whose causal modules change and estimating the skeleton of the causal structure over observed variables, (2) a method for causal direction determination that takes advantage of the information carried by distribution shifts, and (3) a procedure for estimating a lowdimensional representation of changing causal modules. We then showed that distribution shifts also help for causal discovery when there are stationary confounders. In summary, to identify the causal structure, we take advantage of two types of independence constraints: (a) conditional independence (with constraint-based approach) and (b) independent changes of changing causal modules. Huang, Zhang, Zhang, Ramsey, Sanchez-Romero, Glymour, and Sch olkopf The performance evaluated on synthetic data sets showed largely improved F1 score and precision with our proposed CD-NOD, compared to other methods for causal discovery. The results over task-f MRI data well revealed information flows between brain regions and how causal influences change across resting state and task states. The causal relationships learned from stock data (both HK and US) are consistent with background knowledge, e.g., stocks in finance and energy are causes to others, and the estimated nonstationary driving forces match the critical time of financial crisis around the year of 2008. We showed that distribution shift contains useful information for causal discovery, since causal model provides a compact description of how the joint distribution changes. Moreover, it has been recently noticed that this view help understand and solve some machine learning problems, e.g., domain adaptation and prediction in nonstationary environments. Since distribution shifts in heterogeneous/nonstationary data are usually constrained - it might be due to the changes in the data-generating processes of only a few variables, and thus it is only necessary to adapt a small proportion of the joint distribution in domain adaptation problems (Zhang et al., 2013; Sch olkopf et al., 2012). There are several open questions that we aim to answer in future work. First, in the paper we assumed that causal directions do not flip despite distribution shifts. What if some causal directions also change across domains or over time? It is important to develop a general approach to detect such direction flips and do causal discovery, because of the ubiquity of feedback loops. Second, the issue of distribution shift may decrease the power of statistical (conditional) independence tests. It is essential to develop a reliable statistical test to mitigate this problem. Third, in practice, there may also exist other types of confounders, e.g., nonstationary confounders. We will extend our methods to cover general types of confounders. Acknowledgments We are grateful to the anonymous reviewers whose careful comments and suggestions helped improve this manuscript. We would like to acknowledge the support by National Institutes of Health under Contract No. NIH-1R01EB022858-01, FAIN-R01EB022858, NIH1R01LM012087, NIH5U54HG008540-02, and FAIN-U54HG008540, and by the United States Air Force under Contract No. FA8650-17-C7715. The National Institutes of Health and the U.S. Air Force are not responsible for the views reported in this article. JZ s research was supported in part by the Research Grants Council of Hong Kong under the General Research Fund LU342213. Appendix A. Proof of Theorem 1 Proof Before getting to the main argument, let us establish some implications of the SEMs in Eq. 2 and the assumptions in Section 3.1. Since the structure is assumed to be acyclic or recursive, according to Eq. 2, all variables Vi can be written as a function of {gl(C)}L l=1 {θm(C)}n m=1 and {ϵm}n m=1. As a consequence, the probability distribution of V at each value of C is determined by the distribution of ϵ1, ..., ϵn, and the values of {gl(C)}L l=1 {θm(C)}n m=1. In other words, P(V|C) is determined by Qn i=1 P(ϵi) (for Causal Discovery from Heterogeneous/Nonstationary Data ϵ1, ..., ϵn are mutually independent), and {gl(C)}L l=1 {θm(C)}n m=1, where P( ) denotes the probability density or mass function. For any Vi, Vj, and Vij {Vk | k = i, k = j}, because P(Vi, Vj | Vij, C) is determined by P(V|C), it is also determined by Qn i=1 P(ϵi) and {gl(C)}L l=1 {θm(C)}n m=1. Since Qn i=1 P(ϵi) does not change with C, we have P(Vi, Vj | Vij {gl(C)}L l=1 {θm(C)}n m=1 {C}) =P(Vi, Vj | Vij {gl(C)}L l=1 {θm(C)}n m=1). (21) That is, C (Vi, Vj) | Vij {gl(C)}L l=1 {θm(C)}n m=1. (22) By the weak union property of conditional independence, it follows that C Vj | {Vi} Vij {gl(C)}L l=1 {θm(C)}n m=1. (23) We are now ready to prove the theorem. Let Vi, Vj be any two variables in V. First, suppose that Vi and Vj are not adjacent in G. Then they are not adjacent in Gaug, which recall is the graph that incorporates {gl(C)}L l=1 {θm(C)}n m=1. It follows that there is a set Vij {Vk | k = i, k = j} such that Vij {gl(C)}L l=1 {θm(C)}n m=1 d-separates Vi from Vj. Since the joint distribution over V {gl(C)}L l=1 {θm(C)}n m=1 is assumed to be Markov to Gaug, we have Vi Vj | Vij {gl(C)}L l=1 {θm(C)}n m=1. (24) Because all gl(c) and θm(C) are deterministic functions of C, we have P(Vi, Vj | Vij {C}) = P(Vi, Vj | Vij {gl(C)}L l=1 {θm(C)}n m=1 {C}). According to the properties of mutual information given in Madiman (2008) , Eqs. 24 and 22 imply Vi (C, Vj) | Vij {gl(C)}L l=1 {θm(C)}n m=1. By the weak union property of conditional independence, it follows that Vi Vj | Vij {gl(C)}L l=1 {θm(C)}n m=1 {C}. As all gl(C) and θm(C) are deterministic functions of C, it follows that Vi Vj | Vij {C}. In other words, Vi and Vj are conditionally independent given a subset of {Vk | k = i, k = j} {C}. Conversely, suppose Vi and Vj are conditionally independent given a subset S of {Vk | k = i, k = j} {C}. We show that Vi and Vj are not adjacent in G, or equivalently, that they are not adjacent in Gaug. There are two possible cases to consider: Suppose S does not contain C. Then since the joint distribution over V {gl(C)}L l=1 {θm(C)}n m=1 is assumed to be faithful to Gaug, Vi and Vj are not adjacent in Gaug, and hence not adjacent in G. Otherwise, S = Vij {C} for some Vij {Vk | k = i, k = j}. That is, Vi Vj | Vij {C}, or (25) P(Vi, Vj | Vij {C}) = P(Vi | Vij {C})P(Vj | Vij {C}). According to Eq. 21, and also noting that {gl(C)}L l=1 {θm(C)}n m=1 is a deterministic function of C, we have P(Vi, Vj | Vij {C}) = P(Vi, Vj | Vij {gl(C)}L l=1 {θm(C)}n m=1), (26) Huang, Zhang, Zhang, Ramsey, Sanchez-Romero, Glymour, and Sch olkopf which also implies P(Vi | Vij {C}) = P(Vi | Vij {gl(C)}L l=1 {θm(C)}n m=1), (27) P(Vj | Vij {C}) = P(Vj | Vij {gl(C)}L l=1 {θm(C)}n m=1). (28) Substituting Eqs. 26 - 28 into Eq. 25 gives P(Vi, Vj | Vij {gl(C)}L l=1 {θm(C)}n m=1) (29) =P(Vi | Vij {gl(C)}L l=1 {θm(C)}n m=1)P(Vj | Vij {gl(C)}L l=1 {θm(C)}n m=1). That is, Vi Vj | Vij {gl(C)}L l=1 {θm(C)}n m=1. Again, by the Faithfulness assumption on Gaug, this implies that Vi and Vj are not adjacent in Gaug and hence are not adjacent in G. Therefore, Vi are Vj are not adjacent in G if and only if they are conditionally independent given some subset of {Vk | k = i, k = j} {C}. Appendix B. Proof of Proposition 1 µY ,X|C=cn E(Y,X) P(Y ,X | C=cn)[φ(Y ) φ(X)] = C(Y,X),CC 1 C,Cφ(cn) =E(Y,X,C) P(Y ,X,C)[φ(Y ) φ(X) φ(C)]C 1 C,Cφ(cn) =E(X,C) P(X,C){EY P(Y |X,C)[φ(Y )] φ(X) φ(C)}C 1 C,Cφ(cn) =CY,(X,C)C 1 (X,C),(X,C) EX P(X)EC P(C){[φ(X) φ(C)] φ(X) φ(C)}C 1 C,Cφ(cn) (30) Furthermore, EX P(X)EC P(C){[φ(X) φ(C)] φ(X) φ(C)}C 1 C,Cφ(cn) =EX P(X)EC P(C){[φ(X) φ(C)] [φ(X) φ (C)C 1 C,Cφ(cn)]} =EX P(X)EC P(C){φ(X) [φ(C) (φ(X)φ (C)C 1 C,Cφ(cn)) ]} (31) =EX P(X)EC P(C){φ(X) [φ(C) φ (cn)C 1 C,Cφ(C)φ (X)]} =EX P(X)EC P(C){φ(X) [CC,CC 1 C,Cφ(cn)φ (X)]} (32) =EX P(X){φ(X) φ(cn) φ(X)} where (31) holds because tensor product is associative, and (32) holds because φ (cn)C 1 C,Cφ(C) is a scaler, implying φ (cn)C 1 C,Cφ(C) = [φ (cn)C 1 C,Cφ(C)] = φ (C)C 1 C,Cφ(cn). Causal Discovery from Heterogeneous/Nonstationary Data Therefore, equation (30) becomes µY ,X|C=cn = CY,(X,C)C 1 (X,C),(X,C)EX P(X){φ(X) φ(cn) φ(X)}. Let φ (X, C) := φ(X) φ(C), Φy := [φ(y1), ..., φ(y N)], Φx := [φ(x1), ..., φ(x N)], Φx,c := [φ (x1, c1), ..., φ (x N, cn)], Φx,cn := [φ (x1, cn), ..., φ (x N, cn)]. µY ,X|C=cn can be estimated as ˆ µY ,X|C=cn = 1 N ΦyΦ x,c( 1 N Φx,cΦ x,c + λI) 1( 1 N Φy(Kx Kc + λI) 1diag(kc,cn)KxΦ x. Appendix C. Proof of Theorem 2 Proof Let Vi and Vj be an unoriented pair of adjacent variables. If condition (1) is satisfied, then there are two possibilities: Vj is the collider, and there exists another variable Vk which influences Vj and does not have an edge with Vi, so that Vi Vj Vk. In such a case, Vi Vk|S, with S V and S Vj = . Vi is the collider, and there exists another variable Vk which influences Vi and does not have an edge with Vj, so that Vj Vi Vk. In such a case, Vj Vk|S, with S V and S Vi = . Thus, we can identify the direction between Vi and Vj depending on whether Vi Vk|S or Vj Vk |S . If condition (2) is satisfied, and there is only one change influences Vi or Vj. Then we have the following two cases: The change influences Vj, i.e., Vi Vj C: if Vi C|S, with S V and S Vj = , orient Vi Vj; if Vi C|S, with S V and Vj S, orient Vi Vj. The change influences Vi, i.e., Vj Vi C: if Vj C|S, with S V and S Vi = , orient Vj Vi; if Vj C|S, with S V and Vi S, orient Vj Vi. If condition (2) is satisfied, and there are independent changes to both Vi and Vj. Then we have the following two cases: if P(Vi) P(Vj|Vi, Z) and P(Vj) P(Vi|Vj, Z), where Z is the deconfounding set of (Vi, Vj), then orient Vi Vj; if P(Vj) P(Vi|Vj, Z) and P(Vi) P(Vj|Vi, Z), where Z is the deconfounding set of (Vi, Vj), then orient Vj Vi. If condition (3) is satisfied, then we have the following two cases: There is an edge incident to Vj but not to Vi, i.e., Vi Vj Vl: Huang, Zhang, Zhang, Ramsey, Sanchez-Romero, Glymour, and Sch olkopf if Vi Vl|S, with S V and S Vj = , orient Vi Vj; if Vi Vl|S, with S V and Vj S, orient Vi Vj. There is an edge incident to Vi but not to Vj, i.e., Vj Vi Vl: if Vj Vl|S, with S V and S Vi = , orient Vj Vi; if Vj Vl|S, with S V and Vi S, orient Vj Vi. Therefore, it is easy to see that the direction between Vi and Vj is identifiable, if at least one of the conditions is satisfied. R. P. Adams and D. J. Mac Kay. Bayesian online changepoint detection. ar Xiv preprint ar Xiv:0710.3742, 2007. D. M. Chickering. Optimal structure identification with greedy search. Journal of Machine Learning Research, 3:507 554, 2003. F. Eberhardt and R. Scheines. Interventions and causal inference. Philosophy of Science, 74 (5):981 995, 2007. P. Erd os and A. R enyi. On random graphs I. Publicationes Mathematicae, 6:290 297, 1959. A. E. Ghassami, N. Kiyavash, B. Huang, and K. Zhang. Multi-domain causal structure learning in linear systems. In Advances in Neural Information Processing Systems (Neur IPS), pages 6266 6276, 2018. M. Gong*, K. Zhang*, D. Tao, P. Geiger, and B. Sch olkopf. Discovering temporal causal relations from subsampled data. In International Conference on Machine Learning (ICML), pages 1898 1906, 2015. A. Gretton, K. Fukumizu, C. H. Teo, L. Song, B. Sch olkopf, and A. J. Smola. A kernel statistical test of independence. In Neural Information Processing Systems (NIPS), pages 585 592, 2008. M. Havlicek, K.J. Friston, J. Jan, M. Brazdil, and V.D. Calhoun. Dynamic modeling of neuronal responses in f MRI using cubature Kalman filtering. Neuroimage, 56:2109 2128, 2011. Y. He and Z. Geng. Causal network learning from multiple interventions of unknown manipulated targets. In ar Xiv preprint ar Xiv: 1610.08611, 2016. D. Heckerman, D. Geiger, and D. M. Chickering. Learning Bayesian networks: The combination of knowledge and statistical data. Machine Learning, 20:197 243, 1995. D. Heckerman, C. Meek, and G. Cooper. A Bayesian approach to causal discovery. Innovations in Machine Learning, 194:1 28, 2006. K. Hoover. The logic of causal inference. Economics and Philosophy, 6:207 234, 1990. Causal Discovery from Heterogeneous/Nonstationary Data P. O. Hoyer, D. Janzing, J. Mooji, J. Peters, and B. Sch olkopf. Nonlinear causal discovery with additive noise models. In Advances in Neural Information Processing Systems (NIPS), 2009. B. Huang, K. Zhang, and B. Sch olkopf. Identification of time-dependent causal model: A Gaussian process treatment. In International Joint Conference on Artificial Intelligence (IJCAI), pages 3561 3568, 2015. B. Huang, K. Zhang, J. Zhang, Sanchez-Romero, C. R., Glymour, and B. Sch olkopf. Behind distribution shift: Mining driving forces of changes and causal arrows. In IEEE International Conference on Data Mining (ICDM), pages 913 918, 2017. B. Huang, K. Zhang, Y. Lin, B. Sch olkopf, and C. Glymour. Generalized score functions for causal discovery. In ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD), pages 1551 1560, 2018. B. Huang, K. Zhang, M. Gong, and C. Glymour. Causal discovery and forecasting in nonstationary environments with state-space models. In International Conference on Machine Learning (ICML), 2019a. B. Huang, K. Zhang, P. Xie, M. Gong, E. Xing, and C. Glymour. Specific and shared causal relation modeling and mechanism-based clustering. In Advances in Neural Information Processing Systems (Neur IPS), 2019b. B. Huang, K. Zhang, M. Gong, and C. Glymour. Causal discovery from non-identical variable sets. In AAAI Conference on Artificial Intelligence (AAAI), 2020. A. Hyv arinen and P. Pajunen. Nonlinear independent component analysis: Existence and uniqueness results. Neural Networks, 12(3):429 439, 1999. A. Hyv arinen, K. Zhang, S. Shimizu, and P. Hoyer. Estimation of a structural vector autoregression model using non-Gaussianity. Journal of Machine Learning Research, pages 1709 1731, 2010. E. Kummerfeld and D. Danks. Tracking time-varying graphical structure. In Advances in Neural Information Processing Systems (NIPS), pages 1205 1213, 2013. M. Madiman. On the entropy of sums. In IEEE Information Theory Workshop (ITW), pages 303 307, 2008. C. Meek. Strong completeness and faithfulness in bayesian networks. In Uncertainty in Artificial Intelligence (UAI), pages 411 419, 1995. J. Pearl. Causality: Models, Reasoning, and Inference. Cambridge University Press, Cambridge, 2000. J. Peters, P. B uhlmann, and N. Meinshausen. Causal inference using invariant prediction: identification and confidence intervals. Journal of the Royal Statistical Society: Series B, 78(5):947 1012, 2016. Huang, Zhang, Zhang, Ramsey, Sanchez-Romero, Glymour, and Sch olkopf B. Sch olkopf, A. Smola, and K. Muller. Nonlinear component analysis as a kernel eigenvalue problem. Neural Computation, 10:1299 1319, 1998. B. Sch olkopf, D. Janzing, J. Peters, E. Sgouritsa, K. Zhang, and J. Mooij. On causal and anticausal learning. In Proc. 29th International Conference on Machine Learning (ICML 2012), Edinburgh, Scotland, 2012. S. Shimizu, P. O. Hoyer, A. Hyv arinen, and A. J. Kerminen. A linear non-Gaussian acyclic model for causal discovery. Journal of Machine Learning Research, 7:2003 2030, 2006. L. Song, M. Kolar, and E. Xing. Time-varying dynamic Bayesian networks. In Advances in Neural Information Processing systems (NIPS), 2009. P. Spirtes, C. Glymour, and R. Scheines. Causation, Prediction, and Search. Spring-Verlag Lectures in Statistics, 1993. P. Spirtes, C. Glymour, and R. Scheines. Causation, Prediction, and Search. MIT Press, Cambridge, MA, 2nd edition, 2001. M. Talih and N. Hengartner. Structural learning with time-varying components: Tracking the cross-section of financial time series. Journal of the Royal Statistical Society - Series B, 67(3):321 341, 2005. J. Tian and J. Pearl. Causal discovery from changes: a Bayesian approach. In Uncertainty in Artificial Intelligence (UAI), pages 512 521, 2001. X. Wang, R. Hutchinson, and T. M. Mitchell. Training f MRI classifiers to detect cognitive states across multiple human subjects. In Neural Information Processing Systems (NIPS), pages 709 716, 2004. E. P. Xing, W. Fu, and L. Song. A state-space mixed membership blockmodel for dynamic network tomography. Annals of Applied Statistics, 4(2):535 566, 2010. J. Zhang and P. Spirtes. The three faces of faithfulness. Synthese, 2016. K. Zhang and L. Chan. Extensions of ICA for causality discovery in the hong kong stock market. In International Conference on Neural Information Processing (ICONIP), 2006. K. Zhang and A. Hyv arinen. Acyclic causality discovery with additive noise: An informationtheoretical perspective. In European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML PKDD), 2009a. K. Zhang and A. Hyv arinen. On the identifiability of the post-nonlinear causal model. In Uncertainty in Artificial Intelligence, 2009b. K. Zhang, J. Peters, D. Janzing, and B. Sch olkopf. Kernel-based conditional independence test and application in causal discovery. In Uncertainty in Artificial Intelligence (UAI), 2011. K. Zhang, B. Sch olkopf, K. Muandet, and Z. Wang. Domain adaptation under target and conditional shift. In International Conference on Machine Learning, 2013. Causal Discovery from Heterogeneous/Nonstationary Data K. Zhang, B. Huang, J. Zhang, B. Sch olkopf, and C. Glymour. Discovery and visualization of nonstationary causal models. In ar Xiv preprint ar Xiv: 1509.08056, 2015a. K. Zhang, Z. Wang, J. Zhang, and B. Sch olkopf. On estimation of functional causal models: General results and application to post-nonlinear causal model. ACM Transactions on Intelligent Systems and Technologies, 7(2), 2015b. K. Zhang, B. Huang, J. Zhang, C. Glymour, and B. Sch olkopf. Causal discovery from nonstationary/heterogeneous data: Skeleton estimation and orientation determination. In International Joint Conference on Artificial Intelligence (IJCAI), 2017.