# multidomain_causal_structure_learning_in_linear_systems__641c74a3.pdf

Multi-domain Causal Structure Learning

in Linear Systems

Amir Emad Ghassami , Negar Kiyavash , Biwei Huang , Kun Zhang

Department of ECE, University of Illinois at Urbana-Champaign, Urbana, IL, USA.

School of ISy E and ECE, Georgia Institute of Technology, Atlanta, GA, USA.

Department of Philosophy, Carnegie Mellon University, Pittsburgh, PA, USA.

We study the problem of causal structure learning in linear systems from observational data given in multiple domains, across which the causal coefﬁcients and/or the distribution of the exogenous noises may vary. The main tool used in our approach is the principle that in a causally sufﬁcient system, the causal modules, as well as their included parameters, change independently across domains. We ﬁrst introduce our approach for ﬁnding causal direction in a system comprising two variables and propose efﬁcient methods for identifying causal direction. Then we generalize our methods to causal structure learning in networks of variables. Most of previous work in structure learning from multi-domain data assume that certain types of invariance are held in causal modules across domains. Our approach uniﬁes the idea in those works and generalizes to the case that there is no such invariance across the domains. Our proposed methods are generally capable of identifying causal direction from fewer than ten domains. When the invariance property holds, two domains are generally sufﬁcient.

1 Introduction

Consider a system comprised of two dependent random variables X and Y with no latent confounders. Assuming that the dependency is due to a unidirectional causation, how can we determine which variable is the cause and which one is the effect? The golden standard for answering this question is performing controlled experiments or interventions on the variables. Randomizing or intervening on the cause variable may change the effect variable, but not vice versa [Ebe07]. The issue with this approach is that in many cases performing experiments are expensive, unethical, impossible, or even undeﬁned. Therefore, there is a high interest in ﬁnding causal relations from purely observational data. Unfortunately, for a general system, using traditional conditional independence-based methods on purely observational data can identify a structure only up to its Markov equivalence [SGS00, Pea09]. For instance, for the aforementioned system of only two variables, the structures X ! Y and Y ! X (arrow indicates the causal direction), are Markov equivalent. Therefore, extra assumptions are required to make the problem well-deﬁned.

One successful approach in the literature is to restrict the data generating model. For the case that data is generated from a structural equation model with additive noise [Bol89], if the noise is non-Gaussian [SHHK06], or if the equations are non-linear with some mild conditions [ZC06, HJM+09, ZH09], the structure may be identiﬁable. Note that for additive noise models, in the basic case of a linear system with Gaussian noise, the causal direction is non-identiﬁable [SHHK06].

Another recently developing approach is to assume that the data is non-homogenous and is gathered from several different distributions [Hoo90, TP01, PBM16, ZHZ+17, GSKZ17, HZZ+17]. In many real-life settings, the data generating distribution may vary over time, or the dataset may be gathered from different domains and hence, not follow a single distribution. While such data is usually problematic in statistical analysis and causes restrictions on the learning power, this property can be

Correspondence to: Amir Emad Ghassami, email: ghassam2@illinois.edu.

32nd Conference on Neural Information Processing Systems (Neur IPS 2018), Montréal, Canada.

leveraged for the purpose of causal discovery, which is our focus in the paper. Moreover, it has been shown that causal modeling can be exploited to facilitate transfer learning [SJP+12, ZSMW13]. This is because of the coupling relationship between causal modeling and distribution change: Causal models constrain how the data distribution may change, while distribution change exhibits such changes.

In this paper, we focus on linear systems, with observational data for the variables in the system given in multiple domains, across which the causal coefﬁcients and/or the distribution of the exogenous noises may vary. We will present efﬁcient approaches to exploiting such changes for causal structure learning. The proposed methods are based on the principle that although the cause and the effect variables are dependent, the mechanism that generates the cause variable changes independently from the mechanism that generates the effect variable across domains. The same principle was utilized in [ZHZ+17] for exploiting non-stationary or heterogeneous data for causal structure learning. However, since that work considers a non-parametric approach, it is restricted to general independence tests among distributions, which may not have high efﬁciency. The remaining previous works on causal discovery from multiple domains usually assume that a certain type of invariance holds across domains [PBM16, GSKZ17]. Our approach uniﬁes the idea in those works, and generalizes to the case that there is no such invariance across the domains. We present structure learning methods, which are generally capable of identifying causal direction from fewer than ten domains (as a special case, when the invariance property holds, two domains generally sufﬁces).

The rest of the paper is organized as follows. We ﬁrst introduce our approach for ﬁnding causal relations from multi-domain observational data in a system comprising two variables in Section 2. We propose our methods for identifying causal direction in Subsections 2.1 and 2.2. These methods are evaluated in Subsection 2.3. We generalize our method to causal structure learning in networks of variables in Section 3, where the generalized version of our methods are presented in Subsections 3.1 and 3.2 and evaluated in Subsection 3.4. We also present an approach for combining our proposed methods with standard conditional independence-based structure learning algorithms in Subsection 3.3. All the proofs are provided in the supplementary materials.

2 Identifying Causal Relation between Two Variables

In a general system, each variable is generated by a causal module, with conditional distribution P(effect | causes), which takes the direct causes of the variable as input. We assume that the system is causally sufﬁcient, that is, the variables do not have latent confounders. Given a causal directed acyclic graph (DAG), the causal Markov condition [SGS00, Pea09], as a more formal version of Reichenbach s common cause principle [Rei91], says that any variable is independent from its nondescendants given its direct causes. The contrapositive implies that if two quantities are dependent, there must exist a causal path between them. As a consequence, when the joint distribution of a causally sufﬁcient system changes, for any effect variable, the distribution of its direct causes, P(causes), must change independently from P(effect | causes); otherwise, the parameter sets (or variables, from a Bayesian perspective) involved in the two modules change dependently, and there must exist a causal path connecting the parameter sets, which implies the existence of confounders and contradicts the causal sufﬁciency assumption. This can be viewed as a realization of the modularity property of causal systems [Pea09], and as the dynamic counterpart of the principle of independent mechanisms, which states that causal modules are algorithmically independent [DJM+10], or the exogeneity property of the causal system [ZZS15]. We formalize this characteristic as follows. Deﬁnition 1 (Principle of Independent Changes (PIC)). In a causally sufﬁcient system, the causal modules, as well as their included parameters, change independently across domains.

As mentioned in Section 1, this principle was used for causal discovery in nonparametric settings in [ZHZ+17, HZZ+17]. As will be explained in Subsection 2.2, the case of having an invariant causal mechanism is a special case, because a constant is independent from any other variable.

To introduce our methodology, we consider a system comprised of two dependent variables X and Y . Observational data for variables X and Y , or in the asymptotic case, the joint distributions of X and Y , in d domains D = {D(1), , D(d)} is given. The goal is to discover the causal direction between X and Y . We denote the ground truth cause variable and effect variable by C and E, respectively. The relation between C and E is assumed to be linear. Hence, the model in domain D(i) 2 D is denoted as follows:

domain D(i): C = N (i)

C , E = a(i)C + N (i)

where, N (i)

C and N (i)

E are independent exogenous noises with variances (σ2

C)(i) and (σ2

E)(i), respectively. Without loss of generality, we assume that the exogenous noises are zero-mean. We refer to variances of the exogenous noises and the causal coefﬁcients as the parameters of the system. In general, all these three parameters can vary across the domains. For our parametric model of interest, σ2

C corresponds to the cause module, while a and σ2

E correspond to the effect generation module. Therefore, PIC implies that σ2

C changes independently from the pair (a, σ2

E). Note that in general, σ2

E need not to be independent from a, as they both correspond to the mechanism generating the effect.

2.1 Proposed Approach Let βY |X be the linear regression coefﬁcient obtained from regressing Y on X, and let σ2

Y |X = Var(Y βY |X X), i.e., the variance of the residual of regressing Y on X. For the causal direction, we have

C, βE|C = aσ2

For the reverse direction, we have

E, βC|E = E[CE]

E[E2] = aσ2

C|E = Var(NC aσ2

(a NC + NE)) = σ2

We will utilize the change information across domains to ﬁnd the causal direction as follows. For any parameter γ 2 {σ2

C|;, βE|C, σ2

E|;, βC|E, σ2

C|E}, let γ(i) denote the value of this parameter in domain D(i), 1 i d. Consider {γ(1), , γ(d)} as samples from random variable γ. As stated earlier, according to PIC, σ2

C is independent form (βE|C, σ2

E|C) = (a, σ2

E), while as we can see from the expressions in (2), such independence does not hold in general in the reverse direction. For instance, if a and σ2

E are both ﬁxed, an increase in σ2

E|; always leads to an increase in βC|E and σ2

C|E. Therefore, we propose our causal discovery method as follows. To test whether X is the cause of Y , we test the independence between σ2

X|; and (βY |X, σ2

Y |X). If σ2

X|; and (βY |X, σ2

Y |X) are independent but the counterpart in the reverse direction is not, X is considered as the cause variable and Y the effect variable. More speciﬁcally, for testing if X is the cause of Y , let ΓX!Y = {|βY |X|, σ2

Y |X}, and deﬁne the dependence measure

TX!Y (D) :=

where any standard non-parametric measure of dependence I( , ), such as mutual information, can be used. Therefore, for inferring the causal relation between X and Y , we calculate TX!Y (D) and TY !X(D) and pick the direction which has the smaller value, i.e., arg min 2{X!Y,Y !X} T (D), as the correct causal direction. Alternatively, one can use test of statistical independence, such as the kernel-based one [GFT+08] to infer the direction. 2.1.1 Identical Boundaries Method Although checking for independence is sufﬁcient for discovering causal relation, in general performing a non-parametric independence test may not be efﬁcient. This may be specially problematic as in many applications the number of domains is small. In this subsection, we show that the parametric structure of our model can be exploited to devise an efﬁcient independence test.

The main idea is as follows. Consider two bounded random variables γ and γ. We refer to the minimum and maximum of γ as the boundaries of this variable. If γ and γ are independent, then γ will have the same distribution conditioned on γ = γmin and γ = γmax, i.e., γ will have identical distributions on the boundaries of γ. Speciﬁcally, it will have same expected values on the boundaries of γ. On the other hand, if γ and γ are dependent, the expected value of γ on the boundaries of γ are not necessarily different. However, we will show that if X ! Y is not the causal direction and βY |X (and σ2

Y |X) is dependent on σ2

X|;, due to the speciﬁc parametric structure of our model, the expected value of βY |X (and σ2

Y |X) on the boundaries of σ2

X|; will not be identical.

For any parameter of interest, we denote its minimum and its maximum value in the dataset with subscripts ˆm and ˆ M, respectively. For inferring if X is the cause of Y , let

"""E[log γ |log(σ2

X|;)=(log(σ2

X|;)) ˆ M ] E[log γ |log(σ2

X|;)=(log(σ2

and according to (3), we deﬁne the causal order indicator T IB

X!Y (D) := P

γ2ΓX!Y IIB(γ, σ2

X|;), where IB stands for identical boundaries. We have the following result for identiﬁability.

Theorem 1. For dataset D = {D(1), , D(d)}, as d ! 1, T IB

C!E(D) ! 0, and T IB

E!C(D) ! c, for some positive value c which is bounded away from 0.

Using Theorem 1, to see the causal relation between X and Y , we calculate T IB

X!Y (D) and T IB

Y !X(D) and pick the direction which has the smaller value, i.e., arg min 2{X!Y,Y !X} T IB

(D), as the true causal direction. We term this approach the Identical Boundaries (IB) method.

Note that in the IB method we only perform ﬁrst-order statistical test (i.e., regarding the mean) on the boundaries. Clearly, the idea can be extended to performing higher-order tests as well. We have provided the required formulation for the extension to second-order test in the supplementary materials.

2.2 Minimal Changes Method One may perform causal discovery from multiple domains by assuming certain invariances in the parameters across domains. More speciﬁcally, consider a pair of domains {D(i), D(j)}. In order to ﬁnd the causal direction, the authors of [GSKZ17] consider the particular case that a(i) = a(j)

and either (σ2

C)(i) 6= (σ2

C)(j) or (σ2

E)(i) 6= (σ2

E)(j). In this case, β(i)

E|C, and we have

C|E if and only if σ(i)

E . Therefore, comparing the coefﬁcients in both

directions, the causal direction is identiﬁable if σ(i)

E . Note that if either of the variances does not change, this condition always holds. In [PBM16], the authors assume that (σ2

E)(i) = (σ2

E)(j). Therefore, for variable E and any set S, if (σ2

E|S)(i) 6= (σ2

E|S)(j), then S is not the parent set of E. Similarly, the authors of [WSBU18] consider the case that a(i) 6= a(j) and either σ(i)

C , or σ(i)

E . Therefore, under this assumption, in this case for variable X 2 {C, E} and set S {C, E} \ X, if (σ2

X|S)(i) = (σ2

X|S)(j), then S is the parent set of X.

We note that invariance is a special case of the condition of independent changes, as a constant is independent from any variable. Therefore, our idea introduced in Subsection 2.1 can be applied to the case of existence of invariant parameter across domains. In this case, two domains are generally sufﬁcient to identify the causal direction. Therefore, we give a uniﬁcation and generalization of the perspectives of the aforementioned previous works. The assumptions in these works can be seen as particular realizations of the faithfulness assumption [SGS00], which is required for our proposed approach as well:

Assumption 1. For any pair of domains D(i) and D(j), if (σ2

E|;)(i) = (σ2

E|;)(j), or (βC|E)(i) = (βC|E)(j), or (σ2

C|E)(i) = (σ2

C|E)(j), then all parameters of the system are unchanged across D(i)

and D(j), i.e., (σ2

C)(i) = (σ2

C)(j), and a(i) = a(j), and (σ2

E)(i) = (σ2

Assumption 1 is mild in the sense that it only rules out a 2-dimensional subspace of a 3-dimensional space. Therefore, considering Lebesgue measure on the 3-dimensional space, we are only ruling out a measure-zero subset.

Since invariance is a special case of independent changes, based on PIC, change in one causal module does not force any changes in another causal module, i.e., a change in, say, σ2

C|;, will not enforce any changes on βE|C or σ2

E|C. However, in the reverse direction, as it can be seen from equations in (1) and (2), if any of the variables a, σ2

E varies across two domains, by Assumption 1, all three variables σ2

E|;, βC|E, and σ2

C|E will change. Based on this observation, we propose the following principle for causal discovery, which is the counterpart of PIC for the case of invariance. Deﬁnition 2 (Principle of Minimal Changes (PMC)). Suppose Assumption 1 holds. Compared to the direction from effect to cause, fewer or an equal number of changes are required in the causal direction to explain the variation in the joint distribution. Therefore, for testing whether variable X is the cause of variable Y , we propose the following method. Let Γ0

X|;, |βY |X|, σ2

Y |X}. We denote a member of Γ0

X!Y by γ, and

its realization in domain D(i) by γ(i). For any pair of domains {D(i), D(j)}, let Q(i,j)

X!Y 1[log γ(i) 6= log γ(j)]. This quantity counts the number of members of Γ0

0 10 20 30 40 50 Number of domains

0 10 20 30 40 50 Number of domains

IB MC HSIC IB Confounder MC Confounder HSIC with Confounder

Figure 1: F1 score versus number of domains for model 1 on the left and model 2 on the right.

vary across domains {D(i), D(j)}. We deﬁne the causal direction indicator

1[X ! Y 2 arg min 2{X!Y,Y !X} Q(i,j)

where MC stands for minimal changes. Under Assumption 1, we have the following result.

Theorem 2. For a given dataset D, we have T MC

C!E(D) T MC

E!C(D). The inequality is strict if there exists a pair of domains across which at least one and at most two of the parameters σ2

C, a, and σ2

Using Theorem 2, for testing the causal relation between X and Y , we calculate T MC

X!Y (D) and T MC

Y !X(D) and pick the direction that has the larger value, i.e., arg max 2{X!Y,Y !X} T MC

(D), as the true causal direction. We term this approach the Minimal Changes (MC) method.

2.3 Evaluation We consider two models for generating the parameters of the system. In the ﬁrst model, the variances of the noises and the causal coefﬁcients follow the distributions Unif([1,3]) and Unif([- 3,-0.5] [ [0.5,3]), respectively. In the second model, with equal probability, they either follow the aforementioned distributions, or are equal to a ﬁxed value. The number of samples in each domain is 103. We have used the state-of-the-art HISC test [GFT+08] as our non-parametric independence test. The performance of the proposed methods are depicted in Figure 1. We have depicted the F1 score for each case. As seen from the F1 score, the IB method performs better than the MC method in the ﬁrst model. However, if in an application we know that the parameters are not likely to change much (as in model 2), the MC method also has high performance. Moreover, for the case of limited number of domains, the IB method signiﬁcantly outperforms using HSIC test. We also tested our proposed methods when a latent confounder was present in the system. As could be seen in Figure 1, IB is generally more robust to the presence of latent confounder.

3 Causal Discovery with More than Two Variables

In this section, we extend the proposed methods in Section 2 to the case with more than two variables. We consider a causal DAG G = (V, A) over p variables of a causally sufﬁcient system, where V = {X1, , Xp} is the set of variables of the system, and A denotes the set of directed edges, i.e., each element of A is an ordered pair of elements of V . Observational data for variables in d domains D = {D(1), , D(d)} is given. Deﬁne the random vector X := (X1, , Xp)>. We consider the following linear structural equation model (SEM):

X = B>X + N, (4)

where B is the matrix of coefﬁcients of edges, with Bij denoting the coefﬁcient from variable Xi to Xj. If Bij 6= 0, Xi is a parent of Xj and Xj is a child of Xi. We denote the parent set and children set of X by Pa(X) and Ch(X), respectively. A variable is called source variable if it does not have any parents, and sink variable if it does not have any children. Without loss of generality, we assume that B is a strictly upper triangular matrix. Since the underlying structure is DAG, rows and columns of B can be permuted for this condition to be satisﬁed. N = (N1, , Np)> is the vector of exogenous noises. We denote the variance of Ni by σ2

i , and assume that elements of N are independent and without loss of generality, we assume that they are zero-mean variables. Also, we deﬁne as the p p diagonal matrix with values σ2

p on the diagonal.

Algorithm 1 Multi-domain Causal Structure Learning.

Input: Data from variables V in domains D, initial order t on V . Initiation: ˆ = ;. while | t| 6= 0 do

for X 2 t do

Γ X, 1(n) = {|( ˆB X, 1)m,n|, (ˆ X, 1)n,n; 1 m n}, 1 n | t|. TX, 1(D) = P

γ2Γ X, 1(n)

γ2Γ X, 1(k) I(γ, γ).

For the IB method: T IB

X, 1(D) = P

γ2Γ X, 1(n)

γ2Γ X, 1(k) |E[log γ |log γ=(log γ) ˆ M ]

E[log γ |log γ=(log γ) ˆ m]|.

end for Xlast = arg min X2 t T IB

X, 1(D). ˆ = concatenate(Xlast, ˆ ), Remove Xlast from t. end while Output: ˆ .

We deﬁne an ordering of the variables as a permutation of the variables of the system. An ordering on a set of variables and a DAG on those variables are consistent if in the ordering, every variable comes after its parents and before its children. Note that given an undirected graph, an ordering determines the direction of all the edges of the graph uniquely; however, there may be more than one ordering consistent with a given DAG. For example, for the DAG W ! X ! Y Z, orderings (W, Z, X, Y ), (W, X, Z, Y ), and (Z, W, X, Y ) are consistent, but the ordering (W, X, Y, Z) is not

consistent with the DAG, as it contradicts with the direction of the edge between Y and Z.

Deﬁnition 3 (Causal Order). An ordering on the variables is called causal if it is consistent with the ground truth causal DAG.

Since the skeleton of the causal DAG can be identiﬁed from basic conditional independence tests, the main challenge in causal structure learning is to ﬁnd a causal order. In the following we present our approaches to estimate a causal order on the variables of the system.

3.1 Proposed Approach Suppose a candidate for the causal order on the variables, denoted by , is given. In order to generalize our methodology to the network case, for each domain, we need to ﬁrst estimate the regression coefﬁcients and exogenous noise variances of each variable Xi on all the variables Xj with 1(Xj) < 1(Xi), i.e., all the variables, which proceed Xi in the given order. For the given order on variables, for 1 m p, let Sm := ( (1), , (m 1))>. We denote the estimated regression coefﬁcients and exogenous noise variances for the given ordering by ˆB and

ˆ , respectively, where ( ˆB )(1:m 1),m = β (m)|Sm, and (ˆ )m,m = σ2

(m)|Sm, and zero elsewhere.

Note that if is a causal order, then ˆB = B and ˆ = up to permutation. Standard regression calculation can be used for obtaining ˆB and ˆ . We additionally propose an efﬁcient method for estimating the regression coefﬁcients and noise variances, provided in the Supplementary Materials. Our proposed method also makes a connection between precision matrix and adjacency matrix of a Bayesian network.

We will utilize the change information across domains to ﬁnd a causal order as follows. Consider matrix M = B + . According to PIC, elements in each column of M should be jointly independent from elements in any other column, as they correspond to distinct causal modules. Therefore, we can set a metric for measuring dependencies, and orders that obtain the minimum value are causal orders. More speciﬁcally, for a given order on variables, let ˆB and ˆ be the outputs of regression. A naive way of applying the idea in Section 2 to networks is as follows: For 1 n p, let Γ (n) := {|( ˆB )m,n|, (ˆ )n,n; 1 m n}, and deﬁne Q (n) := P

γ2Γ (k) I(γ, γ), where I( , ) is any standard measure for dependence. We deﬁne the causal order indicator for exhaustive search T e

n=2 Q (n). Hence, one can estimate the causal order as ˆ = arg min T e

(D). Therefore, in low dimensions, the causal order can be found by exhaustive search over all orders. However, for large dimensions, this is infeasible, as the number of orders increases super-exponentially

Algorithm 2 MC Causal Structure Learning.

Input: Data from variables V in domains D, initial order t on V . Initiation: ˆ = ;. while | t| 6= 0 do

for X 2 t do

Deﬁne X = { t, X, 1, X, 2}. Γ0

= {|( ˆB )m,n|, (ˆ )m,m; 1 m n | t|}, 2 X. Q(i,j)

γ2Γ0 1[log γ(i) 6= log γ(j)], 2 X, 1 i < j d.

1 i<j d 1[ 2 arg min 02 X Q(i,j)

0 ], 2 X. Update t = arg max 2 X T MC

(D). end for

ˆ = concatenate( t( 1), ˆ ), Remove t( 1) from t. end while Output: ˆ .

with the number of variables. Therefore, we propose an alternative method for learning the causal structure.

The pseudo-code of the proposed approach is presented in Algorithm 1. The main idea is that in each round, we ﬁnd one variable which is the last in the causal order and remove it from the list, until all the variables are ordered. The algorithm starts with a random initial order t on all variables. In each round, for each variable X 2 t, quantity TX, 1(D), given in line 5 of the algorithm, is calculated, which shows the amount of dependency between parameters of the causal module of X and all the other estimated parameters when X is moved to the last position in t. After calculating the quantity TX, 1(D) for all variables in t, the variable Xlast that has the lowest value for this quantity is concatenated to the left side of our estimated order ˆ , and is removed from t. This procedure is continued until all the variables are moved to ˆ . 3.1.1 IB Causal Structure Learning In this subsection, we present an efﬁcient generalization of the IB methods. For the exhaustive version of IB, we deﬁne QIB

""E[log γ |log γ=(log γ) ˆ M ]

E[log γ |log γ=(log γ) ˆ m]

"". We deﬁne the IB causal order indicator for exhaustive search T e IB

(n). This gives us the following result, which is similar to Theorem 1.

Theorem 3. Let be a causal and 0 be a non-causal order. For a given dataset D, as d ! 1, T e IB

(D) ! 0, and T e IB

0 (D) ! c, for some value c, which is bounded away from 0.

Using Theorem 3, one can estimate the causal order as ˆ = arg min T e IB

(D). Therefore, in low dimensions, the causal order can be found by exhaustive search over all orders. However, as mentioned earlier for large dimensions, this is infeasible. Therefore, we need to have an alternative search approach similar to the version for non-parametric version, except that in each round, for each variable X 2 t, quantity T IB

X, 1(D), given in line 6 of Algorithm 1, is calculated.

Theorem 4. In any round of Algorithm 1, let Xs be a sink variable, and Xv be a non-sink variable in the sub-graph induced on t. Then as d ! 1, T IB

Xs, 1(D) ! 0 and T IB

Xv, 1(D) ! c, for some value c, which is bounded away from 0.

3.2 MC Causal Structure Learning In order to generalize the MC method to the network case, we need the following assumption, which is a counterpart of Assumption 1 in the multiple variable case.

Assumption 2. For any pair of domains D(i), D(j), for any variable X and set S, if (σ2

X|S)(i) = (σ2

X|S)(j) (or [(βX|S)(i)]k = [(βX|S)(j)]k), then the value of all the parameters of the system involved in the expression of σ2

X|S (or [(βX|S)]k) are equal in the two domains.

Roughly speaking, Assumption 2 for the linear SEM states that the parameters of the model should not have been designed in a way that they cancel each other out on correlations.

For a given order , let ˆB and ˆ be the output of regressions. Similar to Subsection 3.1, a naive way to apply the MC method to a network is as follows: Let Γ0

:= {|( ˆB )m,n|, (ˆ )m,m; 1

m n p}, and for any pair of domains {D(i), D(j)}, let Q(i,j)

1[log γ(i) 6= log γ(j)]. We deﬁne the MC causal order indicator for exhaustive search T e MC

1 i<j d 1[ 2

arg min 0 Q(i,j)

0 ]. This gives us the following result similar to Theorem 2. Theorem 5. Let be a causal and 0 be a non-causal order. For a given dataset D, we have T e MC

0 (D). Also, there exists two parameters of the system γ1 and γ2 such that if there exists two domains D(i), D(j) with γ(i)

2 , then T e MC

(D) > T e MC

Using Theorem 5, one can estimate the causal order as ˆ = arg max T e MC

(D). Therefore, in low dimensions, the causal order can be found by exhaustive search over all orders. However, for large dimensions, this is infeasible. Therefore, we propose the following alternative method for applying the MC method to networks.

The pseudo-code of the proposed approach is presented in Algorithm 2. The main idea is that in each round, we ﬁnd one variable with lowest causal order and remove it from the list, until all the variables are ordered. The algorithm starts with a random initial order t on all variables. In each round, for each variable X it forms 3 orders in set X. t is the initial order, X, 1 is the same as t, but X is moved to the last position, and X, 2 is the same as t, but X is moved to one before the last position in the order.2 The algorithm then calculates the quantity T MC

(D), given in line 7 of the pseudo-code, for each of the three orders in X, and updates t to the element of X that has the maximum value for this quantity. In the case of tie, we prioritize the orders as follows: t > X, 2 > X, 1. This prioritization guarantees that after performing the aforementioned update of t for all variables, the last variable in t, i.e., t( 1), will be a sink variable, in the subgraph induced on variables in t. We concatenate t( 1) to the left side of our estimated order ˆ , remove it from t, and continue to the next round until all the variables are moved to ˆ . Theorem 6. In each round of Algorithm 2, if Xs is a sink variable, then for all 2 Xs, T MC

Xs, 1(D) T MC

(D). Also, for any of Xs s parents, Xv, if there exists a pair of domains across which at least one and at most two of variables Var(Xv), Bv,s, σ2

s varies, then at the end of round, t( 1) will be a sink variable. Remark 1. Finding independence in Algorithm 1 and invariance in Algorithm 2 can also be done from top to bottom, similar to the approach used in [SIS+11]. That is, in each round we can also ﬁnd a variable with highest causal order as well.

3.3 Combining Proposed Methods with Conditional Independence-based Algorithms One can also run our proposed methods after applying any standard conditional independence-based algorithms, such as PC [SGS00] or GES [Chi02], to the data from some or all domains. Therefore, our proposed approach learns beyond the Markov equivalence class. The approach for combining our methods with standard conditional independence-based algorithms would be as follows.

1. We run the observational algorithm on the data from all domains (or its subset, if the sample

size is too big) to learn the essential graph (also known as CPDAG) of the underlying DAG. 2. We note that an essential graph is a chain graph. For each chain component C, we denote

its vertices by V (C) and denote the set of parents of V (C) by Pa(V (C)), and deﬁne VC = V (C) [ Pa(V (C)). Note that a variable cannot have both ingoing and outgoing edges from variables in V (C), otherwise we will have a partially directed cycle, which is not allowed in a chain graph [AMP+97]. 3. We apply our methods on each chain component separately and use the data corresponding

to VC as the input. We need to include Pa(V (C)) in VC to ensure that the variables under consideration do not have any latent confounders. 4. In VC we locate variables in Pa(V (C)) at the beginning of the order and use our methods to

ﬁnd the order on the remaining variables of VC, i.e., on V (C). 5. We combine the orders obtained from chain components. Note that with sufﬁciently many changing domains, our proposed methods can learn the whole causal structure regardless of utilizing any conditional independence-based algorithm.

3.4 Evaluation We considered model 1 described in Subsection 2.3 for generating the parameters of the system, with the number of generated samples in each domain equal to 103. After identifying the causal ordering,

2 We have provided an example in the Supplementary Materials to demonstrate why it is required to consider both orders X, 1 and X, 2.

2 10 20 30 40 50 Number of domains

p = 4 (complete) p = 8 (complete) p = 12 (complete) p = 4 (sparse) p = 8 (sparse) p = 12 (sparse)

2 4 6 8 10 12 14 16 Number of variables

d = 10 (complete) d = 20 (complete) d = 50 (complete) d = 10 (sparse) d = 20 (sparse) d = 50 (sparse)

2 10 20 30 40 50 Number of domains

p = 4 (complete) p = 8 (complete) p = 12 (complete) p = 4 (sparse) p = 8 (sparse) p = 12 (sparse)

2 4 6 8 10 12 14 16 Number of variables

d = 10 (complete) d = 20 (complete) d = 50 (complete) d = 10 (sparse) d = 20 (sparse) d = 50 (sparse)

2 10 20 30 40 50 Number of domains

p = 4 (complete) p = 8 (complete) p = 12 (complete) p = 4 (sparse) p = 8 (sparse) p = 12 (sparse)

2 4 6 8 10 12 14 16 Number of variables

d = 10 (complete) d = 20 (complete) d = 50 (complete) d = 10 (sparse) d = 20 (sparse) d = 50 (sparse)

2 10 20 30 40 50 Number of domains

p = 4 (complete) p = 8 (complete) p = 12 (complete) p = 4 (sparse) p = 8 (sparse) p = 12 (sparse)

2 4 6 8 10 12 14 16 Number of variables

d = 2 (complete) d = 10 (complete) d = 20 (complete) d = 50 (complete) d = 2 (sparse) d = 10 (sparse) d = 20 (sparse) d = 50 (sparse)

Figure 2: F1 score versus number of domains and number of variables.

we then estimate the causal coefﬁcients B on each domain separately. We set a threshold = 0.1 on B from each domain; if |Bij| is larger than , then there is an edge from Xi to Xj. Then if an edge appears in more than 80% of all domains, we take this edge in the ﬁnal graph. The results are show in Figure 2. All experiments are performed either on complete graphs or on sparse graph generated from Erdos-Renyi model with parameter 0.3. In general, we observed better performance on denser graphs. This is expected as having more parameters helps us in predicting the order. The IB and MC methods both showed high performance in our simulations, while the performance of non-parametric HSIC test was in general not acceptable. We also compared the performance with Li NGAM Algorithm [SHHK06]. To do so, we applied Li NGAM algorithm to the pooled data of all domains. As explained in [ZHZ+17], Li NAGM failed to perform well on our multi-domain data.

f MRI hippocampus data: We applied our methods to f MRI hippocampus dataset [PL], which contains signals from six separate brain regions: perirhinal cortex (PRC), parahippocampal cortex (PHC), entorhinal cortex (ERC), subiculum (Sub), CA1, and CA3/Dentate Gyrus (CA3) in the resting states on the same person in 84 successive days. We used the anatomical connections [BB08, ZHZ+17] as a reference. We applied both MC and IB on this dataset. We investigated all possible causal orders and found the one that minimizes the causal order indicator for MC and IB. After identifying the causal ordering, we estimated the causal coefﬁcients B on each domain separately with threshold = 0.1, and if an edge appears in more than 60% of all domains, we took this edge in the ﬁnal graph. The recovered causal graph between the six regions is shown in the ﬁgure on the right. The black edges indicate edges, which are identiﬁed by both MC and IB methods. The blue edges are only identiﬁed by the MC method, and the orange edges are only identiﬁed by the IB method. The edges in the anatomical ground truth are as follows: PHC ! ERC, PRC ! ERC, ERC ! CA3/DG, CA3/DG ! CA1, CA1 ! Sub, Sub ! ERC, and ERC ! CA1. 4 Conclusion We studied causal structure learning from multi-domain observational data. We proposed methods based on the principle that in a causally sufﬁcient system, the causal modules, as well as their included parameters, change independently across domains. The main idea in our approach does not require any type of invariance across the domains. We ﬁrst introduced our methods on a linear system comprised of two variables, and then proposed efﬁcient algorithms to generalize the idea to the case of multiple variables. We evaluated our method on both synthetic and real data. Our proposed methods are generally capable of identifying causal direction from fewer than ten domains, and when the invariance property holds, two domains are generally sufﬁcient.

Acknowledgments

This work was supported in part by Army grant W911NF-15-1-0281 and NSF grant NSF CCF 1065022. This material is partially based upon work supported by United States Air Force under Con-

tract No. FA8650-17-C-7715, by National Science Foundation under EAGER Grant No. IIS-1829681, and National Institutes of Health under Contract No. NIH-1R01EB022858-01, FAINR01EB022858, NIH-1R01LM012087, NIH-5U54HG008540-02, and FAIN-U54HG008540, and work funded and supported by the Department of Defense under Contract No. FA8702-15-D-0002 with Carnegie Mellon University for the operation of the Software Engineering Institute, a federally funded research and development center. Any opinions, ﬁndings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reﬂect the views of the United States Air Force or the National Institutes of Health or the National Science Foundation. We thank Clark Glymour and Malcolm Forster for helpful discussions, and appreciate the comments from anonymous reviewers, which greatly helped to improve the paper.

[AMP+97] Steen A Andersson, David Madigan, Michael D Perlman, et al. A characterization of

markov equivalence classes for acyclic digraphs. The Annals of Statistics, 25(2):505 541, 1997.

[BB08] Chris M Bird and Neil Burgess. The hippocampus and memory: insights from spatial

processing. Nature Reviews Neuroscience, 9(3):nrn2335, 2008.

[Bol89] Kenneth A. Bollen. Structural equations with latent variables. Wiley series in probability

and mathematical statistics. Applied probability and statistics section. Wiley, 1989.

[Chi02] David Maxwell Chickering. Optimal structure identiﬁcation with greedy search. Journal

of machine learning research, 3(Nov):507 554, 2002.

[DJM+10] Povilas Daniusis, Dominik Janzing, Joris Mooij, Jakob Zscheischler, Bastian Steudel,

Kun Zhang, and Bernhard Schölkopf. Distinguishing causes from effects using nonlinear acyclic causal models. In Proc. 26th Conference on Uncertainty in Artiﬁcial Intelligence (UAI2010), 2010.

[Ebe07] Frederick Eberhardt. Causation and intervention. Unpublished doctoral dissertation,

Carnegie Mellon University, 2007.

[GFT+08] A. Gretton, K. Fukumizu, C. H. Teo, L. Song, B. Schölkopf, and A. J. Smola. A kernel

statistical test of independence. In NIPS 20, pages 585 592, Cambridge, MA, 2008. MIT Press.

[GSKZ17] Amir Emad Ghassami, Saber Salehkaleybar, Negar Kiyavash, and Kun Zhang. Learn-

ing causal structures using regression invariance. In Advances in Neural Information Processing Systems, pages 3015 3025, 2017.

[HJM+09] Patrik O Hoyer, Dominik Janzing, Joris M Mooij, Jonas Peters, and Bernhard Schölkopf.

Nonlinear causal discovery with additive noise models. In Advances in neural information processing systems, pages 689 696, 2009.

[Hoo90] K. Hoover. The logic of causal inference. Economics and Philosophy, 6:207 234, 1990.

[HZZ+17] Biwei Huang, Kun Zhang, Jiji Zhang, Ruben Sanchez Romero, Clark Glymour, and

Bernhard Schölkopf. Behind distribution shift: Mining driving forces of changes and causal arrows. In Proceedings of IEEE 17th International Conference on Data Mining (ICDM 2017), 2017.

[PBM16] Jonas Peters, Peter Bühlmann, and Nicolai Meinshausen. Causal inference by using

invariant prediction: identiﬁcation and conﬁdence intervals. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 78(5):947 1012, 2016.

[Pea09] Judea Pearl. Causality. Cambridge university press, 2009.

[PL] Poldrack and Laumann. https://openfmri.org/dataset/ds000031/, 2015.

[Pou11] Mohsen Pourahmadi. Covariance estimation: The glm and regularization perspectives.

Statistical Science, pages 369 387, 2011.

[Rei91] Hans Reichenbach. The direction of time, volume 65. Univ of California Press, 1991.

[SGS00] Peter Spirtes, Clark N Glymour, and Richard Scheines. Causation, prediction, and

search. MIT press, 2000.

[SHHK06] Shohei Shimizu, Patrik O Hoyer, Aapo Hyvärinen, and Antti Kerminen. A linear non-

gaussian acyclic model for causal discovery. Journal of Machine Learning Research, 7(Oct):2003 2030, 2006.

[SIS+11] Shohei Shimizu, Takanori Inazumi, Yasuhiro Sogawa, Aapo Hyvärinen, Yoshinobu

Kawahara, Takashi Washio, Patrik O Hoyer, and Kenneth Bollen. Directlingam: A direct method for learning a linear non-gaussian structural equation model. Journal of Machine Learning Research, 12(Apr):1225 1248, 2011.

[SJP+12] B. Schölkopf, D. Janzing, J. Peters, E. Sgouritsa, K. Zhang, and J. Mooij. On causal

and anticausal learning. In Proc. 29th International Conference on Machine Learning (ICML 2012), Edinburgh, Scotland, 2012.

[TP01] Jin Tian and Judea Pearl. Causal discovery from changes. In Proceedings of the

Seventeenth conference on Uncertainty in artiﬁcial intelligence, pages 512 521. Morgan Kaufmann Publishers Inc., 2001.

[WSBU18] Yuhao Wang, Chandler Squires, Anastasiya Belyaeva, and Caroline Uhler. Direct

estimation of differences in causal graphs. ar Xiv preprint ar Xiv:1802.05631, 2018.

[XG16] Pan Xu and Quanquan Gu. Semiparametric differential graph models. In Advances in

Neural Information Processing Systems, pages 1064 1072, 2016.

[ZC06] K. Zhang and L. Chan. Extensions of ICA for causality discovery in the Hong Kong

stock market. In Proc. 13th International Conference on Neural Information Processing (ICONIP 2006), 2006.

[ZH09] Kun Zhang and Aapo Hyvärinen. On the identiﬁability of the post-nonlinear causal

model. In Proc. 25th Conference on Uncertainty in Artiﬁcial Intelligence (UAI 2009), Montreal, Canada, 2009.

[ZHZ+17] Kun Zhang, Biwei Huang, Jiji Zhang, Clark Glymour, and Bernhard Schölkopf. Causal

discovery in the presence of distribution shift: Skeleton estimation and orientation determination. In Proc. International Joint Conference on Artiﬁcial Intelligence (IJCAI 2017), 2017.

[ZSMW13] K. Zhang, B. Schölkopf, K. Muandet, and Z. Wang. Domain adaptation under target

and conditional shift. In ICML, 2013.

[ZZS15] K. Zhang, J. Zhang, and B. Schölkopf. Distinguishing cause from effect based on exo-

geneity. In Proc. 15th Conference on Theoretical Aspects of Rationality and Knowledge (TARK 2015), 2015.