# direct_estimation_of_differences_in_causal_graphs__c66d59fa.pdf Direct Estimation of Differences in Causal Graphs Yuhao Wang Lab for Information & Decision Systems and Institute for Data, Systems and Society Massachusetts Institute of Technology Cambridge, MA 02139 yuhaow@mit.edu Chandler Squires Lab for Information & Decision Systems and Institute for Data, Systems and Society Massachusetts Institute of Technology Cambridge, MA 02139 csquires@mit.edu Anastasiya Belyaeva Lab for Information & Decision Systems and Institute for Data, Systems and Society Massachusetts Institute of Technology Cambridge, MA 02139 belyaeva@mit.edu Caroline Uhler Lab for Information & Decision Systems and Institute for Data, Systems and Society Massachusetts Institute of Technology Cambridge, MA 02139 cuhler@mit.edu We consider the problem of estimating the differences between two causal directed acyclic graph (DAG) models with a shared topological order given i.i.d. samples from each model. This is of interest for example in genomics, where changes in the structure or edge weights of the underlying causal graphs reflect alterations in the gene regulatory networks. We here provide the first provably consistent method for directly estimating the differences in a pair of causal DAGs without separately learning two possibly large and dense DAG models and computing their difference. Our two-step algorithm first uses invariance tests between regression coefficients of the two data sets to estimate the skeleton of the difference graph and then orients some of the edges using invariance tests between regression residual variances. We demonstrate the properties of our method through a simulation study and apply it to the analysis of gene expression data from ovarian cancer and during T-cell activation. 1 Introduction Directed acyclic graph (DAG) models, also known as Bayesian networks, are widely used to model causal relationships in complex systems. Learning the causal DAG from observations on the nodes is an important problem across disciplines [8, 25, 30, 36]. A variety of causal inference algorithms based on observational data have been developed, including the prominent PC [36] and GES [20] algorithms, among others [35, 38]. However, these methods require strong assumptions [39]; in particular, theoretical analysis of the PC [14] and GES [23, 40] algorithms have shown that these methods are usually not consistent in the high-dimensional setting, i.e. when the number of nodes is of the same order or exceeds the number of samples, unless highly restrictive assumptions on the sparsity and/or the maximum degree of the underlying DAG are made. The presence of high degree hub nodes is a well-known feature of many networks [2, 3], thereby limiting the direct applicability of causal inference algorithms. However, in many applications, the end goal is not to recover the full causal DAG but to detect changes in the causal relations between two related networks. For example, in the analysis of EEG signals it is of interest to detect neurons or different brain regions that interact differently when the subject is performing different activities [31]; in biological pathways genes may control different sets of target genes under different 32nd Conference on Neural Information Processing Systems (Neur IPS 2018), Montréal, Canada. cellular contexts or disease states [11, 28]. Due to recent technological advances that allow the collection of large-scale EEG or single-cell gene expression data sets in different contexts there is a growing need for methods that can accurately identify differences in the underlying regulatory networks and thereby provide key insights into the underlying system [11, 28]. The limitations of causal inference algorithms for accurately learning large causal networks with hub nodes and the fact that the difference of two related networks is often sparse call for methods that directly learn the difference of two causal networks without having to estimate each network separately. The complimentary problem to learning the difference of two DAG models is the problem of inferring the causal structure that is invariant across different environments. Algorithms for this problem have been developed in recent literature [9, 27, 42]. However, note that the difference DAG can only be inferred from the invariant causal structure if the two DAGs are known. The problem of learning the difference between two networks has been considered previously in the undirected setting [17, 18, 43]. However, the undirected setting is often insufficient: only a causal (i.e., directed) network provides insights into the effects of interventions such as knocking out a particular gene. In this paper we provide to our knowledge the first provably consistent method for directly inferring the differences between pairs of causal DAG models that does not require learning each model separately. The remainder of this paper is structured as follows: In Section 2, we set up the notation and review related work. In Section 3, we present our algorithm for directly estimating the difference of causal relationships and in Section 4, we provide consistency guarantees for our algorithm. In Section 5, we evaluate the performance of our algorithm on both simulated and real datasets including gene expression data from ovarian cancer and T cell activation. 2 Preliminaries and Related Work Let G = ([p], A) be a DAG with node set [p] := {1, , p} and arrow set A. Without loss of generality we label the nodes such that if i j in G then i < j (also known as topological ordering). To each node i we associate a random variable Xi and let P be a joint distribution over the random vector X = (X1, , Xp)T . In this paper, we consider the setting where the causal DAG model is given by a linear structural equation model (SEM) with Gaussian noise, i.e., X = BT X + ϵ where B (the autoregressive matrix) is strictly upper triangular consisting of the edge weights of G, i.e., Bij = 0 if and only if i j in G, and the noise ϵ N(0, Ω) with Ω:= diag(σ2 1, , σ2 p), i.e., there are no latent confounders. Denoting by Σ the covariance matrix of X and by Θ its inverse (i.e., the precision matrix), a short computation yields Θ = (I B)Ω 1(I B)T , and hence Θij = σ 2 j Bij + X k>j σ 2 k Bik Bjk, i = j and Θii = σ 2 i + X j>i σ 2 j B2 ij, i [p]. (1) This shows that the support of Θ is given by the moral graph of G, obtained by adding an edge between pairs of nodes that have a common child and removing all edge orientations. By the causal Markov assumption, which we assume throughout, the missing edges in the moral graph encode a subset of the conditional independence (CI) relations implied by a DAG model on G; the complete set of CI relations is given by the d-separations that hold in G [15][Section 3.2.2]; i.e., Xi Xj | XS in P whenever nodes i and j are d-separated in G given a set S [p] \ {i, j}. The faithfulness assumption is the assertion that all CI relations entailed by P are implied by d-separation in G. A standard approach for causal inference is to first infer CI relations from the observations on the nodes of G and then to use the CI relations to learn the DAG structure. However, several DAGs can encode the same CI relations and therefore, G can only be identified up to an equivalence class of DAGs, known as the Markov equivalence class (MEC) of G, which we denote by [G]. In [41], the author gave a graphical characterization of the members of [G]; namely, two DAGs are in the same MEC if and only if they have the same skeleton (i.e., underlying undirected graph) and the same v-structures (i.e., induced subgraphs of the form i j k). [G] can be represented combinatorially by a partially directed graph with skeleton G and an arrow for those edges in G that have the same orientation in all members of [G]. This is known as the CP-DAG (or essential graph) of G [1]. Various algorithms have been developed for learning the MEC of G given observational data on the nodes, most notably the prominent GES [20] and PC algorithms [36]. While GES is a score-based approach that greedily optimizes a score such as the BIC (Bayesian Information Criterion) over the space of MECs, the PC algorithm views causal inference as a constraint satisfaction problem with the constraints being the CI relations. In a two-stage approach, the PC algorithm first learns the skeleton of the underlying DAG and then determines the v-structures, both from the inferred CI relations. GES and the PC algorithms are provably consistent, meaning they output the correct MEC given an infinite amount of data, under the faithfulness assumption. Unfortunately, this assumption is very sensitive to hypothesis testing errors for inferring CI relations from data and violations are frequent especially in non-sparse graphs [39]. If the noise variables in a linear SEM with additive noise are non-Gaussian, the full causal DAG can be identified (as opposed to just its MEC) [33], for example using the prominent Li NGAM algorithm [33]. Non-Gaussianity and sparsity of the underlying graph in the high-dimensional setting are crucial for consistency of Li NGAM. In this paper, we develop a two-stage approach, similar to the PC algorithm, for directly learning the difference between two linear SEMs with additive Gaussian noise on the DAGs G and H. Note that naive algorithms that separately estimate [G] and [H] and take their differences can only identify edges that appeared/disappeared and cannot identify changes in edge weights (since the DAGs are not identifiable). Our algorithm overcomes this limitation. In addition, we show in Section 4 that instead of requiring the restrictive faithfulness assumption on both DAGs G and H, consistency of our algorithm only requires assumptions on the (usually) smaller and sparser network of differences. Let P(1) and P(2) be a pair of linear SEMs with Gaussian noise defined by (B(1), ϵ(1)) and (B(2), ϵ(2)). Throughout, we make the simplifying assumption that both B(1) and B(2) are strictly upper triangular, i.e., that the underlying DAGs G(1) and G(2) share the same topological order. This assumption is reasonable for example in applications to genomics, since genetic interactions may appear or disappear, change edge weights, but generally do not change directions. For example, in biological pathways an upstream gene does not generally become a downstream gene in different conditions. Hence B(1) B(2) is also strictly upper triangular and we define the difference-DAG (D-DAG) of the two models by := ([p], A ) with i j A if and only if B(1) ij = B(2) ij ; i.e., an edge i j in represents a change in the causal effect of i on j, including changes in the presence/absence of an effect as well as changes in edge weight. Given i.i.d. samples from P(1) and P(2), our goal is to infer . Just like estimating a single causal DAG model, the D-DAG is in general not completely identifiable, in which case we wish to identify the skeleton as well as a subset of arrows A . A simpler task is learning differences of undirected graphical models. Let Θ(1) and Θ(2) denote the precision matrices corresponding to P(1) and P(2). The support of Θ(k) consists of the edges in the undirected graph (UG) models corresponding to P(k). We define the difference-UG (D-UG) by Θ := ([p], E Θ), with i j E Θ if and only if Θ(1) ij = Θ(2) ij for i = j. Two recent methods that directly learn the difference of two UG models are KLIEP [18] and DPM [43]; for a review and comparison of these methods see [17]. These methods can be used as a first step towards estimating the D-DAG : under genericity assumptions, the formulae for Θij in (1) imply that if B(1) ij = B(2) ij then Θ(1) ij = Θ(2) ij . Hence, the skeleton of is a subgraph of Θ, i.e., Θ. In the following section we present our algorithm showing how to obtain and determine some of the edge directions in . We end this section with a piece of notation needed for introducing our algorithm; we define the set of changed nodes to be SΘ := i | j [p] such that Θ(1) i,j = Θ(2) i,j . 3 Difference Causal Inference Algorithm In Algorithm 1 we present our Difference Causal Inference (DCI) algorithm for directly learning the difference between two linear SEMs with additive Gaussian noise given i.i.d. samples from each model. Our algorithm consists of a two-step approach similar to the PC algorithm. The first step, Algorithm 1 Difference Causal Inference (DCI) algorithm Input: Sample data ˆX(1), ˆX(2). Output: Estimated skeleton and arrows AΘ of the D-DAG . Estimate the D-UG Θ and SΘ; use Algorithm 2 to estimate ; use Algorithm 3 to estimate A . Algorithm 2 Estimating skeleton of the D-DAG Input: Sample data ˆX(1), ˆX(2), estimated D-UG Θ, estimated set of changed nodes SΘ. Output: Estimated skeleton . Set := Θ; for each edge i j in do If S SΘ \{i, j} such that β(k) i,j|S is invariant across k = {1, 2}, delete i j in and continue to the next edge. Otherwise, continue. end for Algorithm 3 Directing edges in the D-DAG Input: Sample data ˆX(1), ˆX(2), estimated set of changed nodes SΘ, estimated skeleton . Output: Estimated set of arrows A . Set A := ; for each node j incident to at least one undirected edge in do If S SΘ \ {j} such that σ(k) j|S is invariant across k = {1, 2}, add i j to A for all i S, and add j i to A for all i S and continue to the next node. Otherwise, continue. end for Orient as many undirected edges as possible via graph traversal using the following rule: Orient i j into i j whenever there is a chain i ℓ1 ℓt j. described in Algorithm 2, estimates the skeleton of the D-DAG by removing edges one-by-one. Algorithm 2 takes Θ and SΘ as input. In the high-dimensional setting, KLIEP can be used to estimate Θ and SΘ. For completeness, in the Supplementary Material we also provide a constraintbased method that consistently estimates Θ and SΘ in the low-dimensional setting for general additive noise models. Finally, Θ can also simply be chosen to be the complete graph with SΘ = [p]. These different initiations of Algorithm 2 are compared via simulations in Section 5. The second step of DCI, described in Algorithm 3, infers some of the edge directions in the D-DAG. While the PC algorithm uses CI tests based on the partial correlations for inferring the skeleton and for determining edge orientations, DCI tests the invariance of certain regression coefficients across the two data sets in the first step and the invariance of certain regression residual variances in the second step. These are similar to the regression invariances used in [9] and are introduced in the following definitions. Definition 3.1. Given i, j [p] and S [p] \ {i, j}, let M := {i} S and let β(k) M be the best linear predictor of X(k) j given X(k) M , i.e., the minimizer of E[(X(k) j (β(k) M )T X(k) M )2]. We define the regression coefficient β(k) i,j|S to be the entry in β(k) M corresponding to i. Definition 3.2. For j [p] and S [p] \ {j}, we define (σ(k) j|S)2 to be the variance of the regression residual when regressing X(k) j onto the random vector X(k) S . Note that in general β(k) i,j|S = β(k) j,i|S. Each entry in B(k) can be interpreted as a regression coefficient, namely B(k) ij = β(k) i,j|(Pa(k)(j)\{i}), where Pa(k)(j) denotes the parents of node j in G(k). Thus, when B(1) ij = B(2) ij , then there exists a set S such that β(k) i,j|S stays invariant across k = {1, 2}. This motivates using invariances between regression coefficients to determine the skeleton of the D-DAG. For orienting edges, observe that when σ(k) j stays invariant across two conditions, σ(k) j|S would also stay invariant if S is chosen such that S = Pa(1)(j) Pa(2)(j). This motivates using invariances of residual variances to discover the parents of node j and assign orientations afterwards. Similar to [9] we use hypothesis tests based on the F-test for testing the invariance between regression coefficients and residual variances. See the Supplementary Material for details regarding the construction of these hypothesis tests, the derivation of their asymptotic distribution, and an example outlining the difference of this approach to [9] for invariant structure learning. Example 3.3. We end this section with a 4-node example showing how the DCI algorithm works. Let B(1) and B(2) be the autoregressive matrices defined by the edge weights of G(1) and G(2) and let the noise variances satisfy the following invariances: σ(1) 1 = σ(2) 1 , σ(1) 3 = σ(2) 3 , σ(1) 2 = σ(2) 2 , σ(1) 4 = σ(2) 4 ; DCI output: Initiating Algorithm 2 with Θ being the complete graph and SΘ = [4], the output of the DCI algorithm is shown above. 4 Consistency of DCI The DCI algorithm is consistent if it outputs a partially oriented graph ˆ that has the same skeleton as the true D-DAG and whose oriented edges are all correctly oriented. Just as methods for estimating a single DAG require assumptions on the underlying model (e.g. the faithfulness assumption) to ensure consistency, our method for estimating the D-DAG requires assumptions on relationships between the two underlying models. To define these assumptions it is helpful to view (σ(k) j )j [p] and the non-zero entries (B(k) ij )(i,j) A(k) as variables or indeterminates and each entry of Θ(k) as a rational function, i.e., a fraction of two polynomials in the variables B(k) ij and σ(k) j as defined in (1). Using Schur complements one can then similarly express β(k) v,w|S and (σ(k) w|S)2 as a rational function in the entries of Θ(k) and hence as a rational function in the variables (B(k) ij )(i,j) A(k) and (σ(k) j )j [p]. The exact formulae are given in the Supplementary Material. Clearly, if B(1) ij = B(2) ij (i, j) and σ(1) j = σ(2) j j [p], then β(1) v,w|S = β(2) v,w|S and σ(1) w|S = σ(2) w|S for all v, w, S. For consistency of the DCI algorithm we assume that the converse is true as well, namely that differences in Bij and σj in the two distributions are not cancelled out by changes in other variables and result in differences in the regression coefficients and regression residual variances. This allows us to deduce invariance patterns of the autoregressive matrix B(k) from invariance patterns of the regression coefficients and residual variances, and hence differences of the two causal DAGs.1 Assumption 4.1. For any choice of i, j SΘ, if B(1) ij = B(2) ij then for all S SΘ\{i, j} it holds that β(1) i,j|S = β(2) i,j|S and β(1) j,i|S = β(2) j,i|S. Assumption 4.2. For any choice of i, j SΘ it holds that 1. if B(1) ij = B(2) ij , then S SΘ \ {i, j}, σ(1) j|S = σ(2) j|S and σ(1) i|S {j} = σ(2) i|S {j}. 2. if σ(1) j = σ(2) j , then σ(1) j|S = σ(2) j|S for all S SΘ \ {j}. Assumption 4.1 ensures that the skeleton of the D-DAG is inferred correctly, whereas Assumption 4.2 ensures that the arrows returned by the DCI algorithm are oriented correctly. These assumptions are the equivalent of the adjacency-faithfulness and orientation-faithfulness assumptions that ensure consistency of the PC algorithm for estimating the MEC of a causal DAG [29]. We now provide our main results, namely consistency of the DCI algorithm. For simplicity we here discuss the consistency guarantees when Algorithm 2 is initialized with Θ being the complete graph and SΘ = [p]. However, in practice we recommend initialization using KLIEP (see also Section 5) to avoid performing an unnecessarily large number of conditional independence tests. The consistency guarantees for such an initialization including a method for learning the D-DAG in general additive noise models (that are not necessarily Gaussian) is provided in the Supplementary Material. 1This is similar to the faithfulness assumption in the Gaussian setting, where partial correlations are used for CI testing; the partial correlations are rational functions in the variables B(k) ij and σ(k) j and the faithfulness assumption asserts that if a partial correlation ρij|S is zero then the corresponding rational function is identically equal to zero and hence Bij = 0 [16]. Theorem 4.3. Given Assumption 4.1, Algorithm 2 is consistent in estimating the skeleton of the D-DAG . The proof is given in the Supplementary Material. The main ingredient is showing that if B(1) ij = B(2) ij , then there exists a conditioning set S SΘ \ {i, j} such that β(1) i,j|S = β(2) i,j|S, namely the parents of node j in both DAGs excluding node i. Next, we provide consistency guarantees for Algorithm 3. Theorem 4.4. Given Assumption 4.2, all arrows A output by Algorithm 3 are correctly oriented. In particular, if σ(k) i is invariant across k = {1, 2}, then all edges adjacent to i are oriented. Similar to the proof of Theorem 4.3, the proof follows by interpreting the rational functions corresponding to regression residual variances in terms of d-connecting paths in G(k) and is given in the Supplementary Material. It is important to note that as a direct corollary to Theorem 4.4 we obtain sufficient conditions for full identifiability of the D-DAG (i.e., all arrows) using the DCI algorithm. Corollary 4.5. Given Assumptions 4.1 and 4.2, and assuming that the error variances are the same across the two distributions, i.e. Ω(1) = Ω(2), the DCI algorithm outputs the D-DAG . In addition, we conjecture that Algorithm 3 is complete, i.e., that it directs all edges that are identifiable in the D-DAG. We end this section with two remarks, namely regarding the sample complexity of the DCI algorithm and an evaluation of how restrictive Assumptions 4.1 and 4.2 are. Remark 4.6 (Sample complexity of DCI). For constraint-based methods such as the PC or DCI algorithms, the sample complexity is determined by the number of hypothesis tests performed by the algorithm [14]. In the high-dimensional setting, the number of hypothesis tests performed by the PC algorithm scales as O(ps), where p is the number of nodes and s is the maximum degree of the DAG, thereby implying severe restrictions on the sparsity of the DAG given a reasonable sample size. Meanwhile, the number of hypothesis tests performed by the DCI algorithm scales as O(| Θ|2|SΘ| 1) and hence does not depend on the degree of the two DAGs. Therefore, even if the two DAGs G(1) and G(2) are high-dimensional and highly connected, the DCI algorithm is consistent and has a better sample complexity (as compared to estimating two DAGs separately) as long as the differences between G(1) and G(2) are sparse, i.e., |SΘ| is small compared to p and s. Remark 4.7 (Strength of Assumptions 4.1 and 4.2). Since faithfulness, a standard assumption for consistency of causal inference algorithms to estimate an MEC, is known to be restrictive [39], it is of interest to compare Assumptions 4.1 and 4.2 to the faithfulness assumption of P(k) with respect to G(k) for k {1, 2}. In the Supplementary Material we provide examples showing that Assumptions 4.1 and 4.2 do not imply the faithfulness assumption on the two distributions and vice-versa. However, in the finite sample regime we conjecture Assumptions 4.1 and 4.2 to be weaker than the faithfulness assumption: violations of faithfulness as well as of Assumptions 4.1 and 4.2 correspond to points that are close to conditional independence hypersurfaces [39]. The number of these hypersurfaces (and hence the number of violations) increases in s for the faithfulness assumption and in SΘ for Assumptions 4.1 and 4.2. Hence if the two DAGs G(1) and G(2) are large and complex while having a sparse difference, then SΘ << s. See the Supplementary Material for more details. 5 Evaluation In this section, we compare our DCI algorithm with PC and GES on both synthetic and real data. The code utilized for the following experiments can be found at https://github.com/csquires/dci. 5.1 Synthetic data We analyze the performance of our algorithm in both, the lowand high-dimensional setting. For both settings we generated 100 realizations of pairs of upper-triangular SEMs (B(1), ϵ(1)) and (B(2), ϵ(2)). For B(1), the graphical structure was generated using an Erdös-Renyi model with expected neighbourhood size s, on p nodes and n samples. The edge weights were uniformly drawn from [ 1, 0.25] [0.25, 1] to ensure that they were bounded away from zero. B(2) was then generated from B(1) by adding and removing edges with probability 0.1, i.e., B(2) ij i.i.d. Ber(0.9) B(1) ij if B(1) ij =0, B(2) ij i.i.d. Ber(0.1) Unif([ 1, .25] [.25, 1]) if B(1) ij =0 (a) skeleton (b) skeleton & orientation (c) changed variances Figure 1: Proportion of consistently estimated D-DAGs for 100 realizations per setting with p = 10 nodes and sample size n. Figures (a) and (b) show the proportion of consistently estimated D-DAGs when considering just the skeleton ( ) and both skeleton and edge orientations ( ), respectively; α is the significance level used for the hypothesis tests in the algorithms. Figure (c) shows the proportion of consistent estimates with respect to the number of changes in internal node variances v. Note that while the DCI algorithm is able to identify changes in edge weights, we only generated DAG models that differ by edge insertions and deletions. This is to provide a fair comparison to the naive approach, where we separately estimate the two DAGs G(1) and G(2) and then take their difference, since this approach can only identify insertions and deletions of edges. In Figure 1 we analyzed how the performance of the DCI algorithm changes over different choices of significance levels α. The simulations were performed on graphs with p = 10 nodes, neighborhood size of s = 3 and sample size n {103, 104}. For Figure 1 (a) and (b) we set ϵ(1), ϵ(2) N(0, 1p), which by Corollary 4.5 ensures that the D-DAG is fully identifiable. We compared the performance of DCI to the naive approach, where we separately estimated the two DAGs G(1) and G(2) and then took their difference. For separate estimation we used the prominent PC and GES algorithms tailored to the Gaussian setting. Since KLIEP requires an additional tuning parameter, to understand how α influences the performance of the DCI algorithm, we here only analyzed initializations in the fully connected graph (DCI-FC) and using the constraint-based method described in the Supplementary Material (DCI-C). Both initializations provide a provably consistent algorithm. Figure 1 (a) and (b) show the proportion of consistently estimated D-DAGs by just considering the skeleton ( ) and both skeleton and orientations ( ), respectively. For PC and GES, we considered the set of edges that appeared in one estimated skeleton but disappeared in the other as the estimated skeleton of the D-DAG . In determining orientations, we considered the arrows that were directed in one estimated CP-DAG but disappeared in the other as the estimated set of directed arrows. Since the main purpose of this low-dimensional simulation study is to validate our theoretical findings, we used the exact recovery rate as evaluation criterion. In line with our theoretical findings, both variants of the DCI algorithm outperformed taking differences after separate estimation. Figure 1 (a) and (b) also show that the PC algorithm outperformed GES, which is unexpected given previous results showing that GES usually has a higher exact recovery rate than the PC algorithm for estimating a single DAG. This is due to the fact that while the PC algorithm usually estimates less DAGs correctly, the incorrectly (a) D-DAG skeleton (c) T-cell activation Figure 2: High-dimensional evaluation of the DCI algorithm in both simulation and real data; (a) (b) are the ROC curves for estimating the D-DAG and its skeleton with p = 100 nodes, expected neighbourhood size s = 10, n = 300 samples, and 5% change between DAGs; (c) shows the estimated D-DAG between gene expression data from naive and activated T cells. estimated DAGs tend to look more similar to the true model than the incorrect estimates of GES (as also reported in [35]) and can still lead to a correct estimate of the D-DAG. In Figure 1 (c) we analyzed the effect of changes in the noise variances on estimation performance. We set ϵ(1) N(0, 1p), while for ϵ(2) we randomly picked v nodes and uniformly sampled their variances from [1.25, 2]. We used α = .05 as significance level based on the evaluation from Figure 1. In line with Theorem 4.4, as we increase the number of nodes i such that ϵ(1) i = ϵ(2) i , the number of edges whose orientations can be determined decreases. This is because Algorithm 3 can only determine an edge s orientation when the variance of at least one of its nodes is invariant. Moreover, Figure 1 (c) shows that the accuracy of Algorithm 2 is not impacted by changes in the noise variances. Finally, Figure 2 (a) - (b) show the performance (using ROC curves) of the DCI algorithm in the high-dimensional setting when initiated using KLIEP (DCI-K) and DCI-C. The simulations were performed on graphs with p = 100 nodes, expected neighborhood size of s = 10, sample size n = 300, and ϵ(1), ϵ(2) N(0, 1p). B(2) was derived from B(1) so that the total number of changes was 5% of the total number of edges in B(1), with an equal amount of insertions and deletions. Figure 2 (a) - (b) show that both DCI-C and DCI-K perform similarly well and outperform separate estimation using GES and the PC algorithm. The respective plots for 10% change between B(1) and B(2) are given in the Supplementary Material. 5.2 Real data analysis Ovarian cancer. We tested our method on an ovarian cancer data set [37] that contains two groups of patients with different survival rates and was previously analyzed using the DPM algorithm in the undirected setting [43]. We followed the analysis of [43] and applied the DCI algorithm to gene expression data from the apoptosis and TGF-β pathways. In the apoptosis pathway we identified two hub nodes: BIRC3, also discovered by DPM, is an inhibitor of apoptosis [12] and one of the main disregulated genes in ovarian cancer [13]; PRKAR2B, not identified by DPM, has been shown to be important in disease progression in ovarian cancer cells [4] and an important regulatory unit for cancer cell growth [5]. In addition, the RII-β protein encoded by PRKAR2B has been considered as a therapeutic target for cancer therapy [6, 22], thereby confirming the relevance of our findings. With respect to the TGF-β pathway, the DCI method identified THBS2 and COMP as hub nodes. Both of these genes have been implicated in resistance to chemotherapy in epithelial ovarian cancer [19] and were also recovered by DPM. Overall, the D-UG discovered by DPM is comparable to the D-DAG found by our method. More details on this analysis are given in the Supplementary Material. T cell activation. To demonstrate the relevance of our method for current genomics applications, we applied DCI to single-cell gene expression data of naive and activated T cells in order to study the pathways involved during the immune response to a pathogen. We analyzed data from 377 activated and 298 naive T cells obtained by [34] using the recent drop-seq technology. From the previously identified differentially expressed genes between naive and activated T cells [32], we selected all genes that had a fold expression change above 10, resulting in 60 genes for further analysis. We initiated DCI using KLIEP, thresholding the edge weights at 0.005, and ran DCI for different tuning parameters and with cross-validation to obtain the final DCI output shown in Figure 2 (c) using stability selection as described in [21]. The genes with highest out-degree, and hence of interest for future interventional experiments, are GZMB and UHRF1. Interestingly, GZMB is known to induce cytotoxicity, important for attacking and killing the invading pathogens. Furthermore, this gene has been reported as the most differentially expressed gene during T cell activation [10, 26]. UHRF1 has been shown to be critical for T cell maturation and proliferation through knockout experiments [7, 24]. Interestingly, the UHRF1 protein is a transcription factor, i.e. it binds to DNA sequences and regulates the expression of other genes, thereby confirming its role as an important causal regulator. Learning a D-DAG as opposed to a D-UG is crucial for prioritizing interventional experiments. In addition, the difference UG for this application would not only have been more dense, but it would also have resulted in additional hub nodes such as FABP5, KLRC1, and ASNS, which based on the current biological literature seem secondary to T cell activation (FABP5 is involved in lipid binding, KLRC1 has a role in natural killer cells but not in T cells, and ASNS is an asparagine synthetase gene). The difference DAGs learned by separately applying the GES and PC algorithms on naive and activated T cell data sets as well as on the ovarian cancer data sets are included in the Supplementary Material for comparison. 6 Discussion We presented an algorithm for directly estimating the difference between two causal DAG models given i.i.d. samples from each model. To our knowledge this is the first such algorithm and is of particular interest for learning differences between related networks, where each network might be large and complex, while the difference is sparse. We provided consistency guarantees for our algorithm and showed on synthetic and real data that it outperforms the naive approach of separately estimating two DAG models and taking their difference. While our proofs were for the setting with no latent variables, they extend to the setting where the edge weights and noise terms of all latent variables remain invariant across the two DAGs. We applied our algorithm to gene expression data in bulk and from single cells, showing that DCI is able to identify biologically relevant genes for ovarian cancer and T-cell activation. This purports DCI as a promising method for identifying intervention targets that are causal for a particular phenotype for subsequent experimental validation. A more careful analysis with respect to the D-DAGs discovered by our DCI algorithm is needed to reveal its impact for scientific discovery. In order to make DCI scale to networks with thousands of nodes, an important challenge is to reduce the number of hypothesis tests. As mentioned in Remark 4.6, currently the time complexity (given by the number of hypothesis tests) of DCI scales exponentially with respect to the size of SΘ. The PC algorithm overcomes this problem by dynamically updating the list of CI tests given the current estimate of the graph. It is an open problem whether one can similarly reduce the number of hypothesis tests for DCI. Another challenge is to relax Assumptions 4.1 and 4.2. Furthermore, in many applications (e.g., when comparing normal to disease states), there is an imbalance of data/prior knowledge for the two models and it is of interest to develop methods that can make use of this for learning the differences between the two models. Finally, as described in Section 2, DCI is preferable to separate estimation methods like PC and GES since it can infer not only edges that appear or disappear, but also edges with changed edge weights. However, unlike separate estimation methods, DCI relies on the assumption that the two DAGs share a topological order. Developing methods to directly estimate the difference of two DAGs that do not share a topological order is of great interest for future work. Acknowledgements Yuhao Wang was supported by ONR (N00014-17-1-2147), NSF (DMS-1651995) and the MIT-IBM Watson AI Lab. Anastasiya Belyaeva was supported by an NSF Graduate Research Fellowship (1122374) and the Abdul Latif Jameel World Water and Food Security Lab (J-WAFS) at MIT. Caroline Uhler was partially supported by ONR (N00014-17-1-2147), NSF (DMS-1651995), and a Sloan Fellowship. [1] S. A. Andersson, D. Madigan, and M. D. Perlman. A characterization of markov equivalence classes for acyclic digraphs. The Annals of Statistics, 25(2):505 541, 1997. [2] A.-L. Barabási, N. Gulbahce, and J. Loscalzo. Network medicine: a network-based approach to human disease. Nature Reviews Genetics, 12(1):56 68, 2011. [3] A.-L. Barabási and Z. N. Oltvai. Network biology: understanding the cell s functional organization. Nature Reviews Genetics, 5(2):101 113, 2004. [4] C. Cheadle, M. Nesterova, T. Watkins, K. C. Barnes, J. C. Hall, A. Rosen, K. G. Becker, and Y. S. Cho-Chung. Regulatory subunits of PKA define an axis of cellular proliferation/differentiation in ovarian cancer cells. BMC Medical Genomics, 1(1):43, 2008. [5] F. Chiaradonna, C. Balestrieri, D. Gaglio, and M. Vanoni. RAS and PKA pathways in cancer: new insight from transcriptional analysis. Frontiers in Bioscience, 13:5257 5278, 2008. [6] Y. S. Cho-Chung. Antisense oligonucleotide inhibition of serine/threonine kinases: an innovative approach to cancer treatment. Pharmacology & Therapeutics, 82(2):437 449, 1999. [7] Y. Cui, X. Chen, J. Zhang, X. Sun, H. Liu, L. Bai, C. Xu, and X. Liu. Uhrf1 Controls i NKT Cell Survival and Differentiation through the Akt-m TOR Axis. Cell Reports, 15(2):256 263, 2016. [8] N. Friedman, M. Linial, I. Nachman, and D. Pe er. Using bayesian networks to analyze expression data. Journal of Computational Biology, 7(3-4):601 620, 2000. [9] A. Ghassami, S. Salehkaleybar, N. Kiyavash, and K. Zhang. Learning causal structures using regression invariance. In Advances in Neural Information Processing Systems, pages 3015 3025, 2017. [10] L. A. Hatton. Molecular Mechanisms Regulating CD8+ T Cell Granzyme and Perforin Gene Expression. Ph D thesis, University of Melbourne, 2013. [11] N. J. Hudson, A. Reverter, and B. P. Dalrymple. A differential wiring analysis of expression data correctly identifies the gene containing the causal mutation. PLo S Computational Biology, 5(5):e1000382, 2009. [12] R.W. Johnstone, A.J. Frew, and M.J. Smyth. The TRAIL apoptotic pathway in cancer onset, progression and therapy. Nature Reviews Cancer, 8(10):782 798, 2008. [13] J. Jönsson, K. Bartuma, M. Dominguez-Valentin, K. Harbst, Z. Ketabi, S. Malander, M. Jönsson, A. Carneiro, A. Måsbäck, G. Jönsson, and M. Nilbert. Distinct gene expression profiles in ovarian cancer linked to Lynch syndrome. Familial Cancer, 13:537 545, 2014. [14] M. Kalisch and P. Bühlmann. Estimating high-dimensional directed acyclic graphs with the pc-algorithm. Journal of Machine Learning Research, 8(Mar):613 636, 2007. [15] S. L Lauritzen. Graphical Models, volume 17. Clarendon Press, 1996. [16] S. Lin, C. Uhler, B. Sturmfels, and P. Bühlmann. Hypersurfaces and their singularities in partial correlation testing. Foundations of Computational Mathematics, 14(5):1079 1116, 2014. [17] S. Liu, K. Fukumizu, and T. Suzuki. Learning sparse structural changes in high-dimensional markov networks. Behaviormetrika, 44(1):265 286, 2017. [18] S. Liu, J. A. Quinn, M. U. Gutmann, T. Suzuki, and M. Sugiyama. Direct learning of sparse changes in markov networks by density ratio estimation. Neural Computation, 26(6):1169 1197, 2014. [19] S. Marchini, R. Fruscio, L. Clivio, L. Beltrame, L. Porcu, I. F. Nerini, D. Cavalieri, G. Chiorino, G. Cattoretti, C. Mangioni, R. Milani, V. Torri, C. Romualdi, A. Zambelli, M. Romano, M. Signorelli, S. di Giandomenico, and M D Incalci. Resistance to platinum-based chemotherapy is associated with epithelial to mesenchymal transition in epithelial ovarian cancer. European Journal of Cancer, 49(2):520 530, 2013. [20] C. Meek. Graphical Models: Selecting Causal and Statistical Models. Ph D thesis, Carnegie Mellon University, 1997. [21] N. Meinshausen and P. Bühlmann. Stability selection. Journal of the Royal Statistical Society. Series B: Statistical Methodology, 72(4):417 473, 2010. [22] T. Mikalsen, N. Gerits, and U. Moens. Inhibitors of signal transduction protein kinases as targets for cancer therapy. Biotechnology Annual Review, 12:153 223, 2006. [23] P. Nandy, A. Hauser, and M. H. Maathuis. High-dimensional consistency in score-based and hybrid structure learning, 2015. To appear in Annals of Statistics. [24] Y. Obata, Y. Furusawa, T.A. Endo, J. Sharif, D. Takahashi, K. Atarashi, M. Nakayama, S. Onawa, Y. Fujimura, M. Takahashi, T. Ikawa, T. Otsubo, Y.I. Kawamura, T. Dohi, S. Tajima, H. Masumoto, O. Ohara, K. Honda, S. Hori, H. Ohno, H. Koseki, and K. Hase. The epigenetic regulator Uhrf1 facilitates the proliferation and maturation of colonic regulatory T cells. Nature Immunology, 15(6):571 579, 2014. [25] J. Pearl. Causality: Models, Reasoning, and Inference. Cambridge University Press, 2000. [26] A. Peixoto, C. Evaristo, I. Munitic, M. Monteiro, A. Charbit, B. Rocha, and H. Veiga-Fernandes. CD8 single-cell gene coexpression reveals three different effector types present at distinct phases of the immune response. The Journal of Experimental Medicine, 204(5):1193 1205, 2007. [27] J. Peters, P. Bühlmann, and N. Meinshausen. Causal inference by using invariant prediction: identification and confidence intervals. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 78(5):947 1012, 2016. [28] J. E. Pimanda, K. Ottersbach, K. Knezevic, S. Kinston, W. Y. Chan, N. K. Wilson, J. Landry, A. D Wood, A. Kolb-Kokocinski, A. R. Green, D. Tannahill, G. Lacaud, V. Kouskoff, and B. Göttgens. Gata2, Fli1, and Scl form a recursively wired gene-regulatory circuit during early hematopoietic development. Proceedings of the National Academy of Sciences, 104(45):17692 17697, 2007. [29] J. Ramsey, P. Spirtes, and J. Zhang. Adjacency-faithfulness and conservative causal inference. In Proceedings of the Twenty-Second Conference on Uncertainty in Artificial Intelligence, pages 401 408. AUAI Press, 2006. [30] J. M. Robins and B. Hernan, Miguel. A.and Brumback. Marginal structural models and causal inference in epidemiology, 2000. [31] S. Sanei and J. A. Chambers. EEG Signal Processing. John Wiley & Sons, 2013. [32] S. Sarkar, V. Kalia, W.N. Haining, B.T. Konieczny, S. Subramaniam, and R. Ahmed. Functional and genomic profiling of effector CD8 T cell subsets with distinct memory fates. The Journal of Experimental Medicine, 205(3):625 640, 2008. [33] S. Shimizu, P. O. Hoyer, A. Hyvärinen, and A. Kerminen. A linear non-gaussian acyclic model for causal discovery. Journal of Machine Learning Research, 7(Oct):2003 2030, 2006. [34] M. Singer, C. Wang, L. Cong, N.D. Marjanovic, M.S. Kowalczyk, H. Zhang, J. Nyman, K. Sakuishi, S. Kurtulus, D. Gennert, J. Xia, J.Y.H. Kwon, J. Nevin, R.H. Herbst, I. Yanai, O. Rozenblatt-Rosen, V.K. Kuchroo, A. Regev, and A.C. Anderson. A Distinct Gene Module for Dysfunction Uncoupled from Activation in Tumor-Infiltrating T Cells. Cell, 166(6):1500 1511, 2016. [35] L. Solus, Y. Wang, L. Matejovicova, and C. Uhler. Consistency guarantees for permutation-based causal inference algorithms, 2017. [36] P. Spirtes, C. N. Glymour, and R. Scheines. Causation, Prediction, and Search. MIT press, 2000. [37] R. W. Tothill, A. V. Tinker, J. George, R. Brown, S. B. Fox, S. Lade, D. S. Johnson, M. K. Trivett, D. Etemadmoghadam, B. Locandro, N. Traficante, S. Fereday, J. A. Hung, Y. Chiew, I. Haviv, Australian Ovarian Cancer Study Group, D. Gertig, A. de Fazio, and D. D.L. Bowtell. Novel molecular subtypes of serous and endometrioid ovarian cancer linked to clinical outcome. Clinical Cancer Research, 14(16):5198 5208, 2008. [38] I. Tsamardinos, L. E. Brown, and C. F. Aliferis. The max-min hill-climbing bayesian network structure learning algorithm. Machine Learning, 65(1):31 78, 2006. [39] C. Uhler, G. Raskutti, P. Bühlmann, and B. Yu. Geometry of the faithfulness assumption in causal inference. The Annals of Statistics, pages 436 463, 2013. [40] S. Van de Geer and P. Bühlmann. ℓ0-penalized maximum likelihood for sparse directed acyclic graphs. The Annals of Statistics, 41(2):536 567, 2013. [41] T. Verma and J. Pearl. Equivalence and synthesis of causal models. In Proceedings of the Sixth Annual Conference on Uncertainty in Artificial Intelligence, pages 255 270. Elsevier Science Inc., 1990. [42] K. Zhang, B. Huang, J. Zhang, C. Glymour, and B. Schölkopf. Causal discovery from nonstationary/heterogeneous data: Skeleton estimation and orientation determination. In IJCAI: Proceedings of the Conference, volume 2017, page 1347. NIH Public Access, 2017. [43] S. D. Zhao, T. T. Cai, and H. Li. Direct estimation of differential networks. Biometrika, 101(2):253 268, 2014.