# discovering_fully_oriented_causal_networks__e96fbe4a.pdf Discovering Fully Oriented Causal Networks Osman Mian, Alexander Marx and Jilles Vreeken CISPA Helmholtz Center for Information Security {osman.mian, alexander.marx, jv}@cispa.de We study the problem of inferring causal graphs from observational data. We are particularly interested in discovering graphs where all edges are oriented, as opposed to the partially directed graph that the state-of-the-art discover. To this end we base our approach on the algorithmic Markov condition. Unlike the statistical Markov condition, it uniquely identifies the true causal network as the one that provides the simplest as measured in Kolmogorov complexity factorization of the joint distribution. Although Kolmogorov complexity is not computable, we can approximate it from above via the Minimum Description Length principle, which allows us to define a consistent and computable score based on non-parametric multivariate regression. To efficiently discover causal networks in practice, we introduce the GLOBE algorithm, which greedily adds, removes, and orients edges such that it minimizes the overall cost. Through an extensive set of experiments we show GLOBE performs very well in practice, beating the state-ofthe-art by a margin. Introduction Discovering causal dependencies from observational data is one of the most fundamental problems in science (Pearl 2009). We consider the problem of recovering the causal network over a set of continuous-valued random variables X based on an iid sample from their joint distribution. The stateof-the-art does so by first recovering an undirected causal skeleton which identifies the variables that have a direct causal relation and then uses conditional independence tests to orient as many edges as possible. By the nature of these tests this can only be done up to Markov equivalence classes, which means that these methods in practice return networks where only few edges are oriented. In contrast, we develop an approach that discovers fully directed causal graphs. We base our approach on the algorithmic Markov condition (AMC), a recent postulate that states that the factorization of the joint distribution according to true causal network coincides with the one that achieves the lowest Kolmogorov complexity (Janzing and Sch olkopf 2010). As an example, consider the case where X causes Y . Whereas the traditional statistical Markov condition cannot differentiate between P(X)P(Y |X) and P(Y )P(X|Y ) as both are valid Copyright 2021, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. factorizations of joint distribution P(X, Y ), the algorithmic Markov condition takes the complexities of these distributions into account: in this case, the simplest factorization of P(X, Y ) is K(P(X)) + K(P(Y |X)) as only this factorization upholds the true independence between the marginal and conditional distribution any competing factorization will be more complex because of inherent redundancy between the terms. As Kolmogorov complexity can capture any physical process (Li and Vit anyi 2009) the AMC is a very general model for causality. However, Kolmogorov complexity is not computable, and hence we need a practical score to instantiate it. Here we do so through the Minimum Description Length principle (Gr unwald 2007), which provides a statistically well-founded approach to approximate Kolmogorov complexity from above. We develop an MDL-based score for directed acyclic graphs (DAGs), where we model the dependencies between variables through non-parametric multivariate regression. Simply put, the lower the regression error of the discovered model, the lower its cost, while more parameters mean higher complexity. We show this score is consistent: given sufficiently many samples from the joint distribution, we can uniquely identify the true causal graph if the causal relations are nearly deterministic. To efficiently discover causal networks directly from data we introduce the GLOBE algorithm, which much like the well-known GES (Chickering 2002) algorithm greedily adds and removes edges to optimize the score. Unlike GES, however, GLOBE traverses the space of DAGs rather than Markov equivalence classes orienting edges during its search based on the AMC and hence is guaranteed to result in a fully directed network. Through extensive empirical evaluation we show that GLOBE performs well in practice and outperforms the stateof-the-art conditional independence and score based causal discovery algorithms. On synthetic data we confirm GLOBE does not discover spurious edges between independent variables, and overall achieves the best scores on both the structural hamming distance and the structural intervention distance. Last, but not least, on real-world data we show that GLOBE even works well when it is unlikely that our modelling assumptions are met. For reproducibility we provide detailed pseudo-code in technical appendix, and make all code and data available. The Thirty-Fifth AAAI Conference on Artificial Intelligence (AAAI-21) Preliminaries First, we introduce the notation for causal graphs and the main information theoretic concepts that we need later on. Causal Graph We consider data over the joint distribution of m continuous valued random variables X = {X1, . . . , Xm}. As is common, we assume causal sufficiency. That is, we assume that X contains all random variables that are relevant to the system, or in other words, that there exist no latent confounders. Under the assumptions of causal sufficiency and acyclicity, we can model causal relationships over X using a directed acyclic graph (DAG). A causal DAG G over X is a graph in which the random variables are the nodes and edges identify the causal relationship between a pair of nodes. In particular, a directed edge between two nodes Xi Xj indicates that Xi is a direct cause or parent of Xj. We denote the set of all parents of Xi with Pa(Xi). When working on causal DAGs, we assume the common assumptions, the causal Markov condition and the faithfulness condition, to hold. Simply put, the combination of both assumptions implies that each separation present in the true graph G implies an independence in the joint distribution P over the random variables X and vice versa (Pearl 2009). Identifiability of Causality A causal relationship is said to be identifiable if it is possible to unambiguously recover it from observational data alone. In general, causal dependencies are not identifiable without assumptions on the causal model. The common assumptions for discovering causal DAGs allow identification up to the Markov equivalence class (Pearl 2009). Given additional assumptions, such as that the relation between cause and effect is a non-linear function with additive Gaussian noise (Hoyer et al. 2009), it is possible to identify causal directions within a Markov equivalence class (Glymour, Zhang, and Spirtes 2019). This is the causal model we investigate. Kolmogorov Complexity The Kolmogorov complexity of a finite binary string x is the length of the shortest binary program p for a universal Turing machine U that outputs x and then halts (Kolmogorov 1965; Li and Vit anyi 2009). Simply put, p is the most succinct algorithmic description of x, and therewith Kolmogorov complexity of x is the length of its ultimate lossless compression. Conditional Kolmogorov complexity, K(x | y) K(x), is then the length of the shortest binary program p that generates x, and halts, given y as input. The Kolmogorov complexity of a probability distribution P, K(P), is the length of the shortest program that outputs P(x) to precision q on input x, q (Li and Vit anyi 2009). More formally, we have K(P) = min |p| : p {0, 1} , |U(p, x, q) P(x)| 1 The conditional, K(P | Q), is defined similarly except that the universal Turing machine U now gets the additional information Q. For more details on Kolmogorov complexity see Li and Vit anyi (2009). Minimum Description Length Principle Although Kolmogorov complexity is not computable, we can approximate it from above through lossless compression (Li and Vit anyi 2009). The Minimum Description Length (MDL) principle (Rissanen 1978; Gr unwald 2007) provides a statistically well-founded and computable framework to do so. Conceptually, instead of all programs, ideal MDL considers only those programs for which we know that they output x and halt, i.e., lossless compressors. Formally, given a model class M, MDL identifies the best model M M for data D as the one minimizing L(D, M) = L(M) + L(D | M), where L(M) is the length in bits of the description of M, and L(D | M) is the length in bits of the description of data D given M. This is known as two-part, or crude MDL. There also exists one-part, or refined MDL. Although refined MDL has theoretically appealing properties, it is efficiently computable for a small number of model classes. Asymptotically, there is no difference between the two (Gr unwald 2007). To use MDL in practice we need to define a model class, and how to encode a model, resp. the data given a model, into bits. Note that we are only concerned with optimal code lengths, not actual codes our goal is to measure the complexity of a dataset under a model class, after all (Gr unwald 2007). Hence, all logarithms are to base 2, and we use the common convention that 0 log 0 = 0. Theory In this section, we will first introduce the algorithmic model of causality which is based on Kolmogorov complexity. To put it into practice, we need to introduce a set of modelling assumptions that allow us to approximate it using MDL. We conclude this section by providing consistency guarantees. Algorithmic Model of Causality Here we introduce the main concepts of algorithmic causal inference as introduced by Janzing and Sch olkopf (Janzing and Sch olkopf 2010), starting with the causal model. Postulate 1 (Algorithmic Model of Causality). Let G be a DAG formalizing the causal structure among the strings x1, . . . , xm. Then, every xj is computed by a program qj with constant length from its parents Pa(xj) and an additional input nj. That is xj = qj(Pa(xj), nj) , where the inputs nj are jointly independent. As any mathematical object x can be described as a binary string, and a program qj can model any physical process (Deutsch 1985) or possible function hj (Li and Vit anyi 2009), this is a particularly general model of causality. Equivalent to the statistical model, we can derive that the algorithmic model of causality fulfils the algorithmic Markov property (Janzing and Sch olkopf 2010), that is K(x1, . . . , xm) += j=1 K(xj | Pa (xj)) , where += denotes equality up to an additive constant. Meaning, to most succinctly describe all strings, it suffices to know what are the parents and additional inputs nj for each string xj. Unlike its statistical counterpart which can only identify the causal network up to Markov equivalence, the algorithmic Markov property can identify a single DAG as the most succinct description of all strings. As any mathematical object, including distributions, can be described by a binary string, Janzing and Sch olkopf (2010) define the following postulate. Postulate 2 (Algorithmic Markov Condition). A causal DAG G over random variables X with joint density P is only acceptable if the shortest description of P factorizes as K(P(X1, . . . , Xm)) += j=1 K(P(Xj | Pa(Xj))) . (1) Hence, under the assumption that the true causal graph can be modelled by a DAG, it has to be the one minimizing Eq. (1). As K is not computable we cannot directly compute this score. What we can however, restrict our model class from allowing all possible functions to a subset of these and then approximate K using MDL. Causal Model As causal model we consider a rich class of structural equation models (Pearl 2009) (SEMs) where the value of each node is determined by a linear combination of functions over all possible subsets of parents and additional independent noise. Formally, for all Xi X we have Sj P(Pa(Xi)) hj(Sj) + Ni , (2) where hj is a non-linear function of the j-th subset over the power set, P(Pa(Xi)), of parents of Xi, and Ni is an independent noise term. We assume that all noise variables are jointly independent, Gaussian distributed and that Ni Pa(Xi). MDL Encoding of the Causal Model Next, we specify our MDL score for DAGs. Given an iid sample Xn drawn from the joint distribution P over X, our goal is to approximate Eq. (1) using two-part MDL, which means we need to define a model class M for which we can compute the optimal code length. Here, we define M to include all possible DAGs over X and their corresponding parametrization according to our causal model. That is, for each node Xi a model M M contains an index indicating the parents of Xi, which is equivalent to storing the DAG structure, and the corresponding functional dependencies. Building upon Eq. (1), we want to find that model M M such that M = argmin M M L(Xn, M) = argmin M M i=1 L(Xn i | Pa(Xi), M) = argmin M M where Pa(Xi) are the parents of Xi according to the model M. In the last line, we replace L(Xn i | Pa(Xi), M) with L(ϵi) to clarify that encoding a node given M and its parents comes down to encoding the residuals ϵi. Encoding the Model The model complexity L(M) for a model M M, comprises of the parameters of the functional dependencies and the graph structure. The total cost is simply the sum of the code lengths of the individual nodes i=1 L(Mi) . To encode the individual nodes Xi, we need to transmit its parents, the form of the functional dependency, and the bias or mean shift µi. We encode the model Mi for a node Xi as L(Mi) = LN(k) + k log m + LF (fi) + Lp(µi) , where we first encode the number of parents using LN, the MDL-optimal encoding for integers z 0 (Rissanen 1983). It is defined as LN(z) = log z + log c0, where log z = log z + log log z + . . . and we consider only the positive terms, and c0 is a normalization constant to ensure the Krafftinequality holds (Kraft 1949). Next, we identify which out of the m random variables these are, and then proceed to encode the function fi over these parents, where fi represents the summation term on the right hand side of Eq. (2). Last, we encode the bias term using Lp, defined later in Eq.(3). Encoding the Functions We will instantiate the framework using non-parametric functions hi that also allow for non-linear transformations of the parent variables. To this end, we fit non-parametric Multivariate Adaptive Regression Splines (Friedman 1991). In essence, we estimate Xi as j=1 hj(Sj) , where hj is called a hinge function that is applied to a subset of the parents, Sj, with size |Sj|, that is associated with the j-th hinge. A hinge takes the form i=1 ai max(0, gi(si) bi) , where T denotes the number of multiplicative terms in h, si S is the parent associated with the i-th term, gi is a non-linear transformation applied to si where gi belongs to the function class F, e.g. the class of all polynomials up to a certain degree. We specify F in more detail in the supplementary section, but the encoding can be very general and can include any regression function as long as we can describe the parameters and |F| < . If T = 1 for all hinges, the above definition simplifies to an additive model over individual parents. We encode a hinge function as follows LF (h) = LN(|H|) + X LN(Tj) + log |S| + Tj 1 Tj + Tj log(|F|) + Lp(θ(hj)) First, we use LN to encode the number of hinges and the number of terms per hinge. We then transmit the correct assignment of terms Tj to parents in S, and finally need log(|F|) bits to identify the specific non-linear transformation that is used for each of the Tj terms in the hinge. Encoding Parameters To encode the bias we use the proposal of Marx and Vreeken (2017) for encoding parameters up to a user specified precision p. We have Lp(θ) = |θ| + i=1 LN(si) + LN( θi 10si ) , (3) where si is the smallest integer such that |θi| 10si 10p. Simply put, p = 2 implies that we consider two digits of the parameter. We need one bit to store the sign of the parameter, then we encode the shift si and the shifted parameter θi. Encoding Residuals Last, we need to encode the residual term, L(ϵi). Since we use regression functions, we aim to minimize variance of the residual and hence should encode the residual ϵ as Gaussian distributed with zero-mean (Marx and Vreeken 2017; Gr unwald 2007) ln 2 + log 2πˆσ2 , where we can compute the empirical variance ˆσ2 from ϵ. Combining the above, we now have a lossless MDL score for a causal DAG. Consistency Since MDL can only upper bound Kolmogorov complexity, but not compute it, it is not possible to directly derive strict guarantees from the AMC. We can, however, derive consistency results. We first show that our score allows for identifying the Markov equivalence class of the true DAG i.e. the partially directed network for which each collider is correctly identified. Then, we show that under slightly stricter assumptions, we can orient the remaining edges correctly. The main idea for the first part is to show that our score is consistent simply put, the likelihood term dominates in the limit. For a score with such properties e.g. BIC (Haughton 1988), Chickering (2002) showed that it is possible to identify the Markov equivalence class of the true DAG. To show that our score behaves in the same way, we need to make two light weight assumptions for n : 1. the number of hinges of |H| is bounded by O(log n), and 2. the precision of the parameters θ is constant w.r.t. to n and hence Lp(θ) O(1). Based on these assumptions, we can show that our score is consistent as it asymptotically behaves like BIC, meaning that the penalty term for the parameters only grows with O(log n) complexity, while the likelihood term grows linearly with n and hence is the dominating term as n . Theorem 1. Given a causal model as defined in Eq. (2) and corresponding data Xn drawn iid from joint distribution P. Under Assumptions (1) and (2), L(Xn, M) asymptotically behaves like BIC. Algorithm 1: The GLOBE Algorithm Data: Data Xn over X Result: Causal DAG G 1 Q EDGESCORING(Xn) 2 G FORWARDSEARCH(Q, Xn) 3 G BACKWARDSEARCH(G) With the above, we know that given sufficient data our score will identify the correct Markov equivalence class. To infer the complete DAG, we need to be able to infer the direction for those edges that cannot be inferred using collider structures i.e. single edges like X Y . Closest to our approach is the work of Marx and Vreeken (2019) who showed that it is possible to distinguish between X Y and Y X using any L0 regularized score e.g. BIC, if we assume that the underlying causal function is near deterministic i.e. Y := f(X) + αN, where f is a non-linear function and N is an unbiased, unit-variance noise regulated by a small constant α > 0. Since our score in the limit behaves like an L0-based score (ref. Theorem 1), we can distinguish between Markov equivalent DAGs under these stricter assumptions. For a detailed discussion, readers are directed to the proof of Theorem 1 in technical appendix. Although our score is consistent and can be used to distinguish Markov equivalent DAGs, these guarantees only hold if we were to score all DAGs over X. Since this is infeasible for large graphs, we propose a modified greedy DAG search algorithm to minimize L(Xn, M). The GLOBE Algorithm We now present GLOBE, a score-based method for discovering directed acyclic causal graphs from multivariate continuous valued data. GLOBE consists of three steps: edge scoring, forward and backward search, as shown in Algorithm 1.1 Edge Scoring To improve the forward search where we greedily add the edge that provides the highest gain, we first order all potential edges in a priority queue by their causal strength. We measure the causal strength of an edge, using the absolute gain in bits for orienting an edge in either direction in our model. Formally, let e = (Xi, Xj) be an undirected edge between Xi and Xj, and further let e refer to the directed edge Xi Xj and e the directed edge in the reverse direction. Now, let M be the current model. We write M e to refer to the model where we add edge e, and M e for the model where we add e. We define the gain in bits, δ, associated with edge e ) = max {0, L(Xn, M) L(Xn, M where L(Xn, M) is defined according to the causal model specified in the theory section, and define δ( e ) analogously. Based on δ( e ) and δ( e ), we define the directed gain Ψ( e) for a given edge as 1We provide detailed pseudocodes in the technical appendix e) = Ψ( e). The higher the value of Ψ( e), the higher edge e is ranked. Intuitively, the larger the difference between the edge direction, the more certain we are that we inferred the correct direction. The algorithm for this step is straightforward, we pick each undirected edge e, calculate δ and Ψ for e and e, and add the edges to a priority queue. Forward Search For forward search phase, we use the priority queue obtained from the edge ranking step to build the causal graph by iteratively adding the highest ranked edge. We reject edges that would introduce a cycle. After adding an edge Xi Xj we need to update the score of all edges pointing towards Xj and re-rank them in the priority queue. Due to the greedy nature of the algorithm, we may add edges in the wrong direction when we do not yet know all the parents of a node. Hence, after adding edge Xi Xj to the current model i.e. discovering a new parent for Xj we check for all children of Xj, whether flipping the direction of the edge improves the overall score. If so, we delete that edge e from our model, re-calculate δ and Ψ for e and e, and push them again to the priority queue (see Fig. 1). The forward search stops when the priority queue is empty. To avoid spurious edges, we check for significance of the gain. Let k = δ( e), based on the no-hypercompression inequality (Gr unwald 2007), the probability to gain k bits over the null model is smaller or equal to 2 k. If for an edge the gain k is not significant i.e. 2 k > α, where α is a user defined significance threshold, we disregard the edge. Backward Search To further refine the graph discovered in the forward search, we iteratively remove superfluous edges. In particular, for each node Xj with |Pa(Xj)| = k 2 we score all graphs for which we only use a subset of the parents of size k 1. If any of these graphs provides a gain in compression, we select the one that provides the largest gain and update the model accordingly. We continue this process until we cannot find such a subset for any node and output the current graph as our predicted causal DAG. Complexity Analysis The edge ranking does one pass over the edges, it has a runtime of O(|V |2). In the forward search, each edge can lead to at most (|V | 1) ranking updates due to edge flips. Resulting in a total complexity in O(|V |3). The backwards search has a loose upper bound of O(|V |3), that results when the forward search returns a fully connected graph and we delete each of those edges in the backwards search. Hence, the overall complexity of GLOBE is in O(|V |3). In practice, GLOBE is fast enough for networks as large as 500 nodes. Instantiation We instantiate GLOBE 2 using the open-source implementation in R of Multivariate Adaptive Regression Splines frame- 2GLOBE stems from discovering fully, rather than locally, oriented networks, as well as from it being based on Multivariate Adaptive Regression Splines (MARS), of which the public implementation is known as EARTH. Figure 1: Edge reversal in the forward search: We start with the graph where we wrongly added edge Xj Xk, then we add the correct edge Xi Xj. Revisiting the children of Xj we see that flipping Xj Xk improves our score and hence delete the edge. In the next step we add the correct edge. work (Friedman 1991). Since we could face issues like multicollinearity (Farrar and Glauber 1967) and unrealistic run times if we allow for arbitrary many interactions between parents, we restrict the maximum number of interaction terms to 2 for experiments. Related Work Causal discovery on observational data has drawn more attention in recent years (B uhlmann et al. 2014; Huang et al. 2018; Hu et al. 2018; Margaritis and Thrun 2000) and is still an open problem. To give a succinct overview, we focus on the most related methods, ones that aim to recover a DAG or its Markov equivalence class from continuous valued data. We exclude methods that aim at weakening assumptions such as causal sufficiency or acyclicity (Spirtes et al. 2000), or methods for discrete data (Budhathoki and Vreeken 2017). Most approaches can be classified as constraint based or score based. Both rely on the Markov and faithfulness conditions to recover Markov equivalence classes of the true DAG. Constraint based methods such as the PC and FCI algorithm (Spirtes et al. 2000), their extensions (Colombo and Maathuis 2014; Pearl, Verma et al. 1991) as well as the Grow-Shrink algorithm (Margaritis and Thrun 2000) rely on conditional independence (CI) tests to first recover the undirected causal graph and then infer edge directions only up to the Markov equivalence class using additional edge orientation rules (Meek 1995). The main bottleneck for those approaches is the CI test. The standard choice is the Gaussian CI test (Kalisch and B uhlmann 2007). However, it cannot capture non-linear correlations. The current state-of-the-art uses kernel based tests such as HSIC (Gretton et al. 2005), which can capture non-linear dependencies. Score based methods define a scoring function, S(G, Xn), that evaluates how well a causal DAG G fits the provided data Xn. If the true causal graph G is a DAG, then given infinite data the highest scoring DAG is part of the equivalence class of G (Chickering 2002). Score based approaches start with an empty graph and greedily traverse to the highest scoring Markov equivalence class that is reachable by adding, deleting or reversing an edge. Well-known algorithms in this category include the greedy equivalence search (GES) (Chickering 2002; Hauser and B uhlmann 2012), its extensions (Ramsey et al. 2017), and the current state-of-theart, generalized-GES (GGES) (Huang et al. 2018) which uses kernel regression to capture complex dependencies. In contrast, additive noise models (ANMs) aim to discover the fully directed graph (Hoyer et al. 2009). The primary as- sumption is that the effect can be written as a function of the cause plus additive noise that is independent of the cause. Under this assumption, the function is only admissible in causal direction and not vice-versa (Hoyer et al. 2009). Methods range from linear non-Gaussian (LINGAM) (Shimizu et al. 2006), non-linear functions (RESIT) (Peters et al. 2014) to mixtures of non-linear additive noise models (Hu et al. 2018). The main caveat of ANMs is also the CI test. Fitting a nonlinear function that maximizes the independence between the cause and noise is a slow process which restricts ANMs application to small networks (Hoyer et al. 2009). Most related to our work are methods based on regression error. Those methods have been shown to successfully decide between Markov equivalent DAGs under the assumption of having a non-linear function and low noise (Marx and Vreeken 2017; Bl obaum et al. 2018; Marx and Vreeken 2019) or proven to correctly identify the causal ordering of all nodes (CAM) (B uhlmann et al. 2014). Directly comparing a causal ordering to a DAG is, however, not straightforward. In this paper, we combine the advantages of score based methods and methods based on regression error by discovering the fully oriented graph and allowing for complex nonlinear dependencies, while being fast in practice. Experiments We evaluate GLOBE on both synthetic and real-world data with known ground truth. GLOBE is implemented in Python and both the source code, as well as the synthetic data are made available for reproducibility.3 We compare GLOBE to the state-of-the-art from different classes of algorithms. We compare to RESIT (Peters et al. 2014) and LINGAM (Shimizu et al. 2006) as representative ANM-based methods, to GGES as the best score-based method (Huang et al. 2018), and to PC with the Hilbert Schmidt Independence Criteria, short PCHSIC (Colombo and Maathuis 2014; Gretton et al. 2005), as the state-of-the-art constraint-based method for causal discovery. Comparison with FASTGES (Ramsey et al. 2017) is ommitted since its performance was significantly worse than the other methods. We provide details on experimental setup as well as additional experiments, involving a casestudy in the technical appendix. GLOBE finished within ten minutes for each experimental instance except one real-world dataset with 500 nodes, on which it took 3 days. While the competitors could not handle this data. Evaluation Metrics We evaluate the predicted and the ground truth graphs on the basis of their structural, as well as their causal similarity. We justify using our proposed evaluation metrics in the technical appendix. The Structural Hamming Distance (SHD) (Kalisch and B uhlmann 2007), between two partially directed acyclic graphs (PDAGs) G and ˆG is the the total number of edges where the two graphs differ. Denoting the edge adjacency matrix of G and ˆG with X resp. ˆX we have 3http://eda.mmci.uni-saarland.de/globe/ GLOBE RESIT LINGAM GGES PCHSIC Figure 2: [Lower is better] SHD (left) and SID (right) for increasing number of parents. SHD(G, ˆG) := X 1 i