# discovering_fully_oriented_causal_networks__e96fbe4a.pdf

Discovering Fully Oriented Causal Networks

Osman Mian, Alexander Marx and Jilles Vreeken CISPA Helmholtz Center for Information Security {osman.mian, alexander.marx, jv}@cispa.de

We study the problem of inferring causal graphs from observational data. We are particularly interested in discovering graphs where all edges are oriented, as opposed to the partially directed graph that the state-of-the-art discover. To this end we base our approach on the algorithmic Markov condition. Unlike the statistical Markov condition, it uniquely identiﬁes the true causal network as the one that provides the simplest as measured in Kolmogorov complexity factorization of the joint distribution. Although Kolmogorov complexity is not computable, we can approximate it from above via the Minimum Description Length principle, which allows us to deﬁne a consistent and computable score based on non-parametric multivariate regression. To efﬁciently discover causal networks in practice, we introduce the GLOBE algorithm, which greedily adds, removes, and orients edges such that it minimizes the overall cost. Through an extensive set of experiments we show GLOBE performs very well in practice, beating the state-ofthe-art by a margin.

Introduction Discovering causal dependencies from observational data is one of the most fundamental problems in science (Pearl 2009). We consider the problem of recovering the causal network over a set of continuous-valued random variables X based on an iid sample from their joint distribution. The stateof-the-art does so by ﬁrst recovering an undirected causal skeleton which identiﬁes the variables that have a direct causal relation and then uses conditional independence tests to orient as many edges as possible. By the nature of these tests this can only be done up to Markov equivalence classes, which means that these methods in practice return networks where only few edges are oriented. In contrast, we develop an approach that discovers fully directed causal graphs. We base our approach on the algorithmic Markov condition (AMC), a recent postulate that states that the factorization of the joint distribution according to true causal network coincides with the one that achieves the lowest Kolmogorov complexity (Janzing and Sch olkopf 2010). As an example, consider the case where X causes Y . Whereas the traditional statistical Markov condition cannot differentiate between P(X)P(Y |X) and P(Y )P(X|Y ) as both are valid

Copyright 2021, Association for the Advancement of Artiﬁcial Intelligence (www.aaai.org). All rights reserved.

factorizations of joint distribution P(X, Y ), the algorithmic Markov condition takes the complexities of these distributions into account: in this case, the simplest factorization of P(X, Y ) is K(P(X)) + K(P(Y |X)) as only this factorization upholds the true independence between the marginal and conditional distribution any competing factorization will be more complex because of inherent redundancy between the terms. As Kolmogorov complexity can capture any physical process (Li and Vit anyi 2009) the AMC is a very general model for causality. However, Kolmogorov complexity is not computable, and hence we need a practical score to instantiate it. Here we do so through the Minimum Description Length principle (Gr unwald 2007), which provides a statistically well-founded approach to approximate Kolmogorov complexity from above.

We develop an MDL-based score for directed acyclic graphs (DAGs), where we model the dependencies between variables through non-parametric multivariate regression. Simply put, the lower the regression error of the discovered model, the lower its cost, while more parameters mean higher complexity. We show this score is consistent: given sufﬁciently many samples from the joint distribution, we can uniquely identify the true causal graph if the causal relations are nearly deterministic. To efﬁciently discover causal networks directly from data we introduce the GLOBE algorithm, which much like the well-known GES (Chickering 2002) algorithm greedily adds and removes edges to optimize the score. Unlike GES, however, GLOBE traverses the space of DAGs rather than Markov equivalence classes orienting edges during its search based on the AMC and hence is guaranteed to result in a fully directed network.

Through extensive empirical evaluation we show that GLOBE performs well in practice and outperforms the stateof-the-art conditional independence and score based causal discovery algorithms. On synthetic data we conﬁrm GLOBE does not discover spurious edges between independent variables, and overall achieves the best scores on both the structural hamming distance and the structural intervention distance. Last, but not least, on real-world data we show that GLOBE even works well when it is unlikely that our modelling assumptions are met.

For reproducibility we provide detailed pseudo-code in technical appendix, and make all code and data available.

The Thirty-Fifth AAAI Conference on Artificial Intelligence (AAAI-21)

Preliminaries First, we introduce the notation for causal graphs and the main information theoretic concepts that we need later on.

Causal Graph We consider data over the joint distribution of m continuous valued random variables X = {X1, . . . , Xm}. As is common, we assume causal sufﬁciency. That is, we assume that X contains all random variables that are relevant to the system, or in other words, that there exist no latent confounders. Under the assumptions of causal sufﬁciency and acyclicity, we can model causal relationships over X using a directed acyclic graph (DAG). A causal DAG G over X is a graph in which the random variables are the nodes and edges identify the causal relationship between a pair of nodes. In particular, a directed edge between two nodes Xi Xj indicates that Xi is a direct cause or parent of Xj. We denote the set of all parents of Xi with Pa(Xi). When working on causal DAGs, we assume the common assumptions, the causal Markov condition and the faithfulness condition, to hold. Simply put, the combination of both assumptions implies that each separation present in the true graph G implies an independence in the joint distribution P over the random variables X and vice versa (Pearl 2009).

Identiﬁability of Causality A causal relationship is said to be identiﬁable if it is possible to unambiguously recover it from observational data alone. In general, causal dependencies are not identiﬁable without assumptions on the causal model. The common assumptions for discovering causal DAGs allow identiﬁcation up to the Markov equivalence class (Pearl 2009). Given additional assumptions, such as that the relation between cause and effect is a non-linear function with additive Gaussian noise (Hoyer et al. 2009), it is possible to identify causal directions within a Markov equivalence class (Glymour, Zhang, and Spirtes 2019). This is the causal model we investigate.

Kolmogorov Complexity The Kolmogorov complexity of a ﬁnite binary string x is the length of the shortest binary program p for a universal Turing machine U that outputs x and then halts (Kolmogorov 1965; Li and Vit anyi 2009). Simply put, p is the most succinct algorithmic description of x, and therewith Kolmogorov complexity of x is the length of its ultimate lossless compression. Conditional Kolmogorov complexity, K(x | y) K(x), is then the length of the shortest binary program p that generates x, and halts, given y as input. The Kolmogorov complexity of a probability distribution P, K(P), is the length of the shortest program that outputs P(x) to precision q on input x, q (Li and Vit anyi 2009). More formally, we have

K(P) = min |p| : p {0, 1} , |U(p, x, q) P(x)| 1

The conditional, K(P | Q), is deﬁned similarly except that the universal Turing machine U now gets the additional information Q. For more details on Kolmogorov complexity see Li and Vit anyi (2009).

Minimum Description Length Principle Although Kolmogorov complexity is not computable, we can approximate it from above through lossless compression (Li and Vit anyi 2009). The Minimum Description Length (MDL) principle (Rissanen 1978; Gr unwald 2007) provides a statistically well-founded and computable framework to do so. Conceptually, instead of all programs, ideal MDL considers only those programs for which we know that they output x and halt, i.e., lossless compressors. Formally, given a model class M, MDL identiﬁes the best model M M for data D as the one minimizing L(D, M) = L(M) + L(D | M), where L(M) is the length in bits of the description of M, and L(D | M) is the length in bits of the description of data D given M. This is known as two-part, or crude MDL. There also exists one-part, or reﬁned MDL. Although reﬁned MDL has theoretically appealing properties, it is efﬁciently computable for a small number of model classes. Asymptotically, there is no difference between the two (Gr unwald 2007). To use MDL in practice we need to deﬁne a model class, and how to encode a model, resp. the data given a model, into bits. Note that we are only concerned with optimal code lengths, not actual codes our goal is to measure the complexity of a dataset under a model class, after all (Gr unwald 2007). Hence, all logarithms are to base 2, and we use the common convention that 0 log 0 = 0.

Theory In this section, we will ﬁrst introduce the algorithmic model of causality which is based on Kolmogorov complexity. To put it into practice, we need to introduce a set of modelling assumptions that allow us to approximate it using MDL. We conclude this section by providing consistency guarantees.

Algorithmic Model of Causality Here we introduce the main concepts of algorithmic causal inference as introduced by Janzing and Sch olkopf (Janzing and Sch olkopf 2010), starting with the causal model. Postulate 1 (Algorithmic Model of Causality). Let G be a DAG formalizing the causal structure among the strings x1, . . . , xm. Then, every xj is computed by a program qj with constant length from its parents Pa(xj) and an additional input nj. That is

xj = qj(Pa(xj), nj) ,

where the inputs nj are jointly independent. As any mathematical object x can be described as a binary string, and a program qj can model any physical process (Deutsch 1985) or possible function hj (Li and Vit anyi 2009), this is a particularly general model of causality. Equivalent to the statistical model, we can derive that the algorithmic model of causality fulﬁls the algorithmic Markov property (Janzing and Sch olkopf 2010), that is

K(x1, . . . , xm) +=

j=1 K(xj | Pa (xj)) ,

where += denotes equality up to an additive constant. Meaning, to most succinctly describe all strings, it sufﬁces to know

what are the parents and additional inputs nj for each string xj. Unlike its statistical counterpart which can only identify the causal network up to Markov equivalence, the algorithmic Markov property can identify a single DAG as the most succinct description of all strings. As any mathematical object, including distributions, can be described by a binary string, Janzing and Sch olkopf (2010) deﬁne the following postulate. Postulate 2 (Algorithmic Markov Condition). A causal DAG G over random variables X with joint density P is only acceptable if the shortest description of P factorizes as

K(P(X1, . . . , Xm)) +=

j=1 K(P(Xj | Pa(Xj))) . (1)

Hence, under the assumption that the true causal graph can be modelled by a DAG, it has to be the one minimizing Eq. (1). As K is not computable we cannot directly compute this score. What we can however, restrict our model class from allowing all possible functions to a subset of these and then approximate K using MDL.

Causal Model As causal model we consider a rich class of structural equation models (Pearl 2009) (SEMs) where the value of each node is determined by a linear combination of functions over all possible subsets of parents and additional independent noise. Formally, for all Xi X we have

Sj P(Pa(Xi)) hj(Sj) + Ni , (2)

where hj is a non-linear function of the j-th subset over the power set, P(Pa(Xi)), of parents of Xi, and Ni is an independent noise term. We assume that all noise variables are jointly independent, Gaussian distributed and that Ni Pa(Xi).

MDL Encoding of the Causal Model Next, we specify our MDL score for DAGs. Given an iid sample Xn drawn from the joint distribution P over X, our goal is to approximate Eq. (1) using two-part MDL, which means we need to deﬁne a model class M for which we can compute the optimal code length. Here, we deﬁne M to include all possible DAGs over X and their corresponding parametrization according to our causal model. That is, for each node Xi a model M M contains an index indicating the parents of Xi, which is equivalent to storing the DAG structure, and the corresponding functional dependencies. Building upon Eq. (1), we want to ﬁnd that model M M such that

M = argmin M M L(Xn, M)

= argmin M M

i=1 L(Xn i | Pa(Xi), M)

= argmin M M

where Pa(Xi) are the parents of Xi according to the model M. In the last line, we replace L(Xn i | Pa(Xi), M) with

L(ϵi) to clarify that encoding a node given M and its parents comes down to encoding the residuals ϵi.

Encoding the Model The model complexity L(M) for a model M M, comprises of the parameters of the functional dependencies and the graph structure. The total cost is simply the sum of the code lengths of the individual nodes

i=1 L(Mi) .

To encode the individual nodes Xi, we need to transmit its parents, the form of the functional dependency, and the bias or mean shift µi. We encode the model Mi for a node Xi as

L(Mi) = LN(k) + k log m + LF (fi) + Lp(µi) ,

where we ﬁrst encode the number of parents using LN, the MDL-optimal encoding for integers z 0 (Rissanen 1983). It is deﬁned as LN(z) = log z + log c0, where log z = log z + log log z + . . . and we consider only the positive terms, and c0 is a normalization constant to ensure the Krafftinequality holds (Kraft 1949). Next, we identify which out of the m random variables these are, and then proceed to encode the function fi over these parents, where fi represents the summation term on the right hand side of Eq. (2). Last, we encode the bias term using Lp, deﬁned later in Eq.(3).

Encoding the Functions We will instantiate the framework using non-parametric functions hi that also allow for non-linear transformations of the parent variables. To this end, we ﬁt non-parametric Multivariate Adaptive Regression Splines (Friedman 1991). In essence, we estimate Xi as

j=1 hj(Sj) ,

where hj is called a hinge function that is applied to a subset of the parents, Sj, with size |Sj|, that is associated with the j-th hinge. A hinge takes the form

i=1 ai max(0, gi(si) bi) ,

where T denotes the number of multiplicative terms in h, si S is the parent associated with the i-th term, gi is a non-linear transformation applied to si where gi belongs to the function class F, e.g. the class of all polynomials up to a certain degree. We specify F in more detail in the supplementary section, but the encoding can be very general and can include any regression function as long as we can describe the parameters and |F| < . If T = 1 for all hinges, the above deﬁnition simpliﬁes to an additive model over individual parents. We encode a hinge function as follows

LF (h) = LN(|H|) + X

LN(Tj) + log |S| + Tj 1 Tj

+ Tj log(|F|) + Lp(θ(hj))

First, we use LN to encode the number of hinges and the number of terms per hinge. We then transmit the correct assignment of terms Tj to parents in S, and ﬁnally need log(|F|) bits to identify the speciﬁc non-linear transformation that is used for each of the Tj terms in the hinge.

Encoding Parameters To encode the bias we use the proposal of Marx and Vreeken (2017) for encoding parameters up to a user speciﬁed precision p. We have

Lp(θ) = |θ| +

i=1 LN(si) + LN( θi 10si ) , (3)

where si is the smallest integer such that |θi| 10si 10p. Simply put, p = 2 implies that we consider two digits of the parameter. We need one bit to store the sign of the parameter, then we encode the shift si and the shifted parameter θi.

Encoding Residuals Last, we need to encode the residual term, L(ϵi). Since we use regression functions, we aim to minimize variance of the residual and hence should encode the residual ϵ as Gaussian distributed with zero-mean (Marx and Vreeken 2017; Gr unwald 2007)

ln 2 + log 2πˆσ2 ,

where we can compute the empirical variance ˆσ2 from ϵ. Combining the above, we now have a lossless MDL score for a causal DAG.

Consistency Since MDL can only upper bound Kolmogorov complexity, but not compute it, it is not possible to directly derive strict guarantees from the AMC. We can, however, derive consistency results. We ﬁrst show that our score allows for identifying the Markov equivalence class of the true DAG i.e. the partially directed network for which each collider is correctly identiﬁed. Then, we show that under slightly stricter assumptions, we can orient the remaining edges correctly. The main idea for the ﬁrst part is to show that our score is consistent simply put, the likelihood term dominates in the limit. For a score with such properties e.g. BIC (Haughton 1988), Chickering (2002) showed that it is possible to identify the Markov equivalence class of the true DAG. To show that our score behaves in the same way, we need to make two light weight assumptions for n : 1. the number of hinges of |H| is bounded by O(log n), and 2. the precision of the parameters θ is constant w.r.t. to n and hence Lp(θ) O(1). Based on these assumptions, we can show that our score is consistent as it asymptotically behaves like BIC, meaning that the penalty term for the parameters only grows with O(log n) complexity, while the likelihood term grows linearly with n and hence is the dominating term as n . Theorem 1. Given a causal model as deﬁned in Eq. (2) and corresponding data Xn drawn iid from joint distribution P. Under Assumptions (1) and (2), L(Xn, M) asymptotically behaves like BIC.

Algorithm 1: The GLOBE Algorithm

Data: Data Xn over X Result: Causal DAG G

1 Q EDGESCORING(Xn)

2 G FORWARDSEARCH(Q, Xn)

3 G BACKWARDSEARCH(G)

With the above, we know that given sufﬁcient data our score will identify the correct Markov equivalence class. To infer the complete DAG, we need to be able to infer the direction for those edges that cannot be inferred using collider structures i.e. single edges like X Y . Closest to our approach is the work of Marx and Vreeken (2019) who showed that it is possible to distinguish between X Y and Y X using any L0 regularized score e.g. BIC, if we assume that the underlying causal function is near deterministic i.e. Y := f(X) + αN, where f is a non-linear function and N is an unbiased, unit-variance noise regulated by a small constant α > 0. Since our score in the limit behaves like an L0-based score (ref. Theorem 1), we can distinguish between Markov equivalent DAGs under these stricter assumptions. For a detailed discussion, readers are directed to the proof of Theorem 1 in technical appendix. Although our score is consistent and can be used to distinguish Markov equivalent DAGs, these guarantees only hold if we were to score all DAGs over X. Since this is infeasible for large graphs, we propose a modiﬁed greedy DAG search algorithm to minimize L(Xn, M).

The GLOBE Algorithm We now present GLOBE, a score-based method for discovering directed acyclic causal graphs from multivariate continuous valued data. GLOBE consists of three steps: edge scoring, forward and backward search, as shown in Algorithm 1.1

Edge Scoring To improve the forward search where we greedily add the edge that provides the highest gain, we ﬁrst order all potential edges in a priority queue by their causal strength. We measure the causal strength of an edge, using the absolute gain in bits for orienting an edge in either direction in our model. Formally, let e = (Xi, Xj) be an undirected edge between Xi and Xj, and further let e refer to the directed edge Xi Xj and

e the directed edge in the reverse direction. Now, let M be the current model. We write M

e to refer to the model where we add edge

e, and M e for the model where we add e. We deﬁne the gain in bits, δ, associated with edge

e ) = max {0, L(Xn, M) L(Xn, M

where L(Xn, M) is deﬁned according to the causal model speciﬁed in the theory section, and deﬁne δ( e ) analogously. Based on δ(

e ) and δ( e ), we deﬁne the directed gain Ψ(

e) for a given edge as

1We provide detailed pseudocodes in the technical appendix

e) = Ψ( e). The higher the value of Ψ(

e), the higher edge

e is ranked. Intuitively, the larger the difference between the edge direction, the more certain we are that we inferred the correct direction. The algorithm for this step is straightforward, we pick each undirected edge e, calculate δ and Ψ for

e and e, and add the edges to a priority queue.

Forward Search For forward search phase, we use the priority queue obtained from the edge ranking step to build the causal graph by iteratively adding the highest ranked edge. We reject edges that would introduce a cycle. After adding an edge Xi Xj we need to update the score of all edges pointing towards Xj and re-rank them in the priority queue. Due to the greedy nature of the algorithm, we may add edges in the wrong direction when we do not yet know all the parents of a node. Hence, after adding edge Xi Xj to the current model i.e. discovering a new parent for Xj we check for all children of Xj, whether ﬂipping the direction of the edge improves the overall score. If so, we delete that edge e from our model, re-calculate δ and Ψ for e and

e, and push them again to the priority queue (see Fig. 1). The forward search stops when the priority queue is empty. To avoid spurious edges, we check for signiﬁcance of the gain. Let k = δ(

e), based on the no-hypercompression inequality (Gr unwald 2007), the probability to gain k bits over the null model is smaller or equal to 2 k. If for an edge the gain k is not signiﬁcant i.e. 2 k > α, where α is a user deﬁned signiﬁcance threshold, we disregard the edge.

Backward Search To further reﬁne the graph discovered in the forward search, we iteratively remove superﬂuous edges. In particular, for each node Xj with |Pa(Xj)| = k 2 we score all graphs for which we only use a subset of the parents of size k 1. If any of these graphs provides a gain in compression, we select the one that provides the largest gain and update the model accordingly. We continue this process until we cannot ﬁnd such a subset for any node and output the current graph as our predicted causal DAG.

Complexity Analysis The edge ranking does one pass over the edges, it has a runtime of O(|V |2). In the forward search, each edge can lead to at most (|V | 1) ranking updates due to edge ﬂips. Resulting in a total complexity in O(|V |3). The backwards search has a loose upper bound of O(|V |3), that results when the forward search returns a fully connected graph and we delete each of those edges in the backwards search. Hence, the overall complexity of GLOBE is in O(|V |3). In practice, GLOBE is fast enough for networks as large as 500 nodes.

Instantiation We instantiate GLOBE 2 using the open-source implementation in R of Multivariate Adaptive Regression Splines frame-

2GLOBE stems from discovering fully, rather than locally, oriented networks, as well as from it being based on Multivariate Adaptive Regression Splines (MARS), of which the public implementation is known as EARTH.

Figure 1: Edge reversal in the forward search: We start with the graph where we wrongly added edge Xj Xk, then we add the correct edge Xi Xj. Revisiting the children of Xj we see that ﬂipping Xj Xk improves our score and hence delete the edge. In the next step we add the correct edge.

work (Friedman 1991). Since we could face issues like multicollinearity (Farrar and Glauber 1967) and unrealistic run times if we allow for arbitrary many interactions between parents, we restrict the maximum number of interaction terms to 2 for experiments.

Related Work Causal discovery on observational data has drawn more attention in recent years (B uhlmann et al. 2014; Huang et al. 2018; Hu et al. 2018; Margaritis and Thrun 2000) and is still an open problem. To give a succinct overview, we focus on the most related methods, ones that aim to recover a DAG or its Markov equivalence class from continuous valued data. We exclude methods that aim at weakening assumptions such as causal sufﬁciency or acyclicity (Spirtes et al. 2000), or methods for discrete data (Budhathoki and Vreeken 2017). Most approaches can be classiﬁed as constraint based or score based. Both rely on the Markov and faithfulness conditions to recover Markov equivalence classes of the true DAG. Constraint based methods such as the PC and FCI algorithm (Spirtes et al. 2000), their extensions (Colombo and Maathuis 2014; Pearl, Verma et al. 1991) as well as the Grow-Shrink algorithm (Margaritis and Thrun 2000) rely on conditional independence (CI) tests to ﬁrst recover the undirected causal graph and then infer edge directions only up to the Markov equivalence class using additional edge orientation rules (Meek 1995). The main bottleneck for those approaches is the CI test. The standard choice is the Gaussian CI test (Kalisch and B uhlmann 2007). However, it cannot capture non-linear correlations. The current state-of-the-art uses kernel based tests such as HSIC (Gretton et al. 2005), which can capture non-linear dependencies. Score based methods deﬁne a scoring function, S(G, Xn), that evaluates how well a causal DAG G ﬁts the provided data Xn. If the true causal graph G is a DAG, then given inﬁnite data the highest scoring DAG is part of the equivalence class of G (Chickering 2002). Score based approaches start with an empty graph and greedily traverse to the highest scoring Markov equivalence class that is reachable by adding, deleting or reversing an edge. Well-known algorithms in this category include the greedy equivalence search (GES) (Chickering 2002; Hauser and B uhlmann 2012), its extensions (Ramsey et al. 2017), and the current state-of-theart, generalized-GES (GGES) (Huang et al. 2018) which uses kernel regression to capture complex dependencies. In contrast, additive noise models (ANMs) aim to discover the fully directed graph (Hoyer et al. 2009). The primary as-

sumption is that the effect can be written as a function of the cause plus additive noise that is independent of the cause. Under this assumption, the function is only admissible in causal direction and not vice-versa (Hoyer et al. 2009). Methods range from linear non-Gaussian (LINGAM) (Shimizu et al. 2006), non-linear functions (RESIT) (Peters et al. 2014) to mixtures of non-linear additive noise models (Hu et al. 2018). The main caveat of ANMs is also the CI test. Fitting a nonlinear function that maximizes the independence between the cause and noise is a slow process which restricts ANMs application to small networks (Hoyer et al. 2009). Most related to our work are methods based on regression error. Those methods have been shown to successfully decide between Markov equivalent DAGs under the assumption of having a non-linear function and low noise (Marx and Vreeken 2017; Bl obaum et al. 2018; Marx and Vreeken 2019) or proven to correctly identify the causal ordering of all nodes (CAM) (B uhlmann et al. 2014). Directly comparing a causal ordering to a DAG is, however, not straightforward. In this paper, we combine the advantages of score based methods and methods based on regression error by discovering the fully oriented graph and allowing for complex nonlinear dependencies, while being fast in practice.

Experiments

We evaluate GLOBE on both synthetic and real-world data with known ground truth. GLOBE is implemented in Python and both the source code, as well as the synthetic data are made available for reproducibility.3 We compare GLOBE to the state-of-the-art from different classes of algorithms. We compare to RESIT (Peters et al. 2014) and LINGAM (Shimizu et al. 2006) as representative ANM-based methods, to GGES as the best score-based method (Huang et al. 2018), and to PC with the Hilbert Schmidt Independence Criteria, short PCHSIC (Colombo and Maathuis 2014; Gretton et al. 2005), as the state-of-the-art constraint-based method for causal discovery. Comparison with FASTGES (Ramsey et al. 2017) is ommitted since its performance was signiﬁcantly worse than the other methods. We provide details on experimental setup as well as additional experiments, involving a casestudy in the technical appendix. GLOBE ﬁnished within ten minutes for each experimental instance except one real-world dataset with 500 nodes, on which it took 3 days. While the competitors could not handle this data.

Evaluation Metrics We evaluate the predicted and the ground truth graphs on the basis of their structural, as well as their causal similarity. We justify using our proposed evaluation metrics in the technical appendix. The Structural Hamming Distance (SHD) (Kalisch and B uhlmann 2007), between two partially directed acyclic graphs (PDAGs) G and ˆG is the the total number of edges where the two graphs differ. Denoting the edge adjacency matrix of G and ˆG with X resp. ˆX we have

3http://eda.mmci.uni-saarland.de/globe/

GLOBE RESIT LINGAM GGES PCHSIC

Figure 2: [Lower is better] SHD (left) and SID (right) for increasing number of parents.

SHD(G, ˆG) := X

1 i<j m I((Xij ˆXij) (Xji ˆXji)) ,

where denotes an XOR operation and I(x) is 1 when the expression x is true and 0 otherwise. However, SHD tells us nothing about the causal similarity between two graphs. Hence, we use the Structural Intervention Distance (SID) (Peters and B uhlmann 2015) pre-metric. SID counts the pairs of nodes u and v such that the effect of intervention from u to v is falsely estimated by ˆG with respect to G. In case of a PDAG, SID is an interval, with smallest and largest scores indicating the best resp. worst scores for the DAGs in a given Markov equivalence class.

Synthetic Data We start with a sanity check to ensure that GLOBE can reliably avoid false positives and build up to the case of varying sample sizes over a more complex network. We generated 100 instances each with 1 000 observations for the discussed structures, unless stated otherwise. We standardized the data to have zero mean and unit variance.

Independent Data As a sanity check, we test the methods on instances of a graph containing 10 independent nodes where the value of each node is sampled independently from a Gaussian distribution. We expect all the methods to report empty sets of edges for the instances in this experiment. GLOBE did not report a single spurious edge on any of the instances. On the other hand, LINGAM reported at least one spurious edge for 38%, RESIT for 42% and PCHSIC and GGES for half resp. 10% of the instances.

Effect of Multiple Parents Next we test GLOBE on a simple case of a collider where we vary the number of parents from 2 up to 10. The collider node is calculated as a linear combination of non-linear parent functions given as

Xi Pa(Xj) ai (Xi + bi)ci . (4)

Since it is possible to identify a collider structure using conditional independence tests, we expect GGES and PCHSIC to discover a fully directed network. The results for both

n GLOBE RESIT LINGAM GGES PCHSIC

100 0.28 0.45 0.47 [0.18 , 0.48] [0.28 , 0.54] 500 0.26 0.43 0.43 [0.17 , 0.48] [0.21 , 0.55] 1000 0.26 0.42 0.42 [0.17 , 0.48] [0.20 , 0.54] 1500 0.27 0.40 0.43 [0.17 , 0.48] [0.19 , 0.53] 2000 0.26 0.40 0.40 [0.18 , 0.49] [0.19 , 0.54]

Table 1: [Lower is Better] Averaged normalized SID for the methods. Interval for GGES and PCHSIC indicates the best, resp. worst possible intervention distance for the DAGs in the discovered Markov equivalence class.

SHD and SID are shown in Figure 2. In case of SID, we compare favorably to both GGES and PCHSIC by only reporting the best possible achievable score for their predicted graphs Markov equivalence class. Even with this favorable comparison, GLOBE outperforms the competition.

Data Sampled from a Causal Network Next, we show GLOBE s effectiveness in ﬁnding the causal relationships in a more general setting. We consider multiple instances of the graph that contains all possible connections that could exist in a DAG. In this setting, each child node, Xj can alternatively be calculated using more complex multiplicative interactions between the parents given by

Xi Pa(Xj) Xi ci + bj . (5)

We generate data where we choose between Eq. (4) and (5) with probability 0.7 resp. 0.3 and report results over varying sample sizes. We report the values for SID in Table 1. Overall we see that GLOBE outperforms RESIT and LINGAM by a margin. The causal networks predicted by GLOBE have SID closer to the better end of the range of scores possible for PCHSIC and GGES. In terms of SHD, all the methods were found to be consistent over varying sample sizes, with GLOBE slightly outperforming the competition.

Real World Data

For real world data with known ground truth, we consider three distinct networks of sizes 5, 15 and 500 nodes from the reged dataset (Statnikov et al. 2015), each containing 1 000 rows. Looking at the results shown in Figure 3, we see that GLOBE is closest to the true causal network for both the 5 node (REGED5) and the 15 node (REGED15) network. For REGED15, GLOBE reports a better SID than all the competitors. We see that for the REGED15 network, GGES fails to orient most of the edges, which results in a graph where both extremes of the SID are possible. For the 500 node network, GLOBE was the only algorithm to produce any kind of result in reasonable time (3 days), with a reported normalized SID and SHD of 0.1 resp. 0.01. While GGES failed to terminate within one month, all other methods could not process the data.

0 0.2 0.4 0.6 0.8 1

0 0.2 0.4 0.6 0.8 1

GLOBE GGES RESIT PCHSIC LINGAM

Figure 3: [Closer to Origin is Better] Comparison of Normalized SHD and Normalized SID for real world networks.

Discussion and Future Work

Instantiating GLOBE using the MARS framework is just one of the many realizations of the algorithm. Other regression approaches, as long as we deﬁne a consistent lossless encoding for them, can also be incorporated into GLOBE and may give better results based on the application domain. For proof of concept, we show how to instantiate GLOBE using parametric regression in the technical appendix. Due to computational reasons, we only traverse the space of DAGs and not the Markov equivalence classes, which could result in a locally optimal solution. We try to mitigate this using the edge ﬂipping step during the forward search. However, by incorporating a more complex search strategy, like the beam search, we could both expand our search space, and eliminate the need for the edge ﬂip. Our score is speciﬁcally deﬁned for continuous valued data. An extension of GLOBE would be to discover causal relationships over discrete and mixed type data. As MDL-based scores have been proposed for inference on discrete (Budhathoki and Vreeken 2017) and mixed (Marx and Vreeken 2018) data, but only for pairs of variables, it would be interesting to extend GLOBE to handle both cases.

We considered discovering fully directed causal graphs from observational data. To tackle this problem, we built upon the algorithmic Markov condition that is based on Kolmogorov complexity. Since the latter cannot be computed directly, we proposed a score based on MDL to approximate it from above. We showed that for non-linear mixture models with additive noise, our score allows for discovering the Markov equivalence class of the true DAG and if the noise term is assumed to have a low variance, we can discover the fully directed causal graph. To minimize our score, we proposed GLOBE, a greedy DAG search algorithm that iteratively builds a DAG to ﬁnd a locally optimal solution. We modeled functional dependencies using non-parametric regression functions. Through an extensive set of experiments, we showed that GLOBE beats the state-of-the-art by a margin, reliably orients the edges in the presence of multiple parents, discovers graphs that are structurally and causally similar to the ground truth and is fast enough to infer networks up to 500 nodes.

References Bl obaum, P.; Janzing, D.; Washio, T.; Shimizu, S.; and Sch olkopf, B. 2018. Cause-Effect Inference by Comparing Regression Errors. In AISTATS, 900 909.

Budhathoki, K.; and Vreeken, J. 2017. MDL for causal inference on discrete data. In 2017 IEEE International Conference on Data Mining (ICDM), 751 756. IEEE.

B uhlmann, P.; Peters, J.; Ernest, J.; et al. 2014. CAM: Causal additive models, high-dimensional order search and penalized regression. Annals Stat. 42(6): 2526 2556.

Chickering, D. M. 2002. Optimal structure identiﬁcation with greedy search. JMLR 3(Nov): 507 554.

Colombo, D.; and Maathuis, M. H. 2014. Order-independent constraint-based causal structure learning. JMLR 15(1): 3741 3782.

Deutsch, D. 1985. Quantum Theory, the Church-Turing Principle and the Universal Quantum Computer. R. Statist. Soc. A 400(1818): 97 117.

Farrar, D. E.; and Glauber, R. R. 1967. Multicollinearity in regression analysis: the problem revisited. The Review of Economic and Statistics .

Friedman, J. H. 1991. Multivariate adaptive regression splines. The annals of statistics 1 67.

Glymour, C.; Zhang, K.; and Spirtes, P. 2019. Review of causal discovery methods based on graphical models. Frontiers in Genetics .

Gretton, A.; Bousquet, O.; Smola, A.; and Sch olkopf, B. 2005. Measuring statistical dependence with Hilbert-Schmidt norms. In ALT. Springer.

Gr unwald, P. 2007. The Minimum Description Length Principle. MIT Press.

Haughton, D. M. 1988. On the choice of a model to ﬁt data from an exponential family. Annals Math. Stat. 16(1): 342 355.

Hauser, A.; and B uhlmann, P. 2012. Characterization and greedy learning of interventional Markov equivalence classes of directed acyclic graphs. JMLR 13(Aug): 2409 2464.

Hoyer, P. O.; Janzing, D.; Mooij, J. M.; Peters, J.; and Sch olkopf, B. 2009. Nonlinear causal discovery with additive noise models. In NIPS, 689 696.

Hu, S.; Chen, Z.; Partovi Nia, V.; CHAN, L.; and Geng, Y. 2018. Causal Inference and Mechanism Clustering of A Mixture of Additive Noise Models. In Neur IPS.

Huang, B.; Zhang, K.; Lin, Y.; Sch olkopf, B.; and Glymour, C. 2018. Generalized Score Functions for Causal Discovery. In KDD. ACM.

Janzing, D.; and Sch olkopf, B. 2010. Causal Inference Using the Algorithmic Markov Condition. IEEE TIT 56(10): 5168 5194.

Kalisch, M.; and B uhlmann, P. 2007. Estimating highdimensional directed acyclic graphs with the PC-algorithm. JMLR 8(Mar): 613 636.

Kolmogorov, A. 1965. Three Approaches to the Quantitative Deﬁnition of Information. Problemy Peredachi Informatsii 1(1): 3 11.

Kraft, L. G. 1949. A device for quantizing, grouping, and coding amplitude-modulated pulses. Ph.D. thesis, Massachusetts Institute of Technology.

Li, M.; and Vit anyi, P. 2009. An Introduction to Kolmogorov Complexity and its Applications. Springer.

Margaritis, D.; and Thrun, S. 2000. Bayesian network induction via local neighborhoods. In NIPS, 505 511. Marx, A.; and Vreeken, J. 2017. Telling Cause from Effect using MDL-based Local and Global Regression. In ICDM, 307 316. IEEE. Marx, A.; and Vreeken, J. 2018. Causal inference on multivariate and mixed-type data. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, 655 671. Springer.

Marx, A.; and Vreeken, J. 2019. Identiﬁability of Cause and Effect using Regularized Regression. In KDD. ACM. Meek, C. 1995. Causal Inference and Causal Explanation with Background Knowledge. In UAI, 403 410. Morgan Kaufmann Publishers Inc.

Pearl, J. 2009. Causality: Models, Reasoning and Inference. Cambridge University Press, 2nd edition.

Pearl, J.; Verma, T.; et al. 1991. A theory of inferred causation. KR 91: 441 452. Peters, J.; and B uhlmann, P. 2015. Structural intervention distance for evaluating causal graphs. Neural computation 27(3): 771 799.

Peters, J.; Mooij, J. M.; Janzing, D.; and Sch olkopf, B. 2014. Causal Discovery with Continuous Additive Noise Models. JMLR 15.

Ramsey, J.; Glymour, M.; Sanchez-Romero, R.; and Glymour, C. 2017. A million variables and more: the Fast Greedy Equivalence Search algorithm for learning high-dimensional graphical causal models, with an application to functional magnetic resonance images. International journal of data science and analytics . Rissanen, J. 1978. Modeling by shortest data description. Automatica 14(1): 465 471. Rissanen, J. 1983. A Universal Prior for Integers and Estimation by Minimum Description Length. Annals Stat. 11(2): 416 431. Shimizu, S.; Hoyer, P. O.; Hyv arinen, A.; and Kerminen, A. 2006. A Linear Non-Gaussian Acyclic Model for Causal Discovery. JMLR 7. Spirtes, P.; Glymour, C. N.; Scheines, R.; Heckerman, D.; Meek, C.; Cooper, G.; and Richardson, T. 2000. Causation, prediction, and search. MIT press.

Statnikov, A.; Ma, S.; Henaff, M.; Lytkin, N.; Efstathiadis, E.; Peskin, E. R.; and Aliferis, C. F. 2015. Ultra-Scalable and Efﬁcient Methods for Hybrid Observational and Experimental Local Causal Pathway Discovery. JMLR 16: 3219 3267.