# recursive_causal_discovery__56607c36.pdf

Journal of Machine Learning Research 26 (2025) 1-65 Submitted 3/24; Revised 11/24; Published 3/25

Recursive Causal Discovery

Ehsan Mokhtarian ehsan.mokhtarian@epfl.ch School of Computer and Communication Sciences EPFL 1015 Lausanne, Switzerland

Sepehr Elahi sepehr.elahi@epfl.ch School of Computer and Communication Sciences EPFL 1015 Lausanne, Switzerland

Sina Akbari sina.akbari@epfl.ch School of Computer and Communication Sciences EPFL 1015 Lausanne, Switzerland

Negar Kiyavash negar.kiyavash@epfl.ch College of Management of Technology EPFL 1015 Lausanne, Switzerland

Editor: Ilya Shpitser

Causal discovery from observational data, i.e., learning the causal graph from a ﬁnite set of samples from the joint distribution of the variables, is often the ﬁrst step toward the identiﬁcation and estimation of causal eﬀects, a key requirement in numerous scientiﬁc domains. Causal discovery is hampered by two main challenges: limited data results in errors in statistical testing and the computational complexity of the learning task is daunting. This paper builds upon and extends four of our prior publications (Mokhtarian et al., 2021; Akbari et al., 2021; Mokhtarian et al., 2022, 2023a). These works introduced the concept of removable variables, which are the only variables that can be removed recursively for the purpose of causal discovery. Presence and identiﬁcation of removable variables allow recursive approaches for causal discovery, a promising solution that helps to address the aforementioned challenges by reducing the problem size successively. This reduction not only minimizes conditioning sets in each conditional independence (CI) test, leading to fewer errors but also signiﬁcantly decreases the number of required CI tests. The worst-case performances of these methods nearly match the lower bound. In this paper, we present a uniﬁed framework for the proposed algorithms, reﬁned with additional details and enhancements for a coherent presentation. A comprehensive literature review is also included, comparing the computational complexity of our methods with existing approaches, showcasing their state-of-the-art eﬃciency. Another contribution of this paper is the release of RCD, a Python package that eﬃciently implements these algorithms. This package is

c 2025 Ehsan Mokhtarian, Sepehr Elahi, Sina Akbari, Negar Kiyavash.

License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v26/24-0384.html.

Mokhtarian, Elahi, Akbari, Kiyavash

designed for practitioners and researchers interested in applying these methods in practical scenarios. The package is available at github.com/ban-epfl/rcd, with comprehensive documentation provided at rcdpackage.com. Keywords: causal discovery, removable variable, Python package, recursive, causal structure learning, constraint-based, score-based, permutation-based

Note to Readers

This paper builds upon and extends the methodologies from four of our previous publications in causal discovery. These publications are outlined as follows:

R1 Mokhtarian et al. (2021): A recursive Markov boundary-based approach to causal structure learning; published in KDD-CD 2021; introducing MARVEL.

R2 Akbari et al. (2021): Recursive causal structure learning in the presence of latent variables and selection bias; published in Neur IPS 2021; introducing L-MARVEL.

R3 Mokhtarian et al. (2022): Learning Bayesian networks with structural side information; published in AAAI 2022; introducing RSL.

R4 Mokhtarian et al. (2023a): Novel ordering-based approaches for causal structure learning in the presence of unobserved variables; published in AAAI 2023; introducing ROL.

For readers primarily interested in the practical implementation of our proposed methods, we recommend proceeding directly to Section 8, where our Python package RCD is introduced. Additionally, to facilitate ease of reading, we have summarized the key notations in Table 2. The main results discussed in this paper are also summarized in Table 5.

1. Introduction

A fundamental task in various ﬁelds of science is discovering the causal relations among the variables of interest in a system. These relations have broad applications across numerous ﬁelds, including economics (Angrist et al., 1996; Heckman, 2008), genetics (Friedman et al., 2000; Schadt et al., 2003), ﬁnance (Granger, 1969; Mokhtarian et al., 2024), natural language processing (Agrawal et al., 2024; Wu et al., 2024), and social sciences (Morgan and Winship, 2015). They are often encoded as a maximal ancestral graph (MAG), which represents the causal structure of the system. In cases where no hidden variables are present, a MAG simpliﬁes to a directed acyclic graph (DAG). The problem of learning the causal structure of a system, known as the causal discovery problem, is notoriously challenging and is recognized as an NP-hard problem (Chickering et al., 2004). To address this problem, a slew of methodologies, broadly categorized into constraintbased, score-based, and ordering-based methods, have been developed. Constraint-based methods leverage statistical tests to identify structures consistent with the conditional independence (CI) relations observed in data. Score-based methods learn the graph as a solution to an optimization problem that maximizes a predeﬁned score. Ordering-based methods focus on discovering a causal ordering of variables, simplifying structure learning by reducing the search space and computational complexity. Additionally, hybrid methods

Recursive Causal Discovery

Algorithm Assumptions Completeness #CI tests Reference Name Causal suﬃciency Other

R1 MARVEL YES - YES O(n2 + n 2 in2 in)

R2 L-MARVEL NO - YES O(n2 + n( + in)22 + in)

R3 RSLω YES ω(G) m YES O(n2 + n m+1 in )

RSLD YES Diamond-free YES O(n2 + n 3 in)

ROLHC NO - NO O(max Iter n3)

ROLVI NO - YES O(n22n)

ROLPG NO - NO N/A

Lower Bound YES - YES Ω(n2 + n in2 in)

Lower Bound NO - YES O(n2 + n + in2 + in)

Table 1: Summary of the assumptions, guarantees, and complexities of the recursive causal discovery methods discussed in this paper. For a comparison with the existing literature, refer to Table 3.

have emerged that combine aspects of these methodologies. This paper primarily focuses on constraint-based methods, which are most commonly used in the presence of hidden variables. In Section 7, we present a literature review of causal discovery methods.

Causal discovery is plagued by numerous challenges, most critically those pertaining to computational/time complexity and sample eﬃciency. In constraint-based methods, time complexity is primarily determined by the number of required CI tests. The two classical approaches, the PC algorithm, developed for DAG-learning, and its counterpart for MAG-learning, the FCI algorithm, are not scalable to large graphs (Spirtes et al., 2000). Subsequent research has focused on reducing the computational burden and improving the statistical eﬃciency of these seminal works. On the other hand, methods such as the RFCI (Colombo et al., 2012), speciﬁcally designed to gain computational speed, do this at the cost of possibly compromising completeness. Completeness refers to a method s asymptotic correctness, i.e., when suﬃciently large numbers of samples are available so that the statistical tests used in the method are error-free.

Recent advancements, such as our four publications (R1-R4) in recursive causal discovery, have made signiﬁcant strides in addressing time and sample complexity while maintaining completeness guarantees. Our proposed framework strategically identiﬁes a so-called removable variable (Deﬁnition 16), denoted by X, and learns its neighbors. After omitting this variable, the causal structure over the remaining variables is learned using only the samples from those variables. It is essential to carefully select X to avoid any erroneous inclusions or exclusions in the causal graph, which we shall explain in detail with examples in Section 3. As these methods operate iteratively, they systematically reduce the problem size, leading to fewer CI tests and smaller conditioning sets; hence improving the statistical reliability of these tests.

Mokhtarian, Elahi, Akbari, Kiyavash

Table 1 provides a concise summary of the assumptions, guarantees, and complexity of the recursive causal discovery methods discussed in this paper1. In this table, causal suﬃciency refers to when all variables in the system are observable. The last column shows the number of total CI tests performed as a common proxy to measure the complexity of constraint-based methods. As mentioned earlier, causal discovery from observational data is NP-hard (Chickering et al., 2004). In the last two rows, we present lower bounds for the complexity of constraint-based methods under causal suﬃciency and in the absence of it, as established in Section 6.1. We shall discuss in more detail in Section 6 how this table demonstrates the state-of-the-art eﬃciency of our proposed methods under various assumptions. Our main contributions in this paper are as follows.

We present a uniﬁed framework for the algorithms proposed in R1-R4, reﬁned with additional details and enhancements for a coherent presentation.

We launch the RCD Python package, an eﬃcient implementation of our recursive algorithms.

We conduct a comprehensive literature review and compare the computational eﬃciency of our methods with existing approaches.

The remainder of the paper is organized as follows. We provide preliminaries, including formal deﬁnitions and the goal of causal discovery from observational data in Section 2. The theoretical foundations of our proposed algorithms are established in Section 3. In this section, we cover the generic recursive framework for causal discovery, describe the concept and characteristics of removable variables, and provide a comparison of our novel removable orders with traditional approaches. In Section 4, we introduce four recursive causal discovery methods: MARVEL, L-MARVEL, RSL, and ROL, and discuss their integration within the recursive framework. In Section 5, we delve into the implementation details of these methods and provide detailed pseudocode for each of the aforementioned algorithms. In Section 6, we discuss the complexity and completeness of various causal discovery methods, with a particular emphasis on our proposed recursive approaches, alongside lower bounds that provide theoretical limits for constraint-based methods. An extensive literature review is carried out in Section 7. Finally, Section 8 is dedicated to introducing our Python package RCD, which eﬃciently implements our proposed recursive causal discovery algorithms.

2. Preliminaries

The key notations are summarized in Table 2 to enhance clarity in our presentation. Throughout the paper, we denote random variables in capital letters, sets of variables in bold letters, and graphs in calligraphic letters (e.g., G). Further, since the graphs are deﬁned over a set of random variables, we use the terms variable and vertex interchangeably.

2.1 Preliminary Graph Deﬁnitions

A mixed graph (MG) is a graph G = W, E1, E2 , where W is a set of vertices, E1 is a set of directed edges, i.e., E1 {(X, Y ) | X, Y W}, and E2 is a set of bi-directed edges, i.e.,

1. For the notations used in Table 1, please refer to Table 2.

Recursive Causal Discovery

Notation Description Deﬁnition

V, U Sets of observed and unobserved variables

n Number of observed variables, i.e., |V|

MG Mixed graph

MAG Maximal ancestral graph

DAG Directed acyclic graph

Pa G(X) Parents of vertex X in MG G

Pa G(X) S X X Pa G(X)

Ch G(X) Children of vertex X in MG G

Ne G(X) Neighbors of vertex X in MG G

Ne W(X) {Y W \ {X}| Z W \ {X, Y } : (X Y |Z)PW} Deﬁnition 15

Anc G(X) Ancestors of vertex X (including X) in MG G

Anc G(X) S X X Anc G(X)

ΛG(X) Co-parents of vertex X in DAG G Deﬁnition 6

Mb W(X) Markov boundary of vertex X with respect to W Deﬁnition 12

Mb W {Mb W(X)|X W} Deﬁnition 12

Mb G(X) Markov boundary of vertex X in MG G Deﬁnition 13

VSG(X) V-structures in DAG G in which vertex X is a parent Deﬁnition 5

Dis G(X) District set of vertex X in MG G Deﬁnition 3

Pa+ G (X) Pa G(X) Dis G(X) Pa G(Dis G(X)) Deﬁnition 3

Π(W) Orders over W Deﬁnition 2

Πc(G) c-orders of DAG G Deﬁnition 23

Πr(G) r-orders of MAG G Deﬁnition 24

(G) Maximum degree of MG G

in(G) Maximum in-degree of DAG G

+ in(G) Maximum size of Pa+ G (.) in MG G Deﬁnition 3

ω(G) Clique number of MG G

α(G) Maximum Markov boundary size of MAG G Deﬁnition 13

G[W] Induced subgraph of MG G over W Deﬁnition 1

GW Latent projection of MAG G over W Deﬁnition 9

[G] Markov equivalent MAGs of MAG G Deﬁnition 11

[G]d Markov equivalent DAGs of DAG G Deﬁnition 11

(X Y |Z)G An m-separation in MG G Deﬁnition 4

(X Y |Z)P A conditional independence (CI) in distribution P

Data(W) A collection of i.i.d. samples from distribution PW

Table 2: Table of notations.

Mokhtarian, Elahi, Akbari, Kiyavash

E2 {X, Y } | X, Y W . All graphs in this paper are assumed to be mixed graphs. If a directed edge X Y exists in the graph, we say X is a parent of Y and Y is a child of X. Two variables are as neighbors if a directed or bidirected edge exists between them. The skeleton of G is an undirected graph with the same set of vertices W and an edge between X and Y if they are neighbors in G. The clique number of G, denoted by ω(G), is the size of the largest clique (complete subgraph) of the skeleton of G. Further, we denote by (G) the maximum degree of a graph G.

Deﬁnition 1 (G[W]) For an MG G = W , E 1, E 2 and subset W W , MG G[W] = W, E1, E2 denotes the induced subgraph of G over W, that is E1 = {(X, Y ) E 1 | X, Y W} and E2 = {X, Y } E 2 | X, Y W .

A path P = (X1, . . . , Xk) is called directed if Xi Xi+1 for all 1 i < k. If a directed path exists from X to Y , we say X is an ancestor of Y (we assume each variable is an ancestor of itself). We denote by Pa G(X), Ch G(X), Ne G(X), and Anc G(X) the set of parents, children, neighbors, and ancestors of X in graph G, respectively. Further, for a set of vertices X, we deﬁne

Anc G(X) := [

X X Anc G(X), Pa G(X) := [

X X Pa G(X).

Deﬁnition 2 (Π(W)) For a set W = {W1, W2, . . . , Wm}, an order over W is a tuple (Wi1, Wi2, . . . , Win), where {i1, i2, . . . , im} is a permutation of {1, 2, . . . , m}. We denote by Π(W), the set of all orders over W.

Deﬁnition 3 (Pa+ G (X)) The district set of a variable X in MG G, denoted by Dis G(X), is the set of variables with a path to X comprised only of bi-directed edges. By Pa+ G (X), we denote the union of parents, district set, and parents of district set. That is,

Pa+ G (X) := Pa G(X) Dis G(X) Pa G(Dis G(X)).

Furthermore, by + in(G), we denote the maximum size of Pa+ G (.) in G.

A non-endpoint vertex Xi on a path (X1, X2, , Xk) is called a collider if one of the following situations arises.

Xi 1 Xi Xi+1, Xi 1 Xi Xi+1,

Xi 1 Xi Xi+1, Xi 1 Xi Xi+1.

When Xi 1 and Xi+1 are not neighbors, and Xi is a collider, the arrangement is termed as an unshielded collider. A path P is a collider path if every non-endpoint vertex on P is a collider on P.

Deﬁnition 4 (m-separation) In an MG G over W, a path P = (X, W1, , Wk, Y ) between two distinct variables X and Y is said to be blocked by a set Z W \ {X, Y } in G if there exists 1 i k such that (i) Wi is a collider on P and Wi / Anc G(Z {X, Y }), or (ii) Wi is not a collider on P and Wi Z. We say Z m-separates X and Y in G and denote it by (X Y |Z)G if all the paths in G between X and Y are blocked by Z.

Recursive Causal Discovery

A directed cycle exists in an MG G = W, E1, E2 when there exists X, Y W such that (X, Y ) E1 and Y Anc G(X). Similarly, an almost directed cycle exists in G when there exists X, Y W such that {X, Y } E2 and Y Anc G(X). An MG with no directed or almost-directed cycles is said to be ancestral. An ancestral MG is called maximal if every pair of non-neighbor vertices are m-separable, i.e., there exists a set of vertices that m-separates them. An MG is called a maximal ancestral graph (MAG) if it is both ancestral and maximal. A MAG with no bidirected edges is called a directed acyclic graph (DAG). When G is a DAG, the deﬁnition of m-separation reduces to the so-called d-separation. Moreover, unshielded colliders in a DAG reduce to v-structures. For a DAG G, we denote its maximum in-degree by in(G).

Deﬁnition 5 (VSG(X)) In a DAG G, we denote by VSG(X) the set of v-structures in which vertex X is one of the two parents.

Deﬁnition 6 (Co-parent) In a DAG, two non-neighbor vertices are co-parents if they have at least one common child. The set of co-parents of vertex X in DAG G is denoted by ΛG(X).

Deﬁnition 7 (Discriminating path) In a MAG G, a path P = (X, V1, . . . , Vk, Y ), where k 1, is a discriminating path for Vk if (i) X and Y are not neighbors, (ii) {V1, . . . Vk 1} Pa G(X), and (iii) {V1, . . . Vk 1} are colliders on P.

Deﬁnition 8 (Inducing path) Suppose G is a MAG over W1 W22 and let X, Y be distinct vertices in W1. An inducing path between X and Y relative to W2 is a path on which (i) every non-collider is a member of W2, and (ii) every collider belongs to Anc G(X, Y ).

We note that in Deﬁnition 8, no subset of W1 can block an inducing path relative to W2.

Deﬁnition 9 (Latent projection) Suppose G is a MAG over W1 W2. The latent projection of G over W1, denoted by GW1, is a MAG over W1 constructed as follows:

(i) Skeleton: X, Y W1 are neighbors in GW1 if there exists an inducing path in G between X and Y relative to W2.

(ii) Orientation: For each pair of neighbors X, Y in GW1, the edge between X and Y is oriented as X Y if X Anc G(Y ) and Y / Anc G(X) and as X Y if X / Anc G(Y ) and Y / Anc G(X).

Remark 10 The latent projection maintains the ancestral relationships. Furthermore, in a MAG G over W, for any W2 W1 W we have (GW1)W2 = GW2.

Richardson et al. (2002) showed that the latent projection in Deﬁnition 9 is the unique projection of a MAG G = W1 W2, E1, E2 over W1 that satisﬁes the following: for any distinct variables X, Y in W1 and Z W1 \ {X, Y },

(X Y |Z)GW1 (X Y |Z)G . (1)

2. We to denote disjoint union.

Mokhtarian, Elahi, Akbari, Kiyavash

2.2 Generative Model: Structural Equation Model (SEM)

Consider a DAG G over V U, where V and U denote sets of observed and unobserved variables, respectively. In a structural equation model (SEM) with causal graph G, each variable X V U is generated as X = f X(Pa G(X), ϵX), where f X is a deterministic function and ϵX is the exogenous variable corresponding to X with an additional assumption that the exogenous variables are jointly independent (Pearl, 2009). Such a SEM induces a unique joint distribution PV U over all the variables, which satisﬁes Markov factorization property with respect to G. That is,

PV U(V, U) = Y

X V U PV U(X|Pa G(X)).

The marginalized distribution PV := P U PV U is the observational distribution of the underlying SEM. Furthermore, MAG GV (i.e., the latent projection of DAG G over V as introduced in Deﬁnition 9) is the causal MAG over the observed variables. If all the variables in the SEM are observable, i.e., when U = , GV coincides with G. This assumption is commonly referred to as causal suﬃciency. Therefore, the causal graph over the observed variables is a DAG when causal suﬃciency holds. Next, we show that under suitable assumptions, causal MAG GV captures the conditional independencies of PV.

2.3 Markov Property and Faithfulness

Let P be the joint distribution of a set of variables W. For distinct variables X, Y W and Z W \ {X, Y }, a conditional independence (CI) test in P on the triplet (X, Y, Z) yields independence, denoted by (X Y |Z)P , if P(X|Y, Z) = P(X|Z). Distribution P satisﬁes Markov property with respect to MG G if m-separations in G imply CIs in P. That is,

(X Y |Z)G = (X Y |Z)P .

Conversely, distribution P satisﬁes faithfulness with respect to MG G if CIs in P imply m-separations in G. That is,

(X Y |Z)P = (X Y |Z)G .

Consider a SEM over V U with causal DAG G and joint distribution PV U. The joint distribution PV U satisﬁes Markov property with respect to DAG G (Pearl, 2000). Although PV U does not necessarily satisfy faithfulness with respect to G, it is a common assumption in the literature. When faithfulness holds, we have

(X Y |Z)PV U (X Y |Z)G , (2)

for any distinct variables X, Y in V U and Z V U \ {X, Y }. Recall that U is the set of unobserved variables. Causal discovery aims to ﬁnd a graphical model over V that encodes the CI relations in PV. Note that the induced subgraph G[V] does not necessarily satisfy Markov property and faithfulness with respect to PV, failing to encode the CIs of PV. On the other hand, Equations (1) and (2) imply that the latent

Recursive Causal Discovery

projection of G over V, i.e., GV, satisﬁes Markov property and faithfulness with respect to PV. That is, for distinct variables X, Y in V and Z V \ {X, Y },

(X Y |Z)PV (X Y |Z)GV . (3)

Accordingly, causal discovery aims to learn GV. However, as we discuss next, using the observational distribution PV, MAG GV can only be learned up to an equivalency class.

2.4 Causal Discovery from Observational Distribution

By performing CI tests in the observational distribution PV, the m-separations of causal MAG GV can be recovered using Equation (3). However, two MAGs might have the same set of m-separations.

Deﬁnition 11 (MEC) Two MAGs G1 and G2 are Markov equivalent if they impose the same set of m-separations, i.e., (X Y |Z)G1 (X Y |Z)G2. We denote by [G] the set of Markov equivalent MAGs of G, known as the Markov equivalence class (MEC). Moreover, if G is a DAG, we denote by [G]d the set of Markov equivalent DAGs of G.

CIs are statistically suﬃcient for causal discovery in non-parametric models3. That is, if multiple MAGs satisfy Markov property and faithfulness with respect to the observational distribution PV, we cannot identify which one is the causal MAG of the underlying SEM. Nevertheless, without the assumption of causal suﬃciency, causal discovery from observational distribution aims to identify [GV]. When causal suﬃciency holds, i.e., U = and GV = G, the goal is to identify [G]d.

2.5 Markov Boundary

The majority of the approaches that we present in this paper employ the concept of Markov boundary. This notion can be deﬁned based on either a distribution or a graph.

Deﬁnition 12 (Mb W(X)) Suppose P is the joint distribution of a set of random variables W and let X W. Markov boundary of X with respect to W, denoted by Mb W(X), is a minimal subset of W \ {X}, such that

(X W \ (Mb W(X) {X})|Mb W(X))P .

Additionally, we deﬁne Mb W as the set of Markov boundaries of all variables in W with respect to W, i.e., Mb W := {Mb W(X)|X W}.

Deﬁnition 13 (Mb G(X)) In an MG G, the Markov boundary of a variable X, denoted by Mb G(X), consists of all the variables that have a collider path to X. We denote by α(G) the maximum Markov boundary size of the variables in G.

3. Side information about the underlying SEM can render the causal graph uniquely identiﬁable. For instance, when the functions are linear and the exogenous noises are non-Gaussian, CIs are no longer statistically suﬃcient, and the causal MAG is uniquely identiﬁable from observational distribution.

Mokhtarian, Elahi, Akbari, Kiyavash

Remark 14 If G is a DAG, then

Mb G(X) = Pa G(X) Ch G(X) ΛG(X) = Ne G(X) ΛG(X).

Consider a MAG G and a distribution P, both over a set W. Pellet and Elisseeﬀ(2008a) and Yu et al. (2018) showed that if G satisﬁes Markov property and faithfulness with respect to P, then for each variable X W, Mb W(X) is unique and is equal to Mb G(X).

3. Theoretical Foundation of Proposed Algorithms

In this section, we establish the theoretical basis for our proposed algorithms in causal discovery. Firstly, we present a generic recursive framework for causal discovery in Section 3.1. Then, we explore the concept and characteristics of removable variables in Section 3.2. Finally, we compare and contrast the newly proposed removable orders with traditional approaches in Section 3.3.

3.1 A Recursive Framework for Causal Discovery

Herein, we present a generic recursive framework for causal discovery that does not necessarily assume causal suﬃciency. Consider a SEM over V U with causal DAG G, where V and U denote the set of observed and unobserved variables, respectively. As discussed in Section 2.4, our goal is to learn [GV] (or [G]d with the assumption of causal suﬃciency) from PV when Equation (3) holds. In the following sections, we mainly focus on recovering the skeleton of GV. Later in Section 5.7, we show that a slight variation of the discussed methods can be employed to recover the MEC.

Deﬁnition 15 (Ne W(X)) Suppose PW is the joint distribution of variables in a set W. For X W, we denote by Ne W(X) the set of variables Y W \ {X} such that for each Z W \ {X, Y }, we have (X Y |Z)PW.

Since non-neighbors in any MAG are m-separable, for any W V we have

Ne W(X) = Ne GW(X). (4)

Hence, to learn the skeleton of GV, it suﬃces to learn Ne V(X) for each X V. In practice, a ﬁnite set of samples from PV is available instead of knowing the exact probability distribution. Let Data(V) denote a collection of i.i.d. samples from PV, suﬃciently large to recover the CI relations in PV. For any subset W V, Data(W) represents the samples corresponding to the variables in W. A generic recursive framework for causal discovery is presented in Algorithm 1. The algorithm iteratively removes variables from V and learns the skeleton over the remaining variables. At the i-th iteration, a variable Xi Vi is selected, where Vi denotes the set of remaining variables. As we shall discuss shortly, this variable cannot be arbitrary and must be removable. After ﬁnding such a variable Xi, the algorithm learns the set of neighbors of Xi in GVi using Data(Vi). Note that this is possible because of Equation (4). Then, the samples corresponding to Xi are discarded, and the causal discovery problem is solved recursively for the remaining variables Vi+1 = Vi \ {Xi}.

Recursive Causal Discovery

Algorithm 1: Recursive framework for learning the skeleton of GV

1: Input: Data(V)

3: for i from 1 to n 1 do

4: Xi Find a [removable] variable in GVi using Data(Vi)

5: Learn Ne Vi(Xi) using Data(Vi)

6: Vi+1 Vi \ {Xi}

(a) G{X,Y,Z,W }

(b) G{X,Y,W } = G{X,Y,Z,W }[X, Y, W]

(c) G{X,Z,W }

Figure 1: Graphs in Example 1.

For Algorithm 1 to correctly learn the skeleton of GV, it is necessary that at each iteration, Ne Vi(Xi) = Ne GV[Vi](Xi). Recall that Ne Vi(Xi) = Ne GVi(Xi) and GV[Vi] denotes the induced subgraph of GV over Vi. Since the latent projection can only add new edges to the projected graph, this condition is equivalent to

GVi = GV[Vi], 1 i < n. (5)

However, selecting an arbitrary variable Xi in line 4 may not uphold this property.

Example 1 Consider MAG G{X,Y,Z,W} shown in Figure 1a. In Figure 1b, MAG G{X,Y,W} is the same as G{X,Y,Z,W}[X, Y, W]. However, in Figure 1c, MAG G{X,Z,W} has two extra edges between X, Z and X, W, which are not present in G{X,Y,Z,W}[X, Z, W]. This demonstrates that when Vi = {X, Y, Z, W} in Algorithm 1, Z can be selected in line 4, whereas Y cannot.

To employ Algorithm 1, we need to provide a method to ﬁnd a removable variable and learn its neighbors at each iteration. In the next section, we will deﬁne the concept of removable variables and show that Equation (5) holds if and only if a removable variable is selected at each iteration of this recursive framework.

3.2 Removable Variables

In this section, we deﬁne removable variables, present a graphical characterization for them under diﬀerent assumptions, and provide certain crucial properties.

Deﬁnition 16 (Removable variable) In a MAG G over W, a vertex X W is called removable if G and G[W \ {X}] impose the same set of m-separations over W \ {X}. That is, for any distinct vertices Y, T W \ {X} and Z W \ {X, Y, T},

(Y T|Z)G (Y T|Z)G[W\{X}] .

Mokhtarian, Elahi, Akbari, Kiyavash

(a) For DAGs

V2 . . . Vm Y

(b) For MAGs

Figure 2: Graphical criterion of removability. Asterisk (*) is used as a wildcard, which indicates that the edge endpoint can be either an arrowhead or a tail. In the case of MAGs, the path (X, V1, ..., Y ) is a collider path and X, ..., Vm Pa G(Z). X is removable if and only if for all such paths, Y and Z are adjacent. In the case of a DAG, it suﬃces to check this condition for vertices Y , where Y is a parent, child, or co-parent of X.

Proposition 17 (Only removables can get removed) Suppose G is a MAG over W and X W. MAG GW\{X} is equal to G[W \ {X}] if and only if X is removable in G.

Proposition 17 implies that Equation (5) holds for Algorithm 1 if and only if Xi is removable in GVi at each iteration. Next, we provide graphical characterizations of removable variables in both MAGs and DAGs, along with their key properties. We then deﬁne removable orders and discuss their properties.

3.2.1 Graphical Characterization of Removable Variables

We present graphical characterizations of removable variables in DAGs and MAGs.

Theorem 18 (Graphical characterization in DAGs) A variable X is removable in a DAG G if and only if the following two conditions are satisﬁed for every Z Ch G(X).

Condition 1: Ne G(X) Ne G(Z) {Z}.

Condition 2: Pa G(V ) Pa G(Z), V Ch G(X) Pa G(Z).

Figure 2a depicts the two conditions of Theorem 18 for removable variables in DAGs.

Theorem 19 (Graphical characterization in MAGs) Let G be a MAG over W. Vertex X is removable in G if and only if for any collider path u = (X, V1, ..., Vm, Y ) and Z W \ {X, Y, V1, ..., Vm} such that {X, V1, ..., Vm} Pa G(Z), Y and Z are neighbors.

Figure 2b depicts the condition of Theorem 19 for removable variables in MAGs. Path u = (X, V1, ..., Vm, Y ) is a collider path where {X, V1, ..., Vm} Pa G(Z). Theorem 19 states that X is removable if and only if, for any such path, Y and Z are neighbors. In the case of a DAG, the collider paths are of size either 1 (parents and children of X) or 2 (co-parents of X). As shown in Figure 2a, the graphical criterion of Theorem 19 reduces to checking the adjacency of these three groups of vertices with each child of X (see Theorem 18).

Recursive Causal Discovery

3.2.2 Properties of Removable Variables

Herein, we discuss some key properties of removable variables.

Proposition 20 (Removables exist) In any MAG, the variables with no children are removable, and the set of vertices with no children is non-empty. Therefore, any MAG has at least one removable variable.

Proposition 20 implies that Xi in line 4 of Algorithm 1 is well-deﬁned.

Proposition 21 (Removables have small Mb size) In a MAG G, if a vertex X is removable, then |Mb G(X)| + in(G). Furthermore, if G is a DAG, then |Mb G(X)| in(G).

Note that for an arbitrary variable X, |Mb G(X)| can be as large as n whereas + in(G) (or in(G) for DAGs) is typically a small number, demonstrating that removable variables have relatively small Markov boundary sizes.

Proposition 22 (Removables are invariant in a MEC) Two Markov equivalent MAGs have the same set of removable variables.

We note that the set of vertices without children is not the same for all MAGs in a MEC. However, Proposition 22 implies that the set of removable variables is a superset of the vertices without children, which is invariant across all MAGs in a MEC.

3.3 Removable Orders

In Algorithm 1, our generic framework can be viewed as an ordering-based approach. In this approach, a variable is eliminated at each iteration based on a speciﬁc order. In this section, we ﬁrst deﬁne c-orders, which are standard orders that existing ordering-based methods use to recover the graph. We then introduce r-order, which are orders that can be integrated into Algorithm 1. Finally, we show that r-orders are advantageous over c-orders for structure learning, because of the properties that are depicted in Figure 3. Score-based methods are one of the main classes of algorithms for causal discovery. These algorithms use a score function, such as a regularized likelihood function or Bayesian information criterion (BIC), to evaluate graphs and determine the structure that maximizes the score. Under the causal suﬃciency assumption, the search space of the majority of these methods is the space of DAGs, which contains 2Ω(n2) members. The ﬁrst orderingbased search strategy was introduced by Teyssier and Koller (2005). These methods search through the space of orders (Deﬁnition 2), which includes 2O(n log(n)) members. It is worth noting that the space of orders is much smaller than the space of DAGs. Ordering-based methods divide the learning task into two stages. In the ﬁrst stage, they use the available data to ﬁnd a causal order (c-order), which is deﬁned as follows.

Deﬁnition 23 (c-order) In a DAG G = W, E1, , an order π = (X1, . . . , Xm) Π(W) is called causal (in short c-order) if i > j for each (Xi, Xj) E1. Equivalently, π is c-order if Xi has no children in G[{Xi, Xi+1, . . . , Xn}] for each 1 i n. We denote by Πc(G) the set of c-orders of G.

Mokhtarian, Elahi, Akbari, Kiyavash

Πr(G1) = Πr(G2) = . . . = Πr(Gk)

Πc(G1) Πc(G2) Πc(Gk)

Figure 3: In this ﬁgure, {G1, . . . , Gk} denotes a set of Markov equivalent DAGs. Π(V) denotes the set of orders over V, which is the search space of ordering-based methods. Πc(Gi) denotes the set of c-orders of Gi, the target space of existing ordering-based methods in the literature. Πr(Gi) denotes the set of r-orders of Gi, which is the target space of ROL.

We note that c-orders are deﬁned over DAGs, and accordingly, most of the ordering-based approaches for causal discovery require causal suﬃciency. In the second stage, orderingbased methods use the learned order to identify the MEC of G. We introduce a novel type of order for MAGs, called removable order (in short, r-order), and argue that r-orders are advantageous over c-orders for structure learning.

Deﬁnition 24 (r-order) In a MAG G, an order π = (X1, , Xn) is called removable (r-order) if Xi is a removable variable in G[{Xi, , Xi+1, . . . , Xn}] for each 1 i n. We denote by Πr(G) the set of r-orders of G.

Note that r-orders are deﬁned over MAGs, which enables us to design ordering-based methods that do not assume causal suﬃciency.

Example 2 Consider the two DAGs G1 and G2 and their sets of c-orders depicted in Figure 4. In this case, G1 and G2 are Markov equivalent and together form a MEC. Furthermore, Πc(G1) and Πc(G2) are disjoint, and each contains 2 c-orders, and any order over the set of vertices is an r-order for both G1 and G2. Hence, each graph has 24 r-orders.

In a MEC, all MAGs share the same r-orders. In DAGs, r-orders include all c-orders as subsets. This is illustrated in Figure 3 and formalized in the subsequent propositions.

Proposition 25 (r-orders are invariant across MEC) If G1 and G2 are two Markov equivalent MAGs, then Πr(G1) = Πr(G2).

Proposition 26 (r-orders include c-orders) For any DAG G, we have Πc(G) Πr(G).

Recursive Causal Discovery

(a) DAG G1 Πc(G1) = {(X1, X2, X3, X4), (X1, X2, X4, X3)}

(b) DAG G2 Πc(G2) = {(X2, X1, X3, X4), (X2, X1, X4, X3)}

Figure 4: Two Markov equivalent DAGs G1 and G2 that form a MEC together and their disjoint sets of c-orders. In this example, any order over V = {X1, X2, X3, X4} is an r-order, i.e., Πr(G1) = Πr(G2) = Π(V). Note that |Πr(G1)| = |Πr(G2)| = 24 > 2 = |Πc(G1)| = |Πc(G2)|.

4. RCD Methods: MARVEL, L-MARVEL, RSL, ROL

In the previous section, we introduced a recursive framework for causal discovery. We deﬁned removable variables and provided graphical characterizations of them in MAGs and DAGs, along with their important properties. Building on that framework, in this section, we present four recursive causal discovery methods: MARVEL, L-MARVEL, RSL, and ROL. MARVEL, L-MARVEL, and RSL provide various approaches for lines 4 5 of Algorithm 1 under diﬀerent sets of assumptions. ROL, however, identiﬁes all removable variables at once using r-orders.

4.1 Finding Removable Variables and Their Neighbors

Motivated by Proposition 21, we provide Algorithm 2, a framework for ﬁnding a removable variable.

Algorithm 2: Finding a removable variable.

1: Find Removable(Data(W), Mb W)

3: (W1, . . . , Wm) Sort W such that |Mb W(W1)| |Mb W(W2)| |Mb W(Wm)|

4: for i = 1 to m do

5: if Is Removable(Wi, Data(Mb W(Wi))) is True then

6: return Wi

Given a set W, the algorithm takes the Markov boundaries as input, sorts the variables based on the size of their Markov boundaries, and applies the function Is Removable to them. The function Is Removable determines whether a variable Wi is removable in GW. Algorithm 2 returns the ﬁrst removable variable it identiﬁes and halts. Therefore, Proposition 21 implies that Is Removable will only be called for variables with Markov boundary sizes less than or equal to + in(G). As we showed in Theorems 18 and 19, the removability of a

Mokhtarian, Elahi, Akbari, Kiyavash

variable is a property of the causal graph over its Markov boundary. Therefore, we only need to utilize data from the variables in Mb W(Wi) for the Is Removable function. Next, we introduce three approaches for eﬃciently testing the removability of a variable under diﬀerent sets of assumptions.

4.2 MARVEL: MArkov boundary-based Recursive Variable ELimination

The ﬁrst recursive approach that introduced and utilized the notion of removability is MARVEL (Mokhtarian et al., 2021). MARVEL assumes causal suﬃciency, i.e., the causal graph is a DAG. Suppose G is a DAG with the set of vertices V. To verify the removability of a variable X V, MARVEL ﬁrst learns the neighbors of X, i.e., Ne G(X), and the set of v-structures in which X is a parent, i.e., VSG(X), using the following two lemmas.

Lemma 27 (Pellet and Elisseeﬀ, 2008a) Suppose G is a MAG over V. For X V and Y Mb G(X), Y is a neighbor of X if and only if

(X Y |S)G , S Mb G(X) \ {Y }. (6)

We note that while MARVEL uses Lemma 27 when G is a DAG, Pellet and Elisseeﬀ(2008a) showed that it also holds for MAGs.

Lemma 28 Suppose G is a DAG, Y ΛG(X), (X Y |SXY )G, and Z Ne G(X). Z is a common child of X and Y , i.e., (X Z Y ) VSG(X), if and only if

Z / SXY and (Y Z|S)G , S Mb G(X) {X} \ {Y, Z}.

After learning Ne G(X) and VSG(X) using Lemmas 27 and 28, MARVEL utilizes the following two lemmas to verify the two conditions of Theorem 18.

Lemma 29 Variable X satisﬁes Condition 1 of Theorem 18 if and only if

(Y Z|S {X})G , Y, Z Ne G(X), S Mb G(X) \ {Y, Z}.

Lemma 30 Suppose variable X satisﬁes Condition 1 of Theorem 18. Then X satisﬁes Condition 2 of Theorem 18, and therefore, X is removable in G, if and only if

(Y Z|S {X, V })G ,

(X V Y ) VSG(X), Z Ne G(X) \ {V }, S Mb G(X) \ {V, Y, Z}.

The following corollary outlines the computational complexity of applying these lemmas.

Corollary 31 Given Mb G(X), by applying Lemmas 27-30, we can identify Ne G(X), ΛG(X), VSG(X), and determine whether X is removable in G by performing at most

O |Mb G(X)|22|Mb G(X)|

unique CI tests.

Recursive Causal Discovery

4.3 L-MARVEL: Latent-MARVEL

L-MARVEL extends MARVEL to the case where causal suﬃciency does not necessarily hold, i.e., the causal graph is a MAG (Akbari et al., 2021).4 To verify the removability of a variable X in a MAG G, L-MARVEL ﬁrst learns Ne G(X) using Lemma 27 as follows. If Y Mb G(X) is not a neighbor of X, then X and Y have a separating set in Mb G(X)\{Y }. Hence, identifying Ne G(X) can be performed using a brute-force search in the Markov boundary, using at most |Mb G(X)|2|Mb G(X)| 1

CI tests. After learning Ne G(X), L-MARVEL utilizes the following theorem to check the removability of X.

Theorem 32 In a MAG G over V, a variable X V is removable if and only if for every Y Mb G(X) and Z Ne G(X) \ {Y }, at least one of the following holds.

Condition 1: W Mb G(X) \ {Y, Z} : (Y Z|W)G.

Condition 2: W Mb G(X) \ {Y, Z} : (Y Z|W {X})G.

Remark 33 If Y Ne G(X), we can skip Condition 1 and only check Condition 2.

Below, we present a proposition that we use in Section 5.2 to avoid performing duplicate CI tests in the implementation of L-MARVEL.

Proposition 34 Suppose G is a MAG with the set of vertices V, X V, Y Mb G(X), and Z Ne G(X)\{Y }. If at least one of the two conditions of Theorem 32 holds for X, Y, Z, then the graphical characterization for MAGs in Theorem 19 also holds for X, Y, Z.

4.4 RSL: Recursive Structure Learning

Another recursive algorithm for causal discovery is RSL, which aims to reduce the computational complexity of causal discovery under structural assumptions. RSL requires causal suﬃciency and provides algorithms under two types of structural side information: (I) an upper bound on the clique number of the graph is known, or (II) the graph is diamondfree. The causal discovery algorithms provided under these assumptions are RSLω and RSLD, respectively. Under the corresponding assumptions, both of these methods achieve polynomial-time complexity.

RSLω assumes that an upper bound m on the clique number of the causal graph is known, i.e., ω(G) m.

Remark 35 For a random graph G generated from Erdos-Renyi model G(n, p), ω(G) m with high probability when pn2/m 0 as n .

4. L-MARVEL, as presented in Akbari et al. (2021), can also handle the presence of selection bias. In this paper however, for the sake of simplicity, we assume there is no selection bias, and we have access to i.i.d. samples from PV.

Mokhtarian, Elahi, Akbari, Kiyavash

Figure 5: Diamond graphs.

RSLω provides the following result for verifying the removability of a variable under the assumption ω(G) m.

Theorem 36 Suppose G is a DAG such that ω(G) m. Vertex X is removable in G if for any S Mb G(X) with |S| m 2, the following conditions hold.

Condition 1: Y Z| Mb G(X) {X} \ {Y, Z} S

G , Y, Z Mb G(X) \ S.

Condition 2: (X Y |Mb G(X) \ ({Y } S))G , Y Mb G(X) \ S.

Also, the set of vertices that satisfy these conditions is nonempty.

To identify the neighbors of a removable variable detected using Theorem 36, RSLω provides the following proposition that distinguishes co-parents from neighbors in the Markov boundary.

Proposition 37 Suppose G is a DAG such that ω(G) m. Let X be a vertex that satisﬁes the two conditions of Theorem 36 and Y Mb G(X). Then, Y ΛG(X) if and only if

S Mb G(X) \ {Y } : |S| = m 1 , (X Y |Mb G(X) \ ({Y } S))G .

Moreover, set S is unique and S = Ch G(X) Ch G(Y ).

RSLD assumes that the causal DAG is diamond-free, i.e., it contains no diamond as an induced subgraph. Diamond is one of the three types of DAGs shown in Figure 5.

Remark 38 A random graph G generated from Erdos-Renyi model G(n, p) is diamond-free with high probability when pn0.8 0.

The following theorem provides an eﬃcient method for checking the removability of a variable when the causal DAG is diamond-free.

Theorem 39 In a diamond-free DAG G, a vertex X is removable if and only if

(Y Z|(Mb G(X) {X}) \ {Y, Z})G , Y, Z Mb G(X).

Analogous to the case with the bounded clique number, the following proposition can be used to learn the neighbors of a removable variable in a diamond-free DAG.

Recursive Causal Discovery

Proposition 40 In a diamond-free DAG G, let X be a removable variable and Y Mb G(X). In this case, Y ΛG(X) if and only if

Z Mb G(X) \ {Y } : (X Y |Mb G(X) \ {Y, Z})G . (7)

Moreover, such a variable Z is unique and {Z} = Ch G(X) Ch G(Y ).

4.5 Removable-Order Learning: ROL

ROL is an ordering-based method that leverages the notion of removability for causal discovery and does not require the assumption of causal suﬃciency. As discussed in Section 3.3, ordering-based methods in the literature prior to this approach recover a graph through learning a causal order (c-order) of DAGs, which is a topological order of variables (Deﬁnition 23). ROL introduces and uses a novel order called removable order (r-order), which we deﬁned for MAGs in Deﬁnition 24. Note that in our general framework given in Algorithm 1, (X1, . . . , Xn) forms an r-order. While the recursive methods that we discussed in the previous subsections seek to identify a removable variable in each iteration, ROL aims to learn the whole order at once. To this end, ROL ﬁrst proposes a recursive algorithm that learns an undirected graph, whose pseudocode is given in Algorithm 3.

Algorithm 3: Learning Gπ.

1: Function Learn GPi (π, Data(V))

2: V1 V, Eπ

3: for t = 1 to n 1 do

5: Ne Vt(Xt) Find Neighbors(Xt, Data(Vt))

6: Add undirected edges between Xt and the variables in Ne Vt(Xt) to Eπ.

7: Vt+1 Vt \ {Xt}

8: Return Gπ = (V, Eπ)

Algorithm 3 removes variables according to an arbitrary given order π and learns an undirected graph Gπ recursively. Using this algorithm, ROL deﬁnes the cost of an order π to be the number of edges in Gπ, denoted by |Eπ|. It then deﬁnes an optimization problem that seeks to ﬁnd an order with the minimum cost.

Theorem 41 (Consistency) Any solution of the optimization problem

arg min π |Eπ|, (8)

is a member of Πr(G). Conversely every member of Πr(G) is a solution of (8).

Theorem 41 shows that ﬁnding an r-order is equivalent to solving (8). To solve this optimization problem, ROL proposes three algorithms.

ROLHC, a Hill-climbing-based heuristic algorithm that is scalable to large graphs.

Mokhtarian, Elahi, Akbari, Kiyavash

ROLVI, an exact reinforcement learning (RL)-based algorithm that has theoretical guarantees but is not scalable to large graphs.

ROLPG, an approximate RL-based algorithm that exploits stochastic policy gradient.

In Section 5.4, we will delve into the details of these methods. Herein, we discuss how ROL formulates the optimization problem in Theorem 41 as an RL problem. ROL interprets the process of recovering Gπ from a given order π as a Markov decision process (MDP), where the index t denotes time, the action space is the set of variables V, and the state space is the set of all subsets of V. More precisely, let st and at denote the state and the action of the MDP at iteration t, respectively. Then, st is the remaining variables at time t (i.e., st = Vt) and action at is the variable that will be removed from Vt in that iteration (i.e., at = Xt). Consequently, the state transition due to action at is st+1 = Vt \ {at}. The immediate reward of selecting action at at state st will be the negative of the instantaneous cost, naturally deﬁned as the number of discovered neighbors for at by Find Neighbors in line 5 of Algorithm 3. Formally, the reward of picking action at when in state st is thus given by

r (st, at) = |Find Neighbors (at, Data (st)) | = |Nest(at)| .

In Sections 5.4.2 and 5.4.3, we discuss two of our RL-based approaches with the above formulation.

5. Implementation Details

In Section 4, we explored techniques for identifying removable variables that can be used in various recursive causal discovery algorithms. However, certain details were left out. In this section, we will discuss implementation details and provide pseudocode for these methods. Moreover, in Section 5.7, we discuss how our methods can be augmented to identify the MEC of the underlying causal graph.

Algorithm 4 provides a pseudocode for MARVEL, which is compatible with the generic frameworks of Algorithms 1 and 2. Algorithm 5 presents the main functions that MARVEL uses to learn neighbors and v-structures, as well as to verify the removability of a variable. As elaborated in Section 5.1.1, MARVEL incorporates three data structures - Skip Check vec, Skip Check cond1, and Skip Check cond2. These structures are designed to avoid performing redundant CI tests and improve overall computational eﬃciency. Algorithm 4 initializes the required variables in lines 2 5. It then calls the Compute Mb function to initially compute the Markov boundaries. We will discuss this step in Section 5.5. In the for loop of lines 11 30, the removability of variable Xj is checked, and its neighbors among the remaining variables are learned. If the neighbors of Xj have not been learned in the previous iterations, the algorithm calls the Find Neighbors function in Algorithm 5 to learn its neighbors. This function uses Lemma 27 to learn the following.

Ne Vi(Xj), ΛVi(Xj), {SXj Y : Y ΛVi(Xj), Xj Y |SXj Y

Recursive Causal Discovery

Algorithm 4: MARVEL

1: Input: Data(V)

2: Initialize undirected graph ˆG = (V, E = )

4: X V: Skip Check vec(X) False

5: X, Y, Z V: Skip Check cond1(X, Y, Z), Skip Check cond2(X, Y, Z) False

6: Mb V1 Compute Mb(Data(V))

7: for i from 1 to n 1 do

8: Vi {X Vi|Skip Check vec(Xj) = False}

10: (X1, . . . , Xr) Sort Vi such that |Mb Vi(X1)| |Mb Vi(X2)| |Mb Vi(Xr)|

11: for j from 1 to r do

12: if First time applying the Find Neighbors function to Xj then

13: Ne Vi(Xj), ΛVi(Xj), Sep Set(Xj) Find Neighbors(Xj, Mb Vi(Xj))

14: Add undirected edges between Xj and Ne Vi(Xj) in ˆG if not present

16: Ne Vi(Xj) Ne ˆG[Vi](Xj)

17: ΛVi(Xj) Mb Vi(Xj) \ Ne Vi(Xj)

18: if Condition1(Xj, Ne Vi(Xj), Mb Vi(Xj)) is True then

19: if First time applying the Find VS function to Xj then

20: VS(Xj) Find VS(Xj, Ne Vi(Xj), ΛVi(Xj), Mb Vi(Xj), Sep Set(Xj))

22: VS(Xj) {(Xj Z Y ) VS(Xj)|Z, Y Vi}

23: if Condition2((Xj, Ne Vi(Xj), ΛVi(Xj), Mb Vi(Xj), VS(Xj))) is True then

24: Vi+1 Vi \ {Xj}

25: Mb Vi+1 Update Mb(Xj, Ne Vi(Xj), Mb Vi)

26: Break the for loop of line 11

28: Skip Check vec True

30: Skip Check vec True

31: return ˆG

If the neighbors of Xj have been learned in previous iterations, the algorithm uses ˆG to recover this information. Then, upon calling function Condition1, we verify the ﬁrst condition of Theorem 18, which is designed in accordance with Lemma 29. If this condition is satisﬁed, Lemma 28 is applied (function Find VS) to learn VSVi(Xj). Note that if VSVi(X) is learned using Lemma 28 in an iteration, we can save it and delete a v-structure from it when one of the three variables of the v-structure is removed. Finally, the second condition of Theorem 18 is veriﬁed using Lemma 30 (function Condition2). If this condition is also satisﬁed, the algorithm concludes that Xj is removable and proceeds to remove it from the variables. For the next iteration, we update the Markov boundaries by calling the Update Mb function in Algorithm 11, which we discuss in Section 5.6.

Mokhtarian, Elahi, Akbari, Kiyavash

Algorithm 5: MARVEL functions

1: Function Find Neighbors(X, Mb W(X)) % Lemma 27

2: Ne W(X), ΛW(X), Sep Set(X)

3: for Y Mb W(X) do

4: if SX,Y Mb W(X) \ {Y } : (X Y |SX,Y )Data(W) then

5: Add Y to ΛW(X)

6: Add SX,Y to Sep Set(X)

8: Add Y to Ne W(X)

9: return Ne W(X), ΛW(X), Sep Set(X)

1: Function Find VS(X, Ne W(X), ΛW(X), Mb W(X), Sep Set(X)) % Lemma 28

3: for Y ΛW(X) and Z Ne W(X) do

4: SX,Y The separating set in Sep Set corresponding to Y

5: if Z / SXY and S Mb W(X) {X} \ {Y, Z} : (Y Z|S)Data(W) then

6: Add (X Z Y ) to VS(X)

7: return VS(X)

1: Function Condition1(X, Ne W(X), Mb W(X)) % Lemma 29

2: for Y, Z Ne W(X) such that Skip Check cond1(X, Y, Z) is False do

3: for S Mb W(X) \ {Y, Z} do

4: if (Y Z|S {X})Data(W) then

5: return False

6: Skip Check cond1(X, Y, Z) True

7: return True

1: Function Condition2(X, Ne W(X), ΛW(X), Mb W(X), VS(X)) % Lemma 30

2: for Y ΛW(X) and Z Ne W(X) such that Skip Check cond2(X, Y, Z) is False do

3: Γ {V = Z : (X V Y ) VS(X)}

4: for S Mb W(X) \ {Y, Z} s.t. S Γ = do

5: if (Y Z|S {X})Data(W) then

6: return False

7: Skip Check cond2(X, Y, Z) True

8: return True

5.1.1 Avoiding Duplicate CI Tests in MARVEL

Suppose that MARVEL veriﬁes the removability of a variable X and determines that it is not removable. As a result, the algorithm will need to verify again the removability of X in some of the following iterations, potentially leading to redundant CI tests. To address

Recursive Causal Discovery

this, we propose a method to eliminate such redundancies by leveraging information from previous iterations. Suppose that during iteration i, we invoke Condition1 from Algorithm 5 for a variable X, where W = Vi. If two variables Y, Z Ne Vi(X) do not have a separating set in Mb Vi(X) \ {Y, Z}, then they will not have a separating set in Mb Vi (X) \ {Y, Z} for any i > i. Accordingly, to prevent redundant CI tests, Skip Check cond1 is employed in function Condition1 to save this information and avoid performing duplicate CI tests. A similar approach is adopted in Condition2 using Skip Check cond2. To further enhance the implementation, Skip Check vec is integrated into Algorithm 4. This is based on the understanding that a variable s removability hinges on its Markov boundary. If a variable X is found non-removable in one iteration, it remains so as long as its Mb is unchanged, thereby obviating the need to recheck within the for loop of lines 11 30. Initially, every variable in V has Skip Check vec set to False. Should either function Condition1 or Condition2 return False, we switch Skip Check vec(X) to True. Skip Check vec(X) stays True unless there is a change in its Mb. As detailed in Section 5.6, updating Skip Check vec is crucial and is done in the Update Mb function.

5.2 L-MARVEL

Similar to MARVEL, we provide a pseudocode for L-MARVEL in Algorithm 6. In the for loop of lines 11 22, the algorithm learns the neighbors of Xj and veriﬁes its removability. If the algorithm is learning the neighbors of Xj for the ﬁrst time, it uses Lemma 27 (function Find Neighbors) to learn Ne Vi(Xj). Otherwise, this information has already been stored in ˆG, and the algorithm executes line 13 to recover the neighbors of Xj among the remaining variables. To verify the removability of Xj, the Is Removable function is invoked. This function checks the two conditions of Theorem 32. If both conditions are not met, the function determines that Xj is not removable. Algorithm 6 uses Skip Check vec and Skip Check mat to avoid performing duplicate CI tests. Similar to MARVEL, if a variable is found non-removable in one iteration, its Skip Check vec value is set to True, and it remains so as long as the Markov boundary of the variable remains unchanged. Additionally, Proposition 34 implies that if the condition of Theorem 32 is met for Y and Z during an iteration in the Is Removable function, the graphical characterization of Theorem 19 holds for X, Y, Z. Therefore, to check the removability of X, it is not necessary to check the two conditions of Theorem 32. Accordingly, Skip Check mat is employed in the Is Removable function to avoid performing redundant CI tests.

In this part, we discuss the implementation details of RSLω and RSLD.

Algorithm 7 presents the pseudocode for RSLω. This algorithm takes m as an input, which is an upper bound on the clique number of the true underlying graph. One of the diﬀerences between this algorithm and MARVEL and L-MARVEL is the sequence of operations. This algorithm ﬁrst checks the removability of a variable using Theorem 36

Mokhtarian, Elahi, Akbari, Kiyavash

Algorithm 6: L-MARVEL

1: Input: Data(V)

2: Initialize undirected graph ˆG = (V, E = )

3: X V: Skip Check vec(X) False

4: X, Y, Z V: Skip Check mat(X, Y, Z) False

6: Mb V1 Compute Mb(Data(V))

7: for i from 1 to n 1 do

8: Vi {X Vi|Skip Check vec(Xj) = False}

10: (X1, . . . , Xr) Sort Vi such that |Mb Vi(X1)| |Mb Vi(X2)| |Mb Vi(Xr)|

11: for j from 1 to r do

12: if First time learning the neighbors of Xj then

13: Ne Vi(Xj) Find Neighbors(Xj, Mb Vi(Xj))

14: Add undirected edges between Xj and Ne Vi(Xj) in ˆG if not present

16: Ne Vi(Xj) Ne ˆG[Vi](Xj)

17: if Is Removable(Xj, Ne Vi(Xj), Mb Vi(Xj))) is True then

18: Vi+1 Vi \ {Xj}

19: Mb Vi+1 Update Mb(Xj, Ne Vi(Xj), Mb Vi)

20: Break the for loop of line 11

22: Skip Check vec(Xj) True

23: return ˆG

1: Function Find Neighbors(X, Mb W(X)) % Lemma 27

2: Ne W(X) Mb W(X)

3: for Y Mb W(X) do

4: if S Mb W(X) \ {Y } : (X Y |S)Data(W) then

5: Remove Y from Ne W(X)

6: return Ne W(X)

1: Function Is Removable(X, Ne W(X), Mb W(X)) % Theorem 32

2: for Y Mb W(X) and Z Ne W(X) such that Skip Check mat(X, Y, Z) is False do

3: if S Mb W(X) \ {Y, Z} : (Y Z|S)Data(W) then

4: if S Mb W(X) \ {Y, Z} : (Y Z|S {X})Data(W) then

5: Return False

6: Skip Check mat(X, Y, Z) True

7: Return True

Recursive Causal Discovery

through the function Is Removable. Subsequently, it ﬁnds the neighbors of the variable by employing Proposition 37 in the function Find Neighbors. Furthermore, Algorithm 9 only uses one data structure, Skip Check vec, to eﬃciently prevent redundant CI tests. As mentioned above, Algorithm 7 takes m, an upper bound on the clique number of the causal graph. But what happens if it is not a valid upper bound, i.e., m < ω(G)? In this scenario, two outcomes are possible: either Algorithm 7 is unable to identify any removable variables at an iteration and halts, or RSLω terminates and returns a graph.

Proposition 42 (RSLω is veriﬁable) If Algorithm 7 terminates with an input m > 0, then the clique number of the learned skeleton is greater than or equal to the clique number of the true causal graph.

Proposition 42 implies that, upon termination of the algorithm, if the clique number of the learned graph is at most m, then m is a valid upper bound on the clique number, ensuring the correctness of the output. Otherwise, it indicates that the true clique number is greater than m. Accordingly, we present Algorithm 8 that can learn the skeleton of G without any prior knowledge about ω(G).

We present Algorithm 9 for RSLD. The main body of the algorithm is the same as Algorithm 7 for RSLω. However, the diﬀerence lies in the Is Removable and Find Neighbors functions. The former uses Theorem 39 to check the removability of a variable, while the latter uses Proposition 40 to learn the neighbors of a removable variable. Note that RSLD assumes that the underlying graph is diamond-free.

As we discussed in Section 4.5, ROL aims to solve the optimization problem described in Equation (8), which uses Algorithm 3 to deﬁne a cost function for a given permutation. Additionally, we presented a reinforcement learning (RL) formulation to solve this optimization problem. In this section, we present the implementation details of three approaches for learning an r-order through solving Equation (8).

5.4.1 ROLHC

In Algorithm 10, we propose a hill-climbing approach, called ROLHC for ﬁnding an rorder. In general, the output of Algorithm 10 is a suboptimal solution to (8) as it takes an initial order π and gradually modiﬁes it to another order with less cost, but it is not guaranteed to ﬁnd a minimizer of (8) by taking such greedy approach. Nevertheless, this algorithm is suitable for practice as it is scalable to large graphs, and also achieves superior accuracy compared to the state-of-the-art methods. Inputs to Algorithm 10 are the observational data Data(V) and two parameters max Iter and max Swap. max Iter denotes the maximum number of iterations before the algorithm terminates, and max Swap is an upper bound on the index diﬀerence of two variables that can get swapped in an iteration (line 6). Initial order π in line 2 can be any arbitrary order, but selecting it cleverly (such as initialization using the output of other approaches) will improve the performance of the algorithm.

Mokhtarian, Elahi, Akbari, Kiyavash

Algorithm 7: RSLω

1: Input: Data(V), m

2: Initialize undirected graph ˆG = (V, E = )

3: X V: Skip Check vec(X) False

5: Mb V1 Compute Mb(Data(V))

6: for i from 1 to n 1 do

7: Vi {X Vi|Skip Check vec(Xj) = False}

9: (X1, . . . , Xr) Sort Vi such that |Mb Vi(X1)| |Mb Vi(X2)| |Mb Vi(Xr)|

10: for j from 1 to r do

11: if Is Removable(Xj, Mb Vi(Xj)), m) is True then

12: Ne Vi(Xj) Find Neighbors(Xj, Mb Vi(Xj), m)

13: Add undirected edges between Xj and Ne Vi(Xj) in ˆG if not present

14: Vi+1 Vi \ {Xj}

15: Mb Vi+1 Update Mb(Xj, Ne Vi(Xj), Mb Vi)

16: Break the for loop of line 10

18: Skip Check vec(Xj) True

19: return ˆG

1: Function Is Removable(X, Mb W(X), m) % Theorem 36

2: for S Mb W(X) with |S| m 2 do

3: if Y, Z Mb W(X) \ S : Y Z| Mb W(X) {X} \ {Y, Z} S

or Y Mb W(X) \ S : (X Y |Mb W(X) \ ({Y } S))Data(W) then

4: Return False

5: Return True

1: Function Find Neighbors(X, Mb W(X), m) % Proposition 37

2: Ne W(X) Mb W(X)

3: for Y Mb W(X) do

4: if S Mb W(X) \ {Y } : |S| = m 1, (X Y |Mb W(X) \ ({Y } S))Data(W) then

5: Remove Y from Ne W(X)

6: return Ne W(X)

Algorithm 8: RSLω Without Side Information.

1: Input: Data(V)

2: for m from 1 to n do

3: ˆG RSLω(Data(V), m)

4: if RSLω terminates and ω( ˆG) m then

5: return ˆG

Recursive Causal Discovery

Algorithm 9: RSLD

1: Input: Data(V)

2: Initialize undirected graph ˆG = (V, E = )

3: X V: Skip Check vec(X) False

5: Mb V1 Compute Mb(Data(V))

6: for i from 1 to n 1 do

7: Vi {X Vi|Skip Check vec(Xj) = False}

9: (X1, . . . , Xr) Sort Vi such that |Mb Vi(X1)| |Mb Vi(X2)| |Mb Vi(Xr)|

10: for j from 1 to r do

11: if Is Removable(Xj, Mb Vi(Xj)) is True then

12: Ne Vi(Xj) Find Neighbors(Xj, Mb Vi(Xj))

13: Add undirected edges between Xj and Ne Vi(Xj) in ˆG if not present

14: Vi+1 Vi \ {Xj}

15: Mb Vi+1 Update Mb(Xj, Ne Vi(Xj), Mb Vi)

16: Break the for loop of line 10

18: Skip Check vec(Xj) True

19: return ˆG

1: Function Is Removable(X, Mb W(X)) % Theorem 39

2: for Y, Z Mb W(X) do

3: if (Y Z|(Mb W(X) {X}) \ {Y, Z})Data(W) then

4: Return False

5: Return True

1: Function Find Neighbors(X, Mb W(X)) % Proposition 40

2: Ne W(X) Mb W(X)

3: for Y Mb W(X) do

4: if Z Mb W(X) \ {Y } : (X Y |Mb W(X) \ {Y, Z})Data(W) then

5: Remove Y from Ne W(X)

6: return Ne W(X)

The subroutine Compute Cost takes as input an order π and two numbers 1 a < b n as input and returns a vector Cab Rn. For a t b, the t-th entry of Cab is equal to |Ne Vt(Xt)| which is the number of neighbors of Xt in the remaining graph. Hence, to learn the total cost of π, we can call this function with a = 1 and b = n and then compute the sum of the entries of the output C1n. Accordingly, the main algorithm initially computes the cost vector of π in line 3. The remainder of the algorithm (lines 4 13) updates π iteratively, max Iter number of times. It updates the current order π = (X1, , Xn) as follows: ﬁrst, it constructs a set of orders Πnew Π(V) from π by swapping any two variables Xa and Xb in π as long as 1 b a

Mokhtarian, Elahi, Akbari, Kiyavash

max Swap. Next, for each πnew Πnew, it computes the cost of πnew and if it has a lower cost compared to the current order, the algorithm replaces π by that order and repeats the process. Note that in line 8, the algorithm calls function Compute Cost with a and b to compute the cost of the new policy. The reason is that for i < a and i > b, the i-th entry of the cost of π and πnew are the same. This is because the set of remaining variables is the same. Hence, to compare the cost of π with the cost of πnew, it suﬃces to compare them for entries between a and b. Accordingly, Algorithm 10 checks the condition in line 9, and if the cost of the new policy is better, then the algorithm updates π and its corresponding cost. Note that it suﬃces to update the entries between a to b of C1n.

5.4.2 ROLVI

Following the RL formulation of ROL introduced in Section 4.5, we present ROLVI. This algorithm uses value iteration to tackle the problem. Given a deterministic policy πθ that is parameterized by θ, we can adapt Algorithm 3 to the RL setting as follows: the algorithm takes as input a policy πθ instead of a permutation π. Furthermore, it uses the given policy to select Xt in line 4, given by Xt = πθ (Vt). Finally, given a policy πθ, and the initial state s1 = V, the cumulative reward of a trajectory

Algorithm 10: ROLHC

1: Input: Data(V), max Swap, max Iter

2: Initialize π Π(V)

3: C1n Compute Cost(π, 1, n, Data(V))

4: for 1 to max Iter do

5: Denote π by (X1, , Xn)

6: for 1 a < b n such that b a < max Swap do

7: πnew Swap Xa and Xb in π

8: Cnew ab Compute Cost(πnew, a, b, Data(V))

9: if Pb i=a Cnew ab (i) < Pb i=a C1n(i) then

10: π πnew 11: for j from a to b do

12: C1n(j) Cnew ab (j)

13: Break the for loop of line 6

14: Return π

1: Function Compute Cost (π, a, b, Data(V))

2: Va {Xa, Xa+1, . . . , Xn}

3: Cab (0, 0, , 0) Rn

4: for t = a to b do

6: Ne Vt(Xt) Find Neighbors(Xt, Data(Vt))

7: Cab(t) |Ne Vt(Xt)|

8: Vt+1 Vt \ {Xt}

9: Return Cab

Recursive Causal Discovery

τ = (s1, a1, s2, a2, , sn 1, an 1), which denotes the sequence of states and actions selected by πθ, is given by

t=1 r (st, at) =

Ne Gst(at) .

Hence, if we denote the output of this modiﬁed algorithm by Gθ = V, Eθ , then R (τθ) = Eθ . With this RL formulation, any optimal policy-ﬁnding RL algorithm can be used to ﬁnd a minimum-cost policy πθ and thus solve the optimization problem given in Theorem 41. Accordingly, ROLVI applies value iteration algorithm.

5.4.3 ROLPG

Although any algorithm suited for RL is capable of ﬁnding an optimal deterministic policy for us, the complexity does not scale well as the graph size. Therefore, we can use a stochastic policy that increases the exploration during the training of an RL algorithm. ROLPG exploit stochastic policies parameterized by neural networks to further improve scalability. However, this could come at the price of approximating the optimal solution instead of ﬁnding the exact one. In the stochastic setting, an action at is selected according to a distribution over the remaining variables, i.e., at Pθ( |st = Vt), where θ denotes the parameters of the policy (e.g., the weights used in training of a neural network). In this case, the objective of the algorithm is to minimize the expected total number of edges learned by policy Pθ( |st = Vt), i.e.,

arg max θ Eτθ Pθ |Eθ| , (9)

where the expectation is taken w.r.t. randomness of the stochastic policy. Many algorithms have been developed in the literature for ﬁnding stochastic policies and solving (9). Some examples include Vanilla Policy Gradient (VPG) (Williams, 1992), REINFORCE (Sutton et al., 1999),and Deep Q-Networks (DQN) (Mnih et al., 2013). Accordingly, ROLVI applies VPG.

5.5 Initially Computing Markov Boundaries

Several algorithms have been proposed in the literature for discovering Markov boundaries (Margaritis and Thrun, 1999; Guyon et al., 2002; Tsamardinos and Aliferis, 2003; Tsamardinos et al., 2003; Yaramakala and Margaritis, 2005; Fu and Desmarais, 2010). One simple approach is to use total conditioning (TC) (Pellet and Elisseeﬀ, 2008b). TC states that X and Y are in each other s Markov boundary if and only if

(X Y |V \ {X, Y })PV . (10)

Using total conditioning, n 2 CI tests suﬃce to identify the Markov boundaries of all vertices. However, each test requires conditioning on a large set of variables. This issue has been addressed in multiple algorithms, including Grow-Shrink (GS) (Margaritis and Thrun, 1999), IAMB (Tsamardinos et al., 2003), and its various modiﬁcations. These algorithms

Mokhtarian, Elahi, Akbari, Kiyavash

Algorithm 11: Updates Markov boundaries.

1: Update Mb(X, Ne W(X), Mb W)

2: Mb W\{X} {Mb W(Y ): Y W \ {X}}

3: for Y Mb W(X) do

4: Remove X from Mb W\{X}(Y )

5: for Y, Z Ne W(X) such that |Mb W(Y )| |Mb W(Z)| do

6: if Y Z|Mb W\{X}(Y ) \ {Y, Z}

Data(W) then

7: Remove Z from Mb W\{X}(Y )

8: Remove Y from Mb W\{X}(Z)

9: Skip Check vec(Y ) False

10: Skip Check vec(Z) False

11: return Mb W\{X}

propose methods that perform more CI tests5 but with smaller conditioning sets. Choosing the right algorithm for computing the Markov boundaries depends on the available data.

5.6 Updating Markov Boundaries

When a variable is removed in the recursive framework of Algorithm 1, we do not need to recompute the Markov boundaries of all the vertices. Instead, we can update the Markov boundaries of the remaining vertices. Let W be the set of variables in an iteration with the set of Markov boundaries Mb W. Suppose we want to remove a variable X from W at the end of the current iteration, where Ne W(X) is the set of neighbors of X. In this case, we only need to compute Mb W\{X}, which is the set of Markov boundaries of the remaining variables. We can use Algorithm 11 to compute Mb W\{X}. Removing vertex X from MAG GW has two eﬀects.

1. X is removed from all Markov boundaries, and

2. for Y, Z W \ {X}, if all of the collider paths between Y and Z in GW pass through X, then Y and Z must be excluded from each others Markov boundary.

In the latter case, Y, Z Mb W(X) and the update is performed using a single CI test,

(Y Z|Mb W(Z) \ {X, Y, Z})PW , or equivalently, (Y Z|Mb W(Y ) \ {X, Y, Z})PW .

We chose the CI test with the smaller conditioning set. If the CI test shows that Y and Z are conditionally independent, we remove them from each other s Markov boundary. Recall that we used Skip Check vec in the proposed algorithms to avoid unnecessary computations. As discussed in Section 5.1.1, when the Markov boundary of a variable has changed, we need to set its Skip Check vec value to be False. Accordingly, when the If condition of line 6 holds, Algorithm 11 removes Y and Z from each other s Markov boundary. Therefore, we set their Skip Check vec values to False in lines 9 and 10.

5. These algorithms perform at most O(n2) CI tests.

Recursive Causal Discovery

5.7 Identifying the MEC

In the previous sections, we primarily addressed the task of learning the skeleton of the causal graph. In this section, we discuss how our methods can be augmented to identify the MEC of the underlying causal mechanisms ([GV]). As a general rule, having access to the true skeleton and a separating set for each pair of non-neighbor vertices suﬃce to identify the MEC (Zhang, 2008b). However, we will provide modiﬁcations tailored to a few of our algorithms, which will improve computational complexity. Algorithms MARVEL and RSL require causal suﬃciency. This implies that their objective is to recover the Markov equivalence class of a DAG, [G]d. Verma and Pearl (1991) showed that two DAGs are Markov equivalent if and only if they have the same skeleton and v-structures. Accordingly, to identify [G]d, it suﬃces to learn the skeleton and v-structures of DAG G. As such, in Sections 5.7.1 and 5.7.2 we describe how to recover the v-structures. With this information at hand, Meek rules (Meek, 1995) can be applied to achieve a maximally oriented DAG (CPDAG), also known as essential graph. For characterization and graph-theoretical properties of such graphs, refer to Andersson et al. (1997). On the other hand, Algorithms L-MARVEL and ROL serve to recover the MEC of a MAG. Having the same skeleton and unshielded colliders is necessary but not suﬃcient for two MAGs to be Markov equivalent. The following proposition by Spirtes and Richardson (1996) characterizes necessary and suﬃcient conditions for two MAGs to be Markov equivalent.

Proposition 43 (Spirtes and Richardson, 1996) Two MAGs G1 and G2 are Markov equivalent if and only if (i) they have the same skeleton, (ii) they have the same unshielded colliders, and (iii) if a path P is a discriminating path (Deﬁnition 7) for a vertex X in both MAGs, then X is a collider on P in G1 if and only if it is a collider on P in G2.

Building upon this proposition, it suﬃces to learn the skeleton, unshielded colliders, and shielded colliders for which a discriminating path exists. Subsequently, complete orientation rules can be applied to achieve a maximally oriented (aka maximally informative) partial ancestral graph (PAG). We refer to Zhang (2008b) for complete orientation rules and further discussion.

5.7.1 Recover V-structures in MARVEL

Our goal is to recover the v-structures. Note that a v-structure comprises a pair of coparents and a common child of them. As such, we ﬁrst describe how to identify the pairs of variables that are co-parents. Since each variable is identiﬁed as removable in exactly one iteration of MARVEL, we can identify the co-parents of a variable at the iteration where it gets removed. However, removing a variable does not preserve co-parent relationships. Indeed, if Y, Z are co-parents in a DAG G, where X is a removable variable, then either Y, Z are still co-parents of each other in G[V\{X}], or Y Mb G(Z)\Mb G[V\{X}](Z). The latter case happens when X is the only common child of Y and Z. Based on this observation, in order to identify all co-parent pairs, we ﬁrst modify Algorithm 11, the procedure for updating Markov boundaries, so that whenever a pair of variables (Y, Z) are removed from each other s Markov boundary in lines 7 and 8, this pair is marked as co-parents of each other in the ﬁnal graph. Furthermore,

Mokhtarian, Elahi, Akbari, Kiyavash

the separating set for this pair, namely Mb W\{X}(Y ), is stored as SY Z. The rest of the co-parents and their corresponding separating sets are discovered in line 13 of Algorithm 4, which consists of an application of Lemma 27. During this step, the Markov boundary of X is partitioned into Mb Vi(X) = Ne Vi(X) ΛVi(X) through ﬁnding a separating set SXY for every variable Y ΛVi(X). These separating sets are stored for every pair of co-parents (X, Y ). At the end of the skeleton discovery phase, the set of v-structures can be identiﬁed based on the following lemma.

Lemma 44 (Verma and Pearl, 1991) Let X, Y, Z be three arbitrary vertices of DAG G. These vertices form a v-structure in G, i.e., X Z Y if and only if all of the following hold:

Y ΛG(X), Z Ne G(X) Ne G(Y ), Z / SXY .

Accordingly, Lemma 44 can be integrated into MARVEL to identify the v-structures.

5.7.2 Recover V-structures in RSL

In analogy to MARVEL, it suﬃces to identify the v-structures. Also, the v-structures that are not preserved due to vertex removals can be identiﬁed and oriented through modifying Algorithm 11, just as described in the case of MARVEL. Herein, we present the procedure for identifying the v-structures VSG(X) for a removable variable X. We describe this procedure for RSLD and RSLω separately. RSLD. In the case of diamond-free graphs, if X is a removable variable and Y ΛG(X), then X and Y have a unique common child. Indeed, Proposition 40 reveals not only the co-parents of a removable variable X but also the unique common child of X and each of its co-parents. Accordingly, to identify the v-structures VSG(X) during RSLD, it suﬃces to modify the Find Neighbors subroutine as follows. Each time a variable Z Mb G(X)\{Y } satisﬁes Equation (7) in line 4 of Find Neighbors, the edges are oriented as X Z and Y Z, since Z is the unique common child of X and Y . RSLω. Analogous to RSLD, we can exploit the procedure for ﬁnding the neighbors to further identify the v-structures. In particular, Proposition 37 identiﬁes a unique set S as the common children of X and Y , for any Y that is a co-parent of X. Once such a set S is found in line 4 of Find Neighbors in RSLω (see line 12 of Algorithm 7) for a removable variable Xj and Y Mb G(Xj), it suﬃces to orient every edge U V such that U {Xj, Y } and V S as U V .

6. Complexity and Completeness Analysis

In this section, we discuss the complexity and completeness of various causal discovery methods, with a particular emphasis on our proposed recursive approaches. This analysis is crucial for understanding the eﬃciency and reliability of these methods in practical applications. In addition, we delve into the theoretical limits of these algorithms by providing lower bounds for the complexity of constraint-based algorithms in learning DAGs and MAGs.

Recursive Causal Discovery

6.1 Lower Bound

We introduce two fundamental theorems: Theorem 45, which establishes a lower bound for the complexity of constraint-based algorithms in learning DAGs, and Theorem 46, which does the same for MAGs. These theorems are instrumental in quantifying the theoretical limits of constraint-based methods.

Theorem 45 (Lower bound for DAGs) For any positive integers n and 1 c n, there exists a DAG G with n vertices and in(G) = c such that the number of d-separations of the form (X Y |Z)G required by any constraint-based algorithm to learn the skeleton of G is lower bounded by Ω(n2 + n in(G)2 in(G)). (11)

Theorem 45 determines the hardness of causal discovery under causal suﬃciency parameterized based on in(G) rather than other graph parameters such as (G) or α(G).

Theorem 46 (Lower bound for MAGs) For any positive integers n and 1 c n, there exists a MAG G with n vertices and + in(G) = c such that the number of m-separations of the form (X Y |Z)G required by any constraint-based algorithm to learn the skeleton of G is lower bounded by Ω(n2 + n + in(G)2 + in(G)). (12)

Theorem 46 extends the complexity analysis to the setting of MAGs, addressing scenarios in causal discovery where the assumption of causal suﬃciency does not hold. It similarly parameterizes the hardness of the problem based on + in(G), which is an extension of in for MAGs (Deﬁnition 3). While our analysis focuses on the computational complexity of constraint-based causal discovery algorithms, we also acknowledge the complementary line of research on sample complexity and methods addressing limited data scenarios in causal discovery. Notably, Ghoshal and Honorio (2017); Gao et al. (2022); Jamshidi et al. (2024) have established lower bounds on the number of samples required to recover causal graphs under various assumptions and settings. Additionally, Cui et al. (2022) propose methods to improve constraint-based causal discovery under insuﬃcient data by enhancing the robustness of statistical tests.

6.2 Completeness Analysis and Achievable Bounds of RCD Methods

We present completeness guarantees and upper bounds on the number of performed CI tests by MARVEL, L-MARVEL, RSL, and ROL, as implemented in Section 5. Recall that our algorithms take as input Data(V), a set of i.i.d. samples from the observational distribution PV. In the propositions that follow, we provide asymptotic guarantees of correctness for our methods under the assumption that Data(V) is suﬃciently large to accurately recover the CI relations present in PV. Additionally, as outlined in Section 2, we assume that the true underlying graph encodes the CI relations of the observational distribution, as stated in Equation (3). It should be noted that while our methods are also compatible with weaker notions of faithfulness, we focus on this primary assumption for simplicity in our presentation.

Mokhtarian, Elahi, Akbari, Kiyavash

Proposition 47 (Completeness and complexity of MARVEL) Under the assumption of causal suﬃciency, MARVEL, as implemented in Algorithm 4, correctly returns the skeleton of DAG G by performing at most n 2

+ n in(G) 2

2 in(G)(1 + 0.45 in(G))2 in(G) = O n2 + n in(G)22 in(G) (13)

unique CI tests, and at most n 2 2 in(G) 1 duplicate CI tests in the worst case6.

Corollary 48 If in(G) c log n, MARVEL uses at most O(n2 + nc+1 log2 n) unique CI tests in the worst case, which is polynomial in the number of variables.

As n gets larger, if DAG G has a constant value of in(G), or more generally in(G) (1 ϵ) log n, where ϵ > 0, then both the achievable upper bound of MARVEL in (13) and the lower bound in (11) are quadratic in n. For larger values of in(G), the second terms in these equations become dominant. In this case, the upper bound of MARVEL diﬀers from the lower bound only by a factor of in(G). This demonstrates that under causal suﬃciency and without any additional information, MARVEL has a worst-case performance that nearly matches the lower bound.

Proposition 49 (Completeness and complexity of L-MARVEL) L-MARVEL, as implemented in Algorithm 6, correctly returns the skeleton of MAG GV by performing at most

O n2 + n + in(GV)22 + in(GV)

number of CI tests.

Similar to our previous argument for MARVEL, Proposition 49 shows that the complexity of L-MARVEL aligns closely with the lower bound for MAGs in Theorem 46, demonstrating its near-optimal performance in causal models with latent variables.

Proposition 50 (Completeness and complexity of RSLω) Under causal suﬃciency, if ω(G) m, then RSLω, as implemented in Algorithm 7, correctly returns the skeleton of DAG G by performing at most

O n2 + n in(G)m+1

number of CI tests.

Proposition 50 marks a pivotal development in causal discovery for DAGs with bounded Tree Width. Recent studies including Korhonen and Parviainen (2013); Nie et al. (2014); Ramaswamy and Szeider (2021), have emphasized the importance of algorithms tailored for scenarios where an upper bound on the Tree Width of the causal graph is given as side information. A bound on Tree Width is more restrictive than a bounded clique number, as indicated by the following inequality.

ω(G) Tree Width(G) + 1

6. Duplicate CI tests can be completely eliminated through eﬃcient use of memory. This aspect is omitted here for simplicity and readability.

Recursive Causal Discovery

As such, our RSLω algorithm is also applicable to causal discovery in DAGs with bounded Tree Width. Notably, while existing exact discovery algorithms for these networks demonstrate exponential complexity, RSLω maintains a polynomial complexity. This indicates that when the Tree Width is constant, causal discovery is no longer NP-hard and can be solved in polynomial time.

Proposition 51 (Completeness and complexity of RSLD) Under causal suﬃciency and if G is a diamond-free DAG, then RSLD, as implemented in Algorithm 9, correctly returns the skeleton of DAG G by performing at most

O n2 + n in(G)3 (14)

number of CI tests.

Remark 52 Even if DAG G has diamonds, RSLD correctly recovers all the existing edges with possibly extra edges, i.e., RSLD has no false negative.

RSLD is the fastest among our proposed recursive methods. This can also be seen by the upper bound in Equation (14), where the number of CI tests stays tractable for graphs with more than 1000 variables.

Proposition 53 (Complexity of ROLHC) ROLHC, as implemented in Algorithm 10, performs at most O max Iter n3 (15)

number of CI tests, excluding the initialization step in line 2.

Note that the upper bound in Equation (15) may vary depending on the initialization step in line 2 and the choice of the Find Neighbors function. Also, here we are assuming that max Swap is a constant.

Proposition 54 (Completeness and complexity of ROLVI) According to the introduced RL setting in Section 5.4.2, ROLVI ﬁnds the optimal policy by performing at most O(n22n) number of CI tests.

Note that the bound in Proposition 54 is much lower than O(n!) for iterating over all orders.

6.3 Comparison

In this part, we present a comparative analysis of various causal discovery methods, including our proposed algorithms. The summary of this comparison is presented in Table 3, which categorizes each algorithm based on its assumptions, completeness guarantees, and computational complexity in terms of the number of CI tests required. The last two rows present the lower bounds for complexity under causal suﬃciency and in the absence of it, as established in Section 6.1. A critical observation from this analysis is that the upper bounds on the complexity of our algorithms (MARVEL, L-MARVEL, RSL, and ROL) are signiﬁcantly more eﬃcient compared to the others. This eﬃciency in DAGs is largely due to the following inequality.

in(G) (G) α(G)

Mokhtarian, Elahi, Akbari, Kiyavash

Algorithm Assumptions Completeness #CI tests Causal suﬃciency Other MARVEL YES - YES O(n2 + n 2 in2 in) L-MARVEL NO - YES O(n2 + n( + in)22 + in) RSLω YES ω(G) m YES O(n2 + n m+1 in ) RSLD YES Diamond-free YES O(n2 + n 3 in) ROLHC NO - NO O(max Iter n3) ROLVI NO - YES O(n22n) ROLPG NO - NO N/A PC YES - YES O(n ) GS YES - YES O(n2 + nα22α) MMPC YES - YES O(n22α) CS YES - YES O(n22α) FCI NO - YES N/A RFCI NO - NO N/A FCI+ NO - NO O(n2( +2)) MBCS* NO - YES O(n22α) Lower Bound YES - YES Ω(n2 + n in2 in) Lower Bound NO - YES O(n2 + n + in2 + in)

Table 3: Summary of the assumptions, guarantees, and the complexity of various causal discovery methods from observational data.

Additionally, in a DAG with a constant in-degree, and α can grow linearly with the number of variables.

Corollary 55 Under the assumption of causal suﬃciency, RSLD is the fastest among our proposed methods. In scenarios lacking causal suﬃciency, L-MARVEL is fastest for sparse graphs, while ROLHC outperforms in denser graphs.

7. Related Work

Most of the causal discovery methods can be broadly categorized into constraint-based and score-based approaches. While constraint-based methods recover the structure that is consistent with conditional independence constraints, score-based methods opt for the graph that maximizes a speciﬁc score function. In this section, we present an overview of relevant works in the ﬁeld, organizing them into these two main categories. Under the score-based category, we give special consideration to the so-called ordering-based methods, which rely on recovering a speciﬁc order among the variables to guide the causal discovery task. While we strive to provide a representative review of key works in causal discovery, we note that the vast body of literature includes many more studies and approaches. For a more comprehensive compilation and deeper discussion, we refer to surveys on causal discovery such as Kitson et al. (2023), Hasan et al. (2023), Vowels et al. (2022), Glymour et al. (2019), and Mooij et al. (2016).

Recursive Causal Discovery

7.1 Constraint-Based

PC algorithm (Spirtes et al., 2000), widely used for causal discovery, stands as a foundational approach for this task using observational data, laying the groundwork for most of the subsequent developments in constraint-based approaches. Notably, acknowledging the high computational cost of PC, Le et al. (2016) provided a parallel computing framework for it. Colombo et al. (2014) drew attention to the fact that under potentially erroneous conditional independence tests, the results obtained by the PC algorithm may depend on the order in which these tests were conducted. Following this observation, they introduced PC-stable, which oﬀers an order-independent approach to causal discovery. This enhances the PC s robustness with respect to uncertainties over the order of variables. Further, RPC (Harris and Drton, 2013) was introduced as a relaxation of the PC algorithm to handle instances where strict adherence to conditional independence tests may not be possible. PC and its derivatives work under causal suﬃciency. In extending the scope of constraintbased methods, Spirtes et al. (1995) introduced the fast causal inference algorithm (FCI), which accommodates latent variables and selection bias. Although foundational, FCI faced challenges with incomplete edge orientation rules. Zhang (2008b) later augmented further orientation rules to ensure completeness of its output. FCI also suﬀers from an intractable computational complexity in moderate to high dimensional causal discovery tasks. Focusing on learning high-dimensional causal graphs, Colombo et al. (2012) introduced RFCI, which performs only a subset of the conditional independence tests that FCI requires. Faster computations came at the cost of not being complete, in the sense that the recovered graph may contain extra edges in general. Closest to our approach are the works by Margaritis and Thrun (1999), Pellet and Elisseeﬀ(2008a), and Pellet and Elisseeﬀ(2008b), which put forward the idea of using Markov boundary information to guide causal discovery. The grow-shrink (GS) algorithm (Margaritis and Thrun, 1999) was originally devised as a method to infer Markov boundaries. The authors augmented GS with further steps to make it capable of recovering the causal structure. Pellet and Elisseeﬀ(2008b) proposed the use of total conditioning (TC) to infer the Markov boundary information and recover the causal structure afterward. Pellet and Elisseeﬀ(2008a) also generalized the same ideas to handle causal discovery in the presence of latent variables and selection bias. Along a separate path, CPC (Ramsey et al., 2006) introduced the concept of adjacencyfaithfulness, building a conservative framework for constraint-based causal discovery. KCL (Sun et al., 2007) presented a kernel-based causal discovery algorithm, extending the scope of applicability of causal discovery algorithms to non-linearly related structural models through the use of kernel methods. On the same note, the kernel-based conditional independence test (KCI-test) introduced by Zhang (2008a) further enriches the toolkit for assessing conditional independence through kernel methods. ION (Danks et al., 2008) focused on integrating observational data with narrative information to reﬁne causal relationships.

7.2 Score-Based

Score-based methods provide an alternative approach to constraint-based methods, emphasizing the optimization of a score function to identify the most likely causal graph. The foundation for this line of research was laid by early work in Bayesian statistics, as well as

Mokhtarian, Elahi, Akbari, Kiyavash

the development of graphical models (Pearl, 1988). A notable contribution was made by Heckerman et al. (1995), which put forward the idea of integrating the prior beliefs and statistical data through a Bayesian approach to causal discovery. The authors reviewed certain heuristic algorithms to search for the graphical structure maximizing their scoring function. Greedy equivalence search (GES) algorithm, introduced by Chickering (2002), represents another signiﬁcant advancement in the ﬁeld. GES employs a step-wise greedy search strategy to iteratively reﬁne the graph structure, aiming for maximizing the Bayesian information criterion (Raftery, 1995; Geiger and Heckerman, 1994). GES has been inﬂuential in guiding the subsequent advancements in score-based causal discovery. An extension was introduced by B uhlmann et al. (2014), which diverges from GES by decoupling the order search among variables and edge selection in the DAG from each other. While the variable order search is carried out through a non-regularized maximum likelihood estimation, sparse regression techniques are used for edge selection. The method developed by B uhlmann et al. (2014) is valid for additive models. Another extension is the fast greedy equivalence search (FGES) (Ramsey et al., 2017). FGES builds upon GES by introducing two modiﬁcations to increase the speed of the search in order to adapt it for high-dimensional causal models. Beyond greedy search approaches, there is also literature based on coordinate descent for optimizing the score function. Fu and Zhou (2013) utilizes an ℓ1-regularized likelihood approach and a block-wise coordinate descent to estimate the causal structure. Gu et al. (2019) model the conditional density of a variable given its parents by multi-logit regression, employing a group norm penalty to obtain a sparse graph. Aragam et al. (2015) reduce causal discovery to a series of neighborhood regressions under suitable assumptions. A broad range of recent research on score-based causal discovery has focused on methods based on continuous optimization. The most inﬂuential work in this direction was DAGs with NO TEARS (Zheng et al., 2018), reformulating the combinatorial problem of causal search into a continuous optimization problem. Goudet et al. (2018) introduced CGNN, which uses neural networks to learn the functional mappings between variables and incorporates a hill-climbing search algorithm for the optimization. In order to address the computational costs of CGNN, Kalainathan et al. (2022) presented SAM. Recently, several methods have been introduced to further tackle the time-consuming nature of these methods. Gra N-DAG (Lachapelle et al., 2019) improved upon NO TEARS by leveraging neural networks to model the nonlinear relationships while maintaining computational eﬃciency through gradient-based techniques. Sparse RC (Misiakos et al., 2024) also addressed the scalability issue by adopting a sparse regularization framework. DAG-No Curl (Yu et al., 2021) presented another promising approach, reducing computational costs without explicitly enforcing acyclicity constraints. Bhattacharya et al. (2021) extended the approach to graphs with hidden variables. Other notable extensions include but are not limited to DAGMA (Bello et al., 2022), TOPO (Deng et al., 2023), No FEARS (Wei et al., 2020), DAG-GNN (Yu et al., 2019), Graph Auto Encoder (Ng et al., 2019), NO BEARS (Lee et al., 2019), and DYNOTEARS (Pamﬁl et al., 2020). Bayesian Approaches. These methods represent another prominent class within score-based approaches, emphasizing the integration of prior knowledge with observed data to learn causal structures. The foundational work in this area was the work by Heckerman et al. (1995), which proposed methods that merge expert knowledge and statistical data for

Recursive Causal Discovery

learning Bayesian networks. Building on this work, Friedman and Koller (2003) introduced a principled Bayesian approach to structure discovery in Bayesian networks. More recent works have advanced Bayesian approaches in diﬀerent ways. Viinikka et al. (2020) and introduced a scalable Bayesian framework that leverages variational inference, making it computationally eﬃcient and applicable to large-scale data sets. Zhang et al. (2024) employed a similar approach for active learning. Another signiﬁcant contribution is from Lorch et al. (2021), which framed Bayesian structure learning as a ﬂexible diﬀerentiable optimization problem. Deleu et al. (2022) proposed using generative ﬂow networks to approximate the posterior distributions. Others have focused on making the posterior computations more tractable (Annadani et al., 2021; Hoang et al., 2024), as well as extending Bayesian approaches to latent confounder models (Ma et al., 2024), and active learning (Toth et al., 2022).

7.3 Ordering-Based

Despite the inherent diﬃculty of causal discovery (Chickering, 1996), ﬁnding the graphical structure that maximizes a scoring function becomes tractable when an ordering is postulated on the variables (Buntine, 1991; Cooper and Herskovits, 1992). Based on this observation, Teyssier and Koller (2005) proposed a search over the space of variable orderings, rather than the previously adopted search over the space of DAGs. After recovering the ordering among variables, Teyssier and Koller (2005) used an exhaustive search through all possible parent sets for each vertex. Improving on the latter, Schmidt et al. (2007) showed that this search can be well-approximated through ℓ1 regularization. Another extension to the work by Teyssier and Koller (2005) was introduced by Scanagatta et al. (2015), who proposed anytime algorithms to circumvent the high costs of searching in the space of potential parent sets. Raskutti and Uhler (2018) introduced the sparsest permutation (SP) algorithm, which relaxes the common faithfulness assumption to a weaker assumption called u-frugality. On the same note, Lam et al. (2022) developed a class of permutation-based algorithms, namely GRa SP, which operate under weaker assumptions than faithfulness. Solus et al. (2021) were the ﬁrst to provide consistency guarantees for a greedy permutation-based search algorithm, namely GSP. Bernstein et al. (2020) extended the scope of permutationbased methods to causal structures with latent variables. Ordering-based methods oﬀer an alternative approach to causal discovery by postulating an ordering on the variables to reduce the search space. Recent approaches such as sortregress (Reisach et al., 2021) and HOST (Duong and Nguyen, 2023) have been proposed to tackle complex data structures, particularly for heterogeneous and heteroskedastic data.

7.4 Miscellaneous

While the focus of this paper is on methods for learning a causal graph based on observational data, it is essential to acknowledge other directions in causal discovery research. For example, Yu et al. (2023) introduces a novel approach for causal discovery in zero-inﬂated data, leveraging directed graphical models to enhance gene regulatory network analysis. Another example is Zhao et al. (2024), which introduces a neighborhood selection method for learning the structure of Gaussian functional graphical models for high-dimensional functional data, applicable to EEG and f MRI data. Furthermore, there has been a grow-

Mokhtarian, Elahi, Akbari, Kiyavash

ing interest in causal discovery for cyclic graphs (Richardson, 1996a,b; Mooij et al., 2011; Sethuraman et al., 2023), as these models have implications for causal relationships that involve feedback loops and dynamic dependencies. It is noteworthy that cyclic structures pose additional challenges compared to acyclic graphs. There has also been a growing interest in causal discovery for challenging data types. For instance, Huang et al. (2020) and Zhou et al. (2022) studied causal discovery from heterogeneous and non-stationary data. G unther et al. (2022) considered the problem of independence testing with heteroskedastic data. Several recent works investigated causal discovery under heteroskedastic noise models Duong and Nguyen (2023); Yin et al. (2024); Kikuchi (2022); Cai et al. (2020). Another avenue of research has concentrated on causal discovery methods that leverage interventional data as well as observational data (Kocaoglu et al., 2019; Brouillard et al., 2020; Li et al., 2023). Provided access to interventional data, one can reduce the size of the equivalence class, resulting in a ﬁner speciﬁcation of the causal model. Some works consider causal discovery in an active manner, where experiments are designed explicitly to learn causal graphs (Hyttinen et al., 2013; Hauser and B uhlmann, 2014; Hu et al., 2014; Ghassami et al., 2017; Addanki et al., 2020; Mokhtarian et al., 2023b). This active approach involves strategically choosing interventions to gain the most informative data for learning the graphical structure. Some other notable lines of research include causal discovery for temporal data (Entner and Hoyer, 2010; Assaad et al., 2022; Chu et al., 2014; Gong et al., 2023), and causal representation learning (Sch olkopf et al., 2021; Wang and Jordan, 2021; Ahuja et al., 2023).

8. RCD: A Python Package for Recursive Causal Discovery

We have implemented the algorithms presented along with other necessary and auxiliary utility functions in our Python package, called RCD. Our implementation is available on Git Hub with the following link:

github.com/ban-epfl/rcd

Additionally, you can ﬁnd a detailed documentation of our package on the following website:

rcdpackage.com

Some of the key aspects of our package are highlighted in the following.

Simple installation: RCD is available on Py PI for installation. Use the command

pip install rcd

to add it to your environment.

Lightweight dependencies: RCD uses only four packages - Network X, Num Py, Pandas, and Sci Py, all commonly used in causal discovery.

Well-documented: We have written thorough documentation for each class and function in the Google Python documentation style, available in the website.

Recursive Causal Discovery

Readable code: We used consistent naming schemes for functions and variables and added descriptive in-code comments for increased readability.

Eﬃcient implementation: Optimized for performance, we used optimal data structures and minimized loops and redundant CI tests, ensuring a lean and fast codebase.

8.1 Source Code Organization

The source code of the RCD package available on our Git Hub repository is divided into four directories.

rcd: It contains the implementation of the methods.

tests: It contains unit tests in the framework of pytest that ensure the correctness of the implementation of each method.

examples: It contains working demonstrations of each method.

docs: It contains the conﬁguration for Mk Docs, which is responsible for generating our documentations site.

8.2 Method Implementation

Our proposed methods in the RCD package are implemented as Python modules:

marvel, l marvel, rsl d, rsl w, rol hc

RCD is designed with modularity at its core. Each module requires a CI testing function upon initialization and optionally accepts a Markov boundary matrix ﬁnding function. This design enables users to incorporate their own CI testing and Markov boundary-ﬁnding algorithms. If a Markov boundary-ﬁnding function is not given, our methods use a naive approach to ﬁnd the initial Markov boundary matrix. Each module has a primary public function that is intended to be used by the user, which is the learn and get skeleton function. This function receives a Pandas Dataframe as input, which contains data samples for each variable, with each column representing a variable. The function then returns the learned skeleton as an undirected Network X graph. Below is a small snippet showing the module corresponding to the RSLD method, named rsl d, being used to learn the skeleton corresponding to a data set using the Fisher-Z test.

1 from rcd import rsl_d

2 from rcd.utilities.ci_tests import fisher_z

4 # run RSL -D on data with the Fisher Z test

5 ci_test = lambda x, y, z, data: fisher_z(x, y, z, data ,

significance_level =2 / n ** 2)

6 learned_skeleton = rsl_d. learn_and_get_skeleton (ci_test , data_df)

Mokhtarian, Elahi, Akbari, Kiyavash

Algorithm Original Paper Source Code

RSLD Mokhtarian et al. (2022) rcd package

RSLω Mokhtarian et al. (2022) rcd package

L-MARVEL Akbari et al. (2021) rcd package

MARVEL Mokhtarian et al. (2021) rcd package

ROL-HC Mokhtarian et al. (2023a) rcd package

Other CD alg.

PC Spirtes et al. (2000) causal-learn package

FCI Spirtes et al. (2000) causal-learn package

CD-NOD Huang et al. (2020) causal-learn package

GRa SP Lam et al. (2022) causal-learn package

GES Chickering (2002) causal-learn package

f GES Ramsey et al. (2016) py-tetrad package

No CURL Yu et al. (2021) Paper s repository

GOLEM Ng et al. (2020) g Castle package

Sparse RC Misiakos et al. (2024) Paper s repository

Table 4: The causal discovery algorithms used in our experiments.

9. Experiments

In this section, we present a series of simulations to compare the RCD algorithms with other widely used causal discovery algorithms for learning the skeleton of a causal graph. We study the eﬀect of varying the number of variables, graph density, and sample size in both linear and non-linear settings. We also include experiments on real-world networks.

9.1 Experimental Setup

We provide an overview of the experimental setup, including the algorithms compared, data sets used, experimental environment, evaluation metrics, and methods for Markov boundary estimation.

9.1.1 Algorithms

We compare our proposed algorithms in the rcd package namely, RSLD, RSLω, L-MARVEL, MARVEL, and ROLHC with the following nine causal discovery algorithms: PC, FCI, CDNOD, GRa SP, GES, f GES, Sparse RC, GOLEM, and No CURL. More information about these methods is provided in Table 4.

9.1.2 Data sets

For the ground-truth DAGs, we consider both synthetic random graphs generated from Erd os-R enyi (ER) models and real-world networks available at https://www.bnlearn.com/

Recursive Causal Discovery

bnrepository. For the synthetic data sets, we generate data under two settings: linear and non-linear. Linear setting. We generate linear SEMs with Gaussian noises based on the generated DAGs. Details of the data generation process are as follows:

For each graph, we assign edge weights sampled uniformly from the intervals [ 1.5, 1] and [1, 1.5], allowing for both positive and negative dependencies.

Gaussian noise with a standard deviation randomly chosen from [0.7, 1.2] is added to each variable to introduce variability.

The value of each variable is computed as a linear combination of its parent variables, weighted by the assigned edge weights, plus the corresponding noise term.

Non-linear setting. We initially generate samples as described in the linear setting. Then, we apply a non-linear transformation to the samples for each variable. Speciﬁcally, each variable is transformed by applying the cumulative distribution function (CDF) of its marginal distribution. This transformation guarantees that the marginal distribution of each variable is uniform on [0, 1].

9.1.3 Experiment Environment

We ran all algorithms using Python 3.10, except for f GES, which was run using JPype to connect to Tetrad running on the Amazon Corretto 22 JDK. All simulations were conducted on a PC equipped with two Intel Xeon E5-2680 v3 CPUs, 256 GB of RAM, and running Ubuntu 20.04.4 LTS. We set a time limit of 100 seconds for each algorithm, after which the execution was terminated.

9.1.4 Metrics

In all experiments, we measure the process-time taken by each algorithm to learn and produce the skeleton from the given data, the average conditioning set size, the number of CI tests, and the F1 score when compared with the true skeleton. Note that the CI testrelated metrics are only measured and reported for algorithms that utilize CI tests namely, all RCD algorithms, PC, FCI, and CD-NOD.

9.1.5 Markov Boundary Estimation

As discussed in Section 5.5, several algorithms exist for discovering Markov boundaries. In our experiments, we use the Total Conditioning (TC) method (Pellet and Elisseeﬀ, 2008b) to compute the Markov boundary of each variable. Since we are dealing with linear Gaussian models, we can eﬃciently evaluate the required conditional independencies using the precision matrix (the inverse of the covariance matrix). Speciﬁcally, we compute the partial correlations between variables from the precision matrix and apply Fisher s z-transform to perform the conditional independence tests. This approach allows us to compute all the Markov boundaries eﬃciently. For the signiﬁcance level in the conditional independence tests, we set α = 2/n2 (Pellet and Elisseeﬀ, 2008b).

Mokhtarian, Elahi, Akbari, Kiyavash

9.2 Results

We now present the results of our experiments. For the linear SEMs, we consider four scenarios: varying the number of variables, varying the graph density, varying the sample size, and using real-world graphs. For the non-linear SEMs, we evaluate two scenarios: varying the number of variables and varying the sample size.

9.2.1 Linear SEMs: Varying the Number of Variables

We evaluate the performance of each algorithm as a function of the number of variables n. We investigate two regimes: sparse and dense.

Sparse Regime: We generate ER graphs with edge probability p = 1/n, varying n from 10 to 500.

Dense Regime: We generate ER graphs with edge probability p = log n/n, varying n from 10 to 150.

For each value of n, we generate 20 DAGs, and for each DAG, we generate 10 data sets, each containing 50n samples. We present the results for the sparse regime in Figure 6, and for the dense regime in Figure 7. In the sparse regime (Figure 6), our RCD algorithms are the fastest by far. The time plots (note the logarithmic y-axis) demonstrate that our methods are orders of magnitude faster than the other causal discovery algorithms. This eﬃciency is corroborated by the number of CI tests performed, which is signiﬁcantly lower for our methods, conﬁrming that the number of CI tests is a good proxy for the time complexity of constraint-based methods. In terms of accuracy, our methods achieve near-perfect F1 scores. While f GES and GRa SP also achieve high accuracy, they are much slower than our methods in sparse graphs. In the dense regime (Figure 7), RSLD and f GES are the fastest algorithms. However, f GES suﬀers from lower accuracy, with F1 scores dropping below 0.7, whereas our methods maintain F1 scores around 0.9. GRa SP achieves the best accuracy in dense graphs but is only scalable up to 40 variables in this setting. In contrast, RSLD can easily be applied to graphs with up to 150 variables and potentially much larger graphs, demonstrating better scalability while maintaining high accuracy.

9.2.2 Linear SEMs: Varying the Graph Density

We run simulations on ER graphs with n = 50 variables and a sample size of 2500. The edge probability p varies from 0.02 to 0.2. For each value of p, we generate 10 DAGs, and for each DAG, we generate 10 data sets. We present the results in Figure 8. These results conﬁrm our previous observations. For smaller values of the edge probability (i.e., sparser graphs), our RCD algorithms are faster than the other methods while achieving similar accuracy. As the graphs become denser, the runtime of our methods increases, becoming comparable to some of the other algorithms. In the densest cases, f GES becomes faster than our methods by an order of magnitude. However, this speed comes at the cost of accuracy our methods maintain much higher F1 scores compared to f GES in dense graphs. This demonstrates that while f GES is faster in

Recursive Causal Discovery

Avg. cond. set size

Num. CI tests

0 100 200 300 400 500 Number of vertices

RCD algorithms RSL-D RSL-w L-MARVEL MARVEL ROL-HC

Other algorithms PC/FCI/CD-NOD GRa SP GES f GES Sparse RC

Figure 6: Time taken, average conditioning set size, number of CI tests, and F1 score of RCD algorithms and other CD algorithms on linear SEMs against the number of variables on the sparse ER data set.

Mokhtarian, Elahi, Akbari, Kiyavash

Avg. cond. set size

Num. CI tests

20 40 60 80 100 120 140 Number of vertices

RCD algorithms RSL-D RSL-w L-MARVEL MARVEL ROL-HC

Other algorithms PC/FCI/CD-NOD GRa SP GES f GES Sparse RC

Figure 7: Time taken, average conditioning set size, number of CI tests, and F1 score of RCD algorithms and other CD algorithms on linear SEMs against the number of variables on the dense ER data set.

Recursive Causal Discovery

Avg. cond. set size

Num. CI tests

0.025 0.050 0.075 0.100 0.125 0.150 0.175 0.200 Edge probability

RCD algorithms RSL-D RSL-w L-MARVEL MARVEL ROL-HC

Other algorithms PC/FCI/CD-NOD GRa SP GES f GES Sparse RC

Figure 8: Time taken, average conditioning set size, number of CI tests, and F1 score of RCD algorithms and other CD algorithms on linear SEMs against the ER edge probability.

Mokhtarian, Elahi, Akbari, Kiyavash

very dense graphs, our methods oﬀer a better trade-oﬀbetween speed and accuracy across diﬀerent graph densities.

9.2.3 Linear SEMs: Varying the Sample Size

We run simulations on ER graphs with n = 30 variables and an edge probability of p = 0.1. We generate data sets with sample sizes ranging from 250 to 2000. For each sample size, we generate 10 DAGs and 10 data sets for each DAG.

We present the results in Figure 9. The ﬁgure shows that all algorithms are relatively insensitive to the number of samples. For sample sizes greater than 500 and up to 2000 samples, the performance of all algorithms remains stable in terms of both accuracy and runtime.

9.2.4 Linear SEMs: Real-World Graphs

We conduct experiments on 16 real-world graphs from the Bayesian Network Repository with the number of variables ranging from 8 to 724.7 For each graph, we generate a data set following the data generation procedure described earlier. We present the results for graphs with fewer than 50 variables in Figure 10 and for those with more than 50 variables in Figure 11.

In almost all cases, our methods are the fastest, with RSLD being the fastest in every instance. Regarding accuracy, our methods generally outperform others, with the diﬀerence being more signiﬁcant in larger graphs (Figure 11). Notably, RSLD not only has the best runtime but also achieves the highest accuracy across these 16 real-world graphs. This highlights that RSLD is the best-performing method in this setting.

It is worth mentioning that some of these networks contain diamond structures, whereas RSLD has theoretical guarantees when the graph is diamond-free. Despite this, RSLD performs exceptionally well, demonstrating robustness to violations of its structural assumptions. This robustness is discussed further in Subsection 6.2.

9.2.5 Non-linear SEMs: Varying the Number of Variables

We evaluate the performance of each algorithm as a function of the number of variables n in non-linear SEMs. In this setting, we consider the dense regime by generating ER graphs with edge probability p = log n/n, varying n from 10 to 100. For each value of n, we generate 20 DAGs, and for each DAG, we generate 10 data sets, each containing 50n samples. The non-linear SEMs are obtained by applying the CDF-based transformation to the generated data, ensuring that each variable follows a uniform distribution on [0, 1].

Figure 12 presents the results in terms of time taken, average conditioning set size, number of CI tests, and F1 score. Our proposed methods outperform the other algorithms in terms of both time complexity and accuracy. Notably, GRa SP no longer performs the best in terms of accuracy but is now equally as good as our approaches, while much slower.

Recursive Causal Discovery

Avg. cond. set size

Num. CI tests

250 500 750 1000 1250 1500 1750 2000 Number of samples

RCD algorithms RSL-D RSL-w L-MARVEL MARVEL ROL-HC

Other algorithms PC/FCI/CD-NOD GRa SP f GES Sparse RC

Figure 9: Time taken, average conditioning set size, number of CI tests, and F1 score of RCD algorithms and other CD algorithms on linear SEMs against the number of samples.

Mokhtarian, Elahi, Akbari, Kiyavash

Avg. cond. set size

Num. CI tests

insurance (27)

mildew (35)

barley (48)

Network name

RCD algorithms RSL-D RSL-w L-MARVEL MARVEL ROL-HC

Other algorithms PC/FCI/CD-NOD GRa SP GES f GES Sparse RC

Figure 10: Time taken, average conditioning set size, number of CI tests, and F1 score of RCD algorithms and other CD algorithms on small BN repository graphs.

Recursive Causal Discovery

Avg. cond. set size

Num. CI tests

hailﬁnder (56)

hepar2 (70)

win95pts (76)

pathﬁnder (109)

munin1 (186)

andes (223)

Network name

RCD algorithms RSL-D RSL-w L-MARVEL MARVEL ROL-HC

Other algorithms PC/FCI/CD-NOD GRa SP GES f GES Sparse RC

Figure 11: Time taken, average conditioning set size, number of CI tests, and F1 score of RCD algorithms and other CD algorithms on large BN repository graphs.

Mokhtarian, Elahi, Akbari, Kiyavash

Avg. cond. set size

Num. CI tests

20 40 60 80 100 Number of vertices

RCD algorithms RSL-D RSL-w L-MARVEL MARVEL ROL-HC

Other algorithms PC/FCI/CD-NOD GRa SP GES Sparse RC GOLEM No CURL

Figure 12: Time taken, average conditioning set size, number of CI tests, and F1 score of RCD algorithms and other CD algorithms on non-linear SEMs against the number of variables.

Recursive Causal Discovery

Avg. cond. set size

Num. CI tests

250 500 750 1000 1250 1500 1750 2000 Number of samples

RCD algorithms RSL-D RSL-w L-MARVEL MARVEL ROL-HC

Other algorithms PC/FCI/CD-NOD GRa SP Sparse RC GOLEM No CURL

Figure 13: Time taken, average conditioning set size, number of CI tests, and F1 score of RCD algorithms and other CD algorithms on non-linear SEMs against the number of samples.

Mokhtarian, Elahi, Akbari, Kiyavash

9.2.6 Non-linear SEMs: Varying the Sample Size

We run simulations on ER graphs with a ﬁxed number of variables (set to n = 30) and edge probability p = 0.1, generating non-linear SEMs as described above. Data sets are generated with sample sizes ranging from 250 to 2000. For each sample size, we generate 10 DAGs and 10 data sets. The performance metrics are reported in Figure 13. Similar to the linear case, all algorithms are relatively insensitive to the number of samples, and our approaches continue to be the fastest.

10. Conclusion, Limitations, and Future Work

In this work, we developed a comprehensive framework for recursive causal discovery, derived from our previous publications (R1-R4), reﬁned with additional details and enhancements. The main results discussed in this paper are summarized in Table 5. Our methodology revolves around identifying removable variables, learning their neighbors, discarding them, and then recursively learning the graph of the remaining variables. Through this iterative process, we have signiﬁcantly reduced the number of performed CI tests, enhancing the computational eﬃciency and the accuracy of our methods. We further provided lower bounds on the complexity of constraint-based methods in the worst case and showed that our proposed methods almost match the lower bounds. Finally, we introduced RCD, an open-source Python package with documentation at rcdpackage.com and hosted on github.com/ban-epﬂ/rcd.

Limitations

While our framework oﬀers signiﬁcant advancements in recursive causal discovery, it shares a common limitation with all constraint-based methods: it relies on CI tests, which may not consistently provide reliable results, especially in non-parametric models where such tests can be less robust. Additionally, our approaches leverage Markov boundaries in their recursive framework, allowing the application of any existing algorithm for computing Markov boundaries. Although we proposed methods that iteratively update the Markov boundaries at a low cost, the initial step may pose a computational bottleneck in large-scale applications.

Future Work

In the following, we discuss potential future work.

As mentioned earlier, although our current approaches can learn graphs up to the order of 103 variables with conventional computational power, the initial computation of Markov boundaries poses a computational challenge for larger graphs. An inﬂuential direction for future work is the development of recursive causal discovery methods that do not depend on Markov boundary computations.

7. bnlearn.com/bnrepository

Recursive Causal Discovery

Result Description Source

Proposition 17 Only removables can get removed R1, R2 Theorem 18 Graphical characterization of removables in DAGs R1 Theorem 19 Graphical characterization of removables in MAGs R2 Proposition 20 Removables exist R1, R2 Proposition 21 Removables have small Mb size R1, R2 Proposition 22 Removables are invariant in a MEC

Proposition 25 r-orders are invariant across MEC R4 Proposition 26 r-orders include c-orders R4 Lemma 28 Finding v-structures R1 Lemma 29 Testing condition 1 of removability in DAGs R1 Lemma 30 Testing condition 2 of removability in DAGs R1 Theorem 32 Testing removability in MAGs R2 Proposition 34 Avoiding duplicate CI tests in L-MARVEL

Theorem 36 Removability test in RSLω R3 Proposition 37 Finding neighbors in RSLω R3 Theorem 39 Removability test in RSLD R3 Proposition 40 Finding neighbors in RSLD R3 Theorem 41 Consistency result of ROL s objective R4 Proposition 42 RSLω is veriﬁable R3 Theorem 45 Lower bound for DAGs R1 Theorem 46 Lower bound for MAGs R2 Proposition 47 Completeness and complexity of MARVEL R1 Proposition 49 Completeness and complexity of L-MARVEL R2 Proposition 50 Completeness and complexity of RSLω R3 Proposition 51 Completeness and complexity of RSLD R3 Proposition 53 Complexity of ROLHC Proposition 54 Completeness and complexity of ROLV I R4

Table 5: Table of results.

Mokhtarian, Elahi, Akbari, Kiyavash

In exploring real-world applications of our package, one domain of particular interest is network biology, speciﬁcally for learning Gene Regulatory Networks (GRN). However, given the large number of genes in the human genome (approximately 20,000), applying our current methods directly might be challenging due to the dimensionality issues. While our methods can be used to analyze GRNs of other organisms with fewer genes or adapted for local analyses within the human genome, future modiﬁcations aimed at tailoring our recursive causal discovery methods for high-dimensional biology data, such as GRNs, could unlock new opportunities for applying our package to complex biological systems.

Another direction for future work involves the parallelization of our proposed recursive causal discovery methods. By leveraging parallel computing techniques, such as performing removability checks in parallel, we can signiﬁcantly reduce the computational time, making our algorithms more eﬃcient and scalable.

Another research question is to analyze the sample complexity of our proposed recursive causal discovery methods. While our methods have empirically shown to require fewer samples compared to other approaches in R1-R4, a theoretical analysis of sample complexity remains an open question.

Raghavendra Addanki, Shiva Kasiviswanathan, Andrew Mc Gregor, and Cameron Musco. Eﬃcient intervention design for causal discovery with latents. In International Conference on Machine Learning, pages 63 73. PMLR, 2020.

Ishan Agrawal, Zhijing Jin, Ehsan Mokhtarian, Siyuan Guo, Yuen Chen, Mrinmaya Sachan, and Bernhard Sch olkopf. Causalcite: A causal formulation of paper citations. In Findings of the Association for Computational Linguistics ACL 2024, pages 8395 8410, 2024.

Kartik Ahuja, Divyat Mahajan, Yixin Wang, and Yoshua Bengio. Interventional causal representation learning. In International conference on machine learning, pages 372 407. PMLR, 2023.

Sina Akbari, Ehsan Mokhtarian, Amir Emad Ghassami, and Negar Kiyavash. Recursive causal structure learning in the presence of latent variables and selection bias. Advances in Neural Information Processing Systems, 34:10119 10130, 2021.

Steen A Andersson, David Madigan, and Michael D Perlman. A characterization of Markov equivalence classes for acyclic digraphs. The Annals of Statistics, 25(2):505 541, 1997.

Joshua D Angrist, Guido W Imbens, and Donald B Rubin. Identiﬁcation of causal eﬀects using instrumental variables. Journal of the American statistical Association, 91(434): 444 455, 1996.

Yashas Annadani, Jonas Rothfuss, Alexandre Lacoste, Nino Scherrer, Anirudh Goyal, Yoshua Bengio, and Stefan Bauer. Variational causal networks: Approximate bayesian inference over causal structures. ar Xiv preprint ar Xiv:2106.07635, 2021.

Recursive Causal Discovery

Bryon Aragam, Arash A Amini, and Qing Zhou. Learning directed acyclic graphs with penalized neighbourhood regression. ar Xiv preprint ar Xiv:1511.08963, 2015.

Charles K Assaad, Emilie Devijver, and Eric Gaussier. Survey and evaluation of causal discovery methods for time series. Journal of Artiﬁcial Intelligence Research, 73:767 819, 2022.

Kevin Bello, Bryon Aragam, and Pradeep Ravikumar. Dagma: Learning dags via mmatrices and a log-determinant acyclicity characterization. Advances in Neural Information Processing Systems, 35:8226 8239, 2022.

Daniel Bernstein, Basil Saeed, Chandler Squires, and Caroline Uhler. Ordering-based causal structure learning in the presence of latent variables. In International Conference on Artiﬁcial Intelligence and Statistics, pages 4098 4108. PMLR, 2020.

Rohit Bhattacharya, Tushar Nagarajan, Daniel Malinsky, and Ilya Shpitser. Diﬀerentiable causal discovery under unmeasured confounding. In International Conference on Artiﬁcial Intelligence and Statistics, pages 2314 2322. PMLR, 2021.

Philippe Brouillard, S ebastien Lachapelle, Alexandre Lacoste, Simon Lacoste-Julien, and Alexandre Drouin. Diﬀerentiable causal discovery from interventional data. Advances in Neural Information Processing Systems, 33:21865 21877, 2020.

Peter B uhlmann, Jonas Peters, and Jan Ernest. CAM: Causal additive models, highdimensional order search and penalized regression. The Annals of Statistics, 42(6):2526 2556, 2014.

Wray Buntine. Theory reﬁnement on bayesian networks. In Uncertainty proceedings 1991, pages 52 60. Elsevier, 1991.

Ruichu Cai, Jincheng Ye, Jie Qiao, Huiyuan Fu, and Zhifeng Hao. Fom: Fourth-order moment based causal direction identiﬁcation on the heteroscedastic data. Neural Networks, 124:193 201, 2020.

David Maxwell Chickering. Learning bayesian networks is np-complete. Learning from data: Artiﬁcial intelligence and statistics V, pages 121 130, 1996.

David Maxwell Chickering. Optimal structure identiﬁcation with greedy search. Journal of machine learning research, 3(Nov):507 554, 2002.

Max Chickering, David Heckerman, and Chris Meek. Large-sample learning of bayesian networks is np-hard. Journal of Machine Learning Research, 5, 2004.

Victor W Chu, Raymond K Wong, Wei Liu, and Fang Chen. Causal structure discovery for spatio-temporal data. In International Conference on Database Systems for Advanced Applications, pages 236 250. Springer, 2014.

Diego Colombo, Marloes H Maathuis, Markus Kalisch, and Thomas S Richardson. Learning high-dimensional directed acyclic graphs with latent and selection variables. The Annals of Statistics, pages 294 321, 2012.

Mokhtarian, Elahi, Akbari, Kiyavash

Diego Colombo, Marloes H Maathuis, et al. Order-independent constraint-based causal structure learning. J. Mach. Learn. Res., 15(1):3741 3782, 2014.

Gregory F Cooper and Edward Herskovits. A bayesian method for the induction of probabilistic networks from data. Machine learning, 9:309 347, 1992.

Zijun Cui, Naiyu Yin, Yuru Wang, and Qiang Ji. Empirical bayesian approaches for robust constraint-based causal discovery under insuﬃcient data. ar Xiv preprint ar Xiv:2206.08448, 2022.

David Danks, Clark Glymour, and Robert Tillman. Integrating locally learned causal structures with overlapping variables. Advances in Neural Information Processing Systems, 21, 2008.

Tristan Deleu, Ant onio G ois, Chris Emezue, Mansi Rankawat, Simon Lacoste-Julien, Stefan Bauer, and Yoshua Bengio. Bayesian structure learning with generative ﬂow networks. In Uncertainty in Artiﬁcial Intelligence, pages 518 528. PMLR, 2022.

Chang Deng, Kevin Bello, Bryon Aragam, and Pradeep Kumar Ravikumar. Optimizing notears objectives via topological swaps. In International Conference on Machine Learning, pages 7563 7595. PMLR, 2023.

Bao Duong and Thin Nguyen. Heteroscedastic causal structure learning. In ECAI 2023, pages 598 605. IOS Press, 2023.

Doris Entner and Patrik O Hoyer. On causal discovery from time series data using fci. Probabilistic graphical models, pages 121 128, 2010.

Nir Friedman and Daphne Koller. Being bayesian about network structure. a bayesian approach to structure discovery in bayesian networks. Machine learning, 50:95 125, 2003.

Nir Friedman, Michal Linial, Iftach Nachman, and Dana Pe er. Using bayesian networks to analyze expression data. In Proceedings of the fourth annual international conference on Computational molecular biology, pages 127 135, 2000.

Fei Fu and Qing Zhou. Learning sparse causal Gaussian networks with experimental intervention: regularization and coordinate descent. Journal of the American Statistical Association, 108(501):288 300, 2013.

Shunkai Fu and Michel C Desmarais. Markov blanket based feature selection: A review of past decade. In Proceedings of the world congress on engineering, volume 1, pages 321 328. Newswood Ltd, 2010.

Ming Gao, Wai Ming Tai, and Bryon Aragam. Optimal estimation of gaussian dag models. In International Conference on Artiﬁcial Intelligence and Statistics, pages 8738 8757. PMLR, 2022.

Dan Geiger and David Heckerman. Learning Gaussian networks. In Uncertainty Proceedings 1994, pages 235 243. Elsevier, 1994.

Recursive Causal Discovery

Amir Emad Ghassami, Saber Salehkaleybar, and Negar Kiyavash. Optimal experiment design for causal discovery from ﬁxed number of experiments. ar Xiv preprint ar Xiv:1702.08567, 2017.

Asish Ghoshal and Jean Honorio. Information-theoretic limits of bayesian network structure learning. In Artiﬁcial Intelligence and Statistics, pages 767 775. PMLR, 2017.

Clark Glymour, Kun Zhang, and Peter Spirtes. Review of causal discovery methods based on graphical models. Frontiers in genetics, 10:524, 2019.

Chang Gong, Di Yao, Chuzhe Zhang, Wenbin Li, and Jingping Bi. Causal discovery from temporal data: An overview and new perspectives. ar Xiv preprint ar Xiv:2303.10112, 2023.

Olivier Goudet, Diviyan Kalainathan, Philippe Caillou, Isabelle Guyon, David Lopez-Paz, and Michele Sebag. Learning functional causal models with generative neural networks. Explainable and interpretable models in computer vision and machine learning, pages 39 80, 2018.

Clive WJ Granger. Investigating causal relations by econometric models and cross-spectral methods. Econometrica: journal of the Econometric Society, pages 424 438, 1969.

Jiaying Gu, Fei Fu, and Qing Zhou. Penalized estimation of directed acyclic graphs from discrete data. Statistics and Computing, 29:161 176, 2019.

Wiebke G unther, Urmi Ninad, Jonas Wahl, and Jakob Runge. Conditional independence testing with heteroskedastic data and applications to causal discovery. Advances in Neural Information Processing Systems, 35:16191 16202, 2022.

Isabelle Guyon, Jason Weston, Stephen Barnhill, and Vladimir Vapnik. Gene selection for cancer classiﬁcation using support vector machines. Machine learning, 46(1-3):389 422, 2002.

Naftali Harris and Mathias Drton. Pc algorithm for nonparanormal graphical models. Journal of Machine Learning Research, 14(11), 2013.

Uzma Hasan, Emam Hossain, and Md Osman Gani. A survey on causal discovery methods for temporal and non-temporal data. ar Xiv preprint ar Xiv:2303.15027, 2023.

Alain Hauser and Peter B uhlmann. Two optimal strategies for active learning of causal models from interventional data. International Journal of Approximate Reasoning, 55 (4):926 939, 2014.

David Heckerman, Dan Geiger, and David M Chickering. Learning bayesian networks: The combination of knowledge and statistical data. Machine learning, 20:197 243, 1995.

James J Heckman. Econometric causality. International Statistical Review, 76(1):1 27, 2008.

Nu Hoang, Bao Duong, and Thin Nguyen. Scalable variational causal discovery unconstrained by acyclicity. In ECAI 2024, pages 2685 2692. IOS Press, 2024.

Mokhtarian, Elahi, Akbari, Kiyavash

Huining Hu, Zhentao Li, and Adrian R Vetta. Randomized experimental design for causal graph discovery. Advances in neural information processing systems, 27, 2014.

Biwei Huang, Kun Zhang, Jiji Zhang, Joseph Ramsey, Ruben Sanchez-Romero, Clark Glymour, and Bernhard Sch olkopf. Causal discovery from heterogeneous/nonstationary data. Journal of Machine Learning Research, 21(89):1 53, 2020.

Antti Hyttinen, Frederick Eberhardt, and Patrik O Hoyer. Experiment selection for causal discovery. Journal of Machine Learning Research, 14:3041 3071, 2013.

Fateme Jamshidi, Luca Ganassali, and Negar Kiyavash. On the sample complexity of conditional independence testing with von mises estimator with application to causal discovery. In Forty-ﬁrst International Conference on Machine Learning, 2024.

Diviyan Kalainathan, Olivier Goudet, Isabelle Guyon, David Lopez-Paz, and Mich ele Sebag. Structural agnostic modeling: Adversarial learning of causal graphs. The Journal of Machine Learning Research, 23(1):9831 9892, 2022.

Genta Kikuchi. Diﬀerentiable causal discovery under heteroscedastic noise. In International Conference on Neural Information Processing, pages 284 295. Springer, 2022.

Neville Kenneth Kitson, Anthony C Constantinou, Zhigao Guo, Yang Liu, and Kiattikun Chobtham. A survey of bayesian network structure learning. Artiﬁcial Intelligence Review, 56(8):8721 8814, 2023.

Murat Kocaoglu, Amin Jaber, Karthikeyan Shanmugam, and Elias Bareinboim. Characterization and learning of causal graphs with latent variables from soft interventions. Advances in Neural Information Processing Systems, 32, 2019.

Janne Korhonen and Pekka Parviainen. Exact learning of bounded tree-width bayesian networks. In Artiﬁcial Intelligence and Statistics, pages 370 378. PMLR, 2013.

S ebastien Lachapelle, Philippe Brouillard, Tristan Deleu, and Simon Lacoste-Julien. Gradient-based neural dag learning. ar Xiv preprint ar Xiv:1906.02226, 2019.

Wai-Yin Lam, Bryan Andrews, and Joseph Ramsey. Greedy relaxations of the sparsest permutation algorithm. In Uncertainty in Artiﬁcial Intelligence, pages 1052 1062. PMLR, 2022.

Thuc Duy Le, Tao Hoang, Jiuyong Li, Lin Liu, Huawen Liu, and Shu Hu. A fast pc algorithm for high dimensional causal discovery with multi-core pcs. IEEE/ACM transactions on computational biology and bioinformatics, 16(5):1483 1495, 2016.

Hao-Chih Lee, Matteo Danieletto, Riccardo Miotto, Sarah T Cherng, and Joel T Dudley. Scaling structural learning with no-bears to infer causal transcriptome networks. In Paciﬁc Symposium on Biocomputing 2020, pages 391 402. World Scientiﬁc, 2019.

Adam Li, Amin Jaber, and Elias Bareinboim. Causal discovery from observational and interventional data across multiple environments. In Thirty-seventh Conference on Neural Information Processing Systems, 2023.

Recursive Causal Discovery

Lars Lorch, Jonas Rothfuss, Bernhard Sch olkopf, and Andreas Krause. Dibs: Diﬀerentiable bayesian structure learning. Advances in Neural Information Processing Systems, 34: 24111 24123, 2021.

Pingchuan Ma, Rui Ding, Qiang Fu, Jiaru Zhang, Shuai Wang, Shi Han, and Dongmei Zhang. Scalable diﬀerentiable causal discovery in the presence of latent confounders with skeleton posterior. In Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pages 2141 2152, 2024.

Dimitris Margaritis and Sebastian Thrun. Bayesian network induction via local neighborhoods. Advances in Neural Information Processing Systems, 12:505 511, 1999.

Christopher Meek. Causal inference and causal explanation with background knowledge. In Proceedings of the Eleventh conference on Uncertainty in artiﬁcial intelligence, pages 403 410, 1995.

Panagiotis Misiakos, Chris Wendler, and Markus P uschel. Learning dags from data with few root causes. Advances in Neural Information Processing Systems, 36, 2024.

Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin Riedmiller. Playing atari with deep reinforcement learning. ar Xiv preprint ar Xiv:1312.5602, 2013.

Ehsan Mokhtarian, Sina Akbari, Amir Emad Ghassami, and Negar Kiyavash. A recursive Markov boundary-based approach to causal structure learning. In The KDD 21 Workshop on Causal Discovery, pages 26 54. PMLR, 2021.

Ehsan Mokhtarian, Sina Akbari, Fateme Jamshidi, Jalal Etesami, and Negar Kiyavash. Learning bayesian networks in the presence of structural side information. In Proceedings of the AAAI Conference on Artiﬁcial Intelligence, volume 36, pages 7814 7822, 2022.

Ehsan Mokhtarian, Mohmmadsadegh Khorasani, Jalal Etesami, and Negar Kiyavash. Novel ordering-based approaches for causal structure learning in the presence of unobserved variables. In Proceedings of the AAAI Conference on Artiﬁcial Intelligence, volume 37, pages 12260 12268, 2023a.

Ehsan Mokhtarian, Saber Salehkaleybar, Amir Emad Ghassami, and Negar Kiyavash. A uniﬁed experiment design approach for cyclic and acyclic causal models. Journal of Machine Learning Research, 24(354):1 31, 2023b.

Ehsan Mokhtarian, Negar Kiyavash, and Federico Can e. Sector and style factor rotations in equity markets: Detection and towards causation. Available at SSRN 4966272, 2024.

Joris M Mooij, Dominik Janzing, Tom Heskes, and Bernhard Sch olkopf. On causal discovery with cyclic additive noise models. Advances in neural information processing systems, 24, 2011.

Joris M Mooij, Jonas Peters, Dominik Janzing, Jakob Zscheischler, and Bernhard Sch olkopf. Distinguishing cause from eﬀect using observational data: methods and benchmarks. The Journal of Machine Learning Research, 17(1):1103 1204, 2016.

Mokhtarian, Elahi, Akbari, Kiyavash

Stephen L Morgan and Christopher Winship. Counterfactuals and Causal Inference: Methods and Principles for Social Research. Cambridge University Press, 2015.

Ignavier Ng, Shengyu Zhu, Zhitang Chen, and Zhuangyan Fang. A graph autoencoder approach to causal structure learning. ar Xiv preprint ar Xiv:1911.07420, 2019.

Ignavier Ng, Amir Emad Ghassami, and Kun Zhang. On the role of sparsity and dag constraints for learning linear dags. Advances in Neural Information Processing Systems, 33:17943 17954, 2020.

Siqi Nie, Denis D Mau a, Cassio P De Campos, and Qiang Ji. Advances in learning bayesian networks of bounded treewidth. Advances in neural information processing systems, 27: 2285 2293, 2014.

Roxana Pamﬁl, Nisara Sriwattanaworachai, Shaan Desai, Philip Pilgerstorfer, Konstantinos Georgatzis, Paul Beaumont, and Bryon Aragam. Dynotears: Structure learning from time-series data. In International Conference on Artiﬁcial Intelligence and Statistics, pages 1595 1605. PMLR, 2020.

Judea Pearl. Probabilistic reasoning in intelligent systems: networks of plausible inference. Morgan kaufmann, 1988.

Judea Pearl. Causality: Models, reasoning and inference. Cambridge, UK: Cambridge University Press, 19(2):3, 2000.

Judea Pearl. Causality. Cambridge university press, 2009.

Jean-Philippe Pellet and Andr e Elisseeﬀ. Finding latent causes in causal networks: an eﬃcient approach based on Markov blankets. Neural Information Processing Systems Foundation, 2008a.

Jean-Philippe Pellet and Andr e Elisseeﬀ. Using Markov blankets for causal structure learning. Journal of Machine Learning Research, 9(Jul):1295 1342, 2008b.

Adrian E Raftery. Bayesian model selection in social research. Sociological methodology, pages 111 163, 1995.

Vaidyanathan Peruvemba Ramaswamy and Stefan Szeider. Turbocharging treewidthbounded bayesian network structure learning. In Proceeding of AAAI-21, the Thirty-Fifth AAAI Conference on Artiﬁcial Intelligence, 2021.

Joseph Ramsey, Peter Spirtes, and Jiji Zhang. Adjacency-faithfulness and conservative causal inference. In Proceedings of the Twenty-Second Conference on Uncertainty in Artiﬁcial Intelligence, pages 401 408, 2006.

Joseph Ramsey, Madelyn Glymour, Ruben Sanchez-Romero, and Clark Glymour. A million variables and more: the fast greedy equivalence search algorithm for learning highdimensional graphical causal models, with an application to functional magnetic resonance images. International Journal of Data Science and Analytics, 3:121 129, 2016. URL https://api.semanticscholar.org/Corpus ID:3655067.

Recursive Causal Discovery

Joseph Ramsey, Madelyn Glymour, Ruben Sanchez-Romero, and Clark Glymour. A million variables and more: the fast greedy equivalence search algorithm for learning highdimensional graphical causal models, with an application to functional magnetic resonance images. International journal of data science and analytics, 3:121 129, 2017.

Garvesh Raskutti and Caroline Uhler. Learning directed acyclic graph models based on sparsest permutations. Stat, 7(1):e183, 2018.

Alexander Reisach, Christof Seiler, and Sebastian Weichwald. Beware of the simulated dag! causal discovery benchmarks may be easy to game. Advances in Neural Information Processing Systems, 34:27772 27784, 2021.

Thomas Richardson, Peter Spirtes, et al. Ancestral graph Markov models. The Annals of Statistics, 30(4):962 1030, 2002.

Thomas S Richardson. Discovering cyclic causal structure. Carnegie Mellon [Department of Philosophy], 1996a.

Thomas S Richardson. A discovery algorithm for directed cyclic graphs. In Conference on Uncertainty in Artiﬁcial Intelligence, pages 454 461, 1996b.

Mauro Scanagatta, Cassio P de Campos, Giorgio Corani, and Marco Zaﬀalon. Learning bayesian networks with thousands of variables. Advances in neural information processing systems, 28, 2015.

Eric E Schadt, Stephanie A Monks, Thomas A Drake, Aldons J Lusis, Nelson Che, Victor Colinayo, ..., and Stephen H Friend. Genetics of gene expression surveyed in maize, mouse and man. Nature, 422(6929):297 302, 2003.

Mark Schmidt, Alexandru Niculescu-Mizil, Kevin Murphy, et al. Learning graphical model structure using l1-regularization paths. In AAAI, volume 7, pages 1278 1283, 2007.

Bernhard Sch olkopf, Francesco Locatello, Stefan Bauer, Nan Rosemary Ke, Nal Kalchbrenner, Anirudh Goyal, and Yoshua Bengio. Toward causal representation learning. Proceedings of the IEEE, 109(5):612 634, 2021.

Muralikrishnna G Sethuraman, Romain Lopez, Rahul Mohan, Faramarz Fekri, Tommaso Biancalani, and Jan-Christian H utter. Nodags-ﬂow: Nonlinear cyclic causal structure learning. In International Conference on Artiﬁcial Intelligence and Statistics, pages 6371 6387. PMLR, 2023.

Liam Solus, Yuhao Wang, and Caroline Uhler. Consistency guarantees for greedy permutation-based causal inference algorithms. Biometrika, 108(4):795 814, 2021.

Peter Spirtes and Thomas Richardson. A polynomial time algorithm for determining dag equivalence in the presence of latent variables and selection bias. In Proceedings of the 6th International Workshop on Artiﬁcial Intelligence and Statistics, pages 489 500. Citeseer, 1996.

Mokhtarian, Elahi, Akbari, Kiyavash

Peter Spirtes, Christopher Meek, and Thomas Richardson. Causal inference in the presence of latent variables and selection bias. In Proceedings of the Eleventh conference on Uncertainty in artiﬁcial intelligence, pages 499 506, 1995.

Peter Spirtes, Clark N Glymour, and Richard Scheines. Causation, prediction, and search. MIT press, 2000.

Xiaohai Sun, Dominik Janzing, Bernhard Sch olkopf, and Kenji Fukumizu. A kernel-based causal learning algorithm. In Proceedings of the 24th international conference on Machine learning, pages 855 862, 2007.

Richard S Sutton, David Mc Allester, Satinder Singh, and Yishay Mansour. Policy gradient methods for reinforcement learning with function approximation. Advances in neural information processing systems, 12, 1999.

Marc Teyssier and Daphne Koller. Ordering-based search: a simple and eﬀective algorithm for learning bayesian networks. In Proceedings of the Twenty-First Conference on Uncertainty in Artiﬁcial Intelligence, pages 584 590, 2005.

Christian Toth, Lars Lorch, Christian Knoll, Andreas Krause, Franz Pernkopf, Robert Peharz, and Julius Von K ugelgen. Active bayesian causal inference. Advances in Neural Information Processing Systems, 35:16261 16275, 2022.

Ioannis Tsamardinos and Constantin F Aliferis. Towards principled feature selection: Relevancy, ﬁlters and wrappers. In Proceedings of the Ninth International Workshop on Artiﬁcial Intelligence and Statistics, 2003.

Ioannis Tsamardinos, Constantin F Aliferis, Alexander R Statnikov, and Er Statnikov. Algorithms for large scale Markov blanket discovery. In FLAIRS conference, volume 2, pages 376 380, 2003.

Thomas Verma and Judea Pearl. Equivalence and synthesis of causal models. UCLA, Computer Science Department, 1991.

Jussi Viinikka, Antti Hyttinen, Johan Pensar, and Mikko Koivisto. Towards scalable bayesian learning of causal dags. Advances in Neural Information Processing Systems, 33:6584 6594, 2020.

Matthew J Vowels, Necati Cihan Camgoz, and Richard Bowden. D ya like dags? a survey on structure learning and causal discovery. ACM Computing Surveys, 55(4):1 36, 2022.

Yixin Wang and Michael I Jordan. Desiderata for representation learning: A causal perspective. ar Xiv preprint ar Xiv:2109.03795, 2021.

Dennis Wei, Tian Gao, and Yue Yu. Dags with no fears: A closer look at continuous optimization for learning bayesian networks. Advances in Neural Information Processing Systems, 33:3895 3906, 2020.

Ronald J Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning, 8(3):229 256, 1992.

Recursive Causal Discovery

Anpeng Wu, Kun Kuang, Minqin Zhu, Yingrong Wang, Yujia Zheng, Kairong Han, Baohong Li, Guangyi Chen, Fei Wu, and Kun Zhang. Causality for large language models. ar Xiv preprint ar Xiv:2410.15319, 2024.

Sandeep Yaramakala and Dimitris Margaritis. Speculative Markov blanket discovery for optimal feature selection. In Fifth IEEE International Conference on Data Mining (ICDM 05), pages 4 pp. IEEE, 2005.

Naiyu Yin, Tian Gao, Yue Yu, and Qiang Ji. Eﬀective causal discovery under identiﬁable heteroscedastic noise model. In Proceedings of the AAAI Conference on Artiﬁcial Intelligence, volume 38, pages 16486 16494, 2024.

Kui Yu, Lin Liu, Jiuyong Li, and Huanhuan Chen. Mining Markov blankets without causal suﬃciency. IEEE transactions on neural networks and learning systems, 29(12):6333 6347, 2018.

Shiqing Yu, Mathias Drton, and Ali Shojaie. Directed graphical models and causal discovery for zero-inﬂated data. In Conference on Causal Learning and Reasoning, pages 27 67. PMLR, 2023.

Yue Yu, Jie Chen, Tian Gao, and Mo Yu. Dag-gnn: Dag structure learning with graph neural networks. In International Conference on Machine Learning, pages 7154 7163. PMLR, 2019.

Yue Yu, Tian Gao, Naiyu Yin, and Qiang Ji. Dags with no curl: An eﬃcient dag structure learning approach. In International Conference on Machine Learning, pages 12156 12166. Pmlr, 2021.

Jiji Zhang. Causal reasoning with ancestral graphs. Journal of Machine Learning Research, 9:1437 1474, 2008a.

Jiji Zhang. On the completeness of orientation rules for causal discovery in the presence of latent confounders and selection bias. Artiﬁcial Intelligence, 172(16-17):1873 1896, 2008b.

Zeyu Zhang, Chaozhuo Li, Xu Chen, and Xing Xie. Bayesian active causal discovery with multi-ﬁdelity experiments. Advances in Neural Information Processing Systems, 36, 2024.

Boxin Zhao, Percy S Zhai, Y Samuel Wang, and Mladen Kolar. High-dimensional functional graphical model structure learning via neighborhood selection approach. Electronic Journal of Statistics, 18(1):1042 1129, 2024.

Xun Zheng, Bryon Aragam, Pradeep K Ravikumar, and Eric P Xing. Dags with no tears: Continuous optimization for structure learning. Advances in neural information processing systems, 31, 2018.

Fangting Zhou, Kejun He, and Yang Ni. Causal discovery with heterogeneous observational data. In Uncertainty in Artiﬁcial Intelligence, pages 2383 2393. PMLR, 2022.