# causal_discovery_with_fewer_conditional_independence_tests__e4371002.pdf

Causal Discovery with Fewer Conditional Independence Tests

Kirankumar Shiragur * 1 Jiaqi Zhang * 1 2 Caroline Uhler 2

Many questions in science center around the fundamental problem of understanding causal relationships. However, most constraint-based causal discovery algorithms, including the wellcelebrated PC algorithm, often incur an exponential number of conditional independence (CI) tests, posing limitations in various applications. Addressing this, our work focuses on characterizing what can be learned about the underlying causal graph with a reduced number of CI tests. We show that it is possible to a learn a coarser representation of the hidden causal graph with a polynomial number of tests. This coarser representation, named Causal Consistent Partition Graph (CCPG), comprises of a partition of the vertices and a directed graph defined over its components. CCPG satisfies consistency of orientations and additional constraints which favor finer partitions. Furthermore, it reduces to the underlying causal graph when the causal graph is identifiable. As a consequence, our results offer the first efficient algorithm for recovering the true causal graph with a polynomial number of tests, in special cases where the causal graph is fully identifiable through observational data and potentially additional interventions.

1. Introduction

Causal discovery is a fundamental task in various scientific disciplines including biology, economics, and sociology (King et al., 2004; Cho et al., 2016; Tian, 2016; Sverchkov & Craven, 2017; Rotmensch et al., 2017; Pingault et al., 2018; de Campos et al., 2019; Reichenbach, 1956; Woodward, 2005; Eberhardt & Scheines, 2007; Hoover, 1990;

*Equal contribution; alphabetic order 1Eric and Wendy Schmidt Center, Broad Institute 2Laboratory for Information & Decision Systems, Massachusetts Institute of Technology. Correspondence to: Kirankumar Shiragur <kshiragu@broadinstitute.org>, Jiaqi Zhang <viczhang@mit.edu>.

Proceedings of the 41 st International Conference on Machine Learning, Vienna, Austria. PMLR 235, 2024. Copyright 2024 by the author(s).

Friedman et al., 2000; Robins et al., 2000; Spirtes et al., 2000; Pearl, 2003). Directed acyclic graphs (DAGs) stand out as a popular choice for representing causal relations, with edge directions signifying the flow of information between variables. The core objective of causal discovery is to identify both the edges and their orientations based on available data. While certain structures can be recovered from observational data (Verma & Pearl, 1990), orienting the full graph often requires additional experiments or interventions.

Research on causal structure learning from observational data dates back to the 1990s (Verma & Pearl, 1990; Spirtes et al., 1989). As a pioneering work in this direction, the PC algorithm (Spirtes et al., 2000), named after the authors Peter Spirtes and Clark Glymour, still remains one of most popular and widely used algorithms. It recovers the structure using observational data through conditional independence (CI) tests, with the number of tests being exponential in the degree of the graph. Following this, many causal discovery algorithms emerged (Kalisch & Bühlman, 2007; Brenner & Sontag, 2013; Alonso-Barba et al., 2013; Schulte et al., 2010), accommodating diverse and more general settings, including the presence of latent variables (Spirtes et al., 1999; 2013) and interventional data (Eberhardt et al., 2005; 2006). However, a common challenge shared by these algorithms is their reliance, to different extents, on an exponential number of CI tests in certain graph parameters. This inherent dependence on an exponential number of tests poses practical challenges, making them unsuitable for many real-world scenarios. Moreover, it suggests that achieving exact causal structure learning can be highly challenging.

As performing exponential number of tests is limited in many applications, it motivates us to study the following question:

What useful information about the underlying causal graph can be inferred with fewer conditional independence tests?

Aligned with this motivation, our work also explores the role of interventions in the structure learning process.

In our work, we study these questions under standard Markov, faithfulness and causal sufficiency assumptions (Lauritzen, 1996; Spirtes et al., 2000). The primary contribution of our work is an efficient algorithm that uses a polynomial number of CI tests and recovers a representation

Causal Discovery with Fewer Conditional Independence Tests

of the underlying causal graph with observational and optionally interventional datasets. This representation consists of a partition of the vertices and a DAG defined over its components which is consistent with the underlying causal graph. In addition, our representation is designed to avoid dummy partitions that group all the vertices into a single component. The definition of our representation ensures that the components in the partition satisfy several additional properties, guaranteeing that each component either contains a single vertex or comprises an edge that could only be oriented after an intervention is performed on one of its endpoints. We refer to this representation as the Causally Consistent Partition Graph (CCPG) representation.

An important implication of our results is that if the underlying causal graph is fully identifiable using only observational data, our algorithm yields a CCPG with a partition containing components, each of which is of size one. The size-one property of each component means that our algorithm recovers the true causal graph using only a polynomial number of conditional independence tests. We extend this result in the presence of interventions and provide an algorithm that recovers the true causal graph, when the set of interventions provided is sufficient to identify the underlying causal graph. To the best of our knowledge, our algorithms present the first to guarantee recovering the true causal graph using a polynomial number of tests when the graph is either entirely identifiable from observational data or with an additional set of interventions.

1.1. Related Works

Efficient algorithms for causal structure learning (Spirtes et al., 2000; Claassen et al., 2013) exist for constant bounded degree graphs, recovering the causal graph with a polynomial number of CI tests. For general causal graphs, current methods often entail an exponential number of CI tests, where (Xie & Geng, 2008; Zhang et al., 2024) aimed to to reduce such complexity. For Bayesian network learning, finding a minimal Bayesian network is NP-hard, even with a constant-time CI oracle and nodes with at most k 3 parents. Chickering et al. (2004) demonstrated this hardness through a polynomial reduction from the NP-complete problem, Degree-Bounded Feedback Arc Set. These findings highlight the contrast between causal structure and minimal Bayesian network learning, suggesting that causal structure learning is notably more straightforward. Our results further reinforce this notion by identifying a special class of causal graphs that can be recovered with a polynomial number of conditional independence tests. For other hardness results on Bayesian network learning, we refer readers to Bouckaert (1994); Chickering et al. (2004) and references therein.

Learning causal relationships from observational data (Verma & Pearl, 1990; Spirtes et al., 1989; 2000; Chickering,

2002; Geiger & Heckerman, 2002; Nandy et al., 2018) and interventional data (Eberhardt, 2010; Hu et al., 2014; Shanmugam et al., 2015; Greenewald et al., 2019; Squires et al., 2020; Choo et al., 2022; Choo & Shiragur, 2023; Shiragur et al., 2024) is a well studied problem with a rich literature. We encourage interested readers to explore Glymour et al. (2019); Squires & Uhler (2022) and references therein for a more comprehensive understanding and further details.

1.2. Organization

The rest of our paper is organized as follows. Section 2 is our preliminary section. In Section 3, we provide all the main results of the paper. In Section 4 and Section 5 combined, we provide our CCPG recovery algorithm when just observational data is available. In Section 6, we extend our results to the case of interventions. We provide numerical results in Section 7. Finally in Section 8, we conclude with a short discussion and few open directions.

2. Preliminaries

2.1. Graph Definitions

Let G be a directed acyclic graph (DAG) on n vertices in V . For a vertex v V , let Pa(v), Anc(v), and Des(v) denote the parents, ancestors, and descendants of v respectively. Let Anc[v] = Anc(v) {v} and Des[v] = Des(v) {v}.

For a set of vertices S V , denote Pa(S) = v SPa(s). Similarly define Anc(S), Des(S), Anc[S], and Des[S]. We write src(S) as the set of source nodes within S, that is,

src(S) = {v S | Anc(v) S = } .

Denote S = V \S. Let G[S] be a subgraph of G by removing all vertices in S.

A v-structure refers to three distinct vertices u, v, w such that u v w and u, w are not adjacent. An edge u v is a covered edge (Chickering, 1995) if Pa[u] = Pa(v). A path in G is a list of distinct vertices, where consecutive vertices are adjacent. We can associate a topological ordering π : V [n] to any DAG G such that any u v in G satisfy π(u) < π(v). Note that such topological ordering is not necessarily unique.

2.2. D-Separation and Conditional Independence

DAGs are commonly used in causality (Pearl, 2009), where vertices represent random variables and their joint distribution P factorizes according to the DAG: P(v1, . . . , vn) = Qn i=1 P(vi | Pa(vi)). This factorization entails a set of conditional independencies (CIs) in the observational distribution P. These CI relations are fully characterized by d-separation (Geiger & Pearl, 1990). Formally, for disjoint vertex sets A, B, C V , sets A, B are d-separated by C

Causal Discovery with Fewer Conditional Independence Tests

if and only if any path connecting A and B in G is inactive given C. A path is inactive given C when it has a collider1

d Anc[C] or a non-collider c C; otherwise the path is active given C. Figure 1 illustrates these concepts.

Figure 1. (Left). {1} and {4} are d-separated by {2}, as all paths are inactive given {2}. (Right). {1} and {4} are not d-separated by {2, 3}, as path 1 3 4 is active given {2, 3} by collider 3.

We write A B | C2 when A, B are conditionally independent given C in the observational distribution P. If any set among A, B, C contains only one node, e.g., A = {a}, we write a B | C for simplicity. When C d-separates A, B, then it holds that A B | C (known as the global Markov property (Geiger & Pearl, 1990)). Under the faithfulness assumption, the reverse also holds, i.e., all CI relations in P are implied by d-separation in G.

Setup. In this work, we assume that the causal DAG G is unknown. But we assume causal sufficiency (i.e., no latent confounders), faithfulness and access to enough samples from P to determine if A B | C for any A, B, C V . As all CIs are implied by d-separations, we may infer information about G using these tests.

2.3. Interventions

An intervention I V is an experiment where the conditional distributions P(v | Pa(v)) for v I are changed into P I(v).3 Such interventions eliminate the dependency between v and Pa(v). Let GI denote the modified version of G, where all incoming edges to v I are removed. Let P I denote the interventional distribution, i.e., P I(v1, . . . , vn) = Q

v I P I(v) Q

v I P(v | Pa(v)). Then P I factorizes with respect to GI. We denote A I B | C for CI tests in the interventional distribution P I.

Setup with Interventions. Similar to the observational setting, we assume faithfulness of P I to GI and access to enough samples from P I to determine if A I B | C.

2.4. Verifying Intervention Sets and Covered Edges

When it is possible to perform any number of CI tests: with observational data, a DAG G is in general only identifiable

1Vertex d is a collider on a path iff d on the path. 2For simplicity, we also write A B | C for potential overlapping sets to denote A B | C \ (A B) 3We consider hard interventions in this work.

up to its skeleton, v-structures (Andersson et al., 1997), and possibly additional edges given by the Meek rules (Meek, 1995). Identifiability can be improved with interventional data (Hauser & Bühlmann, 2012), where I allows us to infer the edge orientation of any edge cut by I and V \ I.

A verifying intervention set I for a DAG G (Choo et al., 2022) is a set of interventions that fully orients G, possibly with repeated applications of the Meek rules. We will make use of the following result in our work.

Proposition 2.1 (Theorem 9 in (Choo et al., 2022)). Set I is a verifying intervention set if and only if for every covered edge u v in G, there is |I {u, v}| = 1 for some I I.

The verification number ν(G) is defined as the minimum size of any verifying intervention set of G. This proposition tells us that ν(G) equals to the minimum size of any vertex cover of the covered edges in G.

3. Main Results

Here we present our main findings. As highlighted in the introduction, the key contribution of our work lies in recovering a representation of the underlying causal graph that satisfies various desirable properties with very few CI tests. We now provide a formal definition of this representation.

Definition 3.1 (CCPG & I-CCPG). A Causally Consistent Partition Graph (CCPG) representation of a DAG G on V consists of a partition of V into components V1, . . . , Vk and a DAG D between the components such that, (intra-component property): for each i [k], it holds that |src(Vi)| = 1. Furthermore, if |Vi| > 1, then G[Vi] has at least one covered edge. (inter-component property): D is topologically ordered, i.e., i j in D only if Vi < Vj. It is also consistent with G: (1) if there is no directed edge i j in D, then there are no edges between Vi and Vj in G; (2) if there is a directed edge i j in D, then there is u Vi such that u Pa(src(Vj)).

We further define an Interventional Causally Consistent Partition Graph (I-CCPG) representation of G with respect to an intervention set I: an I-CCPG is a CCPG representation of G that additionally satisfies the following strong intracomponent condition: for each i [k], if |Vi| > 1 then G[Vi] has at least one unintervened4 covered edge.

Figure 2 illustrates these concepts. Note that when I = , I-CCPG reduces to CCPG.

In Definition 3.1, the first property prefers finer partitions, while the second property ensures consistency. Formally, we can show the following properties of these representations, which establish the significance of CCPGs. Proofs for all lemmas in this section can be found in Appendix A.

4An edge is intervened by I if only one of the vertices is in I.

Causal Discovery with Fewer Conditional Independence Tests

Figure 2. Example of CCPG & I-CCPG. (Left). Ground-truth G. (Right). A CCPG representation of G, where V1, V2, V3 are indicated by green boxes and D is illustrated in chalk strokes. Vertices 3, 4 can be in one component as 3 4 is a covered edge. For I = {4}, the only I-CCPG is G itself (due to strong intra-component condition in Definition 3.1).

Lemma 3.2 (Properties of CCPG). For any intervention set I (including ), the following arguments hold:

D = G (i.e., partitioning V into individual vertices) is a valid I-CCPG of G.

If the verification number of G is zero, i.e., ν(G) = 0, then D = G is the unique valid I-CCPG of G.

If I is a verifying intervention set of G, then D = G is the unique valid I-CCPG of G.

The key algorithmic contribution of our work, proven in Section 5, lies in an efficient algorithm that learns a valid CCPG with only polynomial number of CI tests.

Theorem 3.3 (Learning CCPG). Given observational data, there exists an efficient algorithm that performs at most O(n5) CI tests, and outputs a CCPG representation.

This result extends to the interventional setting as follows; the proof is given in Section 6. Theorem 3.4 (I-Learning CCPG). Given observational data and interventional data from interventions in I, there exists an efficient algorithm that performs at most O(n5) + |I| O(n3) CI tests, and outputs an I-CCPG representation.

Combining these results with the properties of CCPG in Lemma 3.2, this provides an efficient algorithm for learning the causal graph with polynomial number of conditional independence tests under certain cases, detailed below. Corollary 3.5 (Causal Discovery with Polynomial CI Tests). For a DAG G and its verifying intervention set I (can be ), our algorithm recovers the full causal graph with at most O(n5) + |I| O(n3) CI tests.

We remark here that Eberhardt et al. (2012) provides a construction of verifying intervention set of size log2(n) that is independent of the underlying DAG G. Together with Corollary 3.5, this implies that our algorithm can learn the full causal graph with at most O(n5) CI tests.

To the best of our knowledge, our results present the first formal characterization of the information recoverable about general causal graphs using polynomial number of CI tests. Prior works showed that it is possible to learn sparse causal graphs with n O(k) CI tests (Claassen et al., 2013; Spirtes et al., 2000). Here k is an upper bound on the vertex degrees. Note that, the number of CI tests our algorithm requires, O(n5), is a polynomial of n that is independent of any graph parameters.

3.1. Proxy V-structure and Meek Rule Statements

In our derivations, we will make use of the following results, which we believe is of separate interest well beyond the scope of this work.

Lemma 3.6 (Proxy V-Structure). Let S V and u, v, z V \S. If u v|S and u v|S {z}, then u, v Des[z].5

Definition 3.7 (Prefix Vertex Set). We call S V a prefix vertex set if it satisfies: for all w S, Anc[w] S = (vertices in S appear first in the topological order).

Lemma 3.8 (Proxy Meek Rule 1). Let S be a prefix subset. If u, v and w are such that u S, v, w S and u v|S and u w|S {v} then v Des[w].

The results stated above serve as proxy statements of vstructure and Meek Rule 1. Given the CI tests in the preceding lemmas, if we additionally have confirmed adjacencies between specific pairs of vertices, stronger statements could be made; e.g.,in Lemma 3.6, we could conclude that z is a child (or descendant) of both u and v, and in Lemma 3.8 that w is a child of v. However, since our lemma statements do not assume any knowledge of adjacency, we can only ascertain weaker statements, in the first case that u and v are not descendants of z, and in the second case that v is not a descendant of w.

While our proxy results reveal weaker relationships among variables, they achieve this using a constant number of CI tests. Uncovering stronger relationships requires adjacency information, which may entail an exponential number of CI tests. Our main contribution lies in leveraging these weaker relationships, along with other favorable properties embedded in our algorithm, to implement the prefix vertex procedure and reconstruct the CCPG representation of the underlying causal graph using few CI tests.

4. Prefix Vertex Set

The core component of our CCPG algorithm involves a procedure that produces a series of prefix vertex sets. We now show how such prefix vertex sets can be learned by performing few CI tests.

5Similar argument is also proven in Lemma 1 of Magliacane et al. (2016).

Causal Discovery with Fewer Conditional Independence Tests

This will be helpful to obtain a CCPG representation of G, since the components V1, . . . , Vk satisfy that V1 Vi is a prefix vertex set for all i [k].

4.1. Algorithm for Learning

We begin by presenting our algorithm for learning a prefix vertex set. This algorithm takes as input a prefix vertex set S V (which can be ) and produces a larger prefix vertex set S .

The analysis of Algorithm 1 will be provided in the next section. For an input prefix vertex set S V , it makes use of three types of CI tests, which we formalize below.6

Definition 4.1 (Type-I Set DS). For all w S, let w DS if and only if u v | S and u v | S {w} for some u V and v S.

By the proxy v-structure in Lemma 3.6, these two CI tests indicate that v Des[w]. Therefore w can potentially be a descendant of v. Thus DS contains vertices that are potential descendants of some other vertex in S. We will rule out this set when searching for prefix S S. Similarly, we can define a type-II set ES.

Definition 4.2 (Type-II Set ES). For all w S \ DS, let w ES if and only if u v | S {v} and u v | S {v, w} for some u S and v, v S \ DS.

We also exclude any type-III set FS, defined as follows.

Definition 4.3 (Type-III Set FS). For all w S \ DS, let w FS if and only if u v | S, u w | S {v}, and v w | S for some u S and v S \ DS.

By the proxy Meek Rule 1 in Lemma 3.8, the first two CI tests u v | S and u w | S {v} guarantee that v Des[w]. The remaining CI test v w | S is to ensure that we do not exclude too many vertices in FS, in particular src( S), as we show in the next section.

Algorithm 1 Learning a Prefix Vertex Set

1: Input: A prefix vertex set S V . CI queries from G. 2: Output: A prefix vertex set S such that S S. 3: Compute type-I set DS. 4: Compute type-II and III sets ES, FS. 5: Let U = S \ (DS ES FS). 6: return S = S U.

4.2. Correctness and Guarantees

Algorithm 1 satisfies the following guarantees. All omitted proofs can be found in Appendix B.

6We note that u, v, v and w in the definitions below are mutually distinct. We omit writing this for simplicity.

Figure 3. DS, ES, FS satisfy that (1) they contain all downstream vertices of any vertex in them; (2) they do not intersect with src( S).

Theorem 4.4. Algorithm 1 outputs a prefix vertex set in O(n4) number of CI tests. In addition, this prefix vertex set contains all the remaining source nodes, i.e., src( S) S .

For its proof, we will make use of the following properties of the type-I, II, and III sets (illustrated in Figure 3).

Lemma 4.5. Let S be a prefix vertex set. If w DS, then Des[w] DS. Furthermore, DS src( S) = . The same properties hold for ES.

Lemma 4.6. Let S be a prefix vertex set. If w FS, then Des[w] ES FS. Furthermore, FS src( S) = .

Proof of Theorem 4.4. We first show that S returned by Algorithm 1 satisfies src( S) S . By Lemmas 4.5 and 4.6, we have (DS ES FS) src( S) = . As S = DS ES FS, it must hold that src( S) S .

Next we show that S is a prefix vertex set. For this, we only need to show that w S and y Des(w), it holds that y S . Since S = DS ES FS, one of the following three scenarios must hold: (1) if w DS, then y DS by Lemma 4.5; (2) if w ES, then y ES by Lemma 4.5; or (3) if w FS, then y ES FS by Lemma 4.6.

Therefore S must be a prefix vertex set. We now bound the number of CI tests performed by Algorithm 1: computing DS takes O(n3) CI tests; computing ES takes O(n4) CI tests; computing FS takes O(n3) CI tests, which completes the proof.

4.3. Relation to Covered Edges

In Theorem 4.4, we showed that src( S) S . In fact, when there are no covered edges coming from src( S), one can show that S = src( S) S via the following Lemma 4.7.

Lemma 4.7. Let S be a prefix vertex set. For w S \ DS, if w src( S) and there is no covered edge from Anc[w] src( S) to Anc[w], then w ES FS.

Causal Discovery with Fewer Conditional Independence Tests

Corollary 4.8. Let S be a prefix vertex set. If there is no covered edge in S, then S = src( S) S.

This result will be useful when deriving CCPG representations using Algorithm 1. To see this, consider the simple case where there is no covered edge in G. Then running Algorithm 1 with S = we can learn src(V ) by Corollary 4.8. Then running Algorithm 1 with S = src(V ), we can learn the source vertices of V \ src(V ). Applying this iteratively, we can obtain the ground-truth topological order of G. As a consequence, one can easily learn G,7 which is the sole CCPG representation as there is no covered edge.

5. Learning Causally Consistent Partition Graph Representations

We now present our algorithm for learning causally consistent partition graph representations. All omitted proofs can be found in Appendix C.

Algorithm 2 contains three parts, indicated by colored boxes below. In the first part, we use Algorithm 1 iteratively to learn prefix vertex sets of the form . . . S S V . In the second part, we break each S \ S into smaller components that are pairwise independent given S. As we will see, these components satisfy the property of CCPG in Definition 3.1. In the third part, we build the acyclic graph between the CCPG components.

Algorithm 2 Learning a CCPG Representation

1: Input: CI queries from G. 2: Output: A CCPG representation of G. 3: Set S = and S as empty ordered list. 4: while S = V do 5: Run Algorithm 1 on S to obtain S . 6: Add S \ S to the end of S. 7: Update S = S . 8: end while 9: Initialize l1 = 1. 10: for Si in S = [S1, . . . , Sm] do 11: Create empty graph T on vertices in Si. 12: Add v w to T iff v w | S1 Si 1. 13: Split Si into components Vli, Vli+1, . . . , Vli+1 based on connected components in T . 14: end for 15: Create empty graph D on vertices in [lm+1]. 16: for Vi, Vj with i < j do 17: Add i j to D iff Vi Vj | V1 Vj 1. 18: end for 19: return V1, . . . , Vlm+1 and D

To show that the components we obtain in the second part

7For i < j in the topological order, the corresponding edge vi vj G iff vi vj | v1, . . . , vj 1.

Figure 4. Illustration of S \ DS (inside the dashed box). It can be split into connected subgraphs based on vertices in src( S) (indicated by the fill color of each vertex in S \ DS).

satisfy the property of CCPG, we will use the following result. Essentially, we can show that the subgraph G[S \ S] can be split into connected subgraphs based on individual vertices in src(S \ S) (= src( S) by Theorem 4.4). This is illustrated in Figure 4.

Lemma 5.1. Let S be a prefix subset. For any w S\DS, there is |Anc[w] src( S)| = 1 . Furthermore, for any other w S \ DS, there is w w | S if and only if Anc[w] src( S) = Anc[w ] src( S).

Based on this result, we can establish that Algorithm 2 outputs a CCPG representation by proving Theorem 3.3.

Proof of Theorem 3.3. We show that Algorithm 2 outputs a CCPG representation of G. Since it runs in polynomial time and uses O(n5) CI tests, this will prove Theorem 3.3.

Intra-component Property. Consider Si. Denote S1 Si 1 = S. We will show that Si is split into Des[s] Si for each s src( S). Since Si S, we have Si = s src( S)(Des[s] Si). By Theorem 4.4, we know that S is a prefix vertex set. Since Si S \ DS, by Lemma 5.1, Des[s] Si for different s src( S) are disjoint. Furthermore, by Theorem 4.4, these disjoint sets are non-empty because s Des[s] Si. Therefore src(Des[s] Si) = {s} and we have |src(Des[s] Si)| = 1. Thus by Lemma 5.1, each of Vli, Vli+1, . . . , Vli+1 corresponds to Des[s] Si for some s src( S).

If |Des[s] Si| > 1, then it contains a vertex that is not s. Suppose w Des(s) Si. By Lemmas 4.7 and 5.1, there is a covered edge from s to some vertex x Anc[w]. Since S and S Si are prefix vertex sets, s, w Si implies x Si. Thus x Des[s] Si and there is a covered edge s x in

Causal Discovery with Fewer Conditional Independence Tests

Des[s] Si. Thus, we have proven that each component in V1, . . . , Vlm+1 satisfies the first property of CCPG.

Inter-component Property. For each Vi, denote the subset that it came from as Shi, i.e., Vi Shi. By the construction of D in Algorithm 2, no edge i j in D means either i > j or Vi Vj | V1 Vj 1. If i > j and hi = hj, then by the fact that S1 Shi is a prefix vertex set, there is no edge from Vi to Vj in G. If hi = hj = h, then we know that there exists si = sj such that Vi = Des[si] Sh and Vj = Des[sj] Sh. If there is an edge from Vi to Vj in G, denoted as v w, then w Des[v] Des[si] which means Vi Vj = , a contradiction. Therefore there is no edge from Vi to Vj in G. If Vi Vj | V1 Vj 1, then clearly there is no edge from Vi to Vj in G. Thus when there is no i j in D, there is no edge from Vi to Vj in G.

If there is an edge i j in D, then we have i < j and Vi Vj | V1 Vj 1. Note that since S1 Shj is prefixed, it holds that Pa(Vj) S1 Shj and Des[Vj] (S1 . . . Shj 1) = . As shown above, there is no edge between any other Vj Shj and Vj. Thus we have Pa(Vj) V1 Vj and Des[Vj] (V1 Vj 1) = . Similarly Pa(Vi) V1 Vi. Thus by the local Markov property and Bayes rule,8 Vi Vj | V1 Vj 1 only if there is a direct edge u v from u Vi to v Vj. We now show that there is also an edge u s where {s} = src(Vj). Assume on the contrary that there is no edge u s. Since s is the source node of Vj and thus a source node of Shj, we have from Pa(Vj) S1 Shj that Pa(s) S1 Shj 1. In addition, Des[s] (S1 Shj 1) = . Thus by the local Markov property, u s | S1 Shj 1. However as u v and v Des[s], we have u s | S1 Shj 1 {v}. As a consequence, v DS1 Shj 1. By Algorithm 1, it is impossible that v Shj, a contradiction. Therefore we must have u Pa(s). This proves that D satisfies the second property of CCPG.

6. Interventions

In Sections 4 and 5, we showed how to learn a CCPG representation using observational data. Here we generalize these methods to interventions. This results in a more refined I-CCPG representation of G. All omitted proofs can be found in Appendix D.

6.1. Learning Refined Prefix Vertex Sets

In Section 4.1, we obtained a prefix vertex set by excluding three types of sets DS, ES and FS. With interventions, we can exclude an additional set. To define it, we first characterize what can be learned using interventions.

8See Lemma A.1 in Appendix A.

Figure 5. Illustration of JI S, where I is indicated by the purple circle. JI S satisfies similar properties as DS, ES and FS.

Lemma 6.1. Let S be a prefixed vertex set. Given an intervention on vertices I V . For any u I, we have u I v for some v I \ S if and only if u Des(I \ S) \ (I \ S).

Note that when the number of CI tests is not restricted, one can learn all edges cut by I as well as their orientations. Lemma 6.1 shows that it is possible to learn the joint set of descendants using O(n2) CI tests.

With this set and a few more CI tests, it is possible to learn additional directional information.

Lemma 6.2. Let S be a prefix subset. Given an intervention I and v I\S, denote HI S(v) = {u S Des[I\S] : u v | V \ Des[I \ S]}.9 Then Pa(v) \ (S Des[I \ S]) HI S(v) Anc(v) \ (S Des[I \ S]).

Such sets can be useful when deciding if a target of I \ S should be excluded when learning a prefix vertex subset S S. Formally, we define the type-IV set as follows.

Definition 6.3 (Type-IV Set JI S). Let S be a prefix vertex set. For an intervention I, let JI S = Des(I \ S) {v I \ S : HI S(v) S = }.

Analogous to Lemma 4.5, we can show that JI S satisfies similar properties (illustrated in Figure 5).

Lemma 6.4. Let S be a prefix vertex set. Then w JI S, we have Des(w) JI S. Furthermore, JI S src( S) = .

Therefore we can exclude JI S as well, which results in the following modification of Algorithm 1 and for which we can show similar guarantees using Lemma 6.4.

9Note that Des[I \ S] can be obtained via Lemma 6.1 and (Des(I \ S) \ (I \ S)) (I \ S).

Causal Discovery with Fewer Conditional Independence Tests

Algorithm 3 Learning a Prefix Vertex Set (w. Interventions)

1: Input: A prefix vertex set S V . CI queries from G and GI for each intervention I I. 2: Output: A prefix vertex set S such that S S. 3: for I I do 4: Compute type-IV set JI S. 5: end for 6: Compute type-I set DS. 7: Compute type-II and III sets ES, FS. 8: Let U = S \ I I JI S DS ES FS . 9: return S = S U .

Theorem 6.5. Algorithm 3 outputs a prefix vertex set in O(n4)+|I|O(n2) CI tests. In addition, this prefix vertex set contains all the remaining source nodes, i.e., src( S) S .

Proof. Similar to the proof of Theorem 3.3, we can use the additional Lemma 6.4 to show that: (1) S is a prefix vertex set, (2) it contains src( S). We now bound the number of CI tests performed by Algorithm 3: note that it bears the additional need to compute JI S compared to Algorithm 1. By Lemmas 6.1 and 6.2, computing JI S for each I I takes O(n2). Thus the total complexity is O(n4)+|I|O(n2).

In addition, we can show the following property for covered edges coming from src( S).

Lemma 6.6. Let S be a prefix vertex set, v src( S) and v w be a covered edge. Given any intervention I, if |I {v, w}| = 1, then w JI S.

Proof. If I {v, w} = {v}, then w Des(I \ S). If I {v, w} = {w}, then w I \ S. We will show that HI S(w) S = . This in turn proves that w JI S. Note that since v src( S), we have v S Des[I \S]. Furthermore, since v w is an edge, we have v Pa(w). Thus by Lemma 6.2, we have v Pa(w) \ (S Des[I \ S]) HI S(w).

6.2. Learning I-CCPG Representations

We obtain the final algorithm by plugging Algorithm 3 into Algorithm 2, i.e., replacing Algorithm 1 in line 5 by Algorithm 3.

Proof of Theorem 3.4. By Theorem 6.5, for any i, set S = S1 Si 1 is a prefix vertex set. Therefore using the proof in Theorem 3.3 and the fact that src( S) S Si (Theorem 6.5), we immediately have |src(Vi)| = 1 and D satisfies the second property of I-CCPG in Definition 3.1.

We now show that when |Vi| > 1, subgraph G[Vi] has at least one unintervened covered edge. Suppose on the contrary that G[Vi] has no unintervened covered edge. Denote

{s} = src(Vi). Let w Vi such that w = s. Similar to Theorem 3.3, we can show that w Des(s). By Lemma 4.7, there is a covered edge from s to some vertex x Anc[w] and x Vi. Since G[Vi] has no unintervened covered edge, s x is intervened by some I I. Then by Lemma 6.6, we have x JI S. This contradicts x Vi.

7. Experiments

In this section, we test our proposed method and compare it to existing causal discovery methods on synthetic data and a toy real-world example. Source code for these results can be found at https://github.com/uhlerlab/CCPG.

7.1. Synthetic Data

In these experiments, we consider the observational setting and generate samples from linear causal models with additive Gaussian noise, governed by identifiable causal graphs, in particular, in-star-shaped DAGs with varying number of nodes. In such settings, with enough samples, our algorithm is guaranteed to return the ground-truth DAG (Corollary 3.5).

We compare our Algorithm 2, termed CCPG, to other constraint-based and hybrid methods that rely on conditional independence tests. The constraint-based methods we include are: PC (Spirtes et al., 2000), FCI (Spirtes et al., 1999), and RFCI (Colombo et al., 2012), where we use the order-independent variants from Colombo et al. (2014) (i.e., stable versions). The hybrid methods we included are: GSP with depth = 4 and depth = (Solus et al., 2021). Implementation details can be found in Appendix E.

20 40 60 80 100 Number of Nodes

CCPG PC PC (stable) PC (stable) FCI (stable) RFCI (stable) GSP GSP (depth = )

constraint-based hybrid

Python R (with C++)

Figure 6. Runtime comparison across graphs of different sizes. CCPG is as fast as hybrid methods (GSP) and significantly more efficient than constraint-based methods (PC, FCI, and RFCI). The programming language behind each implementation is indicated by dashed or solid lines (see Appendix E for details).

Causal Discovery with Fewer Conditional Independence Tests

PC (stable)

FCI (stable)

RFCI (stable)

GSP (depth= )

Number of Samples

w/ causal sufficiency w/o causal sufficiency

Figure 7. Sample complexity comparison. CCPG requires the least number of samples to recover the true graph. Methods that do not assume causal sufficiency are indicated by shading (see Appendix E for details).

Runtime Analysis. To test the computational efficiency of our method, we generated 100k samples across graphs of different sizes and reported the runtime averaged across five runs. Figure 6 shows that CCPG is as fast as hybrid methods (GSP) and significantly more efficient than constraint-based methods (PC, FCI, and RFCI), which can only scale to 20 nodes with reasonable runtime.

Sample Complexity Analysis. To test the sample efficiency of CCPG, we consider a 10-node in-star-shaped DAG. For each method, we increase the number of samples until it returns the ground-truth DAG and repeat this procedure for five runs. Figure 7 shows that CCPG requires the least number of samples to recover the true graph.

In general, constraint-based methods (PC, FCI, and RFCI) have better sample efficiency than hybrid methods, while being significantly more runtime inefficient. In comparison, CCPG, with its polynomial-number of CI test guarantee, enjoys low sample complexity and low runtime complexity.

7.2. A Real-world Example

To illustrate the utility of the coarser representation learned by CCPG in real-world settings, we include a simple 6variable Airfoil example (Asuncion & Newman, 2007; Lam et al., 2022); see Lam et al. (2022) for a detailed description of this example. Although there is no known ground-truth DAG in this setting, a few causal relations are known: (1) velocity, chord, and attack should be source nodes; (2) pressure is downstream of all other nodes.

The coarser representation learned by CCPG is shown in Figure 8. Compared to PC, which returns the partially directed

Velocity Chord

Attack Displacement

Frequency Pressure

Figure 8. Learned coarser representation by CCPG in the Airfoil example.

Frequency Attack

Displacement Pressure

Figure 9. Learned partially directed graph by PC in the Airfoil example.

graph in Figure 9, CCPG seems to be more consistent with the known causal relations while it contains less information due to a coarser representation.

8. Discussion

In our work, we studied causal structure learning under the constraint of fewer CI tests. Since exact structure learning may demand an exponential number of CI tests, we defined a representation (CCPG) that captures partial but crucial information about the underlying causal graph. Moreover, we provided an efficient algorithm that recovers a CCPG representation in a polynomial number of CI tests. This result enabled us to design efficient algorithms for the full recovery of causal graphs in two specific settings, utilizing only a polynomial number of CI tests.

We hope that our work will motivate further exploration of the causal discovery problem under the constraint of fewer CI tests, extending to various settings, including those involving latent variables. Furthermore, our research establishes a foundation for addressing the search problem10 with reduced tests, suggesting the potential existence of search algorithms capable of recovering the causal graph with a polynomial number of independence tests while performing an approximately optimal number of interventions.

10The search problem involves finding the minimum set of interventions that orient the entire causal graph.

Causal Discovery with Fewer Conditional Independence Tests

Acknowledgements

We thank the anonymous reviewers for helpful feedback. J.Z. was partially supported by an Apple AI/ML Ph D Fellowship. K.S. was supported by a fellowship from the Eric and Wendy Schmidt Center at the Broad Institute. C.U. was partially supported by NCCIH/NIH (1DP2AT012345), ONR (N00014-22-1-2116), DOE-ASCR (DE-SC0023187), the MIT-IBM Watson AI Lab, and a Simons Investigator Award.

Impact Statement

This paper presents theoretical work whose goal is to advance the field of causal inference. There are many potential societal applications of our work, none of which we feel must be specifically highlighted here.

Alonso-Barba, J. I., Gámez, J. A., Puerta, J. M., et al. Scaling up the greedy equivalence search algorithm by constraining the search space of equivalence classes. International journal of approximate reasoning, 54(4):429 451, 2013.

Andersson, S. A., Madigan, D., and Perlman, M. D. A characterization of Markov equivalence classes for acyclic digraphs. The Annals of Statistics, 25(2):505 541, 1997.

Asuncion, A. and Newman, D. Uci machine learning repository, 2007.

Bouckaert, R. R. Properties of bayesian belief network learning algorithms. In Uncertainty Proceedings 1994, pp. 102 109. Elsevier, 1994.

Brenner, E. and Sontag, D. Sparsityboost: A new scoring function for learning bayesian network structure. ar Xiv preprint ar Xiv:1309.6820, 2013.

Chickering, D. M. A Transformational Characterization of Equivalent Bayesian Network Structures. In Proceedings of the Eleventh Conference on Uncertainty in Artificial Intelligence, UAI 95, pp. 87 98, San Francisco, CA, USA, 1995. Morgan Kaufmann Publishers Inc. ISBN 1558603859.

Chickering, D. M. Optimal Structure Identification with Greedy Search. Journal of machine learning research, 3 (Nov):507 554, 2002.

Chickering, M., Heckerman, D., and Meek, C. Largesample learning of bayesian networks is np-hard. Journal of Machine Learning Research, 5:1287 1330, 2004.

Cho, H., Berger, B., and Peng, J. Reconstructing Causal Biological Networks through Active Learning. PLo S ONE, 11(3):e0150611, 2016.

Choo, D. and Shiragur, K. Subset verification and search algorithms for causal dags. ar Xiv preprint ar Xiv:2301.03180, 2023.

Choo, D., Shiragur, K., and Bhattacharyya, A. Verification and search algorithms for causal DAGs. Advances in Neural Information Processing Systems, 35, 2022.

Claassen, T., Mooij, J., and Heskes, T. Learning sparse causal models is not np-hard. ar Xiv preprint ar Xiv:1309.6824, 2013.

Colombo, D., Maathuis, M. H., Kalisch, M., and Richardson, T. S. Learning high-dimensional directed acyclic graphs with latent and selection variables. The Annals of Statistics, pp. 294 321, 2012.

Colombo, D., Maathuis, M. H., et al. Order-independent constraint-based causal structure learning. J. Mach. Learn. Res., 15(1):3741 3782, 2014.

de Campos, L. M., Cano, A., Castellano, J. G., and Moral, S. Combining gene expression data and prior knowledge for inferring gene regulatory networks via Bayesian networks using structural restrictions. Statistical Applications in Genetics and Molecular Biology, 18(3), 2019.

Eberhardt, F. Causal Discovery as a Game. In Causality: Objectives and Assessment, pp. 87 96. PMLR, 2010.

Eberhardt, F. and Scheines, R. Interventions and Causal Inference. Philosophy of science, 74(5):981 995, 2007.

Eberhardt, F., Glymour, C., and Scheines, R. On the number of experiments sufficient and in the worst case necessary to identify all causal relations among N variables. In Proceedings of the Twenty-First Conference on Uncertainty in Artificial Intelligence, pp. 178 184, 2005.

Eberhardt, F., Glymour, C., and Scheines, R. N-1 Experiments Suffice to Determine the Causal Relations Among N Variables. In Innovations in machine learning, pp. 97 112. Springer, 2006.

Eberhardt, F., Glymour, C., and Scheines, R. On the Number of Experiments Sufficient and in the Worst Case Necessary to Identify All Causal Relations Among N Variables. ar Xiv preprint ar Xiv:1207.1389, 2012.

Friedman, N., Linial, M., Nachman, I., and Pe er, D. Using bayesian networks to analyze expression data. Journal of computational biology, 7(3-4):601 620, 2000.

Geiger, D. and Heckerman, D. Parameter priors for directed acyclic graphical models and the characterization of several probability distributions. The Annals of Statistics, 30 (5):1412 1440, 2002.

Causal Discovery with Fewer Conditional Independence Tests

Geiger, D. and Pearl, J. On the logic of causal models. In Machine Intelligence and Pattern Recognition, volume 9, pp. 3 14. Elsevier, 1990.

Glymour, C., Zhang, K., and Spirtes, P. Review of causal discovery methods based on graphical models. Frontiers in genetics, 10:524, 2019.

Greenewald, K., Katz, D., Shanmugam, K., Magliacane, S., Kocaoglu, M., Boix-Adserà, E., and Bresler, G. Sample Efficient Active Learning of Causal Trees. Advances in Neural Information Processing Systems, 32, 2019.

Hauser, A. and Bühlmann, P. Characterization and greedy learning of interventional Markov equivalence classes of directed acyclic graphs. The Journal of Machine Learning Research, 13(1):2409 2464, 2012.

Hoover, K. D. The logic of causal inference: Econometrics and the Conditional Analysis of Causation. Economics & Philosophy, 6(2):207 234, 1990.

Hu, H., Li, Z., and Vetta, A. Randomized Experimental Design for Causal Graph Discovery. Advances in neural information processing systems, 27, 2014.

Kalisch, M. and Bühlman, P. Estimating high-dimensional directed acyclic graphs with the pc-algorithm. Journal of Machine Learning Research, 8(3), 2007.

Kalisch, M., Hauser, A., Maechler, M., Colombo, D., Entner, D., Hoyer, P., Hyttinen, A., Peters, J., Andri, N., Perkovic, E., et al. Package pcalg . 2024.

King, R. D., Whelan, K. E., Jones, F. M., Reiser, P. G. K., Bryant, C. H., Muggleton, S. H., Kell, D. B., and Oliver, S. G. Functional genomic hypothesis generation and experimentation by a robot scientist. Nature, 427(6971): 247 252, 2004.

Lam, W.-Y., Andrews, B., and Ramsey, J. Greedy relaxations of the sparsest permutation algorithm. In Uncertainty in Artificial Intelligence, pp. 1052 1062. PMLR, 2022.

Lauritzen, S. L. Graphical models, volume 17. Clarendon Press, 1996.

Magliacane, S., Claassen, T., and Mooij, J. M. Ancestral causal inference. Advances in Neural Information Processing Systems, 29, 2016.

Meek, C. Causal Inference and Causal Explanation with Background Knowledge. In Proceedings of the Eleventh Conference on Uncertainty in Artificial Intelligence, UAI 95, pp. 403 410, San Francisco, CA, USA, 1995. Morgan Kaufmann Publishers Inc. ISBN 1558603859.

Nandy, P., Hauser, A., and Maathuis, M. H. Highdimensional consistency in score-based and hybrid structure learning. The Annals of Statistics, 46(6A):3151 3183, 2018.

Pearl, J. Causality: models, reasoning, and inference. Econometric Theory, 19(4):675 685, 2003.

Pearl, J. Causality: Models, Reasoning and Inference. Cambridge University Press, USA, 2nd edition, 2009. ISBN 052189560X.

Pingault, J.-B., O reilly, P. F., Schoeler, T., Ploubidis, G. B., Rijsdijk, F., and Dudbridge, F. Using genetic data to strengthen causal inference in observational research. Nature Reviews Genetics, 19(9):566 580, 2018.

Reichenbach, H. The direction of time, volume 65. Univ of California Press, 1956.

Robins, J. M., Hernan, M. A., and Brumback, B. Marginal structural models and causal inference in epidemiology. Epidemiology, pp. 550 560, 2000.

Rotmensch, M., Halpern, Y., Tlimat, A., Horng, S., and Sontag, D. Learning a Health Knowledge Graph from Electronic Medical Records. Scientific reports, 7(1):1 11, 2017.

Schulte, O., Frigo, G., Greiner, R., and Khosravi, H. The imap hybrid method for learning gaussian bayes nets. In Advances in Artificial Intelligence: 23rd Canadian Conference on Artificial Intelligence, Canadian AI 2010, Ottawa, Canada, May 31 June 2, 2010. Proceedings 23, pp. 123 134. Springer, 2010.

Shanmugam, K., Kocaoglu, M., Dimakis, A. G., and Vishwanath, S. Learning Causal Graphs with Small Interventions. Advances in Neural Information Processing Systems, 28, 2015.

Shiragur, K., Zhang, J., and Uhler, C. Meek separators and their applications in targeted causal discovery. Advances in Neural Information Processing Systems, 36, 2024.

Solus, L., Wang, Y., and Uhler, C. Consistency guarantees for greedy permutation-based causal inference algorithms. Biometrika, 108(4):795 814, 2021.

Spirtes, P., Glymour, C., and Scheines, R. Causality from probability. 1989.

Spirtes, P., Meek, C., and Richardson, T. An algorithm for causal inference in the presence of latent variables and selection bias (vol. 1), 1999.

Spirtes, P., Glymour, C. N., Scheines, R., and Heckerman, D. Causation, Prediction, and Search. MIT press, 2000.

Causal Discovery with Fewer Conditional Independence Tests

Spirtes, P. L., Meek, C., and Richardson, T. S. Causal inference in the presence of latent variables and selection bias. ar Xiv preprint ar Xiv:1302.4983, 2013.

Squires, C. Causaldag: Creation, manipulation, and learning of causal models. 2018. URL https://github. com/uhlerlab/causaldag.

Squires, C. and Uhler, C. Causal structure learning: a combinatorial perspective. Foundations of Computational Mathematics, pp. 1 35, 2022.

Squires, C., Magliacane, S., Greenewald, K., Katz, D., Kocaoglu, M., and Shanmugam, K. Active Structure Learning of Causal DAGs via Directed Clique Trees. Advances in Neural Information Processing Systems, 33:21500 21511, 2020.

Sverchkov, Y. and Craven, M. A review of active learning approaches to experimental design for uncovering biological networks. PLo S computational biology, 13(6): e1005466, 2017.

Tian, T. Bayesian Computation Methods for Inferring Regulatory Network Models Using Biomedical Data. Translational Biomedical Informatics: A Precision Medicine Perspective, pp. 289 307, 2016.

Verma, T. and Pearl, J. Equivalence and synthesis of causal models. In Proceedings of the Sixth Annual Conference on Uncertainty in Artificial Intelligence, pp. 255 270, 1990.

Woodward, J. Making Things Happen: A theory of Causal Explanation. Oxford university press, 2005.

Xie, X. and Geng, Z. A recursive method for structural learning of directed acyclic graphs. The Journal of Machine Learning Research, 9:459 483, 2008.

Zhang, J., Shiragur, K., and Uhler, C. Membership testing in markov equivalence classes via independence queries. In International Conference on Artificial Intelligence and Statistics, pp. 3925 3933. PMLR, 2024.

Causal Discovery with Fewer Conditional Independence Tests

A. Useful Lemmas

A.1. Proof of Lemma 3.2

Lemma 3.2 (Properties of CCPG). For any intervention set I (including ), the following arguments hold:

D = G (i.e., partitioning V into individual vertices) is a valid I-CCPG of G.

If the verification number of G is zero, i.e., ν(G) = 0, then D = G is the unique valid I-CCPG of G.

If I is a verifying intervention set of G, then D = G is the unique valid I-CCPG of G.

Proof. Note that if |Vi| = 1, then CCPG is actually G. If |Vi| > 1, then it means that there is a covered edge in this subset that is not intervened. By Proposition 2.1, this is impossible when I is a verifying intervention set.

A.2. Proof of Lemma 3.6

Lemma 3.6 (Proxy V-Structure). Let S V and u, v, z V \S. If u v|S and u v|S {z}, then u, v Des[z].11

Proof. Let P be an active path (that carries dependency) between u and v when conditioned on S {z}. Since u v|S and u v|S {z}, we get that there exist a set of vertices w1 . . . wk that are colliders on P and satisfy: Des[wi] S = and z Des[wi].

Consider the path P and note that it takes the form P = u w1 wi wk v. Define P1 = u w1 and P2 = wk v. Since all the colliders on paths P1, P2 are in or have their descendants in S and all non-colliders do not belong to S, we have that P1, P2 are active given S. We prove our lemma using the proof by contradiction strategy. For contradiction, let us assume that one of the vertices in {u, v} belong to the set Des[z] and without loss of generality let that vertex be u. Then since z Des[wi] for all i, there must be u Des[wi] for all i.

Since u Des[wk], let Q be the directed path in the graph that connects wk to u. Note that all the vertices in path Q are all descendants of wk and they do not belong to the set S (because Des[wk] S = ). Now consider the path (P2, Q) that connects vertices v and u and note that the vertex w is a non-collider on the new path (P2, Q). Since P2 and Q are active paths given S and since w S, we immediately get that the path (P2, Q) is an active path given S, which further implies that u v|S; a contradiction and we conclude the proof.

A.3. Proof of Lemma 3.8

Lemma 3.8 (Proxy Meek Rule 1). Let S be a prefix subset. If u, v and w are such that u S, v, w S and u v|S and u w|S {v} then v Des[w].

Proof. Suppose on the contrary that there is u, v, w such that u S, v, w S and u v|S and u w|S {v} and v Des[w].

Since u v | S, let P be the active path connecting u and v given S. As u S and v S, there is an edge u v on P such that u S but v S.

Now denote the vertex on P that is immediate next to v as x. If x v, then there must be a collider on P between u

and v as S is a prefix subset where u S and v S. Let y be the collider on P between u and v that is closest to u , then u v y. Since P is active given S, the collider y Anc[S]. However v Anc[y] Anc[S] and v / S, a contradiction to S being prefix. Thus there must be x v on P.

Since we assumed on the contrary that v Des[w], we can consider the path Q joined by P and the directed path from w to v. Compared to P, the path P has one additional collider v, and has a few additional colliders that lie between v and w which are not in S (as they are all descendants of w and w S). Therefore Q is active given S {v}. This means u w | S {v}, a contradiction.

11Similar argument is also proven in Lemma 1 of Magliacane et al. (2016).

Causal Discovery with Fewer Conditional Independence Tests

A.4. Additional Lemma

In addition, we will make use of the following lemma.

Lemma A.1. For disjoint sets A, B, C V , if Pa(A) C A, Pa(B) C B and Des(B) C = , then A B | C.

Proof. Assume without loss of generality that the vertices in B have the following topological order b1, . . . , bm. Then for any i [m], by the local Markov property (Spirtes et al., 1989), we have A bi | C {b1, . . . , bi 1}. Therefore using Bayes rule, we have

P(B | C, A) = P(b1 | C, A)P(b2 | C, A, b1) . . . P(bm | C, A, b1, . . . , bm 1)

= P(b1 | C, A)P(b2 | C, b1) . . . P(bm | C, b1, . . . , bm 1)

= P(B | C),

and thus A B | C, which completes the proof.

B. Missing Proofs of Prefix Vertex Set

B.1. Proof of Lemma 4.5

We restate the lemma below.

Lemma 4.5. Let S be a prefix vertex set. If w DS, then Des[w] DS. Furthermore, DS src( S) = . The same properties hold for ES.

B.1.1. TYPE-I SET DS

Proof of Lemma 4.5 for DS. We first show that if w DS, then it must hold that Des[w] DS: since w DS, there exists a vertex u V and v S such that u v | S and u v | S {w}. We now show that for any x Des[w], we have u v | S {x}.

Since u v | S {w}, there is a path P from u to v that is active given S {w}. Therefore, all non-colliders on P are not in S {w} and all colliders on P are in Anc[S {w}]. Since x Des[w], all colliders on P are in Anc[S {x}]. If all non-colliders are not in S {x}, then P is an active path from u to v given S {x}, and thus u v | S {x}. Otherwise there is a non-collider on P that is x.

Since u v | S, the path P is inactive given S. From above we know that all non-colliders on P are not in S. Therefore there exists a collider on P that is not in Anc[S]. Suppose the leftmost and rightmost such colliders are k, k (it is possible that k = k ), then k, k must be in Anc[S {w}] \ Anc[S] Anc[w] Anc[x]. Consider the path Q in the graph by cutting out the parts between k, x (and k , x) on P and replacing them with directed edges from k to x (and from k to x). Compared to P, the additional non-colliders on Q are all on the directed path from k to x (or k to x). They are not in S since k, k / Anc[S], and thus Q has no non-colliders in S.

Compared to P, there is no collider on P that is not in Anc[S] and is still on Q by the fact that k, k are leftmost and rightmost colliders on P that are not in Anc[S]. Therefore, x must be a collider on Q, or else Q is active given S and u v | S. Therefore all non-colliders on Q are not in S {x}. Every collider on Q is either x or a collider of P, which is in Anc[S {x}]. Thus Q is active given S {x}. Therefore u v | S {x}.

Next we show that DS src( S) = : for contradiction assume that there exists a vertex a src( S) such that a S\DS, that is a src( S) and for some vertex u V, v S, v u|S and v u|S {a}.

Since v u|S and v u|S {a}, there exists a path P between v and u which is inactive when conditioned on S but is active upon conditioning on S {a}. Moreover, this path contains a vertex b that is a collider on P and satisfies: a Des[b] and Des[b] S = . Since Des[b] S = , we have that b S. Furthermore, since b S, a src( S) and a Des[b], this implies that b = a. Therefore, the path P takes the form: P = v a . . . u. All the colliders on the path P either belong to or have descendant in the set S {a}.

Now consider the path v a and note that it is active given S. Let k be the number of vertices between v and a on this path v v1 . . . vk a. It is immediate that vk S since a src( S). However, since vk S, and since we condition on

Causal Discovery with Fewer Conditional Independence Tests

the set S, this should be a collider for the path Q to be active, which is not possible. Thus we get a contradiction, which completes the proof.

B.1.2. TYPE-II SET ES

Proof of Lemma 4.5 for ES. We first show that if w ES, it must hold that y ES for any y Des(w): since w ES, there is u S and v, v S \ DS, such that u v | S {v} and u v | S {v, w}. We will show u v | S {v, y}.

Since u v | S {v} and u v | S {v, w}, by Lemma 3.6 (note that the set S in the exposition of Lemma 3.6 can be an arbitrary subset), we know that v Des[w]. Assume on the contrary that u v | S {v, y}. Since u v | S {v} and u v | S {v, w}, there is a path P between u, v that is active given S {v, w} but inactive given S {v} or S {v, y}. This means that (1) y is a non-collider on P, (2) all colliders on P are in Anc[S {v, w}], (3) P has a collider that is in Anc[w] \ Anc[S {v}].

Note that v Des[y] since y Des[w] but v Des[w]. In addition, we also have u Des[w] since u is in the prefix vertex set S but w S. Therefore y being a non-collider on P means there is a collider x on P such that y Anc(x). Note that this collider has to be in Anc[S {v, w}]. However, since y Anc[S {w}] and y Anc(x), it must hold that x Anc[v], which means y Anc(v). Since y Des(w), this means v Des(w), which makes Anc[w] \ Anc[S {v}] = . This violates (3) above, a contradiction.

Next we show that ES src( S) = : if there exists w S\DS such that w ES src( S). Then u v | S {v} and u v | S {v, w} for some u S and v, v S \ DS. Thus there is a path P connecting u and v such that P is active given S {v, w} but inactive given S {v}.

Therefore all non-colliders on P are not in S {v, w}, and there is a collider x on P such that x Anc[w]. Note that since w src( S), it must hold that x S. Since x is a collider, there is y = u such that y x on P. Since S is a prefix and x S, we have y S. However, y is a non-collider on P, which contradicts P active given S {v, w} and completes the proof.

B.2. Proof of Lemma 4.6

Lemma 4.6. Let S be a prefix vertex set. If w FS, then Des[w] ES FS. Furthermore, FS src( S) = .

Proof. We first show that if w FS, then for any y Des(w), we have y ES FS. Since w FS, we have u v | S, u w | S {v}, and v w | S for some u S and v S \ DS.

Since v w | S, there is an active path between v, w given S. Consider extending this path by the directed path from w to y. Note that none of vertices on the directed path from w to y are in S, since S is prefixed and w S. Therefore, this extended path is also active given S, which means v y | S.

Thus, if y FS, then it must hold that u y | S {v}. This means there is an active path, denoted by P, between u and y given S {v}. Consider extending this path by the directed path from w to y, denoted as Q (which exists in the graph). Compared to P, the additional non-colliders on Q are not in S {v}: for S, this is because S is prefix, w S, and all additional non-colliders are descendants of w; for v, this is because v Des(w) (by Lemma 3.8, u v | S and u w | S {v}). Thus Q is active given S {v} unless y is a collider on Q. Since u w | S {v}, the path Q must be inactive given S {v}, which means y is a collider on Q. This means Q is active given S {v, y}. Therefore, u w | S {v, y}. Together with u w | S {v}, we have y ES.

Next we show that FS src( S) = . Assume on the contrary that w FS src( S). Since w FS, we have u v|S, v w|S and u w|S {v} for some u S and v S \ DS. By Lemma 3.8, we have v / Des[w]. However, since v w | S, there must be an active path P between v and w given S. This path cannot have any vertex in S; otherwise consider the first vertex that is in S; since v S and S is prefix, such vertex must be a non-collider which would make P inactive given S. Therefore P is fully in S. Since w src( S), we must have P : v w. However since v Des[w], there must be an edge on P. This means that there must be a collider on P. Since this collider is not in S, it makes P inactive given S, a contradiction.

Causal Discovery with Fewer Conditional Independence Tests

B.3. Remarks

The above proofs suffice as intermediate results to show Theorem 4.4, which we proved in Section 4.2.

Regarding Lemma 4.7 in Section 4.3, we will prove it in Appendix C after proving Lemma 5.1, since it depends on Lemma 5.1.

C. Missing Proofs of Causally Consistent Partition Graph Representations

C.1. Proof of Lemma 5.1

To prove Lemma 5.1, we will make use of the following lemma.

Lemma C.1. Let S V be a prefix subset, then the following statements hold:

u v | S for all u, v src( S).

S U is a prefix subset for any U src( S).

Proof. We first prove condition one. For contradiction, we assume that u v|S, which implies that there exists an active path between u and v when conditioned on S. Let P = u u1 . . . uk v be the path and u1, . . . , uk be the vertices along the path. Note that k > 0 as u and v are not connected; because an edge between u and v would mean that one of these vertices is not a source node in S.

Consider u1 and note that there are two possibilities u u1 or u u1. We start with the first case u u1. Since S is a prefix subset and u S, u u1 implies that u1 S. Since no vertex in S is conditioned upon, we have that the vertex u1 is not a collider on the path P and we have that the edge u1 u2 is directed as u1 u2. Repeating the same argument for u2 and all the other vertices in the path, we see that the path P takes the form P = u u1 uk v. The previous argument implies that v Des[u] and therefore does not belong to src( S), which is a contradiction to our assumption that u, v src( S).

Now consider the other case where u u1. Note that since u src( S), it is immediate that u1 S. Furthermore, irrespective of the orientation between u1 and vertex u2, it holds that vertex u1 is a non-collider on the path P. Moreover, since P is an active path, u1 should not be conditioned upon. However, since we condition on S and as u1 S, we have a contradiction.

In the above case analysis, we showed that there does not exist an active path between u and v when conditioned on S, which implies that u v|S and we conclude the proof for condition one.

In the remainder of the proof, we focus our attention on condition two. Consider any subset U src( S). For contradiction assume that S U is not a prefix subset, which implies that there exists a vertex v V \(S U) such that v Anc[u] for some vertex u S U. Since S is a prefix subset, it is immediate that u S. Therefore, the only case is that u U. Note that both u and v belong to the set S and v Anc(u); both these expressions combined contradict the fact that u src( S). Therefore, it should be the case that S U is a prefix subset and we conclude the proof.

Now we prove Lemma 5.1 restated below.

Lemma 5.1. Let S be a prefix subset. For any w S\DS, there is |Anc[w] src( S)| = 1 . Furthermore, for any other w S \ DS, there is w w | S if and only if Anc[w] src( S) = Anc[w ] src( S).

Proof. We first show that for any w S \ DS, we have |Anc[w] src( S)| = 1. Since w S, it must hold that |Anc[w] src( S)| 1. Assume now that there is w S such that |Anc[w] src( S)| 2. Let v1, v2 Anc[w] src( S) such that v1 = v2. By Lemma C.1, we have v1 v2 | S. Now consider the path P by stitching together the two directed paths, one from v1 to w and another from v2 to w. All non-colliders on P are not in S since S is a prefix subset. The only collider on P is w. Therefore P is active given S {w}. We have v1 v2 | S {w}, which means w Dv1,S DS, a contradiction to w DS.

Next we show that for any other w S \ DS, we have w w | S if and only if Anc[w] src( S) = Anc[w ] src( S). For the if direction, denote s = Anc[w] src( S) = Anc[w ] src( S) and consider the trek by joining the two directed

Causal Discovery with Fewer Conditional Independence Tests

paths from s to w and from s to w . Since this path has no colliders and it is fully in S (since S is a prefix and s S), it is active given S. Thus w w | S. For the only if direction, assume on the contrary that w w | S but Anc[w] src( S) = Anc[w ] src( S). Let P be the active path between w, w given S. Then all non-colliders on P are in S and all colliders on P are in Anc[S] = S. This means that P has no colliders; otherwise this collider is a child of the vertex next to it, which is a non-collider that is in S. This means there is an edge from S to S, which is a contradiction with S being prefix. Since P has no colliders, it must satisfy P Anc[w] Anc[w ] = . However, since Anc[w] src( S) = Anc[w ] src( S), |Anc[w] src( S)| = 1 and |Anc[w ] src( S)| = 1, P Anc[w] Anc[w ] = S. This means there is a non-collider on P that is in S, which would make it inactive given S, a contradiction.

C.2. Proof of Lemma 4.7

To prove Lemma 4.7, we will make use of the following results. Lemma C.2. Let S be a prefix subset and w S \ (DS src( S)). By Lemma 5.1, let Anc[w] src( S) = {v}. Then for any u S, if there is no directed path from u to w that does not intersect src( S), then

u w | S {v} .

Proof. Assume on the contrary that u w | S {v}. Let P be an active path from u to w conditioned on v. If P S = {u}, then consider the last node on P that is in S. Since w S, this node must be pointing into a node in S on P, which makes this node a non-collider. However, this node belongs to S, which means P is inactive given S {v}, and thus P S = {u}.

There is also no collider on P. Otherwise consider the first collider; it will be in S since P S = {u}. However, since P is active, it will be in Anc[S {v}]. This is impossible since v src( S) and from Lemma C.1, we know that S {v} is prefixed. Therefore, P must be a directed path from u to w, where the node x adjacent to u is in S. This means that x Anc[w]. If x src( S), then P is a directed path from u to v that does not intersect src( S). If x src( S), since Anc[w] src( S) = {v}, then it must hold that x = v. This means P is inactive given S {v}, a contradiction.

Lemma C.3. Let S be a prefix subset and v src( S). Then for any w S, either v w | S or w Des[v].

Proof. Suppose w Des[v]; we will show that v w | S. Assume on the contrary that v w | S. Let P be the active path between v and w given S. If P intersects with S, then consider the last vertex on P that is in S. This vertex must be a non-collider since w S. This contradicts P being active given S. Therefore, P is fully in S. Since v src( S), we have P : v . . . w. Since w Des[v], there must be a collider on P. However, this collider is not in S and S is prefix, which means that P is active given S, a contradiction.

We now prove Lemma 4.7, restated below. Lemma 4.7. Let S be a prefix vertex set. For w S \DS, if w src( S) and there is no covered edge from Anc[w] src( S) to Anc[w], then w ES FS.

Proof of Lemma 4.7. Assume w src( S). For contradiction, assume that there is no covered edge from Anc[w] src( S) to Anc[w] and w S \ (DS ES FS).

Since w src( S), there is v src( S) Anc[w]. As v Anc[w], there is a directed path from v to w. Consider the longest directed path P from v to w. Let v be the adjacent vertex to v on P, i.e., P : v v w. By Lemma 4.5 and w DS, we must have v DS. Since v v is not a covered edge, there could only be two cases:

There is u v such that u Pa(v ). Since v src( S), we must have u S. Note that u v | S and v w | S and the second condition of the lemma is not met. We must have u w | S {v}. By Lemma C.2, there must exist a directed path Q from u to w that does not intersect src( S). Consider the path between u and w by joining Q and the directed path from v to w. Since this path does not intersect with S src( S) and w is the only collider on it, we know that this path is active given S {v} {w}. Thus u v | S {v} {w}. Since the second condition of the lemma is not met, we must have u v | S {v}. By Lemma C.2, there must exist a directed path from u to v that does not intersect src( S). Since u Pa(v ), this path has at least length two. Let k be the vertex on this path such that k v . Note that v v k. Therefore v k | S {v }. Since v DS, we must have v k | S. By Lemma C.3, we must have k Des[v]. Thus we can increase the length of the directed path P by replacing v v

with v k v . This contradicts P being the longest directed path from v to w.

Causal Discovery with Fewer Conditional Independence Tests

There is k v such that k Pa(v). Note that v v k. Therefore v k | S {v }. Since v DS, we must have v k | S. By Lemma C.3, we must have k Des[v]. Thus we can increase the length of the directed path P by replacing v v with v k v . This contradicts P being the longest directed path from v to w.

This completes the proof.

C.3. Remarks

The above proofs suffice as intermediate results to show Theorem 3.3, which we proved in Section 5.

D. Missing Proofs of Interventions

D.1. Proof of Lemma 6.1

Lemma 6.1. Let S be a prefixed vertex set. Given an intervention on vertices I V . For any u I, we have u I v for some v I \ S if and only if u Des(I \ S) \ (I \ S).

Proof. For each v I, denote Des(v, I) as the set of u I such that u I v. We first show that Des(v, I) is equal to the set of descendants of v such that there exists a directed path from v which is not cut by I.

Let u / I such that u I v. If u Des(v, I), then there exists a directed path from v to u in the modified DAG GI, which means it is active given , a contradiction. Therefore the only if direction is proven.

For the if direction, let u / I such that u I v. Then there is an active path P : u v in the modified DAG GI. Since P is active given , there is no collider on P. Furthermore, since all incoming edges to any vertex in I are removed in the modified DAG, this path must be P : u v and satisfies P I = {v}. Thus u Des(v, I).

Next we show that v I\SDes(v, I) = Des(I \ S) \ (I \ S). This will prove the lemma.

Since Des(v, I) is equal to the set of descendants of v such that there exists a directed path from v which is not cut by I, it is clear that v I\SDes(v, I) Des(I \ S) \ (I \ S). Now let u Des(I \ S) \ (I \ S). Then there is v I \ S such that u Des(v). Consider the directed path from v to u and let v be the last vertex on this path that is in I. Then u Des(v , I). By the fact that S is prefix, we have from v S that v S. Thus u Des(v , I) v I\SDes(v, I) and v I\SDes(v, I) Des(I \ S) \ (I \ S). We therefore have v I\SDes(v, I) = Des(I \ S) \ (I \ S).

D.2. Proof Lemma 6.2

Lemma 6.2. Let S be a prefix subset. Given an intervention I and v I \ S, denote HI S(v) = {u S Des[I \ S] : u v | V \ Des[I \ S]}.12 Then Pa(v) \ (S Des[I \ S]) HI S(v) Anc(v) \ (S Des[I \ S]).

Proof. On one hand, let u Pa(v) \ (S Des[I \ S]). Clearly u S Des[I \ S] and u v | V \ Des[I \ S]. Thus u HI S(v), which proves Pa(v) \ (S Des[I \ S]) HI S(v).

On the other hand, let u HI S(v). Since u v | V \ Des[I \ S], let P : u v be the active path given V \ Des[I \ S]. Then all non-colliders on P are in Des[I \ S], and all colliders on P are in Anc[V \ Des[I \ S]]

Note that Des[I \ S] Anc[V \ Des[I \ S]] = . Otherwise let w Des[I \ S] Anc[V \ Des[I \ S]]. Since w Anc[V \ Des[I \ S]], there is w Des[I \ S] such that w Des[w]. However w Des[I \ S], which means w Des[I \ S]. A contradiction.

If P has a collider y x z, then z is either a non-collider or v. Either case we have z Des[I \ S]. This means x Des[I \ S]. However, x S Anc[V \ Des[I \ S]] and Des[I \ S] Anc[V \ Des[I \ S]] = . A contradiction. Thus there is no collider on P.

Denote the vertex next to u on P as t, i.e., P : u t v. Then t is either a non-collider or v. Either case t Des[I \ S]. Therefore it must be u t, otherwise u Des[I \ S]. Furthermore, as P has no colliders, it must be P : u t v. Thus u Anc(v). Together with u S Des[I \ S], we have u Anc(v) \ (S Des[I \ S]). This

12Note that Des[I \ S] can be obtained via Lemma 6.1 and (Des(I \ S) \ (I \ S)) (I \ S).

Causal Discovery with Fewer Conditional Independence Tests

proves HI S(v) Anc(v) \ (S Des[I \ S]).

D.3. Proof of Lemma 6.4

Lemma 6.4. Let S be a prefix vertex set. Then w JI S, we have Des(w) JI S. Furthermore, JI S src( S) = .

Proof. Assume on the contrary that v JI S but w Des(v) such w JI S. Since v JI S, there could be two cases:

v Des(I \ S). Then clearly w Des(v) Des(I \ S). A contradiction.

v I \ S and there is u HI S(v) S. Then w Des(v) Des(I \ S). A contradiction.

Next we show that src( S) JI S = . Note that by src( S) Des( S) = and Des(I \ S) Des( S), we have src( S) Des(I \ S) = . In addition, for any v src( S) (I \ S), by Lemma 6.2, we have HI S(v) Anc(v) \ S = . Thus src( S) {v I \ S : HI S(v) S = } = . We then have src( S) JI S = .

D.4. Remarks

The above proofs suffice as intermediate results for proving Theorem 6.5. Then together with Lemma 6.6 (proven in Section 6.1), we can prove Theorem 3.4, which is given in Section 6.2.

E. Details of Numerical Experiments

Implementation Details. For FCI and RFCI, we used the implementations in Kalisch et al. (2024), which is written in R with C++ accelerations. For PC and GSP, we used the implementation in Squires, which us written in python. Our method, CCPG, is written in python. The acceleration of R (with C++) can be viewed by comparing two implementations of PC (stable) in Figure 6.

Remark on causal sufficiency. Among the constraint-based methods, we marked the ones that do not assume causal sufficiency in Figure 7. These methods run additional tests to check for unobserved causal variables, and therefore might require more samples compared to, e.g., PC, since the underlying system we test on satisfies causal sufficiency.