# invariant_ancestry_search__57236ff1.pdf

Invariant Ancestry Search

Phillip B. Mogensen 1 Nikolaj Thams 1 Jonas Peters 1

Abstract Recently, methods have been proposed that exploit the invariance of prediction models with respect to changing environments to infer subsets of the causal parents of a response variable. If the environments influence only few of the underlying mechanisms, the subset identified by invariant causal prediction (ICP), for example, may be small, or even empty. We introduce the concept of minimal invariance and propose invariant ancestry search (IAS). In its population version, IAS outputs a set which contains only ancestors of the response and is a superset of the output of ICP. When applied to data, corresponding guarantees hold asymptotically if the underlying test for invariance has asymptotic level and power. We develop scalable algorithms and perform experiments on simulated and real data.

1. Introduction

Causal reasoning addresses the challenge of understanding why systems behave the way they do and what happens if we actively intervene. Such mechanistic understanding is inherent to human cognition, and developing statistical methodology that learns and utilizes causal relations is a key step in improving both narrow and broad AI (Jordan, 2019; Pearl, 2018). Several approaches exist for learning causal structures from observational data. Approaches such as the PC-algorithm (Spirtes et al., 2000) or greedy equivalence search (Chickering, 2002) learn (Markov equivalent) graphical representations of the causal structure (Lauritzen, 1996). Other approaches learn the graphical structure under additional assumptions, such as non-Gaussianity (Shimizu et al., 2006) or non-linearity (Hoyer et al., 2009; Peters et al., 2014). Zheng et al. (2018) convert the problem into a continuous optimization problem, at the expense of identifiability guarantees.

1Department of Mathematical Sciences, University of Copenhagen, Denmark. Correspondence to: Phillip B. Mogensen <pbm@math.ku.dk>, Jonas Peters <jonas.peters@math.ku.dk>.

Proceedings of the 39 th International Conference on Machine Learning, Baltimore, Maryland, USA, PMLR 162, 2022. Copyright 2022 by the author(s).

Invariant causal prediction (ICP) (Peters et al., 2016; Heinze Deml et al., 2018; Pfister et al., 2019; Gamella & Heinze Deml, 2020; Martinet et al., 2021) assumes that data are sampled from heterogeneous environments (which can be discrete, categorical or continuous), and identifies direct causes of a target Y , also known as causal parents of Y . Learning ancestors (or parents) of a response Y yields understanding of anticipated changes when intervening in the system. It is a less ambitious task than learning the complete graph but may allow for methods that come with weaker assumptions and stronger guarantees. More concretely, for predictors X1, . . . , Xd, ICP searches for subsets S {1, . . . , d} that are invariant; a set XS of predictors is called invariant if it renders Y independent of the environment, conditional on XS. ICP then outputs the intersection of all invariant predictor sets SICP := S invariant S. Peters et al. (2016) show that if invariance is tested empirically from data at level α, the resulting intersection ˆSICP is a subset of direct causes of Y with probability at least 1 α.1

In many cases, however, the set learned by ICP forms a strict subset of all direct causes or may even be empty. This is because disjoint sets of predictors can be invariant, yielding an empty intersection, which may happen both for finite samples as well as in the population setting. In this work, we introduce and characterize minimally invariant sets of predictors, that is, invariant sets S for which no proper subset is invariant. We propose to consider the union SIAS of all minimally invariant sets, where IAS stands for invariant ancestry search. We prove that SIAS is a subset of causal ancestors of Y , invariant, non-empty and contains SICP. Learning causal ancestors of a response may be desirable for several reasons: e.g., they are the variables that may have an influence on the response variable when intervened on. In addition, because IAS yields an invariant set, it can be used to construct predictions that are stable across environments (e.g., Rojas-Carulla et al., 2018; Christiansen et al., 2022).

In practice, we estimate minimally invariant sets using a test for invariance. If such a test has asymptotic power against some of the non-invariant sets (specified in Section 5.2), we show that, asymptotically, the probability of ˆSIAS being a

1Rojas-Carulla et al. (2018); Magliacane et al. (2018); Arjovsky et al. (2019); Christiansen et al. (2022) propose techniques that consider similar invariance statements with a focus on distribution generalization instead of causal discovery.

Invariant Ancestry Search

subset of the ancestors is at least 1 α. This puts stronger assumptions on the invariance test than ICP (which does not require any power) in return for discovering a larger set of causal ancestors. We prove that our approach retains the ancestral guarantee if we test minimal invariance only among subsets up to a certain size. This yields a computational speed-up compared to testing minimal invariance in all subsets, but comes at the cost of potentially finding fewer causal ancestors.

The remainder of this work is organized as follows. In Section 2 we review relevant background material, and we introduce the concept of minimal invariance in Section 3. Section 4 contains an oracle algorithm for finding minimally invariant sets (and a closed-form expression of SICP) and Section 5 presents theoretical guarantees when testing minimal invariance from data. In Section 6 we evaluate our method in several simulation studies as well as a real-world data set on gene perturbations. Code is provided at https://github.com/Phillip Mogensen/ Invariant Ancestry Search.

2. Preliminaries

2.1. Structural Causal Models and Graphs

We consider a setting where data are sampled from a structural causal model (SCM) (Pearl, 2009; Bongers et al., 2021)

Zj := fj(PAj, ϵj),

for some functions fj, parent sets PAj and noise distributions ϵj. Following (Peters et al., 2016; Heinze-Deml et al., 2018), we consider an SCM over variables Z := (E, X, Y ) where E is an exogenous environment variable (i.e., PAE = ), Y is a response variable and X = (X1, . . . , Xd) is a collection of predictors of Y . We denote by P the family of all possible distributions induced by an SCM over (E, X, Y ) of the above form.

For a collection of nodes j [d] := {1, . . . , d} and their parent sets PAj, we define a directed graph G with nodes [d] and edges j j for all j PAj. We denote by CHj, ANj and DEj the children, ancestors and descendants of a variable j, respectively, neither containing j. A graph G is called a directed acyclic graph (DAG) if it does not contain any directed cycles. See Pearl (2009) for more details and the definition of d-separation.

Throughout the remainder of this work, we make the following assumptions about causal sufficiency and exogeneity of E (Section 7 describes how these assumptions can be relaxed). Assumption 2.1. Data are sampled from an SCM over nodes (E, X, Y ), such that the corresponding graph is a DAG, the distribution is faithful with respect to this DAG, and the environments are exogenous, i.e., PAE = .

Figure 1. Two structures where SICP PAY . (left) SICP = . (right) SICP = {1}. In both, our method outputs SIAS = {1, 2, 3}.

2.2. Invariant Causal Prediction

Invariant causal prediction (ICP), introduced by Peters et al. (2016), exploits the existence of heterogeneity in the data, here encoded by an environment variable E, to learn a subset of causal parents of a response variable Y . A subset of predictors S [d] is invariant if Y E | S, and we define I := {S [d] | S invariant} to be the set of all invariant sets. We denote the corresponding hypothesis that S is invariant by

HI 0,S : S I.

Formally, HI 0,S corresponds to a subset of distributions in P, and we denote by HI A,S := P \ HI 0,S the alternative hypothesis to HI 0,S. Peters et al. (2016) define the oracle output

S:HI 0,S true S (1)

(with SICP = if no sets are invariant) and prove SICP PAY . If provided with a test for the hypotheses HI 0,S, we can test all sets S [d] for invariance and take the intersection over all accepted sets: ˆSICP := T

S:HI 0,S not rejected S; If

the invariance test has level α, ˆSICP PAY with probability at least 1 α.

However, even for the oracle output in Equation (1), there are many graphs for which SICP is a strict subset of PAY . For example, in Figure 1 (left), since both {1, 2} and {3} are invariant, SICP {1, 2} {3} = . This does not violate SICP PAY , but is non-informative. Similarly, in Figure 1 (right), SICP = {1}, as all invariant sets contain {1}. Here, SICP contains some information, but is not able to recover the full parental set. In neither of these two cases, SICP is an invariant set. If the environments are such that each parent of Y is either affected by the environment directly or is a parent of an affected node, then SICP = PAY (Peters et al., 2016, proof of Theorem 3). The shortcomings of ICP thus relate to settings where the environments act on too few variables or on uninformative ones.

For large d, it has been suggested to apply ICP to the variables in the Markov boundary (Pearl, 2014), MBY = PAY CHY PA(CHY ) (we denote the oracle output by SMB ICP). As PAY MBY , it still holds that SMB ICP is a sub-

Invariant Ancestry Search

set of the causal parents of the response.2 However, the procedure must still be applied to 2| MBY | sets, which is only feasible if the Markov boundary is sufficiently small. In practice, the Markov boundary can, for example, be estimated using Lasso regression or gradient boosting techniques (Tibshirani, 1996; Meinshausen & B uhlmann, 2006; Friedman, 2001).

3. Minimal Invariance and Ancestry

We now introduce the concept of minimally invariant sets, which are invariant sets that do not have any invariant subsets. We propose to consider SIAS, the oracle outcome of invariant ancestry search, defined as the union of all minimally invariant sets. We will see that SIAS is an invariant set, it consists only of ancestors of Y , and it contains SICP as a subset.

Definition 3.1. Let S [d]. We say that S is minimally invariant if and only if

S I and S S : S I;

that is, S is invariant and no subset of S is invariant. We define MI := {S | S minimally invariant}.

The concept of minimal invariance is closely related to the concept of minimal d-separators (Tian et al., 1998). This connection allows us to state several properties of minimal invariance. For example, an invariant set is minimally invariant if and only if it is non-invariant as soon as one of its elements is removed.

Proposition 3.2. Let S [d]. Then S MI if and only if S I and for all j S, it holds that S \ {j} I.

The proof follows directly from (Tian et al., 1998, Corollary 2). We can therefore decide whether a given invariant set S is minimally invariant using O(|S|) checks for invariance, rather than O(2|S|) (as suggested by Definition 3.1). We use this insight in Section 5.1, when we construct a statistical test for whether or not a set is minimally invariant.

To formally define the oracle outcome of IAS, we denote the hypothesis that a set S is minimally invariant by

HMI 0,S : S MI

(and the alternative hypothesis, S / MI, by HMI A,S ) and define the quantity of interest

S:HMI 0,S true S (2)

2In fact, SMB ICP is always at least as informative as ICP. E.g., there exist graphs in which SICP = and SMB ICP = , see Figure 1 (left). There are no possible structures for which SMB ICP SICP, as both search for invariant sets over all sets of parents of Y .

with the convention that a union over the empty set is the empty set.

The following proposition states that SIAS is a subset of the ancestors of the response Y . Similarly to PAY , variables in ANY are causes of Y in that for each ancestor there is a directed causal path to Y . Thus, generically, when intervened, these variables have a causal effect on the response.

Proposition 3.3. It holds that SIAS ANY .

The proof follows directly from (Tian et al., 1998, Theorem 2); see also (Acid & De Campos, 2013, Proposition 2). The setup in these papers is more general than what we consider here; we therefore provide direct proofs for Propositions 3.2 and 3.3 in Appendix A, which may provide further intuition for the results.

Finally, we show that the oracle output of IAS contains that of ICP and, contrary to ICP, it is always an invariant set.

Proposition 3.4. Assume that E PAY . It holds that

(i) SIAS I and (ii) SICP SIAS, with equality if and only if SICP I.

4. Oracle Algorithms

When provided with an oracle that tells us whether a set is invariant or not, how can we efficiently compute SICP and SIAS? Here, we assume that the oracle is given by a DAG, see Assumption 2.1. A direct application of Equations (1) and (2) would require checking a number of sets that grows exponentially in the number of nodes. For SICP, we have the following characterization.3

Proposition 4.1. If E PAY , then SICP = PAY (CHE PA(ANY CHE)).

This allows us to efficiently read off SICP from the DAG, (e.g., it can naively be done in O((d + 2)2.373 log(d + 2)) time, where the exponent 2.373 comes from matrix multiplication). For SIAS, to the best of our knowledge, there is no closed form expression that has a similarly simple structure.

Instead, for IAS, we exploit the recent development of efficient algorithms for computing all minimal d-separators (for two given sets of nodes) in a given DAG (see, e.g., Tian et al., 1998; van der Zander et al., 2019). A set S is called a minimal d-separator of E and Y if it d-separates E and Y given S and no strict subset of S satisfies this property. These algorithms are often motivated by determining minimal adjustment sets (e.g., Pearl, 2009) that can be used to compute the total causal effect between two nodes, for example. If the underlying distribution is Markov and faithful with respect to the DAG, then a set S is minimally invariant if and only if it is a minimal d-separator for E and Y . We

3To the best of our knowledge, this characterization is novel.

Invariant Ancestry Search

can therefore use the same algorithms to find minimally invariant sets; van der Zander et al. (2019) provide an algorithm (based on work by Takata (2010)) for finding minimal d-separators with polynomial delay time. Applied to our case, this means that while there may be exponentially many minimally invariant sets,4 when listing all such sets it takes at most polynomial time until the next set or the message that there are nor further sets is output. In practice, on random graphs, we found this to work well (see Section 6.1). But since SIAS is the union of all minimally invariant sets, even faster algorithms may be available; to the best of our knowledge, it is an open question whether finding SIAS is an NP-hard problem (see Appendix B for details).

We provide a function for listing all minimally invariant sets in our python code; it uses an implementation of the above mentioned algorithm, provided in the R (R Core Team, 2021) package dagitty (Textor et al., 2016). In Section 6.1, we study the properties of the oracle set SIAS. When applied to 500 randomly sampled, dense graphs with d = 15 predictor nodes and five interventions, the dagitty implementation had a median speedup of a factor of roughly 17, compared to a brute-force search (over the ancestors of Y ). The highest speedup achieved was by a factor of more than 1,900.

The above mentioned literature can be used only for oracle algorithms, where the graph is given. In the following sections, we discuss how to test the hypothesis of minimal invariance from data.

5. Invariant Ancestry Search

5.1. Testing a Single Set for Minimal Invariance

Usually, we neither observe a full SCM nor its graphical structure. Instead, we observe data from an SCM, which we want to use to decide whether a set is in MI, such that we make the correct decision with high probability. We now show that a set S can be tested for minimal invariance with asymptotic level and power if given a test for invariance that has asymptotic level and power.

Assume that Dn = (Xi, Ei, Yi)n i=1 are observations (which may or may not be independent) of (X, E, Y ) and let ϕMI n : powerset([d]) Dn (0, 1) {0, 1} be a decision rule that transforms (S, Dn, α) into a decision ϕMI n (S, Dn, α) about whether the hypothesis HMI 0,S should be rejected (ϕMI n = 1) at significance threshold α, or not (ϕMI n = 0). To ease notation, we suppress the dependence on Dn and α when the statements are unambiguous.

A test ψn for the hypothesis H0 has pointwise asymptotic

4This is the case if there are d/2 (disjoint) directed paths between E and Y , with each path containing two X-nodes, for example (e.g., van der Zander et al., 2019).

α (0, 1) : sup P H0 lim n P(ψn = 1) α (3)

and pointwise asymptotic power if

α (0, 1) : inf P HA lim n P(ψn = 1) = 1. (4)

If the limit and the supremum (resp. infimum) in Equation (3) (resp. Equation (4)) can be interchanged, we say that ψn has uniform asymptotic level (resp. power).

Tests for invariance have been examined in the literature. Peters et al. (2016) propose two simple methods for testing for invariance in linear Gaussian SCMs when the environments are discrete, although the methods proposed extend directly to other regression scenarios. Pfister et al. (2019) propose resampling-based tests for sequential data from linear Gaussian SCMs. Furthermore, any valid test for conditional independence between Y and E given a set of predictors S can be used to test for invariance. Although for continuous X, there exists no general conditional independence test that has both level and non-trivial power (Shah & Peters, 2020), it is possible to impose restrictions on the data-generating process that ensure the existence of non-trivial tests (e.g., Fukumizu et al., 2008; Zhang et al., 2011; Berrett et al., 2020; Shah & Peters, 2020; Thams et al., 2021). Heinze Deml et al. (2018) provide an overview and a comparison of several conditional independence tests in the context of invariance.

To test whether a set S [d] is minimally invariant, we define the decision rule

ϕMI n (S) :=

( 1 if ϕn(S) = 1 or min j S ϕn(S \ {j}) = 0,

0 otherwise, (5) where ϕMI n ( ) := ϕn( ). Here, ϕn is a test for the hypothesis HI 0,S, e.g., one of the tests mentioned above. This decision rule rejects HMI 0,S either if HI 0,S is rejected by ϕn or if there exists j S such that HI 0,S\{j} is not rejected. If ϕn has pointwise (resp. uniform) asymptotic level and power, then ϕMI n has pointwise (resp. uniform) asymptotic level and pointwise (resp. uniform) asymptotic power of at least 1 α. Theorem 5.1. Let ϕMI n be defined as in Equation (5) and let S [d]. Assume that the decision rule ϕn has pointwise asymptotic level and power for S and for all S \ {j}, j S. Then, ϕMI n has pointwise asymptotic level and pointwise asymptotic power of at least 1 α, i.e.,

inf P HMI A,S lim n P(ϕMI n (S) = 1) 1 α.

If ϕn has uniform asymptotic level and power, then ϕMI n has uniform asymptotic level and uniform asymptotic power of at least 1 α.

Invariant Ancestry Search

Due to Proposition 3.3, a test for HMI 0,S is implicitly a test for S ANY , and can thus be used to infer whether intervening on S will have a potential causal effect on Y . However, rejecting HMI 0,S is not evidence for S AN; it is evidence for S MI.

5.2. Learning SIAS from Data

We now consider the task of estimating the set SIAS from data. If we are given a test for invariance that has asymptotic level and power and if we correct for multiple testing appropriately, we can estimate SIAS by ˆSIAS, which, asymptotically, is a subset of ANY with large probability. Theorem 5.2. Assume that the decision rule ϕn has pointwise asymptotic level for all minimally invariant sets and pointwise asymptotic power for all S [d] such that S is not a superset of a minimally invariant set. Define C := 2d

and let b I := S [d] | ϕn(S, αC 1) = 0) be the set of all sets for which the hypothesis of invariance is not rejected and define d MI := n S b I | S S : S b I o

and ˆSIAS := S

S d MI S. It then holds that

lim n P( ˆSIAS ANY ) lim n P( ˆSIAS = SIAS)

A generic algorithm for implementing ˆSIAS is given in Appendix D. Remark 5.3. Consider a decision rule ϕn that just (correctly) rejects the empty set (e.g., because the p-value is just below the threshold α), indicating that the effect of the environments is weak. It is likely that there are other sets S I, which the test may not have sufficient power against and are (falsely) accepted as invariant. If one of such sets contains non-ancestors of Y , this yields a violation of ˆSIAS ANY . To guard against this, testing S = can be done at a lower significance level, α0 < α. This modified IAS approach is conservative and may return ˆSIAS = if the environments do not have a strong impact on Y , but it retains the guarantee limn P( ˆSIAS ANY ) 1 α of Theorem 5.2.

The multiple testing correction performed in Theorem 5.2 is strictly conservative because we only need to correct for the number of minimally invariant sets, and there does not exist 2d minimally invariant sets. Indeed, the statement of Theorem 5.2 remains valid for C = C if the underlying DAG has at most C minimally invariant sets. We hypothesize that a DAG can contain at most 3 d/3 minimally invariant sets and therefore propose using C = 3 d/3 in practice. If this hypothesis is true, Theorem 5.2 remains valid (for any DAG), using C = 3 d/3 (see Appendix C for a more detailed discussion).

Alternatively, as shown in the following section, we can restrict the search for minimally invariant sets to a prede-

termined size. This requires milder correction factors and comes with computational benefits.

5.3. Invariant Ancestry Search in Large Systems

We now develop a variation of Theorem 5.2, which allows us to search for ancestors of Y in large graphs, at the cost of only identifying minimally invariant sets up to some a priori determined size.

Similarly to ICP (see Section 2.2), one could restrict IAS to the variables in MBY but the output may be smaller than SIAS; in particular, there are only non-parental ancestors in MBY if these are parents to both a parent a child of Y (For instance, in the graph E X1 . . . Xd Y , SIAS = {1, . . . , d} but restricting IAS to MBY would yield the set {d}.) Thus, we do not expect such an approach to be particularly fruitful in learning ancestors.

Here, we propose an alternative approach and define

Sm IAS := [

S:S MI and |S| m S (6)

as the union of minimally invariant sets that are no larger than m d. For computing Sm IAS, one only needs to check invariance of the Pm i=0 d i sets that are no larger than m. Sm IAS itself, however, can be larger than m: in the graph above Equation (6), S1 IAS = {1, . . . , d}. The following proposition characterizes properties of Sm IAS. Proposition 5.4. Let m < d and let mmin and mmax be the size of a smallest and a largest minimally invariant set, respectively. The following statements are true:

(i) Sm IAS ANY . (ii) If m mmax, then Sm IAS = SIAS. (iii) If m mmin and E PAY , then Sm IAS I. (iv) If m mmin and E PAY , then SICP Sm IAS with equality if and only if SICP I.

If m < mmin and SICP = , then SICP Sm IAS does not hold. However, we show in Section 6.1 using simulations that Sm IAS is larger than SICP in many sparse graphs, even for m = 1, when few nodes are intervened on.

In addition to the computational speedup offered by considering Sm IAS instead of SIAS, the set SIAS can be estimated from data using a smaller correction factor than the one employed in Theorem 5.2. This has the benefit that in practice, smaller sample sizes may be needed to detect noninvariance. Theorem 5.5. Let m d and define C(m) := Pm i=0 d i . Assume that the decision rule ϕn has pointwise asymptotic level for all minimally invariant sets of size at most m and pointwise power for all sets of size at most m that are not supersets of a minimally invariant set. Let b Im := S [d] | ϕn(S, αC(m) 1) = 0 and |S| m ,

Invariant Ancestry Search

be the set of all sets of size at most m for which the hypothesis of invariance is not rejected and define d MI m := n S b Im | S S : S b Imo and ˆSm IAS := S

S d MI m S. It then holds that

lim n P( ˆSm IAS ANY ) lim n P( ˆSm IAS = Sm IAS)

The method proposed in Theorem 5.5 outputs a non-empty set if there exists a non-empty set of size at most m, for which the hypothesis of invariance cannot be rejected. In a sparse graph, it is likely that many small sets are minimally invariant, whereas if the graph is dense, it may be that all invariant sets are larger than m, such that Sm IAS = . In dense graphs however, many other approaches may fail too; for example, it is also likely that the size of the Markov boundary is so large that applying ICP on MBY is not feasible.

6. Experiments

We apply the methods developed in this paper in a population-case experiment using oracle knowledge (Section 6.1), a synthetic experiment using finite sample tests (Section 6.2), and a real-world data set from a gene perturbation experiment (Section 6.3). In Sections 6.1 and 6.2 we consider a setting with two environments: an observational environment (E = 0) and an intervention environment (E = 1), and examine how the strength and number of interventions affect the performance of IAS.

6.1. Oracle IAS in Random Graphs

For the oracle setting, we know that SIAS ANY (Proposition 3.3) and SICP SIAS (Proposition 3.4). We first verify that the inclusion SICP SIAS is often strict in low-dimensional settings when there are few interventions. Second, we show that the set Sm IAS is often strictly larger than the set SMB ICP in large, sparse graphs with few interventions.

In principle, for a given number of covariates, one can enumerate all DAGs and, for each DAG, compare SICP and SIAS. However, because the space of DAGs grows superexponentially in the number of nodes (Chickering, 2002), this is infeasible. Instead, we sample graphs from the space of all DAGs that satisfy Assumption 2.1 and Y DEE (see Appendix E.1 for details).

In the low-dimensional setting (d 20), we compute SICP and SIAS, whereas in the larger graphs (d 100), we compute SMB ICP and the reduced set Sm IAS for m {1, 2} when d = 100 and for m = 1 when d = 1,000. Because there is no guarantee that IAS outputs a superset of ICP when

10 30 50 70 90 0.0

Sparse graphs

d = 4 d = 6 d = 8 d = 10 d = 12

d = 14 d = 16 d = 18 d = 20

10 30 50 70 90

Dense graphs

Proportion of predictors intervened on (%)

Pn(SICP SIAS)

Figure 2. Low-dimensional oracle experiment, see Section 6.1. In all cases, as predicted by theory, SICP is contained in SIAS. For many graphs, SIAS is strictly larger than SICP. On average, this effect is more expressed when there are fewer intervened nodes. Pn refers to the distribution used to sample graphs and every point in the figure is based on 50,000 independently sampled graphs; d denotes the number of covariates X. Empirical confidence bands are plotted around each line, but are very narrow.

searching only up to sets of some size lower than d, we compare the size of the sets output by either method. For the low-dimensional setting, we consider both sparse and dense graphs, but for larger dimensions, we only consider sparse graphs. In the sparse setting, the DAGs are constructed such that there is an expected number of d + 1 edges between the d + 1 nodes X and Y ; in the dense setting, the expected number of edges equals 0.75 d(d + 1)/2.

The results of the simulations are displayed in Figures 2 and 3. In the low-dimensional setting, SIAS is a strict superset of SICP for many graphs. This effect is the more pronounced, the larger the d and the fewer nodes are intervened on, see Figure 2. In fact, when there are interventions on all predictors, we know that SIAS = SICP = PAY (Peters et al., 2016, Theorem 2), and thus the probability that SICP SIAS is exactly zero. For the larger graphs, we find that the set Sm IAS is, on average, larger than SMB ICP, in particular when d = 1,000 or when m = 2, see Figure 3. In the setting with d = 100 and m = 1, the two sets are roughly the same size, when 10% of the predictors are intervened on. The set SMB ICP becomes larger than S1 IAS after roughly 15% of the predictors nodes are intervened on (not shown). For both d = 100 and d = 1,000, the average size of the Markov boundary of Y was found to be approximately 3.5.

6.2. Simulated Linear Gaussian SCMs

In this experiment, we show through simulation that IAS finds more ancestors than ICP in a finite sample setting when applied to linear Gaussian SCMs. To compare the outputs of IAS and ICP, we use the Jaccard similarity between ˆSIAS ( ˆS1 IAS when d is large) and ANY , and between ˆSICP ( ˆS ˆ MB ICP

Invariant Ancestry Search

0 2 4 6 8 10 1.0

0 2 4 6 8 10

ICP IAS (m = 1)

IAS (m = 2)

Proportion of predictors intervened on (%)

Average size of oracle set

Figure 3. High-dimensional oracle experiment with sparse graphs, see Section 6.1. The average size of the set Sm IAS is larger than the average size of the set SMB ICP, both when using IAS to search for sets up to sizes m = 1 and m = 2. Except for the choice of d, the setup is the same as in Figure 2.

when d is large5) and ANY .6

We sample data from sparse linear Gaussian models with i.i.d. noise terms in two scenarios, d = 6 and d = 100. In both cases, coefficients for the linear assignments are drawn randomly. We consider two environments; one observational and one interventional; in the interventional environment, we apply do-interventions of strength one to children of E, i.e., we fix the value of a child of E to be one. We standardize the data along the causal order, to prevent variance accumulation along the causal order (Reisach et al., 2021). Throughout the section, we consider a significance level of α = 5%. For a detailed description of the simulations, see Appendix E.2.

To test for invariance, we employ the test used in Peters et al. (2016): We calculate a p-value for the hypothesis of invariance of S by first linearly regressing Y onto XS (ignoring E), and second testing whether the mean and variance of the prediction residuals is equal across environments. For details, see Peters et al. (2016, Section 3.2.1). Schultheiss et al. (2021) also consider the task of estimating ancestors but since their method is uninformative for Gaussian data and does not consider environments, it is not directly applicable here.

In Theorem 5.2, we assume asymptotic power of our invariance test. When d = 6, we test hypotheses with a correction factor C = 3 6/3 = 9, as suggested in Appendix C, in an attempt to reduce false positive findings. In Appendix E.3, we repeat the experiment of this section with C = 26 and find almost identical results. We hypothesize, that the effects of a reduced C is more pronounced at larger d. When

5 ˆ MB is a Lasso regression estimate of MBY containing at most 10 variables 6The Jaccard similarity between two sets A and B is defined as J(A, B) := |A B|/|A B|, with J( , ) = 0. The Jaccard similarity equals one if the two sets are equal, zero if they are disjoint and takes a value in (0, 1) otherwise.

d = 100, we test hypotheses with the correction factor C(1) of Theorem 5.5. In both cases, we test the hypothesis of invariance of the empty set at level α0 = 10 6 (cf. Remark 5.3). In Appendix E.4, we investigate the effects on the quantities P( ˆSIAS ANY ) and P( ˆS1 IAS ANY ) when varying α0, confirming that choosing α0 too high can lead to a reduced probability of ˆSIAS being a subset of ancestors.

In Figure 4 the results of the simulations are displayed. In SCMs where the oracle versions SIAS and SICP are not equal, ˆSIAS achieved, on average, a higher Jaccard similarity to ANY than ˆSICP. This effect is less pronounced when d = 100. We believe that the difference in Jaccard similarities is more pronounced when using larger values of m. When SIAS = SICP, the two procedures achieve roughly the same Jaccard similarities to ANY , as expected. When the number of observations is one hundred, IAS generally fails to find any ancestors and outputs the empty set (see Figure 7), indicating that the we do not have power to reject the empty set when there are few observations. This is partly by design; we test the empty set for invariance at reduced level α0 in order to protect against making false positive findings when the environment has a weak effect on Y . However, even without testing the empty set at a reduced level, IAS has to correct for making multiple comparisons, contrary to ICP, thus lowering the marginal significance level each set is tested at. When computing the jaccard similarities with either α0 = α or α0 = 10 12, the results were similar (not shown). We repeated the experiments with d = 6 with a weaker influence of the environment (do-interventions of strength 0.5 instead of 1) and found comparable results, with slightly less power in that the empty set is found more often, see Appendix E.5.

We compare our method with a variant, called IASest. graph, where we first estimate (e.g., using methods proposed by Mooij et al. 2020 or Squires et al. 2020) a member graph of the Markov equivalence class ( I-MEC ) and apply the oracle algorithm from Section 4 (by reading of d-separations in that graph) to estimate MI. In general, however, such an approach comes with additional assumptions; furthermore, even in the linear setup considered here, its empirical performance for large graphs is worse than the proposed method IAS, see Appendix E.7.

6.3. IAS in High Dimensional Genetic Data

We evaluate our approach in a data set on gene expression in yeast (Kemmeren et al., 2014). The data contain fullgenome m RNA expressions of d = 6,170 genes and consists of nobs = 160 unperturbed observations (E = 0) and nint = 1,479 intervened-upon observations (E = 1); each of the latter observations correspond to the deletion of a single (known) gene. For each response gene gene Y [d], we apply the procedure from Section 5.3 with m = 1 to search

Invariant Ancestry Search

Jaccard similarity to ANY

Oracle IAS and ICP diﬀerent

ˆSIAS ˆSICP

Oracle IAS and ICP equal

102 103 104 105

Number of observations

Jaccard similarity to ANY

102 103 104 105

Number of observations

Figure 4. Comparison between the finite sample output of IAS and ICP and ANY on simulated data, see Section 6.2. The plots show the Jaccard similarities between ANY and either ˆSIAS ( ˆS1 IAS when d = 100) in red or ˆSICP ( ˆS ˆ MB ICP when d = 100) in blue and ANY . When SICP = SIAS (left column), ˆSIAS is more similar to ANY than ˆSICP. The procedures are roughly equally similar to ANY when SICP = SIAS (right column). Graphs represented in each boxplot: 42 (top left), 58 (top right), 40 (bottom left) and 60 (bottom right).

for ancestors.

We first test for invariance of the empty set, i.e., whether the distribution of gene Y differs between the observational and interventional environment. We test this at a conservative level α0 = 10 12 in order to protect against a high false positive rate (see Remark 5.3). For 3,631 out of 6,170 response genes, the empty set is invariant, and we disregard them as response genes.

For each response gene, for which the empty set is not invariant, we apply our procedure. More specifically, when testing whether gene X is an ancestor of gene Y , we exclude any observation in which either gene X or gene Y was intervened on. We then test whether the empty set is still rejected, at level α0 = 10 12, and whether gene X is invariant at level α = 0.25. Since a set {gene X} is deemed minimally invariant if the p-value exceeds α, setting α large is conservative for the task of finding ancestors. Indeed, when estimating ˆSm IAS, one can test the sets of size m at a higher level α1 > α. This is conservative, because falsely rejecting a minimally invariant set of size m does not break the inclusion ˆSm IAS ANY . However, if one has little power against the non-invariant sets of size m, testing

1e-17 1e-16 1e-15 1e-14 1e-13 1e-12 0.00

True positive rate

TPR Number of pairs found

Number of pairs found

Figure 5. True positive rates and number of gene pairs found in the experiment in Section 6.3. On the x-axis, we change α0, the threshold for invariance of the empty set. When α0 is small, we only search for pairs if the environment has a very significant effect on Y . For smaller α0, fewer pairs are found to be invariant (blue line), but those found, are more likely to be true positives (red line). This supports the claim that the lower α0 is, the more conservative our approach is.

at level α1 can protect against false positives.7

We use the held-out data point, where gene X is intervened on, to determine as ground truth, whether gene X is indeed an ancestor of gene Y . We define gene X as a true ancestor of gene Y if the value of gene Y when gene X is intervened on, lies in the q T P = 1% tails of the observational distribution of gene Y .

We find 23 invariant pairs (gene X, gene Y ); of these, 7 are true positives. In comparison, Peters et al. (2016) applies ICP to the same data, and with the same definition of true positives. They predict 8 pairs, of which 6 are true positives. This difference is in coherence with the motivation put forward in Section 5.2: Our approach predicts many more ancestral pairs (8 for ICP compared to 23 for IAS). Since ICP does not depend on power of the test, they have a lower false positive rate (25% for ICP compared to 69.6% for IAS).

In Figure 5, we explore how changing α0 and q T P impacts the true positive rate. Reducing α0 increases the true positive rate, but lowers the number of gene pairs found (see Figure 5). This is because a lower α0 makes it more difficult to detect non-invariance of the empty set, making the procedure more conservative (with respect to finding ancestors); see Remark 5.3. For example, when α0 10 15, the true positive rate is above 0.8; however, 5 or fewer pairs are found. When searching for ancestors, the effect of intervening may be reduced by noise from intermediary variables, so q T B = 1% might be too strict; in Appendix E.6, we analyze the impact of increasing q T B.

7Only sets of size exactly m can be tested at level α1; the remaining hypotheses should still be corrected by C(m) (or by the hypothesized number of minimally invariant sets).

Invariant Ancestry Search

7. Extensions

7.1. Latent variables

In Assumption 2.1, we assume that all variables X are observed and that there are no hidden variables H. Let us write X = XO XH, where only XO is observed and define I := {S XO | S invariant}. We can then define

SIAS,O := [

S XO:HMI 0,S true S

(again with the convention that a union over the empty set is the empty set), and have the following modification of Proposition 3.3.

Proposition 7.1. It holds that SIAS,O ANY .

All results in this paper remain correct in the presence of hidden variables, except for Proposition 3.4 and Proposition 5.4 (iii-iv).8 Thus, the union of the observed minimally invariant sets, SIAS,O is a subset of ANY and can be learned from data in the same way as if no latent variables were present.

7.2. Non-exogenous environments

Throughout this paper, we have assumed that the environment variable is exogenous (Assumption 2.1). However, all of the results stated in this paper, except for Proposition 4.1, also hold under the alternative assumption that E is an ancestor of Y , but not necessarily exogenous. From the remaining results, only the proof of Proposition 3.2 uses exogeneity of E, but here the result follows from Tian et al. (1998). In all other proofs, we account for both options. This extension also remains valid in the presence of hidden variables, using the same arguments as in Section 7.1.

8. Conclusion and Future Work

Invariant Ancestry Search (IAS) provides a framework for searching for causal ancestors of a response variable Y through finding minimally invariant sets of predictors by exploiting the existence of exogenous heterogeneity. The set SIAS is a subset of the ancestors of Y , a superset of SICP and, contrary to SICP, invariant itself. Furthermore, the hierarchical structure of minimally invariant sets allows IAS to search for causal ancestors only among subsets up to a predetermined size. This avoids exponential runtime and allows us to apply the algorithm to large systems. We have shown that, asymptotically, SIAS can be identified from data with high probability if we are provided with a

8These results do not hold in the presence of hidden variables, because it is not guaranteed that an invariant set exists among XO (e.g., consider a graph where all observed variables share a common, unobserved confounder with Y ). However, if at least one minimally invariant set exists among the observed variables, then all results stated in this paper hold.

test for invariance that has asymptotic level and power. We have validated our procedure both on simulated and real data. Our proposed framework would benefit from further research in the maximal number of minimally invariant sets among graphs of a fixed size, as this would provide larger finite sample power for identifying ancestors. Further it is of interest to establish finite sample guarantees or convergence rates for IAS, possibly by imposing additional assumptions on the class of SCMs. Finally, even though current implementations are fast, it is an open theoretical question whether computing SIAS in the oracle setting of Section 4 is NP-hard, see Appendix B.

Acknowledgements

NT and JP were supported by a research grant (18968) from VILLUM FONDEN.

Acid, S. and De Campos, L. M. An algorithm for finding minimum d-separating sets in belief networks. In Proceedings of the 29th Annual Conference on Uncertainty in Artificial Intelligence (UAI), 2013.

Arjovsky, M., Bottou, L., Gulrajani, I., and Lopez Paz, D. Invariant risk minimization. ar Xiv preprint ar Xiv:1907.02893, 2019.

Berrett, T. B., Wang, Y., Barber, R. F., and Samworth, R. J. The conditional permutation test for independence while controlling for confounders. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 82(1): 175 197, 2020.

Bongers, S., Forr e, P., Peters, J., and Mooij, J. M. Foundations of structural causal models with cycles and latent variables. Annals of Statistics, 49(5):2885 2915, 2021.

Chickering, D. M. Optimal structure identification with greedy search. Journal of Machine Learning Research, 3: 507 554, 2002.

Christiansen, R., Pfister, N., Jakobsen, M. E., Gnecco, N., and Peters, J. A causal framework for distribution generalization. IEEE Transactions on Pattern Analysis and Machine Intelligence (accepted), 2022.

Friedman, J. H. Greedy function approximation: a gradient boosting machine. Annals of Statistics, 29(5):1189 1232, 2001.

Fukumizu, K., Gretton, A., Sun, X., and Sch olkopf, B. Kernel measures of conditional dependence. In Advances in Neural Information Processing Systems (Neur IPS), volume 20, 2008.

Invariant Ancestry Search

Gamella, J. L. and Heinze-Deml, C. Active invariant causal prediction: Experiment selection through stability. In Advances in Neural Information Processing Systems (Neur IPS), volume 33, 2020.

Gaspers, S. and Mackenzie, S. On the number of minimal separators in graphs. In International Workshop on Graph-Theoretic Concepts in Computer Science, pp. 116 121. Springer, 2015.

Heinze-Deml, C., Peters, J., and Meinshausen, N. Invariant causal prediction for nonlinear models. Journal of Causal Inference, 6(2), 2018.

Hoyer, P., Janzing, D., Mooij, J. M., Peters, J., and Sch olkopf, B. Nonlinear causal discovery with additive noise models. In Advances in Neural Information Processing Systems (Neur IPS), volume 21, 2009.

Jordan, M. I. Artificial intelligence the revolution hasn t happened yet. Harvard Data Science Review, 1(1), 2019.

Kemmeren, P., Sameith, K., Van De Pasch, L. A., Benschop, J. J., Lenstra, T. L., Margaritis, T., O Duibhir, E., Apweiler, E., van Wageningen, S., Ko, C. W., et al. Largescale genetic perturbations reveal regulatory networks and an abundance of gene-specific repressors. Cell, 157 (3):740 752, 2014.

Lauritzen, S. L. Graphical models. Clarendon Press, 1996.

Magliacane, S., van Ommen, T., Claassen, T., Bongers, S., Versteeg, P., and Mooij, J. M. Domain adaptation by using causal inference to predict invariant conditional distributions. In Bengio, S., Wallach, H., Larochelle, H., Grauman, K., Cesa-Bianchi, N., and Garnett, R. (eds.), Advances in Neural Information Processing Systems 31, pp. 10846 10856. Curran Associates, Inc., 2018.

Martinet, G., Strzalkowski, A., and Engelhardt, B. E. Variance minimization in the Wasserstein space for invariant causal prediction. ar Xiv preprint ar Xiv:2110.07064, 2021.

Meinshausen, N. and B uhlmann, P. High-dimensional graphs and variable selection with the lasso. Annals of Statistics, 34(3):1436 1462, 2006.

Mooij, J. M., Magliacane, S., and Claassen, T. Joint causal inference from multiple contexts. Journal of Machine Learning Research, 21(99):1 108, 2020. URL http: //jmlr.org/papers/v21/17-123.html.

Pearl, J. Causality. Cambridge university press, 2009.

Pearl, J. Probabilistic reasoning in intelligent systems: networks of plausible inference. The Morgan Kaufmann series in representation and learning. Morgan Kaufmann, 2014.

Pearl, J. Theoretical impediments to machine learning with seven sparks from the causal revolution. ar Xiv preprint ar Xiv:1801.04016, 2018.

Peters, J., Mooij, J. M., Janzing, D., and Sch olkopf, B. Causal discovery with continuous additive noise models. Journal of Machine Learning Research, 15:2009 2053, 2014.

Peters, J., B uhlmann, P., and Meinshausen, N. Causal inference by using invariant prediction: identification and confidence intervals. Journal of the Royal Statistical Society. Series B (Statistical Methodology), pp. 947 1012, 2016.

Pfister, N., B uhlmann, P., and Peters, J. Invariant causal prediction for sequential data. Journal of the American Statistical Association, 114(527):1264 1276, 2019.

R Core Team. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria, 2021. URL https://www. R-project.org/.

Reisach, A., Seiler, C., and Weichwald, S. Beware of the simulated DAG! Causal discovery benchmarks may be easy to game. In Advances in Neural Information Processing Systems (Neur IPS), volume 34, 2021.

Rojas-Carulla, M., Sch olkopf, B., Turner, R., and Peters, J. Invariant models for causal transfer learning. The Journal of Machine Learning Research, 19(1):1309 1342, 2018.

Schultheiss, C., B uhlmann, P., and Yuan, M. Higher-order least squares: assessing partial goodness of fit of linear regression. ar Xiv preprint ar Xiv:2109.14544, 2021.

Shah, R. D. and Peters, J. The hardness of conditional independence testing and the generalised covariance measure. Annals of Statistics, 48(3):1514 1538, 2020.

Shimizu, S., Hoyer, P. O., Hyv arinen, A., Kerminen, A., and Jordan, M. A linear non-Gaussian acyclic model for causal discovery. Journal of Machine Learning Research, 7(10), 2006.

Spirtes, P., Glymour, C. N., Scheines, R., and Heckerman, D. Causation, prediction, and search. MIT press, 2000.

Squires, C., Wang, Y., and Uhler, C. Permutation-based causal structure learning with unknown intervention targets. In Conference on Uncertainty in Artificial Intelligence, pp. 1039 1048. PMLR, 2020.

Takata, K. Space-optimal, backtracking algorithms to list the minimal vertex separators of a graph. Discrete Applied Mathematics, 158:1660 1667, 2010.

Invariant Ancestry Search

Textor, J., van der Zander, B., Gilthorpe, M. S., Li skiewicz, M., and Ellison, G. T. Robust causal inference using directed acyclic graphs: the R package dagitty . International Journal of Epidemiology, 45(6):1887 1894, 2016.

Thams, N., Saengkyongam, S., Pfister, N., and Peters, J. Statistical testing under distributional shifts. ar Xiv preprint ar Xiv:2105.10821, 2021.

Tian, J., Paz, A., and Pearl, J. Finding minimal d-separators. Technical report, University of California, Los Angeles, 1998.

Tibshirani, R. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society, Series B, 58:267 288, 1996.

van der Zander, B., Li skiewicz, M., and Textor, J. Separators and adjustment sets in causal graphs: Complete criteria and an algorithmic framework. Artificial Intelligence, 270:1 40, 2019.

Zhang, K., Peters, J., Janzing, D., and Sch olkopf, B. Kernelbased conditional independence test and application in causal discovery. In Proceedings of the 27th Annual Conference on Uncertainty in Artificial Intelligence (UAI), pp. 804 813, 2011.

Zheng, X., Aragam, B., Ravikumar, P. K., and Xing, E. P. DAGs with NO TEARS: Continuous optimization for structure learning. In Advances in Neural Information Processing Systems (Neur IPS), volume 31, 2018.

Invariant Ancestry Search

A.1. A direct Proof of Proposition 3.2

Proof. Assume that E is exogenous. If E PAY , then there are no minimally invariant sets, and the statement holds trivially. If E PAY , then assume for contradiction, that an invariant set S0 S exists. By assumption, |S \ S0| > 1, because otherwise S0 would be non-invariant.

We can choose S1 S and k0, k1, . . . , kl S with l 1 such that for all i = 1, . . . , l : ki / DEk0 and

S0 S1 {k0, . . . , kl} = S I

for 0 i < l : S0 S1 {k0, . . . , ki} / I

This can be done by iteratively removing elements from S \ S0, removing first the earliest elements in the causal order. The first invariant set reached in this process is then S0 S1.

Since S0 S1 {k0} is non-invariant, there exists a path π between E and Y that is open given S0 S1 {k0} but blocked given S0 S1. Since removing k0 blocks π, k0 must be a collider or a descendant of a collider c on π:

Here, represents an edge that either points left or right. Since π is open given S0 S1, the two sub-paths πE and πY are open given S0 S1.

Additionally, since S0 S1 {k1, . . . , kl} = S \ {k0} is non-invariant, there exists a path τ between E and Y that is unblocked given S0 S1 {k1, . . . , kl} and blocked given S0 S1 {k1, . . . , kl} {k0}. It follows that k0 lies on τ (otherwise τ cannot be blocked by adding k0) and k0 has at least one outgoing edge. Assume, without loss of generality that there is an outgoing edge towards Y . Since τ is open given S0 S1 {k1, . . . , kl}, so is τY .

If there are no colliders on τY , then τY is also open given S0 S1. But then the path the path E πE c k0 τY is also open given S0 S1, contradicting invariance of S0 S1.

If there are colliders on τY , let m be the collider closest to k0, meaning that m DEk0. Since τY is open given S0 S1 {k1, . . . , kl}, it means that either m or a descendant of m is in S0 S1 {k1, . . . , kl}. Since {k1, . . . , kl} DEk0 = , there exist v (S0 S1) ({m} DEm). But then v DEk0 (S0 S1), meaning that π is open given S0 S1, contradicting invariance of S0 S1.

Invariant Ancestry Search

We could assume that τY had an outgoing edge from k0 without loss of generality, because if there was instead an outgoing edge from k0 on τE, the above argument would work with πY and τE instead. This concludes the proof.

A.2. A direct proof of Proposition 3.3

Proof. If E is a parent of Y , we have MI = and the statement follows trivially. Thus, assume that E is not a parent of Y . We will show that if S I is not a subset of ANY , then S := S ANY I, meaning that S / MI.

Assume for contradiction that there is a path p between E and Y that is open given S . Since S I, p is blocked given S. Then there exists a non-collider Z on p that is in S \ ANY . We now argue that all nodes on p are ancestors of Y , yielding a contradiction.

First, assume that there are no colliders on p. If E is exogenous, then p is directed from E to Y . (If E is an ancestor of Y , any node on p is either an ancestor of Y or E, and thus Y .) Second, assume that there are colliders on p. Since p is open given the smaller set S S, all colliders on p are in S or have a descendant in S ; therefore all colliders are ancestors of Y . If E is exogenous, any node on p is either an ancestor of Y or of a collider on p. (If E is an ancestor of Y , any node on p is either an ancestor of Y , of a collider on p or of E, and thus also Y .) This completes the proof of Proposition 3.3.

A.3. Proof of Proposition 3.4

Proof. First, we show that SIAS I. If SIAS is the union of a single minimally invariant set, it trivially holds that SIAS I. Now assume that SIAS is the union of at least two minimally invariant sets, SIAS = S1 . . . Sn, n 2, and assume for a contradiction that there exists a path π between E and Y that is unblocked given SIAS.

Since π is blocked by a strict subset of SIAS, it follows that π has at least one collider; further every collider of π is either in SIAS or has a descendant in SIAS, and hence every collider of π is an ancestor of Y , by Proposition 3.3. If E is exogenous, π has the following shape

E c1 c2 ck Y.

π1 π2 π3, . . . , πk πk+1

(If E is not exogenous but E ANY , then π takes either the form displayed above or the shape displayed below. However,

E c1 c2 ck Y .

π1 π2 π3, . . . , πk πk+1

no matter which of the shapes π takes, the proof proceeds the same.) The paths π1, . . . , πk+1, k 1, do not have any colliders and are unblocked given SIAS. In particular, π1, . . . , πk+1 are unblocked given S1.

The path πk+1 must have a final edge pointing to Y , because otherwise it would be a directed path from Y to ck, which contradicts acyclicity since ck is an ancestor of Y .

As c1 is an ancestor of Y , there exists a directed path, say ρ1, from c1 to Y . Since π1 is open given S1 and since S1 is invariant, it follows that ρ1 must be blocked by S1 (otherwise the path E π1 c1 ρ1 Y would be open). For this reason, S1 contains a descendant of the collider c1.

Similarly, if ρ2 is a directed path from c2 to Y , then S1 blocks ρ2, because otherwise the path E π1 c1 π2 c2 ρ2 Y would be open. Again, for this reason, S1 contains a descendant of c2.

Iterating this argument, it follows that S1 contains a descendant of every collider on π, and since π1, . . . , πk+1 are unblocked by S1, π is open given S1. This contradicts invariance of S1 and proves that SIAS I.

We now show that SICP SIAS with equality if and only if SICP I. First, SICP SIAS because SIAS is a union of the minimally invariant sets, and SICP is the intersection over all invariant sets. We now show the equivalence statement.

Assume first that SICP I. As SICP is the intersection of all invariant sets, SICP I implies that there exists exactly one

Invariant Ancestry Search

invariant set, that is contained in all other invariant sets. By definition, this means that there is only one minimally invariant set, and that this set is exactly SICP. Thus, SIAS = SICP.

Conversely assume that SICP / I. By construction, SICP is contained in any invariant set, in particular in the minimally invariant sets. However, since SICP is not invariant itself, this containment is strict, and it follows that SICP SIAS.

A.4. Proof of Proposition 4.1

Proof. First we show PAY (CHE PA(ANY CHE)) SICP. If j PAY CHE, any invariant set contains j, because otherwise the path E j Y is open. Similarly, if j PAY PA(ANY CHE), any invariant set contains j (there exists a node j such that E j Y and E j j Y , and any invariant set S must contain j or one of its descendants; thus, it must also contain j to ensure that the path E j j Y is blocked by S.) It follows that for all invariant S,

PAY (CHE PA(ANY CHE)) S,

PAY (CHE PA(ANY CHE)) \

S invariant S.

To show SICP PAY (CHE PA(ANY CHE)), take any j / PAY (CHE PA(ANY CHE)). We argue, that an invariant set S not containing j exists, such that j / SICP = T

S invariant S. If j / PAY , let S = PAY , which is invariant. If j PAY , define S = (PAY \{j}) PAj (CHj ANY ) PA(CHj ANY ).

Because j / CHE and j / PA(ANY CHE), we have E / S. Also observe that S ANY . We show that any path between E and Y is blocked by S, by considering all possible paths:

j Y for j = j: Blocked because j PAY \{j}.

v j Y: Blocked because v PAj S and E / PAj.

v c j Y and c ANY: Blocked because v PAj(CHj ANY ).

v c j Y and c / ANY: Blocked because S ANY , and since c / ANY , S DEc = and the path is blocked given S because of the collider c.

c v j Y and c ANY: Blocked because v ANc and c ANY , so v CHj ANY S.

c v j Y and c / ANY: Same reason as for the case v c j Y and c / ANY .

c Y Since S ANY , we must have S DEc = (otherwise this would create a directed cycle from Y Y ). Hence the path is blocked given S because of the collider c.

Since there are no open paths from E to Y given S, S is invariant, and SICP S. Since j / S, it follows that j / SICP. This concludes the proof.

A.5. Proof of Theorem 5.1

Proof. Consider first the case where all marginal tests have pointwise asymptotic power and pointwise asymptotic level.

Pointwise asymptotic level: Let P0 HMI 0,S . By the assumption of pointwise asymptotic level, there exists a non-negative

Invariant Ancestry Search

sequence (ϵn)n N such that limn ϵn = 0 and P0(ϕn(S) = 1) α + ϵn. Then

P0(ϕMI n (S) = 1) = P0

(ϕn(S) = 1) [

j S (ϕn(S \ {j}) = 0)

P0(ϕn(S) = 1) + X

j S P0(ϕn(S \ {j}) = 0)

j S P0(ϕn(S \ {j}) = 0)

The convergence step follows from HMI 0,S = HI 0,S \

j S HI A,S\{j}

and from the assumption of pointwise asymptotic level and power. As P0 HMI 0,S was arbitrary, this shows that ϕMI n has pointwise asymptotic level.

Pointwise asymptotic power: To show that the decision rule has pointwise asymptotic power, consider any PA HMI A,S . We have that

HMI A,S = HI A,S

j S HI 0,S\{j}

As the two sets HI A,S and

j S HI 0,S\{j}

are disjoint, we can consider them one at a time. Consider first the case PA HI A,S. This means that S is not invariant and thus

PA(ϕMI n (S) = 1) = PA

(ϕn(S) = 1) [

j S (ϕn(S \ {j}, α) = 0)

PA(ϕn(S) = 1)

by the assumption of pointwise asymptotic power.

Next, assume that there exists j S such that PA (HI 0,S HI 0,S\{j }). Then,

PA(ϕMI n (S) = 1) = P0

(ϕn(S) = 1) [

j S (ϕn(S \ {j}) = 0)

PA(ϕn(S \ {j }) = 0)

1 α ϵn 1 α as n .

Thus, for arbitrary PA HMI A,S we have shown that PA(ϕMI n (S) = 1) 1 α in the limit. This shows that ϕMI n has pointwise asymptotic power of at least 1 α. This concludes the argument for pointwise asymptotic power.

Next, consider the case that the marginal tests have uniform asymptotic power and uniform asymptotic level. The calculations for showing that ϕMI n has uniform asymptotic level and uniform asymptotic power of at least 1 α are almost identical to the pointwise calculations.

Invariant Ancestry Search

Uniform asymptotic level: By the assumption of uniform asymptotic level, there exists a non-negative sequence ϵn such that limn ϵn = 0 and sup P HI 0,S P(ϕn(S) = 1) α + ϵn. Then,

sup P HMI 0,S P(ϕMI n (S) = 1) = sup P HMI 0,S P

(ϕn(S) = 1) [

j S (ϕn(S \ {j}) = 0)

sup P HMI 0,S

P(ϕn(S) = 1) + X

j S P(ϕn(S \ {j}) = 0)

sup P HMI 0,S P(ϕn(S) = 1) + X

j S sup P HMI 0,S P(ϕn(S \ {j}) = 0)

1 inf P HMI 0,S P(ϕn(S \ {j}) = 1)

j S (1 1) as n

Uniform asymptotic power: From (7), it follows that

inf P HMI A,S P(ϕMI n (S) = 1) = min

inf P HI A,S P(ϕMI n (S) = 1), inf P HI 0,S S

j S HI 0,S\{j} P(ϕMI n (S) = 1)

We consider the two inner terms in the above separately. First,

inf P HI A,S P(ϕMI n (S) = 1) = inf P HI A,S P

(ϕn(S) = 1) [

j S (ϕn(S \ {j}) = 0)

inf P HI A,S P(ϕn(S) = 1)

inf P HI 0,S S

j S HI 0,S\{j} P(ϕMI n (S) = 1) = inf P HI 0,S S

j S HI 0,S\{j} P

(ϕn(S) = 1) [

j S (ϕn(S \ {j}) = 0)

inf P HI 0,S HI 0,S\{j} P

(ϕn(S) = 1) [

j S (ϕn(S \ {j}) = 0)

inf P HI 0,S HI 0,S\{j} P(ϕn(S \ {j}) = 0)

1 sup P HI 0,S HI 0,S\{j} P(ϕn(S \ {j}) = 1)

1 α ϵn 1 α as n .

This shows that ϕMI n has uniform asymptotic power of at least 1 α, which completes the proof.

Invariant Ancestry Search

A.6. Proof of Theorem 5.2

Proof. We have that lim n P( ˆSIAS ANY ) lim n P( ˆSIAS = SIAS)

as SIAS ANY by Proposition 3.4. Furthermore, we have

P( ˆSIAS = SIAS) P( d MI = MI).

Let A := {S | S I} \ {S | S S s.t. S MI} be those non-invariant sets that do not contain a minimally invariant set and observe that

( d MI = MI) \

S MI (ϕn(S, αC 1) = 0) \

S A (ϕn(S, αC 1) = 1). (8)

To see why this is true, note that to correctly recover MI, we need to 1) accept the hypothesis of minimal invariance for all minimally invariant sets and 2) reject the hypothesis of invariance for all non-invariant sets that are not supersets of a minimally invariant set (any superset of a set for which the hypothesis of minimal invariance is not rejected is removed in the computation of d MI). Then,

P( d MI = MI) P

S MI (ϕn(S, αC 1) = 0) \

S A (ϕn(S, αC 1) = 1)

S MI (ϕn(S, αC 1) = 1)

S A P(ϕn(S, αC 1) = 0)

S MI P(ϕn(S, αC 1) = 1) X

S A P(ϕn(S, αC 1) = 0)

S MI (αC 1 + ϵn,S) X

S A P(ϕn(S, αC 1) = 0)

1 |MI|αC 1 + X

S MI ϵn,S X

S A P(ϕn(S, αC 1) = 0)

S MI ϵn,S X

S A P(ϕn(S, αC 1) = 0)

where (ϵn,S)n N,S MI are non-negative sequences that converge to zero and the last step follows from the assumption of asymptotic power. The sequences (ϵn,S)n N,S MI exist by the assumption of asymptotic level.

A.7. Proof of Proposition 5.4

Proof. We prove the statements one by one.

(i) Since Sm IAS is the union over some of the minimally invariant sets, Sm IAS SIAS. Then the statement follows from Proposition 3.3.

(ii) If m mmax, all S MI satisfy the requirement |S| m.

(iii) If m mmin, then Sm IAS contains at least one minimally invariant set. The statement then follows from the first part of the proof of Proposition 3.4 given in Appendix A.3.

(iv) Sm IAS contains at least one minimally invariant set and, by (iii), it is itself invariant. Thus, if SICP I, then SICP Sm IAS. If SICP I, then there exists only one minimally invariant set, which is SICP (see proof of Proposition 3.4), and we have SICP = Sm IAS. This concludes the proof.

Invariant Ancestry Search

A.8. Proof of Theorem 5.5

Proof. The proof is identical to the proof of Theorem 5.2, when changing the correction factor 2 d to C(m) 1, adding superscript m s to the quantities d MI, ˆSIAS and SIAS, and adding the condition |S| m to all unions, intersections and sums.

A.9. Proof of Proposition 7.1

By Proposition 3.3, we have SIAS ANY , and since SIAS,O SIAS, the claim follows immediately.

B. Oracle Algorithms for Learning SIAS

In this section, we review some of the existing literature on minimal d-separators, which can be exploited to give an algorithmic approach for finding SIAS from a DAG. We first introduce the concept of M-minimal separation with respect to a constraining set I.

Definition B.1 (van der Zander et al. (2019), Section 2.2). Let I [d], K [d], and S [d]. We say that S is a K-minimal separator of E and Y with respect to a constraining set I if all of the following are true:

(i) I S. (ii) S I. (iii) There does not exists S I such that K S S.

We denote by MK,I the set of all K-minimal separating sets with respect to constraining set I.

(In this work, S I means E Y | S, but it can stand for other separation statements, too.) The definition of a K-minimal separator coincides with the definition of a minimally invariant set if both K and the constraining set I are equal to the empty set. An -minimal separator with respect to constraining set I is called a strongly-minimal separator with respect to constraining set I.

We can now represent (2) using this notation. M , contains the minimally invariant sets and thus

Listing the set MI,I of all I-minimal separators with respect to the constraining set I (for any I) can be done in polynomial delay time O(d3) (van der Zander et al., 2019; Takata, 2010), where delay here means that finding the next element of MI,I (or announcing that there is no further element) has cubic complexity. This is the algorithm we exploit, as described in the main part of the paper.

Furthermore, we have i SIAS M ,{i} = .

This is because i SIAS if and only if there is a minimally invariant set that contains i, which is the case if and only if there exist a strongly minimal separating set with respect to constraining set {i}. Thus, we can construct SIAS by checking, for each i, whether there is an element in M ,{i}. Finding a strongly-minimal separator with respect to constraining set I, i.e., finding an element in M ,I, is NP-hard if the set I is allowed to grow (van der Zander et al., 2019). To the best of our knowledge, however, it is unknown whether finding an element in M ,{i}, for a singleton {i} is NP-hard.

C. The Maximum Number of Minimally Invariant Sets

If one does not have a priori knowledge about the graph of the system being analyzed, one can still apply Theorem 5.2 with a correction factor 2d, as this ensures (with high probability) that no minimally invariant sets are falsely rejected. However, we know that the correction factor is strictly conservative, as there cannot exist 2d minimally invariant sets in a graph. Thus, correcting for 2d tests, controls the familywise error rate (FWER) among minimally invariant sets, but increases the risk of falsely accepting a non-invariant set relatively more than what is necessary to control the FWER. Here, we discuss the maximum number of minimally invariant sets that can exist in a graph with d predictor nodes and how a priori knowledge about the sparsity of the graph and the number of interventions can be leveraged to estimate a less strict correction that still controls the FWER.

Invariant Ancestry Search

As minimally invariant sets only contain ancestors of Y (see Proposition 3.3), we only need to consider graphs where Y comes last in a causal ordering. Since d-separation is equivalent to undirected separation in the moralized ancestral graph (Lauritzen, 1996), finding the largest number of minimally invariant sets is equivalent to finding the maximum number of minimal separators in an undirected graph with d + 2 nodes. It is an open question how many minimal separators exists in a graph with d + 2 nodes, but it is known that a lower bound for the maximum number of minimal separators is in Ω(3d/3) (Gaspers & Mackenzie, 2015). We therefore propose using a correction factor of C = 3 d/3 when estimating the set ˆSIAS from Theorem 5.2 if one does not have a priori knowledge of the number of minimally invariant sets in the DAG of the SCM being analyzed. This is a heuristic choice and is not conservative for all graphs.

Theorem 5.2 assumes asymptotic power of the invariance test, but as we can only have a finite amount of data, we will usually not have full power against all non-invariant sets that are not supersets of a minimally invariant set. Therefore, choosing a correction factor that is potentially too low represents a trade-off between error types: if we correct too little, we stand the risk of falsely rejecting a minimally invariant set but not rejecting a superset of it, whereas when correcting too harshly, there is a risk of failing to reject non-invariant sets due to a lack of power.

If one has a priori knowledge of the sparsity or the number of interventions, these can be leveraged to estimate the maximum number of minimally invariant sets using simulation, by the following procedure:

1. For b = 1, . . . , B:

(a) Sample a DAG with d predictor nodes, Ninterventions PN interventions and p Pp probability of an edge being present in the graph over (X, Y ), such that Y is last in a causal ordering. The measures PN and Pp are distributions representing a priori knowledge. For instance, in a controlled experiment, the researcher may have chosen the number N0 of interventions. Then, PN is a degenerate distribution with PN(N0) = 1. (b) Compute the set of all minimally invariant sets, e.g., using the adjustment Sets algorithm from dagitty (Textor et al., 2016). (c) Return the number of minimally invariant sets.

2. Return the largest number of minimally sets found in the B repetitions above.

Instead of performing B steps, one can continually update the largest number of minimally invariant sets found so far and end the procedure if the maximum has not updated in a predetermined number of steps, for example.

D. A Finite Sample Algorithm for Computing ˆSIAS

In this section, we provide an algorithm for computing the sets ˆSIAS and ˆSm IAS presented in Theorems 5.2 and 5.5. The algorithm finds minimally invariant sets by searching for invariant sets among sets of increasing size, starting from the empty set. This is done, because the first (correctly) accepted invariant is a minimally invariant set. Furthermore, any set that is a superset of an accepted invariant set, does not need to be tested (as this set cannot be minimal). Tests for invariance can be computationally expensive if one has large amounts of data. Therefore, skipping unnecessary tests offers a significant speedup. In the extreme case, where all singletons are found to be invariant, the algorithm completes in d + 1 steps, compared to Pm i=0 d i steps (2d if m = d). This is implemented in lines 8-10 of Algorithm 1.

E. Additional Experiment Details

E.1. Simulation Details for Section 6.1

We sample graphs that satisfy Assumption 2.1 with the additional requirement that Y DEY by the following procedure:

1. Sample a DAG G for the graph of (X, Y ) with d + 1 nodes, for d {4, 6, . . . , 20} {100, 1,000}, and choose Y to be a node (chosen uniformly at random) that is not a root node.

2. Add a root node E to G with Ninterventions children that are not Y . When d 20, Ninterventions {1, . . . , d} and when d 100, Ninterventions {1, . . . , 0.1 d} (i.e., we consider interventions on up to ten percent of the predictor nodes).

3. Repeat the first two steps if Y DEE.

Invariant Ancestry Search

Algorithm 1 An algorithm for computing ˆSIAS from data input A decision rule ϕn for invariance, significance thresholds α0, α, max size of sets to test m (potentially m = d) and data output The set ˆSIAS

1: Initialize d MI as an empty list. 2: PS {S [d] | |S| m} 3: if ϕn( , α0) = 0 then 4: End the procedure and return ˆSIAS = 5: end if 6: Sort PS in increasing order according the set sizes 7: for S PS do 8: if S S for any S d MI then 9: Skip the test of S and go to next iteration of the loop 10: else 11: Add S to d MI if ϕn(S, α) = 0, else continue 12: end if 13: if The union of d MI contains all nodes then 14: Break the loop 15: end if 16: end for 17: Return ˆSIAS as the union of all sets in d MI

E.2. Simulation Details for Section 6.2

We simulate data for the experiment in Section 6.2 (and the additional plots in Appendix E.4) by the following procedure:

1. Sample data from a single graph by the following procedure:

(a) Sample a random graph G of size d + 1 and sample Y (chosen uniformly at random) as any node that is not a root node in this graph. (b) Sample coefficients, βi j, for all edges (i j) in G from U(( 2, 0.5) (0.5, 2)) independently.

(c) Add a node E with no incoming edges and Ninterventions children, none of which are Y . When d = 6, we set Ninterventions = 1 and when d = 100, we sample Ninterventions uniformly from {1, . . . , 10}. (d) If Y is not a descendant of E, repeat steps (a), (b) and (c) until a graph where Y DEE is obtained.

(e) For n {102, 103, 104, 105}:

i. Draw 50 datasets of size n from an SCM with graph G and coefficients βi j and with i.i.d. N(0, 1) noise innovations. The environment variable, E, is sampled independently from a Bernoulli distribution with probability parameter p = 0.5, corresponding to (roughly) half the data being observational and half the data interventional. The data are generated by looping through a causal ordering of (X, Y ), starting at the bottom, and standardizing a node by its own empirical standard deviation before generating children of that node; that is, a node Xj is first generated from PAj and then standardized before generating any node in CHj. If Xj is intervened on, we standardize it prior to the intervention. ii. For each sampled dataset, apply IAS and ICP. Record the Jaccard similarities between IAS and ANY and between ICP and ANY , and record whether or not is was a subset of ANY and whether it was empty. iii. Estimate the quantity plotted (average Jaccard similarity in Figure 4 or probability of ˆSIAS ANY or ˆSIAS = in Figure 7) from the 50 simulated datasets. (f) Return the estimated quantities from the previous step.

2. Repeat the above 100 times and save the results in a data-frame.

E.3. Analysis of the Choice of C in Section 6.2

We have repeated the simulation with d = 6 from Section 6.2 but with a correction factor of C = 26, as suggested by Theorem 5.2 instead of the heuristic correction factor of C = 9 suggested in Appendix C. Figure 6 shows the results. We

Invariant Ancestry Search

see that the results are almost identical to those presented in Figure 4. Thus, in the scenario considered here, there is no change in the performance of ˆSIAS (as measured by Jaccard similarity) between using a correction factor of C = 26 and a correction factor of C = 3 6/3 = 9. In larger graphs, it is likely that there is a more pronounced difference. E.g., at d = 10, the strictly conservative correction factor suggested by Theorem 5.2 is 210 = 1024, whereas the correction factor suggested in Appendix C is only 3 10/3 = 34 = 81, and at d = 20 the two are 220 = 1,048,576 and 3 20/3 = 37 = 2187.

102 103 104 105

Number of observations

Jaccard similarity to ANY

Oracle IAS and ICP diﬀerent

ˆSIAS ˆSICP

102 103 104 105

Number of observations

Oracle IAS and ICP equal

Figure 6. The same figure as in Figure 4, but with a correction factor of C = 26 = 64 instead of C = 3 6/3 = 9. Only d = 6 shown here, as the correction factor for d = 100 is unchanged. Here, the guarantees of Theorem 5.2 are not violated by a potentially too small correction factor, and the results are near identical to those given in Figure 4 using a milder correction factor.

E.4. Analysis of the Choice of α0 in Section 6.2

Here, we investigate the quantities P( ˆSIAS ANY ), P( ˆS1 IAS ANY ), P( ˆSIAS = ) and P( ˆS1 IAS = ) using the same simulation setup as described in Section 6.2. Furthermore, we also ran the simulations for values α0 = α (testing all hypotheses at the same level), α0 = 10 6 (conservative, see Remark 5.3) as in Section 6.2 and α0 = 10 12 (very conservative). The results for α = 10 6 (shown in Figure 7) were recorded in the same simulations that produced the output for Figure 4. For α0 {α, 10 12} (shown in Figure 8 and Figure 9, respectively) we only simulated up to 10,000 observations, to keep computation time low.

Generally, we find that the probability of IAS being a subset of the ancestors seems to generally hold well and even more so with large sample sizes. (see Figures 7 to 9), in line with Theorem 5.2. When given 100,000 observations, the probability of IAS being a subset of ancestors is roughly equal to one for almost all SCMs, although there are a few SCMs, where IAS is never a subset of the ancestors (see Figure 7). For α0 = 10 6, the median probability of IAS containing only ancestors is one in all cases, except for d = 100 with 1,000 observations here, the median probability is 87%.

In general, varying α0 has the effect hypothesized in Remark 5.3: lowering α0 increases the probability that IAS contains only ancestors, but at the cost of increasing the probability that it is empty (see Figures 7 to 9). For instance, the median probability of IAS being a subset of ancestors when α0 = 10 12 is one for all sample sizes, but the output is always empty when there are 100 observations and empty roughly half the time even at 1,000 observations when d = 100 (see Figure 9). In contrast, not testing the empty set at a reduced level, means that the output of IAS is rarely empty, but the probability of IAS containing only ancestors decreases. Still, even with α0 = α, the median probability of IAS containing only ancestors was never lower than 80% (see Figure 8). Thus, choosing α0 means choosing a trade-off between finding more ancestor-candidates, versus more of them being false positives.

E.5. Analysis of the strength of inverventions in Section 6.2

Here, we repeat the d = 6 simulations from Section 6.2 with a reduced strength of the environment to investigate the performance of IAS under weaker interventions. We sample from the same SCMs as sampled in Section 6.2, but reduce the strength of the interventions to be 0.5 instead of 1. That is, the observational distributions are the same as in Section 6.2, but interventions to a node Xj are here half as strong as in Section 6.2.

The Jaccard similarity between ˆSIAS and ANY is generally lower than what we found in Figure 4 (see Figure 10). This is likely due to having lower power to detect non-invariance, which has two implications. First, lower power means that we

Invariant Ancestry Search

Pn( ˆSIAS ANY )

Pn( ˆS1 IAS ANY )

102 103 104 105

Number of observations

Pn( ˆSIAS = )

102 103 104 105

Number of observations

Pn( ˆS1 IAS = )

Figure 7. The empirical probabilities of recovering a subset of ANY (top row) and recovering an empty set (bottom row), when testing the empty set for invariance at level α0 = 10 6. Generally, our methods seem to hold level well, especially when sample sizes are large. When the sample size is small, the output is often the empty set. When d = 6, we estimate ˆSIAS (left column) and when d = 100, we estimate ˆS1 IAS (right column). The results here are from the simulations that also produced Figure 4. Medians are displayed as orange lines through each boxplot. Each point represents the probability that the output set is ancestral (resp. empty) for a randomly selected SCM, as estimated by repeatedly sampling data from the same SCM for every n {102, 103, 104, 105}. Observations from the same SCM are connected by a line. Each figure contains data from 100 randomly drawn SCMs. Points have been perturbed slightly along the x-axis to improve readability.

may fail to reject the empty set, meaning that we output nothing. Then, the Jaccard similarity between ˆSIAS and ANY is zero. Second, it may be that we correctly reject the empty set, but fail to reject another non-invariant set which is not an ancestor of Y which is then potentially included in the output. Then, the ˆSIAS and ANY is lower, because we increase the number of false findings.

We find that the probability that ˆSIAS is a subset of ancestors is generally unchanged for the lower intervention strength, but the probability of ˆSIAS generally increases for small sample sizes (see Table 1). This indicates that IAS does not make more mistakes under the weaker interventions, but it is more often uninformative. We see also that in both settings, ˆSIAS is empty more often than ˆSICP for low sample sizes, but less often for larger samples (see Table 1). This is likely because IAS tests the empty set at a much lower level than ICP does (10 6 compared to 0.05). Thus, IAS requires more power to find anything, but once it has sufficient power, it finds more than ICP (see also Figure 10). The median probability of ICP returning a subset of the ancestors was always at least 95% (not shown).

Invariant Ancestry Search

Pn( ˆSIAS ANY )

Pn( ˆS1 IAS ANY )

102 103 104

Number of observations

Pn( ˆSIAS = )

102 103 104

Number of observations

Pn( ˆS1 IAS = )

Figure 8. The same figure as Figure 7, but with α0 = α = 0.05 and n {102, 103, 104}. Testing the empty set at the non-conservative level α0 = α means that the empty set is output less often for small sample sizes, but decreases the probability that the output is a subset of ancestors. Thus, we find more ancestor-candidates, but make more mistakes when α0 = α. However, the median probability of the output being a subset of ancestors is at least 80% in all configurations.

Table 1. Summary of the quantities P( ˆSIAS ANY ), P( ˆSIAS = ) and P( ˆSICP = ) for weak and strong do-interventions (strength 0.5 and 1, respectively) when d = 6. Numbers not in parentheses are means, numbers in parentheses are medians. The level is generally unchanged when the environments have a weaker effect, but the power is lower, in the sense that the empty set is output more often. P( ˆSIAS ANY ) P( ˆSIAS = ) P( ˆSICP = )

Strong interventions

n = 100 96.6% (100%) 89.6% (98%) 52.3% (52%) n = 1,000 75.7% (100%) 10.0% (0%) 30.4% (14%) n = 10,000 83.7% (100%) 1.0% (0%) 24.9% (10%) n = 100,000 93.8% (100%) 0.2% (0%) 22.9% (10%)

Weak interventions

n = 100 99.3% (100%) 98.7% (100%) 72.0% (84%) n = 1,000 81.1% (100%) 40.2% (26%) 36.9% (24%) n = 10,000 80.8% (100%) 1.7% (0%) 27.5% (15%) n = 100,000 92.6% (100%) 1.1% (0%) 24.8% (14%)

Invariant Ancestry Search

Pn( ˆSIAS ANY )

Pn( ˆS1 IAS ANY )

102 103 104

Number of observations

Pn( ˆSIAS = )

102 103 104

Number of observations

Pn( ˆS1 IAS = )

Figure 9. The same figure as Figure 7, but with α0 = 10 12 and n {102, 103, 104}. Testing the empty set at at very conservative level α0 = 10 12 means that the empty set is output more often (for one hundred observations, we only find the empty set), but increases the probability that the output is a subset of ancestors. Thus, testing at a very conservative level α0 = 10 12 means that we do not make many mistakes, but the output is often non-informative.

102 103 104 105

Number of observations

Jaccard similarity to ANY

Oracle IAS and ICP diﬀerent

ˆSIAS ˆSICP

102 103 104 105

Number of observations

Oracle IAS and ICP equal

Figure 10. The same figure as the one presented in Figure 4, but with weaker environments (do-interventions of strength 0.5 compared to 1 in Figure 4). Generally, IAS performs the same for weaker interventions as for strong interventions, when there are more than 10,000 observations. Graphs represented in each boxplot: 42 (left), 58 (right).

E.6. Analysis of the Choice of q T B in Section 6.3

In this section, we analyze the effect of changing the cut-off q T B that determines when a gene pair is considered a true positive in Section 6.3. For the results in the main paper, we use q T B = 1%, meaning that the pair (gene X, gene Y ) is considered a true positive if the value of gene Y when intervening on gene X is outside of the 0.01and 0.99-quantiles of gene Y in the observational distribution. In Figure 11, we plot the true positive rates for several other choices of q T B. We compare to the true positive rate of random guessing, which also increases if the criterion becomes easier to satisfy. We observe that the choice of q T B does not substantially change the excess true positive rate of our method compared to random guessing. This indicates that while the true positives in this experiments are inferred from data, the conclusions drawn in Figure 5 are robust with respect to some modelling choices of q T B.

E.7. Learning causal ancestors by estimating the I-MEC

In this section, we repeat the experiments performed in Section 6.2, this time including a procedure (here denoted IASest. graph), where we perform the following steps.

1. Estimate a member graph of the I-MEC and the location of the intervention sites using Unknown-Target Interventional Greedy Sparsest Permutation (UT-IGSP) (Squires et al., 2020) using the implemention from the Python package

Invariant Ancestry Search

0.01 0.025 0.05 0.1 q TP

True positive rate

Varying q TP

TPR (Random guess)

Figure 11. True positive rates (TPRs) for the gene experiment in Section 6.3. q T B specifies the quantile in the observed distribution that an intervention effect has to exceed to be considered a true positive. While the TPR increases for our method when q T B is increased, the TPR of random guessing increases comparably. This validates that changing the definition of true positives in this experiment by choosing a different q T B does not change the conclusion of the experiment substantially.

Causal DAG.9

2. Apply the oracle algorithm described in Section 4 to the estimated graph to obtain an estimate of MI.

3. Output the union of all sets in the estimate of MI.

The results for the low-dimensional experiment are displayed in Figure 12 and the results for the high-dimensional experiment are displayed in Table 2. Here, we see that IASest. graph generally performs well (as measured by Jaccard similarity) in the low-dimensional setting (d = 6), and even better than IAS for sample sizes N 103, but is slightly outperformed by IAS for larger sample sizes. However, in the high-dimensional setting (d = 100), we observe that IASest. graph fails to hold level and identifies only very few ancestors (see Table 2). We hypothesize that the poor performance of IASest. graph in the high-dimensional setting is due to IASest. graph attempting to solve a more difficult task than IAS. IASest. graph first estimates a full graph (here using UT-IGSP), even though only a subgraph of the full graph is of relevance in this scenario. In addition, UT-IGSP aims to estimate the site of the unknown interventions. In contrast, IAS only needs to identify nodes that are capable of blocking all paths between two variables, and does not need to know the site of the interventions.

d = 100, N = 103 d = 100, N = 104 d = 100, N = 105

IAS IASest. graph IAS IASest. graph IAS IASest. graph P(S ANY ) 84.64% 15.30% 94.04% 14.92% 94.72% 14.74% P(S = ) 51.96% 12.32% 12.72% 11.84% 6.98% 11.42% J(S , ANY ) 0.19 0.10 0.33 0.10 0.35 0.11

Table 2. Identifying ancestors by first estimating the I-MEC of the underlying DAG and then applying the oracle algorithm of Section 4

fails to hold level and identifies fewer ancestors than applying IAS, when in a high-dimensional setting.

9Available at https://github.com/uhlerlab/causaldag.

Invariant Ancestry Search

102 103 104 105

Jaccard similarity to ANY

IAS IASest. graph

Figure 12. Comparison between the finite sample output of IAS and the procedure described in Appendix E.7, in the low-dimensional case. Generally, these procedures have similar performance, although IAS performs worse for small sample sizes but slightly better for high sample sizes.