# on_efficient_adjustment_in_causal_graphs__dea34bb6.pdf

Journal of Machine Learning Research 21 (2020) 1-45 Submitted 2/20; Revised 11/20; Published 12/20

On Eﬃcient Adjustment in Causal Graphs

Janine Witte witte@leibniz-bips.de Leibniz Institute for Prevention Research and Epidemiology BIPS, Bremen, Germany and Faculty of Mathematics and Computer Science, University of Bremen, Germany

Leonard Henckel henckel@stat.math.ethz.ch Seminar for Statistics, ETH Zurich, Switzerland

Marloes H. Maathuis maathuis@stat.math.ethz.ch Seminar for Statistics, ETH Zurich, Switzerland

Vanessa Didelez didelez@leibniz-bips.de Leibniz Institute for Prevention Research and Epidemiology BIPS, Bremen, Germany and Faculty of Mathematics and Computer Science, University of Bremen, Germany

Editor: Peter Spirtes

We consider estimation of a total causal eﬀect from observational data via covariate adjustment. Ideally, adjustment sets are selected based on a given causal graph, reﬂecting knowledge of the underlying causal structure. Valid adjustment sets are, however, not unique. Recent research has introduced a graphical criterion for an optimal valid adjustment set (O-set). For a given graph, adjustment by the O-set yields the smallest asymptotic variance compared to other adjustment sets in certain parametric and non-parametric models. In this paper, we provide three new results on the O-set. First, we give a novel, more intuitive graphical characterisation: We show that the O-set is the parent set of the outcome node(s) in a suitable latent projection graph, which we call the forbidden projection. An important property is that the forbidden projection preserves all information relevant to total causal eﬀect estimation via covariate adjustment, making it a useful methodological tool in its own right. Second, we extend the existing IDA algorithm to use the O-set, and argue that the algorithm remains semi-local. This is implemented in the R-package pcalg. Third, we present assumptions under which the O-set can be viewed as the target set of popular non-graphical variable selection algorithms such as stepwise backward selection. Keywords: causal discovery, causal inference, confounder selection, confounding, eﬃciency, graphical models, IDA algorithm, model selection, suﬃcient adjustment set

1. Introduction

In typical analyses of observational data, we wish to estimate the total causal eﬀect of a (possibly multivariate) treatment or exposure X on a (possibly multivariate) outcome Y. Ideally, we can fully specify the underlying causal directed acyclic graph (DAG). We can then use a graphical adjustment criterion, e.g. Pearl s back-door criterion (Pearl, 2009) or the generalised adjustment criterion (Perkovi c et al., 2015, 2018; Shpitser et al., 2010), to check whether a set of covariates is valid for adjustment. However, there may be more than one valid adjustment set. Although all resulting estimators are then consistent, their variances may diﬀer considerably.

c 2020 Janine Witte, Leonard Henckel, Marloes H. Maathuis and Vanessa Didelez.

License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v21/20-175.html.

Witte, Henckel, Maathuis and Didelez

There are several approaches to choose an adjustment set among all valid adjustment sets. For example, one can pick a minimal adjustment set (de Luna et al., 2011; Textor and Li skiewicz, 2011). An alternative strategy is to aim at decreasing the causal eﬀect estimator s variance by including variables associated with the outcome (e.g. Brookhart et al., 2006; Lunceford and Davidian, 2004; Shortreed and Ertefaie, 2017). Witte and Didelez (2019) referred to this strategy as the outcome-oriented approach. It is especially popular when little graphical knowledge is available. A major advancement for the outcomeoriented approach was the graphical characterisation of the optimal adjustment set (O-set) by Henckel et al. (2019) (HPM19). They showed that under a linear model, adjusting for the O-set yields the smallest asymptotic variance for the causal eﬀect estimator compared to all other valid adjustment sets, under assumptions detailed below. Strengthening this result, Rotnitzky and Smucler (2020) (RS20) recently showed that the minimal variance property of the O-set is retained for a class of non-parametric estimators. All these results apply to DAGs, as well as so-called amenable completed partially directed acyclic graphs (CPDAGs; see e.g. Andersson et al., 1997) and amenable maximally oriented partially directed acyclic graphs (max PDAGs; see Perkovi c et al., 2017). These are larger classes of graphs allowing for undirected edges where the direction cannot be decided. Amenability implies that despite the undirected edges, an adjustment set can be identiﬁed from the CPDAG (or max PDAG) so that this set is valid for adjustment in all DAGs in the equivalence class. If a CPDAG (or max PDAG) is not amenable, no common adjustment set for all DAGs in the equivalence class exists (Perkovi c et al., 2018), and hence diﬀerent DAGs may imply diﬀerent true causal eﬀects of X on Y. However, it is then still possible to estimate a multiset of possible causal eﬀects (meaning that all eﬀects in the multiset are compatible with the non-amenable graph) using the IDA algorithm by Maathuis et al. (2009, 2010). In this paper, we provide three new results on eﬃcient causal eﬀect estimation. First, after brieﬂy reviewing the results of HPM19 and RS20 (Section 2), we provide an alternative, intuitive characterisation of the O-set. This is based on the new concept of a forbidden projection, which has many interesting properties regarding adjustment for confounding (Section 3). Second, we extend the application of the O-set to non-amenable CPDAGs and max PDAGs, by incorporating optimal adjustment into the IDA algorithm (Section 4). Third, we discuss how and under what assumptions the O-set can be viewed as the target set of data-driven variable selection methods such as backward model selection (Section 5).

2. Optimal Adjustment for Known Causal Structure

We begin by clarifying our setting and deﬁning the O-set, before proposing an alternative deﬁnition in Section 3. We defer most of the terminology and formal deﬁnitions to Appendix A; here we only state some key concepts. (Possibly) causal nodes and forbidden nodes. Let G be a causal DAG, CPDAG or max PDAG. A path (V1, . . . , Vm) in G is called causal from V1 to Vm if Vi Vi+1 for all i {1, . . . , m 1}. It is called possibly causal if there are no i, j {1, . . . , m}, i < j, such that Vi Vj. Otherwise it is called non-causal from V1 to Vm. A path from X to Y is proper if only its ﬁrst node is in X. If there is a causal path from V1 to Vm in G, then Vm is called a descendant of V1 in G. Analogously, if there is a possibly causal path from V1 to Vm in G, then Vm is called a possible descendant of V1 in G. The set of all

On Efficient Adjustment in Causal Graphs

descendants of V1 in G is denoted by de(V1, G), and the set of all possible descendants by possde(V1, G). The causal nodes with respect to (X, Y) in G, denoted by cn(X, Y, G), are the nodes on proper causal paths from X to Y, excluding X itself. The possibly causal nodes posscn(X, Y, G) are deﬁned analogously. The forbidden set with respect to (X, Y) and G is deﬁned as forb(X, Y, G) = possde(posscn(X, Y, G), G) X. In a DAG, this simpliﬁes to forb(X, Y, G) = de(cn(X, Y, G), G) X. The nodes in the forbidden set are called forbidden nodes. It can be shown that valid adjustment sets never contain forbidden nodes (Perkovi c et al., 2018). Valid adjustment sets. We consider a set of treatments X and a set of outcomes Y. A (possibly empty) set Z is a valid adjustment set relative to (X, Y) if the interventional distribution f(y | do(x)) of Y, given we set X to x by intervention, factorises as follows:

f(y | do(x)) =

( f(y | x) if Z = , R

z f(y | x, z)f(z)dz otherwise.

Valid adjustment sets can be read oﬀfrom a given causal DAG, CPDAG or max PDAG G using the generalised adjustment criterion (Perkovi c et al., 2017, 2018; Shpitser et al., 2010), which generalises Pearl s back-door criterion (Pearl, 2009): Z is a valid adjustment set relative to (X, Y) in G if and only if the following three conditions hold: (a) every proper possibly causal path from (X to Y) starts with a directed edge out of X, (b) Z forb(X, Y, G) = , (c) all proper non-causal deﬁnite-status paths from X to Y are blocked by Z. Property (a) is called amenability. See Appendix A for the deﬁnition of a deﬁnite-status path. In a DAG, all paths are of deﬁnite status. We consider two model classes and corresponding strategies for estimating causal eﬀects when a valid adjustment set is available: (i) the causal linear model with possibly non Gaussian error terms, where causal eﬀects are estimated using linear regression (HPM19), and (ii) the more general non-parametric causal model, where estimation proceeds nonparametrically (RS20). In both settings, we assume an underlying causal DAG, and that we observe all variables displayed as nodes in the DAG, i.e. there are no latent variables. Causal linear models (HPM19). A causal linear model is a causal DAG where every edge represents a linear causal eﬀect. In a causal linear model, the (joint) causal eﬀect of X = {X1, . . . , Xkx} on Y = {Y1, . . . , Yky} is deﬁned as the matrix τ yx with elements

(τ yx)j,i = xi E(Yj | do(x1, . . . , xkx))

= E(Yj | do(x1, . . . , xi + 1, . . . , xkx)) E(Yj | do(x1, . . . , xkx)),

where element (τ yx)j,i corresponds to the controlled direct eﬀect (Robins and Greenland, 1992; Pearl, 2001) of Xi on Yj relative to X. In other words, (τ yx)j,i is the diﬀerence in E(Yj) when X is set to (x1, , xi + 1, . . . , xkx) by intervention, compared to when X is set to (x1, . . . , xkx) by intervention. We can compute the eﬀect of more general interventions as functions of the elements of τ yx; for example, the sum of the ﬁrst row corresponds to the eﬀect on Y1 of increasing all elements of (x1, . . . , xkx) by one. Given a valid adjustment set Z for the eﬀect of X on Y, τ yx can be rewritten as a matrix of regression coeﬃcients as follows: Denote by βyx.z the (ky kx)-matrix whose (j, i)-th element is the regression coeﬃcient βyjxi.x iz of Xi in a linear regression of Yj on Xi and Z X i, where X i = X\{Xi}. Then

Witte, Henckel, Maathuis and Didelez

τ yx = βyx.z. The ordinary least squares (OLS) estimator ˆβyx.z is a consistent estimator of βyx.z. We denote the asymptotic variance of ˆβyjxi.x iz by a.var(ˆβyjxi.x iz). Non-parametric estimation of causal eﬀects (RS20). In the more general setting of a causal DAG without linearity or other assumptions on the functional form, we deﬁne the causal eﬀect of X on Y as follows. Let X be the set of values that X can take. For a pair of vectors x, x X, the causal eﬀect of intervening to set X to x vs. x is the vector yxx with elements

( yxx )j = E(Yj | do(x)) E(Yj | do(x )).

Note that in the non-parametric case, it is not possible to compactly represent the causal eﬀect of X on Y in a (ky kx)-matrix. RS20 considered the class of regular asymptotically linear estimators for the non-parametric estimation of yxx . This class includes inverse probability weighting by a non-parametrically estimated propensity score (Hirano et al., 2003), non-parametric outcome regression (Hahn, 1998), and double machine learning (Chernozhukov et al., 2018; Smucler et al., 2019). We use ˆ yxx .z to denote an estimator from this class that estimates yxx adjusting for a valid adjustment set Z. Under a causal DAG model and certain smoothness and complexity restrictions, ˆ yxx .z is a consistent estimator of yxx . For given y, x and x , the asymptotic distribution of estimators from this class depends only on Z, therefore we do not further distinguish between the estimators. We denote the asymptotic variance of ( ˆ yxx .z)j = ˆ yjxx .z by a.var( ˆ yjxx .z). See RS20 and the references therein for more details on regular asymptotically linear estimators.

Deﬁnition 1 (O-set; HPM19 Deﬁnition 3.8) Let X and Y be disjoint node sets in a DAG, CPDAG or max PDAG G. Then O(X, Y, G) is deﬁned as:

O(X, Y, G) = pa(cn(X, Y, G), G) \ forb(X, Y, G).

An example is given in Figure 1. It shows the causal relations between 12 symptoms of prodromal schizophrenia as measured by the Schizotypic Syndrome Questionnaire (van Kampen, 2006). The DAG was constructed using a combination of expert knowledge and data-driven structure learning (van Kampen, 2014). For illustration, we here take this given DAG as ground truth. Suppose we are interested in the causal eﬀect of Alienation (ALN) on Delusional Thinking (DET). The bold edges indicate the causal paths with causal nodes {PER, SUS, FTW, DET} (circles). The parents of the causal nodes are {ALN, PER, SUS, FTW, AIS, CDR}, the forbidden set is {ALN, PER, SUS, FTW, DET, HOS, EGC} and the O-set is {ALN, PER, SUS, FTW, AIS, CDR} \ {ALN, PER, SUS, FTW, DET, HOS, EGC}={AIS, CDR} (shown in boxes). Other valid adjustment sets are, for example, {AFF, SAN}, {AIS, CDR, AFF} and {AFF, APA, AIS, CDR, SAN}. This can be checked using the generalised adjustment criterion stated above. Note that in many applications it might be possible to augment a causal graph e.g. with further parents of Y that are marginally independent of all other non-descendants of Y . This induces a diﬀerent O-set illustrating that this set depends on what variables are included in the graph. Note also that the O-set is deﬁned even if no valid adjustment set exists, but this case will rarely be of interest.

On Efficient Adjustment in Causal Graphs

Figure 1: DAG from van Kampen (2014) illustrating the assumed causal relations between 12 prodromal symptoms of schizophrenia: AFF=Aﬀective Flattening, AIS=Active Isolation, ALN=Alienation, APA=Apathy, CDR=Cognitive Derailment, DET=Delusional Thinking, EGC=Egocentrism, FTW=Living in a Fantasy World, HOS=Hostility, PER=Perceptual Aberrations, SAN=Social Anxiety, SUS=Suspiciousness. We are interested in the causal eﬀect of ALN on DET, both shown in grey circles. Bold arrows show the causal paths from ALN to DET. The forbidden nodes are shown as circles, nodes in the O-set are shown as boxes.

Proposition 2 (HPM19 Theorem 3.10 (1)) Let X and Y be disjoint subsets of the node set V of a causal DAG, CPDAG or max PDAG G. The set O(X, Y, G) is a valid adjustment set relative to (X, Y) in G if (i) Y possde(X, G) and (ii) a valid adjustment set relative to (X, Y) in G exists.

Condition (i) can be checked using a simple query on G. If Y possde(X, G), we know that the causal eﬀect of X on Y\possde(X, G) is zero. Hence, without loss of generality, we can consider the set of outcome variables Y possde(X, G) instead of Y. Condition (ii) is satisﬁed if O(X, Y, G) or any other subset of V \ {X, Y} fulﬁls the generalised adjustment criterion stated above. For the DAG in Figure 1, it can easily be seen that DET de(ALN), hence condition (i) is satisﬁed. Under condition (i), condition (ii) is always satisﬁed for univariate treatment and outcome in a DAG, because the parents of treatment then form a valid adjustment set (see Pearl, 2009, p. 72f.).

Witte, Henckel, Maathuis and Didelez

The following proposition, which builds on earlier work by Kuroki and Miyakawa (2003) and Kuroki and Cai (2004), establishes the optimality of the O-set in terms of the asymptotic variance in the linear and in the non-parametric setting.

Proposition 3 Let X and Y be disjoint subsets of the node set V of a causal DAG, CPDAG or max PDAG G, such that Y possde(X, G). Let Z be a valid adjustment set relative to (X, Y) in G and let O = O(X, Y, G).

(a) (HPM19 Theorem 3.10 (2)) If the variables V follow a linear causal model compatible with G, then, for every Xi X and Yj Y, a.var(ˆβyjxi.x io) a.var(ˆβyjxi.x iz).

(b) (RS20 Theorem 2) For every Yj Y and pair of vectors x, x X,

a.var( ˆ yjxx .o) a.var( ˆ yjxx .z).

In other words, for a given causal linear model, the O-set yields the smallest asymptotic variance for the OLS estimator among all valid adjustment sets. If linearity cannot be assumed, the O-set yields the smallest variance for regular asymptotically linear estimators. Thus, assume that Figure 1 represents a causal linear model. Proposition 3 then implies that if we estimate the eﬀect of ALN on DET by linearly regressing DET on ALN and the O-set {AIS, CDR}, then the estimator will have a smaller asymptotic variance than if we regress DET on ALN and a diﬀerent valid adjustment set, say the parent set of ALN, which equals {AFF, SAN}. Moreover, when we relax linearity, non-parametric adjustment for the O-set {AIS, CDR} is more eﬃcient than non-parametric adjustment for any other valid adjustment set, provided the estimator is in the class of regular asymptotically linear estimators.

3. The O-Set via Forbidden Projection

In this section we provide an alternative, intuitive construction of the O-set. For the sake of clarity, we restrict ourselves to DAGs; generalisations to amenable max PDAGs are given in Appendix C. To motivate our alternative construction, we posit that a useful adjustment set should be i) valid, ii) easy to compute, and iii) eﬃcient. Consider singleton treatment X and outcome Y , where the latter is not an ancestor of X. The parents of X are easy to determine and guaranteed to be valid (see Pearl, 2009, p. 72f.). However, it is well-known that adjusting for variables strongly associated with treatment tends to reduce the eﬃciency of OLS and other estimators of the treatment eﬀect. Hence, adjusting for the parents of treatment is typically ineﬃcient compared to other valid adjustment sets. In contrast, it is also wellknown that regression adjustment for variables strongly associated with the outcome tends to improve the eﬃciency of OLS and other estimators. Hence, the parents of the outcome would appear a natural, easy to determine and more eﬃcient alternative for adjustment. However, the parents of Y are not guaranteed to be a valid adjustment set; they may contain forbidden nodes, speciﬁcally mediators between treatment and outcome. For example, in Figure 1, FTW is a parent of the outcome DET, but a descendant of the treatment ALN and hence cannot be used for adjustment. Simply omitting such nodes from the parents of Y does not generally lead to a valid adjustment set either. For example, CDR alone does

On Efficient Adjustment in Causal Graphs

not form a valid adjustment set in Figure 1, since there are open confounding paths, e.g. ALN SAN AIS SUS FTW DET. Nonetheless, the intuition of using the parents of Y is correct if applied to a modiﬁed graph. As we show below, marginalising out, i.e. projecting over, the forbidden nodes results in a graph where the parent set of Y indeed coincides with the O-set, and is thus guaranteed to yield an estimator with minimal asymptotic variance in the settings we consider, see Proposition 3. This characterization of the O-set thus combines validity, graphical simplicity and eﬃciency. We will now explain this formally. Consider again the case of a DAG D containing sets X and Y. We ﬁrst need the concept of latent projection, used to marginalise or collapse over latent, i.e. unobserved nodes, while preserving the remaining causal relations and (in)dependencies between the observed nodes.

Deﬁnition 4 (Latent projection; Verma and Pearl, 1990; Shpitser et al., 2014) i Let D be a DAG with node set W L and W L = . The latent projection D(W) over L on W is a graph with node set W and edges as follows: For distinct nodes Wi, Wj W,

1. D(W) contains a directed edge Wi Wj if and only if D contains a directed path Wi Wj on which all non-endpoint nodes are in L,

2. D(W) contains a bi-directed edge Wi Wj if and only if D contains a path, with at least one non-endpoint node, of the form Wi Wj on which all non-endpoint nodes are non-colliders and in L.

In the latent projection D(W), two nodes may be connected by a directed and a bidirected edge at the same time. (In)dependence relations can be read oﬀfrom a latent projection using the m-separation criterion (Richardson, 2003). For disjoint A, B, C W, A and B are d-separated given C in D if and only if A and B are m-separated given C in D(W) (Richardson et al., 2017). For our deﬁnition of the O-set, we project over the forbidden nodes, save X and Y, which motivates the following deﬁnition:

Deﬁnition 5 (Forbidden projection) Let D be a DAG with node set V and let X and Y be disjoint subsets of V. We call the graph DXY = D((V \ forb(X, Y, D)) X Y) the forbidden projection of D with respect to (X, Y).

Figures 2 and 3 show some examples, where the forbidden nodes are shown as circles. In panels A and B of Figure 2, the forbidden sets only contain nodes in X Y, hence nothing is projected over. Panels E and F show DAGs where the forbidden projection has bi-directed edges, which will become relevant in Proposition 6. While we primarily introduce the forbidden projection to provide an alternative characterisation of the O-set, it is a useful tool in its own right. In particular, as we show next, the forbidden projection of a causal DAG preserves all information relevant to the estimation of a causal eﬀect via adjustment. All proofs are given in Appendix B and generalised to max PDAGs in Appendix C.

Witte, Henckel, Maathuis and Didelez

V1 X1 V2 X2 Y

Figure 2: Example DAGs with their forbidden projections. The forbidden nodes are shown as circles, nodes in the O-set are shown as boxes. The bold arrows show the causal paths from X to Y. In panels A and B, the original DAGs and their forbidden projections are identical. In panel E, the empty set is a valid adjustment set and also the O-set. In panel F, the bi-directed edge between X2 and Y indicates that the eﬀect of X = {X1, X2} on Y is not identiﬁed via adjustment.

On Efficient Adjustment in Causal Graphs

Figure 3: Forbidden projection of the DAG in Figure 1 with respect to X = {ALN} and Y = {DET}. The forbidden nodes are shown as circles, nodes in the O-set (parents of DET) are shown as boxes. The bold arrow shows the causal path from ALN to DET.

First, the forbidden projection can be used to check whether a valid adjustment set exists relative to given sets of nodes X and Y:

Proposition 6 Let X and Y be disjoint node sets in a causal DAG D such that Y de(X, D). Then a valid adjustment set relative to (X,Y) in D exists if and only if there is no bi-directed edge between any X X and Y Y in DXY.

In Figure 2, valid adjustment sets with respect to X and Y exist in all panels except for panel F. The eﬀect of X = {X1, X2} on Y in panel F is, however, identiﬁed e.g. by the more general G-formula (Robins, 1986; Dawid and Didelez, 2010), the algorithm in Tian and Pearl (2003), or the methods in Nandy et al. (2017). See Guo and Perkovi c (2020) and RS20 for results on eﬃcient adjustment in the linear and non-parametric case, respectively. The bi-directed edge between Y1 and Y2 in panel E has no relevance in deﬁning or determining a valid adjustment set. For singleton Y such that a valid adjustment set with respect to (X, Y ) exists, the forbidden projection is particularly easy to interpret, as it is itself a causal DAG.

Proposition 7 Let X and {Y } be disjoint node sets in a causal DAG D such that Y de(X, D). Then DXY is a causal DAG if and only if there exists a valid adjustment set relative to (X, Y ) in D.

Further, an adjustment set that is valid in the original graph is also valid in the forbidden projection and vice versa:

Proposition 8 Let X, Y and Z be disjoint node sets in a causal DAG D. Then Z is a valid adjustment set relative to (X, Y) in D if and only if Z is also a valid adjustment set relative to (X, Y) in DXY.

Using the forbidden projection, we now deﬁne the O -set and prove that it is equal to the O-set.

Witte, Henckel, Maathuis and Didelez

Deﬁnition 9 (O -set) Let X and Y be disjoint node sets in a DAG D. We deﬁne O (X, Y, D) as: O (X, Y, D) = pa(Y, DXY) \ (X Y).

In words, the O -set is the set of parents of Y in the forbidden projection DXY, excluding treatment nodes and outcome nodes. The next proposition states our key result.

Proposition 10 Let X and Y be disjoint subsets of the node set V of a DAG D such that Y de(X, D). Then O(X, Y, D) = O (X, Y, D).

It now follows trivially that the statements about the O-set in Proposition 3 are true for the O -set as well. Again, Y de(X, D) in Proposition 10 is not a severe restriction, because if Y de(X, D), we can instead consider the eﬀect on Y de(X, D), as we know that the eﬀect on Y \ de(X, D) is zero. Figure 3 shows the forbidden projection with respect to ALN and DET of the DAG in Figure 1. The O-set {AIS, CDR} (in boxes) is the parent set of DET. All other valid adjustment sets are less eﬃcient, for example the afore-mentioned sets {AFF, SAN}, {AIS, CDR, AFF} and {AFF, APA, AIS, CDR, SAN}. Due to Proposition 8, the validity of all of these sets can be conﬁrmed by using the generalised adjustment criterion stated above on either the original DAG (Figure 1) or its forbidden projection (Figure 3). See Figure 2 for further examples. To summarise, the forbidden projection can be used as follows: First, check in the original graph if Y de(X, D). Next, construct the forbidden projection GXY and check for bi-directed edges. If there is a bi-directed edge between a node in X and a node in Y, then the causal eﬀect of interest is not identiﬁed via adjustment (Proposition 6). Else, GXY

contains all information necessary to determine a valid adjustment set (Proposition 8), and in particular the O-set, which is then the set of parents of Y (Deﬁnition 9, Proposition 10). If Y contains only one node, then GXY is a causal DAG itself and hence straightforward to interpret (Proposition 7).

4. Optimal Adjustment in the IDA Algorithm

In Sections 2 and 3, we considered optimal adjustment in DAGs, and in Appendix C we generalised the results to amenable max PDAGs, which include amenable CPDAGs. As a reminder, a max PDAG is said to be amenable relative to (X, Y) if every proper possibly causal path from X to Y starts with a directed edge out of X. In this section, we consider non-amenable CPDAGs and max PDAGs. CPDAGs and max PDAGs are of interest because they are the output of popular causal search algorithms, i.e. algorithms that attempt to learn a graph from data. Under the linear model with Gaussian error terms, which we focus on in this section, it is generally not possible to learn a unique DAG. Even under the additional assumptions of causal suﬃciency and faithfulness (see e.g. Spirtes et al., 2000), one can at best learn a Markov equivalence class of DAGs, uniquely represented by a CPDAG (see e.g. Andersson et al., 1997). Given additional knowledge of some causal relationships between variables, access to interventional data, or other model restrictions, one can obtain a reﬁnement of this class,

On Efficient Adjustment in Causal Graphs

uniquely represented by a max PDAG (Meek, 1995; Perkovi c et al., 2017). For a CPDAG or max PDAG G, we use [G] to denote the set of DAGs that it represents. The interpretation of edges in a CPDAG or max PDAG G is as follows: A directed edge A B means that this edge is present in all DAGs in [G]. An undirected edge A B means that A and B are adjacent in every DAG in [G] and there is at least one DAG in [G] with A B and at least one with A B.

We suppose in this section that we are interested in a univariate exposure X and a univariate outcome Y . For a given CPDAG or max PDAG G, the true causal eﬀect of X on Y may diﬀer across the DAGs in [G]. In particular, Perkovi c (2020) (Proposition 4.2) showed that assuming Y pa(X, G), the true causal eﬀect of X on Y diﬀers across DAGs in [G] if and only if G is non-amenable relative to (X, Y ), i.e. there is a possibly causal path from X to Y that starts with an undirected edge. Hence, when G is non-amenable relative to (X, Y ), we can at best determine a multiset of possible causal eﬀects (τyx(D))D [G], one for each DAG in [G]. (A multiset (τyx(D))D [G] may contain the same entry multiple times, e.g. if [G] contains ﬁve DAGs, of which three imply an eﬀect of 0 and two imply an eﬀect of 1.2, then (τyx(D))D [G] = {0, 0, 0, 1.2, 1.2}.) While obviously less informative than a single number, this multiset of possible causal eﬀects may still yield useful statistics. The minimum absolute value, for example, is a lower bound for the size of the causal eﬀect. However, enumerating all DAGs in [G] is computationally very expensive even for moderately sized G when there are many undirected edges.

Maathuis et al. (2009) proposed to reduce the complexity of this problem as follows. Consider two DAGs D, D [G] such that pa(X, D) = pa(X, D ) = P and Y P. As the parents of X form a valid adjustment set (Pearl, 2009, p. 72f.), τyx(D) = τyx(D ) = τyx(P), where τyx(P) denotes the coeﬃcient of X in the linear regression of Y on X and P, i.e. βyx.p. Let P = {pa(X, D) | D [G]} denote the set of all possible parent sets of X compatible with G. Then (τyx(P))P P contains the same distinct values as (τyx(D))D [G], while |P| |[G]|. Maathuis et al. (2009) showed that it is possible to determine P locally from the CPDAG G without enumerating all DAGs in G. They hence proposed a simple local procedure for calculating (ˆτyx(P))P P, which is called local IDA (local Intervention Calculus when the DAG is Absent). Perkovi c et al. (2017) proposed a semi-local generalisation to max PDAGs ( semi-local IDA ).

The semi-local IDA algorithm for a max PDAG is given in Algorithm 1. Let sib(X, G) denote the set of nodes sharing an undirected edge with X in G. Semi-local IDA loops over all subsets S sib(X, G). It ﬁrst constructs a graph G such that pa(X, G ) = P = pa(X, G) S. Here the complexity reduction becomes apparent: only the edges adjacent to X need to be oriented. To verify whether the added orientations are compatible with the original graph G, the algorithm attempts to extend the graph to a max PDAG by applying Meek s orientation rules (Construct Max PDAG algorithm; Meek, 1995; Perkovi c et al., 2017; see Figure 8 in Appendix A). This step is semi-local as edges not adjacent to X need to be oriented. If successful, ˆβyx.p is added as a possible causal eﬀect estimate, where P = pa(X, G ) = S pa(X, G).

Nandy et al. (2017) further generalised semi-local IDA to sets X and Y. However, this procedure does not use regression adjustment for possible causal eﬀect estimation and is therefore not directly related to our results.

Witte, Henckel, Maathuis and Didelez

Algorithm 1 Local or semi-local IDA (Maathuis et al., 2009; Perkovi c et al., 2017). When the input is a CPDAG, line 7 can be simpliﬁed and the algorithm becomes fully local. Require: CPDAG or max PDAG G with node set V = {V1, . . . , Vp, X, Y }, i.i.d. observations for V1, , Vp, X, Y Ensure: multiset of estimates bΘ

2: sib(X, G) {V V : X V in G}

3: for all S sib(X, G) do

4: Local Bg

5: for all S S, add {S X} to Local Bg

6: for all S sib(X, G) \ S, add {S X} to Local Bg

7: G Construct Max PDAG(G, Local Bg)

8: if G = FAIL then

9: if Y / pa(X, G ) then

10: regress Y on X pa(X, G ) and add the estimated coeﬃcient of X to bΘ

12: add 0 to bΘ

15: end for

16: return bΘ

Algorithm 2 Optimal IDA. Require: CPDAG or max PDAG G with node set V = {V1, . . . , Vp, X, Y }, i.i.d. observations for V1, , Vp, X, Y Ensure: multiset of estimates bΘ

2: sib(X, G) {V V : X V in G}

3: for all S sib(X, G) do

4: Local Bg

5: for all S S, add {S X} to Local Bg

6: for all S sib(X, G) \ S, add {S X} to Local Bg

7: G Construct Max PDAG(G, Local Bg)

8: if G = FAIL then

9: if Y possde(X, G ) then

10: regress Y on X O(X, Y, G ) and add the estimated coeﬃcient of X to bΘ

12: add 0 to bΘ

15: end for

16: return bΘ

On Efficient Adjustment in Causal Graphs

4.1 Optimal IDA

HPM19 established that the parents of X, as used for adjustment by semi-local IDA, form one of the least eﬃcient valid adjustment sets. It therefore seems a good idea to replace pa(X, D) by the O-set within the IDA algorithm to improve estimation precision. The key question is, however, whether the possible O-sets can still be determined semi-locally. More formally, our aim is to estimate the multiset (τyx(O))O O, O = {O(X, Y, D) | D [G]}, where with a slight abuse of notation we deﬁne τyx(O) = 0 if Y possde(X, D). As before, for two DAGs D and D with the same valid O-set O(X, Y, D) = O(X, Y, D ) = O, we have τyx(D) = τyx(D ) = τyx(O). At ﬁrst glance, it appears impossible to determine O locally or semi-locally, as by Definitions 1 and 9 the causal nodes, their parents and the forbidden nodes, or the forbidden projection, are required to ﬁnd the O-set. However, it turns out that O can be determined semi-locally almost in the same manner as P. This is because once the directions of all edges involving X are given, i.e. for given P, application of Meek s rules reveals all descendants of X and, in consequence, all causal nodes, their parents and the forbidden nodes (cf. Lemma 18 in Appendix C). Hence, via Meek s rules there exists a correspondence between possible parent sets and possible O-sets. We therefore propose Algorithm 2, which we call optimal IDA. It is implemented in the R package pcalg (Kalisch et al., 2012, 2019). Algorithm 2 does not specify whether O(X, Y, G ) is determined from G or from the forbidden projection. We expect this choice to be of limited relevance to the algorithm s runtime. In our implementation, we determine O(X, Y, G ) directly from G . Note also that diﬀerent possible parent sets can correspond to the same O-set. Hence, optimal IDA could be modiﬁed to collect all sets in O ﬁrst, remove duplicates, and only then estimate regression coeﬃcients. In the following, we ﬁrst state formally what can be said about the eﬃciency of the estimates output by optimal IDA, showing that it is worthwhile to replace the parents of X by the O-set. Subsequently we compare the computational burden of the two algorithms.

Proposition 11 Let X and Y be nodes in a causal CPDAG or max PDAG G = (V, E), such that V follows a causal linear model compatible with G with Gaussian errors. Let bΘP

and bΘO be the multisets returned by semi-local IDA and optimal IDA, respectively, applied to X, Y and G, with the subsets of sib(X, G) considered in the same order for both. Then, for i {1 . . . , k}, with k = |bΘP| = |bΘO|,

1. E[bΘP i ] = E[bΘO i ] and

2. a.var(bΘP i ) a.var(bΘO i ).

The proof is given in Appendix D. Note that if we do not assume Gaussianity in Proposition 11, then a.var(ˆβyx.o) a.var(ˆβyx.z) can only be guaranteed if (i) Z is a valid adjustment set in the true DAG, and (ii) O is the O-set of the true DAG. This is because in a causal linear model with non-Gaussian errors, a variable is only required to be linear in its parents, and is not necessarily linear given another node set (cf. Nandy et al., 2017). However, if we are willing to assume that all errors in the underlying causal model are non-Gaussian, alternative causal search approaches exist which output a DAG instead of an equivalence class, e.g. algorithms such as Li NGAM (Shimizu et al., 2006).

Witte, Henckel, Maathuis and Didelez

Remark 12 (1) In terms of the computational burden, semi-local and optimal IDA are very similar for max PDAGs. The key diﬀerence is that optimal IDA adjusts for the O-set instead of the parent set of X (line 10), where the O-set is straightforward to determine from G . However, optimal IDA crucially relies on the construction of the max PDAG in line 7 to determine the O-set, while in semi-local IDA this step can be replaced by a simple local query when the input is known to be a CPDAG. Hence, for the special case of a CPDAG, semi-local IDA can be made fully local by simplifying line 7, whereas optimal IDA cannot. (2) A further minor diﬀerence between semi-local and optimal IDA is the if-statement in line 9. Semi-local IDA only checks whether Y / pa(X, G ), whereas optimal IDA checks the stronger condition Y possde(X, G ). Both conditions ensure that the considered adjustment sets pa(X, G ) and O(X, Y, G ), respectively, are valid adjustment sets. Moreover, if Y possde(X, G ), then τyx(D) = 0 for any D [G ]. The 0 estimate of optimal IDA in this case is therefore the most eﬃcient estimate. Alternatively, we could also insist on Y possde(X, G ) in semi-local IDA and return 0 otherwise. As discussed in the appendix of Maathuis et al. (2009), this is only recommended if the input graph is thought to be reliable, but can lead to the ampliﬁcation of errors if the input graph is not accurate.

Remark 13 Proposition 11 concerns the asymptotic variance when the true CPDAG or a true max PDAG is given. When the graph is estimated on the same data as used for IDA, the naive standard errors from the adjusted linear regressions are invalid. Although considerable progress has been made in the area of post-selection inference (e.g. Berk et al., 2013; Belloni et al., 2014; Rinaldo et al., 2019), no method has been proposed speciﬁcally for estimating standard errors of causal eﬀect estimates after causal search.

It is straightforward to extend optimal IDA to situations where X and Y are sets. However, as noted earlier, in this case joint causal eﬀect estimation via regression adjustment is not always possible. Optimal IDA will then not return an estimate. The estimation procedures used by joint IDA (Nandy et al., 2017) provide an alternative.

4.2 Illustration

We now illustrate optimal IDA (Algorithm 2) using a toy example. Consider the CPDAG G shown in Figure 4(a) and suppose we are interested in the causal eﬀect of X on Y . Clearly, G is not amenable relative to (X, Y ) and thus it is sensible to apply optimal IDA. The set sib(X, G) contains 3 nodes, hence there are 8 potential orientations of the undirected edges with endpoint X. From these 8, 3 imply new v-structures and are thus not compatible with G. The other 5 can be extended to the max PDAGs shown in Figure 4(bf), where the bold arrows indicate orientations derived by Meek s rules (see Figure 8 in Appendix A). For example, in 4(b) it follows from V1 X V3 that V1 V3 by Meek s Rule 2. By Rule 1, it then follows that V3 V5. The compatibility check and the application of Meek s rules are carried out in line 7 of optimal IDA. Next, optimal IDA checks for each max PDAG G , whether Y possde(X, G ). Here, this is the case for all max PDAGs except 4(c). For the other four graphs, O = O(X, Y, G ) is determined and used to compute ˆβyx.o. We indicate O(X, Y, G ) by boxes in the Figures 4(b) and 4(d)-(f). For (c), an eﬀect estimate of zero is returned.

On Efficient Adjustment in Causal Graphs

Figure 4: A CPDAG G (a) and the ﬁve max PDAGs (b-f) corresponding to the ﬁve valid orientations of the neighbourhood of X. The bold edges have been obtained by applying Meek s rules. For each max PDAG G , the boxes indicate O(X, Y, G ), while the diamonds indicate pa(X, G). In (c), optimal IDA returns 0, as there is no possibly causal path from X to Y .

-1 0 1 2 3 4

0.0 0.2 0.4 0.6

Estimated causal effect

Figure 5: IDA density plot in the style of Maathuis et al. (2009). Shown are density curves for the estimated possible causal eﬀects returned by local IDA (solid) and optimal IDA (dotted). The true possible causal eﬀects are 0, 1.5 and 2.5 (vertical lines; height indicates relative frequency: 0 and 2.5 each occur in two of the ﬁve max PDAGs in Figure 4).

Witte, Henckel, Maathuis and Didelez

For comparison, the diamonds in Figures 4(b-f) show the adjustment sets in local IDA (Algorithm 1), i.e. pa(X, G ). In (b) and (e), pa(X, G ) = O(X, Y, G ). In (c), optimal IDA returns zero (Algorithm 2, line 12), while local IDA returns ˆβyx.p with P = {V1, V3}, which converges to βyx.p = 0. The main advantage of optimal IDA becomes apparent in cases (d) and (f): In (d), O(X, Y, G ) = , whereas pa(X, G ) = {V4} which is guaranteed to reduce eﬃciency. In (f), pa(X, G ) = {V3} and O(X, Y, G ) = {V3, V5}, where the latter improves eﬃciency. For further illustration, we carried out a small simulation study in which we generated data according to a causal linear model compatible with Figure 4(b). 1 000 datasets with 40 observations each were generated and given as input to local IDA and optimal IDA, together with the CPDAG in Figure 4(a). The true possible causal eﬀects are 0, 1.5 and 2.5, visualised as vertical lines in Figure 5. The plot shows smoothed density curves for the estimates returned by local IDA (solid) and optimal IDA (dotted). The density plot for optimal IDA is clearly narrower around the values 0 and 2.5. The diﬀerence between the algorithms is even more pronounced for graphs with more nodes and longer paths (not shown). The R-code (R Core Team, 2019) for reproducing Figure 5 is available in the Online Supplement.

4.3 Simulation

In order to compare the performance of optimal versus local IDA in ﬁnite sample settings, we carried out a more extensive simulation study. The design was chosen to reﬂect a typical situation where IDA is used, i.e. interest lies in the causal eﬀect of X on Y in a (known or estimated) CPDAG G that is non-amenable relative to (X, Y ). Non-amenability implies that the multiset (τxy(D))D [G] of possible causal eﬀects of X on Y compatible with G contains more than one distinct value (for almost all parameters values of the causal linear model) (Perkovi c, 2020, Proposition 4.2). A useful summary of (τxy(D))D [G] is the minimum absolute value, min(abs((τxy(D))D [G])), because when this value is non-zero, we know that X has some eﬀect on Y . The aim of our simulation study was to compare how well min(abs((τxy(D))D [G])) is estimated by optimal versus local IDA, in terms of the Monte-Carlo mean squared error (MSE). We investigated 24 scenarios by considering all combinations of the following parameters: number of nodes p {10, 20, 50, 100}, expected number of neighbours per node d {2, 3, 4}, and sample size n {100, 1 000}. In each scenario, the following was repeated 1 000 times (R code for reproducing the simulation study is available in the Online Supplement): A DAG D, with CPDAG G, with p nodes and d expected neighbours per node was randomly chosen such that G was non-amenable relative to two randomly chosen nodes (X, Y ) and such that min(abs((τxy(D))D [G])) was non-zero. (Note that the DAG with its unique true causal eﬀect was simulated for convenience only. Conceptually, we drew directly from the space of CPDAGs, which is why we consider the whole multiset of possible eﬀects to be the truth .) The following was then repeated 100 times: A dataset with n observations was generated from a linear causal model on D where the non-zero coeﬃcients were randomly chosen from a uniform distribution on [ 1, 0.1] [0.1, 1]. Greedy equivalence search (Chickering, 2002) was applied to the data, yielding an estimated CPDAG G . Optimal and local IDA were both applied to the true CPDAG G and the estimated

On Efficient Adjustment in Causal Graphs

CPDAG G . The four output multisets of estimates were summarised by their minimum absolute values. These were compared on the basis of their Monte-Carlo MSE, i.e. the squared diﬀerence between the estimated minimum absolute value and the true minimum absolute value, averaged over the 100 repetitions. Speciﬁcally, we calculated the MSEs for the estimated minima using optimal IDA versus local IDA by computing the relative MSE (RMSE), MSE(optimal IDA)/MSE(local IDA). This was done separately for G and for G , and denoted r and r , respectively. An RMSE of less than one indicates that optimal IDA is more precise than local IDA in estimating the minimum of the multiset of causal eﬀects. In light of Remark 13, we did not consider estimated standard errors. In addition to the above 24 scenarios, we investigated the relative performance of optimal IDA in a scenario where the graph is sparse (d = 1) and the sample size is moderate (n = 100). We considered eight setting where the number of nodes was between p = 10 and p = 1 000. Such high-dimensional scenarios occur, for instance, with gene expression data. As greedy equivalence search is slow for large graphs, we reduced the number of replications from 1 000 to 100 and the number of datasets per graph from 100 to 10.

RMSE optimal IDA / local IDA

RMSE optimal IDA / local IDA

Figure 6: Violin plots of the relative mean squared errors (RMSEs) r and r for the true and estimated CPDAGs, respectively. Scenario A: p = 100 nodes, d = 4 expected neighbours per node, sample size n = 1 000. Scenario B: p = 10, d = 4 and n = 100. The dots mark the geometric means, the plus signs the medians.

Figure 6 shows violin plots of the RMSEs r and r over the 1 000 repetitions, together with the geometric mean and the median. Two scenarios are shown: The one where optimal IDA showed the best overall performance (scenario A, p = 100, d = 4, n = 1 000), and the worst one (scenario B, p = 10, d = 4, n = 100) of all the simulation settings considered. The geometric means and medians for all scenarios are summarised in Tables 1 and 2; the

Witte, Henckel, Maathuis and Didelez

n = 100 n = 1 000

d = 2 d = 3 d = 4 d = 2 d = 3 d = 4

p = 10 0.70 (0.76) 0.72 (0.79) 0.76 (0.86) 0.69 (0.78) 0.71 (0.78) 0.75 (0.86) p = 20 0.64 (0.69) 0.63 (0.68) 0.61 (0.66) 0.63 (0.68) 0.60 (0.65) 0.59 (0.65) p = 50 0.60 (0.64) 0.54 (0.57) 0.51 (0.55) 0.55 (0.58) 0.50 (0.54) 0.46 (0.49) p = 100 0.57 (0.61) 0.50 (0.52) 0.44 (0.46) 0.54 (0.58) 0.44 (0.46) 0.40 (0.42)

Table 1: Geometric means (in parentheses: medians) of the relative mean squared errors (RMSEs) r over 1 000 repetitions for scenarios with diﬀerent numbers of nodes (p), expected number of neighbours per node (d), and sample sizes (n). Optimal and local IDA were applied to the true CPDAG G.

n = 100 n = 1 000

d = 2 d = 3 d = 4 d = 2 d = 3 d = 4

p = 10 1.06 (1.01) 1.06 (1.04) 1.06 (1.06) 0.95 (0.99) 0.97 (1.00) 1.01 (1.00) p = 20 0.99 (1.00) 0.99 (1.00) 0.96 (1.00) 0.88 (0.96) 0.89 (0.97) 0.94 (0.99) p = 50 0.94 (0.98) 0.89 (0.93) 0.89 (0.93) 0.81 (0.90) 0.79 (0.85) 0.78 (0.86) p = 100 0.97 (1.00) 0.94 (0.97) 0.90 (0.94) 0.81 (0.91) 0.73 (0.80) 0.72 (0.77)

Table 2: Geometric means (in parentheses: medians) of the relative mean squared errors (RMSEs) r over 1 000 repetitions for scenarios with diﬀerent numbers of nodes (p), expected number of neighbours per node (d), and sample sizes (n). Optimal and local IDA were applied to the estimated CPDAG G .

complete set of violin plots is shown in Appendix E. Optimal IDA clearly outperformed local IDA, in terms of the geometric mean and median of the RMSE, in all scenarios when applied to the true CPDAG. When the CPDAG was estimated using greedy equivalence search, optimal IDA was still superior in the majority of scenarios, but r was notably larger than r in all scenarios, i.e. the relative performance of optimal IDA was worse with an estimated CPDAG than with a known CPDAG. As an estimated graph inevitably contains some errors regarding the presence and direction of edges, this result may indicate that estimation adjusting for the O-set suﬀers more from such errors than adjusting for the set of parents of X. Small n and small p do not entail much advantage of using optimal IDA: In graphs with only a few nodes, the O-set and the set of parents of X are often similar or even coincide, so that the gain in eﬃciency when using the O-set is less pronounced. A smaller sample size leads to more errors in the estimated graph, which aﬀects estimation of the O-set more than estimation of the set of parents of X, as we conjectured above. However, optimal IDA seems to have a slight advantage for larger d when p is also larger.

On Efficient Adjustment in Causal Graphs

p=10 p=20 p=50 p=100 p=250 p=500 p=750 p=1 000

true CPDAG estimated CPDAG

RMSE optimal IDA / local IDA

Figure 7: Violin plots of the relative mean squared errors (RMSEs) r and r for the true and estimated CPDAGs, respectively. All graphs have d = 1 expected neighbour per node and graphs were estimated from n = 100 observations; the number of nodes p varies. The dots mark the geometric means, the plus signs the medians.

p = 10 p = 20 p = 50 p = 100

true CPDAG 0.71 (0.77) 0.69 (0.76) 0.66 (0.69) 0.66 (0.71) estimated CPDAG 1.04 (1.06) 1.04 (1.01) 0.99 (1.02) 0.93 (1.00)

p = 250 p = 500 p = 750 n = 1 000

true CPDAG 0.60 (0.68) 0.67 (0.69) 0.67 (0.77) 0.69 (0.75) estimated CPDAG 0.73 (1.02) 1.16 (1.09) 0.94 (1.04) 0.79 (1.00)

Table 3: Geometric means (in parentheses: medians) of the relative mean squared errors (RMSEs) r and r for the true and estimated CPDAGs, respectively, over 100 repetitions for scenarios with diﬀerent numbers of nodes (p), d = 1 expected neighbours per node, and sample size n = 100.

Witte, Henckel, Maathuis and Didelez

The additional results for the sparse graphs are shown in Figure 7 and Table 3. Optimal IDA outperformed local IDA regardless of the number of nodes p when the CPDAG was known. When the CPDAG was not known, the median RMSE was about 1 for all p. The geometric mean varied around 1 with no obvious pattern, which may be due to the small number of replications. The results suggest that in the high-dimensional setting, the primary diﬃculty is learning the graph, limiting the advantage of optimal IDA over local IDA.

In summary, based on the simulation results, we recommend using optimal IDA when there is high conﬁdence in the estimated graph. The advantage over local IDA will be most pronounced when the number of nodes is at least 20, or better 50 or more.

5. The O-Set and Non-Graphical Variable Selection

We now assume that neither the causal DAG D nor a CPDAG or max PDAG is known to us, therefore we wish to select a valid adjustment set in a non-graphical manner. We restrict our discussion to the case where we have a univariate treatment X and outcome Y of interest.

In multiple regression analyses, it is common to apply variable selection procedures, e.g. backward selection, to ﬁnd a set of relevant predictors for an outcome Y . In highdimensional settings, regularisation methods that combine selection and estimation, such as the Lasso or the Elastic Net, are commonly used (Tibshirani, 1996; Zou and Hastie, 2005). While variable selection for prediction is in general a diﬀerent task than ﬁnding an eﬃcient or optimal adjustment set for causal eﬀect estimation, we will discuss next under what assumptions and modiﬁcations these tasks coincide. For a general overview of the relation between variable and confounder selection see Witte and Didelez (2019). A basic assumption for the validity of a selected adjustment set is that the set Z from which we select the variables must itself be a valid adjustment set, as deﬁned in Section 2. A number of selection procedures can then be used to determine diﬀerent types of valid adjustment sets as subsets of Z, e.g. a minimal valid adjustment set (Witte and Didelez, 2019; de Luna et al., 2011).

Consider ﬁrst Algorithm 3, which shows the template for backward regression selection (see e.g. Kleinbaum and Kupper, 1978; Montgomery et al., 2012) with the above basic assumption added at the outset. Under the linear model assumptions with Gaussian errors, Y Zi | (Z i, X) can be tested by comparing the models with regressors Z i {X} versus Z {X}, using a t-test with null hypothesis βyzi.xz i = 0. Pval in line 6 is a function that outputs the p-value of a test for the null hypothesis speciﬁed in the argument. The maximum p-value is compared in line 9 to a threshold α. For a given α, Algorithm 3 implements the classical p-value method (see e.g. Greenland and Pearce, 2015). Denote by Fχ2 1(.) the distribution function of the χ2 distribution with one degree of freedom. For a given sample size n, Algorithm 3 with α = 1 Fχ2 1(2) or α = 1 Fχ2 1(log(n)) is equivalent to backward selection using the AIC or BIC, respectively (e.g. Murtaugh, 2014; Derryberry et al., 2018; although the motivation for using them stems from frameworks other than independence testing, see Akaike, 1974 and Schwarz, 1978). Algorithm 3 can easily be adapted to work with a measure of conditional independence other than the p-value of the

On Efficient Adjustment in Causal Graphs

t-test. For example, two non-parametric implementations of Algorithm 3 were proposed by Li et al. (2005). If the true independence relations are known, Algorithm 3 can be condensed to its oracle version, Algorithm 4. Comparing p-values is then redundant, and every Zi needs to be visited only once, as it follows from general properties of conditional independence that the ordering Z1, Z2, . . . , Zp does not matter, provided the joint probability of all variables is strictly positive. Essentially, Algorithm 4 eliminates variables until only the direct predictors of Y are left, i.e. those variables with non-zero coeﬃcients in the oracle regression of Y on X and Z. Algorithm 4 is the non-graphical version of the pruning algorithm introduced in HPM19 which uses d-separation relationships to prune a valid adjustment set to a subset such that the resultant eﬀect estimator has a smaller asymptotic variance. Assume now that an underlying graph exists. The following Proposition 14 formalises how the O-set can be viewed as the target set of backward variable selection algorithms and follows from Proposition 3.6 of HPM19 and Theorem 1 in RS20.

Algorithm 3 Backward regression selection. Require: i.i.d. observations for variables X, Y and Z, such that Z is a valid adjustment set relative to (X, Y )

3: while Pmax > α do

4: Plist empty list of length |Z |

5: for all i in 1 to |Z | do

6: Plist[i] Pval(Y Zi | (X, Z i))

8: Pmax max(Plist)

9: if Pmax > α then

10: Z Z argmax(Plist) 11: end if

12: end while

13: return Z

Algorithm 4 Oracle backward regression selection. Require: independence relations between variables X, Y and Z, such that Z is a valid adjustment set relative to (X, Y )

2: for all i in 1 to |Z | do

3: if Y Zi | (X, Z i) then

4: Z Z i 5: end if

7: return Z

Witte, Henckel, Maathuis and Didelez

Proposition 14 Let X and Y be nodes in a causal DAG, CPDAG or max PDAG G with node set V and let V follow a causal model with a joint density faithful to G. Let Z be a valid adjustment set relative to (X, Y ) in G and let Z be the output of Algorithm 4 when applied to Z.

(a) Z is a valid adjustment set and does not depend on the order in which the variables in Z are considered in Algorithm 4.

(b) If O(X, Y, G) Z, then Z = O(X, Y, G).

(c) If V follows a causal linear model, then a.var(ˆβyx.z ) a.var(ˆβyx.z).

(d) For every pair of values x, x X, a.var( ˆ yxx .z ) a.var( ˆ yxx .z).

As mentioned earlier, Lasso estimation can also be regarded as variable selection, even though its original motivation and common usage mostly concerns prediction. Under speciﬁc assumptions, the Lasso asymptotically selects all and only all the direct predictors of Y with probability 1 (Zhao and Yu, 2006; Lounici, 2008). Thus, although Lasso uses a diﬀerent principle than backward selection, it follows from Proposition 14 that when the starting set Z is a valid adjustment set and a superset of the O-set, the O-set can also be viewed as the target set of the Lasso. We emphasise again that the O-set cannot be determined in a purely data-driven way. Neither the assumption that Z is valid nor O(X, Y, G) Z can be veriﬁed empirically. Hence, prior causal knowledge is essential before any variable selection algorithm can be applied (Witte and Didelez, 2019). In contrast to (semi-)local or optimal IDA, however, selection of an adjustment set based on Algorithm 4 allows some latent structures, as long as the assumption that Z is a valid adjustment set continues to hold. This may be of advantage when only a subset of the variables have been measured. The guarantees of Proposition 14 for Algorithm 4 do not translate to the ﬁnite sample version Algorithm 3. Regression selection in ﬁnite samples is known to have several weaknesses (see e.g. Harrell, 2010). Some issues are that the output may only be a local optimum, and that valid post-selection inference is diﬃcult (Leeb and P otscher, 2008). There is, however, a growing literature on post-selection inference both in the context of OLS-based approaches (e.g. Berk et al., 2013; Rinaldo et al., 2019) and of Lasso-based approaches (e.g. Lockhart et al., 2014; Lee et al., 2016). For causal eﬀect estimation in a non-graphical context, post-selection inference has been considered by Belloni et al. (2014), Dukes and Vansteelandt (2020a), Dukes and Vansteelandt (2020b), Chernozhukov et al. (2018) and others.

6. Conclusions

In this paper, we provided insight into the construction and properties of the O-set introduced by HPM19. We showed that the O-set equals the set of parents of Y in the latent projection over the forbidden nodes (Proposition 10). This lends formal support to the intuition that adjusting for all direct causes of Y minimises the residual variance and hence improves precision when estimating the causal eﬀect of X on Y.

On Efficient Adjustment in Causal Graphs

The forbidden projection is a useful tool in its own right when the aim is to estimate a causal eﬀect via adjustment. It displays all variables of interest, while the forbidden variables, which must not be adjusted for, are marginalised out. The forbidden projection thus reduces the complexity of the causal graph while preserving all information relevant for choosing an adjustment set (see Propositions 8 and 7).

We further proposed a new modiﬁcation of the IDA algorithm, called optimal IDA, which outputs multisets of estimates of possible causal eﬀects by adjusting for the possible O-sets. We showed that this increases estimation precision also in cases where the causal structure is a-priori unknown and needs to be estimated. Moreover, this extends the applicability of optimal adjustment to non-amenable CPDAGs/max PDAGs. Optimal IDA has been implemented in the R package pcalg. While causal search methods in general have some well-known shortcomings, IDA has proved to be a valuable tool for instance for screening purposes in large datasets (Le et al., 2013; Engelmann et al., 2015; Luo et al., 2018). The optimal version can further improve its performance.

Finally, we detailed the prerequisites and assumptions under which non-graphical algorithms for backward variable selection can be viewed as aiming at selecting the O-set. Essentially, we need to assume that the set of variables to select from consists of all nodes in the forbidden projection, or a suitable subset thereof. The algorithm then determines the parents / direct causes of Y based on detected conditional independencies. If the input contains forbidden nodes, however, or lacks certain confounders, the algorithm might select an invalid adjustment set. To avoid the latter, suﬃcient prior knowledge on the set of variables corresponding to forbidden nodes is therefore a key prerequisite when automated variable selection is to be used for causal inference. While this prerequisite may not require full knowledge of the underlying causal DAG, it is important to recognise that such prior knowledge cannot be established in a purely data-driven way (Witte and Didelez, 2019).

Much research on variable selection in causal graphs has focussed on ﬁnding small or minimal adjustment sets (de Luna et al., 2011; Textor and Li skiewicz, 2011; Kn uppel and Stang, 2010). Small adjustment sets are useful during study planning, for instance when data collection is expensive and costs are to be minimised. Moreover, they entail desirable statistical properties e.g. for matching estimators, because suitable matches are more easily found when matching on a few variables only. In general, the O-set is not minimal, but instead entails optimality of causal eﬀect estimation by regression adjustment in linear causal models and non-parametric settings. Simulation results further indicate that the optimality of the O-set extends to other parametric settings and estimation methods, e.g. estimation of the marginal odds ratio via standardised logistic regression (Witte and Didelez, 2019). Combining the beneﬁts of small and optimal adjustment sets, RS20 show that the optimal minimal set, i.e. the set among all minimal adjustment sets yielding the most precise estimation in the class of regular asymptotically linear estimators, must be a subset of the O-set, underlining its relevance and importance.

We note that adjustment is only one of several possible ways of identifying causal eﬀects. While adjusting for the O-set is asymptotically more eﬃcient than adjusting for any other valid adjustment set, it is possible that an even smaller asymptotic variance can be obtained by using an alternative identiﬁcation strategy, e.g. the front-door strategy (Pearl, 2009). This is further investigated in RS20 and in Guo and Perkovi c (2020).

Witte, Henckel, Maathuis and Didelez

Finally, we would like to discuss some avenues for future research. First, given the results by RS20, a natural question is whether a non-parametric version of optimal IDA is feasible. Those aspects of IDA that relate to ﬁnding diﬀerent possible valid adjustment sets are obviously not limited to the causal linear model, and estimators such as in RS20 could also be employed for any given X, Y and adjustment set. The simpliﬁcations for graph search algorithms and IDA under linearity, however, are considerable. For instance, greedy equivalence search with a Gaussian score has been shown to be consistent (Chickering, 2002), and has the advantage of always returning a CPDAG. Non-parametric graph search algorithms exist, but often come with large computational burdens and/or a low power to detect edges (Shah and Peters, 2020; Ramsey, 2014). Further, under a causal linear model, the causal eﬀect of X on Y is a single value, see Section 2; and the marginal and conditional causal eﬀects for diﬀerent valid adjustment sets are all identical. For the non-parametric case, instead, the causal eﬀect of X on Y is an unspeciﬁed function, and issues of non-collapsibility might also come into play. Solving these conceptual hurdles for non-parametric optimal IDA remains an open question. Second, we assumed throughout this paper that all variables are observed. HPM19 have shown that in the presence of hidden variables, an asymptotically optimal set may not exist. Smucler et al. (2020), however, gave a suﬃcient condition under which an optimal adjustment set exists when the underlying DAG includes hidden variables, and showed that an optimal minimal adjustment set always exists. A necessary and suﬃcient condition for the existence of an optimal adjustment set, however, has not yet been formulated.

Acknowledgments

We thank the reviewers and the editor for their valuable comments and suggestions. We gratefully acknowledge ﬁnancial support by the German Research Foundation (DFG Project DI 2372/1-1).

On Efficient Adjustment in Causal Graphs

Appendix A. Terminology

The following terminology is used throughout this paper. It is consistent with, and extends HPM19 where needed. Graphs. A graph G = (V, E) consists of a node set V and a set of edges E. We consider three types of edges: directed ( ), bi-directed ( ) and undirected ( ). There can be more than one edge between a given pair of nodes. We only consider loop-free graphs, i.e. an edge between a node and itself is not allowed. A loop-free graph where there is at most one edge between a given pair of nodes is called a simple graph. Two nodes joined by at least one edge are called endpoints of the edge and adjacent. A directed edge A B is said to be out of A and into B. A graph G = (V , E ) is the induced subgraph of G = (V, E) with respect to V if V V and E includes all edges in E that are between nodes in V . Paths. A path is a sequence of nodes and edges (V0, e1, V1, . . . , e K, VK), K 1, such that every node occurs only once and for k = 1, . . . , K, ek has endpoints Vk 1 and Vk. In a simple graph, the path (V0, e1, V1, . . . , e K, VK) can unambiguously be identiﬁed by the sequence of nodes (V0, V1, . . . , VK) alone. V0 and VK are called endpoints of the path (V0, e1, V1, . . . , e K, VK) and the path is said to be between V0 and VK or from V0 to VK, irrespective of the direction of the edges. For sets of nodes A and B, a path is said to be between A and B or from A to B if its ﬁrst node is in A and the last node is in B. A path from A to B is proper if only its ﬁrst node V0 is in A. Let p = (V0, e1, V1, . . . , e K, VK) and k = 1, . . . , K. Then an edge Vk Vk+1 on p is said to point towards V0, . . . , Vk, while an edge Vk Vk+1 on p is said to point towards Vk+1, . . . , VK. A path is directed from V0 to VK if all edges in the sequence are directed and point towards VK. A path p is possibly directed from V0 to VK if all edges on p are either directed or undirected and there are no i, j, 1 i < j K, such that Vi Vj (cf. Perkovi c et al. (2017); this deﬁnition of a possibly directed path is non-standard as Vi and Vj are not necessarily adjacent nodes on the path, which is required for max PDAGs later). We deﬁne the concatenation of two paths p = (V0, e1, V1, . . . , e K, VK) and q = (VK, e K+1, VK+1, . . . , e K+L, VK+L) as p q = (V0, e1, V1, . . . , e K+L, VK+L), where we require that the nodes V0, . . . , VK+L are distinct. Ancestry. If there is a directed path from A to B, or if A = B, then A is an ancestor of B and B is a descendant of A. If there is a possibly directed path from A to B, or if A = B, then A is a possible ancestor of B and B is a possible descendant of A. If there is an edge A B, then A is a parent of B and B is a child of A. If there is an edge A B, A and B are siblings. Note that in our terminology, a node is a (possible) ancestor and (possible) descendant of itself, but not a parent/child/sibling of itself. For a node V in a simple graph G, we denote the set of all ancestors, possible ancestors, descendants, possible descendants, parents, children and siblings of V in G as an(V, G), possan(V, G), de(V, G), possde(V, G), pa(V, G), ch(V, G), sib(V, G), respectively. For a set of nodes W, the set an(W, G) is deﬁned as S W W an(W, G), with analogous deﬁnitions for possan(W, G), de(W, G), possde(W, G), pa(W, G), ch(W, G) and sib(W, G). Colliders, deﬁnite-status paths and v-structures. A non-endpoint node V is a collider on a path p if both edges adjoining V on p have arrowheads at V , i.e. V , V , V , V . A non-endpoint node V is a non-collider on a path p if at least one of the edges adjoining V on p is out of V , i.e. V , V , V , V , V , V , V , or if both edges adjoining V on p are undirected edges and the

Witte, Henckel, Maathuis and Didelez

two nodes adjacent to V on p are not adjacent to each other. A deﬁnite-status path is a path on which every non-endpoint is either a collider or a non-collider. In a DAG or an ADMG, all paths are of deﬁnite status. Three nodes A, B and C form a v-structure in a graph G if A B C is the induced subgraph G on {A, B, C}.

ADMGs, DAGs and PDAGs. A directed path from A to B, together with an edge A B forms a directed cycle. A graph with only directed and bi-directed edges and without directed cycles is called an acyclic directed mixed graph (ADMG). A simple graph with only directed edges and without directed cycles is called a directed acyclic graph (DAG). A simple graph with only directed and undirected edges containing no directed cycles is called a partially directed acyclic graph (PDAG).

Blocking and separation. (Richardson, 2003; Maathuis and Colombo, 2015; Pearl, 2009) A deﬁnite-status path p in an ADMG or PDAG G is blocked by a node set C if (i) p contains a non-collider in C or (ii) p contains a collider that is not in an(C, G). Otherwise the path p is open given C. Node sets A and B are said to be m-separated given a set C if every path between an A A and a B B is blocked by C. We then write A G B | C. In DAGs, m-separation is called d-separation.

Markov equivalence and CPDAGs. (Andersson et al., 1997) The (Markov) equivalence class of a DAG D is the set of DAGs that imply the same d-separation relationships as D. These are all DAGs with the same adjacencies and v-structures Verma and Pearl (1990). Markov equivalence classes can be represented as completed partially directed acyclic graphs (CPDAGs), which are simple graphs with directed or undirected edges, without directed cycles and with certain restrictions regarding the patterns of edges that can occur. The equivalence class represented by a CPDAG G is denoted by [G]. A directed edge A B in G means that this edge is present in all DAGs in the equivalence class [G]. An undirected edge A B in G means that A and B are adjacent in every DAG in [G] and there is at least one DAG in [G] with A B and at least one with A B.

Meek s rules and max PDAGs. (Perkovi c et al., 2017) Certain subsets of equivalence classes of DAGs can be represented by maximally oriented PDAGs (max PDAGs), which are PDAGs with edge orientations that are closed under the orientation rules in Figure 8 (Meek s rules, Meek (1995)). The set of DAGs represented by a max PDAG G is denoted by [G]. The edges in max PDAGs have the same interpretation as in CPDAGs. DAGs and CPDAGs are special cases of max PDAGs.

Figure 8: Meek s orientation rules. Let G be a simple graph with only directed and undirected edges and without directed cycles. If the graph on the left is an induced subgraph of G, then orient the undirected edges in G according to the graph on the right (Meek, 1995). The rules prevent directed cycles and new v-structures.

On Efficient Adjustment in Causal Graphs

Partial topological ordering. Let D be a DAG with node set V and let V1, . . . , Vp be a partition of V. Then V1 < < Vp is a partial topological ordering of V if for every i > j, there are no directed edges from Vi to Vj. Independence and faithfulness. For sets of random variables X, Y and Z, if X and Y are conditionally independent given Z, we write X Y | Z. A joint density f(v) over a set of random variables V is Markov with respect to a DAG D with node set V if for disjoint X, Y, Z V, X D Y | Z X Y | Z; the density f(v) is faithful to D if also X Y | Z X D Y | Z. Causal DAGs, CPDAGs, max PDAGs and ADMGs. Intuitively, a causal DAG is a DAG where an edge A B means that A is a direct cause of B (relative to the variables included). This can be formalised using the intervention operator, denoted by do( ) in Pearl (2009). For random variables V and X V, the post-intervention density f(v | do(x )) is the joint density of V in a (hypothetical) experiment that ﬁxes X to x for everyone in the population by an external intervention. A joint density f(v) is compatible with a causal DAG D = (V, E) if for all X V, the post-intervention density f(v | do(x )) can be written as f(v | do(x )) = 1(x = x ) Y

V V\X f(v | pa(V, D)),

where 1(x = x ) is the indicator function that is 1 if x = x and 0 otherwise. This is known as the truncated factorisation formula (Spirtes et al., 2000; Pearl, 2009). A CPDAG or max PDAG G is called a causal CPDAG or causal max PDAG if [G] contains a causal DAG. A causal ADMG is an ADMG that has been obtained by subjecting a causal DAG to a latent projection, see Deﬁnition 4. (Possibly) causal nodes and forbidden nodes. See Section 2. Valid adjustment sets and amenability. Let X, Y and Z be disjoint sets of random variables, where Z is possibly empty. Then Z is a valid adjustment set relative to (X, Y) if we have

f(y | do(x)) =

( f(y | x) if Z = , R

z f(y | x, z)f(z)dz otherwise. (1)

Relative to a causal DAG, CPDAG, max PDAG or ADMG G = (V, E), a valid adjustment set is deﬁned as follows: Let X, Y and Z be disjoint subsets of V, where Z is possibly empty. Then Z is a valid adjustment set relative to (X, Y) in G if equation (1) holds for every joint density f(v) compatible with G (Perkovi c et al., 2018). Further, G is said to be amenable for adjustment relative to (X, Y) if every proper possibly causal path from X to Y starts with a directed edge out of X (Perkovi c et al., 2018). Generalised adjustment criterion. (Perkovi c et al., 2017, 2018; Shpitser et al., 2010) Let X, Y and Z be disjoint node sets in a causal DAG, CPDAG, max PDAG or ADMG G. Then Z is a valid adjustment set relative to (X, Y) in G if and only if the following three conditions hold:

(a) G is amenable relative to (X, Y),

(b) Z forb(X, Y, G) = ,

(c) all proper non-causal deﬁnite-status paths from X to Y are blocked by Z.

Witte, Henckel, Maathuis and Didelez

Causal linear model. Let D be a causal DAG with node set V = (V1, , Vp). Then V is said to follow a causal linear model compatible with D if the distribution of each Vi V can be described by an equation of the form

Vj pa(Vi,D) αij Vj + ϵvi

with αij R and ϵvi a random variable with mean 0 and ﬁnite variance such that ϵv1, . . . , ϵvp are jointly independent. For a causal CPDAG or max PDAG G, V is said to follow a causal linear model compatible with G if V follows a causal linear model compatible with a DAG in [G]. Partial variance notation. Consider a random variable S and a random vector T. We denote the covariance matrix of T by Σtt and the row vector of covariances between S and T by Σst. The partial variance of S given T is deﬁned as σss.t = Var(S) ΣstΣ 1 tt Σ 1 st . Asymptotic variance. Consider a sequence of estimators (ˆβn)n N such that n(ˆβn β) converges in distribution to N(0, v). We call v the asymptotic variance of ˆβ and write a.var(ˆβ) = v.

Appendix B. Proofs for Section 3

In this appendix, we prove our claims about the forbidden projection made in Section 3.

Proposition 6 Let X and Y be disjoint node sets in a causal DAG D such that Y de(X, D). Then a valid adjustment set relative to (X,Y) in D exists if and only if there is no bi-directed edge between any X X and Y Y in DXY.

Proof We show that a valid adjustment set relative to (X,Y) in D cannot exist if and only if there is a bi-directed edge between a X X and a Y Y in the forbidden projection DXY. Assume ﬁrst that there is a bi-directed edge in DXY between some X X and a Y Y. Then according to Deﬁnition 4 of the latent projection there is a path in D between X and Y on which all nodes are non-colliders and contained in the forbidden set. This constitutes a non-causal path that cannot be blocked by any sets of nodes that are not forbidden. Hence no valid adjustment set relative to (X,Y) and D exists. Assume now that there is no valid adjustment set relative to (X,Y) in D. Then Lemma 15 implies X de(cn(X, Y, D), D) = . Let X X de(cn(X, Y, D), D). Then there must exist a node C cn(X, Y, D) and a node Y Y such that there is a path of the form X C Y where all non-endpoints are non-colliders on the path and in the forbidden set. It follows from Deﬁnition 4 of the latent projection that DXY contains a bi-directed edge X Y .

Lemma 15 (Corollary 27 in Perkovi c et al., 2018) Let X and Y be disjoint node sets in a causal DAG D such that Y de(X, D). Then a valid adjustment set relative to (X,Y) in D exists if and only if X de(cn(X, Y, D), D) = .

On Efficient Adjustment in Causal Graphs

Proposition 7 Let X and {Y } be disjoint node sets in a causal DAG D such that Y de(X, D). Then DXY is a causal DAG if and only if there exists a valid adjustment set relative to (X, Y ) in D.

Proof First assume that DXY is a causal DAG. Then by Proposition 6, a valid adjustment set relative to (X, Y ) in D exists. Now assume that a valid adjustment set relative to (X, Y ) in D exists. We show that (1) DXY is a DAG syntactically, i.e. a directed graph without cycles, (2) semantically, applying the d-separation criterion to sets of nodes in DXY yields the same separations as applying the d-separation criterion to the same node sets in D, and (3) DXY is a causal DAG for (V \ forb(X, Y, D)) X {Y }. (1) As we assume that a valid adjustment set relative to (X, Y ) in D exists, it follows from Proposition 6 together with Lemma 16 that DXY does not contain bi-directed edges. Acyclicity of latent projections is guaranteed by property 1 of Deﬁnition 4 of the latent projection: every directed edge in D(W) corresponds to a directed path in D, hence if DXY

had a directed cycle then so would D. It follows that DXY is acyclic. (2) The m-separations in a latent projection D(W) correspond to the d-separations between nodes in W in the original DAG D (Richardson et al., 2017). In our case, D(W) = DXY is itself a DAG syntactically, and for DAGs m-separation and d-separation are equivalent. (3) Since D is a causal DAG for the random variables V, the truncated factorisation derived from D holds for all interventions do(T = t ) with T V:

f(v | do(t )) = 1(t = t ) Y

V V\T f(v | pa(V, D)). (2)

We need to show that the truncated factorisation implied by DXY holds for the joint marginal distribution of (V\forb(X, Y, D)) X {Y }. We distinguish two cases. In the ﬁrst case, Y de(X, D). This case is trivial, as the forbidden set is then empty and D = DXY . For the second case, Y de(X, D) we deﬁne the following sets: A = (V \ forb(X, Y, D)) X is the node set of DXY without Y , C = cn(X, Y, D) \ (X {Y }) is the set of forbidden nodes that are ancestors of Y , excluding X and Y , and M = forb(X, Y, D) \ (X {Y } C) is the forbidden set excluding X, Y and C, so that C M = forb(X, Y, D) \ (X {Y }) is the set over which we marginalise when subjecting D to the forbidden projection. Then the following partial topological ordering holds: A < C < Y < M. (Note that Y cannot have descendants in X, as otherwise no valid adjustment set would exist by Lemma 15.) We can now rewrite equation (2) as

f(v | do(t )) = 1(t = t ) Y

A A\T f(a | pa(A, D)) Y

C C\T f(c | pa(C, D))

f(y | pa(Y, D))1(Y T) Y

M M\T f(m | pa(M, D)).

Consider now interventions only in nodes T (V \ forb(X, Y, D)) X {Y }, then C \ T = C and M \ T = M. Upon marginalising the above intervention distribution over

Witte, Henckel, Maathuis and Didelez

M the last term in the product vanishes but the remaining terms do not change:

f(a, y, c | do(t )) = 1(t = t ) Y

A A\T f(a | pa(A, D)) Y

C C f(c | pa(C, D))

f(y | pa(Y, D))1(Y T).

Further marginalising over C, the partial topological order guarantees that the variables in A do not have parents in C. This yields

f(a, y | do(t )) = 1(t = t ) Y

A A\T f(a | pa(A, D)) Z

C C f(c | pa(C, D))

f(y | pa(Y, D))1(Y T)dc.

A variable is conditionally independent of its non-descendants given its parents. All variables in A C are non-descendants of Y , hence Y A C | pa(Y, D) and f(y | pa(Y, D)) = f(y | pa(Y, D) a c) = f(y | a c). The second equality holds because the parents of Y , if there are any, form a subset of A C. Similarly, all variables in A are non-descendants of all variables in C, hence f(c | pa(C, D)) = f(c | pa(C, D) a). Further, all parents of variables in C are in A C, hence Q C C f(c | pa(C, D) a) = f(c | a). We obtain

f(a, y | do(t )) = 1(t = t ) Y

A A\T f(a | pa(A, D)) Z

c f(c | a)f(y | a c)1(Y T)dc

= 1(t = t ) Y

A A\T f(a | pa(A, D)) Z

c f(c, y | a)1(Y T)f(c | a)1(Y T)dc

= 1(t = t ) Y

A A\T f(a | pa(A, D))f(y | a)1(Y T).

Two things remain to be shown. First, for every A A, pa(A, D) = pa(A, DXY ) because A does not have parents in the node set over which we marginalised in the projection. Second, f(y | a) = f(y | pa(Y, DXY )), which follows from the fact that all conditional independencies between variables in DXY can be read oﬀDXY using the d-separation criterion, as we showed in part (2) of this proof. Hence, we have

f(a, y | do(t )) = 1(t = t ) Y

A A\T f(a | pa(A, DXY ))f(y | pa(Y, DXY ))1(Y T)

= 1(t = t ) Y

V (A {Y })\T f(v | pa(V, DXY )),

which is exactly the truncated factorisation formula implied by DXY . Hence, DXY is a causal DAG for the random variables A {Y } = (V \ forb(X, Y, D)) X {Y }.

Lemma 16 Let D be a DAG with node set V and let X V and Y V \ X such that a valid adjustment set relative to (X, Y ) in D exists. Then any edge that is in DXY but not in D is a directed edge into Y .

On Efficient Adjustment in Causal Graphs

Proof We only consider the case where Y de(X, D), as otherwise D = DXY and our statement follows trivially. Deﬁne F = forb(X, Y, D) \ (X {Y }). By Deﬁnition 5, an edge present in DXY but not in D only occurs if D contains a node Wj V \ F that has an ancestor in F. We show that the only node in V \ F that can have an ancestor in F is Y . Consider ﬁrst a W V \ forb(X, Y, D). W does not have ancestors in F, as otherwise W would be a forbidden node itself. Consider next a node X X. X does not have ancestors in F either, as every node in F is a descendant of cn(X, Y, D), but we assume that a valid adjustment set exists relative to (X, Y ) in D, implying X de(cn(X, Y, D), D) = by Lemma 15. Hence, Y is the only node in V \ F that can have an ancestor in F.

Proposition 8 Let X, Y and Z be disjoint node sets in a causal DAG D. Then Z is a valid adjustment set relative to (X, Y) in D if and only if Z is also a valid adjustment set relative to (X, Y) in DXY.

Proof Throughout, let F = forb(X, Y, D) \ (X Y). We ﬁrst suppose that Z is a valid adjustment set in D and show that this implies that it is also a valid adjustment set in DXY. Hence, Z forb(X, Y, D) = and Z Y = , so that every node in Z is also a node of DXY. Further, forb(X, Y, DXY) X Y and hence Z forb(X, Y, DXY) = . Amenability trivially holds in both D and DXY by assumption. It remains to show that every proper non-causal path from X to Y in DXY is blocked by Z, which we do by contradiction. So suppose that Z is a valid adjustment set relative to (X, Y) in D and that there exists a proper non-causal path p = (V0, e1, V1, . . . , e K, VK) from X to Y in DXY that is open given Z. We denote YF = Y forb(X, Y, DXY) = forb(X, Y, DXY) \ X and note that de(YF , DXY) YF . Let VL Y be the ﬁrst node on p that is in Y and consider the path segment p = (V0, e1, V1, . . . , e L, VL). Suppose that L < K and that VL YF. If p is causal, then p must either be causal or contain a collider in YF , contradicting our assumption that it is open given Z. If VL Y \ YF then p cannot be causal. Hence, we can suppose that L = K or replace p with p without loss of generality. Consider now the case that VK Y\YF . This implies that all nodes on p except V0 are not in forb(X, Y, DXY). Since de(forb(X, Y, G) \ X, D) forb(X, Y, D) and by deﬁnition of latent projections, this implies that p is also a path in D. As for any node V in DXY, de(V, D) de(V, DXY) F, and Z F = , it follows that p is also open given Z in D. Consider now the case that VK YF . The path p cannot be a one-edge path, as the two possible such paths would require the existence of paths in D implying that no valid adjustment sets relative to (X, Y) exist in D. By the fact that de(YF , D) YF , the last edge of p must be of the form p = VK 1 VK. By the same argument and the deﬁnition of the forbidden projection, the segment p = (V0, e1, V1, . . . , VK 1) is also a path in D, which by deﬁnition of p and p must be non-causal. The path p corresponds to a causal path q in D, such that all nodes except for VK 1 on q are forbidden. The path q = p q is a proper non-causal path from X to Y in D; we now show that it is open given Z. Since de(V, D) de(V, DXY) F, for any node V / F in D, it follows from the fact that p is open given Z in DXY that it is also open given Z in D. Since F Z = ,

Witte, Henckel, Maathuis and Didelez

q is also open given Z. The node VK 1 is a non-collider on p and hence by the assumption that p is open given Z it follows that VK / Z. Since VK 1 is also a non-collider on q it follows that q is open given Z in D. We now turn to the second part of the proof showing that if a set Z containing no nodes in F is not a valid adjustment relative to (X, Y) in D then it is also not a valid adjustment set relative to (X, Y) in DXY. Suppose that Z forb(X, Y, D) = . Since Z F = it follows that Z (X Y) = ; but this clearly implies that Z cannot be a valid adjustment set in DXY. Suppose now that Z (forb(X, Y, D) Y) = and that Z is not a valid adjustment set relative to (X, Y) in D. This implies the existence of a proper non-causal path p = (V0, e1, V1, . . . , e K, VK) from X to Y in D that is open given Z. Consider ﬁrst the case that p contains no nodes in forb(X, Y, D). Then p also exists in DXY and by the fact that de(V, DXY) = de(V, D) \ F it follows that p is also open given Z in DXY. Suppose now that p contains at least one node in forb(X, Y, D). Since p is open given Z, it cannot contain a collider in forb(X, Y, D). If we suppose that all nodes on p are in forb(X, Y, D), then its existence implies that no valid adjustment exists in D, while the corresponding edge in the forbidden projection would imply the same for DXY. Hence, we can suppose that p contains at least one node not in forb(X, Y, D) and let VL be the last such node. Let p = (V0, e1, V1, . . . , e L, VL) and p = (VL, e L+1, V1, . . . , e K, VK). By construction, VL+1 forb(X, Y, D) and since forb(X, Y, D) \ X forb(X, Y, D), it follows that p is causal. Thus the forbidden projection will map p to the path q = VL VK. This also implies that p is non-causal. Suppose ﬁrst that V0 X de(forb(X, Y, D)\X, D). Then no valid adjustment set exists in D. Further, there must be a bi-directed edge from X to Y in DXY and hence that no valid adjustment set exists in DXY either. We can hence suppose that V0 / forb(X, Y, D) \ X. This implies that all nodes on p except V0 are not in forb(X, Y, D). Since this implies that no node on p is in de(F, D) it follows that p is also a path in DXY. The path q = p q

is a proper non-causal path from X to Y in DXY. By the usual argument p is also open given Z in DXY and trivially, this is also true for q . Further, VL / Z is a non-collider on q and hence, q is open given Z in DXY.

Proposition 10 Let X and Y be disjoint subsets of the node set V of a DAG D such that Y de(X, D). Then O(X, Y, D) = O (X, Y, D).

Proof We ﬁrst show that O(X, Y, D) O (X, Y, D). Let Z O(X, Y, D). By Deﬁnition 1, O(X, Y, D) forb(X, Y, D) = and hence Z is a node in DXY. Furthermore, since O(X, Y, D) pa(cn(X, Y, D), D), and cn(X, Y, D) forb(X, Y, D), there is a node Y Y such that D contains a directed path Z Y on which all non-endpoint nodes are in forb(X, Y, D). Due to property 1 of Deﬁnition 4, this corresponds to an edge Z Y in DXY, hence Z O (X, Y, D). Next, we show that O (X, Y, D) O(X, Y, D). Let Z O (X, Y, D). By Deﬁnition 5, this implies that Z V \ (forb(X, Y, D) X Y). Moreover, by Deﬁnition 9, there is an edge Z Y in DXY with Y Y. In D, this corresponds to a directed path Z Y on which all non-endpoint nodes are in forb(X, Y, D), and

On Efficient Adjustment in Causal Graphs

Z forb(X, Y, D). Denote the path by p. There are two cases: In the ﬁrst case, p has no non-endpoint nodes, i.e. D contains the edge Z Y . Since we assume Y de(X, D), Y must be in cn(X, Y, D), hence Z O(X, Y, D). In the second case, p has at least one non-endpoint node. This means that Z pa(W, D), where W forb(X, Y, D) \ (X Y) and W an(Y , D). Since in a DAG, all forbidden nodes are descendants of X, we also have W de(X, D), and hence W cn(X, Y, D). It follows that Z O(X, Y, D).

Appendix C. Generalisation of the Forbidden Projection and the O -Set to Amenable Max PDAGs

In this appendix, we generalise the forbidden projection (Deﬁnition 5) and the O -set (Deﬁnition 9) to amenable max PDAGs and show that Propositions similar to 6, 8, 7 and 10 still hold for the more general deﬁnitions. The latent projection in general (Deﬁnition 4) cannot be generalised to (amenable) max PDAGs as marginalising does not generally result in an ADMG. As an example, consider the max PDAG W1 L W2 with latent node L. It is not clear how the projection should be constructed in this case: W1 W2 would give the wrong impression that W1 is an ancestor of W2 (instead of a possible ancestor), while W1 W2 would imply that W2 is a possible ancestor of W1. As we will show in the following propositions, however, the latent projection can be meaningfully generalised to amenable max PDAGs when projecting over the special case of a forbidden set.

Deﬁnition 17 (Forbidden projection for amenable max PDAGs) Let G be a max PDAG with node set V, and let X V and Y V \ X such that G is amenable relative to (X, Y ). Deﬁne F = forb(X, Y, G) \ (X {Y }). The forbidden projection GXY of G is a graph with node set V \ F and edges as follows: For distinct nodes Wi, Wj V \ F,

1. GXY contains a directed edge Wi Wj if and only if G contains a directed path Wi Wj on which all non-endpoint nodes are in F,

2. GXY contains a bi-directed edge Wi Wj if and only if G contains a path, with at least one non-endpoint node, of the form Wi Wj on which all non-endpoints are non-colliders and in F,

3. GXY contains an undirected edge Wi Wj if and only if G contains Wi Wj.

Note that we restrict the deﬁnition to singleton Y . This is because with a set Y, we run into similar construction/interpretation problems as described above. Consider, for example, an amenable max PDAG with node set {X, F, Y1, Y2} and edges X Y1 F Y2 as well as X F. None of Y1 Y2, Y1 Y2 or Y1 Y2 are correct representations of the marginal distribution. Before generalising the O -set, we now describe the properties of the forbidden projection for max PDAGs. A key property is that if a valid adjustment set exists, the forbidden projection of a max PDAG is itself a max PDAG (Proposition 22). This is analogous to Proposition 7 for DAGs. Proposition 19, in analogy to Proposition 6, states that if X has

Witte, Henckel, Maathuis and Didelez

a causal eﬀect on Y , then the forbidden projection can be used to check whether a valid adjustment set exists. In Proposition 25, we show that a set Z is a valid adjustment set in the forbidden projection if and only if it is a valid adjustment set in the original graph, which is analogous to Proposition 8. We begin with a lemma that will allow us to use possde(X, G) and de(X, G) interchangeably when G is an amenable max PDAG.

Lemma 18 Let p = (V1, V2, . . . , VK) be a possibly directed path in a max PDAG G such that no node on p shares an undirected edge with V1. Then a subsequence of p forms a directed path from V1 to VK in G.

Proof We show this by induction. By assumption, (V1, V2) is a subsequence of p and forms a directed path from V1 to V2. Now assume that a subsequence of p forms a directed path from V1 to Vk 1, for 2 < k K. Denote this subsequence by (V1 = W1, W2, . . . , WQ = Vk 1). Clearly, if Vk 1 Vk then (V1 = W1, W2, . . . , WQ = Vk 1, Vk) is a subsequence of p and forms a directed path from V1 to Vk in G, which is what we wanted to show. If, on the other hand, if Vk 1 Vk then there are four cases, three of which lead to a contradiction: (1) The induced subgraph of G on {WQ 1, WQ = Wk 1, Vk} is Graph 1 in Figure 9. Then G is not closed under Meek s Rule 1 (see Figure 8), which is a contradiction. (2) The induced subgraph of G on {WQ 1, WQ = Wk 1, Vk} is Graph 2 in Figure 9. Then G is not closed under Meek s Rule 2, which is a contradiction. (3) The induced subgraph of G on {WQ 1, WQ = Wk 1, Vk} is Graph 3 in Figure 9. This implies the induced subgraph of G on {WQ 2, WQ 1, Vk} would also be graph 3 (with the same reasons as above excluding graphs 1 and 2). Repeating the argument for WQ 3, WQ 4, . . . implies an undirected edge between W1 = V1 and Vk, which is a contradiction. (4) The induced subgraph of G on {WQ 1, WQ = Wk 1, Vk} is Graph 4 in Figure 9. Then (V1 = W1, W2, . . . , WQ 1, Vk) is a subsequence of p and forms a directed path from V1 to Vk, which is what we wanted to show.

Figure 9: Graphs for the proof of Lemma 18.

Proposition 19 Let G be a causal max PDAG with node set V, and let X V and Y V \ X such that Y possde(X, G) and G is amenable relative to (X, Y ). Then a valid adjustment set relative to (X, Y ) in G exists if and only if there is no bi-directed edge between any nodes in the forbidden projection GXY .

On Efficient Adjustment in Causal Graphs

Proof By Lemma 18, Y possde(X, G) implies Y de(X, G). The proof is now analogous to the proof of Proposition 6, with Lemma 15 replaced by Lemma 20.

Lemma 20 Let G be a causal max PDAG with node set V, and let X V and Y V \ X such that Y de(X, G) and G is amenable relative to (X, Y ). Then a valid adjustment set relative to (X, Y ) in G exists if and only if X de(cn(X, Y, G), G) = .

Proof We show that no valid adjustment set relative to (X, Y ) in G exists if and only if X de(cn(X, Y, G), G) = . The proof is similar to the proof of Corollary 27 in Perkovi c et al. (2018). Assume ﬁrst that no valid adjustment set relative to (X, Y ) in G exists, then by Lemma 21, there is a proper non-causal deﬁnite-status path from some X X to Y that is open given adjust(X, Y, G) = possan(X {Y }, G) \ (X forb(X, Y, G)). Denote one such path by p. Assume for contradiction that p contains a collider and denote the collider by C. As p is open given adjust(X, Y, G), a descendant of C is in adjust(X, Y, G). This implies that an(C, G) forb(X, Y, G) = , as otherwise all descendants of C would be in forb(X, Y, G) and could not be in adjust(X, Y, G). As Y forb(X, Y, G), it follows that at least one of the nodes adjacent to C on p must be a non-endpoint non-collider on p. Denote one such node by B. As B pa(C, G), B forb(X, Y, G) and B adjust(X, Y, G). But then p is not open given adjust(X, Y, G), which is a contradiction. Hence, p does not contain a collider. As p is non-causal, p cannot be directed towards Y , and as we assume that Y de(X, G), p cannot be directed towards X. Hence, p is a path of the form X A Y , where every non-endpoint is a non-collider not in adjust(X, Y, G). It follows that A is in forb(X, Y, G) and thus is a descendant of X, which implies that X de(cn(X, Y, G), G) and hence X de(cn(X, Y, G), G) = . Assume now that X de(cn(X, Y, G), G) = . Pick a node from X de(cn(X, Y, G), G) and denote it by X . Then there must exist a node C cn(X, Y, G) such that there is a path of the form X C where that all non-endpoint non-colliders on the path are in the forbidden set. This path cannot be blocked by any set of non-forbidden nodes.

Lemma 21 (Theorem 5.6 in Perkovi c et al., 2017) Let X and Y be disjoint node sets in a causal max PDAG G such that G is amenable relative to (X, Y), and let adjust(X, Y, G) = possan(X Y, G) \ (X Y forb(X, Y, G)). Then a valid adjustment set relative to (X, Y) in D exists if and only if all proper non-causal deﬁnite-status paths from X to Y are blocked by adjust(X, Y, G) in G.

Proposition 22 Let G be a causal max PDAG with node set V, and let X V and Y V\X such that G is amenable relative to (X, Y ) and there exists a valid adjustment set relative to (X, Y ) in G. Denote the set of DAGs represented by G by [G] = {D1, D2, . . . , DM}. Then the forbidden projection GXY is the causal max PDAG representing the DAGs in {DXY 1 , DXY 2 , . . . , DXY M }.

Witte, Henckel, Maathuis and Didelez

Proof We only consider the case that Y possde(X, G), as otherwise the proposition follows trivially from the fact that GXY = G. By Lemma 18, Y possde(X, G) implies Y de(X, G). We know from Propositions 6 and 19 that none of GXY , DXY 1 , DXY 2 , . . . , DXY M contain any bi-directed edges. Consider edges present in the latent projections but not in the original graphs: For the max PDAG G, denote the set of edges in GXY but not in G by e(G), and deﬁne analogous sets e(D1), e(D2), . . . , e(DM) for the DAGs D1, D2, . . . , DM. None of e(G), e(D1), e(D2), . . . , e(DM) contain any undirected edges. Further, any directed edges in any of e(G), e(D1), e(D2), . . . , e(DM) are into Y . This is because for every G {G, D1, D2, . . . , DM}, an edge in e(G ) corresponds to a directed path with at least one forbidden non-endpoint node in G . If an edge in e(G ) was into a node V V \ (X {Y }), then V would itself be forbidden, which is a contradiction. If an edge in e(G ) was into a node X X, then X would be in de(cn(X, Y, G ), G ), which by Lemma 20 contradicts our assumption that a valid adjustment set exists relative to (X, Y ) in G . Hence, all edges in all of e(G), e(D1), e(D2), . . . , e(DM) are into Y . In fact, by Lemma 23 below, e(G) = e(D1) = e(D2) = = e(DM) = e. The graphs GXY , DXY 1 , DXY 2 , . . . , DXY M can thus be constructed by copying the induced subgraphs of G, D1, D2, . . . , DM with respect to (V \ forb(X, Y, G)) (X, Y ) and adding the edges in e. Hence, GXY represents the DAGs in {DXY 1 , DXY 2 , . . . , DXY M } in the sense that every directed edge in GXY is also present in all DAGs in {DXY 1 , DXY 2 , . . . , DXY M }, and for every undirected edge Vi Vj in GXY , there is at least one DAG in {DXY 1 , DXY 2 , . . . , DXY M } with Vi Vj and at least one with Vi Vj. In order show that GXY has all the characteristics of a max PDAG, we show that GXY is closed under Meek s rules. Referring to Figure 8, we argue that the graphs on the left-hand sides of Rules 1 4 cannot be induced subgraphs of GXY . Assume for contradiction that the left-hand graph of Rule 1, , was an induced subgraph of GXY . As this graph is not an induced subgraph of G by assumption, and all of e(G), e(D1), e(D2), . . . , e(DM) consist of only directed edges into Y , we can conclude that the directed edge in is into Y , i.e. Y . Hence, Y shares an undirected edge with some node V in G, but this means that V is a forbidden node in some D [G], which is not allowed according to Lemma 24. By similar arguments, none of the graphs on the left-hand sides of Rules 1 4 in Figure 8 is an induced subgraph of GXY .

Lemma 23 Let G be a causal max PDAG with node set V, and let X V and Y V \ X such that G is amenable relative to (X, Y ) and there exists a valid adjustment set relative to (X, Y ) in G. Deﬁne F = forb(X, Y, G) \ (X {Y }) and pick a node V1 V \ F. Then the following two statements are equivalent:

(i) A DAG D [G] contains a directed path p = (V1, V2, . . . , VK = Y ) such that all non-endpoint nodes on p are in F.

(ii) The max PDAG G contains a directed path q = (V1 = W1, W2, . . . , WQ = Y ) such that all non-endpoint nodes on q are in F.

Proof Statement (ii) implies that the directed path p is present in all DAGs in [G] by the deﬁning properties of a max PDAG. Hence, we only show that (i) implies (ii). Again by the properties of max PDAGs, the sequence of nodes (V1, V2, . . . , VK = Y ) forms a possibly

On Efficient Adjustment in Causal Graphs

directed path from V1 to Y in G. We ﬁrst show that no node in {V2, . . . , VK = Y } shares an undirected edge with V1. Suppose, for contradiction, that node Vk, 2 k K shares an undirected edge with V1 and distinguish two cases: (1) V1 X, (2) V1 V \ X. The ﬁrst case contradicts our assumption that G is amenable relative to (X, Y ). The second case implies that V1, as a possible descendant of Vk, is in F, but we chose V1 such that V1 V \ F. Hence, no node in {V2, . . . , VK = Y } shares an undirected edge with V1. We can thus apply Lemma 18 and conclude that a subsequence of (V1, V2, . . . , VK = Y ) forms a directed path from V1 to Y in G, which implies that statement (ii) holds.

Lemma 24 (Lemma E.8 in HPM19) Let X and Y be disjoint node sets in a max PDAG G, such that G is amenable relative to (X, Y), and let D be a DAG in [G]. Then forb(X, Y, G) = forb(X, Y, D).

Proposition 25 Let G be a causal max PDAG with node set V and let X V and Y V \ X such that G is amenable relative to (X, Y ). Then a set Z is a valid adjustment set relative to (X, Y ) in G if and only if it is a valid adjustment set relative to (X, Y ) in the forbidden projection GXY .

Proof Let D [G] and DXY its forbidden projection. By Proposition 8, a set Z is a valid adjustment set relative to (X, Y ) in D if and only if it is a valid adjustment set relative to (X, Y ) in DXY . By Proposition 22, the set [GXY ] contains exactly the forbidden projections of all DAGs in [G]. Hence if Z is a valid adjustment set in all D [G], then it is a valid adjustment set in all DXY [GXY ] and vice versa.

To summarise, the forbidden projection for amenable causal max PDAGs has very similar properties as the forbidden projection for causal DAGs, as long as we consider a singleton outcome node Y : Bi-directed edges in the projection indicate the lack of a valid adjustment set; if a valid set exists, the forbidden projection is a max PDAG itself, preserving all the information relevant to causal eﬀect identiﬁcation via adjustment; in particular, all valid sets can be read oﬀthe forbidden projection as well as the original graph. Finally, we now generalise Deﬁnition 9 of the O -set and its optimality property in Proposition 10 to amenable max PDAGs with singleton Y .

Deﬁnition 26 Let G be a causal max PDAG with node set V, let X V and Y V \ X such that G is amenable relative to (X, Y ), and let GXY be the corresponding forbidden projection. We deﬁne O (X, Y, G) as:

O (X, Y, G) = pa(Y, GXY) \ X.

Proposition 27 Let G be a causal max PDAG with node set V, let X V and Y V \ X such that Y possde(X, D) and G is amenable relative to (X, Y ), let Z be a valid adjustment set relative to (X, Y ) in G and let O = O (X, Y, G). Then O is a valid adjustment set relative to (X, Y ) in G and if the variables V follow a linear causal model compatible with G, then, for every Xi X, a.var(ˆβyxi.x io ) a.var(ˆβyxi.x iz).

Witte, Henckel, Maathuis and Didelez

Proof We prove this by showing that O(X, Y, G) and O (X, Y, G) are equal and invoking Propositions 2 and 3. By Lemma 18, Y possde(X, G) implies Y de(X, G). Then the equivalence follow directly from Lemma 28 in combination with Proposition 10: For every DAG D [G], O(X, Y, G) = O(X, Y, D) = O (X, Y, D) = O (X, Y, G).

Lemma 28 (Lemma E.7 in HPM19) Let X and Y be disjoint node sets in a max PDAG G such that G is amenable relative to (X, Y) in G, let Y possde(X, G), and let D be a DAG in [G]. Then O(X, Y, D) = O(X, Y, G).

Appendix D. Proof for Section 4

Proposition 11 Let X and Y be nodes in a causal CPDAG or max PDAG G = (V, E), such that V follows a causal linear model compatible with G with Gaussian errors. Let bΘP

and bΘO be the multisets returned by semi-local IDA and optimal IDA respectively, applied to X, Y and G, with the subsets of sib(X, G) considered in the same order for both. Then, for i {1 . . . , k}, with k = |bΘP| = |bΘO|,

1. E[bΘP i ] = E[bΘO i ] and

2. a.var(bΘP i ) a.var(bΘO i ).

Proof Consider any set Si sib(X, G). Perkovi c et al. (2017) showed that there exists a DAG D [G], such that Pi = pa(X, D) = Si pa(X, G), if and only if directing the edges in the neighbourhood of X according to Pi and applying Meek s orientation rules results in a valid max PDAG G i. If this is not the case for Si, both algorithms discard Si at their respective line 8. We can hence suppose that there exists a DAG D [G], such that Pi = pa(X, D) = Si pa(X, G). Suppose that Y possde(X, G i). In this case bΘO i = ˆβyx.oi, where Oi = O(X, Y, G i). As G i is amenable by construction, it follows from Lemma 28 in Appendix C that Oi = O(X, Y, D). Further, Y possde(X, G i) implies that Y / pa(X, G i) and thus bΘP i = ˆβyx.pi. By Proposition 2, Oi is a valid adjustment set relative to (X, Y ) in D, and clearly the same holds for Pi. Since we suppose multivariate Gaussianity, this implies that both ˆβyx.oi and ˆβyx.pi are consistent estimators of τyx(D), and E[ˆβyx.oi] = E[ˆβyx.pi] = τyx(D). Further, by Lemmas E.4 and E.5 of the Supplement of HPM19, Pi \ Oi is conditionally independent of Y given {X} Pi, and Oi \ Pi is conditionally independent of X given Pi, respectively. These two independencies allow us to invoke Lemma C.2 of HPM19 and conclude that σyy.xoi σyy.xpi as well as σxx.oi σxx.pi. As we assume a multivariate Gaussian distribution, it follows that

a.var(ˆβyx.oi) = σyy.xoi

σxx.oi σyy.xpi

σxx.pi = a.var(ˆβyx.pi).

Suppose now that Y / possde(X, G i). Then Y / de(X, D), hence τyx(D) = 0. As Y / possde(X, G i), bΘO i = 0 and as a result a.var(bΘO i ) = 0. If Y pa(X, G i), then bΘPi = 0 and as a result a.var(bΘP i ) = 0. If Y / possde(X, G i) pa(X, G i), then bΘP i = ˆβyx.pi and by nature of parent sets E[ˆβyx.pi] = 0. Clearly, a.var(bΘP i ) > 0 in this case.

On Efficient Adjustment in Causal Graphs

Appendix E. Violin Plots

d=2 d=3 d=4 d=2 d=3 d=4

p=10 p=20 p=50 p=100

RMSE optimal IDA / local IDA

n=100 n=1 000

Figure 10: Violin plots of the relative mean squared error (RMSE) r over 1000 repetitions for scenarios with diﬀerent numbers of nodes (p), expected number of neighbours per node (d), and sample sizes (n). Optimal and semi-local IDA were applied to the true CPDAG G. The dots mark the geometric means, the plus signs the medians.

Witte, Henckel, Maathuis and Didelez

d=2 d=3 d=4 d=2 d=3 d=4

p=10 p=20 p=50 p=100

RMSE optimal IDA / local IDA

n=100 n=1 000

Figure 11: Violin plots of the relative mean squared error (RMSE) r over 1000 repetitions for scenarios with diﬀerent numbers of nodes (p), expected number of neighbours per node (d), and sample sizes (n). Optimal and semi-local IDA were applied to the estimated CPDAG G . The dots mark the geometric means, the plus signs the medians.

On Efficient Adjustment in Causal Graphs

Hirotugu Akaike. A new look at the statistical model identiﬁcation. IEEE Transactions on Automatic Control, 19(6):716 723, 1974.

Steen A. Andersson, David Madigan, and Michael D. Perlman. A characterization of Markov equivalence classes for acyclic digraphs. The Annals of Statistics, 25(2):505 541, 1997.

Alexandre Belloni, Victor Chernozhukov, and Christian Hansen. Inference on treatment eﬀects after selection among high-dimensional controls. The Review of Economic Studies, 81(2):608 650, 2014.

Richard Berk, Lawrence Brown, Andreas Buja, Kai Zhang, and Linda Zhao. Valid postselection inference. The Annals of Statistics, 41(2):802 837, 2013.

M. Alan Brookhart, Sebastian Schneeweiss, Kenneth J. Rothman, Robert J. Glynn, Jerry Avorn, and Til St urmer. Variable selection for propensity score models. American Journal of Epidemiology, 163(12):1149 1156, 2006.

Victor Chernozhukov, Denis Chetverikov, Mert Demirer, Esther Duﬂo, Christian Hansen, Whitney Newey, and James Robins. Double/debiased machine learning for treatment and structural parameters. The Econometrics Journal, 21(1):C1 C68, 2018.

David Maxwell Chickering. Optimal structure identiﬁcation with greedy search. Journal of Machine Learning Research, 3(Nov):507 554, 2002.

A. Philip Dawid and Vanessa Didelez. Identifying the consequences of dynamic treatment strategies: A decision-theoretic overview. Statistics Surveys, 4:184 231, 2010.

Xavier de Luna, Ingeborg Waernbaum, and Thomas S Richardson. Covariate selection for the nonparametric estimation of an average treatment eﬀect. Biometrika, 98(4):861 875, 2011.

De Wayne Derryberry, Ken Aho, John Edwards, and Teri Peterson. Model selection and regression t-statistics. The American Statistician, 72(4):379 381, 2018.

Oliver Dukes and Stijn Vansteelandt. How to obtain valid tests and conﬁdence intervals after propensity score variable selection? Statistical Methods in Medical Research, 29(3): 677 694, 2020a.

Oliver Dukes and Stijn Vansteelandt. Inference on treatment eﬀect parameters in potentially misspeciﬁed high-dimensional models. Biometrika, 2020b.

Julia C. Engelmann, Thomas Amann, Birgitta Ott-R otzer, Margit N utzel, Yvonne Reinders, J org Reinders, Wolfgang E. Thasler, Theresa Kristl, Andreas Teufel, Christian G. Huber, Peter J. Oefner, Rainer Spang, and Claus Hellerbrand. Causal modeling of cancer-stromal communication identiﬁes PAPPA as a novel stroma-secreted factor activating NFκB signaling in hepatocellular carcinoma. PLo S Computational Biology, 11(5):e1004293, 2015.

Sander Greenland and Neil Pearce. Statistical foundations for model-based adjustments. Annual Review of Public Health, 36:89 108, 2015.

Witte, Henckel, Maathuis and Didelez

F. Richard Guo and Emilija Perkovi c. Eﬃcient least squares for estimating total eﬀects under linearity and causal suﬃciency. ar Xiv preprint ar Xiv:2008.03481v2, 2020.

Jinyong Hahn. On the role of the propensity score in eﬃcient semiparametric estimation of average treatment eﬀects. Econometrica, 66:315 331, 1998.

Frank E. Harrell, Jr. Regression modeling strategies. Springer, 5th edition, 2010.

Leonard Henckel, Emilija Perkovi c, and Marloes H. Maathuis. Graphical criteria for efﬁcient total eﬀect estimation via adjustment in causal linear models. ar Xiv preprint ar Xiv:1907.02435, 2019.

Keisuke Hirano, Guido W. Imbens, and Geert Ridder. Eﬃcient estimation of average treatment eﬀects using the estimated propensity score. Econometrica, 71(4):1161 1189, 2003.

Markus Kalisch, Martin M achler, Diego Colombo, Marloes H. Maathuis, and Peter B uhlmann. Causal inference using graphical models with the R package pcalg. Journal of Statistical Software, 47(11):1 26, 2012.

Markus Kalisch, Alain Hauser, Martin Maechler, Diego Colombo, Doris Entner, Patrik Hoyer, Antti Hyttinen, Jonas Peters, Nicoletta Andri, Emilija Perkovic, Preetam Nandy, Philipp Ruetimann, Daniel Stekhoven, Manuel Schuerch, and Marco Eigenmann. pcalg: Methods for Graphical Models and Causal Inference, 2019. URL https: //CRAN.R-project.org/package=pcalg. R package version 2.6-6.

David G. Kleinbaum and Lawrence L. Kupper. Applied regression analysis and other multivariable methods. Duxbury Press, 1978.

Sven Kn uppel and Andreas Stang. DAG program: Identifying minimal suﬃcient adjustment sets. Epidemiology, 21(1):159, 2010.

Manabu Kuroki and Zhihong Cai. Selection of identiﬁability criteria for total eﬀects by using path diagrams. In Proceedings of the Twentieth Conference on Uncertainty in Artiﬁcial Intelligence (UAI-04), pages 333 340, 2004.

Manabu Kuroki and Masami Miyakawa. Covariate selection for estimating the causal eﬀect of control plans by using causal diagrams. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 65(1):209 222, 2003.

Thuc Duy Le, Lin Liu, Anna Tsykin, Gregory J Goodall, Bing Liu, Bing-Yu Sun, and Jiuyong Li. Inferring micro RNA m RNA causal regulatory relationships from expression data. Bioinformatics, 29(6):765 771, 2013.

Jason D. Lee, Dennis L. Sun, Yuekai Sun, and Jonathan E. Taylor. Exact post-selection inference, with application to the lasso. The Annals of Statistics, 44(3):907 927, 2016.

Hannes Leeb and Benedikt M. P otscher. Can one estimate the unconditional distribution of post-model-selection estimators? Econometric Theory, 24(2):338 376, 2008.

On Efficient Adjustment in Causal Graphs

Lexin Li, R. Dennis Cook, and Christopher J. Nachtsheim. Model-free variable selection. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 67(2):285 299, 2005.

Richard Lockhart, Jonathan Taylor, Ryan J. Tibshirani, and Robert Tibshirani. A signiﬁcance test for the lasso. Annals of Statistics, 42(2):413 468, 2014.

Karim Lounici. Sup-norm convergence rate and sign concentration property of Lasso and Dantzig estimators. Electronic Journal of Statistics, 2:90 102, 2008.

Jared K. Lunceford and Marie Davidian. Stratiﬁcation and weighting via the propensity score in estimation of causal treatment eﬀects: A comparative study. Statistics in Medicine, 23(19):2937 2960, 2004.

Jiawei Luo, Wei Huang, and Buwen Cao. A novel approach to identify the mi RNA-m RNA causal regulatory modules in cancer. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 15(1):309 315, 2018.

Marloes H. Maathuis and Diego Colombo. A generalised back-door criterion. The Annals of Statistics, 43(3):1060 1088, 2015.

Marloes H. Maathuis, Markus Kalisch, and Peter B uhlmann. Estimating high-dimensional intervention eﬀects from observational data. The Annals of Statistics, 37(6A):3133 3164, 2009.

Marloes H. Maathuis, Diego Colombo, Markus Kalisch, and Peter B uhlmann. Predicting causal eﬀects in large-scale systems from observational data. Nature Methods, 7(4):247 248, 2010.

Christopher Meek. Causal inference and causal explanation with background knowledge. In Proceedings of the Eleventh Conference on Uncertainty in Artiﬁcial Intelligence (UAI-95), pages 403 410, 1995.

Douglas C. Montgomery, Elizabeth A. Peck, and C. Geoﬀrey Vining. Introduction to linear regression analysis. John Wiley & Sons, 5th edition, 2012.

Paul A. Murtaugh. In defense of p values. Ecology, 95(3):611 617, 2014.

Preetam Nandy, Marloes H. Maathuis, and Thomas S. Richardson. Estimating the eﬀect of joint interventions from observational data in sparse high-dimensional settings. The Annals of Statistics, 45(2):647 674, 2017.

Judea Pearl. Direct and indirect eﬀects. In Proceedings of the Seventeenth Conference on Uncertainty in Artiﬁcial Intelligence (UAI-01), pages 411 420, 2001.

Judea Pearl. Causality: Models, reasoning, and inference. Cambridge University Press, 2nd edition, 2009.

Emilija Perkovi c. Identifying causal eﬀects in maximally oriented partially directed acyclic graphs. In Proceedings of the Thirty-Sixth Conference on Uncertainty in Artiﬁcial Intelligence (UAI-20), page ID: 229, 2020.

Witte, Henckel, Maathuis and Didelez

Emilija Perkovi c, Johannes Textor, Markus Kalisch, and Marloes H. Maathuis. A complete generalized adjustment criterion. In Proceedings of the Thirty-First Conference on Uncertainty in Artiﬁcial Intelligence (UAI-15), pages 682 691, 2015.

Emilija Perkovi c, Markus Kalisch, and Maloes H. Maathuis. Interpreting and using CPDAGs with background knowledge. In Proceedings of the Thirty-Third Conference on Uncertainty in Artiﬁcial Intelligence (UAI-17), page ID: 120, 2017.

Emilija Perkovi c, Johannes Textor, Markus Kalisch, and Marloes H. Maathuis. Complete graphical characterization and construction of adjustment sets in Markov equivalence classes of ancestral graphs. Journal of Machine Learning Research, 18(220):1 62, 2018.

R Core Team. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria, 2019. URL https://www.R-project.org/.

Joseph D. Ramsey. A scalable conditional independence test for nonlinear, non-gaussian data. ar Xiv preprint ar Xiv:1401.5031v2, 2014.

Thomas Richardson. Markov properties for acyclic directed mixed graphs. Scandinavian Journal of Statistics, 30(1):145 157, 2003.

Thomas S. Richardson, Robin J. Evans, James M. Robins, and Ilya Shpitser. Nested Markov properties for acyclic directed mixed graphs. ar Xiv preprint ar Xiv:1701.06686, art. ar Xiv:1701.06686, 2017.

Alessandro Rinaldo, Larry Wasserman, and Max G Sell. Bootstrapping and sample splitting for high-dimensional, assumption-lean inference. The Annals of Statistics, 47(6):3438 3469, 2019.

James Robins. A new approach to causal inference in mortality studies with a sustained exposure period application to control of the healthy worker survivor eﬀect. Mathematical Modelling, 7(9 12):1393 1512, 1986.

James M. Robins and Sander Greenland. Identiﬁability and exchangeability for direct and indirect eﬀects. Epidemiology, 3(2):143 155, 1992.

Andrea Rotnitzky and Ezequiel Smucler. Eﬃcient adjustment sets for population average treatment eﬀect estimation in non-parametric causal graphical models. Journal of Machine Learning Research, 21(188):1 86, 2020.

Gideon Schwarz. Estimating the dimension of a model. The Annals of Statistics, 6(2): 461 464, 1978.

Rajen D. Shah and Jonas Peters. The hardness of conditional independence testing and the generalised covariance measure. The Annals of Statistics, 48(3):1514 1538, 2020.

Shohei Shimizu, Patrik O. Hoyer, Aapo Hyv arinen, and Antti Kerminen. A linear non Gaussian acyclic model for causal discovery. Journal of Machine Learning Research, 7: 2003 2030, 2006.

On Efficient Adjustment in Causal Graphs

Susan M. Shortreed and Ashkan Ertefaie. Outcome-adaptive lasso: Variable selection for causal inference. Biometrics, 73(4):1111 1122, 2017.

Ilya Shpitser, Tyler Vander Weele, and James M. Robins. On the validity of covariate adjustment for estimating causal eﬀects. In Proceedings of the Twenty-Sixth Conference on Uncertainty in Artiﬁcial Intelligence (UAI-10), pages 527 536, 2010.

Ilya Shpitser, Robin J. Evans, Thomas S. Richardson, and James M. Robins. Introduction to nested Markov models. Behaviormetrika, 41(1):3 39, 2014.

Ezequiel Smucler, Andrea Rotnitzky, and James M. Robins. A unifying approach for doublyrobust l1 regularized estimation of causal contrasts. ar Xiv preprint ar Xiv:1904.03737v3, 2019.

Ezequiel Smucler, Facundo Sapienza, and Andrea Rotnitzky. Eﬃcient adjustment sets in causal graphical models with hidden variables. ar Xiv preprint ar Xiv:1912.00306, 2020.

Peter Spirtes, Clark Glymour, and Richard Scheines. Causation, prediction, and search. MIT press, 2000.

Johannes Textor and Maciej Li skiewicz. Adjustment criteria in causal diagrams: An algorithmic perspective. In Proceedings of the 27th Conference on Uncertainty in Artiﬁcial Intelligence (UAI-11), pages 681 688, 2011.

Jin Tian and Judea Pearl. On the identiﬁcation of causal eﬀects. Technical Report R-290-L, Department of Computer Science, University, University of California, Los Angeles, 2003.

Robert Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 58(1):267 288, 1996.

Dirk van Kampen. The Schizotypic Syndrome Questionnaire (SSQ): Psychometrics, validation and norms. Schizophrenia Research, 84(2 3):305 322, 2006.

Dirk van Kampen. The SSQ model of schizophrenic prodromal unfolding revised: An analysis of its causal chains based on the language of directed graphs. European Psychiatry, 29(7):437 448, 2014.

TS Verma and Judea Pearl. Equivalence and synthesis of causal models. Technical Report R150, Department of Computer Science, University, University of California, Los Angeles, 1990.

Janine Witte and Vanessa Didelez. Covariate selection strategies for causal inference: Classiﬁcation and comparison. Biometrical Journal, 61(5):1270 1289, 2019.

Peng Zhao and Bin Yu. On model selection consistency of Lasso. Journal of Machine Learning Research, 7:2541 2563, 2006.

Hui Zou and Trevor Hastie. Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 67(2):301 320, 2005.