# estimating_causal_effects_using_weightingbased_estimators__b2680a96.pdf

The Thirty-Fourth AAAI Conference on Artiﬁcial Intelligence (AAAI-20)

Estimating Causal Effects Using Weighting-Based Estimators

Yonghan Jung Department of Computer Science Purdue University jung222@purdue.edu

Jin Tian Department of Computer Science Iowa State University jtian@iastate.edu

Elias Bareinboim Department of Computer Science Columbia University eb@cs.columbia.edu

Causal effect identiﬁcation is one of the most prominent and well-understood problems in causal inference. Despite the generality and power of the results developed so far, there are still challenges in their applicability to practical settings, arguably due to the ﬁnitude of the samples. Simply put, there is a gap between causal effect identiﬁcation and estimation. One popular setting in which sample-efﬁcient estimators from ﬁnite samples exist is when the celebrated back-door condition holds. In this paper, we extend weighting-based methods developed for the back-door case to more general settings, and develop novel machinery for estimating causal effects using the weighting-based method as a building block. We derive graphical criteria under which causal effects can be estimated using this new machinery and demonstrate the effectiveness of the proposed method through simulation studies.

1 Introduction Computing the effects of interventions is one of the central tasks in data-intensive sciences. This problem comes in the literature under the rubric of causal effect identiﬁcation (Pearl 2000, Def. 3.2.4), which asks whether the causal distribution P(Y = y|do(X = x)) (for short, Px (y)) can be uniquely identiﬁed from a combination of substantive knowledge about the phenomenon under investigation, usually in the form of a causal graph G, and the observational distribution P(V ), where V is the set of observed variables. Causal identiﬁcation has been extensively studied based on the do-calculus (Pearl 1995). Building on this logic, a number of solutions were developed for variants of this problem, including complete graphical and algorithmic conditions (Tian 2002; Huang and Valtorta 2006; Shpitser and Pearl 2006; Bareinboim and Pearl 2012; 2016; Jaber, Zhang, and Bareinboim 2018; Lee, Correa, and Bareinboim 2019). Even though causal identiﬁcation has been wellunderstood and solved in principle, there are still outstanding challenges to the application of these results in practice. By and large, these results assume that the precise observational distribution, P(V ), is available for use, while in reality one has access to only a limited number of samples

Copyright c 2020, Association for the Advancement of Artiﬁcial Intelligence (www.aaai.org). All rights reserved.

drawn from P(V ). One setting where estimators for estimating Px(y) from ﬁnite samples have been systematically developed is when the well-known back-door (BD) criterion holds (Pearl 2000, Ch. 3.3.1). That is, if a set of variables Z (called covariates) satisfy the BD criterion relative to (X, Y ) then the effect Px (y) can be identiﬁed by covariate adjustment as Px (y) =

z P(y|x, z)P(z), and the corresponding mean as:

EPx(y) [Y ] =

z E[Y |x, z]P(z). (1)

Computing Eq. (1) naively i.e., estimating E[Y |x, z] and summing over all values Z = z is computationally and statistically challenging whenever the set Z is high dimensional. Regarding the former, summing over Z = z entails an exponential computational burden in |Z|, the cardinality of Z; regarding the latter, covering the support of E[Y |x, z], P(z) with some statistical signiﬁcance is hardly realizable. A series of robust and efﬁcient estimators for estimating the BD estimand (Eq. (1)) from ﬁnite samples have been developed to circumvent these challenges with great practical success, including propensity score matching (Rosenbaum and Rubin 1983), inverse-probability or stabilized weighting (IPW, SW) (Horvitz and Thompson 1952; Robins, Hernan, and Brumback 2000), doubly robust (Bang and Robins 2005), target maximum likelihood estimator (TMLE) (Van Der Laan and Rubin 2006), and outcome-regression such as BART (Hill 2011), just to cite a few. These techniques have been extended to BD-like estimands for time-series and have been called the g-formula by Robins (1986). This formula holds whenever sequential exchangeability or the sequential back-door (SBD) condition holds (Pearl and Robins 1995). Despite all their power, these BD-like conditions only cover a limited set of identiﬁable scenarios, while causal effects could be identiﬁable in many settings that are not in the form of an adjustment, for which no general purpose estimators have been developed. For instance, we discuss below two practical examples where the causal effects are identiﬁable but not by BD-like adjustment. Example 1: Surrogate endpoints. The causal graph in Fig. 1a illustrates a data-generating process of an experimental study that leverages a surrogate endpoint X, a variable

intended to substitute for a clinical endpoint Y when the clinical endpoint is hardly accessible. Suppose one is interested in estimating the causal effect of X (e.g., CD4 cell counts) on Y (e.g., Progression of HIV) to validate the CD4 cell counts as a surrogate endpoint (Hughes et al. 1998). W2 denotes the treatment for the CD4 cell counts and W1 is a set of confounders affecting the treatment (e.g., a previous disease history). The resultant estimand is given by Px(y) =

w1 P (x, y|w1, w2) P (w1) /

w1 P (x|w1, w2) P (w1)), which is clearly not BD-like. To the best of our knowledge, no effective statistical estimator exists for this type of estimands. Example 2: Causal mediators. Consider the causal graph in Fig. 1b, where X represents the level of body mass index (BMI), Z4 the level of multiple, possibly high-dimensional, metabolites, and Y the occurrence of breast cancer (Derkach et al. 2019). Suppose we observe Z1 (e.g., age), Z2 (e.g., diets), and Z3 (e.g., smoking), a set of confounding variables affecting levels of BMI, metabolites, and breast cancer. The goal of the analysis is to assess the effect of BMI levels on breast cancer. The resultant estimand is given by Px(y) =

z P(z4|x, z(3))P(z(3))

x P(y|x , z)P(x |z(3)), where Z = (Z1, Z2, Z3, Z4) and Z(3) = (Z1, Z2, Z3), but no statistical estimator is readily available for this estimand.

In general, many graphical and algorithmic conditions have been developed for determining the identiﬁability of a causal effect Px(y) in a given causal graph. However, no general method exists in the literature for estimating Px(y) from ﬁnite samples whenever it is identiﬁable (for example, as given in Eq. (9)) but not in the form of BD-like adjustment as in Eq. (1)1. In short, we note that: given a causal graph G, (i) Complete solutions have been developed for identifying Px(y) from P(V ); (ii) There exist a plethora of methods aiming to estimate BD-like estimands from ﬁnite samples when G satisﬁes the BD/SBD criteria, but the fact is the BD/SBD criteria only capture a small fraction of the scenarios under which causal effects are identiﬁable; (iii) No systematic treatment exists for estimating arbitrary causal effect estimands that are not BD-like. In this paper, we aim to start bridging the gap between causal identiﬁcation and causal estimation . Speciﬁcally, we propose to extend weightingbased methods developed for BD case (Robins, Hernan, and Brumback 2000) to settings beyond the BD, and further use the weighting-method as a building block to estimate complex causal effect estimands. The contributions of the paper are as follows:

1. We introduce a weighting operator as a building block estimand that could be estimated efﬁciently using existing statistical techniques developed for the BD estimand. 2. We develop novel machinery for estimating complex causal effects based on the composition of weighting operators. 3. We prove graphical criteria (m SBD, Surrogate, and m SBD composition) that go beyond the BD, under which

1Estimators for speciﬁc settings, including the SBD and frontdoor, have been developed based on inﬂuence functions (IF) (Fulcher et al. 2019).

(a) Surrogate endpoints

(b) Causal mediators

Figure 1: Causal graphs corresponding to Example 1 and 2. Nodes representing the treatment and outcome are colored in blue and red, respectively.

a causal estimand can be expressed as a weighting operator or their composition, and, therefore, lends itself to effective estimators. Simulation studies demonstrate the effectiveness of the proposed estimators.

All the proofs are provided in Appendix D in the supplemental material.

2 Preliminaries We use the language of structural causal models (SCMs) (Pearl 2000, pp. 204-207) as our basic semantical framework. Each SCM M over a set of variables V has a causal graph G associated to it. Solid-directed arrows encode functional relationships between observed variables, and dashedbidirected arrows encode unobserved common causes (e.g., see Fig. 1a). Within the structural semantics, performing an intervention, and setting X = x, is represented through the do-operator, do(X = x), which encodes the operation of replacing the original equations of X by the constant x and induces a submodel Mx and an experimental distribution Px(v). Given a causal graph G over a set of variables V, a causal effect Px(y) is said to be identiﬁable in G if Px(y) is uniquely computable from P(v) in any SCM that induces G. For a detailed discussion of SCMs, refer to (Pearl 2000). Each variable will be represented with a capital letter (X) and its realized value with the small letter (x). We will use bold letters (X) to denote sets of variables. Given an ordered set of variables X = (X1, , Xn), we denote X(i) = (X1, , Xi), and X i = (Xi, , Xn). We use the typical graph-theoretic terminology PA(C), Ch(C), De(C), An(C) to represent the union of C and respectively the parents, children, descendants, and ancestors of C. We use GC1C2 to denote the graph resulting from deleting all incoming edges to C1 and all outgoing edges from C2 in G. (X Y | Z)G denotes that X is d-separated from Y given Z in G. E[f(Y)|x] denotes the conditional expectation of f(Y) over P(Y|x). We use P (v) to denote the corresponding empirical distribution.

3 Effect Estimation by Weighting Operators In this section, we start by formally deﬁning a weighting operator as a causal estimand that could be estimated using existing statistical techniques and further used as building blocks to construct more complex causal estimands. We then

present graphical conditions under which a causal estimand can be expressed as a weighting operator.

3.1 Weighting Operator

Causal effect estimation by the BD adjustment is widely used in practice in part due to the availability of efﬁcient estimators from ﬁnite samples. In particular, weightingbased statistical estimators for estimating the BD estimand in Eq. (1) have been developed, including the inverseprobability weighting (IPW) and stabilized weighting (SW) (Robins, Hernan, and Brumback 2000). To present weighting techniques, we ﬁrst deﬁne the notion of weighted distribution as follows:

Deﬁnition 1 (Weighted distribution P W (v)). Given a distribution P (v) and a weight function W (v) > 0, a weighted distribution P W (v) is given by

P W (v) W (v) P (v)

v W (v ) P (v ). (2)

Weighting-based estimators for BD adjustment have been developed based on the following reformulation of the adjustment equation:

Proposition 1. Let X, Y, Z V. If the causal effect Px (y) is identiﬁable by the BD adjustment, then Px (y) = P W (y|x) where W = P (x) P (x|z), and

EPx(y) [Y] = EP W(y|x) [Y|X = x] . (3)

Remarkably, one can estimate W = P (x) P (x|z) as the weight of each individual sample, and treat the reweighted samples as if they were drawn from the causal distribution Px (y) (Pearl 2000, Ch. 3.6.1). In other words, letting Dobs denote samples drawn from P (x, y, z), and DW obs P W (x, y, z) represent the reweighted Dobs, Prop. 1 says DW obs plays the role of samples drawn from the post-intervention distribution Px (y). Therefore, the expected causal effects may be estimated by computing conditional expectation on the reweighted samples. Such weighting-based estimators have also been developed for estimating the g-formula (i.e., gestimation) (Robins 1986; Robins, Hernan, and Brumback 2000) whenever the SBD condition holds. In this paper, we will extend the weighting techniques to situations beyond the BD and the g-formula. Towards this goal, we formally deﬁne a weighting operator as follows: Deﬁnition 2 (Weighing operator B). Given a weight function W (v) > 0, a function h (Y), a set of variables X = x, the weighting operator B [h (Y) | x; W] is deﬁned by

B [h (Y) | x; W] EP W (y|x) [h (Y) |x] =

y h(y)P W (y|x) .

Note that h (Y) is an arbitrary function over Y, and B is a function of X = x. We ll describe in Sec. 5 an empirical estimator of the weighting operator B from ﬁnite samples, which extends the existing statistical techniques developed for BD adjustment. Therefore, whenever a causal estimand is expressed as a weighting operator, it will lend itself

to effective estimators. In particular, in the form of weighting operator, the BD causal estimand in Prop. 1 is given by EPx(y) [Y] = B [Y | x; W], where W = P (x) P (x|z). As alluded earlier, the BD-like conditions cover just a limited set of identiﬁable scenarios. In many settings, causal effects are identiﬁable but not in the form of an adjustment, and no effective estimators have been developed. In the sequel, we go beyond the BD condition and propose new graphical conditions under which a causal estimand can be expressed as a weighting operator. In Sec. 4, we further show that weighting operators can be used as building blocks to construct more complex causal estimands.

3.2 Multi-outcome Sequential Back-door (m SBD) Criterion and Weighting One setting of practical interest where the causal estimand can be expressed as a weighting operator is in the time-series domain with a sequence of treatments X1, . . . , Xn and corresponding covariates Z1, . . . , Zn. We highlight that the BD criterion has been extended to the sequential BD (SBD) criterion in the time-series domain (Pearl and Robins 1995), where the outcome variable Y is assumed to be a singleton. Here, we generalize the SBD criterion to accommodate the situation when Y is a set of variables, for example, for when the outcomes are longitudinal2. Deﬁnition 3 (Multi-outcome sequential back-door (m SBD) criterion). Given the pair of sets (X, Y), let X = {X1, X2, , Xn} be topologically ordered as X1 < X2 < < Xn. Let Y0 = Y \ De (X) and Yi = Y De (Xi) \ De X i+1 for i = 1, 2, , n. Let ND X i be the set of nondescendants of X i. Then the sequence of variables Z = (Z1, Z2, , Zn) are said to be m SBD admissible relative to (X, Y) if it holds that Zi ND X i , and Y i Xi|Y(i 1), Z(i), X(i 1)

G Xi X i+1 .

Roughly speaking, Def. 3 requires that the past observations X(i 1), Y(i 1), Z(i) satisfy the BD criterion relative to each (Xi, Y i) pair as covariates. The m SBD criterion reduces to the original SBD (Pearl and Robins 1995) whenever Y is a singleton. When the m SBD criterion holds in a causal graph, the causal effect is identiﬁable as follows: Theorem 1 (m SBD adjustment). If Z is m SBD admissible relative to (X, Y), then Px (y) is identiﬁable and given by3

k=0 P yk|x(k), z(k), y(k 1)

j=1 P zj|x(j 1), z(j 1), y(j 1) . (4)

2Note that treating Y in SBD criterion as a set would NOT get the m SBD criterion. 3We note that the expressions in the form of Eq. (4) or similar are often called the g-formula (Robins, Greenland, and Hu 1999). The m SBD criterion provides a graphical condition under which the causal effect is identiﬁable as the g-formula.

For example, the causal graph in Fig. 2a represents a time-series setting with a sequence of treatments X1, X2, longitudinal outcomes Y1, Y2, and corresponding covariates Z1, Z2. The BD criterion is not applicable for identifying Px1,x2(y1, y2). However, (Z1, Z2) satisﬁes the m SBD criterion relative to ((X1, X2), (Y1, Y2)). By Thm. 1 Px1,x2(y1, y2) is identiﬁable and the expected causal effect of {X1, X2} on Y2 is given by

EPx1,x2 (y2) [Y2] =

z1,z2,y1 E [Y2|x1, x2, z1, z2, y1] P (y1|x1, z1)

P(z1)P(z2|x1, z1, y1) (5)

Whenever the m SBD admissible Z is high-dimensional, evaluating the causal effect is non-trivial in terms of computation and sample efﬁciency. We address this challenge by leveraging the weighting technique, as shown next. Theorem 2. If Z is m SBD admissible relative to (X, Y), then

EPx(y) [h (Y)] = B [h (Y) | x; W] , where (6)

W = Wm SBD(x, y, z) P(x) n k=1 P (xk|x(k 1), y(k 1), z(k)).

For example, in Fig. 2a, the expected causal effect of {X1, X2} on Y2 can be written, and evaluated, as

EPx1,x2(y2) [Y2] = B [Y2 | {x1, x2}; W] , (8)

where W = P(x1, x2) P(x1|z1)P(x2|x1, y1, z1, z2).

By Thm. 2, once a set Z is m SBD-admissible, the expected causal effect can be estimated using the empirical weighting operator described in Sec. 5.

3.3 Surrogate Criterion and Weighting We present another setting where the causal estimand can be expressed as a weighting operator and can therefore be estimated from ﬁnite samples using weighting techniques. Deﬁnition 4 (Surrogate criterion). (R, Z) is said to be surrogate admissible relative to (X, Y) if (1) (Y R|X)GXR; (2) (Y X|R)GXR; and (3) Z is m SBD admissible relative to (R, (X, Y)). Theorem 3. If (R, Z) is surrogate admissible relative to (X, Y), then4

EPx(y) [h (Y)] = B [h (Y) | x r; Wm SBD(r, x y, z)] .

To demonstrate the application of the surrogate criterion, we consider Example 1 with its corresponding causal graph given in Fig. 1a, where we are interested in estimating the causal effect of the surrogate endpoint X on the clinical endpoint Y with W1 being a set of confounders. It can be derived (e.g. by do-calculus) that the causal effect Px (y) is identiﬁable and given by

w1 P (y, x|w1, w2) P (w2)

w1 P (x|w1, w2) P (w2) . (9)

4Note the weight function Wm SBD is deﬁned in Eq. (7).

(b) Front-door

Figure 2: Example graphs

At the ﬁrst glance, estimating such quotient estimand looks daunting since the variance can be arbitrarily large. To the best of our knowledge, no statistical estimator has been established for the type of estimands like Eq. (9). Thm. 3 provides a solution. By Def. 4, (W2, W1) is surrogate admissible relative to (X, Y ), and by Thm. 3 we have

EPx(y) [Y ] = B Y {w2, x}; W = P (w2) P (w2|w1)

The surrogate criterion allows one to express a complex quotient estimand in the form of a weighting operator, which allows one to estimate through the method discussed in Sec. 5.

4 Causal Effects Estimation by the Composition of Weighting Operators So far, we have deﬁned a weighting operator as a causal estimand that could be estimated using existing statistical techniques and presented graphical conditions (m SBD and Surrogate criteria) under which a causal estimand can be expressed as a weighting operator. In this section, we introduce novel machinery for causal effect estimation by formulating the front-door estimand as a composition of BD weighting operators. We then extend this idea to develop graphical conditions under which causal effects can be estimated by the composition of weighting operators.

4.1 Estimation of Front-door as a Composition of BD Weighting Operators A well-known setting where causal effects are identiﬁable are characterized by what is known as the front-door criterion (Pearl 1995), which states that if Z satisﬁes the frontdoor criterion relative to (X, Y), then the causal effect of X on Y is identiﬁable and is given by the formula

x P(y|x , z)P(x ). (11)

As an example, consider the causal graph in Fig. 2b, where X represents the level of body mass index (BMI), Z the level of multiple, possibly high-dimensional, metabolites, and Y the occurrence of breast cancer (Derkach et al. 2019). The goal is to assess the effect of the level of BMI (X) on the breast cancer (Y ) in the presence of Z, often called causal mediators. We have that Z satisﬁes the front-door criterion relative to (X, Y ), and the expected causal effect is given by

EPx(y) [Y ] =

x E[Y |x , z]P(x ). (12)

Computing Eq. (12) is non-trivial in terms of computation and sample efﬁciency when Z is high-dimensional. In this paper, we propose a novel method for estimating the frontdoor estimand. We note something simple albeit powerful: the front-door can be seen as a composition of BD adjustments. To witness, note that:

z Px (z) BD=

Pz(y) BD={X}

EPx(y) [Y] = EPx(z) EPz(y) [Y] , (14)

where BD represents a BD admissible set, that is, both effects in Eq. (13) can be identiﬁed by BD adjustments. In practice, EPx(y) [Y] can be estimated by ﬁrst estimating EPz(y) [Y], and then estimating the expectation of the resultant quantity over Px (z), both times using the BD weighting operator. Therefore, we can compute Eq. (12) as a composition of BD weighting operators. Using this example, we formally deﬁne a composition of weighting operators as follows:

Deﬁnition 5 (Composition of weighting operators). Given two weighting operators B1(x) B [hz (Z) | x; W1] and B2(z) B [hy (Y) | z; W2], the composition of B1 and B2 is deﬁned by

(B1 B2) (x) B [B2(z) | x; W1] . (15)

The front-door estimand (Eq. (12)) can be computed in terms of the composition operation as follows.

Proposition 2. If Z satisﬁes the front-door criterion relative to (X, Y), then

EPx(y) [Y] = (B1 B2) (x), (16)

where B1(x) = B [h (Z) | x; W1], B2(z) = B [Y | z; W2], W1 = 1, and W2 = P (z) P (z|x).

More generally, we propose using the composition of weighting operators as a novel machinery to construct and estimate complex causal estimands. The corresponding empirical estimator of the composition of B operators will be discussed in Sec. 5.

4.2 Causal Effect Estimation by Composition of Weighting Operators

In this section, we study the conditions under which causal effects may be identiﬁed by a composition of weighting operators, in which the front-door is just a special case.

Deﬁnition 6 (Decomposability criterion). A set of variables Z satisﬁes the decomposability criterion relative to (X, Y) if (1) (Y X|Z)GXZ; and (2) (Y Z|X)GXZ.

Theorem 4. If Z satisﬁes the decomposability criterion, then

z Px (z) Pz (y) , and

EPx(y) [h (Y)] = EPx(z) EPz(y) [h (Y)] . (17)

The importance of this theorem lies in that if both causal effects Px (z) and Pz (y) can be computed using the weighting operators, then Px (y) can be computed by the composition of weighting operators. In particular, we present a criterion that delineates under what conditions a causal effect can be pieced together through the composition of m SBD weighting operators. Deﬁnition 7 (m SBD composition criterion). Sets of variables (Z, W1, W2) are said to satisfy the m SBD composition criterion relative to (X, Y) if: (1) Z satisﬁes the decomposability criterion relative to (X, Y); and (2) W1 is m SBD admissible relative to (X, Z), and W2 is m SBD admissible relative to (Z, Y). Theorem 5 (m SBD composition). If (Z, W1, W2) satisfy the m SBD composition criterion relative to (X, Y), then:

EPx(y) [Y] = (B1 B2) (x), (18)

where B1(x) B [h (Z) | x; Wm SBD(x, z, w1)] and B2(z) B [Y | z; Wm SBD(z, y, w2)]. To demonstrate the application of the m SBD composition criterion, consider the causal mediator scenario (Example 2) with its corresponding causal graph given in Fig. 1b. The set Z = (Z1, Z2, Z3, Z4) satisﬁes the decomposability condition relative to (X, Y ), and (Z, , X) satisfy the m SBD composition criterion relative to (X, Y ). Therefore, the causal effect Px (y) can be expressed as Px (y) =

z Px (z) Pz (y). We have that satisﬁes the SBD conditions relative to (X, (Z1, Z2, Z3, Z4)), which yields

Px(z) = P(z1, z2, z3)P(z4|z1, z2, z3, x), (19) EPx(z) [hz (Z)] = B [hz (Z) | x; W1] B1(x), (20)

where W1 = P (x) P (x|z1,z2,z3). Further note that {X} (i.e. ( , , , X)) is SBD admissible relative to ((Z1, Z2, Z3, Z4), Y ), which yields

EPz(y) [Y ] = B [Y | z; Wy] B2(z), (21)

Wy = P (z1, z2, z3, z4) P (z1, z2, z3) P (z4|z1, z2, z3, x) = P (z4|z1, z2, z3) P (z4|z1, z2, z3, x)

Finally, we obtain that the expected causal effect EPx(y) [Y ] = EPx(z) EPz(y) [Y ] is given by (B1 B2) (x).

5 Weighting-based Empirical Estimators We have introduced the weighting operator as a building block estimand and their composition as a new tool for estimating causal effects. In this section, we present how to estimate the weighting operator and their composition empirically from ﬁnite samples. In other words, instead of having access to the true distribution P(v), we only have an i.i.d. data set Dobs = {V(i)}N i=1 drawn from P(v).

5.1 Empirical Weighting Operators We extend the weighting-based statistical estimation procedures developed for the BD adjustment to the weighting operator deﬁned in Def. 2. One of the widely used methods

0 2500 5000 7500 10000 N

(a) Front-door: low-dim

0 2500 5000 7500 10000 N

(b) Front-door: high-dim

2 3 4 5 6 7 8 91011121314151617181920

(c) D vs. ND

Figure 3: Experimental results for front-door (Fig. 2b) in which Z = (Z1, . . . , ZD) consists of D binary variables Zi: (a) MAAE plots with varying D {6, 7, 8, 9} and (b) D {10, 12, 15, 20}; (c) The number of samples required to reach predeﬁned estimation error bound D vs. ND. Plots are best viewed in color.

for estimating the conditional expectation on the weighted samples is the following weighted regression (also known as weighted least square) estimator (Robins, Hernan, and Brumback 2000):

Deﬁnition 8 (Empirical weighting operator B). Given Dobs = {V(i)}N i=1 P (v), the empirical weighting operator B [h (Y) | x; W] (x) g (x) is estimated by the weighted regression as follows:

g = arg min g R

i=1 W V(i) h Y(i) g X(i) 2 , (22)

where W (v) is the empirically estimated W (v), and R is a class of regression functions (e.g., linear regressions).

For example, for the BD adjustment, we have W(V(i)) = P x(i) / P x(i) | z(i) . When estimating the weight W from data, in practice, some parametric model will be assumed for P (x|z), and parameters of the model will be learned from data. When X = (X1, , Xn), one can ﬁrst use the chain rule of the probability and then model each individual component of P (x|z) = n d=1 P xd|z, x(d 1) . For example, when X is a singleton binary variable, P (X = 1|z) is typically assumed to be a logistic regression function as (1 + exp (α0 + αz1z1 + + αzkzk)) 1, and the parameters α are learned from data. Then the trained model is used to estimate the probability. More expressive function classes than logistic regression can be applied to estimate the weights more accurately (Lee, Lessler, and Stuart 2010; Gruber et al. 2015), which may be appealing depending on the particular setting. Equipped with the estimated weight, one can then estimate the weighting operator by the weighted regression. One can go beyond the standard linear regression class and employ ﬂexible regression functions (Hill 2011; Wen, Hassanpour, and Greiner 2018). We note that B provides a consistent estimator of B if the models for W and R are correctly speciﬁed, following the same argument as in (Robins, Hernan, and Brumback 2000). Another commonly used method in back-door settings is the Horvitz-Thompson (H-T) estimator (Horvitz and Thompson 1952) as an IPW estimator. We use the weighted regression estimator as the empirical estimator for weighting

operators because it has been shown that the H-T estimator may have a higher variance than the weighted regression estimator (Robins, Hernan, and Brumback 2000).

5.2 Estimating Composition of Weighting Operators Given the empirical weighting operator deﬁned in Def. 8, we simply deﬁne the empirical composition of weighting operators as a chain of regressions. Given B1 (x) B [hz (Z) | x; W1] and B2 (z) B [hy (Y) | z; W2], we de-

ﬁne (B1 B2)(x) B1 B2 (x), which is implemented

as B B2(z) x; W1 , the weighted regression for function

B2(z) onto X with weight W1. Formally,

Deﬁnition 9 (Empirical composition of B). Let B1 (x) B [hz (Z) | x; W1] and B2 (z) B [hy (Y) | z; W2]. The empirical composition (B1 B2)(x) is deﬁned by

(B1 B2)(x) B1 B2 (x) B B2(z) x; W1 . (23)

One question that naturally arises is about the consistency of the empirical composition of weighting operators, which is addressed by the following theorem.

Theorem 6 (Consistency of the composition). Let B1(x) and B2(z) be consistent estimators of B1(x) and B2(z). Let the function class R1 of B1 be a compact space. Then, B1 B2 (x) is a consistent estimator of (B1 B2) (x).

6 Simulation Studies 6.1 Simulation Setup Given a causal graph, we will specify a SCM M from which a dataset Dobs will be generated. To compute the target μ(x) EPx(y) [Y ], we generate Nint = 107 number of samples Dint from Mx, the model from do(X = x). We estimate μ(x) by computing the mean of Y in Dint, which is treated as the ground truth. Because there exists no general method in the literature for estimating arbitrary identiﬁable causal effects that are not in the form of BD-like adjustment, we compare the proposed estimators with a naive procedure, as discussed next:

0 2500 5000 7500 10000 N

D=7 D=6 D=5 D=4 D=3

(a) m SBD (Fig. 2a)

0 2500 5000 7500 10000 N

D=20 D=15 D=10 D=5 D=4

(b) Surrogate endpoints (Fig. 1a)

0 2500 5000 7500 10000 N

D=20 D=15 D=10 D=5 D=4

(c) Causal mediators (Fig. 1b)

Figure 4: MAAE plots for (a) m SBD, (b) Surrogate endpoints, and (c) Causal mediators. Plots are best viewed in color.

Naive procedure As an example, assume we want to evaluate the expression in Eq. (5). We compute each conditional probability such as P(z2|x1, z1, y1) as Nz2,x1,z1,y1/Nx1,z1,y1 where Nw is the number of examples in which W = w. If Nx1,z1,y1 = 0 then P(z2|x1, z1, y1) is set to zero. E [Y |x1, x2, z1, z2, y1] is computed as the mean of Y in examples with values (x1, x2, z1, z2, y1), and is set to zero if no example has these values. The expected causal effect is computed by summing over all the possible values of Z1, Y1, Z2. Proposed procedure (named CWO - Composition of Weighting Operators) We use the empirical estimators described in Sec. 5. The conditional probabilities in the weights are estimated by the logistic regression model (binary variables are used in the simulation studies). Accuracy Measure Given a data set Dobs with N examples, let μcwo(x) and μnai(x) be the estimated EPx(y) [Y ] using the CWO and naive procedure respectively. We compute the average absolute error AAE as |μ(x) μcwo(x)| averaged over x and |μ(x) μnai(x)| averaged over x respectively. For each sample size N, we generate 100 data sets. We call the median of the 100 AAEs the median average absolute error or MAAE. A plot of MAAE vs. the sample size N will be called a MAAE plot.

6.2 Simulation Results

We test the proposed CWO against the naive approach in several scenarios (we only compare with the naive method due to the nonexistence of other general purpose methods applicable in these cases). The detailed descriptions of the corresponding SCMs are provided in Appendix E. Front-door (Fig. 2b) We ﬁrst test on the front-door graph for estimating EPx(y) [Y ] in Eq. (12). We set X to be binary, Y continuous within [0, 1], and Z = (Z1, . . . , ZD) with Zi all binary. Fig. 3a shows MAAE of CWO vs. naive for D {6, 7, 8, 9}, and Fig. 3b the plots for D {10, 12, 15, 20}. We observe that the naive approach works well when Z is low dimensional (up to D = 8) and given many examples. CWO may have bias due to the use of logistic regression models. When Z is high-dimensional, CWO signiﬁcantly outperforms the naive approach. To get a better understanding of the sample efﬁciency, for each given D, we gradually increase the sample size N = 500, 1000, 1500, . . ., and ﬁnd the corresponding MAAE, and stop to record the sample

size ND when the MAAE is within a predetermined threshold. The threshold was set to 0.025 in these experiments. Roughly, ND represents how many samples are needed for the estimator to reach a predetermined accuracy. Fig. 3c shows the curves of D vs. ND. We note that the number of samples needed to reach a predetermined accuracy increases very rapidly (exponentially in D) for the naive approach while CWO scales very well. m SBD: (Fig. 2a) We test on estimating EPx1,x2(y2) [Y2] given in Eq. (5). We set X1, X2, Y1 to be binary, Y2 continuous within [0, 1], and Zi = (Zi1, . . . , Zi D) for i = 1, 2, where all Zij are binary. Fig. 4a presents the MAAE plots for D {3, 4, 5, 6, 7}. We note that CWO provides more robust estimates and signiﬁcantly outperforms the naive procedure in high-dimensional settings. Surrogate endpoints (Fig. 1a) We test on estimating EPx(y) [Y ] (where the causal effect Px (y) is given in Eq. (9)). The MAAE plots for D {4, 5, 10, 15, 20} are given in Fig. 4b. We observe that the CWO method signiﬁcantly outperforms the naive approach. Causal mediators (Fig. 1b) We test on estimating EPx(y) [Y ]. Fig. 4c presents the MAAE plots for D {4, 5, 10, 15, 20}. Again, we note CWO signiﬁcantly outperforms the naive procedure in high-dimensional settings. These experimental results show that CWO signiﬁcantly outperforms its naive counterpart. In Appendix B, we provide a discussion on why CWO outperforms the naive procedure. To better understand to what extent the performance gains over the naive procedure should be attributed to the use of parametric assumptions, we also performed simulations comparing CWO against the parametric plug-in estimator given in Appendix C. Finally, we performed simulations comparing CWO with the H-T estimator given in Appendix G.

7 Conclusions The problem of determining whether a causal effect is identiﬁable from observational data given a causal graph is wellunderstood, while there s virtually no work on how, in general, one can efﬁciently estimate, from ﬁnite samples, an identiﬁable causal effect beyond BD-like settings. This paper takes the ﬁrst step in ﬁlling in the gap between identiﬁcation and estimation by developing novel machinery for estimating causal effects through the weighting operators and

their composition. We introduced graphical criteria for determining when the new estimation methods are applicable. These results offer new tools for data scientists to be able to estimate effects that the usual methods (including Propensity score, IPW, BART) are not applicable given that the causal estimand is not BD-like. This work opens new research directions. On the one hand, many techniques have been developed for and besides weighted regression for BD estimation; can those techniques be applied and leveraged to the composition of weighting operators? How model misspeciﬁcation, which is well-studied through double robust methods in the BD-case, should be addressed in this more general setting? On the other hand, can weighting operators be further composed to identify effects beyond the decomposability criterion? Also, can the weighting operator be combined in alternative ways to identify new effects?

Acknowledgements We thank Sanghack Lee, Daniel Kumor, and the reviewers for all the feedback provided. Elias Bareinboim and Yonghan Jung were supported in parts by grants from NSF IIS-1704352, IIS-1750807 (CAREER), IBM Research, and Adobe Research. Jin Tian was partially supported by NSF grant IIS-1704352 and ONR grant N000141712140.

References Bang, H., and Robins, J. M. 2005. Doubly robust estimation in missing data and causal inference models. Biometrics 61(4):962 973. Bareinboim, E., and Pearl, J. 2012. Causal inference by surrogate experiments: z-identiﬁability. In Proceedings of the 28th Conference on Uncertainty in Artiﬁcial Intelligence, 113 120. AUAI Press. Bareinboim, E., and Pearl, J. 2016. Causal inference and the data-fusion problem. Proceedings of the National Academy of Sciences 113(27):7345 7352. Derkach, A.; Pfeiffer, R. M.; Chen, T.-H.; and Sampson, J. N. 2019. High dimensional mediation analysis with latent variables. Biometrics. Fulcher, I. R.; Shpitser, I.; Marealle, S.; and Tchetgen, E. J. T. 2019. Robust inference on population indirect causal effects: the generalized front door criterion. Journal of the Royal Statistical Society: Series B (Statistical Methodology). Gruber, S.; Logan, R. W.; Jarr ın, I.; Monge, S.; and Hern an, M. A. 2015. Ensemble learning of inverse probability weights for marginal structural modeling in large observational datasets. Statistics in medicine 34(1):106 117. Hill, J. L. 2011. Bayesian nonparametric modeling for causal inference. Journal of Computational and Graphical Statistics 20(1):217 240. Horvitz, D. G., and Thompson, D. J. 1952. A generalization of sampling without replacement from a ﬁnite universe. Journal of the American statistical Association 47(260):663 685. Huang, Y., and Valtorta, M. 2006. Pearl s calculus of intervention is complete. In Proceedings of the 22nd Conference

on Uncertainty in Artiﬁcial Intelligence, 217 224. AUAI Press. Hughes, M. D.; Daniels, M. J.; Fischl, M. A.; Kim, S.; and Schooley, R. T. 1998. Cd4 cell count as a surrogate endpoint in hiv clinical trials: a meta-analysis of studies of the aids clinical trials group. Aids 12(14):1823 1832. Jaber, A.; Zhang, J.; and Bareinboim, E. 2018. Causal identiﬁcation under markov equivalence. In Proceedings of the 34th Conference on Uncertainty in Artiﬁcial Intelligence. AUAI Press. Lee, S.; Correa, J. D.; and Bareinboim, E. 2019. General identiﬁability with arbitrary surrogate experiments. In Proceedings of the 35th Conference on Uncertainty in Artiﬁcial Intelligence. AUAI Press. Lee, B. K.; Lessler, J.; and Stuart, E. A. 2010. Improving propensity score weighting using machine learning. Statistics in medicine 29(3):337 346. Pearl, J., and Robins, J. 1995. Probabilistic evaluation of sequential plans from causal models with hidden variables. In Proceedings of the 11th Conference on Uncertainty in Artiﬁcial Intelligence, 444 453. Morgan Kaufmann Publishers Inc. Pearl, J. 1995. Causal diagrams for empirical research. Biometrika 82(4):669 710. Pearl, J. 2000. Causality: Models, Reasoning, and Inference. New York: Cambridge University Press. 2nd edition, 2009. Robins, J. M.; Greenland, S.; and Hu, F.-C. 1999. Estimation of the causal effect of a time-varying exposure on the marginal mean of a repeated binary outcome. Journal of the American Statistical Association 94(447):687 700. Robins, J. M.; Hernan, M. A.; and Brumback, B. 2000. Marginal structural models and causal inference in epidemiology. Epidemiology 11(5). Robins, J. 1986. A new approach to causal inference in mortality studies with a sustained exposure period application to control of the healthy worker survivor effect. Mathematical modelling 7(9-12):1393 1512. Rosenbaum, P. R., and Rubin, D. B. 1983. The central role of the propensity score in observational studies for causal effects. Biometrika 70(1):41 55. Shpitser, I., and Pearl, J. 2006. Identiﬁcation of joint interventional distributions in recursive semi-markovian causal models. In Proceedings of the 21st AAAI Conference on Artiﬁcial Intelligence, 1219. Menlo Park, CA; Cambridge, MA; London; AAAI Press; MIT Press; 1999. Tian, J. 2002. Studies in Causal Reasoning and Learning. Ph.D. Dissertation, Computer Science Department, University of California, Los Angeles, CA. Van Der Laan, M. J., and Rubin, D. 2006. Targeted maximum likelihood learning. The International Journal of Biostatistics 2(1). Wen, J.; Hassanpour, N.; and Greiner, R. 2018. Weighted gaussian process for estimating treatment effect. In Proceedings of the 30th Annual Conference on Neural Information Processing Systems.