# unified_covariate_adjustment_for_causal_inference__eb2ac8ed.pdf

Unified Covariate Adjustment for Causal Inference

Yonghan Jung1, Jin Tian2, and Elias Bareinboim3

1Purdue University 2Mohamed bin Zayed University of Artificial Intelligence 3Columbia University 1jung222@purdue.edu, 2jin.tian@mbzuai.ac.ae, 3eb@cs.columbia.edu

Causal effect identification and estimation are fundamental tasks found throughout the data sciences. Although causal effect identification has been solved in theory, many existing estimators only address a subset of scenarios, known as the sequential back-door adjustment (SBD) (Pearl and Robins, 1995a) or g-formula (Robins, 1986). Recent efforts for developing general-purpose estimators with broader coverage, incorporating the front-door adjustment (FD) (Pearl, 2000) and others, are not scalable due to the high computational cost of summing over a highdimensional set of variables. In this paper, we introduce a novel approach that achieves broad coverage of causal estimands beyond the SBD, incorporating various sum-product functionals like the FD, while achieving scalability estimated in polynomial time relative to the number of variables and samples in the problem. Specifically, we present the class of unified covariate adjustment (UCA) for which we develop a scalable and doubly robust estimator. In particular, we illustrate the expressiveness of UCA for a wide spectrum of causal estimands (e.g., SBD, FD, and others) in causal inference. We then develop an estimator that exhibits computational efficiency and double robustness. Experiments corroborate the scalability and robustness of the proposed framework.

1 Introduction

Causal inference is a crucial aspect of scientific research, with broad applications ranging from social sciences to economics, and from biology to medicine. Two significant tasks in causal inference are causal effect identification and estimation. Causal effect identification concerns determining the conditions under which the causal effect can be inferred from a combination of available data distributions and a causal graph depicting the data-generating process. Causal effect estimation, on the other hand, develops an estimator for the identified causal effect expression using finite samples.

Causal effect identification theories have been well-established across various scenarios. These include cases where the input distribution is purely observational (Tian and Pearl, 2003; Shpitser and Pearl, 2006; Huang and Valtorta, 2006) (known as observational identification or obs ID) or a combination of observational and interventional (Bareinboim and Pearl, 2012a; Lee et al., 2019) (referred to as generalized identification or g ID); scenarios where the target query and input distributions originate from different populations (Bareinboim and Pearl, 2012b; Bareinboim et al., 2014; Bareinboim and Pearl, 2016; Correa et al., 2018; Lee et al., 2020) (known as recoverability or transportability); or cases where the target query is counterfactual (Rung 3) (Correa et al., 2021) (referred to as Ctf-ID) beyond interventional (Rung 2) of the Ladder of Causation (Pearl and Mackenzie, 2018; Bareinboim et al., 2020). In these situations, algorithmic solutions have been devised that take input distributions along with specified target queries and formulate identification functionals as arithmetic operations (sums/integration, products, ratios) on conditional distributions induced from input distributions.

38th Conference on Neural Information Processing Systems (Neur IPS 2024).

Despite all the progress, existing estimators cover only a subset of all identification scenarios. Specifically, well-established estimators for the back-door (BD) adjustment (Pearl, 1995), represented as P

z E[Y | x, z]P(z), and sequential back-door adjustment (SBD) (Robins, 1986; Pearl and Robins, 1995b) and off-policy evaluations (OPE) (Murphy, 2003), which is an SBD with policy interventions, are known for their robustness to the bias (Bang and Robins, 2005; Robins et al., 2009; van der Laan and Gruber, 2012; Murphy, 2003; Rotnitzky et al., 2017; Luedtke et al., 2017; Uehara et al., 2022; Díaz et al., 2023). These estimators are also scalable; i.e., evaluable in polynomial time relative to the number of covariates (|Z|) and capable in the presence of mixed discrete and continuous covariates. However, SBDs only address a fraction of the broader spectrum of identification scenarios.

Beyond SBD, recent efforts have expanded to developing estimators for the front-door (FD) adjustment P

z,x E[Y | x , z]P(z | x)P(x ) (Pearl, 1995). At first glance, this adjustment appears similar to SBD, as both involve the sum-product of conditional probabilities. However, FD involves treatments variables in dual roles one being summed (x in P

x E[Y | x , z]P(x )) and the other being fixed (x in P(z | x)). While FD estimators achieving doubly robustness have been developed (Fulcher et al., 2019; Guo et al., 2023), they lack scalability due to the necessity of summing over the values of Z (i.e., P

z), thereby limiting its practicality when Z is high-dimensional or continuous.

Similar challenges arise in more general identification scenarios beyond SBD and FD. Recent efforts have focused on developing estimators for broad causal estimands, such as Tian s adjustment (Tian and Pearl, 2002a), which incorporates FD and other cases where causal effects are represented as sum-product functionals (Bhattacharya et al., 2022). These efforts also include work on covering any identification functional (Jung et al., 2021a; Xia et al., 2021, 2022; Bhattacharya et al., 2022; Jung et al., 2023a). While these estimators are designed to achieve a wide coverage of functionals, they lack scalability due to the necessity of summing over high-dimensional variables.

Coverage Scalability

Function Prior UCA Prior UCA class

obs ID/g ID

Transportability ?

Table 1: Scope. denotes the addressed area (by UCA or prior works). denotes the unaddressed area. denotes the partially addressed area. ? indicates areas where no known results are present.

Thus far, we have assessed the pair (functional class, estimator) based on two criteria: (1) coverage of the functional class, and (2) scalability of the corresponding estimators. Scalable estimators achieving doubly robustness have been established predominantly for BD/SBD classes. While recent studies have developed estimators with a strong emphasis on coverage (e.g., any identification functional), less attention has been given to achieving scalability.

In this paper, we establish a novel pair of a functional class and its corresponding estimation frameworks designed to ensure scalability while covering a broad spectrum of identification functionals. Our work aims to maximize coverage, enabling the effective development of scalable estimators with the doubly robust property. This functional class, termed unified covariate adjustment (UCA), integrates a sum-product of conditional distributions appearing in many causal inference scenarios such as BD/FD, Tian s adjustment, S-admissibility in transportability/recoverability (Bareinboim and Pearl, 2016), effect-oftreatment-on-the-treated (ETT) (Heckman, 1992), and nested counterfactuals (Correa et al., 2021). The coverage of the proposed class is further demonstrated through the application to a novel estimand for the counterfactual directed effect (Ctf-DE) derived from fairness analysis (Pleˇcko and Bareinboim, 2024). For the proposed UCA class, we develop a scalable and doubly robust estimator evaluable computationally efficiently relative to the number of samples. Table 1 visualizes the scope of our framework. The contributions of this paper are as follows:

1. We introduce unified covariate adjustment (UCA), a comprehensive framework that encompasses a broad class of sum-product causal estimands. This framework s expressiveness is demonstrated across various scenarios beyond SBD, including Tian s adjustment that incorporates FD and others as well novel counterfactual scenarios in fairness analysis.

2. We develop a corresponding estimator that is computationally efficient and doubly robust and provide its finite sample guarantee. We demonstrate scalability and robustness to bias both theoretically and empirically through simulations.

Figure 1: (a) Front-door in Example 1, (b) Verma in Example 2, (c) Napkin, (d) Standard fairness model in Example 3, and (e) Example graph from (Jung et al., 2021a, Fig. 1b)

Notations. We use (X, X, x, x) to denote a random vector, variable, and their realized values, respectively. For a function f(zi) for i = 1, 2, ,, we use P

i f(zi) = f(z1) + f(z2) . Also, for a function f(z), we use P

z f(z) to denote the summation/integration over a mixture of discrete/continuous random variables Z. For example, we write the back-door adjustment as P z EP [Y | x, z]P(z) when Z is a mixture of discrete/continuous variables. Given an ordered set X = {X1, , Xn}, we denote X(i) := {X1, , Xi} and X i := {Xi+1, , Xm} for m = |X|. For a discrete X, we use 1x(X) as a function such that 1x(X) = 1 if X = x; 1x(X) = 0 otherwise. P(V) denotes a distribution over V and P(v) as a probability at V = v, We use EP [f(V)] and VP [f(V)] to denote the mean and variance of f(V) relative to P(V). We use f P := p

EP [{f(V)}2] as L2-norm of f with P. If a function bf is a consistent estimator of f having a rate rn, we will use bf f = o P (rn). We will say ˆf is L2-consistent if ˆf f P = o P (1). We will use bf f = OP (1) if bf f is bounded in probability. Also, bf f is said to be bounded in probability at rate rn if bf f = OP (rn). [n] := {1, , n} is a collection of index. D := {V(i) : i [n]} denotes a sample set, where V(i) denote the ith sample in D. The empirical average of f(V) with samples D is ED[f(V)] := (1/|D|) P

i:V(i) D f(V(i)).

Structural causal models. We use structural causal models (SCMs) (Pearl, 2000; Bareinboim et al., 2020) as our framework. An SCM M is a quadruple M = U, V, P(U), F , where U is a set of exogenous (latent) variables following a joint distribution P(U), and V is a set of endogenous (observable) variables whose values are determined by functions F = {f Vi}Vi V such that Vi f Vi(pai, ui) where PAi V and Ui U. Each SCM M induces a distribution P(V) and a causal graph G = G(M) over V in which directed edges exists from every variable in PAi to Vi and dashed-bidirected arrows encode common latent variables. Performing an intervention fixing X = x is represented through the do-operator, do(X = x), which encodes the operation of replacing the original equations of X (i.e., f X(pax, ux)) by the constant x for all X X and induces an interventional distribution P(V | do(x)). For any Y V, the potential response Yx(u) is defined as the solution of Y in the submodel Mx given U = u, which induces a counterfactual variable Yx.

Related work. Our work is an extension of existing sequential back-door adjustment (SBD) estimators (Mises, 1947; Bickel et al., 1993; Bang and Robins, 2005; Robins et al., 2009; van der Laan and Gruber, 2012; Rotnitzky et al., 2017; Luedtke et al., 2017; Díaz et al., 2023) to a broader class of sum-product functionals, such as the front-door adjustment (FD) and Tian s adjustment (Tian and Pearl, 2002a) which generalizes FD and more, and nested counterfactuals, which will be detailed in later sections. Our work is aligned with recent works of Chernozhukov et al. (2022); Li and Luedtke (2023); Quintas-Martinez et al. (2024), which examined SBD derived from various joint distributions. Specifically, Li and Luedtke (2023) considered the SBD setting where conditional distributions are induced from different sources. In contrast, we study a broader class of sum-product functionals from multiple populations. Also, Quintas-Martinez et al. (2024) considered the Markovian model Qn i=1 P i(Vi | PAi) where each P i can be distinct. In contrast, we study a broader class of estimands that are not confined to conditioning on PAi. On the other hands, (Chernozhukov et al., 2022) considered the case where covariate distributions are allowed to be changed, and demonstrated that FD can be captured through this technique. Our work expands on these findings by covering a broader class, such as the Tian s adjustment and a nested counterfactual in fairness literature, and by providing a more formal theory that includes finite sample guarantees and asymptotic analysis.

2 Unified Covariate Adjustment

A class of causal estimands termed unified covariate adjustment (UCA) is defined as follows:

Definition 1 (Unified Covariate Adjustment (UCA)). Let Ψ[P; σ] denote the following probability measure over an ordered set V := (C1, R1, , Cm, Rm, Y := Cm+1): Ψ[P; σ] := P m+1(Y | Sm) Qm i=1 P i(Ci | Si 1)σi Ri(Ri | Si \ Ri), where

P := {P i(V) : i [m+1]} is a set of distributions in the form of P i(V) = Qi(V | Sb i 1 = sb i 1), where Qi is a distribution, Sb i 1 is a (potentially empty) set. Each pairs P i(V) and P j(V) can be the same (P i(V) = P j(V)) or distinct (P i(V) = P j(V)).

For i [m + 1], Si 1 := (C(i 1) R(i 1)) \ Sb i 1.

Each Ri is controlled by a pre-specified / known probability measure σi Ri := σi Ri(ri | si \ ri) where P

ri σi Ri(ri | si \ ri) = 1 and 0 σi Ri 1 almost surely (e.g., σi Ri := 1ri(Ri)).

Then, the expectation of Y over Ψ[P; σ] is called a Unified Covariate Adjustment (UCA):

ψ0 := EΨ[P;σ][Y ] = X

c r EP m+1[Y | sm]

i=1 P i(ci | si 1)σi Ri(ri | si \ ri). (1)

We will exemplify that UCA encompasses many well-known causal estimands, including the sequential back-door adjustment (SBD) (Robins, 1986; Pearl and Robins, 1995b), front-door adjustment (Pearl, 1995), Tian s adjustment (Tian and Pearl, 2002a), S-admissibility in transportability/recoverability (Bareinboim and Pearl, 2016), effect-of-treatment-on-the-treated (ETT) (Heckman, 1992), nested counterfactuals (Correa et al., 2021), treatment-treatment interaction (Jung et al., 2023b), and off-policy evaluation (Murphy, 2003). This section particularly focuses on recently developed and lesser-known estimands, for which scalable estimators have been rarely explored. Appendix B provides additional examples, demonstrating how UCA can represent well-known estimands such as off-policy evaluation and S-admissibility.

At first glance, UCA closely resembles the sequential back-door adjustment (SBD) (Robins, 1986; Pearl and Robins, 1995b). Indeed, UCA is reduced to SBD in the special case where P i = P(V) for all i = 1, , m + 1 and σi Ri := 1ri(Ri); i.e., ψ0 = P

c EP [Y | c(m) r(m))] Qm i=1 P(ci | c(i 1) r(i 1)). However, UCA provides flexibility to represent target estimands beyond SBD by allowing P i to be any distribution that aligns with the target estimand, permitting arbitrary conditional distributions beyond the observational distribution P. To demonstrate, consider the front-door adjustment (FD) scenario (Pearl, 1995) depicted in Fig. 1a:

E[Y | do(x)] = X

c,z,x E[Y | c, x , z]P(z | c, x)P(c, x ). (2)

Even though FD cannot be expressed using SBD because the treatment variable X is being fixed (in P(z | c, x)) and summed (with P

x ) simultaneously, it can be represented through UCA as follows: Example 1 (FD as UCA). FD can be written as the expectation of Y over P(Y | Z, X, C)P(Z | x, C)P(X, C). We set C1 := {X, C}, C2 := {Z}, R = , P 1(C1) = P(X, C), P 2(C2 | S1) = P(Z | x, C) with Sb 1 = {X}, S1 = {C}, and P 3(Y | S3) = P(Y | Z, X, C) with S2 = {Z, X, C}.

Next, consider Verma s equation (Verma and Pearl, 1990; Tian and Pearl, 2002b) with Fig. 1b:

E[Y | do(x)] = X

b,a,x E[Y | b, a, x]P(b | a, x )P(a | x)P(x ), (3)

where X is fixed to x in E[Y | x, a, b] and P(a | x) while summed in P(b | a, x ) and P(x ). Similar to FD, due to the dual role of X, the existing SBD framework is not suitable to express Verma s equation, which can be represented through UCA as follows: Example 2 (Verma as UCA). Verma s equation is expressible as the expectation of Y over P(Y | B, A, x)P(B | A, X)P(A | x)P(X). We set C1 = {X}, C2 = {A}, C3 = {B}, and R = . We map P 1(C1) := P(X), P 2(C2 | S1) = P(A | x) with S1 = , Sb 1 = {X}, P 3(C3 | S2) = P(B | A, X) with S2 = {A, X}, and P 4(Y | S3) = P(Y | B, A, x) with S3 = {B, A}, Sb 3 = {X}.

In both examples, the variable Sbi = X is bifurcated, being fixed in some conditional distributions (e.g., P(z | x, c) in the front-door criterion (FD)) and summed over P x in others (e.g., P(y | z, x , c)

Algorithm 1: Tian-to-UCA(G, V := (V1, , VK, Y ))

Input: A graph G and a set of topologically ordered variables V := (V1, , VK, Y ).

1 Set C1 := (V1, , Vk 1, Vk := X, Vk+1, , Vk+i1) as an ordered sequence, where (V1, , Vk 1) are predecessors of X, and (Vk+1, , Vk+i1) are successors of X within SX.

2 Set P 1(C1) := P(C1), i := 2 and R := .

3 while V \ ({Y } C(i 1)) = do

4 if Ci 1 SX, set Ci as the next sequence of vertices in V \ ({Y } C(i 1)) that are not in SX; Si 1 := C(i 1) \ {X}; and P i(Ci | Si 1) := P(Ci | Si 1, x) with Sb i 1 := {X}.

5 else, set Ci as the next sequence of vertices in V \ ({Y } C(i 1)) that are in SX; Si 1 := C(i 1); and P i(Ci | Si 1) := P(Ci | Si 1) with Sb i 1 := .

6 i i + 1. 7 end

8 Set m i. If Y SX, set Sm := C(m), Sb m = , and P m+1(Y | Sm) = P(Y | Sm). Otherwise, set Sm := C(m) \ {X}, Sb m = {X}, and P m+1(Y | Sm) = P(Y | Sm, x).

9 return EΨ[P][Y ] where Ψ[P] := P m+1(Y | Sm) Qm i=1 P i(Ci | Si 1).

in FD). Both FD and Verma s equations are special cases of Tian s adjustment (Tian and Pearl, 2002a), which states that E[Y | do(x)] is identifiable under certain conditions. Specifically, when X and its children ch G(X) in the graph G are not connected by bidirected edges, it can be expressed as:

E[Y | do(x)] = X

x EP [Y | v(K)]

i=1 P (vi | v(i 1)), (4)

where V := (V1, V2, , VK, Y ) is a topologically ordered set with Vk := X for some k being the treatment variable X, P (vi | v(i 1)) := P(vi | v(k 1), x, vk+1, , vi 1) (i.e., X is fixed to x) if Vi SX where SX is the set of vertices connected with X through bidirected edges, and P (vi | v(i 1)) := P(vi | v(k 1), x , vk+1, , vi 1) (i.e., X is summed with P

x ) if Vi SX. In Tian s adjustment, X is bifurcated into summed through P

x and fixed to X = x. We exhibit the expressiveness of UCA for Tian s adjustment: Proposition 1. Tian s adjustment in Eq. (4) is UCA-expressible through Algo. 1.

Next, we exhibit the coverage of the UCA for a counterfactual quantity in the fairness literature. Specifically, we focus on the counterfactual directed effect (Ctf-DE) in the Standard fairness model (SFM) (Pleˇcko and Bareinboim, 2024), as illustrated in Fig. 1d. This model includes several key components: the protected (discrete) attribute (X), such as race; the baseline covariates (Z), like age; the mediator variables (W) affected by X, for example, educational level; and the outcome variable (Y ), such as salary. Consider a scenario where we investigate the the query, What would be the expected salary for someone who is Black, but hypothetically of Asian race and had been educated as a White person typically would be? . The query is represented as Ctf-DE: E[YX=x0,WX=x1 | X = x2], where x0, x1, and x2 correspond to the races Asian, White, and Black, respectively. This query can be identified through the algorithm in (Correa et al., 2021) under the SFM in Fig. 1d:

E[YX=x0,WX=x1 | X = x2] = X

w,z E[Y | X = x0, w, z]P(w | X = x1, z)P(z | X = x2). (5)

This identification functional is UCA-expressible: Example 3 (Ctf-DE as UCA). The Ctf-DE is expressible through the expectation of Y over P(Y | X = x0, W, Z)P(W | X, Z)P(Z | X = x2)1x1(X). Set R1 := {X}, σ1 R1 := 1x1(X), P 1(C1) = P(Z | X = x2) with C1 = {Z} and Sb 0 = {X}, P 2(C2 | S1) = P(W | X, Z) with C2 = {W} and S1 = {X, Z}, P 3(Y | S2) = P(Y | X = x0, W, Z) with S2 = {W, Z} and Sb 2 = {X}.

Despite the broad expressiveness of UCA, as illustrated in this section and ppendix B, not all causal estimand functionals are UCA-expressible. To witness, consider the napkin estimand described in (Pearl and Mackenzie, 2018; Jung et al., 2021a) with G in Fig. 1c, defined as P(y | do(x)) = P

w P (y,x|r,w)P (w) P

w P (x|r,w)P (w) . Here, the functional for E[Y | do(x)] is represented not as the expectation of a product of conditional distributions, but rather as a quotient of sums of conditional distributions. The napkin estimand is not UCA-expressible. Intuitively, if a target functional is expressed as

an expectation of a probability measure that is represented as a product of multiple conditional distributions, it can be captured through UCA. A formal criterion is the following: Theorem 1 (Expressiveness). Suppose a functional ψ0 is expressed as the mean of the following measure, P m+1(Y | S m) Qm i=1 P i(Ci | S i 1)σi Ri(Ri | S i \ Ri), where S i = (C(i) R(i)) \ Sb i for each i = 1, . . . , m and P j(V) for j = 1, . . . , m + 1 are distributions of the form P j(V) = Qj(V | Sb j 1 = sb j 1). Then, the functional ψ0 can be expressed through UCA in Eq. (1).

3 Scalable Estimator for Unified Covariate Adjustment

So far, we discussed the coverage of UCA. In this section, we construct a scalable estimator for UCA that achieves doubly robustness property and provides its finite sample guarantee. We define the estimator with two sets of nuisance parameters µ and π. µ is a collection of regression parameters, and π is a collection of ratio parameters.

We introduce sets to define regression nuisances. Define Bi 1 := Si C(i 1) Sb i 1 for i = 2, , m as a bifurcated set, which is a subset of Si in P i+1(Ci+1 | Si) that is fixed to sb i 1 at P i(Ci | Si 1), while marginalized out over P j(Cj | Sj 1) for some j < i (e.g., X in FD). Set Bm = . We use B i 1 to denote an independent copy of Bi 1 (variables following the same distribution as Bi 1 but independent of Bi 1 and V). With Bi 1 and B i 1, we define S i := ((Si Bi) \ Bi 1) B i 1 and ˇSi := S i \ Ri for i = 2, , m. Define the regression nuisance parameters as follows: µm 0 (Sm) := EP m+1[Y | Sm] and ˇµm 0 (ˇSm) := P

rm σm Rm(rm | Sm \ Rm)µm 0 (rm, ˇSm). For i = m 1, , 1,

µi 0(Si, B i) := EP i+1[ˇµi+1 0 (ˇSi+1) | Si, B i], (6)

ˇµi 0(ˇSi) := X

ri σi Ri(ri | Si \ Ri)µi 0(ri, S i). (7)

Equipped with the regression nuisances, UCA can be computed as follows:

Proposition 2. UCA in Eq. (1) can be parameterized as ψ0 = EP 1[ˇµ1 0(ˇS1)].

Whenever no variables are being summed and fixed simultaneously (i.e., Bi 1 = for all i = 2, , m) in the UCA functional, as in Eq. (5) in Ctf-DE, the standard SBD adjustment or examples in Appendix B, we can estimate µ through nested regression methods with off-the-shelf regression models and compute UCA in Eq. (1) as ψ0 = EP 1[ˇµ1 0(ˇS1)]. This approach aligns with existing SBD estimators (Bang and Robins, 2005; Robins et al., 2009; van der Laan and Gruber, 2012; Rotnitzky et al., 2017; Luedtke et al., 2017; Díaz et al., 2023). For instance, in Ctf-DE in Example 3, µ2 0(W, Z) := EP [Y | W, Z, x0], ˇµ2 0(W, Z) = µ2 0(W, Z), µ1 0(X, Z) := EP [ˇµ2 0(W, Z) | X, Z], ˇµ1 0(Z) = µ1 0(x1, Z), and ψ0 = EP [ˇµ1 0(Z) | x2]. These nuisances can be estimated efficiently with regression models run in polynomial time relative to the number of variables and samples (e.g., neural networks (Le Cun et al., 2015) or XGBoost (Chen and Guestrin, 2016)).

Beyond the SBD framework, the regression nuisances are capable of representing functionals in the presence of variables being summed and fixed simultaneously (e.g., FD in Eq. (2) or Verma in Eq. (3)). As an example, consider FD in Eq. (2) with its UCA representation in Example 1. First, define µ2 0(Z, X, C) := EP [Y | Z, X, C] with S2 = {Z, X, C}. Next, we have B1 = Sb 1 C1 = {X} and, ˇS2 = {Z, X , C}, where X is an independent copy of X. Consequently, ˇµ2 0(Z, X , C) := µ2 0(Z, X , C), where (Z, X , C) is plugged into a function µ2 0. Next, define µ1 0(C, X ) := EP [ˇµ2 0(Z, X , C) | x, C, X ]. Finally, we have ˇµ1 0(C, X) = µ1 0(C, X). The expectation, EP [ˇµ1 0(C, X)] = P c,x P(c, x )ˇµ1 0(c, x ), correctly specifies FD in Eq. (2) as follows: X

c,x P(c, x )ˇµ1 0(c, x ) = X

c,x P(c, x )µ1 0(c, x ) = X

c,x P(c, x )EP [ˇµ2 0(Z, x , c) | x, c, x ]

c,x ,z P(c, x )P(z | x, c)µ2 0(z, x , c) = X

c,x ,z P(c, x )P(z | x, c)EP [Y | z, x , c],

where the equation = holds since X is an independent copy of X, so it s independent of Z.

Empirically, generating B i involves permuting copied samples of Bi, an used in recent works in (Chernozhukov et al., 2022; Xu and Gretton, 2022). We name this approach empirical bifurcation:

Algorithm 2: DML-UCA({Di}, L)

1 (Sample splitting) For each i [m + 1], randomly split Di iid P i into L-folds. Let Di ℓdenote the ℓ-th partition, and define Di ℓ:= Di \ Di ℓ. We use W(Di ℓ) to refer to the samples of W in Di ℓ.

2 for ℓ [L] do 3 for i = m, . . . , 1 do

1. Learn ˆµi ℓ(Si, B i) by regressing ˇµi+1 ℓ(ˇSi+1(Di+1 ℓ)) onto Si(Di+1 ℓ), B i(Di+1 ℓ) (where ˇµm+1 := Y ).

2. Evaluate ˇµi ℓ(ˇSi(Di ℓ)) using empirical bifurcation under the policy σi Ri.

3. Compute ˇµi ℓ:= ˇµi ℓ(ˇSi(Di ℓ)) by evaluating ˇµi ℓusing samples Di ℓ.

4. For a nuisance parameter πi 0 satisfying Eq. (9), learn ˆπi ℓusing samples {Dj ℓ: j [i + 1]}.

5. Evaluate ˆπi ℓ:= ˆπi ℓ({Dj ℓ: j [i + 1]}).

4 end 5 end

6 Return the DML-UCA estimator ˆψ:

i=1 EDi+1 ℓ [ˆπi ℓ(ˇµi+1 ℓ ˆµi ℓ)] + ED1 ℓ[ˇµ1 ℓ]. (8)

Definition 2 (Empirical bifurcation). An empirical bifurcation for B following a distribution P is the procedure of copying samples of B P and randomly permuting to obtain new samples B .

In general, the regression nuisances can be estimated from data by employing empirical bifurcation and off-the-shelf regression models.

Next, we define the ratio nuisance parameters π. Define πm 0 (Sm) as the solution functional satisfying EP m+1[µm 0 (Sm)πm 0 ] = ψ0. Recursively, for i = m 1, , 1, define πi 0(Si, B i) as a functional satisfying the following equation, for any µi+1 L2(P i+2).

EP i+2[πi+1 0 (Si+1, B i+1)µi+1(Si+1, B i+1)] = EP i+1[πi 0(Si, B i)EP i+1[ˇµi+1(ˇSi+1) | Si, B i]], (9)

where the closed form solution is given as follows:

Qi j=1 P j(Cj | Sj)σj Rj(Rj | Sj \ Rj)

P i+1(Si, B i) (10)

For the example of FD, π2 0 = P (Z|x,C)

P (Z|X,C) and π1 0 = P (x) P (x|C).

Equipped with the ratio nuisances, UCA can be computed as follows: Proposition 3. UCA in Eq. (1) can be parameterized as ψ0 = EP m+1[πm 0 Y ].

Estimating the ratio nuisances may be challenging due to the distribution ratio of continuous/highdimensional variables. To address the challenge, we use Bayes rule to transform the distribution ratio into a more tractable form. For example, in FD, if the treatment X is a singleton binary, instead of estimating π2 0 = P (Z|x,C)

P (Z|X,C), an equivalent estimand π2 0 = P (x|Z,C)P (X|C)

P (X|Z,C)P (x|C) can be estimated. This approach allows to use off-the-shelf probabilistic classification methods for estimating distribution ratios, allowing scalable computation. A detailed procedure for ratio estimation is in Appendix C.2.

Combining regression and ratio-nuisances, we present a double/debiased machine learning (DML) (Chernozhukov et al., 2018)-based estimator ˆψ for the UCA, titled DML-UCA , in Algo. 2. We provide detailed nuisance specification for various examples in Appendix A and B.

DML-UCA provides a scalable estimator for functionals expressible through UCA. When the target query is BD/SBD, DML-UCA aligns with existing doubly robust SBD estimators (Bang and Robins, 2005; Robins et al., 2009; van der Laan and Gruber, 2012; Rotnitzky et al., 2017; Luedtke et al., 2017; Díaz et al., 2023). Beyond SBD, DML-UCA can be estimated in polynomial time relative to the number of variables and samples, ensuring its scalability: Theorem 2 (Scalability). Algo. 2 runs in O(Knmax + T(m, nmax, K)), where K is the number of distinct in P, nmax := max{|Dk| : k [K]}, and T(m, nmax, K) is the time complexity for

Estimand Estimator Complexity

Plug-in O(n2m)

IPW (Rosenbaum and Rubin, 1983) O(n + T(m, n)) OM (Robins, 1986) AIPW (Rotnitzky et al., 1998)

FD Fulcher et al. (2019); Guo et al. (2023) O(n2m + T(m, n)) Tian s Bhattacharya et al. (2022)

UCA DML-UCA (BD, FD and Tian s) O(n + T(m, n))

DML-UCA (general) O(Knmax + T(m, nmax, K))

obs ID DML-ID (Jung et al., 2021a) O(n22m + T(m, n))

g ID DML-g ID (Jung et al., 2023a) O(Knmax22m + T(m, nmax, K)) Table 2: Comparison of time complexities of existing estimators for estimands: nmax := max{|Di|} is the number of samples, m is the number of variables, and T(m, nmax, K) (or T(m, n) := T(m, nmax = n, K = 1)) is the time complexity for learning nuisance parameters for the target functional. The plug-in estimator for BD is one where EP [Y | x, z] and P(z) are estimated from data, and P

z EP [Y | x, z]P(z) is evaluated. Details are in Sec. C.4.

learning nuisances ˆµi ℓand ˆπi ℓ. Specifically, O(T(m, nmax, K)) = O(K L (Tµ + Tπ)), where Tµ := max{Tˆµi ℓ: i [m], ℓ [L]}, Tπ := max{Tˆπi ℓ: i [m], ℓ [L]}, and Tˆµi ℓand Tˆπi ℓdenote the time complexity for learning and evaluating ˆπi ℓand ˆµi ℓ, respectively.

An an example, for XGBoost (Chen and Guestrin, 2016), Tπ = Tµ = O(numtree depthtree nmax log nmax), where numtree and depthtree are the number and depth of trees in XGBoost.

Table 2 summarizes the comparison of time complexities for existing estimators. As shown in the table, scalable estimators with polynomial time complexity have only been developed for BD/SBD estimands. Existing estimators beyond SBD often lack scalability. For instance, existing estimators for FD (Fulcher et al., 2019; Guo et al., 2023) or Tian s adjustment (Bhattacharya et al., 2022) face exponential time complexity in the dimension of mediators. In contrast, DML-UCA s polynomial time complexity positions it as a uniquely scalable solution within the UCA functional class, which includes FD and Tian s adjustment as special cases. For general obs ID/g ID estimands beyond the UCA class, scalable estimators have yet to be developed.

3.1 Error analysis

In this section, we show that DML-UCA exhibits doubly robustness, in addition to scalability. Since UCA is composed of multiple (possibly distinct) distributions, we provide a tool to distinguish them. Definition 3 (Index set). The index sets I1, , IK partition {1, , m + 1} such that indices i and j are in the same set Ik if and only if P i(V) = P j(V).

We will use Pk for k = 1, , K to denote the distribution P i for i Ik. Then, the functional Ψ[P; σ] in Eq. (1) can be written as follows:

Ψ[P; σ] = Ψ[{Pk : k = 1, , K}; σ]. (11)

Since multiple distributions are involved in UCA, deriving an influence function for each distribution Pk becomes necessary. A standard influence function is typically defined for a single distribution P, and thus, does not suffice for studying multi-distribution setting. To address the issue, we employ a partial influence function (PIF) (Pires and Branco, 2002), an influence function defined relative to each Pk. A formal definition is in Appendix C. For UCA, PIFs are given as follows: Theorem 3 (PIF for UCA). Assume that µi 0 < and 0 < πi 0 < almost surely for i = 1, , m. Define η1 0 := {µ1 0} and ηi 0 := {πi 1 0 , µi 0, µi 1 0 } for i = 1, , m + 1, and

φi(ˇSi; ηi 0, ψ0) := πi 1 0 {ˇµi 0 µi 1 0 } if i > 1 ˇµ1 0 ψ0 if i = 1. (12)

Let Vk := i Ik ˇSi and ηk 0 := i Ikηi 0. Then, the k-th PIF for UCA is ϕk 0 := ϕk(Vk; ηk 0, ψ0) := P

i Ik φi(Si; ηi 0, ψ0).

Equipped with PIFs, we provide a finite-sample guarantee for DML-UCA, extending Chernozhukov et al. (2023) which analyzed DML estimators for BDs.

Theorem 4 (Finite sample guarantee). Suppose µi 0, ˆµi ℓ< and 0 < πi 0, ˆπi ℓ< almost surely for i = 1, , m. Suppose the third moment of ϕk 0 for k = 1, , K exist. Let ϕk 0 := ϕk(Vk; ηk 0, ψ0) and ˆϕk ℓ:= ϕk(Vk; ˆηk ℓ, ψ0). Let Rk 1 := (1/L) PL ℓ=1(EDk ℓ[ˆϕk ℓ] EPk[ˆϕk ℓ]). Then,

1. The error ˆψ ψ0 is decomposed as follows:

k=1 Rk 1 + 1

i=1 EP i+1[(ˆπi ℓ πi 0)(µi 0 ˆµi ℓ)]. (13)

2. Let ρ2 k,0 := VPk[ϕk 0]. With probability (W.P) greater than 1 ϵ,

ρ2 k,0 |Dk| +

ˆϕk ℓ ϕk 0 2 Pk |Dk ℓ|

3. Let κ3 k,0 := EPk[|ϕk 0|3]. Let NORMAL(x) denote the standard normal CDF. W.P greater than 1 ϵ,

|Dk| ρk,0 Rk 1 < x

ˆϕk ℓ ϕk 0 2 Pk |Dk ℓ| + 0.4748κ3 k,0 ρ3 k,0 p

|Dk| , (15)

This is a novel finite sample guarantee of DML-based estimators for functionals beyond SBD. Finite sample analyses for functionals beyond SBD have been studied only for the non-doubly robust estimators (Bhattacharyya et al., 2022). For doubly robust estimators, only asymptotic analyses were provided for FD (Fulcher et al., 2019; Guo et al., 2023), Tian s adjustment (Bhattacharya et al., 2022), and obs ID (Jung et al., 2021b). Thm. 4 elucidates that the error can be decomposed into two terms Rk 1 and Rℓ 2. The term Rk 1 closely approximates a standard normal distribution variable, and Rℓ 2, comprises the error of (ˆπi ℓ, ˆπi 1 ℓ ) and ˆµi, exhibiting doubly-robustness behavior. Specifically, if the nuisance parameters ˆµi ℓ, ˆπi ℓ, and ˆπi 1 ℓ converge at a rate of n 1/4 (where n represents the size of the smallest sample set), then DML-UCA converges at a faster rate of n 1/2. This point becomes evident in the corresponding asymptotic analysis:

Corollary 4 (Asymptotic error). Assume µi 0, ˆµi 0 < and 0 < πi 0, ˆπi 0 < almost surely. Suppose the map ˆηk ℓ7 ˆϕk ℓis uniformly differentiable with respect to ˆηk ℓ, and the derivative of ˆϕk ℓw.r.t. ˆηk ℓis bounded by some constants. Suppose ˆµi ℓand ˆπi ℓare L2-consistent. Then,

k=1 Rk 1 + 1

i=1 OP i+1 ˆµi ℓ µi 0 ( ˆπi ℓ πi 0 ) ,

|Dk|Rk 1 converges in distribution to normal(0, ρ2 k,0).

4 Experiments

In this section, we demonstrate the scalability and doubly robustness of the DML-UCA estimator, where nuisances are learned through XGBoost (Chen and Guestrin, 2016). We specify an SCM M for FD (Fig. 1a), Verma (Fig. 1b), and the example graph in (Jung et al., 2021a) (Fig. 1e), and generate datasets Dk Pk from the SCM. The target estimand is denoted as ψ0. Details are in Appendix F. Further simulations are provided in Appendix E.

Scalability. To demonstrate scalability of DML-UCA, we compare the running time with existing estimators of (Fulcher et al., 2019) (FD) and (Jung et al., 2021a) (Verma s equation and the estimand

246812 20 30 50 100 0

300 DML Fulcher

24681220 30 50 100 0

300 DML Jung

24681220 30 50 100 0

300 DML Jung

25005000 10000 20000

0.05 DML Fulcher

25005000 10000 20000 0.2

0.8 DML Jung

25005000 10000 20000 .025 .050 .075 .100 .125 .150 .175 .200 .225 DML Jung

Figure 2: Comparison of DML-UCA ( DML ) with existing estimators using (Top) running-timeplots (x-axis: the dimension of summed variables, y-axis: running time); and (Bottom) AAE-plots (x-axis: the sample size, y-axis: errors). DML-UCA is compared with (a,d) Fulcher et al. (2019) for FD; (b.e) (Jung et al., 2021a) for Verma s equation; and (c,f) Jung et al. (2021a) for Jung s equation.

with Fig. 1e E[Y | do(x1, x2)] = P

x 1,r,z EP [Y | r, x 1, x2, z]P(r | x1, z)P(z, x 1) which we call Jung s equation ). For each example, we increment the dimension of the summed variables, run 100 simulations, take the average of running times, and compare this average. We label this plot as run-time-plot , presented in the top side of Fig. 2. In the comparison with (Fulcher et al., 2019) for FD in Fig. 2a, we fix |C| = 2 and increment |Z| = {2, 4, 6, 8, 12, 20, 30, 50, 100}. When comparing with (Jung et al., 2021a), for Verma s equations in Figs. (2b), we fix |A| = 2 and increment |B| = {2, 4, 6, 8, 12, 20, 30, 50, 100}. For Jung s equation in Fig. 2c, we fix |Z| = 2 and |R| = {2, 4, 6, 8, 12, 20, 30, 100}. The timeout for the run-time is set to 300 seconds. For all scenarios, the run-time of existing estimators increases rapidly over dimensions due to the summation operation while DML-UCA scales well for high-dimensional covariates.

Doubly robustness. To demonstrate doubly robustness, we compare the error of DML-UCA with existing estimators for FD of Fulcher et al. (2019) and for Verma s and Jung s equations of Jung et al. (2021a) We use ˆψest for est {DML, Fulcher, Jung} to denote each estimator. We use the average absolute error (AAE), which is an average of the error of the estimated versus true causal effect of X = x: 1 | domain(X)| P

x domain(X) | ˆψest(x) ψ0(x)|. To witness the fast convergence of DML-UCA, we enforce the convergence rate of nuisance estimates to be no faster than the decaying rate n 1/4 by adding the noise term ϵ normal(n 1/4, n 1/4) to nuisances, inspired by the experimental design in (Kennedy, 2023). We ran 100 simulations for each number of samples n = {2500, 5000, 10000, 20000}. We label the plot as AAE-plot , presented in the bottom side of Fig. 2. For each example, DML-UCA outperforms other estimators, exhibiting fast convergence.

5 Conclusions

We introduce a framework that encompasses a broad class of sum-product causal estimands, called UCA class, for which scalable estimators were previously unavailable. We demonstrate the expressiveness of the UCA class, which includes not only BD/SBD but also broader classes such as Tian s adjustment incorporating FD and Verma, and Ctf-DE, for which the existing SBD-based framework is not applicable. We develop an estimator for UCA called DML-UCA that can estimate the target estimand in polynomial time relative to the number of samples and variables, ensuring scalability. We provide finite-sample guarantees and corresponding asymptotic error analysis for DML-UCA, demonstrating its fast convergence. These scalability and fast convergence properties are empirically verified through simulations. Our results pave the way toward developing an estimation framework maximizing both coverage and scalability in Table 1.

Acknowledgments

We thank anonymous reviewers for constructive comments to improve the manuscript. This research is supported in part by the NSF, ONR, AFOSR, Do E, Amazon, JP Morgan, and The Alfred P. Sloan Foundation.

Bang, H. and Robins, J. M. (2005). Doubly robust estimation in missing data and causal inference models. Biometrics, 61(4):962 973.

Bareinboim, E., Correa, J. D., Ibeling, D., and Icard, T. (2020). On pearl s hierarchy and the foundations of causal inference. In Probabilistic and Causal Inference: The Works of Judea Pearl, Technical Report R-60. Causal Artificial Intelligence Laboratory, Columbia University.

Bareinboim, E. and Pearl, J. (2012a). Causal inference by surrogate experiments: z-identifiability. In In Proceedings of the 28th Conference on Uncertainty in Artificial Intelligence, pages 113 120. AUAI Press.

Bareinboim, E. and Pearl, J. (2012b). Transportability of causal effects: Completeness results. In Proceedings of the 26th AAAI Conference on Artificial Intelligence, pages 698 704.

Bareinboim, E. and Pearl, J. (2016). Causal inference and the data-fusion problem. Proceedings of the National Academy of Sciences, 113(27):7345 7352.

Bareinboim, E., Tian, J., and Pearl, J. (2014). Recovering from selection bias in causal and statistical inference. In Proceedings of the 28th AAAI Conference on Artificial Intelligence, pages 2410 2416.

Berry, A. C. (1941). The accuracy of the gaussian approximation to the sum of independent variates. Transactions of the american mathematical society, 49(1):122 136.

Bhattacharya, R., Nabi, R., and Shpitser, I. (2022). Semiparametric inference for causal effects in graphical models with hidden variables. Journal of Machine Learning Research, 23:1 76.

Bhattacharyya, A., Gayen, S., Kandasamy, S., Raval, V., and Variyam, V. N. (2022). Efficient interventional distribution learning in the pac framework. In International Conference on Artificial Intelligence and Statistics, pages 7531 7549. PMLR.

Bickel, P. J., Klaassen, C. A., Bickel, P. J., Ritov, Y., Klaassen, J., Wellner, J. A., and Ritov, Y. (1993). Efficient and adaptive estimation for semiparametric models, volume 4. Johns Hopkins University Press Baltimore.

Chen, T. and Guestrin, C. (2016). Xgboost: A scalable tree boosting system. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 785 794.

Chernozhukov, V., Chetverikov, D., Demirer, M., Duflo, E., Hansen, C., Newey, W., and Robins, J. (2018). Double/debiased machine learning for treatment and structural parameters: Double/debiased machine learning. The Econometrics Journal, 21(1).

Chernozhukov, V., Newey, W., Singh, R., and Syrgkanis, V. (2022). Automatic debiased machine learning for dynamic treatment effects and general nested functionals. ar Xiv preprint ar Xiv:2203.13887.

Chernozhukov, V., Newey, W. K., and Singh, R. (2023). A simple and general debiased machine learning theorem with finite-sample guarantees. Biometrika, 110(1):257 264.

Correa, J., Lee, S., and Bareinboim, E. (2021). Nested counterfactual identification from arbitrary surrogate experiments. Advances in Neural Information Processing Systems, 34.

Correa, J. D., Tian, J., and Bareinboim, E. (2018). Generalized adjustment under confounding and selection biases. In Proceedings of the 32nd AAAI Conference on Artificial Intelligence.

Díaz, I., Williams, N., Hoffman, K. L., and Schenck, E. J. (2023). Nonparametric causal effects based on longitudinal modified treatment policies. Journal of the American Statistical Association, 118(542):846 857.

Esseen, C.-G. (1942). On the liapunov limit error in the theory of probability. Ark. Mat. Astr. Fys., 28:1 19.

Fulcher, I. R., Shpitser, I., Marealle, S., and Tchetgen Tchetgen, E. J. (2019). Robust inference on population indirect causal effects: the generalized front door criterion. Journal of the Royal Statistical Society: Series B (Statistical Methodology).

Guo, A., Benkeser, D., and Nabi, R. (2023). Targeted machine learning for average causal effect estimation using the front-door functional. ar Xiv preprint ar Xiv:2312.10234.

Heckman, J. J. (1992). Randomization and Social Policy Evaluation. In Manski, C. and Garfinkle, I., editors, Evaluations: Welfare and Training Programs, pages 201 230. Harvard University Press, Cambridge, MA.

Huang, Y. and Valtorta, M. (2006). Pearl s calculus of intervention is complete. In Proceedings of the 22nd Conference on Uncertainty in Artificial Intelligence, pages 217 224. AUAI Press.

Jung, Y., Díaz, I., Tian, J., and Bareinboim, E. (2023a). Estimating causal effects identifiable from combination of observations and experiments. In Proceedings of the 37th Neural Information Processing Systems.

Jung, Y., Tian, J., and Bareinboim, E. (2021a). Estimating identifiable causal effects on markov equivalence class through double machine learning. In Proceedings of the 38th International Conference on Machine Learning.

Jung, Y., Tian, J., and Bareinboim, E. (2021b). Estimating identifiable causal effects through double machine learning. In Proceedings of the 35th AAAI Conference on Artificial Intelligence.

Jung, Y., Tian, J., and Bareinboim, E. (2023b). Estimating joint treatment effects by combining multiple experiments. In Proceedings of the 40th International Conference on Machine Learning.

Kennedy, E. H. (2023). Towards optimal doubly robust estimation of heterogeneous causal effects. Electronic Journal of Statistics, 17(2):3008 3049.

Kennedy, E. H., Balakrishnan, S., G Sell, M., et al. (2020). Sharp instruments for classifying compliers and generalizing causal effects. Annals of Statistics, 48(4):2008 2030.

Le Cun, Y., Bengio, Y., and Hinton, G. (2015). Deep learning. nature, 521(7553):436 444.

Lee, S. and Bareinboim, E. (2020). Causal effect identifiability under partial-observability. In Proceedings of the 37th International Conference on Machine Learning.

Lee, S., Correa, J., and Bareinboim, E. (2020). General transportability synthesizing observations and experiments from heterogeneous domains. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 10210 10217.

Lee, S., Correa, J. D., and Bareinboim, E. (2019). General identifiability with arbitrary surrogate experiments. In Proceedings of the 35th Conference on Uncertainty in Artificial Intelligence. AUAI Press.

Li, S. and Luedtke, A. (2023). Efficient estimation under data fusion. Biometrika, 110(4):1041 1054.

Luedtke, A. R., Sofrygin, O., van der Laan, M. J., and Carone, M. (2017). Sequential double robustness in right-censored longitudinal models. ar Xiv preprint ar Xiv:1705.02459.

Mises, R. v. (1947). On the asymptotic distribution of differentiable statistical functions. The annals of mathematical statistics, 18(3):309 348.

Murphy, S. A. (2003). Optimal dynamic treatment regimes. Journal of the Royal Statistical Society Series B: Statistical Methodology, 65(2):331 355.

Pearl, J. (1995). Causal diagrams for empirical research. Biometrika, 82(4):669 710.

Pearl, J. (2000). Causality: Models, Reasoning, and Inference. Cambridge University Press, New York. 2nd edition, 2009.

Pearl, J. and Mackenzie, D. (2018). The book of why: the new science of cause and effect. Basic Books.

Pearl, J. and Robins, J. (1995a). Probabilistic evaluation of sequential plans from causal models with hidden variables. In Proceedings of the 11th Conference on Uncertainty in Artificial Intelligence, pages 444 453. Morgan Kaufmann Publishers Inc.

Pearl, J. and Robins, J. (1995b). Probabilistic evaluation of sequential plans from causal models with hidden variables. In Proceedings of the Eleventh conference on Uncertainty in artificial intelligence, pages 444 453.

Pires, A. M. and Branco, J. A. (2002). Partial influence functions. Journal of Multivariate Analysis, 83(2):451 468.

Pleˇcko, D. and Bareinboim, E. (2024). Causal fairness analysis: A causal toolkit for fair machine learning. Foundations and Trends in Machine Learning, 17(3):304 589.

Quintas-Martinez, V., Bahadori, M. T., Santiago, E., Mu, J., Janzing, D., and Heckerman, D. (2024). Multiply-robust causal change attribution. ar Xiv preprint ar Xiv:2404.08839.

Robins, J. (1986). A new approach to causal inference in mortality studies with a sustained exposure period application to control of the healthy worker survivor effect. Mathematical modelling, 7(9-12):1393 1512.

Robins, J., Li, L., Tchetgen, E., and van der Vaart, A. W. (2009). Quadratic semiparametric von mises calculus. Metrika, 69:227 247.

Robins, J. M. and Rotnitzky, A. (1995). Semiparametric efficiency in multivariate regression models with missing data. Journal of the American Statistical Association, 90(429):122 129.

Rosenbaum, P. R. and Rubin, D. B. (1983). The central role of the propensity score in observational studies for causal effects. Biometrika, 70(1):41 55.

Rotnitzky, A., Robins, J., and Babino, L. (2017). On the multiply robust estimation of the mean of the g-functional. ar Xiv preprint ar Xiv:1705.08582.

Rotnitzky, A., Robins, J. M., and Scharfstein, D. O. (1998). Semiparametric regression for repeated outcomes with nonignorable nonresponse. Journal of the american statistical association, 93(444):1321 1339.

Shevtsova, I. (2014). On the absolute constants in the berry-esseen-type inequalities. In Doklady Mathematics, volume 89, pages 378 381. Springer.

Shpitser, I. and Pearl, J. (2006). Identification of joint interventional distributions in recursive semimarkovian causal models. In Proceedings of the 21st AAAI Conference on Artificial Intelligence, page 1219. Menlo Park, CA; Cambridge, MA; London; AAAI Press; MIT Press; 1999.

Tian, J. and Pearl, J. (2002a). A general identification condition for causal effects. In Proceedings of the 18th National Conference on Artificial Intelligence, pages 567 573.

Tian, J. and Pearl, J. (2002b). On the testable implications of causal models with hidden variables. In Proceedings of the 18th conference on Uncertainty in artificial intelligence, pages 519 527. Morgan Kaufmann Publishers Inc.

Tian, J. and Pearl, J. (2003). On the identification of causal effects. Technical Report R-290-L, UCLA.

Uehara, M., Shi, C., and Kallus, N. (2022). A review of off-policy evaluation in reinforcement learning. ar Xiv preprint ar Xiv:2212.06355.

van der Laan, M. J. and Gruber, S. (2012). Targeted minimum loss based estimation of causal effects of multiple time point interventions. The international journal of biostatistics, 8(1).

Verma, T. and Pearl, J. (1990). Causal networks: Semantics and expressiveness. In Machine intelligence and pattern recognition, volume 9, pages 69 76. Elsevier.

Xia, K., Lee, K.-Z., Bengio, Y., and Bareinboim, E. (2021). The causal-neural connection: Expressiveness, learnability, and inference. Advances in Neural Information Processing Systems, 34.

Xia, K. M., Pan, Y., and Bareinboim, E. (2022). Neural causal models for counterfactual identification and estimation. In The Eleventh International Conference on Learning Representations.

Xu, L. and Gretton, A. (2022). A neural mean embedding approach for back-door and front-door adjustment. In The Eleventh International Conference on Learning Representations.

Supplement to Unified Covariate Adjustment for Causal Inference

1 Introduction 1

2 Unified Covariate Adjustment 3

3 Scalable Estimator for Unified Covariate Adjustment 6

3.1 Error analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

4 Experiments 9

5 Conclusions 10

A Nuisance Specification 17

A.1 Front-door adjustment in Example 1 . . . . . . . . . . . . . . . . . . . . . . . . . 17

A.2 Verma s equation in Example 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

A.3 Counterfactual directed effect in Example 3 . . . . . . . . . . . . . . . . . . . . . 18

A.4 Example Estimand for Fig. 1e . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

B More UCA Examples 20

B.1 Effect of the treatment on the treated (ETT) . . . . . . . . . . . . . . . . . . . . . 20

B.2 Transportability (S-admissibility) . . . . . . . . . . . . . . . . . . . . . . . . . . 20

B.3 Off-policy evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

B.4 Treatment-treatment interactions . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

C More Results 22 C.1 Formal definition of Partial influence function (PIF) . . . . . . . . . . . . . . . . . 22

C.2 Density Ratio Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

C.3 Analysis of non-UCA functionals . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

C.3.1 On Case 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 C.3.2 On Case 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 C.4 Time Complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

C.4.1 BD/SBD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 C.4.2 Front-door adjustment (FD) . . . . . . . . . . . . . . . . . . . . . . . . . 26

C.4.3 Tian s adjustment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

C.4.4 DML-UCA (BD, FD, and Tian s). . . . . . . . . . . . . . . . . . . . . . . 27

C.4.5 DML-UCA (general). . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

C.4.6 DML-ID (obs ID). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

C.4.7 DML-g ID (g ID). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

D Proofs 27 D.1 Proof of Proposition 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

D.2 Proof for Proposition 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

D.3 Proof for Proposition 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

D.4 Proof for Theorem 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 D.5 Proof for Theorem 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 D.6 Proof for Theorem 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 D.7 Proof for Theorem 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 D.7.1 Helper lemmas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

D.7.2 Preliminary Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

D.7.3 Proof of Theorem 4 - (1) . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

D.7.4 Proof of Theorem 4 - (2) . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

D.7.5 Proof of Theorem 4 - (3) . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

D.8 Proof for Corollary 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

E More Experiments 35

F Details in Experiments 36

F.1 FD (Fig. 1a) for Simulation in Fig. 2a . . . . . . . . . . . . . . . . . . . . . . . . 36

F.2 Verma (Fig. 1b) for Simulation in Fig. 2b . . . . . . . . . . . . . . . . . . . . . . 37

F.3 Example estimand (Fig. 1e) for Simulation in Fig. 2c . . . . . . . . . . . . . . . . 38

F.4 ETT in Sec. B for Simulation in Fig. E.4a . . . . . . . . . . . . . . . . . . . . . . 40

F.5 Transportability in Sec. B for Simulation in Fig. E.4b . . . . . . . . . . . . . . . . 41

F.6 FD with continuous mediators for Simulation in Fig. E.4c . . . . . . . . . . . . . . 42

F.7 Verma s equation with continuous mediators for Simulation in Fig. E.4d . . . . . . 43

F.8 Ctf-DE in Example 3 for Simulation in Fig. E.4e . . . . . . . . . . . . . . . . . . 44

A Nuisance Specification

A.1 Front-door adjustment in Example 1

We first note that the front-door adjustment is an expectation of the following product:

P(C, X)P(Z | x, C)P(Y | C, X, Z),

which implies that

C1 = {C, X}; C2 = {Z}; C3 = {Y } (i.e., m = 2).

S1 = {C} and S2 = {C, X, Z}

B1 = S2 C1 Sb 1 = {X}.

ˇS2 = {C, X , Z} and ˇS1 = {C, X}.

P 2 = P( | x).

The regression nuisances are the followings:

µ2 0(S2) := µ2 0(Z, X, C) := EP [Y | Z, X, C]

ˇµ2 0(ˇS2) := µ2 0(Z, X , C)

µ1 0(S1, B 1) := µ1 0(X , C) := EP [µ2 0(Z, X , C) | x, C, X ]

ˇµ1 0(ˇS1) = µ1 0(X, C).

The ratio nuisances are the followings:

π2 0(Z, X, C) = P(Z | x, C)

P(Z | X, C), (A.1)

π1 0(C) = P(x) P(x | C). (A.2)

The representation for DML-UCA is

EP [π2 0(Z, X, C){Y µ2 0(Z, X, C)}]

+ EP [π1 0(C){µ2 0(Z, X , C) µ1 0(X , C)} | x]

+ EP [µ1 0(X, C)].

A.2 Verma s equation in Example 2

We first note that the Verma s equation in Eq. (3) is an expectation of the following product:

P(X)P(A | x)P(B | A, X)P(Y | B, A, x),

which implies that

C1 = {X}; C2 = {A}; C3 = {B}; and C4 = {Y } (i.e., m = 3).

S1 = ; S2 = {A, X} and S3 = {B, A}.

Sb 1 = {X} and Sb 3 = {X}.

B1 = S2 C1 Sb 1 = {X}.

ˇS3 = {B, A}; ˇS2 = {A, X } and ˇS1 = {X}.

P 2 = P 4 = P( | x).

The regression nuisances are the followings:

µ3 0(S3) := µ3 0(B, A) := EP [Y | B, A, x]

ˇµ3 0(ˇS3) := µ3 0(B, A) = EP [Y | B, A, x]

µ2 0(S2) := µ2 0(A, X) := EP [µ3 0(B, A) | A, X]

ˇµ2 0(ˇS2) = µ2 0(A, X )

µ1 0(S1, B 1) := EP [µ2 0(A, X ) | x, X ]

ˇµ1 0(ˇS1) := µ1 0(X).

The ratio nuisances are the followings:

π3 0(B, A) = P

x P(B | A, x )P(x )

P(B | A, x) , (A.3)

π2 0(A, X) = P(A | x)

P(A | X) (A.4)

The representation for DML-UCA is

EP [π3 0(B, A, X ){Y µ3 0(B, A, x)} | x]

+ EP [π2 0(A, X){µ3 0(B, A, x) µ2 0(A, X)}]

+ EP [ˇµ2 0(A, X )].

A.3 Counterfactual directed effect in Example 3

From the fact that Ctf-DE in Eq. (5) is represented as the expectation of Y over P(Y | X = x0, W, Z)P(W | X, Z)P(Z | X = x2)1x1(X). Set

C1 = {Z}; C2 = {W}; and C3 = {Y }.

S1 = {X, Z}, S2 = {W, Z}

Sb 0 = {X2}; Sb 1 = {X}; Sb 2 = {X}.

Bi = for all i.

ˇSi = Si \ Ri for all i

The regression nuisances are the followings:

µ2 0(S2) := µ2 0(W, Z) := EP [Y | W, x0, Z]

ˇµ2 0(ˇS2) := µ2 0(W, Z) = EP [Y | W, x0, Z]

µ1 0(S1) := µ1 0(X, Z) := EP [µ2 0(W, Z) | X, Z]

ˇµ1 0(ˇS1) = µ1 0(x1, Z).

The ratio nuisances are the followings:

π2 0(W, Z) = P(W | x1, Z)P(Z | x2)

P(W, Z | x0) , (A.6)

π1 0(X, Z) = 1x1(X)P(Z | x2)

P(X | Z)P(Z) . (A.7)

The representation for DML-UCA is

EP [π2 0(W, Z){Y µ2 0(W, X, Z)} | X = x0]

+ EP [π1 0(X, Z){µ2 0(W, x0, Z) µ1 0(X, Z)}]

+ EP [µ1 0(x1, Z) | x2].

A.4 Example Estimand for Fig. 1e

Given Fig. 1e, the causal effect is given as

E[Y | do(x1, x2)] = X

r,z,x 1 EP [Y | r, x2, z, x 1]P(r | x1, z)P(x 1, z),

which is the expectation of Y over the probability measure

P(Y | R, X2, Z, X1)P(R | x1, Z)P(X1, Z)1x2(X2).

C1 = {X1, Z}; C2 = {R}; C3 = {Y }

R1 = and R2 = {X2} with σ2 R2 = 1x2(X2).

S1 = {Z}, S2 = {R, X2, Z, X1}.

Sb 1 = {X1}.

B1 = S2 C1 Sb 1 = {X1}.

ˇS2 = {R, X 1, Z}. ˇS1 = S1.

P 2 = P( | x1).

The regression nuisances are the followings:

µ2 0(S2) := µ2 0(R, X2, Z, X1) := EP [Y | R, X2, Z, X1]

ˇµ2 0(ˇS2) := ˇµ2 0(R, Z, X 1) = EP [Y | R, x2, Z, X 1]

µ1 0(S1, B 1) := µ1 0(Z, X 1) := EP [ˇµ2 0(R, Z, X 1) | x1, Z, X 1]

ˇµ1 0(ˇS1) = µ1 0(Z, X1).

The ratio nuisances are the followings:

π2 0(X2, R, X1, Z) = 1x2(X2) P(X2 | R, X1, Z) P(R | x1, Z) P(R | X1, Z), (A.8)

π1 0(Z) = P(Z) P(Z | x) = P(x)P(Z) P(x | Z)P(Z) = P(x) P(x | Z). (A.9)

The representation for DML-UCA is

EP [π2 0(X2, R, X1, Z){Y µ2 0(R, X2, Z, X1)}]

+ EP [π1 0(Z, X ){ˇµ2 0(R, Z, X 1) µ1 0(Z, X )} | X1 = x1]

+ EP [µ1 0(Z, X)].

B More UCA Examples

B.1 Effect of the treatment on the treated (ETT)

Let V = {Z, X, Y } be a set of variables where Z is a covariate, X is a treatment and Y is an outcome. The target estimand is

E[Y (x) | x ] = X

z EP [Y | x, z]P(z | x ). (B.1)

The ETT estimand can be written as an expectation of Y over the probability measure

Ψ = P(Y | X, Z)P(Z | x )1x(X).

This factorization implies that C1 := {Z}, R := V \ C1 {Y } = {X}, where R1 = {X}, and σ1 R1 := 1x(X). Also, S1 = {X} Z. Finally,

P 1(C1) = P(Z | x )

P 2(Y | S1) = P(Y | X, Z).

The regression nuisances are the followings:

µ1 0(S1) := µ1 0(X, Z) := EP [Y | X, Z]

ˇµ1 0(S1 \ R1) := ˇµ1 0(Z) := µ1 0(x, Z).

The ratio nuisances are the followings:

π1 0(X, Z) = P(Z | x )1x(X)

P(X, Z) = P(x | Z)P(Z)

P(x ) 1x(X) P(X | Z)P(Z) = P(x | Z)

P(X | Z) 1x(X)

The representation for DML-UCA is

EP [π1 0(X, Z){Y µ1 0(X, Z)}] + EP [ˇµ1 0(Z) | x ].

B.2 Transportability (S-admissibility)

Let V = {Z, X, Y } be a set of variables where Z is a covariate, X is a treatment and Y is an outcome. Let S denote the domain indicator such that S = 0 means the target domain, and S = 1 denotes the source. The S-admissibility estimand appeared in transportability scenario is

E[Y | do(x)] = X

z EP [Y | x, z, S = 1]P(z | S = 0). (B.2)

The estimand can be written as an expectation of Y over the probability measure

Ψ = P(Y | X, Z, S = 1)P(Z | S = 0)1x(X).

From this factorization, we have C1 := Z and R1 := X. Also, set P 1(C1) := P(Z | S = 0) with Sb 0 = S. Set P 2(Y | S1) := P(Y | X, Z | S = 1) with Sb 1 = S and S1 := {X} Z.

The regression nuisances are the followings:

µ1 0(S1) := µ1 0(X, Z) := EP [Y | X, Z, S = 1]

ˇµ1 0(ˇS1) := ˇµ1 0(Z) = µ1 0(x, Z).

The ratio nuisances are the followings:

π1 0(X, Z) = 1x(X) P(X | Z, S = 1) P(Z | S = 0) P(Z | S = 1).

The representation for DML-UCA is

EP [π1 0(X, Z){Y µ1 0(X, Z)} | S = 1] + EP [ˇµ1 0(Z) | S = 0].

B.3 Off-policy evaluation

Let V = {Z, X, Y } be a set of variables where Z is a covariate, X is a treatment and Y is an outcome. Let σ (X | Z) denote the behavioral policy that an agent observed; i.e.,

(Z, X, Y ) P(Y | X, Z)σ (X | Z)P(Z). (B.3)

Let σ(X | Z) denote a policy to be evaluated. Then, the effect of the policy σ is given as

E[Y | σ] := X

x,z EP [Y | x, z]σ (x | z)P(z). (B.4)

The policy treatment effect in Eq. (B.4) can be represented as UCA as follow.

C1 := Z R1 := {X}

σ1 R1 := σ (X | Z)

S1 := {X} Z.

Set P 1(C1) P(Z), σ1 R1(R1 | Z1) σ(X | Z), and P 2(Y | C1, R1) P(Y | X, Z). Then,

Ψ(P; σ) := X

c,R EP 2[Y | c1, R1]σ1 R1P 1(c1)

x,z EP [Y | x, z]σ (x | z)P(z)

= E[Y | σ] (Eq. (B.4)).

The regression nuisances are the followings:

µ1 0(C(1) R(1)) := µ1 0(X, Z) := EP [Y | X, Z]

ˇµ1 0(C(1)) := ˇµ1 0(Z) := X

x µ1 0(x, Z)σ (x | Z).

The ratio nuisances are the followings:

π1 0(X, Z) = σ (X | Z)

The representation for DML-UCA is

EP [π1 0(X, Z){Y µ1 0(X, Z)}] + EP [ˇµ1 0(Z)].

B.4 Treatment-treatment interactions

Let V = {Z, X, Y } be a set of variables where Z is a covariate, X is a treatment and Y is an outcome. The estimand for treatment-treatment interaction discussed in Jung et al. (2023b) is

E[Y | do(x1, x2)] = X

z E[Y | do(x2), z, x1]P(z | do(x1)), (B.5)

which is an expectation of Y over a product of probability measure

P(Y | Z, do(x2), X1)P(Z | do(x1))1x1(X1),

which satisfies an additivity. Therefore, E[Y | do(x1, x2)] is UCA-expressible. Such reduction can be done since the probability measure satisfies additivity w.r.t. all conditional distributions and the policy 1x1(X1). Specifically, set

C1 := Z R1 := {X1} S1 := {X1} Z.

P 1(C1) := P(Z | do(x1))

P 2(Y | C1 R1) := P(Y | X1, Z, do(x2))

σ1 R1 := 1x1(X1).

The regression nuisances are the followings:

µ1 0(C(1) R(1)) := µ1 0(X1, Z) := EP [Y | X1, Z, do(x2)]

ˇµ1 0(C(1)) := EP [Y | x1, Z, do(x2)].

The ratio nuisances are the followings:

π1 0(X, Z) = 1x1(X1)P(Z | do(x1)) P(X1 | Z, do(x2))P(Z | do(x2)),

which can be estimated through the density estimation approach using the probabilistic classification method described in (Díaz et al., 2023, Sec. 5.4).

The representation for DML-UCA is

EP [π1 0(X, Z){Y µ1 0(X, Z)} | do(x2)] + EP [ˇµ1 0(Z) | do(x1)].

C More Results

C.1 Formal definition of Partial influence function (PIF)

Definition C.1 (Partial influence function (PIF) (Pires and Branco, 2002)). Let g(P1, , PK) denote a K-multi-distribution functional. For the k-th component, let Pk t := Pk + t(Qk Pk) for t [0, 1], where Qk is an arbitrary distribution absolutely continuous w.r.t. Pk. The k-th partial influence function is a function ϕk(V; ηi(Pk), g0) such that EPk[ϕk(V; ηk(Pk), g0)] = 0, VPk[ϕk(V; ηk(Pk), g0)] < , and

tg(P1, , Pk t , , PK) t=0 = EQk[ϕk(V; ηk(Pk), g0)].

C.2 Density Ratio Estimation

Two available approaches for estimating the density ratio are the followings. The first approach is to apply the Bayes rule for rewriting the density ratio into more tractable form. For example, consider the problem of estimating π2 0 for FD, which is given as

π2 0 := P(Z | x, C)

P(Z | X, C).

Suppose Z, C are high-dimensional random vectors, and X is a binary singleton variable. Then, P(X | C) or P(X | Z, C) are tractable to estimate compared to P(Z | X, C), since estimating P(X | ) can be done using off-the-shelf probabilistic classification method. Here, π2 0 can be written

as a tractable form as follows:

π2 0 := P(Z | x, C)

P(Z | X, C)

= P(Z, X, C) P(X | C)P(C) P(x | C)P(C)

P(C) P(Z, C) P(Z, C) P(x | C) P(X | C) P(X | Z, C)

P(x | Z, C)

P(X | C) P(X | Z, C)

P(x | Z, C) .

The second approach is to recast the density ratio into the classification problem (Díaz et al., 2023, Sec. 5.4). For example, consider the ratio nuisance appeared in Treatment-treatment interactions:

π1 0(X, Z) = 1x1(X1)P(Z | do(x1)) P(X1 | Z, do(x2))P(Z | do(x2)).

Here, P (Z|do(x1))

P (Z|do(x2)) can be estimated as a following procedure. Let D1 P(Z | do(x1) and D2 P(Z | do(x2) denote samples. Let D0 := D1 D2. Let λ denote an indicator such that λ = 0 means samples are from D1 and λ = 1 means they are from D2. Without loss of generality, |D1| = |D2|. Then,

P(Z | do(x1)) P(Z | do(x2)) = P(Z | λ = 0)

P(Z | λ = 1) = P(λ = 1)

P(λ = 0) P(λ = 0 | Z)P(Z) P(λ = 1 | Z)P(Z) = P(λ = 0 | Z)

P(λ = 1 | Z).

Then, instead of estimating the density ratio explicitly as P (Z|do(x1))

P (Z|do(x2)), we can estimate the equivalent

estimand P (λ=0|Z)

P (λ=1|Z) using any off-the-shelf probabilistic classification method.

C.3 Analysis of non-UCA functionals

We consider two cases where a target estimand cannot be expressed through UCA:

1. Case 1. The target estimand is not in a form of the product (e.g., the target estimand is the quotient of sum-products of two conditional distributions ).

2. Case 2. For a target estimand that is represented as the expectation of Y over the measure

Ψ [P; σ] := P m+1(Y | S m)

i=1 P i(Ci | S i 1)σi Ri(Ri | S i \ Ri),

where P i(V) = Qi(V | Sb i 1 = s) for some distribution Qi, i {2, , m + 1} such that S i 1 = (C(i 1) R(i 1)) \ Sb i 1.

In this section, we will provide example functionals that cannot be expressed through UCA.

C.3.1 On Case 1

Consider Fig. 1c where the causal effect P(y | do(x)) is identifiable and given as

P(y | do(x)) = P

w P(y, x | r, w)P(w) P

w P(x | r, w)P(w) . (C.1)

Here, the functional for E[Y | do(x)] is represented not as the expectation of a product of conditional distributions, but rather as a quotient of sums of conditional distributions. The napkin estimand is not UCA-expressible.

Figure C.3: (a-c) Example for Case 2 (Generalized identification under partial observability (Lee and Bareinboim, 2020, Fig. 1))

C.3.2 On Case 2

Consider Figs. (C.3a-C.3c). A goal is to identify P(y | do(x)) from Fig. C.3c from two input distributions: (1) an interventional distribution P(c, z | do(x)) with Fig. C.3a, and (2) an observational distribution P(r, z, y) with Fig. C.3b. This problem is entitled as the generalized identification under partial observability (Lee and Bareinboim, 2020).

Here, the causal effect is identifiable and given as (Lee and Bareinboim, 2020)

E[Y | do(x)] = X

c,r,z EP [Y | r, z]P(z | do(x), c)P(r)P(c | do(x)). (C.2)

This functional is an expectation of the probability measure Y over P(Y | R, Z)P(Z | do(x), C)P(R)P(C | do(x)). Based on this probability measure, apply the following setting:

C1 = {C}, C2 := {R} and C3 := {Z}.

P 1(C1) := P(C | do(x)) with Sb 0 = .

P 2(C2 | S 1) = P(R) with Sb 1 = and S 1 := .

P 3(C3 | S 2) = P(Z | do(x), C) with Sb 2 = and S 2 := {C}.

P 4(Y | S 3) = P(Y | R, Z) with Sb 2 = and S 3 := {R, Z}.

Here, S 1 = = (C(1) R(1)) \ Sb 1 = {C}. Also, S 2 = {C} = (C(2) R(2)) \ Sb 2 = {C, R}. Finally, S 3 = {R, Z} = (C(3) R(3)) \ Sb 3 = {C, R, Z}. Therefore, Eq. (C.2) is not within UCA-class.

Now, we will witness that the target estimand cannot be correctly represented through the nested regression and empirical bifurcation. Applying the nested regression, we have

µ3 0(S 3) = µ3 0(R, Z) := EP [Y | R, Z]

ˇµ3 0(ˇS 3) = µ3 0(R, Z) = EP [Y | R, Z].

µ2 0(S 2) = E[µ3 0(R, Z) | C, do(x)] = X

r,z EP [Y | r, z]P(r, z | c, do(x)).

This doesn t correctly represent the target estimand in Eq. (C.2) because P(r, z | c, do(x)) is not decomposed into P(r) and P(z | r, c, do(x)).

C.4 Time Complexity

In this section, we provide a detailed analysis on the time complexity in Table 2. Here, n is the sample size. When there are multiple sample sets (i.e., K > 1), we will use nmax to denote the size of the largest sample set. K is the number of sample sets in Eq. (11). m is the number of variables in a causal graph.

C.4.1 BD/SBD

Here, we focus on the back-door adjustment, since the SBD can be analyized similarly. The back-door (BD) adjustment estimand is given as X

z EP [Y | x, z]P(z).

Plug-in. The plug-in estimator composed of two stage learn the conditional probability table and and evaluate it for each samples. For a detailed description, we define some notations. Let D := {V(j) : j = 1, , n}, where V(j) denote the j th sample. For any W V, we will use Dw to denote the sub-sample of D that W is fixed to w; i.e., Dw := {V(j) D such that W(j) = w}. We will use I(Dw) to denote the index set for Dw. Finally, we will use nw := |Dw|.

For the BD adjustment, the plug-in estimator is X

z ˆE[Y | x, z] ˆP(z), (C.3)

ˆE[Y | x, z] := 1 nx,z

j I(Dx,z) Y(j), (C.4)

For the fixed (x, z), learning ˆE[Y | x, z] and ˆP(z) take O(n). Such learning needs to be done for all possible realizations (x, z), where the cardinality of the realization is O(2m). As a result, the computational complexity is O(n2m).

IPW, OM, AIPW We first consider the inverse probability-weighting (IPW) estimator (Rosenbaum and Rubin, 1983). The IPW estimator is

1x(X(i)) ˆπ(X(i) | Z(i))Y(i), (C.6)

where ˆπ(X | Z) is the evaluated function for P(X | Z). Learning the nuisance parameter ˆπ takes T(n, m) and evaluating the IPW estimator takes O(n). As a result, the time complexity for the IPW estimator is O(n + T(n, m)).

Next, we consider the outcome-model (OM) estimator (Robins, 1986). The OM estimator is

i=1 ˆµ(X(i), Z(i)), (C.7)

where ˆµ(X, Z) is the evaluated function for EP [Y | X, Z]. Learning the nuisance parameter ˆπ takes T(n, m) and evaluating the OM estimator takes O(n). As a result, the time complexity for the OM estimator is O(n + T(n, m)).

Finally, the AIPW estimator (Robins and Rotnitzky, 1995) is

1x(X(i)) ˆπ(X(i) | Z(i))Y(i) + 1

i=1 ˆµ(X(i), Z(i)) 1

1x(X(i)) ˆπ(X(i) | Z(i)) ˆµ(X(i), Z(i)). (C.8)

Learning the nuisance parameters ˆπ and ˆµ takes T(n, m) and evaluating the OM/IPW estimator takes O(n). As a result, the time complexity for the OM estimator is O(n + T(n, m)).

C.4.2 Front-door adjustment (FD)

The front-door adjustment (Pearl, 2000) is X

z,c P(z | x, z) X

z EP [Y | z, x , c]P(x | c)P(c).

The FD estimators of (Fulcher et al., 2019; Guo et al., 2023) is

ˆξ(Z(i), x, C(i)) ˆξ(Z(i), X(i), C(i)) {Y(i) ˆµ(C(i), X(i), Z(i))}

1x(X(i)) ˆπ(X(i), C(i))

x ˆµ(C(i), x, Z(i))ˆπ(x, C(i)) X

x,z ˆµ(C(i), x, z)ˆξ(z, X(i), C(i))ˆπ(x, C(i))

z ˆµ(C(i), X(i), z)ˆξ(z, x, C(i))}, (C.9)

where ˆµ(C, X, Z), ˆξ(Z, X, C) and ˆπ(X, C) are the evaluated functions for EP [Y | C, X, Z], P(Z | X, C) and P(X | C). Learning these nuisances takes T(n, m) time. Equipped with ˆµ, ˆξ, ˆπ, evaluating the FD estimator takes O(n2m), since the evaluation over n samples is repeated for O(2m) realization of every (x, z). Therefore, the overall time complexity is O(n2m + T(n, m)).

C.4.3 Tian s adjustment

The Tian s adjustment (Tian and Pearl, 2002a) is

x EP [Y | v(K)]

i=1 P (vi | v(i 1)).

The estimator for Tian s adjstment proposed in (Bhattacharya et al., 2022) is 1

n Pn i=1 φ(V(i); ˆη), where φ(V; η0) is given as

Vi V k+1\SX

1x(X) P(Vj | V(j 1))

x ,v i+1 y Y

Vj (V k SX) V i+1 P(vj | v(j 1))|x=x if Vj SX

Vi V k+1\SX

1x(X) P(Vj | V(j 1))

x ,v i+1 y Y

Vj (V k SX) V i P(vj | v(j 1))|x=x if Vj SX

Vi Vi V k+1 SX

Vj V(i 1) P(Vj | V(j 1))|X=x Q

Vj V(i 1) P(Vj | V(j 1))

Vj V i+1 P(vj | v(j 1))|x=x if Vj SX

Vi Vi V k+1 SX

Vj V(i 1) P(Vj | V(j 1))|X=x Q

Vj V(i 1) P(Vj | V(j 1))

Vj V i+1 P(vj | v(j 1))|x=x if Vj SX

Vj V k+1\SX P(vj | v(j 1))|x =x Y

Vr V k+1 SX P(vr | v(r 1)).

Learning these nuisances takes T(n, m) time. Equipped with nuisances corresponding to ˆP(vi | v(i 1)), evaluating the estimator takes O(n2m), since the evaluation over n samples is repeated for O(2m) realization of every v. Therefore, the overall time complexity is O(n2m + T(n, m)).

C.4.4 DML-UCA (BD, FD, and Tian s).

For BD, FD, and Tian s, the time complexity can be derived by specializing Thm. 2 with K = 1 and nmax = n, and T(m, n) := K L (Tµ + Tπ). Then, the complexity in Thm. 2 reduces to O(n + T(m, n)).

C.4.5 DML-UCA (general).

Set T(m, nmax, K) := L K (Tµ + Tπ). Then, the complexity in Thm. 2 reduces to O(Knmax + T(m, nmax, K)).

C.4.6 DML-ID (obs ID).

The DML-ID estimator of (Jung et al., 2021a) writes the identification functional of causal effect as am arithmetic function of multiple sequential back-door adjustments, where the arithmetic function is an arbitrary combination of marginalization, product, and division. Let f({Ak : k = 1, , K}) denote the DML-ID, where each Ak is the sequential back-door adjustment, and f denotes the arithmetic function.

To evaluate the DML-ID functional, the first step is to learn all nuisances composing each Ak. This takes O(T(m, n)) time. The second step is to evaluate f({Ak : k = 1, , K}). Whenever f contains a marginalization over some random vector, the time complexity for evaluating it is O(n2m). In DML-ID, such marginalization can happen O(2m) times in worst case. Therefore, evaluating f({Ak : k = 1, , K}) can take O(n2m 2m) = O(n22m). As a result, the total time complexity is O(n22m + T(n, m)).

C.4.7 DML-g ID (g ID).

The DML-g ID estimator of (Jung et al., 2023a) writes the identification functional of causal effect as am arithmetic function of multiple generalized sequential back-door adjustments called g-m SBD (Jung et al., 2023a), where the arithmetic function is an arbitrary combination of marginalization, product, and division. Let f({Aj : j = 1, , J}) denote the DML-ID, where each Aj is the g-m SBD, and f denotes the arithmetic function.

To evaluate the DML-g ID functional, the first step is to learn all nuisances composing each Aj. This takes O(T(m, nmax, K)) time. The second step is to evaluate f({Aj : j = 1, , J}). Whenever f contains a marginalization over some random vector, the time complexity for evaluating it is O(nmax2m). In DML-g ID, such marginalization can happen O(2m K) times in worst case. Therefore, evaluating f({Aj : j = 1, , J}) can take O(nmax2m 2m K) = O(Knmax22m). As a result, the total time complexity is O(Knmax22m + T(m, nmax, K)).

D.1 Proof of Proposition 1

The algorithm provides a product of probabilities in a form of

Vi SX P(Vi | V(i)) Y

Vj V\SX P(Vj | V(j) \ X, x).

Then, by taking an expectation over Y , it gives

x EP [Y | v(K)]

i=1 P (vi | v(i 1)).

D.2 Proof for Proposition 2

First, define

set1 := ((Si+1 Bi+1) \ Ri+1 \ Bi)

= (((C(i+1) R(i) Bi+1) \ Sb i+1) \ Bi) B i,

set2 := Si B i = (C(i) R(i) \ Sb i) B i.

Then, we claim that set1 \ set2 = Ci+1. This holds when Bi+1 \ Bi \ Si = . To prove this with contradiction, suppose Bi+1 \ Si = . This holds when Bi+1 Sb i. Recall that Bi+1 is a subset of fixed variables in Sb i+1 in P i+2. Then, Bi+1 Sb i means that this variable will be fixed in P i+1. However, for this variable to be bifurcated in some Cj, this variable should be within Bi. However, this is a contradiction of the definition of Bi+1 and Bi. Therefore, Bi+1 \ Bi \ Si = and set1 \ set2 = Ci+1.

µi 0(Si, B i) = EP i+1[ˇµi+1 0 (ˇSi+1) | Si, B i]

= EP i+1[ˇµi+1 0 (((Si+1 Bi+1) \ Ri+1 \ Bi) B i) | Si, B i]

ci+1,ri+1 P i+1(ci+1 | Si, B i)σi+1 Ri+1(ri+1)µi+1 0 (ci+1, ri+1, Si, B i)

ci+1,ri+1 P i+1(ci+1 | Si)σi+1 Ri+1(ri+1)µi+1 0 (ci+1, ri+1, Si, B i).

By recursion,

EP 1[ˇµ1 0( ˇS1)] = X

j=1 P j(cj | sj 1)σj Rj(rj | sj \ rj)EP m+1[Y | sm].

D.3 Proof for Proposition 3

By definition of πm 0 .

D.4 Proof for Theorem 1

If conditions in Theorem 1 met, the estimand reduces to the UCA by definition.

D.5 Proof for Theorem 2

1. The sample-splitting takes O((m + 1)nmax). 2. For the fixed ℓ, learning ˆµi ℓfor i = m, , 1 takes O(Tµ m). Therefore, learning all regressionnuisances takes O(Tµ m L). 3. For the fixed ℓ, learning ˆπi ℓfor i = 1, , m takes O(Tπ m). Therefore, learning all rationuisances takes O(Tπ m L). 4. Evaluating the DML estimator in Eq. (8) takes O((m + 1)nmax).

In total, the time complexity is

O((m + 1)nmax) + O(Tµ m L) + O(Tπ m L) + O((m + 1)nmax) = O(m {nmax + L (Tµ + Tπ)})

D.6 Proof for Theorem 3

Define Ψi := Qi k=1 P k(Ck | Sk 1)σk(Rk | Sk \ Rk). Define

Ψi t := P i t (Ci | Si 1)σi(Ri | Si \ Ri)

k=1 P k(Ck | Sk 1)σk(Rk | Sk \ Rk).

µi t(S i) := EP i+1 t [ˇµi+1 | S i].

For any P i, we choose the following parametric submodel:

P i t := P i + t(Qi P i).

For any i = m, , 1,

ψ0 := EΨi[µi 0(S i)].

Fix i {1, , m}. Then, consider the differention with respect to P i+1 t .

t EΨi 0[µi t(S i)] =

t EP i+1[µi t(S i)πi 0]

t EP i+1 t [µi t(S i)πi 0]

t EP i+1 t [µi 0(S i)πi 0]

t EP i+1 t [πi 0ˇµi+1(ˇSi+1)]

t EP i+1 t [πi 0µi 0(S i)]

t EP i+1 t [πi 0 ˇµi+1(ˇSi+1) µi 0(S i) ]

= EQi+1[πi 0 ˇµi+1(ˇSi+1) µi 0(S i) ].

Also, consider the differentiation with respect to P 1:

t EP 1 t [ˇµ1 0(ˇS1)] = EQ1[ˇµ1 0(ˇS1) ψ0].

Then, define φi as the differentiation with respect to P i as

φi(ˇSi; ηi 0, ψ0) :=

( πi 1 0 {ˇµi 0 µi 1 0 } if i > 1 ˇµ1 0 ψ0 if i = 1.

tΨ(P1, , Pk t , , PK) t=0 = X

P i t t P i t Ψ(P 1, , P i t , , P m+1; σ) t=0

i Ik EQi[φi(ˇSi; ηi 0, ψ0)],

which completes the proof.

D.7 Proof for Theorem 4

Structure of the proof. Theorem 4 will be proven based on Lemma D.2, Lemma D.3, and Lemma D.4. Specifically, we proceed the proof as follows:

1. We will prove Lemma D.2, Lemma D.3, and Lemma D.4.

2. Berry-Essen s inequality (Berry, 1941) will be stated as a preliminary in Prop. D.1.

3. Theorem 4 will be proven based on the main lemmas and Berry-Essen s inequality.

D.7.1 Helper lemmas

We first state and prove helper lemmas.

i=1 EP i+1[πi 0{ˇµi+1 0 µi 0}] + EP 1[ˇµ1 0]. (D.1)

Proof of Lemma D.1. By the total expectation law, it suffices to show that

Ψ(P; σ) = EP 1[ˇµ1 0(ˇS1)].

This holds from Prop. 2.

Lemma D.2 (Decomposition). Define the following

Φ(ˆµ, ˆπ) :=

i=1 EP i+1[ˆπi{ˇµi+1 ˆµi}] + EP 1[ˇµ1] (D.2)

Φ(µ0, π0) :=

i=1 EP i+1[πi 0{ˇµi+1 0 µi 0}] + EP 1[ˇµ1 0]. (D.3)

The following decomposition holds:

Φ(ˆµ, ˆπ) Φ(µ0, π0) =

r=1 EP r+1[ˆω(r 1){µr 0 ˆµr}{ˆπr πr 0}]. (D.4)

Proof of Lemma D.2. First,

Φ(ˆµ, ˆπ) ψ0 = Φ(ˆµ, ˆπ) Φ(µ0, π0).

EP m+1[ˆπm{µm 0 ˆµm}] + EP m+1[πm 0 ˆµm] EP m+1[πm 0 µm 0 ] | {z } :=ψ0 = EP m+1[{πm 0 ˆπm}{ˆµm µm 0 }].

For i = m 1, , 1, define

µi 0[ˇµi+1] := EP i+1[ˇµi+1(ˇSi+1) | Si, B i].

EP i+1[ˆπi{µi 0[ˇµi+1] ˆµi}] + EP i+1[πi 0ˆµi] EP i+1[πi 0µi 0[ˇµi+1]]

= EP i+1[{πi 0 ˆπi}{ˆµi µi 0[ˇµi+1]}].

Also, for any µi+1 and corresponding ˇµi+1, and for all i = m 1, , 1, by the definition of the πi 0 nuisance,

EP i+2[πi+1 0 µi+1] = EP i+1[πi 0µi 0[ˇµi+1]].

EP m+1[ˆπm{µm 0 ˆµm}] + EP m+1[πm 0 ˆµm] EP m+1[πm 0 µm 0 ]

i=1 EP i+1[ˆπi{µi 0[ˇµi+1] ˆµi}] + EP i+1[πi 0ˆµi] EP i+1[πi 0µi 0[ˇµi+1]]

i=1 EP i+1[ˆπi{µi 0[ˇµi+1] ˆµi}] + EP 2[π1 0 ˆµ1] ψ0

i=1 EP i+1[{πi 0 ˆπi}{ˆµi µi 0[ˇµi+1]}].

Note that EP 2[π1 0 ˆµ1] = EP 1[ˇµ1], since

EP 1[ˇµ1(S1 B1)]

= EP 1[σ1 R1(R1 | S1 \ R1)µ1((S1 B1) \ R1)]

= EP 2 P 1((S1 B1) \ R1)

P 2(S1 B1) σ1 R1(R1 | S1 \ R1)µ1((S1 B1) \ R1)

= EP 2 P 1(C1) P 2(S1 B1)σ1 R1(R1 | S1 \ R1)µ1((S1 B1) \ R1)

= EP 2[π1 0µ1 0].

Φ(ˆµ, ˆπ) Φ(µ0, π0)

i=1 EP i+1[{πi 0 ˆπi}{ˆµi µi 0[ˇµi+1]}].

Lemma D.3 (Stochastic Equicontinuity). Let D iid P. Let D = D0 D1, where n := |D0|. Let ˆf be a function estimated from D1. Then, in probability greater than 1 ϵ for any ϵ (0, 1),

ED0 P h ˆf f i w.p 1 ϵ < ˆf f P nϵ , (D.5)

which implies that

ED0 P [| ˆf f|] = OP

Proof of Lemma D.3. This proof is from (Kennedy et al., 2020, Lemma 2). Since ˆf is a function of D1, we will denote ˆf D1. Define a following random variable of interest:

X := ED0 P [ ˆf D1 f].

Then, the conditional expectation of X given D1 is zero, since

i=1 ˆf D1(Vi) D1

i=1 EP [ ˆf D1(Vi) | D1] = 1

i=1 EP [ ˆf D1(V) | D1] = EP [ ˆf D1(V) | D1],

where the third equality holds by the independence of D0 and D1. Therefore,

EP [X | D1] = EP [ED0 P [ ˆf D1 f] | D1]

= EP [ED0[ ˆf D1 f] | D1] EP [EP [ ˆf D1 f] | D1]

= EP [EP [ ˆf D1 f] | D1] EP [EP [ ˆf D1 f] | D1] = 0.

VP [X | D1] = VP [ED0 P [ ˆf D1 f] | D1]

= VP [ED0[ ˆf D1 f] | D1]

n VP [ ˆf D1 f | D1]

n ˆf D1 f 2 P .

By applying the (conditional-) Chevyshev s inequality,

P(|X EP [X | D1]| t | D1) 1

t2 VP [X | D1] 1 nt2 ˆf D1 f 2 P .

P(|X| t) = P(|X EP [X | D1]| t) = EP (D1)[P(|X EP [X | D1]| t | D1)]

1 nt2 ˆf D1 f 2 P .

In other words, X < t in probability greater than 1 1 nt2 ˆf D1 f 2 P . If t = ˆ f D1 f P nϵ , then

X < ˆ f D1 f P nϵ in the probability greater than 1 ϵ for any ϵ (0, 1).

Lemma D.4 (Combining concentration inequalities). Suppose P(Ak > t) bk/t2 for k = 1, , K. Then,

Proof. The event PK k=1 Ak t K includes the case where Ak < t for k = 1, , K. Therefore,

P (A1 t and and AK t)

= 1 P (A1 > t or or AK > t)

k=1 P (Ak > t)

D.7.2 Preliminary Results

Proposition D.1 (Berry Esseen s inequality (Berry, 1941; Esseen, 1942; Shevtsova, 2014)). Suppose D = {X1, , Xn} are independent and identically distributed random variables with EP [Xi] = 0, EP [X2 i ] = σ2 and EP [|Xi|3] = κ3. Then, for all x and n, P n

σ0 ED[X] < x Φ(x) 0.4748κ3

D.7.3 Proof of Theorem 4 - (1)

By Lemma D.2, we decompose the error as follow:

k=1 EDk Pk[ϕk 0] (D.6)

k=1 EDk ℓ Pk[ˆϕk ℓ ϕk 0] (D.7)

i=1 EP i+1[{µi 0 ˆµi ℓ}{ˆπi ℓ πi 0}]. (D.8)

Rk 1 := EDk Pk[ϕk 0] + 1

ℓ=1 EDk ℓ Pk[ˆϕk ℓ ϕk 0].

Then it completes the proof.

D.7.4 Proof of Theorem 4 - (2)

We first study the term EDk Pk[ϕk 0]. By Chebyshev s inequality,

EDk Pk[ϕk 0] > t ρk,0 p

Equivalently,

P EDk Pk[ϕk 0] > t < 1

t2 ρ2 k,0 |Dk|.

By Lemma D.4,

EDk Pk[ϕk 0] t1K

ρ2 k,0 |Dk|.

By Lemma D.3,

Pk EDk ℓ Pk[ˆϕk ℓ ϕk 0] > t2 1

ˆϕk ℓ ϕk 0 2 Pk |Dk ℓ| . (D.9)

By Lemma D.4,

EDk ℓ Pk[ˆϕk ℓ ϕk 0] Kt2

ˆϕk ℓ ϕk 0 2 Pk |Dk ℓ| . (D.10)

Choose t1 :=

2 ϵ PK k=1 ρ2 k,0 |Dk| and t2 :=

2 ϵ PL ℓ=1 PK k=1 ˆϕk ℓ ϕk 0 2 Pk |Dk ℓ| . Then, with a probability

greater than 1 ϵ,

ρ2 k,0 |Dk| +

ˆϕk ℓ ϕk 0 2 Pk |Dk ℓ|

ρ2 k,0 |Dk| +

ˆϕk ℓ ϕk 0 2 Pk |Dk ℓ|

D.7.5 Proof of Theorem 4 - (3)

By Lemma D.3,

Pk EDk ℓ Pk[ˆϕk ℓ ϕk 0] > t 1

t2 ˆϕk ℓ ϕk 0 2 Pk |Dk ℓ| . (D.11)

By Lemma D.4,

EDk ℓ Pk[ˆϕk ℓ ϕk 0] t

ˆϕk ℓ ϕk 0 2 Pk |Dk ℓ| . (D.12)

Equivalently, by choosing t =

1 ϵ PL ℓ=1 ˆϕk ℓ ϕk 0 2 Pk |Dk ℓ| ,

EDk ℓ Pk[ˆϕk ℓ ϕk 0] w.p 1 ϵ

ˆϕk ℓ ϕk 0 2 Pk |Dk ℓ| . (D.13)

Ak := EDk Pk[ϕk 0] (D.14)

ℓ=1 EDk ℓ Pk[ˆϕk ℓ ϕk 0] (D.15)

EDk ℓ Pk[ˆϕk ℓ ϕk 0] (D.16)

ˆϕk ℓ ϕk 0 2 Pk |Dk ℓ| . (D.17)

Rk := Ak + Bk. (D.18)

Pk Rk < x (D.19)

= Pk (Ak + Bk < x) (D.20)

= Pk (Ak < x Bk) (D.21)

Pk (Ak < x + Ck) (D.22)

w.p 1 ϵ Pk (Ak < x + k) . (D.23)

Then, Pk (Ak < x + k) Φ(x) (D.24)

= Pk (Ak < x + k) Φ(x + k) + Φ(x + k) Φ(x) (D.25)

Pk (Ak < x + k) Φ(x + k) + |Φ(x + k) Φ(x)| (D.26)

0.4748κ3 0 ρ3 k,0 p

|Dk| + |Φ(x + k) Φ(x)| (Prop. D.1) (D.27)

= 0.4748κ3 0 ρ3 k,0 p

|Dk| + |Φ (x ) k| (Mean-value theorem) (D.28)

0.4748κ3 0 ρ3 k,0 p

2π k. (D.29)

This completes the proof.

D.8 Proof for Corollary 4

By Cauchy-Schwartz inequality,

i=1 EP i+1[{µi 0 ˆµi ℓ}{ˆπi ℓ πi 0}] 1

i=1 OP i+1 µi 0 ˆµi ℓ πi 0 ˆπi ℓ . (D.30)

Given assumption, the upper bound in Eq. (15) converges at o Pk(1/ q

|Dk ℓ|). Therefore, we conclude

that Rk converges in distribution to normal(0, ρ2 k,0).

E More Experiments

In this section, we demonstrate the DML-UCA estimator through examples for the ETT, Sadmissibility, FD, Verma s equation, and Ctf-DE described in Sec. 2. For each example, the proposed estimator is constructed using a dataset Dk following a distribution Pk. Our goal is to provide empirical evidence of the fast convergence behavior of the proposed estimator compared to competing baseline estimators. We consider two standard baselines in the literature: the regression-based estimator (reg) only uses the regression nuisance parameters µ, and the ratio-based estimator (ratio) that only uses the ratio nuisance parameters π, while our DML-UCA estimator ( dml ) uses both. Details of the regression-based ( reg ) and the ratio-based ( ratio ) estimators are provided in Sec. A. Details of experimental setting is provided in Sec. F. In this experiments, we set all variables other than the treatment variable X as continuous.

We compare DML-UCA estimator to the regression-based estimator ( reg ) and the ratio-based estimator ( ratio ). In particular, we use ˆψest for est {reg, pw, dml} to denote the regressionbased, probability-weighting, and DML-UCA estimators. We assess the quality of the estimators by computing the average absolute error AAEest which is defined as follow. For the ETT and Ctf-DE, AAEest := | ˆψest ψ0|, where ψ0 := E[YX=0 | X = 1] for the ETT and ψ0 := E[YX=0,WX=1 | X = 2] for the Ctf-DE. For the other examples, AAEest := 1 domqin(X) P

x domain(X) | ˆψest(x) ψ0(x)|

where ψ0(x) := E[Y | do(x)], ˆψest(x) is an estimator for ψ0(x) and dom(X) is a cardinality of the domain of X. Nuisance functions are estimated using XGBoost (Chen and Guestrin, 2016). We ran 100 simulations for each number of samples n = {2500, 5000, 10000, 20000} and drew the AAE plot. We evaluate the AAEest in the presence of the converging noise ϵ as in Sec. 4.

Statistical Robustness. The AAE plots for all scenarios are presented in Fig. E.4. For all examples, all the estimators ( reg , pw , dml ) converge as the sample size grows. Furthermore, the proposed DML-UCA estimator outperforms the other two estimators by achieving fast convergence. This result

25005000 10000 20000 0.00

25005000 10000 20000 0.05

0.40 OM PW DML

25005000 10000 20000 0.00

25005000 10000 20000

0.5 OM PW DML

25005000 10000 20000

0.45 OM PW DML

Figure E.4: (a) ETT in Sec. B, (b) Transportability (S-admissibility) in Sec. B, (c) Front-door in Example 1, (d) Verma in Example 2, (e) Ctf-DE in Example 3.

corroborates the robustness property in Thm. 4, which implies that DML-UCA converges faster than the other counterparts.

F Details in Experiments

As described in Sec. 4, we used the XGBoost (Chen and Guestrin, 2016) as a model for estimating nuisances. We implemented the model using Python. In modeling nuisance using the XGBoost, we used the command xgboost.XGBClassifier(eval_metric= logloss )1 to use the XGBoost. We tuned the parameters for each examples to empirically guarantee the convergence of the regression and ratio nuisances. For each examples, the same parameters are used globally for implementing DML-UCA, regression-based estimator, ratio-based estimator, or other competing estimators (Fulcher et al., 2019; Jung et al., 2021a).

Now, we present the structural causal models (SCMs) utilized for generating the dataset. Furthermore, we include a segment of the code employed to generate the dataset.

F.1 FD (Fig. 1a) for Simulation in Fig. 2a

We define the following structural causal models:

U normal(0.5, 0, 5), UZi normal(0, 1), for i = 1, , d Z Ci := f Ci(U), where C := {Ci : i = 1, , d C} X := f X(C, U), Zi := f Zi(C, X), where Z := {Zi : i = 1, , d Z} Y := f Y (C, Z, U),

f Ci(U) := 1 1 + exp(0.25UZ + 2U 1)

f X(C, U) := Binary 1 1 + exp(2C 1 1 + U)

f Zi(C, X) := Binary 1 1 + exp(2X 1 + 0.5C 1 + UZi)

f Y (C, Z, U) := 1 1 + exp((1/d C)C 1 + (1/d Z)(2Z 1 1) + 2U).

The parameterization for XGBoost used in µ called (mu_params) and π called (pi_params) is the following:

1Detailed parametrization of parameters including learning rates, maximum depth of the trees, etc. are explained in https://xgboost.readthedocs.io/en/stable/python/python_api.html# module-xgboost.training.

mu_params = { 'booster': 'gbtree', 'eta': 0.3, 'gamma': 0, 'max_depth': 10, 'min_child_weight': 1, 'subsample': 1.0, 'colsample_bytree': 1, 'lambda': 0.0, 'alpha': 0.0, 'objective': 'reg:squarederror', 'eval_metric': 'rmse', 'n_jobs': 4 }

pi_params = { 'booster': 'gbtree', 'eta': 0.3, 'gamma': 0, 'max_depth': 10, 'min_child_weight': 1, 'subsample': 0.0, 'colsample_bytree': 1, 'objective': 'binary:logistic', 'eval_metric': 'logloss', 'reg_lambda': 0.0, 'reg_alpha': 0.0, 'nthread': 4 }

F.2 Verma (Fig. 1b) for Simulation in Fig. 2b

We define the following structural causal models:

UXB normal(1, 0, 5), UAY normal( 1, 0, 5), UA normal(0, 1) UB normal(0, 1) X := f X(UXB) Ai := f Ai(X, UAY ), for i = 1, , d A Bi := f Bi(X, UXB), for i = 1, , d B Y := f Y (B, UAY ),

f X(UXB) := Binary 1 1 + exp(2UXB 1)

f Ai(X, UAY ) := Binary 1 1 + exp(2X 1 + UA + UAY )

f Bi(X, UXB) := Binary 1 1 + exp(2A 1 1 + UB + 0.5UXB)

f Y (B, UAY ) := 1 1 + exp(2B 1 1 + 0.5UAY ).

The parameterization for XGBoost used in µ called (mu_params) and π called (pi_params) is the following:

mu_params = { 'booster': 'gbtree', 'eta': 0.35, 'gamma': 0, 'max_depth': 6, 'min_child_weight': 1, 'subsample': 1.0, 'colsample_bytree': 1, 'lambda': 0.0, 'alpha': 0.0, 'objective': 'reg:squarederror', 'eval_metric': 'rmse', 'n_jobs': 4 # Assuming you have 4 cores }

pi_params = { 'booster': 'gbtree', 'eta': 0.1, 'gamma': 0, 'max_depth': 10, 'min_child_weight': 1, 'subsample': 0.0, 'colsample_bytree': 1, 'objective': 'binary:logistic', # Change as per your objective 'eval_metric': 'logloss', # Change as per your needs 'reg_lambda': 0.0, 'reg_alpha': 0.0, 'nthread': 4 }

F.3 Example estimand (Fig. 1e) for Simulation in Fig. 2c

We define the following structural causal models:

UX1,Z normal(1, 0, 5), UX1,Y normal( 1, 0, 5), UZ,Y normal(0.5, 0.5) UR normal(0, 0.5) UZ normal(0, 0.5) UX2 normal(0, 0.5) X1 := f X1(UX1,Z, UX1,Y ) Zi := f Zi(X1, UX1,Z, UZ,Y ), for i = 1, , d Z Ri := f Ri(X1), , for i = 1, , d R Y := f Y (B, UAY ),

f X1(UX1,Z, UX1,Y ) := Binary 1 1 + exp(2UX1,Z UX1,Y 1)

f Ri(X1) := Binary 1 1 + exp(2X1 1 + UR)

f Zi(X1, UX1,Z, UZ,Y ) := Binary 1 1 + exp(4X1 1 + UZ + UX1,Z + UZ,Y )

f X2(Z, X1) := Binary 1 1 + exp((2X1 1)Z 1 UX2)

f Y (R, X2, UX1,Y , UZ,Y ) := 1 1 + exp((1/d R)R 1 + 2X2 1 + 2(UX1,Y + UZ,Y )).

The parameterization for XGBoost used in µ called (mu_params) and π called (pi_params) is the following:

mu_params = { 'booster': 'gbtree', 'eta': 0.3, 'gamma': 0, 'max_depth': 8, 'min_child_weight': 1, 'subsample': 0.8, 'colsample_bytree': 0.8, 'lambda': 0.0, 'alpha': 0.0, 'objective': 'reg:squarederror', 'eval_metric': 'rmse', 'n_jobs': 4 # Assuming you have 4 cores }

pi_params = { 'booster': 'gbtree', 'eta': 0.1, 'gamma': 0, 'max_depth': 10, 'min_child_weight': 1, 'subsample': 0.75, 'colsample_bytree': 0.75, 'objective': 'binary:logistic', # Change as per your objective 'eval_metric': 'logloss', # Change as per your needs 'reg_lambda': 0.0, 'reg_alpha': 0.0, 'nthread': 4 }

F.4 ETT in Sec. B for Simulation in Fig. E.4a

We define the following structural causal models:

UX normal(0, 1) UY 0.5 textttnormal(0, 1) Z 0.25normal(0, 1, d Z), X := f X(Z) Y := f Y (X, Z)

f X(Z) := Binary 1 1 + exp(2Z 1 1 + UX)

f Y (Z, X) := 1 1 + exp(Z 1(2X 1) + UY ).

The parameterization for XGBoost used in µ called (mu_params) and π called (pi_params) is the following:

mu_params = { 'booster': 'gbtree', 'eta': 0.5, 'gamma': 0, 'max_depth': 15, 'min_child_weight': 1, 'subsample': 0.8, 'colsample_bytree': 1, 'lambda': 0, 'alpha': 0, 'objective': 'reg:squarederror', 'eval_metric': 'rmse', 'n_jobs': 4 # Assuming you have 4 cores }

pi_params = { 'booster': 'gbtree', 'eta': 0.3, 'gamma': 0, 'max_depth': 10, 'min_child_weight': 1, 'subsample': 1, 'colsample_bytree': 1, 'objective': 'binary:logistic', # Change as per your objective 'eval_metric': 'logloss', # Change as per your needs 'reg_lambda': 1, 'reg_alpha': 0, 'nthread': 4 }

F.5 Transportability in Sec. B for Simulation in Fig. E.4b

We define the following structural causal models:

UX normal(0, 1) UY 0.5 textttnormal(0, 1) Z 0.25normal(0, 0.5, d Z) + Snormal(0.1, 0.5, d Z) X := f X(Z) Y := f Y (X, Z)

f X(Z) := Binary 1 1 + exp((1/d Z)(2Z 1 1) + UX)

f Y (Z, X) := 1 1 + exp(Z 1(2X 1) + UY ).

The parameterization for XGBoost used in µ called (mu_params) and π called (pi_params) is the following:

mu_params = { 'booster': 'gbtree', 'eta': 0.3, 'gamma': 0, 'max_depth': 15, 'min_child_weight': 1, 'subsample': 0.8, 'colsample_bytree': 1, 'lambda': 0, 'alpha': 0, 'objective': 'reg:squarederror', 'eval_metric': 'rmse', 'n_jobs': 4 # Assuming you have 4 cores }

pi_params = { 'booster': 'gbtree', 'eta': 0.1, 'gamma': 0, 'max_depth': 10, 'min_child_weight': 1, 'subsample': 1, 'colsample_bytree': 1, 'objective': 'binary:logistic', # Change as per your objective 'eval_metric': 'logloss', # Change as per your needs 'reg_lambda': 1, 'reg_alpha': 0, 'nthread': 4 }

F.6 FD with continuous mediators for Simulation in Fig. E.4c

We define the following structural causal models:

UC normal(0, 1, d C) U normal(0, 1) C := f C(U) X := f X(U, C) Z := f Z(X, C) Y := f Y (U, Z, C)

f C(U) := 0.25UC + 2U 1

f X(U, C) := Binary 1 1 + exp((2C 1 1) + U)

f Z(X, C) := 1 1 + exp(0.1C 1(2X 1) + X)

f Y (Z, X) := 1 1 + exp(C 1 + (2Z 1) + U).

The parameterization for XGBoost used in µ called (mu_params) and π called (pi_params) is the following:

mu_params = { 'booster': 'gbtree', 'eta': 0.01, 'gamma': 0, 'max_depth': 10, 'min_child_weight': 1, 'subsample': 1.0, 'colsample_bytree': 1, 'lambda': 0.0, 'alpha': 0.0, 'objective': 'reg:squarederror', 'eval_metric': 'rmse', 'n_jobs': 4 }

pi_params = { 'booster': 'gbtree', 'eta': 0.3, 'gamma': 0, 'max_depth': 20, 'min_child_weight': 1, 'subsample': 0.0, 'colsample_bytree': 1, 'objective': 'binary:logistic', 'eval_metric': 'logloss', 'reg_lambda': 0.0, 'reg_alpha': 0.0, 'nthread': 4 }

F.7 Verma s equation with continuous mediators for Simulation in Fig. E.4d

We define the following structural causal models:

UXB normal(1, 0, 5), UAY normal( 1, 0, 5), X := f X(UXB) A := f A(X, UAY ) B := f B(X, UXB) Y := f Y (B, UAY ),

f X(UXB) := Binary 1 1 + exp(2UXB 1)

f A(X, UAY ) := Binary 1 1 + exp(2X 1 + 0.5UAY )

f B(X, UXB) := Binary 1 1 + exp(2A 1 + 0.5UXB)

f Y (B, UAY ) := 1 1 + exp(2B 1 + 0.5UAY ).

The parameterization for XGBoost used in µ called (mu_params) and π called (pi_params) is the following:

mu_params = { 'booster': 'gbtree', 'eta': 0.35, 'gamma': 0, 'max_depth': 6, 'min_child_weight': 1, 'subsample': 1.0, 'colsample_bytree': 1, 'lambda': 0.0, 'alpha': 0.0, 'objective': 'reg:squarederror', 'eval_metric': 'rmse', 'n_jobs': 4 # Assuming you have 4 cores }

pi_params = { 'booster': 'gbtree', 'eta': 0.1, 'gamma': 0, 'max_depth': 10, 'min_child_weight': 1, 'subsample': 0.0, 'colsample_bytree': 1, 'objective': 'binary:logistic', # Change as per your objective 'eval_metric': 'logloss', # Change as per your needs 'reg_lambda': 0.0, 'reg_alpha': 0.0, 'nthread': 4}

F.8 Ctf-DE in Example 3 for Simulation in Fig. E.4e

We define the following structural causal models:

U normal(0, 2), X := f X(U) Z := f Z(U) W := f W (X, Z) Y := f Y (X, Z, W),

0 if 1 1+exp(2UXB 1) < 0.5 1 if 0.5 1 1+exp(2UXB 1) < 0.8 2 if 0.8 1 1+exp(2UXB 1).

f Z(U) := 1 1 + exp( U + 1)

f W (X, Z) := 1 1 + exp(X 1 + Z)

f Y (Z, X, W) := 1 1 + exp(3X 1 + 0.1Z + 0.1W + W(X 1)).

The parameterization for XGBoost used in µ called (mu_params) and π called (pi_params) is the following:

mu_params = { 'booster': 'gbtree', 'eta': 0.3, # vab 'gamma': 0.0, 'max_depth': 6, #vb (same as va) 'min_child_weight': 1, 'subsample': 1.0, 'colsample_bytree': 1, 'lambda': 0.0, 'alpha': 0.0, 'objective': 'reg:squarederror', 'eval_metric': 'rmse', 'n_jobs': 4 # Assuming you have 4 cores }

pi_params = { 'booster': 'gbtree', 'eta': 0.05, 'gamma': 0, 'max_depth': 10, 'min_child_weight': 1, 'subsample': 1.0, 'colsample_bytree': 1, 'objective': 'multi:softprob', # Change as per your objective 'num_class': 3, 'eval_metric': 'mlogloss', # Change as per your needs 'reg_lambda': 0.0, 'reg_alpha': 0.0, 'nthread': 4 }

Neur IPS Paper Checklist

The checklist is designed to encourage best practices for responsible machine learning research, addressing issues of reproducibility, transparency, research ethics, and societal impact. Do not remove the checklist: The papers not including the checklist will be desk rejected. The checklist should follow the references and follow the (optional) supplemental material. The checklist does NOT count towards the page limit.

Please read the checklist guidelines carefully for information on how to answer these questions. For each question in the checklist:

You should answer [Yes] , [No] , or [NA] .

[NA] means either that the question is Not Applicable for that particular paper or the relevant information is Not Available.

Please provide a short (1 2 sentence) justification right after your answer (even for NA).

The checklist answers are an integral part of your paper submission. They are visible to the reviewers, area chairs, senior area chairs, and ethics reviewers. You will be asked to also include it (after eventual revisions) with the final version of your paper, and its final version will be published with the paper.

The reviewers of your paper will be asked to use the checklist as one of the factors in their evaluation. While "[Yes] " is generally preferable to "[No] ", it is perfectly acceptable to answer "[No] " provided a proper justification is given (e.g., "error bars are not reported because it would be too computationally expensive" or "we were unable to find the license for the dataset we used"). In general, answering "[No] " or "[NA] " is not grounds for rejection. While the questions are phrased in a binary way, we acknowledge that the true answer is often more nuanced, so please just use your best judgment and write a justification to elaborate. All supporting evidence can appear either in the main paper or the supplemental material, provided in appendix. If you answer [Yes] to a question, in the justification please point to the section(s) where related material for the question can be found.

IMPORTANT, please:

Delete this instruction block, but keep the section heading Neur IPS paper checklist",

Keep the checklist subsection headings, questions/answers and guidelines below.

Do not modify the questions and only use the provided macros for your answers.

Question: Do the main claims made in the abstract and introduction accurately reflect the paper s contributions and scope?

Answer: [Yes]

Justification: We clearly state our main claim in the abstract. The contributions are described in the introduction and are clearly reflected throughout the entire paper.

Guidelines:

The answer NA means that the abstract and introduction do not include the claims made in the paper. The abstract and/or introduction should clearly state the claims made, including the contributions made in the paper and important assumptions and limitations. A No or NA answer to this question will not be perceived well by the reviewers. The claims made should match theoretical and experimental results, and reflect how much the results can be expected to generalize to other settings. It is fine to include aspirational goals as motivation as long as it is clear that these goals are not attained by the paper.

2. Limitations

Question: Does the paper discuss the limitations of the work performed by the authors?

Answer: [NA]

Justification: We explicitly state the assumptions that theories are depending on. This assumption is serving as limitations of the work.

Guidelines:

The answer NA means that the paper has no limitation while the answer No means that the paper has limitations, but those are not discussed in the paper. The authors are encouraged to create a separate "Limitations" section in their paper. The paper should point out any strong assumptions and how robust the results are to violations of these assumptions (e.g., independence assumptions, noiseless settings, model well-specification, asymptotic approximations only holding locally). The authors should reflect on how these assumptions might be violated in practice and what the implications would be. The authors should reflect on the scope of the claims made, e.g., if the approach was only tested on a few datasets or with a few runs. In general, empirical results often depend on implicit assumptions, which should be articulated. The authors should reflect on the factors that influence the performance of the approach. For example, a facial recognition algorithm may perform poorly when image resolution is low or images are taken in low lighting. Or a speech-to-text system might not be used reliably to provide closed captions for online lectures because it fails to handle technical jargon. The authors should discuss the computational efficiency of the proposed algorithms and how they scale with dataset size. If applicable, the authors should discuss possible limitations of their approach to address problems of privacy and fairness. While the authors might fear that complete honesty about limitations might be used by reviewers as grounds for rejection, a worse outcome might be that reviewers discover limitations that aren t acknowledged in the paper. The authors should use their best judgment and recognize that individual actions in favor of transparency play an important role in developing norms that preserve the integrity of the community. Reviewers will be specifically instructed to not penalize honesty concerning limitations.

3. Theory Assumptions and Proofs

Question: For each theoretical result, does the paper provide the full set of assumptions and a complete (and correct) proof?

Answer: [Yes]

Justification: Detailed proofs are provided in the supplemental material. Additionally, we have provided the full set of assumptions for each theorems.

Guidelines:

The answer NA means that the paper does not include theoretical results. All the theorems, formulas, and proofs in the paper should be numbered and crossreferenced. All assumptions should be clearly stated or referenced in the statement of any theorems. The proofs can either appear in the main paper or the supplemental material, but if they appear in the supplemental material, the authors are encouraged to provide a short proof sketch to provide intuition. Inversely, any informal proof provided in the core of the paper should be complemented by formal proofs provided in appendix or supplemental material. Theorems and Lemmas that the proof relies upon should be properly referenced.

4. Experimental Result Reproducibility

Question: Does the paper fully disclose all the information needed to reproduce the main experimental results of the paper to the extent that it affects the main claims and/or conclusions of the paper (regardless of whether the code and data are provided or not)?

Answer: [Yes] Justification: We provide the detailed recipe for running the simulation and the guideline for implementing theories. Also, we provide a detailed method on how the synthetic data are generated. Guidelines:

The answer NA means that the paper does not include experiments. If the paper includes experiments, a No answer to this question will not be perceived well by the reviewers: Making the paper reproducible is important, regardless of whether the code and data are provided or not. If the contribution is a dataset and/or model, the authors should describe the steps taken to make their results reproducible or verifiable. Depending on the contribution, reproducibility can be accomplished in various ways. For example, if the contribution is a novel architecture, describing the architecture fully might suffice, or if the contribution is a specific model and empirical evaluation, it may be necessary to either make it possible for others to replicate the model with the same dataset, or provide access to the model. In general. releasing code and data is often one good way to accomplish this, but reproducibility can also be provided via detailed instructions for how to replicate the results, access to a hosted model (e.g., in the case of a large language model), releasing of a model checkpoint, or other means that are appropriate to the research performed. While Neur IPS does not require releasing code, the conference does require all submissions to provide some reasonable avenue for reproducibility, which may depend on the nature of the contribution. For example (a) If the contribution is primarily a new algorithm, the paper should make it clear how to reproduce that algorithm. (b) If the contribution is primarily a new model architecture, the paper should describe the architecture clearly and fully. (c) If the contribution is a new model (e.g., a large language model), then there should either be a way to access this model for reproducing the results or a way to reproduce the model (e.g., with an open-source dataset or instructions for how to construct the dataset). (d) We recognize that reproducibility may be tricky in some cases, in which case authors are welcome to describe the particular way they provide for reproducibility. In the case of closed-source models, it may be that access to the model is limited in some way (e.g., to registered users), but it should be possible for other researchers to have some path to reproducing or verifying the results. 5. Open access to data and code

Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [Yes] Justification: Every detail in the simulations is sufficiently provided to reproduce the results. Guidelines:

The answer NA means that paper does not include experiments requiring code. Please see the Neur IPS code and data submission guidelines (https://nips.cc/ public/guides/Code Submission Policy) for more details. While we encourage the release of code and data, we understand that this might not be possible, so No is an acceptable answer. Papers cannot be rejected simply for not including code, unless this is central to the contribution (e.g., for a new open-source benchmark). The instructions should contain the exact command and environment needed to run to reproduce the results. See the Neur IPS code and data submission guidelines (https: //nips.cc/public/guides/Code Submission Policy) for more details.

The authors should provide instructions on data access and preparation, including how to access the raw data, preprocessed data, intermediate data, and generated data, etc. The authors should provide scripts to reproduce all experimental results for the new proposed method and baselines. If only a subset of experiments are reproducible, they should state which ones are omitted from the script and why. At submission time, to preserve anonymity, the authors should release anonymized versions (if applicable). Providing as much information as possible in supplemental material (appended to the paper) is recommended, but including URLs to data and code is permitted.

6. Experimental Setting/Details

Question: Does the paper specify all the training and test details (e.g., data splits, hyperparameters, how they were chosen, type of optimizer, etc.) necessary to understand the results?

Answer: [Yes]

Justification: We provide the details of simulations enough to reproduce the results.

Guidelines:

The answer NA means that the paper does not include experiments. The experimental setting should be presented in the core of the paper to a level of detail that is necessary to appreciate the results and make sense of them. The full details can be provided either with the code, in appendix, or as supplemental material.

7. Experiment Statistical Significance

Question: Does the paper report error bars suitably and correctly defined or other appropriate information about the statistical significance of the experiments?

Answer: [Yes]

Justification: We provide the details of simulations enough to reproduce the results. Also, our simulation includes confidence intervals.

Guidelines:

The answer NA means that the paper does not include experiments. The authors should answer "Yes" if the results are accompanied by error bars, confidence intervals, or statistical significance tests, at least for the experiments that support the main claims of the paper. The factors of variability that the error bars are capturing should be clearly stated (for example, train/test split, initialization, random drawing of some parameter, or overall run with given experimental conditions). The method for calculating the error bars should be explained (closed form formula, call to a library function, bootstrap, etc.) The assumptions made should be given (e.g., Normally distributed errors). It should be clear whether the error bar is the standard deviation or the standard error of the mean. It is OK to report 1-sigma error bars, but one should state it. The authors should preferably report a 2-sigma error bar than state that they have a 96% CI, if the hypothesis of Normality of errors is not verified. For asymmetric distributions, the authors should be careful not to show in tables or figures symmetric error bars that would yield results that are out of range (e.g. negative error rates). If error bars are reported in tables or plots, The authors should explain in the text how they were calculated and reference the corresponding figures or tables in the text.

8. Experiments Compute Resources

Question: For each experiment, does the paper provide sufficient information on the computer resources (type of compute workers, memory, time of execution) needed to reproduce the experiments?

Answer: [Yes]

Justification: We provide the details of simulations enough to reproduce the results.

Guidelines:

The answer NA means that the paper does not include experiments. The paper should indicate the type of compute workers CPU or GPU, internal cluster, or cloud provider, including relevant memory and storage. The paper should provide the amount of compute required for each of the individual experimental runs as well as estimate the total compute. The paper should disclose whether the full research project required more compute than the experiments reported in the paper (e.g., preliminary or failed experiments that didn t make it into the paper).

9. Code Of Ethics

Question: Does the research conducted in the paper conform, in every respect, with the Neur IPS Code of Ethics https://neurips.cc/public/Ethics Guidelines?

Answer: [Yes]

Justification: We conform the Neur IPS Code of Ethics.

Guidelines:

The answer NA means that the authors have not reviewed the Neur IPS Code of Ethics. If the authors answer No, they should explain the special circumstances that require a deviation from the Code of Ethics. The authors should make sure to preserve anonymity (e.g., if there is a special consideration due to laws or regulations in their jurisdiction).

10. Broader Impacts

Question: Does the paper discuss both potential positive societal impacts and negative societal impacts of the work performed?

Answer: [NA]

Justification: Our contribution is to make a new causal effect estimator, focusing on theoretic contribution.

Guidelines:

The answer NA means that there is no societal impact of the work performed. If the authors answer NA or No, they should explain why their work has no societal impact or why the paper does not address societal impact. Examples of negative societal impacts include potential malicious or unintended uses (e.g., disinformation, generating fake profiles, surveillance), fairness considerations (e.g., deployment of technologies that could make decisions that unfairly impact specific groups), privacy considerations, and security considerations. The conference expects that many papers will be foundational research and not tied to particular applications, let alone deployments. However, if there is a direct path to any negative applications, the authors should point it out. For example, it is legitimate to point out that an improvement in the quality of generative models could be used to generate deepfakes for disinformation. On the other hand, it is not needed to point out that a generic algorithm for optimizing neural networks could enable people to train models that generate Deepfakes faster. The authors should consider possible harms that could arise when the technology is being used as intended and functioning correctly, harms that could arise when the technology is being used as intended but gives incorrect results, and harms following from (intentional or unintentional) misuse of the technology.

If there are negative societal impacts, the authors could also discuss possible mitigation strategies (e.g., gated release of models, providing defenses in addition to attacks, mechanisms for monitoring misuse, mechanisms to monitor how a system learns from feedback over time, improving the efficiency and accessibility of ML).

11. Safeguards

Question: Does the paper describe safeguards that have been put in place for responsible release of data or models that have a high risk for misuse (e.g., pretrained language models, image generators, or scraped datasets)?

Answer: [NA]

Justification: [NA]

Guidelines:

The answer NA means that the paper poses no such risks. Released models that have a high risk for misuse or dual-use should be released with necessary safeguards to allow for controlled use of the model, for example by requiring that users adhere to usage guidelines or restrictions to access the model or implementing safety filters. Datasets that have been scraped from the Internet could pose safety risks. The authors should describe how they avoided releasing unsafe images. We recognize that providing effective safeguards is challenging, and many papers do not require this, but we encourage authors to take this into account and make a best faith effort.

12. Licenses for existing assets

Question: Are the creators or original owners of assets (e.g., code, data, models), used in the paper, properly credited and are the license and terms of use explicitly mentioned and properly respected?

Answer: [NA]

Justification: [NA]

Guidelines:

The answer NA means that the paper does not use existing assets. The authors should cite the original paper that produced the code package or dataset. The authors should state which version of the asset is used and, if possible, include a URL. The name of the license (e.g., CC-BY 4.0) should be included for each asset. For scraped data from a particular source (e.g., website), the copyright and terms of service of that source should be provided. If assets are released, the license, copyright information, and terms of use in the package should be provided. For popular datasets, paperswithcode.com/datasets has curated licenses for some datasets. Their licensing guide can help determine the license of a dataset. For existing datasets that are re-packaged, both the original license and the license of the derived asset (if it has changed) should be provided. If this information is not available online, the authors are encouraged to reach out to the asset s creators.

13. New Assets

Question: Are new assets introduced in the paper well documented and is the documentation provided alongside the assets?

Answer: [NA]

Justification: [NA]

Guidelines:

The answer NA means that the paper does not release new assets. Researchers should communicate the details of the dataset/code/model as part of their submissions via structured templates. This includes details about training, license, limitations, etc. The paper should discuss whether and how consent was obtained from people whose asset is used. At submission time, remember to anonymize your assets (if applicable). You can either create an anonymized URL or include an anonymized zip file.

14. Crowdsourcing and Research with Human Subjects

Question: For crowdsourcing experiments and research with human subjects, does the paper include the full text of instructions given to participants and screenshots, if applicable, as well as details about compensation (if any)?

Answer: [NA]

Justification: [NA]

Guidelines:

The answer NA means that the paper does not involve crowdsourcing nor research with human subjects. Including this information in the supplemental material is fine, but if the main contribution of the paper involves human subjects, then as much detail as possible should be included in the main paper. According to the Neur IPS Code of Ethics, workers involved in data collection, curation, or other labor should be paid at least the minimum wage in the country of the data collector.

15. Institutional Review Board (IRB) Approvals or Equivalent for Research with Human Subjects

Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or institution) were obtained?

Answer: [NA]

Justification: [NA]

Guidelines:

The answer NA means that the paper does not involve crowdsourcing nor research with human subjects. Depending on the country in which research is conducted, IRB approval (or equivalent) may be required for any human subjects research. If you obtained IRB approval, you should clearly state this in the paper. We recognize that the procedures for this may vary significantly between institutions and locations, and we expect authors to adhere to the Neur IPS Code of Ethics and the guidelines for their institution. For initial submissions, do not include any information that would break anonymity (if applicable), such as the institution conducting the review.