# multiplyrobust_causal_change_attribution__3da5c100.pdf

Multiply-Robust Causal Change Attribution

Victor Quintas-Martinez 1 2 Mohammad Taha Bahadori 3 Eduardo Santiago 3 Jeff Mu 3 David Heckerman 3

Abstract Comparing two samples of data, we observe a change in the distribution of an outcome variable. In the presence of multiple explanatory variables, how much of the change can be explained by each possible cause? We develop a new estimation strategy that, given a causal model, combines regression and re-weighting methods to quantify the contribution of each causal mechanism. Our proposed methodology is multiply robust, meaning that it still recovers the target parameter under partial misspecification. We prove that our estimator is consistent and asymptotically normal. Moreover, it can be incorporated into existing frameworks for causal attribution, such as Shapley values, which will inherit the consistency and large-sample distribution properties. Our method demonstrates excellent performance in Monte Carlo simulations, and we show its usefulness in an empirical application. Our method is implemented as part of the Python library Do Why (Sharma & Kiciman, 2020; Bl obaum et al., 2022).

1. Introduction

Analysts are often interested in identifying and quantifying the contribution of multiple possible causes of change to the performance metrics of large-scale systems, such as sales volume, throughput, or retention rate. As an example, a manufacturer compares data from 2023 and 2022 and realizes that net sales have increased. However, many factors have changed between these two time periods, including the characteristics of the product, competitors prices, or market conditions. How much did each of these variables contribute to the increase in average sales? This is a change attribution question. When policymakers and business leaders are interested in using these insights for future changes, the change attribution problem necessarily becomes causal.

1MIT Department of Economics. 2This work was completed while VQM was an intern at Amazon. 3Amazon. Correspondence to: Victor Quintas-Martinez <vquintas@mit.edu>.

Proceedings of the 41 st International Conference on Machine Learning, Vienna, Austria. PMLR 235, 2024. Copyright 2024 by the author(s).

In the causal change attribution problem, we observe two samples of data, including an outcome and multiple explanatory variables. We are also given a causal Directed Acyclic Graph (DAG) that encodes the prior expert knowledge about the causal conditional (in-)dependence relationships among the variables in the data. The objective is to assign a score to each of the causal mechanisms in this DAG that quantifies its contribution to the change in the distribution of the outcome. Although in many instances we observe a change over time, we want to emphasize that our methods apply to more general settings involving a comparison of two samples for example, differences between groups (e.g., females and males) or geographical locations (e.g., East vs. West Coast of the US).

The contribution of each causal mechanism is generally very difficult to disentangle. We need to characterize what the distribution of the outcome variable would have been if we shifted only some causal mechanisms, but left others unchanged. This is a counterfactual distribution, which is not directly observable in the data. We overcome this fundamental challenge by employing a combination of regression and re-weighting methods. Moreover, our estimating equation is multipy robust, in the sense that it still recovers the parameters of interest even if some components of the model are misspecified. We show that our estimator is consistent and asymptotically normal under weak conditions on the ML algorithms used to learn the regression and the weights. The asymptotic variance is also consistently estimable, allowing us to compute valid standard errors, confidence intervals and p-values. To obtain these results we build on the existing double/debiased ML literature (Chernozhukov et al., 2018a; 2021; 2023).

There is a large body of work on the attribution problem (Efron, 2020; Yamamoto, 2012; Dalessandro et al., 2012), some of which has studied causal attribution (Liu et al., 2023; Lu et al., 2023; Mougan et al., 2023; Zhao et al., 2023; Berman, 2018; Ji et al., 2016; Dawid et al., 2014; Shao & Li, 2011). An important branch of the literature has focused on developing suitable definitions of the contribution of each variable, e.g., with variations of Shapley values (Janzing et al., 2024; Chen et al., 2023; Budhathoki et al., 2022; Jung et al., 2022; Sharma et al., 2022) or optimal transport methods (Kulinski & Inouye, 2023), while remaining agnostic about estimation. Our contribution is

Multiply-Robust Causal Change Attribution

related but distinct: the estimator we develop can be readily integrated into existing frameworks for causal change attribution, such as Shapley values, or it may be of interest in its own right. Shapley values based on our estimator will inherit its consistency and asymptotic normality properties; we also propose a convenient and computationally efficient bootstrap procedure to compute their standard errors.

Although other recent algorithms have been proposed to estimate Shapley Values, a majority of these algorithms are not multiply-robust, and generally they are based on regression only (learning the distribution of the outcome given the explanatory variables).1 As such, they will not perform well unless we have a high-quality estimator of the regression function. In high-dimensional or non-parametric settings, estimators for those objects will typically exhibit slow convergence rates, rendering traditional normal-based large sample inference (standard errors, confidence intervals or p-values) invalid. We provide the first formal analysis of the asymptotic properties of an estimator in the causal change attribution setting of Budhathoki et al. (2021), which allows us to perform valid inference in large samples (standard errors, confidence intervals and tests).

Causal change attribution is related to the classical problem of treatment effect estimation (Pearl, 2009; Imbens & Rubin, 2015; Peters et al., 2017) in the need to evaluate counterfactual distributions, but there are some fundamental differences. Within the treatment effects literature, the closest problem to ours is that of causal mediation. Tchetgen Tchetgen & Shpitser (2012) gave multiple-robustness results for mediation analysis with a single mediator (explanatory variable). The case with multiple mediators and an arbitrary DAG is substantially more challenging. Although causal mediation with multiple mediators has been studied before (e.g., Daniel et al., 2015), we provide the first multiply-robust estimator in this setting. We also believe to be the first to explicitly build a connection between the causal change attribution problem and the causal mediation literature.

2. Methodology

In this section we describe our multiply-robust methodology for causal change attribution. Section 2.1 defines the causal model and the causal change attribution problem and its connections and contrasts with treatment effect estimation. Section 2.2 gives an identification result in a simple example, which we then generalize in Section 2.3. Section 2.4

1The only exception that we are aware of is Jung et al. (2022). However, their notion of do-Shapley values decomposes the effect of fixing a value for the explanatory variables, rather than a distribution for each causal mechanism, as we do in this paper. They consider only discrete-valued covariates, and their approach does not seem easily generalizable beyond that.

describe how to implement our estimator, and 2.5 gives large-sample inference results. Finally, in Section 2.6 we explain how our method can be incorporated within existing frameworks of causal change attribution, such as Shapley values. We provide the technical conditions and proofs for all lemmas and theorems in Appendix A.

2.1. Setting

We observe multiple i.i.d. measurements of the same variables (T, X, Y ). Each observation consists of a sample indicator T {0, 1}, K explanatory variables X := (X1, . . . , XK) and an outcome of interest Y Y R. The explanatory variables in X could be continuous, categorical, or even unstructured data types such as text or images.

We assume that the distribution of (X, Y ) | T = t has a causal Markov factorization (Spirtes et al., 2000):

P (t) (X,Y ) := P (t) Y |X

k=1 P (t) Xk|PAk, (F)

where PAk are the parents (direct causes) of Xk in the underlying causal DAG, other than T. Throughout, the superscript (t) denotes conditioning on T = t {0, 1}. Each conditional distribution on the right-hand side of (F) is called a causal mechanism (Peters et al., 2017). Without loss of generality, we assume that the explanatory variables are labeled so that k < k if Xk PAk , i.e., the causal antecedents of Xk in the DAG have indices lower than k . In practice, the researcher only needs to know one such causal ordering, rather than the full causal graph for (X, Y ). Our causal model implies a distribution for Y | T = t by marginalization, i.e.,

P (t) Y := Z P (t) Y |X

k=1 d P (t) Xk|PAk.

The problem of change attribution tries to quantify how much of the differences between P (1) Y and P (0) Y are due to shifting each causal mechanism from its distribution at T = 0 to its distribution at T = 1. To clarify, intervening on a causal mechanism PXk|PAk may impact the outcome both directly and indirectly through other explanatory variables which are causal descendants of Xk. We are interested on the total effect of changing PXk|PAk, without fixing the marginal distribution of its causal descendants. Finally, note that we allow the causal mechanisms to differ between samples, but the underlying DAG is assumed to be the same.

We want to draw a clear distinction between attribution and a related problem that has received a much larger attention in the literature, Average Treatment Effect (ATE) estimation under unconfoundedness. The difference between the

Multiply-Robust Causal Change Attribution

(a) Causal Change Attribution.

(b) ATE under Unconfoundedness.

Figure 1. Two possible DAGs for (T, X, Y ).

causal models underlying these two problems is depicted in Figure 1. In an ATE setting, X typically represents pretreatment covariates, that affect both the outcome Y and the propensity to receive a binary treatment T. To compute the causal effect of T we need to control for those covariates, that is, compare units with similar values of X. In contrast, causal change attribution acknowledges that both the distributions of X and of Y given X vary depending on whether T = 0 or T = 1. The goal of causal change attribution is to quantify how much of the effect of T on Y goes through (is mediated by) each causal mechanism in this distribution. Under an unconfoundedness assumption about the assignment of T, the problem of causal attribution can be thought of as decomposing the ATE of T on Y into the Natural Direct Effect and the Natural Indirect Effect corresponding to each causal mechanism (Pearl, 2009, 4.5; Daniel et al., 2015). We describe this in more detail, including the corresponding structural equation model and potential outcomes, in Appendix B.

We note, however, that our definition of attribution does not require T to be a treatment that can be administered in the interventional sense. As discussed, T could also be an indicator for group membership (e.g., female or male), time period (e.g., before and after a certain date), or geographical location (e.g., East vs. West Coast of the US). Under which conditions can this change attribution be interpreted as causal? We will assume that the researcher knows a causal ordering of the underlying true causal graph for (X, Y ). This will allow us to quantify the contribution of each causal mechanism, rather than changes in the marginal distributions of covariates. In practice, researchers can use a combination of causal discovery methods (e.g., the conditional independence test of Zhang et al., 2011) and domain knowledge to obtain the DAG. We also assume that there is no unobserved variable U such that T U, U X and U Y . We discuss ways to relax this assumption in future research in Section 4.

2.2. Preliminary Example

We begin with a simple yet illustrative example. Consider the simplest situation of K = 1. How much of the difference in means E[Y | T = 1] E[Y | T = 0] is due to changing PX? How much of it is due to changing PY |X?

Y Sample 0 Sample 1

(a) Regression: We learn the dependence between Y and X in sample 0 through the regression function γ(X) (gray curve). Subsequently, we average the fitted values for X in sample 1 (red crosses).

Y Sample 0 Sample 1

(b) Re-weighting: We average Y in sample 0 (red circles), but we give more weight to observations whose X is more likely to come from sample 1.

Figure 2. Visual intuition for regression and re-weighting.

In order to answer these questions, we would like to know what the mean of Y would be if we shifted PX to be as in sample 1, but left PY |X unchanged as in sample 0, which we denote as:

θ 1,0 = Z yd P 1,0 Y (y), for P 1,0 Y = Z P (0) Y |Xd P (1) X .

The fundamental challenge is that P 1,0 Y is a counterfactual distribution (Pearl, 2009, 4.5), in the sense that we don t observe data sampled from it, and so we cannot estimate θ 1,0

directly as a sample average. It is still possible, however, to identify θ 1,0 , as shown in the following lemma: Lemma 2.1. Under the regularity conditions given in the appendix, we have the following identification results:

θ 1,0 = E(1)[γ(X)] (REG)

= E(0)[α(X)Y ], (REW)

where γ(X) := E(0)[Y | X], α(X) := d P (1) X /d P (0) X (X), and E(t)[ ] denotes the expectation conditional on T = t.

The intuition for Lemma 2.1 is captured graphically in Figure 2. Equation (REG) gives identification by regression.

Multiply-Robust Causal Change Attribution

We learn the dependence between Y and X in sample 0 through a (non-parametric) regression, and then average that regression function over the X in sample 1. This is essentially a non-parametric generalization of the Oaxaca Blinder decomposition (Oaxaca, 1973; Blinder, 1973). It also appears in the causality literature as part of the mediation formula (see Appendix B and Pearl, 2009, 4.5). Equation (REW) gives identification by re-weighting. We average Y in sample 0, but we give more weight to observations whose X is more likely to come from sample 1. Formally, the weights α(X) are Radon-Nykodim (RN) derivatives (Billingsley, 1995). When X has a density or a probability mass function, these are simply the ratio of densities or probability mass functions between the two samples, respectively. The re-weighting idea has been used in the literature on covariate shift problems (Shimodaira, 2000; Bickel et al., 2009), but we are not aware of causal change attribution methods that leverage this insight. Remark 2.2 (Relation to Mediation). Causal Change Attribution is closely related to decomposing the total effect of an intervention that sets T = 1 into the Natural Direct Effect and the Natural Indirect Effect of Pearl (2009, 4.5). We discuss this connection in detail in Appendix B.

The following result combines regression and re-weighting methods to obtain a more robust identifying equation. It is essentially a version without pre-treatment covariates of the efficient influence function given in Tchetgen-Tchetgen & Shpitser (2012) for causal mediation analysis with a single mediator. We restate it here in our notation, because it will make the intuition of our novel result for a general causal graph (Section 2.3) clearer.

Lemma 2.3. Let g(X), a(X) be two functions such that E(0)[g(X)2] < , E(0)[a(X)2] < . Consider the following estimating equation:

E(1)[g(X)] + E(0)[a(X)e(X, Y )], (DR)

where e(X, Y ) := Y g(X).

Under the conditions of Lemma 2.1, (DR) is equal to θ 1,0

if g(X) = γ(X) or a(X) = α(X), but not necessarily both.

Equation (DR) is doubly robust in the following sense: it still identifies the parameter of interest even if one of the regression function or the weights is misspecified. The term E(0)[a(X)e(X, Y )] is a debiasing term, as in the double/debiased machine learning literature (Chernozhukov et al., 2018a), consisting of an average of the non-parametric regression error e(Y, X) weighted by a(X). If the regression function is correctly specified, this average will be zero regardless of the weights (by the Law of Iterated Expectations). On the other hand, if the regression function is not correctly specified but the weights are, the second term will

account and correct for the misspecification of g(X). We refer the reader to the proof in Appendix A for details. Remark 2.4 (On the Overlap Assumption). One of the regularity conditions (Assumption A.1) imposes that the support of P (0) X includes the support of P (1) X . From a technical perspective, this assumption guarantees that the RN derivative α(X) exists. In this remark, we give a more intuitive explanation. The regression strategy (REG) requires estimating a regression of h(Y ) on X in sample 0, and then obtaining the fitted values in sample 1. Without overlap, we would be extrapolating. In general, we want to avoid this (unless we have a credibly good parametric model for the regression). Similarly, the re-weighting strategy (REW) breaks down without overlap, because some regions of X values that have positive probability in sample 1 are never observed in sample 0. For our asymptotic results in Section 2.5 we will impose a stronger form of overlap (Assumption A.3), which is analogous to the overlap assumption in ATE estimation under unconfoundedness.

2.3. Identification

We are now ready to introduce the main result. Let c := c1, . . . , c K, c K+1 {0, 1}K+1 denote a change vector, where ck = 1 if we shift the k-th causal mechanism to be as in sample 1, and ck = 0 otherwise. The last entry c K+1 indicates whether we want to shift the conditional distribution of the outcome, PY |X. We will denote by P c Y the distribution:

P c Y := Z P (c K+1) Y |X

k=1 d P (ck) Xk|PAk.

For example, in Section 2.2 we considered P 1,0 Y , where we shifted PX to be as in sample 1 but kept PY |X to be as in sample 0. Of all the possible P c Y for c {0, 1}K+1, only two are observed directly in the data: P (0) Y := P 0,...,0,0 Y and P (1) Y := P 1,...,1,1 Y ; the rest are counterfactual distributions.

We will quantify differences in the distribution of the outcome through the change in some functional (summary measure) of the form θ(PY ) = R h(y)d PY (y) for some h: R R. Important examples of such functionals are the mean h(y) = y, the second moment h(y) = y2 (which, combined with the mean, allows to obtain the variance), and the CDF at a point u: hu(y) = 1{y u} (which can be inverted to obtain quantiles and Wasserstein distances between two distributions). Our main results concern identification, estimation and inference on θc := θ(P c Y ) for a counterfactual distribution described by change vector c {0, 1}K+1.

To state our main results, we adopt the following notation. We denote with a bar Xk := (X1, . . . , Xk) for any k K.

Multiply-Robust Causal Change Attribution

In a slight abuse of notation, we write XK+1 = (X, Y ) and define g K+1( XK+1) := h(Y ) as a convention to make mathematical expressions more compact. Theorem 2.5. Let gk( Xk), ak( Xk) be any candidate functions such that E(ck+1)[gk( Xk)2] < , E(ck+1)[ak( Xk)2] < for k = 1, . . . , K. Consider the following estimating equation:

E(c1)[g1(X1)]+

k=1 E(ck+1)[ak( Xk)ek( Xk+1)], (MR)

where ek( Xk+1) := gk+1( Xk+1) gk( Xk) for k = 1, . . . , K.

For k = 1, . . . , K define:

γk( Xk) = E(ck+1)[gk+1( Xk+1) | Xk],

d P (cj) Xj| Xj 1 d P (ck+1) Xj| Xj 1 (Xj | Xj 1).

Under regularity conditions given in the appendix, (MR) is equal to θc if, for every k = 1, . . . , K, either gk( Xk) = γk( Xk) or ak( Xk) = αk( Xk), but not necessarily both.

The intuition for this result is similar to Lemma 2.3. The parameter of interest can be estimated by regression, but now we need to use sequentially nested regressions γk( Xk) (i.e. regressions where the outcome itself is a regression function). This fixes the desired distribution for each causal mechanism. Since we are estimating K regression functions, we require K debiasing terms, which appear additively. Each of these debiasing terms includes a weight with true value αk( Xk), which is a product of the RN derivatives for the distributions that are shifted with respect to ck+1. Including these K debiasing terms makes equation (MR) multiply robust: for each causal mechanism, only the corresponding regression function or the corresponding weights need to be correctly specified, but not necessarily both. We also note that one of our conditions for multiple robustness is that gk( Xk) = γk( Xk) where γk( Xk) := E(ck+1)[gk+1( Xk+1) | Xk]. This is weaker than setting γk( Xk) = E(ck+1)[γk+1( Xk+1) | Xk], because we do not require that gk+1 is correctly specified, as long as ak+1( Xk+1) = αk+1( Xk+1). Example 2.6. Suppose K = 2 and the causal Markov factorization (F) is PY |X1,X2PX2|X1PX1, as captured by the

Figure 3. DAG for Example 2.6.

DAG in Figure 3. Suppose we are interested in estimating θc

for c = 0, 1, 0 . The multiply-robust estimating equation in Theorem 2.5 for this case is:

E(0)[g1(X1)]

+ E(1)[a1(X1){g2(X1, X2) g1(X1)}]

+ E(0)[a2(X1, X2){h(Y ) g2(X1, X2)}],

where the true values of the unknown functions are:

γ2(X1, X2) = E(0)[h(Y ) | X1, X2],

γ1(X1) = E(1)[g2(X1, X2) | X1],

α2(X1, X2) = d P (1) X2|X1/d P (0) X2|X1(X2 | X1),

α1(X1) = d P (0) X1 /d P (1) X1 (X1).

Remark 2.7 (Runtime and Speedup). The general result of Theorem 2.5 implies that, when there are K explanatory variables, we can identify θc for any c with a multiplyrobust estimating equation that depends on 2K unknown functions (K regression functions and K weights) that need to be estimated. For certain change vectors c, however, it is possible to simplify the identifying equation to have fewer unknown functions while maintaining multiple robustness. In Appendix C, we discuss two settings where simplifications may occur. The first one is when ck = ck+1 for some k (i.e., when two consecutive regressions are computed with respect to the same probability distribution). The second one is when Xk and Xk+1 are (conditionally) independent. In particular, when all the explanatory variables are independent, for any given c we only need to estimate one regression and one RN derivative, regardless of K.

2.4. Estimation

In this section, we describe how to implement the multiplyrobust estimating equation (MR) in practice.

Step 1. Estimate the Regressions The nested regression functions γk( Xk) can be estimated by any parametric or non-parametric ML regression algorithm, including LASSO, ridge, random forests, neural networks, boosting, or any ensemble method combining multiple of these. We work recursively, starting with an estimator ˆγK( XK) for the regression of h(Y ) on XK := X, using the the data from sample c K+1 {0, 1} (i.e., the distribution we want to fix for PY |X). Next, we obtain ˆγK 1( XK 1) by regressing ˆγK( XK) on XK 1, using the the data from sample c K {0, 1} (i.e., the distribution we want to fix for PXK|PAK). We proceed this way up until we obtain ˆγ1(X1).

Step 2. Estimate the Weights We propose two different approaches to estimate the RN derivatives that appear in the weights αk( Xk): classification and automatic estimation. Both methods target the weights directly, rather than

Multiply-Robust Causal Change Attribution

estimating a density or probability mass function in each sample and then taking the ratio. For conciseness, we only describe estimation of µ( Xj) := d P (1) Xj /d P (0) Xj ( Xj). The reciprocal RN derivative can be estimated analogously by exchanging the indices 0 and 1. Conditional RN derivatives for Xj | Xj 1 can be obtained by dividing the RN derivative for Xj by the RN derivative for Xj 1.

2a. Classification By Bayes rule, we can express:

µ( Xj) = β( Xj) 1 β( Xj) 1 p

where β( Xj) := Pr(T = 1 | Xj) and p := Pr(T = 1) (the posterior and prior probabilities of Xj coming from sample 1, respectively). For a given p, this defines a one-toone mapping between µ( Xj) and β( Xj).

This suggests the following methodology to estimate µ( Xj). First, we train a classification ML algorithm (logistic regression with LASSO or ridge penalties, neural networks, random forests, etc.) to predict T based on the concatenated data for Xj. Second, we replace β( Xj) in Bayes rule with the predicted posterior probabilities, and we replace (1 p)/p by its empirical analog, n0/n1, where nt is the number of observations in the T = t sample. The posterior probabilities can also be calibrated by cross-validation, as discussed, e.g., in Niculescu-Mizil & Caruana (2005).

Although this method of estimating RN derivatives is not new (c.f., for instance, Sugiyama et al., 2012, pt. II, ch. 4, or Arbour et al., 2021), the application to building a multiplyrobust moment function is novel.

2b. Automatic Estimation We specialize general results derived by Chernozhukov et al. (2021) to the causal change attribution problem. In particular, we can characterize the RN derivative µ( Xj) as the solution of:

µ( Xj) = arg min m E(0)[m( Xj)2] 2E(1)[m( Xj)].

The literature refers to this method as automatic estimation of the weights because the loss function above depends only in m( Xj), without using the explicit form of µ( Xj) as an RN derivative. Thus, we could minimize a sample version of the criterion above over some class of functions (e.g. linear with LASSO or ridge penalties, neural networks, random forests, etc.).

Step 3. Estimate the Target Parameter Finally, we build an estimator of the counterfactual parameter θc by replacing g( Xk) with ˆγk( Xk) and a( Xk) with ˆαk( Xk) in a sample

analog of equation (MR):

ˆθc := b E(c1)[ˆγ1(X1)]

k=1 b E(ck+1)[ˆαk( Xk)ˆek( Xk+1)], (EST)

where ˆek( Xi,k+1) := ˆγk+1( Xk+1) ˆγk( Xk) for k = 1, . . . , K, and we use b E(t)[f(W)] = n 1 t P

i:Ti=t f(Wi) to denote the sample average on sample t {0, 1}. Remark 2.8 (Sample-splitting). To avoid overfitting bias, our recommendation is that the data used to learn the unknown functions γk( Xk), αk( Xk) are not re-used to estimate the main parameter θc. When the dataset is large, different random subsamples can be used. In medium-sized datasets, a good alternative is to use cross-fitting (see, for example, Newey & Robins, 2018).

2.5. Large-Sample Inference

In this section, we show consistency and asymptotic normality of our estimator ˆθc, and explain how to perform large-sample inference on the true parameter θc. First, we notice that the estimator in (EST) is the sum of two averages over different samples:

ˆθc = b E(0)[ ˆψ(0)( XK+1)] + b E(1)[ ˆψ(1)( XK+1)],

where ˆψ(t)( XK+1) contains the terms that depend on the data from sample t {0, 1} (the exact expression is given in the proof of Theorem 2.9). This characterization will be useful in the results below in terms of writing the asymptotic variance of ˆθc. We define an oracle version of the same object, ψ(t)( XK+1), by replacing ˆγk( Xk), ˆαk( Xk) with their respective true values γk( Xk), ˆαk( Xk) for k = 1, . . . , K.

The following theorem gives an asymptotic normality result for ˆθc, which allow us to perform valid inference on the true parameter θc in large samples (e.g., provide standard errors or confidence intervals):

Theorem 2.9. Under regularity conditions given in the appendix, ˆθc is consistent and asymptotically normal:

ˆθc p θc and n(ˆθc θc) d N(0, V ),

where n := n0 + n1 and

V := 1 1 p Var(0)[ψ(0)( XK+1)]+1

p Var(1)[ψ(1)( XK+1)].

Moreover, V can be estimated consistently by:

n0 d Var(0)[ ˆψ(0)( XK+1)] + n

n1 d Var(1)[ ˆψ(1)( XK+1)]

and Pr(θc [ˆθc z1 a/2 ( ˆV /n)1/2]) 1 a, where z1 a/2 be the (1 a/2)-th quantile of a standard Gaussian random variable.

Multiply-Robust Causal Change Attribution

Remark 2.10 (On the Rate Conditions). Assumption A.5, given in the appendix, imposes certain requirements on the root mean-square (RMSE) convergence rates of ˆγk and ˆαk, which depend on the choice of algorithm and the properties of the true regression and weighting functions (smoothness, sparsity, etc.). The main condition is that the product of the RMSE for the regression and for the weights has to vanish faster than n 1/2. This restriction guarantees that the bias of the main estimator ˆθc is small enough in large samples to obtain valid inference. The product condition captures an important trade-off: in situations where the regression functions can be estimated at a fast rate, the method will work even if the weights converge very slowly, and vice versa, which is a key advantage of using a multiply-robust estimating equation.

2.6. Quantifying the Contribution of Each Causal Mechanism

We now discuss two ways to use the θc for the problem of causal attribution: Shapley values and along a causal path.

Shapley Values Shapley values compute the effect of shifting a causal mechanism averaging over each possible combination of which other causal mechanisms to change. Formally, let ek be a (K + 1)-vector with a 1 in the k-th entry and 0 everywhere else. The Shapley value associated with PXk|PAk is:

1 (K + 1) K P

j cj (θc+ek θc).

Along a Causal Path Define, for each k = 1, . . . , K + 1, a vector bk {0, 1}K+1 with entries 1, . . . , k equal to 1, and the rest equal to 0. In this method, we define the contribution of the k-th causal mechanism to be:

PATHk = θbk θbk 1.

In other words, we define the contribution of the k-th causal mechanism as the effect of shifting from PXk|PAk fixing its causal antecedents to be as in sample 1, but its causal descendants to be as in sample 0.

How to choose between these two approaches? As discussed in the literature (Budhathoki et al., 2021), the effect of changing the one causal mechanism may depend on which other mechanisms have already been changed; Shapley values take that into account, whereas the method along a causal path does not. Whether this is a relevant concern depends on the application: in some cases, the impact of a given explanatory variable may be approximately the same regardless of the distribution of other explanatory variables. On the other hand, causal attribution along a causal path requires estimating θc for only K change vectors c (and, by

the simplification in Remark C.1, each requires training only one regressor and one classifier), whereas computing Shapley values requires estimating θc for 2K+1 change vectors c (each of which could require training up to 2K machine learners). Some computationally efficient approximations to SHAPk exist (e.g, Kolpaczki et al., 2023).

Let \ SHAPk and \ PATHk be estimators of SHAPk and PATHk, respectively, that replace the true θc with the multiply-robust estimator ˆθc in (EST). The following corollary gives a useful result for large-sample inference on the causal attribution parameters:

Corollary 2.11. Under the conditions of Theorem 2.9, \ SHAPk and \ PATHk are consistent and asymptotically normal, with a consistently estimable asymptotic variance.

This is a consequence of \ SHAPk and \ PATHk being finite linear combinations of ˆθc. The asymptotic variance could be computed explicitly, although this might be cumbersome, especially for Shapley values, since it requires the 2K+1 2K+1 covariance matrix of ˆθc for c {0, 1}K+1. As an alternative, the multiplier bootstrap of Belloni et al. (2017) can be used. We describe the procedure in Appendix D. An advantage of the multiplier approach is that each bootstrap replication does not require re-training the ML estimators of the regression function and weights.

3. Empirical Results

In this section, we discuss two sets of Monte-Carlo simulation results and a real-data application of our method to understanding the gender wage gap. The code for the simulations and the empirical application is available at https://github.com/victor5as/ mr_causal_attribution.

3.1. Monte-Carlo Simulations

Design 1 We now present simulation evidence on the performance of our method. Our first simulation design is based on the causal model of Example 2.6, with DAG X1 X2, X1 Y and X2 Y . We give details about the datagenerating process in Appendix E.

Table 1 presents the results of our simulation (over 1,000 draws) for different choices of learners for the regression and the weights. 1a shows the Mean Absolute Error (MAE) for a scenario where both the regression and the weights are correctly specified parametric learners: OLS with quadratic terms for the regressions and logistic classifier with quadratic terms to build the weights.2 Comparing our multiply-robust (MR) estimator to using regression or

2This is the correct specification for the weights, since the loglikelihood ratio between two Gaussian distributions with possibly different variances is a quadratic polynomial.

Multiply-Robust Causal Change Attribution

re-weighting only, we can see that its performance is statistically indistinguishable from the best of the two methods (regression).

Tables 1b and 1c show what happens when either the weights or the regressions are misspecified, while the specification of the other component of the model is still correct. As a particular form of misspecification, we omit the quadratic terms from either the OLS regression or the logistic classifier. In either case, we observe that MR performs as well as the correctly-specified method, while the misspecified method exhibits much higher errors. This illustrates the multiple robustness property of our estimator. When both the weights and the regression are misspecified parametric models, as in 1d, our theoretical guarantees no longer apply. However, we still show these results for completeness. Although it seems to be the case that the error in Shapley values achieved by the MR methods is about as low as the best of the two other methods, it is significantly larger than the correctly-specified benchmark of Table 1a.

Next, we turn our attention to flexible, non-parametric learners for the unknown regression functions and weights, to mimic a more realistic situation in which we do not want to make strong parametric assumptions about these. First, we consider neural networks (Table 1e). We use MLPRegressor and MLPClassifier from the Python library sklearn. We use the default settings for all hyperparameters, except that we allow for early stopping and use 3 hidden layers of 100 neurons each. Second, we consider gradient boosting (Table 1f), using Gradient Boosting Regressor and Gradient Boosting Classifier from sklearn with the default settings. In both cases, we calibrate the predicted probabilities after classification with Calibrated Classifier CV also from sklearn. The results clearly illustrate Remark 2.10: the bias of the MR non-parametric estimator will, in general, vanish faster than the convergence rates of regression-only or reweighting-only methods. As such, we see that in general MR has equal or lower error than the best of the other two methods. Moreover, the MAE is often statistically indistinguishable from that of the correctly-specified parametric benchmark 1a.

Design 2 In a second set of experiments, we consider the effect of increasing hte number of explanatory variables K. We consider a line DAG, X1 X2 XK Y . We give more details about the data-generating process in Appendix E. We use LASSO to learn the regression functions, and Logistic Regression with ℓ1 penalty (Logit LASSO) for the weights. In both cases, the penalty is selected by cross-validation.

Table 2 shows the average (across 500 simulations) of the worst-case (across parameters) absolute error, that is, the

Education Occupation

Figure 4. DAG for the gender wage gap application.

average of maxk=1,...,K+1 |PATHk \ PATHk|. The pattern is almost identical if we look at the Mean Absolute Error instead. The MR estimator performs significantly better than the other two methods, even though the regression and the weights are correctly specified (i.e., they are truly linear/logistic and sparse).

3.2. Real-World Application: Gender Wage Gap

Following Budhathoki et al. (2021), we ask how much of the gender wage gap can be attributed to differences in education or occupation. We use data from the Current Population Survey (CPS) 2015. After applying the same sample restrictions as Chernozhukov et al. (2018c), the resulting sample contains 18,137 male and 14,382 female employed individuals. On this data, the (unconditional) gender wage gap in this sample is of $7.91/hour (standard error 0.01). In other words, female workers earn 24% less on average than male workers; the difference is statistically significant.

We assume the same causal graphical model as Budhathoki et al. (2021), which is captured by the DAG in Figure 4. We randomly split the data into a training set used to estimate the regression function and weights (40%), a calibration set to calibrate the predicted probabilities from our classification algorithm (10%), and an evaluation set used to obtain the main estimates of the ˆθc parameters (50%). To estimate the regression and the weights, we use Hist Gradient Boosting Regressor and Hist Gradient Boosting Classifier from sklearn in Python. We calibrate the probabilities by isotonic regression.

Our estimated Shapley values are plotted in Figure 5. The contribution of education is estimated to be positive and significant. In contrast, the distribution of occupation given education has a significant negative effect, slightly larger in magnitude than the contribution of education, but of opposite sign. Most of the gender wage gap is therefore explained by differences in the distribution of wages given education and occupation. This is consistent with previous findings in the literature. For example, Chernozhukov et al. (2018b), documents the same fact for the UK.

One could think of P(wage | occup, educ) as residual variation, i.e., the remaining gender gap which is not ex-

Multiply-Robust Causal Change Attribution

8 6 4 2 0 2 Gender Wage Gap ($/hour)

Unconditional Wage Gap: -7.76*** (0.01)

P(educ): 1.13*** (0.34)

P(occup | educ): -2.08*** (0.35)

P(wage | occup, educ): -6.81*** (0.36)

Figure 5. Shapley Values in our gender wage gap application. We plot the point estimates and 95% confidence intervals. We also report the point estimates (bootstrapped standard errors in brackets). We denote statistical significance at the 5% level by , and at the 1% level by .

plained by education or occupation. It may capture the effect of additional variables that are not included in our analysis. For example, there may be gender differences in labor market experience due to women temporarily exiting the labor force to take care of young children. Unfortunately, experience is not available in census data, and so we cannot quantify its impact, but excluding it from the analysis does not bias our results, because it is plausibly not a causal antecedent of education and occupational choice. It may also be due to other factors, including gender discrimination in promotion or gender differences in wage bargaining.

To gain more insight into the results of our attribution exercise, it is natural to ask two questions: (1) how different is the distribution of education and occupation between the two groups, and (2) how do these differences translate into differences in average wage. We present some summary statistics related to these questions in the appendix. Figure 7a compares the distribution of educational attainment for males and females. It appears that females tend to be slightly more educated, with higher rates of college and advanced degree graduation. This could explain the positive Shapley value associated with education; for comparison, college graduates earn $12.60/hour more than high school graduates on average. Figure 7b shows the distribution of occupation conditional on having a college degree. Females are more predominant in administrative, education or healthcare professions, whereas males are more likely to work in management and sales. For comparison, managers earn $16.66/hour more than educators and $6.29/hour more than healthcare practitioners on average. At the same time, there are wide differences across genders that cannot be explained by education and occupation alone: female college graduate managers earn $13.77/hour less on average than their male counterparts, which is consistent with the finding that

P(wage | occup, educ) has the largest Shapley value in our causal change attribution exercise.

4. Conclusions

In this paper, we develop a new estimator for causal change attribution measures, which combines regression and reweighting methods in a multiply-robust estimating equation. We provide the first results on consistency and asymptotic normality of ML-based estimators in this setting, and discuss how to perform inference. Moreover, our method can be used to estimate Shapley values, which will inherit the consistency and large-sample distribution properties.

Finally, we suggest two directions for future research. First, in some settings it will be more plausible to assume unconfoundedness conditional on observed covariates. Tchetgen Tchetgen & Shpitser (2012) gave a multiply-robust estimator for a single mediator. Extending their results to a general causal graph with multiple explanatory variables would be interesting. Second, the causal interpretation of the change attribution parameters relies on an assumption of no unobserved confounding. The sensitivity bounds in Chernozhukov et al. (2022) could be adapted as a way to test the robustness of a causal change attribution study to unobserved confounding.

Acknowledgements

The authors would like to thank Victor Chernozhukov, Dominik Janzing, Eric Tchetgen-Tchetgen, and seminar audiences at Amazon and MIT for providing helpful feedback. We also would like to thank Patrick Bl obaum for help with code integration in Do Why.

Multiply-Robust Causal Change Attribution

Impact Statement

This paper presents work whose goal is to advance the fields of Causal Inference and Machine Learning. There are many potential societal consequences of our work, none which we feel must be specifically highlighted here.

The causal interpretation of our change attribution measures is grounded on certain assumptions about the Directed Acyclic Graph (DAG) that codifies causal dependence among the variables in the data. We are precise about these assumptions (Section 2.1), and discuss some ways to relax them in future extensions (Section 4). Our theoretical results also rely on regularity conditions, which we state formally in the Appendix and explain in more intuitive terms in the main body of the paper. Practitioners need to be mindful that, if these assumptions do not hold, our proposed algorithm is not guaranteed to lead to accurate results.

Arbour, D., Dimmery, D., and Sondhi, A. Permutation weighting. In International Conference on Machine Learning, pp. 331 341. PMLR, 2021.

Belloni, A., Chernozhukov, V., Fern andez-Val, I., and Hansen, C. Program evaluation and causal inference with high-dimensional data. Econometrica, 85(1):233 298, 2017.

Berman, R. Beyond the last touch: Attribution in online advertising. Marketing Science, 37(5):771 792, 2018.

Bickel, S., Br uckner, M., and Scheffer, T. Discriminative learning under covariate shift. Journal of Machine Learning Research, 10(9), 2009.

Billingsley, P. Probability and Measure. John Wiley & Sons, third edition, 1995.

Blinder, A. S. Wage discrimination: reduced form and structural estimates. Journal of Human resources, pp. 436 455, 1973.

Bl obaum, P., G otz, P., Budhathoki, K., Mastakouri, A. A., and Janzing, D. Dowhy-gcm: An extension of dowhy for causal inference in graphical causal models. ar Xiv preprint ar Xiv:2206.06821, 2022.

Budhathoki, K., Janzing, D., Bloebaum, P., and Ng, H. Why did the distribution change? In AISTATS, 2021.

Budhathoki, K., Minorics, L., Bl obaum, P., and Janzing, D. Causal structure based root cause analysis of outliers. In International Conference on Machine Learning, pp. 2357 2369. PMLR, 2022.

Chen, H., Covert, I. C., Lundberg, S. M., and Lee, S.-I. Algorithms to estimate shapley value feature attributions. Nature Machine Intelligence, pp. 1 12, 2023.

Chernozhukov, V., Chetverikov, D., Demirer, M., Duflo, E., Hansen, C., Newey, W., and Robins, J. Double/debiased machine learning for treatment and structural parameters. The Econometrics Journal, 21(1):C1 C68, 2018a.

Chernozhukov, V., Fern andez-Val, I., and Luo, S. Distribution regression with sample selection, with an application to wage decompositions in the UK. Technical Report 1811.11603, ar Xiv.org, 2018b.

Chernozhukov, V., Fern andez-Val, I., and Luo, Y. The sorted effects method: Discovering heterogeneous effects beyond their averages. Econometrica, 86(6):1911 1938, 2018c.

Chernozhukov, V., Newey, W. K., Quintas-Mart ınez, V., and Syrgkanis, V. Automatic debiased machine learning via Riesz regression. ar Xiv preprint ar Xiv:2104.14737, 2021.

Chernozhukov, V., Cinelli, C., Newey, W., Sharma, A., and Syrgkanis, V. Long Story Short: Omitted Variable Bias in Causal Machine Learning. Technical report, National Bureau of Economic Research, Cambridge, MA, Jul 2022.

Chernozhukov, V., Newey, W., Singh, R., and Syrgkanis, V. Automatic debiased machine learning for dynamic treatment effects and general nested functionals, 2023.

Dalessandro, B., Perlich, C., Stitelman, O., and Provost, F. Causally motivated attribution for online advertising. In Proceedings of the sixth international workshop on data mining for online advertising and internet economy, pp. 1 9, 2012.

Daniel, R. M., De Stavola, B. L., Cousens, S. N., and Vansteelandt, S. Causal mediation analysis with multiple mediators. Biometrics, 71(1):1 14, 2015.

Dawid, A. P., Faigman, D. L., and Fienberg, S. E. Fitting science into legal contexts: Assessing effects of causes or causes of effects? Sociological Methods & Research, 43 (3):359 390, 2014.

Efron, B. Prediction, estimation, and attribution. International Statistical Review, 88:S28 S59, 2020.

Imbens, G. W. and Rubin, D. B. Causal inference in statistics, social, and biomedical sciences. Cambridge University Press, 2015.

Janzing, D., Bl obaum, P., Minorics, L., Faller, P., and Mastakouri, A. Quantifying intrinsic causal contributions via structure preserving interventions. In International Conference on Artificial Intelligence and Statistics. PMLR, 2024.

Multiply-Robust Causal Change Attribution

Ji, W., Wang, X., and Zhang, D. A probabilistic multi-touch attribution model for online advertising. In Proceedings of the 25th acm international on conference on information and knowledge management, pp. 1373 1382, 2016.

Jung, Y., Kasiviswanathan, S., Tian, J., Janzing, D., Bl obaum, P., and Bareinboim, E. On measuring causal contributions via do-interventions. In International Conference on Machine Learning, pp. 10476 10501. PMLR, 2022.

Kolpaczki, P., Bengs, V., Muschalik, M., and H ullermeier, E. Approximating the shapley value without marginal contributions. Technical Report 2302.00736, ar Xiv.org, 2023.

Kulinski, S. and Inouye, D. I. Towards explaining distribution shifts. In International Conference on Machine Learning, pp. 17931 17952. PMLR, 2023.

Liu, J., Wang, T., Cui, P., and Namkoong, H. On the need for a language describing distribution shifts: Illustrations on tabular datasets. ar Xiv preprint ar Xiv:2307.05284, 2023.

Lu, Z., Geng, Z., Li, W., Zhu, S., and Jia, J. Evaluating causes of effects by posterior effects of causes. Biometrika, 110(2):449 465, 2023.

Mougan, C., Broelemann, K., Masip, D., Kasneci, G., Thiropanis, T., and Staab, S. Explanation shift: Investigating interactions between models and shifting data distributions. ar Xiv preprint ar Xiv:2303.08081, 2023.

Newey, W. K. and Robins, J. R. Cross-fitting and fast remainder rates for semiparametric estimation. ar Xiv preprint ar Xiv:1801.09138, 2018.

Niculescu-Mizil, A. and Caruana, R. Predicting good probabilities with supervised learning. In Proceedings of the 22nd international conference on Machine learning, pp. 625 632, 2005.

Oaxaca, R. Male-female wage differentials in urban labor markets. International economic review, pp. 693 709, 1973.

Pearl, J. Causality. Cambridge University Press, 2009.

Peters, J., Janzing, D., and Sch olkopf, B. Elements of causal inference: foundations and learning algorithms. The MIT Press, 2017.

Shao, X. and Li, L. Data-driven multi-touch attribution models. In Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 258 264, 2011.

Shapley, L. A value for n-person games. Contributions to the theory of games, 1953.

Sharma, A. and Kiciman, E. Dowhy: An end-to-end library for causal inference. ar Xiv preprint ar Xiv:2011.04216, 2020.

Sharma, A., Li, H., and Jiao, J. The counterfactual-shapley value: Attributing change in system metrics. In Neur IPS 2022 Workshop on Causality for Real-world Impact, 2022.

Shimodaira, H. Improving predictive inference under covariate shift by weighting the log-likelihood function. Journal of statistical planning and inference, 90(2):227 244, 2000.

Spirtes, P., Glymour, C. N., Scheines, R., and Heckerman, D. Causation, Prediction, and Search. MIT press, 2000.

Sugiyama, M., Suzuki, T., and Kanamori, T. Density Ratio Estimation in Machine Learning. Cambridge University Press, 2012.

Tchetgen-Tchetgen, E. J. and Shpitser, I. Semiparametric theory for causal mediation analysis: Efficiency bounds, multiple robustness, and sensitivity analysis. Annals of Statistics, 40(3):1816, 2012.

Yamamoto, T. Understanding the past: Statistical analysis of causal attribution. American Journal of Political Science, 56(1):237 256, 2012.

Zhang, K., Peters, J., Janzing, D., and Sch olkopf, B. Kernelbased conditional independence test and application in causal discovery. In Proceedings of the Twenty-Seventh Conference on Uncertainty in Artificial Intelligence, pp. 804 813. AUAI, 2011.

Zhao, R., Zhang, L., Zhu, S., Lu, Z., Dong, Z., Zhang, C., Xu, J., Geng, Z., and He, Y. Conditional counterfactual causal effect for individual attribution. In Uncertainty in Artificial Intelligence, pp. 2519 2528. PMLR, 2023.

Multiply-Robust Causal Change Attribution

A.1. Proofs of Section 2.2

Proof of Lemma 2.1. We begin by assuming P (1) X P (0) X , that is, P (1) X is absolutely continuous with respect to P (0) X . We give a formal definition of absolute continuity in Assumption A.1 (Weak Overlap), which we will present in the proof of the general identification result in Theorem 2.5. We also discuss this assumption in Remark 2.4 in the main text. By the Radon-Nykodim Theorem (Billingsley, 1995), the Radon-Nykodim (RN) derivative α(X) := d P (1) X /d P (0) X (X) exists, is almost-surely unique. Furthermore, we assume E(0)[Y 2] < and E(0)[α(X)2] < , which is a version of Assumption A.2 (Moments). Assumptions A.1 and A.2, combined with the Cauchy-Schwartz inequality, imply that any function g(X) such that E(0)[g(X)2] < will be P (1) X -integrable function g(X), and moreover:

E(1)[g(X)] = Z g(x)d P (1) X (x) = Z α(x)g(x)d P (0) X (x) = E(0)[α(X)g(X)] (RN)

We re now ready to prove our two identification results. Recall the definition of the target parameter:

θ 1,0 = Z y d P (0) Y |X(y | x)d P (1) X (x).

By A.2, γ(X) := E(0)[Y | X] exists and E(0)[γ(X)2] < . By definition of γ(X):

E(1)[γ(X)] = Z γ(x)d P (1) X (x) = Z Z y d P (0) Y |X(y | x) d P (1) X (x) = Z y d P (0) Y |X(y | x)d P (1) X (x),

which shows (REG).

On the other hand, by the Law of Iterated Expectations and property (RN),

E(0)[α(X)Y ] = E(0)[α(X)γ(X)] = E(1)[γ(X)],

which shows (REW).

Proof of Lemma 2.3. Consider first the case of g(X) = γ(X). In that case, by definition of γ(X), E[e(Y, X) | X] = E[Y γ(X) | X] = 0. By the Law of Iterated Expectations, E(0)[a(X)e(Y, X)] = 0 for any a(X) such that E(0)[a(X)2] < , and so (DR) and (REG) give:

E(1)[γ(X)] + E(0)[a(X)e(Y, X)] = E(1)[γ(X)] = θ 1,0 .

If a(X) = α(X), by (RN), E(1)[g(X)] = E(0)[α(X)g(X)] for any g(X) such that E(0)[g(X)2] < , and so (DR) and (REW) give: E(1)[g(X)] + E(0)[α(X)Y ] E(0)[α(X)g(X)] = E(0)[α(X)Y ] = θ 1,0 .

A.2. Proofs of Section 2.3

Proof of Theorem 2.5. We assume the following regularity conditions:

Assumption A.1 (Weak Overlap). For all k = 1, . . . , K, P (0) Xk|PAk and P (1) Xk|PAk are mutually absolutely continuous.

Recall that, for two measures µ, ν on a σ-algebra B, we say that ν µ (ν is absolutely continuous with respect to µ) whenever µ(B) = 0 implies ν(B) = 0 for v B (Billingsley, 1995). If ν µ and µ ν, we say that µ and ν are mutually absolutely continuous. Absolute continuity guarantees that the RN derivatives that go into the weight exist. Intuitively, we are requiring that the distribution of Xk | PAk has the same support in samples 0 and 1. This is essential both for regression methods (avoids extrapolating the regression function) and for re-weighting methods (guarantees that we can find observations that will be similar enough to the other sample), as discussed in Remark 2.4.

Assumption A.2 (Moments). E(t)[h(Y )2] < and E(t)[αk(X)2] < for all k = 1, . . . , K and t {0, 1}.

Multiply-Robust Causal Change Attribution

In particular, Assumption A.2 implies E(t)[γk(Y )2] < for all k = 1, . . . , K and t {0, 1}.

We can now prove the main result. Recall the definition of the general target parameter:

θc = Z h(y)d P (c K+1) Y |X (y | x)

k=1 d P (ck) Xk|PAk(xk | pak).

Step 1: Initial Case We begin by showing that, if g1(X1) = γ1(X1) := E(c2)[g2(X1, X2) | X1] or a1(X1) = α1(X1) :=

d P (c1) X1 /d P (c2) X1 (X1) (but not necessarily both), we have:

E(c1)[g1(X1)] + E(c2)[a1(X1)e1(X1, X2)] = Z g2(x1, x2)d P (c2) X2|X1(x2 | x1)d P (c1) X1 (x1).

Consider first the case g1(X1) = γ1(X1). Then, Ec2[e1(X1, X2) | X1] = 0 and by the Law of Iterated Expectations Ec2[a1(X1)e1(X1, X2)] = 0 for any a1(X1) such that E(c2)[a1(X1)2] < . The definition of γ1(X1) implies:

E(c1)[γ1(X1)] = Z γ1(x1)d P (1) X1 (x1) = Z Z g2(x1, x2)d P (c2) X2|X1(x2 | x1) d P (1) X (x).

as we wanted to show.

On the other hand, if a1(X1) = α1(X1), by (RN), E(c1)[g1(X1)] = E(c2)[α1(X1)g1(X1)] for any g1(X) such that E(c2)[g1(X)2] < . Moreover, by the law of iterated expectations and (RN), E(c2)[α1(X1)g2(X1, X2)] = E(c2)[α1(X1)γ1(X1)] = E(c1)[γ1(X1)], so

E(c1)[g1(X1)] + E(c2)[α1(X1)g2(X1, X2)] E(c2)[α1(X1)g1(X1)] = Z g2(x1, x2)d P (c2) X2|X1(x2 | x1)d P (c1) X1 (x1),

as we wanted to show.

Step 2: Induction Suppose that for some 2 j K, it holds that:

E(c1)[g1(X1)] +

k=1 E(ck+1)[ak( Xk)ek( Xk+1)] = Z gj( Xj)

k=1 d P (ck) Xk| Xk 1(xk | Xk 1).

We next show that, if

gj( Xj) = γj( Xj) := E(cj+1)[gj+1( Xj+1) | Xj]

or aj( Xj) = αj( Xj) :=

d P (ck) Xk| Xk 1 d P (cj+1) Xk| Xk 1 (Xk | Xk 1)

(but not necessarily both), we have:

E(c1)[g1(X1)] +

k=1 E(ck+1)[ak( Xk)ek( Xk+1)] = Z gj+1( Xj+1)

k=1 d P (ck) Xk| Xk 1(xk | Xk 1).

If gj( Xj) = γj( Xj), as before, we have Ecj+1[ej( Xj+1) | Xj] = 0, and by the Law of Iterated Expectations E(cj+1)[aj( Xj)ej( Xj+1)] = 0 for any aj( Xj) such that E(cj+1)[aj( Xj)2] < . The inductive hypothesis and the definition of γj( Xj) imply the desired result:

E(c1)[g1(X1)] +

k=1 E(ck+1)[ak( Xk)ek( Xk+1)] + E(cj+1)[aj( Xj)ej( Xj+1)]

= Z γj( Xj)

k=1 d P (ck) Xk| Xk 1(xk | Xk 1)

= Z Z gj+1( Xj+1)d P (cj+1) Xj+1| Xj(xj+1 | Xj) j Y

k=1 d P (ck) Xk| Xk 1(xk | Xk 1). ( )

Multiply-Robust Causal Change Attribution

Otherwise, suppose that aj( Xj) = αj( Xj). Notice that αj( Xj) is the RN derivative of

k=1 d P (ck) Xk| Xk 1 with respect to

k=1 d P (cj+1) Xk| Xk 1.

The first distribution is a counterfactual distribution if ck = ck for some k, k j, because some causal mechanisms are distributed as in sample 0, and others as in sample 1. The second distribution, however, is the factual joint distribution of Xj in sample cj+1, by the causal Markov assumption and the order of the variable labels. In this light, αj( Xj) satisfies property (RN), and for any gj( Xj) with E(cj+1)[gj( Xj)2] < , we have

E(cj+1)[αj( Xj)gj( Xj)] = Z gj( Xj)

k=1 d P (ck) Xk| Xk 1(xk | Xk 1).

In particular, by the inductive hypothesis and the Law of Iterated Expectations:

E(c1)[g1(X1)] +

k=1 E(ck+1)[ak( Xk)ek( Xk+1)] + E(cj+1)[αj( Xj)gj+1( Xj+1)] E(cj+1)[αj( Xj)gj( Xj)]

= E(cj+1)[αj( Xj)γj( Xj)].

Finally, property (RN) and ( ) give the desired result.

Step 3: Final Step Before we finish, remember that we adopted the notational convention g K+1( XK+1) := h(Y ). Therefore, the K-th inductive step implies:

E(c1)[g1(X1)] +

k=1 E(ck+1)[ak( Xk)ek( Xk+1)] = Z h(y)d P (c K+1) Y |X (y | x)

k=1 d P (ck) Xk| Xk 1(xk | Xk 1).

To conclude the proof, remember that we have labeled the variables so that PAk are part of Xk 1. Because of the causal Markov factorization assumption, PXk| Xk 1 = PXk|PAk. In other words, once we condition on its direct causes in the DAG, Xk is independent of the rest of variables that are part of Xk 1.

A.3. Proofs of Section 2.5

Proof of Theorem 2.9. Chernozhukov et al. (2023, Theorem 9) gives an asymptotic normality and inference result for nested regression functionals. Here we discuss the assumptions that allow us to apply their results. In the statement of the assumptions, C < denotes a generic positive and finite constant.

Assumption A.3 (Strong Overlap). There exists ϵ > 0 such that, for all k = 1, . . . , K, ϵ Pr(T = 1 | Xk) 1 ϵ almost surely. Moreover, for all k = 1, . . . , K, ˆαk C.

This assumption requires the posterior probability of observation Xk sample 1 to be bounded away from 0 and 1. In other words, it must not be possible to tell which sample any given observation comes from with absolute certainty; thus strengthening Assumption A.1. Notice that, this implies 0 < p < 1, where p := Pr(T = 1), and that αk( Xk) will also be almost surely bounded by Bayes rule (BAY). Since the true weights are bounded, we also impose boundedness of the estimated weights ˆαk( Xk).

Assumption A.4 (Higher-order Moments). We assume:

(i) For some q > 2 and all k = 1, . . . , K, E(ck)[γk( Xk)q] C, E(ck)[ˆγk( Xk)q] C, E(ck+1)[γk( Xk)q] C, and E(ck+1)[ˆγk( Xk)q] C.

(ii) For all k = 1, . . . , K, E(ck+1)[(γk+1( Xk+1) γk( Xk))2 | Xk] C.

(iii) V := 1 1 p Var(0)[ψ(0)( XK+1)] + 1

p Var(1)[ψ(1)( XK+1)] c0 > 0.

Multiply-Robust Causal Change Attribution

The first part guarantees that we can apply the Central Limit Theorem. The second part is a regularity condition that allows for heteroskedasticity in each regression, but requires the conditional variance to be bounded. The third part assumes that the asymptotic variance of the target parameter is bounded away from 0. Recall that we defined ψ(t)( XK+1) to be the part of the estimating equation that depends on data from sample t, namely:

ψ(t)( XK+1) =

( γ1(X1) + P

k:ck+1=t αk( Xk)ek( Xk+1) if c1 = t, P

k:ck+1=t αk( Xk)ek( Xk+1) otherwise.

Assumption A.5 (Estimation Rates). Let f t,2 = (E(t)[f(X)2])1/2 denote the L2(P (t) X ) norm. For t {0, 1} and all k = 1, . . . , K, we assume:

(i) ˆγk( Xk) γk( Xk) t,2 = op(1) and ˆαk( Xk) αk( Xk) t,2 = op(1).

(ii) n ˆγk( Xk) γk( Xk) t,2 ˆαk( Xk) αk( Xk) t,2 = op(1) and n ˆγk+1( Xk+1) γk+1( Xk+1) t,2 ˆαk( Xk) αk( Xk) t,2 = op(1).

Assumption A.5 (i) assumes that both ˆγk and ˆαk are consistent in the mean-square sense for k = 1, . . . , K. Assumption A.5 (ii) imposes a trade-off between the rates of convergence of ˆγk and ˆαk that is often encountered in the double/debiased ML literature (see, for example, Chernozhukov et al., 2018a). As we discuss in the main text (Remark 2.10), the product of the root mean-squared errors of ˆγk and ˆαk has to vanish faster than n 1/2. That means that, when we have a high-quality estimator for the regression function, the asymptotic guarantees of the method will still hold even if the weights can only be estimated at a relatively slow rate, and vice versa.

B. Structural Equations and Potential Outcomes

Suppose we are in the setting of the example in Section 2.2. Here we give a structural equations model (SEM) that implies the DAG in Figure 1a:

Y g Y (T, X, εY ),

X g X(T, εX),

where (εY , εX, εT ) are mutually independent disturbances of arbitrary dimension, and denotes assignment. Shifting the causal mechanism P (t) X from t = 0 to t = 1 is equivalent to replacing the structural function g X(0, εX) with g X(1, εX). Likewise, shifting the causal mechanism P (t) Y |X from t = 0 to t = 1 is equivalent to replacing the structural function g Y (0, X, εY ) with g Y (1, X, εY ).

An intervention that sets T = t, or do(T = t), is associated with potential outcomes X(t) := g X(t, εX) and Y (t) := g Y (t, X(t), εX) (Imbens & Rubin, 2015). Notice that the SEM does not impose any independence restrictions between X(0), X(1), Y (0), and Y (1). An intervention that sets T = t and X = x, or do(T = t, X = x), in turn defines another set of potential outcomes Y (t, x).

Pearl (2009, 4.5) defines the natural indirect effect of the transition from T = 0 to T = 1 as:

IE0,1 := E[Y (0, X(1))] E[Y (0)].

In words, this is the expected effect of holding T = 0 constant, and changing X to whatever value it would have attained if T had been set to 1, X(1). As long as there are no open backdoor paths relative to T X and (T, X) Y ,3

IE0,1 = θ 1,0 θ 0,0 ,

where θ 0,0 = E(0)[Y ] can be estimated directly by taking the average of Y in sample 0, and θ 1,0 can be identified as described in Section 2.2. The mediation formula (Pearl, 2009, equation (4.18)) is capturing the same idea as identification by regression (REG).

3More generally, we could allow for pre-treatment covariates W that block those backdoor paths. We discuss this direction for future research in Section 4.

Multiply-Robust Causal Change Attribution

Pearl (2009, 4.5) also defines a natural direct effect of the transition from T = 0 to T = 1 as:

DE0,1 := E[Y (1, X(0))] E[Y (0)] = θ 0,1 θ 0,0 .

This corresponds to the average effect of an intervention that sets T = 1 while, at the same time fixing X to the value it would have attained under T = 0.

The definitions above imply two ways of decomposing the total effect of T on Y , as the difference between a direct effect and the indirect effect of the opposite transition:

E[Y (1)] E[Y (0)] = (θ 1,1 θ 1,0 ) + (θ 1,0 θ 0,0 ) = IE0,1 DE1,0, but also

= (θ 1,1 θ 0,1 ) + (θ 0,1 θ 0,0 ) = DE0,1 IE1,0.

Which of the possible decompositions to choose? We suggest two methods in Section 2.6: Shapley values (Shapley, 1953; Budhathoki et al., 2021), which average over all possible decompositions, or along a causal path, which suggests following an order that respects causal ancestry. We note, however, that our main results are about estimating the θc parameters, and so they are valid regardless of the particular decomposition a researcher commits to.

C. Simplifications to the Identifying Equation

In the general formulation of Theorem 2.5, when there are K explanatory variables we need to estimate a total of 2K unknown functions (K regressions and K weights). For certain change vectors c, however, it is possible to simplify the identifying equation to have fewer unknown functions. In this section, we discuss two settings where simplifications may occur. The first is when ck = ck+1 for some k (i.e., when two consecutive regressions are computed with respect to the same probability distribution), allowing us to apply the Law of Iterated Expectations. The second is when Xk and Xk+1 are (conditionally) independent. Remark C.1 (Simplification I: Iterated Expectations). Recall the Law of Iterated expectations:

E[E[A | B, C] | B] = E[A | B].

We can apply this property when ck = ck+1 for some k (i.e., when two consecutive regressions are computed with respect to the same probability distribution) because:

E(ck)[γk( Xk) | Xk 1] = E(ck)[E(ck+1)[γk+1( Xk+1) | Xk] | Xk 1]

= E(ck)[γk+1( Xk+1) | Xk 1].

Thus, we can skip estimation of γk( Xk), and define γk 1( Xk 1) := E(ck)[gk+1( Xk+1) | Xk 1] directly. Because we are not estimating γk( Xk), we can also drop the debiasing term E(ck+1)[ak( Xk)ek( Xk+1)], so that we also do not need to estimate the corresponding weight αk( Xk). Finally, we need to adjust the residual in the (k 1)-th debiasing term according to the simplification to ek 1( Xk+1) = gk+1( Xk+1) gk 1( Xk 1). Remark C.2 (Simplification II: Conditional Independence). Suppose that A B | C:

E[E[g(A, B, C) | A, C] | C] = E[E[g(A, B, C) | B, C] | C],

i.e., the order of integration of A and B can be exchanged. We can apply this property when Xk Xk+1 | Xk 1 for some k as:

E(ck)[γk( Xk) | Xk 1] = E(ck)[E(ck+1)[γk+1( Xk+1) | Xk] | Xk 1]

= E(ck+1)[E(ck)[γk+1( Xk+1) | Xk 1, Xk+1] | Xk 1].

Combined with the previous remark, this re-ordering can reduce the number of functions to estimate, as shown in the following examples.

Example C.3. Consider K = 3 and the causal Markov factorization depicted in Figure 6. (We omit T from the DAG, with the understanding that the distribution of all the variables depends on T.) This factorization implies that X2 X3 | X1. Suppose we are interested in estimating θc for c = 1, 0, 1, 0 . Remark C.2 implies that we can exchange the order of

Multiply-Robust Causal Change Attribution

Figure 6. DAG with causal factorization P (t) (X,Y ) = P (t) Y |X2,X3P (t) X3|X1P (t) X2|X1P (t) X1

integration of X2 and X3. Because we re integrating X1 and X3 with respect to the same distribution P (1) (X,Y ), and X2 and

Y with respect to the same distribution P (0) (X,Y ), Remark C.1 helps us in reducing the number of unknown functions to estimate. In particular, we can use the doubly-robust estimating equation:

E(1)[g1(X1, X3)] + E(0)[a1(X1, X3){h(Y ) g1(X1, X3)}],

where the true value of the unknown functions is:

γ1(X1, X2) = E(0)[h(Y ) | X1, X2], α1(X1, X2) = d P (1) X1,X3/d P (0) X1,X3(X1, X3).

Thus, we only need to estimate 2 unknown functions (rather than 6). Example C.4. An important example is when all explanatory variables are independent, i.e., the causal Markov factorization is P(X,Y ) = PY |X QK k=1 PXk. For a given change vector c, let Xc = (Xk : ck = 1) and X c = (Xk : ck = 0). Thanks to Remarks C.1 and C.2 we can give a doubly-robust estimating equation for θc that only depends only on 2 unknown functions, regardless of what K is. Whenever c K+1 = 0 (i.e., we want to keep the conditional distribution of Y | X as in sample 0), the doubly-robust estimating equation will be:

E(1)[g1(Xc)] + E(0)[a1(Xc){h(Y ) g1(Xc)}],

where the true value of the unknown functions is:

γ1(Xc) = E(0)[h(Y ) | Xc], α1(Xc) = d P (1) Xc/d P (0) Xc(Xc).

Conversely, if c K+1 = 1 (i.e., we want to shift the conditional distribution of Y | X to be as in sample 1), the doubly-robust estimating equation equation will be:

E(0)[g1(X c)] + E(1)[a1(X c){h(Y ) g1(X c)}],

where the true value of the unknown functions is:

γ1(X c) = E(1)[h(Y ) | X c], α1(X c) = d P (0) X c/d P (1) X c(X c).

D. Multiplier Bootstrap

In this section, we describe how to apply the multiplier bootstrap procedure of Belloni et al. (2017) to compute standard errors for \ PATHk and \ SHAPk.

Let (ξ i )n i=1 be i.i.d. draws, independent of the data, from a random variable with E[ξ] = 0, E[ξ2] = 1 and E[exp(|ξ|)] < . Common choices for the distribution of ξ include a re-centered standard exponential distribution (Bayesian bootstrap) and the standard Gaussian distribution (Gaussian multiplier bootstrap). We refer the reader to Belloni et al. (2017) for references.

Define, for each t {0, 1}, ψ(t)( XK+1) = ˆψ(t)( XK+1) b E(t)[ ˆψ(t)( XK+1)]. A multiplier bootstrap draw of ˆθc can be computed as: ˆθc, = ˆθc + X

t {0,1} b E(t)[ξ ψ(t)( XK+1)].

Multiply-Robust Causal Change Attribution

For each bootstrap draw of (ξ i )n i=1, the resulting set of ˆθc, can be used to compute a bootstrap version of the causal attribution parameters of interest, \ PATH k and \ SHAP k. If we repeat this process many times, the standard deviation of \ PATH k and \ SHAP k over the bootstrap distribution will be an asymptotically valid estimator for the standard errors of \ PATHk and \ SHAPk, respectively.

E. Monte-Carlo Simulations

Design 1 Our simulation design is based on the causal model of Example 2.6. At each simulation draw, we generate two samples of size n0 = n1 = 1000, according to the following distributions:

X1 N(1, σ2 (t)),

X2 | X1 N(β(t)X1, 1),

Y | X1, X2 N(X1 + X2 + 0.25X2 1 + δ(t)X2 2, 1),

for t {0, 1}. The parameters µ(t) := (σ2 (t), β(t), δ(t)) capture the features of the distribution that change between sample 0 and 1, namely the variance of X1, the dependence between X2 and X1, and the dependence between Y and X2 2. In particular, we choose the following values: µ(0) = (1, 0.5, 0.25) and µ(1) = (1.21, 0.2, 0.25). This choice of parameters ensures that the distribution of the simulated data satisfies the regularity conditions in Assumptions A.1 and A.2. The functional of interest will be the mean of Y under different counterfactual distributions. Table 1 presents the results of our simulation (over 1000 draws) for different choices of estimators for the regression and the weights.

Design 2 In a second set of experiments, we consider the effect of increasing K. We consider a line DAG, X1 X2 XK Y . In sample T = 0, we draw n0 = 2, 000 observations according to X1 N(0, 1) and Xk | Xk 1 N(0.5Xk 1, 0.75) for k = 2, . . . , K + 1. For each simulation draw, we randomly select 1/10 of the causal mechanisms to shift their mean by 0.2 in sample T = 1, also with n1 = 2, 000. In each simulation draw, we compute the \ PATHk attribution measure, using LASSO to estimate the regressions, and Logistic Regression with ℓ1 penalty (Logit LASSO) for the weights. In both cases, the penalty is selected by cross-validation. We do this for K {10, 20, 50, 100}.

Multiply-Robust Causal Change Attribution

Table 1. Monte-Carlo simulation results (over 1000 draws). The top panel of each table shows the Mean Absolute Error (MAE) in the estimates of the counterfactual mean parameters θc for c {0, 1}3. We omit the results for θ 0,0,0 and θ 1,1,1 , since these parameters can be estimated as sample means directly from the data. The bottom panel of each table shows the MAE in the estimates of the Shapley values corresponding to each causal mechanism. We compare methods that rely only on regression or re-weighting with our multiply-robust (MR) proposed estimator.

(a) Regression and Weights Correctly Specified

Mean Absolute Error std. err.

Regression Re-weighting MR

0, 0, 1 0.060 0.001 0.065 0.002 0.060 0.001 0, 1, 0 0.059 0.001 0.065 0.002 0.059 0.001 0, 1, 1 0.057 0.001 0.057 0.001 0.057 0.001 1, 0, 0 0.072 0.002 0.075 0.002 0.072 0.002 1, 0, 1 0.063 0.002 0.077 0.002 0.063 0.002 1, 1, 0 0.064 0.002 0.078 0.002 0.064 0.002

SHAP1 0.072 0.002 0.073 0.002 0.072 0.002 SHAP2 0.036 0.001 0.043 0.001 0.037 0.001 SHAP3 0.040 0.001 0.046 0.001 0.041 0.001

(b) Weights Incorrectly Specified

Mean Absolute Error std. err.

Regression Re-weighting MR

0, 0, 1 0.060 0.001 0.059 0.001 0.060 0.001 0, 1, 0 0.059 0.001 0.071 0.002 0.059 0.001 0, 1, 1 0.057 0.001 0.074 0.002 0.057 0.001 1, 0, 0 0.072 0.002 0.085 0.002 0.072 0.002 1, 0, 1 0.063 0.002 0.060 0.001 0.063 0.002 1, 1, 0 0.064 0.002 0.101 0.002 0.064 0.002

SHAP1 0.072 0.002 0.083 0.002 0.072 0.002 SHAP2 0.036 0.001 0.036 0.001 0.036 0.001 SHAP3 0.040 0.001 0.064 0.001 0.040 0.001

(c) Regression Incorrectly Specified

Mean Absolute Error std. err.

Regression Re-weighting MR

0, 0, 1 0.130 0.002 0.065 0.002 0.062 0.001 0, 1, 0 0.066 0.002 0.065 0.002 0.060 0.001 0, 1, 1 0.071 0.002 0.057 0.001 0.057 0.001 1, 0, 0 0.089 0.002 0.075 0.002 0.073 0.002 1, 0, 1 0.098 0.002 0.077 0.002 0.064 0.002 1, 1, 0 0.069 0.002 0.078 0.002 0.066 0.002

SHAP1 0.084 0.002 0.073 0.002 0.073 0.002 SHAP2 0.045 0.001 0.043 0.001 0.038 0.001 SHAP3 0.081 0.002 0.046 0.001 0.042 0.001

(d) Regression and Weights Incorrectly Specified

Mean Absolute Error std. err.

Regression Re-weighting MR

0, 0, 1 0.130 0.002 0.059 0.001 0.113 0.002 0, 1, 0 0.066 0.002 0.071 0.002 0.076 0.002 0, 1, 1 0.071 0.002 0.074 0.002 0.072 0.002 1, 0, 0 0.089 0.002 0.085 0.002 0.089 0.002 1, 0, 1 0.098 0.002 0.060 0.001 0.084 0.002 1, 1, 0 0.069 0.002 0.101 0.002 0.063 0.001

SHAP1 0.084 0.002 0.083 0.002 0.084 0.002 SHAP2 0.045 0.001 0.036 0.001 0.036 0.001 SHAP3 0.081 0.002 0.064 0.001 0.063 0.001

(e) Non-parametric Regression and Weights (NN)

Mean Absolute Error std. err.

Regression Re-weighting MR

0, 0, 1 0.071 0.002 0.068 0.002 0.061 0.001 0, 1, 0 0.079 0.002 0.076 0.002 0.059 0.001 0, 1, 1 0.083 0.002 0.059 0.001 0.057 0.001 1, 0, 0 0.101 0.003 0.076 0.002 0.073 0.002 1, 0, 1 0.080 0.002 0.080 0.002 0.063 0.002 1, 1, 0 0.070 0.002 0.097 0.002 0.064 0.002

SHAP1 0.080 0.002 0.068 0.002 0.072 0.002 SHAP2 0.053 0.001 0.061 0.001 0.037 0.001 SHAP3 0.051 0.001 0.061 0.001 0.041 0.001

(f) Non-parametric Regression and Weights (GBoost)

Mean Absolute Error std. err.

Regression Re-weighting MR

0, 0, 1 0.060 0.001 0.265 0.002 0.061 0.001 0, 1, 0 0.062 0.001 0.080 0.002 0.062 0.001 0, 1, 1 0.058 0.001 0.065 0.002 0.058 0.001 1, 0, 0 0.074 0.002 0.118 0.002 0.074 0.002 1, 0, 1 0.064 0.002 0.280 0.002 0.064 0.002 1, 1, 0 0.065 0.002 0.139 0.002 0.065 0.002

SHAP1 0.072 0.002 0.058 0.001 0.072 0.002 SHAP2 0.037 0.001 0.120 0.002 0.037 0.001 SHAP3 0.041 0.001 0.079 0.002 0.041 0.001

Multiply-Robust Causal Change Attribution

Table 2. Monte-Carlo simulation results (over 5000 draws), Design 2. The table shows the Average Worst-Case Absolute Error (as defined in the main text) for the PATHk attribution measure.

Avg. Worst-Case AE std. err.

K Regression Re-weighting MR

10 0.051 0.001 0.041 0.001 0.030 0.001 20 0.052 0.001 0.042 0.001 0.030 0.001 50 0.050 0.001 0.044 0.001 0.033 0.001 100 0.053 0.001 0.055 0.002 0.044 0.001

F. Additional Figures for the Real-World Application

Less than HS

HS Graduate

Some College

College Graduate

Advanced Degree

Relative Frequency (%)

Male Female

(a) Female vs. male distribution of education (data from CPS2015).

Sales Administrative

Education Healthcare Practitioner

Occupation | College Graduate

Relative Frequency (%)

Male Female

(b) Female vs. male distribution of occupation conditional on having a college degree (data from CPS2015).

Figure 7. Distribution of explanatory variables by gender in our gender wage gap application.