# on_measuring_causal_contributions_via_dointerventions__70f7f523.pdf

On Measuring Causal Contributions via do-interventions

Yonghan Jung 1 Shiva Prasad Kasiviswanathan 2 Jin Tian 3 Dominik Janzing 2 Patrick Blöbaum 2

Elias Bareinboim 4

Causal contributions measure the strengths of different causes to a target quantity. Understanding causal contributions is important in empirical sciences and data-driven disciplines since it allows to answer practical queries like what are the contributions of each cause to the effect? In this paper, we develop a principled method for quantifying causal contributions. First, we provide desiderata of properties (axioms) that causal contribution measures should satisfy and propose the do-Shapley values (inspired by do-interventions (Pearl, 2000)) as a unique method satisfying these properties. Next, we develop a criterion under which the do-Shapley values can be efficiently inferred from non-experimental data. Finally, we provide do-Shapley estimators exhibiting consistency, computational feasibility, and statistical robustness. Simulation results corroborate with the theory.

1. Introduction

Inferring causal effects is a fundamental problem throughout the data sciences since it can answer queries like what would be an expected outcome if inputs had been fixed to certain values? . There is a growing literature tackling this question in both understanding the conditions under which causal conclusions can be drawn from non-experimental data (causal effect identification) (Pearl, 1995; Tian & Pearl, 2003; Huang & Valtorta, 2006; Shpitser & Pearl, 2006; Bareinboim & Pearl, 2016; Jaber et al., 2018; Lee et al., 2019; 2020; Lee & Bareinboim, 2020), and in estimating the identified causal functions using the data (causal effect estimation) (Jung et al., 2020; Bhattacharya et al., 2020; Jung et al., 2021a;b; Bhattacharyya et al., 2020; 2021). Beyond the task of causal inference, interpreting the result

1Purdue University 2Amazon 3Iowa State Univerity 4Columbia University. Correspondence to: Yonghan Jung <jung222@purdue.edu>.

Proceedings of the 39 th International Conference on Machine Learning, Baltimore, Maryland, USA, PMLR 162, 2022. Copyright 2022 by the author(s).

of causal inference, including what is the most important cause of the effect? or more generally, what are the contributions of each cause to the effect? , is also practically important. Answering these queries fall under the task of measuring causal contributions, which quantifies the degree of contribution of causes to a target effect. As a motivation, consider the following scenario: Example 1. A video streaming service company has collected data that contains various features (as described in (Lundberg, 2021)) including sales call (S), product needs (P), interaction with customers (I), monthly usage (M), discounts provided (D), last upgrade (L), economic factors (E), Ad spend (A), bugs reported (B). These features are causally related and affect the target variable: customer retention (Y ) (see Fig. 1a). The company aims to measure causal contributions of these features to the target effect the expected customer retention if each feature had been fixed to a certain value (e.g., set to lower sales calls, higher product needs, etc.).

Example 1 captures practical cases where the target quantity is related to the query what would be the output if inputs had been fixed to certain values? This includes cases where the target quantity is a machine learning (ML) model output since the output is the quantity derived by fixing inputs to specific values, which details are described in Remark 1.

In the area of explainable AI (XAI), there has been a recent thrust on measuring the contributions of features to the ML output (e.g., (Lundberg & Lee, 2017; Schwab & Karlen, 2019; Janzing et al., 2020b; Heskes et al., 2020; Covert et al., 2021)). The majority of existing methods have focused on queries where the target quantity is induced from an accessible model (we say a model for target Y is accessible if the model can be evaluated to obtain Y value for arbitrary input features), with little attention paid to the cases where the target is induced by nature (i.e., the data-generating process is inaccessible; e.g., the customer retention in Example 1) or the ML models are inaccessible. Also, many existing techniques are based on correlation (rather than causation) between the features and the ML model output (e.g., (Lundberg & Lee, 2017; Frye et al., 2020)). Even if another thread of methods focused on measuring contributions based on causation (e.g., (Schwab & Karlen, 2019; Janzing et al., 2020b; Heskes et al., 2020)), these methods often assume

that the data generating process for the target is known and accessible, allowing that an outcome corresponding to any arbitrary features can be generated, ruling out scenarios where the target quantity is induced from an inaccessible model (e.g., Example 1). A detailed comparison with existing literature is presented in Sec. 3.1.

In this paper, we generalize previous approaches to measure the causal contributions of each feature to a target effect induced by an blac-box/unknown/inaccessible model. Our proposed method is directly applicable to the task of quantifying causal contributions of input features of an ML model prediction (formalized in Remark 1). Our key contributions, in further detail, are as follows:

1. [Sec. 3] We axiomatize causal contribution measures. Specifically, we propose desiderata for causal contribution measures as axioms, and propose the do-Shapley, the Shapley value (Shapley, 1953)-based method specialized for quantifying the causal contribution by leveraging the dointervention (Pearl, 2000).1 Our axiomatic characterization provides a theoretical advocation in using the do-Shapley for quantifying causal contributions.

2. [Sec. 4] We provide conditions under which the do Shapley values can be inferred from the observational data (identifiability) in polynomial time (computational feasibility). Even if verifying the identifiability can be done through existing theories in causal-effect identification, they do not provide computational feasibility in determining the identifiability of the do-Shapley. To address this, we provide a sufficient condition under which the identifiability and computational feasibility of the do-Shapley can be efficiently determined.

3. [Sec. 5] We construct estimators for the do-Shapley, exhibiting consistency, computational feasibility, and statistical robustness. We developed three estimators based on the inverse probability weighting (IPW) (Rosenbaum & Rubin, 1983), outcome regression (REG) (Rubin, 1979), and double/debiased machine learning (DML) (Chernozhukov et al., 2018). We prove that all estimators manifest consistency and computational feasibility. In addition, we show that the DML estimator additionally displays statistical robustness to model misspecification and bias. Finally, we present simulation results on these estimators that corroborate with the theory [Sec. 6].

Due to space constraints, the proofs and other omitted details are provided in the appendix.

1The do-Shapley is a generalization of the causal Shapley (Heskes et al., 2020), which also uses the do-interventions, to the case where the target quantity is induced by an inaccessible model.

Figure 1: Causal graph for Example 1, taken from Lundberg (2021).

2. Preliminaries

Notation. Each variable is represented with a capital letter (V ) and its realized value with the small letter (v). We use bold letters V and v to denote a set of variables and a realized value of it, respectively. For any set S, we use |S| to denote its cardinality. Given a topological order over the vertices V := {V1, . . . , Vn} of a graph G, we will use pre(Vi) to denote the predecessors of Vi. We use pre(vi) as a realization of a set of variables pre(Vi); i.e., pre(vi) = wi for pre(Vi) = Wi. We use Ch(Vi) to represent the children of a variable Vi in G. For an index set [n] := {1, , n} and a subset S [n], we use VS := {Vk}k S and VS := {Vk}k S. We use D for the N samples from a distribution P over V; i.e., D := {V(i)}N i=1 P, where V(i) denotes the ith sample. For a function f, we use E [f(V)] as an expectation of f(V) over P, and ED [f(V)] := (1/N) PN i=1 f(V(i)). We use f(V) := p

E [(f(V))2] to denote the L2(P) norm of f(V). OP ( ) and o P ( ) denotes the big O and little O in probability, respectively.

Structural Causal Models. We use the language of structural causal models (SCMs) as our basic semantical framework (Pearl, 2000). A structural causal model (SCM) is a tuple M := V, U, F, P(u) where V, U are a sets of endogenous (observerables) and exogenous variables (latents), F is a set of functions f Vi one for each Vi V where Vi f Vi(PAVi, UVi) for some PAVi V and UVi U, and P(u) is a strictly positive probability measure for U. Each SCM M induces a semi-Markovian causal graph G over the node set V here Vi Vj if Vi is an argument of f Vj, and Vi Vj if UVi and UVj are correlated. Performing an intervention X = x is represented through the do-operator, do(X = x) (shortly, do(x)), which encodes the operation of replacing the original equations of X by the constant x in the SCM M, inducing a submodel Mx and an interventional distribution P(V = v|do(x)) (shortly, P(v|do(x))).

Causal Effect Identification. Given a causal graph G over V, an effect P(y|do(x)) where X, Y V is identifiable if P(y|do(x)) is computable from the distribution P(v) in any SCM M that induces G (Pearl, 2000, p. 77). One key notion is called confounded components (in short, Ccomponent): a set of nodes connected with a path composed solely of bi-directed edge Vi Vj (Tian & Pearl, 2003). For any C V, the quantity Q [C] := P(v\c|do(c)), called a C-factor (Tian & Pearl, 2003), is defined as an interventional distribution of C under an intervention on V\C. We use C(Vi)G (shortly, C(Vi)) to denote C-component of Vi in G, a set of variables belonging to the same Ccomponent as in Vi. We use C(W) := S

Vi W C(Vi) to denote a C-component of a set W V.

Shapley Value. The Shapley value (Shapley, 1953) seeks to allocate the contribution of each player i [n] on some function value f([n]) given a value function ν(S) that measures the value of coalition of players i S [n]. The Shapley value, given as

S [n]\{i} ω(S) {ν(S {i}) ν(S)} , (1)

where ω(S) := 1

n n 1 |S| 1, is a unique value satisfying a set of some desiderata for fair allocation (Young, 1985) (See Appendix A for more details).

2.1. Problem Definition

We are given samples D drawn from a distribution P := PM and a compatible semi-Markovian causal graph G := GM, induced by the SCM M, on topologically ordered variables (V, Y ), with Y supposed to be the final variable in the order. We assume that Y is bounded, and V is a set of discrete variables.

Given (G, D, v) where v is a realized value for V, our goal is to measure the the contribution of vi v to the target causal effect2 E[Y |do(v)] based on the impact to Y if the SCM M has fixed the value of the variable as Vi = vi; PM(y|do(vi)) where PM denote the distribution induced by M. We set E [Y ] = 0 without loss of generality. We make no assumptions regarding the data generating process on Y for generality. With the following additional assumption on f, our problem is straightforwardly reduced to the problem of attributing the importance of features in the XAI: Remark 1 (Reduction to the XAI). If Y is generated by a deterministic function, (e.g., Y is an output of a ML prediction model f s.t. Y := f(V)), then our task reduces

2We focus on the average causal effect E[Y |do(v)], the most widely used quantity in practice. Our method is applicable for any function of causal distribution P(y|do(v)). A condition whether the target quantity E[Y |do(v)] (or P(y|do(v))) can be determined using non-experimental data is discussed in Sec. 4.

to measure the causal contribution of each features vi v on the ML prediction f(v), since E[Y |do(v)] = f(v).

3. Axioms for Causal Contribution

We start by asking the question: What makes a good causal contribution measure? To answer, we propose the following desiderata:

Axiom 1 (Desiderata for Causal Contribution). Causal contributions {ϕvi}n i=1 w.r.t. G is considered desirable if the following properties are satisfied:

1. Perfect assignment: Contributions are perfectly assigned; i.e., E[Y |do(v)] = P

2. Causal irrelevance: If Vi is causally irrelevant to Y for all witness w v\{vi} (i.e., y, P(y|do(vi, w)) = P(y|do(w)))3, then ϕvi = 0.

3. Causal symmetry: If (vi, vj) v have the same causal explanatory power4 to Y for w v\{vi, vj} ( y, P(y|do(vi, w))=P(y|do(vj, w))), then ϕvi =ϕvj.

4. Causal approximation: For any S [n] and v S := {vi}i S, P

i S ϕvi well approximates E[Y |do(v S)]. Formally, {ϕvi}n i=1 is a solution the following weighted least square; i.e., {ϕvi}n i=1 = arg min{ϕ vi}n i=1 P

S [n](E[Y |do(v S)] P

i S ϕ vi)2ω(S) for some positive and bounded function ω(S).

The rationale behind Axiom 1 is the following: (1) Perfect assignment is a natural requirement since we aim to attribute the degree of contributions of each feature vi v to the target causal effect. (2) Causal irrelevance reflects a desire to understand the cause of the outcome by forcing zero contributions for variables not causing the outcome. (3) Causal symmetry enforces the equal contribution for a pair of features if they have the same causal explanatory power. (4) Causal approximation allows ϕvi to be interpreted as a proxy for the causal effect s.t. P

i S ϕvi E[Y |do(v S)] for any S [n].

Perhaps surprisingly, there is a unique causal contribution measure {ϕvi} satisfying the above four properties. Definition 1 (do-Shapley). The do-Shapley5 is a causal contribution measure {ϕvi}n i=1 of v on E[Y |do(v)] w.r.t. G

3Vi is causally irrelevant to Y given VS if P(y|do(vi, v S))= P(y|do(v S)) (Galles & Pearl, 1997, Def. 7). 4A causal explanatory power of X = x to Y = y is a measure of making Y = y more likely if X had been fixed to x; i.e., P(y|do(x)) P(y) (Eva & Stern, 2019). 5Heskes et al. (2020) proposed the same equation for measuring contributions in the accessible model setting and referred to as the causal Shapley. In this paper, we use the term do-Shapley to make it clearer that the definition is based on the do-intervention.

defined as:

S [n]\{i} ω(S){E[Y |do(v S,i)] E[Y |do(v S)]}, (2)

where ω(S) := (1/n) n 1 |S| 1.

Theorem 1 (Uniqueness of the do-Shapley). The do Shapley is a unique causal contribution measure satisfying all the properties in Axiom 1.

Remark 2. Thm. 1 is significant because the axiom doesn t restrict the value function to any fixed form. Thm. 1 instead characterizes the do-Shapley as the unique causal contribution measures satisfying Axiom 1 among any arbitrary value functions and corresponding contribution measures, as in (Sundararajan & Najmi, 2020).

The do-Shapley, as the name implies, is a specialization of the Shapley value in Eq. (1) for ν(S) = E[Y |do(v S)]. The do-Shapley can be alternatively viewed as a marginal causal effect of vi v (i.e., E[Y |do(v S,i)] E[Y |do(v S)]) weighted-averaging over a set S. The significance of Thm. 1 stems from that it codifies the guarantees of the do-Shapley, and provides a tool to compare and contrast with alternative contribution metrics.

Remark 3 (Attribution of contributions for a subset of variables). It is worth noting that the do-Shapley allocates contributions to all vi v. In practice, assigning contributions exclusively to a subset x v may engender more interpretable result. For example, when X := Pa(Y ) V, assigning contributions only to the features in x v might be more interpretable if it is needed that features indirectly affecting to the outcome should be assigned zero contributions. Enforcing the do-Shapley to assign contributions only for the subset x can be simply done (without loss of generality) by the following procedure: (1) Derive a causal graph G[X] compatible with P(x) by applying the projection of a graph6; and (2) Compute the do-Shapley w.r.t. G[X]. See Appendix B for more details.

3.1. Relation with Other Work

In this section, we compare the do-Shapley in Def. 1 with other known methods aiming to measure contributions of features on the outcome. Table 1 summarizes the comparison.

Conditional Shapley. The conditional Shapley (ϕcond vi ) is a specialization of the Shapley value with ν(S) = E [Y |v S] (Lundberg & Lee, 2017; Frye et al., 2020). The conditional

6G[X] is constructed as follow: For any Vi, Vj X, (1) add a directed edge Vi Vj in G[X] if there exists a directed path from Vi to Vj in G such that every vertex on the path is not in X; (2) add a bidirected edge Vi Vj in G[X] if there exists a divergent path between Vi and Vj in G such that every vertex on the path is not in X (Tian & Pearl, 2003).

Causality Inaccessibility Axioms Conditional Marginal Causal ICC do-Shapley

Table 1: Summary of comparisons of the conditional, marginal, causal Shapley values, and the ICC with our method (do-Shapley) w.r.t. consideration of causality, capability in handling outcomes induced by an inaccessible models (e.g., Example 1), and characterization by axioms among measures based on arbitrary value functions.

Shapley measures contributions based on association rather than causation. In general, the conditional Shapley doesn t match with the do-Shapley; The causal irrelevance property doesn t hold in the conditional Shapley (see Example C.1).

Marginal Shapley. The marginal Shapley is another widely used contribution measure in the XAI in which the target variable is a model prediction Y = f(V), where f is a deterministic (refer Remark 1) and accessible prediction model. The marginal Shapley is a specialization of the Shapley value with ν(S) = E [f(v S, VS)] (Janzing et al., 2020b). The marginal Shapley is known to satisfy certain desiderata in attributing the feature importance (Sundararajan & Najmi, 2020). With access to the model f, and a particular graphical assumption that features are not causally affecting each other, the marginal Shapley matches with the do-Shapley (Janzing et al., 2019, Eq. (14)). In general settings where features are causally related as in Example 1, the marginal Shapley doesn t match with the do-Shapley.

Causal Shapley. The causal Shapley (Heskes et al., 2020) is most closely related to the do-Shapley. Specifically, (Heskes et al., 2020) proposed the same equation for measuring the contributions when the outputs are generated by the accessible models, and the graph is unknown (only a partial topological ordering of the graph is known). While the do Shapley doesn t have a restriction that the output is induced by the accessible models and is defined specifically on semi Markovian causal graphs (DAGs with bidirected edges. See Sec. 2) for which rich theories on causal effect identification and estimation are available.

Intrinsic Causal Contribution (ICC). Janzing et al. (2020b) proposed a new method called Intrinsic Causal Contribution (ICC) (ϕicc vi ) to measure the causal contribution under the setting where the causal graph is Markovian, and the structural functions are invertible in the sense that the noise values can be reconstructed from the observations. The ICC relies on so-called a structure-based intervention, which intervenes to features while keeping a causal structure and a joint distribution unaffected, to measure the contri-

bution of Vi on Y . By doing so, the ICC can measure the contribution of Vi on Y that is not via upstream variables. However, there is no axiomatic characterization of the ICC to the best of our knowledge. It is easy to show that ICC does not satisfy the causal symmetry property (see Example C.2).

Other Contribution Measures. Wang et al. (2021a) focused on measuring the relevance of paths in a causal graph to a target node, whereas Singal et al. (2021) provided a recursive approach to capture the flow of importance through the graph. The causal influence defined in Janzing et al. (2013) is based on an operation called deletion of edges and measures the relevance of edges with respect to the joint distribution, but not the relevance of edges for a certain target node. Schamberg et al. (2020) describes a generalization of the information-theoretic approach of Janzing et al. (2013) which quantifies relevance of paths or edges for a target node, based on operations on edges. Under some particular graphical assumptions, e.g., flat graphs, (Singal et al., 2021, Def. 8), the path/edge-based Shapley values (Wang et al., 2021a; Singal et al., 2021) match with the do-Shapley. In general, however, the link between these lines of work is yet to be fully established.

4. Identification of the do-Shapley

In this section, we investigate the question of evaluating the do-Shapley values. To evaluate the do-Shapley, expressing E[Y |do(v S)] as a functional of an observational distribution P using G is essential because we are only given nonexperimental dataset D. For each S [n], complete causal effect identification algorithms for identifying E[Y |do(v S)] are already available (Tian & Pearl, 2003; Huang & Valtorta, 2006; Shpitser & Pearl, 2006). A major practical challenge still remains, however, in using them because determining the identifiability for all subsets S [n] takes exponential computation time. In this section, we address this computational challenges in determining the identifiability by presenting a graphical criterion where the identifiability can be determined in polynomial time, which makes this procedure feasible in practice. Formally, Definition 2 (Identifiability & Feasibility). The do Shapley values {ϕvi}n i=1 w.r.t. G are said to be identifiable if all elements in {E[Y |do(v S)]}S [n] are identifiable in the causal graph G. The identification of the do-Shapley values are said to be (computationally-) feasible if the identification can be done in O(poly(n)).

Since näively applying the existing causal effect identification algorithms to determine the identifiability of the do Shapley values is not computationally feasible (requires O(2n) computations), we provide a simple sufficient graphical criterion under which determining the do-Shapley identifiability is feasible. We start with a definition (refer Sec. 2

for C-component, C-factor):

Definition 3 (C-partition). For a set of variables X V, {Xk}c k=1 is said to be the C-partition if X = c k=1Xk (where Xa Xb = for a = b) where k [c], Xk is a set s.t. any two pairs Xi, Xj Xk are in the same Ccomponent. in G.

Theorem 2 (Identifiability & Feasibility of do-Shapley). The do-Shapley is identifiable if no variable in Vi {V} is connected to its child Ch(Vi) by bidirected paths in G. Suppose Y is not connected by bidirected paths. In this case, for any S [n],

E[Y |do(v S)] = X

v S E [Y |v] Q [V\VS] ,

where Q [V\VS] := Q [V\VS] (v) is given as

Q [V\VS] = P(v) Q [C(VS)]

sk Q [C(Sk)] ,

where Q [C(VS)] = Q

Va C(VS) P(va|pre(va)) is a C-factor of a C-component VS (C(VS)); {Sk}c k=1 is a C-partition of VS; and Q [C(Sk)] := Q

Va C(Sk) P(va|pre(va)) is a C-factor of a C-component C(Sk) for Sk.

Fig. 2a provides an example graph satisfying the conditions in Thm. 2. Specifically, for all computations VS V := {V1, V2, V3}, the causal effects are identified through Thm. 2 as

E [Y |do(v S)]

P v S E [Y |v] P(v2|v1, v3)P(v S), if S {1, 3}, P

v S E [Y |v] P(v S), if S { , 2, {1, 2}, {2, 3}}, P

v S E [Y |v] P(v S|v S), if S {{1, 3}},

E [Y |v] if S = {1, 2, 3}. (3)

Thm. 2 entails the feasibility of the do-Shapley values since the proposed graphical criteria (checking whether Vi and Ch(Vi) are connected by bidirected paths) can be done in O(n3) by applying the breadth-first-search for each variable Vi V.

To demonstrate the wide applicability of Theorem 2, we provide two special cases which are commonly considered in the literature:

1. Markovian case: No latent confounders exist in the system; i.e., G is given as a DAG (Janzing et al., 2013; 2019; Heskes et al., 2020; Basu, 2020; Wang et al., 2021b; Singal et al., 2021).

2. Direct-cause case: No pair of variables (Vi, Vj) V (i = j) is connected by a directed path, no Vi are connected

to Y via bidirected edges, and no directed edge from Y to Vi exists (i.e, only Vi Y is allowed) (Janzing et al., 2020a;b).

For each of these cases, the identification result in Theorem 2 can be simplified as follows.

Corollary 1 (Identification Markovian). In the Markovian case, E[Y |do(v S)] is given as

E[Y |do(v S)] = X

v S E [Y |v S, v S] Y

i S P(vi|pre(vi)).

Corollary 2 (Identification Direct-cause). In the Directcause case, E[Y |do(v S)] is given as

E[Y |do(v S)] = X

v S E [Y |v S, v S] P(v S).

Figs. (2b,2c) provide example graphs for Markovian and Direct-cause cases. For Fig. 2b, Coro. 1 gives

E [Y |do(v S)]

v S E [Y |v] P(v S|v S), if S {1, 3, {1, 3}}, P

v S E [Y |v] P(v S), if S { , 2, {2, 3}, {1, 2}},

E [Y |v] if S = {1, 2, 3}. (4)

For Fig. 2c, Coro. 2 gives

E [Y |do(v S)] = X

v S E [Y |v] P(v S) (5)

for all v S {v1, v2, v3}.

5. Estimation of the do-Shapley

Estimating the do-Shapley values in Eq. (2) is computationally and statistically challenging because (1) Iterating over all S [n] takes computation time exponential in n, and (2) Estimating E[Y |do(v S)] might be vulnerable to bias due to finiteness of the sample dataset. In this section, we design computationally efficient and statistically robust estimators for the do-Shapley values to overcome these challenges, using three different techniques. For ease of presentation, we focus only on the Markovian & Direct-cause cases discussed in Sec. 4, because we are not aware of any general causal effect estimators suitable for estimating the expression in Thm. 27. Throughout this section, we assume all variables are discrete.

7General causal effect estimators propoesd by (Jung et al., 2021b) assumed that the expression is given by the identification algorithm. Therefore, the results in (Jung et al., 2021b) are not directly applicable to estimate the expression in Thm. 2.

(b) Markovian

(c) Direct-cause

Figure 2: Example graphs for Thm. 2 and two special cases: Markovian and Direct-cause.

We first introduce estimators leveraging the idea of the inverse probability weighting (IPW) (Rosenbaum & Rubin, 1983). Our construction of the IPW estimator is based on the following result.

Lemma 1 (Representation using IPW). Let S = {m1, , ms} [n] denote an index set for VS. Let

r=1 1vmr (Vmr)/h S r , for k = s, , 1;

ωS := 1v S(VS)/h S,

where h S r := P(Vmr|pre(Vmr)) and h S := P(VS|VS). Then, E[Y |do(v S)] = E [Y ω] where ω = ωS k for the Markovian case, and ω = ωS for the Direct-cause case.

Using Lemma 1, we construct the IPW estimators.

Definition 4 (IPW for E[Y |do(v S)]). The IPW estimator T ipw(S) for E[Y |do(v S)] is constructed as:

1. Split D randomly into two halves: D0 and D1;

2. Let bωS s,p, bωS p denote estimators for ωS s , ωS from Dp {D0, D1}, respectively.

3. For each p {0, 1}, set

T ipw p (S) :=

( ED1 p Y ˆωS s,p (Markovian) ED1 p Y ˆωS p (Direct-cause)

4. T ipw(S) := {T ipw 0 (S) + T ipw 1 (S)}/2.

The data-splitting (also known as sample-splitting) technique (Klaassen, 1987; Robins & Ritov, 1997; Robins et al., 2008; Zheng & van der Laan, 2011; Chernozhukov et al., 2018) will be employed in constructing all do-Shapley estimators discussed in this section. Without data-splitting, some restriction on the complexity of the estimator function class must be imposed to guarantee statistical consistency.

We introduce estimators leveraging the idea of outcome regression (REG) (Rubin, 1979). Our REG estimator is based on the following result.

Lemma 2 (Representation using REG). Let S := {m1, , ms} [n] denote an index set for VS. Let θS s,1 := Y . For k = s, s 1, , 1,

θS k,2 := E θS k,1|Vmk, pre(Vmk)

θS k 1,1 := E θS k,1|vmk, pre(Vmk) ,

θS a := E [Y |v S, VS] ,

θS b := E [Y |VS, VS] .

Then, E[Y |do(v S)] = E [θ] where θ = θS 0,1 for the Markovian case, and θ = θS a for the Direct-cause case.

We construct the REG estimator based on Lemma 2.

Definition 5 (REG for E[Y |do(v S)]). The REG estimator T reg(S) for E[Y |do(v S)] is constructed as:

1. Split D randomly into two halves: D0 and D1.

2. Let ˆθS k,2,p, ˆθS k 1,1,p, ˆθS a,p denote an estimator for θS k,2, θS k 1,1, θS a from Dp {D0, D1}, respectively.

3. For each p {0, 1},

T reg p (S) :=

ED1 p h ˆθS 0,1,p i (Markovian)

ED1 p h ˆθS a,p i (Direct-cause).

4. T reg(S) := {T reg 0 (S) + T reg 1 (S)}/2.

For IPW and REG estimators to be consistent, one needs to estimate each individual functional (called nuisances) including E [Y |v S, v S] or P(vi|pre(vi)) consistently. A desirable robust estimator is one that converges to the groundtruth at a fast rate even when estimates for nuisances are misspecified (i.e., wrongly specified) or converging relatively slowly. Double/Debiased Machine Learning (DML) (Chernozhukov et al., 2017) is a recently introduced technique to construct such estimators.

Lemma 3 (Representation using DML). Let

( {θS 0,1} {θS k,1, θS k,2}s k=1 {h S r }s r=1 (Markovian) {θS a , θS b , h S} (Direct-cause),

defined in Defs. (4, 5) above, and

VS(V ; ηS):=

θS 0,1 + s P

k=1 ωS k (θS k,1 θS k,2) (Markovian)

θS a + ωS Y θS b (Direct-cause),

where ωS k := Qk r=1 1vmr (Vmr)/h S r and ωS := 1v S(VS)/h S. Then, E[Y |do(v S)] = E VS(V ; ηS) .

We construct the DML estimators based on Lemma 3:

Definition 6 (DML for E[Y |do(v S)]). The DML estimator T dml(S) is constructed as:

Algorithm 1 do-Shapley(M, T est( ))

1: Input: M, Estimators T est( ) in Defs. (4,5,6). 2: Output: Estimates {bϕvi}n i=1. 3: Initialize bϕvi = 0 for all Vi V. 4: for j = 1 to M do 5: Generate the random permutation π over [n]. 6: for i = 1 to n do 7: bϕvi bϕvi + T est ({i, preπ(i)}) T est(preπ(i)) 8: end for 9: end for 10: return {bϕvi/M}n i=1

1. Split D randomly into two halves: D0 and D1;

2. Construct c ηSp , estimates of ηS from Dp, p {0, 1}.

3. T dml p (S) := ED1 p VS(V; bηS p ) for p {0, 1}.

4. T dml(S) := {T dml 0 (S) + T dml 1 (S))/2.

Based on estimators in Defs. (4,5,6), we now propose a computationally efficient estimator for the do-Shapley values based on random permutations:

Definition 7 (do-Shapley estimators Two cases). Let T est(S) {T ipw(S), T reg(S), T dml(S)} denote an estimator for E[Y |do(v S)] defined in Defs. (4,5,6), respectively. The do-Shapley estimator is given as

ϕest vi := 1

j=1 (T est({i, preπj(i)}) T est(preπj(i))),

where M is the number of randomly generated permutation of [n], πj denotes the jth permutation, and preπj(i) is the set of elements that precedes i in πj.

Equipped with the above results, a systematic procedure for constructing do-Shapley estimators is provided in Algorithm 1. The following theorem summarizes the error analyses of all the three do-Shapley estimators.

Theorem 3 (Bias Analysis). Let {πj}M j=1 denote M randomly generated permutations of [n]. For the fixed index i, let Sj,0 := preπj(i) and Sj,1 := {i} Sj,a. Let {bηSj,0, bηSj,1}M j=1 denote L2-consistent estimates for all nuisances {ηSj,0, ηSj,1}M j=1 defined in Def. 6. Let RM,N := OP (M 1/2 + N 1/2). Let e(ˆg) := ˆg g denote an error for a nuisance estimates for any ˆg ˆη and g η. For the do-Shapley estimators defined in Def. 7, suppose the estimators T est(S) are bounded. Let ϵest vi := ϕest vi ϕvi (where est {ipw, reg, dml}).

(a) Thm. 2, Non-noisy

(b) Thm. 2, Noisy

(c) Thm. 2, Incorrect REG

(d) Thm. 2, Incorrect IPW

(e) Markovian, Non-noisy

(f) Markovian, Noisy

(g) Markovian, Incorrect REG

(h) Markovian, Incorrect IPW

(i) Direct, Non-noisy

(j) Direct, Noisy

(k) Direct, Incorrect REG

(l) Direct, Incorrect IPW

Figure 3: The L1-error plots. Plots are rendered in high resolution and can be zoomed in.

Under the Markovian case,

ϵipw vi = RM,N + OP { X

j=1 e(ˆωSj,p sj )},

ϵreg vi = RM,N + OP { X

j=1 e(ˆθSj,p 0,1 )},

ϵdml vi = RM,N + OP { X

k=1 e(ˆh Sj,p k )e(ˆθSj,p k,2 )}.

Under the Direct-cause case,

ϵipw vi = RM,N + OP { X

j=1 e(ˆωSj,p)},

ϵreg vi = RM,N + OP { X

j=1 e(ˆθSj,p 2 )},

ϵdml vi = RM,N + OP { X

j=1 e(ˆh Sj,p)e(ˆθSj,p b )}.

Remark 4 (Properties of the Proposed Estimators). Error analyses in Thm. 3 exhibit consistency of IPW, REG,

DML estimators. Specifically, if nuisances are consistently estimated, ϵipw vi = ϵreg vi = ϵdml vi = OP (1), indicating that the estimators converge to the true quantity. Furthermore, the result presents the statistical robustness property of the DML. In particular, the DML estimates ˆϕdml vi converges to the true value if either e(ˆh Sj,p k ) or e(ˆθSj,p k,2 ) under Marko-

vian, and either e(ˆh Sj,p) or e(ˆθSj,p b ) under Direct-cause are accurate (doubly robustness). Also, ˆϕdml vi converges at the root-N rate if all nuisances ˆh Sj,p k , ˆθSj,p k,2 under Markovian,

and all nuisances ˆh Sj,p, ˆθSj,p b under Direct-cause converge at least at N 1/4 rate (debiasedness).

6. Experiments

In this section, we empirically compare the performance of the proposed do-Shapley estimators from the previous section. Details of the experiments and a different simulation example is provided in Appendices E and F.

Experimental Setup. We use synthetic datasets based on Figs. (2a, 2b, 2c) where each figures matches with Thm. 2, Markovian, and Direct-cause cases. We note that causal effects are identified as in Eqs. (3, 4, 5), respectively. Even if no known estimators for Thm. 2 exist generally, we note

that Eq. (3) is in an amenable form for which results in Sec. 5 are applicable. Throughout the simulation, we denote {ϕvi}n i=1 as the ground-truth do-Shapley values.

Comparison Between Estimators. We compare the three estimators (IPW, REG, DML), denoted by {ϕipw vi , ϕreg vi , ϕdml vi } respectively, for scenarios depicted in graphs in Figs. (2a, 2b, 2c). Nuisances are estimated using gradient boosting model (Friedman, 2001).

Let ϕest vi,k {ϕdml vi,k, ϕipw vi,k, ϕreg vi,k} denote an estimated importance of the ith feature of jth samples (i.e., Vi,k V(k) D). We assess the quality of the estimator by computing the L1 error as L1(est, k) := (1/n) Pn i=1 ϕest vi,k ϕvi,k (where n is the number of features). We ran the simulation for 50 randomly generated sets of samples; i.e., k {1, 2, , 50}, and with sample size N := |D| {100, 250, 500, 750, 1000} to observe convergence behaviors of estimators. We fix M = 20. We refer the box-plot for L1(est, k) as the L1-error plot .

For all {Thm. 2, Markovian, Direct-cause} cases, we compare the performances of the three do-Shapley estimators for (1) Non-noisy where no noises are introduced in the model; (2) Noisy where a converging noise ϵ, decaying at a N α rate (i.e., ϵ Normal(N α, N 2α)) for α = 1/4, is added to the estimated nuisance to control the convergence rate, following the technique in (Kennedy, 2020); (3) Incorrect REG where the model for the REG estimator in Def. 5 is wrongly specified; and (4) Incorrect IPW the model for the IPW estimator in Def. 4 is wrongly specified.

Experimental Results. The L1-error plots for all cases are presented in Fig. 3. For the non-noisy setting, performances of all of three models {DML, REG, IPW} are similar. In the noisy setting where the estimated nuisances are controlled to converge at N 1/4 rate, the DML estimators outperform the other two estimators by achieving a fast convergence with the smallest variance. This result corroborates with the robustness property of the DML (Remark 4). Also, the DML esitmator exhibits the doubly robustness property; the estimator converges in both of the Incorrect IPW and Incorrect REG settings where each corresponding nuisance is wrongly specified.

Contrasting with Conditional Shapley. We contrast the do-Shapley and conditional Shapley in the non-noisy setting. We compare the importance ranking measured by the true do-Shapley with the ranks from the do-DML and conditional Shapley through the Spearman s rank correlation. The correlation is close to 1 if two ranks are similar and to -1 if the ranks are opposite. The true data generating function is Y = 3V1 + 0.4V2 + V3 + UY and the true-do-Shapley identifies V1 having the largest coefficient as the most important. As shown in Table 2, the do-DML-Shapley ranks the feature importance closer to the true rank. As can be

noted, do-DML-Shapley identifies V1 as the most important.

Thm. 2 Markovian Direct DML 1.0 0.8 0.93 Conditional -0.28 -0.74 0.52

Table 2: Comparison of the rank correlation.

7. Conclusion

We propose the do-Shapley as a causal contribution measure and provide theoretical justification through the axiomatic characterization (Thm. 1). Next, we provide conditions under which do-Shapley values can be inferred from nonexperimental data in polynomial time (Thm. 2). We then propose three do-Shapley estimators (IPW, REG, DML) that are consistent. We show that the DML estimator has additional robustness property called doubly robustness and debiasedness (Thm. 3). We expect the proposed contribution measure will help empirical scientists to answer what are the contributions of each cause to the effect?

Acknowledgements

We thank the reviewers for their feedback and help in improving this manuscript. This work was done in part while Jin Tian was visiting the Simons Institute for the Theory of Computing. Elias Bareinboim and Yonghan Jung were partially supported by grants from NSF IIS-1750807 (CAREER).

Bareinboim, E. and Pearl, J. Causal inference and the datafusion problem. Proceedings of the National Academy of Sciences, 113(27):7345 7352, 2016.

Basu, D. On shapley credit allocation for interpretability. ar Xiv preprint ar Xiv:2012.05506, 2020.

Bhattacharya, R., Nabi, R., and Shpitser, I. Semiparametric inference for causal effects in graphical models with hidden variables. ar Xiv preprint ar Xiv:2003.12659, 2020.

Bhattacharyya, A., Gayen, S., Kandasamy, S., Maran, A., and Variyam, V. N. Learning and sampling of atomic interventions from observations. In International Conference on Machine Learning, pp. 842 853. PMLR, 2020.

Bhattacharyya, A., Gayen, S., Kandasamy, S., Raval, V., and Vinodchandran, N. Efficient inference of interventional distributions. ar Xiv preprint ar Xiv:2107.11712, 2021.

Charnes, A., Golany, B., Keane, M., and Rousseau, J. Extremal principle solutions of games in characteristic function form: Core, chebychev and shapley value generalizations. 1988.

Chen, T. and Guestrin, C. Xgboost: A scalable tree boosting system. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 785 794, 2016.

Chernozhukov, V., Chetverikov, D., Demirer, M., Duflo, E., Hansen, C., and Newey, W. Double/debiased/neyman machine learning of treatment effects. American Economic Review, 107(5):261 65, 2017.

Chernozhukov, V., Chetverikov, D., Demirer, M., Duflo, E., Hansen, C., Newey, W., and Robins, J. Double/debiased machine learning for treatment and structural parameters: Double/debiased machine learning. The Econometrics Journal, 21(1), 2018.

Covert, I., Lundberg, S., and Lee, S.-I. Explaining by removing: A unified framework for model explanation. Journal of Machine Learning Research, 22(209):1 90, 2021.

Eva, B. and Stern, R. Causal explanatory power. The British Journal for the Philosophy of Science, 70(4):1029 1050, 2019.

Friedman, J. H. Greedy function approximation: a gradient boosting machine. Annals of statistics, pp. 1189 1232, 2001.

Frye, C., Rowat, C., and Feige, I. Asymmetric shapley values: incorporating causal knowledge into model-agnostic explainability. Advances in Neural Information Processing Systems, 33, 2020.

Galles, D. and Pearl, J. Axioms of causal relevance. Artificial Intelligence, 97(1-2):9 43, 1997.

Heskes, T., Sijben, E., Bucur, I. G., and Claassen, T. Causal shapley values: Exploiting causal knowledge to explain individual predictions of complex models. Advances in Neural Information Processing Systems, 33, 2020.

Huang, Y. and Valtorta, M. Pearl s calculus of intervention is complete. In Proceedings of the 22nd Conference on Uncertainty in Artificial Intelligence, pp. 217 224. AUAI Press, 2006.

Jaber, A., Zhang, J., and Bareinboim, E. Causal identification under markov equivalence. In Proceedings of the 34th Conference on Uncertainty in Artificial Intelligence, 2018.

Janzing, D., Balduzzi, D., Grosse-Wentrup, M., and Schölkopf, B. Quantifying causal influences. The Annals of Statistics, 41(5):2324 2358, 2013.

Janzing, D., Budhathoki, K., Minorics, L., and Blöbaum, P. Causal structure based root cause analysis of outliers. ar Xiv preprint ar Xiv:1912.02724, 2019.

Janzing, D., Blöbaum, P., Minorics, L., and Faller, P. Quantifying causal contributions via structure preserving interventions. ar Xiv preprint ar Xiv:2007.00714, 2020a.

Janzing, D., Minorics, L., and Blöbaum, P. Feature relevance quantification in explainable ai: A causal problem. In International Conference on Artificial Intelligence and Statistics, pp. 2907 2916. PMLR, 2020b.

Jung, Y., Tian, J., and Bareinboim, E. Learning causal effects via weighted empirical risk minimization. Advances in Neural Information Processing Systems, 33, 2020.

Jung, Y., Tian, J., and Bareinboim, E. Estimating identifiable causal effects on markov equivalence class through double machine learning. In Proceedings of the 38th International Conference on Machine Learning, 2021a.

Jung, Y., Tian, J., and Bareinboim, E. Estimating identifiable causal effects through double machine learning. In Proceedings of the 35th AAAI Conference on Artificial Intelligence, 2021b.

Kennedy, E. H. Optimal doubly robust estimation of heterogeneous causal effects. ar Xiv preprint ar Xiv:2004.14497, 2020.

Kennedy, E. H., Balakrishnan, S., G Sell, M., et al. Sharp instruments for classifying compliers and generalizing causal effects. Annals of Statistics, 48(4):2008 2030, 2020.

Klaassen, C. A. Consistent estimation of the influence function of locally asymptotically linear estimators. The Annals of Statistics, pp. 1548 1562, 1987.

Koster, J. T. et al. Marginalizing and conditioning in graphical models. Bernoulli, 8(6):817 840, 2002.

Lattimore, T. and Szepesvári, C. Bandit algorithms. Cambridge University Press, 2020.

Lee, S. and Bareinboim, E. Causal effect identifiability under partial-observability. In Proceedings of the 37th International Conference on Machine Learning, 2020.

Lee, S., Correa, J. D., and Bareinboim, E. General identifiability with arbitrary surrogate experiments. In Proceedings of the 35th Conference on Uncertainty in Artificial Intelligence. AUAI Press, 2019.

Lee, S., Correa, J., and Bareinboim, E. Generalized transportability: Synthesis of experiments from heterogeneous domains. In Proceedings of the 34th AAAI Conference on Artificial Intelligence, 2020.

Lundberg, S. Be careful when interpreting predictive models in search of causal insights. 2021. URL https://bit. ly/3gc Zmgl.

Lundberg, S. M. and Lee, S.-I. A unified approach to interpreting model predictions. In Proceedings of the 31st international conference on neural information processing systems, pp. 4768 4777, 2017.

Molnar, C. Interpretable machine learning. Lulu. com, 2020.

Pearl, J. Causal diagrams for empirical research. Biometrika, 82(4):669 710, 1995.

Pearl, J. Causality: Models, Reasoning, and Inference. Cambridge University Press, New York, 2000. 2nd edition, 2009.

Robins, J. A new approach to causal inference in mortality studies with a sustained exposure period application to control of the healthy worker survivor effect. Mathematical modelling, 7(9-12):1393 1512, 1986.

Robins, J., Li, L., Tchetgen, E., van der Vaart, A., et al. Higher order influence functions and minimax estimation of nonlinear functionals. In Probability and statistics: essays in honor of David A. Freedman, pp. 335 421. Institute of Mathematical Statistics, 2008.

Robins, J. M. and Ritov, Y. Toward a curse of dimensionality appropriate (coda) asymptotic theory for semi-parametric models. Statistics in medicine, 16(3):285 319, 1997.

Rosenbaum, P. R. and Rubin, D. B. The central role of the propensity score in observational studies for causal effects. Biometrika, 70(1):41 55, 1983.

Rubin, D. B. Using multivariate matched sampling and regression adjustment to control bias in observational studies. Journal of the American Statistical Association, 74(366a):318 328, 1979.

Schamberg, G., Chapman, W., Xie, S.-P., and Coleman, T. P. Direct and indirect effects an information theoretic perspective. Entropy, 22(8):854, 2020.

Schwab, P. and Karlen, W. Cxplain: Causal explanations for model interpretation under uncertainty. Advances in Neural Information Processing Systems 32, pp. 10220 10230, 2019.

Shapley, L. Annals of mathematics study no. 28. 1953.

Shapley, L. S. and Shubik, M. A method for evaluating the distribution of power in a committee system. American political science review, 48(3):787 792, 1954.

Shpitser, I. and Pearl, J. Identification of joint interventional distributions in recursive semi-markovian causal models. In Proceedings of the 21st AAAI Conference on Artificial Intelligence, pp. 1219. Menlo Park, CA; Cambridge, MA; London; AAAI Press; MIT Press; 1999, 2006.

Singal, R., Michailidis, G., and Ng, H. Flow-based attribution in graphical models: A recursive shapley approach. Available at SSRN 3845526, 2021.

Štrumbelj, E. and Kononenko, I. Explaining prediction models and individual predictions with feature contributions. Knowledge and information systems, 41(3):647 665, 2014.

Sundararajan, M. and Najmi, A. The many shapley values for model explanation. In International Conference on Machine Learning, pp. 9269 9278. PMLR, 2020.

Tian, J. and Pearl, J. On the identification of causal effects. Technical Report R-290-L, 2003.

Wang, J., Wiens, J., and Lundberg, S. Shapley flow: A graph-based approach to interpreting model predictions. In Banerjee, A. and Fukumizu, K. (eds.), The 24th International Conference on Artificial Intelligence and Statistics, AISTATS 2021, April 13-15, 2021, Virtual Event, volume 130 of Proceedings of Machine Learning Research, pp. 721 729. PMLR, 2021a.

Wang, L., Zhang, Y., Richardson, T. S., and Robins, J. M. Estimation of local treatment effects under the binary instrumental variable model. Biometrika, 2021b.

Winter, E. The shapley value. Handbook of game theory with economic applications, 3:2025 2054, 2002.

Young, H. P. Monotonic solutions of cooperative games. International Journal of Game Theory, 14(2):65 72, 1985.

Zheng, W. and van der Laan, M. J. Cross-validated targeted minimum-loss-based estimation. In Targeted Learning, pp. 459 474. Springer, 2011.

Appendix On Measuring Causal Contributions via do-interventions

A. Fundamentals of the Shapley Value

The Shapley value (Shapley, 1953) in Eq. (1) seeks to allocate the contribution of each i [n] on some function value f([n]) given a coalition function ν(S) that measures the value of coalition of values of players i S (where ν([n]) = f([n])). The Shapley value uniquely satisfying the following desiderata:

Theorem A.1 (Axiomatization of the Shapley Value (Shapley, 1953; Shapley & Shubik, 1954; Young, 1985)). For any subset S of the players indexed [n] = {1, 2, , n} and the value function of S, denoted ν(S), the Shapley value of the player i, denoted ϕi = ϕi(ν), equals

1 {ν(S {i}) ν(S)} , (A.1)

is the unique attribution methods satisfying the following axioms (properties):

1. Efficiency: P

i [n] ϕi = ν([n]);

2. Dummy: For some i [n], if ν(S {i}) = ν(S) for all S [n]\{i}, then ϕi = 0;

3. Symmetry: For some distinct (i, j) [n], if ν(S {i}) = ν(S {j}) for all S [n]\{i, j}, then ϕi = ϕj;

4. Linearity: For all i [n], for any two coalition functions ν1 and ν2, ϕi(ν1 + ν2) = ϕi(ν1) + ϕi(ν2).

B. Details on Remark 3

Given a semi-Markovian causal graph G, a realized vector (v, y) corresponding to a set of variables (V, Y ) and its subset x v corresponding to a set of variables X V, a procedure for assigning contributions only to xi x is the following:

1. Construct a graph G[X] composed of nodes in X and edges added as follows (Tian & Pearl, 2003).

(a) add a directed edge Vi Vj in G[C] if there exists a directed path from Vi to Vj in G such that every vertex on the path is not in C; (b) add a bidirected edge Vi Vj in G[C] if there exists a divergent path between Vi and Vj in G such that every vertex on the path is not in C.

2. Construct the do-Shapley w.r.t. {y, x} on G[X]. Specifically, for all xi x

x S x\xi ωx(S) {E[Y |do(x S,i)] E[Y |do(x S)]} , (B.1)

where ωx(S) := (1/ |x|) |x| 1 |x S| 1.

Then, {ϕxi}xi x is a unique causal contribution measure:

Proposition S.1. {ϕxi}xi x is a unique causal contribution measure w.r.t. {y, x} on G.

Proof. It suffices to show that G[X] is a graph corresponding to P(X), because of ϕxi is the do-Shapley value defined on a graph corresponding to P(X). By (Koster et al., 2002), G[X] is a graph corresponding to P(X).

C. Relation with Other Work - Examples

In this section, we provide examples to demonstrate that other types of Shapley values doesn t satisfy the Axiom 1. We first note that the conditional Shapley doesn t satisfy the causal irrelevance property in Axiom 1. Example C.1 (Causal Irrelevance Property doesn t hold for the conditional Shapley (Janzing et al., 2020b)). Consider G = {V1 V2 Y } where V1, V2 {0, 1}, and the bidirected edge means the existence of hidden confounders. Suppose P(v1, v2) = 1/2 whenever v1 = v2. Note V1 and Y is causally irrelevant. Causal irrelevance property doesn t hold in the conditional Shapley. Specifically, for any v1, v2, E [Y |v1] E [Y ] = v1 1/2 = 0, which leads that ϕcond v1 = 0. In contrast, E[Y |do(v1)] E [Y ] = E[Y |do(v1, v2)] E[Y |do(v2)] = 0. Therefore, ϕv1 = 0, implying that do-Shapley satisfies the causal irrelevance property, unlike to the conditional Shapley.

The ICC doesn t satisfy the causal symmetry property in Axiom 1. Example C.2 (Causal Symmetry Property doesn t hold for the ICC Approach). Consider a following SCM M: For all binary variables UV1, UV2, UY , V1, V2, Y {0, 1}, P(U1 = 1) = 0.5, P(U2 = 1) = 0.2, and P(UY = 1) = 0.8. Also, V1 f V1(UV1) = UV1; V2 f V2(V1, U2) = V1 UV2; and Y f Y (V2, UY ) = V2 UY . A corresponding causal diagram is G = {V1 V2 Y, {UV V for all V {V1, V2, Y }}}. Let y = v1 = v2 = 1. Then, P(y|do(v1)) = P(y|v1) = 0.8, P(y|do(v2)) = P(y|v2) = 0.8, P(y|do(v1, v2)) = P(y|v2) = 0.8, and P(y) = 0.65. We first note that v1 and v2 have the same causal explanatory power to Y since P(y|do(v1)) = P(y|do(v2)) = 0.8. Also, the do-Shapley values for v1, v2 are the same as ϕv1 = ϕv2 = 0.075, which exhibits the causal symmetry. To compute the ICC of the features v1 = v2 = 1, we fix u1 = 1 and u2 = 0, which makes v1 = v2 = 1. Let ϕicc vi denote the ICC of vi. Then, ϕicc v1 = 0.225 and ϕicc v2 = 0.075 even if v1, v2 have the same causal explanatory power. This implies that the causal symmetry doesn t hold.

We provide complete proofs and additional missing details here.

D.1. Proofs from Section 3

We use νdo(S) := E[Y |do(v S)]

in the proof. Theorem D.1 (Restated Theorem 1). The do-Shapley is a unique causal contribution measure satisfying all the properties in Axiom 1.

Proof. We first prove that do-Shapley satisfies all the properties in Axiom 1.

Lemma S.1 (Soundness of do-Shapley). The do-Shapley satisfies all properties in Axiom 1.

Proof. First, consider the perfect assignment property. By the result of (Štrumbelj & Kononenko, 2014), we can represent the do-Shapley as

ϕvi(νdo) = 1

π Π([n]) {νdo ({i} preπ(i)) νdo (preπ(i)))} ,

where Π([n]) is a set of all possible permutations of [n], π is an individual permutation in Π([n]), and preπ(i) := {k [n] such that k < i in π([n])}. Then,

i=1 ϕvi(νdo) = 1

i=1 {νdo ({i} preπ(i)) νdo (preπ(i)))}

π Π([n]) {νdo([n]) νdo( )}

= νdo([n]) νdo( ) = E[Y |do(v)] E [Y ] = E[Y |do(v)].

Now we consider the causal irrelevance property. Suppose Vi is causally irrelevant to Y in expectation for all witness w v\{vi}. Then, the equality νdo(S i) νdo(S) = 0 holds immediately for all S [n]\{i}.

Next we consider the causal symmetry property. Suppose vi, vj has the same causal explanatory power w.r.t. any witnesses w v\{vi, vj}. This leads νdo({i} S) = νdo({j} S) for any S [n]\{i, j}. Then,

ϕvi(νdo) = 1

! 1 {νdo(S i) νdo(S)}

S [n]\{i,j}

! 1 {νdo(S i) νdo(S)} + 1

S [n]\{i,j}

n 1 |S| + 1

! 1 {νdo(S {i, j}) νdo(S {j})}

S [n]\{i,j}

! 1 {νdo(S j) νdo(S)} + 1

S [n]\{i,j}

n 1 |S| + 1

! 1 {νdo(S {i, j}) νdo(S {i})}

! 1 {νdo(S j) νdo(S)} = ϕvj(νdo),

where the third equality holds since (vi, vj) has the same causal explanatory power.

Now, we prove that the do-Shapley satisfies the causal approximation property by showing that there exists ω(S) that makes the do-Shapley as the solution of the weighted least square problem defined in Axiom 1. For a coalition function ν(S) (see the Shapley value paragraph in Sec. 2), it s known that there exists a specific weight function ω(S) that makes the Shapley value in Eq. (1) as the solution of the following WLS problem: arg min{ϕ vi}n i=1 P

S [n](ν(S) P

i S ϕ vi)2ω(S) ( by (Charnes et al., 1988, Thm. 4) and (Lundberg & Lee, 2017, Theorem 2)). This implies that such an ω(S) is the weight function that makes the do-Shapley as the solution of the weighted least square problem defined in Axiom 1.

We now show the other direction that a measure satisfying all properties in Axiom 1 is the do-Shapley.

Lemma S.2 (Completeness of do-Shapley). A vector {ϕvi}vi v satisfying Axiom 1 is the do-Shapley value.

Proof. Throughout the proof, we will define a canonical SCM as follow: Let T [n] denote any fixed index set. A SCM is called canonical for T if E[Y |do(VS = 1)] = 1 iff T S, and 0 otherwise. We use νT do(S) denote the causal coalition function induced by the canonical SCM. Note νT do(S) = 1 iff T S, and 0 otherwise, by the definition of the canonical SCM.

We first note that a vector ϕvi that satisfies the causal approximation property can be represented as a linear function of νdo(S), because ϕvi is a solution of the weighted least square linear regression problem. Therefore,

S [n] ai Sνdo(S). (D.1)

for some constants {ai S}.

Now we focus on the causal irrelevance property. Suppose T [n]\{i}. For any S [n], (T S) = (T S {i}). With i T, (T S) = (T S {i}). Therefore, νT do(S) = νT do(S i) for all S [n]. Then, by the causal irrelevance property, ϕvi(νT do) = 0 if T [n]\i. Then, ϕvi(ν[n]\i do ) = ai [n] + ai [n]\i = 0.

Suppose it has been shown that ai T i + ai T = 0 for T [n]\i such that |T| k for some k. Then, for any S [n]\i such that |S| = k 1,

ϕvi(νS do) = X

T [n] ai T νS do(T) = X

T [n]\{i} T S

ai T i + ai T

T [n]\i T S but T =S

ai T i + ai T

+ ai S i + ai S = ai S i + ai S,

where the first equality by Eq. (D.1), the second by the property of the canonical SCM, the third and fourth by the standard algebra, and the fifth by the inductive hypothesis. Since S [n]\{i}, by causal irrelevance property, ϕvi(νS do) = 0. This implies that ai S i + ai S = 0. Therefore, for any T [n]\i, ai T i + ai T = 0.

Fix pi T := ai T i = ai T . Then,

ϕvi(νdo) = X

T [n] ai T νdo(T) = X

ai T iνdo(T i) + ai T νdo(T) = X

T [n]\i pi T (νdo(T i) νdo(T)) .

Now we focus on the causal symmetry property. Suppose vi and vj have the same causal explanatory power with any given witness w v\{vi, vj} in the canonical SCM for [n]; i.e., ν[n] do (S i) = ν[n] do (S j) for S [n]\{i, j}. We note that ϕvi(ν[n] do ) = pi [n]\i = ϕvj(ν[n] do ) = pj [n]\j. This implies that there exists pn 1 := pi [n]\i = pj [n]\j.

Again, suppose vi, vj have the same causal explanatory power with any given witness w v\{vi, vj} in the canonical SCM for [n]\k for any fixed k {i, j}. Then,

ϕvi(ν[n]\k do ) = pi [n]\{i,k} + pn 1 = ϕvj(ν[n]\k do ) = pj [n]\{j,k} + pn 1.

This implies that there exists a constant pn 2 := pi [n]\{i,k} = pj [n]\{j,k}. By repeating this, we can have a p1, , pn 1 where p|T | is a constant applying to all pj T for any T [n]\i. Therefore, there are constants {p|T |}T [n]\i such that

T [n]\i p|T | (νdo(T i) νdo(T)) .

Finally, we focus on the perfect assignment property. An attribution ϕvi satisfies the perfect assignment property if and only if P

i N pn 1 = 1, and for any nonempty T [n], P

i T p|T | 1 = P

j T p|T | (Winter, 2002, Chap. 7, Theorem 11). This gives pn 1 = 1/n, and for any nonempty T [n], |T| p|T | 1 = (n |T|)p|T |. Then, a closed form for p|T | is given as

p|T | = (n |T| 1)! |T|!

Taking a conjunction of Lemmas (S.1,S.2) completes the proof of the Theorem D.1.

D.2. Proofs from Section 4

Theorem D.2 (Restated Theorem 2). The do-Shapley is identifiable if no variable in Vi {V} is connected to its child Ch(Vi) by bidirected paths in G. Suppose Y is not connected by bidirected paths. In this case, for any S [n],

E[Y |do(v S)] = X

v S E [Y |v] Q [V\VS] ,

where Q [V\VS] := Q [V\VS] (v) is given as

Q [V\VS] = P(v) Q [C(VS)]

sk Q [C(Sk)] ,

where Q [C(VS)] = Q Va C(VS) P(va|pre(va)) is a C-factor of a C-component VS (C(VS)); {Sk}c k=1 is a C-partition of VS; and Q [C(Sk)] := Q

Va C(Sk) P(va|pre(va)) is a C-factor of a C-component C(Sk) for Sk.

Proof. We prove the following, which would imply the above theorem.

Proposition S.1 (Generalized Tian s Adjustment Complete identification criteria for P(V|do(X))). P(V|do(X)) is identifiable if Xa X and Ch(Xa) is not connected by bidirected paths. If identifiable, it s given as

P(V|do(X)) = P(V) Q [C(X)]

xk Q [C(Xk)] , (D.2)

where {Xk}c k=1 is a C-partition of X in G, and Q [C(X)] := Q

Vi C(X) P(Vi|pre(Vi)) is a C-factor of a C-component of X (C(X)), and Q [C(Xk)] := Q

Vi C(Xk) P(Vi|pre(Vi)) is a C-factor of a C-component of Xk (C(Xk)).

Proof. In the proof, for a vector W, we will use De(W) to denote a set of descendants of Wi W in G.

Suppose Xa X is not connected with Ch(Xa) by bidirected paths. We first show that P(V|do(X1)) (for any X1 X a C-component in G(X)) is identifiable and given as

P(V|do(X1)) = P(V) Q [C(X1)]

x1 Q [C(X1)] . (D.3)

By the result of (Jaber et al., 2018, Lemma 1), it suffices to show that X1 = De(X1)G(C(X1)). We show this by contradiction. Suppose Va De(X1)G(C(X1)) such that Va X1. Since Va G(C(X1)), Va is connected with X1 by bidirected paths. Since Va is a descendent of some Xa X1 in G(C(X1)), this means that Vb Ch(X1) is also in the G(C(X1)). This means that Vb and Xa is connected by a bidirected path, which is a contradiction of the given condition. Therefore, X1 = De(X1)G(C(X1)), and Eq. (D.3) holds.

Now, consider a following inductive hypothesis for i = 1, 2, , c:

Q h V\X(i)i = Q V\X(i 1)

xi Q [C(Xi)] . (D.4)

As shown in the above, it holds for i = 1. Suppose it holds for some i 1 1 for i 2. Then, we first note that Xi = De(Xi)G(C(Xi))G(V\X(i 1)). To witness, consider the contradiction for some Xa Xi there exists Va De(Xi)G(C(Xi))G(V\X(i 1)) s.t. Va Xi. First, Va is connected with Xa by bidirected paths since Va G(C(Xi))G(V\X(i 1)). Also, Va is a descendent of Xa, this means that a child of Xa is also in G(C(Xi)), connected by bidirected paths. This is a contradiction. Therefore, Xi = De(Xi)G(C(Xi))G(V\X(i 1)).

Now, we show that C(Xi)G(V\X(i 1)) = C(Xi)G. We start from an obvious observation C(Xi)G(V\X(i 1)) C(Xi)G. We now prove C(Xi)G C(Xi)G(V\X(i 1)). For some Va C(Xi)G, suppose Va C(Xi)G(V\X(i 1)). This means that bidirected paths connecting Va to some nodes in X1 Xi must be via other nodes in X2 X(i 1). This means that Va, X2, X1 are connected by bidirected paths. However, given that X2 X(i 1) and X1 Xi, this is a contradiction, because they are in distinct C-partitions. Therefore, C(Xi)G(V\X(i 1)) = C(Xi)G.

Then, Eq. (D.4) holds. By unfolding it,

Q h V\X(i)i = P(V) Qi k=1 Q [C(Xi)]

xi Q [C(Xi)] .

We note Q C(X(i)) = Qi k=1 Q [C(Xi)], since

Q h C(X(i)) i = Y

Vi C(X(i)) P(vi|pre(vi)) =

Vi C(Xk) P(vi|pre(vi)) =

k=1 Q [C(Xk)] .

This completes the proof.

Now back to witness Thm. D.2. Under the given condition that Y is not connected via bidirected paths to any nodes, the following holds: for any S [n],

(Y VS|VS)GVS .

P(Y, V|do(VS)) = P(Y |do(VS), VS)Q [VS] = P(Y |V)Q [VS] ,

which implies that

E[Y |do(VS)] = X

v S E [Y |V] Q [V] Q [C(VS)]

xk Q [C(Xk)] .

This completes the proof.

Corollary D.2 (Restated Corollary 1). In the Markovian case, E[Y |do(v S)] is given as

E[Y |do(v S)]= X

v S E [Y |v S, v S] Y

i S P(vi|pre(vi)).

Proof. In the Markovian case, C(W) = W for all W V. Then,

P(Y, VS|do(VS)) = P(V, Y ) Q [C(VS)]

xk Q [C(Xk)] = P(V, Y )

Q [VS] = P(Y |V) Y

Vi VS P(Vi|pre(Vi)).

This completes the proof.

Corollary D.2 (Restated Corollary 2). In the Direct-cause case, E[Y |do(v S)] is given as

E[Y |do(v S)] = X

v S E [Y |v S, v S] P(v S).

Proof. In the Direct-cause case, Q [W] = P(W) for all W V since there are no causal paths between a pair of variables in V. Therefore, Q [V\VS] = P(V\VS) = P(VS), which completes the proof.

D.3. Proofs from Section 5

Lemma D.1 (Restated Lemma 1). Let S = {m1, , ms} [n] denote an index set for VS. Let

r=1 1vmr (Vmr)/h S r , for k = s, , 1;

ωS := 1v S(VS)/h S,

where h S r := P(Vmr|pre(Vmr)) and h S := P(VS|VS). Then, E[Y |do(v S)] = E [Y ω] where ω = ωS k for the Markovian case, and ω = ωS for the Direct-cause case.

Proof. For the Markovian case,

E[Y |do(v S)] = X

P(v) Qs r=1 P(vmr|pre(vmr)) = E 1vmr (Vmr)

For the Direct-cause case,

E[Y |do(v S)] = X

P(v) P(v S|v S) = E 1v S(VS)

Lemma D.2 (Restated Lemma 2). Let S := {m1, , ms} [n] denote an index set for VS. Let θS s,1 := Y . For k = s, s 1, , 1,

θS k,2 := E θS k,1|Vmk, pre(Vmk)

θS k 1,1 := E θS k,1|vmk, pre(Vmk) ,

θS a := E [Y |v S, VS] ,

θS b := E [Y |VS, VS] .

Then, E[Y |do(v S)] = E [θ] where θ = θS 0,1 for the Markovian case, and θ = θS a for the Direct-cause case.

Proof. For the Markovian case, we will prove the following, which implies the result.

Lemma S.3. Suppose V = {Y } V where V is an ordered set. Assume that Y is the last variable in the given order. Let VS := {Vm1, , Vms} V (where {m1, , ms} [n]) denote a set of discrete variables. Let VS := V\VS. For each k = 2, , s, let Vℓk := {Vj VS : Vmk 1 Vj Vmk}. Let Vℓ1 := {Vj VS : Vj Vmk} and Vℓs+1 := {Vj VS : Vms Vj}.

Let g S(P) denote a following functional (a.k.a. g-formula (Robins, 1986)).

g S(P) := Z

VS E [Y |v] Y

i S P(vi|pre(vi)) d[v S].

Let θs,1 := Y . For k = s, , 1, and

θk,2 := E [θk,1|Vmk, pre(Vmk)]

θk 1,1 := E [θk,1|vmk, pre(Vmk)] .

Then, the following holds:

g S(P) = E [θ0,1] .

Ak := {pre(Vmk)}

Bk := {Vℓk+1, Vℓk+2, , Vℓs+1}

Ck := {Vmk+1, Vmk+2, , Vms}.

Vi W P(vi|pre(vi)) If W = ; 1 If W = .

Then, it suffices to show that

Bk,Ck E [Y |Vmk, Ak, bk, ck] q(bk)1ck(Ck) d[bk, ck]

Bk,Ck 1 E [Y |Ak, bk, ck 1] q(bk)1ck 1(Ck 1) d[bk, ck 1],

because witnessing E [θ0,1] = g S(P) becomes trivial. Let θs,1 := Y . Then, it s easy to check that the above holds for θs,2 and θs 1,1.

Suppose the above equation holds for k, k + 1, , s. Then, consider k 1. By the given definition,

θk 1,2 := E θk 1,1|Vmk 1, pre(Vmk 1)

θk 2,1 := E θk 1,1|vmk 1, pre(Vmk 1) .

Bk,Ck 1 E [Y |Ak, bk, ck 1] q(bk)1ck 1(Ck 1) d[bk, ck 1] Vmk 1, Ak 1

Bk 1,Ck 1 E Y |Vmk 1, Ak 1, bk 1, ck 1 q(bk 1)1ck 1(Ck 1) d[bk 1, ck 1],

Bk,Ck 1 E [Y |Ak, bk, ck 1] q(bk)1ck 1(Ck 1) d[bk, ck 1] vmk 1, Ak 1

Bk 1,Ck 2 E [Y |Ak 1, bk 1, ck 2] q(bk 1)1ck 2(Ck 2) d[bk 1, ck 2].

B1,C0 E [Y |A1, b1, c0] q(b1)1c0(C0) d[b1, c0],

which gives the equality E [θ0,1] = g S(P).

For the Direct-cause case,

v S E [Y |v] P(v S) = E[Y |do(v S)],

which completes the proof.

Lemma D.3 (Restated Lemma 3). Let

( {θS 0,1} {θS k,1, θS k,2}s k=1 {h S r }s r=1 (Markovian) {θS a , θS b , h S} (Direct-cause),

defined in Defs. (4, 5) above, and

VS(V ; ηS):=

θS 0,1 + s P

k=1 ωS k (θS k,1 θS k,2) (Markovian)

θS a + ωS Y θS b (Direct-cause),

where ωS k := Qk r=1 1vmr (Vmr)/h S r and ωS := 1v S(VS)/h S. Then, E[Y |do(v S)] = E VS(V ; ηS) .

Proof. For the Markovian case, it suffices to show that E h θS k,1 θS k,2 i for any k = 1, 2, , s. This holds since

E θS k,1 θS k,2 = E E θS k,1 θS k,2|Vmk, pre(Vmk) = E θS k,2 θS k,2 = 0.

Therefore, E V(V ; ηS) = E θS 0,1 = E[Y |do(v S)], where the 2nd equality holds by Lemma 2.

For the Direct-cause case, E Y θS 2 = 0 by the definition of θS 2 . Therefore, E V(V ; ηS) = E θS 1 = E[Y |do(v S)].

Theorem D.3 (Restated Theorem 3). Let {πj}M j=1 denote M randomly generated permutations of [n]. For the fixed index i, let Sj,0 := preπj(i) and Sj,1 := {i} Sj,a. Let {bηSj,0, bηSj,1}M j=1 denote L2-consistent estimates for all nuisances {ηSj,0, ηSj,1}M j=1 defined in Def. 6. Let RM,N := OP (M 1/2 + N 1/2). Let e(ˆg) := ˆg g denote an error for a nuisance estimates for any ˆg ˆη and g η. For the do-Shapley estimators defined in Def. 7, suppose the estimators T est(S) are bounded. Let ϵest vi := ϕest vi ϕvi (where est {ipw, reg, dml}).

Under the Markovian case,

ϵipw vi = RM,N + OP { X

j=1 e(ˆωSj,p sj )},

ϵreg vi = RM,N + OP { X

j=1 e(ˆθSj,p 0,1 )},

ϵdml vi = RM,N + OP { X

k=1 e(ˆh Sj,p k )e(ˆθSj,p k,2 )}.

Under the Direct-cause case,

ϵipw vi = RM,N + OP { X

j=1 e(ˆωSj,p)},

ϵreg vi = RM,N + OP { X

j=1 e(ˆθSj,p 2 )},

ϵdml vi = RM,N + OP { X

j=1 e(ˆh Sj,p)e(ˆθSj,p b )}.

Proof. In the proof, we will use a notation ED P [f(V)] for f(V) := ED [f(V)] E [f(V)]. We use N := |D|. Also, for any quantity A, B, A B if there is a constant c s.t. A c B. We first introduce a useful tool for analyzing errors of the proposed estimator.

Lemma S.4. Let η0 denote some nuisance and ˆη denote its L2 consistent estimate. Let f(V; η) denote an arbitrary function having a bounded second moment for any fixed η. Suppose samples used for constructing ˆη and for evaluating f(V; ˆη) are independent.

ED [f(V; ˆη)] E [f(V; η0)] = OP (N 1/2) + E [f(V; ˆη) f(V; η0)] .

Proof. We first note that

ED [f(V; ˆη)] E [f(V; η0)] = ED P [f(V; η0)] ED P [f(V; ˆη) f(V; η0)] + E [f(V; ˆη) f(V; η0)] .

First, ED P [f(V; η0)] = OP (N 1/2) by the classical central limit theorem. Second, ED P [f(V; ˆη) f(V; η0)] = OP (N 1/2) under given conditions by (Kennedy et al., 2020, Lemma 2).

Now, we introduce an equivalent representation of the do-Shapley:

Proposition S.2 ((Štrumbelj & Kononenko, 2014, Eq. (10))). An equivalent representation of the do-Shapley in Eq. (2) is given as

E[Y |do(vpreπ(i),i)] E[Y |do(vpreπ(i))] ,

where Π([n]) is a set of all possible permutations of [n], π is an individual permutation in Π([n]), preπ(i) := {k [n] such that k < i in π([n])}.

This representation motivates a following Monte-Carlo-based approximation:

n E[Y |do(vpreπj (i),i)] E[Y |do(vpreπj (i))] o , (D.5)

where M is the number of randomly generated permutation of [n] and πj denotes kth permutation. Convergence of ϕvi is guaranteed by the following result:

ϕvi ϕvi = OP (M 1/2). (D.6)

Proof. Let Z(σ) := E[Y |do(vi,preσ(k)(i))] E[Y |do(vpreσ(k)(i))] denote a random variable where the randomness is over the permutation σ, where P(σ) = 1

n!. Then, EP [Z(σ)] = ϕvi. By the given assumption, Z(σ) and

k=1 Z(σ(k))

are bounded random variables. Let B denote such bound. Then, by (Lattimore & Szepesvári, 2020, Corollary 5.5),

2B2 log(1/δ)

M and ϕvi < ϕvi +

2B2 log(1/δ)

in probability (1 δ), which implies that ϕvi converges in

M rate. This completes the proof.

Let Sj,a := preπj(i) and Sj,b := {i} preπj(i). By Def. 7, Eqs. (D.5,D.6),

ϕest vi ϕvi = ϕest vi + ϕvi ϕvi + ϕvi (D.7)

T est(Sj,b) E[Y |do(v Sj,b)] + T est(Sj,a) E[Y |do(v Sj,a)] + OP (M 1/2). (D.8)

Now, we analyze each of IPW, REG, DML estimators in Defs. (4,5,6).

Lemma S.6 (Error analysis for IPW). For any nonempty S [n],

T ipw(S) E[Y |do(v S)] =

( OP (N 1/2) + OP ˆωS s ωS s (Markovian) OP (N 1/2) + OP ˆωS ωS (Direct-cause), (D.9)

Proof. We will prove only for the Markovian case, since the exactly same proof is applied for the Direct-cause case. First, E[Y |do(v S)] = E Y ωS s .

From Lemma S.4, it suffices to show that E Y ˆωS s Y ωS s = OP ˆωS s ωS s . It can be shown by

E Y ˆωS s Y ωS s Y ˆωS s ωS s ˆωS s ωS s ,

where the first inequality by Cauchy-Schwarz inequality and the second by the boundness of Y .

Lemma S.7 (Error analysis for REG). For any nonempty S [n],

T reg(S) E[Y |do(v S)] =

OP (N 1/2) + OP ˆθS 0,1 θS 0,1 (Markovian)

OP (N 1/2) + OP ˆθS a θS (Direct-cause) .

Proof. We will prove only for the Markovian case, since the exactly same proof is applied for the Direct-cause case. We note that E θS 0,1 = E[Y |do(v S)] by Lemma 2. From Lemma S.4, it suffices to show that E h ˆθS 0,1 θS 0,1 i = OP ( ˆθ θ ). It holds by Cauchy-Schwarz inequality.

Lemma S.8 (Error analysis for DML). For any nonempty S [n],

T dml(S) E[Y |do(v S)] =

OP (N 1/2) + Ps j=1 OP ˆθS j,2 θS j,2 ˆh S j h S j (Markovian)

OP (N 1/2) + OP ˆθS a θS a ˆh S h S (Direct-cause).

Proof. We note that E V(V , ηS) = E[Y |do(v S)] by Lemma 3. From Lemma S.4, it suffices to show that

E V(V , ˆηS) V(V , ηS) =

Ps j=1 OP ˆθS j,2 θS j,2 ˆh S j h S j (Markovian)

OP ˆθS a θS a ˆh S h S (Direct-cause).

First, consider the Markovian case. We omit the superscript S. Consider a following quantity: For j = 1, 2, , s

Qj := θj 1,1 +

k=j ωj:k(θk,1 θk,2),

where ωj:k := Qk r=j 1vmr (Vmr )

h S r . Let Qs+1 := Y and ωj+1: = 0. We note that Q1 = V(V ; ηS), and E [Q1] = E[Y |do(v S)]. Also, the following holds, by the definition of θk 1,1, θk,2:

E [θk 1,1] = E h 1vmk (Vmk)θk,2 i ,

E h ˆθk 1,1 i = E h 1vmk (Vmk)ˆθk,2 i .

First, we note that Qj can be written in a recursion as follow: For j = 1, 2, , s

Qj = θj 1,1 + ωj:j (Qj+1 θj,2) .

To witness, consider the followings:

Qj = θj 1,1 + ωj:j(θj,1 θj,2) + ωj:j+1(θj+1,1 θj+1,2) + ωj:j+2(θj+2,1 θj+2,2) +

Qj+1 = θj,1 + ωj+1:j+1(θj+1,1 θj+1,2) + ωj+1:j+2(θj+2,1 θj+2,2) +

ωj:j Qj+1 = ωj:jθj,1 + ωj:j+1(θj+1,1 θj+1,2) + ωj:j+2(θj+2,1 θj+2,2) + .

Qj = ωj:j Qj+1 ωj:jθj,1 + θj 1,1 + ωj:j(θj,1 θj,2)

= θj 1,1 + ωj:j (Qj+1 θj,2) .

Finally, we will witness the following holds:

E h ˆQj Qj i = E h ˆQj θj 1,1 i =

k=j OP θk,2 ˆθk,2 ˆhk hk .

We will prove this by using an inductive hypothesis. First, at j = s,

E h ˆQs Qs i = E h ˆQs θs 1,1 i = E h ˆθs 1,1 + ˆωs:s(Y ˆθs,2) θs 1,1 i

= E ˆθs 1,1 + 1vms (Vms)

ˆπs (Y ˆθs,2) θs 1,1

= E 1vms (Vms)(ˆθs,2 θs,2) + 1vms (Vms)

ˆπs (θs,2 ˆθs,2)

= OP θs,2 ˆθs,2 ˆπs πs .

For any j = s 1, , 1,

E h ˆQj Qj i = E h ˆQj θj 1,1 i = E h ˆθj 1,1 θj 1,1 + ˆωj:j ˆQj+1 ˆθj,2 i

= E h ˆθj 1,1 θj 1,1 + ˆωj:j ˆQj+1 θj,1 + ˆωj:j θj,1 ˆθj,2 i

= E h ˆωj:j ˆQj+1 θj,1 i + E h 1vmj (Vmj) ˆθj,2 θj,2 + ˆωj:j θj,1 ˆθj,2 i

= E h ˆωj:j ˆQj+1 θj,1 i + E

" 1 ˆP(Vmj|Wmj)

n θj,2 ˆθj,2 o n ˆhj hj o#

E h ˆQj+1 θj,1 i + E hn θj,2 ˆθj,2 o n ˆhj hj oi

E h ˆQj+1 θj,1 i + θj,2 ˆθj,2 ˆhj hj .

If we assume E h ˆQr θr 1,1 i = Ps k=r OP θk,2 ˆθk,2 ˆhk hk for r = j + 1, , s, then it s easy to witness that it holds for r = j, too. Therefore, by an induction, the equality holds for all r = 1, 2, , 1. This completes the proof for Markovian case.

For Direct-cause case,

E V(V ; ηS) V(V; ˆηS) = E 1v S(VS)

ˆh S (Y ˆθa) + ˆθb θb

= E 1v S(VS)

ˆh S (θa ˆθa) + ˆθb θb

ˆh S (θb ˆθb) + ˆθb θb

ˆh S (θb ˆθb)(h S ˆh S)

ˆh S (θb ˆθb)(h S ˆh S)

θb ˆθb h S ˆh S .

By combining Lemmas (S.4,S.5,S.6,S.7,S.8), we complete the proof of Theorem D.3.

E. Additional Experimental Details From Section 6

E.1. Data Generating Processes

Here, we present the structural causal model for the data generating processes used for the data generating process used in Section 6.

We first note that U Bernoulli(0.4), UV1 Bernoulli(0.8), UV3 Bernoulli(0.4), UV2 Bernoulli(0.3), and UY Normal(0, 1). The SCM that induced the graph in Fig. 2a is

V2 = (V1 V3) UV2 Y = 3V1 + 0.4V2 + V3 + UY .

Figure F.4: (a) Causal graph for Example 1, taken from Lundberg (2021) (b) Variables {S, P, M, L} are hidden. These graphs are used for Appendix F.

The SCM that induced the graph in Fig. 2b is

V1 = UV1 V3 = UV3 V2 = (V1 V3) UV2 Y = 3V1 + 0.4V2 + V3 + UY .

The SCM that induced the graph in Fig. 2c is

V2 = UV2 Y = 3V1 + 0.4V2 + V3 + UY .

F. Additional Experiments

In this section, we consider a different data generation process based on Example 1.

Experimental Setup. We use synthetic datasets based on: (a) Example 1 for which the corresponding causal graph Fig. F.4a is Markovian, and (b) the graph in Fig. F.4b which matches with Direct-cause case. These two graphs share the same data generating process since the graph in Fig. F.4b is generated from the graph in Fig. F.4a by omitting a set of variables. Details of the data generating process are provided in Appendix E. Throughout the simulation, we denote {ϕvi}n i=1 as the ground-truth do-Shapley values.

Comparison Between Estimators. We compare the three estimators (IPW, REG, DML), denoted by {ϕipw vi , ϕreg vi , ϕdml vi } respectively, for scenarios depicted in graphs in Figs. (F.4a,F.4b). For all estimators, nuisances are estimated using gradient boosting model called XGBoost (Chen & Guestrin, 2016).

Let ϕest vi,k {ϕdml vi,k, ϕipw vi,k, ϕreg vi,k} denote an estimated importance of the ith feature of jth samples (i.e., Vi,k V(k) D). As in Section 6, we assess the quality of the estimator by computing the L1 error as

L1(est, k) := (1/n)

ϕest vi,k ϕvi,k ,

(where n is the number of features). We ran the simulation for 100 randomly samples; i.e., k {1, 2, , 100}, and with sample size N := |D| {100, 1000, 5000, 10000} to observe convergence behaviors of estimators. We fix M = 100.

Data Generating Processes. Here, we present the structural causal model for the data generating processes used for the data generating process, where the qualitative graphical description is provided as causal graphs in Fig. F.4a. We will denote V0 : sales calls , V1 : interaction , V2 : economic factors , V3 : last upgrade , V4 : product needs , V5 : discounts provided , V6 : monthly usage , V7 : Ad spend , V8 : bugs reported , Y : customer retention (target variable).

(a) Markovian, Non-noisy

(b) Markovian, Noisy

(c) Direct-cause, Non-noisy

(d) Direct-cause, Noisy

Figure F.5: The L1-error plots for the scenario in Section F.

V0 P(UV0), V2 P(UV2) V3 P(UV3) where P(UV0) is a Uniform distribution ranging over 0 to 4; P(UV2) is a Uniform distribution ranging over 0 to 1; and P(UV3) is a Uniform distribution ranging over 0 to 20. For the rest of variables,

V1 = V0 + UV1 V4 = 0.1 V0 + UV4 V5 = 0.5(1 logit(V4)) + 0.5 UV5 V6 = logit (0.3 V4 + UV6)

V7 = V6 UV6 + 1V3<1(V3) + 1V3<2(V3)

V8 = UV8(V6),

where UV1 is a Poisson random variable with the parameter 0.2, U4 Normal(0, 1), UV5 is a Uniform variable ranging (0, 1), UV6 normal(0, 1), UV6 is an Uniform variable ranging (0.9, 0.99), and UV8(V6) is a Poisson random variable such that its parameter follows 2V6, and 1Va<c(Va) is an indicator function for the variable Va for the event Va < c for some constant c. Finally,

Y = 0.9V 4 + 0.8V 6 0.2V 2 + 0.05V 5 0.015(1 V 8) + 0.2V 0 + 0.3V1 + 0.5(V3 + 0.25) + 0.6V7 UY 0.45,

where {V 4, V 6, V 2, V 5, V 8, V 0} are random variables from the normal distribution where the variance is 1 and their means are {V4, V6, V2, V5, V8, V0}. Finally, Y = logit(7Y ).

For the Case 2, we drop the variable V0, V4, V6, and we used

Y = 0.2V 2 + 0.05V 5 0.015(1 V 8) + 0.3V1 + 0.5/(V3/4 + 0.25) + 0.6V7 UY 0.45.

We also recommend checking the code data_generator_1.py, data_generator_2.py for the detailed configurations of the data generating processes.

Experimental Results. For the non-noisy setting, the L1-error plots for {Markovian, Direct-cause} cases are presented in Figs. (F.5a, F.5c) respectively. The DML-based estimator {ϕdml vi }n i=1 outperforms ({ϕipw vi , ϕreg vi }n i=1) for all N {100, 1000, 5000, 10000}, and it achieves the smallest variance compared to other estimators. This result corroborates with the robustness property of the DML-based estimator (see Remark 4). The L1-error plots for the noisy setting for {Markovian, Direct-cause} cases are presented in Figs. (F.5b, F.5d) respectively. In this case, the DML-based estimator {ϕdml vi }n i=1 exhibits the debiasedness property against the converging noise, while other estimators converge much slower.

Contrasting with the ICC Approach (Janzing et al., 2020a). We contrast the do-Shapley with the ICC approach (Janzing et al., 2020a). The do-Shapley measures the feature importance based on the total effect of variables, while the ICC measures based on their intrinsic effects. It is not possible to quantitatively compare these two contrasting definitions.

We compute the feature importance as proposed in (Molnar, 2020, Chap. 9.6.5), where the importance of the jth feature is defined as:

Ij := 1 |D |

(a) do-DML-Shapley

Figure F.6: Feature importance plots for the do-DML-Shapley and the ICC approaches.

S P I M D L E A B do-DML 0.07 0.12 0.05 0.05 0.03 0.06 0.03 0.12 0.13 ICC 0.12 0.13 0.12 0.13 0.13 0.12 0.12 0.12 0.12

Table 3: Average of feature importances produced by the do-DML-Shapley and ICC approaches.

where ϕi is the Shapley value, and D D is a subset of samples.

In our experiments, we randomly selected 100 samples and compare the feature importance using the do-DML-Shapley (ϕdml vi ) and the ICC approach, denoted ϕicc vi . The average of the estimated importance of each features described in Example 1 is presented in Table 3. In Fig. F.6, we present the bar-plot for both the do-DML-Shapley and the ICC approaches using the observations { ϕdml i (V(j)) }V(j) D and { ϕicc i (V(j)) }V(j) D .

In our experiments, the do-Shapley approach gives that the production needs (P) has the largest total effect where P is in fact the variable with largest coefficient (0.9). in our data generating process, whereas ICC approach gives that all variables have similar intrinsic effects.