# axiomatic_effect_propagation_in_structural_causal_models__efa557d2.pdf

Journal of Machine Learning Research 25 (2024) 1-71 Submitted 3/22; Revised 2/24; Published 3/24

Axiomatic eﬀect propagation in structural causal models

Raghav Singal singal@dartmouth.edu Operations and Management Science Tuck School of Business at Dartmouth College Hanover, NH 03755, USA

George Michailidis gmichail@ucla.edu Department of Statistics and Data Science University of California, Los Angeles Los Angeles, CA 90095, USA

Editor: Mladen Kolar

We study eﬀect propagation in a causal directed acyclic graph (DAG), with the goal of providing a ﬂow-based decomposition of the eﬀect (i.e., change in the outcome variable) as a result of changes in the source variables. We ﬁrst compare various ideas on causality to quantify eﬀect propagation, such as direct and indirect eﬀects, path-speciﬁc eﬀects, and degree of responsibility. We discuss the shortcomings of such approaches and propose a ﬂow-based methodology, which we call recursive Shapley value (RSV). By considering a broader set of counterfactuals than existing methods, RSV obeys a unique adherence to four desirable ﬂow-based axioms. Further, we provide a general path-based characterization of RSV for an arbitrary non-parametric structural equations model (SEM) deﬁned on the underlying DAG. Interestingly, for the special class of linear SEMs, RSV exhibits a simple and tractable characterization (and hence, computation), which recovers the classical method of path coeﬃcients and is equivalent to path-speciﬁc eﬀects. For non-parametric SEMs, we use our general characterization to develop an unbiased Monte-Carlo estimation procedure with an exponentially decaying sample complexity. We showcase the application of RSV on two challenging problems on causality (causal overdetermination and causal unfairness).

Keywords: Structural causal model, attribution, eﬀect decomposition, Shapley value, Monte-Carlo estimation

1. Introduction

A fundamental problem in causal analysis is to quantify eﬀect propagation; namely, given a change in the outcome variable (eﬀect) that is driven by changes at the source nodes in a causal graph, how does the eﬀect ﬂow through the graph? This question relates to various foundational ideas on causality, including direct and indirect eﬀects (Pearl, 2001), path analysis (Wright, 1934; Duncan, 1966; Goldberger, 1972) and path-speciﬁc eﬀects (Pearl, 2001), degree of responsibility (Chockler and Halpern, 2004), explanations (Halpern and Pearl, 2005), mediation analysis (Baron and Kenny, 1986; Mac Kinnon et al., 2007; Imai et al., 2010), information ﬂow (Ay and Polani, 2008), causal inﬂuence (Janzing et al., 2013), and eﬀect decomposition (Kitagawa, 1955; Blinder, 1973; Oaxaca, 1973; Alwin and

c 2024 Raghav Singal and George Michailidis.

License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v25/22-0285.html.

Singal and Michailidis

Hauser, 1975; Fortin et al., 2011; Chernozhukov et al., 2013). In this work, we study the eﬀect propagation problem on a causal directed acyclic graph (DAG) with arbitrary non-parametric structural equations. Such a fairly general framework captures a broad spectrum of causal systems and has found applications in political science (Imai et al., 2011), epidemiology (Vander Weele et al., 2012), ecology (Weible et al., 2004) and wage inequality (Firpo and Pinto, 2016), to name a few.

X2 2X1 = X3 = 3X1

Y = 5X1 + X2 + X3

Figure 1: An example of a causal graph with linear structural equations. The source variable X1 (without parents) is set exogenously, X2 = 2X1, X3 = 3X1, and Y = 5X1 + X2 + X3.

To build intuition, consider the causal graph in Figure 1, where the underlying structural equations are linear and deterministic (no error / noise terms present). Suppose the source variable X1 changes from a value of 0 (background) to 1 (foreground), causing the outcome variable Y to change from 0 to 10. How does this eﬀect (10) of the change in X1 propagate through the DAG? Given the linear equations, an intuitive answer is to express the eﬀect in terms of the corresponding edge weights / path coeﬃcients (Wright, 1918, 1934; Alwin and Hauser, 1975):

Path X1 X2 Y carries 2 units of the total eﬀect.

Path X1 Y carries 5 units of the total eﬀect.

Path X1 X3 Y carries 3 units of the total eﬀect.

This decomposition of (2, 5, 3) along the three channels sums up to the total eﬀect of 10, which is desirable. However, quantifying eﬀect propagation is not as straightforward when the underlying structural equations are non-linear and earlier work, including Wright (1934) and Alwin and Hauser (1975), assumes linear equations. Motivated by this shortcoming, Pearl (2001) introduced natural direct and indirect eﬀects, which are well-deﬁned for nonlinear equations and recover the above-mentioned seemingly natural decomposition for linear equations. However, the natural direct and indirect eﬀects do not necessarily sum to the total eﬀect, which is rather strange (quoted from Pearl (2001)). To see this, consider the causal graph in Figure 2 with the following non-linear equations:

X1 exogenous

X2 = X1 Y = X1X2.

In particular, the outcome involves an interaction of X1 and X2. Similar to the previous example, suppose the source variable X1 changes from 0 to 1, causing Y to change from 0 to 1. How does this eﬀect (change in Y ) propagate through the two channels X1 X2 Y

Axiomatic effect propagation in structural causal models

Figure 2: An example of a two-channeled graph. X1 X2 Y is the indirect channel with X2 being the mediator and X1 Y is the direct channel.

(indirect) and X1 Y (direct)? As we elaborate in 3, both the natural direct and indirect eﬀects equal 0, implying such a decomposition explains none of the total eﬀect. The key reason driving this behavior is that direct (indirect) eﬀect only considers one counterfactual, i.e., when computing eﬀect propagated through the direct (indirect) channel, edges on the indirect (direct) channel are assumed to be inactive. As we will see, by allowing for a broader set of counterfactuals, the proposed approach will overcome this limitation and always explain 100% of the eﬀect, irrespective of the underlying equations being linear or non-linear. A further challenge to quantify eﬀect propagation is to allow for the possibility of multiple source variables changing simultaneously and letting the eﬀect ﬂow through multiple mediators, as opposed to a single block of mediators (Zhang and Bareinboim (2018b) call it en bloc ), as shown in Figure 3. For instance, in domains such as school admissions, credit approval, and wage discrimination, it is often of interest to quantify causal unfairness (Kilbertus et al., 2017; Kusner et al., 2017; Chiappa, 2019), which can be driven by simultaneous changes in multiple protected attributes. A desirable property in such a setting is ﬂow conservation, i.e., the total eﬀect (due to changes at all the source nodes) ﬂows down the DAG while obeying ﬂow-in equals ﬂow-out at each node (formally deﬁned in 2). To the best of our knowledge, most state-of-the-art approaches are either not well-deﬁned in such a setting or fail to obey ﬂow conservation, primarily due to their inability to account for multiple counterfactuals, which is important in applications with multiple moving pieces (e.g., multiple source variables changing or multiple variables mediating the eﬀect). From both a foundational and a practical point-of-view, this is unsatisfying. One possibility to characterize the direct eﬀect of multiple source variables changing simultaneously is via k-way interaction eﬀects; i.e., let the changed values of race and sex propagate only through the red direct links in Figure 3 (2-way eﬀect) and compute the corresponding diﬀerence in the outcome variable (income). However, such a way of thinking leads to exponentially many eﬀects (one for each possible combination of the source nodes). In addition, when one sums up all of these interaction eﬀects, they do not necessarily equal the total eﬀect, which is again undesirable. To see this, consider the Figure 4 setup.

Figure 4: A simple graph to illustrate 2-way interaction eﬀects.

Singal and Michailidis

race age sex

Figure 3: A DAG with multiple source variables (race, age, and sex) and multiple mediators (edu, ms, and hrs). We re-visit this DAG in 7 to illustrate how RSV quantiﬁes causal unfairness (eﬀect propagated via the unfair thick red edges) when multiple sensitive attributes (race and sex) change simultaneously and the underlying dynamics are nonlinear.

Let y = f(x1, x2) with the source variables (x1, x2) changing from (0, 0) to (1, 1). The 1-way eﬀects correspond to the two direct eﬀects (one for X1 and one for X2):

θ1 := f(1, 0) f(0, 0)

θ2 := f(0, 1) f(0, 0).

There is one 2-way eﬀect, which corresponds to changing both the source variables:

θ12 := f(1, 1) f(0, 0).

There are no k-way eﬀects for k 3 since there are only 2 source variables. On adding all of these eﬀects, we get

θ1 + θ2 + θ12 = f(1, 0) + f(0, 1) + f(1, 1) 3f(0, 0).

However, the latter does not necessarily equal the total eﬀect, which is f(1, 1) f(0, 0). Naturally, θ12 equals the total eﬀect, but such a view provides no insight into how to decouple this interaction eﬀect in terms of the underlying contributing factors. Even in a setting as simple as Figure 4 (i.e., no mediators), being able to decouple the total eﬀect is important and analogous to Shapley value, which is a well-accepted concept in economics. In particular, as discussed in Friedenberg and Halpern (2019), one potential application of such a decomposition is in legal ascription of responsibility . In fact, this has been discussed in the legal literature as well, where Ferey and Dehez (2016) motivate it as fairness of the apportionment from an ex post perspective . To dive deeper, consider the Tragedy of the Commons example from Friedenberg and Halpern (2019):

100 ﬁshermen live by a lake. If at least 10 of them overﬁsh this year, then the entire ﬁsh population of the lake will die out. Each ﬁsherman overﬁshes. By the end of the year, the entire ﬁsh population has died out.

Here, we can deﬁne k-way interaction eﬀects for k {1, , 100}. All of them will equal 0 for k < 10 and equal 1 for k 10. However, such a view gives no insight into individual-

Axiomatic effect propagation in structural causal models

level responsibility.1 On the other hand, the Shapley value assigns a responsibility equal to 1/100 to each ﬁsherman, which given their symmetric behavior, is a fair apportionment from an ex post perspective. This work aims to develop a principled framework to quantify eﬀect propagation in causal DAGs with linear / non-linear structural equations. While obeying intuitive properties such as the eﬀect decomposition adding up to the total eﬀect (source eﬃciency) and obeying ﬂow conservation, our framework is ﬂexible enough to allow for the possibility of simultaneous changes at an arbitrary number of source nodes. Note that such properties represent one possible way to operationalize the notion of fair apportionment from an ex post perspective. We reckon there might be other sets of properties2 (and contextspeciﬁc knowledge could be useful in determining which properties are relevant), but we show uniqueness to four desirable axioms under the posited properties. While doing so, we do not impose any additional assumption on the graph topology, besides being a DAG, and hence, allow for multiple mediators. When the graph has no mediators, our framework recovers Shapley value. In addition, it goes a step beyond and is able to decouple the total eﬀect in more involved settings (i.e., with mediators), while accounting for a multitude of non-trivial interaction eﬀects. As such, our proposal can be seen as a generalization of SV to graphs with mediators. The main contributions of this work are as follows. First, we use the language of causal graphs to formally state the underlying attribution / eﬀect propagation problem. The framework is general enough to capture an arbitrary causal graph as long as the underlying graph topology contains no cycles (DAG). We use our problem deﬁnition to highlight some intuitive properties (source eﬃciency and ﬂow conservation) that any methodology to quantify eﬀect propagation should satisfy. Second, we develop a ﬂow-based methodology to quantify eﬀect propagation, which we call recursive Shapley value (RSV). As illustrated in Figure 5, RSV employs a top-down principle by ﬁrst attributing to the source nodes and then quantifying how the eﬀect ﬂows down the DAG. Such a top-down principle generalizes a number of existing node-based attribution techniques. On the foundational front, we establish that by considering a broad spectrum of counterfactuals, RSV uniquely obeys a set of natural ﬂow-based axioms (Theorem 4). We further show that adherence to such axioms result in RSV satisfying a number of intuitive properties (implementation invariance, sensitivity, monotonicity, and aﬃne scale invariance) discussed in recent literature (Sundararajan et al., 2017; Sundararajan and Najmi, 2020). On the implementation front, we provide a closed-form characterization for RSV under both linear (Theorem 6) and non-linear 3 (Theorem 8) SEMs. Our characterization for the linear model recovers the classical method of path coeﬃcients (Wright, 1918, 1934) and is equivalent to path-speciﬁc eﬀects advocated by Pearl (2001), whereas for non-parametric (possibly non-linear) models, the decomposition provided by RSV sums up

1. Of course, this is not meant to diminish the importance of such interaction eﬀects from an ex ante perspective, where one is interested in the impact of (possibly multiple) interventions on the outcome variable. However, that is not our objective in this work. 2. In fact, Friedenberg and Halpern (2019) discuss this caveat as well: words like blame have a wide variety of nuanced meanings in natural language. While we think that the notion that we are trying to capture is useful, it corresponds at best to only one way that the word blame is used by people. So do Chockler and Halpern (2004): We cannot say that a deﬁnition is right or wrong . 3. All results for the non-linear case hold more generally for an arbitrary non-parametric SEM, as it will become clear in 5.

Singal and Michailidis

to the total eﬀect (an implication of the more fundamental ﬂow-based axioms). We use our characterization for the non-parametric models to develop a Monte-Carlo estimation scheme and analyze the quality of our estimator (Proposition 10), which is illustrated through a numerical study. Subsequently, we show how RSV facilitates non-linear mediation analysis to quantify causal overdetermination and causal unfairness and contrast it with existing ideas such as degree of responsibility and path-speciﬁc eﬀects.

Third, we provide a comprehensive view of existing approaches to eﬀect propagation. In particular, we show how various existing techniques in the causality literature operate and shed light on their shortcomings via simple examples. Interestingly, most existing approaches do not result in the decomposition adding up to the total eﬀect. To the best of our knowledge, this is the ﬁrst work to understand the eﬀect propagation literature using a common framework.

Figure 5: Illustrating RSV on the causal graph from Figure 1. First (see the leftmost plot), RSV attributes the total eﬀect of 10 to the source node 1. Second (see plot #2), node 1 s attribution (10) is decomposed among its outgoing edges by evaluating their contributions via the following counterfactual questions: how much attribution would node 1 have received had edge (1, 2) not propagated the change at node 1? This results in edge (1, 2) receving a ﬂow of 2, edge (1, Y ) a ﬂow of 5, and edge (1, 3) a ﬂow of 3. Third (see plot #3), to propagate down the ﬂow node 2 receives, RSV evaluates the attribution node 2 would have received had edge (2, Y ) not communicated the change at node 2. Finally (see the rightmost plot), RSV repeats the exercise at node 3.

Outline The remainder of the paper is organized as follows. In 2, we formally state the eﬀect propagation problem and use such formalism to discuss existing approaches and their shortcomings in 3. We then propose the recursive Shapley value (RSV) approach and establish its desirability in 4, in addition to discussing its connections with the existing works. We characterize RSV for both linear and non-parametric (possibly non-linear) structural equations models and discuss its exact computation in 5, followed by developing a Monte-Carlo estimation scheme in 6. We then showcase applications in 7 and discuss concluding remarks in 8. Technical details are deferred to the appendices.

2. Problem formulation

We start by formulating the problem under consideration. First, we deﬁne the causal graph of interest ( 2.1) and subsequently, formally present the eﬀect propagation problem ( 2.2).

Axiomatic effect propagation in structural causal models

2.1 Causal graph setting

Let G = (N+, E) denote the underlying directed acyclic graph (DAG), with node set N+ := {1, . . . , n, n + 1} and edge set E. A node without parents is called a source node and we allow for multiple such nodes, with N0 denoting their collection. The outcome variable corresponds to the sink node (node without children) and we let it correspond to node n+1 without loss of generality. We focus on a single sink node and note that it is straightforward to generalize our framework to multiple sink nodes (multivariate outcome). We let N := N+ \ {n + 1} = {1, . . . , n} be the set of non-sink nodes. Given an edge (i, j) in E, node i is referred to as a parent of node j whereas node j is a child of node i. For an arbitrary node j N+, we deﬁne Pj to be the set of its parents:

Pj := {i N : (i, j) E} j N+.

Naturally, nodes in N0, i.e., source nodes, have no parents: Pj = for all j N0. Similarly, for node i N+, we deﬁne Ci to be the set of its children:

Ci := {j N+ : (i, j) E} i N+,

and note that node n + 1, i.e., the sink node, has no children, i.e., Cn+1 = . Corresponding to each non-outcome node i N is a variable 4 Xi R whereas the variable corresponding to the outcome node n + 1 is denoted by Y R. For ease of notation, we will sometimes denote Y by Xn+1. We deﬁne the vector of all non-outcome variables as X := (X1, . . . , Xn) and a subset of this collection as XN := (Xi)i N for N N. Each non-source variable Xi is a function of its parents XPi and a corresponding noise variable Ui, leading to the following structural equations F := [fi( )]i N+\N0 in the DAG:

xi = fi(x Pi, ui) i N+ \ N0.

Non-capitalized symbols are used to denote the realizations of the corresponding random variables. The noise variables U := (Ui)i N+\N0 are drawn from some exogenous distribution D, which is independent of (X, Y ). The values of the source variables XN0 are set exogenously and independently of each other. Mediating variables or mediators (variables that are neither source nor outcome, i.e., XN\N0) propagate the source nodes values down to the outcome node n + 1. Given source variables XN0 = x N0 and noise variables U = u, we denote the observed outcome at node n + 1 by

Clearly, conditioned on the noise U = u, the model is deterministic and the source variables realizations XN0 = x N0 are suﬃcient to determine the outcome. Hence, we denote the expected outcome by

g(x N0) := E[f(x N0,u)],

4. Our framework is well-deﬁned and all our results go through for multi-dimensional variables as well, i.e., each non-outcome node i N corresponding to a multi-dimensional variable. However, to be consistent with the literature on causal graphs, we treat each variable as being one-dimensional.

Singal and Michailidis

where the expectation is over the distribution of u. Similar to Deﬁnition 7.1.1 in Pearl (2009), the structural causal model (or causal graph) is denoted by M := (G, F, D) 5. Given our focus on eﬀect propagation, we assume M to be ﬁxed and known throughout the paper, as do related notions such as direct, indirect, and path-speciﬁc eﬀects (Pearl, 2001) and degree of responsibility (Chockler and Halpern, 2004).

2.2 The eﬀect propagation problem

Next, we use the setup previously deﬁned to formally state the eﬀect propagation problem. Similar to Pearl (2001), we are interested in two realizations of the source variables XN0:

x(1) N0 := (x(1) i )i N0 (background)

x(2) N0 := (x(2) i )i N0. (foreground)

There is an underlying temporal / interventional aspect such that the background value changes to the foreground value. To tie our terminology with existing literature, note that our background is similar to the reference value x in Pearl (2001). Given noise realization U = u, the structural causal model is deterministic and hence, the changes at source nodes dictate changes in the entire graph. The background value deﬁnes the baseline outcome

y(1) u := f(x(1) N0 ,u),

and analogously

y(2) u := f(x(2) N0 ,u).

The super-script in y(t) u captures the dependence on the source values x(t) N0 for t {1, 2} whereas the sub-script captures the dependence on the noise u. Hence, the change in the outcome equals

δu := y(2) u y(1) u ,

and the expected change equals

δ := E[δu] = E[y(2) u y(1) u ] = E[y(2) u ] E[y(1) u ] = y(2) y(1),

where y(t) := E[y(t) u ] = g(x(t) N0) for t {1, 2} with the expectation over u. Note that this delta eﬀect, i.e., the (expected) change in the outcome, is entirely driven by the changes at the source nodes and their propagation through the DAG. This is because all the exogenous sources of variation are captured by the source nodes. The question here is how to attribute the expected change in Y (i.e., δ) to the changes at the source nodes and their propagation through the edges of the graph. If the graph has

5. Note that Pearl (2009) deﬁnes a structural causal model (Deﬁnition 7.1.1) as the triplet (V, F, U) where V := (X, Y ), whereas we deﬁne it as (G, F, D). The two deﬁnitions are equivalent as we use G instead of V and D instead of U.

Axiomatic effect propagation in structural causal models

only one node, i.e., n = 1, such a task is straightforward since the only node is completely responsible for the expected change and thus, receives 100% attribution, which propagates through the only edge in the graph. However, if the graph has multiple nodes, i.e., n > 1, then non-trivial interactions are possible. Given an arbitrary causal graph, is there a systematic way to disentangle such interactions and quantify eﬀect propagation via ﬂowbased attribution, i.e., the amount of eﬀect each edge carries? In order to quantify eﬀect propagation, we introduce the following notation. We use [πi]i N to capture the attribution to nodes and [πij](i,j) E to denote the attribution to edges (attribution ﬂow). The scale of both is in terms of δ. As discussed above, all the exogenous variation is captured by the source nodes. Hence, a desirable property is source eﬃciency: X

i N0 πi = δ.

In addition, it seems natural to ensure conservation of ﬂow at each node, that is, ﬂow in equals ﬂow out.

Deﬁnition 1 (Flow conservation) At each internal node j N \ N0, X

i Pj πij = X

For source nodes, X

j Ci πij = δ.

Finally, at sink node, X

i Pn+1 πi,n+1 = δ.

Despite being a desirable feature, ﬂow conservation does not map to a unique attribution ﬂow [πij](i,j) E in general, making the task of attribution non-trivial. As mentioned earlier, consistent with the related literature (Pearl, 2001; Chockler and Halpern, 2004), the underlying causal graph M = (G, F, D) is assumed to be ﬁxed and known in this work. It is worth emphasizing that attribution, i.e., quantifying eﬀect propagation, is a challenging problem in itself, which complements the ongoing developments in causal discovery (Peters et al., 2017; Glymour et al., 2019).

Remark 2 (Individual-level eﬀect) In the problem formulation above, we focus on the population-level eﬀect (or the average treatment eﬀect) as we take an expectation over u and aim to decompose δ (as opposed to δu). We note that our framework works for individuallevel eﬀect as well since given u, all of our machinery can be recycled to propagate δu through the graph. However, a practical challenge might be that u is unobserved. To overcome this, we can use the abduction, action, and prediction framework of Pearl et al. (2016). We elaborate further in the context of the causal unfairness application (see Remark 11).

Singal and Michailidis

3. Existing approaches and their limitations

As stated in 1, quantifying eﬀect propagation relates to various concepts on causality. To be concise, we review direct and indirect eﬀects ( 3.1), path-based techniques ( 3.2), and degree of responsibility ( 3.3) as these notions seem to be the closest to our work. As we illustrate below, each of these notions either fails to satisfy the seemingly natural property of source eﬃciency or is not even well-deﬁned beyond the class of linear SEMs. On the other hand, in addition to satisfying source eﬃciency, our ﬂow-based approach ( 4) uniquely obeys a set of desirable axioms (e.g., ﬂow conservation and attributing zero ﬂow to a redundant edge) and is well-deﬁned for a non-parametric SEM with arbitrary DAG structure. It is also worth highlighting ideas such as Blinder-Oaxaca decomposition (Kitagawa, 1955; Blinder, 1973; Oaxaca, 1973), which attempt to quantify eﬀect decomposition and understand direct and indirect discrimination in a regression setting. Our ﬂow-based approach ( 4) naturally aids such quantiﬁcation in a much more general setting than regression. We discuss one such application in 7 (causal unfairness). We note that in addition to the causality literature, our work intersects with the burgeoning explainable AI / interpretable ML literature. We provide an in-depth discussion of this literature (and its connections to the proposed approach) in our preliminary work (Singal et al., 2021). Therefore, for brevity, we primarily center our discussion here on the causality literature.

3.1 Direct and indirect eﬀects

Pearl (2001) considers a graph with two channels (see Figure 2 in 1) and quantiﬁes direct (eﬀect propagated through the direct channel X1 Y ) and indirect (eﬀect mediated through the indirect channel X1 X2 Y ) eﬀects. Given background x(1) 1 and foreground x(2) 1 with noise (u2, u3), Pearl (2001) deﬁnes natural direct eﬀect as the following diﬀerence:

f3(x(2) 1 , f2(x(1) 1 , u2), u3) f3(x(1) 1 , f2(x(1) 1 , u2), u3). (1)

That is, in the output equation f3(x1, x2, u3), the source variable changes from x(1) 1 to x(2) 1 while the intermediate variable X2 is held at its background value f2(x(1) 1 , u2). On the other hand, for the natural indirect eﬀect, the intermediate variable changes from f2(x(1) 1 , u2) to f2(x(2) 1 , u2) while the source variable is held at its background value x(1) 1 :

f3(x(1) 1 , f2(x(2) 1 , u2), u3) f3(x(1) 1 , f2(x(1) 1 , u2), u3). (2)

Figure 6 provides a visual description of (1) and (2) in terms of active (solid black) and inactive (dashed red) edges (formally deﬁned in 4). As alluded to in 1, it should be clear that such a notion only allows for one possible counterfactual. The average direct and indirect eﬀects are deﬁned by taking an expectation over the noise (u2, u3). A key limitation of such a deﬁnition is that it fails to capture the interactions between the two channels, resulting in it violating source eﬃciency. We illustrate this in Example 1.

Example 1 (Interaction) Consider the graph in Figure 2 with the following structural equations: X2 = X1 and Y = X1X2 (interaction). Suppose background x(1) 1 = 0 and

Axiomatic effect propagation in structural causal models

(a) Natural direct eﬀect

(b) Natural indirect eﬀect

Figure 6: Illustration of (a) direct and (b) indirect eﬀects (Pearl, 2001). Under the direct eﬀect, as shown in the ﬁrst graph, the foreground value x(2) 1 propagates through the direct edge (1, 3) (active) but not through the indirect edge (1, 2) (inactive). (Node 3 denotes the outcome Y .) Direct eﬀect corresponds to the diﬀerence between the ﬁrst two graphs. Under the indirect eﬀect, as shown in the third graph, the foreground value x(2) 1 propagates through the indirect edge (1, 2) (active) but not through the direct edge (1, 3) (inactive). Indirect eﬀect corresponds to the diﬀerence between the last two graphs.

foreground x(2) 1 = 1. Hence, x(1) 2 = y(1) = 0 and x(2) 2 = y(2) = 1. Both natural direct and indirect eﬀects as in (1) and (2) equal 0, even though the change in the outcome equals 1. Hence, the sum of the two eﬀects is less than the total eﬀect. (We assumed the model in this example to be deterministic for ease of illustration and note that it is straightforward to introduce noise variables.)

In fact, realizing that natural direct and indirect eﬀects do not necessarily sum up to the total eﬀect, Pearl (2001) himself stated such relationships to be strange (see discussion below Equation (23) in Pearl (2001)). The notion of reverse indirect eﬀect (Pearl, 2010) can ﬁx this limitation. However, as we discuss in Appendix B, it is rather an ad hoc ﬁx and can result in diﬀerent attributions depending on the order in which one adds the channels. In fact, the recent work of Plecko and Bareinboim (2024) points this out as well (see 5.1 in their paper) and highlights the averaging we propose in this work as a solution. We mention in passing the work of Zhang and Bareinboim (2018a), which generalizes such notions to account for confounders. However, under the absence of counfounders (which is the focus of this work), the proposal of Zhang and Bareinboim (2018a) suﬀers from the above-mentioned limitations as it reduces to Pearl (2001). Beyond the simple graphical structure as in Figure 6 (one source variable and one mediator), it is possible to generalize the notion of direct and indirect eﬀects to graphs with multiple source variables and multiple mediators. In fact, Pearl (2001) does so via path-speciﬁc eﬀects, which we discuss next.

3.2 Path-based techniques

Path-based techniques decompose the total eﬀect by attributing it to the underlying paths in the DAG and date back to the method of path coeﬃcients, introduced over a century ago (Wright, 1918) and discussed in various ﬁelds, including mathematical statistics (Wright, 1934), sociology (Duncan, 1966; Alwin and Hauser, 1975; Fox, 1980), and econometrics (Goldberger, 1972). As alluded to in 1, though the path coeﬃcients method handles the

Singal and Michailidis

case of linear SEMs (by appropriately deﬁning path coeﬃcients using the underlying edge weights, as we did for the example in Figure 1), its key limitation is the inability to generalize beyond the class of linear SEMs. Recognizing this limitation, Pearl (2001) deﬁned path-speciﬁc eﬀects, where given a path from a source node to the outcome node, one only allows the foreground value of the source node to be propagated along the edges in the path (and all other edges are inactive). The corresponding path-speciﬁc eﬀect is the diﬀerence between the resulting (expected) outcome value and the baseline outcome value y(1). Despite being well-deﬁned for non-parametric SEMs, path-speciﬁc eﬀects can suﬀer from the same key limitation as direct and indirect eﬀects, i.e., the sum of path-speciﬁc eﬀects might not add up to the total eﬀect. For instance, in Example 1, the path-speciﬁc eﬀect of both the direct path (X1 Y ) and the indirect path (X1 X2 Y ) equals 0, since the absence of either of the paths results in a failure to capture the interaction eﬀect X1X2. As with direct and indirect eﬀects, the key reason for this shortcoming is that path-speciﬁc eﬀects only allow for one counterfactual, i.e., all other edges are assumed to be inactive. Recent path-based works include Zhang and Bareinboim (2018b) and Henshaw et al. (2020). Though non-parametric in nature, Zhang and Bareinboim (2018b) tackle a diﬀerent problem, as their goal is to decompose the covariance between a source variable and the outcome variable as a sum over the unblocked paths. On the other hand, though we do not discuss the path-speciﬁc causal derivative / selection gradient (Henshaw et al., 2020) in detail, it is straightforward to verify that it suﬀers from the same limitation as path-speciﬁc eﬀects (decomposition not adding up to the total eﬀect).

3.3 Degree of responsibility

Motivated by causality typically [being] treated an all-or-nothing concept , Chockler and Halpern (2004) study a similar problem and propose degree of responsibility to quantify attribution to nodes. To illustrate the intuition behind degree of responsibility, consider the following example, which is adapted from Chockler and Halpern (2004).

Figure 7: The graph corresponding to the voting example (Example 2).

Example 2 (Voting) There are two candidates (1 and 2) in an election and three voters. For each voter i {1, 2, 3}, source variable Xi {0, 1, 2} denotes the candidate voter i votes for, where 0 means the voter is undecided. We are interested in the outcome Y {0, 1}, which denotes whether the ﬁrst candidate wins:

( 1 if P3 i=1 I{Xi = 1} > P3 i=1 I{Xi = 2} 0 otherwise.

Axiomatic effect propagation in structural causal models

I{ } denotes the indicator function and the underlying graph is shown in Figure 7. Consider the following two scenarios:

1. Background x(1) = (0, 0, 0) and foreground x(2) = (1, 1, 1), i.e., all three voters vote for candidate 1 and hence, candidate 1 wins the election 3-0.

2. Background x(1) = (0, 0, 0) and foreground x(2) = (1, 1, 2), i.e., the ﬁrst two voters pick candidate 1 and the third voter picks candidate 2, meaning candidate 1 wins 2-1.

In the words of Chockler and Halpern (2004),

If someone wins an election [3-0], then each person who votes for [them] is less responsible for the victory than if [they] had won [2-1].

To operationalize this intuition, Chockler and Halpern (2004) deﬁne degree of responsibility of A for B as 1/(κ+1), where κ is the minimal number of changes that have to be made to obtain a contingency where B counterfactually depends on A . In the ﬁrst scenario (3-0), this notion attributes a value of 1/2 to each of the 3 voters, as 1 change is needed for a vote to be critical. On the other hand, in the second scenario (2-1), the degree of responsibility of each of the 2 voters (for candidate 1) is 1, as each vote is critical (and attribution to the remaining 1 voter is 0).

Similar to Pearl (2001), a key limitation of such a notion is that it violates source eﬃciency, as should be clear from Example 2. In fact, Chockler and Halpern (2004) themselves stated that a better name for their notion may have been degree of criticality (see their discussion in 5) by pointing out the following example:

For example, consider a voter who voted for [candidate 1] in the case of a 1-0 vote and a voter who voted for [candidate 1] in the case of a 100-99 vote. In both case, that voter has degree of responsibility 1. While it is true that, in both cases, that voter s vote was ciritical, in the second case, the voter may believe that his responsibility is more diﬀuse.

Recognizing this, Chockler and Halpern (2004) discussed the possibility of using Shapley value to deﬁne responsibility, which motivates RSV.

4. The recursive Shapley value (RSV) approach

We ﬁrst provide the intuition behind the proposed approach ( 4.1) and then formally introduce it ( 4.2). Subsequently, we establish the ﬂow-based axioms ( 4.3) obeyed by RSV. Finally, we discuss how RSV connects to existing approaches in the literature ( 4.4). We note that the presentation here generalizes our preliminary work (Singal et al., 2021) since the model here has noise variables U, which were absent before. Before providing intuition, we introduce some notation. Without loss of generality, we insert node 0 (super-source) to the DAG G. This node has no parent but has |N0| outgoing edges (one to each of the source node): (0, j) for j N0. The set of edges is re-deﬁned to include these edges: E E {(0, j) : j N0}. Similarly, the set of nodes is re-deﬁned: N N {0} and N+ N+ {0}. The original set of source nodes is still denoted by N0

Singal and Michailidis

and this collection does not contain the super-source node 0. For each non-outcome node i N, we deﬁne Ei := {(i, j) : j Ci} as the set of its outgoing edges, implying (E0, . . . , En) forms a partition of E. The proposed ﬂow-based approach will quantify edge attributions [πij](i,j) E. These edge attributions are used to deﬁne the node attributions [πj]j N\{0} as the corresponding incoming ﬂow:

i Pj πij j N \ {0}.

It will help to formalize the notion of an active or inactive edge. In our causal graph setting, the role of an edge is to propagate information. As a result, motivated by Pearl (2001) (recall Figure 6), if an edge (i, j) E is active, then it propagates the updated value xi from node i to node j, whereas if it is inactive, then it does not propagate the updated value, meaning node j receives the background value. An illustration is provided in Figure 8.

Figure 8: Illustrating active / inactive edges. Here, E = {(0, 1), (1, 2), (1, 3), (2, 3)} is the set containing all edges in the DAG (active and inactive). Node 3 denotes the outcome Y . There are two inactive edges: (1, 3) and (2, 3) (dashed red lines). Consider arbitrary noise value U = (u2, u3). Since edge (0, 1) is active, we set node 1 to its foreground value x(2) 1 . As edge (1, 2) is active and X1 is set to x(2) 1 , node 2 receives x(2) 1 . However, since edges (1, 3) and (2, 3) are inactive, node 3 (Y ) receives x(1) 1 (from node 1) and x(1) 2 := f2(x(1) 1 , u2) (from node 2). Therefore, X1 = x(2) 1 , X2 = f2(x(2) 1 , u2), and Y = f3(x(1) 1 , x(1) 2 , u3). Note that if edge (0, 1) had been inactive, then the variables would have been set as follows: X1 = x(1) 1 , X2 = f2(x(1) 1 , u2), and Y = f3(x(1) 1 , x(1) 2 , u3).

In general, we consider a subset E E of active edges and arbitrary noise U = u. Let x(1)(u) denote the background corresponding to u, i.e., x(1) i (u) denotes the value at node i N+ \ {0} given (x(1) N0 ,u). Then, each source node i N0 is set as a function of E as follows:

( x(1) i if (0, i) / E

x(2) i if (0, i) E. (3a)

Leveraging the underlying structural equations, we set each non-source node j N+ \{N0 {0}} as a functon of active edges E as follows:

xj(E,u) = fj((xij(E,u))i Pj, uj), (3b)

where for all i Pj,

xij(E,u) :=

( x(1) i (u) if (i, j) / E xi(E,u) if (i, j) E. (3c)

Axiomatic effect propagation in structural causal models

We note that such a construction is motivated by Pearl (2001) (see Figure 3 in Pearl (2001) for example). Notation xij is new and xn+1(E,u) corresponds to the outcome, which we denote by yu(E). Analogous to the notation in 2.2, we deﬁne y(E) := E[yu(E)] by taking the expectation over u. Thus, given arbitrary E E, the notation y(E) is well-deﬁned, using which we provide intuition next. This notation would be critical to deﬁne RSV, since unlike existing notions, RSV allows for a broad spectrum of counterfactuals in terms of which edges are active. (Note that the discussion below assumes basic knowledge of Shapley value (Shapley, 1953), a brief primer for which is presented in Appendix A.)

4.1 Intuition guiding the RSV approach

First, the proposed RSV approach attributes to the source nodes. Then, it quantiﬁes eﬀect propagation by ﬂowing down the source node attributions on the outgoing edges (top-down). To convey intuition, we illustrate the RSV mechanics on a relatively simple DAG (see Figure 9) with arbitrary F (structural equations) and D (noise distribution).

Figure 9: DAG used to illustrate the intuition behind the proposed RSV approach.

Recall that XN0 changes from x(1) N0 to x(2) N0 , resulting in the expected outcome changing

from y(1) = E[y(1) u ] to y(2) = E[y(2) u ]. Since the source nodes N0 = {1, 2} capture all the exogenous source of variation, the expected change in Y is entirely driven by the propagation of changes at (X1, X2). As a result, we begin by attributing to the source nodes, which is done via the following game at the super-source node 0. The collection of outgoing edges from node 0, i.e., E0 = {(0, 1), (0, 2)}, is the set of players. Given a subset of players (coalition) E0 E0, we deﬁne the corresponding characteristic function as

v0(E0) := y(E0, E1, E2, E3, E4).

Recall the notation y( ) is deﬁned in (3). The downstream edges (E1, E2, E3, E4) are set to be active in this game. Then, the attributions received by edges (0, 1) and (0, 2), i.e., π01 and π02, equal the SVs of this game (illustrated in Figure 10). To explicitly capture the dependence on E = (E0, E1, E2, E3, E4) in our notation, we denote these attributions as π01(E), π02(E), π1(E), and π2(E). We then move down to the children of node 0, i.e., nodes 1 and 2, to understand how the eﬀect ﬂows through the graph. Since node 1 has only one outgoing edge, the eﬀect propagation through node 1 is trivial (cf. ﬂow conservation): π13 = π1(E). On the other hand, quantifying the propagation of π2(E) through node 2 is challenging as there are two

Singal and Michailidis

(a) Edge (0, 1) inactive

(b) Edge (0, 1) active

Figure 10: Illustration of the ﬂow π02. The eﬀect propagated by edge (0, 2) is quantiﬁed via its value-add, as determined by the following counterfactual question: how much would have been the eﬀect (expected change in Y ) had edge (0, 2) been inactive? Note that such a counterfactual question involves two possibilities: (a) edge (0, 1) being inactive and (b) edge (0, 1) being active. The diﬀerence between the ﬁrst two graphs captures possibility (a) (value-add of edge (0, 2) given edge (0, 1) being inactive). On the other hand, the diﬀerence between the last two graphs captures possibility (b) (value-add of edge (0, 2) given edge (0, 1) being active). Edge (0, 2) receives an attribution equal to a (weighted) average of these two counterfactuals, where the weights are determined via SV (1/2 in this example). We note that all downstream edges (E1, E2, E3, E4) are set to be active in both the possibilities.

outgoing edges: E2 = {(2, 3), (2, 4)}. Our goal is to decouple the ﬂow received by node 2 into its two outgoing edges. To do so, we aim to understand how much does the presence of each of these two edges contributes to π2(E). Recall that the upstream game at node 0 assumed all the downstream edges to be active. Thus, node 2 receives an attribution of π2(E0, E1, E2, E3, E4) if both the edges in E2 are active. To decompose π2 into π23 and π24, following counterfactual questions seem natural:

Had both the edges in E2 been inactive, how much ﬂow would have propagated through node 2?

What if edge (2, 3) was inactive and (2, 4) active (and vice versa)?

In the upstream game at node 0, had both the edges (2, 3) and (2, 4) been inactive, then the resulting characteristic function would have been y(E0, E1, , E3, E4) E0 E0, meaning edge (0, 2) (and thus, node 2) would have received no ﬂow at all, i.e., π2(E0, E1, , E3, E4) = 0. Therefore, edges (2, 3) and (2, 4) are entirely responsible for the ﬂow π2(E) through node 2. As a result, ﬂow conservation seems logical: π2(E) = π23 + π24. However, this does not uniquely determine the ﬂow on the two edges: π23 and π24. An intuitive scenario is if an edge is redundant, i.e., the attribution received by node 2 is independent of the edge being active or inactive. Then, assigning zero ﬂow on such an edge seems apt (nullity). A second intuitive scenario corresponds to both the edges in E2 being equivalent. For instance, if the value attributed to node 2 equals π2(E)/2 if either of the two edges is active, then decoupling π2(E) equally between them aligns with intuition (symmetry).

Axiomatic effect propagation in structural causal models

We operationalize such intuition by considering the following game at node 2. The set of outgoing edges (E2) maps to the set of players. For arbitrary subset of players (coalition) E2 E2, we deﬁne the characteristic function as the attribution received by node 2 from the upstream game at node 0 (hence, recursive):

v2(E2) := π2(E0, E1, E2, E3, E4).

The ﬂows received by edges (2, 3) and (2, 4), i.e., π23 and π24, are deﬁned to be the SVs of this game (illustrated in Figures 11 and 12).

(a) Edge (2, 4) inactive

(b) Edge (2, 4) active

Figure 11: Illustration of the ﬂow π23. The eﬀect propagated by edge (2, 3) is quantiﬁed via its value-add, as determined by the following counterfactual question: how much would have been the attribution received by node 2 had edge (2, 3) been inactive? Note that such a counterfactual question involves two possibilities: (a) edge (2, 4) being inactive and (b) edge (2, 4) being active. The diﬀerence between the ﬁrst two graphs captures to possibility (a) (value-add of edge (2, 3) given edge (2, 4) being inactive). On the other hand, the diﬀerence between the last two graphs captures possibility (b) (value-add of edge (2, 3) given edge (2, 4) being active). Edge (0, 2) receives an attribution equal to a (weighted) average of these two counterfactuals, where the weights are determined via SV (1/2 in this example). Note that the attribution node 2 receives in each of these four graphs still needs to be evaluated. Figure 10 illustrated this evaluatation for the third graph here (both (2, 3) and (2, 4) active). In Figure 12 below, we show it for the ﬁrst graph ((2, 3) active but (2, 4) inactive). Computations for the second and fourth graphs are similar (not shown to be concise).

Having computed the ﬂow through nodes 1 and 2, we move down to the next layer of children: nodes 3 and 4. To compute the ﬂow through these two nodes, we recycle our logic (in a recursive manner). For example, at the node 3 game, the players are the corresponding outgoing edges E3. Given arbitrary coalition E3 E3, we deﬁne the characteristic function to be the incoming ﬂow at node 3 from the upstream games: v3(E3) := π3(E0, E1, E2, E3, E4).

4.2 Recursive Shapley value based attribution

The formal deﬁnition of RSV involves n recursions, with each recursion being initialized at the corresponding node j N \ {0} via the following game. The set of players includes all the outgoing edges Ej. Given a subset of players (coalition) Ej Ej, we deﬁne the characteristic function to be the node j attribution from the upstream games with all other

Singal and Michailidis

(a) Edge (0, 1) inactive

(b) Edge (0, 1) active

Figure 12: Illustration of the ﬂow π02 (attribution node 2 receives) when one of its outgoing edges (edge (2, 4)) is inactive. The four graphs here replicate the ones in Figure 10 but with edge (2, 4) being inactive. The steps to compute π02 are identical to the ones in the caption of Figure 10.

edges being active:

vj(Ej) := πj(E0, . . . , Ej, . . . , En) = X

i Pj πij(E0, . . . , Ej, . . . , En). (4)

Given this game, the ﬂow attributed to edge (j, k) Ej is the corresponding SV:

Ej Ej\{(j,k)} w Ej(Ej) {vj(Ej {(j, k)}) vj(Ej)} , (5)

w Ej(Ej) := |Ej|! (|Ej| |Ej| 1)!

is the SV weight function as in Appendix A. It should be clear from (4) and (5) that πjk(E0, . . . , En) is a function of πij(E0, . . . , Ej, . . . , En) for Ej Ej, where node i is a parent of node j. The latter deﬁnes a recursion since to compute πij(E0, . . . , Ej, . . . , En) at an arbitrary upstream node i Pj, we use

πi(E0, . . . , Ei, . . . , Ej, . . . , En) Ei Ei

as the characteristic function of the corresponding game at node i, as opposed to

πi(E0, . . . , Ei, . . . , Ej, . . . , En).

In other words, all other edges are not assumed to be active within a recursion, but only at the initialization. The pseudocode in Algorithm 1 formalizes (and clariﬁes) this. When necessary, we will use the notation vj(Ej | E j) to highlight this diﬀerence, where E j := (E0, . . . , Ej 1, Ej+1, . . . , En) and Eℓ Eℓ ℓ N. Recall that the default notation vj( ) assumes all other edges E j to be active, i.e., vj( ) = vj( | E j) for all j N.

Axiomatic effect propagation in structural causal models

Each recursion breaks at node 0 via the following game. The set of players is E0 and we deﬁne this game conditioned on an arbitrary collection of downstream edges being active: (E1, . . . , En) where Eℓ Eℓ, ℓ= 1, . . . , n. For a given coalition E0 E0, we deﬁne the characteristic function to be the expected outcome with (E0, E 0) being the active edges (recall (3)):

v0(E0 | E 0) := y(E0, E 0).

Given this game, the ﬂow on edge (0, k) E0 is the corresponding SV:

π0k(E0, E 0) = X

E0 E0\{(0,k)} w E0(E0) {v0(E0 {(0, k)} | E 0) v0(E0 | E 0)} .

Observe that this deﬁnition is non-recursive and hence, it breaks every recursion. The ﬂow attributed to super-source edges E0 equals the SVs of the node 0 game deﬁned above but with all downstream edges being active, i.e., E 0 is set to E 0. Given the DAG structure, all of the n recursions are well-deﬁned and end up evaluating the expected outcome y(E) for various E E. We formally state our recursive deﬁnition of RSV in Algorithm 1 (and sub-routine Algorithm 2). Algorithm 1 outputs RSV: πRSV jk = πjk(E) (j, k) E. Note that the two inputs N and E provided to Algorithm 1 are assumed to include the super-source node 0 and the corresponding edges E0, i.e., N = {0, . . . , n} and E = (E0, . . . , En). Note that the (j, k) notation in the sub-script of RSVjk represents an input to Algorithm 2. Also, the notation E, E, and E is used to distinguish three diﬀerent sets of edges since the recursion primarily revolves around them.

Algorithm 1 RSV(N, E)

1: for (j, k) E

2: πRSV jk = RSVjk(E0, . . . , En)

4: return [πRSV jk ](j,k) E

Algorithm 2 RSVjk(E0, . . . , En)

1: if j > 0 % recursion

2: return P Ej Ej\{(j,k)} w Ej(Ej) P i Pj {RSVij(E0, . . . , Ej {(j, k)}, . . . , En) RSVij(E0, . . . , Ej, . . . , En)}

3: else % base case

4: return P E0 E0\{(0,k)} w E0(E0) {y(E0 {(0, k)}, E1, . . . , En) y(E0, E1, . . . , En)}

4.3 Flow-based axioms

Having deﬁned RSV, we now establish its desirability by proving uniqueness to a set of seemingly natural ﬂow-based axioms. The ﬁrst axiom is conservation of ﬂow (recall Deﬁnition 1 in 2). The next two axioms are ﬂow symmetry and ﬂow nullity, as touched upon

Singal and Michailidis

in 4.1. Intuitively, they require equivalent outgoing edges to receive equal ﬂow and a redundant edge to receive no ﬂow, respectively. The ﬁnal axiom advocates for ﬂow linearity, which we shed light upon after formally deﬁning these axioms.

Deﬁnition 3 (Flow-based axioms) The ﬂow-based axioms are as follows:

1. Flow conservation: P k C0 π0k = y(2) y(1) and P i Pj πij = P k Cj πjk j N \ {0}.

2. Flow symmetry: For node j N, if (j, k) Ej and (j, ℓ) Ej are such that vj(Ej {(j, k)}) = vj(Ej {(j, ℓ)}) Ej Ej \ {(j, k), (j, ℓ)}, then πjk = πjℓ.

3. Flow nullity: For node j N, if vj(Ej {(j, k)}) = vj(Ej) Ej Ej \ {(j, k)}, then πjk = 0.

4. Flow linearity: For node j N, consider characteristic functions vj( ) and v j( ). Linearity requires πjk(vj + v j) = πjk(vj) + πjk(v j) (j, k) Ej. (Notation πjk(vj) captures the dependence of πjk on vj( ).)

Though conservation of ﬂow is stated slightly diﬀerently than before (Deﬁnition 1), it is easy to establish equivalence between the two. Flow linearity can be interpreted as follows. Note that the characteristic functions [vj( )]j N (deﬁned in 4.2) in the end correspond to the structural equations F as given arbitrary j N, vj( ) ends up evaluating v0( ), which ultimately is a function of F. Now, let [v j( )]j N correspond to a diﬀerent population of variables X, governed by a diﬀerent set of structural equations (say F ) and noise (say D ). Linearity is the requirement that the attribution should be robust to mixing such heterogeneous populations. In other words, one can either (1) mix the two populations ﬁrst (vj+v j) and compute attribution using the mixture (πjk(vj+v j)) or (2) compute attributions on the two populations ﬁrst (πjk(vj) and πjk(v j)) and mix them later (πjk(vj) + πjk(v j)). Under linearity, both (1) and (2) output the same attributions. It is possible to interpret ﬂow symmetry and nullity using the model primitive y( ). To do so, observe that that given node j N, if (j, k) Ej and (j, ℓ) Ej obey

y(E {(j, k)}) = y(E {(j, ℓ)}) E E \ {(j, k), (j, ℓ)},

then symmetry requires the two edges two receive same ﬂow, i.e., πjk = πjℓ. Similarly, if

y(E {(j, k)}) = y(E) E E \ {(j, k)},

then nullity requires edge (j, k) to receive zero ﬂow, i.e., πjk = 0. In Theorem 4, we establish RSV is the unique solution to these four ﬂow-based axioms, with a proof in Appendix C.

Theorem 4 Given structural causal model M = (G, F, D), the RSV [πRSV jk ](j,k) E deﬁned via Algorithm 1 is the unique solution to the ﬂow-based axioms.

A direct corollary of Theorem 4 is that RSV overcomes the source eﬃciency violation of most existing approaches discussed in 3. In fact, Theorem 4 implies that if one believes in the seemingly natural ﬂow-based axioms, then RSV is the only metric they should use to quantify eﬀect propagation (since it is the unique solution). To showcase the desirability of the ﬂow-based axioms, we illustrate them on a DAG equipped with linear structural equations F (Example 3).

Axiomatic effect propagation in structural causal models

Figure 13: The graph for Example 3. (Node 0 is not included for brevity.)

Example 3 (Linear SEM) Consider the graph in Figure 13. There is one source variable X1. The structural equations are linear in X and have an additive noise U:

X2 = a12X1 + U2 X3 = a13X1 + U3 Y = a14X1 + a24X2 + a34X3 + U4.

Suppose the noise is mean-zero, i.e., E[U] = 0 and consider the background x(1) 1 = 0 and the foreground x(2) 1 = 1. Then, the background and foreground expected outcomes equal:

y(2) = a14 + a12a24 + a13a34.

RSV outputs the following ﬂow:

πRSV 12 = πRSV 24 = a12a24 πRSV 14 = a14 πRSV 13 = πRSV 34 = a13a34.

Interpreting conservation of ﬂow is straightforward (ﬂow in equals ﬂow out). To see the intuition behind symmetry, let a12 = a13 and a24 = a34. Then, the change at node 1 is propagated identically through edges (1, 2) and (1, 3). As a result, these two edges receive the same ﬂow. To understand nullity, consider an arbitrary (j, k) E and let ajk = 0. Clearly, such as edge is redundant in terms of eﬀect propagation and RSV attributes zero ﬂow to it. Finally, to see the mechanics behind linearity, consider a diﬀerent model with the same underlying DAG G but suppose the coeﬃcients are diﬀerent: [a jk](j,k) E. RSV under the mixture model ( a = (a + a )/2) is the same as averaging the RSVs under the individual models. For instance, edge (1, 4) receives a ﬂow of (a14 + a 14)/2 under the mixture model and a14 and a 14 under the individual models.

The above example also highlights RSV s ability to decouple direct and indirect eﬀects (Pearl, 2001). In particular, the direct eﬀect of the source X1 on the outcome Y corresponds to the ﬂow a14 on edge (1, 4) whereas the indirect eﬀect is the sum of ﬂow on edges (1, 2) and (1, 3): a12a24 + a13a34. In addition, RSV goes a step beyond by splitting the indirect eﬀects into the underlying edge-speciﬁc ﬂows and thus, facilitates mediation analysis: a mediating variable transmits the eﬀect of an independent variable on a dependent variable

Singal and Michailidis

(Mac Kinnon et al., 2007). Connecting this notion to Figure 13, the independent variable X1 changes from a value of 0 (background) to 1 (foreground), causing Y (the dependent variable) to change from an expected value of 0 to a14 + a12a24 + a13a34 (total eﬀect). RSV provides a crisp decomposition of this total eﬀect. In particular, the direct channel (X1 Y ) carries a14 whereas the two mediating channels (X1 X2 Y ) and (X1 X3 Y ) transmit a12a24 and a13a34, respectively. (Note that it is possible to generalize the intuition here to an arbitrary DAG G with linear structural equations F and mean-zero noise. We do so in 5 (Theorem 6).) Before concluding this subsection, we note that RSV s adherence to the ﬂow-based axioms results in it respecting a mix of desirable properties (implementation invariance, sensitivity, monotonicity, and aﬃne scale invariance) discussed in the interpretable ML literature (Sundararajan et al., 2017; Sundararajan and Najmi, 2020). For instance, implementation invariance requires the attributions to be robust to internal changes in the graph whereas sensitivity / monotonicity require the attributions to respect basic independence and monotonicity relations. Our preliminary work (Singal et al., 2021) provides a detailed discussion of such additional properties along with formal propositions, which we omit here for brevity. We emphasize here that such properties come out naturally as implications of our more fundamental ﬂow-based axioms.

4.4 Connections with existing approaches in the causality literature

Having established RSV s desirable properties, we now highlight connections between RSV and existing approaches on causality. As before, we refer the reader to our preliminary work (Singal et al., 2021) for the connections to the interpretable ML literature, where we rigorously show how the edge-based RSV generalizes a number of existing node-based approaches.

Direct and indirect eﬀects To understand the connection between natural direct / indirect eﬀects (Pearl, 2001) and RSV, we revisit the two-channeled graph introduced in Figure 2 and illustrate the computation of direct eﬀect under RSV in Figure 14. When computing the direct eﬀect, i.e., attribution to edge (1, 3), RSV allows for two possibilities with respect to the indirect channel: (a) edge (1, 2) inactive and (b) edge (1, 2) active. In particular, RSV averages over these two possibilities, allowing it to decouple the interaction eﬀect (which only appears in possibility (b)). On the other hand, natural direct eﬀect (Pearl, 2001) only considers possibility (a) and hence, is unable to capture the interaction. The connection between indirect eﬀect under the two deﬁnitions is analogous. We exemplify this discussion by revisiting Example 1 next.

Revisiting Example 1 Recall from Example 1 (interaction) that a key limitation of the proposal in Pearl (2001) is it does not explain interaction eﬀects, which leads to a source eﬃciency violation (both channels receiving zero attribution in Example 1). RSV, on the other hand, obeys source eﬃciency (cf. ﬂow conservation axiom). In Example 1, RSV attributes 1/2 to each of the two channels (cf. ﬂow symmetry) and hence, captures the interaction eﬀect. Thus, by allowing for a richer set of counterfactuals, RSV is able to decouple interaction eﬀects in a principled manner, ensuring the direct and indirect eﬀects add up to the total eﬀect. Further, recall from Appendix B that although the notion of

Axiomatic effect propagation in structural causal models

(a) Edge (1,2) inactive

(b) Edge (1,2) active

Figure 14: Direct eﬀect under RSV for the graph introduced in Figure 2. The direct channel (1, 3) receives an attribution equal to its value-add, which is evaluated via the following counterfactual question: how much would have been the eﬀect (change in Y ) had edge (1, 3) been inactive? Note that there are two possibilities in such a counterfactual question: (a) edge (1, 2) is inactive and (b) edge (1, 2) is active. The diﬀerence between the ﬁrst two plots corresponds to possibility (a), i.e., value-add of edge (1, 3) when edge (1, 2) is inactive. Similarly, the diﬀerence between the last two plots corresponds to possibility (b), i.e., value-add of edge (1, 3) when edge (1, 2) is active. The attribution received by edge (1, 3) is the weighted average of these two value-adds, where the weights come from the classical SV (1/2 for this instance). Recall that the natural direct eﬀect of Pearl (2001) places all the weight on possibility (a).

reverse eﬀects obeys eﬃciency, it lacks uniqueness since it results in either the direct or the indirect channel getting all the attribution depending on the order in which one adds the channels. RSV avoids this issue by averaging over these two orderings.

Path-based techniques The connection between path-speciﬁc eﬀects (Pearl, 2001) / path-speciﬁc selection gradient (Henshaw et al., 2020) and RSV is similar. In particular, to quantify the eﬀect propagated through the direct path X1 Y , both path-speciﬁc eﬀects and selection gradient put all the weight on possibility (a) and hence, do not account for the interaction eﬀect, which only appears in possibility (b).

Remark 5 It is worth noting that under the class of linear SEMs, a multitude of existing approaches (direct and indirect eﬀects, path coeﬃcients, path-speciﬁc eﬀects, and pathspeciﬁc selection gradient) result in the same intuitive quantiﬁcation as discussed around Figure 1. In particular, all these approaches boil down to using the edge weights to compute the corresponding path coeﬃcients and attribute accordingly 6. As we formalize in 5 (Theorem 6), for linear SEMs, RSV recovers the same ﬂow. Accordingly, RSV is consistent with most existing techniques in the relatively simple linear setting but provides a new perspective under non-linearity, where it is systematically able to account for interaction eﬀects.

Degree of responsibility RSV operationalizes the intuition of Chockler and Halpern (2004) (see the quotes if someone wins ... and ... responsibility is more diﬀuse stated in 3.3) while adhering to a set of desirable axioms, which imply source eﬃciency. To illustrate this, we revisit Example 2.

6. Though some of these approaches are not clearly deﬁned for the setting with multiple source nodes and / or multiple mediators, the implicit decoupling in the linear SEM case (due to no interaction eﬀects) enables their computations.

Singal and Michailidis

Revisiting Example 2 Recall that a key limitation of degree of responsibility is it violates source eﬃciency as well, as shown in Example 2 (voting). RSV, on the other hand, attributes 1/3 to each of the 3 voters in scenario 1 whereas in scenario 2, RSV attributes 2/3 to each of the 2 voters of candidate 1 and 1/3 to the voter of candidate 2. If, in scenario 2, we consider the background of (0, 0, 2) (as opposed to (0, 0, 0)) while keeping the foreground as (1, 1, 2), then voters 1 and 2 receive an attribution of 1/2 each, whereas voter 3 receives zero attribution (nullity).

Blameworthiness It is worth discussing the work of Friedenberg and Halpern (2019), which is complementary to our work in the sense that they propose a deﬁnition of blame in a structural causal model. Their key contribution is to extend the framework of Halpern and Kleiman-Weiner (2018) to a multi-agent setting and they do so using Shapley value. Saying that, their deﬁnition is fundamentally diﬀerent than ours since it depends on the notion of an epistemic state (distribution over causal models) to capture an agent s beliefs about the eﬀects that actions may have . RSV is independent of the individual-level beliefs but only operates on a single given causal model. To dive deeper, consider the illustrative example presented by Friedenberg and Halpern (2019) (see 3.4 of their paper):

Consider a scenario where a committee of 7 people, ag1 through ag7, vote for whether or not to pass a bill. If at least 4 agents vote yes, then the bill will pass. Everyone agrees that it would be better for the bill to pass, but there are external reasons (such as opinions of constituents) that might result in agents beneﬁting from voting no as long as the bill is passed. The committee votes and agents ag1 through ag5 all vote no, so the bill does not pass. How blameworthy is each agent for this outcome?

As discussed in Friedenberg and Halpern (2019), the degree of blameworthiness varies as a function of the agent s beliefs. For instance,

ag1 believed that each of the 6 other agents started with a 60% chance of voting yes. For any coalition of n agents, ag1 also believed that for a cost of n 100 each agent s probability of voting yes (including that of agents not in the coalition) could be increased by n 5% by applying social pressure. In addition, if ag1 herself was in the coalition, then for an additional cost of 2000 she would have switched her vote to yes. Given these beliefs, the degree of blameworthiness for the entire group is 0.390, while ag1 s degree of blameworthiness is 0.073.

On the other hand, RSV is independent of individual-level beliefs and simply attributes the same value to the 5 agents who voted no (cf. symmetry axiom). Degree of blameworthiness requires much more information (individual-level beliefs) than RSV, which can be challenging to gather. Clearly, there is a fundamental diﬀerence in how the two approaches attribute ex post blame.

We conclude this section by summarizing our foundational developments. So far, we have focussed on the deﬁnitional / philosophical aspects of ﬂow-based attribution. Our goal has been to convince the reader that leaving computational tractability aside, RSV is an attractive metric for quantifying eﬀect propagation. It recovers the existing approaches when appropriate and overcomes their limitations when they fail , with a unique adherence

Axiomatic effect propagation in structural causal models

to a set of seemingly natural ﬂow-based axioms. Having established such desirability of RSV, we shed light on its computational tractability next.

5. Characterization and computation of RSV

Next, we focus on the computational aspects of RSV and consider the following two cases: (i) all the structural equations F = [fi( )]i N+\N0 in model M = (G, F, D) are linear ( 5.1) and (ii) F is non-parametric and hence, possibly non-linear ( 5.2). For both settings, we provide a closed-form characterization of RSV and discuss the implications on its computation.

5.1 Linear SEM

Consider an arbitrary DAG G (containing super-source node 0) with linear structural equations F and mean-zero noise:

i Pj aij Xi + Uj j N+ \ {N0 0} (6a)

E[U] = 0. (6b)

Note that this is a classical setting for structural equations in DAGs (see Peters et al. (2017) for example). Denote by Ein j := {(i, j) : i Pj} the incoming edges of node j N+ \ {0}. Deﬁne the forward-looking weights c as follows:

cj,n+1 := aj,n+1 (j, n + 1) Ein n+1 (7a)

cij := aij X

k Cj cjk (i, j) E \ Ein n+1, (7b)

where a0j := x(2) j x(1) j (0, j) E0. Similarly, deﬁne the backward-looking weights b as follows:

b0j := x(2) j x(1) j (0, j) E0 (8a)

i Pj bijajk (j, k) E \ E0. (8b)

Since G is a DAG, weights (7) and (8) are well-deﬁned. In such a linear setting, RSV can be characterized as in Theorem 6.

Theorem 6 Consider structural causal model M = (G, F, D) , wherein the structural equations F are linear and the noise distribution D has zero mean (see (6)), with c and b as in (7) and (8). Then,

πRSV 0j = c0j (0, j) E0

πRSV jk = X

i Pj bijcjk (j, k) E \ E0.

A proof is given in Appendix D and illustrations are provided in Example 3 ( 4.3) and the motivating example ( 1). Furthermore, this characterization establishes a tight connection between RSV and most existing approaches, as alluded to in Remark 5 earlier.

Singal and Michailidis

Remark 7 (Computational implication of Theorem 6) Given the characterization in Theorem 6, it follows that RSV can be computed in linear time and exhibits linear space complexity. Note that topological sorting of a DAG requires O(|N|+|E|) time. Given a topologically sorted graph, c can be computed in O(|N|+|E|) time by proceeding in a reverse topological order and storing local computations at each node (O(|N|) additional space):

k Cj cjk j N.

Similarly, b can be computed in O(|N|+|E|) time. Finally, given c and b, RSV can be computed in O(|N|+|E|) time by using the characterization in Theorem 6. This highlights the tractability of RSV for linear models.

5.2 Non-parametric SEM

We now characterize RSV for an arbitrary set of structural equations (possibly non-linear). We start by illustrating the intuition of the derivation, through the DAG depicted in Figure 15. Our discussion highlights the role of paths in the DAG and the notion of sister edges that prove critical in the calculation of RSV.

0 X1 X2 X3 X4 X5 Y

Figure 15: DAG for illustrating the path-based characterization of RSV.

Pick an arbitrary edge (say the thick blue edge (3, 4) in Figure 15) and suppose we are interested in computing the corresponding RSV πRSV 34 (E), with E = (E0, . . . , E5) denoting the set of all edges. By deﬁnition (recall from 4.2), πRSV 34 (E) is the SV of the following game. The set of players is E3 = {(3, 4), (3, 5), (3, 6)} and given coalition E3 E3, the characteristic function is given by

v3(E3) = π3(E0, E1, E2, E3, E4, E5)

= π3(E3, E 3)

= π13(E3, E 3) + π23(E3, E 3).

Hence, πRSV 34 (E) equals

E3 E3\{(3,4)} w E3(E3) {v3(E3 {(3, 4)}) v3(E3)}

w E3(E3) {v3(E3 {(3, 4)}) v3(E3)}

w E3(E3) {π13(E3 {(3, 4)}) π13(E3)}

Axiomatic effect propagation in structural causal models

w E3(E3) {π23(E3 {(3, 4)}) π23(E3)}

where we use the short-hand notation Ej i := Ei \ {(i, j)} for arbitrary node i N+ and its child node j Ci. Next, we analyze the ﬁrst term in (9), i.e., ( ). Given E3 E3, observe that π13(E3) denotes the ﬂow (RSV) on edge (1, 3) when only E3 edges are active at node 3, i.e.,

π13(E0, E1, E2, E3, E4, E5) = X

w E1(E1) {v1(E1 {(1, 3)} | E3) v1(E1 | E3)}

w E1(E1) {π01(E1 {(1, 3)}, E3) π01(E1, E3)} . (10)

Plugging (10) into ( ), we get

w E3(E3) {π13(E3 {(3, 4)}) π13(E3)}

w E1(E1) {π01(E1 {(1, 3)}, E3 {(3, 4)})

π01(E1, E3 {(3, 4)})} X

w E1(E1) {π01(E1 {(1, 3)}, E3) π01(E1, E3)}

w E1(E1)w E3(E3) {π01(E1 {(1, 3)}, E3 {(3, 4)}) π01(E1, E3 {(3, 4)})

π01(E1 {(1, 3)}, E3) + π01(E1, E3)} .

Now, given E1 E1 and E3 E3, observe that π01(E1, E3) equals π01(E0, E1, E2, E3, E4, E5), which equals

w E0(E0) {v0(E0 {(0, 1)} | E1, E3) v0(E0 | E1, E3)}

w E0(E0) {y(E0 {(0, 1)}, E1, E2, E3, E4, E5) y(E0, E1, E2, E3, E4, E5)} , (12)

where y(E) is the expected outcome as a function of the set of active edges E E (see (3)). Plugging (12) into (11) and re-arranging gives us

w E0(E0)w E1(E1)w E3(E3)

{y(E0 {(0, 1)}, E1 {(1, 3)}, E3 {(3, 4)}) y(E0 {(0, 1)}, E1 {(1, 3)}, E3) y(E0 {(0, 1)}, E1, E3 {(3, 4)}) y(E0, E1 {(1, 3)}, E3 {(3, 4)}) + y(E0 {(0, 1)}, E1, E3) + y(E0, E1 {(1, 3)}, E3) + y(E0, E1, E3 {(3, 4)}) y(E0, E1, E3)} .

Singal and Michailidis

There is a total of eight y( ) terms in (13). Observe the following pattern among them. The ﬁrst term y(E0 {(0, 1)}, E1 {(1, 3)}, E3 {(3, 4)})

includes each of the three edges {(0, 1), (1, 3), (3, 4)} (3 choose 3) and has a positive sign in front of it. Each of the next three terms

y(E0 {(0, 1)}, E1 {(1, 3)}, E3),

y(E0 {(0, 1)}, E1, E3 {(3, 4)}), and

y(E0, E1 {(1, 3)}, E3 {(3, 4)})

include two of the three edges (3 choose 2) and has a negative sign. The next three terms

+ y(E0 {(0, 1)}, E1, E3),

+ y(E0, E1 {(1, 3)}, E3), and

+ y(E0, E1, E3 {(3, 4)})

include one of the three edges (3 choose 1) and are positive. The last term

y(E0, E1, E3)

includes none of the three edges (3 choose 0) and is negative. The previous illustration reveals certain patterns that hold in the general case. First, we introduce concise notation to summarize the previous calculation. Clearly, ( ) corresponds to the path 0 1 3 from source node 0 to node 3. Accordingly, deﬁne W := (0, 1, 3) to be the corresponding (ordered) set of nodes in this path. We use the short-hand X

E013 E134 013

κ013(E013) := X

w E0(E0)w E1(E1)w E3(E3). (14)

for the summation over the weights. In the illustrative example we have here, E134 013 := E1 0 E3 1 E4 3, E013 := (E0, E1, E3), and κ013(E013) := w E0(E0)w E1(E1)w E3(E3). Next, in the general case, given path W, we use the notation X

EW EW {j}\{0} W

where j denotes the end node (4 in the above example) of the edge we are interested in ((3, 4) in the example). As a side note, observe that X

EW EW {j}\{0} W

κW (EW ) = 1 (15)

since each of the w Ek(Ek) term decouples and sums to 1 (by deﬁnition of the SV weight function). The latter fact is useful for developing a Monte-Carlo estimation scheme ( 6). Next, we express the 23 y( ) terms in ( ) (recall (13)). Recycling W = (0, 1, 3) and

Axiomatic effect propagation in structural causal models

deﬁning Edge(W) := {(0, 1), (1, 3)} to be the set of edges in path W, these y( ) terms can be expressed as follows (cf. the pattern discussed above): X

V Edge(W {4}) ( 1)|W| |V | y(EW V, E W ).

In summary, given path W = (0, 1, 3) corresponding to ( ), we have

EW EW {4}\{0} W

V Edge(W {4}) ( 1)|W| |V | y(EW V, E W ). (16)

Similarly, ( ) from (9) equals

EW EW {4}\{0} W

V Edge(W {4}) ( 1)|W| |V | y(EW V, E W ), (17)

where the only diﬀerence is that path W equals (0, 1, 2, 3), instead of (0, 1, 3). Putting (16) and (17) together, we get

πRSV 34 (E) = X

EW EW {4}\{0} W

V Edge(W {4}) ( 1)|W| |V | y(EW V, E W ),

where W3 := {(0, 1, 3), (0, 1, 2, 3)} is the set of all unique paths from node 0 to node 3. As we prove in Appendix E, this characterization holds in general and we state it formally in Theorem 8.

Theorem 8 Given structural causal model M = (G, F, D), the RSV of edge (i, j) E exhibits the following characterization:

πRSV ij (E) = X

EW EW {j}\{0} W

V Edge(W {j}) ( 1)|W| |V | y(EW V, E W ).

Hence, the recursion in RSV (recall Algorithms 1 and 2) can be boiled down to a pathbased expression, wherein we sum over each unique path W Wi from node 0 to node i. For each such path W, we consider the subset of edges EW that excludes the edges in this path, i.e., EW contains the sister edges of the edges in W. We weigh EW by a corresponding factor κW (EW ) and evaluate the value-add of all possible subsets V of the edges in W (to EW ). That is, how much value the edges in the path W add to their sister edges EW .

Remark 9 (Computational implication of Theorem 8) It should be clear from the characterization in Theorem 8 that in the worst case, computing RSV takes an exponential number of operations in the number of edges. For example, consider the characterization for edge (5, 6) in Figure 15 and focus on the path W = (0, 1, 2, 3, 4, 5). The P

EW EW {j}\{0} W sum expands into roughly 2|E| terms (all possible subsets of E). Since a dense DAG has O(n2) edges (where n is the number of nodes), we get the worst-case run-time to be O(2n2). Further, techniques such as dynamic programming can not reduce this worst-case runtime since each of the O(2n2) term can be unique in general.

Singal and Michailidis

Naturally, such a high runtime is undesirable and the following two questions are of interest:

1. Can we develop a computationally tractable characterization for a special class of models?

2. Can we estimate RSV and reduce the exponential dependence on the number of edges?

For the ﬁrst question, Theorem 6 provides a positive answer for the class of linear SEMs. However, as we discuss in Appendix F, developing a tractable characterization even for a linear model with two-way interactions 7 is challenging. Therefore, we focus on developing an estimation scheme for RSV next, which leverages the path-based characterization in Theorem 8.

6. Estimation of RSV

Our estimation scheme along with its properties (unbiasedness, rate of decay of its variance, and sample complexity) is discussed in 6.1 followed by a numerical illustration in 6.2. Note that the model M is assumed to be known (as it is throughout the paper) and our focus here is on estimating the RSV (for a given model M).

6.1 Estimation scheme and properties

Suppose we are interested in estimating πRSV ij (E) for an arbitrary edge (i, j) E. Recalling the path-based characterization (Theorem 8), consider an arbitrary path W Wi. The key computational bottleneck is the sum P

EW EW {j}\{0} W since it enumerates all possible subsets of the sister edges . To estimate this sum, we leverage the fact that the kappa factor sums to 1 (recall (15)) and is non-negative (and hence, can be interpreted as a probability), which leads to a natural Monte-Carlo estimation scheme. Deﬁne

EW EW {j}\{0} W

V Edge(W {j}) ( 1)|W| |V | y(EW V, E W ) (19)

so that πRSV ij equals P W Wi µj W (cf. Theorem 8). We focus on estimating µj W , for which we propose the following strategy:

1. Instead of summing over all EW EW {j}\{0} W (which can be O(2n2)), sample EW with corresponding probability κW (EW ). To do so, we can sample using the independent structures in EW {j}\{0} W (each independent structure corresponds to a node in path W) and use the SV weight function (recall the relation in (14)) to leverage existing SV estimation techniques (Castro et al., 2009; Maleki et al., 2013; Castro et al., 2017; Mitchell et al., 2022). For example, given W = (0, 1, 3) in Figure 15, observe from

7. That is, f(Xj) = P

i Pj aj i Xi + P

i,i Pj aj ii Xi Xi ; arguably the closest non-linear model to a linear model, since the model remains linear in the coeﬃcients a and quadratic in the variables X. The superscripts in the coeﬃcients (e.g., j in aj i) do not represent exponents but are used to indicate the child node of the corresponding edge.

Axiomatic effect propagation in structural causal models

(14) that the sum P

EW EW {j}\{0} W κW (EW ) decomposes as follows ( independent

structures ): X

EW EW {j}\{0} W

κW (EW ) = X

E013 E134 013

w E0(E0)w E1(E1)w E3(E3)

Accordingly, we can ﬁrst sample E0 E1 0 using the SV weight w E0(E0), then E1 E3 1 using the SV weight w E1(E1), followed by E3 E4 3 using the SV weight w E3(E3) and ﬁnally, use (E0, E1, E3) as a sample EW . Since each of the sampling probabilities (w E0(E0), w E1(E1), and w E3(E3)) corresponds to the SV weight function, we can use existing Monte-Carlo techniques to generate such samples of E0, E1, and E3 (Castro et al., 2009; Maleki et al., 2013; Castro et al., 2017; Mitchell et al., 2022). As one possibility, observing that for an arbitrary function h( ),

w E1(E1) h(E1) = X

|E1|! (|E1| |E1| 1)!

|E1|! h(E1) [|E3 1|= |E1| 1]

1 |E3 1| k X

{E1 E3 1:|E1|=k} h(E1),

we can sample a coalition size ﬁrst from a uniform distribution ( 1/|E1| factor) and then a coalition (among the coalitions of the sampled size) from a uniform distribution ( 1/(|E3 1| choose k) factor). That is, to sample E1 E3 1, we will ﬁrst sample the coalition size k Uniform({0, . . . , |E3 1|})

and then sample a coalition

E1 Uniform(E3 1(k))

from the set of all coalitions of this size, i.e., E3 1(k) := {E1 E3 1 : |E1|= k}. Note that the set E3 1(k) contains |E3 1| choose k elements. Fortunately, we do not need to enumerate all possible elements to sample E1. Instead, to generate a uniform sample of size k from E3 1, we can uniformly sample k elements from E3 1 without replacement. Repeating the process to sample E0 and E3 results in (E0, E1, E3), which is an unbiased sample of EW .

2. Use the sampled EW to compute the inner term in (19):

ˆµj W (s) := X

V Edge(W {j}) ( 1)|W| |V | y(EW V, E W ),

where s denotes the Monte-Carlo sample number (corresponding to the sample EW ).

Singal and Michailidis

3. Repeat steps 1 and 2 for s = 1, . . . , S (for relatively large S, the number of Monte Carlo samples) and return the average, which we denote by

s=1 ˆµj W (s).

Since πRSV ij equals P W Wi µj W , a natural estimate for πRSV ij is

ˆπRSV ij := X

W Wi ˆµj W .

We summarize our estimation scheme in Algorithm 3 along with the sub-routine Algorithm 4 to sample sister edges .

Algorithm 3 Estimating πRSV ij for a given edge (i, j) E using S Monte-Carlo samples

1: for W Wi 2: for s = 1, . . . , S

3: EW = sample sister edges(W {j})

4: ˆµj W (s) = P V Edge(W {j})( 1)|W| |V | y(EW V, E W )

7: ˆµj W = 1

S PS s=1 ˆµj W (s)

8: return ˆπRSV ij = P W Wi ˆµj W

Algorithm 4 sample sister edges(W {j})

1: for node h W

2: k Uniform({0, . . . , |Eh| 1})

3: Eh Uniform(Eh+ h (k)) % h+ denotes the node that follows h in path W {j}

5: return EW = (Eh)h W

By construction of our Monte-Carlo scheme, ˆπRSV ij is an unbiased estimator of πRSV ij and its variance decays at a rate of 1/S. It is also possible to provide a high probability concentration bound on the quality of our estimate. We summarize such properties in Proposition 10, with a proof in Appendix G.

Proposition 10 For edge (i, j) E, ˆπRSV ij is an unbiased estimator of πRSV ij with variance decaying at a rate of 1/S. Furthermore, for all t 0,

P πRSV ij ˆπRSV ij t 2 exp βij St2

for some βij > 0 ﬁnite. (The probability measure P{ } here denotes the Monte-Carlo sampling distribution of the estimator ˆπRSV ij .)

Axiomatic effect propagation in structural causal models

An exponentially decaying sample complexity (w.r.t. number of samples S) is highly desirable. The proposed Monte-Carlo scheme tackles the key computational bottleneck in the path-based characterization, i.e., the sum P

EW EW {j}\{0} W . Though this eliminates

the O(2n2) dependence, we still require O(2n) number of operations in the worst-case, since each of the remaining two sums P W Wi and P V Edge(W {j}) can be such. We note that it is possible to develop polynomial-time Monte-Carlo estimation schemes to estimate both the sums via techniques such as importance sampling. For example, to estimate the sum P W Wi, we can multiply and divide by |Wi| and use the probability weight 1/|Wi| to sample W. This requires us to sample uniformly from Wi, which can be done in linear time and space (see Appendix H for one procedure that does so). Similar desirable properties such as in Proposition 10 apply to such sequential Monte-Carlo estimation as well. Furthermore, it is worth mentioning that estimating SV is an active area of research (see Mitchell et al. (2022) for a recent work). Given our estimation scheme for RSV involves sampling SV coalitions ( sister edges ) as a sub-routine (recall Algorithm 4), such advances can be leveraged to further enhance our strategy.

6.2 Numerical illustration of Algorithms 3 and 4

We consider a fully connected layered graph with L layers and M nodes per layer (see Figure 16 for example). We assume each non-source node to be a product of all its parents, e.g., X4 = X1X2X3 (non-linear SEM) in Figure 16. In general,

i Pj Xi j N+ \ {N0 0}.

For each source node in N0, we use a background and foreground of 0 and 1, respectively. This stylized setup enables us to know the exact RSV without resorting to a brute force calculation. In particular, the axioms of ﬂow conservation and ﬂow symmetry prove useful. For example, in Figure 16, the eﬀect (change in Y ) equals 1, with each outgoing edge of node 0 receiving a ﬂow of 1/3, each outgoing edge in layer 1 (edges coming out of nodes 1, 2, and 3) receiving a ﬂow of 1/9, and each edge in the last layer receiving 1/3. In general, given the setup, the exact RSV equals 1/M for the edges coming out of node 0 and going in to node Y , whereas it equals 1/M2 for all other edges. Knowing the exact RSV is useful for our numerical exercise since it enables us to evaluate the quality of our estimate. (Though not the focus of this subsection, note that essentially all existing approaches attribute zero to every edge or path, again highlighting their limitation to explain the total eﬀect in the presence of interactions.) The goal is to evaluate the quality of the proposed RSV estimation scheme given a layered DAG of size (L, M). For ease of presentation, we focus on the top-right edge in the DAG (e.g., the blue thick edge (4, Y ) in Figure 16). Note that this edge is most computationally demanding, since compute time grows with the depth of the DAG. Figure 17a depicts the Monte-Carlo distribution of the estimate (white bars) and the exact RSV (blue vertical line) for a layered graph with (L, M) = (3, 3) for S {100, 200, . . . , 900} Monte-Carlo samples. The distribution is based on 100 runs for each value of S. It can be seen that mean estimate is very close to the theoretical RSV of 1/3, even for low values of S, which is in accordance with the unbiased nature of the estimate. Further, the vari-

Singal and Michailidis

Figure 16: A layered graph with L = 2 layers and M = 3 nodes per layer. For conciseness, we always evaluate the quality of our estimate on the top-right edge in the graph (e.g., the blue thick edge (4, Y ) in this ﬁgure).

ance decays as S increases, as seen in Figure 17b and the rate of decay is 1/S as stated in Proposition 10.

(a) Estimator distribution

0 200 400 600 800 1000 Number of Monte-Carlo samples S

Variance in the estimate

0.73 0.59 0.51 0.41 0.35 0.33

Data 1/S fit

(b) Variance as a function of S

Figure 17: Numerical illustration of our estimation scheme for (L, M) = (3, 3). In plot (a), we show the Monte-Carlo distribution of our estimate (white bars) and the exact RSV (blue vertical line) for S {100, 200, . . . , 900} Monte-Carlo samples. The exact RSV (of the top-right edge) equals 1/M, which equals 1/3 in this case. At the top of each subplot, we report the mean and the standard deviation of the corresponding distribution. In plot (b), we show the variance of the Monte-Carlo estimate as a function of samples S. The black dots correspond to the square of the standard deviation numbers reported in plot (a) (at the top of each subplot). The dotted red line is a regression ﬁtted on the black dots (y = β

x), highlighting that the variance decays at a rate of 1/S.

In Figures 18 and 19, we report the results for M = 5 and M = 7, respectively (while holding L = 3). All previous ﬁndings regarding the unbiased nature of the Monte-Carlo estimate and the rate of decay of its variance as a function of S continue to hold. Results from experiments involving an increasing number of layers from L = 3 to L {4, 5} (for M {3, 5, 7} as above) are given in Appendix I. The additional experimental results

Axiomatic effect propagation in structural causal models

highlight the robustness of the proposed estimation scheme to the number of nodes per layer and number of layers, respectively.

Note that even for (L, M) = (3, 7), a brute force calculation of the RSV requires a long computing time, since the latter is exponential in the number of edges which is around 100. On the other hand, for the same setting, the Monte-Carlo estimate takes 10 seconds for S = 100 and 90 seconds for S = 900, which highlights the computational beneﬁt to estimate the P

EW EW {j}\{0} W sum. Note that the compute time of our estimation scheme

increases (exponentially) in the depth L (e.g., around 7 hours for (L, M) = (5, 7) with S = 900), primarily due to the other two sums in the path-based characterization (P W Wi and P V Edge(W {j})). As previously mentioned, it can be controlled via importance sampling enhancements. Also worth noting, the Monte-Carlo estimate is trivially parallelizable and the compute times can be signiﬁcantly reduced in multi-core environments.

(a) Estimator distribution

0 200 400 600 800 1000 Number of Monte-Carlo samples S

Variance in the estimate

0.47 0.43 0.37 0.3 0.28 0.25

Data 1/S fit

(b) Variance as a function of S

Figure 18: Numerical illustration of our estimation scheme for (L, M) = (3, 5). Note that the exact RSV (of the top-right edge) equals 1/M, which equals 1/5 in this case (blue vertical line in plot (a)). The estimator s mean is close to 0.2 for all values of S and the variance decays at a rate of 1/S.

Singal and Michailidis

(a) Estimator distribution

0 200 400 600 800 1000 Number of Monte-Carlo samples S

Variance in the estimate

0.36 0.29 0.26 0.22 0.17 0.14 0.13

Data 1/S fit

(b) Variance as a function of S

Figure 19: Numerical illustration of our estimation scheme for (L, M) = (3, 7). Note that the exact RSV (of the top-right edge) equals 1/M, which equals 1/7 in this case (blue vertical line in plot (a)). The estimator s mean is close to 0.14 for all values of S and the variance decays at a rate of 1/S.

7. Applications

We showcase the application of the RSV framework on two challenging problems in causality, namely, causal overdetermination in 7.1 and causal unfairness in 7.2 and benchmark it with appropriate existing methodologies.

7.1 Causal overdetermination

The rock-throwing example is a textbook application introduced by Hall (2004) that helps illustrate the robustness of RSV to model changes (implementation invariance) and compare it with existing methods on causality. In the words of Chockler and Halpern (2004), the setup is as follows:

Suppose that Suzy and Billy both pick up rocks and throw them at a bottle. Suzy s rock gets there ﬁrst, shattering the bottle.

Each throw is assumed to be perfectly accurate and hence, is suﬃcient but not necessary to shatter the bottle (causal overdetermination). A corresponding DAG is shown in Figure 20a with the following structural equation for the outcome BS (bottle shatters):

BS = I{ST = 1 or BT = 1}.

ST {0, 1} denotes whether Suzy throws (1) or not (0) whereas BT {0, 1} denotes whether Billy throws (1) or not (0), and I{ } denotes the indicator function. The background for the source variables (ST, BT) is (0, 0) and the foreground is (1, 1). As discussed in 1, in this work, we care about the cause of the bottle shattering from a blame / ex post perspective. In this simple model, the degree of responsibility (Chockler

Axiomatic effect propagation in structural causal models

(a) Model 1

(b) Model 2

Figure 20: Two models for the rock-throwing application.

and Halpern, 2004) of Suzy is 1/2 and of Billy is 1/2, which is consistent with RSV8. However, techniques such as direct and indirect eﬀects (Pearl, 2001), path-speciﬁc eﬀects (Pearl, 2001), and path-speciﬁc selection gradient (Henshaw et al., 2020) attribute 1 to both Suzy and Billy, resulting in an over-allocation, i.e., they attribute more than the total eﬀect. Note that in this example, such approaches give the same decomposition as RSV up to a normalization constant, but this is not true in general (cf. Example 1 in 3). Chockler and Halpern (2004) also discuss another way to model this setting (Figure 20b). They implicitly assume a context where Suzy throws ﬁrst, so there is an edge from SH [Suzy hits] to BH [Billy hits], but not one in the other direction . The underlying structural equations are as follows:

SH = I{ST = 1}

BH = I{BT = 1 and SH = 0}

BS = I{SH = 1 or BH = 1}.

Interestingly, the degree of responsibility for Suzy remains at 1/2, but for Billy drops to 0 under the restriction to allowable settings (Chockler and Halpern, 2004)9. Clearly, the decomposition does not add up to the total eﬀect of 1. In a more extreme setting with k 1 people throwing rocks in addition to Suzy (as opposed to just Billy), the degree of responsibility equals 1/k for Suzy and 0 for all others. As k , the degree of responsibility over all people sums to 0. To put it another way, if an inﬁnite number of people harm a person, then according to the degree of responsibility framework, no one is responsible! This is rather strange. There are two possible ways to interpret the dynamics here. In the ﬁrst one, to capture the fact that Suzy throws ﬁrst , we can model the change in source variables as a sequence of two changes, i.e., the source variables (ST, BT) change from (0, 0) to (1, 0) to (1, 1), as opposed to a direct change from (0, 0) to (1, 1). Under such a view, the outcome BS changes from 0 to 1 to 1 and it is easy to verify that all approaches attribute 1 to Suzy (as a result of the ﬁrst change) and 0 to Billy (as a result of the second change), irrespective

8. In this simple example, a standard Shapley value framing would suﬃce and hence, RSV can be seen as a generalization of SV to graphs with mediators. We discussed this in 1 as well. The fact that RSV recovers SV when appropriate should be seen in positive light. In fact, we formalized such connections in the preliminary version of this work; see Proposition 3 and Proposition B in Singal et al. (2021). 9. From 5 of Chockler and Halpern (2004): If Suzy s rock hits ﬁrst and it requires only one rock to shatter the bottle then, as we have seen, Suzy has degree of responsibility 1 or 1/2 (depending on whether we consider only allowable settings) and Billy has degree of responsibility 0.

Singal and Michailidis

of the underlying model one uses (Figure 20a or Figure 20b). However, this case does not highlight the diﬀerence between RSV and existing techniques. To understand the diﬀerence, we discuss the second interpretation next. Both Suzy and Billy throw the rocks at the same time, but Suzy s throw is faster than Billy s. Hence, the source variables (ST, BT) change from (0, 0) to (1, 1) without an intermediate value of (1, 0). Under this view, both model 1 (Figure 20a) and model 2 (Figure 20b) are correct, but model 2 is more ﬁne-grained. As discussed above, Suzy s degree of responsibility is 1/2 under both models, but Billy s degree of responsibility drops from 1/2 (model 1) to 0 (model 2). Despite the two models being equivalent (in terms of the relationship between the source variables and the outcome variable), the degree of responsibility changes. We posit that such lack of robustness to how the underlying mechanics are implemented is undesirable. As we established in our preliminary work (Proposition 1 in Singal et al. (2021)), RSV obeys implementation invariance, i.e., source variables receive the same attribution under equivalent models. For instance, RSV for both Suzy and Billy equals 1/2 under both models 10. Further, path-based approaches including path-speciﬁc eﬀects and path-speciﬁc selection gradient attribute 1 to both Suzy and Billy under both models, resulting in an over-allocation irrespective of the model one uses. In particular, paths ST SH BS and BT BH BS have a path-speciﬁc eﬀect of 1, whereas the only remaining path ST SH BH BS has a path-speciﬁc eﬀect of 0.

7.2 Causal unfairness

As alluded to in 1 (recall Figure 3), we next illustrate an application of RSV on nonlinear mediation analysis to quantify causal unfairness, which has been a topic of great interest recently (Kilbertus et al., 2017; Kusner et al., 2017; Chiappa, 2019). The goal is to understand the inﬂuence exerted by sensitive attributes on the outcome, through both fair channels (eﬀect mediated by resolving variables) and unfair channels. As we show next, RSV provides a crisp ﬂow-based decomposition of the total eﬀect even when multiple sensitive attributes (e.g., race and sex) change and the underlying dynamics are non-linear. Such a ﬂow-based accounting naturally quantiﬁes causal unfairness, with one possible context being an employee (or a class of employees) suing their employer for unjust treatment and the court trying to apportion legal responsibility to the employer.

Setup To illustrate causal unfairness, we use a graph structure motivated by real data, but assume intuitive relationships between the variables for ease of exposition. In particular, we focus on the DAG shown in Figure 21, which is motivated by the graph topologies discovered in recent works (Zhang et al., 2017; Wu et al., 2019), by employing the PC algorithm (Spirtes et al., 2000) on the adult dataset in the UCI repository (Dua and Graﬀ, 2017). Given source variables race (sensitive attribute #1), age, and sex (sensitive attribute #2), we assume the following non-linear structural equations (with noise):

( 4 log(age) + N(0, 32) if race = 0 5 log(age) + N(0, 32) if race = 1 (20a)

10. In the setting with k 1 people throwing rocks in addition to Suzy, every person s RSV equals 1/k, and hence, the sum equals the total eﬀect.

Axiomatic effect propagation in structural causal models

race age sex

Figure 21: DAG used to illustrate causal unfairness. There are three source variables (race, age, and sex), two of which are sensitive (race and sex). There are four non-source variables: edu stands for years of education, ms for marital status (1 for married and 0 for not), hrs for average hours spent working per day, and inc for whether the person s annual income is above a threshold (e.g., $50,000 in the adult dataset (Dua and Graﬀ, 2017)). The two red thick edges show instances of unfair channels / unresolved discrimination (overt sexism and racism) and the corresponding unfairness parameters a1 and a3 are highlighted in red in the corresponding structural equation (20d). Note that although the mediated pathways could be said to reﬂect systemic racism (e.g., unequal access to education), they can also map to no unfairness if the diﬀerence is due to voluntary behavior of the underlying population. A well-known example is the Berkeley admissions case study (Bickel et al., 1975) where one category of sex (sensitive attribute) voluntarily chooses to apply to a more competitive department (mediating variable) and hence, is less likely to be admitted (outcome variable). We discussed such an example in our preliminary version (see 6 in Singal et al. (2021)). Irrespective of whether the diﬀerence is systemic or voluntary, we note that RSV is well-deﬁned.

( 1 w.p. p(age, sex, edu) 0 w.p. 1 p(age, sex, edu) (20b)

hrs = 8 ms 0.05age + sex + N(0, 22) (20c)

( 1 w.p. 1 1+exp{ 0.1(age 50) 0.05edu 0.02hrs a1race a3sex} 0 w.p. 1 1 1+exp{ 0.1(age 50) 0.05edu 0.02hrs a1race a3sex}. (20d)

The ms probability p(age, sex, edu) in (20b) is governed by the following logistic equations such that sex = 1 population is 90% less likely to be married relative to the sex = 0 population with all else being equal (i.e., same age and edu):

p(age, sex, edu) =

( 1 1+exp{ 0.1(age 25)+0.03edu} if sex = 0

0.9 1+exp{ 0.1(age 25)+0.03edu} if sex = 1. (20e)

To provide intuition, we visualize these relationships in Figure 22. In Figure 22a, we plot the expected value of edu (from (20a)) as we vary age from 20 to 100 for race {0, 1}. The monotonicity and concavity is encoded to reﬂect reality and the race = 1 population is expected to have a higher level of education. Figure 22b depicts the ms probability (from (20b) and (20e)) as a function of age for three levels of edu with sex = 0. Again, the monotonicity with respect to age captures reﬂects reality (more likely to be married as age increases) and we further encode an inverse relation with edu. We assume a linear function

Singal and Michailidis

for hrs (see (20c)) with a baseline value of 8, which decreases by 1 if ms = 1 (married), decreases in age, and increases by 1 if sex equals 1. Finally, in Figure 22c, we show expected inc (from (20d)) as a function of age for two settings of the sensitive attributes: background (race(1), sex(1)) = (0, 0) and foreground (race(2), sex(2)) = (1, 1). In this plot, we set the unfairness parameters a1 and a3 equal to 1 and the intermediate variables (edu, ms, and hrs) are set according to the corresponding relationships in (20). To compute the expected outcome for a given instance of the source variables (race, age, and sex), we generate 10,000 samples of the noise, which we found to be large enough to provide a stable estimate. As before, the monotonicity with age is desirable. Further, the functional form stated above in (20d) makes it clear that the expected outcome is increasing in edu and hrs as well, and the level of unfairness increases as we increase a1 and a3.

20 40 60 80 100 age

race = 1 race = 0

(a) Education

20 40 60 80 100 age

p(age, 0, edu)

edu = 0 edu = 10 edu = 20

(b) Marital status

20 40 60 80 100 age

(race, sex) = (1, 1) (race, sex) = (0, 0)

Figure 22: Visualizing the underlying structural equations presented in (20).

Our goal is to understand causal unfairness, which relates to the change in the expected outcome as a result of changes in sensitive attributes. For instance, in Figure 22c, for age = 40, the expected outcome roughly equals 0.46 and 0.89 for the two scenarios of sensitive attributes. Accordingly, the delta corresponding to an (race, age, sex) background and foreground of (0, 40, 0) and (1, 40, 1) equals the diﬀerence 0.89 0.46 = 0.43. However, not all of this 0.43 can be credited as unfair since some of this eﬀect is propagated through mediators (edu, ms, and hrs), as seen in Figure 21. The question of interest is how to decompose the total eﬀect of 0.43 along the various edges / pathways in order to quantify causal unfairness.

Results In Figure 23a, we show the RSV 11 corresponding to the SEM of (20) with a background (race(1), age(1), sex(1)) = (0, 40, 0) and a foreground (race(2), age(2), sex(2)) = (1, 40, 1) and unfairness parameters (a1, a3) = (1, 1). Clearly, RSV obeys ﬂow conservation and thus, provides an exact decomposition of the total eﬀect of 0.43, which is in contrast to the over-allocation under path-speciﬁc eﬀects (Figure 23b) 12. The zero ﬂow attributed

11. Given the relatively small size of the DAG, we computed the RSV using Algorithm 1 (brute-force). With 10,000 noise samples (used to evaluate expected outcome), it took around 10 minutes on a 3.8 GHz 8-core Intel i7 machine with 16 GB memory to compute RSV for all the edges. 12. We benchmark RSV with the path-speciﬁc eﬀects approach, since it can be perceived as a generalization of direct and indirect eﬀects and is very similar to path-speciﬁc selection gradient in this setup. Further, it

Axiomatic effect propagation in structural causal models

to the edges coming out of age is an instance of the ﬂow nullity axiom, since those edges carry no new information (recall age has the same background and foreground).

eﬀect = 0.43

eﬀect = 0.43

(b) Path-speciﬁc eﬀects

Figure 23: RSV and path-speciﬁc eﬀects for the SEM of (20) with a background (race(1), age(1), sex(1)) = (0, 40, 0) and a foreground (race(2), age(2), sex(2)) = (1, 40, 1) and unfairness parameters (a1 a1 a1, a3 a3 a3) = (1, 1). Expected inc under the background equals 0.89 and under the foreground equals 0.46 and hence, the total eﬀect inc(2) inc(1) equals 0.43. Path-speciﬁc eﬀects add up to 0.54, which is more than the total eﬀect. The thickness of each edge is proportional to the ﬂow.

Hence, with a relatively large value of (a1, a3) = (1, 1), around 90% of the eﬀect is propagated via the unfair edges and the remainder of around 10% is mediated through resolving variables. Of course, these numbers would vary as we change the unfairness parameters a1 and a3. In Figure 24, we show the RSV for (a1, a3) {(0, 0), (0, 0.2), (0.2, 0), (0.2, 0.2)}. When (a1, a3) = (0, 0), it follows from (20d) that there is no causal unfairness since the sensitive attributes do not exert any direct inﬂuence on the outcome. In fact, the corresponding edges (race, inc) and (sex, inc) obey the ﬂow nullity axiom and as a result, they receive an attribution equal to 0 (see Figure 24a). The total eﬀect equals 0.05 and all of it is propagated through intermediate variables (0.04 by edu and 0.01 by hrs). When we increase a3 to 0.2 (see Figure 24b), edge (sex, inc) is no longer redundant and all of the 0.05 increment in the total eﬀect ﬂows through this unfair channel. A similar observation holds for the (a1, a3) = (0.2, 0) setting (see Figure 24c), with edge (race, inc) no longer being redundant and carrying the 0.05 increment in the total eﬀect. Finally, at (a1, a3) = (0.2, 0.2) (see Figure 24d), the total eﬀect jumps to 0.15, with each of the unfair channels carrying 0.05 units and the remainder 0.05 units being propagated by the mediating variables, as in the (a1, a3) = (0, 0) case. As it should be clear, there is nothing special about keeping the age variable at a constant value of 40 in both background and foreground. Our framework is ﬂexible to allow changes in age as well. In Figure 25, we show the RSV when the source variables change from a

is unclear how one would compute the degree of responsibility for this setting. Note that the path-speciﬁc eﬀects approach outputs a value corresponding to each source-to-sink path, and these path-speciﬁc values uniquely map to edge-speciﬁc values via simple aggregation, i.e., for each edge, consider all the paths it appears in and add those path values. We did such an aggregation in our results to facilitate comparison with RSV, e.g., in Figure 23.

Singal and Michailidis

eﬀect = 0.05

(a) (a1 a1 a1, a3 a3 a3) = (0, 0)

eﬀect = 0.10

(b) (a1 a1 a1, a3 a3 a3) = (0, 0.2)

eﬀect = 0.10

(c) (a1 a1 a1, a3 a3 a3) = (0.2, 0)

eﬀect = 0.15

(d) (a1 a1 a1, a3 a3 a3) = (0.2, 0.2)

Figure 24: RSV for the SEM of (20) with a background (race(1), age(1), sex(1)) = (0, 40, 0) and a foreground (race(2), age(2), sex(2)) = (1, 40, 1) and unfairness parameters (a1 a1 a1, a3 a3 a3) {(0, 0), (0, 0.2), (0.2, 0), (0.2, 0.2)}.

background of (0, 40, 0) to a foreground of (1, 45, 1), i.e., age changes as well (from 40 to 45). The unfair parameters are set at (a1, a3) = (0.2, 0.2), as in Figure 24d (to facilitate comparison). The expected outcome in the background equals 0.46 (as before) and in the foreground equals 0.73 and hence, the total eﬀect equals the diﬀerence 0.27, an increase of 0.12 when compared to the total eﬀect of 0.15 in Figure 24d. This increase of 0.12 in total eﬀect is essentially attributed to the outgoing edges of age, as seen in Figure 25. (Note that the outgoing edges of age are attributed 0.13, which is not exactly 0.12. This is primarily because we rounded to two decimal places and there is small noise in our numbers since we used 10,000 noisy samples to evaluate the expected outcome.)

Connecting this application back to our motivation discussed in 1, RSV is naturally able to quantify eﬀect propagation even when multiple source variables change simultaneously and the eﬀect is propagated through multiple mediators, that too in a non-linear manner. Further, it does so while obeying a set of four desirable ﬂow-based axioms, primarily because of its ability to consider a broader set of counterfactuals than existing approaches. On the contrary, approaches such as direct / indirect eﬀects and path-speciﬁc eﬀects only allow

Axiomatic effect propagation in structural causal models

eﬀect = 0.27

Figure 25: RSV for the SEM of (20) with a background (race(1), age(1), sex(1)) = (0, 40, 0) and a foreground (race(2), age(2), sex(2)) = (1, 45, 1) and unfairness parameters (a1, a3) = (0.2, 0.2). As opposed to before, the edges coming out of age are no longer redundant.

for the possibility of one counterfactual (all other edges being inactive), resulting in them violating arguably the most primitive axiom (ﬂow conservation). RSV can in fact handle more involved scenarios. As an illustration, we can have an unfair path from race to inc that goes through one of the intermediate variables (edu). As shown in Figure 26, we insert a new node denoting geographical location of the individual (loc {0, 1}), which depends on race and inﬂuences edu and inc. We modify (20) as follows:

( 1 w.p. p(race) 0 w.p. 1 p(race)

4 + I{race = 1} + I{loc = 1} | {z } new term

log(age) + N(0, 32)

modiﬁcation z }| { 0.9 + 0.1I{loc = 1} 1+exp{ 0.1(age 50) 0.05edu 0.02hrs a1X1 a3X3} 0 w.p. 1 0.9+0.1I{loc=1} 1+exp{ 0.1(age 50) 0.05edu 0.02hrs a1X1 a3X3},

where p(race) is parameterized by p [0, 1] such that p(1) = p and p(0) = p/10, i.e., race 1 is 10 times more likely to live in loc 1 (than race 0). The equation for edu is a generalization of that in (20) and simply boosts the edu level, if loc equals 1. As such, race 1 is more likely to live in the location with better access to education and the loc edu edge can be seen as the unfair channel that goes through an intermediate variable (edu). The unfairness here is due to a lower access to education in location 0. Further, we modify the equation for inc by changing the numerator from 1 to 0.9 + 0.1I{loc = 1}, so that location 1 results in a direct boost to the income as well. This might not be unfair, e.g., location 1 might have a higher cost of living and hence, the incomes are higher as well. All other components of the SCM remain the same as before13.

13. Note that in Figure 26, we do not show the edges coming out of age since the background and foreground values of age equal each other and hence, the corresponding edges will have an RSV of 0. As such, to

Singal and Michailidis

In addition to the unfairness discussed already (i.e., on edges race inc and sex inc), it is also of interest to understand the unfairness mediated through loc (i.e., on the edge loc edu). We provide such an understanding in Figure 26. RSV attributes 1/16th of the total eﬀect to such unfairness (0.01 out of 0.16) for p = 0.3 and 2/19th for p = 0.6. As we increase p from 0 to 1, the total eﬀect increases from 0.13 to 0.22. RSV values for all edges except the ones corresponding to loc do not change much. This aligns with intuition (as increasing p results in more eﬀect being propagated through loc). Accordingly, in Figure 27, we show how RSV evolves as a function of p for the two edges coming out of loc. For p = 0, both the links are attributed a value of 0, which makes sense (since they do not propagate any eﬀect when p = 0). The attributions increase with p, with a clear delineation of how much of the eﬀect ﬂows through the unfair channel of interest (loc edu).

eﬀect = 0.16

(a) p = 0.3

eﬀect = 0.19

(b) p = 0.6

Figure 26: RSV for the modiﬁed SCM of with a background (race(1), age(1), sex(1)) = (0, 40, 0) and a foreground (race(2), age(2), sex(2)) = (1, 40, 1) with unfairness parameters (a1, a3) = (0.2, 0.2) and location parameter p {0.3, 0.6}. (Note that the reported numbers do not necessarily obey ﬂow conservation as they are rounded to 2 decimal places.)

Remark 11 (Individual-level counterfactual) Note that in our illustration of causal unfairness, we only ﬁx the data corresponding to source variables (race, age, sex) and do not condition on any realization of the non-source variables but let them by determined by the structural equations (20). We did this for ease of illustration and we note that it is straightforward to account for individual-level data observed at non-source nodes. Motivated by Kusner et al. (2017), this can be done by following the standard three-step framework of Pearl et al. (2016) (Chapter 4): (a) abduction, (b) action, and (c) prediction. In particular, given individual-level realization (X, Y ) = (x, y), we can ﬁrst compute the posterior of the noise (abduction) U | (x, y). Second, as we did above, we can consider the two cases of interest (action): setting the source variables at background and foreground. Finally, we can compute the RSV-based eﬀect propagation (prediction) as we did above but by using the posterior U | (x, y) (instead of the prior). Furthermore, this can be done

prevent clutter, we omit showing those edges but note that they exist and the corresponding structural equations are the same as in (20).

Axiomatic effect propagation in structural causal models

0 0.2 0.4 0.6 0.8 1 p

(loc, edu) (loc, inc)

Figure 27: RSV on the two outgoing edges of loc (i.e., loc edu and loc inc) as we vary p {0, 0.1, . . . , 1}. The rest of the setup remains the same as in Figure 26.

for realizations corresponding to any subset of (X, Y ). For example, one might only observe (race, age, sex, edu, ms, inc) but not the hrs of an individual, which will result in the corresponding noise posterior.

8. Concluding remarks

Before concluding, we elaborate on a number of interesting extensions for the posited framework. Instead of having the quantity of attribution (i.e., the δ) as the diﬀerence in the expected outcomes (i.e., Eu[y(2) u ] Eu[y(1) u ]), we can deﬁne it using other functionals of the noise distribution, e.g., the quantile or the variance. Our key results (axiomatic support in Theorem 4, path-based characterization for non-parametric SEMs in Theorem 8, and the Monte-Carlo estimation scheme in 6 along with the properties in Proposition 10) hold for such a generalization. To see this, observe that given a subset E E of active edges, the deﬁnition of yu(E) remains the same as discussed around (3) in 4. The key deﬁnition that changes is that of y(E). Instead of deﬁning y(E) as Eu[yu(E)] (as we did below (3)), we now deﬁne it by using an arbitrary operator of interest, denoted by Fu[ ] (e.g., quantile or variance), i.e.,

y(E) := Fu[yu(E)].

Accordingly, given background and foreground source values x(1) N0 and x(2) N0 , the background

and foreground outcome equals y(1) := Fu[y(1) u ] and y(2) := Fu[y(2) u ], respectively, where the super-script captures the dependence on source values as before. Hence, the quantity of attribution equals the diﬀerence y(2) y(1) as before. For this modiﬁcation, the intuition provided in 4.1 holds as it is and the deﬁnition of RSV remains as in 4.2. Further, the aforementioned results (Theorem 4, Theorem 8, and Proposition 10) go through without any additional modiﬁcation (proofs are identical). This highlights the ﬂexibility of the proposed framework to accommodate various metrics of interest for attribution purposes. For example, given the widespread usage of quantile regression (Koenker and Bassett Jr, 1978; Koenker and Hallock, 2001; Meinshausen, 2006), one might be interested in understand-

Singal and Michailidis

ing the change in a certain p-quantile of the outcome, for some p (0, 1). In fact, if the underlying SEM is linear (as in (6)), then our closed-form characterization in Theorem 6 holds for quantiles as well. The reason for this is that due to the additive nature of the noise in a linear SEM, the p-quantile of the outcome can be represented as the outcome s expected value plus some constant (with respect to which edges are active). This constant gets cancelled out under the RSV computations and hence, the solution ends up being the same as the one for the expectation operator. In summary, we formulate a generic problem to study eﬀect propagation in causal DAGs, and use it to provide a comprehensive view on existing approaches such as direct and indirect eﬀects, path-speciﬁc eﬀects, and degree of responsibility. In addition to highlighting limitations of such techniques, we propose an axiomatic ﬂow-based methodology (RSV) to quantify attribution. RSV operates on a top-down principle and ﬂows down the eﬀect from the source nodes via a sequence of recursive games, which allow RSV to consider a broader spectrum of counterfactuals than existing approaches. For linear SEMs, RSV admits a tractable closed-form characterization, which recovers the classical method of path coeﬃcients and is equivalent to ideas such as path-speciﬁc eﬀects. For non-parametric SEMs, we provide a path-based characterization of RSV, which leads to an unbiased Monte-Carlo estimation scheme with an exponentially decaying sample complexity. Our principled approach to eﬀect propagation provides a new perspective on challenging problems on causality, such as causal overdetermination and causal unfairness. Next, we outline several directions worth further exploring. First, although our pathbased characterization leads to an estimation scheme for non-parametric SEMs, one can potentially develop tractable exact characterizations of RSV, especially beyond the class of linear SEMs. We found it challenging to do so even for the case of a linear model with two-way interactions, as mentioned at the end of 5 (recall Footnote 7). Even though if such characterizations are intractable, they might lead to eﬃcient estimation schemes. Second, our framework assumes knowledge of the underlying structural causal model and it would be useful to understand RSV s robustness to model estimation and presence of hidden confounders. For example, in the case of a linear SEM, one can possibly use the characterization in Theorem 6 to understand the sensitivity of RSV to the model parameters [aij](i,j) E (i.e., the edge weights). Understanding such robustness is critical before employing RSV in real-world applications, especially for sensitive applications such as causal unfairness. Third, given the applicability of causal attribution in practical domains such as advertising (Dalessandro et al., 2012; Singal et al., 2022) and legal studies (Ferey and Dehez, 2016), applying RSV to such real-world applications is of interest. Finally, this work proposed one set of desirable axioms for eﬀect propagation and it is of interest to explore alternative axioms.

Acknowledgments

We thank the Action Editor and two anonymous referees for the careful reading of the paper and many constructive comments and suggestions. The work of GM was supported in part by NSF grant DMS 2348640.

Axiomatic effect propagation in structural causal models

Appendix A. Primer on Shapley value (SV)

In this appendix, we provide a brief primer on Shapley value (SV) and refer the reader to Chapter 8 of Peleg and Sudh olter (2007) for details. In cooperative game theory, SV (Shapley, 1953) is a well-accepted solution concept to fairly attribute the total value in a game. In particular, given a ﬁnite set P of players, let the value generated by any subset of players (coalition) P P be denoted by the characteristic function v(P). Accordingly, the total value in the game equals v(P) and the goal is to distribute this value back to the individual players in P. To do so, SV attributes the following to player r P:

P P\{r} w P(P) {v(P {r}) v(P)} ,

where the weight function is deﬁned as

w P(P) := |P|! (|P| |P| 1)!

Intuitively speaking, SV of player r computes its value-add to each coalition P that does not contain player r and weights it by a corresponding probability 14 w P(P). Hence, it can be seen as the expected value-add of player r. SV is desirable since it is the unique solution to obey the following four axioms:

1. Eﬃciency: Total value is distributed, i.e.,

r P πr = v(P) v( ).

2. Symmetry: Equivalent players receive same attribution, i.e., πr = πr if players r, r P obey the following:

v(P {r}) = v(P {r }) P P \ {r, r }.

3. Linearity: Given two characteristic functions v1( ) and v2( ), linearity requires robustness to mixing and scaling, i.e., for all r P,

πr(v1 + v2) = πr(v1) + πr(v2)

πr(αv1) = απr(v1) α R.

4. Null player: Useless player gets zero attribution, i.e., πr = 0 if player r P obeys

v(P {r}) = v(P) P P \ {r}.

14. Note that P

P P\{r} w P(P) equals 1 and each of the weight term is non-negative.

Singal and Michailidis

Appendix B. Reverse eﬀects

As in our direct and indirect eﬀects discussion ( 3.1), consider the setup of Figure 2. To understand a limitation of reverse indirect eﬀect (Pearl, 2010), we ﬁrst visualize direct eﬀect (DE) and reverse indirect eﬀect ( IEr) in Figure 28. Simply put, DE starts oﬀwith the baseline of both direct and indirect channels being inactive (subplot 2) and computes the value-add of making the direct channel active (subplot 1). IEr starts oﬀwith the baseline of the direct channel being active (subplot 4) and computes the value-add of making the indirect channel active (subplot 3). When we add up DE and IEr, subplots 1 and 4 in Figure 28 cancel out and we are left with the total eﬀect TE (diﬀerence between subplots 3 and 2). Clearly, DE = 0 and IEr = 1, suggesting all of the eﬀect is mediated by the indirect channel. However, we can similarly deﬁne reverse direct eﬀect ( DEr) by ﬂipping the order in which we proceed. In particular, as we show in Figure 29, we can ﬁrst compute indirect eﬀect (IE) by starting oﬀwith the baseline of both direct and indirect channels being inactive (subplot 2) and computing the value-add of making the indirect channel active (subplot 1). Then, we can start oﬀwith the baseline of the indirect channel being active (subplot 4) and compute the value-add of making the direct channel active (subplot 3), which we call DEr (analogous to IEr). Similar to TE = DE IEr, we have TE = IE DEr. Doing so results in IE = 0 and DEr = 1, meaning all of the eﬀect is propagated by the direct channel, which is the opposite of what we concluded above. As such, the order in which we proceed leads to a diﬀerent conclusion. RSV avoids this issue by averaging over these two orderings:

TE = (DE DEr)/2 | {z } RSV of direct channel

+ (IE IEr)/2 | {z } RSV of indirect channel

Note that RSV simpliﬁes to (21) only for the very speciﬁc graph structure under consideration. To see this, observe that (21) is path-based in that it decouples the total eﬀect along two paths (direct and indirect), which is diﬀerent from the edge-based RSV. Futhermore, (21) provides a very coarse decomposition of the total eﬀect as it only decouples it along two paths, which is very diﬀerent from RSV s ability to provide a granular ﬂow-based decomposition along every edge. Given the clean decomposition in (21), it is of interest to understand if RSV can be expressed in a similar manner for more general graphs. In fact, this was partly our motivation behind the path-based characterization of RSV in Theorem 8, but RSV s intricate nature leads to a much more involved expression in general. Irrespective, RSV can be seen as generalizing (21) to more involved settings as it recovers (21) in simple settings but has the ability to provide a more granular decomposition of the total eﬀect.

This issue is exacerbated in bigger graphs as there are more possible choices of which order to proceed in. For example, in the graph shown in Figure 30, there are 6 possible orderings. Depending on the order, we end up attributing diﬀerent values, and there is no unique right order. As such, though the notion of reverse eﬀects can restore eﬃciency (decomposition adding up to the total eﬀect), it feels rather ad hoc and lacks uniqueness. This is in sharp contrast to RSV, which comes out uniquely from four desirable axioms.

Axiomatic effect propagation in structural causal models

(a) Direct eﬀect (DE)

(b) Reverse indirect eﬀect ( IEr)

Figure 28: Visualizing DE and IEr. DE corresponds to the diﬀerence between the two graphs in subplot (a). IEr corresponds to the diﬀerence between the two graphs in subplot (b).

(a) Indirect eﬀect (IE)

(b) Reverse direct eﬀect ( DEr)

Figure 29: Visualizing IE and DEr. IE corresponds to the diﬀerence between the two graphs in subplot (a). DEr corresponds to the diﬀerence between the two graphs in subplot (b).

Appendix C. Proof of Theorem 4

Theorem 4 Given structural causal model M = (G, F, D), the RSV [πRSV jk ](j,k) E deﬁned via Algorithm 1 is the unique solution to the ﬂow-based axioms.

Proof First, consider the node 0 game. E0 is the set of players with the characteristic function v0(E0) = y(E0, E1, . . . , En) for coalition E0 E0. Under RSV, ﬂow received by each edge in E0 equals the corresponding SV of this game:

πRSV 0k = X

E0 E0\{(0,k)} w E0(E0) {v0(E0 {(0, k)}) v0(E0)} (0, k) E0.

The uniqueness result of SV w.r.t. the original SV axioms (Shapley, 1953) implies that [πRSV 0k ](0,k) E0 uniquely satisﬁes the four ﬂow-based axioms (at node 0). Observe that the symmetry, nullity, and linearity axioms of the classical SV are equivalent to the ﬂow symmetry, ﬂow nullity, and ﬂow linearity axioms. In addition, ﬂow conservation at node 0 is implied by the eﬃciency axiom of the classical SV. To see this, observe that the classical eﬃciency axiom states P k C0 πRSV 0k = v0(E0) v0( ). This is equivalent to the conservation of ﬂow at node 0:

v0(E0) v0( ) = y(E0, E1, . . . , En) y( , E1, . . . , En)

Singal and Michailidis

Figure 30: A graph with one direct and two indirect channels. There are 6 possible orderings, which can be understood as follows. We start with the baseline of all 3 channels being inactive and hence, we have the option of adding any of the 3 channels ﬁrst. Then, we activate any of the remaining 2 channels. Finally, we activate the remaining 1 channel. Note that 3 2 1 = 6.

= y(2) y(1).

The ﬁnal equality is due to the deﬁnition of y( ) (recall (3)):

y(E0, E1, . . . , En) = y(2)

y( , E1, . . . , En) = y(1).

Next, consider the node j N \ {0} game. The players are Ej with the characteristic function vj(Ej) = P i Pj πij(E0, . . . , Ej, . . . , En) for coalition Ej Ej. Under RSV, ﬂow received by each edge in Ej equals the corresponding SV of this game:

πRSV jk = X

Ej Ej\{(j,k)} w Ej(Ej) {vj(Ej {(j, k)}) vj(Ej)} (j, k) Ej.

The uniqueness result of SV w.r.t. the original SV axioms (Shapley, 1953) implies that [πRSV jk ](j,k) Ej uniquely satisﬁes the four ﬂow-based axioms (at node j). Observe that the symmetry, nullity, and linearity axioms of the classical SV are equivalent to the ﬂow symmetry, ﬂow nullity, and ﬂow linearity axioms. In addition, ﬂow conservation at node j is implied by the eﬃciency axiom of the classical SV. To see this, observe that the classical eﬃciency axiom states P k Cj πRSV jk = vj(Ej) vj( ). This is equivalent to the conservation of ﬂow at node j:

vj(Ej) vj( ) = X

i Pj πij(E0, . . . , Ej, . . . , En) X

i Pj πij(E0, . . . , , . . . , En)

i Pj πRSV ij .

The last equality holds since

πij(E0, . . . , Ej, . . . , En) = πRSV ij i Pj πij(E0, . . . , , . . . , En) = 0 i Pj.

The ﬁrst statement is by deﬁnition and the second statement is true because for Ej = , no new information is passed via node j to its children (see the deﬁnition of y( ) as in (3)) and therefore, edge (i, j) becomes a null player in the node i Pj upstream game. This completes the proof.

Axiomatic effect propagation in structural causal models

Appendix D. Proof of Theorem 6

Theorem 6 Consider structural causal model M = (G, F, D) , wherein the structural equations F are linear and the noise distribution D has zero mean (see (6)), with c and b as in (7) and (8). Then,

πRSV 0j = c0j (0, j) E0

πRSV jk = X

i Pj bijcjk (j, k) E \ E0.

Proof Seeting E = E in Lemma 12 (stated below) gives the following ﬂow on each edge (0, j) E0:

k:(j,k) Ej b0j(E0)cjk(Ej, . . . , En)

k:(j,k) Ej b0jcjk

k:(j,k) Ej a0jcjk

The second equality holds due to b = b(E) (see (8) and (23)) and c = c(E) (see (7) and (22)), the third equality holds as b0j = x(2) j x(1) j = a0j (0, j) E0, and the fourth equality directly follows the deﬁnition of c0j (0, j) E0 (see (7)). Similarly, setting E = E in Lemma 12 gives the following attributions to each edge (j, k) E \ E0:

ℓ:(k,ℓ) Ek bjk(E0, . . . , Ej)ckℓ(Ek, . . . , En)

ℓ Ck bjkckℓ

i Pj bijajkckℓ

i Pj bijajk X

i Pj bijcjk.

The third and ﬁfth equalities directly follow the deﬁnitions of b and c (see (8) and (7)). Recalling that πRSV jk is deﬁned as πjk(E) for all (j, k) E, the proof is complete.

Details on Lemma 12

The proof of Theorem 6 uses a more general result (Lemma 12), which we develop next. It will help to generalize the deﬁnitions of both the forward-looking weights c and the

Singal and Michailidis

backward-looking weights b. In 5.1 (recall Equations (7) and (8)), when deﬁning these weights, we implicitly assumed all edges E as being active. We now let these weights vary as a function of an arbitrary subset E = (E0, . . . , En) E of edges. We assume topologically sorting wlog, i.e., there does not exist an edge (i, j) E such that i > j (Cormen et al., 2009). We deﬁne the generalized forward-looking weights c(E) as follows:

cj,n+1(Ej) := aj,n+1I{(j, n + 1) Ej} (j, n + 1) Ein n+1 (22a)

cij(Ei, . . . , En) := aij I{(i, j) Ei} X

k:(j,k) Ej cjk(Ej, . . . , En) (i, j) E \ Ein n+1, (22b)

where a0j := x(2) j x(1) j (0, j) E0 as before. Similarly, we deﬁne the generalized backwardlooking weights b(E) as follows:

b0j(E0) := (x(2) j x(1) j )I{(0, j) E0} (0, j) E0 (23a)

bjk(E0, . . . , Ej) := X

i:(i,j) Ei bij(E0, . . . , Ei)ajk I{(j, k) Ej} (j, k) E \ E0. (23b)

Observe that setting in E = E recovers the original weights as in (7) and (8), i.e., b = b(E) and c = c(E). We next present Lemma 12.

Lemma 12 Consider structural causal model M = (G, F, D) where the structural equations F are linear and the noise D is mean-zero (see (6)), with c( ) and b( ) as in (22) and (23). Then, given E E, for all i N,

k:(j,k) Ej bij(E0, . . . , Ei)cjk(Ej, . . . , En) j : (i, j) E,

where E = (E0, . . . , En) and the graph G is assumed to be topologically sorted (wlog).

Proof First, consider the node i = 0 (super-source). Recall that π0j(E) is the SV of the following game. The players are the elements of E0 and given an arbitrary coalition E0 E0, characteristic function is

v0(E0 | E 0) = y(E0, E1, . . . , En)

j:(0,j) E0 c0j(E0, E1, . . . , En)

j:(0,j) E0 a0j I{(0, j) E0} X

k:(j,k) Ej cjk(Ej, . . . , En)

j:(0,j) E0 b0j(E0) X

k:(j,k) Ej cjk(Ej, . . . , En).

The second equality directly follows the deﬁnition of c( ) (see (22)) and y( ) (see (3)). The third equality follows the deﬁnition of c( ) (see (22)). The last equality holds since a0j = x(2) j x(1) j and it also leverages the deﬁnition of b0j( ) (see (23)). Given v0(E0 | E 0)

Axiomatic effect propagation in structural causal models

is separabile over the players (0, j) in E0, we have that the SV of each player (0, j) E0 equals

π0j(E) = b0j(E0) X

k:(j,k) Ej cjk(Ej, . . . , En)

k:(j,k) Ej b0j(E0)cjk(Ej, . . . , En).

The base case is complete. Due to the DAG structure, it is suﬃcient to show the claim holds at node j N by assuming it to hold at each of its parent nodes i such that (i, j) Ei. In other words, it suﬃces to show that

k:(j,k) Ej bij(E0, . . . , Ei)cjk(Ej, . . . , En) j : (i, j) E (24)

ℓ:(k,ℓ) Ek bjk(E0, . . . , Ej)ckℓ(Ek, . . . , En) k : (j, k) E.

Recall that πjk(E) corresponds to the SV of the game with Ej as the set of players and the following characteristic function given coalition Ej Ej:

vj(Ej | E j) = X

i:(i,j) Ei πij(E0, . . . , Ej, . . . , En)

k:(j,k) Ej bij(E0, . . . , Ei)cjk(Ej, . . . , En)

k:(j,k) Ej bij(E0, . . . , Ei)cjk(Ej, . . . , En)

i:(i,j) Ei bij(E0, . . . , Ei)cjk(Ej, . . . , En).

The second equality follows (24), the third is true as cjk(Ej, . . . , En) = cjk(Ej, . . . , En) for (j, k) Ej (see (22)). Given the characteristic function is separabile over the underlying players, we have that the SV of player (j, k) Ej equals

i:(i,j) Ei bij(E0, . . . , Ei)cjk(Ej, . . . , En)

i:(i,j) Ei bij(E0, . . . , Ei)ajk I{(j, k) Ej} X

ℓ:(k,ℓ) Ek ckℓ(Ek, . . . , En)

= bjk(E0, . . . , Ej) X

ℓ:(k,ℓ) Ek ckℓ(Ek, . . . , En)

ℓ:(k,ℓ) Ek bjk(E0, . . . , Ej)ckℓ(Ek, . . . , En),

where the second and third equalities follow from the deﬁnitions of c( ) (see (22)) and b( ) (see (23)), respectively. This completes the proof.

Singal and Michailidis

Appendix E. Proof of Theorem 8

Theorem 8 Given structural causal model M = (G, F, D), the RSV of edge (i, j) E exhibits the following characterization:

πRSV ij (E) = X

EW EW {j}\{0} W

V Edge(W {j}) ( 1)|W| |V | y(EW V, E W ).

Proof Directly follows when we plug in E equal to E in Lemma 13 (see below).

Details on Lemma 13

The proof of Theorem 8 leverages a more general result (Lemma 13), which we present now. We generalize the deﬁnitions of various primitives as a function of a subset E = (E0, . . . , En) E. For instance, the set of parents of node i N+ is denoted as Pi(E), the set of all unique paths to node i N+ is denoted as Wi(E), ﬂow (RSV) on edge (i, j) E is denoted as πRSV ij (E) (or simply πij(E) for conciseness), the SV weight as w Ei( ) for the game at node i N, the kappa factor (given path W) as κE W ( ), etc. We assume wlog that the DAG is topologically sorted, i.e., there is no edge (i, j) E with i > j (Cormen et al., 2009). Note that plugging in E as E recovers the original primitives. We are now in a position to present Lemma 13.

Lemma 13 Given structural causal model M = (G, F, D) and a subset E = (E0, . . . , En) E of edges, the RSV (corresponding to input E) of edge (i, j) E exhibits the following characterization:

πRSV ij (E) = X

EW EW {j}\{0} W

κE W (EW ) X

V Edge(W {j}) ( 1)|W| |V | y(EW V, E W ).

(Note that πRSV ij (E) corresponds to the following call of Algorithm 1: RSV(N, E). In particular, the second input is E instead of E.)

Proof First, consider the super-source node i = 0. By deﬁnition, π0j(E) corresponds to the Shapley value of the following game. The set of players is E0 and for a given coalition E0 E0, characteristic function equals

v0(E0 | E 0) = y(E0, E1, . . . , En)

= y(E0, E 0).

Accordingly, the Shapley value for player (0, j) E0 equals

E0 E0\{(0,j)} w E0(E0) {y(E0 {(0, j)}, E 0) y(E0, E 0)} . (25)

Observing that W0(E) = (0), the RHS of the claim for (i, j) = (0, j) equals:

EW EW {j}\{0} W

κE W (EW ) X

V Edge(W {j}) ( 1)|W| |V | y(EW V, E W )

Axiomatic effect propagation in structural causal models

V {(0,j)} ( 1)1 |V | y(E0 V, E 0)

E0 E0\{(0,j)} w E0(E0) {y(E0 {(0, j)}, E 0) y(E0, E 0)}

The ﬁrst equation is the RHS of the claim, the second is obtained by plugging in W0(E) = (0) and simplifying, the third follows the deﬁnition of the kappa factor and expanding the sum over V , and the ﬁnal equality follows (25). This completes the base case. Given the DAG structure, it suﬃces to show the claim holds at node j N by assuming it to hold at each of its parent nodes i s.t. (i, j) Ei. That is, it suﬃces to show that

EW EW {j}\{0} W

κE W (EW ) X

V Edge(W {j}) ( 1)|W| |V | y(EW V, E W )

for all j : (i, j) E implies

EW EW {k}\{0} W

κE W (EW ) X

V Edge(W {k}) ( 1)|W| |V | y(EW V, E W )

for all k : (j, k) E. Consider arbitrary edge (j, k) E. Observe that

Ej Ej\{(j,k)} w Ej(Ej) {vj(Ej {(j, k)} | E j) vj(Ej | E j)}

Ej Ej\{(j,k)} w Ej(Ej) X

i Pj(E) {πij(Ej {(j, k)}, E j) πij(Ej, E j)}

Ej Ej\{(j,k)} w Ej(Ej) X

EW EW {j}\{0} W

V Edge(W {j}) ( 1)|W| |V |

y(EW V, Ej {(j, k)}, E (W j)) y(EW V, Ej, E (W j))

Ej Ej\{(j,k)}

EW EW {j}\{0} W

w Ej(Ej)κE W (EW )

V Edge(W {j}) ( 1)|W| |V |

y(EW V, Ej {(j, k)}, E (W j)) y(EW V, Ej, E (W j))

Ej Ej\{(j,k)}

EW EW {j}\{0} W

w Ej(Ej)κE W (EW )

V Edge(W {j}) ( 1)|W| |V | X

V {(j,k)} ( 1)1 |V |y(EW V, Ej V , E (W j))

Singal and Michailidis

EW EW {k}\{0} W

κE W (EW ) X

V Edge(W {k}) ( 1)|W| |V | y(EW V, E W ).

The ﬁrst equality follows the deﬁnition of SV, second follows the recursive deﬁnition of the characteristic function vj( ), third invokes (26), and the last three equalities use basic algebra. This completes the proof.

Appendix F. RSV characterization for a linear model with interactions

Consider the DAG in Figure 31 for illustration.

Figure 31: DAG used to illustrate the RSV characterization for a linear model with two-way interactions. We have excluded node 0 for simplicity.

Suppose X1 is set exogenously and the true (deterministic) relations are linear with two-way interactions:

X2 = a2 1X1 X3 = a3 1X1 + a3 2X2 + a3 12X1X2 Y = a4 1X1 + a4 2X2 + a4 3X3 + a4 12X1X2 + a4 13X1X3 + a4 23X2X3.

The super-scripts in the coeﬃcients (e.g., 2 in a2 1) do not represent exponents but are used to indicate the child node of the corresponding edge (e.g., 2 in a2 1 denotes node 2). In F.1, we compute the RSV for this relatively simple example and observe a pattern, which we generalize in F.2. Finally, we discuss its computational implications in F.3. The key message of this section is that developing a computationally tractable RSV characterization even for a linear model (in the coeﬃcients) with two-way interactions in the variables is challenging.

F.1 RSV computation

For simplicity 15, assume background X(1) 1 = 0 and foreground X(2) 1 = 1. Then, Y (1) = 0 and

Y (2) = a4 1X(2) 1 + a4 2X(2) 2 + a4 3X(2) 3 + a4 12X(2) 1 X(2) 2 + a4 13X(2) 1 X(2) 3 + a4 23X(2) 2 X(2) 3

= a4 1 + a4 2X(2) 2 + a4 3X(2) 3 + a4 12X(2) 2 + a4 13X(2) 3 + a4 23X(2) 2 X(2) 3

= a4 1 + a4 2a2 1 + a4 3X(2) 3 + a4 12a2 1 + a4 13X(2) 3 + a4 23a2 1X(2) 3

15. As we show, even under this simpliﬁcation, a computationally tractable characterization of RSV is challenging.

Axiomatic effect propagation in structural causal models

= a4 1 + a4 2a2 1 + a4 3(a3 1 + a3 2a2 1 + a3 12a2 1) + a4 12a2 1 + a4 13(a3 1 + a3 2a2 1 + a3 12a2 1)

+ a4 23a2 1(a3 1 + a3 2a2 1 + a3 12a2 1).

First equality is by deﬁnition. Second plugs in X(2) 1 = 1. Third plugs in X(2) 2 = a2 1. Fourth plugs in X(2) 3 = a3 1 + a3 2a2 1 + a3 12a2 1. First, RSV considers a game at node 1, where the players are the outgoing edges E1 = {(1, 2), (1, 3), (1, 4)}. All downstream edges are assumed to be active. The RSV of edge (1, 2) equals:

πRSV 12 = 1

3 (v1({(1, 2)}) v1( )) + 1

6 (v1({(1, 2), (1, 3)}) v1({(1, 3)}))

6 (v1({(1, 2), (1, 4)}) v1({(1, 4)}))

3 (v1({(1, 2), (1, 3), (1, 4)}) v1({(1, 3), (1, 4)})) .

Observe that the coalition values are as follows (via bruteforce computation):

v1( ) = Y (1) = 0

v1({(1, 2)}) = a4 2a2 1 + a4 3a3 2a2 1 + a4 23a2 1a3 2a2 1 v1({(1, 3)}) = a4 3a3 1 v1({(1, 4)}) = a4 1 v1({(1, 2), (1, 3)}) = a4 2a2 1 + a4 3(a3 1 + a3 2a2 1 + a3 12a2 1) + a4 23a2 1(a3 1 + a3 2a2 1 + a3 12a2 1)

v1({(1, 2), (1, 4)}) = a4 1 + a4 2a2 1 + a4 3a3 2a2 1 + a4 12a2 1 + a4 13a3 2a2 1 + a4 23a2 1a3 2a2 1 v1({(1, 3), (1, 4)}) = a4 1 + a4 3a3 1 + a4 13a3 1 v1({(1, 2), (1, 3), (1, 4)}) = Y (2) = a4 1 + a4 2a2 1 + a4 3(a3 1 + a3 2a2 1 + a3 12a2 1) + a4 12a2 1 + a4 13(a3 1 + a3 2a2 1 + a3 12a2 1) + a4 23a2 1(a3 1 + a3 2a2 1 + a3 12a2 1).

Plugging these in πRSV 12 , we get πRSV 12 equals

3 a4 2a2 1 + a4 3a3 2a2 1 + a4 23a2 1a3 2a2 1

6 a4 2a2 1 + a4 3(a3 2a2 1 + a3 12a2 1) + a4 23a2 1(a3 1 + a3 2a2 1 + a3 12a2 1)

6 a4 2a2 1 + a4 3a3 2a2 1 + a4 12a2 1 + a4 13a3 2a2 1 + a4 23a2 1a3 2a2 1

3 a4 2a2 1 + a4 3(a3 2a2 1 + a3 12a2 1) + a4 12a2 1 + a4 13(a3 2a2 1 + a3 12a2 1) + a4 23a2 1(a3 1 + a3 2a2 1 + a3 12a2 1)

= a4 2a2 1 + a4 3a3 2a2 1 + a4 23a2 1a3 2a2 1 + a4 3a3 12a2 1 2 + a4 23a2 1a3 1 2 + a4 23a2 1a3 12a2 1 2 + a4 12a2 1 2 + a4 13a3 2a2 1 2

+ a4 13a3 12a2 1 3 .

We have arranged the terms using the denominators 1, 2, and 3. Note that each of the numerator term appears in Y (2) and hence, πRSV 12 splits Y (2) via these term-speciﬁc denominators. Interestingly, there is an underlying pattern behind these denominators. Observe

Singal and Michailidis

that the denominator corresponds to the number of edges (out of the 3 edges of interest in this game, i.e., E1 = {(1, 2), (1, 3), (1, 4)}) that play a role in the term. For example, in a4 13a3 12a2 1, all three edges of interest appear:

a4 13: (1, 4) and (3, 4)

a3 12: (1, 3) and (2, 3)

a2 1: (1, 2).

Hence, the corresponding denominator equals 3. Extrapolating this pattern, we can decompose Y (2) as follows:

πRSV 12 = a4 2a2 1 + a4 3a3 2a2 1 + a4 23a2 1a3 2a2 1 + a4 3a3 12a2 1 2 + a4 23a2 1a3 1 2 + a4 23a2 1a3 12a2 1 2 + a4 12a2 1 2 + a4 13a3 2a2 1 2

+ a4 13a3 12a2 1 3

πRSV 13 = a4 3a3 1 + a4 3a3 12a2 1 2 + a4 13a3 1 2 + a4 13a3 2a2 1 2 + a4 23a2 1a3 1 2 + a4 13a3 12a2 1 3

πRSV 14 = a4 1 + a4 12a2 1 2 + a4 13a3 1 2 + a4 13a3 2a2 1 2 + a4 13a3 12a2 1 3 .

By construction of this decomposition, the sum (πRSV 12 + πRSV 13 + πRSV 14 ) equals Y (2). As a sanity check, let s verify the value of πRSV 14 via ﬁrst principles:

πRSV 14 = 1

3 (v1({(1, 4)}) v1( )) + 1

6 (v1({(1, 4), (1, 3)}) v1({(1, 3)}))

6 (v1({(1, 4), (1, 2)}) v1({(1, 2)}))

3 (v1({(1, 4), (1, 2), (1, 3)}) v1({(1, 2), (1, 3)}))

6 a4 1 + a4 13a3 1 + 1

6 a4 1 + a4 12a2 1 + a4 13a3 2a2 1

3 a4 1 + a4 12a2 1 + a4 13(a3 1 + a3 2a2 1 + a3 12a2 1)

= a4 1 + a4 13a3 1 2 + a4 12a2 1 2 + a4 13a3 2a2 1 2 + a4 13a3 12a2 1 3 .

This matches the pattern above. We don t need to verify πRSV 13 due to eﬃciency (ﬂow conservation). Having computed RSV at the node 1 game, let s ﬂow it down. Consider node 2. It receives an inﬂow of πRSV 12 , which is split into the outﬂow πRSV 23 + πRSV 24 . Recall from above that

πRSV 12 = a4 2a2 1 + a4 3a3 2a2 1 + a4 23a2 1a3 2a2 1 + a4 3a3 12a2 1 2 + a4 23a2 1a3 1 2 + a4 23a2 1a3 12a2 1 2 + a4 12a2 1 2 + a4 13a3 2a2 1 2

+ a4 13a3 12a2 1 3 .

Axiomatic effect propagation in structural causal models

Extrapolating the pattern from above, we get the following ﬂow decomposition:

πRSV 23 = a4 3a3 2a2 1 + a4 23a2 1a3 2a2 1 2 + a4 3a3 12a2 1 2 + a4 23a2 1a3 12a2 1 4 + a4 13a3 2a2 1 2 + a4 13a3 12a2 1 3

πRSV 24 = a4 2a2 1 + a4 23a2 1a3 2a2 1 2 + a4 23a2 1a3 1 2 + a4 23a2 1a3 12a2 1 4 + a4 12a2 1 2 .

Only two inﬂow terms, a4 23a2 1a3 2a2 1 and a4 23a2 1a3 12a2 1 2 , involve both the edges of interest ((2, 3) and (2, 4)) and hence, are split between the two. All other terms involve just one of the two edges and hence, are allocated accordingly. As a sanity check, let s verify the value of πRSV 24 via ﬁrst principles:

πRSV 24 = 1

2 (v2({(2, 4)}) v2( )) + 1

2 (v2({(2, 3), (2, 4)}) v2({(2, 3)})) .

Observe that the coalition values are as follows:

v2({(2, 3)}) = a4 3a3 2a2 1 + a4 3a3 12a2 1 2 + a4 13a3 2a2 1 2 + a4 13a3 12a2 1 3

v2({(2, 4)}) = a4 2a2 1 + a4 23a2 1a3 1 2 + a4 12a2 1 2

v2({(2, 3), (2, 4)}) = πRSV 12 = a4 2a2 1 + a4 3a3 2a2 1 + a4 23a2 1a3 2a2 1 + a4 3a3 12a2 1 2 + a4 23a2 1a3 1 2 + a4 23a2 1a3 12a2 1 2

+ a4 12a2 1 2 + a4 13a3 2a2 1 2 + a4 13a3 12a2 1 3 .

Plugging these in πRSV 24 , we get πRSV 24 equals

a4 2a2 1 + a4 23a2 1a3 1 2 + a4 12a2 1 2

a4 2a2 1 + a4 23a2 1a3 2a2 1 + a4 23a2 1a3 1 2 + a4 23a2 1a3 12a2 1 2 + a4 12a2 1 2

= a4 2a2 1 + a4 23a2 1a3 1 2 + a4 12a2 1 2 + a4 23a2 1a3 2a2 1 2 + a4 23a2 1a3 12a2 1 4 .

It matches the decomposition above. We do not need to verify πRSV 23 due to ﬂow conservation. Hence, the pattern seems to hold when we ﬂow down the attribution. Note that the due to ﬂow conservation, attribution through node 3 is trivial (πRSV 13 = πRSV 34 ) and hence,

πRSV 34 = a4 3a3 1 + a4 3a3 12a2 1 2 + a4 13a3 1 2 + a4 13a3 2a2 1 2 + a4 23a2 1a3 1 2 + a4 13a3 12a2 1 3 .

This completes the computation of RSV for the example at hand.

F.2 General expression

Continuing with the example from above, consider the decomposition of Y that one obtains by substituting all of the foreground inputs:

Y = a4 1 + a4 2a2 1 + a4 3a3 1 + a4 3a3 2a2 1 + a4 3a3 12a2 1 + a4 12a2 1 + a4 13a3 1 + a4 13a3 2a2 1 + a4 13a3 12a2 1

Singal and Michailidis

+ a4 23a2 1a3 1 + a4 23a2 1a3 2a2 1 + a4 23a2 1a3 12a2 1. (27)

Since the background equals 0, (27) denotes the total eﬀect. The decomposition has 12 terms and we will use the notation a to denote an individual term and the notation P a Y to sum over all such terms in Y (notation abuse since Y is not a set). Given this decomposition, RSV for the topmost game (edges (1, 2), (1, 3), and (1, 4)) is as follows:

πRSV 1j = X

a I{(1, j) a}

unique1(a) j {2, 3, 4},

where I{(1, j) a} denotes whether edge (1, j) appears in term a or not (another notation abuse since a is not a set) and unique1(a) denotes the number of unique edges in a out of the outgoing edges of node 1 (sub-script 1 in unique1( ) captures this dependence on node 1). For example, a4 1 has only edge (1, 4) and unique1(a4 1) = 1 whereas a4 3a3 12a2 1 has four edges ((3, 4), (1, 3), (2, 3), and (1, 2)) but only two of these four emit from node 1 and hence, unique1(a4 3a3 12a2 1) = 2. Flowing down this attribution can be done using the same formula. In particular, everything remains the same except the value that is split. For example, at node 2, instead of splitting Y , we split the value it receives from the top, i.e., πRSV 2 , which equals πRSV 12 in this example. The formula above gives us a decomposition of πRSV 12 (and hence, πRSV 2 ):

πRSV 2 = a4 2a2 1 + a4 3a3 2a2 1 + a4 23a2 1a3 2a2 1 + a4 3a3 12a2 1 2 + a4 23a2 1a3 1 2 + a4 23a2 1a3 12a2 1 2 + a4 12a2 1 2 + a4 13a3 2a2 1 2

+ a4 13a3 12a2 1 3 .

The decomposition has 9 terms and as before, we will use the notation a to denote an individual term (including the denominator that comes with it). Given this decomposition, RSV for the game at node 2 (edges (2, 3) and (2, 4)) is as follows:

πRSV 2j = X

a I{(2, j) a}

unique2(a) j {3, 4}.

Note that we use unique2( ) (sub-script changed from 1 to 2), which captures the number of unique edges out of the outgoing edges at node 2. The general pattern follows (formal proof omitted for brevity). Given this characterization, we have a handle on computing RSV without using the bruteforce recursions in Algorithms 1 and 2. However, as we discuss next, the such a procedure runs into computational tractability issues (primarily due to the interaction terms).

F.3 Runtime

It should be clear that the runtime of this technique is primarily driven by the number of terms in the decomposition of Y as in (27), which we denote by |Y |. In particular, the runtime is upper bounded by O(|E| |Y |), where |E| denotes the number of edges in the graph. This is because computing πRSV ij for an arbitrary (i, j) E involves a sum over at most |Y | terms. Accordingly, the question of interest is as follows: how many terms does Y have?

Axiomatic effect propagation in structural causal models

In this direction, consider a fully dense DAG with n input nodes (X1, . . . , Xn) and outcome node Y . Then, the structural equations for a linear model with two-way interactions are:

X1 exogenous

X2 = a2 1X1 X3 = a3 1X1 + a3 2X2 + a3 12X1X2 ...

i=1 an i Xi +

j=i+1 an ij Xi Xj

i=1 an+1 i Xi +

j=i+1 an+1 ij Xi Xj.

Obtaining the decomposition of Y (especially when X1 equals 1) is conceptually straightforward (though quite mechanical): starting from X1, keep performing forward-substituition. Denote by κj the number of terms that appear in Xj for j = 1, . . . , n + 1. That is, |Y | equals κn+1. Given the structural equations as above, the sequence {κ1, . . . , κn+1} obeys the following recursion:

i=h+1 κhκi j = 2, . . . , n + 1.

κ1 = 1 initializes the recursion. Unfortunately, this sequence explodes quite fast, as shown in Figure 32. Hence, the technique of decomposing Y and using it to attribute via the unique( ) characterization seems computationally intractable. Of course, this is not an impossibility result in general for SEMs with linear main eﬀects and 2-way interaction terms, as there might some other characterization that might be computationally tractable. However, it seems rather challenging to come up with one, primarily due to the presence of the interaction terms.

Appendix G. Proof of Proposition 10

Proposition 14 For edge (i, j) E, ˆπRSV ij is an unbiased estimator of πRSV ij with variance decaying at a rate of 1/S. Furthermore, for all t 0,

P πRSV ij ˆπRSV ij t 2 exp βij St2

for some βij > 0 ﬁnite. (The probability measure P{ } here denotes the Monte-Carlo sampling distribution of the estimator ˆπRSV ij .)

Proof The unbiasedness and 1/S rate of variance decay directly follow our construction of the Monte-Carlo scheme. For the sample complexity, ﬁrst, observe that given S samples, Hoeﬀding s inequality (Hoeﬀding, 1994) implies the following for W Wi: for all t 0,

P n µj W ˆµj W t o 2 exp

2S2t2 PS s=1(max(ˆµj W (s)) min(ˆµj W (s)))2

Singal and Michailidis

2 4 6 8 10 12 14

0 100 200 300 400 500

Figure 32: Growth of κn (on the log scale) as a function of n.

where min(ˆµj W (s)) and max(ˆµj W (s)) capture the range of ˆµj W (s) over s, i.e., ˆµj W (s) [min(ˆµj W (s)), max(ˆµj W (s))] for all s. Since the range is independent of s, we get

P n µj W ˆµj W t o 2 exp

(max(ˆµj W ) min(ˆµj W ))2

Note that (28) bounds the gap between µj W and ˆµj W (i.e., for a given path W Wi) whereas we are ultimately interested in estimating πRSV ij , which equals P W Wi µj W . It is possible to leverage (28) along with a corresponding concentration inequality for sub-Gaussian random variables to obtain a similar concentration bound for the diﬀerence between πRSV ij and ˆπRSV ij . With the error equal to

πRSV ij ˆπRSV ij = X

Theorem 2.6.2 of Vershynin (2018) implies the following sample complexity: for all t 0,

P πRSV ij ˆπRSV ij t = P

2cij St2 P W Wi||Zj W ||2 ψ2

cij := min W Wi

(max(ˆµj W ) min(ˆµj W ))2

Axiomatic effect propagation in structural causal models

and for a random variable Z, the norm

||Z||ψ2:= inf a 0 : E exp(Z2/a2) 2 ,

which is ﬁnite if and only if Z is sub-Gaussian. For us, note that Zj W is sub-Gaussian due to (28) for all W Wi. Setting

βij = 2cij P W Wi||Zj W ||2 ψ2

in (29) completes the proof.

Appendix H. Sampling paths in linear time and space

Here, we supplement our discussion at the end of 6.1 by illustrating how we can sample paths uniformly at random from Wj (set of all paths from node 0 to j) in linear time and space for node j N+ \ {0}. For simplicity, we showcase the procedure on the topologically sorted DAG in Figure 33a. (Note that topological sorting a DAG takes linear time.)

(b) Cumulative values [dj]j N+

Figure 33: The DAG and the corresponding cumulative values used to illustrate our sampling procedure.

We focus on node 6 (Y ) and show how we can generate uniform samples from W6. To do so, we compute the cumulative values (Figure 33b):

i Pj di j N+ \ {0}.

Observe that, by construction, dj equals |Wj| (number of paths from node 0 to node j) for all j N+ \ {0}. Hence, to generate a uniform sample from W6, we can employ the following procedure:

Singal and Michailidis

(a) Estimator distribution

0 200 400 600 800 1000 Number of Monte-Carlo samples S

Variance in the estimate

0.86 0.67 0.52 0.42 0.31 0.3

Data 1/S fit

(b) Variance as a function of S

Figure 34: Numerical illustration of our estimation scheme for (L, M) = (4, 3). Note that the exact RSV (of the top-right edge) equals 1/M, which equals 1/3 in this case (blue vertical line in plot (a)). The estimator s mean is close to 0.33 for all values of S and the variance decays at a rate of 1/S.

1. Start at node 6 and sample a node from its set of parents P6 = {1, 4, 5} with sampling weights (1, 3, 6), i.e., the cumulative values (d1, d4, d5) of the parents.

2. Suppose the sampled node is j P6. Repeat the procedure at node j by sampling from Pj with weights (di)i Pj.

3. Keep doing so until node 0 is reached and stitch together the sampled edges as a sample path from W6.

Since the DAG is topologically sorted, we will reach node 0 in at most n steps and the entire procedure runs in linear time and space, without enumerating W6. It should be clear that the procedure samples paths uniformly at random from W6 and the same idea can be used to sample from Wj for an arbitrary node j N+ \ {0}. Furthermore, generalizing this procedure to an arbitrary DAG is straightforward.

Appendix I. Additional plots for the numerical illustration of our estimation scheme

See Figures 34 to 39.

Axiomatic effect propagation in structural causal models

(a) Estimator distribution

0 200 400 600 800 1000 Number of Monte-Carlo samples S

Variance in the estimate

0.47 0.34 0.3 0.26 0.22 0.2

Data 1/S fit

(b) Variance as a function of S

Figure 35: Numerical illustration of our estimation scheme for (L, M) = (4, 5). Note that the exact RSV (of the top-right edge) equals 1/M, which equals 1/5 in this case (blue vertical line in plot (a)). The estimator s mean is close to 0.20 for all values of S and the variance decays at a rate of 1/S.

(a) Estimator distribution

0 200 400 600 800 1000 Number of Monte-Carlo samples S

Variance in the estimate

0.42 0.34 0.29 0.23 0.21 0.18 0.17

Data 1/S fit

(b) Variance as a function of S

Figure 36: Numerical illustration of our estimation scheme for (L, M) = (4, 7). Note that the exact RSV (of the top-right edge) equals 1/M, which equals 1/7 in this case (blue vertical line in plot (a)). The estimator s mean is close to 0.14 for all values of S and the variance decays at a rate of 1/S.

Singal and Michailidis

(a) Estimator distribution

0 200 400 600 800 1000 Number of Monte-Carlo samples S

Variance in the estimate

0.86 0.69 0.68 0.6 0.54 0.41 0.36

Data 1/S fit

(b) Variance as a function of S

Figure 37: Numerical illustration of our estimation scheme for (L, M) = (5, 3). Note that the exact RSV (of the top-right edge) equals 1/M, which equals 1/3 in this case (blue vertical line in plot (a)). The estimator s mean is close to 0.33 for all values of S and the variance decays at a rate of 1/S.

(a) Estimator distribution

0 200 400 600 800 1000 Number of Monte-Carlo samples S

Variance in the estimate

0.54 0.42 0.35 0.27 0.25 0.24

Data 1/S fit

(b) Variance as a function of S

Figure 38: Numerical illustration of our estimation scheme for (L, M) = (5, 5). Note that the exact RSV (of the top-right edge) equals 1/M, which equals 1/5 in this case (blue vertical line in plot (a)). The estimator s mean is close to 0.20 for all values of S and the variance decays at a rate of 1/S.

Axiomatic effect propagation in structural causal models

(a) Estimator distribution

0 200 400 600 800 1000 Number of Monte-Carlo samples S

Variance in the estimate

0.57 0.46 0.37 0.29 0.27 0.2 0.17

Data 1/S fit

(b) Variance as a function of S

Figure 39: Numerical illustration of our estimation scheme for (L, M) = (5, 7). Note that the exact RSV (of the top-right edge) equals 1/M, which equals 1/7 in this case (blue vertical line in plot (a)). The estimator s mean is close to 0.14 for all values of S and the variance decays at a rate of 1/S.

Duane F Alwin and Robert M Hauser. The decomposition of eﬀects in path analysis. American Sociological Review, pages 37 47, 1975.

Nihat Ay and Daniel Polani. Information ﬂows in causal networks. Advances in Complex Systems, 11(01):17 41, 2008.

Reuben M Baron and David A Kenny. The moderator mediator variable distinction in social psychological research: Conceptual, strategic, and statistical considerations. Journal of Personality and Social Psychology, 51(6):1173, 1986.

Peter J Bickel, Eugene A Hammel, and J William O Connell. Sex bias in graduate admissions: Data from Berkeley. Science, 187(4175):398 404, 1975.

Alan S Blinder. Wage Discrimination: Reduced Form and Structural Estimates. Journal of Human Resources, pages 436 455, 1973.

Javier Castro, Daniel G omez, and Juan Tejada. Polynomial calculation of the Shapley value based on sampling. Computers & Operations Research, 36(5):1726 1730, 2009.

Javier Castro, Daniel G omez, Elisenda Molina, and Juan Tejada. Improving polynomial estimation of the Shapley value by stratiﬁed random sampling with optimum allocation. Computers & Operations Research, 82:180 188, 2017.

Victor Chernozhukov, Iv an Fern andez-Val, and Blaise Melly. Inference on counterfactual distributions. Econometrica, 81(6):2205 2268, 2013.

Singal and Michailidis

Silvia Chiappa. Path-speciﬁc counterfactual fairness. In Proceedings of the 33rd AAAI Conference on Artiﬁcial Intelligence, volume 33, pages 7801 7808, 2019.

Hana Chockler and Joseph Y Halpern. Responsibility and Blame: A Structural-Model Approach. Journal of Artiﬁcial Intelligence Research, 22:93 115, 2004.

Thomas H Cormen, Charles E Leiserson, Ronald L Rivest, and Cliﬀord Stein. Introduction to Algorithms. MIT press, 2009.

Brian Dalessandro, Claudia Perlich, Ori Stitelman, and Foster Provost. Causally motivated attribution for online advertising. In Proceedings of the 6th International Workshop on Data Mining for Online Advertising and Internet Economy, pages 1 9, 2012.

Dheeru Dua and Casey Graﬀ. UCI machine learning repository, 2017. URL http:// archive.ics.uci.edu/ml.

Otis Dudley Duncan. Path analysis: Sociological examples. American Journal of Sociology, 72(1):1 16, 1966.

Samuel Ferey and Pierre Dehez. Multiple causation, apportionment, and the Shapley value. The Journal of Legal Studies, 45(1):143 171, 2016.

Sergio Firpo and Cristine Pinto. Identiﬁcation and estimation of distributional impacts of interventions using changes in inequality measures. Journal of Applied Econometrics, 31 (3):457 486, 2016.

Nicole Fortin, Thomas Lemieux, and Sergio Firpo. Decomposition methods in economics. In Handbook of Labor Economics, volume 4, pages 1 102. Elsevier, 2011.

John Fox. Eﬀect analysis in structural equation models: Extensions and simpliﬁed methods of computation. Sociological Methods & Research, 9(1):3 28, 1980.

Meir Friedenberg and Joseph Y Halpern. Blameworthiness in multi-agent settings. In Proceedings of the 33rd AAAI Conference on Artiﬁcial Intelligence, volume 33, pages 525 532, 2019.

Clark Glymour, Kun Zhang, and Peter Spirtes. Review of Causal Discovery Methods Based on Graphical Models. Frontiers in Genetics, 10:524, 2019.

Arthur S Goldberger. Structural equation methods in the social sciences. Econometrica, pages 979 1001, 1972.

Ned Hall. Two concepts of causation. Causation and Counterfactuals, pages 225 276, 2004.

Joseph Halpern and Max Kleiman-Weiner. Towards formal deﬁnitions of blameworthiness, intention, and moral responsibility. In Proceedings of the 32nd AAAI Conference on Artiﬁcial Intelligence, volume 32, 2018.

Joseph Y Halpern and Judea Pearl. Causes and Explanations: A Structural-Model Approach. Part II: Explanations. The British Journal for the Philosophy of Science, 56(4): 889 911, 2005.

Axiomatic effect propagation in structural causal models

Jonathan M Henshaw, Michael B Morrissey, and Adam G Jones. Quantifying the causal pathways contributing to natural selection. Evolution, 74(12):2560 2574, 2020.

Wassily Hoeﬀding. Probability inequalities for sums of bounded random variables. In The Collected Works of Wassily Hoeﬀding, pages 409 426. Springer, 1994.

Kosuke Imai, Luke Keele, and Dustin Tingley. A general approach to causal mediation analysis. Psychological Methods, 15(4):309, 2010.

Kosuke Imai, Luke Keele, Dustin Tingley, and Teppei Yamamoto. Unpacking the black box of causality: Learning about causal mechanisms from experimental and observational studies. American Political Science Review, 105(4):765 789, 2011.

Dominik Janzing, David Balduzzi, Moritz Grosse-Wentrup, Bernhard Sch olkopf, et al. Quantifying causal inﬂuences. The Annals of Statistics, 41(5):2324 2358, 2013.

Niki Kilbertus, Mateo Rojas Carulla, Giambattista Parascandolo, Moritz Hardt, Dominik Janzing, and Bernhard Sch olkopf. Avoiding Discrimination through Causal Reasoning. In Advances in Neural Information Processing Systems, pages 656 666, 2017.

Evelyn M Kitagawa. Components of a Diﬀerence Between Two Rates. Journal of the American Statistical Association, 50(272):1168 1194, 1955.

Roger Koenker and Gilbert Bassett Jr. Regression quantiles. Econometrica, pages 33 50, 1978.

Roger Koenker and Kevin F Hallock. Quantile regression. Journal of Economic Perspectives, 15(4):143 156, 2001.

Matt J Kusner, Joshua Loftus, Chris Russell, and Ricardo Silva. Counterfactual fairness. Advances in Neural Information Processing Systems, 30, 2017.

David P Mac Kinnon, Amanda J Fairchild, and Matthew S Fritz. Mediation Analysis. Annual Review of Psychology, 58:593 614, 2007.

Sasan Maleki, Long Tran-Thanh, Greg Hines, Talal Rahwan, and Alex Rogers. Bounding the estimation error of sampling-based Shapley value approximation. ar Xiv preprint ar Xiv:1306.4265, 2013.

Nicolai Meinshausen. Quantile regression forests. Journal of Machine Learning Research, 7(6), 2006.

Rory Mitchell, Joshua Cooper, Eibe Frank, and Geoﬀrey Holmes. Sampling Permutations for Shapley Value Estimation. Journal of Machine Learning Research, 23(43):1 46, 2022.

Ronald Oaxaca. Male-Female Wage Diﬀerentials in Urban Labor Markets. International Economic Review, pages 693 709, 1973.

Judea Pearl. Direct and Indirect Eﬀects. Proceedings of the 17th Conference on Uncertainty in Artiﬁcial Intelligence, 32:411 420, 2001.

Singal and Michailidis

Judea Pearl. Causality. Cambridge University Press, 2009.

Judea Pearl. An introduction to causal inference. The International Journal of Biostatistics, 6(2):1 62, 2010.

Judea Pearl, Madelyn Glymour, and Nicholas P Jewell. Causal inference in statistics: A primer. John Wiley & Sons, 2016.

Bezalel Peleg and Peter Sudh olter. Introduction to the Theory of Cooperative Games, volume 34. Springer Science & Business Media, 2007.

Jonas Peters, Dominik Janzing, and Bernhard Sch olkopf. Elements of Causal Inference. The MIT Press, 2017.

Drago Plecko and Elias Bareinboim. Causal Fairness Analysis. Foundations and Trends in Machine Learning, 17(3):1 238, 2024.

Lloyd S Shapley. A value for n-person games. Contributions to the Theory of Games, 2(28): 307 317, 1953.

Raghav Singal, George Michailidis, and Hoiyi Ng. Flow-based attribution in graphical models: a recursive Shapley approach. In Proceedings of the 38th International Conference on Machine Learning. PMLR, 2021.

Raghav Singal, Omar Besbes, Antoine Desir, Vineet Goyal, and Garud Iyengar. Shapley meets uniform:: An axiomatic framework for attribution in online advertising. Management Science, 68(10):7457 7479, 2022.

Peter Spirtes, Clark N Glymour, Richard Scheines, and David Heckerman. Causation, prediction, and search. MIT press, 2000.

Mukund Sundararajan and Amir Najmi. The Many Shapley Values for Model Explanation. In Proceedings of the 37th International Conference on Machine Learning, volume 119, pages 9269 9278. PMLR, 2020.

Mukund Sundararajan, Ankur Taly, and Qiqi Yan. Axiomatic Attribution for Deep Networks. In Proceedings of the 34th International Conference on Machine Learning, volume 70, pages 3319 3328. PMLR, 2017.

Tyler J Vander Weele, Sunni L Mumford, and Enrique F Schisterman. Conditioning on intermediates in perinatal epidemiology. Epidemiology, 23(1):1, 2012.

Roman Vershynin. High-dimensional probability: An introduction with applications in data science, volume 47. Cambridge University Press, 2018.

Christopher Weible, Paul A Sabatier, and Mark Lubell. A comparison of a collaborative and top-down approach to the use of science in policy: Establishing marine protected areas in California. Policy Studies Journal, 32(2):187 207, 2004.

Sewall Wright. On the nature of size factors. Genetics, 3(4):367, 1918.

Axiomatic effect propagation in structural causal models

Sewall Wright. The method of path coeﬃcients. The Annals of Mathematical Statistics, 5 (3):161 215, 1934.

Yongkai Wu, Lu Zhang, and Xintao Wu. Counterfactual fairness: Unidentiﬁcation, bound and algorithm. In Proceedings of the 28th International Joint Conference on Artiﬁcial Intelligence, 2019.

Junzhe Zhang and Elias Bareinboim. Fairness in decision-making the causal explanation formula. In Proceedings of the 32nd AAAI Conference on Artiﬁcial Intelligence, 2018a.

Junzhe Zhang and Elias Bareinboim. Non-parametric path analysis in structural causal models. In Proceedings of the 34th Conference on Uncertainty in Artiﬁcial Intelligence, 2018b.

Lu Zhang, Yongkai Wu, and Xintao Wu. A Causal Framework for Discovering and Removing Direct and Indirect Discrimination. In Proceedings of the 26th International Joint Conference on Artiﬁcial Intelligence, 2017.