# do_not_marginalize_mechanisms_rather_consolidate__da98e9a8.pdf

Do Not Marginalize Mechanisms, Rather Consolidate!

Moritz Willig Technical University of Darmstadt moritz.willig@cs.tu-darmstadt.de

Matej Zeˇcevi c Technical University of Darmstadt matej.zecevic@tu-darmstadt.de

Devendra Singh Dhami Eindhoven University of Technology Hessian Center for AI (hessian.AI) d.s.dhami@tue.nl

Kristian Kersting Technical University Darmstadt Hessian Center for AI (hessian.AI) German Research Center for AI (DFKI) kersting@cs.tu-darmstadt.de

Structural causal models (SCMs) are a powerful tool for understanding the complex causal relationships that underlie many real-world systems. As these systems grow in size, the number of variables and complexity of interactions between them does, too. Thus, becoming convoluted and difficult to analyze. This is particularly true in the context of machine learning and artificial intelligence, where an ever increasing amount of data demands for new methods to simplify and compress large scale SCM. While methods for marginalizing and abstracting SCM already exist today, they may destroy the causality of the marginalized model. To alleviate this, we introduce the concept of consolidating causal mechanisms to transform large-scale SCM while preserving consistent interventional behaviour. We show consolidation is a powerful method for simplifying SCM, discuss reduction of computational complexity and give a perspective on generalizing abilities of consolidated SCM.

1 Introduction

Even complex real world systems might be modeled using structural causal models (SCM) [Pearl, 2009] and several methods exist for doing so automatically from data [Spirtes et al., 2000, Pearl, 2009, Peters et al., 2017]. While technically reflecting the causal structure of the systems under consideration, SCM might not entail intuitive interpretations to the user. Large scale SCM like, appearing for example in genomics, medical data [Squires et al., 2022, Ribeiro-Dantas et al., 2023] or machine learning [Schölkopf et al., 2021, Berrevoets et al., 2023], may become increasingly complex and thereby less interpretable. Contrary to this, computing average treatment effects might be too uninformative given the specific application, as the complete causal mechanism is compressed into a single number. Ideally a user could express the factors of interest and yield a reduced causal system that isolates the relevant mechanism from the rest of the model.

In contrast to other probabilistic models, SCM model the additional aspect of interventions. Consider for example a row of dominoes and its corresponding causal graph as shown in Figure 1. If the starting stone it tipped over, it will affect the following stones, causing the whole row to fall. Humans usually have a good intuition about predicting the unfolding of such physical systems [Gerstenberg, 2022, Beck and Riggs, 2014, Zhou et al., 2023]. Second to that, it is easy to imagine what would happen, if we were to hold onto a domino stone, that is, intervening actively upon the domino sequence. Alternatively, we can programmatically simulate these systems to reason about their outcomes. A simulator tediously computes and updates positions, rotations and collision states of all objects in

DSD contributed while being with hessian.AI and TU Darmstadt before joining TU\e.

37th Conference on Neural Information Processing Systems (Neur IPS 2023).

X1 X2 X3 X4 X5 X5 X1

Causal Model Consolidation Marginalization

Figure 1: Consolidation vs. Marginalization. Even simple real-world systems, like this row of dominoes, are composed of numerous intermediate steps. Classical structural causal models require the explicit evaluation of the individual structural equations to respect possible interventions along the computational chain and yield the final value of X5. The intermediate steps (X2, X3, X4) might be marginalized to obtain a simplified functional representation. Marginalization, however, loses some causal interpretation of the process, as interventions on the marginalized variables can no longer be performed. Consolidation of causal models simplifies the graph structure (compare to Appendix D.1), while respecting interventions on the marginalized variables. Thus, preserving the ability to intervene on the underlying causal mechanisms. (Best viewed in color.)

the system. Depending on the abstraction level of our SCM, computations might be simplified to represent individual stones as binary variables, indicating a stone standing up or getting pushed over. Nonetheless, classical evaluation of simplified SCM is still performed step by step to be able to respect possible interventions on the individual stones. Given that we might only be interested in the outcome. That is, whether or not the last stone will tip over, computing all intermediate steps seems to be a waste of computation, as already noted by Peters and Halpern [2021]. Under these premises, we are interested in preserving the ability to intervene while also being computationally efficient. Classical marginalization [Pearl, 2009, Rubenstein et al., 2017] is of no help to us as it destroys the causal aspect of interventions attached to the variables.

Consolidation vs. Marginalization. By marginalizing we do not only remove variables, but also all associated interventions, destroying the causal mechanisms of the marginalized variables. The insight of this paper, as alluded to in Figure 1 (center), is that there exists an intermediate tier of consolidated models that fill the gap between unaltered SCM and ones with classical marginalization applied. Consolidation simplifies the causal graph by compressing computations of consolidated variables into the equations of compositional variables that are functionally equivalent to the initial model, while still respecting the causal effects of possible interventions. As such consolidation generalizes marginalization in the sense that marginalization can be modeled by consolidating without interventions (I = ; see Def. 1 and Sec. 3). If questions involve causality, then consolidation necessarily needs to be considered since it can actually handle interventions (all cases where I = ). If causal questions are not of concern, then marginalization can be considered as remaining the standard marginalization procedure. One perspective on our approach is to describe consolidation as intervention preserving marginalization .

Structure and Contributions of this paper. In section two we discuss the foundation of SCM and related work. In section three we formally establish the composition of structural causal equations, partitioned SCM, and finally consolidated SCM. In section four we discuss the possible compression and computational simplifications resulting from consolidating models. We present an applied example for simplifying time series data and in a second example demonstrate how consolidation reveals the policy of a game agent. Finally, in section five, we discuss generalizing abilities of our method and provide a perspective on the broader impact and further research. The technical contributions of this paper are as follows:

We define causal compositional variables that yield functionally equivalent distributions to SCM under intervention.

We formalize (partially) consolidated SCM by partitioning SCM under a constraint that guarantees the consistent evaluation with respect to the initial SCM.

We discuss conditions under which consolidation leads to compressed causal equations.

We demonstrate consolidation on two examples. First, obtaining a causal model of reduced size and, secondly, revealing the underlying policy of a causal decision making process.

2 Preliminaries and Related Work

In general we write sets of variables in bold upper-case (X) and their values in lower-case (x). Single variables and their values are written in normal style (X, x). Specific elements of a set are indicated by a subscript index (Xi). Probability distributions over a variable X or a set of variables X are denoted by PX and PX respectively. A detailed list of notation can be found in Appendix E.

Structural Causal Models provide a framework to formalize a notion of causality via graphical models [Pearl, 2009]. From a computational perspective, structural equation models (SEM) can be considered instead of SCM [Halpern, 2000, Spirtes et al., 2000]. While focusing on computational aspects of consolidating causal equations, we use Pearl s formalism of SCM. Modeling causal systems using SCM over SEM does not affect our freedom, as Rubenstein et al. [2017] show consistency between both frameworks. Similar to earlier works of Halpern [2000], Beckers and Halpern [2019] and Rubenstein et al. [2017], we do not assume independence of exogenous variables and model SCM with an explicit set of allowed interventions.

Definition 1 A structural causal model is a tuple M = (V, U, F, I, PU) forming a directed acyclic graph G over variables X = {X1, . . . , XK} taking values in X = Q

k {1...K} Xk subject to a strict partial order <X, where

V = {X1, . . . , XN} X, N K is the set of endogenous variables.

U = X \ V = {XN+1, . . . , XK} is the set of exogenous variables.

F is the set of deterministic structural equations, Vi := fi(X ), where the parents are X {Xj X |Xj <X Vi}.

I {{Ii,vi}i {1...N}}v X where vi is the i-th element of v, Ii,vi indicates an intervention do(Xi = vi) and such that J I I J I. I is the set of perfect interventions under consideration. A perfect intervention do(Vi = vi) replaces the unintervened fi by the constant assignment Vi := vi.

PU is the probability distribution over U.

As we focus on computational aspects of SCM, we do not regard exogenous variables to be latent, but rather consider them to take values which are not under control of the causal system itself. As such, their values are not determined via any structural equation. By construction of I at most one intervention on any specific variable can be included in any intervention set I. The additional constraint enforces that I is closed under subsets, i.e. that any subset of any I I is also part of I. This condition is placed to yield valid intervention sets when partitioning the SCM. Every M entails a DAG structure G = (X, E) consisting of vertices X and edges E, where a directed edge from Xj to Xi exists if x0, x1 Xj.fi(x , x0) = fi(x , x1). For every variable Xi we define ch(Xi), pa(Xi) and an(Xi) as the set of direct children, direct parents and ancestors respectively, according to G.2 Additionally, every M entails an observational distribution PM3 by propagating PU through the structural equations. Any perfect intervention I on a variable Xi replaces fi with a new probability distribution PI. As a consequence M entails infinitely many intervened distributions PI M.

Related Work. Several works acknowledge the need for model simplification when working with causal models at different levels of modeling detail or finding consistent mappings between two already existing causal models [Rubenstein et al., 2017, Chalupka et al., 2016, Beckers et al., 2020, Zennaro et al., 2023, Brehmer et al., 2022]. However, whenever providing explicit methods of mapping SCM, marginalization is considered as a tool of removing variables. Several other works have been dedicated to proving consistency and identifiability results for grouping or clustering variables in general [Anand et al., 2022, Squires et al., 2022]. Works on τ abstractions by [Beckers and Halpern, 2019, Beckers et al., 2020] focus on simplifying models by mapping between SEM of different levels of abstractions. With regard to computational aspects, Rubenstein et al. [2017]

2We define ch(X), pa(X) and an(X) for sets of variables X, as the union of sets gained by individual variable evaluations, e.g., pa(X) = S

X X pa(X). 3We always reference a distribution with respect to some SCM M, therefore, if we write PM then this the distribution over the full variable set, that is, PX.

demonstrate the causal consistency of SEM, providing simplifications results for marginalizing SEM. However, their theorems (cf. Sec.5) explicitly exclude interventions on the marginalized variables.

3 Consolidation of Causal Graphical Structures

In this section, we present an approach to consolidating structural equation systems under intervention. This is the key contribution of this work compared to previous works that only considered marginalization of unintervened subsystems [Pearl, 2009, Peters et al., 2017, Rubenstein et al., 2017]. The focus is on computational aspects of marginalizing intermediate variables while preserving effects of interventions. A formalization of marginalizing intervenable structural equation systems is introduced in this section. Section 4 examines conditions under which consolidation leads to an actual reduction in complexity, followed by two practical examples.

We start with the definition of a Causal Compositional Variable (CCV) that has similar semantics to cluster DAGs [Anand et al., 2022], in that both capture the causal semantics over a set of variables. In contrast to cluster DAGs, CCVs are defined over an SCM M and moreover expose an interface for explicitly applying interventions to the individual variables inside the CCV. We define a CCV with a corresponding function ρ, that takes the exogenous variables U as its input and outputs the values of a subset E V. Thus we write ρE to denote the set of computed variables. To be able to condition on interventions, ρ takes the set of interventions I as it would be applied to the SCM as its second argument.

Definition 2 (Causal Compositional Variable) A variable XI E := ρE(U, I) X | E | 4 is a causal compositional variable over some subset E V of an SCM M, if a consolidation function ρE : (X | U |, I) X | E | exists for which PXI E = PI E for all I I, where PI E is the distribution of target variables E in under some intervention set I.

Put in simple terms, ρE yields the same values for E as would be determined by evaluation of the initial SCM M given any u PU. Naturally, there always exists such a function ρE for every E V, which is computing e E via evaluation of M itself. However, ρE is not required to adhere to the computation sequence imposed by the structural causal model M. In particular, ρE is not required to explicitly compute the intermediate values of any Vi V \ E, which gives way to simplifying internal computations. As such a CCV serves as a possible stand-in for replacing whole SCM by a function of possibly simpler computational complexity:

Definition 3 (Consolidated SCM) Given a causal compositional variable XI E := ρE(U, I) and some base SCM M, we call ME = (E, UM, ρE, IM, PUM) a consolidated SCM.

The distributions of the consolidated SCM PME are not equal to that of the initial SCM PM, since ME only computes a subset E V of all endogenous variables. However, for that subset E, the initial SCM and consolidated model yield the same PI E for all I I.

3.1 Partition of Structural Causal Models

So far, we considered constructing compositional variables from SCM such that they exhibit functional equivalent behaviour and, by doing so, are able to replace base SCM by consolidated SCMs using CCVs. However, compositional variables trade off the semantic graph structure of a classical SCM against a computationally simpler (refer to Sec. 4), but black box function. In practice we might, therefore, only want to replace certain parts of an SCM with consolidated functions. To achieve this goal, we formalize a partition of base SCM into multiple sub SCM. Multiple other works have considered the existence of joint variable clusters within SCM [Anand et al., 2022, Squires et al., 2022]. However, allowing for arbitrary clusters may induce cycles to the model, which would be undesirable. In our work we constrain the clustering by requiring partitions that enforce acyclicity and, therefore, ensure a well defined evaluation order that is consistent with that of the initial SCM.

Endogenous nodes of a base SCM M can be partitioned into L mutually exclusive exhaustive components A = {Ai P(V) \ : i {1, . . . , L}} with Ai, Aj A : i = j Ai Aj = and S

i {1...L} Ai = V. We also call A the (exhaustive) partition. We can use any cluster A A to

4To be precise X | E | = Q

Vi E Xi, where Q is the n-ary Cartesian product.

Figure 2: Consolidating SCMs. (Left) The base graph of an exemplary SCM M gets deconstructed into three sub SCM using the partition set A = {{A}, {B, D}, {C, E}}. Exogenous variables are displayed with dashed circles. (Center) A subgraph G within a larger base SCM. There exists a directed path that exits and re-enters G , thus preventing self-enclosed evaluation of G . (Right) Consolidation of a sub SCM into a multivariate compositional variable. X2 is an aspect variable chosen by the user, (X2 E). X4 and X5 are needed for further computation, thus E = {X2, X4, X5}. The value of X2 is computed via ρE and interventions can be performed via its parameter I. The dotted line indicates that X2 is not an independent variable. Specifically it is not allowed to intervene on X2 via edge cutting , making it independent of ρE (and in consequence causing ρE to compute inconsistent values for X6 and X7).

form a new sub SCM MA: MA = (A, UA, FA, IA, PUA) where UA = pa(A) \ A, FA are the structural equations of A, and PUA is the distribution over UA induced by the base SCM. As some intervention I I might intervene on variables which are no longer part of MA, we define a mapping ψA : I IA which removes those invalid interventions: ψA(I) := {do(Vi = vi) I : Vi VA}. Consequently we define IA := {ψA(I) : I I}. As expected, whenever a set of interventions I does not intervene on any V A, ψA maps it to the empty set. For notational brevity, we assume the implicit application of ψ on any I whenever we apply interventions to a sub SCM. Figure 2 (left) presents an exemplary construction of sub SCM from a given partition of a base SCM.

Unconstrained partitions may divide SCM in an arbitrary way. To guarantee an evaluation order of the individual sub SCM that is consistent with that of the base SCM we need to ensure that any particular sub SCM can be evaluated in a continuous, self-enclosed manner. That is, no intermediate evaluation of external nodes, V / A, is required. Figure 2 (center) illustrates a counter-example of a non-complying partition where an intermediate external evaluation to G is required. To prevent such cases we require the partitions to yield a strict partial ordering under the following definition: the binary relation A1 RX A2 Ai A1, Aj A2 : Ai <X Aj holds if at least one variable in A1 needs to be evaluated before some other variable in A2 according to <X of the base SCM. We call a partition A according to M iff RX is a strict partial order5 over all A A.

Definition 4 (Partitioned SCM) Given an exhaustive partition A, a partitioned SCM MA for some base SCM M is defined as MA = (S Ai, S UAi, S FAi, S IAi, S PUAi), i {1 . . . L} s.t. there exists a strict partial order RX over all Ai A according to M and every MAi = (Ai, UAi, FAi, IAi, PUAi) forms a valid sub SCM.

Consistency of partitioned SCM evaluation. To ensure for the consistent evaluation of all sub SCM MA within a partitioned SCM MA we need to ensure that the evaluation is carried out according to some RX that is compliant according to the base SCM M.6 Doing so, guarantees that the value of every exogenous variable Ui of a sub SCM MAs that is not truly exogenous (Ui / MU) is computed as an endogenous variable Vj inside another MAt, that is evaluated before MAs with Ui := Vj. For example G2 in Fig. 2 (left) computes the values of B and D, required as exogenous variables by G3. Lastly, during evaluation, all MA need to agree on the same set of applied interventions. This is done by fixing a particular I during evaluation and computing the intervention set I A := ψA(I ) specific to every MA. An algorithm for evaluating partitioned SCM and its proof of consistency are presented in Appendix A.

5In particular RX is a strict partial order, if it is asymmetric: A1, A2 A : A1 RX A2 (A2 RX A1), implying that the evaluation of no two sub SCM mutually depend on each other. 6As <X is a partial order, there may exist multiple total orders which comply with the partial ordering of M.

Algorithm 1 Consolidation of Structural Causal Models

1: procedure CONSOLIDATE(M, A, E) 2: for all Ai in A do 3: Ei Ai E Filter aspect variables for the current Ai. 4: E i Ei (pa(V \ Ai) Ai) Add variables that are required by other sub SCM. 5: UAi pa(Ai) \ Ai Define exogenous variables and interventions. 6: IAi {ψAi(I) : I I} = {{do(Xi = v) I : Xi Ai} : I I} 7: ρE i(UAi, I) {Fj : Xj Ai} Define a causal compositional variable via ρE i. 8: ρ E i argminρ E i K(ρ E i) Minimize representation (see Sec. 4).

s.t. ρ E i(UAi) = ρE i(UAi)

9: MAi,E (E i, UAi, ρ E i, IAi, PUAi) Define the sub SCM resulting from Ai and E. 10: end for 11: MA,E (S E i, S UAi, S ρ E i, S IAi, S PUAi), i {1 . . . | A |} Merge all MAi,E. 12: return MA,E Return the consolidated SCM. 13: end procedure

Figure 3: CONSOLIDATE Algorithm. The above pseudo-code summarizes the consolidation algorithm as described in this paper by utilizing causal compositional variables and partitioned SCM to obtain simplified SCM. Depending on the use-case Step 11 might be skipped and the partitioned SCM might be returned instead.

Partial consolidation of SCM. Having defined partitioned SCM allows us to selectively swap out arbitrary sub SCM by their consolidated SCM. In Def. 3 we placed no constraints on E to allow for arbitrary consolidation of variables. For sub SCM MA that appear within a partitioned SCM MA we need to constrain E to additionally include all variables V UA such that evaluation of MA additionally computes all variables needed as exogenous by other sub SCMs. Fig. 2 (right) shows an exemplary sub SCM with X2 (green) chosen as a relevant aspect variable by the user, and X4, X5 being required by evaluations of subsequent SCM. Thus E = {X2, X4, X5}. Whether to consider E or E depends on the standpoint of the user. From a computational perspective E is important as it holds all variables that need to be computed by ρ. On the other hand, the set E captures aspects of the SCM important to the user i.e., variables of interest. We will therefore refer to sub SCM with ME (and in the same breath write ρE ) but use MA,E (see the following Def. 5) to retain the initial set of variables chosen by the user. Having defined consolidated SCM ME, partitioned SCM MA and the required constraint on E we are now equipped with the tools to define a partially consolidated SCM that yields a consistent PE with the base SCM.

Definition 5 (Partially Consolidated SCM) A partially consolidated SCM MA,E is a partitioned SCM MA such that a subset of sub SCM MA are being replaced by consolidated SCM ME where E i := {Vi VA : (Vi E) ( MA = (V , U , F , I , P U ).Vi U )}.

Algorithm 1 summarizes all considerations of this chapter, starting out from a subset E and partition A up to a (partially) consolidated SCM MA,E. An exemplary step-by-step application of the algorithm can be found in Appendix D.3. The purpose of the argmin operation in Line 8 is to minimize complexity of ρ E i by finding a minimal encoding. We discuss this step in more detail in the following section. After formally introducing consolidation, we are ready to illustrate its applicability.

4 Compression of Causal Equations

Model consolidation can lead to compression by reducing the model s graph structure and leveraging redundant computations across equations. This may result in smaller, simpler models that are computationally more efficient and easier to analyze. Compressing structural equations to a minimal representation is highly dependent on the equations under consideration and probably incomputable for most problems. As there is ultimately no way of measuring compressibility of SCM by only considering their connecting graph structure, we provide a discussion with regard to some of the

information-theoretical implications. Specifically, we discuss compression properties for some of the basic structures appearing within SCM; namely chains, forks and colliders. In this section, we, first, analyze how consolidated models may leverage redundant computations for reducing complexity within chained equation in general. Second, we give a condition under which equations, and their interventions can be dropped from the consolidation model altogether. Thirdly, we analyse how interventions within the consolidated model affect our ability to compress equations. Lastly, we will walk through two examples of model compression.

General compression of equation systems. Using our formalization of (partially) consolidated SCM, we now have the chance to replace certain parts of an SCM with computationally simpler expressions. The notion of what a simple expression may be, varies depending on the application and is subjective to the user. To define a measurable metric, we reside to a simplified notion of complexity by measuring the representation length of our consolidated equations. We assume that all structural equations of an SCM can be expressed in terms of elementary operators, where each term contributes the same amount of complexity. As such, we can apply Kolmogorov complexity K [Kolmogorov, 1963]. Then a desirable minimal representation of a structural equation f i is one that minimizes K(fi): f i := argminf i K(f i) s.t.f i(pa(Xi)) = fi(pa(Xi)).

Classical marginalization reduces the number of variables in a graph. To keep the model consistent after marginalization, all children B := ch(A) of a marginalized variable A additionally need to incorporate the values of pa(A) to accommodate for the causal effects that where previously flowing through A into B. This modifies the structural equations of any B B, f B := f B f A, where f B and f B are the structural equations of B before and after marginalization, respectively. Evaluation of the separate equations f A, f B provides an upper bound on the complexity of the composed representation K(f B ) K(f A) + K(f B) [Zvonkin and Levin, 1970]. Since the consolidated system is not required to compute A explicitly, the encoding length of f B might resort to directly computing B from the values of pa(A). Also, the chain rule for Kolmogorov complexity only considers the case of reproducing f A and f B in their initial forms. In addition to that, we might also use semantic rules to reduce equation length, e.g. by collapsing consecutive additions a, b R. c R.a + b = c and so on. Whether consolidation actually leads to simplified equations depends strongly on the specific equations and their connecting graph structure. No simplification effects occur in cases of already minimal systems, while strong cancellation occurs in the case of f B, f A being inverses to each other (see Appendix B.1). Lastly, we want to refer to Appendix B.2, where we showcase the insufficiency of matrix composition to obtain minimal function representations in the case of linear systems.

Marginalizing child-less variables. Regardless of the particular causal graph structure, all equations which do not affect PE can be removed from the model to reduce its overall complexity. In particular we point out that PE is invariant to all X / an(E ). By the following deduction we infer that we can always consolidate all child-less variables (if not part of E themselves) from M: X X \ E .[(ch(X) = ) ( X X .X / pa(X )) ( X X .X / an(X )) X / an(E )]. Since child-less variables do not affect PE , we can not only consolidate but marginalize them. (Reducing to the same scenario as in Rubenstein et al. [2017, Thm. 9]). Therefore, we are allowed to drop interventions do(Xi = c) with Xi / an(E ) from the set of allowed interventions. This process can be applied repeatedly until we have pruned the SCM from all child-less variables irrelevant to E .

4.1 Simplifying Graphical Structures

In contrast to marginalization, consolidation preserves the effects of interventions for consolidated variables. This effectively adds conditional branching to every structural equation fi if some I I with do(Vi = c) I exists:

Vi := c if do(Vi = c) I fi(pa(Vi)) else (1)

While conditional branching might prevent us from compressing equations, we consider that not all variables might be affected by interventions. As such, we might be able to utilize local structures within the graph to simplify equations. In the following we briefly discuss the possibilities of simplifying chains, forks and collider structures within the graphs of SCMs:

Simplifying Chains. Consolidating chains of consequent variables corresponds to stacking structural equations and computing the last non-consolidated variable directly. In the general case,

Ut Lt-1 St-1 At-1

Ut-1 Lt-2 St-2 At-2

Consolidation

Marginalization

Figure 4: Consolidating a real world mechanism. (Left) The causal time-series model of a milling machine, representing tool length L, utilization U, sharpness S and accuracy A. (Center-Left) Removing child-less nodes Lt and At and controlling for the parents Ut yields a simplified causal structure. (Center-Right) Plots for the consolidated structural equation of S. Colored areas show the effects of varying U by one and two sigma ( 0.05, 0.1) respectively. Dashed grey lines indicate interventions, which are respected truthfully by the consolidated function. (Right) Marginalization, likewise, simplifies the model, but does not allow us to investigate the effects of interventions.

conditional branching complicates the simplification of the stacked equations into a single closedform representation. When considering the case of marginalization, that is without considering interventions, as done in Rubenstein et al. [2017], composition of equations turns into direct function composition Xi := fi fi 1 fi 2 . . . . To this end, a complexity bound on chained equations over finite discrete domains is discussed in Appendix B.3, as well as consolidation of the motivating dominoes example in Appendix D.1.

Simplifying Forks. Consolidating the parent node B of a fork structure, A B C, might lead to a duplication of f B into the equations of both child nodes, f A := f A f B, f C := f C f B. If pa(B) U, then A and C will be confounded by exogenous variables. This is the reason why we did not require independence of exogenous variables in Def. 1. Still, consistency with the initial SCM is guaranteed, since we require all structural equations to be deterministic. As a consequence, every evaluation of the duplicated structural equations f B inside f A and f C yields the same value when given the same inputs. While determinism of structural equations is formally required, we illustrate a consistent reparameterization of non-deterministic models in Appendix C.

Simplifying Colliders. Colliders are the most promising graphical structures for simplifying equations. When consolidating A and C of a collider A B C, we might leverage mutual information between f A and f C to simplify f B. Especially in the case of pa(A) = pa(C), consider A X C for example, we might be able to discard f A and f C altogether and compute B directly from X.

4.2 Time Series Example: Tool Wear

We will now demonstrate a simple application of consolidation for a possibly more applied scenario. Imagine that we want to create a causal model of an industrial unit under continuous use, e.g. a milling machine. At the end of every work day the length L and sharpness S of the milling cutter are measured. From these measurements other metrics such as the cutting accuracy A can be derived. Interventions on the process are performed by grinding the cutter, resetting it to a certain sharpness. While every intervention grinds away some material, the weight and size changes are negligible for the considered aspect of accuracy. Throughout our recordings we might encounter multiple such interventions. From the data we fit a classical SCM that models the time series on a day-to-day basis, Vt 1 Vt. We observe the tool to loose some percentage of its sharpness per day depending on its utilization Ut. The intervention do(St = 1) resets the sharpness to a constant value, while do(St = 1, Lt = 1) models a tool replacement. As by Def. 1, I needs to include do(Lt = 1), which might be a recalibration of the machine. Figure 4 (left) shows the initial causal graph of the time series model as defined by the following SCM:

U = {Ut = N(0.5, 0.052)} V = {Lt, St, At} I = P({do(St = 1), do(Lt = 1.0), do(St = 1, Lt = 1)})

fl(l, u) := (1.0 0.002u)l fst(st 1, u) := (1.0 0.3u)st 1 fa(s) := 0.8s2

target_coin target_enemy target_powerup

towards_coin towards_enemy towards_powerup towards_ ag

distance_coin near_coin

position_coin

targeting_cost_coin

distance_powerup near_powerup

position_powerup

targeting_cost_powerup

distance_ ag near_ ag

position_ ag

targeting_cost_ ag

distance_enemy near_enemy

position_enemy

targeting_cost_enemy

planning_sequence_1 planning_sequence_2 planning_sequence_3 planning_sequence_4

player_position

coin_reward powerup_reward enemy_reward ag_reward

ps1 := coin if do(target_coin = 0) / I else flag ps3 := finished ps2 := finished if do(target_coin = 0) / I else flag ps4 := finished ps1 := coin ps3 := finished ps2 := finished ps4 := finished

Figure 5: Complex situations can be easy to understand using consolidation. Encoding the behaviour of agents acting in game environments (left) often results in complex causal graphs (right; indeed unreadable due to complexity. A readable version is contained in the Appendix). Even very simple levels with a single agent, coin, power-up and enemy, entail causal graphs that are intuitively non-interpretable. Especially the intertwining of game mechanics and agent behaviour complicates the inference of the agents actual policy and makes it impossible to judge its performance. In our example a suboptimal greedy policy is embedded within the causal graph, which can be made visible using consolidation (bottom, Eq. (2)). Please note that ps abbreviates planning_sequence . In contrast to marginalization (Eq. (3)), one can still intervene on the consolidated system.

Now, we might be interested in extracting a formula for the total tool sharpness St at an arbitrary point in time t. Thus, our consolidation set consists of all St, E = {St}. Since Ut is exogenous, we make the additional assumption that the utilization follows a normal distribution and we simplify to the expected value u = 0.5 (Figure 4, center-left). As laid out before, we can marginalize all child-less variables Lt and At not part of E . As all Lt are no longer part of ME , ψE maps interventions do(St = st, Lt = lt) do(St = st). Considering the unintervened case, structural equation are now simplified via function composition, f St := f St . . . f S1, which results in the following equation f St := (1 0.3 0.5)t = 0.85t. According to Eq. (1), we now have to inspect interventions as potential candidates for conditional branching. All remaining interventions are of the form do(St = 1.0). Applying an intervention at time t0 equals shifting the following equations by the time of that last intervention t := t t0. Finally, we arrive at the following consolidated equation:

f S(t) := 0.85t t0

where t0 = maxi{i | do(Si) I i t}

Fig. 4 (center-right) shows the resulting plot of the consolidated model under interventions do(S12 = 1) and do(S24 = 1). We successfully demonstrated the power of consolidation models for dynamical system while preserving the ability to intervene. In theory more complex dynamical systems could be consolidated. However, as these kind of self-referential models require a more involved discussion, we kindly refer the reader to Bongers et al. [2018, 2021], Peters et al. [2022] for further considerations.

4.3 Revealing Agent Policy

In our second example we apply consolidation to a more complex causal graph relating the game state of a simple platformer environment to the actions of an agent. See Appendix D.4 for the full causal graph and structural equations. Throughout the level the agent ((1) in Fig. 5) can interact with a coin ((2) in Fig.), a power-up (3), an enemy (4) and the finish flag (5) to accumulate a certain reward (6) by doing so. The power-up is required to interact with the enemy. During play, the agent takes the state of the environment as its input and outputs the state of towards_coin , towards_powerup , etc. The order of the agent actions is then recorded via four planning_sequence_i for i {1 . . . 4} variables. A causal graph, like the one presented in Appendix D.4, might be extracted automatically from observational data, or designed by an expert. Due to the sheer number of variables and edges, dependencies in the obtained SCM are hard to trace. To get a better understanding, we use consolidation to reveal the policy of our agent. We consolidate all endogenous variables except player-

entity distance and the planning_sequence variables. To be able to modify the agents behaviour, we allow interventions by forbidding the agent to target certain entities: I = P({do(target_coin = 0), do(target_enemy = 0), do(target_powerup = 0)}).

Like before, we consolidate equations considering the unintervened case, and then add back in conditional branching for interventions to yield equations (2) in Figure 5. Contrary to the very complex structure of the SCM, the consolidated equation reveals the actually very simple policy of the agent. We find from the consolidated equation that the agent only collects the coin, if not intervened upon, and then heads directly towards the flag. This insight might not be obvious from the initial SCM and, at least, is difficult to spot a priori by looking at the unconsolidated equations. When inspecting the original SCM more closely we find, that a constant factor is added to the calculation of targeting_cost_powerup. This factor might serve to accommodate for the time lost when speeding up or slowing down towards a target. Furthermore, we see that the agent pursues a greedy policy, thus, never considering the overall higher reward of the power-up and enemy together. Instead the policy ignores the power-up, due to its low reward and in consequence also never targets the enemy. This behaviour not only leads to strong simplification of the SCM, but also allows us to discard the imperfect policy, without the need to run possibly costly trials, just to come to the same conclusion.

To summarize, consolidation is a strictly more powerful operation than marginalization. More examples and domains can be found in Appendix D.

5 Conclusions

Consolidation is a powerful tool for transforming SCMs, while preserving the causal aspect of interventions. In addition, consolidating SCMs can lead to more general models. For example, recall the tool wear example of Sec. 4.2. While the initial causal graph operates on discrete time steps our consolidated function provides a continuous relaxation of the causal process. We can evaluate it at any point in time t R and are no longer dependent on the day-to-day basis which was modeled by the initial graph. Additionally, the consolidated equation of our motivating example of rows of dominoes (compare Appendix D.1), yields a generalized formula that is independent of the actual number of dominoes that compose the row, by making use of first-order quantifiers. In our discussion of Sec. 4 we saw that we can further benefit from consolidation in all cases where the initial SCM does not already represent the smallest possible causal model. Lastly, these simplifications align with our goal of making SCMs more interpretable. Consider that the initial domino SCM only provides a local view on the system, by only providing equations for every individual stone, If stone A falls it pushes over stone B, except in the case of an intervention. If stone B falls, ... and so on. The equation of the consolidated SCM can be directly translated into a single natural language sentence, e.g. The last domino will fall, if the first domino is pushed over, except in the case of holding onto or pushing over a stone along the way , capturing the causal mechanisms of the system much more intuitively. Our last example of Sec. 4.3 strikingly revealed the sub-optimal, greedy agent behaviour in a game setting. While we illustrated examples that are well suited for consolidation, we are positively inclined to expect consolidation to be helpful towards a broad range of applications.

Limitations and Broader Impact. Throughout the paper we considered exact consolidations, in that Def. 2 requires strict equality between ρE(U, I) and PI E. This assumption might be met in logic and idealized scenarios, but may hinder consolidation of SCM in other applications due to noise inherent to the system. The definition might be relaxed by allowing for small deviations of ρE from the distribution of the unconsolidated SCM. Thus, relaxing the strict equality PXI E = PI E with | PXI E PI E | < ϵ for some small ϵ > 0, provides a relaxed consolidation constraint for noisy systems. SCM constitute a well suited framework for representing causal knowledge in the form of graphical models. The ability to trace effects through structural equations that yield explanations about the role of variables within causal models is required to make results accessible to non-experts. Actual causation, and only recently, causal abstractions and constrained causal models have come to attention in the field of causality [Halpern, 2016, Zennaro et al., 2023, Beckers and Halpern, 2019, Blom et al., 2020] and might be beneficial for future works on consolidation. Apart from computational advantages, consolidation of SCMs presents itself as a method that enables researchers to break down complex structures and present aspects of causal systems in a broadly accessible manner. Without such tools, SCMs run the danger of being only useful to specialized experts.

Acknowledgments and Disclosure of Funding

The authors acknowledge the support of the German Science Foundation (DFG) project Causality, Argumentation, and Machine Learning (CAML2, KE 1686/3-2) of the SPP 1999 Robust Argumentation Machines (RATIO). The work was supported by the Hessian Ministry of Higher Education Research, Science and the Arts (HMWK) via the DEPTH group CAUSE of the Hessian Center for AI (hessian.ai). This work was partly funded by the ICT-48 Network of AI Research Excellence Center TAILOR (EU Horizon 2020, GA No 952215) and by the Federal Ministry of Education and Research (BMBF; project Plex Plain , FKZ 01IS19081). It benefited from the Hessian research priority programme LOEWE within the project White Box, the HMWK cluster project The Third Wave of AI. and the Collaboration Lab AI in Construction (AICO) of the TU Darmstadt and HOCHTIEF. We thank the anonymous reviewers for their valuable feedback and constructive criticism, which significantly improved this paper. We sincerely appreciate their time and expertise in evaluating our research. We acknowledge the use of Chat GPT for restructuring some sentences while writing the paper.

Tara V Anand, Adèle H Ribeiro, Jin Tian, and Elias Bareinboim. Effect identification in cluster causal diagrams. ar Xiv preprint ar Xiv:2202.12263, 2022.

Sarah R Beck and Kevin J Riggs. Developing thoughts about what might have been. Child development perspectives, 8(3):175 179, 2014.

Sander Beckers and Joseph Y Halpern. Abstracting causal models. In Proceedings of the aaai conference on artificial intelligence, volume 33, pages 2678 2685, 2019.

Sander Beckers, Frederick Eberhardt, and Joseph Y Halpern. Approximate causal abstractions. In Uncertainty in Artificial Intelligence, pages 606 615. PMLR, 2020.

Jeroen Berrevoets, Krzysztof Kacprzyk, Zhaozhi Qian, and Mihaela van der Schaar. Causal deep learning. ar Xiv preprint ar Xiv:2303.02186, 2023.

Tineke Blom, Stephan Bongers, and Joris M Mooij. Beyond structural causal models: Causal constraints models. In Uncertainty in Artificial Intelligence, pages 585 594. PMLR, 2020.

Stephan Bongers, Tineke Blom, and Joris M Mooij. Causal modeling of dynamical systems. ar Xiv preprint ar Xiv:1803.08784, 2018.

Stephan Bongers, Patrick Forré, Jonas Peters, and Joris M Mooij. Foundations of structural causal models with cycles and latent variables. The Annals of Statistics, 49(5):2885 2915, 2021.

Johann Brehmer, Pim De Haan, Phillip Lippe, and Taco S Cohen. Weakly supervised causal representation learning. Advances in Neural Information Processing Systems, 35:38319 38331, 2022.

Krzysztof Chalupka, Frederick Eberhardt, and Pietro Perona. Multi-level cause-effect systems. In Artificial intelligence and statistics, pages 361 369. PMLR, 2016.

Tobias Gerstenberg. What would have happened? counterfactuals, hypotheticals and causal judgements. Philosophical Transactions of the Royal Society B, 377(1866):20210339, 2022.

Joseph Y Halpern. Axiomatizing causal reasoning. Journal of Artificial Intelligence Research, 12: 317 337, 2000.

Joseph Y Halpern. Actual causality. Mi T Press, 2016.

Joseph Y Halpern and Spencer Peters. Reasoning about causal models with infinitely many variables. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pages 5668 5675, 2022.

Mark Hopkins and Judea Pearl. Causality and counterfactuals in the situation calculus. Journal of Logic and Computation, 17(5):939 953, 2007.

Andrei N Kolmogorov. On tables of random numbers. Sankhy a: The Indian Journal of Statistics, Series A, pages 369 376, 1963.

Judea Pearl. Causality. Cambridge university press, 2009.

Jonas Peters, Dominik Janzing, and Bernhard Schölkopf. Elements of causal inference: foundations and learning algorithms. The MIT Press, 2017.

Jonas Peters, Stefan Bauer, and Niklas Pfister. Causal models for dynamical systems. In Probabilistic and Causal Inference: The Works of Judea Pearl, pages 671 690. 2022.

Spencer Peters and Joseph Y Halpern. Causal modeling with infinitely many variables. ar Xiv preprint ar Xiv:2112.09171, 2021.

Marcel da Câmara Ribeiro-Dantas, Honghao Li, Vincent Cabeli, Louise Dupuis, Franck Simon, Liza Hettal, Anne-Sophie Hamy, and Hervé Isambert. Learning interpretable causal networks from very large datasets, application to 400,000 medical records of breast cancer patients. ar Xiv preprint ar Xiv:2303.06423, 2023.

Paul K Rubenstein, Sebastian Weichwald, Stephan Bongers, Joris M Mooij, Dominik Janzing, Moritz Grosse-Wentrup, and Bernhard Schölkopf. Causal consistency of structural equation models. ar Xiv preprint ar Xiv:1707.00819, 2017.

Bernhard Schölkopf, Francesco Locatello, Stefan Bauer, Nan Rosemary Ke, Nal Kalchbrenner, Anirudh Goyal, and Yoshua Bengio. Toward causal representation learning. Proceedings of the IEEE, 109(5):612 634, 2021.

Peter Spirtes, Clark N Glymour, Richard Scheines, and David Heckerman. Causation, prediction, and search. MIT press, 2000.

Chandler Squires, Annie Yun, Eshaan Nichani, Raj Agrawal, and Caroline Uhler. Causal structure discovery between clusters of nodes induced by latent factors. In Conference on Causal Learning and Reasoning, pages 669 687. PMLR, 2022.

Fabio Massimo Zennaro, Máté Drávucz, Geanina Apachitei, W Dhammika Widanage, and Theodoros Damoulas. Jointly learning consistent causal abstractions over multiple interventional distributions. ar Xiv preprint ar Xiv:2301.05893, 2023.

Liang Zhou, Kevin A Smith, Joshua B Tenenbaum, and Tobias Gerstenberg. Mental jenga: A counterfactual simulation model of causal judgments about physical support. Journal of Experimental Psychology: General, 2023.

Alexander K Zvonkin and Leonid A Levin. The complexity of finite objects and the development of the concepts of information and randomness by means of the theory of algorithms. Russian Mathematical Surveys, 25(6):83, 1970.

Supplementary Material Do Not Marginalize Mechanisms, Rather Consolidate!

A Evaluation of Partitioned SCM

A partitioned SCM MA consists of several sub SCM MA, that, in sum, cover all variables and structural equations of an initial SCM M. Thus, evaluation of a partitioned SCM yields the same set of values v V as the original M. Similar to the evaluation of structural equation in the initial M, sub SCM need to be evaluated in a specific order to guarantee all u M U exist. As such, sub SCM can be considered multivariate variables that establish another high-level DAG. The evaluation order is determined via the relation RX as defined in Sec. 3.1 and depends on the graph partition A and the order of X imposed by the the initial SCM.

Algorithm 2 Evaluation of partitioned SCM

1: procedure PARTITIONEDSCMEVAL(MA, u, I) 2: x u x will gradually collect all values x X of M 3: for A in sort(A, RX) do Sort clusters by strict partial order imposed by M 4: M A M A MA where A = A 5: u {xi x | Xi M U} 6: I ψA(I)

7: v = M I A (u ) 8: x = x v 9: end for 10: v = {xi x | Xi M V} Filter all u U to get v V 11: return v 12: end procedure

Algorithm 2 shows the evaluation of partitioned SCM, where MA is the partitioned SCM we want to evaluate, u are the values of exogenous variables to the initial model M and I is the set of applied interventions. The outcomes of sub SCM that are not related via RX are invariant to the evaluation order among each other. Even though RX defines the ordering of sub SCM only up to some partial order, sort(A, RX) can pick any total ordering that is valid with RX.

Proof 1 (Consistency of Partitioned SCM Evaluation) Evaluations of M A every, in step 7, compute all variables Vi A by evaluating fi of the original SCM, yielding the same values as the evaluation of A in M. Therefore PM A = PMA. By Def. 4 every variable V V is contained within some sub SCM M A. The evaluation of Partitioned SCMEval is complete, in the sense that all V = S A = S

A A A are evaluated, as the evaluation of all M A MA is guaranteed by iterating over all A in step 2. Finally PM A = S

A A PM A = S

A A PMA = PMV.

B Complexity reduction in function composition

Reduction of encoding length might vary depending on the type and structure of the equations under consideration. No compression of structural equation is gained when the system of consolidated equations is already minimal. Compression of equation to an identity function is showcased in the following.

B.1 Compression of chained inverses

Reduction to constant complexity for the unintervened system is reached in the case of f B = f 1 A . Consider the equation chain of X A B with A getting marginalized. Immediately f B := f B f A = f 1 A f A = Id follows. Therefore, B := X, which is a single assignment of the value(s) of X into B. Remaining complexity within the consolidated function is then only due to conditional branching in cases of do(A = a), do(B = b) I.

B.2 Matrix composition is not sufficient for compressing equations

The operation of matrix multiplication, as a way of expressing composition of linear functions, stays within the class of matrices. Matrix multiplication, therefore, serves as a possible candidate to be considered when consolidating equations and reducing the encoding length of a linear structural systems. When written down an a high-level view, matrices can expressed in terms of single variables A, B RM N and matrix multiplication : RM N RN O RM O. Assuming equations f Y := A X and f Z := B X, we can reduce the length of the composed equation f Z := A B X by multiply the matrices A and B together, fi = C X with C = A B. While we effectively reduced the number of high-level symbols written in the equation, we are hiding computational complexity in the structure of the matrix C. The following simple counterexample demonstrates a situation where the size, as well as, the number of non-zero entries even increases:

C A B "0 1 1 0 1 1 0 1 1

"0 1 0 1 0 1

0 0 0 0 1 1

Thus, proving that pure matrix multiplication, is not suitable to keep, or even minimize, the size of composed function representations.

B.3 Compression over Finite Discrete Domains

Consolidation may reduce the number of variables within a graph, but burdens the remaining equations with the complexity of the consolidated variables. Without the need to explicitly compute values of consolidated variables, we might leverage cancellation effects to simplify equations, as outlined in the main paper. In terms of compression, no guarantees can be given in the general case. However, we will now show, that the often considered case of chained maps between finite discrete domains simplifies or at least preserves complexity.

The cardinality of the image of a deterministic function f : X Y between two finite discrete sets X, Y is bounded by the cardinality of its domain: | Img(f)| | Dom(f)| |X|, where Img(f) is the image and Dom(f) the domain of f. In particular, the strict inequality | Img(f)| < | Dom(f)| holds for all non-injective maps. Function composition may further reduce the effective domain Domeffective(f) of a function, by only considering values of the image of the previous map as inputs to the next function. In contrast considering to all possible values of X in the case of the non-composed map, the image of the previous function may only be a subset of X. Therefore, f2 f1 | Imgeffective(f2)| | Domeffective(f2)| = | Img(f1)| | Dom(f1)|. In particular, the effective image of a composition chain fn f1 is bounded by the function with the smallest image: | Imgeffective(fn f1)| min | Img(fi)|. Thus, equation chains over finite discrete domains strictly preserve or reduce the effective size of the image, allowing for a possibly simpler combined representation in comparison to representing the functions individually.

C Reparameterization of non-deterministic structural equations.

Consolidation of structural equations might lead to duplication of non-deterministic terms within consolidated systems. For example when consolidating fork structures (compare to Sec. 4.1). Without further precautions, different values might be sampled from the duplicated non-deterministic equations. An example where consolidating a variable B with a non-deterministic equation f B (indicated by a squiggly line) leads to inconsistent behaviour is shown in 6. In M1, C and D both copy on the value of B. Therefore, c = d yields always. M1 shows a graph where B is consolidated from M1. As a result the non-deterministic equation f B is duplicated into the equations of C and D, such that f C := Bern(A) and f D := Bern(A). Within the consolidated model M1 different values might be be sampled from the different noise terms Bern(A) in f C and f D. Consequently c = d might occur in M1 . To obtain consistent behaviour with the initial M1, we need to ensure agreement about the value of Bern(A) across all instances of the duplicated equation. To do so, we reparameterize M1 and explicitly store a fixed value, sampled from Bern(A), into a new exogenous variable R. The equation f B is then reparameterized into a deterministic structural equation taking the variable R as an additional argument, resulting in M2. When consolidating B within M2, all instances of f B now yield the same value, as the noise term is fixed via R and finally PM 2 = PM1.

(1) Non-Determistic Model. Yields C=D

(2) Inconsistent Behaviour: C D

(4) Consistent Model Behaviour: C=D

(3) Reparameterization to Deterministic Eqs.

Figure 6: Reparameterization of non-deterministic models. The SCM M1 contains a nondeterministic equation B := Bern(A) (marked with a squiggly line). With C := B and D := B, M1 always yields C = D. Simply consolidating (or marginalizing) B creates a model M1 with C := Bern(A) and D := Bern(A), such that possibly C = D. Reparameterizing f B by introducing an exogenous random variable R := U(0, 1) and B := A < R, yields the SCM M2 with only deterministic equations. Consolidating (or marginalizing) B in M2 leads to M2 where C := A < R and D := A < R, thus always C = D.

D Consolidation Examples

In this section we show further detailed applications of consolidation. Section D.1 presents the worked out consolidation of the dominoes motivating example of the paper, with regard to generalizing abilities of consolidates models. Section D.2 considers consolidation of the classical firing squad example. In contrast to the other examples, we focus on consolidating graphs with multiple edges in the causal graph. Lastly we provide the causal graph and structural equations of the game agent policy discussed in the main paper, in Section D.4.

D.1 Motivating Example: Dominoes

While we applied consolidation to a particular SCMs in the main paper, we will discuss the motivating example with focus on obtaining representations that cover generalize over populations of SCM. We demonstrate this on the particular example of a rows of dominoes, as a simple SCM with highly homogenous structure. Regardless of whether the SCM is obtained by using methods for direct identification of causal graphs from image data, as presented by Brehmer et al. [2022], or abstracting physical simulation using τ-abstractions [Beckers and Halpern, 2019]; we assume to be provided with a binary representation of the domino stones. The state of every domino Si indicates whether it is standing up or getting pushed over. In this case, the structural equations for all dominoes are the same: fi := Si 1. As a result tipping over the first stone in a row will lead to all stones falling. Also, we are only interested in the final outcome of the chain. That is, whether the last stone will fall or not (E = {Sn}). Again, we use consolidation to collapse the structural equations in the unintervened case: Sn := fn f1 := S1. We consider a single active allowed intervention of holding up any of the dominoes or tipping it over, I = {do(Si = 0), do(Si = 1)}. Upon evaluation, the unconsolidated model needs to check for every domino if it is being intervened or not, requiring n conditional branches. Using the fact that perfect interventions overwrite the variable state for the following dominoes, we introduce a first order quantifier that handles all intervention in a unified way. Finally, by combining the formulas of the intervened and unintervened case, we find the following simple equation:

Sn := xi if do(Si = xi) I S1 else

The resulting equation no longer has a notion of the actual number of dominoes and, in fact, it is invariant to it. We realise that introducing the first-order for-all and exists quantifiers allows for a unified representation of arbitrary chains of dominoes. Similar observations are discussed in Peters and Halpern [2021] and Halpern and Peters [2022] which introduce generalized SEM (GSEM). As intermediate the equations are no longer computed explicitly, the structural equations of consolidated models for different row lengths only differ in the set of allowed interventions I. That is, for a row of three domino stones I = {do(V1 = v1), do(V2 = v1), do(V3 = v1)}, while for four stones the additional do(V4 = v1) is defined. As set out in the introduction of this paper, we consider

f B(A) := 0 if A 5 1 if A > 5

true if B = 0 false if 0 B 10 true otherwise

f E(A) := A%5 = 0 f F (A) := A%10 = 0 f G(E, F) := E F f D(C) := C f H(C, G) := C G

Figure 7: Example SCM for the Application of CONSOLIDATE. The figure shows a toy SCM for demonstrating application of the CONSOLIDATE algorithm. Consider the following SCM with its structural equations and resulting graph (endogenous variables are B, C, D, E, F, G, H with only one exogenous A with each structural equation highlighted on the r.h.s., note that the subscript on f_ denotes the variable to be determined e.g. B f B(A)). In the first step, the algorithm s user decides on a partition. Let s consider for instance the following partition i.e., allowed intervention and consolidation sets: A = {{E, F, G}, {B, C}, {D, H}}; E = {C, F, H}; I = {{do(D = true)}, {do(D = false)}, {do(G = false)}}.

consolidation as a tool for obtaining more interpretable SCM. Towards this end, consolidation might help us in detecting similar structures within an SCM. Doing so eases understanding of causal systems, as the user only has to understand the general mechanisms of a particular SCM once and is then able to apply the gained knowledge to all newly appearing SCM of the same type.

D.2 Firing Squad Example

While the dominoes and tool wear examples where mainly considering the consolidation of sequential structures, we want to briefly demonstrate the consolidation of structural equations that are arranged in a parallel fashion. We consider a variation of the well known firing squad example [Hopkins and Pearl, 2007] with a variable number N of rifleman. A commander (C) gives orders to rifleman (Ri, i {1 . . . N}), which shoot accurately and the prisoner (P) dies. For the sequential stacking of equations we found that interventions exert an overwriting effect. That is, every intervention fixes the value of a variable, making the unfolding of the following equations independent of all previous computations. To yield a similar effect for parallel equations we need to block all paths between the cause and effect. In this scenario, this can easily be expressed by using an all-quantifier. When consolidating the SCM, we consider only the captain C and prisoner P, E = {C, P}, while allowing for any combination of interventions that prevent the rifleman from shooting I = P({do(Ri = 0)}i {1...N}). After consolidation, we obtain the following equation:

lives if C = 0 ( Si. do(Si = 0) I) dies else

As with the dominoes example, we are again in a situation where the consolidated equation intuitively summarizes the effects of individual: The prisoner lives if the captain does not give orders, or if all riflemen are prevented from shooting .

D.3 Step-by-step CONSOLIDATE Application

In this section we provide a step-by-step application of the CONSOLIDATE algorithm given in Algorithm 1. Consider the SCM shown in Figure 7 with its structural equations and resulting graph. The endogenous variables are B, C, D, E, F, G, H with only one exogenous A. Structural equation are highlighted on the right-hand side. Note that the subscript on fx denotes the variable to be determined e.g. B f B(A)).

In a first step, the algorithm s user has to decide on a suitable partition. Consider for instance the following partition (indicated by dashed lines in the figure), the following allowed intervention and consolidation set:

A = {{E, F, G}, {B, C}, {D, H}} I = {{do(D = true)}, {do(D = false)}, {do(G = false)}} E = {C, F, H}

The following example presents a step-by-step application of the CONSOLIDATE algorithm for the cluster A1 = {E, F, G}:

Step 3: E1 {E, F, G} {C, F, H} = {F}

Step 4: E 1 {F} (pa(V \ {E, F, G}) {E, F, G}) = {F} ({A, B, C, G} {E, F, G}) = {F, G} Step 5: UA1 pa({E, F, G}) \ {E, F, G} = {A, E, F} \ {E, F, G} = {A} Step 6: IA1 {{do(Xi = v) I : Xi {E, F, G}} : I I} = {{do(G = false)}} Step 7: ρE 1 {f E(A) := A mod 5 = 0;

f F (A) := A mod 10 = 0; f G(E, F) := E F} Step 8: ρ E 1 argmin K(ρE 1) = {

ρF (A) := A mod 10 = 0; ρG(F, IA1) := F (do(G = false) / IA1)} Step 9: MA1,E ({F, G}, {F}, ρ E 1, {{do(G = false)}}, PA)

Note how computing f E is no longer required. In a similar fashion, equations in A2 resemble a chain that can be composed: f C f B (previously called stacked ; cf. Sec. 4.1). Since |Img(f B)| = 2, at least one of the three conditions of f C (since f C is a 3-case function) will be discarded. (Eventually yielding ρ E 2 {ρC(A):=A 5}). As D is not in E and not required by any other sub SCM it can be marginalized. A3 then reduces to ρ E 3 {ρH(C, G) := C G}.

D.4 Revealing Agent Policy: Causal Graph and Equations

In this section we explicitly list the structural equations representing observed interactions between a platformer environment and a possible rule based agent. The resulting causal graph is shown in Fig.8 at the end of the appendix. Except for the parentless variables coin_reward , powerup_reward , enemy_reward , flag_reward , player_position , position_coin , position_powerup , position_enemy , position_flag and target_flag , which are exogenous and determined by the environment, all variables are considered endogenous:

player_position, position_coin, position_powerup, position_enemy, position_flag [0..1]2

coin_reward := 3; powerup_reward := 1; enemy_reward := 9; flag_reward := 2 With X in {coin, powerup, enemy, flag} : distance_X := position_X player_position_X 2 near_X := distance_X < 3.0 targeting_cost_X := 1.0 + 0.5 distance_X target_coin := targeting_cost_coin < enemy_reward target_powerup := targeting_cost_powerup < powerup_reward target_enemy := targeting_cost_enemy < enemy_reward powered_up target_flag := True powered_up := target_powerup

towards_coin := target_coin coin_reward > max({X_reward|target_X}X {powerup,enemy,flag})

towards_powerup := target_powerup powerup_reward > max({X_reward|target_X}X {coin,enemy,flag})

towards_enemy := target_enemy enemy_reward > max({X_reward|target_X}X {enemy,powerup,flag})

towards_flag := target_flag flag_reward > max({X_reward|target_X}X {coin,powerup,enemy})

jump := near_enemy powered_up

planning_sequencei :=

finished if towards_flag (flag

j=1 planning_sequence_j)

coin if towards_coin (coin /

j=1 planning_sequence_j)

powerup if towards_powerup (powerup /

j=1 planning_sequence_j)

enemy if towards_enemy (enemy /

j=1 planning_sequence_j)

flag if towards_flag (flag /

j=1 planning_sequence_j)

finished else score := 20 time_taken + coin_reward if coin planning_sequencei + powerup_reward if powerup planning_sequencei + enemy_reward if enemy planning_sequencei powerup planning_sequencei + flag_reward if flag planning_sequencei

E Mathematical symbols and notation

The following table contains mathematical functions and notation used throughout the paper.

Notation Meaning X; X A (set of) variable(s). x; x Value(s) of X; X. Xi The i-th variable of X. XS The subset {Xi : i S} of X. PX A probability distribution over variables X. x PX A value x sampled from a distribution over X. P( ) The power set. f g Function composition, (f g)(x) = f(g(x)). Q

Xi X Xi N-ary Cartesian product over the domain of X. 2 l2 vector norm. U(a, b) Uniform Distribution. N(µ, σ2) Normal Distribution. Bern(p) Bernoulli distribution; Takes value 1 with probability p and 0 otherwise. PM Probability distribution over the SCM M. PI M Probability distribution over the SCM M under intervention I. Vi An endogenous variable of an SCM M. Ui An exogenous variable of an SCM M. fi Structural equation of the variable Xi.

target_coin target_enemy target_powerup

towards_coin towards_enemy towards_powerup towards_ ag

distance_coin near_coin

position_coin

targeting_cost_coin

distance_powerup near_powerup

position_powerup

targeting_cost_powerup

distance_ ag near_ ag

position_ ag

targeting_cost_ ag

distance_enemy near_enemy

position_enemy

targeting_cost_enemy

planning_sequence_1 planning_sequence_2 planning_sequence_3 planning_sequence_4

player_position

coin_reward powerup_reward enemy_reward ag_reward

Figure 8: Causal graph of an agent policy. The causal graph of a greedy agent inside an platformer environment. The parentless variables are exogenous. Their value is determined via the game environment. The final score variable is left out for clarity.