# a_measuretheoretic_axiomatisation_of_causality__9d62f3a0.pdf A Measure-Theoretic Axiomatisation of Causality Junhyung Park Empirical Inference Department MPI for Intelligent Systems 72076 Tübingen, Germany junhyung.park@tuebingen.mpg.de Simon Buchholz Empirical Inference Department MPI for Intelligent Systems 72076 Tübingen, Germany simon.buchholz@tuebingen.mpg.de Bernhard Schölkopf Empirical Inference Department MPI for Intelligent Systems 72076 Tübingen, Germany bs@tuebingen.mpg.de Krikamol Muandet CISPA Helmholtz Center for Information Security 66123 Saarbrücken, Germany muandet@cispa.de Causality is a central concept in a wide range of research areas, yet there is still no universally agreed axiomatisation of causality. We view causality both as an extension of probability theory and as a study of what happens when one intervenes on a system, and argue in favour of taking Kolmogorov s measure-theoretic axiomatisation of probability as the starting point towards an axiomatisation of causality. To that end, we propose the notion of a causal space, consisting of a probability space along with a collection of transition probability kernels, called causal kernels, that encode the causal information of the space. Our proposed framework is not only rigorously grounded in measure theory, but it also sheds light on long-standing limitations of existing frameworks including, for example, cycles, latent variables and stochastic processes. 1 Introduction Causal reasoning has been recognised as a hallmark of human and machine intelligence, and in the recent years, the machine learning community has taken up a rapidly growing interest in the subject [46, 53, 54], in particular in representation learning [55, 41, 62, 57, 11, 40] and natural language processing [35, 19]. Causality has also been extensively studied in a wide range of other research domains, including, but not limited to, philosophy [39, 64, 16], psychology [60], statistics [45, 56] including social, biological and medical sciences [51, 31, 29, 26], mechanics and law [6]. The field of causality was born from the observation that probability theory and statistics (Figure 1a) cannot encode the notion of causality, and so we need additional mathematical tools to support the enhanced view of the world involving causality (Figure 1b). Our goal in this paper is to give an axiomatic framework of the forwards direction of Figure 1b, which currently consists of many competing models (see Related Works). As a starting point, we observe that the forwards direction of Figure 1a, i.e. probability theory, has a set of axioms based on measure theory (Axioms 2.1) that are widely accepted and used1, and hence argue that it is natural to take the primitive objects of this framework as the basic building blocks. Despite the fact that all of the existing mathematical frameworks of causality recognise the crucial role that probability plays / should play in any causal 1Kolmogorov s axiomatisation is without doubt the standard in probability theory. However, we are aware of other, less popular frameworks, for example, one that is more amenable to Bayesian probability [34], one based on game theory [59] and imprecise probabilities [61]. 37th Conference on Neural Information Processing Systems (Neur IPS 2023). Data Generating Process Data Probability Theory (a) Statistics (or machine learning) is an inverse problem of probability theory. Causal Data Generating Process Data Causal Reasoning Causal Discovery (b) Causal discovery is an inverse problem of causal reasoning. Figure 1: Data generating processes and data. theory, it is surprising that few of them try to build directly upon the axioms of probability theory, and those that do fall short in different ways (see Related Works). On the other hand, we place manipulations at the heart of our approach to causality; in other words, we make changes to some parts of a system, and we are interested in what happens to the rest of this system. This manipulative philosophy towards causality is shared by many philosophers [64], and is the essence behind almost all causal frameworks proposed and adopted in the statistics/machine learning community that we are aware of. To this end, we propose the notion of causal spaces (Definition 2.2), constructed by adding causal objects, called causal kernels, directly to probability spaces. We show that causal spaces strictly generalise (the interventional aspects of) existing frameworks, i.e. given any configuration of, for example, a structural causal model or potential outcomes framework, we can construct a causal space that can carry the same (interventional) information. Further, we show that causal spaces can seamlessly support situations where existing frameworks struggle, for example those with hidden confounders, cyclic causal relationships or continuous-time stochastic processes. Related Works We stress that our paper should not be understood as a criticism of the existing frameworks, or that somehow our goal is to replace them. On the contrary, we foresee that they will continue to thrive in whatever domains they have been used in, and will continue to find novel application areas. Most prominently, there are the structural causal models (SCMs) [43, 46], based most often on directed acyclic graphs (DAGs). Here, the theory of causality is built around variables and structural equations, and probability only enters the picture through a distribution on the exogeneous variables [33]. Efforts have been made to axiomatise causality based on this framework [21, 23, 28], but models based on structural equations or graphs inevitably rely on assumptions even for the definitions themselves, such as being confined to a finite number of variables, the issue of solvability in the case of non-recursive (or cyclic) cases, that all common causes (whether latent or observed) are modeled, or that the variables in the model do not causally affect anything outside the model. Hence, these cannot be said to be an axiomatic definition in the strictest sense. In a series of examples in Section 4, we highlight cases for which causal spaces have a natural representation but SCMs do not, including common causes, cycles and continuous-time stochastic processes. The potential outcomes framework is a competing model, most often used in economics, social sciences or medicine research, in which we have a designated treatment variable, whose causal effect we are interested in, and for each value of the treatment variable, we have a separate, potential outcome variable [31, 26]. There are other, perhaps lesser-known approaches to model causality, such as that based on decision theory [17, 52], on category theory [32, 20], on an agent explicitly performing actions that transform the state space [15], or settable systems [63]. Perhaps the works that are the most relevant to this paper are those that have already recognised the need for an axiomatisation of causality based on measure-theoretic probability theory. Ortega [42] uses a particular form of a tree to define a causal space, and in so doing, uses an alternative, Bayesian set-up of probability theory [34]. It has an obvious drawback that it only considers countable sets of realisations , clearly ruling out many interesting and commonly-occurring cases, and also does not seem to accommodate cycles. Heymann et al. [27] define the information dependency model based on measurable spaces to encode causal information. We find this to be a highly interesting and relevant approach, but the issue of cycles and solvability arises, and again, only countable sets of outcomes are considered, with the authors admitting that the results are likely not to hold with uncountable sets. Moreover, probabilities and interventions require additional work to be taken care of. Lastly, Cabreros and Storey [12] attempt to provide a measure-theoretic grounding to the potential outcomes framework, but thereby confine attention to the setting of a finite number of variables, and even restrict the random variables to be discrete. Finally, we mention the distinction between type causality and actual causality. The former is a theory about general causality, involving statements such as in general, smoking makes lung cancer more likely . Type causality is what we will be concerned with in this paper. Actual causality, on the other hand, is interested in whether a particular event was caused by a particular action, dealing with statements such as Bob got lung cancer because he smoked for 30 years . It is an extremely interesting area of research that has far-reaching implications for concepts such as responsibility, blame, law, harm [4, 5], model explanation [7] and algorithmic recourse [36]. Many definitions of actual causality have been proposed [25, 22, 24], but the question of how to define actual causality is still not settled [1]. The current definitions of actual causality are all grounded on (variants) of SCMs, and though out of the scope of this paper, it will be an interesting future research direction to consider how actual causality can be incorporated into our proposed framework. 2 Causal Spaces and Interventions Familiarity with measure-theoretic probability theory is necessary, and we succinctly collect the most essential definitions and results in Appendix A. Most important to note is the definition of a transition probability kernel (also given at the end of Appendix A.1): for measurable spaces (E, E) and (F, F), a mapping K : E F [0, ] is called a transition probability kernel from (E, E) into (F, F) if the mapping x 7 K(x, B) is measurable for every set B F and the mapping B 7 K(x, B) is a probability measure on (F, F) for every x E. Also worthy of mention is the definition of a measurable rectangle: if (E, E) and (F, F) are measurable spaces and A E and B F, then the measurable rectangle A B is the set of all pairs (x, y) with x A and y B. All proofs are deferred to Appendix F. We start by recalling the axioms of probability theory, which we will use as the starting point of our work. Axioms 2.1 (Kolmogorov [38]). A probability space is a triple (Ω, H, P), where: (i) Ωis a set of outcomes; (ii) H is a collection of events forming a σ-algebra, i.e. a collection of subsets of Ωsuch that (a) Ω H; (b) if A H, then Ω\ A H; (c) if A1, A2, ... H, then n An H; (iii) P is a probability measure on (Ω, H), i.e. a function P : H [0, 1] satisfying (a) P( ) = 0; (b) P( n An) = P n P(An) for any disjoint sequence (An) in H (c) P(Ω) = 1. In the development of probability theory, one starts by assuming the existence of a probability space (Ω, H, P). However, the actual construction of probability spaces that can carry random variables corresponding to desired random experiments is done through (repeated applications of) two main results those of Ionescu-Tulcea and Kolmogorov [14, p.160, Chapter IV, Section 4]; the former constructs a probability space that can carry a finite or countably infinite chain of trials, and the latter shows the existence of a probability space that can carry a process with an arbitrary index set. In both cases, the measurable space (Ω, H) is constructed as a product space: (i) for a finite set of trials, each taking place in some measurable space (Et, Et), t = 1, ..., n, we have (Ω, H) = n t=1(Et, Et); (ii) for a countably infinite set of trials, each taking place in some measurable space (Et, Et), t N, we have (Ω, H) = t N(Et, Et); (iii) for a process {Xt : t T} with an arbitrary index set T, we assume that all the Xt live in the same standard measurable space (E, E), and let (Ω, H) = (E, E)T = t T (E, E). In the construction of a causal space, we will take as our starting point a probability space (Ω, H, P), where the measure P is defined on a product measurable space (Ω, H) = t T (Et, Et) with the (Et, Et) being the same standard measurable space if T is uncountable. Denote by P(T) the power set of T, and for S P(T), we denote by HS the sub-σ-algebra of H = t T Et generated by measurable rectangles t T At, where At Et differs from Et only for t S. In particular, H = { , H} is the trivial σ-algebra of Ω= t T Et. Also, we denote by ΩS the subspace s SEs of Ω= t T Et, and for T S U, we let πSU denote the natural projection from ΩS onto ΩU. Definition 2.2. A causal space is defined as the quadruple (Ω, H, P, K), where (Ω, H, P) = ( t T Et, t T Et, P) is a probability space and K = {KS : S P(T)}, called the causal mechanism, is a collection of transition probability kernels KS from (Ω, HS) into (Ω, H), called the causal kernel on HS, that satisfy the following axioms: (i) for all A H and ω Ω, K (ω, A) = P(A); (ii) for all ω Ω, A HS and B H, KS(ω, A B) = 1A(ω)KS(ω, B) = δω(A)KS(ω, B); in particular, for A HS, KS(ω, A) = 1A(ω)KS(ω, Ω) = 1A(ω) = δω(A). Here, the probability measure P should be viewed as the observational measure , and the causal mechanism K, consisting of causal kernels KS for S P(T), contains the causal information of the space, by directly specifying the interventional distributions. We write 1A(ω) when viewed as a function in ω for a fixed A, and δω(A) when viewed as a measure for a fixed ω Ω. Note that K cannot be determined independently of the probability measure P, since, for example, K is clearly dependent on P by (i). Before we discuss the meaning of the two axioms, we immediately give the definition of an intervention. An intervention is carried out on a sub-σ-algebra of the form HU for some U P(T). In the following, for S P(T), we denote ωS = πT S(ω). Then note that Ω= ΩS ΩT \S and for any ω Ω, we can decompose it into components as ω = (ωS, ωT \S). Then KS(ω, A) = KS((ωS, ωT \S), A) for any A H only depends on the first ωS component of ω = (ωS, ωT \S). As a slight abuse of notation, we will sometimes write KS(ωS, A) for conciseness. Definition 2.3. Let (Ω, H, P, K) = ( t T Et, t T Et, P, K) be a causal space, and U P(T), Q a probability measure on (Ω, HU) and L = {LV : V P(U)} a causal mechanism on (Ω, HU, Q). An intervention on HU via (Q, L) is a new causal space (Ω, H, Pdo(U,Q), Kdo(U,Q,L)), where the intervention measure Pdo(U,Q) is a probability measure on (Ω, H) defined, for A H, by Pdo(U,Q)(A) = Z Q(dω)KU(ω, A) (1) and Kdo(U,Q,L) = {Kdo(U,Q,L) S : S P(T)} is the intervention causal mechanism whose intervention causal kernels are Kdo(U,Q,L) S (ω, A) = Z LS U(ωS U, dω U)KS U((ωS\U, ω U), A). (2) The intuition behind these definitions is as follows. Starting from the probability space (Ω, H, P), we choose a subspace on which to intervene, namely a sub-σ-algebra HU of H. The intervention is the process of placing any desired measure Q on this subspace (Ω, HU), along with an internal causal mechanism L on this subspace 2. The causal kernel KU corresponding to the subspace HU, which is encoded in the original causal space, determines what the intervention measure on the whole space H will be, via equation (1). For the causal kernels after intervention, the causal effect first takes place within HU via the internal causal mechanism L, then propagates to the rest of H via equation (2). The definition of intervening on a σ-algebra of the form HU given in Definition 2.3 sheds light on the two axioms of causal spaces given in Definition 2.2. Remark 2.4. Trivial Intervention Axiom (i) of causal mechanisms in Definition 2.2 ensures that intervening on the trivial σ-algebra (i.e. not intervening at all) leaves the probability measure intact, i.e. writing Q for the trivial probability measure on { , Ω}, we have Pdo( ,Q) = P. Interventional Determinism Axiom (ii) of Definition 2.2 ensures that for any A HU, we have Pdo(U,Q)(A) = Q(A), which means that if we intervene on the causal space by giving HU a particular probability measure Q, then HU indeed has that measure with respect to the intervention probability measure. 2Choosing Q to have measure 1 on a single element would correspond to what is known as a hard intervention in the SCM literature. Letting Q and L be arbitrary would allow us to obtain any soft intervention . Figure 2: Altitude and Temperature. The following example should serve as further clarification of the concepts. Example 2.5. Let E1 = E2 = R, and E1, E2 be Lebesgue σ-algebras on E1 and E2. Each e1 E1 and e2 E2 respectively represent the altitude in metres and temperature in Celsius of a random location. For simplicity, we assume a jointly Gaussian measure P on (Ω, H) = (E1 E2, E1 E2), say with mean vector 1000 10 and covariance matrix 300 15 15 1 e1 E1 and A E2, we let K1(e1, A) be the conditional measure of P given e1, i.e. Gaussian with mean 1200 e1 20 and variance 1 4. This represents the fact that, if we intervene by fixing the altitude of a location, then the temperature of that location will be causally affected. However, if we intervene by fixing a temperature of a location, say by manually heating up or cooling down a place, then we expect that this has no causal effect on the altitude of the place. This can be represented by the causal kernel K2(e2, B) = P(B) for each B E1, i.e. Gaussian measure with mean 1000 and variance 300, regardless of the value of e2. The corresponding causal graph would be Figure 2. If we intervene on E1 with measure δ1000, i.e. we fix the altitude at 1000m, then the intervention measure Pdo(1,δ1000) on (E2, E2) would be Gaussian with mean 10 and variance 1 4. If we intervene on E2 with any measure Q, the intervention measure Pdo(2,Q) on (E1, E1) would still be Gaussian with mean 1000 and variance 300. The following theorem proves that the intervention measure and causal mechanism are indeed valid. Theorem 2.6. From Definition 2.3, Pdo(U,Q) is indeed a measure on (Ω, H), and Kdo(U,Q,L) is indeed a valid causal mechanism on (Ω, H, Pdo(U,Q)), i.e. they satisfy the axioms of Definition 2.2. To end this section, we make a couple of further remarks on the definition of causal spaces. Remark 2.7. (i) We require causal spaces to be built on top of product probability spaces, as opposed to general probability spaces, and causal kernels are defined on sub-σ-algebras of H of the form HS for S P(T), as opposed to general sub-σ-algebras of H. This is because, for two events that are not in separate components of a product space, one can always intervene on one of those events in such a way that the measure on the other event will have to change, meaning the causal kernel cannot be decoupled from the intervention itself. For example, in a dice-roll with outcomes {1, 2, 3, 4, 5, 6} each with probability 1 6, if we intervene to give measure 1 to roll 6, then the other outcomes are forced to have measure 0. Only if we consider separate components of product measurable spaces can we set meaningful causal relationships that are decoupled from the act of intervention itself. (ii) We do not distinguish between interventions that are practically possible and those that are not. For example, the causal effect of sunlight on the moon s temperature cannot be measured realistically, as it would require covering up the sun, but the information encoded in the causal kernel would still correspond to what would happen when we cover up the sun. 3 Comparison with Existing Frameworks In this section, we show how causal spaces can encode the interventional aspects of the two most widely-used frameworks of causality, i.e. structural causal models and the potential outcomes. 3.1 Structural Causal Models (SCMs) Consider an SCM in its most basic form, given in the following definition. Definition 3.1 ([46, p.83, Definition 6.2]). A structural causal model C = (S, P) consists of a collection S of d (structural) assignments Xj := fj(PAj, Nj), j = 1, ..., d, where PAj {X1, ..., Xd}\{Xj} are called the parents of Xj and Nj are the noise variables; and a distribution P over the noise variables such that they are jointly independent. The graph G of an SCM is obtained by creating one vertex for each Xj and drawing directed edges from each parent in PAj to Xj. This graph is assumed to be acyclic. Below, we show that a unique causal space that corresponds to such an SCM can be constructed. First, we let the variables Xj, j = 1, ..., d take values in measurable spaces (Ej, Ej) respectively, and let (Ω, H) = d j=1(Ej, Ej). An SCM C entails a unique distribution P over the variables X = (X1, ..., Xd) by the propagation of the noise distribution P through the structural equations fj [46, p.84, Proposition 6.3], and we take this P as the observational measure of the causal space. More precisely, assuming {1, ..., d} is a topological ordering, we have, for Aj Ej, j = 1, ..., d, P(A1 E2 ... Ed) = P({n1 : f1(n1) A1}) P(A1 A2 E3 ... Ed) = P({(n1, n2) : (f1(n1), f2(f1(n1), n2)) A1 A2}) ... P(A1 ... Ad) = P({(n1, ..., nd) : (f1(n1), ..., fd(f1(n1), ..., nd)) A1 ... Ad}). Finally, for each S P({1, ..., d}) and for each ω Ω, define f S,ω j = fj if j / S and f S,ω j = ωj if j S. Then we have KS(ω, A1 ... Ad) = P({(n1, ..., nd) : (f S,ω 1 (n1), ..., f S,ω d (f S,ω 1 (n1), ..., nd)) A1 ... Ad}). This uniquely specifies the causal space (Ω, H, P, K) that corresponds to the SCM C. While this shows that causal spaces strictly generalise (interventional aspects of) SCMs, there are fundamental philosophical differences between the two approaches, as highlighted in the following remark. Remark 3.2. (i) The system in an SCM can be viewed as the collection of all variables X1, ..., Xd, and the subsystems the individual variables or the groups of variables. Each structural equation fj encodes how the whole system, when intervened on, affects a subsystem Xj, i.e. how the collection of all other variables affects the individual variables (even though, in the end, the equations only depend on the parents). This way of encoding causal effects seems somewhat inconsistent with the philosophy laid out in the Introduction, that we are interested in what happens to the system when we intervene on a subsystem . It also seems inconsistent with the actual action taken, which is to intervene on subsystems, not the whole system, or the parents of a particular variable. In contrast, the causal kernels encode exactly what happens to the whole system, i.e. what measure we get on the whole measurable space (Ω, H), when we intervene on a subsystem , i.e. put a desired measure on a sub-σ-algebra of H3. (ii) The primitive objects of SCMs are the variables Xj, the structural equations fj and the distribution PN over the noise variables. The observational distribution, as well as the interventional distributions, are derived from these objects. It turns out that unique existence of observational and interventional distributions are not guaranteed, and can only be shown under the acyclicity assumption or rather stringent and hard-to-verify conditions on the structural equations and the noise distributions [10]. Moreover, it means that the observational and interventional distributions are not decoupled, and rather are linked through the structural equations fj, and as a result, it is not possible to encode the full range of observational and interventional distributions using just the variables of interest (see Example 4.1). In contrast, in causal spaces, the observational distribution P, as well as the interventional distributions (via the causal kernels), are the primitive objects. Not only does this automatically guarantee their unique existence, but it also allows the interventional distributions (i.e. the causal information) to be completely decoupled from the observational distribution. (iii) Galles and Pearl [21, Section 3] propose three axioms of counterfactuals based on SCMs (called causal models in that paper), namely, composition, effectiveness and reversibility. Even though these three concepts can be carried over to causal spaces, the mathematics through which they are represented needs to be adapted, since the tools that are used in causal spaces are different from those used in causal models of Galles and Pearl [21]. In particular, we work directly with measures as the primitive objects, whereas Galles and Pearl [21] use the structural equations as the primitive objects, and the probabilities only enter through a measure on the exogenous variables. Thus, the three properties can be phrased in the causal space language as follows: 3In this sense, some philosophy is shared with generalised structural equation models (GSEMs) [48]. Composition For S, R T, denote by Q the measure on HS R obtained by restricting Pdo(S,Q). Then Pdo(S,Q) = Pdo(S R,Q ). In words, intervening on HS via the measure Q is the same as intervening on HS R via the measure that it would have if we intervened on HS via Q. This is not in general true. A counterexample can be demonstrated with a simple SCM, where X1, X2 and X3 causally affect Y , in a way that depends not only on the marginal distributions of X1, X2 and X3 but their joint distribution, and X1, X2 and X3 have no causal relationships among them. Then intervening on X1 with some measure Q cannot be the same as intervening on X1 and X2 with Q P, since such an intervention would change the joint distribution of X1, X2 and X3, even if we give them the same marginal distributions. Effectiveness For S R T, if we intervene on HR via a measure Q, then HS has measure Q restricted to HS. This is indeed guaranteed by interventional determinism (Definition 2.2(ii)), so effectiveness continues to hold in causal spaces. Reversibility For S, R, U T, let Q be some measure on HS, and Q1 and Q2 be measures on HS R and HS U respectively such that they coincide with Q when restricted to HS. Then if Pdo(S R,Q1)(B) = Q2(B) for all B HU and if Pdo(S U,Q2)(C) = Q1(C) for all C HR, then Pdo(S,Q)(A) = Q1(A) for all A HR. This does not hold in general in causal spaces; in fact, Example 4.2 is a counterexample of this, with S = . 3.2 Potential Outcomes (PO) Framework In the PO framework, the treatment and outcome variables of interest are fixed in advance. Although much of the literature begins with individual units, these units are in the end i.i.d. copies of random variables under the stable unit treatment value assumption (SUTVA), and that is how we begin. Denote by ( Ω, H, P) the underlying probability space. Let Z : Ω Z be the treatment variable, taking values in a measurable space (Z, Z). Then for each value z of the treatment, there is a separate random variable Yz : Ω Y, called the potential outcome given Z = z taking values in a measurable space (Y, Y); we also have the observed outcome , which is the potential outcome consistent with the treatment, i.e. Y = YZ. The researcher is interested in quantities such as the average treatment effect , E[Yz1 Yz2], where E is the expectation with respect to P, to measure the causal effect of the treatment. Often, there are other, pre-treatment variables or covariates , which we denote by X : Ω X, taking values in a measurable space (X, X). Given these, another object of interest is the conditional average treatment effect , defined as E[Yz1 Yz2 | X]. It is relatively straightforward to construct a causal space that can carry this framework. We define Ω= Z Y X and H = Z Y X. We also define P, for each A Z, B Y and C X, as P(A B C) = P(Z A, Y B, X C). As for causal kernels, we are essentially only interested in KZ(z, B) for B Y, and we define these to be KZ(z, B) = P(Yz B). In this section, we give a few more concrete constructions of causal spaces. In particular they are designed to highlight cases which are hard to represent with existing frameworks, but which have natural representations in terms of causal spaces. Comparisons are made particularly with SCMs. 4.1 Confounders The following example highlights the fact that, with graphical models, there is no way to encode correlation but no causation between two variables, using just the variables of interest. Example 4.1. Consider the popular example of monthly ice cream sales and shark attacks in the US (Figure 3a), that shows that correlation does not imply causation. This cannot be encoded by an SCM with just two variables as in Figure 3b, since no causation means no arrows between the variables, which in turn also means no dependence. One needs to add the common causes into the Figure 3: Correlation but no causation between ice-cream sales and shark attacks. S stands for the number of shark attacks, I for ice cream sales, T for temperature and E for economy. model (whether observed or latent), the most obvious one being the temperature (high temperatures make people desire ice cream more, as well as to go to the beach more), as seen in Figure 3c. Now we have a model in which both dependence and no causation are captured. But can we stop here? There are probably other factors that affect both variables, such as the economy (the better the economic situation, the more likely people are to buy ice cream, and to take beach holidays) see Figure 3d. Not only is the model starting to lose parsimony, but as soon as we stop adding variables to the model, we are making an assumption that there are no further confounding variables out there in the world4. In contrast, causal spaces allow us to model any observational and causal relationships with just the variables that we were interested in, without any restrictions or the need to add more variables. In this particular example, we would take as our causal space (E1 E2, E1 E2, P, K), where E1 = E2 = R with values in E1 and E2 corresponding to ice cream sales and shark attacks respectively, and E1 = E2 being Lebesgue σ-algebras. Then we can let P be a measure that has a strong dependence between any A E1 and B E2, but let the causal kernels be K1(x, B) = P(B) for any x E1 and B E2, and likewise K2(x, A) = P(A) for any x E2 and A E1. Nancy Cartwright argued against the completeness of causal Markov condition, using an example of two factories [13, p.108], in which there may not even be any confounders between dependent variables, not even an unobserved one. If we accept her position, then there are situations which SCMs would not be able to represent, whereas causal spaces would have no problems at all. As mentioned before, cycles in SCMs cause serious problems, namely that observational and interventional distributions that are consistent with the given structural equations and noise distribution may not exist, and when they do, they may not exist uniquely. This is an artefact of the fact that these distributions are derived from the structural equations rather than taken as the primitive objects. In the vast majority of the cases, cycles are excluded from consideration from the beginning and only directed acyclic graphs (DAGs) are considered. Some works study the solvability of cyclic SCMs [23, 10], where the authors investigate under what conditions on the structural equations and the noise variables there exist random variables and distributions that solve the given structural equations, and if so, when that happens uniquely. Other works have allowed cycles to exist, but restricted the definition of an SCM only to those that have a unique solution [23, 43, 49]. Of course, cyclic causal relationships abound in the real world. In our proposed causal space (Definition 2.2), given two sub-σ-algebras HS and HU of H, nothing stops both of them from having a causal effect on the other (see Definition B.1 for a precise definition of causal effects), but we are still guaranteed to have a unique causal space, both before intervention and after intervention on either HS or HU. The following is an example of a situation with cyclic causal relationship. Example 4.2. We want to model the relationship between the amount of rice in the market and its price per kg. Take as the probability space (E1 E2, E1 E2, P), where E1 = E2 = R with values in E1 and E2 representing the amount of rice in the market in million tonnes and the price 4One solution could be to add a single variable that collects all of the confounders into one, but then the numerical value of this variable , as well as its distribution and the structural equations from this variable into S and I, would be completely meaningless. Figure 4: Rice in the market in million tonnes and price per kg in KRW. of rice per kg in KRW respectively, E1, E2 are Lebesgue σ-algebras and P is for simplicity taken to be jointly Gaussian. Without any intervention, the higher the yield, the more rice there is in the market and lower the price, as in Figure 4b. If the government intervenes on the market by buying up extra rice or releasing rice into the market from its stock, with the goal of stabilising supply at 3 million tonnes, then the price will stabilise accordingly, say with Gaussian distribution with mean 4.5 and standard deviation 0.5, as in Figure 4c. The corresponding causal kernel will be K1(3, x) = 2 0.5 )2. On the other hand, if the government fixes the price of rice at a price, say at 6,000 KRW per kg, then the farmers will be incentivised to produce more, say with Gaussian distribution with mean 4 and standard deviation 0.5, as in Figure 4d. The corresponding causal kernel will be K2(6, y) = 2 Causal spaces treat causal effects really as what happens after an intervention takes place, and with this approach, cycles can be rather naturally encoded, as shown above. We do not view cyclic causal relationships as an equilibrium of a dynamical system, or require it to be viewed as an acyclic stochastic process, as done by some authors [46, p.85, Remark 6.5]. 4.3 Continuous-time Stochastic Processes, and Parents A very well established sub-field of probability theory is the field of stochastic processes, in which the index set representing (most often) time can be either discrete or continuous, and in both cases, infinite. However, most causal models start by assuming a finite number of variables, which immediately rules out considering stochastic processes, and efforts to extend to infinite number of variables usually consider only discrete time steps [46, Chapter 10] or dynamical systems [9, 47, 8, 50]. Since probability spaces have proven to accommodate continuous time stochastic processes in a natural way, it is natural to believe that causal spaces, being built up from probability spaces, should be able to enable the incorporation of the concept of causality in the theory of stochastic processes. Let W be a totally-ordered set, in particular W = N = {0, 1, ...}, W = Z = {..., 2, 1, 0, 1, 2, ...}, W = R+ = [0, ) or W = R = ( , ) considered as the time set. Then, we consider causal spaces of the form (Ω, H, P, K) = ( t T Et, t T Et, P, K), where the index set T can be written as T = W T for some other index set T. The following notion captures the intuition that causation can only go forwards in time. Definition 4.3. Let (Ω, H, P, K) = ( t T Et, t T Et, P, K) be a causal space, where the index set T can be written as T = W T, with W representing time. Then we say that the causal mechanism K respects time, or that K is a time-respecting causal mechanism, if, for all w1, w2 W with w1 < w2, we have that Hw2 T has no causal effect (in the sense of Definition B.1) on Hw1 T . In a causal space where the index set T has a time component, the fact that causal mechanism K respects time means that the past can affect the future, but the future cannot affect the past. This already distinguishes itself from the concept of conditioning conditioning on the future does have implications for past events. We illustrate this point in the example of a Brownian motion. Example 4.4. We consider a 1-dimensional Brownian motion. Take ( t R+Et, t R+Et, P, K), where, for each t R+, Et = R and Et is the Lebesgue σ-algebra, and P is the Wiener measure. For s < t, we have causal kernels Ks(x, y) = 1 2π(t s)e 1 2(t s) (y x)2 and Kt(x, y) = 1 The former says that, if we intervene by setting the value of the process to x at time s, then the process starts again from x, whereas the latter says that if we intervene at time t, the past values at time s are not affected. On the left-hand plot of Figure 5, we set the value of the process at time 1 to 0. The past values of the process are not affected, and there is a discontinuity at time 1 where the process starts Figure 5: 1-dimensional Brownian motion, intervened and conditioned to have value 0 at time 1. again from 0. Contrast this to the right-hand plot, where we condition on the process having value 0 at time 1. This does affect past values, and creates a Brownian bridge from time 0 to time 1. Note, Brownian motion is not differentiable, so no approach based on dynamical systems is applicable. Remark 4.5. The concept of parents is central in SCMs the structural equations are defined on the parents of each variable. However, continuous time is dense, so given two distinct points in time, there is always a time point in between. Suppose we have a one-dimensional continuous time Markov process (Xt)t R [14, p.169], and a point t0 in time. Then for any t < t0, Xt has a causal effect on Xt0, but there always exists some t with t < t < t0 such that conditioned on Xt , Xt does not have a causal effect on Xt0, meaning Xt cannot be a parent of Xt0. In such a case, Xt0 cannot be said to have any parents, and hence no corresponding SCM can be defined. 5 Conclusion In this work, we discussed the lack of a universally agreed axiomatisation of causality, and some arguments as to why measure-theoretic probability theory can provide a good foundation on which to build such an axiomatisation. We proposed causal spaces, by enriching probability spaces with causal kernels that encode information about what happens after an intervention. We showed how the interventional aspects of existing frameworks can be captured by causal spaces, and finally we gave some explicit constructions, highlighting cases in which existing frameworks fall short. Even if causal spaces prove with time to be the correct approach to axiomatise causality, there is much work to be done in fact, all the more so in that case. Most conspicuously, we only discussed the interventional aspects of the theory of causality, but the notion of counterfactuals is also seen as a key part of the theory, both interventional counterfactuals as advocated by Pearl s ladder of causation [44, Figure 1.2] and backtracking counterfactuals [58]. We provide some discussions and possible first steps towards this goal in Appendix E, but mostly leave it as essential future work. Only then will we be able to provide a full comparison with the counterfactual aspects of SCMs and the potential outcomes. As discussed in Section 1, the notion of actual causality is also important, and it would be interesting to investigate how this notion can be embedded into causal spaces. Many important definitions, including causal effects, conditional causal effects, hard interventions and sources, as well as technical results, were deferred to the appendix purely due to space constraints, but we foresee that there would be many more interesting results to be proved, inspired both from theory and practice. In particular, the theory of causal abstraction [49, 2, 3] should benefit from our proposal, through extensions of homomorphisms of probability spaces to causal spaces5. As a final note, we stress again that our goal should not be understood as replacing existing frameworks. Indeed, causal spaces cannot compete in terms of interpretability, and in the vast majority of situations in which SCMs, potential outcomes or any of the other existing frameworks are suitable, we expect them to be much more useful. In particular, assumptions are unavoidable for identifiability from observational data, and those assumptions are much better captured by existing frameworks6, However, just as measure-theoretic probability theory has its value despite not being useful for practitioners in applied statistics, we believe that it is a worthy endeavour to formally axiomatise causality. 5In a similar vein, Keurti et al. [37] consider homomorphisms between groups of interventions. 6Researchers from the potential outcomes community and the graphical model community are arguing as to which framework is better for which situations [30, 43]. We do not take part in this debate. Acknowledgments and Disclosure of Funding We express our sincere gratitude to Robin Evans at the University of Oxford, Michel de Lara at École des Ponts Paris Tech and Wojciech Niemiro at the University of Warsaw for fruitful discussions and providing valuable feedback on earlier drafts. We also thank anonymous reviewers for their suggestions for improvements. This work was supported by the Tübingen AI Center. [1] S. Beckers. Causal Sufficiency and Actual Causation. Journal of Philosophical Logic, 50(6): 1341 1374, 2021. [2] S. Beckers and J. Y. Halpern. Abstracting Causal Models. In Proceedings of the aaai conference on artificial intelligence, volume 33, pages 2678 2685, 2019. [3] S. Beckers, F. Eberhardt, and J. Y. Halpern. Approximate Causal Abstractions. In Uncertainty in Artificial Intelligence, pages 606 615. PMLR, 2020. [4] S. Beckers, H. Chockler, and J. Halpern. A Causal Analysis of Harm. Advances in Neural Information Processing Systems, 35:2365 2376, 2022. [5] S. Beckers, H. Chockler, and J. Y. Halpern. Quantifying Harm. In Proceedings of the 32nd International Joint Conference on Artificial Intelligence (IJCAI 2023), 2023. [6] H. Beebee, C. Hitchcock, and P. Menzies. The Oxford Handbook of Causation. Oxford University Press, 2009. [7] G. Biradar, V. Viswanathan, and Y. Zick. Model Explanations via the Axiomatic Causal Lens. ar Xiv preprint ar Xiv:2109.03890, 2021. [8] T. Blom, S. Bongers, and J. M. Mooij. Beyond Structural Causal Models: Causal Constraints Models. In Uncertainty in Artificial Intelligence, pages 585 594. PMLR, 2020. [9] S. Bongers, T. Blom, and J. M. Mooij. Causal Modeling of Dynamical Systems. ar Xiv preprint ar Xiv:1803.08784, 2018. [10] S. Bongers, P. Forré, J. Peters, and J. M. Mooij. Foundations of Structural Causal Models with Cycles and Latent Variables. The Annals of Statistics, 49(5):2885 2915, 2021. [11] J. Brehmer, P. De Haan, P. Lippe, and T. S. Cohen. Weakly Supervised Causal Representation Learning. Advances in Neural Information Processing Systems, 35:38319 38331, 2022. [12] I. Cabreros and J. D. Storey. Causal Models on Probability Spaces. ar Xiv preprint ar Xiv:1907.01672, 2019. [13] N. Cartwright. The Dappled World: A Study of the Boundaries of Science. Cambridge University Press, 1999. [14] E. Çınlar. Probability and Stochastics, volume 261. Springer Science & Business Media, 2011. [15] T. Cohen. Towards a Grounded Theory of Causation for Embodied AI. In UAI 2022 Workshop on Causal Representation Learning, 2022. [16] J. Collins, N. Hall, and L. A. Paul. Causation and Counterfactuals. The MIT Press, 2004. [17] P. Dawid. Decision-Theoretic Foundations for Statistical Causality. Journal of Causal Inference, 9(1):39 77, 2021. [18] H. B. Enderton. Elements of Set Theory. Academic press, 1977. [19] A. Feder, K. A. Keith, E. Manzoor, R. Pryzant, D. Sridhar, Z. Wood-Doughty, J. Eisenstein, J. Grimmer, R. Reichart, M. E. Roberts, et al. Causal Inference in Natural Language Processing: Estimation, Prediction, Interpretation and Beyond. Transactions of the Association for Computational Linguistics, 10:1138 1158, 2022. [20] T. Fritz, T. Gonda, N. G. Houghton-Larsen, P. Perrone, and D. Stein. Dilations and Information Flow Axioms in Categorical Probability. ar Xiv preprint ar Xiv:2211.02507, 2022. [21] D. Galles and J. Pearl. An Axiomatic Characterization of Causal Counterfactuals. Foundations of Science, 3(1):151 182, 1998. [22] J. Halpern. A Modification of the Halpern-Pearl Definition of Causality. In Twenty-Fourth International Joint Conference on Artificial Intelligence, 2015. [23] J. Y. Halpern. Axiomatizing Causal Reasoning. Journal of Artificial Intelligence Research, 12: 317 337, 2000. [24] J. Y. Halpern. Actual Causality. MIT Press, 2016. [25] J. Y. Halpern and J. Pearl. Causes and Explanations: A Structural-Model Approach. Part I: Causes. British Journal for the Philosophy of Science, 56(4):843 843, 2005. [26] M. Hernan and J. Robins. Causal Inference: What If. Boca Raton: Chapman & Hall/CRC, 2020. [27] B. Heymann, M. De Lara, and J.-P. Chancelier. Causal inference theory with information dependency models. ar Xiv preprint ar Xiv:2108.03099, 2021. [28] D. Ibeling and T. Icard. Probabilistic Reasoning Across the Causal Hierarchy. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 10170 10177, 2020. [29] P. M. Illari, F. Russo, and J. Williamson. Causality in the Sciences. Oxford University Press, 2011. [30] G. Imbens. Potential Outcome and Directed Acyclic Graph Approaches to Causality: Relevance for Empirical Practice in Economics. Technical report, National Bureau of Economic Research, 2019. [31] G. W. Imbens and D. B. Rubin. Causal Inference in Statistics, Social, and Biomedical sciences. Cambridge University Press, 2015. [32] B. Jacobs, A. Kissinger, and F. Zanasi. Causal Inference by String Diagram Surgery. In Foundations of Software Science and Computation Structures: 22nd International Conference, FOSSACS 2019, Held as Part of the European Joint Conferences on Theory and Practice of Software, ETAPS 2019, Prague, Czech Republic, April 6 11, 2019, Proceedings 22, pages 313 329. Springer, 2019. [33] D. Janzing and B. Schölkopf. Causal Inference Using the Algorithmic Markov Condition. IEEE Transactions on Information Theory, 56(10):5168 5194, 2010. [34] E. T. Jaynes. Probability Theory: The Logic of Science. Cambridge university press, 2003. [35] Z. Jin, A. Feder, and K. Zhang. Causal NLP Tutorial: An Introduction to Causality for Natural Language Processing. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing: Tutorial Abstracts, pages 17 22, 2022. [36] A.-H. Karimi, G. Barthe, B. Schölkopf, and I. Valera. A Survey of Algorithmic Recourse: Contrastive Explanations and Consequential Recommendations. ACM Computing Surveys, 55 (5):1 29, 2022. [37] H. Keurti, H.-R. Pan, M. Besserve, B. F. Grewe, and B. Schölkopf. Homomorphism Autoencoder Learning Group Structured Representations from Observed Transitions. In International Conference on Machine Learning, pages 16190 16215. PMLR, 2023. [38] A. N. Kolmogorov. Foundations of the Theory of Probability. NY: Chelsea Publishing Co, 1933. [39] D. Lewis. Counterfactuals. John Wiley & Sons, 2013. [40] F. Locatello, S. Bauer, M. Lucic, G. Raetsch, S. Gelly, B. Schölkopf, and O. Bachem. Challenging Common Assumptions in the Unsupervised Learning of Disentangled Representations. In international conference on machine learning, pages 4114 4124. PMLR, 2019. [41] J. Mitrovic, B. Mc Williams, J. C. Walker, L. H. Buesing, and C. Blundell. Representation learning via invariant causal mechanisms. In International Conference on Learning Representations, 2020. [42] P. A. Ortega. Subjectivity, bayesianism, and causality. Pattern Recognition Letters, 64:63 70, 2015. [43] J. Pearl. Causality. Cambridge university press, 2009. [44] J. Pearl and D. Mackenzie. The Book of Why. Basic Books, New York, 2018. [45] J. Pearl, M. Glymour, and N. P. Jewell. Causal Inference in Statistics: A Primer. John Wiley & Sons, 2016. [46] J. Peters, D. Janzing, and B. Schölkopf. Elements of Causal Inference. The MIT Press, 2017. [47] J. Peters, S. Bauer, and N. Pfister. Causal Models for Dynamical Systems. In Probabilistic and Causal Inference: The Works of Judea Pearl, pages 671 690. ACM, 2022. [48] S. Peters and J. Y. Halpern. Causal Modeling with Infinitely Many Variables. ar Xiv preprint ar Xiv:2112.09171, 2021. [49] P. Rubenstein, S. Weichwald, S. Bongers, J. Mooij, D. Janzing, M. Grosse-Wentrup, and B. Schölkopf. Causal Consistency of Structural Equation Models. In 33rd Conference on Uncertainty in Artificial Intelligence (UAI 2017), pages 808 817. Curran Associates, Inc., 2017. [50] P. K. Rubenstein, S. Bongers, B. Schölkopf, and J. M. Mooij. From Deterministic ODEs to Dynamic Structural Causal Models. In 34th Conference on Uncertainty in Artificial Intelligence (UAI 2018), 2018. [51] F. Russo. Causality and Causal Modelling in the Social Sciences. Springer, 2010. [52] P. Schenone. Causality: A Decision Theoretic Foundation. ar Xiv preprint ar Xiv:1812.07414, 2018. [53] B. Schölkopf. Causality for Machine Learning. In Probabilistic and Causal Inference: The Works of Judea Pearl, pages 765 804. ACM, 2022. [54] B. Schölkopf and J. von Kügelgen. From Statistical to Causal Learning. ar Xiv preprint ar Xiv:2204.00607, 2022. [55] B. Schölkopf, F. Locatello, S. Bauer, N. R. Ke, N. Kalchbrenner, A. Goyal, and Y. Bengio. Toward Causal Representation Learning. Proceedings of the IEEE, 109(5):612 634, 2021. [56] P. Spirtes, C. N. Glymour, R. Scheines, and D. Heckerman. Causation, Prediction, and Search. MIT press, 2000. [57] J. Von Kügelgen, Y. Sharma, L. Gresele, W. Brendel, B. Schölkopf, M. Besserve, and F. Locatello. Self-Supervised Learning with Data Augmentations Provably Isolates Content from Style. Advances in neural information processing systems, 34:16451 16467, 2021. [58] J. Von Kügelgen, A. Mohamed, and S. Beckers. Backtracking Counterfactuals. In Conference on Causal Learning and Reasoning, pages 177 196. PMLR, 2023. [59] V. Vovk and G. Shafer. Game-Theoretic Probability. Introduction to Imprecise Probabilities, pages 114 134, 2014. [60] M. Waldmann. The Oxford Handbook of Causal Reasoning. Oxford University Press, 2017. [61] P. Walley. Statistical Reasoning with Imprecise Probabilities, volume 42. Springer, 1991. [62] Y. Wang and M. I. Jordan. Desiderata for Representation Learning: A Causal Perspective. ar Xiv preprint ar Xiv:2109.03795, 2021. [63] H. White and K. Chalak. Settable Systems: An Extension of Pearl s Causal Model with Optimization, Equilibrium, and Learning. Journal of Machine Learning Research, 10(8), 2009. [64] J. Woodward. Making Things Happen: A Theory of Causal Explanation. Oxford university press, 2005. A Mathematical Preliminaries In this section, we recall some basic facts about measure and probability theory that we need for the development in the main body of the paper. We follow Çınlar [14]. A.1 Measure Theory Suppose that E is a set. We first define the notion of a σ-algebra. A non-empty collection E of E is called a σ-algebra on E if it is closed under complements and countable unions, that is, if (i) A E = E\A E; (ii) A1, A2, ... E = n=1An E [14, p.2]. We call { , E} the trivial σ-algebra of E. If C is an arbitrary collection of subsets of E, then the smallest σ-algebra that contains C, or equivalently, the intersection of all σ-algebras that contain C, is called the σ-algebra generated by C, and is denoted σC. A measurable space is a pair (E, E), where E is a set and E is a σ-algebra on E [14, p.4]. Suppose (E, E) and (F, F) are measurable spaces. For A E and B F, we define the measurable rectangle A B as the set of all pairs (x, y) with x A and y B. We define the product σ-algebra E F on E F as the σ-algebra generated by the collection of all measurable rectangles. The measurable space (E F, E F) is the product of (E, E) and (F, F) [14, p.4]. More generally, if (E1, E1), ..., (En, En) are measurable spaces, their product is i=1 (Ei, Ei) = ( where E1 ... En is the set of all n-tuples (x1, ..., xn) with xi in Ei for i = 1, ..., n and E1 ... En is the σ-algebra generated by the measurable rectangles A1 ... An with Ai in Ei for i = 1, ..., n [14, p.44]. If T is an arbitrary (countable or uncountable) index set and (Et, Et) is a measurable space for each t T, the product space of {Et : t T} is the set t T Et of all collections (xt)t T with xt Et for each t T. A rectangle in t T Et is a subset of the form t T At = {x = (xt)t T t T Et : xt At for each t in T} where At differs from Et for only a finite number of t. It is said to be measurable if At Et for every t (for which At differs from Et). The σ-algebra on t T Et generated by the collection of all measurable rectangles is called the product σ-algebra and is denoted by N t T Et [14, p.45]. A collection C of subsets of E is called a p-system if it is closed under intersections [14, p.2]. If two measures µ and ν on a measurable space (E, E) with µ(E) = ν(E) < agree on a p-system generating E, then µ and ν are identical [14, p.16, Proposition 3.7]. Let (E, E) and (F, F) be measurable spaces. A mapping f : E F is measurable if f 1B E for every B F [14, p.6]. Let (E, E) and (F, F) be measurable spaces. Let f be a bijection between E and F, and let ˆf denote its functional inverse. Then, f is an isomorphism if f is measurable relative to E and F, and ˆf is measurable with respect to F and E. The measurable spaces (E, E) and (F, F) are isomorphic if there exists an isomorphism between them [14, p.11]. A measurable space (E, E) is a standard measurable space if it is isomorphic to (F, BF ) for some Borel subset F of R. Polish spaces with their Borel σ-algebra are standard measurable spaces [14, p.11]. Let A E. Its indicator, denoted by 1A, is the function defined by ( 1 if x A 0 if x / A [14, p.8]. Obviously, 1A is E-measurable if and only if A E. A function f : E R is said to be simple if it is of the form for some n N, a1, ..., an R and A1, ..., An E [14, p.8]. The A1, ..., An E can be chosen to be a measurable partition of E, and is then called the canonical form of the simple function f. A positive function on E is E-measurable if and only if it is the limit of an increasing sequence of positive simple functions [14, p.10, Theorem 2.17]. A measure on a measurable space (E, E) is a mapping µ : E [0, ] such that (i) µ( ) = 0; (ii) µ( n=1An) = P n=1 µ(An) for every disjoint sequence (An) in E [14, p.14]. A measure space is a triplet (E, E, µ), where (E, E) is a measurable space and µ is a measure on it. A measurable set B is said to be negligible if µ(B) = 0, and an arbitrary subset of E is said to be negligible if it is contained in a measurable negligible set. The measure space is said to be complete if every negligible set is measurable [14, p.17]. Next, we review the notion of integration of a real-valued function f : E R with respect to µ [14, p.20, Definition 4.3]. (a) Let f : E [0, ] be simple. If its canonical form is f = Pn i=1 ai1Ai with ai R, then we define Z fdµ = i=1 aiµ(Ai). (b) Suppose f : E [0, ] is measurable. Then by above, we have a sequence (fn) of positive simple functions such that fn f pointwise. Then we define Z fdµ = lim n where R fndµ is defined for each n by (a). (c) Suppose f : E [ , ] is measurable. Then f + = max{f, 0} and f = min{f, 0} are measurable and positive, so we can define R f +dµ and R f dµ as in (b). Then we define Z fdµ = Z f +dµ Z f dµ provided that at least one term on the right be positive. Otherwise, R fdµ is undefined. If R f +dµ < and R f dµ < , then we say that f is integrable. Finally, we review the notion of transition kernels, which are crucial in the consideration of conditional distributions. Let (E, E) and (F, F) be measurable spaces. Let K be a mapping E F [0, ]. Then, K is called a transition kernel from (E, E) into (F, F) if (a) the mapping x 7 K(x, B) is measurable for every set B F; and (b) the mapping B 7 K(x, B) is a measure on (F, F) for every x E. A transition kernel from (E, E) into (F, F) is called a probability transition kernel if K(x, F) = 1 for all x E. A probability transition kernel K from (E, E) into (E, E) is called a Markov kernel on (E, E) [14, p.37,39,40]. A.2 Probability Theory Now we translate the above measure-theoretic notions into the language of probability theory, and introduce some additional concepts. A probability space is a measure space (Ω, H, P) such that P(Ω) = 1 [14, p.49]. We call Ωthe sample space, and each element ω Ωan outcome. We call H a collection of events, and for any A H, we read P(A) as the probability that the event A occurs [14, p.50]. A random variable taking values in a measurable space (E, E) is a function X : Ω E, measurable with respect to H and E. The distribution of X is the measure µ on (E, E) defined by µ(A) = P(X 1A) [14, p.51]. For an arbitrary set T, let Xt be a random variable taking values in (E, E) for each t T. Then the collection {Xt : t T} is called a stochastic process with state space (E, E) and parameter set T [14, p.53]. Henceforth, random variables are defined on (Ω, H, P) and take values in [ , ]. We define the expectation of a random variable X : Ω [ , ] as E[X] = R ΩXd P [14, p.57-58]. We also define the conditional expectation [14, p.140, Definition 1.3]. Suppose F is a sub-σ-algebra of H. (a) Suppose X is a positive random variable. Then the conditional expectation of X given F is any positive random variable EFX satisfying E[V X] = E [V EFX] for all V : Ω [0, ] measurable with respect to F. (b) Suppose X : Ω [ , ] is a random variable. If E[X] exists, then we define EFX = EFX+ EFX , where EFX+ and EFX are defined in (a). Next, we define conditional probabilities, and regular versions thereof [14, pp.149-151]. Suppose H H, and let F be a sub-σ-algebra of H. Then the conditional probability of H given F is defined as PFH = EF1H. Let Q(H) be a version of PFH for every H H. Then Q : (ω, H) 7 Qω(H) is said to be a regular version of the conditional probability PF provided that Q be a probability transition kernel from (Ω, F) into (Ω, H). Regular versions exist if (Ω, H) is a standard measurable space [14, p.151, Theorem 2.7]. The conditional distribution of a random variable X given F is any transition probability kernel L : (ω, B) 7 Lω(B) from (Ω, F) into (E, E) such that PF{Y B} = L(B) for all B E. If (E, E) is a standard measurable space, then a version of the conditional distribution of X given F exists [14, p.151]. Suppose that T is a totally ordered set, i.e. whenever r, s, t T with r < s and s < t, we have r < t and for any s, t T, exactly one of s < t, s = t and t < s holds [18, p.62]. For each t T, let Ft be a sub-σ-algebra of H. The family F = {Ft : t T} is called a filtration provided that Fs Ft for s < t [14, p.79]. A filtered probability space (Ω, H, F, P) is a probability space (Ω, H, P) endowed with a filtration F. Finally, we review the notion of independence and conditional independence. For a fixed integer n 2, let F1, ..., Fn be sub-σ-algebras of H. Then {F1, ..., Fn} is called an independency if P (H1 ... Hn) = P (H1) ...P (Hn) for all H1 F1, ..., Hn Fn. Let T be an arbitrary index set. Let Ft be a sub-σ-algebra of H for each t T. The collection {Ft : t T} is called an independency if its every finite subset is an independency [14, p.82]. Moreover, F1, ..., Fn are said to be conditional independent given F if PF (H1 ... Hn) = PF (H1) ...PF (Hn) for all H1 F1, ..., Hn Fn [14, p.158]. B Causal Effect In this section, we define what it means for a sub-σ-algebra of the form HS to have a causal effect on an event A H. Definition B.1. Let (Ω, H, P, K) = ( t T Et, t T Et, P, K) be a causal space, U P(T), A H an event and F a sub-σ-algebra of H (not necessarily of the form HS for some S P(T)). (i) If KS(ω, A) = KS\U(ω, A) for all S P(T) and all ω Ω, then we say that HU has no causal effect on A, or that HU is non-causal to A. We say that HU has no causal effect on F, or that HU is non-causal to F, if, for all A F, HU has no causal effect on A. (ii) If there exists ω Ωsuch that KU(ω, A) = P(A), then we say that HU has an active causal effect on A, or that HU is actively causal to A. We say that HU has an active causal effect on F, or that HU is actively causal to F, if HU has an active causal effect on some A F. (iii) Otherwise, we say that HU has a dormant causal effect on A, or that HU is dormantly causal to A. We say that HU has a dormant causal effect on F, or that HU is dormantly causal to F, if HU does not have an active causal effect on any event in F and there exists A F on which HU has a dormant causal effect. Sometimes, we will say that HU has a causal effect on A to mean that HU has either an active or a dormant causal effect on A. The intuition is as follows. For any S P(T) and any fixed event A H, consider the function ωS 7 KS((ωS U, ωS\U), A). If HU has no causal effect on A, then it means that the causal kernel does not depend on the ωS U component of ωS. Since this has to hold for all S P(T), it means that it is possible to have, for example, KU(ω, A) = P(A) for all ω Ωand yet for HU to have a causal effect on A. This would be precisely the case where HU has a dormant causal effect on A, and it means that, for some S P(T), ωS 7 KS((ωS U, ωS\U, A) does depend on the ωS U component. We collect some straightforward but important special cases in the following remark. Remark B.2. (a) If HU has no causal effect on A, then letting S = U in Definition B.1(i) and applying Definition 2.2(i), we can see that, for all ω Ω, KU(ω, A) = KU\U(ω, A) = K (ω, A) = P(A). In particular, this means that HU cannot have both no causal effect and active causal effect on A. (b) It is immediate that the trivial σ-algebra H = { , Ω} has no causal effect on any event A H. Conversely, it is also clear that HU for any U P(T) has no causal effect on the trivial σ-algebra. (c) Let U P(T) and F a sub-σ-algebra of H. If HU F = { , Ω}, then HU has an active causal effect on F, since, for A HU F with A = and A = Ω, Definition 2.2(ii) tells us that KU( , A) = 1A( ) = P(A). In particular, HU has an active causal effect on itself. Further, the full σ-algebra H = HT has an active causal effect on all of its sub-σ-algebras except the trivial σ-algebra, and every HU, U P(T) except the trivial σ-algebra has an active causal effect on the full σ-algebra H. (d) Let U P(T) and F1, F2 be sub-σ-algebras of H. If F1 F2 and HU has no causal effect on F2, then it is clear that HU has no causal effect on F1. (e) If HU has no causal effect on an event A, then for any V P(T) with V U, HV has no causal effect on A. Indeed, take any S P(T). Then using the fact that HU has no causal effect on A, see that, for any ω Ω, KS\V (ω, A) = K(S\V )\U(ω, A) applying Definition B.1(i) with S \ V = KS\U(ω, A) since V U = KS(ω, A) applying Definition B.1(i) with S. Since S P(T) was arbitrary, we have that HV has no causal effect on A. (f) Contrapositively, if U, V P(T) with V U and HV has a causal effect on A, then HU has a causal effect on A. (g) If U P(T) has no causal effect on A, then for any V P(T), we have KV (ω, A) = KU V (ω, A). KU V (ω, A) = K(U V )\(U\V )(ω, A) since U \ V has no causal effect on A by (e) = KV (ω, A) since (U V ) \ (U \ V ) = V. (h) If U, V P(T) and neither HU nor HV has a causal effect on A, then HU V has no causal effect on A. Indeed, for any S P(T) and any ω Ω, KS\(U V )(ω, A) = K(S\U)\V (ω, A) = KS\U(ω, A) as V has no causal effect on A = KS(ω, A) as U has no causal effect on A. Since S P(T) was arbitrary, HU V has no causal effect on A. (i) Contrapositively, if U, V P(T) and HU V has a causal effect on A, then either HU or HV has a causal effect on A. Following the definition of no causal effect, we define the notion of a trivial causal kernel. Definition B.3. Let (Ω, H, P, K) = ( t T Et, t T Et, P, K) be a causal space, and U P(T). We say that the causal kernel KU is trivial if HU has no causal effect on HT \U. Note that we can decompose H as H = HU HT \U, and so H is generated by events of the form A B for A HU and B HT \U. But if KU is trivial, then we have, by Axiom 2.2(ii), KU(ω, A B) = 1A(ω)P(B) for such a rectangle. We also define a conditional version of causal effects. Definition B.4. Let (Ω, H, P, K) = ( t T Et, t T Et, P, K) be a causal space, U, V P(T), A H an event and F a sub-σ-algebra of H (not necessarily of the form HS for some S P(T)). (i) If KS V (ω, A) = K(S V )\(U\V )(ω, A) for all S P(T) and all ω Ω, then we say that HU has no causal effect on A given HV , or that HU is non-causal to A given HV . We say that HU has no causal effect on F given HV , or that HU is non-causal to F given HV , if, for all A F, HU has no causal effect on A given HV . (ii) If there exists ω Ωsuch that KU V (ω, A) = KV (ω, A), then we say that HU has an active causal effect on A given HV , or that HU is actively causal to A given HV . We say that HU has an active causal effect on F given HV , or that HU is actively causal to F given HV , if HU has an active causal effect on some A F. (iii) Otherwise, we say that HU has a dormant causal effect on A given HV , or that HU is dormantly causal to A given HV . We say that HU has a dormant causal effect on F given HV , or that HU is dormantly causal to F given HV , if HU does not have an active causal effect on any event in F given HV and there exists A F on which HU has a dormant causal effect given HV . Sometimes, we will say that HU has a causal effect on A given HV to mean that HU has either an active or a dormant causal effect on A given HV . The intuition is as follows. For any fixed S P(T) and any fixed event A H. consider the function ωS V 7 KS V ((ω(S V )\(U\V ), ωS (U\V )), A). If HU has no causal effect on A given HV , then it means that the causal kernel does not depend on the ωS (U\V ) component of ωS V ; in other words, HU only has an influence on A through its V component. We collect some important special cases in the following remark. Remark B.5. (a) Letting V = U, we always have KS U(ω, A) = K(S U)\(U\U)(ω, A) = KS U(ω, A) for all ω Ωand A H, which means that HU has no causal effect on any event A H given itself. (b) If HU has no causal effect on A given HV , then letting U = S in Definition B.4(i), we see that, for all ω Ω, KU V (ω, A) = KV (ω, A). In particular, this means that HU cannot have both no causal effect and active causal effect on A given HV . (c) The case V = reduces Definition B.4 to Definition B.1, i.e. HU having no causal effect in the sense of Definition B.1 is the same as HU having no causal effect given { , Ω} in the sense of Definition B.4, etc. (d) It is possible for HU to be causal to an event A, and for there to exist V P(T) such that HU has no causal effect on A given HV . However, if HU has no causal effect on A, then for any V P(T), HU has no causal effect on A given HV . To see this, note that Remark B.2(e) tells us that U \ V also does not have any causal effect on A. Then given any S P(T), KS V (ω, A) = K(S V )\(U\V )(ω, A), applying Definition B.1(i) to S V . Since S P(T) was arbitrary, HU has no causal effect on A given HV . C Interventions In this section, we provide a few more definitions and results related to the notion of interventions, introduced in Definition 2.3. First, we make a few remarks on how the intervention causal kernels Kdo(U,Q,L) S behave in some special cases, depending on the relationship between U and S. Remark C.1. (a) For S P(T) with U S, we have, for all ω Ωand all A H, Kdo(U,Q,L) S (ω, A) = Z LU(ωU, dω U)KS((ωS\U, ω U), A) = Z δωU (dω U)KS((ωS\U, ω U), A) by Definition 2.2(ii) = KS((ωS\U, ωU), A) = KS(ω, A). This means that, after an intervention on HU, subsequent interventions on HS with HU HS simply overwrite the original intervention. Note that this is reminiscent of the partial ordering on the set of interventions in [49], but in our setting, this is given by the partial ordering induced by the inclusion structure of sub-σ-algebras of H. (b) For S P(T) with S U, Kdo(U,Q,L) S (ω, A) = Z LS(ωS, dω U)KU(ω U, A) for all ω Ωand A H, i.e. Kdo(U,Q,L) S is a product of the two kernels KU and LS [14, p.39]; in particular, Kdo(U,Q,L) S (ω, A) = LS(ω, A) for all A HU. (c) For S P(T) with S U = , Kdo(U,Q,L) S (ω, A) = Z L (ω , dω U)KS U((ωS, ω U), A) = Z Q(dω U)KS U((ωS, ω U), A) by Definition 2.2(i) for all ω Ωand A H, i.e. the effect of intervening on HU with Q then HS is the same as intervening on HU S with a product measure of Q on HU and whatever measure we place on HS. We give it a name for the special case in which the internal causal kernels are all trivial (see Definition B.3). Definition C.2. Let (Ω, H, P, K) = ( t T Et, t T Et, P, K) be a causal space, and U P(T) and Q a probability measure on (Ω, HU). A hard intervention on HU via Q is a new causal space (Ω, H, Pdo(U,Q), Kdo(U,Q,hard)), where the intervention measure Pdo(U,Q) is a probability measure (Ω, H) defined in the same way as in Definition 2.3, and the intervention causal mechanism Kdo(U,Q,hard) = {Kdo(U,Q,hard) S : S P(T)} consists of causal kernels that are obtained from the intervention causal kernels in Definition 2.3 in which LS U is a trivial causal kernel, i.e. one that has no causal effect on HU\S. From the discussion following Definition B.3, we have that, for A HS U and B HU\S, LS U(ω, A B) = 1A(ωS U)Q(B). The next result gives an explicit expression for the causal kernels obtained after a hard intervention. Theorem C.3. Let (Ω, H, P, K) = ( t T Et, t T Et, P, K) be a causal space, and U P(T) and Q a probability measure on (Ω, HU). Then after a hard intervention on HU via Q, the intervention causal kernels Kdo(U,Q,hard) S are given by Kdo(U,Q,hard) S (ω, A) = Kdo(U,Q,hard) S (ωS, A) = Z Q(dω U\S)KS U((ωS, ω U\S), A). Intuitively, hard interventions do not encode any internal causal relationships within HU, so after we subsequently intervene on HS, the measure Q that we originally imposed on HU remains on HU\S. The following lemma contains a couple of results about particular sub-σ-algebras having no causal effects on particular events in the intervention causal space, regardless of the measure and causal mechanism that was used for the intervention. Lemma C.4. Let (Ω, H, P, K) = ( t T Et, t T Et, P, K) be a causal space, and U P(T), Q a probability measure on (Ω, HU) and L = {LV : V P(U)} a causal mechanism on (Ω, HU, Q). Suppose we intervene on HU via (Q, L). (i) For A HU and V P(T) with V U = , HV has no causal effect on A (c.f. Definition B.1(i)) in the intervention causal space (Ω, H, Pdo(U,Q), Kdo(U,Q,L)), i.e. events in the σalgebra HU on which intervention took place are not causally affected by σ-algebras outside HU. (ii) Again, let V P(T) with V U = , and also let A H be any event. If, in the original causal space, HV had no causal effect on A, then in the intervention causal space, HV has no causal effect on A either. (iii) Now let V P(T), A H any event and suppose that the intervention on HU via Q is hard. Then if HV had no causal effect on A in the original causal space, then HV has no causal effect on A in the intervention causal space either. Lemma C.4(ii) and (iii) tell us that, if HV had no causal effect on A in the original causal space, then by intervening on HU with V U = or by any hard intervention, we cannot create a causal effect from Hv on A. However, by intervening on a sub-σ-algebra that contains both HV and (a part of) A, and manipulating the internal causal mechanism L appropriately, it is clear that we can create a causal effect from HV . The next result tells us that if a sub-σ-algebra HU has a dormant causal effect on an event A, then there is a sub-σ-algebra of HU and a hard intervention after which that sub-σ-algebra has an active causal effect on A. Lemma C.5. Let (Ω, H, P, K) = ( t T Et, t T Et, P, K) be a causal space, and U P(T). For an event A H, if HU has a dormant causal effect on A in the original causal space, then there exists a hard intervention and a subset V U such that in the intervention causal space, HV has an active causal effect on A. The next result is about what happens to a causal effect of a sub-σ-algebra that has no causal effect on an event conditioned on another sub-σ-algebra, after intervening on that sub-σ-algebra. Lemma C.6. Let (Ω, H, P, K) = ( t T Et, t T Et, P, K) be a causal space, and U, V P(T). For an event A H, suppose that HU has no causal effect on A given HV (see Definition B.4). Then after an intervention on HV via any (Q, L), HU\V has no causal effect on A. The next result shows that, under a hard intervention, a time-respecting causal mechanism stays time-respecting. Theorem C.7. Let (Ω, H, P, K) = ( t T Et, t T Et, P, K) be a causal space, where the index set T can be written as T = W T, with W representing time and K respecting time. Take any U P(T) and any probability measure Q on HU. Then the intervention causal mechanism Kdo(U,Q,hard) also respects time. In causal spaces, the observational distribution P and the causal mechanism K are completely decoupled. In Section 3.1, we give a detailed argument as to why this is desirable, but of course, there is no doubt that the special case in which the causal kernels coincide with conditional measures with respect to P is worth studying. To that end, we introduce the notion of sources. Definition D.1. Let (Ω, H, P, K) = ( t T Et, t T Et, P, K) be a causal space, U P(T), A H an event and F a sub-σ-algebra of H. We say that HU is a (local) source of A if KU( , A) is a version of the conditional probability PHU (A). We say that HU is a (local) source of F if HU is a source of all A F. We say that HU is a global source of the causal space if HU is a source of all A H. Clearly, source σ-algebras are not unique (whether local or global). It is easy to see that H = { , Ω} and H = HT = t T Et are global sources, and axiom (ii) of Definition 2.2 implies that any HS is a local source of any of its sub-σ-algebras, including itself, since, for any A HU, PHU (A) = 1A. Also, a sub-σ-algebra of a source is not necessarily a source, nor is a σ-algebra that contains a source necessarily a source (whether local or global). In Example 2.5 above, altitude is a source of temperature (and hence a global source), since the causal kernel corresponding to temperature coincides with the conditional measure given altitude, but temperature is not a source of altitude. When we intervene on HU (via any (Q, L)), HU becomes a global source. This precisely coincides with the gold standard that is randomised control trials in causal inference, i.e. the idea that, if we are able to intervene on HU, then the causal effect of HU on any event can be obtained by first intervening on HU, then considering the conditional distribution on HU. Next is a theorem showing that when one intervenes on HU, then HU becomes a source. Theorem D.2. Suppose we have a causal space (Ω, H, P, K) = ( t T Et, t T Et, P, K), and let U P(T). (i) For any measure Q on HU and any causal mechanism L on (Ω, HU, Q), the causal kernel Kdo(U,Q,L) U = KU is a version of Pdo(U,Q) HU , which means that HU is a global source σalgebra of the intervened causal space (Ω, H, Pdo(U,Q), Kdo(U,Q,L)). (ii) Suppose V P(T) with V U. Suppose that the measure Q on (Ω, HU) factorises over HV and HU\V , i.e. for any A HV and B HU\V , Q(A B) = Q(A)Q(B). Then after a hard intervention on HU via Q, the causal kernel Kdo(U,Q,hard) V is a version of Pdo(U,Q) V , which means that HV is a global source σ-algebra of the intervened causal space (Ω, H, Pdo(U,Q), Kdo(U,Q,hard)). Let A H be an event, and U P(T). By the definition of the intervention measure (Definition 2.3), we always have Pdo(U,Q)(A) = Z Q(dω)KU(ω, A), hence Pdo(U,Q)(A) can be written in terms of P and Q if KU(ω, A) can be written in terms of P. This can be seen to occur in three trivial cases: first, if HU is a local source of A (see Definition D.1), in which case KU(ω, A) = PHU (ω, A); secondly, if HU has no causal effect on A (see Definition B.1), in which case KU(ω, A) = P(A); and finally, if A HU, in which case, by intervention determinism (Definition 2.2(ii), we have KU(ω, A) = 1A(ω). In the latter case, we do not even have dependence on P. Can we generalise these results? Lemma D.3. Let (Ω, H, P, K) = ( t T Et, t T Et, P, K) be a causal space. Let A H be an event, and U P(T). If there exists a sub-σ-algebra G of H (not necessarily of the form HV for some V P(T)) such that (i) the conditional probability Pdo(U,Q) HU G ( , A) can be written in terms of P and Q; (ii) the causal kernel KU( , B) can be written in terms of P for all B G; then Pdo(U,Q)(A) can be written in terms of P and Q. Remark D.4. The three cases discussed in the paragraph above Lemma D.3 are special cases of the Lemma with G being any sub-σ-algebra of H with { , Ω} G HU. In this case, condition (ii) is trivially satisfied since we have KU( , B) = 1B( ) by intervention determinism (Definition 2.2(ii)), and for condition (i), by Theorem D.2(i), we have Pdo(U,Q) HU ( , A) = KU( , A), which means that the problem reduces to checking if KU( , A) can be written in terms of P. Proof. By law of total expectations, for any V P(T), we have Pdo(U,Q)(A) = Z Pdo(U,Q) HU G (ω, A)Pdo(U,Q)(dω) = Z Pdo(U,Q) HU G (ω, A) Z Q(dω )KU(ω , dω). Here, Pdo(U,Q) HU G (ω, A) can be written in terms of P and Q by condition (i). Moreover, note that it suffices to be able to write the restriction of KU(ω , ) to HU G in terms of P, since the integration is of a HU G-measurable function. Since the collection of intersections {D B, D HU, B G} is a π-system that generates HU G [14, p.5, 1.18], it suffices to check that KU(ω , D B) can be written in terms of P for all D HU and B G. But by interventional determinism (Definition 2.2(ii)), we have KU(ω , D B) = 1D(ω )KU(ω , B). Since KU(ω , B) can be written in terms of P by condition (ii), the restriction of KU(ω , ) to HU G can be written in terms of P, and hence Pdo(U,Q)(A) can be written in terms of P and Q. Corollary D.5. Let (Ω, H, P, K) = ( t T Et, t T Et, P, K) be a causal space. Let A H be an event, and U P(T). If there exists a V P(T) such that condition (i) of Lemma D.3 is satisfied with G = HV and one of the following conditions is satisfied: (a) HU is a local source of HV ; or (b) HU has no causal effect on HV ; or then Pdo(U,Q)(A) can be written in terms of P and Q. Proof. Condition (i) of Lemma D.3 is satisfied by hypothesis. If one of (a), (b) or (c) is satisfied, then trivially, condition (ii) of Lemma D.3 is also satisfied. The result now follows from Lemma D.3. The above is reminiscent of valid adjustments in the context of structural causal models [46, p.115, Proposition 6.41], and in fact contains the valid adjustments. E Counterfactuals There are various notions of counterfactuals in the literature. The one considered in the SCM literature is the interventional counterfactual, which captures the notion of what would have happened if we intervened on the space, given some observations (that are possibly contradictory to the intervention we imagine we would have done) . Recently, backtracking counterfactuals have also been integrated into the SCM framework [58]. This captures the notion of what would have happened if background conditions of the world had been different, given that the causal laws of the system stay the same? Finally, we note that in the potential outcomes framework, the random variables representing potential outcomes that form the primitives of the framework can be directly counterfactual. Vanilla probability measures have just one argument, i.e. the event. Conditional measures and causal kernels (in the sense of our Definition 2.2) have two arguments, the first being the outcome which we either observe or force the occurrence of, and the second being the event in whose measure we are interested. For both of the above concepts of counterfactuals, we need to go one step further and consider three arguments. The first is the outcome which we observe, just like in conditioning, and the last should be the event in whose measure we are interested. For interventional counterfactuals, the second argument should be an outcome which we imagine to have forced the occurrence of given that we observed the outcome of the first argument, and for backtracking counterfactuals, the second argument should be an outcome which we imagine to have observed instead of the outcome in the first argument which we actually observed. From these principles, we tentatively propose to extend Definition 2.2 to account for interventional counterfactuals as follows. Definition E.1. A causal space is defined as the quadruple (Ω, H, P, K), where (Ω, H, P) = ( t T Et, t T Et, P) is a probability space and K = {KS,F : S P(T), F sub-σ-algebra of H}, called the causal mechanism, is a collection of functions KS,F : Ω Ω H [0, 1], called the causal kernel on HS after observing F, such that (i) for each fixed η Ωand A H, KS,F( , η, A) is measurable with respect to HS; (ii) for each fixed ω Ωand A H, KS,F(ω, , A) is measurable with respect to F; (iii) for each fixed pair (ω, η) Ω Ω, KS,F(ω, η, ) is a measure on H; (iv) for all A H and ω, η Ω, K ,F(ω, η, A) = PF(η, A); (v) for all A HS, all B H and all ω, η Ω, KS,F(ω, η, A B) = 1A(ω)KS(ω, η, B); in particular, for A HS, KS,F(ω, η, A) = 1A(ω)KS,F(ω, η, Ω) = 1A(ω); (vi) for all A H, ω Ωand sub-σ-algebras F G H, EF KS,G(ω, , A) = KS,F(ω, , A). Note that letting F = { , Ω} trivially recovers the causal space as defined in Definition 2.2. Moreover, letting S = , we recover the conditional distribution given F. Recall that the one of the biggest philosophical differences between the SCM framework and our proposed causal spaces (Definition 2.2) was the fact that SCMs start with the variables, the structural equations and the noise distributions as the primitive objects, and the observational and interventional distributions over the endogenous variables are derived from these, whereas causal spaces take the observational and interventional distributions as the primitive objects (the latter via causal kernels). Note that, in the above extended definition of causal spaces incorporating interventional counterfactuals (Definition E.1), we applied the same principles, in that we treated the observational distribution (P), interventional distributions (KS,{ ,Ω}) and the (interventional) counterfactual distributions (KS,F) as the primitive objects. This differs significantly from the SCM framework, where again, the (interventional) counterfactual distributions are derived from the structural equations, by first conditioning on the observed values of the endogenous variables to get a modified (often Dirac) measure on the exogenous variables, then intervening on some of the endogenous variables, deriving the measure on the rest of the endogenous variables by propagating these through the same structural equations. We see the value in this approach in that the (interventional) counterfactual distributions can be neatly derived from the same primitive objects that are used to calculate the observational and interventional distribution. However, we argue that this cannot be an axiomatisation of (interventional) counterfactual distributions in the strictest sense, because it relies on assumptions. In particular, it strongly relies on the assumption that the endogenous variables have no causal effect on the exogenous variables, and when this assumption is violated, i.e. when there is a hidden mediator, calculation of (interventional) counterfactual distributions is not possible. In contrast, Definition E.1 treat the (interventional) counterfactual measures as the primitive objects, and does not impose any a priori assumptions about the system. As mentioned in Section 5 of the main body of the paper, we leave further developments of this interventional counterfactual causal space, as well as the definition of backtracking counterfactual causal space, as essential future work. Theorem 2.6. From Definition 2.3, Pdo(U,Q) is indeed a measure on (Ω, H), and Kdo(U,Q,L) is indeed a valid causal mechanism on (Ω, H, Pdo(U,Q)), i.e. they satisfy the axioms of Definition 2.2. Proof. That Pdo(U,Q) is a measure on (Ω, H) follows immediately from the usual construction of measures from measures and transition probability kernels, see e.g. Çınlar [14, p.38, Theorem 6.3]. It remains to check that Kdo(U,Q,L) is a valid causal mechanism in the sense of Definition 2.2. (i) For all A H and ω Ω, Kdo(U,Q,L) (ω, A) = Z L (ω , dω U)KU((ω , ω U), A) = Z Q(dω )KU(ω , A) = Pdo(U,Q)(A), where we applied Axiom 2.2(i) to L . (ii) For all A HS and B H, we have, by Axiom 2.2(ii) using the fact that A HS HS U, Kdo(U,Q,L) S (ω, A B) = Z LS U(ωS U, dω U)KS U((ωS\U, ω U), A B) = Z LS U(ωS U, dω U)1A((ωS\U, ω U))KS U((ωS\U, ω U), B) = Z LS U(ωS U, dω U)1A((ωS\U, ω S U))KS U((ωS\U, ω U), B), where, in going from the third line to the fourth, we split the ω U in 1A((ωS\U, ω U)) into components (ω S U, ω U\S) and notice that since A HS, 1A does not depend on the component ω U\S. Here, the map ω S U 7 1A((ωS\U, ω S U)) is HS U-measurable, so we can write it as the limit of an increasing sequence of positive HS U-simple functions (see Section A.1), say (fn)n N with fn = Pmn in=1 bin1Bin, where Bin HS U. Likewise, the map ω U 7 KS U((ωS\U, ω U), B) is HU-measurable, so we can write it as the limit of an increasing sequence of positive HU-simple functions, say (gn)n N with gn = Pln jn=1 cjn1Cjn, where Cjn HU. Hence Kdo(U,Q,L) S (ω, A B) = Z LS U(ωS U, dω U) lim n fn(ω S U) lim n gn(ω U) . Since, for each ω U, both of the limits exist by construction, namely the original measurable functions, we have that the product of the limits is the limit of the products: Kdo(U,Q,L) S (ω, A B) = Z LS U(ωS U, dω U) lim n fn(ω S U)gn(ω U) . Here, since fn and gn were individually sequences of increasing functions, the pointwise products fngn also form an increasing sequence of functions. Hence, we can apply the monotone convergence theorem to see that Kdo(U,Q,L) S (ω, A B) Z LS U(ωS U, dω U)fn(ω S U)gn(ω U) jn=1 bincjn Z LS U(ωS U, dω U)1Bin(ω S U)1Cjn(ω U) jn=1 bincjn LS U(ωS U, Bin Cjn) jn=1 bincjn1Bin(ωS U)LS U(ωS U, Cjn) in=1 bin1Bin(ωS U) jn=1 cjn LS U(ωS U, Cjn) in=1 bin1Bin(ωS U) jn=1 cjn LS U(ωS U, Cjn) = lim n fn(ωS U) Z LS U(ωS U, dω U) jn=1 cj1Cjn(ω U) = 1A((ωS\U, ωS U)) Z LS U(ωS U, dω U) lim n gn(ω U) = 1A(ωS) Z LS U(ωS U, dω U)KS U((ωS\U, ω U), B) = 1A(ωS)Kdo(U,Q,L) S (ωS, B) where, from the fourth line to the fifth, we used Axiom 2.2(ii); from the sixth line to the seventh, we used that limit of the products is the product of the limits again, noting that both of the limits exist by construction; from the eighth line to the ninth, we used monotone convergence theorem again. This is the required result. Theorem C.3. Let (Ω, H, P, K) = ( t T Et, t T Et, P, K) be a causal space, and U P(T) and Q a probability measure on (Ω, HU). Then after a hard intervention on HU via Q, the intervention causal kernels Kdo(U,Q,hard) S are given by Kdo(U,Q,hard) S (ω, A) = Kdo(U,Q,hard) S (ωS, A) = Z Q(dω U\S)KS U((ωS, ω U\S), A). Proof. We decompose HU as a product σ-algebra into HS U HU\S. Then events of the form B C with B HS U and C HU\S generate HU, so for fixed ωS U, the measure LS U(ωS U, ) is completely determined by LS U(ωS U, B C) for all B HS U, C HU\S. But we have LS U(ωS U, B C) = δωS U (B)LS U(ωS U, C) by Axiom 2.2(ii) = δωS U (B)Q(C), since LS U is trivial and C HU\S. So the measure LS U(ωS U, ) is a product measure of δωS U and Q. Hence, applying Fubini s theorem, Kdo(U,Q,hard) S (ω, A) = Z LS U(ωS U, dω U)KS U((ωS\U, ω U), A) = Z Z KS U((ωS\U, ω S U, ω U\S), A)δωS U (dω S U)Q(dω U\S) = Z KS U((ωS\U, ωS U, ω U\S), A)Q(dω U\S) = Z Q(dω U\S)KS U((ωS, ω U\S), A), as required. Theorem D.2. Suppose we have a causal space (Ω, H, P, K) = ( t T Et, t T Et, P, K), and let U P(T). (i) For any measure Q on HU and any causal mechanism L on (Ω, HU, Q), the causal kernel Kdo(U,Q,L) U = KU is a version of Pdo(U,Q) HU , which means that HU is a global source σ-algebra of the intervened causal space (Ω, H, Pdo(U,Q), Kdo(U,Q,L)). (ii) Suppose V P(T) with V U. Suppose that the measure Q on (Ω, HU) factorises over HV and HU\V , i.e. for any A HV and B HU\V , Q(A B) = Q(A)Q(B). Then after a hard intervention on HU via Q, the causal kernel Kdo(U,Q) V is a version of Pdo(U,Q) V , which means that HV is a global source σ-algebra of the intervened causal space (Ω, H, Pdo(U,Q), Kdo(U,Q)). Proof. Suppose that f = Pm i=1 bi1Bi is a HU-simple function, i.e. with Bi HU for i = 1, ..., m. Then for any B HU, Z B f(ω)Pdo(U,Q)(dω) = Z i=1 bi1Bi(ω)Pdo(U,Q)(dω) i=1 bi Pdo(U,Q)(B Bi) Z Q(dω)KU(ω, B Bi) by the definition of Pdo(U,Q) Z Q(dω)1B Bi(ω) by Axiom 2.2(ii) i=1 bi1Bi(ω)Q(dω) B f(ω)Q(dω). Now, for any HU-measurable map g : Ω R, we can write it as a limit of an increasing sequence of positive HU-simple functions fn (see Section A.1), so for any B HU, Z B g(ω)Pdo(U,Q)(dω) = Z B lim n fn(ω)Pdo(U,Q)(dω) B fn(ω)Pdo(U,Q)(dω) by the monotone convergence theorem B fn(ω)Q(dω) by above B lim n fn(ω)Q(dω) by the monotone convergence theorem B g(ω)Q(dω). We use this fact in the proof of both parts of this theorem. (i) First note that we indeed have Kdo(U,Q,L) U = KU, by Remark C.1(a). For any A H, the map ω 7 KU(ω, A) is HU-measurable, so for any B HU, Z B KU(ω, A)Pdo(U,Q)(dω) = Z B KU(ω, A)Q(dω) by above fact = Z 1B(ω)KU(ω, A)Q(dω) = Z KU(ω, A B)Q(dω) by Axiom 2.2(ii) = Pdo(U,Q)(A B) = Z 1A B(ω)Pdo(U,Q)(dω) = Z 1B(ω)1A(ω)Pdo(U,Q)(dω) B 1A(ω)Pdo(U,Q)(dω). So KU( , A) = Kdo(U,Q,L) U ( , A) is indeed a version of the conditional probability Pdo(U,Q) HU (A), which means that HU is a global source of (Ω, H, Pdo(U,Q), Kdo(U,Q,L)). (ii) For any A H, the map ω 7 Kdo(U,Q) V (ω, A) is HV -measurable and hence HUmeasurable, so for any B HV HU, Z B Kdo(U,Q) V (ωV , A)Pdo(U,Q)(dωV ) B Kdo(U,Q) V (ωV , A)Q(dωV ) by above fact = Z Kdo(U,Q) V (ωV , A B)Q(dωV ) by Axiom 2.2(ii) = Z Z Q(dω U\V )KU((ωV , ω U\V ), A B)Q(dωV ) = Z KU(ωU, A B)Q(dωU) B 1A(ω)Pdo(U,Q)(dω). where, in going from the third line to the fourth, we used Theorem C.3, and to go from the fourth line to the fifth, we used the hypothesis that Q factorises over HV and HU\V , meaning Q(dωU\V )Q(dωV ) = Q(dωU). So Kdo(U,Q) V (ω, A) is indeed a version of the conditional probability Pdo(U,Q) HV (A), which means that HV is a global source of (Ω, H, Pdo(U,Q), Kdo(U,Q)). Lemma C.4. Let (Ω, H, P, K) = ( t T Et, t T Et, P, K) be a causal space, and U P(T), Q a probability measure on (Ω, HU) and L = {LV : V P(U)} a causal mechanism on (Ω, HU, Q). Suppose we intervene on HU via (Q, L). (i) For A HU and V P(T) with V U = , HV has no causal effect on A (c.f. Definition B.1(i)) in the intervention causal space (Ω, H, Pdo(U,Q), Kdo(U,Q,L)), i.e. events in the σalgebra HU on which intervention took place are not causally affected by σ-algebras outside HU. (ii) Again, let V P(T) with V U = , and also let A H be any event. If, in the original causal space, HV had no causal effect on A, then in the intervention causal space, HV has no causal effect on A either. (iii) Now let V P(T), A H any event and suppose that the intervention on HU via Q is hard. Then if HV had no causal effect on A in the original causal space, then HV has no causal effect on A in the intervention causal space either. Proof. (i) Take any S P(T). See that Kdo(U,Q,L) S (ω, A) = Z LS U(ωS U, dω U)KS U((ωS\U, ω U), A) = Z LS U(ωS U, dω U)1A(ω U) = Z LS U(ωS U, dω U)K(S\V ) U((ω(S\V )\U, ω U), A) = Z L(S\V ) U(ω(S\V ) U, dω U)K(S\V ) U((ω(S\V )\U, ω U), A) = Kdo(U,Q,L) S\V (ω, A) where, in going from the first line to the second and from the second line to the third, we used the fact that A HU, and in going from the third line to the fourth, we applied the fact that (S \ V ) U = S U since V U = . Since S P(T) was arbitrary, HV has no causal effect on A in the intervention causal space. (ii) Take any S P(T). See that Kdo(U,Q,L) S (ω, A) = Z LS U(ωS U, dω U)KS U((ωS\U, ω U), A) = Z LS U(ωS U, dω U)K(S U)\V ((ω(S\V )\U, ω U), A) = Z L(S\V ) U(ω(S\V ) U, dω U)K(S\V ) U((ω(S\V )\U, ω U), A) = Kdo(U,Q,L) S\V (ω, A) where, in going from the first line to the second, we used the fact that HV has no causal effect on A in the original causal space, and in going from the second line to the third, we used U V = , which gives us S U = (S \ V ) U and (S U) \ V = (S \ V ) U. Since S P(T) was arbitrary, HV has no causal effect on A in the intervention causal space. (iii) Take any S P(T). Apply Theorem C.3 to see that Kdo(U,Q,hard) S (ω, A) = Z Q(dω U\S)KS U((ωS, ω U\S), A) = Z Q(dω U\S)K(S U)\V ((ωS, ω U\S), A) Def. B.1(i) = Z Q(dω U\S)K((S\V ) U)\V ((ωS, ω U\S), A) = Z Q(dω U\S)K(S\V ) U((ωS, ω U\S), A) Def. B.1(i) = Z Q(dω U\(S\V ))K(S\V ) U((ωS\V , ω U\(S\V )), A) = Kdo(U,Q) S\V (ω, A), where, in going from the second line to the third, we used that (S U)\V = ((S\V ) U)\V . Since S P(T) was arbitrary, HV has no causal effect on A in the intervention causal space. Lemma C.5. Let (Ω, H, P, K) = ( t T Et, t T Et, P, K) be a causal space, and U P(T). For an event A H, if HU has a dormant causal effect on A in the original causal space, then there exists a hard intervention and a subset V U such that in the intervention causal space, HV has an active causal effect on A. Proof. That HU has a dormant causal effect on A tells us that KU(ω, A) = P(A) for all ω Ω, but there exists some S P(T) and some ω0 Ωsuch that KS(ω0, A) = KS\U(ω0, A). We must have S U = , since otherwise S \ U = S and we cannot possibly have KS(ω0, A) = KS\U(ω0, A). Then we hard-intervene on HS\U with the Dirac measure on ω0. Then apply Theorem C.3 to see that K do(S\U,δω0,hard) S U ((ω0)U S, A) = Z δω0(dω S\U)KS(((ω0)U S, ω S\U), A) = KS(ω0, A) = KS\U(ω0, A) Note that the intervention measure on A is equal to KS\U(ω0, A): Pdo(S\U,δω0)(A) = Z δω0(dω S\U)KS\U(ω , A) = KS\U(ω0, A). Putting these together, we have K do(S\U,δω0,hard) S U (ω0, A) = Pdo(S\U,δω0)(A), i.e. in the intervention causal space (Ω, H, Pdo(S\U,δω0), K do(S\U,δω0,hard) S U ), HS U has an active causal effect on A. Lemma C.6. Let (Ω, H, P, K) = ( t T Et, t T Et, P, K) be a causal space, and U, V P(T). For an event A H, suppose that HU has no causal effect on A given HV (see Definition B.4). Then after an intervention on HV via any (Q, L), HU\V has no causal effect on A. Proof. Take any probability measure Q on (Ω, HV ) and any causal mechanism L on (Ω, HV , Q). Then see that, for any S P(T) and all ω Ω, Kdo(V,Q,L) S (ω, A) = Z LS V (ωS V , dω V )KS V ((ωS\V , ω V ), A) = Z LS V (ωS V , dω V )K(S V )\(U\V )((ωS\(U V ), ω V ), A) = Z L(S\(U\V )) V (ω(S\(U\V )) V , dω V )K(S\(U\V )) V ((ωS\(U V ), ω V ), A) = Kdo(V,Q,L) S\(U\V ) (ω, A), where, in going from the first line to the second, we used the fact that HU has no causal effect on A given HV , and in going from the second line to the third, we used identities S V = (S\(U \V )) V and (S V ) \ (U \ V ) = (S \ (U \ V )) V . Since S P(T) was arbitrary, we have that HU\V has no causal effect on A in the intervention causal space. Theorem C.7. Let (Ω, H, P, K) = ( t T Et, t T Et, P, K) be a causal space, where the index set T can be written as T = W T, with W representing time and K respecting time. Take any U P(T) and any probability measure Q on HU. Then the intervention causal mechanism Kdo(U,Q,hard) also respects time. Proof. Take any w1, w@ W with w1 < w2. Since K respects time, we have that Hw2 T has no causal effect on Hw1 T in the original causal space. To show that Hw2 T has no causal effect on Hw1 T after a hard intervention on HU via Q, take any S P(T) and any event A Hw1 T . Then using Theorem C.3, Kdo(U,Q,hard) S (ω, A) = Z Q(dω )KS U((ωS, ω U\S), A) = Z Q(dω )K(S U)\Hw2 T ((ωS\Hw2 T , ω U\(S Hw2 T )), A) = Z Q(dω )K((S U)\Hw2 T ) (U Hw2 T )((ωS\Hw2 T , ω (U\(S Hw2 T )) (U Hw2 T )), A) = Z Q(dω )K(S\Hw2 T ) U((ωS\Hw2 T , ω U\(S\Hw2 T )), A) = Kdo(U,Q,hard) S\Hw2 T (ω, A) where, from the second line to the third, we used the fact that Hw2 T has no causal effect on A, from the third line to the fourth we used the fact that U Hw2 T has no causal effect on A (by Remark B.2(e)) and Remark B.2(g), and from the fourth line to the fifth, we used that ((S U)\Hw2 T ) (U Hw2 T ) = (S \Hw2 T ) U and (U \(S Hw2 T )) (U Hw2 T ) = U \ (S \ Hw2 T ). Since S P(T) was arbitrary, we have that Hw2 T has no causal effect on A (Definition B.1(i)). Since A Hw1 T was arbitrary, Hw2 T has no causal effect on Hw1 T , and so Kdo(U,Q,hard) respects time.