# a_topological_perspective_on_causal_inference__686fa192.pdf A Topological Perspective on Causal Inference Duligur Ibeling Department of Computer Science Stanford University duligur@stanford.edu Thomas Icard Department of Philosophy Stanford University icard@stanford.edu This paper presents a topological learning-theoretic perspective on causal inference by introducing a series of topologies defined on general spaces of structural causal models (SCMs). As an illustration of the framework we prove a topological causal hierarchy theorem, showing that substantive assumption-free causal inference is possible only in a meager set of SCMs. Thanks to a known correspondence between open sets in the weak topology and statistically verifiable hypotheses, our results show that inductive assumptions sufficient to license valid causal inferences are statistically unverifiable in principle. Similar to no-free-lunch theorems for statistical inference, the present results clarify the inevitability of substantial assumptions for causal inference. An additional benefit of our topological approach is that it easily accommodates SCMs with infinitely many variables. We finally suggest that the framework may be helpful for the positive project of exploring and assessing alternative causal-inductive assumptions. 1 Introduction and Motivation In the background of any investigation into learning algorithms are no-free-lunch phenomena: roughly, the observation that assumption-free statistical learning is infeasible in general (see, e.g., [33, Ch. 5] for a formal statement). Common wisdom is that learning algorithms and architectures must adequately reflect non-trivial features of the data-generating distribution to gain inductive purchase. For many purposes we need to move beyond passive observation, focusing instead on what would happen were we to act upon a given system. Even further, we sometimes desire to explain the behavior of a system, raising questions about what would have occurred had some aspects of a situation been different. Such questions depend not just on the data distribution; they depend on deeper features of underlying data-generating processes or mechanisms. It is thus generally acknowledged that stronger assumptions are required if we want to draw causal conclusions from data [35, 28, 20, 30, 32]. Whether implicit or explicit, any approach to causal inference involves a space of candidate causal models, viz. data-generating processes. Indeed, a blunt way of incorporating inductive bias is simply to omit some class of possible causal hypotheses from consideration. Many (im)possibility results in the literature can accordingly be understood as pertaining to all models within a class. For instance, if we can restrict attention to Markovian models that satisfy faithfulness, then we can always identify the structure of a model from experimental data (e.g., [11, 35]). If we can restrict attention to Markovian (continuous) models with linear functions and non-Gaussian noise, then every model can be learned even from purely observational data [34]. As a negative example, in the larger class of (not necessarily Markovian) models, no model can ever be determined from observational data alone [35, 2]. At the same time, in many settings it is sensible to aim for results with nearly universal force. It is natural to ask, e.g., within the class of all Markovian models, how typical are those in which the faithfulness condition is violated? This might tell us, for instance, how typically we could expect failure of a method that depended on these assumptions. A well-known result shows that, fixing 35th Conference on Neural Information Processing Systems (Neur IPS 2021). any particular causal dependence graph, such violations have measure zero for any smooth (e.g., Lebesgue) measure on the parameter space of distributions consistent with that graph [24]. In fact, the standard notion of statistical consistency itself, which underlies many possibility results in causal inference, requires omission of some purportedly negligible set of possible data streams [9, 35]. There are two standard mathematical approaches to making concepts like typical and negligible rigorous: measure-theoretic and topological. While the two approaches often agree, they capture slightly different intuitions [25]. One virtue of the measure-theoretic approach is its natural probabilistic interpretation: intuitively, we are exceedingly unlikely to hit upon a set with measure zero. At the same time, the measure-theoretic approach is sometimes criticized in statistical settings for its alleged dependence on a measure, and this has been argued to favor topological approaches (see, e.g., [3] on no-free-lunch theorems). The latter of course in turn demands an appropriate topology. In the present work we show how to define a sequence of meaningful topologies on the space of causal models, each corresponding to a progressively coarser level of the so called causal hierarchy ([29, 2]; see Fig. 1 for an abbreviated pictorial summary). We aim to demonstrate that topologizing causal models in this way helps clarify the scope and limits of causal inference under different assumptions, as well as the potential empirical status of those very assumptions, in a highly general setting. Our starting point is a canonical topology on the space of Borel probability distributions called the weak topology. The weak topology is grounded in the fundamental notion of weak convergence of probability distributions [4] and is thereby closely related to problems of statistical inference (see, e.g., [8]). Recent work has sharpened this correspondence, showing that open sets in the weak topology correspond exactly to the statistical hypotheses that can be naturally deemed verifiable [14, 16]. We extend the correspondence to higher levels of the causal hierarchy, including the most refined and expansive top level consisting of all (well-founded) causal models. Lower levels and natural subspaces (e.g., corresponding to prominent causal assumption classes) emerge as coarsenings and continuous projections of this largest space. As an illustration of the general approach, we prove a topological version of the causal hierarchy theorem from [2]. Rather than showing that collapse happens only in a measure zero set as in [2], our Theorem 3 show that collapse is topologically meager. Conceptually, this highlights a different (but complementary) intuition: not only is collapse exceedingly unlikely in the sense of measure, meagerness implies that collapse could never be statistically verified. Correlatively, this implies that any causal assumption that would generally allow us to infer counterfactual probabilities from experimental (or interventional ) probabilities must itself be statistically unverifiable (Corollary 1). To derive such a result we actually show something slightly stronger (see Lem. 2): even with respect to the subspace of models consistent with a fixed temporal order on variables, the causal hierarchy theorem holds. Merely knowing the temporal order of the variables is not enough to render collapse of the hierarchy a statistically verifiable proposition. Furthermore, we show that the witness to collapse can be taken as any of the well-known counterfactual probabilities of causation (see, e.g., [27]): probabilities of necessity, sufficiency, necessity and sufficiency, enablement, or disablement. That is, none of these important quantities are fully determined by experimental data except in a meager set. In 2 we give background on causal models, and in 3 we present a model-theoretic characterization of the causal hierarchy as a sequence of spaces. Topology is introduced in 4, and the main results about collapse appear in 5. For the technical results, we include proof sketches in the main text to provide the core intuitions, relegating some of the details to an exhaustive technical appendix, which also includes additional supplementary material. 2 Structural Causal Models A fundamental building block in the theory of causality is the structural causal model [26, 35, 28] or SCM, which formalizes the notion of a data-generating process. In addition to specifying datagenerating distributions, these models also specify the generative mechanisms that produce them. For the purpose of causal inference and learning, SCMs provide a broad, fine-grained hypothesis space. The notions in this section have their usual definition following, e.g., [28], but we have recast them in the standard language of Borel probability spaces so as to handle the case of infinitely many variables rigorously. We start with notation, basic assumptions, and some probability theory. Notation. The signature (or range) of a variable V is denoted χV . Where S is a set of variables, let χS = S S χS. Given an indexed family of sets {Sβ}β B and elements sβ Sβ, let (sβ)β denote the tuple whose element at index β is sβ, for all β. For B B write πB : β B Sβ β B Sβ for the projection map sending each (sβ)β B 7 (sβ )β B ; abbreviate πβ = π{β }, where β B. The reader is referred to standard texts [21, 5] for elaboration on the concepts used below. Definition 1 (Topology). For discrete spaces (like χS, for a single categorical variable S) we use the discrete topology and for product spaces (like χS for a set of variables S) we use the product topology. Note that the so-called cylinder sets of the form π 1 Y ({y}) for finite subsets Y S and y χY form a basis for the product topology on χS. This cylinder set is a subset of χS, and contains exactly those valuations agreeing with the value πY (y) specified in y for Y , for every Y Y. Following standard statistical notation this cylinder is abbreviated as simply y. Definition 2 (Probability). Where ϑ is a topological space write B(ϑ) for its Borel σ-algebra of measurable subsets. Let P(ϑ) be the set of probability measures on B(ϑ). Specifically, elements of P(ϑ) are functions µ : B(ϑ) [0, 1] assigning a probability to each measurable set such that µ(ϑ) = 1 and µ S i=1(Si) = P i=1 µ(Si) for each sequence S1, S2, . . . of pairwise disjoint sets from B(ϑ). A map f : ϑ1 ϑ2 is said to be measurable if f 1(S2) B(ϑ1) for every S2 B(ϑ2). Fact 1 (Lemma 1.9.4 [5]). A Borel probability measure is determined by its values on a basis. 2.1 SCMs, Observational Distributions Let V be a set of endogenous variables. We assume for simplicity every variable V V is dichotomous with χV = {0, 1}, although the results here generalize to any larger countable range. Influences among endogenous variables are the main phenomena our formalism aims to capture. A well-founded1 direct influence relation on V encapsulates the notion of one endogenous variable possibly influencing another. For each V V, we call {V V : V V } = Pa(V ) the parents of V. We assume every set Pa(V ) is finite; this condition is called local finiteness. These two assumptions (well-foundedness and local finiteness) generalize the common recursiveness assumption to the infinitary setting, and have an alternative characterization in terms of temporal orderings: Fact 2. Say that a total order on V is ω-like if every node has finitely many predecessors: for each V V, the set {V : V V } is finite. Then the influence relation is extendible to an ω-like order iff is well-founded and locally finite. In addition to endogenous variables, causal models have exogenous variables U. Each endogenous V depends on a subset U(V ) U of exogenous parents and uncertainty enters via exogenous noise, that is, a distribution from P(χU). A structural function (or mechanism) for V V is a measurable f V : χPa(V ) χU(V ) χV mapping parental endogenous and exogenous valuations to values. Definition 3. A structural causal model is a tuple M = U, V, {f V }V V, P where U is a collection of exogenous variables, V is a collection of endogenous variables, f V is a structural function for each V V, and P P(χU) is a probability measure on (the Borel σ-algebra of) χU. As is well known, recursiveness implies that each u χU induces a unique v χV that solves the simultaneous system of structural equations {V = f V }V : Proposition 1. Any SCM M with well-founded, locally finite parent relation induces a unique measurable m M : χU χV such that f V πPa(V )(m M(u)), πU(V )(u) = πV m M(u) for all u χU and V V. Measurability then entails that the exogenous noise P induces a distribution on joint valuations of V, called the observational distribution, which characterizes passive observations of the system. Definition 4. The observational distribution p M P(χV) is defined on open sets by p M(y) = P (m M) 1(y) . Here recall that y represents a cylinder subset (Definition 1) of χV. 1See Appendix A for additional background on orders and relations. 2.2 Interventions What makes SCMs distinctively causal is the way they accommodate statements about possible manipulations of a causal setup capturing, e.g., observations resulting from a controlled experimental trial. This is formalized in the following definition. Definition 5. An intervention is a choice of a finite subset of variables W V and w χW. This intervention is written W := w, and we let A be the set of all interventions. Under this intervention, each W W is held fixed to its value πW (w) χW in w while the mechanism for any V V \W is left unchanged. Specifically, where M is as in Definition 3, the manipulated model for W := w is the model MW:=w = U, V, {f W:=w V }V V, P where f W:=w V = f V , V / W constant func. mapping to πV (w), V W. The interventional or experimental distribution p MW:=w P(χV) is just the observational distribution for the manipulated model MW:=w, and it encodes the probabilities for an experiment in which the variables W are fixed to the values w. Remark 1. Empty interventions := () are just passive observations, i.e., p M :=() = p M. 2.3 Counterfactuals By permitting multiple manipulated settings to share exogenous noise, not only the distribution arising from a single manipulation, but also joint distributions over multiple can be considered. These are often called counterfactuals. The set P(χA V) encompasses the combined joint distributions over V for any combination of interventions from A. A basis for the space χA V are the cylinder sets of the following form, for some sequence (X := x, Y), . . . , (W := w, Z) of pairs, where Y, . . . , Z V are finite, and X := x, . . . , W := w A are interventions: π 1 {X:=x} Y({y}) π 1 {W:=w} Z({z}). We will abbreviate this open set as yx, . . . , zw, writing, e.g. simply x for the intervention X = x. Definition 6. Given M, define a counterfactual distribution p M cf P(χA V) on a basis as follows: p M cf (yx, . . . , zw) = P (m MX:=x) 1(y) (m MW:=w) 1(z) . Here, the letters y, . . . , z on the right-hand side abbreviate the respective cylinder sets (Definition 1) π 1 Y ({y}), . . . , π 1 Z ({z}). Remark 2. Marginalizing p M cf to any single intervention W := w yields p MW:=w. If χU is finite, we obtain a familiar [13] sum formula p M cf (yx, . . . , zw) = P {u|m MX:=x(u) y,...,m MW:=w (u) z} P(u). Example 1. As a very simple example (drawn from [28, 2]), just to illustrate the previous definitions and notation, consider a scenario with two binary exogenous variables U = {U1, U2} and two binary endogenous variables V = {X, Y }. Let U1, U2 both be uniformly distributed, and define f X : χU1 χX to be the identity, and f Y : χX χU2 χY by f Y (u, x) = ux + (1 u)(1 x). This fully defines an SCM M with influence X Y , and produces an observational distribution p M such that p M(x, y) = 1/4 for all four settings X = x, Y = y. The space A of interventions in this example includes the empty intervention and all combinations of X := x and Y := y, with x, y {0, 1}. Notably, all interventional distributions here collapse to observational distributions, e.g., p MX:=x(X, Y ) = p M(X, Y ), for both values of x. Thus, experimental manipulations of this system reveal little interesting causal structure. The counterfactual distribution p M cf , however, does not trivialize. For instance, p M cf ((X := 1, Y = 1), (X := 0, Y = 0)) = 1/2. This term is known as the probability of necessity and sufficiency [27], which we can abbreviate by p M cf (yx, y x ). Note that p M cf (yx, y x ) = p M cf (yx)p M cf (y x ) = 1/4. Similarly, p M cf (y x, yx ) = 1/2. 2.4 SCM classes We now define several subclasses of SCMs that we will use throughout the paper. Notably, we do not require their endogenous variable sets V to be finite. It is infinite in many applications, for instance, in time series models, or generative models defined by probabilistic programs (see, e.g., [18, 36]). Because the proofs call for slightly different methods, we deal with the infinite and finite cases separately. We make one additional assumption in the infinite case. Definition 7. µ P(ϑ) is atomless if µ({t}) = 0 for each t ϑ; M is atomless if p M cf is atomless. Intuitively, an atomless distribution is one in which weight is always smeared out continuously and there are no point masses; infinitely many fair coin flips, for example, generate an atomless distribution as the probability of obtaining any given infinite sequence is zero. Definition 8. For the remainder of the paper, fix a countable endogenous variable set V. Define the following classes of SCMs: M = SCMs over V whose influence relation is extendible to the ω-like order ; MX = SCMs over V in which the variable X has no parents: Pa(X) = ; M = all SCMs over V = [ If V is infinite then all SCMs in the classes above are assumed to be atomless. 3 The Causal Hierarchy Implicit in 2, and indeed in much of the literature on causal inference, is a hierarchy of causal expressivity. Following the metaphor offered in [29], it is natural to characterize three levels of the hierarchy as the observational, interventional (experimental), and counterfactual (explanatory). Drawing on recent work [2, 19] we make this characterization explicit. The levels will be defined in descending order of causal expressivity (the reverse of 2). Fig. 1(a) summarizes our definitions. Higher levels determine lower levels counterfactuals determine interventionals, and the observational is just an (empty) interventional. Thus movement downward in the causal hierarchy corresponds to a kind of projection. For indexed {Sβ}β B and B B let ςB : P( β B Sβ) P( β B Sβ) be the marginalization map taking a joint distribution to its marginal on B . Definition 9. Define three composable causal projections {ϖi}1 i 3 with signatures and definitions ϖ3 : M P(χA V), ϖ2 : P(χA V) α A P(χV), ϖ1 : α A P(χV) P(χV); ϖ3 : M 7 p M cf , ϖ2 : µ3 7 ς{α} V(µ3) α A, ϖ1 : (µα)α A 7 µ :=() = π :=() (µα)α . The causal hierarchy consists of three sets {Si}1 i 3 defined as images or projections of M: S3 = ϖ3(M), S2 = ϖ2(S3), S1 = ϖ1(S2). These are the three Levels of the hierarchy. The definitions cohere with those of 2 (and, e.g., [28, 2]): Fact 3. Let M M. Then µ3 = ϖ3(M) S3 trivially coincides with its counterfactual distribution as defined in 2.3, while (µα)α = ϖ2(µ3) S2 coincides with the indexed family of all its interventional distributions ( 2.2), i.e., πW:=w (µα)α = p MW:=w for each W := w A. Finally µ = ϖ1 (µα)α S1 coincides with its observational distribution ( 2.1). Thus, e.g., S3 is the set of counterfactual distributions that are consistent with at least some SCM from M. It is a fact that S3 P(χA V) and similarly not every interventional family belongs to S2; see Appendix B for explicit characterizations. At the observational level, this is simple: Fact 4. S1 = P(χV) in the finite case. In the infinite case, S1 = {µ P(χV) : µ is atomless}. We will also use the subsets {S i }i and {SX i }i, which are defined analogously but via projection from M and MX respectively. 3.1 Problems of Causal Inference As elucidated in [29, 2], the causal hierarchy helps characterize many standard problems of causal inference, in as far as these problems typically involve ascending levels of the hierarchy. Some examples include: 1. Classical identifiability: given observational data about some variables in V, estimate a causal effect of setting variables X to values x [26, 35]. In the notation here, given information about p M(V), can we determine p MX:=x(Y)? M S3 S2 S1 ϖ3 ϖ2 ϖ1 SX Y 2 . . . SX Y 2 ϖX Y 2 ϖX Y 2 (a) Causal Hierarchy (b) Collapse Set C2 Figure 1: (a) S3 can be seen as a coarsening of M, abstracting from irrelevant intensional details. S2 is obtained from S3 by marginalization (also a coarsening), while S1 is a projection of S2 via the empty intervention. Each map ϖi, i = 1, 2, 3, is continuous for the respective weak topologies (Prop. 4). The projections ϖX Y 2 from S2 to the 2VE-spaces are likewise continuous (Prop. 4). (b) The shaded region, C2 S2, is the collapse set in which Level 2 facts determine all Level 3 facts: those points in S2 whose ϖ2-preimage in S3 is a singleton set. The main result of this paper is that C2 is meager in weak topology on S2 (Thm. 3). This means C2 contains no open subset, which by Thm. 2 implies no part of C2 is statistically verifiable, even with infinitely many ideal experiments. 2. General identifiability: given a mix of observational data and limited experimental data that is, information about p M(V) as well as some experimental distributions of the form p MW:=w(V) determine p MX:=x(Y) [38, 22]. 3. Structure learning: given observational data, and perhaps experimental data, infer properties of the underlying causal influence relation [35, 30]. 4. Counterfactual estimation: given a combination of observational and experimental data, infer a counterfactual quantity, such as probability of necessity [31], or probability of necessity and sufficiency [27, 37] (see also 3.3 below). 5. Global identifiability: given observational data drawn from p M(V) infer the full counterfactual distribution p M cf (A V) [34, 10]. This is not an exhaustive list, and these problems are not all independent of one another. They are also all unsolvable in general. Problems 1, 2, and 3 involve ascending to Level 2 given information at Level 1 (and perhaps partial information at Level 2); problems 4 and 5 ask us to ascend to Level 3 given only Level 1 (and perhaps also Level 2) information. The upshot of the causal hierarchy theorem from [2] is that these steps are impossible without assumptions, formalizing the common wisdom, no causes in, no causes out [7]. To understand the statement of the causal hierarchy theorem and our topological version of it we explain what it means for the hierarchy to collapse. 3.2 Collapse of the Hierarchy In the present setting a collapse of the hierarchy can be understood in terms of injectivity of the functions ϖi. For i = 1, 2 let Ci Si be the injective fibers of ϖi, i.e., Ci = {µi Si : µi+1 = µ i+1 whenever ϖi(µi+1) = ϖi(µ i+1) = µi}. Every element µ Ci is a witness to (global) collapse of the hierarchy: knowing µ would be sufficient to determine the Level i + 1 facts completely. A first observation is that ϖ1 is never injective. In other words, the distribution p M(V) never determines all the interventional distributions p MX:=x(Y). This is essentially a way of stating that correlation never implies causation absent assumptions. (See also [2, Thm. 1].) Proposition 2. C1 = . That is, Level 2 never collapses to Level 1 without assumptions. To overcome this formidable inferential barrier, researchers often assume we are not working in the full space M of all causal models, but rather some proper subset embodying a range of causal assumptions. This may effectively eliminate counterexamples to collapse (cf. Fig. 1(b)). For problems of type 1 or 2 (from the list above in 3.1) it is common to assume we are only dealing with models whose graph (direct influence relation) satisfies a fixed set of properties. For problems of type 3 it is common to assume that p M and relate in some way (for instance, through an assumption like faithfulness or minimality [35]). All of these problems become solvable with sufficiently strong assumptions about the form of the functions {f V }V or the probability space P. In some cases, the relevant causal assumptions are justified by appeal to background or expert knowledge. In other cases, however, an assumption will be justified by the fact that it rules out only a small or negligible or measure zero part of the full set M of possibilities. As emphasized by a number of authors [12, 39, 23], not all small subsets are the same, and it seems reasonable to demand further justification for eliminating one over another. We believe that the framework presented here can contribute to this positive project, but our immediate interest is in solidifying and clarifying limitative results about what cannot be done. The issue of collapse becomes especially delicate when we turn to C2. When do interventional distributions fully determine counterfactual distributions? In contrast to Prop. 2 we have: Proposition 3. C2 = . That is, there exists an SCM in which Level 3 collapses to Level 2. Proof sketch. As a very simple example in the finite case, any fully deterministic SCM will result in collapse. This is because, if (µα)α A are all binary-valued then the measure µ3 P α A χV that produces the marginals µα is completely determined: each µα specifies an element of χV, so µ3 must assign unit probability to the tuple that matches µα at the α projection. In the infinite case, any example must be non-deterministic by atomlessness, but collapse is still possible; see Example 2 in Appendix B. 3.3 Probabilities of Causation A handful of counterfactual quantitites over two given variables, collected below, have been particularly prominent in the literature (e.g., [27]). Our main result will show that any of these six quantities (for any two fixed variables) is robust against collapse. Below, fix two distinct variables Y = X V and distinct values x = x χX, y = y χY . Definition 10. The probabilities of causation are the following quantities: P(yx, y x ) : probability of necessity and sufficiency P(y x, yx ) : converse prob. of necessity and sufficiency P(y x | x, y) : prob. of necessity P(yx | x , y ) : prob. of sufficiency P(y x | y) : prob. of disablement P(yx | y ) : prob. of enablement Consider, for example, the probability of necessity and sufficiency (PNS), which is the joint probability that Y would take on value y if X is set by intervention to x, and y if X is set to x . PNS has been thoroughly studied [27, 37, 1], in part due to its widespread relevance: from medical treatment to online advertising, we would like to assess which interventions are likely to be both necessary and sufficient for a given outcome. Using the notation from 2.3, PNS concerns the measure of sets yx, y x = π 1 (X:=x,Y )({y}) π 1 (X:=x ,Y )({y }). The probabilities of causation are paradigmatically Level 3, and we will be interested in their manifestations at Level 2. In that direction we introduce a small part of S2, just enough to witness the behavior of Y (and X) under the empty intervention and the two possible interventions on X: Definition 11. Let AX = { := (), X := 0, X := 1}. Define a small subspace SX Y 2 α AX P(χ{X,Y }) as the image of the map ϖX Y 2 = ς{X,Y } ς{X,Y } ς{X,Y } πAX (see Fig. 1(a)). Call SX Y 2 a two-variable effect (2VE) space; fixing X, we have a 2VE-space for each Y . It is known in the literature that the probabilities of causation are not identifiable from the data p(X, Y ), p(Yx), and p(Yx ) (see, e.g., [1] for PNS). As part of our proof of Theorem 3 below, we will strengthen this considerably to show them all to be generically unidentifiable, in a topological sense to be made precise. 4 The Weak Topology We now demonstrate how S1, S2 and S3 can be topologized. In general, given a space ϑ and the set S = P(ϑ) of Borel probability measures on ϑ, a natural topology on S can be defined as follows: Definition 12. For a sequence (µn)n of measures in S, write (µn)n µ and say it converges weakly [4, p. 7] to µ if R ϑ f dµ for all bounded, continuous f : ϑ R. Then the weak topology τ w on S is that with the following closed sets: E S is closed in τ w iff for any weakly convergent sequence (µn)n µ in which every µn E, the limit point µ is in E. There are several alternative characterizations of τ w, which hold under very general conditions. For instance, it coincides with the topology induced by the so called Lévy-Prohorov metric [4]. The most useful characterization for our purposes is that it can be generated by subbasic open sets of the form {µ : µ(X) > r} (1) with X ranging over basic clopens in ϑ and r over rationals (see, e.g., [16, Lemma A.5]). Conceptually, the explication of τ w in terms of weak convergence strongly suggests a connection with statistical learning. We now make this connection precise, building on existing work [8, 14, 16]. 4.1 Connection to Learning Theory Roughly speaking, we will say a hypothesis H S is statistically verifiable if there is some error bound ϵ and a sequence of statistical tests that converge on H with error at most ϵ, when data are generated from H. More formally, a test is a function λ : ϑn {accept, reject}, where ϑn is the n-fold product of ϑ, viz. finite data streams from ϑ. The interest is in whether a null hypothesis can be rejected given data observed thus far. The boundary of a set A ϑ, written bd(A), is the difference of its closure and its interior. Intuitively, a learner will not be able to decide whether to accept or reject on the boundary. Consequently it is assumed that λ is feasible in the sense that the boundary of its acceptance zone (in the product topology on ϑn) always has measure 0, i.e., µn[bd(λ 1(accept))] = 0 for every µ S, where µn is the n-fold product measure of µ. Say a hypothesis H S is verifiable [14] if there is ϵ > 0 and a sequence (λn)n N of feasible tests (of the complement of H in S, i.e., the null hypothesis ) such that 1. µn[λ 1 n (reject)] ϵ for all n, whenever µ / H; 2. lim n µn[λ 1 n (reject)] = 1, whenever µ H. That is, to be verifiable we only require a sequence of tests that converges in probability to the true hypothesis in the limit of infinite data (requirement 2), while incurring (type 1) error only up to a given bound at finite stages (requirement 1). As an illustrative example, conditional dependence is verifiable [14]. This is a relatively lax notion of verifiability. For instance, the hypothesis need not also be refutable (and thus decidable ). For our purposes this generality is a virtue: we want to show that certain hypotheses are not statistically verifiable by any method, even in this wide sense. The fundamental link between verifiability and the weak topology is the following, due to [14, 16]: Theorem 1. A set H S is verifiable if and only if it is open in the weak topology. 4.2 Topologizing Causal Models We now reinterpret τ w at each level of the causal hierarchy: Definition 13. The weak causal topology τ w i , 1 i 3, is the subspace topology on Si, induced by if i = 3 : τ w on P(χA V); if i = 2 : product of τ w on α A P(χV); if i = 1 : τ w on P(χV). Proposition 4. All projections {ϖi}i, ϖX Y 2 are continuous in the weak causal topologies. A significant observation is that the learning theoretic interpretation, originally intended for τ w 1 , naturally extends to τ w 2 . While data streams at Level 1 amount to passive observations of V, data streams at Level 2 can be seen as sequences of experimental results, i.e., observations of potential outcomes Yx. To make verifiability as easy as possible we assume a learner can observe a sample from all conceivable experiments at each step. A learner is thus a function λ : En {accept, reject}, where En = ((χV)n)α is the set of potential experimental observations over n trials (with α indexing the experiments). Construing En as a product space we can again speak of feasibility of λ. Recall that elements of S2 are tuples (µα)α A of measures. Say a hypothesis H S2 is experimentally verifiable if there is ϵ > 0 and a sequence (λn)n N of feasible tests such that 1 and 2 above hold, replacing µn[λ 1 n (reject)] with Q α µn α[(λ 1 n (reject))α]. That is, when experimental data are drawn from the interventional distributions (µα)α A H, we require that the learner eventually converge on H with bounded error at finite stages. We can then show (see Appendix C): Theorem 2. A set H S2 is experimentally verifiable if and only if it is open in τ w 2 . A similar result can be given for (S3, τ w 3 ), although it is less clear what the empirical content of this result would be. Note also that τ w 1 , τ w 2 , τ w 3 give a sequence of increasingly fine topologies on the set of actual SCMs M by simply pulling back the projections. The point is that τ w 2 is the finest that has clear empirical significance, while τ w 3 is the finest in terms of relevance to the causal hierarchy. 5 Collapse is Meager Recall that a set X ϑ is nowhere dense if every open set contains an open Y with X Y = . A countable union of nowhere dense sets is said to be meager (or of first category). The complement of a meager set is comeager. Intuitively, a meager set is one that can be approximated by sets perforated with holes [25]. Meagerness is notably preserved under continuous preimages. As discussed above, one intuition highlighted by the weak topology τ w is that open sets are the kinds of probabilistic propositions that could, in the limit of infinite data, be verified (Thms. 1, 2). Correlatively, meager sets in τ w are so negligible as to be unverifiable: as a meager set contains no non-empty open subsets (by the Baire Category Theorem [25]), it is statistically unverifiable. We will now show that the injective collapse set C2 from 3.2 is topologically meager. The crux is to identify a good comeager 2VE-subspace where collapse never occurs (with separation witnessed by probabilities of causation). In this subspace, the constraints circumscribing Level 3 have sufficient slack to make a tweak without thereby disturbing Level 2 (cf. Figure 2). We define the good set as the locus of a set of strict inequalities: Definition 14. A family (µα)α AX SX Y 2 is Y -good if we have the following, abbreviating the members of AX as (), x, x : 0 < µx(y ) µ()(x, y ) < µ()(x ), (2) 0 < µ()(x , y ) < µ()(x ). (3) Lemma 1. The subspace of Y -good families is comeager in SX Y 2 . Proof sketch. The non-strict versions of (2), (3) hold universally, so the complement of the good set is defined by equalities. This is closed and contains no nonempty open by the weak subbasis (1). Figure 2 presents the construction in a small, two-variable case, and Lemma 2 below is proven by generalizing it to arbitrary V. Guaranteeing agreement on every interventional distribution in the general case is subtle (Appendix D): it has been observed that enlarging V can enable additional inferences (e.g., [32]), though the next result reflects a dependence on further assumptions. Lemma 2. Suppose is an order in which X comes first and (µα)α A S 2 is such that ϖX Y 2 (µα)α is Y -good, and let ϕ be PNS, the converse PNS, the probability of sufficiency, or the probability of enablement (Definition 10). Then for any µ3 S 3 such that ϖ2(µ3) = (µα)α, there exists a µ 3 S 3 such that µ3 and µ 3 disagree on ϕ. Note that by reversing the roles of x and x , we may obtain the same for the probability of necessity and probability of disablement. The main theorem and its important learning-theoretic corollary are now straightforward. Theorem 3 (Topological Hierarchy). The set C2 of points where all Level 3 facts are identifiable from Level 2 is meager in (S2, τ w 2 ). The preimage ϖ 1 2 (C2) = C3 is likewise meager in (S3, τ w 3 ). Proof. Let DX,Y 2 SX 2 be the preimage under ϖX Y 2 of the set of Y -good tuples in SX Y 2 . Lemma 2 implies that C2 SX 2 is contained in SX 2 \ DX,Y 2 , for any Y = X. Meanwhile, since ϖX Y 2 is continuous, Lemma 1 implies that SX 2 \ DX,Y 2 is meager in SX 2 , and thereby also in S2. Thus C2 = S X V C2 SX 2 is a countable union of meager sets, and hence meager. Corollary 1. No causal hypothesis licensing arbitrary counterfactual inferences (and specifically those of the probabilities of causation) from observational and experimental data is itself statistically (even experimentally) verifiable. M u P(u) Xu Yx,u Yx ,u u0 1/2 x y y u1 1/2 x y y (a) Y -good Model u P(u) Xu Yx,u Yx ,u u0 1/2 ε x y y u1 ε x y y u2 1/2 ε x y y u3 ε x y y (b) Example Separating Levels 2 and 3 Figure 2: (a): the structural functions and exogenous noise for a model M with direct influence X Y . This M meets (2) and (3), so we may apply Lemma 2, constructing the model M in (b), where 0 < ε < 1/2. Note that p M cf (yx, y x ) = 0 while p M cf (yx, y x ) = ε, so that the two models disagree on a Level 3 PNS quantity; on the other hand, it is easy to check agreement on all of Level 2. Similarly, M and M disagree on the converse PNS, probability of sufficiency, and probability of enablement (Definition 10). 6 Conclusion We introduced a general framework for topologizing spaces of causal models, including the space of all (discrete, well-founded) causal models. As an illustration of the framework we characterized levels of the causal hierarchy topologically, and proved a topological version of the causal hierarchy theorem from [2]. While the latter shows that collapse of the hierarchy (specifically of Level 3 to Level 2) is exceedingly unlikely in the sense of (Lebesgue) measure, we offer a complementary result: any condition guaranteeing that we could infer arbitrary Level 3 information from purely Level 2 information must be statistically unverifiable, even by experimental means. Both results capture an important sense in which collapse is negligible in the space of all possible models. As an added benefit, the topological approach extends seamlessly to the setting of infinitely many variables. There are many natural extensions of these results. For instance, we have begun work on a version for continuous endogenous variables. Also of interest are subspaces embodying familiar causal assumptions or other well-studied coarsenings of SCMs (see, e.g., [23] on Bayesian networks, or [17, 15] on linear non-Gaussian models), which often render important inference problems solvable, though sometimes only generically so. In the opposite direction, we expect analogous hierarchy theorems to hold for extensions of the SCM concept, e.g., that dropping the well-foundedness or recursiveness requirements [6]. As emphasized by [2], a causal hierarchy theorem should not be construed as a purely limitative result, but rather as further motivation for understanding the whole range of causal-inductive assumptions, how they relate, and what they afford. We submit that the topological constructions presented here can help clarify and systematize this broader landscape. Acknowledgments This material is based upon work supported by the National Science Foundation Graduate Research Fellowship Program under Grant No. DGE-16565. We are very grateful to the five anonymous Neur IPS reviewers for insightful and detailed comments and questions that led to significant improvements in the paper. We would also like to thank Jimmy Koppel, Krzysztof Mierzewski, Francesca Zaffora Blando, and especially Kasey Genin for helpful feedback on earlier versions. [1] C. Avin, I. Shpitser, and J. Pearl. Identifiability of path-specific effects. In Proceedings of IJCAI, 2005. [2] E. Bareinboim, J. D. Correa, D. Ibeling, and T. Icard. On Pearl s hierarchy and the foundations of causal inference. Technical Report R-60, Causal AI Lab, Columbia University, 2020. [3] G. Belot. Absolutely no free lunches! Theoretical Computer Science, 845:159 180, 2020. [4] P. Billingsley. Convergence of Probability Measures. Wiley, 2nd edition, 1999. [5] V. I. Bogachev. Measure Theory. Springer Berlin Heidelberg, 2007. [6] S. Bongers, P. Forré, J. Peters, and J. M. Mooij. Foundations of structural causal models with cycles and latent variables, 2021. [7] N. Cartwright. Nature s Capacities and their Measurements. Clarendon Press, 1989. [8] A. Dembo and Y. Peres. A topological criterion for hypothesis testing. Annals of Statistics, 22(1):106 117, 1994. [9] P. Diaconis and D. Freedman. On the consistency of Bayes estimates. Annals of Statistics, 14(1):1 26, 1986. [10] M. Drton, R. Foygel, and S. Sullivant. Global identifiability of linear structural equation models. Annals of Statistics, 39(2):865 886, 2011. [11] F. Eberhardt, C. Glymour, and R. Scheines. On the number of experiments sufficient and in the worst case necessary to identify all causal relations among n variables. In Proceedings of UAI, page 178 184, 2005. [12] D. Freedman. From association to causation via regression. Advances in Applied Mathematics, 18:59 110, 1997. [13] D. Galles and J. Pearl. An axiomatic characterization of causal counterfactuals. Foundations of Science, 3 (1):151 182, Jan 1998. [14] K. Genin. The Topology of Statistical Inquiry. Ph D thesis, Carnegie Mellon University, 2018. [15] K. Genin. Statistical undecidability in linear, non-gaussian models in the presence of latent confounders. In Proceedings of Neur IPS, 2021. [16] K. Genin and K. Kelly. The topology of statistical verifiability. In Proceedings of TARK, pages 236 250, 2017. [17] K. Genin and C. Mayo-Wilson. Statistical decidability in linear, non-Gaussian models. In Causal Discovery & Causality-Inspired Machine Learning, Neur IPS, 2020. [18] D. Ibeling and T. Icard. On open-universe causal reasoning. In Proceedings of UAI, 2019. [19] D. Ibeling and T. Icard. Probabilistic reasoning across the causal hierarchy. In Proceedings of AAAI, 2020. [20] G. W. Imbens and D. B. Rubin. Causal Inference in Statistics, Social, and Biomedical Sciences. Cambridge University Press, 2015. [21] J. L. Kelley. General Topology. Springer, 1975. [22] S. Lee, J. Correa, and E. Bareinboim. General identifiability with arbitrary surrogate experiments. In Proceedings of UAI, 2019. [23] H. Lin and J. Zhang. On learning causal structures from non-experimental data without any faithfulness assumption. In Proceedings of ALT, pages 554 582, 2020. [24] C. Meek. Strong completeness and faithfulness in Bayesian networks. In Proceedings of UAI, page 411 418, 1995. [25] J. C. Oxtoby. Measure and Category. Springer, 1971. [26] J. Pearl. Causal diagrams for empirical research. Biometrika, 82(4):669 710, 1995. [27] J. Pearl. Probabilities of causation: Three counterfactual interpretations and their identification. Synthese, 121(1):93 149, 1999. [28] J. Pearl. Causality. Cambridge University Press, 2009. [29] J. Pearl and D. Mackenzie. The book of why: The new science of cause and effect. Basic Books, 2018. [30] J. Peters, D. Janzing, and B. Schölkopf. Elements of Causal Inference: Foundations and Learning Algorithms. MIT Press, 2017. [31] J. Robins and S. Greenland. The probability of causation under a stochastic model for individual risk. Biometrics, 45(4):1125 1138, 1989. [32] B. Schölkopf, F. Locatello, S. Bauer, N. R. Ke, N. Kalchbrenner, A. Goyal, and Y. Bengio. Toward causal representation learning. Proceedings of the IEEE, 109(5):612 634, 2021. [33] S. Shalev-Shwartz and S. Ben-David. Understanding Machine Learning: From Theory to Algorithms. Cambridge University Press, 2014. [34] S. Shimizu, P. O. Hoyer, A. Hyvärinen, and A. Kerminen. A linear non-Gaussian acyclic model for causal discovery. Journal of Machine Learning Research, 7(72):2003 2030, 2006. [35] P. Spirtes, C. Glymour, and R. Scheines. Causation, Prediction, and Search. MIT Press, 2001. [36] Z. Tavares, J. Koppel, X. Zhang, R. Das, and A. S. Lezama. A language for counterfactual generative models. In Proceedings of ICML, 2021. [37] J. Tian and J. Pearl. Probabilities of causation: Bounds and identification. Annals of Mathematics and Artificial Intelligence, 28(1):287 313, 2000. [38] J. Tian and J. Pearl. A general identification condition for causal effects. In Proceedings of AAAI. 2002. [39] C. Uhler, G. Raskutti, P. Bühlmann, and B. Yu. Geometry of the faithfulness assumption in causal inference. Annals of Statistics, 41(2):436 463, 2013.