# neural_causal_abstractions__99b817d6.pdf Neural Causal Abstractions Kevin Xia, Elias Bareinboim Causal AI Lab, Columbia University {kevinmxia, eb}@cs.columbia.edu The ability of humans to understand the world in terms of cause and effect relationships, as well as their ability to compress information into abstract concepts, are two hallmark features of human intelligence. These two topics have been studied in tandem under the theory of causal abstractions, but it is an open problem how to best leverage abstraction theory in real-world causal inference tasks, where the true model is not known, and limited data is available in most practical settings. In this paper, we focus on a family of causal abstractions constructed by clustering variables and their domains, redefining abstractions to be amenable to individual causal distributions. We show that such abstractions can be learned in practice using Neural Causal Models, allowing us to utilize the deep learning toolkit to solve causal tasks (identification, estimation, sampling) at different levels of abstraction granularity. Finally, we show how representation learning can be used to learn abstractions, which we apply in our experiments to scale causal inferences to high dimensional settings such as with image data. 1 Introduction Humans understand the world around them through the use of abstract notions. Biologists can study the function of the liver without understanding the interactions between its subatomic particles studied by physicists. Economists find it more practical to consider macro-level behavior through concepts like aggregate supply and demand rather than studying the purchasing behavior of individuals. At home, we choose to interpret the object in the television as a dog or a car as opposed to a collection of photons or pixels. Humans are highly capable of learning through interacting with the environment and understanding cause and effect between different concepts. Understanding causality is considered a hallmark of human intelligence and allows humans to plan a course of action, determine blame and responsibility, and generalize across environments. It follows that the ability to abstract concepts and study them causally is a key ability expected from modern intelligent systems. AI systems are built on a foundation of generative models, which are representations of the underlying processes from which data is collected. Standard generative models simply Copyright 2024, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. model some joint density of a set of variables of interest, while causal generative models further model distributions involving causal interventions and counterfactual relations. In this paper, we study the problem of learning a causal generative model from data. One major challenge is that data is often provided in complex low level forms (e.g., pixels), while it would be more useful in applications to focus on higher level concepts (e.g., dog or car). We would therefore like to learn a more abstract causal generative model at a higher level of granularity, while guaranteeing that the queries from the coarser model match the ground truth. To formalize this problem, we build on the semantics of a class of generative models called structural causal models (SCMs) (Pearl 2000). An SCM M describes a collection of mechanisms and distribution over unobserved factors. Each SCM induces three qualitatively different sets of distributions related to the human concepts of seeing (called observational), doing (interventional), and imagining (counterfactual), collectively known as the Ladder of Causation or the Pearl Causal Hierarchy (PCH) (Pearl and Mackenzie 2018; Bareinboim et al. 2022). The PCH is a containment hierarchy in which each of these distribution sets can be put into increasingly refined layers, where observational distributions go in layer 1 (L1), interventional in layer 2 (L2), and counterfactual in layer 3 (L3). In typical tasks of causal inference, the goal is to obtain a quantity from a higher layer when given data only from lower layers (e.g. inferring interventional quantities from observational data). Still, it is understood that this is generally impossible without additional assumptions since higher layers are underdetermined by lower layers (Bareinboim et al. 2022; Ibeling and Icard 2020). Generative models can often be implemented in practice as neural networks. Deep learning models have achieved promising success in a variety of applications such as computer vision (Krizhevsky, Sutskever, and Hinton 2012), speech recognition (Graves and Jaitly 2014), and game playing (Mnih et al. 2013). Many of these successes are attributed to representation learning (Bengio, Courville, and Vincent 2013), in which the learned representation can be thought of as an abstraction of the data. Further, there has also been growing interest in the idea of incorporating causality into deep models1. Our work 1Many successful approaches have been developed to estimate causal effects from observational data under backdoor or ignorability The Thirty-Eighth AAAI Conference on Artificial Intelligence (AAAI-24) leverages one such model, the Neural Causal Model (NCM), which incorporates the same causal assumptions encoded in a causal diagram to identify and estimate interventional and counterfactual distributions (Xia et al. 2021; Xia, Pan, and Bareinboim 2023). Despite the soundness of this approach in theory, current NCM-based methods face challenges when applied to complex real-world settings for various reasons: (1) optimization is difficult when scaled to high dimensions, (2) unprocessed data can come in complicated forms (e.g. images, text, etc.), and (3) the causal diagram is difficult to fully specify in some high-dimensional settings. In this work, we address these challenges by studying how representation learning and causal reasoning are related to each other and by building on this understanding to develop a neural framework for causal abstraction learning. Existing works that study causal abstractions set a solid foundation by defining various mathematical notions of abstractions (Rubenstein et al. 2017; Beckers and Halpern 2019; Beckers, Eberhardt, and Halpern 2019). Such definitions are declarative; that is, if the lower and higher level models are given, one can use the definition to decide whether the higher level model is indeed an abstraction of the lower level one. However, neither models are available in practice, and one would want to use limited lower level data to learn a higher level causal abstraction. We will expand on the current generation of causal abstractions in two ways. First, given that the true SCM is almost never available in practice, nor entirely learnable from data, we introduce a relaxed notion of abstractions that applies on the layers of the PCH. Second, we develop algorithms to systematically obtain abstractions in practice given some structural information about the data, which can then be used for downstream inferential tasks such as causal identification, estimation, and sampling. Fig. 1 summarizes the general problem tackled by this paper. The ground truth model ML (left) is defined over low level variables VL (e.g., pixels), while it may be practical to work in their high level abstract counterparts VH (e.g., dog or car). ML induces distributions from the three layers of the PCH (i.e. L 1, L 2, L 3), defined over VL. In this work, we introduce a new type of abstraction function τ that maps distributions over VL to ones over VH (i.e. τ(L 1), τ(L 2), τ(L 3)). Furthermore, ML is unobserved, and only limited data is given (e.g., observational data from L 1). The goal is to learn a high-level SCM c MH (right) over the high-level variables VH that encodes the given causal constraints (GC in the figure) and matches ML on the available data across τ (e.g. b L1 = τ(L 1)). Then, we investigate when and how the resulting model c MH can be used as a surrogate, allowing one to make interventional and counterfactual inferences about the higher layers of ML through the higher layers of c MH. More specifically, our contributions are as follows: In conditions (Shalit, Johansson, and Sontag 2017; Louizos et al. 2017; Li and Fu 2017; Johansson, Shalit, and Sontag 2016; Yao et al. 2018; Yoon, Jordon, and van der Schaar 2018; Kallus 2020; Shi, Blei, and Veitch 2019; Du et al. 2020; Guo et al. 2020), and also to answer causal queries through neural-parameterized SCMs (Kocaoglu et al. 2018; Goudet et al. 2018). Figure 1: Overview of this paper. High-level SCM c MH (right) is trained on available data to serve as an abstract proxy of the true, unobserved, low-level SCM ML (left). Sec. 2, we define a new class of abstractions based on clusters of variables (intervariable) and their domains (intravariable). Building on this new class, we define a notion of abstraction consistency on the layers of the PCH. We then show how to systematically construct an abstraction consistent with all three layers of the PCH and then relate these abstractions to existing definitions. In Sec. 3, we show how to leverage NCM machinery to perform interventional (layer 2) and counterfactual (layer 3) inferences across these abstractions when the true SCM is unavailable. In Sec. 4, we introduce a variant of the NCM that learns representations of each variable and encodes causal assumptions on the representation level, allowing us to learn abstractions even in settings where the assumption of the availability of clusters is relaxed. Experiments in Sec. 5 corroborate with the theory. All appendices, including the proofs, experimental details, further discussion, and examples, can be found in the full technical report (Xia and Bareinboim 2023). 1.1 Preliminaries We now introduce the notation and definitions used throughout the paper. We use uppercase letters (X) to denote random variables and lowercase letters (x) to denote corresponding values. Similarly, bold uppercase (X) and lowercase (x) letters denote sets of random variables and values respectively. We use DX to denote the domain of X and DX = DX1 DXk for the domain of X = {X1, . . . , Xk}. We denote P(X = x) (often shortened to P(x)) as the probability of X taking the values x under the distribution P(X). We utilize the basic semantic framework of structural causal models (SCMs), as defined in (Pearl 2000, Ch. 7). An SCM M consists of endogenous variables V, exogenous variables U with distribution P(U), and mechanisms F. F contains functions f Vi (for all Vi V) that map endogenous parents Pa Vi and exogenous parents UVi to Vi. Each M induces a causal diagram G, where every Vi V is a vertex, there is a directed arrow (Vj Vi) for every Vi V and Vj Pa Vi, and there is a dashed-bidirected arrow (Vj L9999K Vi) for every pair Vi, Vj V such that UVi and UVj are not independent (Markovianity is not assumed). The Thirty-Eighth AAAI Conference on Artificial Intelligence (AAAI-24) Our treatment is constrained to recursive SCMs, which implies acyclic causal diagrams, with finite discrete domains over endogenous variables V. Counterfactual quantities can be computed from SCM M as follows: Definition 1 (Layer 3 Valuation). An SCM M induces layer L3(M), a set of distributions over V, each with the form P(Y ) = P(Y1[x1], Y2[x2],...) such that P M(y1[x1], y2[x2], . . . ) = Z DU 1 Y1[x1](u) = y1, Y2[x2](u) = y2, . . . d P(u) (1) where Yi[xi](u) is evaluated under Fxi := {f Vj : Vj V \ Xi} {f X x:X Xi}. L2 is the subset of L3 for which all xi are equal, and L1 is the subset for which all Xi = . Each Yi corresponds to a set of variables in a world where the original mechanisms f X are replaced with constants xi for each X Xi; this is also known as the mutilation procedure. This procedure corresponds to interventions, and we use subscripts to denote the intervening variables (e.g. Yx) or subscripts with brackets when the variables are indexed (e.g. Y1[x1]). For instance, P(yx, y x ) is the probability of the joint counterfactual event Y = y had X been x and Y = y had X been x . We use the notation Li(M) to denote the set of Li distributions from M. We use Z to denote a set of quantities from Layer 2 (i.e. Z = {P(Vzk)}ℓ k=1), and Z(M) denotes those same quantities induced by SCM M (i.e. Z(M) = {P M(Vzk)}ℓ k=1). This work utilizes Neural Causal Models (NCMs) for practical implementations, as follows: Definition 2 (G-Constrained Neural Causal Model (G-NCM) (Xia et al. 2021, Def. 7)). Given a causal diagram G, a G-constrained Neural Causal Model (G-NCM) c M(θ) over V with parameters θ = {θVi : Vi V} is an SCM b U, V, b F, P( b U) such that (1) b U = {b UC : C C(G)}, where C(G) is the set of all maximal cliques over bidirected edges of G; (2) b F = { ˆf Vi : Vi V}, where each ˆf Vi is a feedforward neural net parameterized by θVi θ mapping UVi Pa Vi to Vi for UVi = {b UC : b UC b U s.t. Vi C} and Pa Vi = Pa G(Vi); (3) P( b U) is defined s.t. b U Unif(0, 1) for each b U b U. In words, a G-NCM is an SCM in which the exogenous variables b U are fixed, and the mechanisms b F are trainable neural nets, whose inputs are determined by the graph G. 2 Abstractions of the Pearl Causal Hierarchy The discussion of abstractions begins with defining causal variables. In many established causal inference tasks, it is typically assumed that there is a well-specified and known set of endogenous variables of interest V, and nature is modeled by a collection of mechanisms that assign values to each of these variables. However, in practice, the definition of V may not always be clear. In particular, the variables of interest may not align with the features of the data. For example, in an economic system, perhaps data on each individual consumer is collected, but the variable of interest is an aggregate measure like gross domestic product (GDP). In image data, perhaps the pixel values are collected, but the variables of interest are related to the objects of the image, not the individual pixels. Acknowledging that the data is not always provided in the best choice of granularity, the causal abstraction literature typically defines two sets of variables, VL and VH, which describe the lower level and higher level settings, respectively. They are typically modeled by corresponding causal models ML and MH, respectively. In this section, we study on the distinction between low level variables VL (e.g. pixels) and their higher level counterparts VH (e.g. image) from the perspective of individual distributions of the PCH. We consider nature s underlying SCM ML defined over VL, and the goal is to reason about the higher level variables VH given data on VL2. See the full technical report (Xia and Bareinboim 2023) for detailed examples of every definition. 2.1 Constructive Abstraction Functions The connection between VH and VL can be described through a mapping between their domains, τ : DVL DVH. Here, we consider a family of abstraction functions where τ is based on clusters of the variables and values of VL: Definition 3 (Inter/Intravariable Clusterings). Let M be an SCM over variables V. 1. A set C is said to be an intervariable clustering of V if C = {C1, C2, . . . Cn} is a partition of a subset of V. C is further considered admissible w.r.t. M if for any Ci C and any V Ci, no descendent of V outside of Ci is an ancestor of any variable in Ci. That is, there exists a topological ordering of the clusters of C relative to the functions of M. 2. A set D is said to be an intravariable clustering of variables V w.r.t. C if D = {DCi : Ci C}, where DCi = {D1 Ci, D2 Ci, . . . , Dmi Ci } is a partition (of size mi) of the domains of the variables in Ci, DCi (recall that DCi is the Cartesian product DV1 DV2 DVk for Ci = {V1, V2, . . . , Vk}, so elements of Dj Ci take the form of tuples of the value settings of Ci). In words, intervariable clusters partition the low level variables to describe each high level variable as a collection of low level variables. Intravariable clusters then describe the domains of these high level variables by partitioning the corresponding value spaces of these intervariable clusters. Example 1. Consider a study on the effects of certain food dishes on body mass index (BMI), inspired by nutrition studies like Gamba et al. (2014). Data is collected on individuals eating at restaurants, including the restaurant (R), dish ordered (D), the amount of carbohydrates (C), fat (F), and protein (P) in the dish, and the BMI of the customer (B). That is, VL = {R, D, C, F, P, B}. One food scientist argues 2For concreteness, we assume that ML is an SCM, but the underlying generative model can be left implicit as explained in Appendix D.1. The Thirty-Eighth AAAI Conference on Artificial Intelligence (AAAI-24) that any nutritional impact of the food on BMI could be abstracted based on how many calories are in each dish. One may then be tempted to cluster the variables C, F, and P together into one variable, named calories, labeled Z. This is an example of intervariable clustering. To denote this formally, we may choose C = {C1 = {B}, C2 = {C, F, P}, C3 = {D}} as the intervariable clusters. In this case, B and D are placed in their own clusters, C1 and C3, respectively. C, F, and P are all clustered together into C2. R is not included and is abstracted away, which may be desirable if R is not relevant to the study. Collectively, C1, C2, and C3 form a partition of the subset of VL without R. Each of the clusters of C will correspond to a high level variable of VH. In this case, for example, let Z denote the high level variable corresponding to cluster C2, interpreted as calories. This is shown at the top of Fig. 2 (red). The domain of C2 contains every tuple of C, F, and P, but the domain of Z can be simplified. After all, the computation of calories can be specified as Z = 4C + 9F + 4P, which means that two sets of values, (c1, f1, p1), (c2, f2, p2) are considered equivalent if 4c1 + 9f1 + 4p1 = 4c2 + 9f2 + 4p2. This clustering of domain values is an example of intravariable clustering, shown at the bottom of Fig. 2 (blue). More formally, the intervariable clusters would be denoted D = {DC1, DC2, DC3}, where each DCi is a partition of DCi. In the case of DC2, we may define DC2 = {D1 C2, D2 C2, . . . }, where each Dj C2 is a collection of tuples (c, f, p) DC2 corresponding to some specific value 4c + 9f + 4p. In Fig. 2 for example, D1 C2 = {(c, f, p) : 4c+9f +4p = 200, (c, f, p) DC2}. Each of the intravariable clusters correspond to a domain value of the high level variable. For example, D1 C2 corresponds to a value of Z = 200. For the remainder of this paper, we consider settings where the intervariable clusters are admissible. Collectively, given an intervariable clustering C and intravariable clustering D of VL, an abstraction function τ can be defined as follows. Definition 4 (Constructive Abstraction Function). A function τ : DVL DVH is said to be a constructive abstraction function w.r.t. inter/intravariable clusters C and D iff 1. There exists a bijective mapping between VH and C such that each VH,i VH corresponds to Ci C; 2. For each VH,i VH, there exists a bijective mapping between DVH,i and DCi such that each vj H,i DVH,i corresponds to Dj Ci DCi; and 3. τ is composed of subfunctions τCi for each Ci C such that v H = τ(v L) = (τCi(ci) : Ci C), where τCi(ci) = vj H,i if and only if ci Dj Ci. We also apply the same notation for any WL VL such that WL is a union of clusters in C (i.e. τ(w L) = (τCi(ci) : Ci C, Ci WL)). In words, through the subfunction τCi, each low level cluster Ci C maps to a single high level variable VH,i VH, and the value ci DCi maps to a corresponding high level value vj H,i DVH,i. Specifically, τCi(ci) maps to vj H,i if ci is in the intravariable cluster Dj Ci. Then, the overall function τ is simply composed of the subfunctions τCi. Intuitively, τ Figure 2: Example of a constructive abstraction function τ w.r.t. corresponding inter/intravariable clusters. Top (intervariable): The low-level variables, dish (D) and BMI (B), are in their own clusters while restaurant (R) is abstracted away. Carbohydrates (C), fat (F), and protein (P) are clustered together and are mapped to a single variable, calories (Z). Bottom (intravariable): The intravariable clustering for C2 = {C, F, P} is shown. Calories Z can be computed from C, F, P using the formula Z = 4C + 9F + 4P. This means that the domain is partitioned such that two different values, (c1, f1, p1), (c2, f2, p2) are in the same intravariable cluster if 4c1 + 9f1 + 4p1 = 4c2 + 9f2 + 4p2. is a constructive abstraction function if it maps VL to VH by first grouping the variables by their corresponding intervariable cluster in C (red maps to yellow in Fig. 2 (top)), followed by assigning each cluster a value based on which intravariable cluster they belong in D (blue maps to green in Fig. 2 (bottom)). As a result, VH can be interpreted such that VH = C and DVH,i = DCi for each VH,i VH. Note that the relationship between VL and VH modeled by τ is not causal. Rather, the contents of VL constitute VH3. Intuitively, two variables of VL are mapped to the same intervariable cluster if they constitute the same high level variable (e.g. two pixels of the same dog), and two values are mapped to the same intravariable cluster if, from a higher level perspective, they are functionally identical (e.g. same image of the dog but rotated or cropped). In this sense, intravariable clustering can be thought of as invariances in the data. This paper will focus on abstractions based on constructive abstraction functions τ created from intervariable and intravariable clusters. This is in contrast with the previous works on causal abstractions discussed in App. B, which leave the functional form of τ implicit. One benefit of making τ concrete is that it allows for a rigorous definition of equivalence between the distributions of a low level model and that of a high level model, as will be elaborated next. 2.2 Layer-Specific Abstractions Ultimately, we would like to study causal properties of VL through their higher level counterparts VH. A sensible goal is, therefore, to learn an SCM MH over VH, which can then be queried for causal inference tasks. Still, even if VH and 3The distinction between causal and constitutional relationships is important and is explained in detail in Appendix D.1. The Thirty-Eighth AAAI Conference on Artificial Intelligence (AAAI-24) VL are connected through some function τ, this alone does not imply that MH is an abstraction of ML. This is the case since the distributions over VH induced by MH may not have any clear connection with the distributions over VL. When two SCMs are defined over the same space of variables, one can verify that they are similar if they induce the same distributions. For example, an SCM M is L2consistent with M if L2(M ) = L2(M), that is, M and M match in every interventional distribution (Bareinboim et al. 2022; Xia et al. 2021). However, when two SCMs are defined over different variable spaces, comparing their distributions is no longer well-defined. Hence, a different notion of consistency is needed to compare an SCM over VL with another over VH through τ. We first note that not all low-level quantities have corresponding high-level counterparts due to the clusters. To define the low level counterfactual quantities that have high level counterparts through τ, first denote YL, as a set of counterfactual variables over VL. That is, YL, = YL,1[x L,1], YL,2[x L,2], . . . , (2) where each YL,i[x L,i] corresponds to the potential outcomes of the variables YL,i under the intervention XL,i = x L,i. Each YL,i and XL,i must be unions of clusters from C (i.e. YL,i = S C C C for some C C) such that τ(YL,i) and τ(XL,i) are well-defined (i.e. τ(YL,i) = V C C τC(C) ). For the high-level counterpart, denote YH, = τ(YL, ) (3) = τ(YL,1[τ(x L,1)]), τ(YL,2[τ(x L,2)]), . . . . (4) For any value y H, DYH, , denote DYL, (y H, ) = {y L, DYL, : τ(y L, ) = y H, }, (5) that is, the set of all values y L, such that τ(y L, ) = y H, . We can now define a notion of consistency relating low level counterfactual quantities to high level counterparts. Definition 5 (Q-τ Consistency). Let ML and MH be SCMs defined over variables VL and VH, respectively. Let τ : DVL DVH be a constructive abstraction function w.r.t. clusters C and D. Let Q = X y L, DYL, (y H, ) P(YL, = y L, ) (6) be a low-level Layer 3 quantity of interest (for some y H, DYH, ), as expressed in Eq. 2, and let τ(Q) = P(YH, = y H, ) (7) be its high level counterpart, as expressed in Eq. 4. We say that MH is Q-τ consistent with ML if X y L, DYL, (y H, ) P ML(YL, = y L, ) = P MH(YH, = y H, ), that is, the value of Q induced by ML is equal to the value of τ(Q) induced by MH4. Furthermore, if MH is Q-τ consistent with ML for all Q Li(ML) of the form of Eq. 6, then MH is said to be Li-τ consistent with ML. 4Note that the equality in Eq. 8 is consistent with the pushforward measure through τ. Def. 5 defines the formal connection between quantities of ML and MH. Intuitively, MH can only be viewed as an abstraction of ML for the quantities in which they are τ-consistent. Note that the definition naturally applies to the L2 case (i.e. all x L,i are identical) and the L1 case (i.e. all XL,i = ). It turns out that when MH is Q-τ consistent with ML on all three layers of the PCH (i.e. L3-τ consistent), then MH can be considered an abstraction of ML on the SCMlevel, which coincides with the definition of constructive τ-abstractions from Beckers and Halpern (2019, Def. 3.19), shown below. Proposition 1 (Abstraction Connection). Let τ : DVL DVH be a constructive abstraction function (Def. 4). MH is L3-τ consistent (Def. 5) with ML if and only if there exists SCMs M L and M H s.t. L3(M L) = L3(ML), L3(M H) = L3(MH), and M H is a constructive τ-abstraction of M L. All proofs are provided in Appendix A. This proposition provides the connection between the abstractions defined in this work and established definitions from previous works5. 2.3 Algorithmic Abstraction Construction With the abstraction function τ defined, the notion of Q-τ consistency allows for comparisons of distributions between the low level model ML and the abstraction MH. Still, it would be desirable to be able to systematically construct MH given ML and τ such that MH is Q-τ consistent with ML for as many queries Q as possible. Moving in this direction, we first note that as a subtlety, for some cases of ML, there are certain choices of C and D (and corresponding τ) for which Q-τ consistency (for some queries Q) is impossible to achieve in any choice of MH. This phenomenon can be described formally by the following condition. Definition 6 (Abstract Invariance Condition (AIC)). Let ML = UL, VL, FL, P(UL) be an SCM. Let τ : DVL DVH be a constructive abstraction function relative to C and D. The SCM ML is said to satisfy the abstract invariance condition (AIC) with respect to τ if, for all v1, v2 DVL such that τ(v1) = τ(v2), all u DUL, and all Ci C, the following holds: τCi f L V (pa(1) V , u V ) : V Ci = τCi f L V (pa(2) V , u V ) : V Ci , (9) where pa(1) V and pa(2) V are the values corresponding to v1 and v2. Then, f pa V is used to denote any arbitrary value s.t. τ(f pa V ) = τ(pa(1) V ) = τ(pa(2) V ). In words, the AIC enforces that if two low level values v1, v2 DVL map to the same high level value (i.e. τ(v1) = τ(v2)), then for each cluster Ci C, the functions of those 5Note that one subtlety of this result is that it is not MH that is directly a constructive τ-abstraction of ML, but rather their L3equivalent counterparts, M H and M L. Indeed, the definition of constructive τ-abstractions is stronger than L3-τ consistency (see proof for more details), but in tasks where we are only concerned with the layers of the PCH, this distinction is inconsequential. The Thirty-Eighth AAAI Conference on Artificial Intelligence (AAAI-24) clusters should map to the same value regardless of UL (i.e. the outputs of f L V (pa(1) V , u V ) for each V Ci should map to the same result as the outputs of f L V (pa(2) V , u V ) when passed through τCi). Intuitively, this implies that two values in the same intravariable cluster have the same functional effect in the higher level setting. It turns out that the AIC describes precisely when an appropriate MH exists as an abstraction of the low level model ML, as shown by the following result. Proposition 2 (Abstraction Conditions). For any SCM ML and constructive abstraction function τ relative to C and D, there exists an SCM MH over variables VH = τ(VL) such that MH is L3-τ consistent with ML if and only if there exists M L such that L3(ML) = L3(M L) and M L satisfies the abstract invariance condition with respect to τ. This critical property guarantees the existence of a highlevel SCM MH such that L3-τ consistency holds, so we will assume that the AIC holds for the rest of this work. Still, see App. D.2 for further discussion on its implications and for relaxations in cases where L3-τ consistency is not required. With the notion of abstractions well-defined, we study how MH can be obtained from ML. Interestingly, when given the admissible clusterings C and D, the procedure for recovering τ and converting ML to MH can be done as shown in Alg. 1. Intuitively, one can obtain an abstraction MH of ML by first constructing the abstraction function τ using the clusterings C and D (lines 2-3), followed by designing the functions of MH to wrap the original functions of ML with τ (lines 4-6). This can be verified using the following result. Proposition 3. Let τ and MH be the function and SCM obtained from running Alg. 1 on inputs ML, C, and D. Then, MH is L3-τ consistent with ML. Alg. 1 can be used to systematically obtain an abstraction MH of the low-level model ML, so long as ML is provided alongside the clusters C and D. Since ML is almost never available in practice, the following sections show how this requirement can be relaxed. 3 Inferences Across Abstractions As demonstrated by Alg. 1, converting a low level model ML to a high level model MH is somewhat immediate when given full observability of the underlying SCM ML. However, in real applications, it is rarely the case that the full specification of ML is known. Typically, one will only be given partial information of ML in the form of data, such as samples of the observational distribution P(VL). The question we investigate in this section is: is it still possible to learn some MH given the observed data? We first note the impossibility result described by the Causal Hierarchy Theorem (CHT) (Bareinboim et al. 2022, Thm. 1), which states that a model trained to match another SCM on lower layers of the causal hierarchy (e.g. L1) will likely not match on higher layers (e.g. L2 or L3). Naturally, the same is true when it comes to inferring causal quantities across abstractions. One may be tempted to believe that MH can be learned given L1 data from ML by instantiating some expressive parametric model c MH on VH, and then training Algorithm 1: Constructing MH from ML. Input : SCM ML = UL, VL, FL, P(UL) , admissible inter/intravariable clusters C and D satisfying abstract invariance condition Output : SCM MH and τ : DVH DVL s.t. MH is L3-τ consistent with ML 1 UH UL,P(UH) P(UL) 2 VH C,DVH D 3 τ Abs Func(C,D) // from Def. 4 4 for Ci C do 5 f H i τ f L V (f pa V ,u V ) : V Ci 6 FH {f H i : Ci C} 7 return τ, MH = UH,VH,FH,P(UH) c MH on P(VH) = P(τ(VL)) such that c MH is L1-τ consistent with ML. Unfortunately, such a model c MH will fail to generalize because even under perfect training, c MH is not guaranteed to be L2-τ (or L3-τ) consistent with ML. This means that any causal quantities induced by c MH will likely bear no relationship with causal quantities induced by ML. We show this in the next result. Proposition 4 (Abstract Causal Hierarchy Theorem (Informal)). Given constructive abstraction function τ : DVH DVL, even if MH is Li-τ consistent with ML, MH will almost never be Lj-τ consistent with ML for j > i. In words, matching across abstractions on lower layers does not guarantee the same will hold for higher layers. The consequence of this result is that causal assumptions will be necessary to make progress. Given this necessity, one type of assumption prevalent throughout causal inference literature is the availability of a causal diagram (Pearl 1995), a graphical structure that qualitatively describes the functional relationships between variables. This assumption is a weaker requirement than assuming the availability of the entire SCM, since it does not require full detail of the generating mechanisms and exogenous distributions. Still, it has been shown that having the causal diagram allows certain inferences across layers, determined through the causal identification problem (Pearl 2000; Bareinboim and Pearl 2016). In the context of abstractions however, specifying the causal diagram for the true model ML requires describing the relationships between every low-level variable in VL. This is still unrealistic in many practical settings since there are typically too many low-level variables (e.g. 128 128 pixels in an image) to expect a description of the relationship between every pair, and many of these relationships may not even be well-defined in a causal manner. Instead, it may be more reasonable to specify a causal diagram over VH (or intervariable clusters C). When |VH| |VL|, the amount of information required is reduced, and the causal relationships between variables may be more clear given that the higher-level variables tend to be more explainable. The causal diagram over VH can be viewed as a graphical abstraction of the causal diagram over VL. The relationship can be formalized through the concept of cluster causal diagrams The Thirty-Eighth AAAI Conference on Artificial Intelligence (AAAI-24) (C-DAGs), introduced in Anand et al. (2023). Definition 7 (Cluster Causal Diagram (C-DAG) (Anand et al. 2023, Def. 1)). Given a causal diagram G = V, E and an admissible clustering C = {C1, . . . , Ck} of V, construct a graph GC = C, EC over C with a set of edges EC defined as follows: 1. A directed edge Ci Cj is in EC if there exists some Vi Ci and Vj Cj such that Vi Vj is an edge in E. 2. A dashed bidirected edge Ci Cj is in EC if there exists some Vi Ci and Vj Cj such that Vi Vj is an edge in E. In words, the nodes of the C-DAG GC simply correspond to the clusters of C, and edges connect clusters Ci and Cj if they connect some Vi Ci and Vj Cj in the original causal diagram G. Interestingly, the C-DAG definition aligns with the concept of intervariable clusters, providing a way for encoding constraints in the smaller space of VH. Following the nutrition study in Ex. 1, Fig. 3 shows the corresponding causal diagram G (left) and the simpler C-DAG GC (right). With the constraints of GC, we now introduce a notion of identification across abstractions to determine precisely which queries can be inferred. Definition 8 (Abstract Identification). Let τ : DVH DVL be a constructive abstraction function. Consider C-DAG GC, and let Z = {P(VL[zk])}ℓ k=1 be a collection of available interventional (or observational if Zk = ) distributions over VL. Let ΩL and ΩH be the space of SCMs defined over VL and VH, respectively, and let ΩL(GC) and ΩH(GC) be their corresponding subsets that induce C-DAG GC. We say that query Q is τ-ID from GC and Z iff for every ML ΩL(GC), MH ΩH(GC) such that MH is Z-τ consistent with ML, MH is also Q-τ consistent with ML. This definition establishes a notion of identification between two different spaces of SCMs, ΩL and ΩH, that are connected through τ. In words, τ-identifiability implies that in every pair of SCMs ML over VL and MH over VH, matching in graph GC and data Z implies a match in query Q. Since ML and MH are defined over different spaces of variables, the term match has some nuance. Specifically, matching in GC implies that GC is a C-DAG for ML and is a causal diagram for MH. Matching in Z (resp. Q) implies that MH is Z-τ consistent (resp. Q-τ consistent) with ML. On the other hand, τ-nonidentifiability implies that there exist a pair of models ML over VL and MH over VH such that ML and MH match in both GC and Z yet still do not match in Q. This means that despite the constraints added through the C-DAG GC, there are still queries that cannot be inferred across τ due to nonidentifiability. This is more acute when there is a large amount of unobserved confounding. The definition of τ-ID provides rigorous semantics to answer whether a query can be inferred across abstractions. The next step is to establish an approach to determine τ-ID when given the available data and graph. For this purpose, one fundamental result is that the notion of τ-ID is actually equivalent to classical identification in the higher level space. Theorem 1 (Dual Abstract ID). Q is τ-ID from GC and Z if and only if τ(Q) is ID from GC and τ(Z). Figure 3: The causal diagram G over variables VL for the nutrition study in Ex. 1 is on the left. Clusters C = {DH = {D}, Z = {C, F, P}, BH = {B}} are outlined in blue. The corresponding C-DAG GC is on the right. Algorithm 2: Neural Abstract ID Identifying and estimating queries across abstractions using NCMs. Input : query Q, L2 datasets Z(ML), C-DAG GC, and admissible inter/intravariable clusters C and D satisfying AIC Output : Q(ML) if identifiable, FAIL otherwise. 1 VH C,DVH D 2 τ Abs Func(C,D) // from Def. 4 3 c M NCM(VH, GC) // from Def. 2 4 θ min arg minθ τ(Q)(c M(θ)) s.t. τ(Z)(c M(θ))=τ(Z(ML)) 5 θ max arg maxθ τ(Q)(c M(θ)) s.t. τ(Z)(c M(θ))=τ(Z(ML)) 6 if τ(Q)(c M(θ min)) = τ(Q)(c M(θ max)) then 7 return FAIL 9 return τ(Q)(c M(θ min)) // choose min or max arbitrarily This result is powerful since it implies that inferences can be made about the low level space by using existing results in the high level space. Notably, since our goal is to learn a higher level SCM MH to make inferences about ML, we can build on the machinery of Neural Causal Models (NCMs) (Xia et al. 2021). NCMs allow one to take the graph GC as an inductive bias (a GC-NCM as described in Def. 2), and they can leverage gradient methods to fit any SCM within the constrained space. Indeed, identification in NCMs can be shown to be equivalent to classical identification when considering models of the same granularity (Xia, Pan, and Bareinboim 2023, Thm. 3). When combined with Thm. 1, this implies the following result. Corollary 1 (Abstract ID with NCMs). Q is τ-ID from GC and Z iff τ(Q) is Neural-ID from bΩ(GC) and τ(Z). Moreover, if it is ID, then Q can be computed by computing τ(Q) by definition from any GC-NCM c M that is τ(Z)-consistent. In words, determining τ-ID is equivalent to determining neural identification (identification in the space of NCMs) on the space of VH. Further, to compute Q in the identifiable case, τ(Q) can be queried from any GC-NCM c M that is τ(Z)-consistent. Corol. 1 implies that we can perform causal The Thirty-Eighth AAAI Conference on Artificial Intelligence (AAAI-24) identification and estimation across abstractions using the Neural ID algorithm (Xia, Pan, and Bareinboim 2023, Alg. 1) on the high level space. This procedure is shown in Alg. 2. First, τ is constructed as described in Def. 4 given the clusters. Then, a GC-NCM is constructed over high-level variables VH. Two parameterizations of the NCM are created. Both are optimized to fit the transformed data τ(Z), but one is optimized to maximize the transformed query τ(Q) while the other is optimized to minimize it. If both parameterizations return the same result, then it must be the true value of the query; otherwise, the query is not identifiable. To implement this algorithm in practice, we leverage the GAN-NCM introduced in Xia, Pan, and Bareinboim (2023); see details in App. C. Alg. 2 is sound and complete for solving the abstract identification problem, as shown below. Corollary 2 (Soundness and Completeness). Let ML be the low-level SCM, C and D be inter/intravariable clusters of VL, GC be a C-DAG, Q be a query, and b Q be the result from running Alg. 2 with inputs Z(ML) > 0, C, D, GC, and Q. Then, Q is τ-ID from GC and Z if and only if b Q is not FAIL. Moreover, if b Q is not FAIL, then b Q = Q(ML). While Alg. 2 solves the abstract ID problem, the consequences of the results in this section are more general. Notably, if Q is indeed τ-ID (which can be verified through Alg. 2), the algorithm produces a neural model c M that serves as a proxy SCM that is Q-τ consistent with the true model ML. Such a model could serve as a generative model of the distribution Q, which has many uses. The samples generated from such a model could be used to estimate the query, or, in more complex settings such as with image data, it may be desirable to simply have novel generated samples consistent with the causal invariances embedded in the system. 4 Representations in Learning Abstractions In many applications, the choice of intervariable clusters C is natural and can be made in tandem when deciding the assumptions of the C-DAG GC6. However, fully specifying the intravariable clusters D is quite challenging when working with high-dimensional data like image data. Doing so would require an enumeration of every possible image along with some label designating each one to a cluster. In this section, we investigate the problem of learning abstractions when the intravariable clusters D are left unspecified. While coarser clusters tend to be better in practice due to the dimensionality reduction, the theory in this paper can be applied for any choice of D so long as the AIC (Def. 6) holds. Hence, a possible constraint when learning D is to find a set of clusters such that the AIC is not violated. To this effect, the following result can be leveraged. Proposition 5. ML is guaranteed to satisfy the AIC w.r.t. τ iff DCi = DCi for all Ci C. In other words, this means that Alg. 2 can be applied in any case where τCi is a bijective mapping between DCi and DVH,i. Also implied by this result is that, without additional 6Still, see App. D.1 for best practices on how to choose or learn intervariable clusters. Figure 4: Example comparison between (b) the GC-NCM and (c) GC-RNCM, with GC shown in (a). Functions of the NCM directly output values of the lower level variables (grouped by clusters in C), while functions of the RNCM output values of their higher level counterparts, mapped by bτ. information, one cannot choose any coarser clustering without potentially violating the AIC7. While this choice of D does not reduce the size of the abstracted space, this means that we are not restricted to the original space of VL and can choose any VH with the same cardinality. In practice, this means that we can choose the option for VH that is the most beneficial for our task. Leveraging this insight, we introduce the representational NCM. Definition 9 (Representational NCM (RNCM)). A representational NCM (RNCM) is a tuple bτ, c M , where bτ(v L; θτ) is a function parameterized by θτ mapping from VL to VH, and c M is an NCM defined over VH. A GC-constrained RNCM (GC-RNCM) is an RNCM bτ, c M such that bτ is composed of subfunctions bτCi for each Ci C (each with its own parameters θτCi ), and c M is a GC-NCM. In an RNCM, the abstraction function bτ is a trainable parameterized function, and the NCM c M is trained over the resulting space mapped by bτ. Fig. 4 shows an example illustrating the difference between the RNCM and a standard NCM. Training can be done in a two step procedure, where first bτ is trained to map to an optimal task-specific space, and then c M can be trained on bτ(VL) (e.g. through Alg. 2). To enforce bijectivity between DCi and DVH,i, as suggested by Prop. 5, one can train bτ in an autoencoder-like setup (Kramer 1991; Kingma and Welling 2014) with a reconstruction loss. bτ can be thought of as a function mapping to a representation space, making this approach amenable to the wide developments of the representation learning literature (Bengio, Courville, and Vincent 2013). We empirically demonstrate this approach below in the experiment of Sec. 5.2. 5 Experiments In this section, we empirically evaluate the effects of utilizing abstractions in causal inference tasks. Details of datagenerating models and architectures can be found in Appendix C. Implementation code is publicly available at https://github.com/Causal AILab/Neural Causal Abstractions. 7In many cases, there may be additional information in the form of invariances (e.g. rotational invariance in image data). In such cases, this information can be leveraged to learn coarser clusters. See Appendix D.3 for more details. The Thirty-Eighth AAAI Conference on Artificial Intelligence (AAAI-24) Figure 5: Colored MNIST results. Samples from various causal queries (top) are collected from competing approaches (left). 5.1 Nutritional Study We perform the toy study on nutrition depicted in Ex. 1. Since a BMI of 25 or over is considered overweight, the goal is to identify and estimate the query Q = P(BD=d 25) (the causal effect of diet on weight) using Alg. 2. R and D are 32dimensional one-hot vectors, and the others are real-valued, so the query may be difficult to answer given such highdimensional variables. Instead, it may be more effective to work in an abstract space with the proposed intervariable clusters C = {DH = {D}, Z = {C, F, P}, BH = {B}}. The original graph G and corresponding C-DAG GC are shown in Fig. 3. We are also given intravariable clusters D such that all values of DH, Z, and BH are clustered into binary categories. Specifically, DH = 1 denotes unhealthy dishes, Z = 1 denotes high calorie count, and BH = 1 denotes an overweight BMI ( 25). We compare the effectiveness identifying and estimating Q with NCMs in both the original setting under VL, and in the abstracted setting of VH computed using the constructive abstraction function τ defined on C and D. The results are shown in Fig. 6. Since Q is identifiable, the gap between the max and min queries computed in Alg. 2 are expected to be as small as possible. As shown in Fig. 6a, the proposed approach converges quickly while others fail to close the gap. Fig. 6b also shows that the proposed approach can estimate Q with significantly lower error. 5.2 Colored MNIST Digits We evaluate the RNCM in a high-dimensional image dataset of colorized MNIST (Deng 2012) digits. Each image (I) has a corresponding digit (D) and color (C) label, and their relationships are shown in the C-DAG GC in Fig. 7a. Color and digit are highly correlated (e.g. 0s are typically red, while 5s are cyan), as shown in Fig. 7b. We evaluate three approaches in the task of sampling images from causal queries. The first approach is a naïve conditional GAN that does not take causality into account. The second is a standard GAN-NCM as described in Xia, Pan, and Bareinboim (2023). The third is our approach described in Sec. 4, a representational NCM also implemented as a GAN, called GAN-RNCM. Samples of the results are shown in Fig. 5. All models are capable of producing digit images, as shown in the first column. The second column illustrates P(I | D = 0), the images conditioned on digit = 0. Many red 0s are expected since most 0s are red in the dataset. The third column illustrates the interventional query P(ID=0), the images with (a) Gaps between max and min query across 1000 training iterations when running Alg. 2. (b) Mean absolute error (MAE) v. dataset size (in log-log scale) for query estimation. Figure 6: Results of the nutrition experiment. Our approach (blue) is compared with a GAN-NCM trained on raw data (red) and one trained on normalized data (yellow). (a) GC for Colored MNIST. (b) Image samples. Digits are highly correlated with the corresponding gradient color. Figure 7: Colored MNIST Experimental Setup digits forced to be 0 through intervention. As interventions ignore spurious correlations, 0s of all colors are expected. Finally, the fourth column illustrates the counterfactual query P(ID=0 | D = 5), indicating what the digits would have looked like had they been 0, given that they were originally 5. Since 5s tend to be cyan, the samples are expected to be 0s that retain the cyan color of the 5s. In all cases, GAN-RNCM produces results close to the expected, while the other approaches have difficulty disentangling color from digit. 6 Conclusions Through the notions of inter/intravariable clusters and Q-τ consistency, we introduced a new family of abstractions allowing analysis on individual PCH distributions. We proved that ID across abstractions is equivalent to classical ID (Thm. 1) and provided a sound and complete algorithm to perform such inferences (Alg. 2). We provided a relaxation of intravariable clusters leveraging representation learning through the RNCM (Def. 9). Finally, we demonstrated empirically that abstractions are vital in high-dimensional settings. The Thirty-Eighth AAAI Conference on Artificial Intelligence (AAAI-24) Acknowledgements This research was supported in part by the NSF, ONR, AFOSR, DARPA, Do E, Amazon, JP Morgan, and The Alfred P. Sloan Foundation. References Anand, T. V.; Ribeiro, A. H.; Tian, J.; and Bareinboim, E. 2023. Causal Effect Identification in Cluster DAGs. In Proceedings of the 37th AAAI Conference on Artificial Intelligence. AAAI Press. Bareinboim, E.; Correa, J. D.; Ibeling, D.; and Icard, T. 2022. On Pearl s Hierarchy and the Foundations of Causal Inference. In Probabilistic and Causal Inference: The Works of Judea Pearl, 507 556. New York, NY, USA: Association for Computing Machinery, 1st edition. Bareinboim, E.; and Pearl, J. 2016. Causal inference and the data-fusion problem. Proceedings of the National Academy of Sciences, 113(27): 7345 7352. Beckers, S.; Eberhardt, F.; and Halpern, J. Y. 2019. Approximate Causal Abstraction. In Proceedings of the 35th Conference on Uncertainty in Artificial Intelligence. Beckers, S.; and Halpern, J. Y. 2019. Abstracting Causal Models. In Proceedings of the Thirty-Third AAAI Conference on Artificial Intelligence and Thirty-First Innovative Applications of Artificial Intelligence Conference and Ninth AAAI Symposium on Educational Advances in Artificial Intelligence, AAAI 19/IAAI 19/EAAI 19. AAAI Press. ISBN 978-1-57735-809-1. Bengio, Y.; Courville, A.; and Vincent, P. 2013. Representation Learning: A Review and New Perspectives. IEEE Trans. Pattern Anal. Mach. Intell., 35(8): 1798 1828. Deng, L. 2012. The mnist database of handwritten digit images for machine learning research. IEEE Signal Processing Magazine, 29(6): 141 142. Du, X.; Sun, L.; Duivesteijn, W.; Nikolaev, A.; and Pechenizkiy, M. 2020. Adversarial Balancing-based Representation Learning for Causal Effect Inference with Observational Data. ar Xiv:1904.13335. Gamba, R.; Schuchter, J.; Rutt, C.; and Seto, E. 2014. Measuring the Food Environment and its Effects on Obesity in the United States: A Systematic Review of Methods and Results. Journal of Community Health, 40(3): 464 475. Goudet, O.; Kalainathan, D.; Caillou, P.; Guyon, I.; Lopez Paz, D.; and Sebag, M. 2018. Learning functional causal models with generative neural networks. In Explainable and interpretable models in computer vision and machine learning, 39 80. Springer. Graves, A.; and Jaitly, N. 2014. Towards End-To-End Speech Recognition with Recurrent Neural Networks. In Xing, E. P.; and Jebara, T., eds., Proceedings of the 31st International Conference on Machine Learning, volume 32 of Proceedings of Machine Learning Research, 1764 1772. Bejing, China: PMLR. Guo, R.; Cheng, L.; Li, J.; Hahn, P. R.; and Liu, H. 2020. A Survey of Learning Causality with Data. ACM Computing Surveys, 53(4): 1 37. Ibeling, D.; and Icard, T. 2020. Probabilistic reasoning across the causal hierarchy. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, 10170 10177. Johansson, F. D.; Shalit, U.; and Sontag, D. 2016. Learning Representations for Counterfactual Inference. In Proceedings of the 33rd International Conference on International Conference on Machine Learning - Volume 48, ICML 16, 3020 3029. JMLR.org. Kallus, N. 2020. Deep Match: Balancing Deep Covariate Representations for Causal Inference Using Adversarial Training. In III, H. D.; and Singh, A., eds., Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, 5067 5077. PMLR. Kingma, D. P.; and Welling, M. 2014. Auto-Encoding Variational Bayes. In Bengio, Y.; and Le Cun, Y., eds., 2nd International Conference on Learning Representations, ICLR 2014, Banff, AB, Canada, April 14-16, 2014, Conference Track Proceedings. Kocaoglu, M.; Snyder, C.; Dimakis, A. G.; and Vishwanath, S. 2018. Causal GAN: Learning Causal Implicit Generative Models with Adversarial Training. In International Conference on Learning Representations. Kramer, M. A. 1991. Nonlinear principal component analysis using autoassociative neural networks. AICh E Journal, 37(2): 233 243. Krizhevsky, A.; Sutskever, I.; and Hinton, G. E. 2012. Image Net Classification with Deep Convolutional Neural Networks. In Pereira, F.; Burges, C. J. C.; Bottou, L.; and Weinberger, K. Q., eds., Advances in Neural Information Processing Systems, volume 25, 1097 1105. Curran Associates, Inc. Li, S.; and Fu, Y. 2017. Matching on Balanced Nonlinear Representations for Treatment Effects Estimation. In Guyon, I.; Luxburg, U. V.; Bengio, S.; Wallach, H.; Fergus, R.; Vishwanathan, S.; and Garnett, R., eds., Advances in Neural Information Processing Systems, volume 30, 929 939. Curran Associates, Inc. Louizos, C.; Shalit, U.; Mooij, J.; Sontag, D.; Zemel, R.; and Welling, M. 2017. Causal Effect Inference with Deep Latent-Variable Models. In Proceedings of the 31st International Conference on Neural Information Processing Systems, NIPS 17, 6449 6459. Red Hook, NY, USA: Curran Associates Inc. ISBN 9781510860964. Mnih, V.; Kavukcuoglu, K.; Silver, D.; Graves, A.; Antonoglou, I.; Wierstra, D.; and Riedmiller, M. 2013. Playing Atari With Deep Reinforcement Learning. In NIPS Deep Learning Workshop. Pearl, J. 1995. Causal diagrams for empirical research. Biometrika, 82(4): 669 688. Pearl, J. 2000. Causality: Models, Reasoning, and Inference. New York, NY, USA: Cambridge University Press, 2nd edition. Pearl, J.; and Mackenzie, D. 2018. The Book of Why. New York: Basic Books. The Thirty-Eighth AAAI Conference on Artificial Intelligence (AAAI-24) Rubenstein, P. K.; Weichwald, S.; Bongers, S.; Mooij, J.; Janzing, D.; Grosse-Wentrup, M.; and Schölkopf, B. 2017. Causal Consistency of Structural Equation Models. In Proceedings of the Thirty-Third Conference on Uncertainty in Artificial Intelligence. Shalit, U.; Johansson, F. D.; and Sontag, D. 2017. Estimating individual treatment effect: generalization bounds and algorithms. In Precup, D.; and Teh, Y. W., eds., Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, 3076 3085. International Convention Centre, Sydney, Australia: PMLR. Shi, C.; Blei, D. M.; and Veitch, V. 2019. Adapting Neural Networks for the Estimation of Treatment Effects. In Wallach, H. M.; Larochelle, H.; Beygelzimer, A.; d Alché-Buc, F.; Fox, E. B.; and Garnett, R., eds., Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, Neur IPS 2019, December 8-14, 2019, Vancouver, BC, Canada, 2503 2513. Xia, K.; and Bareinboim, E. 2023. Neural Causal Abstractions. Technical Report R-101, Columbia University, Department of Computer Science, New York. Xia, K.; Lee, K.-Z.; Bengio, Y.; and Bareinboim, E. 2021. The Causal-Neural Connection: Expressiveness, Learnability, and Inference. In Ranzato, M.; Beygelzimer, A.; Dauphin, Y.; Liang, P.; and Vaughan, J. W., eds., Advances in Neural Information Processing Systems, volume 34, 10823 10836. Curran Associates, Inc. Xia, K.; Pan, Y.; and Bareinboim, E. 2023. Neural Causal Models for Counterfactual Identification and Estimation. In Proceedings of the 11th International Conference on Learning Representations (ICLR-23). Yao, L.; Li, S.; Li, Y.; Huai, M.; Gao, J.; and Zhang, A. 2018. Representation Learning for Treatment Effect Estimation from Observational Data. In Bengio, S.; Wallach, H.; Larochelle, H.; Grauman, K.; Cesa-Bianchi, N.; and Garnett, R., eds., Advances in Neural Information Processing Systems, volume 31, 2633 2643. Curran Associates, Inc. Yoon, J.; Jordon, J.; and van der Schaar, M. 2018. GANITE: Estimation of Individualized Treatment Effects using Generative Adversarial Nets. In International Conference on Learning Representations. The Thirty-Eighth AAAI Conference on Artificial Intelligence (AAAI-24)