# abstracting_causal_models__173efcd4.pdf The Thirty-Third AAAI Conference on Artificial Intelligence (AAAI-19) Abstracting Causal Models Sander Beckers Dept. of Philosophy and Religious Studies Utrecht University Utrecht, Netherlands srekcebrednas@gmail.com Joseph Y. Halpern Dept. of Computer Science Cornell University Ithaca, NY 14853 halpern@cs.cornell.edu We consider a sequence of successively more restrictive definitions of abstraction for causal models, starting with a notion introduced by Rubenstein et al. (2017) called exact transformation that applies to probabilistic causal models, moving to a notion of uniform transformation that applies to deterministic causal models and does not allow differences to be hidden by the right choice of distribution, and then to abstraction, where the interventions of interest are determined by the map from low-level states to high-level states, and strong abstraction, which takes more seriously all potential interventions in a model, not just the allowed interventions. We show that procedures for combining micro-variables into macro-variables are instances of our notion of strong abstraction, as are all the examples considered by Rubenstein et al. 1 Introduction We can and typically do analyze problems at different levels of abstraction. For example, we can try to understand human behavior by thinking at the level of neurons firing in the brain or at the level of beliefs, desires, and intentions. A political scientist might try to understand an election in terms of individual voters or in terms of the behavior of groups such as midwestern blue-collar workers. Since, in these analyses, we are typically interested in the causal connections between variables, it seems reasonable to model the various levels of abstraction using causal models (Halpern 2016; Pearl 2000). The question then arises whether a high-level macro causal model (e.g., one that considers beliefs, desires, and intentions) is a faithful abstraction of a low-level micro model (e.g., one that describes things at the neuronal level). What should this even mean? Perhaps the most common way to approach the question of abstraction is to cluster micro-variables in the low-level model into a single macro-variable in the highlevel model (Chalupka, Eberhardt, and Perona 2015; 2016; Iwasaki and Simon 1994). Of course, one has to be careful to do this in a way that preserves the causal relationships in the low-level model. For example, we do not want to cluster variables X, Y , and Z into a single variable X +Y +Z if different settings (x, y, z) and (x , y , z ) such that x + y + z = x + y + z lead to different outcomes. Copyright c 2019, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. Rubenstein et al. (2017) (RW+ from now on) provided an arguably more general approach to abstraction. They defined a notion of an exact transformation between two causal models. They suggest that if there is an exact transformation τ from causal model M1 to M2, then we should think of M2 as an abstraction of M1, so that M2 is the high-level model and M1 is the low-level model. Abstraction almost by definition involves ignoring inessential differences. So it seems that RW+ would want to claim that if there exists an exact transformation from M1 and M2, then M2 and M1 are the same, except for inessential differences . This leads to the obvious question: what counts as an inessential difference? Of course, this is to some extent in the eye of the beholder, and may well depend on the application. Nevertheless we claim that the notion of inessential difference implicitly encoded in the definition of exact transformation is far too broad. As we show by example, there are models that we would view as significantly different that are related by exact transformations. There are two reasons for this. The first is that, because RW+ consider probabilistic causal models, some differences that are intuitively significant are overlooked by considering just the right distributions. Second, besides a function that maps low-level states to high-level states, RW+ define a separate mapping of interventions that can mask what we view as essential differences between the interventions allowed at the low level and the high level. In this paper, we consider a sequence of successively more restrictive definitions of abstraction, starting with the RW+ notion of exact transformation, moving to a notion of uniform transformation that applies to deterministic causal models and does not allow differences to be hidden by the right choice of distribution, and then to abstraction, where the mapping between the interventions is determined by the mapping from low-level states to high-level states, and strong abstraction, which takes more seriously all potential interventions in a model, not just the allowed interventions. Finally, we define constructive abstraction, which is the special case of strong abstraction where the mapping from lowlevel states to high-level states partitions the low-level variables and maps each cell to a unique high-level variable. As we show, procedures for combining micro-variables into macro-variables are instances of constructive abstraction, as are all the other examples considered by RW+. While we view constructive abstraction as the notion that is likely to be the most useful in practice, as we show by example, the weaker notions of strong abstraction and abstraction are of interest as well. Not surprisingly, the idea of abstracting complicated lowlevel models to simpler high-level models that in some sense act the same way has also been considered in other settings; see, for example, (Binahashemi, de Giacomo, and Lesp erance 2017). While we are trying to capture these intuitions as well, considering a setting that involves causality adds new subtleties. 2 Probabilistic causal models: a review In this section we review the definition of causal models. Much of the discussion is taken from (Halpern 2016). Definition 2.1: A signature S is a tuple (U, V, R), where U is a set of exogenous variables (intuitively, variables that represent factors outside the control of the model), V is a set of endogenous variables (intuitively, variables whose values are ultimately determined by the values of the endogenous variables), and R, a function that associates with every variable Y U V a nonempty set R(Y ) of possible values for Y (i.e., the set of values over which Y ranges). If X = (X1, . . . , Xn), R( X) denotes the crossproduct R(X1) R(Xn). Definition 2.2: A basic causal model M is a pair (S, F), where S is a signature and F defines a function that associates with each endogenous variable X a structural equation FX giving the value of X in terms of the values of other endogenous and exogenous variables (discussed in more detail below). A causal model M is a tuple (S, F, I), where (S, F) is a basic causal model and I is a set of allowed interventions (also discussed in more detail below). Formally, the equation FX maps R(U V {X}) to R(X), so FX determines the value of X, given the values of all the other variables in U V. Note that there are no functions associated with exogenous variables; their values are determined outside the model. We call a setting u of values of exogenous variables a context. The value of X may depend on the values of only a few other variables. Y depends on X in context u if there is some setting of the endogenous variables other than X and Y such that if the exogenous variables have value u, then varying the value of X in that context results in a variation in the value of Y ; that is, there is a setting z of the endogenous variables other than X and Y and values x and x of X such that FY (x, z, u) = FY (x , z, u). In this paper we restrict attention to recursive (or acyclic) models, that is, models where, for each context u, there is a partial order u on variables such that if X u Y , then Y depends on X in context u. In a recursive model, given a context u, the values of all the remaining variables are determined (we can just solve for the value of the variables in the order given by u). A model is strongly recursive if the partial order u is independent of u; that is, there is a partial order such that = u for all contexts u. In a strongly recursive model, we often write the equation for an endogenous variable as X = f( Y ); this denotes that the value of X depends only on the values of the variables in Y , and the connection is given by f. For example, we might have X = Y + U. 1 An intervention has the form X x, where X is a set of endogenous variables. Intuitively, this means that the values of the variables in X are set to x. The structural equations define what happens in the presence of external interventions. Setting the value of some variables X to x in a causal model M = (S, F) results in a new causal model, denoted M X x, which is identical to M, except that F is replaced by F X x: for each variable Y / X, F X x Y = FY (i.e., the equation for Y is unchanged), while for each X in X, the equation FX for is replaced by X = x (where x is the value in x corresponding to x). The set I of interventions can be viewed as the set of interventions that we care about for some reason or other. For example, it might consist of the interventions that involve variables and values that are under our control. In (Halpern and Pearl 2005; Halpern 2016), only basic causal models are considered (and are called causal models). RW+ added the set of allowed interventions to the model. We consider allowed interventions as well, since it seems useful when considering abstractions to describe the set of interventions of interest. We sometimes write a causal model M = (S, F, I) as (M , I), where M is the basic causal model (S, F), if we want to emphasize the role of the set of interventions. Given a signature S = (U, V, R), a primitive event is a formula of the form X = x, for X V and x R(X). A causal formula (over S) is one of the form [Y1 y1, . . . , Yk yk]ϕ, where ϕ is a Boolean combination of primitive events, Y1, . . . , Yk are distinct variables in V, and Such a formula is abbreviated as [ Y y]ϕ. The special case where k = 0 is abbreviated as ϕ. Intuitively, [Y1 y1, . . . , Yk yk]ϕ says that ϕ would hold if Yi were set to yi, for i = 1, . . . , k. A causal formula ψ is true or false in a causal model, given a context. As usual, we write (M, u) |= ψ if the causal formula ψ is true in causal model M given context u. The |= relation is defined inductively. (M, u) |= X = x if the variable X has value x in the unique (since we are dealing with recursive models) solution to the equations in M in context u (i.e., the unique vector of values that simultaneously satisfies all equations in M with the variables in U set to u). The 1RW+ do not restrict to acyclic models. Rather, they make the weaker restriction that, for every setting of the causal variables, with probability 1, there is a unique solution to the equations. In the deterministic setting, the analogous restriction would be to consider causal models where there is a unique solution to all equations. None of our definitions or results changes if we allow this more general class of models. We have restricted to recursive models only to simplify the exposition. truth of conjunctions and negations is defined in the standard way. Finally, (M, u) |= [ Y y]ϕ if (M Y y, u) |= ϕ. To simplify notation, we sometimes write M( u) to denote the unique element of R(V) such that (M, u) |= V = v. Similarly, given an intervention Y y, M( u, Y y) denotes the unique element of R(V) such that (M, u) |= [ Y y](V = v). A probabilistic causal model M = (S, F, I, Pr) is just a causal model together with a probability Pr on contexts. We often abuse notation slightly and denote the probabilistic causal model (S, F, I, Pr) as (M, Pr), where M is the underlying deterministic causal model (S, F, I). RW+ worked with probabilistic causal models, but added one more feature and made a restrictive assumption. They consider models M that place a partial order M on interventions. However, they (and we) consider only what they call the natural partial order, where ( X x) M ( X x ) if X is a subset of X and x is the corresponding subset of x , so we do not explicitly introduce the partial order as a component of the model here. In addition, RW+ assume that for each endogenous variable X, there is a unique exogenous variable UX such that UX is the only exogenous variable on whose value X depends, and UX = UY if X = Y . We say that a causal model has unique exogenous variables (uev) if this is the case. Assuming that a causal model has uev makes sense if we think of UX as the noise variable corresponding to X. However, this assumption is not always appropriate (e.g., if we take the temperature to be exogenous, and temperature can affect a number of endogenous variables). Not surprisingly, in (non-probabilistic) causal models, assuming uev entails a significant loss of generality. In particular, we cannot express the correlation in values between two endogenous variables due to being affected by a common exogenous variable. However, the uev assumption can be made essentially without loss of generality in probabilistic causal models, as the lemma below shows. Definition 2.3 : Two probabilistic causal models M = ((U, V, R), F, I, Pr) and M = ((U , V , R ), F , I , Pr ) are equivalent, written M M , if V = V , R(Y ) = R (Y ) for all Y V, I = I , and all causal formulas have the same probability of being true in both M and M ; that is, for all causal formulas ϕ, we have Pr({ u R(U) : (M, u) |= ϕ}) = Pr ({ u R(U ) : (M , u ) |= ϕ}). Lemma 2.4: Given a probabilistic causal model M, there is a probabilistic causal model M with uev such that M M .2 All the models that we consider in our examples have uev. Whatever problems there are with the RW+ notions, they do not arise from the assumption that models have uev. 3 From exact transformations to abstractions In this section, we review the RW+ definition, point out some problems with it, and then consider a sequence of strengthenings of the definition. 2Proofs can be found in the appendix. 3.1 Exact transformations We need some preliminary definitions. First observe that, given a probabilistic model M = ((U, V, R), F, I, Pr), the probability Pr on R(U) can also be viewed as a probability on R(V) (since each context in R(U) determines a unique setting of the variables in V); more precisely, Pr( v) = Pr({ u : M( u) = v}). In the sequel, we freely view Pr as a distribution on both R(U) and R(V); the context should make clear which we intend. Each intervention X x also induces a probability Pr X x on R(V) in the obvious way: Pr X x( v) = Pr({ u : M( u, X x) = v}). One last piece of notation: We are interested in when a high-level model is an abstraction of a low-level model. In the sequel, we always use ML = ((UL, VL, RL), FL, IL) and MH = ((UH, VH, RH), FH, IH) to denote deterministic causal models (where the L and H stand for low level and high level, respectively). We write (M, Pr) to denote a probabilistic causal model that extends M. With this background, we can give the RW+ definition of exact transformation. Although the definition was given for probabilistic causal models that satisfy uev, it makes sense for arbitrary probabilistic causal models. Definition 3.1: If (ML, Pr L) and (MH, Pr L) are probabilistic causal models, ω : IL IH is an order-preserving, surjective mapping (where ω is order-preserving if, for all interventions in i1, i2 IL such that i1 ML i2 according to the natural order, we have ω(ii) MH ω(i2)), and τ : RL(VL) RH(VH), then (MH, Pr H) is an exact (τω)-transformation of (ML, Pr L) if, for every intervention Y y IL, we have Prω( Y y) H = τ(Pr Y y L ), (1) where τ(Pr L) is the pushforward distribution on R(VH) determined by τ and Pr L: τ(Pr L)( v H) = Pr L({ v L : τ( v L) = v H}). The key point here is the requirement that Prω( Y y) H = τ(Pr Y y L ). Roughly speaking, it says that if you start from the low-level intervention Y y and move up to the highlevel model following two distinct routes, you end up at the same place. The first route goes as follows. The intervention Y y changes the probability distribution on low-level outcomes, giving rise to Pr Y y L (where an outcome is a setting of the endogenous variables). This distribution can be moved up to the high level by applying τ, giving τ(Pr Y y L ), which is a distribution on high-level outcomes. The second route goes as follows. From the low-level intervention Y y we move up to a high-level intervention by applying ω, giving ω( Y y). This intervention changes the probability distribution on high-level outcomes, giving rise to Prω( Y y) H . To be an exact transformation means that this distribution and the previous one are identical, for all interventions Y y. Despite all the notation, we hope that the intuition is clear: the intervention Y y acts the same way in the low-level model as the intervention ω( Y y) does in the high levelmodel. (See RW+ for more discussion and intuition.) The following example illustrates Definition 3.1. Example 3.2: Consider a simple voting scenario where we have 99 voters who can either vote for or against a proposition. The campaign for the proposition can air some subset of two advertisements to try to influence how the voters vote. The low-level model is characterized by endogenous variables Xi, i = 1, . . . , 99, A1, A2, and T, and exogenous variables Ui, i = 1, . . . , 101. Xi denotes voter i s vote, so Xi = 1 if voter i votes for the proposition, and Xi = 0 if voter i votes against. Ai denotes whether add i is run, and T denotes the total number of votes for the proposition. Ui determines how voter i votes as a function of which ads are run for i = 1, . . . , 99, while U100 and U101 determine A1 and A2, respectively. We can cluster the voters into three groups: X1 X33, X34 X66, X67 X99. For example, the first group might represent older, wealthy voters; the second group might represent soccer moms; and the third group might represent young singles. Members of the same group are affected by the ads in the same way, meaning that Pr(Xi = 1|A1 = a1 A2 = a2) = Pr(Xj = 1|A1 = a1 A2 = a2) for all a1, a2 and all i, j that belong to the same group. The highlevel model replaces the variables X1, . . . , X99 by variables G1, G2, and G3, representing the sum of the votes of each group, it replaces U1, . . . , U99 by U 1, U 2, U 3, and replaces T by a binary variable T that just indicates who won. The only interventions allowed in the low-level model are interventions to the variables A1 and A2. We now have an obvious map τ from VL to VH that maps a low-level state to a high-level state by taking G1, G2, and G3 to be the total vote of the corresponding groups; the map ω is just the identity. Given a probability Pr L on UL, there is an obvious probability Pr H on UH such that (MH, Pr H) is an exact transformation of (ML, Pr L). Note that it is critical here that we don t allow interventions on the individual variables Xi at the low level. For example, it is not clear to what high-level intervention ω should map the low-level intervention X3 1. RW+ discuss three applications of exact transformations: a model from which some variables are marginalized; moving from the micro-level to the macro-level by aggregating groups of variables; and moving from a time-evolving dynamical process to a stationary equilibrium state. We review the details of their second application here, just to show how it plays out in our framework. Example 3.3: Let ML be a causal model with endogenous variables X = {Xi : 1 i n} and Y = {Yi : 1 i m}, and equations Xi = Ui for 1 i n and Yi = Pj=1 n Aij Xj + Vi for 1 i m, where A is an m n matrix and there exists an a R such that each column of the matrix A sums to a. Finally, the intervention set is IL = { , X x, Y y, ( X, Y ) ( x, y) : x Rn, y Rm}. Let MH be a model with endogenous variables X and Y , equations X = U and Y = a m X + V , and intervention set IH = { , X x, Y y, ( X, Y ) ( x, y) : x, y R}. Consider the following transformation that averages the X and Y variables: τ : R(VL) R(VH) = R2 ( X, Y ) ( 1 n Pi=1 n Xi, 1 m Pi=1 m Yi). Given a probability Pr UL on Ul (the contexts in the lowlevel model ML), if we take Pr U = Pr UL( 1 n Pi=1 n Ui) and Pr V = Pr UL( 1 m Pi=1 m Vi), then MH is an exact (τ-ω)- transformation of ML for the obvious choice of ω. 3.2 Uniform transformations As the following example shows, much of the work to ensure that a transformation is an exact transformation can be done by choosing appropriate distributions Pr L and Pr H. This leads to cases where (MH, Pr H) is an exact transformation of (ML, Pr L) although it is hard to think of MH as a high-level abstraction of ML. Example 3.4: For i = {1, 2}, let Mi be a deterministic causal model with signature (Ui, Vi, Ri); let ui be a fixed context in Mi; let vi Ri(Vi) be such that (Mi, ui) |= Vi = vi; let Ii consist only of the empty intervention; let Pri put probability 1 on ui; let τi map all elements of R(Vi) to v3 i; and let ωi be the identity map from Ii to I3 i. Clearly (Mi, Pri) is an exact (τi-ωi)-transformation of (M3 i, Pr3 i). The fact that each of M1 and M2 is an exact transformation of the other, despite the fact that the models are completely unrelated, suggests to us that exact transformations are not capturing the essence of abstraction. Roughly speaking, what is happening here is that a high-level model MH can be arbitrary in contexts that do not lead to settings v H that have positive probability for some allowed low-level intervention. This means that if there are few allowed lowlevel interventions or few contexts with positive probability, then there are very few constraints on MH. We end up with high-level models MH that should not (in our view) count as abstractions of ML. We can address this concern by strengthening the notion of exact transformation to require it to hold for all distributions Pr L. Definition 3.5: If ML and MH are deterministic causal models, ω is an order-preserving, surjective mapping ω : IL IH, and τ : RL(VL) RH(VH), then MH is a uniform (τω)-transformation of ML if, for all Pr L, there exists Pr H such that (MH, Pr H) is an exact (τ-ω)-transformation of (ML, Pr L). As we pointed out earlier, since RW+ assume uev, the probability distribution in general might do a lot of work to capture correlations between values of endogenous variables. It makes sense to consider arbitrary distributions if we drop the uev assumption (as in fact we do). In Example 3.4 it is easy to see that neither M1 nor M2 is a uniform transformation of the other. On the other hand, in Example 3.2, we do have a uniform transformation. Considering uniform transformations has other nice features. For one thing, it allows us to derive from τ a mapping τU from R(UL) to R(UH) that explains how Pr L and Pr H are related. More precisely, not only do we know that, for the appropriate ω, for all distributions Pr L there exists Pr H such that (MH, Pr H) is an exact (τ-ω)-transformation of (ML, Pr L), we can take Pr H to be τU(Pr L) (i.e., the pushforward of Pr L under τU). Proposition 3.6: If MH is a uniform (τ-ω)-transformation of ML, then there exists a function τU : R(UL) R(UH) such that, for all distributions Pr L on R(UL), (MH, τU(Pr L)) is an exact (τ-ω)-transformation of (ML, Pr L). The next result provides a characterization of when MH is a uniform (τ-ω)-transformation of ML. Definition 3.7: τ : R(UL) R(UH) is compatible with τ : R(VL) R(VH) if, for all Y y IL and u L R(UL), τ(ML( u L, Y y)) = MH(τ ( u L), ω( Y y)). Theorem 3.8 : Given causal models ML and MH, τ : R(VL) R(VH), and an order-preserving surjective function ω : I(VL) I(VH), the following are equivalent: (a) MH is a uniform (τ-ω)-transformation of ML; (b) there exists a function τU : R(UL) R(UH) compatible with τ. It is easy to check that uniform transformations are closed under composition. Theorem 3.9: If MH is a uniform (τ1-ω1)-transformation of MI and MI is a uniform (τ2-ω2)-transformation of ML, then MH is a uniform ((τ2 τ1)-(ω2 ω1))-transformation of ML. 3.3 Abstraction Although the notion of a uniform transformation deals with some of the problems we see with the RW+ notion of exact transformation, it does not deal with all of them, as the following two examples show. Example 3.10: Let M1 and M2 be deterministic causal models, both with endogenous binary variables X1 and X2 and corresponding binary exogenous variables U1 and U2. 3 In M1, the equations are X1 = U1 and X2 = X1. (U2 plays no role in the equations in M1. We added it just to make M2 a model that has uev and thus show that having uev is not an issue here.) In M2, the equations are X1 = U1 and X2 = U2. The only allowed interventions in M1 are 3A variable is binary if its range is {0, 1}. X1 x1, for x1 {0, 1}; the only allowed interventions in M2 are (X1, X2) (x1, x1), for x1 {0, 1}. It is easy to see that M1 is a uniform transformation of M2 and that M2 is a uniform transformation of M1. If τij and ωij are the maps showing that Mj is a uniform transformation of Mi, then we can take both τ12 and τ21 to be the identity, ω12 maps X1 x1 to (X1, X2) (x1, x1), while ω21 maps (X1, X2) (x1, x1) to X1 x1. But this does not match our intuition that if MH is an abstraction of ML, then MH is a higher-level description of the situation than ML. Whatever higher-level description means, we would expect that if ML and MH are different, then we should not have ML and MH being abstractions of each other. What is the problem here? If we just focus on these sets of allowed interventions, then there is in fact no problem. M1 and M2 do, in a sense, work the same way as far as these allowed interventions go. However, the mappings ω12 and ω21 seem to be in conflict with taking τ12 and τ21 to be the identity. Given that τij is the identity mapping, we would expect ωij to also be the identity mapping. Why should ω12 map X1 0 to something other than X1 0 here? It is easy to see that if we take ω12 to also be the identity mapping then the problem disappears, as we no longer have uniform transformations between these two models. More generally, we define below a natural way in which a mapping τ on states induces a mapping ω on allowed interventions. But even when ω is well-behaved there exist counterintuitive examples of uniform transformations. Example 3.11: Given a model M3, let M4 be a model that is like M3 except that U3 has a new exogenous binary variable U and a new binary endogenous variable X . Modify the equations in M3 so that U is the only parent of X , but X is the parent of every other endogenous variable in M4 (and thus of every endogenous variable in M3). Take I 4 = I3. If X = 1, then all equations in M4 are identical to those in M3. However, if X = 0, then all equations behave in some arbitrary way (the exact way they behave is irrelevant). Define τ : R(V3) R(V4) by taking τ( v L) = ( v L, X = 1). We claim that M4 is a uniform (τ-ω)-transformation of ML, where ω is the identity. Given a distribution Pr3 on R(U3), define Pr4 so that its marginal on the variables in U3 is Pr3 and Pr4(U = 0) = 0. It is easy to see that (M4, Pr4) is an exact (τ-ω)-transformation of (M3, Pr3), regardless of how the equations in M4 are defined if X = 1. What goes wrong in this example is that the high level is more detailed than the low level, contrary to what one expects of an abstraction. Concretely, introducing the extra variable X allows M4 to capture a whole range of possibilities that have no counterpart whatsoever in M3. That doesn t sound right (at least to us). We can fix this by simply demanding that our abstraction function τ be surjective. Combining both observations, we define a natural way in which an abstraction function τ determines which sets of interventions should be allowed at the low level and the high level, and the mapping ωτ between them. Definition 3.12: Given a set V of endogenous variables, X V, and x R( X), let Rst(V, x) = { v R(V) : x is the restriction of v to X}. Given τ : R(VL) R(VH), define ωτ( X x) = Y y if there exists Y VH and y R( Y ) such that τ(Rst(VL, x)) = Rst(VH, y) (as usual, given T R(VL), we define τ(T) = {τ( v L) : v L T}). It is easy to see that, given X and x, there can be at most one such Y and y. If there does not exist such a Y and y, we take ωτ( X x) to be undefined. Let Iτ L be the set of interventions for which ωτ is defined, and let Iτ H = ωτ(Iτ L). It is straightforward to check that in Example 3.2, ωτ is defined on interventions to A1, A2, and on these interventions it is the identity (and thus agrees with ω as defined in that example), but it is also defined on simultaneous interventions on X1 X33, X34 X66, and X67 X99, and on T (as well as combinations of these interventions). In Example 3.3, the interventions on which ωτ is defined are precisely those in the set IL of that example; on these interventions, ωτ = ω. Note that if τ is surjective, then it follows that ωτ( ) = , and for all v L R(VL), ωτ(VL v L) = VH τ( v L). Definition 3.13: (MH, IH) is a τ-abstraction of (ML, IL) if the following conditions hold: τ is surjective; there is a surjective function τU : R(UL) R(UH) compatible with τ; IH = ωτ(IL). As intended, Examples 3.10 and 3.11 are not τabstractions; on the other hand, in Examples 3.2 and 3.3, MH is a τ-abstraction of ML. Unlike exact and uniform transformations, τ-abstraction is a relation between causal models: the mapping ω is determined by τ, and there is no need to specify a probability distribution. We can strengthen the notion of τ-abstraction to define a relation on basic causal models, by considering the largest possible sets of allowed interventions. Definition 3.14: If MH and ML are basic causal models, then MH is a strong τ-abstraction of ML if Iτ H = I H, the set of all high-level interventions, and (MH, Iτ H) is a τ-abstraction of (ML, Iτ L). The notion of strong τ-abstraction provides a clean, powerful relation between basic causal models. However, there are applications where the two additional requirements that make an abstraction strong are too much to ask. In the following example, neither requirement is satisfied. Example 3.15: Consider an object in the earth s gravitational field. On the low level (ML), there are three endogenous variables: V (velocity), H (height), and M (mass), and three corresponding exogenous variables, UV , UH, and UM. The equations in ML are V = UV , H = UH, and M = UM. The high level captures the object s current energy. MH contains endogenous variables K (kinetic energy) and P (potential energy), and two corresponding exogenous variables, UK and UP . The equations in MH are K = UK and P = UP . We define τ : R(VL) R(VH) using the standard equations for kinetic energy and gravitational potential energy, so τ(v, h, m) = ( 1 2mv2, 9.81mh). It is easy to see that τ is a surjection onto R(VH). We claim that MH is not a strong τ-abstraction of ML. To see why, consider interventions of the form M m for m > 0. Applying Definition 3.12, we get that ωτ(M m) = , since τ(Rst(VL, m)) = VH; by choosing v and h appropriately, we can still get all values in VH, as long as m > 0. We also clearly have that ωτ maps the empty intervention in ML to the empty intervention in MH. With this, we can already show that (MH, Iτ H) is not a uniform (τ, ωτ)- transformation of (ML, Iτ L). Suppose that Pr L is a probability on UL that puts probability 1 on (1, 1, 1). For condition (1) in Definition 3.1 to hold for the intervention M m, the probability Pr H on UH must put probability 1 on (.5m, 9.81m). But (1) must hold for all choices of m. This is clearly impossible. Although MH is not a strong τ-abstraction of ML, we can easily construct a sensible and useful τ-abstraction between these models by simply not allowing interventions of the form M m in the low-level model. Concretely, if we define IL as containing the empty intervention and all interventions of the form (V, H, M) (v, h, m), then ωτ maps this to the set IH that contains the empty intervention and all interventions of the form (K, P) (k, p). As the following example shows, there also exist interesting cases where only the first requirement of Definition 3.14 is not satisfied. Roughly speaking, this is because some high-level variables are not logically independent, so not all high-level interventions are meaningful. Example 3.16: Suppose that we have a 100 100 grid of pixels, each of which can be black or white. In the lowlevel model, we have 10,000 endogenous variables Xij, for 1 i, j 100, and 10,000 corresponding exogenous variables Uij for 1 i, j 100, with the obvious equations Xij = Uij. We would expect there to be other variables that are affected by the Xijs (e.g., what a viewer perceives), but for ease of exposition, we ignore these other variables in this example and focus only on the Xij variables. Suppose that all we care about is how many of the pixels in the upper half of the grid are black and how many pixels in the left half of the grid are black. Thus, in the high-level model, we have variables UH and LH whose range is {0, . . . , 5000}. Because of the dependencies between UH and LH , there is a single exogenous variable that determines their values, which are pairs (m, m ) such that 0 m, m 5, 000 and |m m | 2, 500. Now we have an obvious map τ from low-level states to high-level states. We claim that (MH, Iτ H) is a τ-abstraction of (ML, Iτ L), where Iτ L consists of the empty intervention and interventions that simultaneously set all the variables in the upper half and left half (i.e., all variables Xij with 1 i 50 or 51 j 100) and an arbitrary subset of the variables in the bottom right. Given a nonempty intervention X x of this form, ωτ( X x) = (UH m, LH m ), where m is the number of Xij variables set to 1 with 1 i 50 and m is the number of Xij variables set to 1 with 51 j 100; how the variables in the bottom right are set in X x is irrelevant. Thus, Iτ H consists of interventions of the form (UH m, LH m ), where 1 m, m 5000 and |m m | 2500. It is straightforward to check that there is no low-level intervention X x such that ωτ( X x) = UH m. For suppose that ωτ( X x) = UH m. Then τ(Rst(VL, x)) = {(m, m ) : 1 m 5000}. This means that (m, m ) τ(Rst(VL, x)) for some m such that |m m | > 2500, which is a contradiction. A similar argument shows that no intervention of the form LH m can be in Iτ H. It is straightforward to check that (MH, Iτ H) is a uniform (τ, ωτ)-transformation of (ML, Iτ L), so (MH, Iτ H) is a τ-abstraction of (ML, Iτ L), however it is clearly not a strong τ-abstraction of ML. The problem here is that although MH has variables UH and LH , we can only intervene on them simultaneously. It may make sense to consider such interventions if we want a visual effect that depends on both the number of black pixels in the upper half and the number of black pixels in the left half. But it is worth noting that if we consider a high-level model M H with only a single variable ULH that counts the number of pixels that are black in the upper half and the left half altogether, then M H is a strong τ-abstraction of ML with the obvious map τ. The full paper (posted on arxiv) gives an example where the second requirement of Definition 3.14 is not satisfied. 3.4 From micro-variables to macro-variables Roughly speaking, the intuition for clustering microvariables into macro-variables is that in the high-level model, one variable captures the effect of a number of variables in the low-level model. This makes sense only if the low-level variables that are being clustered together work the same way as far as the allowable interventions go. The following definition makes this precise. Definition 3.17 : MH is a constructive τ-abstraction of ML if MH is a strong τ-abstraction of ML and, if VH = {Y1, . . . , Yn}, then there exists a partition P = { Z1, . . . , Zn+1} of VL, where Z1, . . . , Zn are nonempty, and mappings τi : R( Zi) R(Yi) for i = 1, . . . , n such that τ = (τ1, . . . , τn); that is, τ( v L) = τ1( z1) . . . τn( zn), where zi is the projection of v L onto the variables in Zi, and is the concatenation operator on sequences. MH is a constructive abstraction of ML if it is a constructive τabstraction of ML for some τ. In this definition, we can think of each Zi as describing a set of microvariables that are mapped to a single macrovariable Yi. The variables in Zn+1 (which might be empty) are ones that are marginalized away. By definition, every constructive τ-abstraction is a strong τ-abstraction. We conjecture that a converse to this also holds: that is, if MH is a strong τ-abstraction of ML, that perhaps satisfies a few minor technical conditions, then it will in fact be a constructive τ-abstraction of ML. However, we have not proved this result yet. We suspect that constructive τ-abstractions are the notion of abstraction that will arise most often in practice. All three of the examples discussed by RW+ (one of which is Example 3.3) are constructive abstractions. We can easily extend Example 3.2 by adding low-level and high-level interventions to make it a constructive abstraction as well. 4 Discussion and Conclusions We believe that getting a good notion of abstraction will be critical in allowing modelers to think at a high level while still being faithful to a more detailed model. As the analysis of this paper shows, there are different notions of abstraction, that relate causal models at different levels of detail. For example, τ-abstraction is a relation between basic causal models, while a uniform (τ, ω)- transformation relates causal models, and RW+ s notion of exact transformation relates probabilistic causal models. Although our final notion of constructive abstraction is the cleanest and arguably easiest to use, we believe that there exist applications for which the weaker abstraction relations are more appropriate. More work needs to be done to understand which abstraction relation is most suitable for a given application. We hope that the definitions proposed here will help clarify the relevant issues. They should also shed light on some of the recent discussions of higher-level causation in communities ranging from physics to philosophy (see, e.g., (Fenton Glynn 2017; Hoel, Albantakis, and Tononi 2013)). In fact, we see the current paper as laying the formal groundwork for several interesting topics that we intend to explore in future work. First, we hope to generalize the abstraction relation to a notion of approximate abstraction, given that in most real-life settings the mappings between different levels are only approximately correct. Second, our framework makes it possible to explore whether the notion of actual causation could be applied across causal models, rather than merely within a single causal model. For example, it seems to be useful to think of an event in a lowlevel model as causing an event in a high-level model. Third, abstracting causal models of large complexity into simpler causal models with only a few variables is of direct relevance to the increasing demand for explainable AI, for in many situations the problem lies not in the fact that no causal model is available, but in the fact that the only available model is too complicated for humans to understand. Acknowledgments: Halpern was supported in part by NSF grants IIS-1703846 and IIS-1718108, ARO grant W911NF-17-1-0592, and a grant from the Open Philanthropy project. Beckers was supported by the grant ERC2013Co G project REINS, nr. 616512. Some of this work was done while Beckers was a postdoc at Cornell University, supported by the Belgian American Educational Foundation. We thank Frederick Eberhardt and the reviewers of the paper for many useful comments. A Appendix: Proofs Proof of Lemma 2.4: Let M = ((U, V, R), F, I, Pr) and define M = ((U , V, R ), F , I, Pr ) as follows. U has one exogenous variable for each endogenous variable in V. Taking V = {Y1, . . . , Yn}, we take Ui to be the exogenous variable corresponding to Yi. Let U = {U 1, . . . , U n}. We take R(U i) = R(U) for i = 1, . . . , n (so the set of possible values for each variable U i is the set of all contexts in M). If z R(V {Yi}), we define F Yi( z, u1, . . . , un) = FYi( z, ui). (Note that here ui R(U) = R(U i).) Thus, it is clear that the only exogenous variable that the value of Yi in M depends on is U i, so M has uev, as desired. Pr places probability 0 on a context ( u1, . . . , un) unless u1 = . . . = un, and Pr ( u, . . . , u) = Pr( u). It is almost immediate that, with these choices, M M . Proof of Proposition 3.6: Suppose that MH is a uniform (τ-ω)-transformation of ML. Say that u L and u H correspond if τ(ML( u L, Y y)) = MH( u H, ω( Y y)) for all interventions Y y IL. We claim that for all u L R(UL), there exists at least one u H R(UH) that corresponds to u L. To see this, fix u L R(UL). Let Pr L give u L probability 1. Then for each intervention Y y IL, the distribution Pr Y y L gives probability 1 to ML( u L, Y y). Let Pr H be a probability distribution such that (MH, Pr H) is an exact (τ-ω)-transformation of (ML, Pr L). Since τ(Pr Y y L ) = Prω( Y y) H , it follows that Prω( Y y) H gives probability 1 to τ(ML( u L, Y y)), and hence also to the set U u L, Y y H = { u H : MH( u H, ω( Y y)) = τ(ML( u L, Y y))}. Since there are only finitely many interventions in IL, Y y ILU u L, Y y H also has probability 1, and thus must be nonempty. Choose u H Y y ILU Y y H . By construction, u H corresponds to u L. Define τU by taking τU( u L) = u H, where u H corresponds to u L. (If more than one tuple u H corresponds to u L, then one is chosen arbitrarily.) It is now straightforward to check that (MH, τU(Pr L)) is an exact (τ-ω)-transformation of (ML, Pr L). We leave details to the reader. Proof of Theorem 3.8: To show that (a) implies (b), suppose that MH is a uniform (τ-ω)-transformation of ML. Let τU : R(UL) R(UH) be the function guaranteed to exist by Proposition 3.6. We must show that for all Y y IL, u L R(UL), τ(ML( u L, Y y)) = MH(τU( u L), ω( Y y)). Fix u L R(UL). From the construction of τU in the proof of Proposition 3.6, it follows that u L and τU( u L) correspond, which, by definition, means that τ(ML( u L, Y y)) = MH(τU( u L), ω( Y y)) for all interventions Y y IL. To show that (b) implies (a), suppose that (b) holds. Given a distribution Pr L on R(UL), let Pr H = τU(Pr L). It suffices to show that (MH, Pr H) is an exact (τ-ω)-transformation of (ML, Pr L). Thus, we must show that for every intervention Y y IL, we have Prω( Y y) H = τ(Pr Y y L ). Straight- forward computations now show that Prω( Y y) H ( v H) = Pr H({ u H : MH( u H, ω( Y y)) = v H}) = Pr L({ u L : MH(τU( u L), ω( Y y)) = v H)}) = Pr L({ u L : τ(ML( u L, Y y)) = v H}) = Pr L({ v L : τ( v L) = v H and u L(ML( u L, Y y) = v L)}) = Pr Y y L ({ v L : τ( v L) = v H}) = τ(Pr Y y L )( v H), as desired. References Binahashemi, B.; de Giacomo, G.; and Lesp erance, Y. 2017. Abstraction in situation calculus action theories. In Proc. Thirty-First National Conference on Artificial Intelligence (AAAI 17), 1048 1055. Chalupka, K.; Eberhardt, F.; and Perona, P. 2015. Visual causal feature learning. In Proc. 32nd Conference on Uncertainty in Artificial Intelligence (UAI 2015), 181 190. Chalupka, K.; Eberhardt, F.; and Perona, P. 2016. Multilevel cause-effect systems. In Proc. 19th Int. Conf. on Artificial Intelligence and Statistics (AISTATS 2016), 361 369. Fenton-Glynn, L. 2017. Is there high-level causation? Ergo 4(30):845 898. Halpern, J. Y., and Pearl, J. 2005. Causes and explanations: a structural-model approach. Part I: Causes. British Journal for Philosophy of Science 56(4):843 887. Halpern, J. Y. 2016. Actual Causality. Cambridge, MA: MIT Press. Hoel, E. P.; Albantakis, L.; and Tononi, G. 2013. Quantifying causal emergence shows that macro can beat micro. Proc. National Academy of Science 110(49):19790 19795. Iwasaki, Y., and Simon, H. A. 1994. Causality and model abstraction. Artificial Intelligence 67(1):143 194. Pearl, J. 2000. Causality: Models, Reasoning, and Inference. New York: Cambridge University Press. Rubenstein, P. K.; Weichwald, S.; Bongers, S.; Mooij, J. M.; Janzing, D.; Grosse-Wentrup, M.; and Sch olkopf, B. 2017. Causal consistency of structural equation models. In Proc. 33rd Conference on Uncertainty in Artificial Intelligence (UAI 2017).