# identifying_causal_effects_under_functional_dependencies__a06767dd.pdf

Identifying Causal Effects Under Functional Dependencies

Yizuo Chen Department of Computer Science University of California, Los Angeles Los Angeles, CA 90024 yizuo.chen@ucla.edu

Adnan Darwiche Department of Computer Science University of California, Los Angeles Los Angeles, CA 90024 darwiche@cs.ucla.edu

We study the identification of causal effects, motivated by two improvements to identifiability which can be attained if one knows that some variables in a causal graph are functionally determined by their parents (without needing to know the specific functions). First, an unidentifiable causal effect may become identifiable when certain variables are functional. Second, certain functional variables can be excluded from being observed without affecting the identifiability of a causal effect, which may significantly reduce the number of needed variables in observational data. Our results are largely based on an elimination procedure which removes functional variables from a causal graph while preserving key properties in the resulting causal graph, including the identifiability of causal effects.

1 Introduction

A causal effect measures the impact of an intervention on some events of interest, and is exemplified by the question what is the probability that a patient would recover had they taken a drug? This type of question, also known as an interventional query, belongs to the second rung of Pearl s causal hierarchy [1] so it ultimately requires experimental studies if it is to be estimated from data. However, it is well known that such interventional queries can sometimes be answered based on observational queries (first rung of the causal hierarchy) which can be estimated from observational data. This becomes very significant when experimental studies are either not available, expensive to conduct, or would entail ethical concerns. Hence, a key question in causal inference asks when and how a causal effect can be estimated from available observational data assuming a causal graph is provided [2].

More precisely, given a set of treatment variables X and a set of outcome variables Y, the causal effect of x on Y, denoted Pr(Y|do(x)) or Prx(Y), is the marginal probability on Y when an intervention sets the states of variables X to x. The problem of identifying a causal effect studies whether Prx(Y) can be uniquely determined from a causal graph and a distribution Pr(V) over some variables V in the causal graph [2], where Pr(V) is typically estimated from observational data. The casual effect is guaranteed to be identifiable if V correspond to all variables in the causal graph (with some positivity assumptions); that is, if all such variables are observed. When some variables are hidden (unobserved), it is possible that different parameterizations of the causal graph will induce the same distribution Pr(V) but different values for the causal effect Prx(Y) which leads to unidentifiablility. In the past few decades, a significant amount of effort has been devoted to studying the identifiability of causal effects; see, e.g., [3, 2, 4 7]. Some early works include the back-door criterion [8, 2] and the front-door criterion [3, 2]. These criteria are sound but incomplete as they may fail to identify certain causal effects that are indeed identifiable. Complete identification methods include the do-calculus [2], the identification algorithm in [9], and the ID algorithm proposed in [10]. These methods require some positivity assumptions (constraints) on the observational distribution Pr(V) and can derive

38th Conference on Neural Information Processing Systems (Neur IPS 2024).

an identifying formula that computes the causal effect based on Pr(V) when the causal effect is identifiable. Some recent works take a different approach by first estimating the parameters of a causal graph to obtain a fully-specified causal model which is then used to estimate causal effects through inference [11 14]. Further works focus on the efficiency of estimating causal effects from finite data [15 18], the general identifiability of causal effects from both observational and experimental data [19], and the causal effect identification with data collected from sub-populations [20].

A recent line of work studies the impact of additional information on identifiability, beyond causal graphs and observational data. For example, [21] showed that certain unidentifiable causal effects can become identifiable given information about context-specific independence. Our work in this paper follows the same direction as we consider the problem of causal effect identification in the presence of a particular type of qualitative knowledge called functional dependencies [22]. We say there is a functional dependency between a variable X and its parents P in the causal graph if the distribution Pr(X|P) is deterministic but we do not know the distribution itself (i.e., the specific values of Pr(x|p)). We also say in this case that variable X is functional. Previous works have shown that functional dependencies can be exploited to improve the efficiency of Bayesian network inference [23, 13, 24 26]. We complement these works by showing that functional dependencies can also be exploited for identifiability. In particular, we show that some unidentifiable causal effects may become identifiable given such dependencies, propose techniques for testing identifiability in this context, and highlight other implications of such dependencies on the practice of identifiability.

Consider the following motivational example where we are interested in how the enforcement of speed limits may affect car accidents. The Driving Age (A) is functionally determined by Country (C);

Driving Age and Country are causes of Speed (X); and Speed and Driving Age are causes of Accidents (Y ). The DAG on the right captures the causal relations among these variables, where variable A is circled to indicate it is functional. Suppose further that variables C, X, Y are observed. According to classical causal-effect identification methods (e.g., do-calculus, ID algorithm), the causal effect of X on Y is unidentifiable in this case. However, if we take into account that variable A is a function of C, which restricts the class of distributions under consideration, then the causal effect of X on Y becomes identifiable. This exemplifies the improvements to identifiability pursued in this paper.

Consider a causal graph G and a distribution Pr(V) over the observed variables V in G. To check the identifiability of a causal effect, it is standard to first apply the projection operation in [27, 28] which constructs another causal graph G with V as its non-root variables, and follow by applying an identification algorithm to G like the ID algorithm [10]. This two-stage procedure, which we will call project-ID, is applicable only under some positivity constraints (assumptions) which preclude some events from having a zero probability. Since such positivity constraints may contradict functional dependencies, we formulate the notion of constrained identifiability which takes positivity constraints as an input (in addition to the causal graph G and distribution Pr(V)). We also formulate the notion of functional identifiability which further takes functional dependencies as an input. This allows us to explicitly treat the interactions between positivity constraints and functional dependencies, which is needed for combining classical methods like project-ID with the results we present in this paper.

We start with some technical preliminaries in Section 2. We formally define positivity constraints and functional dependencies in Section 3 where we also introduce the problems of constrained and functional identifiability. Section 4 introduces two primitive operations, functional elimination and functional projection, which are needed for later treatments. Sections 5 presents our core results on functional identifiability and how they can be combined with existing identifiability algorithms. We finally close with concluding remarks in Section 6. Proofs of all results are included in the Appendix.

2 Technical Preliminaries

We consider discrete variables in this work. Single variables are denoted by uppercase letters (e.g., X) and their states are denoted by lowercase letters (e.g., x). Sets of variables are denoted by bold uppercase letters (e.g., X) and their instantiations are denoted by bold lowercase letters (e.g., x).

2.1 Causal Bayesian Networks and Interventions

A Causal Bayesian network (CBN) is a pair G, F where G is a causal graph in the form of a directed acyclic graph (DAG), and F is a set of conditional probability tables (CPTs). We have one CPT for each variable X with parents P in G, which specifies the conditional probability distributions Pr(X|P). This CPT will often be denoted by f X(X, P) so f X(x, p) [0, 1] for all instantiations x, p and P

x f X(x, p) = 1 for every instantiation p.

A CBN induces a joint distribution over its variables V which is exactly the product of its CPTs, i.e., Pr(V) = Q

V V f V . Applying a treatment do(x) to the joint distribution yields a new distribution called the interventional distribution, denoted Prx(V). One way to compute the interventional distribution is to consider the mutilated CBN G , F that is constructed from the original CBN G, F as follows: remove from G all edges that point to variables in X, then replace the CPT in F for each X X with a CPT f X(X) where f X(x) = 1 if x is consistent with x and f X(x) = 0 otherwise. Figure 1a depicts a causal graph G and Figure 1b depicts the mutilated causal graph G under a treatment do(x1, x2). The interventional distribution Prx is the distribution induced by the mutilated CBN G , F , where Prx(Y) corresponds to the causal effect Pr(Y|do(x)) also notated by Pr(Yx).

2.2 Identifying Causal Effects

A key question in causal inference is to check whether a causal effect can be (uniquely) computed given the causal graph G and a distribution Pr(V) over a subset V of its variables. If the answer is yes, we say that the causal effect is identifiable given G and Pr(V). Otherwise, the causal effect is unidentifiable. Variables V are said to be observed and the remaining variables are said to be hidden, where Pr(V) is usually estimated from observational data. We start with the general definition of identifiability (not necessarily for causal effects) from [2, Ch. 3.2.4] with a slight rephrasing. Definition 1 (Identifiability [2]). Let Q(M) be any computable quantity of a model M. We say that Q is identifiable in a class of models if, for any pairs of models M1 and M2 from this class, Q(M1) = Q(M2) whenever Pr M1(V) = Pr M2(V) where V are the observed variables.

In the context of causal effects, the problem of identifiability is to check whether every pair of fully-specified CBNs (M1 and M2 in Definition 1) that induce the same distribution Pr(V) will also produce the same value for the causal effect. Note that Definition 1 does not restrict the considered models M1 and M2 based on properties of the distributions Pr M1(V) and Pr M2(V). However, in the literature on identifying causal effects, it is quite common to only consider CBNs (models) that induce distributions which satisfy some positivity constraints, such as Pr(V) > 0. We will examine such constraints more carefully in Section 3 as they may contradict functional dependencies.

It is well known that under some positivity constraints (e.g., Pr(V) > 0), the identifiability of causal effects can be efficiently tested using what we shall call the project-ID algorithm. Given a causal graph G, project-ID first applies the projection operation in [27 29] to yield a new causal graph G whose hidden variables are all roots and each has exactly two children. These properties are needed by the ID algorithm [10], which is then applied to G to yield an identifying formula if the causal effect is identifiable or FAIL otherwise. Consider the causal effect Prx1x2(y) in Figure 1a where the only hidden variable is the non-root variable B. We first project the causal graph G in Figure 1a onto its observed variables to yield the causal graph G in Figure 1c (all hidden variables in G are auxiliary and roots). We then run the ID algorithm on G which returns the following (simplified) identifying formula: Prx1x2(y) = P

acd Pr(c|x1) Pr(d|x1, x2) P

x 1x 2 Pr(y|x 1, x 2, a, c, d) Pr(x 2|x 1, a, c) Pr(x 1, a). Hence, the causal effect Prx1x2(y) is identifiable and can be computed using the above formula. Moreover, all quantities in the formula can be obtained from the distribution Pr(A, C, D, X1, X2, Y ) over observed variables, which can be estimated from observational data. More details on the projection operation and the ID algorithm can be found in Appendix A.

3 Constrained and Functional Identifiability

As mentioned earlier, Definition 1 of identifiability [2, Ch. 3.2.4] does not restrict the pair of considered models M1 and M2. However, it is common in the literature on causal-effect identifiability to only consider CBNs with distributions Pr(V) that satisfy some positivity constraints. Strict positivity,

(a) causal graph

(b) mutilated

(c) projected

Figure 1: Example adapted from [30]. Hidden variables are circled. A bidirected edge X L9999K Y is compact notation for X H Y where H is an auxiliary hidden variable.

Pr(V) > 0, is perhaps the mostly widely used constraint [9, 29, 2]. That is, in Definition 1, we only consider CBNs M1 and M2 which induce distributions Pr M1 and Pr M2 that satisfy Pr M1(V) > 0 and Pr M2(V) > 0. Weaker, and somewhat intricate, positivity constraints were employed by the ID algorithm in [10] as discussed in Appendix A, but we will apply this algorithm only under strict positivity to keep things simple. See also [31] for a recent discussion of positivity constraints.

Positivity constraints are motivated by two considerations: technical convenience, and the fact that most causal effects would be unidentifiable without some positivity constraints (more on this later). Given the multiplicity of positivity constraints considered in the literature, and given the subtle interaction between positivity constraints and functional dependencies (which are the main focus of this work), we next provide a systematic treatment of identifiability under positivity constraints.

3.1 Positivity Constraints

We will first formalize the notion of a positivity constraint and then define the notion of constrained identifiability which takes a set of positivity constraints as input (in addition to the causal graph G and distribution Pr(V)).1

Definition 2. A positivity constraint on Pr(V) is an inequality of the form Pr(S|Z) > 0 where S V, Z V and S Z = . That is, for all instantiations s, z, if Pr(z) > 0 then Pr(s, z) > 0.

When Z = , the positivity constraint is defined on a marginal distribution, Pr(S) > 0. We may impose multiple positivity constraints on a set of variables V. We will use CV to denote the set of positivity constraints imposed on Pr(V) and use vars(CV) to denote all the variables mentioned by CV. The weakest set of positivity constraints is CV = {} (no positivity constraints as in Definition 1), and the strongest positivity constraint is CV = {Pr(V) > 0}.

We next provide a definition of identifiability for the causal effect of treatments X on outcomes Y in which positivity constraints are an input to the identifiability problem. We call it constrained identifiability in contrast to the (unconstrained) identifiability of Definition 1.

Definition 3. We call G, V, CV an identifiability tuple where G is a causal graph (DAG), V is its set of observed variables, and CV is a set of positivity constraints.

Definition 4 (Constrained Identifiability). Let G, V, CV be an identifiability tuple. The causal effect of X on Y is said to be identifiable with respect to G, V, CV if Pr1 x(y) = Pr2 x(y) for any pair of distributions Pr1 and Pr2 which are induced by G and that satisfy Pr1(V) = Pr2(V) and that also satisfy the positivity constraints CV.

For simplicity, we say identifiability to mean constrained identifiability in the rest of paper.

We next show that without some positivity constraints, most causal effects would not be identifiable. We say that a treatment X X is a first ancestor of some outcome Y Y if X is an ancestor of Y in causal graph G, and there exists a directed path from X to Y that is not intercepted by X \ {X}. A first ancestor must exist if some treatment variable is an ancestor of some outcome variable.

Proposition 5. The casual effect of X on Y is not identifiable wrt an identifiability tuple G, V, CV if some X X is a first ancestor of some Y Y, and CV does not imply Pr(X) > 0.

1We are incorporating positivity constraints directly into the definition of identifiability. This is different from the analysis in [32] which derives positivity constraints from a particular run of the identification algorithm.

Hence, identifiability is not possible without some positivity constraints if at least one treatment variable is an ancestor of some outcome variable (which is common). Consider the causal graph on the right where U is the only hidden variable. By Proposition 5, the causal effect of {X1, X2} on {Y1, Y2} is not identifiable if the considered distributions do not satisfy Pr(X2) > 0 as X2 is a first ancestor of Y2.

As positivity constraints become stronger, more causal effects become identifiable since the set of

considered models becomes smaller. Consider the causal graph on the right in which all variables are observed, V = {X, Y, Z}. Without positivity constraints, CV = , the causal effect of X on Y is not identifiable. However, it becomes identifiable given strict positivity, CV = {Pr(X, Y, Z) > 0}, leading to the identifying formula Prx(y) = P

z Pr(y|x, z) Pr(z). This causal effect is also identifiable under the weaker positivity constraint CV = {Pr(X|Z) > 0},2 which implies Pr(X) > 0, so strict positivity is not necessary for identifiability even though it is typically assumed for this folklore result. This is an example where strict positivity may be assumed for technical convenience only as it may facilitate the application of some identifiability techniques like the do-calculus [2].

3.2 Functional Dependencies

A variable X in a causal graph is said to functionally depend on its parents P if its distribution is deterministic: Pr(x|p) {0, 1} for every instantiation x, p. Variable X is also said to be functional in this case. In this work, we assume qualitative functional dependencies: we do not know the distribution Pr(X|P), we only know that it is deterministic.3

A B C Pr(B|A) Pr(C|A) 0 0 0 0.2 0 0 1 1 0.8 1 1 0 0 0.6 1 1 1 1 0.4 0

The table on the right shows two variables B and C that both have A as their parent. Variable C is functional but variable B is not. The CPT for variable C will be called a functional CPT in this case. Functional CPTs are also known as (causal) mechanisms and are expressed using structural equations in structural causal models (SCMs) [33 35]. By definition, in an SCM, every non-root variable is assumed to be functional (when noise variables are represented explicitly in the causal graph).

Qualitative functional dependencies are a longstanding concept. For example, they are common in relational databases, see, e.g., [36, 37], and their relevance to probabilistic reasoning had been brought up early in [22, Ch. 3]. One example of a (qualitative) functional dependency is that different countries have different driving ages, so we know that Driving Age functionally depends on Country even though we may not know the specific driving age for each country. Another example is that a letter grade for a class is functionally dependent on the student s weighted average even though we may not know the scheme for converting a weighted average to a letter grade.

In this work, we assume that we are given a causal graph G in which some variables W have been designated as functional. The presence of functional variables further restricts the set of distributions Pr that we consider when checking identifiability. This leads to a more refined problem that we call functional identifiability (F-identifiability), which depends on four elements.

Definition 6. We call G, V, CV, W an F-identifiability tuple when G is a DAG, V is its set of observed variables, CV is a set of positivity constraints, and W is a set of functional variables in G.

Definition 7 (F-Identifiability). Let G, V, CV, W be an F-identifiability tuple. The causal effect of X on Y is F-identifiable wrt G, V, CV, W if Pr1 x(y) = Pr2 x(y) for any pair of distributions Pr1

and Pr2 which are induced by G, and that satisfy Pr1(V) = Pr2(V) and the positivity constraints CV, and in which variables W functionally depend on their parents.

Both CV and W represent constraints on the models (CBNs) we consider when checking identifiability, and these two types of constraints may contradict each other. We next define two notions that characterize some important interactions between positivity constraints and functional variables.

2This weaker positivity constraint is sufficient to make the identifying formula well-defined since Pr(y|x, z) Pr(z) in the formula is equal to zero when Pr(z) = 0, and is computable when Pr(z) > 0; that is, the conditional probability Pr(y|x, z) is well-defined if Pr(x|z) > 0. 3We assume that root variables cannot be functional as such variables can be removed from the causal graph.

Definition 8. Let G, V, CV, W be an F-identifiability tuple. Then CV and W are consistent if there exists a parameterization for G which induces a distribution satisfying CV and in which W functionally depend on their parents. Moreover, CV and W are separable if W vars(CV) = .

If CV is inconsistent with W then the set of distributions Pr considered in Definition 7 is empty and, hence, the causal effect is not well defined (and trivially identifiable according to Definition 7). As such, one would usually want to ensure such consistency. Here are some examples of positivity constraints that are always consistent with a set of functional variables W: positivity on each treatment variable, i.e., {Pr(X) > 0, X X}, positivity on the set of non-functional treatments, i.e., {Pr(X \ W) > 0}, positivity on all non-functional variables, i.e., {Pr(V \ W) > 0}. All these examples are special cases of the following condition. For a functional variable W W, let HW be variables that intercept all directed paths from non-functional variables to W (such HW may not be unique). If none of the positivity constraints in CV mentions both W and HW , then CV and W are guaranteed to be consistent (see Proposition 25 in Appendix C).

Separability is a stronger condition and it intuitively implies that the positivity constraints will not rule out any possible functions for the variables in W. We need such a condition for one of the results we present later. Some examples of positivity constraints that are separable from W are {Pr(X \ W) > 0} and {Pr(V \ W) > 0}. Studying the interactions between positivity constraints and functional variables, as we did in this section, will prove helpful later when utilizing existing identifiability algorithms (which require positivity constraints) for testing functional identifiability.

4 Functional Elimination and Projection

Our approach for testing identifiability under functional dependencies will be based on eliminating functional variables from the causal graph, which may be followed by invoking the project-ID algorithm on the resulting graph. This can be subtle though since the described process will not work for every functional variable as we discuss in the next section. Moreover, one needs to handle the interaction between positivity constraints and functional variables carefully. The first step though is to formalize the process of eliminating a functional variable and to study the associated guarantees.

Eliminating variables from a probabilistic model is a well studied operation, also known as marginalization; see, e.g., [38 40]. When eliminating variable X from a model that represents distribution Pr(Z), the goal is to obtain a model that represents the marginal distribution Pr(Y) = P

x Pr(x, Y) where Y = Z \ {X}. Elimination can also be applied to a DAG G that represents conditional independencies I, leading to a new DAG G that represents independencies I that are implied by I. In fact, the projection operation of [27, 28] we discussed earlier can be understood in these terms. We next propose an operation that eliminates functional variables from a DAG and that comes with stronger guarantees compared to earlier elimination operations as far as preserving independencies.

Definition 9. The functional elimination of a variable X from a DAG G yields a new DAG attained by adding an edge from each parent of X to each child of X and then removing X from G.4

For convenience, we sometimes say elimination to mean functional elimination when the context is clear. From the viewpoint of independence relations, functional elimination is not sound if the eliminated variable is not functional. In particular, the DAG G that results from this elimination process may satisfy independencies (identified by d-separation) that do not hold in the original DAG G. As we show later, however, every independence implied by G must be implied by G if the eliminated variable is functional. In the context of SCMs, functional elimination may be interpreted as replacing the eliminated variable X by its function in all structural equations that contain X. Functional elimination applies in broader contexts than SCMs though. Eliminating multiple functional variables in any order yields the same DAG (see Proposition 22 in Appendix B). For example, eliminating variables {C, D} from the DAG in Figure 2a yields the DAG in Figure 2c whether we use the order π1 = C, D or the order π2 = D, C.

Functional elimination preserves independencies that hold in the original DAG and which are not preserved by other elimination methods including projection as defined in [27, 28]. These independencies are captured using the notion of D-separation [41, 42] which is more refined than the classical notion of d-separation [43, 44] (uppercase Dversus lowercase d-). The original definition

4Appendix B extends this definition to Causal Bayesian networks (i.e., updating both CPTs and causal graph).

(b) proj. (a) on A, B, G, H, I

(c) eliminate C, D from (a)

(d) proj. (c) on A, B, G, H, I

Figure 2: Contrasting projection with functional projection. C, D are functional. Hidden variables are circled.

of D-separation can be found in [42]. We provide a simpler definition next, stated as Proposition 10, as the equivalence between the two definitions is not immediate. Proposition 10. Let X, Y, Z be disjoint variable sets and W be a set of functional variables in DAG G. Then X and Y are D-separated by Z in G, W iff X and Y are d-separated by Z in G where Z is obtained as follows. Initially, Z = Z. Repeat the next step until Z stops changing: add to Z every variable in W whose parents are in Z .

To illustrate the difference between d-separation and D-separation, consider again the DAG in Figure 2a and assume that variables C, D are functional. Variable G and I are not d-separated by A but they are D-separated by A. That is, there are distributions which are induced by the DAG in Figure 2a and in which G and I are not independent given A. However, G and I are independent given A in every induced distribution in which the variables C, D are functionally determined by their parents. Functional elimination preserves D-separation in the following sense. Theorem 11. Consider a DAG G with functional variables W. Let G be the result of functionally eliminating variables W W from G. For any disjoint sets X, Y, Z in G , X and Y are D-separated by Z in G, W iff X and Y are D-separated by Z in G , W \ W .

We now define the operation of functional projection which augments the original projection operation in [27, 28] in the presence of functional dependencies. Definition 12. Let G be a DAG, V be its observed variables, and W be its hidden functional variables (W V = ). The functional projection of G on V is a DAG obtained by functionally eliminating variables W from G then projecting the resulting DAG on variables V.

We will now contrast functional projection and classical projection using the causal graph in Figure 2a, assuming that the observed variables are V = {A, B, G, H, I} and the functional variables are W = {C, D}. Applying classical projection to this causal graph yields the causal graph in Figure 2b. To apply functional projection, we first functionally eliminate C, D from Figure 2a, which yields Figure 2c, then project Figure 2c on variables V which yields the causal graph in Figure 2d. So we now need to contrast Figure 2b (classical projection) with Figure 2d (functional projection). The latter is a strict subset of the former as it is missing two bidrected edges. One implication of this is that variables G and I are not d-separated by A in Figure 2b because they are not d-separated in Figure 2a. However, they are D-separated in Figure 2a and hence they are d-separated in Figure 2d. So functional projection yielded a DAG that exhibits more independencies. Again, this is because G and I are D-separated by A in the original DAG, a fact that is not visible to projection but is visible to (and exploitable by) functional projection.

An important corollary of functional projection is the following. Suppose all functional variables are hidden, then two observed variables are D-separated in the causal graph G iff they are d-separated in the functional-projected graph G . This shows that such D-separations in G appear as classical d-separations in G which allows us to feed G into existing identifiability algorithms as we show later. This is a key enabler of some results we shall present next on testing functional identifiability.

5 Causal Identification with Functional Dependencies A

(b) projection

Figure 3: B is functional.

Consider the causal graph G in Figure 3a and let V = {A, X, Y } be its observed variables. According to Definition 4 of identifiability, the causal effect of X on Y is not identifiable with respect to G, V, CV where CV = {Pr(A, X, Y ) > 0}. We can show this by projecting the

(a) causal graph

(b) proj. of (a)

(c) F-proj. of (a)

(d) F-elim. F

(e) F-elim. B

Figure 4: Variables A, B, C, F, X, Y are observed. Variables D, E are functional (and hidden).

causal graph G on the observed variables V, which yields the causal graph G in Figure 3b, then applying the ID algorithm to G which returns FAIL. Suppose now that the hidden variable B is known to be functional. According to Definition 7 of F-identifiability, this additional knowledge reduces the number of considered models so it actually renders the causal effect identifiable the identifying formula is Prx(y) = P

a Pr(a) Pr(y|a, x) as we show later. Hence, an unidentifiable causal effect became identifiable in light of knowledge that some variable is functional even without knowing the structural equations for this variable.

The question now is: How do we algorithmically test F-identifiablity? We will propose two techniques for this purpose, the first of which is geared towards exploiting existing algorithms for classical identifiability. This technique is based on eliminating functional variables from the causal graph while preserving F-identifiability, with the goal of getting to a point where F-identifiability becomes equivalent to classical identifiability. If we reach this point, we can use existing algorithms for classical identifiability, like the ID algorithm, to test F-identifiability. This can be subtle though since hidden functional variables behave differently from observed ones. We start with the following result.

Theorem 13. Let G, V, CV, W be an F-identifiability tuple. If G is the result of functionally eliminating the hidden functional variables (W \ V) from G, then the causal effect of X on Y is F-identifiable wrt G, V, CV, W iff it is F-identifiable wrt G , V, CV, V W .

An immediate corollary of this theorem is that if all functional variables are hidden, then we can reduce the question of F-identifiability to a question of identifiability since V W = so F-identifiability wrt G , V, CV, V W = collapses into identifiability wrt G , V, CV .

Corollary 14. Let G, V, CV, W be an F-identifiability tuple where CV = {Pr(V) > 0} and W are all hidden. If G is the result of functionally projecting G on variables V, then the causal effect of X on Y is F-identifiable wrt G, V, CV, W iff it is identifiable wrt G, V, CV .5

This corollary suggests a method for using the ID algorithm, which is popular for testing identifiability, to establish F-identifiability by coupling ID with functional projection instead of classical projection. Consider the causal graph G in Figure 4a with observed variables V = {A, B, C, F, X, Y }. The causal effect of X on Y is not identifiable under Pr(V) > 0: projecting G on observed variables V yields the causal graph G in Figure 4b and the ID algorithm produces FAIL on G . Suppose now that the hidden variables {D, E} are functional. To test whether the causal effect is F-identifiable using Corollary 14, we functionally project G on the observed variables V which yields the causal graph G in Figure 4c. Applying the ID algorithm to G produces the following identifying formula: Prx(y) = P

bf Pr(f|b, x) P

acx Pr(y|a, b, c, f, x ) Pr(a, b, c, x ) so Prx(y) is F-identifiable.

We stress again that Corollary 14 and the corresponding F-identifiability algorithm apply only when all functional variables are hidden. We now treat the case when some of the functional variables are observed. The subtlety here is that, unlike hidden functional variables, eliminating an observed functional variable does not always preserve F-identifiability. However, the following result identifies conditions that guarantees the preservation of F-identifiability in this case. If all observed functional variables satisfy these conditions, then we can again reduce F-identifiability into identifiability so we can exploit existing methods for identifiability like the ID algorithm and do-calculus.

Theorem 15. Let G, V, CV, W be an F-identifiability tuple. Let Z be a set of observed functional variables that are neither treatments nor outcomes, are separable from CV, and that have observed

5We are requiring the positivity constraint Pr(V) > 0 as the projection operation in [28] requires it. If the projection operation only requires a weaker positivity constraint C V, we can replace CV by C V in Corollary 14.

parents. If G is the result of functionally eliminating variables Z from G, then the causal effect of X on Y is F-identifiable wrt G, V, CV, W iff it is F-identifiable wrt G , V \ Z, CV, W \ Z .

We now have the following important corollary of Theorems 13 & 15 which subsumes Corollary 14. Corollary 16. Let G, V, CV, W be an F-identifiability tuple where CV = {Pr(V \ W) > 0} and every variable in W V satisfies the conditions of Theorem 15. If G is the result of functionally projecting G on V \ W, then the causal effect of X on Y is F-identifiable wrt G, V, CV, W iff it is identifiable wrt G , V \ W, CV .

Consider again the causal effect of X on Y in graph G of Figure 4a with observed variables V = {A, B, C, F, X, Y }. Suppose now that the observed variable F is also functional (in addition to the hidden functional variables D, E) and assume Pr(A, B, C, X, Y ) > 0. Using Corollary 16, we can functionally project G on A, B, C, X, Y to yield the causal graph G in Figure 4d, which reduces F-identifiability on G to classical identifiability on G . Since strict positivity holds in G , we can apply any existing identifiability algorithm and conclude that the causal effect is not identifiable. For another scenario, suppose that the observed variable B (instead of F) is functional and we have Pr(A, C, F, X, Y ) > 0. Again, using Corollary 16, we functionally project G onto A, C, F, X, Y to yield the causal graph G in Figure 4e, which reduces F-identifiability on G to classical identifiability on G . If we apply the ID algorithm to G we get the identifying formula (which we denote as Eq. 1): Prx(y) = P af Pr(f|a, x) P cx Pr(y|a, c, f, x ) Pr(a, c, x ). In both scenarios above, we were able to test F-identifiability using an existing algorithm for identifiability.

Corollary 16 (and Theorem 15) has yet another key application: it can help us pinpoint observations that are not essential for identifiability. To illustrate, consider the second scenario above where the observed variable B is functional in the causal graph G of Figure 4a. The fact that Corollary 16 allowed us to eliminate variable B from G implies that observing this variable is not needed for rendering the causal effect F-identifiable and, hence, is not needed for computing the causal effect. This can be seen by examining the identifying formula (Eq. 1) which does not contain variable B. This application of Corollary 16 can be quite significant in practice, especially when some variables are expensive to measure (observe), or when they may raise privacy concerns; see, e.g., [45, 46].

Theorems 13 & 15 are more far-reaching than what the above discussion may suggest. In particular, even if we cannot eliminate every (observed) functional variable using these theorems, we may still be able to reduce F-identifiability to identifiability due to the following result. Theorem 17. Let G, V, CV, W be an F-identifiability tuple. If every functional variable has at least one hidden parent, then a causal effect of X on Y is F-identifiable wrt G, V, CV, W iff it is identifiable wrt G, V, CV .

That is, if we still have functional variables in the causal graph after applying Theorems 13 & 15, and if each such variable has at least one hidden parent, then F-identifiability is equivalent to identifiability.

The method we presented thus far for testing F-identifiability is based on eliminating functional variables from the causal graph, followed by applying existing tools for causal effect identification such as the project-ID algorithm and the do-calculus. This F-identifiability method is complete if every observed functional variable either satisfies the conditions of Theorem 15 or has at least one hidden parent that is not functional. This elimination-based method not only tests identifiability but also provides an identifying formula if the causal effect turns out to be identifiable.

We next present another technique for reducing F-identifiability to identifiability. This method is more general and much more direct than the previous one, but it does not allow us to fully exploit some existing tools like the ID algorithm due to the positivity assumptions they make. The new method is based on pretending that some of the hidden functional variables are actually observed and is inspired by Proposition 10 which reduces D-separation to d-separation using a similar technique. Theorem 18. Let G, V, CV, W be an F-identifiability tuple where CV = {Pr(X) > 0, X X}. A causal effect of X on Y is F-identifiable wrt G, V, CV, W iff it is identifiable wrt G, CV, V where V is obtained as follows. Initially, V = V. Repeat until V stops changing: add to V a functional variable from W if its parents are in V .

Consider the causal effect of X on Y in graph G of Figure 4a and suppose the observed variables are V = {A, B, C, X, Y }, the functional variables are {D, E, F} and we have Pr(X) > 0. By Theorem 18, the causal effect of X on Y is F-identifiable iff it is identifiable in G while pretending

that variables V = {A, B, C, D, E, F, X, Y } are all observed. In this case, the casual effect is not identifiable but we cannot obtain this answer by applying an identifiability algorithm that requires positivity constraints which are stronger than Pr(X) > 0. If we have stronger positivity constraints that imply Pr(X) > 0, X X, then only the if part of Theorem 18 will hold, assuming CV and W are consistent. That is, confirming identifiability wrt G, CV, V will confirm F-identifiability wrt G, V, CV, W but if identifiability is not confirmed then F-identifiability may still hold. This suggests that, to fully exploit the power of Theorem 18, one would need a new class of identifiability algorithms that can operate under the weakest possible positivity constraints.

6 Conclusion

We studied the identification of causal effects in the presence of a particular type of knowledge called functional dependencies. This augments earlier works that considered other types of knowledge such as context-specific independence. Our contributions include formalizing the notion of functional identifiability; the introduction of an operation for eliminating functional variables from a causal graph that comes with stronger guarantees compared to earlier elimination methods; and the employment (under some conditions) of existing algorithms, such as the ID algorithm, for testing functional identifiability and for obtaining identifying formulas. We also provided a complete reduction of functional identifiability to classical identifiability under very weak positivity constraints, and showed how our results can be used to reduce the number of variables needed in observational data.

Acknowledgements

We wish to thank Scott Mueller, Jin Tian, and anonymous reviewers for providing valuable feedback on earlier versions of this paper. This work has been partially supported by ONR grant N000142212501.

[1] Judea Pearl and Dana Mackenzie. The Book of Why: The New Science of Cause and Effect. Basic Books, 2018.

[2] Judea Pearl. Causality: Models, Reasoning, and Inference. Cambridge University Press, second edition, 2009.

[3] Judea Pearl. Causal diagrams for empirical research. Biometrika, 82(4):669 688, 1995. ISSN 00063444.

[4] Peter Spirtes, Clark Glymour, and Richard Scheines. Causation, Prediction, and Search, Second Edition. Adaptive computation and machine learning. MIT Press, 2000. ISBN 978-0-26219440-2.

[5] Guido W. Imbens and Donald B. Rubin. Causal Inference for Statistics, Social, and Biomedical Sciences: An Introduction. Cambridge University Press, 2015.

[6] Jonas Peters, Dominik Janzing, and Bernhard Schölkopf. Elements of Causal Inference: Foundations and Learning Algorithms. MIT Press, 2017.

[7] Miguel A. Hernán and James M. Robins. Causal Inference: What If. Boca Raton: Chapman & Hall/CRC, 2020.

[8] Judea Pearl. [bayesian analysis in expert systems]: Comment: Graphical models, causality and intervention. Statistical Science, 8(3):266 269, 1993.

[9] Yimin Huang and Marco Valtorta. Identifiability in causal bayesian networks: A sound and complete algorithm. In AAAI, pages 1149 1154. AAAI Press, 2006.

[10] Ilya Shpitser and Judea Pearl. Identification of joint interventional distributions in recursive semi-markovian causal models. In AAAI, pages 1219 1226. AAAI Press, 2006.

[11] Marco Zaffalon, Alessandro Antonucci, and Rafael Cabañas. Causal Expectation Maximisation. In WHY Workshop, Neur IPS, 2021. doi: 10.48550/ARXIV.2011.02912. https://arxiv.org/abs/2011.02912.

[12] Marco Zaffalon, Alessandro Antonucci, Rafael Cabañas, and David Huber. Approximating counterfactual bounds while fusing observational, biased and randomised data sources. International Journal of Approximate Reasoning, 162:109023, 2023.

[13] Adnan Darwiche. Causal inference with tractable circuits. In WHY Workshop, Neur IPS, 2021. https://arxiv.org/abs/2202.02891.

[14] David Huber, Yizuo Chen, Alessandro Antonucci, Adnan Darwiche, and Marco Zaffalon. Tractable bounding of counterfactual queries by knowledge compilation. Sixth Workshop on Tractable Probabilistic Modeling @ UAI 2023, 2023. URL https:// tractable-probabilistic-modeling.github.io/tpm2023/papers.

[15] Yonghan Jung, Jin Tian, and Elias Bareinboim. Estimating causal effects using weighting-based estimators. In AAAI, pages 10186 10193. AAAI Press, 2020.

[16] Yonghan Jung, Jin Tian, and Elias Bareinboim. Learning causal effects via weighted empirical risk minimization. In Neur IPS, 2020.

[17] Yonghan Jung, Jin Tian, and Elias Bareinboim. Estimating identifiable causal effects through double machine learning. In AAAI, pages 12113 12122. AAAI Press, 2021.

[18] Yonghan Jung, Jin Tian, and Elias Bareinboim. Estimating identifiable causal effects on markov equivalence class through double machine learning. In ICML, volume 139 of Proceedings of Machine Learning Research, pages 5168 5179. PMLR, 2021.

[19] Sanghack Lee, Juan D. Correa, and Elias Bareinboim. Identifiability from a combination of observations and experiments. In AAAI, pages 13677 13680. AAAI Press, 2020.

[20] Amir Mohammad Abouei, Ehsan Mokhtarian, and Negar Kiyavash. s-id: Causal effect identification in a sub-population. In AAAI, pages 20302 20310. AAAI Press, 2024.

[21] Santtu Tikka, Antti Hyttinen, and Juha Karvanen. Identifying causal effects via context-specific independence relations. In Neur IPS, pages 2800 2810, 2019.

[22] Judea Pearl. Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. Morgan Kaufmann, 1988.

[23] Adnan Darwiche. An advance on variable elimination with applications to tensor-based computation. In ECAI, volume 325 of Frontiers in Artificial Intelligence and Applications, pages 2559 2568. IOS Press, 2020.

[24] Yizuo Chen, Arthur Choi, and Adnan Darwiche. Supervised learning with background knowledge. In 10th International Conference on Probabilistic Graphical Models (PGM), 2020.

[25] Yizuo Chen and Adnan Darwiche. On the definition and computation of causal treewidth. In UAI, Proceedings of the 38th Conference on Uncertainty in Artificial Intelligence, 2022.

[26] Yunqiu Han, Yizuo Chen, and Adnan Darwiche. On the complexity of counterfactual reasoning. In IJCAI, pages 5676 5684. ijcai.org, 2023.

[27] Thomas S. Verma. Graphical aspects of causal models. Technical Report, R-191, 1993.

[28] Jin Tian and Judea Pearl. On the testable implications of causal models with hidden variables. In UAI, pages 519 527. Morgan Kaufmann, 2002.

[29] Jin Tian and Judea Pearl. On the identification of causal effects. Technical Report, R-290-L, 2003.

[30] Manabu Kuroki and Masami Miyakawa. Identifiability criteria for causal effects of joint interventions. Journal of the Japan Statistical Society, 29(2):105 117, 1999.

[31] Yaroslav Kivva, Ehsan Mokhtarian, Jalal Etesami, and Negar Kiyavash. Revisiting the general identifiability problem. In UAI, volume 180 of Proceedings of Machine Learning Research, pages 1022 1030. PMLR, 2022.

[32] Inwoo Hwang, Yesong Choe, Yeahoon Kwon, and Sanghack Lee. On positivity condition for causal inference. In ICML. Open Review.net, 2024.

[33] Alexander Balke and Judea Pearl. Counterfactuals and policy analysis in structural models. In UAI, pages 11 18. Morgan Kaufmann, 1995.

[34] David Galles and Judea Pearl. An axiomatic characterization of causal counterfactuals. Foundations of Science, 3(1):151 182, 1998.

[35] Joseph Y. Halpern. Axiomatizing causal reasoning. Journal of Artificial Intelligence Research, 12:317 337, 2000.

[36] Terry A. Halpin and Tony Morgan. Information modeling and relational databases (2. ed.). Morgan Kaufmann, 2008.

[37] C. J. Date. Database Design and Relational Theory - Normal Forms and All That Jazz. O Reilly, 2012.

[38] Nevin L. Zhang and David L. Poole. Exploiting causal independence in bayesian network inference. J. Artif. Intell. Res., 5:301 328, 1996.

[39] Rina Dechter. Bucket elimination: A unifying framework for probabilistic inference. In Proceedings of the Twelfth Annual Conference on Uncertainty in Artificial Intelligence (UAI), pages 211 219, 1996.

[40] Adnan Darwiche. Modeling and Reasoning with Bayesian Networks. Cambridge University Press, 2009.

[41] Dan Geiger and Judea Pearl. On the logic of causal models. In UAI, pages 3 14. North-Holland, 1988.

[42] Dan Geiger, Thomas Verma, and Judea Pearl. Identifying independence in bayesian networks. Networks, 20(5):507 534, 1990.

[43] Judea Pearl. Fusion, propagation, and structuring in belief networks. Artif. Intell., 29(3): 241 288, 1986.

[44] Thomas Verma and Judea Pearl. Causal networks: semantics and expressiveness. In UAI, pages 69 78. North-Holland, 1988.

[45] Santtu Tikka and Juha Karvanen. Enhancing identification of causal effects by pruning. J. Mach. Learn. Res., 18:194:1 194:23, 2017.

[46] Benito van der Zander, Maciej Liskiewicz, and Johannes Textor. Separators and adjustment sets in causal graphs: Complete criteria and an algorithmic framework. Artif. Intell., 270:1 40, 2019.

[47] Ilya Shpitser and Judea Pearl. Complete identification methods for the causal hierarchy. J. Mach. Learn. Res., 9:1941 1979, 2008.

[48] Adnan Darwiche. An advance on variable elimination with applications to tensor-based computation. In Proceedings of the 24th European Conference on Artificial Intelligence (ECAI), 2020.

[49] Adnan Darwiche. A differential approach to inference in Bayesian networks. J. ACM, 50(3): 280 305, 2003.

A More On Projection and ID Algorithm

As mentioned in the main paper, the project-ID algorithm involves two steps: the projection operation and the ID algorithm. We will review more technical details of each step in this section.

A.1 Projection

The projection [27 29] of G onto V constructs a new DAG G over variables V as follows. Initially, DAG G contains variables V but no edges. Then for every pair of variables X, Y V, an edge is added from X to Y to G if X is a parent of Y in G, or if there exists a directed path from X to Y in G such that none of the internal nodes on the path is in V. A bidirected edge X L9999K Y is further added between every pair of variables X and Y in G if there exists a divergent path6 between X and Y in G such that none of the internal nodes on the path is in V. For example, the projection of the DAG in Figure 1a onto A, C, D, X1, X2, Y yields Figure 1c. A bidirected edge X L9999K Y is compact notation for X H Y where H is an auxiliary hidden variable. Hence, the projected DAG in Figure 1c can be interpreted as a classical DAG but with additional, hidden root variables.

The projection operation is guaranteed to produce a DAG G in which hidden variables are all roots and each has exactly two children. Graphs that satisfy this property are called semi-Markovian and can be fed as inputs to the ID algorithm for testing identifiability [10]. Moreover, projection preserves some properties of G, such as d-separation [27] among V, which guarantees that identifiabilty is preserved when working with G instead of G [28].

A.2 ID Algorithm

After obtaining a projected causal graph, we can apply the ID algorithm for identifying causal effects [10, 47]. The algorithm returns either an identifying formula if the causal effect is identifiable or FAIL otherwise. The algorithm is sound since each line of the algorithm can be proved with basic probability rules and do-calculus. The algorithm is also complete since a causal graph must contain a hedge, a graphical structure that induces the unidentifiability, if the algorithm returns FAIL. The algorithm, however, is only sound and complete under certain positivity constraints, which are weaker but more subtle than strict positivity (Pr(V) > 0).

The positivity constraints required by ID can be summarized as follows (X are the treatment variables): (1) Pr(X|P) > 0 where P = Parents(X) \ X, and (2) Pr(Z) > 0 for all quantities Pr(S|Z) considered by the ID algorithm. The second constraint depends on a particular run of the ID algorithm and can be interpreted as follows. First, if the ID algorithm returns FAIL then the causal effect is not identifiable even under the strict positivity constraint Pr(V) > 0. However, if the ID algorithm returns identifiable then the causal effect is identifiable under the above constraints which are now well defined given a particular run of the ID algorithm. We illustrate with an example next.

Consider the causal graph on the right which contains observed variables {A, B, C, X, Y }. Suppose we are interested in the causal effect of X on Y , applying the ID algorithm returns the following identifying formula: Prx(y)= P

abc Pr(c|a, b, x)P

x Pr(y|a, b, c, x )Pr(a, b, x ). The positivity constraint extracted from this run of the algorithm is Pr(a, b, c, x) > 0 for all a, b, c, x. That is, we can only safely declare the causal effect identifiable based on the ID algorithm if this positivity constraint is satisfied.

B Functional Elimination for CBNs

The functional elimination in Definition 9 removes functional variables from a DAG G and yields another DAG G on the remaining variables. We have shown in the main paper that the functional elimination preserves the D-separations. Here we extend the notion of functional elimination to Causal Bayesian networks (CBNs) which contain not only a causal graph (DAG) but also CPTs. We show that the (extended) functional elimination preserves the marginal distribution on the remaining variables. That is, given any CBN with the causal graph G, we can construct another CBN with the causal graph G such that the two CBNs induce a same distribution on the variables in G , where G is the result of eliminating functional variables from G. Moreover, we show that the functional

6A divergent path between X and Y is a path in the form of X U Y .

elimination operation further preserves the causal effects, which makes it applicable to the causal identification. This extended version of functional elimination and the corresponding results will be used for the proofs in Appendix C.

Recall that a CBN G, F contains a causal graph G and a set of CPTs F. We first extend the definition of functional elimination (Definition 9) from DAGs to CBNs. Definition 19. The functional elimination of a functional variable X from a CBN G, F yields another CBN G , F obtained as follows. The DAG G is obtained from G by Definition 9. For each child C of X, its CPT in F is (P X f Xf C) where f X, f C are the corresponding CPTs in F.

We first show that the new CPTs produced by Definition 19 are well-defined. Proposition 20. Let f X and f Y be the CPTs for variables X and Y in a CBN, then (P X f Xf Y ) is a valid CPT for Y .

The next proposition shows that functional elimination preserves the functional dependencies. Proposition 21. Let M be the CBN resulting from functionally eliminating a functional variable from a CBN M. Then each variable (from M ) is functional in M if it is functional in M.

The next theorem shows that the order of functional elimination does not matter. Proposition 22. Let M be a CBN and π1, π2 be two variable orders over a set of functional variables W. Then functionally eliminating W from M according to π1 and π2 yield the same CBN.

The next result shows that eliminating functional variables preserves the marginal distribution. Theorem 23. Consider a CBN M which induces Pr. Let M be the result of functionally eliminating a set of functional variables W from M which induces Pr . Then Pr = P

One key property of functional elimination is that it preserves the interventional distribution over the remaining variables. This property allows us to eliminate functional variables from a causal graph and estimate the causal effects in the resulting graph. Theorem 24. Let M be the CBN over variables V resulting from functionally eliminating a set of functional variables W from a CBN M. Then M and M attain the same Prx(V) for any X V.

The proofs of the results will be ordered slightly different from the order they appear in the main body of the paper.

Proof of Proposition 5

Proof. Our goal is to construct two different parameterizations F1 and F2 that induce the same Pr(V) but different Prx(y). This is done by first creating a parameterization F which contains strictly positive CPTs for all variables, and then constructing F1 and F2 based on F.

Let P be the directed path from X = Xi to Y , denoted X Z Y which does not contain any treatment variables other than X. Let PX be the parents of X in G. For each node M on the path, let PM be the parents of M except for the parent that lies on P. Moreover, for each variable M on P, we will only modify the conditional probability for a single state m of M, where x x is the treated state of X. Let ϵ be an arbitrarily small constant (close to 0), we next show the modifications for the CPTs in F1.

f 1(x|p X) = 0 if x = x

1/(|X| 1), otherwise

f 1(z|x, p Z) =

1 ϵ, if x = x , z = z

ϵ/(|Z| 1), if x = x , z = z

ϵ, if x = x , z = z

(1 ϵ)/(|Z| 1), if x = x , z = z

For every variable T / {X, Z} which has parent Q on the path P, we assign

f 1(t|q, p T ) =

1 ϵ, if q = q , t = t

ϵ/(|T| 1), if q = q , t = t

ϵ, if q = q , t = t

(1 ϵ)/(|T| 1), if q = q , t = t

We assign the same CPTs for X and all variables T / {X, Z} but a different CPT for Z in F2.

f 2(z|x, p Z) =

ϵ, if x = x , z = z

(1 ϵ)/(|Z| 1), if x = x , z = z

ϵ, if x = x , z = z

(1 ϵ)/(|Z| 1), if x = x , z = z

The two parameterizations F1 and F2 induce the same Pr(V) where Pr(v) = 0 if x v and Pr(v) > 0 otherwise. We next show that the parameterization satisfies each positivity constraint Pr(S|Z) as long as it does not imply Pr(X) > 0. We first show that X S implies Pr(X) > 0. This is because Pr(S) = P z Pr(S|z) Pr(z) and there must exist some instantiation z where Pr(z) > 0 and Pr(S|z) > 0 by constraint. This implies Pr(S) > 0 and therefore Pr(X) > 0. Hence, CV does not contain such constraint Pr(S|Z) where X S. Suppose X Z, then Pr(z) > 0 if and only if x / z. Moreover, since Pr(v) > 0 whenever x / v, it is guaranteed that Pr(S, z) > 0 when Pr(z) > 0, which implies Pr(S|Z) > 0. Finally, suppose X / (S Z), then Pr(S, Z) = P

x Pr(S, Z, x) > 0. Hence, the positivity constraint is satisfied by both parameterizations. By construction, Pr1 and Pr2 induce different values for the causal effect Prx(y) since the probability of Y = y under the treatment do(X = x ) will be different for the two parameterizations.

Proposition 25. Let G be a causal graph and V be its observed variables. A set of functional variables W is consistent with positivity constraints CV if no single constraint in CV mentions both W W and a set HW that intercepts all directed paths from non-functional variables to W.

Proof of Proposition 25. We construct a parameterization F and show that the distribution Pr induced by F satisfies the CV, which ensures the consistency. The states of each variable V are represented in the form of (s V , p1, . . . , pm) where s V and pi (i {1, . . . , m}) are all binary indicator (0 or 1). Specifically, each of the pi corresponds to a functional descendant paths of V defined as follows: a functional descendant path of V is a directed path that starts with V and that all variables on the path (excluding V ) are functional. Suppose V does not have any functional descendant paths, then the states of V is simply represented as (s V ).

We next show how to assign CPTs for each variable in the causal graph G based on whether the variable is functional. For each non-functional variable, we assign a uniform distribution. For each functional variable W whose parents are T1, . . . , Tn and whose functional descendant paths are P1, . . . , Pm, we assign the CPT f W as follows:

s W p T1 Ind(T1,W ) p Tn Ind(Tn,W ) p1 p T1 1,1 p Tn n,1

pm p T1 1,m p Tn n,m

where Ind(Ti, W) denotes the index assigned to the path {(Ti, W)} (which contains a single edge) in the state of T i, and p Ti i,j denotes the indicator in the state of Ti for the functional descendant path P that contains the functional descendant path Pj, i.e., P = {(Ti, W)} Pj.

For simplicity, we call the set of variables HW that satisfies the condition in the proposition a functional ancestor set of W. We show that Pr(S, Z) > 0 for each positivity constraint in the form of Pr(S|Z) > 0. Let W S Z be a subset of functional variables. Since S Z does not contain any functional ancestor set of W for each W W, it follows that there exist directed paths from a set of non-functional variables A W to W that are unblocked by M = S Z \ {W} and contain only functional variables (excluding A W ). We can further assume that A W is chosen such that the set

AW = M A W forms a valid functional ancestor set for W. We next show that for any state w of W and instantiation m of M, there exists at least one instantiation a of A W such that Pr(w, m, a) > 0.

Let PW denote the set of all directed paths from AW to W that do not contain AW (except for the first node on the path). Let PW 1 PW be the paths that start with a variable in M, and PW 2 PW be other paths that start with a variable in A W . Moreover, for any path P, let pathval(P) be the binary indicator (e.g., p1) for P in the state of P(0) (first variable in P). Since the value assignments for pathval(P) are independent for different P s, we can always find some instantiation a A W such that the following equality holds given w and m: M

pathval(P2) = s W M

pathval(P1)

We next assign values for other path indicators of a such that the indicators for the functional descendant paths in the state w are set correctly. In particular, for each functional descendant path P of W, let P be the set of functional descendant paths of AW that do not contain AW (except for the first node on the path) and that contain P as a sub-path. Let P1 P be the paths that start with a variable in M, and P2 P be other paths that start with a variable in A W . Again, since all the indicators for paths in P are independent, we can assign the indicators for a A W such that M

P2 P2 pathval(P2) = pathval(P) M

P1 P1 pathval(P1)

Finally, we combine the cases for each individual W W by creating the following set AW = S

W W AW . Since all the functional descendant paths we considered for different W s are disjoint, we can always find an assignment a for AW that is consistent with the functional dependencies (does not produce any zero probabilities). Consequently, there must exist some full instantiation (u, v) compatible with s, z, a such that Pr(u, v) > 0, which implies Pr(s, z) > 0.

Proof of Proposition 20

Proof. Suppose Y is not a child of X in the CBN, then P

X f Xf Y = f Y (P

X f X) which is guaranteed to be a CPT for Y . Suppose Y is a child of X. Let PX denote the parents of X and PY denote the parents of Y excluding X. The new factor g = P

X f Xf Y is defined over PX PY {Y }. Consider each instantiation p X and p Y , then P

y g(p X, p Y , y) = P

x f X(x|px)f Y (y|p Y , x) = P

x f X(x|p X) P

y f Y (y|p Y , x) = 1. Hence, g is a CPT for Y .

Proof of Proposition 21

Proof. Let X be the functional variables that is functionally eliminated. By definition, the elimination only affects the CPTs for the children of X. Hence, any functional variable that is not a child of X remains functional. For each child C of X that is functional, the new CPT (P

X f Xf C) only contains values that are either 0 or 1 since both f X and f C are functional.

Proof of Proposition 22

Proof. First note that π2 can always be obtained from π1 by a sequence of transpositions , where each transposition swaps two adjacent variables in the first sequence. Let π be an elimination order and let π be the elimination order resulted from swapping πi = X and πi+1 = Y from the π, i.e.,

π = (. . . , X, Y, . . . ) π = (. . . , Y, X, . . . )

We show functional elimination according to π and π yield a same CBN, which can be applied inductively to conclude that elimination according to π1 and π2 yield a same CBN. Since π and π agree on a same elimination order up to X, they yield a same CBN before eliminating variables X, Y . It suffices to show the CBNs resulting from eliminating (X, Y ) and eliminating (Y, X) are the same. Let G, Pr be the CBN before eliminating variables X, Y . Suppose X and Y do not belong to a same family (which contains a variable and its parents), the elimination of X and Y are independent

and the order of elimination does not matter. Suppose X and Y belong to a same family, then they are either parent and child or co-parents.7

WLG, suppose X is a parent of Y . Eliminating (X, Y ) and eliminating (Y, X) yield a same causal graph that is defined as follows. Each child C of Y has parents PX PY PC \ {X, Y }, and any other child C of X has parents PX PC \{X}. We next consider the CPTs. For each common child C of X and Y , its CPT resulting from eliminating (X, Y ) is f 1 C = P Y (P X f Cf X)(P X f Y f X), and the CPT resulting from eliminating (Y, X) is f 2 C = P

Y f Cf Y ). Since X is a parent of Y , we have Y / f X and

Y f Xf Cf Y = X

X f Xf Cf Y

X f Xf C)( X

X f Xf Y ) = f 1 C

We next consider the case when C is a child of Y but not a child of X. The CPT for C resulting from eliminating (X, Y ) is f 1 C = P

X f Y f X), and the CPT resulting from eliminating (Y, X) is f 2 C = P

Y f Cf Y ). Again, since X is a parent of Y , we have Y / f X and

Y f Xf Cf Y = X

X f Xf Cf Y

X f Xf Y ) = f 1 C

We finally consider the case when C is a child of X but not a child of Y . Regardless of the order on X and Y , the CPT for C resulting from eliminating X and Y is (P

We next consider the case when X and Y are co-parents. Regardless of the order on X and Y , the causal graph resulting from the elimination satisfies the following properties: (1) for each common child C of X and Y , the parents for C are PX PY PC \ {X, Y }; (2) for each C that is a child of X but not a child of Y , the parents for C are PX PC \ {X}; (3) for each C that is a child of Y but not a child of X, the parents for C are PY PC \{Y }. We next consider the CPTs. For each common child C of X and Y , its CPT resulting from eliminating (X, Y ) is f 1 C = P

X f Xf C), and the CPT resulting from eliminating (Y, X) is f 2 C = P

Y f Y f C). Since X and Y are not parent and child, we have X / f Y , Y / f X and

X f Y f Xf C = X

Y f Y f Xf C = f 2 C

For each C that is a child of X but not a child of Y , regardless of the order on X and Y , the CPT for C resulting from eliminating variables X and Y is (P

X f Xf C). A similar result holds for each C that is a child of Y but not a child of X.

Proof of Theorem 23

Proof. It suffices to show that Pr = P

X Pr when we eliminate a single variable X. Let F denote the set of CPTs for M. Since f X is a functional CPT for X, we can replicate f X in F which yields a new CPT set (replication) F that induce a same distribution as F; see details in [48, Theorem 4]. Specifically, we pair the CPT for each child C of X with an extra copy of f X, denoted (f X, f C), which yields a list of pairs (f X, f C1), . . . , (f X, f Ck) where C1, . . . , Ck are the children of X. Functionally eliminating X from F yields X

X F = H ( X

X f Xf C1) ( X

[48, Corollary 1]

= H f C1 f Ck = Pr

where H are the CPTs in F that do not contain X and each f Ci is the CPT for child Ci in M .

7X and Y are co-parents if they have a same child.

Proof of Theorem 24

Lemma 26. Consider a CBN G, F and its mutilated CBN Gx, Fx under do(x). Let W be a functional variable not in X and let G , F and G x, F x be the results of functionally eliminating W from G, F and Gx, Fx , respectively. Then G x, F x is the mutilated CBN for G , F .

Proof. First observe that the children of W in G and Gx can only differ by the variables in X. Let C1 be the children of W in both G and Gx and let C2 be the children of W in G but not in Gx. By the definition of mutilated CBN, W has the same set of parents and CPT in G, F and Gx, Fx . Similarly, each child C C1 has the same set of parents and CPT in G, F and Gx, Fx . Hence, eliminating W yields the same set of parents and CPT for each C C in G , F and G x, F x . We next consider the set of parents and CPT for each child C C2. Since W is not a parent of C in Gx, Fx , variable C has the same set of parents and CPT in G x, F x , The exactly same set of parents (empty) and CPT will be assigned to C in the mutilated CBN for G , F .

Proof. (Theorem 24) Consider a CBN G, F and its mutilated CBN Gx, Fx . Let Pr and Prx be the distributions induced by F and Fx over variables V, respectively. By Lemma 26, we can eliminate each W W inductively from G, F and Gx, Fx and obtain G , F and its mutilated CBN G x, F x . By Theorem 23, the distribution induced by F x is exactly P

Proof of Proposition 10

Proof. First note that the extended set Z contains Z and all variable that are functionally determined by Z. Consider any path P between some X X and Y Y. We show that P is blocked by Z iff it is blocked by Z according to the definition in [42]. We first show the if-part. Suppose there is a convergent valve8 for variable W that is closed when conditioned on Z, then the valve is still closed when conditioned on Z unless the parents of W are in Z . However, the path P will be blocked in the latter case since the parents of W must have sequential/divergent valves. Suppose there is a sequential/divergent valve that is closed when conditioned on Z according to [42], then W must be in Z since it is functionally determined by Z. Hence, the valve is also closed when conditioned on Z .

We next show the only-if part. Suppose a convergent valve for variable W is closed when conditioned on Z , then none of Z is a descendent of W since Z is a superset of Z. Suppose a sequential/divergent valve for variable W is closed when conditioned on Z , then W is functionally determined by Z by the construction of Z . Thus, the valve is closed in [42].

Proof of Theorem 11

Proof. By induction, it suffices to show that X and Y are D-separated by Z in G, W iff they are D-separated by Z in G , W , where G is the result of functionally eliminating a single variable T W from G and W = W \ {T}. We first show the contrapositive of the if-part. Suppose X and Y are not D-separated by Z in G, W , by the completeness of D-separation, there exists a parameterization F on G such that (X Y|Z)F. If we eliminate T from the CBN G, F , we obtain another CBN G , F where F is the parameterization for G . By Theorem 23, the marginal probabilities are preserved for the variables in G , which include X, Y, Z. Hence, (X Y|Z)F and X and Y are not D-separated by Z in G , W .

Next consider the contrapositive of the only-if part. Suppose X and Y are not D-separated by Z in G , W , then there exists a parameterization F of G such that (X Y|Z)F by the completeness of D-separation. We construct a parameterization F for G such that F is the parameterization of G which results from eliminating T from the CBN G, F . This is sufficient to show that X and Y are not D-separated by Z in G, W since the marginals are preserved by Theorem 23.

Construction Method Let PT and CT denote the parents and children of T in G. Our construction assumes that the cardinality of T is the number of instantiations for its parents PT . That is, there is a one-to-one correspondence between the states of T and the instantiations of PT , and we use α(t) to denote the instantiation p T corresponding to state t. The functional CPT for T is assigned as f T (t|p T ) = 1 if α(t) = p T and f T (t|p T ) = 0 otherwise for each instantiation p T of PT . Now consider each child C CT that has parents PC (excluding T) and T in G. It immediately follows

8See [40, Ch. 4] for more details on convergent, divergent and sequential valves.

from Definition 9 that C has parents PT PC in G . We next construct the CPT f C in F based on its CPT f C in F . Consider each parent instantiation (t, p C) where t is a state of T and p C is an instantiation of PC. If α(t) is consistent with p C, assign f C(c|t, p C) = f C(c|α(t), p C) for each state c.9 Otherwise, assign any functional distribution for f C(C|t, p C). The construction ensures that the constructed CPT f T for T is functional, and that the functional dependencies among other variables are preserved. In particular, for each child C of T, the constructed CPT f C is functional iff f C is functional. This construction method will be reused later in other proofs.

We now just need to show that CBN G , F is the result of eliminating T from the (constructed) CBN G, F . In particular, we need to check that the CPT for each child C CT in F is correctly computed from the constructed CPTs in F. For each instantiation (p T , p C) and state c of C,

f C(c|p T , p C) = f C(c|t , p C) = f T (t |p T )f C(c|t , p C)

t f T (t|p T )f C(c|t, p C)

where t is the state of T such that α(t ) = p T .

Proof of Theorem 13

Proof. We prove the theorem by induction. It suffices to show the following statement: for each causal graph G with observed variables V and functional variables W, the causal effect Prx(Y) is F-identifiable wrt G, V, CV, W iff it is F-identifiable wrt G , V, CV, W where G is the result of functionally eliminating some hidden functional variable T W and W = W \ {T}.

We first show the contrapositive of the if-part. Suppose Prx(Y) is not F-identifiable wrt G, V, CV, W , there exist two CBNs G, F1 and G, F2 which induce distributions Pr1, Pr2 such that Pr1(V) = Pr2(V) but Pr1x(Y) = Pr2x(Y). Let G , F 1 and G , F 2 be the results of eliminating T / V from G, F1 and G, F2 , the two CBNs attain the same marginal distribution on V but different causal effects by Theorem 23 and Theorem 24. Hence, Prx(Y) is not F-identifiable wrt G , V, CV, W either.

We next show the contrapositive of the only-if part. Suppose Prx(Y) is not F-identifiable wrt G , V, CV, W , there exist two CBNs G , F 1 and G , F 2 which induce distributions Pr 1, Pr 2 such that Pr 1(V) = Pr 2(V) but Pr 1x(Y) = Pr 2x(Y). We can obtain G, F1 and G, F2 by considering again the construction method in Theorem 11 where we assign a one-to-one mapping for T and adopt the CPTs from F 1 and F 2 for the children of T. This way, G , F 1 and G , F 2 become the results of eliminating T from the constructed G, F1 and G, F2 . Since T / V, Pr1(V) = Pr 1(V) = Pr 2(V) = Pr2(V) by Theorem 23, and Pr1x(Y) = Pr 1x(Y) = Pr 2x(Y) = Pr2x(Y) by Theorem 24. Hence, Prx(Y) is not F-identifiable wrt G, V, CV, W either.

Proof of Theorem 15

Proof. Since we only functionally eliminate variables that have observed parents, it is guaranteed that each Z Z has observed parents when it is eliminated. By induction, it suffices to show that Prx(Y) is F-identifiable wrt G, V, CV, W iff it is F-identifiable wrt G , V , CV, W ) where G is the result of eliminating a single functional variable Z W with observed parents from G, V = V \ {Z}, and W = W \ {Z}.

We first show the contrapositive of the if-part. Suppose Prx(Y) is not F-identifiable wrt G, V, CV, W , there exist two CBNs G, F1 and G, F2 which induce distributions Pr1, Pr2 where Pr1(V) = Pr2(V) but Pr1x(Y) = Pr2x(Y). Let G , F 1 and G , F 2 be the results of eliminating Z from G, F1 and G, F2 , the two CBNs induce the same marginal distribution Pr 1(V ) = Pr 2(V ) by Theorem 23 but different causal effects Pr 1x(Y) = Pr 2x(Y) by Theorem 24. Hence, Prx(Y) is not F-identifiable wrt G , V , CV, W .

We now consdier the ctrapositive of the only-if part. Suppose Prx(Y) is not F-identifiable wrt G , V , CV, W , then there exist two CBNs G , F 1 and G , F 2 which induce distributions Pr 1, Pr 2 such that Pr 1(V ) = Pr 2(V ) but Pr 1x(Y) = Pr 2x(Y). We again consider the construction method from the proof of Theorem 11 which produces two CBNs G, F1 and G, F2 .

9For clarity, we use the notation | to separate a variable and its parents in a CPT.

Moreover, G , F 1 and G , F 2 are the result of eliminating Z from G, F1 and G, F2 . It is guaranteed that the two constructed CBNs produce different causal effects Pr1x(Y) = Pr 1x(Y) = Pr 2x(Y) = Pr2x(Y) by Theorem 24. We need to show that G, F1 and G, F2 induce a same distribution over variables V = V {Z}. Consider any instantiation (v , z) of V. Since F1 and F2 assign the same one-to-one mapping α between Z and its parents in G, it is guaranteed that the probabilities Pr1(v , z) = Pr2(v , z) = 0 except for the single state z where α(z ) = p where p is the parent instantiation of Z consistent with v . By construction, Pr1(v , z , u) = Pr 1(v , u) for every instantiation u of hidden variables U. Similarly, Pr2(v , z , u) = Pr 2(v , u) for every instantiation (v , z, u). It then follows that Pr1(v , z ) = P

u Pr1(v , z , u) = P

u Pr 1(v , u) = Pr 1(v ) = Pr 2(v ) = P u Pr 2(v , u) = P u Pr2(v , z , u) = Pr2(v , z ). This means Prx(Y) is not F-identifiable wrt G, V, CV, W either.

Proof of Theorem 17

Lemma 27. Let G be a causal graph, V be its observed variables and W be its functional variables. Let Z be a non-descendant of W that has at least one hidden parent, then a causal effect is Fidentifiable wrt G, V, CV, W iff it is F-identifiable wrt G, V, CV, W {Z} .

Proof. Let W denote the set W {Z}. The only-if part holds immediately by the fact that every distribution that can be possibly induced from G, W can also be induced from G, W . We next consider the contrapositive of the if part. Suppose a causal effect is not F-identifiable wrt G, V, CV, W , then there exist two CBNs G, F1 and G, F2 which induce distributions Pr1, Pr2 such that Pr1(V) = Pr2(V) but Pr1x(Y) = Pr2x(Y). We next construct G, F 1 and G, F 2 which constitute an example of unidentifiability wrt G, V, CV, W . In particular, the CPTs for Z need to be functional in the constructed CBNs.

WLG, we show the construction of G, F 1 from G, F1 which involves two steps (the construction of G, F 2 from G, F2 will follow a same procedure). Let P be the parents of Z. The first step constructs a CBN G , F 1 based on the known method that transforms any (non-functional) CPT into a functional CPT. This is done by adding an auxiliary hidden root parent U for Z whose states correspond to the possible functions between P and Z. The CPTs for U and Z are assigned accordingly such that f Z = P

U f Uf Z where f U and f Z are the constructed CPTs in F 1.10 It follows that F1 and F 1 induce the same distribution over V since F1 = P U F 1. The causal effect is also preserved since summing out U is independent of other CPTs in the mutilated CBN for G , F 1 .

Our second step involves converting the CBN G , F 1 (constructed from the first step) into the CBN G, F 1 over the original graph G. Let T P be the hidden parent of W in G. We merge the auxiliary parent U and T into a new variable T and substitute it for T in G, i.e., T has the same parents and children as T. T is constructed as the Cartesian product of U and T: each state of T is represented as a pair (u, t) where u is a state of U and t is a state of T. We are ready to assign new CPTs for T and its children. For each parent instantiation p of P and each state (u, t) of T , we assign the CPT for T in F as f T ((u, t)|p) = f U(u)f T (t|p). Next consider each child C of T that has parents PC (excluding T ). For each instantiation p C of PC and each state (u, t) of T , we assign the CPT for C in F as f C(c|p C, (u, t)) = f C(c|p C) for each state c. Note that f C is functional iff f C is functional. Hence, the CPTs for W are all functional in F .

We need to show that G, F 1 preserves the distribution on V and the causal effect from G , F 1 . Let U be the hidden variables in G, F 1 and U be the hidden variables in G , F 1 . The distribution on V is preserved since there is a one-to-one correspondence between each instantiation (v, u ) in G, F 1 and each instantiation (v, u ) in G , F 1 where the two instantiations agree on v and are assigned with the same probability, i.e., Pr 1(v, u ) = Pr 1(v, u ). Hence, Pr 1(v) = P

u Pr 1(v, u ) = P

u Pr 1(v, u ) = Pr 1(v) for every instantiation v. The preservation of causal effect can be shown similarly but on the mutilated CBNs. Thus, G, F 1 preserves both the distribution on V and the causal effect from G, F1 . Similarly, we can construct G, F 2 which preserves the distribution on V and causal effect from G, F2 . The two CBNs G, F 1 and G, F 2 constitute an example of unidentifiablity wrt G, V, CV, W .

10Each state u of U corresponds to a function γu where γu(p) is mapped to some state of Z for each instantiation p. The variable U thus has |Z||P| states since there are total of |Z||P| possible functions from P to Z. For each instantiation (z, u, p), the functional CPT for Z is defined as f Z(z|u, p) = 1 if z = γu(p), and f Z(z|u, p) = 0 otherwise. The CPT for U is assigned as f U(u) = Q

p P f Z(γu(p)|p).

Proof of Theorem 17. We prove the theorem by induction. We first order all the functional variables in a bottom-up order. Let W i denote the ith functional variable in the order and W(i) denote the functional variables that are ordered before W i (including W i). It follows that we can go over each W i in the order and show that a causal effect is F-identifiable wrt G, V, CV, W(i 1) iff it is F-identifiable wrt G, V, CV, W(i) by Lemma 27. Since F-identifiability wrt G, V, CV, W(0) is equivalent to identifiability wrt G, V, CV , we conclude that the causal effect is F-identifiable wrt G, V, CV, W iff it is identifiabile wrt G, V, CV .

Proof of Theorem 18

The proof of the theorem is organized as follows. We start with a lemma (Lemma 28) that allows us to modify the CPT of a variable when the marginal probability over its parents contain zero entries. We then show a main lemma (Lemma 29) that allows us to reduce F-identifiability to identifiability when all functional variables are observed or has a hidden parent. We finally prove the theorem based on the main lemma and previous theorems.

Lemma 28. Consider two CBNs that have a same causal graph and induce the distributions Pr1 and Pr2. Suppose the CPTs of the two CBNs only differ by f X(X, P). Then Pr1(p) = 0 iff Pr2(p) = 0 for all instantiations p of P.

Proof. Let f Y and g Y denote the CPT for Y in the first and second CBN. Let An(P) be the ancestors of variables in P (including P). If we eliminate all variables other than An(P), then we obtain the factor set F1 = Q

Y An(P) f Y for the first CBN and F2 = Q

Y An(P) g Y for the second CBN. Since all CPTs are the same for variables in An(P), it is guaranteed that F1 = F2. If we further eliminate

variables other than P from F1 and F2, we obtain the marginal distributions Pr1(P) = = P PF1 and

Pr2(P) = = P

PF2, where = P

P denotes the projection operation that sums out variables other than P from a factor. Hence, Pr1(P) = Pr2(P) which concludes the proof.

Lemma 29. If a causal effect is F-identifiable wrt G, V, CV, W but is not identifiable G, V, CV , then there must exist at least one functional variable that is hidden and whose parents are all observed.

Proof. The lemma is the same as saying that if every functional variable is observed or having a hidden parent, then F-identifiability is equivalent to identifiability. We go over each functional variable Wi W in a bottom-up order Π and show the following inductive statement: a causal effect Prx(Y) is F-identifiable wrt G, V, CV, W(i) iff it is F-identifiable wrt G, V, CV, W(i 1) , where W(i) is a subset of variables in W that are ordered before Wi (and including Wi) in Π. Note that W(0) = and F-identifiability wrt G, V, CV, W(0) collapses into identifiability wrt G, V, CV .

The if-part follows from the definitions of identifiability and F-identifiability. We next consider the contrapositive of the only-if part. Let Z be the functional variable in W that is considered in the current inductive step. Let G, F1 and G, F2 be the two CBNs inducing distributions Pr1 and Pr2 that constitute the unidentifiability, i.e., Pr1(V) = Pr2(V) and Pr1x(Y) = Pr2x(Y). Our goal is to construct two CBNs G, F 1 and G, F 2 , which induce distributions Pr 1 , Pr 2 and contain functional CPTs for Z, such that Pr 1 (V) = Pr 2 (V) and Pr 1x(Y) = Pr 2x(Y). Suppose Z has a hidden parent, we directly employ Lemma 27 to construct the two CBNs. We next consider the case when Z is observed and has observed parents. By default, we use f Z to g Z to denote the CPT for Z in F1 and F2.

The following three steps are considered to construct an instance of unidentifiability.

First Step: we construct G, F 1 and G, F 2 by modifying the CPTs for Z. Let P be the parents of Z in G. For each instantiation p of P where Pr1(p) = Pr2(p) = 0, we modify the entries f Z(Z|p) and g Z(Z|p) for the CPTs f Z and g Z in F 1 and F 2 as follows. Since Pr1x(Y) = Pr2x(Y), there exists an instantiation y such that Pr1x(y) = Pr2x(y). WLG, assume Pr1x(y) > Pr2x(y). Since Pr1x(y) is computed as the marginal probability of y in the mutilated CBN for G, F1 , it can be expressed in the form of network polynomial as shown in [49, 40]. If we treat the CPT entries of f Z(Z|p) as unknown, then we can write Pr1x(y) as follows

Pr1x(y) = α0 + α1f Z(z1|p) + + αkf Z(zk|p)

where α0, α1, . . . , αk are constants and z1, . . . , zk are the states of variable Z. Similarly, we can write Pr2x(y) as follows

Pr2x(y) = β0 + β1g Z(z1|p) + + βkg Z(zk|p)

Let αi be the maximum value among α1, . . . , αk and βj be the minimum value among β1, . . . , βk, our construction method assigns f Z(zi|p) = 1 and g Z(zj|p) = 1. By construction, it is guaranteed that Pr 1x(y) Pr 2x(y) Pr1x(y) Pr2x(y) > 0 where Pr 1x(y), Pr 2x(y) denote the causal effects under the updated CPTs f Z(Z|p) and g Z(Z|p). We repeat the above procedure for all such p where Pr1(p) = 0, which yields the new CBNs G, F 1 and G, F 2 in which f Z(Z|p) and g Z(Z|p) are functional whenever Pr1(p) = 0. We next show that F 1 and F 2 (with the updated f Z and g Z) constitute an example of unidentifiability. Pr 1x(Y) = Pr 2x(Y) since Pr 1x(y) > Pr 2x(y) for the particular instantiation y. We are left to show that the distributions Pr 1 and Pr 2 induced by G, F 1 and G, F 2 are the same over the observed variables V. Consider each instantiation v of V and p of P where p is consistent with v. If Pr1(p) = 0, then Pr 1(p) = Pr 2(p) = 0 by Lemma 28 and thus Pr 1(v) = Pr 2(v) = 0. Otherwise, Pr 1(v) = Pr1(v) = Pr2(v) = Pr 2(v) since none of the CPT entries consistent with v were modified.

Second Step: We construct G , F 1 , G , F 2 from G, F 1 , G, F 2 by introducing an auxiliary root parent for Z and assigning a functional CPT for Z. We add a root variable, denoted R, to be an auxiliary parent of W which specifies all possible functions from P to Z where P contains all instantiations of p where Pr 1(p ) = Pr 2(p ) > 0. Each state r of R corresponds to a function φr where φr(p ) is mapped to some state of Z for each instantiation p . The variable R thus has |Z||P | states since there are total of |Z||P | possible functions from P to Z. For each instantiation (z, r, p), if p P , we define f Z(z|r, p) = 1 if z = φr(p), and f Z(z|r, p) = 0 otherwise. If p / P , we define f Z(z|r, p) = f Z(z|p). The CPT for R is assigned as f R(r) = Q p P f Z(φr(p)|p). Moreover, we remove all the states r of R where f R(r) = 0, which ensures f R(r) > 0 for all remaining r. We assign the CPT g Z in F 2 in a similar way. Note that f R = g R since f Z(φr(p)|p) = Pr 1(φr(p)|p) = Pr 2(φr(p)|p) = g Z(φr(p)|p) for each p P

(where Pr 1(p) = Pr 2(p) > 0).

We next show that G , F 1 and G , F 2 constitute the unidentifiability. One key observation is that f T (t|p) = P

r f R(r)f T (t|p, r) and g T (t|p) = P

r g R(r)g T (t|p, r). Consider each instantiation (v, r) which contains the instantiation p of P and the state z of Z. Suppose Pr 1(p) = 0, then Pr 1(p) = 0 since the marginal over V is preserved in Pr 1. Hence, Pr 1(v, r) = Pr 2(v, r) = 0. Suppose Pr 1(p) = 0, then Pr 1(v, r) = (Q

V V\{Z} f V ) f R(r) when z = φr(p), and Pr 1(v, r) = 0 otherwise. Similarly, Pr 2(v, r) = (Q

V V\{Z} g V ) g R(r) when z = φr(p), and Pr 2(v, r) = 0 otherwise. In both cases, Pr 1(v, r) = Pr 2(v, r) since F 1 and F 2 assign a same function φr for each state r of R, and f V = g V for all V V \ {Z}. To see Pr 1x(Y) = Pr 1x(Y) and Pr 2x(Y) = Pr 2x(Y), note that summing-out R from Pr 1x(V, R) and Pr 2x(V, R) yields Pr 1x(V) and Pr 2x(V).

Third Step: we construct G, F 1 , G, F 2 from G , F 1 , G , F 2 by merging the auxiliary root variable R with an observed parent T of Z. We merge R and T into a new node T and substitute it for T in G, i.e., T has the same parents and children as T in G. Specifically, T is constructed as the Cartesian product of R and T: each state of T can be represented as (r, t) where r is a state of R and t is a state of T. We then assign the CPT f T (T |PT ) in F 1 as follows. For each parent instantiation p T and each state (r, t) of T , f T ((r, t)|p T ) = f R(r)f T (t|p T ). For each child C of T that has parents PC (excluding T ), the CPT for C in F 1 is assigned as f C (c|p C, (r, t)) = f C(c|p C, t). Similarly, we assign the CPTs g T and g C in F 2 . The distributions over observed variables are preserved since there is an one-to-one correspondence between the instantiations over V {R} in G , F 1 and the instantiations over V in G, F 1 . We next consider the causal effect. Suppose T is neither a treatment nor an outcome variable, then the merging does not affect the causal effect by the one-to-one correspondence between instantiations and Pr 1x(Y) = Pr 1x(Y) = Pr 2x(Y) = Pr 2x(Y). Suppose T is an outcome variable, since Pr 1x(Y) = Pr 2x(Y), there exists an instantiation (y , r, t) where Y = Y \ {T} such that Pr 1x(y , r, t) = Pr 2x(y , r, t).

This implies Pr 1x(y , (r, t)) = Pr 2x(y , (r, t)) for the particular instantiation y and the state (r, t) of T . Suppose T is the treatment variable X, since Pr 1x(Y) = Pr 2x(Y), there exists an instantiation (y, r) such that Pr 1x(y, r) = Pr 2x(y, r). Moreover, Pr 1x(r) = Pr 1(r) = Pr 2(r) = Pr 2x(r) > 0 (otherwise, Pr 1x(y, r) = Pr 2x(y, r) = 0). This implies Pr 1x(y|r) = Pr 2x(y|r). Since R is a root in the mutilated CBN, Pr 1(xr)(y) = Pr 1x(y|r) = Pr 2x(y|r) = Pr 2(xr)(y). We now consider the treatment do(T = (r, x)) on G instead of the treatment do(R = r, X = x) on G . We have Pr 1((r,x))(y) = Pr 1(r,x)(y) = Pr 2(r,x)(y) = Pr 2((r,x))(y) for the particular state (r, x) of T . Moreover, Pr 1 ((r, x)) = Pr 1(r, x) = Pr 1(r) Pr 1(x) > 0 by the positivity assumption of Pr 1(x). Thus, the positivity still holds for Pr 1 and similarly for Pr 2 .

Proof of Theorem 18. Let H = V \ V be the set of hidden functional variables that are functionally determined by V. By Theorem 13, F-identifiable wrt G, V, CV, W iff F-identifiable wrt G , V, CV, W \ H where G is the result of functionally eliminating H from G. By construction, every variable in H has parents in V . Hence, by Theorem 15, F-identifiable wrt G , V, CV, W\H iff F-identifiable wrt G, V , CV, W . If we consider each functional variable W W, it is either in V or having some parent that is not in V (otherwise, W would have been added to V ). By Lemma 29, F-identifiable wrt G, V , CV, W iff identifiable wrt G, V , CV .

Neur IPS Paper Checklist

Question: Do the main claims made in the abstract and introduction accurately reflect the paper s contributions and scope? Answer: [Yes] Justification: Our abstract and introduction accurately summarize our goal of studying causal-effect identifiability with functional dependencies. Guidelines:

The answer NA means that the abstract and introduction do not include the claims made in the paper. The abstract and/or introduction should clearly state the claims made, including the contributions made in the paper and important assumptions and limitations. A No or NA answer to this question will not be perceived well by the reviewers. The claims made should match theoretical and experimental results, and reflect how much the results can be expected to generalize to other settings. It is fine to include aspirational goals as motivation as long as it is clear that these goals are not attained by the paper. 2. Limitations

Question: Does the paper discuss the limitations of the work performed by the authors? Answer: [Yes] Justification: The limitations have been discussed in the end of Section 5. Guidelines:

The answer NA means that the paper has no limitation while the answer No means that the paper has limitations, but those are not discussed in the paper. The authors are encouraged to create a separate "Limitations" section in their paper. The paper should point out any strong assumptions and how robust the results are to violations of these assumptions (e.g., independence assumptions, noiseless settings, model well-specification, asymptotic approximations only holding locally). The authors should reflect on how these assumptions might be violated in practice and what the implications would be. The authors should reflect on the scope of the claims made, e.g., if the approach was only tested on a few datasets or with a few runs. In general, empirical results often depend on implicit assumptions, which should be articulated. The authors should reflect on the factors that influence the performance of the approach. For example, a facial recognition algorithm may perform poorly when image resolution is low or images are taken in low lighting. Or a speech-to-text system might not be used reliably to provide closed captions for online lectures because it fails to handle technical jargon. The authors should discuss the computational efficiency of the proposed algorithms and how they scale with dataset size. If applicable, the authors should discuss possible limitations of their approach to address problems of privacy and fairness. While the authors might fear that complete honesty about limitations might be used by reviewers as grounds for rejection, a worse outcome might be that reviewers discover limitations that aren t acknowledged in the paper. The authors should use their best judgment and recognize that individual actions in favor of transparency play an important role in developing norms that preserve the integrity of the community. Reviewers will be specifically instructed to not penalize honesty concerning limitations. 3. Theory Assumptions and Proofs

Question: For each theoretical result, does the paper provide the full set of assumptions and a complete (and correct) proof? Answer: [Yes]

Justification: Also proofs have been provided in the Appendix. Guidelines:

The answer NA means that the paper does not include theoretical results. All the theorems, formulas, and proofs in the paper should be numbered and crossreferenced. All assumptions should be clearly stated or referenced in the statement of any theorems. The proofs can either appear in the main paper or the supplemental material, but if they appear in the supplemental material, the authors are encouraged to provide a short proof sketch to provide intuition. Inversely, any informal proof provided in the core of the paper should be complemented by formal proofs provided in appendix or supplemental material. Theorems and Lemmas that the proof relies upon should be properly referenced. 4. Experimental Result Reproducibility

Question: Does the paper fully disclose all the information needed to reproduce the main experimental results of the paper to the extent that it affects the main claims and/or conclusions of the paper (regardless of whether the code and data are provided or not)? Answer: [NA] Justification: This paper does not include experiments. Guidelines:

The answer NA means that the paper does not include experiments. If the paper includes experiments, a No answer to this question will not be perceived well by the reviewers: Making the paper reproducible is important, regardless of whether the code and data are provided or not. If the contribution is a dataset and/or model, the authors should describe the steps taken to make their results reproducible or verifiable. Depending on the contribution, reproducibility can be accomplished in various ways. For example, if the contribution is a novel architecture, describing the architecture fully might suffice, or if the contribution is a specific model and empirical evaluation, it may be necessary to either make it possible for others to replicate the model with the same dataset, or provide access to the model. In general. releasing code and data is often one good way to accomplish this, but reproducibility can also be provided via detailed instructions for how to replicate the results, access to a hosted model (e.g., in the case of a large language model), releasing of a model checkpoint, or other means that are appropriate to the research performed. While Neur IPS does not require releasing code, the conference does require all submissions to provide some reasonable avenue for reproducibility, which may depend on the nature of the contribution. For example (a) If the contribution is primarily a new algorithm, the paper should make it clear how to reproduce that algorithm. (b) If the contribution is primarily a new model architecture, the paper should describe the architecture clearly and fully. (c) If the contribution is a new model (e.g., a large language model), then there should either be a way to access this model for reproducing the results or a way to reproduce the model (e.g., with an open-source dataset or instructions for how to construct the dataset). (d) We recognize that reproducibility may be tricky in some cases, in which case authors are welcome to describe the particular way they provide for reproducibility. In the case of closed-source models, it may be that access to the model is limited in some way (e.g., to registered users), but it should be possible for other researchers to have some path to reproducing or verifying the results. 5. Open access to data and code

Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material?

Answer: [NA] Justification: This paper does not include experiments. Guidelines:

The answer NA means that paper does not include experiments requiring code. Please see the Neur IPS code and data submission guidelines (https://nips.cc/ public/guides/Code Submission Policy) for more details. While we encourage the release of code and data, we understand that this might not be possible, so No is an acceptable answer. Papers cannot be rejected simply for not including code, unless this is central to the contribution (e.g., for a new open-source benchmark). The instructions should contain the exact command and environment needed to run to reproduce the results. See the Neur IPS code and data submission guidelines (https: //nips.cc/public/guides/Code Submission Policy) for more details. The authors should provide instructions on data access and preparation, including how to access the raw data, preprocessed data, intermediate data, and generated data, etc. The authors should provide scripts to reproduce all experimental results for the new proposed method and baselines. If only a subset of experiments are reproducible, they should state which ones are omitted from the script and why. At submission time, to preserve anonymity, the authors should release anonymized versions (if applicable). Providing as much information as possible in supplemental material (appended to the paper) is recommended, but including URLs to data and code is permitted. 6. Experimental Setting/Details

Question: Does the paper specify all the training and test details (e.g., data splits, hyperparameters, how they were chosen, type of optimizer, etc.) necessary to understand the results? Answer: [NA] Justification: This paper does not include experiments. Guidelines:

The answer NA means that the paper does not include experiments. The experimental setting should be presented in the core of the paper to a level of detail that is necessary to appreciate the results and make sense of them. The full details can be provided either with the code, in appendix, or as supplemental material. 7. Experiment Statistical Significance

Question: Does the paper report error bars suitably and correctly defined or other appropriate information about the statistical significance of the experiments? Answer: [NA] Justification: This paper does not include experiments. Guidelines:

The answer NA means that the paper does not include experiments. The authors should answer "Yes" if the results are accompanied by error bars, confidence intervals, or statistical significance tests, at least for the experiments that support the main claims of the paper. The factors of variability that the error bars are capturing should be clearly stated (for example, train/test split, initialization, random drawing of some parameter, or overall run with given experimental conditions). The method for calculating the error bars should be explained (closed form formula, call to a library function, bootstrap, etc.) The assumptions made should be given (e.g., Normally distributed errors). It should be clear whether the error bar is the standard deviation or the standard error of the mean.

It is OK to report 1-sigma error bars, but one should state it. The authors should preferably report a 2-sigma error bar than state that they have a 96% CI, if the hypothesis of Normality of errors is not verified. For asymmetric distributions, the authors should be careful not to show in tables or figures symmetric error bars that would yield results that are out of range (e.g. negative error rates). If error bars are reported in tables or plots, The authors should explain in the text how they were calculated and reference the corresponding figures or tables in the text. 8. Experiments Compute Resources

Question: For each experiment, does the paper provide sufficient information on the computer resources (type of compute workers, memory, time of execution) needed to reproduce the experiments? Answer: [NA] Justification: This paper does not include experiments. Guidelines:

The answer NA means that the paper does not include experiments. The paper should indicate the type of compute workers CPU or GPU, internal cluster, or cloud provider, including relevant memory and storage. The paper should provide the amount of compute required for each of the individual experimental runs as well as estimate the total compute. The paper should disclose whether the full research project required more compute than the experiments reported in the paper (e.g., preliminary or failed experiments that didn t make it into the paper). 9. Code Of Ethics

Question: Does the research conducted in the paper conform, in every respect, with the Neur IPS Code of Ethics https://neurips.cc/public/Ethics Guidelines? Answer: [Yes] Justification: We checked the code of ethics. Guidelines:

The answer NA means that the authors have not reviewed the Neur IPS Code of Ethics. If the authors answer No, they should explain the special circumstances that require a deviation from the Code of Ethics. The authors should make sure to preserve anonymity (e.g., if there is a special consideration due to laws or regulations in their jurisdiction). 10. Broader Impacts

Question: Does the paper discuss both potential positive societal impacts and negative societal impacts of the work performed? Answer: [NA] Justification: We do not see any direct societal impacts of the work. Guidelines:

The answer NA means that there is no societal impact of the work performed. If the authors answer NA or No, they should explain why their work has no societal impact or why the paper does not address societal impact. Examples of negative societal impacts include potential malicious or unintended uses (e.g., disinformation, generating fake profiles, surveillance), fairness considerations (e.g., deployment of technologies that could make decisions that unfairly impact specific groups), privacy considerations, and security considerations. The conference expects that many papers will be foundational research and not tied to particular applications, let alone deployments. However, if there is a direct path to any negative applications, the authors should point it out. For example, it is legitimate to point out that an improvement in the quality of generative models could be used to

generate deepfakes for disinformation. On the other hand, it is not needed to point out that a generic algorithm for optimizing neural networks could enable people to train models that generate Deepfakes faster. The authors should consider possible harms that could arise when the technology is being used as intended and functioning correctly, harms that could arise when the technology is being used as intended but gives incorrect results, and harms following from (intentional or unintentional) misuse of the technology. If there are negative societal impacts, the authors could also discuss possible mitigation strategies (e.g., gated release of models, providing defenses in addition to attacks, mechanisms for monitoring misuse, mechanisms to monitor how a system learns from feedback over time, improving the efficiency and accessibility of ML). 11. Safeguards

Question: Does the paper describe safeguards that have been put in place for responsible release of data or models that have a high risk for misuse (e.g., pretrained language models, image generators, or scraped datasets)? Answer: [NA] Justification: This paper does not release any data or model. Guidelines:

The answer NA means that the paper poses no such risks. Released models that have a high risk for misuse or dual-use should be released with necessary safeguards to allow for controlled use of the model, for example by requiring that users adhere to usage guidelines or restrictions to access the model or implementing safety filters. Datasets that have been scraped from the Internet could pose safety risks. The authors should describe how they avoided releasing unsafe images. We recognize that providing effective safeguards is challenging, and many papers do not require this, but we encourage authors to take this into account and make a best faith effort. 12. Licenses for existing assets

Question: Are the creators or original owners of assets (e.g., code, data, models), used in the paper, properly credited and are the license and terms of use explicitly mentioned and properly respected? Answer: [NA] Justification: This paper does not use existing assets. Guidelines:

The answer NA means that the paper does not use existing assets. The authors should cite the original paper that produced the code package or dataset. The authors should state which version of the asset is used and, if possible, include a URL. The name of the license (e.g., CC-BY 4.0) should be included for each asset. For scraped data from a particular source (e.g., website), the copyright and terms of service of that source should be provided. If assets are released, the license, copyright information, and terms of use in the package should be provided. For popular datasets, paperswithcode.com/datasets has curated licenses for some datasets. Their licensing guide can help determine the license of a dataset. For existing datasets that are re-packaged, both the original license and the license of the derived asset (if it has changed) should be provided. If this information is not available online, the authors are encouraged to reach out to the asset s creators. 13. New Assets

Question: Are new assets introduced in the paper well documented and is the documentation provided alongside the assets?

Answer: [NA] Justification: This paper does not release new assets. Guidelines:

The answer NA means that the paper does not release new assets. Researchers should communicate the details of the dataset/code/model as part of their submissions via structured templates. This includes details about training, license, limitations, etc. The paper should discuss whether and how consent was obtained from people whose asset is used. At submission time, remember to anonymize your assets (if applicable). You can either create an anonymized URL or include an anonymized zip file. 14. Crowdsourcing and Research with Human Subjects

Question: For crowdsourcing experiments and research with human subjects, does the paper include the full text of instructions given to participants and screenshots, if applicable, as well as details about compensation (if any)? Answer: [NA] Justification: This paper does not involve crowdsourcing nor research with human subjects. Guidelines:

The answer NA means that the paper does not involve crowdsourcing nor research with human subjects. Including this information in the supplemental material is fine, but if the main contribution of the paper involves human subjects, then as much detail as possible should be included in the main paper. According to the Neur IPS Code of Ethics, workers involved in data collection, curation, or other labor should be paid at least the minimum wage in the country of the data collector. 15. Institutional Review Board (IRB) Approvals or Equivalent for Research with Human Subjects Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or institution) were obtained? Answer: [NA] Justification: This paper does not involve crowdsourcing nor research with human subjects. Guidelines:

The answer NA means that the paper does not involve crowdsourcing nor research with human subjects. Depending on the country in which research is conducted, IRB approval (or equivalent) may be required for any human subjects research. If you obtained IRB approval, you should clearly state this in the paper. We recognize that the procedures for this may vary significantly between institutions and locations, and we expect authors to adhere to the Neur IPS Code of Ethics and the guidelines for their institution. For initial submissions, do not include any information that would break anonymity (if applicable), such as the institution conducting the review.