# qualitative_mechanism_independence__a2765448.pdf

Qualitative Mechanism Independence

Oliver E. Richardson Dept of Computer Science Cornell University Ithaca NY 14853 oli@cs.cornell.edu

Spencer Peters Dept of Computer Science Cornell University Ithaca NY 14853 speters@cs.cornell.edu

Joseph Y. Halpern Dept of Computer Science Cornell University Ithaca NY 14853 halpern@cs.cornell.edu

We define what it means for a joint probability distribution to be (QIM-)compatible with a set of independent causal mechanisms, at a qualitative level or, more precisely, with a directed hypergraph A, which is the qualitative structure of a probabilistic dependency graph (PDG). When A represents a qualitative Bayesian network, QIM-compatibility with A reduces to satisfying the appropriate conditional independencies. But giving semantics to hypergraphs using QIM-compatibility lets us do much more. For one thing, we can capture functional dependencies. For another, QIM-compatibility captures important aspects of causality: we can use compatibility to understand cyclic causal graphs, and to demonstrate compatibility is essentially to produce a causal model. Finally, compatibility has deep connections to information theory. Applying compatibility to cyclic structures helps to clarify a longstanding conceptual issue in information theory.

1 Introduction

The structure of a (standard) probabilistic graphical model (like a Bayesian Network or Markov Random Field) encodes a set of conditional independencies among variables. This is useful because it enables a compact description of probability distributions that have those independencies; it also lets us use graphs as a visual language for describing important qualitative properties of a probabilistic world. Yet these kinds of independencies are not the only important qualitative aspects of a probability measure. In this paper, we study a natural generalization of standard graphical model structures that can describe far more than conditional independence.

For example, another qualitative aspect of a probability distribution is that of functional dependence, which is also exploited across computer science to enable compact representations and simplify probabilistic analysis. Acyclic causal models, for instance, specify a distribution via a probability over contexts (the values of variables whose causes are viewed as outside the model), and a collection of equations (i.e., functional dependencies) [18]. And in deep learning, a popular class of models called normalizing flows [25, 12] specify a distribution by composing a fixed distribution over some latent space, say a standard normal distribution, with a function (i.e., a functional dependence) fit to observational data. Functional dependence and independence are deeply related and interacting notions. For instance, if B is a function of A (written A B) and A is independent of C (written A C), then B and C are also independent (B C).1 Moreover, dependence can be written in terms of independence: Y is a function of X if and only if Y is conditionally independent of itself given X (i.e., X Y iff Y Y | X). Traditional graph-based languages such as Bayesian Networks (BNs) and Markov Random Fields (MRFs) cannot capture these relationships. Indeed, the graphoid axioms (which describe BNs and MRFs) [21] and axioms for conditional independence [17], do not even consider statements like A A to be syntactically valid. Yet such statements

1This well-known fact (Lemma 10) is formalized and proved in Appendix A, where all proofs can be found.

38th Conference on Neural Information Processing Systems (Neur IPS 2024).

are perfectly meaningful, and reflect a deep relationship between independence, dependence, and generalizations of both notions (grounded in information theory, a point we will soon revisit).

This paper provides a simple yet expressive graphical language for describing qualitative structure such as dependence and independence in probability distributions. The idea is to specify the inputs and outputs of a set of independent mechanisms: processes by which some target variables T are determined as a (possibly randomized) function of some source variables S. This idea generalizes intuition going back to Pearl [18] by allowing, for example, two mechanisms to share a target variable. So at a qualitative level, the modeler specifies not a (directed) graph, but a (directed) hypergraph.

If we were interested in a concrete probabilistic model, we would also need to annotate this hypergraph with quantitative information describing the mechanisms. For directed acyclic graphs, there are two standard approaches: supply conditional probability distributions (cpds) to get a BN, or supply equations to get a causal model. Correspondingly, there are two approaches to probabilistic modeling based on hypergraphs. The analogue of the first approach supplying a probability P(T|S) for each mechanism leads to the notion of a probabilistic dependency graph (PDG) [23, 22, 24]. The analogue of the second approach supplying an equation describing T as a function of S and independent random noise leads to a novel generalization of a causal model (Definition 4). Models of either kind are of interest to us only insofar as they explain how hypergraphs encode qualitative aspects of probability. Qualitative information in a PDG was characterized by Richardson and Halpern [23] using a scoring function that, despite having some attractive properties, lacks justification and has not been fully understood. In particular, the PDG formalism does not appear to answer a basic question: what does it mean for a distribution to be compatible with a directed hypergraph structure?

We develop precisely such a notion (Definition 2) of compatibility between a distribution µ and a directed hypergraph qualitatively representing a collection of independent mechanisms or, for short, simply (QIM-)compatibility. This definition allows us to use directed hypergraphs as a language for specifying structure in probability distributions, of which the semantics of qualitative BNs are a special case (Theorem 1). Yet QIM-compatibility can do far more than represent conditional independencies in acyclic networks. For one thing, it can encode arbitrary functional dependencies (Theorem 2); for another, it gives meaningful semantics to cyclic models. Indeed, compatibility lets us go well beyond capturing dependence and independence. The fact that Pearl [18] views causal models as representing independent mechanisms suggests that there might be a connection to causality. In fact, there is. A witness that a distribution µ is compatible with a hypergraph A is an extended distribution µ that is nearly equivalent to (and guarantees the existence of) a causal model that explains µ with dependency structure A (Propositions 3 to 5). As we shall see, thinking in terms of witnesses and compatibility allows us to tie together causality, dependence, and independence.

Perhaps surprisingly, compatibility also has deep connections with information theory (Section 4). The conditional independencies of a BN can be viewed as a certain kind of information-theoretic constraint. Our notion of compatibility with a hypergraph A turns out to imply a generalization of this constraint (closely related to the qualitative PDG scoring function) that is meaningful for all hypergraphs (Theorem 7). Applied to cyclic models, it yields a causally inspired notion of pairwise interaction that clarifies some important misunderstandings in information theory (Examples 5 and 6).

Saying that one approach to qualitative graphical modeling has connections to so many different notions is a rather bold claim. We spend the rest of the paper justifying it.

2 Qualitative Independent-Mechanism (QIM) Compatibility

In this section, we present the central definition of our paper: a way of making precise Pearl s notion of independent mechanisms , used to motivate Bayesian Networks from a causal perspective. Pearl [19, p.22] states that each parent-child relationship in a causal Bayesian network represents a stable and autonomous physical mechanism. But, technically speaking, a parent-child relationship only partially describes the mechanism. Instead, the autonomous mechanism that determines the child is really represented by that child s joint relationship with all its parents. So, the qualitative aspect of a mechanism is best represented as a directed hyperarc [5], that can have multiple sources. Definition 1. A directed hypergraph (or simply a hypergraph, since all our hypergraphs will be directed) consists of a set N of nodes and a set A of directed hyperedges, or hyperarcs; each hyperarc a A is associated with a set Sa N of source nodes and a set Ta N of target nodes. We write S a T A to specify a hyperarc a A together with its sources S = Sa and targets T = Ta. Nodes

that are neither a source nor a target of any hyperarc will seldom have any effect on our constructions; the other nodes can be recovered from the hyperarcs (by selecting N := S

a A Sa Ta). Thus, we often leave N implicit, referring to the hypergraph simply as A.

Following the graphical models literature, we are interested in hypergraphs whose nodes represent variables, so that each X N will ultimately be associated with a (for simplicity, finite) set V(X) of possible values. However, one should not think of V as part of the information carried by the hypergraph. It makes perfect sense to say that X and Y are independent without specifying the possible values of X and Y . Of course, when we talk concretely about a distribution µ on a set of variables X = (N, V), those variables must have possible values but the qualitative properties of µ, such as independence, can be expressed purely in terms of N, without reference to V.

Intuitively, we expect a joint distribution µ(X) to be qualitatively compatible with a set of independent mechanisms (whose structure is given by a hypergraph A) if there is a mechanistic explanation of how each target arises as a function of the variable(s) on which it depends and independent random noise. This is made precise by the following definition. Definition 2 (QIM-compatibility). Let X and Y be (possibly identical) sets of variables, and A = {Sa a Ta}a A be a hypergraph with nodes X. We say a distribution µ(Y) is qualitatively independentmechanism compatible, or (QIM-)compatible, with A (symbolically: µ |= A) iff there exists an extended distribution µ(Y X UA) of µ(Y) to X and to UA = {Ua}a A, an additional set of noise variables (one variable per hyperarc) according to which:

(a) the variables Y are distributed according to µ (i.e., µ(Y) = µ(Y)), (b) the variables UA are mutually independent (i.e., µ(UA) = Q

a A µ(Ua) ), and (c) the target variable(s) Ta of each hyperarc a A are determined by Ua and the source variable(s) Sa (i.e., a A. µ |= (Sa, Ua) Ta).

We call such a distribution µ(X Y UA) a witness that µ is QIM-compatible with A.

While Definition 2 requires the noise variables {Ua}a A to be independent of one another, note that they need not be independent of any variables in X. In particular, Ua may not be independent of Sa, and so the situation can diverge from what one would expect from a randomized algorithm, whose randomness U is assumed to be independent of its input S. Furthermore, the variables in U may not be independent of one another conditional on the value of some X X. Example 1. µ(X, Y ) is compatible with A = { 1 {X}, 2 {Y }} (depicted in PDG notation as X Y ) iff X and Y are independent, i.e., µ(X, Y ) = µ(X)µ(Y ). For if U1 and U2 are independent and respectively determine X and Y , then X and Y must also be independent.

This is a simple illustration of a more general phenomenon: when A describes the structure of a Bayesian Network (BN), then QIM-compatibility with A coincides with satisfying the independencies of that BN (which are given, equivalently, by the ordered Markov properties [14], factoring as a product of probability tables, or d-separation [6]). To state the general result (Theorem 1), we must first clarify how the graphs of standard graphical and causal models give rise to directed hypergraphs.

Suppose that G = (V, E) is a graph, whose edges may be directed or undirected. Given a vertex u V , write Pa G(u) := {v : (v, u) E} for the set of vertices that can influence u. There is a natural way to interpret the graph G as giving rise to a set of mechanisms: one for each variable u, which determines the value of u based the values of the variables on which u can depend. Formally, let AG := Pa G(u) u {u}

u V be the hypergraph corresponding to the graph G. Theorem 1. If G is a directed acyclic graph and I(G) consists of the independencies of its corresponding Bayesian network, then µ |= AG if and only if µ satisfies I(G).

Theorem 1 shows, for hypergraphs that correspond to directed acyclic graphs (dags), our definition of compatibility reduces exactly to the well-understood independencies of BNs. This means that QIMcompatibility, a notion based on the independence of causal mechanisms, gives us a very different way of characterizing these independencies one that can be generalized to a much larger class of graphical models that includes, for example, cyclic variants [1]. Moreover, QIM-compatibility can capture properties other than independence. As the next example shows, it can capture determinism.

Example 2. If A = X 1 2 consists of just two hyperarcs pointing to a single variable X, then a distribution µ(X) is QIM-compatible with A iff µ places all mass on a single value x V(X).

Intuitively, if two independent coins always give the same answer (the value of X), then neither coin can be random. This simple example shows that we can capture determinism with multiple hyperarcs pointing to the same variable. Such hypergraphs do not correspond to graphs; recall that in a BN, two arrows pointing to X (e.g., Y X and Z X) represent a single mechanism by which X is jointly determined (by Y and Z), rather than two distinct mechanisms.

Given a hypergraph A = (N, A), X, Y N, and a natural number n 0, let A (+n) X Y denote the hypergraph that results from augmenting A with n additional (distinct) hyperarcs from X to Y .

Theorem 2. (a) µ |= X Y A if and only if n 0. µ |= A (+n) X Y .

(b) if A = AG for a dag G, then µ |= X Y A if and only if µ |= A (+1) X Y .

(c) if a A such that Sa = and X Ta, then µ |= X Y A iff µ |= A (+2) X Y .

Based on the intuition given after Example 2, it may seem unnecessary to ever add more than two parallel hyperarcs to ensure functional dependence in part (a). However, this intuition implicitly assumes that the randomness U1 and U2 of the two mechanisms is independent conditional on X, which may not be the case. See Appendix D for counterexamples.

Finally, as mentioned above, QIM-compatibility gives meaning to cyclic structures, a topic that we will revisit often in Sections 3 and 4. We start with a simple example.

Example 3. Every µ(X, Y ) is compatible with X Y , because every distribution is compatible with X Y , and a mechanism with no inputs is a special case of one that can depend on Y .

The logic above is an instance of an important reasoning principle, which we develop in Appendix B. Although the 2-cycle in Example 3 is straightforward, generalizing it even slightly to a 3-cycle raises a not-so-straightforward question, whose answer will turn out to have surprisingly broad implications.

Example 4. What µ(X, Y, Z) are compatible with the 3-cycle shown, on the right? By the reasoning above, among them must be all distributions consistent with a linear chain X Y Z. Thus, any distribution in which two variables are conditionally independent given the third is compatible with the 3-cycle. Are there distributions that are not compatible with this hypergraph? It is not obvious. We return to this in Section 4.

Because QIM-compatibility applies to cyclic structures, one might wonder if it also captures the independencies of undirected models. Our definition of AG, as is common, implicitly identifies a undirected edge A B with the pair {A B, B A} of directed edges; in this way, it naturally converts even an undirected graph G to a (directed) hypergraph. Compatibility with AG, however, does not coincide with any of the standard Markov properties corresponding to G [13]. This may appear to be a flaw in Definition 2, but it is unavoidable (see Appendix B) if we wish to also capture causality, as we do in the next section.

3 QIM-Compatibility and Causality

Recall that in the definition of QIM-compatibility, each hyperarc represents an independent mechanism. Equations in a causal model are also viewed as representing independent mechanisms. This suggests a possible connection between the two formalisms, which we now explore. We will show that QIM-compatibility with A means exactly that a distribution can be generated by a causal model with the corresponding dependency structure (Section 3.1). Moreover, such causal models and QIMcompatibility witnesses are themselves closely related (Section 3.2). In this section, we establish a causal grounding for QIM-compatibility. To do so, we must first review some standard definitions. Definition 3 (Pearl [19]). A structural equations model (SEM) is a tuple M = (U, V, F), where

U is a set of exogenous variables; V is a set of endogenous variables (disjoint from U); F = {f Y }Y V associates to each endogenous variable Y an equation f Y : V(U V Y ) V(Y ) that determines its value as a function of the other variables.

In a SEM M, a variable X V does not depend on Y V U if f X(. . . , y, . . .) = f X(. . . , y , . . .) for all y, y V(Y ). Let the parents Pa M(X) of X be the set of variables on which X depends. M is acyclic iff Pa M(X) V = Pa G(X) for some dag G with vertices V. In an acyclic SEM, it is

easy to see that a setting of the exogneous variables determines the values of the endogenous variables (symbolically: M |= U V). A probabilistic SEM (PSEM) M = (M, P) is a SEM, together with a probability P over the exogenous variables. When M |= U V , the distribution P(U) extends uniquely to a distribution over V(V U). A cylic PSEM, however, may induce more than one such distribution, or none at all. In general, a PSEM M induces a (possiby empty) convex set of distributions over V(U V). This set is defined by two (linear) constraints: the equations F must hold with probability 1, and the marginal probability over U must equal P. Given a PSEM M, let {{M}} consist of all joint distributions ν(U, V) that satisfy the two constraints above; this set captures the behavior of M in the absence of interventions. A joint distribution µ(X) over X V U can arise from a (P)SEM M iff there is some ν {{M}} whose marginal on X is µ.

We now review the syntax of a language for describing causality. A basic causal formula is one of the form [Y y]φ, where φ is a Boolean expression over the endogenous variables V, Y V is a subset of them, and y V(Y). The language then consists of all Boolean combinations of basic formulas. In a causal model M and context u V(U), a Boolean expression φ over V is true iff it holds for all (u, x) V(U, V) consistent with the equations of M. Basic causal formulas are then given semantics by (M, u) |= [Y y]φ iff (MY y, u) |= φ, where MY y is the result of changing each f Y , for Y Y, to the constant function s 7 y[Y ], which returns (on all inputs s) the value of Y in the joint setting y. The dual formula Y y φ := [Y y] φ is equivalent to [Y y]φ in SEMs where each context u induces a unique setting of the endogenous variables [7]. A PSEM M = (M, P) assigns probabilities to causal formulas according to Pr M(φ) := P({u V(U) : (M, u) |= φ}).

Some authors assume that for each variable X, there is a special independent noise exogenous variable UX on which only the equation f X can depend; we call a PSEM (M, P) randomized if it contains such exogenous variables that are mutually independent according to P, and fully randomized if all its exogenous variables are of this form. Randomized PSEMs are clearly a special class of PSEMs, but note also that every PSEM can be converted to an equivalent randomized PSEM by extending it with additional dummy variables {UX}X V that can take only a single value. Thus, we do not lose expressive power by using randomized PSEMs. In fact, qualitatively, randomized PSEMs are more expressive: they can encode independence.

3.1 The Equivalence Between QIM-Compatibility and Randomized PSEMs

We are now equipped to formally describe the connection between QIM-compatibility and causality. At a high level, this connection should be unsurprising: witnesses and causal models both relate dependency structures to distributions, but in opposite directions . QIM-compatibility starts with distributions and asks what dependency structures they are compatible with. Causal models, on the other hand, are explicit (quantitative) representations of dependency structures that give rise to sets of distributions. We now show that the existence of a causal model coincides with the existence of a witness. We start by showing this for the hypergraphs generated by graphs (like Bayesian networks, except possibly cyclic), which we show correspond to fully randomized causal models (Proposition 3). We then give a natural generalization of a causal model that exactly captures QIM-compatibility with an arbitrary hypergraph (Proposition 4). In both cases, the high-level result is the same: µ |= A iff there is a causal model that has dependency structure A that gives rise to µ.

More precisely, a randomized causal model M has dependency structure A iff there is a 1-1 correspondence between a A and the equations of M, such that the equation fa produces a value of Ta and depends only on Sa and Ua. The definition above emphasizes the hypergraph; for readers interested in causality, here is an equivalent one that emphasizes the causal model: M is of dependency structure A iff the targets of A are disjoint singletons corresponding to the elements of V (so A = {SY {Y }}Y V), and Pa M(Y ) SY {UY } for all Y V. We start by presenting the result in the case where A corresponds to a directed graph. Proposition 3. Given a graph G and a distribution µ, µ |= AG iff there exists a fully randomized PSEM of dependency structure AG from which µ can arise.

In other words, compatibility with a hypergraph corresponding to a graph means arising from a fully randomized PSEM of the appropriate dependency structure. In light of this, Theorem 1 can be viewed as formalizing a phenomenon that seems to be almost universally implicitly understood: every acyclic fully randomized SEM induces a distribution with the independencies of the corresponding Bayesian Network. Conversely, every distribution with those independencies arises from such a causal model. Both halves have been recognized before. Druzdzel and Simon [4, Theorem 1] arguably establish

one direction of the correspondence (turning a BN into a causal model), but their statement of the result obscures the possibility of a converse.2 Pearl s causal Markov condition [20, Theorem 1], on the other hand, is closely related to that converse (as will be made explicit by our Proposition 3). Yet, to the best of our knowledge, the two results have not before been combined and recognized as an equivalent characterization of a BN s conditional independencies.

Like before, QIM-compatibility allows us to go much futher. It is easy to extend Proposition 3 to the dependency structures of all randomized PSEMs. But what happens if A contains hyperarcs with overlapping targets? Here the correspondence starts to break down for a simple reason: by definition, there is at most one equation per variable in a (P)SEM; thus, no PSEM can have dependency structure A. Nevertheless, the correspondence between witnesses and causal models persists if we simply drop the (traditional) requirement that F is indexed by V. This leads us to consider a natural generalization of a (randomized) PSEM that has an arbitrary set of equations not just one per variable.

Definition 4. Let (N, A) be a hypergraph. A generalized randomized PSEM M = (X, U, F, P) with structure A consists of sets of variables X and U = {Ua}a A, together with a set of functions F ={fa : V(Sa) V(Ua) V(Ta)}a A, and a probability Pa over each independent noise variable Ua. The meanings of {{M}} and can arise are the same as for a PSEM.

Proposition 4. µ |= A iff there exists a generalized randomized PSEM with structure A from which µ can arise.

Generalized randomized PSEMs can capture functional dependencies, and constraints. For instance, an equality (say X = Y ) can be encoded in a generalized randomized PSEM with a second equation for X. Indeed, we believe that generalized randomized PSEMs can capture a wide class of constraints, and are closely related to causal models with constraints [2], a discussion we defer to future work.

3.2 Interventions and the Correspondence Between Witnesses and Causal Models

We have seen that QIM-compatibility with A (i.e., the existence of a witness µ) coincides exactly with the existence of a causal model M from which a distribution can arise. But which witnesses correspond to which causal models? The answer to this question will be critical to extend the correspondence we have given so that it can deal with interventions. Different causal models may give rise to the same distribution, yet handle interventions differently.

There are two directions of the correspondence. Given a randomized PSEM M, distributions arising from it are compatible with its dependency structure, and the corresponding witnesses are exactly the distributions in {{M}} (see Appendix E). In particular, if M is acyclic, there is a unique witness. The converse is more interesting: how can we turn a witness into a causal model?

Construction 5. Given a witness µ(X) to compatibility with a hypergraph A with disjoint targets, construct a PSEM according to the following (non-deterministic) procedure. Take V := a ATa, U := UA (X V), and P(U) := µ(U). For each X V, there is a unique a X A whose targets Ta X contain X. Since µ |= (Ua X, Sa X) Ta X (this is just property (c) in Definition 2), X Ta X must also be a function of Sa X and Ua X; take f X to be such a function. More precisely, for each u V(Ua X) and s V(Sa X) for which µ(Ua X= u, Sa X= s) > 0, there is a unique t V(Ta X) such that µ(u, s, t) > 0. In this case, set f X(u, s, . . .) := t[X]. If µ(Ua X= u, Sa X= s) = 0, f X(u, s, . . .) can be an arbitrary function of u and s. Let PSEMs A( µ) denote the set of PSEMs that can result.

It s clear from Construction 5 that PSEMs A( µ) is always nonempty, and is a singleton iff µ(u, s) > 0 for all (a, u, s) a AV(Ua, Sa). A witness with this property exists when µ is positive (i.e., µ(X=x) > 0 for all x V(X)), in which case the construction gives a unique causal model. Conversely, we have seen that an acylic model M gives rise to a unique witness. So, in the simplest cases, models M with structure A and witnesses µ to compatibility with A are equivalent. But there are two important caveats.

1. A causal model M can contain more information than a witness µ if some events have probability zero. For instance, µ could be a point mass on a single joint outcome ω of all variables that satisfies the equations of M. But M cannot be reconstructed uniquely from µ because there may be many causal models for which ω is a solution.

2Indeed, Druzdzel and Simon [4] state that a causal structure does not necessarily imply independences , suggesting that they did not realize that their result could be used to characterize BN independencies.

2. A witness µ can contain more information than a causal model M if M is cyclic. For example, suppose that M consists of two variables, X and X , and equations f X(X ) = X and f X (X) = X. In this case, µ cannot be reconstructed from M, because M does not contain information about the distribution of X.

These two caveats appear to be very different, but they fit together in a surprisingly elegant way. Proposition 5. If µ(X, UA) is a witness for QIM-compatibility with A and M is a PSEM with dependency structure A, then µ {{M}} if and only if M PSEMs A( µ).

Equivalently, this means that PSEMs A( µ), the possible outputs of Construction 5, are precisely the randomized PSEMs of dependency structure A that can give rise to µ. This is already substantial evidence that causal models M PSEMs A( µ) are closely related to the QIM-compatibility witness µ. But everything we have seen so far describes only the correspondence in the absence of intervention, a setting in which many causal models are indistinguishable. Yet the correspondence goes deeper; it extends to interventions. This claim may seem dubious, as the obvous distinction between observing and doing is a fundemental principle of causality. What could a witness, which is purely probabilistic, have to say about intervention? In a randomized PSEM M, we can define an event

do M(X=x) := \

f X(UX, s) = x[X] , where x[X] is the value of X in x. (1)

This is intuitively the event in which the randomness is such that X = x regardless of the values of the parent variables. As we now show, conditioning µ on do M(X=x) has the effect of intervention at least as long as the noise variables UA = {Ua}a A are independent of the other exogenous variables U \ UA in addition to one another (e.g., when M is fully randomized). Theorem 6. Suppose that µ is a QIM witness for A, that M = (U, V, F, P) PSEMs A( µ) is a corresponding PSEM, and that the noise variables UA = {UX}X V are independent of the other exogenous variables U \ UA. For all X V and x V(X), if µ(do M(X=x)) > 0, then

(a) µ(V | do M(X=x)) can arise from MX x;

(b) for all events φ V(V), Pr M [X x]φ µ φ do M(X=x) Pr M X x φ

and all three are equal when M |= U V (such as when M is acyclic).

Theorem 6 shows that the relationship between witnesses and causal models extends to interventions. Even when do M(X=x) has probability zero, it is always possible to find a nearly equivalent setting where the bounds of the theorem apply.3 Intervention and conditioning are conceptually very different, so it may seem surprising that conditioning can have the effect of intervention (and also that the Pearl s do( ) notation actually corresponds to an event [9]). We emphasize that the conditioning (on do M(X=x)) is on the randomness {UX}X X and not X itself; intervening on X=x is indeed fundamentally different from conditioning on X=x.

4 QIM-Compatibility and Information Theory

The fact that the dependency structure of a (causal) Bayesian network describes the independencies of the distribution it induces is fundamental to both causality and probability. It makes explicit the distributional consequences of BN structure. Yet, despite substantial interest [1], generalizing the BN case to more complex (e.g., cyclic) dependency structures remains largely an open problem. In Section 4.1, we generalize the BN case by providing an information-theoretic constraint, capable of capturing conditional independence, functional dependence, and more, on the distributions that can arise from an arbitrary dependency structure. This connection between causality and information theory has implications for both fields. It grounds the cyclic dependency structures found in causality in concrete constraints on the distributions they represent. At the same time, it allows us to resolve longstanding confusion about structure in information theory, clarifying the meaning of the so-called interaction information , and recasting a standard counterexample to substantiate the claim it was intended to oppose. In Section 4.2, we strengthen this connection. Using entropy to measure distance to (in)dependence, we develop a scoring function to measure how far a distribution is from being QIMcompatible with a given dependency structure. This function turns out to have an intimate relationship

3More precisely, for all ϵ > 0, there exists some M that differs from M on the probabilities of all causal formulas by at most ϵ, and a distribution µ that is ϵ-close to µ, such that µ (do M (X=x)) > 0.

with the qualitative PDG scoring fucntion IDef , which we use to show that our information-theoretic constraints degrade gracefully on near-compatible distributions.

I(X; Y ; Z)

Figure 1: Iµ.

We now review the critical information theoretic concepts and their relationships to (in)dependence (see Appendix C.1 for a full primer). Conditional entropy Hµ(Y |X) measures how far µ is from satisfying the functional dependency X Y . Conditional mutual information Iµ(Y ; Z|X) measures how far µ is from satisfying the conditional independence Y Z | X. Linear combinations of these quantities (for X, Y, Z X) can be viewed as the inner product between a coefficient vector v and a 2|X| 1 dimensional vector Iµ that we will call the information profile of µ. For three variables, the components of this vector are illustrated in Figure 1 (right). It is not hard to see that an arbitrary conjunction of (conditional) (in)dependencies can be expressed as a constraint Iµ v 0, for some appropriate choice of vector v.

We now formally introduce the qualitative PDG scoring function IDef , which interprets a hypergraph structure A as a function of the form Iµ v A. This information deficiency, given by

IDef A(µ) = Iµ v A := Hµ(X) + X

a A Hµ(Ta | Sa), (2)

is the difference between the number of bits needed to (independently) specify the randomness in µ along the hyperarcs of A, and the number of bits needed to specify a sample of µ according to its own structure ( X). While IDef has some nice properties4, it can also behave unintuitively in some cases; for instance, it can be negative. Clearly, it does not measure how close µ is to being structurally compatible with A, in general. Nevertheless, there is still a fundamental relationship between IDef and QIM-compatibility, as we now show.

4.1 A Necessary Condition for QIM-Compatibility

What constraints does QIM-compatibility with A place on a distribution µ? When G is a dag, we have seen that if µ |= AG, then µ must satisfy the independencies of the corresponding Bayesian network (Theorem 1); we have also seen that additional hyperarcs impose functional dependencies (Theorem 2). But these results apply only when A is of a very special form. More generally, µ |= A implies that µ can arise from some randomized causal model whose equations have dependency structure A (Propositions 3 and 4). Still, unless A has a particularly special form, it is not obvious whether or not this says something about µ. The primary result of this section is an informationtheoretic bound (Theorem 7) that generalizes most of the concrete consequences of QIM-compatibility we have seen so far (Theorems 1 and 2). The result is a connection between information theory and causality; it yields an information-theoretic test for complex causal dependency structures, and enables causal notions of structure to dispel misconceptions in information theory. Theorem 7. If µ |= A, then IDef A(µ) 0.

Theorem 7 applies to all hypergraphs, and subsumes every general-purpose technique we know of for proving that µ |= A. Indeed, the negative directions of Theorems 1 and 2 are immediate consequences of it. To illustrate some of its subtler implications, we return to the 3-cycle in Example 4.

Example 5. It is easy to see (e.g., by inspecting Figure 1) that IDef 3-cycle(µ) = Hµ(Y |X) + Hµ(Z|Y ) + Hµ(X|Z) Hµ(XY Z) = Iµ(X; Y ; Z). Theorem 7 therefore tells us that a distribution µ that is QIM-compatible with the 3-cycle cannot have negative interaction information Iµ(X; Y ; Z). What does this mean? When I(X; Y ; Z) < 0, conditioning on one variable causes the other two to share more information than they did before. The most extreme instance is µxor, the distribution in which two variables are independent and the third is their parity (illustrated on the right). It seems intuitively clear that µxor cannot arise from the 3-cycle, a causal model with only pairwise dependencies. This is difficult to prove directly, but is an immediate consequence of Theorem 7.

For many, there is an intuition that I(X; Y ; Z) < 0 should require a fundementally 3-way interaction between the variables, and should not arise through pairwise interactions alone [10]. This has

4It captures BN independencies and the dependencies of Theorem 2, reduces to maximum entropy for the empty hypergraph, and combines with the quantitative PDG scoring function [23] to capture factor graphs.

been a source of conflict [26, 16, 15, 3], because traditional ways of making precise pairwise interactions (e.g., maximum entropy subject to pairwise marginal constraints and pairwise factorization) do not ensure that I(X; Y ; Z) 0. But QIM-compatibility does. One can verify by enumeration that the 3-cycle is the most expressive causal structure with no joint dependencies, and we have already proven that QIM-compatibility with that hypergraph implies non-negative interaction information. QIM-compatibility has another even more noteworthy clarifying effect on information theory.

There is a school of thought that contends that all structural information in µ(X) is captured by its information profile Iµ. This position has fallen out of favor in some communities due to standard counterexamples: distributions that have intuitively different structures yet share an information profile [11]. However, with structure explicated by compatibility, the prototypical counterexample of this kind suddenly supports the very notion it was meant to challenge, suggesting in an unexpected way that the information profile may yet capture the essence of probabilistic structure.

Example 6. Let A, B, and C be variables with V(A), V(B), V(C) = {0, 1}2. Using independent fair coin flips X1, X2, and X3, define two joint distributions, P and Q, over A, B, C as follows. Define P by selecting A := (X1, X2), B := (X2, X3), and C := (X3, X1). Define Q by selecting A := (X1, X2), B := (X1, X3), and C := (X1, X2 X3). Structurally, P and Q appear to be very different. According to P, the first components of the three variables (A, B, C) are independent, yet they are identical according to Q. Moreover, P has only simple pairwise interactions between the variables, while Q has µxor (a clear 3-way interaction) embedded within it. Yet P and Q have identical information profiles (see right): in both cases, each of {A, B, C} is determined by the values of the other two, each pair share one bit of information given the third, and I(A; B; C) = 0.

This example has been used to argue that multivariate Shannon information does not take into account important structural differences between distributions [11]. We are now in a position to give a novel and particularly persuasive response, by appealing to QIM-compatibility. Unsurprisingly, P is compatible with the 3-cycle; it is clearly consists of 2-way interactions, as each pair of variables shares a bit. But, counterintuitively, the distribution Q is also compatible with the 3-cycle! (The reader is encouraged to verify that U1 = X3 X1, U2 = X2, and U3 = X3 serves as a witness.) To emphasize: this is despite the fact that Q is just µxor (which is certainly not compatible with the 3-cycle) together with a seemingly irrelevant random bit X1. By the results of Section 3, this means there is a causal model without joint dependence giving rise to Q so, despite appearances, Q does not require a 3-way interaction. Indeed, P and Q are QIM-compatible with precisely the same hypergraphs over {A, B, C}, suggesting that they don t have a structural difference after all.

In light of Example 6, one might reasonably conjecture that the converse of Theorem 7 holds. Unfortunately, it does not (see Appendix C.3); the quantity IDef A(µ) does not completely determine whether or not µ |= A. We now pursue a new (entropy-based) scoring function that does. This will allow us to generalize Theorem 7 to distributions that are only near-compatible with A.

4.2 A Scoring Function for QIM-Compatibility

Here is a function that measures how far a distribution µ is from being QIM-compatible with A.

QIMInc A(µ) := inf ν(U, X) ν(X)=µ(X) Hν(U) + X

a A Hν(Ua) + X

a A Hν(Ta|Sa, Ua). (3)

QIMInc is a direct translation of Definition 2 (a-c); it measures the (optimal) quality of an extended distribution ν as a witness. The infimum restricts the search to ν satisfying (a), the first two terms measure ν s discrepancy of with (b), and the last term measures ν s discrepancy with (c). Therefore: Proposition 8. QIMInc A(µ) 0, with equality iff µ |= A.

Although they seem to be very different, QIMInc and IDef turn out to be closely related. In fact, modulo the infimum, QIMInc A is a special case of IDef not for the hypergraph A, but rather for a transformed one A that models the noise variables explcitly. To construct A from A, add new nodes U = {Ua}a A, and replace each hyperarc

Sa Ta a with the pair of hyperarcs

Finally, add one additional hyperarc U X. (Intuitively, this hyperarc creates functional dependencies in the spirit of Theorem 2.) With these definitions in place, we can state a theorem that bounds QIMInc above and below with information deficiencies (Theorem 9). The lower bound generalizes Theorem 7 by giving an upper limit on IDef A(µ) even for distributions µ that are not QIM-compatible with A. The upper bound is tight in general, and shows that QIMInc A can be equivalently defined as a minimization over IDef A . Theorem 9. (a) If (X, A) is a hypergraph, µ(X) is a distribution, and ν(X, U) is an extension of ν to additional variables U = {Ua}a A indexed by A, then:

IDef A(µ) QIMInc A(µ) IDef A (ν).

(b) For all µ and A, there is a choice of ν that achieves the upper bound. That is,

QIMInc A(µ) = min n IDef A (ν) : ν V(X, U) ν(X) = µ(X)

The semantics of PDGs are based on the idea of measuring (and resolving) inconsistency, which is defined as a minimization over IDef (plus a term that captures relevant concrete probabilistic information). Thus, Theorem 9 (b) tells us that QIM-compatibility (with A) can be captured with a qualitative PDG (namely, A ). It follows that our notion of QIM-compatibility can be viewed as a special case of the semantics of PDGs one that, as we have shown, has a causal interpretation.

5 Discussion and Conclusions

We have shown how directed hypergraphs can be used to represent structural aspects of distributions. Moreover, they can do so in a way that generalizes conditional independencies and functional dependencies and has deep connections to causality and information theory. This notion of QIMcompatibility can be captured with PDGs, and also partially explains the qualitative foundations of these models. Still, many questions remain open.

Perhaps the most important open problem is that of computing whether or not a given distribution µ is QIM compatible with a directed hypergraph A. We have implemented a rudimentary approach (based on solving problem (3) to calculate QIMInc) that works in practice for small examples, but that approach scales poorly, and its correctness has not yet been proved. Even representing a distribution µ over n variables requires Ω(2n) space in general, and a candidate witness µ is even bigger: if all variables are binary, |A| = m, and |Sa|, |Ta| k for all a A, then a direct implementation of (3) is a non-convex optimization problem with at most 2n+mk(2k) variables. Even accepting the (substantial) cost of representing extended distributions, we do not have a bound on the time needed to solve the optimization problem. There are more compact ways of representing the joint distributions µ used in practice (by assuming (in)dependencies), but we do not know if such independence assumptions make it easier to determine whether µ |= A for arbitrary A. But computing IDef A(µ) can be much easier.5 We suspect that Theorem 7, a nontrivial condition for QIM-compatibility that requires only computing IDef A(µ), could play a critical role in designing such an inference procedure.

Another major open problem is that of more precisely understanding the implications of QIMcompatibility in cyclic models. We do not yet know, for example, whether the same set of distributions are QIM-compatible with the clockwise and counter-clockwise 3-cycles.

As mentioned in Section 3, our notion of QIM-compatibility has led us to a generalization of a standard causal model (Definition 4). A proper investigation of this novel modeling tool (which we have not attempted in this paper) would include concrete motivating examples, a careful account of interventions and counterfactuals in this general setting, and results situating these causal models among other generalizations of causal models in the literature.

We hope to address these questions in future work.

5The complexity of calculating IDef A(µ) is typically dominated by the difficulty of calculating the joint entropy H(µ). It can be difficult to compute H(µ) in some cases (e.g., for undirected models), but in others (e.g., for Bayesian Networks or clique trees) the same assumptions that enable a compact representation of µ also make it easy to calculate H(µ).

Acknowledgments and Disclosure of Funding

We would like to thank the reviewers for useful discussion and helpful feedback, such as the pointer to Druzdzel and Simon [4], and for asking us to expand on the complexity of inference. Thank you to Matt Mac Dermott for identifying a bug in a prior version of Theorem 6, and to Matthias Georg Mayer for catching several low-level issues with the presentation. The work of Halpern and Richardson was supported in part by AFOSR grant FA23862114029, MURI grant W911NF-19-1-0217, ARO grant W911NF-22-1-0061, and NSF grant FMit F-2319186. S.P. is supported in part by the NSF under Grants Nos. CCF-2122230 and CCF-2312296, a Packard Foundation Fellowship, and a generous gift from Google.

[1] C. Baier, C. Dubslaff, H. Hermanns, and N. Käfer. On the foundations of cycles in bayesian networks. In Lecture Notes in Computer Science, pages 343 363. Springer Nature Switzerland, 2022. doi: 10.1007/978-3-031-22337-2_17. URL https://doi.org/10.1007% 2F978-3-031-22337-2_17.

[2] S. Beckers, J. Y. Halpern, and C. Hitchcock. Causal models with constraints, 2023.

[3] T. M. Cover and J. A. Thomas. Elements of Information Theory. Wiley, New York, 1991.

[4] M. J. Druzdzel and H. A. Simon. Causality in bayesian belief networks. In Conference on Uncertainty in Artificial Intelligence, 1993. URL https://api.semanticscholar.org/ Corpus ID:14801431.

[5] G. Gallo, G. Longo, S. Pallottino, and S. Nguyen. Directed hypergraphs and applications. Discrete Applied Mathematics, 42(2):177 201, 1993. ISSN 0166-218X. doi: https://doi. org/10.1016/0166-218X(93)90045-P. URL https://www.sciencedirect.com/science/ article/pii/0166218X9390045P.

[6] D. Geiger, T. Verma, and J. Pearl. Identifying independence in bayesian networks. Networks, 20(5):507 534, 1990. doi: https://doi.org/10.1002/net.3230200504. URL https: //onlinelibrary.wiley.com/doi/abs/10.1002/net.3230200504.

[7] J. Y. Halpern. Axiomatizing causal reasoning. Journal of Artificial Intelligence Research, 12: 317 337, 2000.

[8] D. Heckerman, D. M. Chickering, C. Meek, R. Rounthwaite, and C. Kadie. Dependency networks for inference, collaborative filtering, and data visualization. Journal of Machine Learning Research, 1(Oct):49 75, 2000.

[9] C. Hitchcock. Causal Models. In E. N. Zalta and U. Nodelman, editors, The Stanford Encyclopedia of Philosophy. Metaphysics Research Lab, Stanford University, Summer 2024 edition, 2024.

[10] R. James. (stumbling blocks) on the road to understanding multivariate information theory. Discrete Information Theory package documentation, 2018. URL https://dit.readthedocs. io/en/latest/stumbling.html.

[11] R. G. James and J. P. Crutchfield. Multivariate dependence beyond shannon information. Entropy, 19(10), 2017. ISSN 1099-4300. doi: 10.3390/e19100531. URL https://www.mdpi. com/1099-4300/19/10/531.

[12] I. Kobyzev, S. J. Prince, and M. A. Brubaker. Normalizing flows: An introduction and review of current methods. IEEE Transactions on Pattern Analysis and Machine Intelligence, 43 (11):3964 3979, nov 2021. doi: 10.1109/tpami.2020.2992934. URL https://doi.org/10. 1109%2Ftpami.2020.2992934.

[13] D. Koller and N. Friedman. Probabilistic Graphical Models: Principles and Techniques. MIT press, 2009.

[14] S. L. Lauritzen, A. P. Dawid, B. N. Larsen, and H.-G. Leimer. Independence properties of directed markov fields. Networks, 20(5):491 505, 1990. doi: https://doi.org/10. 1002/net.3230200503. URL https://onlinelibrary.wiley.com/doi/abs/10.1002/ net.3230200503.

[15] leonbloy (https://math.stackexchange.com/users/312/leonbloy). conditioning reduces mutual information. Mathematics Stack Exchange, 2015. URL https://math.stackexchange. com/q/1219753. URL:https://math.stackexchange.com/q/1219753 (version: 2015-04-04).

[16] D. J. C. Mac Kay. Information Theory, Inference and Learning Algorithms. Cambridge University Press, 2003.

[17] P. Naumov and B. Nicholls. R.e. axiomatization of conditional independence, 2013.

[18] J. Pearl. Causality: Models, Reasoning, and Inference. Cambridge University Press, New York, 2000.

[19] J. Pearl. Causality. Cambridge university press, 2009.

[20] J. Pearl. Causal inference in statistics: An overview. Statistics Surveys, 3(none):96 146, 2009. doi: 10.1214/09-SS057. URL https://doi.org/10.1214/09-SS057.

[21] J. Pearl and A. Paz. Graphoids: A graphbased logic for reasoning about relevance relations. advances in artificial intelligence, vol. ii, 1987.

[22] O. E. Richardson. Loss as the inconsistency of a probabilistic dependency graph: Choose your model, not your loss function. AISTATS 22, 151, 2022.

[23] O. E. Richardson and J. Y. Halpern. Probabilistic dependency graphs. In Proc. Thirty-Fifth AAAI Conference on Artificial Intelligence (AAAI-21), pages 12174 12181, 2021.

[24] O. E. Richardson, J. Y. Halpern, and C. De Sa. Inference for probabilistic dependency graphs. In Uncertainty in Artificial Intelligence, pages 1741 1751. PMLR, 2023.

[25] E. G. Tabak and E. Vanden-Eijnden. Density estimation by dual ascent of the log-likelihood. Communications in Mathematical Sciences, 8(1):217 233, 2010.

[26] P. L. Williams and R. D. Beer. Nonnegative decomposition of multivariate information, 2010.

Neur IPS Paper Checklist

Question: Do the main claims made in the abstract and introduction accurately reflect the paper s contributions and scope? Answer: [Yes] Justification: All claims are substantiated with precise theorem statements in Sections 2-4, and their implications are carefully discussed. Guidelines:

The answer NA means that the abstract and introduction do not include the claims made in the paper. The abstract and/or introduction should clearly state the claims made, including the contributions made in the paper and important assumptions and limitations. A No or NA answer to this question will not be perceived well by the reviewers. The claims made should match theoretical and experimental results, and reflect how much the results can be expected to generalize to other settings. It is fine to include aspirational goals as motivation as long as it is clear that these goals are not attained by the paper. 2. Limitations

Question: Does the paper discuss the limitations of the work performed by the authors? Answer: [Yes] Justification: For example, we discuss the inherent limitations of our central notion with respect to undirected graphical models, and we pose several open problems regarding the distributional implications of cyclic dependency structures that we were not able to solve in this work. We also explain that we only partially characterize the distributions compatible with general dependency structures. Guidelines:

The answer NA means that the paper has no limitation while the answer No means that the paper has limitations, but those are not discussed in the paper. The authors are encouraged to create a separate "Limitations" section in their paper. The paper should point out any strong assumptions and how robust the results are to violations of these assumptions (e.g., independence assumptions, noiseless settings, model well-specification, asymptotic approximations only holding locally). The authors should reflect on how these assumptions might be violated in practice and what the implications would be. The authors should reflect on the scope of the claims made, e.g., if the approach was only tested on a few datasets or with a few runs. In general, empirical results often depend on implicit assumptions, which should be articulated. The authors should reflect on the factors that influence the performance of the approach. For example, a facial recognition algorithm may perform poorly when image resolution is low or images are taken in low lighting. Or a speech-to-text system might not be used reliably to provide closed captions for online lectures because it fails to handle technical jargon. The authors should discuss the computational efficiency of the proposed algorithms and how they scale with dataset size. If applicable, the authors should discuss possible limitations of their approach to address problems of privacy and fairness. While the authors might fear that complete honesty about limitations might be used by reviewers as grounds for rejection, a worse outcome might be that reviewers discover limitations that aren t acknowledged in the paper. The authors should use their best judgment and recognize that individual actions in favor of transparency play an important role in developing norms that preserve the integrity of the community. Reviewers will be specifically instructed to not penalize honesty concerning limitations. 3. Theory Assumptions and Proofs

Question: For each theoretical result, does the paper provide the full set of assumptions and a complete (and correct) proof? Answer: [Yes] Justification: This is a theoretical work and as such we take mathematical precision very seriously. Guidelines:

The answer NA means that the paper does not include theoretical results. All the theorems, formulas, and proofs in the paper should be numbered and crossreferenced. All assumptions should be clearly stated or referenced in the statement of any theorems. The proofs can either appear in the main paper or the supplemental material, but if they appear in the supplemental material, the authors are encouraged to provide a short proof sketch to provide intuition. Inversely, any informal proof provided in the core of the paper should be complemented by formal proofs provided in appendix or supplemental material. Theorems and Lemmas that the proof relies upon should be properly referenced. 4. Experimental Result Reproducibility

Question: Does the paper fully disclose all the information needed to reproduce the main experimental results of the paper to the extent that it affects the main claims and/or conclusions of the paper (regardless of whether the code and data are provided or not)? Answer: [NA] Justification: This paper is theoretical in nature and does not include experimental results. Guidelines:

The answer NA means that the paper does not include experiments. If the paper includes experiments, a No answer to this question will not be perceived well by the reviewers: Making the paper reproducible is important, regardless of whether the code and data are provided or not. If the contribution is a dataset and/or model, the authors should describe the steps taken to make their results reproducible or verifiable. Depending on the contribution, reproducibility can be accomplished in various ways. For example, if the contribution is a novel architecture, describing the architecture fully might suffice, or if the contribution is a specific model and empirical evaluation, it may be necessary to either make it possible for others to replicate the model with the same dataset, or provide access to the model. In general. releasing code and data is often one good way to accomplish this, but reproducibility can also be provided via detailed instructions for how to replicate the results, access to a hosted model (e.g., in the case of a large language model), releasing of a model checkpoint, or other means that are appropriate to the research performed. While Neur IPS does not require releasing code, the conference does require all submissions to provide some reasonable avenue for reproducibility, which may depend on the nature of the contribution. For example (a) If the contribution is primarily a new algorithm, the paper should make it clear how to reproduce that algorithm. (b) If the contribution is primarily a new model architecture, the paper should describe the architecture clearly and fully. (c) If the contribution is a new model (e.g., a large language model), then there should either be a way to access this model for reproducing the results or a way to reproduce the model (e.g., with an open-source dataset or instructions for how to construct the dataset). (d) We recognize that reproducibility may be tricky in some cases, in which case authors are welcome to describe the particular way they provide for reproducibility. In the case of closed-source models, it may be that access to the model is limited in some way (e.g., to registered users), but it should be possible for other researchers to have some path to reproducing or verifying the results.

5. Open access to data and code

Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material?

Answer: [NA]

Justification: This paper is theoretical in nature and there is no associated data or code.

Guidelines:

The answer NA means that paper does not include experiments requiring code. Please see the Neur IPS code and data submission guidelines (https://nips.cc/ public/guides/Code Submission Policy) for more details. While we encourage the release of code and data, we understand that this might not be possible, so No is an acceptable answer. Papers cannot be rejected simply for not including code, unless this is central to the contribution (e.g., for a new open-source benchmark). The instructions should contain the exact command and environment needed to run to reproduce the results. See the Neur IPS code and data submission guidelines (https: //nips.cc/public/guides/Code Submission Policy) for more details. The authors should provide instructions on data access and preparation, including how to access the raw data, preprocessed data, intermediate data, and generated data, etc. The authors should provide scripts to reproduce all experimental results for the new proposed method and baselines. If only a subset of experiments are reproducible, they should state which ones are omitted from the script and why. At submission time, to preserve anonymity, the authors should release anonymized versions (if applicable). Providing as much information as possible in supplemental material (appended to the paper) is recommended, but including URLs to data and code is permitted.

6. Experimental Setting/Details

Question: Does the paper specify all the training and test details (e.g., data splits, hyperparameters, how they were chosen, type of optimizer, etc.) necessary to understand the results?

Answer: [NA]

Justification: This paper is theoretical and does not involve any training or testing of models.

Guidelines:

The answer NA means that the paper does not include experiments. The experimental setting should be presented in the core of the paper to a level of detail that is necessary to appreciate the results and make sense of them. The full details can be provided either with the code, in appendix, or as supplemental material.

7. Experiment Statistical Significance

Question: Does the paper report error bars suitably and correctly defined or other appropriate information about the statistical significance of the experiments?

Answer: [NA]

Justification: This paper is theoretical and does not contain experiments.

Guidelines:

The answer NA means that the paper does not include experiments. The authors should answer "Yes" if the results are accompanied by error bars, confidence intervals, or statistical significance tests, at least for the experiments that support the main claims of the paper. The factors of variability that the error bars are capturing should be clearly stated (for example, train/test split, initialization, random drawing of some parameter, or overall run with given experimental conditions).

The method for calculating the error bars should be explained (closed form formula, call to a library function, bootstrap, etc.) The assumptions made should be given (e.g., Normally distributed errors). It should be clear whether the error bar is the standard deviation or the standard error of the mean. It is OK to report 1-sigma error bars, but one should state it. The authors should preferably report a 2-sigma error bar than state that they have a 96% CI, if the hypothesis of Normality of errors is not verified. For asymmetric distributions, the authors should be careful not to show in tables or figures symmetric error bars that would yield results that are out of range (e.g. negative error rates). If error bars are reported in tables or plots, The authors should explain in the text how they were calculated and reference the corresponding figures or tables in the text. 8. Experiments Compute Resources

Question: For each experiment, does the paper provide sufficient information on the computer resources (type of compute workers, memory, time of execution) needed to reproduce the experiments? Answer: [NA] Justification: This paper does not have experiments, computational or otherwise. Guidelines:

The answer NA means that the paper does not include experiments. The paper should indicate the type of compute workers CPU or GPU, internal cluster, or cloud provider, including relevant memory and storage. The paper should provide the amount of compute required for each of the individual experimental runs as well as estimate the total compute. The paper should disclose whether the full research project required more compute than the experiments reported in the paper (e.g., preliminary or failed experiments that didn t make it into the paper). 9. Code Of Ethics

Question: Does the research conducted in the paper conform, in every respect, with the Neur IPS Code of Ethics https://neurips.cc/public/Ethics Guidelines? Answer: [Yes]

Justification: To the best of our understanding, since this paper focuses on purely theoretical questions regarding graphical languages and the structure of probability distributions, there are no substantive ethical concerns to address. Guidelines:

The answer NA means that the authors have not reviewed the Neur IPS Code of Ethics. If the authors answer No, they should explain the special circumstances that require a deviation from the Code of Ethics. The authors should make sure to preserve anonymity (e.g., if there is a special consideration due to laws or regulations in their jurisdiction). 10. Broader Impacts

Question: Does the paper discuss both potential positive societal impacts and negative societal impacts of the work performed? Answer: [NA]

Justification: This paper is best viewed as basic theoretical research and is therefore far from direct societal impacts. Guidelines:

The answer NA means that there is no societal impact of the work performed. If the authors answer NA or No, they should explain why their work has no societal impact or why the paper does not address societal impact.

Examples of negative societal impacts include potential malicious or unintended uses (e.g., disinformation, generating fake profiles, surveillance), fairness considerations (e.g., deployment of technologies that could make decisions that unfairly impact specific groups), privacy considerations, and security considerations. The conference expects that many papers will be foundational research and not tied to particular applications, let alone deployments. However, if there is a direct path to any negative applications, the authors should point it out. For example, it is legitimate to point out that an improvement in the quality of generative models could be used to generate deepfakes for disinformation. On the other hand, it is not needed to point out that a generic algorithm for optimizing neural networks could enable people to train models that generate Deepfakes faster. The authors should consider possible harms that could arise when the technology is being used as intended and functioning correctly, harms that could arise when the technology is being used as intended but gives incorrect results, and harms following from (intentional or unintentional) misuse of the technology. If there are negative societal impacts, the authors could also discuss possible mitigation strategies (e.g., gated release of models, providing defenses in addition to attacks, mechanisms for monitoring misuse, mechanisms to monitor how a system learns from feedback over time, improving the efficiency and accessibility of ML). 11. Safeguards

Question: Does the paper describe safeguards that have been put in place for responsible release of data or models that have a high risk for misuse (e.g., pretrained language models, image generators, or scraped datasets)? Answer: [NA] Justification: We do not have and have not released any data or models. Guidelines:

The answer NA means that the paper poses no such risks. Released models that have a high risk for misuse or dual-use should be released with necessary safeguards to allow for controlled use of the model, for example by requiring that users adhere to usage guidelines or restrictions to access the model or implementing safety filters. Datasets that have been scraped from the Internet could pose safety risks. The authors should describe how they avoided releasing unsafe images. We recognize that providing effective safeguards is challenging, and many papers do not require this, but we encourage authors to take this into account and make a best faith effort. 12. Licenses for existing assets

Question: Are the creators or original owners of assets (e.g., code, data, models), used in the paper, properly credited and are the license and terms of use explicitly mentioned and properly respected? Answer: [NA] Justification: We use no external assets. Guidelines:

The answer NA means that the paper does not use existing assets. The authors should cite the original paper that produced the code package or dataset. The authors should state which version of the asset is used and, if possible, include a URL. The name of the license (e.g., CC-BY 4.0) should be included for each asset. For scraped data from a particular source (e.g., website), the copyright and terms of service of that source should be provided. If assets are released, the license, copyright information, and terms of use in the package should be provided. For popular datasets, paperswithcode.com/datasets has curated licenses for some datasets. Their licensing guide can help determine the license of a dataset.

For existing datasets that are re-packaged, both the original license and the license of the derived asset (if it has changed) should be provided. If this information is not available online, the authors are encouraged to reach out to the asset s creators. 13. New Assets

Question: Are new assets introduced in the paper well documented and is the documentation provided alongside the assets? Answer: [NA] Justification: We do not release new assets. Guidelines:

The answer NA means that the paper does not release new assets. Researchers should communicate the details of the dataset/code/model as part of their submissions via structured templates. This includes details about training, license, limitations, etc. The paper should discuss whether and how consent was obtained from people whose asset is used. At submission time, remember to anonymize your assets (if applicable). You can either create an anonymized URL or include an anonymized zip file. 14. Crowdsourcing and Research with Human Subjects

Question: For crowdsourcing experiments and research with human subjects, does the paper include the full text of instructions given to participants and screenshots, if applicable, as well as details about compensation (if any)? Answer: [NA] Justification: The paper does not involve crowdsourcing nor research with human subjects. Guidelines:

The answer NA means that the paper does not involve crowdsourcing nor research with human subjects. Including this information in the supplemental material is fine, but if the main contribution of the paper involves human subjects, then as much detail as possible should be included in the main paper. According to the Neur IPS Code of Ethics, workers involved in data collection, curation, or other labor should be paid at least the minimum wage in the country of the data collector. 15. Institutional Review Board (IRB) Approvals or Equivalent for Research with Human Subjects Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or institution) were obtained? Answer: [NA] Justification: The paper does not involve crowdsourcing nor research with human subjects. Guidelines:

The answer NA means that the paper does not involve crowdsourcing nor research with human subjects. Depending on the country in which research is conducted, IRB approval (or equivalent) may be required for any human subjects research. If you obtained IRB approval, you should clearly state this in the paper. We recognize that the procedures for this may vary significantly between institutions and locations, and we expect authors to adhere to the Neur IPS Code of Ethics and the guidelines for their institution. For initial submissions, do not include any information that would break anonymity (if applicable), such as the institution conducting the review.

We begin with a de-randomization construction, that will be useful for the proofs.

A.1 From CPDs to Distributions over Functions

Compare two objects:

a cpd p(Y |X), and a distribution q(Y X) over functions g : VX VY .

The latter is significantly larger if both |VX| = |VY | = N, then q is a N N dimensional object, while p is only dimension N 2. A choice of distribution q(Y X) corresponds to a unique choice cpd p(Y |X), according to p(Y =y | X=x) := q(Y X(x) = y).

Claim 1. 1. The definition above in fact yields a cpd, i.e., P

y p(Y =y|X=x) = 1 for all x VX. 2. This definition of p(Y |X) is the conditional marginal of any joint distribution µ(X, Y, Y X) satisfying µ(Y X) = q and µ(Y = Y X(X)) = 1.

Both p and q give probabilistic information about Y conditioned on X. But q(Y X) contains strictly more information. Not only does it specify the distribution over Y given X=x, but it also contains counter-factual information about the distribution of Y if X were equal to x , conditioned on the fact that, in reality, X=x.

Is there a natural construction that goes in the opposite direction, intuitively making as many independence assumptions as possible? It turns out there is:

q(Y X=g) = Y

x VX p(Y =g(x) | X=x).

Think of Y X as a collection of variables {Y x : x VX} describing the value of the function for each input, so that q is a joint distribution over them. This construction simply asks that these variables be independent. Specifying a distribution with these independences amounts to a choice of marginal distribution q(Y x) for each x V X, and hence is essentially a funciton of type VX VY , the same as p. In addition, if we apply the previous construction, we recover p, since:

q(Y X(x) = y) = X

g:VX VY 1[g(x) = y] Y

x VX p(Y =g(x ) | X=x )

g:VX VY 1[g(x) = y]p(Y =g(x) | X=x) Y

x =x p(Y =g(x ) | X=x )

= p(Y =y | X=x) X

g:VX VY 1[g(x) = y] Y

x =x p(Y =g(x ) | X=x )

= p(Y =y | X=x) X

g:VX\{x} VY

x VX\{x} p(Y =g(x ) | X=x )

= p(Y =y | X=x).

The final equality holds because the remainder of the terms can be viewed as the probability of selecting any function from X \ {x} to Y , under an analogous measure; thus, it equals 1. This will be a useful construction for us in general.

A.2 Results on (In)dependence

Lemma 10. Suppose X1, . . . , Xn are variables, Y1, . . . , Yn are sets, and for each i {1, . . . n}, we have a function fi : V(Xi) Yi. Then if X1, . . . , Xn are mutually independent (according to a joint distribution µ), then so are f1(X1), . . . , fn(Xn).

Proof. This is an intuitive fact, but we provide a proof for completeness. Explicitly, mutual independence of X1, . . . , Xn means that, for all joint settings x = (x1, . . . xn), we have µ(X1=x1, . . . , Xn=xn) = Qn i=1 µ(Xi=xi). So, for any joint setting y = (y1, . . . , yn) Y1 Yn, we have

µ f1(X1)=y1, . . . , fn(Xn)=yn = µ({x : f(x) = y})

(x1,...,xn) V(X1,...,Xn) f1(x1)=y1, ..., fn(xn)=yn

µ(X1=x1, . . . , Xn=xn)

x1 VX1 f1(x1)=y1

xn VXn fn(xn)=yn

µ(X1=x1, . . . , Xn=xn)

x1 VX1 f1(x1)=y1

xn VXn fn(xn)=yn

i=1 µ(Xi=xi)

x1 VX1 f1(x1)=y1

xn VXn fn(xn)=yn

i=1 µ(fi(Xi) = yi).

Lemma 11 (properties of determination).

1. If ν |= A B and ν |= A C, then ν |= A (B, C).

2. If ν |= A B and ν |= B C, then ν |= A C.

Proof. ν |= X Y , means there exists a function f : V (A) V (B) such that ν(f(Y ) = X) = 1, i.e., the event f(A) = B occurs with probability 1.

1. Let f : V(A) V(B) and g : V(A) V(C) be such that ν(f(A) = B) = 1 = ν(g(A) = C). Since both events happen with probability 1, so must the event f(A) = B g(A) = C. Thus the event (f(A), g(A)) = (B, C) occurs with probability 1. Therefore, ν |= A (B, C).

2. The same ideas, but faster: we have f : V(A) V(B) as before, and g : V(B) V(C), such that the events f(A) = B and g(B) = C occur with proability 1. By the same logic, it follows that their conjunction holds with probability 1, and hence C = f(g(A)) occurs with probability 1. So ν |= A C.

Theorem 1. If G is a directed acyclic graph and I(G) consists of the independencies of its corresponding Bayesian network, then µ |= AG if and only if µ satisfies I(G).

Proof. Label the vertics of G = (N, E) by natural numbers so that they are a topological sort of G that is, without loss of generality, suppose N = [n] := {1, 2, . . . , n}, and i < j whenever i j E. By the definition of AG, the arcs AG = {Si i i}n i=1 are also indexed by integers. Finally, write X = (X1, . . . , Xn) for the variables X corresponding to N over which µ is defined.

( = ). Suppose µ |= AG. This means there is an extension of µ(X, U) of µ(X) to additional independent variables U = (U1, . . . , Un), such that µ |= (Si, Ui) i for all i [n].

First, we claim that if µ is such a witness, then µ |= (U1, . . . , Uk) (X1, . . . , Xk) for all k [n], and so in particular, µ |= U X. This follows from QIM-compatibility s condition (c) and the fact that G is acyclic, by induction. In more detail: The base case of k = 0 holds vacuously. Suppose that

µ |= (X1, . . . , Xk) for some k < n. Now, conditon (c) of Definition 2 says µ |= (Sk+1, Uk+1) Xk+1. Because the varaibles are sorted in topological order, the parent variables Sk+1 are a subset of {X1, . . . , Xn}, which are determined by U by the induction hypothesis; at the same time clearly µ |= (U1, . . . , Uk+1) Uk+1 as well. So, by two instances of Lemma 11, µ |= (U1, . . . Uk+1) Xk+1. Combining with our inductive hypothesis, we find that µ |= (U1, . . . Uk+1) (X1, . . . , Xk+1). So, by induction, µ |= (U1, . . . , Uk) (X1, . . . , Xk) for k [n], and in particular, µ |= U X.

With this in mind, we now return to proving that µ has the required independencies. It suffices to show that µ(X) = Qn i=1 µ(Xi | Si). We do so by showing that, for all k [n], µ(X1, . . . , Xk) = µ(X1, . . . , Xk 1)µ(Xk | Sk). By QIM-compatibility witness condition (c), we know that µ |= (Sk, Uk) Xk, and so there exists a function fk : V(Sk) V(Uk) V(Xk) for which the event fk(Sk, Uk) = Xk occurs with probability 1. Since µ |= (U1, . . . , Uk 1) (X1, . . . , Xk 1), and Uk is independent of (U1, . . . , Uk 1), it follows from Lemma 10 that µ |= (X1, . . . , Xk 1) Uk. Thus

µ(X1, . . . , Xk 1, Xk) = X

u V(Uk) µ(X1, . . . , Xk 1) µ(Uk = u) 1[Xk = fk(Sk, u)]

= µ(X1, . . . , Xk 1) X

u V(Uk) µ(Uk = u) 1[Xk = fk(Sk, u)].

Observe that the quantity on the right, including the sum, is a function of Xk and Sk, but no other variables; let φ(Xk, Sk) denote this quantity. Because µ is a probability distribution, know that φ(Xk, Sk) must be the conditional probability of Xk given X1, . . . , Xk 1, and it depends only on the variables Sk. Thus µ(X1, . . . , Xk) = µ(X1, . . . , Xk 1)µ(Xk | Sk).

Therefore ν(X) = µ(X) factors as required by the BN G, meaning that µ has the independencies specified by G. (See Koller & Friedman Thm 3.2, for instance.)

( = ). Suppose µ satiesfies the independencies of G, meaning that each node is conditionally independent of its non-descendents given its parents. We now repeatedly apply the construction Appendix A.1 to construct a QIM-compatibility witness. Specifically, for k {1, . . . , n}, let Uk be a variable whose values V(Uk) := V(Xk)V(Sk) are functions from values of Xk s parents, to values of Xk. Let U denote the joint variable (U1, . . . , Un), and observe that a setting g = (g1, . . . , gn) of U uniquely picks out a value of X, by evaluating each function in order. Let s call this function f : V(U) V(X).

To be more precise, we now construct f(g) inductively. The first component we must produce is X1, but since X1 has no parents, g1 effectively describes a single value of X1, so we define the first component f(g)[X1] to be that value. More generally, assuming that we have already defined the components X1, . . . , Xi 1, among which are the variables Sk on which Xi depends, we can determine the value of Xi; formally, this means defining

f(g)[Xi] := gi(f(g)[Si]),

which, by our inductive assumption, is well-defined. Note that, for all g V(U) and x V(X), the function f is characterized by the property

i=1 gi(x[Si]) = x[Xi]. (4)

To quickly verify this: if f(g) = x, then in particular, for i [n], then x[Xi] = f(g)[Xi] = gi(x[Si]) by the definition above. Conversely, if the right hand side of (4) holds, then we can prove f(g) = x by induction over our construction of f: if f(g)[Xj] = x[Xj] for all j < i, then f(g)[Xi] = gi(f(g)[Si]) = gi(x[Si]) = x[Xi].

Next, we define an unconditional probability over each Uk according to

µi(Ui = g) := Y

s V(Sk) µ(Xi = g(s) | Si = s),

which, as verified in Appendix A.1, is indeed a conditional probability, and has the property that µi(Ui(s) = x) = µ(Xi = x | Si = s) for all x V(Xi) and s V(Si). By taking an independent

combination (tensor product) of each of these unconditional distributions, we obtain a joint distribution µ(U) = Qn i=1 µi(Ui). Finally, we extend this distribution to a full joint distribution µ(U, X) via the pushforward of µ(U) through the function f defined by induction above. In this distribution, each Xi is determined by Ui and Si.

By construction, the variables U are mutually independent (for Definition 2(b)), and satisfy (Sk, Uk) Xk for all k [n] (Definition 2(c)). It remains only to verify that the marginal of µ on the variables X is the original distribution µ (Definition 2(a)). Here is where we rely on the fact that µ satisfies the independencies of G, which means that we can factor µ(X) as µ(X) = Qn i=1 µ(Xi | Si).

g V(U) µ(U=g) δf(x | g)

(g1,...,gn) V(U) 1 x = f(g) n Y

i=1 µ(Ui=gi)

(g1,...,gn) V(U) 1 h n

i=1 gi(x[Si]) = x[Xi] i n Y

i=1 µ(Ui=gi) [ by (4) ]

g V(Ui) 1 g(x[Si]) = x[Xi] µ(Ui = g)

i=1 µ n g V(Ui) g(si) = xi o where xi := x[Xi], si := x[Si]

i=1 µ Ui(si) = xi

i=1 µ(Xi = xi | Si = si)

= µ(X = x).

Therefore, when µ satisfies the independencies of a BN G, it is QIM-compatible with AG.

Before we move on to proving the other results in the paper, we first illustrate how this relatively substantial first half of the proof of Theorem 1 can be dramatically simplified by relying on two information theoretic arguments.

Alternate, information-based proof. ( = ). Let G be a dag. If µ |= AG, then by Theorem 7, IDef AG(µ) 0. In the appendix of [23], it is shown that IDef AG(µ) 0 with equality iff µ satisfies the BN s independencies. Thus µ must satisfy the appropriate independencies.

(a) µ |= X Y A if and only if n 0. µ |= A (+n) X Y .

(b) if A = AG for a dag G, then µ |= X Y A if and only if µ |= A (+1) X Y .

(c) if a A such that Sa = and X Ta, then µ |= X Y A iff µ |= A (+2) X Y .

Proof. (a). The forward direction is straightforward. Suppose that µ |= A and µ |= X Y . The former condition gives us a witness ν(X, U) in which U = {Ua}a A are mutually independent variables indexed by A, that determine their respective edges. Extend ν in the unique way to n additional constant variables U1, . . . , Un, each of which can only take on one value. We claim that this extended distribution ν , which we conflate with ν because it is not meaningfully different, is a witness to µ |= A (+n) X Y . Since µ |= X Y it must also be that ν |= X Y , and it follows that ν |= (X, Ui) Y for all i {1, . . . , n}, demonstrating that the new requirements of

ν imposed by Definition 2(c) hold. (The remainder of the requirements for condition (c), namely that ν |= (Sa, Ua) Ta for a A, still hold because ν is an extension of ν, which we know has this property.) Finally, since U are mutually independent and each Ui is a constant (and hence independent of everything), the variables U := U {Ui}n i=1 are also mutually independent. Thus ν (or, more precisely, an isomorphic extension of it to additional trivial variables) is a witness of µ |= A (+n) X Y .

The reverse direction is difficult to prove directly, yet it is a straightforward application of Theorem 7. Suppose that µ |= A (+n) X Y for all n 0. By Theorem 7, we know that

0 IDef A (+n) X Y (µ) = IDef A(µ) + n Hµ(Y |X).

Because IDef A(µ) is bounded below (by log |V(X)|), it cannot be the case that Hµ(Y |X) > 0; otherwise, the inequality above would not hold for large n (specifically, for n > log |V(X)|/ Hµ(Y |X)). By Gibbs inequailty, Hµ(Y |X) is non-negative, and thus it must be the case that Hµ(Y |X) = 0. Thus µ |= X Y . It is also true that µ |= A by monotonicity (Proposition 13), which is itself a direct application of Theorem 7

(b). Now A = AG for some graph G. The forward direction of the equivalence is strictly weaker than the one we already proved in part (a); we have shown µ |= A (+n) X Y for all n 0, and needed only to show it for n = 1. The reverse direction is what s interesting. As before, we will take a significant shortcut by using Theorem 7. Suppose µ |= A (+1) X Y . In this case where A = AG, it was shown by Richardson and Halpern [23] that IDef A(µ) 0. It follows that

0 (Theorem 7) IDef A (+n) X Y (µ) = IDef A(µ) + Hµ(Y |X) 0,

and thus Hµ(Y |X) = 0, meaning that µ |= X Y as promised. As before, we also have µ |= A by monotonicity.

(c). As in part (b), the forward direction is a special case of the forward direction of part (a), and it remains only to prove the reverse direction. Equipped with the additional information that A { {X}}, suppose that µ |= A (+2) X Y . By monotonicity, this means µ |= A and also that µ |= X Y . Let A denote this hypergraph. Once again by appeal to Theorem 7, we have that 0 IDef A = Hµ(X, Y ) + H(X) + 2 Hµ(Y |X) = Hµ(Y |X) 0. It follows that Hµ(Y |X) = 0, and thus µ |= X Y . As mentioned above, we also know that µ |= A, and thus µ |= A X Y as promised.

A.3 Causality Results of Section 3

Proposition 3. Given a graph G and a distribution µ, µ |= AG iff there exists a fully randomized PSEM of dependency structure AG from which µ can arise.

Proof. ( = ). Suppose µ |= AG. Thus there exists some witness µ(X, U) to this fact, satisfying conditions (a-c) of Definition 2. Because AG is partitional, the elements of PSEMs AG( µ) are ordinary (i.e., not generalized) randomized PSEMs. We claim that every M = (M, P) PSEMs AG( µ) that is a randomized PSEM from which µ can arise, and also has the property that Pa M(Y ) Pa G(Y ) {UY } for all Y X.

The hyperarcs of AG correspond to the vertices of G, which in turn correspond to the variables in X; thus U = {UX}X X . By property (b) of QIM-compatibility witnesses (Definition 2), these variables {UX}X X are mutually independent according to µ. Furthermore, because M = (M, P) PSEMs AG( µ), we know that µ(U) = P, and thus the variables in U must be mutually independent according to P. By construction, in causal models M PSEMs AG( µ) the equation f Y can depend only on SY = Pa G(Y ) X and UY . So, in particular, f Y does not depend on UX for X = Y .

Altogether, we have shown that M contains exogenous variables {UX}X X that are mutually independent according to P, and that f Y does not depend on UX when X = Y . Thus, M is a randomized PSEM.

By condition (a) on QIM-compatibility witnesses (Definition 2), we know that µ(X) = µ. By

Proposition 5(a), we know that µ {{M}}. Together, the previous two sentences mean that µ can arise from M.

Finally, as mentioned in the first bullet item, the equation f Y in M can depend only on SY = Pa G(Y ) and on UY . Thus Pa M(Y ) Pa G(Y ) {UY } for all Y X.

Under the assumption that µ |= AG, we have now shown that there exists a randomized causal model M from which µ can arise, with the property that Pa M(Y ) Pa G(Y ) {UY } for all Y X.

( = ). Conversely, suppose there is a randomized PSEM M = (M = (Y, U, F), P) with the property that Pa M(Y ) Pa G(Y ) {UY } for all Y , from which µ can arise. The last clause means there exists some ν {{M}} such that ν(X) = µ. We claim that this ν is a witness to µ |= AG. We already know that condition (a) of being a QIM-compatibility witness is satisfied, since ν(X) = µ. Condition (b) holds because of the assumption that {UX}X X are mutually independent in the distribution P for a randomized PSEM (and the fact that ν(U) = P, since ν {{M}}). Finally, we must show that (c) for each Y X, ν |= Pa G(Y ) {UY } Y . Since ν {{M}}, we know that M s equation holds with probability 1 in ν, and so it must be the case that ν |= Pa M(Y ) Y . Note that, in general, if A B and A C, then B C. By assumption, Pa M(Y ) Pa G(Y ) {UY }, and thus ν |= Pa G(Y ) {UY } Y .

Thus ν satisfies all conditions (a-c) for a QIM-compatibility witness, and hence µ |= AG.

Proposition 4. µ |= A iff there exists a generalized randomized PSEM with structure A from which µ can arise.

Proof. ( = ). Suppose µ |= A, meaning there exists a witness ν(X, U) with property Definition 2(c), meaning that, for all a A, there is a functional dependence (Sa, Ua) Ta. Thus, there is some set of functions F with these types that holds with probability 1 according to ν. Meanwhile, by Definition 2(b), ν(U) are mutually independent, so defining Pa(Ua) := ν(Ua), we have ν(U) = Q

a A Pa(Ua). Together, the previous two conditions (non-deterministically) define a generalized randomized PSEM M of shape A for which ν {{M}}. Finally, by Definition 2(a), we know that µ can arise from M.

( = ). Conversely, suppose there is a generalized randomized SEM M of shape A from which µ(X) can arise. Thus, there is some ν {{M}} whose marginal on X is µ. We claim that this ν is also

a witness that µ |= A. The marginal constraint from Definition 2(a) is clearly satisfied. Condition (b) is immediate as well, because ν(U) = Q

a Pa(Ua). Finally, condition (c) is satisfied, because the equations of M hold with probability 1, ensuring the appropriate functional dependencies.

Proposition 5. If µ(X, UA) is a witness for QIM-compatibility with A and M is a PSEM with dependency structure A, then µ {{M}} if and only if M PSEMs A( µ).

Proof. (a) is straightforward. Suppose M PSEMs(ν). By construction, the equations of M reflect functional dependencies in ν, and hence hold with probability 1.6 Furthermore, the distribution P(U) in all M PSEMs(ν) is equal to ν(U). These two facts, demonstrate that ν satisfies the two constraints required for membership in {{M}}.

(b). We do the two directions separately. First, suppose M PSEMs(ν). We have already shown (in part (a)) that ν {{M}}. The construction of PSEMs(ν) depends on the hypergraph A (even if the dependence is not explicitly clear from our notation) in such a way that f X does not depend on any variables beyond Ua and Sa X. Thus, Pa M(X) Sa X {Ua X}.

Conversely, suppose M = (X, U, F) is a PSEM satisfying ν {{M}} and Pa M(X) Sa X {Ua X}. We would like to show that M PSEMs(ν). Because ν {{M}}, we know that the distribution P(U) over the exogenous variables in the PSEM M is equal to ν(U), matching the first part of our construction. What remains is to show that the equations F are consistent with our transformation. Choose any X X. Because A is subpartitional, there is a unique a X A such that X Ta X. Now choose any values s V(Sa X) and u V(Ua X). If ν(s, u) > 0, then we know there is a unique value of x V(X) such that ν(s, u, x) > 0. Since M s equation for X, f X, depends only on s and u, and holds with probability 1, we know that f X(s, u) = t, as required. On the other hand, if ν(s, u) = 0, then any choice of f X(s, u) is consistent with our procedure. Since this is true for all X, and all possible inputs to the equation f X, we conclude that the equations F can arise from the procedure described in the main text, and therefore M PSEMs(ν).

Theorem 6. Suppose that µ is a QIM witness for A, that M = (U, V, F, P) PSEMs A( µ) is a corresponding PSEM, and that the noise variables UA = {UX}X V are independent of the other exogenous variables U \ UA. For all X V and x V(X), if µ(do M(X=x)) > 0, then

(a) µ(V | do M(X=x)) can arise from MX x; (b) for all events φ V(V), Pr M [X x]φ µ φ do M(X=x) Pr M X x φ

and all three are equal when M |= U V (such as when M is acyclic).

Proof. (part a). Because we have assumed µ(do M(X=x)) > 0, the conditional distribution

µ | do M(X=x) = µ(U, X) Y

X X 1 s.f X(UX, s) = x[X] . µ(do M(X=x))

is defined. By assumption, M PSEMs( µ) and µ is a witness to the fact that µ |= A. Thus, by Proposition 5, we know that µ {{M}}. So in particular, all equations of M hold for all joint settings (u, v) V(U V) in the support of µ. But the support of the conditional distribution µ | do M(X=x) is a subset of the support of µ, so all equations of M also hold in the conditioned distribution. Furthermore, the event do M(X=x) is the event in which, for all X X, the variable UX takes on a value such that f X(. . . , UX, . . .) = x[X]. Thus the equations corresponding to X = x also hold with probability 1 in µ | do M(X=x).

This shows that all equations of MX x hold with probability 1 in µ | do M(X=x). However, the marginal distribution µ(U | do M(X=x)) over U is typically not equal to P(U) after all, we have collapsed distribution of the variables UX := {UX : X X}. Clearly, µ | do M(X=x) / {{MX x}}. However, as we now show, there exists a different distribution µ {{MX x}} such that µ (V) = ν(V | do M(X=x)).

6When the probability of some combination of source variables is zero, there is typically more than one choice of functions that holds with probability 1; the choice of functions is essentially the choice of M PSEMs(ν).

Let U0 := U \ UX. We can define µ according to

µ (V, UX, U0) := µ(V, U0 | do M(X=x))P(UX).

The distribution µ satisfies three critical properties:

1. By construction, the marginal of µ on V is µ (V) = µ(V | do M(X=x)).

2. At the same time, the marginal of µ on exogenous variables is

µ (U) = µ (UX, U0)

V(V) µ (v, UX, U0) dv

V(V) µ(v, U0 | do M(X=x))P(UX) dv [definition of µ ]

= P(UX) µ(U0 | do M(X=x)) = P(UX)P(U0 | do M(X=x)) [because µ(U) = P(U), since µ {{M}}]

= P(UX)P(U0)

" since do M(X=x) depends only on UX, while UX and UZ are independent in µ (by the witness condition).

= P(UX, U0) [ same reason as above ]

3. Finally, µ satisfies all equations of MX x. It satisfies the equations for the variables X because X = x holds with probability 1. At the same time, the equations of MX x corresponding to other variables, say f Z for Z V \ X, also hold with probability one. This is because the marginal µ (UZ, V) is shared with the distribution µ | do M(X=x), and that distribution satisfies these equations. (It suffices to show that they share this particular marginal because the equations for Z do not depend on UX.)

Together, properties 2 and 3 show that µ {{MX x}}, while property 1 shows that µ(V | do M(X=x)) can arise from MX x. This establishes part (a).

(part b). We will again make use of the distribution µ defined in part (a), and its three critical properties listed above. Given a setting u V(U) of the exogenous variables, let

FX x(u) := ω V(V) X X. ω[X] = x[X] Y V \ X. ω[Y ] = f X(ω[X \ Y ], u)

denote the set of joint settings of endogenous variables that are consistent with the equations of MX x.

If u V(U) is such that

(M, u) |= [X x]φ (MX x, u) |= φ ω FX x(u). ω φ FX x(u) φ,

then ϕ holds at all points that satisfy the equations of MX x. So, since µ is supported only on such points (property 3), it must be that µ (φ) = 1. By property 1, µ (φ) = µ(φ | do M(X=x)).

Furthermore, if µ (φ) > 0, then there must exist some ω FX x(u) satisfying φ, and thus (M, u) |= X x φ. Putting both of these observations together, and with a bit more care to the

symbolic manipulation, we find that:

Pr M([X x]φ) = P({u V(U) : (M, u) |= [X x]φ})

u V(U) P(u)1 FX x(u) φ

u V(U) P(u) µ (φ | u) = µ (φ) = µ(φ | do M(X=x))

u V(U) P(u)1 FX x(u) φ =

= P({u V(U) : (M, u) |= X x φ}) = Pr M( X x φ), as desired.

Finally, if µ |= U V, then FX x(u) is a singleton for all u, and hence φ holding for all ω FX x and for some ω FX x are equivalent. So, in this case,

(M, u) |= [X x]φ (M, u) |= X x φ,

and thus the probability of both formulas are the same and it must also equal µ(φ | do M(X=x)) which we have shown lies between them.

A.4 Information Theoretic Results of Section 4

To prove Theorem 7 and Theorem 9(a), we will need the following Lemma. Lemma 12. Consider a set of variables Y = {Y1, . . . , Yn}, and another (set of) variable(s) X. Every joint distribution µ(X, Y) over the values of X and Y satisfies

i=1 Iµ(X ; Yi) Iµ(X ; Y) +

i=1 Hµ(Yi) Hµ(Y).

Proof. Since there is only one joint distribution in scope, we omit the subscript µ, writing I( ) instead of Iµ( ) and H( ) instead of Hµ( ), in the body of this proof. The following fact will also be very useful:

I(A; B, C) = I(A; C) + I(A; B | C) (the chain rule for mutual information). (5)

We prove this by induction on n. In the base case (n = 1), we must show that I(X; Y ) I(X; Y ) + H(Y ) H(Y ), which is an obvious tautology. Now, suppose inductively that

i=1 I(X ; Yi) I(X ; Y1:k) +

i=1 H(Yi) H(Y1:k) (IHk)

for some k < n, where Y1:k = (Y1, . . . , Yk). We now prove that the analogue for k + 1 also holds. Some calculation reveals:

= I(X; Y1:k+1) I(X; Y1:k | Yk+1) by MI chain rule (5)

I(X; Y1:k+1) h since I(X; Y1:k | Yk+1) 0 i

= I(X; Yk+1 | Y1:k) + I(Y1:k; Yk+1) by MI chain rule (5)

= I(X; Y1:k+1) + H(Yk+1) H(Y1:k+1) I(X; Y1:k) + H(Y1:k)

h left: one more MI chain rule (5); right: defn of mutual information

Observe: adding this inequality to our inductive hypothesis (IHk) yields (IHk+1)! So, by induction, the lemma holds for all k.

Theorem 7. If µ |= A, then IDef A(µ) 0.

Proof. Suppose that µ |= A, meaning that there is a witness ν(X, U) that extends µ, and has properties (a-c) of Definition 2. For each hyperarc a, since ν |= (Sa, Ua) Ta, we have Hν(Ta | Sa, Ua) = 0, and so

Hµ(Ta | Sa) = Hν(Ta | Sa, Ua) + Iν(Ta; Ua | Sa) = Iν(Ta; Ua | Sa).

Thus, we compute X

a A Hµ(Ta | Sa) = X

a A Iν(Ua; Ta | Sa)

a A Iν(Ua; Ta, Sa) Iν(Ua; Sa) by MI chain rule (5)

a A Iν(Ua; Ta, Sa) since Iν(Ua ; Sa) 0

a A Iν(Ua; X) since X (Sa, Ta)

Iν(X; U) + X

a A Hν(Ua) Hν(U) by Lemma 12

= Iν(X; U) since U are independent (per condition (b) of Definition 2) Hν(X) = Hµ(X). (per condition (a) of Definition 2)

Thus, IDef A(µ) 0.

Proposition 8. QIMInc A(µ) 0, with equality iff µ |= A.

Proof. The first term in the definition of QIMInc be written as Hν(U) + X

a A Hν(Ua) = E ν

h log ν(U) Q

and is therefore the relative entropy between ν(U) and the independent product distribution Q

a A ν(Ua). Thus, it is non-negative. The remaining terms of QIMInc A(µ), are all conditional entropies, and hence non-negative as well. Thus QIMInc A(µ) 0.

Now, suppose µ is s2-comaptible with A, i.e., there exists some ν(U, X) such that (a) ν(X) = µ(X), (b) Hν(Ta|Sa, Ua) = 0, and (d) {Ua}a A are mutually independent. Then clearly ν satisfies the condition under the infemum, every Hν(Ta|Sa, Ua) is zero. It is also immediate that the final term is zero as well, because it equals ID(ν(U) Q a ν(Ua)), and ν(U) = Q a ν(Ua), per the definition of mutual independence. Thus, ν witnesses that QIMInc(A,λ) = 0.

Conversely, suppose QIMInc(A,λ) = 0. Because the feasible set is closed and bounded, as is the function, the infemum is achieved by some joint distribution ν(X, A) with marginal µ(X). In this distribution ν, we know that every Hν(Ta|Sa, Ua) = 0 and ID(ν(U) Q

a ν(Ua)) = 0 because if any of these terms were positive, then the result would be positive as well. So ν satisfies (a) and (b) by definition. And, because relative entropy is zero iff its arguments are identical we have ν(U) = Q

a ν(Ua), so the Ua s are mutually independent, and ν satisfies (d) as well.

(a) If (X, A) is a hypergraph, µ(X) is a distribution, and ν(X, U) is an extension of ν to additional variables U = {Ua}a A indexed by A, then:

IDef A(µ) QIMInc A(µ) IDef A (ν).

(b) For all µ and A, there is a choice of ν that achieves the upper bound. That is,

QIMInc A(µ) = min n IDef A (ν) : ν V(X, U) ν(X) = µ(X)

Proof. Part (a). The left hand side of the theorem (IDef A(ν) QIMInc A(µ)) is a strengthening of the argument used to prove Theorem 7. Specifically, let ν be a minimizer of the optimization problem defining QIMInc We calculate QIMInc A(µ) IDef A(µ)

a A Hν (Ta | Sa, Ua) Hν (U) + X

a A Hν (Ua)

a A Hµ(Ta | Sa) Hµ(X)

Hν (Ta | Sa, Ua) Hν (Ta | Sa) + Hµ(X) Hν (U) + X

a A Hν (Ua)

a A Iν (Ta; Ua | Sa) + Hµ(X) Hν (U) + X

a A Hν (Ua).

The argument given in the first five lines of the proof of Theorem 7, gives us a particularly convenient bound for the first group of terms on the left: X

a A Iν (Ua; Ta | Sa) Iν (X; U) + X

a A Hν (Ua) Hν (U).

Substituting this into our previous expression, we have: QIMInc A(µ) IDef A(µ)

Iν (X; U) + X

a A Hν (Ua) Hν (U) + Hµ(X) Hν (U) + X

a A Hν (Ua)

= Hµ(X) Iν (X; U) 0. The final inequality holds because of our assumption that the marginal ν (X) equals µ(X). Thus, QIMInc A(µ) IDef A(µ), as proimised.

We now turn to the right hand inequality, and part (b) of the theorem. Recall that ν is defined to be a minimizer of the optimization problem defining QIMInc. For the right inequality (QIMInc A(µ) IDef A (ν)) of part (a), observe that

IDef A (ν) = Hν(X, U) + X

a A Hν(Ua) + X

a A Hν(Ta|Sa, Ua) + Hν(X | U)

= Hν(U) + X

a A Hν(Ua) + X

a A Hν(Ta|Sa, Ua)

a A Hν (Ua) + X

a A Hν (Ta|Sa, Ua)

= QIMInc(µ). This proves the right hand side of the inequality of part (a). Moreover, because the one inequality holds with equality when ν = ν is a minimizer of this quantity (subject to having marginal µ(X)) we have shown part (b) as well.

B Monotonicity and Undirected Graphical Models

The fact that (quantitative) PDG inconsistency is monotonic is a powerful reasoning principle that can be used to prove many important inequalities [22]. In this section, we develop a related principle for QIM-compatibility. Here is a direct but not very useful analague: if A A and µ |= A , conclude µ |= A. After all, if µ is consistent with a set of independent causal mechanisms, then surely it is consistent with a causal picture in which only a subset of those mechanisms are present and independent. There is a sense in which BNs and MRFs are also monotonic, but in the opposite direction: adding edges to a graph results in a weaker independence statement. We will soon see why.

f Since we use directed hypergraphs, there is actually a finer notion of monotonicity at play. Inputs and ouputs play opposite roles, and they are naturally monotonic in opposite directions. If there is an obvious way to regard an element of B as an element of B (abbreviated B , B ), and A , A, then a function f : A B can be regarded as one of type A B . This is depicted to the right. The same principile applies in our setting. If X and Z are sets of variables and X Z, then V(Z), V(X), by restriction. It follows, for example, that any mechanism by which X determines (Y, Y ) can be viewed as a mechanism by which (X, X ) determines Y . The general phenomenon is captured by the following. Definition 6. If A = {S a T}a, A = {S a T }a , and there is an injective map ι : A A such that Ta Tι(a) and Sa Sι(a) for all a A , then A is a weakening of A (written A A ).

Proposition 13. If A A and µ |= A, then µ |= A .

Proposition 13 is strictly stronger than the simple monotonicity mentioned at the beginning of the section, because a hyperarc with no targets is vacuous, so removing all targets of a hyperarc is equivalent to deleting it. It also explains why BNs and MRFs are arguably anti-monotonic: adding X Y to a graph G means adding X to the sources the hyperarc whose target is Y , in AG.

As mentioned in the main body of the paper, the far more important consequence of this result is that it helps us begin to understand what QIM-compatibility means for cyclic hypergraphs. For the reader s convenience, we now restate the examples in the main text, which are really about monotonicity..

Example 3. Every µ(X, Y ) is compatible with X Y . This is because this cycle is weaker than a hypergraph that can already represent any distribution, i.e., X Y X Y . .

Example 4. What µ(X, Y, Z) are compatible with the 3-cycle shown, on the right? By monotonicity, among them must be all distributions consistent with a linear chain X Y Z. Thus, any distribution in which two variables are conditionally independent given the third is compatible with the 3-cycle. Are there any distributions that are not compatible with this hypergraph? It is not obvious. We return to this in Section 4.

Because QIM-compatibility applies to cyclic structures, one might wonder if it also captures the independencies of undirected models. Undirected edges A B are commonly identified with a (cylic) pair of directed edges {A B, B A}, as we have implicitly done in defining AG. In this way, undirected graphs, too, naturally correspond to directed hypergraphs. For example, G = A B C corresponds to the hypergraph AG shown on the left. Compatibility with AG, however, does not coincide with any of the standard Markov properties corresponding to G [13]. This may appear to be a flaw in Definition 2 (QIM-compatibility), but it is unavoidable. While both BNs and MRFs are monotonic, it is impossible to capture both classes with a monotonic definition. Theorem 14. It is possible to define a relation |= between distributions µ and directed hypergraphs A satisfying any two, but not all three, of the following.

(monotonicity) If µ |= A and A A , then µ |= A . (positive BN capture) If µ satisfies the independencies I(G) of a dag G, then µ |= AG. (negative MRF capture) If µ|= AG for an undirected directed graph G, then µ has one of the Markov properties with respect to G.

The proof is a direct and easy-to-visualize application of monotonicity (Proposition 13). Assume montonicity and positive BN capture. Let µxor(A, B, C) be the joint distribution in which A and C are independent fair coins, and B = A C is their parity. We then have:

= AA B C. But µxor |= A C | B.

We emphasize that Theorem 14 has implications for the qualitative semantics of any graphical model (even if one were to reject the definition QIM-compatibility). We now look into the implications for some lesser-known graphical models, which may appear not to comply with Theorem 14.

Dependecny Networks To readers familiar with dependency networks (DNs) [8], Theorem 14 may raise some conceptual issues. When G is an undirected graph, AG is the structure of a consistent DN.

The semantics of such a DN, which intuitively describe an independent mechanism on each hyperarc, coincide with the MRFs for G (at least for positive distributions). In more detail, DN semantics are given by the fixed point of a markov chain that repeatedly generates independent samples along the hyperarcs of AG for some (typically cyclic) directed graph G. The precise definition requires an order in which to do sampling. Although this choice doesn t matter for the consistent DNs that represent MRFs, it does in general. With a fixed sampling order, the DN is monotonic and captures MRFs, but can represent only BNs for which that order is a topological sort.

C Information Theory, PDGs, and QIM-Compatibility

C.1 More Detailed Primer on Information Theory

We now expand on the fundemental information quantities introduced at the beginning of Section 4. Let µ be a probability distribution, and be X, Y, Z be (sets of) discrete random variables. The entropy of X is the uncertainty in X, when it is distributed according to µ, as measured by the number of bits of information needed (in expectation) needed to determine it, if the distribution µ is known. It is given by

x V(X) µ(X=x) log 1 µ(X=x) = E µ[log µ(X)],

and a few very important properties; chief among them, Hµ(X) is non-negative, and equal to zero iff X is a constant according to µ. The joint entropy H(X, Y ) is just the entropy of the combined variable (X, Y ) whose values are pairs (x, y) for x V(X), y V(Y ); this is the same as the entropy of the variable X Y when X and Y are themselves sets of variables.

The conditional entropy of Y given X measures the uncertainty present in Y if one knows the value of X (think: the information in Y but not X), and is equivalently defined as any of the following three quantities:

Hµ(Y |X) := E µ[ log 1/µ(Y |X) ] = Hµ(X, Y ) Hµ(X) = E x µ(X)[ Hµ|X=x(Y ) ].

The mutual information I(X; Y ), and its conditional variant I(X; Y |Z), are given, respectively, by

Iµ(X; Y ) := E µ

h log µ(X, Y )

i , and I(X; Y |Z) := E µ

h log µ(X, Y, Z)µ(Z)

µ(X, Z)µ(Y, Z)

The former is non-negative and equal to zero iff µ |= X Y , and the latter is non-negative and equal to zero iff µ |= X Y | Z. All of these quantities are purely structural or qualitative in the sense that they are invariant to relabelings of values, and

Just as conditional entropy can be written as a linear combination of unconditional entropies, so too can conditional mutual information be written as a linear combination of unconditional mutual informations: I(X; Y |Z) = I(X; (Y, Z)) I(X; Z). Thus conditional quantities are easily derived from the unconditional ones. But at the same time, the unconditional versions are clearly special cases of the conditional ones; for example, Hµ(X) is clearly the special case of H(X|Z) when Z is a constant (e.g., Z = ). Furthermore, entropy and mutual information are also interdefinable and generated by linear combinations of one another. It is easy to verify that Iµ(X; Y ) = Hµ(X) + Hµ(Y ) H(X, Y ) and Iµ(X; Y |Z) = Hµ(X|Z) + Hµ(Y |Z) H(X, Y |Z), and thus mutual information is derived from entropy. Yet on the other hand, Iµ(Y ; Y ) = Hµ(Y ) and Iµ(Y ; Y |X) = Hµ(Y |X) thus entropy is a special case of mutual information.

C.2 Structural Deficiency: More Motivation, and Examples

To build intuition for IDef , which characterizes our bounds in Section 4, we now visualize the vector v A for various example hypergraphs.

Subfigures 2a, 2b, and 2c show how adding hyperarcs makes distriutions more deterministic. When A is the empty hypergraph, IDef reduces to negative entropy, and so prefers distributions that are maximally uncertain (e.g., Subfigures 2a and 2d). For this empty but all distributions µ have negative IDef A(µ) 0. In the definition of IDef , each hyperarc X Y is compiled to a cost H(Y |X) for uncertainty in Y given X. One can see this

Figure 2: Illustrations of the structural deficiency IDef A underneath drawn underneath their associated hypergraphs {Gi}. Each circle represents a variable; an area in the intersection of circles {Cj} but outside of circles {Dk} corresponds to information that is shared between all Cj s, but not in any Dk. Variation of a candidate distribution µ in a green area makes its qualitative fit better (according to IDef ), while variation in a red area makes its qualitative fit worse; grey is neutral. Only the boxed structures in blue, whose IDef can be seen as measuring distance to a particular set of (conditional) independencies, are expressible as BNs.

visually in Figure 2 as a red crescent that s added to the information profile as we move from 2d to 2e to 2f.

Some hypergraphs (see Figures 2b and 2h) are indiscriminate, in the sense that every distribution gets the same score (of zero, because a point mass δ always has SDef A(δ) = 0). Such a graph has a structure such that any distribution can be precisely encoded by the process in (b). As shown here and also in Richardson and Halpern [23], IDef can also indicate independencies and conditional independencies, illustrated respectively in Subfigures 2g and 2i.

For more complex structures, structural information deficiency IDef can represent more than independence and dependence. The cyclic structures in Examples 3 and 4, correspond to the structural deficiencies pictured in Subfigures 2f and 2j, respectively, which are functions that encourage shared information between the three variables.

C.3 Counter-Examples to the Converse of Theorem 7

In light of Example 6 and its connections to IDef through Theorem 7, one might hope this criterion is not just a bound, but a precise characterization of the distributions that are QIM-compatible with the 3-cycle. Unfortunately, it does not, and the converse of Theorem 7 is false.

Example 7. Suppose µ(X, Y, Z) = Unif(X, Z)δid(Y |X) and A = { X, Y }, where all variables are binary. Then IDef A(µ) = 0, but X and Y are not independent.

Here is another counter-example, of a very different kind.

Example 8. Suppose A, B, C are binary variables. It can be shown by enumeration (see appendix) that no distribution supported on seven of the eight possible joint settings of of V(A, B, C) can be QIM-compatible with the 3-cycle A3 . Yet it is easy to find examples of such distributions µ that have positive interaction information I(A; B; C), and thus IDef µ(A3 ) 0 for such distributions.

D QIM-Compatibility Constructions and Counterexamples

We now give a counterexample to a simpler previously conjectured strengthening of Theorem 2, in which part (a) is an if-and-only-if. This may be surprising. In the unconditional case, it is true that, two arcs { 1 X, 2 X} precisely encode that X is a constant, as illustrated by Example 2. The following, slightly more general result, is an immediate correlary of Theorem 2(c). Proposition 15. µ |= A { 1 X, 2 X} if and only if µ |= A and µ |= X.

One can be forgiven for imagining that the conditional case would be analogous that QIMcompatibility with a hypergraph that has two parallel arcs from X to Y would imply that Y is a function of X. But this is not the case. Furthermore, our counterexample also shows that neither of the two properties we consider in the main text (requiring that A is partitional, or that the QIMcompatibility with µ is even) are enough to ensure this. That is, there are partitional graphs A such that µ |=e A but µ |= A {X 1 Y, X 2 Y }. Example 9. We will construct a witness of SIM-compatibility for the hypergraph

in which Y is not a function of X, which for n = 3 will disprove the analogue of Theorem 2 for the partitional context A equal to the 2-cycle.

Let U = (U0, U1, . . . , Un) be a vector of n mutually independent random coins, and A is one more independent random coin. For notational convenience, define the random vector U := (U0, . . . , Un) consisting of all variables Ui except for U0. Then, define variables X and Y according to:

X := (A U1, . . . , A Un, U0 U1, U0 U2, . . . , U0 Un) = (A U, U0 U) Y := (A, U0 U) = (A, U0 U1, U0 U2, . . . , U0 Un),

where and the operation Z V is element-wise xor (or addition in Fn 2), after implicitly converting the scalar Z to a vector by taking n copies of it. Call the resulting distribution ν(X, Y, U).

It we now show that ν witnesses that its marginal on X, Y is QIM-compatible with A, which is straightforward.

(b) U are mutually independent by assumption;

(c.0) Y = (A, B) and U0 determine X according to:

g(A, B, U0) = (A U0 B, B) = (A U0 U0 U, U0 U) since B = U0 U = (A U, U0 U) = X

(c.1 n) for i {1, . . . , n}, Ui and X = (V, B) together determine Y according to

fi(V, B, Ui) := (Vi Ui, B) = (A Ui Ui, U0 U) = Y.

In addition, this distribution ν(U, X, Y ) satisfies condition

(d) ν(X, Y | U) = 1

21[g(Y, U0) = X] Qn i=1 1[fi(X, Ui) = Y ], since, for all joint settings of U, there are two possible values of (X, Y ), corresponding to the two values of A, and both happen with probability 1

Thus, we have constructed a distribution that witnessing the fact that µ(X, Y ) |=e A.

Yet, observe that X alone does not determine Y in this distribution, because X alone is not enough to determine A (without also knowing some Ui).

For those who are interested, observe that the bound of Theorem 7 tells that we must satisfy

0 IDef A(µ) = Hµ(X, Y ) + n Hµ(Y | X) + Hµ(X | Y ) = Iµ(X; Y ) + (n 1) Hµ(Y | X)

Indeed, this distribution has information profile

H(X | Y ) = 1 bit, I(X; Y ) = n bits, H(Y | X) = 1 bit,

and so IDef A(µ) = 1 bit. Intuitively, this one missing bit corresponds to the value of A that is not determined by the structure of A.

E From Causal Models to Witnesses

We now return to the easy direction of the correspondence between QIM-compatibility witnesses and causal models, mentioned at the beginning of Section 3.2. Given a (generalized) randomized PSEM M, we now show that distributions ν {{M}}, are QIM-compatibility witness showing that the marginals of ν are QIM-compatible with the hypergraph AM. More formally:

Proposition 16. If M = (M=(U, V, F), P) is a randomized PSEM, then every ν {{M}} witnesses the QIM-compatibility of its marginal on its exogenous variables, with the dependency structure of M. That is, for all ν {{M}} and Y U V, ν(Y) |= AM.

The proof is straightforward: by definition, if ν {{M}}, then it must satisfy the equations, and so automatically fulfills condition (c). Condition (a) is also satisfied trivially, by assumption: the distribution we re considering is defined to be a marginal of ν. Finally, (b) is also satisfied by construction: we assumed that UA = {Ua}a A are independent.