# entropic_causal_inference__cc2c39ca.pdf

Entropic Causal Inference

Murat Kocaoglu, Alexandros G. Dimakis, Sriram Vishwanath Department of Electrical and Computer Engineering, The University of Texas at Austin, USA

Babak Hassibi Department of Electrical Engineering California Institute of Technology, USA

We consider the problem of identifying the causal direction between two discrete random variables using observational data. Unlike previous work, we keep the most general functional model but make an assumption on the unobserved exogenous variable: Inspired by Occam s razor, we assume that the exogenous variable is simple in the true causal direction. We quantify simplicity using Rényi entropy. Our main result is that, under natural assumptions, if the exogenous variable has low H0 entropy (cardinality) in the true direction, it must have high H0 entropy in the wrong direction. We establish several algorithmic hardness results about estimating the minimum entropy exogenous variable. We show that the problem of ﬁnding the exogenous variable with minimum H1 entropy (Shannon Entropy) is equivalent to the problem of ﬁnding minimum joint entropy given n marginal distributions, also known as minimum entropy coupling problem. We propose an efﬁcient greedy algorithm for the minimum entropy coupling problem, that for n = 2 provably ﬁnds a local optimum. This gives a greedy algorithm for ﬁnding the exogenous variable with minimum Shannon entropy. Our greedy entropy-based causal inference algorithm has similar performance to the state of the art additive noise models in real datasets. One advantage of our approach is that we make no use of the values of random variables but only their distributions. Our method can therefore be used for causal inference for both ordinal and also categorical data, unlike additive noise models.

1 Introduction Causality has been studied under several frameworks including potential outcomes (Rubin 1974) and structural equation modeling (Pearl 2009). Under the Pearlian framework (Pearl 2009) it is possible to discover some causal directions between variables using only observational data with conditional independence tests. The PC algorithm (Spirtes, Glymour, and Scheines 2001) and its variants fully characterize which causal directions can be learned in the general case. For large graphs, GES algorithm (Chickering 2002) provides a score-based test to greedily identify the highest scoring causal graph given the data. Unfortunately, these approaches do not guarantee the recovery of true causal direction between every pair of variables, since typically data could be generated by several statistically equivalent causal graphs.

Copyright c 2017, Association for the Advancement of Artiﬁcial Intelligence (www.aaai.org). All rights reserved.

A general solution to the causal inference problem is to conduct experiments, also called interventions. An intervention forces the value of a variable without affecting the other system variables. This removes the effect of its causes, effectively creating a new causal graph. These changes in the causal graph create a post-interventional distribution among variables, which can be used to identify some additional causal relations in the original graph. The procedure can be applied repeatedly to fully identify any causal graph (Hauser and Bühlmann 2012a), (Hauser and Bühlmann 2012b), (Hyttinen, Eberhardt, and Hoyer 2013), (Shanmugam et al. 2015). For many problems, it can be very difﬁcult to create interventions since they require additional experiments after the original data collection. Researchers would still like to discover causal relations between variables using only observational data, using so-called data-driven causality. Several recent works (Chen et al. 2014; Shajarisales et al. 2015) have developed such methods. To be able to make any conclusions on causal directions in this case, additional assumptions must be made about the mechanisms that generate the data. In this paper we focus on the simplest causal discovery problem that involves only two variables. The two causal graphs X Y and X Y are statistically indistinguishable so conditional independence tests cannot make any causal inference from observational data without interventions. Statistical indistinguishability easily follows from the fact that any joint distribution on two variables p(x, y) can be factorized both as p(x)p(y/x) and p(y)p(x/y). The most popular assumption for two-variable data-driven causality is the additive noise model (ANM) (Shimizu et al. 2006). In ANM, any outside factor is assumed to affect the effect variable additively, which leads to the equation Y = f(X) + E, E X. Although restrictive, this assumption leads to strong theoretical guarantees in terms of identiﬁability, and provides the state of the art accuracy in real datasets. (Shimizu et al. 2006) showed that if f is linear and the noise is non-Gaussian the causal direction is identiﬁable. (Hoyer et al. 2008) showed that when f is non-linear, irrespective of the noise, identiﬁability holds in a non-adverserial setting of system parameters. (Peters, Janzing, and Schölkopf 2011) extended ANM to discrete variables. Another approach is to exploit the postulate that the cause and mechanism are in general independently assigned by nature. The notion of independence here is vague and one needs

Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence (AAAI-17)

to assign maps, or conditional distributions to random variables to argue about independence of cause and mechanism. In this direction an information-geometry based approach is suggested (Janzing et al. 2012). Independence of cause and mechanism is captured by treating the log-slope of the function as a random variable, and assuming that it is independent from the cause. In the case of a deterministic relation Y = f(X), there are theoretical guarantees on identiﬁability. However, this assumption is restrictive for real data. Previous work exploited these two ideas, additive noise, and independence of cause and mechanism, to draw datadriven causal conclusions about problems in a diverse range of areas from astronomy to neuroscience (Shajarisales et al. 2015), (Schölkopf et al. 2015). (Shajarisales et al. 2015) uses the same idea that the cause and effect are independent in the time series of a linear ﬁlter. They suggest the spectral independence criterion, which is robust to time shifts. (Chen et al. 2014) uses kernel space embeddings with the assumption that the cause distribution p(x) and mechanism p(y|x) are selected independently to distinguish cause from effect. An exception to these frameworks is to use a binary classiﬁer on the joint distribution on (X, Y ) to classify distributions into those that come from the causal model X Y and Y X (Lopez-Paz et al. 2015; Lopez-Paz, Muandet, and Recht Dec 2015; Lopez-Paz and Oquab 2016). However, it is not clear what are the correct set of assumptions to show an identiﬁability result with this approach. As noted by (Chen et al. 2014), although conceptually proposed before, using Kolmogorov complexity of the factorization of the joint distribution p(y|x)p(x) and p(x|y)p(y) as a criterion for deciding causal direction has not been used successfully until now. The use of information theory as a tool for causal discovery is currently gaining increasing attention. This is through different appoaches, e.g., for time-series data, Granger causality and Directed Information can be used (Granger 1969; Etesami and Kiyavash 2016; Quinn, Kiyavash, and Coleman Dec 2015; Kontoyiannis and Skoularidou Aug 2016), see also (MCI 2016). However, researchers have not used entropy as a measure of simplicity in the causal discovery literature, probably because the entropies H(Y |X) and H(X|Y ) do not give us any more information than H(X) and H(Y ), due to the symmetry H(Y ) + H(X|Y ) = H(X) + H(Y |X). In our work, as we will explain, we minimize H(E) which initially sounds similar, but is fundamentally different from H(Y |X). Entropy has found some additional uses in the causality literature recently: In (Gao et al. 2016), authors use maximum mutual information between X, Y in order to quantify the causal strength of a known causal graph. The work that is most similar to ours in spirit is (Mooij et al. 2010), which also drops the additive noise assumption. Their approach and setup are different in many ways: Authors work with continuous data. To be able to handle this generic form, they have to make strong assumptions on the exogenous variable, function, and distribution of the cause: Mooij et al. assume that the exogenous variable is a standard Gaussian, a Gaussian mixture prior for the cause, and a Gaussian process as the prior of the function.

1.1 Our Contributions

In this paper, we propose a novel approach to the causal identiﬁability problem for discrete variables. Similar to (Mooij et al. 2010), we keep the most general functional model, but only put an assumption on the exogenous (background) variable. Based on Occam s razor, we employ a simplicity assumption on the unobserved exogenous variable. We use Rényi entropy, which is deﬁned as Ha(X) = 1 1 a log (

i pa i ), for a random variable X with state probabilities pi. We focus on two special cases of Rényi entropy: H0, which corresponds to the logarithm of the number of states, and H1 which corresponds to Shannon entropy, but our framework can be extended. Speciﬁcally, if the true causal direction is X Y , then the random variable Y is an arbitrary function of X and an exogenous variable E: Y = f(X, E) where E is independent from the cause X. Our key assumption is that the exogenous variable E is simple, i.e., has low Rényi entropy. The postulate is that for any model in the wrong direction X = f (Y, E), the exogenous variable E has high Rényi entropy. We are able to prove this result for the H0 special case of Rényi entropy, assuming generic distributions for X, Y . Furthermore, we empirically show that using H1 Shannon entropy we obtain practical causality tests that work with high probability in synthetic datasets and that slightly outperforms the previous state of the art in real datasets. Our assumption is an entropic interpretation of Occam s razor, motivated by what E represents in the causal model. The exogenous variable captures the combined effect of all the variables not included in the system model, which affect the distribution of Y . Our causal assumption can be stated as there should not be too much complexity not included in the causal model". For a 1, i.e., Shannon entropy, H(X) + H(E), H(Y )+H( E) are the number of random bits required to generate an input for the causal system X Y and X Y , respectively. The simplest explanation of an observed joint distribution, i.e., the direction which requires nature to generate smaller number of random bits is selected as the true causal model. More precisely we have the following:

Assumption 1. Entropy of the exogenous variable E is small in the true causal direction.

The notions of simplicity that we consider are H0, which is log-cardinality, and H1, which is Shannon entropy. One signiﬁcant advantage of using Shannon entropy as a simplicity metric is that it can be estimated more robustly in the presence of measurement errors, unlike cardinality H0. We prove an identiﬁability result for H0 entropy, i.e., cardinality of E: If the probability values are not adversarially chosen, for most functions, the true causal direction is identiﬁable under Assumption 1. Based on experimental evidence, we conjecture that a similar identiﬁability result must hold for Shannon entropy H1. To use our framework we need algorithms that explain a dataset by ﬁnding an exogenous variable E with minimum cardinality H0 and minimum Shannon entropy H1. Since the entropies of X and Y can be very different, any metric to determine the true causal direction cannot only consider the entropy of the exogenous variable without incorporating

the entropy of the cause. We explain the exogenous variable in both directions and declare the causal direction to be the one with the smallest joint entropy Ha(X) + Ha(E) versus Ha(Y ) + Ha( E). Our method can be applied for any Rényi entropy Ha but in this paper we only use a = 0 and a = 1. Unfortunately, minimizing H0(E) seems very hard for real datasets since it offers no noise robustness. For Shannon entropy we can do much better for real data. The ﬁrst step in obtaining a practical algorithm is showing that the minimum H1 explanation is equivalent to the following problem: For n random variables with given marginal distributions, ﬁnd a joint distribution with the minimum Shannon entropy that is consistent with the given marginals. This problem is called the minimum Shannon entropy coupling and is known to be NP hard (Kovacevic, Stanojevic, and Senk 2012). We propose a greedy approximation algorithm for this problem that empirically performs very well. We also prove that, for n = 2, our algorithm always produces a local minimum. In summary our contributions in this paper include1: We show identiﬁability for generic low-entropy causal models under Assumption 1 with H0. We show that the problems of identifying the minimum cardinality (H0) exogenous variable, and identifying the minimum Shannon entropy (H1) exogenous variable given a joint distribution are both NP hard. We design a novel greedy algorithm for the minimum entropy coupling problem, which turns out to be equivalent to the problem of ﬁnding exogenous variable with minimum H1 entropy. We empirically validate the conjecture that the causal direction is identiﬁable under Assumption 1 with H1, using experiments on synthetic datasets. We empirically show that our causal inference algorithm based on Shannon entropy minimization has slightly better performance than the existing best algorithms on a real causal dataset. Interestingly, our algorithm uses only the probability distributions rather than the actual values of the random variables, and hence is applicable to categorical variables.

1.2 Background and Notation A tuple M = (X, U, F, D, p) is a causal model when, 1) F = {fi} are deterministic functions, 2) X = {Xi} are a set of endogenous (observed) variables U = {Ui} are a set of exogenous (latent) variables with Xi = fi(Pai, Ui), i where Pai are the endogenous parents and Ui is the exogenous parent of Xi in directed acyclic graph D, 3) U are mutually independent with respect to p. The observable variable set X has a joint distribution implied by the distributions of U, and the functional relations fi. D is then a Bayesian network for the induced joint distribution of endogenous variables. A standard assumption employed in Pearl s model causal sufﬁciency is also used here: Every exogenous variable is a direct parent of at most one endogenous variable. In this paper, we consider a simple two variable causal system which contains only two endogenous variables X, Y .

1For a version with proofs, see https://arxiv.org/abs/1611.04035

Assume X causes Y , which is represented as X Y . The model is determined only by one exogenous variable E, and a function f, where Y = f(X, E). The probability distribution of X and E, and f determines the distribution of Y . This model is shown by the tuple M = ({X, Y }, E, f, X Y, p X,E). Notice that we do not assign an exogenous variable to X, since it is the source node in the graph. We denote the set {1, 2, , n} by [n].

i xi is meant to run through every possible index. log refers to the logarithm base 2. For two variables X, Y , Y|X and X|Y denote the conditional probability distribution matrices, i.e., Y|X(i, j) = p(y = i|x = j) and X|Y(i, j) = p(x = i|y = j). The statistical independence of two random variables X and E are shown by X E. For notational convenience, probability distribution of random variable X is shown by p(x) as well as p X(x). x shows the distribution of X in vector form , i.e., xi = x(i) = P(X = i). n 1 simplex is the set of points x in n dimensional Euclidean space that satisfy

i x(i) = 1. card is the cardinality of a set.

2 Causal Model with Minimum Cardinality Exogenous Variable Consider the causal model M = ({X, Y }, E0, f0, X Y, p X,E). The task is to identify the underlying causal graph X Y using independent identically distributed samples {(xi, yi)}i. Assuming causal sufﬁciency, this task reduces to deciding whether X causes Y or Y causes X. To isolate the identiﬁability problem from estimation errors due to ﬁnite samples, we assume that the joint distribution of (X, Y ) is available. Most proofs are deferred to the Appendix. One way to identify that X causes Y is by showing that although there exists a function f and random variable E with Y = f(X, E), X E, there is no function, random variable pair (g, E) such that X = g(Y, E), Y E. However, without more assumptions, this is not possible: For any joint distribution one can ﬁnd valid causal models for both X Y, X Y . This is widely known, although for completeness, we provide a proof (Lemma 4 in the Appendix). Even when the true causal graph is known, one can create different constructions of f, E with Y = f(X, E), X E. There is no way to distinguish the true causal model. However, even though we cannot recover the actual function and the exogenous variable, we can still show identiﬁability. First, we give an equivalent characterization of a causal model on two variables. Deﬁnition 1 (Block Partition Matrices). Consider a matrix M {0, 1}n2 m. Let mi,j represent the i+(j 1)n th row of M. Let Si,j = {k [m] : mi,j(k) = 0}. M is called a block partition matrix if it belongs to C := {M : M {0, 1}n2 m, i [n] Si,j = [m], Si,j Sl,j = , i = l}.

C thus stands for 0, 1 matrices with n2 rows and m columns where each block of n rows correspond to a partitioning of the set [m]. We make the following key observation: Lemma 1. Given discrete random variables X, Y with distribution p(x, y), a causal model M = ({X, Y }, E, f, X Y, p X,E), E E with card(E) = m if and only if M C, e Rm + with

i e(i) = 1 that satisfy vec(Y|X) = Me.

In other words, the existence of a causal pair X Y is equivalent to the existence of a block partition matrix M and a vector e of proper dimensions with vec(Y|X) = Me. For simplicity, assume |X| = |Y| = n. We later remove this constraint. We ﬁrst show that any joint distribution can be explained using a variable E with n(n 1) + 1 states.

Lemma 2 (Upper Bound on Minimum Cardinality of E). Let X X, Y Y be two random variables with joint probability distribution p X,Y (x, y), where |X| = |Y| = n. Then a causal model Y = f(X, E), X E that induces p X,Y , where E has support size n(n 1) + 1.

We can show that, if the columns of Y|X are uniformly sampled points in the n 1 dimensional simplex, then n(n 1) states are also necessary for E (see Proposition 3in the Appendix). This shows, unless designed by nature through the causal mechanism, exogenous variable cannot have small cardinality. Based on this observation, the hope is to prove that in the wrong causal direction, say X Y and we ﬁnd an E Y such that X = g(Y, E) for some g, the exogenous variable E has to have large cardinality. In the next section, we show this is actually through, under mild conditions on f.

2.1 Identiﬁability for H0 Entropy In a causal system Y = f(X, E), nature chooses the random variables X, E, and function f, and the conditional probability distributions are then determined by these. We are interested in the cardinality of variables E X in the wrong causal direction X = g(Y, E). Considering X|Y, we can show that the same lower bound of n(n 1) still holds despite nature now chooses E and X randomly, rather than choosing the columns of X|Y directly. A mild assumption on f is needed to avoid degenerate cases (For counterexamples see the appendix).

Deﬁnition 2 (Generic Function). Let Y = f(X, E) where variables X, Y, E have supports X, Y, E, respectively. Let Sy,x = f 1 x (y) E be the inverse map for x, e, i.e., Sy,x = {e E : y = f(x, e)}. A function f is called generic", if for each (x1, x2, y) triple f 1 x1 (y) = f 1 x2 (y) and for every (x, y) pair f 1 x (y) = .

In other words f is called generic if yth row in the xth 1 block of matrix M in the decomposition vec(Y|X) = Me is different from yth row in the xth 2 block, and both are nonzero. This is not a restrictive condition, for example if p(y|x) are all different, no two rows of M can be the same. For any given conditional distribution, if the probabilities are perturbed by arbitrarily small continuous noise, the corresponding f will be generic almost surely. We have the following main identiﬁability result:

Theorem 1 (Identiﬁability). Consider the causal model M = ({X, Y }, E0, f0, X Y, p X,E0) where the random variables X, Y have n states, E0 X has θ states and f is a generic function . If the distributions of X and E are uniformly randomly selected from the n 1 and θ 1 simplices, then with probability 1, any E Y that satisﬁes X = g(Y, E) for some deterministic function g has cardinality at least n(n 1).

Theorem 1 implies that the causal direction is identiﬁable, when the exogenous variable has cardinality < n(n 1):

Corollary 1. Assume that there exists an algorithm A that given n random variables {Zi}, i [n] with distributions {pi}, i [n] each with n states, outputs the distribution of the random variable E with minimum cardinality and functions {fi, i [n]} where Zi = fi(E). Consider the causal pair X Y where Y = f(X, E0). Assume that the cardinality of E0 is less than n(n 1), and f is generic. Then, A can be used to identify the true causal direction with probability 1, if X, E0 are selected uniformly randomly from the proper dimensional simplices.

Proof. Feed the set of conditional distributions {P(Y |X = i) : i [n]} and {P(X|Y = i) : i [n]} to A to obtain E, E. From Theorem 1, with probability 1, A identiﬁes E with card( E) n(n 1). Then since card(E) card(E0) < card( E), comparing cardinalities give the true direction.

Corollary 1 gives an algorithm for ﬁnding the true causal direction: Estimate E, E with minimum H0 entropy and declare X Y if | E| > |E| and declare X Y if | E| < |E|. The result easily extends to the case where X and Y are allowed to have different number of states:

Proposition 1 (Inference algorithm). Suppose X Y . Let X X, Y Y, |X| = n, |Y| = m. Assume that A is the algorithm that ﬁnds the exogenous variables E, E with minimum cardinality. Then, if the underlying exogenous variable E0 satisﬁes |E0| < n(m 1), with probability 1, we have |X| + |E| < |Y | + | E|.

Proof follows from Corollary 1, and by extending the proof of Theorem 1 to different cardinalities for X, Y . Unfortunately, it turns out there does not exist an efﬁcient algorithm A, unless P=NP:

Theorem 2. Given a conditional distribution matrix Y|X, identifying E X with minimum support size such that there exist a function f with Y = f(X, E) is NP hard.

The hardness of this problem sets us to search for alternative approaches.

3 Causal Model with Minimum H1 Entropy In this section, we propose a way to identify the causal model that explains the observational data with minimum Shannon entropy ( entropy in short ). Entropy of a causal model is measured by the number of random bits required to generate its input. In the causal graph X Y , where Y = f(X, E), we identify the exogenous variable E X with minimum entropy. We show that this corresponds to a known problem which has been shown to be NP hard. Later we propose a greedy algorithm. Notice that H(E) is different from the conditional entropy H(Y |X). Certainly, since Y = f(X, E), H(Y |X) H(E). The key is that since E is forced to be independent from X, H(E) cannot be lowered to H(Y |X). To see this, we can write H(Y |X) =

i p X(i)H(Y |X = i), whereas since conditional probability distribution of Y |X = i is the

same as the distribution of fi(E) for some function fi, we have H(E) maxi H(Y |X = i).

3.1 Finding E with Minimum Entropy Consider the equation Y = f(X, E), X E. Let fx : E Y be the function mapping E to Y when X = x, i.e., fx(E) := f(x, E). Then P(Y = y|X = x) = P(fx(E) = y|X = x) = P(fx(E) = y). The last equality follows from the fact that X E. Thus, we can treat the conditional distributions P(Y |X = x) as distributions that emerge by applying some function fx to some unobserved variable E. Then the problem of identifying E with minimum entropy given the joint distribution p(x, y) becomes equivalent to, given distributions of the variables fi(E), ﬁnding the distribution with minimum entropy (distribution of E), such that there exists functions fi which map this distribution to the observed distributions of Y |X = i. It can be shown that H(E) H(f1(E), f2(E), . . . , fn(E)). Regarding fi(E) as a random variable Ui, the best lower bound on H(E) can be obtained by minimizing H(U1, U2, . . . , Un). We can show that we can always construct an E that acheives this minimum. Thus the problem of ﬁnding the exogenous variable E with minimum entropy given the joint distribution p(x, y) is equivalent to the problem of ﬁnding the minimum entropy joint distribution of the random variables Ui = (Y |X = i), given the marginal distributions p(Y |X = i): Theorem 3 (Minimum Entropy Causal Model). Assume that there exists an algorithm A that given n random variables {Zi}, i [n] with distributions {pi}, i [n] each with n states, outputs the joint distribution over Zi consistent with the given marginals, with minimum entropy. Then, A can be used to ﬁnd the causal model M = ({X, Y }, E, X Y, p X,E) with minimum input entropy, given any joint distribution p X,Y . The problem of minimizing entropy subject to marginal constraints is non-convex. In fact, it is shown in (Kovacevic, Stanojevic, and Senk 2012) that minimizing the joint entropy of a set of variables given their marginals is NP hard. Thus we have the following corollary: Corollary 2. Finding the causal model M = ({X, Y }, E, f, X Y, p E,X) with minimum H(E) that induce a given distribution p(x, y) is NP hard.

For this, we propose a greedy algorithm. Using entropy to identify E instead of cardinality, despite both turning out to be NP hard, is useful since entropy is more robust to noise in data. In real data, we estimate the probability values from samples, and noise is unavoidable.

3.2 A Conjecture on Identiﬁability with H1 Entropy We have the following conjecture, supported by artiﬁcial and real data experiments in Section 4. Conjecture 1. Consider the causal model M = ({X, Y }, E, f, X Y, p X,E) where discrete random variables X, Y have n states, E X has θ states. . If the distribution of X is uniformly randomly selected from the n 1 dimensional simplex and distribution of E

is uniformly selected from the probability distributions that satisfy H1(E) log n + O(1) and f is randomly selected from all functions f : [n] [θ] [n], then with high probability, any E Y that satisﬁes X = g(Y, E) for some deterministic g entails H(X) + H(E) < H(Y ) + H( E).

Proposition 2 (Assuming Conjecture 1). Assume there exists an algorithm A that given n random variables {Zi}, i [n] with distributions {pi}, i [n] each with n states, outputs the distribution of the random variable E with minimum entropy and functions {fi}, i [n] where Zi = fi(E). Consider the causal pair X Y where Y = f(X, E0), and cardinality of E0 is cn for some constant c, and f is selected randomly. Then, A can be used to identify the true causal direction with high probability, if X, E0 are uniformly random samples from the proper dimensional simplices.

3.3 Greedy Entropy Minimization Algorithm

Given m discrete random variables with n states, we provide a heuristic algorithm to minimize their joint entropy given their marginal distributions. The main idea is the following: Each marginal probability constraint must be satisﬁed. For example, for the case of two variables with distributions p1, p2, ith row of joint distribution matrix should sum to p1(i). The contribution of a probability mass to the joint entropy only increases when probability mass is divided into smaller chunks: p1(i) log p1(i) a log a b log b, when p1(i) = a + b, for a, b 0. Thus, we try to keep large probability masses intact to assure that their contribution to the joint distribution is minimized. We propose Algorithm 1. The sorting step is only to simplify the presentation. Hence, although the given algorithm runs in time O(m2n2 log n), it can easily be reduced to O(max(mn log n, m2n)) by dropping the sorting step. The algorithm simply proceeds by removing the most probability mass it can at each round. This makes sure the large probability masses remain intact.

Algorithm 1 Joint Entropy Minimization Algorithm

1: Input: Marginal distributions of m variables each with n states, in matrix form M = [p T 1 ; p T 2 ; ..., p T m]. 2: e = [ ] 3: Sort each row of M in decreasing order. 4: Find minimum of maximum of each row: r mini(pi(1)) 5: while r > 0 do 6: e [e, r] 7: Update maximum of each row: pi(1) pi(1) r, i 8: Sort each row of M in decreasing order. 9: r mini(pi(1)) 10: end while 11: return e.

One can easily construct the joint distribution using a variant: Instead of sorting, at each step, ﬁnd r = mini{maxj{pi(j)}} and assign r to the element with coordinates (ai), where ai = arg maxj pi(j).

Lemma 3. Greedy entropy minimization outputs a point with entropy at most log m + log n.

2 2.5 3 3.5 4 4.5 5 5.5 6 6.5 7 Log of number of states, log2(n)

H*(X1,X2,...,Xn) - maxi(H(Xi))

Minimum Joint Entropy by Greedy Algorithm

Average Gap H*(X1,X2,...Xn)-maxi H(Xi) Minimum Gap Maximum Gap

(a) Greedy Joint Entropy Min.

H(E)/log(n) 0 0.2 0.4 0.6 0.8 1

Probability of Success

Accuracy for Low Entropy E

n = 4 n = 8 n = 16 n = 32 n = 64 n = 128

0 0.5 1 0.997

(b) Identiﬁability, Artiﬁcial Data

20 30 40 50 60 70 80 90 100 Decision Rate, %

Accuracy, %

Accuracy vs. Decision Percentage

Entropy-based Causal Inference 68% Confidence Interval 95% Condifence Interval

(c) Performance on Real Data

Figure 1: (a) Performance of greedy joint entropy minimization algorithm: n distributions each with n states are randomly generated for each value of n. As can be seen, the minimum joint entropy obtained by the greedy algorithm is at most 1 bit away from the largest marginal maxi H(Xi). (b) Identiﬁability with Entropy: We generate distributions of X, Y by randomly selecting f, X, E. Probability of success is the fraction of points where H(X, E) < H(Y, E). As observed, larger n drives probability of success to 1 when H(E) log n, supporting Conjecture 1. (c) Real Data Performance: Decision rate is the fraction of samples for which algorithm makes a decision for a causal direction. A decision is made when |H(X, E) H(Y, E)| > t log2 n, where t determines the decision rate. Conﬁdence intervals are also provided.

Lemma 3 follows from the fact that the algorithm returns a support of size at most m(n 1) + 1. We also prove that, when there are two variables with n dimensions, the algorithm returns a point that satisﬁes the KKT conditions of the optimization problem, which implies that it is a local optimum (see Proposition 4 in the Appendix).

4 Experiments

In this section, we test the performance of our algorithms on real and artiﬁcial data. First, we test the greedy entropy minimization algorithm and show that it performs close to the trivial lower bound. Then, we test our conjecture of identiﬁability using entropy. Lastly, we test our entropy-minimization based causal identiﬁcation technique on real data. In order to test our algorithms, we sample points in proper dimensional simplices, which correspond to distributions for X and E. Distribution of points are uniform for selecting the distribution of X. It is well-known that a vector [xi/Z]i is uniformly randomly distributed over the simplex, if xi are i.i.d. exponential random variables with parameter 1, and Z =

i xi (Onn and Weissman 2011). To sample lowentropy distributions for E, instead of exponential, we use a heavy tailed distribution for sampling each coordinate. Specifically, we use [ei/Z]i, where ei are i.i.d. log-normal random variables with parameter σ. We observe that this allows us to sample a variety of distributions with small entropy. Performance of Greedy Entropy Minimization: We sample distributions for n random variables {Xi}, i [n] each with n states and apply Algorithm 1 to minimize their joint entropy. We compare our greedy joint entropy minimization algorithm with the simple lower bound of maxi H(Xi). Figure 1a shows average, maximum and minimum excess bits relative to this lower bound. Contrary to the pessimistic bound of log n bits, joint entropy is at most 1 bit away from maxi H(Xi) for the given range of n. Verifying Entropy-Based Identiﬁability Conjecture: In

this section, we empirically verify Conjecture 1. The distributions for X are uniformly randomly sampled from the simplex in n dimensions. We also select f randomly (see implementation details). For the log-normal parameter σ used for sampling the distribution of E from the n(n 1) dimensional simplex, we sweep the integer values from 2 to 8. This allows us to get distribution samples from different regimes. We only consider the samples which satisfy H(E) log n. After sampling E, X, f, we identify the corresponding Y|X and X|Y for Y = f(X, E). We apply greedy entropy minimization on the columns of the induced distributions Y|X, X|Y to get the estimates E, E for both causal models Y = f(X, E) and X = g(Y, E), respectively. Figure 1b shows the variation of success probability, i.e., the fraction of samples which satisfy H(X) + H(E) < H(Y ) + H( E). As observed, as n is increased, probability of success converges to 1, when H(E) log n, which supports the conjecture. Experiments on Real Cause Effect Pairs: We test our entropy-based causal inference algorithm on the Cause Effect Pairs repository (Mooij et al. 2016a). ANM have been reported to achieve an accuracy of 63% with a conﬁdence interval of 10% (Mooij et al. 2016b). We also use the binomial conﬁdence intervals as in (Clopper and Pearson 1934). The cause effect pairs show very different characteristics. From the scatter plots, one can observe that they can be a mix of continuous and discrete variables. The challenge in applying our framework on this dataset is choosing the correct quantization. Small number of quantization levels may result in loss of information regarding the joint distribution, and a very large number of states might be computationally hard to work with. We pick the same number of states for both X and Y , and use a uniform quantization that assures each state of the variables has 10 samples on average. From the samples, we estimate the conditonal transition matrices Y|X and X|Y and feed the columns to the greedy entropy minimization algorithm (Algorithm 1), which outputs an approximate of

the smallest entropy exogenous variable. Later we compare H(X, E) and H(Y, E) and declare the model with smallest input entropy to be the true model, based on Conjecture 1. For a causal pair, we invoke the algorithm if |H(X, E) H(Y, E)| t log(n) for threshold parameter t, which determines the decision rate. Accuracy becomes unstable for very small decision rates, since the number of evaluated pairs becomes too small. At 100% decision rate, algorithm achieves 64.21% which is slightly better than the 63% performance of ANM as reported in (Mooij et al. 2016b). In addition, our algorithm only uses probability values, and is applicable to categorical as well as ordinal variables.

Acknowledgements This work has been supported by NSF Grants CCF 1344179, 1344364, 1407278, 1422549, ARO YIP W911NF-14-1-0258, NSF-1564167 and NSF-1559997. The work of Babak Hassibi has been supported in part by the National Science Foundation under grants CNS-0932428, CCF-1018927, CCF1423663 and CCF-1409204, by a grant from Qualcomm Inc., by NASA s Jet Propulsion Laboratory through the President and Director s Fund, and by King Abdullah University of Science and Technology.

References Chen, Z.; Zhang, K.; Chan, L.; and Schölkopf, B. 2014. Causal discovery via reproducing kernel hilbert space embeddings. Neural Computation 26:1484 1517. Chickering, D. M. 2002. Optimal structure identiﬁcation with greedy search. Journal of Machine Learning Research 3:507 554. Clopper, C., and Pearson, E. S. 1934. The use of conﬁdence or ﬁducial limits illustrated in the case of the binomial. Biometrika 26:404 413. Etesami, J., and Kiyavash, N. 2016. Discovering inﬂuence structure. In IEEE ISIT. Gao, W.; Kannan, S.; Oh, S.; and Viswanath, P. 2016. Conditional dependence via shannon capacity: Axioms, estimators and applications. In ICML 2016. Granger, C. W. 1969. Investigating causal relations by econometric models and cross-spectral methods. Econometrica: Journal of the Econometric Society 424 438. Hauser, A., and Bühlmann, P. 2012a. Characterization and greedy learning of interventional markov equivalence classes of directed acyclic graphs. Journal of Machine Learning Research 13(1):2409 2464. Hauser, A., and Bühlmann, P. 2012b. Two optimal strategies for active learning of causal networks from interventional data. In Proceedings of Sixth European Workshop on Probabilistic Graphical Models. Hoyer, P. O.; Janzing, D.; Mooij, J.; Peters, J.; and Schölkopf, B. 2008. Nonlinear causal discovery with additive noise models. In Proceedings of NIPS 2008. Hyttinen, A.; Eberhardt, F.; and Hoyer, P. 2013. Experiment selection for causal discovery. Journal of Machine Learning Research 14:3041 3071. Janzing, D.; Mooij, J.; Zhang, K.; Lemeire, J.; Zscheischler, J.; Daniušis, P.; Steudel, B.; and Schölkopf, B. 2012. Information-geometric approach to inferring causal directions. Artiﬁcial Intelligence 182183:1 31.

Kontoyiannis, I., and Skoularidou, M. Aug. 2016. Estimating the directed information and testing for causality. IEEE Transactions on Information Theory 62:6053 6067. Kovacevic, M.; Stanojevic, I.; and Senk, V. 2012. On the hardness of entropy minimization and related problems. In IEEE Information Theory Workshop. Lopez-Paz, D., and Oquab, M. 2016. Revisiting classiﬁer twosample tests. In ar Xiv pre-print. Lopez-Paz, D.; Muandet, K.; Schölkopf, B.; and Tolstikhin, I. 2015. Towards a learning theory of cause-effect inference. In Proceedings of ICML 2015. Lopez-Paz, D.; Muandet, K.; and Recht, B. Dec. 2015. The randomized causation coefﬁcient. Journal of Machine Learning Research 16:2901 2907. MCI. 2016. Munich workshop on causal inference and information theory 2016. https://www.lnt.ei.tum.de/events/munich-workshopon-causal-inference-and-information-theory-2016/. Accessed: 2016 - 09 - 14. Mooij, J. M.; Stegle, O.; Janzing, D.; Zhang, K.; and Schölkopf, B. 2010. Probabilistic latent variable models for distinguishing between cause and effect. In Proceedings of NIPS 2010. Mooij, J. M.; Janzing, D.; Zscheischler, J.; and Schölkopf, B. 2016a. Cause effect pairs repository. [Online]. Mooij, M. J.; Peters, J.; Janzing, D.; Zscheischler, J.; and Scholkopf, B. 2016b. Distinguishing cause from effect using observational data: methods and benchmarks. Journal of Machine Learning Research 17(32):1 102. Onn, S., and Weissman, I. 2011. Generating uniform random vectors over a simplex with implications to the volume of a certain polytope and to multivariate extremes. Annals of Operations Research 189:331 342. Pearl, J. 2009. Causality: Models, Reasoning and Inference. Cambridge University Press. Peters, J.; Janzing, D.; and Schölkopf, B. 2011. Causal inference on discrete data using additive noise models. IEEE Transactions on Pattern Analysis and Machine Intelligence 33:2436 2450. Quinn, C.; Kiyavash, N.; and Coleman, T. Dec. 2015. Directed information graphs. IEEE Transactions on Information Theory 61:6887 6909. Rubin, D. 1974. Estimating causal effects of treatments in randomized and nonrandomized studies. Journal of Educational Psychology 66 (5):688 701. Schölkopf, B.; W. Hogg, D.; Wang, D.; Foreman-Mackey, D.; Janzing, D.; Simon-Gabriel, C.-J.; and Peters, J. 2015. Removing systematic errors for exoplanet search via latent causes. In Proceedings of the 32 nd International Conference on Machine Learning. Shajarisales, N.; Janzing, D.; Schölkopf, B.; and Besserve, M. 2015. Telling cause from effect in deterministic linear dynamical systems. In Proceedings of the 32 nd International Conference on Machine Learning. Shanmugam, K.; Kocaoglu, M.; Dimakis, A.; and Vishwanath, S. 2015. Learning causal graphs with small interventions. In NIPS 2015. Shimizu, S.; Hoyer, P. O.; Hyvarinen, A.; and Kerminen, A. J. 2006. A linear non-gaussian acyclic model for causal discovery. Journal of Machine Learning Research 7:2003 2030. Spirtes, P.; Glymour, C.; and Scheines, R. 2001. Causation, Prediction, and Search. A Bradford Book.