# nonlinear_causal_inference_using_gaussianity_measures__e2fd9913.pdf Journal of Machine Learning Research 17 (2016) 1-39 Submitted 9/14; Revised 1/16; Published 4/16 Non-linear Causal Inference using Gaussianity Measures Daniel Hern andez-Lobato daniel.hernandez@uam.es Universidad Aut onoma de Madrid Calle Francisco Tom as y Valiente 11, Madrid 28049, Spain Pablo Morales-Mombiela pablo.morales@estudiante.uam.es Quantitative Risk Research Calle Faraday 7, Madrid 28049, Spain David Lopez-Paz dlp@fb.com Facebook AI Research 6 rue Menars, Paris 75002, France Alberto Su arez alberto.suarez@uam.es Universidad Aut onoma de Madrid Calle Francisco Tom as y Valiente 11, Madrid 28049, Spain Editor: Isabelle Guyon and Alexander Statnikov We provide theoretical and empirical evidence for a type of asymmetry between causes and effects that is present when these are related via linear models contaminated with additive non-Gaussian noise. Assuming that the causes and the effects have the same distribution, we show that the distribution of the residuals of a linear fit in the anti-causal direction is closer to a Gaussian than the distribution of the residuals in the causal direction. This Gaussianization effect is characterized by reduction of the magnitude of the high-order cumulants and by an increment of the differential entropy of the residuals. The problem of non-linear causal inference is addressed by performing an embedding in an expanded feature space, in which the relation between causes and effects can be assumed to be linear. The effectiveness of a method to discriminate between causes and effects based on this type of asymmetry is illustrated in a variety of experiments using different measures of Gaussianity. The proposed method is shown to be competitive with state-of-the-art techniques for causal inference. Keywords: causal inference, Gaussianity of the residuals, cause-effect pairs 1. Introduction The inference of causal relationships from data is one of the current areas of interest in the artificial intelligence community, e.g. (Chen et al., 2014; Janzing et al., 2012; Morales Mombiela et al., 2013). The reason for this surge of interest is that discovering the causal . The vast majority of the work was done while being at the Max Planck Institute for Intelligent Systems and at Cambridge University. c 2016 Daniel Hern andez-Lobato, Pablo Morales-Mombiela, David Lopez-Paz and Alberto Su arez. Hern andez-Lobato, Morales-Mombiela, Lopez-Paz and Su arez structure of a complex system provides an explicit description of the mechanisms that generate the data, and allows us to understand the consequences of interventions in the system (Pearl, 2000). More precisely, automatic causal inference can be used to determine how modifications of the value of certain relevant variables (the causes) influence the values of other related variables (the effects). Therefore, understanding cause-effect relations is of paramount importance to control the behavior of complex systems and has applications in industrial processes, medicine, genetics, economics, social sciences or meteorology. Causal relations can be determined in complex systems in three different ways. First, they can be inferred from domain knowledge provided by an expert, and incorporated in an ad-hoc manner in the description of the system. Second, they can be discovered by performing interventions in the system. These are controlled experiments in which one or several variables of the system are forced to take particular values. Interventions constitute a primary tool for identifying causal relationships. However, in many situations they are unethical, expensive, or technically infeasible. Third, they can be estimated using causal discovery algorithms that use as input purely uncontrolled and static data. This last approach for causal discovery has recently received much attention from the machine learning community (Shimizu et al., 2006; Hoyer et al., 2009; Zhang and Hyv arinen, 2009). These methods assume a particular model for the mapping mechanisms that link causes to effects. By specifying particular conditions on the mapping mechanism and the distributions of the cause and noise variables, the causal direction becomes identifiable Chen et al. (2014). For instance, Hoyer et al. (2009) assume that the effect is a non-linear transformation of the cause plus some independent additive noise. A potential drawback of these methods is that the assumptions made by the particular model considered could be unrealistic for the data under study. In this paper we propose a general method for causal inference that belongs to the third of the categories described above. Specifically, we assume that the cause and the effect variables have the same distribution and are linked by a linear relationship contaminated with non-Gaussian noise. For the univariate case we prove that, under these assumptions, the magnitude of the cumulants of the residuals of order higher than two is smaller for the linear fit in the anti-causal direction than in the causal one. Since the Gaussian is the only distribution whose cumulants of order higher than 2 are zero, statistical tests based on measures of Gaussianity can be used for causal inference. An antecedent of this result is the observation that, when cause and effect have the same distribution, the residuals of a fit in the anti-causal direction have higher entropy than in the causal direction (Hyv arinen and Smith, 2013; Kpotufe et al., 2014). Since the residuals of the causal and anti-causal linear models have the same variance and the Gaussian is the distribution that maximizes the entropy for a fixed variance, this means that the distribution of the latter is more Gaussian than the former. For multivariate cause-effect pairs that have the same distribution and are related by a linear model with additive non-Gaussian noise the proof given by Hyv arinen and Smith (2013) and Kpotufe et al. (2014) can be extended to show that the entropy of the vector of residuals of a linear fit in the anti-causal direction is larger than the corresponding residuals of a linear fit in the causal direction. We conjecture that also in this case there is a reduction of the magnitude of the tensor cumulants of the anti-causal multivariate residuals and provide some numerical evidence of this effect in two dimensions. Non-linear Causal Inference using Gaussianity Measures The problem of non-linear causal inference is addressed by embedding the original problem in an expanded feature space. We then make the assumption that the non-linear relation between causes and effects in the original space is linear in the expanded feature space. The computations required to make inference on the causal direction based on this embedding can be readily carried out using kernel methods. In summary, the proposed method for causal inference proceeds by first making a transformation of the original variables so that causes and effects have the same distribution. Then we perform kernel ridge regression in both the causal and the anti-causal directions. The dependence between causes and effects, which is non-linear in the original space, is assumed to be linear in the kernel-induced feature space. A statistical test is then used to quantify the degree of similarity between the distributions of these residuals and a Gaussian distribution with the same variance. Finally, the direction in which the residuals are less Gaussian is identified as the causal one. The performance of this method is evaluated in both synthetic and real-world causeeffect pairs. From the results obtained it is apparent that the anti-causal residuals of a linear fit in the expanded feature space are more Gaussian than the causal residuals. In general, it is difficult to estimate the entropy from a finite sample (Beirlant et al., 1997). Empirical estimators of high order cumulants involve high order moments, which means they often have large variance. As an alternative, we propose to use statistical tests based on the energy distance to characterize the Gaussianization effect for the residuals of linear fits in the causal and anti-causal directions. Tests based on the energy distance were analyzed in depth by Sz ekely and Rizzo (2005). They have been shown to be related to homogeneity tests based on embeddings in a Reproducing Kernel Hilbert Space (Gretton et al., 2012). An advantage of energy distance-based statistics is that they can be readily estimated from a sample by computing expectations of pairwise Euclidean distances. The energy distance generally provides better results than the entropy or cumulant-based Gaussianity measures. In the problems investigated, the accuracy of the proposed method, using the energy distance to the Gaussian, is comparable to other state-of-the-art techniques for causal discovery. The rest of the paper is organized as follows: Section 2 illustrates that, under certain conditions, the residuals of a linear regression fit are closer to a Gaussian in the anti-causal direction than in the causal one, based on a reduction of the high-order cumulants and on an increment of the entropy. This section considers both the univariate and multivariate cases. Section 3 adopts a kernel approach to carry out a feature expansion that can be used to detect non-linear causal relationships. We also show here how to compute the residuals in the expanded feature space, and how to choose the different hyper-parameters of the proposed method. Section 4 contains a detailed description of the implementation. In section 6 we present the results of an empirical assessment of the proposed method in both synthetic and real-world cause-effect data pairs. Finally, Section 7 summarizes the conclusions and puts forth some ideas for future research. 2. Asymmetry Based on the Non-Gaussianity of the Residuals of Linear Models Let X and Y be two random variables that are causally related. The direction of the causal relation is not known. Our goal is to determine whether X causes Y, i.e., X Y or, Hern andez-Lobato, Morales-Mombiela, Lopez-Paz and Su arez alternatively Y causes X, i.e., Y X. For this purpose, we exploit an asymmetry between causes an effects. This type of asymmetry can be uncovered using statistical tests that measure the non-Gaussianity of the residuals of linear regression models obtained from fits in the causal and in the anti-causal direction. To motivate the methodology that we have developed, we will proceed in stepwise manner. First we analyze a special case in one dimension: We assume that X and Y have the same distribution and are related via a linear model contaminated with additive i.i.d. non-Gaussian noise. The noise is independent of the cause. Under these assumptions we show that the distribution of the residuals of a linear fit in the incorrect (anti-causal) direction is closer to a Gaussian distribution than the distribution of the residuals in the correct (causal) direction. For this, we use an argument based on the reduction of the magnitude of the cumulants of order higher than 2. The cumulants are defined as the derivatives of the logarithm of the moment-generating function evaluated at zero (Cornish and Fisher, 1938; Mc Cullagh, 1987). The Gaussianization effect can be characterized also in terms of an increase of the entropy. The proof is based on the results of Hyv arinen and Smith (2013), which are extended in this paper to the multivariate case. In particular, we show that the entropy of the residuals of a linear fit in the anti-causal direction is larger or equal than the entropy of the residuals of a linear fit in the causal direction. Since the Gaussian it the distribution that has maximum entropy, given a particular covariance matrix, an increase of the entropy of the residuals means that their distribution becomes closer to the Gaussian. Finally, we note that it is easy to guarantee that X and Y have the same distribution in the case that these variables are unidimensional and continuous. To this end we only have to transform one of the variables (typically the cause random variable) using the probability integral transform, as described in Section 4. However, after the data have been transformed, the relation between the variables will no longer be linear in general. Thus, to address non-linear cause-effect problems involving univariate random variables the linear model is formulated in an expanded feature space, where the multivariate analysis of the Gaussianization effect is also applicable. In this feature space all the computations required for causal inference can be formulated in terms of kernels. This can be used to detect nonlinear causal relations in the original input space and allows for an efficient implementation of the method. The only assumption is that the non-linear relation in the original input space is linear in the expanded feature space induced by the selected kernel. 2.1 Analysis of the Univariate Case Based on Cumulants Let X and Y be one-dimensional random variables that have the same distribution. Without further loss of generality, we will assume that they have zero mean and unit variance. Let x = (x1, . . . , x N)T and y = (y1, . . . , y N)T be N paired samples drawn i.i.d. from P(X, Y). Assume that the causal direction is X Y and that the measurements are related by a linear model yi = wxi + ϵi , ϵi xi , i , (1) where w = corr(X, Y) [ 1, 1] and ϵi is independent i.d. non-Gaussian additive noise. Non-linear Causal Inference using Gaussianity Measures A linear model in the opposite direction, i.e., Y X, can be built using least squares xi = wyi + ϵi, (2) where w = corr(Y, X) is the same coefficient as in the previous model. The residuals of this reversed linear model are defined as ϵi = xi wyi. Following an argument similar to that of Hern andez-Lobato et al. (2011) we show that the residuals { ϵi}N i=1 in the anti-causal direction are more Gaussian than the residuals {ϵi}N i=1 in the actual causal direction X Y based on a reduction of the magnitude of the cumulants. The proof is based on establishing a relation between the cumulants of the distribution of the residuals in both the causal and the anti-causal direction. First, we show that κn(yi), the n-th order cumulant of Y, can be expressed in terms of κn(ϵi), the n-th order cumulant of the residuals: κn(yi) = wnκn(xi) + κn(ϵi) = wnκn(yi) + κn(ϵi) = 1 1 wn κn(ϵi) . (3) To derive this relation we have used (1), that xi and yi have the same distribution (and hence have the same cumulants), and standard properties of cumulants (Cornish and Fisher, 1938; Mc Cullagh, 1987). Furthermore, κn( ϵi) = κn(xi wyi) = κn(xi w2xi wϵi) = (1 w2)nκn(xi) + ( w)nκn(ϵi) = (1 w2)nκn(yi) + ( w)nκn(ϵi) = (1 w2)n 1 wn κn(ϵi) + ( w)nκn(ϵi) = cn(w)κn(ϵi) , (4) where we have used the definition of ϵi and (3) In Figure 1 the value of cn(w) = (1 w2)n 1 wn + ( w)n. (5) is displayed as a function of w [ 1, 1]. Note that c1(w) = c2(w) = 1 independently of the value of w. This means that the mean and the variance of the residuals are the same in both the causal and anti-causal directions. For n > 2, |cn(w)| 1 with equality only for w = 0 and w = 1. The result is that the high-order cumulants of the residuals in the anti-causal direction are smaller in magnitude that the corresponding cumulants in the causal direction. Using the observation that all the cumulants of the Gaussian distribution of order higher than two are zero (Marcinkiewicz, 1938), we conclude that the distribution of the residuals in the anti-causal direction is closer to the Gaussian distribution than in the causal direction. In summary, we can infer the causal direction by (i) fitting a linear model in each possible direction, i.e., X Y and Y X, and (ii) carrying out statistical tests to detect the level of Gaussianity of the two corresponding residuals. The direction in which the residuals are less Gaussian is expected to be the correct one. 2.2 Analysis of the Multivariate Case Based on Cumulants In this section we argue that the Gaussianization effect of the residuals in the anti-causal direction also takes place when the two random variables X and Y are multidimensional. Hern andez-Lobato, Morales-Mombiela, Lopez-Paz and Su arez 1.0 0.5 0.0 0.5 1.0 1.0 0.5 0.0 0.5 1.0 Coefficient Values for Odd Cumulants c1 c3 c5 c7 c9 c11 c13 1.0 0.5 0.0 0.5 1.0 0.0 0.2 0.4 0.6 0.8 1.0 Coefficient Values for Even Cumulants c2 c4 c6 c8 c10 c12 c14 Figure 1: Values of the function cn( ) as a function of w for each cumulant number n (odd in the left plot, even in the right plot). All values for cn( ) lie in the interval [ 1, 1]. We will assume that these variables follow the same distribution and, without further loss of generality, that they have been whitened (i.e., they have a zero mean vector and the identity matrix as the covariance matrix). Let X = (x1, . . . , x N)T and Y = (y1, . . . , y N)T be N paired samples drawn i.i.d. from P(X, Y). In this case, the model assumed for the actual causal relation is yi = Axi + ϵi , ϵi xi , i , (6) where A = corr(Y, X) is a d d matrix of model coefficients and ϵi is i.i.d. non-Gaussian additive noise. The model in the anti-causal direction is the one that results from the least squares fit: xi = Ayi + ϵi , (7) where we have defined ϵi = xi Ayi and A = corr(X, Y) = AT. As in the univariate case, we start by expressing the cumulants of Y in terms of the cumulants of the residuals. However, the cumulants are now tensors (Mc Cullagh, 1987): κn(yi) = κn(Axi) + κn(ϵi) . (8) In what follows, the notation vect( ) stands for the vectorization of a tensor. For example, in the case of a tensor T with dimensions d d d vect(T) = (T1,1,1, T2,1,1, , Td,1,1, T1,2,1, , Td,d,d)T. Using this notation we obtain vect(κn(yi)) = Anvect(κn(xi)) + vect(κn(ϵ)) = (I An) 1vect(κn(ϵ)) , (9) Non-linear Causal Inference using Gaussianity Measures where An = A A A A, n times, is computed using the Kronecker matrix product. To derive this expression we have used (6), the fact that Y and X are equally distributed and hence have the same cumulants. We also have used the properties of the tensor cumulants vect(κn(Axi)) = Anvect(κn(xi)), where the powers of the matrix A are computed using the Kronecker product (Mc Cullagh, 1987). Similarly, for the reversed linear model κn( ϵi) = κn(xi ATyi) = κn(xi ATAxi ATϵi) = κn((I ATA)xi ATϵi) = κn((I ATA)xi) + κn( ATϵi) . (10) Using again the notation for the vectorized tensor cumulants vect(κn( ϵi)) = (I ATA)nvect(κn(xi)) + ( 1)n(AT)nvect(κn(ϵi)) = (I ATA)n(I An) 1vect(κn(ϵi)) + ( 1)n(AT)nvect(κn(ϵi)) = (I ATA)n(I An) 1 + ( 1)n(AT)n vect(κn(ϵi)) , (11) where the powers of matrices are computed using the Kronecker product as well, and where we have used (9) and that Y and X are equally distributed and have the same cumulants. We now give some evidence to support that the magnitude of vect(κn( ϵi)) is smaller than the magnitude of vect(κn(ϵi)) in terms of the ℓ2-norm, for cumulants of order higher than 2. That is, the tensors corresponding to high-order cumulants become closer to a tensor with all its components equal to zero. For this, we introduce the following definition: Definition 1 The operator norm of a matrix M induced by the ℓp vector norm is ||M||op = min{c 0 : ||Mv||p c||v||p , v}, where || ||p denotes the ℓp-norm for vectors. The consequence is that ||M||op ||Mv||p/||v||p, v. This means that ||M||p can be understood as a measure of the size of the matrix M. In the case of the ℓ2-norm, the operator norm of a matrix M is equal to its largest singular value or, equivalently, to the square root of the largest eigenvalue of MTM. Let Mn = (I ATA)n(I An) 1 + ( 1)n(AT)n. That is, Mn is the matrix that relates the cumulants of order n of the residuals in the causal and anti-causal directions in (11). We now evaluate ||Mn||op, and show that in most cases its value is smaller than one for high-order cumulants κn( ), leading to a Gaussianization of the residuals in the anti-causal direction. From (11) and the definition given above, we know that ||Mn||op ||vect(κn( ϵi))||2/||vect(κn(ϵi))||2. This means that if ||Mn||op < 1 the cumulants of the residuals in the incorrect causal direction are shrunk to the origin. Because the multivariate Gaussian distribution has all cumulants of order higher than two equal to zero (Mc Cullagh, 1987), this translates into a distribution for the residuals in the anti-causal direction that is closer to the Gaussian distribution. In the causal direction, we have that E[yy T ] = E[(Axi+ϵi)(Axi+ϵi)T] = AAT +C = I, where C is the positive definite covariance matrix of the actual residuals1. Thus, AAT = I C and hence the singular values of A, denoted σ1, . . . , σd, satisfy 0 σi = 1 αi 1, where αi is the corresponding positive eigenvalue of C. Assume that A is symmetric (this 1. We assume such matrix exists and that it is positive definite. Hern andez-Lobato, Morales-Mombiela, Lopez-Paz and Su arez 1.0 0.5 0.0 0.5 1.0 1.0 0.5 0.0 0.5 1.0 1.0 0.5 0.0 0.5 1.0 1.0 0.5 0.0 0.5 1.0 1.0 0.5 0.0 0.5 1.0 1.0 0.5 0.0 0.5 1.0 1.0 0.5 0.0 0.5 1.0 1.0 0.5 0.0 0.5 1.0 1.0 0.5 0.0 0.5 1.0 1.0 0.5 0.0 0.5 1.0 1.0 0.5 0.0 0.5 1.0 1.0 0.5 0.0 0.5 1.0 Figure 2: Contour curves of the values of ||Mn||op for d = 2 and for n > 2 as a function of λ1 and λ2, i.e., the two eigenvalues of A. A is assumed to be symmetric. Similar results are obtained for higher-order cumulants. also means that Mn is symmetric). Denote by λ1, . . . , λd to the eigenvalues of A. That is, λ2 i and 0 λ2 i 1, i = 1, . . . , d. For a fixed cumulant of order n we have that ||Mn||op = max v S h 1 λ2 vj i 1 Qn i=1 λvi + ( 1)n n Y where S = {1, . . . , d}n, | | denotes absolute value, and we have employed standard properties of the Kronecker product about eigenvalues and eigenvectors (Laub, 2004). Note that this expression does not depend on the eigenvectors of A, but only on its eigenvalues. Figure 2 shows, for symmetric A, the value of ||Mn||op for n = 3, . . . , 8, and d = 2 when the two eigenvalues of A range in the interval ( 1, 1). We observe that ||Mn||op is always smaller than one. As described before, this will lead to a reduction in the ℓ2-norm of the cumulants in the anti-causal direction due to (11), and will in consequence produce a Gaussianization effect on the distribution of the residuals. For n 2 it can be readily shown that ||M1||op = ||M2||op = 1. In general, the matrix A need not be symmetric. In this case, ||M1||op = ||M2||op = 1 as well. However, the evaluation of ||Mn||op for n > 2 is more difficult, but feasible for small n. Non-linear Causal Inference using Gaussianity Measures 0.2 0.4 0.6 0.8 0.2 0.4 0.6 0.8 0.2 0.4 0.6 0.8 0.2 0.4 0.6 0.8 0.2 0.4 0.6 0.8 0.2 0.4 0.6 0.8 0.2 0.4 0.6 0.8 0.2 0.4 0.6 0.8 0.2 0.4 0.6 0.8 0.2 0.4 0.6 0.8 0.2 0.4 0.6 0.8 0.2 0.4 0.6 0.8 Figure 3: Contour curves of the values of ||Mn||op for d = 2 and for n > 2 as a function of σ1 and σ2, i.e., the two singular values of A. A is not assumed to be symmetric. The singular vectors of A are chosen at random. A dashed blue line highlights the boundary of the region where ||Mn||op is strictly smaller than one. Figure 3 displays the values of ||Mn||op, for d = 2, as the two singular values of A, σ1 and σ2, vary in the interval (0, 1). The left singular vectors and the right singular vectors of A are chosen at random. In this figure a dashed blue line highlights the boundary of the region where ||Mn||op is strictly smaller than one. We observe that for most values of σ1 and σ2, ||Mn||op is smaller than one, leading to a Gaussianization effect in the distribution of the residuals in the anti-causal direction. However, for some singular values, ||Mn||op is strictly larger than one. Of course, this does not mean that there is not such a Gaussianization effect also in those cases. The definition given for ||Mn||op assumes that all potential vectors v represent valid cumulants of a probability distribution, which need not be the case in practice. For example, it is well known that cumulants exhibit some form of symmetry (Mc Cullagh, 1987). This can be seen in the second order cumulant, which is a covariance matrix. The consequence is that ||Mn||op is simply an upper bound on the reduction of the ℓ2-norm of the cumulants in the anti-causal model. Thus, we also expect a Gaussianization effect to occur also for these cases. Furthermore, the numerical simulations presented in Section 6 provide evidence of this effect for asymmetric A. Hern andez-Lobato, Morales-Mombiela, Lopez-Paz and Su arez Figure 4: (left) Value of ||M2||op, for d = 2, as a function of σ1 and σ2, i.e., the two singular values of A. (right) Actual ratio between ||vect(κ2(ϵi))||2 and ||vect(κ2( ϵi))||2, for d = 2, as a function of σ1 and σ2. A is not assumed to be symmetric. The singular vectors of A are chosen at random. The fact that ||Mn||op is only an upper bound is illustrated in Figure 4. This figure considers the particular case of the second cumulant κ2( ), which can be analyzed in detail. On the left plot the value of ||M2||op is displayed as a function of σ1 and σ2, the two singular values of A. We observe that ||M2||op takes values that are larger than one. In this case it is possible to evaluate in closed form the ℓ2-norm of vect(κ2(ϵi)) and vect(κ2( ϵi)), i.e., the vectors that contain the second order cumulant of the residuals in each direction. In particular, it is well known that the second order cumulant is equal to the covariance matrix (Mc Cullagh, 1987). In the causal direction, the covariance matrix of the residuals is C = I AAT, as shown in the previous paragraphs. The covariance matrix of the residuals in the anti-causal direction, denoted by C, can be computed similarly. Namely, C = I ATA. These two matrices, i.e., C and C, respectively give k2(ϵi) and k2( ϵi). Furthermore, they have the same singular values. This means that ||vect(k2(ϵi))||2/||vect(k2( ϵi))||2 = 1, as illustrated by the right plot in Figure 4. Thus, ||M2||op is simply an upper bound on the actual reduction of the ℓ2-norm of the second order cumulant of the residuals in the anticausal direction. The same behavior is expected for ||Mn||op, with n > 2. In consequence, one should expect that the cumulants of the distribution of the residuals of a model fitted in the anti-causal direction are smaller in magnitude. This will lead to an increased level of Gaussianity measured in terms of a reduction of the magnitude of the high-order cumulants. 2.3 Analysis of the Multivariate Case Based on Information Theory The analysis of the multivariate case carried out in the previous section is illustrative. Nevertheless, it does not prove that the distribution of the residuals in the anti-casual direction is more Gaussian based on a reduction of the magnitude of the cumulants. Further evidence of this increased level of Gaussianity can be obtained based on an increase of the entropy obtained by using information theory. In this case we also assume that the multidimensional variables X and Y follow the same distribution, but unlike in the previous Non-linear Causal Inference using Gaussianity Measures section, they need not be whitened, only centered. Here we closely follow Section 2.4 of the work by Hyv arinen and Smith (2013) and extend their results to the multivariate case. Under the assumptions specified earlier, the model in the causal direction is yi = Axi + ϵi, with xi ϵi and A = Cov(Y, X)Cov(X, X) 1. Similarly, the model in the anti-causal direction is xi = Ayi + ϵi with A = Cov(X, Y)Cov(Y, Y) 1. By making use of the causal model it is possible to show that ϵi = (I AA)xi Aϵi, where I is the identity matrix. Thus, the following equations are satisfied: xi yi = A I I AA A Let H(xi, yi) be the entropy of the joint distribution of the two random variables associated with samples xi and yi. Because (13) is a linear transformation, we can use the entropy transformation formula (Hyv arinen and Smith, 2013) to get that H(xi, yi) = H(xi, ϵi) + log |det P|, where det P = det I = 1. Thus, we have that H(xi, yi) = H(xi, ϵi). Conversely, if we use (14) we have H(yi, ϵi) = H(xi, ϵi) + log |det P|, where det P = det A det( A (I AA)A 1I) = det I = 1, under the assumption that A is invertible. The result is that H(xi, yi) = H(yi, ϵi) = H(xi, ϵi). Denote the mutual information between the cause and the noise with I(xi, ϵi). Similarly, let I(yi, ϵi) be the mutual information between the random variables corresponding to the observations yi and ϵi. Then, I(xi, ϵi) I(yi, ϵi) = H(xi) + H(ϵi) H(xi, ϵi) H(yi) H( ϵi) + H(yi, ϵi) = H(xi) + H(ϵi) H(yi) H( ϵi) . (15) Furthermore, from the actual causal model assumed we know that I(xi, ϵi) = 0. By contrast, we have that I(yi, ϵi) 0, since both yi and ϵi depend on xi and ϵi. We also know that H(xi) = H(yi) because we have made the hypothesis that both X and Y follow the same distribution. The result is that: H(ϵi) H( ϵi) , (16) with equality iffthe residuals are Gaussian. We note that an alternative but equivalent way to obtain this last result is to consider a multivariate version of Lemma 1 by Kpotufe et al. (2014), under the assumption of the same distribution for the cause and the effect. In particular, even though Kpotufe et al. (2014) assume univariate random variables, their work can be easily generalized to multiple variables. Although the random variables corresponding to ϵi and ϵi have both zero mean, they need not have the same covariance matrix. Denote with Cov(ϵi) and Cov( ϵi) to these matrices and let ˆϵi and ˆ ϵi be the whitened residuals (i.e., the residuals multiplied by the Cholesky factor of the inverse of the corresponding covariance matrix). Then, H(ˆϵi) H(ˆ ϵi) 1 2 log |det Cov( ϵi)| + 1 2 log |det Cov(ϵi)| . (17) Hern andez-Lobato, Morales-Mombiela, Lopez-Paz and Su arez As shown in Appendix A, although not equal, the matrices Cov(ϵi) and Cov( ϵi) have the same determinant. Thus, the two determinants cancel in the equation above. This gives, H(ˆϵi) H(ˆ ϵi) . (18) The consequence is that the entropy of the whitened residuals in the anti-causal direction is expected to be higher or equal than the entropy of the whitened residuals in the causal direction. Because the Gaussian distribution is the continuous distribution with the highest entropy for a fixed covariance matrix, we conclude that the level of Gaussianity of the residuals in the anti-causal direction, measured in terms of differential entropy, has to be larger or equal to the level of Gaussianity of the residuals in the causal direction. In summary, if the causal relation between the two identically distributed random variables X and Y is linear the residuals of a least squares fit in the anti-causal direction are more Gaussian than those of a linear fit in the causal direction. This Gaussianization can be characterized by a reduction of the magnitude of the corresponding high-order cumulants (although we have not formally proved this, we have provided some evidence that this is the case) and by an increase of the entropy (we have proved this in this section). A causal inference method can take advantage of this asymmetry to determine the causal direction. In particular, statistical tests based on measures of Gaussianity can be used for this purpose. However and importantly, these measures of Gaussianity need not be estimates of the differential entropy or the high-order cumulants. In particular, the entropy is a quantity that is particularly difficult to estimate in practice (Beirlant et al., 1997). The same occurs with the high-order cumulants. Their estimators involve high-order moments, and hence suffer from high variance. When the distribution of the residuals in (6) is Gaussian, the causal direction cannot be identified. In this case, it is possible to show that the distribution of the reversed residuals ϵi, the cause xi, and the effect yi, is Gaussian as a consequence of (9) and (11). This non-identifiability agrees with the general result of Shimizu et al. (2006), which indicates that non-Gaussian distributions are strictly required in the disturbance variables to carry out causal inference in linear models with additive independent noise. Finally, the fact that the Gaussianization effect is also expected in the multivariate case suggests a method to address causal inference problems in which the relationship between cause and effect is non-linear: It consists in mapping each observation xi and yi to a vector in an expanded feature space. One can them assume that the non-linear relation in the original input space is linear in the expanded feature space and compute the residuals of kernel ridge regressions in both directions. The direction in which the residuals are less Gaussian is identified as the causal one. 3. A Feature Expansion to Address Non-linear Causal Inference Problems We now proceed to relax the assumption that the causal relationship between the unidimensional random variables X and Y is linear. For this purpose, instead of working in the original space in which the samples {(xi, yi)}N i=1 are observed, we will assume that the model is linear in some expanded feature space {(φ(xi), φ(yi))}N i=1 for some mapping function φ( ) : R Rd. Importantly, this map preserves the property that if xi and yi are Non-linear Causal Inference using Gaussianity Measures equally distributed, so will be φ(xi) and φ(yi). According to the analysis presented in the previous section, the residuals of a linear model in the expanded space should be more Gaussian in the anti-causal direction than the residuals of a linear model in the causal direction, based on an increment of the differential entropy and on a reduction of the magnitude of the cumulants. The assumption we make is that the non-linear relation between X and Y in the original input space is linear in the expanded feature space. In this section we focus on obtaining the normalized residuals of linear models formulated in the expanded feature space. For this purpose, we assume that a kernel function k( , ) can be used to evaluate dot products in the expanded feature space. In particular, k(xi, xj) = φ(xi)Tφ(xj) and k(yi, yj) = φ(yi)Tφ(yj) for arbitrary xi and xj and yi and yj. Furthermore, we will not assume in general that φ(xi) and φ(yi) have been whitened, only centered. Whitening is a linear transformation which is not expected to affect to the level of Gaussianity of the residuals. However, once these residuals have been obtained they will be whitened in the expanded feature space. Later on we describe how to center the data in the expanded feature space. For now on, we will assume this step has already been done. 3.1 Non-linear Model Description and Fitting Process Assume that the relation between X and Y is linear in an expanded feature space φ(yi) = Aφ(xi) + ϵi , ϵi xi, (19) where ϵi is i.i.d. non-Gaussian additive noise. Given N paired observations {(xi, yi)}N i=1 drawn i.i.d. from P(X, Y), define the matrices Φx = (φ(x1), . . . , φ(x N)) and Φy = (φ(y1), . . . , φ(y N)) of size d N. The estimate of A that minimizes the sum of squared errors is ˆA = ΓΣ 1, where Γ = ΦyΦT x and Σ = ΦxΦT x . Unfortunately, when d > N, where d is the number of variables in the feature expansion, the matrix Σ 1 does not exist and ˆA is not unique. This means that there is an infinite number of solutions for ˆA with zero squared error. To avoid the indetermination described above and also to alleviate over-fitting, we propose a regularized estimator. Namely, 1 2||φ(yi) Aφ(xi)||2 2 + τ 1 2||A||2 F , (20) where || ||2 denotes the ℓ2-norm and || ||F denotes the Frobenius norm. In this last expression τ > 0 is a parameter that controls the amount of regularization. The minimizer of (20) is ˆA = ΓΣ 1, where Γ = ΦyΦT x and Σ = τI+ΦxΦT x . The larger the value of τ, the closer the entries of ˆA are to zero. Furthermore, using the matrix inversion lemma we have that Σ 1 = τ 1I τ 1Φx V 1ΦT x , where V = (τI + Kx,x) 1 and Kx,x = ΦT x Φx is a kernel matrix whose entries are given by k(xi, xj). After some algebra it is possible to show that ˆA = ΓΣ 1 = Φy VΦT x , (21) which depends only on the matrix V. This matrix be computed with cost O(N3). We note that the estimate obtained in (21) coincides with the kernel conditional embedding operator described by Song et al. (2013) for mapping conditional distributions into infinite dimensional feature spaces using kernels. Hern andez-Lobato, Morales-Mombiela, Lopez-Paz and Su arez 3.2 Obtaining the Matrix of Inner Products of the Residuals A first step towards obtaining the whitened residuals in feature space (which will be required for the estimation of their level of Gaussianity) is to compute the matrix of inner products of these residuals (kernel matrix). For this, we define ϵi = φ(yi) ˆAφ(xi). Thus, ϵT i ϵj = h φ(yi) ˆAφ(xi) i T h φ(yj) ˆAφ(xj) i = φ(yi)Tφ(yj) φ(yi)T ˆAφ(xj) φ(xi)T ˆAφ(yj) + φ(xi)T ˆA ˆAφ(xj) , (22) for two arbitrary residuals ϵi and ϵj in feature space. In general, if we denote with Kϵ to the matrix whose entries are given by ϵT i ϵj and define Ky,y = ΦT y Φy, we have that Kϵ = Ky,y Ky,y VKx,x Kx,x VKy,y + Kx,x VKy,y VKx,x , (23) where we have used the definition of ˆA in (21). This expression only depends on the kernel matrices Kx,x and Ky,y and the matrix V, and can be computed with cost O(N3). 3.3 Centering the Input Data and Centering and Whitening the Residuals An assumption made in Section 2 was that the samples of the random variables X and Y are centered, i.e., they have zero mean. In this section we show how to carry out this centering process in feature space. Furthermore, we also show how to center the residuals of the fitting process, which are also whitened. Whitening is a standard procedure in which the data are transformed to have the identity matrix as the covariance matrix. It also corresponds to projecting the data onto all the principal components, and scaling them to have unit standard deviation. We show how to center the data in feature space. For this, we follow Sch olkopf et al. (1997) and work with: φ(xi) = φ(xi) 1 j=1 φ(xj) , φ(yi) = φ(yi) 1 j=1 φ(yj) . (24) The consequence is that now the kernel matrices Kx,x and Ky,y are replaced by Kx,x = Kx,x 1NKx,x Kx,x1N + 1NKx,x1N , Ky,y = Ky,y 1NKy,y Ky,y1N + 1NKy,y1N , (25) where 1N is a N N matrix with all entries equal to 1/N. The residuals can be centered also in a similar way. Namely, Kϵ = Kϵ 1NKϵ Kϵ1N + 1NKϵ1N. We now explain the whitening of the residuals, which are now assumed to be centered. This process involves the computation of the eigenvalues and eigenvectors of the d d covariance matrix C of the residuals. This is done as in kernel PCA (Sch olkopf et al., 1997). Denote by ϵi to the centered residuals. The covariance matrix is C = N 1 PN i=1 ϵi ϵT i . The eigenvector expansion implies that Cvi = λivi, where vi denotes the i-th eigenvector and λi the i-th eigenvalue. The consequence is that N 1 PN k=1 ϵk ϵT k vi = λivi. Thus, the eigenvectors can be expressed as a combination of the residuals. Namely, vi = PN j=1 bi,j ϵj, Non-linear Causal Inference using Gaussianity Measures where bi,j = N 1 ϵT j vi. Substituting this result in the previous equation we have that N 1 PN k=1 ϵk ϵT k PN j=1 bi,j ϵj = λi PN j=1 bi,j ϵj. When we multiply both sides by ϵT l we obtain N 1 PN k=1 ϵT l ϵk ϵT k PN j=1 bi,j ϵj = λi PN j=1 bi,j ϵT l ϵj, for l = 1, . . . d, which is written in terms of kernels as Kϵ Kϵbi = λi N Kϵbi, where bi = (bi,1, . . . , bi,N)T. A solution to this problem is found by solving the eigenvalue problem Kϵbi = λi Nbi. We also require that the eigenvectors have unit norm. Thus, 1 = v T i vi = PN j=1 PN k=1 bi,jbi,k ϵT j ϵk = b T i Kϵbi = λi Nb T i bi, which means that bi has norm 1/ λi N. Consider now that bi is one eigenvector of Kϵ. Then, bi = 1/ λi N bi. Similarly, let λi be an eigenvalue of Kϵ. Then λi = λi/N. In summary, λi and bi,j, with i = 1, . . . , N and j = 1, . . . , N can be found with cost O(N3) by finding the eigendecomposition of Kϵ. The whitening process is carried out by projecting each residual ϵk onto each eigenvector vi and then multiplying by 1/ λi. The corresponding i-th component for the k-th residual, denoted by Zk,i, is Zk,i = 1/ λiv T i ϵk = 1/ λi PN j=1 bi,j ϵT j ϵk, and in consequence, the whitened residuals are Z = KϵBD = NBD 1 = N B, where B is a matrix whose columns contain each bi, B is a matrix whose columns contain each bi and D is a diagonal matrix whose entries are equal to 1/ λi. Each row of Z now contains the whitened residuals. 3.4 Inferring the Most Likely Causal Direction After having trained the model and obtained the matrix of whitened residuals Z in each direction, a suitable Gaussianity test can be used to determine the correct causal relation between the variables X and Y. Given the theoretical results of Section 2 one may be tempted to use tests based on entropy or cumulants estimation. Such tests may perform poorly in practice due to the difficulty of estimating high-order cumulants or differential entropy. In particular, the estimators of the cumulants involve high-order moments and hence, suffer from high variance. As a consequence, in our experiments we use a statistical test for Gaussianity based on the energy distance (Sz ekely and Rizzo, 2005), which has good power, is robust to noise, and does not have any adjustable hyper-parameters. Furthermore, in Appendix B we motivate that in the anti-causal direction one should also expect a smaller energy distance to the Gaussian distribution. Assume X and Y are two independent random variables whose probability distribution functions are F( ) and G( ). The energy distance between these distributions is defined as D2(F, G) = 2E[||X Y||] E[||X X ||] E[||Y Y ||] , (26) where || || denotes some norm, typically the ℓ2-norm; X and X are independent and identically distributed (i.i.d.); Y and Y are i.i.d; and E denotes expected value. The energy distance satisfies all axioms of a metric and hence characterizes the equality of distributions. Namely, D2(F, G) = 0 if and only if F = G. Furthermore, in the case of univariate random variables the energy distance is twice the Cram er-von Mises distance given by R (F(x) G(x))2 dx. Assume X = (x1, . . . , x N)T is a matrix that contains N random samples (one per each row of the matrix) from a d dimensional random variable with probability density f. The statistic described for testing for Gaussianity, i.e., H0 : f = N( |0, I) vs H1 : f = N( |0, I), Hern andez-Lobato, Morales-Mombiela, Lopez-Paz and Su arez that is described by Sz ekely and Rizzo (2005) is: Energy(X) = N j=1 E[||xj Y||] E[||Y Y ||] 1 j,k=1 ||xj xk|| where Y and Y are independent random variables distributed as N( |0, I) and E denotes expected value. Furthermore, the required expectations with respect to the Gaussian random variables Y and Y can be efficiently computed as described by Sz ekely and Rizzo (2005). The idea is that if f is similar to a Gaussian density N( |0, I), then Energy(X) is close to zero. Conversely, the null hypothesis H0 is rejected for large values of Energy(X). The data to test for Gaussianity is in our case Z, i.e., the matrix of whitened residuals, which has size N N. Thus, the whitened residuals have N dimensions. The direct introduction of these residuals into a statistical test for Gaussianity is not expected to provide meaningful results, as a consequence of the high dimensionality. Furthermore, in our experiments we have observed that it is often the case that a large part of the total variance is explained by the first principal component (see the supplementary material for evidence supporting this). That is, λi, i.e., the eigenvalue associated to the i-th principal component, is almost negligible for i 2. Additionally, we motivate in Appendix C that one should also obtain more Gaussian residuals, after projecting the data onto the first principal component, in terms of a reduction of the magnitude of the high-order cumulants. Thus, in practice, we consider only the first principal component of the estimated residuals in feature space. This is the component i with the largest associated eigenvalue λi. We denote such N-dimensional vector by z. Let zx y be the vector of coefficients of the first principal component of the residuals in feature space when the linear fit is performed in the direction X Y. Let zy x be the vector of coefficients of the first principal component of the residuals in feature space when the linear fit is carried out in the direction Y X. We define the measure of Gaussianization of the residuals as G = Energy(zx y)/N Energy(zy x)/N, where Energy( ) computes the statistic of the energy distance test for Gaussianity described above. Note that we divide each statistic by N to cancel the corresponding factor that is considered in (27). Since in this test larger values for the statistic corresponds to larger deviations from Gaussianity, if G > 0 the direction X Y is expected to be more likely the causal direction. Otherwise, the direction Y X is preferred. The variance of G will depend on the sample size N. Thus, ideally one should use the difference between the p-values associated to each statistic as the confidence in the decision taken. Unfortunately, computing these p-values is expensive since the distribution of the statistic under the null hypothesis must be approximated via random sampling. In our experiments we measure the confidence of the decision in terms of the absolute value of G, which is faster to obtain and we have found to perform well in practice. 3.5 Parameter Tuning and Error Evaluation Assume that a squared exponential kernel is employed in the method described above. This means that k(xi, xj) = exp γ(xi xj)2 , where γ > 0 is the bandwidth of the kernel. The same is assumed for k(yi, yj). Therefore, two hyper-parameters require adjustment Non-linear Causal Inference using Gaussianity Measures in the method described. These are the ridge regression regularization parameter τ and the kernel bandwidth γ. They must be tuned in some way to produce the best possible fit in each direction. The method chosen to guarantee this is a grid search guided by a 10-fold cross-validation procedure, which requires computing the squared prediction error over unseen data. In this section we detail how to evaluate these errors. Assume that M new paired data instances are available for validation. Let the two matrices Φynew = (φ(ynew 1 ), . . . , φ(ynew M )) and Φxnew = (φ(xnew 1 ), . . . , φ(xnew M )) summarize these data. Define ϵnew i = φ(ynew i ) ˆAφ(xnew i ). After some algebra, it is possible to show that the sum of squared errors for the new instances is: i=1 (ϵnew i )Tϵnew i = trace Kynew,ynew Kynew,y VKT xnew,x Kxnew,x VKT ynew,y + Kxnew,x VKy,y VKT xnew,x where Kynew,ynew = ΦT ynewΦynew, Kynew,y = ΦT ynewΦy and Kx,xnew = ΦT x Φxnew. Of course, the new data must be centered before computing the error estimate. This process is similar to the one described in the previous section. In particular, centering can be simply carried out by working with the modified kernel matrices: Kxnew,x = Kxnew,x MNKx,x Kxnew,x1N + MNKx,x1N , Kynew,y = Kynew,y MNKy,y Kynew,y1N + MNKy,y1N , Kynew,ynew = Kynew,ynew MNKy,ynew Kynew,y MT N + MNKy,y MT N , (29) where MN is a matrix of size M N with all components equal to 1/N. In this process, the averages employed for the centering step are computed using only the observed data. A disadvantage of the squared error is that it strongly depends on the kernel bandwidth parameter γ. This makes it difficult to choose this hyper-parameter in terms of such a performance measure. A better approach is to choose both γ and τ in terms of the explained variance by the model. This is obtained as follows: Explained-Variance = 1 E/MVarynew, where E denotes the squared prediction error and Varynew the variance of the targets. The computation of the error E is done as described previously and Varynew is simply the average of the diagonal entries in Kynew,ynew. 3.6 Finding Pre-images for Illustrative Purposes The kernel method described above expresses its solution as feature maps of the original data points. Since the feature map φ( ) is usually non-linear, we cannot guarantee the existence of a pre-image under φ( ). That is, a point y such that φ(y) = ˆAφ(x), for some input point x. An alternative to amend this issue is to find approximate pre-images, which can be useful to make predictions or plotting results (Sch olkopf and Smola, 2002). In this section we describe how to find this approximate pre-images. Assume that we have a new data instance xnew for which we would like to know the associated target value ynew, after our kernel model has been fitted. The predicted value in Hern andez-Lobato, Morales-Mombiela, Lopez-Paz and Su arez feature space is: φ(ynew) = Φy VΦT x φ(xnew) = Φy Vkx,xnew = i=1 αiφ(yi) , (30) where kx,xnew contains the kernel evaluations between each entry in x (i.e., the observed samples of the random variable X) and the new instance. Finally, each αi is given by a component of the vector Vkx,xnew. The approximate pre-image of φ(ynew), ynew, is found by solving the following optimization problem: ynew = arg min u ||φ(ynew) φ(u)||2 2 = arg min u 2αTky,u + k(u, u) , (31) where ky,u is a vector with the kernel values between each yi and u, and k(u, u) = φ(u)Tφ(u). This is a non-linear optimization problem than can be solved approximately using standard techniques such as gradient descent. In particular, the computation of the gradient of ky,u with respect to u is very simple in the case of the squared exponential kernel. 4. Data Transformation and Detailed Causal Inference Algorithm The method for causal inference described in the previous section relies on the fact that both random variables X and Y are equally distributed. In particular, if this is the case, φ(xi) and φ(yi), i.e., the maps of xi and yi in the expanded feature space, will also be equally distributed. This means that under such circumstances one should expect residuals that are more Gaussian in the anti-causal direction due to a reduction of the magnitude of the high order cumulants and an increment of the differential entropy. The requirement that X and Y are equally distributed can be easily fulfilled in the case of continuous univariate data by transforming x, the samples of X, to have the same empirical distribution as y, the samples of Y. Consider x = (x1, . . . , x N)T and y = (y1, . . . , y N)T to be N paired samples of X and Y, respectively. To guarantee the same distribution for these samples we only have to replace x by x, where each component of x, xi, is given by xi = ˆF 1 y ( ˆFx(xi)), with ˆF 1 y ( ) the empirical quantile distribution function of the random variable Y, estimated using y. Similarly, ˆFx( ) is the empirical cumulative distribution function of X, estimated using x. This operation is known as the probability integral transform. One may wonder why should x be transformed instead of y. The reason is that by transforming x the additive noise hypothesis made in (1) and (6) is preserved. In particular, we have that yi = f( ˆF 1 x ( ˆFy( xi))) + ϵi. On the other hand, if y is transformed instead, the additive noise model will generally not be valid anymore. More precisely, the transformation that computes y in such a way that it is distributed as x is yi = ˆF 1 x ( ˆFy(yi)), i. Thus, under this transformation we have that yi = ˆF 1 x ( ˆFy(f(xi) + ϵi)), which will lead to the violation of the additive noise model. Of course, transforming x requires the knowledge of the causal direction. In practice, we will transform both x and y and consider that the correct transformation is the one that leads to the highest level of Gaussianization of the residuals in the feature space, after fitting the model in each direction. That is, the transformation that leads to the highest value of G is expected to be the correct one. We expect that when y is transformed instead Non-linear Causal Inference using Gaussianity Measures of x, the Gaussianization effect of the residuals is not as high as when x is transformed, as a consequence of the violation of the additive noise model. This will allow to determine the causal direction. We do not have a theoretical result confirming this statement, but the good results obtained in Section 6.1 indicate that this is the case. The details of the complete causal inference algorithm proposed are given in Algorithm 1. Besides a causal direction, e.g., X Y or Y X, this algorithm also outputs a confidence level in the decision made which is defined as max(|G x|, |G y|), where G x = Energy(z x y)/N Energy(zy x)/N denotes the estimated level of Gaussianization of the residuals when x is transformed to have the same distribution as y. Similarly, G y = Energy(z y x)/N Energy(z y x)/N denotes the estimated level of Gaussianization of the residuals when x and y are swapped and y is transformed to have the same distribution as x. Here z x y contains the first principal component of the residuals in the expanded feature space when trying to predict y using x. The same applies for zy x, z y x and zx y. However, the residuals are obtained this time when trying to predict x using y, when trying to predict x using y and when trying to predict y using x, respectively. Recall that the reason for keeping only the first principal component of the residuals is described in Section 3.4. Assume |G x| > |G y|. In this case we prefer the transformation of x to guarantee that the cause and the effect have the same distribution. The reason is that it leads to a higher level of Gaussianization of the residuals, as estimated by the energy statistical test. Now consider that G x > 0. We prefer the direction X Y because the residuals of a fit in that direction are less Gaussian and hence have a higher value of the statistic of the energy test. By contrast, if G x < 0 we prefer the direction Y X for the same reason. In the case that |G x| < |G y| the reasoning is the same and we prefer the transformation of y. However, because we have swapped x and y for computing G y, the decision is the opposite as the previous one. Namely, if G y > 0 we prefer the direction Y X and otherwise we prefer the direction X Y. The confidence in the decision (i.e., the estimated level of Gaussianization) is always measured by max(|G x|, |G y|). The algorithm uses a squared exponential kernel with bandwidth parameter γ and the actual matrices ˆA x y and ˆAy x, of potentially infinite dimensions, need not be evaluated in closed form in practice. As indicated in Section 3, all computations are carried out efficiently with cost O(N3) using inner products, which are evaluated in terms of the corresponding kernel function. All hyper-parameters, i.e., τ and γ, are chosen using a grid search method guided by a 10-fold cross-validation process. This search maximizes the explained variance of the left-out data and 10 potential values are considered for both τ and γ. 5. Related Work The Gaussianity of residuals was first employed for causal inference by Hern andez-Lobato et al. (2011). These authors analyze auto-regressive (AR) processes and show that a similar asymmetry as the one described in this paper can be used to determine the temporal direction of a time series in the presence of non-Gaussian noise. Namely, when fitting an AR process to a reversed time series, the residuals obtained follow a distribution that is closer to a Gaussian distribution. Nevertheless, unlike the work described here, the method proposed by Hern andez-Lobato et al. (2011) cannot be used to tackle multidimensional or non-linear causal inference problems. In their work, Hern andez-Lobato et al. (2011) show Hern andez-Lobato, Morales-Mombiela, Lopez-Paz and Su arez Algorithm 1: Causal Inference Based on the Gaussianity of the Residuals (GR-AN) Data: Paired samples x and y from the random variables X and Y. Result: An estimated causal direction alongside with a confidence level. 1 Standardize x and y to have zero mean and unit variance; 2 Transform x to compute x ; // This guarantees that x is distributed as y. 3 ˆA x y Fit Model( x, y) ; // This also finds the hyper-parameters τ and γ. 4 z x y Obtain Residuals( x, y, ˆA x y) ; // First PCA component in feature space. 5 ˆAy x Fit Model(y, x) ; // Fit the model in the other direction 6 zy x Obtain Residuals(y, x, ˆAy x) ; // First PCA component in feature space. 7 G x Energy(z x y)/N Energy(zy x)/N ; // Get the Gaussianization level. 8 Swap x and y and repeat lines 2-7 of the algorithm to compute G y. 9 if |G x| > |G y| then 10 if G x > 0 then 11 Output: X Y with confidence |G x| 13 Output: Y X with confidence |G x| 16 if G y > 0 then 17 Output: Y X with confidence |G y| 19 Output: X Y with confidence |G y| some advantages of using statistical tests based on measures of Gaussianity to determine the temporal direction of a time series, as a practical alternative to statistical tests based on the independence of the cause and the residual. The motivation for these advantages is that the former tests are one-sample tests while the later ones are two-sample tests. The previous paper is extended by Morales-Mombiela et al. (2013) to consider multidimensional AR processes. However, this work lacks a theoretical result that guarantees that the residuals obtained when fitting a vectorial AR process in the reversed (antichronological) direction will follow a distribution closer to a Gaussian distribution. In spite of this issue, extensive experiments with simulated data suggest the validity of such conjecture. Furthermore, a series of experiments show the superior results of the proposed rule to determine the direction of time, which is based on measures of Gaussianity, and compared with other state-of-the-art methods based on tests of independence. The problem of causal inference under continuous-valued data has also been analyzed by Shimizu et al. (2006). The authors propose a method called LINGAM that can identify the causal order of several variables when assuming that (a) the data generating process is linear, (b) there are no unobserved co-founders, and (c) the disturbance variables have non Gaussian distributions with non-zero variances. These assumptions are required because LINGAM relies on the use of Independent Component Analysis (ICA). More specifically, let x denote a vector that contains the variables we would like to determine the causal Non-linear Causal Inference using Gaussianity Measures order of. LINGAM assumes that x = Bx + e, where B is a matrix that can be permuted to strict lower triangularity if one knows the actual causal ordering in x, and e is a vector of non-Gaussian independent disturbance variables. Solving for x, one gets x = Ae, where A = (I B) 1. The A matrix can be inferred using ICA. Furthermore, given an estimate of A, B can be obtained to find the corresponding connection strengths among the observed variables, which can then be used to determine the true causal ordering. LINGAM has been extended to consider linear relations among groups of variables (Entner and Hoyer, 2012; Kawahara et al., 2012). In real-world data, causal relationships tend to be non-linear, a fact that questions the usefulness of linear methods. Hoyer et al. (2009) show that a basic linear framework for causal inference can be generalized to non-linear models. For non-linear models with additive noise, almost any non-linearities (invertible or not) will typically yield identifiable models. In particular, Hoyer et al. (2009) assume that yi = f(xi)+ϵi, where f( ) is a possibly non-linear function, xi is the cause variable, and ϵi is some independent and random noise. The proposed causal inference mechanism consists in performing a non-linear regression on the data to get an estimate of f( ), ˆf( ), and then calculate the corresponding residuals ˆϵi = yi ˆf(xi). Then, one may test whether ˆϵi is independent of xi or not. The same process is repeated in the other direction. The direction with the highest level of independence is chosen as the causal one. In practice, the estimate ˆf( ) is obtained using Gaussian processes for regression, and the HSIC test (Gretton et al., 2008) is used as the independence criterion. This method has obtained good performance results (Janzing et al., 2012) and it has been extended to address problems where the model is yi = h(f(xi) + ϵi), for some invertible function h( ) (Zhang and Hyv arinen, 2009). A practical difficulty is however that such a model is significantly harder to fit to the data. In the work by Mooij et al. (2010), a method for causal inference is proposed based on a latent variable model, used to incorporate the effects of un-observed noise. In this context, it is considered that the effect variable is a function of the cause variable and an independent noise term, not necessarily additive, that is, yi = f(xi, ϵi), where xi is the cause variable and ϵi is some independent and random noise. The causal direction is then inferred using standard Bayesian model selection. In particular, the preferred direction is the one under which the corresponding model has the largest marginal likelihood, where the marginal likelihood is understood as a proxy for the Kolmogorov complexity. This method suffers from several implementation difficulties, including the intractability of the marginal likelihood computation. However, it has shown encouraging results on synthetic and real-world data. Janzing et al. (2010) consider the problem of inferring linear causal relations among multi-dimensional variables. The key point here is to use an asymmetry between the distributions of the cause and the effect that occurs if the covariance matrix of the cause and the matrix mapping the cause to the effect are independently chosen. This method exhibits the nice property that applies to both deterministic and stochastic causal relations, provided that the dimensionality of the involved random variables is sufficiently high. The method assumes that yi = Axi + ϵi, where xi is the cause and ϵi is additive noise. Namely, denote with ˆΣ to the empirical covariance matrix of the variables in each xi. Given an estimate of A, ˆA, the method computes x y = log trace( ˆA ˆΣ ˆAT) log trace( ˆΣ)+log trace( ˆA ˆAT)+d, where d is the dimension of xi. This process is repeated to compute y x where xi and Hern andez-Lobato, Morales-Mombiela, Lopez-Paz and Su arez yi are swapped. The asymmetry described states that x y should be close to zero while y x should not. Thus, if | x y| > | y x|, xi is expected to be the cause. Otherwise, the variables in yi are predicted to be cause instead. Finally, a kernelized version of this method is also described by Chen et al. (2013). Most of the methods introduced in this section assume some form of noise in the generative process of the effect. Thus, their use is not justified in the case of noiseless data. Janzing et al. (2012) describe a method to deal with these situations. In particular, the method makes use of information geometry to identify an asymmetry that can be used for causal inference. The asymmetry relies on the idea that the marginal distribution of the cause variable, denoted by p(x), is expected to be chosen independently from the mapping mechanism producing the effect variable, denoted by the conditional distribution p(y|x). Independence is defined here as orthogonality in the information space, which allows to describe a dependence that occurs between p(y) and p(x|y) in the anti-causal direction. This dependence can be then used to determine the causal order. A nice property of this method is that this asymmetry between the cause and the effect becomes very simple if both random variables are deterministically related. Remarkably, the method also performs very well in noisy scenarios, although no theoretical guarantees are provided in this case. A similar method for causal inference to the last one is described by Chen et al. (2014). These authors also consider that p(x) and p(y|x) fulfill some sort of independence condition, and that this independence condition does not hold for the anti-causal direction. Based on this, they define an uncorrelatedness criterion between p(x) and p(y|x), and show an asymmetry between the cause and the effect in terms of a certain complexity metric on p(x) and p(y|x), which is less than the same complexity metric on p(y) and p(x|y). The complexity metric is calculated in terms of a reproducing kernel Hilbert space embedding (EMD) of probability distributions. Based on the complexity metric, the authors propose an efficient kernel-based algorithm for causal discovery. In Section 2.3 we have shown that in the multivariate case one should expect higher entropies in the anti-causal direction. Similar results have been obtained in the case of nonlinear relations and the univariate data case (Hyv arinen and Smith, 2013; Kpotufe et al., 2014). Assume x, y R and the actual causal model to be y = f(x) + d, with x d and f( ) an arbitrary function. Let e be the residual of a fit performed in the anti-causal direction. Section 5.2 of the work by Hyv arinen and Smith (2013) shows that the likelihood ratio R of each model (i.e., the model fitted in the causal direction and the model fitted in the anti-causal direction) converges in the presence of infinite data to the difference between the sum of the entropies of the independent variable and the residual in each direction. Namely, R H(x) H(d/σd) + H(y) + H(e/σe) + log σd log σe, where σd and σe denote the standard deviation of the errors in each direction. If R > 0, the causal direction is chosen. By contrast, if R < 0 the anti-causal direction is preferred. The process of evaluating R involves the estimation of the entropies of four univariate random variables, i.e., x, d, y and e and the standard deviation of the errors d and e, which need not be equal. The non-linear functions are estimated as in the work by Hoyer et al. (2009) using a Gaussian process. The entropies are obtained using a maximum entropy approximation under the hypothesis that the distributions of these variables are not far from Gaussian (Hyv arinen, 1998). The resulting method is called non-linear maximum entropy (NLME). A practical difficulty is however that the estimation of the entropy is a very difficult task, Non-linear Causal Inference using Gaussianity Measures even in one dimension (Beirlant et al., 1997). Thus, the NLME method is adapted in an ad-hoc manner with the aim of obtaining better results in certain difficult situations with sparse residuals. More precisely, if H(x) and H(y) are ignored and Laplacian residuals are assumed R log σe log σd. That is, the model with the minimum error is preferred. The errors are estimated however in terms of the absolute deviations (because of the Laplacian assumption). This method is called mean absolute deviation (MAD). Finally, Kpotufe et al. (2014) show the consistency of the noise additive model, give a formal proof for R 0 (see Lemma 1), and propose to estimate H(x), H(y), H(d) and H(e) using kernel density estimators. Note that if x and y are equally distributed, H(x) = H(y) and the condition R 0 implies H(e) H(d). Nevertheless, σd and σe are in general different (see the supplementary material for an illustrative example). This means that in the approach of Hyv arinen and Smith (2013) and Kpotufe et al. (2014) it is not possible to make a decision directly on the basis of a Gaussianization effect on the residuals. The proposed method GR-AN, introduced in Section 4, differs from the approaches described in the previous paragraph in that it does not have to deal with the estimation of four univariate entropies, which can be a particularly difficult task. By contrast, it relies on statistical tests of deviation from Gaussianity to infer the causal direction. Furthermore, the tests employed in our method need not be directly related to entropy estimation. This is particularly the case of the energy test suggested in Section 3.4. Not having to estimate differential entropies is an advantage of our method confirmed by the results that are obtained in the experiments section. In particular, we have empirically observed that that GR-AN performs better than the two methods for causal inference NLME and MAD that have been described in the previous paragraph. GR-AN also performs better than GR-ENT, a method that uses, instead of statistical tests of Gaussianity, a non-parametric estimator of the entropy (Singh et al., 2003). 6. Experiments We carry out experiments to validate the method proposed in this paper, and empirically verify that the model residuals in the anti-causal direction are more Gaussian that the model residuals in the causal direction due to a reduction of the high-order cumulants and an increment of the differential entropy. From now on, we refer to our method as GR-AN (Gaussianity of the Residuals under Additive Noise). Furthermore, we compare the performance of GR-AN with four other approaches from the literature on causal inference, reviewed in Section 5. First, LINGAM (Shimizu et al., 2006), a method which assumes an additive noise model, but looks for independence between the cause and the residuals. Second, IR-AN (Independence of the Residuals under Additive Noise), by Hoyer et al. (2009). Third, a method based on information geometry, IGCI (Janzing et al., 2012). Fourth, a method based on Reproducing Kernel Hilbert Space Embeddings (EMD) of probability distributions (Chen et al., 2014). Fifth, the two methods for non-linear causal inference based on entropy estimation described by Hyv arinen and Smith (2013), NLME and MAD. Sixth, the same GR-AN method, but where we omit the transformation to guarantee that the random variables X and Y are equally distributed. This method is called GR-AN . Last, we also compare results with two variants of GR-AN that are not based on the energy distance to measure the level of Gaussianity of the residuals. These are GR-K4, which uses Hern andez-Lobato, Morales-Mombiela, Lopez-Paz and Su arez the empirical estimate of the fourth cumulant (kurtosis) to determine the causal direction (it chooses the direction with the largest estimated fourth cumulant); and GR-ENT, which uses a non-parametric estimator of the entropy (Singh et al., 2003) to determine the causal direction (the direction with the smallest entropy is preferred); The hyper-parameters of the different methods are set as follows. In LINGAM, we use the parameters recommended by the implementation provided by the authors. In IR-AN, NLME and MAD we employ a Gaussian process whose hyper-parameters are found by type II maximum likelihood. Furthermore, in IR-AN the HSIC test is used to assess independence between the causes and the residuals. In NLME the entropy estimator is the one described by Hyv arinen (1998). In IGCI, we test different normalizations (uniform and Gaussian) and different criteria (entropy or integral) and report the best observed result. In EMD and synthetic data, we follow Chen et al. (2014) to select the hyper-parameters. In EMD and real-world data, we evaluate different hyper-parameters and report the results for the best combination found. In GR-AN, GR-K4 and GR-ENT the hyper-parameters are found via cross-validation, as described in Section 4. The number of neighbors in the entropy estimator of GR-ENT is set to 10, a value that we have observed to give a good trade-off between bias and variance. Finally, in GR-AN, GR-K4, and GR-ENT we transform the data so that both variables are equally distributed, as indicated in Section 4. The confidence in the decision is computed as indicated by Janzing et al. (2012). More precisely, in LINGAM the confidence is given by the maximum absolute value of the entries in the connection strength matrix B. In IGCI we employ the absolute value of the difference between the corresponding estimates (entropy or integral) in each direction. In IR-AN the confidence level is obtained as the maximum of the two p-values of the HSIC test. In EMD we use the absolute value of the difference between the estimates of the corresponding complexity metric in each direction, as described in (Chen et al., 2014). In NLME and MAD the confidence level is given by the absolute value of the difference between the outputs of the entropy estimators in each direction (Hyv arinen and Smith, 2013). In GRK4 we use the absolute difference between the estimated fourth cumulants. In GR-ENT we use the absolute difference between the estimates of the entropy. Finally, in GR-AN we follow the details given in Section 4 to estimate the confidence in the decision. To guarantee the exact reproducibility of the different experiments described in this paper, the source-code for all methods and data sets is available in the public repository https://bitbucket.org/dhernand/gr_causal_inference. 6.1 Experiments with Synthetic Data We carry out a first batch of experiments on synthetic data. In these experiments, we employ the four causal mechanisms that map X to Y described by Chen et al. (2014). They involve linear and non-linear functions, and additive and multiplicative noise effects: M1: yi = 0.8xi + ϵi. M2: yi = xiϵi. M3: yi = 0.3x3 i + ϵi. M4: yi = atan(xi)3 + ϵi. Non-linear Causal Inference using Gaussianity Measures The noise ϵi can follow four different types of distributions: (i) A generalized Gaussian distribution with shape parameter equal to 10 (an example of a sub-Gaussian distribution); (ii) a Laplace distribution (an example of a super-Gaussian distribution); (iii) a Gaussian distribution; and (iv) a bimodal distribution with density p(ϵi) = 0.5N(ϵi|m, s)+0.5N(ϵi| m, s), where m = .63 and s = .1. The Laplace distribution and the Gaussian distribution are adjusted to have the same variance as the generalized Gaussian distribution. The bimodal distribution already has the same variance as the generalized Gaussian distribution. As indicated by Chen et al. (2014), in these experiments, the samples from the cause variable X are generated from three potential distributions: 2π exp{ x2/2}. p2(x) = 1 2 0.5π exp{ (x + 1)2/0.5} + 1 2 0.5π exp{ (x 1)2/0.5}. p3(x) = 1 4 0.5π exp{ (x+1.5)2/0.5}+ 1 2 0.5π exp{ x2/0.5}+ 1 4 0.5π exp{ (x 1.5)2/0.5}. These are unimodal, bimodal, and trimodal distributions, respectively. Figure 5 displays a representative example of the plots of different combinations of distributions and mapping mechanisms when the noise follows a generalized Gaussian distribution with shape parameter equal to 10. The plots for Laplace, Gaussian or bimodal distributed noise look similar to these ones. The assumptions made by proposed method, i.e., GR-AN, are valid in the case of all the causal mechanisms, except for M2, which considers multiplicative noise, and in the case of all cause distributions, p1, p2 and p3. The only type of noise that violates the assumptions made by GR-AN is the case of Gaussian noise. In particular, under Gaussian noise GR-AN cannot infer the causal direction using Gaussianity measures because the actual residuals are already Gaussian. The average results of each method on 100 repetitions of each potential causal mechanism, distribution for the effect, and noise distribution are displayed in Table 1. The size of each paired samples of X and Y is set to 500 in these experiments. We observe that when the assumptions made by the proposed method, GR-AN, are satisfied, it identifies the causal direction on a very high fraction of the 100 repetitions considered. However, when these assumptions are not valid, e.g., in the case of the M2 causal mechanism, which has multiplicative noise, the performance worsens. The same happens when the distribution of the residuals is Gaussian. In these experiments, LINGAM tends to fail when the causal relation is strongly non-linear. This is the case of the causal mechanism M3. LINGAM also has problems when all the independent variables are Gaussian. Furthermore, all methods generally fail in the case of independent Gaussian variables that are linearly related. This corresponds to the causal mechanism M1, the distribution p1(x) for the cause, and the Gaussian distribution for the noise. The reason for this is that in this particular scenario the causal direction is not identifiable (Shimizu et al., 2006). IGCI and EMD sometimes fail in the case of the causal mechanism M1 and M4. However, they typically correctly identify the causal direction in the case of the mechanism M2, which has non-additive noise, and where the other methods tend to fail. Finally, IR-AN performs slightly better than GR-AN, especially in the case of additive Gaussian noise, where GR-AN is unable to identify the causal direction. MAD provides very bad results for some of the mechanism considered, i.e., M1 and M4. Surprisingly this is the case even for Laplacian additive noise, which is Hern andez-Lobato, Morales-Mombiela, Lopez-Paz and Su arez 10 5 0 5 10 10 5 0 5 10 10 5 0 5 10 10 5 0 5 10 10 5 0 5 10 10 5 0 5 10 Figure 5: Plots of different distributions and mechanisms when the noise follows a generalized Gaussian distribution with shape parameter equal to 10. the hypothesis made by MAD. This bad behavior is probably a consequence of ignoring the entropies of X and Y in this method. NLME and GR-ENT give worse results than GR-AN in some particular cases, e.g., p1 and the causal mechanism M3. We believe this is related to the difficulty of estimating differential entropies in practice. GR-AN performs very similar to GR-AN. This indicates that in practice one may ignore the transformation that guarantees that X and Y are equally distributed. GR-K4 also gives similar results to GR-AN, probably because in these experiments the tails of the residuals are not very heavy. An overall comparison of the different methods evaluated is shown in Figure 6. This figure displays several radar charts that indicate the average accuracy of each method for the different types of noise considered and for each mechanism M1, M2, M3 and M4. In particular, for a given method and a given type of noise, the radius of each portion of the pie is proportional to the corresponding average accuracy of the method across the distributions p1, p2 and p3 for the cause. The pie at the bottom corresponds to 100% accuracy for each causal mechanism. The conclusions derived from this figure are similar to the ones obtained Non-linear Causal Inference using Gaussianity Measures from Table 1. In particular, IR-AN performs very well, except for multiplicative noise (M3), closely followed by GR-AN, GR-AN , and GR-K4, which give similar results. The methods perform very poorly in the case of additive Gaussian noise, since they cannot infer the actual causal direction in that situation. NLME and GR-ENT have problems in the case of the causal mechanism M3 and MAD in the case of the mechanisms M1 and M4. LINGAM also performs bad in the case of M3 and IGCI and EMD have problems in the case of the mechanisms M1 and M4. IGCI, EMD and MAD are the only methods performing well in the case of M2, the mechanism with non-additive noise. We have repeated these experiments for other samples sizes, e.g., 100, 200 300 and 1000. The results obtained are very similar to the ones reported here, except when the number of samples is small and equal to 100. In that case NLME performs slightly better than the proposed approach GR-AN, probably because with 100 samples it is very difficult to accurately estimate the non-linear transformation that is required to guarantee that X and Y are equally distributed. These results of these additional experiments are found in the supplementary material. In summary, the good results provided by GR-AN and its variants in the experiments described indicate that (i) when the assumptions made by GR-AN are valid, the method has a good performance and (ii) there is indeed a Gaussianization effect in the residuals when the model is fitted under the anti-causal direction. Because GR-AN also performs well in these experiments, this indicates that a Gaussianization of the residuals may happen even when X and Y do not follow the same distribution. We give further evidence of the Gaussianization of the distribution of the residuals obtained when fitting the model under the anti-causal direction. For this, we analyze in detail three particular cases of GR-AN corresponding to the causal mechanism M3, the distribution p2(x) for the cause and each of the three types of additive noise considered. Namely, generalized Gaussian noise, Laplacian noise and bimodal noise. Figure 7 shows the predicted pre-images for new data instances when the model has been fitted in the causal ( X Y) and the anti-causal (Y X) direction alongside with a histogram of the first principal component of the residuals in feature space. A Gaussian approximation is also displayed as a solid black line on top of the histogram. In this case x, i.e., the samples of X, have been transformed to be equally distributed to y, i.e., the samples from Y. We observe that the distribution of the residuals in the anti-causal direction (Y X) is more similar to a Gaussian distribution. Furthermore, for the direction X Y the statistic of the energy based Gaussianity test for the first principal component of the residuals is respectively 4.02, 4.64 and 11.46, for generalized Gaussian, Laplacian and bimodal noise. Recall that the larger the value the larger the deviation from Gaussianity. In the case of the direction Y X, the energy statistic associated to the residuals is 0.97, 0.68 and 1.31, respectively. When y is transformed to have the same distribution as x similar results are observed (results not shown). However, the Gaussianization effect is not as strong as in this case, probably because it leads to the violation of the additive noise assumption. In summary, the figure displayed illustrates in detail the Gaussianization effect of the residuals when fitting the model in the anti-causal direction. Hern andez-Lobato, Morales-Mombiela, Lopez-Paz and Su arez Noise Generalized Gaussian Laplacian Gaussian Bimodal Algorithm Mechanism Mechanism Mechanism Mechanism M1 M2 M3 M4 M1 M2 M3 M4 M1 M2 M3 M4 M1 M2 M3 M4 p1(x) LINGAM 100% 1% 7% 93% 100% 0% 26% 89% 58% 0% 11% 12% 100% 0% 3% 100% IGCI 28% 100% 100% 34% 73% 100% 100% 96% 49% 100% 100% 63% 29% 98% 100% 44% EMD 34% 96% 100% 43% 70% 100% 100% 77% 48% 100% 100% 50% 50% 99% 100% 55% IR-AN 100% 31% 100% 99% 99% 26% 100% 97% 45% 34% 100% 75% 100% 46% 100% 100% NLME 100% 31% 22% 100% 100% 20% 15% 100% 45% 22% 8% 96% 100% 42% 17% 100% MAD 0% 94% 100% 0% 100% 99% 100% 100% 50% 99% 100% 97% 0% 92% 97% 0% GR-AN 100% 46% 95% 100% 100% 44% 80% 100% 48% 47% 6% 20% 100% 51% 100% 100% GR-AN 100% 0% 94% 100% 100% 0% 94% 100% 45% 0% 1% 17% 100% 0% 100% 100% GR-K4 100% 70% 94% 100% 99% 51% 96% 100% 52% 56% 2% 19% 100% 53% 100% 100% GR-ENT 100% 71% 76% 97% 93% 52% 51% 93% 41% 58% 26% 29% 100% 50% 84% 99% p2(x) LINGAM 100% 32% 68% 100% 100% 30% 99% 100% 100% 13% 91% 100% 100% 53% 91% 100% IGCI 40% 97% 100% 71% 98% 100% 100% 99% 72% 100% 100% 97% 39% 99% 100% 54% EMD 96% 99% 100% 98% 96% 100% 100% 98% 94% 100% 100% 90% 92% 95% 100% 95% IR-AN 100% 55% 100% 100% 100% 42% 100% 100% 100% 44% 100% 100% 100% 43% 100% 100% NLME 100% 47% 100% 100% 95% 36% 100% 100% 99% 36% 100% 100% 100% 38% 100% 100% MAD 0% 100% 100% 13% 5% 100% 100% 100% 2% 100% 100% 95% 0% 100% 85% 0% GR-AN 100% 46% 100% 100% 98% 32% 96% 100% 29% 40% 16% 54% 100% 32% 100% 100% GR-AN 100% 0% 94% 100% 100% 0% 65% 100% 39% 0% 0% 6% 100% 0% 100% 100% GR-K4 100% 50% 100% 100% 87% 44% 100% 98% 26% 51% 23% 47% 100% 30% 100% 100% GR-ENT 96% 56% 90% 98% 78% 40% 73% 92% 46% 45% 42% 48% 100% 36% 97% 100% p3(x) LINGAM 100% 0% 19% 100% 100% 0% 98% 100% 92% 0% 56% 67% 100% 8% 7% 100% IGCI 76% 100% 100% 83% 100% 100% 100% 100% 92% 100% 100% 97% 67% 100% 100% 87% EMD 90% 100% 100% 94% 98% 100% 100% 100% 92% 100% 100% 97% 96% 100% 100% 94% IR-AN 100% 40% 100% 100% 100% 26% 100% 100% 96% 35% 100% 100% 100% 45% 100% 100% NLME 100% 44% 100% 100% 100% 26% 100% 100% 74% 34% 98% 100% 100% 43% 99% 100% MAD 0% 97% 100% 1% 100% 100% 100% 100% 38% 98% 100% 98% 0% 95% 99% 0% GR-AN 100% 52% 90% 100% 100% 58% 100% 100% 46% 53% 36% 45% 100% 54% 98% 100% GR-AN 100% 0% 100% 100% 100% 0% 100% 100% 51% 0% 12% 26% 100% 0% 100% 100% GR-K4 100% 64% 94% 100% 99% 47% 100% 97% 43% 54% 15% 24% 100% 57% 100% 100% GR-ENT 100% 60% 42% 95% 94% 54% 86% 91% 47% 48% 48% 32% 100% 47% 43% 99% Table 1: Accuracy on the synthetic data for each method, causal mechanisms and type of noise. Non-linear Causal Inference using Gaussianity Measures Figure 6: Radar charts showing the average accuracy of each method for the different types of noise considered and for each mechanism M1, M2, M3 and M4. For a particular method and type of noise, the radius of each portion of the pie is proportional to the corresponding average accuracy of the method across the distributions p1, p2 and p3 for the cause. The pie at the bottom corresponds to 100% accuracy for each mechanism. 6.2 Experiments with Real Cause-effect Pairs A second batch of experiments is performed on the cause-effect pairs from the Cha Learn challenge2. This challenge contains 8073 cause-effect data pairs with a labeled causal direction. From these pairs, we consider a subset for our experiments. In particular, we select the 184 pairs that have (i) at least 500 samples, and (ii) a fraction of repeated instances for each random variable of at most 1%. The first criterion guarantees that there is enough data to make a decision with high confidence. The second criterion removes the pairs with discrete random variables, motivated by the transformation required by the GR-AN method to guarantee the equal distribution of X and Y. In particular, this transformation cannot be carried out on discrete data. Another advantage is that this filtering process of the data facilitates the experiments since several of the methods considered in the comparison (i.e., GR-AN, GR-ENT, GR-K4, IR-AN, NLME, MAD and EMD) are computationally very expensive. More precisely, they have a cubic cost with respect to the number of samples and they require tuning several hyper-parameters. The consequence is that evaluating these methods on the 8073 pairs available is therefore not feasible. 2. See https://www.codalab.org/competitions/1381 for more information. Hern andez-Lobato, Morales-Mombiela, Lopez-Paz and Su arez Causal Direction ( X Y) Anti-causal Direction (Y X) Gen. Gaussian 3 1 1 2 3 4 Fit of the Model Residuals First Principal Component 0.0 0.2 0.4 First Principal Component 2 1 0 1 2 3 0.0 0.2 0.4 Fit of the Model Residuals First Principal Component 0.0 0.2 0.4 0.6 First Principal Component 3 1 1 2 3 4 0.0 0.2 0.4 Fit of the Model Residuals First Principal Component 0.0 0.2 0.4 First Principal Component 0.0 0.2 0.4 Figure 7: (left column) Predicted pre-images obtained in the casual direction X Y alongside with a histogram of the first principal component of the residuals in feature space. A Gaussian fit is displayed as a solid black line. Results are shown for each type of additive noise considered. (right column) Same plots for the anti-causal direction Y X. Using these 184 pairs we evaluate each of the methods considered in the previous section and report the corresponding accuracy as a function of the decisions made. In these experiments we sample at random 500 instances from each cause-effect pair. This is a standard number of samples that has been previously employed by other authors in their experiments with cause-effect pairs (Janzing et al., 2012). Furthermore, a threshold value is fixed and the obtained confidence in the decision by each method is compared to such threshold. Only if the confidence is above the threshold value, the cause-effect pair is considered in the evaluation of the accuracy of the corresponding method. A summary of the results is displayed in Figure 8. This figure shows for each method, as a fraction of the decisions made, the accuracy on the filtered data sets on which the confidence on the decision is above the threshold value. A gray area has been drawn to indicate accuracy values that are not statistically different from random guessing (accuracy equal to 50%) using a bino- Non-linear Causal Inference using Gaussianity Measures mial test (p-value above 5%). We observe that IR-AN obtains the best results, followed by GR-AN, GR-AN , GR-K4, GR-ENT, IGCI, NLME and EMD. NLME, IGCI and EMD perform worse than GR-AN and GR-AN when a high number of decisions are made. The differences in performance between IR-AN and GR-AN, when 100% of the decisions are made, are not statistically significant (a paired t-test returns a p-value equal to 25%). The fact that the performance of GR-AN is similar to the performance of GR-AN also indicates that there is some Gaussianization of the residuals even though the two random variables X and Y are not equally distributed. We observe that the results of LINGAM and MAD are not statistically different from random guessing. This remarks the importance of nonlinear models and questions the practical utility of the MAD method. In these experiments, GR-ENT and GR-K4 perform worse than GR-AN, which remarks the benefits of using the energy distance to estimate the deviation from the Gaussian distribution, as a practical alternative to entropy or cumulant based measures. In summary, the results displayed in Figure 8 confirm that the level of Gaussianity of the residuals, estimated using statistical tests, is a useful metric that can be used to identify the causal order of two random variables. Furthermore, this figure also validates the theoretical results obtained in Section 2 which state that one should expect residuals whose distribution is closer to the Gaussian distribution when performing a fit in the anti-causal direction. In the supplementary material we include additional results for other sample sizes. Namely, 100, 200 and 300 samples. The results obtained are similar to the ones reported in Figure 8. However, the differences between GR-ENT, NLME, GR-AN and GR-AN are smaller. Furthermore, when the number of samples is small (i.e., equal to 100) GRAN performs worse than NLME, probably because with such a small number of samples it is difficult to estimate the non-linear transformation that guarantees that X and Y are equality distributed. In this section we have also evaluated the different methods compared in the previous experiments on a random subset of 184 cause-effect pairs chosen across the 8073 pairs of the Cha Learn challenge (results not shown). In this case, the ranking of the curves obtained looks similar to the ranking displayed in Figure 8, i.e., IR-AN performs best followed by GR-AN, GR-AN , NLME and EMD. However, all methods obtain worse results in general and none of them, except IR-AN, perform significantly different from random guessing. Finally, we also have evaluated the different methods in a subset of 82 cause-effect pairs extracted from the T ubingen cause-effect pairs3. We only considered those pairs with scalar cause and effect. The results obtained are displayed in Figure 9. In this case, the performance of the different methods is worse than the one displayed in Figure 8. Only IR-AN, IGCI and MAD perform significantly better than random guessing. Furthermore, GR-AN and GR-AN do not perform well in this set of cause-effect pairs. This is also the case of NLME. We believe that the reason for this bad performance is that in most of these pairs some of the variables take discrete or repeated values. In the case of GR-AN this makes infeasible to transform the two random variables, X and Y, so that they are equally distributed. Furthermore, the discrete random variables may have a strong impact in the tests for Gaussianity and in the estimation of the differential entropy. This could explain the bad performance of GR-AN , NLME and GR-ENT. 3. See http://webdav.tuebingen.mpg.de/cause-effect/ for more details. Hern andez-Lobato, Morales-Mombiela, Lopez-Paz and Su arez 0 20 40 60 80 100 0 20 40 60 80 100 184 Filtered Cause Effect Pairs Extracted from the Cha Learn Challenge Fraction of Decisions LINGAM IGCI EMD IR AN MAD NLME GR AN GR AN* GR K4 GR ENT Figure 8: Accuracy of each method, as a fraction of the decisions made, on the 184 filtered cause-effect pairs extracted from the Cha Learn challenge. The number of samples of each pair is equal to 500. Best seen in color. In summary, the results reported in this section have shown that in some cause-effect pairs, when the assumptions made by the proposed method are satisfied, there is indeed a Gaussianization effect in the residuals obtained when fitting the model in the anti-causal direction, and this asymmetry is useful to carry out causal inference on both synthetic and real inference problems. Our experiments also show that the transformation employed to guarantee that X and Y are equally distributed can be ignored in some cases without decreasing the performance. This indicates that our statement about the increased level of Gaussianity of the residuals, in terms of the increase of the entropy and the reduction of the high-order cumulants, may be true under more general assumptions. 7. Conclusions In this paper we have shown that in the case of cause-effect pairs with additive non-Gaussian noise there is an asymmetry that can be used for causal inference. In particular, assuming that the cause and the effect are equally distributed random variables, that are linearly related, the residuals of a least squares fit in the anti-causal direction are more Gaussian than the residuals of a linear fit in the causal direction due a reduction of the magnitude of the high-order cumulants. Furthermore, by extending the results of Hyv arinen and Smith (2013) based on information theory, we have shown that this Gaussianization effect is also present when the two random variables are multivariate due to an increment of the Non-linear Causal Inference using Gaussianity Measures 0 20 40 60 80 100 0 20 40 60 80 100 82 Cause Effect Pairs Extracted from the Tuebingen Database Fraction of Decisions LINGAM IGCI EMD IR AN MAD NLME GR AN GR AN* GR K4 GR ENT Figure 9: Accuracy of each method, as a fraction of the decisions made, on the 82 causeeffect pairs extracted from the Tuebingen database. Best seen in color. differential entropy. This motivates the use of kernel methods to work in an expanded feature space. This enables addressing non-linear cause-effect inference problems using simple techniques. Specifically, kernel methods allow to fit a linear model in an expanded feature space which will be non-linear in the original input space. Taking advantage of the asymmetry described, we have designed a method for non-linear causal inference, GR-AN (Gaussianity of the Residuals under Additive Noise). The method consists in computing the residuals of a linear model in an expanded feature space in both directions, i.e., X Y and Y X. The expected causal direction is the one in which the residuals appear to be more Gaussian (i.e., the magnitude of the high-order cumulants is reduced and the entropy is increased). Thus, a suitable statistical test that measures the level of non-Gaussianity of the residuals can be used to determine the causal direction. In principle, one may be tempted to use statistical tests based on entropy or cumulant estimation. However, our experiments show that one can obtain better results by using a test based on an energy distance to quantify the Gaussianity of the residuals. In particular, entropy estimation is an arguably difficult task and the estimators of the cumulants involve high-order moments which can lead to high variance. The effectiveness of the proposed method GR-AN has been illustrated in both synthetic and real-world causal inference problems. We have shown that in certain problems GRAN is competitive with state-of-the-art methods and that it performs better than related methods based on entropy estimation (Hyv arinen and Smith, 2013). The entropy can be understood as a measure of non-Gaussianity. Nevertheless, it is very difficult to estimate Hern andez-Lobato, Morales-Mombiela, Lopez-Paz and Su arez in practice. By contrast, the statistical test employed by GR-AR is not directly related to entropy estimation. This may explain the improvements observed. A limitation of the current formulation of GR-AN is that the distributions of the cause and the effect have to be equal. In the case of continuous univariate variables finding a transformation to make this possible is straightforward. Additionally, our experiments show that such a transformation can be side-stepped in some cases without a deterioration in performance. In any case, further research is needed to extend this analysis to remove this restriction. Finally, the performance of GR-AN on cause-effect pairs with discretized values is rather poor. We believe this is due to the fact that in this case, finding a transformation so that the cause and the effect are equally distributed is infeasible. Furthermore, the discretization process probably has a strong impact on the Gaussianity tests. Further evidence that make these observations more plausible is the fact that discretization has also a strong negative impact in the performance of the methods based on entropy estimation. Acknowledgments Daniel Hern andez-Lobato and Alberto Su arez gratefully acknowledge the use of the facilities of Centro de Computacin Cientfica (CCC) at Universidad Aut onoma de Madrid. These authors also acknowledge financial support from the Spanish Plan Nacional I+D+i, Grants TIN2013-42351-P and TIN2015-70308-REDT, and from Comunidad de Madrid, Grant S2013/ICE-2845 CASI-CAM-CM. David Lopez-Paz acknowledges support from Fundaci on la Caixa. Appendix A. In this appendix we show that if X and Y follow the same distribution and they have been centered, then the determinant of the covariance matrix of the random variable corresponding to ϵi, denoted with Cov(ϵi), coincides with the determinant of the covariance matrix corresponding to the random variable ϵi, denoted with Cov( ϵi). From the causal model, i.e., yi = Axi + ϵi, we have that: Cov(Y) = ACov(X)AT + Cov(ϵi) . (32) Since X and Y follow the same distribution we have that Cov(Y) = Cov(X). Furthermore, we know from the causal model that A = Cov(Y, X)Cov(X) 1. Then, Cov(ϵi) = Cov(X) Cov(Y, X)Cov(X) 1Cov(X, Y) . (33) Non-linear Causal Inference using Gaussianity Measures In the case of ϵi we know that the relation ϵi = (I AA)xi Aϵi must be satisfied, where A = Cov(X, Y)Cov(Y) 1 = Cov(X, Y)Cov(X) 1. Thus, we have that: Cov( ϵi) = (I AA)Cov(X)(I AT AT) + ACov(ϵi) AT = Cov(X) Cov(X)AT AT AACov(X) + AACov(X)AT AT + ACov(X) AT ACov(Y, X)Cov(X) 1Cov(X, Y) AT = Cov(X) Cov(X)AT AT AACov(X) + ACov(X) AT = Cov(X) Cov(X, Y)Cov X 1Cov(Y, X) Cov(X, Y)Cov X 1Cov(Y, X) + Cov(X, Y)Cov X 1Cov(Y, X) = Cov(X) Cov(X, Y)Cov(X) 1Cov(Y, X) . (34) By the matrix determinant theorem we have that det Cov( ϵi) = det Cov(ϵi). See (Murphy, 2012, p. 117) for further details. Appendix B. In this Appendix we motivate that, if the distribution of the residuals is not Gaussian, but is close to Gaussian, one should also expected more Gaussian residuals in the anti-causal direction in terms of the energy distance described in Section 3.4. For simplicity we will consider the univariate case. We use the fact that the energy distance in the one-dimensional case is the squared distance between the cumulative distribution functions of the residuals and a Gaussian distribution (Sz ekely and Rizzo, 2013). Thus, h F(x) Φ(x) i2 dx , D2 = Z [F(x) Φ(x)]2 dx , (35) where D2 and D2 are the energy distances to the Gaussian distribution in the anti-causal and the causal direction respectively; F(x) and F(x) are the c.d.f. of the residuals in the anti-causal and the causal direction, respectively; and finally, Φ(x) is the c.d.f. of a standard Gaussian. One should expect that D2 D2. To motivate this, we use the Gram-Charlier series and compute an expansion of F(x) and F(x) around the standard Gaussian distribution (Patel and Read, 1996). Such an expansion only converges in the case of distributions that are close to be Gaussian (see Section 17.6.6a of (Cram er, 1946) for further details). Namely, F(x) = Φ(x) φ(x) a3 3! H2(x) + a4 4! H3(x) + , F(x) = Φ(x) φ(x) a3 3! H2(x) + a4 4! H3(x) + , (36) where φ(x) is the p.d.f. of a standard Gaussian, Hn(x) are Hermite polynomials and an and an are coefficients that depend on the cumulants, e.g., a3 = κ3, a4 = κ4, a3 = κ3, a4 = κ4. Note, however, that coefficients an and an for n > 5 depend on combinations of Hern andez-Lobato, Morales-Mombiela, Lopez-Paz and Su arez the cumulants. Using such an expansion we find: κ2 3 36E[H2(x)2φ(x)] + κ2 4 576E[H3(x)2φ(x)] , (37) where E[ ] denotes expectation with respect to a standard Gaussian and we have truncated the Gram-Charlier expansion after n = 4. Truncation of the Gram-Charlier expansion after n = 4 is a standard procedure that is often done in the ICA literature for entropy approximation. See for example Section 5.5.1 of (Hyv arinen et al., 2004). We have also used the fact that E[H3(x)H2(x)φ(x)] = 0. The same approach can be followed in the case of D2, the energy distance in the causal direction. The consequence is that D2 κ2 3/36 E[H2(x)2φ(x)] + κ2 4/576 E[H3(x)2φ(x)]. Finally, the fact that one should expect D2 D2 is obtained by noting that κn = cnκn, where cn is some constant that lies in the interval ( 1, 1), as indicated in Section 2.1. We expect that this result extends to the multivariate case. Appendix C. In this Appendix we motivate that one should expect also more Gaussian residuals in the anti-causal direction, based on a reduction of the cumulants, when the residuals in feature space are projected onto the first principal component. That is, when they are multiplied by the first eigenvector of the covariance matrix of the residuals, and scaled by the corresponding eigenvalue. Recall from Section 2.2 that these covariance matrices are C = I AAT and C = I ATA, in the causal and anti-causal direction respectively. Note that both matrices have the same eigenvalues. If A is symmetric we have that both C and C have the same matrix of eigenvectors P. Let pn 1 be the first eigenvector multiplied n times using the Kronecker product. The cumulants in the anti-causal and the causal direction, after projecting the data onto the first eigenvector are κproj n = (pn 1)TMnvect( κn) = c(pn 1)Tvect( κn) and κproj n = (pn 1)Tvect(κn), respectively, where Mn is the matrix that relates the cumulants in the causal and the anticausal direction (see Section 2.2) and c is one of the eigenvalues of Mn. In particular, if A is symmetric, it is not difficult to show that pn 1 is one of the eigenvectors of Mn. Furthermore, we also showed in that case that ||Mn||op < 1 for n 3 (see Section 2.2). The consequence is that c ( 1, 1), which combined with the fact that ||pn 1|| = 1 leads to smaller cumulants in magnitude in the case of the projected residuals in the anti-causal direction. If A is not symmetric we motivate that one should also expect more Gaussian residuals in the anti-causal direction due to a reduction in the magnitude of the cumulants. For this, we derive a smaller upper bound on their magnitude. This smaller upper bound is based on an argument that uses the operator norm of vectors. Definition 2 The operator norm of a vector w induced by the ℓp norm is ||w||op = min{c 0 : ||w Tv||p c||v||p, v}. The consequence is that ||w||op ||w Tv||p/||v||p, v. Thus, the smallest the operator norm of w, the smallest the expected value obtained when multiplying any vector by the Non-linear Causal Inference using Gaussianity Measures vector w. Furthermore, it is clear that ||w||op = ||w||2, in the case of the ℓ2-norm. From the previous paragraph, in the anti-causal direction we have || κproj n ||2 = ||( pn 1)TMnvect(κn)||2, where p1 is the first eigenvector of C, while in the causal direction we have ||κproj n ||2 = ||(pn 1)Tvect(κn)||2, where p1 is the first eigenvector of C. Thus, because the norm of each vector pn 1 and pn 1 is one, we have that ||pn 1||op = 1. However, because we expect Mn, to reduce the norm of ( pn 1)T, as motivated in Section 2.2, ||( pn 1)TMn||op < 1 should follow. This is expected to lead to smaller cumulants in magnitude in the anti-causal direction. J. Beirlant, E. J. Dudewicz, L. Gy orfi, and E. C. Van Der Meulen. Nonparametric entropy estimation: An overview. International Journal of Mathematical and Statistical Sciences, 6(1):17 39, 1997. Z. Chen, K. Zhang, and L. Chan. Nonlinear causal discovery for high dimensional data: A kernelized trace method. In IEEE 13th International Conference on Data Mining, pages 1003 1008, 2013. Z. Chen, K. Zhang, L. Chan, and B. Sch olkopf. Causal discovery via reproducing kernel Hilbert space embeddings. Neural Computation, 26(7):1484 1517, 2014. E. A. Cornish and R. A. Fisher. Moments and cumulants in the specification of distributions. Revue de l Institut International de Statistique / Review of the International Statistical Institute, 5(4):307 320, 1938. H. Cram er. Mathematical methods of statistics. Ph D thesis, 1946. D. Entner and P. O. Hoyer. Estimating a causal order among groups of variables in linear models. In Internatinal Conference on Artificial Neural Networks, pages 84 91. 2012. A. Gretton, K. Fukumizu, C. H. Teo, L. Song, B. Sch olkopf, and A. J. Smola. A kernel statistical test of independence. In Advances in Neural Information Processing Systems 20, pages 585 592. 2008. A. Gretton, K. M. Borgwardt, M. J. Rasch, B. Sch olkopf, and A. Smola. A kernel twosample test. Journal of Machine Learning Research, 13:723 773, 2012. J. M. Hern andez-Lobato, P. Morales-Mombiela, and A. Su arez. Gaussianity measures for detecting the direction of causal time series. In International Joint Conference on Artificial Intelligence, pages 1318 1323, 2011. P. O. Hoyer, D. Janzing, J. M. Mooij, J. Peters, and B. Sch olkopf. Nonlinear causal discovery with additive noise models. In Advances in Neural Information Processing Systems 21, pages 689 696, 2009. A. Hyv arinen. New approximations of differential entropy for independent component analysis and projection pursuit. In Advances in Neural Information Processing Systems 10, pages 273 279. MIT Press, 1998. Hern andez-Lobato, Morales-Mombiela, Lopez-Paz and Su arez A. Hyv arinen and S. M. Smith. Pairwise likelihood ratios for estimation of non-Gaussian structural equation models. Journal of Machine Learning Research, 14(1):111 152, 2013. A. Hyv arinen, J. Karhunen, and E. Oja. Independent Component Analysis. John Wiley & Sons, 2004. D. Janzing, P. O. Hoyer, and B. Sch olkopf. Telling cause from effect based on highdimensional observations. In International Conference on Machine Learning, pages 479 486, 2010. D. Janzing, J. M. Mooij, K. Zhang, J. Lemeire, J. Zscheischler, P. Daniuˇsis, B. Steudel, and B. Sch olkopf. Information-geometric approach to inferring causal directions. Artificial Intelligence, 182-183:1 31, 2012. Y. Kawahara, K. Bollen, S. Shimizu, and T. Washio. Group Li NGAM: Linear non-Gaussian acyclic models for sets of variables. 2012. ar Xiv:1006.5041. S. Kpotufe, E. Sgouritsa, D. Janzing, and B. Sch olkopf. Consistency of causal inference under the additive noise model. In International Conference on Machine Learning, pages 478 486, 2014. A. J. Laub. Matrix Analysis For Scientists And Engineers. Society for Industrial and Applied Mathematics, Philadelphia, PA, USA, 2004. ISBN 0898715768. J. T. Marcinkiewicz. Sur une propri et e de la loi de gauss. Mathematische Zeitschrift, 44: 612 618, 1938. P. Mc Cullagh. Tensor methods in statistics. Chapman and Hall, 1987. J. M. Mooij, O. Stegle, D. Janzing, K. Zhang, and B. Sch olkopf. Probabilistic latent variable models for distinguishing between cause and effect. In Advances in Neural Information Processing Systems 23, pages 1687 1695. 2010. P. Morales-Mombiela, D. Hern andez-Lobato, and A. Su arez. Statistical tests for the detection of the arrow of time in vector autoregressive models. In International Joint Conference on Artificial Intelligence, 2013. K. Murphy. Machine Learning: a Probabilistic Perspective. The MIT Press, 2012. J. K. Patel and C. B. Read. Handbook of the normal distribution, volume 150. CRC Press, 1996. J. Pearl. Causality: Models, Reasoning, and Inference. Cambridge University Press, New York, NY, USA, 2000. ISBN 0-521-77362-8. B. Sch olkopf and A. J. Smola. Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond. MIT Press, Cambridge, MA, USA, 2002. ISBN 0262194759. B. Sch olkopf, A. Smola, and K.-R. M uller. Kernel principal component analysis. In International Conference on Artificial Neural Networks, pages 583 588. Springer, 1997. Non-linear Causal Inference using Gaussianity Measures S. Shimizu, P. O. Hoyer, A. Hyv arinen, and A. Kerminen. A linear non-Gaussian acyclic model for causal discovery. Journal of Machine Learning Research, 7:2003 2030, 2006. H. Singh, N. Misra, V. Hnizdo, A. Fedorowicz, and E. Demchuk. Nearest neighbor estimates of entropy. American journal of mathematical and management sciences, 23(3-4):301 321, 2003. L. Song, K. Fukumizu, and A. Gretton. Kernel embeddings of conditional distributions: A unified kernel framework for nonparametric inference in graphical models. Signal Processing Magazine, IEEE, 30:98 111, 2013. G. J. Sz ekely and M. L. Rizzo. A new test for multivariate normality. Journal of Multivariate Analysis, 93(1):58 80, 2005. G. J. Sz ekely and M. L. Rizzo. Energy statistics: A class of statistics based on distances. Journal of Statistical Planning and Inference, 143(8):1249 1272, 2013. K. Zhang and A. Hyv arinen. On the identifiability of the post-nonlinear causal model. In International Conference on Uncertainty in Artificial Intelligence, pages 647 655, 2009.