# optimal_transport_for_causal_discovery__b006b470.pdf

Published as a conference paper at ICLR 2022

OPTIMAL TRANSPORT FOR CAUSAL DISCOVERY

Ruibo Tu KTH Royal Institute of Technology ruibo@kth.se

Kun Zhang Carnegie Mellon University Mohamed bin Zayed University of Artiﬁcial Intelligence kunz1@cmu.edu

Hedvig Kjellstr om KTH Royal Institute of Technology Silo AI hedvig@kth.se

Cheng Zhang Microsoft Research Cheng.Zhang@microsoft.com

To determine causal relationships between two variables, approaches based on Functional Causal Models (FCMs) have been proposed by properly restricting model classes; however, the performance is sensitive to the model assumptions, which makes it difﬁcult to use. In this paper, we provide a novel dynamical-system view of FCMs and propose a new framework for identifying causal direction in the bivariate case. We ﬁrst show the connection between FCMs and optimal transport, and then study optimal transport under the constraints of FCMs. Furthermore, by exploiting the dynamical interpretation of optimal transport under the FCM constraints, we determine the corresponding underlying dynamical process of the static cause-effect pair data. It provides a new dimension for describing static causal discovery tasks while enjoying more freedom for modeling the quantitative causal inﬂuences. In particular, we show that Additive Noise Models (ANMs) correspond to volume-preserving pressureless ﬂows. Consequently, based on their velocity ﬁeld divergence, we introduce a criterion for determining causal direction. With this criterion, we propose a novel optimal transport-based algorithm for ANMs which is robust to the choice of models and extend it to post-nonlinear models. Our method demonstrated state-of-the-art results on both synthetic and causal discovery benchmark datasets.

1 INTRODUCTION

Determining causal relationships between two variables is a fundamental and challenging causal discovery task (Janzing et al., 2012). Conventional constraint-based and score-based causal discovery methods identify causal structures only up to Markov equivalent classes (Spirtes et al., 2001), in which some causal relationships are undetermined. To address this challenge, properly constrained functional causal models (FCMs) have been proposed. FCMs represent the effect as a function of its cause and independent noise and can help identify the causal direction between two variables by imposing substantial structural constraints on model classes, such as additive noise models (ANMs) (Shimizu et al., 2006; Hoyer et al., 2008) and post-nonlinear models (PNLs) (Zhang and Hyv arinen, 2009b). While some of the models, such as PNLs, are highly ﬂexible, the constraints are still restrictive and difﬁcult to interpret and relax. Inevitably, the performance of these methods is sensitive to model assumptions and optimization algorithms, especially in real-world applications.

To handle the mentioned issues, we consider FCMs from a dynamical-system view. By augmenting a time dimension for the static causal discovery task, we interpret FCMs with dynamical causal processes under the least action principle (Arnol d, 2013). The new interpretation connects FCMs with a large class of models in dynamical systems. It then provides more freedom to model causal inﬂuences, possibilities to derive new causal discovery criteria, and a potential direction to generalize causal models with identiﬁable causal direction.

Published as a conference paper at ICLR 2022

In particular, we exploit the above idea by leveraging the intrinsic connection between FCMs and optimal transport. Optimal transport is originally introduced by Monge (1781), which has been applied in a large range of applications, not only because it is a natural way to describe moving particles (Ambrosio et al., 2012) but also because of its recent improvement in the computational methods (Cuturi, 2013; Kolouri et al., 2019). Recently, it has also been largely applied to generative models for measuring the distance of probability distributions (Arjovsky et al., 2017; Kolouri et al., 2018; Genevay et al., 2018). Among different optimal transport deﬁnitions, the L2 Wasserstein distance got extensive applications in statistics (Rachev and R uschendorf, 1998), functional analysis (Barthe, 1998), et al. (Mc Cann, 1997; Otto, 1997). The dynamical formulation of the L2 Wasserstein distance is introduced by Benamou and Brenier (2000) for relaxing the computational costs. We ﬁnd that in the context of the dynamical formulation, FCMs can be connected with optimal transport. Furthermore, with the dynamical interpretation of optimal transport, one can naturally understand FCMs from a dynamical-system view, which makes it possible to derive new criteria to identify causal direction. Moreover, it also enables us to develop practical algorithms with optimal transport for static causal discovery tasks without learning a regression model. Our main contributions are:

1. Dynamical interpretation of FCMs in the bivariate case. We provide dynamical interpretations of optimal transport under the constraints of FCMs. Furthermore, we introduce a time variable, determine the underlying dynamical process under the least action principle (Arnol d, 2013) for the static bivariate causal discovery task, and characterize properties of the corresponding dynamical systems (Sec. 3.1 and Sec. 3.2).

2. A criterion for determining causal relationships between two variables. We study the corresponding dynamical systems of FCMs and prove that ANMs correspond to volume-preserving pressureless ﬂows. Moreover, based on the divergence of their velocity ﬁelds, we propose a criterion for determining causal relationships and show that under the identiﬁability conditions of ANMs it is a valid criterion for ANMs, which can be extended to PNLs directly (Sec. 3.2).

3. An optimal transport-based approach (DIVOT) for distinguishing cause from effect between two variables. DIVOT inherits the advantages of one-dimensional optimal transport. It can be computed efﬁciently and does not require independence tests, learning a regression model, or deriving likelihood functions for complicated distributions. Experimental results show that our method is robust to the choice of models and has a promising performance compared with the state-of-the-art methods on both synthetic and real cause-effect pair datasets (Sec. 4 and Sec. 6).

2 PRELIMINARIES

Optimal transport: the underdetermined Jacobian problem. We mainly follow the notations and the deﬁnitions of (Benamou and Brenier, 2000). Suppose that two (probability) density functions, p0(x) and p T (x) where x Rd, are non-negative and bounded with total mass one. The transfer of p0(x) to p T (x) is realized with a smooth one-to-one map M : Rd Rd. The Jacobian problem is to ﬁnd M that satisﬁes the Jacobian equation, p0(x0) = p T (M(x0))|det( M(x0))|, where x T = M(x0), is the gradient in vector calculus, and det( ) denotes determinant. This is an underdetermined problem as many maps can be the solutions. A natural way is to choose the optimal one, e.g., the one with the lowest cost. A common cost function is the Lp Wasserstein distance.

Lp Wasserstein distance and its one-dimensional closed-form solution. The Lp Wasserstein distance between p0 and p T , denoted by Wp(p0, p T ), is deﬁned by Wp(p0, p T )p = inf M R |M(x0) x0|pp0(x0)dx0, where p 1 (Kantorovich, 1948). In this work, we mainly use the square of the L2 Wasserstein distance, denoted by W 2 2 . Moreover, the one-dimensional (1D) Lp Wasserstein distance has a closed-form solution, e.g., the 1D optimal solution of W 2 2 is M = P 1 T P0, where P0 and PT are the cumulative distribution functions for p0 and p T , and represents the function composition. In practice, the 1D optimal solution can be computed with the average square distance between the sorted samples from p0 and p T (Kolouri et al., 2019).

Functional causal models. FCMs represent the effect Y as a function f( ) of the direct cause X and independent noise Ey, where function f describes the causal inﬂuence of X on Y , and Ey is the exogenous variable/noise. Without any additional assumption on the functional classes, the causal direction is not identiﬁable (Hyv arinen and Pajunen, 1999; Zhang et al., 2015a). Roughly speaking,

Published as a conference paper at ICLR 2022

(a) Pressureless potential ﬂow corresponding to FCMs

(b) Volume-preserving pressureless potential ﬂow corresponding to ANMs

Figure 1: Panel (a) illustrates the trajectories of the pressureless potential ﬂow corresponding to general FCMs , while panel (b) shows the trajectories of the volume-preserving pressureless potential ﬂow corresponding to ANMs. In panel (b), the trajectories have zero velocities on the Ex/X axis and move along straight lines that are parallel to each other when having the same values of X/Ex.

because given variable pair (X, Y ), one can always construct Y = f(X, Ey) and another different FCM, X = ef(Y, Ex), such that both of them have independent noise (Hyv arinen and Pajunen, 1999; Zhang et al., 2015a). Several works further introduce proper assumptions on model classes, which guarantees that the independence of cause and noise only holds in the causal direction, e.g.,

ANM: Y = g(X) + Ey; (1) PNL: Y = h(g(X) + Ey), (2)

where g and h are nonlinear functions and h is invertible.

3 DYNAMICAL INTERPRETATION OF FUNCTIONAL CAUSAL MODELS

We ﬁrst show the connection between FCMs and optimal transport in Sec. 3.1. In Sec. 3.2, we further elaborate the analogy between the optimal transport problem and the causal direction determination problem. We then study the optimal transport under the constraints of FCMs, show the corresponding dynamical systems of FCMs, and characterize the properties of such systems.

3.1 A REFORMULATION OF FUNCTIONAL CAUSAL MODELS

As introduced in Sec. 2, FCMs are used to approximate the true data generation process. Given the FCM, Y = f(X, Ey), we rewrite it in the vector form,

= Ex f(X, Ey)

= M(x0), (3)

where x0, x T R2, their probability densities p0, p T 0, and M : R2 R2. As an analogy to the mass transfer scenario (Monge, 1781), we consider the samples of independent noise Ex and Ey as the particles of materials and regard the map M in Eqn. (3) as a special transformation of the independent noise samples. As shown in Fig. 1, one can consider the data points are transferred from the original positions (which are unmeasured) in the plane Ex Ey at time 0 to the observed positions in the plane X Y at time T. Such transformation considers the transfer as a dynamical process which moves the unmeasured independent noise x0 = [Ex, Ey] 1 and consequently leads to the observations x T = [X, Y ] . From the perspective of FCM-based causal discovery approaches, causal inﬂuences are represented by FCMs which represent the effect as a function of its direct cause and an unmeasured noise satisfying the FCM constraints:

(i) The map constraint: the values of X are determined by the values of its corresponding noise, i.e., X = Ex, while the values of the effect depend on cause X and noise Ey; (ii) The independence constraint: the noise terms are independent, i.e, Ex is independent of Ey.

1 denotes the transpose of vectors or matrices.

Published as a conference paper at ICLR 2022

Note that the optimal transport M with the minimal Lp Wasserstein distance is not necessary to be the one in Eqn. (3), because it has no information about the FCM constraints or the true data generation process. In other words, given two sample sets of x0 and x T , the couplings given by optimal transport are not necessary to be the ones generated from the ground-truth FCM.

3.2 DYNAMICAL INTERPRETATION OF FCMS: OPTIMAL TRANSPORT UNDER THE FCM CONSTRAINTS

In this section, we jointly consider the causality and the optimality of the maps in the Jacobian problem. It provides both a causal sense of the transformation and a dynamical view of FCMs. We ﬁrst recap the dynamical formulation of the L2 Wasserstein distance, study such dynamical systems under the FCM constraints, and then show their properties under the FCM and ANM constraints.

Dynamical L2 Wasserstein distance. Benamou and Brenier (2000) formulate the L2 Monge Kantorovich problem as a convex space-time minimization problem in a continuum mechanics framework. Fixing a time interval [0, T], they introduce the concepts of the smooth time-dependent density ρ(t, xt) 0 and the velocity ﬁeld v(t, xt). When they are clear from context, we denote them by ρ and v. Because we are considering the bivariate case, xt R2 and v R2. Then, they give the dynamical formulation of W 2 2 :

W 2 2 (p0, p T ) = inf ρ,v T Z

0 ρ(t, xt)|v(t, xt)|2dxtdt, (4)

( initial and ﬁnal conditions: ρ(0, ) = p0, ρ(T, ) = p T the continuity equation: tρ + (ρv) = 0 ,

where denotes the divergence in vector calculus. They show that minimizing the objective function in the optimization problem (4) is equivalent to ﬁnding the dynamical system with the least action (Arnol d, 2013) and prove that the solutions of (4) are pressureless potential ﬂows, of which the ﬂuid particles are not subject to any pressure or force and the trajectories are determined given their initial positions and velocities or given their initial and ﬁnal positions. Suppose that M is the solution given by W 2 2 . The corresponding ﬂows follow the time evolution equation,

xt = x0 + t

T v(t, xt), where v(t, xt) = v(0, x0) = M (x0) x0 and t [0, T]. (5)

The time evolution equation shows that xt is a convex combination of x0 and M (x0) and that the velocity ﬁelds do not depend on time.

As an analogy between the optimal transport problem and causal direction determination, the density ρ and the velocity v of moving particles can be considered as the probability density and the velocity of changing values of data points. Moreover, the dynamical interpretation of the L2 Wasserstein distance introduces a time variable and provides a natural time interpolation ρ(t, xt) of ρ0 and ρT together with the velocity ﬁeld v(t, xt). Similarly, we can also have the natural time interpolation p(t, xt) between p0 and p T as well as the velocity ﬁeld v(t, xt) under the least action principle (Arnol d, 2013), which is the dynamical interpretation of FCMs.

Dynamical L2 Wasserstein distance under the FCMs constraints. First, we introduce FCM constraints in the context of the dynamical L2 Wasserstein distance. According to the time evolution equation (5), we know that the velocity is fully determined by the initial and ﬁnal values of xt. We ﬁrst consider FCM constraint (i). For the initial and ﬁnal values of the cause, the value of its observation X is equal to its noise value Ex in Eqn. (3). Consequently, denoting v = [vx, vy] , where vx and vy represent the velocities along the Ex/X-axis and Ey/Y -axis respectively as shown in Fi.g 1, we know that t [0, T], xt R, vx(t, xt) = 0. As for FCM constraint (ii), it implies that the initial noise of cause and effect, corresponding to Ex and Ey at the time 0, are independent. Therefore, we have the FCM constraints for the dynamical L2 Wasserstein distance,

(I) The map constraints: v(t, xt) = [vx(t, xt), vy(t, yt)] , xt, yt R, t [0, T], vx(t, xt) = 0;

(II) The independence constraint: two random variables (cause and effect) at the initial time have the joint probability density function p0(x0) and they are independent.

Published as a conference paper at ICLR 2022

Second, we characterize the properties of the dynamical L2 Wasserstein distance under constraints (I) and (II). According to Eqn. (3), the form of M is determined as

M (x0) = Ex f(Ex, Ey)

otherwise, the FCM constraints will be violated. Moreover, the W 2 2 under the FCM constraints can be computed with the one-dimensional W 2 2 as shown in Prop. 1 (the derivation is in App. D). Proposition 1. Under constraints (I) and (II), the square of the L2 Wasserstein distance between p0 and p T is

W 2 2 (p0, p T ) = EEx h W 2 2 p(Ey), p(Y |Ex) i . (7)

Furthermore, we consider the constraints of ANMs and characterize the corresponding dynamical systems for now, which can be directly extended to PNLs as mentioned in Sec. 4.2. Based on constraints (I) and (II), we further introduce the ANM constraint,

(III) the effect is the sum of noise and a nonlinear function of cause as deﬁned in Eqn. (1). Theorem 1 ( Zero divergence of the velocity ﬁeld ). Under constraints (I) and (II), the dynamical systems given by the L2 Wasserstein distance are pressureless ﬂows. Further under ANM constraint (III), they become volume-preserving pressureless ﬂows, of which the divergence of the velocity ﬁeld, v(t, xt) = [vx(t, xt), vy(t, yt)] , satisﬁes

div v(t, xt) = vx(t, xt)

xt + vy(t, yt)

yt = 0, t [0, T], xt, yt R,

where div is the divergence operator in vector calculus.

Thm. 1 determines the corresponding dynamical systems of FCMs and ANMs by analyzing their densities and velocity ﬁelds (the details of the proof are in App. D) and shows an essential property of the corresponding dynamical systems of ANMs. The property indicates a potential criterion for causal direction determination, which the divergence of the velocity ﬁeld is zero everywhere in the causal direction, while it may not always hold in the reverse direction. Next, we will verify the criterion rigorously, propose an algorithm based on it, and show the extension for the PNL cases.

4 CAUSAL DIRECTION DETERMINATION WITH OPTIMAL TRANSPORT

In this section, we deﬁne a divergence measure as a criterion for determining causal direction between two variables for ANMs. Based on the criterion, we then provide an algorithm to identify the causal direction, named by the divergence measure with optimal transport (DIVOT).

4.1 DIVERGENCE MEASURE AS A CAUSAL DISCOVERY CRITERION

We ﬁrst deﬁne the divergence measure and then show that it is a valid criterion for identifying the causal direction in the bivariate case under the identiﬁability conditions of ANMs, as shown in Prop. 2 (the proof is in App. D). Proposition 2 (Divergence measure as a causal discovery criterion). Deﬁne the divergence measure,

R2 |div v|2p0(x0)dx0 = Ex0[ |div v|2], (8)

where v = M (x0) x0. Suppose that constraints (I), (II), (III), and the identiﬁability conditions of ANMs (Hoyer et al., 2008) are satisﬁed. The divergence measure of the corresponding dynamical system satisﬁes D(v) = 0 if and only if X is the direct cause of Y .

Nevertheless, there are some challenges to compute the divergence measure. For example, we need to solve a two-dimensional optimal transport problem and compute the derivative of a velocity ﬁeld for all samples. Another challenge of computing M is that we in general have no information about x0. Such issues are all solved with DIVOT.

Published as a conference paper at ICLR 2022

4.2 PROPOSED METHOD: DIVOT

We ﬁrst provide an overview of the algorithm, DIVOT, for determining the causal direction between two variables and then introduce the four steps to compute the divergence measure.

Overview of DIVOT. Alg. 1 is based on the divergence-measure criterion for causal direction determination and determines causal direction between two variables. Given data of X and Y , we want to infer whether X causes Y (X Y ) or Y causes X (Y X). We compute the divergence measure in both directions. In practice, noise can have different variance in different applications, and the larger variance can lead to the larger measure value with ﬁnite samples. So we normalize the variance-based measure value with the estimated noise variance. And then the one with the smaller normalized measure value is the causal direction. In App. A, we provide the modiﬁed algorithms for including the independent case and the signiﬁcance of the results with bootstrapping.

Algorithm 1: DIVOT: divergence measure with optimal transport for causal direction determination. Input: data {(xi, yi)} and noise distribution p(E; θ0) Output: X Y or Y X.

1 Def DIVOT({(xi, yi)}, p(E; θ0)):

2 data, θ = {(xi, yi)}, θ0

3 LX Y = Div(data, p(E; θ))

4 data, θ = {(yi, xi)}, θ0

5 LY X = Div(data, p(E; θ))

6 if LY X < LX Y then

7 return Y X

9 return X Y

10 Def Div(data, p(E; θ)):

11 noise = Sampling(p(E; θ))

12 datas, noises = OT(data, noise)

13 loss = minθ Var Div(datas, noises)

14 return loss/Var(p(E; θ ))

Noise data generation: the ﬁrst step of computing the divergence measure. To compute the divergence measure, we need to know the velocity ﬁeld v as deﬁned in the time evolution equation (5). It requires the couplings of the data of x0 = [Ex, Ey] and x T = [X, Y ] . But in the bivariate causal discovery task, only the data of x T are given. Therefore, as shown in Line 11 of Alg. 1, we ﬁrst deal with the issue due to the lack of the noise data of x0, denoted by {(ei x, ei y)}. To obtain the noise data, we may assume a multivariate probability distribution of x0 with the density p0(Ex, Ey) and then sample data from it, represented by (ei x, ei y) p0(Ex, Ey). Fortunately, due to the FCM constraints, we know that p0(Ex, Ey) = p0(X, Ey) = p(X)p(Ey). So we only need to assume the probability distribution of Ey and parameterize it with θ, denoted by p(Ey; θ). Suppose that the dataset of x T with N samples is given, denoted by {(xi, yi)}N. We ﬁrst sample a data set of Ey with the sample size N, denoted by {ei y}N, e.g., in the experiments of this work, we use the simpliﬁed reparameterization trick, ei y = f noise θ (esource y ) = θ esource y and esource y N(0, 1)/U(0, 1), (9)

where f noise θ is a monotonic function, and esource y is sampled from a standard normal distribution or a uniform distribution. As the monotonic f noise θ can be very ﬂexible, we can represent ﬂexible noise distributions with monotonic neural networks as in (Huang et al., 2018). Next, we randomly match the data of X, denoted by {xi}N, with {ei y}N, which gives the {(xi, ei y)}N as the data of x0.

Optimal transport ﬁnds the couplings of the observation and the generated noise. Computing the divergence measure requires the couplings of the data of x0 and x T because of v = M (x0) x0; in other words, given the data of x0 and x T , we need to solve a two-dimensional optimal transport problem, which gives M and the couplings as in Line 12 of Alg. 1. According to Prop. 1, solving the two-dimensional optimal transport problem under the FCM constraints is equivalent to solving one-dimensional optimal transport problems. More speciﬁcally, the expectation in Eqn. (7) can be computed with the Monte Carlo estimator by using N samples of Ex, denoted by {ei x}N, and then the two-dimensional W 2 2 in Eqn. (7) is computed with N one-dimensional W 2 2 , i.e., W 2 2 (p0, p T ) 1 N P i W 2 2 (p(Ey), p(Y |Ex = ei x)). Importantly, solving the one-dimensional optimal transport problem or computing the one-dimensional W 2 2 can be implemented with a sorting operation as in (Kolouri et al., 2019). Therefore, the couplings are found by which given each value of Ex, we ﬁnd the corresponding values of Ey and Y from {(ei x, ei y)}N and {(xi, yi)}N and then match such sorted values of Ey and Y .

Published as a conference paper at ICLR 2022

Variance-based divergence measure. Suppose that the previous step has found the couplings, i.e., given any sample (ei x, ei y), we have its corresponding (xi, yi). Then, we will compute the divergence of a velocity ﬁeld as in Line 13 of Alg. 1. According to the deﬁnition, |div v|2 = ( dvy

dy )2. Since all the ei y are matched with yi, we have vi y = yi ei y. A straightforward way to approximate the derivative is using its nearest neighbour pair (ei xb, ei yb) and (xbi, ybi) and then approximate it

with d dy(vy)|y=yi = (yi ei y) (ybi ei yb) yi ybi ; however, it suffers the following issues (especially in the few-sample scenario) : (a) the denominator is in general a small number, and the distance to the nearest neighbour can be large in the few-sample case, which makes the computation unstable and inaccurate; (b) the deviation on the X-axis makes the approximation a biased estimate especially when the gradient of g(X) + Ey at X = x is large. Therefore, we propose the variance-based divergence measure working better in practice. It is straightforward to see that under constraints (I) and (II), the value of the divergence measure of an ANM is zero if and only if the value of the variance-based divergence measure is zero, which is deﬁned as Dvar(v) = EX[V[Vy|X|X]], (10) where V[Vy|X|X] represents a conditional variance of the velocity ﬁeld Vy (a random variable) at position X at the initial time. For example, V[Vy|x|X = x] represents the variance of all the velocities vy at the positions where X = x at the initial time. {(xi, yi)}N denotes the cause-effect pair dataset with the sample size N. {vi y|x}Nx denotes all the velocities vi y at position X = x at the initial time, where the sample size is Nx, and their mean value is vy|x. Then, V[V y|x|X = x] = PNx i=1(vi y|x vy|x)2/(Nx 1), and

sort( yx) sort( ey) ave( yx ey) 2 2 Nx 1 , (11)

where {xi}N represents the set of all the values of X and its sample size is N; yx is the vector of the Y samples where X = x; ey is the vector of the Ey samples where Ex = x; sort( ) sorts a vector; ave( ) computes the vector mean; and 2 2 is the square of a ℓ2 norm.

Minimization w.r.t θ. Given the data {(xi, yi)}N and the density p(Ey; θ), we can compute the divergence measure with the variance-based method. Note that we only initialize θ with some random value, and p(Ey; θ) is not necessary to be the true distribution or even signiﬁcantly different from the true one, which can lead to the wrong result of the divergence measure. Therefore, as shown in Line 13 of Alg. 1, we minimize the divergence measure w.r.t. θ. For the minimization, one can derive the gradient w.r.t θ in a simple parameterization case as (9), while in the complex case one can use auto-differentiation. According to Prop. 2, the divergence measure in the causal direction is zero if and only if p(Ey; θ ) with the optimal parameter θ is the true noise distribution a.e., implied by the identiﬁability of ANMs. In this work, we used autograd and RMSProp (gradient descent) of JAX for the minimization.

Extension to PNLs. The measure (11) can be directly extended to the PNL cases. Since h in PNL (2) is an invertible function, by considering h 1(Y ) as a new random variable, h 1(Y ) = g(X)+Ey is an ANM. Thus, for PNLs, under the identiﬁabililty conditions of PNLs (Zhang and Hyv arinen, 2009b), Prop. 2 still holds; and we only need replace yx in (11) with yx = f PNL ω ( yx), where f PNL ω is an invertible function, e.g., it can be the simpliﬁed version of (Zhang et al., 2015a), yx = f PNL ω ( yx) = yx + ωa tanh(ωb yx + ωc), (12) where ωa, ωb, and ωc are positive scalars. Moreover, f PNL ω can also be the monotonic neural networks, such as (Huang et al., 2018).

Extension to the multivariate case. For simplicity and clarity of the paper, we focus on elaborating the connection between FCMs and dynamical systems and developing the theoretical basis and the method in the bivariate case, which are essential for the further development of FCM-based causal discovery methods. In the case of multiple variables, one can use a constraint-based method to ﬁnd the causal skeleton (the undirected causal graph) and then use the extension of our method for the edge orientation similar as (Zhang and Hyv arinen, 2009b; Monti et al., 2020; Khemakhem et al., 2021). See App. B for details.

Published as a conference paper at ICLR 2022

0 100 200 300 400 500 Sample size

y = 0.1(2.5x)3 0.1x + Ey y = sin(4x) + Ey y = x + Ey y = fpiece(x) + Ey

(a) Accuracy on different ANM datasets.

0 200 400 Sample size

Y = 0.1((2.5X)3 X) + Ey

Uniform Beta Normal

0 200 400 Sample size

Y = fpiece(X) + Ey

Uniform Beta Normal

(b) Robustness to prior misspeciﬁcation.

Figure 2: Performance on synthetic data generated from different ANMs with different sample sizes. fpiece is a discontinuous function: fpiece(x) = 0.5x3 x, x 0; fpiece(x) = 1 0.5x3 + x, x > 0. In panel (a), each result represents the percentage of correct results on datasets with a sample size generated from an ANM. Panel (b) shows the results of applying DIVOT using different types of distributions, p(Ey; θ), to the datasets with uniform distribution noise Ey U(0, 1).

5 RELATED WORK

There are mainly two types of causal discovery methods for static causal direction determination between two variables. The ﬁrst one introduces model assumptions to achieve the identiﬁability of causal direction. Most of such methods are based on Li NGAM (Shimizu et al., 2006) and ANMs (Hoyer et al., 2008; Mooij et al., 2009). Some of them are based on the more general models, e.g., PNLs using MLP for representing nonlinear functions (PNL-MLP) (Zhang and Hyv arinen, 2009b) and PNLs using warped Gaussian process and mixture of Gaussian noise (PNL-WGP-Mo G) (Zhang et al., 2015a). Recently, Khemakhem et al. (2021) propose an autoregressive ﬂow-based model (CAREFL), of which the assumption is more general than ANMs and stricter than PNLs. They commonly apply (non)linear regression to learn the function g in (1) (or together with h in (2)), and then test the independence between the independent noise Ey in (1) (the residual) and the cause X. And the independence test is commonly Hilbert-Schmidt independence criterion (HSIC) (Gretton et al., 2005); however, as argued by Yamada and Sugiyama (2010), the kernel width limits its practical use and its common heuristic value limits the ﬂexibility of the function approximation. Moreover, there are other criteria proposed in the ﬁrst type of methods, such as the likelihood ratio (LLR) (Hyv arinen and Smith, 2013), maximum likelihood-based criterion (MML), regression error-based causal inference (RECI) (Bl obaum et al., 2018), and mutual information (Zhang and Hyv arinen, 2009a; Yamada and Sugiyama, 2010). Nevertheless, the common issue of the ﬁrst type of methods is that they restrict the model classes and that their performance is sensitive to model assumptions. The second type of methods achieves identiﬁability by proposing other principles instead of restricting model classes, such as exogeneity (Zhang et al., 2015b), the randomness of exogeneous variables (Entropic) (Compton et al., 2021), and independent causal mechanisms, e.g., GPI (Stegle et al., 2010) and IGCI (Janzing et al., 2012).

6 EXPERIMENTS

We demonstrate and evaluate our method on the synthetic and real-world cause-effect pair data (Mooij et al., 2016). Moreover, we also provide the experiments and the discussion of our method in the presence of unknown confounding in App. C. The details about experiments are in App. F.

Synthetic data. We evaluate DIVOT with Gaussian noise as in (9) on the datasets with different (non)linear functions and samples sizes. We generate synthetic data with the ANMs: 1) Y = X + Ey; 2) Y = 0.1(2.5X)3 0.1X + Ey; 3) Y = sin(4X) + Ey; 4) if x < 0, Y = 0.5X3 X + Ey and if x 0, Y = 1 0.5X3 + X + Ey , where X U( 1, 1) and Ey U(0, 1) are uniform distribution. For each ANM, we generate datasets with the sample sizes 10, 25, 50, 100, 200, and 500. For each sample size, 100 different datasets are generated. Fig. 2a shows that DIVOT consistently recovers the causal direction for all the cases. Since DIVOT has no smoothness constraint of functions, it can deal with the case 4) which is a discontinuous function. We compared DIVOT with the results of CAREFL (Khemakhem et al., 2021), RECI (Bl obaum et al., 2018), and

Published as a conference paper at ICLR 2022

Table 1: Percentage (%) of recovering the true causal direction on the T ubingen cause-effect pair datasets (Mooij et al., 2016) compared with the reported results of (Stegle et al., 2010) in the upper table and (Zhang et al., 2015a) in the lower table.

Ours (PNL) Ours (ANM) Li NGAM ANM-Gauss ANM-MML ANM-HSIC PNL GPI-HSIC GPI-MML IGCI 78.9 3.9 71 4 62 3 45 3 68 1 68 3 67 4 62 4 72 2 76 1 Ours (PNL) Ours (ANM) ANM WGP-Gauss WGP-Mo G PNL-MLP GPI IGCI 76.6 4.5 67 3 63 67 73 70 72 73

other benchmark methods in Appendix. As shown in Fig. 7 of App. F, our method performs better than the others. Moreover, we also show the robustness of DIVOT to the prior misspeciﬁcation as the experiments in CAREFL (Khemakhem et al., 2021). The synthetic data are generated with uniform distribution noise Ey U(0, 1), where the p(Ey; θ) of DIVOT is either uniform distribution U(0, 1), beta distribution B(a = 0.5, b = 0.5), or standard normal distribution N(0, 1). As shown in Fig. 2b, DIVOT with the misspeciﬁed noise distributions has similar performance with the one using the correct class of distributions.

T ubingen cause-effect pair dataset. We apply DIVOT to the T ubingen cause-effect pair dataset (Mooij et al., 2016). This is a collection of real-world cause-effect pairs. We use the variancebased divergence measure (11) and parameterize p(Ey; θ) as in (9). As for the PNL extension, we implement (12) without the positivity constraint (in practice imposing no constraint on ω also performs well in this simple formulation). Moreover, we found that given a value of X or Y of the datasets, there are often few samples. Thus, we consider a range of values of X, which may introduce bias of the divergence measure; hence we also use a linear debiasing function for reducing the bias. Then, we minimize the variance-based divergence measure over parameters with autograd of JAX. More details about the debiasing function and optimization can be found in App. F. To compare with the results reported in other works, we use the maximum number of the sample size N = 500 as (Stegle et al., 2010; Zhang et al., 2015a) and run all the experiments with 3 random seeds as (Stegle et al., 2010), though DIVOT is efﬁcient enough for datasets with larger sample sizes as shown in Appendix. In addition, we normalize data and select the ones within 2 standard deviation, because DIVOT is sensitive to the outliers as optimal transport. As shown in Tab. 1, our proposed DIVOT, especially the extension for PNLs, outperforms than the other methods. The reported results in Tab. 1 are taken from (Stegle et al., 2010) and (Zhang et al., 2015a). And our results are based on the same datasets as them, i.e., with 68 and 77 cause-effect pairs respectively. Moreover, the entropic causal inference (Compton et al., 2021) is reported with 64.21% accuracy; CAREFL (Khemakhem et al., 2021) is reported with 73% accuracy on 108 pairs; RECI (Bl obaum et al., 2018) is reported with 76% weighted accuracy on the 100 pairs. From Tab. 1, we can also see that the other ANM/PNL-based methods are more sensitive to the choice of noise distributions. As shown in Tab. 1, the ANM/PNL with Gaussian noise performs worse than the one with a more complex distribution (mixture of Gaussian) or the one combined with a more stable measure, such as HSIC independence test. In contrast, the PNL extension of DIVOT with the Gaussian noise has a state-of-the-art result.

7 CONCLUSION

In this paper, we provide a new dynamical-system perspective of FCMs in the context of identifying causal relationships in the bivariate case. We ﬁrst demonstrate the connection between FCMs and optimal transport and study the dynamical systems of the optimal transport under constraints of FCMs. We then show that FCMs correspond to pressureless potential ﬂows and that ANMs as a special case corresponding to the pressureless potential ﬂows with the velocity ﬁeld divergence equal to zero. Based on such ﬁndings, we propose a divergence measure-based criterion for causal discovery and provide an efﬁcient optimal transport-based algorithm, DIVOT, for identifying causal direction between two variables. The experimental results on both synthetic and real datasets show that compared with the state-of-the-art methods, DIVOT has state-of-the-art results and promising properties for ﬂexibility, efﬁciency, and prior misspeciﬁcation robustness. We hope that the connection between FCMs and optimal transport has the potential of helping understand general FCMs from the dynamical perspective and inspiring more generic causal discovery methods.

Published as a conference paper at ICLR 2022

ACKNOWLEDGEMENTS

KZ would like to acknowledge the support by the National Institutes of Health (NIH) under Contract R01HL159805, by the NSF-Convergence Accelerator Track-D award 2134901, and by the United States Air Force under Contract No. FA8650-17-C7715. RT would like to acknowledge the funding support of the Swedish e-Science Research Centre.

L. Ambrosio, M. Colombo, G. De Philippis, and A. Figalli. Existence of eulerian solutions to the semigeostrophic equations in physical space: the 2-dimensional periodic case. Communications in Partial Differential Equations, 37(12):2209 2227, 2012.

M. Arjovsky, S. Chintala, and L. Bottou. Wasserstein generative adversarial networks. In Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pages 214 223. PMLR, 06 11 Aug 2017.

V. I. Arnol d. Mathematical methods of classical mechanics, volume 60. Springer Science & Business Media, 2013.

F. Barthe. Optimal young s inequality and its converse: a simple proof. Geometric & Functional Analysis GAFA, 8(2):234 242, 1998.

J.-D. Benamou and Y. Brenier. A computational ﬂuid mechanics solution to the monge-kantorovich mass transfer problem. Numerische Mathematik, 84(3):375 393, 2000.

P. Bl obaum, D. Janzing, T. Washio, S. Shimizu, and B. Sch olkopf. Cause-effect inference by comparing regression errors. In International Conference on Artiﬁcial Intelligence and Statistics, pages 900 909. PMLR, 2018.

J. Bradbury, R. Frostig, P. Hawkins, M. J. Johnson, C. Leary, D. Maclaurin, G. Necula, A. Paszke, J. Vander Plas, S. Wanderman-Milne, and Q. Zhang. JAX: composable transformations of Python+Num Py programs, 2018. URL http://github.com/google/jax.

R. T. Chen, Y. Rubanova, J. Bettencourt, and D. Duvenaud. Neural ordinary differential equations. ar Xiv preprint ar Xiv:1806.07366, 2018.

D. M. Chickering. Optimal structure identiﬁcation with greedy search. Journal of machine learning research, 3(Nov):507 554, 2002.

S. Compton, M. Kocaoglu, K. Greenewald, and D. Katz. Entropic causal inference: Identiﬁability and ﬁnite sample results. ar Xiv preprint ar Xiv:2101.03501, 2021.

M. Cuturi. Sinkhorn distances: Lightspeed computation of optimal transport. In Advances in Neural Information Processing Systems, volume 26. Curran Associates, Inc., 2013.

A. Genevay, G. Peyr e, and M. Cuturi. Learning generative models with sinkhorn divergences. In International Conference on Artiﬁcial Intelligence and Statistics, pages 1608 1617. PMLR, 2018.

A. Gretton, O. Bousquet, A. Smola, and B. Sch olkopf. Measuring statistical dependence with hilbert-schmidt norms. In International conference on algorithmic learning theory, pages 63 77. Springer, 2005.

P. Hoyer, D. Janzing, J. M. Mooij, J. Peters, and B. Sch olkopf. Nonlinear causal discovery with additive noise models. Advances in neural information processing systems, 21:689 696, 2008.

C.-W. Huang, D. Krueger, A. Lacoste, and A. Courville. Neural autoregressive ﬂows. In International Conference on Machine Learning, pages 2078 2087. PMLR, 2018.

A. Hyv arinen and P. Pajunen. Nonlinear independent component analysis: Existence and uniqueness results. Neural networks, 12(3):429 439, 1999.

A. Hyv arinen and S. M. Smith. Pairwise likelihood ratios for estimation of non-gaussian structural equation models. Journal of Machine Learning Research, 14(Jan):111 152, 2013.

Published as a conference paper at ICLR 2022

D. Janzing and B. Sch olkopf. Detecting non-causal artifacts in multivariate linear regression models. In International Conference on Machine Learning, pages 2245 2253. PMLR, 2018.

D. Janzing, J. Mooij, K. Zhang, J. Lemeire, J. Zscheischler, P. Daniuˇsis, B. Steudel, and B. Sch olkopf. Information-geometric approach to inferring causal directions. Artiﬁcial Intelligence, 182:1 31, 2012.

L. V. Kantorovich. On a problem of monge. Uspekhi Mat. Nauk., 1948.

I. Khemakhem, R. Monti, R. Leech, and A. Hyvarinen. Causal autoregressive ﬂows. In International Conference on Artiﬁcial Intelligence and Statistics, pages 3520 3528. PMLR, 2021.

S. Kolouri, G. K. Rohde, and H. Hoffmann. Sliced wasserstein distance for learning gaussian mixture models. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3427 3436, 2018.

S. Kolouri, K. Nadjahi, U. Simsekli, R. Badeau, and G. Rohde. Generalized sliced wasserstein distances. In Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019.

A. A. Mastakouri, B. Sch olkopf, and D. Janzing. Necessary and sufﬁcient conditions for causal feature selection in time series with latent common causes. In International Conference on Machine Learning, pages 7502 7511. PMLR, 2021.

R. J. Mc Cann. A convexity principle for interacting gases. Advances in mathematics, 128(1):153 179, 1997.

G. Monge. M emoire sur la th eorie des d eblais et des remblais. Histoire de l Acad emie Royale des Sciences de Paris, 1781.

R. P. Monti, K. Zhang, and A. Hyv arinen. Causal discovery with general non-linear relationships using non-linear ica. In Uncertainty in Artiﬁcial Intelligence, pages 186 195. PMLR, 2020.

J. Mooij, D. Janzing, J. Peters, and B. Sch olkopf. Regression by dependence minimization and its application to causal inference in additive noise models. In Proceedings of the 26th annual international conference on machine learning, pages 745 752, 2009.

J. M. Mooij, J. Peters, D. Janzing, J. Zscheischler, and B. Sch olkopf. Distinguishing cause from effect using observational data: methods and benchmarks. The Journal of Machine Learning Research, 17(1):1103 1204, 2016.

F. Otto. Viscous ﬁngering: an optimal bound on the growth rate of the mixing zone. SIAM journal on applied mathematics, 57(4):982 990, 1997.

S. T. Rachev and L. R uschendorf. Mass Transportation Problems: Volume I: Theory, volume 1. Springer Science & Business Media, 1998.

S. Shimizu, P. O. Hoyer, A. Hyv arinen, A. Kerminen, and M. Jordan. A linear non-gaussian acyclic model for causal discovery. Journal of Machine Learning Research, 7(10), 2006.

L. N. Smith. Cyclical learning rates for training neural networks. In 2017 IEEE winter conference on applications of computer vision (WACV), pages 464 472. IEEE, 2017.

P. Spirtes, C. Glymour, and R. Scheines. Causation, prediction, and search, 2nd edn., vol. 1, 2001.

O. Stegle, D. Janzing, K. Zhang, J. M. Mooij, and B. Sch olkopf. Probabilistic latent variable models for distinguishing between cause and effect. Advances in neural information processing systems, 23:1687 1695, 2010.

T. Tao. Matrix identities as derivatives of determinant identities, 2013. URL https://terrytao.wordpress.com/2013/01/13/ matrix-identities-as-derivatives-of-determinant-identities/.

Published as a conference paper at ICLR 2022

M. Yamada and M. Sugiyama. Dependence minimizing regression with model selection for nonlinear causal inference under non-gaussian noise. In Proceedings of the AAAI Conference on Artiﬁcial Intelligence, volume 24, 2010.

K. Zhang and A. Hyv arinen. Causality discovery with additive disturbances: An informationtheoretical perspective. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pages 570 585. Springer, 2009a.

K. Zhang and A. Hyv arinen. On the identiﬁability of the post-nonlinear causal model. In Proceedings of the Twenty-Fifth Conference on Uncertainty in Artiﬁcial Intelligence, pages 647 655, 2009b.

K. Zhang, Z. Wang, J. Zhang, and B. Sch olkopf. On estimation of functional causal models: general results and application to the post-nonlinear causal model. ACM Transactions on Intelligent Systems and Technology (TIST), 7(2):1 22, 2015a.

K. Zhang, J. Zhang, and B. Sch olkopf. Distinguishing cause from effect based on exogeneity. ar Xiv preprint ar Xiv:1504.05651, 2015b.

Published as a conference paper at ICLR 2022

In App. A, we provide the modiﬁcation of Alg. 1 for including the independent case and the signiﬁcance of the results. In App. B, we provide the outlook of the extension to the multivariate case. In App. C, we provide the analysis, the discussion, and the experimental results of our method in the presence of unknown confounding. In App. D, we provide the proofs of Prop. 1, the extension of Prop. 1, Prop. 2, and Thm. 1. In App. E, we include the identiﬁability conditions of ANMs in (Hoyer et al., 2008). In App. F, we introduce the details of the DIVOT implementation and the experiments:

App. F.1: introduce the concept, position, for computing the divergence measure (11); App. F.2: show a potential problem of computing the variance-based divergence measure in the ﬁnite/limited-sample case and introduce batches of data to deal with the problem; App. F.3: show that using batches can introduce bias and lead to wrong causal relationships (especially in the few-sample case) and then introduce debiasing functions to reduce the bias, which is used for the experiments on real-world data in Sec. 6; App. F.4: introduce the optimization methods used in the experiments on synthetic and real-world data in Sec. 6 and show the convexity of the objective function, the divergence measure (11), under the construction in Sec. 6; App. F.5: show the robustness of DIVOT to prior misspeciﬁcation; App. F.6: compare DIVOT with other benchmark methods; App. F.7: show the efﬁciency of DIVOT with its running time.

A MODIFICATION OF ALG. 1 FOR INCLUDING THE INDEPENDENT CASE AND THE SIGNIFICANCE OF THE RESULTS

The proposed measure is able to deal with the independent case without relying on other methods or tests after adapting the output conditions of the algorithm. The values of our proposed measure in the independent case are zero in the two directions, while the measure values of the causal case are a zero value in one direction and a non-zero value in another direction. The modiﬁed output conditions are shown in Tab. 2. To determine that the value of our measure is zero or not, one could follow a

Table 2: Modiﬁed output conditions for the independent case.

Output of DIVOT Div(X Y) = 0 Div(X Y) = 0 X, Y independent True True X Y True False X Y False True

similar way as (Bl obaum et al., 2018) choosing a threshold for real-world applications practically. However, different applications may have different thresholds. For example, in practice, when the noise distributions have different variance in the ﬁnite/few sample scenario, the larger variance can lead to the larger measure value. Although one can handle the problem well by normalizing the variance-based measure value with the estimated noise variance, deriving a statistical test for the ﬁnite/few sample case is still the ideal way, which is another nontrivial task without assuming the type of noise distributions and will be the future work of our method.

Bootstrapping for the signiﬁcance of the results. Instead of testing whether a measure value is signiﬁcantly zero or non-zero, we suggest using a bootstrapping method and testing whether the two measure values are signiﬁcantly different (one could also test whether the difference between the two measure values are signiﬁcantly zero or not): Step 1. use bootstrapping (resampling with replacement) to get B (e.g., 50) bootstrapping datasets; Step 2. compute two measure values in the two directions for each bootstrapping dataset; Step 3. apply a two-sample test (T-test) to check whether the mean of one measure value in one direction is signiﬁcantly different from another one in the other direction. Step 4. if two measure values are signiﬁcantly different, we conclude that the smaller one is in the causal direction; if they are not signiﬁcantly different, we conclude that it is the independent case.

Published as a conference paper at ICLR 2022

Moreover, we did experiments for the case where the causal relationship is so weak that the FCM with a causal relationship is similar to the independent case (e.g., when the coefﬁcient of the direct cause is extremely small in the linear case, it is close to the independent case where the measure values are close). We generate four datasets with 1000 samples for the four FCMs in the synthetic data experiments in Sec. 6 with a weighting factor w: Y = w f(X) + Ey, and then use the bootstrapping method and T-test. Tab. 3 shows the p-values, of which small values indicate the signiﬁcantly different measure values. If the signiﬁcant level is 0.05, our method can tell the difference between two measures in the causal case when w > 0.02.

Table 3: Experimental results of testing the signiﬁcance of the results.

w 0.01 0.02 0.03 0.04 0.05 M1 0.155 0.038 0.0006 2.429e-08 3.091e-11 M2 0.251 0.070 0.0003 0.0001 1.935e-07 M3 0.390 0.119 0.0196 0.0003 9.578e-08 M4 0.408 0.098 0.0037 2.571e-05 2.346e-11 w 0.01 0.02 0.03 0.04 0.05 M1 indep X cause X cause X cause X cause M2 indep indep X cause X cause X cause M3 indep indep X cause X cause X cause M4 indep indep X cause X cause X cause

B EXTENSION TO THE MULTIVARIATE CASE

The problem of the naive solution to the multivariate case. A direct extension to the multivariate case is as the extensions of other bivariate causal discovery methods, such as (Khemakhem et al., 2021), (Zhang and Hyv arinen, 2009b), and (Monti et al., 2020). One can ﬁrst apply constraint- /score-based methods to get a causal skeleton, an undirected causal graph, and then use the extension of our measure for ﬁnding all the causal directions. A problem of the extension in some of the other bivariate works assuming causal sufﬁciency is that they directly applied their bivariate methods to each edge of the causal skeleton without considering the DAG structure. This can lead to the wrong results, especially in the case where they disregard the common parent/confounder of an edge.

Extending our method directly to the multivariate case. We provided the extensions of Prop. 1 and the variance-based measure, which can be used for computing the measure value considering the DAG structure and then orienting the edges based on the causal skeleton in the multivariate case. Our extension to the multivariate case has the following properties: Given the causal skeleton,

1. it distinguishes Markov equivalent classes under the identiﬁability conditions of ANMs (or PNLs); 2. it still beneﬁts from the closed-form 1D optimal transport solution and doesn t have the computational issue as in the high-dimensional optimal transport methods because of the FCM constraints; 3. it provides a score of which the causal structure has the minimum value compared with all DAGs of the causal skeleton; moreover, the measure value of a DAG is the summation of all measure values of the causal modules/conditionals/mechanisms.

First, we show the derivation of the extension of Prop. 1 for ANMs, given which the derivation in the PNL case is straightforward (one can consider what we did for Eqn. (12) in our paper). The proposition shows how to efﬁciently compute the L2 Wasserstein distance between high dimensional distributions under FCM constraints. Suppose that the general ANM is Xi = gi(PAi) + Ei, where i = 1, ..., m, Ei is the noise term of Xi, and PAi denotes the parent variables of Xi. The square of L2 Wasserstein distance is

W 2 2 (p0, p T ) =

i=1 EP Ai[W 2 2 (p(Ei), p(Xi|PAi)],

Published as a conference paper at ICLR 2022

of which the derivation is shown in App. D.2.

Second, as a direct implication of Thm. 1, given the couplings of Ei and Xi, the corresponding dynamical system has zero divergence of its velocity ﬁeld; in other words, the corresponding dynamical system which moves the samples of x0 to the samples of x T under ANM constraints has zero divergence on each dimension. And the variance-based measure is

k {samples of P Ai }N

||sort( xi|P Ai=k) sort( ei ) ave( xi|P Ai=k ei )||2 2 Nk 1 ,

where xi|P Ai=k is the data vector, of which the elements are the Xi values of the samples with PAi taking the value k; ei is the data vector, of which the elements are the generated noise samples; Nk represents the length of the vector xi|P Ai=k or ei .

Next, as the direct extension, one could enumerate all possible DAGs of the causal skeleton and compute their measure values, of which the minimum value is corresponding to the causal graph. Because the causal skeleton is given, it must be the case where one of the two variables of an edge is the cause and the other one is the effect. So the enumerated graphs have two situations: 1) all the edges are correctly oriented; 2) the causal direction of at least one edge is wrong such that the measure value of at least one causal module is signiﬁcantly larger than the correct one (note that considering a child as the direct cause leads to increasing the measure value, while omitting a cause is not necessary to increase the measure value of the causal module). Therefore, we can simply choose the graph with the minimum measure value as the causal one. As mentioned in the paper, it is also very important to develop practical algorithms for large-scale real-world problems, and there are some points for future works to further explore: 1. Testing the signiﬁcance of the results in the multivariate case, i.e., whether the minimum one is signiﬁcantly smaller than the others. One could apply bootstrapping to the dataset and have a pvalue for the measure value. One could also apply bootstrapping for each causal module, however, there may exist the problem of the multiple statistical test issue with family-wise errors in this way. 2. Developing an efﬁcient search algorithm in the multivariate case without relying on constraint- /score-based methods. This requires to analysing the measure value of a causal module in more situations (e.g., omitting a parent, involving an independent variable/non-child descendant/non-parent ancestor, or a case mixing the mentioned factors) and considering the characteristics to develop an efﬁcient search algorithm similar as the greedy search algorithm (Chickering, 2002).

C UNKNOWN CONFOUNDING

The unknown confounding has different inﬂuences on the results of our method in different situations: 1) independent case: when two variables are independent; 2) causal case: when there is a causal relationship between two variables. Depending on how the unknown confounding inﬂuences the pair of variables, it can make our method

1. reverse the direction of the result in the causal case;

2. disregard the causal relationship in the causal case;

3. introduce extraneous causal relationship in the independent case.

To understand the results, we could analyze some toy examples intuitively. One extreme independent case is that X = U and Y = Ey + U, where U is the unobserved confounder; Ey is the noise of Y ; X has no noise. Then our method will show that X is the cause of Y , even though there is no causal relationship. Similarly, suppose that U and the noise of X and Y have the same distribution with variance equal to 1, and that X = Ex + 1000U and Y = Ey + U. Then, the distribution of X can be dominated by U, which leads to the wrong result for the same reason as the extreme case.

In the causal case, it follows the same reason. For example, X = Ex+U and Y = X+Ey+1000U. Then the distribution of Y can be dominated by U. Therefore, the FCM is close to X = Ex + U and Y = 1000U, which leads to reversing the causal direction in the result. Moreover, when X = Ex +1000U and Y = X +Ey +1000U, then both distributions of X and Y can be dominated

Published as a conference paper at ICLR 2022

by U; hence, it is close to X = 1000U and Y = 1000U, which leads to disregarding the causal relationship in the result.

Nevertheless, in practice, the impact of unknown confounding is more complex and all factors can be mixed together with the impact of ﬁnite samples and the function properties. Therefore, we provide the experiments based on the synthetic datasets. From the experimental results, we can also ﬁnd that the strength of the confounding and the difference of the confounding strength on the two variables are two important factors resulting in the wrong results of our method.

Experimental results for the unknown confounder. We generate a dataset with 1000 samples for each FCM: FCM1) X = U and Y = U; FCM2) X = Ex + wx U and Y = Ey + wy U; FCM3) X = Ex + wx U and Y = f(X) + Ey + wy U, where Ex and Ey are the noise terms; U is the unknown confounder; wx and wy are the coefﬁcients of U representing the confounding strength; we used f(X) = X and f(X) = sin(4X) in the experiments. As for the data generation, Ex, Ey, and U follow the uniform distribution, U(0, 1), and we vary the coefﬁcients wx and wy.

Instead of testing whether a result is signiﬁcantly close to zero, we use bootstrapping to resample 50 datasets for each generated dataset and then apply a two-sample test to test whether the measure values in the two different directions are signiﬁcantly different. Because although the measure value, in theory, is zero in the causal direction when there is no unknown confounder, with ﬁnite samples the measure value will be larger than zero in practice. And to decide which measure value is close to zero or not, it requires a threshold which can vary in different applications, or deriving a statistical test which is nontrivial for the measure without assuming the type of noise distributions and can be the future work of our method. Thus, we test whether two measure values are signiﬁcantly different, and if so, we then conclude that the smaller one is in the causal direction; if they are not signiﬁcantly different, we then conclude that it is the independent case. As for the experimental results, when the p-value is close to zero, it means that the two measure values are signiﬁcantly different. And in general, one could take the signiﬁcant level at 0.05 to make a conclusion. We used 0.05 for the experiments.

The experimental result of FCM1 is that the p-value is equal to 1.0, which means that the measure values in the two directions are not signiﬁcantly different.

As for the experimental results of FCM2, we found that in the independent case, when the coefﬁcients wx and wy are the same, we can get the correct results; when the coefﬁcients are different, the method will give the wrong results. In different applications/scenarios, there are different tolerance ranges for our methods such that when the difference of the confounding coefﬁcients is within the range, even if the confounding coefﬁcients are different, we can still get the correct results.

Table 4: The experimental results of FCM2 (the independent case) in the presence of unknown confounding.

wx\wy 0.1 1.0 10 0.1 0.354 1.57e-39 9.20e-05 1.0 - 0.159 3.20e-49 10 - - 0.451

wx\wy 0.1 1.0 10 0.1 indep causal causal 1.0 - indep causal 10 - - indep

As for the experimental results of FCM3, we found that in the causal case, the confounding strength is a factor inﬂuencing the results more, which is different from the independent case. Moreover, we found that when the nonlinear function f(X) is non-monotonic, such as sin(4X), the number of correct results is larger than the one in the linear case. Because the non-monotonic function itself can introduce a type of asymmetry which can indicate the causal direction for our method, the non-monotonic functions can be easier than the monotonic functions.

Future work for dealing with the unknown confounders. Causal discovery in the presence of unknown confounding is still an open problem, but there are some promising results in recent years such as (Janzing and Sch olkopf, 2018; Mastakouri et al., 2021). Especially, (Mastakouri et al.,

Published as a conference paper at ICLR 2022

Table 5: The experimental results of FCM3 (the causal case) with f(X) = X in the presence of unknown confounding.

wx\wy 0.1 1.0 10 100 0.1 8.56e-56 2.98e-36 0.157 1.42e-06 1.0 1.22e-49 0.00027 9.60e-49 3.26e-51 10 0.002 7.02e-05 0.0018 0.007 100 0.385 0.398 0.577 0.70 wx\wy 0.1 1.0 10 100 0.1 X cause X cause indep X cause 1.0 X cause X cause Y cause Y cause 10 Y cause Y cause Y cause X cause 100 indep indep indep indep

Table 6: The experimental results of FCM3 (the causal case) with f(X) = sin(4X) in the presence of unknown confounding.

wx\wy 0.1 1.0 10 100 0.1 1.42e-06 1.66e-68 5.62e-16 2.91e-09 1.0 5.68e-76 4.70e-81 5.62e-16 6.94e-51 10 2.17e-55 2.70e-47 1.36e-51 0.50 100 2.93e-06 2.70e-47 9.81e-24 0.08 wx\wy 0.1 1.0 10 100 0.1 X cause X cause X cause X cause 1.0 X cause X cause X cause Y cause 10 Y cause Y cause X cause indep 100 Y cause Y cause X cause indep

2021) shows the identiﬁability results on time-series data in the presence of memoryless unknown confounders where the confounder doesn t have an autocorrelation effect. A potential research direction with our framework is that by recovering the corresponding dynamical process of a static causal discovery problem while considering the memoryless confounding, we can determine the causal direction between two variables in the presence of memoryless unknown confounders.

D PROOFS AND DERIVATION

D.1 DERIVATION OF PROP. 1

Under constraints (I) and (II),

W 2 2 (p0, p T )

= R |M (x0) x0|2p0(x0)dx0 deﬁnition of W 2 2 = RR (f(Ex, Ey) Ey)2p(Ey)d Eyp(Ex)d Ex Eqn. (6), and the independence of Ex and Ey = EEx[W 2 2 (p(Ey), p(Y |Ex))]

D.2 PROOF OF THE EXTENSION OF PROP. 1

Suppose that in the multivariate ANM, x0 = [E1, ..., Em] , and x T = [X1, ..., Xm] , and Xi = gi(PAi) + Ei, where PAi denotes the parent of Xi.

Published as a conference paper at ICLR 2022

W 2 2 (p0, p T ) = Z |M (x0) x0|2p0(x0)dx0

= Z |M ([E1, .., Em] ) [E1, .., Em] |2p(E1, ..., Em)dx0

Z |M i ([E1, .., Em] ) Ei|2p(E1, ..., Em)dx0,

(where M i represents the i-th element of M )

Z |M i ([E1, .., Em] ) Ei|2p(Ei)d Ei p(E1:m i)d E1:m i,

(where 1 : m i represents 1, ..., i 1, i + 1, ..., m, and the independence of noise implies p(Ei)p(E1:m i))

Z |f(Ei, E1:m i) Ei|2p(Ei)d Ei p(E1:m i)d E1:m i

Z | ef(Ei, M 1:m i(E1:m i)) Ei|2p(Ei)d Ei p(E1:m i)d E1:m i

Z | ef(Ei, X1:m i)) Ei|2p(Ei)d Ei p(E1:m i)d E1:m i,

(where applying the change of variable formula)

Z | ef(Ei, X1:m i)) Ei|2p(Ei)d Ei p(X1:m i)d X1:m i|det(J)| 1

Z | ef(Ei, X1:m i)) Ei|2p(Ei)d Ei p(X1:m i)d X1:m i,

(where |det(J)| 1 = 1 for ANMs)

i=1 EP Ai[ Z | ef(Ei, PAi)) Ei|2p(Ei)d Ei]

i=1 EP Ai[W 2 2 (p(Ei), p(Xi|PAi)]

D.3 PROOF OF THM. 1

Part I. We ﬁrst derive the time interpolation p(t, xt) of p0 and p T under dynamical FCM constraints. Suppose that the FCMs are Y = f(X, Ey). As for the dynamical formulation of the L2 Wasserstein distance under constraints (I) and (II), according to the Jacobian equation, p(t, xt) = p0(x0)/| det(J xt

x0 )|, t (0, T), where xt, x0 R2 and J xt

x0 is the Jacobian matrix w.r.t

x0. Moreover, log p(t,xt)

t | det(J xt

x0 )| , where J xt

T , 0 t T f X , t T ( f

Published as a conference paper at ICLR 2022

According to Jacobi s formula and (Tao, 2013),

d dt| det(A(t))|

= det(A(t))tr A(t) 1 d

= det(A(t))tr det(A(t)) 1adj(A(t)) d

adj(A(t)) d

where is element wise multiplication and adj(A) is adjugate matrix of A and the sign takes the same sign as det(J xt

x0 ). Furthermore, since we know that under the structural constraints (I) and

(II), M is Eqn. (6). Thus, we replace A(t) with J xt

T ( M I), and then we have

T , 0 t T f X , t T ( f

Ey 1) + 1, t T f X 0, 1 + t

x0 = 1 T ( M I) =

" 0, 0 1 T f X , 1 T ( f

d dt| det(J xt

Ey 1)(1 + t

which takes the same sign as det(J xt

Part II. We then proof the property of the corresponding dynamical systems of ANMs in Thm. 1.

Because log p(t,xt)

t | det(J xt

x0 )| , under the conditions in Thm. 1, f Ey = 1 and it is obvious

dt| det(J xt

x0 )| = 0 and | det(J xt

x0 )| = 0, t [0, T]. Furthermore, according to the theorem

of instantaneous change of variables (Chen et al., 2018), which is a variant of Fokker-Plank equation, we know that

log p(x(t))

t = tr( v) = div v.

Therefore, div v = 0.

D.4 PROOF OF PROP. 2

Necessary direction. We prove the necessary direction by showing that given X is the direct cause of Y in an ANM, D(v) = 0. In the causal direction, because of the time evolution equation (5), we know that it is sufﬁcient to check the divergence of the velocity ﬁeld at time 0, and that the divergence taking zero value everywhere at time 0 implies the divergence taking zero value everywhere for t [0, T]. Because of Thm. 1, we know the div v = 0 at time 0. Therefore, D(v) = 0.

Sufﬁcient direction. We prove the sufﬁcient direction by showing the contradiction with the identiﬁability of ANMs. Given that D(v) = 0 under the constraints (I), (II), (III), and the identiﬁability conditions of ANMs, it tells us that there is an ANM in the form of Eqn. (1) which is consistent with the data distribution and has independent noise. Because of the identiﬁability of ANMs shown

Published as a conference paper at ICLR 2022

by (Hoyer et al., 2008), there is no ANM with Y X which is consistent with the data distribution and has independent noise at the same time. Therefore, it can only be the case the X is the cause of Y .

Moreover, there is a concern that it may happen that in the direction which is not the causal direction, the model is not an ANM and its divergence measure is equal to zero. This may be problematic if the divergence measure is applied to applications. However, we will show that in general, this will not happen by proving that under weak assumptions D(v) = 0 implies the model is an ANM. First, D(v) = 0 implies div v = 0 for all x0 with positive probability densities p(x0) > 0. In the following we say div v = 0 in short. Second, div v = 0 implies p(t,xt)

t = 0 a.e. according to Thm.

1. Consequently,

d| det(J xt

dt =0, which leads to f Ey = 1 where Y = f(X, Ey) a.e. according to the proof of Thm. 1. Therefore, under the assumptions: 1) p(Ey) is positive in a continuous range of Ey; 2) in the range, f Ey = 1 a.e. implies that it holds everywhere, we have Y = Ey + C, where C is a quantity which doesn t change with Ey, e.g., it can be a function of X or a constant. Therefore, D(v) = 0 implies Y = Ey + g(X) further under assumptions 1) and 2).

E IDENTIFIABILITY CONDITIONS OF ANMS IN (HOYER ET AL., 2008)

Because the identiﬁability conditions of ANMs in (Hoyer et al., 2008) are important and necessary for our Thm. 1, we include them here:

Let the joint probability density of x and y be given by p(x, y) = pn(y f(x))px(x), where pn, px are probability densities on R. If there is a backward model of the same form, i.e., p(x, y) = p n(x g(y))py(y), then, denoting ν := log pn and ξ := log px, the triple (f, px, pn) must satisfy the following differential equation for all x, y with ν (y f(x))f (x) = 0:

ξ = ξ ( ν f

f ) 2ν f f + ν f + ν ν f f

where we have skipped the arguments y f(x), x, and x for ν, ξ, and f and their derivatives, respectively. Moreover, if for a ﬁxed pair (f, ν) there exists y R such that ν (y f(x))f (x) = 0 for all but a countable set of points x R, the set of all px for which p has a backward model is contained in a 3-dimensional afﬁne space.

F EXPERIMENTS AND DETAILS OF DIVOT

For the synthetic data experiments in Sec. 6, we visualized the generated data with the sample size 500 in Fig. 3. For determining causal direction, we used the variance-based divergence measure in

1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00 X

1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00 X

1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00 X

1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00 X

Figure 3: Synthetic data generated from ANMs with different (non)linear functions. From left to right, they are (1) Y = X + Ey; (2) Y = 0.1(2.5X)3 0.1X + Ey; (3) Y = sin(4X) + Ey; (4) Y = 0.5X3 X + Ey if x < 0, and Y = 1 0.5X3 + X + Ey, otherwise, where X U( 1, 1) and Ey U(0, 1) are uniform distributions.

Eqn. (11). The experiment setup is the same for experiments with different ANMs. For each ANM, we run experiments with different sample sizes, 10, 25, 50, 100, 200, and 500. For each sample size, 100 different datasets are generated. For all the synthetic data experiments in Sec. 6, we use batches for computing Eqn. (11) without using debiasing functions. In the following, we introduce positions (App. F.1), batches (App. F.2), debiasing functions (App. F.3), and optimization details (App. F.4) of DIVOT; as well as show its robustness to prior misspeciﬁcation (App. F.5), more comparison with other benchmark methods (App. F.6), and its efﬁciency (App. F.7).

Published as a conference paper at ICLR 2022

0.2 0.4 0.6 0.8 1.0 Batch size

Dvar(v)+debiasing

(a) The inﬂuence of batch sizes on determining causal direction and the improvement with a linear debiasing function.

0.2 0.4 0.6 0.8 1.0 Batch size

Estimated parameter:

Dvar(v)+debiasing

(b) The inﬂuence of batch sizes on estimating p(Ey; θ) and the improvement with a linear debiasing function. The ground-truth θ is 1.

Figure 4: The inﬂuence of the batch size and the debiasing function on causal direction determination in panel (a) and density estimation in panel (b). The synthetic data (of which the sample size is 100) are generated from the linear ANM, Y = X + Ey, where X U( 1, 1) and Ey U(0, 1) are uniform distributions. In panels (a) and (b), the batch size is represented as the ratio of the sample size, e.g., 0.1 represents that the batch size is 0.1 100; the red triangular represents the experiments with only using the variance-based divergence measure; the blue star represents using the the variance-based divergence measure and a linear debiasing function. The error bar in panel (b) represents standard deviation.

F.1 NUMBER OF POSITIONS FOR COMPUTING THE VARIANCE-BASED DIVERGENCE MEASURE

Suppose a dataset {(xi, yi)}N with sample size N. For computing the divergence measure, we call x {xi}N as a position and for each x {xi}N we need to specify the corresponding vector yx, of which the number of elements is Nx and generated samples of ey. In practice, it is not necessary to use all the N positions for computing Eqn. (11). For the synthetic data experiments in Sec. 6, when N > 50, we choose 50 positions out of N; otherwise, we choose all the N positions. To select a position (when N > 50), we ﬁrst ﬁnd the maximal and minimal values of {xi}N, compute the interval length, max({xi}N) min({xi}N) /50, and then choose a position every such length (or its nearest position available in the data). We can then compute Eqn. (11) for causal direction determination.

F.2 BATCHES FOR THE FINITE (OR LIMITED) SAMPLE SCENARIO

In the ﬁnite sample scenario, it can be the case that there is no other data at a position x. Therefore, we use the neighbors of x as a batch of data for computing Eqn. (11). We consider all the data in a batch as having the same value x. We represent the batch size with the percentage of the total number N of samples. For the synthetic data experiments in Sec. 6, we use the batch size 0.4 for the datasets with sample size 10; 0.2 for the datasets with sample sizes 25 and 50; 0.15 for the datasets with sample sizes 100 and 200; and 0.05 for the datasets with sample size 500.

Note that for the synthetic data experiments in Sec. 6, the batch size is not larger than 0.2 except the extremely small datasets with 10 samples. Because using larger batch sizes introduces bias by concatenating data at different positions in a batch. For the experiments in Fig. 4a, we generate a synthetic dataset with sample size 100 and use all the 100 positions for computing Eqn. (11). We can see that choosing a proper batch size can increase the accuracy of DIVOT by increasing the number of samples in a position; however, if further increasing the batch size such that it is larger than 0.2, the accuracy is decreased because concatenating data at different positions leads to a different corresponding noise distribution compared with the ground-true p(Ey). Thus, as shown in Fig. 4b, the estimation of the noise distribution becomes worse and worse with increasing the batch size. Such bias can be problematic especially in the few-sample case where we have to choose a larger batch size for computing the divergence measure. Therefore, we use a debiasing function such

Published as a conference paper at ICLR 2022

1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00 X

= 1.0655056

g(x) in ANM

(a) The debiasing function and g(x) in the ANM.

1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00 X

(b) The samples of (xi, ydebias i ).

Figure 5: As a by-product, debiasing functions can be used for estimating the nonlinear function of an ANM in Eqn (1). We optimize the parameters of a neural network representing a nonlinear debiasing function and the parameter of a uniform distribution representing the hypothesized noise distribution. As for panel (a), the ground-true noise distribution is corresponding to θ = 1 and the estimated parameter bθ is 1.066. As for panel (b), given a dataset {(xi, yi)}, ydebias i = yi gd(xi).

that we can use a larger sample size for the few-sample case without sacriﬁcing the performance of DIVOT.

F.3 DEBIASING FUNCTIONS FOR THE FEW-SAMPLE SCENARIO

Because the mentioned bias is introduced by omitting the position information, we introduce it back with a debiasing function gd(x) to reduce the bias by modifying Eqn. (11) as

sort( ydebais x ) sort( ey) ave( ydebais x ey)

2 Nx 1 , ei y p(Ey; θ),

ydebais x = yx gd(x; w), (13)

where gd(x; w) is a (non)linear function parameterized with w, e.g., in our experiments the linear debiasing function is gd(x) = w x. As shown in Fig. 4, using the linear debiasing function guarantees not only the accuracy while using a larger batch size than 0.2, but also the correctness of the density estimation.

For a more complicated scenario, e.g., the ANM uses a nonlinear function, one can apply a neural network to the debiasing function. We found that as a by-product, a sufﬁcient ﬂexible debiasing function can be used for estimating g(x) of ANMs in Eqn. (1). As shown in Fig. 5, we generate synthetic 500 data of an ANM with the nonlinear function Y = g(X)+Ey = 0.1(2.5X)3 0.1X + Ey, where X U( 1, 1) and Ey U( 0.5, 0.5) satisfy uniform distribution. The gd(x) is close to the ground-truth g(x). Because this is not the main focus of the work, we would like to refer the readers to the Jupyter notebook in supplementary materials for the details of the implementation.

Nevertheless, the purpose of using debiasing functions is not to estimate the g(x) accurately but to reduce the bias introduced by using a large batch size. This means that when we use a restrictive class of gd(x), it can lead to a noticeable difference of gd(x) and g(x). But for the purpose of causal direction determination, the divergence measure does not require to specify the nonlinear functional form of ANMs. Moreover, DIVOT based on optimal transport does not require to estimate g(x) either. Therefore, DIVOT can still determine causal direction correctly in such case.

Published as a conference paper at ICLR 2022

F.4 OPTIMIZATION METHODS AND THE CONVEXITY OF THE OBJECTIVE FUNCTIONS OF DIVOT

In this section, we introduce the optimization method of the synthetic experiments in Sec. 6 and the convexity of the objective function. Next, we introduce the optimization method of DIVOT in the real-world data experiments in Sec. 6, of which the divergence measure uses a linear debiasing function and standard Gaussian noise.

As for the synthetic data experiments, we minimize the variance-based divergence measure in Eqn. (11) as an objective function, of which the parameter is θ. Because when θ > 0, the linear function f noise θ (esource y ) does not change the order of sorting results of esource y , we have

sort( yx) f noise θ (sort( esource y )) ave( yx f noise θ ( esource y )) 2

sort( yx) θ sort( esource y ) ave( yx θ esource y ) 2

f noise θ ( esource y ) = θ esource y , θ > 0, esource y U(0, 1). (14)

It is obvious that the divergence measure in Eqn. (14) is convex on θ > 0. The convexity is achieved by parameterizing the noise distribution with a linear function fnoise, and there are many exited toolboxes for solving the minimization problem.

For the synthetic data experiments, we used gradient descend for ﬁnding the optimal θ . The gradient jax.grad(loss) is computed with the autograd in JAX (Bradbury et al., 2018). We update θ by specifying a step size sz and θ := θ jax.grad(loss) sz. If after the update θ < 0 (which has never happened), we set the value of θ as a positive number close to zero. We used sz = 1 for all the synthetic data experiments. Moreover, because of the convexity, one can also use an one-step update method to directly get the best parameter θ by ﬁnding the root of the gradient function of Eqn. (14). We simply give a range of θ and ﬁnd θ by a binary search method such that the gradient at θ is equal to zero. The experiments of DIVOT for ANMs in the real-world data experiments also used the one-step update for updating θ and the range θ is speciﬁed as [0, 100]. In addition, we need to optimize over the parameter w of the linear debiasing function. We use gradient descent to ﬁnd the best parameter, w . We update θ every 10 updates of w. For the PNL extension of DIVOT, we used the cyclic learning rate (Smith, 2017) with gradient descent. We update θ every 10 updates of w and ω in (12). The program of DIVOT is terminated when the divergence measure converges. If a more complicated scenario requires f noise θ to be a nonlinear function, one may need to use the gradient descend method instead of the one-step update. We ﬁnd that it is sufﬁcient to use the standard Gaussian distribution with the linear f noise θ for DIVOT to have promising results in the experiments on the T ubingen datasets, which indicates that DIVOT is robust to the choice of models. We then investigate the robustness of DIVOT to prior misspeciﬁcation.

F.5 ROBUSTNESS TO PRIOR MISSPECIFICATION

We use different hypothesized noise distributions in DIVOT to test the robustness to the misspeciﬁcation of noise distribution. The synthetic data are generated as in Sec. 6 with uniform distrbution noise Ey U(0, 1). In DIVOT, we use one of the three hypothesized distributions: uniform distribution U(0, 1), beta distribution B(a = 0.5, b = 0.5), and standard normal distribution N(0, 1). As shown in Fig. 6, DIVOT with the misspeciﬁed noise distributions has similar performance with the one using the correct class of distributions, which shows the robustness of DIVOT.

F.6 COMPARISON WITH RESULTS OF BENCHMARK METHODS

We compare DIVOT with other benchmark methods, such as ANM (Hoyer et al., 2008), CAREFL (Khemakhem et al., 2021), RECI (Bl obaum et al., 2018), and LLR (Hyv arinen and Smith, 2013). As in (Khemakhem et al., 2021), the synthetic data are generated with the Laplace distribution as

Published as a conference paper at ICLR 2022

0 100 200 300 400 500 Sample size

Uniform Beta Normal

0 200 400 Sample size

Y = 0.1((2.5X)3 X) + Ey

Uniform Beta Normal

0 200 400 Sample size

Y = fpiece(X) + Ey

Uniform Beta Normal

Figure 6: Robustness to prior misspeciﬁcation: The performance of DIVOT with different hypothesized distributions of p(Ey; θ) on the datasets with uniform distribution noise Ey U(0, 1).

100 200 300 400 500 Sample size

100 200 300 400 500 Sample size

Y = X + 0.5X3 + Ey

100 200 300 400 500 6a PSle s Lze

3r RSRrt LRn FRrre Ft

LLnear 6(0 x2 x1 + z2

100 200 300 400 500 6a PSle s Lze

3r RSRrt LRn FRrre Ft

Add Lt Lve n RLse 6(0

Alg Rr Lth P CA5()L CA5()L-16 LLnear L5

5(C, A10 A10-11

Figure 7: Comparison of the results of DIVOT (the ﬁrst row) with the reported results in (Khemakhem et al., 2021) (the second row). The ﬁgures in the second row are taken from the screenshot of (Khemakhem et al., 2021). The noise (Z2 and Ey) distribution of the synthetic data is a Laplace distribution. CAREFL represents the causal autoregressive model (Khemakhem et al., 2021); CAREFL-NS represents the causal autoregressive model without scaling; RECI represents the method, regression error causal inference (Bl obaum et al., 2018); ANM (Hoyer et al., 2008) uses Gaussian process while ANM-NN uses a neural network; linear LR is the linear likelihood ratio method (Hyv arinen and Smith, 2013).

the noise distribution. DIVOT uses batches without debiasing functions. And the data preprocessing is the same as in the real-world data experiments in Sec. 6. We run experiments on each dataset for 100 times. As shown in Fig. 7, for the sample size larger than 100 the accuracy of DIVOT is 100% which shows that DIVOT has promising results and performs better compared the other methods, especially in the linear case.

Published as a conference paper at ICLR 2022

F.7 EFFICIENCY OF DIVOT

For common causal discovery tasks in the bivariate case, the sample size 10000 is a large and challengeable one. Causal discovery methods need to consider the efﬁciency of algorithms especially in the large sample size scenario. We apply DIVOT with batches and no debiasing functions to the synthetic data, of which the sample size is 10000 generated with Y = 0.1((2.5X)3 X) + Ey and Ey U(0, 1). As shown in Table 7, we test the running time of DIVOT to determine causal direction with different number of positions and different batch sizes. The experiments are based on Mac Book Pro (15-inch, 2018) with 2.9 GHz 6-Core Intel Core i9. Our implementation is based on JAX (Bradbury et al., 2018) which uses Apache License and the running time is measured with the command %timeit in JAX.

Table 7: Running time of DIVOT on synthetic data, of which the sample size is 10000. The DIVOT with batches and no debiasing function is tested with different number of positions and batch sizes.

batch size 50 positions 100 positions 0.001 240 ms 3.48 ms 431 ms 4.33 ms 0.01 7.71 s 475 ms 14.3 s 262 ms 0.1 1min 7s 1.81 s 2min 19s 2.11 s