# nonlinear_causal_discovery_with_latent_confounders__ed97b758.pdf Nonlinear Causal Discovery with Latent Confounders David Kaltenpoth 1 Jilles Vreeken 1 Causal discovery, the task of discovering the causal graph over a set of observed variables X1, . . . , Xm, is a challenging problem. One of the cornerstone assumptions is that of causal sufficiency: that all common causes of all measured variables have been observed. When it does not hold, causal discovery algorithms making this assumption return networks with many spurious edges. In this paper, we propose a nonlinear causal model involving hidden confounders. We show that it is identifiable from only the observed data and propose an efficient method for recovering this causal model. At the heart of our approach is a variational autoencoder which parametrizes both the causal interactions between observed variables as well as the influence of the unobserved confounders. Empirically we show that it outperforms other state-of-the-art methods for causal discovery under latent confounding on synthetic and real-world data. 1. Introduction Causal models are at the heart of science (Pearl, 2009). They are robust as they explain away spurious associations, interpretable as we can see what causes what, and actionable as they allow us to estimate effects. Furthermore, they generalize across different, often novel, environments, making them suitable for many machine learning tasks (Pearl, 2009). However, deriving causal models is challenging. Not only is structure learning NPhard (Chickering et al., 2004), it is in fact impossible to learn a causal model from data without making further assumptions on the generating process (Pearl, 2009). One of the cornerstone assumptions in causal discovery is that of causal sufficiency, the assumption that we have measured 1CISPA Helmholtz Center for Information Security, Germany. Correspondence to: David Kaltenpoth . Proceedings of the 40 th International Conference on Machine Learning, Honolulu, Hawaii, USA. PMLR 202, 2023. Copyright 2023 by the author(s). all common causes of all variables in our data. Whenever this assumption holds, all correlations between the observed variables can be explained. When causal sufficiency does not hold, however, methods that rely on it will not be able to adequately explain the observed correlations and hence return a causal network with many spurious edges. Because it is impossible to measure all relevant variables in most applications, detecting and adjusting causal effects for such latent confounders has become an important research theme (Ogarrio et al., 2016; Wang & Blei, 2019; Janzing & Sch olkopf, 2018; Kaltenpoth & Vreeken, 2019; Ranganath & Perotte, 2018). Existing work, however, can only do so in restricted settings, e.g., for two high-dimensional variables (Janzing & Sch olkopf, 2018) or when additional information is available, such as data from multiple environments, labels, time, or proxies of the latent confounders (Khemakhem et al., 2020). In contrast, in this paper, we propose a new method for learning a causal directed acyclic graph (DAG) from observational data in the presence of latent confounders. Given data over variables X1, . . . , Xm, our goal is to determine which Xi share a common cause Z, and return a causal network over both observed variables and hidden confounders. To do so, we define a nonlinear structural causal model (SCM) with hidden confounders based on the postnonlinear (PNL) model (Zhang & Hyv arinen, 2010; Zhang & Hyvarinen, 2012). We formally show that our model is identifiable in both the linear and in the strictly nonlinear case when the causal graph over X is sparse. For discovery, we leverage recent advances in automatic differentiation (Abadi et al., 2016; Baydin et al., 2018) and present a novel method for causal discovery with hidden confounding based on variational autoencoders (Kingma & Welling, 2019). Specifically, the decoder is given by our structural causal model. To ensure that we obtain an acyclic causal network, we use a differentiable penalty based on counting the number of weighted cycles in the graph (Zheng et al., 2018; Yu et al., 2019). With our approach for Nonlinear Causal Discovery under Latent Confounding, NOCADILAC for short, we efficiently obtain a fully directed network over both the observed variables X and their latent confounders Z. We Nonlinear Causal Discovery with Latent Confounders show that our method is theoretically sound by proving that it is consistent in a special case. We show through extensive experimental evaluation that our proposed method is not only theoretically sound but also does well in practice. On synthetic data, we significantly outperform a number of other state of the art methods for causal discovery with and without assumptions on causal sufficiency GFCI (Ogarrio et al., 2016), NOTEARS (Zheng et al., 2018), DAGGNN (Yu et al., 2019), 3OFF2 (Affeldt et al., 2016) and DCD (Bhattacharya et al., 2021). We further illustrate the importance of explicitly modeling latent confounders by comparing NOCADILAC with DAG-GNN and DCD on highly nonlinear real-world protein signaling data. We include full proofs of all results in Appendix B, but sketch the main ideas here. All code and results can be found on the authors website.1 2. Preliminaries This section introduces notation, defines the problem at hand, and introduces our causal model. 2.1. Notation We denote observed variables by X = (X1, . . . , Xm) and unobserved variables by Z = (Z1, . . . , Zl) , which are governed by a joint distribution P(X, Z). Noise variables are denoted by ϵ, and we write ϵx and ϵz for noise variables affecting X, respectively Z. We assume that our noise follows a normal distribution ϵ N(µ, diag(σ2)I) where σ2 contains the variances of the exogenous variables. We write directed acyclic graphs (DAGs) as G = (V, E) where the nodes V are comprised of the observed variables X and (a subset of) the unobserved Z. The set of all parents of Y in G is given by Pa G(Y ). We assume that the distribution P(X, Z) corresponds to a graph G satisfying faithfulness, sufficiency, and causal Markov conditions (Koller & Friedman, 2009). We also refer to the parents of Y as the causes of Y , and write Y1 Y2 when Y1 causes Y2. We denote nonlinear functions by τ, ν and assume that they work elementwise, τ(y) = (τ1(y1), . . . , τm(ym)), and that all nonlinearities τi, νi are three times differentiable, invertible and consequently strictly monotonic. We write id for the identity function, id(y) = y. We denote matrices by capital letters A, B, C, their transposes by A , B , C , and use I to refer to the identity matrix. The entries of the matrix A are aij. We further call an adjacency matrix A acyclic if the DAG G induced by it is acyclic. We can now informally state our problem as follows. Problem Statement (Informal). Given data only over the 1https://eda.rg.cispa.io/prj/fanta/ observed variables X, recover the causal model over both X and its unobserved confounders Z. To formalize this problem, we next describe the structural causal models we are considering. 2.2. Structural Causal Models To describe our data-generating process, we begin by introducing structural causal models (SCMs, Pearl (2009)) for the observed variables X when no latent confounders Z are involved. If all causal relationships are linear, we can write the model as a standard linear SCM, X = A X + ϵ , (1) where each ϵi is independent of the parents Pa G(Xi) of Xi. Since the non-zero entries of the matrix A encode a DAG, there are no paths of length m + 1 in G, so that the matrix satisfies Am+1 = 0. In particular, the matrix C := (I A ) 1 = Pm k=0 Ak exists. By rearranging terms, we can therefore equivalently write To generalize this, we now include element-wise postnonlinearities τ in the causal model (Taleb & Jutten, 1999), X = τ(C ϵ) . (2) As we assume the τ to be invertible, it is straightforward to show this model is equivalent to the well-studied postnonlinear (PNL) causal model (Zhang & Hyv arinen, 2010), and refer the reader to Appendix A. The PNL model is suitable when sensors or measurements introduce nonlinear distortions in the observed variables. Moreover, this class of nonlinear causal models is known to permit identifiability of the causal DAG under general conditions (Zhang & Hyvarinen, 2012; Peters et al., 2014). These models are therefore appropriate for our setting both due to their ease of computability (Yu et al., 2019) as well as to facilitate theoretical tractability in the following. Our next step in formalizing the problem is introducing latent factors in the structural causal model. 2.3. Confounders in SCMs To describe the effects of latent confounders, we replace X by (X, Z) in the model of Eq. (1), X Z = A XX A XZ 0 0 where we assume the matrices AZX, AZZ = 0 so that all Pa(Zj) = . Note that for Gaussian variables Z, this assumption is a natural one. If Z1 Z2 and Z2 affects precisely two variables X1, X2, then replacing the Nonlinear Causal Discovery with Latent Confounders Figure 1: Graphical representation of the generating model in Eq. (4). The variables Z and ϵ generate the correlated noise H = B Z +ϵ, which then generates X = τ(C H). Shaded nodes are unobserved ϵ, H are unobserved, while unshaded nodes X are observed. edge Z1 Z2 with direct edges Z1 X1, X2 and the correct parameters results in an indistinguishable distribution P(X). We will elaborate further on the need for this asumption in Section 3.1. We can therefore focus on the generating mechanism of X, which we write as X = A X + B Z + ϵ , (3) where A = AXX, B = AXZ and ϵ = ϵx. Equivalently, for C = (I A ) 1 as above we have X = C (B Z + ϵ) . Adding latent confounders thus replaces the uncorrelated noise ϵ with correlated noise terms B Z + ϵ. It is in this spirit that we generalize our nonlinear model of Eq. (2) as X = τ(C (B Z + ϵ)) , (4) or equivalently X = ν(X) + B Z + ϵ where ν = id τ C 1. We graphically depict the data-generating model in Figure 1. Here, H = B Z + ϵ represents the correlated noise obtained as a mixture of Z and ϵ. Having introduced our causal model with latent confounders, we can state our problem formally. Problem Statement (Formal). Given data from X = τ(C (B Z + ϵx)) Z N(0, diag(σ2 z)I) ϵ N(0, diag(σ2 x)I) , recover matrices A, B capturing the dependencies between the observed variables (up to trivial indeterminacies). Here, by trivial indeterminacy we mean the following. We can write our problem as an overcomplete independent component analysis (OICA) problem (Comon, 1992), X = τ (C B C ) Z ϵ and by writing Q = (C B C ) we know that Q can only be identified up to permution and scalar multiplication of its columns. That is, any matrix Q = QΠΛ can induce the same distribution P(X) under rescaling of the noise variables ϵ, where Π is a permutation and Λ a diagonal matrix (Eriksson & Koivunen, 2004). Consequently, when we talk about uniqueness or identifiability of any of the parameters below, we will be talking about uniqueness modulo such trivial indeterminacies. Since our exogenous noise variables ϵ are all Gaussian, additional non-trivial indeterminacies are possible (Eriksson & Koivunen, 2004; Hyvarinen et al., 2019). We therefore study conditions under which our causal model becomes identifiable in the next section. 3. Theory and Methodology In this section, we first give identifiability guarantees for our causal model. We then describe our method for recovering the causal model over the observed X and unobserved Z and show that it is consistent in the linear setting. 3.1. Identifiability In general, without further assumptions, our causal model is not identifiable. To see this, consider the model from Eq. (3) with X = (X1, . . . , X4) and one-dimensional Z X = A X + B Z + ϵ . Without further assumptions on A or B, the model has many equivalent parametrizations. To give the simplest example in which we can prove uniqueness of the parameters, consider the model Xi = bi Z + ϵi. Then no Xi has a causal influence on any other Xj, but all pairs Xi, Xj are correlated with covariances σij := cov(Xi, Xj) = bibj . If we try to fit a causal graph over only the variables X, we will find all variables to be connected. That is, the causal network over X would form a complete network (Elidan et al., 2000). However, the pairwise correlations satisfy the following constraint for all distinct quadruples i, j, u, v (Silva et al., 2006), σijσuv = σiuσjv , so that the six pairwise correlations lie on a fourdimensional manifold. The inferred causal mechanisms are therefore correlated with each other, violating the principle of independent mechanisms (Janzing & Sch olkopf, 2010). To achieve identifiability of our causal model in a more general setting, we make the following assumptions. Nonlinear Causal Discovery with Latent Confounders Figure 2: Example graphs illustrating our structural assumptions. (a) All observed variables are confounded by the same factor Z1, and 6 edges exist between its children S1. (b) Two different confounders affecting five nodes each, and one additional edge incoming to each of the sets. (c) As in (b), but this time Z2 affects one of the nodes in S1. (d) This graph violates the assumptions because S1 has too many additional edges incoming. Assumption 1. There exists a partition of the variables X into l disjoint sets S1, . . . , Sl of sizes |Sj| 4 such that for each variable Xi Sj the the direct causal effect bij of Zj Xi is non-zero, bij = 0. As in the example, this assumption guarantees that each Zj has an influence on a subset Sj of the variables X that is sufficiently large to recover its parameters bij. Of course, this would not avail us much if the overlaps between sets are too large, e.g., when two variables Zj = Zk have exactly the same sets Sj = Sk of downstream effects. To prevent such cases, we introduce our next assumption. Assumption 2. There are at most |Sj| 4 edges incoming to vertices in Sj, aside from the edges Zj Sj. This assumption ensures that the different Zj, Zk cannot have too much overlap in their Sj, either through direct connections of Zj Sk, or through indirect paths Zj Sj Sk. That is, the sets Sj are only weakly connected to each other to ensure distinguishability between the effects of different Zj. Note in particular that since models with Zj Zk are indistinguishable from models where each node in Xi Sk also has an edge Zj Xi, this assumption requires Z to be jointly independent. Likewise, as we saw in the first example, there also cannot be too many connections between variables within Sj, as this makes it impossible to tell which correlations are due to Zj, and which due to causal effects of variables within Sj. Assumption 3. For all distinct Xi, Xj, Xu, Xv that are not conditionally independent given Z, and their covariances σrs, we have σijσuv = σiuσjv. This assumption guarantees that the weights in A are not picked adversarially so as to make it seem as if the variables are purely confounded when they are not. It ensures that when we observe low-dimensional correlation structres that can be more readily explained by a latent confounder as in the example above, these correlations are in fact due to such a latent confounder. When the non-zero weights of A are sampled from a continuous probability distribution, this assumption holds with probability 1. Assumption 4. Either τ = id and ϵx N(0, σ2 x I), or τi is strictly nonlinear for all i, d2 dy2 τi(y) = 0 for all y. While not necessary to determine the effects of the latent confounders Z, this assumption is necessary for the identification of the matrix A determining causal relationships between the observed variables X. The first case corresponds to the well-known identifiability of causal models for linear Gaussian models with equal noise variances (Peters & B uhlmann, 2012). The second case corresponds to the identifiablity of PNL models (Peters et al., 2014). With these assumptions, we can now state our main result. Theorem 1 (Identifiability). Let the distribution P(X, Z) be generated by the model (4) and let assumptions 1 4 hold. Then the matrices B, C (and therefore A) are identifiable up to trivial indeterminacies. Proof Sketch. Since all nonlinearities τi, τj are invertible, the mutual information terms I(Xi, Xj) can be correlated similar to the covariances in the example. Further, due to sparsity, for each Zj, each of its children Xi is part of a quadruple Xi, Xu, Xv, Xw permitting bij to be inferred. Once the effect of Z on X is known, criteria for PNL models or equal variance Gaussian models can be used, depending on τ, to infer the remaining network over the observed X (Peters et al., 2014; Peters & B uhlmann, 2012). Note that we do not need to know the sets Sj to determine B. Instead, the tetrad constraints fully determine the sets Sj up to permutations of its indices. If we assume that Z and τ are normalized and we have some domain knowledge, we can further get rid of the rescaling indeterminacy. Corollary 2. Let the assumptions of Theorem 1 hold. Further let all τi be increasing and satisfy τi(1) = 1, let Nonlinear Causal Discovery with Latent Confounders Zj N(0, 1) and for each j let the sign of bij be known for at least one Xi Sj. Then B is identifiable up to permutations of its columns. Next, we show that the model is identifiable with probability tending to one for large numbers of observed variables, even when the causal network is much less sparse than required by assumption (A3). Theorem 3 (Identifiability for Large Dense Graphs). Let assumptions 3 and 4 hold and let the true causal graph G over X, Z be sampled from a directed Erd os-R enyi model ER(m + l, p) satisfying assumption 1, with m observed and l unobserved nodes and edge probability p < 1. Then lim m P(A, B identifiable) = 1 , where the limit is over graphs with fixed topological order. Proof Sketch. When p < 1, for any Zj with child Xi, for sufficiently large m we are guaranteed to find a suitable quadruple Xi, Xu, Xv, Xw to estimate bij. Given all bij, the entries of A corresponding to incoming edges into Xi, are identifiable for the same reason as in Theorem 1. Note that we are not interested in the identifiability of the nonlinearities τi (Zhang & Hyv arinen, 2010; Zhang & Hyvarinen, 2012). Instead we are interested in discovering the underlying generating DAG G as well as the effects of the latent confounder Z, this is not an issue for our purposes. As long as we can find any element-wise nonlinearity ν such that ν(X) N(0, Σ) for some Σ, we have achieved our goal. Next, we develop a method that achieves this task. 3.2. Learning with Autoencoders In order to learn a nonlinear function ν such that ν(X) is normally distributed, we make use of variational autoencoders (VAEs) (Kingma & Welling, 2019). The goal of a VAE is to optimize the evidence lower bound (ELBO, Kingma & Welling (2019)) defined as LELBO := Eq(H|X) (log p(X | H)) D (q(H | X) | p(H)) where D is the Kullback-Leibler divergence and q(H | X) is a distribution to be optimized over. That is, we maximize a lower bound on the evidence log p(X) max q(H|X), p(X|H) LELBO log p(X) over some class of distributions q(H | X), called encoders, and p(X | H), called decoders. Given a set of variables H, we can use Eq. (4) as decoder and write X = τ(C H) where H takes on the role of B Z + ϵ. We can write the corresponding encoder as H N((I A )τ 1(X), σ2 z(X)B B + σ2 ϵ (X)I) . Equivalently, we can split H into the effect of the latent Z and the independent noise ϵ as H = B Z + ϵ where Z N(µz(X), σ2 z(X)I) ϵ N(µϵ(X), σ2 ϵ (X)I) s.t. B µz + µϵ = I A τ 1(X) . We show the full model in graphical form in Fig. 3. In practice, to allow the nonlinearity τ to be learned jointly with its inverse τ 1, it needs to have a closed-form solution for its inverse and allow for ease of gradient computation in the shared parameters θ. Due to its piece-wise linearity, a stack of multiple PRe LU functions applied to each variable is therefore a good candidate. We define τi(xi; θi) = ϕ(λi2ϕ(λi1x + βi1; αi1) + βi2; αi2) and τ(x; θ) = (τ1(x1; θ1), . . . , τm(xm; θm)) where ϕ(y; γ) = [y]+ γ[y] and θ is the vector containing all parameters {λij, αij, βij}ij (He et al., 2015). To ensure that our learned model has a causal interpretation, we require the discovered adjacency matrix A to be acyclic. To this end, we use a differentiable penalty h(A) proposed by (Zheng et al., 2018; Yu et al., 2019) h(A) := tr ((I + A A/m)m) m = trace(I) + trace((A A/m)i) m which counts the weighted number of loops of each length i in the graph G with adjacency matrix A. Here A A denotes the Hadamard product, (A A)ij = A2 ij. Therefore h(A) = 0 if and only if the graph described by the adjacency matrix A is acyclic. In general, however, we not only care that A is acyclic, we are particularly interested in finding sparse matrices A and B encoding simple and interpretable causal networks. Under such a sparsity constraint, our approach is consistent. Theorem 4 (Consistency under Sparsity). Let xn be a sample generated from the model in Eq. (4) with τ = id and let assumptions 1 4 hold. Let L be the score L(xn; A, B) := LELBO + λA A 0 + λB B 0 , and let b A, b B be its maximizers subject to acyclicity, b A, b B = arg max A,B L(xn; A, B) s.t. h(A) = 0 . Nonlinear Causal Discovery with Latent Confounders Encoder Decoder (I A ) 1 τ b X Figure 3: Proposed model architecture. The encoder outputs both noise ϵ and confounder Z. Learnable parameters include the adjacency matrix A of the causal model over the observed variables as well as the matrix B containing the influence of Z on the observed nodes. Then for small enough σϵ(X), σz(X) the score L is consistent for recovering the matrices A, B when λA = λB = log(n)/2: lim n P( b A = A, b B = B) = 1 . Proof Sketch. When τ = id and σz, σϵ = 0, LELBO can be written as 1 2 (I A )X 2 F for where X is the data matrix for xn. As before, B can be discovered by finding appropriate quadruples, and the network can be approximated by using consistency results for linear Gaussian models with equal variance (Van de Geer & B uhlmann, 2013). This result applies only to the linear case, for which the identifiability of overcomplete ICA has been established (Eriksson & Koivunen, 2004). In contrast, for nonlinear ICA, even the non-overcomplete case is riddled with difficulties, requiring additional assumptions of specific kinds of sparsity, independence of mechanisms, or auxiliary information (Zheng et al., 2022; Gresele et al., 2021; Khemakhem et al., 2020). To the best of our knowledge, these approaches do not (easily) extend to overcomplete nonlinear ICA, preventing us from providing any better consistency results at this point in time. In practice, the regularizers 0 are not differentiable, so we use the common practice of replacing them with L1 norms 1. The overall learning problem including the nonlinearities τ( ; θ) defined above is therefore min A,B,θ f(xn; A, B, θ) := LELBO + λA A 1 + λB B 1 s.t. h(A) = 0 . To obtain a fully differentiable optimization target, we use the augmented Lagrangian approach (Bertsekas, 1997) L(A, B, θ, λ) = f(A, B, θ) + λh(A) + ρ 2 |h(A)|2 . which can be solved using dual ascent (Bertsekas, 1997). We include further details on the computation of the terms as well as the optimization in Appendix C. 4. Related Work Causal inference is arguably one of the most critical research areas in statistical inference and has recently gained much attention among researchers (Pearl, 2009). The existence of hidden confounders and selection bias makes it impossible, however, to infer causality from purely observational data without additional assumptions (Pearl, 2009). When causal sufficiency is assumed, both constraintbased (Spirtes et al., 2000; Zhang, 2008) and scorebased (Chickering, 2002; Scanagatta et al., 2015; Ramsey et al., 2017) methods can discover causal networks up to Markov equivalence. These methods are, however, based on discrete optimization and do not benefit from recent advances in automatic differentiation (Abadi et al., 2016; Baydin et al., 2018). To make use of these advances, Zheng et al. (2018) reformulate network inference as a continuous optimization problem by introducing a differentiable constraint measuring how many cycles a matrix contains. While initially designed for purely linear relationships, it has been generalized to permit nonlinear relationships (Zheng et al., 2020; Yu et al., 2019). However, none of the aforementioned methods can distinguish between networks inside the Markov equivalence class (MEC). When additional assumptions are made, it becomes possible to distinguish between Markov equivalent networks. Based on the asymmetry between factorizations of the joint distribution in the causal and the anti-causal directions, it is possible to determine causal directions in both the bivariate (Zhang & Hyvarinen, 2012; Daniusis et al., 2012; Janzing et al., 2012) as well as multivariate cases (Peters & B uhlmann, 2012; B uhlmann et al., 2014; Mian et al., 2021). In the presence of latent confounders, several methods, such as GFCI (Ogarrio et al., 2016; Colombo et al., 2012) or 3OFF2 (Affeldt et al., 2016) and convex optimizationbased approaches (Chandrasekaran et al., 2010), can find causal networks in which undirected edges indicate la- Nonlinear Causal Discovery with Latent Confounders tent confounding. Specifically, Nested Markov Models (NMMs) (Shpitser et al., 2014; 2018; Richardson et al., 2017; Evans & Richardson, 2019) can sometimes provide identifiability of causal models with latent factors by using Verna constraints. The recent approach DCD by Bhattacharya et al. (2021) combines NMMs with the differentiable constraint by Zheng et al. (2018) to discover a partially directed causal network indicating which nodes are likely confounded. However, all of these methods return MECs. In particular, they cannot tell which nodes share a confounder, making their results difficult to interpret. To relieve this issue, several approaches determine whether sets of variables share a confounder. Janzing & Sch olkopf (2018) use geometric properties of high-dimensional regression problems to assess whether variables are confounded, while Kaltenpoth & Vreeken (2019) use the minimum description length principle (Gr unwald, 2007) to test whether the confounder improves the compression of the observed data. In contrast, Wang & Blei (2019) and Ranganath & Perotte (2018) explicitly model latent confounders using factor models to adjust causal estimates for their presence. However, none of these approaches infer causal networks in the presence of confounders. 5. Experiments In this section, we evaluate our method NOCADILAC empirically. We are interested in how well it 1) recovers the set of nodes affected by Z and 2) recovers the entire causal network. We compare with other state-of-the-art methods for network discovery, including those that permit latent confounding, using GFCI (Ogarrio et al., 2016), 3OFF2 (Affeldt et al., 2016), and DCD (Bhattacharya et al., 2021), and those assuming causal sufficiency, NOTEARS (Zheng et al., 2018) and DAG-GNN (Yu et al., 2019). We implemented NOCADILAC using Tensorflow (Abadi et al., 2016) and perform optimization using Adam (Kingma & Ba, 2014). For our competitors, we use the implementations provided by the authors. We make all code and results available in the supplement. 5.1. Synthetic Data We first describe our data-generating process, followed by the metrics we use to evaluate our method. We then show the results of each competitor in multiple settings. DATA GENERATION We begin by giving an overview of our data generation process. First, we use the Erd os-R enyi model to generate a random DAG with p = 0.3. We then generate a corresponding adjacency matrix with Aij U[ 1, 1] when (i, j) G. Our data generating model is given by x = 0 0.2 0.4 0.6 0.8 decision rate 0 0.2 0.4 0.6 0.8 decision rate NOCADILAC GFCI DAG-GNN NOTEARS 3OFF2 DCD 0 1 2 3 4 102 decision rate 0 1 2 3 4 5 102 decision rate Figure 4: Evaluation on synthetic networks of size 25. We show F1 scores for (a) network discovery and (b) confounded set recovery (higher is better), as well as (c) structural Hamming distance and (d) structural intervention distance (lower is better). Each figure shows the average score over increasing fractions of all datasets, sorted for each method from most confident to least. We see that NOCADILAC outperforms its competitors at all levels. In particular, in confounded set recovery (b), NOCADILAC performs better at its worst than its competitors at their best. A g(x ) + ϵ where g(x ) = µ + α f(x ) where f is uniformly sampled from linear, quadratic, cubic, exponential, logistic or sinusoidal functions and (µ, α) U[ 1, 1]2m are i.i.d. The noise is ϵ N(0, 1). To generate the observed data x, we remove sample x from the above model and remove l source nodes z = (x k1, . . . , x kl). In the interest of space, we here consider confounders z of dimension l = 1 but include analysis for varying l to Appendix E. We use sample size n = 2500 and repeat each experiment 500 times. EVALUATION METRICS To evaluate NOCADILAC, we consider the following metrics chosen to capture different aspects of causal network discovery with confounders. To measure how well we recover the overall network, we use the Structural Hamming Distance (SHD) as well as the F1 score for network discovery, which we denote F net 1 . As we are particularly concerned with discovering which nodes are confounded, we further consider the F1 score for confounded node recovery, denoted F conf 1 , which measures how well we recover children of z. Since all of the above metrics measure only the structural similarity between the true and recovered networks, we also use the Structural Intervention Distance Nonlinear Causal Discovery with Latent Confounders 0 25 50 100 0 0.2 0.4 0.6 0.8 Network size m 0 25 50 100 0 0.2 0.4 0.6 0.8 Network size m NOCADILAC GFCI DAG-GNN NOTEARS 3OFF2 DCD 0 25 50 100 Network size m 0 25 50 100 Network size m Figure 5: Evaluation on synthetic networks of different sizes. We show F1 scores for (a) network discovery and (b) confounded set recovery (higher is better), as well as (c) structural Hamming distance and (d) structural intervention distance (lower is better). (SID) (Peters & B uhlmann, 2013), which measures how many interventional distributions differ between the true and the recovered network. To ensure that our evaluation also has practical significance, we do not only look at the average performance of each method over all generated datasets. Instead, we want a metric that is easy to compute solely from the observed and that we can use to predict the quality of our discovered causal model. To do so, we consider the relationship between the confidence, defined as the improvement of the discovered model over the null model, and the performance measured by the above metrics. Formally, we define the confidence of a method as the relative gain in its score L over the null model C = L(X; 0) L(X; θ ) L(X; 0) , (5) where L(X; θ ) is the score optimized by the method and L(X; 0) corresponds to an empty model. We are therefore interested in the relationship between C and the metrics above: a good method should obtain better scores where it is confident than where it is not. CONFIDENCE AND PERFORMANCE To evaluate the relationship between the confidence C and each used metric, for each competitor we order the generated datasets and recovered networks in descending order by their obtained confidence C. For NOCADILAC, DAGGNN, NOTEARS, and DCD, we can use the definition of 0 0.2 0.4 0.6 0.8 Fraction considered 0 0.2 0.4 0.6 0.8 Fraction considered NOCADILAC GFCI DAG-GNN NOTEARS 3OFF2 DCD 0 1 2 3 4 102 Fraction considered 0 1 2 3 4 5 102 Fraction considered Figure 6: Evaluation on REGED. F1 scores for (a) network discovery and (b) confounded set recovery (higher is better), as well as (c) structural Hamming distance and (d) structural intervention distance (lower is better). Eq. (5). However, since GFCI and 3OFF2 do not optimize a score, we order their decisions in the way most favorable to them. We include more details in Appendix D. To show the relationship between confidences and each respective metric, we create decision rate plots. They show the average performance over the top k% datasets, sorted from most to least confident. We show these decision rate plots in Fig. 4. For NOCADILAC, all metrics correlate strongly with C, suggesting that it can be used to determine which network discoveries are more reliable than others. For NOTEARS and DAG-GNN, confidence correlates with F net 1 , SID and SHD, but since these methods are not able to find confounders, their F conf 1 score is zero throughout. Overall NOCADILAC outperforms its competitors by a large margin for all metrics. PERFORMANCE FOR DIFFERENT NETWORK SIZES To see how well NOCADILAC performs for larger networks, we next test all methods for networks of sizes m {10, 25, 50, 100}. We show the results in Fig. 5. For all metrics, NOCADILAC outperforms its competitors by a large margin. While all methods show decreasing performance for increasing network size m, NOCADILAC is most robust against increasing network sizes. As the network size increases, the gap becomes smaller for the F1 scores. This is due primarily to the latent confounder affecting only a smaller fraction of variables as we increase the network size. Meanwhile, especially for SID, the gap becomes more prominent as every incorrect edge between nodes simultaneously affects many other pairs of nodes. Nonlinear Causal Discovery with Latent Confounders (a) Consensus Network (b) NOCADILAC (c) DAG-GNN Figure 7: Results on the Sachs dataset. NOCADILAC discovers a confounder Z capturing the influence of PKC (green edges). In contrast, DAG-GNN finds many edges between nodes influenced by PKC (red) and DCD contains indications of confounding for many pairs of nodes, but neither method can determine that nodes share a confounder. All methods make roughly the same number of errors regarding reversed edges (dashed), including some edges only as indirect paths (dotted), and missing edges (gray). 5.2. REGED Benchmark Data To evaluate NOCADILAC on data that violates our assumptions, we evaluate it on the REGED dataset (Guyon et al., 2008). This dataset contains re-simulated created by matching the distribution of real microarray gene expression data. It contains m = 999 features and n = 20 500 non-i.i.d. samples. To introduce confounders, we cannot use the whole network to evaluate NOCADILAC. Instead, we generate subgraphs of different sizes containing nodes with common parents, whose data we hide from our methods. We include the full details in Appendix F. We show decision rate plots for all metrics in Fig. 6. While all methods perform worse than on synthetic data, NOCADILAC nevertheless performs best by a large margin. 5.3. Case Study: Protein Signaling Network Last, to see how well NOCADILAC works on real-world data, we evaluate it on the widely used Sachs dataset (Sachs et al., 2005) for protein signaling. It contains n = 7466 continuous measurements of a total of m = 11 phosphorylated proteins and phospholipids in human immune system cells. The consensus network contains 20 edges, which we show in Fig. 7a. Since the graph contains cycles, some mistakes are inevitable. To make the data appropriate for our setting, we remove the node PKC with out-degree four from the network. Note that the edge from PIP2 to PKC violates our assumption that latent confounders have no incoming edges. We show the benefit of explicitly modeling confounders by comparing against DAG-GNN and DCD. We show the results of this experiment in Fig. 7. In Fig. 7b, we see that NOCADILAC automatically discovers a substitute latent factor Z connected to the correct variables (green edges) and thereby takes the place of PKC. For the overall network, only three edges are missing entirely (gray), while two are reversed (dashed), and another three are instead contained as paths of length 2 (dotted). This performance is similar to the result of other state-of-the-art methods on fully observed data (Yu et al., 2019). In Fig. 7c, we see the result of DAG-GNN. The absence of PKC from the observed data leads to DAG-GNN inferring a large number of edges (red) between nodes initially connected to PKC while making more mistakes over the remaining variables. We see a similar pattern in the results of DCD in Fig. 7d. Many pairs of children of PKC are considered to be potentially confounded (red), but so are many other pairs of variables that do not have PKC as parent. It is also unclear which pairs share the same latent parent. Furthermore, DCD misses many more edges than either NOCADILAC or DAG-GNN. Overall, NOCADILAC produces more accurate and interpretable results than its competitors. 6. Conclusion In this paper, we proposed a method for discovering a nonlinear causal model in the presence of latent confounders from observational data. We proved the identifiability of our causal model and proposed an effective approach to learn the parameters based on VAEs. We showed the theoretical soundness of our method by proving that the LELBO with sparsity and acyclicity constraints forms a consistent scoring criterion when the generating model is linear. Empirical evaluation showed that NOCADILAC works well not only on linearly but also on nonlinearly generated data. On real-world data, we saw that compared to its competitors, it produces both quantitatively better results on all metrics as well as qualitatively more interpretable models. For future work, an interesting avenue will be the combination of recent theoretical insights into nonlinear ICA with our present results to obtain better theoretical guarantees. Nonlinear Causal Discovery with Latent Confounders Abadi, M., Barham, P., Chen, J., Chen, Z., Davis, A., Dean, J., Devin, M., Ghemawat, S., Irving, G., Isard, M., et al. Tensorflow: A System for Large-Scale Machine Learning. In OSDI16, pp. 265 283, 2016. Affeldt, S., Verny, L., and Isambert, H. 3off2: A Network Reconstruction Algorithm Based on 2-point and 3-point Information Statistics. In BMC bioinformatics, volume 17, pp. 149 165. Bio Med Central, 2016. Baydin, A. G., Pearlmutter, B. A., Radul, A. A., and Siskind, J. M. Automatic Differentiation in Machine Learning: A Survey. Journal of Marchine Learning Research, 18:1 43, 2018. Bertsekas, D. P. Nonlinear Programming. Athena Scientific, 1997. Bhattacharya, R., Nagarajan, T., Malinsky, D., and Shpitser, I. Differentiable Causal Discovery Under Unmeasured Confounding. In International Conference on Artificial Intelligence and Statistics, pp. 2314 2322. PMLR, 2021. B uhlmann, P., Peters, J., Ernest, J., et al. CAM: Causal Additive Models, High-Dimensional Order Search and Penalized Regression. The Annals of Statistics, 42: 2526 2556, 2014. Chandrasekaran, V., Parrilo, P. A., and Willsky, A. S. Latent Variable Graphical Model Selection Via Convex Optimization. In 2010 48th Annual Allerton Conference on Communication, Control, and Computing (Allerton), pp. 1610 1613. IEEE, 2010. Chickering, D. M. Learning Equivalence Classes of Bayesian-Network Structures. The Journal of Machine Learning Research, 2:445 498, 2002. Chickering, M., Heckerman, D., and Meek, C. Large Sample Learning of Bayesian Networks Is NP-Hard. Journal of Machine Learning Research, 5, 2004. Colombo, D., Maathuis, M. H., Kalisch, M., and Richardson, T. S. Learning High-Dimensional Directed Acyclic Graphs With Latent and Selection Variables. The Annals of Statistics, pp. 294 321, 2012. Comon, P. Independent Component Analysis. 1992. Daniusis, P., Janzing, D., Mooij, J., Zscheischler, J., Steudel, B., Zhang, K., and Sch olkopf, B. Inferring Deterministic Causal Relations. ar Xiv preprint ar Xiv:1203.3475, 2012. Elidan, G., Lotner, N., Friedman, N., and Koller, D. Discovering Hidden Variables: A Structure-Based Approach. Advances in Neural Information Processing Systems, 13, 2000. Eriksson, J. and Koivunen, V. Identifiability, Separability, and Uniqueness of Linear ICA Models. IEEE signal processing letters, 11(7):601 604, 2004. Evans, R. J. and Richardson, T. S. Smooth, Identifiable Supermodels of Discrete Dag Models With Latent Variables. Bernoulli, 25(2):848 876, 2019. Gresele, L., Von K ugelgen, J., Stimper, V., Sch olkopf, B., and Besserve, M. Independent Mechanism Analysis, a New Concept? Advances in neural information processing systems, 34:28233 28248, 2021. Gr unwald, P. D. The Minimum Description Length Principle. MIT press, 2007. Guyon, I., Aliferis, C. F., Cooper, G. F., Elisseeff, A., Pellet, J., Spirtes, P., and Statnikov, A. R. Design and Analysis of the Causation and Prediction Challenge. In Causation and Prediction Challenge at WCCI 2008, Hong Kong, June 1-6, 2008, volume 3 of JMLR Proceedings, pp. 1 33. JMLR.org, 2008. He, K., Zhang, X., Ren, S., and Sun, J. Delving Deep Into Rectifiers: Surpassing Human-Level Performance on Imagenet Classification. In Proceedings of the IEEE international conference on computer vision, pp. 1026 1034, 2015. Hyvarinen, A., Sasaki, H., and Turner, R. Nonlinear ICA Using Auxiliary Variables and Generalized Contrastive Learning. In The 22nd International Conference on Artificial Intelligence and Statistics, pp. 859 868. PMLR, 2019. Janzing, D. and Sch olkopf, B. Causal Inference Using the Algorithmic Markov Condition. IEEE Transactions on Information Theory, 56(10):5168 5194, 2010. Janzing, D. and Sch olkopf, B. Detecting Non-Causal Artifacts in Multivariate Linear Regression Models. In International Conference on Machine Learning, pp. 2245 2253. PMLR, 2018. Janzing, D., Mooij, J., Zhang, K., Lemeire, J., Zscheischler, J., Daniuˇsis, P., Steudel, B., and Sch olkopf, B. Information-Geometric Approach To Inferring Causal Directions. Artificial Intelligence, 182:1 31, 2012. Kaltenpoth, D. and Vreeken, J. We Are Not Your Real Parents: Telling Causal From Confounded Using MDL. In Proceedings of the 2019 SIAM International Conference on Data Mining, pp. 199 207. SIAM, 2019. Nonlinear Causal Discovery with Latent Confounders Khemakhem, I., Kingma, D., Monti, R., and Hyvarinen, A. Variational Autoencoders and Nonlinear ICA: A Unifying Framework. In International Conference on Artificial Intelligence and Statistics, pp. 2207 2217. PMLR, 2020. Kingma, D. P. and Ba, J. Adam: A Method for Stochastic Optimization. ar Xiv preprint ar Xiv:1412.6980, 2014. Kingma, D. P. and Welling, M. An Introduction To Variational Autoencoders. ar Xiv preprint ar Xiv:1906.02691, 2019. Koller, D. and Friedman, N. Probabilistic Graphical Models: Principles and Techniques. MIT press, 2009. Mian, O. A., Marx, A., and Vreeken, J. Discovering Fully Oriented Causal Networks. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pp. 8975 8982, 2021. Ogarrio, J. M., Spirtes, P., and Ramsey, J. A Hybrid Causal Search Algorithm for Latent Variable Models. In Proceedings of the Eighth International Conference on Probabilistic Graphical Models, pp. 368 379, 2016. Pearl, J. Causality: Models, Reasoning and Inference. Cambridge University Press, 2nd edition, 2009. Peters, J. and B uhlmann, P. Identifiability of Gaussian Structural Equation Models With Equal Error Variances. ar Xiv preprint ar Xiv:1205.2536, 2012. Peters, J. and B uhlmann, P. Structural Intervention Distance (SID) for Evaluating Causal Graphs. ar Xiv preprint ar Xiv:1306.1043, 2013. Peters, J., Mooij, J. M., Janzing, D., Sch olkopf, B., et al. Causal Discovery With Continuous Additive Noise Models. Journal of Machine Learning Research, 15: 2009 2053, 2014. Ramsey, J., Glymour, M., Sanchez-Romero, R., and Glymour, C. A Million Variables and More: The Fast Greedy Equivalence Search Algorithm for Learning High-Dimensional Graphical Causal Models, With an Application To Functional Magnetic Resonance Images. International journal of data science and analytics, 3: 121 129, 2017. Ranganath, R. and Perotte, A. Multiple Causal Inference With Latent Confounding. ar Xiv preprint ar Xiv:1805.08273, 2018. Richardson, T. S., Evans, R. J., Robins, J. M., and Shpitser, I. Nested Markov Properties for Acyclic Directed Mixed Graphs. ar Xiv preprint ar Xiv:1701.06686, 2017. Sachs, K., Perez, O., Pe er, D., Lauffenburger, D. A., and Nolan, G. P. Causal Protein-Signaling Networks Derived From Multiparameter Single-Cell Data. Science, 308(5721):523 529, 2005. Scanagatta, M., de Campos, C. P., Corani, G., and Zaffalon, M. Learning Bayesian Networks With Thousands of Variables. In NIPS, pp. 1864 1872, 2015. Shpitser, I., Evans, R. J., Richardson, T. S., and Robins, J. M. Introduction To Nested Markov Models. Behaviormetrika, 41(1):3 39, 2014. Shpitser, I., Evans, R. J., and Richardson, T. S. Acyclic Linear Sems Obey the Nested Markov Property. In Uncertainty in Artificial Intelligence, volume 2018. NIH Public Access, 2018. Silva, R., Scheines, R., Glymour, C., Spirtes, P., and Chickering, D. M. Learning the Structure of Linear Latent Variable Models. Journal of Machine Learning Research, 7(2), 2006. Spirtes, P., Glymour, C. N., and Scheines, R. Causation, Prediction, and Search. MIT press, 2000. Taleb, A. and Jutten, C. Source Separation in Post Nonlinear Mixtures. IEEE Transactions on signal Processing, 47(10):2807 2820, 1999. Van de Geer, S. and B uhlmann, P. ℓ0-penalized Maximum Likelihood for Sparse Directed Acyclic Graphs. The Annals of Statistics, 41(2):536 567, 2013. Wang, Y. and Blei, D. M. The Blessings of Multiple Causes. Journal of the American Statistical Association, 114(528):1574 1596, 2019. Yu, Y., Chen, J., Gao, T., and Yu, M. DAG-GNN: DAG Structure Learning With Graph Neural Networks. In International Conference on Machine Learning, pp. 7154 7163. PMLR, 2019. Zhang, J. On the Completeness of Orientation Rules for Causal Discovery in the Presence of Latent Confounders and Selection Bias. Artificial Intelligence, 172(16-17): 1873 1896, 2008. Zhang, K. and Hyv arinen, A. Distinguishing Causes From Effects Using Nonlinear Acyclic Causal Models. In Causality: Objectives and Assessment, pp. 157 164. PMLR, 2010. Zhang, K. and Hyvarinen, A. On the Identifiability of the Post-Nonlinear Causal Model. ar Xiv preprint ar Xiv:1205.2599, 2012. Nonlinear Causal Discovery with Latent Confounders Zheng, X., Aragam, B., Ravikumar, P. K., and Xing, E. P. Dags With No Tears: Continuous Optimization for Structure Learning. Advances in neural information processing systems, 31, 2018. Zheng, X., Dan, C., Aragam, B., Ravikumar, P., and Xing, E. P. Learning Sparse Nonparametric DAGs. In International Conference on Artificial Intelligence and Statistics, 2020. Zheng, Y., Ng, I., and Zhang, K. On the Identifiability of Nonlinear ICA: Sparsity and Beyond. ar Xiv preprint ar Xiv:2206.07751, 2022. Nonlinear Causal Discovery with Latent Confounders A. Why our Model is a PNL Causal Model In this section we quickly explain why our proposed model in Eq. 2 can be considered a post-nonlinear (PNL) causal model. Note first that the model as stated is a PNL ICA model (Taleb & Jutten, 1999). We now show that this can in fact be rewritten as PNL causal model (Zhang & Hyvarinen, 2012). To do so, assume that the topological order is X1, . . . , Xm. Next we show for all k = 1, . . . , m that we can write j=1 αjkτ 1 j (Xj) + ϵk For k = 1 this is trivially true as the sum is empty. For the case k > 1, we have l=1 αjlτ 1 l (Xl) j γh(Ak 1) ρk otherwise where α > 1, γ < 1 determine how quickly ρ increases. The first of these equations can be solved using any black box stochastic optimization algorithm readily available in machine learning toolboxes. For the other two equations, we use the commonly used hyperparameters α = 10, γ = 0.25 (Yu et al., 2019). D. Evaluating GFCI and 3OFF2 As mentioned in the text, evaluating GFCI and 3OFF2 with our metrics is not straightforward. This is due to both of them using only local scores to discern local structures, and these scores cannot be aggregated straightforwardly. We, therefore, sort the obtained scores of both GFCI and 3OFF2 in the best possible way for their evaluation. Even so, we see that NOCADILAC outperforms both of these methods for all metrics. Further, GFCI does not return a unique directed graph but an equivalence class of graphical models. In particular, latent confounding is possible for many edges in the discovered networks, but for only very few of them, they are confident that they are confounded. To evaluate the returned equivalence classes, we, therefore, evaluate F net 1 , F conf 1 and SID over all networks in these equivalence classes where feasible which is infrequently the case and over a random sample of 10000 networks from these equivalence classes where it is not. We then take the average of these scores to evaluate GFCI as this is the most reasonable evaluation of their results. However, even taking the upper quartile of these scores does not change the fact that NOCADILAC consistently outperforms GFCI. Nonlinear Causal Discovery with Latent Confounders E. Higher-dimensional Z We next consider the effect of including multiple confounding factors Zi in our generating data, each influencing nonoverlapping sets of 5 variables in a network of 50 variables total. We show the overall F conf 1 scores for one to five confounders in Table 1. We see that for one to three confounders, NOCADILAC performs at a consistent level. However, at four and five confounders, its performance decreases due to the increasing difficulty of fitting a learning a good model of such complexity. In contrast, all of DCD, 3OFF2, and GFCI perform less well from the start, and their performance drops immediately upon adding a second latent confounder. The reason for this is instructive: since none of them infer a latent confounder but only indicate whether pairs of variables may be confounded, they have no way of distinguishing between sets of nodes confounded by different factors. We, therefore, ran a spectral clustering algorithm on the subgraphs containing only edges due to pairwise confounding to return the correct number of confounded sets. Nevertheless, the returned subgraphs were of low quality, as reflected by the very low F conf 1 scores for each of the competitors. Number of confounders Method 1 2 3 4 5 NOCADILAC 0.42 0.38 0.35 0.23 0.15 DCD 0.23 0.14 0.11 0.07 0.03 3OFF2 0.22 0.12 0.08 0.05 0.03 GFCI 0.21 0.07 0.03 0.02 0.01 Table 1: Comparison of NOCADILAC, DCD, 3OFF2 and GFCI for graphs with varying numbers of latent confounders. While all methods perform well, only NOCADILAC maintains its performance as the number of latent factors increases. F. Details on REGED Setting As mentioned in the text, we can only use part of the network provided by REGED at once if our goal is to introduce confounders. Instead, we generate a number of subgraphs as follows. For every node with an outdegree of at least 3, we start by taking its children C. We then take the union of the (minimal) Markov blankets of nodes in C as our vertex set V = S c C MB(c) and use its induced graph G = G[V ]. Note that even though we start with only C as the set of confounded nodes, it is possible that the addition of the Markov blankets adds nodes with common parents that are not themselves in V . Our confounded set is, therefore, some larger C C, which is nevertheless known beforehand.