# causally_consistent_normalizing_flow__570a05fd.pdf Causally Consistent Normalizing Flow Qingyang Zhou1, Kangjie Lu2, Meng Xu1 1University of Waterloo, Ontario, Canada 2University of Minnesota, Minnesota, America qingyang.zhou@uwaterloo.ca, kjlu@umn.edu, meng.xu.cs@uwaterloo.ca Causal inconsistency arises when the underlying causal graphs captured by generative models like Normalizing Flows are inconsistent with those specified in causal models like Struct Causal Models. This inconsistency can cause unwanted issues including the unfairness problem. Prior works to achieve causal consistency inevitably compromise the expressiveness of their models by disallowing hidden layers. In this work, we introduce a new approach: Causally Consistent Normalizing Flow (CCNF). To the best of our knowledge, CCNF is the first causally consistent generative model that can approximate any distribution with multiple layers. CCNF relies on two novel constructs: a sequential representation of SCMs and partial causal transformations. These constructs allow CCNF to inherently maintain causal consistency without sacrificing expressiveness. CCNF can handle all forms of causal inference tasks, including interventions and counterfactuals. Through experiments, we show that CCNF outperforms current approaches in causal inference. We also empirically validate the practical utility of CCNF by applying it to real-world datasets and show how CCNF addresses challenges like unfairness effectively. Code https://github.com/UWCSZhou/CCNF Extended version https://arxiv.org/abs/2412.12401 1 Introduction Causal generative modeling is generative models (GMs) that utilize given causal models like structure causal models (SCMs) for data generation (Komanduri et al. 2024). It has been widely researched on different types of GMs like VAE (Yang et al. 2021), GAN (Kocaoglu et al. 2017), Normalizing Flow (NF) (Javaloy, Martin, and Valera 2023) and Diffusion Model (Sanchez and Tsaftaris 2022). However, most approaches have a problem that they can only approximate the causality relations instead of enforcing the consistency between the causal graph induced by GMs and the causal graph in given SCMs. The problem is called casual inconsistency problem in prior works (Javaloy, Martin, and Valera 2023) and will be discussed in detail in Section 3. This could lead to critical societal issues (e.g. the one in Section 2), which have yet to be adequately addressed. Copyright 2025, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. Fortunately, recent works have proposed GMs that are causally consistent by design. Causal NF (Javaloy, Martin, and Valera 2023) and VACA (S anchez-Martin, Rateike, and Valera 2022) ensure causal consistency by limiting their models to minimal structural complexity. More specifically, Causal NF restricts the model depth to zero, i.e., eliminating middle layers in NF; while VACA applies a similar restriction to its encoder structure. In other words, causal consistency is guaranteed at the expense of the utility of these models the ability to approximate any arbitrarily complex distributions of observations. For instance, the training objective of Causal NF is to minimize the discrepancy between the distributions of latent variables and the pre-selected distributions by users (e.g. Gaussian). However, as shown in Figure 1a, a Causal NF trained on a nonlinear Simpson dataset is not able to accomplish the objective. The distribution of the third latent variable, highlighted in green in Figure 1a, deviates significantly from the Gaussian distribution. In this paper, we introduce Causally Consistent Normalizing Flow, abbreviated as CCNF, that is a causally consistent GM by design without sacrificing utility, i.e., approximating arbitrarily complex distributions based on universal approximation theorems (details in Theorem 5.2). A key innovation of CCNF is to translate an SCM into a sequence (details in Secion 4). The sequential representation of an SCM eliminates the constraints of maximum layer depths without compromising causal consistency, enabling a more flexible model architecture in CCNF (details in Section 5). Subsequently, CCNF employs normalizing flows with partial causal transformations to effectively capture the causality in the data, which is further elaborated in Section 4 and Section 5. We demonstrate that CCNF is inherently causally consistent and capable of performing causal inference tasks such as interventions and counterfactuals. To the best of our knowledge, CCNF is the first causally consistent GM that can approximate any distributions of observation variables across multiple layers. In comparison, CCNF outperforms existing models like Causal NF in similar tasks, as shown in 1b. Additionally, CCNF proves effective in real-world applications, addressing significant issues such as unfairness. Applying CCNF to the German credit dataset (Hofmann 1994), we observe notable improvements: a reduction in individual unfairness from 9.00% to 0.00%, and an increase in overall The Thirty-Ninth AAAI Conference on Artificial Intelligence (AAAI-25) (a) causal NF Figure 1: Prior distributions of causal NF and CCNF accuracy from 73.00% to 75.80%. Summary. This paper makes the following contributions: We propose a new sequential representation for SCMs, and formally prove its ability to maintain the causal consistency. Utilizing this sequential representation alongside partial causal transformations, we develop CCNF, a GM that guarantees causal consistency and excels at complex causal inference tasks. We empirically validate CCNF, demonstrating it outperforms state-of-the-art casually consistent GMs on causal inference benchmarks. Furthermore, our real-world case study showcase the potential of CCNF in addressing critical issues like unfairness. CCNF is open-sourced in the code link. 2 A Motivating Example To articulate the causal inconsistency problem between GMs and SCMs, we present a motivating example modeling a simplified admission system. A simplified admission system. The causal graph of a simplified admission system M is shown in Figure 2. It consists of four attributes: gender, age, score, and admission decision. In terms of casualty, gender and age determine the distribution of score, and ideally, score solely determines the distribution admission decision, ensuring that gender does not (and should not) directly affect admission. To illustrate, we assume the observations O of this admission system M can be generated by the equations in Table 1 under the SCM column. Here, the value of independent variables ui where i {g, a, s, d} are randomly sampled from predefined distributions. For instance, ug is sampled from a distribution where gender is distributed equally. With distinct samples for ui where i {g, a, s, d}, this system can generate unique observations representing O. GMs based on the admission system. In this scenario, GMs learned from the observations O offer extensive capabilities (Harshvardhan et al. 2020). However, a GM may establish incorrect causal links, such as between gender and admission decision, as indicated by the red dashed line in Age Gender Score Decision Figure 2: The causal graph of an SCM describing an admission system with direct causalities that are intended (black solid line) and forbidden (red dashed line) Variable SCM GM gender ug ug age ua ua score us+gender+age us-gender+age decision f(ud+score) f(ud+score+2*gender) Table 1: Comparison of generative equations between the actual SCM and GM. The function f in the table stands for the sign function. Figure 2. This incorrect causality suggests that in GM, different values of gender could lead to varied distributions of admission decision, even if the score is identical. Consider the scenario where the underlying causal relationships of a GM are formulated as in the GM column of the Table 1. We can verify that the data distributions generated by the GM is indistinguishable from those of M, despite in distinct functional forms. However, GM and M are not causally consistent, which could result in significant consequences. Unfairness due to causal inconsistency. In this example, causal inconsistency could cause an unfairness problem. Consider a scenario where users generate data instances with the same score. In the origin system M, as gender has no direct impact on admission decision, users will observe that changing the gender attribute does not alter the distribution of admission decision. However, in the causally inconsistent model GM, users will surprisingly observe that different values of gender could lead to different distribu- (a) Consistent (b) Inconsistent Figure 3: Causally consistent models and inconsistent models of prior works. G = Gender, A = age, S = Score, D = Decisions, M stands for nodes in the middle layer. tions of decision, even with the same score. Such gender bias could lead to intense social debates and the organization deploying the GM might face legal challenges (Thompso 2023). Prior works and their drawbacks. Despite being applied to various architectures of GMs, prior works on guaranteed causally consistent GMs share a common intuition: while each attribute in observation is influenced only by its parent attributes within the causal graph, the causality relation must be captured in the GM model all at once, which means that their layer depth must be zero. Figure 3a shows an example of this approach. In this scenario, if we maintain constant values for age and score and only mutate the value of gender, The distribution of admission decision should remain unaffected. This intuition indeed ensures causal consistency, but it has some shortcomings; notably, it does not allow for incorporation of any middle layers. For example, mutating the value of gender in Figure 3b results in a change of admission decision due to the red dashed connections. This comparison between Figure 3b and Figure 3a illustrates how causal consistency is compromised even with the introduction of a single layer in the middle, while on the other hand, forbidding middle layers significantly impairs the model s capacity for learning. However, as we will demonstrate later, CCNF can maintain causal consistency even with multiple middle layers. Therefore CCNF offers great learning ability compared to previous models, enhancing practical applicability. 3 Preliminaries In this section, we define basic concepts and related lemmas to set the stage for CCNF. All definitions and lemmas introduced in this section are consistent with prior works (Khemakhem et al. 2021; Papamakarios et al. 2021; Thost and Chen 2021). Structured Casual Model (SCM) Definition. A structural causal model (SCM) is a tuple M = (ef, Pu) commonly used to represent causality. It describes the process that a set of d endogenous (observed) random variables X = {X1, , Xd} is generated from a corresponding set of exogenous (latent) random variables U = {U1, , Ud} associated with a set of predefined distributions Pu and a set of transfer functions ef. Typically, X and U have the same length, denoting as d. The generation of X is governed by the equation: Xi = efi(Xpai, Ui), i {1, . . . , d} (1) where Xpai denotes a set of endogenous variables that directly influences Xi, i.e. the parents of Xi. Particularly, pai represents the labels of those parent variables of Xi. Generally, we assume exogenous variables U are mutually independent. Graphical representation of an SCM: causal graph. A causal graph G = (V, E) with |V| = d nodes representing an SCM is a powerful tool to describe causality. Each node in V corresponds to an endogenous random variable Xi, and each edge in E represents a causal link from a variable in Xpai to Xi. As a common assumption (Pearl 2009), the causal graph G is structured as a directed acyclic graph (DAG). For convenience, each node V V is assigned with a label i L = {1, , d}, and we denote it as Vi. For a subset of labels A L, we define VA = {Vi | i A}. See Figure 4a for an illustration. Causal inference and causal hierarchy. Causal inference denotes the data generation process by SCM. According to the causal hierarchy (Pearl 2009), the generative process is classified into three distinct tiers: observations, interventions, and counterfactuals. The observations process involves generating X unconditionally. This is straightforward: we generate U sampled from the predefined distributions Pu and then compute X from U by the formula in Equation 1. The interventions process involves generating X while setting the Xi to a specific value of a, often represented as Do(Xi = a). This requires modifying the SCM such that every Xi in Equation 1 is replaced with a to create a new SCM: MDo(Xi=a). X is generated by performing observations on the new SCM. The counterfactuals process considers a specific data instance X where Xj = b, and aims to generate data instances supposing that Xj = b = b. This process first deduces U from X via Equation 1, then performs Do(Xj = b ) to create a new SCM MDo(Xj=b ), Subsequently, the data is generated by inputting U into MDo(Xi=a). Causal Normalizing Flows Normalizing flows. Normalizing flows constitute a set of generative models that express the probability of observed variables X from U by change-of-variables rules. Particularly, given X = {X1, , Xd} and U = {U1, , Ud}, the probability of X is expressed as follows: X = Tθ(U), where U PU (2) PX(X) = PU(T 1 θ (X))| det JT 1 θ (X)| (3) Here, Tθ represents a transformation that maps endogenous variables U to exogenous variables X. Tθ could be any transformation as long as it is a partial derivative and invertible, often realized by a neural network with parameter θ. It is common to chain different transformations Tθ1 Tθk to form a larger transformation Tθ = Tθk Tθ2 Tθ1. PX denotes the probability of X while PU denotes the probability of U, where PX is the target and PU usually follows a simple distribution. The det J means the Jacobian determinant of a given function, which is T 1 θ in this formula. Multi-layer universal approximator. An NF serves as a multi-layer universal approximator, implying that any PX can be approximated by chaining a finite number of transformations. Comparatively, the single-layer universal approximator can achieve the same objective with only one transformation. The assumption that a NF is single-layer universal is stronger than the assumption that it is multi-layer. Although an NF the theoretical capability is single-layer universal (Papamakarios et al. 2021), No concrete NF succeeded in proving this. Instead, a recent study on the university of coupling-based NF proves that affine coupling flows like MAF (Papamakarios, Pavlakou, and Murray 2018) are multi-layer universal (Draxler et al. 2024). Autoregressive normalizing flows. Autoregressive normalizing flows are a type of NFs whose transformations are defined as below: given two random variables U = {U1, , Ud} and X = {X1, , Xd}, the special transformation of Equation 2 is: Xi = Tθ(Ui | X 1). More details are in the extended version. Measurement. In this experiment, we evaluate given models based on two key metrics: accuracy and causal consistency. Accuracy reflects through the KL distance between the captured prior distribution and the actual one, denoted as KL(p M|pθ). Causal consistency is reflected through Equation 7 as in Causal NF. L(Tθ(X)) = x Tθ(X) (1 G) 2 (7) Here, G represents the causal graph as an adjacency matrix of a given SCM M, x Tθ(X) denotes the Jacobian matrix of Tθ(X). Tθ(X) is causally consistent with M iff L(Tθ(X)) = 0. Result. Results are summarized in Table 2. In a nutshell, the practical results are consistent with theoretical expectations. Firstly, CCNF demonstrates causally consistent with the given SCM, as L(Tθ(X)) of CCNF is consistently 0. Secondly, state-of-the-art models can only keep consistency by constraining their expressive power. CAREFL fails to meet the casual consistency altogether, while VACA and Causl NF can only achieve that with a single layer (L = 1). These findings remain consistent with the results reported in their respective papers. Causal Inference Tasks Experiment design. Since only Causal NF and VACA with one layer (L = 1) can ensure causal consistency, in this experiment, we compare CCNF with Causal NF (L = 1) and VACA (L = 1) for causal inference tasks. We test given models on representative synthetic datasets: Nonlinear Triangle dataset, Nonlinear Simpson dataset, M-graph dataset, Network dataset, Backdoor dataset, and Chain dataset with 3 8 nodes, respectively. Those causal structures are either from previous works (Javaloy, Martin, and Valera 2023; S anchez Martin, Rateike, and Valera 2022) or from practical applications, making them suitable for evaluating the performance of causal inference. Measurement. We use three different measurements to evaluate the performance of causal inference tasks. For observations, the measurement is the KL distance, for interventions, we measure the max Maximum Mean Discrepancy (MMD) distance. For counterfactuals, we measure the Root Mean-Square Deviation (RMSD) distance. More details are in the extended version. Result. As shown in Table 3, CCNF demonstrates superior performance compared with previous works across nearly all datasets. This can be attributed to the ability of CCNF to capture complex casualties with additional middle layers an ability absent in Causal NF and VACA. While compared with Causal NF, CCNF requires approximately 130% more time for training and evaluation, the time spent remains within a reasonable range and is less than that of VACA. Overall, CCNF is a more practical choice for causal inference tasks compared to stat-of-the-art tools. Real-world Evaluation Experiment design. Like prior works (S anchez-Martin, Rateike, and Valera 2022; Javaloy, Martin, and Valera 2023), we select the German credit dataset as a representative example. Classifiers commonly utilize this dataset to predict the credit risk for a given applicant. If a classifier of the German credit predicts the risk directly through the sex attribute, we say it has the unfairness problem. CCNF offers two methods to address real-world issues: 1 building a fairness classifier directly or strengthening the dataset through counterfactual data augmentation. Specifically, we first train CCNF on the German credit dataset. For any applicant, CCNF could function as an unfairnessfree classifier by setting the exogenous variable of the risk attribute to its mean value, which is 0 in our experiment. Additionally, 2 CCNF could generate counterfactuals of the training set to create a data-augmented dataset. We also build an SVM classifier on the origin German credit dataset and the augmented one respectively for comparison. Measurement. We utilize the individual fairness (Fleisher 2021) to evaluate the fairness of a classifier. Individual fairness is determined by examining whether changing the sex attribute of an applicant alters the risk level predicted by a classifier. Mathematically, for a test dataset with n instances, we define individual fairness as Pn i=1 |Risk(Xsex=1) Risk(Xsex=0)|/n. L = 1 L > 1 CAREFL Causal NF VACA CAREFL Causal NF VACA CCNF KL 0.01 0.03 0.00 0.00 2.96 0.08 0.00 0.00 0.00 0.00 2.62 0.08 0.00 0.00 L(Tθ(X)) 0.20 0.04 0.00 0.00 0.00 0.00 0.32 0.09 0.16 0.05 0.15 0.01 0.00 0.00 Table 2: Causal consistency comparison between CCNF and prior works. The causally consistent models are marked in bold. KL is used to evaluate the accuracy and L(Tθ(X)) is used to evaluate the causal consistency Dataset Model Performance Time(ms) KL Inter.MMD C.F.RMSD Train Evaluation Nlin Triangle CCNF 0.12 0.04 0.03 0.03 0.14 0.05 20.24 0.32 19.80 0.27 Causal NF 0.37 0.00 0.10 0.02 0.79 0.07 6.03 0.07 5.09 0.10 VACA 1.41 0.07 2.13 0.61 34.62 12.53 36.23 0.70 35.27 0.63 Nlin Simpson CCNF 0.01 0.00 0.00 0.00 0.00 0.00 15.96 0.24 15.73 0.68 Causal NF 0.25 0.00 0.04 0.01 0.00 0.00 6.30 0.13 5.53 0.30 VACA 1.56 0.04 0.11 0.15 0.59 0.10 36.52 0.40 35.62 0.39 M-Graph CCNF 0.01 0.00 0.00 0.00 0.04 0.01 9.13 0.11 8.36 0.18 Causal NF 0.32 0.00 0.17 0.01 0.02 0.01 6.21 0.15 5.33 0.26 VACA 1.83 0.01 0.12 0.01 1.19 0.03 40.50 1.18 39.89 0.72 Network CCNF 0.55 0.13 0.01 0.01 0.25 0.06 19.94 0.32 19.44 0.86 Causal NF 1.38 0.04 0.15 0.02 0.41 0.03 9.11 0.08 8.14 0.07 VACA 1.67 0.03 0.38 0.12 13.20 0.50 38.36 0.36 37.59 0.36 Backdoor CCNF 0.56 0.00 0.01 0.00 0.04 0.01 15.21 0.22 14.71 0.62 Causal NF 0.96 0.00 0.08 0.02 0.04 0.00 6.30 0.23 5.45 0.40 VACA 1.78 0.01 0.13 0.01 1.88 0.02 39.04 0.39 38.21 0.40 Chain CCNF 0.28 0.02 0.00 0.00 0.03 0.00 26.10 0.34 24.78 0.29 Causal NF 4.67 0.04 0.05 0.01 0.13 0.02 11.59 0.09 10.75 0.07 VACA 1.52 0.14 0.22 0.03 1.26 0.26 45.69 0.72 44.66 0.66 Table 3: Causal inference tasks comparison between causally consistent models. In the title, Inter. means interventions, C.F. means counterfactuals. The best results are marked in bold Name Accuracy F1 Fairness SVM 73.00 0.00 82.12 0.00 9.00 0.00 SVMCF 72.60 1.10 81.90 0.59 4.10 1.40 CCNF 75.80 2.22 84.34 2.22 0.00 0.00 Table 4: Real-world evaluation on German credit dataset, every number is magnified 100 times. Result. The results are summarized in Table 4. Overall, CCNF can enhance fairness while maintaining accuracy in both methods. For the fairness problem, CCNF can lower the fairness rate significantly. Particularly, the NF classifier demonstrates superior accuracy and eliminates unfairness. Those facts reveal that CCNF are suitable for real-world problems like unfairness without compromising accuracy. 7 Conclusion and Future Work Causal inconsistency in generative models can lead to significant consequences. While some prior works cannot guarantee causal consistency, others achieve it only by limiting their depth. In the paper, we introduce CCNF, a novel causal GM that ensures causal consistency without limiting the depth as rigorously demonstrated in Section 5. Furthermore, we elaborate on how to perform causal inference tasks in CCNF, demonstrating its proficiency in efficiency. Through synthetic experiments, we illustrate that CCNF outperforms state-of-the-art models in terms of accuracy across different causal tasks. To validate the real-world applicability of CCNF, we also apply CCNF to a real-world dataset and address practical issues concerning unfairness. Our results indicate that CCNF can effectively mitigate problems while maintaining accuracy and reliability. Overall, CCNF represents a significant advancement in incorporating generative models with causality, offering both theoretical guarantees of causal consistency and practical applicability in addressing real-world issues involved with causality. Future work. First, a major constraint of CCNF is the requirement of a well-defined causal graph, which is challenging to obtain in reality. In the future, CCNF should be able to handle these cases more properly by supporting even incorrect or non-DAG causal graphs. Second, while CCNF primarily focuses on causal inference tasks, its capabilities could be extended to other domains. For instance, by incorporating appropriate causalities to the SCM, CCNF could aid in causal discovery tasks. In the future, CCNF could be leveraged for extensive tasks. Acknowledgements This work is funded in part by NSERC (RGPIN-202203325) and research gifts from Amazon and Meta. References Crouse, M.; Abdelaziz, I.; Cornelio, C.; Thost, V.; Wu, L.; Forbus, K.; and Fokoue, A. 2019. Improving graph neural network representations of logical formulae with subgraph pooling. ar Xiv preprint ar Xiv:1911.06904. Draxler, F.; Wahl, S.; Schn orr, C.; and K othe, U. 2024. On the universality of coupling-based normalizing flows. ar Xiv preprint ar Xiv:2402.06578. Fleisher, W. 2021. What s fair about individual fairness? In Proceedings of the 2021 AAAI/ACM Conference on AI, Ethics, and Society, 480 490. Harshvardhan, G.; Gourisaria, M. K.; Pandey, M.; and Rautaray, S. S. 2020. A comprehensive survey and analysis of generative models in machine learning. Computer Science Review, 38: 100285. Hofmann, H. 1994. Statlog (German Credit Data). UCI Machine Learning Repository. DOI: https://doi.org/10.24432/C5NC77. Javaloy, A.; Martin, P. S.; and Valera, I. 2023. Causal normalizing flows: from theory to practice. In Thirty-seventh Conference on Neural Information Processing Systems. Khemakhem, I.; Monti, R. P.; Leech, R.; and Hyv arinen, A. 2021. Causal Autoregressive Flows. ar Xiv:2011.02268. Kocaoglu, M.; Snyder, C.; Dimakis, A. G.; and Vishwanath, S. 2017. Causal GAN: Learning Causal Implicit Generative Models with Adversarial Training. ar Xiv:1709.02023. Komanduri, A.; Wu, X.; Wu, Y.; and Chen, F. 2024. From Identifiable Causal Representations to Controllable Counterfactual Generation: A Survey on Causal Generative Modeling. ar Xiv:2310.11011. Papamakarios, G.; Nalisnick, E.; Rezende, D. J.; Mohamed, S.; and Lakshminarayanan, B. 2021. Normalizing flows for probabilistic modeling and inference. Journal of Machine Learning Research, 22(57): 1 64. Papamakarios, G.; Pavlakou, T.; and Murray, I. 2018. Masked Autoregressive Flow for Density Estimation. ar Xiv:1705.07057. Pawlowski, N.; Coelho de Castro, D.; and Glocker, B. 2020. Deep structural causal models for tractable counterfactual inference. Advances in neural information processing systems, 33: 857 869. Pearl, J. 2009. Causality. Cambridge university press. Pearl, J. 2012. The do-calculus revisited. ar Xiv preprint ar Xiv:1210.4852. Ribeiro, F. D. S.; Xia, T.; Monteiro, M.; Pawlowski, N.; and Glocker, B. 2023. High Fidelity Image Counterfactuals with Probabilistic Causal Models. ar Xiv:2306.15764. Sanchez, P.; and Tsaftaris, S. A. 2022. Diffusion causal models for counterfactual estimation. ar Xiv preprint ar Xiv:2202.10166. S anchez-Martin, P.; Rateike, M.; and Valera, I. 2022. VACA: Designing variational graph autoencoders for causal queries. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, 8159 8168. Thompso, E. 2023. Class-action lawsuit against Facebook claiming discrimination gets the green light. CBC. Thost, V.; and Chen, J. 2021. Directed Acyclic Graph Neural Networks. ar Xiv:2101.07965. Xi, Q.; and Bloem-Reddy, B. 2023. Indeterminacy in generative models: Characterization and strong identifiability. In International Conference on Artificial Intelligence and Statistics, 6912 6939. PMLR. Xia, K.; Pan, Y.; and Bareinboim, E. 2022. Neural causal models for counterfactual identification and estimation. ar Xiv preprint ar Xiv:2210.00035. Yang, M.; Liu, F.; Chen, Z.; Shen, X.; Hao, J.; and Wang, J. 2021. Causalvae: Disentangled representation learning via neural structural causal models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 9593 9602.