# federated_causality_learning_with_explainable_adaptive_optimization__f06c8822.pdf Federated Causality Learning with Explainable Adaptive Optimization Dezhi Yang1, 2, Xintong He3, Jun Wang2*, Guoxian Yu1, 2, Carlotta Domeniconi4, Jinglin Zhang5 1School of Software, Shandong University, Jinan, China 2SDU-NTU Joint Centre for AI Research, Shandong University, Jinan, China 3Department of Mathematics, National University of Singapore, Singapore 4Departiment of Computer Science, George Mason University, VA, USA 5School of Control Science and Engineering, Shandong University, Jinan, China dzyang@mail.sdu.edu.cn, E0966399@u.nus.edu.sg, {kingjun, gxyu,jinglin.zhang}@sdu.edu.cn, carlotta@cs.gmu.edu Discovering the causality from observational data is a crucial task in various scientific domains. With increasing awareness of privacy, data are not allowed to be exposed, and it is very hard to learn causal graphs from dispersed data, since these data may have different distributions. In this paper, we propose a federated causal discovery strategy (Fed Causal) to learn the unified global causal graph from decentralized heterogeneous data. We design a global optimization formula to naturally aggregate the causal graphs from client data and constrain the acyclicity of the global graph without exposing local data. Unlike other federated causal learning algorithms, Fed Causal unifies the local and global optimizations into a complete directed acyclic graph (DAG) learning process with a flexible optimization objective. We prove that this optimization objective has a high interpretability and can adaptively handle homogeneous and heterogeneous data. Experimental results on synthetic and real datasets show that Fed Causal can effectively deal with non-independently and identically distributed (non-IID) data and has a superior performance. Introduction Causal structure discovery is a fundamental and critical problem in many fields, such as economics (Koller and Friedman 2009) and biology (Sachs et al. 2005). Randomized controlled experiments are the golden standard for discovering causal structure, but they are often limited by cost and may even be prohibited by ethics. Causal discovery typically infers a directed acyclic graph (DAG) from observational data at a central site (Pearl 2009; Peters, Janzing, and Sch olkopf 2017); this DAG encodes causal relationships between variables. However, with the increasing awareness of privacy and security, data are scattered among different clients and cannot be shared, which makes it difficult for canonical causal discovery algorithms to find reliable causal structure from limited client data. Federated learning (FL), as a secure framework for cooperative training with multiple clients, learns a unified model from scattered data by exchanging model parameters or gradients among clients, without exposing local clients data (Mc Mahan et al. 2017). FL has made good progress in areas *Corresponding author. Copyright 2024, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. such as image classification and recommendation systems (Chai et al. 2020). However, recent FL algorithms are based on continuous optimization and cannot be directly applied to causal discovery algorithms with a combinatorial optimization property. For example, constraint-based algorithms PC (Spirtes et al. 2000) and FCI (Spirtes, Meek, and Richardson 2013) use conditional independence between variables to judge whether there is a causal structure, while scorebased algorithms GES (Chickering 2002) and the max-min hill-climbing algorithm (Tsamardinos, Brown, and Aliferis 2006) use score functions and heuristic search strategies to find causal graphs with the best scores. Recent gradient-based causal discovery algorithms, Zheng et al. (2018) and Yu et al. (2019) make use of smooth equality constraints instead of discrete acyclic constraints to discover causal structures through continuous optimization (i.e., Augmented Lagrange method (Nemirovsky 1999)), and they provide the opportunity of learning causal graphs in a continuous manner within a federated framework. Subsequently, Lachapelle et al. (2019) and Zheng et al. (2020) introduced deep neural networks to deal with more complex causal models. However, these gradient-based algorithms need to center data, which is not feasible due to privacy protection. For dispersed data, several distributed causal discovery methods have been proposed. Some of them need to make assumptions about parameters but lack generalization (Shimizu 2012; Xiong et al. 2021), and others need to share additional learning parameters but cause privacy leakage (Na and Yang 2010; Ye, Amini, and Zhou 2022). More importantly, they are often unable to deal with non-independent identical distributed (non-IID) data among clients. In this paper, we develop a general federated DAG learning strategy (Fed Causal) to seek the global causal graph from horizontally partitioned data with different distributions. This strategy designs a global optimization process instead of traditional weighted average in the server to aggregate local causal graphs, which naturally ensures the sparsity and acyclicity of the global causal graph. The local and global optimization processes form a whole as an explainable adaptive optimization objective, which is consistent with the centralized optimization objective under statistical homogeneity. In addition, this explainable objective allows clients to flexibly learn local data distributions with statistical heterogeneity and seek a uniform global graph on The Thirty-Eighth AAAI Conference on Artificial Intelligence (AAAI-24) Global Optimization Global Graph Distribution Local graph from MLP Figure 1: Schematic framework of Fed Causal on the nonparametric model. In each interaction, the clients optimize local causal models based on local data and send the first layer parameters θ(1) k to the server. The server optimizes the global model to conform to the local distributions and explicitly constrains its acyclicity to obtain a global causal graph G. Fed Causal then broadcasts server parameters θ(1) to clients for the next round interaction. the server. Figure 1 shows the conceptual framework of Fed Causal. Our main contributions can be outlined as follows: (i) We meticulous extend the centralized DAG learning framework to federated scenarios. Fed Causal explicitly constrains the acyclicity of the global causal graph and optimizes it to conform to dispersed data, ensuring effective causal information interaction between local and global. (ii) Fed Causal unifies the global aggregation optimization and local DAG optimization, and formulates a complete causal graph learning process. We prove that its optimization process is consistent with centralized DAG optimization on homogeneous data and is suitable for heterogeneous data. (iii) We conduct experiments on synthetic and real datasets, and prove that Fed Causal is close to or even better than centralized learning on homogeneous data, and outputs more accurate and acyclic causal graphs than other methods (Ng and Zhang 2022; Gao et al. 2023) on heterogeneous data. Related Work The number of DAGs is super-exponential in the number of variables. As such, learning discrete DAGs is generally NP-hard for traditional causal discovery algorithms (Spirtes et al. 2000; Chickering 2002; Bernstein et al. 2020). To avoid the difficult combinatorial optimization of the acyclic constraint, NOTEARS (Zheng et al. 2018) transforms the acyclic constraint into a smooth equality constraint, and converts the causal discovery problem into a continuous optimization one with efficient solvers. DAGGNN (Yu et al. 2019) extends NOTEARS using graph neu- ral networks to handle various variable types. Gran-DAG (Lachapelle et al. 2019) uses neural networks to deal with non-linear causal relationships. NOTEARS-MLP (Zheng et al. 2020) proposes a generalized function model and characterizes a non-parametric structure equation model of DAG via partial derivatives. Het DAG (Liang et al. 2023) extends NOTEARS-MLP to learn the DAGs of attributed heterogeneous network, and DARING (He et al. 2021) considers the independence of heterogeneous noises to improve the robustness. MCSL (Ng et al. 2022) adopts a Gumbel Sigmoid function to approximate the parameter matrix of a DAG into a binary adjacency one. However, the above approaches target at centralized data, and their assumptions or strategies for identifying causal graphs cannot cope with statistically heterogeneous data. Based on the continuous optimization framework, we design a global optimization to flexibly learn the global causal graph by aggregating local causal graphs from decentralized data that may have different distributions. Several attempts have been made to learn the causal structure from distributed datasets with different subsets of variables. (Tillman, Danks, and Glymour 2008) estimated the local partial ancestral graph from each dataset, and then found global partial ancestral graphs on the complete set of variables with the same d-connection and d-separation relationships as the local PAGs. (Triantafillou and Tsamardinos 2015) used a SAT solver in a similar process to improve the algorithm s scalability. However, these methods cannot uniquely identify the causal graph and suffer from a large indeterminacy. For the horizontally distributed dataset, Gou, Jun, and Zhao (2007) and Na and Yang (2010) used a twostep process that independently estimates the Bayesian network by each client s dataset, then performs conditional independent testing and voting to obtain the final causal graph. These distributed methods separately use local datasets during training and directly share local models parameters, which result in poor performance and privacy leakage. Our Fed Causal aggregates the global causal graph using a very small fraction of the parameters of local models, effectively avoiding the risk of reconstructing local data from model parameters. It does not simply aggregate the local causal graphs, but adds a proximal term and global optimization to enable sufficient information to be exchanged between clients and the server. The recent NOTEARS-ADMM (abbreviated as NOADMM) approach (Ng and Zhang 2022) adopts the alternating direction method of multipliers to decompose the continuous optimization objective of gradient-based methods, and learns the global causal structure by exchanging client model parameters. However, NO-ADMM requires all clients to participate in the training and makes all local models consistent, and thus lacks flexibility and does not account for statistical heterogeneity. Although Fed DAG (Gao et al. 2023) takes into account heterogeneous data, it directly masks local models with a global binary matrix. As a consequence, it cannot effectively update the masked model parameters and obtain accurate causal graphs. Both NOADMM and Fed DAG enforce too stringent local models and suffer from performance degradation. Fed Causal nat- The Thirty-Eighth AAAI Conference on Artificial Intelligence (AAAI-24) urally combines local and global optimizations, forms an adaptive optimization process, and has a consistent optimization objective with the centralized DAG learning under IID data without additional assumptions. Compared with existing federated DAG learning methods, Fed Causal is more interpretable and more suitable for non-IID data, it trains local models without strong restrictions to learn the unified global graph. Fed Causal accurately identifies the unique causal graph while conforming to the heterogeneous distribution of local data. Background The causal discovery problem can be defined as follows: given the observed data X Rn d and the variables set V = {V1, , Vd}, we need to learn a DAG G from X, where n and d represent the sample size and the number of variables respectively. In the federated scenario, this problem becomes more complicated: let the number of clients be K, X = {Xk}K k=1 represent the clients local data, and N = {nk}K k=1 represent the sample size of clients data; we need to learn a uniform causal graph G without exposing clients data. The difficulty is that clients data may be non-iid. Even though existing DAG learning methods based on continuous optimization mostly focus on centralized data and cannot handle statistical heterogeneity, they are necessary for federated causal discovery because most of the federated learning algorithms adopt continuous optimization. For this reason we introduce next some of the existing continuous DAG learning strategies. Continuous Optimization for DAG Learning NOTEARS uses equality to constrain the acyclicity of causal graphs and continuously optimizes a causal graph on the linear data model. NOTEARS defines a weight matrix W Rd d to represent the causal model, thus Wij = 0 if and only if there is no edge from Vi to Vj in the real causal graph G. Then, it uses the equation h(W) = tr(e W W ) d = 0 to replace the discrete constraint G DAGs (see (Zheng et al. 2018) for details). With this constraint, NOTEARS reconstructs data from W to minimize the reconstruction residual and designs the following constrained optimization formula: min W L(W; X) = 1 2n||X XW||2 2 + λ||W||1 s.t. h(W) = tr(e W W ) d = 0 (1) where ||W||1 guarantees the sparsity of the causal graph, and λ is the hyper-parameter. The approximate solution is sought via the L-BFGS-B algorithm (Byrd et al. 1995). To extend NOTEARS to non-parametric models, Zheng et al. (2020) proposed NOTEARS-MLP, which learns the causal model via a multi-layer perceptron (MLP) and uses partial derivatives to express the dependence between variables. NOTEARS-MLP learns the causal generation model f = {f1, , fd} for each variable. It is easy to show that there is no edge from Vj to Vi, if and only if the partial derivative jfi of fi with respect to Xj is equal to 0. So, NOTEARS-MLP uses the partial derivatives of f to represent the parameter matrix W(f): [W(f)]ji = || jfi||2. Let θ = {(A(1) 1 , , A(h) 1 ), , (A(1) d , , A(h) d )} be the parameters of all MLPs and A(h) i be the h-th layer parameter of the MLP corresponding to fi, NOTEARS-MLP uses the first layer parameters θ(1) = {A(1) 1 , , A(1) d } of all MLPs to represent the weight matrix [W(θ(1))]ji = ||A(1) i,(:,j)||2. The constrained optimization formula of NOTEARS-MLP is as follows: min θ L(θ; X) = 1 2n[||X mlp(X; θ)||2 2 + λ||θ||1] s.t. h(θ(1)) = tr(e W (θ(1)) W (θ(1))) d = 0 (2) where mlp(X; θ) is the reconstruction result of original data X from all causal generation models parameterized by MLP and ||θ||1 guarantees the causal model sparsity. Both NOTEARS and NOTEARS-MLP require centerized IID data. In contrast, Fed Causal targets decentralized non IID data. It unifies local DAG learning on non-IID data into a global optimization framework, which improves the causality learning capability of client models under limited local data, and ensures unity and accuracy of global causal graphs under statistical heterogeneity. The global model of Fed Causal corresponds to the unique causal graph that explains causality between variables, and the local models learn the heterogeneous distributions of clients data. Proposed Methodology Optimization for Linear SEM To learn causal graphs that conform to the data distribution, both centralized and federated DAG learning methods aim to minimize the reconstruction residual of observational data and guarantee the graph acyclicity. For a linear structural equation model (SEM), assume that there are K clients holding n = PK k=1 nk samples in total. If we can collect decentralized data {Xk}K k=1, the global weight matrix W can be learned as follows: W = arg min W 1 2nk [||Xk Xk W||2 2 + λ||W||1] s.t. tr(e W W ) d = 0 (3) However, clients cannot expose their own data and they can only use local data to optimize the local weight matrix Wk Rd d. In addition, the acyclic constraint term is applied to the local matrix Wk rather than the global W. A natural idea is to aggregate the local matrices into a global one with guaranteed sparsity and acyclicity. Without loss of generality, we assume that the probability of the observational data being distributed over k th client is equal to the frequency nk/n of samples appearing in it, given sufficient data. To measure local empirical risks over possible data distributions, the global matrix should be the weighted average of the local matrices based on sample sizes: W = PK k=1 nk Wk/n. Unfortunately, clients may learn weight matrices with opposite causal relationships, which introduces a strong numerical bias to the global matrix and violates the acyclicity. To solve this problem, we The Thirty-Eighth AAAI Conference on Artificial Intelligence (AAAI-24) design a global constrained optimization to aggregate local matrices and ensure the global sparsity and acyclicity: min W L(W) = n ||W Wk||2 2] s.t. h(W) = tr(e W W ) d = 0 (4) where nk||W Wk||2 2/n is the distribution loss term to constrain the global matrix to approach the local matrix. Obviously, Eq. (4) measures and minimizes local empirical risks, and its equality constraint also explicitly constrains the acyclicity of the global causal graph. Since the global matrix resulting from aggregating sparse local matrices is also numerically sparse, we do not add the sparse penalty term to Eq. (4). We broadcast the learned global matrix W to all clients and further optimize their local matrices as follows: min Wk Lk(Wk; Xk) = 1 2nk ||Xk Xk Wk||2 2 + λ1||Wk||1 s.t. h(Wk) = tr(e Wk Wk) d = 0 (5) Local matrix optimization still requires the sparsity penalty and acyclic constraint to prevent its overfitting. In practice, we find that adding a penalty term λ2||Wk W||2 2 to bring local matrices close to the global one is beneficial for the convergence of local optimization. This penalty term is weaker than the equality constraint of NO-ADMM (Ng and Zhang 2022), which forces the local matrix to be equal to the global matrix, but it is more flexible and suitable for addressing the statistical heterogeneity. Fed Causal first optimizes local matrices {Wk}K k=1 at clients using Eq. (5) and uploads them to the server. The server then optimizes the global matrix W using Eq. (4) and distributes it to the local models. Fed Causal repeats the above local and global optimizations for a sufficiently large number of iterations, or stops when the global acyclic constraint term falls below a pre-defined threshold. Optimization for Non-Parametric SEM The non-parametric model is more complex than the linear model, since it has unknown parameters and clients data distributions. Exchanging all models parameters between the clients and the server not only causes excessive communication overheads, but also leads to the same local models and to the violation of statistical heterogeneity. To address these issues, we adjust the global optimization and pass only a few parameters between the clients and the server. We denote the parameters of the local causal generation model as θk, the parameters of the first layer as θ(1) k and those of other layers as θ( 1) k . The global optimization formula of Fed Causal in the non-parametric model is: min θ(1) L(θ(1)) = n ||θ(1) θ(1) k ||2 2] s.t. h(θ(1)) = tr(e W (θ(1)) W (θ(1))) d = 0 (6) Fed Causal receives only the first layer parameters θ(1) k of the local models to optimize the global model. The unified Algorithm 1: Fed Causal: Federated Causal Learning Input: Observed data X = {X1, , XK} Output: Causal weight matrix W(θ(1)) 1: Initial global parameters θ(1), local parameters {θk = (θ(1) k , θ( 1) k )}K k=1 and hyperparameter-list {α, ρ, htol, ρmax, γ, β} 2: for t = 0, 1, 2, do 3: while ρ < ρmax do 4: Broadcast θ(1), α, ρ: θ(1) k = θ(1), αk = α, ρk = ρ 5: Update local parameters {θk}K k=1 using Eq. (7) 6: Upload {θ(1) k }K k=1 to the server 7: Update global parameters θ(1) using Eq. (6) 8: if h(θ(1)) > γH then 9: ρ = βρ 10: else 11: Break 12: end if 13: end while 14: Set H = h(θ(1)), α = α + ρh(θ(1)) 15: if h(θ(1)) <= htol or ρ >= ρmax then 16: Break 17: end if 18: end for 19: return W(θ(1)) global causal graph is derived from the partial derivative of the first-layer global model parameters θ(1), so there is no need to send θk. This design not only reduces the communication overhead compared to passing all parameters as in NO-ADMM (Ng and Zhang 2022), but also enforces privacy protection, because it is difficult for attackers to reconstruct data from the first layer parameters of a complex model. We then broadcast θ(1) to all clients, replace their θ(1) k , and optimize their θk with the following formula: min θk Lk(θk; Xk) = 1 2nk ||Xk mlp(Xk; θk)||2 2 + λ1||θ(1) k ||1 s.t. h(θ(1) k ) = tr(e W (θ(1) k ) W (θ(1) k )) d = 0 (7) Fed Causal learns the global causal graph using Eq. (6), allows clients to learn different causal models conforming to non-IID local data using Eq. (7), and thus provides high flexibility for causal discovery in non-IID data. Algorithm 1 gives the main process of Fed Causal on non-parametric data, where α and ρ are the Lagrangian multiplier and the penalty parameter, htol and ρmax are the thresholds that control the loop termination, γ and β are the condition and the step size that control the update of ρ and α. After obtaining the causal weight matrix W(θ(1)), we prune the edges with small weight (<0.3) to get the DAG. Let s consider the case of K client models with one hidden layer of m units. Fed Causal s clients take O(nkd2m + d2m + d3) to compute the objective and the gradient. The computation complexity of Fed DAG s clients is O(nkd2m+ nkd2 + d2m + d3), due to the additional mask operation. The Thirty-Eighth AAAI Conference on Artificial Intelligence (AAAI-24) NO-ADMM does not compute the acyclic term on clients, so its clients computation complexity is O(nkd2m+d2m), but it suffers from overfitting local models. Both Fed Causal and NO-ADMM optimize the global graph on the server, so their complexity on the server side is O(Kd2m+d3). However, Fed Causal only needs to compute the first layer parameters of the model, so it is usually faster than NO-ADMM, especially when the model has at least two layers. The communication overheads of Fed Causal, NO-ADMM, and Fed DAG for a single round interaction are d2m, d2m + dm and d2, respectively. As the model becomes complex with more layers, the overhead of NO-ADMM sharply increases. Fed DAG averages only local matrices and has a lower server computation than NO-ADMM and Fed Causal, but its global matrix may not uncover an accurate global causal graph, as we show in our experiments. Analysis of Explainable Adaptive Optimization Lemma 1. Under the condition that local data are homogeneous and obey linear SEM, the optimization objective of Fed Causal is consistent with that of centralized DAG learning, i.e. the global matrix W arg min W ||X XW||2 2/2n + λ||W||1 + ρh2(W)/2 + αh(W). For the linear data model, the optimization objective of the penalty term ||W Wk||2 2 in Eq. (4) is W Wk, and that of Eq. (5) is Wk arg min Wk Lk(Wk; Xk) + ρkh2(Wk)/2 + αkh(Wk), so the actual optimization objective of ||W Wk||2 2 is W arg min W Lk(W; Xk) + ρkh2(W)/2 + αkh(W). We can then replace the penalty term in the global optimization (Eq. (4)) with the actual objective, and obtain the following: W arg min W 2nk ||Xk Xk W||2 2 + λ||W||1] 2h2(W) + αh(W) (8) Eq. (8) only preserves the global acyclic constraint term, because the local acyclic constraint acts on the local matrices and does not constrain the global matrix. By merging the terms in Eq. (8), we find that the optimization objective of the global matrix is: W arg min W ||X XW||2 2/2n + λ||W||1 + ρh2(W)/2 + αh(W). Obviously, this optimization objective is consistent with the centralized optimization objective of NOTEARS, which means that Fed Causal may achieve the same results as NOTEARS, even though the data are scattered. Lemma 2. Under the condition that local data are homogeneous and obey non-parametric SEM, the optimization objective of Fed Causal is consistent with that of centralized DAG learning, i.e. the global parameters θ(1) arg minθ(1) ||X mlp(X; θ(1), θ( 1))||2 2/2n + λ||θ(1)||1 + ρh2(θ(1))/2 + αh(θ(1)). For the non-parametric data model, the optimization objective of the penalty term ||θ(1) θ(1) k ||2 2 in Eq. (7) is θ(1) θ(1) k . By expressing the local parameters θk optimized by Eq. (6) as (θ(1) k , θ( 1) k ), we obtain the actual optimization objective of ||θ(1) θ(1) k ||2 2 as θ(1) arg minθ(1) Lk(θ(1), θ( 1) k ; Xk) + ρkh2(θ(1))/2 + αkh(θ(1)). Similarly, we can deduce the global optimization objective in the non-parametric data model as: θ(1) arg min θ(1) 2nk ||Xk mlp(Xk; θ(1), θ( 1) k )||2 2 +λ||θ(1)||1] + ρ 2h2(θ(1)) + αh(θ(1)) (9) We consider two cases. On the one hand, if clients data are homogeneous, the local parameter θ( 1) k can be the same or very similar, so we can merge the terms in Eq. (9) to obtain the final optimization objective: θ(1) arg minθ(1) ||X mlp(X; θ(1), θ( 1))||2 2/2n + λ||θ(1)||1 + ρh2(θ(1))/2 + αh(θ(1)). This optimization objective is also consistent with the centralized optimization objective of NOTEARS-MLP. On the other hand, if clients have decentralized heterogeneous data, Eq. (9) allows clients to learn different θ( 1) k , and still optimizes a unified θ(1). In other words, regardless whether the local data fit linear or nonlinear models, Fed Causal learns different data joint distributions on the clients and optimizes a uniform causal graph on the server. In other words, Fed Causal adapts to the case where clients have different causal generation models but the same causal graph. In summary, our Fed Causal can flexibly adapt to statistical heterogeneous/homogeneous data with an explainable proof. Experiments We compare Fed Causal against recent federated causal discovery algorithms, including Fed DAG (Gao et al. 2023) and NO-ADMM (Ng and Zhang 2022). We also set three baselines based on NOTEARS (Zheng et al. 2018). The first baseline (NO-ALL) learns DAGs from all data by centering local data; the second (NO-Avg) independently learns local causal graphs and combines these graphs by weighted average; the third one (NO-w/o Acy) is similar to our strategy but disregards the acyclic constraint on the global causal graph. For nonlinear or heterogeneous data, we adapt the nonlinear or non-parametric version of Fed Causal, Fed DAG and NO-ADMM, and replace NOTEARS with NOTEARSMLP (Zheng et al. 2020) for NO-ALL, NO-Avg and NOw/o Acy. For real datasets, we add two classic baselines PC (Spirtes et al. 2000) and GES (Chickering 2002), which use all data but identify complete partially DAG (CPDAG) with undirected edges. We provide the hyperparameter settings of Fed Causal and other baselines in the supplementary file (Yang et al. 2023). We study the performance of the compared methods on synthetic IID and non-IID datasets, as well as real datasets. We use the Erd os R enyi (ER) and Scale-free (SF) models to generate the ground truth DAGs. We use the linear Gaussian (LG) model and the additive noise model with MLP (ANM-MLP) (Zheng et al. 2020) respectively to synthesize linear and nonlinear data. For non-IID data, we fix the generated DAG and randomly sample models from the linear Gaussian model (LG), the additive noise model with The Thirty-Eighth AAAI Conference on Artificial Intelligence (AAAI-24) MLP (ANM-MLP), the additive model with Gaussian processes (ADD-GP) (B uhlmann, Peters, and Ernest 2014) and the additive index model (MIM) (Yuan 2011) to generate data for different clients. We guarantee that, even if different clients choose the same model, their model parameters will be different. We use the same server (Ubuntu 18.04.5, Intel Xeon Gold 6248R and Nvidia RTX 3090) to perform experiments and report the structural hamming distance (SHD), true positive rate (TPR) and false discovery rate (FDR) of the estimated DAGs, averaged over 10 random runs. A larger TPR indicates a better performance, while the opposite trend holds for SHD and FDR. The code of Fed Causal is shared at https://www.sdu-idea.cn/pub Detail?pub Id=279. Results on IID Data To test Fed Causal on IID data, we set up 10 clients and generate 200 samples for each client with linear (LG) and nonlinear (ANM-MLP) IID data models. We conduct the experiment to estimate randomly generated DAGs with {10, 20, 40, 80} variables. Based on the results of Fed Causal and other baselines in Figure 2, we have some observations: (i) Fed Causal obtains the highest TPR on the linear model across different numbers of variables. NO-ALL learns the causal graph using all data, so its results can be taken as an upper bound. Fed Causal even outperforms NO-ALL, probably because there are sufficient data for clients to learn local causal graphs on the simple linear model. In addition, the collaboration between clients works alike an ensemble system that helps to reinforce correct edges and remove erroneous ones. NO-ADMM achieves the lowest TPR and highest FDR and SHD, since it does not consider the sparsity and acyclicity on clients, which cause overfitted local graphs and fail to form a good global causal graph. Fed DAG performs well in the linear model, approaching NO-ALL with only a few variables; it does not with more variables. NO-Avg per- 10 20 40 80 Number of variables 10 20 40 80 Number of variables NO-ADMM NO-ALL Fed DAG Fed Causal NO-w/o Acy NO-Avg 10 20 40 80 Number of variables (a) Results on linear model 10 20 40 80 Number of variables 10 20 40 80 Number of variables NO-ADMM NO-ALL Fed DAG Fed Causal NO-w/o Acy NO-Avg 10 20 40 80 Number of variables (b) Results on nonlinear model Figure 2: Results on linear and non-linear IID data. Solid lines correspond to federated methods and dotted lines are NOTEARS-based baselines. forms almost as well as NO-ALL on the linear model, since the linear model is simple and clients can confidently learn local graphs without any collaboration. (ii) Fed Causal also achieves the lowest FDR and SHD values on the nonlinear model, being clearly superior to NOALL, with very close TPR values. This is because our global optimization aligns with the centralized optimization objective of NO-ALL that uses all data, and federated learning allows clients to collaborate with each other to reduce erroneous edges. For the same reason as for the linear model, NO-ADMM still performs poorly on the nonlinear model. Fed DAG does not perform as well on the nonlinear model, as it did on the linear one, because the Gumbel-Sigmoid function it uses is not suitable for dealing with nonlinear causality, which leads to biased results. We do not report the results of Fed DAG on 80 nodes, since Gao et al. (2023) only provided the hyperparameters of Fed DAG on 10, 20, and 40 nodes, and Fed DAG is very sensitive to the hyperparameters and fails on 80 nodes. NO-Avg yields unreliable results with very high FDR and SHD values, even though its TPR is also high. This indicates that simply averaging local causal graphs may result in more erroneous edges. (iii) Our aggregation optimization effectively improves the performance of Fed Causal. Compared to NO-w/o Acy, Fed Causal performs better on the linear model and obtains significantly lower FDR and SHD values on the nonlinear model. This is because our adaptive optimization not only measures the local empirical risk, but also constrains the acyclicity of the global graph, while NO-w/o Acy disregards this constraint. Fed Causal trains the global graph to conform to the local data distribution and eliminates the numerical bias caused by aggregation that may lead to a cyclic graph, thus effectively improves the global graph. Results on Non-IID Data In this experiment, we also set 10 clients, 200 samples for each client and graphs with {10, 20, 40, 80} variables. We randomly select models from LG, ANM-MLP, ADD-GP or Mi M to generate non-IID data for different clients. Figure 3 shows the results of Fed Causal and other baselines. We have some important observations: (i) Fed Causal manifests the best performance among the compared methods on decentralized heterogeneous data. NO-ADMM pursues the consistency of all local models and thus does not identify the global causal graph in this practical and challenging non-IID setting. NO-ALL gets extremely high FDR values because the training data are mixed with multiple causal models, which prevent NO-ALL to accurately identify causal relationships. This observation suggests that causal algorithms with the prerequisite of centralized data cannot identify the causal graphs from non-IID data. Federated causal discovery algorithms (Fed Causal, Fed DAG, and NO-w/o Acy) perform better on heterogeneous data than on homogeneous nonlinear data. This is because clients discover the same false causal relationships on IID data and reinforce these false edges in aggregation, while clients on non-IID data can offset the false edges. Fed Causal has a much lower FDR than Fed DAG and a higher TPR than NO-w/o Acy. The graph found by NO- The Thirty-Eighth AAAI Conference on Artificial Intelligence (AAAI-24) 10 20 40 80 Number of variables 10 20 40 80 Number of variables NO-ADMM NO-ALL Fed DAG Fed Causal NO-w/o Acy NO-Avg 10 20 40 80 Number of variables (a) Results on ER graphs 10 20 40 80 Number of variables 10 20 40 80 Number of variables 10 20 40 80 Number of variables NO-ADMM NO-ALL Fed DAG Fed Causal NO-w/o Acy NO-Avg 100 (b) Results on SF graphs Figure 3: Results on linear and non-linear non-IID data. Avg has a very high TPR and FDR values without guaranteed acyclicity. (ii) Fed Causal effectively improves the local causal graphs and guarantees the acyclicity. NO-Avg cannot obtain a reliable and acyclic causal graph on non-IID data, due to the poor local results and high h(θ(1)) (a smaller h(θ(1)) indicates a better acyclicity) in Table 1. This is because the clients of NO-Avg are trained independently, but the limited local data can t support them to accurately learn the nonlinear and heterogeneous data models, which are more complex than linear ones. In contrast, by aggregating local results at each interaction, Fed Causal, NO-w/o Acy and Fed DAG significantly improve the quality of local graphs. Fed DAG learns local graphs masked by an approximate binary global matrix, which allows clients to get very similar results with global h(θ(1)) equal to 0. However, the masked model parameters cannot be effectively updated, which causes the model to be unable to correct error edges and limits its accuracy. In terms of the client performance and global acyclicity, Fed Causal is superior to NO-w/o Acy. This proves the effectiveness of our adaptive optimization that guarantees the acyclic global causal graph for clients. In addition, to study the communication efficiency of Fed Causal and other federated baselines, we show the performance trend of the algorithm as a function of the number 40 nodes Metrics of local graphs h(θ(1)) TPR FDR Fed DAG 82.4 0.0 10.8 0.3 0 Fed Causal 92.4 1.9 4.4 2.8 4.2 10 12 NO-w/o Acy 84.7 3.4 8.3 5.0 7.8 10 7 NO-Avg 77.0 11.2 71.9 6.0 2.2 10 1 Table 1: Evaluation of global acyclic constraint term h(θ(1)) and local causal graph TPR FDR SHD NNZ CP PC 50.00 61.54 24 26 GES 40.00 78.95 31 38 Fed Causal 45.00 52.63 15 19 NO-ALL 45.00 57.14 18 21 NO-w/o Acy 40.00 69.23 24 26 NO-Avg 40.00 70.37 25 27 Fed DAG 40.00 71.43 26 28 NO-ADMM 20.00 80.00 29 20 Table 2: Results on the Sachs dataset. DAG and CP refer to algorithms that identify directed acyclic graph and complete partially directed acyclic graph, respectively. of communications in the supplementary file (Yang et al. 2023). The results show that Fed Causal has the most superior communication efficiency. We also explore the robustness of Fed Causal against different numbers of clients and local sample sizes, and present the results in the supplementary file (Yang et al. 2023). We observe that Fed Causal is stable and in all cases outperforms the rivals. Results on a Real Dataset We evaluate Fed Causal on a protein signaling network based on expression levels of proteins and phospholipid given by (Sachs et al. 2005), which is generally accepted by the biology community. There are a total of n = 7466 samples, d = 11 variables, and 20 edges in the ground truth DAG of this dataset (Sachs). We randomly select n = 7460 samples and evenly distribute them to 10 clients to mimic scattered data. The results are shown in Table 2. Fed Causal achieves the best and most reliable results. PC and GES, as centralized and typical causal discovery algorithms, both show good TPR but high SHD, because they can only identify CPDAGs. NO-ALL also uses all data to identify DAGs, so its results are better than most federated approaches. Federated methods NO-w/o Acy, NO-Avg, Fed DAG and NO-ADMM are limited by decentralized data and are outperformed by centralized PC and NO-ALL. The performance of Fed Causal is not only better than other federated methods, but even better than NO-ALL, which may be owed to the effective interaction between the clients and the server powered by our adaptive optimization. This paper introduces a federated approach (Fed Causal) to learn a unified global causal graph from decentralized non IID data. Fed Causal uses an explainable and adaptive optimization process to coordinate clients to optimize the local causal graphs based on clients data and to learn the global graph with ensured acyclicity. Our analysis shows that the optimization objective of Fed Causal under statistically homogeneous data is consistent with that of causal discovery algorithms for centralized data, and Fed Causal can flexibly learn DAGs from decentralized heterogeneous data. Experimental results confirm its effectiveness, generality and reliability on IID, and non-IID data. The Thirty-Eighth AAAI Conference on Artificial Intelligence (AAAI-24) Acknowledgements This work is supported by NSFC (No. 62072380, 62031003 and 62272276), National Key Research and Development Program of China (No. 2022YFC3502101), Taishan Scholars Project Special Funding, Xiaomi Young Talents Program, CAAI-Huawei Mind Spore Open Fund. References Bernstein, D.; Saeed, B.; Squires, C.; and Uhler, C. 2020. Ordering-based causal structure learning in the presence of latent variables. In AISTATS, 4098 4108. B uhlmann, P.; Peters, J.; and Ernest, J. 2014. CAM: Causal additive models, high-dimensional order search and penalized regression. The Annals of Statistics, 42(6): 2526 2556. Byrd, R. H.; Lu, P.; Nocedal, J.; and Zhu, C. 1995. A limited memory algorithm for bound constrained optimization. SIAM Journal on Scientific Computing, 16(5): 1190 1208. Chai, D.; Wang, L.; Chen, K.; and Yang, Q. 2020. Secure federated matrix factorization. IEEE Intelligent Systems, 36(5): 11 20. Chickering, D. M. 2002. Optimal structure identification with greedy search. JMLR, 3(11): 507 554. Gao, E.; Chen, J.; Shen, L.; Liu, T.; Gong, M.; and Bondell, H. 2023. Fed DAG: Federated DAG Structure Learning. TMLR. Gou, K. X.; Jun, G. X.; and Zhao, Z. 2007. Learning Bayesian network structure from distributed homogeneous data. In In ACIS International Conference on Software Engineering, Artificial Intelligence, Networking, and Parallel/Distributed Computing, volume 3, 250 254. IEEE. He, Y.; Cui, P.; Shen, Z.; Xu, R.; Liu, F.; and Jiang, Y. 2021. Daring: Differentiable causal discovery with residual independence. In KDD, 596 605. Koller, D.; and Friedman, N. 2009. Probabilistic graphical models: principles and techniques. MIT press. Lachapelle, S.; Brouillard, P.; Deleu, T.; and Lacoste-Julien, S. 2019. Gradient-Based Neural DAG Learning. In ICLR. Liang, J.; Wang, J.; Yu, G.; Guo, W.; Domeniconi, C.; and Guo, M. 2023. Directed Acyclic Graph Learning on Attributed Heterogeneous Network. IEEE Transactions on Knowledge and Data Engineering. Mc Mahan, B.; Moore, E.; Ramage, D.; Hampson, S.; and y Arcas, B. A. 2017. Communication-efficient learning of deep networks from decentralized data. In AISTATS, 1273 1282. Na, Y.; and Yang, J. 2010. Distributed Bayesian network structure learning. In ISIE, 1607 1611. Nemirovsky, A. 1999. Optimization II. Numerical methods for nonlinear continuous optimization. Ng, I.; and Zhang, K. 2022. Towards federated bayesian network structure learning with continuous optimization. In AISTATS, 8095 8111. Ng, I.; Zhu, S.; Fang, Z.; Li, H.; Chen, Z.; and Wang, J. 2022. Masked gradient-based causal structure learning. In SDM, 424 432. Pearl, J. 2009. Causality. Cambridge university press. Peters, J.; Janzing, D.; and Sch olkopf, B. 2017. Elements of causal inference: foundations and learning algorithms. MIT Press. Sachs, K.; Perez, O.; Pe er, D.; Lauffenburger, D. A.; and Nolan, G. P. 2005. Causal protein-signaling networks derived from multiparameter single-cell data. Science, 308(5721): 523 529. Shimizu, S. 2012. Joint estimation of linear non-Gaussian acyclic models. Neurocomputing, 81: 104 107. Spirtes, P.; Glymour, C. N.; Scheines, R.; and Heckerman, D. 2000. Causation, prediction, and search. MIT Press. Spirtes, P. L.; Meek, C.; and Richardson, T. S. 2013. Causal inference in the presence of latent variables and selection bias. ar Xiv preprint ar Xiv:1302.4983. Tillman, R. E.; Danks, D.; and Glymour, C. 2008. Integrating locally learned causal structures with overlapping variables. In Neur IPS, 1665 1672. Triantafillou, S.; and Tsamardinos, I. 2015. Constraintbased causal discovery from multiple interventions over overlapping variable sets. JMLR, 16(1): 2147 2205. Tsamardinos, I.; Brown, L. E.; and Aliferis, C. F. 2006. The max-min hill-climbing Bayesian network structure learning algorithm. Mach. Learn., 65(1): 31 78. Xiong, R.; Koenecke, A.; Powell, M.; Shen, Z.; Vogelstein, J. T.; and Athey, S. 2021. Federated causal inference in heterogeneous observational data. ar Xiv preprint ar Xiv:2107.11732. Yang, D.; He, X.; Wang, J.; Yu, G.; Domeniconi, C.; and Zhang, J. 2023. Federated Causality Learning with Explainable Adaptive Optimization. ar Xiv:2312.05540. Ye, Q.; Amini, A. A.; and Zhou, Q. 2022. Distributed Learning of Generalized Linear Causal Networks. ar Xiv preprint ar Xiv:2201.09194. Yu, Y.; Chen, J.; Gao, T.; and Yu, M. 2019. DAG-GNN: DAG structure learning with graph neural networks. In ICML, 7154 7163. Yuan, M. 2011. On the identifiability of additive index models. Statistica Sinica, 1901 1911. Zheng, X.; Aragam, B.; Ravikumar, P.; and Xing, E. P. 2018. DAGs with NO TEARS: continuous optimization for structure learning. In Neur IPS, 9492 9503. Zheng, X.; Dan, C.; Aragam, B.; Ravikumar, P.; and Xing, E. 2020. Learning sparse nonparametric dags. In AISTATS, 3414 3425. The Thirty-Eighth AAAI Conference on Artificial Intelligence (AAAI-24)