# federated_causal_discovery_from_heterogeneous_data__187003d9.pdf Published as a conference paper at ICLR 2024 FEDERATED CAUSAL DISCOVERY FROM HETEROGENEOUS DATA Loka Li1, Ignavier Ng2, Gongxu Luo1, Biwei Huang3, Guangyi Chen1,2, Tongliang Liu1, Bin Gu1, Kun Zhang1,2 1 Mohamed bin Zayed University of Artificial Intelligence 2 Carnegie Mellon University 3 University of California San Diego {longkang.li, kun.zhang}@mbzuai.ac.ae Conventional causal discovery methods rely on centralized data, which is inconsistent with the decentralized nature of data in many real-world situations. This discrepancy has motivated the development of federated causal discovery (FCD) approaches. However, existing FCD methods may be limited by their potentially restrictive assumptions of identifiable functional causal models or homogeneous data distributions, narrowing their applicability in diverse scenarios. In this paper, we propose a novel FCD method attempting to accommodate arbitrary causal models and heterogeneous data. We first utilize a surrogate variable corresponding to the client index to account for the data heterogeneity across different clients. We then develop a federated conditional independence test (FCIT) for causal skeleton discovery and establish a federated independent change principle (FICP) to determine causal directions. These approaches involve constructing summary statistics as a proxy of the raw data to protect data privacy. Owing to the nonparametric properties, FCIT and FICP make no assumption about particular functional forms, thereby facilitating the handling of arbitrary causal models. We conduct extensive experiments on synthetic and real datasets to show the efficacy of our method. The code is available at https://github.com/lokali/Fed CDH.git. 1 INTRODUCTION Causal discovery aims to learn the causal structure from observational data, attracting significant attention from fields such as machine learning and artificial intelligence (Nogueira et al., 2021), healthcare (Shen et al., 2020), economics (Zhang & Chan, 2006), manufacturing (Vukovi c & Thalmann, 2022) and neuroscience (Tu et al., 2019). Recently, it has been facing new opportunities and challenges from the rapid growth of data volume. One of the key challenges is data decentralization. Traditionally, causal discovery is conducted at a centralized site where all data is gathered in one location. However, in real-world scenarios, data is often distributed across multiple parties, such as the healthcare data across various hospitals (Kidd et al., 2022). Consequently, there has been increasing interest in federated causal discovery (FCD), which aims to uncover the underlying causal structure of decentralized data with privacy and security concerns. Existing FCD methods from observational data can be generally classified as continuousoptimization-based, constraint-based, and score-based methods. Some continuous-optimizationbased methods extend NOTEARS (Zheng et al., 2018) with federated strategies, such as NOTEARSADMM (Ng & Zhang, 2022) that relies on the ADMM (Boyd et al., 2011) optimization method, Fed DAG (Gao et al., 2022) that employs Fed Avg (Mc Mahan et al., 2017) technique, and FED-CD (Abyaneh et al., 2022) that utilizes belief aggregation. These methods might suffer from various technical issues, including convergence (Wei et al., 2020; Ng et al., 2022), nonconvexity (Ng et al., 2023), and sensitivity to data standardization (Reisach et al., 2021). As for score-based methods, DARLIS (Ye et al., 2022) utilizes the distributed annealing (Arshad & Silaghi, 2004) strategy to search for the optimal graph, while PERI (Mian et al., 2023) aggregates the results of the local greedy equivalent search (GES) (Chickering, 2002) and chooses the worst-case regret for each iteration. One constraint-based method, FEDC2SL (Wang et al., 2023), extendes χ2 test to the federated Published as a conference paper at ICLR 2024 Table 1: The comparison of related works about federated causal discovery from observational data. Our Fed CDH method could handle arbitrary functional causal models and heterogeneous data. Method Category Data Causal Model Identifiability Federated Distribution Assumption 1 Requirement Strategy NOTEARS-ADMM (Ng & Zhang, 2022) Optimization-based Homogeneous Linear Gaussian, EV Yes ADMM NOTEARS-MLP-ADMM (Ng & Zhang, 2022) Optimization-based Homogeneous Nonlinear Additive Noise Yes ADMM Fed DAG (Gao et al., 2022) Optimization-based Heterogeneous Nonlinear Additive Noise Yes Fed Avg Fed-CD (Abyaneh et al., 2022) Optimization-based Heterogeneous Nonlinear Additive Noise Yes Belief Aggregation DARLIS (Ye et al., 2022) Score-based Homogeneous Generalized Linear No Distributed Annealing PERI (Mian et al., 2023) Score-based Homogeneous Linear Gaussian, NV No Voting Fed PC (Huang et al., 2022) Constraint-based Homogeneous Linear Gaussian, NV No Voting Fed CDH (Ours) Constraint-based Heterogeneous Arbitrary Functions No Summary Statistics version, however, this method is restrictive on discrete variables and therefore not applicable for any continuous variables. Other constraint-based methods, such as Fed PC (Huang et al., 2022), aggregate the skeletons and directions of the Peter-Clark (PC) algorithm (Spirtes et al., 2000) by each client via a voting mechanism. However, as shown in Table 1, most of these methods heavily rely on either identifiable functional causal models or homogeneous data distributions. These assumptions may be overly restrictive and difficult to be satisfied in real-world scenarios, limiting their diverse applicability. For instance, distribution shifts may often occur in the real world across different clients owing to different interventions, collection conditions, or domains, resulting in the presence of heterogeneous data. Please refer to Appendix A2 for further discussion of related works, including those of causal discovery, heterogeneous data and FCD. In this paper, we propose Fed CDH, a novel constraint-based approach for Federated Causal Discovery from Heterogeneous data. The primary innovation of Fed CDH lies in using summary statistics as a proxy for raw data during skeleton discovery and direction determination in a federated fashion. Specifically, to address heterogeneous data, we first introduce a surrogate variable corresponding to the client or domain index, allowing our method to model distribution changes. Unlike existing FCD methods that only leverage the data from different clients to increase the total sample size, we demonstrate how such data heterogeneity across different clients benefits the identification of causal directions and how to exploit it. Furthermore, we propose a federated conditional independence test (FCIT) for causal skeleton discovery, incorporating random features (Rahimi & Recht, 2007) to approximate the kernel matrix which facilitates the construction of the covariance matrix. Additionally, we develop a federated independent change principle (FICP) to determine causal directions, exploiting the causal asymmetry. FICP also employs random features to approximate the embeddings of heterogeneous conditional distributions for representing changing causal models. It is important to note that FCIT and FICP are non-parametric, making no assumption about specific functional forms, thus facilitating the handling of arbitrary causal models. To evaluate our method, we conduct extensive experiments on synthetic datasets including linear Gaussian models and general functional models, and real-world dataset including f MRI Hippocampus (Poldrack et al., 2015) and HK Stock Market datasets (Huang et al., 2020). The significant performance improvements over other FCD methods demonstrate the superiority of our approach. 2 REVISITING CAUSAL DISCOVERY FROM HETEROGENEOUS DATA In this section, we will firstly provide an overview of causal discovery and some common assumptions, then we will introduce the characterizations of conditional independence and independent change. This paper aims at extending those techniques from the centralized to the federated setting. 1) Causal Discovery with Changing Causal Models. Consider d observable random variables denoted by V =(V1, . . . , Vd) and K clients, and one client corresponds to one unique domain. In this paper, we focus on horizontally-partitioned data (Samet & Miri, 2009), where each client holds a different subset of the total data samples while all the clients share the same set of features. Let k be the client index, and be the domain index, where k, {1, . . . , K}. Each client has nk samples, in total there are n samples, denoted by 1Linear Gaussian model with equal noise variance (EV) (Peters & B uhlmann, 2014) and nonlinear additive noise model (Hoyer et al., 2008) are identifiable, while linear Gaussian model with non-equal noise variance (NV) is not identifiable. Generalized linear model and arbitrary functional model are certainly not identifiable. Published as a conference paper at ICLR 2024 n= PK k=1 nk. The task of federated causal discovery is to recover the causal graph G given the decentralized data matrix V Rn d. When the data is homogeneous, the causal process for each variable Vi can be represented by the following structural causal model (SCM): Vi=fi(PAi, ϵi), where fi is the causal function, PAi is the parents of Vi, ϵi is a noise term with non-zero variance, and we assume the ϵi s are mutually independent. When the data is heterogeneous, there must be some causal models changing across different domains. The changes may be caused by the variation of causal strengths or noise variances. Therefore, we formulate the causal process for heterogeneous data as: Vi=fi(PAi, ϵi, θi( ), ψ( )), where is regarded as an observed random variable referred as the domain index, the function fi or the distribution of the noise ϵi is different or changing across different domains, both ψ( ) and θi( ) are unobserved domain-changing factors represented as the functions of variable , ψ( ) is the set of pseudo confounders that influence the whole set of variables and we assume there are L such confounders ( ψ( )={ψl( )}L l=1, the minimum value for L can be 0 meaning that there is no such latent confounder in the graph, while the maximum value can be C2 d = d(d+1) 2 , meaning that each pair of observed variables has a hidden confounder), θi( ) denotes the effective parameters of Vi in the model, and we assume that θi( ) is specific to Vi and is independent of θj( ) for any i =j. ψ( ) and θi( ) input which is a positive integer and output a real number. Let Gobs be the underlying causal graph over V , and Gaug be the augmented graph over V ψ( ) {θi( )}d i=1. For causal discovery with changing causal models, we follow previous work such as CD-NOD (Huang et al., 2020) and make the following assumptions. Assumption 1 (Pseudo Causal Sufficiency). There is no confounder in the dataset of one domain, but we allow the changes of different causal modules across different domains to be dependent. Assumption 2 (Markov and Faithfulness). The joint distribution over V ψ( ) {θi( )}d i=1 is Markov and faithful to Gaug. Figure 1: An illustration where the causal models of variables Vi and Vj are changing across domains. (a) the graph with unobserved domainchanging factors ψℓ( ), θi( ) and θj( ); (b) the simplified graph with the surrogate variable . To remove the potential influence from confounders and recover causal relations across different domains, causal discovery could be conducted on the augmented graph Gaug instead of Gobs. While ψ( ) {θi( )}d i=1 are unobserved variables, the domain index is observed variable. Therefore, is introduced as the surrogate variable (Huang et al., 2020) for causal discovery from heterogeneous data. An illustration is given in Figure 1, where the augmented graph with the unobserved domain-changing variables ψ( ) and θi( ) could be simplified by an augmented graph with just a surrogate variable . If there is an edge between surrogate variable and observed variable Vi on Gaug, then it means that the causal model related to Vi is changing across different domains, in other words, the data distribution of Vi is heterogeneous across domains. 2) Characterization of Conditional Independence. Let X, Y, Z be random variables or sets of random variables, with the domains X, Y, Z, respectively. Define a measurable and positive definite kernel k X , and denote the corresponding reproducing kernel Hilbert space (RKHS) HX . Similarly, we define k Y, HY, k Z and HZ. One of the most used characterizations of conditional independence (CI) is: X Y |Z if and only if PXY |Z = PX|ZPY |Z, or equivalently PX|Y,Z = PX|Z. Another characterization of CI is given in terms of the partial cross-covariance operator on RKHS. Lemma 1 (Characterization of CI with Partial Cross-covariance (Fukumizu et al., 2007)). Let X (X, Z), k X k X k Z, and H X be the RKHS corresponding to k X . Assume that HX L2 X, HY L2 Y , HZ L2 Z. Further assume that k X k Y is a characteristic kernel on (X Z) Y, and that HZ + R (the direct sum of two RHKSs) is dense in L2(PZ). Let Σ XY |Z be the partial crosscovariance operator, then Σ XY |Z = 0 X Y |Z. (1) Published as a conference paper at ICLR 2024 Based on the above lemma, we further consider a different characterization of CI which enforces the uncorrelatedness of functions in suitable spaces, which may be intuitively more appealing. More details about the interpretation of Σ XY |Z, the definition of characteristic kernel, and the uncorrelatedness-based characterization of CI, are put in Appendix A3.1. 3) Characterization of Independent Change. The Hilbert-Schmidt Independence Criterion (HSIC) (Gretton et al., 2007) is a statistical measure used to assess the independence between two random variables in the RKHS. We use the normalized HSIC to evaluate the independence of two changing causal models. The value of the normalized HSIC ranges from 0 to 1, and a smaller value indicates that the two changing causal models are more independent. Let ˆ X Y be the normalized HSIC between P(X) and P(Y |X), and ˆ Y X be the normalized HSIC between P(Y ) and P(X|Y ). Then, we can determine the causal direction between X and Y with the following lemma. Lemma 2 (Independent Change Principle (Huang et al., 2020)). Let X and Y be two random observed variables. Assume that both X and Y have changing causal models (both of them are adjacent to in Gaug). Then the causal direction between X and Y can be determined according to the following rules i) If ˆ X Y < ˆ Y X, output the direction X Y , ii) If ˆ X Y > ˆ Y X, output the direction Y X. More details about the definition and formulation of ˆ X Y and ˆ X Y are in Appendix A3.2. It is important to note that: once the Gaussian kernel is utilized, the kernel-based conditional independence test (Zhang et al., 2012) and the kernel-based independent change principal (Huang et al., 2020) assume smoothness for the relationship of continuous variables. 3 FEDERATED CAUSAL DISCOVERY FROM HETEROGENEOUS DATA In this section, we will explain our proposed Fed CDH method in details. An overall framework of Fed CDH is given in Figure 2. Two key submodules of our method are federated conditional independent test (FCIT; Theorem 4 and Theorem 5) and federated independent change principle (FICP; Theorem 6), which are presented in Section 3.1 and Section 3.2, respectively. We then illustrate how to construct the summary statistics and how to implement FCIT and FICP with summary statistics (Theorem 8) in Section 3.3. Last but not least, we discuss the communication costs and secure computations in Section 3.4. For the proofs of theorems and lemmas, please refer to Appendix A4. 3.1 FEDERATED CONDITIONAL INDEPENDENT TEST (FCIT) 1) Null Hypothesis. Consider the null and alternative hypotheses H0 : X Y |Z, H1 : X Y |Z. (2) According to Eq. 1, we can measure conditional independence based on the RKHSs. Therefore, we equivalently rewrite the above hypothesis more explicitly as H0 : Σ XY |Z 2 HS = 0, H1 : Σ XY |Z 2 HS > 0. (3) Note that the computation forms of Hilbert-Schmidt norm and Frobenius norm are the same, and the difference is that the Hilbert-Schmidt norm is defined in infinite Hilbert space while the Frobenius norm is defined in finite Euclidean space. We here consider the squared Frobenius norm of the empirical partial cross-covariance matrix as an approximation for the hypotheses, given as H0 : C XY |Z 2 F = 0, H1 : C XY |Z 2 F > 0, (4) where C XY |Z= 1 n Pn i=1[( Ai E( A|Z))T (Bi E(B|Z))] corresponds to the partial cross-covariance matrix with n samples, C XY |Z Rh h, A=f( X) Rn h, B=g(Y ) Rn h, {f j( X)|h j=1} F X, {gj(Y )|h j=1} FY . n and h denote the number of total samples of all clients and the number Published as a conference paper at ICLR 2024 Ground-truth Graph Skeleton Discovery Step 2 Direction Determination Step 1 Graph Initialization Direction Determination Step 2 Skeleton Discovery Step 1 Federated Conditional Independence Test Federated Independent Change Principle Summary Statistics Total Sample Size & Global Covariance Tensor Sample Size & Local Covariance Tensor Sample Size & Local Covariance Tensor Sample Size & Local Covariance Tensor Client 2 Client K Figure 2: Overall framework of Fed CDH. Left: The clients will send their sample sizes and local covariance tensors to the server, for constructing the summary statistics. The federated causal discovery will be implemented on the server. Right Top: Relying on the summary statistics, we propose two submodules: federated conditional independence test and federated independent change principle, for skeleton discovery and direction determination. Right Bottom: An example of FCD with three observed variables is illustrated, where the causal modules related to V2 and V3 are changing. of hidden features or mapping functions, respectively. Since X (X, Z), then for each function f j: X7 F X, the input is X Rn 2 and the output f j( X) Rn. For each function gk:Y 7 FY , the input is Y Rn and the output gk(Y ) Rn. Notice that F X and FY are function spaces, which are set to be the support of the process 2 cos(w +b), w follows standard Gaussian distribution, and b follows uniform distribution from [0, 2π]. We choose these specific spaces because in this paper we use random features to approximate the kernels. E( A|Z) and E(B|Z) could be non-linear functions of Z which are difficult to estimate. Therefore, we would like to approximate them with linear functions. Let q(Z) Rn h, {qj(Z)|h j=1} FZ, FZ shares a similar function space with FY . We could estimate E(f j|Z) with the linear ridge regression solution u T j q(Z) and estimate E(gj|Z) with v T j q(Z) under mild conditions (Sutherland & Schneider, 2015). Now we give the following lemma. Lemma 3 (Characterization of Conditional Independence). Let f j and gj be the functions defined for the variables X and Y , respectively. Then X Y |Z is approximated by the following condition E( f g) = 0, f F X|Z and g FY |Z, (5) where F X|Z={ f | f j=f j u T j q(Z), f j F X} and FY |Z={ g | gj=gj v T j q(Z), gj FY }. Let γ be a small ridge parameter. According to Eq. 1 and Eq. 5, by ridge regression, we obtain C XY |Z = C XY C XZ(CZZ + γI) 1CZY . (6) 2) Test Statistic and Null Distribution. In order to ensure the convergence to a non-degenerate distribution, we multiply the empirical estimate of the Frobenius norm by n, and set it as the test statistic TCI=n C XY |Z 2 F . Let K X|Z be the centralized kernel matrix, given by K X|Z HR X|ZRT X|ZH, where H=I 1 n11T and R X|Z f( X)=f( X) u T q(Z) which can be seen as the residual after kernel ridge regression. Here, I refers to the n n identity matrix and 1 denotes the vector of n ones. We now define KY |Z similarly. Let λ X|Z and λY |Z be the eigenvalues of K X|Z and KY |Z, respectively. Let {α1, . . . , αL} denote i.i.d. standard Gaussian variables, and thus {α2 1, . . . , α2 L} denote i.i.d. χ2 1 variables. Considering n i.i.d. samples from the joint distribution P XY Z, we have Theorem 4 (Federated Conditional Independent Test). Under the null hypothesis H0 (X and Y are conditionally independent given Z), the test statistic TCI n C XY |Z 2 F , (7) Published as a conference paper at ICLR 2024 has the asymptotic distribution i,j=1 λ X|Z,iλY |Z,jα2 ij. Although the defined test statistics are equivalent to that of kernel-based conditional independence test (KCI) (Zhang et al., 2012), the asymptotic distributions are in different forms. Please note that the large sample properties are needed when deriving the asymptotic distribution ˆTCI above, and the proof is shown in Appendix A4.2. Given that X Y |Z, we could introduce the independence between R X|Z and RY |Z, which leads to the separation between λ X|Z,i and λY |Z,j. We show that this separated form could help to approximate the null distribution in terms of a decomposable statistic, such as the covariance matrix. We approximate the null distribution with a two-parameter Gamma distribution, which is related to the mean and variance. Under the hypothesis H0 and given the sample D, the distribution of ˆTCI can be approximated by the Γ(ˆk, ˆθ) distribution: P(t) = (tˆk 1 e t/ˆθ)/(θˆk Γ(ˆk)), where ˆk = E2( ˆTCI|D)/Var( ˆTCI|D), and ˆθ = Var( ˆTCI|D)/E( ˆTCI|D). We propose to approximate the null distribution with the mean and variance in the following theorem. Theorem 5 (Null Distribution Approximation). Under the null hypothesis H0 (X and Y are conditionally independent given Z), we have E( ˆTCI|D) = tr(C X|Z) tr(CY |Z), Var( ˆTCI|D) = 2 C X|Z 2 F CY |Z 2 F , (8) where C X|Z= 1 n RT X|ZHHR X|Z, CY |Z= 1 n RT Y |ZHHRY |Z, and tr( ) means the trace operator. For testing the conditional independence X Y |Z, in this paper, we only deal with the scenarios where X and Y each contain a single variable while Z could contain a single variable, multiple variables, or be empty. When Z is empty, the test becomes the federated unconditional independent test (FUIT), as a special case. We provide more details about FUIT in Appendix A5. 3.2 FEDERATED INDEPENDENT CHANGE PRINCIPLE (FICP) As described in Lemma 2, we can use independent change principle (ICP) to evaluate the dependence between two changing causal models. However, existing ICP (Huang et al., 2020) heavily relies on the kernel matrix to calculate the normalized HSIC. It may be challenging for decentralized data because the off-diagonal entries of kernel matrix require the raw data from different clients, which violates the data privacy in federated learning. Motivated by that, we propose to estimate the normalized HSIC with the following theorem. Theorem 6 (Federated Independent Change Principle). In order to check whether two causal models change independently across different domains, we can estimate the dependence by ˆ X Y = C X, Y 2 F tr(C X) tr(C Y ), ˆ Y X = C Y, X 2 F tr(C Y ) tr(C X), (9) where Y (Y |X), X (X|Y ), C X, Y and C Y, X are specially-designed covariance matrices, and C X, C Y , C X and C Y are specially-designed variance matrices. 3.3 IMPLEMENTING FCIT AND FICP WITH SUMMARY STATISTICS More details are given about how to implement FCIT and FICP with summary statistics. The procedures at the clients and the server are shown in Algorithm 1. Each client needs to calculate its local sample size and covariance tensor, which are aggregated into summary statistics at the server. The summary statistics contain two parts: total sample size n and covariance tensor CT . With the summary statistics as a proxy, we can substitute the raw data at each client for FCD. The global Published as a conference paper at ICLR 2024 Algorithm 1 Fed CDH: Federated Causal Discovery from Heterogeneous Data Input: data matrix Dk Rnk d at each client, k, {1, . . . , K} Output: a causal graph G Client executes: 1: (Summary Statistics Calculation) For each client k, use the local data Dk to get the sample size nk and calculate the covariance tensor CTk, and send them to the server. Server executes: 2: (Summary Statistics Construction) Construct the summary statistics by summing up the local sample sizes and the local covariance tensors: n = PK k=1 nk, CT = PK k=1 CTk. 3: (Augmented Graph Initialization) Build a completely undirected graph G0 on the extended variable set V { }, where V denotes the observed variables and is surrogate variable. 4: (Federated Conditional Independence Test) Conduct the federated conditional independence test based on the summary statistics, for skeleton discovery on augmented graph and direction determination with one changing causal module. In the end, get an intermediate graph G1. 5: (Federated Independent Change Principle) Conduct the federated independent change principle based on the summary statistics, for direction determination with two changing causal modules. Ultimately, output the causal graph G. statistics are decomposable because they could be obtained by simply summing up the local ones, such as n= PK k=1 nk and CT = 1 n PK k=1 nk CTk. Specifically, we incorporate the random Fourier features (Rahimi & Recht, 2007), because they have shown competitive performances to approximate the continuous shift-invariant kernels. According to the following Lemma, we could derive a decomposable covariance matrix from an indecomposable kernel matrix via random features. Lemma 7 (Estimating Covariance Matrix from Kernel Matrix). Assuming there are n i.i.d. samples for the centralized kernel matrices Kx, Ky, Kx,y and the covariance matrix Cx,y, we have tr( Kx,y) tr( ϕw(x) ϕw(y)T ) = tr( ϕw(y)T ϕw(x)) = n tr(Cx,y), tr( Kx Ky) tr( ϕw(x) ϕw(x)T ϕw(y) ϕw(y)T ) = ϕw(x)T ϕw(y) 2 = n2 Cx,y 2, (10) where x, y Rn, Kx, Ky, Kx,y Rn n, Cx,y Rh h, ϕw(x) Rn h is the centralized random fea- ture, ϕw(x)=Hϕw(x), ϕw(x) q 2 h[cos(w1x+b1), . . . , cos(whx+bh)]T and ϕw(x) Rn h, and similarly for ϕw(y) and ϕw(y). w is drawn from P(w) and b is drawn uniformly from [0, 2π]. In this paper, we use random features to approximate the Gaussian kernel for continuous variables and the delta kernel for discrete variables such as the surrogate variable . It is important to note that this surrogate variable is essentially a discrete variable (more specifically, a categorical variable, with no numerical order among different values), and a common approach to deal with such discrete variables is to use delta kernel. Notice that Cx,y denotes the covariance matrix for variable sets x and y, which is sample-wise decomposable because Cx,y= 1 n PK k=1 nk Cxk,yk, where Cxk,yk corresponds to the local covariance matrix of variable sets xk and yk at the k-th client. Here, we have xk, yk Rnk, Cxk,yk Rh h. In the augmented graph, there are d =d+1 variables (d observed variables and one surrogate variable), thus we could construct a global covariance tensor CT Rd d h h by summing up the local ones CTk Rd d h h. Theorem 8 (Sufficiency of Summary Statistics). The summary statistics, consisting of total sample size n and covariance tensor CT , are sufficient to represent all the statistics for FCD, including TCI in Eq. 7, E( ˆTCI|D) and Var( ˆTCI|D) in Eq. 8, and ˆ X Y and ˆ Y X in Eq. 9. According to the above theorem, with the total sample size n and the global covariance tensor CT at the server, it is sufficient to conduct FCIT and FICP in the FCD procedures. More details about skeleton discovery and direction determination rules will be given in Appendix A6. 3.4 COMMUNICATION COSTS AND SECURE COMPUTATIONS We propose to construct summary statistics without directly sharing the raw data, which has already preserved the data privacy to some extent. The original sample size of raw data is in Rn d , where Published as a conference paper at ICLR 2024 Figure 3: Results of synthetic dataset on linear Gaussian model. By rows, we evaluate varying number of variables d, varying number of clients K, and varying number of samples nk. By columns, we evaluate Skeleton F1 ( ), Skeleton SHD ( ), Direction F1 ( ) and Direction SHD ( ). we assume n d , h. The constructed covariance tensor is in dimension Rd d h h, which could significantly reduce the communication costs when the sample size n is large enough and the hidden dimension h is small enough. Furthermore, if each client is required to not directly share the local summary statistics, one can incorporate some standard secure computation techniques, such as secure multiparty computation (Cramer et al., 2015), which allows different clients to collectively compute a function over their inputs while keeping them private, or homomorphic encryption (Acar et al., 2018), which enables complex mathematical operations to process encrypted data without compromising the encryption. Please refer to Goryczka & Xiong (2015) for more about secure computation. It is worth noting that some secure computation techniques can introduce significant computation overhead. To further enhance privacy protection and computational efficiency, it would be beneficial to further improve our proposed method and we leave it for future explorations. 4 EXPERIMENTS To evaluate the efficacy of our proposed method, we conduct extensive experiments on both synthetic and real-world datasets. For the synthetic datasets, we consider the linear Gaussian model and general functional model to show that our method can handle arbitrary functional causal models. We ensure that all synthetic datasets have some changing causal models, meaning that they are heterogeneous data. To show the wide applicability of our method, we run two real-world datasets, f MRI Hippocampus (Poldrack et al., 2015) and HK Stock Market datasets (Huang et al., 2020). Synthetic Datasets. The true DAGs are simulated by Erd os-R enyi model (Erd os et al., 1960) with the number of edges equal to the number of variables. We randomly select 2 variables out of d variables to be changing across clients. For the changing causal model, we generate according to the SCM: Vi= P Vj PAi ˆσk ijfi(Vj)+ˆγkϵi, where Vj PAi is the direct cause of Vi. The causal strength ˆσk ij and the parameter ˆγk change across different client with index k, which are uniformly sampled from U(0.5, 2.5) and U(1, 3), respectively. We separately generate the data for each domain with different causal models and then combine them together. For the fixed causal model, we generate according to Vi= P Vj PAi ˆσijfi(Vj)+ϵi. We consider the linear Gaussian model with non-equal Published as a conference paper at ICLR 2024 noise variances and the general functional model. For linear Gaussian model, fi(Vj)=Vj and ϵi are sampled from Gaussian distribution with a non-equal variance which is uniformly sampled from U(1, 2). For general functional model, fi is randomly chosen from linear, square, sinc, and tanh functions, and ϵi follows uniform distribution U( 0.5, 0.5) or Gaussian distribution N(0, 1). We compare our Fed CDH method with other FCD baselines, such as NOTEARS-ADMM (for linear case) (Ng & Zhang, 2022), NOTEARS-MLP-ADMM (for non-linear case) (Ng & Zhang, 2022), Fed DAG (Gao et al., 2022) and Fed PC (Huang et al., 2022). We consider these baselines mainly because of their publicly available implementations. We evaluate both the undirected skeleton and the directed graph, denoted by Skeleton and Direction in the Figures. We use the structural Hamming distance (SHD), F1 score, precision, and recall as evaluation criteria. We evaluate variable d {6, 12, 18, 24, 30} while fixing other variables such as K=10 and nk=100. We set client K {2, 4, 8, 16, 32} while fixing others such as d=6 and nk=100. We let the sample size in one client nk {25, 50, 100, 200, 400} while fixing other variables such as d=6 and K=10. Following the setting of previous works such as (Ng & Zhang, 2022), we set the sample size of each client to be equal, although our method can handle both equal and unequal sample size per client. For each setting, we run 10 instances with different random seeds and report the means and standard deviations. The results of F1 score and SHD are given in Figure 3 and Figure A3 for two models, where our Fed CDH method generally outperforms the other methods. Although we need large sample properties in the proof of Theorem 4, in practice we only have finite samples. According to the experiment of varying samples, we can see that with more samples the performance of our method is getting better. More analysis including the implementation details, the results of the precision and recall, the analysis of computational time, and the hyperparameter study, the statistical significance test, and the evaluation on graph density are provided in Appendix A7. Real-world Datasets. We evaluate our method and the baselines on two real-world dataset, f MRI Hippocampus (Poldrack et al., 2015) and HK Stock Market datasets (Huang et al., 2020). (i) f MRI Hippocampus dataset contains signals from d=6 separate brain regions: perirhinal cortex (PRC), parahippocampal cortex (PHC), entorhinal cortex (ERC), subiculum (Sub), CA1, and CA3/Dentate Gyrus (DG) in the resting states on the same person in 84 successive days. The records for each day can be regarded as one domain, and there are 518 samples for each domain. We select nk=100 samples for each day and select K {4, 8, 16, 32, 64} days for evaluating varying number of clients. We select K=10 days and select nk {25, 50, 100, 200, 400} samples for evaluating varying number of samples. (ii) HK Stock Market dataset contains d=10 major stocks in Hong Kong stock market, which records the daily closing prices from 10/09/2006 to 08/09/2010. Here one day can be also seen as one domain. We set the number of clients to be K {2, 4, 6, 8, 10} while randomly select nk=100 samples for each client. All other settings are following previous one by default. More dataset information, implementation details, results and analysis are provided in Appendix A8. 5 DISCUSSION AND CONCLUSION Discussion. (i) Strengths: First of all, by formulating our summary statistics, the requirement for communication between the server and clients is restricted to only one singular instance, thereby substantially reducing the communication times. This is a marked improvement over other baseline methods that necessitate iterative communications. Additionally, the utilization of a surrogate variable enhances our capability to handle heterogeneous data. Furthermore, leveraging the nonparametric characteristics of our proposed FCIT and FICP, our Fed CDH method can adeptly manage arbitrary functional causal models. (ii) Limitations: Firstly, the efficiency of our summary statistics in reducing communication costs may not be considerable when the sample size n is small or the hidden dimension h is large. Secondly, our method is designed specifically for horizontally-partitioned federated data, hence it cannot be directly applied to vertically-partitioned federated data. Conclusion. This paper has put forth a novel constraint-based federated causal discovery method called Fed CDH, demonstrating broad applicability across arbitrary functional causal models and heterogeneous data. We construct the summary statistics as a stand-in for raw data, ensuring the protection of data privacy. We further propose FCIT and FICP for skeleton discovery and direction determination. The extensive experiments, conducted on both synthetic and real-world datasets, underscore the superior performance of our method over other baseline methods. For future research, we will enhance our method to address more complex scenarios, such as vertically-partitioned data. Published as a conference paper at ICLR 2024 ACKNOWLEDGEMENT This material is based upon work supported by the AI Research Institutes Program funded by the National Science Foundation under AI Institute for Societal Decision Making (AI-SDM), Award No. 2229881. The project is also partially supported by the National Institutes of Health (NIH) under Contract R01HL159805, and grants from Apple Inc., KDDI Research Inc., Quris AI, and Infinite Brain Technology. Amin Abyaneh, Nino Scherrer, Patrick Schwab, Stefan Bauer, Bernhard Sch olkopf, and Arash Mehrjou. Fed-cd: Federated causal discovery from interventional and observational data. ar Xiv preprint ar Xiv:2211.03846, 2022. Abbas Acar, Hidayet Aksu, A Selcuk Uluagac, and Mauro Conti. A survey on homomorphic encryption schemes: Theory and implementation. ACM Computing Surveys (Csur), 51(4):1 35, 2018. Muhammad Arshad and Marius C Silaghi. Distributed simulated annealing. Distributed Constraint Problem Solving and Reasoning in Multi-Agent Systems, 112, 2004. Marcus Bendtsen. Regime aware learning. In Conference on Probabilistic Graphical Models, pp. 1 12. PMLR, 2016. Stephen Boyd, Neal Parikh, Eric Chu, Borja Peleato, Jonathan Eckstein, et al. Distributed optimization and statistical learning via the alternating direction method of multipliers. Foundations and Trends in Machine learning, 3(1):1 122, 2011. Vince D Calhoun, Robyn Miller, Godfrey Pearlson, and Tulay Adalı. The chronnectome: timevarying connectivity networks as the next frontier in fmri data discovery. Neuron, 84(2):262 274, 2014. Chandler Squires. causaldag: creation, manipulation, and learning of causal models. https://github.com/uhlerlab/causaldag, 2018. Rong Chen, Krishnamoorthy Sivakumar, and Hillol Kargupta. Learning bayesian network structure from distributed data. In Proceedings of the 2003 SIAM International Conference on Data Mining, pp. 284 288. SIAM, 2003. David Maxwell Chickering. Optimal structure identification with greedy search. Journal of machine learning research, 3(Nov):507 554, 2002. Ronald Cramer, Ivan Bjerre Damg ard, et al. Secure multiparty computation. Cambridge University Press, 2015. JJ Daudin. Partial association measures and an application to qualitative regression. Biometrika, 67 (3):581 590, 1980. Dorit Dor and Michael Tarsi. A simple algorithm to construct a consistent extension of a partially oriented graph. Technicial Report R-185, Cognitive Systems Laboratory, UCLA, 1992. Paul Erd os, Alfr ed R enyi, et al. On the evolution of random graphs. Publ. Math. Inst. Hung. Acad. Sci, 5(1):17 60, 1960. Kenji Fukumizu, Arthur Gretton, Xiaohai Sun, and Bernhard Sch olkopf. Kernel measures of conditional dependence. Advances in neural information processing systems, 20, 2007. Erdun Gao, Junjia Chen, Li Shen, Tongliang Liu, Mingming Gong, and Howard Bondell. Feddag: Federated dag structure learning. Transactions on Machine Learning Research, 2022. Slawomir Goryczka and Li Xiong. A comprehensive comparison of multiparty secure additions with differential privacy. IEEE transactions on dependable and secure computing, 14(5):463 477, 2015. Published as a conference paper at ICLR 2024 Kui Xiang Gou, Gong Xiu Jun, and Zheng Zhao. Learning bayesian network structure from distributed homogeneous data. In Eighth acis international conference on software engineering, artificial intelligence, networking, and parallel/distributed computing (snpd 2007), volume 3, pp. 250 254. IEEE, 2007. Arthur Gretton, Kenji Fukumizu, Choon Teo, Le Song, Bernhard Sch olkopf, and Alex Smola. A kernel statistical test of independence. Advances in neural information processing systems, 20, 2007. Patrik Hoyer, Dominik Janzing, Joris M Mooij, Jonas Peters, and Bernhard Sch olkopf. Nonlinear causal discovery with additive noise models. Advances in neural information processing systems, 21, 2008. Biwei Huang, Kun Zhang, and Bernhard Sch olkopf. Identification of time-dependent causal model: A gaussian process treatment. In Twenty-Fourth international joint conference on artificial intelligence, 2015. Biwei Huang, Kun Zhang, Jiji Zhang, Joseph Ramsey, Ruben Sanchez-Romero, Clark Glymour, and Bernhard Sch olkopf. Causal discovery from heterogeneous/nonstationary data. The Journal of Machine Learning Research, 21(1):3482 3534, 2020. Jianli Huang, Kui Yu, Xianjie Guo, Fuyuan Cao, and Jiye Liang. Towards privacy-aware causal structure learning in federated setting. ar Xiv preprint ar Xiv:2211.06919, 2022. Brian Kidd, Kunbo Wang, Yanxun Xu, and Yang Ni. Federated learning for sparse bayesian models with applications to electronic health records and genomics. In PACIFIC SYMPOSIUM ON BIOCOMPUTING 2023: Kohala Coast, Hawaii, USA, 3 7 January 2023, pp. 484 495. World Scientific, 2022. Phillip Lippe, Taco Cohen, and Efstratios Gavves. Efficient neural causal discovery without acyclicity constraints. International Conference on Learning Representations, 2022. Lars Lorch, Scott Sussex, Jonas Rothfuss, Andreas Krause, and Bernhard Sch olkopf. Amortized inference for causal structure learning. Advances in Neural Information Processing Systems, 35: 13104 13118, 2022. Brendan Mc Mahan, Eider Moore, Daniel Ramage, Seth Hampson, and Blaise Aguera y Arcas. Communication-efficient learning of deep networks from decentralized data. In Artificial intelligence and statistics, pp. 1273 1282. PMLR, 2017. Osman Mian, David Kaltenpoth, Michael Kamp, and Jilles Vreeken. Nothing but regrets privacypreserving federated causal discovery. In International Conference on Artificial Intelligence and Statistics, pp. 8263 8278. PMLR, 2023. Yongchan Na and Jihoon Yang. Distributed bayesian network structure learning. In 2010 IEEE International Symposium on Industrial Electronics, pp. 1607 1611. IEEE, 2010. Ignavier Ng and Kun Zhang. Towards federated bayesian network structure learning with continuous optimization. In International Conference on Artificial Intelligence and Statistics, pp. 8095 8111. PMLR, 2022. Ignavier Ng, S ebastien Lachapelle, Nan Rosemary Ke, Simon Lacoste-Julien, and Kun Zhang. On the convergence of continuous constrained optimization for structure learning. In International Conference on Artificial Intelligence and Statistics, 2022. Ignavier Ng, Biwei Huang, and Kun Zhang. Structure learning with continuous optimization: A sober look and beyond. ar Xiv preprint ar Xiv:2304.02146, 2023. Ana Rita Nogueira, Jo ao Gama, and Carlos Abreu Ferreira. Causal discovery in machine learning: Theories and applications. Journal of Dynamics & Games, 8(3):203, 2021. Jonas Peters and Peter B uhlmann. Identifiability of gaussian structural equation models with equal error variances. Biometrika, 101(1):219 228, 2014. Published as a conference paper at ICLR 2024 Russell A Poldrack, Timothy O Laumann, Oluwasanmi Koyejo, Brenda Gregory, Ashleigh Hover, Mei-Yen Chen, Krzysztof J Gorgolewski, Jeffrey Luci, Sung Jun Joo, Ryan L Boyd, et al. Longterm neural and physiological phenotyping of a single human. Nature communications, 6(1): 8885, 2015. Ali Rahimi and Benjamin Recht. Random features for large-scale kernel machines. Advances in neural information processing systems, 20, 2007. Alexander Reisach, Christof Seiler, and Sebastian Weichwald. Beware of the simulated DAG! causal discovery benchmarks may be easy to game. In Advances in Neural Information Processing Systems, 2021. Basil Saeed, Snigdha Panigrahi, and Caroline Uhler. Causal structure discovery from distributions arising from mixtures of dags. In International Conference on Machine Learning, pp. 8336 8345. PMLR, 2020. Saeed Samet and Ali Miri. Privacy-preserving bayesian network for horizontally partitioned data. In 2009 International Conference on Computational Science and Engineering, volume 3, pp. 9 16. IEEE, 2009. Xinpeng Shen, Sisi Ma, Prashanthi Vemuri, and Gyorgy Simon. Challenges and opportunities with causal discovery algorithms: application to alzheimer s pathophysiology. Scientific reports, 10 (1):1 12, 2020. Shohei Shimizu, Patrik O Hoyer, Aapo Hyv arinen, Antti Kerminen, and Michael Jordan. A linear non-gaussian acyclic model for causal discovery. Journal of Machine Learning Research, 7(10), 2006. Peter Spirtes. An anytime algorithm for causal inference. In International Workshop on Artificial Intelligence and Statistics, pp. 278 285. PMLR, 2001. Peter Spirtes and Kun Zhang. Causal discovery and inference: concepts and recent methodological advances. In Applied informatics, volume 3, pp. 1 28. Springer Open, 2016. Peter Spirtes, Clark N Glymour, Richard Scheines, and David Heckerman. Causation, prediction, and search. MIT press, 2000. Eric V Strobl, Kun Zhang, and Shyam Visweswaran. Approximate kernel-based conditional independence tests for fast non-parametric causal discovery. Journal of Causal Inference, 7(1), 2019. Danica J Sutherland and Jeff Schneider. On the error of random fourier features. ar Xiv preprint ar Xiv:1506.02785, 2015. Ruibo Tu, Kun Zhang, Bo Bertilson, Hedvig Kjellstrom, and Cheng Zhang. Neuropathic pain diagnosis simulator for causal discovery algorithm evaluation. Advances in Neural Information Processing Systems, 32, 2019. Matej Vukovi c and Stefan Thalmann. Causal discovery in manufacturing: A structured literature review. Journal of Manufacturing and Materials Processing, 6(1):10, 2022. Zhaoyu Wang, Pingchuan Ma, and Shuai Wang. Towards practical federated causal structure learning. ar Xiv preprint ar Xiv:2306.09433, 2023. Dennis Wei, Tian Gao, and Yue Yu. DAGs with no fears: A closer look at continuous optimization for learning Bayesian networks. In Advances in Neural Information Processing Systems, 2020. Robert F Woolson. Wilcoxon signed-rank test. Wiley encyclopedia of clinical trials, pp. 1 3, 2007. Eric P Xing, Wenjie Fu, and Le Song. A state-space mixed membership blockmodel for dynamic network tomography. The Annals of Applied Statistics, 4(2):535 566, 2010. Qiang Yang, Yang Liu, Yong Cheng, Yan Kang, Tianjian Chen, and Han Yu. Federated learning. Synthesis Lectures on Artificial Intelligence and Machine Learning, 13(3):1 207, 2019. Published as a conference paper at ICLR 2024 Qiaoling Ye, Arash A Amini, and Qing Zhou. Distributed learning of generalized linear causal networks. ar Xiv preprint ar Xiv:2201.09194, 2022. Yue Yu, Jie Chen, Tian Gao, and Mo Yu. Dag-gnn: Dag structure learning with graph neural networks. In International Conference on Machine Learning, pp. 7154 7163. PMLR, 2019. Yue Yu, Tian Gao, Naiyu Yin, and Qiang Ji. Dags with no curl: An efficient dag structure learning approach. In International Conference on Machine Learning, pp. 12156 12166. PMLR, 2021. Kun Zhang and Lai-Wan Chan. Extensions of ica for causality discovery in the hong kong stock market. In International Conference on Neural Information Processing, pp. 400 409. Springer, 2006. Kun Zhang and Aapo Hyvarinen. On the identifiability of the post-nonlinear causal model. ar Xiv preprint ar Xiv:1205.2599, 2012. Kun Zhang, Jonas Peters, Dominik Janzing, and Bernhard Sch olkopf. Kernel-based conditional independence test and application in causal discovery. ar Xiv preprint ar Xiv:1202.3775, 2012. Kun Zhang, Biwei Huang, Jiji Zhang, Bernhard Sch olkopf, and Clark Glymour. Discovery and visualization of nonstationary causal models. ar Xiv preprint ar Xiv:1509.08056, 2015. Xun Zheng, Bryon Aragam, Pradeep K Ravikumar, and Eric P Xing. Dags with no tears: Continuous optimization for structure learning. Advances in Neural Information Processing Systems, 31, 2018. Fangting Zhou, Kejun He, and Yang Ni. Causal discovery with heterogeneous observational data. In Uncertainty in Artificial Intelligence, pp. 2383 2393. PMLR, 2022. Published as a conference paper at ICLR 2024 Appendix for Federated Causal Discovery from Heterogeneous Data Appendix organization: A1 Summary of Symbols 15 A2 Related Works 15 A3 Details about the Characterization 16 A3.1 Characterization of Conditional Independence . . . . . . . . . . . . . . . . . . . . 16 A3.2 Characterization of Independent Change . . . . . . . . . . . . . . . . . . . . . . . 18 A4 Proofs 18 A4.1 Proof of Lemma 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 A4.2 Proof of Theorem 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 A4.3 Proof of Theorem 5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 A4.4 Proof of Theorem 6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 A4.5 Proof of Lemma 7 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 A4.6 Proof of Theorem 8 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 A5 Details about Federated Unconditional Independence Test 24 A5.1 Null Hypothesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 A5.2 Null Distribution Approximation . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 A6 Details about Skeleton Discovery and Direction Determination 25 A6.1 Skeleton Discovery. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 A6.2 Direction Determination. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 A7 Details about the Experiments on Synthetic Datasets 28 A7.1 Implementation Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 A7.2 Analysis of F1 and SHD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 A7.3 Results of Precision and Recall . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 A7.4 Results of Computational Time . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 A7.5 Hyperparameter Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 A7.6 Statistical Significance Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 A7.7 Evaluation on Dense Graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 A7.8 Evaluation on the power of conditional independence test . . . . . . . . . . . . . . 32 A7.9 Evaluation on the order of domain indices . . . . . . . . . . . . . . . . . . . . . . 32 A8 Details about the Experiments on Real-world Dataset 33 Published as a conference paper at ICLR 2024 A8.1 Details about f MRI Hippocampus Dataset . . . . . . . . . . . . . . . . . . . . . . 33 A8.2 Details about HK Stock Market Dataset . . . . . . . . . . . . . . . . . . . . . . . 34 A1 SUMMARY OF SYMBOLS In order to improve the readability of our paper, we summarize the most important symbols and their meanings throughout the paper, as shown in Table A1. Table A1: Summary of symbols Symbol Meaning Symbol Meaning d the number of observed variables. n the total sample size of all clients. K the number of clients. V the data matrix, V Rn d. k the client index, k {1, ..., K}. nk the sample size of k-th client. domain index, {1, ..., K}. Vi the i-th variable, i {1, ..., d}. fi( ) the causal function of variable Vi. PAi the parents of Vi. ϵi the noise term of Vi. ψ a set of pseudo confounders . θi the effective parameter of Vi. L the number of pseudo confounders . G, Gobs the causal graph with d variables. Gaug the augmented graph with d+1 variables. X, Y, Z a set of random variables. k X , k Y, k Z the positive definite kernels. X, Y, Z the domains for the variables. HX , HY, HZ the reproducing kernel Hilbert spaces. X X (X, Z). ˆ the normalized HSIC. Σ the cross-covariance operator in infinite dimension. C the cross-covariance matrix in finite dimension. h the number of hidden features/mapping functions. f( ), g( ), q( ) the mapping functions. F the function spaces. u, v the regression coefficients. A A = f( X) Rn h. B B = g(Y ) Rn h. γ the ridge parameter. I the identity matrix. K the centralized kernel matrix. H the matrix for centralization, H = I 1 n11T . TCI the test statistic for conditional independence. R the residual for ridge regression. α the standard Gaussian variable. α2 the χ2 1 variable. λ the nonzero eigenvalues. ˆTCI the asymptotic statistic. ˆk, ˆθ the parameters for Gamma distribution Γ(ˆk, ˆθ). C the specially-designed covariance matrix. tr( ) the trace operator. d d = d + 1 (plus one surrogate variable). CT global covariance tensor, CT Rd d h h. CTk local covariance tensor, CTk Rd d h h. w the coefficients for random features. b the intercepts for random features. A2 RELATED WORKS Causal Discovery. In general, there are mainly three categories of methods for causal discovery (CD) from observed data (Spirtes & Zhang, 2016): constraint-based methods, score-based methods and function-based methods. Constraint-based methods utilize the conditional independence test (CIT) to learn a skeleton of the directed acyclic graph (DAG), and then orient the edges upon the skeleton. Such methods contain Peter-Clark (PC) algorithm (Spirtes & Zhang, 2016) and Fast Causal Inference (FCI) algorithm (Spirtes, 2001). Some typical CIT methods include kernel-based independent conditional test (Zhang et al., 2012) and approximate kernel-based conditional independent test (Strobl et al., 2019). Score-based methods use a score function and a greedy search method to learn a DAG with the highest score by searching all possible DAGs from the data, such as Greedy Equivalent Search (GES) (Chickering, 2002). Within the score-based category, there is a continuous optimization-base subcategory attracting increasing attention. NOTEARS (Zheng et al., 2018) firstly reformulates the DAG learning process as a continuous optimization problem and solves it using gradient-based method. NOTEARS is designed under the assumption of the linear relations between variables. Subsequent works have extended NOTEARS to handle nonlinear cases via deep neural networks, such as DAG-GNN (Yu et al., 2019) and DAG-No Curl (Yu et al., 2021). ENCO (Lippe et al., 2022) presents an efficient DAG discovery method for directed acyclic causal graphs utilizing both observational and interventional data. AVCI (Lorch et al., 2022) infers causal structure by performing amortized variational inference over an arbitrary data-generating distribution. These continuous-optimization-based methods might suffer from various technical issues, including convergence (Wei et al., 2020; Ng et al., 2022), nonconvexity (Ng et al., 2023), and sensitivity to Published as a conference paper at ICLR 2024 data standardization (Reisach et al., 2021). Function-based methods rely on the causal asymmetry property, including the linear non-Gaussion model (Li NGAM) (Shimizu et al., 2006), the additive noise model (Hoyer et al., 2008), and the post-nonlinear causal model (Zhang & Hyvarinen, 2012). Causal Discovery from Heterogeneous Data. Most of the causal discovery methods mentioned above usually assume that the data is independently and identically distributed (i.i.d.). However, in practical scenarios, distribution shift is possibly occurring across datasets, which can be changing across different domains or over time, as featured by heterogeneous or non-stationary data (Huang et al., 2020). To tackle the issue of changing causal models, one may try to find causal models on sliding windows for non-stationary data (Calhoun et al., 2014), and then compare them. Improved versions include the regime aware learning algorithm to learn a sequence of Bayesian networks that model a system with regime changes (Bendtsen, 2016). Such methods may suffer from high estimation variance due to sample scarcity, large type II errors, and a large number of statistical tests. Some methods aim to estimate the time-varying causal model by making use of certain types of smoothness of the change (Huang et al., 2015), but they do not explicitly locate the changing causal modules. Several methods aim to model time-varying time-delayed causal relations (Xing et al., 2010), which can be reduced to online parameter learning because the direction of the causal relations is given (i.e., the past influences the future). Moreover, most of these methods assume linear causal models, limiting their applicability to complex problems with nonlinear causal relations. In particular, a nonparametric constraint-based method to tackle this causal discovery problem from non-stationary or heterogeneous data, called CD-NOD (Huang et al., 2020), was recently proposed, where the surrogate variable was introduced, written as smooth functions of time or domain index. The first model-based method was proposed for heterogeneous data in the presence of cyclic causality and confounders, named CHOD (Zhou et al., 2022). Saeed et al. (Saeed et al., 2020) provided a graphical representation via the mixture DAG of distributions that arise as mixtures of causal DAGs. Federated Causal Discovery. A two-step procedure was adopted (Gou et al., 2007) to learn a DAG from horizontally partitioned data, which firstly estimated the structures independently using each client s local dataset, and secondly applied further conditional independence test. Instead of using statistical test in the second step, a voting scheme was used to pick those edges identified by more than half of the clients (Na & Yang, 2010). These methods leverage only the final graphs independently estimated from each local dataset, which may lead to suboptimal performance as the information exchange may be rather limited. Furthermore, (Samet & Miri, 2009) developed a privacy-preserving method based on secure multiparty computation, but was limited to the discrete case. For vertically partitioned data, (Yang et al., 2019) constructed an approximation to the score function in the discrete case and adopted secure multiparty computation. (Chen et al., 2003) developed a four-step procedure that involves transmitting a subset of samples from each client to a central site, which may lead to privacy concern. NOTEARS-ADMM (Ng & Zhang, 2022) and Fed-DAG (Gao et al., 2022) were proposed for the federated causal discovery (FCD) based on continuous optimization methods. Fed-PC (Huang et al., 2022) was developed as a federated version of classical PC algorithm, however, it was developed for homogeneous data, which may lead to poor performance on heterogeneous data. DARLIS (Ye et al., 2022) utilizes the distributed annealing (Arshad & Silaghi, 2004) strategy to search for the optimal graph, while PERI (Mian et al., 2023) aggregates the results of the local greedy equivalent search (GES) (Chickering, 2002) and chooses the worst-case regret for each iteration. Fed-CD (Abyaneh et al., 2022) was proposed for both observational and interventional data based on continuous optimization. FEDC2SL (Wang et al., 2023) extended χ2 test to the federated version, however, this method is restrictive on discrete variables and therefore not applicable for any continuous variables. Notice that most of these above-mentioned methods heavily rely on either identifiable functional causal models or homogeneous data distributions. These assumptions may be overly restrictive and difficult to be satisfied in real-world scenarios, limiting their diverse applicability. A3 DETAILS ABOUT THE CHARACTERIZATION A3.1 CHARACTERIZATION OF CONDITIONAL INDEPENDENCE In this section, we will provide more details about the interpretation of Σ XY |Z as formulated in Eq. 13, the definition of characteristic kernel as shown in Lemma 9, which is helpful to understand the Published as a conference paper at ICLR 2024 Lemma 1 in the main paper. We then provide the uncorrelatedness-based characterization of CI in Lemma 10. First of all, for the random vector (X, Y ) on X Y, the cross-covariance operator from HY to HX is defined by the relation f, ΣXY g HX = EXY [f(X)g(Y )] EX[f(X)]EY [g(Y )], (11) for all f HX and g HY. Furthermore, we define the partial cross-covariance operator as ΣXY |Z = ΣXY ΣXZΣ 1 ZZΣZY . (12) If ΣZZ is not invertible, use the right inverse instead of the inverse. We can intuitively interpret the operator ΣXY |Z as the partial cross-covariance between {f(X), f HX } and {g(Y ), g HY} given {q(Z), q HZ}. Lemma 9 (Characteristic Kernel (Fukumizu et al., 2007)). A kernel KX is characteristic, if the condition EX PX[f(X)]=EX QX[f(X)] ( f HX ) implies PX=QX, where PX and QX are two probability distributions of X. Gaussian kernel and Laplacian kernel are characteristic kernels. As shown in Lemma 1, if we use characteristic kernel and define X (X, Z), the characterization of CI could be related to the partial cross-covariance as Σ XY |Z = 0 X Y |Z, where Σ XY |Z = Σ XY Σ XZΣ 1 ZZΣZY . (13) Similarly, we can intuitively interpret the operator Σ XY |Z as the partial cross-covariance between {f( X), f H X } and {g(Y ), g HY} given {q(Z), q HZ}. Based on Lemma 1, we further consider a different characterization of CI which enforces the uncorrelatedness of functions in suitable spaces, which may be intuitively more appealing. Denote the probability distribution of X as PX and the joint distribution of (X, Y ) as PXY . Let L2 X be the space of square integrable functions of X and L2 XY be that of (X, Y ). Specifically, L2 X = {f(X)|E(f 2) < }, and likewise for L2 XY . Particularly, consider the following constrained L2 spaces: S X {f L2 X | E(f|Z) = 0}, S Y {g L2 Y | E(g|Z) = 0}, S Y |Z {g | g = g(Y ) E(g|Z), g L2 Y }. They can be constructed from the corresponding L2 spaces via nonlinear regression. From example, for any function f L2 XZ, the corresponding function f is given by: f ( X) = f( X) E(f|Z) = f( X) β f(Z), (15) where β f(Z) L2 Z is the regression function of f( X) on Z. Then, we can then relate the different characterization of CI from Lemma 1 to the uncorrelatedness in the following lemma. Lemma 10 (Characterization of CI based on Partial Association (Daudin, 1980)). Each of the following conditions are equivalent to X Y |Z (i.) E(fg) = 0, f S X and g S Y , (ii.) E(fg ) = 0, f S X and g S Y |Z, (iii.) E(f g) = 0, f S X and g L2 Y , (iv.) E(f g ) = 0, f S X and g L2 Y . When (X, Y, Z) are jointly Gaussian, the independence is equivalent to the uncorrelatedness, in other words, X Y |Z is equivalent to the vanishing of the partial correlation coefficient ρXY |Z. We can regard the Lemma 10 as as a generalization of the partial correlation based characterization of CI. Published as a conference paper at ICLR 2024 For example, condition (i) means that any residual function of (X, Z) given Z is uncorrelated with that of (Y, Z) given Z. Here we can observe the similarity between Lemma 1 and Lemma 10, except the only difference that Lemma 10 considers all functions in L2 spaces, while Lemma 1 exploits the spaces corresponding to some characteristic kernels. If we restrict the function f and g in condition (ii) to the spaces H X and HY, respectively, Lemma 10 is then reduced to Lemma 1. Based on the two lemmas mentioned above plus the Lemma 1, we could further derive Lemma 3 in our main paper. A3.2 CHARACTERIZATION OF INDEPENDENT CHANGE In Lemma 2 of the main paper, we provide the independent change principle (ICP) to evaluate the dependence between two changing causal models. Here, we give more details about the definition and the assigned value of normalized HSIC. A smaller value means being more independent. Definition 1 (Normalized HSIC (Fukumizu et al., 2007)). Given variables U and V , HSIC provides a measure for testing their statistical independence. An estimator of normalized HSIC is given as HSICN UV = tr( MU MV ) tr( MU) tr( MV ) , (17) where MU and MV are the centralized Gram matrices, MU HMUH, MV HMV H, H = I 1 n11T , I is n n identity matrix and 1 is vector of n ones. How to construct MU and MV will be explained in the corresponding cases below. To check whether two causal modules change independently across different domains, the dependence between P(X) and P(Y |X) and the dependence between P(Y ) and P(X|Y ) on the given data can be given by X Y = tr( MX MY |X) tr( MX) tr( MY |X) , Y X = tr( MY MX|Y ) tr( MY ) tr( MX|Y ) . (18) According to CD-NOD (Huang et al., 2020), instead of working with conditional distribution P(X|Y ) and P(Y |X), we could use the joint distribution P(X, Y ), which is simpler, for estimation. Here we use Y instead of Y to emphasize that in this constructed distribution X and Y are not symmetric. Then, the dependence values listed in Eq. 18 could be estimated by ˆ X Y = tr( MX MY X) tr( MX) tr( MY X) , ˆ Y X = tr( MY MXY ) tr( MY ) tr( MXY ) , (19) where MX HMXH, MX ˆµX| ˆµT X| . Similarly, we define MY , MY and ˆµY | . According to (Huang et al., 2020), we have ˆµX| ϕ( )(C + γI) 1C X, (20) where ˆµX| ϕ( )(C + γI) 1C X, ˆµX| , ϕ( ) Rn h, γ is a small ridge parameter, ϕ represents the feature map, and is the surrogate variable indicating different domains or clients. Similarly, we define MY , MY and ˆµY | . ˆµY | ϕ( )(C + γI) 1C Y . (21) Moreover, MY X HMY XH, MY X ˆµY X| ˆµT Y X| . Similarly, we define MXY , MXY and ˆµXY . ˆµY X| ϕ( )(C + γI) 1C ,(Y,X) ˆµXY | ϕ( )(C + γI) 1C ,(X,Y ), (22) Eq. 19 as formulated above is helpful to further derive Theorem 5 in our main paper. Here, we provide the proofs of the theorems and lemmas, including Lemma 3, Theorem 4, Theorem 5, Theorem 6, Lemma 7, and Theorem 8 in our main paper. Published as a conference paper at ICLR 2024 A4.1 PROOF OF LEMMA 3 Proof: We define the covariance matrix in the null hypothesis as C XY |Z = 1 n Pn i=1[( Ai E( A|Z))T (Bi E(B|Z))] which corresponds to the partial cross-covariance matrix with n samples, C XY |Z Rh h, A=f( X) Rn h, B=g(Y ) Rn h, {f j( X)|h j=1} F X, {gj(Y )|h j=1} FY . Notice that F X and FY are function spaces. n and h denote the number of total samples of all clients and the number of hidden features or mapping functions, respectively. Notice that E( A|Z) and E(B|Z) could be non-linear functions of Z which may be difficult to estimate. therefore, we would like to approximate them with linear functions. Let q(Z) Rn h, {qj(Z)|h j=1} FZ. We could estimate E(f j|Z) with the ridge regression output u T j q(Z) under the mild conditions given below. Lemma 11. (Sutherland & Schneider, 2015) Consider performing ridge regression of f j on Z. Assume that (i) Pn i=1 f j i = 0, f j is defined on the domain of X; (ii) the empirical kernel matrix of Z, denoted by KZ, only has finite entries (i.e., KZ < ); (iii) the range of Z is compact, Z Rd Z. Then we have P h |ˆE(f j|Z) u T j q(Z)| ϵ i c0 ϵ2 e hϵ2c1, (23) where ˆE(f j|Z) is the estimate of E(f j|Z) by ridge regression, c0 and c1 are both constants that do not depend on the sample size n or the number of hidden dimensions or mapping functions h. The exponential rate with respect to h in the above lemma suggests we can approximate the output of ridge regression with a small number of hidden features. Moreover, we could similarly estimate E(gj|Z) with v T j q(Z), because we could guarantee that P h |ˆE(gj|Z) v T j q(Z)| ϵ i 0 for any fixed ϵ > 0 at an exponential rate with respect to h. Similar to the L2 spaces in condition (ii) of Lemma 10, we can consider the following condition to approximate conditional independence: E( f g) = 0, f F X|Z and g FY |Z, where F X|Z = { f | f j = f j E(f j|Z), f j F X}, FY |Z = { g | gj = gj E(gj|Z), gj FY }. According to Eq. 23, we could estimate E(f j|Z) and E(gj|Z) by u T j q(Z) and v T j q(Z), respectively. Thus, we can reformulate the function spaces as F X|Z = { f | f j = f j u T j q(Z), f j F X}, FY |Z = { g | gj = gj v T j q(Z), gj FY }. (25) Proof ends. A4.2 PROOF OF THEOREM 4 '( )) *(+) ,- ,-! " $|" | Figure A1: Given that X Y |Z, we could introduce the independence between R X|Z and RY |Z. Proof: Assume that there are n i.i.d. samples for X, Y, Z. Let K X|Z be the centralized kernel matrix, given by K X|Z R X|Z RT X|Z=HR X|ZRT X|ZH, where R X|Z f( X)=f( X) u T q(Z) which can be seen as the residual after ridge regression. Similarly, We could define KY |Z RY |Z RT Y |Z=HRY |ZRT Y |ZH and RY |Z g(Y )=g(Y ) v T q(Z). Accordingly, we let K XY |Z R X|Z RT Y |Z=HR X|ZRT Y |ZH. We set the test statistic as TCI=n C XY |Z 2 F , where C XY |Z RT X|Z RY |Z= 1 n RT X|ZHHRY |Z. Published as a conference paper at ICLR 2024 Let λ X|Z and λY |Z be the eigenvalues of K X|Z and KY |Z, respectively. Furthermore, we define the EVD decomposition K X|Z = V X|ZΛ X|ZV T X|Z, where Λ X|Z is the diagonal matrix containing non-negative eigenvalues λ X|Z,i. Similarly, we define KY |Z = VY |ZΛY |ZV T Y |Z with eigenvalues λY |Z,i. Let ψ X|Z = [ψ X|Z,1, ψ X|Z,2, . . . , ψ X|Z,n] VY |ZΛ1/2 Y |Z and ϕY |Z = [ϕY |Z,1, ϕY |Z,2, . . . , ϕY |Z,n] VY |ZΛ1/2 Y |Z. On the other hand, consider eigenvalues λ X|Z,i and eigenfunctions u X|Z,i of the kernel k X|Z w.r.t. the probablity measure with the density P( x), i.e., λ X|Z,i and u X|Z,i satisfy R k X|Z( x, x ) u X|Z,i( x) P( x)d x = λ X|Z,i u X|Z,i( x ), where we assume that u X|Z,i have unit variance, i.e., E[u2 X|Z,i( X)] = 1. Similarly, we define k Y |Z, λ Y |Z,i, and u Y |Z,i. Let {α1, . . . , αn2} denote i.i.d. standard Gaussian variables, and thus {α2 1, . . . , α2 n2} denote i.i.d. χ2 1 variables. Lemma 12 (Kernel-based Conditional Independence Test (Zhang et al., 2012)). Under the null hypothesis that X and Y are conditional independent given Z, we have that the test statistic TCI 1 n tr( K X|Z KY |Z) have the same asymptotic distribution as ˆTCI 1 n Pn2 k=1 λk α2 k, where λk are eigenvalues of ww T , w = [w1, . . . , wn], with the vector wt obtained by stacking Mt = [ψ X|Z,1( Xt), ψ X|Z,2( Xt), . . . , ψ X|Z,n( Xt)]T [ϕY |Z,1(Yt), ϕY |Z,2(Yt), . . . , ϕY |Z,n(Yt)]. In the above lemma, their test statistic is equivalent to ours, due to the fact that 1 n tr( K X|Z KY |Z) = 1 n tr( R X|Z( RT X|Z RY |Z RT Y |Z)) n tr(( RT X|Z RY |Z RT Y |Z) R X|Z) n RT X|Z RY |Z 2 F n n C XY |Z 2 F = n C XY |Z 2 F . However, their asymptotic distribution is different from ours. Based on their asymptotic distribution, we could go further. The first two rows of Eq. 26 hold true because of the commutative property of trace, namely, tr(AB) = BA, refer to Lemma 6 for more details. According to the formulation of R X|Z and RY |Z, we have ( f( X) = u T q(Z) + R X|Z g(Y ) = v T q(Z) + RY |Z. (27) Based on the above formulations, we could easily draw the causal graph as shown in Fig. A1. In particular, considering that X and Y are conditionally independent given Z, we could further determine that R X|Z and RY |Z are independent, namely, we have X Y |Z R X|Z RY |Z. (28) As f( X) and g(Y ) are uncorrelated, then E(wt) = 0. Furthermore, the covariance is Σ = C ov(wt) = E(wtw T t ), where w is defined in the same way as in Lemma 12. If R X|Z RY |Z, for k = i or l = j, we denote the non-diagonal (ND) entries of Σ as e ND, where e ND = E[ q λ X|Z,iλ Y |Z,jλ X|Z,kλ Y |Z,lu X|Z,iu Y |Z,ju X|Z,ku Y |Z,l] λ X|Z,iλ Y |Z,jλ X|Z,kλ Y |Z,l E[u X|Z,iu X|Z,k]E[u Y |Z,ju Y |Z,l] We then denote the diagonal entries of Σ as e D, where e D = λ X|Z,iλ Y |Z,j E[u2 X|Z,i]E[u2 Y |Z,j] = λ X|Z,iλ Y |Z,j, (30) Published as a conference paper at ICLR 2024 which are eigenvalues of Σ. According to (Zhang et al., 2012), 1 nλ X|Z,i converge in probability λ X|Z. Substituting all the results into the asymptotic distribution in Lemma 12, we can get the updated asymptotic distribution i,j=1 λ X|Z,iλY |Z,jα2 ij as β = n . (31) where β is the number of nonzero eigenvalues λ X|Z of the kernel matrices K X|Z. Consequently, TCI and ˆTCI have the same asymptotic distribution. Proof ends. A4.3 PROOF OF THEOREM 5 Proof: First of all, since α2 ij follow the χ2 distribution with one degree of freedom, thus we have E(α2 ij) = 1 and Var(α2 ij) = 2. According to the asymptotic distribution in Theorem 4 and the derivation of Lemma 7, we have E( ˆTCI|D) = 1 i,j λ X|Z,iλY |Z,j i λ X|Z,i X n2 tr( K X|Z) tr( KY |Z) n2 tr( R X|Z RT X|Z) tr( RY |Z RT Y |Z) n2 tr(n C X|Z) tr(n CY |Z) = tr(C X|Z) tr(CY |Z), where R X|Z and RY |Z are defined in the proof of Theorem 3 above. Therefore, E( ˆTCI|D) = tr(C X|Z) tr(CY |Z). Furthermore, α2 ij are independent variables across i and j, and notice that tr( K2 X|Z) = P i λ2 X|Z,i, and similarly tr( K2 Y |Z) = P i λ2 Y |Z,i. Based on the asymptotic distribution in Theorem 4, we have Var( ˆTCI|D) = 1 i,j λ2 X|Z,iλ2 Y |Z,j Var(α2 ij) i λ2 X|Z,i X j λ2 Y |Z,j n4 tr( K2 X|Z) tr( K2 Y |Z). Additionally, according to the similar rule as in Eq. 26, we have tr( K2 X|Z) = tr( R X|Z RT X|Z R X|Z RT X|Z) = tr( RT X|Z R X|Z RT X|Z R X|Z) = RT X|Z R X|Z 2 F = n C X|Z 2 F = n2 C X|Z 2 F . Published as a conference paper at ICLR 2024 Similarly, we have tr( K2 Y |Z) = n2 CY |Z 2 F . Substituting the results into the above formula- tion about variance, we have 2 n4 tr( K2 X|Z) tr( K2 Y |Z) = 2 n4 n2 C X|Z 2 F n2 CY |Z 2 F . Thus, Var( ˆTCI|D) = 2 C X|Z 2 F CY |Z 2 F . Proof ends. A4.4 PROOF OF THEOREM 6 Proof: According to the above-mentioned formulations, we have MX HMXH = ˆµX| ˆµT X| , ˆµX| H ˆµX| . Based on the rules of estimating covariance matrix from kernel matrix in Lemma 6, we have tr( MX) = tr( ˆµX| ˆµT X| ) = tr( ˆµT X| ˆµX| ) (35) = tr((Hϕ( )(C + γI) 1C X)T (Hϕ( )(C + γI) 1C X)) (36) = tr(CX (C + γI) 1ϕ( )T H Hϕ( )(C + γI) 1C X)) n tr(CX (C + γI) 1C (C + γI) 1C X) (37) n tr(C X). (38) Eq. 35 is obtained due to the trace property of the product of the matrices, as shown in Lemma 6. Eq. 36 is substituting from Eq. 20. Here we use Eq. 38 for simple notation. We can see that it can be represented with some combinations of different covariance matrices. Similarly, we have tr( MY ) = 1 n tr(CY (C + γI) 1C (C + γI) 1C Y ) = 1 n tr(C Y ). (39) Regarding the centralized Gram matrices for joint distribution, similarly we have tr( MY X) = 1 n tr(C(Y,X), (C + γI) 1C (C + γI) 1C ,(Y,X)) = 1 n tr(C Y ), tr( MXY ) = 1 n tr(C(X,Y ), (C + γI) 1C (C + γI) 1C ,(X,Y )) = 1 n tr(C X), (40) where tr( MY X) = tr( MXY ). Furthermore, based on Lemma 6 and Eq. 22, we have tr( MX MY X) = tr( ˆµX| ˆµT X| ˆµY X| ˆµT Y X| ) = tr( ˆµT X| ˆµY X| ˆµT Y X| ˆµX| ) (41) = ˆµT X| ˆµY X| 2 F = (Hϕ( )(C + γI) 1C X)T (Hϕ( )(C + γI) 1C ,(Y,X)) 2 F (42) = CX (C + γI) 1ϕ( )T H Hϕ( )(C + γI) 1C ,(Y,X) 2 F n CX (C + γI) 1C (C + γI) 1C ,(Y,X) 2 F (43) n2 C X, Y 2 F . (44) Eq. 41 is obtained due to the trace property of the product of the matrices, as shown in Lemma 6. Eq. 41 is substituting from Eq. 20 and Eq. 22. Here we use Eq. 44 for simple notation. We can see that it can be represented with some combinations of different covariance matrices. Similarly, we have tr( MY MXY ) = 1 n CY (C + γI) 1C (C + γI) 1C ,(X,Y ) 2 F n2 C Y, X 2 F . (45) Published as a conference paper at ICLR 2024 Substituting the equations above into Eq. 19, we have ˆ X Y = C X, Y 2 F tr(C X) tr(C Y ), ˆ Y X = C Y, X 2 F tr(C Y ) tr(C X). (46) Proof ends. A4.5 PROOF OF LEMMA 7 Proof: First of all, we incorporate random Fourier features to approximate the kernels, because they have shown competitive performances to approximate the continuous shift-invariant kernels. Lemma 13 (Random Features (Rahimi & Recht, 2007)). For a continuous shift-invariant kernel K(x, y) on R, we have: K(x, y) = Z R p(w)ejw(x y)dw = Ew[ζw(x)ζw(y)], (47) where ζw(x)ζw(y) is an unbiased estimate of K(x, y) when w is drawn from p(w). Since both the probability distribution p(w) and the kernel entry K(x, y) are real, the integral in Eq. 47 converges when the complex exponentials are replaced with cosines. Therefore, we may get a real-values mapping by: K(x, y) ϕw(x)T ϕw(y), 2 h[cos(w1x + b1), ..., cos(whx + bh)]T , 2 h[cos(w1y + b1), ..., cos(why + bh)]T , where w is drawn from p(w) and b is drawn uniformly from [0, 2π]. x, y, w, b R, and the randomized feature map ϕw : R Rh. The precise form of p(w) relies on the type of the shift-invariant kernel we would like to approximate. Here in this paper, we choose to approximate Gaussian kernel as one of the characteristic kernels, and thus set the probability distribution p(w) to the Gaussian one. Based on Eq. 48, we have tr( Kx,y) tr( ϕw(x) ϕw(y)T ), (49) where x, y Rn, Kx,y Rn n, ϕw(x) Rn h is the centralized random feature, ϕw(x)=Hϕw(x). Furthermore, benefiting from the commutative property of the trace of the product of two matrices, we have tr( ϕw(x) ϕw(y)T ) = tr( ϕw(y)T ϕw(x)), (50) Since each random feature is centralized, meaning the zero mean for each feature, therefore, we have: tr( ϕw(y)T ϕw(x)) = tr( 1 n Cx,y) = 1 n tr(Cx,y), (51) where Cx,y is the covariance matrix for variable x and y, Cx,y Rh h, h is the number of hidden features. For the second formulation, we have tr( Kx Ky) = tr[ ϕw(x) ϕw(x)T ϕw(y) ϕw(y)T ] = tr[ ϕw(x)( ϕw(x)T ϕw(y) ϕw(y)T )] = tr[( ϕw(x)T ϕw(y) ϕw(y)T ) ϕw(x)] = tr[ ϕw(x)T ϕw(y) ϕw(y)T ϕw(x)] = ϕw(x)T ϕw(y) 2 F = n Cx,y 2 F = n2 Cx,y 2 F . Published as a conference paper at ICLR 2024 Together with Eq. 49, Eq. 50, Eq. 51 and Eq. 52 formulated above, we could prove the Lemma 7 in the main paper. Proof ends. A4.6 PROOF OF THEOREM 8 Proof. The summary statistics contain two parts: total sample size n and covariance tensor CT Rd d h h. Let Cij T Rh h be the (i, j)-th entry of the covariance tensor, which denotes the covariance matrix of the i-th and the j-th variable. With the summary statistics as a proxy, we can substitute the raw data at each client. During the procedures of causal discovery, the needed statistics include TCI in Theorem 4, E( ˆTCI|D) and Var( ˆTCI|D) in Theorem 5, and ˆ X Y and ˆ Y X in Theorem 6. 1) Based on the Eq. (7) in the main paper, we have C XY |Z = C XY C XZ(CZZ + γI) 1CZY = C(X,Z),Y C(X,Z),Z(CZZ + γI) 1CZY (53) = (CXY + CZY ) (CXZ + CZZ)(CZZ + γI) 1CZY In this paper, we consider the scenarios where X and Y are single variables, and Z may be a single variable, a set of variables, or empty. Assuming that Z contains L variables. We have i=1 CZi Y , CXZ = i=1 CXZi, CZZ = j=1 CZi Zj, (54) where CXY , CZi Y , CXZi, and CZi Zj are the entries of the covariance tensor CT . According to Theorem 3, TCI n C XY |Z 2 F . Therefore, the summary statistics are sufficient to represent TCI. 2) Similar to Eq. 53, we have C X|Z = (CXX + 2CXZ + CZZ)(CXZ + CZZ)(CZZ + γI) 1(CXZ + CZZ) (55) CY |Z = CY Y CY Z(CZZ + γI) 1CZY . (56) Substituting Eq. 54 into Eq. 55 and Eq. 56, we can also conclude that the covariance tensor is sufficient to represent C X|Z and CY |Z. In other words, the summary statistics are sufficient to represent E( ˆTCI|D) and Var( ˆTCI|D). 3) As shown in section A4.3, we have ˆ X Y = C X, Y 2 F tr(C X) tr(C Y ), ˆ Y X = C Y, X 2 F tr(C Y ) tr(C X), (57) where each components can be represented as some combinations of covariance matrices, as shown in Eq. 37, Eq. 39, Eq. 40, Eq. 43, and Eq. 45. Therefore, the summary statistics are sufficient to represent ˆ X Y and ˆ Y X. 4) To sum up, we could conclude that: The summary statistics, consisting of total sample size n and covariance tensor CT , are sufficient to represent all the statistics needed for federated causal discovery. Proof ends. A5 DETAILS ABOUT FEDERATED UNCONDITIONAL INDEPENDENCE TEST Here, we provide more details about the federated unconditional independence test (FUIT), where the conditioning set Z is empty. Generally, this method follows similar theorems for federated conditional independent test (FCIT). Published as a conference paper at ICLR 2024 A5.1 NULL HYPOTHESIS Consider the null and alternative hypothesis H0 : X Y, H1 : X Y. (58) Similar to FCIT, we consider the squared Frobenius norm of the empirical covariance matrix as an approximation, given as H0 : C XY 2 F = 0, H1 : C XY 2 F > 0. (59) In this unconditional case, we set the test statistics as TUI n C XY 2 F , and give the following theorem. Theorem 14 (Federated Unconditional Independent Test). Under the null hypothesis H0 (X and Y are independent), the test statistic TUI n CXY 2 F , (60) has the asymptotic distribution i,j=1 λX,iλY,jα2 ij, where λX and λY are the eigenvalues of KX and KY , respectively. Here, the proof is similar to the proof of Theorem 3, thus we refer the readers to section A4.2 for more details. A5.2 NULL DISTRIBUTION APPROXIMATION We also approximate the null distribution with a two-parameter Gamma distribution, which is related to the mean and variance. Under the hypothesis H0 and given the sample D, the distribution of ˆTCI can be approximated by the Γ(κ, θ) distribution. Here we provide the theorem for null distribution approximation. Theorem 15 (Null Distribution Approximation). Under the null hypothesis H0 (X and Y are independent), we have E( ˆTUI|D) = tr(CX) tr(CY ), Var( ˆTUI|D) = 2 CX 2 F CY 2 F , (61) Here, the proof is similar to the proof of Theorem 4, thus we refer the readers to section A4.3 for more details. A6 DETAILS ABOUT SKELETON DISCOVERY AND DIRECTION DETERMINATION In this section, we will introduce how we do the skeleton discovery and direction determination during the process of federated causal discovery. All those steps are conducted on the server side. Our steps are similar to the previous method, such as CD-NOD (Huang et al., 2020), the core difference are that we develop and utilize our proposed federated conditional independent test (FCIT) and federated independent change principle (FICP). A6.1 SKELETON DISCOVERY. We first conduct skeleton discovery on the augmented graph. The extra surrogate variable is introduced in order to deal with the data heterogeneity across different clients. Lemma 16. Given the Assumptions 1, 2 and 3 in the main paper, for each Vi V , Vi and are not adjacent in the graph if and only if they are independent conditional on some subset of {Vj|j = i}. Published as a conference paper at ICLR 2024 Proof. If Vi s causal module is invariant, which means that P(Vi|PAi) remains the same for every value of , then Vi |PAi. Thus, if Vi and are not independent conditional on any subset of other variables, Vi s module changes with , which is represented by an edge between Vi and . Conversely, we assume that if Vi s module changes, which entails that Vi and are not independent given PAi, then Vi and are not independent given any other subset of V \{Vi}. Proof ends. Lemma 17. Given the Assumptions 1, 2 and 3 in the main paper, for every Vi, Vj V , Vi and Vj are not adjacent if and only if they are independent conditional on some subset of {Vl|l = i, l = j} { }. Proof. The if direction shown based on the faithfulness assumption on Gaug and the fact that {ψl( )}L l=1 {θi( )}d i=1 is a deterministic function of . The only if direction is proven by making use of the weak union property of conditional independence repeatedly, the fact that all {ψl( )}L l=1 and {θi( )}d i=1 are deterministic function of , the above three assumptions, and the properties of mutual information. Please refer to (Zhang et al., 2015) for more complete proof. With the given three assumptions in the main paper, we can do skeleton discovery. i) Augmented graph initialization. First of all, build a completely undirected graph on the extended variable set V { }, where V denotes the observed variables and is surrogate variable. ii) Changing module detection. For each edge Vi, conduct the federated conditional independence test or federated unconditional independent test. If they are conditionally independent or independent, remove the edge between them. Otherwise, keep the edge and orient Vi. iii) Skeleton discovery. Moreover, for each edge Vi Vj, also conduct the federated independence test or federated unconditional independent test. If they are conditionally independent or independent, remove the edge between them. In the procedures, how observed variables depend on surrogate variable is unknown and usually nonlinear, thus it is crucial to use a general and non-parametric conditional independent test method, which should also satisfy the federated learning constraints. Here, we utilize our proposed FCIT. A6.2 DIRECTION DETERMINATION. After obtaining the skeleton, we can go on with the causal direction determination. By introducing the surrogate variable , it does not only allow us to infer the skeleton, but also facilitate the direction determinations. For each variable Vi whose causal module is changing (i.e., Vi), in some ways we might determine the directions of every edge incident to Vi. Assume another variable Vj which is adjacent to Vi, then we can determine the directions via the following rules. i) Direction determination with one changing module. When Vj s causal module is not changing, we can see Vi Vj forms an unshielded triple. For practice purposes, we can take the direction between and Vi as Vi, since we let be the surrogate variable to indicate whether this causal module is changing or not. Then we can use the standard orientation rules (Spirtes et al., 2000) for unshielded triples to orient the edge between Vi and Vj. (1) If and Vi are independent conditional on some subset of {Vl|l = j} which is excluding Vj, then the triple forms a V-structure, thus we have Vi Vj. (2) If and Vi are independent conditional on some subset of {Vl|l = i} {Vj} which is including Vj, then we have Vi Vj. In the procedure, we apply our proposed FCIT. ii) Direction determination with two changing modules. When Vj s causal module is changing, we can see there is a special confounder between Vi Vj. First of all, as mentioned above, we can still orient Vi and Vj. Then, inspired by that P(cause) and P(effect|cause) change independently, we can identify the direction between Vi and Vj according to Lemma 1, and we apply our proposed FICP. Published as a conference paper at ICLR 2024 (a) Precision and recall on linear Gaussian model. (b) Precision and recall on general functional model. Figure A2: Results of the synthetic dataset on (a) linear Gaussian model and (b) general functional model. By rows in each subfigure, we evaluate varying number of variables d, varying number of clients K, and varying number of samples nk. By columns in each subfigure, we evaluate Skeleton Precision ( ), Skeleton Recall ( ), Direction Precision ( ) and Direction Recall ( ). Published as a conference paper at ICLR 2024 A7 DETAILS ABOUT THE EXPERIMENTS ON SYNTHETIC DATASETS More details about the synthetic datasets are explained in this section, including the implementation details in section A7.1, the results analysis of F1 and SHD in section A7.2, the complete results of precision and recall in section A7.3, the computational time analysis in section A7.4, the hyperparameter study on the number of hidden features h in section A7.5, the statistical significance test for the results in section A7.6, and the evaluation on dense graph in section A7.7. A7.1 IMPLEMENTATION DETAILS We provide the implementation details of our method and other baseline methods. Fed DAG (Gao et al., 2022): Codes are available at the author s Github repository https: //github.com/Erdun GAO/Fed DAG. The hyperparameters are set by default. NOTEARS-ADMM and NOTEARS-MLP-ADMM (Ng & Zhang, 2022): Codes are available at the author s Github repository https://github.com/ignavierng/ notears-admm. The hyperparameters are set by default, e.g., we set the threshold level to 0.1 for post-processing. Fed PC (Huang et al., 2022): Although there is no public implementation provided by the author, considering that it is the only constraint-based method among all the existing works for federated causal discovery, we still compared with it. We reproduced it based on the Causal-learn package https://github.com/py-why/causal-learn. Importantly, we follow the paper, set the voting rate as 30% and set the significance level to 0.05. Fed CDH (Ours): Our method is developed based on the CD-NOD (Huang et al., 2020) and KCI (Zhang et al., 2012) which are publicly available in the Causal-learn package https://github.com/py-why/causal-learn. We set the hyperparameter h to 5, and set the significance level for FCIT to 0.05. Our source code has been appended in the Supplementary Materials. For NOTEARS-ADMM, NOTEARS-MLP-ADMM, and Fed DAG, the output is a directed acyclic graph (DAG), while Fed PC and our Fed CDH may output a completed partially directed acyclic graph (CPDAG). To ease comparisons, we use the simple orientation rules (Dor & Tarsi, 1992) implemented by Causal-DAG (Chandler Squires, 2018) to convert a CPDAG into a DAG. We evaluate both the undirected skeleton and the directed graph, denoted by Skeleton and Direction as shown in the Figures. A7.2 ANALYSIS OF F1 AND SHD We have provided the results of F1 and SHD in the main paper as shown in Figure 3 and Figure A3, here we provide further discussions and analysis. The results of linear Gaussian model are given in Figure 3 and those of general functional model are provided in Figure A3. According to the results, we observe that our Fed CDH method generally outperforms all other baselines across different criteria and settings. According to the results of our method on both of the two models, when d increases, the F1 score decreases and the SHD increases for skeletons and directions, indicating that FCD with more variables might be more challenging. On the contrary, when K and nk increase, the F1 score grows and the SHD reduces, suggesting that more joint clients or samples could contribute to better performances for FCD. In linear Gaussian model, NOTEARS-ADMM and Fed PC generally outperform Fed DAG. The reason may be that the front two methods were proposed for linear model while the latter one was specially proposed for nonlinear model. In general functional model, Fed PC obtained the worst performance compared to other methods in direction F1 score, possibly due to its strong assumptions on linear model and homogeneous data. Fed DAG and NOTEARS-MLP-ADMM revealed poor results regarding SHD, the reasons may be two-fold: they assume nonlinear identifiable model, which may not well handle the general functional model; and both of them are continuous-optimization-based methods, which might suffer from various issues such as convergence and nonconvexity. Published as a conference paper at ICLR 2024 Figure A3: Results of synthetic dataset on general functional model. By rows, we evaluate varying number of variables d, varying number of clients K, and varying number of samples nk. By columns, we evaluate Skeleton F1 ( ), Skeleton SHD ( ), Direction F1 ( ) and Direction SHD ( ). A7.3 RESULTS OF PRECISION AND RECALL In the main paper, we have only provided the results of F1 score and SHD, due to the space limit. Here, we provide more results and analysis of the precision and the recall. The results of average and standard deviation are exhibited in Figure A2. According to the results, we could observe that our Fed CDH method generally outperformed all other baseline methods, regarding the precision of both skeleton and direction. Moreover, in the linear Gaussian model, NOTEARS-ADMM generally achieved the best performance regarding the recall although it performed poorly in precision, the reason might be that NOTEARS-ADMM assumed homogeneous data distribution, which might face challenges in the scenarios with heterogeneous data. In the general functional model, when evaluating varying numbers of clients K and samples nk, Fed DAG performed the best with respect to the recall, however, neither Fed DAG nor NOTEARS-MLP-ADMM obtained satisfactory results in the precision, the reason might be that both of them are continuous-optimization-based methods, which might potentially suffer from various issues such as convergence and nonconvexity. A7.4 RESULTS OF COMPUTATIONAL TIME Existing works about federated causal discovery rarely evaluate the computational time when conducting experiments. Actually, it is usually difficult to measure the exact computational time in real life, because of some facts, such as the paralleled computation for clients, the communication time costs between the clients and the server, and so on. However, the computational time is a significant factor to measure the effectiveness of a federated causal discovery method to be utilized in practical scenarios. Therefore, in this section, for making fair comparisons, we evaluate the computational time for each method, assuming that there is no paralleled computation (meaning that we record the computational time at each client and server and then simply add them up) and no extra communication cost (indicating zero time cost for communication). We evaluate different settings as mentioned above, including varying number of variables d, varying number of clients K, and varying number of samples nk. We generate data according to linear Published as a conference paper at ICLR 2024 Table A2: Results of computational time for varying number of variables d, varying number of clients K, and varying number of samples nk. We report the average and standard deviation over 10 runs. This is the synthetic dataset based on linear Gaussian model. Data Sizes Methods d K nk Fed PC NOTEARS-ADMM Fed DAG Fed CDH (Ours) 3.87 1.97s 14.10 1.89s 136.92 21.50s 8.14 2.47s 12 32.01 3.54s 28.33 2.46s 321.84 65.94s 62.69 7.77s 18 39.58 4.75s 35.13 2.89s 398.27 149.51s 98.57 9.23s 24 84.05 7.64s 40.01 2.94s 715.80 268.93s 172.11 18.18s 30 94.03 9.48s 56.35 3.91s 1441.13 519.04s 232.35 26.67s 0.72 0.24s 7.04 0.64s 50.38 11.29s 3.88 1.49s 4 2.07 0.73s 9.07 0.77s 85.08 15.68s 5.24 1.74s 8 3.64 1.54s 10.80 0.78s 114.81 29.67s 8.01 2.32s 16 5.79 2.59s 19.40 2.51s 342.34 62.28s 12.60 2.98s 32 14.08 4.44s 30.56 2.88s 714.06 137.31s 20.30 4.37s 25 0.48 0.10s 13.06 1.91s 125.77 20.64s 3.75 1.29s 50 1.47 0.64s 13.75 2.51s 127.25 20.38s 5.74 1.61s 100 3.87 1.97s 14.10 1.89s 136.92 21.50s 8.14 2.47s 200 16.52 3.63s 14.68 2.23s 138.67 31.91s 13.78 3.75s 400 51.10 6.87s 15.90 2.54s 140.37 34.42s 22.86 4.55s Gaussian model. For each setting, we run 10 instances, report the average and the standard deviation of the computational time. The results are exhibited in Table A2. According to the results, we could observe that among the four FCD methods, Fed DAG is the least efficient method with the largest time cost, because it uses a two-level structure to handle the heterogeneous data: the first level learns the edges and directions of the graph and communicates with the server to get the model information from other clients, while the second level approximates the mechanism among variables and personally updates on its own data to accommodate the data heterogeneity. Meanwhile, Fed PC, NOTEARS-ADMM and our Fed CDH are comparable. In the setting of varying variables, our method exhibited unsatisfactory performance among the three methods, because the other two methods, Fed PC and NOTEARS-ADMM, are mainly for homogeneous data. However, in the case of varying variables, NOTEARS-ADMM is the most ineffective method, because with the increasing of clients, more parameters (one client corresponds to one sub adjacency matrix which needs to be updated) should get involved in the optimization process, therefore, the total processing time can also increase by a large margin. In the scenario of varying samples, Fed PC is the slowest one among the three methods. A7.5 HYPERPARAMETER STUDY We conduct experiments on the hyperparameter, such as the number of mapping functions or hidden features h. Regarding the experiments in the main paper, we set h to 5 by default. Here in this section, we set h {5, 10, 15, 20, 25, 30}, d = 6, K = 10, nk = 100 and evaluate the performances. We generate data according to linear Gaussian model. We use the F1 score, the precision, the recall and the SHD for both skeleton and direction. We also report the runtime. We run 10 instances and report the average values. The experimental results are given in Table A3. According to the results, we could observe that with the number of hidden features h increasing, the performance of the direction is obviously getting better, while the performance of the skeleton may fluctuate a little bit. Theoretically, the more hidden features or a larger h we consider, the better performance of how closely the random features approximate the kernels should be. When the number of hidden features approaches infinity, the performance of random features and that of kernels should be almost the same. And the empirical results seem to be consistent with the theory, where a large h can lead to a higher F1 score and precision for the directed graph. Published as a conference paper at ICLR 2024 Table A3: Hyperparameter study on the number of hidden features h. We evaluate the F1 score, precision, recall, and SHD of both skeleton and direction. We report the average over 10 runs. This is the synthetic dataset based on linear Gaussian model. h Metrics Skeleton Direction Time F1 Precision Recall SHD F1 Precision Recall SHD 5 0.916 0.980 0.867 0.9 0.721 0.765 0.683 2.0 8.14s 10 0.916 0.980 0.867 0.9 0.747 0.810 0.700 2.0 8.87s 15 0.907 0.980 0.850 1.0 0.762 0.818 0.717 1.8 10.57s 20 0.889 0.980 0.833 1.2 0.767 0.833 0.717 1.8 12.72s 25 0.896 0.980 0.833 1.1 0.789 0.838 0.750 1.6 20.93s 30 0.896 0.980 0.833 1.1 0.825 0.873 0.783 1.4 37.60s Table A4: Test result of statistical significance of our Fed CDH method compared with other baseline methods. We report the p values via Wilcoxon signed-rank test (Woolson, 2007). This is the synthetic dataset based on linear Gaussian model. Parameters [Fed CDH vs. Fed PC] [Fed CDH vs. NOTEARS-ADMM] [Fed CDH vs. Fed DAG] d k n S-F1 S-SHD D-F1 D-SHD S-F1 S-SHD D-F1 D-SHD S-F1 S-SHD D-F1 D-SHD 6 10 100 0.00 0.05 0.01 0.12 0.00 0.01 0.11 0.10 0.00 0.01 0.01 0.01 12 10 100 0.00 0.01 0.01 0.01 0.00 0.00 0.15 0.00 0.00 0.00 0.11 0.00 18 10 100 0.00 0.01 0.00 0.01 0.00 0.00 0.03 0.00 0.00 0.00 0.02 0.00 24 10 100 0.01 0.01 0.00 0.01 0.00 0.00 0.01 0.00 0.01 0.01 0.01 0.00 30 10 100 0.00 0.01 0.00 0.01 0.00 0.00 0.01 0.00 0.00 0.00 0.00 0.00 6 2 100 0.00 0.00 0.01 0.01 0.01 0.00 0.21 0.01 0.00 0.00 0.03 0.00 6 4 100 0.00 0.01 0.00 0.01 0.01 0.01 0.01 0.00 0.00 0.01 0.01 0.00 6 8 100 0.00 0.00 0.01 0.02 0.02 0.01 0.03 0.02 0.00 0.00 0.09 0.00 6 16 100 0.00 0.01 0.01 0.02 0.00 0.00 0.10 0.03 0.00 0.00 0.07 0.00 6 32 100 0.00 0.00 0.00 0.00 0.00 0.00 0.04 0.01 0.00 0.00 0.03 0.00 6 10 25 0.00 0.00 0.01 0.01 0.01 0.01 0.26 0.02 0.00 0.00 0.03 0.00 6 10 50 0.00 0.01 0.01 0.00 0.01 0.00 0.99 0.03 0.00 0.00 0.02 0.00 6 10 200 0.00 0.01 0.01 0.02 0.00 0.00 0.03 0.02 0.00 0.00 0.11 0.01 6 10 400 0.00 0.01 0.01 0.01 0.01 0.00 0.03 0.01 0.00 0.01 0.01 0.00 Moreover, the computational time is also increasing. When h is smaller than 20, the runtime increases steadily. When h is greater than 20, the runtime goes up rapidly. Importantly, we could see that even when h is small, such as h = 5, the general performance of our method is still robust and competitive. A7.6 STATISTICAL SIGNIFICANCE TEST In order to show the statistical significance of our method compared with other baseline methods on the synthetic linear Gaussian model, we report the p values via Wilcoxon signed-rank test (Woolson, 2007), as shown in Table A4. For each baseline method, we evaluate four criteria: Skeleton F1 (SF1), Skeleton SHD (S-SHD), Direction F1 (D-F1), and Direction SHD (D-SHD). We set the significance level to 0.05. Those p values higher than 0.05 are underlined. From the results, we can see that the improvements of our method are statistically significant at 5% significance level in general. A7.7 EVALUATION ON DENSE GRAPH As shown in Figure 3 in the main paper, the true DAGs are simulated using the Erd os R enyi model (Erd os et al., 1960) with the number of edges equal to the number of variables. Here we consider a more dense graph with the number of edges are two times the number of variables. we evaluate on synthetic linear Gaussian model and general functional model, and record the F1 score and SHD for both skeleton and directed graphs. All other settings are following the previous ones by default. Published as a conference paper at ICLR 2024 Figure A4: We evaluate on synthetic linear Gaussian model (Top Row) and general functional model (Bottom Row) when the number of edges are two times the number of variables. By columns, we evaluate Skeleton F1 ( ), Skeleton SHD ( ), Direction F1 ( ) and Direction SHD ( ). According to the results as shown in Figure A4, we can see that our methods still outperformed other baselines in varying number of variables. Interestingly, when the generated graph is more dense, the performance of Fed PC will obviously go down for various number of variables. A7.8 EVALUATION ON THE POWER OF CONDITIONAL INDEPENDENCE TEST Here we added a new set of experiments to compare the power of our proposed federated conditional independence test and the centralized conditional independence test (i.e., kernel-based conditional independence test (Zhang et al., 2012)). We followed the previous paper (Zhang et al., 2012) and used the post-nonlinear model (Zhang & Hyvarinen, 2012) to generate data. Assume there are four variables W, X, Y , and Z. X = ˆg( ˆf(W) + ϵX), Y = ˆg( ˆf(W) + ϵY ), and Z is independent from both X and Y . ˆf and ˆg are functions randomly chosen from linear, square, sin and tan functions. ϵX, ϵY , W and Z are sampled from either uniform distribution U( 0.5, 0.5) or Gaussian distribution N(0, 1). ϵX and ϵY are random noises. In this case, X and Y are dependent due to the shared component of W. Since Z is independent from both X and Y , therefore, we have X Y |Z. Here we set the significance level to 0.05, and the total sample size varies from 200, 400, 600, 800 to 1000. For federated CIT, we set the number of clients to 10, therefore, each client has 20, 40, 60, 80, or 100 samples. We run 1000 simulations and record the power of the two tests. From the result in Figure A5, we can see that the power of our federated CIT is almost similar to that of centralized CIT. Particularly, when the sample size reaches 1000, both of the two tests achieve power with more than 95%. A7.9 EVALUATION ON THE ORDER OF DOMAIN INDICES In this section, we aim to find out whether the order of domain indices will impact the results. Theoretically, there should be no impact on the results when it takes different values because this domain index is essentially a discrete variable (more specifically, a categorical variable, with no numerical order among different values), a common approach to deal with such discrete variable is to use delta kernel (based on Kronecker delta function), and therefore it is reasonable to use random features to approximate the delta kernel for discrete variables. Empirically, we have added one new set of experiments to evaluate whether the order of domain indices will impact the results. We have one set of domain indices and run our Fed CDH on the synthetic linear Gaussian model with varying number variables d {6, 12, 18, 24, 30} while keeping K = 10 and nk = 100, other settings are the same as those in our main paper. Then, we randomly shuffle the indices for different domains, denoted by Fed CDH+Shuffle . Published as a conference paper at ICLR 2024 200 300 400 500 600 700 800 900 1000 Number of samples n Federated CIT Centralized CIT Figure A5: Comparison regarding the power of test between federate conditional independence test and the centralized conditional independence test. 10 20 30 Number of variables d Skeleton F1 10 20 30 Number of variables d Skeleton SHD 10 20 30 Number of variables d Direction F1 10 20 30 Number of variables d Direction SHD Fed CDH+Shuffle Fed CDH Figure A6: Evaluation on the order of domain indices on linear Gaussian model. We evaluate varying number of variables d. By columns, we evaluate Skeleton F1 ( ), Skeleton SHD ( ), Direction F1 ( ) and Direction SHD ( ). As shown in Figure A6, the results turned out that: the performances between the two sets of different domain indices are quite similar, and we may conclude that it has no obvious impact on the results when the domain indices take different values. A8 DETAILS ABOUT THE EXPERIMENTS ON REAL-WORLD DATASET A8.1 DETAILS ABOUT FMRI HIPPOCAMPUS DATASET We evaluate our method and the baselines on f MRI Hippocampus (Poldrack et al., 2015). The directions of anatomical ground truth are: PHC ERC, PRC ERC, ERC DG, DG CA1, CA1 Sub, Sub ERC and ERC CA1. Generally, we follow a similar setting as the experiments on synthetic datasets. For each of them, we use the structural Hamming distance (SHD), the F1 score as evaluation criteria. We measure both the undirected skeleton and the directed graph. Here, we consider varying number of clients K and varying number of samples in each client nk. The results of F1 score and SHD is given in Figure A7. According to the results, we could observe that our Fed CDH method generally outperformed all other baseline methods, across all the criteria listed. The reason could be that our method is specifically designed for heterogeneous data while some baseline methods assume homogeneity like Fed PC and NOTEARS-MLP-ADMM, furthermore, our method can handle arbitrary functional causal models, different from some baseline methods that assume linearity such as Fed PC. Compared with our method, Fed DAG performed much worse, the reason might be its nature of the continuous optimization, which might suffer from various issues such as convergence and nonconvexity. Published as a conference paper at ICLR 2024 Figure A7: Results of real-world dataset f MRI Hippocampus (Poldrack et al., 2015). By rows, we evaluate varying number of clients K and varying number of samples nk. By columns, we evaluate Skeleton F1 ( ), Skeleton SHD ( ), Direction F1 ( ) and Direction SHD ( ). 2.5 5.0 7.5 10.0 Number of clients K Skeleton F1 2.5 5.0 7.5 10.0 Number of clients K Skeleton SHD 2.5 5.0 7.5 10.0 Number of clients K Direction F1 2.5 5.0 7.5 10.0 Number of clients K Direction SHD Fed PC NOTEARS-MLP-ADMM Fed DAG Fed CDH Figure A8: Results of real-world dataset HK Stock Market (Huang et al., 2020). We evaluate varying number of clients K, and we evaluate Skeleton F1 ( ), Skeleton SHD ( ), Direction F1 ( ) and Direction SHD ( ). A8.2 DETAILS ABOUT HK STOCK MARKET DATASET We also evaluate on HK stock market dataset (Huang et al., 2020) (See Page 41 for more details about the dataset). The HK stock dataset contains 10 major stocks, which are daily closing prices from 10/09/2006 to 08/09/2010. The 10 stocks are Cheung Kong Holdings (1), Wharf (Holdings) Limited (2), HSBC Holdings plc (3), Hong Kong Electric Holdings Limited (4), Hang Seng Bank Ltd (5), Henderson Land Development Co. Limited (6), Sun Hung Kai Properties Limited (7), Swire Group (8), Cathay Pacific Airways Ltd (9), and Bank of China Hong Kong (Holdings) Ltd (10). Among these stocks, 3, 5, and 10 belong to Hang Seng Finance Sub-index (HSF), 1, 8, and 9 belong to Hang Seng Commerce and Industry Sub-index (HSC), 2, 6, and 7 belong to Hang Seng Properties Sub-index (HSP), and 4 belongs to Hang Seng Utilities Sub-index (HSU). Here one day can be also seen as one domain. We set the number of clients to be K {2, 4, 6, 8, 10} while randomly select nk=100 samples for each client. All other settings are following previous ones by default. The results are provided in Figure A8. According to the results, we can infer that our Fed CDH method also outperformed the other baseline methods, across the different criteria. Similar to the analysis above, our method is tailored for heterogeneous data, in contrast to baseline methods like Fed PC and NOTEARS-MLP-ADMM, which assume homogeneity. Additionally, our approach is capable of handling arbitrary functional causal models, setting it apart from baseline methods like Fed PC that assume linearity. When compared to our method, Fed DAG exhibited significantly poorer performance. This could be attributed to its reliance on continuous optimization, which may encounter challenges such as convergence and nonconvexity.