# collaborative_heterogeneous_causal_inference_beyond_metaanalysis__d895f745.pdf Collaborative Heterogeneous Causal Inference Beyond Meta-analysis Tianyu Guo 1 Sai Praneeth Karimireddy 2 Michael I. Jordan 1 2 Collaboration between different data centers is often challenged by heterogeneity across sites. To account for the heterogeneity, the state-of-theart method is to re-weight the covariate distributions in each site to match the distribution of the target population. Nevertheless, this method still relies on the concept of traditional meta-analysis after adjusting for the distribution shift. This work proposes a collaborative inverse propensity score weighting estimator for causal inference with heterogeneous data. Instead of adjusting the distribution shift separately, we use weighted propensity score models to collaboratively adjust for the distribution shift. Our method shows significant improvements over the methods based on meta-analysis when heterogeneity increases. By incorporating outcome regression models, we prove the asymptotic normality when the covariates have dimension d < 8. Our methods preserve privacy at individual sites by implementing federated learning protocols. 1. Introduction The booming of Federated Learning (FL) has drawn attention in medical and social sciences, where sharing datasets between data centers is often limited. However, their research focuses more on Causal Inference, in which prediction gets less attention, whereas valid inference is the main focus. For example, Meta-analysis takes the weighted mean of published estimators of the average treatment effect (ATE) and mainly focuses on choosing optimal weights and making inferences. Given homogeneous data, how could Federated Learning help Causal Inference? The estimation of ATE commonly incorporates nuisance prediction models, e.g., the propen- 1Department of Statistics, UC Berkeley. 2Department of EECS, University of California, UC Berkeley. Correspondence to: Tianyu Guo . Proceedings of the 41 st International Conference on Machine Learning, Vienna, Austria. PMLR 235, 2024. Copyright 2024 by the author(s). sity score model. Thanks to the homogeneity, we can use FL methods to train a shared propensity score model, then each site gets its own ATE estimator, and finally, the central server uses Meta-analysis to take the weighted mean. Nevertheless, given heterogeneous data, Federated Learning seems to play a negligible role. Since propensity score models differ between sites, training a shared model is meaningless. As a result, all methods fall within the scope of Meta-analysis. For example, to estimate the ATE for a target site, Han et al. (2022) and Han et al. (2023b) consider using density ratio to re-weight source sites and summarizing estimates from source sites with Meta-analysis. We propose a novel method tailored for collaboration with heterogeneous data. Suppose we have K sites, denote the ATE as τ, the nuisance propensity model as e, the site-wise weight as ηk. Instead of taking the weighted mean afterward, we directly take the weighted mean of nuisance models and get ˆτk(PK r=1 ηrˆer) in each site k, which is inconsistent. Then, we could recover a consistent estimator ˆτCLB by taking the average across all sites. Equations (1) and (2) summarize the previous and our estimators. k=1 ηkˆτk(ˆe FL) ˆτheter = k=1 ηkˆτk(ˆek), (1) we propose: ˆτCLB = k=1 ˆτk(PK r=1 ηrˆer). (2) Our method outperforms previous ones in several ways: first, it is the first method that allows collaboration across disjoint domains without additional assumptions; second, it achieves better accuracy than Meta-analysis; third, it remains stable even as the heterogeneity between sites increases, which encourages collaboration from a broader range. We provide theory and experiment to demonstrate these claims. 2. Problem Setup We use S = [K] to denote the set of sites, with D(k) being the dataset of site k. Let Z be the binary treatment, X Rd be the covariates with dimension d, Y be the outcome. Let Y (z) be the potential outcome under treatment z {1, 0}. Classical causal inference only copes with the biased sam- Collaborative Heterogeneous Causal Inference Beyond Meta-analysis pling of Z. However, we need to cope with multiple sites. We first present a motivating example from Meta-analysis to model the actual data-generating procedure. Example 1 (Collaboration of Clinical Trails). Koesters et al. (2013) reviews clinical trials of Agomelatine, an antidepressant drug approved by the European Medicines Agency in 2009. The 13 included trials have different data sizes and demographic distributions. One study was carried out on individuals aged 60 or above, and the remaining is for all ages. Each study reports the mean difference in Hamilton Rating Scale for Depression (HRSD) scores between treatment and control groups. The target group in Example 1 is the patients with depression. However, each clinical trial is a biased sample from the target population. Abstracting from this example, we propose a sampling-selecting framework for collaborative causal inference: 1. Sampling: Sample an individual i from the target distribution, let Si {(k, Z) | k S, Z {0, 1}} be the selection indicator. If S = (k, 1), individual i gets selected to site k and gets treated. If S = , the individual is eliminated from the dataset. Sample (Xi, Yi(1), Yi(0), Si) i.i.d. from the target distribution according to Equation (3) and get the pooled dataset (4). P(S = | X) = e (X), P(S = (k, z) | X) = e(k,z)(X) with e (X) + z {0,1} e(k,z)(X) = 1. (3) DMeta = {(Xi, Yi(1), Yi(0), Si) | i [N]}. (4) 2. Selecting: We use Z(Si) to denote the treatment indicator corresponding to Si. We follow the potential outcome framework and invoke the Stable Unit Treatment Value Assumption (SUTVA). Therefore, Yi = Z(Si)Yi(1) + {1 Z(Si)}Yi(0). Split DMeta to each site and treatment/control groups according to S and get D(k) = {(Xi, Yi, Zi) DMeta | Si = (k, Zi)}. (5) Furthermore, census data commonly reflects the target distribution of covariates. Therefore, we assume there s a public dataset D(t) that contains covariates information. D(t) = {(Xi) | Xi drawn i.i.d. from target distribution}. (6) Figure 1 visualizes the data-generating process. Sites select from the target distribution heterogeneously. Xiong (a) Sampling-Selecting Framework. Selection to sites Target Population Census Data Figure 1: Visualization of the data-generating process. Each site selects from the target distribution in different way. The selection indicator S = (k, z) describes the sampling mechanism of the Z = z group in site k. Specifically, S = (k, 1) represents the left skewed distribution; (k, 0) represents the right skewed distribution; (ℓ, 1) represents the undercoverage bias; represents the discarded data. et al. (2022) assumes that P(S = ) = 0 and consider the pooled dataset, which might violate with the real world. For example, if all sites include fewer men in the dataset, we would have that P(S = | men) < P(S = | other genders). In contrast, we allow S = to reflect the biased sampling of the pooled dataset K k=1D(k) from the target population. One may question that there s no real sampling-selecting process since each site collects data independently. A possible answer is to recall the well-accepted quasi-experiment framework (Cook et al., 2002). For example, to understand the effect of gender on an outcome. The quasi-experiment framework imagines that individual Xi firstly gets i.i.d. sampled, then gets treated by gender Gi, although there s no actual gender assignment process. The sampling-selecting framework extends the quasiexperiment to multiple-site settings. The objective is to estimate the average causal effect on the target distribution τ = E[Y (1) Y (0)]. A foundation for identifying τ is Assumption 1. Assumption 1 (Homogeneity and unconfoundedness). We have that (Y (1), Y (0)) S | X (7) More than unconfounded treatment assignment, Assumption 1 also implies that the individual treatment effects are the same across sites. In Example 1, when fixing X for an individual i, if the effect of Agomelatine still varies across sites, collaboration is meaningless due to unmeasured confounders. Violation of Assumption 1 is sometimes termed as anti-causal learning (Farahani et al., 2020). Collaborative Heterogeneous Causal Inference Beyond Meta-analysis Another foundation for identifying the causal effect is the overlap assumption. There are two kinds of overlap assumptions. Given an individual, Assumption 2 requires each site to select them with non-zero probability, whereas Assumption 3 only requires the overall selection probability to be non-zero. Assumption 2 (Individual-Overlap). We have that min x,k {P(Z = 1, S = 1 | x, A = k)} > c > 0. Assumption 3 (Overall-Overlap). We have that min x {P(Z = 1, S = 1 | x)} > c > 0. Revisiting Example 1, Assumption 1 is guaranteed by the experimental design and by the similar effect of drugs given sufficient demographic information. Assumption 2 fails since one site only includes senior patients. While Assumption 3 holds since other sites collect data from all ages. We provide a counter-example showing that Assumptions 1 and 3 might not hold. Example 2 (Collaboration of Observational Studies with Unmeasured Confounder). Betthäuser et al. (2023) review observational studies regarding the learning deficits of school-aged children during COVID-19. Among 42 included observational studies, four are from middle-income countries, and the remaining are from high-income countries. Over half of the studies do not collect covariates, using the difference in means of grades before and after the pandemic. Since over half of the studies do not collect covariates, there are unmeasured confounders. Therefore, Assumption 1 is unlikely to hold. Moreover, if the target distribution is school-aged children from the entire world, both 2 and 3 fail since low-income countries are missing in the study. We suggest avoiding collaboration in this case. We have some additional notations: Denote E[Y1] as µ1 and E[Y0] as µ0. Define NS = PK k=1 N (k) with N (k) = |D(k)| being the sample size of dataset k. Note that NS < N since we drop the individuals with S = . 2.1. Related Work There are extensive attempts in Meta-analysis literature to cope with heterogeneity (Borenstein et al., 2007; 2010; Higgins et al., 2009). For example, by assuming that the average treatment effect follows the normal distribution across sites, many propose using random effects models (Hedges & Vevea, 1998; Riley et al., 2011) instead of fixed effects models (Tufanaru et al., 2015). There are other ways, such as using site-specific information and conduct Meta-regression (van Houwelingen et al., 2002; Glynn & Quinn, 2010), using quasi-likelihood (Tufanaru et al., 2015). More recently, Cheng & Cai (2021) propose a penalized method for integrating heterogeneous causal effects. However, all methods need strong parametric assumptions on the heterogeneity. It s still necessary to rely on qualitative understandings of heterogeneity based on summary statistics (Stroup, 2000). Causal Inference literature also has a growing interest in collaboration with concerns in external validity (Concato et al., 2000; Rothwell, 2005; Colnet et al., 2023). Yang & Ding (2020) propose a Rao-Blackwellization method for incorporating RCT and observational studies with unmeasured confounders to improve the estimation efficiency. Recently, more works try to incorporate federated learning in causal inference (Xiong et al., 2022; Han et al., 2023a; Guo et al., 2023; Vo et al., 2023). Vo et al. (2022) propose adaptive kernel methods under the causal graph model. Several focus on inference. For example, Xiong et al. (2022) and Hu et al. (2022) assume homogeneous models and propose a collaboration framework that avoids direct data merging. Han et al. (2022; 2023b) considers heterogeneous sample selection under parametric distribution shift assumptions. Nevertheless, most new methods still fall under the framework of Meta-analysis. As a broader interest, our work also uses double machine learning(Chernozhukov et al., 2018; Athey & Imbens, 2019). It extends the doubly robust estimator (Bang & Robins, 2005; Glynn & Quinn, 2010; Funk et al., 2011) to non-parametric and machine learning methods (Huang et al., 2006; Sugiyama et al., 2007b;a; Wager & Athey, 2017; Tibshirani, 1996). We adopt it in particular to mitigate the hardness of estimating density ratio (Farahani et al., 2020; Härdle et al., 2004). 3. Collaborative Inverse Propensity Score Weighting The inverse propensity score weighting (IPW) estimator plays a central role in causal inference. We generalize it to collaborative setting, thinking e(k,z)(X) = P(S = (k, z) | X) as a generalized version of propensity score. We begin with using the oracle propensity score models and then discuss how to estimate the models. 3.1. The CLB-IPW estimator As a benchmark, consider the method where each site calculates its own IPW estimator for ATE and takes weighted sum, which is the standard method in Meta-analysis. Since we assume propensity score models are correct, it s not necessary to use L1 penalty as in Han et al. (2022). De- Collaborative Heterogeneous Causal Inference Beyond Meta-analysis k=1 η(k)(ˆµ(k) Meta,1 ˆµ(k) Meta,0), with ˆµ(k) Meta,1 = 1 ˆN (k) Meta,1 Zi Yi e(k,1)(Xi), (8) ˆµ(k) Meta,0 = 1 ˆN (k) Meta,0 e(k,0)(Xi) , (9) ˆN (k) Meta,1 = X Zi e(k,1)(Xi), and (10) ˆN (k) Meta,0 = X 1 Zi e(k,0)(Xi). (11) Equations (8), (9), (10), and (11) take the Hájek form (Little & Rubin, 2019), in which we use a consistent estimator ˆN for the sample size. It always achieves better numerical stability and smaller variance than directly using sample size. More importantly, as we will show later, we could only identify e(k,z)(X) up to a constant factor, Hájek form releases us from the identifiability issue. The best choice of η(k) is the inverse variance. In specific, denoting Var(ˆτ (k) Meta) = (σ(k) Meta) 2, the optimal weights are η(k) (σ(k) Meta) 2. See Cheng & Cai (2021) for more discussions. Meta-IPW is designed for review studies (Borenstein et al., 2007) rather than collaboration. Each site must be able to obtain a valid estimator. But, a single site would commonly suffer from under-coverage of the entire population. Revisiting Example 1, one site only takes experiments for older people. Due to their under-coverage, they can never get the valid ATE estimator for the entire population, so it s impossible to incorporate them into the Meta-IPW estimator. Alternatively, we introduce the CLB-IPW estimator. CLBIPW directly takes the weighted mean of heterogeneous propensity score functions. In specific, to estimate µ1, we use ˆµ(k) CLB,1 = 1 ˆN (k) CLB,1 η(k)Zi Yi PK r=1 η(r)e(r,1)(Xi) , (12) with ˆN (k) CLB,1 = X η(k)Zi PK r=1 η(r)e(r,1)(Xi) We have that E[ˆµ(k) CLB,1] = E h η(k)e(k,1)(X)Y1 PK r=1 η(r)e(r,1)(X) which means that it s not consistent for µ1. However, when we take summation of ˆµ(k) CLB,1 across k, we get that E[ˆµCLB,1] = E h PK k=1 η(k)e(k,1)(X)Y1 PK r=1 η(r)e(r,1)(X) It allows collaboration between disjoint domains. In Example 1, the site that only includes elders could compute ˆµ(k) CLB,1 without worrying about their under-coverage. Given a young patient X from other sites. We have that e(k,1)(X) = 0 but e(r,1)(X) > 0 for r = k, which ensures a non-zero denominator. The estimators for µ0 follow the same manner, which we relegate to the appendix. We could compute ˆτCLB in a fully federated way, as presented in Algorithm 1. Algorithm 1 CLB-IPW Algorithm Require: K datasets with D(k) as shown in Equation (5). Each site publishes their propensity score models e(k,1)(X) and e(k,0)(X). 1: for k = 1 to K do 2: At site k, calculate ˆµ(k) CLB,1 , ˆµ(k) CLB,0, ˆN (k) CLB,1, and ˆN (k) CLB,1 according to Equation (12). Send them to the central server. 3: end for 4: Central server computes ˆτCLB = ˆµCLB,1 ˆµCLB,0, (13) where ˆµCLB,1 is the average of ˆµ(k) CLB,1 weighted by ˆN (k) CLB,1, with ˆµCLB,0 following the same manner. The best choice of η(k) is data-dependent and thus could not be obtained from one round of communication. Therefore, we suggest taking vanilla weights η(k) = 1 for all k. Notice that k=1 e(k,1)(X) = P(Z(S) = 1 | X), (14) which means that the vanilla weights match the propensity score for Z in the pooled dataset. More importantly, we find that the vanilla weights would already make the CLBIPW estimator uniformly better than Meta-IPW estimator. Proposition 1 (Meta-IPW Estimator). Given Assumptions 1 and 2, using inverse variance weighting, as N , we have that N(ˆτMeta τ) d N(0, v2 Meta), v2 Meta = n K X k=1 E h(Y1 µ1)2 e(k,1)(X) + (Y0 µ0)2 Collaborative Heterogeneous Causal Inference Beyond Meta-analysis Theorem 2 (CLB-IPW Estimator). Given Assumptions 1 and 3, using vanilla weights for CLB-IPW, as N , we have that N(ˆτCLB τ) d N(0, v2 CLB), (15) v2 CLB = E h (Y1 µ1)2 PK k=1 e(k)(X) + (Y0 µ0)2 PK r=1 e(r,0)(X) Moreover, we have that v2 CLB v2 Meta. There are two ways to understand why ˆτCLB is better: First, Meta-IPW takes the weighted mean site-wise, whereas CLB-IPW takes the weighted mean individualwise. Given each individual Xi, CLB-IPW adaptively puts more weights on sites with larger e(k)(Xi). Whereas Meta IPW uses the same weights for any Xi. Second, CLBIPW utilizes coarser balancing scores (Imbens & Rubin, 2015). Balancing score is a generalization of the propensity score. Any function of covariates is sufficient for adjusting the confoundingness between Z and Y . The Meta-IPW uses P(S | X) as its inverse weights, and CLBIPW uses P(Z(S) | X). Theorem 3 shows that they are both balancing scores. Theorem 3. We have that (Y (1), Y (0)) Z(S) | P(S | X), and (Y (1), Y (0)) Z(S) | P(Z(S) | X). Notice that P(S | X) has an auxiliary variable k(X) comparing to P(Z(S) | X). But k(X) is superfluous since it doesn t affect (Y1, Y0). As a result, CLB-IPW gets better efficiency by maintaining a smaller model. A simpler model benefits us by maintaining fewer variables to adjust for, thus attaining better efficiency. Similar ideas occur extensively in model selection literature (Raschka, 2020). 3.2. Estimation of propensity score models We start from the identification of e(k,z)(X). Since we have no information on the dropped set D = {i | Si = }, it s impossible to identify all parameters. For instance, multiplying N by a factor 2 and dividing e(k,z)(X) by 2 would lead to the same observed distribution. However, identifiability is guaranteed up to a constant factor. And thanks to the Hájek forms of our IPW estimators, identification up to a constant is enough. Proposition 4. We have that e(k,z)(X) = r(k,z)(X)P(S = (k, z) | S = )P(S = ). where r(k)(X) = p(X | S = (k, z))/p(X) is the density ratio function, which is identifiable. Meanwhile, P(S = (k, z) | S = ) is identifiable by taking N (k)/NS. Only P(S = ) is not identifiable. We focus on estimating density ratio r(k,z)(X). We suggest two methods from the large literature on density ratio estimation. Han et al. (2022) applies a parametric exponential tilting model. They assumes that r(k,z)(X) = exp (ψ(X) γ(k,z)) for a given representation function ψ (such as ψ(x) = x) and unknown parameter γ(k,z). We could estimate γ through the method of moments, i.e., finding ˆγ(k,z) that solves X i D(k) Ziψ(Xi) exp (ψ(X) γ(k,z)) i Dt ψ(Xi) exp (ψ(X) γ(k,z)), which is equivalent to entropy balancing (Zhao & Percival, 2017). Recently, motivated by Matching (Abadie & Imbens, 2016) and K-Nearest Neighbour (Zhang et al., 2018), Lin et al. (2021) propose a minimax nonparametric way to estimate the density ratio. Using their method, we have that ˆr(k,z)(x) = N (t) P M W(x; Dt, D(k,z)), (16) where W(x; Dt, D(k,z)) means the total number of units in Dt that x is close to Xi than its M-nearest neighbour in D(k,z). See Lin et al. (2021) for more detail. We have the following convergence rates for them Proposition 5 (Point-wise error of density estimation). Given x Rd, if the exponential tilting model is correctly specified, we have that E h | exp (ψ(x) ˆγ(k,z)) r(k,z)(x)| i = O(N 1/2). (17) For the nonparametric method, we have that E h |ˆr(k,z)(x) r(k,z)(x)| i = O(N 1/(2+d)). (18) 4. Incorporating Outcome Models Density ratio estimation is challenging and can easily fail under mis-specification or due to the curse of dimensionality. Therefore, it is essential to incorporate outcome models to mitigate the errors caused by density ratio estimation. To maintain consistent structure with Section 3, we first discuss how to incorporate outcome models in the estimator and then discuss how to learn the outcome models. 4.1. Decoupled AIPW estimator The augmented inverse propensity score weighted (AIPW) estimator (Bang & Robins, 2005) employs Neyman orthogonality to construct an asymptotically normal estimator even if nuisance models converge at slower rates. We introduce their idea to the collaboration setting. Collaborative Heterogeneous Causal Inference Beyond Meta-analysis How to use outcome models? Due to the biased selection of S, directly taking the mean across all source data renders the estimator inconsistent. A natural idea is to use the inverse propensity score to adjust the distribution and get that ˆτadjust = 1 h ˆ m1(Xi) ˆe(k,1)(Xi) ˆ m0(Xi) ˆe(k,0)(Xi) This is the choice of Han et al. (2022). However, the consistency of ˆτadjust substantially depends on the density ratio function, making the regression model useless. Alternatively, we make use of the public census dataset D(t). As discussed in Section 2, D(t) provides public information for X in the target distribution. Utilizing it, we propose a decoupled AIPW estimator. ˆτAIPW = 1 N (t) h ˆ m1(X(t) i ) ˆ m0(X(t) i ) i + k=1 ˆδ(k) AIPW, (19) with ˆδ(k) AIPW having two versions: ˆδ(k) Meta AIPW = k=1 η(k)h ˆδ(k) Meta AIPW,1 ˆδ(k) Meta AIPW,0 i , δ(k) CLB AIPW = k=1 ˆw(k) CLB,1ˆδ(k) CLB AIPW,1 k=1 ˆw(k) CLB,0ˆδ(k) CLB AIPW,0, with ˆw(k) CLB,1 ˆN (k) CLB,1, ˆw(k) CLB,0 ˆN (k) CLB,0. Here ˆδ(k) Meta AIPW and ˆδ(k) CLB AIPW are residual versions of the corresponding IPW estimators, changing all Y to Y m(X) in the formula. We only present the formula for the ˆδ1 s and relegate ˆδ0 s to the appendix. ˆδ(k) Meta AIPW,1 = 1 ˆN (k) Meta,1 Zi[Yi m1(Xi)] e(k,1)(Xi) , ˆδ(k) CLB AIPW,1 = 1 ˆN (k) CLB,1 Zi[Yi m1(Xi)] PK r=1 e(r,1)(Xi) . The proposed estimator computes the difference in mean of outcome models only in D(t) and the correction terms only in D(k) s. Though being decoupled, it preserves the robustness of the AIPW estimator. We summarize its properties in Theorem 6. Theorem 6. Suppose that 1. The estimated models ˆ m1, ˆ m0 and ˆe are independent1 with D(t) and D(k) s. 1We could achieve independence by using sampling splitting, see Chernozhukov et al. (2018) for more detailed discussion. 2. They have convergence rates E[ ˆ m1 m1 2], E[ ˆ m0 m1 2] = O(1/N ξm), (20) and E[ ˆe e 2] = O(1/N ξe), (21) with ξmξe > 1/2. 3. The models ˆe, ˆm, e, and m are bounded. Further supposing that N (t)/NS λ, we have that N(ˆτCLB AIPW τ) d N(0, v2 CLB AIPW), (22) v2 CLB AIPW = λ 1E h [m1(X) m0(X)]2i λ 1τ 2 + E h (Y1 m1(X))2 P(Z(S) = 1 | X) + (Y0 m0(X))2 P(Z(S) = 0 | X) The assumptions in Theorem 6 are standard in the literature (Chernozhukov et al., 2018; Athey & Wager, 2020). If we use the K-NN density ratio estimation (Lin et al., 2021), we get that ξe = 2/(2 + d). Therefore, taking any outcome model with ξm 1/2 2/(2 + d) would guarantee the asymptotic normality of ˆτCLB AIPW. 4.2. Estimation of outcome models It s worth noting the convergence rates in Equation (20) are taking average over the target population. To achieve low excess risk in the target population, we adopt the domain adaptation part from orthogonal statistical learning (Foster & Syrgkanis, 2020). Consider the loss function reweighted through inverse propensity scores: L(m1; {D(k)}k S) = k=1 L(k)(m1; D(k)) with L(k) = X Ziℓ(Yi, m1(Xi)) PK r=1 ˆe(r,1)(Xi) . (23) We want to compare it with training directly on the target distribution, i.e., using loss function L i=1 ℓ(Yi(1), m1(Xi)). (24) Theorem 7. Suppose that 1. The estimated propensity score model ˆe(X) satisfies Equation (21). 2. Using loss function (24), ˆ m1 satisfies Equation (21). Then, using loss function (23), we have that E[ ˆ m1(X) m1(X) 2] O(1/N ξm) + O(1/N 4ξe). (25) Collaborative Heterogeneous Causal Inference Beyond Meta-analysis 4.3. Federated Learning Algorithm The estimation of the outcome model requires federated learning. We could optimize the loss function by using Fed Avg (Li et al., 2020) or SCAFFOLD (Karimireddy et al., 2020). We present the process, including computing ˆτAIPW in Algorithm 2. Algorithm 2 CLB-AIPW Algorithm Require: K datasets {D(k)}k S and D(t). 1: (Locally) estimate ˆe(k,z)(X). 2: while not converged do 3: Train model m1 and m0 using the Fed Avg Algorithm with Loss function in (23). 4: end while 5: (Locally) update Yi s by Yi Yi m Zi(Xi). 6: Use Algorithm 1 to get PK k=1 δ(k) CLB AIPW. 7: Construct the CLB-AIPW estimator using Equation (19). As a result, using Algorithm 2, if we combine Theorems 6 and 7, we could get that ˆτCLB AIPW is asymptotic normal given that ξmξe < 1/2 and ξ5 e < 1/2. Using Proposition 5, it suffices to utilize the K-NN density ratio estimation method with d 8 and find an outcome model with ξm 1/2 2/(2 + d). This avoids the problem of the misspecification of the exponential tilting model. It is worth noting that our discussion of AIPW is is from the point of view of learning theory. If we adopt the classical double robustness framework, when the outcome model is correctly specified, there s no need to adjust the distribution of the covariates. The AIPW estimator is asymptotically normal even when the propensity score model completely fails. We would demonstrate its robustness in the simulation. 5. Experiments 5.1. Synthetic Dataset We conduct the experiment using synthetic dataset. Using fix sample sizes, we generate the data seperately for different sites. Consider three source datasets, with N (k) = 1000, 2000, 3000. The target dataset contains N (t) = 10000 data points. In specific, we generate the target distribution through X N(µ(t), σ2I3) with µ(t) = 0.1 and σ = 2. In the source dataset, we fix the treatment assignment mechanism and take the true propensity score as P(Z(k) = 1 | X(k)) = 1/[1 + exp ([1.2; 0.3; 1.2] X(k))]. 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 Heterogeneity Mean Sqaured Error IPW in sites Figure 2: KL-MSE curve of different estimators. The heterogeneity is measured through the mean KL-divergence between source and target datasets. The Meta-IPW estimator outperforms estimators from single site, but its error increases with the heterogeneity. In contrast, CLB-IPW and DML estimators remain stable as heterogeneity increases. Take the true potential outcomes as Y (1) = [1.2; 1.8; 1.4] X(k) and Y (0) = [0.6; 0.7; 0.6] X(k). We also choose normal distribution for source datasets. Suppose that X(k) N(µ(k), σ2), with σ = 2. We use the mean KL divergence between source datasets to the target dataset as a measure for the heterogeneity across sites, which is given by d KL(D(t), {D(k)}k [3]) = 1 2σ2 (µ(k) µ) 2. We increase d KL from 0 to 4. Fixing each d KL, we choose µ(k) uniformly and randomly assign negative sign to one of them. In the estimation process, we use the exponential tilting model for density ratio estimation and the linear model for outcome regression. We calculate the mean squared error (MSE) of Meta-IPW, CLB-IPW, and Meta-AIPW, and CLB-AIPW through 2000 Monte Carlo Simulations, with four replications of different {µ(k)} s. Figure 2 shows the d KL MSE curve. We mark the IPW estimators in each single site with dotted line. Although outperforming each individual sites, the Meta-IPW estimator still suffers from the increasing of heterogeneity. In contrast, both CLB-IPW and AIPW remain stable when heterogeneity increases. We further demonstrate the robustness of the AIPW estimator with four combinations of specifications of propensity score and outcome models. We relegate the details of the mis-specified model to the appendix. Figure 3 shows the 95% C.I. of the Meta-IPW, CLB-IPW, Meta-AIPW, and CLB-AIPW estimators. We choose the case with the mean Collaborative Heterogeneous Causal Inference Beyond Meta-analysis 6 True PS; True OM 6 True PS; False OM 6 False PS; True OM 6 False PS; False OM Figure 3: The 95% confidence intervals for synthetic dataset. The CLB-IPW shows smaller variance than Meta IPW under all scenarios. The AIPW estimators remain consistent when either of the PS or OM model is correctly specified. KL-distance being 3. In all cases, CLB-IPW estimator has tighter confidence intervals. When propensity score model is misspecified, both Meta-IPW and CLB-IPW fail due to incorrect weighting. In contrast, AIPW estimators remain consistent as long as outcome model is correct. When both models are misspecified, there is no hope to obtain consitent result. 5.2. Real world application We present a real-world application of our method. Our data comes from two studies about preventing sharing fake news during COVID-19. Roozenbeek et al. (2021) replicates the experiment of Pennycook et al. (2020) to study the effect of a nudge intervention on preventing the sharing of fake news. Both of the two studies sample participants according to U.S. census through online platforms. The outcome is measured by the difference of sharing intentions between true and false headlines about COVID-19 (truth discernment score). They find that a simple accuracy reminder could increase the truth discernment score (ˆτ = 0.034, p < 0.001). Using the same design and analysis procedures, Roozenbeek et al. (2021) replicates their findings, though with a less significant effect size (ˆτ = 0.015, p 0.017). Although two studies both try to sample from the target distribution and their heterogeneity is well-controlled, as suggested by Jin et al. (2023), we still use exponential tilting method to adjust the covariates shift. We adjust the distribution for the mean and variance of the Cognitive Reflec- 1 2 Meta IPW Clb IPW Meta AIPW Clb AIPW Estimated Treatment Effect Figure 4: The 95% confidence intervals for the real dataset. The red dots mark the true effect size. In Figure 4, τ1 denotes the estimated causal effect in Pennycook et al. (2020), and τ2 denotes Roozenbeek et al. (2021). We find that Meta-IPW, CLB-IPW, Meta-AIPW, and CLB-AIPW estimators have similar performance, with Meta-IPW and Meta AIPW showing slightly larger effect sizes. tion Test (CRT) score, the scientific knowledge quiz score, the Medical Maximizer-Minimizer Scale (MMS), distribution of self-reported political leanings, gender, and age. Figure 4 presents the 95% C.I.s for the two datasets and three estimators. Due to that the two datasets are close, we find close results. But CLB-IPW and AIPW show slightly larger effect size, matching the conclusion of the original study. 6. Conclusion In this work, we propose a collaborative inverse propensity score estimator that is suitable for heterogeneous data. Along the way, we utilize the sampling-selecting framework to describe the heterogeneity across sites. We show that the CLB-IPW estimator outperforms Meta-analysisbased estimator both in theory and in simulation. To account for the difficulty of density estimation, we borrow ideas from AIPW and orthogonal statistical learning literature, and provide the necessary convergence rates for nuisance models. As a future direction, it is worth while to explore the communication-efficient method for the optimal weighting of propensity score models. Impact Statement Getting an unbiased sample from the target distribution is crucial to developing a machine learning model or making inferences. However, nearly all datasets suffer from underrepresentation or bias in sampling the target popula- Collaborative Heterogeneous Causal Inference Beyond Meta-analysis tion. Collaboration between heterogeneous data sites is a big challenge in methodology and privacy concerns. Our work shows the possibility of collaboration, which does not suffer from heterogeneity and protects privacy. We believe it could serve as a starting point for encouraging collaboration across extremely diverse sources. The whole community could benefit from broader data sources while having private information well-preserved. Abadie, A. and Imbens, G. W. Matching on the Estimated Propensity Score. Econometrica, 84(2):781 807, 2016. ISSN 0012-9682. doi: 10.3982/ECTA11293. URL https://www.econometricsociety.org/ doi/10.3982/ECTA11293. Athey, S. and Imbens, G. Machine Learning Methods Economists Should Know About, March 2019. URL http://arxiv.org/abs/1903.10075. ar Xiv:1903.10075 [econ, stat]. Athey, S. and Wager, S. Policy Learning with Observational Data, September 2020. URL http:// arxiv.org/abs/1702.02896. ar Xiv:1702.02896 [cs, econ, math, stat]. Bang, H. and Robins, J. M. Doubly Robust Estimation in Missing Data and Causal Inference Models. Biometrics, 61(4):962 973, 2005. ISSN 1541-0420. doi: 10.1111/j.1541-0420.2005.00377.x. URL https: //onlinelibrary.wiley.com/doi/abs/10. 1111/j.1541-0420.2005.00377.x. _eprint: https://onlinelibrary.wiley.com/doi/pdf/10.1111/j.15410420.2005.00377.x. Betthäuser, B. A., Bach-Mortensen, A. M., and Engzell, P. A systematic review and meta-analysis of the evidence on learning during the COVID-19 pandemic. Nature Human Behaviour, 7(3):375 385, March 2023. ISSN 2397-3374. doi: 10.1038/s41562-022-01506-4. URL https://www.nature.com/articles/ s41562-022-01506-4. Number: 3 Publisher: Nature Publishing Group. Borenstein, M., Hedges, L., and Rothstein, H. Introduction to Meta-Analysis. 2007. Borenstein, M., Hedges, L. V., Higgins, J. P., and Rothstein, H. R. A basic introduction to fixed-effect and random-effects models for metaanalysis. Research Synthesis Methods, 1(2):97 111, 2010. ISSN 1759-2887. doi: 10.1002/jrsm.12. URL https://onlinelibrary.wiley. com/doi/abs/10.1002/jrsm.12. _eprint: https://onlinelibrary.wiley.com/doi/pdf/10.1002/jrsm.12. Cheng, D. and Cai, T. Adaptive Combination of Randomized and Observational Data, November 2021. URL http://arxiv.org/abs/2111. 15012. ar Xiv:2111.15012 [stat]. Chernozhukov, V., Chetverikov, D., Demirer, M., Duflo, E., Hansen, C., Newey, W., and Robins, J. Double/debiased machine learning for treatment and structural parameters. The Econometrics Journal, 21(1):C1 C68, February 2018. ISSN 1368-4221, 1368-423X. doi: 10.1111/ ectj.12097. URL https://academic.oup.com/ ectj/article/21/1/C1/5056401. Colnet, B., Mayer, I., Chen, G., Dieng, A., Li, R., Varoquaux, G., Vert, J.-P., Josse, J., and Yang, S. Causal inference methods for combining randomized trials and observational studies: a review, January 2023. URL http://arxiv.org/abs/2011. 08047. ar Xiv:2011.08047 [stat]. Concato, J., Shah, N., and Horwitz, R. I. Randomized, Controlled Trials, Observational Studies, and the Hierarchy of Research Designs. New England Journal of Medicine, 342(25):1887 1892, June 2000. ISSN 0028-4793, 1533-4406. doi: 10.1056/NEJM200006223422507. URL http://www.nejm.org/doi/abs/10.1056/ NEJM200006223422507. Cook, T. D., Campbell, D. T., and Shadish, W. Experimental and quasi-experimental designs for generalized causal inference, volume 1195. Houghton Mifflin Boston, MA, 2002. Farahani, A., Voghoei, S., Rasheed, K., and Arabnia, H. R. A Brief Review of Domain Adaptation, October 2020. URL http://arxiv.org/abs/2010. 03978. ar Xiv:2010.03978 [cs]. Foster, D. J. and Syrgkanis, V. Orthogonal Statistical Learning, September 2020. URL http:// arxiv.org/abs/1901.09036. ar Xiv:1901.09036 [cs, econ, math, stat]. Funk, M. J., Westreich, D., Wiesen, C., Stürmer, T., Brookhart, M. A., and Davidian, M. Doubly Robust Estimation of Causal Effects. American Journal of Epidemiology, 173(7):761 767, April 2011. ISSN 1476-6256, 0002-9262. doi: 10.1093/aje/kwq439. URL https:// academic.oup.com/aje/article-lookup/ doi/10.1093/aje/kwq439. Glynn, A. N. and Quinn, K. M. An Introduction to the Augmented Inverse Propensity Weighted Estimator. Political Analysis, 18(1):36 56, 2010. ISSN 1047-1987, 1476-4989. doi: 10.1093/pan/mpp036. URL https://www.cambridge.org/core/ Collaborative Heterogeneous Causal Inference Beyond Meta-analysis product/identifier/S1047198700012304/ type/journal_article. Guo, Z., Li, X., Han, L., and Cai, T. Robust Inference for Federated Meta-Learning, January 2023. URL http://arxiv.org/abs/2301. 00718. ar Xiv:2301.00718 [stat]. Han, L., Hou, J., Cho, K., Duan, R., and Cai, T. Federated Adaptive Causal Estimation (FACE) of Target Treatment Effects, April 2022. URL http://arxiv. org/abs/2112.09313. ar Xiv:2112.09313 [math, stat]. Han, L., Li, Y., Niknam, B. A., and Zubizarreta, J. R. Privacy-Preserving, Communication-Efficient, and Target-Flexible Hospital Quality Measurement, February 2023a. URL http://arxiv.org/abs/2203. 00768. ar Xiv:2203.00768 [stat]. Han, L., Shen, Z., and Zubizarreta, J. Multiply Robust Federated Estimation of Targeted Average Treatment Effects, September 2023b. URL http://arxiv.org/ abs/2309.12600. ar Xiv:2309.12600 [cs, math, stat]. Hedges, L. V. and Vevea, J. L. Fixed-and random-effects models in meta-analysis. Psychological methods, 3(4): 486, 1998. Publisher: American Psychological Association. Higgins, J. P. T., Thompson, S. G., and Spiegelhalter, D. J. A Re-Evaluation of Random-Effects Meta-Analysis. Journal of the Royal Statistical Society Series A: Statistics in Society, 172(1):137 159, January 2009. ISSN 0964-1998, 1467-985X. doi: 10.1111/j.1467-985X. 2008.00552.x. URL https://academic.oup. com/jrsssa/article/172/1/137/7084465. Hu, M., Shi, X., and Song, P. X.-K. Collaborative causal inference with a distributed data-sharing management, April 2022. URL http://arxiv.org/abs/ 2204.00857. ar Xiv:2204.00857 [stat]. Huang, J., Gretton, A., Borgwardt, K., Schölkopf, B., and Smola, A. Correcting Sample Selection Bias by Unlabeled Data. In Advances in Neural Information Processing Systems, volume 19. MIT Press, 2006. URL https://proceedings. neurips.cc/paper/2006/hash/ a2186aa7c086b46ad4e8bf81e2a3a19b-Abstract. html. Härdle, W., Müller, M., Sperlich, S., Werwatz, A., and others. Nonparametric and semiparametric models, volume 1. Springer, 2004. Imbens, G. W. and Rubin, D. B. Causal Inference for Statistics, Social, and Biomedical Sciences: An Introduction. Cambridge University Press, 2015. Jin, Y., Guo, K., and Rothenhäusler, D. Diagnosing the role of observable distribution shift in scientific replications, September 2023. URL http://arxiv.org/abs/ 2309.01056. ar Xiv:2309.01056 [stat]. Karimireddy, S. P., Kale, S., Mohri, M., Reddi, S., Stich, S., and Suresh, A. T. SCAFFOLD: Stochastic Controlled Averaging for Federated Learning. In Proceedings of the 37th International Conference on Machine Learning, pp. 5132 5143. PMLR, November 2020. URL https://proceedings.mlr. press/v119/karimireddy20a.html. ISSN: 2640-3498. Koesters, M., Guaiana, G., Cipriani, A., Becker, T., and Barbui, C. Agomelatine efficacy and acceptability revisited: systematic review and meta-analysis of published and unpublished randomised trials. British Journal of Psychiatry, 203(3):179 187, September 2013. ISSN 00071250, 1472-1465. doi: 10.1192/bjp.bp.112.120196. URL https://www.cambridge.org/core/ product/identifier/S0007125000052533/ type/journal_article. Li, X., Huang, K., Yang, W., Wang, S., and Zhang, Z. On the Convergence of Fed Avg on Non-IID Data, June 2020. URL http://arxiv.org/abs/1907. 02189. ar Xiv:1907.02189 [cs, math, stat]. Lin, Z., Ding, P., and Han, F. Estimation based on nearest neighbor matching: from density ratio to average treatment effect, December 2021. URL http://arxiv. org/abs/2112.13506. ar Xiv:2112.13506 [econ, math, stat]. Little, R. J. and Rubin, D. B. Statistical analysis with missing data, volume 793. John Wiley & Sons, 2019. Pennycook, G., Mc Phetres, J., Zhang, Y., Lu, J. G., and Rand, D. G. Fighting COVID-19 Misinformation on Social Media: Experimental Evidence for a Scalable Accuracy-Nudge Intervention. Psychological Science, 31(7):770 780, July 2020. ISSN 0956-7976. doi: 10.1177/0956797620939054. URL https://doi. org/10.1177/0956797620939054. Publisher: SAGE Publications Inc. Raschka, S. Model Evaluation, Model Selection, and Algorithm Selection in Machine Learning, November 2020. URL http://arxiv.org/abs/1811. 12808. ar Xiv:1811.12808 [cs, stat]. Riley, R. D., Higgins, J. P. T., and Deeks, J. J. Interpretation of random effects meta-analyses. BMJ, 342: d549, February 2011. ISSN 0959-8138, 1468-5833. doi: 10.1136/bmj.d549. URL https://www.bmj.com/ Collaborative Heterogeneous Causal Inference Beyond Meta-analysis content/342/bmj.d549. Publisher: British Medical Journal Publishing Group Section: Research Methods & Reporting. Roozenbeek, J., Freeman, A. L. J., and Linden, S. v. d. How Accurate Are Accuracy-Nudge Interventions? A Preregistered Direct Replication of Pennycook et al. (2020). Psychological Science, 32(7):1169 1178, 2021. doi: 10.1177/09567976211024535. URL https://doi. org/10.1177/09567976211024535. _eprint: https://doi.org/10.1177/09567976211024535. Rothwell, P. M. External validity of randomised controlled trials: To whom do the results of this trial apply?. The Lancet, 365(9453):82 93, January 2005. ISSN 01406736. doi: 10.1016/S0140-6736(04)17670-8. URL https://linkinghub.elsevier.com/ retrieve/pii/S0140673604176708. Stroup, D. F. Meta-analysis of Observational Studies in Epidemiology A Proposal for Reporting. JAMA, 283(15):2008, April 2000. ISSN 00987484. doi: 10.1001/jama.283.15.2008. URL http://jama.jamanetwork.com/article. aspx?doi=10.1001/jama.283.15.2008. Sugiyama, M., Krauledat, M., and Müller, K.-R. Covariate shift adaptation by importance weighted cross validation. Journal of Machine Learning Research, 8(5), 2007a. Sugiyama, M., Nakajima, S., Kashima, H., Buenau, P., and Kawanabe, M. Direct Importance Estimation with Model Selection and Its Application to Covariate Shift Adaptation. In Advances in Neural Information Processing Systems, volume 20. Curran Associates, Inc., 2007b. URL https://proceedings.neurips. cc/paper_files/paper/2007/hash/ be83ab3ecd0db773eb2dc1b0a17836a1-Abstract. html. Tibshirani, R. Regression Shrinkage and Selection Via the Lasso. Journal of the Royal Statistical Society: Series B (Methodological), 58(1):267 288, January 1996. ISSN 0035-9246, 2517-6161. doi: 10.1111/j.2517-6161.1996.tb02080.x. URL https: //rss.onlinelibrary.wiley.com/doi/10. 1111/j.2517-6161.1996.tb02080.x. Tufanaru, C., Munn, Z., Stephenson, M., and Aromataris, E. Fixed or random effects meta-analysis? Common methodological issues in systematic reviews of effectiveness. JBI Evidence Implementation, 13(3):196, September 2015. ISSN 26913321. doi: 10.1097/XEB.0000000000000065. URL https://journals.lww.com/ijebh/ fulltext/2015/09000/fixed_or_random_ effects_meta_analysis__common.12.aspx. van Houwelingen, H. C., Arends, L. R., and Stijnen, T. Advanced methods in meta-analysis: multivariate approach and meta-regression. Statistics in Medicine, 21(4):589 624, 2002. ISSN 1097-0258. doi: 10.1002/sim.1040. URL https://onlinelibrary.wiley. com/doi/abs/10.1002/sim.1040. _eprint: https://onlinelibrary.wiley.com/doi/pdf/10.1002/sim.1040. Vershynin, R. High-dimensional probability: An introduction with applications in data science, volume 47. Cambridge university press, 2018. Vo, T. V., Bhattacharyya, A., Lee, Y., and Leong, T.-Y. An adaptive kernel approach to federated learning of heterogeneous causal effects. Advances in Neural Information Processing Systems, 35:24459 24473, 2022. Vo, T. V., lee, Y., and Leong, T.-Y. Federated Learning of Causal Effects from Incomplete Observational Data, August 2023. URL http://arxiv.org/abs/2308. 13047. ar Xiv:2308.13047 [cs, stat]. Wager, S. and Athey, S. Estimation and Inference of Heterogeneous Treatment Effects using Random Forests, July 2017. URL http://arxiv.org/abs/1510. 04342. ar Xiv:1510.04342 [math, stat]. Xiong, R., Koenecke, A., Powell, M., Shen, Z., Vogelstein, J. T., and Athey, S. Federated Causal Inference in Heterogeneous Observational Data, December 2022. URL http://arxiv.org/abs/2107. 11732. ar Xiv:2107.11732 [cs, econ, q-bio, stat]. Yang, S. and Ding, P. Combining Multiple Observational Data Sources to Estimate Causal Effects. Journal of the American Statistical Association, 115(531):1540 1554, 2020. ISSN 0162-1459. doi: 10.1080/01621459.2019. 1609973. URL https://www.ncbi.nlm.nih. gov/pmc/articles/PMC7571608/. Zhang, S., Li, X., Zong, M., Zhu, X., and Wang, R. Efficient k NN Classification With Different Numbers of Nearest Neighbors. IEEE Transactions on Neural Networks and Learning Systems, 29(5):1774 1785, May 2018. ISSN 2162-237X, 2162-2388. doi: 10.1109/TNNLS.2017.2673241. URL http:// ieeexplore.ieee.org/document/7898482/. Zhao, Q. and Percival, D. Entropy balancing is doubly robust. Journal of Causal Inference, 5(1):20160010, September 2017. ISSN 2193-3685, 2193-3677. doi: 10.1515/jci-2016-0010. URL http://arxiv.org/ abs/1501.03571. ar Xiv:1501.03571 [stat]. Collaborative Heterogeneous Causal Inference Beyond Meta-analysis A.1. Preliminaries Definition A.1. Given i.i.d. weights wi and outcomes Yi, take their weighted sum ˆG = Pn i=1 wi Yi. We call an estimator is Hájek" type if it uses (Pn i=1 wi) 1 to normalize, and Horvitz-Thompson" (HT) type if it uses (n E[w]) 1, i.e., ˆµHájek = 1 Pn i=1 wi ˆG µHT = 1 n E[w] ˆG. We begin with relating the asymptotic behaviour of Hájek-type IPW estimator with the HT-type. In specific, we have that Lemma A.2. The Hajek"-type weighted mean estimator is asymptotically equivalent to the centralized Horvitz Thompson"-type weighted mean estimator ˆµHT = µ + 1 n E[w] i=1 wi(Yi µ), (26) i.e., we have that n(ˆµHájek ˆµHT) = o P (1). Proof. We subtract µ from ˆµHájek and get that n(ˆµHájek µ) = n Pn i=1 wi i=1 wi(Yi µ) = 1 Pn i=1 wi/n 1 n i=1 wi(Yi µ) = 1 E[w] 1 n i=1 wi(Yi µ) + o P (1) = n(ˆµHT µ) + o P (1). The second to the third line is by combining the fact that Pn i=1 wi/n = E[w]+o P (1) and Pn i=1 wi(Yi µ)/ n = OP (1), through law of large numbers and CLT. A.2. Proof of Proposition 1 We first define several useful intermediate values. We use ˆG to denote un-normalized IPW summations and ˆN to denote the estimated data sizes. ˆG(k) Meta = X Zi Yi e(k,1)(X) (1 Zi)Yi e(k,0)(X) , ˆG(k) Meta,1 = X Zi Yi e(k,1)(X), and ˆG(k) Meta,0 = X e(k,0)(X) . In the main paper, we use that ˆµ(k) Meta,1 = 1 ˆN (k) CLB,1 ˆG(k) Meta,1 ˆµ(k) Meta,0 = 1 ˆN (k) CLB,0 ˆG(k) Meta,0. Proof. We first re-write ˆG(k) Meta as ˆG(k) Meta = 1 {Si = (k, 1)} Yi e(k,1)(X) 1 {S = (k, 0)} Yi e(k,0)(X) . Use Lemma A.2, we only need to consider ˆτ (k) Meta HT = 1 n1 {Si = (k, 1)} (Yi µ1) e(k,1)(X) 1 {S = (k, 0)} (Yi µ0) Collaborative Heterogeneous Causal Inference Beyond Meta-analysis E n1 {S = (k, 1)} (Y µ1) e(k,1)(X) 1 {S = (k, 0)} (Y µ0) = E h E h P{S = (k, 1) | X}(Y1 µ1) e(k,1)(X) P{S = (k, 0) | X}(Y0 µ0) e(k,0)(X) | X ii = E[Y1 µ1 (Y0 µ0)] = 0. We also have that Var n1 {S = (k, 1)} (Y µ1) e(k,1)(X) 1 {S = (k, 0)} (Y µ0) = E hh1 {S = (k, 1)} (Y1 µ1) e(k,1)(X) 1 {S = (k, 0)} (Y0 µ0) = E h1 {S = (k, 1)} (Y1 µ1)2 e(k,1)(X)2 + 1 {S = (k, 0)} (Y0 µ0)2 e(k,0)(X)2 i = E h1 {S = (k, 1)} (Y1 µ1)2 e(k,1)(X)2 + 1 {S = (k, 0)} (Y0 µ0)2 e(k,0)(X)2 i = E h(Y1 µ1)2 e(k,1)(X) + (Y0 µ0)2 Therefore, using CLT, we get that N(ˆτ (k) τ) d N(0, (v(k) Meta) 2), (27) (v(k) Meta) 2 = 1 N E h(Y1 µ1)2 e(k,1)(X) + (Y0 µ0)2 Therefore, we have that N(ˆτMeta τ) = N(ˆτ (k) Meta τ) i d N 0, N E h(Y1 µ1)2 e(k,1)(X) + (Y0 µ0)2 (η(k)) 2(v(k) Meta) 2 N PK k=1 E h (Y1 µ1)2 e(k,1)(X) + (Y0 µ0)2 e(k,0)(X) i 1 , where the equality holds if and only if η(k) (v(k) Meta) 1. A.3. Proof of Theorem 2 We first provide the entire formula for CLB-IPW estimator. We define ˆG(k) CLB,1 = X Zi Yi PK r=1 e(r,1)(Xi) ˆG(k) CLB,0 = X (1 Zi)Yi PK r=1 e(r,0)(Xi) ˆN (k) CLB,1 = X Zi Yi PK r=1 e(r,1)(Xi) ˆN (k) CLB,1 = X Zi Yi PK r=1 e(r,1)(Xi) . Collaborative Heterogeneous Causal Inference Beyond Meta-analysis Then, we have that PK k=1 ˆG(k) CLB,1 PK k=1 ˆN (k) CLB,1 PK k=1 ˆG(k) CLB,0 PK k=1 ˆN (k) CLB,0 , where in the main paper, we use that ˆµCLB,1 = 1 ˆN (k) CLB,1 ˆG(k) CLB,1, and ˆµCLB,0 = 1 ˆN (k) CLB,0 ˆG(k) CLB,0. Proof. We rewrite the formula as ˆG(k) CLB,1 = 1 {Si = (k, 1)} Yi PK r=1 e(r,1)(Xi) , and ˆG(k) CLB,0 = 1 {Si = (k, 0)} Yi PK r=1 e(r,0)(Xi) . (29) As a result, we have that k=1 ˆGCLB,1 = 1 {Si = (k, 1)} Yi PK r=1 e(r,1)(Xi) = 1 {Z(Si) = 1} Yi P(Z(Si) = 1 | Xi), k=1 ˆGCLB,0 = 1 {Si = (k, 0)} Yi PK r=1 e(r,1)(Xi) = 1 {Z(Si) = 0} Yi P(Z(S) = 0 | Xi). Similarly, we get 1 {Si = (k, 1)} PK r=1 e(r,1)(Xi) = 1 {Z(Si) = 1} P(Z(Si) = 1 | Xi) 1 {Si = (k, 0)} PK r=1 e(r,0)(Xi) = 1 {Z(Si) = 0} P(Z(Si) = 0 | Xi). As a result, ˆN 1 CLB,1 ˆGCLB,1 ˆN 1 CLB,0 ˆGCLB,0 takes the form of Hájek type IPW estimator. Therefore, we could use Lemma A.2 and get the corresponding HT-type estimator. Since we have that E h 1 {Z(S) = 1} P(Z(S) = 1 | X) i = E h P[Z(S) = 1 | X] P(Z(S) = 1 | X) Same result holds for the control group. The HT estimators are (ˆµCLB,1,HT τ) = 1 h1 {Z(Si) = 1} (Yi µ1) P(Z(Si) = 1 | Xi) 1 {Z(Si) = 0} (Yi µ0) P(Z(Si) = 0 | Xi) Using central limit theorem, since we have that E h1 {Z(S) = 1} (Y µ1) P(Z(S) = 1 | X) 1 {Z(S) = 0} (Y µ0) P(Z(S) = 0 | X) = E h P(Z(S) = 1 | X)E[Y1 µ1 | X] P(Z(S) = 1 | X) P(Z(S) = 0 | X)E[Y0 µ0 | X] P(Z(S) = 0 | X) = E h E[Y1 µ1 Y0 + µ0 | X] i Collaborative Heterogeneous Causal Inference Beyond Meta-analysis Var h1 {Z(S) = 1} (Y µ1) P(Z(S) = 1 | X) 1 {Z(S) = 0} (Y µ0) P(Z(S) = 0 | X) = E h1 {Z(S) = 1} (Y µ1) P(Z(S) = 1 | X) 1 {Z(S) = 0} (Y µ0) P(Z(S) = 0 | X) = E P(Z(S) = 1 | X)E[(Y1 µ1)2 | X] P(Z(S) = 1 | X)2 + P(Z(S) = 0 | X)E[(Y0 µ0)2 | X] P(Z(S) = 0 | X)2 = E (Y1 µ1)2 P(Z(S) = 1 | X) + (Y0 µ0)2 P(Z(S) = 0 | X) Use that NS/N P(S = ). We get that NS ˆτCLB τ d N(0, v2 CLB), (30) v2 CLB = P(S = )E (Y1 µ1)2 P(Z(S) = 1 | X) + (Y0 µ0)2 P(Z(S) = 0 | X) To compare v2 CLB and v2 Meta, we first prove Lemma A.3. Lemma A.3. The function f(t1, . . . , t K) = (t 1 1 + . . . + t 1 K ) 1 with ti > 0, i = 1, . . . , K is concave. Proof. We directly prove it by showing that its hessian matrix is negative semi-definite. Denoting 2f = {Hkj}1 k,j K, we have that 2t 4 k ( K r=1 t 1 r ) 3 2t 3 k ( K r=1 t 1 r ) 2 if k = j 2t 2 k t 2 j ( K r=1 t 1 r ) 3 if k = j. (32) By taking out the common factor we get that r=1 t 1 r 3 2f(t1, . . . , t K) = t 2 1... t 2 K t 2 1 . . . t 2 K ( r=1 t 1 r ) t 3 1 t 3 2 ... t 3 K The second term is negative definite. The first term only gets one non-zero eigenvalue, with the corresponding eigenvector v = (t 2 1 , . . . , t 2 K ). We only need to verify that v 2fv 0. We have that r=1 t 1 r 3 v 2f(t1, . . . , t K)v = K X k=1 t 4 k 2 k=1 t 1 k )( k=1 t 7 k ) k 0 and N . This proves that N 3 P 0. We could prove that N 4 P 0 with the same manner. At Collaborative Heterogeneous Causal Inference Beyond Meta-analysis last, for 5, we have that N| 3| t o 2 exp t2/2 P i D(t) Var[ ˆ m1(Xi) m1(Xi)]/N (t) + MY t/(3 E{[ ˆ m1(X) m1(X)]2} + MY t/(3 (N (t)) 2ξm + (N (t)) 1/2MY t/3 for any t > 0 and N . This proves that N 5 P 0. We could prove that N 6 P 0 with the same manner. Combining 1, . . . , 6 with Equation (37) together, we finish the proof of Lemma A.4. By Lemma A.4, we only need to consider τCLB AIPW. Using CLT, we have that N( τCLB AIPW τ) d N(0, v2 CLB AIPW), (38) E( τCLB AIPW) = E h m1(X(t)) m0(X(t)) i + E h Z(S)(Y1 m1(X)) P(Z(S) = 1 | X) (1 Z(S))(Y0 m0(X)) P(Z(S) = 0 | X) = E[Y1 Y0], v2 CLB AIPW = Var h N( τCLB AIPW τ) i = N N (t) Var h m1(X(t)) m0(X(t)) i N Var h Z(S)(Y1 m1(X)) P(Z(S) = 1 | X) (1 Z(S))(Y0 m0(X)) P(Z(S) = 0 | X) = λ 1E h [m1(X) m0(X)]2i λ 1τ 2 + E h (Y1 m1(X))2 P(Z(S) = 1 | X) + (Y0 m0(X))2 P(Z(S) = 0 | X) This proves Theorem 6. A.7. Proof of Theorem 7 It is a direct result from Appendix B.2 in Foster & Syrgkanis (2020). Collaborative Heterogeneous Causal Inference Beyond Meta-analysis B. Experiments B.1. Extra Details For the incorrect scenario, using subscript i to denote different dimensions of X, we let X 1 = X1X2, X 2 = X2 2, and X 3 = X3/ max {1, X 1}. Using X as the regressors for misspecified propensity and outcome models. B.2. Ablations We provide the KL MSE plots with misspecified models in Figure 5. All experiment settings are the same with Figure 2, but we perturb the models. We construct false models also with X . The results show the same trend with Figure 2. It is worth noting that in Figure 5a, the AIPW estimator has similar variance with Meta-IPW when KL distance is large. We attribute this result to numerical instability, as we find there are occasionally divergent learned parameters due to extreme heterogeneity. The CLB-IPW estimator maintains low MSE against heterogeneity. (a) True PS; False OM 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 Heterogeneity Mean Sqaured Error IPW in sites (b) False PS; True OM 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 Heterogeneity Mean Sqaured Error IPW in sites (c) False PS; False OM 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 Heterogeneity Mean Sqaured Error IPW in sites Figure 5: The mean squared error changing with heterogeneity. We use X for all misspecified models. When both models fail to fit the data, there s no theoretical guarantee and all estimators have huge mean squared error. The better performance of Meta-IPW there is meaningless.