# predictionpowered_generalization_of_causal_inferences__25601cb5.pdf Prediction-powered Generalization of Causal Inferences Ilker Demirel 1 2 Ahmed Alaa 3 Anthony Philippakis 2 David Sontag 1 Causal inferences from a randomized controlled trial (RCT) may not pertain to a target population where some effect modifiers have a different distribution. Prior work studies generalizing the results of a trial to a target population with no outcome but covariate data available. We show how the limited size of trials makes generalization a statistically infeasible task, as it requires estimating complex nuisance functions. We develop generalization algorithms that supplement the trial data with a prediction model learned from an additional observational study (OS), without making any assumptions on the OS. We theoretically and empirically show that our methods facilitate better generalization when the OS is high-quality , and remain robust when it is not, and e.g., have unmeasured confounding. 1. Introduction Experimental data from randomized controlled trials (RCT) is the gold standard for causal inference as various biases are avoided by design (Imbens & Rubin, 2015; Hernan & Robins, 2021). However, in addition to being time and costintensive, RCTs often exhibit limited external validity, and their findings may not apply to a target population (Rothwell, 2005; Stuart et al., 2011). The generalizability of an RCT is compromised when baseline factors that influence prognosis (effect modifiers) have different distributions in the trial and target populations (Dahabreh et al., 2019) (see Figure 1). For instance, trials may consist of healthier individuals on average than routine clinical practice. Since the overall health status likely affects the prognosis, it leads to confounding bias between the population-level effects in the trial and the target populations (Hern an et al., 2004). 1MIT CSAIL 2Eric and Wendy Schmidt Center, Broad Institute of MIT and Harvard 3Department of Computational Precision Health, UC Berkeley and UCSF. Correspondence to: Ilker Demirel . Proceedings of the 41 st International Conference on Machine Learning, Vienna, Austria. PMLR 235, 2024. Copyright 2024 by the author(s). Dahabreh et al. (2019; 2020) develop methods that use individual-level covariate, treatment, and outcome data from a trial and only covariate information from the target population to estimate causal quantities in the latter (generalization). In this work, we show how combining trial data with potentially biased observational data, e.g., found from electronic health records, can power better generalization. Our Contributions We derive the generalization meansquared error when an outcome model learned from the trial is used in the target sample to estimate an average causal effect in the target population, and probe how it increases when the trial is small and not representative of the target population (Section 3). Drawing inspiration from recent advances in using black-box models for valid statistical inference (Schuler et al., 2021; Angelopoulos et al., 2023), we develop prediction-powered estimators that leverage additional observational data without any assumptions on it and discuss when they lead to lower generalization error (Section 4). We simulate over a thousand data-generating processes and find that our estimators yield remarkable improvements when the observational data is high-quality and maintain baseline performance when it is not (Section 5). Related Work There is growing interest in integrating data from trials and observational studies (OS) (Bareinboim & Pearl, 2016; Yang & Ding, 2019; NICE, 2022; Colnet et al., 2024). Schuler et al. (2021); Liao et al. (2023) show how adjustment by the predictions of a model learned from an OS can increase power in analyzing a trial. Similarly, Guo et al. (2021) investigate how coupling trial data and with control-variates constructed in an OS may enable smaller-variance estimation of the average treatment effect (ATE) in the trial population. Hartman et al. (2015); Degtiar et al. (2023) study generalization to a target population defined by the OS population or its union with the trial population. Han et al. (2023) study ATE estimation in a target population by incorporating data from multiple source populations, where the ATE is identifiable in all of the populations but different. Oberst et al. (2023) review methods that combine ATE estimates from a trial and an OS to obtain a better hybrid estimate (Rosenman et al., 2020; Cheng & Cai, 2021; Yang et al., 2023). Kallus et al. (2018); Chen et al. (2021); Hatt et al. (2022) consider the heterogeneity in effects and focus on the conditional ATE (CATE) func- Prediction-powered Generalization of Causal Inferences tion. Rosenman & Owen (2021) adopt a different angle and studies more efficient trial design using data from OS. Another line of papers studies benchmarking evidence from OS (Forbes & Dahabreh, 2020). Hussain et al. (2022; 2023); Demirel et al. (2024) develop falsification tests for the causal assumptions by comparing the findings of an OS and a trial. De Bartolomeis et al. (2024a;b) focus on quantifying the hidden confounding in an OS, and Karlsson & Krijthe (2024) show how one can detect hidden confounding using multiple OS with a shared data-generating process. In the works above, the target population of interest is taken as either the OS population or its union with the trial population. We consider a more general setting where the target population is defined separately from the trial and OS populations so long as it consists of trial-eligible individuals. For instance, the target population can represent a subgroup in the trial with a small sample size. Our goal is to estimate population-level causal effects in the target population, for which only covariate information is available, by integrating data from a small trial and a large OS. We detail the necessary causal identification assumptions in Section 2, which crucially do not enforce any unverifiable conditions on the OS, and describe our estimators in Section 4. We do not go the route of cooking a recipe to combine real-valued estimates from the trial and the OS, nor promise to give guarantees on the granular CATE function, as the former offers poor flexibility in utilizing rich observational data and the latter replaces the causal assumptions on the OS with statistical assumptions on its bias function. Our approach lies somewhere in between: we fit an outcome function from the OS using flexible machine learning models, which can be subject to causal biases, and analyze how coupling it with trial data can power the estimation of a real-valued causal estimand in the target population. 2. Background 2.1. Notation and Objective We consider a nested design where a trial is sampled from an underlying population of trial-eligible individuals. Note that our methods can easily be extended to nonnested designs where the target sample is obtained separately; e.g., to represent a subgroup for which the trial sample alone cannot power statistically significant inference. We have access to an i.i.d. sample of observations D = {Wi}n i=1 with Wi = (Xi, Si, Si Ai, Si Yi), where X X is a d-dimensional covariate vector, S is a binary trial participation indicator, A is a categorical treatment, and Y R is the outcome of interest. Only covariate data is available for non-participants (S = 0), while treatment and outcome data are also available for participants (S = 1). Age distribution in RCT ( ) and target ( ) populations. RCT cohort is younger than target cohort. Mean outcome in RCT is larger than in the target population. Outcome of interest, , is larger for younger. Mean potential outcome Figure 1. Age influences both selection into the trial and the outcome, inducing confounding bias between the population-level mean potential outcomes in the trial and target populations. The target population of interest is represented by nonparticipants. We denote by D1 D the set of trial participants and by D0 D the set of non-participants with sizes n1 = Pn i=1 1 {Si = 1} and n0 = Pn i=1 1 {Si = 0}, partitioning the composite sample D. Further, we denote by P1 and P0 the joint distribution of W in the underlying trial and target populations. For instance, X P0 represents a covariate drawn from the target distribution P(X | S = 0). We seek causal inference in the target population. Specifically, denoting by Y a the potential outcome under treatment A = a, we want to estimate the average potential outcomes in the target population. µa := E[Y a | S = 0]. (1) A more common causal estimand, the average treatment effect, can be directly derived from the average potential outcomes (e.g., µ1 µ0 in a binary treatment setting). Focusing on potential outcomes allows for simpler exposition, and they are of independent interest in many applications. The challenge in estimating µa is three-fold. The first is obvious: no outcome data is available for non-participants. One can contemplate resorting to outcome data from the trial, which brings us to the second challenge. The potential outcome Y a can only be observed for those who received treatment A = a. When treatment assignment depends on unobserved factors that also affect the outcome, one risks confounding bias, which presents a non-trivial challenge in analyzing observational data (see Section 4). However, it is easily avoided in trials by randomized treatment assignment, and the average potential outcome in trial population, E[Y a|S = 1], can be reliably estimated. The final challenge Prediction-powered Generalization of Causal Inferences is E[Y a|S = 0] = E[Y a|S = 1] when there is confounding by trial participation, leading to different distributions of effect modifiers in trial and target populations (see Figure 1). Therefore, one cannot generalize population-level effect estimates from a trial to the target population, but needs to adjust for confounding by trial participation. Next, we state the causal assumptions needed to estimate µa by incorporating outcome data from the trial. 2.2. Assumptions for Causal Inference Assumption 2.1 (Consistency). A = a = Y = Y a. Assumption 2.2 (Mean ignorability of treatment assignment in trial). E[Y a | X, S = 1] = E[Y a | X, S = 1, A = a]. Assumption 2.3 (Positivity of treatment assignment in trial). P(A = a | X = x, S = 1) > 0. Assumptions 2.1-2.3 are satisfied in an RCT by design, and they enable causal inference within the trial population, i.e., reliable estimation of E[Y a | S = 1]. Assumption 2.4 (Mean ignorability of trial participation). E[Y a | X] = E[Y a | X, S = 1] = E[Y a | X, S = 0]. Assumption 2.5 (Positivity of selection into trial). P(S = 0 | X = x) > 0 = P(S = 1 | X = x) > 0. Assumption 2.4 requires that within levels of measured covariates X, potential outcomes in the trial and target populations are the same on average. Assumption 2.5 ensures that every patient has a nonzero probability of participating in the trial, and we do not have to rely on pure extrapolation. Assumptions 2.4 and 2.5 transform the problem of generalizing the results of a trial into a covariate shift problem and allow one to identify µa as follows (Dahabreh et al., 2020). µa = EX P0[E[Y a | X, S = 0]] = EX P0[E[Y a | X, S = 1]] = EX P0[E[Y | X, S = 1, A = a]]. (2) where last two steps follow from Assumptions 2.4-2.5 and 2.1-2.3. Note that (2) can be estimated using only covariate data from non-participants (S = 0) and covariate, treatment, and outcome data from the trial participants (S = 1). 3. Generalization Using Experimental Data Dahabreh et al. (2020) propose estimators of (2) based on outcome functions, weighting by the inverse of the trial participation probability, and doubly-robust (DR) ones. We focus on the outcome function approach as it more lucidly uncovers the limitations of generalization from trial data, and the synthetic results in Dahabreh et al. (2020) show that it outperforms the weighting-based approaches and performs on par with the DR ones (as we also verify in Appendix B.2). Our findings reveal how a predictive model trained on large-scale observational data could help. We define the mean outcome function in the trial population S = 1 under treatment A = a as ga(X) := E[Y | X, S = 1, A = a] (3) = E[Y a | X, S = 1] (Assumptions 2.1-2.3) = E[Y a | X]. (Assumptions 2.4-2.5) One can estimate ˆga(X) from the trial sample D1, and then average its predictions in the target sample D0. This leads to the following outcome model (OM) estimator on the composite sample D. i=1 1 {Si = 0} ˆga(Xi). (4) In the remainder of this section, we investigate when ˆµOM a is expected to have high mean-squared error (MSE). Our next result gives an approximation for the MSE in the special case where X is purely categorical, which provides perspective into the limitations of ˆµOM a for the more general case as well. Proposition 3.1. Let X be a categorical covariate stratifying the population into K groups and denote by ns=1,a,k the number of trial participants from group X = k assigned to treatment A = a, and by σ2 a,k the variance of outcome among such patients. Let us estimate the outcome function ga(X = k) with the sample mean of outcomes Y of participants in group X = k assigned to treatment A = a. Suppose that Assumptions 2.1-2.5 hold. When n0 is large, the MSE of ˆµOM a in (4) can be approximated as E[(ˆµOM a µa)2] k=1 p2 s=0(k) σ2 a,k ns=1,a,k , (5) where ps=0(k) := P(X = k | S = 0) is the proportion of patients from group X = k in the target population. Proposition 3.1 reveals the key challenge in our endeavor. The reason behind the need for a generalization procedure is that some effect modifiers distributions might differ in the trial and target populations. Reading off (5), one can see that when the trial is limited in representing patient profiles that are prevalent in the target population (small ns=1,a,k, large ps=0(k)), the MSE will be larger. That is, inference in target population gets more challenging when it becomes more different from the trial population. The insights from Proposition 3.1 extend to the case with continuous covariates and parametric estimators ˆga(X) = ga(X; ˆθ) (e.g., a random forest). Let us denote by A the algorithm that fits ˆθ from the trial sample (e.g., ridge regression), i.e., ˆθ = A(D1). As D1 P1, we write ˆθ A(P1) to refer to the randomness in estimating ˆθ from D1. Next, we give an approximation for the MSE in the general case. Prediction-powered Generalization of Causal Inferences Theorem 3.2. Suppose that Assumptions 2.1-2.5 hold and consider a parametric estimator ˆga(X) = ga(X; ˆθ) for the outcome function. For large n0, the MSE of ˆµOM a in (4) can be approximated as E[(ˆµOM a µa)2] EX P0 Eˆθ A(P1)[ga(X; ˆθ)] ga(X) | {z } =: SBg(X) + Varˆθ A(P1) EX P0[ga(X; ˆθ)] . (7) The first term, (6), is the statistical bias (SB). Crucially, it is obtained by integrating the bias function SBg(X) over the target covariate distribution PX|S=0, making ˆµOM a susceptible to weak overlap between trial and target populations. Consider the case where ga(X; ˆθ) is misspecified/underfit. SBg(X) will be larger where PX|S=1 has little weight, since ga(X; ˆθ) is fit using the trial sample D1. ˆµOM a may then suffer substantial bias if PX|S=0 is large in covariate regions where the trial support is weak. It is therefore essential to ensure that ga(X; ˆθ) is rich enough and can match the complexity of ga(X) to avoid a large bias term. However, in practice, one s ability to flexibly model ga(X; ˆθ) is severely limited as the small size of trials (e.g., 200) can lead to overfitting, i.e., increasing the variance term (7). We empirically demonstrate this tradeoff in Appendix B.1. 4. Prediction-powered Generalization Using Experimental and Observational Data Here, we study how integrating rich observational data with limited experimental data can make the generalization task more statistically feasible. We index by S = 2 the observational population with joint distribution P2. We assume access to an i.i.d. sample of observations (Xi, Ai, Yi) P2 and denote by D2,a the set of patients who received treatment A = a in the observational data. In the first step, we fit a predictive model fa : X R by minimizing the empirical mean-squared error in D2,a to approximate E[Y | X, S = 2, A = a]. (8) Unlike the trial sample, large observational data can support parametrizing fa(X) with powerful machine learning models, allowing it to model complex functions. If one is willing to make Assumptions 2.1-2.5 for the observational study (OS), i.e., S = 2 instead of S = 1, µa can be identified as EX P0[E[Y |X, S = 2, A = a]] via the same machinery in (2). One could then apply fa(X) in the composite sample D to estimate µa as ˆµOS-OM a = 1 i=1 1 {Si = 0} fa(Xi), (9) analogous to (4). While Assumptions 2.1-2.5 are defensible for trials, some of them rarely hold for observational studies in practice, particularly the ignorability of treatment assignment (no unmeasured confounding). In contrast to most of the literature on causal inference using observational data, we take an extremely assumption-light approach, making no assumptions on observational data. We define the bias function as the difference between the outcome function in the trial, ga(X), and the observational predictor, fa(X). ba(X) := fa(X) ga(X) (10) = fa(X) E[Y | X, S = 2, A = a] | {z } statistical bias + E[Y a | X, S = 2, A = a] E[Y a | X, S = 2] | {z } confounding bias + E[Y a | X, S = 2] E[Y a | X, S = 1] | {z } transportation bias since E[Y a | X, S = 1] = ga(X) (see (3)). The statistical bias term is related to fitting fa(X) using a finite sample, and it vanishes with more data given enough model capacity. Confounding and transportation biases, however, are the price of avoiding Assumptions 2.2 and 2.4 for the observational study (S = 2). They will not disappear even with infinite data from the observational population, rendering ˆµOS-OM a in (9) an inconsistent estimator for µa = E[Y a | S = 0] even when fa(X) = E[Y | X, S = 2, A = a]. In Sections 4.1 and 4.2, we derive two new identifications of µa that integrate the predictions of fa in a statistically valid way and discuss how they lead to more sample-efficient estimation in comparison to (2). We give regression functionbased estimators, derive their MSEs, and compare them to that of the baseline in Theorem 3.2. 4.1. Additive Bias Correction to Predictive Model We covered why using fa alone is unreliable. Nonetheless, it may carry useful signal we can exploit when coupled with trial data. First, using the trial sample D1, we show how one can learn the bias function of the predictive model, ba(X) = fa(X) ga(X). We then give an estimator for µa that uses the predictions fa(X) in the target sample by correcting with their estimated bias, ˆba(X). We formalize in Theorem 4.2 and Section 4.1.3 when it is more advantageous to construct an estimator of µa that relies on fitting the bias function ba(X) instead of the outcome function ga(X), such as the illustrative example depicted in Figure 2. 4.1.1. IDENTIFICATION We start by trivially writing µa = E [fa(X) | S = 0] E [fa(X) Y a | S = 0] . (11) Prediction-powered Generalization of Causal Inferences Estimating a complex from a small RCT is statistically infeasible. An observational predictor, , is biased in general. Nonetheless, its bias, , may be estimated more sample efficiently from a small RCT. Figure 2. A biased predictor fa(X) can still capture higher order polynomials, making its bias ba(X) easier to learn than ga(X). The first term can be estimated by averaging fa(X) in the target sample, which is generally biased for µa. The second term removes this bias; however, as it contains a counterfactual variable, Y a, it is not immediately clear how one would estimate it. Our next result shows that it can be identified without additional assumptions on fa. Lemma 4.1. Suppose that Assumptions 2.1-2.5 hold. Let fa : X R and define the error variable Z := fa(X) Y. (12) µa can be identified as µa = EX P0[fa(X) E[Z | X, S = 1, A = a] . (13) Note that Z can be calculated for trial participants and second term in (13) can be estimated with covariate information from the target sample and covariate, treatment, and error information from the trial sample, as we cover next. 4.1.2. REGRESSION FUNCTION-BASED ESTIMATION By (10), (12), and (3), it is straightforward to see that ba(X) = E[Z | X, S = 1, A = a]. That is, the identification in (13) is through the bias function in (10). We denote by ba(X; ˆγ) a parametric fit obtained by regressing Z onto covariates X in the trial sample and write the additive-bias-correction (ABC) estimator. ˆµABC a = 1 i=1 1 {Si = 0} fa(Xi) ba(Xi; ˆγ) . (14) Theorem 4.2. Suppose that Assumptions 2.1-2.5 hold. For large n0, the MSE of ˆµABC a in (14) can be approximated as E[(ˆµABC a µa)2] EX P0 Eˆγ A(P1)[ba(X; ˆγ)] ba(X) 2 (15) + Varˆγ A(P1) EX P0[ba(X; ˆγ)] . (16) Algorithm 1 Generalization via additive bias correction Input: Sample D, Predictor fa, MSE optimizer A D1,a D: Trial cohort (Si = 1) with treatment Ai = a for Wi D1,a do Calculate the prediction error Zi = fa(Xi) Yi end for Fit ba(X; ˆγ) by minimizing MSE for Z in D1,a using A Return ˆµABC a in (14) The significance of Theorem 3.2 is showing that the MSE of ˆµABC a admits the same form with that of ˆµOM a in Theorem 3.2. The difference is that the MSE is governed by how well the bias function ba(X) is estimated instead of the outcome function ga(X). This result formalizes how leveraging a potentially biased observational predictor can be more viable for the generalization task. Consider the case in Figure 2 where fa(X) captures higher degree polynomials in ga(X), resulting in ba(X) being a low-degree polynomial. One can then fit a linear model with a few polynomial features for ba(X; ˆγ), resulting in both controlled bias (15) and variance (16) terms. On the other hand, fitting ga(X; ˆθ) similarly will result in a large bias term (6). We provide a detailed discussion in the next section and empirically demonstrate how the bias ((6), (15)) and variance ((7), (16)) terms compare in Appendix B.1. 4.1.3. CASE STUDY: POLYNOMIAL RIDGE REGRESSION The symmetry between the MSEs in Theorems 3.2 and 4.2 allows one to reason about the (comparative) performances of the outcome and bias function-based estimators in (4) and (14). To gain further insight, we study the polynomial ridge regression framework and describe the regime where estimating ba(X) is more feasible than ga(X) in the setting described below. We focus on finite-sample results, which are of significant interest given the limited size of trials. We consider X [ 1, 1], denote by L2([ 1, 1]) the space of square-integrable functions 1 endowed with the inner-product f, g = R 1 1 f(x)g(x) dx, and assume that ga, ba L2([ 1, 1]) with bounded norms ga , ba 1. Finally, we assume the following generative equations. Yi = ga(Xi) + ηi, Zi = ba(Xi) ηi, (17) where ηi N(0, σ2) are zero-mean i.i.d. noise variables and Zi are the patient-wise error terms for the predictive model fa(X), defined previously in (12). Let us now define, for a generic function f, the empirical excess risk of a fit ˆf obtained from a sample of size m as Rm( ˆf, f) := 1 i=1 ( ˆf(Xi) f(Xi))2, (18) 1R 1 1 f 2(x) dx < , f L2([ 1, 1]). Prediction-powered Generalization of Causal Inferences which quantifies how far away the fit is from the true function. In the rest of this section, we study oracle upper bounds on Rn1(ˆga, ga) and Rn1(ˆba, ba) when ˆga and ˆba are fit via polynomial ridge regression in the trial sample D1. To that end, let us now introduce the Legendre polynomials which have convenient properties that facilitate clear exposition. We denote by ϕk : [ 1, 1] R the k-th order normalized Legendre polynomial. The set {ϕk} k=0 form an orthonormal basis 2 for L2([ 1, 1]); meaning that any function f L2([ 1, 1]) can be uniquely represented as a linear combination of {ϕk} k=0, allowing us to write k=0 λkϕk(X), ba(X) = k=0 ωkϕk(X), (19) where λk = ga, ϕk and ωk = ba, ϕk . In practice, one can fit ˆga and ˆba using Legendre polynomials up to degree d N with ridge regularization to avoid overfitting to the trial sample. Leaving the intermediary steps to Appendix A.2, we proceed to state the corresponding upper bounds on the expected empirical excess risks. Lemma 4.3 (Adopted from Wainwright (2019)). Let X [ 1, 1], ga, ba L2([ 1, 1]), ga , ba 1, and consider the generative equations in (17) with noise variance σ2. Denote by ˆga and ˆba the fits obtained by regressing Y and Z (see (12)) onto {ϕk(X)}d k=0 in trial sample D1 with an appropriately chosen ridge regularization penalty. We have EXi,ηi[Rn1(ˆga, ga)] σ2d /n1 + X k=d +1 λ2 k, (20) EXi,ηi[Rn1(ˆba, ba)] σ2d /n1 + X k=d +1 ω2 k. (21) The upper bounds in (20) and (21) share the first statistical error term, which grows with the number of polynomial features d . Comparing the second terms reveals that estimating the bias function is favorable when P k=d +1 ω2 k < P k=d +1 λ2 k. One would expect the preceding condition to hold in two scenarios, which we discuss next. The first one is when the observational predictor is high quality. Precisely, if fa(X) ga(X), then ba(X) 0, implying small values for ωk and sum of their squares. This is the same condition in Angelopoulos et al. (2023) for a black-box predictor to power better inference when coupled with a small amount of gold-standard data. In the context of causal inference, it would take the individual terms in (10) to be as small as possible to warrant fa(X) ga(X), which requires observational study to have negligible hidden confounding for treatment assignment and to be transportable conditioned on X. The second scenario is when ba mostly consists of lower degree polynomials, that is, wk 0 for k > d . This is a 2 ϕi, ϕj = δij, span({ϕk} k=0) = L2([ 1, 1]). relaxed and more general version of the key assumption in Kallus et al. (2018), which requires ba(X) to be linear in X. The idea is that even when fa(X) is biased, it can still capture complex structure, such as the higher order polynomials modeling rapid turns in ga(X), and make ba(X) considerably simpler, as illustrated in Figure 2. 4.2. Augmented Outcome Modeling Here we draw from Schuler et al. (2021); Liao et al. (2023) and leverage the observational model by using its predictions as an additional regressor while estimating the outcome function from the trial. Using fa(X) as a covariate still makes for an easier estimation task when ga(X) = fa(X) + ba(X) with fa(X) capturing most of the complexity in ga(X) and ba(X) is a simpler function. However, it has two advantages over the additive bias correction approach we discuss below. First is robustness when fa(X) does not carry useful information. For instance, let fa(X) be an independent noise term, η, for all X. Then the additive bias ba(X) = η ga(X) is just a noisier version of ga(X) and even more challenging to estimate. On the other hand, when the predictions fa(X) are used as a covariate, a good learning algorithm would just ignore it. We compare the two approaches robustness with synthetic experiments (see Figure 5). Second is the flexibility in how the predictions are utilized. Consider the illustrative example where ga(X) = fa(X)/2 and the bias of fa(X) can be corrected simply dividing it by two. On the other hand, additive bias ba(X) = ga(X) is identical to the outcome function and not easier to estimate. 4.2.1. IDENTIFICATION Let us define the augmented covariate vector as Xi := [X1 i , X2 i , . . . , Xd i , fa(Xi)], (22) where Xn i is the n-th original covariate out of d. We denote X X where X = X R. Lemma 4.4. Suppose that Assumptions 2.1-2.5 hold. Let fa : X R and define the augmented outcome function ha( X) := E[Y | X, S = 1, A = a]. (23) where X is defined in (22). µa can be identified as µa = EX P0[ha( X)]. (24) Note that X and X carry the same information and Assumptions 2.1-2.5 continue to hold for X. The identification in (24) thus follows from the same steps that lead to (2). Prediction-powered Generalization of Causal Inferences Algorithm 2 Generalization via augmented outcome model Input: Sample D, Predictor fa, MSE optimizer A D1,a D: Trial cohort (Si = 1) with treatment Ai = a for Wi D do Calculate the outcome prediction fa(Xi) Construct the augmented covariate vector Xi as in (22) end for Fit ha( X; ˆβ) by minimizing MSE for Y in D1,a using A Return ˆµAOM a in (25) 4.2.2. REGRESSION FUNCTION-BASED ESTIMATION Note that ha( X) = ha(X, fa(X)) = ga(X) and the only difference from the baseline approach in Section 3 is that we have an additional regressor, Xd+1 = fa(X). We denote by ha( Xi; ˆβ) the parametric fit for ha( Xi) and write the augmented outcome modeling (AOM) estimator as ˆµAOM a = 1 i=1 1 {Si = 0} ha( Xi; ˆβ). (25) The approximation to the MSE E[(ˆµAOM a µa)2] follows the same form with that of ˆµOM a in Theorem 3.2 with the augmented outcome function ha replacing ga. In the interest of space, we defer the precise statement to Appendix A.3. We close this section by mentioning two critical directions for future work to leverage observational data more efficiently: one related to modeling and the other to estimation. Representation-powered Outcome Modeling Instead of using a model s predictions fa(X), one can use the representations learned by the model as additional covariates in the trial (Johansson et al., 2016; Shalit et al., 2017). This approach also allows for extracting richer information from the observational data more flexibly, e.g., via unsupervised and multimodal learning methods. Doubly-robust Estimation For the prediction-powered identifications of µa in (13) and (24), we focused only on regression function-based estimators to demonstrate the advantages of our approach. In Appendix A.4, we give doublyrobust estimators that enjoy desirable properties such as asymptotic normality that enable the construction of confidence intervals (Chernozhukov et al., 2018; Kennedy, 2023). 5. Synthetic Experiments We simulate over a thousand different synthetic data generating processes with varying levels of complexity in the outcome function ga(X), confounding bias in the observational study, and trial size n1. We compare the root MSE (RMSE) Observational study Nested trial design Figure 3. Data-generating process used in simulated experiments. (Left.) X (observed) induces confounding by trial participation. (Right.) In the observational study, there is hidden confounding for treatment assignment due to U (unobserved). of our estimators (14) and (25), which combine experimental and observational data, to that of the baselines (4) and (9) which use them alone. Bias-variance terms (e.g., (15) and (16)) are presented in Appendix B.1. Further, we demonstrate the robustness of the augmented outcome modeling estimator in (25) over the additive bias correction estimator in (14). While the main results are concerned with outcomemodeling-based estimators, we present additional empirical results for the inverse propensity weighting and and doublyrobust estimators in Appendix B.2. Our code is available at https://github.com/demireal/ppci. 5.1. Data-generating Process We consider two covariates X, U [ 1, 1], a binary treatment strategy A {0, 1}, and a real-valued outcome Y R. We first describe the probabilistic model that generates the potential outcomes Y 0 and Y 1 conditioned on X and U. We then move on to explain the sampling mechanism in the nested trial design that generates the trial and target cohorts. Finally, we specify the patient sampling and treatment assignment mechanism underpinning the observational data we use to train a predictive model fa : [ 1, 1]2 R. Results of simpler experiments where the functions underlying the data-generating process are specified to be linear are presented in Appendix B.4. Generating (Potential) Outcomes We denote the full outcome model (FOM) for treatment A = a by FOMa : [ 1, 1]2 R. For a patient with covariates (Xi, Ui), the potential outcome is calculated as Y a i = FOMa(Xi, Ui). Note that we use the same outcome model for patients in the trial, target, and observational samples. We generate FOMa by sampling from a GP with mean function m(X, U) = 0 and kernel function k ((X, U), (X , U )) (Rasmussen et al., 2006). We create a composite kernel by adding a squared-exponential (SE) kernel to model the local variations and a linear kernel to model the trends in the Prediction-powered Generalization of Causal Inferences Figure 4. 100 different set of data-generating functions are sampled for each (l FOM1 x , αPA u , n1). We plot the RMSE averaged over 100 scenarios. Results are reported for four different numbers of polynomial features used to fit the underlying regression functions (if any). outcome. Precisely, we have k((X, U),(X , U ))= αFOMa x XX + αFOMa u UU (linear) + exp (X X )2 2(l FOMa x )2 (U U )2 2(l FOMa u )2 , (SE) (26) where αFOMa x , αFOMa u , l FOMa x , l FOMa u R+ are free parameters. We experiment with different values to simulate a diverse set of scenarios. For instance, a larger value for αFOMa u implies a stronger linear trend in FOMa(X, U) along U-axis. More details are given at the end of this section. Generating Trial and Target Samples We consider a nested study design and generate a composite trial-eligible patient cohort by sampling Xi, Ui Uniform[ 1, 1] independently. We denote by P(S = 1 | Xi, Ui) the probability of trial participation, which is generated as P(S = 1|Xi, Ui) = median n 1 1 + e LPS(Xi,Ui) , 0.1, 0.9 o . (27) where the logit function LPS(Xi, Ui) is sampled from a GP with the composite linear + SE kernel in (26) with parameters αPS x = 10, l PS x = 1, αPS u = 0, l PS u = + . The last two parameters effectively imply that the trial participation probability does not depend on U but X only, ensuring Assumption 2.4. Taking a median with 0.1 and 0.9 ensures Assumption 2.5. Trial participation is then sampled as Bernoulli(P(S = 1 | Xi, Ui)). Finally, for trial participants (Si = 1), the treatment assignment is sampled as Ai Bernoulli(0.5) and the observed outcome is generated as Y = FOMAi(Xi, Ui), which ensures that Assumptions 2.1-2.3 hold. Generating an Observational Sample An observational cohort is generated by sampling Xi, Ui Uniform[ 1, 1] independently. For each patient, treatments are sampled as Ai Bernoulli(P(A = 1|S = 2, Xi, Ui)), where the probability of treatment assignment is generated similar to (27) through a logit function LPA(X, U) sampled from a GP with parameters αPA x , αPA u , l PA x , l PA u R+. The observed outcomes are generated as Y = FOMAi(Xi, Ui). Simulated Scenarios and GP Parameters We focus on the mean potential outcome under treatment A = 1 in the target population, µ1 = E[Y 1 | S = 0]. The sample size for the observational study (OS) and the target sample are set to 50, 000 and 20, 000, respectively. We experiment with different values for the parameters n1, l FOM1 x , and αPA u . We fit f1(X) from the OS with a neural network, and g1(X; ˆθ), b1(X; ˆγ), h1( X; ˆβ) are fit from the trial sample D1 via polynomial ridge regression with 5-fold cross-validation. We use trial sizes n1 {200, 1000} and l FOM1 x {0.5, 0.2}, where a smaller value leads FOM1(X, U) to change more quickly in response to X (i.e., consist of high order polynomials), thus resulting in a more complex outcome function g1(X). We provide examples in Appendix B.3. When learning f1(X) from the OS, we conceal U and experiment with (l PA u , αPA u ) {( , 0), (0.5, 0), (0.5, 10)}. In the first setting, P(A = 1|S = 2, X, U) does not depend on U, and there is no hidden confounding, which is intro- Prediction-powered Generalization of Causal Inferences duced when l PA u is changed to 0.5. Finally setting αPA u to 10 increases the weight of U in P(A = 1|S = 2, X, U), leading to a larger confounding bias. The preceding sets of hyperparameters lead to 2 2 3 = 12 combinations. For each combination, 100 different datagenerating functions were sampled from the GPs, leading to 1200 distinct scenarios. For each scenario, 100 independent runs were made where a new trial sample D1 was generated, and estimates for µ1 were calculated. An average RMSE is calculated for each scenario over 100 runs and for each combination over 100 scenarios, presented in Figure 4. 5.2. Discussion of Results We first discuss the results in Figure 1 and focus on the advantages of our prediction-powered estimators over the baselines. We then investigate Figure 2 which demonstrates why the augmented outcome modeling (AOM) is more robust than the additive bias correction (ABC) approach. Using the OS Alone As the hidden confounding in the OS increases, ˆµOS-OM 1 in (9), which directly applies f1(X) in the target population, suffers higher RMSE. Note that its performance does not improve with a larger trial size, as confounding bias does not result from a small sample size. Using the Trial Sample Alone The performance of ˆµOM 1 in (4) relies on fitting the outcome function g1(X) from the trial sample accurately. Therefore, it incurs higher RMSE when the trial is small and g1(X) is complex. The RMSE gets even worse when higher-order polynomials are used to fit ˆg1(X) in a small trial, due to the quickly increasing variance term (7) (see Appendix B.1). Combining Trial and Observational Data Predictionpowered estimators ˆµABC 1 and ˆµAOM 1 yield significant improvement over ˆµOM 1 when the trial is small, outcome function is complex, and the hidden confounding is small. This is because when f1(X) accurately estimates away most of the complex structure in g1(X), the resulting bias function has a small norm, i.e., b1(X) 0 (see Appendix B.3 for some examples), and is more feasible to fit from the small trial sample and generalize to the target population. Confounding in the OS leads to slightly worse RMSEs for ˆµABC 1 and ˆµAOM 1 , but they still compare favorably to ˆµOM 1 . Note that a larger hidden confounding need not impede benefiting from f1(X), so long as b1(X) consists of lower degree polynomials. The results in Figure 4 are averaged over many scenarios, and we provide examples in Appendix B.3 where b1(X) is also complex and no improvement over the baseline is achieved. Finally, when the trial is large enough to support fitting g1(X) with a higher order polynomial, ˆµOM 1 perform on par with ˆµABC 1 and ˆµAOM 1 . Figure 5. Convention same as Figure 4. The observational predictor is not trained on any data but generates i.i.d. noise for all X. Which Prediction-powered Estimator is Better? For the experiments in Figure 4, ˆµABC 1 and ˆµAOM 1 have similar performances. We demonstrate the robustness of ˆµAOM 1 in Figure 5, where f1(X) is just noise and b1(X) is harder to estimate than g1(X). While ˆµABC 1 suffers high RMSE, ˆµAOM 1 simply ignores f1(X) as a regressor and retains the performance of the baseline approach. 6. Concluding Remarks We investigated the statistical challenges of generalizing causal inferences from a randomized controlled trial to a target population whose characteristics differ from the trial. We showed how observational data could make generalization more statistically feasible without unrealistic assumptions. Through a diverse set of synthetic experiments, we verified the effectiveness of our methods. Future work includes investigating more flexible approaches to leverage observational data and exploring further experiment setups (e.g., with different kernels to simulate specific real-world settings) to gain further insight into the potentials and limitations of integrating experimental and observational evidence. Acknowledgements The authors thank the anonymous reviewers for their thoughtful comments and contributions to our experimental results during the discussion phase, and to Sontag Lab members for insightful discussions. ID was supported by funding from the Eric and Wendy Schmidt Center at the Broad Institute of MIT and Harvard. This study was supported in part by Office of Naval Research Award No. N00014-21-1-2807. Impact Statement Causal inference is crucial in medicine. The present manuscript contributes to the rapidly growing field of integrating gold-standard data from trials and real-world evidence. We propose statistically valid methods that can leverage large-scale observational data to power better generalization of causal effects from a trial to a target population. There are many potential societal consequences of our work, none which we feel must be specifically highlighted here. Prediction-powered Generalization of Causal Inferences Angelopoulos, A. N., Bates, S., Fannjiang, C., Jordan, M. I., and Zrnic, T. Prediction-powered inference. Science, 382 (6671):669 674, 2023. doi: 10.1126/science.adi6000. Bareinboim, E. and Pearl, J. Causal inference and the datafusion problem. Proceedings of the National Academy of Sciences, 113(27):7345 7352, 2016. Chen, S., Zhang, B., and Ye, T. Minimax rates and adaptivity in combining experimental and observational data. ar Xiv preprint (2109.10522), September 2021. Cheng, D. and Cai, T. Adaptive combination of randomized and observational data. ar Xiv preprint (2111.15012), November 2021. Chernozhukov, V., Chetverikov, D., Demirer, M., Duflo, E., Hansen, C., Newey, W., and Robins, J. Double/debiased machine learning for treatment and structural parameters, 2018. Colnet, B., Mayer, I., Chen, G., Dieng, A., Li, R., Varoquaux, G., Vert, J.-P., Josse, J., and Yang, S. Causal inference methods for combining randomized trials and observational studies: a review. Statistical Science, 39(1): 165 191, 2024. Dahabreh, I. J., Robertson, S. E., Tchetgen, E. J., Stuart, E. A., and Hern an, M. A. Generalizing causal inferences from individuals in randomized trials to all trial-eligible individuals. Biometrics, 75(2):685 694, June 2019. Dahabreh, I. J., Robertson, S. E., Steingrimsson, J. A., Stuart, E. A., and Hernan, M. A. Extending inferences from a randomized trial to a new target population. Statistics in medicine, 39(14):1999 2014, 2020. De Bartolomeis, P., Abad, J., Donhauser, K., and Yang, F. Detecting critical treatment effect bias in small subgroups. ar Xiv preprint ar Xiv:2404.18905, 2024a. De Bartolomeis, P., Martinez, J. A., Donhauser, K., and Yang, F. Hidden yet quantifiable: A lower bound for confounding strength using randomized trials. In International Conference on Artificial Intelligence and Statistics, pp. 1045 1053. PMLR, 2024b. Degtiar, I., Layton, T., Wallace, J., and Rose, S. Conditional cross-design synthesis estimators for generalizability in medicaid. Biometrics, 2023. Demirel, I., De Brouwer, E., Hussain, Z. M., Oberst, M., Philippakis, A. A., and Sontag, D. Benchmarking observational studies with experimental data under rightcensoring. In International Conference on Artificial Intelligence and Statistics, pp. 4285 4293, 2024. Forbes, S. P. and Dahabreh, I. J. Benchmarking observational analyses against randomized trials: a review of studies assessing propensity score methods. Journal of general internal medicine, 35:1396 1404, 2020. Guo, W., Wang, S., Ding, P., Wang, Y., and Jordan, M. I. Multi-source causal inference using control variates. ar Xiv preprint ar Xiv:2103.16689, 2021. Han, L., Shen, Z., and Zubizarreta, J. Multiply robust federated estimation of targeted average treatment effects. Advances in Neural Information Processing Systems, 36: 70453 70482, 2023. Hartman, E., Grieve, R., Ramsahai, R., and Sekhon, J. S. From sample average treatment effect to population average treatment effect on the treated: combining experimental with observational studies to estimate population treatment effects. Journal of the Royal Statistical Society. Series A,, 178(3):757 778, June 2015. Hatt, T., Berrevoets, J., Curth, A., Feuerriegel, S., and van der Schaar, M. Combining observational and randomized data for estimating heterogeneous treatment effects. ar Xiv preprint ar Xiv:2202.12891, 2022. Hernan, M. A. and Robins, J. M. Causal Inference. CRC Press, Boca Raton, FL, February 2021. Hern an, M. A., Hern andez-D ıaz, S., and Robins, J. M. A structural approach to selection bias. Epidemiology, pp. 615 625, 2004. Hussain, Z., Oberst, M., Shih, M.-C., and Sontag, D. Falsification before extrapolation in causal effect estimation. arxiv preprint ar Xiv:2209.13708, 2022. Hussain, Z., Shih, M.-C., Oberst, M., Demirel, I., and Sontag, D. Falsification of internal and external validity in observational studies via conditional moment restrictions. In International Conference on Artificial Intelligence and Statistics, pp. 5869 5898, 2023. Imbens, G. W. and Rubin, D. B. Causal inference in statistics, social, and biomedical sciences. Cambridge University Press, 2015. Johansson, F., Shalit, U., and Sontag, D. Learning representations for counterfactual inference. In International conference on machine learning, pp. 3020 3029. PMLR, 2016. Kallus, N., Puli, A. M., and Shalit, U. Removing hidden confounding by experimental grounding. Advances in neural information processing systems, 31, 2018. Karlsson, R. and Krijthe, J. Detecting hidden confounding in observational data using multiple environments. Prediction-powered Generalization of Causal Inferences Advances in Neural Information Processing Systems, 36, 2024. Kennedy, E. H. Towards optimal doubly robust estimation of heterogeneous causal effects. Electronic Journal of Statistics, 17(2):3008 3049, 2023. Liao, L. D., Højbjerre-Frandsen, E., Hubbard, A. E., and Schuler, A. Prognostic adjustment with efficient estimators to unbiasedly leverage historical data in randomized trials. ar Xiv preprint ar Xiv:2305.19180, 2023. NICE. Nice real-world evidence framework, 2022. URL https://www.nice.org.uk/corporate/ ecd9/chapter/overview. Oberst, M., D Amour, A., Chen, M., Wang, Y., Sontag, D., and Yadlowsky, S. Understanding the risks and rewards of combining unbiased and possibly biased estimators, with applications to causal inference. ar Xiv preprint ar Xiv:2205.10467, 2023. Rasmussen, C. E., Williams, C. K., et al. Gaussian processes for machine learning, volume 1. Springer, 2006. Rosenman, E. T. and Owen, A. B. Designing experiments informed by observational studies. Journal of Causal Inference, 9(1):147 171, 2021. Rosenman, E. T., Basse, G., Owen, A. B., and Baiocchi, M. Combining observational and experimental datasets using shrinkage estimators. Biometrics, 2020. Rothwell, P. M. External validity of randomised controlled trials: to whom do the results of this trial apply? . The Lancet, 365(9453):82 93, 2005. Schuler, A., Walsh, D., Hall, D., Walsh, J., Fisher, C., for Alzheimer s Disease, C. P., Initiative, A. D. N., and Study, A. D. C. Increasing the efficiency of randomized trial estimates via linear adjustment for a prognostic score. The International Journal of Biostatistics, 18(2):329 356, 2021. Shalit, U., Johansson, F. D., and Sontag, D. Estimating individual treatment effect: generalization bounds and algorithms. In International conference on machine learning, pp. 3076 3085. PMLR, 2017. Stuart, E. A., Cole, S. R., Bradshaw, C. P., and Leaf, P. J. The use of propensity scores to assess the generalizability of results from randomized trials. Journal of the Royal Statistical Society: Series A (Statistics in Society), 174 (2):369 386, 2011. Wainwright, M. J. High-dimensional statistics: A nonasymptotic viewpoint, volume 48. Cambridge University Press, 2019. Yang, S. and Ding, P. Combining multiple observational data sources to estimate causal effects. Journal of the American Statistical Association, 2019. Yang, S., Gao, C., Zeng, D., and Wang, X. Elastic integrative analysis of randomised trial and real-world data for treatment heterogeneity estimation. Journal of the Royal Statistical Society Series B: Statistical Methodology, 85 (3):575 596, 2023. Prediction-powered Generalization of Causal Inferences A. Proofs and Additional Results A.1. Deferred Proofs Proposition 3.1. Let X be a categorical covariate stratifying the population into K groups and denote by ns=1,a,k the number of trial participants from group X = k assigned to treatment A = a, and by σ2 a,k the variance of outcome among such patients. Let us estimate the outcome function ga(X = k) with the sample mean of outcomes Y of participants in group X = k assigned to treatment A = a. Suppose that Assumptions 2.1-2.5 hold. When n0 is large, the MSE of ˆµOM a in (4) can be approximated as E[(ˆµOM a µa)2] k=1 p2 s=0(k) σ2 a,k ns=1,a,k , (5) where ps=0(k) := P(X = k | S = 0) is the proportion of patients from group X = k in the target population. Proof. Let us denote by Ds=1,a,k D1 the set of trial participants in group k that received treatment A = a, with size ns=1,a,k = Pn i=1 1 {Xi = k, Si = 1, Ai = a}. We estimate the outcome model as ˆga(X = k) = Pn i=1 1 {Xi = k, Si = 1, Ai = a} Yi ns=1,a,k . (28) ˆga(X = k) is simply the sample mean of outcomes Yi in Ds=1,a,k and we have E [ˆga(X = k)] = E [Y | X = k, S = 1, A = a] = E [Y a | X = k, S = 1, A = a] = E [Y a | X = k, S = 1] = E [Y a | X = k, S = 0] (29) where the first equality follows from the unbiasedness of the sample mean and the rest from Assumptions 2.1, 2.2, and 2.4, respectively. When X is categorical and ˆga(X) is estimated via (28), (4) admits the following equivalent expression. k=1 ˆps=0(k)ˆga(k). (30) where ˆps=0(k) = Pn i=1 1{Si=0,Xi=k} Pn i=1 1{Si=0} is the proportion of patients in the target sample D0 from group k. Note that the target D0 and trial D1 samples are disjoint of each other. Since ˆps=0(k) is effectively calculated from the observations in the target sample D0 only, and similarly ˆga(k) from D1 only, ˆps=0(k) and ˆga(k) are independent. Following (30), we can then write E[ˆµOM a ] = E k=1 ˆps=0(k)ˆga(k) k=1 E [ˆps=0(k)ˆga(k)] k=1 E [ˆps=0(k)] E [ˆga(k)] k=1 ps=0(k)E [Y a | X = k, S = 0] (31) = E[Y a | S = 0] (law of total expectation) Prediction-powered Generalization of Causal Inferences where (31) follows from the unbiasedness of sample proportion and (29). Next, by the law of total variance we write Var ˆµOM a = ED0 Var D1(ˆµOM a | D0) + Var D0 ED1 ˆµOM a | D0 . (33) We start with the first term. ED0 Var D1(ˆµOM a | D0) = ED0 k=1 ˆps=0(k)ˆga(k) D0 k=1 ˆp2 s=0(k)Var D1 ˆga(k) D0 # k=1 ED0 ˆp2 s=0(k)Var D1 ˆga(k) D0 k=1 ED0 ˆp2 s=0(k) Var D1 (ˆga(k)) (35) ED0 [ˆps=0(k)]2 + Var D0 (ˆps=0(k)) | {z } n0 0 Var D1 (ˆga(k)) (36) k=1 p2 s=0(k) σ2 a,k ns=1,a,k , (37) where (34) holds since the participants in different groups are independent of each other and ˆp2 s=0(k) is no longer random after conditioning on D0. Similar to above, (35) holds since ˆga(k) is independent of the target sample D0 and therefore ˆp2 s=0(k). (36) follows after writing the variance of the sample proportion Var D0 (ˆps=0(k)) = ps=0(k) (1 ps=0(k)) For the second term we write Var D0 ED1 ˆµOM a | D0 = Var D0 k=1 ˆps=0(k)ˆga(k) | D0 k=1 ˆps=0(k)ED1 [ˆga(k) | D0] k=1 ˆps=0(k)ED1 [ˆga(k)] k=1 ˆps=0(k)E [Y a | X = k, S = 0] k=1 Var D0 (ˆps=0(k)) | {z } n0 0 E [Y a | X = k, S = 0]2 (39) where (39) follows again from (38). Combining (33), (37), and (40), we have, as n0 goes to infinity, k=1 p2 s=0(k) σ2 a,k ns=1,a,k (41) Prediction-powered Generalization of Causal Inferences Finally, we have E[(ˆµOM a µa)2] = E[(ˆµOM a µa)]2 | {z } 0 by (32) +Var ˆµOM a k=1 p2 s=0(k) σ2 a,k ns=1,a,k , (by (41)) and we are done. Theorem 3.2. Suppose that Assumptions 2.1-2.5 hold and consider a parametric estimator ˆga(X) = ga(X; ˆθ) for the outcome function. For large n0, the MSE of ˆµOM a in (4) can be approximated as E[(ˆµOM a µa)2] EX P0 Eˆθ A(P1)[ga(X; ˆθ)] ga(X) | {z } =: SBg(X) + Varˆθ A(P1) EX P0[ga(X; ˆθ)] . (7) E[ ˆµOM a µa 2] = E ˆµOM a µa 2 | {z } Bias2 + Var(ˆµOM a ) | {z } Variance We will start with the bias term. Note that, under Assumptions 2.1-2.5, we have µa = E[E[Y | X, S = 1, A = a] | S = 0] (see (2)) = E[ga(X) | S = 0] (by definition, see (3)) = EX P0[ga(X)]. (43) with a manipulation of notation at the final step. Once ˆθ is estimated from the trial sample D1 via an algorithm A, ˆµOM a is calculated by effectively taking a sample mean of ga(Xi; ˆθ) for the covariates Xi in the target sample D0. We can write, E[ˆµOM a ] = E i=1 1 {Si = 0} ga(Xi; ˆθ) Xi D0 ga(Xi; ˆθ) Xi D0 ga(Xi; ˆθ) ˆθ Xi D0 ga(Xi; ˆθ) h EX P0 h ga(X; ˆθ) ii = EX P0 h Eˆθ P1 h ga(X; ˆθ) ii . (44) since Xi D0 are i.i.d and independent of ˆθ. Combining (43) and (44) we write Bias = E ˆµOM a µa = EX P0 h Eˆθ A(P1)[ga(X; ˆθ)] ga(X) i . (45) Prediction-powered Generalization of Causal Inferences We continue with the variance term by invoking the law of total variance. Variance = Var ˆµOM a = Varˆθ A(P1),D0 Xi D0 ga(Xi; ˆθ) = Varˆθ A(P1) Xi D0 ga(Xi; ˆθ) + Eˆθ A(P1) Xi D0 ga(Xi; ˆθ) | {z } n0 0 Varˆθ A(P1) Xi D0 ga(Xi; ˆθ) = Varˆθ A(P1) EX P0 h ga(X; ˆθ) i , (46) where in the third equality, the variance of the sample mean vanishes for large n0. Last two steps follow since the target sample D0 and ˆθ are independent and Xi D0 are i.i.d. Plugging (45) and (46) into the definition of the MSE in (42), we are done. Lemma 4.1. Suppose that Assumptions 2.1-2.5 hold. Let fa : X R and define the error variable Z := fa(X) Y. (12) µa can be identified as µa = EX P0[fa(X) E[Z | X, S = 1, A = a] . (13) Proof. Recall that we have µa = E [fa(X) | S = 0] E [fa(X) Y a | S = 0] . (47) Note that the prediction model fa is fixed. Therefore we have E[fa(X) | S = 0] = EX P0[fa(X)], (48) which is only a change of notation. Further, conditioned on X, fa(X) is no more random, We can then write E[fa(X) Y a | S = 0] = EX P0 E[fa(X) Y a | X, S = 0] = EX P0 E[fa(X) Y a | X, S = 1] (49) = EX P0 E[fa(X) Y a | X, S = 1, A = a] (50) = EX P0 E[fa(X) Y | X, S = 1, A = a] (51) = EX P0 E[Z | X, S = 1, A = a] , (52) where (49) is due to Assumptions 2.4 and 2.5, (50) is due to Assumptions 2.2 and 2.3, and (51) is due to Assumption 2.1. Combining (47), (48), and (52) completes the proof. Theorem 4.2. Suppose that Assumptions 2.1-2.5 hold. For large n0, the MSE of ˆµABC a in (14) can be approximated as E[(ˆµABC a µa)2] EX P0 Eˆγ A(P1)[ba(X; ˆγ)] ba(X) 2 (15) + Varˆγ A(P1) EX P0[ba(X; ˆγ)] . (16) Prediction-powered Generalization of Causal Inferences Proof. We have, by Lemma 4.1, µa = E[fa(X) ba(X) | S = 0] = E[fa(X) ba(X) | S = 0]. (53) where ba(X) = E[fa(X) Y | X, S = 1, A = a]. We consider a parametric estimator ba(X; ˆγ) where γ is estimated from the instance-wise prediction errors fa(X) Y in the trial sample D1. We will follow the same steps in the proof of Theorem 3.2 for the most part. E[ ˆµABC a µa 2] = E ˆµABC a µa 2 | {z } Bias2 + Var(ˆµABC a ) | {z } Variance E[ˆµABC a ] = E i=1 1 {Si = 0} fa(Xi) ba(Xi; ˆγ) # Xi D0 fa(Xi) ba(Xi; ˆγ) Since the sample mean is unbiased, it follows that Xi D0 fa(Xi) = EX P0 [fa(X)] . (56) Next, via the same machinery that derives (44), we have Xi D0 ba(Xi, ˆγ) = EX P0 h Eˆγ P1 [ba(X; ˆγ)] i . (57) By (55), (56), and (57), we observe E[ˆµABC a ] = EX P0 h fa(X) Eˆγ P1 [ba(X; ˆγ)] i , (58) which leads to, in combination with (53) Bias = E ˆµABC a µa = EX P0 h ba(X) Eˆγ P1 [ba(X; ˆγ)] i . (59) We continue with the variance term. Variance = Var ˆµABC a Xi D0 fa(Xi) ba(Xi; ˆγ) Xi D0 fa(Xi) | {z } n0 0 + Varˆγ A(P1),D0 Xi D0 ba(Xi; ˆγ) Prediction-powered Generalization of Causal Inferences Varˆγ A(P1) (EX P0 [ba(X; ˆγ)]) . (60) The decomposition in third equality is due to the independence of the models predictions fa(X) and errors fa(X) Y (ˆγ is derived using the latter only). Finally, (60) follows through the same machinery that derives (46). We are done after plugging (59) and (60) into (54). A.2. Polynomial Ridge Regression We consider polynomial ridge regression in the trial sample D1 using Legendre polynomials up to degree d , to fit the outcome model and bias model estimates, ˆga and ˆba which results in the following fits with an appropriately chosen penalty parameter λ (Wainwright, 2019). ˆga argmin g F(d ) i=1 (Yi g(Xi))2 ) ˆba argmin b F(d ) i=1 (Zi b(Xi))2 ) k=0 βkϕk(X) | is the class of polynomials up to degree d with bounded norm. The results then follow from the oracle inequalities derived for the orthogonal basis approximation problem in Example 13.14, Section 13.3 of (Wainwright, 2019). A.3. MSE Approximation for the Augmented Outcome Modeling Approach Theorem A.1. Suppose that Assumptions 2.1-2.5 hold. For large n0, the MSE of ˆµAOM a in (25) can be approximated as E[(ˆµAOM a µa)2] EX P0 E ˆβ A(P1)[ha( X; ˆβ)] ha( X) 2 + Var ˆβ A(P1) EX P0[ha( X; ˆβ)] . (64) Proof. The proof follows from the same steps in the proof of Theorem 3.2. A.4. Doubly-Robust Estimation In order to leverage the prognostic model fa(X) in the analysis, we can proceed with two identifications of µa, (13) and (24), for which we considered estimators based only on regression functions ((14) and (25)). However, in practice, we can directly use the so-called doubly-robust (DR) estimators for (13) and (24), which have several desirable properties. In addition to a regression function component, DR estimators also have weighting function components, which, in our case, are the probability of trial enrollment, P(S = 1 | X), and the probability of treatment assignment in the trial, P(A = a | X, S = 1). Estimators based only on weighting models are also available but will not be covered here in the interest of space. (Dahabreh et al., 2020) derives a generic DR estimator for the functional EX P0[E[Y | X, S = 1, A = a]], (65) which we can directly adopt and use to estimate (24) and the second term of (13). Estimating the first term of (13) remains unchanged as the average of predictions fa(X) in the target sample. We make the following definitions. p = P(S = 1). (66) p(X) = P(S = 1 | X). (67) πa(X) = P(A = a | X, S = 1), (68) and denote by ˆp, ˆp(X), and ˆπa(X) their estimates. Note that in order to use DR estimators, one now needs to fit those functions using the composite sample D. Next we give the DR estimator for (13) and (24). ˆµDR-ABC a = 1 i=1 1 {Si = 0} fa(Xi) Prediction-powered Generalization of Causal Inferences + 1 n(1 ˆp) 1 {Si = 0}ˆba(Xi) + 1 {Si = 1, A = a} 1 ˆp(Xi) ˆp(Xi)ˆπa(Xi) Zi ˆba(Xi) . (69) ˆµa DR-PA = 1 n(1 ˆp) 1 {Si = 0} ˆha( Xi) + 1 {Si = 1, A = a} 1 ˆp(Xi) ˆp( Xi)ˆπa( Xi) Y ˆha( Xi) . (70) An essential property of DR estimators is that consistent estimation of either the regression or weighting functions guarantees consistent estimation of µa, hence the name doubly-robust . Beyond this DR property, however, they have other desirable properties (under certain regularity conditions or cross-fitting techniques (Chernozhukov et al., 2018)) such as asymptotical efficiency and normality, which enable one to construct confidence intervals beyond point prediction and allow for, e.g., calculating p-values and testing hypotheses. We refer the interested reader to (Kennedy, 2023) for a unifying overview of the theory around the DR estimators, their properties, and how to construct them for different estimands of interest. We present empirical results for the DR estimators in Appendix B.2. B. Additional Experimental Results B.1. Bias-Variance Tradeoff We do not plot the bias and variance terms for ˆµOS-OM 1 . It has minimal ( 0) variance as one applies the observational predictor directly to the target sample, and nothing is fit from the small trial data. Since the target sample is taken to be large, the variance in ˆµOS-OM 1 is negligible. Almost all of its MSE (see Figure 4) consists of the bias, which results from hidden confounding introduced by concealing the U variable (see Figure 3). In Figure 6, we see that the bias resulting from estimating the outcome function g1(X) from the trial sample is very large with a small model. Although it decreases with a larger model as expected, we see in Figure 7 that the variance quickly explodes when the trial size is small (n1 = 200) and g1(X) is complex and has high intrinsic variation. Whenn1 = 1000, the variance terms significantly decrease by a factor of 10, and the RMSE of ˆµOM 1 becomes comparable to prediction-powered estimators when higher-degree polynomials are fit for ˆg1(X). Our approaches leveraging the additional predictor have smaller bias and variance terms. The difference is the most significant when the trial is small and the outcome function is complex. Figure 6. Convention same as Figure 4. Average squared bias of each estimator is plotted. Prediction-powered Generalization of Causal Inferences Figure 7. Convention same as Figure 4. Average variance of each estimator is plotted. B.2. IPW and DR-based Estimators In Figure 8, we include the generalization RMSE for the doubly-robust (DR) and inverse propensity weighting (IPW) estimators. DR versions of our methods are given in Appendix A.4. Baseline IPW and DR estimators are detailed in (Dahabreh et al., 2020). The IPW estimator performs the worst due to the high variance in the propensity weight estimates, and the DR estimators perform similarly to the outcome-model (OM) estimators. Dahabreh et al. (2020) report similar results. Finally, we note that as the sample size in the trial n1 increases, the MSE of different estimators converge as before. Figure 8. Generalization RMSE with DR and IPW estimators included. Prediction-powered Generalization of Causal Inferences B.3. Example Cases In Figures 9-12, we demonstrate several example cases where we plot the ground truth functions for the synthetic datagenerating processes, an example trial sample that is used to fit ˆb1(X), ˆg1(X), ˆh1(X) (plotted the linear fits only for simplicity, i.e., 1st-degree polynomials), and the observational predictor f1(X). We aim to demonstrate how the outcome function g1(X) becomes wiggly as l FOM1 x decreases and has more rapid turns representing higher-order polynomials. Further, we see that as the hidden confounding increases, i.e., as we move along the x-axis of plots, the bias of the observational predictor, b1(X) = g1(X) f1(X) also increases and becomes a higher-norm function, decreasing the utility of leveraging observational data and increasing the RMSE. As we referred to earlier, one can see in Figures 11 and 12, for the cases with l FOM1 x = 0.2 (complex outcome function g1(X)) and (l PA u = 0.5, αFOM1 u = 10) (large hidden confounding), the bias function b1(X) is also a complex function with high norm, and the RMSE of the prediction-powered approaches are not significantly better than the baseline estimator ˆµOM 1 . B.4. Using (Generalized) Linear Models in the Data-generating Process We sincerely thank Reviewer xw X2 for taking the time during the author-reviewer discussion phase to provide the initial codebase for the results presented here. Instead of generating the outcome and propensity score functions using GPs, we use polynomial models. The full outcome model under treatment A = 1 is specified as a 5-th order polynomial with parameter β. Precisely, we set FOM1(X, U) = β0 + βX 1 X + + βX 5 X5 + γ βU 1 U + + βU 5 U 5 + βXU 1 XU + + βXU 5 (XU)5 , (71) and observe Y 1 = FOM1(X, U) + ϵ where ϵ N(0, σ2) for some σ R+ which we use to model the intrinsic variation in outcome observations. Larger values for σ increase the risk of overfitting when the trial size n1 is small. β parameters characterize the complexity of the outcome function. The probability of selection into the trial and the probability of treatment assignment in the OS are modeled as follows. P(S = 1|X, U) = 1 1 + exp (λ0 + λ1X + + λ5X5). (72) P(A = 1|S = 2, X, U) = 1 1 + exp α0 + αX 1 X + + αX 5 X5 + γ αU 1 U + + αU 5 U 5 + αXU 1 XU + + αXU 5 (XU)5 . (73) γ R+ determines the amount of hidden confounding in the OS, as U is concealed. Further, λ parameters characterize how weak the overlap is between the trial and target samples. Briefly, larger values for β, γ, λ, and σ parameters make the generalization task more challenging. In Table 1, we present the generalization MSEs under various settings. We always use α N(0, 1). We sample 100 ground-truth α, β parameters for each setting, make 100 independent runs for each ground-truth, and then present the average MSE values. We use both 1st and 5th order polynomials to fit the bias and outcome functions, ˆb1(X) and ˆg1(X). Setting 1st order poly. fit 5th order poly. fit ˆµABC 1 ˆµOM 1 ˆµABC 1 ˆµOM 1 γ = 0, ϵ N(0, 0.12), β N(0, 1), λ = 1 .0001 .0092 .0002 .0002 γ = 1, ϵ N(0, 0.12), β N(0, 1), λ = 1 .0087 .0183 .0148 .0148 γ = 0, ϵ N(0, 22), β N(0, 1), λ = 1 .0044 .0054 .0066 .0066 γ = 0, ϵ N(0, 22), β N(0, 22), λ = 1 .0044 .0082 .0066 .0066 γ = 0, ϵ N(0, 22), β N(0, 22), λ = 2 .0048 .0127 .0151 .0151 γ = 1, ϵ N(0, 22), β N(0, 22), λ = 2 .0060 .0141 .0139 .0139 Table 1. Generalization MSEs using (generalized) linear models in the data-generating process. Prediction-powered Generalization of Causal Inferences Figure 9. Example case 1. Prediction-powered Generalization of Causal Inferences Figure 10. Example case 2. Prediction-powered Generalization of Causal Inferences Figure 11. Example case 3. Prediction-powered Generalization of Causal Inferences Figure 12. Example case 4.