# adjustment_criteria_for_generalizing_experimental_findings__8975ef16.pdf Adjustment Criteria for Generalizing Experimental Findings Juan D. Correa 1 Jin Tian 2 Elias Bareinboim 1 Generalizing causal effects from a controlled experiment to settings beyond the particular study population is arguably one of the central tasks found in empirical circles. While a proper design and careful execution of the experiment would support, under mild conditions, the validity of inferences about the population in which the experiment was conducted, two challenges make the extrapolation step to different populations somewhat involved, namely, transportability and sampling selection bias. The former is concerned with disparities in the distributions and causal mechanisms between the domain (i.e., settings, population, environment) where the experiment is conducted and where the inferences are intended; the latter with distortions in the sample s proportions due to preferential selection of units into the study. In this paper, we investigate the assumptions and machinery necessary for using covariate adjustment to correct for the biases generated by both of these problems, and generalize experimental data to infer causal effects in a new domain. We derive complete graphical conditions to determine if a set of covariates is admissible for adjustment in this new setting. Building on the graphical characterization, we develop an efficient algorithm that enumerates all possible admissible sets with polytime delay guarantee; this can be useful for when some variables are preferred over the others due to different costs or amenability to measurement. 1. Introduction Scientific inferences in data-driven disciplines entail some understanding of the laws of nature and a web of cause and effect relationships. For instance, policy-makers aiming to 1Department of Computer Science, Purdue University, Indiana, USA 2Computer Science Department, Iowa State University, IA, USA. Correspondence to: Juan D. Correa . Proceedings of the 36 th International Conference on Machine Learning, Long Beach, California, PMLR 97, 2019. Copyright 2019 by the author(s). improve the economical condition of a certain population need to understand how a tax increase would affect consumers behavior and, in turn, economic activity; or, health scientists trying to develop a new treatment for prostate cancer would have to understand how their new drug interacts with the body and affects the cancer s progression (Pearl, 2000; Spirtes et al., 2001; Bareinboim and Pearl, 2016). Controlled experimentation is one of the most pervasive methods to probe for such effects, deemed the gold standard for scientific research in empirical circles. The main idea is to generate a controlled environment where the behavior of an outcome variable can be observed under two regimes: one where a certain condition (e.g., drug A) is present and another where it isn t (placebo), under the ceteris paribus condition. If all other factors are held constant, intuitively, any difference in the outcome can be attributed to the action, i.e., to a causal relationship between them. In the medical sciences, this appears under the rubric of Randomized Controlled Trials (RCTs). In fact, the Food and Drug Administration (FDA) spends billions of dollars every year to support systematic, controlled, and large-scale experimentation (National Academy of Medicine, 2010). Science is largely about generalization. Most experimental findings are intended to be generalized to a broader, or even different, target domain (in other words, population, setting, environment). In medicine, for instance, some of the most important pharmaceutical discoveries were first developed and tested using rats as subjects, while the goal was to use the results to treat humans. In psychology, college students are usually the subject of experimentation, so as to answer questions about human cognition, which, broadly speaking, include subjects with and without exposure to higher education. In many machine learning settings, agents are trained by performing actions in simulated environments, where the goal is to deploy these systems in other, maybe real, environment, which doesn t match the training ground. In all these settings, an extrapolation step from the causal distribution where the experiment was conducted to where the inference is intended is required. If the source distribution is such that its conclusions can be extrapolated to the target domain, the same is said to have external validity. External validity has been considered one of the main research challenges by the current generation of data (empiri- Adjustment Criteria for Generalizing Experimental Findings cal) scientists (Altman et al., 2001). In particular, we ll discuss two challenges that threaten external validity, namely, transportability and sampling selection bias. Example 1. Greenhouse et al., discussed the challenges of generalizability in the risk of suicidality among pediatric antidepressant users. On investigating the causal relationship between antidepressant use and the risk of suicide attempt, the FDA performed several RCTs, finding that youths receiving antidepressants (X) had approximately twice the amount of suicidal thoughts and behaviors (Y ) compared to the control groups. These results led to a new policy and the issue of a strict warning in the drugs label. Surprisingly, following the warning, reports suggested a decrease in the number of prescriptions and an increase in suicidal events in the corresponding age groups. Furthermore, several observational studies found a decrease in the risk of suicide in patients being treated with the same antidepressants, even after adjusting for access to mental health-care and other confounding factors. Some of the possible explanations for this discrepancy are: Transportability: There is a mismatch between the study population and the general clinical population regarding ethnicity, race, and income (covariates named E). Sampling selection bias: FDA s studies sampled from a distinct population by excluding youths with elevated baseline risk for suicide (B) from their cohorts. The problem of extrapolating experimental findings across domains that differ both in their distributions and inherent causal characteristics (e.g., rats to humans) is usually called transportability (Bareinboim and Pearl, 2016). Special cases of transportability are found in the literature under different rubrics, including lack of external validity (Campbell and Stanley, 1963; Manski, 2007), heterogeneity (H ofler et al., 2010) and meta-analysis (Glass, 1976; Hedges and Olkin, 1985). Issues of transportability can be represented graphically in a causal diagram by adding a special variable in the form of a square, T, which represents the unobserved disparity-generating factors. For instance, Fig. 1(a) represents the causal diagram of Example 1. Sampling selection bias appears due to preferential exclusion of units from the sample. The data-gathering process will, therefore, reflect a distortion in the sample s proportions and, since the data is no longer a faithful representation of the underlying population, biased estimates will be produced regardless of the number of samples collected (even if the treatment is controlled). Different biases fall under the umbrella of sampling selection bias, including censoring, self-selection/volunteering and non-response (Hern an et al., 2004). Selection bias can be represented graphically through a special hollow node S, see Fig. 1(a). S can be seen as an indicator where S=1 if a unit is included in the sample, and S=0 otherwise (Bareinboim and Pearl, 2012). In this paper, our goal is to explicate the general principle that licenses extrapolation across settings when issues of transportability and selection bias are both present. We ll address this problem using the covariate adjustment technique (Pearl, 2000). Adjusting by a set of covariates is arguably (a) (see example 1) (b) (see example 2) Figure 1. Selection diagrams with T and S nodes indicating differences between populations and the sampling selection mechanism. the most widely used technique for causal effects estimation. Although usually used to control for confounding bias in observational data, it has recently been shown to be suitable to control for when selection bias is present as well (Correa and Bareinboim, 2017; Correa et al., 2018). In this paper, we investigate the challenge of estimating causal effects when the input distribution is experimental, plagued with selection bias, and collected from a population that is structurally different than the one where the inferences are intended. We introduce a covariate adjustment formulation to overcome the issues due to both transportability and selection bias. More specifically, our contributions are as follows: 1. Generalization Adjustment Formula. We introduce a covariate adjustment formulation that uses selectionbiased experimental data from a source population and unbiased data from a target population, to produce an unbiased and valid estimand of a target causal effect. 2. Graphical Characterization. We prove a necessary and sufficient graphical condition for the admissibility of a set of covariates for this adjustment. 3. Algorithmic Characterization. We develop a complete algorithm that runs with polynomial delay and enumerates all sets suitable for adjustment according to the causal distribution and model, from which the researcher can pick with arbitrary criteria (e.g., low measurement cost, higher statistical precision). 2. Preliminaries and Related Work Structural Causal Models. The systematic analysis of transportability and selection bias requires a formal language where the characterization of the underlying datagenerating model can be encoded explicitly. We use the language of Structural Causal Models (SCMs) (Pearl, 2000). Formally, a SCM M is a 4-tuple U, V, F, P(u) , where U is a set of exogenous (latent) variables and V is a set of endogenous (measured) variables. F represents a collection of functions such that each variable Vi V is determined by fi F, where fi is a mapping from the respective domain of Ui Pai to Vi, Ui U, Pai V\{Vi}, and the entire set F forms a mapping from U to V. Uncertainty is Adjustment Criteria for Generalizing Experimental Findings encoded through a probability distribution over the exogenous variables, P(u). We will denote variables by capital letters, and their realized values by small letters. Sets of variables are denoted in bold. Within the structural semantics, performing an action/intervention of setting X=x is represented through the do-operator, do(X=x), which encodes the operation of replacing the original equation of X by the constant x inducing a submodel Mx and an experimental distribution Px(v). An experiment can be thought of as physically replacing this equation by assigning a treatment, instead of letting it occur naturally. The causal effect of X on a set of variables Y is defined as Px(y), that is, the distribution over Y in the intervened model Mx. We will also use do-calculus to derive causal expressions from other causal quantities. For a detailed discussion of SCMs and do-calculus, we refer readers to (Pearl, 2000). Every SCM M induces a causal diagram G represented as a directed acyclic graph where every variable Vi V is a vertex, and there exists a directed edge from every variable in Pai to Vi. Also, for every pair Vi,Vj V such that Ui Uj = , there exists a bidirected edge between Vi and Vj. A distribution is said to be compatible with G if it could be generated by an SCM that induces G. We denote as GXZ the graph resulting from removing all incoming edges to X and all outgoing edges from Z in G. We use typical graph-theoretic terminology with the abbreviations Pa(C), De(C), An(C), which stand for the union of C and its parents, descendants, and ancestors, respectively. The expression (X Y | Z)G denotes that X is independent of Y given Z in the graph G according to the d-separation criterion (Pearl, 2000) (subscript G may be omitted). Transportability. Transportability theory is concerned with the conditions under which experimental data from one environment (π) can be used to establish a causal quantity in a different domain (π ), while π and π are different but somewhat related domains, that is, assessing the causal effect of X on Y in the target domain (i.e., P x(y)) using measurements over a set of variables under experiments in a different environment (i.e., Px(v)). Different conditions were studied in the literature, for instance, in (Pearl and Bareinboim, 2011; Bareinboim et al., 2013; Bareinboim and Pearl, 2014). The first critical component of any transportability analysis is to formally express the assumptions about the differences between the domains of interest. In particular, the overlapping of two causal diagrams is used to express such difference, which is called selection diagram. Definition 1 (Selection Diagram (Bareinboim and Pearl, 2014)). Let M, M be a pair of SCMs relative to domains π, π , sharing a diagram G. M, M induces a selection diagram D consisting of G plus extra variables Ti with edge Ti Vi whenever there might exist a discrepancy fi = f i or P(Ui) = P (Ui) between M and M . We employ special indicator variables T, drawn as squares to represent differences between the source and target populations, pointing to the variables that are affected by unobserved factors (causal mechanism or distribution) that are distinct across settings (e.g., see Fig. 1). As for selection bias, we use an indicator variable S (drawn round with double border) that is pointed to by every variable that affects the process by which a unit is included in the data. Covariate Adjustment. Adjusting by a set of covariates is arguably the most common technique used to identify causal effects from an observational distribution P(v), namely: Definition 2 (Adjustment (Pearl, 2000)). Given a causal diagram G over variables V and sets X, Y, Z V, the set Z is called covariate adjustment for estimating the causal effect of X on Y (or, usually, just adjustment), if for every distribution P(v) compatible with G, it holds that z P(y | x, z)P(z). (1) In other words, the distribution P(z) is used to re-weight the z-specific distributions P(y | x, z); for sets Z satisfying certain conditions (e.g., that would account for confounding bias), this mapping corresponds to the causal effect Px(y). Several criteria have been developed to determine whether a set Z is admissible for adjustment (Shpitser et al., 2010; Perkovi c et al., 2015; 2018), including the celebrated Backdoor criterion (Pearl, 1993; 2000; Pearl and Paz, 2013), namely: Definition 3 (Backdoor Criterion). A set of variables Z satisfies the Backdoor Criterion relative to a pair of variables (X, Y) in a causal diagram G if: (i) No node in Z is a descendant of X, and (ii) Z blocks every path between X and Y that contains an arrow into X. Intuitively, this criterion identifies the variables that when conditioned on, block the back-door paths in the graph (those with arrows coming into X that carry spurious correlation), while keeping the causal paths unperturbed. Covariate adjustment has been commonly used to control for confounding bias, nevertheless, some recent work demonstrated the validity of this technique to control for both confounding and selection biases (Correa and Bareinboim, 2017; Correa et al., 2018). As mentioned before, in the setting of this paper, the goal is not to control for confounding bias (solved by randomization), but for selection bias and transportability. Adjustment Criteria for Generalizing Experimental Findings 3. Generalizing Experimental Findings through Adjustment A properly carried-out experiment will effectively control for confounding bias, and the resulting effect of the treatment X on the outcome Y will be valid for the population represented in the experiment, i.e., domain π. In most cases, as discussed earlier, the goal is not to make statements only about the units involved in the experiment, but to generalize the findings to a (usually much) larger and possibly different population (domain π ). Invalid conclusions about the target population will be reached if the generalization biases are left uncontrolled. In other words, Px(y), obtained in π may differ significantly from P x(y), the corresponding causal quantity for the target population π . Recall that we consider two challenges related to the generalizability of experimental findings, transportability and selection bias. For instance, consider the selection diagram in Fig. 1(a) corresponding to the situation described in Example 1. Background factors (E) affect both the use of antidepressants (X) and the formation of suicidal thoughts and behaviors (Y ). The transportability node T pointing to E encodes the assumption that there is a discrepancy in the distributions of background factors between the population from the study and the target group of youths. Baseline risk for suicide (B), which affects both X and Y , also affects the inclusion of subjects into the randomized trials. This selective sampling process is encoded in the graph through the edge from B to the selection indicator S. The aim here is to obtain the effect P x(y) in domain π (general population) from the data Px(y, b, e|S=1) coming from the domain π (controlled groups). In practice, experimental data from the source domain may be insufficient to identify the target effect. Still, it s not uncommon that nonexperimental, unbiased data may be available in the target population, at least over some subset of the variables, W (i.e., P (w)). In these situations, the covariate adjustment technique provides a natural way of combining data from the two domains. For the model in Fig. 1(a), if P (b, e) is available in the target population, then the target effect P x(y) can be computed by combining Px(y, b, e|S=1) with P (b, e) in an adjustment expression, namely, b,e Px(y | b, e, S = 1)P (b, e), (2) which will be proved later on in this section (Thm. 1). A summary of this setting is provided in Fig. 2. In words, our task is: Given qualitative causal assumptions in the form of a selection diagram D, and given data Px(v|S = 1) in domain π and P (w) in domain π , determine if Q = P x(y) is estimable by adjustment on a set Z W V. Specifically, we are looking for sufficient and necessary Px(v | S=1) P (w) P x(y) z Px(y | z, S=1)P (z) Controlled environment, Sample Selection Bias Natural environment, No Selection Bias Figure 2. Summary of the task (see text for description). conditions to determine if it holds that z Px(y | z, S = 1)P (z), (3) based on the assumptions encoded in a selection diagram D. The right hand side of Eq. (3) contains two terms corresponding to different distributions the first is the experimental one from the source (π) that may be affected by selection bias; the second is the distribution over a set of covariates measured in the target domain (π ). One may surmise that it s possible to get away by adjusting only for pre-treatment covariates, as customary in backdoor problems. However, adjusting for descendants of the treatment may be required to account for selection bias. To witness, consider the following scenario. Example 2. A randomized clinical trial is performed to measure the effect of a gene therapy (X) on a certain type of leukemia (Y ). The selection diagram in Fig. 1(b) represents the corresponding causal model. One common side effect of X is the decrease in blood cells (Z2), which in turn can affect the development of symptoms such as anemia and serious infections (Z3). These symptoms are also caused by other background factors such as genetics, age, and family history (say Z1). Outside the study, these factors affect the propensity of individuals choosing the treatment, and the outcome. There are also unmeasured factors affecting people using the treatment and developing the symptoms (X L9999K Z3) as well as latent variables that affect the symptoms and the outcome (Z3 L9999K Y ). Due to the development of severe symptoms, subjects may drop from the study or be unable to attend the follow up consultations, resulting in their data being dropped from the study (S = 0). Considering only data from cases that did not drop out may lead to selection bias. Similarly, depending on the conditions of the study, the target population may differ in background factors compared to the units in the experiment. The possibility of such differences is accounted for by the transportability node T pointing to Z1. If one adjusts only for the set Z = {Z1} to control for the transportability issue, there is still selection bias due to an active (open) path S Z3 L9999K Y . Adjustment Criteria for Generalizing Experimental Findings It seems Z3 is needed if selection bias is to be controlled as well. However, adjusting for some descendant of X may induce spurious correlation between X and Y. In this case, conditioning on Z3 induces a non-causal correlation between X and Y, through, e.g., X L9999K Z3 L9999K Y . For convenience, when considering a set Z and treatment X, let Znd = Z \ De(X) denote the non-descendants of X in Z, and Zd = Z De(X) denote the descendants of X. It turns out that conditioning on variables from Zd that are independent of the outcome Y given Znd in the experimental distribution does not introduce spurious correlation into the adjustment. On the other hand, we need to pay special attention to those variables in Zd d-connected with Y in the interventional graph GX (given X), that we will denote as Zp = n Z Zd (Z Y | Znd, X)GX We now introduce a graphical condition to characterize the sets Z that yield valid adjustments for Q = P x(y), i.e.: Definition 4 (Generalization Adjustment (st-adjustment) Criterion (singleton treatment)). Given a selection diagram D with transportability and selection bias variables, respectively, T and S, relative to domains π and π , a treatment X, and disjoint sets Y, Z V, the set Z is said to satisfy the st-adjustment criterion relative to (X, Y) in D if (i) The variables in Zp are independent of the treatment given all other covariates, i.e., (Zp X | Z \ Zp). (ii) The outcome is independent of the transportability nodes and the selection bias mechanism given the covariates and X, i.e., (Y T, S | Z, X)DX. Since the variables in Zp are correlated with the outcome (by definition), the first condition requires them to be independent of the treatment X, given the other covariates, so as to prevent spurious correlation or the disturbance of causal paths when employing such variables. The second condition accounts for the generalizability issues it requires the outcome to be independent of the transportability (T) and selection bias nodes (S) in the effect specific to the levels of the set Z; the criterion owes its name, st-adjustment, to this condition. In contrast to similar criteria, no condition is required for controlling confounding due to the experimental nature of the data. To build intuition on reading the conditions, consider the following examples: Example 3. Recall the selection diagram in Fig. 1(a) and consider the set Z = {B, E}. It turns out that Zp = since neither B nor E are descendants of X, so the first condition is satisfied. For the second, one can immediately verify that (Y T, S | B, E, X)DX holds. Example 4. Consider the diagram in Fig. 1(b). For the set Z = {Z1}, condition (i) is trivially satisfied because Zp = . However, there is an active path S Z3 L9999K Y that violates (ii). In fact, Z3 needs to be included in Z, but then (Z3 X | Z1) because of the directed path X Z2 Z3. We have to include Z2 in Z to block this path, which leads to the same Zp, but now there is still a path X L9999K Z3 that violates the first condition. It turns out, there is no set Z satisfying the criterion for this case. If the bidirected edge between X and Z3 (shown in red color) was not present, (Z3 X | Z1, Z2) would hold and (as we will show next) z1,z2,z3 Px(y|z1, z2, z3, S=1)P (z1, z2, z3). (5) We show next that the st-adjustment criterion licenses, and it s also necessary for, the extrapolation of causal findings from a source to a target domain through covariate adjustment on a set Z in the context of singleton treatments. Theorem 1 (st-adjustment (singleton treatment)). Given a selection diagram D, a singleton X, and disjoint sets Y and Z, the causal effect P x(y) is given by z Px(y | z, S=1)P (z) (6) if and only if Z satisfies the st-adjustment criterion relative to (X, Y). The proof for Thm. 1 will be given as a lemma (Lemma. 1) after stating a more general st-adjustment theorem (Thm. 3) in the next section. All proofs are provided in the appendix. Example 5. As discussed in Example 3, the set {B, E} satisfies the st-adjustment criterion relative to (X, Y ) in the diagram Fig. 1(a), which implies that P x(y) is given by Eq. (2), following Thm. 1. In words, the assumptions encoded in D license the extrapolation of the causal distribution experiments on the effect of antidepressants on suicide risk carried out in RCTs (source) to a target population consisting of the general clinical population of youths with depression combining the conditional effect segregated by each stratum of B, E (baseline risk for suicide and background factors), re-weighted by the probability of each level of those variables as observed in the target domain. Example 6. For a case such as Fig. 1(b), where no set Z satisfies the criterion, Thm. 1 states that for any model consisten with the assumptions in D, no adjustment in the form of Eq. (6) gives a correct estimation of the target effect. 4. Adjusting for Multiple Treatments Even though controlling for one treatment variable at a time may be sufficient in some applications, in practice, there are settings where multiple factors need to be tested concurrently. In this section, we address more challenging settings involving causal effects of multiple treatment variables. For example, in online marketing, experiments are used to test the effectiveness of a combination of variables Adjustment Criteria for Generalizing Experimental Findings such as content position, media, and audience, on user interaction, clicks, or conversion. Due to cost and number of user participation required to carry out these experiments, it is desirable to be able to generalize them to alternative audiences and correct for sampling issues. To handle multiple treatments, adjusting for the descendants of X may again induce spurious correlation between X and Y. More attention is needed to the variables in Zp (defined in Eq. (4)) and how they are related to the multiple treatments X. Consider the two models in Fig. 3 and set Z = {Z1, Z2, Z3, Z4}1 leading to Zp = {Z2, Z4}. Note that Zp is not independent of X = {X1, X2} given Z\Zp = {Z1, Z3} in either one of the diagrams, hence condition (i) of Def. 4 fails in both cases. Even so, there is a subtle difference between the two models: while adjusting for Z is not valid in Fig. 3(a), it is guaranteed to yield P x(y) in Fig. 3(b). To witness, note that P x(y) can be derived as z1 P (z1) (7) z1 P x(y|z1)P (z1) (8) z1,z2 P x(y|z1, z2)P x(z2|z1)P (z1) (9) z1,z2 P x(y|z1, z2)P (z1, z2) (10) z1,z2,z3 P x(y|z1, z2, z3)P (z1, z2, z3) (11) z P x(y|z)P x(z4|z1, z2, z3)P (z1, z2, z3) (12) z P x(y|z)P (z) (13) z Px(y|z, S = 1)P (z) (14) In the derivation above, we first introduced Z1 into the adjustment (Eq.(7)) using the fact that it was independent of Y given X in DX, hence it does not introduce any spurious correlation (8). Next, we added Z2 by conditioning (9), and since X2 has no effect on {Z1, Z2}, Px(z2 | z1) = Px1(z2 | z1). Also, given Z1, Z2 is independent of X1, so no spurious correlation is added (10). Similarly, Z1, Z3 is independent of Y given the already introduced {Z1, Z2} (11). Finally, Z4 is independent of {X1, X2} given {Z1, Z2, Z3} (13). After both Z2 and Z4 have been adjusted for, the outcome is independent of the selection mechanism S, and the causal effect can be expressed in the form of the st-adjustment (14). Remarkably, no other set Z is valid for adjustment in this model, and the steps described can only be performed in the given order. As a matter of fact, the reason why Z will not work for Fig. 3(a) is that in the last step, we have a distribution Px(Z4|Z1, Z2, Z3) and since X1 has a causal effect 1The two selection diagrams do not have T nodes, meaning the populations are the same in source and target domains with only selection bias issue occurring. (a) No order over Z1, Z2, Z3, Z4 is suitable for adjustment. (b) Order Z1