# a_proxy_variable_view_of_shared_confounding__11dad756.pdf A Proxy Variable View of Shared Confounding Yixin Wang 1 David M. Blei 2 Causal inference from observational data can be biased by unobserved confounders. Confounders the variables that affect both the treatments and the outcome induce spurious non-causal correlations between the two. Without additional conditions, unobserved confounders generally make causal quantities hard to identify. In this paper, we focus on the setting where there are many treatments with shared confounding, and we study under what conditions is causal identification possible. The key observation is that we can view subsets of treatments as proxies of the unobserved confounder and identify the intervention distributions of the rest. Moreover, while existing identification formulas for proxy variables involve solving integral equations, we show that one can circumvent the need for such solutions by directly modeling the data. Finally, we extend these results to an expanded class of causal graphs, those with other confounders and selection variables. 1. Introduction Causal inference from observational data can be biased by unobserved confounders. Confounders are variables that affect both the treatments and the outcome. When measured, we can account for them with adjustments (Pearl, 2009). But when unobserved, they open back-door paths that bias the causal inference; back-door adjustments are not possible. Consider the following causal problem. How does a person s diet affect her body fat percentage? One confounder is lifestyle: someone with a healthy lifestyle will eat healthy foods such as boiled broccoli; but she will also exercise frequently, which lowers her body fat. When lifestyle is unobserved, the composition of diet will be correlated with body fat, regardless of its true causal effect. Compounding the difficulty, accurate measurements of lifestyle (the 1University of California, Berkeley 2Columbia University. Correspondence to: David M. Blei . Proceedings of the 38 th International Conference on Machine Learning, PMLR 139, 2021. Copyright 2021 by the author(s). confounder) are difficult to obtain, e.g., requiring expensive real-time tracking of activities. Lifestyle is necessarily an unobserved confounder. Here we focus on the setting where multiple treatments share the same unobserved confounder. The example fits into this setting. Each type of food broccoli, burgers, granola bars, pizza, and so on is a potential treatment for body fat. Further, each person s lifestyle affects multiple treatments, i.e., their consumption of multiple types of food. People with a healthy lifestyle eat broccoli and granola; people with an unhealthy lifestyle eat pizza and burgers. Thus the different foods share the same unobserved confounder, i.e. each person s lifestyle. When multiple treatments share the same unobserved confounding, which causal quantities can be identified? How can we estimate them? These are the questions we address. Begin with the causal graph of Figure 1a, where an unobserved confounder U (lifestyle) affects multiple treatments {A1, . . . , Am} (food choices) and an outcome Y (body fat). Further consider a subset of treatments C. We prove that, under suitable conditions, the intervention distribution p(y | do(a C)) is identifiable. The key observation is that, under shared confounding, some treatments can serve as proxies of unobserved confounders (Miao et al., 2018; Kuroki & Pearl, 2014), enabling causal identification of other treatments. This observation helps identify the intervention distributions of subsets of treatments. Unlike prior work, we do not need to find two external proxies for the unobserved confounder; some treatments themselves can serve as proxies for other treatments. We then turn to estimation. The identification formula we obtain requires solving an integral equation (Miao et al., 2018), which might be difficult. We show that the deconfounder algorithm of Wang & Blei (2019a) can help bypass this requirement, producing correct causal estimates by directly modeling the data. With a simulation study, we demonstrate that the identification conditions we require are crucial for the algorithm to produce correct causal inferences. We note that, while we use the same algorithm, the theoretical setting considered here is different from Wang & Blei (2019a). We finally generalize the identification and estimation results to an expanded class of graphs in Figure 2b. This class A Proxy Variable View of Shared Confounding Figure 1. (a) Multiple treatments with shared confounding. (b) Proxy variables for an unobserved confounder (Miao et al., 2018). (Only the shaded nodes are observed.) contains shared confounding, measured single-treatment confounders (that only affect one treatment), and selection on the unobservables. We establish identifiability as well as the applicability of the deconfounder in this larger class. Contributions. The main contributions of this paper are identification and estimation results that target multiple treatments with shared confounding, allowing for certain types of selection bias. We derive conditions under which the intervention distributions of the treatments are identifiable and further conditions under which the deconfounder algorithm can produce correct causal inference. The key idea is to use some treatments as proxies of the unobserved confounder to identify the effect of other treatments. Rather than solving integral equations, the algorithm estimates the intervention distributions by directly modeling the data. Related work. This work uses and extends causal identification with proxy variables (Kuroki & Pearl, 2014; Miao et al., 2018; Shi et al., 2020). While these works focus on a single treatment and a single outcome, we leverage the multiplicity of the treatments to establish causal identification. With multiple treatments, the recent work of Miao et al. (2020) proposes two approaches to identifying the intervention distributions: the auxiliary variable approach and the null treatments approach. These approaches utilize the shared confounding structure via assuming at least half of the confounded treatments do not causally affect the outcome, and the treatment-confounder distribution is identifiable from observational data. Our approach differs from this approach in how we leverage the shared confounding structure for causal identification. As the treatments share the same unobserved confounder, we view some treatments as proxies of the shared unobserved confounder for identifying the effects of the other treatments. A second body of related work is on causal inference with multiple treatments (Ranganath & Perotte, 2018; Heckerman, 2018; Janzing & Schölkopf, 2018; D Amour, 2019b; Frot et al., 2017; Cevid et al., 2018; Wang et al., 2017; Tran & Blei, 2017; Wang & Blei, 2019a; Puli et al., 2020). While many of these works focus on developing algorithms, we focus on theoretical aspects of the problem. The deconfounder algorithm that we use was developed in Wang & Blei (2019a) and has been heavily debated and discussed (Ogburn et al., 2019; 2020; Imai & Jiang, 2019; Grimmer et al., 2020; D Amour, 2019a;b; Wang & Blei, 2020; 2019b). Here we delineate settings and assumptions, different from those in Wang & Blei (2019a), where the algorithm provides correct causal inferences. We also demonstrate that the effectiveness of the algorithm in practice relies on these assumptions. The identification results in this paper differ from those in Wang & Blei (2019a). First, that work assumes the unobserved confounder is a deterministic function of the treatments; in contrast, we allow the substitute confounder to be random given the treatments. Second, we establish identification by assuming the existence of a function of the treatments that does not affect the outcome; this assumption is not made in Wang & Blei (2019a). Finally, we extend the ideas to allow for selection bias (Bareinboim & Pearl, 2012), including selection driven by unobserved confounders. We note that D Amour (2019b) provides negative examples of causal identification where some intervention distributions are not identifiable; it also suggests collecting additional proxy variables to resolve non-identification. The results below do not contradict those of D Amour (2019b). Rather, we focus on the intervention distributions of subsets of the treatments; D Amour (2019b) focuses on the intervention distributions of all the treatments. Further, the way we use proxy variables differs in that we use existing causes as proxy variables, as opposed to collecting additional proxies. 2. Multiple treatments & shared confounders Consider a causal inference problem where multiple treatments of interest affect a single outcome. It deviates from classical causal inference, where the main interest is a single treatment and a single outcome. Figure 1a provides an example. There are m treatments A1, . . . , Am that all affect the outcome Y ; and there is an unobserved confounder U that affects Y and the treatments. This graph exemplifies shared unobserved confounding, where U affects multiple treatments. In this paper, the goal is to estimate the intervention distributions on subsets of treatments, P(Y | do(AC = a C)). It is the distribution of the outcome Y if we intervene on AC {A1, . . . , Am}, which is a (strict) subset. (E.g., if we are interested in each treatment individually then each subset contains one treatment.) We will establish causal identification and then discuss an algorithm for estimation. Section 3 extends these results to an expanded class of graphs. A Proxy Variable View of Shared Confounding 2.1. Causal identification An intervention distribution is identifiable if it can be written as a function of the observed data distribution (e.g., P(y, a1, . . . , am) in Figure 1a) (Pearl, 2009). In Figure 1a, which intervention distributions can be identified? In this section we prove that, under suitable conditions, the intervention distributions of subsets of the treatments P(y | do(a C)) are identifiable.1 The starting point for causal identification with multiple treatments is the proxy variable strategy, which focuses on causal identification with a single treatment (Kuroki & Pearl, 2014; Miao et al., 2018). Consider the causal graph in Figure 1b: it has a single treatment A1, an outcome Y , and an unobserved confounder U. The goal is to estimate the intervention distribution P(y | do(a1)). There are some other variables in the graph too. A proxy X is an observable child of the unobserved confounder; a null proxy N is a proxy that does not affect the outcome. The theory around proxy variables says that the intervention distribution P(y | do(a1)) is identifiable if (1) we observe two proxies of the unobserved confounder U and (2) one of the proxies is a null proxy (Miao et al., 2018). In particular, since N and X are observed, P(y | do(a1)) is identifiable. We leverage the idea of proxy variables to identify intervention distributions in Figure 1a, multiple treatments with shared unobserved confounding. The main idea is to use some treatments as proxies to identify the intervention distributions of other treatments. The benefit is that, with multiple treatments, we do not need to observe external proxy variables; rather the treatments themselves serve as proxies. Nor do we need to observe a null proxy, one that does not affect the outcome (like N in Figure 1b); we only need to assume that there is a function of the treatments that does not affect the outcome. (We do not need to know this function either, just that at least one such function exists.) In short, we can use the idea of the proxy but without collecting external data; we can work solely with the data about the treatments and the outcome. We formally state the identification result. To repeat, assume the causal graph in Figure 1a with m treatments A1:m, an outcome Y , and a shared unobserved confounder U. The goal is to identify the intervention distribution of a strict subset of the treatments P(y | do(a C)). Partition the m treatments into three sets: AC is the set of treatments on which we intervene; AX is the set of treatments we use as a proxy; AN is the set of treatments such that there exists a function f(AN ) that can serve as a null proxy. (We discuss this assumption below.) The latter two sets mimic the proxy X and the null proxy N in the proxy 1We abbreviate P(y | do(a C)) = P(y | do(AC = a C)). variable strategy. Sets AC, AX and AN must be non-empty. Assumption 1. There exists some function f and a set ; 6= N {1, . . . , m}\C such that 1. The outcome Y does not depend on f(AN ): f(AN ) ? Y | U, AC, AX , (1) where X = {1, . . . , m}\(C [ N) 6= ;. 2. The conditional distribution P(u | a C, f(a N )) is com- plete2 in f(a N ) for almost all a C. 3. The conditional distribution P(f(a N ) | a C, a X ) is com- plete in a X for almost all a C. Assumption 1.1 posits that a set of treatments AN exists such that some function of them f(AN ) can serve as a null proxy (Eq. 1). Roughly, it requires f(AN ) does not affect the outcome. It does not require that we know N or f(AN ), just that they exist. When might this assumption be satisfied? First, suppose some of the multiple treatments do not affect the outcome. Then Assumption 1.1 reduces to the null proxy assumption (Kuroki & Pearl, 2014; Miao et al., 2018; D Amour, 2019b). This might be plausible, e.g., in a genetic study or other setting where there are many treatments. Again, we do not need to know which treatments are null treatments. Indeed, as long as two treatments are null, the theory below implies that the intervention distributions of each individual treatment is identifiable. But this assumption goes beyond a restatement of the null proxy assumption. Suppose two (or more) treatments only affect the outcome as a bundle. Then the bundle can form the set N and the function is one that is orthogonal to how they are combined. As a (silly) example, consider two of the treatments to be bread and butter. Suppose they must be served together to induce the joyfulness of food, but not individually. (If either is served alone, it has no effect on joyfulness one way or the other.) Then the function f(AN ) is XOR of the bundle; the quantity (bread XOR butter) does not affect Y . Again, the function and set must exist; we do not need to know them. As a more serious example, consider that HDL cholesterol, LDL cholesterol, and triglycerides (TG) affect the risk of a 2Definition of complete : The conditional distribution P(u | a C, f(a N )) is complete in f(a N ) for almost all a C means for any square-integrable function g( ) and almost all a C, g(u, a C)P(u | a C, f(a N )) du = 0 for almost all f(a N ) if and only if g(u, a C) = 0 for almost all u. A Proxy Variable View of Shared Confounding heart attack through the ratios HDL/LDL and TG/HDL (Millán et al., 2009). Then HDL LDL and TG HDL are both examples of f(AN ) that do not affect Y . The existence of one of them suffices for Assumption 1.1. (We discuss this assumption in more technical detail in Appendix B.) Assumption 1.2 and Assumption 1.3 are two completeness conditions on the true causal model; they are required by the proxy variable strategy (e.g. Conditions 2 and 3 of Miao et al. (2018)). Roughly, they require that the distributions of U corresponding to different values of f(AN ) are distinct; the distributions of f(AN ) relative to different AX values are also distinct. The two assumptions are satisfied when we work with a causal model that satisfies the completeness condition. Many common models satisfy this condition. Examples include exponential families (Newey & Powell, 2003), location-scale families (Hu & Shiu, 2018), and nonparametric regression models (Darolles et al., 2011). Completeness is a common assumption posited in nonparametric causal identification (Miao et al., 2018; Yang et al., 2017; D Haultfoeuille, 2011); it is often used to guarantee the existence and the uniqueness of solutions to integral equations. Chen et al. (2014) provides a discussion of completeness. Under Assumption 1, we can identify the intervention distribution of the subset of the treatments AC. Theorem 1. (Causal identification under shared confounding) Assume the causal graph Figure 1a. (Note the data does not need to be faithful to the graph some edges can be missing.) Under Assumption 1, the intervention distribution of the treatments AC is identifiable: P(y | do(a C)) = h(y, a C, a X )P(a X ) da X (2) for any solution h to the integral equation P(y | a C, f(a N )) = h(y, a C, a X )P(a X | a C, f(a N )) da X . Moreover, the solution to Eq. 3 always exists under weak regularity conditions in Appendix D. Proof sketch. The proof relies on the partition of the m treatments: AC as the treatments, AX as the proxies, and AN such that f(AN ) can be a null proxy. We then follow the proxy variable strategy to identify the intervention distributions of AC using AX as a proxy and f(AN ) as a null proxy. We no longer have a null proxy like N as in Figure 1b; all the m treatments can affect the outcome. However, Assumption 1.1 allows f(AN ) to play the role of a null proxy. The full proof is in Appendix A. Theorem 1 identifies the intervention distributions of subsets of the treatments AC; it writes P(y | do(a C)) as a function of the observed data distribution P(y, a C, a X , a N ). In particular, it lets us identify the intervention distributions of individual treatments P(y | do(ai)), i = 1, . . . , m. By using the treatments themselves as proxies, Theorem 1 exemplifies how the multiplicity of the treatments enables causal identification under shared unobserved confounding. 2.2. Causal estimation with the deconfounder Theorem 1 guarantees that the intervention distribution P(y | do(a C)) is estimable from the observed data. However, it involves solving an integral equation (Eq. 3). This integral equation is hard to solve except in the simplest linear Gaussian case (Carrasco et al., 2007). How can we estimate P(y | do(a C)) in practice? We revisit the deconfounder algorithm in Wang & Blei (2019a). We show that the deconfounder correctly estimates the intervention distribution P(y | do(a C)); it implicitly solves the integral equation in Eq. 3 by modeling the data. (This is an alternative justification of the algorithm from Wang & Blei (2019a).) We first review the algorithm. Given the treatments A1, . . . , Am and the outcome Y , the deconfounder proceeds in three steps: 1. Construct a substitute confounder. Based only on the (observed) treatments A1, . . . , Am, it first constructs a random variable ˆZ such that all the treatments are conditionally independent: ˆP(a1, . . . , am, ˆz) = ˆP(ˆz) ˆP(aj | ˆz), (4) where ˆP( ) is consistent with the observed data P(a1, . . . , am) = R ˆP(a1, . . . , am, ˆz) dˆz. The random variable ˆZ is called a substitute confounder; it does not necessarily coincide with the unobserved confounder U. The substitute is constructed using probabilistic models with local and global variables (Bishop, 2006), such as probabilistic PCA (Tipping & Bishop, 1999). 2. Fit an outcome model. The next step is to estimate how the outcome depends on the treatments and the substitute confounder ˆP(y | a1, . . . , am, ˆz). This outcome model is fit to be consistent with the observed data: P(y, a1, . . . , am) ˆP(y | a1, . . . , am, ˆz) ˆP(a1, . . . , am, ˆz) dˆz. (5) Along with the first step, the deconfounder gives the joint distribution ˆP(y, a1, . . . , am, ˆz). 3. Estimate the intervention distribution. The final step estimates the intervention distribution P(y | do(a C)) by A Proxy Variable View of Shared Confounding integrating out the non-intervened treatments and the substitute confounder, ˆP(y | do(a C)) ˆP(y | a1, . . . , am, ˆz) ˆP(a{1,...,m}\C, ˆz) dˆz da{1,...,m}\C. (6) This is the estimate. The correctness of the deconfounder. Note that many possible ˆP( ) s satisfy the deconfounder requirements (Eqs. 4 and 5); the algorithm outputs one such ˆP. Under suitable conditions, we show that any such ˆP provides the correct causal estimate P(y | do(a C)). Assumption 2. The deconfounder estimate ˆP(y, a1, . . . , am, ˆz) satisfies two conditions: 1. It is consistent with Assumption 1.1, ˆP(y | a C, a X , f(a N ), ˆz) = ˆP(y | a C, a X , ˆz). 2. The conditional distribution ˆP(ˆz | a C, a X ) is complete in a X for almost all a C. Assumption 2.1 roughly requires that there exists a function f and a subset of the treatments AN such that f(AN ) does not affect the outcome in the deconfounder outcome model. (When the number of treatments goes to infinity, Assumption 2.1 reduces to Assumption 1.1.) We emphasize that f(AN ) is not involved in calculating the estimate (Eq. 6); it only appears in Assumption 2.1. Hence the correctness of the algorithm does not require specifying f( ) and AN , just that it exists. Assumption 2.2 requires that the distributions of ˆZ corresponding to different values of AX are distinct. It is a similar completeness condition as in Assumption 1. Now we state the correctness of the algorithm. Theorem 2. (Correctness of the deconfounder under shared confounding) Assume the causal graph Figure 1a. Under Assumption 1, Assumption 2 and weak regularity conditions, the deconfounder provides correct estimates of the intervention distribution: ˆP(y | do(a C)) = P(y | do(a C)), (7) where ˆP(y | do(a C)) is computed from Eq. 6. Proof sketch. The proof of Theorem 2 relies on a key observation: the deconfounder implicitly solves the integral equation (Eq. 3) by modeling the observed data with ˆP(y, a1, . . . , am, ˆz). Assumption 2.2 guarantees that the deconfounder estimate can be written as ˆP(y | a C, ˆz) = ˆh(y, a C, a X ) ˆP(a X | ˆz) da X (8) under weak regularity conditions; this function ˆh(y, a C, a X ) also solves the integral equation (Eq. 3). The deconfounder uses this solution to form an estimate of P(y | do(a C)); this estimate is correct because of Theorem 1. The full proof is in Appendix C. Theorem 2 justifies the deconfounder for multiple causal inference under shared confounding (Figure 1a). It proves that the deconfounder correctly estimates the intervention distributions when they are identifiable. This result complements Theorems 6 8 of Wang & Blei (2019a); it establishes identification and correctness by assuming there exists some function of the treatments that does not affect the outcome. In contrast, Theorems 6 8 of Wang & Blei (2019a) assume a consistent substitute confounder, that the substitute confounder is a deterministic function of the treatments. Their assumption is stronger; conditional on the treatments, Theorems 1 and 2 allow the substitute confounder to be random. Theorem 2 also shows that we can leverage the deconfounder algorithm to put the proxy variable strategy into practice. While existing identification formulas of proxy variables involves solving integral equations (Miao et al., 2018), Theorem 2 shows how to circumvent this need by directly modeling the data and applying the deconfounder; it implicitly solves the integral equations. Section 4 illustrates these theorems with a linear example. 3. An expanded class of causal graphs We discussed causal identification and estimation when multiple treatments share the same unobserved confounder. We now extend these results to an expanded class of causal graphs, those with several types of nodes and, in particular, those that include a selection variable (Bareinboim et al., 2014; Bareinboim & Pearl, 2012). Using the results in Section 2, we establish causal identification and estimate intervention distributions. 3.1. An expanded class of causal graphs The expanded class of graphs is illustrated in Figure 2b. As above, there are m treatments A1:m and an outcome Y . The goal is to estimate P(y | do(a C)), where AC {A1, . . . , Am} is a subset of treatments on which we intervene. Apart from treatments and outcome, the graph has other types of variables; Figure 2a contains a glossary. Confounders. Confounders are parents of both the treatments and the outcome; they can be unobserved. In Figure 2b, for example, U sng i and U mlt i are confounders; they have arrows into the outcome Y and at least one of the treatments Ai. We differentiate between single-treatment and multi-treatment confounders. Single-treatment confounders like U sng i affect only one treatment; multi-treatment con- A Proxy Variable View of Shared Confounding founders like U mlt i affect two or more treatments. Covariates. There are two types of covariates treatment covariates and outcome covariates. treatment covariates are parents of the treatments, but not the outcome; they can be unobserved. As with confounders, we differentiate between single-treatment covariates W sng i and multi-treatment covariates W mlt i . Outcome covariates like V are parents of the outcome but not the treatments. They do not affect any of the m treatments; they can be unobserved. Selection operator. Following Bareinboim & Pearl (2012), we introduce a selection operator S 2 {0, 1} into the causal graph. The value S = 1 indicates an individual being selected; otherwise, S = 0. We only observe the outcome of those individuals with S = 1, but we may observe the treatments on unselected individuals. (E.g., consider a genomewide association study where we collect an expensive-tomeasure trait on a subset of the population but have genome data on a much larger set.) Note that Figure 2b allows selection to occur on the confounders. 3.2. Causal identification We extend the results around causal identification and estimation under shared confounding (Theorems 1 and 2) to the expanded class of graphs. We first reduce the graph of Figure 2b to one close to the shared confounding case; then we handle the complications of selection bias. Reduction to shared confounding. To reduce the graph of Figure 2b, we bundle all the unobserved multi-treatment confounders and null confounders {U mlt, W mlt} into a single unobserved confounder Z. This variable Z is shared by all the treatments as in Figure 1a and renders all the treatments conditionally independent. Moreover, it is sufficient to adjust for Z and single-treatment confounders U sng to estimate P(y | do(a C)) because {U mlt, W mlt, U sng} constitute an admissible set. We can equivalently identify the intervention distributions P(y | do(a C)) in the graph of Figure 2b using a reduced graph of Figure 2c; it involves only the single-treatment confounders U sng and a shared confounder Z. Below we formally state the validity of the reduction. Lemma 3. (Validity of reduction) Assume the causal graph in Figure 2b. Adjusting for the multi-treatment confounders and null confounders on the graph of Figure 2b is equivalent to adjusting for the shared confounder in Figure 2c: P(y | usng, umlt, wmlt, a1, . . . , am, s = 1) = P(y | usng, z, a1, . . . , am, s = 1). (9) Proof sketch. The proof uses a measure-theoretic argument to characterize the information contained in the Z variable in Figure 2c. Roughly, the information in Z is same as the information of all multi-treatment confounders, all null confounders, and some independent error: σ(z) = σ(umlt, wmlt, Z), (10) where σ( ) denotes the σ-algebra of a random variable. The independent error Z satisfies Z ? Y, S, U sng, U mlt, W mlt, A1, . . . , Am. Eq. 10 implies that conditioning on Z is equivalent to conditioning on U mlt, W mlt, Z; it leads to Eq. 9. The full proof is in Appendix E. Causal identification on the reduced causal graph (Figure 2c). We reduced the expanded class of graphs (Figure 2b) to one with shared confounding (Figure 2c). This reduction allows us to establish causal identification on the expanded class. We extend Theorem 1 from Figure 1a to Figure 2c. With the reduction step (Lemma 3), it leads to causal identification. How can we identify the intervention distributions P(y | do(a C)) on the reduced graph (Figure 2c)? Figure 2c has a confounder Z that is shared across all treatments. This structure is similar to the unobserved shared confounding of Figure 1a. In addition to the shared confounder Z, the reduced graph involves single-treatment confounders U sng and the selection operator S. We posit two assumptions on them to enable causal identification. Assumption 3. The causal graph Figure 2c satisfies the following conditions: 1. All single-treatment confounders U sng i s are observed. 2. The selection operator S satisfies S ? (A, Y ) | Z, U sng. (11) 3. We observe the non-selection-biased distribution P(a1, . . . , am, usng) and the selection-biased distribution P(y, usng, a1, . . . , am | s = 1). Assumption 3.1 requires that the confounders that affect the outcome and only one of the treatments must be observed. It allows us to adjust for confounding due to these singletreatment confounders. Assumption 3.2 roughly requires that selection can only occur on the confounders. Assumption 3.3 requires access to the non-selection-biased distribution of the treatments and single-treatment-confounders. It aligns with common conditions required by recovery under selection bias (e.g., Theorem 2 of Bareinboim et al. (2014)). A Proxy Variable View of Shared Confounding Name Children Notation Confounder 1 treatment & outcome U mlt, U sng treatment covariate 1 treatment only W mlt, W sng Outcome covariate outcome only V Figure 2. (a) Types of nodes (b) The expanded class of causal graphs. S is the selection operator. (c) The reduced causal graph with shared confounding. We next establish causal identification on the reduced causal graph Figure 2c. We additionally make Assumption 4; it is a variant of Assumption 1 but involves single-treatment confounders and the selection operator. Assumption 4. There exists some function f and a set ; 6= N {1, . . . , m}\C such that 1. The outcome Y does not causally depend on f(AN ): f(AN ) ? Y | Z, AC, AX , U sng, S = 1 (12) where X = {1, . . . , m}\(C [ N) 6= ;. 2. The conditional P(z | a C, f(a N ), usng C , s = 1) is complete in f(a N ) for almost all a C and usng C , where U sng C is the single-treatment confounders affecting AC. 3. The conditional P(f(a N ) | a C, a X , usng C , s = 1) is complete in a X for almost all a C and usng Under Assumption 3 and Assumption 4, we can identify the intervention distributions P(y | do(a C)). Lemma 4. Assume the causal graph Figure 2c. Under Assumption 3 and Assumption 4, the intervention distribution of the treatments AC is identifiable: P(y | do(a C)) (13) h(y, a C, a X , usng C )P(a X )P(usng C ) da X dusng for any solution h to the integral equation P(y | a C, f(a N ), usng h(y, a C, a X , usng P(a X | a C, f(a N ), usng C , s = 1) da X , (14) where U sng C is the single-treatment confounders affecting AC. Moreover, the solution to Eq. 14 always exists under weak regularity conditions in Appendix D. (The proof is in Appendix F, similar to Theorem 1.) Causal identification on the expanded class of causal graphs (Figure 2b). Based on the previous analysis on the reduced graph, we establish causal identification result on the expanded class of causal graphs. Theorem 5. Assume the causal graph Figure 2b. Assume a variant of Assumption 3 and Assumption 4 (detailed in Appendix G), the intervention distribution of the treatments AC is identifiable using Eq. 13 and Eq. 14. (The proof is in Appendix G.) 3.3. Causal estimation with the deconfounder We finally extend the deconfounder to the expanded class of causal graphs (Figure 2b) with selection bias and prove its correctness. We build on the identification result of Theorem 5. We then show that the deconfounder provides correct causal estimates by implicitly solving the integral equation (Eq. 14). This argument is similar to the argument of Theorem 2. The algorithm for the expanded class of graphs with selection bias extends the version described in Section 2.2. Specifically, Assumption 2 allows the algorithm to have access to both the non-selection-biased data P(a1, . . . , am, usng) and the selection-biased data P(y, usng, a1, . . . , am | s = 1). In this case, the algorithm outputs two estimates: (1) ˆP(a1, . . . , am, usng, ˆz) = ˆP(ˆz) ˆP(usng | a1, . . . , am, ˆz) ˆP(ai | ˆz), (2) ˆP(y, a1, . . . , am, usng, ˆz | s = 1). We note that the former is constructed using only the treatments A1, . . . , Am and single-treatment confounders U sng. Moreover, both estimates must be consistent with the observed data: ˆP(a1, . . . , am, usng, ˆz) dˆz = P(a1, . . . , am, usng), A Proxy Variable View of Shared Confounding ˆP(y, a1, . . . , am, usng, ˆz | s = 1) dˆz = P(y, a1, . . . , am, usng | s = 1). We note that the substitute confounder ˆZ does not necessarily coincide with the true confounders U mlt or the true null confounders W mlt. Nor do ˆP(a1, . . . , am, usng, ˆz) and ˆP(y, a1, . . . , am, usng, ˆz | s = 1) need to be unique. We will show that any ˆZ and ˆP that the algorithm outputs will lead to a correct estimate of ˆP(y | do(a C)). Finally the algorithm estimates ˆP(y | do(a C)) ˆP(y | a1, . . . , am, ˆz, usng C , s = 1) (15) ˆP(a{1,...,m}\C, ˆz)P(usng C dˆz da{1,...,m}\C, where U sng C are the single-treatment confounders that affect the treatments AC. We now prove the correctness of the deconfounder on the expanded class of causal graphs. We make a variant of Assumption 2 and state the correctness result. Assumption 5. The deconfounder outputs the estimates ˆP(y, a1, . . . , am, usng, ˆz | s = 1) and ˆP(a1, . . . , am, usng, ˆz) that satisfy the following: 1. It is consistent with Assumption 3.1: ˆP(a1, . . . , am | ˆz, usng, s = 1) = ˆP(a1, . . . , am | ˆz, usng). 2. It is consistent with Assumption 4.1: ˆP(y | a C, a X ,f(a N ), ˆz, usng, s = 1) = ˆP(y | a C, a X , ˆz, usng, s = 1). 3. The conditional ˆP(ˆz | a C, a X , usng, s = 1) is complete in a X for almost all a C. The conditional ˆP(ˆz | a C, a X , usng, s = 1), Eq. 16, and Eq. 17 can be computed from ˆP(a1, . . . , am, usng, ˆz) and ˆP(y, a1, . . . , am, usng, ˆz | s = 1)). Under these assumptions, Theorem 6 establishes the correctness of the deconfounder on causal graphs under certain types of selection bias. Theorem 6. (Correctness of the deconfounder on the expanded class of causal graphs) Assume the causal graph Figure 2b. Assume a variant of Assumption 3 and Assump- tion 4 (detailed in Appendix H). Under Assumption 5 and weak regularity conditions, the deconfounder provides correct estimates of the intervention distribution: ˆP(y | do(a C)) = P(y | do(a C)). (18) (The proof is in Appendix H.) 4. Example: A linear causal model We illustrate Theorems 5 and 6 in a linear causal model. Consider the meal/body-fat example. The treatments are ten types of food A1, . . . , A10; the outcome is a person s body fat Y . How does food consumption affect body fat? In this example, the individual s lifestyle U mlt is a multitreatment confounder. Whether a person is vegan W mlt is a multi-treatment null confounder. Both U mlt and W mlt are unobserved. Whether one has easy access to good burger shops U sng is a single-treatment confounder; it affects both burger consumption A1 and body fat percentage Y ; U sng is observed. Finally, the observational data comes from a survey with selection bias S; people with healthy lifestyle are more likely to complete the survey. Every variable is associated with a disturbance term , which comes from a standard normal. Given these variables, suppose the real world is linear, U mlt = U mlt, U sng = U sng, W mlt = W mlt, A1 = A1UU mlt + A1W W mlt + A1U 0U sng + A1, Ai = Ai UU mlt + Ai W W mlt + Ai, i = 2, . . . , 10, Y Ai Ai + Y UU mlt + Y U 0U sng + Y . These equations describe the true causal model of the world. The confounders and null confounders {U mlt, W mlt} are unobserved. We are interested in the intervention distribution of the first two food categories, burger (A1) and broccoli (A2): P(y | do(a1, a2)). (We emphasize that we might be interested in any subsets of the treatments.) This world satisfies the assumptions of Theorem 5. Even though the confounders U mlt are unobserved, the intervention distribution P(y | do(a1, a2)) is identifiable. Now consider a simple deconfounder. Fit a 2-D probabilistic principal component analysis (PPCA) to the data about food consumption {A1, . . . , A10}; we do not model the outcome Y . Wang & Blei (2019a) also checks the model to ensure it fits the distribution of the assigned treatments. (Let s assume that 2-D PPCA passes this check.) PPCA leads to a linear estimate of the substitute confounder, γ1i Ai + 1 ˆ Z, γ2i Ai + 2 ˆ Z for parameters γ1i and γ2i, and Gaussian noise i, ˆ Z. This substitute confounder ˆZ satisfies Assumption 5. Plausi- A Proxy Variable View of Shared Confounding bly, the real world satisfies the variant of Assumption 3 and Assumption 4. These assumptions greenlight us to calculate the intervention distribution. We fit an outcome model using the substitute confounder ˆZ and calculate the intervention distribution using Eq. 15. Theorem 6 guarantees that this estimate is correct. 5. A Simulation Study In this section, we see the identification results in action. We find that the identification conditions discussed in Sections 2 and 3 are crucial for producing correct causal estimates. The theoretical results and the conditions required by Theorems 1, 2, 5 and 6 are practically important. Specifically, we consider a linear data generating process in Section 4 with a one-dimensional U and three treatments A1, A2, A3. We explore two configurations of the unobserved confounder U. In one configuration, U is normally distributed, and the resulting observational data satisfies the completeness condition in Assumption 1.2. Figure 3a shows the mean squared error (RMSE) of the deconfounder average treatment effect (ATE) estimate stays low even if the confounding strength is high while the RMSE of naive regression quickly blows up. In a second configuration, U is uniformly distributed; it results in an observational data distribution that violates the completeness condition (Assumption 1.2). Figure 3b shows that the deconfounder can no longer control for confounding in this setting. It produces causal estimates that have consistently lower quality than naive regression. Finally we extend the simulation to study selection bias. Given a normally or uniformly distributed U, we generate observational data from the same linear model. We then introduce selection bias by selecting samples with probability / N(U ; 0, 0.52) and / Unif(U ; 0, 0.5). We apply the deconfounder estimation algorithm. Figures 3c and 3d exhibit the similar phenomenon as above. When the identification conditions hold, the deconfounder produces significant improvement in ATE estimation. When these conditions are violated, the deconfounder produces low quality ATE estimates. Notice that under selection bias, the variance of the estimate tends to go down as confounding strength goes up. We observe this phenomenon because stronger confounding strength makes it easier to infer the latent confounder U, reducing the variance of the estimate. 6. Discussion We study causal identification and estimation when multiple treatments share the same unobserved confounder. By treating some treatments as proxies of the shared confounder, we (a) Assumption 1.2 holds (b) Assumption 1.2 is violated (c) Assumption 4.2 holds (d) Assumption 4.2 is violated Figure 3. The deconfounder outperforms naive regression when the identification conditions are satisfied, but fails to otherwise. can identify the intervention distributions of the other treatments. For an expanded class of causal graphs, we prove that the intervention distribution of subsets of treatments is identifiable. We further show that the deconfounder algorithm of Wang & Blei (2019a) makes valid inferences of these intervention distributions when causal identification holds. We demonstrate the practical relevance of these theoretical results in a simulation study, showing how violating the identification conditions can fail the deconfounder in practice. Acknowledgements We thank Elias Bareinboim and Victor Veitch for their insightful comments on the manuscript. This work is supported by ONR N00014-17-1-2131, ONR N0001415-1-2209, NIH 1U01MH115727-01, NSF CCF-1740833, DARPA SD2 FA8750-18-C-0130, Amazon, and Simons Foundation. A Proxy Variable View of Shared Confounding Bareinboim, E. & Pearl, J. (2012). Controlling selection bias in causal inference. In Artificial Intelligence and Statistics (pp. 100 108). Bareinboim, E., Tian, J., & Pearl, J. (2014). Recovering from selection bias in causal and statistical inference. Bishop, C. (2006). Pattern Recognition and Machine Learn- ing. Springer New York. Carrasco, M., Florens, J.-P., & Renault, E. (2007). Lin- ear inverse problems in structural econometrics estimation based on spectral decomposition and regularization. Handbook of econometrics, 6, 5633 5751. Chen, X., Chernozhukov, V., Lee, S., & Newey, W. K. (2014). Local identification of nonparametric and semiparametric models. Econometrica, 82(2), 785 809. D Amour, A. (2019a). Comment: Reflections on the decon- founder. Journal of the American Statistical Association, 114(528), 1597 1601. D Amour, A. (2019b). On multi-cause approaches to causal inference with unobserved counfounding: Two cautionary failure cases and a promising alternative. In The 22nd International Conference on Artificial Intelligence and Statistics (pp. 3478 3486). Darolles, S., Fan, Y., Florens, J.-P., & Renault, E. (2011). Nonparametric instrumental regression. Econometrica, 79(5), 1541 1565. D Haultfoeuille, X. (2011). On the completeness condition in nonparametric instrumental problems. Econometric Theory, 27(3), 460 471. Frot, B., Nandy, P., & Maathuis, M. H. (2017). Learn- ing directed acyclic graphs with hidden variables via latent gaussian graphical model selection. ar Xiv preprint ar Xiv:1708.01151. Grimmer, J., Knox, D., & Stewart, B. M. (2020). Na\" ive regression requires weaker assumptions than factor models to adjust for multiple cause confounding. ar Xiv preprint ar Xiv:2007.12702. Heckerman, D. (2018). Accounting for hidden common causes when infering cause and effect from observational data. ar Xiv preprint ar Xiv:1801.00727. Hu, Y. & Shiu, J.-L. (2018). Nonparametric identification using instrumental variables: sufficient conditions for completeness. Econometric Theory, 34(3), 659 693. Imai, K. & Jiang, Z. (2019). Comment: The challenges of multiple causes. Journal of the American Statistical Association, 114(528), 1605 1610. Janzing, D. & Schölkopf, B. (2018). Detecting confound- ing in multivariate linear models via spectral analysis. Journal of Causal Inference, 6(1). Kuroki, M. & Pearl, J. (2014). Measurement bias and effect restoration in causal inference. Biometrika, 101(2), 423 437. Miao, W., Geng, Z., & Tchetgen Tchetgen, E. J. (2018). Identifying causal effects with proxy variables of an unmeasured confounder. Biometrika, 105(4), 987 993. Miao, W., Hu, W., Ogburn, E. L., & Zhou, X. (2020). Identifying effects of multiple treatments in the presence of unmeasured confounding. ar Xiv preprint ar Xiv:2011.04504. Millán, J., Pintó, X., et al. (2009). Lipoprotein ratios: physi- ological significance and clinical usefulness in cardiovascular prevention. Vascular health and risk management, 5, 757. Newey, W. K. & Powell, J. L. (2003). Instrumental vari- able estimation of nonparametric models. Econometrica, 71(5), 1565 1578. Ogburn, E. L., Shpitser, I., & Tchetgen, E. J. T. (2019). Comment on blessings of multiple causes . Journal of the American Statistical Association, 114(528), 1611 1615. Ogburn, E. L., Shpitser, I., & Tchetgen, E. J. T. (2020). Counterexamples to "the blessings of multiple causes" by wang and blei. Pearl, J. (2009). Causality. Cambridge University Press, 2nd edition. Puli, A., Perotte, A., & Ranganath, R. (2020). Causal esti- mation with functional confounders. Advances in Neural Information Processing Systems, 33. Ranganath, R. & Perotte, A. (2018). Multiple causal inference with latent confounding. ar Xiv preprint ar Xiv:1805.08273. Shi, X., Miao, W., Nelson, J. C., & Tchetgen Tchetgen, E. J. (2020). Multiply robust causal inference with doublenegative control adjustment for categorical unmeasured confounding. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 82(2), 521 540. Tipping, M. E. & Bishop, C. M. (1999). Probabilistic princi- pal component analysis. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 61(3), 611 622. A Proxy Variable View of Shared Confounding Tran, D. & Blei, D. M. (2017). Implicit causal models for genome-wide association studies. ar Xiv preprint ar Xiv:1710.10742. Wang, J., Zhao, Q., Hastie, T., & Owen, A. B. (2017). Confounder adjustment in multiple hypothesis testing. Annals of statistics, 45(5), 1863. Wang, Y. & Blei, D. M. (2019a). The blessings of multiple causes. Journal of the American Statistical Association, 114(528), 1574 1596. Wang, Y. & Blei, D. M. (2019b). The blessings of multiple causes: Rejoinder. Journal of the American Statistical Association, 114(528), 1616 1619. Wang, Y. & Blei, D. M. (2020). Towards clarifying the theory of the deconfounder. ar Xiv preprint ar Xiv:2003.04948. Yang, S., Wang, L., & Ding, P. (2017). Identification and estimation of causal effects with confounders subject to instrumental missingness. ar Xiv preprint ar Xiv:1702.03951. Cevid, D., Bühlmann, P., & Meinshausen, N. (2018). Spec- tral deconfounding via perturbed sparse linear models.