# valid_causal_inference_with_some_invalid_instruments__782bdf14.pdf Valid Causal Inference with (Some) Invalid Instruments Jason Hartford 1 Victor Veitch 2 Dhanya Sridhar 3 Kevin Leyton-Brown 1 Instrumental variable methods provide a power ful approach to estimating causal effects in the presence of unobserved confounding. But a key challenge when applying them is the reliance on untestable exclusion assumptions that rule out any relationship between the instrument variable and the response that is not mediated by the treat ment. In this paper, we show how to perform con sistent instrumental variable estimation despite violations of the exclusion assumption. In particu lar, we show that when one has multiple candidate instruments, only a majority of these candidates or, more generally, the modal candidate response relationship needs to be valid to estimate the causal effect. Our approach uses an estimate of the modal prediction from an ensemble of in strumental variable estimators. The technique is simple to apply and is black-box in the sense that it may be used with any instrumental vari able estimator as long as the treatment effect is identified for each valid instrument independently. As such, it is compatible with recent machinelearning based estimators that allow for the esti mation of conditional average treatment effects (CATE) on complex, high dimensional data. Ex perimentally, we achieve accurate estimates of conditional average treatment effects using an ensemble of deep network-based estimators, in cluding on a challenging simulated Mendelian randomization problem. 1. Introduction Instrumental variable (IV) methods are a powerful approach for estimating treatment effects: they are robust to unob served confounders and they are compatible with a variety of flexible nonlinear function approximators (see e.g. Newey *Equal contribution 1University of British Columbia, Vancou ver, Canada 2University of Chicago, Illinois, USA 3Columbia University, New York, USA. Correspondence to: Jason Hartford . Proceedings of the 38 th International Conference on Machine Learning, PMLR 139, 2021. Copyright 2021 by the author(s). & Powell, 2003; Darolles et al., 2011; Hartford et al., 2017; Singh et al., 2019; Bennett et al., 2019; Muandet et al., 2020; Dikkala et al., 2020), thereby allowing nonlinear estimation of heterogeneous treatment effects. In order to use an IV approach, one must make three assump tions. The first, relevance, asserts that the treatment is not independent of the instrument. This assumption is relatively unproblematic, because it can be verified with data. The sec ond assumption, unconfounded instrument, asserts that the instrument and outcome do not share any common causes. This assumption cannot be verified directly, but in some cases it can be justified via knowledge of the system; e.g. the instrument may be explicitly randomized or may be the result of some well understood random process. The final assumption, exclusion, asserts that the instrument s effect on the outcome is entirely mediated through the treatment. This assumption is even more problematic; not only can it not be verified directly, but it can be very difficult to rule out the possibility of direct effects between the instrument and the outcome variable. Indeed, there are prominent cases where purported instruments have been called into ques tion for this reason. For example, in economics, the widely used judge fixed effects research design (Kling, 2006) uses random assignment of trial judges as instruments and leverages differences between different judges propensities to incarcerate to infer the effect of incarceration on some economic outcome of interest (see Frandsen et al., 2019, for many recent examples). Mueller-Smith (2015) points out that exclusion is violated if judges also hand out other forms of punishment (e.g. fines, a stern verbal warning etc.) that are not observed. Similarly, in genetic epidemiol ogy, Mendelian randomization (Davey Smith & Ebrahim, 2003) uses genetic variation to study the effects of some exposure on an outcome of interest. For example, given ge netic markers that are known to be associated with a higher body mass index (BMI), we can estimate the effect of BMI on cardiovascular disease. However, this only holds if we are confident that the same genetic markers do not influence the risk of cardiovascular disease in any other ways. The possibility of such direct effects referred to as horizon tal pleiotropy in the genetic epidemiology literature is regarded as a key challenge for Mendelian randomization (Hemani et al., 2018). It is sometimes possible to identify many candidate instru Valid Causal Inference with (Some) Invalid Instruments ments, each of which satisfies the relevance assumption; in such settings, demonstrating exclusion is usually the key challenge, though in principle unconfounded instrument could also be a challenge. For example, many such can didate instruments can be obtained in both the judge fixed effects and Mendelian randomization settings, where indi vidual judges and genetic markers, respectively, are treated as different instruments. Rather than asking the modeler to gamble by choosing a single candidate about which to assert these untestable assumptions, this paper advocates making a weaker assumption about the whole set of can didates. Most intuitively, we can assume majority validity: that at least a majority of the candidate instruments satisfy all three assumptions, even if we do not know which candi dates are valid and which are invalid. Or we can go further and make the still weaker assumption of modal validity: that the modal relationship between instruments and response is valid. Observe that modal validity is a weaker condition because if a majority of candidate instruments are valid, the modal candidate response relationship must be character ized by these valid instruments. Modal validity is satisfied if, as Tolstoy might have said, All happy instruments are alike; each unhappy instrument is unhappy in its own way. This paper introduces Mode IV, a robust instrumental vari able technique. Mode IV allows the estimation of nonlin ear causal effects and lets us estimate conditional average treatment effects that vary with observed covariates. It is simple to implement it involves fitting an ensemble with a modal aggregation function and is black-box in the sense that it is compatible with any valid IV estimator, which al lows it to leverage any of the recent machine learning-based IV estimators. Despite its simplicity, Mode IV has strong asymptotic guarantees: we show consistency and that even on a worst-case distribution, it converges point-wise to an oracle solution at the same rate as the underlying estimators. We experimentally validated Mode IV using both a modified version of the demand simulation from Hartford et al. (2017) and a more realistic Mendelian randomization example mod ified from Hartwig et al. (2017). In both settings even with data with a very low signal-to-noise ratio we observed Mode IV to be robust to exclusion-restriction bias and accu rately recovered conditional average treatment effects. 2. Related Work Background on Instrumental Variables We are inter ested in estimating the causal effect of some treatment vari able, t, on some outcome of interest, y. The treatment effect is confounded by a set of observed covariates, x, and un observed confounding factors, , which affect both y and t. With unobserved confounding, we cannot rely on condition ing to remove the effect of confounders; instead we use an instrumental variable, z, to identify the causal effect. Instrumental variable estimation can be thought of as an inverse problem: we can directly identify the causal1 effect of the instrument on both the treatment and the response be fore asking the inverse question, what treatment response mappings, f : t ! y, could explain the difference be tween these two effects? The problem is identified if this question has a unique answer. If the true structural relation ship is of the form, y = f(t, x) + , one can show that, E[y|x, z] = R f(t, x)d F (t|x, z), where E[y|x, z] gives the instrument response relationship, F (t|x, z) captures the instrument treatment relationship, and the goal is to solve the inverse problem to find f( ). In the linear case, f(t, x) = ft + 1x, so the integral on the right hand side of reduces to f E[t|x, z]+1x and f can be estimated using lin ear regression of y on the predicted values of t given x and z from a first stage regression. This procedure is known as Two-Stage Least Squares (Angrist & Pischke, 2008). More generally, the causal effect is identified if the integral equa tion has a unique solution for f (Newey & Powell, 2003). Nonlinear IV A number of recent approaches have lever aged this additive confounders assumption to extend IV analysis beyond the linear setting. Newey & Powell (2003) and Darolles et al. (2011) proposed the first nonparametric procedures for estimating these structural equations, based on polynomial basis expansions. These methods relax the linearity requirement, but scale poorly in both the number of data points and the dimensionality of the data. To overcome these limitations, recent approaches have adapted deep neu ral networks for nonlinear IV analyses. Deep IV (Hartford et al., 2017) fits a first-stage conditional density estimate of ˆF (t|x, z) and uses it to solve the above integral equation. Both Bennett et al. (2019) and Dikkala et al. (2020) adapt generalized method of moments (Hansen, 1982) to the non linear setting by leveraging adversarial losses, while Singh et al. (2019) and Muandet et al. (2020) propose kernel-based procedures for estimation using two-stage and dual formula tions of the problem, respectively. Puli & Ranganath (2020) showed conditions that allow IV inference with latent vari able estimation techniques. Inference with invalid instruments in linear settings Much of the work on valid inference with invalid instru ments is in the Mendelian randomization literature, where violations of the exclusion restriction are common. For a recent survey, see Hemani et al. (2018). There are two broad approaches to valid inference in the presence of bias intro duced by invalid instruments: averaging over the bias, or eliminating the bias with ideas from robust statistics. In the first setting, valid inference is possible under the assumption that each instrument introduces a random bias, but that the 1Strictly, non-causal instruments suffice but identification and interpretation of the estimates can be more subtle (see Swanson & Hern an, 2018). Valid Causal Inference with (Some) Invalid Instruments mean of this process is zero (although this assumption can be relaxed (c.f. Bowden et al., 2015; Koles ar et al., 2015)). Then, the bias tends to zero as the number of instruments grow. Methods in this first broad class have the attractive property that they remain valid even if none of the instru ments is valid, but they rely on strong assumptions that do not easily generalize to the nonlinear setting considered in this paper. The second class of approaches to valid inference assumes that some fraction of the instruments are valid and then uses the fact that biased instruments are outliers whose effect can be removed by leveraging robust estimators. For example, by assuming majority validity and constant linear treatment effects2, Kang et al. (2016) and Guo et al. (2018) show that it is possible to consistently estimate the treatment effect via a Lasso-style estimator that uses the sparsity of the 1 norm to remove invalid instruments. Under the same linearity and constant effect assumptions, Hartwig et al. (2017) showed that one can estimate the treatment effect under modal validity by estimating the mode of a set of Wald estimators. In this paper, we use the same modal insight as Hartwig et al., but generalize the approach to a nonlinear setting, thereby removing the strong assumption of constant treatment effects. Finally, Kuang et al. (2020) recently showed that, under majority validity, it is possible to leverage structure learning techniques produce a summary (valid) IV that can be plugged into downstream estimators. They focus on a setting with binary instruments, responses and confounders whereas we aim for a generic method that places no constraints on the data generating process beyond those necessary for identification. Ensemble models Ensembles are widely used in machine learning as a technique for improving prediction perfor mance by reducing variance (Breiman, 1996) and combining the predictions of weak learners trained on non-uniformly sampled data (Freund & Schapire, 1995). These ensem ble methods frequently use modal predictions via majority voting among classifiers, but they are designed to reduce variance. Both the median and mode of an ensemble of models have been explored as a way of improve robustness to outliers in the forecasting literature (Stock & Watson, 2004; Kourentzes et al., 2014), but we are not aware of any prior work that explicitly uses these aggregation techniques to eliminate bias from an ensemble. Mode estimation If a distribution admits a density, the mode is defined as the global maximum of the density func tion. More generally, the mode can be defined as the limit of a sequence of modal intervals intervals of width h that con 2That is, assuming that the true structural equation is some linear function of the treatment and invalid instruments, and that all units share the same treatment effect parameter, (. tains the largest proportion of probability mass such that xmode = limh!0 arg max F ([x - h/2, x + h/2]). These x two definitions suggest two estimation methods for estimat ing the mode from samples: either one may try to estimate the density function and the maximize the estimated func tion (Parzen, 1962), or one might search for midpoints of modal intervals from the empirical distribution functions. To find modal intervals, one can either fix an interval width, h, and choose x to maximize the number of samples within the modal interval (Chernoff, 1964), or one can solve the dual problem by fixing the target number of samples to fall into the modal interval and minimizing h (Dalenius, 1965; Venter, 1967). We use this latter Dalenius Venter approach as the target number of samples can be parameterized by the number of valid instruments, thereby avoiding the need to select a kernel bandwidth h. In this paper, we assume we have access to a set of k candidate variables, Z = {z1, . . . , zk}, which are valid instrumental variables if they satisfy relevance, exclusion and unconfounded instrument, and are invalid otherwise. Denote the set of valid instruments, V := {zi : zi 6? t, zi ? , zi ? y|x, t, }, and the set of invalid in struments, I = Z \ V. We further assume that each valid instrument identifies the causal effect. In the ad ditive confounder setting, this amounts to assuming that the unobserved confounder s effect on y is additive, such that y = f(t, x, zi:i2I ) + for some function f and E[y|x, zi:i= j , zj ] = R f(t, x, zi:i= j )d F (t|x, zi:i= j , zj ) has the same unique solution for all j in Z. The Mode IV procedure requires the analyst to specify a lower bound V ? 2 on the number of valid instruments and then proceeds in three steps. 1. Fit an ensemble of k estimates of the conditional out come {fˆ1, . . . , fˆk} using a non-linear IV procedure ap plied to each of the k instruments. Each fˆ is a function mapping treatment t and covariates x to an estimate of the effect of the treatment conditional on x. 2. For a given test point (t, x), select [ˆl, uˆ] as the smallest interval containing V of the estimates {fˆ1(t, x), . . . , fˆk(t, x)}. Define Iˆmode = {i : ˆl fˆ i(t, x) uˆ} to be the indices of the instruments cor responding to estimates falling in the interval. 1 3. Return fˆmode(t, x) = P i2ˆ fˆi(t, x) |Iˆmode| Imode Figure 1 shows this procedure graphically. The idea is that the estimates from the valid instruments will tend to cluster around the true value of the effect, E[y|do(t), x]. We assume that the most common effect is a valid one; i.e., that Valid Causal Inference with (Some) Invalid Instruments 3 2 1 0 1 2 3 Valid Invalid MODE-IV Target Figure 1. Example of the Mode IV algorithm with 7 candidates (4 valid and 3 invalid) from the biased demand simulation (see Section 4). The 7 estimators shown in the plot are each trained with a different candidate, and at every test point t, the mode of the 7 predictions is computed point-wise. The region highlighted in green contains the 3 predictions that formed part of the modal interval for each given input. The Mode IV prediction the mean of the 3 closest prediction is shown in solid green. the modal effect is valid. To estimate the mode, we look for the tightest cluster of points which, by definition, are the points contained in Iˆmode. Intuitively, each estimate in this interval should be approximately valid and hence approximates the modal effect. Finally, we average these estimates to reduce variance. The next theorem formalizes this intuition by showing that Mode IV asymptotically identifies and consistently estimates the causal effect. Theorem 1. Fix a test point (t, x) and let fˆ1, . . . , fˆk be estimators of the causal effect of t at x corresponding to k ˆ (possibly invalid) instruments. E.g., fˆj = fj (t, x). Denote the true effect as f = E[y|do(t), x]. Suppose that 1. (consistent estimators) fˆj ! fj almost surely for each instrument. In particular, fj = f whenever the jth instrument is valid. 2. (modal validity) At least v of the instruments are valid, and no more than v - 1 of the invalid instruments agree on an effect. That is, v of the instruments yield the same estimand if and only if all of those instruments are valid. Let [ˆl, uˆ] be the smallest interval containing v of the instru ˆ ments and let Iˆmode = {i : ˆl fi uˆ}. Then, almost surely, where wˆi, wi are any non-negative set of weights such that each wˆi ! wi a.s. and P wi = 1. i2ˆImode We defer all proofs to the supplementary material. Of course, the Mode IV procedure can be generalized to allow estimators of the mode that are different from the one used in Steps 2 and 3. The key advantage of the Dalenius Venter modal estimator is that the optimal choice for its only hyper-parameter, V , does not depend on the distribution of the estimators at a given test point. By contrast, kernel density-based modal estimators require tuning a length-scale parameter, where the optimal choice may vary as a function of the test point, (t, x). It is also straightforward to imple ment3, and relatively insensitive to the choice of V . The procedure as a whole is, however, k times more computa tionally expensive than running single estimation procedure at both training and test time. Despite its simplicity, Mode IV has strong point-wise worstcase guarantees. Theorem 2 shows that if each estimate is bounded,4 then even in the worst case where v - 1 invalid candidates all agree on an effect, Mode IV converges at the same rate as the underlying estimators to the solution of an oracle that uniformly averages the valid instruments. In particular, if the estimators achieve the parametric rate, p 1/ n, in the number of instances n, then Mode IV also p converges at 1/ n. Theorem 2. For some test point (t, x), let Z = ˆ {fˆ1, . . . , fk} be k estimates of the causal effect of t at x. Assume, [Bounded estimates] Each estimate is bounded by some constants, [ai, bi] [Convergent estimators] Each estimator converges in mean 1 squared error at a rate n-r (where r = if the estimator 2 achieves the parametric rate), and hence each estimator has