# causal_inference_through_a_witness_protection_program__120a0877.pdf

Journal of Machine Learning Research 17 (2016) 1-53 Submitted 4/15; Revised 8/15; Published 4/16

Causal Inference through a Witness Protection Program

Ricardo Silva ricardo@stats.ucl.ac.uk Department of Statistical Science and CSML University College London London WC1E 6BT, UK

Robin Evans evans@stats.ox.ac.uk Department of Statistics University of Oxford Oxford OX1 3TG, UK

Editor: Kevin Murphy

One of the most fundamental problems in causal inference is the estimation of a causal eﬀect when treatment and outcome are confounded. This is diﬃcult in an observational study, because one has no direct evidence that all confounders have been adjusted for. We introduce a novel approach for estimating causal eﬀects that exploits observational conditional independencies to suggest weak paths in an unknown causal graph. The widely used faithfulness condition of Spirtes et al. is relaxed to allow for varying degrees of path cancellations that imply conditional independencies but do not rule out the existence of confounding causal paths. The output is a posterior distribution over bounds on the average causal eﬀect via a linear programming approach and Bayesian inference. We claim this approach should be used in regular practice as a complement to other tools in observational studies. Keywords: Causal inference, instrumental variables, Bayesian inference, linear programming

1. Contribution

We provide a new methodology for obtaining bounds on the average causal eﬀect (ACE) of a treatment variable X on an outcome variable Y . We introduce methods for binary models and for linear continuous models. For binary variables, the ACE is deﬁned as

E[Y | do(X = 1)] E[Y | do(X = 0)] = P(Y = 1 | do(X = 1)) P(Y = 1 | do(X = 0)), (1)

where do( ) is the operator of Pearl (2000), denoting distributions where a set of variables has been intervened on by an external agent. In this paper, we assume the reader is familiar with the concept of causal graphs, the basics of the do operator, and the basics of causal discovery algorithms such as the PC algorithm of Spirtes et al. (2000). We provide a short summary for context in Section 2. The ACE is in general not identiﬁable from observational data. We obtain upper and lower bounds on the ACE by exploiting a set of covariates, which we assume are not affected by X or Y as justiﬁed by temporal ordering or other background assumptions. Such covariate sets are often found in real-world problems, and form the basis of many of the

c 2016 Ricardo Silva and Robin Evans.

Silva and Evans

Figure 1: (a) A generic causal graph where X and Y are confounded by some U. (b) The same system in (a) where X is intervened upon by an external agent. (c) A system where W and Y are independent given X. (d) A system where it is possible to use faithfulness to discover that U is suﬃcient to block all back-door paths between X and Y . (e) Here, U itself is not suﬃcient.

observational studies done in practice (Rosenbaum, 2002a). However, it is not obvious how to obtain the ACE as a function of the covariates. Our contribution modiﬁes the results of Entner et al. (2013), who exploit conditional independence constraints to obtain point estimates of the ACE but rely on assumptions that might be unstable with ﬁnite sample sizes. Our modiﬁcation provides a diﬀerent interpretation of their search procedure, which we use to generate candidate instrumental variables (Manski, 2007). The linear programming approach of Dawid (2003), inspired by Balke and Pearl (1997) and further reﬁned by Ramsahai (2012), is then modiﬁed to generate bounds on the ACE by introducing constraints on some causal paths, motivated as relaxations of Entner et al. (2013). The new setup can be computationally expensive, so we introduce further relaxations to the linear program to generate novel symbolic bounds, and a fast algorithm that sidesteps the full linear programming optimization. In Section 2, we discuss the background of the problem. In Section 3 we provide an overview of the methodology, which is divided into several subcomponents and described through Sections 4 8. Section 9 contains experiments with synthetic and real data.

2. Background: Instrumental Variables, Witnesses and Admissible Sets

Assuming X is a potential cause of Y , but not the opposite, a cartoon of the possibly complex real-world causal system containing X and Y is shown in Figure 1(a). U represents the universe of common causes of X and Y . In control and policy-making problems, we

Causal Inference through a Witness Protection Program

would like to know what will happen to the system when the distribution of X is overridden by some external agent (e.g., a doctor, a robot or an economist). The resulting modiﬁed system is depicted in Figure 1(b), and represents the family of distributions indexed by do(X = x): the graph in (a) has undergone a surgery that removes incoming edges to X. Spirtes et al. (2000) provide an account of the ﬁrst graphical methods exploiting this idea, which are related to the overriding of structural equations proposed by Haavelmo (1943). Notice that if U is observed in the data set, then we can obtain the distribution P(Y = y | do(X = x)) by simply calculating P u P(Y = y | X = x, U = u)P(U = u) (Spirtes et al., 2000). This was popularized by Pearl (2000) as the back-door adjustment. In general P(Y = y | do(X = x)) can be substantially diﬀerent from P(Y = y | X = x). The ACE can usually be estimated via a trial in which X is randomized: this is equivalent to estimating the conditional distribution of Y given X under data generated as in Figure 1(b). In contrast, in an observational study (Rosenbaum, 2002a) we obtain data generated by the system in Figure 1(a). If one believes all relevant confounders U have been recorded in the data then the back-door adjustment can be used, though such completeness is uncommon. By postulating knowledge of the causal graph relating components of U, one can infer whether a measured subset of the causes of X and Y is enough (Pearl, 2000; Vander Weele and Shpitser, 2011; Pearl, 2009). Without knowledge of the causal graph, assumptions such as faithfulness (Spirtes et al., 2000) are used to infer it. The faithfulness assumption states that a conditional independence constraint in the observed distribution exists if and only if a corresponding structural independence exists in the underlying causal graph. For instance, observing the independence W Y | X, and assuming faithfulness and the causal order, we can infer the causal graph Figure 1(c); in all the other graphs this conditional independence in not implied. We deduce that no unmeasured confounders between X and Y exist. This simple procedure for identifying chains W X Y is useful in exploratory data analysis (Chen et al., 2007; Cooper, 1997), where a large number of possible causal relations X Y are unquantiﬁed but can be screened oﬀusing observational data before experiments are performed. The purpose of using faithfulness is to be able to sometimes identify such quantities. Entner et al. (2013) generalize the discovery of chain models to situations where a nonempty set of covariates is necessary to block all back-doors. Suppose W is a set of covariates which are known not to be eﬀects of either X or Y , and we want to ﬁnd an admissible set contained in W: a set of observed variables which we can use for back-door adjustment to obtain P(Y = y | do(X = x)). Entner et al. s Rule 1 states the following:

Rule 1: If there exists a variable W W and a set Z W\{W} such that

(i) W Y | Z (ii) W Y | Z {X},

then infer that Z is an admissible set.

Entner et al. (2013) also provide ways of identifying zero eﬀects with a Rule 2. For simplicity of presentation, for now we assume that the eﬀect of interest was already identiﬁed as non-zero. Section 7 discusses the case of zero eﬀects. A point estimate of the ACE can then be found using Z. Given that (W, Z) satisﬁes Rule 1, we call W a witness for the admissible set Z. The model in Figure 1(c) can be

Silva and Evans

identiﬁed with Rule 1, where W is the witness and Z = . In this case, for binary models a so-called na ıve functional P(Y = 1 | X = 1) P(Y = 1 | X = 0) will provide the correct ACE. If U is observable in Figure 1(d), then it can be identiﬁed as an admissible set for witness W. Notice that in Figure 1(a), taking U as a scalar, it is not possible to ﬁnd a witness since there are no remaining variables. Also, if in Figure 1(e) our covariate set W is {W, U}, then no witness can be found since U cannot be blocked. Hence, it is possible for a procedure based on Rule 1 to answer I don t know, even when a back-door adjustment would be possible if one knew the causal graph. However, using the faithfulness assumption alone one cannot do better: Rule 1 is complete for non-zero eﬀects if no further information is available (Entner et al., 2013). Despite its appeal, the faithfulness assumption is not without diﬃculties. Even if unfaithful distributions can be ruled out as pathological under seemingly reasonable conditions (Meek, 1995), distributions which lie close to (but not on) an unfaithful model may in practice be indistinguishable from distributions within that unfaithful model at ﬁnite sample sizes. To appreciate these complications, consider the structure in Figure 1(d) with U unobservable and the remaining (observable) variables binary. Here W is randomized but X is not, and we would like to know the ACE of X on Y 1. W is sometimes known as an instrumental variable (IV), and we call Figure 1(d) the standard IV structure (SIV); the distinctive features here being the constraints W U and W Y | {X, U}, statements which include latent variables. If this structure is known, optimal bounds

LSIV E[Y | do(X = 1)] E[Y | do(X = 0)] USIV

can be obtained without further assumptions, and estimated using only observational data over the binary variables W, X and Y (Balke and Pearl, 1997). This structure cannot be found using faithfulness, as the only independence constraints involve a latent variable. However, there exist distributions faithful to the IV structure but which at ﬁnite sample sizes may appear to satisfy the Markov property for the structure W X Y ; in practice this can occur at any ﬁnite sample size (Robins et al., 2003). The true average causal eﬀect may lie anywhere in the interval [LSIV , USIV ], which can be rather wide even when W Y | X, as shown by the following result:

Proposition 1 If W Y | X and the model follows the causal structure of the standard IV graph, then USIV LSIV = 1 |P(X = 1 |W = 1) P(X = 1 | W = 0)|.

All proofs in this manuscript are given in Appendix A. For a ﬁxed joint distribution P(W, X, Y ), the length of such an interval cannot be further minimized (Balke and Pearl, 1997). Notice that the length of the interval will depend on how strongly associated W and X are: W = X implies UIV LIV = 0 as expected, since this is the scenario of a perfect intervention. The scenario where W X is analogous to not having any instrumental variable, and the length of the corresponding interval is 1.

1. A classical example is in non-compliance: suppose W is the assignment of a patient to either drug or placebo, X is whether the patient actually took the medicine or not, and Y is a measure of health status. The doctor controls W but not X. This problem is discussed by Pearl (2000) and Dawid (2003).

Causal Inference through a Witness Protection Program

Thus, the true ACE may diﬀer considerably from the na ıve functional supported by Enter et al. s Rule 1, appropriate for the simpler structure W X Y but not for the standard IV structure. While we emphasize that this is a worst-case scenario analysis and by itself should not rule out faithfulness as a useful assumption, it is desirable to provide a method that gives greater control over violations of faithfulness.

In Section 4, we introduce the main algorithm, the Witness Protection Program. The core idea is (i) to invert the usage of Entner et al. s Rule 1, so that pairs (W, Z) should provide an instrumental variable bounding method instead of a back-door adjustment; (ii) express violations of faithfulness as bounded violations of local independence; (iii) ﬁnd bounds on the ACE using a linear programming formulation. Unless stated otherwise, it is assumed that all observed variables are binary. A simpliﬁed version of the algorithm is shown in Algorithm 1. This version assumes we know the distribution of the observed variables, P(W, X, Y ), which simpliﬁes the exposition of the method. The loops in Steps 2 and 3 are a search for pairs (W, Z) of witness-admissible sets that satisfy Enter et al. s Rule 1, done by verifying independence constraints in the given joint distribution. If we assumed faithfulness, the job would be complete: we either obtain an empty set or the true ACE. This is essentially the contribution of Entner et al. (2013). However, we assume that faithfulness need not hold, and that all variables can be connected to each other in the causal graph, including a set U of hidden common causes of X and Y . At the same time, we cannot allow for arbitrary violations of faithfulness, as the presence of hidden common causes leads to only very weak constraints on the ACE. Instead, we allow for the expression of a subset of possible violations, expressed as weak edges on the fully connected causal graph of W, Z, X, Y and U. The meaning of a weak edge is given in detail in Section 4, and it is fully deﬁned by a set of hyperparameters ℵthat needs to be provided to the algorithm, also explained in Section 4. Given ℵ, a generalization of the linear programming problem for instrumental variables described by Ramsahai (2012) can be used to ﬁnd tight lower and upper bounds on the ACE. As the approach provides each witness a degree of protection against faithfulness violations, using a linear program, we call this framework the Witness Protection Program (WPP). Thus, this procedure uniﬁes back-door adjustments and (a generalization of) instrumental variable approaches in a single framework, while not requiring knowing the true causal graph and relying on assumptions weaker than faithfulness. This is the main message of the paper. The output of the algorithm provides a set of lower/upper bounds on the ACE. If one could assume that ℵis conservative (that is, the actual edges are weaker than the ones implied by the set of causal models (W, X, Y, Z, U) compatible with ℵ), then a tight interval containing the ACE will be given by the largest lower bound and the smallest upper bound. However, there are several practical issues that need to be solved, the main ones being: (i) we do not know P(W, X, Y ) and hence it needs to estimated from data; (ii) once statistical errors are introduced, it is not clear how to combine the diﬀerent constraints implied by the algorithm; (iii) the computational cost of the procedure can be high, particularly if

Silva and Evans

input : A distribution P(W, X, Y ); A set of relaxation parameters ℵ; Covariate index set W and cause-eﬀect indices X and Y ; output: A set of quadruplets (W, Z, LWZ, UWZ), where (W, Z) is a witness-admissible set pair and (LWZ, UWZ) are a lower and upper bound on the ACE, respectively;

2 for each W W do

3 for every admissible set Z W\{W} identiﬁed by W do

4 (LWZ, UWZ) bounds on the ACE as given by P(W, X, Y, Z) and ℵ;

5 R R {(W, Z, LWZ, UWZ)};

Algorithm 1: A simpliﬁed Witness Protection Program algorithm, assuming the observable distribution P(W, X, Y ) is known.

uncertainty estimates are required; (iv) we would like to have some results for continuous data; (v) the set of hyperparameters ℵneeds to be chosen somehow, and some objective criterion to choose them is important in practice. Section 4.2 addresses points (i) and (ii) using a Bayesian approach. This requires a likelihood function. Since the latent variable model that includes U is not identiﬁable, we work directly on the marginal observable distribution under the constraints implied by the linear program. Independence constraints can be tested using Bayesian model selection, but optionally can be ignored in the linear programming step to provide a more stringent test of feasibility of ℵ, as the feasible region for the ACE might be empty if the tested independence does not ﬁt the data well enough even if it passes the test. An interpretation of this usage of independence tests is given in Section 4.3. We criticize na ıve uses of Bayesian inference for latent variable models in Section 4.4. A convenient implication of using Bayesian inference is that credible intervals for the ACE bounds can be computed in a conceptually simple way, using Monte Carlo methods. However, numerically running a linear program for each sample is expensive. A fully analytical solution to the linear program is not known, but a solution to a relaxed version of it can be found in a much cheaper and more numerically stable iterative algorithm (compared to a black-box solver) given in Section 5. This addresses point (iii), but bounds are looser than those obtained with a numerical solver as a consequence. Point (iv) is partially addressed by Section 6, where we derive bounding methods for linear models. This complements Entner et al. (2012), which relies on non-Gaussianity and faithfulness using conditions weaker than Rule 1. Conceptually the method can be adapted to discrete non-binary data without major modiﬁcations, although presentation gets considerably more complicated. Treating continuous W is not a theoretical problem (at least by discretizing each W on demand while keeping Z continuous), although diﬀerent estimation techniques and parametric assumptions would be required. Likewise, it is theoretically

Causal Inference through a Witness Protection Program

possible to get bounds on the cumulative distribution function of Y by dichotomizing it at diﬀerent levels Y y, but we will not further discuss these generalizations in this paper. Section 7 is a ﬁnal note concerning the main procedure, where we discuss the possibility of exploiting Enter et al. s Rule 2 for detecting zero eﬀects. Although we do not further analyze this modiﬁcation in our experiments, this section provides further insights on how WPP is related to sensitivity analysis methods for observational studies previously found in the literature. Finally, Section 8 is an extensive discussion on point (v), the choice of ℵand the need to deal with possibly incoherent bounds, which also relates back to point (ii). While this discussion is orthogonal to the main algorithm, which takes ℵas a given and it is guaranteed to be at least as conservative as Entner et al. (2013), it is a important practical issue. This section also complements the discussion started in Section 7 on the relation between WPP and sensitivity analysis methods.

4. The Witness Protection Program

Let (W, Z) be any pair found by a search procedure that decides when Rule 1 holds. W will play the role of an instrumental variable, instead of being discarded. Conditional on Z, the lack of an edge W Y can be justiﬁed by faithfulness (as W Y | {X, Z}). For the same reason, there should not be any (conditional) dependence between W and a possible unmeasured common parent2 U of X and Y . Hence, W U and W Y | {U, X} hold given Z. A standard IV bounding procedure such as (Balke and Pearl, 1997) can then be used conditional on each individual value z of Z, then averaged over P(Z). That is, we can independently obtain lower and upper bounds {L(z), U(z)} for each value z, and bound the ACE by X

z L(z)P(Z = z) E[Y | do(X = 1)] E[Y | do(X = 0)] X

z U(z)P(Z = z), (2)

since E[Y | do(X = 1)] E[Y | do(X = 0)] = P z(E[Y | do(X = 1), Z = z] E[Y | do(X = 0), Z = z])P(Z = z). Under the assumption of faithfulness and the satisﬁability of Rule 1, the calculation of the above interval is redundant, as Rule 1 allows the direct use of the back-door adjustment using Z. Our goal is to not enforce faithfulness, but use Rule 1 as a motivation to allow for a subset of violations of faithfulness, but not arbitrary violations. In what follows, assume Z is set to a particular value z and all references to distributions are implicitly assumed to be deﬁned conditioned on the event Z = z. That is, for simplicity of notation, we will neither represent nor condition on Z explicitly. The causal ordering where X and Y cannot precede any other variable is also assumed, as well as the causal ordering between X and Y . Consider a standard parameterization of a directed acyclic graph (DAG) model, not necessarily causal, in terms of conditional probability tables (CPTs): let θV v.p represent P(V = v | Par(V ) = p) where V {W, X, Y, U} denotes both a random variable and a vertex in the corresponding DAG; Par(V ) is the corresponding set of parents of V .

2. In this manuscript, we will sometimes refer to U as a set of common parents, although we do not change our notation to bold face to reﬂect that.

Silva and Evans

Faithfulness violations occur when independence constraints among observables are not structural, but due to path cancellations. This means that parameter values are arranged so that W Y | X holds, but paths connecting W and U, or W and Y , may exist so that either W U or W Y | {U, X}. In this situation, some combination of the following should hold true:

P(Y = y | X = x, W = w, U = u) = P(Y = y | X = x, U = u) P(Y = y | X = x, W = w, U = u) = P(Y = y | X = x, W = w) P(X = x | W = w, U = u) = P(X = x | W = w) P(U = u | W = w) = P(U = u),

for some {w, x, y, u} in the sample space of P(W, X, Y, U). For instance, if the second and third statements above are true, this implies the existence of an active path into X and Y via U, conditional on W 3, such as X U Y . If the ﬁrst statement is true, this corresponds to an active path between W and Y that is not blocked by {X, U}. If the fourth statement is true, U and W are marginally dependent, with a corresponding active path. Notice that some combinations are still compatible with a model where W U and W Y | {U, X} hold: if the second statement in (3) is false, this does not mean that U is necessarily a common parent of X and Y . Such a family of models is observationally equivalent4 to one where U is independent of all variables. When translating the conditions (3) into parameters {θV v.p}, we need to deﬁne parent sets for each vertex, which only need to respect the partial causal ordering being assumed; similarly to the Prediction Algorithm of Spirtes et al. (2000), we do not need to fully specify a causal model in order to identify some of its interventional distributions. In our conditional probability table (CPT) factorization, we deﬁne Par(X) = {W, U} and Par(Y ) = {W, X, U}. The joint distribution of {W, U} can be factorized arbitrarily: the causal directionality among W, U (and Z) is not relevant to the only interventional distribution of interest, do(X). In the next subsection, we reﬁne the parameterization of our model by introducing redundancies: we provide a parameterization for the latent variable model P(W, X, Y, U), the interventional distribution P(W, Y, U | do(X)) and the corresponding (latent-free) marginals P(W, X, Y ), P(W, Y | do(X)). These parameters cannot vary fully independently of each other. It is this fact that will allow us to bound the ACE using only P(W, X, Y ).

4.1 Encoding Faithfulness Relaxations with Linear Constraints

We deﬁne a relaxation of faithfulness as any set of assumptions that allows the relations in (3) to be true, but not necessarily in an arbitrary way: this means that while the left-hand and right-hand sides of each entry of (3) are indeed diﬀerent, their diﬀerence is bounded by either the absolute diﬀerence or by ratios. Without such restrictions, (3) will only imply uninteresting bounds, as discussed in our presentation of Proposition 1. Consider the following parameterization of the distribution of {W, X, Y, U} under the observational and interventional regimes, and their respective marginals obtained by in-

3. That is, a path that d-connects X and Y and includes U, conditional on W; it is into X (and Y ) because the edge linking X to the path points to X. See Spirtes et al. (2000) and Pearl (2000) for formal deﬁnitions and more examples. 4. Meaning a family of models where P(W, X, Y ) satisﬁes the same constraints.

Causal Inference through a Witness Protection Program

Figure 2: A visual depiction of the family of assumptions introduced in our framework. Dashed edges correspond to conditional dependencies that are constrained according to free parameters, displayed along each corresponding edge. This is motivated by observing W Y | X.

tegrating U away5. Again we condition everywhere on a particular value z of Z but, for simplicity of presentation, we suppress this from our notation, since it is not crucial to the developments in this section:

ζ yx.w P(Y = y, X = x | W = w, U) ζyx.w P U P(Y = y, X = x | W = w, U)P(U | W = w) = P(Y = y, X = x | W = w)

η xw P(Y = 1 | X = x, W = w, U) ηxw P U P(Y = 1 | X = x, W = w, U)P(U | W = w) = P(Y = 1 | do(X = x), W = w)

δ w P(X = 1 | W = w, U) δw P U P(X = x | W = w, U)P(U | W = w) = P(X = 1 | W = w).

Under this encoding, the ACE is given by

η11P(W = 1) + η10P(W = 0) η01P(W = 1) η00P(W = 0). (4)

Notice that we do not explicitly parameterize the marginal of U, for reasons that will become clear later. We introduce the following assumptions, as illustrated by Figure 2:

|η x1 η x0| ϵw (5)

|η xw P(Y = 1 | X = x, W = w)| ϵy (6)

|δ w P(X = 1 | W = w)| ϵx (7)

βP(U) P(U | W = w) βP(U). (8)

5. Notice from the development in this section that U is not necessarily a scalar, nor discrete.

Silva and Evans

Setting ϵw = 0, β = β = 1 recovers the standard IV structure. Further assuming ϵy = ϵx = 0 recovers the chain structure W X Y . Under this parameterization in the case ϵy = ϵx = 1, β = β = 1, Ramsahai (2012), extending Dawid (2003), used linear programming to obtain bounds on the ACE. We will brieﬂy describe the four main steps of the framework of Dawid (2003), and refer to the cited papers for more details of their implementation. For now, assume that ζyx.w and P(W = w) are known constants that is, treat P(W, X, Y ) as known. This assumption will be dropped later. Dawid s formulation of a bounding procedure for the ACE is as follows.

Step 1 Notice that parameters {η xw} take values in a 4-dimensional polytope. Find the extreme points of this polytope. Do the same for {δ w}.

In particular, for ϵw = ϵy = 1, the polytope of feasible values for the four dimensional vector (η 00, η 01, η 10, η 11) is the unit hypercube [0, 1]4, a polytope with a total of 16 vertices (0, 0, 0, 0), (0, 0, 0, 1), . . . (1, 1, 1, 1). Dawid (2003) covered the case ϵw = 0, where a twodimensional vector {η x} replaces {η xw}. In Ramsahai (2012), the case 0 ϵw < 1 is also covered: some of the corners in [0, 1]4 disappear and are replaced by others. The case where ϵw = ϵx = ϵy = 1 is vacuous, in the sense that the consecutive steps cannot infer non-trivial constraints on the ACE.

Step 2 Find the extreme points of the joint space {ζ yx.w} {η xw} by mapping them from the extreme points of {δ w} {η xw}, since ζ yx.w = (δ w)x(1 δ w)(1 x)η xw.

The extreme points of the joint space {δ w} {η xw} are just the combination of the extreme points of each space. Some combinations δ w η xw map to the same ζ yx.w, while the mapping from a given δ w η xw to η xw is just the trivial projection. At this stage, we obtain all the extreme points of the polytope {ζ yx.w} {η xw} that are entailed by the factorization of P(W, X, Y, U) and our constraints.

Step 3 Using the extreme points of the joint space {ζ yx.w} {η xw}, ﬁnd the dual polytope of this space in terms of linear inequalities. Points in this polytope are convex combinations of {ζ yx.w} {η xw}, shown by Dawid (2003) to correspond to the marginalizations over some P(U), and each P(U) corresponds to some point in the polytope. This results in constraints over {ζyx.w} {ηxw}.

This is the core step in Dawid (2003): points in the polytope {ζyx.w} {ηxw} correspond to diﬀerent marginalizations of U according to diﬀerent P(U). Describing the polytope in terms of inequalities provides all feasible distributions that result from marginalizing U according to some P(U). Because we included both ζ yx.w and η xw in the same space, this will tie together P(Y, X | W) and P(Y | do(X), W).

Step 4 Finally, maximize/minimize (4) with respect to {ηxw} subject to the constraints found in Step 3 to obtain upper/lower bounds on the ACE.

Causal Inference through a Witness Protection Program

Allowing for the case where ϵx < 1 or ϵy < 1 is just a matter of changing the ﬁrst step, where box constraints are set on each individual parameter as a function of the known P(Y = y, X = x | W = w), prior to the mapping in Step 2. The resulting constraints are now implicitly non-linear in P(Y = y, X = x | W = w), but at this stage this does not matter as the distribution of the observables is treated as a constant. That is, each resulting constraint in Step 3 is a linear function of {ηxw} and a multilinear function on {{ζyx.w}, ϵx, ϵy, ϵw, β, β, P(W)}, as discussed in Section 5. Within the objective function (4), the only decision variables are {ηxw}, and hence Step 4 still sets up a linear programming problem even if there are multiplicative interactions between {ζyx.w} and other parameters. To allow for the case β < 1 < β, we substitute every occurrence of ζyx.w due to the dualization in Step 3 above6 by κyx.w P U ζ yx.w P(U); notice the diﬀerence between κyx.w and ζyx.w. Likewise, we substitute every occurrence of ηxw in the constraints by ωxw P U η xw P(U). Instead of plugging in constants for the values of κyx.w and turning the crank of a linear programming solver, we treat {κyx.w} (and {ωxw}) as unknowns, linking them to observables and ηxw by the constraints

ηxw/β ωxw min(1, ηxw/β), ζyx.w/β κyx.w ζyx.w/β, (9)

yx κyx.w = 1. (10)

Finally, the steps requiring ﬁnding extreme points and converting between representations of a polytope can be easily implemented using a package such as Polymake7 or the scdd package8 for R. Once bounds are obtained for each particular value of Z, Equation (2) is used to obtain the unconditional bounds assuming P(Z) is known. In Section 8, we provide some guidance on how to choose the free parameters of the relaxation. However, it is relevant to point out that any choice of ϵw 0, ϵy 0, ϵx 0, 0 β 1 β is guaranteed to provide bounds that are at least as conservative as the back-door adjusted point estimator of Entner et al. (2013), which is always covered by the bounds. Background knowledge, after a user is suggested a witness and admissible set, can also be used to set relaxation parameters. So far, the linear programming formulated through Steps 1 4 assumes one has already identiﬁed an appropriate witness W and admissible set Z, and that the joint distribution P(W, X, Y, Z) is known. In the next section, we discuss how this procedure is integrated with statistical inference for P(W, X, Y, Z) and the search procedure of Entner et al. (2013).

6. Notice the subtlety: the values of P(y, x | w) appear within the extreme points of {ζ yx.w} {η xw}, but here we are only concerned about the symbols ζyx.w emerging from convex combinations of ζ yx.w. In the original formulation of Dawid (2003), κyx.w = P(y, x | w) is satisﬁed, because P(U) = P(U | W) is assumed, but in our case in general this will not be true. Hence, the need for a diﬀerent symbol. Ramsahai (2012) deals with the P(U) = P(U | W) relaxation in a diﬀerent way by conditioning on each value of W, but his ACE intervals always include zero. 7. http://www.poymake.org 8. http://cran.r-project.org/

Silva and Evans

input : A binary data matrix D; A set of relaxation parameters ℵ; A covariate index set W and cause-eﬀect indices X and Y ; output: A set of triplets (W, Z, B), where (W, Z) is a witness-admissible set pair contained in W and B is a distribution over lower/upper bounds on the ACE implied by the pair

2 for each W W do

3 for every admissible set Z W\{W} identiﬁed by W given D do

4 B posterior over lower/upper bounds on the ACE as given by (W, Z, X, Y, D, ℵ);

5 if there is no evidence in B to falsify the (W, Z, ℵ) model then

6 R R {(W, Z, B)};

10 return R

Algorithm 2: The outline of the Witness Protection Program algorithm.

4.2 Bayesian Learning and Result Summarization

In the previous section, we treated (the conditional) ζyx.w and P(W = w) as known. A common practice is to replace them by plug-in estimators (and in the case of a non-empty admissible set Z, an estimate of P(Z) is also necessary). Such models can also be falsiﬁed, as the constraints generated are typically only supported by a strict subset of the probability simplex. In principle, one could ﬁt parameters without constraints, and test the model by a direct check of satisﬁability of the inequalities using the plug-in values. However, this does not take into account the uncertainty in the estimation. For the standard IV model, Ramsahai and Lauritzen (2011) discuss a proper way of testing such models in a frequentist sense. Our models can be considerably more complicated. Recall that constraints will depend on the extreme points of the {ζ yx.w} parameters. As implied by (6) and (7), extreme points will be functions of ζyx.w. Writing the constraints fully in terms of the observed distribution will reveal non-linear relationships. We approach the problem in a Bayesian way. We will assume ﬁrst the dimensionality of Z is modest (say, 10 or less), as this is the case in most applications of faithfulness to causal discovery. We parameterize ζz yxw P(Y = y, X = x, W = w | Z = z) as a full 2 2 2 contingency table9. In the context of the linear programming problem of the previous section, for a given z, we have ζyx.w = ζyxw/P(W = w), P(W = w) = P yx ζyxw. Given that the dimensionality of the problem is modest, we assign to each three-variate distribution P(Y, X, W | Z = z) an independent Dirichlet prior for every possible assignment of Z, constrained by the inequalities implied by the model (and renormalized accordingly).

9. That is, we allow for dependence between W and Y given {X, Z}, interpreting the decision of independence used in Rule 1 as being only an indicator of approximate independence.

Causal Inference through a Witness Protection Program

The posterior is also a 8-dimensional constrained Dirichlet distribution, where we use rejection sampling to obtain a posterior sample by proposing from the unconstrained Dirichlet. A Dirichlet prior is assigned to P(Z). Using a sample from the posterior of P(Z) and a sample (for each possible value z) from the posterior of P(Y, X, W | Z = z), we obtain a sample upper and lower bound for the ACE by just running the linear program for each sample of {ηz yxw} and {P(Z = z)}. The full algorithm is shown in Algorithm 2, where ℵ {ϵw, ϵx, ϵy, β, β}. The search procedure is left unspeciﬁed, as diﬀerent existing approaches can be plugged into this step. See Entner et al. (2013) for a discussion. In Section 9 we deal with small dimensional problems only, using the brute-force approach of performing an exhaustive search for Z. In practice, brute-force can be still valuable by using a method such as discrete PCA (Buntine and Jakulin, 2004) to reduce W\{W} to a small set of binary variables. To decide whether the premises in Rule 1 hold, we merely perform Bayesian model selection with the BDeu score (Buntine, 1991) between the full graph {W X, W Y, X Y } (conditional on Z) and the graph with the edge W Y removed. Step 5 in Algorithm 2 is a falsiﬁcation test. Since the data might provide a bad ﬁt to the constraints entailed by the model10, we opt not to accept every pair (W, Z) that passes Rule 1. One possibility is to calculate the posterior distribution of the model where constraints are enforced, and compare it against the posteriors of the saturated model given by the unconstrained contingency table. This requires another prior over the constraint hypothesis and the calculation of the corresponding marginal likelihoods. As an alternative approach, we adopt the pragmatic rule of thumb suggested by Richardson et al. (2011): sample M samples from the {ζz yxw} posterior given the unconstrained model, and check the proportion of values that are rejected. If more than 95% of them are rejected, we take this as an indication that the proposed model provides a bad ﬁt and reject the given choice of (W, Z). The ﬁnal result provides a set of posterior distributions over bounds, possibly contradictory, which should be summarized as appropriate. One possibility is to check for the union of all intervals or, as a simpler alternative, report the lowest of the lower bound estimates and the highest of the upper bound estimates using a point estimate for each bound:

1. for each (W, Z) in R, calculate the posterior expected value of the lower and upper bounds11;

2. report the interval L ACE U where L is the minimum of the lower bounds and U the maximum of the upper bounds.

In our experiments, we use a diﬀerent summary. As we calculate the log-marginal posterior M1, M2, M3, M4 for the hypotheses W Y | Z, W Y | Z, W Y | Z {X}, W Y | Z {X}, respectively, we use the score

(M1 M2) + (M3 M4) (11)

10. This is a result of not enforcing W Y | Z {X} in ηz yxw. 11. Alternatively to using the expected posterior estimator for the lower/upper bounds, one can, for instance, report the 0.025 quantile of the marginal lower bound distribution and the 0.975 quantile of the marginal upper bound distribution. Notice, however, this does not give a 0.95 credible interval over ACE intervals as the lower bound and the upper bound are dependent in the posterior.

Silva and Evans

to assess the quality of the bounds obtained with the corresponding witness-admissible set pair. M1 M2 and M3 M4 are the log-posterior odds for the models required by the premises of Rule 1 against the respective alternatives. Just reporting the posterior of each (W, Z) model to rank witness-admissible set pairs would not be entirely appropriate, as we are comparing models for diﬀerent random variables. Adding the log-posteriors instead of a diﬀerent combination is done for the sake of simplicity, contrasted to the idea of comparing the posterior of W X Y against the other seven combinations of edges involving {W, X, Y }. Given the score, we then report the corresponding top-scoring interval and evaluation metric based on this criterion. The reason for reporting a single representative interval is our belief that it is more productive to accommodate most of the (possibly large) discrepancies among estimated ACE intervals in the selection of ℵ, as done in Section 8. By selecting a single baseline with a unique lower/upper bound pair, it is simpler to communicate uncertainty, as we can then provide credible intervals or full distributions for the selected lower/upper bounds12.

4.3 A Note on Weak Dependencies

As we brieﬂy mentioned in the previous section, our parameterization {ζz yxw} does not enforce the independence condition W Y | Z {X} required by Rule 1. Our general goal is to let WPP accept near independencies, in which the meaning of the symbol in practice means weak dependence13. We do not deﬁne what a weak dependence should mean, except for the general guideline that some agreed measure of conditional association should be small. Our pragmatic view on WPP is that Rule 1, when supported by weak dependencies, should be used as a motivation for the constraints in Section 4.1. That is, one makes the assumption that weak dependencies are not generated by arbitrary near-path cancellations, reﬂecting the belief that very weak associations should correspond to weak direct causal eﬀects (and, where this is unacceptable, WPP should either be adapted to exclude relevant cases, or not be used). At the same time, users of WPP do not need to accept this view, as the method does not change under the usual interpretation of . However, it is worthwhile to point out that computational gains can be obtained by using a parameterization that encodes the independence: that is, if we change our parameterization to enforce the independence constraint W Y | Z {X}, then there is no need to perform the check in line 5 of Algorithm 2, as the model is compatible with the (conditional on Z) chain W X Y regardless of ℵ. One can then generate full posterior distribution bounds only for those witness-admissible sets for which uncertainty estimates are required. The validity of point estimates of a bound is guaranteed by running a single linear programming on a point estimate of the distribution implied by the Bayesian net-

12. One should not confuse credible intervals with ACE intervals, as these are two separate concepts: each lower or upper bound is a function of the unknown P(W, X, Y, Z) and needs to be estimated. There is posterior uncertainty over each lower/upper bound as in any problem where a functional of a distribution needs to be estimated. So the posterior distribution and the corresponding credible intervals over the ACE intervals are perfectly well-deﬁned as in any standard Bayesian inference problem. 13. The procedure that decides conditional independencies in Section 4.2 is a method for testing exact independencies, although the prior on the independence assumption regulates how strong the evidence in the data should be for independence to be accepted.

Causal Inference through a Witness Protection Program

1.0 0.5 0.0 0.5 1.0 effect

ACE distribution, mean = 0.29

1.0 0.5 0.0 0.5 1.0 effect

ACE distribution, mean = 0.05

1.0 0.5 0.0 0.5 1.0 effect

ACE distribution, mean = 0.07

(a) (b) (c)

Figure 3: Posterior over the ACE obtained by three diﬀerent priors conditioned on a synthetic data set of size 1,000,000. Posterior computed by running 1,000,000 iterations of Gibbs sampling. The (independent) priors for θY 1.xu and θX x.wu are Beta (α, α), while θU u is given a Dirichlet (α, α, α, α). We set α = 0.1, 1, 10 for the cases shown in (a), (b) and (c), respectively. Vertical red line shows the true ACE, while the population IV bounds are shown with gray lines. As the prior gets less informative (moving from (c) to (a)), the erratic shape of the posterior distribution also shows the eﬀect of bad Gibbs sampling mixing. Even with a very large data set, the concentration of the posterior is highly dependent on the concentration of the prior.

work W X Y (for every instance of Z), as no further constraints in the observable distribution will exist. That is, if one wants to enforce the independence constraints, Line 4 of Algorithm 2 can be modiﬁed to directly generate just point estimates of the bounds for any witness-admissible set pair where a full posterior distribution is not required, and the falsiﬁcation test in Step 5 (with the costly polytope construction) can also be skipped.

4.4 A Note on Unidentiﬁability

An alternative to bounding the ACE or using back-door adjustments is to put priors directly on the latent variable model for {W, X, Y, U}. Using the standard IV model as an example, we can deﬁne parameters θY y.xu P(Y = y | X = x, U = u), θX x.wu P(X = x | W = w, U = u) and θU u P(U = u), on which priors are imposed14. No complicated procedure for generating constraints in the observable marginal is necessary, and the approach provides point estimates of the ACE instead of bounds. This sounds too good to be true, and indeed it is: results strongly depend on the prior, regardless of sample size. To illustrate this, consider a simulation from a standard IV model (Figure 1(c)), with Z = and U an unobservable discrete variable of 4 levels. We generated a model by setting P(W = w) = 0.5 and sampling parameters θY 1.xu and θX 1.wu from the uniform [0, 1] distribution, while the 4-dimensional vector θU u comes from a

14. P(W = w) is not necessary, as the standard IV bounds (Balke and Pearl, 1997) do not depend on it.

Silva and Evans

Dirichlet (1, 1, 1, 1). The resulting model had an ACE of 0.20, with a wide IV interval [ 0.50, 0.38] as given by the method of Balke and Pearl (1997). Narrower intervals can only be obtained by making more assumptions: there is no free lunch. However, as in this case where WPP cannot identify any witness, one might put priors on the latent variable model to get a point estimate, such as the posterior expected value of the ACE. To illustrate the pitfalls of this approach, we perform Bayesian inference by putting priors directly on the CPT parameters of the latent variable model, assuming we know the correct number of levels for U. Figure 3 shows some results with a few diﬀerent choices of priors. The sample size is large enough so that the posterior is essentially entirely within the population bounds and the estimation of P(W, X, Y, Z) is itself nearly exact. The posterior over the ACE covers a much narrower area than the IV interval, and its behavior is erratic. This is not to say that informative priors on a latent variable model cannot produce important results. For instance, Steenland and Greenland (2004) discuss how empirical priors on smoking habits among blue-collar workers were used in their epidemiological question: the causal eﬀect of the occupational hazard of silica exposure on lung cancer incidence among industrial sand workers. Smoking is a confounding factor given the evidence that smoking and occupation are associated. The issue was that smoking was unrecorded among the workers, and so priors on the latent variable relationship to the observables were necessary. Notice, however, that this informative prior is essentially a way of performing a back-door adjustment when the adjustment set Z and treatment-outcome pair {X, Y } are not simultaneously measured within the same subjects. When latent variables are unknown unknowns, a prior on P(Y | X, U) may be hard to justify. Richardson et al. (2011) discuss more issues on priors over latent variable models as a way of obtaining ACE point estimates, one alternative being the separation of identiﬁable and unidentiﬁable parameters to make transparent the eﬀect of prior (mis)speciﬁcation.

5. Algebraic Bounds and the Back-substitution Algorithm

Posterior sampling is expensive within the context of Bayesian WPP: constructing the dual polytope for possibly millions of instantiations of the problem is time consuming, even if each problem is small. Moreover, the numerical procedure described in Section 4 does not provide any insight on how the diﬀerent free parameters {ϵw, ϵx, ϵy, β, β} interact to produce bounds, unlike the analytical bounds available in the standard IV case. Ramsahai (2012) derives analytical bounds under (5) given a ﬁxed, numerical value of ϵw. We know of no previous analytical bounds as an algebraic function of ϵw.

5.1 Algebraic Bounds

We derive a set of bounds, whose validity are proved by three theorems. The ﬁrst theorem derives separate upper and lower bounds on ωxw using all the assumptions except Equation (5); this means constraints which do not link distributions under diﬀerent values of W = w. The second theorem derives linear constraints on {ωxw} using (5) and more elementary constraints. Our ﬁnal result will construct less straightforward bounds, again using Equation (5) as the main assumption. As before, assume we are implicitly conditioning on some Z = z everywhere.

Causal Inference through a Witness Protection Program

We introduce the notation

LY U xw max(P(Y = 1|X = x, W = w) ϵy, 0) UY U xw min(P(Y = 1|X = x, W = w) + ϵy, 1) LXU w max(P(X = 1|W = w) ϵx, 0) UXU w min(P(X = 1|W = w) + ϵx, 1)

and deﬁne L min{LY U xw }, U max{UY U xw }. Moreover, some further redundant notation is used to simplify the description of the constraints:

δ 1.w δ w δ 0.w 1 δ w LXU 11 LXU 1 LXU 01 1 UXU 1 UXU 11 UXU 1 UXU 01 1 LXU 1

and, following Ramsahai (2012), for any x {0, 1}, we deﬁne x as the complementary binary value (i.e. x = 1 x). The same convention applies to pairs {w, w }. Finally, deﬁne χx.w P U P(X = x | W = w, U)P(U) = κ1x.w + κ0x.w.

Theorem 2 The following constraints are entailed by the assumptions expressed in Equations (6), (7) and (8):

κ1x.w + UY U xw (κ0x .w + κ1x .w) κ1x.w/LXU xw 1 κ0x.w/UXU xw

κ1x.w + LY U xw (κ0x .w + κ1x .w) κ1x.w/UXU xw 1 κ0x.w/LXU xw

Theorem 3 The following constraints are entailed by the assumptions expressed in Equations (5), (6), (7) and (8):

( (κ1x.w + ϵw(κ0x.w + κ1x.w ))/LXU xw 1 (κ0x.w ϵw(κ0x.w + κ1x.w ))/UXU xw (14)

( (κ1x.w ϵw(κ0x.w + κ1x.w ))/UXU xw 1 (κ0x.w + ϵw(κ0x.w + κ1x.w ))/LXU xw (15)

ωxw ωxw UXU x w κ1x.w + ϵw(κ0x .w + κ1x .w) ωxw ωxw LXU x w κ1x.w ϵw(κ0x .w + κ1x .w) ωxw ωxw UXU x w 1 κ0x.w UXU x w ϵw(κ0x .w + κ1x .w) ωxw ωxw LXU x w 1 κ0x.w LXU x w + ϵw(κ0x .w + κ1x .w) ωxw ωxw ϵw ωxw ωxw ϵw

Silva and Evans

Theorem 4 The following constraints are entailed by the assumptions expressed in Equations (5), (6), (7) and (8):

( κ1x .w + κ1x.w + κ1x.w κ1x .w + χx w(U + L + 2ϵw) L κ1x .w + κ1x.w + κ1x.w κ1x .w + 2χx wϵw + χx w (U + L) L (17)

( κ1x .w + κ1x.w + κ1x .w + κ1x.w + χx w (U + L) 2ϵwχx w U κ1x .w + κ1x.w + κ1x .w + κ1x.w χx w(2ϵw U L) U (18)

ωxw + ωx w ωx w κ1x .w + κ1x.w κ1x .w + κ1x.w χxw (U + L + 2ϵw) + L ωxw + ωx w ωx w κ1x .w + κ1x.w κ1x .w + κ1x.w 2χxw ϵw χxw(U + L) + L ωxw + ωx w ωx w κ1x .w + κ1x.w + κ1x .w + κ1x.w χxw(U + L) + 2ϵwχxw + U ωxw + ωx w ωx w κ1x .w + κ1x.w + κ1x .w + κ1x.w + χxw (2ϵw U L) + U (19)

Although at ﬁrst such relations seem considerably more complex than those given by Ramsahai (2012), on closer inspection they illustrate qualitative aspects of our free parameters. For instance, consider

ωxw κ1x.w + LY U xw (κ0x .w + κ1x .w),

one of the instances of (13). If ϵy = 1 and β = β = 1, then LY U xw = 0 and this relation collapses to ηxw ζ1x.w, one of the original relations found by Balke and Pearl (1997) for the standard IV model. Decreasing ϵy will linearly increase LY U xw only after ϵy P(Y = 1 | X = x, W = w), tightening the corresponding lower bound given by this equation. Consider now

ωxw 1 (κ0x.w ϵw(κ0x.w + κ1x.w ))/UXU xw .

If also ϵw = 0 and ϵx = 1, from this inequality it follows that ηxw 1 ζ0x.w . This is another of the standard IV inequalities (Balke and Pearl, 1997). Equation (5) implies |ωx w ωx w | ϵw, and as such by setting ϵw = 0 we have that

ωxw + ωx w ωx w κ1x .w + κ1x.w κ1x .w + κ1x.w χxw (U + L + 2ϵw) + L (20)

implies ηxw η1x.w+η1x.w η1x .w η0x.w , one of the most complex relationships in (Balke and Pearl, 1997). Further geometric intuition about the structure of the binary standard IV model is given by Richardson and Robins (2010). These bounds are not tight, in the sense that we opt not to fully exploit all possible algebraic combinations for some results, such as (20): there we use L η xw U and 0 δ w 1 instead of all possible combinations resulting from (6) and (7). The proof idea in Appendix A can be further reﬁned, at the expense of clarity. Because our derivation is a further relaxation, our ﬁnal bounds are more conservative (that is, looser).

Causal Inference through a Witness Protection Program

5.2 Eﬃcient Optimization and Falsiﬁcation Tests

Besides providing insight into the structure of the problem, the algebraic bounds give an eﬃcient way of checking whether a proposed parameter vector {ζyxw} is valid in Step 5 of Algorithm 2, as well as ﬁnding the ACE bounds: we can now use back-substitution on the symbolic set of constraints to ﬁnd box constraints Lxw ωxw Uxw. The proposed parameter will be rejected whenever an upper bound is smaller than a lower bound, and (4) can be trivially optimized conditioning only on the box constraints this is yet another relaxation, added on top of the ones used to generate the algebraic inequalities. We initialize by intersecting all algebraic box constraints (of which (12) and (14) are examples); next we reﬁne these by scanning relations ωxw aωxw c (the family given by (16)) in lexicographical order, and tightening the bounds of ωxw using the current upper and lower bounds on ωxw where possible. We then identify constraints Lxww ωxw ωxw Uxww starting from ϵw ωxw ωxw ϵw and the existing bounds, and plug them into relations ωxw+ωx w ωx w c (as exempliﬁed by (20)) to get reﬁned bounds on ωxw as functions of (Lx ww , Ux ww ). We iterate this until convergence, which is guaranteed since lower/upper bounds never decrease/increase at any iteration. This back-substitution of inequalities follows the spirit of message-passing, in the sense that we iteratively update quantities of interest (intervals bounding the decision variables of the linear program) based on a small subset of other quantities, and it can be orders of magnitude more eﬃcient than the fully numerical solution, while not increasing the width of the intervals by too much. In Section 9, we provide evidence for this claim. The back-substitution method is used in our experiments, combined with the fully numerical linear programming approach as explained in Section 9. The full method is given in Algorithm 3.

6. Linear Models

Entner et al. (2012) present a variant of their methodology for linear non-Gaussian models. The main diﬀerence is that in this case no witness variable W is necessary: it is possible to validate Z as an admissible set from a regression of X on Z and Y on {X, Z}. Faithfulness is not necessary under some non-Gaussianity assumptions, although not all of these assumptions are testable without faithfulness and assumptions of parameter stability are still necessary for constraints other than independence constraints.

In this section, we adapt WPP to linear models. Vanishing partial correlations are used in the premise of Rule 1 instead of independence constraints, even if variables are non Gaussian. The computation of the ACE bounds is vastly simpliﬁed. It complements Entner et al. (2012) in the sense that, although the method does not provide point estimates of the ACE and it might fail to identify some admissible sets, it does not require either faithfulness or non-Gaussianity15. Only second-order moments are necessary in the construction of the bound, although nonparametric linear models for testing partial correlations or sampling from the posterior distribution of covariance matrices might be necessary.

15. Even if variables are clearly non-Gaussian, the residuals of their regression on the admissible set might be close to Gaussian this is after all the motivation for Gaussian likelihoods in most regression models, parametric or not.

Silva and Evans

input : Distributions {ζyx.w} and {P(W = w)}; output: Lower and upper bounds (Lxw, Uxw) for every ωxw

1 Find tightest lower and upper bounds (Lxw, Uxw) for each ωxw using inequalities (12), (13) (14), (15), (17) and (18);

2 Let Lϵw xw and Uϵw xw be lower/upper bounds of ωxw ωxw ;

3 for each pair (x, w) {0, 1}2 do

4 Lϵw xw ϵw;

5 Uϵw xw ϵw;

7 while TRUE do

8 for each relation ωxw b ωxw c in (16) do

9 Uϵw xw min{Uϵw xw, (b 1)Lxw + c}

11 for each relation ωxw b ωxw c in (16) do

12 Lϵw xw max{Lϵw xw, (b 1)Uxw + c}

14 for each relation ωxw + ωx w ωx w c in (19) do

15 Uxw min{Uxw, c Lϵw xw }

17 for each relation ωxw (ωx w ωx w ) c in (19) do

18 Uxw min{Uxw, c + Uϵw xw }

20 for each relation ωxw + ωx w ωx w c in (19) do

21 Uxw max{Uxw, c Uϵw xw }

23 for each relation ωxw (ωx w ωx w ) c in (19) do

24 Uxw max{Uxw, c + Lϵw xw }

26 if no changes in {(Lxw, Uxw)} then

30 return (Lxw, Uxw) for each (x, w) {0, 1}2

Algorithm 3: The iterative back-substitution procedure for bounding Lxw ωxw Uxw for all combinations of x and w in {0, 1}2.

6.1 A Bounding Procedure for Linear Models

Consider for now the linear model case with an empty admissible set Z:

X = a W + Ux Y = b X + c W + Uy, (21)

where {W, Ux, Uy} are assumed to be zero mean variables, and {Ux, Uy} are unobservable. The case with non-empty Z is analogous and discussed in the next section. The ACE is

Causal Inference through a Witness Protection Program

given by b. We denote as sww, swx, swy, . . . the corresponding variances/covariances of {W, Ux, Uy}. Moreover, let the variances of {W, Ux, Uy} be set such that each element of {W, X, Y } has unit variance, and denote as ρwx, ρwy, ρxy the corresponding correlations of {W, X, Y }. Notice that sww = 1, and no assumptions about Gaussianity are being made. As before, we assume for now that ρwx, ρwy, ρxy are known constants, and we would like to bound b as a function of this observable correlation matrix. The implied correlation matrix of model (21) needs to match the observable correlation matrix:

ρwx = a + swx (22)

ρwy = bρwx + c + swy (23)

ρxx = 1 = a2 + 2aswx + sxx (24)

ρxy = b + cρwx + aswy + sxy (25)

ρyy = 1 = b2 + 2bcρwx + c2 + syy + 2[b(aswy + sxy) + cswy], (26)

where the above identities follow directly from (21). The feasible values of the parameters are given by the intersection of the above and

ϵc c ϵc (27)

ϵwx swx ϵwx, ϵwy swy ϵwy, ϵxy sxy ϵxy (28)

0 sxx 1, 0 syy 1. (29)

We ignore the positive semideﬁniteness requirement of the covariance matrix of {W, UX, UY } for simplicity. The set of constraints can be simpliﬁed as follows:

Theorem 5 Assume16 ρwy = ρwxρxy. If an assignment of values to {a, b, c, swx, sxx, sxy, sxx} satisﬁes (27)-(29), then it satisﬁes (22)-(26) if and only it satisﬁes the following:

ρwy = bρwx + c + swy (30) ( ρxy ϵxy Uaswy b + cρwx ρxy + ϵxy Laswy, if swy 0 ρxy ϵxy Laswy b + cρwx ρxy + ϵxy Uaswy, if swy < 0 (31)

b2 + 2bcρwx + c2 2(bρxy + cρwy) 0 (32)

where La max(min(0, 2ρwx), ρwx ϵwx) and Ua min(max(0, 2ρwx), ρwx + ϵwx).

This means that optimizing b subject to constraints (27) (32) is a convex program on {b, c, swy} (conditioned on the sign of swy). Notice that, because of the assumption ρwy = ρwxρxy, the system is always satisﬁable. It can nevertheless rule out some the possible values of b (e.g. b = ρxy = ρwy/ρwx if ϵxy = ϵwy or ϵc = ϵwy = 0). Given {ρwx, ρxy} (and setting ρwy = ρwxρxy), we can ﬁnd an upper bound for the ACE by maximizing b under the constraints (27) (32) and 0 swy ϵwy, followed by maximization under the condition ϵwy swy 0. The upper bound is the maximum of the two conditional maxima. The lower bound is derived in an analogous way.

16. This assumption can be dropped, but the proof of Theorem 5 gets more complicated.

Silva and Evans

6.2 Algorithm for Gaussian Copula Models

One general model family in which vanishing partial correlations are closely connected to independence is the Gaussian copula (Elidan, 2013; Nelsen, 2007). Consider the following generative model:

V N(0, R) Vi = F 1 i (φ(V i )), i = 1, 2, . . . , p, (33)

where V is a p-dimensional random vector generated according to the Gaussian with p p correlation matrix R, Fi( ) is some arbitrary cumulative distribution function (CDF), and φ( ) is standard Gaussian CDF. In continuous distributions, Fi( ) is invertible, and Markov properties of V are preserved in the distribution of V. See Harris and Drton (2013) for a discussion of Gaussian copula models in the context of causal inference, in particular for structure learning using the PC algorithm (Spirtes et al., 2000). Causal structure and eﬀects are deﬁned for V as in a typical linear causal system. Conditional independencies can be tested by copula-based measures, such as Spearman s rank correlation, by testing for the corresponding vanishing partial correlations. Given a target treatment X V and outcome Y V, we are interested in bounding the ACE of X in Y , the Gaussian variables underlying the possibly non-Gaussian X and Y . For simplicity, we search for admissible sets Z with corresponding witness W using a Gaussian copula correlation matrix estimate ˆR. The unobserved data for V is then for simplicity assumed to have zero empirical mean and empirical covariance matrix ˆR. We score models entailing independence of some Vi and Vj given VZ by scoring two Gaussian networks, G1 {V Z V i , V Z V j } against G2 {V Z V i , V Z V j , V i V j }. This is analogous to the binary case, where here we use the corrected BGe score (Kuipers et al., 2014). Notice this test is approximate, as ˆR is used as a surrogate for the empirical covariance matrix of the unobserved data V , which is required by BGe17. For a given (W, Z) accepted by Rule 1, we calculate the empirical residual (rank) correlation matrix obtained by regressing W on Z , X on W and Z , and Y on X and Z , so that the partial (residual) correlation of W and Y given X and Z is zero. Regression of subsets of V on other subsets is done by standard regression using ˆR: let {ˆσww.z, ˆσxx.z, ˆσyy.z} be the residual variances of the regression of {W , X , Y } on Z as deﬁned by ˆR. The resulting residual covariance matrix is scaled to unit variance, and the method in Section 6.1 is used to generate scaled bounds Lbs bstandardized Ubs, which are converted in bounds on the ACE in the original scale as [ p

ˆσxxˆσyy Lbs, p

ˆσxxˆσyy Ubs]. The algorithm is basically the same as Algorithm 2, except we report only point estimates for the bounds instead of posteriors, and no falsiﬁcation step is necessary (Step 5 of Algorithm 2) as the model cannot be falsiﬁed given the accepted conditional independence.

17. Alternatively, one could test for the corresponding vanishing partial correlations in the empirical Spearman rank correlation matrix, as suggested by Harris and Drton (2013), at a particular signiﬁcance level α. However, this only provides p-values, which are not ideal to sort witness/admissible sets by a score, as p-values measure only the surprise of seeing the observed data under a constraint. This does not measure strength of dependence nor a posterior over models. A fully Bayesian version of this approach is conceptually simple, although nonparametric modeling of {Fi( )} (the so-called nonparanormal model) might require Markov chain Monte Carlo methods and computing marginal likelihoods is computationally very intensive.

Causal Inference through a Witness Protection Program

7. Zero Eﬀects

Under faithfulness, the premise of Rule 1 will not be true in a system where X is not a cause of Y . The result is that no conclusion about the ACE can be made, but identiﬁability can still be achieved by other means. Rule 2 (Entner et al., 2013) covers all identiﬁable cases where X is not a cause of Y :

Rule 2a: If there exists a set Z W such that X Y | Z, then infer an ACE of zero.

Rule 2b: If there exists a variable W W and a set Z W\{W} such that:

(i) W X | Z (ii) W Y | Z,

then infer an ACE of zero.

Rule 2a is a direct application of faithfulness, while Rule 2b essentially corresponds to the unshielded collider check in the FCI algorithm of Spirtes et al. (2000). Figure 4 illustrates the paths that can be weakened under Rules 2a and 2b, excluding any a priori restrictions on X Y , since this is the relation that we want to bound given conditions on other paths. It is clear that for 2a not much of interest can be said beyond this: any association that cancels the causal eﬀect of X and Y should be due to a corresponding association generated by unmeasured confounders. If such a strong contribution due to confounders does not exist, then we should not expect the ACE to be strong18. Rule 2b seems more interesting. However, as suggested by Figure 4(b), we are forcing fewer structural constraints in the linear program compared to Rule 1, as there is nothing from Rule 2b that motivates weak confounding eﬀects. The reason for that is that Rule 2b concerns the removal of X Y , which corresponds to the eﬀect we want to bound as a consequence of assumptions elsewhere (instead of assuming a priori, say, |ACE| ϵ for some new hyperparameter ϵ ). One possibility is that for a pair (W, Z) that satisﬁes Rule 2b, we perform the standard WPP bounding with ϵx = ϵy = 1 and, if desired, the added constraint |ACE| ϵ to be assumed given the ﬁring of Rule 2b. Another possibility is to exploit Rule 2 to learn something about the possible eﬀects of W on Y : in this case, we condition on constraints |η 0w η 1w| ϵ to derive bounds on the direct eﬀect of W on Y (Cai et al., 2008). In the context of our ACE problem, it might suggest information about ϵw that can be reused in another suggested pair (W, Z ), but this will require further assumptions or tests, as the diﬀerences between Z and Z will make this transfer of information not trivial. For the rest of the paper, we will ignore the use of Rule 2 for simplicity. In the next section, however, we will consider the implications of having diﬀerent pairs of witness/admissible set as a way of learning information about our hyperparameters.

18. This is not to say that such an observation has little scientiﬁc value. A similar statement is that a strong association between X and Y should be indicative of some causal eﬀect, in the absence of a set of confounders that could fully explain this association. Simple as this is, this type of reasoning has long been explored in observational studies (Cornﬁeld et al., 1959), and it is essentially what is behind Rosenbaum s sensitivity analysis methods (Rosenbaum, 2002a). Our point is that the linear programming approach for this setup is trivial.

Silva and Evans

Figure 4: An illustration of conditional dependencies to be weakened under the acceptance of Rule 2. (a) Unmeasured confounding between X and Y is considered when these two variables are (conditionally) independent (given a possibly non-empty set Z). (b) (Conditional) observable independence of W and Y is used to suggest that W and Y have bounded dependence conditioned on U, as well as weak dependence between W and U. Notice no weakening of eﬀects {U X, U Y }.

8. Choosing Relaxation Parameters

The free parameters ℵ {ϵw, ϵx, ϵy, β, β} do not have a unique, clear-cut, domain-free procedure by which they can be calibrated. However, as we brieﬂy discussed in Section 4, it is useful to state explicitly the following simple guarantee of WPP:

Corollary 6 Given W Y | Z and W Y | {X, Z}, the WPP population bounds on the ACE will always include the back-door adjusted population ACE based on Z.

Proof The proof follows directly by plugging in the quantities ϵw = ϵy = ϵx = 0, β = β = 1, into the analytical bounds of Section 5.1, which will give the tightest bounds on the ACE (generalized to accommodate a background set Z): a single point, which is also the functional obtained by the back-door adjustment.

The implication is that, regardless of the choice of free parameters, the result is guaranteed to be more conservative than the one obtained using the faithfulness assumption. This does not mean that a judicious choice of relaxation parameters is of secondary importance. Ideally, domain knowledge should be used: given a witness and admissible set, an expert decides which relaxations are reasonable. This is domain dependent, and might not be easier than choosing an admissible set from background knowledge. As an alternative, this section covers more generic methods for choosing relaxation parameters. Two main approaches are discussed:

ℵis deduced by the outcome of a sensitivity analysis procedure; given a particular interval length L, we derive a quantiﬁcation of faithfulness violations (represented by ℵ) required to generate causal models compatible with the observational data and an interval of length L containing the ACE. This is covered in Section 8.1;

Causal Inference through a Witness Protection Program

exploit the multiplicity of solutions (pairs of candidate witness/admissible sets) usually provided by Rule 1 to learn about the extent of possible faithfulness violations. Combine the multiple solutions with constraints or prior distributions for ℵto obtain estimates of the relaxation parameters. This is covered in Section 8.2.

8.1 Choice by Grid Search Conditioned on Acceptable Information Loss

One pragmatic default method is to ﬁrst ask how wide an ACE interval can be so that the result is still useful for the goals of the analysis (e.g., sorting possible control variables X as candidates for a lab experiment based on lower bounds on the ACE). Let L be the interval width the analyst is willing to accept. Set ϵw = ϵx = ϵy = kϵ and β = c, β = 1/c, for some pair (kϵ, c) such that 0 kϵ < 1, 0 < c 1, and let (kϵ, c) vary over a grid of values. For each witness/admissible set candidate pair, pick the (kϵ, c) choice(s) entailing interval(s) of length closest to L. In case of more than one solution, summarize them by a criterion such the union of the intervals. This methodology provides an explicit trade-oﬀbetween length of the interval and tightness of assumptions. Notice that, starting from the back-door adjusted point estimator of Entner et al. (2013), it is not obvious how the trade-oﬀcould be obtained: that is, how to build an interval around the back-door point estimate that can be interpreted as bounds under an acceptable amount of information loss. WPP provides a principled way of building such an interval, with the resulting assumptions on ℵbeing explicitly revealed as a by-product. If the analyst believes that the resulting values of ℵare not strict enough, and no substantive knowledge exists that allows particular parameters to be tightened up, then one either has to concede that wider intervals are necessary or to ﬁnd other means of identifying the ACE without the faithfulness assumption19. In the experiments in Section 9.2, we deﬁne a parameter space of kϵ {0.05, 0.10, . . . , 0.30} and c {0.9, 1}. More than one interval of approximately the same width is identiﬁed. For instance, the conﬁgurations (kϵ = 0.25, c = 1) and (kϵ = 0.05, c = 0.9) both produced intervals of approximately length 0.30.

8.2 Linking Selection on the Observables to Selection on the Unobservables

Observational studies cannot be carried out without making assumptions that are untestable given the data at hand. There will always be degrees of freedom that must be chosen, even if such choices are open to criticism. The game is to provide a language to express assumptions in as transparent a manner as possible. Our view on priors for the latent variable model (Section 4.4) is that such prior knowledge is far too diﬃcult to justify when the interpretation of U is unclear. Moreover, putting a prior on a parameter such as P(Y = 1 |X = x, W = w, U = u) so that this prior is bounded by the constraint |P(Y = 1 |X = x, W = w, U = u) P(Y = 1 | X = w, W = w)| ϵw has no clear advantage over

19. That is, expert knowledge should of course still be invoked to decide whether the resulting relaxation is plausible or not (and hence, whether the resulting interval is believable), although communication by sensitivity analysis might facilitate discussion and criticism of the study. In his rejoinder to the discussion of (Rosenbaum, 2002b), Rosenbaum points out that the sensitivity analysis procedure just states the logical outcome of the structural assumptions: the deviation of (say) P(Y = 1 | X = x, W = w) from P(Y = 1 | X = x, W = w, U = u), required to explain the given magnitude of variation of plausible ACEs, is not imposed a priori by expert knowledge, but deduced.

Silva and Evans

WPP: a speciﬁcation of the shape of this prior is still necessary and may have undesirable side eﬀects; it has no computational advantages, as constraints will have to be dealt with now within a Markov chain Monte Carlo procedure; it provides no insight on how constraints are related to one another (Section 5); it still suggests a point estimate that should not be trusted lightly, and posterior bounds which cannot be interpreted as data-driven bounds; and it still requires a choice of ϵw. That is not to say that subjective priors on the relationship between U and the observables cannot be exploited, but the level of abstraction at which they need to be speciﬁed should have advantages when compared to the latent variable model approach. For instance, Altonji et al. (2005) introduced a framework to deal with violations of the IV assumptions (in the context of linear models). Their main idea is to linearly decompose the (observational) dependence of W and Z, and the (causal) dependence of Y and Z, as two signal-plus-noise decompositions, and assume that dependence among the signals allows one to infer the dependence among the noise terms. In this linear case, the dependence among noise terms gives the association between W and Y through unmeasured confounders. The constraint given by the assumption can then be used to infer bounds on the (diﬀerential) ACE. The details are not straightforward, but the justiﬁcation for the assumption is indirectly derived by assuming Z is chosen by a sampling mechanism that picks covariates from the space of confounders U, so that |Z| and |U| are large. The core idea is that the dependence between the covariates which are observed (i.e. Z) and the other variables (W, X, Y ) should tell us something about the impact of the unmeasured confounders. Their method is presented for linear models only, and the justiﬁcation requires a very large |Z|. We introduce a very diﬀerent method inspired by the same general principle, but exploiting the special structure of our procedure. Instead of relying on linearity and a ﬁxed set of covariates, consider the following postulate: the variability of back-door adjusted ACE estimators based on diﬀerent admissible sets, as implied by Rule 1, should provide some information about the extent of faithfulness violations in the given domain. In what follows, let ℵbe simpliﬁed so that ℵ {ϵw, ϵxy, β}, where ϵxy ϵx = ϵy and β β = 1/β. The task then is to choose the three parameters in this set.

8.2.1 Method 1: Tightest ACE Coverage

Let {(W1, Z1), . . . , (Wk, Zk)} be the set of all pairs found by WPP. Let Ai be the ACE calculated by the back-door adjustment on Zi, which will be the true ACE if faithfulness holds (we assume for now the joint distribution of the observables is known, so no statistical uncertainty plays a role yet). Let (Li(ℵ), Ui(ℵ)) be the corresponding lower bound and upper bound implied by (Wi, Zi) and ℵ. Finally, let i {1, 2, . . . , k} be the index of a witness/admissible set that will be our reference pair20 to output the ﬁnal bounds on the ACE, once we choose ℵ. The idea is simple: minimize (ϵw, ϵxy, β) subject to Aj [Li (ℵ), Ui (ℵ)] for 1 j k. This is a multi-objective minimization problem, of which we can return the Pareto frontier. Because this is a small dimensional problem in which high precision is not needed, a simple grid search will suﬃce, as performed in Section 9. Given that the Pareto frontier is likely to contain multiple points, we can report all intervals implied by each possible choice of ℵ.

20. The score in Equation (11) is used to pick i .

Causal Inference through a Witness Protection Program

Alternatively, we can provide a summary of the resulting bounds, such as the union of the intervals. The rationale for this is as follows: if faithfulness is true, then all Ai will collapse to the same value, which implies that all bounds will collapse to a single point. Diﬀerences among Ai are the result of faithfulness violations, and we explain the contradictions via our ℵparameters. Contradictions in constraints entailed by faithfulness have been exploited before to achieve robust causal inference, as in the Conservative PC algorithm of Ramsey et al. (2006). To the best of our knowledge, we provide here the ﬁrst algorithm for accommodating faithfulness contradictions in a space of constraints other than conditional independence constraints among observables. For the real case where the observable joint distribution needs to be estimated from data, one simple alternative is just to use empirical estimates of the unconstrained joint. In one sense, this provides a conservative choice of ℵ, as one could modify the constraints in the minimization of (ϵw, ϵxy, β) to require instead a less stringent criterion: that (say) the 95% credible interval for each Aj overlaps with credible intervals for [Li (ℵ), Ui (ℵ)]. Credible intervals can be obtained as a function of the posterior distribution of the parameters of the joint of {X, Y, W1, . . . , Wk} Sk i=1 Zi, where a prior over the multivariate binary distributions is subject to the WPP constraints for (Wi, Zi) and the independence constraints for all other pairs, at each candidate value of ℵ. This is very costly, and better sampling procedures than the oﬀ-the-shelf rejection sampler will be necessary. We adopt the simple conservative approach with plug-in estimators instead.

8.2.2 Method 2: Bayesian Learning of Relaxation Parameters

A criticism of the tightest ACE coverage method is that it does not take into account the size of k: for k = 1, it will return ϵw = ϵxy = 0, for instance. Judgment is necessary on whether k is large enough in order to trust the results of this analysis. Alternatively, one may cast the problem of learning ℵas yet another Bayesian learning problem, with {(W1, Z1), . . . , (Wk, Zk)} providing evidence for ℵaccording to some reasonable deﬁnition of likelihood function. In what follows, again we assume the joint distribution of the observables is given, so that the back-door ACE functionals A1, A2, . . . , Ak given by Rule 1 are observable. As in the previous section, in our implementation we just plug-in the empirical distribution of the observables, but more sophisticated approaches accounting for the uncertainty in this estimate can in principle be constructed. The principle is: if we allow for many ways in which faithfulness violations might be detected by contradictory results, but contradictions are not found, then this should be evidence that faithfulness violations do not exist. If contradictions are small (i.e., ACEs implied by diﬀerent back-door adjustments are close), faithfulness violations should be small. Our uncertainty should decrease as more opportunities for contradictions are allowed. In particular, we want the posterior to converge to the single values ϵw = 0, ϵxy = 0 and β = 1 as the number of witness/admissible set pairs increase and they agree on the same value. We start by deﬁning

dw max i,x,z |P(Y = 1 | X = x, Wi = 1, Zi = z, U) P(Y = 1 | X = x, Wi = 0, Zi = z, U)|,

Silva and Evans

for i = 1, 2, . . . , k. An analogous deﬁnition is given for dxy and dβ. Next, we deﬁne the likelihood function for dw under data set {A1, A2, . . . Ak} as follows;

P(A1, A2, . . . , Ak | dw, dxy, dβ) =

i=1 p U[L(dw,dxy,dβ),U(dw,dxy,dβ)](Ai), (34)

where p U[a,b]( ) is the uniform distribution in [a, b]. Functions L(dw, dxy, dβ), U(dw, dxy, dβ) are the respective lower bound and upper bound implied by the WPP constraints parameterized by {dw, dxy, dβ}, and the (given) joint distribution of the observables. A uniform (discrete) prior for {dw, dxy, dβ} is given over a pre-deﬁned grid of values for these parameters21. We then choose a set of {ϵw, ϵxy, β} as the high posterior density region deﬁned by sorting all {dw, dxy, dβ} in decreasing value of posterior mass, picking the minimum set that adds up to at least 95% of posterior mass. We summarize the implied set of bounds as necessary, see Section 9.

There is no reason why a uniform prior and the uniform likelihood (34) should be the only choices. Our motivation is that the chosen likelihood function penalizes parameters that imply wide intervals, while remaining agnostic about the position of each ACE within bounds. More importantly, the penalization increases as k increases, making the posterior more peaked. It however forces all intervals of equal length to be distinguished based on the prior only. Priors matter in applied work, but in our experiments we choose the uniform prior for its simplicity. We leave the discussion of other choices of likelihood and priors for future work.

A criticism of Equation (34) is that the pairs in set {(W1, Z1), . . . , (Wk, Zk)} might have much overlap (in the sense that a same witness may appear in many pairs, and the intersection among {Zi} may be large). As such, the multiplication in Equation (34) provides overconﬁdent posteriors, as pairs are considered to be independent pieces of information for the relaxation parameters. More free parameters accounting for the dependence of {Ai} given {dw, dxy, dβ} should be added. However, while we remove a class of irrelevant pairs (any (Wj, Zj) such that there is some i = j where Zi Zj and Wi = Wj), in this work we ignore more complex adjustments for simplicity.

9. Experiments

In this section, we start with a comparison of the back-substitution algorithm of Section 5.2 against the fully numerical procedure, which generates constraints using standard algorithms for changing between polytope representations. We then perform studies with synthetic data, comparing diﬀerent back-door estimation algorithms against WPP. Finally, we perform studies with real data sets.

21. See Section 9. Using a pre-deﬁned discretization simpliﬁes computation, as no MCMC is required and we do not need high precision in estimating relaxation parameters. A continuous space would also imply challenges to the MCMC approach, as the posterior can be ﬂat in some regions where diﬀerent parameter settings imply intervals of same length U(dw, dxy, dβ) L(dw, dxy, dβ).

Causal Inference through a Witness Protection Program

9.1 Empirical Investigation of the Back-substitution Algorithm

We compare the back-substitution algorithm introduced in Section 5.2 with the fully numerical algorithm. Comparison is done in two ways: (i) computational cost, as measured by the wallclock time taken to generate 100 samples by rejection sampling; (ii) width of the generated intervals. As discussed in Section 5.2, bounds obtained by the back-substitution algorithm are at least as wide as in the numerical algorithm, barring rounding problems22. We ran two batches of 1000 trials each, varying the level of the relaxation parameters. In the ﬁrst batch, we set ϵx = ϵy = ϵw = 0.2, and β = 0.9, β = 1.1. In the second batch, we change parameters so that β = β = 1. Experiments were run on a Intel Xeon E5-1650 at 3.20Ghz. Models were simulated according the the structure W X Y , sampling each conditional distribution of a vertex being equal to 1 given its parent from the uniform (0, 1) distribution. The numerical procedure of converting extreme points to linear inequalities was done using the package rcdd, a R wrapper for the cddlib by Komei Fukuda. Inference is done by rejection sampling, requiring 100 samples per trial. We ﬁx the number of iterations of the back-substitution method to 4, which is more than enough to achieve convergence. All code was written in R. For the ﬁrst batch, the average time diﬀerence between the fully numerical method and the back-substitution algorithm was 1 second, standard deviation (s.d.) 0.34. The ratio between times had a mean of 203 (s.d. 82). Even with a more specialized implementation of the polytope dualization step23, two orders of magnitude of diﬀerence seem hard to remove by better coding. Concerning interval widths, the mean diﬀerence was 0.15 (s.d. 0.06), meaning that the back-substitution on average has intervals where the upper bound minus the lower bound diﬀerence is 0.15 units more than the numerical method, under this choice of relaxation parameters and averaged over problems generated according to our simulation scheme. There is a correlation between the width diﬀerence and the interval width given by the numerical method the gap, implying that diﬀerences tend to be larger when bounds are looser: the gap between methods was as small as 0.04 for a fully numerical interval of width 0.19, and as large as 0.23 for a fully numerical interval of width 0.49. For the case where β = β = 1, the average time diﬀerence was 0.92 (s.d. of 0.24), ratio of 152 (s.d. 54.3), interval width diﬀerence of 0.09 (s.d. 0.03); The gap was as small as 0.005 for a fully numerical interval of width 0.09, and as large as 0.17 for a fully numerical interval of with 0.23.

9.2 Synthetic Studies

We describe a set of synthetic studies for binary data where, for procedures that estimate ACE intervals, we assess the trade-oﬀbetween its correctness (that is, how far from the true ACEs the intervals are, for a suitable deﬁnition of distance) and its informativeness (how wide the intervals are).

22. We did not use rational arithmetic in the polytope generator in order to speed it up; consequently, about 1% of the time we observed numerical problems. Those were excluded from the statistics reported in this section. 23. One advantage of the analytical bounds, as used by the back substitution method, is that it is easy to express them as matrix operations over all Monte Carlo samples, while the polytope construction requires iterations over the samples.

Silva and Evans

In the synthetic study setup, we compare our method against NE1 and NE2, two na ıve point estimators deﬁned by back-door adjustment on the whole of set of available covariates W and on the empty set, respectively. The former is widely used in practice, even when there is no causal basis for doing so (Pearl, 2009). The point estimator of Entner et al. (2013), based solely on the faithfulness assumption, is also assessed. We generate problems where conditioning on the whole set W is guaranteed to give incorrect estimates. In detail: we generate graphs where W {Z1, Z2, . . . , Z8}. Four independent latent variables L1, . . . , L4 are added as parents of each {Z5, . . . , Z8}; L1 is also a parent of X, and L2 a parent of Y . L3 and L4 are each randomly assigned to be a parent of either X or Y , but not both. {Z5, . . . , Z8} have no other parents. The graph over Z1, . . . , Z4 is chosen by adding edges uniformly at random according to a ﬁxed topological order. As a consequence, using the full set W for back-door adjustment is always incorrect, as at least four paths X L1 Zi L2 Y are active for i = 5, 6, 7, 8. The conditional probabilities of a vertex given its parents are generated by a logistic regression model with pairwise interactions, where parameters are sampled according to a zero mean Gaussian with standard deviation 20 / number of parents. Parameter values are also further bounded, so that if the generated value if greater than 0.975 or less than 0.025, it is resampled uniformly in [0.950, 0.975] or [0.025, 0.050], respectively. We analyze two variations: in the ﬁrst, it is guaranteed that at least one valid pair witness-admissible set exists; in the second, all latent variables in the graph are set also as common parents also of X and Y , so no valid witness exists. We divide each variation into two subcases: in the ﬁrst, hard subcase, parameters are chosen (by rejection sampling, proposing from the model described in the previous paragraph) so that NE1 has a bias of at least 0.1 in the population; in the second, no such a selection is enforced, and as such our exchangeable parameter sampling scheme makes the problem relatively easy. We summarize each WPP interval by the posterior expected value of the lower and upper bounds. In general WPP returns more than one bound: we select the upper/lower bound corresponding to the (W, Z) pair which maximizes the score described at the end of Section 4.2. A BDeu prior with an equivalent sample size of 10 was used. Our main evaluation metric for an estimate is the Euclidean distance (henceforth, error ) between the true ACE and the closed point in the given estimate, whether the estimate is a point or an interval. For methods that provide point estimates (NE1, NE2, and faithfulness), this means just the absolute value of the diﬀerence between the true ACE and the estimated ACE. For WPP, the error of the interval [L, U] is zero if the true ACE lies in this interval. We report error average and error tail mass at 0.1, the latter meaning the proportion of cases where the error exceeds 0.1. Moreover, the faithfulness estimator is deﬁned by averaging over all estimated ACEs as given by the accepted admissible sets in each problem. As discussed in Section 8.1, WPP can be understood as providing a trade-oﬀbetween information loss and accuracy. For instance, while the trivial interval [ 1, 1] will always have zero error, it is not an interesting solution. We assess the trade-oﬀby running simulations at diﬀerent levels of kϵ, where ϵw = ϵy = ϵx = kϵ. We also have two conﬁgurations for {β, β}: we set them at either β = β = 1 or β = 0.9, β = 1.1. For the cases where no witness exists, Entner s Rule 1 should theoretically report no solution. Entner et al. (2013) used stringent thresholds for deciding when the two conditions

Causal Inference through a Witness Protection Program

of Rule 1 held, we refer to that paper for an evaluation on how well Rule 1 can be correctly activated under the more conservative setup. Instead we take a more relaxed approach, using a uniform prior on the hypothesis of independence. As such, due to the nature of our parameter randomization, more often than not it will propose at least one witness. That is, for the problems where no exact solution exists, we assess how sensitive the methods are given conclusions taken from approximate independencies instead of exact ones.

The analytical bounds are combined with the numerical procedure as follows. We use the analytical bounds to test each proposed model using the rejection sampling criterion. Under this scheme, we calculate the posterior expected value of the contingency table and, using this single point, calculate the bounds using the fully numerical method. This is not guaranteed to work: the point estimator using the analytical bounds might lie outside the polytope given by the full set of constraints. If this situation is detected, we revert to calculating the bounds using the analytical method. The gains in interval length reduction using the full numerical method are relatively modest (e.g., at kϵ = 0.20, the average interval width reduced from 0.30 to 0.24) but depending on the application they might make a sizeable diﬀerence.

We simulate 100 data sets for each one of the four cases (hard case/easy case, with theoretical solution/without theoretical solution), 5000 points per data set, 1000 Monte Carlo samples per decision. Results for the point estimators (NE1, NE2, faithfulness) are obtained using the population contingency tables. Results are summarized in Table 1. The ﬁrst observation is that at very low levels of kϵ we increase the ability to reject all witness candidates: this is due mostly not because Rule 1 never ﬁres, but because the falsiﬁcation rule of WPP (which does not enforce independence constraints) rejects the proposed witnesses found by Rule 1. The trade-oﬀset by WPP is quite stable, where larger intervals are indeed associated with smaller error. The point estimates vary in quality, being particularly bad in the situation where no witness should theoretically exist. The set-up where β = 0.9, β = 1.1 is especially uninformative compared to β = β = 1. At kϵ = 0.2, we obtain interval widths around 0.50. As Manski (2007) emphasizes, this is the price for making fewer assumptions. Even there, they typically cover only about 25% of the interval [ 1, 1] of a priori possibilities for the ACE.

9.2.1 Selection of Relaxation Parameters

We performed an automated choice of relaxation parameters applying the methods in Section 8.2 to the same synthetic data sets. For each data set and each parameter choice method, we obtain a set B of intervals deﬁned by a lower/upper bound. We summarize B in two ways: the tightest bound, meaning we choose the narrowest interval in B; the loosest bound, deﬁned as the interval where the lower (upper) bound is the smallest lower (largest upper) bound in B. We then report results for each of the four synthetic case scenarios and each of the two methods: the Tightest ACE Coverage (TAC) method from Section 8.2.1 and the high posterior density (HPD) method of Section 8.2.2. Each parameter ϵw and ϵxy = ϵx = ϵy was allowed to assume values in the discretized grid {0.01, 0.05, 0.10, . . . , 0.50}. Parameter β = β = 1/β was allowed to take values in {1, 1.05, . . . , 1.20}. Results are summarized in Table 2.

Silva and Evans

Hard, Solvable: NE1 = (0.12, 1.00), NE2 = (0.02, 0.03)

kϵ Found Faith.1 WPP1 Width1 WPP2 Width2 0.05 0.74 0.03 0.05 0.02 0.05 0.05 0.00 0.00 0.34 0.10 0.94 0.04 0.05 0.01 0.01 0.11 0.00 0.00 0.41 0.15 0.99 0.04 0.05 0.01 0.02 0.16 0.00 0.00 0.46 0.20 1.00 0.05 0.05 0.01 0.01 0.24 0.00 0.00 0.53 0.25 1.00 0.05 0.07 0.00 0.00 0.32 0.00 0.00 0.60 0.30 1.00 0.05 0.10 0.00 0.00 0.41 0.00 0.00 0.69 Easy, Solvable: NE1 = (0.01, 0.01), NE2 = (0.07, 0.24)

kϵ Found Faith.1 WPP1 Width1 WPP2 Width2 0.05 0.81 0.03 0.02 0.02 0.04 0.04 0.00 0.01 0.34 0.10 0.99 0.02 0.02 0.01 0.02 0.09 0.00 0.00 0.40 0.15 1.00 0.02 0.01 0.00 0.00 0.17 0.00 0.00 0.46 0.20 1.00 0.02 0.01 0.00 0.00 0.24 0.00 0.00 0.54 0.25 1.00 0.02 0.01 0.00 0.00 0.32 0.00 0.00 0.61 0.30 1.00 0.02 0.01 0.00 0.00 0.41 0.00 0.00 0.67 Hard, Not Solvable: NE1 = (0.16, 1.00), NE2 = (0.20, 0.88)

kϵ Found Faith.1 WPP1 Width1 WPP2 Width2 0.05 0.67 0.20 0.90 0.17 0.76 0.06 0.04 0.14 0.32 0.10 0.91 0.19 0.91 0.13 0.63 0.10 0.02 0.07 0.39 0.15 0.97 0.19 0.92 0.10 0.41 0.18 0.01 0.03 0.45 0.20 0.99 0.19 0.95 0.07 0.25 0.24 0.01 0.01 0.51 0.25 1.00 0.19 0.96 0.03 0.13 0.31 0.00 0.00 0.58 0.30 1.00 0.19 0.96 0.02 0.06 0.39 0.00 0.00 0.66 Easy, Not Solvable: NE1 = (0.09, 0.32), NE2 = (0.14, 0.56)

kϵ Found Faith.1 WPP1 Width1 WPP2 Width2 0.05 0.68 0.13 0.51 0.10 0.37 0.05 0.02 0.07 0.33 0.10 0.97 0.12 0.53 0.08 0.28 0.10 0.01 0.05 0.39 0.15 1.00 0.12 0.52 0.05 0.17 0.16 0.01 0.03 0.46 0.20 1.00 0.12 0.53 0.03 0.08 0.23 0.01 0.03 0.52 0.25 1.00 0.12 0.48 0.02 0.05 0.31 0.00 0.02 0.59 0.30 1.00 0.12 0.48 0.01 0.04 0.39 0.00 0.01 0.65

Table 1: Summary of the outcome of the synthetic studies. Columns labeled WPP1 refer to results obtained for β = β = 1, while WPP2 refers to the case β = 0.9, β = 1.1. The ﬁrst column is the level in which we set the remaining parameters, ϵx = ϵy = ϵw = kϵ. The second column is the frequency by which a WPP solution has been found among 100 runs. For each particular method (NE1, NE2, Faithfulness and WPP) we report the pair (error average, error tail mass at 0.1), as explained in the main text. The Faithfulness estimator is the back-door adjustment obtained by using as the admissible set the same set found by WPP1. Averages are taken only over the cases where a witness-admissible set pair has been found. The columns following each WPP results are the median width of the respective WPP interval across the 100 runs.

Causal Inference through a Witness Protection Program

Tightest Loosest Case TAC HPD TAC HPD error width error width error width error width Hard, Solvable 0.004 0.18 0.004 0.18 0.002 0.24 0.00009 0.40 Easy, Solvable 0.002 0.13 0.002 0.13 0.001 0.20 0.002 0.26 Hard, Not Solvable 0.12 0.14 0.12 0.14 0.10 0.20 0.07 0.33 Easy, Not Solvable 0.07 0.14 0.07 0.14 0.05 0.20 0.04 0.35

Table 2: Applying the criteria for choosing relaxation parameters from Section 8.2 to the four synthetic case scenarios. Error is the average error, as formalized for Table 1. Width is the average width over all 100 subcases of the respective study. Tightest and Loosest are the two criteria for summarizing a set of intervals, as explained in the main text.

Comparing it against Table 1, results seem to be slightly worse than WPP1 at the same interval width, but without making prior assumptions on β. Compared to WPP2, overall widths are much smaller. The HPD method agrees with TAC on the tightest interval, as our choice of prior will always imply a posterior mode on the TAC solution. The loosest interval for HPD will always be larger or equal to the loosest in TAC, as the 95% posterior mass that generates intervals will include the Pareto frontier and possibly many other candidates. In our simulations, the reduction in error for HPD with the loosest bound came with a nontrivial increase on the length of the corresponding intervals. While we do not explicitly advocate one method over another, the HPD method can be used to classify problems as harder than others by assessing how much of the posterior mass of hyperparameters is not on the Pareto frontier. In Figure 5, we visualize the marginal posterior distribution of {dxy, dw} for two synthetic problems in the easy/solvable case, where in one problem the tightest interval failed to cover the true ACE, while in the other the ACE was correctly accounted for.

9.3 Inﬂuenza Study

Our empirical study concerns the eﬀect of inﬂuenza vaccination on a patient being later on hospitalized with chest problems. X = 1 means the patient got a ﬂu shot, Y = 1 indicates the patient was hospitalized. A negative ACE therefore suggests a desirable vaccine. The study was originally discussed by Mc Donald et al. (1992). Shots were not randomized, but doctors were randomly assigned to receive a reminder letter to encourage their patients to be inoculated, an event recorded as binary variable GRP. This suggests the standard IV model in Figure 1(d), with W = GRP and U unobservable. That is, W and U are independent because W is randomized, and there are reasonable justiﬁcations to believe the lack of a direct eﬀect of letter randomization on patient hospitalization. Richardson et al. (2011) and Hirano et al. (2000) provide further discussion.

Silva and Evans

0.0 0.2 0.4 0.6 0.8 1.0

0.0 0.2 0.4 0.6 0.8 1.0

Joint posterior distribution

0.0 0.2 0.4 0.6 0.8 1.0

0.0 0.2 0.4 0.6 0.8 1.0

Joint posterior distribution

Figure 5: Marginal posterior distribution for {dxy, dw} in two problems instances. Darker values represent smaller probabilities. In instance (a), the length of the tightest interval was 0.32 and did not contain the true ACE (but the error was still < 0.01). In instance (b), the length of the tightest interval was 0.11 and did contain the true ACE. Parameter dw does not seem to be as inﬂuential conditional on dxy, and the uniform prior allows for little variability in the dw posterior away from the mode.

From this randomization, it is possible to directly estimate the ACE24 of W on Y : 0.01. This is called intention-to-treat (ITT) analysis (Rothman et al., 2008), as it is based on the treatment assigned by randomization and not on the variable of interest (X), which is not randomized. While the ITT can be used for policy making, the ACE of X on Y would be a more relevant result, as it reveals features of the vaccine that are not dependent on the encouragement design. X and Y can be confounded, as X is not controlled. For instance, the patient choice of going to be vaccinated might be caused by her general health status, which will be a factor for hospitalization in the future.

The data contains records of 2, 681 patients, with some demographic indicators (age, sex and race) and some historical medical data (for instance, whether the patient is diabetic). A total of 9 covariates is available. Using the bounds of Balke and Pearl (1997) and observed

24. Notice that while the ACE might be small, this does not mean that in another scale, such as odd-ratios, the results do not reveal an important eﬀect. This depends on the domain.

Causal Inference through a Witness Protection Program

frequencies produces an interval of [ 0.23, 0.64] for the ACE. WPP could not validate GRP as a witness for any admissible set.

Instead, when forbidding GRP to be included in an admissible set (since the theory says GRP cannot be a common direct cause of vaccination and hospitalization), WPP selected as the highest-scoring pair the witness DM (patient had history of diabetes prior to vaccination) with admissible set composed of AGE (dichotomized as 60 or less years old, and above 60 ) and SEX. Choosing, as an illustration, ϵw = ϵy = ϵx = 0.2 and β = 0.9, β = 1.1, we obtain the posterior expected interval [ 0.10, 0.17]. This does not mean the vaccine is more likely to be bad (positive ACE) than good: the posterior distribution is over bounds, not over points, being completely agnostic about the distribution within the bounds. Notice that even though we allow for full dependence between all of our variables, the bounds are stricter than in the standard IV model due to the weakening of hidden confounder eﬀects postulated by observing conditional independencies. It is also interesting that two demographic variables ended up being chosen by Rule 1, instead of other indicators of past diseases.

When allowing GRP to be included in an admissible set, the pair (DM, {AGE, SEX}) is now ranked second among all pairs that satisfy Rule 1, with the ﬁrst place being given by RENAL as the witness (history of renal complications), with the admissible set being GRP, COPD (history of pulmonary disease), and SEX. In this case, the expected posterior interval was approximately the same, [ 0.08, 0.16]. It is worthwhile to mention that, even though this pair scored highest by our criterion that measures the posterior probability distribution of each premise of Rule 1, it is clear that the ﬁt of this model is not as good as the one with DM as the witness, as measured by the much larger proportion of rejected samples when generating the posterior distribution. This suggests future work on how to rank such models.

In Figure 6 we show a scatter plot of the posterior distribution over lower and upper bounds on the inﬂuenza vaccination, where DM is the witness. In Figure 7(a) and (b) we show kernel density estimators based on the Monte Carlo samples for the cases where DM and RENAL are the witnesses, respectively. While the witnesses were tested using the analytical bounds, the ﬁnal set of samples shown here were generated with the fully numerical optimization procedure, which is quite expensive.

We also analyze how to select ℵ= {ϵw, ϵxy = ϵx = ϵy, β = β = 1/β} using the Tightest ACE Coverage (TAC) method of Section 8.2.1. The motivation is that this is a domain with overall weak dependencies among variables. From one point of view, this is bad as instruments will be weak and generate wide intervals (as suggested by Proposition 1). From another perspective, this suggests that the eﬀect of hidden confounders may also be weak.

A total of 48 witness/admissible sets were proposed by WPP via Rule 1. The TAC Pareto frontier, using the same parameter space as in Section 9.2.1, included only two possibilities, ϵxy = 0.05, ϵw = 0.01, β = 1 and ϵxy = 0.01, ϵw = 0.01, β = 1.05. Using the empirical distribution as an estimator of the joint of the observables, the respective ACE intervals were [ 0.01, 0.01] and [ 0.02, 0.02]. Although the sign of the ACE is not determined from the data, the WPP procedure suggests that the magnitude of the ACE is no greater than 0.02, which by itself is of interest.

Silva and Evans

0.125 0.100 0.075 Lower bound

Upper bound

Posterior distribution

Figure 6: Scatterplot of the joint posterior distribution of lower bounds and upper bounds, Pearson correlation coeﬃcient of 0.71.

9.4 Linear Models

In this section, we assess the usage of the method in the linear case. Independently, we also introduce a complementary way of summarizing the outcome of a WPP analysis. We ﬁrst check the performance of the method with a small synthetic study. We generate models following the same pattern of the hard/solvable case of Section 9.2, the ACE being the coeﬃcient of X in the equation for Y . Each conditional model for a variable V i given its parents is generated by sampling its coeﬃcients from independent standard Gaussians, sampling the variance of the error term from a uniform [0, 0.5], then rescaling the coeﬃcients such that the marginal variance of V i is 1. Observable data is then generated by transforming each V i to follow a gamma distribution with mean and variance equal to 2. We generated 100 data sets with a sample size of 1000 each. We perform experiments25 setting all hyperparameters ϵc = ϵwx = ϵwy = ϵxy = 0.2 and ϵc = ϵwx = ϵwy = ϵxy = 0.1. Estimates of the Gaussian copula correlation matrices are obtained using function huge.npn from the R package huge to transform the data, of which we compute the empirical correlation matrix. We obtained average errors of 0.04 for the method with parameters set at 0.2, and 0.07 for parameters set at 0.1. The average length of the proposed intervals were 0.5 and 0.26, respectively. For comparison, the population error for the two na ıve estimators was 0.23 and 0.18.

25. The test for conditional independencies is done with the corrected BGe score (Kuipers et al., 2014) as discussed in Section 6.2. The hyperparameters are a prior of 0.5 for the independence constraint hypothesis, and a inverse Wishart prior with υ p + 2 degrees of freedom and a scale matrix given by the p p identity matrix multiplied by υ, where p is the number of variables in the test.

Causal Inference through a Witness Protection Program

0.2 0.0 0.2 effect

Marginal Posterior Distribution (means: [ 0.10, 0.17])

0.2 0.0 0.2 effect

Marginal Posterior Distribution (means: [ 0.08, 0.16])

Figure 7: In (a), the marginal densities for the lower bound (red) and upper bound (blue) on the ACE, smoothed kernel density estimates based on 5000 Monte Carlo samples. Bounds were derived using DM as the witness. In (b), a similar plot using RENAL as the witness.

We performed an empirical study with the 1976 Panel Study of Income Dynamics. The study uses data from 1975, assessing incoming of couples in 1976. Our outcome variable Y is the wife s reported wage at the time of the 1976 interview, and the treatment X is the number of years of the wife s previous labor market experience. The data was discussed by Mroz (1987) and can be obtained from the R package AER (Kleiber and Zeileis, 2008). Covariate set W includes a combination of discrete and continuous variables, such as husband s wages, number of children, and whether the wife went to college. We infer a Gaussian copula correlation matrix using the extended rank likelihood method of Hoﬀ (2007) with the R package sbgcop, which can deal with discrete and continuous variables but requires expensive MCMC sampling. Notice that conditional independencies among the discrete observable elements of V {X, Y } W do not follow from conditional independencies among the unobservable Gaussian variables V . We nevertheless test Rule 1 among V using the estimated copula correlation matrix and the relatively high prior of 0.5 for the hypothesis of independence, for any given independence assessment. Sample size is 753, with 17 covariates26. We then make the (strong) assumption that work experience in 1975 is not a cause of any other variable in the covariate set.

26. We removed two covariates from the original data set: the indicator of economical participation, which is a deterministic function of other covariates; and the estimated wage of the wife in 1975, which for that year was not available directly via self-report. In order to speed up the search algorithm, for each witness candidate W, the space of variables to test for an admissible set is composed of the 10 covariates mostly

Silva and Evans

Setting all relaxation parameters ϵc = ϵwx = ϵwy = ϵxy to 0.1, we obtain in Figure 8(a) all corresponding intervals, with black dots representing the corresponding estimated ACE for the chosen admissible set. An explanation of all variables can be found in the documentation of package AER (Kleiber and Zeileis, 2008). Recall that the units here are given in the latent Gaussian space, where each V i is a non-linear transformation of the corresponding Vi, as explained in Section 6.2. This analysis reveals two clear clusters of behavior, which internally show little variability but are very diﬀerent from one another, even accounting for a violation of 0.1. This illustrates possible ways of communicating the output of a WPP analysis so that issues with assumptions and data can be raised. In this case, the two clusters of intervals diﬀer in one variable in the admissible set: variable HOURS is present in cases where the intervals are closer to zero. This variable measures the number of work hours of the wife in 1975, and is partially embedded in the deﬁnition of the experience level measured at 1975. By removing all admissible sets that include the HOURS variable, we obtain the summary given as Figure 8(b). This type of visualization step can be used to ﬂag major contradictions that cannot be easily explained by allowing mild violations of faithfulness, but which might suggest problematic measurements to be reconsidered in the analysis.

10. Conclusion

Our model provides a novel compromise between point estimators given by the faithfulness assumption and bounds based on instrumental variables. We believe such an approach should become a standard item in the toolbox of methodologies for observational studies, as it provides means to draw conclusions from a complementary set of assumptions. Ongoing updates of software for WPP is provided as part of the R package Causal FX, available at the Comprehensive R Network27 and Git Hub28. A snapshot of the code used in this paper is available at http://www.homepages.ucl.ac.uk/~ucgtrbd/wpp. In particular, unlike Bayesian approaches that put priors directly on the parameters of the unidentiﬁable latent variable model P(Y, X, W, U | Z), the constrained Dirichlet prior on the observed distribution does not suﬀer from massive sensitivity to the choice of hyperparameters. When a strongly informative prior is lacking, WPP keeps inference more honest by focusing on bounds. While it is tempting to look for an alternative that will provide a point estimate of the ACE, it is also important to have a method that trades-oﬀinformation for fewer assumptions. WPP provides a framework to express such assumptions. The brute-force search used in the implementation of Rule 1 can be substituted by other combinatorial search procedures and dimensionality reduction methods. Entner et al. (2013) provide alternatives by borrowing ideas from the PC algorithm, for instance. Package Causal FX implements the idea discussed brieﬂy in Section 9.4, where for each witness candidate W we pre-select a small set of candidates from W\{W} and perform a bruteforce search for admissible sets within this candidate set only. Pre-selection in Causal FX 1.0 is done by ﬁrst sorting all Z W\{W} according to the empirical mutual information

strongly associated with W, measured by the absolute value of the corresponding copula correlation matrix entry. 27. https://cran.r-project.org/web/packages/Causal FX/index.html 28. https://github.com/rbas2015/Causal FX

Causal Inference through a Witness Protection Program

(age) education (age) feducation | hcollege (age) feducation | heducation (age) feducation | meducation

(age) hcollege | meducation (age) heducation | meducation

(fincome) tax (hage) college (hage) education

(hage) hcollege (hcollege) age | city | meducation | tax (hcollege) city | hage | meducation | tax (heducation) age | city | meducation | tax (heducation) city | hage | meducation | tax

(hhours) (hwage) hhours | hours | tax

(oldkids) age (oldkids) college (oldkids) education

(oldkids) fincome

(oldkids) hage (oldkids) unemp (tax) college | fincome | hcollege | heducation | hours

(tax) college | fincome | hcollege | hours | hwage (tax) college | fincome | heducation | hours | hwage

(tax) education | fincome | hours | hwage

(tax) education | hcollege | hours (unemp) fincome | hhours | meducation | oldkids | tax (unemp) hcollege | hhours | meducation | oldkids | tax (unemp) hcollege | hhours | meducation | tax | youngkids

(unemp) hhours | meducation | oldkids | tax | youngkids

(youngkids) city | fincome | hwage

(youngkids) college | hours

0.0 0.1 0.2 0.3 0.4

(Witness) Admissible Set

(age) education (age) feducation | hcollege (age) feducation | heducation (age) feducation | meducation

(age) hcollege | meducation (age) heducation | meducation

(fincome) tax (hage) college (hage) education

(hage) hcollege (hcollege) age | city | feducation | tax (hcollege) age | city | meducation | tax (hcollege) city | feducation | hage | tax (hcollege) city | feducation | oldkids | tax

(hcollege) city | hage | meducation | tax (heducation) age | city | meducation | tax (heducation) city | hage | meducation | tax

(hhours) (hwage) education

(oldkids) age (oldkids) college (oldkids) education

(oldkids) fincome

(oldkids) hage (oldkids) heducation

(oldkids) unemp (unemp) fincome | heducation | hhours | meducation | oldkids

(unemp) fincome | heducation | hhours | meducation | tax (unemp) fincome | heducation | hhours | meducation | youngkids

(unemp) fincome | heducation | hhours | oldkids | tax (unemp) fincome | heducation | hhours | oldkids | youngkids

(unemp) fincome | hhours | meducation | oldkids | tax (unemp) hcollege | hhours | meducation | oldkids | tax (unemp) hcollege | hhours | meducation | tax | youngkids

(unemp) heducation | hhours | meducation | oldkids | tax (unemp) heducation | hhours | meducation | oldkids | youngkids

(unemp) heducation | hhours | meducation | tax | youngkids

(unemp) heducation | hhours | oldkids | tax | youngkids (unemp) hhours | meducation | oldkids | tax | youngkids

(youngkids) city | fincome | hwage

0.2 0.3 0.4

(Witness) Admissible Set

Figure 8: The diagrams depict some ACE intervals obtained for the linear model of the impact of work experience up to 1975 of a married woman into her salary in 1976. On the y-axis, we show the witness in brackets, followed by all variables in the admissible set; the x-axis shows the point estimates of the interval for the ACE using ϵc = ϵwx = ϵwy = ϵxy = 0.1. Black dots are the corresponding point estimates of the ACE using the back-door method. All variable names are explained in the documentation of package AER (Kleiber and Zeileis, 2008). In (a), we allow all other recorded variables into the covariate set W from which witnesses and admissible sets are generated. In (b), we remove HOURS from the pool of possible covariates.

of Z and W given X and then picking the top K candidates, in descending value of mutual information (the heuristic being that we should look ﬁrst at paths W X Z that are strong ). K is chosen such that enumerating 2K candidate admissible sets is possible within the available computer resources. Although this restricted search procedure might miss some admissible sets, it has the advantage of avoiding sensitivity to propagation of statistical mistakes that creates diﬃculties for the PC algorithm and similar methods.

We emphasize that the credible intervals obtained by the procedure are conditioned on the search results, discarding uncertainty coming from the choices of witnesses, admissible sets and relaxation parameters. Ideally, uncertainty concerning the outcome of the Rule 1 search should also be taken into account. An approach analogous to (Friedman and Koller, 2003) is necessary, which we leave as future work.

Silva and Evans

As further future work, we will look at a generalization of the procedure beyond relaxations of chain structures W X Y . Much of the machinery here developed, including Entner et al. s Rules, can be adapted to the case where causal ordering is unknown: starting from the algorithm of Mani et al. (2006) to search for Y-structures, it is possible to generalize Rule 1 to setups where we have an outcome variable Y that needs to be controlled, but where there is no covariate X known not to be a cause of other covariates. Mooij and Cremers (2015) investigate the robustness of the faithfulness condition in this setup. Finally, the techniques used to derive the symbolic bounds in Section 5 may prove useful in a more general context, and complement other methods to ﬁnd subsets of useful constraints such as the graphical approach of Evans (2012).

Acknowledgments

We thank Mc Donald, Hiu and Tierney for their ﬂu vaccine data and the anonymous reviewers for their many suggestions to improve our paper. Much of this work was done while RS was hosted by the Department of Statistics at the University of Oxford. Parts of this work were previously published in the 2014 Neural Information Processing Systems Conference (Silva and Evans, 2014). Section 6, 7 and 8 are completely new, and all remaining sections have new material added, including the appendix.

Appendix A. Proofs

In this Appendix, we prove the results mentioned in the main text.

A.1 Basic Results

We divide the proofs in four main sections. The ﬁrst section provides the basic methods, including how classical results in instrumental variable bounding can be rederived. The second and third sections are proofs for the most complex types of bounds. Finally, the fourth section covers the linear continuous case.

Proof of Proposition 1 In the standard IV case, simple analytical bounds are known for P(Y = y | do(X = x)) (Balke and Pearl, 1997; Dawid, 2003):

1 ζ00.0 1 ζ00.1 ζ01.0 + ζ10.0 + ζ10.1 + ζ11.1 ζ10.0 + ζ11.0 + ζ01.1 + ζ10.1

ζ10.1 ζ10.0 ζ10.0 + ζ11.0 ζ00.1 ζ11.1 ζ00.0 ζ11.0 + ζ10.1 + ζ11.1

1 ζ01.1 1 ζ01.0 ζ10.0 + ζ11.0 + ζ00.1 + ζ11.1 ζ00.0 + ζ11.0 + ζ10.1 + ζ11.1

ζ11.1 ζ11.0 ζ01.0 ζ10.0 + ζ10.1 + ζ11.1 ζ10.0 + ζ11.0 ζ01.1 ζ10.1

Causal Inference through a Witness Protection Program

where ηx P(Y = 1 | do(X = x)) and ζyx.w P(Y = y, X = x | W = w). Deﬁne also αx P(Y = 1 | X = x) and βw P(X = 1 | W = w) so that

ζyx.w = αI(y=1) x (1 αx)I(y=0)βI(x=1) w (1 βw)I(x=0), (35)

where I( ) is the indicator function returning 1 or 0 depending on whether its argument is true or false, respectively. Assume for now that β1 β0, that is, P(X = 1 | W = 1) P(X = 1 |W = 0). We will ﬁrst show that 1 ζ00.0 min{1 ζ00.1, ζ01.0 +ζ10.0 +ζ10.1 +ζ11.1, ζ10.0 +ζ11.0 +ζ01.1 +ζ10.1}. That 1 ζ00.0 1 ζ00.1 follows directly from the relationship (35) and the assumptions W Y |X and β1 β0: (1 ζ00.0) (1 ζ00.1) = (1 α0)(1 β0) + (1 α0)(1 β1) = (1 α0)(β0 β1) 0. Now consider (1 ζ00.0) (ζ01.0 + ζ10.0 + ζ10.1 + ζ11.1). This is equal to

= (1 (1 α0)(1 β0)) ((1 α1)β0 + α0(1 β0) + α0(1 β1) + α1β1) = (β0 + α0(1 β0)) (β0 α1β0 + α0(1 β0) + α0 α0β1 + α1β1) = α1(β0 β1) α0(1 β1) 0

Analogously, we can show that 1 ζ00.0 ζ10.0 + ζ11.0 ζ01.1 ζ10.1. Tedious but analogous manipulations lead to the overall conclusion

1 ζ00.0 = min

1 ζ00.0 1 ζ00.1 ζ01.0 + ζ10.0 + ζ10.1 + ζ11.1 ζ10.0 + ζ11.0 + ζ01.1 + ζ10.1

ζ10.0 = max

ζ10.1 ζ10.0 ζ10.0 + ζ11.0 ζ00.1 ζ11.1 ζ00.0 ζ11.0 + ζ10.1 + ζ11.1

1 ζ01.1 = min

1 ζ01.1 1 ζ01.0 ζ10.0 + ζ11.0 + ζ00.1 + ζ11.1 ζ00.0 + ζ11.0 + ζ10.1 + ζ11.1

ζ11.1 = max

ζ11.1 ζ11.0 ζ01.0 ζ10.0 + ζ10.1 + ζ11.1 ζ10.0 + ζ11.0 ζ01.1 ζ10.1

The upper bound on the ACE η1 η0 is obtained by subtracting the lower bound on η0 from the upper bound on η1. That is, η1 η0 (1 ζ01.1) ζ10.0 = USIV . Similarly, η1 η0 ζ11.1 (1 ζ00.0) = LSIV . It follows that USIV LSIV = 1 (P(X = 1 | W = 1) P(X = 1 | W = 0)). Finally, assuming β1 β0 gives by symmetry the interval width 1 (P(X = 1 | W = 0) P(X = 1 | W = 1)), implying the width in the general case is given by 1 |P(X = 1 | W = 1) P(X = 1 | W = 0)|.

Now we will prove the main theorems stated in Section 5. To facilitate reading, we repeat here the notation used in the description of the constraints with a few additions, as well as the identities mapping diﬀerent parameter spaces and the corresponding assumptions exploited in the derivation. We start with the basic notation,

Silva and Evans

ζ yx.w P(Y = y, X = x | W = w, U) ζyx.w P U P(Y = y, X = x | W = w, U)P(U | W = w) = P(Y = y, X = x | W = w) κyx.w P U P(Y = y, X = x | W = w, U)P(U)

η xw P(Y = 1 | X = x, W = w, U) ηxw P U P(Y = 1 | X = x, W = w, U)P(U | W = w) = P(Y = 1 | do(X = x), W = w) ωxw P U P(Y = 1 | X = x, W = w, U)P(U)

δ w P(X = 1 | W = w, U) δw P U P(X = 1 | W = w, U)P(U | W) = P(X = 1 | W = w) = ζ11.w + ζ01.w χx.w P U P(X = x | W = w, U)P(U) = κ1x.w + κ0x.w

The explicit relationship between parameters describing the latent variable model is:

ζ 00.0 = (1 η 00)(1 δ 0) ζ 01.0 = (1 η 10)δ 0 ζ 10.0 = η 00(1 δ 0) ζ 11.0 = η 10δ 0 ζ 00.1 = (1 η 01)(1 δ 1) ζ 01.1 = (1 η 11)δ 1 ζ 10.1 = η 01(1 δ 1) ζ 11.1 = η 11δ 1

All upper bound constants U U are assumed to be positive. For L U = 0, c 0, all ratios c/L U are deﬁned to be positive inﬁnite. In what follows, we deﬁne the standard IV model as the one which obeys exogeneity of W and exclusion restriction that is, the model following the directed acyclic graph {W X Y, X U Y }. All variables are binary, and the goal is to bound the average causal eﬀect (ACE) of X on Y given a non-descendant W and a possible (set of) confounder(s) U of X and Y .

Proof of Theorem 2 Start with the relationship between ηxw and its upper bound:

η xw UY U xw (Multiply both sides by δ x .w) η xw(1 (1 δ x .w)) UY U xw δ x .w (Marginalize over P(U)) ωxw κ1x.w UY U xw χx .w ωxw κ1x.w + UY U xw (κ0x .w + κ1x .w)

and an analogous series of steps gives ωxw κ1x.w + LY U xw (κ0x .w + κ1x .w). Notice such bounds above will depend on how tight ϵy is. As an illustration of its implications, consider

Causal Inference through a Witness Protection Program

the derived identity ζ 0x.w = (1 η xw)δ x.w 1 η xw = ζ 0x.w/δ x.w 1 η xw ζ 0x.w η xw 1 ζ 0x.w = ζ 0x.w + ζ 0x .w + ζ 1x .w ωxw κ0x.w + κ0x .w + κ1x .w. It follows from UY U xw 1 that that the derived bound ωxw κ1x.w +UY U xw (κ0x .w +κ1x .w) is at least as tight as the one obtained via η xw 1 ζ 0x.w. Notice also that the standard IV bound ηxw 1 ζ0x.w (Balke and Pearl, 1997; Dawid, 2003) is a special case for ϵy = 0, β = β = 1. For the next bounds, consider

δ x.w UXU xw η xwδ x.w UXU xw η xw (Marginalize over P(U)) κ1x.w UXU xw ωxw ωxw κ1x.w/UXU xw

where the bound ωxw κ1x.w/LXU xw can be obtained analogously. The corresponding bound for the standard IV model (with possible direct eﬀect W Y ) is ηxw ζ1x.w, obtained again by choosing ϵx = 1, β = β = 1. The corresponding bound ωxw κ1x.w is a looser bound for UXU xw < 1. Notice that if LXU xw = 0, the upper bound is deﬁned as inﬁnite. Finally, the last bounds are similar to the initial ones, but as a function of ϵx instead of ϵy: δ x.w UXU xw (1 η xw)δ x.w UXU xw (1 η xw) (Marginalize over P(U)) κ0x.w UXU xw (1 ωxw) ωxw 1 κ0x.w/UXU xw

The lower bound ωxw 1 κ0x.w/LXU xw is obtained analogously, and implied to be minus inﬁnite if LXU xw = 0.

Proof of Theorem 3 We start with the following derivation,

η xw η xw ϵw η xw δ x.w η xwδ x.w ϵwδ x.w (Use UXU xw δ x.w ) η xw δ x.w η xw UXU xw ϵwδ x.w (Marginalize over P(U)) κ1x.w ωxw UXI xw ϵwχx.w ωxw (κ1x.w ϵwχx.w )/UXU xw ωxw (κ1x.w ϵw(κ0x.w + κ1x.w ))/UXU xw

Analogously, starting from η xw η xw ϵw, we obtain ωxw (κ1x.w +ϵw(κ0x.w +κ1x.w ))/LXU xw . Notice that for the special case ϵw and UXU xw = 1, we obtain the corresponding lower bound ωxw κ1x.w that relates ω and κ across diﬀerent values of W. The result corresponding to the upper bound ηxw 1 ζ0x.w can be obtained as follows:

η xw η xw ϵw 1 + η xw 1 η xw ϵw (1 η xw) (1 η xw ) ϵw (1 η xw)δ x.w (1 η xw )δ x.w ϵwδ x.w (1 η xw)UXU xw (1 η xw )δ x.w ϵwδ x.w (Marginalize over P(U)) (1 ωxw)UXU xw κ0x.w ϵwχx.w ωxw 1 (κ0x.w ϵw(κ0x.w + κ1x.w ))/UXU xw

Silva and Evans

with the corresponding lower bound (non-trivial for LXU xw > 0) given by ω xw 1 (κ0x.w + ϵw(κ0x.w + κ1x.w ))/LXU xw . The ﬁnal block of relationships can be derived as follows:

η xw η xw ϵw η xwδ x .w η xw δ x .w ϵwδ x .w η xw(1 (1 δ x .w)) η xw δ x .w ϵwδ x .w (Use UXU x w δ x .w) η xw η xw(1 δ x .w) η xw UXU x .w ϵwδ x .w (Marginalize over P(U)) ωxw κ1x.w ωxw UXU x w ϵwχx .w ωxw ωxw UXU x w κ1x.w + ϵw(κ0x .w + κ1x .w)

with the lower bound ωxw ωxw LXU x w κ1x.w ϵw(κ0x .w + κ1x .w) derived analogously. Moreover,

η xw η xw ϵw (1 η xw)δ x .w (1 η xw )δ x .w ϵwδ x .w (1 η xw)(1 (1 δ x .w)) (1 η xw )UXU x w ϵwδ x .w 1 ωxw κ0x.w (1 ωxw )UXU x w ϵwχx .w ωxw ωxw UXU x w 1 κ0x.w UXU x w ϵw(κ0x .w + κ1x .w)

and the corresponding ωxw ωxw LXU x w 1 κ0x.w LXU x w + ϵw(κ0x .w + κ1x .w). The last two relationships follow immediately from the deﬁnition of ϵw.

Our constraints found so far collapse to some of the constraints found in the standard IV models (Balke and Pearl, 1997; Dawid, 2003) given ϵw = 0, β = β = 1. Namely,

ηxw 1 ζ0x.w ηxw 1 ζ0x.w ηxw ζ1x.w ηxw ζ1x.w

However, none of the constraints so far found counterparts in the following:

ηxw ζ0x.w + ζ1x.w + ζ1x.w + ζ1x .w ηxw ζ0x.w + ζ1x.w + ζ1x.w + ζ1x .w ηxw ζ1x.w + ζ1x .w ζ0x.w ζ1x .w ηxw ζ1x.w + ζ1x .w ζ0x.w ζ1x .w

These constraints have the distinctive property of being functions of both P(Y = x, X = x | W = w) and P(Y = x, X = x | W = w ), simultaneously. So far, we have only used the basic identities and constraints, without attempting at deriving constraints that are not a direct application of such identities. In the framework of (Dawid, 2003; Ramsahai, 2012), it is clear that general linear combinations of functions of {δ x.wη 1x.w, δ x.w, η 1x.w} can generate constraints on observable quantities ζyx.w and causal quantities of interest, ηxw. We need to encompass these possibilities in such a way that we get a framework for generating symbolic constraints as a function of {ϵw, ϵy, ϵx, β, β}.

Causal Inference through a Witness Protection Program

One of the diﬃculties on exploiting a black-box polytope package for that is due to the structure of the process, which exploits the constraints in Section 4 by ﬁrst ﬁnding the extreme points of the feasible region of {δ w}, {η xw}. If we use the constraints

|η x1 η x0 | ϵw 0 η xw 1

then assuming 0 < ϵw < 1, we always obtain the following six extreme points,

(0, 0) (0, ϵw) (ϵw, 0) (1 ϵw, 1) (1, 1 ϵw) (1, 1)

In general, however, once we introduce constraints LY U xw η xw UXU xw , the number of extreme points will vary. Moreover, when multiplied with the extreme points of the space δ 1 δ 0, the resulting extreme points of ζ yx.w might be included or excluded of the polytope depending on the relationship among {ϵw, ϵx, ϵy} and the observable P(Y, X | W). Numerically, this is not a problem (barring numerical instabilities, which do occur with a nontrivial frequency). Algebraically, this makes the problem considerably complicated29. Instead, in what follows we will deﬁne a simpler framework that will not give tight constraints, but will shed light on the relationship between constraints, observable probabilities and the ϵ parameters. This will also be useful to scale up the full Witness Protection Program, as discussed in the main paper.

A.2 Methodology for Cross-W Constraints

Consider the standard IV model again, i.e., where W is exogenous with no direct eﬀect on Y . So far, we have not replicated anything such as e.g. η1 ζ00.0 + ζ11.0 + ζ10.1 + ζ11.1. We call this a cross-W constraint, as it relates observables under diﬀerent values of W {0, 1}. These are important when considering weakening the eﬀect W Y . The recipe for deriving them will be as follows. Consider the template

δ 0f1(η 0, η 1) + δ 1f2(η 0, η 1) + f3(η 0, η 1) 0 (36)

such that fi( , ) are linear. Linearity is imposed so that this function will correspond to a linear function of {ζ , η , δ }, of which expectations will give observed probabilities or interventional probabilities. We will require that evaluating this expression at each of the four extreme points of the joint space (δ 0, δ 1) {0, 1}2 will translate into one of the basic constraints 1 η i 0 or η i 0, i {0, 1}. This implies any combination of {δ 0, δ 1, η 0, η 1} will satisfy (36) (more on that later).

29. As a counterpart, imagine we deﬁned a polytope through the matrix inequality Ax b. If we want to obtain its extreme point representation as an algebraic function of the entries of matrix A and vector b, this will be a complicated problem since we cannot assume we know the magnitudes and signs of the entries.

Silva and Evans

Given a choice of basic constraint (say, η 1 0), and setting δ 0 = δ 1 = 0, this immediately identiﬁes f3( , ). We assign the constraint corresponding to δ 0 = δ 1 = 1 with the complementary constraint for η1 (in this case, η 1 1). This leaves two choices for assigning the remaining constraints. Why do we associate the δ 0 = δ 1 = 1 case with the complementary constraint? Let us parameterize each function as fi(η 0, η 1) aiη 0 + biη 1 + ci. Let a3 = q, where either q = 1 (case η 0 0) or q = 1 (case 1 η 0 0). Without loss of generality, assume case (δ 0 = 1, δ 1 = 0) is associated with the complementary constraint where the coeﬃcient of η 0 should be q. For the other two cases, the coeﬃcient of η 0 should be 0 by construction. We get the system a3 = q a1 + a3 = q a2 + a3 = 0 a1 + a2 + a3 = 0

This system has no solution. Assume instead δ 0 = δ 1 = 1 is associated with the complementary constraint where the coeﬃcient of η 0 should be q. The system now is:

a3 = q a1 + a3 = 0 a2 + a3 = 0 a1 + a2 + a3 = q

This system always have the solution a1 = a2 = q. We do have freedom with b1, b2, b3, which means we can choose to allocate the remaining two cases in two diﬀerent ways.

Lemma 7 Consider the constraints derived by the above procedure. Then any choice of (δ 0, δ 1, η 0, η 1) [0, 1]4 will satisfy these constraints.

Proof Without loss of generality, let f3(η 0, η 1) = qη 0 + (1 q)/2, q { 1, 1}. That is, a3 = q, b3 = 0, c3 = (1 q)/2. This implies a1 = a2 = q (as above). Associating (δ 0 = 1, δ 1 = 0) with η 1 0 gives {b1 = 1, c1 = (q 1)/2} and consequently associating (δ 0 = 0, δ 0 = 1) with 1 η 1 0 implies {b2 = 1, c2 = (1 + q)/2}. Plugging this into the expression δ 0f1(η 0, η 1) + δ 1f2(η 0, η 1) + f3(η 0, η 1) we get

= δ 0( qη 0 + η 1 + (q 1)/2) + δ 1( qη 0 η 1 + (1 + q)/2) + qη 0 + (1 q)/2 = η 0(q (δ 0 + δ 1)q) + η 1(δ 0 δ 1) + δ 0(q 1)/2 + δ 1(1 + q)/2 + (1 q)/2 = η 0(q (δ 0 + δ 1)q) + η 1(δ 0 δ 1) + ( q + (δ 0 + δ 1)q)/2 + (δ 1 δ 0 + 1)/2

= q((δ 1 + δ 0) 1)(1 2η 0)/2 + ((δ 1 δ 0)(1 2η 1) + 1)/2 = (δ 1 + δ 0 1)s/2 + (δ 1 δ 0)t/2 + 1/2

where s = q(1 2η 0) [ 1, 1] and t = (1 2η 1) [ 1, 1]. Then evaluating at the four extreme points s, t { 1, +1} we get δ0, δ1, 1 δ0, 1 δ1, all of which are non-negative.

The procedure derives 8 bounds (4 cases that we get by associating f3 with either ηx 0 or 1 ηx 0. For each of these cases, 2 subcases what we get by assigning (δ 0 = 1, δ 1 = 0)

Causal Inference through a Witness Protection Program

with either ηx 0 or 1 ηx 0). Now, for an illustration of one case:

Deriving a constraint for the standard IV model, example: f3(η 0, η 1) η 0 0

Associate η 1 0 with assignment (δ 0 = 1, δ 1 = 0) (implying we associate η 1 1 with assignment (δ 0 = 0, δ 1 = 1) and η 0 1 with (δ 0 = 1, δ 1 = 1)). This uniquely gives f1(η 0, η 1) = η 1 η 0, f2(η 0, η 1) = η 1 η 0 + 1. The resulting expression is

δ 0(η 1 η 0) + δ 1( η 1 η 0 + 1) + η 0 0

from which we can verify that the assignment (δ 0 = 1, δ 1 = 1) gives η 0 1. Now, we need to take the expectation of the above with respect to U to obtain observables ζ and causal distributions η. However, ﬁrst we need some rearrangement so that we match η 0 with corresponding (1 δ w) and so on.

η 1(δ 0 δ 1) + η 0(1 δ 0 δ 1) + δ 1 0 η 1(δ 0 δ 1) + η 0((1 δ 0) + (1 δ 1) 1) + δ 1 0 ζ 11.0 ζ 11.1 + ζ 10.0 + ζ 10.1 η 0 + ζ 01.1 + ζ 11.1 0

Taking expectations and rearranging it, we have

η0 ζ11.0 + ζ10.0 + ζ10.1 + ζ01.1

rediscovering one of the IV bounds for η0. Choosing to associate η 1 0 with assignment (δ 0 = 0, δ 1 = 1) will give instead

η0 ζ11.1 + ζ10.1 + ζ10.0 + ζ01.0

Basically the eﬀect of one of the two choices within any case is to switch ζyx.w with ζyx.w .

A.3 Deriving Cross-W Constraints

What is left is a generalization of that under the condition |ηxw ηxw | ϵw, w = w , instead of ηxw = ηxw . In this situation, we exploit the constraint L η xw U instead of 0 η xw 1 or LY U xw η xw UY U xw , where L min{LY U xw }, U max{UY U xw }. Using LY U xw η xw UY U xw complicates things considerably. Also, we will not derive here the analogue proof of Lemma 1 for the case where (η 0, η 1) [L, U]2, as it is analogous but with a more complicated notation.

Proof of Theorem 4 We demonstrate this through two special cases. General Model, Special Case 1: f3(η 0w, η 1w) η xw L 0

There are two modiﬁcations. First, we perform the same associations as before, but with respect to L η xw U instead of 0 η x 1. Second, before we take expectations, we swap some of the η xw with η xw up to some error ϵw. Following the same sequence as in the example for the IV model, we get the resulting expression (where x {0, 1}\x):

δ w(η x w η xw) + δ w ( η x w η xw + U + L) + η xw L 0

Silva and Evans

from which we can verify that the assignment (δ w = 1, δ w = 1) gives U η xw 0. Now, we need to take the expectation of the above with respect to U to obtain observables κ and causal eﬀects ω. However, the diﬃculty now is that terms η xwδ w and η xw δ w have no observable counterpart under expectation. We get around this transforming η xw δ w into η xwδ w (and η xwδ w into η xw δ w ) by adding the corresponding correction η xw η xw +ϵw:

δ w(η x w η xw) + δ w ( η x w η xw + U + L) + η xw L 0 δ w(η x w η xw) + δ w ( η x w + ϵw η xw + ϵw + U + L) + η xw L 0 η x wδ w + η xw(1 δ w) ηx w δ w ηxw δ w + δ w (U + L + 2ϵw) L 0

Now, the case for x = 1 gives

η 0wδ w + η 1w(1 δ w) η0w δ w η1w δ w + . . . 0 η 0w(1 (1 δ w)) + η 1w(1 δ w) η 0w (1 (1 δ w )) η 1w δ w + . . . 0

Taking the expectations:

ω0w κ10.w + ω1w κ11.w ω0w + κ10.w κ11.w + χw (U + L + 2ϵw) L 0 (37)

Notice that for β = β = 1, L = 0, U = 1, ϵw = 0, this implies ηxw = ηxw and this collapses to η0w ζ10.w + η1w ζ11.w η0w + ζ10.w ζ11.w + δw 0

η1w ζ10.w + ζ11.w ζ10.w ζ01.w

which is one of the lower bounds one obtains under the standard IV model. The case for x = 0 is analogous and gives

ω0w κ11.w + κ10.w + κ10.w κ11.w + χw (U + L + 2ϵw) L (38)

The next subcase is when we exchange the assignment of (δ w, δ w ) to other constraints. We obtain the following inequality:

δ w (η x w η xw) + δ w( η x w η xw + U + L) + η xw L 0

which from an analogous sequence of steps leads to

δ w (η x w η xw) + δ w( η x w η xw + U + L) + η xw L 0 δ w (η x w + ϵw η xw + ϵw) + δ w( η x w η xw + U + L) + η xw L 0 η x w δ w η xw δ w + 2δ w ϵw η x wδ w + η xw(1 δ w) + δ w(U + L) L 0

η 0w δ w η 1w δ w + η 0wδ w + η 1w(1 δ w) + . . . 0 η 0w (1 (1 δ w )) η 1w δ w η 0w(1 (1 δ w)) + η 1w(1 δ w) + . . . 0

Taking expectations,

ω0w κ10.w κ11.w ω0w + κ10.w + ω1w κ11.w + 2χw ϵw + χw(U + L) L 0 (39)

Causal Inference through a Witness Protection Program

η 1w δ w η 0w δ w + η 1wδ w + η 0w(1 δ w) + . . . 0 η 1w δ w η 0w (1 (1 δ w )) η 1wδ w + η 0w(1 δ w) + . . . 0 κ11.w ω0w + κ10.w κ11.w + κ10.w + 2χw ϵw + χw(U + L) L 0

ω0w κ11.w + κ10.w κ11.w + κ10.w + 2χw ϵw + χw(U + L) L (40)

General Model, Special Case 2: f3(η 0w, η 1w) U η xw 0

Associate η x w L with assignment (δ w = 1, δ w = 0) (implying we associate η x w U with assignment (δ w = 0, δ w = 1) and η xw L with (δ w = 1, δ w = 1)). The resulting expression is δ w(η x w + η xw U L) + δ w ( η x w + η xw) + U η xw 0

Following the same line of reasoning as before, we get this for x = 1:

ω0w ω0w ω1w κ10.w + κ11.w + κ10.w + κ11.w χw(U + L) + 2ϵwχw + U 0 (41)

We get this for x = 0:

ω0w κ11.w + κ10.w + κ11.w + κ10.w + χw(U + L) 2ϵwχw U (42)

With the complementary assignment, we start with the relationship

δ w (η x w + η xw U L) + δ w( η x w + η xw) + U η xw 0

ω0w ω0w ω1w κ10.w + κ11.w + κ10.w + κ11.w + χw (2ϵw U L) + U 0 (43)

ω0w κ11.w + κ10.w + κ11.w + κ10.w χw (2ϵw U L) U (44)

Notice that the bounds obtained are asymmetric in x, i.e., we derive diﬀerent bounds for ω0w and ω1w. Symmetry is readily obtained by the same derivation where δ w is interpreted as P(X = 0 | W = w, U) and x is swapped with x .

A.4 Linear Case

Our ﬁnal proof refers to results introduced in Section 6.

Proof of Theorem 5 Variable sxx appears only in Equation (24) and the inequalities 0 sxx 1. The intersection of these relationships is satisﬁable if and only if 0 a2+2aswx 1 is satisﬁable. Moreover, swx appears only in Equation (22). Solving this equation for swx and plugging it in 0 a2 + 2aswx 1, we obtain 0 a2 + 2aρwx 1. The quadratic

Silva and Evans

expression for a achieves a unique maximum at a = ρwx, implying a 2+2a ρwx = ρ2 wx 1. We can then drop the inequality a2 + 2aρwx 1 as this is always satisﬁed. The resulting interval is a2 2aρwx 0, and the set of values of a satisfying it is either the interval [2ρwx, 0] or [0, 2ρwx] depending on the sign of ρwx. This can be written as

min(0, 2ρwx) a max(0, 2ρwx) (45)

The intersection of Equation (22) and ϵwx swx ϵwx is satisﬁable only if ρwx ϵwx a ρwx + ϵwx. Combining this interval with (45), we obtain the inequality

max(min(0, 2ρwx), ρwx ϵwx) a min(max(0, 2ρwx), ρwx + ϵwx) (46)

which La and Ua being deﬁned as the lower and upper bounds, respectively, in the interval above. Since a now only appears in (46) and Equation (25), and assuming swy 0, the intersection of the two equations is satisﬁable if and only if

ρxy Uaswy b + cρwx + sxy ρxy Laswy (47)

Equation (31) follows from sxy not appearing anywhere else but in the relationship ϵxy sxy ϵxy, and also considering the case swy < 0. Equation (26) and the relationship 0 syy 1 is satisﬁable if and only if

0 b2 + 2bcρwx + c2 + 2[b(aswy + sxy) + cswy] 1 (48)

is satisﬁable. From Equation (25) we have aswy + sxy = ρxy b cρwx and from Equation (23) we have swy = ρwy bρwx c. Making these substitutions into (48), we get

0 b2 2bcρwx c2 + 2(bρxy + cρwy) 1.

The quadratic function in (b, c) has a unique maximum. Assuming for now c is unconstrained, taking the derivatives of b2 2bcρwx c2 + 2(bρxy + cρwy) with respect to b and c, and setting them to zero, we obtain the stationary point

b = ρxy ρwxρwy

1 ρ2wx , c = ρwy ρwxρxy

1 ρ2wx , (49)

and by assumption it follows that c = 0. Plugging c in b2 2bcρwx c2 +2(bρxy +cρwy), we get b(2ρxy b) ρ2 xy 1. So, it is suﬃcient to satisfy

b2 + 2bcρwx + c2 2(bρxy + cρwy) 0. (50)

Causal Inference through a Witness Protection Program

J. Altonji, T. Elder, and C. Taber. Selection on observed and unobserved variables: Assessing the eﬀectiveness of Catholic schools. Journal of Political Economy, 113:151 184, 2005.

A. Balke and J. Pearl. Bounds on treatment eﬀects from studies with imperfect compliance. Journal of the American Statistical Association, 92:1171 1176, 1997.

W. Buntine. Theory reﬁnement on Bayesian networks. Proceedings of the 7th Conference on Uncertainty in Artiﬁcial Intelligence (UAI 1991), pages 52 60, 1991.

W. Buntine and A. Jakulin. Applying discrete PCA in data analysis. Proceedings of 20th Conference on Uncertainty in Artiﬁcial Intelligence (UAI 2004), pages 59 66, 2004.

Z. Cai, M. Kuroki, J. Pearl, and J. Tian. Bounds on direct eﬀects in the presence of confounded intermediate variables. Biometrics, 64:695 701, 2008.

L. Chen, F. Emmert-Streib, and J. D. Storey. Harnessing naturally randomized transcription to infer regulatory relationships among genes. Genome Biology, 8:R219, 2007.

G. Cooper. A simple constraint-based algorithm for eﬃciently mining observational databases for causal relationships. Data Mining and Knowledge Discovery, 2, 1997.

J. Cornﬁeld, W. Haenszel, E. Hammond, A. Lilienfeld, M. Shimkin, and E. Wynder. Smoking and lung cancer: recent evidence and a discussion of some questions. Journal of the National Cancer Institute, 22:173 203, 1959.

A.P. Dawid. Causal inference using inﬂuence diagrams: the problem of partial compliance. In P.J. Green, N.L. Hjort, and S. Richardson, editors, Highly Structured Stochastic Systems, pages 45 65. Oxford University Press, 2003.

G. Elidan. Copulas in Machine Learning. Copulae in Mathematical and Quantitative Finance, Lecture Notes in Statistics, pages 39 60, 2013.

D. Entner, P.O. Hoyer, and P. Spirtes. Statistical test for consistent estimation of causal eﬀects in linear non-Gaussian models. Proceedings of the 15th International Conference on Artiﬁcial Intelligence and Statistics (AISTATS 2012), pages 364 372, 2012.

D. Entner, P. Hoyer, and P. Spirtes. Data-driven covariate selection for nonparametric estimation of causal eﬀects. Proceedings of the 16th International Conference on Artiﬁcial Intelligence and Statistics (AISTATS 2013), pages 256 264, 2013.

R. Evans. Graphical methods for inequality constraints in marginalized DAGs. Proceedings of the 22nd Workshop on Machine Learning and Signal Processing, 2012.

N. Friedman and D. Koller. Being Bayesian about network structure: a Bayesian approach to structure discovery in Bayesian networks. Machine Learning Journal, 50:95 126, 2003.

T. Haavelmo. The statistical implications of a system of simultaneous equations. Econometrica, 11:1 12, 1943.

Silva and Evans

N. Harris and M. Drton. PC algorithm for nonparanormal graphical models. Journal of Machine Learning Research, 14:3365 3383, 2013.

K. Hirano, G. Imbens, D. Rubin, and X.-H. Zhou. Assessing the eﬀect of an inﬂuenza vaccine in an encouragement design. Biometrics, 1:69 88, 2000.

P. Hoﬀ. Extending the rank likelihood for semiparametric copula estimation. Annals of Applied Statistics, 1:265 283, 2007.

C. Kleiber and A. Zeileis. Applied Econometrics with R. Springer-Verlag, 2008.

J. Kuipers, G. Moﬀa, and D. Heckerman. Addendum on the scoring of Gaussian directed acyclic graphical models. Annals of Statistics, 42:1689 1691, 2014.

S. Mani, G. Cooper, and P. Spirtes. A theoretical study of Y structures for causal discovery. Proceedings of the 22nd Conference on Uncertainty in Artiﬁcial Intelligence (UAI2006), pages 314 323, 2006.

C. Manski. Identiﬁcation for Prediction and Decision. Harvard University Press, 2007.

C. Mc Donald, S. Hiu, and W. Tierney. Eﬀects of computer reminders for inﬂuenza vaccination on morbidity during inﬂuenza epidemics. MD Computing, 9:304 312, 1992.

C. Meek. Strong completeness and faithfulness in Bayesian networks. Proceedings of the Eleventh Conference on Uncertainty in Artiﬁcial Intelligence (UAI 1995), pages 411 418, 1995.

J. Mooij and J. Cremers. An empirical study of one of the simplest causal prediction algorithms. Proceedings of the Advances in Causal Inference Workshop, UAI 2015, 2015.

T. Mroz. The sensitivity of an empirical model of married women s hours of work to economic and statistical assumptions. Econometrica, 55:765 799, 1987.

R. Nelsen. An Introduction to Copulas. Springer-Verlag, 2007.

J. Pearl. Causality: Models, Reasoning and Inference. Cambridge University Press, 2000.

J. Pearl. Myth, confusion, and science in causal analysis. UCLA Cognitive Systems Laboratory, Technical Report (R-348), 2009.

R. Ramsahai. Causal bounds and observable constraints for non-deterministic models. Journal of Machine Learning Research, 13:829 848, 2012.

R. Ramsahai and S. Lauritzen. Likelihood analysis of the binary instrumental variable model. Biometrika, 98:987 994, 2011.

J. Ramsey, P. Spirtes, and J. Zhang. Adjacency-faithfulness and conservative causal inference. Proceedings of the 22nd Conference on Uncertainty in Artiﬁcial Intelligence (UAI 2006), pages 401 408, 2006.

Causal Inference through a Witness Protection Program

T. Richardson and J. Robins. Analysis of the binary instrumental variable model. In R. Dechter, H. Geﬀner, and J.Y. Halpern, editors, Heuristics, Probability and Causality: A Tribute to Judea Pearl, pages 415 444. College Publications, 2010.

T. Richardson, R. Evans, and J. Robins. Transparent parameterizations of models for potential outcomes. In J. Bernardo, M. Bayarri, J. Berger, A. Dawid, D. Heckerman, A. Smith, and M. West, editors, Bayesian Statistics 9, pages 569 610. Oxford University Press, 2011.

J. Robins, R. Scheines, P. Spirtes, and L. Wasserman. Uniform consistency in causal inference. Biometrika, 90:491 515, 2003.

P. Rosenbaum. Observational Studies. Springer-Verlag, 2002a.

P. Rosenbaum. Covariance adjustment in randomized experiments and observational studies. Statistical Science, 17(3):286 327, 2002b.

K. Rothman, S. Greenland, and T. Lash. Modern Epidemiology. Wolters Kluwer, 2008.

R. Silva and R. Evans. Causal inference through a witness protection program. Advances in Neural Information Processing Systems, 27:298 306, 2014.

P. Spirtes, C. Glymour, and R. Scheines. Causation, Prediction and Search. Cambridge University Press, 2000.

K. Steenland and S. Greenland. Monte Carlo sensitivity analysis and Bayesian analysis of smoking as an unmeasured confounder in a study of silica and lung cancer. American Journal of Epidemiology, 160:384 392, 2004.

T. Vander Weele and I. Shpitser. A new criterion for confounder selection. Biometrics, 64: 1406 1413, 2011.