# causal_inference_through_a_witness_protection_program__11a370f0.pdf

Causal Inference through a Witness Protection Program

Ricardo Silva Department of Statistical Science and CSML University College London ricardo@stats.ucl.ac.uk

Robin Evans Department of Statistics University of Oxford evans@stats.ox.ac.uk

One of the most fundamental problems in causal inference is the estimation of a causal effect when variables are confounded. This is difﬁcult in an observational study because one has no direct evidence that all confounders have been adjusted for. We introduce a novel approach for estimating causal effects that exploits observational conditional independencies to suggest weak paths in a unknown causal graph. The widely used faithfulness condition of Spirtes et al. is relaxed to allow for varying degrees of path cancellations that will imply conditional independencies but do not rule out the existence of confounding causal paths. The outcome is a posterior distribution over bounds on the average causal effect via a linear programming approach and Bayesian inference. We claim this approach should be used in regular practice to complement other default tools in observational studies.

1 Contribution

We provide a new methodology to bound the average causal effect (ACE) of a variable X on a variable Y . For binary variables, the ACE is deﬁned as

E[Y | do(X = 1)] E[Y | do(X = 0)] = P(Y = 1 | do(X = 1)) P(Y = 1 | do(X = 0)), (1)

where do( ) is the operator of Pearl [14], denoting distributions where a set of variables has been intervened upon by an external agent. In the interest of space, we assume the reader is familiar with the concept of causal graphs, the basics of the do operator, and the basics of causal discovery algorithms such as the PC algorithm of Spirtes et al. [22]. We provide a short summary for context in Section 2

The ACE is in general not identiﬁable from observational data. We obtain upper and lower bounds on the ACE by exploiting a set of (binary) covariates, which we also assume are not effects of X or Y (justiﬁed by temporal ordering or other background assumptions). Such covariate sets are often found in real-world problems, and form the basis of most observational studies done in practice [21]. However, it is not obvious how to obtain the ACE as a function of the covariates. Our contribution modiﬁes the results of Entner et al. [6], who exploit conditional independence constraints to obtain point estimates of the ACE, but give point estimates relying on assumptions that might be unstable in practice. Our modiﬁcation provides a different interpretation of their search procedure, which we use to generate candidate instrumental variables [11]. The linear programming approach of Dawid [5] and Ramsahai [16] is then modiﬁed to generate bounds on the ACE by introducing constraints on some causal paths, motivated as relaxations of [6]. The new setup can be computationally expensive, so we introduce further relaxations to the linear program to generate novel symbolic bounds, and a fast algorithm that sidesteps the full linear programming optimization with some simple, message passing-like, steps.

(a) (b) (c) (d) (e)

Figure 1: (a) A generic causal graph where X and Y are confounded by some U. (b) The same system in (a) where X is intervened upon by an external agent. (c) A system where W and Y are independent given X. (d) A system where it is possible to use faithfulness to discover that U is sufﬁcient to block all back-door paths between X and Y . (e) Here, U itself is not sufﬁcient.

Section 2 introduces the background of the problem and Section 3 our methodology. Section 4 discusses an analytical approximation of the main results, and a way by which this provides scalingup possibilities for the approach. Section 5 contains experiments with synthetic and real data.

2 Background: Instrumental Variables, Witnesses and Admissible Sets

Assuming X is a potential cause of Y , but not the opposite, a cartoon of the causal system containing X and Y is shown in Figure 1(a). U represents the universe of common causes of X and Y . In control and policy-making problems, we would like to know what happens to the system when the distribution of X is overridden by some external agent (e.g., a doctor, a robot or an economist). The resulting modiﬁed system is depicted in Figure 1(b), and represents the family of distributions indexed by do(X = x): the graph in (a) has undergone a surgery that wipes out edges, as originally discussed by [22] in the context of graphical models. Notice that if U is observed in the dataset, then we can obtain the distribution P(Y = y | do(X = x)) by simply calculating P

u P(Y = y | X = x, U = u)P(U = u) [22]. This was popularized by [14] as the back-door adjustment. In general P(Y = y | do(X = x)) can be vastly different from P(Y = y | X = x).

The ACE is simple to estimate in a randomized trial: this follows from estimating the conditional distribution of Y given X under data generated as in Figure 1(b). In contrast, in an observational study [21] we obtain data generated by the system in Figure 1(a). If one believes all relevant confounders U have been recorded in the data then back-door adjustment can be used, though such completeness is uncommon. By postulating knowledge of the causal graph relating components of U, one can infer whether a measured subset of the causes of X and Y is enough [14, 23, 15]. Without knowledge of the causal graph, assumptions such as faithfulness [22] are used to infer it.

The faithfulness assumption states that a conditional independence constraint in the observed distribution exists if and only if a corresponding structural independence exists in the underlying causal graph. For instance, observing the independence W Y | X, and assuming faithfulness and the causal order, we can infer the causal graph Figure 1(c); in all the other graphs this conditional independence in not implied. We deduce that no unmeasured confounders between X and Y exist. This simple procedure for identifying chains W X Y is useful in exploratory data analysis [4], where a large number of possible causal relations X Y are unquantiﬁed but can be screened using observational data before experiments are performed. The idea of using faithfulness is to be able to sometimes identify such quantities.

Entner et al. [6] generalize the discovery of chain models to situations where a non-empty set of covariates is necessary to block all back-doors. Suppose W is a set of covariates which are known not to be effects of either X or Y , and we want to ﬁnd an admissible set contained in W: a set of observed variables which we can use for back-door adjustment to get P(Y = y | do(X = x)). Entner s Rule 1 states the following:

Rule 1: If there exists a variable W W and a set Z W\{W} such that:

(i) W \ Y | Z (ii) W Y | Z {X}.

then infer that Z is an admissible set.

A point estimate of the ACE can then be found using Z. Given that (W, Z) satisﬁes1 Rule 1, we call W a witness for the admissible set Z. The model in Figure 1(c) can be identiﬁed with Rule 1, where W is the witness and Z = . In this case, a so-called Na ıve Estimator2 P(Y = 1 | X = 1) P(Y = 1 | X = 0) will provide the correct ACE. If U is observable in Figure 1(d), then it can be identiﬁed as an admissible set for witness W. Notice that in Figure 1(a), taking U as a scalar, it is not possible to ﬁnd a witness since there are no remaining variables. Also, if in Figure 1(e) our covariate set W is {W, U}, then no witness can be found since U cannot be blocked. Hence, it is possible for a procedure based on Rule 1 to answer I don t know whether an admissible set exists even when a back-door adjustment would be possible if one knew the causal graph. However, using the faithfulness assumption alone one cannot do better: Rule 1 is complete for non-zero effects without more information [6].

Despite its appeal, the faithfulness assumption is not without difﬁculties. Even if unfaithful distributions can be ruled out as pathological under seemingly reasonable conditions [13], distributions which lie close to (but not on) a simpler model may in practice be indistinguishable from distributions within that simpler model at ﬁnite sample sizes. To appreciate these complications, consider the structure in Figure 1(d) with U unobservable. Here W is randomized but X is not, and we would like to know the ACE of X on Y 3. W is sometimes known as an instrumental variable (IV), and we call Figure 1(d) the standard IV structure; if this structure is known, optimal bounds LIV ACE UIV can be obtained without further assumptions, using only observational data over the binary variables W, X and Y [1]. There exist distributions faithful to the IV structure but which at ﬁnite sample sizes may appear to satisfy the Markov property for the structure W X Y ; in practice this can occur at any ﬁnite sample size [20]. The true average causal effect may lie anywhere in the interval [LIV , UIV ] (which can be rather wide), and may differ considerably from the na ıve estimate appropriate for the simpler structure. While we emphasize that this is a worst-case scenario analysis and by itself should not rule out faithfulness as a useful assumption, it is desirable to provide a method that gives greater control over violations of faithfulness.

3 Methodology: the Witness Protection Program

The core of our idea is (i) to invert the usage of Entner s Rule 1, so that pairs (W, Z) should provide an instrumental variable bounding method instead of a back-door adjustment; (ii) express violations of faithfulness as bounded violations of local independence; (iii) ﬁnd bounds on the ACE using a linear programming formulation.

Let (W, Z) be any pair found by a search procedure that decides when Rule 1 holds. W will play the role of an instrumental variable, instead of being discarded. A standard IV bounding procedure such as [1] can be used conditional on each individual value z of Z, then averaged over P(Z). The lack of an edge W Y given Z can be justiﬁed by faithfulness (as W Y | {X, Z}). For the same reason, there might be no (conditional) dependence between W and a possible unmeasured common parent of X and Y . However, assuming faithfulness itself is not interesting, as a back-door adjustment could be directly obtained. Allowing unconstrained dependencies induced by edges W Y and (W, U) (any direction) is also a non-starter, as all bounds will be vacuous [16].

Consider instead the (partial) parameterization in Table 1 of the joint distribution of {W, X, Y, U}, where U is latent and not necessarily a scalar. For simplicity of presentation, assume we are conditioning everywhere on a particular value z of Z, but which we supress from our notation as this will not be crucial to developments in this Section. Under this notation, the ACE is given by

η11P(W = 1) + η10P(W = 0) η01P(W = 1) η00P(W = 0). (2)

1The work in [6] aims also at identifying zero effects with a Rule 2 . For simplicity we assume that the effect of interest was already identiﬁed as non-zero. 2Sometimes we use the word estimator to mean a functional of the probability distribution instead of a statistical estimator that is a function of samples of this distribution. Context should make it clear when we refer to an actual statistic or a functional. 3A classical example is in non-compliance: suppose W is the assignment of a patient to either drug or placebo, X is whether the patient actually took the medicine or not, and Y is a measure of health status. The doctor controls W but not X. This problem is discussed by [14] and [5].

ζ yx.w P(Y = y, X = x | W = w, U) ζyx.w P

U P(Y = y, X = x | W = w, U)P(U | W = w) = P(Y = y, X = x | W = w)

η xw P(Y = 1 | X = x, W = w, U) ηxw P

U P(Y = 1 | X = x, W = w, U)P(U | W = w) = P(Y = 1 | do(X = x), W = w)

δ w P(X = 1 | W = w, U) δw P

U P(X = x | W = w, U)P(U | W = w) = P(X = 1 | W = w).

Table 1: A partial parameterization of a causal DAG model over some {U, W, X, Y }. Notice that such parameters cannot be functionally independent, and this is precisely what we will exploit.

We now introduce the following assumptions,

|η x1 η x0| ϵw (3) |η xw P(Y = 1 | X = x, W = w)| ϵy (4) |δ w P(X = 1 | W = w)| ϵx (5)

βP(U) P(U | W = w) βP(U). (6)

Setting ϵw = 0, β = β = 1 recovers the standard IV structure. Further assuming ϵy = ϵx = 0 recovers the chain structure W X Y . Deviation from these values corresponds to a violation of faithfulness, as the premises of Rule 1 can only be satisﬁed by enforcing functional relationships among the conditional probability tables of each vertex. Using this parameterization in the case ϵy = ϵx = 1, β = β = 1, Ramsahai [16], extending [5], used the following linear programming to obtain bounds on the ACE (for now, assume that ζyx.w and P(W = w) are known constants):

1. There is a 4-dimensional polytope where parameters {η xw} can take values: for ϵw = ϵy = 1, this is the unit hypercube [0, 1]4. Find the extreme points of this polytope (up to 12 points for the case where ϵw > 0). Do the same for {δ w}. 2. Find the extreme points of the joint space ζ yx.w by mapping them from the points in {δ w} {η xw}, since ζ yx.w = (δ w)x(1 δ w)(1 x)η xw.

3. Using the extreme points of the 12-dimensional joint space {ζ yx.w} {η xw}, ﬁnd the dual polytope of this space in terms of linear inequalities. Points in this polytope are convex combinations of {ζ yx.w} {η xw}, shown by [5] to correspond to the marginalization over some arbitrary P(U). This results in contraints over {ζyx.w} {ηxw}. 4. Maximize/minimize (2) with respect to {ηxw} subject to the constraints found in Step 3 to obtain upper/lower bounds on the ACE.

Allowing for the case where ϵx < 1 or ϵy < 1 is just a matter changing the ﬁrst step, where box constraints are set on each individual parameter as a function of the known P(Y = y, X = x | W = w), prior to the mapping in Step 2. The resulting constraints are now implicitly non-linear in P(Y = y, X = x | W = w), but at this stage this does not matter as they are treated as constants. To allow for the case β < 1 < β, use exactly the same procedure, but substitute every occurrence of ζyx.w in the constraints by κyx.w P U ζ yx.w P(U); notice the difference between κyx.w and ζyx.w. Likewise, substitute every occurrence of ηxw in the constraints by ωxw P

U η xw P(U). Instead of plugging in constants for the values of κyx.w and turning the crank of a linear programming solver, we ﬁrst treat {κyx.w} (and {ωxw}) as unknowns, linking them to observables and ηxw by the constraints ζyx.w/ β κyx.w ζyx.w/β, P

yx κyx.w = 1 and ηxw/ β ωxw ηxw/β. Finally, the method can be easily implemented using a package such as Polymake (http://www.poymake.org) or SCDD for R. More details are given in the Supplemental Material.

In this paper, we will not discuss in detail how to choose the free parameters of the relaxation. Any choice of ϵw 0, ϵy 0, ϵx 0, 0 β 1 β is guaranteed to provide bounds that are at

input : Binary data matrix D; set of relaxation parameters θ; covariate index set W; cause-effect indices X and Y output: A list of pairs (witness, admissible set) contained in W

L ; for each W W do

for every admissible set Z W\{W} identiﬁed by W and θ given D do

B posterior over upper/lowed bounds on the ACE as given by (W, Z, X, Y, D, θ); if there is no evidence in B to falsify the (W, Z, θ) model then

L L {B}; end end end return L

Algorithm 1: The outline of the Witness Protection Program algorithm.

least as conservative as the back-door adjusted point estimator of [6], which is always covered by the bounds. Background knowledge, after a user is suggested a witness and admissible set, can be used here. In Section 5 we experiment with a few choices of default parameters. To keep focus, in what follows we will discuss only computational aspects. We develop a framework for choosing relaxation parameters in the Supplemental, and expect to extend it in follow-up publications.

As the approach provides the witness a degree of protection against faithfulness violations, using a linear program, we call this framework the Witness Protection Program (WPP).

3.1 Bayesian Learning

The previous section treated ζyx.w and P(W = w) as known. A common practice is to replace them by plug-in estimators (and in the case of a non-empty admissible set Z, an estimate of P(Z) is also necessary). Such models can also be falsiﬁed, as the constraints generated are typically only supported by a strict subset of the probability simplex. In principle, one could ﬁt parameters without constraints, and test the model by a direct check of satisﬁability of the inequalities using the plug-in values. However, this does not take into account the uncertainty in the estimation. For the standard IV model, [17] discuss a proper way of testing such models in a frequentist sense.

Our models can be considerably more complicated. Recall that constraints will depend on the extreme points of the {ζ yx.w} parameters. As implied by (4) and (5), extreme points will be functions of ζyx.w. Writing the constraints fully in terms of the observed distribution will reveal non-linear relationships. We approach the problem in a Bayesian way. We will assume ﬁrst the dimensionality of Z is modest (say, 10 or less), as this is the case in most applications of faithfulness to causal discovery. We parameterize P(Y, X, W | Z) as a full 2 2 2 contingency table4.

Given that the dimensionality of the problem is modest, we assign to each three-variate distribution P(Y, X, W | Z = z) an independent Dirichet prior for every possible assigment of Z, constrained by the inequalities implied by the corresponding polytopes. The posterior is also a 8-dimensional constrained Dirichlet distribution, where we use rejection sampling to obtain a posterior sample by proposing from the unconstrained Dirichlet. A Dirichlet prior can also be assigned to P(Z). Using a sample from the posterior of P(Z) and a sample (for each possible value z) from the posterior of P(Y, X, W | Z = z), we obtain a sample upper and lower bound for the ACE.

The full algorithm is shown in Algorithm 1. The search procedure is left unspeciﬁed, as different existing approaches can be plugged in into this step. See [6] for a discussion. In Section 5 we deal with small dimensional problems only, using the brute-force approach of performing an exhaustive search for Z. In practice, brute-force can be still valuable by using a method such as discrete PCA [3] to reduce W\{W} to a small set of binary variables. To decide whether the premises in Rule 1 hold, we merely perform Bayesian model selection with the BDeu score [2] between the full graph {W X, W Y, X Y } (conditional on Z) and the graph with the edge W Y removed. Our

4That is, we allow for dependence between W and Y given {X, Z}, interpreting the decision of independence used in Rule 1 as being only an indicator of approximate independence.

ωxw κ1x.w + LY U xw (κ0x .w + κ1x .w) (7)

ωxw 1 (κ0x.w ϵw(κ0x.w + κ1x.w ))/U XU xw (8)

ωxw ωxw U XU x w κ1x.w + ϵw(κ0x .w + κ1x .w) (9)

ωxw + ωx w ωx w κ1x .w + κ1x.w κ1x .w + κ1x.w χxw ( U + L + 2ϵw) + L (10)

Table 2: Some of the algebraic bounds found by symbolic manipulation of linear inequalities. Notation: x, w {0, 1}, x = 1 x and w = 1 w are the complementary values. LY U xw max(0, P(Y = 1 |X = x, W = w) ϵy), U Y U xw min(1, P(Y = 1 |X = x, W = w) + ϵy); LXU xw max(0, P(X = x |W = w) ϵx), with U XU xw deﬁned accordingly. Finally, U max{U Y U xw }, L min{LY U xw } and χxw κ1x.w + κ0x.w. Full set of bounds with proofs can be found in the Supplementary Material.

falsiﬁcation test in Step 5 is a simple and pragmatical one: our initial trial of rejection sampling proposes M samples, and if more than 95% of them are rejected, we take this as an indication that the proposed model provides a bad ﬁt. The ﬁnal result is a set of posterior distributions over bounds, possibly contradictory, which should be summarized as appropriate. Section 5 provides an example.

4 Algebraic Bounds and the Back-substitution Algorithm

Posterior sampling is expensive within the context of Bayesian WPP: constructing the dual polytope for possibly millions of instantiations of the problem is time consuming, even if each problem is small. Moreover, the numerical procedure described in Section 3 does not provide any insight on how the different free parameters {ϵw, ϵy, ϵx, β, β} interact to produce bounds, unlike the analytical bounds available in the standard IV case. [16] derives analytical bounds under (3) given a ﬁxed, numerical value of ϵw. We know of no previous analytical bounds as an algebraic function of ϵw.

In the Supplementary Material, we provide a series of algebraic bounds as a function of our free parameters. Due to limited space, we show only some of the bounds in Table 2. They illustrate qualitative aspects of our free parameters. For instance, if ϵy = 1 and β = β = 1, then LY U xw = 0 and (7) collapses to ηxw ζ1x.w, one of the original relations found by [1] for the standard IV model. Decreasing ϵy will linearly increase LY U xw , tightening the corresponding lower bound in (7). If also ϵw = 0 and ϵx = 1, from (8) it follows ηxw 1 ζ0x.w . Equation (3) implies ωx w ωx w ϵw, and as such by setting ϵw = 0 we have that (10) implies ηxw η1x.w + η1x.w η1x .w η0x.w , one of the most complex relationships in [1]. Further geometric intuition about the structure of the binary standard IV model is given by [19].

These bounds are not tight, in the sense that we opted not to fully exploit all possible algebraic combinations for some results, such as (10): there we use L η xw U and 0 δ w 1 instead of all possible combinations resulting from (4) and (5). The proof idea in the Supplementary Material can be further reﬁned, at the expense of clarity. Because our derivation is a further relaxation, the implied bounds are more conservative (i.e., wider).

Besides providing insight on the structure of the problem, this gives a very efﬁcient way of checking whether a proposed parameter vector {ζ yx.w} is valid, as well as ﬁnding the bounds: use backsubstitution on the symbolic set of constraints to ﬁnd box constraints Lxw ωxw Uxw. The proposed parameter will be rejected whenever an upper bound is smaller than a lower bound, and (2) can be trivially optimized conditioning only on the box constraints this is yet another relaxation, added on top of the ones used to generate the algebraic inequalities. We initialize by intersecting all algebraic box constraints (of which (7) and (8) are examples); next we reﬁne these by scanning relations ωxw aωxw c such as (9) in lexicographical order, and tightening the bounds of ωxw using the current upper and lower bounds on ωxw where possible. We then identify constraints Lxww ωxw ωxw Uxww starting from ϵw ωxw ωxw ϵw and the existing bounds, and plug into relations ωxw + ωx w ωx w c (as exempliﬁed by (10)) to get reﬁned bounds on ωxw as functions of (Lx ww , Ux ww ). We iterate this until convergence, which is guaranteed since bounds never widen at any iteration. This back-substitution of inequalities follows the spirit

of message-passing and it is an order of magnitude more efﬁcient than the fully numerical solution, while not increasing the width of the bounds by too much. In the Supplementary Material we provide evidence for this claim. In our experiments in Section 5, the back-substitution method was used in the testing stage of WPP. After collecting posterior samples, we calculated the posterior expected value of the contingency tables and run the numerical procedure to obtain the ﬁnal tight bound5.

5 Experiments

We describe a set of synthetic studies, followed by one study with the inﬂuenza data discussed by [9, 18]. In the synthetic study setup, we compare our method against NE1 and NE2, two na ıve point estimators deﬁned by back-door adjustment on the whole of W and on the empty set, respectively. The former is widely used in practice, even when there is no causal basis for doing so [15]. The point estimator of [6], based solely on the faithfulness assumption, is also assessed.

We generate problems where conditioning on the whole set W is guaranteed to give incorrect estimates6. Here, |W| = 8. We analyze two variations: one where it is guaranteed that at least one valid witness admissible set pair exists; in the other, latent variables in the graph are common parents also of X and Y , so no valid witness exists. We divide each variation into two subcases: in the ﬁrst, hard subcase, parameters are chosen (by rejection sampling) so that NE1 has a bias of at least 0.1 in the population; in the second, no such selection exists, and as such our exchangeable parameter sampling scheme makes the problem relatively easy. We summarize each WPP bound by the posterior expected value of the lower and upper bounds. In general WPP returns more than one bound: we select the upper/lower bound corresponding to the (W, Z) pair where the sum of BDeu scores for W \ Y | Z and W Y | Z {X} is highest.

Our main evaluation metric for an estimate is the Euclidean distance (henceforth, error ) between the true ACE and the closed point in the given estimate, whether the estimate is a point or an interval. For methods that provide point estimates (NE1, NE2, and faithfulness), this means just the absolute value of the difference between the true ACE and the estimated ACE. For WPP, the error of the interval [L, U] is zero if the true ACE lies in this interval. We report error average and error tail mass at 0.1, the latter meaning the proportion of cases where the error exceeds 0.1. The comparison is not straightforward, since the trivial interval [ 1, 1] will always have zero bias according to this deﬁnition. This is a trade-off, to be set according to an agreed level of information loss, measured by the width of the resulting intervals. This is discussed in the Supplemental. We run simulations at two levels of parameters: β = 0.9, β = 1.1, and the same conﬁguration except for β = β = 1. The former gives somewhat wide intervals. As Manski emphasizes [11], this is the price for making fewer assumptions. For the cases where no witness exists, Entner s Rule 1 should theoretically report no solution. In [6], stringent thresholds for accepting the two conditions of Rule 1 are adopted. Instead we take a more relaxed approach, using a uniform prior on the hypothesis of independence, and a BDeu prior with effective sample size of 10. As such, due to the nature of our parameter randomization, almost always (typically > 90%) the method will propose at least one witness. Given this theoretical failure, for the problems where no exact solution exists, we assess how sensitive the methods are given conclusions taken from approximate independencies instead of exact ones.

We simulate 100 datasets for each one of the four cases (hard case/easy case, with theoretical solution/without theoretical solution), 5000 points per dataset, 1000 Monte Carlo samples per decision. Results are summarized in Table 3 for the case ϵw = ϵx = ϵy = 0.2, β = 0.9, β = 1.1. Notice

5Sometimes, however, the expected contingency table given by the back-substitution method would fall outside the feasible region of the fully speciﬁed linear program this is expected to happen from time to time, as the analytical bounds are looser. In such a situation, we report the bounds given by the back-substitution samples. 6In detail: we generate graphs where W {Z1, Z2, . . . , Z8}. Four independent latent variables L1, . . . , L4 are added as parents of each {Z5, . . . , Z8}; L1 is also a parent of X, and L2 a parent of Y . L3 and L4 are each randomly assigned to be a parent of either X or Y , but not both. {Z5, . . . , Z8} have no other parents. The graph over Z1, . . . , Z4 is chosen by adding edges uniformly at random according to the lexicographic order. In consequence using the full set W for back-door adjustment is always incorrect, as at least four paths X L1 Zi L2 Y are active for i = 5, 6, 7, 8. The conditional probabilities of a vertex given its parents are generated by a logistic regression model with pairwise interactions, where parameters are sampled according to a zero mean Gaussian with standard deviation 10 / number of parents. Parameter values are truncated so that all conditional probabilities are between 0.025 and 0.975.

Case (β = 1, β = 1) NE1 NE2 Faith. WPP Width Hard/Solvable 0.12 1.00 0.02 0.03 0.05 0.05 0.01 0.01 0.24 Easy/Solvable 0.01 0.01 0.07 0.24 0.02 0.01 0.00 0.00 0.24 Hard/Unsolvable 0.16 1.00 0.20 0.88 0.19 0.95 0.07 0.25 0.24 Easy/Unsolvable 0.09 0.32 0.14 0.56 0.12 0.53 0.03 0.08 0.23

Table 3: Summary of the outcome of the synthetic studies. Each entry for particular method is a pair (bias average, bias tail mass at 0.1) of the respective methods, as explained in the main text. The last column is the median width of the WPP interval. In a similar experiment with β = 0.9, β = 1.1, WPP achieves nearly zero error, with interval widths around 0.50. A much more detailed table for many other cases is provided in the Supplementary Material.

that WPP is quite stable, while the other methods have strengths and weaknesses depending on the setup. For the unsolvable cases, we average over the approximately 99% of cases where some solution was reported in theory, no conditional independences hold and no solution should be reported, but WPP shows empirical robustness for the true ACE in these cases.

Our empirical study concerns the effect of inﬂuenza vaccination on a patient being hospitalized later on with chest problems. X = 1 means the patient got a ﬂu shot, Y = 1 indicates the patient was hospitalized. A negative ACE therefore suggests a desirable vaccine. The study was originally discussed by [12]. Shots were not randomized, but doctors were randomly assigned to receive a reminder letter to encourage their patients to be inoculated, recorded as GRP. This suggests the standard IV model in Figure 1(d), with W = GRP and U unobservable. Using the bounds of [1] and observed frequencies gives an interval of [ 0.23, 0.64] for the ACE. WPP could not validate GRP as a witness, instead returning as the highest-scoring pair the witness DM (patient had history of diabetes prior to vaccination) with admissible set composed of AGE (dichotomized at 60 years) and SEX. Here, we excluded GRP as a possible member of an admissible set, under the assumption that it cannot be a common cause of X and Y . Choosing ϵw = ϵy = ϵx = 0.2 and β = 0.9, β = 1.1, we obtain the posterior expected interval [ 0.10, 0.17]. This does not mean the vaccine is more likely to be bad (positive ACE) than good: the posterior distribution is over bounds, not over points, being completely agnostic about the distribution within the bounds. Notice that even though we allow for full dependence between all of our variables, the bounds are considerably stricter than in the standard IV model due to the weakening of hidden confounder effects postulated by observing conditional independences. Posterior plots and sensitivity analysis are included in the Supplementary Material; for further discussion see [18, 9].

6 Conclusion

Our model provides a novel compromise between point estimators given by the faithfulness assumptions and bounds based on instrumental variables. We believe such an approach should become a standard item in the toolbox of anyone who needs to perform an observational study. R code is available at http://www.homepages.ucl.ac.uk/ ucgtrbd/wpp. Unlike risky Bayesian approaches that put priors directly on the parameters of the unidentiﬁable latent variable model P(Y, X, W, U | Z), the constrained Dirichlet prior does not suffer from massive sensitivity to the choice of hyperparameters, as discussed at length by [18] and the Supplementary Material. By focusing on bounds, WPP keeps inference more honest, providing a compromise between a method purely based on faithfulness and purely theory-driven analyses that overlook competing models suggested by independence constraints. As future work, we will look at a generalization of the procedure beyond relaxations of chain structures W X Y . Much of the machinery here developed, including Entner s Rules, can be adapted to the case where causal ordering is unknown: the search for Y-structures [10] generalizes the chain structure search to this case. Also, we will look into ways on suggesting plausible values for the relaxation parameters, already touched upon in the Supplementary Material. Finally, the techniques used to derive the symbolic bounds in Section 4 may prove useful in a more general context and complement other methods to ﬁnd subsets of useful constraints such as the information theoretical approach of [8] and the graphical approach of [7].

Acknowledgements. We thank Mc Donald, Hiu and Tierney for their ﬂu vaccine data, and the anonymous reviewers for their valuable feedback.

[1] A. Balke and J. Pearl. Bounds on treatment effects from studies with imperfect compliance. Journal of the American Statistical Association, pages 1171 1176, 1997. [2] W. Buntine. Theory reﬁnement on Bayesian networks. Proceedings of the 7th Conference on Uncertainty in Artiﬁcial Intelligence (UAI1991), pages 52 60, 1991. [3] W. Buntine and A. Jakulin. Applying discrete PCA in data analysis. Proceedings of 20th Conference on Uncertainty in Artiﬁcial Intelligence (UAI2004), pages 59 66, 2004. [4] L. Chen, F. Emmert-Streib, and J. D. Storey. Harnessing naturally randomized transcription to infer regulatory relationships among genes. Genome Biology, 8:R219, 2007. [5] A.P. Dawid. Causal inference using inﬂuence diagrams: the problem of partial compliance. In P.J. Green, N.L. Hjort, and S. Richardson, editors, Highly Structured Stochastic Systems, pages 45 65. Oxford University Press, 2003. [6] D. Entner, P. Hoyer, and P. Spirtes. Data-driven covariate selection for nonparametric estimation of causal effects. JMLR W&CP: AISTATS 2013, 31:256 264, 2013. [7] R. Evans. Graphical methods for inequality constraints in marginalized DAGs. Proceedings of the 22nd Workshop on Machine Learning and Signal Processing, 2012. [8] P. Geiger, D. Janzing, and B. Sch olkopf. Estimating causal effects by bounding confounding. Proceedings of the 30th Conference on Uncertainty in Artiﬁcial Intelligence, pages 240 249, 2014. [9] K. Hirano, G. Imbens, D. Rubin, and X.-H. Zhou. Assessing the effect of an inuenza vaccine in an encouragement design. Biometrics, 1:69 88, 2000. [10] S. Mani, G. Cooper, and P. Spirtes. A theoretical study of Y structures for causal discovery. Proceedings of the 22nd Conference on Uncertainty in Artiﬁcial Intelligence (UAI2006), pages 314 323, 2006. [11] C. Manski. Identiﬁcation for Prediction and Decision. Harvard University Press, 2007. [12] C. Mc Donald, S. Hiu, and W. Tierney. Effects of computer reminders for inﬂuenza vaccination on morbidity during inﬂuenza epidemics. MD Computing, 9:304 312, 1992. [13] C. Meek. Strong completeness and faithfulness in Bayesian networks. Proceedings of the Eleventh Conference on Uncertainty in Artiﬁcial Intelligence (UAI1995), pages 411 418, 1995. [14] J. Pearl. Causality: Models, Reasoning and Inference. Cambridge University Press, 2000. [15] J. Pearl. Myth, confusion, and science in causal analysis. UCLA Cognitive Systems Laboratory, Technical Report (R-348), 2009. [16] R. Ramsahai. Causal bounds and observable constraints for non-deterministic models. Journal of Machine Learning Research, pages 829 848, 2012. [17] R. Ramsahai and S. Lauritzen. Likelihood analysis of the binary instrumental variable model. Biometrika, 98:987 994, 2011. [18] T. Richardson, R. Evans, and J. Robins. Transparent parameterizatios of models for potential outcomes. In J. Bernardo, M. Bayarri, J. Berger, A. Dawid, D. Heckerman, A. Smith, and M. West, editors, Bayesian Statistics 9, pages 569 610. Oxford University Press, 2011. [19] T. Richardson and J. Robins. Analysis of the binary instrumental variable model. In R. Dechter, H. Geffner, and J.Y. Halpern, editors, Heuristics, Probability and Causality: A Tribute to Judea Pearl, pages 415 444. College Publications, 2010. [20] J. Robins, R. Scheines, P. Spirtes, and L. Wasserman. Uniform consistency in causal inference. Biometrika, 90:491 515, 2003. [21] P. Rosenbaum. Observational Studies. Springer-Verlag, 2002. [22] P. Spirtes, C. Glymour, and R. Scheines. Causation, Prediction and Search. Cambridge University Press, 2000. [23] T. Vander Weele and I. Shpitser. A new criterion for confounder selection. Biometrics, 64:1406 1413, 2011.