# bounding_causal_effects_on_continuous_outcome__c72f03c5.pdf Bounding Causal Effects on Continuous Outcome Junzhe Zhang, Elias Bareinboim Causal Artificial Intelligence Laboratory Columbia University {junzhez,eb}@cs.columbia.edu We investigate the problem of bounding causal effects from experimental studies in which treatment assignment is randomized but the subject compliance is imperfect. It is well known that under such conditions, the actual causal effects are not point-identifiable due to uncontrollable unobserved confounding. In their seminal work, Balke and Pearl (1994) derived the tightest bounds over the causal effects in this settings by employing an algebra program to derive analytic expressions. However, Pearl s approach assumes the primary outcome to be discrete and finite. Solving such a program could be intractable when high-dimensional context variables are present. In this paper, we present novel non-parametric methods to bound causal effects on the continuous outcome from studies with imperfect compliance. These bounds could be generalized to settings with the high-dimensional context. Introduction One of the most common methods for policy learning used throughout the empirical sciences is the use of randomization of the treatment assignment. This method is considered the gold standard within many disciplines and can be traced back, at least, to Fisher (Fisher 1935) and Neyman (Neyman 1923). Whenever human subjects are at the center of the experiment, unfortunately, issues of non-compliance arise, namely, subjects do not necessarily follow the experimental protocol and end up doing what they want. It is well-understood that under such conditions, confounding bias will emerge. For instance, subjects who did not comply with the treatment assignment may be precisely those who would have responded adversely to the treatment. Therefore, the actual causal effects of the treatment, when it is applied uniformly to the population, might be substantially less effective than of the data reveals. To cope with this bias, analysts may resort to exploit theoretical assumptions underlying the interactions between compliance and response (Wright 1928; Angrist, Imbens, and Rubin 1996). The problem of identifying causal effects from observed data provided with causal assumptions about the data-generating mechanisms, represented in the form of a directed acyclic causal diagram (Pearl 2000, Ch. 1.2), has been extensively studied in the causal inference literature. Copyright 2021, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. Several criteria and algorithms have been developed (Pearl 2000; Spirtes, Glymour, and Scheines 2000; Bareinboim and Pearl 2016). For example, a criterion called back-door (Pearl 2000, Ch. 3.2.2) permits one to determine whether causal effects can be obtained by covariate adjustment and subsequent inverse probability weighting. This condition is also known as conditional ignorability and unconfoundeness (Rosenbaum and Rubin 1983). Efficient estimators were developed based on the propensity score (Rosenbaum and Rubin 1983; Bang and Robins 2005) and off-policy learning (Dud ık, Langford, and Li 2011; Li, Munos, and Szepesvari 2015; Munos et al. 2016; Thomas and Brunskill 2016). By and large, the combination of causal assumptions and observational data does not always allow one to pointidentify the causal effect, called the non-identifiable. That is, there exists more than one parametrization of the target effect that are compatible with the same observational data and qualitative assumptions (Pearl 2000, Def. 3.2.2). A causal effect is partially identifiable if it is not identifiable, but the set of its possible values is smaller than the original parameter space. Inferring about the treatment effect in the partially identifiable settings has been a target of growing interest in the domains of causal inference (Balke and Pearl 1995; Chickering and Pearl 1996; Richardson et al. 2014; Cinelli et al. 2019), and more recently, in machine learning (Kallus and Zhou 2018; Kallus, Puli, and Shalit 2018). Among these works, two approaches are often employed: (1) bounds are derived for the target effect under minimal assumptions; or (2) additional untestable assumptions are invoked under which the causal effect is identifiable, and then sensitivity analysis is conducted to assess how the target causal effect varies as the untestable assumptions are changed. This paper focuses on the bounding approach. (Robins 1989; Manski 1990) derived the first informative bounds over the causal effects from studies with imperfect compliance, under a set of non-parametric assumptions called instrumental variables. In their seminal work (Balke and Pearl 1994a, 1997), Balke and Pearl improved earlier results by employing a computer algebraic program to derive analytic expressions of the causal bounds, which are provably optimal. Despite the optimality guarantees provided in their treatment, there are still significant challenges in performing the partial identification of the causal effects with the presence of instrumental variables. First, Pearl s ap- The Thirty-Fifth AAAI Conference on Artificial Intelligence (AAAI-21) proaches assume the outcome is discrete and finite, which is often not the case in many practical applications. Second, in settings with the high-dimensional context, solving the formulated program is often intractable due to computational and sample complexity issues. The goal of this paper is to overcome these challenges. We investigate the partial identification of the causal effect on the continuous outcome, with the presence of instrumental variables and the high-dimensional context. More specifically, our contributions are as follows. (1) We identify a set of novel non-parametric assumptions that explicate the inherent independence relationships among the latent counterfactual variables (also called the potential outcomes) when instrumental variables are present. (2) Using the proposed model, we formulate the linear programs that bound the target causal effect on the continuous outcome from studies with imperfect compliance, which is provable optimal. (3) We provide efficient estimation procedures for the derived bounds from finite observational sample under highdimensional context. Finally, we apply the derived causal bounds to various bandit learning algorithms (Gittins 1979), showing that they could consistently improve the convergence for identifying the optimal treatment. Our results are validated on the International Stroke Trial data (Carolei et al. 1997). Given the space constraints, all proofs are provided in the full technical report (Zhang and Bareinboim 2020). Preliminaries In this section, we introduce the basic notations and definitions used throughout the paper. We use capital letters to denote variables (X) and small letters for their values (x). Let X stand for the domain of X and X for its dimension. We use P(x) to represent probabilities P(X = x). The basic semantical framework of our analysis rests on structural causal models (SCM) (Pearl 2000, Ch. 7). A SCM M is a tuple U,V ,F,P(u) , where U is a set of exogenous (unobserved) variables and V is a set of endogenous (observed) variables. F is a set of structural functions where f Vi F decides the values of Vi V taking as argument a combination of other endogenous and exogenous variables (i.e., Vi f Vi(Pa Vi,UVi),Pa Vi V ,UVi U). The values of U are drawn from the distribution P(u), inducing an observational distribution P(v) over V . Each SCM is associated with a causal diagram in the form of a directed acyclic graph G, where nodes represent variables and arrows stand for functional relationships (e.g., see Fig. 1). By convention, whenever clear from the context, the exogenous U are left implicit. The bi-directed arrows between Vi and Vj indicate the existence of an unobserved confounder (UC) Uk affecting both Vi and Vj, i.e., Vi Uk Vj An intervention on a set of endogenous variables X, denoted by do(x), is an operation where values of X are set to constants x, regardless of how they were ordinarily determined (through the functions {f X X X}). For a SCM M, let Mx be a modified sub-model of M under intervention do(x). The potential outcome of Y to intervention do(x), denoted by Yx(u), is the solution for Y with U = u in the sub-model Mx; it can be read as the counterfactual sentence the value that Y would have obtained in (a) P(x,y z) X Y (b) E[Yx] Figure 1: Causal diagram of the instrumental variable (IV) model: Z represents the (randomized) treatment assigned, X the treatment actually received, and Y the outcome. situation U = u, had X been x. Statistically, averaging u over the distribution P(u) leads to the interventional distribution P(yx). For a detailed survey on the structural causal models, we refer readers to (Pearl 2000, Ch. 7). One fundamental problem in causal inference is to estimate P(yx) from the combination of the observational distribution P(v) and causal diagram G. An interventional distribution P(yx) is identifiable from G if for any pair of SCMs M1 and M2 compatible with G, PM1(yx) = PM2(yx) whenever PM1(v) = PM2(v) (Pearl 2000, pp. 77). In other words, P(yx) are non-identifiable if there exists a pair of SCMs that give arise to the same P(v) and G but induce different distributions P(yx). New Bounds on Causal Effects We will focus on the a special type of SCM called the instrumental variable (IV) models which represent experimental studies with imperfect compliance (Pearl 2000, Ch. 8.2). Fig. 1a shows the causal diagram of the IV model where Z represents the (randomized) treatment assigned, X the treatment actually received, and Y the observed outcome; exogenous variables U summarize the unknown factors about an individual subject that affect both X and Y . The values of Y are continuous, decided by a function y f Y (x,u) bounded in [0,1]. We assume Z,X are both finite. For each Z = z, values of X are decided by an unknown mechanism x f X(z,u). The data collected from the studies are summarized as the observational distribution P(x,y z). Given P(x,y z), we are interested in inferring the expected outcome on Y by of performing a treatment do(x), i.e., the causal effect E[Yx]. Fig. 1 graphically describes this learning settings. Unfortunately, the non-identifiability of the treatment effect E[Yx] from the surrogate Z and UCs between X and Y was shown in (Bareinboim and Pearl 2012). To overcome this challenge, we will consider the problem of partial identification in IV models (Manski 2003). Instead of pin-pointing the target quantity, the goal of partial identification is to derive bounds on the parameter space of the causal effect E[Yx] from the observational data P(x,y z), called the causal bound. Restricted Instrumental Variable Models Let MIV[P(x,y z)] denote a set of IV models described in Fig. 1a which are compatible with distribution P(x,y z); therefore, for any M MIV[P(x,y z)], PM(x,y z) = P(x,y z). We could derive causal bounds E[Yx] [lx,hx] by solving the optimization problems as follows: lx = min M MOBS EM[Yx] hx = max M MOBS EM[Yx] RRRRRRRRRRR MOBS = MIV[P(x,y z)] (1) The challenge of solving Eq. (1) is that the parametric forms of the exogenous variables U and structural functions F are not explicitly specified. MIV[P(x,y z)] could be infinitely large, making it hard to derive the bounds [lx,hx]. We will provide efficient methods to overcome this challenge. In particular, we propose a new non-parametric representation for the pair U,F. Let YX denote a vector of potential responses (Yx0,...,Yx X 1) where each xi X and let y X be its realizations; XZ = x Z is similarly defined. We first describe a family of IV models where U,F are wellspecified from counterfactuals XZ,YX . Definition 1 (Restricted IV (RIV) Model). A restricted instrumental variable model is a SCM U,V ,F,P(u) in Fig. Fig. 1a where V = {X,Y,Z}, U = {XZ,YX }. Given a vector x Z, xz is an element in x Z at the index zi = z; similarly, yx is an element in y X at xi = x. Values of X,Y are decided by functions f X,f Y F defined as: x f X(z,x Z) = xz, y f Y (x,y X ) = yx, (2) Yxi are mutually independent given XZ, i.e., for any xi, Yxi {Yxj xj xi} XZ (3) At first glance, the conditional independence among YX in Def. 1 may be surprising since it seems to impose additional constraints about the exogenous U in the original IV model. We will show in sequel that this restriction indeed captures the natural properties of the optimization problem in Eq. (1). Fig. 2a shows the graphical representation of a RIV model. The square labeled with X indicates that there are X nodes Yxi of this kind. The counterfactuals XZ,YX are the exogenous variables U affecting the treatment X and outcome Y , respectively. XZ,YX are correlated; each potential reward node Yxi is d-separated from other nodes Yxj where xj xi given XZ (Pearl 2000, Def. 1.2.3). The counterfactual distribution P(x Z,y X ) can be written as P(x Z,y X ) = P(x Z) xi X P(yxi x Z). Let MRIV[P(x,y z)] denote a set of RIV models compatible with P(x,y z). We will show that solving Eq. (1) is equivalent to optimizing E[Yx] over the feasible region MOBS = MRIV[P(x,y z)]. Theorem 1. Given P(x,y z), for any IV model M1 MIV[P(x,y z)], there exists an IV model M2 MRIV[P(x,y z)] such that (s.t.) EM1[Yx] = EM2[Yx], and vice versa. Thm. 1 says that for any IV model M of Fig. 1a inducing the observational data P(x,y z), we could always reduce it into a RIV model in MRIV[P(x,y z)] while preserving its treatment effects E[Yx]. Optimizing Eq. (1) within the feasible region MOBS = MRIV[P(x,y z)] thus induces causal bounds E[Yx] [lx,hx]. The sharpness of [lx,hx] follows immediately from Def. 1, i.e., there exist SCMs M1,M2 MIV[P(x,y z)] such that EM1[Yx] = lx,EM2[Yx] = hx. Figure 2: Causal diagrams of (a) a RIV model where Yxi are mutually independent given XZ; (b) an unrestricted RIV model. The square labelled with X represents X nodes of which only a single example Yxi is shown explicitly. A Linear Program Formulation We now turn our attention to solving the optimization problem in Eq. (1) within the feasible region MOBS = MRIV[P(x,y z)]. We will use probabilities P(x Z) and E[Yxi x Z]P(x Z) as unknown parameters. Basic probabilistic properties and y f Y (x,u) [0,1] imply: x Z P(x Z) = 1, 0 E[Yxi x Z]P(x Z) P(x Z) 1 (4) By Def. 1, P(x,y z) could be written as linear combinations of P(x Z) and E[Yxi x Z]P(x Z) as P(x z) = x Z Ixz=x P(x Z), (5) E[Y x,z]P(x z) = x Z Ixz=x E[Yx x Z]P(x Z). (6) where I{ } is an indicator function. Similarly, the causal effects E[Yx] can be written as a linear function: E[Yx] = x Z E[Yx x Z]P(x Z). Eq. (1) is reducible to linear programs (LP) optimizing the objective function E[Yx] subject to probabilistic constraints Eq. (4) and observational constraints Eq. (6), i.e., lx = min E[Yx] hx = max E[Yx] RRRRRRRRRRR subject to Eqs. (4) to (6) (7) Solving such a linear program leads to a valid causal bound over the expected reward E[Yx]. Comparision with Existing Methods The idea of modeling the exogenous variables U and functional relationships in F using its projection to the latent counterfactuals XZ,YX has been explored in the literature, including the canonical partition (Balke and Pearl 1994b) and principal stratification (Frangakis and Rubin 2002). These methods can be seen as a RIV model without the independence restriction among YX (see Fig. 2b). For discrete X,Y,Z, representing P(x Z,y X ) of this unrestricted model requires a probability table of size O( X Z Y X ); while a RIV model (Def. 1) requires a table of only O( X Z +1 Y ), removing the exponential dependence on X . In addition, our representation does not require a particular parametric form of outcome (e.g., Y is discrete). This allows one to derive causal bounds on the continuous outcome by employing standard LP methods, which are provably optimal. Contextual Settings We now study the contextual IV (CIV) model shown in Fig. 3a where an additional context C is now observed. We denote by do(π(x c)) a stochastic intervention where values of treatment are decided following a conditional distribution π(x c) mapping from C to X. The expected reward E[Yπ(x c)] induced by do(π(x c)) is given by E[Yπ(x c)] = x,c E[Yx c]π(x c)P(c). (8) We are interested in inferring the causal effect E[Yπ(x c)] from the observational distribution P(x,y, c z). The nonidentifiability of this learning settings, shown in Fig. 3, has been acknowledged in the causal inference literature (Tian 2008; Correa and Bareinboim 2019). We thus consider the partial identification problem that bounds E[Yπ(x c)] from P(x,y, c z) in CIV models. Among quantities in Eq. (8), π(x c),P(c) are provided. It is thus sufficient to bound the conditional causal effect E[Yx c]. Let MCIV[P(x,y, c z)] denote contextual IV models that are compatible with the observational distribution P(x,y, c z). We show that the formulation of Def. 1 are also applicable in the contextual settings. Theorem 2. Given P(x,y, c z), fix a context C = c, for any CIV model M1 MCIV[P(x,y, c z)], there exists an IV model M2 MRIV[P(x,y c,z)] such that EM1[Yx c] = EM2[Yx], and vice versa. Thm. 2 implies that bounding E[Yx c] from P(x,y, c z) in CIV models is equivalent to bounding E[Yx] from the observational data P(x,y c,z) in RIV models. For any M MCIV[P(x,y, c z)], one could always translate it into a solution of Eq. (1) within the feasible region MRIV[P(x,y c,z)]. Let E[Yx c] [lx(c),hx(c)] denote the solutions of Eq. (1) with MOBS = MRIV[P(x,y c,z)]. Causal bounds E[Yπ(x c)] [lπ,hπ] are computable from [lx(c),hx(c)] following Eq. (8). Theorem 3. Given P(x,y, c z), there exist CIV models M1, M2 MCIV[P(x,y, c z)] such that EM1[Yπ(x c)] = lπ and EM2[Yπ(x c)] = hπ. Thm. 3 guarantees that [lπ,hπ] are optimal in CIV models. Suppose there exists a bound E[Yπ(x c)] [l π,h π] strictly contained in [lπ,hπ]. We can always find CIV models M1,M2 compatible with P(x,y, c z) while their E[Yπ(x c)] lie outside [l π,h π], which is a contradiction. The conditional bound E[Yx c] [lx(c),hx(c)] is obtainable by solving Eq. (7) where observational constraints P(x z),E[Y x,z]P(x z) in Eq. (6) are replaced with conditional quantities P(x c,z),E[Y x,z, c]P(x z,c). Approximating solution to Eq. (7) generalizes the natural bounds in (Manski 1990) to the contextual settings. Theorem 4. For an CIV model M, given P(x,y, c z), maxz lπ(z) E[Yπ(x c)] minz hπ(z) where lπ(z) = x,c E[Y x,c,z]π(x c)P(x,c z), hπ(z) = lπ(z) + x,c π(X x c)P(x,c z). (c) P(x,y, c z) X Y (d) E[Yπ(x c)] Figure 3: Causal diagram of the contextual instrumental variable (CIV) model: Z represents the (randomized) treatment assigned, X the treatment actually received, and Y the observed outcome; C is an additional observed context. Estimation in High-Dimensional Context Causal bounds developed so far are functions of the observational distribution, which are identifiable from the sampling process, and so generally can be consistently estimated. Howevers, computational and sample complexity challenges could arise when context C is high-dimensional. In this section, we will introduce robust estimation procedures to circumvent this issue. We first assume that the models ˆP(x c,z) and ˆE[Y I{X=x} c,z] of observational quantities P(x c,z) and E[Y I{X=x} c,z] are provided. Let {Xi,Yi,Ci,Zi}n i=1 denote finite samples drawn from an observational distribution P(x,y, c,z). For a sampled instance with context Ci, we could compute functions l Xi(Ci),h Xi(Ci) by solving LPs in Eq. (7) with observational constraints P(x z),E[Y x,z]P(x z) in Eq. (6) replaced with ˆP(x Ci,z) and ˆE[Y I{X=x} Ci,z]. Since we consider only a fixed C = Ci, Eq. (7) could be solved efficiently despite of dimensionalities of context C. Thm. 2 ensures that the conditional causal effect E[Yx Ci] [lx(Ci),hx(Ci)]. We could obtain causal bounds E[Yπ(x c)] [lπ,hπ] by summing over finite samples. Formally, the empirical estimates ˆlπ, ˆhπ of causal bounds lπ,hπ are given by: n i=1 x π(x Ci)lx(Ci), ˆhπ = 1 n i=1 x π(x Ci)hx(Ci). Lemma 1. If ˆP(x c,z) and ˆE[Y I{X=x} c,z] are models of P(x c,z) and E[Y I{X=x} c,z], ˆlπ, ˆhπ are consistent estimators of lπ,hπ. When the models of P(x c,z) and E[Y I{X=x} c,z] are not available, one could obtain an efficient approximation of the causal bounds using the empirical estimates of the natural bounds in Thm. 4. For any Z = z, let n(z) = n i IZi=z. The estimators ˆlπ(z), ˆhπ(z) are defined as follows: ˆlπ(z) = 1 n(z) n i=1 Yi IZi=zπ(Xi Ci), ˆhπ(z) = ˆlπ(z) + 1 n(z) n i=1 IZi=zπ(X Xi Ci). Lemma 2. ˆlπ(z), ˆhπ(z) are consistent estimators of functions lπ(z),hπ(z) defined in Thm. 4. The causal effect E[Yπ(x c)] could then be bounded from the finite samples {Xi,Yi,Ci,Zi}n i=1 by inequalities maxz ˆlπ(z) E[Yπ(x c)] minz ˆhπ(z). Bandit Algorithms with Causal Bounds The causal bounds derived so far may seem to be uninformative since they do not immediately identify the optimal treatment in IV models. We will that this is not the case. More specifically, we will introduce a systematic procedure to incorporate causal bounds in online bandit algorithms (Auer, Cesa-Bianchi, and Fischer 2002; Sen, Shanmugam, and Shakkottai 2018; Audibert and Bubeck 2010) for identifying the optimal treatment. Our analysis reveals that causal bounds consistently improve the performance of bandit algorithms in various learning settings. For the IV model of Fig. 1(a), we denote by µx the expected reward E[Yx] of performing a treatment do(x). Let µ = maxx µx and let x denote the optimal treatment (so, µx = µ ). An bandit algorithm learns the optimal treatment through repeated episodes of experiments t = 1,...,T. At each episode t, the algorithm performs an intervention do(Xt) and observes an outcome Yt. The cumulative regret E[RT ] after T episodes is defined by E[RT ] = Tµ T t=1 E[Yt], i.e., the loss due to the fact that the algorithm does not always play the optimal arm. A desirable asymptotic property of an algorithm is to have lim T E[RT ]/T = 0, meaning that the procedure converges and finds the optimal treatment x . We also consider the contextual IV model of Fig. 3(a). Let Π be a finite set of candidate policies {π1,...,πN}. Let µπ = E[Yπ(x c)] and let π denote the optimal policy π = arg maxπi Π µπi. At each episode t = 1,...,T, a bandit algorithm has access to a context Ct, picks a policy πt, assigns a treatment Xt following πt(x Ct) and observes a reward Yt. Observational data P(x,y, c z) are provided prior to the experiments. Similarly, the cumulative regret E[RΠ T ] after T episodes is defined as E[RΠ T ] = Tµπ T t=1 E[Yt]. We will assess and compare the performance of bandit algorithms in terms of the cumulative regrets. Causal UCB Our methods follow the well celebrated principle of optimism in the face of uncertainty (OFU). This principle leads to efficient bandit strategies, in the form of the upper confidence bound (UCB) algorithms (Auer, Cesa-Bianchi, and Fischer 2002). We will next describe UCB strategy in bandit models with a set of candidate policies Π (an arm choice x can be seen as a policy in X). At each round t, a UCB agent constructs a set of plausible models Mt that are consistent with the data. The agent then identifies the most favorable model from Mt and prescribes the optimal policy for the identified model. In practice, this decision is determined by the concentration bounds over possible models. An upper (lower) confidence bound Uπ(t) (Lπ(t)) for a policy π at time t is the maximum (minimum) expected reward of π of models in Mt. The agent then prescribes a policy based on Ux(t),Lx(t) and observes a reward. We now incorporate causal bounds using the OFU principle, called the Causal-UCB (for short, UCBc). Let µMπ denote the expected reward of a policy π Π in model M. At each round t, we obtain a subset Mc t from Mt by removing models inconsistent with the causal bounds, i.e., Algorithm 1: Causal-UCB (UCBc) 1: Input: Causal bounds {[lπ,hπ]}π Π. 2: for all t do 3: For each π Π, compute U π(t),Lπ(t) as: U π(t) = max{min{Uπ(t),hπ},lπ}, Lπ(t) = max{min{Lπ(t),hπ},lπ}, (9) where Uπ(t),Lπ(t) are, respectively, the upper and lower confidence bounds for policy π. 4: Play an arm do(Xt πt) and observe Yt. 5: end for Mc t = {M Mt µMπ [lπ,hπ]}. The agent then prescribes a decision that is optimal to the most favorable model in the subset Mc t. When the causal bounds are beneficial and significantly reduce the complexities of the search space Mt, it is expected that a UCBc agent outperforms the standard (non-causal) method. Alg. 1 describes an implementation of UCBc in bandit settings. At trial t, we clip confidence bounds Uπ(t),Lπ(t) using the causal bound [lπ,hπ]. The agent then proceeds ordinarily with the clipped bounds U π(t),Lπ(t). Likewise, UCBc in IV models of Fig. 1a follows Alg. 1 with π replaced with arm x X. For the remainder of this section, we will apply the UCBc strategy (Alg. 1) to the state-of-the-art bandit algorithms, showing its consistent improvements for various tasks. We refer the interested readers to the technical report (Zhang and Bareinboim 2020) for detailed implementations. Multi-Armed Bandits We start with IV models of Fig. 1a. (Capp e et al. 2013) proposed the kl-UCB procedure that is applicable for bandit settings with bounded reward. It computes confidence bound Ux(t) for each arm x using Cramer s theorem. The agent then plays an arm Xt with the largest Ux(t). Let x = µ µx and let X be the subset {x X µx < µ }. kl-UCB guarantees a bound on the cumulative regret of: E[RT ] x X ( x kl(µx,µ ))log(T) + o(log(T)), (10) where kl(µx,µ ) = µx log(µx/µ ) + (1 µx)log((1 µx)/(1 µ )), i.e., the Kullback-Leibler divergence between Bernoulli distributions with mean µx and µ . We applied Causal UCB strategy (Alg. 1) to kl-UCB by replacing policy π with arm x X. We denote the resultant algorithm kl-UCBc. At trial t, a kl-UCBc agent pulls an arm Xt with the largest clipped confidence bound U x(t) where U x(t) is obtained following Eq. (9). Let X hx c denote a set {x X hx c}. kl-UCBc guarantees the asymptotic regret bound as follows: Theorem 5. Given E[Yx] [lx,hx], the regret E[RT ] of kl-UCBc after T 3 is bounded by E[RT ] x X hx µ ( x kl(µx,µ ))log(T) + o(log(T)). SCMs IV models (Fig. 1) Contextual IV (Fig. 3) Tasks Cumulative Regret E[RT ] Best-Arm Cumulative Regret E[RΠ T ] Standard O( x X ( x kl(µx,µ ))log(T)) O( x X 2 x log(δ 1K log( 2 x ))) O(Cλ(Π )M 2 log(T)) Causal O( x X hx µ ( x kl(µx,µ ))log(T)) O( x X hx µ1,2 2 x log(δ 1K log( 2 x ))) O(Cλ(Π hπ µπ )M 2 log(T)) Table 1: Summary of bandit results presented in this paper. SCMs represents the causal diagrams of the corresponding learning settings. Obj. stands for the objective that the target agent aims to optimize. Standard stands for the asymptotics of the standard UCB-style algorithms, Causal for the proposed strategy leveraging the observational distribution. The regret bound in Thm. 5 is guaranteed to be smaller than Eq. (10) if X hx µ is strictly contained in X . This result coincides with the optimal regret of B-kl-UCB (Zhang and Bareinboim 2017) when bounds over the expected reward are provided. Since x/kl(µx,µ ) 1/(2 x), the improvement of kl-UCBc is significant when the gap x of x X hx<µ is small and close to zero. Best Arm Identification We also consider the settings of pure exploration in IV models (Mannor and Tsitsiklis 2004). Rather than looking at the cumulative regret, we are concerned with PAC-style ( probably approximately correct ) bound on the sample complexity to identify the optimal treatment. In (Jamieson and Nowak 2014), a sampling strategy, called lil LUCB, were provided, which finds the optimal arm with probability at least 1 2+ϵ ϵ/2 (log(1 + ϵ)) (1+ϵ)δ in at most O( x X 2 x log(δ 1 X log( 2 x ))) trials for any ϵ (0,1) and δ (0,log(1 + ϵ)/e). We apply Alg. 1 to lil LUCB and denote the resulting algorithm lil LUCBc. Assume (without loss of generality) that arms are ordered such that µ1 > µ2 µN (so, µ = µ1). Let µ1,2 = (µ1 + µ2)/2. The following theorem provides the sample complexity analysis of lil LUCBc. Theorem 6. Given E[Yx] [lx,hx], with probability (w.p.) at least 1 2+ϵ ϵ/2 (log(1 + ϵ)) (1+ϵ)δ, lil LUCBc returns the optimal treatment x with O( x X hx µ1,2 2 x log(δ 1 X log( 2 x ))) samples. Compared with lil LUCB, the sample complexity bound in Thm. 6 is tighter if X µ1,2 X . Contextual Bandits Finally, we consider the regret minimization in the contextual IV models of Fig. 3a. (Sen, Shanmugam, and Shakkottai 2018) proposed a UCB-style algorithm for contextual bandits, called D-UCB, when a set of stochastic policies π Π are provided, i.e., π(x z) > 0. At each round t, D-UCB estimates Uπ(t) for each policy π using the concentration bounds of the clipped importance sampling estimator (Sen, Shanmugam, and Shakkottai 2018), and apply the policy with the highest Uπ(t) estimation. Let M(πi,πj) denote the log divergence between two arbitrary policies πi,πj (Sen, Shanmugam, and Shakkottai 2018, Def. 2), and let M = maxπi,πj Π M(πi,πj). For any Π Π, let policies in Π be ordered such that µπ1 µπ2 µπn where n = Π . For any π Π, let π = µπ µπ. We define function λ(Π ) = πnγ( πn)+ n 1 i=1 πi(γ( πi) γ( πi+1)), where γ( ) = log2(6/ )/ 2. Let Π denote the set of sub-optimal policies {π Π µπ < µπ }. D-UCB obtains an asymptotic regret bound as follow: E[RΠ T ] Cλ(Π )M 2 log(T) + o(log(T)), (11) where C is a constant. We apply the causal strategy (Alg. 1) to D-UCB and denote the new procedure D-UCBc. Let Π hx c be a set of sub-optimal policies {π Π hπ c}. We now provides the regret bound of D-UCBc. Theorem 7. Given E[Yπ(x c)] [lπ,hπ], the regret E[RΠ T ] of D-UCBc after T 2 is bounded by E[RΠ T ] Cλ(Π hx µπ )M 2 log(T) + o(log(T)). (12) Compared with Eq. (11), the bound in Thm. 7 only differs in the hardness measure λ(Π µπ ). The following lemma guarantees that λ(Π µπ ) is never larger than λ(Π ). Lemma 3. For any Π1 Π2 Π, λ(Π1) λ(Π2). If π1 < < π Π , λ(Π1) < λ(Π2). Thm. 7, together with Lem. 3, says that D-UCBc improves over D-UCB if there exists some sub-optimal π with hπ < µπ (given that µπ of each π Π are not all equal). We summarize in Table. 1 the results discussed in this section. Standard stands for the asymptotics of the standard algorithms and Causal is our strategy leveraging the causal bounds obtained from the observation. The interesting aspect is that the causal approach is guaranteed to rival the standard algorithms, even when the observation is biased towards the wrong decision. If the causal bounds are beneficial (e.g., hx < µ ), our approach could eliminate a suboptimal arm x (or policy π) early during the trials, thus outperforming the standard methods. Such improvements could be significant in more difficult instances, when the gap x between x and x is small, e.g., close to zero. Experiments: International Stroke Trials We now use a real-world dataset to investigate the performance of proposed bandit strategies. Specifically, we study the International Stroke Trial (IST) (Carolei et al. 1997), focusing on the effect of the aspirin allocation X on a composite score Y , a continuous value in [0,1] predicting the likelihood of patients recovery. We also measure the pretreatment attributes U, including age, gender, and conscious state. To emulate unobserved confounding, as discussed in the paper, we filter the IST data following a inclusion rule f(x z,u) and hide some columns of U. We repeat this procedure and generate observational samples using 4 different (a) Multi-Arm Bandits (b) Best Arm ID (c) Contextual Bandits Figure 4: Simulations comparing solvers that are causally enhanced (kl-UCBc, lil LUCBc, D-UCBc) and standard (kl-UCB, lil LUCB and D-UCB) on the International Stroke Trials data. Graphs are rendered in high resolution and can be zoomed in. inclusion rules {f(x zi,u)}i=1...4. To evaluate the performance of the different bandit strategies, we make bootstrap estimates of the patient s true response from the IST data. For details on the experimental setups, we refer readers to the full technical report (Zhang and Bareinboim 2020). Multi-Armed Bandits The expected reward of not giving (X = 0) and giving aspirin (X = 1) are respectively µ0 = 0.6201 and µ1 = 0.6948, suggesting an increased chance of recovery from aspirin. The causal bounds µ0 [0.5905,0.6506],µ1 [0.4839,0.7527] estimated using all n = 9650 samples do not permit the identification of the optimal treatment, since one is contained in the other. We deploy a kl-UCBc agent provided with these causal bounds. For comparison, we also include a standard kl-UCB agent, a kl-UCBc agent with the causal bounds estimated using 300 samples (kl-UCBc ), and kl-UCB warm-started with the empirical estimates of E[Y x,z] (kl-UCB ). The results (Fig. 4a) reveal a significant difference in the cumulative regret (CR) between kl-UCB (CR = 38.63) and kl UCBc (CR = 0.07); kl-UCBc agent (CR = 37.94) coincides with kl-UCB since the sample size n = 300 are not sufficient for obtaining any informative estimate of the causal bounds (Thm. 5); kl-UCB (CR = 373.42) performs worst among all strategies due to unobserved confounding. Best Arm ID We also run the lil LUCBc agent with empirical bounds estimated from n = 9711 observations (lil LUCBc-all) to efficiently identify the optimal treatment (X = 1). For comparison, we include the standard lil LUCB, lil LUCBc with empirical bounds obtained from n = 100 observations (lil LUCBc-100), and lil LUCB warm-started with estimates of E[Y x] from observations (lil LUCBo). The stopping times T of each algorithm are compared in Fig. 4(b). We can immediately note a dramatic difference in the sample complexities experienced by lil LUCBc (T = 4.4 103) compared to lil LUCB (T = 6.5 103) and lil LUCBc-100 (T = 6.5 103). lil LUCBo (T = 8.2 103) performs worst among all algorithms. Contextual Bandits Suppose we now have access to a context C = {sex}. Our goal is to find the optimal treat- ment among two policies π0(x c) and π1(x c). We also include experiments for more involved candidate policies in the technical report (Zhang and Bareinboim 2020). We estimate causal bounds over µπ using n = 9650 observational samples and provide them to a D-UCBc agent. For comparision, we include the standard D-UCB, D-UCBc with causal bounds from n = 300 samples (D-UCBc ), and DUCB seeded with samples of confounded observations (DUCB ). The cumulative regrets (CR) of each strategy are measured and compared in Fig. 4c. The analysis reveals a significant difference in CR experienced by D-UCBc(CR = 0.07) compared to D-UCB (CR = 111.8) and D-UCBc (CR = 110.8). Unsurprisingly, D-UCB (CR = 153.9) performs worst among all strategies. These results corroborate with our findings: useful information could be extracted from the confounded, passivelycollected data to improve the performance of a learning agent. The causal approaches (e.g, kl-UCBc) dominate the standard, non-causal methods (kl-UCB) given sufficient observational samples. When the number of observations is not statistically significant, our approaches could still rival the standard methods, i.e., no negative transfer occurs. Conclusions In this paper, we investigated the problem of bounding causal effects from experimental studies in which treatment assignment is randomized but the subject compliance is imperfect. Under such conditions, the actual causal effects are not identifiable due to uncontrollable confounding. In particular, we derived informative bounds over the causal effect and accounted for challenging issues due to highdimensional context and the lack of discreteness. We incorporated these bounds into UCB-like algorithms and proved that the causal approach, leveraging observational data, consistently dominates non-causal, state-of-the-art procedures. We hope that our framework can be useful in practical settings since, even though imperfect, observational data contains causal information and is abundantly available. Acknowledgements The authors were partially supported by grants from NSF IIS-1704352 and IIS-1750807 (CAREER). References Angrist, J.; Imbens, G.; and Rubin, D. 1996. Identification of causal effects using instrumental variables (with Comments). Journal of the American Statistical Association 91(434): 444 472. Audibert, J.-Y.; and Bubeck, S. 2010. Best arm identification in multi-armed bandits. In COLT-23th Conference on Learning Theory-2010, 13 p. Auer, P.; Cesa-Bianchi, N.; and Fischer, P. 2002. Finite-time analysis of the multiarmed bandit problem. Machine learning 47(2-3): 235 256. Balke, A.; and Pearl, J. 1994a. Counterfactual Probabilities: Computational Methods, Bounds, and Applications. In de Mantaras, R. L.; and Poole, D., eds., Uncertainty in Artificial Intelligence 10, 46 54. San Mateo, CA: Morgan Kaufmann. Balke, A.; and Pearl, J. 1994b. Counterfactual Probabilities: Computational Methods, Bounds and Applications. In Proceedings of the Tenth International Conference on Uncertainty in Artificial Intelligence, UAI 94, 46 54. Balke, A.; and Pearl, J. 1995. Counterfactuals and policy analysis in structural models. In Proceedings of the Eleventh Conference on Uncertainty in Artificial Intelligence, 11 18. Balke, A.; and Pearl, J. 1997. Bounds on treatment effects from studies with imperfect compliance. Journal of the American Statistical Association 92(439): 1172 1176. Bang, H.; and Robins, J. M. 2005. Doubly robust estimation in missing data and causal inference models. Biometrics 61(4): 962 973. Bareinboim, E.; and Pearl, J. 2012. Causal inference by surrogate experiments: z-identifiability. In de Freitas, N.; and Murphy, K., eds., Proceedings of the Twenty-Eighth Conference on Uncertainty in Artificial Intelligence, 113 120. Corvallis, OR: AUAI Press. Bareinboim, E.; and Pearl, J. 2016. Causal inference and the data-fusion problem. Proceedings of the National Academy of Sciences 113: 7345 7352. Capp e, O.; Garivier, A.; Maillard, O.-A.; Munos, R.; Stoltz, G.; et al. 2013. Kullback leibler upper confidence bounds for optimal sequential allocation. The Annals of Statistics 41(3): 1516 1541. Carolei, A.; et al. 1997. The International Stroke Trial (IST): a randomized trial of aspirin, subcutaneous heparin, both, or neither among 19435 patients with acute ischaemic stroke. The Lancet 349: 1569 1581. Chickering, D.; and Pearl, J. 1996. A clinician s apprentice for analyzing non-compliance. In Proceedings of the Twelfth National Conference on Artificial Intelligence, volume Volume II, 1269 1276. Menlo Park, CA: MIT Press. Cinelli, C.; Kumor, D.; Chen, B.; Pearl, J.; and Bareinboim, E. 2019. Sensitivity Analysis of Linear Structural Causal Models. In Chaudhuri, K.; and Salakhutdinov, R., eds., Proceedings of the 36th International Conference on Machine Learning, volume 97, 1252 1261. Long Beach, CA: PMLR. Correa, J.; and Bareinboim, E. 2019. From Statistical Transportability to Estimating the Effect of Stochastic Interventions. In Kraus, S., ed., Proceedings of the 28th International Joint Conference on Artificial Intelligence, 1661 1667. Macao, China: International Joint Conferences on Artificial Intelligence Organization. Dud ık, M.; Langford, J.; and Li, L. 2011. Doubly robust policy evaluation and learning. In Proceedings of the 28th International Conference on International Conference on Machine Learning, 1097 1104. Omnipress. Fisher, R. 1935. The Design of Experiments. Edinburgh: Oliver and Boyd. Frangakis, C.; and Rubin, D. 2002. Principal Stratification in Causal Inference. Biometrics 1(58): 21 29. Gittins, J. C. 1979. Bandit processes and dynamic allocation indices. Journal of the Royal Statistical Society. Series B (Methodological) 148 177. Jamieson, K.; and Nowak, R. 2014. Best-arm identification algorithms for multi-armed bandits in the fixed confidence setting. In Information Sciences and Systems (CISS), 1 6. IEEE. Kallus, N.; Puli, A. M.; and Shalit, U. 2018. Removing Hidden Confounding by Experimental Grounding. In Bengio, S.; Wallach, H.; Larochelle, H.; Grauman, K.; Cesa-Bianchi, N.; and Garnett, R., eds., Advances in Neural Information Processing Systems 31, 10911 10920. Curran Associates, Inc. Kallus, N.; and Zhou, A. 2018. Confounding-Robust Policy Improvement. In Bengio, S.; Wallach, H.; Larochelle, H.; Grauman, K.; Cesa-Bianchi, N.; and Garnett, R., eds., Advances in Neural Information Processing Systems 31, 9289 9299. Curran Associates, Inc. Li, L.; Munos, R.; and Szepesvari, C. 2015. Toward Minimax Off-policy Value Estimation. In Proceedings of the 18th International Conference on Artificial Intelligence and Statistics. Mannor, S.; and Tsitsiklis, J. N. 2004. The sample complexity of exploration in the multi-armed bandit problem. Journal of Machine Learning Research 5(Jun): 623 648. Manski, C. 1990. Nonparametric bounds on treatment effects. American Economic Review, Papers and Proceedings 80: 319 323. Manski, C. F. 2003. Partial identification of probability distributions. Springer Science & Business Media. Munos, R.; Stepleton, T.; Harutyunyan, A.; and Bellemare, M. 2016. Safe and efficient off-policy reinforcement learning. In Advances in Neural Information Processing Systems. Neyman, J. 1923. On the application of probability theory to agricultural experiments. Essay on principles. Section 9. Statistical Science 5(4): 465 480. Pearl, J. 2000. Causality: Models, Reasoning, and Inference. New York: Cambridge University Press. 2nd edition, 2009. Richardson, A.; Hudgens, M. G.; Gilbert, P. B.; and Fine, J. P. 2014. Nonparametric bounds and sensitivity analysis of treatment effects. Statistical science: a review journal of the Institute of Mathematical Statistics 29(4): 596. Robins, J. 1989. The analysis of randomized and nonrandomized AIDS treatment trials using a new approach to causal inference in longitudinal studies. In Sechrest, L.; Freeman, H.; and Mulley, A., eds., Health Service Research Methodology: A Focus on AIDS, 113 159. Washington, D.C.: NCHSR, U.S. Public Health Service. Rosenbaum, P.; and Rubin, D. 1983. The central role of propensity score in observational studies for causal effects. Biometrika 70: 41 55. Sen, R.; Shanmugam, K.; and Shakkottai, S. 2018. Contextual Bandits with Stochastic Experts. In International Conference on Artificial Intelligence and Statistics, 852 861. Spirtes, P.; Glymour, C. N.; and Scheines, R. 2000. Causation, prediction, and search. MIT press. Thomas, P.; and Brunskill, E. 2016. Data-efficient off-policy policy evaluation for reinforcement learning. In International Conference on Machine Learning, 2139 2148. Tian, J. 2008. Identifying dynamic sequential plans. In Proceedings of the Twenty-Fourth Conference on Uncertainty in Artificial Intelligence. Wright, P. 1928. The Tariff on Animal and Vegetable Oils. New York, NY: The Mac Millan Company. Zhang, J.; and Bareinboim, E. 2017. Transfer learning in multi-armed bandits: a causal approach. In Proceedings of the 26th International Joint Conference on Artificial Intelligence, 1340 1346. AAAI Press. Zhang, J.; and Bareinboim, E. 2020. Bounding Causal Effects on Continuous Outcome. URL https://causalai.net/r61. pdf. (Date accessed: 2021-02-15).