# structural_causal_bandits_with_nonmanipulable_variables__c67443d7.pdf The Thirty-Third AAAI Conference on Artificial Intelligence (AAAI-19) Structural Causal Bandits with Non-Manipulable Variables Sanghack Lee, Elias Bareinboim Department of Computer Science Purdue University West Lafayette, IN 47907 {lee2995,eb}@purdue.edu Causal knowledge is sought after throughout data-driven fields due to its explanatory power and potential value to inform decision-making. If the targeted system is well-understood in terms of its causal components, one is able to design more precise and surgical interventions so as to bring certain desired outcomes about. The idea of leveraging the causal understanding of a system to improve decision-making has been studied in the literature under the rubric of structural causal bandits (Lee and Bareinboim, 2018). In this setting, (1) pulling an arm corresponds to performing a causal intervention on a set of variables, while (2) the associated rewards are governed by the underlying causal mechanisms. One key assumption of this work is that any observed variable (X) in the system is manipulable, which means that intervening and making do(X = x) is always realizable. In many real-world scenarios, however, this is a too stringent requirement. For instance, while scientific evidence may support that obesity shortens life, it s not feasible to manipulate obesity directly, but, for example, by decreasing the amount of soda consumption (Pearl, 2018). In this paper, we study a relaxed version of the structural causal bandit problem when not all variables are manipulable. Specifically, we develop a procedure that takes as argument partially specified causal knowledge and identifies the possibly-optimal arms in structural bandits with non-manipulable variables. We further introduce an algorithm that uncovers non-trivial dependence structure among the possibly-optimal arms. Finally, we corroborate our findings with simulations, which shows that MAB solvers enhanced with causal knowledge and leveraging the newly discovered dependence structure among arms consistently outperform causal-insensitive solvers. Introduction Causal knowledge is deemed highly valuable and desirable across a wide range of disciplines, including in the sciences, engineering, and businesses. The reasons for such a prominence are various. First, causal knowledge entails explanatory power, which epitomizes the very goal of science i.e., opening Nature s black box and explaining how reality works, or how it could be understood in terms of more elementary, and interpretable components. Second, and more pragmatically, purposeful agents (AI systems, Copyright c 2019, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. policy-makers, physicians) are better equipped for designing cleaner, more precise, and more effective interventions once the underlying causal mechanisms are well-understood (Bareinboim and Pearl 2016; Pearl and Mackenzie 2018; Lee and Bareinboim 2018). For instance, economists strive to understand the root causes of poverty, which could allow the design of new policies (i.e., causal interventions) to improve the population s socioeconomic status (SES). A considerable body of evidence was accumulated for many decades, notably by the University of Chicago s Professor, and Nobel Prize laureate, James Heckman, who demonstrated the effects of early education on families SES, among other indicators (Heckman 2006). The understanding following this causal link translated to the larger support of early childhood education, and a push for new policies; see, for an example, Obama s one billion dollar investment (The White House, Office of the Press Secretary 2014). There are abundant of such cases in public health as well e.g., evidence supports that tobacco smoking is one of the determinant factors of lung cancer (Cornfield 1951; U.S. Department of Health and Human Services 2014), or obesity is responsible to shortening life expectancy (Flegal, Graubard, and Williamson 2005; Pearl 2018). Despite all the impressive results achieved so far, the treatment of how to use causal knowledge to support decisionmaking is still in its infancy. If our goal is to build a more automated, data-driven society, and intelligent systems that can reason and act autonomously, we need to move from a heuristic understanding of the interplay between causal knowledge and decision-making (which currently relies on the insights of highly skilled scientists) to a more fundamental understanding of the principles that allow one to translate causal evidence into a more robust decision-making process. In order to realize this goal, we start by highlighting the rich literature on automated decision-making in AI, including the setting known as multi-armed bandits (MABs, for short) (Robbins 1952; Lai and Robbins 1985). We ll use the MAB framework as the baseline of our analysis. In a typical MAB instance, there are multiple arms (or actions) to play at each time step. Pulling an arm at each step returns a stochastic reward from the underlying, unknown distribution. Algorithms for the bandit problem attempt to minimize the cumulative regret, while having to cope with an exploration and exploitation trade-off. This setting has been popular in experimental circles since many real-world situations can be cast as MAB instances, including clinical trials, online advertisement placement, network routing, news article recommendation, just to cite a few. Remarkably, many algorithms with attractive properties have been developed, including the celebrated Upper Confidence Bound (Garivier and Capp e 2011; Auer, Cesa-Bianchi, and Fischer 2002; Capp e et al. 2013) and Thompson sampling (Thompson 1933; Chapelle and Li 2011; Kaufmann, Korda, and Munos 2012). One critical assumption used throughout these algorithms, and their multiple extensions, is that the arms are usually considered independent. Researchers have started to explore the dependence across arms to make these solvers applicable to a broader set of applications where the independence across arms is clearly unattainable (Dani, Hayes, and Kakade 2008; Magureanu, Combes, and Proutiere 2014). Even though not explicitly acknowledged until recently, MAB solvers inherently estimate causal effects, or properties of these effects. To witness, note that (1) pulling an arm corresponds to intervening on a set of variables and setting them to specific values (from whatever natural regime the decision used to be based on, to a new compulsory policy determined by the solver); and, (2) the rewards associated with these arms are governed by the underlying causal mechanisms (Bottou et al. 2013; Bareinboim, Forney, and Pearl 2015; Lattimore, Lattimore, and Reid 2016; Zhang and Bareinboim 2017; Forney, Pearl, and Bareinboim 2017; Sen et al. 2017). Recently, Lee and Bareinboim (2018) combined these observations (i.e., that MABs are inherently causal and that the dependence across arms carries decision-making value) and introduced what they called the structural causal bandits problem (SCM-MAB), where a Structural Causal Model (SCM) (Pearl 2000) is used to describe the underlying causal system. The dependency among arms naturally arises due to the causal relationships among the endogenous and exogenous variables. Given partial knowledge about the underlying SCM, the authors characterized two structural properties computable from any SCM-MAB, i.e., i) arms with the same reward distribution using do-calculus constraints, and ii) under what topological conditions one arm can be optimal. For concreteness, we consider the SCM-MAB instance shown in Fig. 1, where Y represents the reward, X and Z represent two variables that can be manipulated, and the dashed-bidirected arrow, following standard notation, represents the existence of a latent common cause (exogenous) affecting both X and Y . There are four sets of variables that can be intervened upon in this case (intervention sets), i.e., { , {X}, {Z}, {X, Z}}, which totals 9 arms if we assume binary variables. The first property reveals that P(y|do(z)) = P(y|do(x, z)), which means that intervening on X and Z simultaneously is regarded as wasteful one should always prefer intervening on do(Z) over do(X, Z). Intuitively, all effects of X on Y are mediated by Z in this case, which means X has no effect on Y whenever we intervene on Z. The second property characterizes that maxz µz maxx µx where µx = E[Y |do(x)] holds true in any model, which implies that do(Z) should be always preferred over do(X). This implies that three arms, do( ), do(z = 0), and do(z = 1), relating to the non-dominated, Figure 1: A causal model where Y is the outcome variable. minimal intervention sets, and {Z}, are contenders to be optimal relative to the underlying SCM compatible with G.1 Such a minimal set that can be optimal when intervened upon was termed a possibly-optimal minimal intervention set, or POMIS, in (Lee and Bareinboim 2018). In many settings, however, it is not the case that every observed variable is manipulable. For the graph discussed earlier, imagine that X, Z, and Y correspond to diet, cholesterol level, and heart failure. Changing the cholesterol directly is not a conceivable physical manipulation, despite the obvious effect of cholesterol in heart failure. Given that Z is not manipulable, then intervening on X, e.g., diet, may lead to the desired outcome. Clearly, POMISs under non-manipulable variables will not be, in general, a subset of the POMISs with only manipulable variables. We ll show in the paper that do(X), which was dominated by do(Z), should be regarded as a POMIS whenever Z is not manipulable. It is critical, therefore, to be able to identify the POMISs from a causal graph under an arbitrary set of non-manipulable variables. Previous work considered identifying the POMISs based on the graph structure, which allowed a more precise decisionmaking by the corresponding MAB solver. Still, the dependence structure across arms was not fully exploited. More sophisticated relations among arms can be learned through the do-calculus (Pearl 2000, Ch. 3). For instance, the identity P(y|do(x)) = P x P(y|z, x )P(x ) holds in Fig. 1, which is known as the front-door formula. The front-door implies that the distribution of the arm do(X = x) can be learned from do( ), i.e., without having to directly intervene on X. More generally, an interventional distribution can be expressed in terms of the other distributions, which implies new learning opportunities we aim to leverage. Specifically, our contributions are as follow: 1. We start by formalizing the SCM-MAB problem under non-manipulability constraints using the language of structural causality. We formally characterize POMIS with manipulability constraints, i.e., the collection of interventions that are possibly optimal. 2. We develop a new algorithm that derives an expression for the arm s distribution given an arbitrary set of interventional distributions (other arms). We use this result to enhance standard MAB algorithms with causal capabilities, which will now exploit the structural properties of the underlying causal graph and the relations across interventional distributions, improving their efficiency. Simulations corroborate our intuition solvers that leverage causal knowledge can operate orders of magnitude more data-efficiently than their non-causal counterparts. 1Note that this implies that not intervening, and letting the system operate in its natural state, can yield an optimal reward. Preliminaries We follow the convention in the field and use a capital letter as a variable, X, and the corresponding lower case, x, as its realization. Bold face is used for a set of variables or values, X or x, blackboard bold for a set of sets, X. We denote by XX the domain of X. We also use caligraphies for graphs and models. The basic semantical framework of our analysis relies on Structural Causal Models (SCM, Pearl, 2000), i.e., Definition 1 (SCM). A structural causal model (SCM) M is a 4-tuple U, V, F, P(U) where: U is a set of exogenous (unobserved) variables, which are determined by factors outside of the model; V is a set {Vi}n i=1 of endogenous (observed) variables that are determined by variables in U V; F is a set of structural functions {fi}n i=1 where each fi is a process by which Vi is assigned a value vi fi(pai, ui) in response to the current values of PAi V and Ui U; P(U) is a distribution over the exogenous variables U. A SCM induces observational and interventional probability distributions, P(v) = P i P(vi|pai, ui)P(u), and P(v\t|do(t)) = P i|Vi T P(vi|pai, ui)P(u). For short, we write P(y|do(x)) as Px(y). Each SCM is associated with a causal graph G = V, E , where there are two types of edges in E directed edges such as Vi Vj indicating direct functional dependence, i.e., Vi is used to define fj F; bidirected edges such as Vi Vj indicating the existence of an unobserved confounder (UC, for short), i.e., there exists U U which appears in both fi and fj. We use family relations, pa, ch, an, de to denote parents, children, ancestors, and descendants of its argument. With Pa, Ch, An, De, we include the arguments, e.g., An(W) = an(W) {W}. Given a set of variables as argument, we use the union of individual outputs, e.g., An(W) = S W W An(W). Note that pa(Vi) = PAi. We denote by GX a subgraph where edges onto X are removed. The do-calculus (Pearl 1995) consists of three rules relating observational and interventional distributions that obey the topology of the causal graph G. For example, Px,z(y) = Pz(y) (Fig. 1) can be inferred due to (Y X|Z)GX,Z, which is a d-separation statement between Y and X given Z in the subgraph GX,Z. We refer readers to (Pearl 2000, Sec. 3.4) for a more detailed discussion on SCMs and do-calculus. A K-armed bandit problem consists of K independent arms such that their reward distributions are independent to each other. In this setting, the task is often minimizing Reg T , the cumulative regret after T rounds. This is the difference between a maximum expected cumulative reward and an expected cumulative reward of a MAB algorithm, Reg T = Tµ PT t=1 E[Yat], where at is the arm played at time t, Yat is a random variable associated with the arm, and µ is the optimal expected reward. In a SCM-MAB setting, each arm corresponds to intervening on a set of variables. Given G and Y , all arms are {x XX | X V\{Y }} with P(Yx), the distribution of a reward variable with arm x, is equivalent to an interventional distribution Px(Y ), and µx = E [Y |do(x)]. We denote by µx the best expected reward by intervening on X, i.e., µx = maxx XX µx. SCM-MAB with Non-manipulability In this section, we generalize SCM-MABs by allowing nonmanipulability constraints, and then characterize the POMIS under such constraints. We start by denoting a set of nonmanipulable variables by N V \ {Y }, where the reward variable Y is assumed to be non-manipulable. For simplicity, we override the definitions of MIS and POMIS by incorporating N into them, and limiting the intervention set X to be X V\{Y }\N instead of X V\{Y }. First, we define Minimal Intervention Sets, which represent a non-redundant set of intervention sets. Definition 2 (Minimal Intervention Set (MIS)). Given G, Y , N , a set of variables X V \ {Y } \ N is said to be a minimal intervention set if there is no X X such that µx = µx for every SCM conforming to G where x XX that is consistent with x. We denote by MN G,Y a set of MISs given G, Y , N where we omit N if N = . The following relationship can be easily derived from the definition, MN G,Y = {W MG,Y | W N = }. MISs can be identified by checking whether any two probability distributions are equal (through Rule 3 of do-calculus). Next, we define a subset of the MISs that can lead to an optimal reward. Definition 3 (Possibly-Optimal Minimal Intervention Set (POMIS)). Given G, Y , N , let X MN G,Y . If there exists a SCM conforming to G such that µx > W MN G,Y \{X}µw , then X is a possibly-optimal minimal intervention set with respect to G, Y , N . We similarly denote by PN G,Y , a set of POMISs given G, Y , N . Lee and Bareinboim (2018) introduced two key concepts that will be instrumental to our analysis, i.e., minimal unobserved confounders territory (MUCT) and interventional border (IB). MUCT is a minimal set of variables in An(Y ), which includes Y and is closed under descendants and unobserved confounders given a causal graph. For example, the MUCT in Fig. 2a includes all the variables as B is confounded with Y , A is confounded with B, and C is a descendant of B (or A). The IB is defined as the parents of the MUCT excluding the MUCT itself, which is an empty set in this case. With GA, the MUCT is {B, C, Y } and the IB is {A} as shown in Fig. 2b. It was shown that X V\{Y } is a POMIS if and only if IB in GX is X. For example, {A, C} is a POMIS in Fig. 2d while {B} is not in Fig. 2c, which indicates that intervening on {A, C} is preferred to {B}. Characterizing POMIS with Non-manipulability Identifying all the POMISs given non-manipulable variables is non-trivial as the unconstrained POMIS (i.e., N = ) cannot be, in general, applicable by filtering out intervention sets containing only manipulable variables, i.e., G,Y ,N PN G,Y = {X P G,Y | X N = }. This contrasts with MISs in which a set of constrained MISs is a mere subset of unconstrained MISs. Although the equality does not hold in general, the feasible subset of unconstrained POMISs is related to POMIS with constraints as shown below: Proposition 1. If X PG,Y and X N= , then X PN G,Y . Figure 2: MUCT (thick, red) and IB (blue) on manipulated graphs. Given G and Y , POMISs are , {A}, and {A, C}. Simply put, if an arm do(x) is optimal among all arms without constraints, then the arm is still optimal among the manipulable subset of arms. However, this proposition only answers the intersection between PG,Y and PN G,Y . At this point, the natural question is how to systematically find the POMISs that are not unconstrained POMISs. Recall that in the front-door graph, we noted that {X} is a POMIS when {Z} is non-manipulable, while this isn t the case in the unconstrained setting. To explain this phenomenon, we note that X, which affects Y through Z, can be seen as directly connected to Y in the constrained setting; clearly, Z should be disregarded since it no longer can disturb the pathway from X to Y . Focusing only on a subset of variables in a causal graph and their causal relationship lies at the heart of causal modeling. Latent projection (projection, for short) (Verma and Pearl 1990; Verma 1992) is an operation that induces a causal graph over a subset of the variables from the original graph while maintaining some key topological properties of the original causal structure (Tian 2002). In general, the causal graph at hand is already the result of the projection of some unknown, more involved causal structure. For example, the front-door graph (Fig. 1) with the cholesterol example might be the projection of a larger graph such as in Fig. 3, where the shaded nodes were marginalized out. At an intuitive level, without constraints, POMISs in the front-door graph can be understood as POMISs in the larger graph where the shaded nodes are not manipulable. It would be natural to ask whether we can obtain PN G,Y indirectly from PH,Y , where H is the projection of G onto V\N. We formally describe how a projection of a causal graph G onto V\N is constructed next. Let ˆG be a DAG that explicitly represents the unobserved confounders. We initialize a graph H = V \ N, , then proceed to add 1. a directed edge between Vi and Vj if Vi Vj G or there exists a directed path from Vi to Vj where all nonend nodes in the path between them are in N. 2. a bidirected edge between Vi and Vj if Vi Vj G; or there exists directed paths from an unobserved confounder to Vi and Vj in ˆG where all non-end nodes are in N. Let G[V ] be a causal graph obtained by projecting G onto Figure 3: A causal graph where its projection onto {X, Y , Z} yields the front-door graph (Fig. 1). Figure 4: Causal graphs after marginalizing out A, B, and C from Fig. 2a, respectively, with added edges in thick, red. V . We will show PN G,Y = PH,Y through two propositions that guarantees that the optimality of an arm will be preserved during i) the projection from G to H, and ii) the other way around. Proposition 2. Given a SCM M1 = M = V, U, F, P(U) , there exists a SCM M2 = M[V\N] = V \ N, U, F , P(U) such that P 1 x(y)=P 2 x(y) for any X, Y V\N and Y = . Proof. Let M = M[V\{W }] where W N. The functions {f X}X ch(W )G can be used without modifications in M . We modify functions of the children of W. For every Q ch(W)G, we devise f Q with f Q and f W . By the projection s construction, PA Q = (PAQ\{W}) PAW and let U Q = UQ UW so that f Q takes pa W and u W as argument in addition to pa Q\{W} and u Q, which conforms to G[V\{W }]. Hence, f Q is simply f Q with w computed inside, f Q pa Q, u Q = f Q pa Q\{W}, f W (pa W , u W ), u Q . This guarantees that M[V\{W }] and M yield the same observational and interventional distributions over V \ {W}. The above procedure can be iteratively applied to the rest of variables in N to prove the equivalence of M[V\N] and M with respect to the distributions they yield over V \ N. Fig. 4a is an example where marginalizing out A from Fig. 2a induces bidirected edges between B and C (due to the UC between A and B and A C), between C and Y since C A Y . 2 Based on Prop. 2, C and Y take the UC between A and B to compute a (the value A would take) inside. This explains all three bidirected edges in Fig. 4a. Proposition 3. Given a causal diagram G, let H = G[V\N]. For a SCM M1 = M[V\N] = V \ N, U, F , P(U) conforming to H, there exists a SCM M2 = M = V, U, F, P(U) that conforms to G such that P 1 x(y) = P 2 x(y), for any X, Y V \ N and Y = . 2Although B A Y implies a bidirected edge between B and Y , we do not represent multiple edges of the same type. Proof. Consider a simple case where constructing M of G from M of G[V\{W }]. Without loss of generality, assume that U U appears in at most two functions. Otherwise, we can decompose U into multiple U s and change U and P(U) accordingly. We similarly reuse f X for any X ch(W)G to define F. We need to define f W and f Q for Q Q = ch(W)G. By definition, endogenous variables to be used in f W and {f Q}Q Q are pa(W)G and pa(Q)G, respectively. Next, we determine Ui for V Q {W} as follows. First, any U U that appears only in a variable V (i.e., U U V ) will be in UV . Now consider U used in two variables, Vi and Vj, in G[V\{W }]. If a bidirected edge between Vi and Vj exists in G, every U U i U j will also be included in Ui and Uj. Otherwise, there are two (non-mutually exclusive) cases: i) both Vi and Vj are children of W in G or ii) Vi is confounded with W and Vj is a child of W in G (or vice versa). In both cases, simply put U into UW . This construction guarantees that all U are modeled in M conforming to G. Then, we define f W be any one-to-one function so that its argument can be passed to its children through f 1 W (w). Finally, f Q is devised to reuse f Q with f 1 W as f Q pa Q, u Q = f Q pa Q \ {W}, f 1 W (w), u Q . The procedure can be sequentially applied to construct M for G from M for H. We now revisit Fig. 4a for another example. We focus on how the unobserved variables will be assigned. Let UCs be UBC, UCY , and UBY . Unobserved variables other than UCs will be reused as described above. First, f B = f B which already takes UBC and UBY . f Y takes UBY since B Y in G. Since two red bidirected edges do not appear in G, f A will take both UBC and UCY , and f C and f Y become irrelevant to two red UCs. Then, we can observe that A is confounded with B via UBC, and B and Y are confounded with UBY . However, UCY will be served as a variable-specific hidden variable for A, which we do not explicitly represent. The construction conforms to G. Theorem 4. Given a causal diagram G = V, E , a reward variable Y V, and a set of non-manipulable variables N V\{Y }, let H be the projection of G onto V\N. Then, PN G,Y = PH,Y Proof. This follows from Prop. 2 and Prop. 3. In words, the projection onto the manipulable variables preserves the causal structure among them with respect to POMIS. Fig. 4 illustrates three projected graphs of G in Fig. 2a with {A}, {B}, and {C} marginalized out, respectively. Applying the (unconstrained) POMIS identification procedure (Lee and Bareinboim 2018) on the graphs yields: P{A} G,Y = { , {B}, {C}}, P{B} G,Y = { , {A}, {A, C}}, and P{C} G,Y = { , {A}, {A, B}}. This section studied the problem of which intervention sets can be optimal. Once the POMIS under constraints are formally characterized, the arms to be played by a MAB algorithm can be refined and its performance improved. Exploiting the POMISs Structure In this section, we note that arms exhibited interesting dependence structure that we will try to exploit, which follows from the underlying causal structure. The previous section investigated two types of dependence among arms (equality and partial-orderedness), which helped identifying POMIS arms, {x XX|X PN G,Y }. In this section, we explore more sophisticated dependence among arms that allows one arm s reward distribution to be inferred from pulling another arm. Following the convention, we note that when an arm x is pulled, a full realization of the other variables in the system is observed, e.g., v = v1, v2, . . . , vn , which is sampled from the joint distribution Px(v) associated with the pulled arm. Recall the causal model in Fig. 1 when Z is non-manipulable. There are three POMIS arms do( ), do(x=0), and do(x=1), where the two do(x) arms reward distributions can be expressed as Px(y) = P x P(y|z, x )P(x ), which, as noted, is called the front-door formula. Given such an expression, Px(y) can be estimated not only with a trivial expression Px(y) = P v\{y} Px(v), from the experimental samples obtained by pulling the arm do(x), but also with the front-door formula with observational samples from P(v), in this case, by playing the do-nothing arm. This possibility of leveraging samples from other arms is not an exclusive phenomenon to the model in Fig. 1, but can be cast in much broader terms. Given a general graph G, under what conditions can a probability term (a reward distribution, in particular) be turned into an expression with terms of other distributions? In the causal inference literature, the problem of identifiability (Pearl 2000; Tian and Pearl 2002) asks whether the experimental distribution of interest can be uniquely estimated from the observational distribution (do-nothing intervention). Given a causal graph and P(V), there exists a complete algorithm which outputs an expression, if exists, for a quantity of interest, Px(y) in this case. A more general problem, called z-identifiability (Bareinboim and Pearl 2012), asks whether one can identify Px(y) when a set of experiments (i.e., distributions) are available, {PZ | Z Z}, for some Z V. That is, there exists a set of manipulable variables Z, and experiments on every subset of manipulable variables are available. In the SCM-MAB setting with non-manipulability constraints, it is natural to take advantage of z-identifiability where Z = V\{Y }\N. Although it is feasible, in a sense, that the resulting expression will not include any of N, the expression may include terms related to experiments on non-POMISs. Consider Fig. 4a for an example: an expression containing Pb,c will not be useful since arms {do(b, c)}b XB,c XC will not be played at all since {B, C} PN G,Y . Hence, there exists a need to extend the treatment given to z-identifiability to account for an arbitrary set of experiments, {PZ}Z Z, where Z is a subset of the superset of V. We developed a generalized z-identifiability algorithm (Alg. 1), which we call z2ID. Using this algorithm, we can identify whether an arm s reward distribution can be estimated from the other arms information. Before explaining the algorithm, we introduce a few notations. We denote Algorithm 1 z2ID 1: function Z2ID(G, y, x, Z) 2: if Z and X = then 3: return P Si CC(G) sub-z2ID(G, si, v\si, Z) 4: else return sub-z2ID(G, y, x, Z) 5: function SUB-Z2ID(G, y, x, Z) 6: for Z Z s.t. Z X and Z ZZ Z X do 7: return ID(G \ Z, y, x \ Z, Pz) if not FAIL 8: if V \ An(Y)G = then 9: return sub-z2ID(G[An(Y)G], y, x An(Y)G, {Z An(Y)G}Z Z) 10: if W (V\X)\An(Y)GX = then 11: return sub-z2ID(G, y, x w, Z) 12: if |CC(G \ X)| > 1 then 13: return P Si CC(G\X) sub-z2ID(G, si, v\si, Z) 14: return FAIL by G [W], a vertex-induced subgraph of G by W (not to confuse with G[V ]). Also, G\W = G [V\W]. CC(G) represents C-components (Tian and Pearl 2002) of G, which is a family of sets of endogenous variables where each set consists of variables that are maximally connected via bidirected edges. z2ID first examines whether the query of interest is an observational quantity where only interventional distributions are available (Line 2). Then, the procedure tries to identify the quantity by decomposing it into multiple interventional quantities using C-component factorization (Tian and Pearl 2002) (Line 3). This part, especially, differs from variants of identifiability algorithms where an observational distribution P(V) is taken for granted. Then, z2ID recursively transforms the query Px(y) into different forms using well-known equalities that can be derived from basic probability operations and do-calculus, and it tries to reduce Px (y) to an expression only made of Pz whenever Z X and Z Z. This can be understood as solving an identifiability problem where the query is Qx \z(y) = Px (y), and Q(v) = Pz(v) is regarded as an available observational distribution. We show next that the procedure is indeed sound, i.e., any expression that it returns is a valid equality with respect to the underlying structural causal model. Theorem 5 (soundness). Whenever z2ID returns an expression for Px(y), it is correct. Proof. This follows from the previous identifiability results (Tian and Pearl 2002; Shpitser and Pearl 2006; Bareinboim and Pearl 2012). Specifically, Lines 2 3 correspond to Cfactorization (Tian and Pearl 2002). Lines 7 8 follow from soundness of ID since Px(y) = Px\z,z(y) which is identifying Qx\z(y) with Q = Pz in G\Z (Bareinboim and Pearl 2012). Lines 9 12 are based on do-calculus. Finally, Line 13 14 follows from Lemma 3 of (Shpitser and Pearl 2006). For concreteness, consider a causal graph G in Fig. 2a. With N = {A}, the POMISs are { , {B}, {C}} (as shown in Fig. 4a above). The algorithm z2ID will returns expressions Algorithm 2 b MVWA 1: function BMVWA(D, {ˆθ1, ˆθ2, . . . , ˆθm}, B) Input: D: samples {Dx}x A; {ˆθ1, ˆθ2, . . . , ˆθm}: m estimators; B: the number of bootstraps 2: for b 1, . . . , B do 3: D(b) {D(b) x }x A bootstrap samples 4: ( m i=1) ˆθ(b) i ˆθi(D(b)) evaluate expressions 5: ˆΣ Cov(ˆθ1, ˆθ2, . . . , ˆθm) 6: w arg minw w T ˆΣw such that Pm i=1 wi = 1, wi 0 7: return Pm i=1 wiˆθi(D), w T ˆΣw for each arm s distribution as follows: a,b,c Pb(c|a)Pc(a, b, y) (1) a,c P(c|a, b) P b P(y|a, b , c)P(a, b ) (2) a,b P(y|a, b, c)P(a, b) (3) Pc(y) = P a Pb(y|a, c)Pb(a) (4) Note that there are two equations for Pc(y), Eqs. (3) and (4) the former suggests that Pc can be estimated with samples from the do-nothing intervention, while the second says that it can be estimated from samples from Pb. Moreover, b in Eq. (4) can be any realization within XB. This means that such a mapping can be viewed as two separate sources of data, i.e., P a Pb=0(y|a, c)Pb=0(a) and P a Pb=1(y|a, c)Pb=1(a). The failing condition of z2ID implies that it encountered a (decomposed) probability term that cannot be reduced to any of the available distributions individually. It is an open problem, and outside of the scope of this work, to prove the necessity of z2ID, which requires to show how the available distributions cannot answer the query collectively. Integrating Expressions into MAB Algorithms Given that we have multiple expressions for each arm s distribution, it is crucial to find a principled way of combining them. Let θ = Px(y) be a quantity of interest, and there be m dependent estimators ˆθ = {ˆθi}m i=1 corresponding to expressions relating to θ. They can be combined using a weighted sum, ˆθ = Pm i=1 wiˆθi with Pm i=1 wi = 1 and ( 1 i m) wi 0 (i.e., convex combination). Given that every ˆθi is unbiased, we may focus on minimizing the variance of ˆθ, Var(ˆθ) = w TΣw, where Σ = Cov(ˆθ, ˆθ). Weights resulting minimum variance can be obtained through solving a constrained quadratic program. Bootstrap is a well-known technique to estimate the variance of an estimator (Efron and Tibshirani 1993). Given data D of size n, sampling n instances from the data, with replacement, yields a bootstrap sample. By repeating the procedure B times, we obtain bootstrap samples {D(b)}B b=1, which allow us to estimate sample variance for an estimator. Given multiple datasets, we bootstrap them simultaneously, and obtain bootstrap estimates for each estimator so as to compute a covariance matrix Σ. We then are able to compute weights that yield a minimum variance estimator. A pseudo-code (Alg. 2) describes the aforementioned proce- dure, named bootstrap-based minimum variance weighted average (b MVWA). We introduce two algorithms z2-TS and z2-KL-UCB (Alg. 3), generalizations of the classic TS and KL-UCB algorithms, respectively, that incorporate b MVWA so that information about other arms rewards can be exploited to infer an arm s expected reward more precisely via expressions retrieved from z2ID. We first explain z2-TS. Given a SCMMAB instance with non-manipulable variables (G, Y , and N), the algorithm first computes corresponding POMISs (Line 2) and obtain estimators for arms relevant to the POMISs (Line 4). For each arm x, b MVWA returns a minimum variance estimate and its variance. Then, it finds a pair of parameters for a beta distribution whose mean and variance correspond to ˆθx and ˆs2 x, and a sample is drawn from this distribution as an approximate posterior sample.3 Similarly to TS, an arm is chosen based on posterior samples. Next we describe z2-KL-UCB. The algorithm follows the same initialization steps, and computes each arm s mean payout and its variance through b MVWA. Conventional KLUCB computes the upper-confidence bound of the expected reward of an arm x with a non-decreasing function f and how many times the arm is pulled, Nx. It infers ˆNx (Line 18), the number of effective samples for each arm x, rather than using the actual number of arms played, Nx. One property of the bootstrap is that if the number of samples is too small, bootstrap will fail, in the sense that the bootstrap distribution will not approximate the target distribution. In practice, the aforementioned approach will be deferred until enough number of samples is observed. As a rule of thumb, we do not use an expression (i.e., estimator) unless its support is covered (i.e., every possible value with respect to each term in the expression has positive mass). For example, Eq. (1) will be evaluated if all 8 and 16 values of Pb(c|a) and Pc(a, b, y) exist, respectively (assuming that the variables are binary). Furthermore, with a binary reward variable, we utilize a beta distribution, Beta(αx + 1, βx + 1), for the mean and variance of a trivial estimator, e.g. ˆθi = Px(y), where αx and βx are the number of Yx being 1 and 0, respectively, at round t. Empirical Evaluation In this section, we evaluate the performance (cumulative regret, CR for short) of SCM-MAB algorithms under different strategies so as to assess the effect of employing POMISbased arm refinement as well as the dependence structure among arms. In particular, we compare the following strategies: Brute-force (BF), which intervenes on every combination of manipulable variables, MIS MN G,Y ; POMIS PN G,Y ; POMIS+, which is POMIS enhanced with z2ID. Formally, note that the following relationships hold: BF MIS POMIS = POMIS+ relative to their arms. Combined with two base MAB algorithms, TS and KL-UCB, we compared total eight settings, where only POMIS+ employs z2-TS and z2-KL-UCB. 3One might simply use a weighted average of one of bootstrap estimates, w T ˆθ(b) for some b. Algorithm 3 Bernoulli TS and KL-UCB with z2ID 1: function Z2-TS(G, Y , N, T) 2: Z PN G,Y 3: A {x XX | X Z} 4: ˆθx {Px(y)} {z2ID(G, y, x, Z )}Z Z\{X} for x A 5: D {Dx = }x A 6: for t in 1, . . . , T do 7: for x A do 8: ˆθx, ˆs2 x b MVWA(D, ˆθx) 9: Find ˆαx, ˆβx such that Beta(ˆαx, ˆβx) matching ˆθx, ˆs2 x 10: θx Beta(ˆαx, ˆβx) 11: x arg maxx A θx 12: Sample v by do(x ) and append v to Dx 13: function Z2-KL-UCB(G, Y , N, T, f ln(t)+3 ln(ln(t))) 14: Initialize Z, A, {ˆθx}x A, D 15: ( x A) Sample v by do(x), and append v to Dx 16: for t in |A|, . . . , T do 17: ˆθx, ˆs2 x b MVWA(D, ˆθx) for x A 18: ˆ Nx ˆθx(1 ˆθx)/ˆs2 x; ˆt P 19: µ= sup µ [0, 1]:KL(ˆθx, µ) f(ˆt) x A 20: x arg maxx A µx 21: Sample v by do(x ), and append v to Dx We simulated three SCM-MAB instances, i.e., the frontdoor setting (Fig. 1) and two models following Fig. 2a and Fig. 5d. The time horizon T is set to 10,000, which is enough to observe the performance difference; simulations are repeated 1,000 times, and the number of bootstraps B is set to 500. Detailed descriptions and expressions generated by z2ID are provided in the full technical report (Lee and Bareinboim 2019). Fig. 5 illustrates experimental results and Table 1 summarizes the average cumulative regrets. We discuss these results in the sequel. (The Front-door Graph with N={Z}) The results are shown in Fig. 5a. Due to its simplicity, BF, MIS, and POMIS are all the same in this setting.4 A gap between POMIS and POMIS+ indicates how well expressions reduce uncertainty of arms expected rewards so that the underlying MAB algorithm plays the optimal arm more often (and suboptimal arms less often). POMIS+ reduces CRs 34.5% (TS) and 39.2% (KL-UCB) compared to POMIS. (Causal Model for Fig. 2a with N={A}) Results are shown in Fig. 5b. MIS and POMIS have the same performance as their arms are the same. The gain for (PO)MIS compared to BF is due to the fact that there are four less arms corresponding to intervening on {B, C}. z2ID helps gain to reduce cumulative regret for both MAB algorithms. Reduction in CRs by employing POMIS+ compared to POMIS are 46.2% and 61.0% for TS and KL-UCB, respectively. Such a high reduction rate is achieved in part because expressions construct reciprocal relationships among arms (see Eqs. (1) 4If two strategies share the same set of arms, then their results will be exactly the same since we assigned a unique random seed for each of 1,000 Monte Carlo simulations so as to reproduce the same experimental results. 0 5k 10k Trials Cumulative Regret 0 5k 10k Trials Cumulative Regret 0 5k 10k Trials Cumulative Regret POMIS+ POMIS MIS BF Figure 5: (a,b,c) Simulated results for settings in Figs. 1, 2a, and 5d, where the average cumulative regret (95% confidence interval) is shown relative to the four strategies discussed in the paper (BF, MIS, POMIS, POMIS+). The results are also shown for the corresponding TS (solid) and KL-UCB (dashed) solvers. (d) The causal graph used in (c) with N = {A, C}. Fig. 5a Fig. 5b Fig. 5c TS POMIS+ 29.08 50.29 121.54 POMIS 44.40 93.53 161.54 MIS 212.24 BF 183.15 313.39 UCB POMIS+ 59.30 74.93 247.05 POMIS 97.52 192.39 359.40 MIS 469.30 BF 328.68 657.11 Table 1: Average cumulative regret at T = 10, 000 to (4)) except that Pb(y) is not expressed with Pc(v). (Causal Model for Fig. 5d) The results are shown in Fig. 5c. This graph clearly demonstrates the advantage of having a smaller number of arms (BF 27, MIS 19, POMIS 15). In this setting, POMIS+ makes use of 24 expressions. However, the larger number of arms (compared to the previous tasks) and larger number of variables involved in formulas, e.g., a probability term P(y|a, b, d, e) in some formula, make POMIS+ difficult to take full advantage of every expression in a given time horizon T = 10, 000. The gap between POMIS and POMIS+ will be more visible with a larger time horizon. POMIS+ reduced CRs 24.8% and 31.3% for TS and KL-UCB, respectively, compared to those for POMIS. In sum, our experiments corroborate our theoretical findings MAB algorithms are benefited from playing a smaller number of qualified arms (POMIS) and precise estimations of arms rewards (z2ID). As per our observations, z2ID will be less rewarding if the gap between rewards of the optimal and sub-optimal arms is big enough so that they can easily be distinguished before the MAB algorithm takes advantage of available expressions. Further, the performance gain will not be outstanding if the aforementioned gap is relatively small and the given time horizon is limited, or if there are many arms with low rewards that are not sufficiently played to utilize related expressions. Conclusions We studied the problem of structural causal bandits that asks whether (and how) a decision-maker should intervene in a causal system so as to optimize a particular measure. Our results generalize the previous treatment given to the problem (Lee and Bareinboim 2018), which assumed that all observed variables in the system are manipulable (i.e., could be intervened upon). For example, while cholesterol has an indisputable effect on heart failure, it s not the case that one could physically manipulate cholesterol, but perhaps she could intervene on another variable, say, maybe diet, which could accomplish the same goal of decreasing the likelihood of heart failure. Formally, we characterized two crucial properties in structural bandits with manipulability constraints, i.e.: (1) the possibly-optimal arms (POMIS), and (2) the topological relationship between such arms. We further developed an identification algorithm that outputs an expression for an (interventional) distribution providing a way to fuse an arbitrary set of experiments so that each arm s reward distribution can be estimated from samples of other arms. We equipped existing MAB algorithms with such capabilities, which are now able to play only a qualified subset of the arms while more accurately estimating their expected rewards. Following the current debate on the topic (Pearl 2018), we hope that our results will help to move the discussion from whether a non-manipulable variable causes another to how a decision-maker in the real world should intervene (or not intervene) in the system and be able to optimize a socially agreed target outcome (e.g., survival, happiness, wealth). Acknowledgments This research is supported in parts by grants from NSF IIS1704352, and IIS-1750807 (CAREER), IBM Research, and Adobe Research. References Auer, P.; Cesa-Bianchi, N.; and Fischer, P. 2002. Finitetime analysis of the multiarmed bandit problem. Machine Learning 47(2/3):235 256. Bareinboim, E., and Pearl, J. 2012. Causal inference by surrogate experiments: z-identifiability. In de Freitas, N., and Murphy, K., eds., Proceedings of the Twenty-Eighth Conference on Uncertainty in Artificial Intelligence, 113 120. Corvallis, OR: AUAI Press. Bareinboim, E., and Pearl, J. 2016. Causal inference and the data-fusion problem. Proceedings of the National Academy of Sciences 113:7345 7352. Bareinboim, E.; Forney, A.; and Pearl, J. 2015. Bandits with unobserved confounders: A causal approach. In Advances in Neural Information Processing Systems, 1342 1350. Bottou, L.; Peters, J.; Charles, D. X.; Chickering, M.; Portugaly, E.; Ray, D.; Simard, P. Y.; and Snelson, E. 2013. Counterfactual reasoning and learning systems: the example of computational advertising. Journal of Machine Learning Research 14(1):3207 3260. Capp e, O.; Garivier, A.; Maillard, O.-A.; Munos, R.; Stoltz, G.; et al. 2013. Kullback leibler upper confidence bounds for optimal sequential allocation. The Annals of Statistics 41(3):1516 1541. Chapelle, O., and Li, L. 2011. An empirical evaluation of thompson sampling. In Advances in neural information processing systems, 2249 2257. Cornfield, J. 1951. A method of estimating comparative rates from clinical data; applications to cancer of the lung, breast, and cervix. Journal of the National Cancer Institute 11:1269 1275. Dani, V.; Hayes, T. P.; and Kakade, S. M. 2008. Stochastic linear optimization under bandit feedback. In Proceedings of Conference On Learning Theory (COLT), 355 366. Efron, B., and Tibshirani, R. J. 1993. An Introduction to the Bootstrap. Number 57 in Monographs on Statistics and Applied Probability. Chapman & Hall/CRC. Flegal, K. M.; Graubard, B. I.; and Williamson, D. F. 2005. Excess deaths associated with underweight, overweight, and obesity. 293(15):1861 1867. Forney, A.; Pearl, J.; and Bareinboim, E. 2017. Counterfactual data-fusion for online reinforcement learners. In International Conference on Machine Learning, 1156 1164. Garivier, A., and Capp e, O. 2011. The KL-UCB algorithm for bounded stochastic bandits and beyond. In Proceedings of the 24th annual Conference On Learning Theory, 359 376. Heckman, J. J. 2006. Skill formation and the economics of investing in disadvantaged children. Science 312(5782):1900 1902. Kaufmann, E.; Korda, N.; and Munos, R. 2012. Thompson sampling: An asymptotically optimal finite-time analysis. In Algorithmic Learning Theory, 199 213. Lai, T., and Robbins, H. 1985. Asymptotically efficient adaptive allocation rules. Advances in Applied Mathematics 6(1):4 22. Lattimore, F.; Lattimore, T.; and Reid, M. D. 2016. Causal bandits: Learning good interventions via causal inference. In Advances in Neural Information Processing Systems 29. 1181 1189. Lee, S., and Bareinboim, E. 2018. Structural causal bandits: Where to intervene? In Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, forthcoming. Lee, S., and Bareinboim, E. 2019. Structural causal bandits with non-manipulable variables. Technical Report R-40, Purdue AI Lab, Department of Computer Science, Purdue University. Magureanu, S.; Combes, R.; and Proutiere, A. 2014. Lipschitz bandits: Regret lower bound and optimal algorithms. In Proceedings of The 27th Conference on Learning Theory, volume 35 of Proceedings of Machine Learning Research, 975 999. Barcelona, Spain: PMLR. Pearl, J., and Mackenzie, D. 2018. The Book of Why: The New Science of Cause and Effect. Basic Books. Pearl, J. 1995. Causal diagrams for empirical research. Biometrika 82(4):669 688. Pearl, J. 2000. Causality: Models, Reasoning, and Inference. New York: Cambridge University Press. Pearl, J. 2018. Does obesity shorten life? or is it the soda? on non-manipulable causes. Technical Report R-483, Forthcoming, Journal of Causal Inference, Department of Computer Science, University of California, Los Angeles, CA. Robbins, H. 1952. Some aspects of the sequential design of experiments. Bull. Amer. Math. Soc. 58(5):527 535. Sen, R.; Shanmugam, K.; Dimakis, A. G.; and Shakkottai, S. 2017. Identifying best interventions through online importance sampling. In International Conference on Machine Learning, 3057 3066. Shpitser, I., and Pearl, J. 2006. Identification of joint interventional distributions in recursive semi-Markovian causal models. In Proceedings of The Twenty-First National Conference on Artificial Intelligence, 1219 1226. AAAI Press. The White House, Office of the Press Secretary. 2014. Fact sheet: Invest in us: The white house summit on early childhood education. Press Release. Thompson, W. R. 1933. On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika 25(3/4):285 294. Tian, J., and Pearl, J. 2002. A general identification condition for causal effects. In Proceedings of the Eighteenth National Conference on Artificial Intelligence (AAAI 2002), 567 573. Menlo Park, CA: AAAI Press/The MIT Press. Tian, J. 2002. Studies in Causal Reasoning and Learning. Ph.D. Dissertation, Computer Science Department, University of California, Los Angeles, CA. U.S. Department of Health and Human Services. 2014. The health consequences of smoking 50 years of progress: A report of the surgeon general. Technical report, U.S. Department of Health and Human Services, Centers for Disease Control and Prevention, National Center for Chronic Disease Prevention and Health Promotion, Office on Smoking and Health, Atlanta, GA. Verma, T., and Pearl, J. 1990. Equivalence and synthesis of causal models. In Proceedings of the Sixth Conference on Uncertainty in Artificial Intelligence (UAI 1990), 220 227. Verma, T. 1992. Invariant properties of causal models. Technical Report R-134, Cognitive Systems Laboratory, Department of Computer Science, UCLA. Zhang, J., and Bareinboim, E. 2017. Transfer learning in multi-armed bandits: A causal approach. In Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence, IJCAI-17, 1340 1346.