# structural_causal_bandits_where_to_intervene__711233e1.pdf Structural Causal Bandits: Where to Intervene? Sanghack Lee Department of Computer Science Purdue University lee2995@purdue.edu Elias Bareinboim Department of Computer Science Purdue University eb@purdue.edu We study the problem of identifying the best action in a sequential decisionmaking setting when the reward distributions of the arms exhibit a non-trivial dependence structure, which is governed by the underlying causal model of the domain where the agent is deployed. In this setting, playing an arm corresponds to intervening on a set of variables and setting them to specific values. In this paper, we show that whenever the underlying causal model is not taken into account during the decision-making process, the standard strategies of simultaneously intervening on all variables or on all the subsets of the variables may, in general, lead to suboptimal policies, regardless of the number of interventions performed by the agent in the environment. We formally acknowledge this phenomenon and investigate structural properties implied by the underlying causal model, which lead to a complete characterization of the relationships between the arms distributions. We leverage this characterization to build a new algorithm that takes as input a causal structure and finds a minimal, sound, and complete set of qualified arms that an agent should play to maximize its expected reward. We empirically demonstrate that the new strategy learns an optimal policy and leads to orders of magnitude faster convergence rates when compared with its causal-insensitive counterparts. 1 Introduction The multi-armed bandit (MAB) problem is one of the prototypical settings studied in the sequential decision-making literature [Lai and Robbins, 1985, Even-Dar et al., 2006, Bubeck and Cesa-Bianchi, 2012]. An agent needs to decide which arm to pull and receives a corresponding reward at each time step while keeping the goal of maximizing its cumulative reward in the long run. The challenge is the inherent trade-off between exploiting known arms versus exploring new reward opportunities [Sutton and Barto, 1998, Szepesvári, 2010]. There is a wide range of assumptions underlying MABs, but in most of the traditional settings, the arms rewards are assumed to be independent, which means that knowing the reward distribution of one arm has no implication to the reward of the other arms. Many strategies were developed to solve this problem, including classic algorithms such as -greedy, variants of UCB (Auer et al., 2002, Cappé et al., 2013), and Thompson sampling [Thompson, 1933]. Recently, the existence of some non-trivial dependencies among arms has been acknowledged in the literature and studied under the rubric of structured bandits, which include settings such as linear [Dani et al., 2008], combinatorial [Cesa-Bianchi and Lugosi, 2012], unimodal [Combes and Proutiere, 2014], and Lipschitz [Magureanu et al., 2014], just to name a few. For example, a linear (or combinatorial) bandit imposes that an action xt 2 Rd (or {0, 1}d) at a time step t incurs a cost > t xt, where t is a loss vector chosen by, e.g., an adversary. In this case, an index-based MAB algorithm, oblivious to the structural properties, can be suboptimal. In another line of investigation, rich environments with complex dependency structures are modeled explicitly through the use of causal graphs, where nodes represent decisions and outcome variables, and direct edges represent direct influence of one variable on another [Pearl, 2000]. Despite the 32nd Conference on Neural Information Processing Systems (Neur IPS 2018), Montréal, Canada. (a) Standard MAB Cum. Regrets (c) Cumulative Regrets Opt. Arm Prob. All Subsets All-at-once (d) Probability Figure 1: MAB problems as directed acyclic graphs where U is an unobserved variable. Plots of cumulative regrets and probability selecting an optimal arm when a MAB algorithm intervenes X1 and X2 simultaneously (All-at-once) or all subsets of {X1, X2} for IV-MAB. The IV-MAB is also used in the experimental section (see Appendix D [Lee and Bareinboim, 2018] for its parametrization). apparent connection between MABs and causality, only recently has the use of causal reasoning been incorporated into the design of MAB algorithms. For instance, Bareinboim et al. [2015] first explored the connection between causal models with unobserved confounders (UCs) and reinforcement learning, where latent factors affect both the reward distribution and the player s intuition. The key observation used in the paper is that while standard MAB algorithms optimize based on the do-distribution (formally written as E[Y |do(X)] or E[Yx]), the simplest type of counterfactuals, this approach is dominated by another strategy using a more detailed counterfactual as the basis of the optimization process (i.e., E[Yx|X = x0]); this general strategy was called regret decision criterion (RDC). This strategy was later extended to handle counterfactual distributions of higher dimensionality by Forney et al. [2017]. Further, Lattimore et al. [2016] and Sen et al. [2017] studied the problem of best arm identification through importance weighting, where information on how playing arms influences the direct causes (parents, in causal terminology) of a reward variable is available. Zhang and Bareinboim [2017] leveraged causal graphs to solve the problem of off-policy evaluation in the presence of UCs. They noted that whenever UCs are present, traditional off-policy methods can be arbitrarily biased, leading to linear regret. They then showed how to solve the offpolicy evaluation problem by incorporating the causal bounds into the decision-making procedure.1 Overall, these works showed different aspects of the same phenomenon whenever UCs are present in the real world, the expected guarantees provided by standard methods are no longer valid, which translates to an inability to converge to any reasonable policy. They then showed that convergence can be restored once the causal structure is acknowledged and used during the decision-making process. In this paper, we focus on the challenge of identifying the best action in MABs where the arms correspond to interventions on an arbitrary causal graph, including when latent variables confound the observed relations (i..e, semi-Markovian causal models). To understand this challenge, we first note that a standard MAB can be seen as the simple causal model as shown in Fig. 1a, where X represents an arm (with K different values), Y the reward variable, and U the unobserved variable that generates the randomness of Y .2 After a sufficiently large number of pulls of X (chosen by the specific algorithm), Y s average reward can be determined with high confidence. Whenever a set of UCs affect more than one observed variable, however, novel, non-trivial challenges arise. To witness, consider the more involved MAB structure shown in Fig. 1b, where an unobserved confounder U affects both the action variable X1 and the reward Y . A naive approach for an algorithm to play such a bandit would be to pull arms in a combinatorial manner, i.e., combining both variables (X1 X2) so that arms are D(X1) D(X2), where D(X) is the domain of X. One may surmise that this is a valid strategy, albeit not the most efficient one. Somewhat unexpectedly, however, Fig. 1c shows that this is not the case the optimal action comes from pulling X2 and ignoring X1, while pulling {X1, X2} together would lead to subpar cumulative rewards (regardless of the number of iterations) since it simply cannot pull the optimal arm (Fig. 1d). After all, if one is oblivious to the causal structure and decides to take all intervenable variables as one (in this case, X1 X2), indiscriminately, one may be doomed to learn a suboptimal policy. 1On another line of investigation, Ortega and Braun [2014] introduced a generalized version of Thompson sampling applied to the problem of adaptive control. 2In causal notation, Y f Y (U, X), which means that Y s value is determined by X and the realization of the latent variable U. If f Y is linear, we would have a (stochastic) linear bandit. Our results do not constrain the types of structural functions, which is usually within nonparametric causal inference [Pearl, 2000, Ch. 7]. In this paper, we investigate this phenomenon, and more broadly, causal MABs with non-trivial dependency structure between the arms. More specifically, our contributions are as follows: (1) We formulate a SCM-MAB problem, which is a structured multi-armed bandit instance within the causal framework. We then derive the structural properties of a SCM-MAB, which are computable from any causal model, including arms equivalence based on do-calculus [Pearl, 1995], and partial orderedness among sets of variables associated with arms in regards to the maximum rewards achievable. (2) We characterize a special set of variables called POMIS (possibly-optimal minimal intervention set), which is worth intervening based on the aforementioned partial orders. We then introduce an algorithm that identifies a complete set of POMISs so that only the subset of arms associated with them can be explored in a MAB algorithm. Simulations corroborate our findings. iid stochastic Markovian adversarial linear Lipschitz Figure 2: A bandit space with various dimensions (not all dimensions are shown) Big picture The multi-armed bandit is a rich setting in which a huge number of variants has been studied in the literature. Different aspects of the decision-making process have been analyzed and well-understood in the last decades, which include different functional forms (e.g., linear, Lipschitz, Gaussian process), types of feedback experienced by the agent (bandit, semi-bandit, full), the adversarial or i.i.d. nature of the interactions, just to cite some of the most popular ones. Our study of SCM-MABs puts the causal dimension front and center in the map. In particular, we fully acknowledge the existence of a causal structure among the underlying variables (whenever not known a priori, see Footnote 3), and leverage the qualitative relations among them. This is in clear contrast with the prevailing practice that is more quantitative and, almost invariably, is oblivious to the underlying causal structure (as shown in Fig. 1a). We outline in Fig. 2 an initial map that shows the relationship between these dimensions; our goal here is not to be exhaustive, nor prescriptive, but to help to give some perspective. In this paper, we study bandits with no constraints over the underlying functional form (nonparametric, in causality language), i.i.d. stochastic rewards, and with an explicit causal structure acknowledged by the agent. Preliminaries: notations and structural causal models We follow the notation used in the causal inference literature. A capital letter is used for a variable or a mathematical object. The domain of X is denoted by D (X). A bold capital letter is for a set of variables, e.g., X = {Xi}n i=1, while a lowercase letter x 2 D (X) is a value assigned to X, and x 2 D (X) = X2X (D (X)). We denote by x [W], values of x corresponding to W \ X. A graph G = h V, Ei is a pair of vertices V and edges E. We adopt family relationships pa, ch, an, and de to denote parents, children, ancestors, and descendants of a given variable; Pa, Ch, An, and De extends pa, ch, an, and de by including the argument as the result, e.g., Pa (X)G = pa (X)G [{X}. With a set of variables as argument, pa (X)G = S X2X pa (X)G and similarly defined for other relations. We denote by V (G) the set of variables in G. G [V0] for V0 V (G) is a vertex-induced subgraph where all edges among V0 are preserved. We define G\X as G [V (G) \X] for X V (G). We adopt the language of Structural Causal Models (SCM) [Pearl, 2000, Ch. 7]. An SCM M is a tuple h U, V, F, P (U)i, where U is a set of exogenous (unobserved or latent) variables and V is a set of endogenous (observed) variables. F is a set of deterministic functions F = {fi}, where fi determines the value of Vi 2 V based on endogenous variables PAi V\ {Vi} and exogenous variables Ui U, that is, e.g., vi fi(pai, ui). P(U) is a joint distribution over the exogenous variables. A causal diagram G = h V, Ei, associated with M, is a tuple of vertices V (the endogenous variables) and edges E, where a directed edge Vi ! Vj 2 E if Vi 2 PAj, and a bidirected edge between Vi and Vj if they share an unobserved confounder, i.e., Ui \ Uj 6= ;. Note that pa(Vi)G corresponds to PAi. Probability of Y = y when X is held fixed at x (i.e., intervened) is denoted by P (y|do(x)), where intervention on X is graphically represented by GX, the graph G with incoming edges onto X removed. We denote by CC (X)G the c-component of G that contains X where a c-component is a maximal set of vertices connected with bidirected edges [Tian and Pearl, 2002]. We define CC (X)G = S X2X CC (X)G. For a more detailed discussion on the properties of SCMs, we refer readers to [Pearl, 2000, Bareinboim and Pearl, 2016]. For all the proofs and appendices, please refer to the full technical report [Lee and Bareinboim, 2018]. ; D(X) D(Z) (a) 3 (b) 3 3 (c) 3 3 (d) 3 3 3 Figure 3: (a d) Causal graphs such that µx = µx,z, and (e) non-dominated arms 2 Multi-armed bandits with structural causal models We recall that MABs consider a sequential decision-making setting where pulling one of the K available arms at each round gives the player a stochastic reward from an unknown distribution associated with the corresponding arm. The goal is to minimize (maximize) the cumulative regret (reward) after T rounds. The mean reward of an arm a is denoted by µa and the maximal reward is µ = max1 a K µa. We focus on the cumulative regret, Reg T = Tµ PT t=1 E [YAt] = PK a=1 a E [Ta (T)], where At is the arm played at time t, Ta (t) is the number of arm a has been played after t rounds, and a = µ µa. We now can explicitly connect a MAB instance to its SCM counterpart. Let M be a SCM h U, V, F, P (U)i and Y 2 V be a reward variable, where D (Y ) R. The bandit contains arms {x 2 D (X) | X V\{Y }}, a set of all possible interventions on endogenous variables except the reward variable. Each arm Ax (or simply x) associates with a reward distribution P (Y |do(x)) where its mean reward µx is E [Y |do(x)]. We call this setting a SCM-MAB, which is fully represented by the pair h M, Y i. Throughout this paper, we assume that the causal graph G of M is fully accessible to the agent,3 although its parametrization is unknown: that is, an agent facing a SCM-MAB h M, Y i plays arms with knowledge of G and Y , but not of F and P (U). For simplicity, we denote information provided to an agent playing a SCM-MAB by JG, Y K. We now investigate some key structural properties that follow from the causal structure G of the SCM-MAB. Property 1. Equivalence among arms We start by noting that do-calculus [Pearl, 1995] provides rules to evaluate invariances in the interventional space. In particular, we focus here on the Rule 3, which ascertains the condition such that a set of interventions does not have an effect on the outcome variable, i.e., P(y|do(x, z), w) = P(y|do(x), w). Since arms correspond to interventions (including the null intervention) and there is no contextual information, we consider examining P(y|do(x, z)) = P(y|do(x)) through Y ?? Z | X in GX[Z, which implies µx,z = µx. If valid, this condition implies that it is sufficient to play only one arm among arms in the equivalence class. Definition 1 (Minimal Intervention Set (MIS)). A set of variables X V\{Y } is said to be a minimal intervention set relative to JG, Y K if there is no X0 X such that µx[X0] = µx for every SCM conforming to the G. For instance, the MISs corresponding to the causal graphs in Fig. 3 are {;, {X}, {Z}}, which do not include {X, Z} since µx = µx,z. The MISs are determined without considering the UCs in a causal graph. The empty set and all singletons in an (Y )G are MISs for G with respect to Y . The task of finding the best arm among all possible arms can be reduced to a search within the MISs. Proposition 1 (Minimality). A set of variables X V\{Y } is a minimal intervention set for G with respect to Y if and only if X an (Y )GX. All the MISs given JG, Y K can be determined without explicitly enumerating 2V\{Y } while checking the condition in Prop. 1. We provide an efficient recursive algorithm enumerating the complete set of MISs given G and Y (Appendix A), which runs in O(mn2) where m is the number of MISs. 3In settings where this is not the case, one can spend the first interactions with the environment to learn the causal graph G from observational [Spirtes et al., 2001] or experimental data [Kocaoglu et al., 2017]. Property 2. Partial-orders among arms We now explore the partial-orders among subsets of V\{Y } within the MISs. Given the causal diagram G, it is possible that intervening on some variables is always as good as intervening on another set of variables (regardless of the parametrization of the underlying model). Formally, there can be two different sets of variables W, Z V\{Y } such that max w2D(W) µw max in every possible SCM conforming to G. If that is the case, it would be unnecessary (and possibly harmful in terms of sample efficiency) to play arms D (W). We next define Possibly-Optimal MIS, which incorporates the partial-orderedness among subsets of V\{Y } into MIS denoting the optimal value for a X V\{Y } given a SCM by x . Definition 2 (Possibly-Optimal Minimal Intervention Set (POMIS)). Given information JG, Y K, let X be a MIS. If there exists a SCM conforming to G such that µx > 8Z2Z\{X}µz , where Z is the set of MISs with respect to G and Y , then X is a possibly-optimal minimal intervention set with respect to the information JG, Y K. Intuitively, one may believe that the best action will be to intervene on the direct causes (parents) of the reward variable Y , since this would entail a higher degree of controllability of Y within the system. This, in fact, holds true if Y is not confounded with any of its ancestors, which includes the case where no unobserved confounders are present in the system (i.e., Markovian models). Proposition 2. Given information JG, Y K, if Y is not confounded with an(Y )G via unobserved confounders, then pa(Y )G is the only POMIS. Corollary 3 (Markovian POMIS). Given JG, Y K, if G is Markovian, then pa(Y )G is the only POMIS. For instance, in Fig. 3a, {{X}} is the set of POMISs. Whenever unobserved confounders (UCs) are present,4 on the other hand, the analysis becomes more involved. To witness, let us analyze the maximum achievable rewards of the MISs in the other causal diagrams in Fig. 3. We start with Fig. 3b and note that µz µx since µz = P x µx P(x|do(z )) P x P(x|do(z )) = µx . On the other hand, µ; is not comparable to µx . For a concrete example, consider a SCM where the domains of variables are {0, 1}. Let U be the UC between Y and Z where P(U = 1) = 0.5. Let f Z(u) = 1 u, f X(z) = z, and f Y (x, u) = x u, where is the exclusive-or function. If X is not intervened on, x will be 1 u yielding y = 1 for both cases u = 0 or u = 1 so that µ; = 1. However, if X is intervened to either 0 or 1, y will be 1 only half the time since P(U = 1) = 0.5, which results in µx = 0.5. We also provide in Appendix A a SCM such that µ; < µx holds true. This model (µ; > µx ) illustrates an interesting phenomenon allowing an UC to affect Y freely may lead to a higher reward, which may be broken upon interventions. We now consider the different confounding structure shown in Fig. 3c (similar to Fig. 1b), where the variable Z lies outside of the influence of the UC associated with Y . In this case, intervening on Z leads to a higher reward, µz µ;. To witness, note that µ; = P z E [Y |z] P (z) = P z µz P (z) P z µz P (z) = µz . However, µz and µx are incomparable, which is shown through two models provided in Appendix A. Finally, we can add the confounders of the two previous models, which is shown in Fig. 3d. In this case, all three µx , µz , and µ; are incomparable. One can imagine scenarios where the influence of the UCs are weak enough so that corresponding models produce results similar to Figs. 3a to 3c. It s clear that the interplay between the location of the intervened variable, the outcome variable, and the UCs entails non-trivial interactions and consequences in terms of the reward. The table in Fig. 3e highlights the arms that are contenders to generate the highest rewards in each model (i.e., each arm intervenes a POMIS to specific values), while intervening on a non-POMIS represents a waste of resources. Interestingly, the only parent of Y , i.e., X, is not dominated by any other arms in any of the scenarios discussed. In words, this suggests that the intuition that controlling variables closer to Y is not entirely lost even when UCs are present; they are not the only POMIS, but certainly one of them. Given that more complex mechanisms cannot be, in general, ruled out, performing experiments would be required to identify the best arm. Still, the results of the table guarantee that the search can be refined so that MAB solvers can discard arms that cannot lead to profitable outcomes, and converge faster to playing the optimal arm. 4Recall that unobserved confounders are represented in the graph as bidirected dashed edges. {Z} {X} {W } {X, W } {X, Z} {Z, W } Figure 4: Causal graphs where pink and blue nodes are MUCT and IB, respectively. (Right most) A schematic showing an exploration order of subsets of variables. 3 Graphical characterization of POMIS Our goal in this section is to graphically characterize POMISs. We will leverage the discussion in the previous section and note that UCs connected to a reward variable affect the reward distributions in a way that intervening on a variable outside the coverage of such UCs (including no UC) can be optimal e.g., {X} for Fig. 3a, ; for Figs. 3b and 3d, and {Z} for Fig. 3c. We introduce two graphical concepts to help characterizing this property. Definition 3 (Unobserved-Confounders Territory). Given information JG, Y K, let H be G [An (Y )G]. A set of variables T V (H) containing Y is called an UC-territory on G with respect to Y if De (T)H = T and CC (T)H = T. An UC-territory T is said to be minimal if no T0 T is an UC-territory. A minimal UC-Territory (MUCT) for G and Y can be constructed by extending a set of variables, starting from {Y }, alternatively updating the set with the c-component and descendants of the set. Definition 4 (Interventional Border). Let T be a minimal UC-territory on G with respect to Y . Then, X = pa (T)G \T is called an interventional border for G with respect to Y . The interventional border (IB) encompasses essentially the parents of the MUCT. For concreteness, consider Fig. 4a, and note that {W, X, Y , Z} is the MUCT for the causal graph with respect to Y , and the IB is {S, T} (marked in pink and blue in the graph, respectively). As its name suggests, MUCT is a set of endogenous variables governed by a set of UCs where at least one UC is adjacent to a reward variable. Specifically, the reward is determined by values of: (1) the UCs governing the MUCT; (2) a set of unobserved variables (other than the UCs) where each affects an endogenous variable in the MUCT; and (3) the IB. In other words, there is no UC interplaying across MUCT and its outside so that µx = E[Y |x] where x is a value assigned to the IB X. We now connect MUCT and IB with POMIS. Let MUCT(G, Y ) and IB(G, Y ) be, respectively, the MUCT and IB given JG, Y K. Proposition 4. IB(G, Y ) is a POMIS given JG, Y K. The main strategy of the proof is to construct a SCM M where intervening on any variable in MUCT(G, Y ) causes significant loss of reward. It seems that MUCT and IB can only identify a single POMIS given JG, Y K. However, they, in fact, serve as basic units to identify all POMISs. Proposition 5. Given JG, Y K, IB(GW, Y ) is a POMIS, for any W V\ {Y }. Prop. 5 generalizes Prop. 4 for when W 6= ; while taking care of UCs across MUCT(GW, Y ), and its outside in the original causal graph G. See Fig. 4d, for an instance, where IB(GW , Y ) = {W, T}. Intervening on W cuts the influence of S and the UC between W and X, while still allowing the UC to affect X.5 Similarly, one can see in Fig. 4b that IB(GX, Y ) = {T, W, X} where intervening on X lets Y be the only element of MUCT making its parents an interventional border, hence, a POMIS. Note that pa(Y )G is always a POMIS since MUCT(Gpa(Y )G, Y ) = {Y } and IB(Gpa(Y )G, Y ) = pa(Y )G. With Prop. 5, one can enumerate the POMISs given JG, Y K considering all subsets of V\ {Y }. We show in the sequel that this strategy encompasses all the POMISs. Theorem 6. Given JG, Y K, X V\{Y } is a POMIS if and only if IB(GX, Y ) = X. 5Note that exogenous variables that do not affect more than one endogenous variable (i.e., non-UCs) are not explicitly represented in the graph. Algorithm 1 Algorithm enumerating all POMISs with JG, Y K 1: function POMISS(G, Y ) 2: T, X = MUCT (G, Y ) , IB (G, Y ); H = GX [T [ X] 3: return {X} [ sub POMISs (H, Y , reversed (topological-sort (H)) \ (T \ {Y }) , ;) 4: function SUBPOMISS(G, Y , , O) 5: P = ; 6: for i 2 do 7: T, X, 0, O0 = MUCT(G i, Y ), IB(G i, Y ), i+1:| | \ T, O [ 1:i 1 8: if X \ O0 = ; then 9: P = P [ {X} [ (sub POMISs (GX [T [ X] , Y , 0, O0) if 0 6= ; else ;) 10: return P Algorithm 2 POMIS-based kl-UCB 1: function POMIS-KL-UCB(B, G, Y , f, T) 2: Input: B, a SCM-MAB, G, a causal diagram; Y , a reward variable 3: A = S X2POMISs(G, Y ) D(X) 4: kl-UCB(B, A, f, T) Thm. 6 provides a graphical necessary and sufficient condition for a set of variables being a POMIS given JG, Y K. This characterization allows one to determine all possible arms in a SCM-MAB that are worth intervening on, and, therefore, being free from pulling the other unnecessary arms. 4 Algorithmic characterization of POMIS Although the graphical characterization provides a means to enumerate the complete set of POMISs given JG, Y K, a naively implemented algorithm requires time exponential in |V|. We construct an efficient algorithm (Alg. 1) that enumerates all the POMISs based on Props. 7 and 8 below and the graphical characterization introduced in the previous section (Thm. 6). Proposition 7. Let T and X be the MUCT(GW, Y ) and IB(GW, Y ), respectively, relative to G and Y . Then, for any Z V\T, MUCT(GX[Z, Y ) = T and IB(GX[Z, Y ) = X. Proposition 8. Let H=GX [T [ X] where T and X are MUCT and IB given JGW, Y K, respectively. Then, for any W0 T\ {Y }, HW0 and GW[W0 yield the same MUCT and IB with respect to Y . Prop. 7 allows one to avoid having to examine GW for every W V\{Y }. Prop. 8 characterizes the recursive nature of MUCT and IB, where identification of POMISs can be evaluated by subgraphs. Based on these results, we design a recursive algorithm (Alg. 1) to explore subsets of V\{Y } with a certain order. See Fig. 4e for an example where subsets of {X, Z, W} are connected based on set inclusion relationship and an order of variables, e.g., (X, Z, W). That is, there exists a directed edge between two sets if (i) one set is larger than the other by a variable and (ii) the variable s index (as in the order) is larger than other variable s index in the smaller set. The diagram traces how the algorithm will explore the subsets following the edges, while effectively skipping nodes. Given G and Y , POMISs (Alg. 1) computes a POMIS, i.e., IB(G, Y ). Then, a recursive procedure sub POMISs is called with an order of variables (Line 3). Then sub POMISs examines POMISs by intervening on a single variable against the given graph (Line 6 9). If the IB (X in Line 7) of such an intervened graph intersects with O0 (a set of variables that should be considered in other branch), then no subsequent call is made (Line 8). Otherwise, a subsequent sub POMISs call will take as arguments an MUCT-IB induced subgraph (Prop. 8), a refined order, and a set of variables not to be intervened in the given branch. For clarity, we provide a detailed working example in Appendix C with Fig. 4a where the algorithm explores only four intervened graphs (G, G{X}, G{Z}, G{W }) and generates the complete set of POMISs {{S, T}, {T, W}, {T, W, X}}. Theorem 9 (Soundness and Completeness). Given information JG, Y K, the algorithm POMISs (Alg. 1) returns all, and only POMISs. The POMISs algorithm can be combined with a MAB algorithm, such as the kl-UCB, creating a simple yet effective SCM-MAB solver (see Alg. 2). kl-UCB satisfies lim supn!1 Cum. Regrets 0 250 500 750 1000 Trials Probability POMIS MIS Brute-force All-at-once 0 250 500 750 1000 Trials 0 2500 5000 7500 10000 Trials (a) Task 1 (b) Task 2 (c) Task 3 Figure 5: Comparisons across tasks (columns) with cumulative regrets (top) and optimal arm selection probability (bottom) with TS for solid and kl-UCB for dashed lines. Best viewed in color. x:µx<µ µ µx KL(µx,µ ) where KL is Kullback-Leibler divergence between two Bernoulli distributions [Garivier and Cappé, 2011]. It is clear that the reduction in the size of arms will lower the upper bounds of the corresponding cumulative regrets. 5 Experiments In this section, we present empirical results demonstrating that the selection of arms based on POMISs makes standard MAB solvers converge faster to an optimal arm. We employ two popular MAB solvers, kl-UCB, which enjoys cumulative regret growing logarithmically with the number of rounds [Cappé et al., 2013], and Thompson sampling (TS, Thompson [1933]), which has strong empirical performance [Kaufmann et al., 2012]. We considered four strategies for selecting arms, including POMISs, MISs, Brute-force, and All-at-once, where Brute-force evaluates all combinations of arms S X V\{Y } D (X), and All-at-once considers intervening in all variables simultaneously, D (V\{Y }), oblivious to the causal structure and any knowledge about the action space. The performance of the eight (4 2) algorithms are evaluated relative to three different SCM-MAB instances (the detailed parametrizations are provided in Appendix D). We set the horizon large enough so as to observe near convergence, and repeat each simulation 300 times. We plot (i) the average cumulative regrets (CR) along with their respective standard deviations and (ii) the probability of an optimal arm being selected averaged over the repeated tests (OAP).6,7 Task 1: We start by analyzing a Markovian model. We note that by Cor. 3, searching for the arms within the parent set is sufficient in this case. The number of arms for POMISs, MISs, Brute-force, and All-at-once are 4, 49, 81, and 16, respectively. Note that there are 4 optimal arms within All-at-once arms for instance, if the parent configuration is X1 = x1, X2 = x2, this strategy will also include combinations of Z1 = z1, Z2 = z2, 8z1, z2. The simulated results are shown in Fig. 5a. CR at round 1000 with kl-UCB are 3.0, 48.0, 72, and 12 (in the order), and all strategies were able to find the optimal arms at this time. POMIS and All-at-once first reached 95% OAP at round 20 and 66, respectively. There are two interesting observations at this point. First, at an 6All the code is available at https://github.com/sanghack81/SCMMAB-NIPS2018 7One may surmise that combinatorial bandit (CB) algorithms can be used to solve SCM-MAB instances by noting that an intervention can be encoded as a binary vector, where each dimension in the vector corresponds to intervening on a single variable with a specific value. However, the two settings invoke a very different set of assumptions, which makes their solvers somewhat difficult to compare in some reasonably fair way. For instance, the current generation of CB algorithms is oblivious to the underlying causal structure, which makes them resemble very closely the Brute-force strategy, the worst possible method for SCM-MABs. Further, the assumption of linearity is arguably one of the most popular considered by CB solvers. The corresponding algorithms, however, will be unable to learn the arms rewards properly since a SCM-MAB is nonparametric, making no assumption about the underlying structural mechanisms. These are just a few immediate examples of the mismatches between the current generation of algorithms for both causal and combinatorial bandits. early stage, OAP for MISs is smaller than Brute-force since it has only 1 optimal arm among 49 arms, while Brute-force has 9 among 81. The advantage of employing MIS over Brute-force is only observed after a sufficiently large number of plays. More interestingly, POMIS and All-at-once both have the common optimal to non-optimal arms-ratio (1:3 versus 4:12), however, POMIS dominates All-at-once since the agent can learn better about the mean reward of the optimal arm while playing non-optimal arms less. Naturally, this translates into less variability and additional certainty about the optimal arm even in Markovian settings. Task 2: We consider the setting known as instrumental variable (IV), which was shown in Fig. 3c. The optimal arm in this simulation is setting Z = 0. The number of arms for the four strategies is 4, 5, 9, and 4, respectively. The results are shown in Fig. 5b. Since the All-at-once strategy only considers non-optimal arms (i.e., pulling Z, X together), it incurs in a linear regret without selecting an optimal arm (0%). CR (and OAP) at round 1000 with TS are POMIS 16.1 (98.67%), MIS 21.4 (99.00%), Brute-force 42.9 (93.33%), and All-at-once 272.1 (0%). At round 5000, where Brute-force nearly converged, the ratio of CRs for POMIS and Brute-force is 54.2 18.1 = 2.99 ' 2.67 = 9 1 4 1. POMIS, MIS, and Brute-force first hits 95% OAP at 172, 214, and 435. Task 3: Finally, we study the more involved scenario shown in Fig. 4a. In this case, the optimal arm is intervening on {S, T}, which means that the system should follow its natural flow of UCs, which All-at-once is unable to pull. There are 16, 75, 243, and 32 arms for the strategies (in the order). The results are shown in Fig. 5c. The CR (and OAP) at round 10000 with TS are POMIS 91.4 (99.0%), MIS 472.4 (97.0%), Brute-force 1469.0 (85.0%), and All-at-once 2784.8 (0%). Similarly, the ratio (in round 10000) is 1469.0 91.4 = 16.07 16.13 = 243 1 16 1 which is expected to increase since Brute-force is not yet converged at the moment. Only POMIS and MIS achieved OAP of 95% first in 684 and 3544 steps, respectively. We start by noticing that the reduction in the CRs is approximately proportional to the reduction in the number of non-optimal arms pulled by (PO)MIS by the corresponding algorithm, which makes the POMIS-based solver the clear winner throughout the simulations. It s still not inconceivable that the number of arms examined by All-at-once is smaller than for POMIS in a specific SCM-MAB instance, which would entail a lower CR to the former. However, such a lower CR in some instances does not constitute any sort of assurance since arms excluded from All-at-once, but included in POMIS, can be optimal in some SCM-MAB instance conforming to JG, Y K. Furthermore, a POMIS-based strategy always dominates the corresponding MIS and Brute-force ones. These observations together suggest that, in practice, a POMIS-based strategy should be preferred given that it will always converge and will usually be faster than its counterparts. Remarkably, there is an interesting trade-off between having knowledge of the causal structure versus not knowing the corresponding dependency structure among arms, and potentially incurring in linear regret (All-at-once) or exponential slowdown (Brute-force). In practice, for the cases in which the causal structure is unknown, the pull of the arms themselves can be used as experiments and could be coupled with efficient strategies to simultaneously learn the causal structure [Kocaoglu et al., 2017]. 6 Conclusions We studied the problem of deciding whether an agent should perform a causal intervention and, if so, which variables it should intervene upon. The problem was formalized using the logic of structural causal models (SCMs) and formalized through a new type of multi-armed bandit called SCM-MABs. We started by noting that whenever the agent cannot measure all the variables in the environment (i.e., unobserved confounders exist), standard MAB algorithms that are oblivious to the underlying causal structure may not converge, regardless of the number of interventions performed in the environment. (We note that the causal structure can easily be learned in a typical MAB setting since the agent always has interventional capabilities.) We introduced a novel decision-making strategy based on properties following the do-calculus, which allowed the removal of redundant arms, and the partial-orders among the sets of variables existent in the underlying causal system, which led to the understanding of the maximum achievable reward of each interventional set. Leveraging this new strategy based on the possibly-optimal minimal intervention sets (called POMIS), we developed an algorithm that decides whether (and if so, where) interventions should be performed in the underlying system. Finally, we showed by simulations that this causally-sensible strategy performs more efficiently and more robustly than their non-causal counterparts. We hope that formal machinery and the algorithms developed here can help decision-makers to make more principled and efficient decisions. Acknowledgments This research is supported in parts by grants from IBM Research, Adobe Research, NSF IIS-1704352, and IIS-1750807 (CAREER). Peter Auer, Nicolò Cesa-Bianchi, and Paul Fischer. Finite-time analysis of the multiarmed bandit problem. Machine Learning, 47(2/3):235 256, 2002. Elias Bareinboim and Judea Pearl. Causal inference and the data-fusion problem. Proceedings of the National Academy of Sciences, 113(27):7345 7352, 2016. Elias Bareinboim, Andrew Forney, and Judea Pearl. Bandits with unobserved confounders: A causal approach. In Advances in Neural Information Processing Systems 28, pages 1342 1350. 2015. Sébastien Bubeck and Nicolò Cesa-Bianchi. Regret analysis of stochastic and nonstochastic multi- armed bandit problems. Foundations and Trends in Machine Learning, 5(1):1 122, 2012. Olivier Cappé, Aurélien Garivier, Odalric-Ambrym Maillard, Rémi Munos, and Gilles Stoltz. Kullback-Leibler upper confidence bounds for optimal sequential allocation. The Annals of Statistics, 41(3):1516 1541, 2013. Nicolò Cesa-Bianchi and Gábor Lugosi. Combinatorial bandits. Journal of Computer and System Sciences, 78(5):1404 1422, 2012. Richard Combes and Alexandre Proutiere. Unimodal bandits: Regret lower bounds and optimal algorithms. In International Conference on Machine Learning, pages 521 529, 2014. Varsha Dani, Thomas P. Hayes, and Sham M. Kakade. Stochastic linear optimization under bandit feedback. In Proceedings of Conference On Learning Theory (COLT), pages 355 366, 2008. Eyal Even-Dar, Shie Mannor, and Yishay Mansour. Action elimination and stopping conditions for the multi-armed bandit and reinforcement learning problems. Journal of machine learning research, 7(Jun):1079 1105, 2006. Andrew Forney, Judea Pearl, and Elias Bareinboim. Counterfactual data-fusion for online reinforce- ment learners. In Proceedings of the 34th International Conference on Machine Learning, pages 1156 1164, 2017. Aurélien Garivier and Olivier Cappé. The KL-UCB algorithm for bounded stochastic bandits and beyond. In Proceedings of the 24th annual Conference On Learning Theory, pages 359 376, 2011. Emilie Kaufmann, Nathaniel Korda, and Rémi Munos. Thompson sampling: An asymptotically optimal finite-time analysis. In Algorithmic Learning Theory, pages 199 213, 2012. Murat Kocaoglu, Karthikeyan Shanmugam, and Elias Bareinboim. Experimental design for learning causal graphs with latent variables. In Advances in Neural Information Processing Systems 30, pages 7021 7031, 2017. Tze Leung Lai and Herbert Robbins. Asymptotically efficient adaptive allocation rules. Advances in applied mathematics, 6(1):4 22, 1985. Finnian Lattimore, Tor Lattimore, and Mark D. Reid. Causal bandits: Learning good interventions via causal inference. In Advances in Neural Information Processing Systems 29, pages 1181 1189. 2016. Sanghack Lee and Elias Bareinboim. Structural causal bandits: Where to intervene? Technical Report R-36, Purdue AI Lab, Department of Computer Science, Purdue University, 2018. Stefan Magureanu, Richard Combes, and Alexandre Proutiere. Lipschitz bandits: Regret lower bound and optimal algorithms. In Proceedings of The 27th Conference on Learning Theory, pages 975 999, 2014. Pedro A. Ortega and Daniel A. Braun. Generalized Thompson sampling for sequential decision- making and causal inference. Complex Adaptive Systems Modeling, 2(2), 2014. Judea Pearl. Causal diagrams for empirical research. Biometrika, 82(4):669 688, 1995. Judea Pearl. Causality: Models, Reasoning, and Inference. Cambridge University Press, New York, 2000. Second ed., 2009. Rajat Sen, Karthikeyan Shanmugam, Alexandros G. Dimakis, and Sanjay Shakkottai. Identifying best interventions through online importance sampling. In Proceedings of the 34th International Conference on Machine Learning, pages 3057 3066, 2017. Peter Spirtes, Clark Glymour, and Richard Scheines. Causation, Prediction, and Search. A Bradford Book, 2001. Richard S. Sutton and Andrew G. Barto. Reinforcement learning: An introduction. MIT press, 1998. Csaba Szepesvári. Algorithms for reinforcement learning. Morgan and Claypool, 2010. William R. Thompson. On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika, 25(3/4):285 294, 1933. Jin Tian and Judea Pearl. A general identification condition for causal effects. In Proceedings of the Eighteenth National Conference on Artificial Intelligence, pages 567 573, 2002. Junzhe Zhang and Elias Bareinboim. Transfer learning in multi-armed bandits: A causal approach. In Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence, IJCAI-17, pages 1340 1346, 2017.