# causal_contextual_bandits_with_targeted_interventions__ac6926b9.pdf Published as a conference paper at ICLR 2022 CAUSAL CONTEXTUAL BANDITS WITH TARGETED INTERVENTIONS Chandrasekar Subramanian1, 2, Balaraman Ravindran1, 2 1 Robert Bosch Centre for Data Science and Artificial Intelligence 2 Department of Computer Science and Engineering Indian Institute of Technology Madras, Chennai, India sekarnet@gmail.com, ravi@cse.iitm.ac.in We study a contextual bandit setting where the learning agent has the ability to perform interventions on targeted subsets of the population, apart from possessing qualitative causal side-information. This novel formalism captures intricacies in real-world scenarios such as software product experimentation where targeted experiments can be conducted. However, this fundamentally changes the set of options that the agent has, compared to standard contextual bandit settings, necessitating new techniques. This is also the first work that integrates causal sideinformation in a contextual bandit setting, where the agent aims to learn a policy that maps contexts to arms (as opposed to just identifying one best arm). We propose a new algorithm, which we show empirically performs better than baselines on experiments that use purely synthetic data and on real world-inspired experiments. We also prove a bound on regret that theoretically guards performance. 1 INTRODUCTION Contextual bandits have been used as natural frameworks to model interactive decision making scenarios such as recommendation systems (Liu et al., 2018), marketing campaign allocation (Sawant et al., 2018) and more (Bouneffouf & Rish, 2019). In this framework, the learning agent repeatedly interacts with an environment with the aim of learning a near-optimal decision policy that maps a context space to a set of actions (also referred to as arms or interventions1). In the standard stochastic variant, in each round of interaction, the agent observes a context from the environment and, once the agent chooses an action, the environment returns a sampled reward that is a function of the current context and chosen action (Lattimore & Szepesv ari, 2020). The agent s objective is to minimize some meaningful measure of closeness of the policy to optimality (see Lattimore & Szepesv ari (2020) for some standard definitions). One of the key issues in the wide application of contextual bandits (and reinforcement learning, in general (Dulac-Arnold et al., 2021)) is their need for a large number of samples that are costly to actualize in practice. For example, each arm might correspond to performing a product experiment on users or to conducting a specific medical intervention. However, we show that there are nuances in real-world situations which though not fully captured by the standard formulations of contextual bandits if modeled and leveraged, allow us to build methods that can improve the rate at which good policies can be identified. A motivating example Consider a sales-assistance software agent that is learning to suggest a campaign that is optimal for a given sales lead (defined by a set of features). With software products, there is often an opportunity to conduct experiments with variants of the software on defined subsets of the population of users (each specified by a set of characteristics, or context variables) to learn about resulting metrics; see Google (2021) for an example. We can think of these experiments as constituting a training phase. More specifically, the agent, for example, can conduct a campaign 1We use the term intervention here because actions or arms in a (contextual) bandit setting can be interpreted as Pearl do() interventions on a causal model (Zhang & Bareinboim, 2017; Lattimore et al., 2016). See Pearl (2009; 2019) for more discussion on the do() operation. Published as a conference paper at ICLR 2022 (i.e., the intervention) on a specific subset of users (for example, the context value os=i OS could define a subset) and observe the outcome. Thus, instead of necessarily performing an intervention on a randomly sampled user coming from the natural population distribution, the agent can instead choose to target the intervention on a randomly selected user with specific characteristics (as defined by an assignment of context variable values). This type of interventions, which we call targeted interventions, fundamentally changes the set of options that the agent has in every training round. In addition, the training phase often happens in a more lenient environment where the agent might have access to auxiliary context variables (such as IT-spend) that are unavailable in the evaluation phase. Further, there is also causal side-information sometimes available. For instance, we might know that emailsubject causes openemail, and not the other way around. Encoding this qualitative side-information as a causal graph can help the agent make use of this structure. Lattimore et al. (2016) and Yabe et al. (2018) demonstrate this in the best-arm identification case. After training, there is an evaluation phase (e.g., when the agent is deployed), where the agent observes the main context variables and decides an action; its regret performance is measured at this point. Our framework Our framework captures the above intricacies (which are not restricted to software product experimentation). The formal treatment is in Section 2; we present an overview here. The agent has T rounds of interaction that act as the training phase, followed by a (T + 1) th round on which it is evaluated. In each of the T training rounds, it has the choice to perform either a targeted intervention or a standard interaction. Further, in addition to the main context variables, the agent can observe a (possibly empty) set of additional auxiliary context variables during training; these auxiliary context variables are not observable during evaluation. All variables are made available to the agent at the end of every training round (this is similar to Lattimore et al. (2016)). The agent also has access to a causal graph that models the qualitative causal relationships between the variables; there are very few assumptions made on this graph (discussed in Section 2.1). The causal graph allows a factorized representation of the joint distribution of the variables, and as a result, enables information leakage i.e., updating beliefs about several interventions after every single intervention. Also, importantly, we allow context variables to be categorical, and therefore the usual assumptions that enable generalization of the learned policy across contexts, such as linearity (e.g., in Dimakopoulou et al. (2019)), become invalid. The agent s problem is one of sample allocation how to allocate samples across the T training rounds so as to learn a policy that minimizes regret in the (T + 1) th round that represents the evaluation phase. In the evaluation phase, the agent observes the main context variables from the environment, against which it chooses an action and receives a reward much like in a standard contextual bandit setting. Since targeted interventions are restricted to the training phase, it introduces a difference between the training and evaluation phases. The agent s challenge, therefore, is to learn policies that minimize regret in the evaluation phase using samples it collects in the training phase (from a combination of standard interactions and targeted interventions). The algorithm we propose utilizes a novel entropy-like measure called Unc that guides this sample allocation in a way that also exploits the information leakage. 1.1 CONTRIBUTIONS In this paper, we formalize this modified setting, which we call causal contextual bandits with targeted interventions , provide a novel algorithm and show both theoretical and experimental results. Specifically, our contributions are: We formalize the more nuanced contextual bandit setting described above (Sec. 2). This is the first work that we know of that formulates a contextual bandit setting with causal side-information. This is also the first paper we are aware of that introduces targeted interventions in a contextual bandit setting. We propose a new algorithm based on minimizing a novel entropy-like measure (Sec. 3.1) We prove a bound on its regret, providing a theoretical guard on its performance (Sec. 3.2) We show results of experiments that use purely synthetic data (Sec. 4.1). The results demonstrate that our algorithm performs better than baselines. We also run experiments that are inspired by proprietary data from Freshworks Inc. (Sec. 4.2), providing evidence of our algorithm s performance in more realistic scenarios as well. Published as a conference paper at ICLR 2022 Our motivation comes from real world settings where experiments are costly; therefore, we are interested in empirical behavior when the training budget T is relatively small. 1.2 RELATED WORK Causal bandits have been studied in literature recently (Lattimore et al., 2016; Sen et al., 2017; Yabe et al., 2018; Lu et al., 2020) and they leverage causal side-information to transfer knowledge across interventions. They, however, have been studied only in a best arm identification setting, with each arm modeled as an intervention on a set of variables; the objective is to learn a single best arm. Our setting differs significantly since the agent attempts to learn a policy (by which we mean a mapping from contexts to arms) instead of a single best arm. Therefore, while the agent can perform targeted interventions that specify both the context and action, it is still attempting to learn an optimal context-action mapping. The contextual bandit literature is well-studied (see Zhou (2016) and Lattimore & Szepesv ari (2020) for surveys), and has taken various approaches to enable knowledge transfer across contexts. For example, Dimakopoulou et al. (2019) assume expected reward is a linear function of context, while Agarwal et al. (2014) make assumptions about the existence of a certain oracle. However, there has not been work that has looked at contextual bandits which utilize causal side-information as a way to transfer knowledge across contexts, or considered targeted interventions in the space of options for the agent. Our method utilizes the information leakage that the causal graph provides to not just learn about other interventions after every intervention, but also to be smarter about the choice of targeted interventions to conduct. Another related area is active learning (Settles, 2009). In active learning, a supervised learning agent gets the option to query an oracle actively to obtain the true labels for certain data points. However, it is in the supervised learning setting where the agent receives true label feedback, whereas in our setting the agent only receives bandit feedback (that is, only for the action that was taken). Nevertheless, our work can be thought of as infusing some elements of the active learning problem into the contextual bandits setting by providing the agent the ability to perform targeted interventions. There has been some work investigating the intersection of causality and bandits with the aim of transfer learning. Zhang & Bareinboim (2017) study transfer of knowledge from offline data in the presence of unobserved confounders, in a non-contextual setting. As can be seen, this differs significantly from our setting. 2 FORMALISM We assume that the underlying environment is modeled as a causal model M, which is defined by a directed acyclic graph G over all variables i.e., the variable to be intervened (X), the reward variable (Y ), and the set of context variables (C) and a joint probability distribution P that factorizes over G (Pearl, 2009; Koller & Friedman, 2009). C is partitioned into the set of main context variables (Ctar) and the set of auxiliary context variables (Cother). G is sometimes called the causal graph or causal diagram of M. Each variable takes on a finite, known set of values; note that this is quite generic, and allows for categorical variables. An intervention do(X = x) on M involves removing all arrows from the parents of X into X, and setting X = x (Pearl, 2009). The agent has access only to G and not to M; therefore, the agent is not given any knowledge a priori about the underlying conditional probability distributions (CPDs) of the variables. We specify a targeted intervention X = x conditioned on context Ctar = ctar succinctly by (x, ctar). While standard interventions (Pearl, 2009) result in a distribution of the form P(. | do(x)), targeted interventions result in P(. | do(x), Ctar = ctar). Table 1 summarizes the key notation used in this paper. The agent is allowed T training rounds of interaction with the environment, at the end of which it aims to have learned a policy ˆϕ : val(Ctar) val(X). Specifically, in each of the T training rounds, the agent can choose to either (Standard interaction) Observe context ctar P(Ctar), choose intervention x, and observe (y, cother) P(Y, Cother | do(x), ctar), (or) Published as a conference paper at ICLR 2022 Table 1: Summary of key notation Symbol/Notation Meaning X variable that will be intervened upon Y reward variable Ctar, Cother set of main context variables and set of auxiliary context variables, respectively; so the set of all context variables is C = Ctar Cother = {..., Ci, ...}. Capital letters a random variable; e.g., C1 or X Small letters a random variable s value; e.g., c1 or x Small bold font a vector of random variable values; e.g., ctar denotes a specific choice of values taken by variables in Ctar (x, ctar) specifies targeted intervention X = x on subset defined by Ctar = ctar ˆP, ˆE estimate of distribution P and expectation E based on current beliefs θV |pa V beliefs (vector of Dirichlet parameters) about parameters of P(V |pa V ); θ(v) V |pa V is the entry corresponding to V = v. val(V ), val(V) set of values taken by the variable V , and set of variables V, respectively. NV , NV = |val(V )|, |val(V)| pa V value of variables in PAV , the parents of V . For Y , we let PAY denote its parents excluding X to keep proof easier to read. T number of training rounds α fraction of rounds in Phase 1 of training a B if a is an assignment of values to A, then a B is assignment of those values to respective variables B A. Set operations for readability, we use them on vectors as well, where there is no ambiguity. (Targeted intervention) Choose intervention x and a target subset given by context values Ctar = ctar, and observe (y, cother) P(Y, Cother | do(x), ctar) where, when there is no ambiguity, we use P interchangeably to mean either the joint distribution or a marginal distribution. Note that in both modes of interaction above, the intervention is only on X; they differ, however, in whether the agent observes the context values from the environment or whether it chooses the context values on which to condition the intervention. After training, the agent is evaluated in the (T + 1) th round. Here, the agent is presented a query context ctar P(Ctar) to which it responds with an action x = ˆϕ(ctar) using the learned policy, and receives a reward y P(Y | do(x), ctar) from the environment. The objective of the agent is to learn a policy ˆϕ that minimizes simple-regret, which is defined as: ctar [µ ctar ˆµctar] P(ctar) = X ctar Regret(ctar) P(ctar) where ϕ is an optimal policy, µ ctar E[Y |do(ϕ (ctar)), ctar] and ˆµctar E[Y |do(ˆϕ(ctar)), ctar]. 2.1 ASSUMPTIONS We are interested in DAGs G where there is an edge X Y , and no other directed path between X and Y ; that is, X directly affects Y ; there is no context variable caused by X. This models the natural situation we encounter in settings such as a recommendation system where click-through rate (Y ) of a user is directly affected by the user s features (the context) and the recommendation (X); the recommendation does not cause any contexts, so doesn t have a path to Y through them. We assume that there are no unobserved confounders (UCs); but purely interactive bandit settings, such as ours, are robust to direct UCs between X and Y since actions can be interpreted as sampling from the P(.|do(x)) distribution (Bareinboim et al., 2015; Guo et al., 2020). We make an additional assumption to simplify the factorization in Section 3.2: (A1) {C confounds C Ctar and Y } = C Ctar. A sufficient condition for (A1) to be true is if Ctar is ancestral (i.e., Ctar contains all its ancestors). This assumption does not preclude UCs (for example, the graph used in the synthetic experiments in Section 4.1 could have a UC between X and Y ), and can be relaxed in the future. Published as a conference paper at ICLR 2022 3 SOLUTION APPROACH 3.1 ALGORITHM (UNC CCB) Algorithm 1a: Training phase of Unc CCB Data: Graph G; fraction α of Phase 1 rounds. Initialization: For all V C {Y }, set θV |pa V = (1, ..., 1), for all values of pa V . 1 (Phase 1) for t = 1 . . . αT do 2 Observe context ctar P(Ctar) 3 Choose x val(X) uniformly at random 4 Observe (y, cother) P(Y, Cother | do(x), ctar) 5 for V in C do 6 update Beliefs(V, c PAV , c V ) //denote ctar cother by c 8 update Beliefs(Y, (c PAY , x), c Y ) 10 (Phase 2) for t = αT + 1, . . . , T do 11 Choose x, ctar as: arg min x val(X), ctar val(Ctar) x val(X), ctar val(Ctar) Unc E[Y |do(x ), ctar ] x, ctar 12 Do targeted intervention (x, ctar) and observe (y, cother) P(Y, Cother|do(x), ctar) 13 for V in Cother do 14 update Beliefs(V, c PAV , c V ) 16 update Beliefs(Y, (x, c PAY ), c Y ) Result: Final set of beliefs for all V, pa V : ..., θV |pa V , ... 18 Procedure update Beliefs(V, pa V , v) 19 θ(v) V |pa V θ(v) V |pa V + 1 Algorithm 1b: Evaluation phase of Unc CCB Data: Graph G, learned beliefs ..., θV |pa V , ... , user context to be decided ctar 1 for every V, pa V do 2 for v val(V ) do 3 Set ˆP(V = v|pa V ) = θ(v) V |pa V P v θ(v ) V |pa V 4 end 6 for x val(X) do 7 Compute ˆψ(x, ctar) ˆE[Y |do(x), ctar] using ˆP in Equation (2) Result: Return ˆϕ(ctar) arg maxx ˆψ(x, ctar) We call our proposed algorithm Unc CCB. The training phase of Unc CCB (given as Algorithm 1a) consists of two phases. In each round of Phase 1 (a fraction α of the rounds), it observes a context from the environment, then uniformly at random chooses an action, and finally observes the remaining variables. The observed values are used to perform standard Bayesian updates over the beliefs about parameters of all relevant CPDs. Since we do not assume any a priori beliefs for the agent, this uniform exploration helps the agent enter Phase 2 with reasonable starting beliefs. Published as a conference paper at ICLR 2022 In Phase 2 (the remaining (1 α) fraction of the rounds) of training, the agent tries to allocate targeted interventions optimally. To this end, it needs to trade-off between exploring new contexts and gaining more knowledge about already-explored contexts, while taking into account the information leakage resulting from the shared pathways in the causal graph (or equivalently, shared CPDs in the factorized representation of P; Appendix E discusses a related subcase). The novel, entropy-like, Unc measure (discussed below) helps guide the agent in this decision in every round. Algorithm 1b specifies the evaluation phase of Unc CCB, where the agent is evaluated on simple regret. Intuition behind the Unc measure As mentioned before, performing a targeted intervention given by (x, ctar) would allow the agent to update its beliefs about effects of other targeted interventions (x , ctar ). Intuitively, we would want the algorithm to allocate more samples to those targeted interventions that are most informative about the most valuable (in terms of reward Y ) targeted interventions. Unc captures this intuition by providing a measure of the effect on the agent s knowledge of E[Y |do(x ), ctar ] when a targeted intervention (x, ctar) is performed. The agent, then aggregates this over all possible (x , ctar ), providing a measure of overall resulting knowledge from targeted intervention (x, ctar); this helps guide its sample allocation (see Line 11 of Algorithm 1a). Definition of the Unc measure We model the conditional distribution P(V |pa V ) for any variable V as a categorical distribution whose parameters are sampled from a Dirichlet distribution. That is, P(V |pa V ) = Cat(V ; b1, ..., br), where (b1, ..., br) Dirc(θV |pa V ). Here θV |pa V is a vector of length, say, r. Let denote θV |pa V [i] denote the i th entry of θV |pa V . We define an object called Ent that captures a measure of our knowledge of the CPD: Ent(P(V |pa V )) X " θV |pa V [i] P j θV |pa V [j] ln θV |pa V [i] P j θV |pa V [j] Let the parents of V be PAV = (U1, ..., Up); suppose they take a particular set of values pa V = (u1, ..., up). We define that a CPD P(V |pa V ) is unaffected by targeted intervention (x, ctar) if there exists i {1, ..., p} such that either (1) Ui is X and ui = x, or (2) Ui Ctar and ctar Ui = ui. In other words, the CPD P(V |pa V ) is unaffected by (x, ctar) if doing this targeted intervention and observing all variables (x, ctar, cother) does not enable us to update beliefs about P(V |pa V ) using knowledge of G. Now, define Ent(P(V |pa V )|x, ctar) Ent(P(V |pa V )), if P(V |pa V ) is unaffected by (x, ctar) Entnew(P(V |pa V )), otherwise where Entnew is computed by averaging the resulting Ent values over the possible belief updates the agent might make after performing (x, ctar). That is, Entnew(P(V |pa V )) = 1 i Ent( Cat(b 1, ..., b r)) where (b 1, ..., b r) Dirc ..., θV |pa V [i 1], θV |pa V [i] + 1, θV |pa V [i + 1], ... . Letting c denote ctar cother , we can now define Unc which captures our knowledge of E[Y |do(x ), ctar ] if we perform some other targeted intervention given by (x, ctar): Unc E[Y |do(x ), ctar ] x, ctar X cother val(Cother) V Cother Ent(P(V |c PAV )|x, ctar)+ Ent(P(Y |x , c PAY )|x, ctar) ˆP(c ) ˆE[Y |c , do(x )] 3.2 REGRET BOUND Theorem 3.1. For any 0 < δ < 1, with probability 1 δ, Regret 3Epa Y ,ctar " 2 αT NX P(pa Y , ctar) ϵT X,P AY ln 2NX(NC + |C|) C Cother Epa C,ctar " 2 αTP(pa C, ctar) ϵT P AC ln 2(NC + |C|) Published as a conference paper at ICLR 2022 ln NP AC(NC + |C|) , ϵT X,P AY = ln NXNP AY (NC + |C|) Proof. Due to space constraints, we only provide a proof sketch here. The full proof is provided in Appendix A as part of the Supplementary Text. For readability of the proof, we assume that Y and all variables in C are binary, whereas X can take on any finite set of values. Under Assumption (A1) described in Section 2.1, we can factorize as follows: E[Y |do(x), ctar] = X cother val(Cother) P(Y = 1|x, c PAY ) Y c cother P(C = c|c PAC ) (2) where c denotes ctar cother. The proof first bounds errors of estimates of ˆP(Y = 1|x, c PAY ) and ˆP(C = c|c PAC ) with high probability using union bounds, concentration bounds, and the fact that the number of rounds in which the agent observes any set of context values is lower bounded by the number of times it sees it in Phase 1. We also make use of the fact that all variables are observed after every round. We then aggregate these bounds of individual CPD estimates to bound the overall estimate of ˆE[Y |do(x), ctar], and then finally bound regret by utilizing the fact that the algorithm chooses arms based on the estimated ˆE. Discussion Theorem 3.1 bounds the regret (with high probability). Note that T = Regret 0, which shows that the regret bound has desirable limiting behavior.2 Further, regret is inversely related to T, similar to other simple-regret algorithms such as in Lattimore et al. (2016). However, other terms that are a function of the space of possible interventions (NCNX) are different since we are in a contextual bandit setting (whereas Lattimore et al. (2016) is non-contextual); more specifically, regret is related proportionally to p NXNC ln (NXNC); in fact, our regret bound is strictly tighter than this. We prove the bound primarily to provide a theoretical guard on regret; we do not claim improved regret bounds since there is no comparable simple-regret contextual bandit regret bound. We show improved performance over baselines empirically in Section 4. 4 EXPERIMENTAL EVALUATION Since there is no setting studied previously that directly maps to our setting, we adapt existing multi-armed bandit algorithms to make use of causal side-information and targeted interventions in a variety of ways to form a set of baselines against which we compare our algorithm s performance. The baseline algorithms are the following. Std TS and Std Uni Exp follow the standard bandit interaction protocol during training observe context, choose action, receive reward. They treat each ctar as a separate problem and learn an optimal arm for each; they differ in that Std TS uses Thompson Sampling (Agrawal & Goyal, 2012) for each ctar problem, whereas Std Uni Exp does uniform exploration over val(X). The other set of algorithms are Targ Int TS, Targ Int Uni Exp and Targ Int TS Uni Exp. These treat the space val(X) val(Ctar) as the set of arms for targeted interventions. Targ Int TS performs Thompson Sampling over this space, and Targ Int Uni Exp performs uniform exploration over this space. Targ Int TS Uni Exp is a variation of Targ Int TS that, with some probability,3 chooses a random targeted intervention; this infuses a controlled degree of uniform exploration into Targ Int TS. All five algorithms update CPD beliefs after each round. No baseline is strictly stronger than another; their relative performance can vary based on the specific setting. We also compare against a non-causal version of algorithm Std TS, which we call Non Causal Std TS, to provide one non-causal baseline (i.e., which does not utilize the causal side-information). The evaluation phase for each baseline algorithm is analogous to Algorithm 1b given a ctar, the action with the highest expected reward, based on beliefs at the end of the training phase, is chosen. 2There is a technical requirement to keep the denominator in Eq (1) greater than 0; however, it is easy to see that this only requires that T > constant, so there is no impact on asymptotic behavior. 3We choose this probability to be 0.5 for the plots. There is more discussion on this in Appendix F. Published as a conference paper at ICLR 2022 Figure 1: Mean regrets for Experiment 1 (Section 4.1). Regret is normalized to [0, 1] and T as fractions of NXNCtar. Figure 2: Mean regrets for Experiment 2 (Section 4.1). 4.1 PURELY SYNTHETIC DATA We consider a causal model M whose causal graph G consists of the following edges: C1 C0, C0 X, C0 Y, X Y . We assume that Ctar = {C1} and that Cother = {C0}. We choose the fraction of Phase 1 rounds α = 0.5 as we empirically found that to be a good choice. More details and code are in the Supplementary Material. Additional experiments, analyzing the reason for why our algorithm performs better than baselines, are in Appendix B. Experiment 1 Given the wide range of possible settings arising from different parameterizations of M, the first experiment seeks to answer the question: what is the expected performance of our algorithm over a set of representative settings? . Each setting can be interpreted naturally when, for example, we think of Ctar as the set of user features that a recommendation agent observes post-deployment. The settings are chosen to capture the intuition that, typically, the agent sees highvalue contexts (i.e., contexts for which, if the agent learns the optimal action, can fetch high rewards) relatively less frequently, say, 20% of the time, but there can be variation in the number of different Ctar values over which that probability mass is spread and in how risky they are (for example, very low rewards for non-optimal actions). In each run, the agent is presented with a randomly selected setting from this representative set. Results are averaged over 300 independent runs; error bars display 1.96 standard errors. For details on specific parameterizations, refer Appendix D. Figure 1 provides the results of this experiment. The axes are normalized since the representative set of settings could have different ranges for T and for regret. We see that our algorithm performs better than all baselines, especially for lower values of T. This, in fact, is a recurring theme our algorithm s performance improvement is more pronounced in low T-budget situations, which, as we stated in Section 1.1, is what we are interested in. Further, Appendix C contains plots for two individual settings from the set, one where Std TS and Std Uni Exp perform significantly better than Targ Int TS, Targ Int Uni Exp and Targ Int TS Uni Exp, and another where it is the converse; in contrast, our algorithm performs close to the best baseline in both. The intuition behind this is that the Unc measure incentivizes exploring new context values (due to its entropy-like nature) while weighting more the contexts that are likely to yield more rewards (based on our current estimates). Also, as expected, Non Causal Std TS performs the worst as it ignores the causal graph. Experiment 2 To ensure that the results are not biased due to our choice of the representative set, the second experiment asks what happens to performance when we directly randomize the parameters of the CPDs in each run, subject to realistic constraints? . Specifically, in each run, we (1) randomly pick an i {1, ..., NC1/2 }, (2) distribute 20% of the probability mass randomly over C1 values {1, ..., i}, (3) distribute the remaining 80% of the mass over C1 values {i+1, ..., NC1}. The c1 values in the 20% bucket are higher-value, while those in the 80% bucket are lower-value. Intuitively, this captures the commonly observed 80-20 pattern (for example, 20% of the users contribute to around 80% of the revenue), but randomizes the other aspects; this gives an estimate of how the algorithms would perform on expectation. The results are averaged over 100 independent runs; error bars display 1.96 standard errors. Figure 2 shows that our algorithm performs better than all baselines Published as a conference paper at ICLR 2022 in this experiment, with a more pronounced improvement for the smaller values of T. For instance, when T = 24, our algorithm s mean regret is around 35% lower than that of the next best baseline. 4.2 CRM SALES DATA-INSPIRED EXPERIMENT Experiment 3 This experiment seeks to answer the question: how does the algorithm perform on realistic scenarios? . We setup the experiment inspired by a real world scenario. Consider a bandit agent that could assist salespeople by learning to decide how many outgoing calls to make in an ongoing deal, given just the type of deal (new business, etc.) and size of customer (small, medium, etc.), so as to maximize a reward metric (which we call the expected deal value ). Deal type affects the reward via the source of the lead (for example, chat). The trained agent would be deployed internally or to the company s clients, where it would generally not have access to lead source. The causal graph relating these variables is given in Figure 3a, which was obtained using domain knowledge. Deal type and Customer size are Ctar, Lead source is Cother, # outgoing calls is X, and Exp. deal value is Y . The parameters of each CPD corresponding to this causal graph was calibrated using proprietary sales data from Freshworks Inc., after some adjustments using domain knowledge (for example, where there were coverage gaps). Note that the algorithms use only the CPDs and not this raw data; in any case, we make the dataset available in anonymized form at https://doi.org/10.5281/zenodo.5540348 under the CC BY-NC 4.0 license. Further, we also do a warm-start for all algorithms to account for the fact that, in practice, there is often some starting beliefs about the CPDs from past data, domain knowledge, etc., which can be encoded into the agent s prior beliefs.4 Specifically, we ran pure random exploration for 15 rounds at the beginning of each algorithm and updated all CPD beliefs; this simulates a warm-start. Due to this, we used α = 0 for our algorithm. Results are averaged over 50 independent runs; error bars display 1.96 standard errors. Figure 3 shows that our algorithm performs better than all baselines in this realistic setting as well. The Supplementary Material provides the code. (a) Causal graph. (b) Mean regrets. Figure 3: CRM sales data-inspired experiments (Section 4.2) 5 CONCLUSION AND FUTURE DIRECTIONS This work presented a contextual bandits formulation that captures real-world nuances such as the ability to conduct targeted interventions and the presence of causal side-information, along with a novel algorithm that exploits this to achieve improved sample efficiency. In addition to synthetic experiments, we also performed real world-inspired experiments set up using actual CRM sales data. A useful direction of future investigation is the development of algorithms when G is unknown or partially known. Another important direction is the development of methods robust to distributional shifts. Distributional shifts have been studied for prediction tasks (Magliacane et al., 2018; Gong et al., 2016) and for causal effects (Bareinboim & Pearl, 2014; Correa & Bareinboim, 2020); it is an interesting question to study this in our setting, when G remains the same between training and evaluation but M changes. 4Other works such as Dimakopoulou et al. (2019) and Liu et al. (2018) have used warm starting as well. Published as a conference paper at ICLR 2022 ACKNOWLEDGEMENTS This work was partly funded by Freshworks Inc. through a research grant to Balaraman Ravindran and Chandrasekar Subramanian. REPRODUCIBILITY STATEMENT The full proof of Theorem 3.1 is provided in an appendix at the end of this PDF file (after the references) as part of the Supplementary Text. The source code of all the experiments in the main paper, along with a README on how to run them, is provided as part of the Supplementary Materials in a file named Supplemental Code Paper1203.zip. Further, even though the algorithms use only the CPDs and not the raw data, the dataset used to calibrate the real worldinspired experiments (see Section 4.2 of main paper) is made available in anonymized form at https://doi.org/10.5281/zenodo.5540348 under the CC BY-NC 4.0 license. The other experiments were synthetic and did not involve the use of any dataset. ETHICS STATEMENT This work is a foundational work that provides an improved mathematical framework and algorithm for contextual bandits. Some experiments were set up utilizing data from Freshworks Inc.; the data is released in anonymized form with the consent of the company (link to dataset given in the main paper). To the best of our knowledge, we do not believe there were any ethical issues associated with the development of this work. Further, given the nature of the work as foundational and introducing a new algorithm (and not specific to an application), we do not foresee any specific potential negative ethical issues created by this work. However, we do point out that researchers utilizing this method to their specific applications should adhere to ethical standards of their own (e.g., by avoiding targeting interventions on subpopulations based on racial attributes). Published as a conference paper at ICLR 2022 Alekh Agarwal, Daniel Hsu, Satyen Kale, John Langford, Lihong Li, and Robert Schapire. Taming the Monster: A Fast and Simple Algorithm for Contextual Bandits. In Proceedings of the 31st International Conference on Machine Learning, volume 32 of PMLR, pp. 1638 1646, 2014. Shipra Agrawal and Navin Goyal. Analysis of thompson sampling for the multi-armed bandit problem. In Proceedings of the 25th Annual Conference on Learning Theory, volume 23 of PMLR, pp. 39.1 39.26, 2012. Elias Bareinboim and Judea Pearl. Transportability from multiple environments with limited experiments: Completeness results. In Proceedings of the 27th International Conference on Neural Information Processing Systems, volume 1, pp. 280 288, 2014. Elias Bareinboim, Andrew Forney, and Judea Pearl. Bandits with unobserved confounders: A causal approach. In Proceedings of the 28th International Conference on Neural Information Processing Systems, volume 1, pp. 1342 1350, 2015. Djallel Bouneffouf and Irina Rish. A survey on practical applications of multi-armed and contextual bandits. ar Xiv:1904.10040, 2019. Juan D. Correa and Elias Bareinboim. A Calculus for Stochastic Interventions: Causal Effect Identification and Surrogate Experiments. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, 2020. Maria Dimakopoulou, Zhengyuan Zhou, Susan Athey, and Guido Imbens. Balanced Linear Contextual Bandits. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pp. 3445 3453, 2019. Gabriel Dulac-Arnold, Nir Levine, Daniel J. Mankowitz, Jerry Li, Cosmin Paduraru, Sven Gowal, and Todd Hester. Challenges of real-world reinforcement learning: definitions, benchmarks and analysis. Machine Learning, Apr 2021. Mingming Gong, Kun Zhang, Tongliang Liu, Dacheng Tao, Clark Glymour, and Bernhard Sch olkopf. Domain Adaptation with Conditional Transferable Components. In Proceedings of the 33rd International Conference on Machine Learning, volume 48, pp. 2839 2848, 2016. Google. Targeting overview. https://support.google.com/optimize/answer/ 6283420, 2021. Accessed: 2021-10-02. Ruocheng Guo, Lu Cheng, Jundong Li, P. Richard Hahn, and Huan Liu. A survey of learning causality with data: Problems and methods. ACM Comput. Surv., 53(4), July 2020. Daphne Koller and Nir Friedman. Probabilistic Graphical Models: principles and techniques. MIT Press, 2009. Finnian Lattimore, Tor Lattimore, and Mark D. Reid. Causal bandits: Learning good interventions via causal inference. In Proceedings of the 30th International Conference on Neural Information Processing Systems, volume 29, pp. 1189 1197, 2016. Tor Lattimore and Csaba Szepesv ari. Bandit Algorithms. Cambridge University Press, 2020. Bo Liu, Ying Wei, Yu Zhang, Zhixian Yan, and Qiang Yang. Transferable contextual bandit for cross-domain recommendation. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 32, pp. 3619 3626, 2018. Yangyi Lu, Amirhossein Meisami, Ambuj Tewari, and William Yan. Regret analysis of bandit problems with causal background knowledge. In Proceedings of the 36th Conference on Uncertainty in Artificial Intelligence (UAI), volume 124 of PMLR, pp. 141 150, 2020. Sara Magliacane, Thijs Van Ommen, Tom Claassen, Stephan Bongers, Joris M. Mooij, and Philip Versteeg. Domain adaptation by using causal inference to predict invariant conditional distributions. In Proceedings of the 32nd International Conference on Neural Information Processing Systems, pp. 10869 10879, 2018. Published as a conference paper at ICLR 2022 Judea Pearl. Causality. Cambridge University Press, 2nd edition, 2009. Judea Pearl. On the Interpretation of do(x). Journal of Causal Inference, 7(1), 2019. Neela Sawant, Chitti Babu Namballa, Narayanan Sadagopan, and Houssam Nassif. Contextual multi-armed bandits for causal marketing. ar Xiv:1810.01859, 2018. Rajat Sen, Karthikeyan Shanmugam, Alexandres G. Dimakis, and Sanjay Shakkottai. Identifying best interventions through online importance sampling. In Proceedings of the 34th International Conference on Machine Learning, volume 70 of PMLR, pp. 3057 3066, 2017. Burr Settles. Active learning literature survey. Technical Report 1648, University of Wisconsin Madison, 2009. Akihiro Yabe, Daisuke Hatano, Hanna Sumita, Shinji Ito, Naonori Kakimura, Takuro Fukunaga, and Ken-ichi Kawarabayashi. Causal Bandits with Propagating Inference. In Proceedings of the 35th International Conference on Machine Learning, volume 80 of PMLR, pp. 5512 5520, 2018. Junzhe Zhang and Elias Bareinboim. Transfer learning in multi-armed bandits: A causal approach. In Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence, pp. 1340 1346, 2017. Li Zhou. A survey on contextual multi-armed bandits. ar Xiv:1508.03326, 2016. Published as a conference paper at ICLR 2022 A APPENDIX (SUPPLEMENTARY TEXT): PROOF OF THEOREM 3.1 This is the full version of the proof sketch presented in Section 3.2 of the paper. For readability of the proof, we assume that Y and all variables in C are binary, whereas X can have any finite set of values. In Sections A.1 through A.5, we derive the expression for bounding Regret(ctar), which is the simple regret given Ctar = ctar. Then, in Section A.6, we derive the final expression for bounding overall Regret. In addition to the notation in Table 1 of the main paper, we use a few other notations to keep the proof more readable: c denotes ctar cother. a B is shortened to b when it is unambiguous what a is; for example, c PAV is shortened to pa V when it is obvious that c is the current set of values taken by all the context variables. We write P(V = v|pa V ) interchangeably with P(v|pa V ) since we use small letters (e.g., v) to denote the value taken by the random variable denoted by the respective capital letter (e.g., V ). Now, the factorization under Assumption (A1) can be written in this simpler notation as: E[Y |do(x), ctar] = X cother val(Cother) P(Y = 1|x, pa Y ) Y c cother P(c|pa C) Recollect that we use set operations on vectors as well, wherever there is no ambiguity. A.1 BOUNDING THE ERROR OF P(Y = 1|x, pa Y ) ESTIMATES Let pa Y denote the value taken by PAY (i.e., Y s parents), excluding x. Say, for a (x, pa Y ), P h |ˆP(Y = 1|x, pa Y ) P(Y = 1|x, pa Y )| ϵx,pa Y i δx,pa Y Using the union bound, we can rewrite it as P h x, |ˆP(Y = 1|X, pa Y ) P(Y = 1|X, pa Y )| ϵX,pa Y i δX,pa Y In other words, with probability 1 δX,pa Y , the following event EX,pa Y is true: | x, ˆP(Y = 1|X, pa Y ) P(Y = 1|X, pa Y )| ϵX,pa Y We derive expressions for ϵX,pa Y and δX,pa Y in Lemma A.2. A.2 BOUNDING THE ERROR OF P(C = c|PAC) ESTIMATES Now, say, for a C C (taking value C = c), and for given pa C, P h |ˆP(c|pa C) P(c|pa C)| ϵc|pa C i δc|pa C Note that ˆP(0|pa C) + ˆP(1|pa C) = 1 and P(0|pa C) + P(1|pa C) = 1. Now, say the actual errors in estimates are given by ϵc. That is, ϵc ˆP(c|pa C) P(c|pa C). Therefore, ˆP(0|pa C) + ˆP(1|pa C) = 1 P(0|pa C) + ϵ0|pa C + P(1|pa C) + ϵ1|pa C = 1 1 + ϵ0|pa C + ϵ1|pa C = 1 ϵ0|pa C + ϵ1|pa C = 0 Now, this means that | ϵ0|pa C| a | ϵ1|pa C| a Therefore, the event | ϵ0|pa C| a is the same event as the event | ϵ1|pa C| a. Therefore, given a pa C, the following is true: P c, | ϵc|pa C| ϵC|pa C 1 δC|pa C Published as a conference paper at ICLR 2022 which is the same as P h c, |ˆP(c|pa C) P(c|pa C)| ϵC|pa C i 1 δC|pa C where δC|pa C is a function of ϵC|pa C. Note that we now have δC|pa C instead of δc|pa C. In other words, given a pa C, with probability 1 δC|pa C, the following event EC|pa C is true: c, |ˆP(C = c|pa C) P(C = c|pa C)| ϵC|pa C We derive expressions for δC|pa C and ϵC|pa C in Lemma A.1. A.3 PUTTING THEM TOGETHER FOR THE OVERALL BOUND FOR REGRET(ctar) Using the union bound, for any ctar, with probability 1 P pa Y δX,pa Y P pa C δC|pa C, all of EX,pa Y and EC|pa C are true simultaneously pa Y and pa C C (C \ PAY ). Let s call this event E. In other words, under event E, all estimates are bounded. Let aalg = ˆϕ(ctar) denote the action chosen by the algorithm in the evaluation phase when presented with ctar; and let a denote an optimal action for ctar. Therefore, given Ctar = ctar, under event E, E[Y |do(aalg), ctar] = X cother P(Y = 1|aalg, pa Y ) Y c cother P(c|pa C) (Factorization due to G as earlier) cother P(Y = 1|aalg, pa Y ) Y c cother {ˆP(c|pa C) ϵC|pa C} (Since ϵC|pa C are actual errors.) cother {ˆP(Y = 1|aalg, pa Y ) ϵX,pa Y } Y c cother {ˆP(c|pa C) ϵC|pa C} (Due to event E) ˆP(Y = 1|do(aalg), ctar) X cother P(cother|ctar)ϵC|pa C x=aalg (Algebra discussed separately in Section A.3.1) = ˆE[Y |do(aalg), ctar] X pa C P(pa C|ctar)ϵC|pa C x=aalg (Since Y is binary) ˆE[Y |do(a ), ctar] X pa C P(pa C|ctar)ϵC|pa C x=aalg (Due to proposed algorithm) Continuing and applying similar steps, we get E[Y |do(aalg), ctar] E[Y |do(a ), ctar] 2 X pa Y P(pa Y |ctar)ϵX,pa Y pa C P(pa C|ctar)ϵC|pa C Defining ϵ X X pa Y P(pa Y |ctar)ϵX,pa Y pa C P(pa C|ctar)ϵC|pa C Published as a conference paper at ICLR 2022 we get that, E[Y |do(aalg), ctar] E[Y |do(a ), ctar] 2ϵ X 3 X C Cother ϵ C Therefore, under event E (that is, with probability 1 P pa Y δX,pa Y P pa C δC|pa C), for any given ctar, Regret(ctar) = E[Y |do(a ), ctar] E[Y |do(aalg), ctar] 2ϵ X + 3 X C Cother ϵ C (3) We derive expressions for each of these terms in Sections A.4 and A.5, and then arrive at the final bound in Section A.6. A.3.1 ALGEBRA FOR INTERMEDIATE EXPRESSION We wish to show that X cother {ˆP(Y = 1|x, pa Y ) ϵX,pa Y } Y c cother {ˆP(c|pa C) ϵC|pa C} ˆP(Y = 1|do(x), ctar) X cother P(cother|ctar)ϵC|pa C Let us denote the expression on the left hand side by (A). Let ci cother chosen in a reverse topological order; this ensures that a child is chosen before any of its parents. Let Ci be the corresponding variable in Cother. Then we can write h {ˆP(Y = 1|x, pa Y ) ϵX,pa Y }{ˆP(ci|pa Ci) ϵci|pa Ci} i c cother\ci {ˆP(c |pa C ) ϵc |pa C } h ˆP(Y = 1|x, pa Y )ˆP(ci|pa Ci) ϵci|pa Ci ˆP(Y = 1|x, pa Y ) ϵX,pa Y ˆP(ci|pa Ci) + ϵX,pa Y ϵci|pa Ci c cother\ci {ˆP(c |pa C ) ϵc |pa C } h ˆQCi[Y = 1|x, pa Ci] ϵCi|pa Ci ˆϵX,pa Ci c cother\ci {ˆP(c |pa C ) ϵc |pa C } ˆQD[Y = 1|x, c] X d ˆP(Y = 1|x, pa Y ) Y d d {ˆP(d|pa D, c)} and ˆϵX,pa Ci X ci ϵX,pa Y P(ci|pa Ci) These make use of the fact that ϵci|pa Ci ϵCi|pa Ci, ϵ0|pa Ci = ϵ1|pa Ci and that ˆP 1. We can keep reapplying the above steps in reverse topological order till we exhaust all ci, and we ll get (A) ˆQCother[Y = 1|x, ctar] X cother P(cother|ctar)ϵC|pa C Noting that ˆQCother[Y = 1|x, ctar] = ˆP(Y = 1|do(x), ctar) gives us the desired inequality. Published as a conference paper at ICLR 2022 A.4 EXPRESSIONS FOR δC|pa C AND ϵC|pa C c, |ˆP(c|pa C) P(c|pa C)| " 2 T P(pa C, ctar) ϵT P AC ln 2 δC|pa C Proof. Let Tpa C be the set of time indices during training when PAC = pa C was chosen/seen by the algorithm, and let Tpa C = |Tpa C|. Now, for a C that is observed and not chosen5, note that our estimate of ˆP(C = 1|pa C) is computed as (θ(1) C|pa C + 1)/(Tpa C + 2). We approximate6 this as θ(1) C|pa C/Tpa C. Using Hoeffding s inequality, for any c, P h |ˆP(C = c|pa C) P(C = c|pa C)| ϵC|pa C i 2 exp Tpa C 2 (ϵC|pa C)2 But since already observed, the two events (corresponding to c = 0 and c = 1) are the same, we have P h c, |ˆP(c|pa C) P(c|pa C)| ϵC|pa C i 1 2 exp Tpa C 2 (ϵC|pa C)2 Therefore, we let δC|pa C 2 exp Tpa C 2 (ϵC|pa C)2 Now, let T αT be the number of rounds of Phase 1. Let TC|pa C be the number of times C was updated in Phase 1 as a results of encountering PAC = pa C. Note that the mean of TC|pa C is T P(ctar)P(pa C|ctar) = T P(pa C, ctar). By the Hoeffding s inequality, P h TC|pa C T P(pa C, ctar) ϵT pa C 2(ϵT pa C)2 By the union bound, we have (let s call the event as ET C|P AC), P h pa C, TC|pa C T P(pa C, ctar) ϵT P AC i 1 NP AC exp 2(ϵT P AC)2 where NP AC = Q Letting δT C|P AC NP AC exp 2(ϵT P AC )2 ϵT C|P AC = NP AC δT C|P AC Note that TC|pa C TC|pa C. Therefore, given that ET C|P AC is true, we have P h c, |ˆP(c|pa C) P(c|pa C)| ϵC|pa C i 1 2 exp TC|pa C 2 (ϵC|pa C)2 T P(pa C, ctar) ϵT C|P AC 2 (ϵC|pa C)2 ! 5This means C C during Phase 1 of the algorithm, or C Cother in Phase 2. 6It s easy to see that this does not change the asymptotic behavior of the bound. Published as a conference paper at ICLR 2022 Defining δC|pa C as δC|pa C 2 exp T P(pa C, ctar) ϵT C|P AC 2 (ϵC|pa C)2 ! " 2 T P(pa C, ctar) ϵT C|P AC ln 2 δC|pa C A.5 EXPRESSIONS FOR δX,pa Y , ϵX,pa Y , AND ϵ x P h x, |ˆP(Y = 1|x, pa Y ) P(Y = 1|x, pa Y )| v u u t " 2 T NX P(pa Y , ctar) ϵT X,P AY Proof. As before, let Tx,pa Y be the set of time indices during training when (X, PAY ) = (x, pa Y ) was chosen/seen by the algorithm, and let Tx,pa Y = |Tx,pa Y |. As before, note that our estimate of ˆP(Y = 1|x, pa Y ) is computed as (θ(1) Y |x,pa Y + 1)/(Tx,pa Y + 2), which we can approximate as θ(1) Y |x,pa Y /Tx,pa Y . Now, by Hoeffding s inequality P h |ˆP(Y = 1|x, pa Y ) P(Y = 1|x, pa Y )| ϵx,pa Y i 2 exp Tx,pa Y 2 ϵ2 x,pa Y During Phase 1, let Tx,pa Y be the random variable denoting the number of times that (X, PAY ) = (x, pa Y ) was seen/chosen. Note that, due to the nature of Phase 1 (i.e., each x is chosen with probability 1 NX ), the mean of Tx,pa Y is T P(pa Y , ctar) 1 NX . Therefore, using the union bound and the Hoeffding s inequality, P (x, pa Y ), Tx,pa Y T P(pa Y , ctar) 1 NX ϵT X,P AY 1 NXNP AY exp 2 T ϵT X,P AY 2 Since (x, pa Y ), Tx,pa Y Tx,pa Y , we have (let s call this event ET X,P AY ), P (x, pa Y ), Tx,pa Y T P(pa Y , ctar) 1 NX ϵT X,P AY 1 NXNP AY exp 2 T ϵT X,P AY 2 Letting δT X,P AY NXNP AY exp 2 T ϵT X,P AY 2 , we have ϵT X,P AY = Now, we can get that, for a given pa Y , x, (let s call it event EX,pa Y ), P h x, |ˆP(Y = 1|x, pa Y ) P(Y = 1|x, pa Y )| ϵX,pa Y i x 2 exp Tx,pa Y 2 (ϵX,pa Y )2 T NX P(pa Y , ctar) ϵT X,P AY 2 (ϵX,pa Y )2 ! Published as a conference paper at ICLR 2022 δX,pa Y = 2NX exp T NX P(pa Y , ctar) ϵT X,P AY 2 (ϵX,pa Y )2 ! " 2 T NX P(pa Y , ctar) ϵT X,P AY A.6 DERIVING THE FINAL REGRET BOUND The probability that events E, ET P AC( C C) and ET X,P AY are all simultaneously true is pa Y δX,pa Y δT X,P AY X pa C δC|pa C + δT P AC Substituting Equations (4) and (5) back into Equation (3), we get that, with the above probability, for a given ctar, Regret(ctar) 3 pa Y P(pa Y |ctar) " 2 T NX P(pa Y , ctar) ϵT X,P AY pa C P(pa C|ctar) " 2 T P(pa C, ctar) ϵT P AC ln 2 δC|pa C ϵT C|P AC = NP AC δT C|P AC ϵT X,P AY = We set δ = δX,pa Y = δT X,P AY = δC|pa C = δT P AC, C, pa Y , pa C. Thus, for a given ctar, with probability 1 NP AY δ δ NC\P AY δ |C \ PAY | δ, Regret(ctar) 3 pa Y P(pa Y |ctar) " 2 T NX P(pa Y , ctar) ϵT X,P AY pa C P(pa C|ctar) " 2 T P(pa C, ctar) ϵT C|P AC ϵT C|P AC = ϵT X,P AY = Published as a conference paper at ICLR 2022 Now, noting that NP AY δ + δ + NC\P AY δ + |C \ PAY | δ (NC + |C|) δ, we set δ = (NC + |C|) δ. So we get that with probability 1 δ, Regret(ctar) 3 pa Y P(pa Y |ctar) " 2 αT NX P(pa Y , ctar) ϵT X,P AY ln 2NX(NC + |C|) pa C P(pa C|ctar) " 2 αTP(pa C, ctar) ϵT P AC ln 2(NC + |C|) ln NP AC(NC + |C|) ϵT X,P AY = ln NXNP AY (NC + |C|) Substituting Equation (6) into the definition of Regret in Section 2 of the main paper, we get the expression in Theorem 3.1. B APPENDIX (SUPPLEMENTARY TEXT): ADDITIONAL EXPERIMENT UNDERSTANDING THE REASON FOR BETTER PERFORMANCE (a) Std Uni Exp (b) Targ Int Uni Exp (c) Our proposed algorithm Figure 4: Frequency of choosing or encountering each value of Ctar. Highlighted in red color are the high-value contexts (i.e., contexts for which learning the right actions provides higher expected rewards). In this analysis, we try to understand why our algorithm exhibits better performance than baselines. We had pointed out intuitively that the Unc measure trades off between exploring new contexts and Published as a conference paper at ICLR 2022 learning more about explored contexts based on (a) current knowledge of the various context-action pairs, and (b) current estimates of the value (i.e., rewards obtainable) of context-action pairs. This section attempts to zoom into this behavior. Specifically, we consider one of the settings (setting #1) from the representative set used in Experiment 1; it has NCtar = 10, all equally likely, but Ctar {0, 1} are more valuable than others (i.e., learning the right actions for these contexts can give higher rewards); see code (config setting1.py) for more details. We zoom into the case when T = 15. We do 500 independent runs, count the number of times every possible Ctar value is encountered or chosen, and plot the frequencies. We compare our algorithm s behavior with two representative baselines Algorithm Std Uni Exp (which does standard contextual bandit interactions) and Algorithm Targ Int Uni Exp (which does targeted interventions); the effect is similar for the other two baselines as well. As seen in Figure 4, our algorithm chooses the higher value contexts (Ctar {0, 1}) with relatively higher frequency than the baselines, while still ensuring good exploration of other contexts in line with the intuition discussed in the main paper. C APPENDIX (SUPPLEMENTARY TEXT): MORE DETAILS ABOUT EXPERIMENT 1 In the discussion about Experiment 1 in the main paper, it was mentioned that Appendix C contains plots for two individual settings from the representative set one where Algorithms Std TS and Std Uni Exp perform significantly better than Algorithms Targ Int TS, Targ Int Uni Exp and Targ Int TS Uni Exp, and another where it is the converse while our algorithm performs close to the best baseline in both. Figure 5 presents those plots. Each result is based on 50 runs. D APPENDIX (SUPPLEMENTARY TEXT): REGARDING PARAMETERIZATIONS USED IN THE EXPERIMENTS For the specific parameterizations of all settings used in all the experiments, refer to the README file in Supplemental Code Paper1203.zip as part of the Supplemental Material. E APPENDIX (SUPPLEMENTARY TEXT): MORE DISCUSSION ON THE CASE WHERE Cother = If Cother = (that is, if it is empty), it might appear as if there would not be any information leakage. However, this is not true, and information leakage can still exist. For instance, consider the same graph as in used in Section 4.1. But now suppose that Ctar = {C1, C0}, whereas Cother = . Consider two targeted interventions, (x, c0, c1) and (x, c0, c 1), where c1 = c 1. Note that the distribution of Y after these two targeted interventions would, respectively, be P(Y |do(x), (c0, c1)) = P(Y |x, c0) P(Y |do(x), (c0, c 1)) = P(Y |x, c0) The common term P(Y |x, c0) enables information leakage between these two targeted interventions. To see this, note that if the agent conducts targeted intervention (x, c0, c1), then it is able to update its beliefs about P(Y |x, c0) which improves its estimates of (x, c0, c 1) as well due to the shared term. So Cother = does not preclude information leakage. Instead, suppose that we are considering the interventions (x, c0, c1) and (x, c 0, c1), where c0 = c 0. In this case, information leakage gains will not be possible. However, our algorithm would still be able to achieve better performance compared to baselines by balancing between exploring new (x, ctar) pairs and learning more about already explored valuable (x, ctar) pairs. To see this, note that in the definition of Unc in Section 3.1 the information component is weighted by the value component (the last two terms, namely, ˆP(c ) ˆE[Y |c , do(x )]). Published as a conference paper at ICLR 2022 (a) Algorithms Targ Int TS, Targ Int Uni Exp and Targ Int TS Uni Exp perform better than other baselines. (b) Algorithms Std TS, Std Uni Exp perform better than other baselines. Figure 5: Different baselines outperform each other in two different settings; however, our algorithm performs close to the best baseline in both. F APPENDIX (SUPPLEMENTARY TEXT): REGARDING THE CHOICE OF THE PROBABILITY OF UNIFORM EXPLORATION IN TARGINT TS UNIEXP Thompson Sampling aims to minimize cumulative regret in a best-arm identification setting (Agrawal & Goyal, 2012). For cumulative regret minimization, it performs better than simple exploration algorithms such as ϵ-greedy, by controlling the exploration based on the agent s current beliefs. However, as discussed in the main part of the paper, in our setting, the agent is faced with the task of learning a policy, while balancing exploration of new contexts and learning more about already explored contexts. Thus, apart from having standard Thompson Sampling as a baseline (Targ Int TS), we also create a variation of this (called Targ Int TS Uni Exp) where, in each round, the algorithm, with some probability λ, chooses a targeted intervention uniformly at random (from the space val(X) val(Ctar)) for exploration; with probability 1 λ, it chooses the targeted intervention given by Thompson Sampling. This probability λ allows us to directly control the degree of uniform exploration. In Experiment 1 (Section 4.1), we chose the probability of uniform exploration λ = 0.5 for Targ Int TS Uni Exp. This was to act as a midpoint between the Targ Int TS algorithm (attempting to optimize cumulative regret) and Targ Int Uni Exp (which does uniform exploration). We noted Published as a conference paper at ICLR 2022 that our algorithm continues to perform better than all baselines. Figure 6 shows that this continues to be true even when λ in Targ Int TS Uni Exp is not as large as 0.5, by setting λ = 0.2. Figure 6: Experiment 1 results when probability of uniform exploration in Targ Int TS Uni Exp was chosen to be 0.2 (instead of 0.5). Our algorithm continues to perform better than all baselines.