# adaptively_exploiting_dseparators_with_causal_bandits__45247c7e.pdf

Adaptively Exploiting d-Separators

with Causal Bandits

Blair Bilodeauú University of Toronto

Linbo Wang University of Toronto

Daniel M. Roy University of Toronto

Multi-armed bandit problems provide a framework to identify the optimal intervention over a sequence of repeated experiments. Without additional assumptions, minimax optimal performance (measured by cumulative regret) is well-understood. With access to additional observed variables that d-separate the intervention from the outcome (i.e., they are a d-separator), recent causal bandit algorithms provably incur less regret. However, in practice it is desirable to be agnostic to whether observed variables are a d-separator. Ideally, an algorithm should be adaptive; that is, perform nearly as well as an algorithm with oracle knowledge of the presence or absence of a d-separator. In this work, we formalize and study this notion of adaptivity, and provide a novel algorithm that simultaneously achieves (a) optimal regret when a d-separator is observed, improving on classical minimax algorithms, and (b) signiﬁcantly smaller regret than recent causal bandit algorithms when the observed variables are not a d-separator. Crucially, our algorithm does not require any oracle knowledge of whether a d-separator is observed. We also generalize this adaptivity to other conditions, such as the front-door criterion.

1 Introduction

Given a set of interventions (actions) for a speciﬁc experiment, we are interested in learning the best one with respect to some outcome of interest. Without knowledge of speciﬁc causal structure relating the observed variables, this task is impossible from solely observational data (c.f. Theorem 4.3.2 of [29]). Instead, we seek the most efﬁcient way to sequentially choose interventions for i.i.d. repetitions of the experiment, where the main challenge is that we cannot observe the counterfactual effect of interventions we did not choose. Without any structural assumptions beyond i.i.d., one can always learn the best intervention with high conﬁdence by performing each intervention a sufﬁcient number of times [10]. In the presence of additional structure such as a causal graph on observables this strategy may result in performing suboptimal interventions unnecessarily often. However, the presence of such structure is often unveriﬁable, and incorrectly supposing that it exists may catastrophically mislead the experimenter. Thus, a fundamental question arises: Can we avoid strong, unveriﬁable assumptions while simultaneously performing fewer harmful interventions when advantageous structure exists?

A natural framework in which to study this question is that of (multi-armed) bandit problems: Over a sequence of interactions with the environment, the experimenter chooses an action using their experience of the previous interactions, and then observes the reward of the chosen action. The goal is to achieve comparable performance with what would have been achieved if the experimenter had chosen the (unknown) optimal action in each interaction. Formally, performance is measured by regret, which is the difference of the cumulative reward incurred by the experimenter compared to the optimal action. In this partial-information setting, regret induces the classical trade-off between exploration (choosing potentially suboptimal actions to learn if they re optimal) and exploitation

úCorrespondence to blair.bilodeau@mail.utoronto.ca

36th Conference on Neural Information Processing Systems (Neur IPS 2022).

(choosing the action that empirically appears the best). In contrast, other measures of performance (e.g., only identifying the average treatment effect or the best action at the end of all interactions) do not penalize the experimenter for performing suboptimal actions during exploration, and consequently are insufﬁcient to study our question of interest.

For an action set A and a time horizon T, the minimax optimal regret for bandits without any assumptions on the data (worst-case) is O(

|A| T) [6], and is achieved by many algorithms [19]. Recently, Lu et al. [24] showed that, under additional causal structure, a new algorithm (C-UCB) can achieve improved regret. In particular, if the experimenter has access to a variable Z taking values in a ﬁnite set Z that d-separates [29] the intervention and the reward, as well as the interventional distribution of Z for each a œ A, C-UCB achieves O(

|Z| T) regret. However, as we show, the performance of C-UCB when the d-separation assumption fails is orders of magnitude worse than that of UCB. Is strict adaptation possible? That is, is there an algorithm that recovers the guarantee of C-UCB when Z is a d-separator and the guarantee of UCB in all other environments, without advance knowledge of whether Z is a d-separator?

As of yet, there is no general theory of adaptivity in the bandit setting. The closest we have to a general method is the Corral algorithm and its offspring [2, 4]. Corral uses online mirror descent to combine base bandit algorithms, but requires that each of the base algorithms is stable when operating on importance-weighted observations. Unfortunately, while UCB is stable, simulations reveal this not to be the case for C-UCB. This presents a barrier to adapting to causal structure via Corral-like techniques, and raises the question of whether there is a new way to achieve adaptivity.

Contributions. We introduce the conditionally benign property for bandit environments: informally, there exists a random variable Z such that the conditional distribution of the reward given Z is the same for each action a œ A. We show that the conditionally benign property is (a) strictly weaker than the assumption of Lu et al. [24] (in their proofs, they actually assume that all causal parents of the reward are observed); (b) equivalent to Z being a d-separator when A is all interventions; and (c) implied by the front-door criterion [29] when A is all interventions except the null intervention (i.e., a pure observation). We then prove that any algorithm that achieves optimal worst-case regret must incur suboptimal regret for some conditionally benign environment, and hence strict adaptation to the conditionally benign property is impossible. Despite this, we introduce the hypothesis-tested adaptive causal upper conﬁdence bound algorithm (HAC-UCB), which provably (a) achieves nonvacuous (sublinear in T) regret at all times without any assumptions, (b) recovers the improved performance of C-UCB for conditionally benign environments, and (c) performs as well as UCB in certain environments where C-UCB and related algorithms (such as those studied in [25, 27]) incur linear regret. Empirically, we observe these performance improvements on simulated data.

Impact. Recently, multiple works have developed causal bandit algorithms that achieve improved performance in the presence of advantageous causal relationships (initiated by Bareinboim et al. [7] and Lattimore et al. [17]; see Section 7 for more literature). Further, the last decade has seen a ﬂurry of work in bandits on designing algorithms that recover worst-case regret bounds while simultaneously performing signiﬁcantly better in advantageous settings, without requiring advance knowledge of which case holds [e.g., 9, 31, 34, 26, 1]. However, to the best of our knowledge, no existing work studies algorithms that achieve adaptive regret guarantees with respect to causal structure. The present work provides a framework that expands the study of adaptive decision making to the rich domain of causal inference.

2 Preliminaries

2.1 Problem Setting

We consider a general extension of the usual bandit setting where, in addition to a reward corresponding to the action played, the experimenter observes some additional variables after choosing their action; we call this the post-action context. This is distinct from the contextual bandit problem, where the experimenter has access to side-information before choosing their action.

Let Y = [0, 1] be the reward space2, Z be a ﬁnite set of values for the post-action context to take, and P(Z Y) denote the set of joint probability distributions. For any p œ P(Z Y) and

2Our results hold for Y = R using sub-Gaussian rewards with bounded mean at the expense of constants.

(Z, Y)-valued random variable (Z, Y ), let p(Z) and p(Y | Z) denote the marginal and conditional

distributions respectively. Let Ep and Pp denote expectation and probability operators under p.

The stochastic bandit problem with post-action contexts proceeds as follows: For each round t œ [T], the experimenter selects At œ A while simultaneously {(Zt(a), Yt(a)) : a œ A} are independently sampled from the environment, which is any family of distributions = { a : a œ A} œ P(Z Y)A indexed by the action set. The experimenter only observes (Zt(At), Yt(At)) and receives reward Yt(At). From a causal perspective, (Zt(a), Yt(a))aœA corresponds to the potential outcome vector, and under causal consistency, (Zt(At), Yt(At)) corresponds to the observed data (Zt, Yt) under the chosen intervention At.

The observed history up to round t is the random variable Ht = (As, Zs(As), Ys(As))sœ[t]. A policy is a sequence of measurable maps from the observed history to the action set, denoted

ﬁ= (ﬁt)tœ[T ] œ Π(A, Z, T) :=

(A Z Y)t 1 æ A

The experimenter chooses a policy in advance to select their action on each round according to At = ﬁt(Ht 1). Clearly, an environment and a policy ﬁtogether deﬁne a joint probability distribution on (At, Zt(a), Yt(a))tœ[T ],aœA (which includes the counterfactuals not seen by the player). Let E ,ﬁdenote expectation under this joint distribution. The performance of a policy under an environment is quantiﬁed by the regret

R ,ﬁ(T) = T max

E a[Y ] E ,ﬁ

2.2 Speciﬁc Algorithms

Classical bandit algorithms take |A| and T as inputs. The dependence on T can often be dropped using the doubling trick or more sophisticated techniques such as decreasing learning rates, but we do not focus on these reﬁnements in this work, instead allowing T as an input. In order to account for the additional information provided by the post-action context, we also consider algorithms that take |Z| as an input. By restricting dependence to only the cardinality of A and Z, we explicitly suppose that there is no additional structure to exploit on these spaces; much work in the bandit literature has focused on such structure through linear or Lipschitz rewards, but we defer these extensions to future work in favour of focusing on adaptivity. For notational simplicity, we denote algorithmic dependence on A or Z even though the dependence is actually through their cardinality (i.e., the labellings of items in the sets are arbitrary).

In the causal bandit literature [17, 24, 27], it is common to suppose that the algorithm also receives distributional information relating actions to intermediate variables. In particular, if the (unknown) environment is , prior work supposes that the algorithm has access to the (interventional) marginal distributions (Z) = { a(Z) : a œ A}. In this work, we suppose instead that the algorithm has access to a collection of approximate marginal distributions (Z) = { a(Z) : a œ A}; for example, these could be an estimate of (Z) that was learned ofﬂine. Ideally, (Z) will be close to (Z), but our novel method is entirely adaptive to this assumption: regardless of how well (Z) approximates (Z), HAC-UCB incurs sublinear regret.

We now introduce additional notation to deﬁne the algorithms of interest in this work. Suppose that A, Z, (Z), and T are all ﬁxed in advance, as well as a conﬁdence parameter = T œ (0, 1). For each t œ [T], z œ Z, and a œ A, deﬁne the number of the ﬁrst t rounds on which z was observed by TZ

t (z) = 1 qt

s=1 I{Zs(As) = z} (where a b = max{a, b}), and similarly the number of rounds on which a was chosen by TA

t (a) = 1 qt

s=1 I{As = a}. Further, deﬁne the empirical mean estimate for the reward under the distribution induced by choosing the action a as

t (a) = [TA

t (a)] 1 qt

s=1 Ys(As)I{As = a} and the empirical conditional mean estimate for the reward given that z was observed as ˆµZ

t (z) = [TZ

t (z)] 1 qt

s=1 Ys(As)I{Zs(As) = z}. Deﬁne UCBA

t (a) = ˆµA

log(2/ )/(2TA

t (a)), UCBZ

t (z) = ˆµZ

log(2/ )/(2TZ

t (z)), and ] UCBt(a) = q

t (z)P a[Z = z].

Using these objects, we deﬁne three algorithms, each of which produces actions that are Ht-measurable. The upper conﬁdence bound algorithm (UCB, [6]) is deﬁned by AUCB

arg maxaœA UCBA

t (a), and the causal upper conﬁdence bound algorithm (C-UCB, [24]) is deﬁned by AC

t+1 = arg maxaœA ] UCBt(a), where ties are broken by using some predetermined ordering on A. Finally, we deﬁne a new combination of these two methods, which we call the hypothesis-tested adaptive causal upper conﬁdence bound algorithm (HAC-UCB) and describe precisely in Algorithm 1; we denote its actions by AHAC

Algorithm 1: HAC-UCB(A, Z, T, (Z))

do Play each a œ A for Á4

T / |A| Ë rounds, and let ˆ a(Z) be the MLE of a(Z)

---P a[Z = z] Pˆ a[Z = z]

--- > 2T 1/4

|A| |Z| log T replace (Z) Ω ˆ (Z) do Play each a œ A for Á

T / |A| Ë rounds set ﬂag = True while t Æ T

/* Check if either of the two conditions fail */

t 1(a) = UCBA

t 1(a) ] UCBt 1(a) +

|A||Z| log T

T 1/4 for a œ A do

t 1(z) P a[Z = z] Æ DA

|A||Z| log T

set ﬂag = False; break /* If conditions pass, play C-UCB */ if ﬂag

/* If conditions ever fail, play UCB forever */ set AHAC

Heuristically, HAC-UCB has an initial exploration period to ensure that (Z) is sufﬁciently accurate if not, it is replaced with the maximum likelihood estimate (MLE) of the marginals and then optimistically plays C-UCB until there is sufﬁcient evidence that the environment is not conditionally benign. The switch from C-UCB to UCB is decided by a hypothesis test performed on each round, which uses the conﬁdence intervals that will hold if the environment is conditionally benign. When the arm mean estimates of UCB and C-UCB disagree, this provides evidence that the environment is not conditionally benign, and the evidence is considered sufﬁcient to switch when the size of the disagreement is large compared to the size of the conﬁdence intervals themselves. As we illustrate in the proof of the regret bounds, with high probability this test will not induce a switch for a conditionally benign environment, and will sufﬁciently limit the regret incurred by C-UCB if the environment is not conditionally benign.

3 Conditionally Benign Property

We now formalize the main property that HAC-UCB will adaptively exploit.

Deﬁnition 3.1. An environment œ P(Z Y)A is conditionally benign if and only if there exists p œ P(Z Y) such that for each a œ A, a(Z) π p(Z) and a(Y | Z) = p(Y | Z) p-a.s.

This deﬁnition does not require any causal terminology to deﬁne or use for regret bounds, but we now instantiate it for the causal setting. For a collection of ﬁnite random variables V and a (potentially) continuous random variable Y , let PV be the set of all joint probability distributions with strictly positive marginal probabilities on V. Fix a DAG G on (V, Y ) such that Y is a leaf and two disjoint sets Z V and A V such that Pa G

A V \ Z. Let A be the set of all possible do interventions on A, and for each a œ A let pa denote the interventional distribution (Deﬁnition A.5). This structure suggests a graphical analogue of the conditionally benign property.

Deﬁnition 3.2. For any DAG G and AÕ A, (G, AÕ) is conditionally benign if and only if for all p œ PV that are Markov relative to G, {pa(Z, Y ) : a œ AÕ} is conditionally benign.

We now connect the conditionally benign property to d-separation (Deﬁnition A.6) and the front-door criterion (Deﬁnition A.8). All proofs are deferred to Appendix A, along with standard notation and deﬁnitions from the causal literature. Theorem 3.3. Z d-separates Y from A on G if and only if (G, A) is conditionally benign.

This equivalence is a strict specialization of the conditionally benign property to the causal setting. In particular, to deﬁne conditionally benign, we need not require all possible interventions be allowed. Deﬁne A0 to be A with the null (observational) intervention removed, and let G A denote G with all edges directed into A removed. Theorem 3.4. Z d-separates Y from A on G A if and only if (G, A0) is conditionally benign.

The beneﬁts of discarding the null intervention are demonstrated by the following fact. Lemma 3.5. Z d-separates Y from A on G A if Z satisﬁes the front-door criterion for (A, Y) on G.

We visualize the preceding results in Figure 1. In graph (a), Z d-separates the intervention from the reward, and hence any Markov relative distribution (Deﬁnition A.3) will induce a conditionally benign environment. Graph (b) corresponds to a setting where one cannot hope to always improve performance due to the direct effect of the intervention on the reward, and consequently the environment need not be conditionally benign. In graph (c), the presence of the unobserved confounder U means that Z does not d-separate the intervention from the reward. However, if the null intervention is not considered, the arrow from U to A is never applicable, and hence any Markov relative distribution on the modiﬁed DAG will induce a conditionally benign environment. Speciﬁcally, graph (c) satisﬁes the front-door criterion, revealing that the conditionally benign property captures that this setting is still benign for decision-making, even though the conditions assumed by Lu et al. [24] do not hold. Finally, in graph (d), the unobserved confounder U once again violates d-separation, but also the front-door criterion is not satisﬁed because of the back-door path from Z to Y . Hence, even discarding the null intervention does not guarantee that the environment will be conditionally benign.

Figure 1: DAGs to illustrate the conditionally benign property. A is the intervention, Z is the post-action context, Y is the reward, and U is an unobserved variable. (G, A) is conditionally benign for (a) but only (G, A0) is conditionally benign for (c). For (b) and (d) the environment need not be conditionally benign.

4 Analysis of Bandit Algorithms

We now study the impact of the conditionally benign property on regret. All proofs are deferred to Appendix B. First, recall the standard regret bound for UCB, with constants tuned to rewards in [0, 1]. Theorem 4.1 (Theorem 7.2 of [19]). For all A, Z, T, and œ P(Z Y)A, if = 2/T 2

R ,UCB(T) Æ 2 |A| + 4

2 |A| T log T.

Second, we generalize the main result of Lu et al. [24] by relaxing two assumptions: we only require that the environment is conditionally benign, and we allow for approximate marginal distributions using the following deﬁnition. Later, this will enable us to trade-off approximation error of (Z) with online estimation of (Z) in order to ensure that HAC-UCB always incurs sublinear regret.

Deﬁnition 4.2. For any Á Ø 0, (Z) and (Z) are Á-close if

---P a[Z = z] P a[Z = z]

Theorem 4.3 (Reﬁned Theorem 1 of [24]). For all Á > 0, A, Z, T, and conditionally benign œ P(Z Y)A, if (Z) and (Z) are Á-close and = 2/T 2 then

R ,C(T) Æ 2 |Z| + 6

|Z| T log T + (log T)

2T + 2Á(1 +

Remark 4.4. Lu et al. [24] assume that Á = 0. Our result implies that an approximation error of Á =

|Z| /T is sufﬁcient to achieve the optimal rate. Ù

Next, we motivate our introduction of a new algorithm by showing that C-UCB can catastrophically fail when the environment is not conditionally benign, incurring regret that is linear in T (which is as bad as possible for bounded rewards). Theorem 4.5. For every A and Z with |A| Ø 2, there exists œ P(Z Y)A such that even if

(Z) = (Z), for all possible settings of the conﬁdence parameters, T ,

Remark 4.6. This lower bound is speciﬁcally for C-UCB. However, any algorithm that relies on eliminating actions from consideration via assumption rather than data is susceptible to such an issue. In particular, this result is easily modiﬁed to apply for C-TS [24] and C-UCB-2 [27]. Other causal algorithms (e.g., Parallel Bandit [17]) also intuitively suffer from the issue that our construction exploits, although a different argument must be made since these rely on different causal structure. Ù

We now state our regret upper bound for our new algorithm, HAC-UCB. In Theorem 6.2, we will show it is impossible to always achieve the optimal regret without knowledge of whether a d-separator is observed, but the following theorem shows some adaptivity is always possible. Crucially, HAC-UCB achieves sublinear regret without any assumptions on or (Z). For a more detailed breakdown of the constants, see Eq. (B.9).

Theorem 4.7 (Main Result). For all A, Z, T Ø 25 |A|2, œ P(Z Y)A, and (Z) œ P(Z)A,

R ,HAC(T) Æ 4 |A| + 11 T 3/4(log T)

|A| |Z| + 15

(|A| + |Z|)T log T + 5(log T)

For all Á Æ T 1/4

|A| |Z| log T, if is conditionally benign and (Z) and (Z) are Á-close then

R ,HAC(T) Æ 4 |A| + 2 |Z| + 6

|Z| T log T + 4(log T)

It is an open problem whether the dependence on T 3/4 is tight. In Theorem 6.2, we will show that it is impossible to obtain worst-case regret of size

|A| T while still achieving improved regret on conditionally benign environments. However, it may be possible to improve the dependence on T, and the role of logarithmic factors in how much improvement is possible remains to be understood. Towards improving this result, we now show that there exists an environment that forces C-UCB to incur linear regret yet HAC-UCB will switch to following UCB (and hence incurs

|A| T log T regret at worst). That is, HAC-UCB recovers the improved performance of C-UCB when the conditionally benign property holds, is never worse than C-UCB, and optimally outperforms C-UCB in some settings. Theorem 4.8. There exists a constant C such that for any A and Z with |A| Ø 2, there exists œ P(Z Y)A so that for any T used for C-UCB with (Z) = (Z),

and if (Z) = (Z) is used for HAC-UCB then

|A| + |Z| +

|A| T log T + (log T)

Remark 4.9. Theorem 4.8 could be stated with (Z) only Á-close to (Z), but for simplicity we have supposed Á = 0 to highlight the role of the conditionally benign property. Ù

5 Simulations

We now study the empirical performance of these algorithms in two key settings, corresponding to a conditionally benign environment and the lower bound environment from Theorem 4.5. We compare our algorithm HAC-UCB with UCB [5], C-UCB [24], C-UCB-2 [27], and Corral (for which we use the learning rate and algorithm prescribed by [2] with the epoch-based, scale-sensitive UCB prescribed by [4]); for all algorithms, we use the parameters that are optimal as prescribed by existing theory. To focus solely on the impact of the conditionally benign property, we set (Z) = (Z). The results of this section are a representative simulation demonstrating empirically that (a) for worstcase environments, both C-UCB and C-UCB-2 incur linear regret, while HAC-UCB successfully switches to incur sublinear regret to compete with Corral and UCB, and (b) for conditionally benign environments, HAC-UCB and C-UCB enjoy improved performance compared to UCB, Corral, and C-UCB-2, all three of which have regret growing like

|A| T. Implementation details are available in Appendix D and code can be found at https://github.com/blairbilodeau/adaptive-causal-bandits.

5.1 Conditionally Benign Environment

First, we consider a conditionally benign environment. Taking the gap =

|A| (log T)/T, the ﬁxed conditional distribution for Z = {0, 1} is Y | Z Ber(1/2 + (1 Z) ). Then, for a small Á (we take Á = 0.0005), we set P 1[Z = 0] = 1 Á and P a[Z = 0] = Á for all other a œ A \ {1}. Thus, aú

= 1, and the actions are separated by . In summary, each z œ Z has positive probability of being observed, yet each action nearly deterministically ﬁxes Z.

In Figure 2 (left panel) we observe three main effects: (a) C-UCB and HAC-UCB perform similarly (their regret curves overlap), both achieving much smaller regret that remains unchanged by increasing |A|, (b) UCB grows at the worst-case rate of roughly

|A| T log T, not taking advantage of the conditionally benign property, and (c) neither Corral nor C-UCB-2 realize the beneﬁts of the conditionally benign property, since the regret increases with |A| and empirically they perform worse than UCB. We note that the x-axis starts at T = 500 to satisfy the minor requirement of T > |A|2.

500 1000 1500 2000 2500 3000 0

Conditionally Benign Environment (|A|=20, |Z|=2)

UCB Corral C_UCB C_UCB_2 HAC_UCB*

500 1000 2000 3000 4000 5000 0

Worst Case Environment (|A|=20, |Z|=2)

Figure 2: Regret when the conditionally benign property holds (left) and when it fails (right).

5.2 Worst Case Environment

Second, we consider an environment that is not conditionally benign. We use the same general environment from our lower bound in Theorem 4.5, where the causal algorithms learn a biased estimate of which z œ {0, 1} has a higher conditional mean since Z is not a d-separator, and consequently they concentrate onto a bad action.

In Figure 2 (right panel), we again observe three main effects: (a) both C-UCB and C-UCB-2 incur linear regret, as prescribed by Theorem 4.5, (b) HAC-UCB achieves sublinear regret, although it is worse than UCB (we show in Theorem 6.2 that optimal adaptivity is impossible), and (c) Corral does not suffer linear regret, but appears to still do worse than HAC-UCB for large T.

6 Adaptivity for Causal Structures

Thus far we have analyzed the regret of various algorithms in two cases: environments that either are or are not conditionally benign. Minimax algorithms (UCB) fail to achieve smaller regret for conditionally benign environments, while causal algorithms (C-UCB) fail catastrophically in environments that are not conditionally benign. In this section, we formalize the notion of strict adaptivity (adaptive minimax optimality) with respect to the conditionally benign property, show that it is impossible to be adaptively minimax optimal with respect to the conditionally benign property, and discuss a relaxed notion of optimal adaptivity based on Pareto optimality.

6.1 Generic Algorithms

In order to describe adaptivity (and its impossibility) in the stochastic bandit problem with postaction contexts, we require a higher level of abstraction than a policy, which we achieve with algorithms. It is possible to deﬁne algorithms and the corresponding notion of adaptivity in abstract generality. However, we take the same perspective as Bilodeau et al. [8] (who deﬁne adaptive minimax optimality with respect to relaxations of the i.i.d. assumption), and sacriﬁce some generality by deﬁning algorithms using the speciﬁc objects we study in this work. Formally, an algorithm is any map from the problem-speciﬁc inputs to the space of compatible policies, denoted by

a : (A, Z, T, (Z)) æ a(A, Z, T, (Z)) œ Π(A, Z, T).

We denote the set of all algorithms by AC-MAB, and the subset of algorithms that are constant in (Z, (Z)) by AMAB; this subset contains the classical bandit algorithms that are agnostic to knowledge

of post-action contexts, or more speciﬁcally, do not exploit causal structure.

6.2 Adaptive Minimax Optimality

Let p : P(Z Y)A æ {0, 1} encode whether a given environment satisﬁes a certain property of interest; in this work, it is always an indicator for whether is conditionally benign. Further, for every q œ P(Z), denote the set of all environments with marginal q by ΠA,Z(q) = { œ P(Z Y)A : (Z) = q}. There are multiple ways one could deﬁne optimal adaptivity: we propose the following notion of strict adaptivity, which requires that the experimenter to do as well as they possibly could have if they had access to p( ) in advance, but without this knowledge. Deﬁnition 6.1. An algorithm a œ AC-MAB is adaptively minimax optimal with respect to p if and only if there exists C > 0 such that for all A, Z, q œ P(Z)A, and T,

sup œΠA,Z(q)

R ,a(A,Z,q,T )(T) Æ C inf ﬁœΠ(A,Z,T )

sup œΠA,Z(q)

R ,ﬁ(T) (6.1)

sup œΠA,Z(q), p( )=1

R ,a(A,Z,q,T )(T) Æ C inf ﬁœΠ(A,Z,T )

sup œΠA,Z(q), p( )=1

R ,ﬁ(T). (6.2)

6.3 Impossibility of Strict Adaptivity

We now show it is impossible for any algorithm to always realize the beneﬁts of the conditionally benign property while also recovering the worst-case rate of

|A| T (e.g., Theorems 9.1 and 15.2 of [19]), even when the algorithm has access to the true marginals. Our proof strategy is a modiﬁcation of the ﬁnite-time lower bounds from Section 16.2 of Lattimore and Szepesv ari [19]. Notably, the lower bounds of Lu et al. [24] already imply that any algorithm that does not take advantage of causal structure cannot be adaptively minimax optimal. We prove a signiﬁcantly stronger result: even algorithms that use Z and (Z) = (Z) cannot be adaptively minimax optimal! Theorem 6.2. Let a œ AC-MAB be such that there exists C > 0 such that for all A, Z, and T,

sup œP(Z Y)A R ,a(A,Z, (Z),T )(T) Æ C

There exists a constant CÕ such that for all A, Z, and T Ø |A|, there exists conditionally benign

œ P(Z Y)A with

R ,a(A,Z, (Z),T )(T) Ø CÕ

6.4 Pareto-Adaptive Minimax Optimality

In light of this impossibility result, it is of interest to characterize relaxations of strict adaptivity that are achievable. Koolen [15] and Lattimore [18] introduce the Pareto frontier of regret, which when applied to the conditionally benign property, is all tuples of regret guarantees such that improving the regret in conditionally benign environments would necessarily force the worst-case regret to increase. We propose that it is desirable for an algorithm to do as well as possible in the worst-case, subject to always realizing smaller regret on conditionally benign environments. Formally, let Aı

C-MAB be the subset of a œ AC-MAB that satisfy Eq. (6.2) for some constant C. Deﬁnition 6.3. An algorithm aı œ Aı

C-MAB is Pareto-adaptively minimax optimal with respect to p if and only if there exists C > 0 such that for all a œ Aı

C-MAB, A, Z, q œ P(Z)A, and T,

sup œΠA,Z(q)

R ,aı(A,Z,q,T )(T) Æ C sup œΠA,Z(q)

R ,a(A,Z,q,T )(T).

It remains an open problem to prove whether HAC-UCB is Pareto-adaptively minimax optimal, and more generally to identify the Pareto frontier for the causal bandit problem.

7 Related Work

Kocaoglu et al. [14] and Lindgren et al. [23] efﬁciently used interventions to learn a causal graph under standard causal assumptions [29]. Hyttinen et al. [13] identiﬁed the minimal number of experiments needed to learn underlying causal structure of variables with only a linear structural assumption. Lee and Bareinboim [20, 22, 21] identiﬁed the minimal set of interventions that permit causal effect identiﬁcation in the presence of known causal structure, while Kumor et al. [16] studied analogues of the back-door condition for identifying causal effects with unobserved confounders.

Bareinboim et al. [7] introduced causal bandits with a binary motivating example to demonstrate that empirically better performance can be achieved by exploiting a speciﬁc, known causal structure. Lattimore et al. [17] and Yabe et al. [35] studied best-arm identiﬁcation, where the experimenter does not incur any penalty for exploration rounds. Given knowledge of the causal graph informing the interventions and response, they separately proved that exponential improvements in the dependence of the regret on the action set are possible provided the underlying distribution on the causal graph is sufﬁciently balanced .

Sen et al. [32] obtained an instance-dependent regret bound under causal assumptions, but obtained the wrong dependence on the arm gap ( 2 rather than 1), and consequently in the worst-case the dependence on T may still dwarf the structural beneﬁts. Sen et al. [33] studied an alternative type of intervention, where rather than ﬁxing a node only the conditional probabilities are changed. This notion is easily stated in our notation, since we allow for abstract families of distributions (indexed by abstract interventions ) to deﬁne a environment. However, they focused on distribution-dependent guarantees under stronger causal assumptions, and hence our results are not directly comparable.

All of the above regret bounds heavily require assumptions about the causal graph, and without such assumptions the presumed information learned from non-intervening rounds can catastrophically mislead the experimenter in exactly the same way that C-UCB suffers in our Theorem 4.5. Hence, it remains an interesting open problem to study adaptivity in each of these variations of the causal bandit setting, and our work provides a stepping stone to do so.

Prior to the present work, Lu et al. [24] has already been extended in multiple directions. de Kroon et al. [12] observed that C-UCB can be reduced to requiring only a separating set, but only prove the regret is no worse than that of UCB if a separating set is observed. The authors remark that a causal discovery algorithm could in principle be used to learn the separating set online, but observed in their experiments that they obtain biased estimates and hence there are no convergence guarantees. Lu et al. [25] replaced knowledge of the causal graph with the assumption that the causal graph has a tree structure, and incorporated the cost of learning the tree into the full regret bound. Nair et al. [27] provided an instance-dependent regret bound for an alternative algorithm to C-UCB, which they call C-UCB-2, in the presence of the full causal graph. While they demonstrated empirically that C-UCB-2 outperforms C-UCB for certain instances, we ﬁnd that C-UCB-2 performs much worse when a d-separator is observed, and the provable linear lower bound (Theorem 4.5) also applies to C-UCB-2 when there are no observed d-separators.

8 Discussion

We have demonstrated that the improved regret possible when a d-separator is observed can also be realized in the multi-armed bandit problem by requiring only certain conditional independencies, which we have formalized using the conditionally benign property. We proved that it is impossible to optimally adapt to this property, but provided a new algorithm (HAC-UCB) that simultaneously recovers the improved regret for conditionally benign environments and signiﬁcantly improves on prior work when the conditionally benign property does not hold. Crucially, our algorithm requires no more assumptions about the world than vanilla bandit algorithms. We expect our results to spur future work on (a) improved adaptation to the conditionally benign property, (b) relaxations of the conditionally benign property for which optimal adaptation is possible, and (c) adaptation in more general partial feedback settings.

In practice, HAC-UCB will be most useful in settings with a large action space and intermediate variables that may plausibly satisfy the conditionally benign property. In passing, we mention that one such example is learning the causal effect of genome edits (interventions) on disease phenotypes. Here, the post-action context could be gene expressions that are sometimes assumed to be a d-separator (e.g., [3]). We leave the implementation of our algorithm in clinical settings and collaboration with practitioners for future work.

Acknowledgements

BB is supported by an NSERC Canada Graduate Scholarship and the Vector Institute. DMR is supported in part by an NSERC Discovery Grant, Canada CIFAR AI Chair funding through the Vector Institute, and an Ontario Early Researcher Award. This material is based also upon work supported by the United States Air Force under Contract No. FA850-19-C-0511. Any opinions, ﬁndings and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reﬂect the views of the United States Air Force. Part of this work was done while BB and DMR were visiting the Simons Institute for the Theory of Computing. We thank Zachary Lipton for suggesting to consider simultaneously adapting to estimation of the marginal distributions, and we thank Chandler Squires, Vasilis Syrgkanis, and Angela Zhou for helpful discussions.

[1] Y. Abbasi-Yadkori, P. L. Bartlett, V. Gabillon, A. Malek, and M. Valko. Best of both worlds:

Stochastic and adversarial best-arm identiﬁcation. In Proceedings of the 31st Annual Conference on Learning Theory, 2018.

[2] A. Agarwal, H. Luo, B. Neyshabur, and R. E. Schapire. Corralling a band of bandit algorithms.

In Proceedings of the 30th Annual Conference on Learning Theory, 2017.

[3] H. F. Ainsworth and H. J. Cordell. Using gene expression data to identify causal pathways be-

tween genotype and phenotype in a complex disease: Application to genetic analysis workshop 19. BMC Proceedings, 10(49), 2016.

[4] R. Arora, T. V. Marinov, and M. Mohri. Corralling stochastic bandit algorithms. In Proceedings

of the 24th International Conference on Artiﬁcial Intelligence and Statistics, 2021.

[5] P. Auer, N. Cesa-Bianchi, and P. Fischer. Finite-time analysis of the multiarmed bandit problem.

Machine Learning, 47(2):235 256, 2002.

[6] P. Auer, N. Cesa-Bianchi, and R. E. Schapire. The nonstochastic multiarmed bandit problem.

SIAM Journal on Computing, 32(1):48 77, 2002.

[7] E. Bareinboim, A. Forney, and J. Pearl. Bandits with unobserved confounders: A causal

approach. In Advances in Neural Information Processing Systems 28, 2015.

[8] B. Bilodeau, J. Negrea, and D. M. Roy. Relaxing the i.i.d. assumption: Adaptively minimax

optimal regret via root-entropic regularization, 2020. ar Xiv:2007.06552.

[9] S. Bubeck and A. Slivkins. The best of both worlds: Stochastic and adversarial bandits. In

Proceedings of the 25th Conference on Learning Theory, 2012.

[10] S. Bubeck, R. Munos, and G. Stoltz. Pure exploration in multi-armed bandits problems. In

Proceedings of the 20th International Conference on Algorithmic Learning Theory, 2009.

[11] C. L. Canonne. A short note on learning discrete distributions, 2020. ar Xiv:2002.11457.

[12] A. A. de Kroon, D. Belgrave, and J. M. Mooij. Causal bandits without prior knowledge using

separating sets. In Proceedings of the 1st Conference on Causal Learning and Reasoning, 2021.

[13] A. Hyttinen, F. Eberhardt, and P. O. Hoyer. Experiment selection for causal discovery. Journal

of Machine Learning Research, 14:3041 3071, 2013.

[14] M. Kocaoglu, A. G. Dimakis, and S. Vishwanath. Cost-optimal learning of causal graphs. In

Proceedings of the 34th International Conference on Machine Learning, 2017.

[15] W. M. Koolen. The Pareto regret frontier. In Advances in Neural Information Processing

Systems 27, 2013.

[16] D. Kumor, J. Zhang, and E. Bareinboim. Sequential causal imitation learning with unobserved

confounders. In Advances in Neural Information Processing Systems 35, 2021.

[17] F. Lattimore, T. Lattimore, and M. D. Reid. Causal bandits: Learning good interventions via

causal inference. In Advances in Neural Information Processing Systems 30, 2016.

[18] T. Lattimore. The Pareto regret frontier for bandits. In Advances in Neural Information

Processing Systems 29, 2015.

[19] T. Lattimore and C. Szepesv ari. Bandit Algorithms. Cambridge University Press, 2020.

[20] S. Lee and E. Bareinboim. Structural causal bandits with non-manipulable variables. In

Proceedings of the 33rd AAAI Conference on Artiﬁcial Intelligence, 2019.

[21] S. Lee and E. Bareinboim. Causal effect identiﬁability under partial-observability. In Proceed-

ings of the 37th International Conference on Machine Learning, 2020.

[22] S. Lee and E. Bareinboim. Characterizing optimal mixed policies: Where to intervene and what

to observe. In Advances in Neural Information Processing Systems 34, 2020.

[23] E. M. Lindgren, M. Kocaoglu, A. G. Dimakis, and S. Vishwanath. Experimental design for

cost-aware learning of causal graphs. In Advances in Neural Information Processing Systems 32, 2018.

[24] Y. Lu, A. Meisami, A. Tewari, and Z. Yan. Regret analysis of bandit problems with causal

background knowledge. In Proceedings of the 36th Conference on Uncertainty in Artiﬁcial Intelligence, 2020.

[25] Y. Lu, A. Meisami, and A. Tewari. Causal bandits with unknown graph structure. In Advances

in Neural Information Processing Systems 35, 2021.

[26] T. Lykouris, V. Mirrokni, and R. Paes Leme. Stochastic bandits robust to adversarial corruptions.

In Proceedings of the 50th Annual ACM SIGACT Symposium on Theory of Computing, 2018.

[27] V. Nair, V. Patil, and G. Sinha. Budgeted and non-budgeted causal bandits. In Proceedings of

the 24th International Conference on Artiﬁcial Intelligence and Statistics, 2021.

[28] F. Orabona. A modern introduction to online learning, 2019. ar Xiv:1912.13213.

[29] J. Pearl. Causality. Cambridge University Press, 2nd edition, 2009.

[30] J. Pearl, M. Glymour, and N. P. Jewell. Causal Inference in Statistics: A Primer. Wiley, 2016.

[31] Y. Seldin and A. Slivkins. One practical algorithm for both stochastic and adversarial bandits.

In Proceedings of the 31st International Conference on Machine Learning, 2014.

[32] R. Sen, K. Shanmugam, M. Kocaoglu, A. Dimakis, and S. Shakkottai. Contextual bandits with

latent confounders: An NMF approach. In Proceedings of the 20th International Conference on Artiﬁcial Intelligence and Statistics, 2017.

[33] R. Sen, K. Shanmugan, A. G. Dimakis, and S. Shakkottai. Identifying best interventions through

online importance sampling. In Proceedings of the 34th International Conference on Machine Learning, 2017.

[34] T. S. Thune and Y. Seldin. Adaptation to easy data in prediction with limited advice. In

Advances in Neural Information Processing Systems 31, 2018.

[35] A. Yabe, D. Hatano, H. Sumita, S. Ito, N. Kakimura, T. Fukunaga, and K.-i. Kawarabayashi.

Causal bandits with propogating inference. In Proceedings of the 35th International Conference on Machine Learning, 2018.

1. For all authors...

(a) Do the main claims made in the abstract and introduction accurately reﬂect the paper s

contributions and scope? [Yes] (b) Did you describe the limitations of your work? [Yes]

(c) Did you discuss any potential negative societal impacts of your work? [No] We do not

see any negative societal impacts of these theoretical results. (d) Have you read the ethics review guidelines and ensured that your paper conforms to

them? [Yes] 2. If you are including theoretical results...

(a) Did you state the full set of assumptions of all theoretical results? [Yes] (b) Did you include complete proofs of all theoretical results? [Yes] 3. If you ran experiments...

(a) Did you include the code, data, and instructions needed to reproduce the main experi-

mental results (either in the supplemental material or as a URL)? [Yes] (b) Did you specify all the training details (e.g., data splits, hyperparameters, how they

were chosen)? [Yes] (c) Did you report error bars (e.g., with respect to the random seed after running experi-

ments multiple times)? [N/A] (d) Did you include the total amount of compute and the type of resources used (e.g., type

of GPUs, internal cluster, or cloud provider)? [N/A] All computations were done on CPU on a personal laptop computer. 4. If you are using existing assets (e.g., code, data, models) or curating/releasing new assets...

(a) If your work uses existing assets, did you cite the creators? [N/A] (b) Did you mention the license of the assets? [N/A]

(c) Did you include any new assets either in the supplemental material or as a URL? [N/A]

(d) Did you discuss whether and how consent was obtained from people whose data you re

using/curating? [N/A] (e) Did you discuss whether the data you are using/curating contains personally identiﬁable

information or offensive content? [N/A] 5. If you used crowdsourcing or conducted research with human subjects...

(a) Did you include the full text of instructions given to participants and screenshots, if

applicable? [N/A] (b) Did you describe any potential participant risks, with links to Institutional Review

Board (IRB) approvals, if applicable? [N/A] (c) Did you include the estimated hourly wage paid to participants and the total amount

spent on participant compensation? [N/A]