# additive_causal_bandits_with_unknown_graph__d2a1f953.pdf

Additive Causal Bandits with Unknown Graph

Alan Malek 1 Virginia Aglietti 1 Silvia Chiappa 1

We explore algorithms to select actions in the causal bandit setting where the learner can choose to intervene on a set of random variables related by a causal graph, and the learner sequentially chooses interventions and observes a sample from the interventional distribution. The learner s goal is to quickly ﬁnd the intervention, among all interventions on observable variables, that maximizes the expectation of an outcome variable. We depart from previous literature by assuming no knowledge of the causal graph except that latent confounders between the outcome and its ancestors are not present. We ﬁrst show that the unknown graph problem can be exponentially hard in the parents of the outcome. To remedy this, we adopt an additional additive assumption on the outcome which allows us to solve the problem by casting it as an additive combinatorial linear bandit problem with full-bandit feedback. We propose a novel action-elimination algorithm for this setting, show how to apply this algorithm to the causal bandit problem, provide sample complexity bounds, and empirically validate our ﬁndings on a suite of randomly generated causal models, effectively showing that one does not need to explicitly learn the parents of the outcome to identify the best intervention.

1. Introduction

What setting of our factory production system should we choose to maximize efﬁciency? Which nutrients would induce maximal crop yield increase? What combination of drugs and dosages would optimize patients outcomes? All these questions ask which variables and values would optimize the causal effect on an outcome Y . In a system of variables X[K] = {X1, . . . , XK} and Y , these questions

1Deep Mind, London, UK. Correspondence to: Alan Malek <alanmalek@deepmind.com>.

Proceedings of the 40 th International Conference on Machine Learning, Honolulu, Hawaii, USA. PMLR 202, 2023. Copyright 2023 by the author(s).

can be phrased as asking which set X X[K] and values x supp(X) would produce an optimal outcome under the intervention do(X = x), whose effect is to alter the observational distribution p(X[K], Y ) describing the existing relationships between X[K] Y by setting X to the ﬁxed value x. Indicating with p(Y | do(X = x)) the distribution of Y under such an intervention, answering these questions is equivalent to solving the optimization problem max X X[K],x supp(X)) E[Y | do(X = x)].

The causal bandit problem, proposed in Lattimore et al. (2016), is an extension of the multi-armed bandit problem to the setting where many variables can be intervened on and a causal graph G is used to describe the causal relationships among X[K] Y . The learner solves this optimization problem by repeatedly choosing an intervention (X, x), also called an action, and observing a sample from p(Y | do(X = x)). A na ıve approach to the problem would be to treat each of the combinatorially many actions as independent and run a typical bandit algorithm. Instead, the causal graph enables us to reason about acting on sub-parts of the system and exploit the causal structure to reduce the action set A. For example, consider a system of variables X1, X2, and Y with causal graph X1 X2 Y . Fact 1.1 below tells us that intervening on pa(Y ), the parents of Y (or direct causes, i.e. the variables with an edge into Y ), can always produce an expectation as high as the best intervention on any other set. As pa(Y ) = {X2}, this means that the optimization problem can simpliﬁed to maxx2 supp(X2) E[Y | do(X2 = x2)].

Causal bandits have been explored under various assumptions on G, A, and the interventional distribution. Lattimore et al. (2016) assumed full knowledge of G and of the distribution of pa(Y ) under any intervention in A and proposed an algorithm that selects all actions at the start to minimize a lower bound on sample complexity. Lu et al. (2020) proposed a UCB algorithm for the cumulative regret setting which exploits the observation that one does not need a conﬁdence bound on individual actions but on how the actions affect pa(Y ). Lu et al. (2021) was the ﬁrst to considered an unspeciﬁed graph and instead considered the special case of a tree G with a singleton pa(Y ), which allowed pa(Y ) to be identiﬁed via binary search. De Kroon et al. (2022) did not place constraints on the structure of G and instead used causal discovery algorithms to learn a separating set (i.e. a

Additive Causal Bandits with Unknown Graph

set that d-separates variables we may intervene on from Y ) which thereby allowed them to decompose the problem into learning the effect of the actions on the separating set and learning the distribution of Y conditioned on the separating set. This algorithm can be viewed as a generalization of previous causal bandit algorithms which effectively use pa(Y ) as the separating set. Bilodeau et al. (2022) developed an algorithm that performs optimally when a given set is a separating set but fell back to a normal bandit algorithm otherwise. With the exception of Bilodeau et al. (2022), De Kroon et al. (2022), and Maiti et al. (2022), these works assumed no latent confounders (i.e. latent common causes), and many assumed that the learner could make arbitrary interventions on all variables excluding Y . Xiong & Chen (2023) were the ﬁrst to consider the PAC (i.e. the ﬁxed-conﬁdence) setting and, similar to this work, provides instance-dependent PAC bounds. They study the knowngraph case (potentially with unobserved variables), whereas we focus on the unknown-graph case with the additional assumptions that (i) there are no unobserved confounders between Y and its ancestors, and (ii) we may intervene on all variables in X[K] (which is a common assumption in the causal bandit literature, see (Lee & Bareinboim, 2018; Lu et al., 2020)).

We refer to this problem as the causal bandit with an unknown graph (CBUG) problem. While the ﬁrst assumption is self-explanatory and the third assumption common in the literature, our second assumption is a relaxation of the typical no-latent-confounders assumption and natural in several situations. For example, if one has a list that contains the parents of Y (e.g. built from domain experts or a noisy causal discovery algorithm), our assumption is still satisﬁed for arbitrary joint distributions on these variable (even with latent confounding between them and between other variables). Finally, this confounding assumption implies that the optimal intervention set is pa(Y ), as demonstrated in Lee & Bareinboim (2018, Proposition 2). Fact 1.1. If there are no latent confounders between Y and any of its ancestors, then

max X X[K],x supp(X) E[Y | do(X = x)]

max x supp(pa(Y )) E[Y | do(pa(Y ) = x )].

This fact suggests that we can solve the CBUG problem by ﬁrst learning pa(Y ) then searching for the optimal values in supp(pa(Y )); we refer to this approach as parents-ﬁrst. Since a global intervention do(X[K] = x) cuts all incoming edges into X[K] in G such that only the edges from pa(Y ) to Y remain, it sufﬁces to use global interventions and avoid learning the parents. Fact 1.2. Under the conditions of Fact 1.1,

E[Y | do(X[K] = x)] = E[Y | do(pa(Y ) = x )],

where x = x supp(pa(Y )) are the values of x limited to pa(Y ).

Using the two above facts, the CBUG problem can be recast as a regression problem over supp(X[K]). In other words, when there are no latent confounders between Y and any of its ancestors and when we may intervene on all variables, the optimal intervention may be found by a global intervention which circumvents the need to know the causal graph. To our surprise, this observation does not seem to have been used to design algorithms before.

Unfortunately, even with this simpliﬁcation, the CBUG problem can be intractable. In Section 2, we show that, for any algorithm, there exists a problem instance with sample complexity exponential in the size of pa(Y ); therefore, we need additional structural assumptions to make the problem tractable. Turning towards the causal inference literature, we see that the most common structural assumption is that the outcome is a noisy additive function of its parents, the discrete analog of the linear assumption. It has been developed into its own theory (Hastie, 2017) and is used throughout social and biomedical sciences: see e.g. Imbens & Rubin (2015, Chapter 13) and B uhlmann et al. (2014)), and more recently Maeda & Shimizu (2021) for examples. This assumption leads us to deﬁne the additive CBUG (a CBUG) problem, where we additionally assume that Y is an additive function of pa(Y ) plus a random term. We emphasize that additive outcome settings are of signiﬁcant interest to the causal inference community.

As developed in Section 3, the key implication of the additive assumption is that the a CBUG problem can be recast as a linear bandit problem (Lattimore & Szepesv ari (2020, Chapter 19) provides a good introduction) where the action set is combinatorial and we only have full-bandit feedback, meaning that we never observe the effect of individual variables on the outcome and only observe a single sample of the outcome from p(Y | do(X = x)).

By considering the speciﬁc problem of a CBUG, we have naturally arrived at the additive combinatorial linear bandit problem with full-bandit feedback problem. To the best of our knowledge, we are the ﬁrst to consider this problem, which extends previous causal bandit settings in two ways: (1) the action set is combinatorial, whereas most prior pure-exploration linear bandits cannot scale to combinatorial actions, and (2) we only have full-bandit feedback, meaning that we can never observe the individual additive components of Y and must infer them from only their sum.

Existing pure-exploration linear bandit algorithms either have complexities that scale with the number of actions or cannot exploit the structure of the problem; hence, in Section 4, we propose a novel action-elimination algorithm that alternates between selecting actions that approximately

Additive Causal Bandits with Unknown Graph

solve an optimal design problem with decreasing tolerances and using the resulting observations to eliminate suboptimal actions. Noting that storing a combinatorial action set and solving optimal design problems are generally intractable, we solve both computational challenges by restricting A to marginal action sets that decomposes over variables, which also allows for an easy approximation of the optimal design problem. We name our algorithm marginal optimal design linear bandit (MODL). We analyze the algorithm in the PAC setting and provide one of the ﬁrst instance-dependent analysis of the sample complexity (with Xiong & Chen (2023) being the only other, to be best of our knowledge). Finally, in Section 5, we show that MODL performs well for a CBUG problems and, in particular, signiﬁcantly outperforms the parents-ﬁrst approach while being only slightly behind an oracle version of the algorithm that knows pa(Y ).

2. Additive Causal Bandits with Unknown Graphs (a CBUG)

This section gives a formal deﬁnition of the CBUG problem, presents a lower bound showing that any algorithm for solving this problem must have exponential dependence on the number of parents, and introduces the additive outcome assumption.

Notation. We refer to sets of variables or values using bold face, e.g. X and x, respectively. We use a subscript k to indicate variable number, superscripts i or j to indicate a discrete value, and superscripts n or t to indicate sample or round number. For example, xt k is the value of Xk for the tth sample, and θi k is a parameter corresponding to Xk s ith value. Finally, [n] := {1, . . . , n} and x[n] indicates a sequence of sets of values.

2.1. CBUG Problem Formulation

We assume a system formed by random variables V = X[K] Y , where X[K] = {X1, . . . , XK}, supp(Xk) = {1, . . . , Mk} (i.e. each Xk has ﬁnite integer support), and Y is an outcome of interest that can be real-valued or discrete. The variables are causally related by an acyclic causal graph G with associated observational distribution p(V ). The learner acts by selecting a set X X[K] and values x supp(X) and performing the intervention do(X = x), which corresponds to replacing p(V ) with the interventional distribution p(V \X | do(X = x)) resulting from removing all incoming edges into X from G and ﬁxing the value of X to x. When there are no unobserved confounders (i.e. a latent common cause between variables in V , usually represented by a bidirected edge in G), p(V ) = p(Y | pa(Y )) QK k=1 p(Xk | pa(Xk)), and the interventional distribution can be expressed as p(V \X | do(X = x)) = p(Y | pa(Y )) QK k=1s.t.Xk X p(Xk | pa(Xk))δX=x

with δX=x a delta function centered at x.

The learner s goal in causal bandits is to ﬁnd the set X and values x which result in the greatest expectation of Y under the interventional distribution, denoted E[Y | do(X = x)]. The learner accomplishes this task by interacting with the environment sequentially, choosing, at every round t, an intervention (Xt, xt) (also called an action) and obtaining a sample from p(Y | do(Xt = xt)). Without loss of generality, we assume pa(Y ) = {X1, . . . , XPY } where PY is the number of parents of Y (of course the learner does not know this ordering).

We consider the causal bandit problem in the setting where G is unknown and the learner must ﬁnd, for ϵ > 0 and δ (0, 1), an (ϵ, δ)-PAC solution ( ˆ X, ˆx) satisfying

P max X,x E[Y | do(X = x)] E[Y | do( ˆ X = ˆx)] ϵ

1 δ. in as few rounds (or samples, we use the two interchangeably) as possible. We refer to this problem as the causal bandit with an unknown graph (CBUG) problem.

2.2. Global Intervention Approach

Recall that the CBUG problem assumes no latent confounders between Y and any of its ancestors (i.e. variables with a directed (or causal) path into Y ) and that we can simultaneously intervene on all observable variables except for Y , i.e. we can make global interventions. A discussed previously, global interventions are a common model in the literature. They also model settings where statistical units are expensive relative to the cost of intervening on additional variables for a single unit, which implies that the sample complexity (the number of units used) is the key quantity to minimize, not the number of variables intervened on.

As stated in Fact 1.1, these two assumptions imply that pa(Y ) is the optimal intervention set. Thus, a natural solution to solving the CBUG problem would be to ﬁrst ﬁnd pa(Y ) and then search for the optimal value in supp(pa(Y )). Instead, we make the key observation that a global intervention do(X[K] = x) cuts all incoming edges into X[K] in G, leaving only the edges from pa(Y ) to Y , which implies that performing a global intervention is equivalent to intervening on pa(Y ); precisely, E[Y | do(X[K] = x)] = E[Y | do(pa(Y ) = x )], where x are the values of x for pa(Y ) (recall Fact 1.2 above). This claim can be proved by invoking rule 3 of do-calculus (Pearl, 2000) which, in this case, states that E[Y | do(X[K])] = E[Y | do(pa(Y ))] since Y G X[K]X[K]\pa(Y ) | pa(Y ), where G X[K] is the graph G with all incoming edges into X[K] removed. We can therefore restrict the problem to ﬁnding ˆx supp(X[K])

that satisﬁes P maxx supp(X[K]) E[Y | do(X[K] = x)]

E[Y | do(X[K] = ˆx)] ϵ 1 δ, meaning that learning the parents, in some sense, is optional.

Additive Causal Bandits with Unknown Graph

2.3. The CBUG Problem is Exponentially Hard

While using global interventions saves us from having to consider all intervention sets in the powerset of X[K], the CBUG problem can be exponentially hard in PY . For a ﬁxed δ and ϵ, let x supp(pa(Y )) be ﬁxed and unknown, and let E[Y | do(pa(Y ) = x)] = 0 + ϵ1 {x = x }: the expectation is ﬂat except at a single value x where it is equal to ϵ. We can strengthen the example by choosing p such that, for any intervention do(X = x ) with x x

(indicating that x agrees with x in supp(pa(Y ))) we have p(Xk = x k | do(X = x )) = 0 for all Xk / X . Only interventions with x x can provide information about x , so this problem is difﬁcult as the learner has to try actions blindly until one containing x is found. We assume that Y | do(X = x) is 1-sub-Gaussian for any intervention.

Obtaining an upper bound on this problem is easy. Consider the algorithm that picks a ordering all values x1, x2 . . . in supp(X[K]) uniformly at random. Begin-

ning at t = 1, it collects O P

ples from p(Y | do(X[K] = xt) and tests E[Y | do(X = x)] ϵ against the null hypothesis E[Y | do(X = x)] = 0. If the null hypothesis is rejected, then xt

is optimal and the algorithm terminates; otherwise, the algorithm moves on to t + 1. The sample complexity is O |supp(pa(Y ))|

δ : since P(x

xt) = 1/| supp(pa(Y ))|, we have to test, on average, O (| supp(pa(Y ))|) actions before stumbling upon one containing x . This na ıve algorithm matches the following lower bound with proof in Appendix C.

Theorem 2.1. There is an instance of the CBUG problem such that any (ϵ, δ)-PAC algorithm must take

Ω |supp(pa(Y ))|

δ samples in expectation.

2.4. Additive Outcome Assumption

As the example from the previous section illustrates, a problem where information about the optimal intervention is hyper-localized (i.e. where one only learns about the optimal intervention by trying it) is information-theoretically difﬁcult. Therefore, we need some assumptions to make the problem tractable and, as discussed in the introduction, turn to the additive assumption from the causal inference literature (B uhlmann et al., 2014).

Assumption 2.2 (Additive Outcome). There exist functions f1, . . . , f PY and a σ2-sub-Gaussian random variable η such that Y = PPY k=1 fk(Xk) + η.

This assumption implies that the causal effect of pa(Y ) on Y decomposes into the sum of individual effects from each parent, which results in an additive CBUG (a CBUG) problem. We focus on the homoscedastic case where η is

an i.i.d. σ2-sub-Gaussian random variable, by far the most common assumption in the literature.

3. Pure-Exploration Linear Bandits

This section deﬁnes the additive combinatorial linear bandit with full-bandit feedback problem and shows how a CBUG is a special case. With the additive outcome assumption, the CBUG problem can be cast as a pure-exploration linear bandit problem with a combinatorial action set A. The linear bandit problem is a sequential decision problem where, at round t, the learner chooses an action xt A and observes yt = xt, θ +ϵt, where ϵt is a zero-mean σ2-sub-Gaussian random variable and θ Rd is a ﬁxed but unknown parameter. The goal of the learner is to ﬁnd an (ϵ, δ)-PAC action in a few rounds/samples as possible.

We cast a CBUG as a linear bandit problem using onehot-encoding. For k [K], let ek(i) be the ith unit vector in RMk, and for x supp(X[K]), deﬁne e(x) = (e1(x1), . . . , e K(x K)) to be the concatenation of the one-hot vectors, which produces a mapping from supp(X[K]) to Rd with d := P

k Mk. Deﬁning the vector θ = (f1(1), f1(2), . . . , f K(MK)), we obtain E[Y | do(X[K] = x)] = θ , e(x) . Therefore, the goal is to ﬁnd arg maxx supp(X[K]) θ , e(x) . Note that the terms of θ corresponding to Xk / pa(Y ) are zero, so θ is sparse. It is important to note that we only have full-bandit feedback because we observe θ , e(x) and not the individual fk components. Even though the action set A has a combinatorial structure with size QK k=1 Mk, as it is the Cartesian product of choosing one value for each variable, the additive assumption allows us to consider the dimension-d linear problem instead.

Casting a CBUG as a linear bandit problem enables us to borrow from the extensive literature on pure-exploration linear bandits. One successful approach has been to treat the action selection as a optimal experimental design problem: that is, selecting actions to reveal as much information about a hidden parameter vector estimated through regression. While these optimal design problems tend to be intractable except for special cases, we show how to approximate our action set to avoid these difﬁculties.

This approach was pioneered by Soare et al. (2014), who had the insight a that the optimal design problem should optimize for learning the gaps between actions. Improvements in sample complexity were made by Xu et al. (2018) and Tao et al. (2018) by using a different estimator and a different design approximation strategy, respectively, while Fiez et al. (2019) considered the more general problem of transductive experimental design. A survey of optimal design in linear bandits can be found in Lattimore & Szepesv ari (2020, Chapter 22). Unfortunately, all of these algorithms

Additive Causal Bandits with Unknown Graph

have complexity that is linear in the number of actions and are therefor not tractable for our combinatorial action space. Another line of work, (Chen et al., 2014; Gabillon et al., 2011), had the same additive action structure as us but assumed semi-bandit feedback, i.e. where noisy observations of individual fk(xi) are possible. Because we only observe yt = PPY k=1 fk(xt k) + η, we are in the full-bandit setting and cannot use these algorithms either.

The only work we are aware of in the pure-exploration combinatorial linear bandit with full-bandit feedback setting is by Du et al. (2021), who claimed the ﬁrst efﬁcient algorithm for this setting. Their approach uses a pre-sampling step to select a subset of the actions of size O(poly(d)) and then runs the algorithm of (Constantinou & Dawid, 2017). The resulting algorithm is fairly complex (requiring multiple sub-procedures including an entropy mirror-descent stage) and requires ﬁnding a size-d subset of actions with rank d. In our case, the rank of any subset of actions is at most d 1 so we cannot use this algorithm. Further, their algorithm is general purpose and cannot fully exploit the structure of our action space. Hence, we created a new algorithm, introduced in the following section.

4. Marginal Optimal Design Linear Bandit

Given data {(xt, yt)}n t=1, we need to learn about the unknown parameter vector θ in a way that lets us quantify the uncertainty. With Vn = P

t n xt(xt) denoting the data covariance and V n its pseudoinverse, we use the ordinary least squares (OLS) estimator, ˆθ = V n P t n xtyt, which has the following Azuma-style conﬁdence interval for ˆθ (see, e.g. Soare et al. (2014)):

Lemma 4.1. Let ˆθ be the OLS estimator calculated from data x[n] with covariance matrix Vn. For any z Rd, δ (0, 1), and for σ2-sub-Gaussian η,

P ˆθ θ , z q

2σ2 z 2 Vn log(1/δ) δ.

This proof, as well as all other omitted proofs, are given in Appendix C. The lemma requires x[n] to not be a function of the data. The main challenge in using this inequality is that we need to solve for a sequence of covariates to minimize the bound.

Following Lattimore & Szepesv ari (2020, Chapter 22), we propose an action-elimination algorithm that proceeds in phases. Each phase begins with a set S of plausibly best actions and a desired tolerance γ. The algorithm chooses actions to optimize the upper bound in Lemma 4.1, then uses this guarantee to prune S. The tolerance is then decreased before the next phase. Phases repeat until an optimal action is identiﬁed or all sub-ϵ actions have been removed. There are two main computational difﬁculties. First, the number

of actions, |A| = Q

k Mk, is very large, which makes the pruning step potentially intractable. We need algorithms that scale with the exponentially smaller ambient dimension d = P

k Mk. Second, the action selection (known as an optimal design problem) is a combinatorial optimization problem and generally difﬁcult. We solve both problems at once by limiting S to sets that decompose over variables, deﬁned below.

Marginal Action Sets. We say that a set of actions S A = supp(X[K]) is marginal if there exist Sk supp(Xk), k = 1, . . . , K such that S = {(j1, j2, . . . , j K) : j1 S1, j2 S2, . . . , j K SK}. In other words, S consists of the Cartesian product of S1, . . . , SK. Marginal action sets are intuitive: if we have eliminated, say, Xk = j as a good action, then we should never consider any action where Xk = j. Marginal action sets solve the combinatorial action set problem, since such sets can be represented by P

k Mk binary values.

Optimal Design. In linear bandits, our goal is to choose actions xt to reveal as much about the optimal action as possible. While the most obvious goal would be to choose actions to minimize a bound on ˆθ θ , e(x) simultaneously for all actions x S, estimating the gaps between actions x and x , deﬁned as (x, x ) := θ , e(x) e(x ) , is more efﬁcient (Fiez et al., 2019). Deﬁning Vn := Pn t=1 e(xt)e(xt) , the optimal design problem is

arg min x[n] max x,x S e(x) e(x ) Vn . (1)

Generally, the optimal design problem (1) is intractable (Xu et al., 2018), and the state-of-the-art algorithms are linear in S (Allen-Zhu et al., 2021). Fortunately, marginal action sets afford a computationally simple solution with an easy to calculate upper bound.

Lemma 4.2. Assume that S is marginal, and let x[n] be any sequence of actions that are uniform in every marginal, i.e. for every k, Pn t=1 1 {xt k = i} 1 {xt k = j} 1 for all i, j Sk. With Vn as the covariance matrix of x[n], we have

max x,x S e(x) e(x ) V n X

2|Si| n |Si| X

Roughly, the proof proceeds by noting that Vn can be written as a diagonal matrix of counts plus the cross terms, both of which are positive semi-deﬁnite. We can upper bound the total expression the norm deﬁned only with the diagonal terms, which permits a particularly simple form of e(x) e(x ) V n that we can explicitly calculate for uniform sequences of marginals. Remark 4.3. The embedding that we use for the linear bandits is not full rank. For example, for any vector v RK with 1 v = 0, the null space of Vn includes

Additive Causal Bandits with Unknown Graph

(v11(M1), . . . , v K1(MK)) (where 1(n) is the ones vector of length n). However, what is important is that the projection of the nullspace onto the coordinates in Sk is always in the all-ones direction, which allows us to calculate unbiased estimates of the gaps, even though we may not be able to identify θ. This insight provides another reason why Eq. (1) is the correct optimization objective.

4.1. Deriving the Elimination Algorithm

At each phase of the algorithm, we have an action set S and an error tolerance γ, and we wish to ﬁnd a set of actions R to remove from S that can be guaranteed to be suboptimal. The crux is that we can only calculate the empirical gaps ˆ (x, x ) := ˆθ, e(x) e(x ) , thus we need to bound the error ˆ (x, x ) (x, x ).

Suppose that we choose x[n] according to Lemma 4.2; Lemma 4.1 then guarantees that ˆθ θ , e(x) e(x ) q

δ holds with high probability for all x, x S. This means that it sufﬁces to take n = l 4σ2| P

k Sk| γ2 log L

δ m if we want to ensure that ˆθ

θ , e(x) e(x ) γ.

The usual choice in the bandit literature is R = {x S : x S s.t. ˆ (x, x ) γ}. Letting x be the optimal action, we see that

θ , e(x) e(x ) θ , e(x) e(x )

ˆθ, e(x) e(x ) γ 0

for all x R. Using the inequality in the other direction, any x S \ R must have θ , e(x) e(x ) θ ˆθ, e(x) e(x ) + ˆθ, e(x) e(x ) 2γ.

This rejection procedure is guaranteed, with probability at least 1 δ, to eliminate all 2γ-suboptimal actions and never eliminate the optimal action.

The reader may notice that S \ R, is not marginal even with this choice of R, even if S is marginal. Instead, we want a R such that 1) S \ R is marginal, and 2) R is as large as possible. Such a marginal-preserving rejection rule is necessary for the tractability of Eq. (1).

Since we require S := S \ R to be marginal, we can deﬁne Rk := Sk\S k to be the values removed from Xk s marginal. How must we constrain Rk so that, for every x R, there must be some x S with ˆ (x, x ) γ?

We use the fact that gaps decompose by variables: if we deﬁne ˆ i,j k := ˆθi k ˆθj k and ˆ j k := maxi ˆθi k ˆθj k, then the gap between x and x decomposes as ˆ (x, x ) = P

k [K] ˆ xk,x k k . Taking x = x , we have

that ˆ (x, x ) = P k ˆ xk k . Thus, we may only include i Rk if, for all x S with xk = i, we can guaran-

Algorithm 1 MODL

Input: δ > 0, ϵ > 0, X[K], σ2, B Optional: upper bound PY on PY Sk(1) [Mk] for k = 1, . . . , K L log 2BK

for ℓ= 1, . . . , L do

γ(ℓ) ϵ K 2L ℓ+1

k Sk(ℓ)| γ(ℓ)2 log L

Choose x1, . . . , xn supp(X[K]) using Lemma 4.2 Collect yt P(Y | do(X[K] = xt)) t n Update Vn, ˆθ(ℓ) = V n P

Calculate empirical gaps ˆ j k for k = 1, . . . , K do

Sk(ℓ+ 1) n j Sk(ℓ) : ˆ j k < γ(ℓ) o

k 1 {|Sk(ℓ)| = 1} if ˆPY (ℓ+ 1) = K or ˆPY (ℓ) PY then

Break end if end for Return arg maxx S ˆθ(ℓ), e(x) and ˆθ(ℓ)

tee that ˆ (x, x ) = P

k ˆ xk k γ. Using x (i) to denote x with the kth value set to i, it is easy to check that ˆ (x (i), x ) = ˆ i k, which implies that we can only include i in Rk if ˆ i k γ.

4.2. The MODL Algorithm

With the optimal design and rejection procedures derived, we can present the marginal optimal design linear bandit (MODL) algorithm and its sample complexity bound. MODL proceeds in phases ℓ= 1, . . . , L, and in each phase it solves an XY-optimal design problem, using the results of Lemma 4.2 with error γ = ϵ2L ℓ, ensuring that γ = ϵ

2 by the time the algorithm terminates. The algorithm uses the rejection rule of Section 4.1 to maintain a marginal action set S. We also consider the case when PY is provided, which allows termination once PY variables have their optimal value identiﬁed. The intuition is that θ1 k, . . . , θMk k are approximately equal for all for Xk pa(Y ), so the algorithm is not able to limit Sk to a single value. Pseudocode is provided in Algorithm 1.

We remark that restricting to marginal action sets does not eliminate any action. Instead, this restriction potentially reduces the number of actions that can be eliminated by requiring that the set of remaining actions be expanded to the smallest marginal action set containing it. In essence, marginal action sets allow us to trade-off some statistical efﬁciency for computational tractability.

Additive Causal Bandits with Unknown Graph

samples needed

10 variables

30 variables

10 variables PY unknown

30 variables PY unknown

0 20 5 10 0 20

P1 MODL Oracle

number of parents

Figure 1. Sample complexity and average gaps versus number of parents of Y .

We analyzed the expected sample complexity of the algorithm and present an upper bound in Theorem 4.4. We ﬁnd the typical sum-of-reciprocal-squared-gaps dependence common to best-arm-identiﬁcation problems, O P

i,k( i k ϵ) 2 , however, instead of a sum over the combinatorial action set, the sum is over the all gaps for individual variables, which is the sample complexity one would expect if each variable could be played independently. In other words, despite only having full-bandit feedback, we obtain the sample complexity as if we had semi-bandit feedback. A substantial part of the complexity comes from the P

k/ pa(Y ) Mkϵ 2 term, which arises because the nonparents are the most difﬁcult: to differentiate between the cases when a variable is a non-parent or when there is a single value that is ϵ/K better than the rest, all values must be estimated within ϵ/K.

For the known-PY case, the ( i k (ϵ/K)) 2 term is replaced by ( i k min (ϵ/K)) 2 which could be substantially smaller if min ϵ/K. We see this reduction because we no longer need to identify the non-parents, but rather can terminate once the minimum gap among the parents is found.

Theorem 4.4. Algorithm 1 is (ϵ, δ)-PAC. The expected sample complexity has an upper bound of

1 ( min i k ϵ

where min = mink PY mini [Mk] i k is the minimum gap in the parents in the case when number of parents PY is provided, and 0 otherwise.

Due mostly to the additive assumption, the sample complexity contains P

k Mk terms of order O ( ϵ) 2 ,

which is the same order as the sample complexity of running a separate bandit algorithm for each variable despite only observe the sum of rewards. In contrast, a naive approach which, ignoring the structure, simply uses a bestarm-identiﬁcation algorithm over the combinatorial action set would have Q

k Mk terms in the complexity bound of order PM1 j1=1 PMK j K=1 1 ( min (P

k jk k ) ϵ)2 .

Recovering the Parents of Y . With simple assumptions on fk, we may recover a good estimate for pa(Y ) upon termination of the algorithm. Using the parameter estimates returned by the algorithm, we deﬁne bpa(Y ) to be all the nodes Xk where ℓ, |ˆθi k(ℓ) ˆθj k(ℓ)| 2γ(ℓ) i, j Sk(ℓ). This formula follows the intuition that non-parents k have all θj k identically equal and thus ˆθj k(ℓ) should be within the error tolerance γ(ℓ). We can show that this method works with high probability, provided an identiﬁability condition holds. Without any identiﬁability assumptions, no algorithm can be guaranteed to recover the parents.

Theorem 4.5. Assume that there is some ϵmin > 0 such that, for all k PY , there exist i, i [Mk] with |fk(i) fk(i )| ϵmin. Let ˆx and ˆθ be the output of Algorithm 1 run with ϵ ϵmin and δ > 0. Then bpa(Y ) as deﬁned above has P( bpa(Y ) = pa(Y )) 1 δ. Furthermore, the intervention ˆx b pa(Y ) := {Xi = ˆxi : Xi bpa(Y )}, which is ˆx limited to pa(Y ), is (ϵ, δ)-PAC.

5. Experiments

This section presents an empirical evaluation of the MODL algorithm on a collection of randomly generated causal additive models1. Additional experiments studying the effect

1Code has been released at https://github.com/ deepmind/additive_cbug.

Additive Causal Bandits with Unknown Graph

of graph structure and the sensitivity to the additive outcome assumption s violation can be found in Appendix B.

Baselines. Since we are the ﬁrst to consider the general setting of unknown G without assumptions on its structure, it is difﬁcult to compare MODL to other algorithms in the causal bandit literature. The closest algorithms are those of De Kroon et al. (2022) and Bilodeau et al. (2022) with the separating set taken to be all intervenable random variables. In this settings, these algorithms reduces to a multi-armed bandit on the full, product action space. As the number of actions is exponential in the number of variables, we were only able to include this baseline for experiment with few variables. Since these algorithms were designed for the cumulative regret setting, we have implemented a version using Successive Elimination (SE) (Even-Dar et al., 2006).

We also compare MODL to (i) a parents-ﬁrst (P1) method which ﬁrst performs hypothesis testing to ﬁnd an approximate parents set ˆpa(Y ) and then runs Algorithm 1 with X[K] = ˆpa(Y ) (i.e. considering ˆpa(Y ) as the intervention set), and (ii) an oracle method which runs Algorithm 1 with X[K] = pa(Y ). Fact 1.2 guarantees that intervening on pa(Y ) alone is sufﬁcient to solve the problem; therefore, the difference in performance between MODL and the oracle method quantiﬁes the value of knowing the parents. Comparing MODL with the P1 method answers whether spending samples to explicitly learn the parents is efﬁcient.

For learning pa(Y ), we were not able to ﬁnd any suitable algorithm in the literature that exploits the ability to intervene on all X[K]. Thus, we designed our own algorithm for ﬁnding an approximate parents set ˆpa(Y ) using global interventions. Let x0 supp(X[K]) be some ﬁxed intervention; for each k in some random order, we enumerate j [Mk] and test a null hypothesis of E[Y | do(X[K] = x0)] = E[Y | do(Xk = j, X[K] \ {Xk} = x0)]; Xk is added to ˆpa(Y ) only if we ﬁnd a j where the null hypothesis is rejected. We terminate early if ˆpa(Y ) is large enough to meet a bound on PY . Provided that, for all Xk / pa(Y ), there exists some j with |fk(j)| ϵ, this algorithm is guaranteed to ﬁnd pa(Y ) with high probability. Pseudocode and a complexity bound are provided in Appendix A.

Experimental Set-up. We performed the evaluation on randomly sampled structural causal models (SCMs) generated as followed. The causal graph, excluding Y , was a sampled directed acyclic graph from the Erd os-R enyi model with the degree 3 and K 1 variables. We randomly choose set of variables of size PY as the parents of Y , then each variable topologically greater than Y is independently set as a child to Y with probability .5.

To sample the conditional probability distributions, we chose Mk uniformly between speciﬁed upper and lower bounds and generated the conditional probability distribu-

tion for each Xk by sampling p(Xk = j | pa(Xk) = x) Beta(2, 5) independently for all j and x. Finally, we generated fk(j) = BW j k, with B = 5 and W j k sampled i.i.d. from Beta(2, 5) and set η to a standard normal variable. If Xj had Y as a parent, we used the same construction but with Y rounded to an integer.

Results. Using ϵ = 1/2 and δ = .1 for our (ϵ, δ)-PAC criterion, we considered a variety of settings of PY , K, and upper and lower bounds for Mk. Each point all graphs corresponds to the average over 20 different SCMs sampled using the process described above and 50 independent runs of the methods on independently generated data.

samples needed

4 variables PY unknown

6 variables PY unknown

2 4 Number Parents

2 4 Number Parents

Figure 2. Sample complexities including SE.

In the ﬁgure above, the sample complexity and the average gaps are plotted for the Successive Elimination baseline (in red) as well as MODL, parents ﬁrst, and the oracle methods. The SE baseline are almost too large to be comparable (roughly 200 times the other methods, all which appear comparatively as zero) and does not scale to more than a few variables. The sample complexity decreases with the number of parents as a greater portion of arms are able to be eliminated.

Figure 1 plots the same, without SE, for more interesting numbers of variables in four different combinations of K = {10, 30} and known/unknown PY (the lower and upper bounds for Mk are 3 and 6). As predicted by Theorem 4.4, the sample complexity decreases with PY : many samples are required to distinguish between non-parents and a potential parents with θi k ϵ, so the complexity increases with the number of non-parents. Overall, the performance of MODL is much closer to the performance of the oracle method. We see that the performance coincides when PY = K since MODL and the oracle method become the same algorithm. We also note that the P1 method does not beneﬁt from PY = K. The P1 method also has consistently higher gaps (since, on occasion, it fails to identify a parent which would cause a large error), but all gaps are well within the desired error tolerance of ϵ = 1/2. See the appendix for more ﬁgures: e.g. Figure 3 plots the sample complexity for

Additive Causal Bandits with Unknown Graph

K = 30 versus the support sizes Mk.

To summarize, we found that across all the settings that we investigated MODL was substantially better than the P1 method, and its performance (in terms of the gap of the ﬁnal action and the sample complexity) was closer to the performance of the oracle method than to the performance of the P1 method. Hence, we conclude that the penalty of not knowing the parents is relatively small and much smaller than the cost of learning the parents ﬁrst.

6. Discussion

In this paper, we have proposed an approach to solving the causal bandit problem under the general setting of an unknown causal graph (CBUG). Using the key insight that having no latent confounding between Y and any of its ancestors implies that a global intervention is equivalent to an intervention on the optimal set pa(Y ), we showed that an additive outcome assumption allows us to solve the CBUG problem as a combinatorial linear bandit.

Limiting our algorithm to marginal action sets alleviated the computational burden by providing an easy approximation to the optimal design problem and a factorization of the action set. Two immediate direction for improving our algorithm and analysis is to consider the quality of approximation in the rejection procedure. A rejection procedure that outputs a marginal action set must reject fewer points than an unconstrained procedure. Are there principled ways of interpolating between marginal and full actions sets, perhaps using unions of marginal sets, which would let us trade-off computation and a larger number of rejected actions? The bound presented in this paper decomposes by variable, but a more nuanced rejection procedure and analysis should involve how the gaps between variables relate.

Other possible extensions include: (i) relaxing the additive outcome assumption, for example by adding interaction terms, (ii) investigating what assumption allow for algorithms that adapt to sparsity, and (iii) considering the linear and continuous case. More generally, how can our method generalize to the case of latent confounders between Y and its ancestors?

Finally, we contrast our setting with the most common setting of causal inference where the goal is to learn the causal graph with as few experiments as possible. This setting is typical in the inﬁnite data case, so the problem difﬁculty is measured in the number of experiments. As we are in the ﬁnite sample case, the number of statistical units (i.e. samples or rounds) is the most important quantity. With this distinction in mind, intervening on many variables is justiﬁed so long as we can reduce the sample complexity. However, budgeted versions of the causal bandit problem have been considered (Nair et al., 2021) and are another

interesting direction.

Acknowledgements We wish to thank Eleni Sgouritsa for help in designing the code and technical support and Yasin Abbasi-Yadkori for technical feedback and many fruitful discussions.

Allen-Zhu, Z., Li, Y., Singh, A., and Wang, Y. Near-optimal discrete optimization for experimental design: A regret minimization approach. Mathematical Programming, 186:439 478, 2021.

Bilodeau, B., Wang, L., and Roy, D. M. Adaptively exploiting d-separators with causal bandits. In Advances in Neural Information Processing Systems, 2022.

B uhlmann, P., Peters, J., and Ernest, J. Cam: Causal additive models, high-dimensional order search and penalized regression. The Annals of Statistics, 42(6):2526 2556, 2014.

Chen, S., Lin, T., King, I., Lyu, M. R., and Chen, W. Combinatorial pure exploration of multi-armed bandits. In Advances in Neural Information Processing Systems, volume 27, 2014.

Constantinou, P. and Dawid, A. P. Extended conditional independence and applications in causal inference. Annals of Statistics, 45(6):2618 2653, 2017.

De Kroon, A., Mooij, J., and Belgrave, D. Causal bandits without prior knowledge using separating sets. In Conference on Causal Learning and Reasoning, pp. 407 427. PMLR, 2022.

Du, S. S., Kakade, S. M., Wang, R., and Yang, L. F. Is a good representation sufﬁcient for sample efﬁcient reinforcement learning? In International Conference on Learning Representations, 2020.

Du, Y., Kuroki, Y., and Chen, W. Combinatorial pure exploration with full-bandit or partial linear feedback. In Proceedings of the AAAI Conference on Artiﬁcial Intelligence, volume 35, pp. 7262 7270, 2021.

Even-Dar, E., Mannor, S., Mansour, Y., and Mahadevan, S. Action elimination and stopping conditions for the multi-armed bandit and reinforcement learning problems. Journal of Machine Learning Research, 7(6):1079 1105, 2006.

Fiez, T., Jain, L., Jamieson, K. G., and Ratliff, L. Sequential experimental design for transductive linear bandits. In Advances in Neural Information Processing Systems, volume 32, 2019.

Additive Causal Bandits with Unknown Graph

Gabillon, V., Ghavamzadeh, M., Lazaric, A., and Bubeck, S. Multi-bandit best arm identiﬁcation. In Advances in Neural Information Processing Systems, volume 24, 2011.

Hastie, T. J. Generalized additive models. In Statistical models in S, pp. 249 307. Routledge, 2017.

Imbens, G. W. and Rubin, D. B. Causal inference in statistics, social, and biomedical sciences. Cambridge University Press, 2015.

Lattimore, F., Lattimore, T., and Reid, M. D. Causal bandits: Learning good interventions via causal inference. In Advances in Neural Information Processing Systems, pp. 1181 1189, 2016.

Lattimore, T. and Szepesv ari, C. Bandit Algorithms. Cambridge University Press, 2020.

Lee, S. and Bareinboim, E. Structural causal bandits: where to intervene? In Advances in Neural Information Processing Systems, pp. 2568 2578, 2018.

Lu, Y., Meisami, A., Tewari, A., and Yan, W. Regret analysis of bandit problems with causal background knowledge. In Conference on Uncertainty in Artiﬁcial Intelligence, pp. 141 150. PMLR, 2020.

Lu, Y., Meisami, A., and Tewari, A. Causal bandits with unknown graph structure. In Advances in Neural Information Processing Systems, volume 34, pp. 24817 24828, 2021.

Maeda, T. N. and Shimizu, S. Causal additive models with unobserved variables. In Uncertainty in Artiﬁcial Intelligence, pp. 97 106, 2021.

Maiti, A., Nair, V., and Sinha, G. A causal bandit approach to learning good atomic interventions in presence of unobserved confounders. In Uncertainty in Artiﬁcial Intelligence, pp. 1328 1338. PMLR, 2022.

Nair, V., Patil, V., and Sinha, G. Budgeted and non-budgeted causal bandits. In International Conference on Artiﬁcial Intelligence and Statistics, pp. 2017 2025. PMLR, 2021.

Pearl, J. Causality: Models, Reasoning, and Inference. Cambridge University Press, 2000.

Soare, M., Lazaric, A., and Munos, R. Best-arm identiﬁcation in linear bandits. In Advances in Neural Information Processing Systems, volume 27, 2014.

Tao, C., Blanco, S., and Zhou, Y. Best arm identiﬁcation in linear bandits with linear dimension dependency. In International Conference on Machine Learning, pp. 4877 4886, 2018.

Xiong, N. and Chen, W. Combinatorial pure exploration of causal bandits. In The Eleventh International Conference on Learning Representations, 2023.

Xu, L., Honda, J., and Sugiyama, M. A fully adaptive algorithm for pure exploration in linear bandits. In International Conference on Artiﬁcial Intelligence and Statistics, pp. 843 851, 2018.

Additive Causal Bandits with Unknown Graph

A. A simple test for the parents of Y

In the experiments, we compared MODL with an algorithm that ﬁrst learns the parents of Y then ran a bandit algorithm on the remaining variables.

The algorithm is intuitively simple: we will construct conﬁdence intervals of width ϵ/2 for E[Y | do(Xk = j, X k = x k)] for all values of j, and if the intersection of the conﬁdence intervals do not overlap, then one of the values of Xk is statistically signiﬁcantly different from the others. Because we have intervened to ﬁx all the other variables, this difference must be because Xk is a parent of Y . Formally, we have the following lemma. Lemma A.1. Let x0 supp(X) be ﬁxed, and x k be x0 with the kth variable s value removed. For δ (0, 1) and ϵ > 0, assume that Y 1 k , . . . , Y nk k p(Y | do(Xk = j, X k = x k), where

ϵ2 log 2K supp(Xk)

P 1 k K, j supp(Xk),

Pnk i=1 Y i k nk E[Y | do(Xk = j, X k = x k)] ϵ

Hence, by the union bound, all the conﬁdence intervals are simultaneously correct with probability at least 1 δ.

Proof. With the nk deﬁned in the lemma, we can verify that s

nk log 2K supp(Xk)

Thus, Lemma 4.1 implies that

Pnk i=1 Y i k nk E[Y | do(Xk = j, X k = x k)] ϵ

1 2K supp(Xk)

Finally, by the union bound, all the conﬁdence intervals are simultaneously correct with probability at least 1 δ, as claimed.

Under the event that the bounds are all correct, Xk is a parent of Y if there exist two intervals that do not overlap; in this case, the means of the two interventions must be different with high probability. See Algorithm 2 for pseudocode. In addition to having few false positives, we can argue that the algorithm has few false negatives, as presented in the following lemma.

Algorithm 2 Finding pa(Y )

Given: ϵ > 0, δ (0, 1), σ2, x0 supp(X[K]) .

bpa(Y ) for k = 1, . . . , K: do

Ck R n l 8σ2

ϵ2 log 2K supp(Xk)

δ m . for j = 1, . . . , Mk do

Collect y1, . . . , yn p(Y | do(Xk = j, X i = x ))

Ck Ck Pn i=1 yi

if Ck = then

bpa(Y ) bpa(Y ) {Xk} Skip the rest of the tests for Xk. end if end for end for Return bpa(Y )

Lemma A.2. Assume that for all Xk pa(Y ), there exists i, j supp(Xk) such that |fk(i) fk(j)| ϵ. Then, with probability 1 δ, Algorithm 2 correctly recovers the parents.

Additive Causal Bandits with Unknown Graph

samples needed

10 parents PY known

30 parents PY known

10 parents PY unknown

30 parents PY unknown

P1 MODL Oracle

support size

Figure 3. Sample complexity versus lower bound on Mk.

samples needed

105 PY known

average gap

P1 MODL Oracle

average graph degree

Figure 4. Sample complexity and average gap vs. number of variables with 30 variables and 10 parents.

B. Additional Experiments

This section holds additional experiments under the same setup in Section 5.

Figure 3 plots the sample complexity for K = 30 versus the support sizes Mk, where the x-axis speciﬁes the lower bound on Mk (the upper bound is 3 larger). As above, the performance of MODL is much closer to the performance of the oracle method than to the performance of the P1 method, and almost identical for PY = K = 30.

Figure 4 plots the sample complexity as the degree of the sampled graph changes, when PY = 10 and K = 30. Unsurprisingly, the degree has very little effect on the sample complexity, conﬁrming our intuition from causal inference that intervening on all parents renders the rest of the causal graph unimportant.

Figure 5 conﬁrms our suspicion that the linear bandit is very sensitive to model mispeciﬁcation. We generated non-linear data by using the outcome model

k=1 fk(Xk) + αBM 4 max (Xi1Xi2Xi3Xi4 + Xj1Xj2Xj3 + Xk1Xk2)

for randomly chosen (without replacement) indices i1, i2, i3, i4, j1, j2, j3, and k1, k2 from [PY ] and Mmax is the upper bound on the support size (6 in this case). This model was chosen to resemble the effect of adding interaction terms that are the product of several variables. The leading coefﬁcient is chosen to keep the maximum interaction term roughly αB so choosing α [0, 1] keeps the scale of the interactions terms roughly equivalent to the additive terms. Despite this scaling, we still ﬁnd that the performance is very sensitive to model mismatch, as illustrated in Figure 5. We also note that the parents-ﬁrst approach is much more sensitive to model mispeciﬁcation, at least in the average gap, because the mispeciﬁcation increases the probability of the parents being misspeciﬁed, which in turn causes a large error. MODL and

Additive Causal Bandits with Unknown Graph

0.000 0.025 0.050 0.075 0.100

samples needed

105 Parents known

0.000 0.025 0.050 0.075 0.100

Parents unknown

0.000 0.025 0.050 0.075 0.100 0

average gap

0.000 0.025 0.050 0.075 0.100

P1 MODL Oracle

nonlinear scaling

Figure 5. Sample complexity and average gap vs. model mispeciﬁcation. The x-axis details the coefﬁcient of a multiplicative nonlinear term in the expected response function.

the oracle algorithms are more immune to this effect.

Proof of Theorem 2.1. This lower bound can by shown by invoking Theorem 33.5 from (Lattimore & Szepesv ari, 2020).

For any x supp(X[K]), we deﬁne q(x) to be the x values of pa(Y ); in particular, the reward of any two actions x and x are equal if q(x) = q(x ). We invoke the theorem for a bandit with supp(X[K]) arms, one for each action, and with a set E of bandit environments indexed by xpa supp(pa(Y )); for a bandit ν(xpa) E corresponding to xpa, we set Y | do(X = x) Bernoulli( 1

2 + ϵ1 {q(x) = xpa}). We can explicitly calculate Ealt(ν(xpa)) = {ν(x pa) : x pa = xpa}, which are the set of bandit environments with a different optimal arm than ν s. We also use νx to indicate the reward distribution of action x under bandit ν. Then, examining the terms in the theorem, we need to calculate

c(ν) = max α supp(X[K])

min ν Ealt(ν)

x supp(X[K]) αx D(νx, ν x)

where D( , ) is the relative entropy. If ν Ealt(ν), this means that the x pa corresponding to ν is different from the xpa corresponding to ν. Fix one x pa; we can see that D(νx, ν x) is zero for all x, x with q(x) = q(x ). We also use the fact that D(νx, ν x) = O(ϵ2) when ϵ is small. Hence

min ν Ealt(ν)

x supp(X[K]) αx D(νx, ν x) = min x pa =xpa supp(pa(Y )) O(ϵ2)αx pa,

where xpa corresponds to ν. Taking the max over α, we see that any α must spread mass evenly across all x with

x pa = xpa supp(pa(Y )), which leads to c(v) = O ϵ2 | supp(pa(Y ))| . Combining these calculations with the theorem, we ﬁnd that

E[τ] O | supp(pa(Y ))|

where E[τ] is the expected stopping time of any sound (i.e. (ϵ, δ)-PAC) algorithm, as claimed.

We could also follow the techniques from the lower bound of (Du et al., 2020) and reduce the index-query problem to the CBUG problem.

Additive Causal Bandits with Unknown Graph

Proof of Lemma 4.2. Let x1, . . . , xn be a sequence of actions. To calculate Vn, we write

i=1 (e1(xi 1), . . . , e K(xi K))(e1(xi 1), . . . , e K(xi K)) =

j =1 ej(xi j)ej (xi j ) = Dn + Cn,

where Dn is the matrix of on-diagonal components and Cn contains all the off-diagonal terms; both are positive semi-deﬁnite. Using the fact that if A and B are PSD matrices, then x (A + B) x x A x, we can upper bound e(x) e(x ) Vn by e(x) e(x ) Dn .

Letting N j k = P

t n 1 {xt k = j} be the round xt k took on the jth value, we can show that

t n e(xt)e(xt) = X

t n diag((e1(xt 1), . . . , e K(xt K)) = diag N 1 1 , N 2 1 , . . . , N M1 1 , N 1 2 , . . . , N MK K ;

Dn is the diagonal matrix of the counts of the values. We can upper bound the optimal design problem over x[n] with another problem over counts of values that appears in S and add to n. Formally, this set is

(N 1 1 , N 2 1 , . . . , N MK K ) : k, N j k = 0 if j / Sk and X

j N j k = n

We can easily check that, for any x, x S,

e(x) e(x ) Dn = (e(x) e(x )) V n(e(x) e(x ))

k (ek(xk) ek(x k)) ek(xk)

N xk k ek(x k)

1 N xk k + 1

1 {xk = x k} .

Thus, the XY-optimal has an upper bound

x XY(S, n) = arg min {Nj i } N (S,n) max x,x S e(x) e(x ) Vn

= arg min {Nj k} N (S,n)

k max j,j Sk 1

= arg min {Nj k} N (S,n)

minj Sk N j k .

We can solve the problem in closed from: for all i V , allocate N j i evenly among all j Si. Because |Si| may not divide n, we may have rounding errors and can only guarantee that (N j i ) 1 [ n/Si , n/Si ], which results in an objective value of P i 2 n/Si 2 P i |Si| n |Si|.

Theorem 4.4. Algorithm 1 is (ϵ, δ)-PAC. The expected sample complexity is

3 σ2 log log(BK/ϵ)

1 ( i k (ϵ/K))2 + X

k/ pa(Y ) Mk 1 ϵ2

If the number of parents PY is provided, the complexity is instead

3 σ2 log log(BK/ϵ)

( min i k (ϵ/K))2 ,

where min = mink PY mini [Mk] i k is the minimum gap in the parents.

Additive Causal Bandits with Unknown Graph

Proof of Theorem 4.4. Recall that MODL alternates between two stages, data-collection and action elimination, and uses an exponentially decreasing error tolerance. Let Sk(ℓ), γ(ℓ), and nell be the corresponding the values during phase ℓ.

It is easy to verify that γ B/2 for ℓ= 1 and γ = ϵ 2K for ℓ= L. We will show that, with the chosen nℓ,

P ˆθ θ , e(x) e(x ) γ(ℓ) x, x S(ℓ), ℓ [L] 1 δ.

That is, with high probability, all our conﬁdence intervals used by the algorithm are correct.

Let Sk(ℓ) be the kth marginal of S during phase ℓof the algorithm. By choosing

k Sk(ℓ)| γ2 log L

Lemma 4.2 guarantees that maxz,z S e(x) e(x ) Vn P

n . Lemma 4.1 provides

ˆθ θ , e(x) e(x )

2σ2 e(x) e(x ) Vn log L

4σ2 P k |Sk(ℓ)|

with probability at least 1 δ for all x S(ℓ) simultaneously.

At each stage ℓ, the elimination algorithm proceeds only eliminating x where there exists x with ˆθ, e(x ) e(x) γ(ℓ). If such a x exists, then we can conclude that

θ , e(x) e(x ) θ , e(x) e(x ) ˆθ, e(x) e(x ) γ(ℓ) 0.

Hence, we have shown that, for all stages ℓ, the algorithm never eliminates any action that is γ(ℓ)-suboptimal.

The last step to checking correctness is to show that an ϵ-suboptimal action is returned. In round ℓ= L, Lemma 4.1, guarantees that θ , e(x) e(x ) ϵ/2 for all x, x in S(ℓ). Applying this to the action ˆx returned by the algorithm and the fact that, under the event that the conﬁdence intervals are correct, x S(ℓ), we have

θ , e(ˆx) e(x ) θ ˆθ, e(ˆx) e(x ) + ˆθ, e(ˆx) e(x ) Kϵ

where the last step used the fact that each vairable s error was controlled to ϵ/2k.

The total sample complexity is L X

k=1 |Sk(ℓ)| 4σ2

γ(ℓ)2 log L

We now look to bound PL ℓ=1 PK k=1 |Sk(ℓ)| in terms of instance-dependent quantities. With no bound on |pa(Y )|, the complexity also decomposes. Let G = {γ1 > . . . > γL = ϵ/2K} be the set of γ used by the algorithm. Setting c = 4σ2 log(L/δ) and using the fact that ˆ i k are all unbiased, the expected sample complexity can be upper bounded by calculating

c γ(ℓ)2 i k γ(ℓ) =

c γ2 1 i k γ

{γ G:γ i k ϵ/K}

4c 3( i k ϵ

k/ pa(Y ) Mk 4c K2

Additive Causal Bandits with Unknown Graph

To summarize, the sample complexity is the sum the sample complexity of the parents with a linear term for the non-parents. Intuitively, it is very difﬁcult to differentiate between a parent with some |fk( )| ϵ and a non-parent, so we expect to see these terms in the sample complexity.

Recall that min = mink pa(Y ) mini [Mk] i k. When the number of parents is given, the algorithm will terminate as soon as γ(ℓ) min. Hence, the sample complexity can be decomposed into the samples needed to learn the parents, P

k pa(Y ) PMk i=1 4c 3( i k ϵ)2 , and the extra samples from all the remaining variables

{γ G:γ min ϵ}

k/ pa(Y ) Mk c γ2 = X

k/ pa(Y ) Mkc X

{γ G:γ min ϵ}

k/ pa(Y ) Mk 4c 3( min ϵ)2 .

Hence, the total sample complexity is

1 ( i k ϵ)2 + X

1 min i k ϵ

Proof of Theorem 4.5. We need to prove two things: ﬁrst, all parents are discovered, and second, no erroneous parents are included. Throughout, we condition of the event that the conﬁdence intervals are all correct, which happens with probability at least 1 δ, and demonstrated in the proof of Theorem 4.4.

Recalling that ˆθi k(ℓ) and γ(ℓ) are the respective quantities at phase ℓof the algorithm, we argue that non-parents are not added to ˆpa(Y ). Consider some non-parent k > pa Y . The concentration inequality from Lemma 4.1, applied to x, x that only differ in that the ﬁrst corresponds to Xk = i and the second to Xk = j, yields

ˆθ θ , e(x) e(x ) γ ˆθi k θj k γ.

Taken with the fact that θi k θj k = 0 for all i, j Mk, we have ˆθi k(ℓ) ˆθj k(ℓ) 2γ(ℓ), so k is not added to ˆpa(Y ) on the event that the conﬁdence intervals are collect.

To show that all parents are included, we consider two cases. First, assume that the algorithm does not terminate early so γL = ϵ/2. For every k PY , the algorithm must have found the i, i satisfying |fk(i) fk(i )| ϵmin ϵ/2, so k in included in ˆpa Y .

The second case is when the algorithm terminates early. In this case, there must be PY variables with only one action remaining. These variables must be parents, since, with high probability no non-parents are added.

Under the event that ˆpa(Y ) = pa(Y ), Fact 1.2 implies that the expected response of actions ˆx and ˆx ˆpa(Y ) are equal, completing the proof.