# additive_causal_bandits_with_unknown_graph__d2a1f953.pdf Additive Causal Bandits with Unknown Graph Alan Malek 1 Virginia Aglietti 1 Silvia Chiappa 1 We explore algorithms to select actions in the causal bandit setting where the learner can choose to intervene on a set of random variables related by a causal graph, and the learner sequentially chooses interventions and observes a sample from the interventional distribution. The learner s goal is to quickly find the intervention, among all interventions on observable variables, that maximizes the expectation of an outcome variable. We depart from previous literature by assuming no knowledge of the causal graph except that latent confounders between the outcome and its ancestors are not present. We first show that the unknown graph problem can be exponentially hard in the parents of the outcome. To remedy this, we adopt an additional additive assumption on the outcome which allows us to solve the problem by casting it as an additive combinatorial linear bandit problem with full-bandit feedback. We propose a novel action-elimination algorithm for this setting, show how to apply this algorithm to the causal bandit problem, provide sample complexity bounds, and empirically validate our findings on a suite of randomly generated causal models, effectively showing that one does not need to explicitly learn the parents of the outcome to identify the best intervention. 1. Introduction What setting of our factory production system should we choose to maximize efficiency? Which nutrients would induce maximal crop yield increase? What combination of drugs and dosages would optimize patients outcomes? All these questions ask which variables and values would optimize the causal effect on an outcome Y . In a system of variables X[K] = {X1, . . . , XK} and Y , these questions 1Deep Mind, London, UK. Correspondence to: Alan Malek . Proceedings of the 40 th International Conference on Machine Learning, Honolulu, Hawaii, USA. PMLR 202, 2023. Copyright 2023 by the author(s). can be phrased as asking which set X X[K] and values x supp(X) would produce an optimal outcome under the intervention do(X = x), whose effect is to alter the observational distribution p(X[K], Y ) describing the existing relationships between X[K] Y by setting X to the fixed value x. Indicating with p(Y | do(X = x)) the distribution of Y under such an intervention, answering these questions is equivalent to solving the optimization problem max X X[K],x supp(X)) E[Y | do(X = x)]. The causal bandit problem, proposed in Lattimore et al. (2016), is an extension of the multi-armed bandit problem to the setting where many variables can be intervened on and a causal graph G is used to describe the causal relationships among X[K] Y . The learner solves this optimization problem by repeatedly choosing an intervention (X, x), also called an action, and observing a sample from p(Y | do(X = x)). A na ıve approach to the problem would be to treat each of the combinatorially many actions as independent and run a typical bandit algorithm. Instead, the causal graph enables us to reason about acting on sub-parts of the system and exploit the causal structure to reduce the action set A. For example, consider a system of variables X1, X2, and Y with causal graph X1 X2 Y . Fact 1.1 below tells us that intervening on pa(Y ), the parents of Y (or direct causes, i.e. the variables with an edge into Y ), can always produce an expectation as high as the best intervention on any other set. As pa(Y ) = {X2}, this means that the optimization problem can simplified to maxx2 supp(X2) E[Y | do(X2 = x2)]. Causal bandits have been explored under various assumptions on G, A, and the interventional distribution. Lattimore et al. (2016) assumed full knowledge of G and of the distribution of pa(Y ) under any intervention in A and proposed an algorithm that selects all actions at the start to minimize a lower bound on sample complexity. Lu et al. (2020) proposed a UCB algorithm for the cumulative regret setting which exploits the observation that one does not need a confidence bound on individual actions but on how the actions affect pa(Y ). Lu et al. (2021) was the first to considered an unspecified graph and instead considered the special case of a tree G with a singleton pa(Y ), which allowed pa(Y ) to be identified via binary search. De Kroon et al. (2022) did not place constraints on the structure of G and instead used causal discovery algorithms to learn a separating set (i.e. a Additive Causal Bandits with Unknown Graph set that d-separates variables we may intervene on from Y ) which thereby allowed them to decompose the problem into learning the effect of the actions on the separating set and learning the distribution of Y conditioned on the separating set. This algorithm can be viewed as a generalization of previous causal bandit algorithms which effectively use pa(Y ) as the separating set. Bilodeau et al. (2022) developed an algorithm that performs optimally when a given set is a separating set but fell back to a normal bandit algorithm otherwise. With the exception of Bilodeau et al. (2022), De Kroon et al. (2022), and Maiti et al. (2022), these works assumed no latent confounders (i.e. latent common causes), and many assumed that the learner could make arbitrary interventions on all variables excluding Y . Xiong & Chen (2023) were the first to consider the PAC (i.e. the fixed-confidence) setting and, similar to this work, provides instance-dependent PAC bounds. They study the knowngraph case (potentially with unobserved variables), whereas we focus on the unknown-graph case with the additional assumptions that (i) there are no unobserved confounders between Y and its ancestors, and (ii) we may intervene on all variables in X[K] (which is a common assumption in the causal bandit literature, see (Lee & Bareinboim, 2018; Lu et al., 2020)). We refer to this problem as the causal bandit with an unknown graph (CBUG) problem. While the first assumption is self-explanatory and the third assumption common in the literature, our second assumption is a relaxation of the typical no-latent-confounders assumption and natural in several situations. For example, if one has a list that contains the parents of Y (e.g. built from domain experts or a noisy causal discovery algorithm), our assumption is still satisfied for arbitrary joint distributions on these variable (even with latent confounding between them and between other variables). Finally, this confounding assumption implies that the optimal intervention set is pa(Y ), as demonstrated in Lee & Bareinboim (2018, Proposition 2). Fact 1.1. If there are no latent confounders between Y and any of its ancestors, then max X X[K],x supp(X) E[Y | do(X = x)] max x supp(pa(Y )) E[Y | do(pa(Y ) = x )]. This fact suggests that we can solve the CBUG problem by first learning pa(Y ) then searching for the optimal values in supp(pa(Y )); we refer to this approach as parents-first. Since a global intervention do(X[K] = x) cuts all incoming edges into X[K] in G such that only the edges from pa(Y ) to Y remain, it suffices to use global interventions and avoid learning the parents. Fact 1.2. Under the conditions of Fact 1.1, E[Y | do(X[K] = x)] = E[Y | do(pa(Y ) = x )], where x = x supp(pa(Y )) are the values of x limited to pa(Y ). Using the two above facts, the CBUG problem can be recast as a regression problem over supp(X[K]). In other words, when there are no latent confounders between Y and any of its ancestors and when we may intervene on all variables, the optimal intervention may be found by a global intervention which circumvents the need to know the causal graph. To our surprise, this observation does not seem to have been used to design algorithms before. Unfortunately, even with this simplification, the CBUG problem can be intractable. In Section 2, we show that, for any algorithm, there exists a problem instance with sample complexity exponential in the size of pa(Y ); therefore, we need additional structural assumptions to make the problem tractable. Turning towards the causal inference literature, we see that the most common structural assumption is that the outcome is a noisy additive function of its parents, the discrete analog of the linear assumption. It has been developed into its own theory (Hastie, 2017) and is used throughout social and biomedical sciences: see e.g. Imbens & Rubin (2015, Chapter 13) and B uhlmann et al. (2014)), and more recently Maeda & Shimizu (2021) for examples. This assumption leads us to define the additive CBUG (a CBUG) problem, where we additionally assume that Y is an additive function of pa(Y ) plus a random term. We emphasize that additive outcome settings are of significant interest to the causal inference community. As developed in Section 3, the key implication of the additive assumption is that the a CBUG problem can be recast as a linear bandit problem (Lattimore & Szepesv ari (2020, Chapter 19) provides a good introduction) where the action set is combinatorial and we only have full-bandit feedback, meaning that we never observe the effect of individual variables on the outcome and only observe a single sample of the outcome from p(Y | do(X = x)). By considering the specific problem of a CBUG, we have naturally arrived at the additive combinatorial linear bandit problem with full-bandit feedback problem. To the best of our knowledge, we are the first to consider this problem, which extends previous causal bandit settings in two ways: (1) the action set is combinatorial, whereas most prior pure-exploration linear bandits cannot scale to combinatorial actions, and (2) we only have full-bandit feedback, meaning that we can never observe the individual additive components of Y and must infer them from only their sum. Existing pure-exploration linear bandit algorithms either have complexities that scale with the number of actions or cannot exploit the structure of the problem; hence, in Section 4, we propose a novel action-elimination algorithm that alternates between selecting actions that approximately Additive Causal Bandits with Unknown Graph solve an optimal design problem with decreasing tolerances and using the resulting observations to eliminate suboptimal actions. Noting that storing a combinatorial action set and solving optimal design problems are generally intractable, we solve both computational challenges by restricting A to marginal action sets that decomposes over variables, which also allows for an easy approximation of the optimal design problem. We name our algorithm marginal optimal design linear bandit (MODL). We analyze the algorithm in the PAC setting and provide one of the first instance-dependent analysis of the sample complexity (with Xiong & Chen (2023) being the only other, to be best of our knowledge). Finally, in Section 5, we show that MODL performs well for a CBUG problems and, in particular, significantly outperforms the parents-first approach while being only slightly behind an oracle version of the algorithm that knows pa(Y ). 2. Additive Causal Bandits with Unknown Graphs (a CBUG) This section gives a formal definition of the CBUG problem, presents a lower bound showing that any algorithm for solving this problem must have exponential dependence on the number of parents, and introduces the additive outcome assumption. Notation. We refer to sets of variables or values using bold face, e.g. X and x, respectively. We use a subscript k to indicate variable number, superscripts i or j to indicate a discrete value, and superscripts n or t to indicate sample or round number. For example, xt k is the value of Xk for the tth sample, and θi k is a parameter corresponding to Xk s ith value. Finally, [n] := {1, . . . , n} and x[n] indicates a sequence of sets of values. 2.1. CBUG Problem Formulation We assume a system formed by random variables V = X[K] Y , where X[K] = {X1, . . . , XK}, supp(Xk) = {1, . . . , Mk} (i.e. each Xk has finite integer support), and Y is an outcome of interest that can be real-valued or discrete. The variables are causally related by an acyclic causal graph G with associated observational distribution p(V ). The learner acts by selecting a set X X[K] and values x supp(X) and performing the intervention do(X = x), which corresponds to replacing p(V ) with the interventional distribution p(V \X | do(X = x)) resulting from removing all incoming edges into X from G and fixing the value of X to x. When there are no unobserved confounders (i.e. a latent common cause between variables in V , usually represented by a bidirected edge in G), p(V ) = p(Y | pa(Y )) QK k=1 p(Xk | pa(Xk)), and the interventional distribution can be expressed as p(V \X | do(X = x)) = p(Y | pa(Y )) QK k=1s.t.Xk X p(Xk | pa(Xk))δX=x with δX=x a delta function centered at x. The learner s goal in causal bandits is to find the set X and values x which result in the greatest expectation of Y under the interventional distribution, denoted E[Y | do(X = x)]. The learner accomplishes this task by interacting with the environment sequentially, choosing, at every round t, an intervention (Xt, xt) (also called an action) and obtaining a sample from p(Y | do(Xt = xt)). Without loss of generality, we assume pa(Y ) = {X1, . . . , XPY } where PY is the number of parents of Y (of course the learner does not know this ordering). We consider the causal bandit problem in the setting where G is unknown and the learner must find, for ϵ > 0 and δ (0, 1), an (ϵ, δ)-PAC solution ( ˆ X, ˆx) satisfying P max X,x E[Y | do(X = x)] E[Y | do( ˆ X = ˆx)] ϵ 1 δ. in as few rounds (or samples, we use the two interchangeably) as possible. We refer to this problem as the causal bandit with an unknown graph (CBUG) problem. 2.2. Global Intervention Approach Recall that the CBUG problem assumes no latent confounders between Y and any of its ancestors (i.e. variables with a directed (or causal) path into Y ) and that we can simultaneously intervene on all observable variables except for Y , i.e. we can make global interventions. A discussed previously, global interventions are a common model in the literature. They also model settings where statistical units are expensive relative to the cost of intervening on additional variables for a single unit, which implies that the sample complexity (the number of units used) is the key quantity to minimize, not the number of variables intervened on. As stated in Fact 1.1, these two assumptions imply that pa(Y ) is the optimal intervention set. Thus, a natural solution to solving the CBUG problem would be to first find pa(Y ) and then search for the optimal value in supp(pa(Y )). Instead, we make the key observation that a global intervention do(X[K] = x) cuts all incoming edges into X[K] in G, leaving only the edges from pa(Y ) to Y , which implies that performing a global intervention is equivalent to intervening on pa(Y ); precisely, E[Y | do(X[K] = x)] = E[Y | do(pa(Y ) = x )], where x are the values of x for pa(Y ) (recall Fact 1.2 above). This claim can be proved by invoking rule 3 of do-calculus (Pearl, 2000) which, in this case, states that E[Y | do(X[K])] = E[Y | do(pa(Y ))] since Y G X[K]X[K]\pa(Y ) | pa(Y ), where G X[K] is the graph G with all incoming edges into X[K] removed. We can therefore restrict the problem to finding ˆx supp(X[K]) that satisfies P maxx supp(X[K]) E[Y | do(X[K] = x)] E[Y | do(X[K] = ˆx)] ϵ 1 δ, meaning that learning the parents, in some sense, is optional. Additive Causal Bandits with Unknown Graph 2.3. The CBUG Problem is Exponentially Hard While using global interventions saves us from having to consider all intervention sets in the powerset of X[K], the CBUG problem can be exponentially hard in PY . For a fixed δ and ϵ, let x supp(pa(Y )) be fixed and unknown, and let E[Y | do(pa(Y ) = x)] = 0 + ϵ1 {x = x }: the expectation is flat except at a single value x where it is equal to ϵ. We can strengthen the example by choosing p such that, for any intervention do(X = x ) with x x (indicating that x agrees with x in supp(pa(Y ))) we have p(Xk = x k | do(X = x )) = 0 for all Xk / X . Only interventions with x x can provide information about x , so this problem is difficult as the learner has to try actions blindly until one containing x is found. We assume that Y | do(X = x) is 1-sub-Gaussian for any intervention. Obtaining an upper bound on this problem is easy. Consider the algorithm that picks a ordering all values x1, x2 . . . in supp(X[K]) uniformly at random. Begin- ning at t = 1, it collects O P ples from p(Y | do(X[K] = xt) and tests E[Y | do(X = x)] ϵ against the null hypothesis E[Y | do(X = x)] = 0. If the null hypothesis is rejected, then xt is optimal and the algorithm terminates; otherwise, the algorithm moves on to t + 1. The sample complexity is O |supp(pa(Y ))| δ : since P(x xt) = 1/| supp(pa(Y ))|, we have to test, on average, O (| supp(pa(Y ))|) actions before stumbling upon one containing x . This na ıve algorithm matches the following lower bound with proof in Appendix C. Theorem 2.1. There is an instance of the CBUG problem such that any (ϵ, δ)-PAC algorithm must take Ω |supp(pa(Y ))| δ samples in expectation. 2.4. Additive Outcome Assumption As the example from the previous section illustrates, a problem where information about the optimal intervention is hyper-localized (i.e. where one only learns about the optimal intervention by trying it) is information-theoretically difficult. Therefore, we need some assumptions to make the problem tractable and, as discussed in the introduction, turn to the additive assumption from the causal inference literature (B uhlmann et al., 2014). Assumption 2.2 (Additive Outcome). There exist functions f1, . . . , f PY and a σ2-sub-Gaussian random variable η such that Y = PPY k=1 fk(Xk) + η. This assumption implies that the causal effect of pa(Y ) on Y decomposes into the sum of individual effects from each parent, which results in an additive CBUG (a CBUG) problem. We focus on the homoscedastic case where η is an i.i.d. σ2-sub-Gaussian random variable, by far the most common assumption in the literature. 3. Pure-Exploration Linear Bandits This section defines the additive combinatorial linear bandit with full-bandit feedback problem and shows how a CBUG is a special case. With the additive outcome assumption, the CBUG problem can be cast as a pure-exploration linear bandit problem with a combinatorial action set A. The linear bandit problem is a sequential decision problem where, at round t, the learner chooses an action xt A and observes yt = xt, θ +ϵt, where ϵt is a zero-mean σ2-sub-Gaussian random variable and θ Rd is a fixed but unknown parameter. The goal of the learner is to find an (ϵ, δ)-PAC action in a few rounds/samples as possible. We cast a CBUG as a linear bandit problem using onehot-encoding. For k [K], let ek(i) be the ith unit vector in RMk, and for x supp(X[K]), define e(x) = (e1(x1), . . . , e K(x K)) to be the concatenation of the one-hot vectors, which produces a mapping from supp(X[K]) to Rd with d := P k Mk. Defining the vector θ = (f1(1), f1(2), . . . , f K(MK)), we obtain E[Y | do(X[K] = x)] = θ , e(x) . Therefore, the goal is to find arg maxx supp(X[K]) θ , e(x) . Note that the terms of θ corresponding to Xk / pa(Y ) are zero, so θ is sparse. It is important to note that we only have full-bandit feedback because we observe θ , e(x) and not the individual fk components. Even though the action set A has a combinatorial structure with size QK k=1 Mk, as it is the Cartesian product of choosing one value for each variable, the additive assumption allows us to consider the dimension-d linear problem instead. Casting a CBUG as a linear bandit problem enables us to borrow from the extensive literature on pure-exploration linear bandits. One successful approach has been to treat the action selection as a optimal experimental design problem: that is, selecting actions to reveal as much information about a hidden parameter vector estimated through regression. While these optimal design problems tend to be intractable except for special cases, we show how to approximate our action set to avoid these difficulties. This approach was pioneered by Soare et al. (2014), who had the insight a that the optimal design problem should optimize for learning the gaps between actions. Improvements in sample complexity were made by Xu et al. (2018) and Tao et al. (2018) by using a different estimator and a different design approximation strategy, respectively, while Fiez et al. (2019) considered the more general problem of transductive experimental design. A survey of optimal design in linear bandits can be found in Lattimore & Szepesv ari (2020, Chapter 22). Unfortunately, all of these algorithms Additive Causal Bandits with Unknown Graph have complexity that is linear in the number of actions and are therefor not tractable for our combinatorial action space. Another line of work, (Chen et al., 2014; Gabillon et al., 2011), had the same additive action structure as us but assumed semi-bandit feedback, i.e. where noisy observations of individual fk(xi) are possible. Because we only observe yt = PPY k=1 fk(xt k) + η, we are in the full-bandit setting and cannot use these algorithms either. The only work we are aware of in the pure-exploration combinatorial linear bandit with full-bandit feedback setting is by Du et al. (2021), who claimed the first efficient algorithm for this setting. Their approach uses a pre-sampling step to select a subset of the actions of size O(poly(d)) and then runs the algorithm of (Constantinou & Dawid, 2017). The resulting algorithm is fairly complex (requiring multiple sub-procedures including an entropy mirror-descent stage) and requires finding a size-d subset of actions with rank d. In our case, the rank of any subset of actions is at most d 1 so we cannot use this algorithm. Further, their algorithm is general purpose and cannot fully exploit the structure of our action space. Hence, we created a new algorithm, introduced in the following section. 4. Marginal Optimal Design Linear Bandit Given data {(xt, yt)}n t=1, we need to learn about the unknown parameter vector θ in a way that lets us quantify the uncertainty. With Vn = P t n xt(xt) denoting the data covariance and V n its pseudoinverse, we use the ordinary least squares (OLS) estimator, ˆθ = V n P t n xtyt, which has the following Azuma-style confidence interval for ˆθ (see, e.g. Soare et al. (2014)): Lemma 4.1. Let ˆθ be the OLS estimator calculated from data x[n] with covariance matrix Vn. For any z Rd, δ (0, 1), and for σ2-sub-Gaussian η, P ˆθ θ , z q 2σ2 z 2 Vn log(1/δ) δ. This proof, as well as all other omitted proofs, are given in Appendix C. The lemma requires x[n] to not be a function of the data. The main challenge in using this inequality is that we need to solve for a sequence of covariates to minimize the bound. Following Lattimore & Szepesv ari (2020, Chapter 22), we propose an action-elimination algorithm that proceeds in phases. Each phase begins with a set S of plausibly best actions and a desired tolerance γ. The algorithm chooses actions to optimize the upper bound in Lemma 4.1, then uses this guarantee to prune S. The tolerance is then decreased before the next phase. Phases repeat until an optimal action is identified or all sub-ϵ actions have been removed. There are two main computational difficulties. First, the number of actions, |A| = Q k Mk, is very large, which makes the pruning step potentially intractable. We need algorithms that scale with the exponentially smaller ambient dimension d = P k Mk. Second, the action selection (known as an optimal design problem) is a combinatorial optimization problem and generally difficult. We solve both problems at once by limiting S to sets that decompose over variables, defined below. Marginal Action Sets. We say that a set of actions S A = supp(X[K]) is marginal if there exist Sk supp(Xk), k = 1, . . . , K such that S = {(j1, j2, . . . , j K) : j1 S1, j2 S2, . . . , j K SK}. In other words, S consists of the Cartesian product of S1, . . . , SK. Marginal action sets are intuitive: if we have eliminated, say, Xk = j as a good action, then we should never consider any action where Xk = j. Marginal action sets solve the combinatorial action set problem, since such sets can be represented by P k Mk binary values. Optimal Design. In linear bandits, our goal is to choose actions xt to reveal as much about the optimal action as possible. While the most obvious goal would be to choose actions to minimize a bound on ˆθ θ , e(x) simultaneously for all actions x S, estimating the gaps between actions x and x , defined as (x, x ) := θ , e(x) e(x ) , is more efficient (Fiez et al., 2019). Defining Vn := Pn t=1 e(xt)e(xt) , the optimal design problem is arg min x[n] max x,x S e(x) e(x ) Vn . (1) Generally, the optimal design problem (1) is intractable (Xu et al., 2018), and the state-of-the-art algorithms are linear in S (Allen-Zhu et al., 2021). Fortunately, marginal action sets afford a computationally simple solution with an easy to calculate upper bound. Lemma 4.2. Assume that S is marginal, and let x[n] be any sequence of actions that are uniform in every marginal, i.e. for every k, Pn t=1 1 {xt k = i} 1 {xt k = j} 1 for all i, j Sk. With Vn as the covariance matrix of x[n], we have max x,x S e(x) e(x ) V n X 2|Si| n |Si| X Roughly, the proof proceeds by noting that Vn can be written as a diagonal matrix of counts plus the cross terms, both of which are positive semi-definite. We can upper bound the total expression the norm defined only with the diagonal terms, which permits a particularly simple form of e(x) e(x ) V n that we can explicitly calculate for uniform sequences of marginals. Remark 4.3. The embedding that we use for the linear bandits is not full rank. For example, for any vector v RK with 1 v = 0, the null space of Vn includes Additive Causal Bandits with Unknown Graph (v11(M1), . . . , v K1(MK)) (where 1(n) is the ones vector of length n). However, what is important is that the projection of the nullspace onto the coordinates in Sk is always in the all-ones direction, which allows us to calculate unbiased estimates of the gaps, even though we may not be able to identify θ. This insight provides another reason why Eq. (1) is the correct optimization objective. 4.1. Deriving the Elimination Algorithm At each phase of the algorithm, we have an action set S and an error tolerance γ, and we wish to find a set of actions R to remove from S that can be guaranteed to be suboptimal. The crux is that we can only calculate the empirical gaps ˆ (x, x ) := ˆθ, e(x) e(x ) , thus we need to bound the error ˆ (x, x ) (x, x ). Suppose that we choose x[n] according to Lemma 4.2; Lemma 4.1 then guarantees that ˆθ θ , e(x) e(x ) q δ holds with high probability for all x, x S. This means that it suffices to take n = l 4σ2| P k Sk| γ2 log L δ m if we want to ensure that ˆθ θ , e(x) e(x ) γ. The usual choice in the bandit literature is R = {x S : x S s.t. ˆ (x, x ) γ}. Letting x be the optimal action, we see that θ , e(x) e(x ) θ , e(x) e(x ) ˆθ, e(x) e(x ) γ 0 for all x R. Using the inequality in the other direction, any x S \ R must have θ , e(x) e(x ) θ ˆθ, e(x) e(x ) + ˆθ, e(x) e(x ) 2γ. This rejection procedure is guaranteed, with probability at least 1 δ, to eliminate all 2γ-suboptimal actions and never eliminate the optimal action. The reader may notice that S \ R, is not marginal even with this choice of R, even if S is marginal. Instead, we want a R such that 1) S \ R is marginal, and 2) R is as large as possible. Such a marginal-preserving rejection rule is necessary for the tractability of Eq. (1). Since we require S := S \ R to be marginal, we can define Rk := Sk\S k to be the values removed from Xk s marginal. How must we constrain Rk so that, for every x R, there must be some x S with ˆ (x, x ) γ? We use the fact that gaps decompose by variables: if we define ˆ i,j k := ˆθi k ˆθj k and ˆ j k := maxi ˆθi k ˆθj k, then the gap between x and x decomposes as ˆ (x, x ) = P k [K] ˆ xk,x k k . Taking x = x , we have that ˆ (x, x ) = P k ˆ xk k . Thus, we may only include i Rk if, for all x S with xk = i, we can guaran- Algorithm 1 MODL Input: δ > 0, ϵ > 0, X[K], σ2, B Optional: upper bound PY on PY Sk(1) [Mk] for k = 1, . . . , K L log 2BK for ℓ= 1, . . . , L do γ(ℓ) ϵ K 2L ℓ+1 k Sk(ℓ)| γ(ℓ)2 log L Choose x1, . . . , xn supp(X[K]) using Lemma 4.2 Collect yt P(Y | do(X[K] = xt)) t n Update Vn, ˆθ(ℓ) = V n P Calculate empirical gaps ˆ j k for k = 1, . . . , K do Sk(ℓ+ 1) n j Sk(ℓ) : ˆ j k < γ(ℓ) o k 1 {|Sk(ℓ)| = 1} if ˆPY (ℓ+ 1) = K or ˆPY (ℓ) PY then Break end if end for Return arg maxx S ˆθ(ℓ), e(x) and ˆθ(ℓ) tee that ˆ (x, x ) = P k ˆ xk k γ. Using x (i) to denote x with the kth value set to i, it is easy to check that ˆ (x (i), x ) = ˆ i k, which implies that we can only include i in Rk if ˆ i k γ. 4.2. The MODL Algorithm With the optimal design and rejection procedures derived, we can present the marginal optimal design linear bandit (MODL) algorithm and its sample complexity bound. MODL proceeds in phases ℓ= 1, . . . , L, and in each phase it solves an XY-optimal design problem, using the results of Lemma 4.2 with error γ = ϵ2L ℓ, ensuring that γ = ϵ 2 by the time the algorithm terminates. The algorithm uses the rejection rule of Section 4.1 to maintain a marginal action set S. We also consider the case when PY is provided, which allows termination once PY variables have their optimal value identified. The intuition is that θ1 k, . . . , θMk k are approximately equal for all for Xk pa(Y ), so the algorithm is not able to limit Sk to a single value. Pseudocode is provided in Algorithm 1. We remark that restricting to marginal action sets does not eliminate any action. Instead, this restriction potentially reduces the number of actions that can be eliminated by requiring that the set of remaining actions be expanded to the smallest marginal action set containing it. In essence, marginal action sets allow us to trade-off some statistical efficiency for computational tractability. Additive Causal Bandits with Unknown Graph samples needed 10 variables 30 variables 10 variables PY unknown 30 variables PY unknown 0 20 5 10 0 20 P1 MODL Oracle number of parents Figure 1. Sample complexity and average gaps versus number of parents of Y . We analyzed the expected sample complexity of the algorithm and present an upper bound in Theorem 4.4. We find the typical sum-of-reciprocal-squared-gaps dependence common to best-arm-identification problems, O P i,k( i k ϵ) 2 , however, instead of a sum over the combinatorial action set, the sum is over the all gaps for individual variables, which is the sample complexity one would expect if each variable could be played independently. In other words, despite only having full-bandit feedback, we obtain the sample complexity as if we had semi-bandit feedback. A substantial part of the complexity comes from the P k/ pa(Y ) Mkϵ 2 term, which arises because the nonparents are the most difficult: to differentiate between the cases when a variable is a non-parent or when there is a single value that is ϵ/K better than the rest, all values must be estimated within ϵ/K. For the known-PY case, the ( i k (ϵ/K)) 2 term is replaced by ( i k min (ϵ/K)) 2 which could be substantially smaller if min ϵ/K. We see this reduction because we no longer need to identify the non-parents, but rather can terminate once the minimum gap among the parents is found. Theorem 4.4. Algorithm 1 is (ϵ, δ)-PAC. The expected sample complexity has an upper bound of 1 ( min i k ϵ where min = mink PY mini [Mk] i k is the minimum gap in the parents in the case when number of parents PY is provided, and 0 otherwise. Due mostly to the additive assumption, the sample complexity contains P k Mk terms of order O ( ϵ) 2 , which is the same order as the sample complexity of running a separate bandit algorithm for each variable despite only observe the sum of rewards. In contrast, a naive approach which, ignoring the structure, simply uses a bestarm-identification algorithm over the combinatorial action set would have Q k Mk terms in the complexity bound of order PM1 j1=1 PMK j K=1 1 ( min (P k jk k ) ϵ)2 . Recovering the Parents of Y . With simple assumptions on fk, we may recover a good estimate for pa(Y ) upon termination of the algorithm. Using the parameter estimates returned by the algorithm, we define bpa(Y ) to be all the nodes Xk where ℓ, |ˆθi k(ℓ) ˆθj k(ℓ)| 2γ(ℓ) i, j Sk(ℓ). This formula follows the intuition that non-parents k have all θj k identically equal and thus ˆθj k(ℓ) should be within the error tolerance γ(ℓ). We can show that this method works with high probability, provided an identifiability condition holds. Without any identifiability assumptions, no algorithm can be guaranteed to recover the parents. Theorem 4.5. Assume that there is some ϵmin > 0 such that, for all k PY , there exist i, i [Mk] with |fk(i) fk(i )| ϵmin. Let ˆx and ˆθ be the output of Algorithm 1 run with ϵ ϵmin and δ > 0. Then bpa(Y ) as defined above has P( bpa(Y ) = pa(Y )) 1 δ. Furthermore, the intervention ˆx b pa(Y ) := {Xi = ˆxi : Xi bpa(Y )}, which is ˆx limited to pa(Y ), is (ϵ, δ)-PAC. 5. Experiments This section presents an empirical evaluation of the MODL algorithm on a collection of randomly generated causal additive models1. Additional experiments studying the effect 1Code has been released at https://github.com/ deepmind/additive_cbug. Additive Causal Bandits with Unknown Graph of graph structure and the sensitivity to the additive outcome assumption s violation can be found in Appendix B. Baselines. Since we are the first to consider the general setting of unknown G without assumptions on its structure, it is difficult to compare MODL to other algorithms in the causal bandit literature. The closest algorithms are those of De Kroon et al. (2022) and Bilodeau et al. (2022) with the separating set taken to be all intervenable random variables. In this settings, these algorithms reduces to a multi-armed bandit on the full, product action space. As the number of actions is exponential in the number of variables, we were only able to include this baseline for experiment with few variables. Since these algorithms were designed for the cumulative regret setting, we have implemented a version using Successive Elimination (SE) (Even-Dar et al., 2006). We also compare MODL to (i) a parents-first (P1) method which first performs hypothesis testing to find an approximate parents set ˆpa(Y ) and then runs Algorithm 1 with X[K] = ˆpa(Y ) (i.e. considering ˆpa(Y ) as the intervention set), and (ii) an oracle method which runs Algorithm 1 with X[K] = pa(Y ). Fact 1.2 guarantees that intervening on pa(Y ) alone is sufficient to solve the problem; therefore, the difference in performance between MODL and the oracle method quantifies the value of knowing the parents. Comparing MODL with the P1 method answers whether spending samples to explicitly learn the parents is efficient. For learning pa(Y ), we were not able to find any suitable algorithm in the literature that exploits the ability to intervene on all X[K]. Thus, we designed our own algorithm for finding an approximate parents set ˆpa(Y ) using global interventions. Let x0 supp(X[K]) be some fixed intervention; for each k in some random order, we enumerate j [Mk] and test a null hypothesis of E[Y | do(X[K] = x0)] = E[Y | do(Xk = j, X[K] \ {Xk} = x0)]; Xk is added to ˆpa(Y ) only if we find a j where the null hypothesis is rejected. We terminate early if ˆpa(Y ) is large enough to meet a bound on PY . Provided that, for all Xk / pa(Y ), there exists some j with |fk(j)| ϵ, this algorithm is guaranteed to find pa(Y ) with high probability. Pseudocode and a complexity bound are provided in Appendix A. Experimental Set-up. We performed the evaluation on randomly sampled structural causal models (SCMs) generated as followed. The causal graph, excluding Y , was a sampled directed acyclic graph from the Erd os-R enyi model with the degree 3 and K 1 variables. We randomly choose set of variables of size PY as the parents of Y , then each variable topologically greater than Y is independently set as a child to Y with probability .5. To sample the conditional probability distributions, we chose Mk uniformly between specified upper and lower bounds and generated the conditional probability distribu- tion for each Xk by sampling p(Xk = j | pa(Xk) = x) Beta(2, 5) independently for all j and x. Finally, we generated fk(j) = BW j k, with B = 5 and W j k sampled i.i.d. from Beta(2, 5) and set η to a standard normal variable. If Xj had Y as a parent, we used the same construction but with Y rounded to an integer. Results. Using ϵ = 1/2 and δ = .1 for our (ϵ, δ)-PAC criterion, we considered a variety of settings of PY , K, and upper and lower bounds for Mk. Each point all graphs corresponds to the average over 20 different SCMs sampled using the process described above and 50 independent runs of the methods on independently generated data. samples needed 4 variables PY unknown 6 variables PY unknown 2 4 Number Parents 2 4 Number Parents Figure 2. Sample complexities including SE. In the figure above, the sample complexity and the average gaps are plotted for the Successive Elimination baseline (in red) as well as MODL, parents first, and the oracle methods. The SE baseline are almost too large to be comparable (roughly 200 times the other methods, all which appear comparatively as zero) and does not scale to more than a few variables. The sample complexity decreases with the number of parents as a greater portion of arms are able to be eliminated. Figure 1 plots the same, without SE, for more interesting numbers of variables in four different combinations of K = {10, 30} and known/unknown PY (the lower and upper bounds for Mk are 3 and 6). As predicted by Theorem 4.4, the sample complexity decreases with PY : many samples are required to distinguish between non-parents and a potential parents with θi k ϵ, so the complexity increases with the number of non-parents. Overall, the performance of MODL is much closer to the performance of the oracle method. We see that the performance coincides when PY = K since MODL and the oracle method become the same algorithm. We also note that the P1 method does not benefit from PY = K. The P1 method also has consistently higher gaps (since, on occasion, it fails to identify a parent which would cause a large error), but all gaps are well within the desired error tolerance of ϵ = 1/2. See the appendix for more figures: e.g. Figure 3 plots the sample complexity for Additive Causal Bandits with Unknown Graph K = 30 versus the support sizes Mk. To summarize, we found that across all the settings that we investigated MODL was substantially better than the P1 method, and its performance (in terms of the gap of the final action and the sample complexity) was closer to the performance of the oracle method than to the performance of the P1 method. Hence, we conclude that the penalty of not knowing the parents is relatively small and much smaller than the cost of learning the parents first. 6. Discussion In this paper, we have proposed an approach to solving the causal bandit problem under the general setting of an unknown causal graph (CBUG). Using the key insight that having no latent confounding between Y and any of its ancestors implies that a global intervention is equivalent to an intervention on the optimal set pa(Y ), we showed that an additive outcome assumption allows us to solve the CBUG problem as a combinatorial linear bandit. Limiting our algorithm to marginal action sets alleviated the computational burden by providing an easy approximation to the optimal design problem and a factorization of the action set. Two immediate direction for improving our algorithm and analysis is to consider the quality of approximation in the rejection procedure. A rejection procedure that outputs a marginal action set must reject fewer points than an unconstrained procedure. Are there principled ways of interpolating between marginal and full actions sets, perhaps using unions of marginal sets, which would let us trade-off computation and a larger number of rejected actions? The bound presented in this paper decomposes by variable, but a more nuanced rejection procedure and analysis should involve how the gaps between variables relate. Other possible extensions include: (i) relaxing the additive outcome assumption, for example by adding interaction terms, (ii) investigating what assumption allow for algorithms that adapt to sparsity, and (iii) considering the linear and continuous case. More generally, how can our method generalize to the case of latent confounders between Y and its ancestors? Finally, we contrast our setting with the most common setting of causal inference where the goal is to learn the causal graph with as few experiments as possible. This setting is typical in the infinite data case, so the problem difficulty is measured in the number of experiments. As we are in the finite sample case, the number of statistical units (i.e. samples or rounds) is the most important quantity. With this distinction in mind, intervening on many variables is justified so long as we can reduce the sample complexity. However, budgeted versions of the causal bandit problem have been considered (Nair et al., 2021) and are another interesting direction. Acknowledgements We wish to thank Eleni Sgouritsa for help in designing the code and technical support and Yasin Abbasi-Yadkori for technical feedback and many fruitful discussions. Allen-Zhu, Z., Li, Y., Singh, A., and Wang, Y. Near-optimal discrete optimization for experimental design: A regret minimization approach. Mathematical Programming, 186:439 478, 2021. Bilodeau, B., Wang, L., and Roy, D. M. Adaptively exploiting d-separators with causal bandits. In Advances in Neural Information Processing Systems, 2022. B uhlmann, P., Peters, J., and Ernest, J. Cam: Causal additive models, high-dimensional order search and penalized regression. The Annals of Statistics, 42(6):2526 2556, 2014. Chen, S., Lin, T., King, I., Lyu, M. R., and Chen, W. Combinatorial pure exploration of multi-armed bandits. In Advances in Neural Information Processing Systems, volume 27, 2014. Constantinou, P. and Dawid, A. P. Extended conditional independence and applications in causal inference. Annals of Statistics, 45(6):2618 2653, 2017. De Kroon, A., Mooij, J., and Belgrave, D. Causal bandits without prior knowledge using separating sets. In Conference on Causal Learning and Reasoning, pp. 407 427. PMLR, 2022. Du, S. S., Kakade, S. M., Wang, R., and Yang, L. F. Is a good representation sufficient for sample efficient reinforcement learning? In International Conference on Learning Representations, 2020. Du, Y., Kuroki, Y., and Chen, W. Combinatorial pure exploration with full-bandit or partial linear feedback. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pp. 7262 7270, 2021. Even-Dar, E., Mannor, S., Mansour, Y., and Mahadevan, S. Action elimination and stopping conditions for the multi-armed bandit and reinforcement learning problems. Journal of Machine Learning Research, 7(6):1079 1105, 2006. Fiez, T., Jain, L., Jamieson, K. G., and Ratliff, L. Sequential experimental design for transductive linear bandits. In Advances in Neural Information Processing Systems, volume 32, 2019. Additive Causal Bandits with Unknown Graph Gabillon, V., Ghavamzadeh, M., Lazaric, A., and Bubeck, S. Multi-bandit best arm identification. In Advances in Neural Information Processing Systems, volume 24, 2011. Hastie, T. J. Generalized additive models. In Statistical models in S, pp. 249 307. Routledge, 2017. Imbens, G. W. and Rubin, D. B. Causal inference in statistics, social, and biomedical sciences. Cambridge University Press, 2015. Lattimore, F., Lattimore, T., and Reid, M. D. Causal bandits: Learning good interventions via causal inference. In Advances in Neural Information Processing Systems, pp. 1181 1189, 2016. Lattimore, T. and Szepesv ari, C. Bandit Algorithms. Cambridge University Press, 2020. Lee, S. and Bareinboim, E. Structural causal bandits: where to intervene? In Advances in Neural Information Processing Systems, pp. 2568 2578, 2018. Lu, Y., Meisami, A., Tewari, A., and Yan, W. Regret analysis of bandit problems with causal background knowledge. In Conference on Uncertainty in Artificial Intelligence, pp. 141 150. PMLR, 2020. Lu, Y., Meisami, A., and Tewari, A. Causal bandits with unknown graph structure. In Advances in Neural Information Processing Systems, volume 34, pp. 24817 24828, 2021. Maeda, T. N. and Shimizu, S. Causal additive models with unobserved variables. In Uncertainty in Artificial Intelligence, pp. 97 106, 2021. Maiti, A., Nair, V., and Sinha, G. A causal bandit approach to learning good atomic interventions in presence of unobserved confounders. In Uncertainty in Artificial Intelligence, pp. 1328 1338. PMLR, 2022. Nair, V., Patil, V., and Sinha, G. Budgeted and non-budgeted causal bandits. In International Conference on Artificial Intelligence and Statistics, pp. 2017 2025. PMLR, 2021. Pearl, J. Causality: Models, Reasoning, and Inference. Cambridge University Press, 2000. Soare, M., Lazaric, A., and Munos, R. Best-arm identification in linear bandits. In Advances in Neural Information Processing Systems, volume 27, 2014. Tao, C., Blanco, S., and Zhou, Y. Best arm identification in linear bandits with linear dimension dependency. In International Conference on Machine Learning, pp. 4877 4886, 2018. Xiong, N. and Chen, W. Combinatorial pure exploration of causal bandits. In The Eleventh International Conference on Learning Representations, 2023. Xu, L., Honda, J., and Sugiyama, M. A fully adaptive algorithm for pure exploration in linear bandits. In International Conference on Artificial Intelligence and Statistics, pp. 843 851, 2018. Additive Causal Bandits with Unknown Graph A. A simple test for the parents of Y In the experiments, we compared MODL with an algorithm that first learns the parents of Y then ran a bandit algorithm on the remaining variables. The algorithm is intuitively simple: we will construct confidence intervals of width ϵ/2 for E[Y | do(Xk = j, X k = x k)] for all values of j, and if the intersection of the confidence intervals do not overlap, then one of the values of Xk is statistically significantly different from the others. Because we have intervened to fix all the other variables, this difference must be because Xk is a parent of Y . Formally, we have the following lemma. Lemma A.1. Let x0 supp(X) be fixed, and x k be x0 with the kth variable s value removed. For δ (0, 1) and ϵ > 0, assume that Y 1 k , . . . , Y nk k p(Y | do(Xk = j, X k = x k), where ϵ2 log 2K supp(Xk) P 1 k K, j supp(Xk), Pnk i=1 Y i k nk E[Y | do(Xk = j, X k = x k)] ϵ Hence, by the union bound, all the confidence intervals are simultaneously correct with probability at least 1 δ. Proof. With the nk defined in the lemma, we can verify that s nk log 2K supp(Xk) Thus, Lemma 4.1 implies that Pnk i=1 Y i k nk E[Y | do(Xk = j, X k = x k)] ϵ 1 2K supp(Xk) Finally, by the union bound, all the confidence intervals are simultaneously correct with probability at least 1 δ, as claimed. Under the event that the bounds are all correct, Xk is a parent of Y if there exist two intervals that do not overlap; in this case, the means of the two interventions must be different with high probability. See Algorithm 2 for pseudocode. In addition to having few false positives, we can argue that the algorithm has few false negatives, as presented in the following lemma. Algorithm 2 Finding pa(Y ) Given: ϵ > 0, δ (0, 1), σ2, x0 supp(X[K]) . bpa(Y ) for k = 1, . . . , K: do Ck R n l 8σ2 ϵ2 log 2K supp(Xk) δ m . for j = 1, . . . , Mk do Collect y1, . . . , yn p(Y | do(Xk = j, X i = x )) Ck Ck Pn i=1 yi if Ck = then bpa(Y ) bpa(Y ) {Xk} Skip the rest of the tests for Xk. end if end for end for Return bpa(Y ) Lemma A.2. Assume that for all Xk pa(Y ), there exists i, j supp(Xk) such that |fk(i) fk(j)| ϵ. Then, with probability 1 δ, Algorithm 2 correctly recovers the parents. Additive Causal Bandits with Unknown Graph samples needed 10 parents PY known 30 parents PY known 10 parents PY unknown 30 parents PY unknown P1 MODL Oracle support size Figure 3. Sample complexity versus lower bound on Mk. samples needed 105 PY known average gap P1 MODL Oracle average graph degree Figure 4. Sample complexity and average gap vs. number of variables with 30 variables and 10 parents. B. Additional Experiments This section holds additional experiments under the same setup in Section 5. Figure 3 plots the sample complexity for K = 30 versus the support sizes Mk, where the x-axis specifies the lower bound on Mk (the upper bound is 3 larger). As above, the performance of MODL is much closer to the performance of the oracle method than to the performance of the P1 method, and almost identical for PY = K = 30. Figure 4 plots the sample complexity as the degree of the sampled graph changes, when PY = 10 and K = 30. Unsurprisingly, the degree has very little effect on the sample complexity, confirming our intuition from causal inference that intervening on all parents renders the rest of the causal graph unimportant. Figure 5 confirms our suspicion that the linear bandit is very sensitive to model mispecification. We generated non-linear data by using the outcome model k=1 fk(Xk) + αBM 4 max (Xi1Xi2Xi3Xi4 + Xj1Xj2Xj3 + Xk1Xk2) for randomly chosen (without replacement) indices i1, i2, i3, i4, j1, j2, j3, and k1, k2 from [PY ] and Mmax is the upper bound on the support size (6 in this case). This model was chosen to resemble the effect of adding interaction terms that are the product of several variables. The leading coefficient is chosen to keep the maximum interaction term roughly αB so choosing α [0, 1] keeps the scale of the interactions terms roughly equivalent to the additive terms. Despite this scaling, we still find that the performance is very sensitive to model mismatch, as illustrated in Figure 5. We also note that the parents-first approach is much more sensitive to model mispecification, at least in the average gap, because the mispecification increases the probability of the parents being misspecified, which in turn causes a large error. MODL and Additive Causal Bandits with Unknown Graph 0.000 0.025 0.050 0.075 0.100 samples needed 105 Parents known 0.000 0.025 0.050 0.075 0.100 Parents unknown 0.000 0.025 0.050 0.075 0.100 0 average gap 0.000 0.025 0.050 0.075 0.100 P1 MODL Oracle nonlinear scaling Figure 5. Sample complexity and average gap vs. model mispecification. The x-axis details the coefficient of a multiplicative nonlinear term in the expected response function. the oracle algorithms are more immune to this effect. Proof of Theorem 2.1. This lower bound can by shown by invoking Theorem 33.5 from (Lattimore & Szepesv ari, 2020). For any x supp(X[K]), we define q(x) to be the x values of pa(Y ); in particular, the reward of any two actions x and x are equal if q(x) = q(x ). We invoke the theorem for a bandit with supp(X[K]) arms, one for each action, and with a set E of bandit environments indexed by xpa supp(pa(Y )); for a bandit ν(xpa) E corresponding to xpa, we set Y | do(X = x) Bernoulli( 1 2 + ϵ1 {q(x) = xpa}). We can explicitly calculate Ealt(ν(xpa)) = {ν(x pa) : x pa = xpa}, which are the set of bandit environments with a different optimal arm than ν s. We also use νx to indicate the reward distribution of action x under bandit ν. Then, examining the terms in the theorem, we need to calculate c(ν) = max α supp(X[K]) min ν Ealt(ν) x supp(X[K]) αx D(νx, ν x) where D( , ) is the relative entropy. If ν Ealt(ν), this means that the x pa corresponding to ν is different from the xpa corresponding to ν. Fix one x pa; we can see that D(νx, ν x) is zero for all x, x with q(x) = q(x ). We also use the fact that D(νx, ν x) = O(ϵ2) when ϵ is small. Hence min ν Ealt(ν) x supp(X[K]) αx D(νx, ν x) = min x pa =xpa supp(pa(Y )) O(ϵ2)αx pa, where xpa corresponds to ν. Taking the max over α, we see that any α must spread mass evenly across all x with x pa = xpa supp(pa(Y )), which leads to c(v) = O ϵ2 | supp(pa(Y ))| . Combining these calculations with the theorem, we find that E[τ] O | supp(pa(Y ))| where E[τ] is the expected stopping time of any sound (i.e. (ϵ, δ)-PAC) algorithm, as claimed. We could also follow the techniques from the lower bound of (Du et al., 2020) and reduce the index-query problem to the CBUG problem. Additive Causal Bandits with Unknown Graph Proof of Lemma 4.2. Let x1, . . . , xn be a sequence of actions. To calculate Vn, we write i=1 (e1(xi 1), . . . , e K(xi K))(e1(xi 1), . . . , e K(xi K)) = j =1 ej(xi j)ej (xi j ) = Dn + Cn, where Dn is the matrix of on-diagonal components and Cn contains all the off-diagonal terms; both are positive semi-definite. Using the fact that if A and B are PSD matrices, then x (A + B) x x A x, we can upper bound e(x) e(x ) Vn by e(x) e(x ) Dn . Letting N j k = P t n 1 {xt k = j} be the round xt k took on the jth value, we can show that t n e(xt)e(xt) = X t n diag((e1(xt 1), . . . , e K(xt K)) = diag N 1 1 , N 2 1 , . . . , N M1 1 , N 1 2 , . . . , N MK K ; Dn is the diagonal matrix of the counts of the values. We can upper bound the optimal design problem over x[n] with another problem over counts of values that appears in S and add to n. Formally, this set is (N 1 1 , N 2 1 , . . . , N MK K ) : k, N j k = 0 if j / Sk and X j N j k = n We can easily check that, for any x, x S, e(x) e(x ) Dn = (e(x) e(x )) V n(e(x) e(x )) k (ek(xk) ek(x k)) ek(xk) N xk k ek(x k) 1 N xk k + 1 1 {xk = x k} . Thus, the XY-optimal has an upper bound x XY(S, n) = arg min {Nj i } N (S,n) max x,x S e(x) e(x ) Vn = arg min {Nj k} N (S,n) k max j,j Sk 1 = arg min {Nj k} N (S,n) minj Sk N j k . We can solve the problem in closed from: for all i V , allocate N j i evenly among all j Si. Because |Si| may not divide n, we may have rounding errors and can only guarantee that (N j i ) 1 [ n/Si , n/Si ], which results in an objective value of P i 2 n/Si 2 P i |Si| n |Si|. Theorem 4.4. Algorithm 1 is (ϵ, δ)-PAC. The expected sample complexity is 3 σ2 log log(BK/ϵ) 1 ( i k (ϵ/K))2 + X k/ pa(Y ) Mk 1 ϵ2 If the number of parents PY is provided, the complexity is instead 3 σ2 log log(BK/ϵ) ( min i k (ϵ/K))2 , where min = mink PY mini [Mk] i k is the minimum gap in the parents. Additive Causal Bandits with Unknown Graph Proof of Theorem 4.4. Recall that MODL alternates between two stages, data-collection and action elimination, and uses an exponentially decreasing error tolerance. Let Sk(ℓ), γ(ℓ), and nell be the corresponding the values during phase ℓ. It is easy to verify that γ B/2 for ℓ= 1 and γ = ϵ 2K for ℓ= L. We will show that, with the chosen nℓ, P ˆθ θ , e(x) e(x ) γ(ℓ) x, x S(ℓ), ℓ [L] 1 δ. That is, with high probability, all our confidence intervals used by the algorithm are correct. Let Sk(ℓ) be the kth marginal of S during phase ℓof the algorithm. By choosing k Sk(ℓ)| γ2 log L Lemma 4.2 guarantees that maxz,z S e(x) e(x ) Vn P n . Lemma 4.1 provides ˆθ θ , e(x) e(x ) 2σ2 e(x) e(x ) Vn log L 4σ2 P k |Sk(ℓ)| with probability at least 1 δ for all x S(ℓ) simultaneously. At each stage ℓ, the elimination algorithm proceeds only eliminating x where there exists x with ˆθ, e(x ) e(x) γ(ℓ). If such a x exists, then we can conclude that θ , e(x) e(x ) θ , e(x) e(x ) ˆθ, e(x) e(x ) γ(ℓ) 0. Hence, we have shown that, for all stages ℓ, the algorithm never eliminates any action that is γ(ℓ)-suboptimal. The last step to checking correctness is to show that an ϵ-suboptimal action is returned. In round ℓ= L, Lemma 4.1, guarantees that θ , e(x) e(x ) ϵ/2 for all x, x in S(ℓ). Applying this to the action ˆx returned by the algorithm and the fact that, under the event that the confidence intervals are correct, x S(ℓ), we have θ , e(ˆx) e(x ) θ ˆθ, e(ˆx) e(x ) + ˆθ, e(ˆx) e(x ) Kϵ where the last step used the fact that each vairable s error was controlled to ϵ/2k. The total sample complexity is L X k=1 |Sk(ℓ)| 4σ2 γ(ℓ)2 log L We now look to bound PL ℓ=1 PK k=1 |Sk(ℓ)| in terms of instance-dependent quantities. With no bound on |pa(Y )|, the complexity also decomposes. Let G = {γ1 > . . . > γL = ϵ/2K} be the set of γ used by the algorithm. Setting c = 4σ2 log(L/δ) and using the fact that ˆ i k are all unbiased, the expected sample complexity can be upper bounded by calculating c γ(ℓ)2 i k γ(ℓ) = c γ2 1 i k γ {γ G:γ i k ϵ/K} 4c 3( i k ϵ k/ pa(Y ) Mk 4c K2 Additive Causal Bandits with Unknown Graph To summarize, the sample complexity is the sum the sample complexity of the parents with a linear term for the non-parents. Intuitively, it is very difficult to differentiate between a parent with some |fk( )| ϵ and a non-parent, so we expect to see these terms in the sample complexity. Recall that min = mink pa(Y ) mini [Mk] i k. When the number of parents is given, the algorithm will terminate as soon as γ(ℓ) min. Hence, the sample complexity can be decomposed into the samples needed to learn the parents, P k pa(Y ) PMk i=1 4c 3( i k ϵ)2 , and the extra samples from all the remaining variables {γ G:γ min ϵ} k/ pa(Y ) Mk c γ2 = X k/ pa(Y ) Mkc X {γ G:γ min ϵ} k/ pa(Y ) Mk 4c 3( min ϵ)2 . Hence, the total sample complexity is 1 ( i k ϵ)2 + X 1 min i k ϵ Proof of Theorem 4.5. We need to prove two things: first, all parents are discovered, and second, no erroneous parents are included. Throughout, we condition of the event that the confidence intervals are all correct, which happens with probability at least 1 δ, and demonstrated in the proof of Theorem 4.4. Recalling that ˆθi k(ℓ) and γ(ℓ) are the respective quantities at phase ℓof the algorithm, we argue that non-parents are not added to ˆpa(Y ). Consider some non-parent k > pa Y . The concentration inequality from Lemma 4.1, applied to x, x that only differ in that the first corresponds to Xk = i and the second to Xk = j, yields ˆθ θ , e(x) e(x ) γ ˆθi k θj k γ. Taken with the fact that θi k θj k = 0 for all i, j Mk, we have ˆθi k(ℓ) ˆθj k(ℓ) 2γ(ℓ), so k is not added to ˆpa(Y ) on the event that the confidence intervals are collect. To show that all parents are included, we consider two cases. First, assume that the algorithm does not terminate early so γL = ϵ/2. For every k PY , the algorithm must have found the i, i satisfying |fk(i) fk(i )| ϵmin ϵ/2, so k in included in ˆpa Y . The second case is when the algorithm terminates early. In this case, there must be PY variables with only one action remaining. These variables must be parents, since, with high probability no non-parents are added. Under the event that ˆpa(Y ) = pa(Y ), Fact 1.2 implies that the expected response of actions ˆx and ˆx ˆpa(Y ) are equal, completing the proof.