# causal_bandits_with_unknown_graph_structure__4078ef73.pdf

Causal Bandits with Unknown Graph Structure

Yangyi Lu Department of Statistics University of Michigan yylu@umich.edu

Amirhossein Meisami Adobe Inc. meisami@adobe.com

Ambuj Tewari Department of Statistics University of Michigan tewaria@umich.edu

In causal bandit problems, the action set consists of interventions on variables of a causal graph. Several researchers have recently studied such bandit problems and pointed out their practical applications. However, all existing works rely on a restrictive and impractical assumption that the learner is given full knowledge of the causal graph structure upfront. In this paper, we develop novel causal bandit algorithms without knowing the causal graph. Our algorithms work well for causal trees, causal forests and a general class of causal graphs. The regret guarantees of our algorithms greatly improve upon those of standard multi-armed bandit (MAB) algorithms under mild conditions. Lastly, we prove our mild conditions are necessary: without them one cannot do better than standard MAB algorithms.

1 Introduction

A multi-armed bandit (MAB) problem is one of the classic models of sequential decision making (Auer et al., 2002; Agrawal and Goyal, 2012, 2013a). Statistical measures such as regret and sample complexity measure how fast learning algorithms achieve near optimal performance in bandit problems. However, both regret and sample complexity for MAB problems necessarily scale with the number of actions without further assumptions. To address problems with a large action set, researchers have studied various types of structured bandit problems where additional assumptions are made on the structure of the reward distributions of the various actions. Algorithms for structured bandit problems exploit the dependency among arms to reduce the regret or sample complexity. Examples of structured bandit problems include linear bandits (Abbasi-Yadkori et al., 2011; Agrawal and Goyal, 2013b), sparse linear bandits (Abbasi-Yadkori et al., 2012), and combinatorial bandits (Cesa-Bianchi and Lugosi, 2012; Combes et al., 2015).

In this paper, we study a different kind of structured bandit problems: causal bandits. In this setting, actions are composed of interventions on variables of a causal graph. Many real world problems can be modeled via causal bandits. In healthcare applications, the physician adaptively adjusts the dosage of multiple drugs to achieve some desirable clinical outcome (Liu et al., 2020). In email campaign problems, marketers adjust for features of commercial emails to attract more customers and convert them into loyal buyers (Lu et al., 2019; Nair et al., 2021). Genetic engineering also involves direct manipulation of one or more genes using biotechnology, such as changing the genetic makeup of cells to produce improved organisms (Wikipedia contributors, 2021). Recently there has been a ﬂurry of works (Lattimore et al., 2016; Sen et al., 2017; Lee and Bareinboim, 2018; Lu et al., 2019; Lee and Bareinboim, 2019; Nair et al., 2021) on causal bandits that show how to achieve simple regret or cumulative regret not scaling with the action set size.

However, a major drawback of existing work is that they require signiﬁcant prior knowledge. All existing works require that the underlying causal graph is given upfront. Some regret analysis works even assume knowing certain probabilities for the causal model. In practice, they are all strong assumptions.

35th Conference on Neural Information Processing Systems (Neur IPS 2021).

In this paper, our goal is to develop causal bandit algorithms that 1) do not require prior knowledge of the causal graph and 2) achieve stronger worst-case regret guarantees than non-causal algorithms such as Upper Conﬁdence Bound (UCB) and Thompson Sampling (TS) whose regret often scales at least polynomially with the number of nodes n in the causal graph. Unfortunately, this goal cannot be achieved for general causal graphs. Consider the causal graph consists of isolated variables and the reward directly depends on one of them. Then in the worst case there is no chance to do better than standard algorithms since no meaningful causal relations among variables can be exploited. In this paper, we study what classes of causal graphs on which we can achieve the goal.

Our Contributions. We summarize our contributions below.

1. We ﬁrst study causal bandit problems where the unknown causal graph is a directed tree, or a causal forest. This setting has wide applications in biology and epidemiology (Greenewald et al., 2019; Burgos et al., 2008; Kontou et al., 2016; Pavlopoulos et al., 2018). We design a novel algorithm Central Node UCB (CN-UCB) that simultaneously exploits the reward signal and the tree structure to efﬁciently ﬁnd the direct cause of the reward and then applies the UCB algorithm on a reduced intervention set corresponding to the direct cause.

2. Theoretically, we show under certain identiﬁability assumptions, the regret of our algorithm only scales logarithmically with the number of nodes n in the causal graph. To our knowledge, this is the ﬁrst regret guarantee for unknown causal graph that provably outperforms standard MAB algorithms. We complement our positive result with lower bounds showing the indentiﬁability assumptions are necessary.

3. Furthermore, we generalize CN-UCB to a more general class of graphs that includes causal trees, causal forests, proper interval graphs, etc. Our algorithm ﬁrst constructs undirected clique (junction) trees and again simultaneously exploits the reward signal and the junction-tree structure to efﬁciently ﬁnd the direct cause of the reward. We also extend our regret guarantees to this class of graphs.

In many scenarios, our algorithms do not recover the full underlying causal graph structure. Therefore, our results deliver the important conceptual message that exact causal graph recovery is not necessary in causal bandits since the main target is to maximize the reward.

2 Related work

The causal bandit framework was proposed by Lattimore et al. (2016) and has been studied in various settings since then. In the absence of confounders, Lattimore et al. (2016) and Sen et al. (2017) studied the best arm identiﬁcation problem assuming that the exact causal graph and the way interventions inﬂuence the direct causes of the reward variable are given. Lu et al. (2019) and Nair et al. (2021) proposed efﬁcient algorithms that minimize the cumulative regret under the same assumptions. When confounders exist, Lee and Bareinboim (2018, 2019) developed effective ways to reduce the intervention set using the causal graph before applying any standard bandit algorithm, such as UCB. Even though the performance of above works improved upon that of standard bandit MAB algorithms, they all make the strong assumption of knowing the causal graph structure in advance.

In our setting, the underlying causal graph is not known. Then a natural approach is to ﬁrst learn the causal graph through interventions. There are many intervention design methods developed for causal graph learning under different assumptions (He and Geng, 2008; Hyttinen et al., 2013; Shanmugam et al., 2015; Kocaoglu et al., 2017; Lindgren et al., 2018; Greenewald et al., 2019; Squires et al., 2020). However, this approach is not sample efﬁcient because it is not necessary to recover the full causal graph in order to maximize rewards. de Kroon et al. (2020) tackled this problem with unknown graph structure based on separating set ideas. However, this implicitly requires the existence of a set of non-intervenable variables that d-separate the interventions and the reward. Moreover, their regret bound does not improve upon non-causal algorithms such as UCB. In this paper, we take a different approach which uses the reward signal to efﬁciently learn the direct causes of the reward.

Our approach is inspired by Greenewald et al. (2019) which proposed a set of central node algorithms that can recover the causal tree structure within O(log n) single-node interventions. Squires et al. (2020) also extended the central node idea to learn a general class of causal graphs that involves constructing junction trees and clique graphs. Following these works, our causal bandit algorithms

also adaptively perform interventions on central nodes to learn the direct causes of the reward variable. However, our algorithms differ from theirs because we also take the reward signal into account.

3 Preliminaries

In this section, we follow the notation and terminology of Lattimore et al. (2016); Greenewald et al. (2019) for describing causal models and causal bandit problems.

3.1 Causal Models

A causal model consists of a directed acyclic graph (DAG) D over a set of random variables X = {X1, . . . , Xn} and a joint distribution P that factorizes over D. The parents (children) of a variable Xi on graph D, denoted by Pa D(Xi) (or Ch D(Xi)), are the set of variables Xj such that there is a directed edge from Xj to Xi (or from Xi to Xj) on graph D. The set of ancestors (descendants) of a variable Xi, denoted by An D(Xi) (or De D(Xi)), are the set of variables Xj such that there is a path from Xj to Xi (or from Xi to Xj) on D. Without loss of generality, we assume the domain set for every Xi is Dom(Xi) = [K] := {1, . . . , K}. For every Xi, we write the set of neighbors of Xi in graph D as ND(Xi) including variables Xj such that there is an edge between Xj and Xi regardless of the direction. The maximum degree of an undirected graph G is denoted by dmax(G). Throughout, we denote the true causal graph by D, use V ( ) as the set of vertices of a graph and deﬁne skeleton( ) as the undirected graph obtained by replacing the arrows in the directed graph with undirected edges. Deﬁnition 1 (Directed Tree and Causal Tree). A directed tree is a DAG whose underlying undirected graph is a tree and all its edges point away from the root. A causal tree is a causal model whose underlying causal graph is a directed tree.

For a node Xi on a directed or undirected tree D and its neighbor Y ND(Xi), we write BXi:Y D as the set of nodes that can be reached from Y through any path on the graph (regardless of the directions of edges on the path), when the edge between Xi and Y is cut out from D. Note that the neighbor Y itself is always included in branch BXi:Y D . Deﬁnition 2 (Central Node (Greenewald et al., 2019)). A central node vc of an undirected tree T with respect to a distribution q over the nodes is one for which maxj NT (vc) q(Bvc:Xj T ) 1/2. At least one such vc is guaranteed to exist for any distribution q (Jordan, 1869; Greenewald et al., 2019).

Informally, a central node vc guarantees that the weight of every branch around vc cannot be larger than 1/2 according to the distribution q( ). Deﬁnition 3 (Essential Graph). The class of causal DAGs that encode the same set of conditional independences is called the Markov equivalence class. Denote the Markov equivalence class of a DAG D by [D]. The essential graph of D, denoted by E(D), has the same skeleton as D, with directed edges Xi Xj if such edge direction between Xi and Xj holds for all DAGs in [D], and undirected edges otherwise.

The chain components of E(D), denoted by CC(E(D)), are the connected components of E(D) after removing all directed edges. Every chain component of E(D) is a chordal graph (Andersson et al., 1997). A DAG whose essential graph has a single chain component is called a moral DAG (Greenewald et al., 2019). Deﬁnition 4 (Causal Forest (Greenewald et al., 2019)). A causal graph is said to be a causal forest if each of the undirected components of the essential graph are trees.

Many widely used causal DAGs including causal trees and bipartite causal graphs are examples of causal forest (Greenewald et al., 2019). Bipartite graph applications can range from biological networks, biomedical networks, biomolecular networks to epidemiological networks (Burgos et al., 2008; Kontou et al., 2016; Pavlopoulos et al., 2018).

3.2 Causal Bandit Problems

In causal bandit problems, the action set consists of interventions (Lattimore et al., 2016). An intervention on node X removes all edges from Pa D(X) to X and results in a post-intervention

distribution denoted by P({X}c|do(X = x)) over {X}c X \ {X}. An empty intervention is represented by do(). The reward variable R is real-valued and for simplicity, we assume the reward is only directly inﬂuenced by one of the variables in X, which we call the reward generating variable denoted by XR 1. The learner does not know the identity of XR. Since there is only one XR in our setting, the optimal intervention must be contained in the set of single-node interventions as follows: A = {do(X = x) | X X, x [K]}. Thus, we focus on above intervention set with |A| = n K throughout the paper.

We denote the expected reward for intervention a = do(X = x) by µa = E [R|a]. Then a = argmaxa A µa is the optimal action and we assume that µa [0, 1] for all a. A random reward for a = do(X = x) is generated by R|a = µa + ε, where ε is 1-sub Gaussian. At every round t, the learner pulls at = do(Xt = xt) based on the knowledge from previous rounds and observes a random reward Rt and the values of all X X. The objective of the learner is to minimize the cumulative regret RT = Tµa PT t=1 µat without knowing the causal graph D.

We make the following assumptions below. Assumption 1. The following three causal assumptions hold:

Causal sufﬁciency: for every pair of observed variables, all their common causes are also observed. Causal Markov condition: every node in causal graph G is conditionally independent of its nondescendents, given its parents. Causal faithfulness condition: the set of independence relations derived from Causal Markov condition is the exact set of independence relations.

Assumption 1 is commonly made in causal discovery literature (Peters et al., 2012; Hyttinen et al., 2013; Eberhardt, 2017; Greenewald et al., 2019). Equivalently speaking, there is no latent common causes and the correpondence between d-separations and conditional independences is one-to-one. Assumption 2 (Causal Effect Identiﬁability). There exists an ε > 0, such that for any two variables Xi Xj in graph D, we have |P(Xj = x|do(Xi = x )) P(Xj = x)| > ε holds for some x Dom(Xj), x Dom(Xi).

Assumption 2 is necessary. It states that if there is a direct causal relation between two variables on the graph, the causal effect strength cannot go arbitrarily small. A similar version of this assumption was also made by Greenewald et al. (2019)) in their causal graph learning algorithms. We show the necessity of this assumption in Section 6. Intuitively, without this assumption, any causal relation among variables cannot be determined through ﬁnite intervention samples and Ω(

n KT) worst-case regret is the best one can hope for. Assumption 3 (Reward Identiﬁability). We assume that for all X An(XR), there exists x [K] such that |E[R|do()] E[R|do(X = x)]| , for some universal constant > 0.

Lastly, we show that Assumption 3 is also necessary. It guarantees a difference on the expected reward between the observations and after an intervention on an ancestor of XR. In Section 6, we prove that the worst-case regret is again lower bounded by Ω(

n KT) without this assumption.

In the following sections, we describe our causal bandit algorithms that focus on intervention design to minimize regret. Like most intervention design works, our algorithm take the essential graph E(D) and observational probabilities over X (denoted by P(X)) as the input, which can be estimated from enough observational data2 (He and Geng, 2008; Greenewald et al., 2019; Squires et al., 2020).

4 CN-UCB for trees and forests

We start with introducing our algorithm CN-UCB (Algorithm 1) when the causal graph D is a directed tree. Before diving into details, we summarize our results as follows: Theorem 1 (Regret Guarantee for Algorithm 1). Deﬁne B = max n 32 2 log 8n K

ε2 log 8n2K2

δ o and T1 = KB(2 + d) log2 n, where 0 < δ < 1.

1For multiple reward generating variables cases, one can repeatedly run our proposed algorithms and recover each of them one by one. We discuss this setting in Section C 2Observational data is usually much more cheaper than interventional data (Greenewald et al., 2019).

Suppose we run CN-UCB with T T1 interventions in total and T2 := T T1. Then with probability at least 1 δ, we have

RT = e O K max 1

d(log n)2 +

where d := dmax(skeleton(D)) and e O ignores poly-log terms non-regarding to n.

From above results, we see that especially for large n and small maximum degree d, Algorithm 1 outperforms the standard MAB bandit approaches that incur e O(

n KT) regret.

4.1 Description for CN-UCB

Our method contains three main stages. There is no v-structure in a directed tree, so the essential graph obtained from observational data is just the tree skeleton T0 and we take it as our input in Algorithm 1. In stage 1, Algorithm 1 calls Algorithm 3 to ﬁnd a directed sub-tree e T0 that contains XR by performing interventions on the central node (may change over time) on the tree skeleton. In stage 2, Algorithm 1 calls Algorithm 4 to ﬁnd the reward generating node XR by central node interventions on e T0. We prove that XR can be identiﬁed correctly with high probability from the ﬁrst two stages, so a UCB algorithm can then be applied on the reduced intervention set AR = {do(XR = k) | k = 1, . . . , K} for the remaining rounds in stage 3. In the remainder of this section, we explain our algorithm design for each stage in detail and use an example in Figure 1 to show how CN-UCB proceeds step by step.

!% !& !' !(: = !+

(a) True causal tree.

!% !& !' !(: = !+

(b) Input tree skeleton.

!% !& !' !(: = !+

(c) Stage 1 (t = 0): intervention on central node X2.

!% !& !' !(: = !+

(d) Stage 1 (t = 0) intervention result.

!% !& !' !(: = !+

(e) Stage 1 (t = 1): intervention on central node X3.

!% !& !' !(: = !+

(f) Stage 1 (t = 1) intervention result.

!" !#: = !&

(g) Stage 1 output: directed subtree.

!" !#: = !&

(h) Stage 2 (t = 0): intervention on central node X3.

!" !#: = !&

(i) Stage 2 (t = 0): interventions on the children of X3.

!" !#: = !&

(j) Stage 2 output: only X7 remains.

Figure 1: Causal Tree Example for CN-UCB procedure.

In the beginning of CN-UCB, we collect reward data under empty intervention to obtain the observational reward mean estimate ˆR, which will be used in later steps to check whether a variable is an ancestor of the true reward generating variable XR or not.

In our example, the true causal graph and its essential graph are displayed in Figure 1a and 1b, where the true reward generating variable is X7.

Algorithm 1 Central Node Upper Conﬁdence Bound (CN-UCB)

1: Input: Tree skeleton T0, K, , ε, B, T2, observational probabilities P(X). 2: Perform do() for B times, collect R1, . . . , RB 3: Obtain reward estimate for the empty intervention do(): ˆR 1

B PB b=1 Rb.

4: Stage 1: Find a directed subtree that contains XR: e T0 Find Sub-tree(T0, K, B, ε, , ˆR).

//call Algorithm 3 in Section B. 5: Stage 2: Find the key node that generates the reward: XR Find Key Node(e T0, K, B, , ˆR).

//call Algorithm 4 in Section B. 6: Stage 3: Apply UCB algorithm on AR = {do(XR = k) | k = 1, . . . , K} for T2 rounds.

Stage 1. The goal of this stage is to ﬁnd a directed subtree that contains XR within O(log n) single-node interventions. We achieve this goal by adaptively intervening the central node that may change over time. To illustrate our main idea, we make the following claim for single-node interventions.

Claim 1 (Two outcomes from single-node interventions on variable X). 1) Whether X is an ancestor of XR or not can be determined. 2) The directed subtree induced by X as its root can be found.

To see the ﬁrst outcome, we obtain reward estimates under interventions over X. If the difference between any of the reward estimates and ˆR is larger than a threshold, Assumption 3 guarantees us that X is an ancestor of XR.

To see the second outcome, we estimate interventional probabilities ˆP(Y = y|do(X = x)) for all Y NT0(X) and y, x [K]. Comparing these quantities with the corresponding observational probabilities P(Y = y), the directions of edges between X and its neighbors can be determined due to Assumption 2. Note that in a tree structure, at most one of the neighbors has causal effect on X while others are all causally inﬂuenced by X. Once the edges between X and its neighbors are oriented, all other edges in the subtree induced by X (as the root) can be oriented using the property that each variable has at most one parent in a tree structure. Thus, if we learn that X is indeed an ancestor of XR, a directed subtree can be obtained immediately. Otherwise, we conclude that all the variables in the directed subtree induced by X cannot be XR.

Then a key question is: how do we adaptively select which variables to intervene? Arbitrarily selecting variables may easily require O(n) single-node interventions. For example, let us consider a graph X1 X2 Xn and the reward generating variable is X1. If we start from intervening on Xn and everytime move one-node backward towards intervening X1, we need O(n) single-node interventions in total to ﬁgure out the directed subtree containing X1. To overcome this issue, we adopt the idea of central node algorithm proposed by Greenewald et al. (2019). Interventions on a central node vc(t) guarantees us to eliminate at least half of the nodes from future searching, because the directed subtree induced by vc(t) contains more than half of the nodes regarding to the current weighting function qt( ). And as long as vc(t) is not found as an ancestor of XR, the weights qt+1( ) for variables in the directed subtree induced by vc(t) as the root will be set as zeros. Thus, stage 1 can ﬁnish within O(log n) single-node interventions on T0.

For the example in Figure 1, CN-UCB takes the tree skeleton in Figure 1b as its input and identiﬁes X2 as the central node. As Figure 1d shows, the directed subtree induced by root X2 are crossed out from the candidate set for XR since intervening on X2 shows that X2 is not an ancestor of the true XR. CN-UCB then identiﬁes X3 as the central node over the remaining variables(Figure 1e). We will see X3 is indeed an ancestor of X7 from interventions on X3. Thus, X1 can be crossed out because X1 is on the upstream direction of X3 and cannot be the true XR (Figure 1f). Finally, stage 1 outputs the directed subtree in Figure 1g whose root is X3.

Stage 2. The goal of this stage is to identify XR in the directed sub-tree e T0 within O(log n) single-node interventions. Similar to stage 1, we determine whether a node X is an ancestor of XR by performing interventions on it. If we ﬁnd a variable X that is not an ancestor of XR according to its reward estimates, all nodes in the sub-tree induced by X can be eliminated from future searching. Otherwise, we continue to intervene on the children of X. Since XR is unique, at most one child Y Ch(X) is an ancestor of XR. If such Y exists, we repeat above procedure on the directed

sub-tree induced by Y as the root and update weights for all other variables as zeros in the algorithm. Lastly, if none of the children Y Ch(X) appears to be an ancestor of XR, X itself must be XR.

We again perform intervention on the central node at the beginning of each round in Algorithm 4 and intervene on its children if necessary. By the deﬁnition of central node, every round we can either eliminate at least half of the nodes or ﬁnish searching for XR. Thus, stage 2 can be ﬁnished within O(d log2 n) single node interventions on e T0, since the central node at each round has at most d children.

For our example, stage 2 identiﬁes X3 as the central node showing in Figure 1h. With high probability we can conclude X3 is indeed an ancestor of X7 := XR, so stage 2 continues to intervene on its children X6 and X7 (Figure 1i). From the intervention results we will see that X7 is also an ancestor of the true XR so that X3 and X6 can be removed. Finally, stage 2 outputs X7 with high probability (Figure 1j).

Stage 3. Once the ﬁrst two stages output a variable, we simply apply the UCB algorithm over the reduced intervention set deﬁned in Algorithm 1.

4.2 Extension of CN-UCB to causal forest

Generalizing CN-UCB to causal forest3 deﬁned in Deﬁnition 4 is straightforward. The causal forest version for CN-UCB is presented in Algorithm 5 (Section B). We simply run stage 1 and stage 2 of Algorithm 1 for every chain component (tree structure) of the causal forest essential graph until stage 2 ﬁnds the reward generating variable XR with high probability. Then, a standard UCB algorithm can be applied on the reduced intervention set corresponding to XR. The regret guarantee for Algorithm 5 is presented in Theorem 2.

Theorem 2 (Regret Guarantee for Algorithm 5). Deﬁne C(D) as the number of chain components in E(D). Suppose we run Algorithm 5 (Section 5) with T 2KB(d + C(D)) log2 n := T1 total

interventions and T2 = T T1, where 0 < δ < 1 and B = max n 32 2 log 8n K

ε2 log 8n2K2

δ o . Then under Assumptions 1, 2 and 3, with probability at least 1 δ, we have

RT = e O K max 1

(d + C(D))(log n)2 +

where e O ignores poly-log terms non-regarding to n and d := max T0 CC(E(D)) dmax(T0).

5 Extension of CN-UCB to a general class of causal graphs

In this section, we extend CN-UCB to more general causal graphs. Our approach involves searching for the reward generating node XR within each undirected chain component of E(D). In order to use the central node idea, we try to construct tree structures on each component.

5.1 Preliminaries for the general graph class

Before describing our algorithm, we ﬁrst review deﬁnitions of clique (junction) tree and clique graph for an undirected graph G following Squires et al. (2020).

Deﬁnition 5 (Clique). A clique C V (G) is a subset of nodes in which every pair of nodes are connected with an edge. We say a clique C is maximal if for any v V (G) \ C, C {v} is not a clique. The maximal cliques set of G is denoted by C(G) with its clique number deﬁned as ω(G) = max C C(G) |C|, where |C| denotes the number of nodes in clique C. We use |C(G)| to denote the number of cliques in set C(G).

Deﬁnition 6 (Clique Tree (Junction Tree)). A clique tree TG for G is a tree with maximal cliques C(G) as vertices that satisﬁes the junction tree property, i.e. for all v V (G), the induced subgraph on the set of cliques containing v is a tree. We denote the set of clique trees for graph G by T (G).

3Note that in our setting, causal forest refers to a type of causal graph. This should not be confused with the causal forest machine learning method which is similar to random forest.

Algorithm 2 CN-UCB for General Causal Graph

1: Input: essential graph E(D), K, , ε, B, T2, observational probabilities P(X). 2: Initialize: found False 3: Perform do() for B times, collect reward data R1, . . . , RB; ˆR 1

B PB b=1 Rb. 4: for G CC(E(D)) do 5: Stage 1: Create a junction tree TG and ﬁnd a directed sub-junction-tree that contains XR: e TG Find Sub-junction-tree(G, TG, K, B, ε, , ˆR). //call Algorithm 7 in Section B. 6: if e TG is not empty then 7: Stage 2: Find a clique in e TG that contains XR: C0 Find Key Clique(e TG, K, B, , ˆR).

//call Algorithm 8 in Section B. 8: Stage 3: Apply UCB algorithm on AC0 = {do(X = k) | X V (C0), k = 1, . . . , K} for T2 rounds. 9: return Finished. 10: end if 11: end for 12: return No clique that contains XR is found. Apply UCB algorithm on the entire action space A = {do(X = x) | X X, x [K]}.

Deﬁnition 7 (Clique Graph). A clique graph ΓG is the graph union of T (G), i.e. V (ΓG) = C(G) and U(ΓG) = T T (G)U(T ), where U( ) denotes the set of undirected edges.

Deﬁnition 8 (Directed Clique Tree). A directed junction tree TD of a moral DAG D has the same vertices and adjacencies as a clique tree TG of G = skeleton(D). We use asterisks as wildcards for edge points, e.g. Xi Xj denotes either Xi Xj or Xi Xj. For each ordered pair of adjacent cliques C1 C2 we orient the edge mark of C2 as: 1) C1 C2 if for v12 C1 C2 and v2 C2 \ C1, we have v12 v2 in the DAG D or 2) C1 C2 otherwise.

Deﬁnition 9 (Directed Clique Graph). The directed clique graph ΓD of a moral DAG D is the graph union of all directed clique trees of D. The procedure of graph union is the same as Deﬁnition 8.

Deﬁnition 10 (Intersection Comparable (Squires et al., 2020)). A pair of edges C1 C2 and C2 C3 on a clique graph are intersection comparable if C1 C2 C2 C3 or C1 C2 C2 C3.

Intersection-incomparable chordal graphs were introduced as uniquely representable chordal graphs" in Kumar and Madhavan (2002). This class includes many familiar graph classes such as causal trees, causal forests and proper interval graphs (Squires et al., 2020). A clique graph is said to be intersection-incomparable if above intersection-comparable relation does not hold for any pair of edges. In this section, we design our algorithm for general causal graphs. If for every chain component G CC(E(D)), the corresponding clique graph ΓG is intersection-incomparable, we prove that our algorithm can achieve improved regret guarantee.

5.2 CN-UCB for general causal graphs

Our approach for causal bandit problems with general causal graphs is similar as previous settings. We try to ﬁnd a subset of variables Xsub that contain the true reward generating node XR with high probability and then apply UCB algorithm on interventions over Xsub.

Our method is presented in Algorithm 2. We search for XR in every chain component G CC(E(D)) separately. In stage 1, we create a junction tree TG and search for a directed sub-junction-tree e TG that contains XR via Algorithm 7. If such a e TG exists, we keep searching for a clique C0 in e TG that contains XR in stage 2 via Algorithm 8 and apply UCB algorithm on single-node interventions over variables in C0. Otherwise, we repeat above procedure for the next chain component.

Stage 1 and stage 2 in Algorithm 2 also use the idea of central node interventions. But different from previous settings, Algorithm 7 and Algorithm 8 proceed by performing interventions on variables in the central clique that changes over time and every central clique intervention can eliminate at least half of the cliques on TG or e TG from future searching. A central clique for a junction tree is deﬁned in the same way as a central node for a tree in that we just treat the cliques on a junction tree as the node" in Deﬁnition 2.

Remark on the intersection-incomparable property. We now explain why we need the intersection-incomparable property like other central node based algorithm that also involves constructing clique graphs (Squires et al., 2020). In general, it is possible for a directed junction tree T to have v-structures over cliques, which can make us unable to ﬁnish searching for a directed sub-junction-tree that contains XR by intervening on variables in O(log |T |) cliques, where |T | denotes the number of cliques on T . The reason is, a clique node may have more than one clique parents in T , so that we cannot eliminate more than half of the cliques by interventions over the central clique. In total, we may need to perform O(n) single-node interventions to ﬁnd XR and incur O(n) regret. However, for every chain component G of the essential graph, if its clique graph ΓG is intersection-incomparable, then there is no v-structure over cliques in any junction tree TG of G (Squires et al., 2020), i.e. every node has at most one parent node.

Algorithm 2 can take any graph input no matter whether intersection-incomparable property holds or not. If the property does not hold, Algorithm 2 may output nothing at line 5 or line 7 for all component G CC(E(D)) (will not output an incorrect sub-junction tree or clique by its design). Thus, if the learner ﬁnds that Algorithm 2 does not output anything after line 5 or 7 for all components, standard UCB algorithm can be used on the entire action set (see line 12 in Algorithm 2).

Theorem 3 (Regret guarantee for Algorithm 2 (ΓG is intersection-incomparable for all G CC(E(D)))). Deﬁne 0 < δ < 1 and B = max n 32 2 log 8n K

ε2 log 8n2K2

δ o . Suppose

we run Algorithm 2 for T B + KB log2 n ω(GR) + dω(GR) + P

G CC(E(D)) ω(G) := T1 total number of interventions and T2 := T T1. Then under Assumptions 1, 2 and 3, with probability at least 1 δ, we have

d(TGR)ω(GR) + X

G CC(E(D)) ω(G)

log n log Cmax + p

where d(TGR) denotes the maximum degree of junction tree TGR, GR denotes the chain component that contains the true XR and Cmax max G CC(E(D)) |C(G)|. O( ) ignores poly-log terms nonregarding to n or number of cliques.

Above regret bound shows that our algorithm outperforms standard MAB algorithms especially when n is large and the degree d(TGR) and the clique numbers ω(G) are small . Also, above result reduces to Theorem 2 when D is a causal forest, because ω(G) is always 2 for tree-structure chain components.

6 Lower bounds

In this section we show without Assumption 2 or 3, any algorithm will incur an Ω

nk T regret in the worst-case, which is exponentially worse than our results in terms of n.

Deﬁnition 11 (n K-arm Gaussian Causal Bandit Class). For a bandit instance in n K-arm Gaussian causal bandit class, actions are composed by single-node interventions over n variables X = {X1, . . . , Xn}: A = {do(Xi = x)|x [K]; i = 1, . . . , n} with |A| = n K. The reward for every action is Gaussian distributed and is directly inﬂuenced by one of X.

Theorem 4 (Minimax Lower Bound Without Assumption 2). Let E be the set of n K-armed Gaussian bandits (Deﬁnition 11) where the instances in E does not satisfy Assumption 2. We show that the minimax regret is infπ Π supν E E[RT (π, ν)] = Ω(

n KT), where Π denotes the set of all policies.

Theorem 5 (Minimax Lower Bound (Assumption 2 holds but Assumption 3 does not hold)). Let E be the set of n K-armed Gaussian bandits (Deﬁnition 11) where the instances in E satisfy Assumption 2 but does not satisfy Assumption 3. We show that the minimax regret is infπ Π supν E E[RT (π, ν)] = Ω(

n KT), where Π denotes the set of all policies.

7 Discussion

In this paper, we studied causal bandit problems with unknown causal graph structures. We proposed an efﬁcient algorithm CN-UCB for the causal tree setting. The regret of CN-UCB scales logarithmically with the number of variables n when the intervention size is n K. CN-UCB was then extended to broader settings where the underlying causal graph is a causal forest or belongs to a general class of graphs. For the later two settings, our regret again only scales logarithmically with n. Lastly, we provide lower bound results to justify the necessity of our assumptions on the causal effect identiﬁability and the reward identiﬁability. Our approaches are the ﬁrst set of causal bandit algorithms that do not rely on knowing the causal graph in advance and can still outperform standard MAB algorithms.

There are several directions for future work. First, one can generalize CN-UCB to the multiple reward generating node setting, i.e., more than one variable directly inﬂuences the reward. We expect this can be done by repeatedly applying stage 1 and 2 in CN-UCB to ﬁnd all the reward generating nodes one by one and apply a standard MAB algorithm on the reduced intervention set. In this setting, it is natural to consider interventions that can be performed on more than one variables. Second, it will be interesting to develop instance dependent regret bounds, for example, through estimating the probabilities over the causal graph in CN-UCB. Lastly, one can also develop efﬁcient algorithms that do not need to know the causal graph when confounders exist.

Acknowledgement

We thank our Neur IPS reviewers and meta-reviewer for helpful suggestions to improve the paper.

Funding transparency statement

Funding (ﬁnancial activities supporting the submitted work): Funding in direct support of this work: NSF CAREER grant IIS-1452099, Adobe Data Science Research Award.

Competing Interests (ﬁnancial activities outside the submitted work): None

Abbasi-Yadkori, Y., Pál, D., and Szepesvári, C. (2011). Improved algorithms for linear stochastic bandits. In NIPS, volume 11, pages 2312 2320.

Abbasi-Yadkori, Y., Pal, D., and Szepesvari, C. (2012). Online-to-conﬁdence-set conversions and application to sparse stochastic bandits. In Artiﬁcial Intelligence and Statistics, pages 1 9. PMLR.

Agrawal, S. and Goyal, N. (2012). Analysis of thompson sampling for the multi-armed bandit problem. In Conference on Learning Theory, pages 39 1.

Agrawal, S. and Goyal, N. (2013a). Further optimal regret bounds for thompson sampling. In Artiﬁcial intelligence and statistics, pages 99 107. PMLR.

Agrawal, S. and Goyal, N. (2013b). Thompson sampling for contextual bandits with linear payoffs. In International Conference on Machine Learning, pages 127 135.

Andersson, S. A., Madigan, D., Perlman, M. D., et al. (1997). A characterization of markov equivalence classes for acyclic digraphs. Annals of statistics, 25(2):505 541.

Auer, P., Cesa-Bianchi, N., and Fischer, P. (2002). Finite-time analysis of the multiarmed bandit problem. Machine learning, 47(2-3):235 256.

Burgos, E., Ceva, H., Hernández, L., Perazzo, R. P., Devoto, M., and Medan, D. (2008). Two classes of bipartite networks: nested biological and social systems. Physical Review E, 78(4):046113.

Cesa-Bianchi, N. and Lugosi, G. (2012). Combinatorial bandits. Journal of Computer and System Sciences, 78(5):1404 1422.

Combes, R., Talebi, M. S., Proutiere, A., and Lelarge, M. (2015). Combinatorial bandits revisited. ar Xiv preprint ar Xiv:1502.03475.

de Kroon, A. A., Belgrave, D., and Mooij, J. M. (2020). Causal discovery for causal bandits utilizing separating sets. ar Xiv preprint ar Xiv:2009.07916.

Eberhardt, F. (2017). Introduction to the foundations of causal discovery. International Journal of Data Science and Analytics, 3(2):81 91.

Greenewald, K., Katz, D., Shanmugam, K., Magliacane, S., Kocaoglu, M., Adsera, E. B., and Bresler, G. (2019). Sample efﬁcient active learning of causal trees. In Advances in Neural Information Processing Systems, pages 14302 14312.

He, Y.-B. and Geng, Z. (2008). Active learning of causal networks with intervention experiments and optimal designs. Journal of Machine Learning Research, 9(Nov):2523 2547.

Hyttinen, A., Eberhardt, F., and Hoyer, P. O. (2013). Experiment selection for causal discovery. The Journal of Machine Learning Research, 14(1):3041 3071.

Jordan, C. (1869). Sur les assemblages de lignes. J. Reine Angew. Math, 70(185):81.

Kocaoglu, M., Dimakis, A., and Vishwanath, S. (2017). Cost-optimal learning of causal graphs. In International Conference on Machine Learning, pages 1875 1884. PMLR.

Kontou, P. I., Pavlopoulou, A., Dimou, N. L., Pavlopoulos, G. A., and Bagos, P. G. (2016). Network analysis of genes and their association with diseases. Gene, 590(1):68 78.

Kumar, P. S. and Madhavan, C. V. (2002). Clique tree generalization and new subclasses of chordal graphs. Discrete Applied Mathematics, 117(1-3):109 131.

Lattimore, F., Lattimore, T., and Reid, M. D. (2016). Causal bandits: Learning good interventions via causal inference. In Advances in Neural Information Processing Systems, pages 1181 1189.

Lattimore, T. and Szepesvári, C. (2018). Bandit algorithms. preprint.

Lee, S. and Bareinboim, E. (2018). Structural causal bandits: where to intervene? In Advances in Neural Information Processing Systems, pages 2568 2578.

Lee, S. and Bareinboim, E. (2019). Structural causal bandits with non-manipulable variables. In Proceedings of the AAAI Conference on Artiﬁcial Intelligence, volume 33, pages 4164 4172.

Lindgren, E. M., Kocaoglu, M., Dimakis, A. G., and Vishwanath, S. (2018). Experimental design for cost-aware learning of causal graphs. ar Xiv preprint ar Xiv:1810.11867.

Liu, S., See, K. C., Ngiam, K. Y., Celi, L. A., Sun, X., and Feng, M. (2020). Reinforcement learning for clinical decision support in critical care: comprehensive review. Journal of medical Internet research, 22(7):e18477.

Lu, Y., Meisami, A., Tewari, A., and Yan, Z. (2019). Regret analysis of causal bandit problems. ar Xiv preprint ar Xiv:1910.04938.

Nair, V., Patil, V., and Sinha, G. (2021). Budgeted and non-budgeted causal bandits. In International Conference on Artiﬁcial Intelligence and Statistics, pages 2017 2025. PMLR. Pavlopoulos, G. A., Kontou, P. I., Pavlopoulou, A., Bouyioukos, C., Markou, E., and Bagos, P. G. (2018). Bipartite graphs in systems biology and medicine: a survey of methods and applications. Giga Science, 7(4):giy014. Peters, J., Mooij, J., Janzing, D., and Schölkopf, B. (2012). Identiﬁability of causal graphs using functional models. ar Xiv preprint ar Xiv:1202.3757. Sen, R., Shanmugam, K., Dimakis, A. G., and Shakkottai, S. (2017). Identifying best interventions through online importance sampling. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 3057 3066. JMLR. org.

Shanmugam, K., Kocaoglu, M., Dimakis, A. G., and Vishwanath, S. (2015). Learning causal graphs with small interventions. ar Xiv preprint ar Xiv:1511.00041.

Squires, C., Magliacane, S., Greenewald, K., Katz, D., Kocaoglu, M., and Shanmugam, K. (2020). Active structure learning of causal dags via directed clique tree. ar Xiv preprint ar Xiv:2011.00641. Wikipedia contributors (2021). Genetic engineering Wikipedia, the free encyclopedia. https://en.wikipedia.org/w/index.php?title=Genetic_ engineering&oldid=1018922014. [Online; accessed 26-April-2021].