# causal_bandits_with_propagating_inference__e65d5226.pdf

Causal Bandits with Propagating Inference

Akihiro Yabe 1 Daisuke Hatano 2 Hanna Sumita 3 Shinji Ito 1 Naonori Kakimura 4 Takuro Fukunaga 2

Ken-ichi Kawarabayashi 5

Bandit is a framework for designing sequential experiments, where a learner selects an arm A A and obtains an observation corresponding to A in each experiment. Theoretically, the tight regret lower-bound for the general bandit is polynomial with respect to the number of arms |A|, and thus, to overcome this bound, the bandit problem with side-information is often considered. Recently, a bandit framework over a causal graph was introduced, where the structure of the causal graph is available as side-information and the arms are identiﬁed with interventions on the causal graph. Existing algorithms for causal bandit overcame the Ω( p

|A|/T) simple-regret lower-bound; however, their algorithms work only when the interventions A are localized around a single node (i.e., an intervention propagates only to its neighbors). We then propose a novel causal bandit algorithm for an arbitrary set of interventions, which can propagate throughout the causal graph. We also show that it achieves O( p

γ log(|A|T)/T) regret bound, where γ is determined by using a causal graph structure. In particular, if the maximum in-degree of the causal graph is a constant, then γ = O(N 2), where N is the number of nodes.

1. Introduction

Multi-armed bandit has been widely recognized as a standard framework for modeling online learning with a limited number of observations. In each round in the bandit problem, a learner chooses an arm A from given candidates A, and obtains a corresponding observation. Since observation is limited, the learner must adopt an efﬁcient strategy for

1NEC Corporation, Japan 2RIKEN AIP, Japan 3Tokyo Metropolitan University, Japan 4Keio University, Japan 5National Institute of Informatics, Japan. Correspondence to: Akihiro Yabe <a-yabe@cq.jp.nec.com>.

Proceedings of the 35 th International Conference on Machine Learning, Stockholm, Sweden, PMLR 80, 2018. Copyright 2018 by the author(s).

exploring the optimal arm A A. The efﬁciency of the strategy is measured by regret, and the theoretically tight lower-bound is O( p

|A|) with respect to the number of arms |A| in the general multi-armed bandit setting. Thus, in order to improve the above lower bound, one requires additional information for the bandit setting. For example, contextual bandit (Agarwal et al., 2014; Auer et al., 2002) is a well-known class of bandit problems with side information on domain-expert knowledge. For this setting, there is a logarithmic regret bound O( p

log |A|) with respect to the number of arms. In this paper, we also achieve O( p

log |A|) regret bound for a novel class of bandit problems with side information. To this end, let us introduce our bandit setting in detail.

Causal graph (Pearl, 2009) is a well-known tool for modeling a variety of real problems, including computational advertising (Bottou et al., 2013), genetics (Meinshausen et al., 2016), agriculture (Splawa-Neyman et al., 1990), and marketing (Kim et al., 2008). Based on causal graph discovery studies (Eberhardt et al., 2005; Hauser & B uhlmann, 2014; Hu et al., 2014; Shanmugam et al., 2015), Lattimore et al. (2016) recently introduced the causal bandit framework. They consider the problem of ﬁnding the best intervention which causes desirable propagation of a probabilistic distribution over a given causal graph with a limited number of experiments T. In this setting, the arms are identiﬁed as interventions A on the causal graph. A set of binary random variables V1, V2, . . . , VN is associated with nodes v1, v2, . . . , v N of the causal graph. At each round of an experiment, a learner selects an intervention A A {0, 1, }N which enforces a realization of a variable Vi to Ai when Ai {0, 1}. The effect of the intervention then propagates throughout the causal graph through the edges, and a realization ω {0, 1}N over all nodes is observed after propagation. The goal of the causal bandit problem is to control the realization of a target variable VN with an optimal intervention.

Figure 1 is an illustrative example of the causal bandit problem. In the ﬁgure, the four nodes on the right represent a consumer decision-making model in e-commerce borrowed from (Kim et al., 2008). This model assumes that customers make a decision to purchase based on their perceived risk in an online transition (e.g., defective product), the con-

Causal Bandits with Propagating Inference

sumer s trust of a web vendor, and the perceived beneﬁt in e-commerce (e.g., increased convenience). Consumer trust inﬂuences perceived risk. Here, we consider controlling customer s behavior by two kinds of advertising that correspond to adding two nodes (Ad A and Ad B) to be intervened into the model. Ad A can change only the reliability of a website, that is, it can inﬂuence the decision of customers in an indirect way through the middle nodes. In contrast, Ad B can change the perceived beneﬁt. The aim is to increase the number of purchases by consumers through choosing an effective advertisement. This is indeed a bandit problem over a causal graph.

The work in (Lattimore et al., 2016) considered the causal bandit problem to minimize simple regret and offered an improved regret bound over the aforementioned tight lowerbound Ω( p

|A|/T) (Audibert & Bubeck, 2010)[Theorem 4] for the general bandit setting (Audibert & Bubeck, 2010; Gabillon et al., 2012). Sen et al. (2017) extended this study by incorporating a smooth intervention, and they provided a new regret bound parameterized by the performance gap between the optimal and sub-optimal arms. This parameterized bound comes from the technique developed for the general multi-armed bandit problem (Audibert & Bubeck, 2010). These analyses, however, only work for a special class of interventions with known true parameters. Indeed, they only consider localized interventions.

Main contribution This paper proposes the ﬁrst algorithm for the causal bandit problem with an arbitrary set of interventions (which can propagate throughout the causal graph), with a theoretically guaranteed simple regret bound. The bound is O( p

γ log(|A|T)/T), where γ is a parameter bounded on the basis of the graph structure. In particular, γ = O(N 2) if the maximum in-degree of the causal graph is bounded by a constant, where N is the number of nodes.

The major difﬁculty in dealing with an arbitrary intervention comes from accumulation and propagation of estimation error. Existing studies consider interventions that only affect the parents Pk of a single node Vk. To estimate the relationship between Pk and Vk in this setting, we could apply an efﬁcient importance sampling algorithm (Bottou et al., 2013; Lattimore et al., 2016). On the other hand, when we intervene an arbitrary node, it can affect the probabilistic propagation mechanism in any part of the causal graph. Hence, we cannot directly control the realization of intermediate nodes when designing efﬁcient experiments.

The proposed algorithm consists of two steps. First, the preprocessing step is devoted to estimating parameters for designing efﬁcient experiments used in the main step. More precisely, we focus on estimation of parameters with bounded relative error. By truncating small parameters that are negligible but tend to have large relative error, we man-

Perceived risk

Consumer trust

Perceived beneﬁt

Figure 1. Simple example of a causal graph.

age to avoid accumulation of estimation error. In the main step, we apply an importance sampling approach introduced in (Lattimore et al., 2016; Sen et al., 2017) on the basis of estimated parameters with a guaranteed relative error. This step allows us to estimate parameters with bounded absolute error, which results in the desired regret bound.

Owing to space limitations, all the proofs are omitted, where they can be found in the full version of this paper (Yabe et al., 2018).

Related studies Minimizing simple regret in bandit problems is called the best-arm identiﬁcation (Gabillon et al., 2012; Kaufmann et al., 2016) or pure exploration (Bottou et al., 2013) problem, and it has been extensively studied in the machine learning research community. The inference of a causal graph structure is also well-studied, which can be classiﬁed into causal graph discovery and causal inference: Causal graph discovery (Eberhardt et al., 2005; Hauser & B uhlmann, 2014; Hu et al., 2014; Shanmugam et al., 2015) considers efﬁcient experiments for determining the structure of causal graph, while causal inference (Mooij et al., 2016; Pearl, 2009; Shimizu et al., 2011; Spirtes & Glymour, 1991) challenges one to determine the graph structure only from historical data without additional experiments. The causal bandit problem designs experiments without using historical data, which is rather compatible with causal graph discovery studies.

2. Causal bandit problem

This section introduces the causal bandit problem proposed by (Lattimore et al., 2016).

Let G = (V, E) be a directed acyclic graph (DAG) with a node set V = {v1, v2, . . . , v N} and a (directed) edge set E. Let (vi, vj) denote an edge from vi to vj. Without loss of generality, we suppose that the nodes in V are topologically sorted so that no edge from vi to vj exists if i j. For each n = 1, . . . , N, let Pn denote the index set of the parents of vn, i.e., Pn = {i {1, . . . , n 1}: (vi, vn) E}. We then deﬁne Pn = Pn {n}.

Each node vn V is associated with a random variable Vn, which takes a value in {0, 1}. The distribution of Vn is then inﬂuenced by the variables associated with the parents

Causal Bandits with Propagating Inference

of vn (unless Vn is intervened, as described below). For each π {0, 1}Pn, the parameter αn(π) deﬁned below characterizes the distribution of Vn given the realizations of its parents:

αn(π) := Prob Vn = πn

Vi = πi for all i Pn, vn is not intervened

That is to say, if the parents vi for i Pn are realized as πi, then Vn = πn with probability αn(π), and Vn = 1 πn with probability 1 αn(π).

Together with a DAG, we are also given a set A of interventions. Each intervention is identiﬁed with a vector A { , 0, 1}N, where An = implies that Vn is intervened and that the realization of Vn is ﬁxed as An. Let π {0, 1}Pn. Given an intervention A A and realizations πi over the parents i Pn, the probability that Vn = πn holds is then determined as follows:

Prob (Vn = πn | Vi = πi for all i Pn, do(A))

αn(π) if An = , 1 if An = πn, 0 if An = 1 πn.

This equality together with the adjacency of the causal graph G completely determines the joint distribution over the variables V1, V2, . . . , VN, under an arbitrary intervention A A.

In the causal bandit problem, we are given a DAG G = (V, E) and a set A of interventions. However, the parameters αn (n = 1, . . . , N) are not known. Our ideal goal is then to ﬁnd an intervention A A that maximizes the probability µ(A ) of realizing VN = 1, where µ : A [0, 1] is deﬁned by

µ(A) := Prob(VN = 1 | do(A))

for each A A.

For this purpose, we discuss the following algorithms. First, they estimate µ(A) (A A) from T experimental trials. Each experiment consists of the application of an intervention and the observation of a realization π {0, 1}N over all nodes. Let ˆµ(A) denote the estimate of µ(A). Second, the algorithm selects the intervention ˆA that maximizes ˆµ. We evaluate the efﬁciency of such an algorithm with the simple regret RT deﬁned as follows:

RT = µ(A ) E[µ( ˆA)].

Note that, even if an algorithm is deterministic, ˆA includes stochasticity since the observations obtained in each experiment are produced by a stochastic process.

In this paper, we assume that N 3 and T 2 for ease of technical discussion.

3. Proposed Algorithm

We propose an algorithm for the causal bandit problem, and present a regret bound of the proposed algorithm in this section. Let Cn = 2|Pn| for each n = 1, . . . , N, and C = PN n=1 Cn. For S S [1, N] and π {0, 1}S , let πS denote the restriction of π onto S.

3.1. Outline of the proposed algorithm

Recall that the purpose of the causal bandit problem is to identify an intervention A that maximizes µ(A ). This task is trivial if αn is known for all n = 1, . . . , N, because µ(A) can then be calculated for all A A. Let B(A) = {π {0, 1}N | π i = Ai if Ai = , π N = 1}, and for n [1, N], let In,A denote the set of nodes in [1, n] which are not intervened by A; In,A := {m [1, n] | Am = }. µ(A) can then be represented as

n IN,A αn(πPn).

Therefore, for computing µ approximately, our algorithm estimates αn (n = 1, . . . , N).

In order to estimate αn efﬁciently, we are required to manipulate the random variables associated with the parents of vn. More concretely, to estimate αn(π) for π Pn, we require samples with realization ω {0, 1}N satisfying πi = ωi over the parents i Pn of vn. For n = 1, 2, . . . , N, π {0, 1}Pn, and A A, we thus introduce the additional quantities βn(π, A) that denote the probability of realizing ω with ωPn = π under a given intervention A. More precisely, we deﬁne

( Prob(Vm = πm, m Pn | do(A)) if An = , 0 otherwise.

Our algorithm consists of two phases. The ﬁrst phase estimates βn (n = 1, . . . , N), and the second phase estimates αn (n = 1, . . . , N). The algorithm requires T/3 experiments in the ﬁrst phase, and 2T/3 experiments in the second phase. In the rest of this section, we ﬁrst explain those phases and present a regret bound on the algorithm.

3.2. First Phase: Estimation of β

Here, we introduce the estimation phase of βn for all n = 1, . . . , N. The pseudo-code of this phase is described in Algorithm 1. Algorithm 1 requires a non-negative number λ as a parameter, which will be set to C3/N. We perform T/3 experiments in this phase.

Before explaining the details of Algorithm 1, we note that βn can be calculated from α1, . . . , αn 1. For π {0, 1}Pn, let Bn(π, A) := {π {0, 1}n 1 | π i = πi if i

Causal Bandits with Propagating Inference

Algorithm 1 Estimation of β Require: λ Ensure: ˆβn (n = 1, . . . , N) and D

1: Gn for n = 1, 2, . . . , N 2: for n = 1, 2, . . . , N do 3: for π {0, 1}Pn do 4: Calculate ˆβn(π, A) for each A A by (2) 5: Calculate ˆAn,π = argmax A A ˆβn(π, A) 6: tn(π) 0 and tn(π) 0 7: for j = 1, . . . , T/(3C) do 8: Conduct an experiment with ˆAn,π and let ω {0, 1}N be the obtained result 9: tn(π) tn(π) + 1 if ωi = πi for all i Pn 10: tn(π) tn(π) + 1 if ωi = πi for all i Pn and ωn = 1 11: end for 12: for k = 0, 1 do 13: Extend π to π {0, 1}Pn with π n = k 14: Compute ˇα n(π ) by (3) 15: If (4) holds, then Gn Gn {π } 16: Compute ˇαn(π ) by (5) 17: end for 18: end for 19: end for 20: Compute Hn and Dn (n = 1, 2, . . . , N) by (6) and (7) 21: return ˆβn (n = 1, 2, . . . , N) and D = {Dn | n = 1, 2, . . . , N}

Pn, π i = Ai if Ai = } denote the set of realizations over V1, V2, . . . , Vn 1 that is consistent with the realization π over Pn and the intervention A. If An = , then βn(π, A) is described as

βn(π, A) = X

m In 1,A αm(π

Algorithm 1 consists of N iterations. The n-th iteration computes the following objects:

an estimate ˆβn of βn, ˆAn,π A for each π {0, 1}Pn, an estimate ˇαn of αn, and Gn {0, 1}Pn.

We remark that ˇαn in Algorithm 1 are used only for computing an estimate ˆβn and are not used for estimating µ. An estimate of αn is computed in the next phase of our algorithm.

At the beginning of the n-th iteration, we compute ˆβn(π, A) for each π {0, 1}Pn and A A by (1) substituting ˇαm for αm;

ˆβn(π, A) = X

m In 1,A ˇαm(π

Let us conﬁrm that this ˆβn(π, A) can be computed if ˇαm (m = 1, . . . , n 1) are available.

For each π {0, 1}Pn, then, we identify an intervention

Algorithm 2 Estimation of α

Require: ˆβn (n = 1, . . . , N) and D Ensure: ˆαn (n = 1, . . . , N)

1: for n = 1, 2, . . . , N and each π {0, 1}Pn do 2: t n(π) 0 and t n(π) 0 3: Calculate ˆAn,π := argmax A A ˆβn(π, A) 4: for j = 1, . . . , T/(3C) do 5: Conduct an experiment with ˆAn,π and let ω {0, 1}N be the obtained result 6: for m = 1, . . . , N with ( ˆAn,π)m = do 7: t m(ωPm) t m(ωPm) + 1 8: t m(ωPm) t m(ωPm) + 1 if ωm = 1 9: end for 10: end for 11: end for 12: Compute an optimal solution ˆη for (8) 13: for t = 1, 2, . . . , T/3 do 14: Sample At from U(ˆη) 15: Conduct experiment with At and let ω {0, 1}N be the obtained realization 16: for n = 1, . . . , N with An = do 17: t n(ωPn) t n(ωPn) + 1 18: t n(ωPn) t n(ωPn) + 1 if ωn = 1 19: end for 20: end for 21: for n = 1, 2, . . . , N and π {0, 1}Pn do 22: Compute ˆα n(π) by (9) and ˆαn(π) by (10) 23: end for 24: return ˆαn.

Algorithm 3 Causal Bandit

1: Apply Algorithm 1 with λ = C3/N to obtain ˆβn (n = 1, . . . , N) and D 2: Apply Algorithm 2 to obtain ˆαn (n = 1, . . . , N) 3: Calculate ˆµ(A) for each A A by (11) 4: return ˆA := argmax A A ˆµ(A)

ˆAn,π that attains max A A ˆβn(π, A). Using ˆAn,π, we compute ˇαn(π) as follows, where π is an extension of π onto {0, 1}Pn. We conduct T/(3C) experiments with ˆAn,π. Let tn(π) be the number of experiments in those T/(3C) experiments in which the obtained realization ω {0, 1}N

satisﬁes ωi = πi for each i Pn. Let tn(π) be the number of experiments counted in tn(π), where ωn = 1 also holds. We then compute ˇα n(π) using the equation

tn(π)/tn(π) if πn = 1, 1 tn(π)/tn(π) if πn = 0. (3)

The vector π {0, 1}Pn is added to Gn if

ˇα n(π)ˆβn(π, ˆAn,π) 2e S(λ), (4)

where S(λ) is deﬁned as

S(λ) := 12λN 2C log T

This Gn reserves such π {0, 1}Pn that ˇα n(π) is too small

Causal Bandits with Propagating Inference

to estimate αn(π) with sufﬁcient accuracy. Then ˇαn(π) is determined by replacing ˇα n(π) with 0 for π Gn:

( ˇα n(π) if π Gn, 0 otherwise. (5)

This replacement contributes to reducing the relative estimation error of ˆβn in subsequent steps (n = n + 1, . . . , N).

After iterating for all n = 1, 2, . . . , N, the algorithm computes Hn and Dn (n = 1, 2, . . . , N) deﬁned by

π {0, 1}Pn ˆβn(πPn, ˆAn,π) 8e C2S(λ) o ,

Dn = Gn Hn. (7)

This Dn contributes to bound the absolute error of the estimation of ˆβn(πPn) for π Dn. The algorithm returns an estimate ˆβn and the family D := {Dn | n = 1, 2, . . . , N}.

3.3. Second Phase: Estimation of α

In this phase, our algorithm computes an estimate ˆαn of αn for all n = 1, . . . , N. The pseudo-code for this phase is given in Algorithm 2. As an input, it receives ˆβn (n = 1, . . . , N) and D from Algorithm 1.

Algorithm 2 consists of two parts. The ﬁrst part conducts T/(3C) experiments with ˆAn,π (computed from ˆβn(π, A), A A) for each n = 1, . . . , N and π {0, 1}Pn. This is the same process used to compute ˇα n in Algorithm 1. Let

D n := {π {0, 1}Pn | π0, π1 Dn}

where πk is the extension of π {0, 1}Pn onto {0, 1}Pn with πk n = k. Let us deﬁne a set Jn := {0, 1}Pn \ D n and a constant rn,π := ˆβn(π, ˆAn,π)/C for each n = 1, . . . , N and π {0, 1}Pn. In the second part, the algorithm solves the following optimization problem:

min η [0,1]A max A A

ˆβ2 n(π, A) P

A A ηA ˆβn(π, A ) + rn,π

A A ηA = 1. (8)

Note that, for each n = 1, 2, . . . , N, π Jn only if ˆβn(π, ˆAn,π) > 0 according to Line 20 of Algorithm 1. Thus the denominator is positive for every π Jn, and the above optimization problem is well-deﬁned. Let ˆη be an optimal solution for (8). Consider the distribution U(ˆη) over A that generates A with a probability of ˆηA. The second part samples an intervention according to U(ˆη) and uses it to conduct experiments, for T/3 times,.

For each n = 1, . . . , N and π {0, 1}Pn, the algorithm counts the number t n(π) (resp., t n(π)) of experiments that

result in ω {0, 1}N with ωPn = π (resp., ωPn = π and ωn = 1). Then, ˆα n(π) (n = 1, . . . , N, π {0, 1}Pn) is deﬁned by

t n(πPn)/t n(πPn) if πn = 1, 1 t n(πPn)/t n(πPn) if πn = 0. (9)

The output ˆαn is deﬁned by

( ˆα n(π) if π Dn, 0 otherwise. (10)

3.4. Regret bound

Pseudo-code of our entire algorithm is provided in Algorithm 3. It computes an estimate ˆβ of β by Algorithm 1 and then computes ˆα by Algorithm 2. It then computes an estimate ˆµ of µ by

n IN,A ˆαn(πPn) (11)

for each A A. The algorithm returns an intervention ˆA A that maximizes ˆµ.

Let us deﬁne γ as the optimum value of the following problem:

γ := min η [0,1]A max A A

π {0,1}Pn :βn(π,A)>0

β2 n(π, A) P

A A ηA βn(π, A )

A A ηA = 1. (12)

The regret bound of Algorithm 3 is parameterized by the optimum value γ :

Theorem 1. The regret RT of Algorithm 3 satisﬁes

max{γ , N} log(|A|T)

The notation O( ) is used here under the assumption that N is sufﬁciently small with respect to T but not negligible. The optimum value γ is bounded as follows. Let |A| denote the number of nodes intervened by A, i.e., |A| := |{n [1, N]: An {0, 1}}|:

Proposition 2. It holds that N min A A |A| γ min{NC, N|A|}.

Since the lower-bound for the general best-arm identiﬁcation problem is Ω( p

|A|/T) (Audibert & Bubeck, 2010)[Theorem 4], our algorithm provides a better regret bound when the number of interventions |A| is large compared to γ NC, which is only dependent on the causal graph structure.

Causal Bandits with Propagating Inference

Remark 3. We present Algorithms 1, 2, and 3 for the setting that every αn(π) is unknown. However, our algorithms can be applied even when αn(π) is known for some n = 1, . . . , N and π {0, 1}Pn by incorporating minor modiﬁcations. In this case, we denote the number of unknown αn(π) as C. The modiﬁed algorithm just skips experiments for estimating the known αn(π), and we can deﬁne ˆβn(πPn, A) = 0 for such n and π. We then redeﬁne γ by replacing corresponding βn(πPn, A) with 0 in (12), and our bound in Theorem 1 is valid for this decreased γ . In particular, we can recover the regret bound considered in (Lattimore et al., 2016)[Theorem 3] as follows:

Corollary 4. Suppose that αn(π) is known for every n < N and π {0, 1}Pn. Then the regret RT of Algorithm 3 satisﬁes RT O( p

γ log(|A|T)/T), where

γ = min η [0,1]A max A A

β2 N(π, A) P

A A ηA βN(π, A )

A A ηA = 1.

Remark 5. Our problem setting is often called hard intervention, which directly controls the realization of a node vn as An {0, 1}. In contrast, Sen et al. (2017) introduced the soft intervention model on a node vn where an intervention changes the conditional probability αn of a node vn. They in fact considered a simple case where a graph has a single node vk such that PN = Pk {k}, whose conditional probability can be controlled by soft intervention, and proved parameterized regret bound.

We here remark that their model can be implemented by the hard intervention model with an arbitrary set of interventions. The details of this implementation is presented in our full paper.

This section presents an approach for proving Theorem 1. Complete proofs for all the statements are presented in the full version (Yabe et al., 2018).

4.1. Accuracy of Algorithm 1

For n = 1, 2, . . . , N, let ˇαn and ˇα n be the stochastic estimates computed in Algorithm 1, and ˆAn,π := argmax A A ˆβn(π, A) be the action determined from the esti-

mate ˆβn. Let G be deﬁned by G = {Gn | n = 1, . . . , N}. Using G, we deﬁne αn,G and βn,G as follows. For each n [1, N] and π {0, 1}Pn, we deﬁne αn,G(π) by

( αn(π) if π Gn, 0 otherwise.

For each n [1, N], π {0, 1}Pn, and A A, we deﬁne βn,G(π, A) by

βn,G(π, A) = X

m In 1,A αm,G(π

Thus αn,G(π) is obtained from αn(π) by truncating its values if π Gn, and βn,G is deﬁned from αn,G. We deﬁne αn,D and βn,D in the same way. Since Gn Dn, we observe that βn,D(π, A) βn,G(π, A) βn(π, A). Similarly, for A A, we deﬁne µD(A) by

m IN,A αm,D(πPm). (13)

The following proposition demonstrates the error bound for outputs ˆβn and D from Algorithm 1.

Proposition 6. Let ˆβn and D be the outputs of Algorithm 1 with parameter λ 1. Then the following holds with a probability of at least 1 6C/T: for every n [1, N], π {0, 1}Pn \ Dn with π = πPn, and A A:

1 eβn,D(π, A) ˆβn(π, A) eβn(π, A), (14)

αn(π)βn(π, ˆAn,π) S(λ), (15)

βn(π, A) eˆβn(π, A) + eˆβn(π, ˆAn,π)/C, (16)

µ(A) µD(A) 8e2(C3 + C)S(λ). (17)

We prepare the following three lemmas to prove Proposition 6. The ﬁrst lemma is an application of Chernoff s bound, which bounds the relative error of the estimation ˇα n:

Lemma 7. Let n [1, N], π {0, 1}Pn, and π = πPn.

(i) If αn(π)βn(π, ˆAn,π) S(λ), then the following holds with a probability of at least 1 2/T:

ˇα n(π)βn(π, ˆAn,π) 2S(λ).

(ii) If αn(π)βn(π, ˆAn,π) S(λ), then the following holds with a probability of at least 1 3/T: 1 1

αn(π) ˇα n(π) 1 + 1

The second lemma bounds the gap produced by truncation of αn that is conducted for introducing αn,G and αn,D. We use the notation H n := {πPn | π Hn}.

Lemma 8. (i) Let n [1, N] and π {0, 1}Pn. For every A A, it holds that

βn(π, A) βn,G(π, A)

π Gm max A A αm(π )βm,G(π Pm, A ).

Causal Bandits with Propagating Inference

(ii) For every A A, it holds that

π Gm max A A αm(π)βm,G(πPm, A )

max A A βm,G(π , A ).

The third lemma bounds the relative error of ˆβ. This statement can be proven by induction on the basis of Lemma 7.

Lemma 9. The following holds for every n = 1, 2, . . . , N, π {0, 1}Pn with π = πPn, and A A with a probability of at least 1 6C/T:

1 eβn,G(π, A) ˆβn(π, A) eβn,G(π, A),

π Gn if αn(π)βn(π, ˆAn,π) < S(λ), 1 1

αn(π) ˇα n(π) 1 + 1

if αn(π)βn(π, ˆAn,π) S(λ).

Then Proposition 6 is proven on the basis of Lemmas 7 9.

4.2. Accuracy of Algorithm 2

This subsection bounds the gap between the true value µ(A) and its estimate ˆµ(A) given by Algorithm 2, assuming that the input of Algorithm 2, which is output of Algorithm 1, satisﬁes the conditions in Proposition 6.

Proposition 10. Suppose that λ 1, and ˆβn and D satisfy (14), (15), (16), and (17). Let ˆαn be the output of Algorithm 2, and let ˆµ be deﬁned by (11). Then the following holds for every A A with a probability of at least 1 (10C + 2)/T:

|µ(A) ˆµ(A)|

2e6γ log(|A|T)

8e2C3 log T

λT + 8e2(C3 + C)S(λ).

Recall that IN,A := {m [1, N] | Am = } for A A. For n [1, N] and π {0, 1}Pn, let αn(π) := ˆαn(π) αn,D(π). For A A and J IN,A, we deﬁne f J(A) by

m IN,A\J αm,D(πPm) Y

n J αn(πPn).

Observe that f J(A) is given by replacing αn,D(πPn) by αn(πPn) for n J in the deﬁnition (13) of µD. Recall that ˆµ(A) is given by replacing αn,D(π) in the deﬁnition of µD by ˆαn(π) for all n IN,A. Based on these relationships, we have the following lemma:

Lemma 11. For A A, it holds that:

µD(A) = f (A), ˆµ(A) = X

J IN,A f J(A). (18)

For j IN,A, let f j(A) := f {j}(A). We provide probabilistic bounds for the linear terms (|J| = 1) and superlinear terms (|J| 2) in (18), separately, using Hoeffding s inequality.

Lemma 12. Suppose that (14), (15), and (16) hold.

(i) The following holds with a probability of at least 1 (C + 2)/T:

j IN,A f j(A)

2e6γ log(|A|T)

(ii) The following holds with a probability of at least 1 9C/T:

J IN,A:|J| 2 |f J(A)|

8e2C3 log T

The above two lemmas imply Proposition 10.

4.3. Proof of Theorem 1

We present a sketch of proof of Theorem 1, on the basis of Propositions 6 and 10, as follows. Putting λ = C3/N, by Propositions 2 6, and 10, the following holds for every A A with a probability of at least 1 (16C + 2)/T:

|µ(A) ˆµ(A)|

8e6 max{γ , N} log(|A|T)

T + 192e2NC7 log T

Let A = argmax A Aµ(A) and ˆA = argmax A Aˆµ(A). Then it holds that

µ(A ) µ( ˆA) |µ(A ) ˆµ(A )| + |µ( ˆA) ˆµ( ˆA)|

max{γ , N} log(|A|T)

This implies the desired regret bound.

5. Experiments

We now demonstrate the performance of the proposed algorithm through experimental evaluations and compare it with a baseline algorithm (Audibert & Bubeck, 2010) which was proposed for the general bandit problem and thus cannot take advantage of known causal graph structure.

Causal Bandits with Propagating Inference

The number of rounds T

The average regret RT

(i) Synthetic (h = 5)

The number of rounds T

The average regret RT

(ii) Alarm instance

The number of rounds T

The average regret RT

(iii) Water instance

Proposed (b = 2)

Baseline (b = 2) Proposed (b = 4)

Baseline (b = 4) Proposed (b = 8)

Baseline (b = 8)

Figure 2. The average regret over synthetic and real-world instances

Instances We evaluated the algorithms on both synthetic and real-world instances. Detailed experimental setting is presented in the full version (Yabe et al., 2018). Recall that an instance of the causal bandit problem consists of a DAG G, an intervention set A, and αn(n = 1, . . . , N).

In the synthetic instances, the DAG G is deﬁned as a directed complete binary tree of height 4, and then the number of nodes is N = 25 1 = 31, and the number of uncertain parameter is C = 22 (24 1)+20 24 = 76. In the realworld instances, the DAG G is constructed from the Alarm and the Water data sets in a Bayesian Network Repository1. The numbers N of nodes in the DAGs constructed from Alarm and Water data sets are 37 and 32, and the numbers C of uncertain parameters are 116 and 248, respectively.

For each G, we consider interventions over all leaves which ﬁxes exactly b N nodes as 1 and the others as 0. We call this parameter b budget, and the number of intervention |A| is then controlled by the budget.

For each n {1, . . . , N} and π {0, 1}Pn, we generate αn(π) from the uniform distribution over [0, 1].

For each of those instances, we executed the algorithms 10 times and compared their average regrets.

Implementation of the proposed algorithm Our algorithm given in Section 3 is designed conservatively to obtain the theoretical regred bound (Theorem 1), and there is a room to modify the algorithm to be more efﬁcient in practice although the theoretical regret bound may not hold for it. In our implementation, we introduced the following three modiﬁcations into the proposed algorithm. First, while Algorithm 2 discards samples obtained for computing ˇα in Algorithm 1 to maintain the independence between ˆβ and ˆα, we use all of them also in Algorithm 2 in our implementation. Next, we ignore the truncation mechanism of Algorithm 1 by setting λ = 0. We expect these two modiﬁcations make the estimates of the algorithm more accurate. Finally, instead of solving (8), we set ηA by ηA = 1/C

1http://www.cs.huji.ac.il/ galel/ Repository/

if A = ˆAn,π for some n [1, N] and π {0, 1}Pn, and ηA = 0 otherwise. Since it is time-consuming to solve (8), this modiﬁcation makes the algorithm faster.

Experimental results Figure 4.3(i) shows the average regrets over the synthetic instances against the number of rounds T {C, 2C, . . . , 9C}. Figures 4.3 (ii) and (iii) respectively illustrate the average regrets for the real-world instances constructed from the Alarm and the Water data sets.

The results show that the proposed algorithm outperforms the baseline in every instance. In particular, the gap is remarkably large (> 0.2) in the Alarm data set (ii) with a large number of interventions (b = 4, 8, corresponding to |A| = 793, 3796, respectively,) and a small number of samples (T 4C = 464). In these cases, the baseline cannot apply every intervention at least once. On the other hand, the regret of the proposed algorithm only grows slowly with respect to the number of arms |A|, in all instances. Thus the proposed algorithm provides effective regret, even when the number of interventions |A| is 30 times larger than the number of experiments T.

6. Conclusion

In this paper, we proposed the ﬁrst algorithm for the general causal bandit problem, where existing algorithms could deal with only localized interventions, and proved a novel regret bound O( p

γ log(|A|T)/T) which is logarithmic with respect to the number of arms. Our experimental result shows that the proposed algorithm is applicable to systems where the number of interventions |A| is much larger than T. One important future research direction would be to prove the gap-dependent bound as Sen et al. (2017) has proven for localized interventions. Another research direction, which is mentioned in (Lattimore et al., 2016), would include incorporation of a causal discovery algorithm to enable the estimation of the structure of a causal graph, which is currently assumed to be known in advance.

Causal Bandits with Propagating Inference

Acknowledgements

Daisuke Hatano, Hanna Sumita, Naonori Kakimura, Takuro Fukunaga, and Ken-ichi Kawarabayashi are supported by JST ERATO Kawarabayashi Large Graph Project, Grant Number JPMJER1201, Japan.

Agarwal, A., Hsu, D., Kale, S., Langford, J., Li, L., and Schapire, R. Taming the monster: A fast and simple algorithm for contextual bandits. In International Conference on Machine Learning, pp. 1638 1646, 2014.

Audibert, J.-Y. and Bubeck, S. Best arm identiﬁcation in multi-armed bandits. In The 23rd Conference on Learning Theory, pp. 41 53, 2010.

Auer, P., Cesa-Bianchi, N., Freund, Y., and Schapire, R. E. The nonstochastic multiarmed bandit problem. SIAM Journal on Computing, 32(1):48 77, 2002.

Bottou, L., Peters, J., Qui nonero-Candela, J., Charles, D. X., Chickering, D. M., Portugaly, E., Ray, D., Simard, P., and Snelson, E. Counterfactual reasoning and learning systems: The example of computational advertising. The Journal of Machine Learning Research, 14(1):3207 3260, 2013.

Eberhardt, F., Glymour, C., and Scheines, R. On the number of experiments sufﬁcient and in the worst case necessary to identify all causal relations among n variables. In Proceedings of the Twenty-First Conference on Uncertainty in Artiﬁcial Intelligence, pp. 178 184. AUAI Press, 2005.

Gabillon, V., Ghavamzadeh, M., and Lazaric, A. Best arm identiﬁcation: A uniﬁed approach to ﬁxed budget and ﬁxed conﬁdence. In Advances in Neural Information Processing Systems, pp. 3212 3220, 2012.

Hauser, A. and B uhlmann, P. Two optimal strategies for active learning of causal models from interventional data. International Journal of Approximate Reasoning, 55(4): 926 939, 2014.

Hu, H., Li, Z., and Vetta, A. R. Randomized experimental design for causal graph discovery. In Advances in Neural Information Processing Systems, pp. 2339 2347, 2014.

Kaufmann, E., Capp e, O., and Garivier, A. On the complexity of best-arm identiﬁcation in multi-armed bandit models. The Journal of Machine Learning Research, 17 (1):1 42, 2016.

Kim, D. J., Ferrin, D. L., and Rao, H. R. A trust-based consumer decision-making model in electronic commerce: The role of trust, perceived risk, and their antecedents. Decision Support Systems, 44(2):544 564, 2008.

Lattimore, F., Lattimore, T., and Reid, M. D. Causal bandits: Learning good interventions via causal inference. In Advances in Neural Information Processing Systems, pp. 1181 1189, 2016.

Meinshausen, N., Hauser, A., Mooij, J. M., Peters, J., Versteeg, P., and B uhlmann, P. Methods for causal inference from gene perturbation experiments and validation. Proceedings of the National Academy of Sciences, 113(27): 7361 7368, 2016.

Mooij, J. M., Peters, J., Janzing, D., Zscheischler, J., and Sch olkopf, B. Distinguishing cause from effect using observational data: methods and benchmarks. The Journal of Machine Learning Research, 17(1):1103 1204, 2016.

Pearl, J. Causality. Cambridge university press, 2009.

Sen, R., Shanmugam, K., Dimakis, A. G., and Shakkottai, S. Identifying best interventions through online importance sampling. In International Conference on Machine Learning, pp. 3057 3066, 2017.

Shanmugam, K., Kocaoglu, M., Dimakis, A. G., and Vishwanath, S. Learning causal graphs with small interventions. In Advances in Neural Information Processing Systems, pp. 3195 3203, 2015.

Shimizu, S., Inazumi, T., Sogawa, Y., Hyv arinen, A., Kawahara, Y., Washio, T., Hoyer, P. O., and Bollen, K. Directlingam: A direct method for learning a linear nongaussian structural equation model. Journal of Machine Learning Research, 12(Apr):1225 1248, 2011.

Spirtes, P. and Glymour, C. An algorithm for fast recovery of sparse causal graphs. Social Science Computer Review, 9(1):62 72, 1991.

Splawa-Neyman, J., Dabrowska, D. M., and Speed, T. On the application of probability theory to agricultural experiments. essay on principles. section 9. Statistical Science, pp. 465 472, 1990.

Yabe, A., Hatano, D., Sumita, H., Ito, S., Kakimura, N., Fukunaga, T., and Kawarabayashi, K. Causal bandits with propagating inference. ar Xiv preprint ar Xiv:1806.02252, 2018.