# causal_bandits_with_propagating_inference__e65d5226.pdf Causal Bandits with Propagating Inference Akihiro Yabe 1 Daisuke Hatano 2 Hanna Sumita 3 Shinji Ito 1 Naonori Kakimura 4 Takuro Fukunaga 2 Ken-ichi Kawarabayashi 5 Bandit is a framework for designing sequential experiments, where a learner selects an arm A A and obtains an observation corresponding to A in each experiment. Theoretically, the tight regret lower-bound for the general bandit is polynomial with respect to the number of arms |A|, and thus, to overcome this bound, the bandit problem with side-information is often considered. Recently, a bandit framework over a causal graph was introduced, where the structure of the causal graph is available as side-information and the arms are identified with interventions on the causal graph. Existing algorithms for causal bandit overcame the Ω( p |A|/T) simple-regret lower-bound; however, their algorithms work only when the interventions A are localized around a single node (i.e., an intervention propagates only to its neighbors). We then propose a novel causal bandit algorithm for an arbitrary set of interventions, which can propagate throughout the causal graph. We also show that it achieves O( p γ log(|A|T)/T) regret bound, where γ is determined by using a causal graph structure. In particular, if the maximum in-degree of the causal graph is a constant, then γ = O(N 2), where N is the number of nodes. 1. Introduction Multi-armed bandit has been widely recognized as a standard framework for modeling online learning with a limited number of observations. In each round in the bandit problem, a learner chooses an arm A from given candidates A, and obtains a corresponding observation. Since observation is limited, the learner must adopt an efficient strategy for 1NEC Corporation, Japan 2RIKEN AIP, Japan 3Tokyo Metropolitan University, Japan 4Keio University, Japan 5National Institute of Informatics, Japan. Correspondence to: Akihiro Yabe . Proceedings of the 35 th International Conference on Machine Learning, Stockholm, Sweden, PMLR 80, 2018. Copyright 2018 by the author(s). exploring the optimal arm A A. The efficiency of the strategy is measured by regret, and the theoretically tight lower-bound is O( p |A|) with respect to the number of arms |A| in the general multi-armed bandit setting. Thus, in order to improve the above lower bound, one requires additional information for the bandit setting. For example, contextual bandit (Agarwal et al., 2014; Auer et al., 2002) is a well-known class of bandit problems with side information on domain-expert knowledge. For this setting, there is a logarithmic regret bound O( p log |A|) with respect to the number of arms. In this paper, we also achieve O( p log |A|) regret bound for a novel class of bandit problems with side information. To this end, let us introduce our bandit setting in detail. Causal graph (Pearl, 2009) is a well-known tool for modeling a variety of real problems, including computational advertising (Bottou et al., 2013), genetics (Meinshausen et al., 2016), agriculture (Splawa-Neyman et al., 1990), and marketing (Kim et al., 2008). Based on causal graph discovery studies (Eberhardt et al., 2005; Hauser & B uhlmann, 2014; Hu et al., 2014; Shanmugam et al., 2015), Lattimore et al. (2016) recently introduced the causal bandit framework. They consider the problem of finding the best intervention which causes desirable propagation of a probabilistic distribution over a given causal graph with a limited number of experiments T. In this setting, the arms are identified as interventions A on the causal graph. A set of binary random variables V1, V2, . . . , VN is associated with nodes v1, v2, . . . , v N of the causal graph. At each round of an experiment, a learner selects an intervention A A {0, 1, }N which enforces a realization of a variable Vi to Ai when Ai {0, 1}. The effect of the intervention then propagates throughout the causal graph through the edges, and a realization ω {0, 1}N over all nodes is observed after propagation. The goal of the causal bandit problem is to control the realization of a target variable VN with an optimal intervention. Figure 1 is an illustrative example of the causal bandit problem. In the figure, the four nodes on the right represent a consumer decision-making model in e-commerce borrowed from (Kim et al., 2008). This model assumes that customers make a decision to purchase based on their perceived risk in an online transition (e.g., defective product), the con- Causal Bandits with Propagating Inference sumer s trust of a web vendor, and the perceived benefit in e-commerce (e.g., increased convenience). Consumer trust influences perceived risk. Here, we consider controlling customer s behavior by two kinds of advertising that correspond to adding two nodes (Ad A and Ad B) to be intervened into the model. Ad A can change only the reliability of a website, that is, it can influence the decision of customers in an indirect way through the middle nodes. In contrast, Ad B can change the perceived benefit. The aim is to increase the number of purchases by consumers through choosing an effective advertisement. This is indeed a bandit problem over a causal graph. The work in (Lattimore et al., 2016) considered the causal bandit problem to minimize simple regret and offered an improved regret bound over the aforementioned tight lowerbound Ω( p |A|/T) (Audibert & Bubeck, 2010)[Theorem 4] for the general bandit setting (Audibert & Bubeck, 2010; Gabillon et al., 2012). Sen et al. (2017) extended this study by incorporating a smooth intervention, and they provided a new regret bound parameterized by the performance gap between the optimal and sub-optimal arms. This parameterized bound comes from the technique developed for the general multi-armed bandit problem (Audibert & Bubeck, 2010). These analyses, however, only work for a special class of interventions with known true parameters. Indeed, they only consider localized interventions. Main contribution This paper proposes the first algorithm for the causal bandit problem with an arbitrary set of interventions (which can propagate throughout the causal graph), with a theoretically guaranteed simple regret bound. The bound is O( p γ log(|A|T)/T), where γ is a parameter bounded on the basis of the graph structure. In particular, γ = O(N 2) if the maximum in-degree of the causal graph is bounded by a constant, where N is the number of nodes. The major difficulty in dealing with an arbitrary intervention comes from accumulation and propagation of estimation error. Existing studies consider interventions that only affect the parents Pk of a single node Vk. To estimate the relationship between Pk and Vk in this setting, we could apply an efficient importance sampling algorithm (Bottou et al., 2013; Lattimore et al., 2016). On the other hand, when we intervene an arbitrary node, it can affect the probabilistic propagation mechanism in any part of the causal graph. Hence, we cannot directly control the realization of intermediate nodes when designing efficient experiments. The proposed algorithm consists of two steps. First, the preprocessing step is devoted to estimating parameters for designing efficient experiments used in the main step. More precisely, we focus on estimation of parameters with bounded relative error. By truncating small parameters that are negligible but tend to have large relative error, we man- Perceived risk Consumer trust Perceived benefit Figure 1. Simple example of a causal graph. age to avoid accumulation of estimation error. In the main step, we apply an importance sampling approach introduced in (Lattimore et al., 2016; Sen et al., 2017) on the basis of estimated parameters with a guaranteed relative error. This step allows us to estimate parameters with bounded absolute error, which results in the desired regret bound. Owing to space limitations, all the proofs are omitted, where they can be found in the full version of this paper (Yabe et al., 2018). Related studies Minimizing simple regret in bandit problems is called the best-arm identification (Gabillon et al., 2012; Kaufmann et al., 2016) or pure exploration (Bottou et al., 2013) problem, and it has been extensively studied in the machine learning research community. The inference of a causal graph structure is also well-studied, which can be classified into causal graph discovery and causal inference: Causal graph discovery (Eberhardt et al., 2005; Hauser & B uhlmann, 2014; Hu et al., 2014; Shanmugam et al., 2015) considers efficient experiments for determining the structure of causal graph, while causal inference (Mooij et al., 2016; Pearl, 2009; Shimizu et al., 2011; Spirtes & Glymour, 1991) challenges one to determine the graph structure only from historical data without additional experiments. The causal bandit problem designs experiments without using historical data, which is rather compatible with causal graph discovery studies. 2. Causal bandit problem This section introduces the causal bandit problem proposed by (Lattimore et al., 2016). Let G = (V, E) be a directed acyclic graph (DAG) with a node set V = {v1, v2, . . . , v N} and a (directed) edge set E. Let (vi, vj) denote an edge from vi to vj. Without loss of generality, we suppose that the nodes in V are topologically sorted so that no edge from vi to vj exists if i j. For each n = 1, . . . , N, let Pn denote the index set of the parents of vn, i.e., Pn = {i {1, . . . , n 1}: (vi, vn) E}. We then define Pn = Pn {n}. Each node vn V is associated with a random variable Vn, which takes a value in {0, 1}. The distribution of Vn is then influenced by the variables associated with the parents Causal Bandits with Propagating Inference of vn (unless Vn is intervened, as described below). For each π {0, 1}Pn, the parameter αn(π) defined below characterizes the distribution of Vn given the realizations of its parents: αn(π) := Prob Vn = πn Vi = πi for all i Pn, vn is not intervened That is to say, if the parents vi for i Pn are realized as πi, then Vn = πn with probability αn(π), and Vn = 1 πn with probability 1 αn(π). Together with a DAG, we are also given a set A of interventions. Each intervention is identified with a vector A { , 0, 1}N, where An = implies that Vn is intervened and that the realization of Vn is fixed as An. Let π {0, 1}Pn. Given an intervention A A and realizations πi over the parents i Pn, the probability that Vn = πn holds is then determined as follows: Prob (Vn = πn | Vi = πi for all i Pn, do(A)) αn(π) if An = , 1 if An = πn, 0 if An = 1 πn. This equality together with the adjacency of the causal graph G completely determines the joint distribution over the variables V1, V2, . . . , VN, under an arbitrary intervention A A. In the causal bandit problem, we are given a DAG G = (V, E) and a set A of interventions. However, the parameters αn (n = 1, . . . , N) are not known. Our ideal goal is then to find an intervention A A that maximizes the probability µ(A ) of realizing VN = 1, where µ : A [0, 1] is defined by µ(A) := Prob(VN = 1 | do(A)) for each A A. For this purpose, we discuss the following algorithms. First, they estimate µ(A) (A A) from T experimental trials. Each experiment consists of the application of an intervention and the observation of a realization π {0, 1}N over all nodes. Let ˆµ(A) denote the estimate of µ(A). Second, the algorithm selects the intervention ˆA that maximizes ˆµ. We evaluate the efficiency of such an algorithm with the simple regret RT defined as follows: RT = µ(A ) E[µ( ˆA)]. Note that, even if an algorithm is deterministic, ˆA includes stochasticity since the observations obtained in each experiment are produced by a stochastic process. In this paper, we assume that N 3 and T 2 for ease of technical discussion. 3. Proposed Algorithm We propose an algorithm for the causal bandit problem, and present a regret bound of the proposed algorithm in this section. Let Cn = 2|Pn| for each n = 1, . . . , N, and C = PN n=1 Cn. For S S [1, N] and π {0, 1}S , let πS denote the restriction of π onto S. 3.1. Outline of the proposed algorithm Recall that the purpose of the causal bandit problem is to identify an intervention A that maximizes µ(A ). This task is trivial if αn is known for all n = 1, . . . , N, because µ(A) can then be calculated for all A A. Let B(A) = {π {0, 1}N | π i = Ai if Ai = , π N = 1}, and for n [1, N], let In,A denote the set of nodes in [1, n] which are not intervened by A; In,A := {m [1, n] | Am = }. µ(A) can then be represented as n IN,A αn(πPn). Therefore, for computing µ approximately, our algorithm estimates αn (n = 1, . . . , N). In order to estimate αn efficiently, we are required to manipulate the random variables associated with the parents of vn. More concretely, to estimate αn(π) for π Pn, we require samples with realization ω {0, 1}N satisfying πi = ωi over the parents i Pn of vn. For n = 1, 2, . . . , N, π {0, 1}Pn, and A A, we thus introduce the additional quantities βn(π, A) that denote the probability of realizing ω with ωPn = π under a given intervention A. More precisely, we define ( Prob(Vm = πm, m Pn | do(A)) if An = , 0 otherwise. Our algorithm consists of two phases. The first phase estimates βn (n = 1, . . . , N), and the second phase estimates αn (n = 1, . . . , N). The algorithm requires T/3 experiments in the first phase, and 2T/3 experiments in the second phase. In the rest of this section, we first explain those phases and present a regret bound on the algorithm. 3.2. First Phase: Estimation of β Here, we introduce the estimation phase of βn for all n = 1, . . . , N. The pseudo-code of this phase is described in Algorithm 1. Algorithm 1 requires a non-negative number λ as a parameter, which will be set to C3/N. We perform T/3 experiments in this phase. Before explaining the details of Algorithm 1, we note that βn can be calculated from α1, . . . , αn 1. For π {0, 1}Pn, let Bn(π, A) := {π {0, 1}n 1 | π i = πi if i Causal Bandits with Propagating Inference Algorithm 1 Estimation of β Require: λ Ensure: ˆβn (n = 1, . . . , N) and D 1: Gn for n = 1, 2, . . . , N 2: for n = 1, 2, . . . , N do 3: for π {0, 1}Pn do 4: Calculate ˆβn(π, A) for each A A by (2) 5: Calculate ˆAn,π = argmax A A ˆβn(π, A) 6: tn(π) 0 and tn(π) 0 7: for j = 1, . . . , T/(3C) do 8: Conduct an experiment with ˆAn,π and let ω {0, 1}N be the obtained result 9: tn(π) tn(π) + 1 if ωi = πi for all i Pn 10: tn(π) tn(π) + 1 if ωi = πi for all i Pn and ωn = 1 11: end for 12: for k = 0, 1 do 13: Extend π to π {0, 1}Pn with π n = k 14: Compute ˇα n(π ) by (3) 15: If (4) holds, then Gn Gn {π } 16: Compute ˇαn(π ) by (5) 17: end for 18: end for 19: end for 20: Compute Hn and Dn (n = 1, 2, . . . , N) by (6) and (7) 21: return ˆβn (n = 1, 2, . . . , N) and D = {Dn | n = 1, 2, . . . , N} Pn, π i = Ai if Ai = } denote the set of realizations over V1, V2, . . . , Vn 1 that is consistent with the realization π over Pn and the intervention A. If An = , then βn(π, A) is described as βn(π, A) = X m In 1,A αm(π Algorithm 1 consists of N iterations. The n-th iteration computes the following objects: an estimate ˆβn of βn, ˆAn,π A for each π {0, 1}Pn, an estimate ˇαn of αn, and Gn {0, 1}Pn. We remark that ˇαn in Algorithm 1 are used only for computing an estimate ˆβn and are not used for estimating µ. An estimate of αn is computed in the next phase of our algorithm. At the beginning of the n-th iteration, we compute ˆβn(π, A) for each π {0, 1}Pn and A A by (1) substituting ˇαm for αm; ˆβn(π, A) = X m In 1,A ˇαm(π Let us confirm that this ˆβn(π, A) can be computed if ˇαm (m = 1, . . . , n 1) are available. For each π {0, 1}Pn, then, we identify an intervention Algorithm 2 Estimation of α Require: ˆβn (n = 1, . . . , N) and D Ensure: ˆαn (n = 1, . . . , N) 1: for n = 1, 2, . . . , N and each π {0, 1}Pn do 2: t n(π) 0 and t n(π) 0 3: Calculate ˆAn,π := argmax A A ˆβn(π, A) 4: for j = 1, . . . , T/(3C) do 5: Conduct an experiment with ˆAn,π and let ω {0, 1}N be the obtained result 6: for m = 1, . . . , N with ( ˆAn,π)m = do 7: t m(ωPm) t m(ωPm) + 1 8: t m(ωPm) t m(ωPm) + 1 if ωm = 1 9: end for 10: end for 11: end for 12: Compute an optimal solution ˆη for (8) 13: for t = 1, 2, . . . , T/3 do 14: Sample At from U(ˆη) 15: Conduct experiment with At and let ω {0, 1}N be the obtained realization 16: for n = 1, . . . , N with An = do 17: t n(ωPn) t n(ωPn) + 1 18: t n(ωPn) t n(ωPn) + 1 if ωn = 1 19: end for 20: end for 21: for n = 1, 2, . . . , N and π {0, 1}Pn do 22: Compute ˆα n(π) by (9) and ˆαn(π) by (10) 23: end for 24: return ˆαn. Algorithm 3 Causal Bandit 1: Apply Algorithm 1 with λ = C3/N to obtain ˆβn (n = 1, . . . , N) and D 2: Apply Algorithm 2 to obtain ˆαn (n = 1, . . . , N) 3: Calculate ˆµ(A) for each A A by (11) 4: return ˆA := argmax A A ˆµ(A) ˆAn,π that attains max A A ˆβn(π, A). Using ˆAn,π, we compute ˇαn(π) as follows, where π is an extension of π onto {0, 1}Pn. We conduct T/(3C) experiments with ˆAn,π. Let tn(π) be the number of experiments in those T/(3C) experiments in which the obtained realization ω {0, 1}N satisfies ωi = πi for each i Pn. Let tn(π) be the number of experiments counted in tn(π), where ωn = 1 also holds. We then compute ˇα n(π) using the equation tn(π)/tn(π) if πn = 1, 1 tn(π)/tn(π) if πn = 0. (3) The vector π {0, 1}Pn is added to Gn if ˇα n(π)ˆβn(π, ˆAn,π) 2e S(λ), (4) where S(λ) is defined as S(λ) := 12λN 2C log T This Gn reserves such π {0, 1}Pn that ˇα n(π) is too small Causal Bandits with Propagating Inference to estimate αn(π) with sufficient accuracy. Then ˇαn(π) is determined by replacing ˇα n(π) with 0 for π Gn: ( ˇα n(π) if π Gn, 0 otherwise. (5) This replacement contributes to reducing the relative estimation error of ˆβn in subsequent steps (n = n + 1, . . . , N). After iterating for all n = 1, 2, . . . , N, the algorithm computes Hn and Dn (n = 1, 2, . . . , N) defined by π {0, 1}Pn ˆβn(πPn, ˆAn,π) 8e C2S(λ) o , Dn = Gn Hn. (7) This Dn contributes to bound the absolute error of the estimation of ˆβn(πPn) for π Dn. The algorithm returns an estimate ˆβn and the family D := {Dn | n = 1, 2, . . . , N}. 3.3. Second Phase: Estimation of α In this phase, our algorithm computes an estimate ˆαn of αn for all n = 1, . . . , N. The pseudo-code for this phase is given in Algorithm 2. As an input, it receives ˆβn (n = 1, . . . , N) and D from Algorithm 1. Algorithm 2 consists of two parts. The first part conducts T/(3C) experiments with ˆAn,π (computed from ˆβn(π, A), A A) for each n = 1, . . . , N and π {0, 1}Pn. This is the same process used to compute ˇα n in Algorithm 1. Let D n := {π {0, 1}Pn | π0, π1 Dn} where πk is the extension of π {0, 1}Pn onto {0, 1}Pn with πk n = k. Let us define a set Jn := {0, 1}Pn \ D n and a constant rn,π := ˆβn(π, ˆAn,π)/C for each n = 1, . . . , N and π {0, 1}Pn. In the second part, the algorithm solves the following optimization problem: min η [0,1]A max A A ˆβ2 n(π, A) P A A ηA ˆβn(π, A ) + rn,π A A ηA = 1. (8) Note that, for each n = 1, 2, . . . , N, π Jn only if ˆβn(π, ˆAn,π) > 0 according to Line 20 of Algorithm 1. Thus the denominator is positive for every π Jn, and the above optimization problem is well-defined. Let ˆη be an optimal solution for (8). Consider the distribution U(ˆη) over A that generates A with a probability of ˆηA. The second part samples an intervention according to U(ˆη) and uses it to conduct experiments, for T/3 times,. For each n = 1, . . . , N and π {0, 1}Pn, the algorithm counts the number t n(π) (resp., t n(π)) of experiments that result in ω {0, 1}N with ωPn = π (resp., ωPn = π and ωn = 1). Then, ˆα n(π) (n = 1, . . . , N, π {0, 1}Pn) is defined by t n(πPn)/t n(πPn) if πn = 1, 1 t n(πPn)/t n(πPn) if πn = 0. (9) The output ˆαn is defined by ( ˆα n(π) if π Dn, 0 otherwise. (10) 3.4. Regret bound Pseudo-code of our entire algorithm is provided in Algorithm 3. It computes an estimate ˆβ of β by Algorithm 1 and then computes ˆα by Algorithm 2. It then computes an estimate ˆµ of µ by n IN,A ˆαn(πPn) (11) for each A A. The algorithm returns an intervention ˆA A that maximizes ˆµ. Let us define γ as the optimum value of the following problem: γ := min η [0,1]A max A A π {0,1}Pn :βn(π,A)>0 β2 n(π, A) P A A ηA βn(π, A ) A A ηA = 1. (12) The regret bound of Algorithm 3 is parameterized by the optimum value γ : Theorem 1. The regret RT of Algorithm 3 satisfies max{γ , N} log(|A|T) The notation O( ) is used here under the assumption that N is sufficiently small with respect to T but not negligible. The optimum value γ is bounded as follows. Let |A| denote the number of nodes intervened by A, i.e., |A| := |{n [1, N]: An {0, 1}}|: Proposition 2. It holds that N min A A |A| γ min{NC, N|A|}. Since the lower-bound for the general best-arm identification problem is Ω( p |A|/T) (Audibert & Bubeck, 2010)[Theorem 4], our algorithm provides a better regret bound when the number of interventions |A| is large compared to γ NC, which is only dependent on the causal graph structure. Causal Bandits with Propagating Inference Remark 3. We present Algorithms 1, 2, and 3 for the setting that every αn(π) is unknown. However, our algorithms can be applied even when αn(π) is known for some n = 1, . . . , N and π {0, 1}Pn by incorporating minor modifications. In this case, we denote the number of unknown αn(π) as C. The modified algorithm just skips experiments for estimating the known αn(π), and we can define ˆβn(πPn, A) = 0 for such n and π. We then redefine γ by replacing corresponding βn(πPn, A) with 0 in (12), and our bound in Theorem 1 is valid for this decreased γ . In particular, we can recover the regret bound considered in (Lattimore et al., 2016)[Theorem 3] as follows: Corollary 4. Suppose that αn(π) is known for every n < N and π {0, 1}Pn. Then the regret RT of Algorithm 3 satisfies RT O( p γ log(|A|T)/T), where γ = min η [0,1]A max A A β2 N(π, A) P A A ηA βN(π, A ) A A ηA = 1. Remark 5. Our problem setting is often called hard intervention, which directly controls the realization of a node vn as An {0, 1}. In contrast, Sen et al. (2017) introduced the soft intervention model on a node vn where an intervention changes the conditional probability αn of a node vn. They in fact considered a simple case where a graph has a single node vk such that PN = Pk {k}, whose conditional probability can be controlled by soft intervention, and proved parameterized regret bound. We here remark that their model can be implemented by the hard intervention model with an arbitrary set of interventions. The details of this implementation is presented in our full paper. This section presents an approach for proving Theorem 1. Complete proofs for all the statements are presented in the full version (Yabe et al., 2018). 4.1. Accuracy of Algorithm 1 For n = 1, 2, . . . , N, let ˇαn and ˇα n be the stochastic estimates computed in Algorithm 1, and ˆAn,π := argmax A A ˆβn(π, A) be the action determined from the esti- mate ˆβn. Let G be defined by G = {Gn | n = 1, . . . , N}. Using G, we define αn,G and βn,G as follows. For each n [1, N] and π {0, 1}Pn, we define αn,G(π) by ( αn(π) if π Gn, 0 otherwise. For each n [1, N], π {0, 1}Pn, and A A, we define βn,G(π, A) by βn,G(π, A) = X m In 1,A αm,G(π Thus αn,G(π) is obtained from αn(π) by truncating its values if π Gn, and βn,G is defined from αn,G. We define αn,D and βn,D in the same way. Since Gn Dn, we observe that βn,D(π, A) βn,G(π, A) βn(π, A). Similarly, for A A, we define µD(A) by m IN,A αm,D(πPm). (13) The following proposition demonstrates the error bound for outputs ˆβn and D from Algorithm 1. Proposition 6. Let ˆβn and D be the outputs of Algorithm 1 with parameter λ 1. Then the following holds with a probability of at least 1 6C/T: for every n [1, N], π {0, 1}Pn \ Dn with π = πPn, and A A: 1 eβn,D(π, A) ˆβn(π, A) eβn(π, A), (14) αn(π)βn(π, ˆAn,π) S(λ), (15) βn(π, A) eˆβn(π, A) + eˆβn(π, ˆAn,π)/C, (16) µ(A) µD(A) 8e2(C3 + C)S(λ). (17) We prepare the following three lemmas to prove Proposition 6. The first lemma is an application of Chernoff s bound, which bounds the relative error of the estimation ˇα n: Lemma 7. Let n [1, N], π {0, 1}Pn, and π = πPn. (i) If αn(π)βn(π, ˆAn,π) S(λ), then the following holds with a probability of at least 1 2/T: ˇα n(π)βn(π, ˆAn,π) 2S(λ). (ii) If αn(π)βn(π, ˆAn,π) S(λ), then the following holds with a probability of at least 1 3/T: 1 1 αn(π) ˇα n(π) 1 + 1 The second lemma bounds the gap produced by truncation of αn that is conducted for introducing αn,G and αn,D. We use the notation H n := {πPn | π Hn}. Lemma 8. (i) Let n [1, N] and π {0, 1}Pn. For every A A, it holds that βn(π, A) βn,G(π, A) π Gm max A A αm(π )βm,G(π Pm, A ). Causal Bandits with Propagating Inference (ii) For every A A, it holds that π Gm max A A αm(π)βm,G(πPm, A ) max A A βm,G(π , A ). The third lemma bounds the relative error of ˆβ. This statement can be proven by induction on the basis of Lemma 7. Lemma 9. The following holds for every n = 1, 2, . . . , N, π {0, 1}Pn with π = πPn, and A A with a probability of at least 1 6C/T: 1 eβn,G(π, A) ˆβn(π, A) eβn,G(π, A), π Gn if αn(π)βn(π, ˆAn,π) < S(λ), 1 1 αn(π) ˇα n(π) 1 + 1 if αn(π)βn(π, ˆAn,π) S(λ). Then Proposition 6 is proven on the basis of Lemmas 7 9. 4.2. Accuracy of Algorithm 2 This subsection bounds the gap between the true value µ(A) and its estimate ˆµ(A) given by Algorithm 2, assuming that the input of Algorithm 2, which is output of Algorithm 1, satisfies the conditions in Proposition 6. Proposition 10. Suppose that λ 1, and ˆβn and D satisfy (14), (15), (16), and (17). Let ˆαn be the output of Algorithm 2, and let ˆµ be defined by (11). Then the following holds for every A A with a probability of at least 1 (10C + 2)/T: |µ(A) ˆµ(A)| 2e6γ log(|A|T) 8e2C3 log T λT + 8e2(C3 + C)S(λ). Recall that IN,A := {m [1, N] | Am = } for A A. For n [1, N] and π {0, 1}Pn, let αn(π) := ˆαn(π) αn,D(π). For A A and J IN,A, we define f J(A) by m IN,A\J αm,D(πPm) Y n J αn(πPn). Observe that f J(A) is given by replacing αn,D(πPn) by αn(πPn) for n J in the definition (13) of µD. Recall that ˆµ(A) is given by replacing αn,D(π) in the definition of µD by ˆαn(π) for all n IN,A. Based on these relationships, we have the following lemma: Lemma 11. For A A, it holds that: µD(A) = f (A), ˆµ(A) = X J IN,A f J(A). (18) For j IN,A, let f j(A) := f {j}(A). We provide probabilistic bounds for the linear terms (|J| = 1) and superlinear terms (|J| 2) in (18), separately, using Hoeffding s inequality. Lemma 12. Suppose that (14), (15), and (16) hold. (i) The following holds with a probability of at least 1 (C + 2)/T: j IN,A f j(A) 2e6γ log(|A|T) (ii) The following holds with a probability of at least 1 9C/T: J IN,A:|J| 2 |f J(A)| 8e2C3 log T The above two lemmas imply Proposition 10. 4.3. Proof of Theorem 1 We present a sketch of proof of Theorem 1, on the basis of Propositions 6 and 10, as follows. Putting λ = C3/N, by Propositions 2 6, and 10, the following holds for every A A with a probability of at least 1 (16C + 2)/T: |µ(A) ˆµ(A)| 8e6 max{γ , N} log(|A|T) T + 192e2NC7 log T Let A = argmax A Aµ(A) and ˆA = argmax A Aˆµ(A). Then it holds that µ(A ) µ( ˆA) |µ(A ) ˆµ(A )| + |µ( ˆA) ˆµ( ˆA)| max{γ , N} log(|A|T) This implies the desired regret bound. 5. Experiments We now demonstrate the performance of the proposed algorithm through experimental evaluations and compare it with a baseline algorithm (Audibert & Bubeck, 2010) which was proposed for the general bandit problem and thus cannot take advantage of known causal graph structure. Causal Bandits with Propagating Inference The number of rounds T The average regret RT (i) Synthetic (h = 5) The number of rounds T The average regret RT (ii) Alarm instance The number of rounds T The average regret RT (iii) Water instance Proposed (b = 2) Baseline (b = 2) Proposed (b = 4) Baseline (b = 4) Proposed (b = 8) Baseline (b = 8) Figure 2. The average regret over synthetic and real-world instances Instances We evaluated the algorithms on both synthetic and real-world instances. Detailed experimental setting is presented in the full version (Yabe et al., 2018). Recall that an instance of the causal bandit problem consists of a DAG G, an intervention set A, and αn(n = 1, . . . , N). In the synthetic instances, the DAG G is defined as a directed complete binary tree of height 4, and then the number of nodes is N = 25 1 = 31, and the number of uncertain parameter is C = 22 (24 1)+20 24 = 76. In the realworld instances, the DAG G is constructed from the Alarm and the Water data sets in a Bayesian Network Repository1. The numbers N of nodes in the DAGs constructed from Alarm and Water data sets are 37 and 32, and the numbers C of uncertain parameters are 116 and 248, respectively. For each G, we consider interventions over all leaves which fixes exactly b N nodes as 1 and the others as 0. We call this parameter b budget, and the number of intervention |A| is then controlled by the budget. For each n {1, . . . , N} and π {0, 1}Pn, we generate αn(π) from the uniform distribution over [0, 1]. For each of those instances, we executed the algorithms 10 times and compared their average regrets. Implementation of the proposed algorithm Our algorithm given in Section 3 is designed conservatively to obtain the theoretical regred bound (Theorem 1), and there is a room to modify the algorithm to be more efficient in practice although the theoretical regret bound may not hold for it. In our implementation, we introduced the following three modifications into the proposed algorithm. First, while Algorithm 2 discards samples obtained for computing ˇα in Algorithm 1 to maintain the independence between ˆβ and ˆα, we use all of them also in Algorithm 2 in our implementation. Next, we ignore the truncation mechanism of Algorithm 1 by setting λ = 0. We expect these two modifications make the estimates of the algorithm more accurate. Finally, instead of solving (8), we set ηA by ηA = 1/C 1http://www.cs.huji.ac.il/ galel/ Repository/ if A = ˆAn,π for some n [1, N] and π {0, 1}Pn, and ηA = 0 otherwise. Since it is time-consuming to solve (8), this modification makes the algorithm faster. Experimental results Figure 4.3(i) shows the average regrets over the synthetic instances against the number of rounds T {C, 2C, . . . , 9C}. Figures 4.3 (ii) and (iii) respectively illustrate the average regrets for the real-world instances constructed from the Alarm and the Water data sets. The results show that the proposed algorithm outperforms the baseline in every instance. In particular, the gap is remarkably large (> 0.2) in the Alarm data set (ii) with a large number of interventions (b = 4, 8, corresponding to |A| = 793, 3796, respectively,) and a small number of samples (T 4C = 464). In these cases, the baseline cannot apply every intervention at least once. On the other hand, the regret of the proposed algorithm only grows slowly with respect to the number of arms |A|, in all instances. Thus the proposed algorithm provides effective regret, even when the number of interventions |A| is 30 times larger than the number of experiments T. 6. Conclusion In this paper, we proposed the first algorithm for the general causal bandit problem, where existing algorithms could deal with only localized interventions, and proved a novel regret bound O( p γ log(|A|T)/T) which is logarithmic with respect to the number of arms. Our experimental result shows that the proposed algorithm is applicable to systems where the number of interventions |A| is much larger than T. One important future research direction would be to prove the gap-dependent bound as Sen et al. (2017) has proven for localized interventions. Another research direction, which is mentioned in (Lattimore et al., 2016), would include incorporation of a causal discovery algorithm to enable the estimation of the structure of a causal graph, which is currently assumed to be known in advance. Causal Bandits with Propagating Inference Acknowledgements Daisuke Hatano, Hanna Sumita, Naonori Kakimura, Takuro Fukunaga, and Ken-ichi Kawarabayashi are supported by JST ERATO Kawarabayashi Large Graph Project, Grant Number JPMJER1201, Japan. Agarwal, A., Hsu, D., Kale, S., Langford, J., Li, L., and Schapire, R. Taming the monster: A fast and simple algorithm for contextual bandits. In International Conference on Machine Learning, pp. 1638 1646, 2014. Audibert, J.-Y. and Bubeck, S. Best arm identification in multi-armed bandits. In The 23rd Conference on Learning Theory, pp. 41 53, 2010. Auer, P., Cesa-Bianchi, N., Freund, Y., and Schapire, R. E. The nonstochastic multiarmed bandit problem. SIAM Journal on Computing, 32(1):48 77, 2002. Bottou, L., Peters, J., Qui nonero-Candela, J., Charles, D. X., Chickering, D. M., Portugaly, E., Ray, D., Simard, P., and Snelson, E. Counterfactual reasoning and learning systems: The example of computational advertising. The Journal of Machine Learning Research, 14(1):3207 3260, 2013. Eberhardt, F., Glymour, C., and Scheines, R. On the number of experiments sufficient and in the worst case necessary to identify all causal relations among n variables. In Proceedings of the Twenty-First Conference on Uncertainty in Artificial Intelligence, pp. 178 184. AUAI Press, 2005. Gabillon, V., Ghavamzadeh, M., and Lazaric, A. Best arm identification: A unified approach to fixed budget and fixed confidence. In Advances in Neural Information Processing Systems, pp. 3212 3220, 2012. Hauser, A. and B uhlmann, P. Two optimal strategies for active learning of causal models from interventional data. International Journal of Approximate Reasoning, 55(4): 926 939, 2014. Hu, H., Li, Z., and Vetta, A. R. Randomized experimental design for causal graph discovery. In Advances in Neural Information Processing Systems, pp. 2339 2347, 2014. Kaufmann, E., Capp e, O., and Garivier, A. On the complexity of best-arm identification in multi-armed bandit models. The Journal of Machine Learning Research, 17 (1):1 42, 2016. Kim, D. J., Ferrin, D. L., and Rao, H. R. A trust-based consumer decision-making model in electronic commerce: The role of trust, perceived risk, and their antecedents. Decision Support Systems, 44(2):544 564, 2008. Lattimore, F., Lattimore, T., and Reid, M. D. Causal bandits: Learning good interventions via causal inference. In Advances in Neural Information Processing Systems, pp. 1181 1189, 2016. Meinshausen, N., Hauser, A., Mooij, J. M., Peters, J., Versteeg, P., and B uhlmann, P. Methods for causal inference from gene perturbation experiments and validation. Proceedings of the National Academy of Sciences, 113(27): 7361 7368, 2016. Mooij, J. M., Peters, J., Janzing, D., Zscheischler, J., and Sch olkopf, B. Distinguishing cause from effect using observational data: methods and benchmarks. The Journal of Machine Learning Research, 17(1):1103 1204, 2016. Pearl, J. Causality. Cambridge university press, 2009. Sen, R., Shanmugam, K., Dimakis, A. G., and Shakkottai, S. Identifying best interventions through online importance sampling. In International Conference on Machine Learning, pp. 3057 3066, 2017. Shanmugam, K., Kocaoglu, M., Dimakis, A. G., and Vishwanath, S. Learning causal graphs with small interventions. In Advances in Neural Information Processing Systems, pp. 3195 3203, 2015. Shimizu, S., Inazumi, T., Sogawa, Y., Hyv arinen, A., Kawahara, Y., Washio, T., Hoyer, P. O., and Bollen, K. Directlingam: A direct method for learning a linear nongaussian structural equation model. Journal of Machine Learning Research, 12(Apr):1225 1248, 2011. Spirtes, P. and Glymour, C. An algorithm for fast recovery of sparse causal graphs. Social Science Computer Review, 9(1):62 72, 1991. Splawa-Neyman, J., Dabrowska, D. M., and Speed, T. On the application of probability theory to agricultural experiments. essay on principles. section 9. Statistical Science, pp. 465 472, 1990. Yabe, A., Hatano, D., Sumita, H., Ito, S., Kakimura, N., Fukunaga, T., and Kawarabayashi, K. Causal bandits with propagating inference. ar Xiv preprint ar Xiv:1806.02252, 2018.