# modelbased_causal_bayesian_optimization__e8527f6d.pdf Published as a conference paper at ICLR 2023 MODEL-BASED CAUSAL BAYESIAN OPTIMIZATION Scott Sussex ETH Z urich scott.sussex@inf.ethz.ch Anastasiia Makarova ETH Z urich Andreas Krause ETH Z urich How should we intervene on an unknown structural causal model to maximize a downstream variable of interest? This optimization of the output of a system of interconnected variables, also known as causal Bayesian optimization (CBO), has important applications in medicine, ecology, and manufacturing. Standard Bayesian optimization algorithms fail to effectively leverage the underlying causal structure. Existing CBO approaches assume noiseless measurements and do not come with guarantees. We propose model-based causal Bayesian optimization (MCBO), an algorithm that learns a full system model instead of only modeling intervention-reward pairs. MCBO propagates epistemic uncertainty about the causal mechanisms through the graph and trades off exploration and exploitation via the optimism principle. We bound its cumulative regret, and obtain the first non-asymptotic bounds for CBO. Unlike in standard Bayesian optimization, our acquisition function cannot be evaluated in closed form, so we show how the reparameterization trick can be used to apply gradient-based optimizers. Empirically we find that MCBO compares favorably with existing state-of-the-art approaches. 1 INTRODUCTION Many applications, such as drug and material discovery, robotics, agriculture, and automated machine learning, require optimizing an unknown function that is expensive to evaluate. Bayesian optimization (BO) is an efficient framework for sequential optimization of such objectives (Moˇckus, 1975). The key idea in BO is to quantify uncertainty in the unknown function via a probabilistic model, and then use this to navigate a trade-off between selecting inputs where the function output is favourable (exploitation) and selecting inputs to learn more about the function in areas of uncertainty (exploration). While most standard BO methods focus on a black-box setup (Figure 1 a), in practice, we often have more structure on the unknown function that can be used to improve sample efficiency. In this paper, we exploit structural knowledge in the form of a causal graph specified by a directed acyclic graph (DAG). In particular, we assume that actions can be modeled as interventions on a structural causal model (SCM) (Pearl, 2009) that contains the reward (function output) as a variable (Figure 1 b). While we assume the graph structure to be known, we consider the functional relations in the SCM as unknown. All variables in the SCM are observed along with the reward after each action. This Causal BO setting has important potential applications, such as optimizing medical and ecological interventions (Aglietti et al., 2020b). For illustrative purposes, consider the example of an agronomist trying to find the optimal Nitrogen fertilizer schedule for maximizing crop yield, described in Figure 1. There, the concentration of Nitrogen in the soil causes its concentration in the soil at the later timesteps. To exploit the causal graph structure for optimization, we propose model-based causal Bayesian optimization (MCBO). MCBO explicitly models the full SCM and the accompanying uncertainty of all SCM components. This allows our algorithm to select interventions based on an optimistic strategy similar to that used by the upper confidence bound algorithm (Srinivas et al., 2010). We show that this strategy leads to the first CBO algorithm with a cumulative regret guarantee. For a practical algorithm, maximizing the upper confidence bound in our setting is computationally more difficult, because uncertainty in all system components must be propagated through the entire estimated SCM to the reward variable. We show that an application of the reparameterization trick allows MCBO to be practically implemented with common gradient-based optimizers. Empirically, Published as a conference paper at ICLR 2023 Bayesian Optimization Causal Bayesian Optimization Figure 1: A visual comparison between the modelling assumptions of BO vs CBO. Circular nodes represent observed variables, squares represent action inputs and Y is the reward. Algorithms select a before observing X and Y . (a) In standard BO, the DAG has the structure shown regardless of the problem structure. (b) The DAG corresponding to our stylised agronomy example, where we aim to maximize crop yield Y . CBO takes this DAG as input for designing actions. X0 is an unmodifiable starting property of the soil, and X1 . . . X3 are the measured amounts of Nitrogen in the soil at different timesteps. Each observation is modelled with its own Gaussian process. a1 . . . a3 are possible interventions involving adding Nitrogen fertilizer to the soil. MCBO achieves competitive performance on existing CBO benchmarks and a related setting called function network BO (Astudillo & Frazier, 2021b). Contributions We introduce MCBO, a model-based algorithm for causal Bayesian optimization than can be applied with very generic classes of interventions. Using MCBO we prove the first sublinear cumulative regret bound for CBO. We show how the bound scales depending on the graph structure. We demonstrate that CBO can lead to a potentially exponential improvement in cumulative regret, with respect to the number of actions, compared to standard BO. By an application of the reparameterization technique, we show how our algorithm can be efficiently implemented with popular gradient-based optimizers. We evaluate MCBO on existing CBO benchmarks and the related setting of function network BO. Our results show that MCBO performs favorably compared to methods designed specifically for these tasks. 2 BACKGROUND AND PROBLEM STATEMENT We consider the problem of an agent interacting with an SCM for T rounds in order to maximize the value of a particular target variable. We start with introducing SCMs and the kinds of interventions an agent can perform on an SCM. In the following, we denote with [m] the set of integers {0, . . . , m}. Structural Causal Models An SCM is described by a tuple G, Y, X, F , Ω of the following elements: G is a known DAG; Y is the reward variable of interest; X = {Xi}m 1 i=0 is a set of observed random variables; the set F = {fi}m i=0 defines the functional relations between these variables; and Ω= {Ωi}m i=0 is a set of independent noise variables with zero-mean and known distribution. We use the notation Y and Xm interchangeably and assume the elements of X are topologically ordered, i.e., X0 is a root and Xm is a leaf. We use the notation pai {0, . . . , m} for the indices of the parents of the ith node, and Zi = {Xj}j pai for the parents of the ith node. We sometimes use Xi to refer to both the ith node and the ith random variable. Published as a conference paper at ICLR 2023 Each Xi is generated according to the function fi : Zi Xi, taking the parent nodes Zi of Xi as input: xi = fi(zi) + ωi, where lowercase denotes a realization of the corresponding random variable. The reward is a scalar xm R. An observation Xi is defined over a compact set xi Xi Rd, and its parents are defined over Zi = Q j pai Xj for i [m 1]. Interventions At each interaction round, the agent performs an intervention on the SCM. In this work, we consider two types of intervention models. We consider a soft intervention model (Eberhardt & Scheines, 2007) where interventions are parameterized by controllable action variables. Let Ai Rq denote the compact space of an action ai and A be the space of all actions a = {ai}m i=0. We represent the actions as additional nodes in G (see Fig. 1): ai is a parent of only Xi, and hence an additional input to fi. Since fi is unknown, in our soft intervention model, the agent does not know apriori the functional effect of ai on Xi. A simple example of a soft intervention is a shift intervention xi = fi(zi, ai)+ωi = gi(zi)+ai +ωi for some function gi. A shift intervention might occur in our example of adding Nitrogen fertilizer to soil and then measuring the total soil Nitrogen concentration. While our theoretical results will focus on data obtained via soft interventions, our experiments also consider two other data sources. First, we consider a hard intervention model: hard interventions (often referred to as do-interventions) modify the targeted variable to a specific distribution independently of the variable s parents. For example, a doctor sets the dosage of a patient s medication, which fixes the dosage to a specific value (Aglietti et al., 2020b). Second, a special case under both intervention models is the collection of observational data, which is when no intervention is performed on the system. In the soft intervention model, not intervening on node i is equivalent to setting ai = 0. An example would be not applying any Nitrogen fertilizer to the soil. In practice, the agent may have access to some previous observational data before its first interaction with the system. In the following, we introduce the problem setup under the soft intervention model and then adapt it to the hard intervention model. Constraints on interventions In many applications, we may not be able to intervene on all nodes simultaneously. For example, a farmer may only have the capacity to apply fertilizer at two out of three possible time windows in Fig. 1. This results in an action space with cardinality constraints, written as a = {ai}m i=0 : i=0 1[ai =0] c, c 1 Problem statement We consider the problem of an agent sequentially interacting with an SCM, with known DAG G and a fixed but unknown set of functions F = {fi}m i=1 with fi : Zi Ai Xi. At round t we select actions a:,t = {ai,t}m i=0 and obtain observations x:,t = {xi,t}m i=0, where we add an additional subscript to denote the round of interaction. The action ai,t and the observation xi,t are related by xi,t = fi(zi,t, ai,t) + ωi,t, i [m]. (2) If i corresponds to a root node, the parent vector zi,t denotes an empty vector, and the output of fi only depends on the action ai,t. Since we cannot intervene on the target variable Xm, we fix am = 0. The reward is given by yt = fm(zm,t, am,t) + ωm,t, (3) which implicitly depends on the whole intervention vector a:,t. We define the action that maximizes the expected reward by a = arg max a A E[y|a], (4) where, unless otherwise stated, expectations are taken over noise ω. Performance metric Our agent s goal is to design a sequence of interventions {a:,t}T t=0 that achieves a high average expected reward. We hence study the cumulative regret (Lattimore & Szepesv ari, 2020) over a time horizon T : h E[y|a ] E[y|a:,t] i . (5) Published as a conference paper at ICLR 2023 A sublinear growth rate of RT with T implies vanishing average regret: RT /T 0 as T . As an alternative to cumulative regret, one can also study the simple regret E[y|a ] E[y|a T ]. The most appropriate metric depends on the application. In the Nitrogen fertilizer example, cumulative regret might be preferable because we care about obtaining high crop yields across all years, not just in one final year. Index notation Let xi,t = [xi,t,1, . . . , xi,t,d]T denote a vector where xi,t,l indicates indexing the component l [d] of the tth timepoint of the observations at the node i [m]. For functions with vector output, e.g., fi : Zi Xi, we sometimes consider notation with additional input to fi that indicates the output dimension: fi(z, a) = [fi(z, a, 1), . . . , fi(z, a, d)]T . Regularity assumptions We consider standard smoothness assumptions for the unknown functions fi : S Xi defined over a compact domain S (Srinivas et al., 2010). In particular, for each node i [m], we assume that fi( ) belongs to a reproducing kernel Hilbert space (RKHS) Hki, a space of smooth functions defined on S = Zi Ai. This means that fi Hki is induced by a kernel function ki : S S R where S = S [d]1. We also assume that ki(s, s ) 1 for every s, s S2. Moreover, the RKHS norm of fi( ) is assumed to be bounded fi ki Bi for some fixed constant Bi > 0. Finally, to ensure the compactness of the domains Zi, we assume that the noise ω is bounded, i.e., ωi [ 1, 1]d.3 Problem statement under hard interventions Under a hard intervention model, instead of selecting an action a in Eq. (4), the agent must select both a set of intervention targets I I P([m 1]) and their values a I AI A. For hard intervention we can rewrite Eq. (2) as xi = ai if i I fi(zi) + ωi otherwise, i [m], (6) where fi is unknown and employs the same regularity assumptions. Further constraints similar to Eq. (1) can be placed on either the intervention nodes I or action values a I. Finally, observational data corresponds to the empty intervention set I = . 2.1 RELATED WORK Optimal decision-making in SCMs has been the subject of several recent works, for example, in the bandit setting (Lattimore et al., 2016; Bareinboim et al., 2015). Aglietti et al. (2020b) introduce causal BO (CBO) focusing on hard interventions. CBO considers unobserved confounding and uses the do-calculus to estimate Y given I, a I using both observational data and interventional data with the same intervention targets I. Aglietti et al. (2020a) extend CBO to make use of data obtained from hard interventions with different intervention targets. While both methods use do-calculus to estimate causal effects, they do not learn the full system model. Branchini et al. (2022) and Alabed & Yoneki (2022) extend these works to explore the CBO setting in the case of unknown DAG. Function network BO (FNBO) (Astudillo & Frazier, 2021b) is similar to the CBO setup with soft interventions and the proposed algorithm uses an expected improvement acquisition function to select actions. MCBO generalizes the FNBO setup to a richer class of problems since the causal model formalism allows for modelling hard interventions. Moreover, in contrast to MCBO, FNBO assumes the system is noiseless which might be a restrictive assumption in practice. Kusakawa et al. (2021) study stochastic function networks in the special case of a chain graph. Similar to MCBO, they develop a UCB-based acquisition function and provide an accompanying cumulative regret guarantee. However, they do not employ a reparameterization trick to show how to optimize the acquisition function. Kusakawa et al. (2021) do not perform any empirical study of the proposed UCB-based method, but evaluate expected improvement and credible interval methods 1For vector-valued functions coming from an RKHS we consider a scalar-valued function where the output index is part of the function input, as described in Curi et al. (2020) Appendix F. 2This is known as the bounded variance property, and it holds for many common kernels. 3This assumption can be relaxed to ωi being sub-Gaussian using similar techniques to Curi et al. (2020) Appendix I. Though sub-Gaussian noise includes distributions with unbounded support, Curi et al. (2020) provide high probability bounds on the domain of Zi. Published as a conference paper at ICLR 2023 also developed for the chain graph setting. Both FNBO and CBO are part of a wider research direction on using intermediate observations from the computation of the unknown function to improve the sample efficiency of BO algorithms (Astudillo & Frazier, 2021a). In concurrent work, Varici et al. (2022) consider a similar setup and prove a cumulative regret guarantee for a UCB-based algorithm with soft interventions. However, they focus only on linear SCMs and a binary action space for each intervention target, while MCBO applies to non-linear SCMs and a continuous action space. The function class studied in this paper is similar to that of deep Gaussian processes (GPs) (Damianou & Lawrence, 2013) in that MCBO models Y as a composition of GPs given a. Deep GPs, however, do not make use of intermediate system variables and do not compose GPs according to a causal graph structure. Our use of the reparameterization trick to practically implement an upper confidence bound acquisition function (Srinivas et al., 2010) in MCBO is inspired by Curi et al. (2020), who apply ideas from BO to design an algorithm for sample efficient reinforcement learning. 3 ALGORITHM In this section, we propose the MCBO algorithm, describing the probabilistic model and acquisition function used. We first introduce MCBO under the soft intervention setup and then describe how to adapt it to hard interventions. Statistical model We use Gaussian processes (GPs) for learning the RKHS functions f0, . . . , fm from observations. Our regularity assumptions permit the construction of confidence bounds using these GP models with priors associated with the RKHS kernels. We refer to Rasmussen (2003) for more background on the relation between GPs and RKHS functions. For all i [m], let µi,0 and σ2 i,0 denote the prior mean and variance functions for each fi, respectively. Since ωi is bounded, it is also subgaussian and we denote the variance by ρ2 i . The corresponding posterior GP mean and variance, denoted by µi,t and σ2 i,t respectively, are computed based on the previous evaluations Dt = {z:,1:t, a:,1:t, x:,1:t}. In particular, for each function fi( , , l) defined by the given kernel ki and output component l: µi,t(zi, ai, l) = ki,t(zi, ai, l) (Kt + ρ2 i I) 1vec(xi,1:t) , (7) σ2 i,t(zi, ai, l) = ki((zi, ai, l); (zi, ai, l)) ki,t(zi, ai, l) (Kt + ρ2 i I) 1ki,t(zi, ai, l) , (8) where I denotes the identity matrix, vec(xi,1:t) = [xi,1,1, xi,1,2, . . . , xi,t,d] and for (t1, l), (t2, l ) [(1, 1), (1, 2), . . . , (t, d)]: [Kt](t1,l),(t2,l ) = ki((zi,t1,l, ai,t1,l, l); (zi,t2,l , ai,t2,l , l )), ki,t(zi, ai, l) = [ki((zi,1,1, ai,1,1, 1); (zi, ai, l)), . . . , ki((zi,t,d, ai,t,d, d); (zi, ai, l))] . We write µi,t = [µi,t( , 1), . . . , µi,t( , d)]T and similarly for σi,t. We give more background on the posterior updates of vector-valued GPs in Appendix A.1. At time t, the known set Mt of statistically plausible functions F = { fi}m i=0 (functions that lie inside the confidence interval given by the posterior of each GP) is defined as: Mt = F = { fi}m i=0, s.t. i : fi Hki, fi ki Bi, and fi(zi, ai) µi,t 1(zi, ai) βi,tσi,t 1(zi, ai), zi Zi, ai Ai Here, βi,t is a parameter that ensures the validity of the confidence bounds. Some examples of concentration inequalities under similar regularity assumptions as well as explicit forms for βi,t can be found in Chowdhury & Gopalan (2019) and Srinivas et al. (2010). In the following, we set βi,t = βt for all i such that the confidence bounds in Eq. (9) are still valid. Published as a conference paper at ICLR 2023 Algorithm 1 Model-based Causal BO (MCBO) Require: Parameters {βt}t 1, G, Ω, kernel functions ki and prior means µi,0 = 0 i [m] 1: for t = 1, 2, . . . do 2: Construct confidence bounds as in Eq. (9) 3: Select at arg maxa A maxη( ) E[y | { f}, a] as in Eq. (12) 4: Observe samples {zi,t, xi,t}m i=0 5: Use Dt to update posterior {µi,t( ), σ2 i,t( )}m i=0 as in Eqs. (7) and (8) 6: end for Algorithm 2 Model-based Causal BO with Hard Interventions (MCBO) Require: Parameters {βt}t 1, G, Ω, kernel functions ki and prior means µi,0 = 0 i [m] 1: for t = 1, 2, . . . do 2: Construct confidence bounds as in Eq. (9) 3: Select I, a I arg max I,a I maxη E[y | { f}, do(XI = a I)] 4: Observe samples {zi,t, xi,t}m i=0 5: Use Dt to update posterior {µi,t( ), σ2 i,t( )}m i=0 as in Eqs. (7) and (8) 6: end for Acquisition function At each round t, interventions are selected based on maximizing an acquisition function. Our acquisition function is based on the upper confidence bound acquisition function (Srinivas et al., 2010). That is, we optimistically pick interventions that yield the highest expected return among all system models that are still plausible given past observations: a:,t = arg max a A max F Mt Eω h y | F , a i . (10) Note that Eq. (10) is not amenable to commonly used optimization techniques, due to the maximization over a set of functions with bounded RKHS norm. Therefore, following Curi et al. (2020), we use the reparameterization trick to write any fi F Mt using a function ηi : Zi Ai [ 1, 1]di as: fi,t( zi, ai) = µi,t 1( zi, ai) + βtσi,t 1( zi, ai) ηi( zi, ai), (11) where xi = fi( zi, ai) + ωi denotes observations from simulating actions in one of the plausible models, and not necessarily the true model. denotes the elementwise multiplication of vectors. This reparametrization allows for rewriting our acquisition function in terms of η : Z A [ 1, 1]|X|: a:,t = arg max a A max η( ) Eω h y | F , a i , s.t. F = { fi,t} in Eq. (11). (12) Intuitively, the variables η allow for choosing optimistic but plausible models given the confidence bounds. In practice, the function η can be parameterized by, for example, a neural network, and then standard optimization techniques are applied. For the theory, we assume access to an oracle providing the global optimum for Eq. (12). In practice, such an oracle may be computationally infeasible due to the non-convexity of Eq. (12). We discuss heuristics that we use for approximating this oracle in Appendix A.3. Algorithm 1 summarizes our Model-based Causal BO approach. We note that for the special case of the SCM following the DAG of Fig. 1(a), our algorithm and the associated guarantees reduce to standard BO (Srinivas et al., 2010). Hard interventions MCBO also naturally generalizes to hard interventions (Algorithm 2). In our experiments, we perform the combinatorial optimization over the set of nodes I by enumeration because |I| is not large for the instances we consider. We apply the notion of a minimal intervention set from Lee & Bareinboim (2019) to prune sets of intervention targets that contain redundant interventions, resulting in a smaller set to optimize over. Published as a conference paper at ICLR 2023 4 THEORETICAL ANALYSIS This section describes the convergence guarantees for MCBO under a soft intervention model, showing the first sublinear cumulative regret bounds for causal BO. We start by introducing additional technical assumptions required for the analysis. Assumption 1 (Functional relations). All fi F are Lf-Lipschitz continuous. Assumption 2 (Continuity). i, t, the functions µi,t, σi,t are Lµ, Lσ Lipschitz continuous. Assumption 3 (Calibrated model). All statistical models are calibrated w.r.t. F , so that i, t there exists a sequence βt R>0 such that, with probability at least (1 δ), for all zi, ai Zi Ai we have |fi(zi, ai) µi,t 1(zi, ai)| βtσi,t 1(zi, ai), element-wise. Assumption 1 follows directly from the regularity assumptions of Section 2. Assumption 2 holds if the RKHS of each fi has a Lipschitz continuous kernel (see Curi et al. (2020), Appendix G). Assumption 2 restricts the convergence guarantees to soft interventions that affect their target variable in a smooth way, meaning that our analysis does not directly apply to the hard intervention model. We nevertheless experimentally demonstrate the effectiveness of MCBO in non-smooth settings, such as CBO with hard interventions. Assumption 3 holds when we assume that the ith GP prior uses the same kernel as the RKHS of fi and that βt is sufficiently large to ensure the confidence bounds in Eq. (9) hold. In the DAG G over nodes {Xi}m i=0 , let N denote the maximum distance from a root node to Xm: N = maxi dist(Xi, Xm) where dist( , ) is measured as the number of edges in the longest path from a node Xi to the reward node Xm. Let K denote the maximum number of parents of any variable in G: K = maxi |pa(i)|. The following theorem bounds the performance of MCBO in terms of cumulative regret. Theorem 1. Consider the optimization problem in Eq. (4) with SCM satisfying Assumptions 1 3 where G is known but F is unknown. Then for all T 1, with probability at least 1 δ, the cumulative regret of Algorithm 1 is bounded by RT O LN f LN σ βN T KNm p Here, γT = maxi γi,T where the node-specific γi,T denotes the maximum information gain at a time T commonly used in the standard regret guarantees for Bayesian optimization (Srinivas et al., 2010). This maximum information gain is known to be sublinear in T for many common kernels, such as linear and squared exponential kernels, resulting in an overall sublinear regret for MCBO. We refer to Appendix A.2.3 for the proof. Theorem 1 demonstrates that the use of the graph structure in MCBO results in a potentially exponential improvement in how the cumulative regret scales with the number of actions m. Standard Bayesian optimization as in Fig. 1 (a), that makes no use of the graph structure, results in cumulative regret exponential in m (Srinivas et al., 2010), when using a squared exponential kernel. When all Xi in MCBO are modeled with squared exponential kernels that model output components independently, we have γT = O(d(Kd + q)(log T)Kd+q+1), resulting in a cumulative regret that is exponential in K and exponential in N. However, note that m K + N. For several common graphs, the exponential scaling in N and K could be significantly more favorable than the exponential scaling in m. Consider the case of G having the binary-tree-like structure in appendix Fig. 3 (binary tree), where N = log(m) and K = 2. In such settings, the cumulative regret of MCBO will have only polynomial dependence on m. We further discuss the bound in Theorem 1 for specific kernels in Appendix A.2.3 and discuss the dependence of βT on T in Appendix A.2.4. 5 EXPERIMENTS In this section, we empirically evaluate MCBO on six problems taken from previous CBO or function network papers (Aglietti et al., 2020b; Astudillo & Frazier, 2021b). The DAGs corresponding to each task are given in Fig. 3 of the appendix. We provide an open-source implementation of MCBO4. 4https://github.com/ssethz/mcbo Published as a conference paper at ICLR 2023 0 20 40 60 80 100 Round Average Reward Toy Graph: Average Reward MCBO EICBO EIFN 0 20 40 60 80 100 Round Average Reward PSAGraph: Average Reward MCBO EICBO EIFN 0 20 40 60 80 100 Round Average Reward Ackley: Average Reward MCBO EIFN GP-UCB 0 20 40 60 80 100 Round Average Reward Dropwave, noisy: Average Reward MCBO EIFN GP-UCB 0 20 40 60 80 100 Round Average Reward Rosenbrock, noisy: Average Reward MCBO EIFN GP-UCB 0 20 40 60 80 100 Round Average Reward Alpine2, noisy: Average Reward MCBO EIFN GP-UCB Figure 2: (a, b) EICBO does not achieve monotonically improving average reward on CBO tasks, while MCBO achieves high average reward on both tasks. (c) On the noiseless Ackley function networks task, EIFN also does not achieve monotonically improving average reward. (d, e, f) On a range of noisy function networks tasks, MCBO is at as good as and often higher performing than EIFN and GP-UCB baselines. Baselines We compare our approach with three baselines (i) Expected Improvement for Function Networks (EIFN) (Astudillo & Frazier, 2021b); (ii) Causal Bayesian Optimization (EICBO) (Aglietti et al., 2020b); and (iii) standard upper confidence bound (GP-UCB) (Srinivas et al., 2010) which models the objective given interventions with a single GP (see Fig. 1 a). To enable a fair comparison, we only apply EICBO to the hard intervention setting, since it was designed specifically for a hard intervention model. Experimental setup We report the average reward as a function of the number of system interventions performed. The average reward at time T is defined by PT t=0 Eω[Y | at] and is directly inversely related to cumulative regret in that high average expected reward is equivalent to low cumulative regret. This matches the performance metric studied in our analysis. Average expected reward and cumulative regret are natural metrics for many real applications, like crop management, in which we want consistently high yield, and the healthcare-inspired setting we study in these experiments, where we seek good treatment outcomes for more than a single patient. In the appendix, we show experiments measuring the best expected reward of any action previously chosen, which is more similar to an inverse of the simple regret. We report mean performance over 20 random seeds, with error bars showing σ/ 20 where σ is the standard deviation across the repeats. All figures that are referenced but not in the main paper can be found in the appendix. For the guarantees in Theorem 1 to hold, {βt}T t=0 must be chosen so that the model is calibrated at all time-steps as in Eq. (9). In practice, we select a single β such that βt = β, t. Choosing β too pessimistically will result in high regret, as demonstrated by the dependence of the guarantee on βN. For GP-UCB and MCBO, β is chosen by cross-validation across tasks, as described in the appendix. Toy Experiment First, we evaluate on the synthetic Toy Graph setting from Aglietti et al. (2020b). Toy Graph is a hard intervention CBO task where I = { , {0}, {1}}. All methods start with 10 observational samples and samples from 2 random interventions on each I I. When EICBO obtains interventional data, it obtains noiseless samples because it is not designed for the noisy setting. By noiseless samples, we mean that for action a EICBO observes Eω[Y | a]. Other methods obtain single samples from the distribution Y, X | a. Published as a conference paper at ICLR 2023 In Toy Graph, I = {1} is the optimal target, but to efficiently learn the optimal a{1}, the agent must generalize from both observational data and interventional data on other targets I = {0}. EICBO models Y given a I with a separate GP for every I and is consequently only able to make use of observational data and interventional data with the same target. Since MCBO learns a full system model, it incorporates all observations into the learned model, even when interventions do not match. Figure 2 (a) shows that the average reward of MCBO (Algorithm 2) is favorable compared to the baselines. Both baselines are built on expected improvement, which will continue to explore even after high-reward solutions are found. This explains the non-monotonic average reward of EICBO. Healthcare Experiment PSAGraph is inspired by the DAG from a real healthcare setting (Ferro et al., 2015) and also benchmarked by Aglietti et al. (2020b). The agent intervenes by prescribing statins and/or aspirin while specifying the dosage to control prostate-specific antigen (PSA) levels. Here I = { , {2}, {3}, {2, 3}} and all interventions are hard interventions. Initial sample sizes are the same as for Toy Graph. Figure 2 (b) again shows EICBO having a nonmonotonic average reward and strong comparable performance of MCBO. Noiseless Function Networks In addition to the hard intervention setting, we evaluate MCBO and the baselines on four tasks from Astudillo & Frazier (2021b). All systems have up to six nodes and varying graph structures. In function networks, actions can affect multiple system variables, and system variables can be children of multiple actions. Function networks are deterministic, so ω = 0. MCBO (Algorithm 1) can be applied directly in this setting, and the guarantees are also easily transferable. Like in Astudillo & Frazier (2021b), there are no constraints (besides bounded domain) on actions, and the agent is initialized with 2A + 1 samples from random actions, where A is the number of action nodes. Figure 2 (c) and Figure 4 (a,b,c) show that MCBO achieves competitive average reward on all tasks. EIFN is better on Dropwave. Meanwhile, MCBO is substantially better than EIFN on the Ackley and Rosenbrock tasks. Overall, there are not sufficiently many tasks established in the literature to conclusively say which properties might make a task favor EIFN over MCBO. However, this would be interesting to understand in future work and likely relates to the wider conversation in BO comparing expected improvement and UCB algorithms (Merrill et al., 2021). We find that the naive GP-UCB approach, which does not use the graph structure, generally performs poorly, especially on problems with larger graphs like Alpine2. On Ackley, EIFN does not achieve monotonically improving average reward, which is not unexpected given that it is based upon expected improvement. Noisy Function Networks We modify three of the function networks settings to include an additive zero-mean Gaussian noise at every system variable, making ω non-zero. EIFN is designed for deterministic function networks and has no convergence guarantees in this setting. Results in terms of average (Figure 2 d,e,f) and best reward (Figure 6 g, h, i) are comparable to the noiseless case, with MCBO and EIFN both performing well compared to GP-UCB. 6 CONCLUSION This paper introduces MCBO, a principled model-based approach to solving Bayesian optimization problems over structural causal models. Our approach explicitly models all variables in the system and propagates epistemic uncertainty through the model to select interventions based on the optimism principle. This allows MCBO to solve global optimization tasks in systems that have known causal structure with improved sample efficiency compared to prior works. We prove the first non-asymptotic convergence guarantees for an algorithm solving the causal Bayesian optimization problem and demonstrate that its theoretical advantages are reflected in strong empirical performance. Future work might consider how to apply the method to large graphs, where the sets of all possible discrete intervention targets cannot be efficiently enumerated. Published as a conference paper at ICLR 2023 ACKNOWLEDGMENTS AND DISCLOSURE OF FUNDING We thank Lars Lorch, Parnian Kassraie and Ilnura Usmanova for their feedback and anonymous reviewers for their helpful comments. This research was supported by the Swiss National Science Foundation under NCCR Automation, grant agreement 51NF40 180545, and by the European Research Council (ERC) under the European Union s Horizon grant 815943. Published as a conference paper at ICLR 2023 Virginia Aglietti, Theodoros Damoulas, Mauricio Alvarez, and Javier Gonz alez. Multi-task Causal Learning with Gaussian Processes. In Advances in Neural Information Processing Systems (Neur IPS), 2020a. Virginia Aglietti, Xiaoyu Lu, Andrei Paleyes, and Javier Gonz alez. Causal Bayesian Optimization. In International Conference on Artificial Intelligence and Statistics (AISTATS), 2020b. Sami Alabed and Eiko Yoneki. Bo Graph: Structured Bayesian Optimization From Logs for Expensive Systems with Many Parameters. In Proceedings of the 2nd European Workshop on Machine Learning and Systems, 2022. Raul Astudillo and Peter I Frazier. Thinking inside the box: A tutorial on grey-box bayesian optimization. In 2021 Winter Simulation Conference (WSC). IEEE, 2021a. Raul Astudillo and Peter I. Frazier. Bayesian Optimization of Function Networks. In Advances in Neural Information Processing Systems (Neur IPS), 2021b. Maximilian Balandat, Brian Karrer, Daniel R. Jiang, Samuel Daulton, Benjamin Letham, Andrew Gordon Wilson, and Eytan Bakshy. Bo Torch: A Framework for Efficient Monte-Carlo Bayesian Optimization. In Advances in Neural Information Processing Systems (Neur IPS), 2020. Elias Bareinboim, Andrew Forney, and Judea Pearl. Bandits with unobserved confounders: A causal approach. In Advances in Neural Information Processing Systems (Neur IPS), 2015. Nicola Branchini, Virginia Aglietti, Neil Dhir, and Theodoros Damoulas. Causal entropy optimization. ar Xiv:2208.10981, 2022. Sayak Ray Chowdhury and Aditya Gopalan. Online learning in kernelized markov decision processes. In International Conference on Artificial Intelligence and Statistics (AISTATS), 2019. Sebastian Curi, Felix Berkenkamp, and Andreas Krause. Efficient Model-Based Reinforcement Learning through Optimistic Policy Search and Planning. In Advances in Neural Information Processing Systems (Neur IPS), 2020. Andreas Damianou and Neil D. Lawrence. Deep Gaussian processes. In International Conference on Artificial Intelligence and Statistics (AISTATS), 2013. Frederick Eberhardt and Richard Scheines. Interventions and causal inference. Philosophy of science, 74(5):981 995, 2007. Ana Ferro, Francisco Pina, Milton Severo, Pedro Dias, Francisco Botelho, and Nuno Lunet. Use of statins and serum levels of prostate specific antigen. Acta Urol ogica Portuguesa, 32(2):71 77, 2015. Shunya Kusakawa, Shion Takeno, Yu Inatsu, Kentaro Kutsukake, Shogo Iwazaki, Takashi Nakano, Toru Ujihara, Masayuki Karasuyama, and Ichiro Takeuchi. Bayesian Optimization for Cascadetype Multi-stage Processes. ar Xiv preprint, 2021. Finnian Lattimore, Tor Lattimore, and Mark D Reid. Causal bandits: Learning good interventions via causal inference. In Advances in Neural Information Processing Systems (Neur IPS), 2016. Tor Lattimore and Csaba Szepesv ari. Bandit algorithms. Cambridge University Press, 2020. Sanghack Lee and Elias Bareinboim. Structural causal bandits with non-manipulable variables. In AAAI Conference on Artificial Intelligence, 2019. Erich Merrill, Alan Fern, Xiaoli Fern, and Nima Dolatnia. An empirical study of bayesian optimization: Acquisition versus partition. Journal of Machine Learning Research, 2021. Jonas Moˇckus. On Bayesian methods for seeking the extremum. In Optimization Techniques, 1975. Judea Pearl. Causality. Cambridge University Press, 2009. Published as a conference paper at ICLR 2023 Carl Edward Rasmussen. Gaussian processes in machine learning. In Summer school on machine learning, pp. 63 71. Springer, 2003. Niranjan Srinivas, Andreas Krause, Sham Kakade, and Matthias Seeger. Gaussian process optimization in the bandit setting: No regret and experimental design. In Proceedings of International Conference on Machine Learning (ICML)), 2010. Burak Varici, Karthikeyan Shanmugam, Prasanna Sattigeri, and Ali Tajer. Causal bandits for linear structural equation models. ar Xiv preprint ar Xiv:2208.12764, 2022. Published as a conference paper at ICLR 2023 A APPENDIX MODEL-BASED CAUSAL BAYESIAN OPTIMIZATION A.1 BACKGROUND ON GPS Here we give more background on the vector-valued GP models we use. We will drop the i-node index and consider the modelling of a single function with scalar output f : A Z R in RKHS Hk with kernel k : S S R where S = (Z A). We assume measurement noise with variance ρ2. Assuming prior mean and variance µ0, σ0 and dataset Dt = {z1:t, a1:t, x1:t} where xt R, we get a posterior GP mean and variance given by updates µt(z, a) = kt(z, a) Kt + ρ2I 1 x1:t , (13) σ2 t (z, a) = k((z, a); (z, a)) kt(z, a) Kt + ρ2I 1 kt(z, a) , (14) where (Kt)t1,t2 = k((zt1, at1); (zt2, at2)), kt(z, a) = [k((z1, a1); (z, a)) . . . , k((zt, at); (z, a))] , and I is the identity matrix. This follows the standard GP update given in Rasmussen (2003). In this work, we model functions with vector-valued outputs f : A Z X Rd (we will continue to drop the i index). For this we follow Chowdhury & Gopalan (2019) and use a scalar-output GP that takes the component of the output vector under consideration as an input. That is, we model f( , l) (the lth output component of the function under study) where f = [f( , 1), . . . , f( , d)]. We again assume that f is from RKHS Hk but with kernel k : S S R where S = (Z A [d]). Assuming prior mean and variance µ0, σ0 and dataset Dt = {z1:t, a1:t, x1:t} where xt Rd, we get a posterior GP mean and variance given by updates µt(z, a, l) = kt(z, a, l) (Kt + ρ2I) 1vec(x1:t) , (15) σ2 t (z, a, l) = k((z, a, l); (z, a, l)) kt(z, a, l) (Kt + ρ2I) 1kt(z, a, l) , (16) where vec(x1:t) = [x1,1, x1,2, . . . , xt,d] and for (t1, l), (t2, l ) [(1, 1), (1, 2), . . . , (t, d)]: [Kt](t1,l),(t2,l ) = k((zt1,l, at1,l, l); (zt2,l , at2,l , l )), kt(z, a, l) = [k((z1,1, a1,1, 1); (z, a, l)), . . . , k((zt,d, at,d, d); (z, a, l))] . This follows the posterior update of Chowdhury & Gopalan (2019). The key idea is to use a single scalar-output GP with kernel k for modeling all output components, but introduce the component index as part of the input space. This requires introducing the notation vec(x1:t) and ordering all observations, of all time points and all components, in a single index, since the GP update for component l considers observations of all other components. Under this GP model the components of f need not be independent if kernel k is designed to model dependency between the components. In our work, we use one of these vector-valued GP models for each individual random variable in our causal model, leading to the reintroduction of the additional i index in Eqs. (7) and (8). A.2 PROOFS FOR THE THEORETICAL ANALYSIS Our analysis closely follows Curi et al. (2020), particularly the proofs in their Appendix D, where they prove similar guarantees for a model-based reinforcement learning problem. In contrast to Curi et al. (2020), which models RL transition dynamics with a single GP for all timesteps, MCBO uses independent GPs for modeling the functional relation {f1, . . . , fm} and uses the causal graph G to determine the input and output variables of these functions. This section is organized as follows. In Appendix A.2.1, we discuss a notion of model complexity ΓT similar to the one introduced in the RL setting by Curi et al. (2020). We then bound the cumulative regret in terms of ΓT in Appendix A.2.2. Finally, in Appendix A.2.3, we prove Theorem 1 by connecting our notion of model complexity with the maximum information gain of a GP model. The norm notation refers to ℓ2-norm if no additional notation is given. We let {ai,t Ai, zi,t Zi}i,t>0 denote the set of actions chosen by MCBO and the realizations of the parents of node i, respectively. Published as a conference paper at ICLR 2023 A.2.1 MODEL COMPLEXITY The number of samples needed to learn a low-regret action is related to the number of samples needed to learn all GP models in our SCM. This is analogous to the classic BO setting (Srinivas et al., 2010). We quantify the model complexity of our entire model class MT as ΓT = max (z,a) A {Z A}T σi,t 1(zi,t, ai,t) 2. (17) This measure of model complexity closely relates to the maximum information gain γT used for proving regret guarantees in BO (Srinivas et al., 2010). For a single GP model, γT is the maximum information gain about the unknown f that can be obtained from noisy evaluations of the f at fixed inputs (see Eq. (39)). Later we show that ΓT can be bounded by a sum of the information gains for all m GPs. It is worth noting that Equation 17 may be a loose notion of model complexity because it assumes we can independently choose every zi, but for many graphs, there could be overlap in the zi, zj for i = j (two nodes could have a shared parent). A.2.2 ANALYSIS IN TERMS OF GENERAL MODEL COMPLEXITY ΓT In this section, we will prove a theorem similar to Theorem 1 but in terms of the model complexity we define in Eq. (17). Note that this version of the theorem does not require that µ and σ come from a GP model with independent outputs, but any model such that Assumptions 1 3 are satisfied. In later sections, when using a GP model we bound ΓT in terms of the maximum information gain of the m GPs used to get Theorem 1. We will use the function Σi,t( ) to represent a matrix of all zeros except the values of the diagonal, given by σi,t( ). A a result, σi,t( ) = diag (Σi,t( )) Theorem 2. Consider the optimization problem in Eq. (4) with SCM satisfying Assumptions 1 3 where G is known but F is unknown. Then, for all T 1, with probability at least 1 δ, the regret of Algorithm 1 is bounded by RT O LN f LN σ βN T KNp We first sketch the proof steps. In Lemma 1 we show that with, high probability, there exists some set of functions η that allows the reparameterized plausible SCM model in Eq. (11) to match the true SCM. Recall the mechanism of the ground-truth SCM in Eq. (2) xi,t = fi(zi,t, ai,t) + ωi,t, i {0, . . . , m}, (2) and the mechanism of the optimistic SCM model using the reparameterization of Eq. (11) xi,t = fi( zi,t, ai,t) + ωi,t (18) = µi,t 1( zi,t, ai,t) + βtΣi,t 1( zi,t, ai,t)ηi( zi,t, ai,t) + ωi,t, i {0, . . . , m}. (19) In Lemma 2, Lemma 3 and Corollary 1 we bound the instantaneous regret (regret at some specific timepoint t) by bounding the difference in SCM output, for the same action input, assuming the true SCM vs the optimistic reparameterized SCM. Then in Lemma 5 we use the intermediate result of Lemma 4 to show that our bound on instantaneous regret implies a bound on cumulative regret. In Lemma 1, for convenience, we will drop the explicit dependence of all quantities on t. Lemma 1. Assume some fixed set of actions a is chosen at any timepoint t. Under Assumption 3, for any x generated by the true SCM Eq. (2), with probability at least 1 δ there exists a set of functions η = {ηi}m i=0, where ηi : Zi Ai [ 1, 1]d, such that x = x if i ωi = ωi. Proof. Since ω = ω we only need to prove that there exists some η such that i fi(zi, ai) = µi( zi, ai) + βΣi( zi, ai)ηi. Published as a conference paper at ICLR 2023 By Assumption 3, with probability 1 δ, for all i = 0, . . . , m, we have an elementwise bound |fi(zi, ai) µi(zi, ai)| βiσi(zi,t, ai,t). Thus for each zi,t, ai,t there exists a vector ηi with values in [ 1, 1]d such that fi(zi, ai) = µi(zi, ai) + βΣi(zi, ai)ηi. Note that this is not quite what we need because the RHS contains zi and not zi. We will now use an inductive argument on i that constructs each ηi sequentially from i = 0 to m. Base case: we must prove that for i = 0 we have f0(z0, a0) = µ0( z0, a0) + βtΣ0( z0, a0)η0. We know z0 = z0 (both are the empty vector) and a = a by the assumption of a fixed action. Then there exists some vector η0 such that f0(z0, a0) = µ0(z0, a0) + βΣ0(z0, a0)η0 = µ0( z0, a0) + βΣ0( z0, a0)η0. Let η0( ) be the function that outputs the vector η0 given input z0, a0 and the base case is proven. Now assume the inductive hypothesis: j < i we have fj(zj, aj) = µj( zj, aj) + βΣj( zj, aj)ηj. We want to show that this implies fi(zi, ai) = µi( zi, ai) + βΣi( zi, ai)ηi. We know ai = ai by the assumption of a fixed action. zi = zi because zi = [ xpai[1], . . . , xpai[|pai|]]T and we selected each ηj such that xj = xj. Then there exists some vector ηi such that fi(zi, ai) = µi(zi, ai) + βΣi(zi, ai))ηi = µi( zi, ai) + βΣi( zi, ai))ηi. Let ηi( ) output the vector ηi given input zi, ai and the inductive step is proven. Lemma 2. Under Assumption 3, with probability at least (1 δ) t 0 the instantaneous regret rt is bounded by rt = E[y|F , a ] E[y|F , a:,t] E h y| Ft, a:,t i E[y|F , a:,t]. (20) Proof. The result follows directly from, E[y|F , a ] E h y| Ft, a:,t i . (21) This is true by the definition of Ft, a:,t as the argmax of Eq. (12) and that with probability at least (1 δ) we have F MT . We now show how the observations under the true and optimistic dynamics differ for a fixed noise sequence ω = ω and the fixed action ai,t at any time t. Lemma 3. Under Assumptions 1 3, let Lf,t = 1 + Lf + 2βt Lσ. Then, for all iterations t > 0, any functions ηi : Rpi Rqi [ 1, 1]di and any sequence of ωi with ωi = ωi (for all i), we have xm,t xm,t 2βt KN LN f,t i=0 σi,t 1(zi,t, ai,t) (22) Proof. We prove by induction on i. Base case. Consider the base case i = 0. Because the nodes are topologically ordered we will have pa0 = . Its realization, therefore, depends only on the chosen action. Formally, we assume z0 = , x0 = f0(z0, a0) + ω0 and x0 = f0( z0, a0) + ω0. Since ω0 = ω0, x0,t x0,t = f0(z0,t, a0,t) + ω0,t µ0,t 1(z0,t, a0,t) βtΣ0,t 1(z0,t, a0,t)η0(z0,t, a0,t) ω0 f0(z0,t, a0,t) µ0,t 1(z0,t, a0,t) + βtΣ0,t 1(z0,t, a0)η0(z0,t, a0,t) 2βt Σ0,t 1(z0,t, a0,t) In the following, we omit the dependence on the action a, e.g., using fi(zi,t) instead of fi(zi,t, ai,t) since we assume the actions to be the same for the process generating xi,t and xi,t. Induction step. Now assuming that xi 1,t xi 1,t 2βt KNi 1 LNi 1 f,t Pi 1 j=0 σj,t 1(zj,t) we prove a similar result for the ith node. 1 = fi(zi,t) + ωi,t µi,t 1( zi,t) βtΣi,t 1( zi,t)ηi( zi,t) ωi,t Published as a conference paper at ICLR 2023 2 = fi(zi,t) µi,t 1( zi,t) βtΣi,t 1( zi,t)ηi( zi,t) + fi( zi,t) fi( zi,t) 3 = fi( zi,t) µi,t 1( zi,t) βtΣi,t 1( zi,t)ηi( zi,t) + fi(zi,t) fi( zi,t) 4 fi( zi,t) µi,t 1( zi,t) + βtΣi,t 1( zi,t)ηi( zi,t) + fi(zi,t) fi( zi,t) 5 βt σi,t 1( zi,t) + βt σi,t 1( zi,t) + Lf zi,t zi,t 6 = 2βt σi,t 1( zi,t) + Lf zi,t zi,t 7 = 2βt σi,t 1( zi,t) + σi,t 1(zi,t) σi,t 1(zi,t) + Lf zi,t zi,t 8 2βt ( σi,t 1(zi,t) + Lσ zi,t zi,t ) + Lf zi,t zi,t 9 2βt σi,t 1(zi,t) + (1 + Lf + 2βt Lσ) zi,t zi,t 10 2βt σi,t 1(zi,t) + (1 + Lf + 2βt Lσ) X j pai xi,t xi,t 11 2βt σi,t 1(zi,t) + (1 + Lf + 2βt Lσ) X j pai 2βt KNj(1 + Lf + 2βt Lσ | {z } =: Lf h=0 σh,t 1(zh,t) 12 2βt KNi LNi f,t j=0 σj,t 1(zj,t) (23) where 1 follows the dynamics Eqs. (2) and (11). In 2 , we assume the noise to be equal and add and subtract the same term. In 3 and 4 , we reorder terms and apply the triangle inequality. In 5 and 6 , we rely on the calibrated uncertainty and Lipschitz dynamics, then collect terms and use diagonality of the matrix Σi,t 1( ). In 7 and 8 , we add and subtract the same term and use the Lipschitz continuity of σi,t 1. Finally, in 9 , we add 1 to ensure that we can later upper bound this term by taking the exponential of it. 10 applies the triangle inequality. 11 follows the inductive hypothesis, and 12 is due to the depth of at least one parent j being Nj = Ni 1. Now we will relate this bound on the observations to a bound on yt when selecting actions according to MCBO in both the optimistic and true dynamics. Corollary 1. Under the assumptions of Lemma 3, for any sequence of ηi [ 1, 1]di, θ D, and t 1 we have that E h yt| Ft, a:,t i E[yt|F , a:,t] 2βt KN LN f,t Eω= ω i=0 σi,t 1(zi,t, ai,t) Proof. This follows from Lemma 3. Y is just the final observation so E h yt| Ft, a:,t i E[yt|F , a:,t] = E[ xm,t xm,t | a:,t] 2βt KN LN f,t E i=0 σi,t 1(zi,t, ai,t) Published as a conference paper at ICLR 2023 Lemma 4. Under Assumption 3, let LY,t = 2βt LN f,t. Then, with probability at least (1 δ) it holds for all t 0 that r2 t L2 Y,t K2Nm E i=0 σi,t 1(zi,t, ai,t) 2 2 rt E h yt| Ft, at i E[yt|F , at] (26) 2βt KN LN f,t E i=0 σi,t 1(zi,t, ai,t) i=0 σi,t 1(zi,t, ai,t) r2 t L2 Y,t K2N i=0 σi,t 1(zi,t, ai,t) L2 Y,t K2NE i=0 σi,t 1(zi,t, ai,t) L2 Y,t K2Nm E i=0 σi,t 1(zi,t, ai,t) 2 2 The last two lines are Jensen s inequality. Now we bound cumulative regret RT . Lemma 5. Under the assumption of Assumptions 1 3, with probability at least (1 δ) it holds for all t 0 that R2 T TL2 Y,T m i=0 σi,t(zi,t, ai,t)2 2 2 2 TL2 Y,T m i=0 σi,t(zi,t, ai,t)2 2 2 where 1 is due to Jensen s inequality and 2 follows Lemma 4. Similar to the equivalent lemma in Curi et al. (2020), this bound is dependent on the data observed by the iteration t, making it hard to interpret in a general case. To this end, we further provide the worst-case bound dependent on the model complexity ΓT . Lemma 6. Under Assumptions 1 3, with probability at least (1 δ) it holds for all t 0 that R2 T TL2 Y,T K2NmΓT (34) Proof. Substituting in Eq. (17) we have σi,t(zi,t, ai,t)2 2 # and the result follows. Taking square roots and substituting in for LY,T in terms of βT , Lf and Lσ in Eq. (34) concludes the proof for Theorem 2. Published as a conference paper at ICLR 2023 A.2.3 PROOF OF THEOREM 1 We can bound ΓT in Theorem 2 to get a bound that depends on the specific GP model used for each fi. We can show that for many commonly used kernels MCBO achieves sublinear (in T) regret. Theorem 1. Consider the optimization problem in Eq. (4) with SCM satisfying Assumptions 1 3 where G is known but F is unknown. Then for all T 1, with probability at least 1 δ, the cumulative regret of Algorithm 1 is bounded by RT O LN f LN σ βN T KNm p Proof. Step 1. Some preliminaries relating mutual information IT (xi,1:T , fi,1:T ) and maximum information gain γT In the following, we consider the information gain for the node i, i.e., for xi,1:T Rd T and fi,1:T = [fi(zi,1, ai,1), . . . , fi(zi,T , ai,T )] evaluated at points Ai = {zi,1:T , ai,1:T }, Ai Zi Ai. In this step, for simplicity, when clear from context we will omit the i-notation in the derivation for the information gain. Mutual information IT (x1:T , f1:T ) is then defined as: I(x1:T , f1:T ) = H(x1:T ) H(x1:T |f1:T ), where H( ) denotes entropy. For fitting the GPs, the following models are used: xt|ft N(ft(zt, at), ρ2I) and xt|x1:t 1, z1:t, a1:t N(µt 1(zt, at), ρI + Σt 1(zt, at)), where I Rd d, and µt 1(zt, at) = [µt 1(zt, at, 1), . . . , µt 1(zt, at, d)] , σt 1(zt, at) = diag(Σt 1(zt, at)) = [σt 1(zt, at, 1), . . . , σt 1(zt, at, d)] . Our setup assumes that the components {xt,l}d l=0 are independent of each other given zt, at. Therefore from Srinivas et al. (2010) we know that the mutual information for the ith GP model is: I(x1:T , f1:T ) = H(x1:T ) H(x1:T |f1:T ) = 1 l=1 ln 1 + σ2 t 1(zt, at, l) because per component: I(xi,1:T,l, fi,1:T,l) = 1 t=1 ln 1 + σ2 t 1(zt, at, l) Then, by the definition of maximum information gain γi,T,l per node i in the graph: γi,T,l := max Ai {Zi Ai}T I(xi,1:T,l, fi,1:T,l) = max Ai {Zi Ai}T 1 2 t=1 ln 1 + σ2 i,t 1(zt, at, l) Accordingly, we can write the maximum information gain between xi,1:T and fi,1:T as follows: γi,T := max Ai {Zi Ai}T I(xi,1:T , fi,1:T ) = max Ai {Zi Ai}T 1 2 l=1 ln 1 + σ2 i,t 1(zt, at, l) l=1 max Ai {Zi Ai}T 1 2 t=1 ln 1 + σ2 i,t 1(zt, at, l) l=1 γi,T,l (40) Note: Bounds for γi,T,l and γi,T . Upper bounds on γi,T,l are provided in Srinivas et al. (2010) for widely used kernels and scale sublinearly in T. We use pi = d |pa(i)| to represent the size Published as a conference paper at ICLR 2023 of the z-input to a GP. Recall that q is the length of each action vector i.e. Ai Rq. For the linear kernel γi,T,l = O((pi + q) log T), and for the squared exponential kernel γi,T,l = O((pi + q)(log T)pi+q+1). We can use Eq. (39) to give bounds on γi,T too e.g. if all GPs use independent linear kernels for each output component γi,T = O(di(pi+q) log T), and if all GPs use independent squared exponential kernels for each output component γi,T = O(d(pi + q)(log T)pi+q+1). Step 2. A bound for model complexity ΓT . Here we bound the model complexity Eq. (17) 1 ln(1 + ρ 2 i )γi,T . (41) For readability, in the following, we denote max A ( ) := max A: A= i Ai: i: Ai {Zi Ai}T ( ). ΓT = max {Z A}T i=0 σi,t 1(zi, ai) 2 2 i=0 σi,t 1(zi, ai) 2 2 t=1 σi,t 1(zi, ai) 2 2 t=1 σi,t 1(zi, ai) 2 2 l=1 σi,(t 1)(zi, ai, l) 2 2 2 ln(1 + ρ 2 i ) max Ai 1 2 l=1 ln 1 + σi,(t 1)(zi, ai, l) | {z } maximum information gain Eq. (39) 2 ln(1 + ρ 2 i )γi,T 7 = O(mγT ), where 1 bounds A and Z with a box. 2 is from the max over a sum. 3 is due to the assumption of xi,t being independent of Aj, j = i, conditioned on Ai. 4 is due to Jensen s inequality. 5 is due to the fact that for any s2 [0, ρ 2] we can bound s2 ρ 2 ln(1+ρ 2) ln(1+s2) Srinivas et al. (2010). This also holds for function s2( ) := ρ 2 i σ2 i,(t 1),l( ) since ρ 2 i σ2 i,(t 1),d( ) ρ 2 i k( , ) ρ 2 i because ki( , ) < 1 i (bounded kernel assumption). 6 is due to Eqs. (36) and (39). Finally, in 7 we define γT := maxi γi,T being the maximum value of the maximum information gains over graph nodes. Plugging this upper bound on ΓT into Theorem 2 completes the proof. Note: Sublinearity w.r.t. T of maximum information gain γT . Upper bounds on γT will often scale sublinearly in T. This follows from γi,T scaling sublinearly in T for many popularly used kernels (see previous note). In particular, a linear kernel leads to γT = O (d(Kd + q) log T) and a squared exponential kernel leads to γT = O d(Kd + q)(log T)Kd+q+1 (assuming output components are independent) since maxi pi = Kd. A.2.4 DEPENDENCE OF βT ON T FOR PARTICULAR KERNELS Note that for Assumption 3 to hold under our RKHS assumptions, βT might depend on T. For a single observed variable corresponding to node i at time t, for Assumption 3 to hold we must have βT = O Bi + ρi d γi,t (Chowdhury & Gopalan (2019) equation 9). For Assumption 3 to hold at Published as a conference paper at ICLR 2023 time t for all i we can apply a union bound and see that it is sufficient for βT = O B + ρ where B = maxi Bi and ρ = maxi ρi. In the regret bound of Theorem 1, βT appears raised to the power N. For γT corresponding to the linear kernel and squared exponential kernel, this will still lead to sublinear regret regardless of N because γT will be only logarithmic in T. However, for a Mat ern kernel, where the best known bound γT = O (p(p T)clog(p T)) with 0 < c < 1, the cumulative regret bound will not be sublinear if N and c are sufficiently large. A similar phenomena with the Mat ern kernel appears in the guarantees of Curi et al. (2020) which use GP models in model-based reinforcement learning. A.3 MAXIMIZING THE ACQUISITION FUNCTION Our theoretical results assume access to an oracle that can maximize Eq. (10). Here we discuss how we approximate this oracle in practice. In noiseless settings, instead of parameterizing each ηi as a neural network, we can parameterize it as a constant. This is because with no noise, the inputs to ηi (zi and ai) given a are fixed. This keeps the space of parameters to optimize over small, meaning we can use an identical optimization procedure to that used in EIFN by Astudillo & Frazier (2021b) which is also an out-of-the-box optimizer in the Bo Torch package Balandat et al. (2020). For noisy settings where each ηi : Ai Zi R is a neural network, we use our own optimizer. For each initialization of η parameters, we perform stochastic gradient descent to optimize both the ηi parameters and a. We can do this because Eq. (12) is differentiable with respect to both a and the parameters of each ηi. After running stochastic gradient descent on many random initializations we will have many solution candidates. We select the candidate with the highest acquisition function value. We use a large number of different random initializations because the acquisition function may be very non-convex. Other approaches, such as those considered in Curi et al. (2020) for model-based reinforcement learning, could also be adapted to optimize our acquisition function. When parameterizing each ηi with a neural network, we always use a two layer feed-forward network with a Re Lu non-linearity, To map the output into [ 1, 1] we put the output of the network through an element-wise Sigmoid. In all noisy environments we estimate the expectation in the acquisition function (Eq. (12)) using a Monte Carlo estimate with 32 repeats for each gradient step. For noisy Dropwave we use 128 repeats because the environment is particularly noisy compared to other noisy environments. A.4 EXPERIMENTAL SETUP Here we give additional notes to explain the details of our experimental setup. The cross-validation for selecting β is performed across β = {0.05, 0.5, 5}. For a given performance metric and task, we ll use the example of average reward on Dropwave, we select the β that on average performs best across all other tasks for the same performance metric, called β . We then report the results in terms of average reward for running the algorithm (GP-UCB or MCBO) on Dropwave with β = β For MCBO and EIFN, we use identical values for all shared hyperparameters (e.g. GP kernels) and use the original hyperparameters from Astudillo & Frazier (2021b). CBO methods do not have hyperparameters comparable with these methods because CBO methods do not model individual system observations (see Section 2.1). We run the original EICBO source code effectively unmodified Aglietti et al. (2020b). For CBO tasks, if the agent selects no intervention (observational data) at a given timestep, we give the agent, for any algorithm, a single observational sample. This is different to in Aglietti et al. (2020b) where 20 observational samples are given. There is no noisy version of Ackley because adding noise can make the environment unstable by producing very large or very small rewards. For all environments and all Xi, the noisy environment adds unit-Gaussian noise, except Dropwave where we scale this noise by 0.1 to make the environment more stable. Published as a conference paper at ICLR 2023 a0 a1 a2 a3 a4 a5 X0 X1 X2 X3 X4 Y a0 a1 a2 a3 a4 a0 a1 a2 a3 a4 a5 binary tree Figure 3: The DAGs corresponding to each task in the experiments, except binary tree which is used for illustrative purposes. Circles are observation or reward variables. Squares are actions in the function network setting. In hard intervention CBO, nodes with a dashed border can be intervened upon. All nodes represent a scalar random variable. 0 20 40 60 80 100 Round Average Reward Dropwave: Average Reward MCBO EIFN GP-UCB 0 20 40 60 80 100 Round Average Reward Rosenbrock: Average Reward MCBO EIFN GP-UCB 0 20 40 60 80 100 Round Average Reward Alpine2: Average Reward MCBO EIFN GP-UCB Figure 4: All average reward plots not used in the main paper. All settings are noiseless. When applying MCBO to function networks we allow am to be nonzero, to match the function networks setting. Published as a conference paper at ICLR 2023 0 20 40 60 80 100 Round Average Reward Toy Graph: Average Reward MCBO, β = 0.05 MCBO, β = 0.5 MCBO, β = 5 EIFN 0 20 40 60 80 100 Round Average Reward PSAGraph: Average Reward MCBO, β = 0.05 MCBO, β = 0.5 MCBO, β = 5 EIFN 0 20 40 60 80 100 Round Average Reward Ackley: Average Reward MCBO, β = 0.05 MCBO, β = 0.5 MCBO, β = 5 EIFN 0 20 40 60 80 100 Round Average Reward Dropwave: Average Reward MCBO, β = 0.05 MCBO, β = 0.5 MCBO, β = 5 EIFN 0 20 40 60 80 100 Round Average Reward Rosenbrock: Average Reward MCBO, β = 0.05 MCBO, β = 0.5 MCBO, β = 5 EIFN 0 20 40 60 80 100 Round Average Reward Alpine2: Average Reward MCBO, β = 0.05 MCBO, β = 0.5 MCBO, β = 5 EIFN 0 20 40 60 80 100 Round Average Reward Dropwave, noisy: Average Reward MCBO, β = 0.05 MCBO, β = 0.5 MCBO, β = 5 EIFN 0 20 40 60 80 100 Round Average Reward Rosenbrock, noisy: Average Reward MCBO, β = 0.05 MCBO, β = 0.5 MCBO, β = 5 EIFN 0 20 40 60 80 100 Round Average Reward Alpine2, noisy: Average Reward MCBO, β = 0.05 MCBO, β = 0.5 MCBO, β = 5 EIFN Figure 5: An ablation across β where we plot average reward for all tasks. A.5 FURTHER EXPERIMENTAL ANALYSIS Average expected reward In Figure 4 we show the deterministic function networks plots not included in the main paper. In Figure 5 we show the average reward for all β considered in the cross-validation. Best expected reward Here we also report the best expected reward as a function of the number of rounds T. The best expected reward at time T is defined by max a {at}T t=0 Eω[Y | a]. This is similar to but not directly inversely related to simple regret. Simple regret algorithms require an inference procedure to select a final action after T rounds of exploration. This procedure could be computationally expensive for the CBO setting. For example, a reasonable choice would be to report a final action based upon maximizing a lower confidence bound of the objective in Eq. (4), however a closed form expression for this does not exist for CBO. Best expected reward instead assumes access to an oracle that computes the highest expected reward of any action the algorithm has previously chosen. Our plotting of the best expected reward is consistent with previous work on CBO (Aglietti et al., 2020b). When performing the same cross-validation procedure as for the average expected reward case, we find that the performance of MCBO varies drastically across tasks. This is shown in Figure 6. Selecting β is known to be a difficulty with UCB-based method (Merrill et al., 2021). Figure 7 shows the performance in terms of best reward for all 3 possibilities for β. When selecting just Published as a conference paper at ICLR 2023 0 20 40 60 80 100 Round Best Reward Toy Graph: Best Reward MCBO EICBO EIFN 0 20 40 60 80 100 Round Best Reward PSAGraph: Best Reward MCBO EICBO EIFN 0 20 40 60 80 100 Round Best Reward Ackley: Best Reward MCBO EIFN GP-UCB 0 20 40 60 80 100 Round Best Reward Dropwave: Best Reward MCBO EIFN GP-UCB 0 20 40 60 80 100 Round Best Reward Rosenbrock: Best Reward MCBO EIFN GP-UCB 0 20 40 60 80 100 Round Best Reward Alpine2: Best Reward MCBO EIFN GP-UCB 0 20 40 60 80 100 Round Best Reward Dropwave, noisy: Best Reward MCBO EIFN GP-UCB 0 20 40 60 80 100 Round Best Reward Rosenbrock, noisy: Best Reward MCBO EIFN GP-UCB 0 20 40 60 80 100 Round Best Reward Alpine2, noisy: Best Reward MCBO EIFN GP-UCB Figure 6: All best reward plots. the best β among the 3 tried, we find that MCBO is very strong on all tasks (with the exception of best reward on Toy Graph, where it underperforms the baselines). The optimal β can vary by an order of magnitude or more across tasks in our case. This is likely because all of our settings have very different functional relationships and graph structures. As discussed in Section 4, this can be understood from our cumulative regret guarantee in Theorem 1. The plots of average and best scores for other values of β (Figs. 5 and 7) suggest that β can be made larger for MCBO to trade-off exploration vs. exploitation depending on whether simple or cumulative regret is more of a priority. Meanwhile, expected improvement methods are focused almost almost entirely on exploration. Runtimes In Figure 8 we show the total runtime of each method for all 100 rounds. EIFN and MCBO run on equivalent hardware, which has 4 times more ram than the hardware used for GPUCB and EICBO. MCBO generally has a much longer runtime than EIFN in noisy settings where the ηi are parameterized by neural networks. Generally in BO, because the unknown function is often assumed expensive to evaluate, we are less concerned about the time required to optimize the acquisition function and more concerned with notions of statistical efficiency such as regret. Our implementation for noisy settings could likely be sped-up with more parallelism to make it more comparable to EIFN runtimes. In noiseless settings, where the optimization methods used are roughly equivalent between EIFN and MCBO, the two methods have comparable runtimes. Non-monotonic regret of EICBO In Fig. 2 (a,b) we found that EICBO had a non-monotonic average reward, which could likely be explained by the use of an expected improvement acquisition function. We decided to clarify that the non-monotonic average reward was because of the algorithm and not because of differences in our setup compared to Aglietti et al. (2020b). Besides us evaluat- Published as a conference paper at ICLR 2023 0 20 40 60 80 100 Round Best Reward Toy Graph: Best Reward MCBO, β = 0.05 MCBO, β = 0.5 MCBO, β = 5 EIFN 0 20 40 60 80 100 Round Best Reward PSAGraph: Best Reward MCBO, β = 0.05 MCBO, β = 0.5 MCBO, β = 5 EIFN 0 20 40 60 80 100 Round Best Reward Ackley: Best Reward MCBO, β = 0.05 MCBO, β = 0.5 MCBO, β = 5 EIFN 0 20 40 60 80 100 Round Best Reward Dropwave: Best Reward MCBO, β = 0.05 MCBO, β = 0.5 MCBO, β = 5 EIFN 0 20 40 60 80 100 Round Best Reward Rosenbrock: Best Reward MCBO, β = 0.05 MCBO, β = 0.5 MCBO, β = 5 EIFN 0 20 40 60 80 100 Round Best Reward Alpine2: Best Reward MCBO, β = 0.05 MCBO, β = 0.5 MCBO, β = 5 EIFN 0 20 40 60 80 100 Round Best Reward Dropwave, noisy: Best Reward MCBO, β = 0.05 MCBO, β = 0.5 MCBO, β = 5 EIFN 0 20 40 60 80 100 Round Best Reward Rosenbrock, noisy: Best Reward MCBO, β = 0.05 MCBO, β = 0.5 MCBO, β = 5 EIFN 0 20 40 60 80 100 Round Best Reward Alpine2, noisy: Best Reward MCBO, β = 0.05 MCBO, β = 0.5 MCBO, β = 5 EIFN Figure 7: An ablation across β where we plot best reward for all tasks. MCBO EICBO EIFN Algorithm Total Runtime (seconds) PSAGraph: Total Runtime b MCBO EIFN GP-UCB Algorithm Total Runtime (seconds) Alpine2: Total Runtime MCBO EIFN GP-UCB Algorithm Total Runtime (seconds) Alpine2, noisy: Total Runtime Figure 8: Runtimes of experiments for a a) CBO setting, b) noiseless function network setting, and c) noisy function network setting. ing in terms of average reward instead of best reward, compared to Aglietti et al. (2020b) we also used smaller initial sample sizes (random samples the agent obtains before t = 0) of observational data. This was to make the setting more challenging, since we found that with extra observational data Toy Graph and PSAGraph could be solved with very few samples. When reproducing the experiments of Fig. 2 (a, b) with additional starting observational data, we found the same qualitative results (non-montonicity) for the average reward case. Note that this result is not inconsistent at all with the experimental results of Aglietti et al. (2020b), since they only evaluate in terms of best reward.