# adversarial_causal_bayesian_optimization__04b2f454.pdf Published as a conference paper at ICLR 2024 ADVERSARIAL CAUSAL BAYESIAN OPTIMIZATION Scott Sussex ETH Zürich scott.sussex@inf.ethz.ch Pier Giuseppe Sessa ETH Zürich Anastasiia Makarova ETH Zürich Andreas Krause ETH Zürich In Causal Bayesian Optimization (CBO), an agent intervenes on a structural causal model with known graph but unknown mechanisms to maximize a downstream reward variable. In this paper, we consider the generalization where other agents or external events also intervene on the system, which is key for enabling adaptiveness to non-stationarities such as weather changes, market forces, or adversaries. We formalize this generalization of CBO as Adversarial Causal Bayesian Optimization (ACBO) and introduce the first algorithm for ACBO with bounded regret: Causal Bayesian Optimization with Multiplicative Weights (CBO-MW). Our approach combines a classical online learning strategy with causal modeling of the rewards. To achieve this, it computes optimistic counterfactual reward estimates by propagating uncertainty through the causal graph. We derive regret bounds for CBO-MW that naturally depend on graph-related quantities. We further propose a scalable implementation for the case of combinatorial interventions and submodular rewards. Empirically, CBO-MW outperforms non-causal and non-adversarial Bayesian optimization methods on synthetic environments and environments based on real-word data. Our experiments include a realistic demonstration of how CBO-MW can be used to learn users demand patterns in a shared mobility system and reposition vehicles in strategic areas. 1 INTRODUCTION How can a scientist efficiently optimize an unknown function that is expensive to evaluate? This problem arises in automated machine learning, drug discovery and agriculture. Bayesian optimization (BO) encompasses an array of algorithms for sequentially optimizing unknown functions (Moˇckus, 1975). Classical BO algorithms treat the unknown function mostly as a black box and make minimal structural assumptions. By incorporating more domain knowledge about the unknown function into the algorithm, one can hope to optimize the function using fewer evaluations. A recent line of work on causal Bayesian optimization (CBO) (Aglietti et al., 2020) attempts to integrate use of a structural causal model (SCM) into BO methods. It is assumed that actions are interventions on some set of observed variables, which are causally related to each other and a reward variable through a known causal graph (Fig. 1b), but unknown mechanisms. Many important BO problems might take such a shape. For example, managing supply in a Shared Mobility System (SMS) involves intervening on the distribution of bikes and scooters across the city. Importantly, Sussex et al. (2022) show that a BO approach leveraging the additional structure of CBO can achieve exponentially lower regret in terms of the action space size. Most CBO methods to date assume that the system is completely stationary across interactions and that only one agent interacts with the system. However, often it would be desirable to incorporate the influence of external events. For example, in a SMS the day s demand is highly non-stationary, and can only be fully observed at the day s end. We would like an algorithm that adapts to the variability in these external events. In this work, we incorporate external events into CBO by introducing a novel adversarial CBO (ACBO) setup, illustrated in Fig. 1c. Crucially, in ACBO the downstream reward is explicitly influenced by potentially adversarial interventions on certain nodes in the causal graph (identified using dashed nodes in Fig. 1c) that can only be observed a-posteriori. For this general setting, we propose a novel algorithm CBO with multiplicative weights (CBO-MW) and prove a regret Published as a conference paper at ICLR 2024 Bayesian Optimization (BO) Causal Bayesian Optimization (CBO) Adversarial Causal Bayesian Optimization (ACBO) Figure 1: Visualizations of Bayesian Optimization (BO), Causal BO (CBO), and the Adversarial CBO (ACBO) introduced in this work. Agents must select actions a that maximize rewards Y . (a) Standard BO assumes the simplest DAG from actions to reward, regardless of the problem structure. (b) CBO incorporates side observations, e.g., X0, X1 and causal domain knowledge. Each observation could be modeled with a separate model. (c) In ACBO (this work), we additionally model the impact of external events (weather, perturbations, other players actions, etc.) that cannot be controlled, but can only be observed a-posteriori. These are depicted as dashed blue nodes and could directly affect the reward node, like a Y , or indirectly affect it by perturbing upstream variables, like a 0, a 1. guarantee using a stronger (but natural) notion of regret than the one used in existing CBO works. For settings where the number of intervention targets is large, we propose a distributed version of CBO-MW which is computationally efficient and can achieve approximate regret guarantees under some additional submodularity assumptions on the reward. Finally, we find empirical evidence that CBO-MW outperforms relevant prior work in adversarial versions of previously studied CBO benchmarks and in learning to re-balance units in an SMS simulation based upon real data. 2 BACKGROUND AND PROBLEM STATEMENT We consider the problem of an agent interacting with an SCM for T rounds in order to maximize the value of a reward variable. We start by introducing SCMs, the soft intervention model used in this work, and then define the adversarial sequential decision-making problem we study. In the following, we denote with [m] the set of integers {0, . . . , m}. Structural Causal Models Our SCM is described by a tuple G, Y, X, F , Ω of the following elements: G is a known DAG; Y is the reward variable; X = {Xi}m 1 i=0 is a set of observed scalar random variables; the set F = {fi}m i=0 defines the unknown functional relations between these variables; and Ω= {Ωi}m i=0 is a set of independent noise variables with zero-mean and known distribution. We use the notation Y and Xm interchangeably and assume the elements of X are topologically ordered, i.e., X0 is a root and Xm is a leaf. We denote with pai {0, . . . , m} the indices of the parents of the ith node, and use the notation Zi = {Xj}j pai for the parents this node. We sometimes use Xi to refer to both the ith node and the ith random variable. Each Xi is generated according to the function fi : Zi Xi, taking the parent nodes Zi of Xi as input: xi = fi(zi) + ωi, where lowercase denotes a realization of the corresponding random variable. The reward is a scalar xm [0, 1] while observation Xi is defined over a compact set xi Xi R, and its parents are defined over Zi = Q j pai Xj for i [m 1].1 Interventions In our setup, an agent and an adversary both perform interventions on the SCM 2. We consider a soft intervention model (Eberhardt and Scheines, 2007) where interventions are parameterized by controllable action variables. A simple example of a soft intervention is a shift intervention, where actions affect their outputs additively (Zhang et al., 2021). 1Here we consider scalar observations for ease of presentation, but we note that the methodology and analysis can be easily extended to vector observations as in Sussex et al. (2022) 2Our framework allows for there to be potentially multiple adversaries, but since we consider everything from a single player s perspective, it is sufficient to combine all the other agents into a single adversary. Published as a conference paper at ICLR 2024 First, consider the agent and its action variables a = {ai}m i=0. Each action ai is a real number chosen from some finite set. That is, the space Ai of action ai is Ai R[0,1] where |Ai| = Ki for some Ki N. Let A be the space of all actions a = {ai}m i=0. We represent the actions as additional nodes in G (see Fig. 1): ai is a parent of only Xi, and hence an additional input to fi. Since fi is unknown, the agent does not know apriori the functional effect of ai on Xi. Not intervening on a node Xi can be considered equivalent to selecting ai = 0. For nodes that cannot be intervened on by our agent, we set Ki = 1 and do not include the action in diagrams, meaning that without loss of generality we consider the number of action variables to be equal to the number of nodes m. 3 For the adversary we consider the same intervention model but denote their actions by a with each a i defined over A i R[0,1] where |A i| = K i and K i is not necessarily equal to Ki. According to the causal graph, actions a, a induce a realization of the graph nodes: xi = fi(zi, ai, a i) + ωi, i [m]. (1) If an index i corresponds to a root node, the parent vector zi denotes an empty vector, and the output of fi only depends on the actions. Problem statement Over multiple rounds, the agent and adversary intervene simultaneously on the SCM, with known DAG G and fixed but unknown functions F = {fi}m i=1 with fi : Zi Ai A i Xi. At round t the agent selects actions a:,t = {ai,t}m i=0 and obtains observations x:,t = {xi,t}m i=0, where we add an additional subscript to denote the round of interaction. When obtaining observations, the agent also observes what actions the adversary chose a :,t = {a i,t}m i=0. We assume the adversary does not have the power to know a:,t when selecting a :,t, but only has access to the history of interactions until round t and knowledge of the agent s algorithm. The agent obtains a reward given by yt = fm(zm,t, am,t, a m,t) + ωm,t, (2) which implicitly depends on the whole action vector a:,t and adversary actions a :,t. The agent s goal is to select a sequence of actions that maximizes their cumulative expected reward PT t=1 r(a:,t, a :,t) where r(a:,t, a :,t) = E yt | a:,t, a :,t and expectations are taken over ω unless otherwise stated. The challenge for the agent lies in not knowing a-priori neither the causal model (i.e., the functions F = {fi}m i=1), nor the sequence of adversarial actions {a :,t} t=1. Performance metric After T timesteps, we can measure the performance of the agent via the notion of regret: R(T) = max a A t=1 r(a, a :,t) t=1 r(a:,t, a :,t), (3) i.e., the difference between the best cumulative expected reward obtainable by playing a single fixed action if the adversary s action sequence and F were known in hindsight, and the agent s cumulative expected reward. We seek to design algorithms for the agent that are no-regret, meaning that R(T)/T 0 as T , for any sequence a :,t. We emphasize that while we use the term adversary , our regret notion encompasses all strategies that the adversary could use to select actions. This might include cooperative agents or mechanism non-stationarities. For simplicity, we consider only adversary actions observed after the agent chooses actions. Our methods can be extended to also consider adversary actions observed before the agent chooses actions, i.e., a context. This results in learning a policy that returns actions depending on the context, rather than just learning a fixed action. This extension is straightforward and we briefly discuss it in Appendix A.2. Regularity assumptions We consider standard smoothness assumptions for the unknown functions fi : S Xi defined over a compact domain S (Srinivas et al., 2010). In particular, for each node i [m], we assume that fi( ) belongs to a reproducing kernel Hilbert space (RKHS) Hki, a space of smooth functions defined on S = Zi Ai A i. This means that fi Hki is induced by a kernel function ki : S S R. We also assume that ki(s, s ) 1 for every s, s S4. Moreover, the RKHS 3There may be constraints on the actions our agent can take. We refer the reader to Sussex et al. (2022) for how our setup can be extended to handle constraints. 4This is known as the bounded variance property, and it holds for many common kernels. Published as a conference paper at ICLR 2024 norm of fi( ) is assumed to be bounded fi ki Bi for some fixed constant Bi > 0. Finally, to ensure the compactness of the domains Zi, we assume that the noise ω is bounded, i.e., ωi [ 1, 1]d. 3 RELATED WORK Causal Bayesian optimization Several recent works study how to perform Bayesian optimization on systems with an underlying causal graph. Aglietti et al. (2020) proposes the first CBO setting with hard interventions and an algorithm that uses the do-calculus to generalise from observational to interventional data, even in settings with unobserved confounding. Astudillo and Frazier (2021) consider a noiseless setting with soft interventions (known as a function network) where a full system model is learnt, and an expected improvement objective used to select interventions. Sussex et al. (2022) propose MCBO, an algorithm with theoretical guarantees that can be used with both hard and soft interventions. MCBO propagates epistemic uncertainty about causal mechanisms through the graph, balancing exploration and exploitation using the optimism principle (Srinivas et al., 2010). Causal bandits, which similarly incorporate causal graph knowledge into the bandit setting, usually consider discrete actions with categorical observations (Lattimore et al., 2016) or linear mechanisms with continuous observations (Varici et al., 2022). All of these methods consider only stationary environments and do not account for possible adversaries. Bayesian optimization in non-i.i.d. settings Multiple works study how to develop robust strategies against shifts in uncontrollable covariates. They study notions of worst-case adversarial robustness (Bogunovic et al., 2018), distributional robustness (Kirschner et al., 2020; Nguyen et al., 2020), robust mixed strategies (Sessa et al., 2020a) and risk-aversion to uncontrollable environmental parameters (Makarova et al., 2021; Iwazaki et al., 2021). Nonstationarity is studied in the canonical BO setup in Kirschner et al. (2020); Nguyen et al. (2020) and in the CBO setup in Aglietti et al. (2021). However, these methods do not accommodate adversaries in the system, e.g., multiple agents that we cannot control. A foundation for our work is the GP-MW algorithm (Sessa et al., 2019) which studies learning in unknown multi-agents games and is a special case of our setting. We compare further with GP-MW in Section 5. Another special case of CBO-MW with a specific graph is STACKELUCB (Sessa et al., 2021), designed for playing unknown Stackelberg games with multiple types of opponent (see Appendix A.1.1). In this section, we introduce the methodology for the proposed CBO-MW algorithm. 4.1 CALIBRATED MODELS An important component of our approach is the use of calibrated uncertainty models to learn functions F , as done in Sussex et al. (2022). At the end of each round t, we use the dataset Dt = {z:,1:t, a:,1:t, a :,1:t, x:,1:t} of past interventions to fit a separate model for every node in the system. CBO-MW can be applied with any set of models that have calibrated uncertainty. That is, for every node i at time t the model has a mean function µi,t and variance function σi,t (learnt from Dt) that accurately capture any epistemic uncertainty in the true model. Assumption 1 (Calibrated model). All statistical models are calibrated w.r.t. F , so that i, t there exists a sequence βt R>0 such that, with probability at least (1 δ), for all zi, ai, a i Zi Ai A i we have |fi(zi, ai, a i) µi,t 1(zi, ai, a i)| βtσi,t 1(zi, ai, a i), element-wise. If the models are calibrated, we can form confidence bounds that contain the true system model with high probability. This is done by combining separate confidence bounds for the mechanism at each node. At time t, the known set Mt of statistically plausible functions F = { fi}m i=0 is defined as: Mt = F = { fi}m i=0, s.t. i : fi Hki, fi ki Bi, and fi(zi, ai, a i) µi,t 1(zi, ai, a i) βtσi,t 1(zi, ai, a i), zi Zi, ai Ai, a i A i Published as a conference paper at ICLR 2024 GP models Gaussian Process (GP) models can be used to model epistemic uncertainty. These are the model class we study in our analysis (Section 5), where we also give explicit forms for βt that satisfy Assumption 1. For all i [m], let µi,0 and σ2 i,0 denote the prior mean and variance functions for each fi, respectively. Since ωi is bounded, it is also subgaussian and we denote the variance by b2 i . The corresponding posterior GP mean and variance, denoted by µi, t and σ2 i,t respectively, are computed based on the previous evaluations Dt: µi,t(si,1:t) = kt(si,1:t) (Kt + b2 i It) 1xi,1:t (5) σ2 i,t(si,1:t) = k(si,1:t; si,1:t) kt(si,1:t) (Kt + b2 i It) 1kt(si,1:t) , (6) where si,1:t = (zi,1:t, ai,1:t, a i,1:t), kt(si,1:t) = [k(si,j, si,1:t)]t j=1, and Kt = [k(si,j, si,j )]j,j is the kernel matrix. 4.2 THE CBO-MW ALGORITHM Algorithm 1 Causal Bayesian Optimization Multiplicative Weights (CBO-MW) Require: parameters τ, {βt}t 1, G, Ω, kernel functions ki and prior means µi,0 = 0 i [m] 1: Initialize w1 = 1 |A|(1, . . . , 1) R|A| 2: for t = 1, 2, . . . do 3: Sample at wt 4: Observe samples {zi,t, xi,t, a i,t}m i=0 5: Update posteriors {µi,t( ), σ2 i,t( )}m i=0 6: for a A do 7: Compute UCBG t (a, a t) using Algorithm 2 8: ˆyt a = min(1, UCBG t (a, a t)) 9: wt+1 a wt a exp(τ ˆyt a) 10: end for 11: end for We can now present the proposed CBOMW algorithm. Our approach is based upon the classic multiplicative weights method (Littlestone and Warmuth, 1994; Freund and Schapire, 1997), widely used in adversarial online learning. Indeed, ACBO can be seen as a specific structured online learning problem. At time t, CBO-MW maintains a weight wt a for every possible intervention a A such that P a wt a = 1 and uses these to sample the chosen intervention, i.e., at wt a. Contrary to standard CBO (where algorithms can choose actions deterministically), in adversarial environments such as ACBO randomization is necessary to achieve no-regret, see, e.g., Cesa-Bianchi and Lugosi (2006). Then, CBO-MW updates the weights at the end of each round based upon what action the adversary took a t and the observations xt. If the mechanism between actions, adversary actions, and rewards were to be completely known (i.e., the function r( ) in our setup), a standard MW strategy suggests updating the weight for every a according to wt+1 a wt a exp (τ r(a, a t)) , where τ is a tunable learning rate. In particular, r(a, a t) is the counterfactual of what would have happened, in expectation over noise, had a t remained fixed but the algorithm selected a instead of at. However, in ACBO the system is unknown and thus such counterfactual information is not readily available. On the other hand, as outlined in the previous section, we can build and update calibrated models around the unknown mechanisms and then estimate counterfactual quantities from them. Specifically, CBO-MW utilizes the confidence set Mt to compute an optimistic estimate of r(a, a t): UCBG t (a, a ) = max F Mt Eω h y | F , a, a t i . (7) Given action a, opponent action a t and confidence set Mt, UCBG t (a, a ) represents the highest expected return among all system models in this confidence set. CBO-MW uses such estimates to update the weights in place of the true but unknown counterfactuals r(a, a t). Computing UCBG t (a, a ) is challenging since our confidence set Mt consists of a set of m different models and one must propagate epistemic uncertainty through all models in the system, from actions to rewards. Because mechanisms can be non-monotonic and nonlinear, one cannot simply independently maximize the output of every mechanism. We thus defer this task to an algorithmic subroutine (denoted causal UCB oracle) which we describe in the next subsection. CBO-MW is summarized in Algorithm 1. We note that CBO-MW strictly generalizes the GP-MW algorithm of Sessa et al. (2019), which was first to propose combining MW with optimistic counterfactual reward estimates. However, they consider a trivial causal graph with only a target node, thus a single GP model. For this simpler Published as a conference paper at ICLR 2024 model one can compute UCBG t in closed-form but must ignore any causal structure in the reward model. In Section 5 and in our experiments we show CBO-MW can significantly outperform GP-MW both theoretically and experimentally. 4.3 CAUSAL UCB ORACLE The problem in Eq. (7) is not amenable to commonly used optimization techniques, due to the maximization over a set of functions with bounded RKHS norm. Therefore, similar to Sussex et al. (2022) we make use of the reparameterization trick to write any fi F Mt using a function ηi : Zi Ai A i [ 1, 1] as fi,t( zi, ai, a i) = µi,t 1( zi, ai, a i) + βtσi,t 1( zi, ai, a i)ηi( zi, ai, a i), (8) where xi = fi( zi, ai, a i) + ωi denotes observations from simulating actions in one of the plausible models, and not necessarily the true model. The validity of this reparameterization comes directly from the definition of Mt in Eq. (4) and the range of ηi. The reparametrization allows for rewriting UCBG t in terms of η : Z A A [ 1, 1]|X|: UCBG t (a, a ) = max η( ) Eω h y | F , a, a t i , s.t. F = { fi,t} in Eq. (8). (9) Algorithm 2 Causal UCB Oracle Require: neural networks η, actions a, a , model posteriors µ, σ, parameter βt, repeats Nrep. 1: Initialize SOLUTIONS = 2: for j = 1, . . . , Nrep do 3: Randomly initialize weights of each ηi η 4: UCBG t,j = maxη( ) E[y | F , a, a ] computed via stochastic gradient ascent on η 5: SOLUTIONS = SOLUTIONS {UCBG t,j}. 6: end for 7: return max(SOLUTIONS) In practice, we can parameterize η, for example with neural networks, and maximize this objective using stochastic gradient ascent, as described in Algorithm 2. The use of the reparameterization trick simplifies the optimization problem because we go from optimizing over functions with a tricky constraint ( F Mt) to a much simpler constraint (η just needing output in [0, 1]). Since the optimization problem is still non-convex, we deploy multiple random re-initializations of the η parameters. Here we analyse the theoretical performance of CBO-MW and provide a bound on its regret as a function of the underlying causal graph. For our analysis, we make some additional technical assumptions. First, we assume all fi F are Lf-Lipschitz continuous. This follows directly from the regularity assumptions of Section 2. Second, we assume that i, t, the functions µi, σi,t are Lµ, Lσ Lipschitz continuous. This holds if the RKHS of each fi has a Lipschitz continuous kernel (see Curi et al. (2020), Appendix G). Finally, we assume the causal UCB oracle can always compute UCBG t (a, a ) (Eq. (9)) exactly. In addition, to show how the regret guarantee depends on the specific GP hyperparameters used, we use a notion of model complexity for each node i: γi,T := max Ai {Zi Ai A i}T I(xi,1:T , fi) (10) where I is the mutual information and the observations xi,1:T implicitly depend on the GP inputs Ai. This is analogous to the maximum information gain used in the analysis of standard BO algorithms Srinivas et al. (2010). We also define γT = max i γi,T (11) as the worst-case maximum information gain across all nodes in the system. Finally, we define two properties of the causal graph structure that the regret guarantee will depend on. In the DAG G over nodes {Xi}m i=0, let denote the maximum number of parents of any variable in G: = maxi |pa(i)|. Then let N denote the maximum distance from a root node to Xm: N = maxi dist(Xi, Xm) where dist( , ) is the number of edges in the longest path from a node Xi to Xm. We can then prove the following theorem on the regret of CBO-MW. Published as a conference paper at ICLR 2024 Theorem 1. Fix δ (0, 1), if actions are played according to CBO-MW with βt = γt 1 + log(m/δ) and τ = p (8 log |A|)/T, then with probability at least 1 δ, T log |A| + p T log(2/δ) + B + p γT + log(m/δ) N+1 NLN σ LN f m p where B = maxi Bi, γT = max γi,t. That is, γT is the worst-case maximium information gain of any of the GP models. Theorem 1, whose proof is deferred to the appendix, shows that CBO-MW is no-regret for a variety of common kernel functions, for example linear and squared exponential kernels. This is because even when accounting for the dependence of γT in T, the bound is still sublinear in T. We discuss the dependence of γT on T for specific kernels in Appendix A.1.2. Comparison with GP-MW We can use Theorem 1 to demonstrate that the use of graph structure in CBO-MW results in a potentially exponential improvement in the rate of regret compared to GP-MW (Sessa et al., 2019) with respect to the number of action variables m. Consider the graph in Fig. 4b (see Appendix) for illustration. When all Xi in CBO-MW are modeled with squared exponential kernels, we have γT = O(( + 2)(log T) +3). This results in a cumulative regret that is exponential in and N. Instead, GP-MW uses a single high-dimensional GP (Fig. 4a), implying γT = O((log T)m) for a squared exponential kernel. Note that m + N and thus, for several common graphs, the exponential scaling in N and could be significantly more favorable than the exponential scaling in m. Specifically for the binary tree of Fig. 4b, where N = log(m) and = 2, the cumulative regret of CBO-MW will have only polynomial dependence on m. Furthermore, in addition to favorable scaling of the regret in m, the model class considered by CBOMW is considerably larger than that of GP-MW, because CBO-MW can model systems where reward depends on actions according to a composition of GPs based on G, rather than a single GP. 6 COMPUTATIONAL CONSIDERATIONS IN LARGER ACTION SPACES The computational complexity of CBO-MW is dominated by the computation of the counterfactual expected reward estimate for each possible intervention a A (Line 7 in Algorithm 1). In many situations, even with a large number of action variables m, |A| may still be small and thus CBO-MW feasible to implement. A notable example is when there exist constraints on the possible interventions, such as only being able to intervene on at most a few nodes simultaneously. In the worst case, though, |A| grows exponentially with m and thus CBO-MW may not be computationally feasible. In this section we will show how prior knowledge of the problem structure can be used to modify CBO-MW to be computationally efficient even in settings with huge action spaces. We note that the computational efficiency of CBO-MW is not affected by the number of possible adversary interventions |A | because these are only observed a-posteriori (in fact, A need not be finite). The general idea consists of decomposing CBO-MW into a decentralized algorithm, which we call D-CBO-MW, that orchestrates multiple sub-agents. First recall that A can be decomposed into m smaller action spaces so that A = A1 ... Am. We then consider m independent agents where each agent i performs CBO-MW but only over the action space Ai. Importantly, the observations of the actions of the other agents are considered part of A for that agent. Moreover, all agents utilize the same calibrated model Mt at each round. We provide full pseudo-code and theory in the appendix. Our approach is inspired by Sessa et al. (2021) who propose a distributed analog of the GP-MW algorithm. In Appendix B we show that under specific assumptions on the reward function, D-CBO-MW provably enjoys an approximate version of the guarantees in Theorem 1. Namely, we study a setting where r is a submodular and monotone increasing function of a for any given set of adversary actions. Submodularity is a diminishing returns property (see formal definition in the appendix) widely exploited in a variety of domains to derive efficient methods with approximation guarantees, see e.g., Marden and Wierman (2013); Sessa et al. (2021); Paccagnan and Marden (2022). Similarly, we exploit submodularity in the overall reward s structure to parallelize the computation over the possible interventions in our causal graph. In our experiments, we study rebalancing an SMS where CBO-MW is not applicable due to a combinatorial action space but D-CBO-MW is efficient and achieves good performance. Published as a conference paper at ICLR 2024 0 25 50 75 100 Round Cumulative Regret Alpine-Penny CBO-MW (Ours) MCBO GP-UCB GP-MW 0 25 50 75 100 Round Cumulative Regret Alpine-Perturb CBO-MW (Ours) MCBO GP-UCB GP-MW 0 25 50 75 100 Round Cumulative Regret Dropwave-Penny CBO-MW (Ours) MCBO GP-UCB GP-MW Figure 2: CBO-MW achieves sublinear regret and high sample efficiency on 3 function networks in the presence of adversaries. Non-adversarial methods (GP-UCB and MCBO) achieve linear regret and non-causal methods (GP-MW) are less sample efficient. 7 EXPERIMENTS We evaluate CBO-MW and D-CBO-MW on various synthetic problems and a simulator of rebalancing an SMS based on real data. The goal of the experiments is to understand how the use of causal modelling and adaptability to external factors in CBO-MW affects performance compared to other BO methods that are missing one or both of these components. All methods are evaluated over 10 repeats, with mean and standard error reported for different time horizons. 7.1 FUNCTION NETWORKS Networks We evaluate CBO-MW on 8 diverse environments. As a base, we take 4 examples of function networks from Astudillo and Frazier (2021). Function networks is a noiseless CBO setup with no adversaries. To study an adversarial setup, we modify each environment by adding adversaries inputs in 2 ways: Penny and Perturb. In Penny, an adversary can affect a key node with a dynamic similar to the classic matching pennies game. In Perturb, the adversary can perturb some of the agent s interventions. The exact way in which the adversary s actions affect the environment is unknown and the actions themselves can only be observed a-posteriori. In each round, the adversary has a 20% chance to play randomly, and an 80% chance to try and minimize the agent s reward, using full knowledge of the agent s strategy and the environment. The causal graph and structural equation model for each environment in given in Appendix C. Baselines We compare the performance of CBO-MW (Algorithm 1) with GP-MW Sessa et al. (2019) which does not exploit the causal structure. We additionally compare against non-adversarial baselines GP-UCB (Srinivas et al., 2010), and MCBO (Sussex et al., 2022) which uses a causal model but cannot account for adversaries. Results We give results from 3 of the 8 environments in Fig. 2. Others can be found in Appendix C. In general across the 8 environments, MCBO and GP-UCB obtain linear regret, while the regret of GP-MW and CBO-MW grows sublinearly, consistent with Theorem 1. CBO-MW has the strongest or joint-strongest performance on 7 out of the 8 environments. The setting where CBO-MW is not strongest involves a dense graph where worst-case GP sparsity is the same as in GP-MW, which is consistent with our theory. In settings such as Alpine where the graph is highly sparse, we observe the greatest improvements from using CBO-MW. 7.2 LEARNING TO REBALANCE SHARED MOBILITY SYSTEMS (SMSS) We evaluate D-CBO-MW on the task of rebalancing an SMS, a setting introduced in Sessa et al. (2021). The goal is to allocate bikes to city locations (using relocation trucks, overnight) to maximize the number of bike trips in the subsequent day. The action space is combinatorial because each of the 5 trucks can independently reallocate units to a new depot. This makes the approaches studied in Section 7.1 computationally infeasible, so we deploy D-CBO-MW. We expect the reward Yt (total trips given unit allocation) to be monotone submodular because adding an extra unit to a depot should increase total trips but with decreasing marginal benefit, as discussed in Sessa et al. (2021). The goal is to compare D-CBO-MW to a distributed version of GP-MW and understand whether the use of a causal graph can improve sample efficiency. Published as a conference paper at ICLR 2024 0 50 100 150 200 Day Avg. Daily Trips (Normaliz.) D-CBO-MW (Ours) Random D-GP-MW (b) Figure 3: Learning to rebalance an SMS. (a) Green dots represent demand for trips, black crosses are depot locations, and blue circles are scaled in size to the frequency that D-CBO-MW assigns bikes to that depot (averaged across all repeats and days). D-CBO-MW assigns bikes in strategic locations near high demand. (b) D-CBO-MW fulfills more total trips across over 200 days compared to D-GP-MW which does not use a causal reward model. Setup and trips simulator A simulator is constructed using historical data from an SMS in Louisville, KY (Lou, 2021). Before each day t, all 40 bikes in the system are redistributed across 116 depots. This is done by 5 trucks, each of which can redistribute 8 bikes at one depot, meaning a truck i s action is ai [116]. The demand for trips corresponds to real trip data from Lou (2021). After each day, we observe weather and demand data from the previous day which are highly non-stationary and thus correspond to adversarial interventions a according to our model. Our simulator is identical to the one of Sessa et al. (2021), except we exclude weekends for simplicity. We give more details in Appendix C. We compare three methods on the SMS simulator. First, in RANDOM each truck places its bikes at a depot chosen uniformly at random. Second, D-GP-MW modifies GP-MW using the same ideas as those presented in Section 6. It is a special case of D-CBO-MW but using the graph in Fig. 1(a). That is, a single GP is used to predict Yt given at, a t. Finally, we evaluate D-CBO-MW which utilizes a more structured causal graph exploiting the reward structure. Based on historical data, we cluster the depots into regions such that trips don t frequently occur across two different regions (e.g., when such regions are too far away). Then, our graph models the trips starting in each region only using bike allocations to depots in that region. Yt is computed by summing the trips across all regions (see Fig. 8 (b) in the appendix for an illustration). This system model uses many low-dimensional GPs instead of a single high-dimensional GP as used in D-GP-MW. Results Fig. 3 (a) displays the allocation strategy of D-CBO-MW. We observe that D-CBO-MW learns the underlying demand patterns and allocates bikes in strategic areas where the demand (green dots) is higher. This plot is zoomed-in for readability. The allocation strategy over all depots is shown in Appendix Fig. 9. Moreover, in Fig. 3 (b) we see that D-CBO-MW is significantly more sample efficient than D-GP-MW. This improvement is largely attributed to early rounds, where D-CBO-MW learns the demand patterns much faster than D-GP-MW due to learning smaller-dimensional GPs. 8 CONCLUSION We introduce CBO-MW, the first principled approach to causal Bayesian optimization in non-stationary and potentially multi-agent environments. We prove a sublinear regret guarantee for CBO-MW and demonstrate a potentially exponential improvement in regret, in terms of the number of possible intervention targets, compared to state-of-the-art methods. We further propose a distributed version of our approach, D-CBO-MW, that can scale to large action spaces and achieves approximate regret guarantees when rewards are monotone submodular. Empirically, our algorithms outperform existing non-causal and non-adversarial methods on synthetic function network tasks and on an SMS rebalancing simulator based on real data. Published as a conference paper at ICLR 2024 REPRODUCIBILITY STATEMENT Attached to this submission we include code (https://github.com/ssethz/acbo) that implements CBO-MW, D-CBO-MW and all baselines seen in the experiments. It includes code for all function network environments and the SMS setting, which are also detailed in the appendix. All datasets used in the SMS setting are also included at this url. The appendix includes information on our experimental protocol, for example how we selected the hyperparameters for the experiments shown in the paper. All technical assumptions are given in the main paper, and complete proofs of all theoretical results are given in the appendix. Pseudocode for CBO-MW is given in the main paper and pseudocode for D-CBO-MW is given in the appendix. ACKNOWLEDGEMENTS We thank Lars Lorch for his feedback on the draft of this paper. This research was supported by the Swiss National Science Foundation under NCCR Automation, grant agreement 51NF40 180545, by the European Research Council (ERC) under the European Union s Horizon grant 815943, and by ELSA (European Lighthouse on Secure and Safe AI) funded by the European Union under grant agreement No. 101070617. Jonas Moˇckus. On Bayesian methods for seeking the extremum. In Optimization Techniques, 1975. Virginia Aglietti, Xiaoyu Lu, Andrei Paleyes, and Javier González. Causal Bayesian Optimization. In International Conference on Artificial Intelligence and Statistics (AISTATS), 2020. Scott Sussex, Anastasia Makarova, and Andreas Krause. Model-based causal bayesian optimization. In International Conference on Learning Representations (ICLR), 2022. Frederick Eberhardt and Richard Scheines. Interventions and causal inference. Philosophy of science, 74(5):981 995, 2007. Jiaqi Zhang, Chandler Squires, and Caroline Uhler. Matching a desired causal state via shift interventions. Advances in Neural Information Processing Systems, 34:19923 19934, 2021. Niranjan Srinivas, Andreas Krause, Sham Kakade, and Matthias Seeger. Gaussian process optimization in the bandit setting: No regret and experimental design. In Proceedings of International Conference on Machine Learning (ICML), 2010. Raul Astudillo and Peter I. Frazier. Bayesian Optimization of Function Networks. In Advances in Neural Information Processing Systems (Neur IPS), 2021. Finnian Lattimore, Tor Lattimore, and Mark D Reid. Causal bandits: Learning good interventions via causal inference. In Advances in Neural Information Processing Systems (Neur IPS), 2016. Burak Varici, Karthikeyan Shanmugam, Prasanna Sattigeri, and Ali Tajer. Causal bandits for linear structural equation models. ar Xiv preprint ar Xiv:2208.12764, 2022. Ilija Bogunovic, Jonathan Scarlett, Stefanie Jegelka, and Volkan Cevher. Adversarially robust optimization with gaussian processes. In Advances in Neural Information Processing Systems (Neur IPS), 2018. Johannes Kirschner, Ilija Bogunovic, Stefanie Jegelka, and Andreas Krause. Distributionally robust Bayesian optimization. In International Conference on Artificial Intelligence and Statistics (AISTATS), 2020. Thanh Nguyen, Sunil Gupta, Huong Ha, Santu Rana, and Svetha Venkatesh. Distributionally robust Bayesian quadrature optimization. In International Conference on Artificial Intelligence and Statistics (AISTATS), 2020. Published as a conference paper at ICLR 2024 Pier Giuseppe Sessa, Ilija Bogunovic, Maryam Kamgarpour, and Andreas Krause. Mixed strategies for robust optimization of unknown objectives. In International Conference on Artificial Intelligence and Statistics (AISTATS), 2020a. Anastasia Makarova, Ilnura Usmanova, Ilija Bogunovic, and Andreas Krause. Risk-averse heteroscedastic bayesian optimization. In Advances in Neural Information Processing Systems (Neur IPS), 2021. Shogo Iwazaki, Yu Inatsu, and Ichiro Takeuchi. Mean-variance analysis in Bayesian optimization under uncertainty. In International Conference on Artificial Intelligence and Statistics (AISTATS), 2021. Virginia Aglietti, Neil Dhir, Javier González, and Theodoros Damoulas. Dynamic causal bayesian optimization. In Advances in Neural Information Processing Systems (Neur IPS), 2021. Pier Giuseppe Sessa, Ilija Bogunovic, Maryam Kamgarpour, and Andreas Krause. No-regret learning in unknown games with correlated payoffs. Advances in Neural Information Processing Systems (Neur IPS), 2019. Pier Giuseppe Sessa, Ilija Bogunovic, Andreas Krause, and Maryam Kamgarpour. Online submodular resource allocation with applications to rebalancing shared mobility systems. In Proceedings of International Conference on Machine Learning (ICML), 2021. Nick Littlestone and Manfred K Warmuth. The weighted majority algorithm. Information and computation, 108(2):212 261, 1994. Yoav Freund and Robert E Schapire. A decision-theoretic generalization of on-line learning and an application to boosting. Journal of computer and system sciences, 55(1):119 139, 1997. Nicolo Cesa-Bianchi and Gábor Lugosi. Prediction, learning, and games. Cambridge university press, 2006. Sebastian Curi, Felix Berkenkamp, and Andreas Krause. Efficient Model-Based Reinforcement Learning through Optimistic Policy Search and Planning. In Advances in Neural Information Processing Systems (Neur IPS), 2020. Jason R Marden and Adam Wierman. Distributed welfare games. Operations Research, 61(1): 155 168, 2013. D Paccagnan and JR Marden. Utility design for distributed resource allocation-par ii: Applications to submodular, covering, and supermodular problems. IEEE Transactions on Automatic Control, 67: 618 632, 2022. Louisville Kentucky Open Data, 2021. URL https://data.louisvilleky.gov/ dataset/dockless-vehicles. Pier Giuseppe Sessa, Ilija Bogunovic, Maryam Kamgarpour, and Andreas Krause. Learning to play sequential games versus unknown opponents. Advances in Neural Information Processing Systems (Neur IPS), 2020b. Pier Giuseppe Sessa, Ilija Bogunovic, Andreas Krause, and Maryam Kamgarpour. Contextual games: Multi-agent learning with side information. Advances in Neural Information Processing Systems (Neur IPS), 33:21912 21922, 2020c. An Bian, Kfir Levy, Andreas Krause, and Joachim M Buhmann. Continuous dr-submodular maximization: Structure and algorithms. Advances in Neural Information Processing Systems (Neur IPS), 30, 2017. Francis Bach. Submodular functions: from discrete to continuous domains. Mathematical Programming, 175:419 459, 2019. Local Weather Forecast, News and Conditions | Weather Underground. URL https://www. wunderground.com/. Published as a conference paper at ICLR 2024 ADVERSARIAL CAUSAL BAYESIAN OPTIMIZATION A.1 PROOF OF THEOREM 1 We start by proving that there exists a sequence of βt that allow Assumption 1 to hold. Lemma 1 (Node Confidence Lemma ). Let Hki be a RKHS with underlying kernel function ki. Consider an unknown function fi : Zi A i A i Xi in Hki such that f ki Bi, and the sampling model xi,t = f(zi,t, ai,t, a i,t)+ωt where ωt is b-sub-Gaussian (with independence between times). By setting βt = Bi + b p 2 (γt 1 + log(m/δ)) , the following holds with probability at least 1 δ: |µt 1(zi, ai, ai ) fi(zi, ai, a i)| βtσi,t 1(zi, ai, a i) , zi, ai, a i Z i Ai A i, t 1, i [m], where µt 1( ) and σt 1( ) are given in Eq. (5), Eq. (6) and γt 1 is the maximum information gain defined in (11). Proof. Lemma 1 follows directly from Sessa et al. (2019, Lemma 1) after applying a union bound so that the statement holds for all m GP models in our system. Now we give an upper and lower confidence bound for r at each t. Note that since in our setup all ωi are assumed bounded in [0, 1], we can use b = 1 Lemma 2 (Reward Confidence Lemma). Choose δ [0, 1] and then choose the sequence {βt}T t=1 according to Lemma 1. a, ai A A , t we have with probability 1 δ UCBG t (a, a ) Ct(δ) r(a, a ) UCBG t (a, a t) where we define Ct(δ) as Ct(δ) = LY,t N v u u tm Eω σi,t 1(zi,t, ai,t, a i,t) 2 # We define LY,t = 2βt(1 + Lf + 2βt Lσ)N. Proof. The RHS follows firstly from Assumption 1 meaning that with probability at least 1 δ, f { ft}. The RHS then follows directly from the definition of UCBG in Eq. (7). The LHS follows directly from Sussex et al. (2022, Lemma 4). The addition of the adversary actions does not change the analysis in this lemma because for a given t, a t is fixed for both the true model f and the model in { f} that leads to the upper confidence bound. Now we can prove the main theorem. Choose δ [0, 1] and then choose the sequence {βt}T t=1 according to Lemma 1 so that Assumption 1 holds with probability 1 δ 2. First recall that regret is defined as t=1 r( a, a t) t=1 r(at, a t) where a = arg max PT t=1 r( a, a t). Now we can say that with probability at least 1 δ Published as a conference paper at ICLR 2024 t=1 min{1, UCBG t ( a, a t)} UCBG t (at, a t) Ct (δ/2) (12) t=1 min{1, UCBG t ( a, a t)} t=1 UCBG t (at, a t) + t=1 Ct (δ/2) (13) where the first line follows from Lemma 2. Evaluating the last term we see t=1 Ct (δ/2) t=1 Ct (δ/2)2 (14) t=1 L2 Y,t 2NEω σi,t 1(zi,t, ai,t, a i,t) 2 # 3 = O LY,T Np TmγT . (16) 1 comes from AM-QM inequality and 2 comes from plugging in for Ct( δ 2). Finally, 3 comes from LT,t being weakly increasing in t and from using the same analysis as (Sussex et al., 2022, Theorem 1) to upper bound the sum of σs with γT . Moreover, we can upper bound the first two terms of Eq. (13) using a standard regret bound for the MW update rule (Line 9 in CBO-MW), e.g. from (Cesa-Bianchi and Lugosi, 2006, Corollary 4.2). Indeed, with probability 1 δ 2 it holds: t=1 min{1, UCBG( a, a t)} t=1 UCBG(at, a t) = O p T log |A| + p T log(2/δ) . (17) Using the union bound on the two different probabilistic events discussed so far (Assumption 1 and Eq. (17)) we can say that with probability at least 1 δ T log |A| + p T log(2/δ) + LY,T Np Substituting in for LY,t and βt gives the result of Theorem 1. A.1.1 STACKELUCB AS A SPECIAL CASE In Fig. 4 and Section 5 we give a comparison between GP-MW and CBO-MW, and see that GP-MW can be seen as a special case of CBO-MW with a one-node causal graph. STACKELUCB Sessa et al. (2020b) is another online learning algorithm that can be seen as a special case of CBO-MW. We show the graph in Fig. 5. In STACKELUCB an agent plays an unknown Stackelberg game against a changing opponent which at time t has representation a 0,t. After the agent selects an action a0,t, the opponent sees this action and responds based upon a fixed but unknown response mechanism for that opponent: their response is X0,t = f0(a 0,t, a0,t). Then, the response, game outcome Yt, and opponent identity are revealed to the agent. The sequence of opponent identities, i.e. {a 0,t}T t=1, can be selected potentially adversarially based on knowledge of the agent s strategy. A.1.2 DEPENDENCE OF βT ON T FOR PARTICULAR KERNELS In Theorem 1, there is a dependence of the bound on γT . If γT scales with at worst O T , then the overall bound will not be in T, resulting in CBO-MW being no-regret. γT will depend on the Published as a conference paper at ICLR 2024 a2 a3 a1 a0 a 2 a 3 a 1 a 0 X2 X3 X1 X0 Figure 4: An example of a binary tree graph to compare GP-MW and CBO-MW. In (a) we see GP-MW ignores all observations and models Y given all agent and adversary actions, resulting a single high-dimensional GP. Meanwhile in (b), for the same task, CBO-MW fits a model for every observation resulting in multiple low-dimensional GP models. Figure 5: The causal graph corresponding to STACKELUCB, a special case of our setup. The opponent action X0 is played based on the opponent type a 0 and as a response to the agent s action a0. The reward Y depends on the agent and opponent s action. kernel of the GP used to model each system component. For simplicity, in this section we ll assume that the same kernel is used for modelling all nodes, though if different kernels are used one just needs to consider the most complex (worst scaling of γT with T) For γT corresponding to the linear kernel and squared exponential kernel, we have sublinear regret regardless of N because γT will be only logarithmic in T. In particular, a linear kernel leads to γT = O (( + 1) log T) and a squared exponential kernel leads to γT = O ( + 1)(log T) +2 . However, for a Matérn kernel, where the best known bound is γT = O ((T)clog(T)) with hyperparameter 0 < c < 1, the cumulative regret bound will not be sublinear if N and c are sufficiently large. A similar phenomena with the Matérn kernel appears in the guarantees of Curi et al. (2020) which use GP models in model-based reinforcement learning and in Sussex et al. (2022). A.2 INCORPORATING CONTEXTS INTO CBO-MW Our approach can be easily extended to a setting where a t, the adversary actions at time t, are observed before the agent chooses at. In this setting the adversary actions could be thought of as contexts. Our method for computing UCBG can be plugged into the algorithm of Sessa et al. (2020c). Their algorithm maintains a separate set of weights for every context unless two contexts are close correspond to a similar expected reward for all possible at. Published as a conference paper at ICLR 2024 Algorithm 3 Multiplicative Weights Update (MWU) Require: set Ai where |Ai| = Ki, learning rate η 1: Initialize w1 = 1 Ki (1, . . . , 1) RKi 2: function PLAY_ACTION 3: p = w 1/ PKi j=1 w[j] 4: a p 5: return a // sample action 6: end function 7: 8: function UPDATE(f( )) 9: f = min(1, [f(a)]a Ai) RK // rewards vector 10: w = w exp (ηf) // MW update 11: return 12: end function Algorithm 4 Distributed Causal Bayesian Optimization Multiplicative Weights (D-CBO-MW) Require: parameters τ, {βt}t 1, G, Ω, kernel functions ki and prior means µi,0 = 0, i [m] 1: Algoi MWU(Ai), i = 1, . . . , m, Algorithm 3 // initialize agents 2: for t = 1, 2, . . . do 3: Algoi.PLAY_ACTION(), i = 1, . . . , m // sample actions and perform interventions 4: Observe samples {zi,t, xi,t, a i,t}m i=0 5: Update posteriors {µi,t( ), σ2 i,t( )}m i=0 6: for i 1, . . . , m do 7: for ai Ai do 8: Compute ucbi,t(ai) = UCBG t (ai, a i,t, a t) for ai Ai using Algorithm 2 9: end for 10: Algoi.UPDATE(ucbi,t( )), i = 1, . . . , m 11: end for 12: end for B AN EFFICIENT VARIANT OF CBO-MW FOR LARGE ACTION SPACES B.1 THE DISTRIBUTED CBO-MW (D-CBO-MW) ALGORITHM For ease of readability in the pseudocode we pull-out some key functionality of the MW algorithm, which we put into a class called MWU described in Algorithm 3. Each agent is an instance of this class, i.e., each agent maintains its own set of weights over its actions, and updates to these instances are coordinated by D-CBO-MW as described in Algorithm 4. Conceptually it can be thought of as m instances of CBO-MW, where the action spaces of other agents are absorbed into A . While for presenting the algorithm we always decompose A as A1 Am and have each agent control the intervention on one node, one could in practice choose to decompose the action space in a different way. A simple example would be having agent one control A1 A2 and thus designing the intervention for both X1, X2 When computing the UCBG for a single agent i [m], we will find the following notation convenient. Let ai,t Ai be the action chosen by agent i at time t. a i,t A i A are the actions chosen at time t by all agents except agent i. Note that since the subspaces of the action space each agent controls are non-overlapping, A = Ai A i. When agent i chooses actions in Ai, for convenience we will from now on represent it as a vector in A with 0 at all indexes the agent cannot intervene. We do similarly for a i,t. Then we use the notation UCBG t (ai,t, a i,t, a t) = UCBG t (ai,t + a j,t, a t). B.2 APPROXIMATION GUARANTEES FOR MONOTONE SUBMODULAR REWARDS In this section we show that D-CBO-MW enjoys provable approximation guarantees in case of monotone and DR-submodular rewards. Both such notions are defined below. Published as a conference paper at ICLR 2024 Definition 1 (Monotone Function). A function f : A Rm R is monotone if for x y, Definition 2 (DR-Submodularity, (Bian et al., 2017)). A function f : A Rm R is DRsubmodular if, for all x y A, i [m], k 0 such that (x + kei) and (y + kei) A, f(x + kei) f(x) f(y + kei) f(y). DR-submodularity is a generalization of the more common notion of a submodular set function to continuous domains Bach (2019). For our analysis, in particular, we assume that for every a :,t A , the reward function r(a:,t, a :,t) is monotone DR-submodular in a:,t. We consider ACBO as a game played among all the distributed agents and the adversary. Our guarantees are then based on the results of Sessa et al. (2021) and depend on the following notions of average and worst-case game curvature. Definition 3 (Game Curvature). Consider a sequence of adversary actions a 1, . . . , a T . We define the average and worst-case game curvature as cavg({a :,t}T t=1) = 1 inf i PT t=1[ r(2amax, a :,t)]i PT t=1[ r(0, a :,t)]i [0, 1], cwc({a :,t}T t=1) = 1 inf t,i [ r(2amax, a :,t)]i [ r(0, a :,t)]i [0, 1], where amax = amax1 and amax is the largest intervention value a single agent can assign at any index. If 2amax is outside the domain of r, then the definition can be applied to a monotone extension of r over Rm. A definition of game curvature for non-differentiable r is given in the appendix of Sessa et al. (2021). Curvature measures how close r( , a :,t) is to being linear in the agents actions, with cavg = cwc = 0 coinciding with a completely linear function. The closer r( , a :,t) is to linear, generally the easier the function is to optimize in a distributed way, because a linear function is separable in its inputs. cwc is the worst-case curvature of r( , a :,t) across all rounds t, while cavg is the average curvature across rounds. The curvature of r( , a :,t) will change with t because a :,t will change across rounds. We will without loss of generality assume that r(0, a ) = 0 for all a . If this did not hold then r(0, a ) could simply be subtracted from all observations to make it true. Then, we can show the following. Theorem 2. Consider the setting of Theorem 1 but with the additional monotone submodularity and curvature assumptions made in this section. Assume |Ai| = K for all i. Then, if actions are played according to D-CBO-MW with βt = O B + p γt 1 + log (m/δ) and η = p 8 log(K)/T then with probability at least 1 δ, t=1 r(at, a t) α OPT m O B + p γT + log(m/δ) N+1 NLN σ LN f m p T log K + p T log(2m/δ) , with α = max n 1 cavg({a :,t}T t=1), 1 + cwc({a :,t}T t=1) 1o and OPT = maxa A PT t=1 r(a:,t, a :,t) is the expected reward achieved by playing the best fixed interventions in hindsight. B and γT are defined the same as in Theorem 1. Proof. We will find useful the notation of the regret of an individual agent i [m]. We will consider the regret of each agent to be not in terms of the reward r but in terms of the feedback that agent receives: the UCB. We therefore define Ri(T) = max a t=1 UCBG t (a, a i,t, a t) t=1 UCBG t (ai,t, a i,t, a t). Published as a conference paper at ICLR 2024 It can be thought of as the regret of agent i compared to the best fixed action in hindsight, given that the actions of all other agents are fixed, and the agent is trying to maximize the sum of UCBs. Using the above definitions, and provided that UCBG t ( ) are a valid upper confidence bound functions on the reward (according to Lemma 2), we can directly apply (Sessa et al., 2021, Theorem 1). This shows that with probability 1 δ 2, the overall reward obtained by D-CBO-MW is lower bounded by t=1 r(at, a t) α OPT m where α is defined in Theorem 2 and Ct(δ/2) is as defined in Lemma 2. We note that Sessa et al. (2021) stated their theorem for the case where the UCB was computed using a single GP model (our setting with only a reward node and parent actions), however the proof is identical when any method to compute the UCB is used with an accompanying confidence bound in the form of Lemma 2. Then, we can obtain the bound in the theorem statement by further bounding the agents regrets Ri(T). Indeed, because each agent updates its strategy (Line 10 in Algorithm 4) using the MW rule (Algorithm 3), we can bound each Ri(T) with probability 1 δ 2m using Eq. (17). We can also substitute in for Ct(δ/2) using Lemma 2. Via a union bound we get our final result with probability 1 δ. C EXPERIMENTS C.1 FUNCTION NETWORKS We evaluate CBO-MW on 8 environments that have been modified from existing function networks Astudillo and Frazier (2021). These have been previously used to study CBO algorithms Sussex et al. (2022). We modify each environment in two ways (Perturb and Penny) in order to incorporate an adversary, resulting in 8 environments total. For all environments we tried to make the fewest modifications possible when incorporating the adversary in a way that made the game nontrivial while maintaining the spirit of the original function network. Perturb environments allow the adversary to modify some of the input actions of the agents. Penny environments incorporate some element of the classic matching pennies game into the environment. If node Xi has an adversary action parent, part of the mechanism for generating Xi will involve multiplying another parent by the adversary action. Because the adversary can select negative actions (see below), this results in a dynamic similar to matching pennies. Throughout the setup and theory we assume that there is one action per node for simplicity. For many of the function networks experiments there may be 0 or more than one action per node. Similar theoretical results for this case can also be shown using our analysis. In all environments we map the discrete action space [0, K 1] to the continuous domain of the original function network. Using a as the continuous action at a single node and a as the discrete action input at that node, we always use mapping a = a K 1 0.5 C1 + C2, where C1, C2 defines some environment specific linear map that usually maps to the input space of the original function network in Astudillo and Frazier (2021). We use the same mapping for adversary actions in Perturb environments For the adversary s actions in Penny environments, we use a more complicated mapping from a to a to ensure that the adversary normally cannot select 0, which would result in a trivial game that the adversary can solve by always playing a = 0. If K is even we use K 1(1 2ϵ)ϵ 0.5 C1 + C2, where ϵ = 0.05. If K is odd we use a = a + 0.5 K 1 (1 2ϵ)ϵ 0.5 C1 + C2, where ϵ = 0.05. Published as a conference paper at ICLR 2024 Dropwave-Penny Dropwave-Perturb Alpine-Penny a0 a1 a 2 a3 Alpine-Perturb a0 a1 a2 a3 a 0 a 1 a 2 Rosenbrock-Penny a0 a1 a2 a3 Rosenbrock-Perturb a0 a1 a2 a3 Ackley-Penny a0 a1 a2 a3 Ackley-Perturb a0 a 0 a1 a 1 a2 a3 Figure 6: The DAGs corresponding to each task in the function networks experiments. DAGs illustrating the casual structure of each environment are shown in Fig. 6. For some environments, we made the number of nodes smaller than the environment s counterpart in Astudillo and Frazier (2021) for computational reasons. We give the full SCM for each environment at the end of this subsection. To select hyperparamaters for each method we perform a small hyperparameter sweep to select the best hyperparameters, before fixing the best hyperparameters and running 10 repeats on fresh seeds. On all repeats the agent is initialized with 2m + 1 samples of uniformly randomly sampled a and a , where m is the number of action nodes. Fig. 7 shows the regret plots for environments not already shown in the main paper. As discussed, we find that standard non-adversarial BO algorithms often fail to achieve sublinear regret and that CBOMW is often the most sample efficient compared to GP-MW. Only on Ackley-Penny is GP-MW obtaining a lower regret. This can be understood by our theory. Ackley-Penny is not at all sparse. That is, m. Our Theorem 1 suggests that most improvement will come in high dimensional settings with sparse graphs. This is made clear by the superior performance of CBO-MW on the Alpine2 environments. Here we systematically list the SCM for each environment. Dropwave-Penny a [0, 2]2, a [ 1, 1]1. We have Published as a conference paper at ICLR 2024 0 25 50 75 100 Round Cumulative Regret Dropwave-Perturb CBO-MW (Ours) MCBO GP-UCB GP-MW 0 25 50 75 100 Round Cumulative Regret Rosenbrock-Penny CBO-MW (Ours) MCBO GP-UCB GP-MW 0 25 50 75 100 Round Cumulative Regret Rosenbrock-Perturb CBO-MW (Ours) MCBO GP-UCB GP-MW 0 25 50 75 100 Round Cumulative Regret Ackley-Penny CBO-MW (Ours) MCBO GP-UCB GP-MW 0 25 50 75 100 Round Cumulative Regret Ackley-Perturb CBO-MW (Ours) MCBO GP-UCB GP-MW Figure 7: Regret plots for the function networks not already shown in Fig. 2. Y = cos(3X0) 2 + 0.5X2 0 a 0. This is the original Dropwave from Astudillo and Frazier (2021) but with a modified input space and a matching pennies dynamic on the final node. Dropwave-Perturb a [ 10.24, 10.24]2, a [ 10.24/5, 10.24/5]1. Like many Perturb systems, the adversary has a smaller domain than the agent to prevent it from being too strong. We have Y = cos(3X0) 2 + 0.5X2 0 . This is the original Dropwave from Astudillo and Frazier (2021) but with one of the actions being perturbed by the adversary. Alpine-Penny a [0, 10]4, a [1, 11]1. We have X0 = a0 sin(a0) For nodes Xi with an adversary parent we have a i sin(a i)Xi 1, and for nodes influenced by the agent we have Xi = ai sin(ai)Xi 1. This is the original Alpine2 from Astudillo and Frazier (2021) but with an adversary controlling one of the nodes instead of the agent. We shift that adversary s action space so that they cannot 0 the output with a fixed action. Alpine-Perturb a [0, 10]4, a [0, 2]3. Let ai = ai + a i for i when Xi has an adversarial action input, and ai = ai otherwise. We have X0 = a0 sin( a0), a i sin( ai)Xi 1. Published as a conference paper at ICLR 2024 Rosenbrock-Penny a [0, 1]4, a [0, 1]2. We have X0 = 100(a1 a2 0)2 (1 a0)2 + 10, Xi = 100(ai+1 a2 i )2 (1 ai)2 + 10 + Xi 1 a i, where a i = 1 if there is no adversary over node i and otherwise a i = a i. Rosenbrock-Perturb a [ 2, 2]4, a [ 1, 1]2. Let ai = ai + a i for i when Xi has an adversarial action input, and ai = ai otherwise. We have X0 = 100( a1 a2 0)2 (1 a0)2 + 10, Xi = 100( ai+1 a2 i )2 (1 ai)2 + 10 + Xi 1. Ackley-Penny a [ 2, 2]4, a [ 1, 1]1. Let ai = ai + a i for i when Xi has an adversarial action input, and ai = ai otherwise. We have i cos(2π ai), Y = 20 exp( 0.2 p X0) + e X1. Ackley-Perturb a [ 2, 2]4, a [ 1, 1]2. We have i cos(2πai), Y = 20a 0 exp( 0.2 p X0) + e X1. This is the original Ackley from Astudillo and Frazier (2021) but with a matching pennies dynamic on the final node. C.2 SHARED MOBILITY SYSTEM We use the same SMS simulator as Sessa et al. (2021) and thus we largely refer to this regarding simulator details, unless otherwise specified. The simulator uses real demand data amalgamated from several SMS sources in Louisville Kentucky Lou (2021). We treat the system as a single SMS where all transport units are identical. A single trip taken in the Louisville data at a specific time is treated as a single unit of demand in the simulator. The demand is fulfilled if the location of the demand (where the trip started in the dataset) is within a certain distance (0.8km Euclidean distance) of a depot containing at least one bike. If the demand for a trip is satisfied, a single trip starting from the depot is counted and the bike is transported to the depot nearest to the end location of the trip in the dataset, after removing the bike from the system for 30 minutes to simulate the trip having non-zero duration. The simulator s use of real trip data means that the geographical and temporal distribution of demand, and its relation to the weather, is realistic. The depots are not in the original trip data but constructed from the trip data using a similar procedure to Sessa et al. (2021). The start location for every trip taken over the year is clustered via k-means, and then clusters that are very close together are removed. This left 116 depots where bikes can be left. We consider a system with 40 bikes, which are distributed initially by 5 trucks that place all bikes in that truck at the same depot. We further allocate depots to regions. These are constructed by using trip data across the whole year, and using a heuristic that clusters depots into regions so that there is a low chance that any given trip starts in one region and ends in another. As shown in Fig. 8 this leads to nearby depots often being in the same region, which is reasonable. We get 15 regions R1 to R15. Published as a conference paper at ICLR 2024 Figure 8: (a) The allocation of depots to regions. Depots of the same colour belong to the same region. (b) For our empirical evaluation of D-CBO-MW we use a graph that computes trips for each region and then sums these up to get total trips. Here we show a simplified version for 2 regions R1 and R2. The total trips in the first region XR1 only depends on the number of bikes allocated to each depot in R1 given by XR1 1 . . . XR1 |R1|. This reduces the dimensionality of GPs used in the model. For simplicity we have removed the agent s action nodes from the graph since the relationship between actions and bike allocations is a fixed known mapping. Agent action ai is a one-hot 116-length vector for which depot truck i s bikes are placed at. We obtain 3 measurements for a t at the end of each day t. This is the day s average temperature, rainfall, and total demand (including unmet demand). These are part of a because they are out of our agent s control and not sampled i.i.d. across days. The agent must adaptively respond to these observations over time. Weather data is the real weather from that day obtained from Loc. Observations xr i give the number of bikes at day start in depot number i within region r. xr is the total fulfilled trips that started in region r. Reward Y is then the total trips in a day. All observations are normalized to ensure they are fixed in [0, 1]. In Sessa et al. (2021), 2 separate GPs are used to model weekday and weekend demand. For simplicity we use a simulator that skips weekends, and therefore we don t need a separate model for the two types of day. No matter the algorithm used, the first 10 days of the year use the RANDOM strategy to gather exploratory data to initialize the GP models for the xr. The graph used by D-CBO-MW is given in Fig. 8(b). The relationship between the bike allocations and bike distribution at day start {Xr : }R15 r=R1 is a fixed known function. The mechanism from the starting bike distribution in a region r (Xr : ), adversary actions a (weather and demand) and total trips in region r over the day (Xr) is an unknown function that must be learnt for each r. The relationship between total trips Y and its parents is known (sum over parents). For this kind of graph the output of the Causal UCB Oracle (Algorithm 2) will always set η = 1, because more trips in any region results in more total trips. For computational efficiency we therefore implement the Causal UCB Oracle to set η to 1 instead of optimising over η. Published as a conference paper at ICLR 2024 Figure 9: Green dots represent demand for trips, black crosses are depot locations, and blue circles are scaled in size to the frequency that D-CBO-MW assigns bikes to that depot (averaged across all repeats).