# dynamic_causal_bayesian_optimization__e8133749.pdf Dynamic Causal Bayesian Optimization Virginia Aglietti University of Warwick The Alan Turing Institute V.Aglietti@warwick.ac.uk Neil Dhir The Alan Turing Institute ndhir@turing.ac.uk Javier González Microsoft Research Cambridge Gonzalez.Javier@microsoft.com Theodoros Damoulas University of Warwick The Alan Turing Institute T.Damoulas@warwick.ac.uk This paper studies the problem of performing a sequence of optimal interventions in a causal dynamical system where both the target variable of interest and the inputs evolve over time. This problem arises in a variety of domains e.g. system biology and operational research. Dynamic Causal Bayesian Optimization (DCBO) brings together ideas from sequential decision making, causal inference and Gaussian process (GP) emulation. DCBO is useful in scenarios where all causal effects in a graph are changing over time. At every time step DCBO identifies a local optimal intervention by integrating both observational and past interventional data collected from the system. We give theoretical results detailing how one can transfer interventional information across time steps and define a dynamic causal GP model which can be used to quantify uncertainty and find optimal interventions in practice. We demonstrate how DCBO identifies optimal interventions faster than competing approaches in multiple settings and applications. 1 Introduction Causal Dimension f(X = x, t) ABO Temporal Dimension f(do(Xs,t = xs,t), I0:t 1) f(do(Xs = xs)) CBO f(X = x) BO Figure 1: DAG representation of a dynamic causal global optimisation (DCGO) problem (a) and the DAG considered when using CBO, ABO or BO to address the same problem. Shaded nodes gives observed variables while the arrows represent causal effects. Solving decision making problems in a variety of domains requires understanding of cause-effect relationships in a system. This can be obtained by experimentation. However, deciding how to intervene at every point in time is particularly complex in dynamical systems, due to the evolving nature of causal effects. For instance, companies need to decide how to allocate scarce resources across different quarters. In system biology, scientists need to identify genes to knockout at specific points in time. This paper describes a probabilistic framework that finds such optimal interventions over time. Focusing on a specific example, consider a setting in which Yt denotes the unemployment-rate of an economy at time t, Zt is the economic growth and Xt the inflation rate. Fig. 1a depicts the causal graph [26] representing an agent s understanding of the causal links between these variables. The agent aims to determine, at each time step t {0, 1, 2}, the optimal action to perform in order to minimize the current unemployment rate Denotes equal contribution 35th Conference on Neural Information Processing Systems (Neur IPS 2021). Yt while accounting for the intervention cost. The investigator could frame this setting as a sequence of global optimization problems and find the solutions by resorting to Causal Bayesian Optimization [CBO, 2]. CBO extends Bayesian Optimization [BO, 30] to cases in which the variable to optimize is part of a causal model where a sequence of interventions can be performed. However, CBO does not account for the system s temporal evolution thus breaking the time dependency structure existing among variables (Fig. 1b). This will lead to sub-optimal solutions, especially in non-stationary scenarios. The same would happen when using Adaptive Bayesian Optimization [ABO, 25] (Fig. 1c) or BO (Fig. 1d). ABO captures the time dependency of the objective function but neither considers the causal structure among inputs nor their temporal evolution. BO disregards both the temporal and the causal structure. Our setting differs from both reinforcement learning (RL) and the multi-armed bandits setting (MAB). Differently from MAB we consider interventions on continuous variables where the dynamic target variable has a non-stationary interventional distribution. In addition, compared to RL, we do not model the state dynamics explicitly and allow the agent to perform a number of explorative interventions which do not change the underlying state of the system, before selecting the optimal action. We discuss these points further in 1.1. Dynamic Causal Bayesian Optimization2, henceforth DCBO, accounts for both the causal relationships among input variables and the causality between inputs and outputs which might evolve over time. DCBO integrates CBO with dynamic Bayesian networks (DBN), offering a novel approach for decision making under uncertainty within dynamical systems. DBN [19] are commonly used in time-series modelling and carry dependence assumptions that do not imply causation. Instead, in probabilistic causal models [27], which form the basis for the CBO framework, graphs are buildt around causal information and allow us to reason about the effects of different interventions. By combining CBO with DBNs, the proposed methodology finds an optimal sequence of interventions which accounts for the causal temporal dynamics of the system. In addition, DCBO takes into account past optimal interventions and transfers this information across time, thus identifying the optimal intervention faster than competing approaches and at a lower cost. We make the following contributions: We formulate a new class of optimization problems called Dynamic Causal Global Optimization (DCGO) where the objective functions account for the temporal causal dynamics among variables. We give theoretical results demonstrating how interventional information can be transferred across time-steps depending on the topology of the causal graph. Exploiting our theoretical results, we solve the optimization problem with DCBO. At every time step, DCBO constructs surrogate models for different intervention sets by integrating various sources of data while accounting for past interventions. We analyze DCBO performance in a variety of settings comparing against CBO, ABO and BO. 1.1 Related Work Dynamic Optimization Optimization in dynamic environments has been studied in the context of evolutionary algorithms [14, 16]. More recently, other optimization techniques [28, 32, 10] have been adapted to dynamic settings, see e.g. [9] for a review. Focusing on BO, the literature on dynamic settings [3, 7, 25] is limited. The dynamic BO framework closest to this work is given by Nyikosa et al. [25] and focuses on functions defined on continuous spaces that follow a more complex behaviour than a simple Markov model. ABO treats the inputs as fixed and not as random variables, thereby disregarding their temporal evolution and, more importantly, breaking their causal dependencies. Causal Optimization Causal BO [CBO, 2] focuses instead on the causal aspect of optimization and solves the problem of finding an optimal intervention in a DAG by modelling the intervention functions with single GPs or a multi-task GP model [1]. CBO disregards the existence of a temporal evolution in both the inputs and the output variable, treating them as i.i.d. overtime. While disregarding time significantly simplifies the problem, it prevents the identification of an optimal intervention at every t. Bandits and RL In the broader decision-making literature, causal relationships have been considered in the context of bandits [4, 20 22] and reinforcement learning [23, 8, 13, 36, 24]. In these cases, actions or arms, correspond to interventions on a causal graph where there exists complex relationships between the agent s decisions and the received rewards. While dynamic settings have been considered in acausal bandit algorithms [5, 33, 35], causal MAB have focused on static settings. Dynamic settings are instead considered by RL algorithms and formalized through Markov decision processes (MDP). 2A Python implementation is available at: https://github.com/neildhir/DCBO. In the current formulation, DCBO does not consider an MDP as we do not have a notion of state and therefore do not require an explicit model of its dynamics. The system is fully specified by the causal model. As in BO, we focus on identifying a set of time-indexed optimal actions rather than an optimal policy. We allow the agent to perform explorative interventions that do not lead to state transitions. More importantly, differently from both MAB and RL, we allow for the integration of both observational and interventional data. An expanded discussion on the reason why DCBO should be used and the links between DCBO, CBO, ABO and RL is included in the supplement ( 8). 2 Background and Problem Statement Let random variables and values be denoted by upper-case and lower-case letters respectively. Vectors are represented shown in bold. do(X = x) represents an intervention on X whose value is set to x. p(Y | X = x) represents an observational distribution and p(Y | do(X = x)) represents an interventional distribution. This is the distribution of Y obtained by intervening on X and fixing its value to x in the data generating mechanism (see Fig. 2), irrespective of the values of its parents. Evaluating p(Y | do(X = x)) requires real interventions while p(Y | X = x) only requires observing the system. DO and DI denote observational and interventional datasets respectively. Consider a structural causal model (SCM) defined in Definition 1. Definition 1. (Structural Causal Model) [27, p. 203]. A structural causal model M is a triple U, V, F) where U is a set of background variables (also called exogenous), that are determined by factors outside of the model. V is a set {V1, V2, . . . , V|V|} of observable variables (also called endogenous), that are determined by variables in the model (i.e., determined by variables in U V). F is a set of functions {f1, f2, . . . , fn} such that each fi is a mapping from the respective domains of Ui Pa(Vi) to Vi, where Ui U and Pa(Vi) V \ Vi and the entire set F forms a mapping from U to V. In other words, each {fi vi fi(Pa(vi) , ui) | i = 1, . . . , n}, assigns a value to Vi that depends on the values of the select set of variables (Ui Pa(Vi)). M is associated to a directed acyclic graph (DAG) G, in which each node corresponds to a variable and the directed edges point from members of Pa(Vi) and Ui to Vi. We assume G to be known and leave the integration with causal discovery [15] methods for future work. Within V, we distinguish between three different types of variables: non-manipulative variables C, which cannot be modified, treatment variables X that can be set to specific values and output variable Y that represents the agent s outcome of interest. Exploiting the rules of do-calculus [27] one can compute p(Y | do(X = x)) using observational data. This often involves evaluating intractable integrals which can be approximated by using observational data to get a Monte Carlo estimate bp(Y | do(X = x)) p(Y | do(X = x)). These approximations will be consistent when the number of samples drawn from p(V) is large. Causality in time One can encode the existence of causal mechanisms across time steps by explicitly representing these relationships with edges in an extended graph denoted by G0:T . For instance, the DAG in Fig. 1(a) can be seen as one of the DAGs in Fig. 1(b) propagated in time. The DAG in Fig. 1(a) captures both the causal structure existing across time steps and the causal mechanism within every time-slice t [19]. In order to reason about interventions that are implemented in a sequential manner, that is at time t we decide which intervention to perform in the system and so define: Definition 2. Mt is the SCM at time step t defined as Mt = U0:t, V0:t, F0:t where 0 : t denotes the union of the corresponding variables or functions up to time t (see Fig. 2). V0:t includes X0:t = Xt, Y0:t = Yt and C0:t = Ct C0:t 1. The functions in F0:t corresponding to intervened variables are replaced by constant values while the exogenous variables related to them are excluded from U0:t. Definition 3. Gt is the causal graph associated to Mt. In Gt, the incoming edges in variables intervened at 0 : t 1 are mutilated while intervened variables are represented by deterministic nodes (squares) see Fig. 2. Dynamic Causal Global Optimization (DCGO) The goal of this work is to find a sequence of interventions, optimizing a target variable, at each time step, in a causal DAG. Given Gt and Mt, at every time step t, we wish to optimize Yt by intervening on a subset of the manipulative variables Xt. The optimal intervention variables X s,t and intervention levels x s,t are given by: X s,t, x s,t = arg min Xs,t P(Xt),xs D(Xs,t) E[Yt | do(Xs,t = xs,t) , 1t>0 I0:t 1] (1) X0 = f X0(ǫX0) Z0 = f Z0(X0, ǫZ0) Y0 = f Y0(Z0, ǫY0) F0 = {f X0, f Z0, f Y0} U0 = {ǫX0, ǫZ0, ǫY0} X0 = {X0, Z0} X0 = f X0(ǫX0) Z0 = f I Z0( ) = z0 Y0 = f Y0(z0, ǫY0) X1 = f X1(X0, ǫX0) Z1 = f Z1(z0, X1, ǫZ1) Y1 = f Y1(Y0, Z1, ǫY1) F0:1 = {f X0, f I Z0, f Y0,X1 , f Z1, f Y1} U0:1 = {ǫX0, ǫY0, ǫX1, ǫZ1, ǫY1} X0:1 = {X1, Z1} C0:1 = {X0, Y0} Y0:1 = Y1 X0 = f X0(ǫX0) Z0 = f I Z0( ) = z0 Y0 = f Y0(z0, ǫY0) X1 = f I X1( ) = x1 Z1 = f Z1(z0, x1, ǫZ1) Y1 = f Y1(Y0, Z1, ǫY1) X2 = f X2(x1, ǫX2) Z2 = f Z2(Z1, X2, ǫZ2) Y2 = f Y2(Y1, Z2, ǫY2) F0:2 = {f X0, f I Z0, f Y0, f I X1, f Z1, f Y1, f X2, f Z2, f Y2} U0:2 = {ǫX0, ǫY0, ǫZ1, ǫY1, ǫX2, ǫZ2, ǫY2} X0:2 = {X2, Z2} C0:2 = {X0, Y0, Z1, Y1} I0 = (Z0, z0) = arg min Xs,0 P(X0), xs D(Xs,0) E [Y0 | do(Xs,0 = xs,0)] I1 = (X1, x1) = arg min Xs,1 P(X1), xs D(Xs,1) E [Y1 | do(Xs,1 = xs,1), I0] I2 = (Z2, z2) = arg min Xs,2 P(X2), xs D(Xs,2) E [Y2 | do(Xs,2 = xs,2), I1, I0] Figure 2: Structural equation models considered by DCBO at every time step t {0, 1, 2}. Exogenous noise variables ϵi are depicted here but are omitted in the remainder of the paper, to avoid clutter. For every t, Gt is a mutilated version of Gt 1 reflecting the optimal intervention implemented in the system at 0 : t 1 which are represented by squares. The SCM functions in F0:t corresponding to the intervened variables are set to constant values. The exogenous variables that only related to the intervened variables are excluded from Ut. C0:t is given by the set {Ct C0:t 1 Y0:t 1 X0:t 1}. where I0:t 1 = St 1 i=0 do X s,i = x s,i denotes previous interventions, 1t>0 is the indicator function and P(Xt) is the power set of Xt. D(Xs,t) represents the interventional domain of Xs,t. In the sequel we denote the previously intervened variables by IV 0:t 1 = St 1 i=0 X s,i and implemented intervention levels by IL 0:t 1 = St 1 i=0 x s,i. The cost of each intervention is given by cost(Xs,t, xs,t). In order to solve the problem in Eq. (1) we make the following assumptions : Assumptions 1. Denote by G(t) the causal graph including variables at time t in G0:T and let Y PT t = Pa(Yt) Y0:t 1 be the set of variables in G0:T that are both parents of Yt and targets at previous time step. Let the set Y PNT t = Pa(Yt) \Y PT t be the complement and denote by f Yt( ) the functional mapping for Yt in Mt. We make the following assumptions: 1. Invariance of causal structure: G(t) = G(0), t > 0. 2. Additivity of f Yt( ) that is Yt = f Yt(Pa(Yt)) + ϵ with f Yt(Pa(Yt)) = f Y Y (Y PT t ) + f NY Y (Y PNT t ) where f Y Y and f NY Y are two generic unknown functions and ϵ N(0, σ2). 3. Absence of unobserved confounders in G0:T . Assumption (3) implies the absence of unobserved confounders at every time step. For instance, this is the case in Fig. 1a. Still in this DAG, Assumption (2) implies f Yt(Pa(Yt)) = f Y Y (Yt 1) + f NY Y (Zt) + ϵYt, t > 0. Finally, Assumption (1) implies the existence of the same variables at every time step and a constant orientation of the edges among them for t > 0. Notice that Assumptions 1 imply invariance of the causal structure within each time-slice, i.e. the structure, edges and vertices, concerning the nodes with the same time index. This means that, across time steps, both the graph and the functional relationships can change. Therefore, not only can the causal effects change significantly across time steps, but also the input dimensionality of the causal functions we model, might change. For instance, in the DAG of Fig. 3(c), the target function for Y2 has dimensionality 3 and a function f Yt( ) that is completely different from the one assumed for Y1 that has only two parents. We can thus model a wide variety of settings and causal effects despite this assumption. Furthermore, even though we assume an additive structure for the functional relationship on Y , the use of GPs allow us to have flexible models with highly non-linear causal effects across different graph structures. In the causality literature, GP models are well established and have shown good performances compared to parametric linear and non-linear models (see e.g. [31, 34, 37]). The sum of GPs gives a flexible and computationally tractable model that can be used to capture highly non-linear causal effects while helping with interpretability [12, 11]. 3 Methodology In this section we introduce Dynamic Causal Bayesian Optimization (DCBO), a novel methodology addressing the problem in Eq. (1). We first study the correlation among objective functions for two consecutive time steps and use it to derive a recursion formula that, based on the topology of the graph, expresses the causal effects at time t as a function of previously implemented interventions (see square nodes in Fig. 2). Exploiting these results, we develop a new surrogate model for the objective functions that can be used within a CBO framework to find the optimal sequence of interventions. This model enables the integration of observational data and interventional data, collected at previous time-steps and interventional data collected at time t, thereby accelerating the identification of the current optimal intervention. 3.1 Characterization of the time structure in a DAG with time dependent variables The following result provides a theoretical foundation for the dynamic causal GP model introduced later. In particular, it derives a recursion formula allowing us to express the objective function at time t as a function of the objective functions corresponding to the optimal interventions at previous time steps. The proof is given in the appendix ( 2). Definition 4. Consider a DAG G0:T and the objective function E[Yt | do(Xs,t = xs,t) , I0:t 1] for a generic time step t {0, . . . , T}. Denote by Y PT t = (Pa(Yt) Y0:t 1) the parents of Yt that are targets at previous time steps and by Y PNT t = Pa(Yt) \Y PT t the remaining parents. For any Xs,t P(Xt) and IV 0:t 1 X0:t 1 we define the following sets: XPY s,t = Xs,t Pa(Yt) includes the variables in Xs,t that are parents of Yt. IPY 0:t 1 = IV 0:t 1 Pa(Yt) includes the variables in IV 0:t 1 that are parents of Yt. W Pa(Yt) such that Pa(Yt) = (Pa(Yt) Y0:t 1) XPY s,t IPY 0:t 1 W. W includes variables that are parents of Yt but are not targets nor intervened variables. The values of I0:t 1, XPY s,t, IPY 0:t 1 and W will be denoted by i, x PY, i PY and w respectively. Theorem 1. Time operator. Consider a DAG G0:T and the related SCM satisfying Assumptions 1. It is possible to prove that, Xs,t P(Xt), the intervention function fs,t(x) = E[Yt | do(Xs,t = x) , 1t>0 I0:t 1] with fs,t(x) : D(Xs,t) R can be written as: fs,t(x) = f Y Y (f ) + Ep(w|do(Xs,t=x),i) f NY Y (x PY, i PY, w) (2) where f = {E Yi| do X s,i = x s,i , I0:i 1 }Yi Y PT t that is the set of previously observed optimal targets that are parents of Yt. f Y Y denotes the function mapping Y PT t to Yt and f NY Y represents the function mapping Y PNT t to Yt. Eq. (2) reduces to Ep(w| do(Xs,t=x),i) f NY Y (x PY, i PY, w) when Yt does not depend on previous targets. This is the setting considered in CBO that can be thus seen as a particular instance of DCBO. Exploiting Assumptions (1), it is possible to further expand the second term in Eq. (2) to get the following expression. A proof is given in the supplement ( 2). Corollary 1. Given Assumptions 1, we can write: Ep(w| do(Xs,t=x),i) f NY Y (x PY, i PY, w) = Ep(U0:t) f NY Y (x PY, i PY, {C(W)}W W) (3) where p(U0:t) is the distribution for the exogenous variables up to time t and C(W) is given by: f W (u W , x PW, i PW) if R = f W (u W , x PW, i PW, r) if R Xs,t IV 0:t 1 f W (u W , x PW, i PW, C(R)) if R Xs,t IV 0:t 1 where f W represents the functional mapping for W in the SCM and u W is the set of exogenous variables with edges into W. x PW and i PW are the values corresponding to XPW s,t and IPW 0:t 1 which in turn represent the subset of variables in Xs,t and IV 0:t 1 that are parents of W. Finally r is the value of R = Pa(W) \(XPY s,t IPW 0:t 1). Examples for Eq. (2): For the DAG in Fig. 1(a), at time t = 1 and with IV 0:t 1 = {Z0}, we have E[Y | do(Z1 = z) , I0] = f Y Y (y 0) + f NY Y (z). Indeed in this case W = , x PY = z and f = {y 0 = E[Y0| do(Z0 = z0)]}. Still at t = 1 and with IV 0:t 1 = {Z0}, the objective function for Xs,t = {X1} can be written as f Y Y (y 0) + Ep(z1| do(X1=x),I0) f NY Y (z1) as W = {Z1}. All derivations for these expressions and alternative graphs are given in the supplement ( 2). 3.2 Restricting the search space The search space for the problem in Eq. (1) grows exponentially with |Xt| thus slowing down the identification of the optimal intervention when G includes more than a few nodes. Indeed, a naive approach of finding X s,t at t = 0, . . . , T would be to explore the 2|Xt| sets in P(Xt) at every t and keep 2|Xt| models for the objective functions. In the static setting, CBO reduces the search space by exploiting the results in [21]. In particular, it identifies a subset of variables M P(X) worth intervening on thus reducing the size of the exploration set to 2|M|. In our dynamic setting, the objective functions change at every time step depending on the previously implemented interventions and one would need to recompute M at every t. However, it is possible to show that, given Assumptions 1, the search space remains constant over time. Denote by Mt the set M at time t and let M0 represent the set at t = 0 which corresponds to M computed in CBO. For t > 0 it is possible to prove that: Proposition 3.1. MIS in time. If Assumptions 1 are satisfied, Mt = M0 for t > 0. 3.3 Dynamic Causal GP model Here we introduce the Dynamic Causal GP model that is used as a surrogate model for the objective functions in Eq. (1). The prior parameters are constructed by exploiting the recursion in Eq. (2). At each time step t, the agent explores the sets in Mt P(Xt) by selecting the next intervention to be the one maximizing a given acquisition function. The DCBO algorithm is shown in Algorithm 1. Prior Surrogate Model At each time step t and for each Xs,t Mt, we place a GP prior on the objective function fs,t(x) = E[Yt| do(Xs,t = x) , 1t>0 I0:t 1]. We construct the prior parameters exploiting the recursive expression in Eq. (2): fs,t(x) GP(ms,t(x), ks,t(x, x )) where ms,t(x) = E h f Y Y (f ) + b E[f NY Y (x PY, i PY, w)] i ks,t(x, x ) = k RBF(x, x ) + σs,t(x)σs,t(x ) with σs,t(x) = q V[f Y Y (f ) + ˆE f NY Y (x PY, i PY, w) and k RBF(x, x ) := exp( ||x x ||2 2l2 ) represents the radial basis function kernel [29]. We have it that ˆE f NY Y (x PY, i PY, w) = ˆEp(w| do(Xs,t=x),i) f NY Y (x PY, i PY, w) represents the expected value of f NY Y (x PY, i PY, w) with respect to p(w | do(Xs,t = x) , i) which is estimated via the do-calculus using observational data. The outer expectation in ms,t(x) and the variance in σs,t(x) are computed with respect to p(f Y Y , f NY Y ) which is also estimated using observational data. In this work we model f Y Y , f NY Y and all functions in the SCM by independent GPs. Both ms,t(x) and σs,t(x) can be equivalently written by exploiting the equivalence in Eq. (3). In both cases, this prior construction allows the integration of three different types of data: observational data, interventional data collected at time t and the optimal interventional data points collected in the past. The former is used to estimate the SCM model and p(w | do(Xs,t = x) , i) via the rules of do-calculus. The optimal interventional data points at 0 : t 1 determine the shift f Y Y (f ) while the interventional data collected at time t are used to update the prior distribution on fs,t(x). Similar prior constructions were previously considered in static settings [2, 1] where only observational and interventional data at the current time step were used. The additional shift term appears here as there exists causal dynamics in the target variables and the objective function is affected by previous decisions. Fig. 2 in the appendix shows a synthetic example in which accounting for the dynamic aspect in the prior formulation leads to a more accurate GP posterior compared to the baselines, especially when the the optimum location changes across time steps. Algorithm 1: DCBO Data: DO, {DI s,t=0}s {0,...,|M0|}, G0:T , H. Result: Optimal intervention path {X s,t, x s,t, y t }T t=1 Initialise: M, DI 0 and initial optimal DI = . for t = 0, . . . , T do 1. Initialise dynamic causal GP models for all Xs,t Mt using DI ,t 1 if t > 0. 2. Initialise interventional dataset {DI s,t}s {0,...,|Mt|} for h = 1, . . . , H do 1. Compute EIs,t(x) for each Xs,t Mt. 2. Obtain (s , α ) 3. Intervene and augment DI s=s ,t 4. Update posterior for fs=s ,t end 3. Return the optimal intervention (X s,t, x s,t) 4. Append optimal interventional data DI ,t = DI ,t 1 ((X s,t, x s,t), y t ) end Likelihood Let DI s,t = (XI, YI s,t) be the set of interventional data points collected for Xs,t with XI being a vector of intervention values and YI s,t representing the corresponding vector of observed target values. As in standard BO we assume each ys,t in YI s,t to be a noisy observation of the function fs,t(x) that is ys,t(x) = fs,t(x) + ϵs,t with ϵs,t N(0, σ2) for s {1, . . . , |Mt|} and t {0, . . . , T}. In compact form, the joint likelihood function for DI s,t is p(YI s,t | fs,t, σ2) = N(fs,t(XI), σ2I). Acquisition Function Given our surrogate models at time t, the agent selects the interventions to implement by solving a Causal Bayesian Optimization problem [2]. The agent explores the sets in Mt and decides where to intervene by maximizing the Causal Expected Improvement (EI). Denote by y t the optimal observed target value in {YI s,t}|Mt| s=1 that is the optimal observed target across all intervention sets at time t. The Causal EI is given by EIs,t(x) = Ep(ys,t)[max(ys,t y t , 0)]/cost(Xs,t, xs,t). Let α1, . . . , α|Mt| be solutions of the optimization of EIs,t(x) for each set in Mt and α := max{α1, . . . , α|Mt|}. The next best intervention to explore at time t is given by s = argmaxs {1, ,|Mt|} αs. Therefore, the set-value pair to intervene on is (s , α ). At every t, the agent implement H explorative interventions in the system which are selected by maximizing the Causal EI. Once the budget H is exhausted, the agent implements what we call the decision intervention It, that is the optimal intervention found at the current time step, and move forward to a new optimization at t + 1 carrying the information in y 0:t 1. The parameter H determines the level of exploration of the system and acts as a budget for the CBO algorithm. Its value is determined by the agent and is generally problem specific. Posterior Surrogate Model For any set Xs,t Mt, the posterior distribution p(fs,t | DI s,t) can be derived analytically via standard GP updates. p(fs,t | DI s,t) will also be a GP with parameters ms,t(x | DI s,t) = ms,t(x) + ks,t(x, XI)[ks,t(XI, XI) + σ2I](YI s,t ms,t(XI) and ks,t(x, x | DI s,t) = ks,t(x, x ) ks,t(x, XI)[ks,t(XI, XI) + σ2I]ks,t(XI, x ). 4 Experiments We evaluate the performance of DCBO in a variety of synthetic and real world settings with DAGs given in Fig. 3. We first run the algorithm for a stationary setting where both the graph structure and the SCM do not change over time (STAT.). We then consider a scenario characterised by increased observation noise (NOISY) for the manipulative variables and a settings where observational data are missing at some time steps (MISS.). Still assuming stationarity, we then test the algorithm in a DAG where there are multivariate interventions in Mt (MULTIV.). Finally, we run DCBO for a non-stationary graph where both the SCM and the DAG change over time (NONSTAT.). To conclude, we use DCBO to optimize the unemployment rate of a closed economy (DAG in Fig. 3d, ECON.) and to find the optimal intervention in a system of ordinary differential equation modelling a real predator-prey system (DAG in Fig. 3e, ODE). We provide a discussion on the applicability of DCBO to real-world problems in 7 of the supplement together with all implementation details. Baselines We compare against the algorithms in Fig. 1. Note that, by constructions, ABO and BO intervene on all manipulative variables while DCBO and CBO explore only Mt at every t. In addition, both DCBO and ABO reduce to CBO and BO at the first time step. We assume the availability of an observational dataset DO and set a unit intervention cost for all variables. Performance metric We run all experiments for 10 replicates and show the average convergence path at every time step. We then compute the values of a modified gap metric3 across time steps and with standard errors across replicates. The metric is defined as Gt = y(x s,t) y(xinit) y y(xinit) + H H(x s,t) H where y( ) represents the evaluation of the objective function, y is the global minimum, and xinit and x s,t are the first and best evaluated point, respectively. The term H H(x s,t) H with H(x s,t) denoting the number of explorative trials needed to reach x s,t captures the speed of the optimization. This term is equal to zero when the algorithm is not converged and equal to (H 1)/H when the algorithm converges at the first trial. We have 0 Gt 1 with higher values denoting better performances. For each method we also show the average percentage of replicates where the optimal intervention set X s,t is identified. Y0 Y1 Y2 W0 W1 W2 (a) MULTIV. (c) NONSTAT. Figure 3: DAGs used in the experimental sections for the real ( 4.2) and synthetic data ( 4.1). 4.1 Synthetic Experiments Stationary DAG and SCM (STAT.) We run the algorithms for the DAG in Fig. 1(a) with T = 3 and N = 10. For t > 0, DCBO converges to the optimal value faster than competing approaches (see Fig. 2 in the supplement, right panel, 3rd row). DCBO identifies the optimal intervention set in 93% of the replicates (Table 2) and reaches the highest average gap metric (Table 1). In this experiment the location of the optimum changes significantly both in terms of optimal set and intervention value when going from t = 0 to t = 1. This information is incorporated by DCBO through the prior dependency on y 0:t 1. In addition, ABO performance improves over time as it accumulates interventional data and uses them to fit the temporal dimension of the surrogate model. This benefits ABO in a stationary setting but might penalise it in non-stationary setting where the objective functions change significantly. Noisy manipulative variables (NOISY): The benefit of using DCBO becomes more apparent when the manipulative variables observations are noisy while the evolution of the target variable is more accurately detected. In this case both the convergence of DCBO and CBO are slowed down by noisy observations which are diluting the information provided by the do-calculus making the priors less informative. However, the DCBO prior dependency on y 0:t 1 allows it to correctly identify the shift in the target variable thus improving the prior accuracy and the speed-up of the algorithm (Fig. 4). Missing observational data (MISS.) Incorporating dynamic information in the surrogate model allows us to efficiently optimise a target variable even in setting where observational data are missing. 3This metric is a modified version of the one used in [18]. 0 5 10 15 cost(Xs,t, xs,t) 0 5 10 15 cost(Xs,t, xs,t) 0 5 10 15 cost(Xs,t, xs,t) 5 t = 2 DCBO CBO ABO BO E Yt | do(X s,t = x s,t) Figure 4: Experiment NOISY. Convergence of DCBO and competing methods across replicates. The dashed black line (- - -) gives the optimal outcome y t , t. Shaded areas are one standard deviation. We consider the DAG in Fig. 1(a) with T = 6, N = 10 for the first three time steps and N = 0 afterwards. DCBO uses the observational distributions learned with data from the first three time steps to construct the prior for t > 3. On the contrary, CBO uses the standard prior for t > 3. In this setting DCBO consistently outperforms CBO at every time step. However, ABO performance improves over time and outperforms DCBO starting from t = 4 due to its ability to exploit all interventional data collected over time (see Fig. 3 in the supplement). Multivariate intervention sets (MULTIV.) When the optimal intervention set is multivariate, both DCBO and CBO convergence speed worsen. For instance, for the DAG in Fig. 3a, |M| = 5 thus both CBO and DCBO will have to perform more explorative interventions before finding the optimum. At the same time, ABO and BO consider interventions only on {Wt, Xt, Zt}, t and need to explore an even higher intervention space. The performance of all methods decrease in this case (Table 1) but DCBO still identifies the optimal intervention set in 93% of the replicates (Table 2). Independent manipulative variables (IND.): Having to explore multiple intervention sets significantly penalises DCBO and CBO when there is no causal relationship among manipulative variables which are also the only parents of the target. This is the case for the DAG in Fig. 3b where the optimal intervention is {Xt, Zt} at every time step. In this case, exploring M and propagating uncertainty in the causal prior slows down DCBO convergence and decreases both its performance (Table 1) and capability to identify the optimal intervention set (Table 2). Non-stationary DAG and SCM (NONSTAT.): DCBO outperforms all approaches in non-stationary settings where both the DAG and the SCM change overtime see Fig. 3c. Indeed, DCBO can timely incorporate changes in the system via the dynamic causal prior construction while CBO, BO and ABO need to perform several interventions before accurately learning the new objective functions. Table 1: Average Gt across 10 replicates and time steps. See Fig. 1 for a summary of the baselines. Higher values are better. The best result for each experiment in bold. Standard errors in brackets. Synthetic data Real data STAT. MISS. NOISY MULTIV. IND. NONSTAT. ECON. ODE DCBO 0.88 0.84 0.75 0.49 0.48 0.69 0.64 0.67 (0.00) (0.01) (0.00) (0.01) (0.04) (0.00) (0.01) (0.00) CBO 0.70 0.70 0.51 0.48 0.47 0.61 0.61 0.65 (0.01) (0.02) (0.02) (0.09) (0.07) (0.00) (0.01) (0.00) ABO 0.56 0.49 0.49 0.39 0.54 0.38 0.57 0.48 (0.01) (0.02) (0.04) (0.21) (0.01) (0.02) (0.02) (0.01) BO 0.54 0.48 0.38 0.35 0.50 0.38 0.50 0.44 (0.02) (0.03) (0.05) (0.08) (0.01) (0.03) (0.01) (0.03) 4.2 Real experiments Real-World Economic data (ECON.) We use DCBO to minimize the unemployment rate Ut of a closed economy. We consider its causal relationships with economic growth (Gt), inflation rate (Rt) and fiscal policy (Tt)4. Inspired by the economic example in [17] we consider the DAG in 4The causality between economic variables is oversimplified in this example thus the results cannot be used to guide public policy and are only meant to showcase how DCBO can be used within a real application. Table 2: Average % of replicates across time steps for which X s,t is identified. See Fig. 1 for a summary of the baselines. Higher values are better. The best result for each experiment in bold. Synthetic data Real data STAT. MISS. NOISY MULTIV. IND. NONSTAT. ECON. ODE DCBO 93.00 58.00 100.00 93.00 93.00 100.00 86.67 33.3 CBO 90.00 85.00 90.00 90.0 90.00 100.00 93.33 33.3 ABO 0.00 0.00 0.00 0.00 100.00 0.00 66.67 0.00 BO 0.00 0.00 0.00 0.00 100.00 0.00 66.67 0.00 Fig. 3d where Rt and Tt are considered manipulative variables we need to intervene on to minimize log(Ut) at every time step. Time series data for 10 countries5 are used to construct a non-parametric simulator and to compute the causal prior for both DCBO and CBO. DCBO convergences to the optimal intervention faster than competing approaches (see Table 1 and Fig. 6 in the appendix). The optimal sequence of interventions found in this experiment is equal to {(T0, R0) = (9.38, 2.00), (T1, R1) = (0.53, 6.00), (T2) = (0.012)} which is consistent with domain knowledge. Planktonic predator prey community in a chemostat (ODE) We investigate a biological system in which two species interact, one as a predator and the other as prey, with the goal of identifying the intervention reducing the concentration of dead animals in the chemostat see Dt in Fig. 3e. We use the system of ordinary differential equations (ODE) given by [6] as our SCM and construct the DAG by rolling out the temporal variable dependencies in the ODE while removing graph cycles. Observational data are provided in [6] and are use to compute the dynamic causal prior. DCBO outperforms competing methods in term of average gap metric and identifies the optimum faster (Table 1). Additional details can be found in the supplement ( 6). 5 Conclusions We consider the problem of finding a sequence of optimal interventions in a causal graph where causal temporal dependencies exist between variables. We propose the Dynamic Causal Bayesian Optimization (DCBO) algorithm which finds the optimal intervention at every time step by intervening in the system according to a causal acquisition function. Importantly, for each possible intervention we propose to use a surrogate model that incorporates information from previous interventions implemented in the system. This is constructed by exploiting theoretical results establishing the correlation structure among objective functions for two consecutive time steps as a function of the topology of the causal graph. We discuss the DCBO performance in a variety of setting characterized by different DAG properties and stationarity assumptions. Future work will focus on extending our theoretical results to more general DAG structures thus allowing for unobserved confounders and a changing DAG topology within each time step. In addition, we will work on combining the proposed framework with a causal discovery algorithm so as to account for uncertainty in the graph structure. Acknowledgements This work was supported by the EPSRC grant EP/L016710/1, The Alan Turing Institute under EPSRC grant EP/N510129/1, the Defence and Security Programme at The Alan Turing Institute, funded by the UK Government and the Lloyds Register Foundation programme on Data Centric Engineering through the London Air Quality project. TD acknowledges support from UKRI Turing AI Fellowship (EP/V02678X/1). 5Data were downloaded from https://www.data.oecd.org/ [Accessed: 01/04/2021]. All details in the supplement. [1] Aglietti, V., Damoulas, T., Álvarez, M., and González, J. Multi-task causal learning with gaussian processes. In Advances in Neural Information Processing Systems, volume 33, pp. 6293 6304, 2020. [2] Aglietti, V., Lu, X., Paleyes, A., and González, J. Causal Bayesian Optimization. In Proceedings of the Twenty Third International Conference on Artificial Intelligence and Statistics, volume 108 of Proceedings of Machine Learning Research, pp. 3155 3164. PMLR, 26 28 Aug 2020. [3] Azimi, J., Jalali, A., and Fern, X. Dynamic batch bayesian optimization. ar Xiv preprint ar Xiv:1110.3347, 2011. [4] Bareinboim, E., Forney, A., and Pearl, J. Bandits with unobserved confounders: A causal approach. Advances in Neural Information Processing Systems, 28:1342 1350, 2015. [5] Besbes, O., Gur, Y., and Zeevi, A. Stochastic multi-armed-bandit problem with non-stationary rewards. Advances in neural information processing systems, 27:199 207, 2014. [6] Blasius, B., Rudolf, L., Weithoff, G., Gaedke, U., and Fussmann, G. F. Long-term cyclic persistence in an experimental predator prey system. Nature, 577(7789):226 230, 2020. [7] Bogunovic, I., Scarlett, J., and Cevher, V. Time-varying gaussian process bandit optimization. In Artificial Intelligence and Statistics, pp. 314 323. PMLR, 2016. [8] Buesing, L., Weber, T., Zwols, Y., Racaniere, S., Guez, A., Lespiau, J.-B., and Heess, N. Woulda, coulda, shoulda: Counterfactually-guided policy search. ar Xiv preprint ar Xiv:1811.06272, 2018. [9] Cruz, C., González, J. R., and Pelta, D. A. Optimization in dynamic environments: a survey on problems, methods and measures. Soft Computing, 15(7):1427 1448, 2011. [10] De, M. K., Slawomir, N. J., and Mark, B. Stochastic diffusion search: Partial function evaluation in swarm intelligence dynamic optimisation. In Stigmergic optimization, pp. 185 207. Springer, 2006. [11] Duvenaud, D. Automatic model construction with Gaussian processes. Ph D thesis, University of Cambridge, 2014. [12] Duvenaud, D., Nickisch, H., and Rasmussen, C. E. Additive Gaussian Processes. ar Xiv preprint ar Xiv:1112.4394, 2011. [13] Foerster, J., Farquhar, G., Afouras, T., Nardelli, N., and Whiteson, S. Counterfactual multi-agent policy gradients. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 32, 2018. [14] Fogel, L. J., Owens, A. J., and Walsh, M. J. Artificial intelligence through simulated evolution. 1966. [15] Glymour, C., Zhang, K., and Spirtes, P. Review of causal discovery methods based on graphical models. Frontiers in Genetics, 10, 2019. [16] Goldberg, D. E. and Smith, R. E. Nonstationary function optimization using genetic algorithms with dominance and diploidy. In Genetic algorithms and their applications: proceedings of the second International Conference on Genetic Algorithms: July 28-31, 1987 at the Massachusetts Institute of Technology, Cambridge, MA. Hillsdale, NJ: L. Erlhaum Associates, 1987., 1987. [17] Huang, B., Zhang, K., Gong, M., and Glymour, C. Causal discovery and forecasting in nonstationary environments with state-space models. In International Conference on Machine Learning, pp. 2901 2910. PMLR, 2019. [18] Huang, D., Allen, T. T., Notz, W. I., and Zeng, N. Global optimization of stochastic black-box systems via sequential kriging meta-models. Journal of global optimization, 34(3):441 466, 2006. [19] Koller, D. and Friedman, N. Probabilistic Graphical Models: Principles and Techniques - Adaptive Computation and Machine Learning. The MIT Press, 2009. ISBN 0262013193. [20] Lattimore, F., Lattimore, T., and Reid, M. D. Causal bandits: Learning good interventions via causal inference. In Advances in Neural Information Processing Systems, pp. 1181 1189, 2016. [21] Lee, S. and Bareinboim, E. Structural causal bandits: where to intervene? Advances in Neural Information Processing Systems 31, 31, 2018. [22] Lee, S. and Bareinboim, E. Structural causal bandits with non-manipulable variables. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pp. 4164 4172, 2019. [23] Lu, C., Schölkopf, B., and Hernández-Lobato, J. M. Deconfounding reinforcement learning in observational settings. ar Xiv preprint ar Xiv:1812.10576, 2018. [24] Madumal, P., Miller, T., Sonenberg, L., and Vetere, F. Explainable reinforcement learning through a causal lens. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pp. 2493 2500, 2020. [25] Nyikosa, F. M., Osborne, M. A., and Roberts, S. J. Bayesian optimization for dynamic problems, 2018. [26] Pearl, J. Causal diagrams for empirical research. Biometrika, 82(4):669 688, 1995. [27] Pearl, J. Causality: models, reasoning and inference, volume 29. Springer, 2000. [28] Pelta, D., Cruz, C., and Verdegay, J. L. Simple control rules in a cooperative system for dynamic optimisation problems. International Journal of General Systems, 38(7):701 717, 2009. [29] Rasmussen, C. E. Gaussian processes in machine learning. In Summer School on Machine Learning, pp. 63 71. Springer, 2003. [30] Shahriari, B., Swersky, K., Wang, Z., Adams, R. P., and De Freitas, N. Taking the human out of the loop: A review of bayesian optimization. Proceedings of the IEEE, 104(1):148 175, 2015. [31] Silva, R. and Gramacy, R. B. Gaussian Process Structural Equation Models with Latent Variables. ar Xiv preprint ar Xiv:1002.4802, 2010. [32] Trojanowski, K. and Wierzcho n, S. T. Immune-based algorithms for dynamic optimization. Information Sciences, 179(10):1495 1515, 2009. [33] Villar, S. S., Bowden, J., and Wason, J. Multi-armed bandit models for the optimal design of clinical trials: benefits and challenges. Statistical science: a review journal of the Institute of Mathematical Statistics, 30(2):199, 2015. [34] Witty, S., Takatsu, K., Jensen, D., and Mansinghka, V. Causal Inference using Gaussian Processes with Structured Latent Confounders. In International Conference on Machine Learning, pp. 10313 10323. PMLR, 2020. [35] Wu, Q., Iyer, N., and Wang, H. Learning contextual bandits in a non-stationary environment. In The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval, pp. 495 504, 2018. [36] Zhang, J. and Bareinboim, E. Near-optimal reinforcement learning in dynamic treatment regimes. In Advances in Neural Information Processing Systems 32 pre-proceedings (Neur IPS), 2019. [37] Zhang, K., Schölkopf, B., and Janzing, D. Invariant Gaussian Process Latent Variable Models and Application in Causal Discovery. ar Xiv preprint ar Xiv:1203.3534, 2012.