# constrained_causal_bayesian_optimization__0946003b.pdf Constrained Causal Bayesian Optimization Virginia Aglietti 1 Alan Malek 1 Ira Ktena 1 Silvia Chiappa 1 We propose constrained causal Bayesian optimization (c CBO), an approach for finding interventions in a known causal graph that optimize a target variable under some constraints. c CBO first reduces the search space by exploiting the graph structure and, if available, an observational dataset; and then solves the restricted optimization problem by modelling target and constraint quantities using Gaussian processes and by sequentially selecting interventions via a constrained expected improvement acquisition function. We propose different surrogate models that enable to integrate observational and interventional data while capturing correlation among effects with increasing levels of sophistication. We evaluate c CBO on artificial and real-world causal graphs showing successful trade off between fast convergence and percentage of feasible interventions. 1. Introduction The problem of understanding which interventions in a system optimize a target variable is of relevance in many scientific disciplines, including biology, medicine, and social sciences. Often, the investigator might want to solve this problem while also ensuring that interventions satisfy some constraints. As an example, consider the protein-signalling network of Fig. 1(a), which describes the causal pathways linking several phosphorylated proteins and phospholipids in human immune system cells, including the mitogen-activated protein kinases Erk studied in cancer therapy (Fr emin & Meloche, 2010; Sachs et al., 2005). An investigator might wish to find which variables among Mek, PKC, PKA and Akt to perturb and to which levels in order to minimize Erk, under the constraints of not inhibiting PKA and PKC which play an important role in the functioning of healthy cells. As another example, consider the graph of Fig. 1(b) describing 1Deep Mind, London, UK. Correspondence to: Virginia Aglietti . Proceedings of the 40 th International Conference on Machine Learning, Honolulu, Hawaii, USA. PMLR 202, 2023. Copyright 2023 by the author(s). Mek Erk Akt Figure 1: (a) Protein-signaling network and (b) causal PSA graph. Red, grey and pink nodes indicate target, intervenable, and constrained variables respectively. the causal relationships between prostate specific antigen (PSA) and its risk factors, some of which might be due to an existing policy at the time of study (Ferro et al., 2015). An investigator might wish to understand which variables among Aspirin, Statin, and Calories Intake (CI) to fix and to which values in order to minimize PSA, under the constraint of keeping BMI below a certain value. In this paper, we propose constrained causal Bayesian optimization (c CBO), a sequential approach for efficiently solving such constrained optimization problems in the setting of known causal graph that builds on the literature on causal Bayesian optimization (Aglietti et al., 2020), constrained Bayesian optimization (Gardner et al., 2014), and multitask learning with Gaussian processes (GPs) (Alvarez et al., 2011). Our contributions are summarized as follows: We provide a formalization of the constrained optimization problem using the framework of structural causal models (SCMs). We introduce a novel theory for reducing the search space of interventions that eliminates intervention sets of higher cardinality. We propose different GP surrogate models for the unknown target and constraint quantities that leverage observational data, interventional data and the SCM structure, capturing correlations with increasing level of sophistication. This enables uncertainty quantification thereby speeding up the identification of an optimal intervention while limiting the number of infeasible interventions, i.e. not satisfy the constraints. We propose a constrained expected improvement acquisition function for sequentially selecting interventions Constrained Causal Bayesian Optimization based on both output optimization and constraint satisfaction. This enables finding the optimal solution with a smaller number of interventions. We evaluate c CBO on synthetic and real-world graphs with different SCM characteristics showing how it successfully trades off achieving a fast convergence and collecting a high percentage of feasible interventions. 2. Constrained Causal Global Optimization We consider a system of observable random variables V with target variable Y V and intervenable variables I V \Y 1, and the problem of finding an intervention set X I and intervention values x that optimize the expectation of Y while ensuring that the expectations of constrained variables C V \Y are above/below certain values. We assume that the system can be described by a structural causal model (SCM) (Pearl, 2000; Pearl et al., 2016) M = V , U, F , p(U) , where U is a set of exogenous unobserved random variables with distribution p(U) = Q U U p(U), and F = {f V }V V is a set of deterministic functions such that V = f V (pa(V ), UV ) with pa(V ) V \V and UV U, V V . M has associated a causal graph G with nodes V , and with a directed edge from V to W if V pa(W), in which case V is called a parent of W, and a bi-directed edge between V and W if UV UW = (UV UW is an unobserved confounder between V and W). We assume that G is acyclic, namely that there are no directed paths2 starting and ending at the same node. Given a causal graph G, we say that M is compatible with G if all edges that are in the causal graph associated with M are also in G. We refer to the joint distribution of V determined by p(U), which we denote by p(V ), as observational distribution, and to an observation from p(V ) as observational data sample. An intervention on V I setting its value to v, denoted by do(V = v), corresponds to modifying M by replacing f V (pa(V ), UV ) with v. We refer to the joint distribution of V in the modified SCM under such an intervention, indicated by pdo(V =v)(V ), as interventional distribution, and to an observation from it as interventional data sample. Notice that, in the case of no unobserved confounders, pdo(V =v)(V ) = Q W V \V p(W | pa(W))δV =v with δV =v denoting a delta function centered at v. An intervention on a set of variables is defined similarly. Let µY do(X=x) := Epdo(X=x)[Y ] denote the expectation of Y 1To simplify the notation we write Y to denote the set {Y }, and similarly throughout the paper. 2A directed path is a sequence of linked nodes whose edges are directed and point from preceding towards following nodes in the sequence. Algorithm 1 c CBO Inputs: G, I, C, Y , DI := {DI X}X I, DO, λC, S, T n MC Y,G c MISReduce(G, I, C, Y, DO) Initialise GPs g X(x) X n MC Y,G with DI X and DO for t = 1, . . . , T do 1. Select intervention (Xt, xt) with c EI 2. Obtain samples {c(s) Xt, y(s)}S s=1 from the distribution pdo(Xt=xt)(CXt, Y ), and use them to compute the sample mean estimate ˆµdo(Xt=xt) 3. DI Xt DI Xt xt, ˆµdo(Xt=xt) 4. Update τ(g Xt | DI Xt) end Output: (X , x ) with min feasible ˆµY do(X =x ) over DI w.r.t. the interventional distribution3, which we refer to as the target effect. The goal of the investigator is to find a set X I and values x that optimize the target effect while ensuring that µC do(X =x ), which we refer to as constraint effect, is greater/smaller than a threshold λC, C C V \Y . Target and constraint effects are unknown. The constraint for a variable X C X can directly be enforced by restricting the range of intervention values, D(X), to be in accordance with the threshold. Instead, the constraints for CX := C\(C X) need to be included in the optimization problem. The constrained optimization problem can thus be formalized as follows. Definition 2.1 (c CGO Problem). The constrained causal global optimization (c CGO) problem is the problem of finding a tuple (X , x ) such that4 X , x = arg min X PI, x D(X) µY do(X=x) s.t. µCX do(X=x) λCX, (1) where PI indicates the power set5 of I, D(X) := X XD(X), and µCX do(X=x) and λCX denote the constraint effects and corresponding thresholds on all variables in CX. The c CGO problem extends the causal global optimization (CGO) problem defined in Aglietti et al. (2020) to incorporate constraints. 3. Constrained Causal Bayesian Optimization We propose to solve the c CGO problem using the constrained causal Bayesian optimization (c CBO) method sum- 3This expectation is commonly indicated with E[Y | do(X = x)] (Pearl, 2000). 4Depending on the application, the investigator might be interested in maximizing µY do(X=x) and/or ensuring that some or all constraints effects are smaller than the thresholds. In such cases Eq. (1) would need to be adjusted accordingly. 5Excluding . Constrained Causal Bayesian Optimization marized in Algorithm 1, which assumes known casual graph G, continuous variables V , and full observation of the system after an intervention. First, the search space is reduced to a subset n MC Y,G of PI via the c MISReduce procedure described in Section 3.1. Then, the restricted c CGO problem is solved by modelling, X n MC Y,G, the unknown target and constraint effects µdo(X=x) := (µY do(X=x), µCX do(X=x)) using Gaussian processes (GPs) (Rasmussen & Williams, 2006) g X(x) := g V X(x) V CX Y , as described in Section 3.2, and via the following sequential strategy. At each trial t = 1, . . . , T: (1) a set of intervenable variables Xt and intervention values xt are selected via a constrained expected improvement (c EI) acquisition function that accounts for both the target and constraint effects, as described in Section 3.3; (2) a set of S interventional data samples is obtained and used to get a sample mean estimate ˆµdo(Xt=xt) of µdo(Xt=xt); (3) (xt, ˆµdo(Xt=xt)) is added to the interventional dataset DI Xt of Xt; (4) the posterior distribution of g Xt(x), denoted by τ(g Xt | DI Xt), is updated. Once the maximum number of trials T is reached, a tuple (X , x ) giving the minimum feasible target effect value in DI := {DI X}X I is returned. 3.1. Reducing the Search Space The cardinality of the power set, |PI|, grows exponentially with the cardinality of I, thus solving the c CGO problem by exploring the entire set could be prohibitively expensive. Even when |PI| is small, reducing the search space would simplify the problem as a smaller number of comparisons between constraint and target effects would be required. In this section, we propose a procedure, which we refer to as c MISReduce, that leverages invariances of the target and constraint effects w.r.t. different intervention sets to eliminate intervention sets of higher cardinality. This is achieved by extending the theory of minimal intervention sets of Lee & Bareinboim (2018) to account for the presence of constraints (a summary of the procedure and all proofs are given Appendix A). The search space is first reduced from PI to the set of constrained minimal intervention sets (c MISs) relative to (C Y , G), denoted by MC Y,G. Definition 3.1 (c MIS). A set X I is said to be a c MIS relative to (C Y, G) if there is no X X with CX = CX such that µW do(X=x) = µW do(X =x ), where x indicates the subset of x corresponding to variables X , x D(X), W CX Y , and SCM compatible with G. The following theorem justifies this reduction. Theorem 3.2 (Sufficiency of MC Y,G). MC Y,G contains a solution of the c CGO problem (if a solution exists), SCM compatible with G. Figure 2: (a) Causal graph with I = {A, D, E} and C = {C, D, E}. (b) Causal graph with I = C = {X, Z}. Let an(W, G) be the set of ancestors of W in G, i.e. the nodes with a directed path to W; an(W , G) := W W an(W, G); and GX the graph with all incoming links onto all elements of X removed. The following proposition gives a graphical criterion for identifying MC Y,G. Proposition 3.3 (Characterization of c MIS). X I is a c MIS relative to (C Y , G) X an(C Y, GX) (C X). If an observational dataset DO is available, the search space is further reduced from MC Y,G to n MC Y,G by checking, X MC Y,G, if all reducible variables in CX are nullfeasible. Definition 3.4 (Reducibility). C CX is reducible if µC do(X=x) = µC := Ep[C], x D(X). Definition 3.5 (Null-feasibility). C CX is null-feasible if µC λC. If X an(C, GX) = , C CX is reducible. Indeed, if X an(C, GX) = , by Lemma A.1 in Appendix A (with W1 = X, W2 = C, W3 = ) we have C GXX, which implies µC do(X=x) = µC by rule 3 of do-calculus (Pearl, 2000). If a reducible C CX is not null-feasible, X is removed from the search space as it is not a solution of the c CGO problem. If instead C is null-feasible, all X X in MC Y,G that are of the form X = X X1 where X1 an(CX\C Y, GX ) = , and for which (i) CX = CX or (ii) all variables in CX\CX are reducible and null-feasible, are eliminated from the search space. Indeed, in such cases, if X were a solution of the c CGO problem, then X would also be a solution as: (a) W (CX\C)\(C X1) Y , µW do(X =x ) = µW do(X=x) (Lemma A.1 in Appendix A (with W1 = X1, W2 = CX\C Y, W3 = X) gives CX\C Y GX X1 | X, which implies µW do(X =x ) = µW do(X=x) by rule 3 of do-calculus); (b) W C X1 the constraint effects are satisfied for X as C X1 = CX\CX . Example: We describe the search space reduction for the casual graph of Fig. 2(a). In this graph MC Y,G = PI = {{A}, {D}, {E}, {A, D}, {A, E}, {D, E}, {A, D, E}}, as X an(C Y, GX) (C X), X PI (e.g. A an C Y, GA (C A)). Consider X = {D, E} MC Y,G and C. X an C, GX = , Constrained Causal Bayesian Optimization and therefore C is reducible. If C is not null-feasible, X is removed from the search space. If instead C is null-feasible, the set X = {A, D, E} X in MC Y,G is of the form X = X A where A an(CX\C Y, GX ). Therefore X is removed from the search space. All other sets in MC Y,G have constrained variables that are non reducible and thus need to be included in n MC Y,G. 3.2. Gaussian Processes Surrogate Models For each X n MC Y,G and V CX Y , we model µV do(X=x) with a GP g V X(x) GP(m V X(x), SV X(x, x )), as GPs allow constructing flexible surrogate models while enabling uncertainty quantification and closed form updates. We propose one single-task GP (STGP) and two multi-task GPs (MTGP and G-MTGP) which capture the correlation across effects with increasing level of sophistication. For V = W, the single-task GP treats g V X(x) and g W X (x ) as independent, while the multi-task GPs model their correlation via a covariance matrix SV,W X (x1, x2), with (i, j)-th element given by E[g V X(xi)g W X (xj)] E[g V X(xi)]E[g W X (xj)], either by assuming a common latent structure among the GPs (MTGP) or, for the setting of no unobserved confounders and under the assumption V = f V (pa(V )) + UV with p(UV ) = N(0, σ2 V ), by explicitly exploiting the SCM structure (G-MTGP). We propose different prior parameters constructions for STGP and MTGP, including one that leverages the availability of an observational dataset DO (STGP+ and MTGP+). The different GP constructions allow the investigator to model the target and constraint effects in both settings where no information about the system is available and black-box models are preferred (STGP and MTGP) and settings in which one can leverage different sources of information and integrate them in a structured prior formulation that quantifies uncertainty in a principled way (STGP+, MTGP+, and G-MTGP). For all surrogate models, after an intervention (X, x) is selected by the acquisition function, as described in Section 3.3, a set of S interventional data samples {c(s) X , y(s)}S s=1 from pdo(X=x)(CX, Y ) is obtained. For each V CX Y , this set is used to form a sample mean estimate ˆµV do(X=x) of µV do(X=x), which is treated as a noisy realization of g V X(x) with additive Gaussian noise. The tuple (x, ˆµdo(X=x)) is then added to DI X and the posterior distribution τ(g X | DI X) is computed via standard GP updates (Rasmussen & Williams, 2006). Full details are given in Appendix B. STGP. For the single-task GP, we either assume m V X(x) = 0 and radial basis function (RBF) kernel SV X(x, x ) = σ2 f exp( ||x x ||2 2l2 ) with (σ2 f, l) hyper-parameters, or the prior construction proposed in CBO (Aglietti et al., 2020). In the latter case, DO is used to obtain estimates of the target and constraint effects which are then used as prior mean functions (we refer to this variant as STGP+). MTGP. Our first multi-task GP, inspired by the linear coregionalization model of Alvarez et al. (2011), assumes that the target and constraint GPs are linear combinations of shared independent GPs, i.e. g V X(x) = PQ q=1 a V X,qu X,q(x) with u X,q(x) GP(m X,q(x), SX,q(x, x )). In this case, the variance and covariance terms across functions and intervention values are given by SV X(x, x ) = PQ q=1 a V X,q 2 SX,q(x, x ) and SV,W X (x, x ) = PQ q=1 a V X,qa W X,q SX,q(x, x ) respectively. The scalar parameters a V X,q are learned with a standard type2 ML approach together with the kernel hyper-parameters. We either consider m X,q(x) = 0 and RBF kernel for each SX,q(x, x ), or the prior construction using DO as in STGP+ (we refer to this variant as MTGP+). G-MTGP. For the setting of no unobserved confounders and under the assumption V = f V (pa(V )) + UV with p(UV ) = N(0, σ2 V ), we propose to model each f V as an independent GP with an RBF kernel f V (pa(v)) GP(0, SV RBF(pa(v), pa(v) )), where pa(v) denotes a value taken by pa(V ), and to fit it using DO. Consider the intervention do(X = x) and let U an(V ) X denote the subset of U corresponding to the ancestors of V in G that are not d-separated from V by X, and similarly for f an(V ) X . We can write V as an explicit function of U an(V ) X and f an(V ) X by recursively replacing parents with their functional form in the modified SCM under do(X = x). Taking the expectation w.r.t. U an(V ) X therefore gives g V X(x) as a function of f an(V ) X . For example, for the causal graph in Fig. 2(b) with X = UX, Z = f Z(X) + UZ and Y = f Y (Z) + UY , we can write Y = f Y (f Z(X) + UZ) + UY , which gives g Y X(x) = Ep(UZ)[f Y (f Z(x) + UZ)] and g Z X(x) = f Z(x). We can then obtain realizations {g V,(s) X }S s=1 of g V X by using samples of f an(V ) X and U an(V ) X to form an approximation of SV,W X (x1, x2) as 1 S PS s=1 g V,(s) X (x1)g W,(s) X (x2) 1 S PS s=1 g V,(s) X (x1) 1 S PS s=1 g W,(s) X (x2) , and simi- larly for SV X(x, x ) and m V X(x). 3.3. Acquisition Functions To select interventions accounting for both the target and constraint effects we propose acquisition functions based on those used for constrained BO (Gardner et al., 2014) and noisy BO (Letham et al., 2019). We define the constrained expected improvement (c EI) per unit of intervention cost at an input point x as c EIX(x) := Eτ(g X | DI X) h max(0,g Y g Y X(x)) |X| Ig CX X λCX i , where g Y is the minimum feasible value attained by g Y X across interventional dataset DI and Ig CX X λCX is an indicator variable Constrained Causal Bayesian Optimization equal to one if g CX X λCX and to zero otherwise. The division by |X| is due to assuming an intervention cost for X equal to its cardinality. Alternative costs can be considered as long as they are greater than zero. STGP. When using a single-task GP as surrogate model, we can exploit the factorization τ(g X | DI X) = τ(g Y X | DI X) Q C CX τ(g C X | DI X) to obtain c EIX(x) = Eτ(g Y X | DI X) max(0, g Y g Y X(x)) Q C CX P(g C X λC) where P(g C X λC) = h 1 Φ λC m C X(x) SC X(x,x) with Φ( ) denoting the CDF of a standard Gaussian distribution. In the case of noiseless observations of g Vk X (x), g Y is known thus we can compute the first term in closed form as g Y m Y X(x) Φ g Y m Y X(x) SY X(x,x) SY X(x, x)ϕ g Y m Y X(x) SY X(x,x) with ϕ( ) denoting the PDF of a standard Gaussian distribution. In the case of noisy observations of g Vk X (x), g Y is unknown as we observe noisy values of the target effects. We can get an estimate of g Y using samples from τ(g X | DI X) for all X n MC Y,G, and use the estimate to compute c EIX(x) in closed form using the terms derived above. For every X n MC Y,G the values of c EIX(x) obtained with different g Y are then averaged to integrate out the uncertainty on the optimal feasible value observed across all intervention sets. MTGP. When using a multi-task model, c EIX(x) cannot be computed in closed form as τ(g X | DI X) does not factorize. We thus compute it via Monte Carlo integration with a similar procedure as for the noisy STGP setting. 3.4. Computational Aspects The computational cost of c CBO is dominated by the algebraic operations needed to compute the posterior parameters for the GP models of the target and constraints effects. Let NI denote the largest among the cardinalities of the interventional datasets collected for the sets in n MC Y,G, i.e. NI = max X n MC Y,G |DI X|, and let M denote the highest among the number of target and constraint effects for the sets in n MC Y,G, i.e. M = max X n MC Y,G 1 + |CX|. As the GPs corresponding to different intervention sets are updated independently, the computational cost scales as O(N 3 I ) when using a single-task model and as O(M 3N 3 I ) when using a multi-task model. Independent GPs updates also imply linear scaling of the computational complexity with respect to the cardinality of n MC Y,G. Therefore, a larger G might induce a higher number of sets to explore, but does not induce higher computational complexity. In terms of convergence to the true global optimum, c CBO inherits the properties of BO algorithms. While any alternative acquisition function for constrained BO can be used within c CBO, in this work we resort to a constrained expected improvement function due to its computationally tractability. The expected improvement acquisition function was shown to have strong theoretical guarantees (Vazquez & Bect, 2010; Bull, 2011) while performing well in practice (Snoek et al., 2012). However, performance guarantees have yet to be established for the constrained version. 3.5. Related Work c CBO is related to the vast literature on constrained BO methods, which find feasible solutions either by using an acquisition function that accounts for the probability of feasibility (Gardner et al., 2014; Gelbart et al., 2014; Griffiths & Hern andez-Lobato, 2017; Hern andez-Lobato et al., 2014; Keane et al., 2008; Schonlau et al., 1998; S obester et al., 2014; Tran et al., 2019), or by transforming the constrained problem into an unconstrained one (Ariafar et al., 2019; Picheny et al., 2016), or by exploiting trust regions (Eriksson & Poloczek, 2021). While these works disregard the causal aspect of the optimization and model the unknown functions independently, multi-task surrogate models have been considered by multi-objective BO methods (Dai et al., 2020; Feliot et al., 2017; Hakhamaneshi et al., 2021; Mathern et al., 2021; Swersky et al., 2013) or, more recenlty, in the context of safe BO (Bergmann & Graichen, 2020; Berkenkamp et al., 2016; 2021; Kirschner et al., 2019; Sui et al., 2015; 2018). c CBO is also related to the works combining causality with decision-making frameworks which have mainly focused on finding optimal interventions using observational data (Atan et al., 2018; H akansson et al., 2020; Zhang et al., 2012) or on designing interventions for causal discovery (Tigas et al., 2022). The idea of identifying optimal interventions through sequential experimentation has been explored in causal bandits (Lattimore et al., 2016), causal reinforcement learning (Zhang, 2020) and, more recently, in CBO (Aglietti et al., 2020) and model-based causal BO (MCBO, Sussex et al. (2023)). All these works tackle unconstrained settings and disregard the effects that interventions optimizing a target variable might have on the constrained variables. 4. Experiments We evaluate c CBO with surrogate models STGP, STGP+, MTGP, MTGP+, and G-MTGP on the causal graphs of Fig. 2(b) (SYNTHETIC-1), Fig. 2(a) (SYNTHETIC-2), Fig. 1(b) (HEALTH), and Fig. 1(a) (PROTEIN-SIGNALING). We assume an initial DI that includes one point per intervention set, and consider different settings with respect to the SCM characteristics, null-feasibility, |n MC Y,G|, and NO = |DO|. See Appendix C for full experimental details6. To the best of our knowledge, there are no other constrained 6Code for reproducing the experiments is available at https://github.com/deepmind/ccbo. Constrained Causal Bayesian Optimization SYNTHETIC-1 0.0 2.5 5.0 7.5 10.0 12.5 15.0 17.5 Cumulative intervention cost Best found target STGP STGP + MTGP MTGP + Y do(X = x ) (c CGO) Y do(I = x ) (c GO) 0.0 2.5 5.0 7.5 10.0 12.5 15.0 17.5 Cumulative intervention cost Best found target STGP STGP + MTGP MTGP + Y do(X = x ) (c CGO) Y do(I = x ) (c GO) MTGP STGP STGP + -MTGP MTGP + c BO Method % Feasible interventions -MTGP MTGP STGP STGP + MTGP + c BO Method % Feasible interventions Figure 3: SYNTHETIC-1 with NO = 500 (top row) and NO = 100 (bottom row) and λZ = 2. Left: Causal graphs. Center: Convergence to the c CGO (solid red line) and c GO (dotted red line) optima. Lines give average results across different initialization of DI. Shaded areas represent standard deviation. Right: Average percentage of feasible interventions collected over trials. methods in the literature that exploit causal structure. Therefore, we compare to the closest method aiming at solving the non-causal constrained global optimization (c GO) problem, namely the constrained BO algorithm (c BO) proposed by Gardner et al. (2014). c BO intervenes on all variables in I simultaneously, models the target and constraints effects via independent GPs, and selects interventions by maximizing the constrained EI acquisition function. For each model, we show the average convergence to the optimal target effect ( standard deviations given by shaded areas), and the mean percentage of feasible interventions collected over trials across 20 initialization of DI and for varying values of NO. The combination of these two metrics allows us to understand how each method balances convergence speed and feasibility. To gain more insights into this balance, we also include in the analysis CBO, an algorithm that randomly picks interventions (RANDOM) and, for SYNTHETIC-1 and SYNTHETIC-2, MCBO (see Appendix C). Notice that, as CBO and MCBO solve the CGO problem, a comparison with them is only relevant for cases in which the CGO and c CGO optima are equal. 4.1. Synthetic Causal Graphs SYNTHETIC-1. For the SYNTHETIC-1 graph with λX = 1, λZ = 2, 10, and NO = 500, 100, 10 observational data samples from the SCM in Appendix C.1, we obtain n MC Y,G = {{X}, {Z}}. With NO = 500 and λZ = 2, using a surrogate model that exploits DO when constructing the prior GP parameters for functions in g X(x), as in STGP+, MTGP+ and G-MTGP, leads to faster average convergence (Fig. 3, top row). The convergence speed is further improved when capturing the covariance structure among the target and constraint effects as done by MTGP+ and G-MTGP. High convergence speed for STGP+, MTGP+ and G-MTGP is associated with a higher percentage of feasible interventions collected over trials compared to the other methods. Similar results are obtained with NO = 100 and λZ = 2 (Fig. 3, bottom row) and with NO = 10 and λZ = 2 (Fig. 9 in Appendix C.1). In these cases, the lower cardinality of DO leads to a less accurate estimation of the effects which affects the prior mean functions for STGP+, MTGP+ and G-MTGP, and kernel function for G-MTGP. In turns, this leads to a lower number of feasible interventions and convergence speed, particularly for G-MTGP. As a consequence, MTGP+ outperforms all other models in this setting. By intervening on all constrained variables (I = C) with intervention domains that are in accordance with the threshold values, c BO only collects feasible interventions. However, it converges to the higher c GO optimum, as causal structure is disregarded. Interestingly, the results obtained with NO = 500 and λZ = 10 (Fig. 10 (top row) in Appendix C.1) for which the c CGO and CGO optima are equal, making CBO, MCBO and c CBO comparable, show that both CBO and MCBO converge at a slower pace compared to MTGP+ and G-MTGP while collecting a lower number of feasible interventions. Indeed, CBO and MCBO disregard the values taken by the constrained variables. Finally, when λZ = 10, G-MTGP and MTGP+ show fast convergence performance and high percentage of feasible interventions collected over trials. This is even when NO = 100 (Fig. 11 in Appendix C.1). SYNTHETIC-2. For the SYNTHETIC-2 graph with λC = 10, C C, and NO = 500, 100, 10 observational data samples from the SCM in Appendix C.2, we obtain nullfeasibility C C and thus n MC Y,G = PI\{A, D, E}. As in SYNTHETIC-1, when NO = 500 using a prior GP Constrained Causal Bayesian Optimization SYNTHETIC-2 A C 0 5 10 15 20 25 30 35 Cumulative intervention cost Best found target STGP STGP + MTGP MTGP + Y do(X = x ) (c CGO) Y do(I = x ) (c GO) 0 5 10 15 20 25 30 35 Cumulative intervention cost Best found target STGP STGP + MTGP MTGP + Y do(X = x ) (c CGO) Y do(I = x ) (c GO) STGP STGP + MTGP MTGP + c BO -MTGP Method % Feasible interventions -MTGP STGP STGP + MTGP MTGP + c BO Method % Feasible interventions Figure 4: SYNTHETIC-2 with NO = 500 (top row) and NO = 100 (bottom row). Left: Causal graphs. Center: Convergence to the c CGO (solid red line) and c GO (dotted red line) optima. Lines give average results across different initialization of DI. Shaded areas represents standard deviation. Right: Average percentage of feasible interventions collected over trials. construction that exploits DO leads to faster convergence (Fig. 4, top row). In particular, the accurate estimation of the target and constraint effects with DO, which are used as prior mean functions, favours the prior formulation of STGP+ which identifies the optimal interventions after 5 trials. However, by disregarding the correlation among the effects, STGP+ collects a higher percentage of infeasible interventions compared to MTGP+ and G-MTGP. In this setting, accurate estimation of the functions in the SCM translates to better uncertainty estimation around the effects given by the kernel function of G-MTGP. Therefore G-MTGP successfully trades off improvement and feasibility showing a similar convergence to MTGP+ but the highest percentage of feasible interventions collected. With NO = 100 (Fig. 4, bottom row) and NO = 10 (Fig. 12 of Appendix C.2), the estimation of the target and constraint effects and the functions in the SCM from DO deteriorates leading MTGP+, which learns the correlations directly from DI, to outperform all other methods. As in SYNTHETIC-1, intervening on all variables in I simultaneously blocks the propagation of causal effects in the graph thus leading c BO to achieve a sub-optimal solution compared to c CBO and a lower number of feasible interventions compared to G-MTGP. 4.2. Real-world Causal Graphs HEALTH. For the HEALTH graph, we use the SCM in Ferro et al. (2015) (given in Appendix C.3). We consider the constraint that BMI must be lower than 25, which is considered the maximum healthy level. When both NO = 100 and NO = 10, BMI is not null-feasible thus all sets in n MC Y,G = {{CI}, {Aspirin, CI}, {Statin, CI}, {Aspirin, Statin, CI}} include CI. In this setting, I is the optimal intervention set, and therefore the c CGO and c GO optima coincide. When NO = 100, c BO is significantly slower than G-MTGP and, by disregarding causal information, collects a lower percentage of feasible interventions over trials (Fig. 5, top row). Even though the effects are simple linear functions, the stochasticity in DO determined by p(U) is high (Fig. 15 of Appendix C.3) compared to the previous examples and obscures their estimation. This leads to a less accurate prior mean function for G-MTGP, STGP+ and MTGP+ which translates to a lower convergence speed for the latter two models. Despite the less accurate prior mean functions, G-MTGP achieves the fastest convergence and the highest percentage of feasible interventions (100%) by capturing the correlation among target and constraint effects induced by the SCM thus properly quantifying uncertainty around them. Collecting 100% of feasible interventions is particularly important in this setting as infeasible interventions might negatively affect patients health status. While G- MTGP performs well in settings with NO = 100, it does not reach convergence when NO = 10 (Fig. 5, bottom row). Indeed, the prior mean and kernel functions estimation from DO deteriorates when the size of the observational dataset is very small leading to an incorrect uncertainty quantification around the effects and preventing the exploration of the interventional space. This is also observed for STGP+ and MTGP+ as their prior mean function is affected by the incorrect estimation of the SCM functions and thus of the target and constraint effects. This leads c BO to outperform all other methods in this setting. PROTEIN-SIGNALING. For the PROTEIN-SIGNALING graph, we use the observational dataset from Sachs et al. (2005), to construct an SCM (see Appendix C.4). Given NO = 100, 10 observational data samples from the SCM, all constrained variables are null-feasible, giving n MC Y,G = Constrained Causal Bayesian Optimization 0 10 20 30 40 Cumulative intervention cost Best found target STGP STGP + MTGP MTGP + Y do(X = x ) (c CGO) Y do(I = x ) (c GO) 0 10 20 30 40 Cumulative intervention cost Best found target STGP STGP + MTGP MTGP + Y do(X = x ) (c CGO) Y do(I = x ) (c GO) STGP MTGP c BO MTGP + STGP + -MTGP Method % Feasible interventions STGP MTGP MTGP + c BO STGP + -MTGP Method % Feasible interventions Figure 5: HEALTH with NO = 100 (top row) and NO = 10 (bottom row). Left: Causal graphs. Center: Convergence to the c CGO (solid red line) and c GO (dotted red line coinciding with the solid red line in this experiment) optima. Lines give average results across different initialization of DI. Shaded areas represents standard deviation. Right: Average percentage of feasible interventions collected over trials. {{PKC}, {PKA}, {Mek}, {PKC, PKA}, {PKC, Mek}, {PKA, Mek}}. All methods converge to the same optimum (Fig. 6). Indeed, the target effect achieved by intervening on I is equal to the one achieved by intervening on the optimal intervention set {PKA, Mek} (this is the result of Erk being independent on PKC and Akt given Mek and PKA). In addition, in this example CI = thus by intervening on I with interventional domains that are in accordance with the threshold values c BO achieves a 100% average feasibility. Overall c BO performs comparably to G-MTGP and MTGP+ (Fig. 6, top row) when NO = 100, while CBO is faster but selects more infeasible interventions (Fig. 18 in Appendix C.4). Despite achieving slower convergence than single task models, MTGP+ and G-MTGP select a higher percentage of feasible interventions (99.13%) with G-MTGP converging slightly faster then MTGP+. The feasibility aspect is again particularly important as every non feasible intervention results in the inhibition of PKA, which can impede healthy functions of the cell. Hence, a method that yields greater than 99% feasible interventions at the cost of ( 10) additional interventions (MTGP+ and G-MTGP) is deemed preferable compared to methods that explore the intervention space more aggressively but yield a lower number of feasible interventions (CBO, STGP and STGP+). As in the other experiments, the convergence performance of G-MTGP and MTGP+ slightly deteriorates when NO = 10 (Fig. 6, bottom row) due to an incorrect uncertainty quantification around the effects which prevents the exploration of the interventional space. As in the results for NO = 100, notice how c BO achieves a 100% average feasibility as all constrained variables are intervened (C I, CI = ) with interventional ranges that are in accordance with the threshold values. The algorithm quickly identify the c GO optimum, which is in this case equal to the c CGO optimum. 5. Discussion In this paper, we introduced the c CBO approach for identifying interventions optimizing a target effect under constraints. We proposed different GP surrogate models for the target and constraint effects that leverage observational data DO and the structure of the SCM. Our results show that incorporating DO in the GP prior construction leads to faster identification of optimal interventions and higher percentage of feasible interventions selected. They also show that further performance improvement can be obtained using multi-task GP models which capture the correlation among target and constraint effects. Accounting for the SCM structure in the covariance matrix, as done by G-MTGP, is especially beneficial with more complex correlation structures or with a high number of effects. We found G-MTGP to successfully trade off improvement and feasibility, achieving a fast convergence while collecting the highest percentage of feasible interventions. However, as G-MTGP exploits the fitted SCM functions in the computation of the prior parameters, when these functions cannot be learned accurately MTGP+ is preferable as it can learn the correlation directly from DI. Our search space reduction procedure can be thought as placing an intervention cost for each X PI that is equal to its cardinality. This procedure could be modified using different cost structures or augmented with budget constraints. For example, one could exclude from PI the intervention sets non satisfying a budget and then proceed by excluding sets according to the proposed procedure. c CBO requires knowledge of the true causal graph underlying the system of interest, an assumption that might not be Constrained Causal Bayesian Optimization PROTEIN-SIGNALING Mek Erk Akt 0 2 4 6 8 10 12 14 16 18 Cumulative intervention cost 0.08 0.10 0.12 0.14 0.16 0.18 0.20 0.22 Best found target STGP STGP + MTGP MTGP + Y do(X = x ) (c CGO) Y do(I = x ) (c GO) 0 2 4 6 8 10 12 14 16 18 Cumulative intervention cost 0.08 0.10 0.12 0.14 0.16 0.18 0.20 Best found target STGP STGP + MTGP MTGP + Y do(X = x ) (c CGO) Y do(I = x ) (c GO) STGP STGP + MTGP MTGP + -MTGP c BO Method 86 88 90 92 94 96 98 100 % Feasible interventions STGP STGP + MTGP MTGP + -MTGP c BO Method 86 88 90 92 94 96 98 100 % Feasible interventions Figure 6: PROTEIN-SIGNALING with NO = 100 and NO = 10 (bottom row). Left: Causal graphs. Center: Convergence to the c CGO (solid red line) and c GO (dotted red line coinciding with the solid red line in this experiment) optima. Lines give average results across different initialization of DI. Shaded areas represents standard deviation. Right: Average percentage of feasible interventions collected over trials. satisfied in practice. Using c CBO with an incorrect graph G could lead to inaccurate GP prior parameters and invalid search space reduction. Indeed, there is no guarantee that n MC Y,G would contain the optimal intervention set when the independence and the ancestor-type relationships encoded in G differ from those in G. Extending the methodology to deal with settings characterized by misspecified or unknown G, similarly e.g. to Branchini et al. (2023), represents an important future direction. From a computational perspective, c CBO might suffer from poor scalability when |C| or the number of collected interventional data samples are high. In the latter case, sparse GP methods e.g. inducing inputs (Titsias, 2009), could be used to significantly reduce the computational complexity of both single-task and multi-task GPs. Alternatively, one could consider sharing information across the surrogate models of different sets in n MC Y,G in order to reduce the total number of interventions. Evaluating c CBO on realworld settings where |C| is high would require simulators characterized by a high number of variables. In terms of convergence properties, while the constrained EI does not currently provide theoretical guarantees, one could extend c CBO to use a constrained acquisition function achieving asymptotic convergence, e.g. GP-UCB, (Srinivas et al., 2009; Lu & Paulson, 2022) or a non-myopic acquisition function to avoid exploration issues (Lam & Willcox, 2017). Another future direction for safe-critical applications would be to extend the current framework to deal with safety constraints in order to guarantee that the number of feasible interventions collected never falls below a critical value. Finally, while c CBO focuses on hard interventions in which variables are set to specific values, an interesting extension would be to consider more general soft-interventions. Acknowledgements The authors would like to thank Alexis Bellot for his valuable feedback. Aglietti, V., Lu, X., Paleyes, A., and Gonz alez, J. Causal Bayesian optimization. In International Conference on Artificial Intelligence and Statistics, pp. 3155 3164, 2020. Alvarez, M. A., Rosasco, L., and Lawrence, N. D. Kernels for vector-valued functions: A review. ar Xiv preprint ar Xiv:1106.6251, 2011. Ariafar, S., Coll-Font, J., and Brooks, Dana Hand Dy, J. G. ADMMBO: Bayesian optimization with unknown constraints using ADMM. Journal of Machine Learning Research, 20(123):1 26, 2019. Atan, O., Jordon, J., and Van der Schaar, M. Deep-treat: Learning optimal personalized treatments from observational data using neural networks. In AAAI Conference on Artificial Intelligence, pp. 2071 2078, 2018. Bergmann, D. and Graichen, K. Safe Bayesian optimization under unknown constraints. In IEEE Conference on Decision and Control, pp. 3592 3597, 2020. Berkenkamp, F., Schoellig, A. P., and Krause, A. Safe controller optimization for quadrotors with Gaussian processes. In IEEE International Conference on Robotics and Automation, pp. 491 496, 2016. Berkenkamp, F., Krause, A., and Schoellig, A. P. Bayesian optimization with safety constraints: Safe and automatic Constrained Causal Bayesian Optimization parameter tuning in robotics. Machine Learning, pp. 1 35, 2021. Branchini, N., Aglietti, V., Dhir, N., and Damoulas, T. Causal entropy optimization. In International Conference on Artificial Intelligence and Statistics, pp. 8586 8605, 2023. Bull, A. D. Convergence rates of efficient global optimization algorithms. Journal of Machine Learning Research, 12(10), 2011. Curi, S., Berkenkamp, F., and Krause, A. Efficient modelbased reinforcement learning through optimistic policy search and planning. Advances in Neural Information Processing Systems, 33:14156 14170, 2020. Dai, S., Song, J., and Yue, Y. Multi-task Bayesian optimization via Gaussian process upper confidence bound. In ICML 2020 Workshop on Real World Experiment Design and Active Learning, 2020. Eriksson, D. and Poloczek, M. Scalable constrained Bayesian optimization. In International Conference on Artificial Intelligence and Statistics, pp. 730 738, 2021. Feliot, P., Bect, J., and Vazquez, E. A Bayesian approach to constrained single-and multi-objective optimization. Journal of Global Optimization, 67(1-2):97 133, 2017. Ferro, A., Pina, F., Severo, M., Dias, P., Botelho, F., and Lunet, N. Use of statins and serum levels of prostate specific antigen. Acta Urol ogica Portuguesa, 32(2):71 77, 2015. Fr emin, C. and Meloche, S. From basic research to clinical development of MEK1/2 inhibitors for cancer therapy. Journal of Hematology & Oncology, 3(1):1 11, 2010. Gardner, J. R., Kusner, M. J., Xu, Z. E., Weinberger, K. Q., and Cunningham, J. P. Bayesian optimization with inequality constraints. In International Conference on Machine Learning, pp. 937 945, 2014. Gelbart, M. A., Snoek, J., and Adams, R. P. Bayesian optimization with unknown constraints. ar Xiv preprint ar Xiv:1403.5607, 2014. Griffiths, R.-R. and Hern andez-Lobato, J. M. Constrained Bayesian optimization for automatic chemical design. ar Xiv preprint ar Xiv:1709.05501, 2017. H akansson, S., Lindblom, V., Gottesman, O., and Johansson, F. D. Learning to search efficiently for causally nearoptimal treatments. In Advances in Neural Information Processing Systems, volume 33, pp. 1333 1344, 2020. Hakhamaneshi, K., Abbeel, P., Stojanovic, V., and Grover, A. JUMBO: Scalable multi-task Bayesian optimization using offline data. ar Xiv preprint ar Xiv:2106.00942, 2021. Hern andez-Lobato, J. M., Hoffman, M. W., and Ghahramani, Z. Predictive entropy search for efficient global optimization of black-box functions. ar Xiv preprint ar Xiv:1406.2541, 2014. Keane, A., Forrester, A., and Sobester, A. Engineering design via surrogate modelling: a practical guide. American Institute of Aeronautics and Astronautics, Inc., 2008. Kirschner, J., Mutny, M., Hiller, N., Ischebeck, R., and Krause, A. Adaptive and safe Bayesian optimization in high dimensions via one-dimensional subspaces. In International Conference on Machine Learning, pp. 3429 3438, 2019. Lam, R. and Willcox, K. Lookahead bayesian optimization with inequality constraints. In Advances in Neural Information Processing Systems, 2017. Lattimore, F., Lattimore, T., and Reid, M. D. Causal bandits: Learning good interventions via causal inference. In Advances in Neural Information Processing Systems, 2016. Lee, S. and Bareinboim, E. Structural causal bandits: where to intervene? In Advances in Neural Information Processing Systems, pp. 2568 2578, 2018. Letham, B., Karrer, B., Ottoni, G., and Bakshy, E. Constrained Bayesian optimization with noisy experiments. Bayesian Analysis, 14(2):495 519, 2019. Lu, C. and Paulson, J. A. No-regret bayesian optimization with unknown equality and inequality constraints using exact penalty functions. IFAC-Papers On Line, 55(7):895 902, 2022. Mathern, A., Steinholtz, O. S., Sj oberg, A., Onnheim, M., Ek, K., Rempling, R., Gustavsson, E., and Jirstrand, M. Multi-objective constrained Bayesian optimization for structural design. Structural and Multidisciplinary Optimization, 63(2):689 701, 2021. Pearl, J. Causality: Models, Reasoning and Inference, volume 29. Springer, 2000. Pearl, J., Glymour, M., and Jewell, N. P. Causal Inference in Statistics: A Primer. John Wiley & Sons, 2016. Picheny, V., Gramacy, R. B., Wild, S., and Digabel, S. L. Bayesian optimization under mixed constraints with a slack-variable augmented Lagrangian. In Advances in Neural Information Processing Systems, pp. 1443 1451, 2016. Constrained Causal Bayesian Optimization Rasmussen, C. E. and Williams, C. K. I. Gaussian Processes for Machine Learning. The MIT Press, 2006. Sachs, K., Perez, O., Pe er, D., Lauffenburger, D. A., and Nolan, G. P. Causal protein-signaling networks derived from multiparameter single-cell data. Science, 308(5721): 523 529, 2005. Schonlau, M., Welch, W. J., and Jones, D. R. Global versus local search in constrained optimization of computer models. Lecture Notes-Monograph Series, pp. 11 25, 1998. Snoek, J., Larochelle, H., and Adams, R. P. Practical Bayesian optimization of machine learning algorithms. In Advances in Neural Information Processing Systems, volume 25, 2012. S obester, A., Forrester, A. I., Toal, D. J., Tresidder, E., and Tucker, S. Engineering design applications of surrogateassisted optimization techniques. Optimization and Engineering, 15(1):243 265, 2014. Srinivas, N., Krause, A., Kakade, S. M., and Seeger, M. Gaussian process optimization in the bandit setting: No regret and experimental design. ar Xiv preprint ar Xiv:0912.3995, 2009. Sui, Y., Gotovos, A., Burdick, J., and Krause, A. Safe exploration for optimization with Gaussian processes. In International Conference on Machine Learning, pp. 997 1005, 2015. Sui, Y., Zhuang, V., Burdick, J., and Yue, Y. Stagewise safe Bayesian optimization with Gaussian processes. In International Conference on Machine Learning, pp. 4781 4789, 2018. Sussex, S., Makarova, A., and Krause, A. Model-based causal Bayesian optimization. In International Conference on Learning Representations, 2023. Swersky, K., Snoek, J., and Adams, R. P. Multi-task Bayesian optimization. In Advances in Neural Information Processing Systems, 2013. Tigas, P., Annadani, Y., Jesson, A., Sch olkopf, B., Gal, Y., and Bauer, S. Interventions, where and how? Experimental design for causal models at scale. In Advances in Neural Information Processing Systems, 2022. Titsias, M. Variational learning of inducing variables in sparse gaussian processes. In Artificial Intelligence and Statistics, pp. 567 574, 2009. Tran, A., Sun, J., Furlan, J. M., Pagalthivarthi, K. V., Visintainer, R. J., and Wang, Y. p BO-2GP-3B: A batch parallel known/unknown constrained Bayesian optimization with feasibility classification and its applications in computational fluid dynamics. Computer Methods in Applied Mechanics and Engineering, 347:827 852, 2019. Vazquez, E. and Bect, J. Convergence properties of the expected improvement algorithm with fixed mean and covariance functions. Journal of Statistical Planning and Inference, 140(11):3088 3095, 2010. Zhang, B., Tsiatis, A. A., Laber, E. B., and Davidian, M. A robust method for estimating optimal treatment regimes. Biometrics, 68(4):1010 1018, 2012. Zhang, J. Designing optimal dynamic treatment regimes: A causal reinforcement learning approach. In International Conference on Machine Learning, pp. 11012 11022, 2020. Constrained Causal Bayesian Optimization A. Reducing the Search Space Recall the definition of c MIS relative to (C Y, G). Definition 3.1 (c MIS). A set X I is said to be a c MIS relative to (C Y, G) if there is no X X with CX = CX such that µW do(X=x) = µW do(X =x ), where x indicates the subset of x corresponding to variables X , x D(X), W CX Y , and SCM compatible with G. Also recall that MC Y,G is the subset of PI whose elements are c MISs relative to (C Y , G), that an(W, G) denotes the set of ancestors of W in G, that an(W , G) := W W an(W, G), and that GX is the graph with all incoming edges onto all elements of X removed. Proposition 3.3 (Characterization of c MIS). X I is a c MIS relative to (C Y , G) X an(C Y, GX) (C X). Proof. (= ) We first prove that if X an(C Y, GX) (C X) then X is a c MIS relative to (C Y, G). X an(C Y, GX) (C X) implies that, for any X X, the set Q = X\X is also a subset of an(C Y, GX) (C X). Therefore (i) Q an(C Y, GX), or (ii) Q C X, or (iii) Q includes both variables in an(C Y, GX) and C X. In case (ii), Q C and, as X = Q X , we obtain C X = C (Q X ) = (C Q) (C X ) = Q (C X ) C X implying CX CX which violates the requirement CX = CX of Definition 3.1. In cases (i) and (iii), Q includes at least one variable, say Q, with at least one directed path, say from Q to W C Y , that is not passing through X (as the incoming edges into X are removed in GX). Consider a SCM with V = P|pa(V )| i=1 pa(V )i + UV , where pa(V )i is the i-th parent, and UV N(0, 1). In this SCM, µW do(X =x ) = ax + bµQ do(X =x ) where a and b are two positive constants, while µW do(Q=q,X =x ) = ax + bq, so that taking q = µQ do(X =x ) + 1 gives µW do(Q=q,X =x ) > µW do(X =x ). Therefore, for any X X we can construct a SCM such that µW do(X=x) > µW do(X =x ) for at least one W C Y . As there is no X X such that µW do(X=x) = µW do(X =x ) W C Y and SCM compatible with G, X satisfies the requirements of Definition 3.1 for being a c MIS relative to (C Y, G). ( =) We now prove that if X is a c MIS relative to (C Y, G) then X an(C Y, GX) (C X). If X were not a subset of an(C Y, GX) (C X), then we could define the non empty set Q = X\ X (an(C Y, GX) (C X)) and the set X = X\Q X such that (1) CX = CX and (2) all effects on variables in C Y for X and X are equal, contradicting X being a c MIS relative to (C Y, G). Condition (1) would hold as X (an(C Y, GX) (C X)) = (X an(C Y, GX)) (X C) implies that we can express Q as Q = X\(B (X C)) for B = X an(C Y, GX), showing that Q does not include variables in X that are in C, and therefore that CX = CX . Condition (2) would follow from the fact that the non-overlapping sets Q and C Y satisfy (C Y ) GXQ | X as: (i) there could not be causal paths from any variable in Q to any variable in (C Y ) in GX (as Q does not contain the ancestors of (C Y ) by definition), (ii) non-causal frontdoor paths would be closed as the incoming edges into the colliders that might be included in X would be removed in GX, and (iii) all backdoor paths would be removed in GX. Therefore, by the Rule 3 of do-calculus, we would have µW do(Q=q,X =x ) = µW do(X =x ) W C Y . Theorem 3.2 (Sufficiency of MC Y,G). MC Y,G contains a solution of the c CGO problem (if a solution exists), SCM compatible with G. Proof. Consider a set X that satisfies Eq. (1). By Definition 3.1, if X / MC Y,G there exists a set X X with CX = CX such that µY do(X=x) = µY do(X =x ) and µC do(X=x) = µC do(X =x ), C CX and therefore X also satisfies Eq. (1). If X / MC Y,G, we can apply a similar reasoning, and proceed until we find a set for which there is no subset that gives the same effects, which therefore is in MC Y,G. Lemma A.1. For any disjoint set of variables W1, W2, W3, if W1 an(W2, GW1,W3) = then W2 GW1,W3W1 | W3. Proof. W1 an(W2, GW1,W3) = implies that there are no directed paths from (any element of) W1 to (any element of) W2 in GW1,W3. In addition, all backdoor paths from W1 to W2 in G are removed in GW1,W3. Therefore, all paths from W1 to W2 in GW1,W3 are non-directed frontdoor paths. As W3 cannot be a collider on paths from W1 to W2 in GW1,W3 (W3 does not have incoming edges in GW1,W3), these paths cannot be opened by conditioning on W3. In conclusion, all paths from W1 to W2 are closed in GW1,W3 given W3, therefore W2 GW1,W3W1 | W3. Let n MC Y,G(X, C) := {X MC Y,G : X X, X = X X1 with X1 an(CX\C Y, GX ) = and CX = CX or X an(C , GX) = and µC λC for all C CX\CX }; let n MC Y,G(X) := n S C CX,X an(C,GX)= [X, n MC Y,G(X, C)]µC λC o with [X, n MC Y,G(X, C)]µC λC = n MC Y,G(X, C) if µC λC and X otherwise. Constrained Causal Bayesian Optimization Table 1: Summary of the prior mean m V X(x) and kernel SV X(x, x ) parameters associated to the different surrogate models. {g V,(s) X }S s=1 denote a set of S realisations of g V X obtained by sampling from p(U) and the posterior distributions of f V given DO, ˆσV do(X=x) = 1 S PS s=1 g V,(s) X (x) m V X(x 2 , σ2 f and l are the kernel hyper-parameters, and SX,q(x, x ) is the kernel function for u X,q with associated scalar coefficient a Vk X,q. Prior parameters for g V X m V X(x) SV X(x, x ) c BO 0 σ2 f exp( ||x x ||2 CBO 0 σ2 f exp( ||x x ||2 STGP 0 σ2 f exp( ||x x ||2 STGP+ 1 S PS s=1 g V,(s) X (x) σ2 f exp( ||x x ||2 2l2 ) + ˆσV do(X=x) ˆσV do(X=x ) MTGP 0 PQ q=1 a V X,q 2 SX,q(x, x ) MTGP+ 1 S PS s=1 g V,(s) X (x) PQ q=1 a V X,q 2 SX,q(x, x ) G-MTGP 1 S PS s=1 g V,(s) X (x) 1 S PS s=1 g V,(s) X (x)g V,(s) X (x ) 1 S PS s=1 g V,(s) X (x) 1 S PS s=1 g V,(s) X (x ) Theorem A.2 (Sufficiency of n MC Y,G). n MC Y,G := MC Y,G\ S X MC Y,G n MC Y,G(X) contains a solution of the c CGO problem (if a solution exists), SCM compatible with G. Proof. Consider a set X MC Y,G that satisfies Eq. (1). If X n MC Y,G, then X S X MC Y,G n MC Y,G(X), and therefore there exist at least one set X MC Y,G and at least one C CX with X an(C, GX) = (implying µC do(X=x) = µC). If µC < λC then [X, n MC Y,G(X, C)]µC λC = X, and thus X X implying that CX CX and therefore that X does not satisfies Eq. (1) as µC < λC. If instead µC λC, X n MC Y,G(X, C), and thus X X, X = X X1 with X1 an(CX\C Y, GX ) = . In addition, X has either the same constraint set of X (i.e. CX = CX) or is such that its additional constraints are satisfied (X an(C , GX) and µC λC for all C CX\CX ). Therefore X also satisfies Eq. (1). If X n MC Y,G we can apply a similar reasoning and proceed until we find a set in n MC Y,G. The c MISReduce procedure is given below. c MISReduce. First construct MC Y,G as: MC Y,G = , X PI, if X an(C Y, GX) (C X) add X to MC Y,G. Then, set n MC Y,G to MC Y,G. If DO = , X MC Y,G, C CX with X an(C, GX) = , if C is not null-feasible (according to an estimate ˆµC of µC) remove X from n MC Y,G. If instead C is null-feasible, remove from n MC Y,G all X X with X = X X1 where X1 an(CX\C Y, GX ) = and such that CX = CX or X an(C , GX) = and µC λC for all C CX\CX . B. Baselines and Surrogate Models In the experimental section we compare the performance of c CBO using different surrogate models against CBO, c BO and, for SYNTHETIC-1 and SYNTHETIC-2, also against For the surrogate models of CBO, c BO, and STGP we assume a zero prior mean function m V X(x) = 0 and an RBF kernel SV X(x, x ) = σ2 f exp( ||x x ||2 2l2 ) for all X PI and V CX Y . RBF kernels are also considered for the kernel functions SX,q(x, x ), q = 1, . . . , Q, associated to the latent GPs of MTGP and MTGP+. Notice that the surrogate model parameters for c BO, CBO and STGP are equal. Indeed, these methods only differ in terms of search space and acquisition functions used. c BO solves an non-causal constrained global optimization problem therefore considers only one intervention set, i.e. X = I, models the associated target µY do(I=x) and constraint effects µCI do(I=x) independently and selects interventions via the constrained expected improvement acquisition function (Gardner et al., 2014). CBO uses the same surrogate models construction but explores the intervention sets included in n MC Y,G via a standard expected improvement acquisition function thus solving an unconstrained optimization problem. Finally, STGP considers the surrogate model constructions of CBO and c BO but explores the sets included in n MC Y,G selecting interventions via a constrained expected improvement acquisition function. For all surrogate models exploiting DO in the GP prior mean functions (STGP+, MTGP+ and G-MTGP), we compute m V X(x) for all X n MC Y,G and V CX Y by averaging the values of V obtained by sampling from the fitted SCM where the functions for X are fixed to Constrained Causal Bayesian Optimization 3 2 1 0 1 2 3 X 0 5 10 15 Z Figure 7: SYNTHETIC-1 with NO = 500. Scatter plots for the observational data together with interventional ranges D(X) (top) and D(Z) (bottom). x. Specifically, we model each function f V with a GP f V (pa(v)) GP(0, SV (pa(v), pa(v) )), where pa(v) denotes a value taken by pa(V ). We use an RBF kernel defined as SV (pa(v), pa(v) ) = σ2 f exp( ||pa(v) pa(v) ||2 2l2 ). We assume a Gaussian likelihood for DO and compute the posterior distribution for all f V in closed form via standard GP updates. We then sample from p(U) and these posteriors to obtain a set of samples {g V,(s) X }S s=1 for all X n MC Y,G. The mean functions for the target and constraint effects are then obtained as m V X(x) = 1 S PS s=1 g V,(s) X with V = Y and V = C, C C respectively. A similar procedure is used to compute the kernel function SV,W X (x, x ) for G-MTGP. In particular, given the samples {g V,(s) X }S s=1 and {g W,(s) X }S s=1 for each couple of variables (V, W) in C Y , we compute the correlation across the associated effects as 1 S PS s=1 g V,(s) X (x)g W,(s) X (x ) 1 S PS s=1 g V,(s) X (x) 1 S PS s=1 g W,(s) X (x ) . See Table 1 for a summary of the prior parameters used for each surrogate model. Similarly to CBO, MCBO solves an unconstrained optimization problem. Instead of explicitly modelling the target effect, MCBO assumes each V V to be of the form V = f V (pa G(V ), AV ) + UV , where AV is a set of action variables whose values can be set by the investigator. It then places a vector-valued GP prior on the functions {f V }V V and exploits the posterior distribution of this GP together with a reparameterization trick introduced by Curi et al. (2020) to derive confidence bounds for the target effect. These bounds are then used within an upper confidence bound acquisition function. Notice that, by explicitly modelling the functions in the SCM, MCBO cannot deal with settings in which there are unobserved confounders. C. Experimental Details and Additional Results For all experiments we initialize the kernel hyper-parameters (σ2 f, l) of the surrogate models to 1 and optimize them with a standard type-2 ML approach. When sampling from the modified SCM is required in order to compute the prior mean or kernel functions we use S = 10 samples. The same 4 2 0 2 4 A 3 2 1 0 1 2 3 D 3 2 1 0 1 2 3 E Figure 8: SYNTHETIC-2 with NO = 500. Scatter plots for the observational data together with interventional ranges D(A) (top), D(D) (center) and D(E) (bottom). 0.0 2.5 5.0 7.5 10.0 12.5 15.0 17.5 Cumulative intervention cost Best found target STGP STGP + MTGP MTGP + Y do(X = x ) (c CGO) Y do(I = x ) (c GO) -MTGP MTGP STGP + STGP MTGP + c BO Method % Feasible interventions Figure 9: SYNTHETIC 1 with NO = 10 and λZ = 2. Top: Convergence to c CGO (solid red line) and c GO (dotted red line) optima. Lines give average results across different initialization of DI. Shaded areas represents standard deviation. Bottom: Average percentage of feasible interventions collected over trials. number of samples is used when the computation of the acquisition function requires a Monte Carlo approximation. C.1. SYNTHETIC-1 For the causal graph in Fig. 2(b) with I = C = {X, Z} we consider the SCM X = UX, Z = exp( X) + UZ, Y = cos(Z) exp( Z/20) + UY , with UX, UZ, UY N(0, 1). We set the interventional ranges to D(X) = [ 3, 2] and D(Z) = [ 1, 1], the constraint thresholds to (λX, λZ) = (1, 2) for Fig. 3, Fig. 9 and Fig. 10 (bottom row) and to (λX, λZ) = (1, 10) for Fig. 10 (top row) and Fig. 11, and require the constraint effects to be lower that the thresholds. Notice that, even in cases when NO is high, there exists a mismatch between the observational and interventional ranges. Fig. 7 shows how Constrained Causal Bayesian Optimization 0.0 2.5 5.0 7.5 10.0 12.5 15.0 17.5 Cumulative intervention cost Best found target STGP STGP + CBO RANDOM MCBO Y do(X = x ) (c CGO) Y do(X = x ) (CGO) Y do(I = x ) (c GO) MCBO RANDOM CBO STGP MTGP -MTGP STGP + MTGP + c BO Method % Feasible interventions 0.0 2.5 5.0 7.5 10.0 12.5 15.0 17.5 Cumulative intervention cost Best found target STGP STGP + CBO RANDOM MCBO Y do(X = x ) (c CGO) Y do(X = x ) (CGO) Y do(I = x ) (c GO) CBO MCBO RANDOM MTGP STGP STGP + -MTGP MTGP + c BO Method % Feasible interventions Figure 10: SYNTHETIC-1 with NO = 500 and λZ = 10 (top row), and with NO = 500 and λZ = 2 (bottom row). Left: Convergence to the c CGO (solid red line), CGO (dash-dotted red line) and c GO (dotted red line) optima. Right: Average percentage of feasible interventions collected over trials. 0.0 2.5 5.0 7.5 10.0 12.5 15.0 17.5 Cumulative intervention cost 1.50 1.25 1.00 0.75 0.50 0.25 0.00 Best found target STGP STGP + MTGP MTGP + Y do(X = x ) (c CGO) Y do(I = x ) (c GO) STGP MTGP STGP + MTGP + -MTGP c BO Method 93 94 95 96 97 98 99 100 101 % Feasible interventions Figure 11: SYNTHETIC-1 with NO = 100 and λZ = 10. Top: Convergence to the c CGO (solid red line) and c GO (dotted red line) optima. Bottom: Average percentage of feasible interventions collected over trials. the interventional ranges are only partially covered by the observational data, especially for X. Therefore, estimating the target and constraint effects with DO for interventions values outside of the observational ranges, say for X = 2, might lead to inaccurate results which translate to unreliable prior parameters for the surrogate models exploiting DO. A combination of interventional and observational data is needed in order to quantify uncertainty around the effects and to choose the intervention to perform next while en- suring satisfaction of the constraints, and thus to efficiently identify an optimal feasible intervention. In this graph, PI = {{X}, {Z}, {X, Z}}, and MC Y,G = PI as X an({X, Z, Y }, GX) ({X, Z} X), X PI. The set X = {Z} has only one constrained variable, X, that is reducible (as X an(X, GX) = ) and nullfeasible for NO = 500, 100, 10. We can thus exclude X = {X, Z} X from MC Y,G. Indeed, {X, Z} is of the form X = X X with X / an( Y, GX ). We therefore obtain n MC Y,G = {{X}, {Z}}. Fig. 10 (top row) shows the convergence results for NO = 500 and λZ = 10, for which the c CGO and CGO problems have the same optimum. Notice how both CBO and MCBO converge to the optimum at a slower pace compared to STGP+, MTGP+ and G-MTGP. Indeed, CBO and MCBO only take into account the target value observed after performing each intervention. CBO also models each function individually and, by disregarding the values of the constraints, needs to perform more interventions before identifying an optimal one. More importantly, MTGP+, STGP+ and G-MTGP collect more than 99% of feasible interventions over trials. This is a critical aspect of the method as in real-world applications the investigator might not know whether an optimal solution is in a feasible region or not. Using c CBO in such settings does not slow down the identification of an optimal solution, but rather improves it, while allowing to efficiently restrict the exploration regions. For completeness, in Fig. 10 (bottom row) we also report the comparison with CBO, MCBO, and RANDOM for NO = 500 and λZ = 2. Constrained Causal Bayesian Optimization 0 10 20 30 40 50 Cumulative intervention cost Best found target STGP STGP + MTGP MTGP + Y do(X = x ) (c CGO) Y do(I = x ) (c GO) STGP MTGP STGP + MTGP + c BO -MTGP Method % Feasible interventions Figure 12: SYNTHETIC-2 with NO = 10. Top: Convergence to the c CGO (solid red line) and c GO (dotted red line) optima. Bottom: Average percentage of feasible interventions collected over trials. C.2. SYNTHETIC-2 For the causal graph in Fig. 2(a) with I = {A, D, E} and C = {C, D, E} we consider the SCM A = UA, B = UB, C = exp( A)/5 + UC, D = cos(B) + C/10 + UD, E = exp( C)/10 + UE, Y = cos(D) D/5 + sin(E) E/4 + UY , with UA, UB, UC, UD, UE, UY N(0, 1). We set the interventional ranges to D(A) = [ 5, 5], D(D) = [ 1, 1] and D(E) = [ 1, 1], the constraint thresholds to λC = 10, λD = 10, λE = 10, and require all constraint effects to be smaller than the thresholds. For NO = 500, 100, 10 all variables in C are null-feasible, giving n MC Y,G = PI\{A, D, E} = {{A}, {D}, {E}, {A, D}, {A, E}, {D, E}} (see the example in Section 3.1). As in SYNTHETIC-1, even in cases when NO is high, there exists a mismatch between the observational and interventional ranges (Fig. 8) for all interventions sets in n MC Y,G. Fig. 12 shows the convergence plots and associated percentage of feasible interventions when NO = 10. As discussed in Section 5, when the functions in the SCM cannot be learned accurately due to limited or noisy observational data, MTGP+ should be preferred. Indeed, a lower value for NO leads to a less accurate estimation of the target and constraint effects from DO which negatively affects the prior mean functions of STGP+, MTGP+ and G-MTGP. G-MTGP is further penalized in this case as also the kernel functions are computed exploiting the fitted SCM functions. As a consequence, we observe slower convergence and a lower percentage of feasible interventions collected when using G-MTGP. In particular, notice how the inaccurate estimation of the constraint effects translates into a significantly lower convergence speed for G-MTGP compared to the settings in which NO is higher. On the contrary, MTGP+ successfully trades off feasibility and improvement in these experiments reaching convergence while collecting a high percentage of feasible interventions over trials. Finally, by breaking the existing causal relationships and intervening on all variables in I, c BO blocks the propagation of causal effects in the graph and converges to a solution (c GO) that is sub-optimal with respect to the one achieved by all c CBO instances. Finally, Fig. 13 shows how, in settings where the c CGO and CGO optima differ, CBO, MCBO, and RANDOM converge to a solution which is lower than the c CGO one. In addition, by disregarding the constraints, CBO, MCBO, and RANDOM collect a high number of infeasible interventions over trials. C.3. HEALTH For the causal graph in Fig. 1(b) with I = {Statin, Aspirin, CI} and C = {BMI}, the SCM is taken from Ferro et al. (2015) and can be written as Age = UAge, CI = UCI, BMR = 1500 + 10 UBMR, Height = 175 + 10 UHeight, Weight = BMR + 6.8 Age 5 Height 13.7 + CI 150/7716 , BMI = Weight/(Height/100)2, Aspirin = σ( 8 + 0.1 Age + 0.03 BMI), Statin = σ( 13 + 0.1 Age + 0.2 BMI), PSA = 6.8 + 0.04 Age 0.15 BMI 0.60 Statin + 0.55 Aspirin + σ(2.2 0.05 Age + 0.01 BMI 0.04 Statin + 0.02 Aspirin) + UPSA, with UAge U(55, 75), UCI U( 100, 100), UBMR t N( 1, 2), UHeight t N( 0.5, 0.5), UPSA N(0, 0.4), where U( , ) denotes a uniform distribution, t N(a, b) a standard Gaussian distribution truncated between a and b, and σ( ) the sigmoidal transformation defined as σ(x) = 1 1+exp( x). We set the interventional ranges to D(Aspirin) = [0, 1], D(Statin) = [0, 1], and D(CI) = [ 400, 400], and require BMI to be lower than 25. For NO = 100, these interventional ranges are only partially covered by the observational data (Fig. 14) thus significantly complicating the estimation of the effects with DO, especially when the stochasticity determined by p(U) is high (see scatter plots for CI, Aspirin, Statin, and PSA in Fig. 15). We have PI = {{Aspirin}, {Statin}, {CI}, {Aspirin, Statin}, Constrained Causal Bayesian Optimization 0 10 20 30 40 Cumulative intervention cost Best found target STGP STGP + CBO RANDOM MCBO Y do(X = x ) (c CGO) Y do(X = x ) (CGO) Y do(I = x ) (c GO) CBO STGP STGP + MCBO RANDOM MTGP MTGP + c BO -MTGP Method % Feasible interventions Figure 13: SYNTHETIC-2 with NO = 500. Top: Convergence to the c CGO (solid red line), CGO (dash-dotted red line) and c GO (dotted red line) optima. Bottom: Average percentage of feasible interventions collected over trials. 400 300 200 100 0 100 200 300 400 CI 0.0 0.2 0.4 0.6 0.8 1.0 Statin O D(Statin) 0.0 0.2 0.4 0.6 0.8 1.0 Aspirin O D(Aspirin) Figure 14: HEALTH with NO = 100. Scatter plots for the observational data together with interventional ranges D(CI) (top), D(Statin) (center) and D(Aspirin) (bottom). {Aspirin, CI}, {Statin, CI}, {Aspirin, Statin, CI}} which is equal to MC Y,G. BMI is reducible for X = {Aspirin}, X = {Statin} and X = {Aspirin, Statin} as X an(BMI, GX) = . Given an observational dataset of size NO = 100, 10, ˆµBMI 26 making BMI not null-feasible. We therefore obtain n MC Y,G = {{CI}, {Aspirin, CI}, {Statin, CI}, {Aspirin, Statin, CI}}. For completeness we report the full comparison with CBO and RANDOM in Fig. 16. As in the previous experiments, the c CGO and CGO optima differ thus CBO and RANDOM converge, at a much lower speed, to different values while collecting a large number of infeasible interventions. C.4. PROTEIN-SIGNALING For the causal graph7 of Fig. 1(a) with I = {PKC, PKA, Mek, Akt} and C = {PKC, PKA}, Sachs et al. (2005) give a dataset but no SCM. Thus we first fit a SCM by placing a GP prior with zero mean and RBF kernel on f V , V V , and using the observational data. We assume a Gaussian distribution with zero mean and unit variance for PKC. We use the learned SCM to generate observational and interventional data. 7We exclude the Plcγ phospho-protein, the PIP2 and the PIP3 phospho-lipids, as these form a separate graph with no connections to Erk. 100 75 50 25 0 25 50 75 100 CI 100 75 50 25 0 25 50 75 100 CI 0.2 0.3 0.4 0.5 0.6 Aspirin 0.1 0.2 0.3 0.4 0.5 0.6 Statin Figure 15: HEALTH. NO = 100 observational data samples from the SCM in Section C.3. As in the previous experiments, the interventional ranges are only partially covered by NO = 100 observational samples (Fig. 17) which complicates the estimation of the effects with DO. The power set of I is given by PI = {{PKC}, {PKA}, {Mek}, {Akt}, {PKC, PKA}, {PKC, Mek}, {PKC, Akt}, {PKA, Mek}, {PKA, Akt}, {Mek, Akt}, {PKC, PKA, Mek}, {PKC, PKA, Akt}, {PKA, Mek, Akt}}. Note that, for X {{Akt}, {PKC, Akt}, {PKA, Akt}, {Mek, Akt}, {PKC, PKA, Akt}, {PKA, Mek, Akt}}, we have X an({PKC, PKA, Erk}, GX) ({PKC, PKA} X) thus we can exclude these sets from PI and obtain MC Y,G = PI\{{Akt}, {PKC, Akt}, {PKA, Akt}, {Mek, Akt}, {PKC, PKA, Akt}, {PKA, Mek, Akt}}. Given DO with size NO = 100, 10, all constrained variables are null-feasible. The set X = {PKC, PKA} in MC Y,G has only one constrained variable PKC that is reducible (as X an(PKC, GX) = ) and null-feasible for NO = 100, 10. We can thus exclude X = {PKC, PKA, Mek} X from MC Y,G as it is of the form X = X PKC Constrained Causal Bayesian Optimization 0 10 20 30 40 Cumulative intervention cost Best found target STGP STGP + Y do(X = x ) (c CGO) Y do(X = x ) (CGO) Y do(I = x ) (c GO) CBO RANDOMSTGP MTGP c BO MTGP + STGP + -MTGP Method % Feasible interventions Figure 16: HEALTH with NO = 100. Top: Convergence to the c CGO (solid red line), CGO (dash-dotted red line) and c GO (dotted red line) optima. Bottom: Average percentage of feasible interventions collected over trials. with PKC an( Y, GX ) = . We therefore obtain n MC Y,G = MC Y,G\{PKC, PKA, Mek}}. In addition, as the CGO optimum equals the c CGO optimum, CBO and RANDOM converge to the same value achieved by c CBO (Fig. 18, top row) but collect a significantly higher number of infeasible interventions (Fig. 18, bottom row). 2 0 2 4 6 Akt 3 2 1 0 1 2 3 4 5 Mek 2 1 0 1 2 3 4 PKA 2 0 2 4 6 PKC Figure 17: PROTEIN-SIGNALING with NO = 100. Scatter plots for the observational data together with interventional ranges D(AKT) (top row), D(Mek) (second row), D(PKA) (third row) and D(PKC) (bottom row). 0 2 4 6 8 10 12 14 16 18 Cumulative intervention cost Best found target STGP STGP + Y do(X = x ) (c CGO) Y do(X = x ) (CGO) Y do(I = x ) (c GO) RANDOM STGP CBO STGP + MTGP MTGP + -MTGP c BO Method % Feasible interventions Figure 18: PROTEIN-SIGNALING with NO = 100. Top: Convergence to the c CGO (solid red line), CGO (dash-dotted red line) and c GO (dotted red line) optima. Bottom: Percentage of feasible interventions collected over trials.