# multitask_causal_learning_with_gaussian_processes__bb655d28.pdf Multi-task Causal Learning with Gaussian Processes Virginia Aglietti University of Warwick The Alan Turing Institute V.Aglietti@warwick.ac.uk Theodoros Damoulas University of Warwick The Alan Turing Institute T.Damoulas@warwick.ac.uk Mauricio A. Álvarez University of Sheffield Mauricio.Alvarez@sheffield.ac.uk Javier González Microsoft Research Cambridge Gonzalez.Javier@microsoft.com This paper studies the problem of learning the correlation structure of a set of intervention functions defined on the directed acyclic graph (DAG) of a causal model. This is useful when we are interested in jointly learning the causal effects of interventions on different subsets of variables in a DAG, which is common in field such as healthcare or operations research. We propose the first multi-task causal Gaussian process (GP) model, which we call DAG-GP, that allows for information sharing across continuous interventions and across experiments on different variables. DAG-GP accommodates different assumptions in terms of data availability and captures the correlation between functions lying in input spaces of different dimensionality via a well-defined integral operator. We give theoretical results detailing when and how the DAG-GP model can be formulated depending on the DAG. We test both the quality of its predictions and its calibrated uncertainties. Compared to single-task models, DAG-GP achieves the best fitting performance in a variety of real and synthetic settings. In addition, it helps to select optimal interventions faster than competing approaches when used within sequential decision making frameworks, like active learning or Bayesian optimization. 1 Introduction Solving decision making problems in a variety of domains such as healthcare, systems biology or operations research, often requires experimentation. By performing interventions one can understand how a system behaves when an action is taken and thus infer the cause-effect relationships of a phenomenon. Experiments are especially useful when observational causal inference methods do not provide accurate estimation of the causal effects. For instance, in healthcare, drugs are tested in randomized clinical trials before commercialization. Biologists might want to understand how genes interact in a cell once one of them is knocked out. Finally, engineers investigate the impact of design changes on complex physical systems by conducting experiments on digital twins [34]. Experiments in these scenarios are usually expensive, time-consuming, and, especially for field experiments, they may present ethical issues. Therefore, researchers generally have to trade-off cost, time, and other practical considerations to decide which experiments to conduct, if any, to learn about the system. Consider the causal graph in Fig. 1 which describes how crop yield Y is affected by soil fumigants X and the level of eel-worm population at different times Z = {Z1, Z2, Z3} [11, 27]. By performing a set of experiments, the investigator aims at learning the intervention functions relating the expected crop yield to each possible intervention set and level. Naïvely, one could achieve that by modelling each intervention function separately. However, this approach would disregard the correlation structure existing across experimental outputs and would increase the computational 34th Conference on Neural Information Processing Systems (Neur IPS 2020), Vancouver, Canada. complexity of the problem. Indeed, the intervention functions are correlated and each experiment carries information about the yield we would obtain by performing alternative interventions in the graph. For instance, observing the yield when running an experiment on the intervention set {X, Z1} and setting the value to the intervention value {x, z1}, provides information about the yield we would get from intervening only on X or on {X, Z1, Z2, Z3}. This paper studies how to jointly model such intervention functions so as to transfer knowledge across different experimental setups and integrate observational and interventional data. The model proposed here enables proper uncertainty quantification of the causal effects thus allowing the definition of optimal experimental design strategies. Soil fumigants (X) Eel-worm t+1 (Z3) Eel-worm t (Z2) Eel-worm t-1 (Z1) Crop yield (Y ) Figure 1: DAG for the crop yield. Nodes denote variables, arrows represent causal effects and dashed edges indicate unobserved confounders. 1.1 Motivation and Contributions The framework proposed in this work combines causal inference with multi-task learning via Gaussian processes (GP, [30]). Probabilistic causal models are commonly used in disciplines where explicit experimentation may be difficult and the do-calculus [27] enables prediction of the effect of an intervention without performing the experiment. In do-calculus, different intervention functions are modelled individually and there is no information shared across experiments. Modelling the correlation across experiments is crucial especially when the number of observational data points is limited and experiments on some variables cannot be performed. Multi-task GP methods have been extensively used to model non-trivial correlations between outputs [4]. However, to the best of our knowledge, this is the first work focusing on intervention functions, possibly of different dimensionality, defined on a causal graph. Particularly, we make the following contributions: We give theoretical results detailing when and how a causal multi-task model for the experimental outputs can be developed depending on the topology of the DAG of a causal model. Exploiting our theoretical results, we develop a joint probabilistic model for all intervention functions, henceforth named DAG-GP, which flexibly accommodates different assumptions in terms of data availability both observational and interventional. We demonstrate how DAG-GP achieves the best fitting performance in a variety of experimental settings while enabling proper uncertainty quantification and thus optimal decision making when used within Active Learning (AL) and Bayesian Optimization (BO). 1.2 Related work While there exists an extensive literature on multi-task learning with GPs [9, 4] and causality [28, 17], the literature on causal multi-task learning is very limited. In the causality literature, studies have focused on observational causal inference and have focused on the problem of transferring the causal effect of one given variable across environments [29, 6 8]. Several works have focused on domain adaptation problems [31, 26, 35] where data for a source domain is given, and the task is to predict the distribution of a target variable in a target domain. Closer to our work, [2] have developed a linear coregionalization model for learning the individual treatment effects via observational data. While [2] is the first paper conceptualizing causal inference as a multi-task learning problem, its focus is on modelling the correlation across intervention levels for a single intervention function corresponding to a dichotomous intervention variable. Finally, [24] studied the problem of identification of the causal effect of one intervention set in terms of available observational and experimental distributions. Focusing on the problem of identification, [24] does not provide a model for these distributions nor focuses on transfer across interventions. While one could repeat their procedure to get an expression for all possible intervention sets in the causal graph, this would not allow expressing all causal effects via a shared interventional distribution which is the focus of this paper. In addition, this paper focuses on settings where all causal effects are identifiable and interventional data are available. In these settings, the procedure in [24] would simplify and would either return the experimental output values, when these are available, or compute the causal effects via do-calculus. Differently from these previous works, this paper focuses on transfer within a single environment, across experiments and across intervention levels. The set of functions we wish to learn have continuous input spaces of different dimensionality. Therefore, capturing their correlation requires placing a probabilistic model over the inputs which enables mapping between input spaces. The DAG, which we assumed to be known and is not available in standard multi-task settings, allows us to define such a model. Therefore, existing multi-output GP models are not applicable to our problem. Our work is also related to the literature on causal decision making. Studies in this field have focused on multi-armed bandit problems [5, 21, 25, 22] and reinforcement learning [10, 14] settings where arms or actions correspond to interventions on a DAG. More recently, [1] proposed a Causal Bayesian Optimization (CBO) framework solving the problem of finding an optimal intervention in a DAG by modelling the intervention functions with GPs. Interestingly, the authors proposed a GP prior constructions, the so called causal prior, enabling the integration of observational and interventional data. More importantly, in CBO each function is modelled independently and their correlation is not accounted for when exploring the intervention space. This paper overcomes this limitation by introducing a multi-task model for experimental outputs. Finally, in the causal literature there has been a growing interest for experimental design algorithms to learn causal graphs [19, 18, 16] or the observational distributions in a graph [32]. Here we use our multi-task model within an AL framework so as to efficiently learn the experimental outputs in a causal graph. 2 Background and Problem setup Consider a probabilistic structural causal model (SCM) [28] consisting of a directed acyclic graph G (DAG) and a four-tuple U, V, F, P(U) , where U is a set of independent exogenous background variables distributed according to the probability distribution P(U), V is a set of observed endogenous variables and F = {f1, . . . , f|V|} is a set of functions such that vi = fi(Pai, ui) with Pai = Pa(Vi) denoting the parents of Vi. G 1 encodes our knowledge of the existing causal mechanisms among V. Within V, we distinguish between two different types of variables: treatment variables X that can be manipulated and set to specific values2 and output variables Y that represent the agent s outcomes of interest. Given G, we denote the interventional distribution for two disjoint sets in V, say X and Y, as P(Y|do (X = x)). This is the distribution of Y obtained by intervening on X and fixing its value to x in the data generating mechanism, irrespective of the values of its parents. The interventional distribution differs from the observational distribution which is denoted by P(Y|X = x). Under some identifiability conditions [15], do-calculus allows the estimation of interventional distributions and thus causal effects from observational distributions [27]. In this paper, we assume the causal effect for X on Y to be identifiable X P(X) with P(X) denoting the power set of X. 2.1 Problem setup Consider a DAG G and the related SCM. Define the set of intervention functions for Y in G as: T = {ts(x)}|P(X)| s=1 ts(x) = Ep(Y |do(Xs=x))[Y ] = E[Y |do (Xs = x)]. (1) with Xs P(X) where P(X) is the power set of X minus the empty set3 and x D(Xs) where D(Xs) = X Xs D(X) with D(X) denoting the interventional domain of X. Let DO = {xn, yn}N n=1, with xn R|X| and yn R, be an observational dataset of size N from this SCM. Consider an interventional dataset DI = (XI, YI) with XI = S s{x I si}N I s i=1 and YI = S s{y I si}NI s i=1 denoting the intervention levels and the function values observed from previously run experiments across sets in P(X). N I s represents the number of experimental outputs observed for the intervention set Xs. Our goal is to define a joint prior distribution p(T) and compute the posterior p(T|DI) so as to make probabilistic predictions for T at some unobserved intervention sets and levels. 1As mentioned above, in this paper we assume G to be known. However, one could run a causal discovery algorithm as a pre-processing step or use interventional data to discriminate among graphs within the Markov equivalence class. We leave this for future research. 2This setting can be extended to include non-manipulative variables. See [23] for a definition of such nodes. 3We exclude the empty set as it corresponds to the observational distribution t (x) = E[Y ]. 3 Multi-task learning of intervention functions In this section we address the following question: can we develop a joint model for the functions T in a causal graph and thus transfer information across experiments? To answer this question we study the correlation among functions in T which varies with the topology of G. Inspired by previous works on latent force models [3], we show how any functions in T can be written as an integral transformation of some base function f, also defined starting from G, via some integral operator Ls such that ts(x) = Ls(f)(x), Xs P(X). We first characterize the latent structure among experimental outputs and provide an explicit expression for both f and Ls for each intervention set ( 3.1). Based on the properties of G, we clarify when this function exists. Exploiting these results, we detail a new model to learn T which we call the DAG-GP model ( 3.2). In DAG-GP we place a GP prior on f and propagate our prior assumptions on the remaining part of the graph to analytically derive a joint distribution of the elements in T. The resulting prior distribution incorporates the causal structure and enables the integration of observational and interventional data. 3.1 Characterization of the latent structure in a DAG The following results provide a theoretical foundation for the multi-task causal GP model introduced later. In particular, they characterize when f and Ls exist and how to compute them thus fully characterizing when transfer across experiments is possible. All proofs are given in the appendix. Definition 3.1. Consider a DAG G where the treatment variables are denoted by X. Let C be the set of variables directly confounded with Y , CN be the set of variables in C that are not colliders4 and I be the set Pa(Y ). For each Xs P(X) we define the following sets: IN s = I\(Xs I) represents the set of variables in I not included in Xs. CI s = CN Xs is the set of variables in C which are included in Xs and are not colliders. CN s = CN\CI s is the set of variables in C that are neither included in Xs nor colliders. In the following theorem v N s gives the values for the variables in the set IN s while c represents the values for the set CN which are partition in c N s and c I s depending on the set Xs we are considering. Theorem 3.1. Causal operator. Consider a causal graph G and a related SCM where the output variable and the treatment variables are denoted by Y and X respectively. Denote by C the set of variables in G that are directly confounded with Y and let I be the set Pa(Y ). Assume that C does not include nodes that have both unconfounded incoming and outcoming edges. It is possible to prove that, Xs P(X), the intervention function ts(x) : D(Xs) R can be written as ts(x) = Ls(f)(x) where Ls(f)(x) = Z Z πs(x, (v N s , c))f(v, c)dv N s dc, (2) with f(v, c) = E Y |do (I = v) , CN = c representing a shared latent function and πs(x, (v N s , c)) = p(c I s|c N s )p(v N s , c N s |do (Xs = x)) giving the integrating measure for the set Xs. We call Ls(f)(x) the causal operator, (I C) the base set, f(v, c) the base function and πs( , ) the integrating measure of the set Xs. A simple limiting case arises when the DAG does not include variables directly confounded with Y or C only includes colliders. In this case C = and the base function is included in T. Theorem 3.1 provides a mechanism to reconstruct all causal effects emerging from P(X) using the base function as a driving force . In particular, the integrating measures can be seen as Green s functions incorporating the DAG structure [3]. Examples of πs(x, (v N s , c)): For the DAG in Fig. 4(a) the integrating measure for t X(x) is πs(x, (v N s , c)) = πX(x, (v N X, c)) = p(z|do (X = x)) as IN X = Z\(X Z) = Z, CI X = and CN X = thus v N s = z while c I s and c N s are not defined. Similar expressions can be derived for the DAG in Fig. 4(b). Focusing on the computation of t B(b) = EY [do (B = b)] we have: IN B = {E, D}\(B {E, D}) = {E, D}, CI B = B and CN B = A. We can then write πs(x, (v N s , c)) = πB(b, (v N B , c)) = p(b |a)p(d, e, a|do (B = b)) = p(b )p(d, e, a|do (B = b)). Full details for all DAGs used in this paper are given in the appendix. 4Variables in C causally influenced by X and Y . Figure 2: Posterior mean and variance for t X(x) in the DAG of Fig. 4 (a) (without the red edge). For both plots m X( ) and KX( , ) give the posterior mean and standard deviation respectively. Left: Comparison between the DAG-GP model and a single-task GP model (GP). DAG-GP captures the behaviour of t X(x) in areas where DI is not available (see area around x = 2) while reducing the uncertainty via transfer due to available data for z (see appendix). Right: Comparison between DAGGP with the causal prior (DAG-GP+) and a standard prior with zero mean and RBF kernel (DAG-GP). In addition to transfer, DAG-GP+ captures the behaviour of t X(x) in areas where DO (black ) is available (see region [ 2, 0]) while inflating the uncertainty in areas with no observational data. While these results can be further generalized to select I to be different from Pa(Y ), this choice is particularly useful due to the following result. Corollary 3.1. Minimality of I. The smallest set I for which Eq. (2) holds is given by Pa(Y ). The dimensionality of I when chosen as Pa(Y ) has properties that have been previously studied in the literature. In the context of optimization [1], it corresponds to the so-called causal intrinsic dimensionality, which refers to the effective dimensionality of the space in which a function is optimized when causal information is available. The existence of f depends on the properties of the nodes in C which also represents the smallest set for which Eq. (2) holds ( 1.4 in the supplement). Theorem 3.2. Existence of f. If C includes nodes that have both unconfounded incoming and outcoming edges the function f does not exist. When f does not exist, full transfer across all functions in T is not possible (DAGs with red edges in Fig. 4). However, these results enable a model for partial transfer across a subset of T ( 2 supp.). 3.2 The DAG-GP model Next, we introduce the DAG GP model based on the results from the previous section. Model Likelihood: Let DI = (XI, YI) be the interventional dataset defined in Section 2.1. Denote by TI the collection of intervention vector-valued functions computed at XI. Each entry y I si in YI, is assumed to be a noisy observation of the corresponding function ts at x I i : y I si = ts(x I i ) + ϵsi, for s = 1, . . . , |P(X)| and i = 1, . . . , N I s , (3) with ϵsi N(0, σ2). In compact form, the joint likelihood function is p(YI|TI, σ2) = N(TI, σ2I). Prior distribution on T: To define a join prior on the set of intervention functions, p(T), we take the following steps. First, we follow [1] to place a causal prior on f, the base function of the DAG. Second, we propagate this prior on f through all elements in T via the causal operator in Eq. (2). Step 1, causal prior on the base function: The key idea of the causal prior, already used in [1], is to use the observational dataset DO and the do-calculus to construct the prior mean and variance of a GP that is used to model an intervention function. Our aim is to compute such prior for the causal effect of the base set I C on Y . The causal prior has the benefit of carrying causal information but at the expense of requiring DO to estimate the causal effect. Any sensible prior can be used in this step, so the availability of DO is not strictly necessity. However, in this paper we stick to the causal prior since it provides an explicit way of combining experimental and observational data. For simplicity we use b = (v, c) to denote in compact form the values of the variables in the base set I = v and C = c. Using do-calculus we can compute ˆf(b) = ˆf(v, c) = ˆE[Y |do (I = v) , c] and ˆσ(b) = ˆσ(v, c) = ˆV[Y |do (I = v) , c]1/2 where ˆV and ˆE represent the variance and expectation of the causal effects estimated from DO. The causal prior is defined as: f(b) GP(m(b), K(b, b )) m(b) = ˆf(b) K(b, b ) = k RBF(b, b ) + ˆσ(b)ˆσ(b ) where m(b) and K(b, b ) represents the prior mean and variance respectively. The term k RBF(b, b ) := σ2 f exp( ||b b ||2/2l2) denotes the radial basis function (RBF) kernel, which is added to provide more flexibility to the model. Step 2, propagating the distribution to all elements in T: In Section 3.1 we showed how, Xs P(X), ts(x) = Ls(f)(x) with f given by the intervention function defined in Theorem 3.1. By linearity of the causal operator, placing a GP prior on f induces a well-defined joint GP prior distribution on T. In particular, for each Xs P(X), we have ts(x) GP(ms(x), ks(x, x )) with: ms(x) = Z Z m(b)πs (x, bs) dbs (4) ks(x, x ) = Z Z K(b, b )πs (x, bs) πs (x , b s) dbsdb s. (5) where bs = (v N s , c) is the subset of b including only the v values corresponding to the set IN s . Let D be a finite set of inputs for the functions in T, that is D = S s{xsi}M i=1. T computed in D follows a multivariate Gaussian distribution that is TD N(m T(D), KT(D, D)) with KT(D, D) = (KT(x, x ))x D,x D and m T(D) = (m T(x))x D. In particular, for two generic data points xsi, xs j D with s and s denoting two distinct functions we have m T(xsi) = E[ts(xi)] = ms(xi) and KT(xsi, xs j) = Cov[ts(xi), ts (xj)]. When computing the covariance function across intervention sets and intervention levels we differentiate between two cases. When both ts and ts are different from f, we have: Cov[ts(xi), ts (xj)] = Z Z K(b, b )πs (xi, bs) πs (xj, b s ) dbsdb s . If one of the two functions equals f, this expression further reduces to: Cov[ts(xi), ts (xj)] = Z K(b, b )πs (xj, b s ) db s . Note that the integrating measures πs ( , ) and πs ( , ) allow to compute the covariance between points that are defined on spaces on possibly different dimensionality, a scenario that traditional multi-output GP models are unable to handle. The prior p(T) enables to merge different data types and to account for the natural correlation structure among interventions defined by the topology of the DAG. For this reason we call this formulation the DAG-GP model. The parameters in Eqs. (4) (5) can be computed in closed form only when K(b, b ) is an RBF kernel and the integrating measures are assumed to be Gaussian distributions. In all other cases, one needs to resort to numerical approximations e.g. Monte Carlo integration in order to compute the parameters of each ts(x). Posterior distribution on T: The posterior distribution p(TD|DI) can be derived analytically via standard GP updates. For any set D, p(TD|DI) will be Gaussian with parameters m T|DI(D) = m T(D) + KT(D, XI)[KT(XI, XI) + σ2I](TI m T(XI)) and KT|DI(D, D) = KT(D, D) KT(D, XI)[KT(XI, XI) + σ2I]KT(XI, D). See Fig. 2 for an illustration of the DAG-GP model. The time complexity of the algorithm is O(N 3) with N denoting the size of DI. This complexity can be reduced by resorting to sparse GP approximations e.g. inducing points approximations. 4 A helicopter view Different variations of the DAG-GP model can be considered depending on the availability of both observational DO and interventional data DI (Fig. 3). Our goal here is not to be exhaustive, nor prescriptive, but to help to give some perspective. When DI is not available do-calculus is the only way to learn T, which in turns requires DO. When both data types are not available, learning T via a do-calculus Single-task Multi-task s p(ts(x)) ts(x) GP(0, KRBF (x, x )) s p(ts(x)) ts(x) GP(m+(x), K+(x, x )) f(b) GP(m+(b), K+(b, b )) s p(ts(x)|f) f(b) GP(0, KRBF (b, b )) ts(x) = f(b)πs(x, bs)dbs s p(ts(x)|f) gp Mechanistic model Observational data Interventional data ts(x) = f(b)πs(x, bs)dbs Figure 3: Models for learning the intervention functions T defined on a DAG. The do-calculus allows estimating T when only the observational data is available. When the interventional data is also available, one can use a single-task model (denoted by GP) or a multi-task model (denoted by DAG-GP). When both data types are available one can combine them using the causal prior parameters represented by m+( ) and k+( , ). The resulting models are denoted by GP+ and DAG-GP+. Figure 4: Examples of DAGs (in black) for which f exists and the DAG-GP model can be formulated. The red edges, if added, prevent the identification of f making the transfer via DAG-GP not possible. probabilistic model is not possible unless the causal effects can be transported from an alternative population. In this case mechanistic models based on physical knowledge of the process under investigation are the only option. When DI is available one can consider a single task or a multi-task model. If f does not exist, a single GP model needs to be considered for each intervention function. Depending on the availability of DO, integrating observational data into the prior distribution (denoted by GP+) or adopting a standard prior (denoted by GP) are the two alternatives. In both cases, the experimental information is not shared across functions and learning T requires intervening on all sets in P(X). When instead f exists, DAG-GP can be used to transfer interventional information and, depending on DO, also incorporating observational information a priori (DAG-GP+). 5 Experiments This section evaluates the performance of the DAG-GP model on two synthetic settings and on a real world healthcare application (Fig. 4). We first learn T with fixed observational and interventional data ( 5.1) and then use the DAG-GP model to solve active learning (AL) ( 5.2) and Bayesian Optimization (BO) ( 5.3)5. Implementation details are given in the supplement. Baselines: We run our algorithm both with (DAG-GP+) and without (DAG-GP) the causal prior and compare against the alternative models described in Fig. 3. Note that we do not compare against alternative multi-task GP models because, as mentioned in Section 1.2, the models existing in the literature cannot be straightforwardly applied to our problem. In addition, given that we assume full identifiability of causal effects and availability of DI, the mean results for GP+ correspond to the results we would get by applying the g ID procedure in [24] (see 1.2 for a discussion of this method). Performance measures: We run all models with different initialisation of DI and different sizes of DO. We report the root mean square error (RMSE) performances together with standard errors across replicates. For the AL experiments we show the RMSE evolution as the size of DI increases. For the BO experiments we report the convergence performances to the global optimum. 5Code and data for all the experiments is provided at https://github.com/Virgi Agl/DAG-GP. Table 1: RMSE performances across 10 initializations of DI. See Fig. 3 for details on the compared methods. do stands for the do-calculus. N is the size of DO. Standard errors in brackets. N = 30 N = 100 DAG-GP+ DAG-GP GP+ GP do DAG-GP+ DAG-GP GP+ GP do DAG1 0.46 0.57 0.60 0.77 0.70 0.43 0.57 0.45 0.77 0.52 (0.06) (0.09) (0.2) (0.27) - (0.05) (0.08) (0.05) (0.27) - DAG2 0.44 0.45 0.62 1.26 1.40 0.36 0.41 0.58 1.28 1.41 (0.1) (0.13) (0.10) (0.11) - (0.09) (0.12) (0.07) (0.11) - DAG3 0.05 0.44 0.23 0.89 0.18 0.06 0.44 0.48 0.89 0.23 (0.04) (0.12) (0.03) (0.23) - (0.04) (0.12) (0.06) (0.23) - Figure 5: AL results. Convergence of the RMSE performance across functions in T and across replicates as more experiments are collected. DAG-GP+ gives our algorithm with the causal prior while DAG-GP is our algorithm with a standard prior. # interventions is the number of experiments for each Xs. Shaded areas give standard deviation. See Fig. 3 for details on the compared methods. 5.1 Learning T from data We test the algorithm on the DAGs in Fig. 4 and refer to them as (a) DAG1, (b) DAG2 and (c) DAG3. DAG3 is taken from [33] and [13] and is used to model the causal effect of statin drugs on the levels of prostate specific antigen (PSA). We consider the nodes {A, C} in DAG2 and {age, BMI, cancer} in DAG3 to be non-manipulative. We set the size of DI to 5 |T| for DAG1 (|T| = 2), to 3 |T| for DAG2 (|T| = 6) and to |T| for DAG3 (|T| = 3). As expected, GP+ outperforms GP incorporating the information in DO (Tab. 1). Interestingly, GP+ also outperforms DAG-GP in DAG3 when N = 30 and in DAG1 when N = 100. This depends on the effect that DO has, through its size N and its coverage of the interventional domains, on both the causal prior and the estimation of the integrating measures. Lower N and coverage imply not only a less precise estimation of the do-calculus but also a worse estimation of the integrating measures and thus a lower transfer of information. Higher N and coverage imply more accurate estimation of the causal prior parameters and enhanced transfer of information across experiments. In addition, the way DO affects the performance results it s specific to the DAG structure and to the distribution of the exogenous variables in the SCM. More importantly, Tab. 1 shows how DAG-GP+ consistently outperforms all competing methods by successfully integrating different data sources and transferring interventional information across functions in T. Differently from competing methods, these results holds across different N and DI values making DAG-GP+ a robust default choice for any application. 5.2 DAG-GP as surrogate model in Active Learning The goal of AL is to design a sequence of function evaluations to perform in order to learn a target function as quickly as possible. Denote by D a set of inputs for the functions in T, that is D = S s Ds with Ds D(Xs) and a set A D of size k. We would like to select A, that is select the both the functions to be observed and the locations, such that we maximize the reduction of entropy in the remaining unobserved locations: A = argmax A:|A|=k H(T(D\A)) H(T(D\A)|T(A)). where T(D\A) denotes the set of functions T evaluated in D\A, T(D\A)|T(A) gives the distribution for T at (D\A) given that we have observed T(A) while H( ) represents the entropy. While this problem is NP-complete, Krause et al. [20] proposed an efficient greedy algorithm providing an approximation for A. This algorithm starts with A = and solves the problem sequentially by selecting, at every step j, a point xsj = argmaxxsj D\A H(ts(x)|A) H(ts(x)|D\(A xsj)). Figure 6: BO results. Convergence of the CBO algorithm to the global optimum (E[Y |do (Xs = x)]) when our algorithm is used as a surrogate model with (DAG-GP+) and without (DAG-GP) the causal prior. See the supplement for standard deviations across replicates. In order to select the next intervention level and intervention set, while properly accounting for uncertainty reduction, one can use the DAG-GP model for T. Fig. 5 shows the RMSE performances as more interventional data are collected and the DAG-GP model is used within the AL algorithm proposed by [20]. Across different N settings, DAG-GP+ converges to the lowest RMSE performance faster then competing methods by collecting evaluations in areas where: (i) DO does not provide information and (ii) the predictive variance is not reduced by the experimental information transferred from the other interventions. As mentioned before, DO impacts on the causal prior parameters via the do-calculus computations. When the latter are less precise, because of lower N or lower coverage of the interventional domains, the model variances for DAG-GP+ or GP+ are inflated. Therefore, when DAG-GP+ or GP+ are used as surrogate models, the interventions are collected mainly in areas where DO is not observed thus slowing down the exploration of the interventional domains and the convergence to the minimum RMSE (Fig. 5 DAG2, N = 100). See Section 5 in the supplement for more details about the use of the DAG-GP model within AL. 5.3 DAG-GP as surrogate model in Bayesian optimization The goal of BO is to optimize a function which is costly to evaluate and for which an explicit functional form is not available by making a series of function evaluations. In a recent work, [1] introduced the CBO algorithm which finds the intervention optimising a target variable in a causal graph. In order to find the optimal intervention, CBO places a single-task GP model on all the intervention functions in a DAG. By modeling these functions independently, CBO does not account for their correlation when exploring the intervention space (see 6 in the supplement for more details). Replacing the independent surrogate models used by CBO with the DAG-GP framework significantly speed up the convergence to the global optimum. This is shown in Fig. 6 where the DAG-GP (with and without causal prior) is compared against the single-task models. 6 Conclusions This paper addresses the problems of modelling the correlation structure of a set of intervention functions defined on the DAG of a causal model. We propose the DAG-GP model, which is based on a theoretical analysis of the DAG structure, and allows to share experimental information across interventions while integrating observational and interventional data via do-calculus. Our results demonstrate how DAG-GP outperforms competing approaches in term of fitting performances. In addition, our experiments show how integrating decision making algorithms with the DAG-GP model is crucial when designing optimal experiments as DAG-GP accounts for the uncertainty reduction obtained by transferring interventional data. Future work will extend the DAG-GP model to allow for transfer of experimental information across environments whose DAGs are partially different. In addition, we will focus on combining the proposed framework with a causal discovery algorithm so as to account for uncertainty in the graph structure. Broader Impact Computing causal effects is an integral part of scientific inquiry, spanning a wide range of questions such as understanding behaviour in online systems, assessing the effect of social policies, or investigation the risk factors for diseases. By combining the theory of causality with machine learning techniques, Causal Machine Learning algorithms have the potential to highly impact society and businesses by answering what-if questions, enabling policy-evaluation and allowing for data-driven decision making in real-world contexts. The algorithm proposed in this paper falls into this category and focuses on addressing causal questions in a fast and accurate way. As shows in the experiments, when used within decision making algorithms, the DAG-GP model has the potential to speed up the learning process and to enable optimal experimentation decisions by accounting for the multiple causal connections existing in the process under investigation and their cross-correlation. Our algorithm can be used by practitioners in several domains. For instance, it can be used to learn about the impact of environmental variables on coral calcification [12] or to analyse the effects of drugs on cancer antigens [13]. In terms of methodology, while the DAG-GP model represents a step towards a better model for automated decision making, it is based on the crucial assumption of knowing the causal graph. Learning the intervention functions of an incorrect causal graph might lead to incorrect inference and sub-optimal decisions. Therefore, more work needs to be done to account for the uncertainty in the graph structure. Acknowledgements This work was supported by the EPSRC grant EP/L016710/1, The Alan Turing Institute under EPSRC grant EP/N510129/1 and the Lloyds Register Foundation programme on Data Centric Engineering. MAA has been financed by the EPSRC Research Projects EP/R034303/1 and EP/T00343X/1. MAA has also been supported by the Rosetrees Trust (ref: A2501). [1] Aglietti, V., Lu, X. L., Paleyes, A., and González, J. (2020). Causal Bayesian Optimization. In Artificial Intelligence and Statistics. [2] Alaa, A. M. and Van der Schaar, M. (2017). Bayesian inference of individualized treatment effects using multi-task Gaussian processes. In Advances in Neural Information Processing Systems, pages 3424 3432. [3] Álvarez, M., Luengo, D., and Lawrence, N. D. (2009). Latent force models. In Artificial Intelligence and Statistics, pages 9 16. [4] Álvarez, M. A., Rosasco, L., Lawrence, N. D., et al. (2012). Kernels for vector-valued functions: A review. Foundations and Trends in Machine Learning, 4(3):195 266. [5] Bareinboim, E., Forney, A., and Pearl, J. (2015). Bandits with unobserved confounders: A causal approach. In Advances in Neural Information Processing Systems, pages 1342 1350. [6] Bareinboim, E. and Pearl, J. (2012). Causal inference by surrogate experiments: z-identifiability. ar Xiv preprint ar Xiv:1210.4842. [7] Bareinboim, E. and Pearl, J. (2013). Meta-transportability of causal effects: A formal approach. In Proceedings of the 16th International Conference on Artificial Intelligence and Statistics (AISTATS), pages 135 143. [8] Bareinboim, E. and Pearl, J. (2014). Transportability from multiple environments with limited experiments: Completeness results. In Advances in neural information processing systems, pages 280 288. [9] Bonilla, E. V., Chai, K. M., and Williams, C. (2008). Multi-task Gaussian process prediction. In Advances in neural information processing systems, pages 153 160. [10] Buesing, L., Weber, T., Zwols, Y., Racaniere, S., Guez, A., Lespiau, J.-B., and Heess, N. (2018). Woulda, coulda, shoulda: Counterfactually-guided policy search. ar Xiv preprint ar Xiv:1811.06272. [11] Cochran, W. and Cox, G. (1957). Experimental design. john willey and sons. Inc., New York, NY. [12] Courtney, T. A., Lebrato, M., Bates, N. R., Collins, A., De Putron, S. J., Garley, R., Johnson, R., Molinero, J.-C., Noyes, T. J., Sabine, C. L., et al. (2017). Environmental controls on modern scleractinian coral and reef-scale calcification. Science advances, 3(11):e1701356. [13] Ferro, A., Pina, F., Severo, M., Dias, P., Botelho, F., and Lunet, N. (2015). Use of statins and serum levels of prostate specific antigen. Acta Urológica Portuguesa, 32(2):71 77. [14] Foerster, J. N., Farquhar, G., Afouras, T., Nardelli, N., and Whiteson, S. (2018). Counterfactual multi-agent policy gradients. In Thirty-Second AAAI Conference on Artificial Intelligence. [15] Galles, D. and Pearl, J. (2013). Testing identifiability of causal effects. ar Xiv preprint ar Xiv:1302.4948. [16] Greenewald, K., Katz, D., Shanmugam, K., Magliacane, S., Kocaoglu, M., Adsera, E. B., and Bresler, G. (2019). Sample efficient active learning of causal trees. In Advances in Neural Information Processing Systems, pages 14279 14289. [17] Guo, R., Cheng, L., Li, J., Hahn, P. R., and Liu, H. (2018). A survey of learning causality with data: Problems and methods. ar Xiv preprint ar Xiv:1809.09337. [18] Hauser, A. and Bühlmann, P. (2014). Two optimal strategies for active learning of causal models from interventional data. International Journal of Approximate Reasoning, 55(4):926 939. [19] He, Y.-B. and Geng, Z. (2008). Active learning of causal networks with intervention experiments and optimal designs. Journal of Machine Learning Research, 9(Nov):2523 2547. [20] Krause, A., Singh, A., and Guestrin, C. (2008). Near-optimal sensor placements in Gaussian processes: Theory, efficient algorithms and empirical studies. Journal of Machine Learning Research, 9(Feb):235 284. [21] Lattimore, F., Lattimore, T., and Reid, M. D. (2016). Causal bandits: Learning good interventions via causal inference. In Advances in Neural Information Processing Systems, pages 1181 1189. [22] Lee, S. and Bareinboim, E. (2018). Structural causal bandits: where to intervene? In Advances in Neural Information Processing Systems, pages 2568 2578. [23] Lee, S. and Bareinboim, E. (2019). Structural causal bandits with non-manipulable variables. Technical report, Technical Report R-40, Purdue AI Lab, Department of Computer Science, Purdue. [24] Lee, S., Correa, J. D., and Bareinboim, E. (2020). General identifiability with arbitrary surrogate experiments. In Uncertainty in Artificial Intelligence, pages 389 398. PMLR. [25] Lu, C., Schölkopf, B., and Hernández-Lobato, J. M. (2018). Deconfounding reinforcement learning in observational settings. ar Xiv preprint ar Xiv:1812.10576. [26] Magliacane, S., van Ommen, T., Claassen, T., Bongers, S., Versteeg, P., and Mooij, J. M. (2018). Domain adaptation by using causal inference to predict invariant conditional distributions. In Advances in Neural Information Processing Systems, pages 10846 10856. [27] Pearl, J. (1995). Causal diagrams for empirical research. Biometrika, 82(4):669 688. [28] Pearl, J. (2000). Causality: models, reasoning and inference, volume 29. Springer. [29] Pearl, J. and Bareinboim, E. (2011). Transportability of causal and statistical relations: A formal approach. In Twenty-fifth AAAI conference on artificial intelligence. [30] Rasmussen, C. E. (2003). Gaussian processes in machine learning. In Summer School on Machine Learning, pages 63 71. Springer. [31] Rojas-Carulla, M., Schölkopf, B., Turner, R., and Peters, J. (2018). Invariant models for causal transfer learning. The Journal of Machine Learning Research, 19(1):1309 1342. [32] Rubenstein, P. K., Tolstikhin, I., Hennig, P., and Schölkopf, B. (2017). Probabilistic active learning of functions in structural causal models. ar Xiv preprint ar Xiv:1706.10234. [33] Thompson, C. (2019). Causal graph analysis with the causalgraph procedure. https: //www.sas.com/content/dam/SAS/support/en/sas-global-forum-proceedings/ 2019/2998-2019.pdf. [34] Ye, C., Butler, L., Bartek, C., Iangurazov, M., Lu, Q., Gregory, A., Girolami, M., and Middleton, C. (2019). A digital twin of bridges for structural health monitoring. In 12th International Workshop on Structural Health Monitoring 2019. Stanford University. [35] Zhang, K., Schölkopf, B., Muandet, K., and Wang, Z. (2013). Domain adaptation under target and conditional shift. In International Conference on Machine Learning, pages 819 827.