# bayesian_active_causal_discovery_with_multifidelity_experiments__e6949228.pdf

Bayesian Active Causal Discovery with Multi-Fidelity Experiments

Zeyu Zhang Gaoling School of Artificial Intelligence Renmin University of China zeyuzhang@ruc.edu.cn

Chaozhuo Li Microsoft Research Asia cli@microsoft.com

Xu Chen Gaoling School of Artificial Intelligence Renmin University of China xu.chen@ruc.edu.cn

Xing Xie Microsoft Research Asia xingx@microsoft.com

This paper studies the problem of active causal discovery when the experiments can be done based on multi-fidelity oracles, where higher fidelity experiments are more precise and expensive, while the lower ones are cheaper but less accurate. In this paper, we formally define the task of multi-fidelity active causal discovery, and design a probabilistic model for solving this problem. In specific, we first introduce a mutual-information based acquisition function to determine which variable should be intervened at which fidelity, and then a cascading model is proposed to capture the correlations between different fidelity oracles. Beyond the above basic framework, we also extend it to the batch intervention scenario. We find that the theoretical foundations behind the widely used and efficient greedy method do not hold in our problem. To solve this problem, we introduce a new concept called ϵ-submodular, and design a constraint based fidelity model to theoretically validate the greedy method. We conduct extensive experiments to demonstrate the effectiveness of our model.

1 Introduction

Causal discovery aims to learn the causal structure of a set of variables, which is fundamental for many real-world applications, including health caring [1], education [2] drug discovery [3] and protein synthesis [4]. In general, causal structure learning is an NP-hard problem [5], and purely based on the observational datasets, one cannot identify the unique causal structure, where the best result is discovering its Markov equivalence class [6].

To more accurately identify the unique causal structure, a promising direction is active causal discovery (ACD), where the model is allowed to actively intervene the causal structure to query key information for orienting the causal relations between different variables. For example, to study the causal relations between the drugs and diseases, one can conduct clinical tests via selectively administering the medicines to the patients. The key of active causal discovery is how to design effective experiments when the total cost (e.g., the number of experiments) is limited. To achieve this goal, recent years have witnessed many promising models. For example, Agrawal et al. [7] proposes to intervene on the variables which can orient as many as possible undirected edges. Tigas et al. [8]

Co-first authors. Corresponding author.

37th Conference on Neural Information Processing Systems (Neur IPS 2023).

designs a mutual information based method to determine the variables and values to be intervened, and study both single and batch intervention scenarios.

While the above models have achieved remarkable successes, they only allow to query a single oracle (e.g., the real causal structure) for the experiments3. However, in many real-world applications, the experiments can be done via different methods. For example, to investigate the drug-disease causal relations, in addition to the clinical tests, one can also build simulators to obtain the medicine effects on the patients [9]. Usually, each experimental method corresponds to a unique oracle, and different oracles have various fidelites. Higher-fidelity experiments are more accurate but expensive, for example, administering drugs to the real patients. Lower-fidelity experiments are cheaper but inaccurate, for example, using patient simulators. These different fidelity oracles may offer better cost-benefit trade-offs, which, however, cannot be handled by existing active causal discovery models.

To bridge the above gap, in this paper, we formally define the task of active causal discovery with multi-fidelity oracles, where the model has to actively select which variables and values to intervene at which fidelities. This task is non-trivial due to the following reasons: to begin with, because of the introduction of multi-fidelity oracles, the model has to strategically choose the lower-cost and informative enough experiments to uncover the real causal structure, which needs our special designs. Then, given the experiment results with different fidelities, how to infer the real causal structure is also not easy, since the experiment results can be not produced from the oracle corresponding to the real causal structure. In addition, in practice, an efficient experiment should allow simultaneously intervening multiple variables [10]. However, how to extend our model to the batch intervention scenario is still not clear.

To overcome the above challenges, we design a Bayesian active causal discovery model, which is composed of two components. The first one is a mutual information (MI) based acquisition function. It aims to select the interventional variables, values and fidelities which are more informative for the real causal structure. The second one is a cascading fidelity model. In specific, we first regard the highest fidelity oracle as the real causal structure, and then a cascading model is built to correlate different fidelity oracles, so that the experiment results at one fidelity can be leveraged to infer the oracle at another fidelity. To achieve more efficient experiments, we also extend our model to the batch intervention scenario. Previously, the greedy method is demonstrated to be an efficient and effective strategy for batch intervention [8]. However, we found that, by allowing multi-fidelity oracles, the theoretical foundations behind the greedy method do not hold. For alleviating this problem, we introduce a new concept called ϵ-submodular, and design a constraint-based fidelity model to theoretically validate the greedy method.

The main contributions of this paper are summarized as follows: (1) we formally define the task of active causal discovery with multi-fidelity oracles, which, to our knowledge, is the first time in the field of causal discovery. (2) To solve the above task, we propose a Bayesian framework, which is composed of a mutual information based acquisition function and a cascading fidelity model. (3) To extend our framework to the batch intervention scenario, we introduce a constraint-based fidelity model, which provides theoretical guarantees for the efficient greedy method. (4) We conduct extensive experiments to demonstrate the effectiveness of our model.

2 Preliminaries

2.1 Structure Causal Model

Structure causal model (SCM) is an effective language for describing and learning the causal relations between different random variables [11]. Usually, SCM is composed of a causal graph and a set of structure equation models (SEM).

For the causal graph, we denote it by G = V, E , where V is the node set, and E is the adjacency matrix. Each node in V corresponds to a variable. Suppose there are d variables in our studied problem, then we use XV = [X1, X2, . . . , Xd] to denote the variable set. The adjacency matrix E {0, 1}d d describes the causal relations between different variables. Eij = 1 means that Xi is a

3In the following, we may interchangeably use oracle , causal structure and structure causal model to represent the underlying model for generating the results of the experiments.

parent of Xj, and there exists an edge from Xi to Xj, while Eij = 0 indicates that there is no edge between Xi and Xj.

For the structure equation models, we denote them by g = {g1, g2, ..., gd}, where each gi quantitatively describes the relation between Xi and its parents. Formally, we implement F with the commonly used additive noise models (ANM) [12], that is:

Xi = gi(pa(i); γi) + ϵi, ϵi N(0, σ2 i ), (1)

where γi is the parameter set of gi, and the noise term ϵi follows the Gaussian distribution with σ2 i as the variance. We denote the complete parameter set of g as θ = {γ, σ}, where γ = {γ1, γ2, . . . , γd} and σ = {σ1, σ2, . . . , σd}. Noted that, given the above equation, we can easily derive the distribution of XV , that is, p(XV ) = Qd i=1 N(gi(pa(i); γi), σ2 i ).

Based on the above formulation, given an observational dataset D = {xk}N k=1 p(XV ), causal discovery aims to learn the adjacency matrix E, or more generally, simultaneously identify E and the SEM parameter θ. In this paper, we focus on the general case, and denote ϕ = (θ, E).

2.2 Active Causal Discovery

Previous work has demonstrated that, purely based on the observational dataset, the real causal graph can only be identifiable to its Markov equivalence class. Active causal discovery holds the promise of identifying more accurate causal graph via designing interventional experiments.

Formally, an interventional experiment is represented by e = {(j, v)}, which means cutting all the causal relations pointing to Xj, and fixing the value of Xj as v. In causal learning, the experiment e can also be represented by do(Xj = v). Obviously, the distribution of XV is changed after the experiment e, and we denote the experiment-induced distribution by p(XV |do(Xj = v)). In practice, we cannot access the implementation of p, but can only observe the experiment result sampled from p(XV |do(Xj = v)). The key of active causal discovery is to design a series of experiments within limited budgets, such that the results can be better leveraged to identify ϕ.

2.3 Multi-Fidelity Active Causal Discovery

Existing ACD models mostly obtain the experiment results via interacting with the real SCM. However, in practice, the experiments can be done in different ways (e.g., real clinical tests or patient simulators). Each type of experiment corresponds to an underlying oracle, which produces the results of the experiments. Different oracles may provide better cost-benefit trade-offs for the experiment designs, which are failed to be considered by the previous work.

Formally, suppose we have M oracles, and the parameters of the ith oracle is denoted by ϕi. Let the experiment cost of the ith oracle be λi, and without loss of generality, we assume λ1 λ2 λM. Intuitively, if an oracle is more accurate (i.e., has higher fidelity), then it should be more expensive4. We regard the real SCM as the most accurate oracle, thus we set ϕM = ϕ. We denote all the oracle parameters and costs as Φ = {ϕ1, ϕ2, ..., ϕM} and Λ = {λ1, λ2, ..., λM}, respectively.

Due to the introduction of multi-fidelity oracles, the experiment in traditional active causal discovery is extended to be a triplet e = {(j, v), m}, where in addition to the intervention pair (j, v), the fidelity m should also be considered in the experiment design. We define the dataset for model training as D = {et, xt}T t=1, where et = {(j, v), m} indicates the distribution for generating xt. In general, {(j, v), m} means that xt is generated from pm(XV |do(Xj = v)), which is induced by ϕm and the intervention (j, v). If (j, v) = , then xt is an observational data, which is sampled from ϕm without any intervention. Finally, we define the task of multi-fidelity active causal discovery as follows:

Definition 1 (Multi-Fidelity Active Causal Discovery (MFACD)). Given M oracles with different costs Λ, we need to design a model f, which can strategically determine the intervention pair (j, v) and fidelity m to achieve better cost-benefit trade-off in terms of identifying the real SCM ϕM.

4Noted that if some oracle costs more than another one with higher fidelity, then this oracle is useless, since one may always choose to query the higher fidelity oracle.

3 The Licence Model

For solving the task of MFACD, we design a Bayesian framework called Licence (Multi-fidelity active learning for causal discovery), which is composed of two components. The first one is an acquisition function based on mutual information, which is responsible for designing the experiments. The second one is a cascaded fidelity model, which is designed to capture the correlation between different ϕi s. The experiment results obtained from the first component are leveraged to update the fidelity model, and ϕM in the fidelity model is the finally predicted result. At last, we introduce how to extend our model to the batch intervention scenario.

3.1 MI-based Acquisition Function

Intuitively, a better experiment should leverage little cost to reveal as much as possible information about the real SCM ϕM. Thus, we design the following acquisition function:

f(j, v, m) = I(x; ϕM|e, D)

where e = {(j, v), m} is the experiment to be designed. D is the dataset already collected, which will be enlarged after each experiment. I(x; ϕM|e, D) is the mutual information, which indicates that if we conduct experiment e, then how much information the experiment result may share with the target parameter ϕM. Obviously, we should select e which can lead to larger I(x; ϕM|e, D). λm is the cost of e. By dividing I(x; ϕM|e, D) with λm, we trade-off the experiment informativeness and cost. To determine (j, v, m), we derive an estimator for f(j, v, m) following the idea of Bayesian Active Learning by Disagreement (BALD) [13], that is:

f(j, v, m) = H(x|e, D) H(x|ϕM, e, D)

= Ep(x|e,D) log Ep(ϕm|e,D)[p(x|e, ϕm)] + Ep(ϕM|D) Ep(x|e,ϕM)[log p(x|e, ϕM)]

where p(x|e, D) and p(ϕm|e, D) correspond to the distributions of x and ϕm after observing D under the intervention (j, v). p(ϕM|D) is the posterior of ϕM after observing D. p(x|e, ϕm) is the probability of x based on ϕm intervened by (j, v). We approximate the expectation operator based Monte Carlo sampling. The detailed derivation process can be seen in the appendix.

Obviously, determining the best (j, v, m) equals to solving the following problem: {j , v , m } = arg max {j,v,m} f(j, v, m). (3)

In our task, the intervention value v is continuous. Following the previous work, we firstly learn the optimal v for each (j, m) pair based on Bayesian optimization (BO). Then, we compare all the results, and select the solution which can lead to the largest f(j, v, m). We present the detailed Bayesian optimization process in the appendix. It should be noted that one can also leverage more advanced BO methods for jointly learning (j, v, m) [14]. However, we do not find significant performance improvements by these models.

3.2 Cascaded Fidelity Model

Intuitively, the experiment results from different oracles may share common information. The samples at one fidelity may help to infer the oracles at other fidelities. To capture the correlations between different oracles, we build a cascaded probabilistic model (see Figure 1), where the oracles with different fidelities are successively connected, and the observed samples are only determined by their corresponding oracles.

Formally, to achieve more robust optimization, we first regard the discrete adjacency matrix E as the samples from Bernoulli distribution. In specific, we let Eij Bernoulli(σ(ST i T j)), where S, T RK d (K d) are two continuous matrices. S i and T j are the ith and jth columns of S and T, respectively. By replacing E with S and T, we revise the parameter set ϕ as (θ, S, T).

For each fidelity m, we assign a prior distribution of ϕm given ϕm 1 as follows:

p(ϕ1) e β f(S1,T1) N(0, I),

p(ϕm|ϕm 1) e β f(Sm,Tm) N(aϕm 1 + b, σ2I), m 2, (4)

Figure 1: The cascaded probabilistic model is shown above. Different fidelity oracles are successively connected, and the observed samples are only determined by their corresponding oracles.

where we add subscript m to indicate different fidelities. f(Sm, Tm) = Ep(E|Sm,Tm)[λ1 {trace e E d} + λ2 ||E||1] is a regularizer to encourage E to be a sparse and directed acyclic graph [15]. λ1, λ2, a, b and σ2 are hyper-parameters.

Since the real SCM is ϕM, we need to infer the posterior p(ϕM|D), where D is the initially observational dataset or experiment results. Directly computing p(ϕM|D) is not easy, since the dataset D may contain samples from different fidelity oracles. To solve this problem, we first obtain the joint distribution p(Φ|D), where Φ = {ϕ1, ϕ2, ..., ϕM} is the collection of all the oracle parameters. Then we derive p(ϕM|D) by marginalizing out {ϕ1, ϕ2, ..., ϕM 1}.

To efficiently compute and sample from p(Φ|D), we introduce a variational approximator q(Φ), which is specified as follows:

q(ϕ1) N(0, I),

qψm(ϕm|ϕm 1) N(cmϕm 1 + dm, σ2 m I), m 2,

q(Φ) = q(ϕ1)

m=2 q(ϕm|ϕm 1),

where ψm = {cm, dm, σm} is the set of learnable parameters, and we denote Ψ = {ψm}M m=2. According to the theory of variational inference [16], we maximize the following evidence lower bound (ELBO) [16] to learn Ψ:

ELBO =EΦ q(Φ) [log p(D|Φ) KL(q(Φ)||p(Φ))] ,

=EΦ q(Φ) [log p(D|Φ) log q(Φ) + log p(Φ)] , (6)

where the likelihood p(D|Φ) can be easily obtained based on equation (1). More detailed derivation on the ELBO can be seen in the appendix. Once we have learned Ψ, the posterior p(Φ|D) is approximated by q(Φ), and further, we have the following theory: Theorem 1. If p(Φ|D) q(Φ), then for any variable sets A, B Φ, p(A|B, D) q(A|B). As a special case p(ϕM|D) q(ϕM).

The proof of this theory is immediate, since

p(A|B, D) = p(A, B|D)

Φ A B p(Φ|D)dΦ A B R

Φ B p(Φ|D)dΦ B

Φ A B q(Φ)dΦ A B R

Φ B q(Φ)dΦ B

q(B) = q(A|B)

Let A = ϕM and B = , we have p(ϕM|D) q(ϕM). Based on this theory, we leverage q to replace p in (2) for easy sampling. Remark. According to the specification of qψm(ϕm|ϕm 1), we can easily demonstrate that q(Φ) follows Gaussian distribution. Such property enables us to use the reparameterization trick [17] to relate Φ with Ψ. Different from traditional variational models, in our objective, the adjacency matrix

E in f(Sm, Tm) is sampled from a discrete distribution, which cuts down the back-propagation signal. To solve this problem, we leverage gumbel-softmax to further associate E with (Sm, Tm) Φ. Since Φ can be further represented by Ψ, all the variables in (6) can be reparameterized by Ψ, which enables us to optimize it in an end to end manner. We present the detailed derivation process and the complete learning algorithm in the appendix.

3.3 Extension to Batch Intervention

In practice, simultaneously intervening multiple variables can be more efficient due to the lower frequency on interacting with the oracles. However, under the setting of batch intervention, the candidate intervention space exponentially increases with respect to the number of targets. Suppose we need to select c out of d variables for intervention, then the size of the candidate space is dc. Previous work found that the greedy strategy is a both efficient and effective method for the batch intervention scenario [8]. From the efficiency perspective, the greedy method only need to search in a cd-sized candidate space. From the effectiveness perspective, people demonstrate that the mutual information obtained by the greedy strategy is not worse than that of the optimal solution multiplied by (1 1

e) [18]. In the following, we introduce how to extend our model to the batch intervention scenario, and leverage the greedy strategy to solve our task.

Objective for batch intervention in MFACD. We use the following objective for batch intervention scenario: arg max {ei}n i=1 I({xi}n i=1; ϕM|{ei}n i=1, D),

i=1 λi C, (7)

where ei = {(ji, vi), mi} is an experiment, xi is the observed sample from the experiment ei, that is, xi pmi(XV |do(Xji = vi)). λi is the cost of experiment ei. This objective aims to design a series of experiments with budget C, which can reveal the information about ϕM as much as possible. The number of intervention targets n is not a fixed value, which is constrained by the total budget.

The greedy method for MFACD. The greedy method designs each experiment independently, and at each step, it selects the experiment which can maximize the average information gain. Following [19], the kth experiment is determined based on the following objective:

I({xi}k 1 i=1 xk; ϕM|{ei}k 1 i=1 ek, D) I({xi}k 1 i=1 ; ϕM|{ei}k 1 i=1 , D)) λm

m=1 λm + λk C,

where {ei}k 1 i=1 is the previously designed experiments, and is fixed when learning ek.

What s wrong with the greedy method. The theoretical foundations of the greedy method is demonstrated by the previous work [19] as follows: Theorem 2. If I(x; ϕM|e, D) is (1) submodular and (2) non-decreasing, then

I({xg i }n i=1; ϕM|{eg i }n i=1, D) (1 1

e)I({x i }n i=1; ϕM|{e i }n i=1, D), (9)

where {eg i }n i=1 is the solution obtained from the greedy method, and {e i }n i=1 is the optimal solution.

However, due to the introduction of multi-fidelity, in our model, I(x; ϕM|e, D) is actually not submodular. More specifically, in the proof of submodular, two samples xs and xt has to be independent given ϕM (e.g., see B.4 in [8]). In the single-fidelity setting, this requirement naturally holds, since ϕM is exactly the parameter used to sample xs and xt. However, when we introduce multi-fidelity, xs and xt may not independent given ϕM, since they are only directly influenced by their own oracles (see Figure 1).

An improved greedy method tailored to MFACD. To alleviate the above problem, in this section we design an improved greedy method tailored to our task. In specific, we first define several new concepts, and then build theories based on these concepts to inspire our model designs.

Definition 2 (ϵ-independent). For random variables A, B and C, if their mutual information satisfy I(A; B|C) ϵ, then we say A and B are ϵ-independent given C.

Definition 3 (ϵ-submodular). Suppose f( ) is a set function on Ω. If for any A, B Ω, A B and x Ω\B, f(A {x}) f(A) f(B {x}) f(B) ϵ, then we say f is ϵ-submodular on Ω.

Based on the above two definitions, we have:

Theorem 3. For any two experiments es and et, if the corresponding samples xs and xt are ϵ-independent given ϕM, {es, et} and D, then I( ; ϕM| , D) is ϵ-submodular.

Theorem 4. If I( ; ϕM| , D) is ϵ-submodular on X and non-decrease, for any i, j, λi

λj Bλ, then

I({xg i }n i=1; ϕM|{eg i }n i=1, D) (1 e 1 Bλ )I({x i }n i=1; ϕM|{e i }n i=1, D) B,

where B = ϵ Bλ Pn i=1(1 1 Bλn)i 1.

Following this theory, we improve objective (6) to a constaint-based ELBO as follows to capture the degree of independence between different experiment results:

max EΦ q(Φ) [log p(D|Φ) log q(Φ) + log p(Φ)] ,

{es,et} I(xs; xt|ϕM, {es, et}, D) ϵ. (10)

The proofs of the above theories are presented in the appendix, and similar to objective (2), we use Monte Carlo method to approximate the mutual information in objective (10).

4 Related Works

Bayesian Active Causal Discovery. Causal discovery [20 22] refers to recovering causality in a set of variables, especially trying to find a directed acyclic graph (DAG) that can represent the relationship between variables in a system. Active causal discovery was first proposed in [23, 24] with the assumption that the data is discrete-valued. In active causal discovery, the experimenter attempts to intervene on the variables in the system at each step, utilizes the interventional data to recover causal relation, and finally identifies the entire causal structure with minimal cost [25]. Many methods have been discussed in different settings over the past decades, including continuous linear Bayesian networks [26, 27], non-linear causal models [28], and large-scale causal models [8].

Multi-fidelity Settings. The fidelity commonly refers to how accurate the model or environment can be when providing information. Higher-fidelity models are more accuracy but cost much, while lower ones are less accurate but cheaper. In order to combine the strength of each model, multi-fidelity models are proposed to achieve an accurate model with lower costs. Multi-fidelity models can be divided into two main categories [29]: Multi-fidelity Surrogate Models (MFSM) [30 32] and Multi-fidelity Hierarchical Models (MFHM) [33 35]. In addition to optimization, multi-fidelity models can also be used for uncertainty propagation [36] and statistical inference [37]. Recent works also start to study the multi-fidelity settings when conducting Bayesian experiments [38 40], and adopt deep learning frameworks to solve corresponding problems.

Our paper makes a first step towards multi-fidelity active causal discovery, and solve the non-trivial challenges when combining the above two fields, which, to the best of our knowledge, is the first time in the causal inference domain.

5 Experiments

In this section, we conduct experiments to demonstrate the effectiveness of our model, where we focus on the following problems: (1) whether our model can achieve better performance than the previous ACD methods? (2) Whether the constraint in objective (10) in necessary? (3) How the DAG regularization coefficient influence the model performance? In the following, we first introduce the experiment setup, and then present and analyze the results.

Figure 2: Results of the overall performance on different datasets and budgets. Lower SHD, RMSE or larger AUPRC indicate better performances. We conduct each experiment for ten times, and report the average performances and error bars.

5.1 Experimental Setup

We experiment with three commonly used causal discovery datasets, including Erd os-Rényi graph (ER) [41], Scale-Free graph (SF) [42] and DREAM [43]. To demonstrate the effectiveness of our model, we compare it with AIT [44] and CBED [8], which are the recent state-of-the-art models in this field. Since they cannot select different fidelities, we design two variants for each of the baseline, that is, X-REAL and X-RANDOM, which means that the model always interacts with the ground truth oracle ϕM or randomly select the oracles. Here X is AIT or CBED. For the evaluation metrics, we use SHD [45], AUPRC [46] and RMSE to evaluate different models. The first two metrics aim to evaluate the accuracy of the learned topological structure, and the last one measures the performance of functional relations. For single intervention, we first generate several observational samples to initialize the model. Then, we indicate a total intervention budget, and let the model interact with the causal graph with different fidelities until the budget runs out. For each interaction, the model will provide an intervention, and correspondingly obtain a sample from the oracles, which is used to update the model. Finally, the model outputs the estimated causal graph, which is leveraged to evaluate the performance. For batch intervention, we indicate the total budget for each intervention step, and the model determines n interventions simultaneously, which are delivered to the oracles to obtain the samples. We present more detailed settings in the appendix.

5.2 Overall Performance

In this experiment, we evaluate the models under different total budgets, and we present the results on ER and SF datasets with 10 graph nodes. The experiments on DREAM and more nodes are presented in the appendix. From the results shown in Figure 2, we can see: as the total budget becomes larger, the performances of all the models tend to increase on both datasets. This is not surprising, since more experiment budgets enable us to conduct more interventions or query more accurate oracles, which can reveal more information about the ground truth and facilitate more accurate causal discovery. In most cases, our model can perform better than the baselines across different datasets, evaluation metrics and intervention budgets. These results demonstrate the effectiveness of our model. In specific, on the metrics of SHD, AUPRC and RMSE, our model can on average improve performance of the best baseline by about 27.74%, 82.35% and 22.74% on ER, and 17.69%, 60.27% and 21.62%

(a) Batch Intervention with 30 Budget Model SHD AUPRC (%) RMSE (%) AIT-REAL 20.10 2.26 17.74 0.32 3.11 0.00 AIT-RANDOM 19.92 2.81 17.57 0.43 3.53 0.01 CBED-REAL 20.12 2.17 20.25 0.19 3.51 0.00 CBED-RANDOM 19.34 1.72 24.32 0.47 3.74 0.01 Licence (w/o reg) 17.10 3.71 32.55 1.81 3.12 0.02 Licence 16.75 5.68 37.94 1.90 2.66 0.01

Figure 3: (a) The results of experiments on ER graph with 10 graph nodes under the batch intervention scenario. The average performance and error bars are provided. (b) The results of Licence model with different DAG regulation coefficient β s. The experiment is conducted based on ER graph with 10 graph nodes.

on SF, respectively. If we look more carefully, we find that, for both AIT and CBED, randomly querying the oracles can sometimes perform better than always interacting with the ground truth oracles. This result suggests that lower fidelity oracles can be helpful to trade-off the performance and cost. However, the random method is still suboptimal, and designing more principled and tailored strategies to select the fidelities is necessary, which is evidenced by the lowered performance of X-RANDOM than our model.

5.3 Necessity of the Mutual Information Constraint in Objective (10)

In our model, the mutual information constraint in objective (10) aims to make the greedy method validate. In this section, we study whether it is necessary by experiments. To achieve this goal, we introduce a variant of our model Licence (w/o reg), where we remove the mutual information constraint. We evaluate the models based on the dataset of ER with 10 graph nodes, and the total budget is set as 30. The results are presented in Figure 3(a). We can see, in some cases, X-RANDOM performs better than X-REAL , which is consistent with the above experiments, and demonstrates that always querying the ground truth oracle may not lead to better performance under limited budget. By comparing our model with the variant Licence (w/o reg), we find that our model can consistently achieve better performances on all the evaluation metrics. In specific, Licence can improve the performance of Licence (w/o reg) by about 2.05%, 16.57% and 14.92% on SHD, AUPRC and RMSE respectively. These results demonstrate that the mutual information constraint is necessary in our model, which empirically verifies the theories proposed in section 3.3.

5.4 Influence of the DAG regularization coefficient

In this section, we analysis the influence of the DAG regulation coefficient β in equation (4). The results are reported based on AUPRC. The coefficient β reflects the importance of DAG regulation when updating the model. As β increases, the DAGness is more emphasized for the causal graph. In this subsection, we conduct experiments for various β, ranging from 0.0 to 1.0, and the results are shown in Figure 3(b). We can see as β increases, the performance of AUPRC improves as well, and peaks at β = 0.8. That is probably because lower β will decrease the acyclic property of causal graphs, which is incompatible with the prior knowledge of true causal graphs. However, higher coefficient may also impact the optimization process, which leads to sub-optimal results. We think a trade-off between mild constraint for easy optimization and solid constraint for DAG property is supposed to take into consider for different tasks and settings.

6 Conclusion

This paper formally defines the task of active causal discovery with multi-fidelity oracles, which, to our knowledge, is the first time in the causal discovery domain. To solve this task, we propose a Bayesian framework, which is composed of a mutual information based acquisition function and a cascading fidelity model. We also extend our framework to the batch intervention scenario, and propose a constraint-based fidelity model to validate the efficient greedy method.

This paper actually makes an initial step toward considering different oracles in active causal discovery. There is much room left for improvement. To begin with, one can design more advanced batch intervention strategies, which can bypass the greedy method and does not need to introduce the mutual information constraint in the fidelity model. In addition, since the experiments in active causal discovery are conducted sequentially, and the former experiment results may influence the latter ones, it is interesting to consider the experiment designs as a Markov decision process, and leverage reinforcement learning to optimize the total information gains of all the experiments in a more principled manner.

ACKNOWLEDGMENTS

This work is supported in part by National Key R&D Program of China (2022ZD0120103), National Natural Science Foundation of China (No. 62102420), Beijing Outstanding Young Scientist Program NO. BJJWZYJH012019100020098, Intelligent Social Governance Platform, Major Innovation & Planning Interdisciplinary Platform for the "Double-First Class" Initiative, Renmin University of China, Public Computing Cloud, Renmin University of China, fund for building world-class universities (disciplines) of Renmin University of China, Intelligent Social Governance Platform.

[1] Jeffrey S Levin. Religion and health: Is there an association, is it valid, and is it causal? Social Science & Medicine, 38(11):1475 1482, 1994.

[2] Eric R Eide and Mark H Showalter. Methods matter: Improving causal inference in educational and social science research: A review article. Economics of Education Review, 31(5):744 748, 2012.

[3] Bengt Muthén and Hendricks C Brown. Estimating drug effects in the presence of placebo response: causal inference using growth mixture modeling. Statistics in medicine, 28(27): 3363 3385, 2009.

[4] Karen Sachs, Omar Perez, Dana Pe er, Douglas A Lauffenburger, and Garry P Nolan. Causal protein-signaling networks derived from multiparameter single-cell data. Science, 308(5721): 523 529, 2005.

[5] David Maxwell Chickering. Learning bayesian networks is np-complete. Learning from data: Artificial intelligence and statistics V, pages 121 130, 1996.

[6] Thomas S Verma and Judea Pearl. Equivalence and synthesis of causal models. In Probabilistic and Causal Inference: The Works of Judea Pearl, pages 221 236. 2022.

[7] Raj Agrawal, Chandler Squires, Karren Yang, Karthikeyan Shanmugam, and Caroline Uhler. Abcd-strategy: Budgeted experimental design for targeted causal structure discovery. In The 22nd International Conference on Artificial Intelligence and Statistics, pages 3400 3409. PMLR, 2019.

[8] Panagiotis Tigas, Yashas Annadani, Andrew Jesson, Bernhard Schölkopf, Yarin Gal, and Stefan Bauer. Interventions, where and how? experimental design for causal models at scale. In Advances in Neural Information Processing Systems.

[9] Azam Peyvandipour, Nafiseh Saberian, Adib Shafi, Michele Donato, and Sorin Draghici. A novel computational approach for drug repurposing using systems biology. Bioinformatics, 34 (16):2817 2825, 2018.

[10] Yashas Annadani, Panagiotis Tigas, Desi R Ivanova, Andrew Jesson, Yarin Gal, Adam Foster, and Stefan Bauer. Differentiable multi-target causal bayesian experimental design. ar Xiv preprint ar Xiv:2302.10607, 2023.

[11] Judea Pearl and Dana Mackenzie. The book of why: the new science of cause and effect. Basic books, 2018.

[12] Patrik Hoyer, Dominik Janzing, Joris M Mooij, Jonas Peters, and Bernhard Schölkopf. Nonlinear causal discovery with additive noise models. Advances in neural information processing systems, 21, 2008.

[13] Neil Houlsby, Ferenc Huszár, Zoubin Ghahramani, and Máté Lengyel. Bayesian active learning for classification and preference learning. ar Xiv preprint ar Xiv:1112.5745, 2011.

[14] Samuel Daulton, Xingchen Wan, David Eriksson, Maximilian Balandat, Michael A Osborne, and Eytan Bakshy. Bayesian optimization over discrete and mixed spaces via probabilistic reparameterization. ar Xiv preprint ar Xiv:2210.10199, 2022.

[15] Xun Zheng, Bryon Aragam, Pradeep K Ravikumar, and Eric P Xing. Dags with no tears: Continuous optimization for structure learning. Advances in neural information processing systems, 31, 2018.

[16] David M Blei, Alp Kucukelbir, and Jon D Mc Auliffe. Variational inference: A review for statisticians. Journal of the American statistical Association, 112(518):859 877, 2017.

[17] Diederik P Kingma and Max Welling. Auto-encoding variational bayes. ar Xiv preprint ar Xiv:1312.6114, 2013.

[18] Andreas Krause and Carlos E Guestrin. Near-optimal nonmyopic value of information in graphical models. ar Xiv preprint ar Xiv:1207.1394, 2012.

[19] Shibo Li, Jeff M Phillips, Xin Yu, Robert Kirby, and Shandian Zhe. Batch multi-fidelity active learning with budget constraints. Advances in Neural Information Processing Systems, 35: 995 1007, 2022.

[20] Christina Heinze-Deml, Marloes H Maathuis, and Nicolai Meinshausen. Causal structure learning. Annual Review of Statistics and Its Application, 5:371 391, 2018.

[21] Clark Glymour, Kun Zhang, and Peter Spirtes. Review of causal discovery methods based on graphical models. Frontiers in genetics, 10:524, 2019.

[22] Matthew J Vowels, Necati Cihan Camgoz, and Richard Bowden. D ya like dags? a survey on structure learning and causal discovery. ACM Computing Surveys, 55(4):1 36, 2022.

[23] Kevin P Murphy. Active learning of causal bayes net structure. Technical report, technical report, UC Berkeley, 2001.

[24] Simon Tong and Daphne Koller. Active learning for structure in bayesian networks. In International joint conference on artificial intelligence, volume 17, pages 863 869. Citeseer, 2001.

[25] Yang-Bo He and Zhi Geng. Active learning of causal networks with intervention experiments and optimal designs. Journal of Machine Learning Research, 9(Nov):2523 2547, 2008.

[26] Hyunghoon Cho, Bonnie Berger, and Jian Peng. Reconstructing causal biological networks through active learning. Plo S one, 11(3):e0150611, 2016.

[27] Robert Osazuwa Ness, Karen Sachs, Parag Mallick, and Olga Vitek. A bayesian active learning experimental design for inferring signaling networks. In Research in Computational Molecular Biology: 21st Annual International Conference, RECOMB 2017, Hong Kong, China, May 3-7, 2017, Proceedings 21, pages 134 156. Springer, 2017.

[28] Julius von Kügelgen, Paul K Rubenstein, Bernhard Schölkopf, and Adrian Weller. Optimal experimental design via bayesian optimization: active causal structure learning for gaussian process networks. ar Xiv preprint ar Xiv:1910.03962, 2019.

[29] M Giselle Fernández-Godino, Chanyoung Park, Nam-Ho Kim, and Raphael T Haftka. Review of multi-fidelity models. ar Xiv preprint ar Xiv:1609.07196, 2016.

[30] Alexander IJ Forrester, Neil W Bressloff, and Andy J Keane. Optimization using surrogate models and partially converged computational fluid dynamics simulations. Proceedings of the Royal Society A: Mathematical, Physical and Engineering Sciences, 462(2071):2177 2204, 2006.

[31] TD Robinson, Michael S Eldred, Karen E Willcox, and Robert Haimes. Surrogate-based optimization using multifidelity models with variable parameterization and corrected space mapping. AIAA journal, 46(11):2814 2822, 2008.

[32] Christopher C Fischer, Ramana V Grandhi, and Philip S Beran. Bayesian low-fidelity correction approach to multi-fidelity aerospace design. In 58th AIAA/ASCE/AHS/ASC Structures, Structural Dynamics, and Materials Conference, page 0133, 2017.

[33] SA Burton and P Hajela. A variable-complexity approach to second-order reliability-based optimization. Structural and Multidisciplinary Optimization, 25(4):237 250, 2003.

[34] Akil Narayan, Claude Gittelson, and Dongbin Xiu. A stochastic collocation algorithm with multifidelity models. SIAM Journal on Scientific Computing, 36(2):A495 A521, 2014.

[35] Benjamin Peherstorfer, Tiangang Cui, Youssef Marzouk, and Karen Willcox. Multifidelity importance sampling. Computer Methods in Applied Mechanics and Engineering, 300:490 509, 2016.

[36] Leo Wai-Tsun Ng and Michael Eldred. Multifidelity uncertainty quantification using nonintrusive polynomial chaos and stochastic collocation. In 53rd aiaa/asme/asce/ahs/asc structures, structural dynamics and materials conference 20th aiaa/asme/ahs adaptive structures conference 14th aiaa, page 1852, 2012.

[37] Tiangang Cui, Youssef M Marzouk, and Karen E Willcox. Data-driven model reduction for the bayesian solution of inverse problems. International Journal for Numerical Methods in Engineering, 102(5):966 990, 2015.

[38] Shibo Li, Wei Xing, Robert Kirby, and Shandian Zhe. Multi-fidelity bayesian optimization via deep neural networks. Advances in Neural Information Processing Systems, 33:8521 8531, 2020.

[39] Shibo Li, Robert Kirby, and Shandian Zhe. Batch multi-fidelity bayesian optimization with deep auto-regressive networks. Advances in Neural Information Processing Systems, 34:25463 25475, 2021.

[40] Shibo Li, Jeff M Phillips, Xin Yu, Robert Kirby, and Shandian Zhe. Batch multi-fidelity active learning with budget constraints. Advances in Neural Information Processing Systems, 35: 995 1007, 2022.

[41] Paul Erd os, Alfréd Rényi, et al. On the evolution of random graphs. Publ. Math. Inst. Hung. Acad. Sci, 5(1):17 60, 1960.

[42] Lun Li, David Alderson, John C Doyle, and Walter Willinger. Towards a theory of scale-free graphs: Definition, properties, and implications. Internet Mathematics, 2(4):431 523, 2005.

[43] Thomas Schaffter, Daniel Marbach, and Dario Floreano. Genenetweaver: in silico benchmark generation and performance profiling of network inference methods. Bioinformatics, 27(16): 2263 2270, 2011.

[44] Yashas Annadani, Jonas Rothfuss, Alexandre Lacoste, Nino Scherrer, Anirudh Goyal, Yoshua Bengio, and Stefan Bauer. Variational causal networks: Approximate bayesian inference over causal structures. ar Xiv preprint ar Xiv:2106.07635, 2021.

[45] Ioannis Tsamardinos, Laura E Brown, and Constantin F Aliferis. The max-min hill-climbing bayesian network structure learning algorithm. Machine learning, 65:31 78, 2006.

[46] Jesse Davis and Mark Goadrich. The relationship between precision-recall and roc curves. In Proceedings of the 23rd international conference on Machine learning, pages 233 240, 2006.

[47] Harold J Kushner. A new method of locating the maximum point of an arbitrary multipeak curve in the presence of noise. 1964.

[48] Carl Edward Rasmussen. Gaussian processes in machine learning. In Advanced Lectures on Machine Learning: ML Summer Schools 2003, Canberra, Australia, February 2-14, 2003, Tübingen, Germany, August 4-16, 2003, Revised Lectures, pages 63 71. Springer, 2004.

[49] Niranjan Srinivas, Andreas Krause, Sham M Kakade, and Matthias Seeger. Gaussian process optimization in the bandit setting: No regret and experimental design. ar Xiv preprint ar Xiv:0912.3995, 2009.

[50] Reuven Y Rubinstein and Dirk P Kroese. Simulation and the Monte Carlo method. John Wiley & Sons, 2016.

[51] Chris J Maddison, Andriy Mnih, and Yee Whye Teh. The concrete distribution: A continuous relaxation of discrete random variables. ar Xiv preprint ar Xiv:1611.00712, 2016.

[52] Dimitri P Bertsekas. Constrained optimization and Lagrange multiplier methods. Academic press, 2014.

[53] Lars Lorch, Jonas Rothfuss, Bernhard Schölkopf, and Andreas Krause. Dibs: Differentiable bayesian structure learning. Advances in Neural Information Processing Systems, 34:24111 24123, 2021.

A Monte Carlo Approximation for f(j, v, m) 15

A.1 Derivation Process for f(j, v, m) . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

A.2 Sampling from p(ϕm|D) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

A.3 Sampling from p(ϕm|ϕM, D) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

A.4 Calculation of p(x|e,ϕm) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

B Bayesian Optimization for Determining (j , v , m ) 21

C Detailed Training Process of ELBO 22

C.1 Derivation Process of ELBO . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

C.2 Estimation of ELBO . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

C.3 Gaussian Reparameterization Trick . . . . . . . . . . . . . . . . . . . . . . . . . . 24

C.4 Gumbel-softmax Reparameterization Trick . . . . . . . . . . . . . . . . . . . . . 24

C.5 Optimization of ELBO . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

D Training Process of Constraint based ELBO 25

E Proof of Theory 3 26

F Proof of Theory 4 26

G Algorithm 28

H More Experiments 28

H.1 Experimental Settings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

H.1.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

H.1.2 Baselines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

H.1.3 Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

H.2 Simulation of Oracles with Different Fidelities . . . . . . . . . . . . . . . . . . . . 30

H.3 Details of Configurations and Computation . . . . . . . . . . . . . . . . . . . . . 30

H.4 Experiments on DREAM Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

H.5 Experiments on More Nodes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

H.6 Supplementary Experiments on MAE Metric . . . . . . . . . . . . . . . . . . . . 31

H.7 Supplementary Experiments on Different Oracle Settings . . . . . . . . . . . . . . 31

H.8 Supplementary Experiments on Regularization Coefficient λ . . . . . . . . . . . . 32

H.9 Supplementary Experiments on Ablation Studies . . . . . . . . . . . . . . . . . . 32

I Potentially Negative Social Impact 32

J Limitations 34

A Monte Carlo Approximation for f(j, v, m)

A.1 Derivation Process for f(j, v, m)

Considering that the mutual information is not directly tractable, we approximate f(j, v, m) by:

f(j, v, m) = 1 λm K1 L1

c1=1 p(x(k1,l1) m |ϕ(c1) m , e)

+ 1 λm K2 L2 C2

c2=1 log h p(x(k2,l2,c2) m |ϕ(k2) M , e) i ,

where e = {(j, v), m} is the experiment to be designed, ϕ(c1) m , ϕ(k1) m p(ϕm|D), x(k1,l1) m p(x|ϕ(k1) m , e), ϕ(k2) M p(ϕM|D), ϕ(k2,l2) m p(ϕm|ϕ(k2) M , D) and x(k2,l2,c2) m p(x|ϕ(k2,l2) m , e).

We present the detailed approximation process as follows:

f(j, v, m) = 1 λm I(x; ϕM|e, D)

= 1 λm [H(x|e, D) H(x|ϕM, e, D)]

Ep(x|e,D) log p(x|e, D) + Ep(ϕM|D) Ep(x|ϕM,e) [log p(x|e, ϕM)]

Ep(x|e,D) log Ep(ϕm|e,D) [p(x|e, ϕm)]

Ep(ϕM|D) Ep(x|e,ϕM) [log p(x|e, ϕM)]

| {z } F For part E, we can estimate it by

E = 1 λm K1 L1

c1=1 p(x(k1,l1) m |ϕ(c1) m , e)

where for the first expectation on p(ϕm|e, D), we first sample ϕ(k1) m from ϕ(k1) m p(ϕm|e, D) for K1 times, and then for each ϕ(k1) m , we sample x(k1,l1) m from x(k1,l1) m p(x|ϕ(k1) m , e) for L1 times. For the second expectation on p(ϕm|e, D), we sample ϕ(c1) m p(ϕm|e, D) for C1 times.

For part F, we have

F = 1 λm Ep(ϕM|D) Ep(x|ϕM,e) [log p(x|ϕM, e)]

= 1 λm Ep(ϕM|D)

Z p(x|ϕM, e) log p(x|ϕM, e) dx

= 1 λm Ep(ϕM|D)

Z Z p(x|ϕm, e)p(ϕm|ϕM) dϕm log p(x|ϕM, e) dx

= 1 λm Ep(ϕM|D,e)

Z Z p(x|ϕm, e)p(ϕm|ϕM) log p(x|ϕM, e) dx dϕm

= 1 λm Ep(ϕM|D,e)

Z Ep(x|ϕm,e) [p(ϕm|ϕM) log p(x|ϕM, e)] dϕm

= 1 λm Ep(ϕM|D,e) Ep(ϕm|ϕM) Ep(x|ϕm,e) [log p(x|ϕM, e)] .

It can be estimated by

1 λm K2 L2 C2

c2=1 log h p(x(k2,l2,c2) m |ϕ(k2) M , e) i ,

where for the expectation on p(ϕM|D, e), we sample ϕ(k2) M from ϕ(k2) M p(ϕM|e, D) for K2 times. For the expectation on p(ϕm|ϕM), for each ϕ(k2) M , we sample ϕ(k2,l2) m from ϕ(k2,l2) m p(ϕm|ϕ(k2) M ) for L2 times. For the expectation on p(x|ϕm, e), for each ϕ(k2) M and ϕ(k2,l2) m , we sample x(k2,l2,c2) m from x(k2,l2,c2) m p(x|ϕ(k2,l2) m , e) for C2 times.

Therefore, we can conclude that f(j, v, m) can be estimated by

f(j, v, m) = 1 λm K1 L1

c1=1 p(x(k1,l1) m |ϕ(c1) m , e)

+ 1 λm K2 L2 C2

c2=1 log h p(x(k2,l2,c2) m |ϕ(k2) M , e) i ,

where ϕ(c1) m , ϕ(k1) m p(ϕm|D), x(k1,l1) m p(x|ϕ(k1) m , e), ϕ(k2) M p(ϕM|D), ϕ(k2,l2) m p(ϕm|ϕ(k2) M , D) and x(k2,l2,c2) m p(x|ϕ(k2,l2) m , e).

Obviously, the above approximation of f(j, v, m) only depends on p(ϕm|D), p(ϕm|ϕM, D) and p(x|ϕm, e). In the next, we show how to sample from them in Section A.2, A.3 and A.4, respectively.

A.2 Sampling from p(ϕm|D)

Basically, sampling from the posterior of p( |D) is not easy. To solve this problem, as mentioned in the main paper, we introduce a variational probability q to approximate p . In specific, in order to sample from p(ϕm|D), we first obtain a sample ϕ1 from ϕ1 N(0, I), and then get ϕm from the distribution q(ϕm|ϕ1).

q(ϕm|ϕ1) = Z

ϕ2 q(ϕm, ϕm 1, . . . , ϕ2|ϕ1) dϕm 1 . . . dϕ2.

q(ϕm, ϕm 1, . . . , ϕ2|ϕ1) =

i=2 q(ϕi|ϕi 1)

q(ϕi|ϕi 1) = N(ciϕi 1 + di, σ2 i I),

we have q(ϕm|ϕ1) is a Gaussian distribution, which is easy for sampling.

In our model, ci and σ2 i I are diagonal matrices, which means that the dimensions in ϕi are independent with each other. We denote ϕi = [ϕi,1, ϕi,2 . . . ϕi,d], where ϕi,j is the jth element of ϕi. Then, we have

q(ϕm, ϕm 1, . . . , ϕ2|ϕ1) =

j=1 q(ϕm,j, ϕm 1,j, . . . , ϕ2,j|ϕ1,j).

So our target can be converted to calculate the probability q(ϕm,j, ϕm 1,j, . . . , ϕ2,j|ϕ1,j) for all dimensions 1 j d. Let ci,j, di,j and σ2 i,j be the jth element of ci, di and σ2 i , respectvely. We

assume that σi,j = q

c2 i 1,j + 1 σi 1,j (i 4) and σ3,j = σ2,j = e, where e is the hyper-parameter.

Suppose µi,j is the mean of the Gaussian distribution for q(ϕi,j|ϕi 1,j), that is,

µi,j = ci,jϕi 1,j + di,j (i 2),

then, the approximated joint distribution can be represented as

q(ϕm,j, ϕm 1,j, . . . , ϕ2,j|ϕ1,j) =

i=2 q(ϕi,j|ϕi 1,j)

1 2σ2 i,j (ϕi,j µi,j)2 .

Then, we integrate ϕ2,j, ϕ3,j, . . . , ϕm 1,j sequentially to obtain q(ϕm,j|ϕ1,j).

First of all, we integrate ϕ2,j for the joint distribution, where we have:

q(ϕm,j, ϕm 1,j, . . . , ϕ3,j|ϕ1,j)

= Z q(ϕm,j, ϕm 1,j, . . . , ϕ3,j, ϕ2,j|ϕ1,j) dϕ2,j

1 2σ2 i,j (ϕi,j µi,j)2 dϕ2,j

1 2σ2 i,j (ϕi,j µi,j)2 1

1 2σ2 3,j [ϕ3,j (c3,jϕ2,j+d3,j)]2

1 2σ2 2,j [ϕ2,j (w2ϕ1,j+d3,j)]2 dϕ2,j.

Denote c2,j = c2,j and d2,j = d2,j, and because of σ3,j = σ2,j, we have

q(ϕm,j, ϕm 1,j, . . . , ϕ3,j|ϕ1,j)

1 2σ2 i,j (ϕi,j µi,j)2 1

1 2σ2 3,j [ϕ3,j (c3,jϕ2,j+d3,j)]2

1 2σ2 3,j [ϕ2,j ( c2,jϕ1,j+ d2,j)] 2 dϕ2,j

1 2σ2 i,j (ϕi,j µi,j)2 1

n [ϕ3,j (c3,jϕ2,j + d3,j)]2 + ϕ2,j ( c2,jϕ1,j + d2,j) 2o

For S1, we have

S1 = [ϕ3,j (c3,jϕ2,j + d3,j)]2 + ϕ2,j ( c2,jϕ1,j + d2,j) 2

=ϕ2 3,j + c2 3,jϕ2 2,j + d2 3,j + 2d3,jc3,jϕ2,j 2c3,jϕ3,jϕ2,j 2d3,jϕ3,j + ϕ2 2,j + c2 2,jϕ2 1,j + d2 2,j + 2 d2,j c2,jϕ1,j 2 c2,jϕ1,j 2d3,jϕ3,j =(c2 3,j + 1)

ϕ2 2,j + 2(d3,jc3,j c3,jϕ3,j c2,jϕ1,j d2,j)

c2 3,j + 1 +

d3,jc3,j c3,jϕ3,j c2,jϕ1,j d2,j

(d3,jc3,j c3,jϕ3,j c2,jϕ1,j d2,j)2

c2 3,j + 1 + ϕ2 3,j + c2 2,jϕ2 1,j + d2 2,j + 2 d2,j c2,jϕ1,j 2d3,jϕ3,j

=(c2 3,j + 1)

ϕ2,j c3,jϕ3,j + c2,jϕ1,j + d2,j d3,jc3,j

+ c2 3,jϕ2 3,j + c2 2,jc2 3,jϕ2 1,j + d2 2,jc2 3,j + 2d3,j c2,jc3,jϕ1,j c2 2,jϕ2 1,j d2 2,j 2 d2,j c2,jϕ1,j c2 3,j + 1

+ ϕ2 3,j + c2 2,jϕ2 1,j + d2 2,j + 2 d2,j c2,jϕ1,j 2d3,jϕ3,j c2 3,jd2 3,j c2 3,jϕ2 3,j + 2c2 3,jd3,jϕ3,j c2 3,j + 1

+ 2c3,j d2,jd3,j 2c3,jϕ3,j c2,jϕ1,j 2c3,jϕ3,j d2,j + 2 d2,j c2,jc2 3,jϕ1,j 2d3,jc2 3,jϕ3,j c2 3,j + 1 .

Then we have

S1 =(c2 3,j + 1)

ϕ2,j c3,jϕ3,j + c2,jϕ1,j + d2,j d3,jc3,j

+ ϕ2 3,j 2( c2,jc3,jϕ1,j + d2,jc3,j + d3,j)ϕ3,j

c2,jc3,jϕ1,j + d2,jc3,j + d3,j 2 d2 3,j (c2 3,j + 1) c2 3,j + 1

=(c2 3,j + 1)

ϕ2,j c3,jϕ3,j + c2,jϕ1,j + d2,j d3,jc3,j

+ 1 c2 3,j + 1 ϕ3,j (c3,j c2,jϕ1,j + d2,jc3,j + d3,j) 2 d2 3,j.

Therefore we have q(ϕm,j, ϕm 1,j, . . . , ϕ3,j|ϕ1,j)

1 2σ2 i,j (ϕi,j µi,j)2

(c2 3,j +1)

ϕ2,j c3,j ϕ3,j + c2,j ϕ1,j + d2,j d3,j c3,j c2 3,j +1

1 2σ2 3,j (c2 3,j +1)[ϕ3,j (c3,j c2,jϕ1,j+ d2,jc3,j+d3,j)] 2 e

d2 3,j 2σ2 3,j 1 q

The S2 part is the integration form of ϕ2,j N( c3,jϕ3,j+ c2,jϕ1,j+ d2,j d3,jc3,j

c2 3,j+1 , σ2 3,j c2 3,j+1), which is equal to 1, so we have

q(ϕm,j, ϕm 1,j, . . . , ϕ3,j|ϕ1,j)

1 2σ2 i,j (ϕi,j µi,j)2 e

1 2σ2 3,j (c2 3,j +1)[ϕ3,j (c3,j c2,jϕ1,j+ d2,jc3,j+d3,j)] 2

d2 3,j 2σ2 3,j 1 q

c2 3,j + 1 .

We denote c3,j = c3,j c2,j and d3,j = d2,jc3,j + d3,j, and denote r2,j = e

d2 3,j 2σ2 3,j 1

c2 3,j+1, so we have

q(ϕm,j, ϕm 1,j, . . . , ϕ3,j|ϕ1,j)

1 2σ2 i,j (ϕi,j µi,j)2 e

1 2σ2 3,j (c2 3,j +1)[ϕ3,j ( c3,jϕ1,j+ d3,j)] 2 r2,j

1 2σ2 i,j (ϕi,j µi,j)2 1

1 2σ2 4 [ϕ4 (c4ϕ3,j+d4)]2

1 2σ2 3,j (c2 3,j +1)[ϕ3,j ( c3,jϕ1,j+ d3,j)] 2 r2,j

1 2σ2 i,j (ϕi,j µi,j)2 1

1 2σ2 4 [ϕ4 (c4ϕ3,j+d4)]2

1 2σ2 4 [ϕ3,j ( c3,jϕ1,j+ d3,j)] 2 r2,j

Similarly, then we integrate ϕ3,j

q(ϕm,j, ϕm 1,j, . . . , ϕ4|ϕ1,j)

= Z q(ϕm,j, ϕm 1,j, . . . , ϕ3,j|ϕ1,j) dϕ3,j

1 2σ2 i,j (ϕi,j µi,j)2 Z 1

1 2σ2 4 [ϕ4 (c4ϕ3,j+d4)]2

1 2σ2 4 [ϕ3,j ( c3,jϕ1,j+ d3,j)] 2 dϕ2,j.

The formulation is similar to the previous one, so we can utilize the process above to integrate succesively, and we finally obtain

q(ϕm,j|ϕ1,j) = Qm 1 i=2 ri,j

1 2σ2 m,j (c2 m,j +1)[ϕm,j ( cm,jϕ1,j+ dm,j)] 2 ,

which indicates

p(ϕm,j|ϕ1,j, D) Qm 1 i=2 ri,j

1 2σ2 m,j (c2 m,j +1)[ϕm,j ( cmϕ1,j+ dm,j)] 2 ,

where we have the iterative calculation by

d2 i+1,j 2σ2 i+1,j 1 q

c2 i+1,j + 1 ,

ci,j = ci,j ci 1,j, (i 3), di,j = di 1,jci,j + di,j, (i 3).

A.3 Sampling from p(ϕm|ϕM, D)

To sample from the distribution p(ϕm|ϕM, D), we first obtain a sample ϕ1 from the prior distribution (i.e., ϕ1 N(0, I)), then get ϕm from a consecutive sampling process:

ϕM 1 p(ϕM 1|ϕM, ϕ1, D), ϕM 2 p(ϕM 2|ϕM 1, ϕ1, D), ... ϕm p(ϕm|ϕm+1, ϕ1, D),

because of the Markov property in our cascaded model. So our target is obtaining the distributions p(ϕi 1|ϕi, ϕ1, D). For a certain p(ϕi 1|ϕi, ϕ1, D), according to the Bayes rule, we have

p(ϕi 1|ϕi, ϕ1, D) =p(ϕi|ϕi 1, ϕ1, D) p(ϕi 1|ϕ1, D)

p(ϕi|ϕ1, D) .

Similarly with the last section, we use non-bold symbols to represent one dimension of the multidimension parameters, where they are able to transfer independently, and finally construct the eventual parameters by concatenating, that is,

p(ϕi 1|ϕi, ϕ1, D) =

j=1 p(ϕi,j|ϕi 1,j, ϕ1,j, D).

So our target can be converted to calculate the probability p(ϕi,j|ϕi 1,j, ϕ1,j, D) for all dimensions 1 j d. According to the Markov property and the transportation probability, we have

p(ϕi,j|ϕi 1,j, ϕ1,j, D) q(ϕi,j|ϕi 1,j, ϕ1,j)

1 2σ2 i,j [ϕi,j (ciϕi 1,j+di,j)]2 .

According to the previous section, we have

p(ϕi,j|ϕ1,j, D) q(ϕi,j|ϕ1,j) = Qi 1 i=2 ri,j

1 2σ2 i,j (c2 i,j +1)[ϕi,j ( ciϕ1,j+ di)] 2 ,

p(ϕi 1,j|ϕ1,j, D) q(ϕi 1,j|ϕ1,j) = Qi 2 i=2 ri,j

1 2σ2 i 1,j (c2 i 1,j +1)[ϕi 1,j ( ci 1,jϕ1,j+ di 1,j)] 2 .

Then we have

p(ϕi 1,j|ϕi,j, ϕ1,j, D) q(ϕi,j|ϕi 1,j, ϕ1,j) q(ϕi 1,j|ϕ1,j)

q(ϕi,j|ϕ1,j)

2πσi,j ri 1 e

1 2σ2 i,j [ϕi,j (ci,jϕi 1,j+di,j)]2 e

[ϕi 1,j ( ci 1,j ϕ1,j + di 1,j )]2

2σ2 i 1,j (c2 i 1,j +1)

[ϕi,j (ci,j ϕi 1,j +di,j )]2

2σ2 i,j (c2 i,j +1)

2σ2 i,j d2 i,j e

[ϕi,j (ci,j ϕi 1,j +di,j )]2

2σ2 i+1,j e

[ϕi,j (ci,j ϕi 1,j +di,j )]2

[ϕi 1,j ( ci 1,j ϕ1,j + di 1,j )]2

2σ2 i,j d2 i,j + [ϕi,j (ci,j ϕi 1,j +di,j )]2

1 2σ2 i,j n [ϕi,j (ci,jϕi 1,j + di,j)]2 + ϕi 1,j ( ci 1,jϕ1,j + di 1,j) 2o

Then we calculate the part C as

C = [ϕi,j (ci,jϕi 1,j + di,j)]2 + ϕi 1,j ( ci 1,jϕ1,j + di 1,j) 2

= ϕ2 i,j + (ci,jϕi 1,j + di,j)2 2(ci,jϕi 1,j + di,j)ϕi,j + ϕ2 i 1,j + ( wi 1ϕ1,j + di 1,j)2 2ϕi 1,j( ci 1,jϕ + di 1,j)

= ϕ2 i,j + c2 i,jϕ2 i 1,j + d2 i,j + 2di,jci,jϕi 1,j 2ci,jϕi 1,jϕi,j 2di,jϕi,j + ϕ2 i 1,j + c2 i 1,jϕ2 1,j + d2 i 1,j + 2 ci 1,j di 1,jϕ1,j 2 ci 1,jϕ1,jϕi 1,j 2 di 1,jϕi 1,j

= ( w2 i 1 + 1)

ϕi 1,j ci,jϕi,j + ci 1,jϕ1,j + di 1,j di,jci,j

where B does not include ϕi, which indicates

p(ϕi 1,j|ϕi,j, ϕ1,j, D) N(ci,jϕi,j + ci 1,jϕ1,j + di 1,j di,jwi 2

c2 i,j + 1 , σ2 i,j c2 i,j + 1).

A.4 Calculation of p(x|e,ϕm)

In this section, we will show how to calculate the graph probability p(x|e, ϕm). Remember the graph parameters ϕm = [θm; Sm; Tm], so we have

p(x|e, ϕm) = Z

E p(x|e, θm, E) p(E|e, Sm, Tm) d E

= EE p(E|Sm,Tm) [p(x|e, θm, E)] .

According to Monte Carlo sampling, we have

p(x|e, ϕm) = 1

l=1 p(x|e, θm, El),

where El[i, j] Bernoulli(σ(ST m[i] Tm[j])). In order to conduct intervention process, we change the jth column of El to zeros, and represent it with El. Moreover, we replace the jth element of x

with v, and get the result x. We change the jth element of ϵm with zero, and get the result ϵm. Then according the definition of causal graphs, we have

p(x|e, ϕm) = 1

l=1 N(x; f( x; El, γm), ϵm),

where f is the causal function that depends on the parameter γm.

B Bayesian Optimization for Determining (j , v , m )

We intend to find the best tuple for acquisition, that is,

(j , v , m ) = arg max (j,v,m) f(j, v, m).

We define the best interventional value v under interventional node j and fidelity m as

v (j, m) = arg max v f(j, v, m)

= arg max v fj,m(v).

where fj,m(v) is rewritten from f(j, v, m) under given j, m. Therefore, our task is calculating v (j, m) for j [d], m [M] with Bayesian optimization [47]. We utilize a Gaussian Process (GP) [48] to model surrogate function distributions for each v (j, m). We denote f GP(0, K(vi, vj)), and K(vi, vj) is the kernel of GP. We sequentially find vt and calculate fj,m(vt) to direct the process. According to GP, the previous t functions and the t + 1 function are multivariate Gaussian distribution, F1:t ft+1

N 0, Kt kt+1 k T t+1 K(vt+1, vt+1)

where we define

F1:t = [f1, f2, . . . , ft] ,

kt+1 = [K(vt+1, v1), K(vt+1, v2), . . . , K(vt+1, vt+1)]T ,

K(v1, v1) K(vt, v1) ... ... ... K(vt, v1) K(vt, vt)

Given previous t steps, we have the posterior probability is

p(ft+1|{(vi, fj,m(vi))}t i=1, vt+1) = N(µt(vt+1), σ2 t (vt+1)),

with the non-parametric means and variances

µt(vt+1) = k T t+1(K + I) 1F1:t, (12)

σ2 t (vt+1) = K(vv+1, vt+1) k T t+1(K + I) 1kt+1. (13)

We acquire the next vt+1 with GP-UCB [49] function

at+1(v) = µt(v) + βac q

vt+1 = arg max v at+1(v).

where βac is a hyper-parameter. Suppose the maximum of steps is T, the final output of function v (j, m) is

v (j, m) = arg max v µT (v).

Then we choose the best interventional node j and fidelity m by their best values under O(d M)

j , m = arg max j,m v (j, m),

v = v (j , m ).

C Detailed Training Process of ELBO

C.1 Derivation Process of ELBO

Because we use the distribution q(ϕm) to approximate the distribution p(ϕm), then we intend to minimize the distance between these two distributions optimize the parameters of q(ϕm), where we utilize KL divergence to measure the distance, that is,

Ψ = arg min Ψ KL[q(Φ||p(Φ|D)].

According to the variational inference, we have

KL [q(Φ)||p(Φ|D)]

= Z q(Φ) log q(Φ) p(Φ|D) dΦ

= Z q(Φ) log q(Φ) dΦ Z q(Φ) log p(Φ|D) dΦ

= EΦ q(Φ) [log q(Φ)] Z q(Φ) log p(Φ, D)

= EΦ q(Φ) [log q(Φ)] Z q(Φ) log p(Φ, D) dΦ + Z q(Φ) log p(D) dΦ

= EΦ q(Φ) [log q(Φ)] EΦ q(Φ) [log p(Φ, D)] + Z q(Φ) log p(D) dΦ

= EΦ q(Φ) [log q(Φ)] EΦ q(Φ) [log p(Φ, D)] | {z } ELBO

+ log p(D).

Because log p(D) is not related to Ψ, minimizing KL [q(Φ)||p(Φ|D)] is equivalent to maximizing the ELBO part, and we have

ELBO = EΦ q(Φ) [log p(Φ, D)] EΦ q(Φ) [log q(Φ)]

= EΦ q(Φ) [log p(D|Φ)] + EΦ q(Φ) [log p(Φ)] EΦ q(Φ) [log q(Φ)]

= EΦ q(Φ) [log p(D|Φ) log q(Φ) + log p(Φ)]

Above all, we can conclude that

Ψ = arg min Ψ KL[q(Φ||p(Φ|D)]

is equivalent to maximize evidence lower bound

Ψ = arg max Ψ ELBO

= arg max Ψ EΦ q(Φ) [log p(D|Φ) log q(Φ) + log p(Φ)] .

C.2 Estimation of ELBO

We represent the equation of ELBO as

ELBO = EΦ q(Φ) [log p(D|Φ) log q(Φ) + log p(Φ)]

= EΦ q(Φ) [log p(D|Φ)] | {z } A

EΦ q(Φ) [log q(Φ) log p(Φ)] | {z } B

For the part A, we have

A = EΦ q(Φ)

i=1 p(x(i)|j(i), v(i), m(i), Φ)

where N is the current number of samples in buffer. Then we have

i=1 p(x(i)|j(i), v(i), m(i), Φ)

i=1 log p(x(i)|j(i), v(i), m(i), Φ)

i=1 EΦ q(Φ) h log p(x(i)|j(i), v(i), m(i), Φ) i

i=1 Eϕm(i) q(ϕm(i)) h log p(x(i)|j(i), v(i), m(i), ϕm(i)) i .

Using Monte Carlo sampling [50], we can calculate the expectation by NS samples for each point.

j=1 log p(x(i)|j(i), v(i), m(i), ϕ(j) m(i)),

where we sample ϕ(j) m(i) q(ϕm(i)) with size NS.

Then we denote the distribution q(Φ) = N( µall, Σall), and similarly, we have p(Φ) = QM m=1 e β f(Sm,Tm) N(µall, Σall). Both the parameter µall, Σall can be represented by the parameters in Ψ, while µall and Σall are constant. Then we calculate part B

B =EΦ q(Φ) [log q(Φ) log p(Φ)]

Φ N( µall, Σall) log N( µall, Σall) QM m=1 e β f(Sm,Tm) N(µall, Σall) dΦ

Φ N( µall, Σall) log N( µall, Σall)

N(µall, Σall) dΦ + Z

Φ N( µall, Σall) log 1 QM m=1 e β f(Sm,Tm) dΦ

= KL[N( µall, Σall)||N(µall, Σall)] | {z } C

Φ N( µall, Σall) log

m=1 eβ f(Sm,Tm) dΦ

According to KL divergence of Gaussian distribution, we can calculate C in a close-form.

C =KL[N( µall, Σall)||N(µall, Σall)]

log ||Σall||

|| Σall|| d + tr(Σ 1 all Σall) + ( µall µall)T Σ 1 all( µall µall) .

Then we calculate D by the following steps:

Φ N(µall, Σall) log

m=1 eβ f(Sm,Tm) dΦ

Φ N(µall, Σall)

m=1 log eβ f(Sm,Tm) dΦ

Φ N(µall, Σall)

m=1 β f(Sm, Tm) dΦ

= β EΦ N(µall,Σall)

m=1 f(Sm, Tm)

Using Monte Carlo sampling, we can calculate the expectation by ND samples for each point.

m=1 f(S(i) m , T(i) m ).

m=1 Ep(E|S(i) m ,T(i) m ) λ1 tr e E d + λ2 ||E|| ,

where we samples Φ(i) N(µall, Σall) with size ND. Using Monte Carlo sampling again, we can calculate the expectation by NE samples.

λ1 tr e E d + λ2 ||E|| ,

where we samples E(j) p(E|S(i) m , T(i) m ) with size NE.

Finally, we obtain the estimation

j=1 log p(x(i)|j(i), v(i), m(i), ϕ(j) m(i))

log ||Σall||

|| Σall|| d + tr(Σ 1 all Σall) + ( µall µall)T Σ 1 all( µall µall)

λ1 tr e E d + λ2 ||E|| .

C.3 Gaussian Reparameterization Trick

In the last section, we derive the objection function for optimizing the model parameters, where we can use methods of the gradient decent to solve it. However, a significant problem rises due to the sampling process, because the gradient of model parameters can not pass backward from the naive sampling process(i.e., untraceable). Therefore, we use Gaussian reparameterization trick to make the Gaussian sampling process traceable.

In specific, we will demonstrate the traceable calculation of ϕ by Gaussian reparameterization trick. In order to sample ϕ N(µ, Σ), we first sample δ N(0, I) instead, and then obtain ϕ = µ + δ Σ. Therefore, the gradient can be traced from ϕ to µ and Σ. In specific, both µ and Σ can be represented with the function of learnable parameter Ψ.

C.4 Gumbel-softmax Reparameterization Trick

Besides of the Gaussian sampling process, the Bernoulli sampling in our equation is not traceable either, so we utilize Gumbel-softmax reparameterization trick to make it traceable.

We demonstrate the traceable calculation of E p(E|S, T) by Gumbel-max reparameterization trick. According to Gumbel-max [51], we have

Bernoulli(p) 1 [G1 + log p > G0 + log(1 p)] , G0, G1 Gumbel(0, 1).

Instead of using unit step function, we utilize sigmoid function

σ(G1 + log p > G0 + log(1 p)).

Therefore, we have

Ei,j = σ(Li,j + ST i Tj),

where Li,j L(0, 1). Therefore, we sample Li,j L(0, 1) instead, where L(0, 1) is logistic distribution, and calculate Ei,j = σ(Li,j + ST i Tj) to trace gradients. Specifically, both Si and Ti can be represented with the function of learnable parameter Ψ.

C.5 Optimization of ELBO

With the estimation and reparameterization trick, we are able to conduct gradient descent methods to optimize our parameters with the objection function

Ψ = arg max Ψ ELBO.

The format of stochastic gradient descent (SGD) is

Ψ Ψ + γ ELBO

where γ is the learning rate.

D Training Process of Constraint based ELBO

We intend to optimize our parameter with

Ψ = arg max Ψ EΦ q(Φ) [log p(D|Φ) log q(Φ) + log p(Φ)] ,

{es,et} I(xs; xt|ϕM, {es, et}, D) ϵ.

However, the objection has a constraint, which is hard to optimize with gradient descent methods. So we utilize Lagrange multiplier [52] to convert it to a constraint-free method:

Ψ = arg max Ψ EΦ q(Φ) [log p(D|Φ) log q(Φ) + log p(Φ)] + λ X

{es,et} I(xs; xt|ϕM, {es, et}, D),

where λ is the Lagrange multiplier. Then, we intend to calculate the constraint part.

First of all, we have

I(xs; xt|ϕM, {es, et}, D) =H(xs|ϕM, {es, et}, D) + H(xt|ϕM, {es, et}, D) H(xs, xt|ϕM, {es, et}, D) =H(xs|ϕM, es, D) + H(xt|ϕM, et, D) H(xs, xt|ϕM, {es, et}, D),

For the term H(xs|ϕM, es, D), we have

H(xs|ϕM, es, D) = Z p(xs|ϕM, es, D) log p(xs|ϕM, es, D) dxs

= Ep(xs|ϕM,es,D) [log p(xs|ϕM, es, D)] .

We use Monte Carlo sampling to estimate H(xs|ϕM, es, D), and we have

H(xs|ϕM, es, D) 1 K1 K2

k2=1 log p(x(k1,k2)|es, ϕM),

where we sample graphs ϕk1 m q(ϕm|es, ϕM), and obtain samples x(k1,k2) p(x|es, ϕk1 m ). Similarly, we can calculate

H(xt|ϕM, et, D) 1 K1 K2

k2=1 log p(x(k1,k2)|et, ϕM),

where we sample graphs ϕk1 m q(ϕm|et, ϕM), and obtain samples x(k1,k2) p(x|et, ϕk1 m ).

And we have

H(xs, xt|ϕM, {es, et}, D) 1 K1 K2 K3

k2 2=1 log p(x(k1,k1 2), x(k1,k2 2)|{es, et}, ϕM),

where we sample graphs ϕk1 m q(ϕm|{es, et}, ϕM), obtain samples x(k1,k1 2) p(x|es, ϕk1 m ), and obtain samples x(k1,k2 2) p(x|et, ϕk1 m ).

Therefore, we add constraint on the original loss function to obtained the estimation of constraint based ELBO, that is,

j=1 log p(x(i)|j(i), v(i), m(i), ϕ(j) m(i))

log ||Σall||

|| Σall|| d + tr(Σ 1 all Σall) + ( µall µall)T Σ 1 all( µall µall)

λ1 tr e E d + λ2 ||E|| + λ [I(xs; xt|ϕM, {es, et}, D)] .

E Proof of Theory 3

Proof. To begin with, we introduce two anchor variables x, e, indicating existing samples and experiments in the system, which are independent with the following experiments. Since xs, xt are ϵ-independent given ϕM, {es, et} and D, we have:

I(xs; xt|ϕM, {es, et}, D) = I(xs; xt|ϕM, x, e {es, et}, D) ϵ H(xs|ϕM, x, e {es, et}, D) + H(xt|ϕM, x, e {es, et}, D) H(xs, xt|ϕM, x, e {es, et}, D) ϵ,

Since I(xs; ϕM|x, e {es, et}, D) =H(xs|x, e {es, et}, D) H(xs|ϕM, x, e {es, et}, D) I(xt; ϕM|x, e {es, et}, D) =H(xt|x, e {et, et}, D) H(xt|ϕM, x, e {es, et}, D)

We have: I(xs; ϕM|x, e {es, et}, D) + I(xt; ϕM|x, e {es, et}, D) =H(xs|x, e {es, et}, D) + H(xt|x, e {et, et}, D) H(xs|ϕM, x, e {es, et}, D) H(xt|ϕM, x, e {es, et}, D) H(xs, xt|x, e {es, et}, D) H(xs, xt|ϕM, x, e {es, et}, D) ϵ =I(xs, xt; ϕM|x, e {es, et}, D) ϵ.

According to the basic mutual information property I(A, B; C) I(B; C) = I(A; C|B), we have:

I(x xs; ϕM|e {es, et}, D) I(x; ϕM|e {es, et}, D) +I(x xt; ϕM|e {es, et}, D) I(x; ϕM|e {es, et}, D) I(x {xt, xs}; ϕM|e {es, et}, D) I(x; ϕM|e {es, et}, D) ϵ.

Thus, we have: I(x xs; ϕM|e {es, et}, D) + I(x xt; ϕM|e {es, et}, D) I(x {xt, xs}; ϕM|e {es, et}, D) + I(x; ϕM|e {es, et}, D) ϵ.

Since different experiments are independent, we have:

I(x xs; ϕM|e {es}, D) + I(x xt; ϕM|e {et}, D) I(x {xt, xs}; ϕM|e {es, et}, D) + I(x; ϕM|e, D) ϵ.

Thus, I( ; ϕM| , D) is ϵ-submodular.

F Proof of Theory 4

For clear presentation, we denote g({ei}n i=1) = I({xi}n i=1; ϕM|{ei}n i=1, D), then we need to solve the following problem:

arg max {ei}n i=1 g({ei}n i=1), (14)

Suppose S = {e i }n i=1 is the optimal solution for objective (14), and the results of the greedy method is S = {ei}n i=1, where the experiments are sequentially determined from e1 to en. We denote S1:j = {ei}j i=1, and (e|S1:j) = g(S1:j e) g(S1:j), according to the greedy method, we have:

ej+1 = arg max e (e|S1:j)

where λe is the cost of experiment e.

Based on all the above notations, we have:

g(S ) g(S S1:j) = g(S1:j) + g(S1:j e 1) g(S1:j) + g(S1:j e 1 e 2) g(S1:j e 1) + . . . + g(S1:j {e 1, ..., e n}) g(X1:i {e 1, ..., e n 1})

= g(S1:j) +

g(S1:j {e 1, ..., e k}) g(X1:i {e 1, ..., e k 1})

k=1 [g(S1:j {e k}) g(S1:j) + ϵ]

= g(S1:j) +

k=1 [ ({e k}|S1:j) + ϵ] ,

where the first inequality holds because of the non-decreasing property, and the second inequality holds because of the ϵ-submodular property.

Since ej+1 = arg maxe (e|S1:j)

λe , we have (e|S1:j)

λe (ej+1|S1:j)

λej+1 for any e, thus (e|S1:j)

λe λej+1 (ej+1|S1:j) Bλ (ej+1|S1:j). By bringing this result into the above equation, we have:

g(S ) g(S1:j) +

k=1 [ ({e k}|S1:j) + ϵ]

k=1 [Bλ (ej+1|S1:j) + ϵ]

= g(S1:j) + n Bλ (ej+1|S1:j) + nϵ

Let Tj = g(S ) g(S1:j), we have:

Tj Tj+1 = g(S1:j+1) g(S1:j) = (ej+1|S1:j) Tj nϵ

Tn (1 1 n Bλ )Tn 1 + ϵ

Bλ [(1 1 n Bλ )]2Tn 2 + (1 1 n Bλ ) ϵ

... [(1 1 n Bλ )]n T0 + [(1 1 n Bλ )]n 1 ϵ

Bλ + ... + ϵ

Let B = [(1 1 n Bλ )]n 1 ϵ

Bλ + ... + ϵ Bλ = ϵ Bλ Pn i=1[(1 1 n Bλ )]i 1, and considering that [(1

1 n Bλ )]n = e 1 Bλ , we have:

g(S ) g(S1:n) e 1 Bλ g(S ) + B

Thus, we have g(S1:n) (1 e 1 Bλ )g(S ) B.

Algorithm 1: Algorithm of Licence for Single Intervention Scenario

Input: Variable set XV , number of oracles M, cost of oracles Λ, observational data DO, total budget C, and learning rate η. Output: Causal graph ϕM.

1 Initialize the model parameter Ψ .

2 Optimize Ψ with the training process of ELBO under DO.

3 Initialize DI = .

4 while Budget C does not run out do

5 Initialize j , m , v and let ζ = .

6 for (j, m) in {1, 2, . . . , d} {1, 2, . . . , M} do

7 Calculate v (j, m) with BO.

8 if f(j, v (j, m), m) > ζ then

9 Update j j, m m and v v (j, m).

10 Update ζ f(j, v (j, m), m).

13 Subtract the budget with C C λm .

14 Acquire (j , v , m ) towards the true causal graph to obtain x pm(XV |do(Xj = v)).

15 Update DI DI {x }.

16 Optimize Ψ with training process of ELBO under DO DI.

18 Sample ϕM from p(ϕM|D)

19 return Causal graph ϕM.

G Algorithm

The algorithm for Licence method for single interventiion scenario is shown in Algorithm 1. Moreover, the algorithm for Licence method for batch interventiion scenario is shown in Algorithm 2.

H More Experiments

H.1 Experimental Settings

H.1.1 Datasets

The details of our experimental datasets are presented as follows:

Erd os-Rényi (ER) [41] graph is a random graph introduced by Paul Erd os and Alfréd Rényi. For ER graph, a graph with n vertices is generated by connecting each pair of vertices with a probability p.

Scale-Free (SF) [42] graph is a type of random graph that has a degree distribution following power law. A small number of vertices in SF graph own a large number of edges, while the vast majority of vertices have relatively few edges.

DREAM [43] is the abbreviation for Dialogue for Reverse Engineering Assessments and Methods, which can estimate the reverse quality that causal discovery methods perform. Specifically, we use a biological graph generator Gene Net Weaver for our experiments, which is a real-word public dataset.

H.1.2 Baselines

The details of experimental baselines are demonstrated as follows. We utilize Di BS [53] as our basic graph representation component. For acquisition methods, we use AIT and CBED and obtain the query tuples of node and value.

AIT [44] is an active learning method that utilize f-score to select intervention queries.

Algorithm 2: Algorithm of Licence for Batch Intervention Scenario

Input: Variable set XV , number of oracles M, cost of oracles Λ, observational data DO, total batch experiment step T, total budget C, and learning rate η. Output: Causal graph ϕM.

1 Initialize the model parameter Ψ .

2 Optimize Ψ with training process of constraint based ELBO under DO.

3 Initialize BI =

4 for t in 1, 2, . . . , T do

5 while Budget C does not run out do

6 Initialize j , m , v and let ζ = .

7 for (j, m) in {1, 2, . . . , d} {1, 2, . . . , M} do

8 Calculate v (j, m) with BO.

9 if f(j, v (j, m), m) > ζ then

10 Update j j, m m and v v (j, m).

11 Update ζ f(j, v (j, m), m).

14 Subtract the budget with C C λm .

15 Update BI BI {(j , v , m )}.

17 Acquire BI towards the true causal graph to obtain {x pm(XV |do(Xj = v))}(j,v,m) BI.

18 Update DI DI {x }(j,v,m) BI.

19 Optimize Ψ with training process of constraint based ELBO under DO DI.

21 Sample ϕM from p(ϕM|D)

22 return Causal graph ϕM.

CBED [8] is based on the calculation of mutual information (MI), which intend to select intervention queries with maximal MI scores after obtaining new samples under current queries.

For the batch intervention scenario, we extend above methods with greedy strategy, which can promise an lower bound for approximation with submodular property. For choosing the fidelities to query, we use two circumstances, i.e., REAL and RANDOM.

REAL fidelity means the model always choose the highest fidelity to conduct experiments. This strategy is aligned with classic causal discovery under active learning paradigm without multi-fidelity settings, which can just choose the most accurate samples to conduct discovery process.

RANDOM fidelity means the model choose different fidelities randomly with uniform probability.

H.1.3 Metrics

The details of experimental metrics are demonstrated as follows. We utilize SHD and AUPRC to reflect the topological structure discovering performance, and design RMSE to reflex the predicting performance of functional relations.

SHD [45] is the abbreviation for Structural Hamming Distance, and it estimate the topological structure by counting the number of different edges on adjacency matrix. We calculate the expectation of SHD under multiple graph samplings.

AUPRC [46] is the area under precision-recall curve, where we consider entities on the adjacency matrix as binary classification problem. The AUPRC is also under the expectation for multiple graph sampling.

RMSE is designed for estimating the performance of grasping functional relations. We obtain several samples from the true causal graph, and let our model and the true causal function to conduct forward process respectively, then calculate the RMSE between the two results. We calculate RMSE by sampling graphs for multiple times.

Table 1: The left table demonstrate the details of the configuration of device and platform. The right table shows the details of time cost on computation.

Name Details CPU Intel Xeon Platinum 8350C 2.60GHz GPU RTX A5000 (24GB) Memory 42GB RAM Python Version 3.8 Java Version 1.8.0 (Necessary for DREAM)

Model Time (secs) AIT-REAL 7.686 AIT-RANDOM 7.451 CBED-REAL 7.998 CBED-RANDOM 7.989 Licence 8.320

Table 2: The details of experimental settings.

Name Explanation Value budget The total budget for interventional experiments, (i.e., C). 10/20/30/40/50 oracle number The number of oracles, (i.e., M) 3 oracle cost The cost for each oracle, (i.e., Λ) 2, 8, 32 oracle noise The extra additive noise for each oracle. 0.04, 0.02, 0.00 observation number The number of observational samples. 1000 expect edge number The number of expect edges. 2 additive noise The value of additive noise during data generations. 0.01

H.2 Simulation of Oracles with Different Fidelities

For a given intervention (j, v), suppose we have M oracles {ϕ1, ϕ2, ..., ϕM}, then the experiment results {xj,v,1, xj,v,2, ..., xj,v,M} are specified as follows:

xj,v,m = xj,v,M + δm, δm N(0, σm),

where xj,v,M is the ground truth, which can be directly obtained from the datasets. Since xj,v,m is correlated with xj,v,M by the first line, their underlying oracles ϕm and ϕM are correlated in our simulation. In our experiment, we set δ1 > δ2 > ... > δM = 0. Suppose the cost of ϕm as λm, then we set λ1 < λ2 < ... < λM.

H.3 Details of Configurations and Computation

The details of the configurations of device and platform are demonstrate in Table 1(left). We will show the details of the time cost on computation. We measure the time cost on the generation of each intervention per fidelity for all models, and the results are shown in Figure 1(right). We find that our method cost a little more than the baselines, which is probably due to the more complex sampling process in our model.

We also show the details of experimental settings for our overall experiments in Table 2. We carefully tune the hyper-parameters for baselines and our model, and the final values can be obtained in the configuration file in our codes.

H.4 Experiments on DREAM Dataset

We conduct experiments on a real-world biological dataset, called DREAM. Note that, DREAM does not support the calculation of RMSE, because of the lack of interface in this real-world dataset. We use two sub-datasets Ecoli and Yeast as our true causal graphs. The results are shown in Figure 4. We find that our model outperforms that other baselines on both Ecoli and Yeast, and both single and batch intervention scenario.

H.5 Experiments on More Nodes

In this section, we conduct further experiments on datasets with more nodes. We extend the number of nodes from 10 to 20, and experiment on the ER graph. The results are shown in Figure 3. We find that our model is still effective on the scenario of more nodes, and is better than baselines.

Figure 4: The performance among models on DREAM datasets with different datasets and budgets. Lower SHD, RMSE indicate better performances. We conduct each experiment for ten times, and report the average performances and error bars.

Table 3: SHD results of 20 nodes graphs on different budgets. Lower SHD indicates better performances. We conduct each experiment for ten times, and report average performances and error bars.

Model Budget(10) Budget(20) Budget(30) Budget(40) Budget(50) AIT-REAL 63.36 4.89 64.36 5.18 64.53 6.83 63.28 4.86 64.35 5.19 AIT-RANDOM 63.62 4.61 62.16 5.75 64.60 5.23 66.87 6.47 63.53 5.27 Di BS-REAL 63.58 6.35 61.50 7.69 63.50 6.86 63.56 6.34 61.45 7.69 Di BS-RANDOM 63.68 6.77 65.07 6.41 63.91 7.14 63.99 4.46 63.86 3.00 Licence 49.67 11.64 49.61 8.08 55.68 8.63 51.34 11.24 51.36 9.11

H.6 Supplementary Experiments on MAE Metric

We further compare our model with the baselines based on the Mean Absolute Error (MAE) metric. The experiments are conducted based on ER with different total budgets. The results are presented in Table 4.

Table 4: Results of the metric MAE (%).

Model Budget(10) Budget(20) Budget(30) Budget(40) Budget(50) AIT-REAL 3.46 0.01 2.43 0.01 2.63 0.01 3.46 0.01 2.42 0.01 AIT-RANDOM 3.71 0.02 2.42 0.01 2.82 0.01 2.68 0.00 2.54 0.01 CBED-REAL 3.73 0.01 2.56 0.00 2.54 0.00 3.69 0.01 2.52 0.00 CBED-RANDOM 4.00 0.02 2.53 0.00 2.88 0.00 3.14 0.01 2.70 0.01 Licence 2.06 0.00 2.20 0.01 1.70 0.00 1.77 0.00 2.07 0.00

The results indicate that our model surpasses the baselines in terms of MAE. This further provides evidence that the superior performance of our model is a general conclusion.

H.7 Supplementary Experiments on Different Oracle Settings

To demonstrate that our model is generally effective for different costand noisy-levels. We conduct experiments based on different sets of oracles. In specific, the experiments are conducted based on the following settings in Table 5. The results are presented in Figure 5.

From the results, we can see that our model can always perform better than the baselines on different sets of oracles.

Table 5: Settings for different cost and noise levels.

Setting 1 Setting 2 Oracle (m) cost (λm) noise (σm) Oracle (m) cost (λm) noise (σm) 1 2 0.04 1 2 0.04 2 8 0.02 2 8 0.02 3 32 0 3 16 0 Setting 3 Setting 4 Oracle (m) cost (λm) noise (σm) Oracle (m) cost (λm) noise (σm) 1 2 0.04 1 2 0.04 2 4 0.02 2 8 0.02 3 32 0 3 32 0 Setting 5 Setting 6 Oracle (m) cost (λm) noise (σm) Oracle (m) cost (λm) noise (σm) 1 2 0.08 1 2 0.04 2 8 0.02 2 8 0.03 3 32 0 3 32 0

H.8 Supplementary Experiments on Regularization Coefficient λ

In our model, the ϵ-independent constraint in Equation 10 is an important contribution. In the optimization process, we convert it to the objective. We study the influence of the coefficient λ by tuning it in the range of [10 5,10 6,10 7,10 8,10 9]. The results are presented in Figure 6.

From the results, we can see the performances of our model varies as we set different λ s. In most cases, the best performance is achieved when λ is moderated (not too large or too small).

H.9 Supplementary Experiments on Ablation Studies

To study whether the correlation modeling between different oracles are necessary, we first build a variant of our model by regarding different oracles as independent components, that is, removing the links between different ϕ s in Figure 1, and then compare our model with such variant. The results are presented in Table 6.

Table 6: Results of the Licence and Licence without the cascaded relation.

Metrics Model Budget(10) Budget(20) Budget(30) Budget(40) Budget(50)

SHD Licence (w/o rel) 14.61 2.30 15.25 2.74 15.47 3.74 15.17 4.30 18.52 4.65 Licence 14.67 2.98 14.73 2.04 14.29 3.20 14.83 3.24 15.02 3.01

AUPRC (%) Licence (w/o rel) 28.96 1.28 35.05 1.35 31.74 1.50 37.85 2.03 33.70 4.36 Licence 35.75 2.13 25.12 2.01 41.79 1.71 40.89 2.92 41.76 3.19

RMSE (%) Licence (w/o rel) 3.34 0.04 3.10 0.02 2.46 0.00 2.61 0.00 2.83 0.00 Licence 2.82 0.01 2.75 0.00 2.40 0.00 2.68 0.01 2.69 0.01

We can see, in most cases, our model can achieve better performance than its variant without modeling the correlations between different oracles.

I Potentially Negative Social Impact

Causal discovery focuses on understanding causal relationships between variables. While causal discovery has the potential to bring about positive social impacts, it is important to consider both the positive and negative implications of its applications. In this response, I will focus on the negative impact of causal discovery.

Reductionism and Oversimplification. Causal discovery techniques often aim to identify simple cause-and-effect relationships. However, complex social phenomena often involve a multitude of interconnected factors, making it difficult to capture the full complexity of the system. Relying solely on causal discovery may lead to oversimplification and reductionism, neglecting the nuanced interactions between variables.

Figure 5: Results of different settings of oracles in terms of cost and noise.

Figure 6: Results of extensive experiments on regularization λ.

Ethical Concerns. Causal discovery can involve analyzing sensitive data, such as personal information or medical records. If not handled carefully, the use of this data can raise significant ethical concerns related to privacy, consent, and potential discrimination. Improper handling of data could lead to violations of privacy and unfair treatment of individuals or groups.

Overreliance on Correlation. Causal discovery often relies on identifying statistical correlations between variables. However, correlation does not imply causation, and there is a risk of mistakenly inferring causal relationships based solely on correlation. Overreliance on such methods can lead to erroneous conclusions, leading to misguided decision-making and ineffective interventions.

Social Bias and Inequality. Causal discovery relies on the data used for analysis, which can reflect existing biases and inequalities present in society. If the data used is biased, the causal relationships discovered may perpetuate or exacerbate existing social inequalities. Causal discovery methods need to be sensitive to potential biases and strive for fairness and inclusivity in both data collection and analysis.

In conclusion, while causal discovery holds promise in understanding complex systems, it is crucial to consider its potential negative impacts. Oversimplification, ethical concerns, overreliance on correlation, and social bias are all factors that need to be addressed to ensure responsible and beneficial applications of causal discovery. It is essential to approach this field with caution and incorporate broader societal considerations to mitigate the negative impacts and harness its potential for positive social change.

J Limitations

In this section, we analyze the limitations of our work, including sub-optimum of greedy method, estimation of mutual information, and scale of causal graph.

Sub-optimum of greedy method. For the whole process of active causal discovery, the interventional data will be acquired successively in the greedy manner. Therefore, even if the strategy for acquisition is the optimal for each current step, the entire trajectory of causal discovery is sub-optimal. A possible solution is finding the best acquisition trajectory by reinforcement learning.

Estimation of mutual information. For different circumstances, the costs, accuracy and data scale can be various. Therefore, the scale of mutual information can be affected as well. So it is important to adjust hyper-parameters accordingly.

Scale of causal graph. It is a classic problem for causal discovery that most existing methods suffer from the difficulty in extending to large-scale graphs. The efficiency and effectiveness are supposed to be further improved, and we will optimize our model as well.

[1] Jeffrey S Levin. Religion and health: Is there an association, is it valid, and is it causal? Social Science & Medicine, 38(11):1475 1482, 1994.

[2] Eric R Eide and Mark H Showalter. Methods matter: Improving causal inference in educational and social science research: A review article. Economics of Education Review, 31(5):744 748, 2012.

[3] Bengt Muthén and Hendricks C Brown. Estimating drug effects in the presence of placebo response: causal inference using growth mixture modeling. Statistics in medicine, 28(27): 3363 3385, 2009.

[4] Karen Sachs, Omar Perez, Dana Pe er, Douglas A Lauffenburger, and Garry P Nolan. Causal protein-signaling networks derived from multiparameter single-cell data. Science, 308(5721): 523 529, 2005.

[5] David Maxwell Chickering. Learning bayesian networks is np-complete. Learning from data: Artificial intelligence and statistics V, pages 121 130, 1996.

[6] Thomas S Verma and Judea Pearl. Equivalence and synthesis of causal models. In Probabilistic and Causal Inference: The Works of Judea Pearl, pages 221 236. 2022.

[7] Raj Agrawal, Chandler Squires, Karren Yang, Karthikeyan Shanmugam, and Caroline Uhler. Abcd-strategy: Budgeted experimental design for targeted causal structure discovery. In The 22nd International Conference on Artificial Intelligence and Statistics, pages 3400 3409. PMLR, 2019.

[8] Panagiotis Tigas, Yashas Annadani, Andrew Jesson, Bernhard Schölkopf, Yarin Gal, and Stefan Bauer. Interventions, where and how? experimental design for causal models at scale. In Advances in Neural Information Processing Systems.

[9] Azam Peyvandipour, Nafiseh Saberian, Adib Shafi, Michele Donato, and Sorin Draghici. A novel computational approach for drug repurposing using systems biology. Bioinformatics, 34 (16):2817 2825, 2018.

[10] Yashas Annadani, Panagiotis Tigas, Desi R Ivanova, Andrew Jesson, Yarin Gal, Adam Foster, and Stefan Bauer. Differentiable multi-target causal bayesian experimental design. ar Xiv preprint ar Xiv:2302.10607, 2023.

[11] Judea Pearl and Dana Mackenzie. The book of why: the new science of cause and effect. Basic books, 2018.

[12] Patrik Hoyer, Dominik Janzing, Joris M Mooij, Jonas Peters, and Bernhard Schölkopf. Nonlinear causal discovery with additive noise models. Advances in neural information processing systems, 21, 2008.

[13] Neil Houlsby, Ferenc Huszár, Zoubin Ghahramani, and Máté Lengyel. Bayesian active learning for classification and preference learning. ar Xiv preprint ar Xiv:1112.5745, 2011.

[14] Samuel Daulton, Xingchen Wan, David Eriksson, Maximilian Balandat, Michael A Osborne, and Eytan Bakshy. Bayesian optimization over discrete and mixed spaces via probabilistic reparameterization. ar Xiv preprint ar Xiv:2210.10199, 2022.

[15] Xun Zheng, Bryon Aragam, Pradeep K Ravikumar, and Eric P Xing. Dags with no tears: Continuous optimization for structure learning. Advances in neural information processing systems, 31, 2018.

[16] David M Blei, Alp Kucukelbir, and Jon D Mc Auliffe. Variational inference: A review for statisticians. Journal of the American statistical Association, 112(518):859 877, 2017.

[17] Diederik P Kingma and Max Welling. Auto-encoding variational bayes. ar Xiv preprint ar Xiv:1312.6114, 2013.

[18] Andreas Krause and Carlos E Guestrin. Near-optimal nonmyopic value of information in graphical models. ar Xiv preprint ar Xiv:1207.1394, 2012.

[19] Shibo Li, Jeff M Phillips, Xin Yu, Robert Kirby, and Shandian Zhe. Batch multi-fidelity active learning with budget constraints. Advances in Neural Information Processing Systems, 35: 995 1007, 2022.

[20] Christina Heinze-Deml, Marloes H Maathuis, and Nicolai Meinshausen. Causal structure learning. Annual Review of Statistics and Its Application, 5:371 391, 2018.

[21] Clark Glymour, Kun Zhang, and Peter Spirtes. Review of causal discovery methods based on graphical models. Frontiers in genetics, 10:524, 2019.

[22] Matthew J Vowels, Necati Cihan Camgoz, and Richard Bowden. D ya like dags? a survey on structure learning and causal discovery. ACM Computing Surveys, 55(4):1 36, 2022.

[23] Kevin P Murphy. Active learning of causal bayes net structure. Technical report, technical report, UC Berkeley, 2001.

[24] Simon Tong and Daphne Koller. Active learning for structure in bayesian networks. In International joint conference on artificial intelligence, volume 17, pages 863 869. Citeseer, 2001.

[25] Yang-Bo He and Zhi Geng. Active learning of causal networks with intervention experiments and optimal designs. Journal of Machine Learning Research, 9(Nov):2523 2547, 2008.

[26] Hyunghoon Cho, Bonnie Berger, and Jian Peng. Reconstructing causal biological networks through active learning. Plo S one, 11(3):e0150611, 2016.

[27] Robert Osazuwa Ness, Karen Sachs, Parag Mallick, and Olga Vitek. A bayesian active learning experimental design for inferring signaling networks. In Research in Computational Molecular Biology: 21st Annual International Conference, RECOMB 2017, Hong Kong, China, May 3-7, 2017, Proceedings 21, pages 134 156. Springer, 2017.

[28] Julius von Kügelgen, Paul K Rubenstein, Bernhard Schölkopf, and Adrian Weller. Optimal experimental design via bayesian optimization: active causal structure learning for gaussian process networks. ar Xiv preprint ar Xiv:1910.03962, 2019.

[29] M Giselle Fernández-Godino, Chanyoung Park, Nam-Ho Kim, and Raphael T Haftka. Review of multi-fidelity models. ar Xiv preprint ar Xiv:1609.07196, 2016.

[30] Alexander IJ Forrester, Neil W Bressloff, and Andy J Keane. Optimization using surrogate models and partially converged computational fluid dynamics simulations. Proceedings of the Royal Society A: Mathematical, Physical and Engineering Sciences, 462(2071):2177 2204, 2006.

[31] TD Robinson, Michael S Eldred, Karen E Willcox, and Robert Haimes. Surrogate-based optimization using multifidelity models with variable parameterization and corrected space mapping. AIAA journal, 46(11):2814 2822, 2008.

[32] Christopher C Fischer, Ramana V Grandhi, and Philip S Beran. Bayesian low-fidelity correction approach to multi-fidelity aerospace design. In 58th AIAA/ASCE/AHS/ASC Structures, Structural Dynamics, and Materials Conference, page 0133, 2017.

[33] SA Burton and P Hajela. A variable-complexity approach to second-order reliability-based optimization. Structural and Multidisciplinary Optimization, 25(4):237 250, 2003.

[34] Akil Narayan, Claude Gittelson, and Dongbin Xiu. A stochastic collocation algorithm with multifidelity models. SIAM Journal on Scientific Computing, 36(2):A495 A521, 2014.

[35] Benjamin Peherstorfer, Tiangang Cui, Youssef Marzouk, and Karen Willcox. Multifidelity importance sampling. Computer Methods in Applied Mechanics and Engineering, 300:490 509, 2016.

[36] Leo Wai-Tsun Ng and Michael Eldred. Multifidelity uncertainty quantification using nonintrusive polynomial chaos and stochastic collocation. In 53rd aiaa/asme/asce/ahs/asc structures, structural dynamics and materials conference 20th aiaa/asme/ahs adaptive structures conference 14th aiaa, page 1852, 2012.

[37] Tiangang Cui, Youssef M Marzouk, and Karen E Willcox. Data-driven model reduction for the bayesian solution of inverse problems. International Journal for Numerical Methods in Engineering, 102(5):966 990, 2015.

[38] Shibo Li, Wei Xing, Robert Kirby, and Shandian Zhe. Multi-fidelity bayesian optimization via deep neural networks. Advances in Neural Information Processing Systems, 33:8521 8531, 2020.

[39] Shibo Li, Robert Kirby, and Shandian Zhe. Batch multi-fidelity bayesian optimization with deep auto-regressive networks. Advances in Neural Information Processing Systems, 34:25463 25475, 2021.

[40] Shibo Li, Jeff M Phillips, Xin Yu, Robert Kirby, and Shandian Zhe. Batch multi-fidelity active learning with budget constraints. Advances in Neural Information Processing Systems, 35: 995 1007, 2022.

[41] Paul Erd os, Alfréd Rényi, et al. On the evolution of random graphs. Publ. Math. Inst. Hung. Acad. Sci, 5(1):17 60, 1960.

[42] Lun Li, David Alderson, John C Doyle, and Walter Willinger. Towards a theory of scale-free graphs: Definition, properties, and implications. Internet Mathematics, 2(4):431 523, 2005.

[43] Thomas Schaffter, Daniel Marbach, and Dario Floreano. Genenetweaver: in silico benchmark generation and performance profiling of network inference methods. Bioinformatics, 27(16): 2263 2270, 2011.

[44] Yashas Annadani, Jonas Rothfuss, Alexandre Lacoste, Nino Scherrer, Anirudh Goyal, Yoshua Bengio, and Stefan Bauer. Variational causal networks: Approximate bayesian inference over causal structures. ar Xiv preprint ar Xiv:2106.07635, 2021.

[45] Ioannis Tsamardinos, Laura E Brown, and Constantin F Aliferis. The max-min hill-climbing bayesian network structure learning algorithm. Machine learning, 65:31 78, 2006.

[46] Jesse Davis and Mark Goadrich. The relationship between precision-recall and roc curves. In Proceedings of the 23rd international conference on Machine learning, pages 233 240, 2006.

[47] Harold J Kushner. A new method of locating the maximum point of an arbitrary multipeak curve in the presence of noise. 1964.

[48] Carl Edward Rasmussen. Gaussian processes in machine learning. In Advanced Lectures on Machine Learning: ML Summer Schools 2003, Canberra, Australia, February 2-14, 2003, Tübingen, Germany, August 4-16, 2003, Revised Lectures, pages 63 71. Springer, 2004.

[49] Niranjan Srinivas, Andreas Krause, Sham M Kakade, and Matthias Seeger. Gaussian process optimization in the bandit setting: No regret and experimental design. ar Xiv preprint ar Xiv:0912.3995, 2009.

[50] Reuven Y Rubinstein and Dirk P Kroese. Simulation and the Monte Carlo method. John Wiley & Sons, 2016.

[51] Chris J Maddison, Andriy Mnih, and Yee Whye Teh. The concrete distribution: A continuous relaxation of discrete random variables. ar Xiv preprint ar Xiv:1611.00712, 2016.

[52] Dimitri P Bertsekas. Constrained optimization and Lagrange multiplier methods. Academic press, 2014.

[53] Lars Lorch, Jonas Rothfuss, Bernhard Schölkopf, and Andreas Krause. Dibs: Differentiable bayesian structure learning. Advances in Neural Information Processing Systems, 34:24111 24123, 2021.