# bayesian_adaptive_calibration_and_optimal_design__e2d58fc6.pdf Bayesian Adaptive Calibration and Optimal Design Rafael Oliveira CSIRO s Data61 Sydney, Australia Dino Sejdinovic University of Adelaide Adelaide, Australia David Howard CSIRO s Data61 Brisbane, Australia Edwin V. Bonilla CSIRO s Data61 Sydney, Australia The process of calibrating computer models of natural phenomena is essential for applications in the physical sciences, where plenty of domain knowledge can be embedded into simulations and then calibrated against real observations. Current machine learning approaches, however, mostly rely on rerunning simulations over a fixed set of designs available in the observed data, potentially neglecting informative correlations across the design space and requiring a large amount of simulations. Instead, we consider the calibration process from the perspective of Bayesian adaptive experimental design and propose a data-efficient algorithm to run maximally informative simulations within a batch-sequential process. At each round, the algorithm jointly estimates the parameters of the posterior distribution and optimal designs by maximising a variational lower bound of the expected information gain. The simulator is modelled as a sample from a Gaussian process, which allows us to correlate simulations and observed data with the unknown calibration parameters. We show the benefits of our method when compared to related approaches across synthetic and real-data problems. 1 Introduction In many scientific and engineering disciplines, computer simulation models form an essential part of the process of predicting and reasoning about complex phenomena, especially when real data is scarce. These simulation models depend on the inputs set by the user, commonly referred to as designs, and on a number of parameters representing unknown physical quantities, known as calibration parameters. The problem of setting these parameters so as to closely match observations of the real phenomenon is known as the calibration of computer models [1]. The seminal work by Kennedy and O Hagan [1] introduces the Bayesian framework for calibration of simulation models, using Gaussian processes [2] to account for the differences between the model and reality, as well as for uncertainty in the calibration parameters. While the simulator is an essential tool when obtaining real data is expensive or unfeasible, each run of a simulator may itself involve significant computational resources, especially in applications such as climate science or complex engineering systems. In this situation, it is imperative to run simulations at carefully chosen settings of designs as well as of calibration inputs, using current knowledge to optimise resource use [3 5]. In this contribution, we bridge Bayesian calibration with adaptive experimental design [6] and use information-theoretic criteria [7] to guide the selection of simulation settings so that they are most informative about the true value of the calibration parameters. We refer to our approach as BACON (Bayesian Adaptive Calibration and Optimal desig N). BACON allows computational resources to be focused on simulations that provide the most value in terms of reducing epistemic uncertainty. Importantly, in contrast to prior work, it optimises designs jointly with calibration inputs in order to capture informative correlations across both spaces. Experimental results on synthetic experiments and a robotic gripper design problem demonstrate the benefits of BACON compared to competitive Corresponding author: rafael.dossantosdeoliveira@data61.csiro.au 38th Conference on Neural Information Processing Systems (Neur IPS 2024). baselines in terms of computational savings and the quality of the estimated posterior under similar computational constraints. 2 Problem formulation Let f : X Y represent a mapping of experimental designs x X to the outcomes of a physical process f(x) Y R. We are given a set of observed outcomes y R = [y1, . . . , y R]T and their associated designs XR := {xi}R i=1 X. Observations are corrupted by noise as yi = f(xi) + νi, where νi N(0, σ2 ν) is zero-mean Gaussian noise, for i {1, . . . , R}. In addition, we have access to the output of a computer model h : X Θ R given a design input and simulation parameters. Given an optimal setting for the calibration parameters θ Θ, the simulator h(x, θ ), can be used to approximate the outcomes of the real physical process f(x). However, θ is unknown, and evaluations of the simulator h are costly, though cheaper than executing real experiments evaluating f. Our task is to optimally estimate θ given the real data y R, outputs of the simulator h and a prior distribution p(θ ), representing initial assumptions about θ . More concretely, let ˆy S := [h(ˆxi, ˆθi)]S i=1 represent simulated outcomes for a set of designs b XS := {ˆxi}S i=1 X and simulation parameters bΘS := {ˆθi}S i=1 Θ. Given the cost of running simulations, we will associate the simulator h with a latent function (usually referred to as emulator) drawn from a Gaussian process (GP) prior and assume simulation outputs and real data follow a joint probability distribution p(y R, ˆy S, θ ). In this setting, the Bayesian experimental design objective is to propose a sequence of simulations which will maximise the expected information gain (EIG) about θ : EIG( b XS, bΘS) := H(p(θ |y R)) Ep(ˆy S| b XS,bΘS,y R)[H(p(θ |y R, ˆy S))] = Ep(ˆy S| b XS,bΘS,y R) [DKL(p(θ |y R, ˆy S)||p(θ |y R))] = I(θ ; ˆy S | y R, b XS, bΘS) , where H( ) represents the entropy of a probability distribution, DKL( || ) denotes the Kullback-Leibler divergence, and I(θ ; ˆy S | y R) is the mutual information between θ and the simulator output ˆy S given the real observations y R and the simulator inputs to be optimized. We note here that, in our setting, the real observations y R are always fixed. Therefore, intuitively, the EIG above captures the reduction in uncertainty that will be obtained when selecting ( b XS, bΘS) averaged over all the possible outcomes ˆy S. 3 Related work Our work consists of deriving a Bayesian adaptive experimental design (BAED) approach to the problem of calibration. Therefore, in the following, we will briefly discuss current literature on these two main research areas. 3.1 Adaptive experimental design The problem of experimental design has a long history [8], spanning from classical fixed design patterns to modern adaptive approaches [9]. Optimal experimental design consists of selecting experiments which will maximise some form of criterion involving a measure of utility of the experiment and its associated costs [10]. Under the Bayesian formulation, uncertainty in the outcomes of the process is considered, and the optimality of a design is measured in terms of its expected utility [11]. Information theory then allows us to quantify information gain as a utility function, which is commonly applied in modern approaches to Bayesian experimental design [12]. The estimation of posterior distributions becomes a computational bottleneck for information-theoretic Bayesian frameworks. Recent work has focused on addressing the difficulties in estimating the expected information gain by means of, e.g., variational inference [13], density-ratio estimation [14], importance sampling [15], and the learning of efficient policies to propose designs [16, 17]. These methods, however, usually assume that the simulator is known and inexpensive to evaluate. In contrast, the simulations themselves are modelled as expensive experiments for us, and we apply Gaussian process models as emulators to capture uncertainty over the black-box simulator. In addition, traditional BAED approaches assume that the prior is trivial to sample from and evaluate densities of, while in our case the starting prior is p(θ |y R), which is likely non-trivial. We refer the reader to the recent review on modern Bayesian methods for experimental design by Rainforth et al. [18] for further details on BAED. 3.2 Active learning for calibration Experimental design approaches generally aim towards the selection of designs for physical experiments, whereas we are concerned with the problem of running optimal simulated experiments for model calibration in the presence of real data. When simulations are resource-intensive, a few methods have been derived based on the Bayesian calibration framework proposed by Kennedy and O Hagan [1]. Busby and Feraille [19] present an algorithm to learn GP emulators for a simulator which can then be combined with Bayesian inference algorithms, such as Markov chain Monte Carlo [20], to provide a posterior distribution over parameters. In their approach, the optimised variables are solely the calibration parameters, and the selection criterion is based on minimising the integrated mean-square error of the GP predictions. Many other approaches can be applied to this setting by modelling the simulator or its associated likelihood function as a GP, including Bayesian optimisation [3, 21, 22] and methods for adaptive Bayesian quadrature [23, 24]. Besides GPs, other algorithms focusing on the selection of calibration parameters have been derived using ensembles of neural networks [25] and deep reinforcement learning [26]. These frameworks, however, do not allow for the selection of simulation design points, usually keeping them co-located with the real data. Allowing for design point decisions to be included, Leatherman et al. [4] presented approaches for combined simulation and physical experimental design following geometric and prediction-errorbased criteria, though using an offline, non-sequential framework. More recently, Marmin and Filippone [5] derived a deep Gaussian process [27] framework for Bayesian calibration problems and discussed an application with experimental design among other examples. Their experimental design approach to calibration was based on choosing simulations that maximally reduce the variational posterior variance over the calibration parameters, as measured by the derivatives of the evidence lower bound with respect to (w.r.t.) variance parameters. In contrast, we aim to directly maximise the information gain w.r.t. the unknown calibration parameters. 4 Gaussian processes for Bayesian calibration To estimate information gain, we need a probabilistic model which can correlate simulations with real data and the unknown parameters θ . Ideally, the model needs to allow for a computationally tractable conditioning on the parameters θ and account for the discrepancy between real and simulated data. Hence, we follow the Bayesian calibration approach in Kennedy and O Hagan [1] and model: f(x) = ρh(x, θ ) + ε(x) , x X, θ p(θ ), (2) where ε : X R represents the error (or discrepancy) between simulations and real outcomes, and ρ R accounts for possible differences in scale. We place Gaussian process priors on the simulator h GP(0, ˆk) and on the error function ε GP(0, kε). 4.1 Bi-fidelity exact Gaussian process model Since both h and ε are GPs, simulations and real outcomes can be jointly modelled as a single Gaussian process. In fact, both the simulator h and the true function f can be seen as different levels of fidelity of the same underlying process, with h representing a coarser version of f. Namely, let s S := {0, 1} denote a fidelity parameter. The combined model is then given by: ˆf(x, θ, s) := h(x, θ), s = 0 ρh(x, θ) + ε(x), s = 1 . (3) such that f(x) = ˆf(x, θ , 1) and h(ˆx, ˆθ) = ˆf(ˆx, ˆθ, 0), for any x, ˆx X and ˆθ Θ. As a result, for arbitrary points in the joint space z, z Z := X Θ S, the following covariance function parameterises the combined GP model ˆf GP(0, k): k(z, z ) := kρ(s, s )ˆk((x, θ), (x , θ )) + ss kε(x, x ) (4) where kρ(s, s ) := (1 + s(ρ 1))(1 + s (ρ 1)), z := (x, θ, s), and z := (x , θ , s ). Therefore, any set of real and simulated evaluations are joint normally distributed under a combined GP model. 4.2 Joint probabilistic model and predictions Let ZR := ZR(θ ) := [(xi, θ , 1)]R i=1 represent the set of partially observed inputs for real data y R, and let b ZS := [(ˆxi, ˆθ, 0)]S i=1 denote the current set of simulation inputs for the observations ˆy S. Under the GP prior, the joint probability model p(ˆy S, y R, θ ) can be decomposed as: p(ˆy S, y R, θ ) = p(ˆy S, y R|θ )p(θ ) = Z ˆf p(ˆy S|ˆf)p(y R|ˆf, θ )p(ˆf|θ )p(θ ) dˆf , (5) where ˆf := ˆf(Z(θ )) RR+S, and Z(θ ) := {ZR(θ ), b ZS} corresponds to the full set of inputs. The GP prior then allows us to model real and simulated outcomes jointly as a Gaussian random vector ˆf: ˆf|θ N(0, K(θ )) , (6) where K(θ ) := k(Z(θ ), Z(θ )) = [k(z, z )]z,z Z(θ ) denotes the prior covariance matrix. Assuming a Gaussian noise model for the observations y = f(x, θ ) + ε(x) + ν, with ν N(0, σ2 ν), the marginal distribution over the observations y := [y T R, ˆy T S]T is available in closed form as: p(ˆy S, y R|θ ) = N(y; 0, K(θ ) + Σy) , (7) where Σy denotes the covariance matrix of the observation noise, i.e., [Σy]ii = σ2 ν for any zi with si = 1, and [Σy]ij = 0 elsewhere.2 Under the GP assumptions, we can make predictions about ˆy = h(ˆx, ˆθ) at any pair of ˆx, ˆθ X Θ. Conditioning on θ and a dataset Dt := {XR, y R, b Xt, bΘt, ˆyt}, let Zt(θ ) := {ZR(θ ), b Zt} denote the set of inputs up to time t conditional on θ , and yt the corresponding outputs. We then have that: p(ˆy|θ , ˆx, ˆθ, Dt) = N(ˆy; µt(ˆz; θ ), σ2 t (ˆz; θ )) , (8) for ˆz := (ˆx, ˆθ), where: µt(ˆz; θ ) := k T t (ˆz; θ )T(Kt(θ ) + Σyt) 1yt (9) kt(ˆz, ˆz ; θ ) := k(ˆz, ˆz ) kt(ˆz; θ )T(Kt(θ ) + Σyt) 1kt(ˆz ; θ ) (10) σ2 t (z; θ ) := kt(ˆz, ˆz; θ ) , (11) with kt(ˆz; θ ) := k(Zt(θ ), ˆz) and Kt(θ ) := k(Zt(θ ), Zt(θ )). We next describe how to apply this model to derive a Bayesian adaptive calibration algorithm. 5 Bayesian adaptive calibration and optimal design In this section, we describe an approach to design experiments for calibration of computer models that incorporates information gathered during the experiments iteratively. We refer to these types of designs as adaptive. Thus, we consider the sequential design of experiments setting, where at each iteration t N, we optimise: EIGt(ˆx, ˆθ) := I(θ ; ˆy | ˆx, ˆθ, Dt 1) = H(p(θ |Dt 1)) Eˆy p(ˆy|ˆx,ˆθ,Dt 1)[H(p(θ |ˆy, ˆx, ˆθ, Dt 1))] = Ep(ˆy,θ |ˆx,ˆθ,Dt 1) log p(θ |ˆy, ˆx, ˆθ, Dt 1) given the dataset Dt 1 := {XR, y R, b Xt 1, bΘt 1, ˆyt 1} of observations. Given that the expected information gain is submodular [28], a sequential approach allows us to get close enough (usually a factor of at least 1 1/e [29]) to the optimal EIG over the whole experiment, while also allowing algorithmic decisions to adapt to current estimates for p(θ |Dt). 2In practice, we add a small nugget term to the diagonal of the noise covariance matrix for numerical stability. In general, computing the full EIG objective (1), or its sequential version (12), is intractable, as that requires estimating the true posterior and its density conditioned on sampled data. Note that both p(θ |ˆy, ˆx, ˆθ, Dt 1) and p(ˆy, θ |ˆx, ˆθ, Dt 1) depend on the posterior p(θ |Dt 1), as: p(θ |ˆy, ˆx, ˆθ, Dt 1) = p(ˆy, θ |ˆx, ˆθ, Dt 1) p(ˆy|ˆx, ˆθ, Dt 1) (13) p(ˆy, θ |ˆx, ˆθ, Dt 1) = p(ˆy|θ , ˆx, ˆθ, Dt 1)p(θ |Dt 1) , (14) where the conditional predictive density p(ˆy|θ , ˆx, ˆθ, Dt 1) is Gaussian and available in closed form (Eq. 8). Clearly, in general, the true posterior is intractable, since p(θ |Dt) = p(Dt|θ )p(θ ) p(Dt) and p(Dt) = R Θ p(Dt|θ )p(θ ) dθ involves integration over the entire parameter space Θ, which can be high dimensional and involve highly non-linear operations, such as computing inverse covariances. In addition, the marginal predictive density p(ˆy|ˆx, ˆθ, Dt 1) = R Θ p(ˆy, θ |ˆx, ˆθ, Dt 1) dθ is also usually intractable for the same reasons. 5.1 Variational EIG lower bound Following Foster et al. [13], we replace the EIG by a variational objective which does not require the true posterior density over θ . This formulation allows us to jointly estimate an approximation to the posterior and select optimal design points ˆx and simulation parameters ˆθ. Applying the variational lower bound by Barber and Agakov [30] to Eq. 12 yields the following alternative to the EIG: d EIGt(ˆx, ˆθ, q) := Ep(ˆy,θ |ˆx,ˆθ,Dt 1) log q(θ |ˆy, ˆx, ˆθ) EIGt(ˆx, ˆθ), (15) where q(θ |ˆy, ˆx, ˆθ) is any conditional probability density model. The gap is given by the expected Kullback-Leibler (KL) divergence between the true and the variational posterior [13, Sec. A.1]:3 EIGt(ˆx, ˆθ) d EIGt(ˆx, ˆθ, q) = Ep(ˆy|ˆx,ˆθ,Dt 1)[DKL(p(θ |Dt 1, ˆy)||q(θ |ˆy))] 0 . (16) Maximising the variational EIG lower bound w.r.t. the variational distribution q then provides us with an approximation to p(θ |ˆy, ˆx, ˆθ, Dt 1). Therefore, we can simultaneously obtain maximally informative designs and optimal variational posteriors by jointly optimising the EIG lower bound w.r.t. the simulator inputs and the variational distribution as: ˆxt, ˆθt, qt argmax ˆx X,ˆθ Θ,q Q d EIGt(ˆx, ˆθ, q) = argmax ˆx X,ˆθ Θ,q Q Ep(ˆy,θ |ˆx,ˆθ,Dt 1)[log q(θ |ˆy)] , (17) given a suitable variational family Q of conditional distributions. Note that, in this formulation, we only need samples from the posterior p(θ |Dt 1) to estimate the expectation above, which can be approximated via Monte Carlo, without requiring densities other than that of the variational model q. 5.2 Algorithm Algorithm 1 summarises the method we propose, which we name Bayesian Adaptive Calibration and Optimal desig N (BACON). The algorithm starts with an initial dataset D0 containing the real data (and possibly previously available simulation data) and an estimate of the posterior given the initial data p(θ |D0). Posterior estimates in BACON can be represented by samples obtained via Markov chain Monte Carlo (MCMC) or variational inference over the GP model and the currently available data Dt. Note that we only need samples from the previous posterior to estimate the expectation in Eq. 17, with no need to directly evaluate its probability densities. Each iteration starts by optimising the variational EIG lower bound using the objective in Eq. 17 to jointly select an optimal design ˆxt, simulation parameters ˆθt and variational posterior qt. Given the new design ˆxt, we run the simulation with the chosen parameters ˆθt, observing a new outcome ˆyt. The calibration posterior pt(θ ) and the GP model are then updated with the new data, potentially including a re-estimation of the GP hyper-parameters via, for example, maximum likelihood estimation. The process then repeats given the updated GP and posterior for up to a given number of iterations T. At the end, a final posterior p T (θ ) = p(θ |y R, ˆy T ) and a conditional density model q T are obtained. 3We will at times write q(θ |ˆy) to denote q(θ |ˆy, ˆx, ˆθ) to avoid notation clutter, as the dependence on the inputs (ˆx, ˆθ) remains implicit through the conditioning on ˆy. Algorithm 1 BACON input D0 := {XR, y R}; {Real data} input p0(θ ) := p(θ |D0) {MCMC or VI prior distribution} for t {1, . . . , T} do ˆxt, ˆθt, qt argmaxˆx,ˆθ,q Ept 1(ˆy,θ |ˆx,ˆθ) [log q(θ |ˆy)] {Optimise d EIGt} ˆyt := h(ˆxt, ˆθt) {Run simulation} Dt := Dt 1 {ˆxt, ˆθt, ˆyt} {Update GP model} pt(θ ) = p(θ |Dt 1) {Update posterior via MCMC or VI} end for output p T (θ ) {Final posterior} 5.3 Variational posteriors Any conditional probability density model q(θ |ˆy) estimating probability densities over the parameter space Θ given an observation ˆy could suit our method. In the following, we describe two possible parameterisations for this model. The first facilitates marginalising latent inputs in GP regression [31, 32], while the second better captures multi-modality in the posterior. Conditional Gaussian models. Assuming we can approximate p(θ |Dt) as a Gaussian, we can construct a variational conditional density model as: qϕ(θ |ˆy, ˆx, ˆθ) := N(θ ; mϕ(ˆy, ˆx, ˆθ), Σϕ(ˆy, ˆx, ˆθ)) , (18) where mϕ and Σϕ are given by parametric models, such as neural networks, with parameters ϕ. To ensure Σϕ( ) is positive-definite, it can be parameterised by its Cholesky decomposition Σϕ( ) = Lϕ( )Lϕ( )T, where Lϕ( ) is a lower-triangular matrix with positive diagonal entries. Conditional normalising flows Normalising flows [33] apply the change-of-variable formula to derive composable, invertible transformations gw of a fixed base distribution p0: gw(ξ0) := g(K) w g(1) w (ξ0), ξ0 p0 (19) The log-probability density of a point ξ = gw(ξ0) under this model can be calculated as: log p K(ξ; w) = log p0(ξ0) j=1 log J(j) w (ξj 1) , where ξ0 := g 1 w (ξ), ξj := g(j) w (ξj 1), and J(j) w is the Jacobian matrix of the jth transform g(j) w , for j {1, . . . , K}. Several invertible flow architectures have been proposed in the literature, including radial and planar flows [33], autoregressive models [34 36] and models based on splines [37]. To derive a conditional density model qϕ(θ |ˆy), conditional normalising flows map the original flow parameters w via a neural network model rϕ : ˆy 7 w [38, 39]. The resulting variational conditional density model is then given by: log qϕ(θ |ˆy, ˆx, ˆθ) = log p K(θ ; rϕ(ˆy, ˆx, ˆθ)) . (20) 5.4 Batch parallel evaluations Often simulations can be run in parallel by spawning multiple processes in a single machine or over a high-performance computing cluster. In this case, proposing batches with multiple simulation inputs can be more effective than running single simulations in a sequence. Optimising the EIG w.r.t. a batch of inputs B := {ˆxi, ˆθi}B i=1, instead of single points, we obtain a batch version of Algorithm 1. In this case, we are seeking a batch that maximises the mutual information between the parameters θ and the resulting simulation outcomes, i.e.: EIGt(B) = I(θ ; {ˆyi}B i=1|B, Dt 1) Ep({ˆyi}B i=1,θ |B,Dt 1) log q(θ |{ˆyi}B i=1) p(θ |Dt 1) We optimise this objective with variational models that accept multiple conditioning observations q(θ |ˆy1, . . . , ˆy B). In practice, this simply amounts to replacing the single conditioning entries to the models in Sec. 5.3 by the concatenated batch or a permutation-invariant deep set encoding [16, 40]. (a) MAP error (c) DKL(pt||p ) (d) Posterior Figure 1: Experimental results on synthetic data where the target posterior p is unimodal. The first 3 plots show estimates for performance metrics as a function of the number of simulations run (not including the initial data). Estimates were computed based on the posterior estimates for each method available during their run, with random using p(θ ), D-optimal and BACON using MCMC posteriors, and IMSPE using a Dirac delta (reverse KL undefined, not shown) on the MAP estimate as posterior estimates. Results are averaged over 10 trials, and shaded areas indicate 1 standard deviation. The rightmost plot shows the target posterior, with the true θ indicated by a star. (a) MAP error (c) DKL(pt||p ) (d) Posterior Figure 2: Experimental results on synthetic data where the target posterior p is bimodal. See Fig. 1 for details, with the exception that the rightmost plot now shows the bimodal target posterior. 6 Experiments In this section, we present experimental results on synthetic and real-data problems evaluating the proposed variational Bayesian adaptive calibration framework against baselines. Further experimental details can be found in Appendix A and in our code repository.4 Performance metrics. We evaluated each method against a set of performance metrics, which we now describe. The maximum-a-posteriori (MAP) error measures the distance between the mode of the variational distribution and the true parameters θ . To measure the quality of the learnt model in predicting real outcomes, we also evaluated the root mean square error (RMSE) between the expected GP predictions under the learnt variational distribution and real outcomes: 1 N PN i=1(Eq(θ)[µ(x i , θ ; θ)] y i )2, where y i = f(x i ) + ν i are observations of the real process over a set of test points {x i }N i=1 X placed on a uniform grid over the design space. Information gain. Lastly, we also evaluated two sample-based estimates of the KL divergence [41]. Namely, DKL(p T ||p0) corresponds to the KL divergence between the final MCMC posterior (given all simulations and real data) and the initial one (given only the real data and an initial set of randomised simulations) both estimated over the learnt GP model. The column DKL(p T ||p ) indicates the KL divergence between the final MCMC posterior p T and the posterior p with full knowledge of the simulator, which can be cheaply evaluated in this synthetic scenario. The average of DKL(p T ||p0) is an indicator for the expected information gain (1) of an algorithm, given that it is the expected relative entropy across the possible trajectories of observations. In contrast, DKL(p T ||p ) indicates how far the estimates are from the best possible posterior obtainable with a model that is given the available real data and (a potentially infinite amount of) simulations. 4Code available at: https://github.com/csiro-funml/bacon DKL(p T ||p0) DKL(p T ||p ) BACON 1.00 0.06 0.76 0.13 IMSPE 0.89 0.11 1.05 0.19 D-optim. 0.42 0.11 1.09 0.15 Random 0.62 0.07 1.18 0.13 VBMC 0.53 0.02 (a) Unimodal posterior DKL(p T ||p0) DKL(p T ||p ) BACON 0.40 0.03 0.45 0.06 IMSPE 0.19 0.04 0.70 0.07 D-optim. 0.07 0.02 0.94 0.03 Random 0.28 0.07 0.54 0.07 VBMC 0.49 0.13 (b) Bimodal posterior Table 1: Results for 2+2D synthetic problem after T = 50 iterations (batch of B = 4). Here DKL(p T ||p0) corresponds to the KL divergence between the final posterior (estimated after each algorithm s run with all the data it collected) and the starting one (higher is better), while DKL(p T ||p ) is the KL between the final posterior and the posterior with full knowledge of the simulator p (lower is better). All posteriors were sampled via MCMC using 4000 samples. Averages and standard deviations were estimated from 10 independent runs. 6.1 Baselines Our algorithmic baselines were chosen to illustrate the main approaches currently available in the literature. All baselines are implemented as sequential methods, in the sense that their GP models are updated with the latest batch of observations before proceeding to the next iteration. Random search. This baseline samples simulation designs ˆxt U(X) from a uniform distribution over the design space X and calibration parameters from the prior ˆθt p(θ ). IMSPE with MAP estimates. The integrated mean squared prediction error (IMSPE) [42] criterion chooses designs ˆxt and calibration ˆθt parameters by minimising the GP prediction error: IMSPEt(ˆz) := Z Z E[( ˆf(z) µt+1(z; θ ))2 | ˆf(ˆz), Dt] dz = Z Z σ2 t+1(z; θ |Dt, ˆf(ˆz)) dz. (22) The posterior MAP estimate θ t argmaxθ p(θ|Dt 1) is used as a point estimate for the true θ . The integral is approximated as a sum over a uniform grid of designs and samples from the calibration prior,5 making IMSPE equivalent to active learning Cohn [43] and also a form of A-optimality [28]. D-optimal designs. We provide experimental results with an additional baseline following a Doptimality criterion, a classic experimental design objective. Optimal candidate designs according to this criterion are points of maximum uncertainty according to the model [28]. If we model the simulator as the unknown variable of interest, this corresponds to selecting designs where we have maximum entropy of the Gaussian predictive distribution p(ˆy|ˆx, ˆθ, Dt 1). This approach, therefore, simply attempts to collect an informative set of simulations according to the GP prior over the simulator h only, without considering the information in the real data. Running D-optimality on θ , instead, would lead back to the EIG criterion we use. Variational Bayesian Monte Carlo (VBMC). Acerbi [44] presents an adaptive Bayesian quadrature method to learn posterior distributions over models with black-box likelihood functions. The method estimates the posterior p(θ |y R, h) by modelling the log-joint log p(y R, θ |h) as a sample from a Gaussian process. VBMC then learns a variational posterior approximation by maximising a lower-confidence bound over the ELBO given by the GP estimates. Calibration parameter queries ˆθt are obtained by optimising quadrature-based acquisition functions. Regarding design points, simulations are always run on the set of real design points XR in the observed data, which is fixed. 6.2 Synthetic experiments For this experiment, we sampled a function ˆf GP(0, k) to use as our simulator and compared different algorithms. Following a sparse GP approach [45], a function sampled from a GP can 5The original paper proposed analytic solutions to Eq. 22 tailored for specific kernels. However, we decided to keep our codebase generic to work with different kernels, and therefore opted for a numerical approximation. DKL(p T ||p0) DKL(p T ||p ) BACON 0.37 0.09 0.07 0.06 IMSPE 0.22 0.11 0.45 0.21 D-optimal 0.21 0.08 0.23 0.10 Random 0.32 0.09 0.20 0.14 VBMC 5.48 1.66 Table 2: Results on the location finding problem after T = 30 iterations with B = 4, R = 20 real data points and an initial set of 20 simulations. Estimates were averaged over 10 independent runs. be approximated as ˆf(z) k(z, ZM)K 1 M u M, where u M N(ˆu M, ΣM) is a sample from an M-dimensional Gaussian, ZM := {zi}M i=1 X Θ {0, 1}, for a given M. As the number of points M , if the pseudo-inputs ZM form a dense set, the approximate ˆf should converge in distribution to a sample from the Gaussian process GP(0, k). In our case, to sample ZM, we sample designs from a uniform distribution over the design space, calibration parameters from the prior, and fidelities from a Bernoulli distribution with parameter set to 0.5. We also set ˆu M := 0 and ΣM := KM = k(ZM, ZM). We repeatedly run a loop of T iterations for each algorithm, with different random seeds. We run each algorithm for T := 50 iterations using a batch of B := 4 designs per iteration. Each of the methods using GP approximations for the simulator are initialised with 20 observations and R = 5 real data points. To configure VBMC, we allow it to run an equivalent maximum amount of objective function evaluations. The design space is set as the 2-dimensional unit box X := [0, 1]2 and the true parameters are sampled from a standard normal prior p(θ ) := N(θ ; 0, I) also over a 2D space, totalling a 4-dimensional problem space. Results are presented in Fig. 1 and 2. Fig. 1 shows a case where the GP-sampled simulator led to a unimodal target posterior. In this case, we see that BACON is able to achieve fast convergence in terms of MAP estimates and KL divergence towards the target posterior, while IMSPE dominates in terms of simulator approximation error as measured by the RMSE. As the posterior is unimodal and quite concentrated around the true parameter, it is natural that a method relying on MAP estimates, such as RMSE, would perform well. In contrast, when the posterior is multimodal, as shown in the bimodal case in Fig. 2, MAP estimates are not necessarily reliable any more, as they might get stuck on a non-informative mode, leading to biased estimates for IMSPE and a significant drop in performance. Lastly, note that D-optimal and random designs can also lead to RMSE approaching the lowest (as determined by the noise level with σν = 0.5) in some circumstances. However, these approaches do not directly provide posterior approximations and may fail in more complex scenarios. In terms of final posterior estimates, Table 1 shows that VBMC estimates reach the closest to the full-knowledge target posterior p in the unimodal case, while BACON is able to surpass the other GP-emulation approaches in terms of information gain. For the bimodal case, however, we see that BACON gains an advantage over VBMC. Recall that VBMC relies on a variational mixture of Gaussian distributions, while BACON applies conditional normalising flows for its posterior approximations, which lead to increased flexibility. In addition, despite the slightly worse performance than VBMC, BACON also provides a GP model that can be used as an emulator for the simulator (and to approximate the real process), while VBMC s focus is on approximating the log-likelihood. 6.3 Finding the location of hidden sources We consider the problem of finding the location of 2 hidden sources in a 2D environment following the setting in Foster et al. [16]. We are provided with R = 20 initial measurements and an initial set of S = 20 randomised simulations without knowledge of the true parameters which the data was generated with. Sources are sampled from a standard normal, the design space is limited to the unit box, and noise is sampled with σν = 0.5. Our results are presented in Table 2, showing a similar tendency in higher information gain for our method, and a very low KL w.r.t. p . Note that a higher information gain indicates a more informative posterior, whose entropy will be much lower relative to the starting distribution, compared to the other methods. In addition, the ideal p , which a GP-based posterior should converge to in the limit of infinite data, is not known by the methods, only p0. Therefore, besides obtaining maximally informative data, we have shown that BACON is also efficient in approximating posteriors over black-box simulators, while also learning a GP emulator. (a) Platform (b) Real grasp (c) Simulation Figure 3: Soft-robotics grasping experiment. We calibrate a soft materials simulator against real data from physical grasping from an automated experimentation platform DKL(p T ||p ) BACON 1.32 0.05 IMSPE 1.56 0.08 D-optimal 1.50 0.05 Random 1.48 0.07 Table 3: Soft-robotics simulator calibration final results after T = 10 with B = 16 points per batch. The target posterior p was inferred using a large set of 1024 random simulations uniformly covering the design and parameter space. Performance was averaged over 4 independent runs. 6.4 Soft-robotic grasping simulator calibration For this experiment, we are provided with a dataset containing R = 10 real measurements of the peak grasping force of soft robotic gripper designs on a range of testing objects (see Fig. 3). The gripper designs follow a fin-ray pattern parameterised by 9 geometric parameters [46], and we are interested in estimating 2 unknown physics parameters, the Young s modulus of elasticity and the coefficient of static friction with the objects. To simulate the gripper designs, we use the SOFA framework [47] to reproduce the grasping scenario and provide an estimate of the peak grasping force. In particular, for this paper, we focus on the grasping of a spherical object, which provides a simpler geometry and lower discrepancy with respect to real data measurements compared to more complex objects. This experiment provides us with a benchmark where simulations are expensive to run, taking from minutes to a few hours to run (depending on mesh resolution) on a high-performance computing platform. Therefore, it is important to choose a minimum amount of informative simulations. Our results are shown in Table 3. Each algorithm was initialised with a set of 123 random simulations and run for T = 10 iterations. The results show that BACON achieves the closest approximation to the target posterior. IMSPE highly concentrated its parameter choices around its posterior mode estimate, while other baselines were too spread, both leading to inferior posterior approximations (see Fig. 4 in the appendix) and showing the advantage of BACON s joint optimisation and inference. 7 Conclusion, limitations and future work We have developed BACON, a Bayesian approach that carries out parameter calibration of computer models and optimal design of experiments jointly. It does so by optimizing an information-theoretic criterion so that input designs and calibration parameters are selected to be maximally informative about the optimal parameters. Our method provides a full posterior over optimal calibration parameters as well as an accurate Gaussian process based estimation of the computer model (i.e., an emulator). One of the main limitations of the presented framework, however, is scalability to large datasets, due to the cubic computational complexity of exact inference with GPs. A potential extension with scalable sparse variational GP models [48] using a conditional distribution model for the inducing points is discussed in Sec. B.2. We emphasize that our proposed method is still applicable to many real practical settings, where the problem constraints do not demand a very large number of simulation samples. Lastly, we also note that the method can be adapted to work with vector-valued observations by the use of multi-output GP models [49]. Further discussions on limitations and future work can be found in our appendix (see Appendix B and C). Acknowledgements This project was supported by resources and expertise provided by CSIRO IMT Scientific Computing. We are also grateful for the support of CSIRO s Data61 soft-robotics team, especially Josh Pinskier, Xing Wang, Lois Liow, Sarah Baldwin, James Brett and Vinoth Viswanathan, in the experimental data collection and simulations setup for the soft-robotics calibration problem. [1] Marc C. Kennedy and Anthony O Hagan. Bayesian calibration of computer models. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 63(3):425 464, 2001. [2] Carl E. Rasmussen and Christopher K. I. Williams. Gaussian Processes for Machine Learning. The MIT Press, Cambridge, MA, 2006. [3] Michael U. Gutmann and Jukka Corander. Bayesian optimization for likelihood-free inference of simulator-based statistical models. Journal of Machine Learning Research, 17, 2016. [4] Erin R. Leatherman, Angela M. Dean, and Thomas J. Santner. Designing combined physical and computer experiments to maximize prediction accuracy. Computational Statistics and Data Analysis, 113:346 362, 2017. [5] S ebastien Marmin and Maurizio Filippone. Deep Gaussian processes for calibration of computer models (with discussion). Bayesian Analysis, 17(4):1301 1350, 2022. [6] Tom Rainforth, Adam Foster, Desi R. Ivanova, and Freddie Bickford Smith. Modern Bayesian Experimental Design. Statistical Science, 39(1):100 114, 2024. [7] Thomas M. Cover and Joy A. Thomas. Elements of Information Theory. John Wiley & Sons, 2005. [8] Ronald A. Fisher. The design of experiments. Oliver & Boyd, Oxford, England, 1935. [9] Stewart Greenhill, Santu Rana, Sunil Gupta, Pratibha Vellanki, and Svetha Venkatesh. Bayesian optimization for adaptive experimental design: A review. IEEE Access, 8:13937 13948, 2020. [10] J. Kiefer. Optimum experimental designs. Journal of the Royal Statistical Society. Series B (Methodological), 21(2):272 319, 1959. [11] Kathryn Chaloner and Isabella Verdinelli. Bayesian experimental design: A review. Statistical Science, 10(3):273 304, 1995. [12] Elizabeth G. Ryan, Christopher C. Drovandi, James M. Mcgree, and Anthony N. Pettitt. A review of modern computational algorithms for Bayesian optimal design. International Statistical Review, 84(1):128 154, 2016. [13] Adam Foster, Martin Jankowiak, Eli Bingham, Paul Horsfall, Yee Whye Teh, Tom Rainforth, and Noah Goodman. Variational Bayesian optimal experimental design. In 33rd Conference on Neural Information Processing Systems (Neur IPS 2019), Vancouver, Canada, 2019. [14] Steven Kleinegesse and Michael U. Gutmann. Efficient Bayesian experimental design for implicit models. In Kamalika Chaudhuri and Masashi Sugiyama, editors, Proceedings of the 22nd International Conference on Artificial Intelligence and Statistics (AISTATS), Proceedings of Machine Learning Research, Naha, Okinawa, Japan, 2019. PMLR. [15] Joakim Beck, Ben Mansour Dia, Luis FR Espath, Quan Long, and Raul Tempone. Fast Bayesian experimental design: Laplace-based importance sampling for the expected information gain. Computer Methods in Applied Mechanics and Engineering, 334:523 553, 2018. [16] Adam Foster, Desi R. Ivanova, Ilyas Malik, and Tom Rainforth. Deep Adaptive Design: Amortizing sequential Bayesian experimental design. In Marina Meila and Tong Zhang, editors, Proceedings of the 38th International Conference on Machine Learning (ICML 2021), volume 139 of Proceedings of Machine Learning Research, pages 3384 3395. PMLR, 2021. [17] Tom Blau, Edwin V. Bonilla, Iadine Chades, and Amir Dezfouli. Optimizing sequential experimental design with deep reinforcement learning. In Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, Baltimore, Maryland, USA, 2022. PMLR. [18] Tom Rainforth, Adam Foster, Desi R Ivanova, and Freddie Bickford Smith. Modern bayesian experimental design. ar Xiv preprint ar Xiv:2302.14545, 2023. [19] D. Busby and M. Feraille. Adaptive design of experiments for calibration of complex simulators - An application to uncertainty quantification of a mature oil field. Journal of Physics: Conference Series, 135, 2008. [20] Christophe Andrieu, Nando De Freitas, Arnaud Doucet, and Michael I. Jordan. An introduction to MCMC for machine learning. Machine Learning, 50(1-2):5 43, 2003. [21] Soumalya Sarkar, Sudeepta Mondal, Michael Joly, Matthew E. Lynch, Shaunak D. Bopardikar, Ranadip Acharya, and Paris Perdikaris. Multifidelity and multiscale Bayesian framework for high-dimensional engineering design and calibration. Journal of Mechanical Design, 141(12), 2019. [22] Rafael Oliveira, Lionel Ott, and Fabio Ramos. No-regret approximate inference via Bayesian optimisation. In 37th Conference on Uncertainty in Artificial Intelligence (UAI). PMLR, 2021. [23] Luigi Acerbi. Variational Bayesian Monte Carlo with noisy likelihoods. In H Larochelle, M Ranzato, R Hadsell, M F Balcan, and H Lin, editors, 34th Conference on Neural Information Processing Systems (Neur IPS 2020), volume 33, pages 8211 8222, 2020. [24] Marko J arvenp a a, Michael U. Gutmann, Aki Vehtari, and Pekka Marttinen. Parallel Gaussian process surrogate Bayesian inference with noisy likelihood evaluations. Bayesian Analysis, 2020. [25] Mucahit Cevik, Mehmet Ali Ergun, Natasha K Stout, Amy Trentham-Dietz, Mark Craven, and Oguzhan Alagoz. Using Active Learning for Speeding up Calibration in Simulation Models. Medical decision making: an international journal of the Society for Medical Decision Making, 36(5):581 593, 2016. [26] Yuan Tian, Manuel Arias Chao, Chetan Kulkarni, Kai Goebel, and Olga Fink. Real-time model calibration with deep reinforcement learning. Mechanical Systems and Signal Processing, 165 (July 2021):108284, 2022. [27] Andreas C. Damianou and Neil D. Lawrence. Deep Gaussian processes. Proceedings of the 16th International Conference on Artificial Intelligence and Statistics (AISTATS), 31, 2013. [28] Andreas Krause, Ajit Singh, and Carlos Guestrin. Near-optimal sensor placements in Gaussian processes: Theory, efficient algorithms and empirical studies. Journal of Machine Learning Research, 9:235 284, 2008. [29] Daniel Golovin and Andreas Krause. Adaptive submodularity: Theory and applications in active learning and stochastic optimization. Journal of Artificial Intelligence Research, 42: 427 486, 2011. [30] David Barber and Felix Agakov. The im algorithm: A variational approach to information maximization. In Proceedings of the 16th International Conference on Neural Information Processing Systems, NIPS 03, page 201 208, Cambridge, MA, USA, 2003. MIT Press. [31] Patrick Dallaire, Camille Besse, and Brahim Chaib-Draa. An approximate inference with Gaussian process to latent functions from uncertain data. Neurocomputing, 74:1945 1955, 2011. [32] Andreas C. Damianou, Michalis K Titsias, and Neil D Lawrence. Variational inference for latent variables and uncertain inputs in Gaussian processes. Journal of Machine Learning Research, 17(1):1 62, 2016. [33] Danilo Jimenez Rezende and Shakir Mohamed. Variational inference with normalizing flows. In International Conference on Machine Learning (ICML 2015), volume 2, pages 1530 1538, Lille, France, 2015. [34] Durk P Kingma, Tim Salimans, Rafal Jozefowicz, Xi Chen, Ilya Sutskever, and Max Welling. Improved variational inference with inverse autoregressive flow. In D. Lee, M. Sugiyama, U. Luxburg, I. Guyon, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 29, 2016. [35] George Papamakarios, Theo Pavlakou, and Iain Murray. Masked autoregressive flow for density estimation. In I. Guyon, U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 30, 2017. [36] Chin Wei Huang, David Krueger, Alexandre Lacoste, and Aaron Courville. Neural autoregressive flows. In 35th International Conference on Machine Learning (ICML 2018), volume 5, pages 3309 3324, Stockholm, Sweden, 2018. [37] Conor Durkan, Artur Bekasov, Iain Murray, and George Papamakarios. Neural spline flows. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alch e-Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 32, 2019. [38] Christina Winkler, Daniel Worrall, Emiel Hoogeboom, and Max Welling. Learning likelihoods with conditional normalizing flows. ar Xiv preprint ar Xiv:1912.00042, 2019. [39] Abdelrahman Abdelhamed, Marcus A Brubaker, and Michael S Brown. Noise flow: Noise modeling with conditional normalizing flows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 3165 3173, 2019. [40] Manzil Zaheer, Satwik Kottur, Siamak Ravanbhakhsh, Barnab as P oczos, Ruslan Salakhutdinov, and Alexander J. Smola. Deep sets. In Advances in Neural Information Processing Systems, volume 30, pages 3392 3402, 2017. [41] Zolt an Szab o. Information theoretical estimators toolbox. Journal of Machine Learning Research, 15:283 287, 2014. [42] Scott Koermer, Justin Loda, Aaron Noble, and Robert B. Gramacy. Active Learning for Simulator Calibration. ar Xiv, 2023. URL http://arxiv.org/abs/2301.10228. [43] Annie Sauer, Robert B. Gramacy, and David Higdon. Active Learning for Deep Gaussian Process Surrogates. Technometrics, 65(1):1 39, 2022. ISSN 15372723. doi: 10.1080/00401706. 2021.2008505. URL https://doi.org/10.1080/00401706.2021.2008505. [44] Luigi Acerbi. Variational Bayesian Monte Carlo. In 32nd Conference on Neural Information Processing Systems (Neur IPS 2018), Montr eal, Canada, 2018. [45] Edward Snelson and Zoubin Ghahramani. Sparse gaussian processes using pseudo-inputs. In Y. Weiss, B. Sch olkopf, and J. Platt, editors, Advances in Neural Information Processing Systems, volume 18, 2005. [46] Xing Wang, Bing Wang, Joshua Pinskier, Yue Xie, James Brett, Richard Scalzo, and David Howard. Fin-bayes: A multi-objective bayesian optimization framework for soft robotic fingers. Soft Robotics, 2024. [47] Franc ois Faure, Christian Duriez, Herv e Delingette, J er emie Allard, Benjamin Gilles, St ephanie Marchesseau, Hugo Talbot, Hadrien Courtecuisse, Guillaume Bousquet, Igor Peterlik, and St ephane Cotin. SOFA: A Multi-Model Framework for Interactive Physical Simulation. In Yohan Payan, editor, Soft Tissue Biomechanical Modeling for Computer Assisted Surgery, volume 11 of Studies in Mechanobiology, Tissue Engineering and Biomaterials, pages 283 321. Springer, June 2012. URL https://inria.hal.science/hal-00681539. [48] Michalis K. Titsias. Variational learning of inducing variables in sparse Gaussian processes. In International Conference on Artificial Intelligence and Statistics (AISTATS), Clearwater Beach, Florida, USA, 2009. [49] Mauricio A Alvarez, Lorenzo Rosasco, Neil D Lawrence, et al. Kernels for vector-valued functions: A review. Foundations and Trends in Machine Learning, 4(3):195 266, 2012. [50] Eli Bingham, Jonathan P. Chen, Martin Jankowiak, Fritz Obermeyer, Neeraj Pradhan, Theofanis Karaletsos, Rohit Singh, Paul Szerlip, Paul Horsfall, and Noah D. Goodman. Pyro: Deep Universal Probabilistic Programming. Journal of Machine Learning Research, 2018. [51] Maximilian Balandat, Brian Karrer, Daniel R. Jiang, Samuel Daulton, Benjamin Letham, Andrew Gordon Wilson, and Eytan Bakshy. Bo Torch: A Framework for Efficient Monte-Carlo Bayesian Optimization. In Advances in Neural Information Processing Systems 33, 2020. URL http://arxiv.org/abs/1910.06403. [52] Matthew D. Hoffman and Andrew Gelman. The no-U-turn sampler: Adaptively setting path lengths in Hamiltonian Monte Carlo. Journal of Machine Learning Research, 15:1593 1623, 2014. [53] Michalis Titsias and Neil Lawrence. Bayesian Gaussian process latent variable model. In International Conference on Artificial Intelligence and Statistics (AISTATS), pages 844 851, 2010. [54] Vidhi Lalchand, Aditya Ravuri, and Neil D. Lawrence. Generalised GPLVM with stochastic variational inference. In Gustau Camps-Valls, Francisco J. R. Ruiz, and Isabel Valera, editors, Proceedings of The 25th International Conference on Artificial Intelligence and Statistics, volume 151 of Proceedings of Machine Learning Research, pages 7841 7864. PMLR, 2022. [55] James Hensman, Nicol o Fusi, and Neil D. Lawrence. Gaussian processes for big data. In Proceedings of the Twenty-Ninth Conference on Uncertainty in Artificial Intelligence, UAI 13, page 282 290, Arlington, Virginia, USA, 2013. AUAI Press. Algorithm 2 BACON (split training) D0 := {XR, y R}; for t {1, . . . , T} do µt 1, kt 1 Update GP(Dt 1) {θ i }SA i=1 MCMC p(θ |Dt 1) p(θ )N(y R; µt 1(ZR(θ ); θ ), Σt 1(ZR(θ ); θ ) + σ2 νI) p(θ |Dt 1) ˆpt 1 := 1 SA PSA i=1 δθ i qt Train Flow(ˆpt 1, Dt 1) {ˆxt,i, ˆθt,i}B i=1 Optimise Designs(qt, ˆpt 1, Dt 1) ˆyt,i := h(ˆxt,i, ˆθt,i) (parallel) for i {1, . . . , B} {Run batch of simulations} Dt := Dt 1 {ˆxt,i, ˆθt,i, ˆyt,i}B i=1 {Update GP dataset} end for A Additional details on the experiments For all experiments, we use conditional normalising flows as the variational model for BACON. Our implementation for BACON and most of the baselines, except for VBMC,6 is based on Pyro probabilistic programming models [50]. Gaussian process modelling code is based on Bo Torch7 [51]. The flow architecture is chosen for each synthetic-data problem by running hyper-parameter tuning with a simplified version of the problem. Most Gaussian process models are parameterised with Mat ern kernels [2, Ch. 4] and constant or zero mean functions. Pyro s MCMC with its default no-U-turn (NUTS) sampler [52] was applied to obtain samples from p(θ |Dt 1) at each iteration t. KL divergences are computed from samples using a nearest neighbours estimator implemented in the information theoretical estimators (ITE) package8 [41]. A.1 Synthetic GP problem The GP prior was set with ˆk given by a squared exponential kernel and kε given by a Mat ern kernel with smoothness parameter set to 2.5 [2]. The conditional normalising flow was configured with 2 layers of neural spline flows [37]. Batches of arbitrary size are used for conditioning via a permutation invariant set encoder, similar to Blau et al. [17], with a 2-layer, 32-units-wide fullyconnected hyperbolic tangent neural network passing through a summation at the end. Gradient-based optimisation is run using Adam with a learning rate 10 3 for the flow parameters and 0.05 for the simulation design points, both using cosine annealing with warm restarts as a learning rate scheduler. 256 samples were subsampled from the MCMC posterior to estimate expectations for both this and the location-finding problem. Algorithm with split training. For the synthetic GP problem, we provide a more detailed pseudocode of our algorithmic implementation using an option for training the conditional normalising flow and optimising the designs separately. Specifically, we applied MCMC to estimate our posteriors and had a flexible optimisation loop, where we had the option to separate the training of the conditional normalising flow model from the optimisation of the design points, as shown in Algorithm 2. This approach can make the algorithm more stable, though at the cost of a longer runtime. This option was only applied to the GP-based synthetic experiments, while for the other experiments we ran the full joint optimisation over both the simulation inputs (ˆx, ˆθ) and the variational parameters of the conditional model q. A.2 Location finding problem For this experiment we used more up-to-date Zuko9 implementations of the conditional normalising flow models, which were again set as neural spline flows [37] combined with a set encoder to condition on arbitrary batch sizes. Further architectural details can be found in our code repository. 256 samples 6For VBMC, we used its author s Python implementation at: https://github.com/acerbilab/pyvbmc 7Bo Torch: https://botorch.org 8ITE package: https://bitbucket.org/szzoli/ite-in-python 9Zuko: https://zuko.readthedocs.io/stable/ Algorithm 3 Train Flow input ˆpt, Dt for n {1, . . . , N} do {ˆθi}B i=1 (1 ϵ)ˆpt + ϵp {ˆxi}B i=1 U(X) {θ i }S i=1 ˆpt {ˆyi,j}S,B i,j=1 N(µt({ˆxi, ˆθi}B i=1; {θ i }S i=1), Σt({ˆxi, ˆθi}B i=1; {θ i }S i=1)) S PS i=1 ϕ log qϕ θ i Dt {ˆxj, ˆθj, ˆyi,j}B j=1 end for output qϕ Algorithm 4 Optimise Designs input qt, ˆpt 1, Dt 1 bΘ = {ˆθi}B i=1 (1 ϵ)ˆpt 1 + ϵp ˆX = {ˆxi}B i=1 U(X) for n {1, . . . , N} do {θ i }S i=1 ˆpt 1 {ˆyi}S i=1 N(µt 1(ˆX, bΘ; {θ i }S i=1), Σt 1(ˆX, bΘ; {θ i }S i=1)) (ˆX, bΘ) (ˆX, bΘ) + η S PS i=1 ˆ X, b Θ log qt θ i Dt {ˆX, bΘ, ˆyi} end for output {ˆxi, ˆθi}B i=1 were subsampled from the MCMC posterior at each iteration to estimate expectations for EIG lower bound computations. The simulations kernel ˆk was a Mat ern 2.5 kernel. For this experiment we did not model the error term, leaving it with a zero kernel, since data is generated directly from the simulator with no further error component, only Gaussian noise with a standard deviation of 0.5. Final KL estimates were computed using the maximum-a-posteriori hyper-parameters of the GP model learnt with the random search approach to minimise biases in the estimate of DKL(p T ||p0) due to differing GP hyper-parameters across baselines. A.3 Soft-robotics simulation problem The prior for the calibration parameters p(θ ) in this experiment consisted of a 2-dimensional standard normal transformed through a sigmoid and an affine transform composition to provide a smooth uniform distribution over a pre-specified range for the calibration parameters. Such smooth approximation allows gradients to be computed near the edges of the parameter space while not allowing optimisation to take the calibration parameter candidates outside the uniform prior boundaries, since these would be placed at infinity under the normalised space. The conditional normalising flow model used Zuko s implementation of neural spline flows with 10 transform layers. The set encoder consisted of a 2-layer fully connected 32-unit-wide neural network encoding each input into an 8-dimensional output which was then summed and passed through as the context input to condition the flow. Adam again was used for optimisation with a learning rate of 0.001 for the flow and 0.05 for the simulation inputs. Monte Carlo expectation estimates used 256 samples from the current MCMC posterior at each joint optimisation step. A.4 Hyper-parameter tuning Besides the GP hyperparameters (e.g., lengthscales, noise variance, etc.), which had to be tuned for the non-GP-based problems, there are optimisation settings (i.e., step sizes, scheduling rates, etc.), conditional density model hyper-parameters (i.e., normalising flow architecture), and other algorithmic settings, e.g., the designs batch size B. The latter is dependent on the available computing resources (e.g., number of CPU cores or compute nodes for simulations in a high-performance computing system). We tuned optimisation settings and architectural parameters for the conditional (a) Reference posterior p (b) BACON final posterior (c) IMSPE baseline final posterior (d) D-optimal baseline final posterior (e) Random baseline final posterior Figure 4: Final posterior approximations p(θ |DT ) and simulation parameter ˆθ (red crosses) choices by each method for the soft-robotics simulator calibration problem after one of the runs. The target/reference posterior (a) was inferred using a large number (1024) of simulations following a Latin hypercube pattern over the combined design X and calibration parameters space Θ and a uniform prior p(θ) over the same range as the smooth uniform prior the algorithms used. The posteriors are plotted as a 2D histogram over the normalised range (after an affine and sigmoid transform), which the algorithms used for optimisation. The KL divergences in Table 3 are computed with respect to this reference posterior. Also note that the simulation parameters ˆθ in the plot correspond to different algorithmic choices for design inputs ˆx, which are 9-dimensional variables that are not plotted here. normalising flows via Bayesian optimisation with short runs (e.g., 10-20 iterations) on the synthetic problem. However, depending on the number of parameters, a simpler approach, like grid search, might be enough. GP hyper-parameters were optimised online via maximum a posteriori estimation after each iteration s batch update. Further implementation details can be found in our code repository.10 B Extensions of the proposed approach In the following, we present two extensions to deal with limitations of the current approach. Namely, we can amortise inference over the calibration posterior by reutilising the learnt conditional distribution models as priors, instead of having to run, for example, MCMC. Secondly, we present derivations for a scalable sparse GP version of our method. B.1 Amortisation We use a conditional variational distribution model for q(θ |ˆy). The main advantage of training a conditional model is that, once new data ˆyt is observed, we readily obtain an approximation to the new posterior as p(θ |Dt) = p(θ |ˆyt, ˆxt, ˆθt, Dt 1) qt(θ |ˆyt). There is, therefore, potential to reuse the variational posterior as the prior for the next iteration, and all the optimisation is concentrated within a single loop. Approximate objective. We are still left with terms dependent on the posterior from the previous iteration p(θ |Dt 1) in Eq. 15. Firstly, however, note that the denominator inside the expectation is constant w.r.t. the optimisation variables, not affecting the maximiser. Secondly, we may replace the joint predictive distribution p(ˆy, θ |ˆx, ˆθ, Dt 1) by an approximation using the previous optimal variational posterior qt 1 as: p(ˆy, θ |ˆx, ˆθ, Dt 1) qt 1(ˆy, θ |ˆx, ˆθ) := p(ˆy|θ , ˆx, ˆθ, Dt 1)qt 1(θ ) (23) where qt 1(θ ) := qt 1(θ |ˆyt 1) p(θ |Dt 1). The following objective then approximately shares the same set of maximisers as the variational lower bound d EIGt(ˆx, ˆθ, q): ˆxt, ˆθt, qt argmax ˆx X,ˆθ Θ,q Q Eqt 1(ˆy,θ |ˆx,ˆθ) [log q(θ |ˆy)] . (24) In practice, reusing the variational conditional posterior may tend to degenerate the approximation over time. However, that can be corrected by rerunning MCMC or a variational inference scheme over the data to obtain a fresh new posterior at every few iterations. B.2 Conditional sparse models for large datasets Computing the variational EIG requires evaluating expectations with respect to the posterior predictive distribution p(ˆy|θ , ˆx, ˆθ, Dt). Note, however, that, as θ appears inside a matrix inversion in the GP predictive (Eq. 8), each sample of p(ˆy|θ , ˆx, ˆθ, Dt) requires a O(N 3 t ) computation cost, where Nt := R+t is the number of data points at iteration t N. This cost may quickly become prohibitive for reasonably large datasets, which are easily obtainable in batch settings (Sec. 5.4), rendering EIG computations infeasible. To scale our method to handle large amounts of data, we then need GP models that can reduce this computational complexity, while still allowing us to obtain reasonable EIG estimates. B.2.1 Variational sparse GP approximation We consider an augmentation to the original GP model which allows us to sparsify its covariance matrix, reducing the computational complexity of GP predictions. Following the variational sparse GP approach [48], let u := ˆf(Zu) RM denote a vector of M inducing variables representing unknown function values at a given set of pseudo-inputs Zu. The joint distribution between observations y, 10Code available at: https://github.com/csiro-funml/bacon function values ˆf := ˆf(Z(θ )), inducing variables u and the unknown parameters θ can be written as: p(y,ˆf, u, θ ) = p(y,ˆf, u|θ )p(θ ) = p(y|ˆf)p(ˆf|u, θ )p(u)p(θ ) , (25) where p(y|ˆf) = N(y;ˆf, Σy), p(ˆf|u, θ ) = N(ˆf; K ˆ fu(θ )K 1 uuu, K ˆ f ˆ f(θ ) K ˆ fu(θ )K 1 uu Ku ˆ f(θ )) , (26) and p(u) = N(u; 0, Kuu), using notation shortcuts Kuu := k(Zu, Zu), K ˆ fu(θ ) := k(Z(θ ), Zu), and K ˆ f ˆ f(θ ) := k(Z(θ ), Z(θ )). We may now formulate an evidence lower bound (ELBO) based on the joint variational density q(ˆf, u, θ ) as: log p(y) = Eq(ˆf,u,θ ) log p(y,ˆf, u, θ ) q(ˆf, u, θ ) + DKL(q(ˆf, u, θ )||p(ˆf, u, θ |y)) Eq(ˆf,u,θ ) log p(y,ˆf, u, θ ) q(ˆf, u, θ ) Since DKL(q(ˆf, u, θ )||p(ˆf, u, θ |y)) 0, and 0 if and only if q(ˆf, u, θ ) = p(ˆf, u, θ |y), maximising the ELBO above w.r.t. q provides us with an approximation to the joint posterior. Choosing q(ˆf, u, θ ) := p(ˆf|u, θ )q(u, θ ) simplifies the ELBO to [53]: log p(y) Eq(ˆf,u,θ ) log p(y|ˆf)p(u)p(θ ) Sparse variational GP approaches can reduce the computational complexity of Bayesian inference on GPs to O(NM 2) or even O(M 3) [48, 54], where N is the number of data points. B.2.2 Structure of the joint variational posterior If we would take a mean-field approach setting q(u, θ ) := q(u)q(θ ), the ELBO above would further simplify, leading to a few computational advantages, as explored by Bayesian GP-LVM methods [53, 32, 54]. However, in our experimental design context, this approach leads to a few issues. Firstly, using the mean-field posterior as a replacement for our joint posterior breaks the dependence between ˆy and θ , leading their mutual information (a.k.a. EIG) to be zero regardless of the design inputs ˆx and ˆθ. Secondly, although u and θ are independent according to their priors (Eq. 25), they become dependent when conditioned on the data. In fact, the true posterior over u given the data and the true parameters θ is exactly Gaussian: p(u|Dt, θ ) = N(u; µt(Zu; θ ), kt(Zu, Zu; θ )) , (29) where µt( ; θ ) and kt( , ; θ ) are given by Eq. 9 and Eq. 10, respectively. Note, however, that the posterior over θ should not be Gaussian for a general non-linear kernel k. Therefore, it makes more sense for us to model q(u, θ ) := q(u|θ )q(θ ). Moreover, learning a Gaussian conditional model over u and a flexible variational distribution over θ should be enough to allow us to recover the true posterior, since p(u, θ |Dt) = p(u|Dt, θ )p(θ |Dt). Optimal variational inducing-point distribution. Given θ Θ, we have a standard sparse GP model. The optimal variational inducing-point distribution is available in closed form following standard results [48] as: q (u|θ ) = N(u; µu(θ ), Σu(θ )) , (30) where the distribution parameters are: µu(θ) := Kuu(Kuu + Ψ2(θ)) 1Ψ1(θ)Ty (31) Σu(θ) := Kuu(Kuu + Ψ2(θ)) 1Kuu , (32) and the conditional Ψ matrices are given by: Ψ1(θ) := K ˆ fu(θ)Σ 1 y (33) Ψ2(θ) := Ku ˆ f(θ)Σ 1 y K ˆ fu(θ) , (34) for θ Θ. The computational cost of sampling predictions with this model then reduces from O(N 3) to O(NM 2). Parametric variational inducing distribution. To further reduce the computational cost of predictions, we may accept a sub-optimal conditional variational inducing-point distribution given by a parametric model: qζ(u|θ ) := N(u; mζ(θ ), Σζ(θ )) , (35) following the architecture in Sec. 5.3. This formulation allows us to approximate the evidence lower bound in Eq. 28 w.r.t. q(u|θ ) via mini-batching [see 55]. To do so, we approximate ˆfi := ˆf(zi) via conditionally independent samples given u, for i {1, . . . , N}. As a result, the data-dependent term in Eq. 28 decomposes as a sum which is amenable to mini-batching: Eqζ(ˆf,u|θ )[log p(y|ˆf)] i=1 Eqζ( ˆ fi,u|θ )[log p(yi| ˆfi)] (36) where qζ( ˆfi, u|θ ) = p( ˆfi|u, θ )qζ(u|θ ). The variational parameters ζ need to be optimised within a second optimisation loop after the data update in Algorithm 1 w.r.t.: ℓt(ζ) := Eqt(θ ) i=1 Eqζ( ˆ f(zi),u|θ )[log p(yi| ˆf(zi))] Eqt(θ )[DKL(qζ(u|θ )||p(u))] . (37) Although the GP update is no longer available in closed form, we gain computational efficiency for large volumes of data. Applying mini-batches of size L N to Eq. 37 results in a computational cost O(LM 2) (or O(M 3), if M > L), which is smaller than the cost O(NM 2) of the optimal variational distribution q (u|θ ). C Further discussion on limitations High-dimensional settings. The dimensionality of our search space consists of the combined dimensionality of the designs X and calibration parameters space Θ, which can be large in practical applications. In general, in higher dimensions, one is to expect that the algorithm will require a larger number of iterations to find suitable posterior approximations due to the possible increase in complexity of the posterior. The analysis of such complexity, however, is problem-dependent and outside the scope of this work. In addition, note that we do not mean that the per-iteration runtime is directly affected, since what dominates the cost of inference is sampling from the GP, whose runtime complexity is dominated by the cube of the number of data points due to a matrix inversion operation, while being only linear in dimensionality. Gaussian assumptions. We make Gaussian assumptions when modelling the simulator and the approximation errors, which can be seen as restrictive for some applications. However, if the errors are sub-Gaussian (i.e., its tail probabilities decay faster than that of a Gaussian), as is the case for bounded errors, we conjecture that a GP model can still be a suitable surrogate, as it would not underestimate the error uncertainty. If the error function is sampled from some form of heavy-tailed stochastic process (e.g., a Student-T process), the GP would, however, tend to under estimate uncertainty and lead to possibly optimistic EIG estimates that make the algorithm under-explore the search space. Changing from a GP model to another type of stochastic process model that can capture heavier tails would be possible, though require significant changes to the algorithm s predictive equations. We, however, believe that most real-world cases would present errors which are at least bounded (and therefore sub-Gaussian) with respect to the simulations. Neur IPS Paper Checklist Question: Do the main claims made in the abstract and introduction accurately reflect the paper s contributions and scope? Answer: [Yes] Justification: Experimental results confirm the main claims in the introduction and abstract. Guidelines: The answer NA means that the abstract and introduction do not include the claims made in the paper. The abstract and/or introduction should clearly state the claims made, including the contributions made in the paper and important assumptions and limitations. A No or NA answer to this question will not be perceived well by the reviewers. The claims made should match theoretical and experimental results, and reflect how much the results can be expected to generalize to other settings. It is fine to include aspirational goals as motivation as long as it is clear that these goals are not attained by the paper. 2. Limitations Question: Does the paper discuss the limitations of the work performed by the authors? Answer: [Yes] Justification: Limitations are discussed in the conclusion section and in Appendix C. Guidelines: The answer NA means that the paper has no limitation while the answer No means that the paper has limitations, but those are not discussed in the paper. The authors are encouraged to create a separate Limitations section in their paper. The paper should point out any strong assumptions and how robust the results are to violations of these assumptions (e.g., independence assumptions, noiseless settings, model well-specification, asymptotic approximations only holding locally). The authors should reflect on how these assumptions might be violated in practice and what the implications would be. The authors should reflect on the scope of the claims made, e.g., if the approach was only tested on a few datasets or with a few runs. In general, empirical results often depend on implicit assumptions, which should be articulated. The authors should reflect on the factors that influence the performance of the approach. For example, a facial recognition algorithm may perform poorly when image resolution is low or images are taken in low lighting. Or a speech-to-text system might not be used reliably to provide closed captions for online lectures because it fails to handle technical jargon. The authors should discuss the computational efficiency of the proposed algorithms and how they scale with dataset size. If applicable, the authors should discuss possible limitations of their approach to address problems of privacy and fairness. While the authors might fear that complete honesty about limitations might be used by reviewers as grounds for rejection, a worse outcome might be that reviewers discover limitations that aren t acknowledged in the paper. The authors should use their best judgment and recognize that individual actions in favor of transparency play an important role in developing norms that preserve the integrity of the community. Reviewers will be specifically instructed to not penalize honesty concerning limitations. 3. Theory Assumptions and Proofs Question: For each theoretical result, does the paper provide the full set of assumptions and a complete (and correct) proof? Answer: [NA] Justification: Theoretical results have not been derived for this paper. Guidelines: The answer NA means that the paper does not include theoretical results. All the theorems, formulas, and proofs in the paper should be numbered and crossreferenced. All assumptions should be clearly stated or referenced in the statement of any theorems. The proofs can either appear in the main paper or the supplemental material, but if they appear in the supplemental material, the authors are encouraged to provide a short proof sketch to provide intuition. Inversely, any informal proof provided in the core of the paper should be complemented by formal proofs provided in appendix or supplemental material. Theorems and Lemmas that the proof relies upon should be properly referenced. 4. Experimental Result Reproducibility Question: Does the paper fully disclose all the information needed to reproduce the main experimental results of the paper to the extent that it affects the main claims and/or conclusions of the paper (regardless of whether the code and data are provided or not)? Answer: [Yes] Justification: Code is included, though soft-robotics experiment data is protected. Guidelines: The answer NA means that the paper does not include experiments. If the paper includes experiments, a No answer to this question will not be perceived well by the reviewers: Making the paper reproducible is important, regardless of whether the code and data are provided or not. If the contribution is a dataset and/or model, the authors should describe the steps taken to make their results reproducible or verifiable. Depending on the contribution, reproducibility can be accomplished in various ways. For example, if the contribution is a novel architecture, describing the architecture fully might suffice, or if the contribution is a specific model and empirical evaluation, it may be necessary to either make it possible for others to replicate the model with the same dataset, or provide access to the model. In general. releasing code and data is often one good way to accomplish this, but reproducibility can also be provided via detailed instructions for how to replicate the results, access to a hosted model (e.g., in the case of a large language model), releasing of a model checkpoint, or other means that are appropriate to the research performed. While Neur IPS does not require releasing code, the conference does require all submissions to provide some reasonable avenue for reproducibility, which may depend on the nature of the contribution. For example (a) If the contribution is primarily a new algorithm, the paper should make it clear how to reproduce that algorithm. (b) If the contribution is primarily a new model architecture, the paper should describe the architecture clearly and fully. (c) If the contribution is a new model (e.g., a large language model), then there should either be a way to access this model for reproducing the results or a way to reproduce the model (e.g., with an open-source dataset or instructions for how to construct the dataset). (d) We recognize that reproducibility may be tricky in some cases, in which case authors are welcome to describe the particular way they provide for reproducibility. In the case of closed-source models, it may be that access to the model is limited in some way (e.g., to registered users), but it should be possible for other researchers to have some path to reproducing or verifying the results. 5. Open access to data and code Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [Yes] Justification: Code (included with submission) will be made public for most of the results, except soft-robotics data, which is subject to internal restrictions. Guidelines: The answer NA means that paper does not include experiments requiring code. Please see the Neur IPS code and data submission guidelines (https://nips.cc/ public/guides/Code Submission Policy) for more details. While we encourage the release of code and data, we understand that this might not be possible, so No is an acceptable answer. Papers cannot be rejected simply for not including code, unless this is central to the contribution (e.g., for a new open-source benchmark). The instructions should contain the exact command and environment needed to run to reproduce the results. See the Neur IPS code and data submission guidelines (https: //nips.cc/public/guides/Code Submission Policy) for more details. The authors should provide instructions on data access and preparation, including how to access the raw data, preprocessed data, intermediate data, and generated data, etc. The authors should provide scripts to reproduce all experimental results for the new proposed method and baselines. If only a subset of experiments are reproducible, they should state which ones are omitted from the script and why. At submission time, to preserve anonymity, the authors should release anonymized versions (if applicable). Providing as much information as possible in supplemental material (appended to the paper) is recommended, but including URLs to data and code is permitted. 6. Experimental Setting/Details Question: Does the paper specify all the training and test details (e.g., data splits, hyperparameters, how they were chosen, type of optimizer, etc.) necessary to understand the results? Answer: [Yes] Justification: Main details are provided, and the code has been made publicly available. Guidelines: The answer NA means that the paper does not include experiments. The experimental setting should be presented in the core of the paper to a level of detail that is necessary to appreciate the results and make sense of them. The full details can be provided either with the code, in appendix, or as supplemental material. 7. Experiment Statistical Significance Question: Does the paper report error bars suitably and correctly defined or other appropriate information about the statistical significance of the experiments? Answer: [Yes] Justification: Standard deviations are reported with every performance plot and table. Guidelines: The answer NA means that the paper does not include experiments. The authors should answer Yes if the results are accompanied by error bars, confidence intervals, or statistical significance tests, at least for the experiments that support the main claims of the paper. The factors of variability that the error bars are capturing should be clearly stated (for example, train/test split, initialization, random drawing of some parameter, or overall run with given experimental conditions). The method for calculating the error bars should be explained (closed form formula, call to a library function, bootstrap, etc.) The assumptions made should be given (e.g., Normally distributed errors). It should be clear whether the error bar is the standard deviation or the standard error of the mean. It is OK to report 1-sigma error bars, but one should state it. The authors should preferably report a 2-sigma error bar than state that they have a 96% CI, if the hypothesis of Normality of errors is not verified. For asymmetric distributions, the authors should be careful not to show in tables or figures symmetric error bars that would yield results that are out of range (e.g. negative error rates). If error bars are reported in tables or plots, The authors should explain in the text how they were calculated and reference the corresponding figures or tables in the text. 8. Experiments Compute Resources Question: For each experiment, does the paper provide sufficient information on the computer resources (type of compute workers, memory, time of execution) needed to reproduce the experiments? Answer: [No] Justification: Full experiment details will be provided for camera-ready version. Guidelines: The answer NA means that the paper does not include experiments. The paper should indicate the type of compute workers CPU or GPU, internal cluster, or cloud provider, including relevant memory and storage. The paper should provide the amount of compute required for each of the individual experimental runs as well as estimate the total compute. The paper should disclose whether the full research project required more compute than the experiments reported in the paper (e.g., preliminary or failed experiments that didn t make it into the paper). 9. Code Of Ethics Question: Does the research conducted in the paper conform, in every respect, with the Neur IPS Code of Ethics https://neurips.cc/public/Ethics Guidelines? Answer: [Yes] Justification: No data subject to the Neur IPS Code of Ethics has been used in this work. Guidelines: The answer NA means that the authors have not reviewed the Neur IPS Code of Ethics. If the authors answer No, they should explain the special circumstances that require a deviation from the Code of Ethics. The authors should make sure to preserve anonymity (e.g., if there is a special consideration due to laws or regulations in their jurisdiction). 10. Broader Impacts Question: Does the paper discuss both potential positive societal impacts and negative societal impacts of the work performed? Answer: [No] Justification: This work is of a theoretical nature introduce new methods for a general class of applications, potentially in science and engineering. Guidelines: The answer NA means that there is no societal impact of the work performed. If the authors answer NA or No, they should explain why their work has no societal impact or why the paper does not address societal impact. Examples of negative societal impacts include potential malicious or unintended uses (e.g., disinformation, generating fake profiles, surveillance), fairness considerations (e.g., deployment of technologies that could make decisions that unfairly impact specific groups), privacy considerations, and security considerations. The conference expects that many papers will be foundational research and not tied to particular applications, let alone deployments. However, if there is a direct path to any negative applications, the authors should point it out. For example, it is legitimate to point out that an improvement in the quality of generative models could be used to generate deepfakes for disinformation. On the other hand, it is not needed to point out that a generic algorithm for optimizing neural networks could enable people to train models that generate Deepfakes faster. The authors should consider possible harms that could arise when the technology is being used as intended and functioning correctly, harms that could arise when the technology is being used as intended but gives incorrect results, and harms following from (intentional or unintentional) misuse of the technology. If there are negative societal impacts, the authors could also discuss possible mitigation strategies (e.g., gated release of models, providing defenses in addition to attacks, mechanisms for monitoring misuse, mechanisms to monitor how a system learns from feedback over time, improving the efficiency and accessibility of ML). 11. Safeguards Question: Does the paper describe safeguards that have been put in place for responsible release of data or models that have a high risk for misuse (e.g., pretrained language models, image generators, or scraped datasets)? Answer: [No] Justification: There are currently no plans to release any dataset other than synthetic data Guidelines: The answer NA means that the paper poses no such risks. Released models that have a high risk for misuse or dual-use should be released with necessary safeguards to allow for controlled use of the model, for example by requiring that users adhere to usage guidelines or restrictions to access the model or implementing safety filters. Datasets that have been scraped from the Internet could pose safety risks. The authors should describe how they avoided releasing unsafe images. We recognize that providing effective safeguards is challenging, and many papers do not require this, but we encourage authors to take this into account and make a best faith effort. 12. Licenses for existing assets Question: Are the creators or original owners of assets (e.g., code, data, models), used in the paper, properly credited and are the license and terms of use explicitly mentioned and properly respected? Answer: [Yes] Justification: Mostly open-source code has been used to base this project on. Guidelines: The answer NA means that the paper does not use existing assets. The authors should cite the original paper that produced the code package or dataset. The authors should state which version of the asset is used and, if possible, include a URL. The name of the license (e.g., CC-BY 4.0) should be included for each asset. For scraped data from a particular source (e.g., website), the copyright and terms of service of that source should be provided. If assets are released, the license, copyright information, and terms of use in the package should be provided. For popular datasets, paperswithcode.com/datasets has curated licenses for some datasets. Their licensing guide can help determine the license of a dataset. For existing datasets that are re-packaged, both the original license and the license of the derived asset (if it has changed) should be provided. If this information is not available online, the authors are encouraged to reach out to the asset s creators. 13. New Assets Question: Are new assets introduced in the paper well documented and is the documentation provided alongside the assets? Answer: [No] Justification: Except for code to reproduce experiments, no new assets are introduced. Guidelines: The answer NA means that the paper does not release new assets. Researchers should communicate the details of the dataset/code/model as part of their submissions via structured templates. This includes details about training, license, limitations, etc. The paper should discuss whether and how consent was obtained from people whose asset is used. At submission time, remember to anonymize your assets (if applicable). You can either create an anonymized URL or include an anonymized zip file. 14. Crowdsourcing and Research with Human Subjects Question: For crowdsourcing experiments and research with human subjects, does the paper include the full text of instructions given to participants and screenshots, if applicable, as well as details about compensation (if any)? Answer: [NA] Justification: This paper does not involve crowdsourcing nor research with human subjects. Guidelines: The answer NA means that the paper does not involve crowdsourcing nor research with human subjects. Including this information in the supplemental material is fine, but if the main contribution of the paper involves human subjects, then as much detail as possible should be included in the main paper. According to the Neur IPS Code of Ethics, workers involved in data collection, curation, or other labor should be paid at least the minimum wage in the country of the data collector. 15. Institutional Review Board (IRB) Approvals or Equivalent for Research with Human Subjects Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or institution) were obtained? Answer: [NA] Justification: This paper does not involve research with human subjects. Guidelines: The answer NA means that the paper does not involve crowdsourcing nor research with human subjects. Depending on the country in which research is conducted, IRB approval (or equivalent) may be required for any human subjects research. If you obtained IRB approval, you should clearly state this in the paper. We recognize that the procedures for this may vary significantly between institutions and locations, and we expect authors to adhere to the Neur IPS Code of Ethics and the guidelines for their institution. For initial submissions, do not include any information that would break anonymity (if applicable), such as the institution conducting the review.