# bayesian_experimental_design_via_contrastive_diffusions__63596600.pdf

Published as a conference paper at ICLR 2025

BAYESIAN EXPERIMENTAL DESIGN VIA CONTRASTIVE DIFFUSIONS

Jacopo Iollo1,2,3 Christophe Heinkelé2 Pierre Alliez3 Florence Forbes1

1: Université Grenoble Alpes, Inria, CNRS, G-INP, France, name.surname@inria.fr 2: Cerema, Endsum-Strasbourg, France, christophe.heinkele@cerema.fr 3: Université Côte d Azur, Inria, France, name.surname@inria.fr

Bayesian Optimal Experimental Design (BOED) is a powerful tool to reduce the cost of running a sequence of experiments. When based on the Expected Information Gain (EIG), design optimization corresponds to the maximization of some intractable expected contrast between prior and posterior distributions. Scaling this maximization to high dimensional and complex settings has been an issue due to BOED inherent computational complexity. In this work, we introduce a pooled posterior distribution with cost-effective sampling properties and provide a tractable access to the EIG contrast maximization via a new EIG gradient expression. Diffusion-based samplers are used to compute the dynamics of the pooled posterior and ideas from bi-level optimization are leveraged to derive an efficient joint sampling-optimization loop. The resulting efficiency gain allows to extend BOED to the well-tested generative capabilities of diffusion models. By incorporating generative models into the BOED framework, we expand its scope and its use in scenarios that were previously impractical. Numerical experiments and comparison with state-of-the-art methods show the potential of the approach.

1 INTRODUCTION

Designing optimal experiments can be critical in numerous applied contexts where experiments are constrained in terms of resources or more generally costly and limited. In this work, design is assumed to be characterized by some continuous parameters ξ E Rd, which refers to the experimental part, such as the choice of a measurement location, that can be controlled to optimize the experimental outcome. We consider a Bayesian setting in which the parameters of interest is θ Θ Rm and design is optimized to maximize the information gain on θ. Bayesian optimal experimental design (BOED) is not a new topic in statistics, see e.g. Chaloner and Verdinelli (1995); Sebastiani and Wynn (2000); Amzal et al. (2006) but has recently gained new interest with the use of machine learning techniques, see Rainforth et al. (2024); Huan et al. (2024) for recent reviews. The most common approach consists of maximizing the so-called expected information gain (EIG), which is a mutual information criterion that accounts for information via the Shannon s entropy. Let p(θ) denote a prior probability distribution and p(y|θ, ξ) a likelihood defining the observation y Y generating process. The prior is assumed to be independent on ξ and p(y|θ, ξ) available in closed-form. To our knowledge, all previous BOED approaches also assume that the prior is available in closed-form, a setting that we refer to as density-based BOED. In this work, by making BOED more computationally efficient, we open the first access to diffusion-based generative models and introduce data-based BOED when the prior is only available through samples. This broadens the scope of problems that can be tackled to a wide range of inverse problems (Daras et al., 2024).

The EIG, denoted below by I, admits several equivalent expressions, see e.g. Foster et al. (2019). It can be written as the expected loss in entropy when accounting for an observation y at ξ (eq. (1)) or as a mutual information (MI) or expected Kullback-Leibler (KL) divergence (eq. (2)). Denoting pξ(θ, y) = p(θ, y|ξ) the joint distribution of (θ, Y ) and using p(θ, y|ξ) = p(θ|y, ξ)p(y|ξ) = p(y|θ, ξ)p(θ), it comes, I(ξ) = Ep(y|ξ)[H(p(θ)) H(p(θ|Y , ξ)] (1)

= Ep(y|ξ) [KL(p(θ|Y , ξ), p(θ))] = MI(pξ) , (2)

Published as a conference paper at ICLR 2025

where random variables are indicated with uppercase letters, Ep( )[ ] or Ep[ ] denotes the expectation with respect to p and H(p(θ)) = Ep(θ)[log p(θ)] is the entropy of p. The joint distribution pξ completely determines all other distributions, marginal (prior) and conditional (posterior) distributions, so that the mutual information, which is the KL between the joint and the product of its marginal distributions, can be written as a function of pξ P(Θ Y) only. In the following P(Θ Y), resp. P(Θ), resp. P(Y), denotes the set of probability measures on Θ Y, resp. Θ, resp. Y.

In BOED, we look for ξ satisfying

ξ arg max ξ Rd I(ξ) = arg max ξ Rd MI(pξ) . (3)

The above optimization is usually referred to as static design optimization. The main challenge in EIG-based BOED is that both the EIG and its gradient with respect to ξ are doubly intractable. Their respective expressions involve an expectation of an intractable integrand over a posterior distribution which is itself not straightforward to sample from. The posterior distribution is generally only accessible through an iterative algorithm providing approximate samples. In practice, the inference problem is further complicated as design optimization is considered in a sequential context, in which a series of experiments is planned sequentially and each successive design has to be accounted for. In order to remove the integrand intractability issue, solutions have been proposed which optimize an EIG lower bound (Foster et al., 2019). This lower bound can be expressed as an expectation of a tractable integrand and becomes tight with increased simulation budgets. The remaining posterior sampling issue has then been solved in different ways. A set of approaches consists of approximating the problematic posterior distribution, either with variational techniques (Foster et al., 2019) or with efficient sequential Monte Carlo (SMC) sampling (Iollo et al., 2024; Drovandi et al., 2013). Other approaches avoid posterior estimation, using reinforcement learning (RL) and off-line policy learning to bypass the need for sampling (Foster et al., 2021; Ivanova et al., 2021; Blau et al., 2022). However, some studies have shown that estimating the posterior was beneficial, e.g. Iollo et al. (2024) and Ivanova et al. (2024), which improves on Foster et al. (2021) by introducing posterior estimation steps in order to refine the learned policy. In addition, one should keep in mind that posterior inference is central in BOED as the ultimate goal is not design per se but to gain information on the parameter of interest. This is challenging especially in a sequential context. Previous attempts that provide both candidate design and estimates of the posterior distribution, such as Foster et al. (2019); Iollo et al. (2024), are thus essentially in 2 alternating stages, approximate design optimization being dependent on approximate posterior sampling and vice-versa.

In this work, we propose a novel 1-stage approach which leverages a sampling-as-optimization setting (Korba and Salim, 2022; Marion et al., 2025) where sampling is seen as an optimization task over the space of probability distributions. We introduce a new EIG gradient expression (Section 3), which highlights the EIG gradient as a function of both the design and some sampling outcome. This fits into a bi-level optimization framework adapted to BOED in Section 4. So doing, at each step, both an estimation of the optimal design and samples from the current posterior distribution can be provided in a single loop described in Section 5. It results an efficient procedure that can handle both traditional density-based samplers and data-based samplers such as provided by the highly successful diffusion-based generative models. The resulting efficiency gain enables BOED applications at significantly larger scales than previously feasible, including inpainting problems ranging from computer vision to protein engineering and MRI (Quan et al., 2024; Yang et al., 2019; Aali et al., 2023). Figure 1 is an illustration on a 28 28 image θ reconstruction problem from 7 7 sub-images centered at locations ξ to be selected, details in Section 6. For simpler notation, we first present our approach in the static design case. Adaptation to the sequential case is specified in Section 6 and all numerical examples are in the sequential setting.

Figure 1: 28 28 Image θ (1st column) reconstruction from seven 7 7 sub-images y = Aξθ + η centered at seven central pixels ξ (designs) selected sequentially. Optimized vs. random designs: measured outcome y (2nd vs. 3rd column) and parameter θ estimates (reconstruction) with highest weights (upper vs. lower sub-row).

Published as a conference paper at ICLR 2025

2 RELATED WORK

We focus on gradient-based BOED for continuous problems. Applying a first-order method to solve (3) requires computing gradients of the EIG I, which are no more tractable than I itself. Gradient-based BOED is generally based on stochastic gradient-type algorithms (see Section 4.3.2. in Huan et al. (2024)). This requires in principle unbiased gradient estimators, although stochastic approximation solutions using biased oracles have also been investigated, see e.g. Demidovich et al. (2023); Liu and Tajbakhsh (2024). To meet this requirement, most stochastic gradient-based approaches start from an EIG lower bound that yields tractable unbiased gradient estimators. More specifically, EIG lower bounds have usually the advantage to remove the nested expectation issue, see e.g. Foster et al. (2019). In contrast, very few approaches focus on direct EIG gradient estimators. To our knowledge, this is only the case in Goda et al. (2022) and Ao and Li (2024). Goda et al. (2022) propose an unbiased estimator of the EIG gradient using a randomized version of a multilevel nested Monte Carlo (MLMC) estimator from Rhee and Glynn (2015). A different estimator is proposed by Ao and Li (2024), who use MCMC samplers leading to biased estimators, for which the authors show empirically that the bias could be made negligible. In this work, we first show, in Section 3, that their two apparently different solutions actually only differ in the way the intractable posterior distribution is approximated. We then propose a third way to compute EIG gradients that is more computationally efficient and scales better to larger data volumes and sequential design contexts. This new expression makes use of a distribution that we introduce and name the pooled posterior distribution. This latter distribution has interesting sampling features that allow us to leverage score-based sampling techniques and connect to the so-called implicit diffusion framework of Marion et al. (2025). Our single loop procedure in Section 5 is inspired by Marion et al. (2025) and other recent developments in bi-level optimization (Yang et al., 2021; Dagréou et al., 2022; Hong et al., 2023). However, these latter settings do not cover doubly intractable objectives such as the EIG, which requires both appropriate gradient estimators and sampling operators, see our Sections 3 and 4. In BOED, efficient single loop procedures have been proposed by Foster et al. (2020) but they rely heavily on variational approximations, which may limit accuracy in scenarios with complex posterior distributions.

3 POOLED-POSTERIOR ESTIMATION OF THE EIG GRADIENT

Efficient EIG gradient estimators are central for accurate scalable BOED. Gradients derived from the reparameterization trick are often preferred, over the ones obtained with score-based techniques, as they have been reported to exhibit lower variance (Xu et al., 2019).

EIG gradient via a reparametrization trick. Assuming p(y|θ, ξ) is such that Y can be rewritten as Y = Tξ,θ(U) with Tξ,θ invertible so that U = T 1 ξ,θ(Y ) and U is a random variable independent on θ and ξ with a tractable distribution p U(U). The existence of Tξ,θ is straightforward if the direct model corresponds to an additive Gaussian noise as the transformation is then linear in U. Results exist to guarantee the existence of such a transformation in more general situations (Papamakarios et al., 2021). Using this change of variable, two expressions of the EIG gradient, (5) and (6) below, can be derived. Detailed steps are given in Appendix A. With pξ denoting the joint distribution p(θ, y|ξ), g a quantity related to the score g(ξ, y, θ, θ ) = ξ log p(Tξ,θ(u)|θ , ξ)|u=T 1 ξ,θ(y) and

denoting h(ξ, y, θ, θ ) = ξp(Tξ,θ(u)|θ , ξ)|u=T 1 ξ,θ(y), a first expression is

g(ξ, Y , θ, θ) Ep(θ ) h(ξ, Y , θ, θ )

Ep(θ ) p(Y |θ , ξ)

Considering importance sampling formulations for the second term of (4), with an importance distribution q P(Θ), potentially depending on y, θ and ξ, further leads to

ξI(ξ) = Epξ

g(ξ, Y , θ, θ) Eq(θ |Y ,θ,ξ) h p(θ ) q(θ |Y ,θ,ξ) h(ξ, Y , θ, θ ) i

Eq(θ |Y ,θ,ξ) h p(θ ) q(θ |Y ,θ,ξ) p(Y |θ , ξ) i

In Goda et al. (2022), this latter expression is used in a randomized MLMC procedure with q set to a Laplace approximation of the posterior distribution, without justification for this specific choice of q.

Published as a conference paper at ICLR 2025

It results an estimator which is not unbiased but can be de-biased following Rhee and Glynn (2015). Alternatively, a second expression of the EIG gradient is the starting point of Ao and Li (2024),

ξI(ξ) =Epξ g(ξ, Y , θ, θ) Ep(θ |Y ,ξ) g(ξ, Y , θ, θ ) . (6)

It follows a nested Monte Carlo estimator (30) given in Appendix A, using samples {(yi, θi)}i=1:N from the joint pξ and for each yi, samples {θ i,j}j=1:M from an MCMC procedure approximating the intractable posterior p(θ |yi, ξ). Interestingly, expression (6) can also be recovered by setting the importance proposal q(θ |y, θ, ξ) to p(θ |y, ξ) in (5), which provides a clear justification of why the choice of q made in Goda et al. (2022) is relevant. Approaches by Goda et al. (2022) and Ao and Li (2024) thus mainly differ in their choice of approximations for the posterior distribution. Using a Laplace approximation as in Goda et al. (2022) is relevant only if the posterior is unimodal, which may not be the case in practice. The MCMC version of Ao and Li (2024) is then potentially more general but also more costly as it requires running N times a MCMC sampler, targeting each time a different posterior p(θ|yi, ξ). In the next paragraph, we introduce the pooled posterior distribution and derive another, more computationally efficient, gradient expression.

Importance sampling EIG gradient estimator with a pooled posterior proposal. In their work, Ao and Li (2024) consider only static design, which hides the fact that for more realistic sequential design contexts, their solution is not tractable due to its computational complexity. Their solution faces the standard issue of nested estimation (Rainforth et al., 2018). To avoid this issue we propose to use an importance sampling expression for the second term in (6), which has the advantage to move the dependence on y (and θ) from the sampling part to the integrand part. We consider a proposal distribution q P(θ) that does not depend on Y nor θ. It comes,

ξI(ξ) = Epξ

g(ξ, Y , θ, θ) Eq(θ |ξ)

p(θ |Y , ξ)

q(θ |ξ) g(ξ, Y , θ, θ ) , (7)

and an approximate gradient can be obtained as

g(ξ, yi, θi, θi) Eq(θ |ξ)

p(θ |yi, ξ)

q(θ |ξ) g(ξ, yi, θi, θ ) . (8)

The second term in (8) still requires N importance sampling approximations whose quality depends on the choice of the proposal distribution q. The ideal proposal q is easy to simulate, with computable weights at least up to a constant, and so that q and the multiple target distributions p( |yi, ξ) are not too far apart. Given N samples {(θi, yi)}i=1:N from pξ, we propose thus to take q = qξ,N where qξ,N is the following logarithmic pooling or geometric mixture, with PN i=1 νi = 1,

i=1 p(θ|yi, ξ)νi p(θ)

i=1 p(yi|θ, ξ)νi . (9)

We refer to qξ,N as the pooled posterior distribution, defined in a more general way in (13). It allows to assess the effect of a candidate design ξ on samples from the prior. It differs from a standard posterior as no real data y obtained by running the experiment ξ is available during the optimization. We only have access to samples {(θi, yi)}i=1:N from the joint pξ. The pooled posterior can be seen as a distribution that takes into account all possible outcomes of a candidate experiment ξ given the samples {θi}i=1:N from the prior. This choice of qξ,N is justified in Appendix B, using Lemma 2, proved therein. Lemma 2 shows that, for PN i=1νi = 1, qξ,N is the distribution q that minimizes the weighted sum of the KLs against each posterior p(θ|yi, ξ), i.e. PN i=1νi KL(q, p(θ|yi, ξ)), leading to an efficient importance sampling proposal. It follows our new gradient estimator,

g(ξ, yi, θi, θi) 1

j=1 wi,j g(ξ, yi, θi, θ j)

where {(θi, yi)}i=1:N follow pξ, {θ i}j=1:M follow qξ,N and wi,j = p(θ j|yi,ξ) qξ,N(θ j) denotes the importance sampling weight. When this fraction can only be evaluated up to a constant, we consider

Published as a conference paper at ICLR 2025

self normalized importance sampling (SNIS) using p, qξ,N the unnormalized versions of p and qξ,N,

wi,j = p(θ j|yi, ξ) qξ,N(θ j) = p(yi|θ j, ξ) QN ℓ=1 p(yℓ|θ j, ξ)νℓ and wi,j = wi,j PM j=1 wi,j . (11)

Although with a reduced computational cost, computing gradients with (10) still requires an iterative sampling algorithm ideally run for a large number of iterations to reach satisfying approximations of the joint pξ and the pooled posterior qξ,N. In static design, sampling from the joint is not generally difficult as the prior and the likelihood are assumed available but this becomes problematic in sequential design, as further detailed in Section 6.1. Sequential design is the setting to be kept in mind in this paper and in practice, the exact distributions are rarely reached. To assess the impact on gradient approximations, it is convenient to introduce, as in Marion et al. (2025), gradient operators. In the next section, we show how to adapt the formalism of Marion et al. (2025) to our BOED task.

4 EIG OPTIMIZATION THROUGH SAMPLING

To maximize the EIG using its gradient estimator (10), samples are needed from both the joint distribution pξ and the pooled posterior proposal qξ,N. If handled naively, it results a computationally expensive nested sampling-optimization loop where new samples from both distributions need to be generated at every update of the design parameter ξ. To derive more efficient procedures, we propose to adapt to BOED the framework of Marion et al. (2025) that integrates sampling and optimization into a single bi-level optimization loop. To do so, the EIG gradient ξI(ξ) has first to be expressed as a function Γ of three key components: the joint distribution pξ, the proposal distribution q, and the design parameter ξ itself. Our choice of the pooled posterior as proposal distribution q is then justified for its interesting sampling properties and the concept of sampling operator of Marion et al. (2025) is generalized to efficiently generate the samples needed to estimate our EIG gradient via Γ.

Estimation of gradients through sampling. Denote by Γ a function from P(Θ Y) P(Θ) Rd to Rd, defined as,

Γ(p, q, ξ) = Ep

g(ξ, Y , θ, θ) Eq

q(θ ) g(ξ, Y , θ, θ ) . (12)

Expression (7) shows that ξI(ξ) = Γ(pξ, q, ξ), where q is a distribution q(θ |ξ) on θ possibly depending on ξ. The gradient estimator (10) corresponds then to ξI(ξ) Γ(ˆpξ, ˆqξ,N, ξ), where ˆpξ = PN i=1 δ(θi,yi) and ˆqξ,N = PM j=1 δθ j. In general, sampling from pξ, or its sequential counterpart, and qξ,N is challenging and only possible through an iterative procedure. However, an interesting feature of our pooled posterior is that it does not add additional sampling difficulties.

Pooled posterior distribution. More generally (details in Appendix B), we define,

qξ,ρ(θ) exp (Eρ[log p(θ|Y , ξ)]) (13)

where ρ is a measure on Y. When ρ(y) = PN i=1νiδyi(y) with PN i=1νi = 1, we recover qξ,ρ(θ) = qξ,N(θ) in (9). The special structure of the pooled posterior allows to sample from it using the same algorithmic structure to sample from a single posterior p(θ|y, ξ). Indeed, the score of qξ,ρ(θ) is linked to the posterior score,

θ log qξ,ρ(θ) = Eρ[ θ log p(θ|Y , ξ)] , (14)

which for qξ,N simplifies into PN i=1 νi θ log p(θ|yi, ξ). In practice, we consider the operation of sampling as the output of a stochastic process iterating a so-called sampling operator.

Iterative sampling operators. Iterative sampling operators, as introduced in Marion et al. (2025), are mappings to a space of probabilities. In our BOED setting, we consider two such operators. The first one is defined, for each ξ, through a sequence over s of functions from P(Θ Y) to P(Θ Y) and denoted by ΣY ,θ s (p, ξ). Sampling is defined as the outcome in the limit s or for some finite s = S of the following process starting from p(0) P(Θ Y) and iterating

p(s+1) = ΣY ,θ s (p(s), ξ) , (15)

Published as a conference paper at ICLR 2025

Figure 2: Source localisation example. Prior (left) and pooled posterior (right) samples at experiment k. Final ξ k (orange cross) at the end of the optimization sequence ξ0, , ξT (blue crosses). This optimization "contrasts" the two distributions by making the pooled posterior "as different as possible" from the prior.

where p(s) can be explicit, e.g. p(s) is a Gaussian distribution with parameters depending on s, or represented by a random variable Xs p(s). For example, in density-based BOED, for Xs = (Ys, θs), we consider the Euler discretization of the Langevin diffusion converging to pξ

y(s+1) = y(s) γs y V (y(s), θ(s), ξ) + p

θ(s+1) = θ(s) γs θV (y(s), θ(s), ξ) + p

2γs Bθ,s. (16)

where By,s and Bθ,s are realizations of independent standard Gaussian variables, γs is a step-size and V (y, θ, ξ) = log p(θ) log p(y|θ, ξ) is the pξ potential, pξ(y, θ) exp ( V (y, θ, ξ)). The dynamics induced lead to samples from pξ for s . In the following, we will thus use the notation ΣY ,θ s to mean that we have access to samples from p(s+1), which is equivalent to apply ΣY ,θ s to an empirical version of p(s) built from samples {(y(s) i , θ(s) i )}i=1:N. Similarly, we can produce samples {θ (s+1) j }j=1:M from the pooled posterior qξ,N, using its score expression, via the updating,

θ (s+1) = θ (s) γ s

i=1 νi θV (yi, θ (s), ξ) + p

2γ s Bθ ,s . (17)

For a sampling operator of the pooled posterior general form (13), we need to extend the definition in Marion et al. (2025) by adding a dependence on some distribution ρ P(Y) for the conditioning part. The second sampling operator is defined, for some given ξ and ρ, through a sequence over s of parameterized functions from P(Θ) to P(Θ) and denoted by Σθ s (q, ξ, ρ). The sampling operator is defined as the outcome of the following process starting from q(0) P(Θ) and iterating

q(s+1) = Σθ s (q(s), ξ, ρ) . (18)

For instance, when ρ = PN i=1 νiδyi, q(s+1) = Σθ s (q(s), ξ, ρ) can then be a shorthand for (17).

5 SINGLE LOOP CONTRASTIVE EIG OPTIMIZATION

The perspective of optimization through sampling leads naturally to a nested loop procedure. An inner loop is performed to reach good approximations of pξ and qξ,N using two samplers as specified in Section 4 and summarized in the nested loop Algorithm 1. Considering sampling as an optimization over the space of distributions (Marion et al., 2025), a more efficient single loop procedure can be derived. As illustrated in the single loop Algorithm 2, at each optimization step over ξ, the sampling operators are applied only once using the current ξ, which is updated in turn, etc. Sampling operators can be derived from traditional density-based sampling, like in (16) and (17), where an expression of the target distribution is required to compute the score, and also from data-based sampling where only training samples are available. In the latter case, conditional score-based generative models have emerged as a very active field of research. We explicit below how a recent such framework proposed by Dou and Song (2024) can be used in our setting.

Published as a conference paper at ICLR 2025

Data-based samplers. In BOED, we are interested in sampling from a conditional distribution p(θ|y, ξ) with the following objectives. First we need to sample from the pooled posterior qξ,N(θ) which requires conditioning on the observation y. Second, in sequential design problems (see section 6.1 and Appendix D), we need to condition on the history of observations Dk 1 and produce samples from p(θ|Dk 1). Both these issues can be tackled in the framework of diffusion models for inverse problems (Daras et al., 2024). When the likelihood corresponds to a linear measurement Y with Y = Aξθ + η and η N(0, Σ), inspiring recent attempts, such as (Corenflos et al., 2025; Cardoso et al., 2024), have addressed the problem of sampling efficiently from p(θ|y, ξ) using only the pre-trained score sϕ(θ, t) of a diffusion model, without the need for any kind of retraining (see Appendix C for details). For conditional sampling, this would mean running an SDE with a conditional score θ log pt(θ(t)|y, ξ), which is intractable. See (43) and Appendix C.2. As a solution, Dou and Song (2024) propose a method named FPS that approximates pt(θ(t)|y, ξ) by pt(θ(t)|y(t), ξ) with y(t) the noised observation at time t. As the score θlog pt(θ(t)|y(t), ξ) can be written as θlog pt(θ(t))+ θlog pt(y(t)|θ(t), ξ), we can leverage the learned score sϕ(θ(t), t) and the closed form of θ log pt(y(t)|θ(t), ξ) to sample approximately from p(θ|y, ξ) using a backward SDE with the approximate score, see (45) in Appendix C.2. This allows to sample efficiently from p(θ|Dk 1) and, using θ log qξ,N(θ )=PN i=1νi θ log p(θ |yi, ξ), from the pooled posterior with the extension of (45) below, where y(t) i is the noised yi at time t of the forward SDE,

2 θ (t) β(t)

i=1 νi θ log pt(θ (t)|y(t) i , ξ)

β(t)d Bt , (19)

where t above is now flowing backwards from infinity to t=0. In practice, (19) is solved approximately using a numerical discretization and an initialization of the process with θ (T ) N(0, I) for some large finite T. An additional resampling SMC-like step can also be added as explained in Appendix E. The approach allows to handle new sequential data-based BOED tasks as illustrated in Section 6.3.

Algorithm 1:Nested-loop optimization Result: Optimal design ξ

Initialisation: ξ0 Rd for t=0:T-1 (outer ξ optimization loop) do

p(0) t p0 and q(0) t q0 for s=0:S-1 (pξ inner sampling) do

p(s+1) t = ΣY ,θ s (p(s) t , ξt) end ˆpξt p(S) t ˆρt ˆpξt(y) (ˆpξt marginal over y) for s =1:S -1 (qξ,ρ inner sampling) do

q(s +1) t = Σθ s (q(s ) t , ξt, ˆρt) end

ˆqξt q(S ) t Compute ξI(ξt)=Γ(ˆpξt, ˆqξt, ξt) in (12) Update ξt with SGD or another optimizer end return ξT ;

Algorithm 2: Single loop optimization Result: Optimal design ξ

Initialisation: ξ0 Rd, p(0) p0, q(0) q0 for t=0:T-1(sampling-optimization loop) do

p(t+1) = ΣY ,θ t (p(t), ξt) ˆρt+1 p(t+1) y (p(t+1) marginal over y) q(t+1) = Σθ t (q(t), ξt, ˆρt+1) Compute ξI(ξt)=Γ(p(t+1), q(t+1), ξt) in (12) Update ξt with SGD or another optimizer end return ξT ;

Measure 1 2 3 4 5 6

Co Diff .227 .338 .528 .673 .789 .826 Random .168 .275 .350 .391 .421 .463

Table 1: Co Diff and random reconstruction quality comparison with SSIM, in [-1,1], the higher the better.

Density-based samplers. Among density-based samplers, we can mention score-based MCMC samplers, including Langevin dynamics via the Unadjusted Langevin Algorithm (ULA) and Metropolis Adjusted Langevin Algorithm (MALA) (Roberts and Tweedie, 1996), Hamiltonian Monte Carlo (HMC) samplers (Hoffman and Gelman, 2014). In Section 6.2, an illustration is given with Langevin and sequential Monte Carlo (SMC) to handle a sequential density-based BOED task.

Contrastive Optimization. Optimizing ξ using the gradient expression (10) encourages to select a ξ that gives either high probability p(yi|θi, ξ) to samples (θi, yi) from pξ or low probability

Published as a conference paper at ICLR 2025

p(yj|θ j, ξ) to samples θ j from qξ,N. This contrastive behaviour is also visible in (2) where the EIG is defined as the mean over the experiment outcomes of the KL between posterior and prior distributions. The pooled posterior qξ,N is then used as a proxy to the intractable posterior, to perform this contrastive optimization. Figure 2 provides a visualization of this contrastive behavior in the source localization example of Section 6.2. It corresponds to set the next design ξ to a value that eliminates the most parameter θ values (right plot) among the possible ones a priori (left plot). This is analogous to Noise Constrastive Estimation (Gutmann and Hyvärinen, 2010) methods where model parameters are computed so that the data samples are as different as possible from the noise samples. Additional illustrations are given in Appendix Figure 6.

6 NUMERICAL EXPERIMENTS

Two sequential density-based (Section 6.2) and data-based (Section 6.3) BOED examples are considered to illustrate that our method extends to the sequential case in both settings.

6.1 SEQUENTIAL BAYESIAN EXPERIMENTAL DESIGN

In the sequential setting, a sequence of K experiments is planned while gradually accounting for the successively collected data. At step k, we wish to pick the best design ξk given previous outcomes Dk 1 = {(y1, ξ1), . . . , (yk 1, ξk 1)}. The expected information gain in this scenario is given by:

Ik(ξ, Dk 1) = Ep(y|ξ,Dk 1) [KL(p(θ|Y , ξ, Dk 1), p(θ|Dk 1))] ,

where p(θ|Dk 1) and p(θ|y, ξ, Dk 1) act respectively as prior and posterior analogues to the static case (2). See Appendix D for more detailed explanations. The main difference is that we no longer have direct access to samples from the step k prior p(θ|Dk 1). However, as p(θ|Dk 1) p(θ) Qk 1 n=1 p(yn|θ, ξn) and p(θ|y, ξ, Dk 1) p(θ)p(y|θ, ξ) Qk 1 n=1 p(yn|θ, ξn) we can still compute the score of these distributions and run sampling operators similar to (15) and (18). To emphasize their dependence on Dk 1, they are denoted by ΣY ,θ|Dk 1 s (p(s), ξ) and Σθ |Dk 1 s (q(s), ξ, ρ). Examples of these operators are provided in (20) and (21) in Section 6.2.

Evaluation metrics and comparison. We refer to our method as Co Diff. In Section 6.2, comparison is provided with other recent approaches, namely a reinforcement learning-based approach RL-BOED from Blau et al. (2022), the variational prior contrastive estimation VPCE of Foster et al. (2020) and a recent approach named PASOA (Iollo et al., 2024) based on tempered sequential Monte Carlo samplers. We also compare with a non tempered version of this latter approach (SMC) and with a random baseline, where the observations {y1, , y K} are simulated with designs generated randomly. More details about these methods are given in Appendix F.2. To compare methods in terms of information gains, we use the sequential prior contrastive estimation (SPCE) and sequential nested Monte Carlo (SNMC) bounds introduced in Foster et al. (2021) and used in Blau et al. (2022). These quantities allow to compare methods on the produced design sequences only, via their [SPCE, SNMC] intervals which contain the total EIG. Their expressions are given in Appendix F.1. We also provide the L2 Wasserstein distance between the produced samples and the true parameter θ. For methods that do not provide posterior estimations or poor quality ones (RL-BOED, VPCE, Random), we compute Wasserstein distances on posterior samples obtained by using tempered SMC on their design and observation sequences. In contrast, SMC Wasserstein distances are computed on the SMC posterior samples. In Section 6.3, our evaluation is mainly qualitative. The previous methods do not apply and we are not aware of existing attempts that could handle such a generative setting.

6.2 SOURCES LOCATION FINDING

We present a source localization example inspired by Foster et al. (2021); Blau et al. (2022). The setup involves C sources in R2, with unknown positions θ = {θ1, . . . , θC}. The challenge is to determine optimal measurement locations to accurately infer the sources positions. When a measurement is taken at location ξ R2, the signal strength is defined as µ(θ, ξ) = b + PC c=1 αc m+ θc ξ 2 2 where αc, b, and m are predefined constants. We assume a standard Gaussian prior for each source location, θc N(0, I2), and model the likelihood as log-normal: (log y | θ, ξ) N(log µ(θ, ξ), σ), with σ representing the standard deviation. For this experiment, we set C = 2, α1 = α2 = 1, m = 10 4,

Published as a conference paper at ICLR 2025

b = 10 1, σ = 0.5, and plan K = 30 sequential design optimizations. In the notation of the single loop Algorithm 2, we consider ΣY ,θ|Dk 1 t (p(t), ξt) and Σθ |Dk 1 t (q(t), ξt, ˆρt+1) operators that correspond respectively to the update of batch samples of size N = 200 and M = 200 {(y(t) i , θ(t) i )}i=1:N and {θ

(t) j }j=1:M with ˆρt+1 = PN i=1 νiδy(t+1) i using Langevin diffusions. Making use of the availability of the likelihood in this example, sampling from it, is straightforward and sampling operator iterations simplify into, for i = 1 : N and j = 1 : M

θ(t+1) i = θ(t) i + γt θ log p(θ(t) i |Dk 1) + p

2γt Bθ,t and y(t+1) i p(y|θ(t+1) i , ξt) (20)

θ (t+1) j = θ (t) j + γ t

i=1 νi θ log p(θ (t) j |y(t+1) i , ξt, Dk 1) + p

2γ t Bθ ,t . (21)

In practice, Langevin diffusion can get trapped in local minima and causes the sampling to be too slow to keep pace with the optimization process. To address this, we augment the Langevin diffusion with the Diffusive Gibbs (Di GS) MCMC kernel proposed by Chen et al. (2024). Di GS is an auxiliary variable MCMC method where the auxiliary variable X is a noisy version of the original variable X. Di GS enhances mixing and helps escape local modes by alternately sampling from the distributions p( x|x), which introduces noise via Gaussian convolution, and p(x| x), which denoises the sample back to the original space using a score-based update (here a Langevin diffusion). With 400 total samples, each measurement step takes 2.9 s. This number of samples is insightful as it is usually the amount of samples one can afford to compute in the diffusion models of Section 6.3. The whole experiment is repeated 100 times with random source locations each time. Figure 3 shows, with respect to k, the median for SPCE, the L2 Wasserstein distances between weighted samples and the true source locations and SNMC. Co Diff clearly outperforms all other methods, with significant improvement, both in terms of information gain and posterior estimation. It improves by 30% the non-myopic RL-BOED results on SPCE and provides much higher SNMC. The L2 Wasserstein distance is two order of magnitude lower, suggesting the higher quality of our measurements.

Figure 3: Source location. Median and standard error over 100 rollouts for SPCE, L2 Wasserstein distance (log-scale), SNMC with respect to number of experiments k. Number of samples N+M=400.

6.3 IMAGE RECONSTRUCTION WITH DIFFUSION MODELS

We build an artificial experimental design task to illustrate the ability of our method to handle design parameters related to inverse problems with a high dimensional parameter θ. We consider the task of recovering an hidden image from only partial observations of its pixels. The image to be recovered is denoted by θ. An experiment corresponds to the choice of a pixel ξ around which an observation mask is centered and the image becomes visible. The measured observation y is then a masked version of θ. The likelihood derives from the model Y = Aξθ + η where Aξ is a square mask centered at ξ and η some Gaussian variable. For the image prior, we consider a diffusion model trained for generation of the MNIST dataset (Le Cun et al., 1998). The goal is thus to select sequentially the best central pixel locations for 7 7 masks so as to reconstruct an entire 28 28 MNIST image in the smallest number of experiments. The smaller the mask the more interesting it becomes to optimally select the mask centers. Algorithm 2 is used with diffusion-based sampling operators specified in Appendix F.2.2. The gain in optimizing the mask placements is illustrated in Figure 1 and Appendix Figure 9. It is confirmed quantitatively in Table 1, which reports reconstruction quality as measured by the structural similarity index measure (SSIM) (Wang et al., 2004), details in Appendix F.2.2. Progressive reconstructions are shown in Figures 4, 7 and 8. The digit to be recovered is shown in the

Published as a conference paper at ICLR 2025

1st column. The successively selected masks are shown (red line squares) in the 2nd column with the resulting gradually discovered part of the image. The reconstruction per se can be estimated from the posterior samples shown in the last 16 columns. At each experiment, the upper sub-row shows the 16 most-likely reconstructed images, while the lower sub-row shows the 16 less probable ones. As the number of experiments increases the posterior samples gradually concentrate on the right digit.

Figure 4: Image reconstruction. First 6 experiments (rows): image ground truth, measurement at experiment k, samples from current prior p(θ|Dk 1), with best (resp. worst) weights in upper (resp. lower) sub-row. The samples incorporate past measurement information as the procedure advances. Each design steps takes 7.3s

7 CONCLUSION

We presented a new approach, Co Diff, to gradient-based BOED that allows very efficient implementations. The performance was illustrated in a traditional density-based setting with superior accuracy and lower computational cost compared to state-of-the-art methods. In addition, the possibility of our method to also handle data-based sampling represents, to our knowledge, the first extension of BOED to diffusion-based generative models. By integrating the highly successful framework of diffusion models for our sampling operators, we were able to optimize a design parameter ξ concurrently with the diffusion process. This was illustrated in a new application for BOED involving high dimensional image parameters. The foundation of our approach lies on a new EIG gradient estimator, bi-level optimization, conditional diffusion models and their application to inverse problems. Thanks to this advancement, there are as many new potential applications of BOED as there are trained diffusion models for specific inverse problem tasks. Current limitations include that Co Diff remains a greedy approach, that it requires an explicit expression of the likelihood and that when using diffusions to address inverse problems only linear forward models are currently handled. However, the non-linear setting is an active field of research, and advancements in this area could be directly applied to our framework. The applicability of our method could also be extended by considering settings with no explicit expression of the likelihood and investigating simulation-based inference such as developed by Ivanova et al. (2021); Kleinegesse and Gutmann (2021); Kleinegesse et al. (2020). In addition, although in density-based BOED, we have shown that greedy approaches could outperform long-sighted reinforcement learning procedures, in a data-based setting, it would be interesting to investigate an extension to non myopic approaches such as Iqbal et al. (2024).

Published as a conference paper at ICLR 2025

ACKNOWLEDGMENTS

The authors thank the reviewers and area chair for their interesting and useful comments and the Inria Challenge project ROAD-AI for partial funding. This work was performed using HPC/AI resources from GENCI-IDRIS (Grant 2023-AD011014217R1). Pierre Alliez is supported by the French government, through the 3IA Côte d Azur Investments in the Future project managed by the National Research Agency ANR-19-P3IA-0002.

Aali, A., Arvinte, M., Kumar, S., and Tamir, J. I. (2023). Solving inverse problems with score-based generative priors learned from noisy data. In 2023 57th Asilomar Conference on Signals, Systems, and Computers, pages 837 843. IEEE.

Alquier, P. (2024). User-friendly Introduction to PAC-Bayes Bounds. Foundations and Trends in Machine Learning, 17(2):174 303.

Amari, S.-i. (2007). Integration of Stochastic Models by Minimizing α-Divergence. Neural Computation, 19(10):2780 2796.

Amzal, B., Bois, F., Parent, E., and Robert, C. P. (2006). Bayesian-Optimal Design via Interacting Particle Systems. Journal of the American Statistical Association, 101(474):773 785.

Anderson, B. D. (1982). Reverse-time diffusion equation models. Stochastic Processes and their Applications, 12(3):313 326.

Ao, Z. and Li, J. (2024). On Estimating the Gradient of the Expected Information Gain in Bayesian Experimental Design. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 20311 20319.

Babuschkin, I., Baumli, K., Bell, A., Bhupatiraju, S., Bruce, J., Buchlovsky, P., Budden, D., Cai, T., Clark, A., Danihelka, I., Dedieu, A., Fantacci, C., Godwin, J., Jones, C., Hemsley, R., Hennigan, T., Hessel, M., Hou, S., Kapturowski, S., Keck, T., Kemaev, I., King, M., Kunesch, M., Martens, L., Merzic, H., Mikulik, V., Norman, T., Papamakarios, G., Quan, J., Ring, R., Ruiz, F., Sanchez, A., Schneider, R., Sezener, E., Spencer, S., Srinivasan, S., Stokowiec, W., Wang, L., Zhou, G., and Viola, F. (2020). The Deep Mind JAX Ecosystem.

Blau, T., Bonilla, E. V., Chades, I., and Dezfouli, A. (2022). Optimizing sequential experimental design with deep reinforcement learning. In Proceedings of the 39th International Conference on Machine Learning (ICML), volume 162, pages 2107 2128. PMLR.

Bradbury, J., Frostig, R., Hawkins, P., Johnson, M. J., Leary, C., Maclaurin, D., Necula, G., Paszke, A., Vander Plas, J., Wanderman-Milne, S., and Zhang, Q. (2020). JAX: composable transformations of Python+Num Py programs.

Cardoso, G., el idrissi, Y. J., Corff, S. L., and Moulines, E. (2024). Monte Carlo guided Denoising Diffusion models for Bayesian linear inverse problems. In The Twelfth International Conference on Learning Representations (ICLR).

Carvalho, L., Villela, D., Coelho, F., and Bastos, L. (2022). Bayesian Inference for the Weights in Logarithmic Pooling. Bayesian Analysis, 1(1):1 29.

Chaloner, K. and Verdinelli, I. (1995). Bayesian experimental design: A review. Statistical Science, 10(3):273 304.

Chatterjee, S. and Diaconis, P. (2018). The sample size required in importance sampling. The Annals of Applied Probability, 28(2):1099 1135.

Chen, W., Zhang, M., Paige, B., Hernández-Lobato, J. M., and Barber, D. (2024). Diffusive Gibbs sampling. In Proceedings of the 41st International Conference on Machine Learning (ICML), volume 235, pages 7731 7747. PMLR.

Published as a conference paper at ICLR 2025

Corenflos, A., Zhao, Z., Särkkä, S., Sjölund, J., and Schön, T. B. (2025). Conditioning diffusion models by explicit forward-backward bridging. In The 28th International conference on Artificial Intelligence and Statistics (AISTATS).

Dagréou, M., Ablin, P., Vaiter, S., and Moreau, T. (2022). A framework for bilevel optimization that enables stochastic and global variance reduction algorithms. In Advances in Neural Information Processing Systems.

Daras, G., Chung, H., Lai, C.-H., Mitsufuji, Y., Milanfar, P., Dimakis, A. G., Ye, C., and Delbracio, M. (2024). A survey on diffusion models for inverse problems. https://giannisdaras.github.io/publications/diffusion_survey.pdf.

Demidovich, Y., Malinovsky, G., Sokolov, I., and Richtárik, P. (2023). A Guide Through the Zoo of Biased SGD. In Thirty-seventh Conference on Neural Information Processing Systems.

Dhariwal, P. and Nichol, A. (2021). Diffusion models beat GANx on image synthesis. Advances in neural information processing systems, 34:8780 8794.

Donsker, M. and Varadhan, S. (1976). Asymptotic evaluation of certain Markov process expectations for large time III. Communications on Pure and Applied Mathematics, 29(4):389 461.

Dou, Z. and Song, Y. (2024). Diffusion posterior sampling for linear inverse problem solving: A filtering perspective. In The Twelfth International Conference on Learning Representations (ICLR).

Drovandi, C. C., Mc Gree, J., and Pettitt, A. N. (2013). Sequential Monte Carlo for Bayesian sequentially designed experiments for discrete data. Computational Statistics & Data Analysis, 57(1):320 335.

Efron, B. (2011). Tweedie s formula and selection bias. Journal of the American Statistical Association, 106(496):1602 1614.

Foster, A., Ivanova, D. R., Malik, I., and Rainforth, T. (2021). Deep Adaptive Design: Amortizing Sequential Bayesian Experimental Design. In Proceedings of the 38th International Conference on Machine Learning (ICML), volume 161, pages 3384 3395. PMLR.

Foster, A., Jankowiak, M., Bingham, E., Horsfall, P., Teh, Y. W., Rainforth, T., and Goodman, N. (2019). Variational Bayesian Optimal Experimental Design. In Advances in Neural Information Processing Systems, pages 14059 14070.

Foster, A., Jankowiak, M., O Meara, M., Teh, Y. W., and Rainforth, T. (2020). A Unified Stochastic Gradient Approach to Designing Bayesian-Optimal Experiments. In Proceedings of the 23rd International Conference in Artificial Intelligence and Statistics, volume 108, pages 2959 2969.

Goda, T., Hironaka, T., Kitade, W., and Foster, A. (2022). Unbiased MLMC Stochastic Gradient Based Optimization of Bayesian Experimental Designs. SIAM Journal on Scientific Computing, 44(1):A286 A311.

Gumbel, E. (1961). Bivariate logistic distribution. Journal of American Statistical Association, pages 335 349.

Gutmann, M. and Hyvärinen, A. (2010). Noise-contrastive estimation: A new estimation principle for unnormalized statistical models. In Proceedings of the thirteenth international conference on artificial intelligence and statistics, pages 297 304. JMLR Workshop and Conference Proceedings.

Hoffman, M. D. and Gelman, A. (2014). The No-U-Turn Sampler: Adaptively Setting Path Lengths in Hamiltonian Monte Carlo. Journal of Machine Learning Research, 15(47):1593 1623.

Hong, M., Wai, H., Wang, Z., and Yang, Z. (2023). A two-timescale stochastic algorithm framework for bilevel optimization: Complexity analysis and application to actor-critic. SIAM Journal on Optimization, 33:147 180.

Huan, X., Jagalur, J., and Marzouk, Y. (2024). Optimal experimental design: Formulations and computations. Acta Numerica, 33:715 840.

Published as a conference paper at ICLR 2025

Hyvärinen, A. (2005). Estimation of Non-Normalized Statistical Models by Score Matching. Journal of Machine Learning Research, 6(24):695 709.

Iollo, J., Heinkelé, C., Alliez, P., and Forbes, F. (2024). PASOAPArticle ba Sed Bayesian Optimal Adaptive design. In Proceedings of the 41st International Conference on Machine Learning (ICML), volume 235, pages 21020 21046. PMLR.

Iqbal, S., Corenflos, A., Särkkä, S., and Abdulsamad, H. (2024). Nesting particle filters for experimental design in dynamical systems. In Proceedings of the 41st International Conference on Machine Learning (ICML), volume 235. PMLR.

Ivanova, D. R., Foster, A., Kleinegesse, S., Gutmann, M. U., and Rainforth, T. (2021). Implicit deep adaptive design: Policy-based experimental design without likelihoods. Advances in neural information processing systems, 34:25785 25798.

Ivanova, D. R., Hedman, M., Guan, C., and Rainforth, T. (2024). Step-DAD: Semi-Amortized Policy-Based Bayesian Experimental Design. ICLR 2024 Workshop on Data-centric Machine Learning Research (DMLR).

Kingma, D. and Ba, J. (2015). Adam: A method for stochastic optimization. In International Conference on Learning Representations (ICLR), San Diega, CA, USA.

Kleinegesse, S., Drovandi, C., and Gutmann, M. U. (2020). Sequential Bayesian Experimental Design for Implicit Models via Mutual Information. Bayesian Analysis, 16:773 802.

Kleinegesse, S. and Gutmann, M. U. (2021). Gradient-based Bayesian Experimental Design for Implicit Models using Mutual Information Lower Bounds. Ar Xiv, abs/2105.04379.

Knoblauch, J., Jewson, J., and Damoulas, T. (2022). An Optimization-Centric View on Bayes Rule: Reviewing and Generalizing Variational Inference. Journal of Machine Learning Research, 23(1).

Korba, A. and Salim, A. (2022). Sampling as first-order optimization over a space of probability measures. Tutorial at the 39th International Conference on Machine Learning (ICML).

Kullback, S. (1959). Information Theory and Statistics. Wiley.

Le Cun, Y., Bottou, L., Bengio, Y., and Haffner, P. (1998). Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278 2324.

Liu, Y. and Tajbakhsh, S. D. (2024). Stochastic Optimization Algorithms for Problems with Controllable Biased Oracles. https://arxiv.org/abs/2306.07810.

Malik, H. J. and Abraham, B. (1973). Multivariate logistic distributions. Annals of Statistics, 3:588 590.

Marion, P., Korba, A., Bartlett, P., Blondel, M., Bortoli, V. D., Doucet, A., Llinares-López, F., Paquette, C., and Berthet, Q. (2025). Implicit diffusion: Efficient optimization through stochastic sampling. In The 28th International conference on Artificial Intelligence and Statistics (AISTATS).

Minka, T. (2005). Divergence measures and message passing. Technical report, Research, Microsoft.

Papamakarios, G., Nalisnick, E., Rezende, D. J., Mohamed, S., and Lakshminarayanan, B. (2021). Normalizing flows for probabilistic modeling and inference. Journal of Machine Learning Research, 22(1).

Quan, W., Chen, J., Liu, Y., Yan, D.-M., and Wonka, P. (2024). Deep learning-based image and video inpainting: A survey. International Journal of Computer Vision, 132(7):2367 2400.

Rainforth, T., Cornish, R., Yang, H., Warrington, A., and Wood, F. (2018). On nesting Monte Carlo estimators. In International Conference on Machine Learning (ICML), pages 4267 4276. PMLR.

Rainforth, T., Foster, A., Ivanova, D. R., and Bickford Smith, F. (2024). Modern Bayesian Experimental Design. Statistical Science, 39(1):100 114.

Published as a conference paper at ICLR 2025

Rhee, C.-h. and Glynn, P. W. (2015). Unbiased estimation with square root convergence for SDE models. Operations Research, 63(5):1026 1043.

Roberts, G. O. and Tweedie, R. L. (1996). Exponential convergence of Langevin distributions and their discrete approximations. Bernoulli, 2(4):341 363.

Sebastiani, P. and Wynn, H. P. (2000). Maximum entropy sampling and optimal Bayesian experimental design. Journal of the Royal Statistical Society: Series B (Statistical Methodology).

Song, Y., Sohl-Dickstein, J., Kingma, D. P., Kumar, A., Ermon, S., and Poole, B. (2021). Scorebased generative modeling through stochastic differential equations. In The Ninth International Conference on Learning Representations (ICLR).

Wang, Z., Bovik, A. C., Sheikh, H. R., and Simoncelli, E. P. (2004). Image Quality Assessment: From Error Visibility to Structural Similarity. IEEE Transactions on Image Processing, 13(4):600 612.

Xu, M., Quiroz, M., Kohn, R., and Sisson, S. A. (2019). Variance reduction properties of the reparameterization trick. In The 22nd international conference on artificial intelligence and statistics, pages 2711 2720. PMLR.

Yang, J., Ji, K., and Liang, Y. (2021). Provably Faster Algorithms for Bilevel Optimization. In Advances in Neural Information Processing Systems.

Yang, K. K., Wu, Z., and Arnold, F. H. (2019). Machine-learning-guided directed evolution for protein engineering. Nature methods, 16(8):687 694.

Published as a conference paper at ICLR 2025

A TWO EXPRESSIONS FOR THE EIG GRADIENT

Both approaches presented below, that of Goda et al. (2022) and Ao and Li (2024), start from EIG gradient expressions derived using a reparameterization trick. Using the change of variable Y = Tξ,θ(U), we can derive the following expression for the EIG gradient,

ξI(ξ) =Ep U(u)p(θ) [ ξ log p(Tξ,θ(U)|θ; ξ)] Ep U(u)p(θ) [ ξ log p(Tξ,θ(U)|ξ)] . (22)

The first term in (22) involves only the known likelihood and is generally not problematic. For the second term, we can use,

ξ log p(Tξ,θ(U)|ξ) = ξp(Tξ,θ(U)|ξ)

p(Tξ,θ(U)|ξ) (23)

with p(Tξ,θ(U)|ξ) = Ep(θ ) p(Tξ,θ(U)|θ , ξ) and

ξp(Tξ,θ(U)|ξ) = ξEp(θ ) p(Tξ,θ(U)|θ , ξ)

= Ep(θ ) ξ p(Tξ,θ(U)|θ , ξ) (24)

= Ep(θ ) p(Tξ,θ(U)|θ , ξ) ξ log p(Tξ,θ(U)|θ , ξ) . (25)

Subsequently, two expressions of the EIG gradient can be derived depending on which of (24) or (25) is used. Using (24) and p(Tξ,θ(U)|ξ) = Ep(θ ) p(Tξ,θ(U)|θ , ξ) , it comes

ξI(ξ) =Ep U(u)p(θ)

ξ log p(Tξ,θ(U)|θ; ξ) Ep(θ ) ξ p(Tξ,θ(U)|θ , ξ)

Ep(θ ) p(Tξ,θ(U)|θ , ξ)

g(ξ, Y , θ, θ) Ep(θ ) h(ξ, Y , θ, θ )

Ep(θ ) p(Y |θ , ξ)

with g(ξ, y, θ, θ )= ξlog p(Tξ,θ(u)|θ ,ξ)|u=T 1 ξ,θ(y) and h(ξ, y, θ, θ )= ξp(Tξ,θ(u)|θ , ξ)|u=T 1 ξ,θ(y).

Considering, in the second term, an additional importance distribution q(θ |y, θ, ξ) leads to the expression used in Goda et al. (2022),

ξI(ξ) = Epξ

g(ξ, Y , θ, θ) Eq(θ |Y ,θ,ξ) h p(θ ) q(θ |Y ,θ,ξ)h(ξ, Y , θ, θ ) i

Eq(θ |Y ,θ,ξ) h p(θ ) q(θ |Y ,θ,ξ)p(Y |θ , ξ) i

It can be used to derive estimators of the form,

g(ξ, yi, θi, θi)

1 M PM j=1 p(θ i,j) q(θ i,j|yi,θi,ξ)h(ξ, yi, θi, θ i,j)

1 M PM j=1 p(θ i,j) q(θ i,j|yi,θi,ξ)p(yi|θ i,j, ξ)

where {(yi, θi)}i=1:N are simulated from the joint distribution pξ and for each i = 1 : N, {θ i,j}j=1:M is a sample from q( |yi, θi, ξ). Goda et al. (2022) use (28) with N = 1. Even with perfect sampling, this estimator is not unbiased due to the ratio in the second term but can be de-biased following Rhee and Glynn (2015). The randomized MLMC procedure of Rhee and Glynn (2015) is a post-hoc general procedure that can be more generally applied to de-bias a sequence of possibly biased estimators, provided the estimators are consistent.

Alternatively, using (25) instead, another expression of the EIG gradient can be derived. Replacing (25) in (23), it comes,

ξ log p(Tξ,θ(U)|ξ) = Ep(θ )

p(Tξ,θ(U)|θ , ξ) ξ log p(Tξ,θ(U)|θ , ξ)

p(Tξ,θ(U)|ξ)

= Ep(θ |Tξ,θ(U),ξ) ξ log p(Tξ,θ(U)|θ , ξ) ,

Published as a conference paper at ICLR 2025

which, with the definition of g above, leads to

ξI(ξ) =Epξ g(ξ, Y , θ, θ) Ep(θ |Y ,ξ) g(ξ, Y , θ, θ ) . (29)

This alternative expression (29) is the starting point of Ao and Li (2024), who subsequently use the following estimator,

g(ξ, yi, θi, θi) Eq(θ |yi,ξ) g(ξ, yi, θi, θ ) ,

where {(yi, θi)}i=1:N is as before a sample from the joint distribution pξ and where for each yi, q(θ |yi, ξ) is a tractable approximation of the intractable posterior p(θ |yi, ξ). More specifically, Ao and Li (2024) propose to approximate each posterior distribution by q(θ |yi, ξ) = 1 M PM j=1 δθ i,j, using a sample {θ i,j}j=1:M from an MCMC procedure. It follows the nested Monte Carlo estimator below,

g(ξ, yi, θi, θi) 1

j=1 g(ξ, yi, θi, θi,j)

B LOGARITHMIC POOLING AS A GOOD IMPORTANCE SAMPLING PROPOSAL

When considering importance sampling with a proposal distribution q and a target distribution p, Chatterjee and Diaconis (2018) proved that under certain conditions, the number of simulation draws required for both importance sampling and self normalized importance sampling (SNIS) estimators to have small L1 error with high probability was roughly exp(KL(p, q)), see Theorem 1.2 in Chatterjee and Diaconis (2018) for SNIS. Similarly, selecting a proposal distribution which minimizes the importance sampling estimator variance is equivalent to finding a distribution with small χ2-distance to p, see e.g. Appendix E of Minka (2005). More generally, finding a good proposal q is linked to the problem of minimizing α-divergences or f-divergence between p and q, which are jointly convex in p and q, see Minka (2005). In this work, we consider KL(q, p) as a measure of proximity between p and q. This choice is ultimately arbitrary but has the advantage of leading to an interpretable proposal with interesting sampling properties. To justify the pooled posterior qξ,N in (9) and its use in (10), we then use Lemma 2 below to show that for PN i=1 νi = 1, the distribution q that minimizes the weighted sum of the KL against each posterior p(θ|yi, ξ), i.e. PN i=1 νi KL(q, p(θ|yi, ξ)) is

i=1 p(yi|θ, ξ)νi (31)

i=1 p(θ|yi, ξ)νi , (32)

which is the logarithmic pooling (or geometric mixture) of the respective posterior distributions p(θ|yi, ξ). Lemma 2 results from an application of a lemma mentioned by Alquier (2024) (Lemma 2.2 therein), and recalled below in Lemma 1. This Lemma 1 has been known since Kullback (Kullback, 1959) in the case of a finite parameter space Θ, but the general case is due to Donsker and Varadhan (Donsker and Varadhan, 1976). Recall that P(Θ) denotes the set of probability measures on Θ and p a given probability measure in P(Θ).

Lemma 1 (Donsker and Varadhan s variational formula) For any measurable, bounded function f : Θ R, the supremum with respect to q P(Θ) of

Eq [f(θ)] KL(q, p)

is the following Gibbs measure pf defined by its density with respect to p,

dpf = exp(f(θ)) Ep [exp(f(θ))]dp .

The following Lemma 2 is an application of Lemma 1.

Published as a conference paper at ICLR 2025

Lemma 2 For a given probability measure p P(Θ) and a measure ρ on Y (not necessarily a probability measure), define for any probability measure q P(Θ)

ℓ(q) = Eq [EY ρ [log p(Y |θ)]] KL(q, p) . (33)

It results from the Donsker and Varadhan s variational formula Lemma 1 that the supremum of ℓ(q) with respect to q is reached for the Gibbs measure q defined by its density with respect to p,

q (θ) p(θ) exp(EY ρ[log p(Y |θ)]) .

In addition maximizing ℓis equivalent to minimising

EY ρ [KL(q(θ), p(θ|Y ))]

which means that q is the measure that minimizes the KL to each p(θ|y) on average with respect to y.

Proof of Lemma 2 The expression of q results from a direct application of Lemma 1 to f(θ) = EY ρ [log p(Y |θ)] assuming it is measurable and bounded as a function of θ (to be checked in practice). The second part results from rewriting ℓas

ℓ(q) = EY ρ

log(p(Y |θ)p(θ))

= EY ρ [KL(q, p(θ|Y )] + EY ρ [log p(Y )] .

Example: as already mentioned our pooled posterior qξ,N corresponds to the application of this result to ρ = PN i=1 νiδyi with PN i=1 νi = 1. This latter result with PN i=1 νi = 1 can also be recovered from a more general result by Amari (2007), which is stated for any α-divergence (see Theorem 2 of Amari (2007) with α = 1). The continuous weight version is also mentioned by Amari (2007) (Theorem 4), referring to other papers for the proof. We provided here a simple proof for the KL (α = 1) case.

Remark 1: If ρ = δy, or ρ = PN i=1 δyi, we recover the standard variational formulation of the posterior distribution (see e.g. Table 1 in (Knoblauch et al., 2022)). The posterior distribution p(θ|y1, . . . , y N) differs from the logarithmic pooling (for which the weights νi sum to 1) in the relative weight given to the prior. The result is valid for very general ℓnot necessarily expressed as an expectation.

Remark 2: Regarding logarithmic pooling, the result is similar to a result in Carvalho et al. (2022) (Remark 3.1 therein) by showing that, in the case of the sum of the KL, PN i=1 KL(q, p(θ|yi)), the optimal pooling weights are equal, νi = 1

Remark 3: The pooled posterior distribution can also be recovered as a constrained mean field solution. Indeed, it is easy to show that q is also the measure that minimizes the KL between the joint distribution and a product form approximation where one of the factor is fixed to ρ(y),

q = arg min q P(Θ) KL(q(θ)ρ(y), p(θ, Y )) .

C DIFFUSION-BASED GENERATIVE MODELS

C.1 DENOISING DIFFUSION MODELS

Given a distribution p0 P(Θ) only available through a set of samples of θ, diffusion models are based on the addition of noise to the available samples in such a manner that allows to learn the reverse process that "denoises" the samples. This learned process can then be exploited to generate new samples by denoising random noise samples until we get back to the original data distribution. As an appropriate noising process, in our experiments we ran the Variance Preserving SDE from Dhariwal and Nichol (2021):

d θ (t) = β(t)

2 θ (t)dt + p

β(t)d Bt (34)

Published as a conference paper at ICLR 2025

where β(t) > 0 is a linear noise schedule that controls the amount of noise added at time t. Solving SDE (34) leads to

( θ (t)| θ (0)) N θ (0) exp( 1

0 β(s)ds), (1 exp( Z t

0 β(s)ds))I , (35)

which can be written as

θ (t) = αt θ (0) +

1 αtϵ with αt = exp( Z t

0 β(s)ds) and (36)

where ϵ N(0, I) is a standard Gaussian random variable. Samples from p0 are transformed to samples approximately from a standard Gaussian distribution after some large time T.

The reverse denoising process can then be written as the reverse of the diffusion process (34), which as stated by Anderson (1982) is:

dθ(t) = β(t)

2 θ(t) β(t) θ log pt(θ(t)) dt + p

β(t)d Bt , (37)

where t flows backwards from infinity to t=0 and pt is the distribution of θ (t) from (34). In practice, the process is started at some large finite T assuming that θ(T ) N(0, I). In this Appendix, we rather consider increasing time t from 0 to T, using that (37) can be equivalenty written as,

dθ(t) = β(T t)

2 θ(t) + β(T t) θ log p T t(θ(t)) dt + p

β(T t)d Bt , (38)

which is now initialized with θ(0) N(0, I). Solving this reverse SDE, the distribution of θ(T ) is closed to p0 for large T, which allows approximate sampling from p0.

The score function θ log pt(θ) of the noisy data distribution at time t is intractable and is then estimated by learning a neural network sϕ(θ, t) with parameters ϕ. Score matching (Hyvärinen, 2005) is a method to train sϕ by minimizing the following loss:

Ept(θ) ||sϕ(θ, t) θ log pt(θ)||2 . (39)

As pt(θ) is still unknown and only samples from pt(θ| θ (0)) in (35) are available, Song et al. (2021) rewrite this loss function as:

Et U[0,T ]Ep0(θ(0))Ept(θ|θ(0)) h λ(t)||sϕ(θ, t) θ log pt(θ|θ(0))||2i (40)

where λ(t) > 0 is a weighting function that allows to focus more on certain timesteps than others. It is common to take λ(t) inversely proportional to the variance of (35) at time t.

Once the neural network sϕ has been trained by minimizing (40), it can be used to generate new samples approximately distributed as the target distribution p0 by running a numerical scheme on the reverse SDE (38). By running for example the Euler-Maruyama scheme on (38), we get the following update step for the reverse process:

θ(t+ t) = θ(t) + β(T t)

2 θ(t) t + β(T t)sϕ(θ(t), T t) t + p

β(T t) tϵ . (41)

We can then generate samples approximately from p0 by running the reverse process (41) with a small enough t.

C.2 CONDITIONAL DIFFUSION MODELS

Conditional diffusion models arise when, for some measurement y, we want to produce samples from some conditional distribution p0(θ|y). Sampling from conditional distributions is a problem that arises in inverse problems. When using diffusion models, numerous solutions have been investigated as mentioned in a very recent review (Daras et al., 2024). We specify in this section the approach adopted for our applications. With the application to experimental design in mind, we assume here that Y = Aξθ + η (42)

Published as a conference paper at ICLR 2025

where η N(0, σ2I) is the measurement noise, Aξ is the operator that represents the experiment at ξ.

Sampling from the conditional distribution p(θ|y, ξ) can be done by running the reverse diffusion process on the conditional SDE:

dθ(t) = β(T t)

2 θ(t) + β(T t) θ log p T t(θ(t)|y, ξ) dt + p

β(T t)d Bt, (43)

with the usual score θ log p T t(θ(t)) replaced by the conditionnal score θ log p T t(θ(t)|y, ξ). The main objective of conditional SDE is to generate samples from the conditional distribution p(θ|y, ξ) without retraining a new neural network sϕ for the new conditional score. Writing in terms of the forward process, the conditional score can be written using:

θ log pt( θ (t)|y, ξ) = θ log pt(y| θ (t), ξ) + θ log pt( θ (t)) (44)

and we can leverage a pre-computed neural network sϕ that was trained to estimate the

score θ log pt( θ (t)) in the unconditional case. If we know how to evaluate the first term

θ log pt(y| θ (t), ξ), we can then run the reverse process (43) to generate samples from the conditional distribution p(θ|y, ξ). Unfortunately, this term does not have a closed form expression.

As a solution, Dou and Song (2024) propose to approximate the intractable θ log pt(y| θ (t), ξ) by

the tractable θ log pt(y(t)| θ (t), ξ) where y(t) is a noisy version of y at time t. Then, the following backward SDE can be run to generate samples from the conditional distribution p(θ|y, ξ):

dθ(t) = β(T t)

2 θ(t) + β(T t) θ log p T t(y(T t)|θ(t), ξ) + β(T t) θ log p T t(θ(t)) dt

β(T t)d Bt (45)

The sequence of noisy y(t) can be generated with a noising process like (36),

y(t) = αty +

1 αt Aξϵ with αt = exp( Z t

0 β(s)ds) , (46)

which using the forward model (42) can be written as:

y(t) = Aξ θ (t) + αtη . (47)

We can then evaluate θ log pt(y(t)| θ (t), ξ) as :

θ log pt(y(t)| θ (t), ξ) = 1 σ2 αt AT ξ (y(t) Aξ θ (t)) . (48)

C.3 GRADIENT ESTIMATION

When sampling is performed with a finite time horizon diffusion model, the first iterations of the corresponding sampling operator may provide too noisy samples which would result in gradient estimations with little information. One solution proposed by Marion et al. (2025) is to use a queuing trick which requires to store in memory a queue of samples. The memory burden is high and only acceptable for a limited number of particles. We propose a simpler solution, which consists in using Tweedie s formula (Efron, 2011) to perform a one-shot backward step, replacing a potentially too

noisy θ(t) by the conditional mean E[ θ (0)| θ (T t) = θ(t)], which can be interpreted as its prediction at time 0. The Tweedie s formula provides this prediction in close-form,

ˆθ (t) = E[ θ (0)| θ (T t) = θ(t)] = θ(t) + (1 αT t) θ log p T t(θ(t)) αT t . (49)

Gradients are then computed using the ˆθ (t) s values, while θ(t) is updated into θ(t+1) from the

backward SDE, as mentioned above. This is only really impactful for small t as for large t, ˆθ (t) and θ(t) get closer. This extra computation does not add cost as (49) uses a score value that is already computed for (45).

Published as a conference paper at ICLR 2025

D SEQUENTIAL BAYESIAN EXPERIMENTAL DESIGN

In this framework, experimental conditions are determined sequentially, making use of measurements that are gradually made. This sequential view is referred to as sequential or iterated design. In a sequential setting, we assume that we plan a sequence of K experiments. For each experiment, we wish to pick the best ξk using the data that has already been observed Dk 1 = {(y1, ξ1), . . . , (yk 1, ξk 1)}. Given this design, we conduct an experiment using ξk and obtain outcome yk. Both ξk and yk are then added to Dk 1 for a new set Dk = Dk 1 (yk, ξk). After each step, our belief about θ is updated and summarised by the current posterior p(θ|Dk), which acts as the next prior at step k + 1. When the observations are assumed conditionally independent, it comes,

p(θ|Dk) p(θ)

n=1 p(yn|θ, ξn) (50)

p(y, θ|ξ, Dk 1) p(θ) p(y|θ, ξ)

n=1 p(yn|θ, ξn) . (51)

A greedy design can be seen as choosing each design ξk as if it was the last one. This means that ξk is chosen as ξ k the value that maximizes

ξ k = arg max ξ Ik(ξ, Dk 1)

Ik(ξ, Dk 1) = Epk ξ

log pk ξ(θ, Y ) p(Y |ξ, Dk 1) p(θ|Dk 1)

= MI(pk ξ) (52)

with pk ξ denoting the joint distribution p(y, θ|ξ, Dk 1) = p(y|θ, ξ) p(θ|Dk 1). Distribution pk ξ involves the current prior p(θ|Dk 1), which is not available in closed-form and is not straightforward to sample from. Distribution pk ξ can be written as a Gibbs distribution by defining the potential Vk as

pk ξ(y, θ) exp ( Vk(y, θ, ξ))

with Vk(y, θ, ξ) = log p(θ) log p(y|θ, ξ)

n=1 log p(yn|θ, ξn)

= V (y, θ, ξ) + Vk(θ) ,

where V (y, θ, ξ) has been already defined in Section 4. Note that the marginal in θ of pk ξ is the posterior at step k 1 or equivalently the current prior p(θ|Dk 1) and the marginal in y is

p(y|ξ, Dk 1) = Ep(θ|Dk 1) [p(y|θ, Dk 1)] .

Once a new ξk is computed and a new observation yk is performed, the posterior at step k is p(θ|yk, ξk, Dk 1) which is the conditional distribution of p(yk, θ|ξk, Dk 1).

E SEQUENTIAL MONTE CARLO (SMC)-STYLE RESAMPLING

SMC is an essential addition when dealing with sequential BOED. In density-based BOED, it has been already exploited in the sequential context showing a real improvement in the quality of the generated samples (Iollo et al., 2024). SMC is also useful in simpler static cases as it can improve the quality of the generated θ and contrastive θ samples, that in turn improves the accuracy of the gradient estimator (30). A particularly central step in SMC is the resampling step, first recalled below in the density-based case. Using our framework, is it also possible to derive a SMC-style resampling scheme in the data-based case. This is becoming a popular strategy in the context of generative models (Dou and Song, 2024; Cardoso et al., 2024).

Published as a conference paper at ICLR 2025

Density-based BOED. In static density-based BOED, the prior p(θ) and the likelihood p(y|θ, ξ) are available in closed-form. In the sequential experiment context, we want to generate N samples θ1, . . . , θN from the current prior p(θ|Dk 1) and M samples θ 1, . . . , θ M from the pooled posterior qξ,N(θ ). As both p(θ|Dk 1) and qξ,N(θ ) QN i=1 p(θ |yi, ξ, Dk 1)νi can be evaluated up to a normalizing constant, it is straightforward to extend the sampling operators of Section 4 and add a resampling step to the samples θ1, . . . , θN and θ 1, . . . , θ M with weights wi and w j:

wi = wi PN i=1 wi with wi = p(θi)

w j = w j PM j=1 w j with w j = q(θ j)

where p and q are the unnormalized versions of p(θ|Dk 1) and qξ,N(θ ) respectively.

Data-based BOED. In the setting of data-based BOED, we assume access to a conditional diffusion model that allows to generate samples from p(θ|Dk 1) and qξ,N(θ ). The resampling scheme proposed in (Dou and Song, 2024) can be used as is, to improve the quality of the samples from p(θ|Dk 1) as this is a usual conditional distribution. The resampling scheme is based on the FPS update: θ(t) j is first moved using the backward SDE into θ(t+1) j according to p(θ(t+1)|θ(t), y(T t 1), ξ, Dk 1), which satisfies

p(θ(t+1)|θ(t), y(T t 1), ξ, Dk 1) pt(θ(t+1)|θ(t), Dk 1) p(y(T t 1)|θ(t+1), ξ) (53)

where pt(θ(t+1)|θ(t), Dk 1) is given in closed form by the unconditional diffusion model and p(y(T t 1)|θ(t+1), ξ) is given by (47). As both these distributions are Gaussian, p(θ(t+1)|θ(t), y(T t 1), ξ, Dk 1) can be written in closed form and resampling weights can be written as: wi = wi PN i=1 wi with wi = p(y(T t 1)|θ(t) i , ξ) (54)

where p(y(T t 1)|θ(t) i , ξ) is tractable (see Dou and Song (2024) for more details).

For the pooled posterior qξ,N(θ ) QN i=1 p(θ |yi, ξ, Dk 1)νi, update (53) takes the form:

i=1 p(θ(t+1)|θ(t), y(T t 1) i , ξ, Dk 1)νi pt(θ(t+1)|θ(t), Dk 1)

i=1 p(y(T t 1) i |θ(t+1), ξ)νi

p(θ(t+1)|θ(t), Dk 1) p(y(T t 1) i |θ(t+1), ξ) νi

which leads to the following resampling weights:

w j = w j PM j=1 w j with w j =

i=1 p(y(T t 1) i |θ (t) j , ξ)νi .

F NUMERICAL EXPERIMENTS

F.1 SEQUENTIAL PRIOR CONTRASTIVE ESTIMATION (SPCE) AND SEQUENTIAL NESTED MONTE CARLO (SNMC) CRITERIA

The SPCE introduced by Foster et al. (2021) is a tractable quantity to assess the design sequence quality. For a number K of experiments, DK = {(y1, ξ1), , (y K, ξK)} and L contrastive variables, SPCE is defined as

SPCE(ξ1, , ξK) = E K Q

k=1 p(yk|ξk,θ0) L Q

k=1 p(Yk|θ0, ξk)

k=1 p(Yk|θℓ, ξk)

Published as a conference paper at ICLR 2025

SPCE is a lower bound of the total EIG which is the expected information gained from the entire sequence of design parameters ξ1, . . . , ξK and it becomes tight when L tends to . In addition, SPCE has the advantage to use only samples from the prior p(θ) and not from the successive posterior distributions. It makes it a fair criterion to compare methods on design sequences only. Considering a true parameter value denoted by θ , given a sequence of design values {ξk}k=1:K, observations {yk}k=1:K are simulated using p(y|θ , ξk) respectively. Therefore, for a given Dk, the corresponding SPCE is estimated numerically by sampling θ1, , θL from the prior,

SPCE(DK) = 1

k=1 p(yk|θ , ξk)

k=1 p(yk|θ , ξk) + L P

k=1 p(yk|θi ℓ, ξk)

Similarly, an upper bound on the total EIG has also been introduced by Foster et al. (2021) and named the Sequential nested Monte Carlo (SNMC) criterion,

SNMC(ξ1, , ξK) = E K Q

k=1 p(yk|ξk,θ0) L Q

k=1 p(Yk|θ0, ξk)

k=1 p(Yk|θℓ, ξk)

As shown in Foster et al. (2021) (Appendix A), SPCE increases with L to reach the total EIG when L at a rate O(L 1) of convergence. It is also shown in Foster et al. (2021) that for a given L, SPCE is bounded by log(L + 1) while the upper bound SNMC below is potentially unbounded. As in Blau et al. (2022), if we use L = 107 to compute SPCE and SNMC, the bound is log(L + 1) = 16.12 for SPCE. In practice this does not impact the numerical methods comparison as the intervals [SPCE, SNMC] containing the total EIG remain clearly distinct.

F.2 IMPLEMENTATION DETAILS

F.2.1 SOURCE EXAMPLE

For VPCE (Foster et al., 2020) and RL-BOED (Blau et al., 2022), we used the code available at github.com/csiro-mlai/RL-BOED, using the settings recommended therein to reproduce the results in the respective papers. VPCE optimizes an EIG lower bound in a myopic manner estimating posterior distributions with variational approximations. RL-BOED is a non-myopic approach which does not provide posterior distributions. From the obtained sequences of observations and design values, we computed SPCE and SNMC as explained above and retrieved the same results as in their respective papers. For PASOA and SMC procedures, we used the code available at github.com/iolloj/pasoa. PASOA is a myopic approach, optimizing an EIG lower bound using sequential Monte Carlo (SMC) samplers and tempering to also provide posterior estimations. The method refered to as SMC is a variant without tempering.

For Co Diff, the νi s in the pooled posterior distribution were set to νi = 1 N . The current prior and posterior distributions at experimental step k were initialized using respectively the prior and posterior samples at step k 1. Design optimization was performed using the Adam optimizer with an exponential learning rate decay schedule with initial learning rate 10 2 and decay rate 0.98. The Langevin step-size in the Di GS method Chen et al. (2024) was set to 10 2. The joint optimization-sampling loop was run for 5000 steps. Figure 5 shows samples from the current prior p(θ|Dk 1), which gradually concentrate around the true sources as k increases. The additional Figure 6 shows, at some intermediate step k, samples from the current prior p(θ|Dk 1) and from the pooled posterior distribution in comparison, to illustrate its contrastive nature.

F.2.2 MNIST EXAMPLE

In this example, the likelihood easily derives from Y = Aξθ + η, where Y , θ and η are familiarly seen as arrays of pixels. The transformation Tξ,θ is simply Tξ,θ(U) = Aξθ + U with U = η a Gaussian variable. However, to fit in our theoretical framework, images need to be treated as mappings over a continuous spatial domain, so that ξ can be seen as a continuous parameter and Aξθ be differentiable with respect to ξ. This is common in image processing and analysis, where

Published as a conference paper at ICLR 2025

Figure 5: Source localization example. Experiments 0 (prior samples), 4, 8 and 12. As new design locations are selected (orange crosses), samples concentrate to the true sources (red crosses). Samples with lower weights in blue, higher weights in yellow.

2D images are often regarded as mappings over a continuous 2D domain (see e.g. Rein van den Boomgaard and Leo Dorst, 2021 Lecture Notes) with x1 and x2 axes. That is, for (x1, x2) R2, ξ = (ξ1, ξ2) R2, we consider that an array of pixels θ is a discrete sampled representation of a 2D function θ(x1, x2) and more generally define, using abusively the same notation for continuous and sampled representations,

Y (x1, x2) = Aξ(θ(x1, x2)) + η(x1, x2)

where Aξ( ) is a masking operator depending on some length h and defined by

Aξ(θ(x1, x2)) = θ(x1, x2) if (x1, x2) Sξ,h = 0 otherwise

with Sξ,h = {(x1, x2) R2, ξ1 h 1 x1 ξ1 + h 1, ξ2 h 2 x2 ξ2 + h 2, denoting by 1 and 2 the sampling distances along the x1 and x2 axes.

To be able to derive with respect to ξ, we then need to consider a smooth version µξ,s of Aξ. This is classically done by convoluting with a 2D Gaussian kernel Gs, e.g. the product of two 1D Gaussian kernels with positive scales s = (s1, s2). In practice, this consists in smoothing the sharp borders of the mask, noting that µξ,s Aξ when s 0,

µξ,s(x1, x2) = (Aξ(θ) Gs)(x1, x2)

= (A0(θ) Gs)(x1 ξ1, x2 ξ2) = µ0,s(x1 ξ1, x2 ξ2)

where the second equality uses that Aξ(θ(x1, x2)) = A0(θ(x1 ξ1, x2 ξ2)). It follows that, no matter what A0(θ) is, such a convolution with Gs makes µ0,s into a function that is continuous and

Published as a conference paper at ICLR 2025

Figure 6: Several source localisation examples. Prior (left) and pooled posterior (right) samples at experiment k. Final ξ k value (orange cross) at the end of the optimization sequence ξ0, , ξT (blue crosses). This optimization "contrasts" the two distributions by making the pooled posterior "as different as possible" from the prior.

Published as a conference paper at ICLR 2025

infinitely differentiable in its arguments. Then, it comes, for i = 1, 2,

ξi (x1, x2) = µ0,s

ξi (x1 ξ1, x2 ξ2)

xi (x1 ξ1, x2 ξ2)

(x1 ξ1, x2 ξ2) .

These latter derivatives are all that is needed to define the gradients in our developments. If A0 is a Heaviside step function, its convolution with a distribution leads to the cumulative density function (cdf) of this distribution. The same developments are valid by replacing the Gaussian kernel by a bivariate logistic distribution L with mean 0 and scales s = (s1, s2). The multivariate logistic distribution (Gumbel, 1961; Malik and Abraham, 1973) generalizes the univariate logistic distribution. Its pdf is L2(x1, x2; s) = 2! exp( x1/s1 x2/s2) s1s2(1+exp( x1/s1)+exp( x2/s2))3 . Its cdf is closed-form and is

1 1+exp( x1/s1)+exp( x2/s2). In practice, we simply consider a product of two independent univariate

logistic distributions with mean 0 and scale si, i = 1, 2, L1(xi; si) = exp( xi/si) si(1+exp( xi/si))2 . The cdf of a 1D logistic distribution is the sigmoid function S(xi; si) = 1 1+exp( xi/si). The convolution of a Heaviside step function with such a logistic distribution is thus a smooth sigmoid. In the MNIST example, the mask length is set to h = 7, with sampling distances 1 = 2 = 1 and we use a 2D product logistic kernel with s1 = s2 = 0.1. Using that the 2D mask Aξ can be written as the following product (H(x1 ξ1+h) + H(ξ1+h x1) 1) (H(x2 ξ2+h) + H(ξ2+h x2) 1), where H is the 1D Heaviside step function, it follows that the smooth µξ,s is

µξ,s(x1, x2)=(S(x1 ξ1+h; s1) + S(ξ1+h x1; s1) 1) (S(x2 ξ2+h; s2) + S(ξ2+h x2; s2) 1) .

For the numerical example of Section 6.3, we used the MNIST dataset (Le Cun et al., 1998), the time varying SDE (34) with a noise schedule β(t) = bmin + (bmin bmax)(t t0)/(T t0) (with bmax = 5, bmin = 0.2, t0 = 0, T = 2). The training of the usual score matching was done for 3000 epochs with a batch size of 256 and using Adam optimizer Kingma and Ba (2015). We used gradient clipping and the training was done on a single A100 GPU.

Update equations for the sampling operators were derived from SDE (19) for the contrastive samples of the pooled posterior qξ,N(θ ) and (45) for samples from the current prior p(θ|Dk 1), where Dk 1 can be added in the conditioning part without difficulty. Those updates are equivalent to (55) and (53) respectively. The resampling weights were computed as in Section E.

Figures 7 and 8 show additional image reconstruction processes. The digit to be recovered is shown in the first column. The successively selected masks are shown (red line squares) in the second column with the resulting gradually discovered part of the image. The reconstruction per se can be estimated from the posterior samples shown in the last 16 columns. At each experiment, the upper sub-row shows the 16 most-likely reconstructed images, while the lower sub-row shows the 16 less-probable ones. As the number of experiments increases the posterior samples gradually concentrate on the right digit.

Figure 9 then shows that design optimization is effective by showing better outcomes when masks locations are optimized (second column) than when masks are selected at random centers (third column). The highest posterior weight samples in the last 14 columns also clearly show more resemblance with the true digit in the optimized case. The superior performance of design optimization is confirmed quantitatively in Table 1, which reports the reconstruction quality as measured by the structural similarity index measure (SSIM) (Wang et al., 2004), for both Co Diff and random design. 20 ground truth digit images are randomly selected and the SSIM is computed for the Co Diff and random reconstructions, after each successive experiment out of 6. Table 1 reports the median SSIM over the 20 selected digits. The SSIM is a decimal value between -1 and 1, where 1 indicates perfect similarity, 0 indicates no similarity, and -1 indicates perfect anti-correlation.

Published as a conference paper at ICLR 2025

Figure 7: Image reconstruction. First 7 experiments (rows): image ground truth, measurement at experiment k, samples from current prior p(θ|Dk 1), with best (resp. worst) weights in upper (resp. lower) sub-row. The samples incorporate past measurement information as the procedure advances.

F.3 HARDWARE DETAILS

The source example 6.2 can be run locally. It was tested on an Apple M1 Pro 16Gb chip but faster running times can be achieved on GPU. The MNIST example 6.3 was run on a single A100 80Gb GPU.

F.4 SOFTWARE DETAILS

Our code is implemented in Jax Bradbury et al. (2020) and uses Flax as a Neural Network library and Optax as optimization one Babuschkin et al. (2020). The code is available at https://github.com/jcopo/Contrastive Diffusions.

Published as a conference paper at ICLR 2025

Figure 8: Image reconstruction. First 7 experiments (rows): image ground truth, measurement at experiment k, samples from current prior p(θ|Dk 1), with best (resp. worst) weights in upper (resp. lower) sub-row. The samples incorporate past measurement information as the procedure advances.

Figure 9: Image θ (1st column) reconstruction from 7 sub-images y = Aξθ +η selected sequentially at 7 central pixel ξ. Optimized vs. random designs: measured outcome y (2nd vs. 3rd column) and parameter θ estimates (reconstruction) with highest weights (upper vs. lower sub-row).