# pasoa_particle_based_bayesian_optimal_adaptive_design__0662756b.pdf

PASOAPArticle ba Sed Bayesian Optimal Adaptive design

Jacopo Iollo 1 2 3 Christophe Heinkelé 3 Pierre Alliez 2 Florence Forbes 1

We propose a new procedure named PASOA, for Bayesian experimental design, that performs sequential design optimization by simultaneously providing accurate estimates of successive posterior distributions for parameter inference. The sequential design process is carried out via a contrastive estimation principle, using stochastic optimization and Sequential Monte Carlo (SMC) samplers to maximise the Expected Information Gain (EIG). As larger information gains are obtained for larger distances between successive posterior distributions, this EIG objective may worsen classical SMC performance. To handle this issue, tempering is proposed to have both a large information gain and an accurate SMC sampling, that we show is crucial for performance. This novel combination of stochastic optimization and tempered SMC allows to jointly handle design optimization and parameter inference. We provide a proof that the obtained optimal design estimators benefit from some consistency property. Numerical experiments confirm the potential of the approach, which outperforms other recent existing procedures.

1. Introduction

A design refers to some experimental conditions required to perform an experiment and get observations from the phenomenon under study. Assuming that such a design is characterized by some parameters denoted by ξ, simple examples of ξ are coordinates in a 2D space or a frequency at which we wish to measure a quantity of interest. The overall goal of experimental design can be summarized as the acquisition of good quality data, which is of

1Université Grenoble Alpes, Inria, CNRS, GINP, France 2Université Côte d Azur, Inria, France 3Cerema, Endsum-Strasbourg, France. Correspondence to: Jacopo Iollo <jacopo.iollo@inria.fr>, Florence Forbes <florence.forbes@inria.fr>.

Proceedings of the 41 st International Conference on Machine Learning, Vienna, Austria. PMLR 235, 2024. Copyright 2024 by the author(s).

increasing importance in numerous scientific and industrial applications, see e.g. (Ryan et al., 2016; Rainforth et al., 2024). Quality can be defined from numerous view points, allocating resources for information gathering, improving precision and/or prediction or reducing experimental costs. In this work, we assume that the desired designs are continuous parameters ξ E that maximize information on some parameters of interest θ Θ Rm. In Bayesian Optimal Experimental Design (BOED), a Bayesian model is specified over θ (Chaloner & Verdinelli, 1995; Sebastiani & Wynn, 2000; Amzal et al., 2006; Ryan et al., 2016; Rainforth et al., 2024). The criterion defining the optimality of a design may then depend on the application (Ryan et al., 2016; Kleinegesse & Gutmann, 2021) but the most used of them is the expected information gain (EIG). It targets designs that most reduce the entropy of the posterior distribution over θ, or equivalently that maximize the Kullback-Leibler divergence between the posterior and the prior, and this in expectation over all possible experimental outcomes, see eq. (1,2). The Bayesian formulation is particularly convenient in sequential contexts, where we wish to exploit the results of previous experiments to guide the design of future ones. Unfortunately, computing and optimizing the EIG sequentially requires sampling from an evolving sequence of intractable posterior distributions. We propose to use a sequential Monte-Carlo (SMC) approach for an efficient sequential sampling (Del Moral et al., 2006; Doucet & Lee, 2018; Naesseth et al., 2019; Chopin & Papaspiliopoulos, 2020; Dai et al., 2022) and in the spirit of some original and recent approaches (Huan & Marzouk, 2014; Kleinegesse & Gutmann, 2021; Foster et al., 2019; 2020; 2021; Blau et al., 2022), we adopt a stochastic optimization approach to optimize the EIG. The originality of our approach is to use SMC in conjunction with stochastic optimization to efficiently sample the relevant quantities and estimate the required noisy gradients. However, naive SMC is potentially problematic. In sequential EIG-based optimization, larger gains are obtained for larger distances between successive posterior distributions, which is a particularly bad scenario for SMC performance. A solution is to consider distribution tempering or annealing (Neal, 2001; Del Moral et al., 2006), which creates a path of intermediate closer distributions for a more accurate SMC sampling (Figures 1 and 2). We name our approach PASOA, for Particle ba Sed Bayesian Optimal Adaptive design, where

PASOAPArticle ba Sed Bayesian Optimal Adaptive design

Figure 1. Source location example at steps 5, 7, 9, 11. Over design steps, particles concentrate faster to true sources (red crosses) with PASOA (2nd line) than with SMC (1st line). Lower particle weights in blue, higher in yellow.

particles refer to samples featuring an evolving Bayesian posterior information that adaptively guides the successive design estimation. To our knowledge, such a combination, of stochastic optimization and SMC or tempered SMC, has not been proposed before for BOED. It allows to benefit from both techniques advantages. Stochastic optimization provides a more scalable way to handle sequential design optimization, while the principled SMC framework allows to efficiently perform an accurate parameter inference by reusing samples from past experiments. This high quality inference is exploited in turn via stochastic optimization to yield informative design estimators coming with good theoretical features. We show that these estimators are consistent and converge in probability to the optimal designs. On the numerical side, we verify their practical performance on benchmark design problems. We observe significantly higher information gains than previous methods, including sophisticated and reinforcement learning approaches (Blau et al., 2022; Foster et al., 2020), with these gains leading, in turn, to improved posterior inference and parameter estimation. We thus provide the first theoretical result of this kind in the BOED literature, while also highlighting tempering as a mean to greatly enhance the performance of SMC in BOED.

2. Related work

Simulations required for EIG optimization can be obtained via Markov Chain Monte Carlo (MCMC) algorithms, see e.g. Kleinegesse & Gutmann (2020) for a static design. In a sequential design context, the necessity to do so at each step is very simulation intensive (Amzal et al., 2006; Kleinegesse et al., 2020). Typically, Kleinegesse et al. (2020) extend Kleinegesse & Gutmann (2020) to a sequential setting but shows up to only K = 4 or 5 experiments. As an alternative, SMC samplers have been used in previous work, e.g. Drovandi et al. (2014; 2013) use SMC but only for finite-valued design parameters, avoiding the problematic optimization by reducing it to a finite number of comparisons. Another attempt (Kuck

et al., 2006) proposes SMC to handle the optimisation part but restricts to static one-step design. In our work, we consider both the continuous design and sequential setting, which was referred to as an open question by Ryan et al. (2016). Recent ideas using stochastic optimization (Blau et al., 2022; Kleinegesse & Gutmann, 2021; Foster et al., 2020; 2021; 2019; Huan & Marzouk, 2014) have since then paved the way to efficient optimisation formulation of sequential BOED but without consideration of SMC techniques and somewhat neglecting posterior distribution inference. A solution explored by Foster et al. (2019; 2020) uses mean field variational approximations of the posterior distributions but with no guarantee on their quality, while Foster et al. (2021); Blau et al. (2022) bypass the need for such posterior estimation using a reinforcement learning formulation. In this work, we show that superior performance than these previous solutions can be obtained with SMC, from which we also derive new good features. We provide theoretical results and guarantees that have not been formulated before in modern BOED and we shed a new light on tempering by showing that it can be crucial in SMCbased sequential BOED, to face the contradiction between searching for the largest information gain and maintaining good sampling performance.

3. Sequential Bayesian Optimal Experimental Design (BOED)

The Bayesian framework is a unified way to account for prior information via a probability distribution p(θ), for uncertainties about the observations y through a distribution p(y|θ, ξ), and for a design criterion (also called utility function) F(ξ, θ, y) describing the experimental aims. The prior is assumed to be independent of ξ and p(y|θ, ξ) available in closed-form.

Expected Information Gain (EIG). There exists various utility functions F depending on the targeted task (Ryan et al., 2016; Kleinegesse & Gutmann, 2021). In this

PASOAPArticle ba Sed Bayesian Optimal Adaptive design

work, we focus on parameter estimation and consider an information-based utility leading to the so-called expected information gain (EIG). The EIG, denoted by I, admits several equivalent expressions, see e.g. (Foster et al., 2019). It can be written as the expected loss in entropy when accounting for an observation y at ξ (eq. (1)) or as a mutual information or expected Kullback-Leibler divergence (eq. (2)). Using p(y, θ|ξ) = p(θ|y, ξ)p(y|ξ) = p(y|θ, ξ)p(θ),

I(ξ) = Ep(y|ξ)[H(p(θ)) H(p(θ|Y, ξ)] (1)

= Ep(y|ξ) [KL(p(θ|Y, ξ), p(θ))] (2)

where random variables are indicated with uppercase letters, Ep[ ] denotes the expectation with respect to p, KL the Kullback-Leibler divergence and H(p(θ)) = Ep(θ)[log p(θ)] is the entropy of p. We thus look for ξ

satisfying ξ arg max ξ Rd I(ξ) . (3)

Before optimizing I(ξ), evaluating I(ξ) is often difficult due to the intractability of p(θ|y, ξ) and p(y|ξ).

Sequential design. Solving (3) is a static or one-step design problem. A single ξ or multiple {ξ1, , ξK} are selected prior to any observation, measurements {y1, , y K} are made for these design parameters and the experiment is stopped. The prior p(θ) can be used to encode previous observations but in static design, the selected designs depend only on the model. In contrast, in sequential or iterated design, K experiments are planned sequentially to construct an adaptive strategy, meaning that for the kth experiment, the best ξk is selected taking into account the previous design parameters and associated observations Dk 1 = {(y1, ξ1), , (yk 1, ξk 1)}. Then, yk is measured at ξk and Dk is updated into Dk = Dk 1 (yk, ξk). This approach is referred to as greedy or myopic, in that the next design ξk is chosen as if it was the last one. Non-myopic approaches exist, using for instance reinforcement learning principles (Blau et al., 2022; Foster et al., 2021) but with another layer of complexity and performance that are not always superior, see (Blau et al., 2022) or our results in Section 7. In this work, we limit ourselves to the greedy approach, replacing in (1) or (2) the prior p(θ) by our current belief on θ, namely p(θ|Dk 1) = p(θ|y1, ξ1, , yk 1, ξk 1), and to solve iteratively for

ξ k arg max ξ Rd Ik(ξ), (4)

where Ik(ξ) = E[H(p(θ|Dk 1)) H(p(θ|Y, ξ,Dk 1))]

= E [KL(p(θ|Y, ξ,Dk 1), p(θ|Dk 1))] (5)

and E is with respect to p(y|ξ, Dk 1). Observations are assumed conditionally independent so that p(θ|Dk) p(θ) Qk i=1p(yi|θ, ξi)which also leads to

p(θ|Dk) p(θ|Dk 1) p(yk|θ, ξk) . (6)

EIG contrastive bound optimization. Going back to the optimization in (3), we focus on the continuous design case and assume that quantities are differentiable when needed and that gradients are well defined. This is generally not a restrictive assumption except for discrete design spaces that may require specific treatments, see e.g. (Blau et al., 2022; Drovandi et al., 2014). A standard gradient ascent algorithm would consist, at iteration t, of updating ξt+1 = ξt + γt ξI(ξ)|ξ=ξt with a stepsize γt, but in practice both I(ξ) and its gradient ξI(ξ) are intractable. However, they can both be expressed as expectations, which naturally leads to consider stochastic approximation approaches (Borkar, 2008), among which the most popular is the Stochastic Gradient (SG) algorithm. In a BOED setting, if ξI(ξ) is expressed as an expectation ξI(ξ) = E[f(ξ, X)] over a random variable X, the SG iteration writes ξt+1 = ξt + γt f(ξt, xt) with xt a realisation of X. This assumes that we can differentiate under the integral sign in (2). There exist different ways to differentiate, including the popular reparametrization trick, but SG for I(ξ) remains difficult to perform due to the intractability of the integrand in (2). An alternative has been proposed by Foster et al. (2019; 2020) referred to as a variational approach. It consists of optimizing a tractable lower bound of I(ξ) and computing ξ via an alternate maximization. We also consider such a bound, IP CE, introduced by Foster et al. (2020) and named the Prior Contrastive Estimation (PCE) bound. It is based on contrastive samples from L additional variables θℓ, for ℓ= 1 : L, distributed following the prior p(θ) as θ, which is rewritten as θ0. IP CE is defined as

IP CE(ξ) = Ep(y|θ0,ξ) QL ℓ=0 p(θℓ) [F(ξ, θ0, , θL, Y)] (7)

with F(ξ, θ0, , θL, y)=log p(y|θ0, ξ)

1 L+1 PL ℓ=0 p(y|θℓ, ξ) .

IP CE is a lower bound I(ξ) IP CE(ξ) and the bound is tight when L tends to (see Foster et al. (2020) for a proof). It is tractable as all expressions p(y|θℓ, ξ) are tractable, and its gradient requires only the gradient ξp(y|θ, ξ). Stochastic approximation can be applied to maximize IP CE as ξIP CE(ξ) can be expressed as an expectation via reparametrization. More specifically, we assume that there exists a transformation T ξ θ0 such that Y = T ξ θ0(U) with U U independent of ξ and θ0 and easy to simulate, e.g. U is a standard Gaussian. It follows under some mild conditions specified in the Appendix that

ξIP CE(ξ)=Ep(u)QL ℓ=0p(θℓ) h ξF(ξ,θ0, ,θL,T ξ θ0(U)) i (8)

A similar bound and its gradient can be derived for sequential design optimization (4). Using conditional independence (6), at each step k, p(θ) only needs to be

PASOAPArticle ba Sed Bayesian Optimal Adaptive design

replaced by the current posterior p(θ|Dk 1),

Ik P CE(ξ)=

Ep(y|θ0,ξ))QL ℓ=0 p(θℓ|Dk 1)[F(ξ, θ0, , θL,Y)] (9)

ξIk P CE(ξ)=Ep(u)QL ℓ=0p(θℓ|Dk 1) h ξF(ξ,θ0, ,θL,T ξ θ0(U)) i .

The stochastic gradient algorithm is summarized in Algorithm 1, with the additional possibility to estimate gradients with minibatches of size Nt, in line 5.

Algorithm 1 SG with minibatches(Nt)t=1:T at step k to optimize Ik P CE in (9)

1: Set T iterations, ξ0, stepsizes (γt)t=1:T 2: while t T do 3: Sample θi ℓ,t p(θ|Dk 1), ℓ=0:L, i=1:Nt 4: Sample ui t p(u), for i = 1 : Nt

5: Set t+1= 1

i=1 ξF(ξ,θi 0,t, ,θi L,t,T ξ θi 0,t(ui t))|ξ=ξt

6: Update ξt+1 = ξt + γt t+1

7: return ξ k = ξT or a Polyak averaging value

Optimizing the contrastive bound Ik P CE requires then sampling θ0, , θL from p(θ|Dk 1) (Algorithm 1 line 3). This is more costly than sampling from the prior, as p(θ|Dk 1) is only know up to a normalizing constant (6), but can be efficiently dealt with using a SMC approach as detailed in the next section.

4. Tempered Sequential Monte Carlo

µλτ 1 µλτ µ0 = pk 1 µ1 = pk

Figure 2. Tempered SMC, a SMC step from pk 1 to pk (blue) performed in T intermediate tempering steps (red).

SMC samplers extend the idea of importance sampling by re-using samples from one distribution to another, and benefit from numerous theoretical results (Del Moral et al., 2006; Doucet & Lee, 2018; Naesseth et al., 2019; Chopin & Papaspiliopoulos, 2020; Dai et al., 2022). More specifically, denoting P(Θ) the set of probability measures on Θ, the goal is to provide samples (called particles) from a sequence of probability distributions {pk}k=1:K in P(Θ). To simplify, we deal with probability densities assuming absolute continuity with respect to the Lebesgue measure but the setting is more general, e.g. (Chopin & Papaspiliopoulos, 2020). A standard MCMC approach would require to build an ergodic kernel Mk and to run

Algorithm 2 Adaptive tempered SMC at step k

1: Set τ =0, λ0 =0, M, ESSmin, Mλ, resample 2: Sample θ1:M 0 p(θ|Dk 1) = µλ0(θ) 3: Set wi 0 = 1

M , i=1 : M and w1:M 0 ={w1 0, , w M 0 } 4: while λτ < 1 do

5: Set τ = τ + 1 6: Set θ 1:M τ 1 =resample(θ1:M τ 1, w1:M τ 1)) µλτ 1 7: Sample θi τ Mλτ 1( θ i τ 1, ) for i=1:M,

8: Set θ1:M τ = {θ1 τ, , θM τ } Resampling and Markov kernel step

9: Solve for γ, ( PM i=1 p(yk|θi τ ,ξk) γ) 2

PM i=1 p(yk|θi τ ,ξk) 2γ = ESSmin

10: Set λτ = λτ 1+γ Tempered step

11: Set wi τ = p(yk|θi τ, ξk)γ, wi τ = wi τ M P

12: Set w1:M τ = {w1 τ, , w M τ } Compute new weights

13: return θ1:M k = θ1:M τ , w1:M k = w1:M τ for a particle

approximation p M k = M P

i=1 wi kδθi k of p( |Dk) = µ1

a Markov chain from scratch for each pk. In contrast, SMC samplers provide the possibility to approximate pk recycling samples from pk 1. SMC samplers aim at propagating M particles θ1:M k 1 = {θ1 k 1, , θM k 1} and their corresponding weights w1:M k 1 = {w1 k 1, , w M k 1} in such a way that the empirical distribution p M k of the particles at time k converges to pk in some sense: meaning that for all integrable functions ϕ, Ep M k [ϕ(θ)] = PM i=1 wi kϕ(θi k) M Epk[ϕ(θ)]. However, as showed

by Agapiou et al. (2017), the number of particles M required for an accurate particle approximation p M k scales exponentially with the Kullback-Leibler distance between the proposal pk 1 and target pk distributions. For pk 1(θ) = p(θ|Dk 1) and pk(θ) = p(θ|Dk), this is problematic as EIG optimization aims at increasing this distance, see (4,5). Moving from p(θ|Dk 1) to p(θ|Dk) with just one SMC step might then yield poor results. A solution is to consider tempering (Neal, 2001; Del Moral et al., 2006; Syed et al., 2021) to move along a sequence of probability distributions interpolating between p(θ|Dk 1) and p(θ|Dk). A tempering path is a sequence of the form µλτ with 0 = λ0 < λτ < . . . < λT = 1 where µ0 = pk 1 and µ1 = pk (Figure 2). Usually, only the initial and final distributions are imposed. Intermediate distributions µλτ are not of interest so that the λτ s can be chosen as desired. A popular approach is to use what is know as the geometric path: µλτ (θ) pk 1(θ)1 λτ pk(θ)λτ , which using (6), takes the form µλτ (θ) p(θ|Dk 1) p(yk|θ, ξk)λτ or

PASOAPArticle ba Sed Bayesian Optimal Adaptive design

equivalently µλτ (θ) µλτ 1(θ) p(yk|θ, ξk)λτ λτ 1. As setting the sequence λτ manually can be a challenging task with disappointing results, we follow the adaptive strategy proposed by Jasra et al. (2011). Given a userset threshold ESSmin interpreted as an effective sample size, at iteration τ of the tempered SMC procedure, given a current set of particles θ1:M τ , we set recursively λτ = λτ 1 + γ with γ the solution in [0, 1 λτ 1] of the

equation ( PM i=1 p(yk|θi τ ,ξk) γ) 2

PM i=1 p(yk|θi τ ,ξk) 2γ = ESSmin . If γ is not in

[0, 1 λτ 1], λτ is set to 1 and the tempering stops. This is a relatively simple task to solve with numerical root finding. This procedure guarantees that the SMC approximation error remains stable over iterations, see (Jasra et al., 2011) for details. It can be interpreted as a way to control the Chisquare pseudo-distance between the successive distributions, see (Chopin & Papaspiliopoulos, 2020) Proposition 17.2. Tempered SMC then requires like SMC an unbiased resampling scheme denoted by resample(θ1:M, w1:M). Resampling is the action of drawing randomly from a weighted sample, so as to obtain an unweighted sample. Several unbiased resampling schemes are listed in Chapter 9 of Chopin & Papaspiliopoulos (2020) and studied in e.g. (Gerber et al., 2019; Crisan & Doucet, 2002). The most standard one is multinomial resampling which draws samples independently according to their weights. Tempered SMC also requires a family of Markov kernels (Mλ)λ so that Mλ : Θ P(Θ) leaves µλ invariant. Tempering is illustrated in Figure 2 and in Algorithm 2.

5. Particle EIG contrastive bound approximation and optimization

Algorithms 1 and 2 can then be combined iteratively. We call the resulting algorithm PASOA. Algorithm 2 at step k provides a particle approximation of p(θ|Dk) used in Algorithm 1 line 3 at step k + 1 to optimize the next EIG contrastive bound Ik+1 P CE and get ξ k+1. A new yk+1 is then measured at ξ k+1, Dk+1 is set to Dk {yk+1, ξ k+1} and Algorithm 2 is used again to get a particle approximation of p(θ|Dk+1) etc. More specifically, the M weighted particles produced by Algorithm 2 are used to approximate the intractable p(θ|Dk) and to simplify the sampling of the L + 1 contrastive variables in line 3 of Algorithm 1 at step k + 1. In practice, line 3 can be performed in different ways. We propose the following one, which has good numerical and asymptotic properties (see Section 6). Due to the use of generally large numbers L (typically L = 200) of contrastive samples, expectations of interest, e.g. (7,8), are computed in a large dimensional space ΘL+1 even if Θ is of moderate dimension. A simple way to mitigate the dimension impact is to start from M = N(L + 1) particles using a SMC procedure on Θ and partition them into L + 1 disjoint subsets of N particles, denoted by

θ1:N k,ℓ, for ℓ= 0 : L, with their associated weights denoted by W 1:N k,ℓ with PN i=1 W i k,ℓ= 1. Seeing particles and weights as random variables, if the resampling procedure used in Algorithm 2 is so that the resampled values are independent conditionally on the previous particles, then the L + 1 collections of random variables denoted by ζN k,ℓ= {W i k,ℓ, θi k,ℓ}i=1:N are independent and identically distributed (i.i.d.) conditionally on the previous particles. For instance, multinomial resampling preserves conditional independence. From the weighted particles produced by Algorithm 2, we then derive L + 1 i.i.d. random probability measures P N k,ℓ= PN i=1 W i k,ℓδθi k,ℓ(δθ is the Dirac measure at θ) and the gradient estimate at each iteration t, Algorithm 1 line 5, is computed with θi 0,t, . . . , θi L,t values sampled independently from realizations p N k,ℓof the respective independent particle approximations P N k,ℓfor i = 1 : Nt and ℓ= 0 : L. We thus define P N k,ℓon Θ and use in our SG algorithm approximated distributions on ΘL+1 which are product measures L ℓ=0P N k,ℓ. It follows a SG procedure, detailed in the Appendix (Algorithm 3), which produces an estimator ξ k+1,N from the optimization of a particle approximation of Ik+1 P CE defined as

Ik+1,N P CE (ξ)=Ep(u)QL ℓ=0 p N k,ℓ(θℓ) h F(ξ,θ0, ,θL,T ξ θ0(U)) i (10)

The advantage of such a product form particle approximation over standard SMC lies in the following observation. Standard SMC would consist in using QL ℓ=0p(θℓ) P N k (θ0, , θL) where

ℓ=0 W i k,ℓ

δ(θi k,0, ,θi k,L) . (11)

As an alternative, with the same M weights and particles, our product form approach is using QL ℓ=0 p(θℓ) QL ℓ=0 p N k,ℓ(θℓ), which can be rewritten as a sum, i.e.

ℓ=0 p N k,ℓ=

ℓ=0 W iℓ k,ℓ

δ(θi0 k,0, ,θ i L k,L) . (12)

There are two main differences between (11) and (12). First, (12) is more statistically efficient: the θi k,ℓare i.i.d. and so all permutations (θi0 k,0, , θi L k,L) of these samples, for 1 i0, . . . , i L N, are identically distributed. Hence (12) averages over N L+1 tuples while its conventional counterpart (11) only averages over N tuples. This increase in tuple number leads to a decrease in estimators variance. With the same particles, product-form (12) makes the most out of every sample available. Second, in contrast to tuples in (11), those in (12) are not independent because the same θi k,ℓappears in many terms of the sum. As a consequence, using product form approximations requires to generalize

PASOAPArticle ba Sed Bayesian Optimal Adaptive design

standard SMC results, which rely on the independence of the terms in the sum. In the next section, we take that into account to show the convergence of the quantities involved in PASOA.

6. Asymptotic properties

With the contrastive bound approach, to increase approximations quality, we can play with L the number of contrastive samples and N the number of particles. It has already been shown by Foster et al. (2020) that increasing L improves the quality of the IP CE bound that becomes tight, i.e. to the limit equal to the EIG. We then study the behavior of our approximations when the number N of particles tends to . The sequential design values produced can be seen as realizations of random estimators {ξ k+1,N}N 1 targeting points of maximum of random criterion functions {Ik+1,N P CE (ξ)}N 1. We use here the same notation for random quantities and their realizations. We first establish the convergence of the criterion functions adapting standard SMC techniques in Proposition 1, and then the convergence of their points of maximum, i.e. the consistency of the {ξ k+1,N}N 1 to the optimal designs in Proposition 2.

Convergence of product particle approximations. With the use of contrastive samples, expectations are computed in ΘL+1. Particles could then be considered as (L + 1)-dimensional elements in this space so that standard SMC convergence results could apply (Chopin & Papaspiliopoulos, 2020), Chapter 11. However, working in a (L + 1)-dimensional space is more challenging for SMC and not necessary for the type of convergence result we need. As importance sampling, SMC suffers from the weights variance which scales unfavorably with the dimension of the problem (Naesseth et al., 2019; Chopin & Papaspiliopoulos, 2020). In contrast, in Section 5, we define P N k,ℓon Θ and use in our SG algorithm (Algorithm 3 in Appendix) approximated distributions on ΘL+1 which are product measures L ℓ=0P N k,ℓ. Such products have been used in various settings but have been only recently studied in a more general way (Kuntz et al., 2022). However, theoretical results therein do not cover the use of product form estimators within SMC samplers. Other papers that consider structured SMC settings (Rebeschini & van Handel, 2015; Lindsten et al., 2017; Kuntz et al., 2024; Aitchison, 2019) differ from our work in key aspects further discussed in Appendix Section C.1.

Let Cb(ΘL+1) denote the set of functions ϕ : ΘL+1 R that are measurable and bounded, and let ||ϕ|| denote the supremum norm ||ϕ|| = supθ ΘL+1 |ϕ(θ)|. Using the notation and terminology of Chopin & Papaspiliopoulos (2020) and previous authors before them, we introduce

the potential function Gk,τ which in our setting is equal to Gk,τ(θ) = p(yk|θ, ξk)γ and used in Algorithm 2 to compute weights. We simplified the notation but γ may also depend on k and τ. More specifically, in standard tempering, γ is a preset constant but in Algorithm 2, we use an adaptive tempering where γ depends on previous particles. Although adaptive tempering schemes are often more challenging to study, e.g. (Salomone et al., 2018; Beskos et al., 2016; Del Moral et al., 2012), this is not problematic in our setting as γ 1 and all we require is Gk,τ to be bounded.

For our result, the assumption that p(yk|θ, ξk) is bounded and strictly positive is enough but more general results and cases are available in (Beskos et al., 2016; Jasra et al., 2011). The following proposition establishes the convergence in L2-norm, which also implies the convergence in L1-norm and in probability.

Following Chopin & Papaspiliopoulos (2020); Crisan & Doucet (2002), we can establish similarly Lp-norm (p > 2) and almost-sure convergence but convergence in probability is sufficient for consistency of the design estimators. We denote by ζN k,L the collection of random variables ζN k,L = {W i k,ℓ, θi k,ℓ}i=1:N,ℓ=0:L, and use the short notation p L+1 k [ϕ] for p L+1 k [ϕ] = EQL ℓ=0 p(θℓ|Dk)[ϕ(θ0, , θL)].

Proposition 1 (L2 convergence) Using Algorithm 2 with multinomial resampling and assuming that all potential functions Gk,τ are upper bounded, there exists a constant ck such that, for all functions ϕ Cb(ΘL+1),

ℓ=0 W iℓ k,ℓϕ(θi0 k,0, , θi L k,L) p L+1 k [ϕ]

ck c N,L ϕ 2

with c N,L = 1 (1 1 N )L+1 L+1

N and where the expectation is taken over all the realizations of the random tempered SMC method, or equivalently on ζN k,L.

The proof is detailed in Appendix and extends the proof in Chapter 11 of Chopin & Papaspiliopoulos (2020) which is itself mainly based on (Crisan & Doucet, 2002). Note that the first term in the outer expectation above corresponds to the expectation of ϕ with respect to L ℓ=0P N k,ℓ, which is a random variable as the P N k,ℓare random measures. For any ξ E, when ϕ is set to ϕξ(θ0, , θL) = f P CE(ξ, θ0, , θL),

with f P CE(ξ, θ0, , θL) = Ep(u) h F(ξ,θ0, ,θL,T ξ θ0(U)) i

assumed to be bounded on ΘL+1 for all ξ, then p L+1 k [ϕξ] = Ik+1 P CE(ξ), the PCE lower bound (9) we seek to optimize at step k + 1. Similarly, the expectation of ϕξ with respect to L ℓ=0P N k,ℓis Ik+1,N P CE (ξ) in (10). It follows from Proposition 1 that

n Ik+1,N P CE (ξ) Ik+1 P CE(ξ) o2 ck c N,L ϕ 2 ,

PASOAPArticle ba Sed Bayesian Optimal Adaptive design

or in other words that for all ξ E, the sequence of random variables {Ik+1,N P CE (ξ)}N 1 converges in L2-norm to Ik+1 P CE(ξ) when N tends to . Using Chebyshev s inequality this also implies the convergence in probability, pointwise in ξ, that is for all ξ, for all ϵ > 0,

lim N pζN k,L

Ik+1,N P CE (ξ) Ik+1 P CE(ξ) ϵ = 0. (13)

To further comment this result and give more insight on the constants involved, we note that interestingly, if the normalized potential functions are bounded by 1 (see Appendix Section C.2 for details), the constant denoted by ck in Proposition 1 remains low and decreases when L increases. Otherwise, the constant may increase very fast with k and, even if k is relatively low in BOED, the bound may become uninformative for finite N. This is actually a well referenced behavior in SMC, see e.g. (Chopin & Papaspiliopoulos, 2020). In our work, we limited the presentation to simple conditions but with stronger assumptions, in particular on the Markov kernels, it is possible to limit the growing of ck over steps k, see Section 11.4. in (Chopin & Papaspiliopoulos, 2020).

Consistency of the sequential design estimators. Under additional assumptions, specified in Proposition 2, it follows from Proposition 1 that the design parameters found at each step of our procedure tend to the ideal ones.

Assume Ik+1 P CE(ξ) reaches its maximum in ξ k+1 and consider a sequence {ξ k+1,N}N 1 of estimators defined by

ξ k+1,N arg max ξ E Ik+1,N P CE (ξ) . (14)

Proposition 2 states that {ξ k+1,N}N 1 are consistent estimators of ξ k+1, i.e. that for all ϵ > 0,

lim N pζN k,L ξ k+1,N ξ k+1 ϵ = 0. (15)

The assumptions in Proposition 2 are in a simplified form. More general situations could be handled but would require more complex developments while not changing the general idea. To simplify the presentation, our first assumption (A1) is that the design space E is compact. The second assumption (A2) is to ensure that the point of maximum ξ k+1 is isolated or well-separated and that only design values in the neighborhood of ξ k+1 reaches values close to Ik+1 P CE(ξ k+1) (see Lemma 5 in Appendix). Assumption (A3) indicates that estimators ξ k+1,N could be points of near maximum only. The formulation of (A3) uses additional variables ρN that account for the fact that estimators ξ k+1,N are usually only approximations of the maxima in (14), as in practice the optimization task is solved numerically with a given precision that can be controlled in the sense that a

bound on the approximation error can be provided. If we assume that ξ k+1,N is an exact maximizer of Ik+1,N P CE (ξ), then (A3) is trivially satisfied with ρN = 0. In our setting, ξ k+1,N is obtained by a stochastic gradient algorithm for which we could exhibit the ρN sequence but this is a particular case left to the interested reader. This point of view, the assumptions and the technique of proof we are using, are similar to the ones used for establishing asymptotic properties of M-estimators (van der Vaart, 1998; van der Vaart & Wellner, 1996). A proof is given in Appendix, using the pointwise convergence (13) from Proposition 1 and several intermediate results.

Proposition 2 (Consistency) Assume

A1 E Rd is a compact set.

A2 For all ξ = ξ k+1, Ik+1 P CE(ξ) < Ik+1 P CE(ξ k+1)

A3 There exists a sequence {ρN}N 1 of positive random variables and a sequence of random variables {ξ k+1,N}N 1 in E that satisfy

ϵ > 0, lim N pζN k,L (ρN ϵ) = 0

lim inf N pζN k,L

Ik+1,N P CE (ξ k+1,N) Ik+1,N P CE (ξ k+1) ρN =1

Then the sequence {ξ k+1,N}N 1 is consistent, i.e. for all ϵ > 0,

lim N pζN k,L ξ k+1,N ξ k+1 ϵ = 0.

7. Numerical experiments

To benchmark our method in terms of information gained, we use the sequential prior contrastive estimation (SPCE) and sequential nested Monte Carlo (SNMC) bounds introduced in (Foster et al., 2021) and used in (Blau et al., 2022). For a number K of experiments leading to ξ1, , ξK, and L contrastive variables, SPCE and SNMC are respectively lower and upper bounds for the total EIG, i.e. the expected information gained from the entire sequence ξ1, , ξK and they become tight when L tends to . Their exact expressions are given in Appendix Sections D.1 and D.2. They have the advantage of using only samples from the prior p(θ) and not from the successive posterior distributions, thus not considering posterior approximation errors and making them convenient criteria to compare methods on design sequences only. Methods can be compared via their [SPCE, SNMC] intervals which contain the total EIG. However, one benefit of our approach is to also provide accurate posterior estimation for subsequent accurate parameter estimation. To assess this aspect, we use the L2 Wasserstein distance between the weighted

PASOAPArticle ba Sed Bayesian Optimal Adaptive design

particles yielded by our method and the true parameter θ. We consider two other recent approaches, namely a reinforcement learning-based approach RL-BOED from (Blau et al., 2022) and the variational prior contrastive estimation VPCE of Foster et al. (2020). RL-BOED is a non-myopic approach, while VPCE involves iteratively optimizing IP CE in a myopic manner but estimating posterior distributions with variational inference. We also compare with a non tempered version of our approach and with a random baseline, where the observations {y1, , y K} are simulated with designs generated randomly. For methods that do not provide posterior estimations or poor quality ones (RL-BOED, VPCE, Random), we compute Wasserstein distances on posterior samples obtained by using tempered SMC on their design and observation sequences. In contrast, naive SMC Wasserstein distances are computed on the SMC posterior samples to better assess the impact of tempering via the comparison with PASOA.

Source location. We consider the 2D location finding experiment used in (Foster et al., 2021; Blau et al., 2022). It consists of S hidden sources in R2 whose locations θ = {θ1, , θS} are unknown. Each source emits a signal whose intensity attenuates according to the inverse-square law. The measured signal is the superposition of all these signals. The design problem is to choose where to make the measurements to best learn the source locations. If a measurement is performed at a point ξ R2, the signal strength is µ(θ, ξ) = b + PS s=1 αs m+||θs ξ||2 2 where αs, b and m are constants. A standard Gaussian prior is assumed for each θs N(0, I) and the likelihood is assumed lognormal, (log y|θ, ξ) N(log µ(θ, ξ), σ) with standard deviation σ. We set S = 2, α1 = α2 = 1, m = 10 4, b = 10 1, σ = 0.5 and we plan K = 30 successive design optimisations. The Markov kernel is that of a Metropolis Hasting scheme with a Gaussian proposal centered at the current particle with a variance set to the empirical variance of the particles. We use L = 200 contrastive variables for the Ik P CE bound. Algorithm 2 is used to get N = 100 simulations θ1:N ℓ of each contrastive variable. The Adam algorithm (Kingma & Ba, 2015) is then used with standard hyperparameters to perform the stochastic gradient.

The whole experiment is repeated 100 times but drawing source locations at random each time. Figure 3, column 1, shows, with respect to k, the median and standard error for SPCE, SNMC and the L2 Wasserstein distances between weighted particles and the true source locations.

A first observation is that design optimization leads to a significant improvement over the naive random baseline. For VPCE and RL-BOED we recover the results shown in Figure 1 of Blau et al. (2022). Our method leads to a significant improvement, both in terms of information gain and posterior estimation. It improves by 30% RL-BOED

results on SPCE and provides much higher SNMC. The L2 Wasserstein distance is two order of magnitude lower, suggesting the higher quality of our measurements.

Figure 3. Source location. Column 1: median and standard error over 100 rollouts for SPCE (top), SNMC (middle) and L2 Wasserstein distance (bottom) with respect to the number of experiments k. Column 2: impact of the number of particles (5K to 1M) on median SPCE, SNMC and Wasserstein distance for PASOA (plain) and SMC (dotted). Note the logarithmic scale.

Both plain SMC and PASOA outperform the other methods. Additional illustration of the benefit of tempering is visible in Figure 1 where PASOA better estimates the posterior concentrating around the true sources when SMC leads to more scattered particles and is less efficient in removing particles from non informative locations. Tempering allows to reduce the number of particles, outperforming SMC even with much fewer particles, e.g. 5e3 vs 106 for a lower computation time (Figure 3, column 2, and Table 1). The number of tempering steps tends to decrease with the number of experiments as more information is gathered (Appendix Figure 6). PASOA is also more robust to prior misspecification (Figure 7).

Constant Elasticity of Substitution (CES). In this other model (Blau et al., 2022; Foster et al., 2020), an agent compares two baskets of goods with 3 items each and gives a rating in [0, 1]. The design is a 6-dimensional vector representing quantities for each item in each basket. There are 3 parameters θ = (ρ, α = (α1, α2, α3), u) in dimension 5, which have to be recovered from the

PASOAPArticle ba Sed Bayesian Optimal Adaptive design

Table 1. Impact of the number of particles on average computation times for Source location on a V100 GPU and on average total tempering steps for PASOA.

Particles 50 000 100 000 320 000 1 000 000 SMC 25.1s 0.1s 30.1s 0.1s 61.7 0.1s 130.1 0.2s PASOA 25.9s 0.1s 31.4s 0.1s 63.4 0.1s 134.4 0.2s Tempering steps 120.2 120.0 116.1 114.2

Figure 4. CES example. Median and standard error over 100 rollouts, with respect to the number of experiments k, for SPCE -plain , SNMC -dashed (left) and Wasserstein distance (right).

agent s ratings of different baskets. The design task is challenging since most basket configurations provide very little information. Figure 4 shows the obtained SPCE, SNMC and Wasserstein distance curves for K = 10 experiments. SMC-based methods outperform again RLBOED and VPCE, both in terms of design sequences and Wasserstein distances. For PASOA, the [SPCE, SNMC] intervals (Figure 4 left) are distinctly above the other ones. For clarity sake, the SMC [SPCE, SNMC] intervals are not plotted but shown in Appendix Figure 8.

For a global view of the methods, Table 2 summarizes the main features and running times, with more explanations in Appendix Section D.3. Additional experimental results and implementation details are also given in the Appendix. The chosen models are benchmarks used in BOED. A real-word application would be a great addition but is out of the scope of this paper.

Table 2. Main features of the compared methods. Column 5 shows training times required for amortization.

Method Posterior Amortized Non-myopic Training Time PASOA SMC

RL-BOED (Blau et al., 2022) CES: 20h Sources: 10h VPCE (Foster et al., 2020)

8. Conclusion

We have introduced a new Bayesian sequential design optimization algorithm which also provides posterior distribution estimates for parameter inference. The

procedure uses a tempering principle to handle the fact that maximizing information gain jeopardizes standard SMC samplers accuracy. Although greedy, our approach performs better than the long-sighted reinforcement learning approach of Blau et al. (2022). As already observed by Blau et al. (2022), a possible explanation is the use of posterior information in our optimisation process. Moreover, the lesser performance of VPCE, which uses suboptimal variational posterior approximations, suggests that the key to a good EIG optimization is an accurate posteriors estimation, which can be achieved via SMC and further improved with tempering. Accurately estimating posterior distributions seems more important and more critical than planning multiple steps ahead. In addition, when the design parameter is high dimensional, all methods are likely to suffer from the difficulty of searching in a high dimensional design space, but as a myopic approach PASOA is likely to suffer less than non myopic approaches for which the search space grows exponentially as the number of experiments increases. Nevertheless, investigating the possibility to extend our approach via a policy-based method would be interesting. Another difficulty may come from a high dimensional parameter. Augmenting the parameter dimension makes inference more difficult. If posterior inference fails, PASOA may loose its specific advantage and result in design of lesser quality. Methods that do not rely on these posterior estimation are less prone to this loss of design performance. However, the goal of experimental design is to provide informative data to infer and ultimately gain information on the parameter of interest. Even if other methods may then provide better SPCE values, this is useless if the posterior is impossible to estimate correctly. In practice, for PASOA, one might use a more sophisticated tuned Markov kernel and exploit the advantage of tempering that a good kernel allows to scale to higher dimensions, see Section 17.2.3 in (Chopin & Papaspiliopoulos, 2020). At last, investigating amortized simulation-based inference such as (Ivanova et al., 2021; Kleinegesse & Gutmann, 2021; Kleinegesse et al., 2020) to handle models which are only available through simulations would be an important next step for applications.

Acknowledgements

The authors thank the reviewers and area chair for their interesting and useful comments and the Inria Challenge project ROAD-AI for partial funding. This work was performed using HPC/AI resources from GENCI-IDRIS (Grant 2023-AD011014217R1). Pierre Alliez is supported by the French government, through the 3IA Côte d Azur Investments in the Future project managed by the National Research Agency ANR-19-P3IA-0002.

PASOAPArticle ba Sed Bayesian Optimal Adaptive design

Impact Statement

This paper presents work whose goal is to advance the field of Machine Learning. There are many potential societal consequences of our work, none which we feel must be specifically highlighted here.

Agapiou, S., Papaspiliopoulos, O., Sanz-Alonso, D., and Stuart, A. M. Importance sampling: Intrinsic dimension and computational cost. Statistical Science, pp. 405 431, 2017.

Aitchison, L. Tensor Monte Carlo: Particle Methods for the GPU era. In Advances in Neural Information Processing Systems 2019, pp. 7146 7155, 2019.

Amzal, B., Bois, F., Parent, E., and Robert, C. P. Bayesian Optimal Design via Interacting Particle Systems. Journal of the American Statistical Association, 101(474):773 785, 2006.

Babuschkin, I., Baumli, K., Bell, A., Bhupatiraju, S., Bruce, J., Buchlovsky, P., Budden, D., Cai, T., Clark, A., Danihelka, I., Dedieu, A., Fantacci, C., Godwin, J., Jones, C., Hemsley, R., Hennigan, T., Hessel, M., Hou, S., Kapturowski, S., Keck, T., Kemaev, I., King, M., Kunesch, M., Martens, L., Merzic, H., Mikulik, V., Norman, T., Papamakarios, G., Quan, J., Ring, R., Ruiz, F., Sanchez, A., Schneider, R., Sezener, E., Spencer, S., Srinivasan, S., Stokowiec, W., Wang, L., Zhou, G., and Viola, F. The Deep Mind JAX Ecosystem, 2020.

Benassi, R., Bect, J., and Vazquez, E. Bayesian optimization using sequential monte carlo. In Learning and Intelligent Optimization. LION 2012. Lecture Notes in Computer Science, volume 7219, pp. 339 342. Springer, Berlin, Heidelberg, 11 2011.

Beskos, A., Jasra, A., Kantas, N., and Thiery, A. On the convergence of adaptive sequential Monte Carlo methods. Annals of Applied Probability, 26(2):1111 1146, April 2016.

Blau, T., Bonilla, E. V., Chades, I., and Dezfouli, A. Optimizing sequential experimental design with deep reinforcement learning. In Proceedings of the 39th International Conference on Machine Learning, volume 162, pp. 2107 2128. PMLR, 17 23 Jul 2022.

Borkar, V. Stochastic approximation: a dynamical view point. Cambridge University Press, 2008.

Bradbury, J., Frostig, R., Hawkins, P., Johnson, M. J., Leary, C., Maclaurin, D., Necula, G., Paszke, A., Vander Plas, J., Wanderman-Milne, S., and Zhang, Q. JAX: composable transformations of Python+Num Py programs, 2020.

Chaloner, K. and Verdinelli, I. Bayesian experimental design: A review. Statistical Science, 10(3):273 304, August 1995.

Chopin, N. and Papaspiliopoulos, O. An introduction to sequential Monte Carlo, volume 4. Springer, 2020.

Crisan, D. and Doucet, A. A survey of convergence results on particle filtering methods for practitioners. IEEE Transactions on Signal Processing, 50(3):736 746, 2002.

Cuturi, M., Meng-Papaxanthos, L., Tian, Y., Bunne, C., Davis, G., and Teboul, O. Optimal Transport Tools (OTT): A JAX Toolbox for all things Wasserstein. ar Xiv preprint ar Xiv:2201.12324, 2022.

Dai, C., Heng, J., Jacob, P. E., and Whiteley, N. An invitation to sequential Monte Carlo samplers. Journal of the American Statistical Association, 117(539):1587 1600, June 2022.

Del Moral, P., Doucet, A., and Jasra, A. Sequential Monte Carlo samplers. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 68(3):411 436, 2006.

Del Moral, P., Doucet, A., and Jasra, A. On Adaptive Resampling Procedures for Sequential Monte Carlo Methods. Bernoulli, 18(1):252 278, 2012.

Doucet, A. and Lee, A. Sequential Monte Carlo methods. In Handbook of Graphical Models: 165 189. Ed. by M. Maathuis, M. Drton, S. L. Lauritzen, and M. Wainwrigth, 2018.

Drovandi, C., Mc Gree, J. M., and Pettitt, A. N. A sequential Monte Carlo algorithm to incorporate model uncertainty in Bayesian sequential design. Journal of Computational and Graphical Statistics, 23(1):3 24, March 2014.

Drovandi, C. C., Mc Gree, J., and Pettitt, A. N. Sequential Monte Carlo for Bayesian sequentially designed experiments for discrete data. Computational Statistics & Data Analysis, 57(1):320 335, 2013.

Foster, A., Jankowiak, M., Bingham, E., Horsfall, P., Teh, Y. W., Rainforth, T., and Goodman, N. Variational Bayesian Optimal Experimental Design. In Advances in Neural Information Processing Systems, pp. 14059 14070, 2019.

Foster, A., Jankowiak, M., O Meara, M., Teh, Y. W., and Rainforth, T. A Unified Stochastic Gradient Approach to Designing Bayesian-Optimal Experiments. In Proceedings of the 23rd International Conference in Artificial Intelligence and Statistics, volume 108, pp. 2959 2969, 2020.

PASOAPArticle ba Sed Bayesian Optimal Adaptive design

Foster, A., Ivanova, D. R., Malik, I., and Rainforth, T. Deep Adaptive Design: Amortizing Sequential Bayesian Experimental Design. In Proceedings of the 38th International Conference on Machine Learning, volume 139, pp. 3384 3395. PMLR, July 2021.

Gerber, M., Chopin, N., and Whiteley, N. Negative association, ordering and convergence of resampling methods. Annals of Statistics, 47(4):2236 2260, August 2019.

Henrández-Lobato, J. M., Hoffman, M. W., and Ghahramani, Z. Predictive Entropy Search for Efficient Global Optimization of Black-Box Functions. In Advances in Neural Information Processing Systems, volume 1, pp. 918 926, 2014.

Huan, X. and Marzouk, Y. Gradient-based stochastic optimization methods in Bayesian experimental design. International Journal for Uncertainty Quantification, 4 (6):479 510, 2014.

Ivanova, D. R., Foster, A., Kleinegesse, S., Gutmann, M. U., and Rainforth, T. Implicit Deep Adaptive Design: Policy Based Experimental Design without Likelihoods. In Advances in Neural Information Processing Systems, volume 34, pp. 25785 25798, 2021.

Jasra, A., Stephens, D. A., Doucet, A., and Tsagaris, T. Inference for Lévy-driven stochastic volatility models via adaptive sequential Monte Carlo. Scandinavian Journal of Statistics, 38(1):1 22, 2011.

Kingma, D. and Ba, J. Adam: A method for stochastic optimization. In International Conference on Learning Representations (ICLR), San Diega, CA, USA, 2015.

Kleinegesse, S. and Gutmann, M. U. Bayesian experimental design for implicit models by mutual information neural estimation. In Proceedings of the 37th International Conference on Machine Learning, volume 493, pp. 5316 5326. PMLR, July 2020.

Kleinegesse, S. and Gutmann, M. U. Gradient-based Bayesian Experimental Design for Implicit Models using Mutual Information Lower Bounds. Ar Xiv, abs/2105.04379, 2021.

Kleinegesse, S., Drovandi, C., and Gutmann, M. U. Sequential Bayesian Experimental Design for Implicit Models via Mutual Information. Bayesian Analysis, 16: 773 802, July 2020.

Kuck, H., de Freitas, N., and Doucet, A. SMC Samplers for Bayesian Optimal Nonlinear Design. In IEEE Nonlinear Statistical Signal Processing Workshop, pp. 99 102, 2006.

Kuntz, J., Crucinio, F., and Johansen, A. Product-form estimators: exploiting independence to scale up Monte Carlo. Statistics and Computing, 32(12), 2022.

Kuntz, J., Crucinio, F., and Johansen, A. The divide-andconquer sequential Monte Carlo algorithm: theoretical properties and limit theorems. Annals of Applied Probability, 34(1B):1469 1523, Feb. 2024.

Lao, J. and Louf, R. Blackjax: A sampling library for JAX. Astrophysics Source Code Library, 2022.

Lindsten, F., Johansen, A., Naesseth, C., Kirkpatrick, B., Schön, T., Aston, J., and Bouchard-Côté, A. Divideand-Conquer With Sequential Monte Carlo. Journal of Computational and Graphical Statistics, 26(2):445 458, 2017.

Moffat, H., Hainy, M., Papanikolaou, N., and Drovandi, C. Sequential experimental design for predator-prey functional response experiments. Journal of Royal Society Interface, 17(166), 2020.

Naesseth, C. A., Lindsten, F., and Schön, T. B. Elements of Sequential Monte Carlo. Foundations and Trends in Machine Learning, 12(3):307 392, 2019.

Neal, R. Annealed importance sampling. Statistics and Computing, 11:125 139, April 2001.

Rainforth, T., Foster, A., Ivanova, D. R., and Bickford Smith, F. Modern Bayesian Experimental Design. Statistical Science, 39(1):100 114, Feb. 2024.

Rebeschini, P. and van Handel, R. Can local particle filters beat the curse of dimensionality? The Annals of Applied Probability, 25(5):2809 2866, Oct. 2015.

Ryan, E. G., Drovandi, C. C., Mc Gree, J. M., and Pettitt, A. N. A review of modern computational algorithms for Bayesian optimal design. International Statistical Review, 84(1):128 154, April 2016.

Salomone, R., South, L., Drovandi, C. C., and Kroese, D. P. Unbiased and Consistent Nested Sampling via Sequential Monte Carlo. Ar Xiv, 2018.

Sebastiani, P. and Wynn, H. P. Maximum entropy sampling and optimal Bayesian experimental design. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 62(1):145 157, 2000.

Snoek, J., Larochelle, H., and Adams, R. P. Practical Bayesian optimization of machine learning algorithms. In Advances in Neural Information Processing Systems, pp. 2951 2959, 2012.

PASOAPArticle ba Sed Bayesian Optimal Adaptive design

Syed, S., Bouchard-Côté, A., Deligiannidis, G., and Doucet, A. Non-reversible parallel tempering: a scalable highly parallel MCMC scheme. Journal of the Royal Statistical Society (Series B), 84:321 350, 2021.

van der Vaart, A. W. Asymptotic Statistics. Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge University Press, 1998.

van der Vaart, A. W. and Wellner, J. A. Weak Convergence and Empirical Process: With Applications to Statistics. springer, 1996.

PASOAPArticle ba Sed Bayesian Optimal Adaptive design

In this appendix, we provide proofs for the theoretical results in our paper and additional details on the implementation and numerical experiments.

A. Particle EIG contrastive bound optimization

The algorithm below summarizes the procedure developed for design optimization. It is the result of the combination of Algorithms 1 and 2 in the main body. It results from the replacement in the former of the current posterior p(θ|Dk 1) by its particle approximation resulting from the adaptive tempering in Algorithm 2.

Algorithm 3 Particle EIG contrastive bound stochastic optimization at step k + 1

1: Set T iterations, ξ0, stepsizes (γt)t=1:T 2: Run tempered SMC Algorithm 2 (main body) at step k for M = N(L + 1), and set L + 1 independent particle approximations P N k,ℓ, for ℓ= 0 : L (see main paper Section 5 for details) 3: while t T do 4: Sample θi ℓ,t P N k,ℓ, for ℓ=0:L, i=1:Nt 5: Sample ui t p(u), for i = 1 : Nt

6: Set t+1= 1

i=1 ξF(ξ,θi 0t, ,θi Lt,T ξ θi 0t(ui t))|ξ=ξt

7: Update ξt+1 = ξt + γt t+1

8: return ξ k+1,N =ξT or a Polyak averaging value

B. Differentiation under the integral sign

We briefly recall standard conditions under which it is possible to exchange differentiation and expectation operators. In our setting this implies conditions that are also useful for other results in Section C.3. Indeed to carry out a stochastic gradient algorithm (Algorithms 1 and 3), we assume that all quantities are well defined. In particular,

ξIP CE(ξ)=Ep(u) QL ℓ=0 p(θℓ) h ξF(ξ,θ0, ,θL,T ξ θ0(U)) i . (16)

Denoting pu p L+1 θ the product probability distribution p(u) QL ℓ=0 p(θℓ) on U ΘL+1, sufficient conditions for (16) are that the function F is differentiable in ξ for pu p L+1 θ -almost all (u, θ0, , θL) and its gradient satisfies ξF(ξ,θ0, ,θL,T ξ θ0(u)) H(θ0, , θL, u) where H has a bounded expectation. In particular this implies the continuity

properties (ii) and (iii) required in Lemma 4 and 5 below.

In practice, it can also be useful to notice that the gradient required within the expectation (16) can be expressed using only the log-likelihood gradient. To simplify, in the expression below we denote T ξ θ0(u) by y. Using the definition of F, it comes,

ξF(ξ,θ0, ,θL, y) = ξ log p(y|ξ, θ0)

ℓ=0 wℓ(θ0, , θL, ξ, y) ξ log p(y|ξ, θℓ) (17)

where the wℓ(θ0, , θL, ξ, y) = p(y|ξ,θℓ) L P

ℓ =0 p(y|ξ,θℓ ) sum to 1 and act as weights. Thus we only need to compute for all θ0, θℓ, the

gradient ξ log p(T ξ θ0(u)|ξ, θℓ).

C. Proofs of main and intermediate results

We first specify the steps that lead to Proposition 1 in the main body. The overall goal is to show that products of particle approximations have similar convergence properties than single particle approximations.

C.1. Product form estimators

Product form estimators have been used in various places in the literature but only studied in a more formal and general way in some recent work (Kuntz et al., 2022). Their advantages are clearly highlighted in (Kuntz et al., 2022) but the

PASOAPArticle ba Sed Bayesian Optimal Adaptive design

theoretical results therein do not cover the use of product form estimators within SMC samplers. In contrast, (Rebeschini & van Handel, 2015; Lindsten et al., 2017; Kuntz et al., 2024; Aitchison, 2019) consider SMC settings that may look similar to ours but differ in key aspects. Indeed, (Rebeschini & van Handel, 2015) proposed a so-called block particle filter consisting of a product of local smaller particle filters to approximate particle filters in high dimensional models. The last line of their Algorithm 2 clearly shows that the block particle filter is equivalent to running several independent particle filters on smaller dimensional spaces. In contrast, we run a single SMC from which we extract at each step a product form estimate of the target distribution. This allows to compute expectations and other inferential quantities in a more efficient way, using the advantages of product form estimators, notably their lower variance, as explained in (Kuntz et al., 2022). The Divide-and-Conquer SMC (Da C-SMC) methodology introduced in (Lindsten et al., 2017) and further studied in (Kuntz et al., 2024) targets a different objective, which is to generalize the sequential nature of standard SMC to a more general tree-structured execution flow of the sampler. Hierarchical decompositions of variables of interest into subsets are considered and product form estimators are used to recombined the splitted subsets of particles. Typical use cases include sets of variables exhibiting a graph dependence structure such as Markov random fields. In contrast, in our work, the execution flow remains sequential, corresponding to a tree that is a simple line, and we do not address complex parameter structures. At last, the Tensor Monte Carlo technique proposed in (Aitchison, 2019) shares similarity with Da C-SMC but focuses on computational aspects and does not cover as many theoretical properties. The work in (Aitchison, 2019) proposes to exploit dependence structures in variables and conditional independence properties to address the issue of computing the very large sums induced by product form estimators. This is not an issue we encounter as, in our work, the computation of very large sums is avoided via stochastic approximation.

The advantage of using a product of particle approximations justifies a new analysis of the SMC procedure. Interestingly, the product-form can also help to mitigate the growth of constant ck in the L2 bound in Proposition 1 (c τ in the proof below). Increasing L can allow to reduce ck to an extent that can make the bound informative for finite sample N. See the remark at the end of Section C.2.

C.2. Convergence of product form particle approximations

The result is intuitive and generalizes standard SMC proofs but as mentioned in the previous section, we believe it is not covered in previous work. It requires additional care as illustrated in Lemma 1 below. We follow the presentation in Chapter 11 of Chopin & Papaspiliopoulos (2020) which is mainly based on the presentation in (Crisan & Doucet, 2002). As mentioned in (Chopin & Papaspiliopoulos, 2020) it is not the most general presentation but it has the advantage of using standard tools while presenting the general idea. For simplicity, we only present the L2 convergence result, which is enough to prove Proposition 1.

To simplify the notation, we use τ as the time or step index and µτ, Mτ for , µλτ , Mλτ as in Algorithm 2, to denote the step and successive targets and Markov kernels, but this extends to k and pk, as we repeat the same process sequentially. For a SMC procedure on Θ, Gτ denotes then the so-called potential function involved in the weights normalization and Gτ = Gτ/(µτ 1Mτ 1(Gτ)) its normalization. For example, in our setting in Section 6, Gτ(θ) = Gk,τ(θ) = p(yk|θ, ξk)γ. Recall that for a function ϕ on Θ, we then have

µτ[ϕ] = Eµτ [ϕ] = µτ 1Mτ 1[ϕ Gτ] = Z Gτ(θ)ϕ(θ) Z Mτ 1(θ , θ) µτ 1(θ ) dθ dθ. (18)

As a particular case, we have

µτ(θℓ) = µτ[δθℓ] = Gτ(θℓ) Z Mτ 1(θ , θℓ) µτ 1(θ ) dθ dθ (19)

µτ 1Mτ 1[ Gτ] = µτ[1] = 1 (20)

We generalize these definitions and results on the product space ΘL+1 with ϕ being now a function of L + 1 variables. For a function ϕ on ΘL+1, we use notation Mτϕ for the function

Mτϕ : (θ0, , θL) Z Z ϕ(θ 0, , θ L)

ℓ=0 Mτ(θℓ, θ ℓ) dθ ℓ

PASOAPArticle ba Sed Bayesian Optimal Adaptive design

Similarly, we denote µ L+1 τ [ϕ] = Eµ L+1 τ [ϕ] = R R ϕ(θ0, , θL) QL ℓ=0 µτ(θℓ) dθℓ . Defining,

(µτ 1Mτ 1) L+1[ϕ] = Z Z ϕ(θ0, , θL)

Z Mτ 1(θ ℓ, θℓ) µτ 1(θ ℓ) dθ ℓ

we can deduce from (18) (respectively via (19) and (20) that

µ L+1 τ [ϕ] = (µτ 1Mτ 1) L+1 "

(µτ 1Mτ 1) L+1 " L Y

= µ L+1 τ [1] = 1. (23)

The strategy is to establish Proposition 1 by decomposing the error at step τ as a sum of contributions from each step (sampling, reweighting, resampling, etc.) of the current and previous steps. We first need a general result on variances of sum of random variables.

Lemma 1 Let {Xi ℓ}ℓ=0:L,i=1:N a collection of i.i.d. random variables on X. For any function ϕ on X L+1 so that the variances exist we have, for all (i0, , i L) [1 : N]L+1,

V ar ϕ(Xi0 0 , , Xi L L ) = V ar ϕ(X1 0, , X1 L)

and we denote this common variance by V ar(ϕ1). We can then bound the following variance

i L=1 ϕ(Xi0 0 , , Xi L L )

N L+1 N L+1 (N 1)L+1 V ar(ϕ1)

Proof. Let us denote I = [1 : N], J = IL+1, j = (i0, , i L) for iℓ I, and use the short notation ϕj = ϕ(Xi0 0 , , Xi L L ). Then

i L=1 ϕ(Xi0 0 , , Xi L L )

j J ϕj) = X

j J V ar(ϕj) + X

j =j Cov(ϕj , ϕj)

= N L+1V ar(ϕ1) + X

j =j Cov(ϕj , ϕj) .

The second covariance term is a sum of N 2(L+1) N L+1 pairwise covariances among which a number are zero due to the independence of the {Xi ℓ}ℓ=0:L,i=1:N. To count them, we notice that for each j = (i0, , i L) J, Cov(ϕj, ϕ j) = 0 for every j J such that j = (i 0, , i L) with i ℓ I\{iℓ}. There are thus (N 1)L+1 such j . It follows that there are altogether N L+1(N 1)L+1 null covariance terms and N L+1(N L+1 (N 1)L+1 1) non null ones. For each non null covariance, the Cauchy-Schwartz inequality implies that Cov(ϕj , ϕj) p

V ar(ϕj )V ar(ϕj) = V ar(ϕ1), which concludes the proof.

We then recall some notation and definitions specific to our product form particle approximation. The tempered SMC procedure in Algorithm 2 produces M = N(L + 1) particles denoted by θ1:M τ at each step indexed by τ. When partitioning these particles into L + 1 disjoint subset we will denote by θ1:N ℓ,τ the N particles in each subset with ℓ= 0 : L + 1. We denote by ζN k,L the collection of random variables ζN k,L = {W i k,ℓ, θi k,ℓ}i=1:N,ℓ=0:L, and use the short notation p L+1 k [ϕ]

for p L+1 k [ϕ] = EQL ℓ=0 p(θℓ|Dk)[ϕ(θ0, , θL)]. We write Fτ 1 for the σ-algebra Fτ 1 = σ θ1:M τ 1 . Let Cb(ΘL+1)

denote the set of functions ϕ : ΘL+1 R that are measurable and bounded, and let ||ϕ|| denote the supremum norm ||ϕ|| = supθ ΘL+1 |ϕ(θ)|. The following lemma bounds the Monte Carlo error at step τ.

PASOAPArticle ba Sed Bayesian Optimal Adaptive design

Lemma 2 (Monte Carlo error) Using Algorithm 2 with a multinomial resampling scheme, for all functions ϕ Cb(ΘL+1),

i L=1 ϕ(θi0 0,τ, , θi L L,τ)

ℓ=0 W iℓ ℓ,τ 1

Mτ 1ϕ(θi0 0,τ 1, , θi L L,τ 1)

c N,L ϕ 2 (24)

with c N,L = 1 1 1

N L+1 and where the expectation is taken over all the realizations of the random tempered SMC method, or equivalently on ζN k,L.

Proof. Multinomial resampling preserves conditional independence so that the particles θ1:N ℓ,τ are i.i.d. conditionally on Fτ 1 and their conditional distribution is

θi ℓ,τ|Fτ 1

i=1 W i ℓ,τ 1Mτ 1(θi ℓ,τ 1, ).

Thus for all i0, , i L,

h ϕ(θi0 0,τ, , θi L L,τ) | Fτ 1 i =

ℓ=0 W iℓ ℓ,τ 1

Mτ 1ϕ(θi0 0,τ 1, , θi L L,τ 1)

i L=1 ϕ(θi0 0,τ, , θi L L,τ) | Fτ 1

It follows that

i L=1 ϕ(θi0 0,τ, , θi L L,τ)

ℓ=0 W iℓ ℓ,τ 1

Mτ 1ϕ(θi0 0,τ 1, , θi L L,τ 1)

i L=1 ϕ(θi0 0,τ, , θi L L,τ) | Fτ 1

= 1 N 2(L+1) V ar

i L=1 ϕ(θi0 0,τ, , θi L L,τ) | Fτ 1

c N,L V ar ϕ(θ1 0,τ, , θ1 L,τ) | Fτ 1

c N,L EζN τ,L ϕ(θ1 0,τ, , θ1 L,τ)2 | Fτ 1 c N,L ϕ 2

The line before last above results from Lemma 1. Using the tower property, we get (24).

The constant c N,L is equivalent to L+1

N and tends to 0 when N tends to . The next Lemma involves the potential functions Gτ. Note that in Section 6, Gτ corresponds to Gk,τ(θ) = p(yk|θ, ξ)γ, where γ usually also depends on τ and k as it is found adaptively. However all we need is that the potential and then the likelihood is upper bounded in θ.

Lemma 3 (Weights normalization error) If Gτ is upper bounded, then for all functions ϕ Cb(ΘL+1),

ℓ=0 W iℓ ℓ,τ

ϕ(θi0 0,τ, , θi L L,τ) 1 N L+1

ℓ=0 Gτ(θiℓ ℓ,τ)

ϕ(θi0 0,τ, , θi L L,τ)

ϕ 2 EζN k,L

i=1 Gτ(θi ℓ,τ)

where Gτ = Gτ/(µτ 1Mτ 1(Gτ)).

Proof. By definition

W iℓ ℓ,τ = Gτ(θiℓ ℓ,τ) PN i=1 Gτ(θi ℓ,τ) = Gτ(θiℓ ℓ,τ) PN i=1 Gτ(θi ℓ,τ)

PASOAPArticle ba Sed Bayesian Optimal Adaptive design

Thus the term in the square in the left hand side is

ℓ=0 W iℓ ℓ,τ

ϕ(θi0 0,τ, , θi L L,τ)

i=1 Gτ(θi ℓ,τ))

and the first factor is bounded by ϕ since N P

ℓ=0 W iℓ ℓ,τ

The two Lemmas can now be combined to establish the proposition below.

Proposition 1 (L2 convergence) Using Algorithm 2 with multinomial resampling and assuming that potential functions Gτ are upper bounded, there exists constants cτ, c τ so that, for all functions ϕ Cb(ΘL+1),

i L=1 ϕ(θi0 0,τ, , θi L L,τ) (µτ 1Mτ 1) L+1[ϕ]

cτ c N,L ϕ 2 (25)

ℓ=0 W iℓ ℓ,τ

ϕ(θi0 0,τ, , θi L L,τ) µ L+1 τ [ϕ]

c τ c N,L ϕ 2 (26)

with c N,L = 1 (1 1

Proof The proof works by induction on τ. Replacing µτ 1Mτ 1 by µ0, at τ = 0, (25) holds with c0 = 1. Assume (25) holds at step τ, we first show that (26) also holds at step τ. The left hand side in (26) can be decomposed into

ℓ=0 W iℓ ℓ,τ

ϕ(θi0 0,τ, , θi L L,τ) µ L+1 τ [ϕ] =

ℓ=0 W iℓ ℓ,τ

ϕ(θi0 0,τ, , θi L L,τ) 1 N L+1

ℓ=0 Gτ(θiℓ ℓ,τ)

ϕ(θi0 0,τ, , θi L L,τ)

ℓ=0 Gτ(θiℓ ℓ,τ)

ϕ(θi0 0,τ, , θi L L,τ) µ L+1 τ [ϕ] .

For all ϕ, when ϕ (θ0, , θL) = QL ℓ=0 Gτ(θℓ) ϕ(θ0, , θL), we have (µτ 1Mτ 1) L+1[ϕ ] = µ L+1 τ [ϕ] by (22). The

second term can then be bounded by cτc N,L ϕ 2 , applying (25) with ϕ . For the first term, we apply Lemma 3 and then (25) with ϕ(θ0, , θL) = QL ℓ=0 Gτ(θℓ) since then (µτ 1Mτ 1) L+1[ϕ] = µ L+1 τ [1] = 1 by (23). Finally we get (26) with c τ = 4cτ Gτ 2(L+1) where we have used that E[(X + Y )2] 2(E[X2] + E[Y 2]). We then show that (26) at step τ 1 implies (25) at step τ. Similarly, we have,

i L=1 ϕ(θi0 0,τ, , θi L L,τ) (µτ 1Mτ 1) L+1[ϕ] =

i L=1 ϕ(θi0 0,τ, , θi L L,τ)

ℓ=0 W iℓ ℓ,τ 1

Mτ 1ϕ(θi0 0,τ 1, , θi L L,τ 1)

ℓ=0 W iℓ ℓ,τ 1

Mτ 1ϕ(θi0 0,τ 1, , θi L L,τ 1) (µτ 1Mτ 1) L+1[ϕ] .

The first term can be bounded using Lemma 2 and for the second term by applying (26) at τ 1 to function Mτ 1ϕ defined in (21). Then (25) holds with cτ = 2(1 + c τ 1).

PASOAPArticle ba Sed Bayesian Optimal Adaptive design

Proposition 1 in the paper is just the result above but reduced to the presentation of (26). To give more insight on the constants involved in Proposition 1, we note that the constant denoted by ck corresponds to c τ in the above proof. A well referenced behavior in SMC, see e.g. (Chopin & Papaspiliopoulos, 2020), is that this constant may increase very fast with k and become uninformative for finite N. Without considering tempering for simplicity, it comes that c0 = 1 and ck = 4(2 + ck 1) G 2(L+1) , where G denotes the upper bound in θ of Gk(θ) p(yk|θ, ξk), which is assumed here to be independent on the specific value of yk and ξk. It follows that ck = 2 Pk i=1(4 G 2(L+1) )i. Interestingly, with our product form approximation, if G < 1, ck remains low and decreases when L increases.

C.3. Consistency of the sequential design estimators

We specify the steps that lead to the proof of Proposition 2 in the main body. We first recall some notation and definitions. The sequential design values produced by Algorithm 3 above can be seen as realizations of random estimators {ξ k+1,N}N 1 targeting points of maximum of random criterion functions {Ik+1,N P CE (ξ)}N 1,

Ik+1,N P CE (ξ) = Ep(u) QL ℓ=0 p N k,ℓ(θℓ) h F(ξ,θ0, ,θL,T ξ θ0(U)) i .

A natural question is to study the limiting distributions of these random quantities when the number of particles N tends to infinity. The main property is the following Theorem 1, which results from a general result from M-estimator theory (van der Vaart, 1998; van der Vaart & Wellner, 1996). The two following lemmas are then useful to provide simpler sufficient conditions to satisfy the Theorem s assumptions and to establish our main result in Proposition 2 in the main body, also recalled below.

Theorem 1 (Theorem 5.7 in (van der Vaart, 1998)) Assume

(i) For all ϵ > 0,

lim N pζN k,L

sup ξ E |Ik+1,N P CE (ξ) Ik+1 P CE(ξ)| ϵ

(ii) For all ϵ > 0,

sup ξ ξ k+1 ϵ Ik+1 P CE(ξ) < Ik+1 P CE(ξ k+1) . (28)

(iii) There exists a sequence of positive random variables {ρN}N 1 and a sequence of random variables {ξ k+1,N}N 1 in E that satisfy ϵ > 0, lim N pζN k,L(ρN ϵ) = 0

lim inf N pζN k,L

Ik+1,N P CE (ξ k+1,N) Ik+1,N P CE (ξ k+1) ρN = 1.

Then the sequence of estimators {ξ k+1,N}N 1 is consistent, i.e. for all ϵ > 0,

lim N pζN k,L ξ k+1,N ξ k+1 ϵ = 0.

Proof. The proof is a special case of Theorem 5.7 in (van der Vaart, 1998). We reproduce it using our notation. In all the following proofs, to simplify, we drop the k + 1 notation, so that Ik+1 P CE(ξ) and Ik+1,N P CE (ξ) are now simply denoted by IP CE(ξ) and IN P CE(ξ) and their respective maximizers by {ξ } and {ξ N}. Since ξ maximizes IP CE(ξ), it comes that for all N 1,

0 IP CE(ξ ) IP CE(ξ N)

= (IP CE(ξ ) IN P CE(ξ )) + (IN P CE(ξ ) IN P CE(ξ N)) + (IN P CE(ξ N) IP CE(ξ N)) .

PASOAPArticle ba Sed Bayesian Optimal Adaptive design

The first and third terms in the sum are bounded by sup ξ E |IP CE(ξ) IN P CE(ξ)| while the second term can be bounded

by ρN + (IN P CE(ξ ) IN P CE(ξ N) ρN) δ{IN P CE(ξ ) IN P CE(ξ N)>ρN} where δA is the indicator function which is 1 if A is satisfied and 0 otherwise. It follows that

0 IP CE(ξ ) IP CE(ξ N)

2 sup ξ E |IP CE(ξ) IN P CE(ξ)|, ρN, IN P CE(ξ ) IN P CE(ξ N) ρN δ{IN P CE(ξ ) IN P CE(ξ N)>ρN}

Assumptions (i) and (iii) imply that all three terms in the max tend to 0 in probability so that for all η > 0,

lim N pζN k,L(IP CE(ξ ) IP CE(ξ N) η) = 0. (29)

Then, for all ϵ > 0, using (ii), for all ξ satisfying ξ ξ ϵ, there exists η > 0 so that IP CE(ξ) IP CE(ξ ) η. This implies that { ξ N ξ ϵ} {IP CE(ξ N) IP CE(ξ ) η} and

pζN k,L( ξ N ξ ϵ) pζN k,L(IP CE(ξ N) IP CE(ξ ) η) = pζN k,L(IP CE(ξ ) IP CE(ξ N) η) .

The limit in N of this last term tends to 0 by (29), which concludes the proof.

In general, the uniform convergence in (i) is the most difficult assumption to check but in our setting, when E is compact, it is easily derived from previous assumptions and the pointwise convergence in probability ((iv) below), which can be derived from the L2 convergence in Proposition 1 (see comments in the main body). Recall the following definition,

f P CE(ξ, θ0, , θL) = Ep(u) h F(ξ,θ0, ,θL,T ξ θ0(U)) i

and the shortcut notation p L+1 k = LQ

ℓ=0 p(θℓ|Dk).

Lemma 4 Assume

(i) E Rd is a compact set.

(ii) Ik+1 P CE(ξ) is a continuous function in ξ.

(iii) f P CE(ξ, θ0, , θL) is a continuous function in ξ for p L+1 k -almost all (θ0, , θL).

(iv) Convergence in probability pointwise: For all ξ E and all ϵ > 0,

lim N pζN k,L

|Ik+1,N P CE (ξ) Ik+1 P CE(ξ)| ϵ = 0.

Then the uniform convergence in Theorem 1 (i) is satisfied, that is, for all ϵ > 0,

lim N pζN k,L

sup ξ E |Ik+1,N P CE (ξ) Ik+1 P CE(ξ)| ϵ

Proof. Continuous functions on a compact set are uniformly continuous. It follows from (ii) and (iii) that for all ϵ > 0, there exists η > 0 so that for all ξ E,

sup ξ ξ η |IP CE(ξ) IP CE(ξ )| ϵ (30)

and for p L+1 k -almost all (θ0, , θL),

sup ξ ξ η |f P CE(ξ, θ0, , θL) f P CE(ξ , θ0, , θL)| ϵ. (31)

PASOAPArticle ba Sed Bayesian Optimal Adaptive design

Let B(ξ, η) be a ball centered at ξ with radius η. As E is compact, for all η > 0, it is possible to extract, from the cover set S

ξ E B(ξ, η), a finite subcover S

b=1:B B(ξ(b), η) so that E S

b=1:B B(ξ(b), η) and

sup ξ E |IN P CE(ξ) IP CE(ξ)| = max b=1:B sup ξ B(ξ(b),η) |IN P CE(ξ) IP CE(ξ)| . (32)

For all b = 1 : B, and all ξ B(ξ(b), η), we also have that

|IN P CE(ξ) IP CE(ξ)| |IN P CE(ξ) IN P CE(ξ(b))| + |IN P CE(ξ(b)) IP CE(ξ(b))| + |IP CE(ξ(b)) IP CE(ξ)|

which implies

sup ξ B(ξ(b),η) |IN P CE(ξ) IP CE(ξ)| sup ξ B(ξ(b),η)

n |IN P CE(ξ) IN P CE(ξ(b))| o

+ |IN P CE(ξ(b)) IP CE(ξ(b))|

+ sup ξ B(ξ(b),η)

n |IP CE(ξ(b)) IP CE(ξ)| o

For the first term in the right-hand side,

sup ξ B(ξ(b),η)

n |IN P CE(ξ) IN P CE(ξ(b))| o Ep N L+1 k

sup ξ ξ(b) η |f P CE(ξ, θ0, , θL) f P CE(ξ(b), θ0, , θL)|

By Proposition 1 (26), the right-hand side above tends in L2-norm and then in probability when N tends to to

sup ξ ξ(b) η |f P CE(ξ, θ0, , θL) f P CE(ξ(b), θ0, , θL)|

which is smaller than ϵ by (31). The second term in the right-hand side of (33) tends in probability to 0 by (iv) for all ξ(b). Finally, using (30), the third term in (33) is smaller than ϵ. To conclude, (32) implies the uniform convergence (27).

The following result gives simpler sufficient conditions for Assumption (ii) in Theorem 1 to hold.

Lemma 5 Assume

(i) E Rd is a compact set.

(ii) Ik+1 P CE(ξ) is a continuous function in ξ .

(iii) For all ξ = ξ k+1, Ik+1 P CE(ξ) < Ik+1 P CE(ξ k+1) .

Then Assumption (ii) of Theorem 1 is satisfied.

Proof. A continuous function reaches its maximum on a compact set. It follows that Ik+1 P CE(ξ) reaches its maximum on the compact subset E\B(ξ k+1, ϵ). Let ξϵ denote a value at which this maximum is reached. Then (iii) implies that Ik+1 P CE(ξϵ) < Ik+1 P CE(ξ k+1), which leads to (28).

Using Lemmas 4 and 5, we can now replace in Theorem 1 Assumptions (i) and (ii) by simpler conditions. It follows the Proposition 2 presented in the paper and recalled below.

Proposition 2 Assume

A1 E Rd is a compact set.

PASOAPArticle ba Sed Bayesian Optimal Adaptive design

A2 For all ξ = ξ k+1, Ik+1 P CE(ξ) < Ik+1 P CE(ξ k+1)

A3 There exists a sequence of positive random variables {ρN}N 1 and a sequence of random variables {ξ k+1,N}N 1 in E that satisfy ϵ > 0, lim N pζN k,L(ρN ϵ) = 0

lim inf N pζN k,L

Ik+1,N P CE (ξ k+1,N) Ik+1,N P CE (ξ k+1) ρN = 1.

Then the sequence of estimators {ξ k+1,N}N 1 is consistent, i.e. for all ϵ > 0,

lim N pζN k,L ξ k+1,N ξ k+1 ϵ = 0.

Proof. With E compact, we can use Lemma 4. The continuity of Ik+1 P CE(ξ) and f P CE has been already assumed earlier as specified in Section B and Lemma 4 (iv) is a consequence of Proposition 1. It follows the uniform convergence property (i) in Theorem 1. Then (A1-2) and Lemma 5 imply (ii) in Theorem 1. With (A3), Theorem 1 leads to the result. Note that if we assume that ξ k+1,N is an exact maximizer of Ik+1,N P CE (ξ) then (A3) is trivially satisfied with ρN = 0.

D. Numerical experiments

D.1. Sequential prior contrastive estimation (SPCE) criterion

We specify the SPCE introduced by Foster et al. (2021) and used in our experiments and those of Blau et al. (2022) to assess the design sequence quality in our comparison. For a number K of experiments, DK = {(y1, ξ1), , (y K, ξK)} and L contrastive variables, SPCE is defined as

SPCE(ξ1, , ξK) = E K Q

k=1 p(yk|ξk,θ0) L Q

k=1 p(yk|θ0, ξk)

k=1 p(yk|θℓ, ξk)

SPCE is a lower bound of the total EIG which is the expected information gained from the entire sequence of design parameters ξ1, . . . , ξK and it becomes tight when L tends to . In addition, SPCE has the advantage to use only samples from the prior p(θ) and not from the successive posterior distributions. It makes it a fair criterion to compare methods on design sequences only. Considering a true parameter value denoted by θ , given a sequence of design values {ξk}k=1:K, observations {yk}k=1:K are simulated using p(y|θ , ξk) respectively. Therefore, for a given Dk, the corresponding SPCE is estimated numerically by sampling θ1, , θL from the prior,

SPCE(DK) = 1

k=1 p(yk|θ , ξk)

k=1 p(yk|θ , ξk) + L P

k=1 p(yk|θi ℓ, ξk)

As shown in (Foster et al., 2021) (Appendix A), SPCE increases with L to reach the total EIG I(ξ1, . . . , ξK) when L at a rate O(L 1) of convergence. More specifically, it is shown in (Foster et al., 2021) that

0 I(ξ1, . . . , ξK) SPCE(ξ1, . . . , ξK) C L + 1 (35)

where C = Ep(DK)p(θ|DK) h p(DK|θ)

p(DK) i 1 with the notation p(DK|θ) = KQ

k=1 p(yk|θ, ξk).

It is also shown in (Foster et al., 2021) that for a given L, SPCE is bounded by log(L + 1) while the upper bound SNMC below is potentially unbounded. As in (Blau et al., 2022), if we use L = 107 to compute SPCE and SNMC, the bound is log(L + 1) = 16.12 for SPCE. In practice this does not impact the numerical methods comparison as the intervals [SPCE, SNMC] containing the total EIG remain clearly distinct.

PASOAPArticle ba Sed Bayesian Optimal Adaptive design

D.2. Sequential nested Monte Carlo (SNMC) criterion

Similarly, an upper bound on the total EIG, with similar features, has also been introduced by Foster et al. (2021). Its expression is very similar to that of SPCE, varying only through the sum in the denominator,

SNMC(ξ1, , ξK) = E K Q

k=1 p(yk|ξk,θ0) L Q

k=1 p(yk|θ0, ξk)

k=1 p(yk|θℓ, ξk)

D.3. Implementation details

For VPCE (Foster et al., 2020) and RL-BOED (Blau et al., 2022), we use the code available at github.com/csiro-mlai/RLBOED, using the settings recommended therein to reproduce the results in the respective papers. From the obtained sequences of observations and design values, we compute SPCE and SNMC as explained above and retrieve the same results as in their respective papers.

Our code is implemented in Jax (Bradbury et al., 2020) and available at github.com/iolloj/pasoa. Several packages are used through the repository. Namely, we used Optax (Babuschkin et al., 2020) to run Gradient Descents, the Sequential Monte Carlo part was heavily inspired and built using Kernels from Black Jax (Lao & Louf, 2022) and we used OTT (Cuturi et al., 2022) to compute Wasserstein distances.

Table 3 summarizes the main features and running times of the compared methods. The RL-BOED method has the advantage to be both non-myopic and amortized in the sense that a policy is learnt upfront and then used straightforwardly at each new experiment. It follows a much longer training time, which does not exist for the other methods. Note that in comparison the deployment times of all methods are neglible (see Table 1 in the paper). In contrast RL-BOED does not provide approximations for the posterior distributions.

Table 3. Main features and training times of the compared methods: the second column indicates whether a method also provides approximation of posterior distribution, the third if it is amortized and the fourth if it is non-myopic. The last column shows training times for the amortized method RL-BOED and a sequence of K experiments run on a single Nvidia V100 GPU, for the source finding and CES examples.

Method Posterior Amortized Non-myopic Training Time PASOA SMC

RL-BOED (Blau et al., 2022) CES: 20h Sources: 10h VPCE (Foster et al., 2020)

D.4. Hardware details

Our method can be run on a local machine and was tested on a Apple M1 Pro 16Gb chip. However, for a faster running time, each experiment was finally produced by running our method on a single Nvidia V100 GPU. One other advantage of tempering and of our PASOA method is that by reducing the number of needed particles for an accurate procedure, it lowers the hardware requirements for this method as it becomes feasible to run it on CPUs (see Table 1).

D.5. Checking the assumptions given in the theoretical results

Ideally, the models used in experiments should satisfy the assumptions appearing in our propositions. For the L2 convergence result (Proposition 1), the conditions are easy to check. Proposition 1 requires that the potential functions Gk,τ are bounded. It is sufficient to check that the likelihood p(y|θ, ξ) as a function of θ is bounded (main Section 6 before Proposition 1). For the source location model, the likelihood is log-normal and is bounded independently of θ, y, and ξ. For the CES example, the likelihood is a mixture given in equation (36), Section D.7 below where the last component is a logit-normal distribution.

PASOAPArticle ba Sed Bayesian Optimal Adaptive design

Both p0 and p1 in equations (37) and (38) below are in [0, 1]. The only potentially problematic case may be when ση 0. In that case, 1 p0 tends to 0 and for p1 we can use the approximation below equation (38). It follows that the third term (1 p0 p1) q(y|θ, ξ) in (36) remains bounded.

For the consistency result (Proposition 2), conditions (A1) and (A2) can be stronger than necessary. Note that condition (A3) is not directly related to the model but to the optimization procedure and could be ignored. The important weaker condition is (ii) in Theorem 1. Similarly to consistency results in M-estimator theory (see e.g. van der Vaart (1998)), in our work, we assume that ξ k+1 is a global and unique maximum of Ik+1 P CE ((A2)). Condition (ii) actually states that ξ k+1 is in addition well separated (see Figure 5.2 of van der Vaart (1998) for an illustration of this notion). Lemma 5 gives sufficient conditions for (ii), which results in (A1) and (A2) in Proposition 2. (A1) is that the design space is compact and (A2) states that ξ k+1 is a unique global maximum (not necessary well separated). (A1) is easy to check but (A2) is strong and not usually easy to check. Both can be relaxed with additional technicalities, see section 5.2.1 of van der Vaart (1998). For the CES model, the design space is compact. For the source example, it can be restricted to [ X, X]2 without specific care, as e.g. in (Blau et al., 2022). For (A2), we have not found yet a general way to check this for the IP CE bound. Note though, that this unchecked assumption is common practice as it would require more care to talk about consistency if the maximum was not unique and global.

D.6. Source location example

For the 2D location finding example used in (Foster et al., 2021; Blau et al., 2022) and tested in the paper, with 2 sources, K = 30 successive design optimisations, and 100 repetitions of the experiment, the number of gradient steps was set to 5000 and the ESS for the SMC procedure to 0.9. Also, in practice stratified resampling was preferred to multinomial resampling. The latter has the advantage to considerably simplify the proofs and this mismatch between theory and practice is very common in the SMC literature. Figure 3 in the paper shows the SPCE, SNMC and the L2 Wasserstein distances between weighted particles and the true source locations, providing three quantitative assessment and comparison of methods. As an additional, visual assessment of the quality of the posterior approximation provided by our method, Figure 1 in the paper and Figure 5 below illustrate the evolution of the particles over the design steps, starting from a sample following the prior to a sample concentrating around the true source locations. In particular, the k = 0 step shows particles simulated according to the prior. In most use cases, plain SMC already gives better results than other reference methods. Figure 1 and Table 1 in the paper show that tempering allows to reduce the number of particles. In Figure 5 below, the source locations, indicated by red crosses in the plots, are chosen in a part of the space not well covered by the prior to illustrate the robustness of our approach to a potential prior misspecification. After some iterations PASOA is able to explore the parameter space to finally concentrate the posterior on the true source locations. In contrast, SMC may miss some of the sources when they are outside the prior mass or when there are too many of them. We suspect that prior misspecification is a typical very common feature that jeopardizes SMC performance while impacting much less PASOA. This is actually the same problem encountered with IS that tempering aims at solving. Similarly, Figure 7 shows, in terms of SPCE, SNMC and Wasserstein distance, that SMC is more robust with tempering. Figure 6 indicates the number of tempering steps taken on average. The median (over 100 rollouts) number of tempering steps varies from 14 to 2 and is globally decreasing, being under 5 after 15 experiments. The number of tempering steps reduces when the posterior concentrates and when adding new observations becomes less informative.

D.7. Constant Elasticity of Substitution example

This other model, used in (Blau et al., 2022; Foster et al., 2020), comes from behavioral economics. In this model, an agent compares two baskets of goods and gives a rating y on a sliding scale from 0 to 1. The goal is to design the two baskets of goods so as to infer the agent s utility function, which depends on some unknown parameters. The designs are 6-dimensional vectors ξ = (ξ1, ξ2) corresponding to the two baskets with 3 values each ξd = (ξd,1, ξd,2, ξd,3) [0 : 100]3 for d = 1, 2, which represent quantities for 3 items in each basket. There are 3 parameters θ = (ρ, α, u) in dimension 5, whose prior distributions are respectively ρ Beta(1, 1), α = (α1, α2, α3) Dirichlet(1, 1, 1) and log u N(1, 3).

The model likelihood is given by the following model that uses a subjective utility function U and two hyperparameters

PASOAPArticle ba Sed Bayesian Optimal Adaptive design

Figure 5. PASOA evolution of particles (in purple) over some selected steps k. Particles correspond initially to a sample from the prior p(θ) and progressively evolve to a sample of particles located around the initially unknown true source positions indicated by red crosses. Green crosses indicate the optimal measurement locations ξ k obtained at each step k.

Figure 6. Source location example: median (over 100 rollouts) number of tempering steps over the number of experiments, with respect to the number of particles.

ϵ = 2 22 and τ = 0.005,

y = f(η, ϵ)

where η N(µη, σ2 η)

with µη = (U(ξ1) U(ξ2)) u

ση = (1 + ξ1 ξ2 ) τ u

For d = 1, 2, U(ξd) = α1ξρ d,1 + α2ξρ d,2 + α3ξρ d,3 1/ρ

PASOAPArticle ba Sed Bayesian Optimal Adaptive design

Figure 7. Source location example. Prior misspecification: median (over 100 rollouts) (a) SPCE, (b) SNMC and (c) Wasserstein distances for SMC (blue) and PASOA (red).

where f(η, ϵ) takes it values in [ϵ, 1 ϵ] and is a censored sigmoid defined by

f(η, ϵ) = 1 ϵ if η logit(1 ϵ)

= ϵ if η logit(ϵ)

= (1 + exp( η)) 1 otherwise

with logit(y) = log(y/(1 y)). In other words, y is a censored logit-normal distribution with parameters µη and ση. Its density is a mixture

p(y|θ, ξ) = p0δϵ(y) + p1δ1 ϵ(y) + (1 p0 p1) q(y|θ, ξ) (36)

where q(y|θ, ξ) = 1 ση

2πy(1 y) exp( (logit(y) µη)2

2σ2η ) is the density of a logit-normal distribution and p0 and p1 are defined by the following logit-normal CDF values

p0 = q(y ϵ) = p(η logit(ϵ)) = F logit(ϵ) µη

p1 = 1 q(y 1 ϵ) = 1 p(η logit(1 ϵ)) = 1 F logit(1 ϵ) µη

with the last equalities involving the normal CDF values of variable η and the standard normal CDF F values. In practice, computing log p0 and log p1 may sometimes be numerically problematic when p0 or p1 become too small. Computing p0 or p1 involves computing lower and upper Gaussian tails. In this case, following (Foster et al., 2020), we use the following first order asymptotic approximation of the standard normal CDF, when x is large,

2π exp( x2/2).

and when x is small (negative)

F(x) = 1 F( x) 1 x

2π exp( x2/2).

Thus, denoting f the pdf of the standard normal distribution, log p0 log f(x) log( x) with x = logit(ϵ) µη

ση and log p1

log f(x) log(x) with x = logit(1 ϵ) µη

ση or to summarize both approximations when |x| is large, log f(x) log(|x|).

Implementation details, if not otherwise specified, are the same as for the source location example. We plan K = 10 successive design optimisations and repeat the whole experiment 100 times for varying values of the true parameters, for all methods, PASOA, SMC, RL-BOED, VPCE and the random design baseline. This is overall a more challenging example as the objective function has many suboptimal local maxima, and the stochastic gradient procedure may be more sensitive to initialization. The number of gradient steps was set to 5000. For the SMC procedure, the ESS was set to 0.9, the Markov kernel is that of a random walk Metropolis-Hasting scheme with prior transformations mapping the parameters to R4. The transformations used are respectively, u = log u, ρ = logit(ρ) = log ρ 1 ρ, and α 1 = log α1

α3 , α 2 = log α2

α3 , with the inverse

PASOAPArticle ba Sed Bayesian Optimal Adaptive design

transformations being u = exp u , ρ = exp ρ

1+exp ρ , α1 = exp α 1 1+exp α 1+exp α 2 , α2 = exp α 2 1+exp α 1+exp α 2 and α3 = 1 1+exp α 1+exp α 2 . We use then L = 100 contrastive variables with each N = 500 simulations.

Figure 4 in the paper and Figures 8 and 9 below show, with respect to k, the median and standard error of the SPCE, SNMC and Wasserstein distances between weighted particles and the true parameters. We observe for all methods more variability in the repetitions for this example. In terms of total EIG, the difference with RL-BOED is not as large as in the source location example, but the difference remains large for Wasserstein distances. PASOA still shows better performance in terms of information gain as measured by SPCE and SNMC. In Figure 4 in the paper, we observe that in experiments 0-2, our approach temporarily loses its advantage over RL-BOED due to insufficiently refined particle approximations of the posteriors. However, this edge is regained in subsequent experiments as more information from the posteriors becomes available. Our better design sequences are also visible in the Wasserstein distance plot presented in Figure 4. Figure 8 left shows on the same plot the SPCE and SNMC curves. Without tempering, SMC gains an advantage only in the latter steps 7-9 in terms of information gained, while, in the Wasserstein distance plot presented in Figure 4, SMC shows better performance from the start. A possible explanation is that, as shown in Figure 9 below, RL-BOED performs better on parameter ρ at the expense of sacrificing precision on the others. Overall the Wasserstein distance for all parameters remains in favor of our methods but it may be that a better precision on ρ leads to a slightly higher information gain (Figure 8 left).

In our current tempering implementation, the Markov kernel is fixed to a standard Metropolis-Hastings scheme for all steps. It is out of the scope of this paper but possible directions for improvement include using more sophisticated kernels, such as Langevin or Hamiltonian Monte Carlo moves, as suggested in the Tuning parameters section p.1591 of Dai et al. (2022) and in references therein. More generally, a number of recommendations, as reviewed in (Dai et al., 2022), have been reported as efficient and could be investigated.

Figure 8. CES example. Median and standard error over 100 tollouts, with respect to the number of experiments k. The [SPCE, SNMC] intervals containing the totel EIG are plot, respectively with plain (SPCE lower bound) and dashed (SNMC upper bound) lines. Left: SMC (blue) vs RL-BOED (green) and VPCE (yellow). Right: SMC (blue) vs PASOA (red).

D.8. Non differentiable examples

When the model log-likelihood is not differentiable, either because the gradient is not available or difficult to compute, or because the design space is discrete, the stochastic gradient part of our method cannot be directly applied. However, we can still use the other parts by replacing the optimization step by either an exhaustive argmax, in the case of a finite design space, or by Bayesian optimization (Snoek et al., 2012; Henrández-Lobato et al., 2014; Benassi et al., 2011) which does not requires gradients. This can be seen as an advantage of myopic solutions, which allow such replacements to be easily performed. For each sequential optimization, the search space remains of reasonable size or dimension and does not increases exponentially. This is not the case for other policy-based approach, e.g. (Foster et al., 2021), which would involve a challenging high-dimensional Bayesian optimization of the policy parameters, or an exhaustive search in an exponentially increasing space with K.

PASOAPArticle ba Sed Bayesian Optimal Adaptive design

Figure 9. CES example. Median and standard error of Wasserstein distances for each parameter (ρ, α, u) separately.

To illustrate this situation, another benchmark example used in (Blau et al., 2022; Moffat et al., 2020) is the Prey population example. The design is a discrete variable. Instead of using stochastic gradient descent to optimize the Ik P CE bound at each step sequentially, we can compute it for every possible design and take the argmax. The same can be done to adapt VPCE to this discrete design space, while RL-BOED has the advantage to be applicable for both continuous and discrete spaces. Without the gradient part, our approach is similar to that of Moffat et al. (2020) but with an additional tempering, which was already reported to compare favorably to RL-BOED in Figure 4 of Blau et al. (2022).