# multifidelity_monte_carlo_a_pseudomarginal_approach__be048df2.pdf

Multi-ﬁdelity Monte Carlo: a pseudo-marginal approach

Diana Cai Department of Computer Science Princeton University dcai@cs.princeton.edu

Ryan P. Adams Department of Computer Science Princeton University rpa@princeton.edu

Markov chain Monte Carlo (MCMC) is an established approach for uncertainty quantiﬁcation and propagation in scientiﬁc applications. A key challenge in applying MCMC to scientiﬁc domains is computation: the target density of interest is often a function of expensive computations, such as a high-ﬁdelity physical simulation, an intractable integral, or a slowly-converging iterative algorithm. Thus, using an MCMC algorithms with an expensive target density becomes impractical, as these expensive computations need to be evaluated at each iteration of the algorithm. In practice, these computations often approximated via a cheaper, lowﬁdelity computation, leading to bias in the resulting target density. Multi-ﬁdelity MCMC algorithms combine models of varying ﬁdelities in order to obtain an approximate target density with lower computational cost. In this paper, we describe a class of asymptotically exact multi-ﬁdelity MCMC algorithms for the setting where a sequence of models of increasing ﬁdelity can be computed that approximates the expensive target density of interest. We take a pseudo-marginal MCMC approach for multi-ﬁdelity inference that utilizes a cheaper, randomized-ﬁdelity unbiased estimator of the target ﬁdelity constructed via random truncation of a telescoping series of the low-ﬁdelity sequence of models. Finally, we discuss and evaluate the proposed multi-ﬁdelity MCMC approach on several applications, including log-Gaussian Cox process modeling, Bayesian ODE system identiﬁcation, PDE-constrained optimization, and Gaussian process parameter inference.

1 Introduction

Simulation and computational modeling play a key role in science, engineering, economics, and many other areas. When these models are high-quality and accurate, they are important for scientiﬁc discovery, design, and data-driven decision making. However, the ability to accurately model complex physical phenomena often comes with a signiﬁcant cost many models involve expensive computations that then need to be evaluated repeatedly in, for instance, a sampling or optimization algorithm. Examples of model classes with expensive computations include intractable integrals or sums, expensive quantum simulations [43], expensive numerical simulations arising from partial differential equations (PDEs) [38] and large systems of ordinary equations (ODEs).

In many situations, one has the ability to trade off computational cost against ﬁdelity or accuracy in the result. Such a tradeoff might arise from the choice of discretization or the number of basis functions when solving a PDE, or the number of quadrature points when estimating an integral. It is often possible to leverage lower-ﬁdelity models to help accelerate high-quality solutions, e.g., by using multigrid methods [23] for spatial discretizations. More generally, multi-ﬁdelity methods combine multiple models of varying cost and ﬁdelity to accelerate computational algorithms and have been applied to solving inverse problems [11, 24, 38], trust region optimization [1, 4, 15, 31, 39], Bayesian optimization [8, 21, 27, 28, 41, 45], Bayesian quadrature [17, 46], and sequential learning [22, 35].

36th Conference on Neural Information Processing Systems (Neur IPS 2022).

(a) Trapezoid rule with 2k trapezoids (k = 1, 2, 3)

(b) Lokta-Volterra ODE solutions, dt = 1/k

Figure 1: Examples of low-ﬁdelity sequences of models. (a) Sequence trapezoid quadrature estimates Ik, where Ik is the trapzoid rule with 2k trapezoids. (b) Lokta-Volterra ODE solutions for prey u(t) (blue) and predator v(t) (red) using Euler s method with step size dt.

One critically important tool for scientiﬁc and engineering computation is Markov chain Monte Carlo (MCMC), which is widely used for uncertainty quantiﬁcation, optimization, and integration. MCMC methods are recipes for constructing a Markov chain with some desired target distribution as the limiting distribution. Pseudo-random numbers are used to simulate transitions of the Markov chain in order to produce samples from the target distribution. However, MCMC often becomes impractical for high-ﬁdelity models, where a single step of the Markov chain may, for instance, involve a numerical simulation that takes hours or days to complete. Multi-ﬁdelity methods for MCMC focus on constructing Markov chain transition operators that are sometimes able to use inexpensive lowﬁdelity evaluations instead of expensive high-ﬁdelity evaluations. The goal is to increase the effective number of samples generated by the algorithm, given a constrained computational budget. A large focus of the multi-ﬁdelity MCMC literature is on two-stage Metropolis-Hastings (M-H) methods [10, 14], which use a single low-ﬁdelity model for early rejection of a proposed sample, thereby often short-circuiting the evaluation of the expensive, high-ﬁdelity model.

However, there are several limitations of two-stage multi-ﬁdelity Monte Carlo. First, in many applications, a hierarchy of cheaper, low-ﬁdelity models is available; for instance, in the case of integration, k-point quadrature estimates form a hierarchy of low-ﬁdelity models, and in the case of a PDE, varying the discretization. Thus, the two-stage approach does not fully utilize the availability of a hierarchy of ﬁdelities and may be more suitable for settings where the highand low-ﬁdelity models are not hierarchically related, e.g., semi-empirical methods vs. Hartree-Fock in computational chemistry. In addition, in such applications, there is often a limiting model of interest, such as a continuous function that the low-ﬁdelity discretizations approximate. Two-stage MCMC does not asymptotically sample from this limiting target density and will at best sample from an approximation of the biased, high-ﬁdelity posterior. Finally, the two-stage method is unnatural to generalize to more sophisticated MCMC algorithms such as slice sampling and Hamiltonian Monte Carlo (HMC).

We propose a class of multi-ﬁdelity MCMC methods designed for applications with a hierarchy of low-ﬁdelity models available. More speciﬁcally, we assume access to a sequence of low-ﬁdelity models that converge to a perfect-ﬁdelity model in the limit. Within an MCMC algorithm, we can approximate the perfect-ﬁdelity target density with an unbiased estimator constructed from a randomized truncation of the inﬁnite telescoping series of low-ﬁdelity target densities. This class of multi-ﬁdelity MCMC is an example of a pseudo-marginal MCMC (PM-MCMC) algorithm the unbiased estimator essentially guarantees that the algorithm is asymptotically exact in that the limiting distribution recovers the perfect-ﬁdelity target distribution as its marginal distribution. Our approach introduces the ﬁdelity of a model as an auxiliary random variable that is evolved separately from the target variable according to its own conditional target distribution; this technique can be used in conjunction with any suitable MCMC update that leaves the conditional update for the target variable of interest invariant, such as M-H, slice sampling, elliptical slice sampling, or Hamiltonian Monte Carlo. We apply the pseudo-marginal multi-ﬁdelity MCMC approach to several problems, including log-Gaussian Cox process modeling, Bayesian ODE system identiﬁcation, PDE-constrained optimization, and Gaussian process parameter inference.

Related work. Multi-ﬁdelity MCMC methods are commonly applied in a two-stage procedure, where the goal is to reduce the computational cost of using a single expensive high-ﬁdelity model by

using a cheap low-ﬁdelity model as a low-pass ﬁlter for a delayed acceptance/rejection algorithm [10, 11, 14]; see Peherstorfer et al. [36] for a survey. Higdon et al. [24] propose coupling a highﬁdelity Markov chain with a low-ﬁdelity Markov chain via a product chain. In constrast, our approach aims to sample from a perfect-ﬁdelity target density while reducing computational cost; two-stage MCMC algorithms result in biased estimates with respect to this target density. A related class of methods is multilevel Monte Carlo [13, 18, 19, 44], which uses a hierarchy of multi-ﬁdelity models for Monte Carlo estimation by expressing the expectation of a high-ﬁdelity model as a telescoping sum of low-ﬁdelity models. Dodwell et al. [13] use the M-H algorithm to form the multilevel Monte Carlo estimates, simulating from a separate Markov chain for each level of the telescoping sum. In practice multilevel Monte carlo requires choosing a ﬁnite number of ﬁdelities, inducing bias in the estimator with respect to the (limiting) perfect-ﬁdelity model. In contrast, our method uses a randomized ﬁdelity within a single Markov chain with the perfect-ﬁdelity model as the target.

Our approach applies pseudo-marginal MCMC to multi-ﬁdelity problems. There is a rich literature developing pseudo-marginal MCMC methods [2, 6] for so-called doubly-intractable likelihoods, which are likelihoods that are intractable to evaluate. Several approaches in the pseudo-marginal MCMC literature are particular relevant to our work. The ﬁrst are the PM-MCMC methods introduced by Lyne et al. [30], which describes a class of pseudo-marginal M-H methods that use Russian roulette estimators to obtain unbiased estimators of the likelihood. However, this method samples the variable of interest jointly with the auxillary randomness, which often leads to sticking.

Alternatively, several methods have considered sampling the randomness separately. The idea of clamping random numbers is explored in depth by Andrieu et al. [3] and Murray and Graham [33]; the latter applies to this pseudo-marginal slice sampling. In particular, our approach applies these ideas to the speciﬁc setting of multi-ﬁdelity models, where the random ﬁdelity is treated as an auxillary variable. Finally, while our approach applies to doubly-intractable problems, we are also motivated by a larger class of multi-ﬁdelity problems studied in the computational sciences that may not even be inference problems, such as quantum simulations and PDE-constrained optimization.

2 Multi-ﬁdelity MCMC

Monte Carlo methods approximate integrals and sums that can be expressed an expectation:

Eπ(h(θ)) = Z h(θ) π(θ) dθ 1

t=1 h(θ(t)), where θ(t) π, (1)

and where π : Θ R+ is the target density that may only be known up to a constant, h(θ) is a function of interest, and {θ(t)}T t=1 are samples from π. Markov chain Monte Carlo methods are then used to generate samples θ(t) from π by simulating from a Markov chain with target π.

In many settings, pointwise evaluations of the target function π(θ) are expensive or even intractable; from here on we will assume that the goal is to compute statistics of a quantity of interest h(θ) with respect to a perfect-ﬁdelity target density π (θ). In practice, the estimate in Equation (1) is instead estimated using a cheaper, low-ﬁdelity density πk(θ), where k N := {1, 2, . . .}. In particular, we consider settings where there is a sequence of low-ﬁdelity densities available that converge to the target, i.e., πk(θ) k π (θ). We assume that as k increases, the model becomes higher in ﬁdelity (with respect to π ) but more costly to evaluate, increasing in expense super-linearly with k.

For instance, π could represent a target density that depends on an intractable integral, the solution of a PDE, the solution of a large system of ODEs, the solution of a large system of linear equations, or the minimizer of a function. Thus, a typical evaluation of π requires an approximation at a ﬁdelity k with a tolerable level of bias for a given computational budget. Here increasing k could correspond to ﬁner discretizations of differential equations, increasing numbers of quadrature points, or performing a larger number of iterations in a linear solver or optimization routine.

In the multi-ﬁdelity setting, the goal is to combine several models of varying ﬁdelity within an MCMC algorithm to reduce the computational cost of estimating Equation (1). In this paper, we describe a class of MCMC algorithms that leverages the sequence of low-ﬁdelity models πk. Our strategy for multi-ﬁdelity MCMC (MF-MCMC) will be to construct an unbiased estimator of π (θ) using random choices of the ﬁdelity K and then to include K in the Markov chain as an auxiliary variable. By carefully constructing such a Markov chain, it will be possible to asymptotically estimate the

functional in Equation (1) as though the samples were taken from the perfect-ﬁdelity model; each step of the Markov chain will nevertheless only require a ﬁnite amount of computation. Finally, our approach allows us to essentially plug in any valid MCMC algorithm, and we apply this strategy to develop multi-ﬁdelity variants of a number of MCMC algorithms, such as M-H and slice sampling.

2.1 Pseudo-marginal MCMC for the multi-ﬁdelity setting

Pseudo-marginal MCMC [2, 6] is a class of auxillary-variable MCMC algorithms that replaces the target density π(θ) with an estimator ˆπ(θ) that is a function of a random variable. If the estimator is nonnegative and unbiased, i.e., for all θ Θ, ˆπ(θ) 0 and E[ˆπ(θ)] = π(θ), then MCMC transitions that use the estimator still have π(θ) as their invariant distribution. This property is sometimes referred to as exact-approximate MCMC as the transitions are approximate but the limiting distribution is exact. Estimators can be constructed from a variety of methods, including particle ﬁltering [2]; our approach will use randomized series truncations, which has been consider in pseudo-marginal MCMC methods such as Lyne et al. [30], Georgoulas et al. [16], and Biron-Lattes et al. [7].

We now apply the pseudo-marginal approach to the multi-ﬁdelity setting. Here the target density estimator arises from a random choice of the ﬁdelity K N that is governed by a distribution µ on N. We denote the estimator using ˆπK(θ) to make the dependence on the random ﬁdelity K explicit. The estimator is constructed such that it is unbiased with respect to µ, i.e.,

k=1 µ(k)ˆπk(θ) = π (θ) . (2)

The distribution µ is also constructed by the user: ideally, the estimator ˆπK(θ) will prefer smaller values of K while having sufﬁciently low variance as to allow the Markov chain to mix effectively. Thus the simulations can be run at inexpensive low-ﬁdelities, while the estimates will be as though the perfect-ﬁdelity model were being used.

The standard pseudo-marginal MCMC approach is to construct a Markov chain that has the following joint density as its stationary distribution:

π(θ, K) = µ(K)ˆπK(θ). (3)

Observe that while Equation (3) does not depend on the perfect-ﬁdelity target density π , it returns the desired marginal π via Equation (2). As a concrete example, a pseudo-marginal M-H algorithm generates a new state θ and ﬁdelity K jointly using q(θ ; θ) as the proposal for θ , q(K ; K) = µ(K ) as the proposal distribution for the ﬁdelity, and accepts/rejects the state according to

a = π(θ , K )q(θ; θ )q(K; K )

π(θ, K)q(θ ; θ)q(K ; K) = ˆπK (θ )q(θ; θ )

ˆπK(θ)q(θ ; θ) , (4)

where the equality holds since the distribution terms for K and K cancel. Note that the right-hand side of Equation (4) is the standard M-H ratio but that the target density π is replaced with the estimator ˆπK.

However, standard pseudo-marginal MCMC using joint proposals of the state and ﬁdelity can get stuck when the estimator is noisy and fail to accept new states. Thus, we apply the approach in Murray and Graham [33] that augments the Markov chain to include the randomness of the estimator via a separate update; here the randomness of the estimator arises from the ﬁdelity K. Concretely, we construct a Markov chain that simulates from Equation (3) by alternating sampling between the conditional target densities π(K|θ) and π(θ|K) (steps 5 and 6 of Algorithm 1, respectively). We refer to this strategy as multi-ﬁdelity MCMC (MF-MCMC), since by conditioning on K = k, the update for the state θ becomes a standard deterministic update applied to a low-ﬁdelity model ˆπk(θ), and any appropriate MCMC update can be used here, making it straightforward to use complex MCMC methods, such as slice sampling and HMC. Similarly, any suitable MCMC update for the ﬁdelity K can be used using the conditional target π(K|θ).

Many techniques can be used to construct an unbiased estimator of π with randomness K; we describe a general approach in the next section. However, it is generally difﬁcult to guarantee the estimator is nonnegative, as required by pseudo-marginal MCMC. One technique considered by Lin et al. [29] and Lyne et al. [30] is to instead sample from the target distribution induced by the absolute value of the estimator and applying a sign-correction to the ﬁnal Monte Carlo estimate in

Equation (1), an approach borrowed from the quantum Monte Carlo literature where it is necessary for modeling fermionic particles. This approach has been applied to the M-H algorithm, but we note that this general approach can be applied much more broadly, as we do in this work.

In problems where the estimator may be negative, we sample from the conditional target distributions using the absolute value of the estimator |ˆπK(θ)|, and we denote these conditionals with π(K | θ) µ(K)|ˆπK(θ)| and π(θ|K = k) |ˆπk(θ)|. The estimate in Equation (1) is then corrected using the signs σ(θ, k) of evaluations of ˆπk(θ), Z h(θ) π(θ) dθ PT t=1 h(θ(t))σ(θ(t), K(t))

PT t=1σ(θ(t), K(t)) =: ˆIT , (5)

where {(θ(t), K(t))}T t=1 are the sampled values from the joint distribution π(θ, K) |ˆπK(θ)|µ(K).

Importantly, the sign-corrected estimate still asymptotically leads to the desired estimate of the functional of interest. Let σ(θ, k) denote the sign of the estimator such that ˆπk(θ) = σ(θ, k)|ˆπk(θ)|. The estimator ˆIT in Equation (5) is formed using a Monte Carlo estimate of the functional after expanding it into its joint distribution, i.e., Z h(θ)π (θ)dθ = Z X

k=1 h(θ)ˆπk(θ)µ(k)dθ =

R P k=1 h(θ)σ(θ, k) π(θ, k)dθ R P k=1 σ(θ, k) π(θ, k)dθ . (6)

The full multi-ﬁdelity MCMC algorithm with sign correction summarized in Algorithm 1. We note that while the Markov chain no longer converges to a target with the marginal π , the ﬁnal estimate after sign-correction which is the downstream goal of interest converges to the quantity of interest due to Equation (6). While this may seem limiting if one is interested in the posterior itself, useful unbiased posterior summaries may be still be obtained via the functional, such as the posterior mean, variance, quantiles, and histograms that may be used to visualize marginal distributions.

3 Unbiased low-ﬁdelity estimators via randomized truncations

In this section, we discuss how to construct an unbiased estimator of π (θ), given a sequence of low-ﬁdelity likelihoods with the property πk(θ) π (θ) as k . This estimator has the property that it requires a ﬁnite amount of computation with probability one, and it also has a tunable amount of expected computation per estimate, i.e., it uses low-ﬁdelity density evaluations to estimate the perfect-ﬁdelity target density. The central idea of this estimator has been used for decades, going back to John von Neumann and Stanislaw Ulam. More recently it has found use in applications of inference and optimization in related work such as Glynn and Rhee [20], Lyne et al. [30], Beatson and Adams [5], and Jacob et al. [26].

First note that we can express the perfect-ﬁdelity model as a telescoping sum of low-ﬁdelity models: let π0(θ) = 0 and write

k=1 πk(θ) πk 1(θ). (7)

The estimator ˆπK is then constructed by taking a random truncation K µ of the inﬁnite telescoping series. The sampled terms in the sum are then reweighted to ensure the estimator remains unbiased:

k=1 wk,K(πk(θ) πk 1(θ)) . (8)

Two approaches are commonly used to ensure that the resulting estimator is unbiased: weighted single-term estimators and Russian roulette estimators. The single-term estimator [30] is constructed by importance sampling a term from the series in Equation (7): the truncation level is drawn as K µ, and the Kth term is used to form the estimate

ˆπK(θ) = µ(K) 1(πK(θ) πK 1(θ)) . (9)

Thus, the weight in Equation (8) is Wk,K = µ(K) 11(K = k). In the Russian roulette estimator, the remaining terms in the estimator are reweighted by their survival probabilities,

Algorithm 1 Multi-ﬁdelity Monte Carlo with sign-correction

1: Input: Initial state θ and ﬁdelity K, truncation distribution µ 2: for t = 1, . . . , T do 3: Given current K and θ, form estimator ˆπK(θ) = PK k=1 wk,K(πk(θ) πk 1(θ)) 4: Save sign σ(θ, K) = sign(ˆπK(θ)) 5: Update ﬁdelity K leaving invariant the target conditional

π(K|θ) µ(K)|ˆπK(θ)|

6: Update state θ leaving invariant the target conditional

π(θ|K = k) |ˆπk(θ)|

7: end for 8: Output: Samples {(θ(t), K(t))} and estimate ˆIT = PT t=1 σ(t)h(θ(t)) / PT t=1 σ(t)

i.e., Wk,K = (1 Pk 1 k =1 µ(k )) 11(K k). The distribution µ controls the number of terms in the estimator, and a good proposal distribution should match the tails of the sequence of low-ﬁdelity densities [5, 30, 37].

The ability to use cheaper models is a key feature of multi-ﬁdelity inference, and the low-ﬁdelity estimator provides a means to reduce the computational cost of multi-ﬁdelity Monte Carlo. However, these estimators are an example of a class of methods that explores a compute-variance tradeoff: computationally cheaper estimates leads to high variability. The resulting increase in variance slows down the convergence of the MCMC procedure and could lead to an overall less efﬁcient method due to a reduced effective sample size.

4 Summary of the multi-ﬁdelity MCMC recipe

Here we summarize the recipe for constructing a multi-ﬁdelity Markov chain Monte Carlo algorithm.

First, identify a sequence of increasing-ﬁdelity target densities with the property that their limit is the desired perfect-ﬁdelity density. Low-ﬁdelity densities should be cheap with the cost rapidly increasing within the sequence. In the context of Bayesian inference, it may be appropriate to focus the multi-ﬁdelity aspects on the likelihood term and construct the target densities via, e.g., πk(θ; D) π0(θ)Lk(θ ; D), where π0 is the prior, Lk is a low-ﬁdelity likelihood, and D is the set of observations. This likelihood-based version is what we use in several of the experiments.

Next, introduce a truncation distribution µ on N. This truncation distribution should be chosen to balance between expected cost and variance of the resulting estimator; our overall goal is to mostly use cheap low-ﬁdelity densities, but high-variance estimators will presumably damage the mixing time and/or the asymptotic variance.

Initialize the Markov chain with a reasonable choice for θ and a draw of K from the distribution µ. Each step of the Markov chain simulation consists of an update to θ given K and an update of K given θ. The update of θ given K can be performed using any standard MCMC algorithm, e.g., M-H, slice sampling, or HMC, applied to the low-ﬁdelity estimator. It is important to use the absolute value of the estimator and keep track of its sign. The update of K given θ is also ﬂexible, but it is reasonable to construct the update so that only a few K are considered in each step, as each of those ﬁdelities will need to be evaluated. By default, we consider a simple random walk on the positive integers for our experiments. After running a sufﬁcient number of steps of the Markov chain, use the sign corrected-estimator in Equation (5) to compute the expectation of the function h(θ).

5 Experiments

In all experiments, we use a random-walk M-H update to sample from the conditional K|θ, and truncation distribution µ(K) = geometric(K; γ0). Additional experimental details are in Appendix F.

(a) Two-stage MH

(b) Multi-ﬁdelity MH

(c) MH sampler estimate vs cost

(d) Single-ﬁdelity slice

(e) Multi-ﬁdelity slice

(f) Slice sampler estimate vs cost

Figure 2: Conjugate Gaussian model. Left: Histograms for M-H (a,b) and slice sampling (d,e). Right: Comparison of posterior standard deviation estimate vs computation for M-H (c) and slice sampling (d) methods. Average posterior mean computed over 4 different chains.

5.1 Toy conjugate Gaussian models

In order to understand the behavior of MF-MCMC on a simple example of Bayesian inference, we ﬁrst examine an example where the computational cost of evaluating the sequence of low-ﬁdelity likelihoods does not increase with k. Consider a perfect-ﬁdelity likelihood L (θ) = N(x; θ, σ ) and a low-ﬁdelity likelihood Lk(θ) = N(x; θ, σk), where σ2 k σ2 . The prior is π0(θ) = N(θ|0, 1), and so a closed-form posterior density can be computed. Here we consider the sequence σ2 k = 1+2/k2

and σ2 = 1. In Figure 2 we compare the results of single-ﬁdelity and multi-ﬁdelity M-H and slice sampling as well as the two-stage M-H algorithm summarized in Appendix C.4. We consider 2 two-stage M-H settings with high and low ﬁdelities of {k HF, k LF} = {1000, 10} and {k HF, k LF} = {100, 5}. The histograms show the bias of each method after simulating 10,000 samples, and the solid gray curve denotes the exact posterior density. We also compute a measure of total cost and a running average of the estimate of the posterior standard deviation functional, where the dotted black line denotes the true value. The number of cost-adjusted likelihoods was computed by upweighting each likelihood evaluation by the ﬁdelity. Here the multi-ﬁdelity methods typically converge to a similar value as the single high-ﬁdelity methods but in fewer cost-adjusted likelihood evaluations.

5.2 Log-Gaussian Cox processes

We examine an application of MF-MCMC to the log Gaussian Cox process (LGCP) model [32], where the perfect-ﬁdelity model is a function of an integral and the lower-ﬁdelity sequence of models arises from k-point quadrature estimates. Let log f GP(0, κℓ), where κℓ(x, x ) = exp 1

2ℓ2 x x 2 2

and where ℓis a lengthscale hyperparameter. Consider an inhomogenous Poisson process on X RD

with intensity λ(x) = ef(x). Given a random set of points {Xn}N n=1, the perfect-ﬁdelity likelihood is

L (f) = exp Z

X (1 ef(x))dx N Y

n=1 ef(Xn). (10)

Typically, inference in the LGCP uses a grid-based approximation of Equation (10), where the points are binned into counts and modeled with a Poisson likelihood [12, 34, 42], resulting in a biased posterior. Because the likelihood depends on a high-dimensional latent Gaussian vector, we perform inference for f using the elliptical slice sampling (ESS) algorithm (see Appendix C.3). We approximate the integral in Equation (10) with a trapezoidal quadrature rule Ik, where the number of quadrature points is a linear function of k.

Figure 3: Coal mining disasters 1850 1963. Left: Posterior mean of the rate function at the observed data points. Right: Posterior mean of the rate function at T = 1862 vs computation.

(a) Sampled posterior trajectories under a ﬁxed computational budget

(b) Posterior mean estimate vs computational cost

Figure 4: Lokta-Volterra system parameter identiﬁcation. The ﬁdelity represents (a function of) the step size dt of the ODE solver.

We apply multi-ﬁdelity and single-ﬁdelity ESS algorithms to a coal mining disasters data set (Carlin et al. [9]). The data contain the dates of 191 coal mine explosions that killed ten or more men in Britain between March 15, 1851 and March 22, 1962. Figure 3 (left) shows the estimated mean intensity and standard deviation on coal mining disasters data between one run of multi-ﬁdelity ESS and single-ﬁdelity ESS with k = 10, 100, 1000 quadrature points. In this plot, the high- (k = 1000) and multi-ﬁdelity posterior mean and standard deviation estimates match well, and the bias in the lowest ﬁdelity (k = 10) estimate is apparent. We also computed the cost-adjusted number of likelihood evaluations performed in each iteration of MF-ESS and SF-ESS. Figure 3 (right) shows the average estimated mean intensity at the time step t = 1862 on the three models against the average cost-adjusted number of likelihood evaluations per iteration. We observe that the multi-ﬁdelity and high-ﬁdelity estimates are close after many iterations of sampling, but that the multi-ﬁdelity estimate converges with less computation.

5.3 Bayesian ODE system identiﬁcation

We now apply the MF-MCMC approach to Bayesian system identiﬁcation for the Lotka-Volterra ODE. Let u(t) 0 represent the population size of the prey species at time t, and v(t) 0 represent the population size of the predator species. The dynamics of the two species given parameters α, β, γ, δ 0 are given by a pair of ﬁrst-order ODEs: d dtu = (α βv)u = αu βuv, d dtv = ( γ δu)v = γv δuv. (11)

Figure 5: PDE-constrained optimization with a linear heat equation. Estimate for α (left) and β (right) vs computation. The black dotted lines denote the true values of α, β.

System identiﬁcation solves the inverse problem by estimating the parameters of the ODE system θ = (α, β, γ, δ). Taking a Bayesian approach, we specify a noise model for the observed data and priors on the parameters, and we use MCMC to infer a distribution over the solution. For simplicity, we assume that the initial conditions are known and ﬁx σ = 0.25.

Deﬁne zn := (u(tn), v(tn)) and let z1(θ), . . . , z N(θ) be the solutions to the Lotka-Volterra differential equations at times t1, . . . , t N given the initial conditions and the system parameters θ = (α, β, γ, δ). Suppose we have observations arising from log(yn) = log(zn) + ϵn, where ϵ N(0, σ2I). The low-ﬁdelity likelihood is a function of a numerical solution of the ODE using a time step of size dt (Equation (F.3)). We compared the performance of a multi-ﬁdelity elliptical slice sampler to single-ﬁdelity ellipitical slice samplers with step size dt = 1 10 5, 1 10 4. For the ODE solver, we considered both an Euler solver and an 4th-order Runge-Kutta solver (Figure F.1). Figure 4a shows 200 trajectories corresponding to parameters sampled from the posterior distributions under each method using the Euler solver. Figure 4b shows posterior mean estimates for each system parameter. We observe that the estimates from the multi-ﬁdelity slice sampler approach the estimates reported by Howard [25] (black dotted lines) in less time than the single-ﬁdelity samplers.

5.4 PDE-constrained optimization

We now consider global optimization of a PDE-constrained objective, where an expensive physical simulation is run repeatedly in an outer loop problem. A common approach for global optimization is simulated annealing, which has been applied to constrained global optimization [40]. Consider a model for heat ﬂow in a thin rod of length L with spatial coordinates x [0, L]. Let u(x, t) represent the temperature in the rod at position x and time t, and let u represent a desired target temperature. The goal is to minimize a loss function subject to u satisfying a linear heat equation. This objective along with an initial condition and homogenous boundary conditions can be summarized as:

minimizeu u u 2 2

subject to u

x2 + 2β u, (12)

u(x, 0) = sin πx/2 ,

u(0, t) = u(L, t) = 0, x [0, L], t [0, T],

where α, β > 0 are the system parameters. The goal is to ﬁnd θ = (α, β) that minimizes the objective and satisﬁes the constraints. To solve the PDE, we discretize the domain into a grid of size x and represent the second derivative using the central difference formula. This induces a system of ODEs that we solve numerically using a Tsitouras 5/4 Runge-Kutta method, setting t = 0.4 x2 so as to satisfy a CFL stability condition. Here the ﬁdelity of the problem is given by the size of the spatial discretization x, which in turn controls t. We compared against two single-ﬁdelity discretizations of the spatial coordinate, where x = 5 10 3, 1 10 2. The results are in Figure 5, where we plot two of the MF results with γ0 = 0.1, 0.25. In these examples, the multi-ﬁdelity estimates converge faster than the single-ﬁdelity estimates in wallclock time.

5.5 Gaussian process regression parameter inference

Let X RN D and consider a Gaussian process regression model with a squared exponential kernel:

f GP(0, kθ), y = f(X) + ϵ, ϵ N(0, σ2 0), kθ(x, x ) = exp 1

2θ2 x x 2 2

(a) Histograms of high, low, two-stage, and multi-ﬁdelity

(b) Posterior mean estimate vs computation

Figure 6: Parameter inference in a Gaussian process regression model. Left: The posterior distribution of the parameter θ. Right: The posterior mean estimate vs computional cost.

where we assume σ2 0 is known. Let Σθ := [kθ(xi, xj)]i [N],j [N].

In many applications of Gaussian process modeling, one is interested in integrating out the parameters θ using MCMC. Computing the posterior π(θ|X, y) is expensive because each evaluation of the likelihood p(y|X, θ) = N(y | 0, Σθ + σ2 0I) involves computing a determinant and solving a linear system with respect to the matrix Σθ + σ2 0I, which has an O(N 3) computational cost associated with standard methods (e.g., Cholesky decomposition). For simplicity, we will only consider the ﬁdelity of solution to the linear system, but we note that the determinant can considered using the approach described in Potapczynski et al. [37]. Additional derivations and details are in Appendix F.5.

We generate synthetic data from the GP model with N = 100, σ2 0 = 1, and lengthscale θ0 = 45. For the GP model, we use a log Normal prior on θ given above in Equation (F.5) with parameters ν0 = 3.8, ν1 = 0.03. In Figure 6, we compare these approaches using single-ﬁdelity, multi-ﬁdelity, and two-stage M-H samplers. The low-ﬁdelity likelihood sequence was constructed by computing the solution to the linear system using a preconditioned conjugate gradient solver with k steps. The single-ﬁdelity likelihoods were a high-ﬁdelity likelihood (k = N) and a low-ﬁdelity likelihood (k = k N N), and the multi-ﬁdelity M-H samplers used γ0 = 0.1. The two-stage M-H approach used high and low ﬁdelities of k {100, 5}. For all methods, we use a M-H sampler with T = 50000 iterations. In the histograms, we observe that the high-ﬁdelity, multi-ﬁdelity and two-stage approaches tend to lead to similar posteriors, while the low-ﬁdelity sampler has more noticable bias with respect to the high-ﬁdelity histogram. The estimate produced by the multi-ﬁdelity samplers converged in fewer cost-adjusted likelihood evaluations than the high-ﬁdelity and two-stage approaches.

6 Discussion and future work

In this work, we introduced a class of multi-ﬁdelity MCMC that uses a low-ﬁdelity unbiased estimator to reduce the computational cost of sampling while still maintaining the desired limiting target distribution of the Markov chain. In particular, we have demonstrated the use of our framework on more advanced MCMC algorithms beyond M-H, such as slice sampling, and to additional settings such as optimization. Our results show a reduction in computation while producing accurate solutions in comparison with high-ﬁdelity models when it is possible to construct a target estimator that is not too noisy. Many future directions remain. First, applying MF-MCMC to large-scale expensive applications has many computational challenges. Making the method more robust to specialized problems is important, especially if the estimator is heavy-tailed. Thus, constructing good proposal distributions matching properties of the low-ﬁdelity target sequence is crucial, especially for application to high-dimensional problems. In addition, we have thus far focused on target densities where there is a single computation whose ﬁdelity is varied. However, in many settings, there may be target densities with multiple computations that converge at different rates, for example, if the target density includes both an intractable integral and a solution of a linear system. Our framework can be extended to that setting by adjusting the proposal distribution, and it is useful to understand how these rates impact the convergence properties of the sampler.

Acknowledgments and Disclosure of Funding

This work was partially supported by NSF grants IIS-2007278 and OAC-2118201. D. Cai was supported in part by a Google Ph.D. Fellowship in Machine Learning.

[1] N. M. Alexandrov, J. E. Dennis, R. M. Lewis, and V. Torczon. A trust-region framework for managing the use of approximation models in optimization. Structural optimization, 15(1): 16 23, 1998.

[2] C. Andrieu and G. O. Roberts. The pseudo-marginal approach for efﬁcient Monte Carlo computations. The Annals of Statistics, 37(2):697 725, 2009.

[3] C. Andrieu, A. Doucet, and R. Holenstein. Discussion of Particle Markov chain Monte Carlo methods". Journal of the Royal Statistical Society: Series B (Statistical Methodology), 72(3): 269 342, 2010.

[4] E. Arian, M. Fahl, and E. W. Sachs. Trust-region proper orthogonal decomposition for ﬂow control. In IEEE Conference on Decision and Control, 2000.

[5] A. Beatson and R. P. Adams. Efﬁcient optimization of loops and limits with randomized telescoping sums. In International Conference on Machine Learning, pages 534 543. PMLR, 2019.

[6] M. A. Beaumont. Estimation of population growth or decline in genetically monitored populations. Genetics, 164(3):1139 1160, 2003.

[7] M. Biron-Lattes, A. Bouchard-Côté, and T. Campbell. Pseudo-marginal inference for CTMCs on inﬁnite spaces via monotonic likelihood approximations. Journal of Computational and Graphical Statistics, 2022.

[8] L. Brevault, M. Balesdent, and A. Hebbal. Overview of Gaussian process based multi-ﬁdelity techniques with variable relationship between ﬁdelities. ar Xiv preprint ar Xiv:2006.16728, 2020.

[9] B. P. Carlin, A. E. Gelfand, and A. F. Smith. Hierarchical Bayesian analysis of changepoint problems. Journal of the Royal Statistical Society: Series C (Applied Statistics), 41(2):389 405, 1992.

[10] J. A. Christen and C. Fox. Markov chain Monte Carlo using an approximation. Journal of Computational and Graphical Statistics, 14(4):795 810, 2005.

[11] T. Cui, Y. M. Marzouk, and K. E. Willcox. Data-driven model reduction for the Bayesian solution of inverse problems. International Journal for Numerical Methods in Engineering, 102 (5):966 990, 2015.

[12] P. J. Diggle, P. Moraga, B. Rowlingson, and B. M. Taylor. Spatial and spatio-temporal log Gaussian Cox processes: extending the geostatistical paradigm. Statistical Science, 28(4): 542 563, 2013.

[13] T. J. Dodwell, C. Ketelsen, R. Scheichl, and A. L. Teckentrup. A hierarchical multilevel Markov chain Monte Carlo algorithm with applications to uncertainty quantiﬁcation in subsurface ﬂow. SIAM/ASA Journal on Uncertainty Quantiﬁcation, 3(1):1075 1108, 2015.

[14] Y. Efendiev, T. Hou, and W. Luo. Preconditioning Markov chain Monte Carlo simulations using coarse-scale models. SIAM Journal on Scientiﬁc Computing, 28(2):776 803, 2006.

[15] M. Fahl and E. W. Sachs. Reduced order modelling approaches to PDE-constrained optimization based on proper orthogonal decomposition. In Large-scale PDE-constrained Optimization, pages 268 280. Springer, 2003.

[16] A. Georgoulas, J. Hillston, and G. Sanguinetti. Unbiased Bayesian inference for population Markov jump processes via random truncations. Statistics and Computing, 27(4):991 1002, 2017.

[17] A. Gessner, J. Gonzalez, and M. Mahsereci. Active multi-information source Bayesian quadrature. In Uncertainty in Artiﬁcial Intelligence, pages 712 721. PMLR, 2020.

[18] M. B. Giles. Multilevel Monte Carlo path simulation. Operations Research, 56(3):607 617, 2008.

[19] M. B. Giles. Multilevel Monte Carlo methods. Monte Carlo and Quasi-Monte Carlo Methods, pages 83 103, 2013.

[20] P. W. Glynn and C.-h. Rhee. Exact estimation for Markov chain equilibrium expectations. Journal of Applied Probability, 51(A):377 389, 2014.

[21] R. B. Gramacy and H. K. Lee. Adaptive design and analysis of supercomputer experiments. Technometrics, 51(2):130 145, 2009.

[22] G. W. Gundersen, D. Cai, C. Zhou, B. E. Engelhardt, and R. P. Adams. Active multi-ﬁdelity Bayesian online changepoint detection. In Uncertainty in Artiﬁcial Intelligence, pages 1916 1926. PMLR, 2021.

[23] W. Hackbusch. Multi-grid methods and applications, volume 4. Springer Science & Business Media, 2013.

[24] D. Higdon, H. Lee, and Z. Bi. A Bayesian approach to characterizing uncertainty in inverse problems using coarse and ﬁne-scale information. IEEE Transactions on Signal Processing, 50 (2):389 399, 2002.

[25] P. Howard. Modeling basics. Lecture Notes for Math, 442, 2009.

[26] P. E. Jacob, J. O Leary, and Y. F. Atchadé. Unbiased Markov chain Monte Carlo methods with couplings. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 82(3): 543 600, 2020.

[27] D. R. Jones, M. Schonlau, and W. J. Welch. Efﬁcient global optimization of expensive black-box functions. Journal of Global Optimization, 13(4):455 492, 1998.

[28] S. Li, W. Xing, R. Kirby, and S. Zhe. Multi-ﬁdelity Bayesian optimization via deep neural networks. Advances in Neural Information Processing Systems, 33:8521 8531, 2020.

[29] L. Lin, K. Liu, and J. Sloan. A noisy Monte Carlo algorithm. Physical Review D, 61(7):074505, 2000.

[30] A.-M. Lyne, M. Girolami, Y. Atchadé, H. Strathmann, and D. Simpson. On Russian roulette estimates for Bayesian inference with doubly-intractable likelihoods. Statistical Science, 30(4): 443 467, 2015.

[31] A. March and K. Willcox. Constrained multiﬁdelity optimization using model calibration. Structural and Multidisciplinary Optimization, 46(1):93 109, 2012.

[32] J. Møller, A. R. Syversveen, and R. P. Waagepetersen. Log Gaussian Cox processes. Scandinavian Journal of Statistics, 25(3):451 482, 1998.

[33] I. Murray and M. Graham. Pseudo-marginal slice sampling. In Artiﬁcial Intelligence and Statistics, pages 911 919. PMLR, 2016.

[34] I. Murray, R. Adams, and D. Mac Kay. Elliptical slice sampling. In Artiﬁcial Intelligence and Statistics, pages 541 548. PMLR, 2010.

[35] A. Palizhati, S. B. Torrisi, M. Aykol, S. K. Suram, J. S. Hummelshøj, and J. H. Montoya. Agents for sequential learning using multiple-ﬁdelity data. Scientiﬁc Reports, 12(1):1 13, 2022.

[36] B. Peherstorfer, K. Willcox, and M. Gunzburger. Survey of multiﬁdelity methods in uncertainty propagation, inference, and optimization. SIAM Review, 60(3):550 591, 2018.

[37] A. Potapczynski, L. Wu, D. Biderman, G. Pleiss, and J. P. Cunningham. Bias-free scalable Gaussian processes via randomized truncations. In International Conference on Machine Learning, pages 8609 8619. PMLR, 2021.

[38] M. Raissi, P. Perdikaris, and G. E. Karniadakis. Inferring solutions of differential equations using noisy multi-ﬁdelity data. Journal of Computational Physics, 335:736 746, 2017.

[39] T. Robinson, M. S. Eldred, K. E. Willcox, and R. Haimes. Surrogate-based optimization using multiﬁdelity models with variable parameterization and corrected space mapping. AIAA journal, 46(11):2814 2822, 2008.

[40] H. E. Romeijn and R. L. Smith. Simulated annealing for constrained global optimization. Journal of Global Optimization, 5(2):101 126, 1994.

[41] J. Song, Y. Chen, and Y. Yue. A general framework for multi-ﬁdelity Bayesian optimization with gaussian processes. In The 22nd International Conference on Artiﬁcial Intelligence and Statistics, pages 3158 3167. PMLR, 2019.

[42] M. Teng, F. Nathoo, and T. D. Johnson. Bayesian computation for Log-Gaussian Cox processes: A comparative analysis of methods. Journal of Statistical Computation and Simulation, 87(11): 2227 2252, 2017.

[43] M. Troyer and U.-J. Wiese. Computational complexity and fundamental limitations to fermionic quantum Monte Carlo simulations. Physical Review Letters, 94(17):170201, 2005.

[44] D. J. Warne, T. P. Prescott, R. E. Baker, and M. J. Simpson. Multiﬁdelity multilevel Monte Carlo to accelerate approximate Bayesian parameter inference for partially observed stochastic processes. ar Xiv preprint ar Xiv:2110.14082, 2021.

[45] J. Wu, S. Toscano-Palmerin, P. I. Frazier, and A. G. Wilson. Practical multi-ﬁdelity Bayesian optimization for hyperparameter tuning. In Uncertainty in Artiﬁcial Intelligence, pages 788 798. PMLR, 2020.

[46] X. Xi, F.-X. Briol, and M. Girolami. Bayesian quadrature for multiple related integrals. In International Conference on Machine Learning, pages 5373 5382. PMLR, 2018.

1. For all authors...

(a) Do the main claims made in the abstract and introduction accurately reﬂect the paper s contributions and scope? [Yes] (b) Did you describe the limitations of your work? [Yes] We discuss this in the discussion section. (c) Did you discuss any potential negative societal impacts of your work? [Yes] We include a section in Appendix G (d) Have you read the ethics review guidelines and ensured that your paper conforms to them? [Yes] 2. If you are including theoretical results...

(a) Did you state the full set of assumptions of all theoretical results? [Yes] We provide details in Appendix E (b) Did you include complete proofs of all theoretical results? [Yes] We provide details in Appendix E 3. If you ran experiments...

(a) Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)? [Yes] See Appendix F. (b) Did you specify all the training details (e.g., data splits, hyperparameters, how they were chosen)? [Yes] Yes, see Appendix F for full details. (c) Did you report error bars (e.g., with respect to the random seed after running experiments multiple times)? [Yes] The plots are generated over multiple random seeds. (d) Did you include the total amount of compute and the type of resources used (e.g., type of GPUs, internal cluster, or cloud provider)? [Yes] See Appendix F 4. If you are using existing assets (e.g., code, data, models) or curating/releasing new assets...

(a) If your work uses existing assets, did you cite the creators? [N/A] (b) Did you mention the license of the assets? [N/A]

(c) Did you include any new assets either in the supplemental material or as a URL? [N/A]

(d) Did you discuss whether and how consent was obtained from people whose data you re using/curating? [N/A] (e) Did you discuss whether the data you are using/curating contains personally identiﬁable information or offensive content? [N/A] 5. If you used crowdsourcing or conducted research with human subjects...

(a) Did you include the full text of instructions given to participants and screenshots, if applicable? [N/A] (b) Did you describe any potential participant risks, with links to Institutional Review Board (IRB) approvals, if applicable? [N/A] (c) Did you include the estimated hourly wage paid to participants and the total amount spent on participant compensation? [N/A]