# multifidelity_monte_carlo_a_pseudomarginal_approach__be048df2.pdf Multi-fidelity Monte Carlo: a pseudo-marginal approach Diana Cai Department of Computer Science Princeton University dcai@cs.princeton.edu Ryan P. Adams Department of Computer Science Princeton University rpa@princeton.edu Markov chain Monte Carlo (MCMC) is an established approach for uncertainty quantification and propagation in scientific applications. A key challenge in applying MCMC to scientific domains is computation: the target density of interest is often a function of expensive computations, such as a high-fidelity physical simulation, an intractable integral, or a slowly-converging iterative algorithm. Thus, using an MCMC algorithms with an expensive target density becomes impractical, as these expensive computations need to be evaluated at each iteration of the algorithm. In practice, these computations often approximated via a cheaper, lowfidelity computation, leading to bias in the resulting target density. Multi-fidelity MCMC algorithms combine models of varying fidelities in order to obtain an approximate target density with lower computational cost. In this paper, we describe a class of asymptotically exact multi-fidelity MCMC algorithms for the setting where a sequence of models of increasing fidelity can be computed that approximates the expensive target density of interest. We take a pseudo-marginal MCMC approach for multi-fidelity inference that utilizes a cheaper, randomized-fidelity unbiased estimator of the target fidelity constructed via random truncation of a telescoping series of the low-fidelity sequence of models. Finally, we discuss and evaluate the proposed multi-fidelity MCMC approach on several applications, including log-Gaussian Cox process modeling, Bayesian ODE system identification, PDE-constrained optimization, and Gaussian process parameter inference. 1 Introduction Simulation and computational modeling play a key role in science, engineering, economics, and many other areas. When these models are high-quality and accurate, they are important for scientific discovery, design, and data-driven decision making. However, the ability to accurately model complex physical phenomena often comes with a significant cost many models involve expensive computations that then need to be evaluated repeatedly in, for instance, a sampling or optimization algorithm. Examples of model classes with expensive computations include intractable integrals or sums, expensive quantum simulations [43], expensive numerical simulations arising from partial differential equations (PDEs) [38] and large systems of ordinary equations (ODEs). In many situations, one has the ability to trade off computational cost against fidelity or accuracy in the result. Such a tradeoff might arise from the choice of discretization or the number of basis functions when solving a PDE, or the number of quadrature points when estimating an integral. It is often possible to leverage lower-fidelity models to help accelerate high-quality solutions, e.g., by using multigrid methods [23] for spatial discretizations. More generally, multi-fidelity methods combine multiple models of varying cost and fidelity to accelerate computational algorithms and have been applied to solving inverse problems [11, 24, 38], trust region optimization [1, 4, 15, 31, 39], Bayesian optimization [8, 21, 27, 28, 41, 45], Bayesian quadrature [17, 46], and sequential learning [22, 35]. 36th Conference on Neural Information Processing Systems (Neur IPS 2022). (a) Trapezoid rule with 2k trapezoids (k = 1, 2, 3) (b) Lokta-Volterra ODE solutions, dt = 1/k Figure 1: Examples of low-fidelity sequences of models. (a) Sequence trapezoid quadrature estimates Ik, where Ik is the trapzoid rule with 2k trapezoids. (b) Lokta-Volterra ODE solutions for prey u(t) (blue) and predator v(t) (red) using Euler s method with step size dt. One critically important tool for scientific and engineering computation is Markov chain Monte Carlo (MCMC), which is widely used for uncertainty quantification, optimization, and integration. MCMC methods are recipes for constructing a Markov chain with some desired target distribution as the limiting distribution. Pseudo-random numbers are used to simulate transitions of the Markov chain in order to produce samples from the target distribution. However, MCMC often becomes impractical for high-fidelity models, where a single step of the Markov chain may, for instance, involve a numerical simulation that takes hours or days to complete. Multi-fidelity methods for MCMC focus on constructing Markov chain transition operators that are sometimes able to use inexpensive lowfidelity evaluations instead of expensive high-fidelity evaluations. The goal is to increase the effective number of samples generated by the algorithm, given a constrained computational budget. A large focus of the multi-fidelity MCMC literature is on two-stage Metropolis-Hastings (M-H) methods [10, 14], which use a single low-fidelity model for early rejection of a proposed sample, thereby often short-circuiting the evaluation of the expensive, high-fidelity model. However, there are several limitations of two-stage multi-fidelity Monte Carlo. First, in many applications, a hierarchy of cheaper, low-fidelity models is available; for instance, in the case of integration, k-point quadrature estimates form a hierarchy of low-fidelity models, and in the case of a PDE, varying the discretization. Thus, the two-stage approach does not fully utilize the availability of a hierarchy of fidelities and may be more suitable for settings where the highand low-fidelity models are not hierarchically related, e.g., semi-empirical methods vs. Hartree-Fock in computational chemistry. In addition, in such applications, there is often a limiting model of interest, such as a continuous function that the low-fidelity discretizations approximate. Two-stage MCMC does not asymptotically sample from this limiting target density and will at best sample from an approximation of the biased, high-fidelity posterior. Finally, the two-stage method is unnatural to generalize to more sophisticated MCMC algorithms such as slice sampling and Hamiltonian Monte Carlo (HMC). We propose a class of multi-fidelity MCMC methods designed for applications with a hierarchy of low-fidelity models available. More specifically, we assume access to a sequence of low-fidelity models that converge to a perfect-fidelity model in the limit. Within an MCMC algorithm, we can approximate the perfect-fidelity target density with an unbiased estimator constructed from a randomized truncation of the infinite telescoping series of low-fidelity target densities. This class of multi-fidelity MCMC is an example of a pseudo-marginal MCMC (PM-MCMC) algorithm the unbiased estimator essentially guarantees that the algorithm is asymptotically exact in that the limiting distribution recovers the perfect-fidelity target distribution as its marginal distribution. Our approach introduces the fidelity of a model as an auxiliary random variable that is evolved separately from the target variable according to its own conditional target distribution; this technique can be used in conjunction with any suitable MCMC update that leaves the conditional update for the target variable of interest invariant, such as M-H, slice sampling, elliptical slice sampling, or Hamiltonian Monte Carlo. We apply the pseudo-marginal multi-fidelity MCMC approach to several problems, including log-Gaussian Cox process modeling, Bayesian ODE system identification, PDE-constrained optimization, and Gaussian process parameter inference. Related work. Multi-fidelity MCMC methods are commonly applied in a two-stage procedure, where the goal is to reduce the computational cost of using a single expensive high-fidelity model by using a cheap low-fidelity model as a low-pass filter for a delayed acceptance/rejection algorithm [10, 11, 14]; see Peherstorfer et al. [36] for a survey. Higdon et al. [24] propose coupling a highfidelity Markov chain with a low-fidelity Markov chain via a product chain. In constrast, our approach aims to sample from a perfect-fidelity target density while reducing computational cost; two-stage MCMC algorithms result in biased estimates with respect to this target density. A related class of methods is multilevel Monte Carlo [13, 18, 19, 44], which uses a hierarchy of multi-fidelity models for Monte Carlo estimation by expressing the expectation of a high-fidelity model as a telescoping sum of low-fidelity models. Dodwell et al. [13] use the M-H algorithm to form the multilevel Monte Carlo estimates, simulating from a separate Markov chain for each level of the telescoping sum. In practice multilevel Monte carlo requires choosing a finite number of fidelities, inducing bias in the estimator with respect to the (limiting) perfect-fidelity model. In contrast, our method uses a randomized fidelity within a single Markov chain with the perfect-fidelity model as the target. Our approach applies pseudo-marginal MCMC to multi-fidelity problems. There is a rich literature developing pseudo-marginal MCMC methods [2, 6] for so-called doubly-intractable likelihoods, which are likelihoods that are intractable to evaluate. Several approaches in the pseudo-marginal MCMC literature are particular relevant to our work. The first are the PM-MCMC methods introduced by Lyne et al. [30], which describes a class of pseudo-marginal M-H methods that use Russian roulette estimators to obtain unbiased estimators of the likelihood. However, this method samples the variable of interest jointly with the auxillary randomness, which often leads to sticking. Alternatively, several methods have considered sampling the randomness separately. The idea of clamping random numbers is explored in depth by Andrieu et al. [3] and Murray and Graham [33]; the latter applies to this pseudo-marginal slice sampling. In particular, our approach applies these ideas to the specific setting of multi-fidelity models, where the random fidelity is treated as an auxillary variable. Finally, while our approach applies to doubly-intractable problems, we are also motivated by a larger class of multi-fidelity problems studied in the computational sciences that may not even be inference problems, such as quantum simulations and PDE-constrained optimization. 2 Multi-fidelity MCMC Monte Carlo methods approximate integrals and sums that can be expressed an expectation: Eπ(h(θ)) = Z h(θ) π(θ) dθ 1 t=1 h(θ(t)), where θ(t) π, (1) and where π : Θ R+ is the target density that may only be known up to a constant, h(θ) is a function of interest, and {θ(t)}T t=1 are samples from π. Markov chain Monte Carlo methods are then used to generate samples θ(t) from π by simulating from a Markov chain with target π. In many settings, pointwise evaluations of the target function π(θ) are expensive or even intractable; from here on we will assume that the goal is to compute statistics of a quantity of interest h(θ) with respect to a perfect-fidelity target density π (θ). In practice, the estimate in Equation (1) is instead estimated using a cheaper, low-fidelity density πk(θ), where k N := {1, 2, . . .}. In particular, we consider settings where there is a sequence of low-fidelity densities available that converge to the target, i.e., πk(θ) k π (θ). We assume that as k increases, the model becomes higher in fidelity (with respect to π ) but more costly to evaluate, increasing in expense super-linearly with k. For instance, π could represent a target density that depends on an intractable integral, the solution of a PDE, the solution of a large system of ODEs, the solution of a large system of linear equations, or the minimizer of a function. Thus, a typical evaluation of π requires an approximation at a fidelity k with a tolerable level of bias for a given computational budget. Here increasing k could correspond to finer discretizations of differential equations, increasing numbers of quadrature points, or performing a larger number of iterations in a linear solver or optimization routine. In the multi-fidelity setting, the goal is to combine several models of varying fidelity within an MCMC algorithm to reduce the computational cost of estimating Equation (1). In this paper, we describe a class of MCMC algorithms that leverages the sequence of low-fidelity models πk. Our strategy for multi-fidelity MCMC (MF-MCMC) will be to construct an unbiased estimator of π (θ) using random choices of the fidelity K and then to include K in the Markov chain as an auxiliary variable. By carefully constructing such a Markov chain, it will be possible to asymptotically estimate the functional in Equation (1) as though the samples were taken from the perfect-fidelity model; each step of the Markov chain will nevertheless only require a finite amount of computation. Finally, our approach allows us to essentially plug in any valid MCMC algorithm, and we apply this strategy to develop multi-fidelity variants of a number of MCMC algorithms, such as M-H and slice sampling. 2.1 Pseudo-marginal MCMC for the multi-fidelity setting Pseudo-marginal MCMC [2, 6] is a class of auxillary-variable MCMC algorithms that replaces the target density π(θ) with an estimator ˆπ(θ) that is a function of a random variable. If the estimator is nonnegative and unbiased, i.e., for all θ Θ, ˆπ(θ) 0 and E[ˆπ(θ)] = π(θ), then MCMC transitions that use the estimator still have π(θ) as their invariant distribution. This property is sometimes referred to as exact-approximate MCMC as the transitions are approximate but the limiting distribution is exact. Estimators can be constructed from a variety of methods, including particle filtering [2]; our approach will use randomized series truncations, which has been consider in pseudo-marginal MCMC methods such as Lyne et al. [30], Georgoulas et al. [16], and Biron-Lattes et al. [7]. We now apply the pseudo-marginal approach to the multi-fidelity setting. Here the target density estimator arises from a random choice of the fidelity K N that is governed by a distribution µ on N. We denote the estimator using ˆπK(θ) to make the dependence on the random fidelity K explicit. The estimator is constructed such that it is unbiased with respect to µ, i.e., k=1 µ(k)ˆπk(θ) = π (θ) . (2) The distribution µ is also constructed by the user: ideally, the estimator ˆπK(θ) will prefer smaller values of K while having sufficiently low variance as to allow the Markov chain to mix effectively. Thus the simulations can be run at inexpensive low-fidelities, while the estimates will be as though the perfect-fidelity model were being used. The standard pseudo-marginal MCMC approach is to construct a Markov chain that has the following joint density as its stationary distribution: π(θ, K) = µ(K)ˆπK(θ). (3) Observe that while Equation (3) does not depend on the perfect-fidelity target density π , it returns the desired marginal π via Equation (2). As a concrete example, a pseudo-marginal M-H algorithm generates a new state θ and fidelity K jointly using q(θ ; θ) as the proposal for θ , q(K ; K) = µ(K ) as the proposal distribution for the fidelity, and accepts/rejects the state according to a = π(θ , K )q(θ; θ )q(K; K ) π(θ, K)q(θ ; θ)q(K ; K) = ˆπK (θ )q(θ; θ ) ˆπK(θ)q(θ ; θ) , (4) where the equality holds since the distribution terms for K and K cancel. Note that the right-hand side of Equation (4) is the standard M-H ratio but that the target density π is replaced with the estimator ˆπK. However, standard pseudo-marginal MCMC using joint proposals of the state and fidelity can get stuck when the estimator is noisy and fail to accept new states. Thus, we apply the approach in Murray and Graham [33] that augments the Markov chain to include the randomness of the estimator via a separate update; here the randomness of the estimator arises from the fidelity K. Concretely, we construct a Markov chain that simulates from Equation (3) by alternating sampling between the conditional target densities π(K|θ) and π(θ|K) (steps 5 and 6 of Algorithm 1, respectively). We refer to this strategy as multi-fidelity MCMC (MF-MCMC), since by conditioning on K = k, the update for the state θ becomes a standard deterministic update applied to a low-fidelity model ˆπk(θ), and any appropriate MCMC update can be used here, making it straightforward to use complex MCMC methods, such as slice sampling and HMC. Similarly, any suitable MCMC update for the fidelity K can be used using the conditional target π(K|θ). Many techniques can be used to construct an unbiased estimator of π with randomness K; we describe a general approach in the next section. However, it is generally difficult to guarantee the estimator is nonnegative, as required by pseudo-marginal MCMC. One technique considered by Lin et al. [29] and Lyne et al. [30] is to instead sample from the target distribution induced by the absolute value of the estimator and applying a sign-correction to the final Monte Carlo estimate in Equation (1), an approach borrowed from the quantum Monte Carlo literature where it is necessary for modeling fermionic particles. This approach has been applied to the M-H algorithm, but we note that this general approach can be applied much more broadly, as we do in this work. In problems where the estimator may be negative, we sample from the conditional target distributions using the absolute value of the estimator |ˆπK(θ)|, and we denote these conditionals with π(K | θ) µ(K)|ˆπK(θ)| and π(θ|K = k) |ˆπk(θ)|. The estimate in Equation (1) is then corrected using the signs σ(θ, k) of evaluations of ˆπk(θ), Z h(θ) π(θ) dθ PT t=1 h(θ(t))σ(θ(t), K(t)) PT t=1σ(θ(t), K(t)) =: ˆIT , (5) where {(θ(t), K(t))}T t=1 are the sampled values from the joint distribution π(θ, K) |ˆπK(θ)|µ(K). Importantly, the sign-corrected estimate still asymptotically leads to the desired estimate of the functional of interest. Let σ(θ, k) denote the sign of the estimator such that ˆπk(θ) = σ(θ, k)|ˆπk(θ)|. The estimator ˆIT in Equation (5) is formed using a Monte Carlo estimate of the functional after expanding it into its joint distribution, i.e., Z h(θ)π (θ)dθ = Z X k=1 h(θ)ˆπk(θ)µ(k)dθ = R P k=1 h(θ)σ(θ, k) π(θ, k)dθ R P k=1 σ(θ, k) π(θ, k)dθ . (6) The full multi-fidelity MCMC algorithm with sign correction summarized in Algorithm 1. We note that while the Markov chain no longer converges to a target with the marginal π , the final estimate after sign-correction which is the downstream goal of interest converges to the quantity of interest due to Equation (6). While this may seem limiting if one is interested in the posterior itself, useful unbiased posterior summaries may be still be obtained via the functional, such as the posterior mean, variance, quantiles, and histograms that may be used to visualize marginal distributions. 3 Unbiased low-fidelity estimators via randomized truncations In this section, we discuss how to construct an unbiased estimator of π (θ), given a sequence of low-fidelity likelihoods with the property πk(θ) π (θ) as k . This estimator has the property that it requires a finite amount of computation with probability one, and it also has a tunable amount of expected computation per estimate, i.e., it uses low-fidelity density evaluations to estimate the perfect-fidelity target density. The central idea of this estimator has been used for decades, going back to John von Neumann and Stanislaw Ulam. More recently it has found use in applications of inference and optimization in related work such as Glynn and Rhee [20], Lyne et al. [30], Beatson and Adams [5], and Jacob et al. [26]. First note that we can express the perfect-fidelity model as a telescoping sum of low-fidelity models: let π0(θ) = 0 and write k=1 πk(θ) πk 1(θ). (7) The estimator ˆπK is then constructed by taking a random truncation K µ of the infinite telescoping series. The sampled terms in the sum are then reweighted to ensure the estimator remains unbiased: k=1 wk,K(πk(θ) πk 1(θ)) . (8) Two approaches are commonly used to ensure that the resulting estimator is unbiased: weighted single-term estimators and Russian roulette estimators. The single-term estimator [30] is constructed by importance sampling a term from the series in Equation (7): the truncation level is drawn as K µ, and the Kth term is used to form the estimate ˆπK(θ) = µ(K) 1(πK(θ) πK 1(θ)) . (9) Thus, the weight in Equation (8) is Wk,K = µ(K) 11(K = k). In the Russian roulette estimator, the remaining terms in the estimator are reweighted by their survival probabilities, Algorithm 1 Multi-fidelity Monte Carlo with sign-correction 1: Input: Initial state θ and fidelity K, truncation distribution µ 2: for t = 1, . . . , T do 3: Given current K and θ, form estimator ˆπK(θ) = PK k=1 wk,K(πk(θ) πk 1(θ)) 4: Save sign σ(θ, K) = sign(ˆπK(θ)) 5: Update fidelity K leaving invariant the target conditional π(K|θ) µ(K)|ˆπK(θ)| 6: Update state θ leaving invariant the target conditional π(θ|K = k) |ˆπk(θ)| 7: end for 8: Output: Samples {(θ(t), K(t))} and estimate ˆIT = PT t=1 σ(t)h(θ(t)) / PT t=1 σ(t) i.e., Wk,K = (1 Pk 1 k =1 µ(k )) 11(K k). The distribution µ controls the number of terms in the estimator, and a good proposal distribution should match the tails of the sequence of low-fidelity densities [5, 30, 37]. The ability to use cheaper models is a key feature of multi-fidelity inference, and the low-fidelity estimator provides a means to reduce the computational cost of multi-fidelity Monte Carlo. However, these estimators are an example of a class of methods that explores a compute-variance tradeoff: computationally cheaper estimates leads to high variability. The resulting increase in variance slows down the convergence of the MCMC procedure and could lead to an overall less efficient method due to a reduced effective sample size. 4 Summary of the multi-fidelity MCMC recipe Here we summarize the recipe for constructing a multi-fidelity Markov chain Monte Carlo algorithm. First, identify a sequence of increasing-fidelity target densities with the property that their limit is the desired perfect-fidelity density. Low-fidelity densities should be cheap with the cost rapidly increasing within the sequence. In the context of Bayesian inference, it may be appropriate to focus the multi-fidelity aspects on the likelihood term and construct the target densities via, e.g., πk(θ; D) π0(θ)Lk(θ ; D), where π0 is the prior, Lk is a low-fidelity likelihood, and D is the set of observations. This likelihood-based version is what we use in several of the experiments. Next, introduce a truncation distribution µ on N. This truncation distribution should be chosen to balance between expected cost and variance of the resulting estimator; our overall goal is to mostly use cheap low-fidelity densities, but high-variance estimators will presumably damage the mixing time and/or the asymptotic variance. Initialize the Markov chain with a reasonable choice for θ and a draw of K from the distribution µ. Each step of the Markov chain simulation consists of an update to θ given K and an update of K given θ. The update of θ given K can be performed using any standard MCMC algorithm, e.g., M-H, slice sampling, or HMC, applied to the low-fidelity estimator. It is important to use the absolute value of the estimator and keep track of its sign. The update of K given θ is also flexible, but it is reasonable to construct the update so that only a few K are considered in each step, as each of those fidelities will need to be evaluated. By default, we consider a simple random walk on the positive integers for our experiments. After running a sufficient number of steps of the Markov chain, use the sign corrected-estimator in Equation (5) to compute the expectation of the function h(θ). 5 Experiments In all experiments, we use a random-walk M-H update to sample from the conditional K|θ, and truncation distribution µ(K) = geometric(K; γ0). Additional experimental details are in Appendix F. (a) Two-stage MH (b) Multi-fidelity MH (c) MH sampler estimate vs cost (d) Single-fidelity slice (e) Multi-fidelity slice (f) Slice sampler estimate vs cost Figure 2: Conjugate Gaussian model. Left: Histograms for M-H (a,b) and slice sampling (d,e). Right: Comparison of posterior standard deviation estimate vs computation for M-H (c) and slice sampling (d) methods. Average posterior mean computed over 4 different chains. 5.1 Toy conjugate Gaussian models In order to understand the behavior of MF-MCMC on a simple example of Bayesian inference, we first examine an example where the computational cost of evaluating the sequence of low-fidelity likelihoods does not increase with k. Consider a perfect-fidelity likelihood L (θ) = N(x; θ, σ ) and a low-fidelity likelihood Lk(θ) = N(x; θ, σk), where σ2 k σ2 . The prior is π0(θ) = N(θ|0, 1), and so a closed-form posterior density can be computed. Here we consider the sequence σ2 k = 1+2/k2 and σ2 = 1. In Figure 2 we compare the results of single-fidelity and multi-fidelity M-H and slice sampling as well as the two-stage M-H algorithm summarized in Appendix C.4. We consider 2 two-stage M-H settings with high and low fidelities of {k HF, k LF} = {1000, 10} and {k HF, k LF} = {100, 5}. The histograms show the bias of each method after simulating 10,000 samples, and the solid gray curve denotes the exact posterior density. We also compute a measure of total cost and a running average of the estimate of the posterior standard deviation functional, where the dotted black line denotes the true value. The number of cost-adjusted likelihoods was computed by upweighting each likelihood evaluation by the fidelity. Here the multi-fidelity methods typically converge to a similar value as the single high-fidelity methods but in fewer cost-adjusted likelihood evaluations. 5.2 Log-Gaussian Cox processes We examine an application of MF-MCMC to the log Gaussian Cox process (LGCP) model [32], where the perfect-fidelity model is a function of an integral and the lower-fidelity sequence of models arises from k-point quadrature estimates. Let log f GP(0, κℓ), where κℓ(x, x ) = exp 1 2ℓ2 x x 2 2 and where ℓis a lengthscale hyperparameter. Consider an inhomogenous Poisson process on X RD with intensity λ(x) = ef(x). Given a random set of points {Xn}N n=1, the perfect-fidelity likelihood is L (f) = exp Z X (1 ef(x))dx N Y n=1 ef(Xn). (10) Typically, inference in the LGCP uses a grid-based approximation of Equation (10), where the points are binned into counts and modeled with a Poisson likelihood [12, 34, 42], resulting in a biased posterior. Because the likelihood depends on a high-dimensional latent Gaussian vector, we perform inference for f using the elliptical slice sampling (ESS) algorithm (see Appendix C.3). We approximate the integral in Equation (10) with a trapezoidal quadrature rule Ik, where the number of quadrature points is a linear function of k. Figure 3: Coal mining disasters 1850 1963. Left: Posterior mean of the rate function at the observed data points. Right: Posterior mean of the rate function at T = 1862 vs computation. (a) Sampled posterior trajectories under a fixed computational budget (b) Posterior mean estimate vs computational cost Figure 4: Lokta-Volterra system parameter identification. The fidelity represents (a function of) the step size dt of the ODE solver. We apply multi-fidelity and single-fidelity ESS algorithms to a coal mining disasters data set (Carlin et al. [9]). The data contain the dates of 191 coal mine explosions that killed ten or more men in Britain between March 15, 1851 and March 22, 1962. Figure 3 (left) shows the estimated mean intensity and standard deviation on coal mining disasters data between one run of multi-fidelity ESS and single-fidelity ESS with k = 10, 100, 1000 quadrature points. In this plot, the high- (k = 1000) and multi-fidelity posterior mean and standard deviation estimates match well, and the bias in the lowest fidelity (k = 10) estimate is apparent. We also computed the cost-adjusted number of likelihood evaluations performed in each iteration of MF-ESS and SF-ESS. Figure 3 (right) shows the average estimated mean intensity at the time step t = 1862 on the three models against the average cost-adjusted number of likelihood evaluations per iteration. We observe that the multi-fidelity and high-fidelity estimates are close after many iterations of sampling, but that the multi-fidelity estimate converges with less computation. 5.3 Bayesian ODE system identification We now apply the MF-MCMC approach to Bayesian system identification for the Lotka-Volterra ODE. Let u(t) 0 represent the population size of the prey species at time t, and v(t) 0 represent the population size of the predator species. The dynamics of the two species given parameters α, β, γ, δ 0 are given by a pair of first-order ODEs: d dtu = (α βv)u = αu βuv, d dtv = ( γ δu)v = γv δuv. (11) Figure 5: PDE-constrained optimization with a linear heat equation. Estimate for α (left) and β (right) vs computation. The black dotted lines denote the true values of α, β. System identification solves the inverse problem by estimating the parameters of the ODE system θ = (α, β, γ, δ). Taking a Bayesian approach, we specify a noise model for the observed data and priors on the parameters, and we use MCMC to infer a distribution over the solution. For simplicity, we assume that the initial conditions are known and fix σ = 0.25. Define zn := (u(tn), v(tn)) and let z1(θ), . . . , z N(θ) be the solutions to the Lotka-Volterra differential equations at times t1, . . . , t N given the initial conditions and the system parameters θ = (α, β, γ, δ). Suppose we have observations arising from log(yn) = log(zn) + ϵn, where ϵ N(0, σ2I). The low-fidelity likelihood is a function of a numerical solution of the ODE using a time step of size dt (Equation (F.3)). We compared the performance of a multi-fidelity elliptical slice sampler to single-fidelity ellipitical slice samplers with step size dt = 1 10 5, 1 10 4. For the ODE solver, we considered both an Euler solver and an 4th-order Runge-Kutta solver (Figure F.1). Figure 4a shows 200 trajectories corresponding to parameters sampled from the posterior distributions under each method using the Euler solver. Figure 4b shows posterior mean estimates for each system parameter. We observe that the estimates from the multi-fidelity slice sampler approach the estimates reported by Howard [25] (black dotted lines) in less time than the single-fidelity samplers. 5.4 PDE-constrained optimization We now consider global optimization of a PDE-constrained objective, where an expensive physical simulation is run repeatedly in an outer loop problem. A common approach for global optimization is simulated annealing, which has been applied to constrained global optimization [40]. Consider a model for heat flow in a thin rod of length L with spatial coordinates x [0, L]. Let u(x, t) represent the temperature in the rod at position x and time t, and let u represent a desired target temperature. The goal is to minimize a loss function subject to u satisfying a linear heat equation. This objective along with an initial condition and homogenous boundary conditions can be summarized as: minimizeu u u 2 2 subject to u x2 + 2β u, (12) u(x, 0) = sin πx/2 , u(0, t) = u(L, t) = 0, x [0, L], t [0, T], where α, β > 0 are the system parameters. The goal is to find θ = (α, β) that minimizes the objective and satisfies the constraints. To solve the PDE, we discretize the domain into a grid of size x and represent the second derivative using the central difference formula. This induces a system of ODEs that we solve numerically using a Tsitouras 5/4 Runge-Kutta method, setting t = 0.4 x2 so as to satisfy a CFL stability condition. Here the fidelity of the problem is given by the size of the spatial discretization x, which in turn controls t. We compared against two single-fidelity discretizations of the spatial coordinate, where x = 5 10 3, 1 10 2. The results are in Figure 5, where we plot two of the MF results with γ0 = 0.1, 0.25. In these examples, the multi-fidelity estimates converge faster than the single-fidelity estimates in wallclock time. 5.5 Gaussian process regression parameter inference Let X RN D and consider a Gaussian process regression model with a squared exponential kernel: f GP(0, kθ), y = f(X) + ϵ, ϵ N(0, σ2 0), kθ(x, x ) = exp 1 2θ2 x x 2 2 (a) Histograms of high, low, two-stage, and multi-fidelity (b) Posterior mean estimate vs computation Figure 6: Parameter inference in a Gaussian process regression model. Left: The posterior distribution of the parameter θ. Right: The posterior mean estimate vs computional cost. where we assume σ2 0 is known. Let Σθ := [kθ(xi, xj)]i [N],j [N]. In many applications of Gaussian process modeling, one is interested in integrating out the parameters θ using MCMC. Computing the posterior π(θ|X, y) is expensive because each evaluation of the likelihood p(y|X, θ) = N(y | 0, Σθ + σ2 0I) involves computing a determinant and solving a linear system with respect to the matrix Σθ + σ2 0I, which has an O(N 3) computational cost associated with standard methods (e.g., Cholesky decomposition). For simplicity, we will only consider the fidelity of solution to the linear system, but we note that the determinant can considered using the approach described in Potapczynski et al. [37]. Additional derivations and details are in Appendix F.5. We generate synthetic data from the GP model with N = 100, σ2 0 = 1, and lengthscale θ0 = 45. For the GP model, we use a log Normal prior on θ given above in Equation (F.5) with parameters ν0 = 3.8, ν1 = 0.03. In Figure 6, we compare these approaches using single-fidelity, multi-fidelity, and two-stage M-H samplers. The low-fidelity likelihood sequence was constructed by computing the solution to the linear system using a preconditioned conjugate gradient solver with k steps. The single-fidelity likelihoods were a high-fidelity likelihood (k = N) and a low-fidelity likelihood (k = k N N), and the multi-fidelity M-H samplers used γ0 = 0.1. The two-stage M-H approach used high and low fidelities of k {100, 5}. For all methods, we use a M-H sampler with T = 50000 iterations. In the histograms, we observe that the high-fidelity, multi-fidelity and two-stage approaches tend to lead to similar posteriors, while the low-fidelity sampler has more noticable bias with respect to the high-fidelity histogram. The estimate produced by the multi-fidelity samplers converged in fewer cost-adjusted likelihood evaluations than the high-fidelity and two-stage approaches. 6 Discussion and future work In this work, we introduced a class of multi-fidelity MCMC that uses a low-fidelity unbiased estimator to reduce the computational cost of sampling while still maintaining the desired limiting target distribution of the Markov chain. In particular, we have demonstrated the use of our framework on more advanced MCMC algorithms beyond M-H, such as slice sampling, and to additional settings such as optimization. Our results show a reduction in computation while producing accurate solutions in comparison with high-fidelity models when it is possible to construct a target estimator that is not too noisy. Many future directions remain. First, applying MF-MCMC to large-scale expensive applications has many computational challenges. Making the method more robust to specialized problems is important, especially if the estimator is heavy-tailed. Thus, constructing good proposal distributions matching properties of the low-fidelity target sequence is crucial, especially for application to high-dimensional problems. In addition, we have thus far focused on target densities where there is a single computation whose fidelity is varied. However, in many settings, there may be target densities with multiple computations that converge at different rates, for example, if the target density includes both an intractable integral and a solution of a linear system. Our framework can be extended to that setting by adjusting the proposal distribution, and it is useful to understand how these rates impact the convergence properties of the sampler. Acknowledgments and Disclosure of Funding This work was partially supported by NSF grants IIS-2007278 and OAC-2118201. D. Cai was supported in part by a Google Ph.D. Fellowship in Machine Learning. [1] N. M. Alexandrov, J. E. Dennis, R. M. Lewis, and V. Torczon. A trust-region framework for managing the use of approximation models in optimization. Structural optimization, 15(1): 16 23, 1998. [2] C. Andrieu and G. O. Roberts. The pseudo-marginal approach for efficient Monte Carlo computations. The Annals of Statistics, 37(2):697 725, 2009. [3] C. Andrieu, A. Doucet, and R. Holenstein. Discussion of Particle Markov chain Monte Carlo methods". Journal of the Royal Statistical Society: Series B (Statistical Methodology), 72(3): 269 342, 2010. [4] E. Arian, M. Fahl, and E. W. Sachs. Trust-region proper orthogonal decomposition for flow control. In IEEE Conference on Decision and Control, 2000. [5] A. Beatson and R. P. Adams. Efficient optimization of loops and limits with randomized telescoping sums. In International Conference on Machine Learning, pages 534 543. PMLR, 2019. [6] M. A. Beaumont. Estimation of population growth or decline in genetically monitored populations. Genetics, 164(3):1139 1160, 2003. [7] M. Biron-Lattes, A. Bouchard-Côté, and T. Campbell. Pseudo-marginal inference for CTMCs on infinite spaces via monotonic likelihood approximations. Journal of Computational and Graphical Statistics, 2022. [8] L. Brevault, M. Balesdent, and A. Hebbal. Overview of Gaussian process based multi-fidelity techniques with variable relationship between fidelities. ar Xiv preprint ar Xiv:2006.16728, 2020. [9] B. P. Carlin, A. E. Gelfand, and A. F. Smith. Hierarchical Bayesian analysis of changepoint problems. Journal of the Royal Statistical Society: Series C (Applied Statistics), 41(2):389 405, 1992. [10] J. A. Christen and C. Fox. Markov chain Monte Carlo using an approximation. Journal of Computational and Graphical Statistics, 14(4):795 810, 2005. [11] T. Cui, Y. M. Marzouk, and K. E. Willcox. Data-driven model reduction for the Bayesian solution of inverse problems. International Journal for Numerical Methods in Engineering, 102 (5):966 990, 2015. [12] P. J. Diggle, P. Moraga, B. Rowlingson, and B. M. Taylor. Spatial and spatio-temporal log Gaussian Cox processes: extending the geostatistical paradigm. Statistical Science, 28(4): 542 563, 2013. [13] T. J. Dodwell, C. Ketelsen, R. Scheichl, and A. L. Teckentrup. A hierarchical multilevel Markov chain Monte Carlo algorithm with applications to uncertainty quantification in subsurface flow. SIAM/ASA Journal on Uncertainty Quantification, 3(1):1075 1108, 2015. [14] Y. Efendiev, T. Hou, and W. Luo. Preconditioning Markov chain Monte Carlo simulations using coarse-scale models. SIAM Journal on Scientific Computing, 28(2):776 803, 2006. [15] M. Fahl and E. W. Sachs. Reduced order modelling approaches to PDE-constrained optimization based on proper orthogonal decomposition. In Large-scale PDE-constrained Optimization, pages 268 280. Springer, 2003. [16] A. Georgoulas, J. Hillston, and G. Sanguinetti. Unbiased Bayesian inference for population Markov jump processes via random truncations. Statistics and Computing, 27(4):991 1002, 2017. [17] A. Gessner, J. Gonzalez, and M. Mahsereci. Active multi-information source Bayesian quadrature. In Uncertainty in Artificial Intelligence, pages 712 721. PMLR, 2020. [18] M. B. Giles. Multilevel Monte Carlo path simulation. Operations Research, 56(3):607 617, 2008. [19] M. B. Giles. Multilevel Monte Carlo methods. Monte Carlo and Quasi-Monte Carlo Methods, pages 83 103, 2013. [20] P. W. Glynn and C.-h. Rhee. Exact estimation for Markov chain equilibrium expectations. Journal of Applied Probability, 51(A):377 389, 2014. [21] R. B. Gramacy and H. K. Lee. Adaptive design and analysis of supercomputer experiments. Technometrics, 51(2):130 145, 2009. [22] G. W. Gundersen, D. Cai, C. Zhou, B. E. Engelhardt, and R. P. Adams. Active multi-fidelity Bayesian online changepoint detection. In Uncertainty in Artificial Intelligence, pages 1916 1926. PMLR, 2021. [23] W. Hackbusch. Multi-grid methods and applications, volume 4. Springer Science & Business Media, 2013. [24] D. Higdon, H. Lee, and Z. Bi. A Bayesian approach to characterizing uncertainty in inverse problems using coarse and fine-scale information. IEEE Transactions on Signal Processing, 50 (2):389 399, 2002. [25] P. Howard. Modeling basics. Lecture Notes for Math, 442, 2009. [26] P. E. Jacob, J. O Leary, and Y. F. Atchadé. Unbiased Markov chain Monte Carlo methods with couplings. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 82(3): 543 600, 2020. [27] D. R. Jones, M. Schonlau, and W. J. Welch. Efficient global optimization of expensive black-box functions. Journal of Global Optimization, 13(4):455 492, 1998. [28] S. Li, W. Xing, R. Kirby, and S. Zhe. Multi-fidelity Bayesian optimization via deep neural networks. Advances in Neural Information Processing Systems, 33:8521 8531, 2020. [29] L. Lin, K. Liu, and J. Sloan. A noisy Monte Carlo algorithm. Physical Review D, 61(7):074505, 2000. [30] A.-M. Lyne, M. Girolami, Y. Atchadé, H. Strathmann, and D. Simpson. On Russian roulette estimates for Bayesian inference with doubly-intractable likelihoods. Statistical Science, 30(4): 443 467, 2015. [31] A. March and K. Willcox. Constrained multifidelity optimization using model calibration. Structural and Multidisciplinary Optimization, 46(1):93 109, 2012. [32] J. Møller, A. R. Syversveen, and R. P. Waagepetersen. Log Gaussian Cox processes. Scandinavian Journal of Statistics, 25(3):451 482, 1998. [33] I. Murray and M. Graham. Pseudo-marginal slice sampling. In Artificial Intelligence and Statistics, pages 911 919. PMLR, 2016. [34] I. Murray, R. Adams, and D. Mac Kay. Elliptical slice sampling. In Artificial Intelligence and Statistics, pages 541 548. PMLR, 2010. [35] A. Palizhati, S. B. Torrisi, M. Aykol, S. K. Suram, J. S. Hummelshøj, and J. H. Montoya. Agents for sequential learning using multiple-fidelity data. Scientific Reports, 12(1):1 13, 2022. [36] B. Peherstorfer, K. Willcox, and M. Gunzburger. Survey of multifidelity methods in uncertainty propagation, inference, and optimization. SIAM Review, 60(3):550 591, 2018. [37] A. Potapczynski, L. Wu, D. Biderman, G. Pleiss, and J. P. Cunningham. Bias-free scalable Gaussian processes via randomized truncations. In International Conference on Machine Learning, pages 8609 8619. PMLR, 2021. [38] M. Raissi, P. Perdikaris, and G. E. Karniadakis. Inferring solutions of differential equations using noisy multi-fidelity data. Journal of Computational Physics, 335:736 746, 2017. [39] T. Robinson, M. S. Eldred, K. E. Willcox, and R. Haimes. Surrogate-based optimization using multifidelity models with variable parameterization and corrected space mapping. AIAA journal, 46(11):2814 2822, 2008. [40] H. E. Romeijn and R. L. Smith. Simulated annealing for constrained global optimization. Journal of Global Optimization, 5(2):101 126, 1994. [41] J. Song, Y. Chen, and Y. Yue. A general framework for multi-fidelity Bayesian optimization with gaussian processes. In The 22nd International Conference on Artificial Intelligence and Statistics, pages 3158 3167. PMLR, 2019. [42] M. Teng, F. Nathoo, and T. D. Johnson. Bayesian computation for Log-Gaussian Cox processes: A comparative analysis of methods. Journal of Statistical Computation and Simulation, 87(11): 2227 2252, 2017. [43] M. Troyer and U.-J. Wiese. Computational complexity and fundamental limitations to fermionic quantum Monte Carlo simulations. Physical Review Letters, 94(17):170201, 2005. [44] D. J. Warne, T. P. Prescott, R. E. Baker, and M. J. Simpson. Multifidelity multilevel Monte Carlo to accelerate approximate Bayesian parameter inference for partially observed stochastic processes. ar Xiv preprint ar Xiv:2110.14082, 2021. [45] J. Wu, S. Toscano-Palmerin, P. I. Frazier, and A. G. Wilson. Practical multi-fidelity Bayesian optimization for hyperparameter tuning. In Uncertainty in Artificial Intelligence, pages 788 798. PMLR, 2020. [46] X. Xi, F.-X. Briol, and M. Girolami. Bayesian quadrature for multiple related integrals. In International Conference on Machine Learning, pages 5373 5382. PMLR, 2018. 1. For all authors... (a) Do the main claims made in the abstract and introduction accurately reflect the paper s contributions and scope? [Yes] (b) Did you describe the limitations of your work? [Yes] We discuss this in the discussion section. (c) Did you discuss any potential negative societal impacts of your work? [Yes] We include a section in Appendix G (d) Have you read the ethics review guidelines and ensured that your paper conforms to them? [Yes] 2. If you are including theoretical results... (a) Did you state the full set of assumptions of all theoretical results? [Yes] We provide details in Appendix E (b) Did you include complete proofs of all theoretical results? [Yes] We provide details in Appendix E 3. If you ran experiments... (a) Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)? [Yes] See Appendix F. (b) Did you specify all the training details (e.g., data splits, hyperparameters, how they were chosen)? [Yes] Yes, see Appendix F for full details. (c) Did you report error bars (e.g., with respect to the random seed after running experiments multiple times)? [Yes] The plots are generated over multiple random seeds. (d) Did you include the total amount of compute and the type of resources used (e.g., type of GPUs, internal cluster, or cloud provider)? [Yes] See Appendix F 4. If you are using existing assets (e.g., code, data, models) or curating/releasing new assets... (a) If your work uses existing assets, did you cite the creators? [N/A] (b) Did you mention the license of the assets? [N/A] (c) Did you include any new assets either in the supplemental material or as a URL? [N/A] (d) Did you discuss whether and how consent was obtained from people whose data you re using/curating? [N/A] (e) Did you discuss whether the data you are using/curating contains personally identifiable information or offensive content? [N/A] 5. If you used crowdsourcing or conducted research with human subjects... (a) Did you include the full text of instructions given to participants and screenshots, if applicable? [N/A] (b) Did you describe any potential participant risks, with links to Institutional Review Board (IRB) approvals, if applicable? [N/A] (c) Did you include the estimated hourly wage paid to participants and the total amount spent on participant compensation? [N/A]