# divideandconquer_fusion__c0db2f7f.pdf Journal of Machine Learning Research 24 (2023) 1-82 Submitted 10/21; Revised 5/23; Published 6/23 Divide-and-Conquer Fusion Ryan S.Y. Chan rchan@turing.ac.uk The Alan Turing Institute London, NW1 2DB, UK Murray Pollock murray.pollock@newcastle.ac.uk School of Mathematics, Statistics and Physics Newcastle University Newcastle, NE1 7RU, UK Adam M. Johansen a.m.johansen@warwick.ac.uk Department of Statistics University of Warwick Coventry, CV4 7AL, UK Gareth O. Roberts gareth.o.roberts@warwick.ac.uk Department of Statistics University of Warwick Coventry, CV4 7AL, UK Editor: Pierre Alquier Combining several (sample approximations of) distributions, which we term sub-posteriors, into a single distribution proportional to their product, is a common challenge. Occurring, for instance, in distributed big data problems, or when working under multi-party privacy constraints. Many existing approaches resort to approximating the individual subposteriors for practical necessity, then find either an analytical approximation or sample approximation of the resulting (product-pooled) posterior. The quality of the posterior approximation for these approaches is poor when the sub-posteriors fall out-with a narrow range of distributional form, such as being approximately Gaussian. Recently, a Fusion approach has been proposed which finds an exact Monte Carlo approximation of the posterior, circumventing the drawbacks of approximate approaches. Unfortunately, existing Fusion approaches have a number of computational limitations, particularly when unifying a large number of sub-posteriors. In this paper, we generalise the theory underpinning existing Fusion approaches, and embed the resulting methodology within a recursive divide-andconquer sequential Monte Carlo paradigm. This ultimately leads to a competitive Fusion approach, which is robust to increasing numbers of sub-posteriors. Keywords: Distributed computing, importance sampling, Markov chain Monte Carlo, sequential Monte Carlo, stochastic differential equations. c 2023 Ryan S.Y. Chan, Murray Pollock, Adam M. Johansen and Gareth O. Roberts. License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v24/21-1274.html. Chan, Pollock, Johansen and Roberts 1. Introduction In this paper, we are interested in the following d-dimensional (product-pooled) target density (which we term the fusion density), f(x) f1(x) f C(x) = c=1 fc(x), (1) where x Rd, fc(x) for c {1, . . . , C} represent the individual densities which we wish to unify (termed sub-posteriors in deference to the fact that a major application of this technique will be the setting in which the posterior is proportional to the product of these factors), and C represents the total number of sub-posteriors. We assume that we have access to independent realisations from each sub-posterior, and that it is possible to evaluate each sub-posterior pointwise up to its normalising constant. Although typically, one would only have approximate samples from each sub-posterior, we will discuss later that neither of these assumptions form limiting factors for our methodology. The need to unify several (sample approximations of) distributions, over a common parameter space, into a single sample approximation of the distribution in the manner of (1) is surprisingly common. For instance, it arises classically in expert elicitation (Albert et al., 2012; Berger, 1980; Genest and Zidek, 1986) and meta-analysis (Fleiss, 1993). However, it has proven to be challenging methodologically in a number of modern settings due to problem specific constraints. These include when dealing with the privacy constraints of the individual sources (Yıldırım and Ermi s, 2019), in cases where the sheer number of sources is overwhelming, or if the networking constraints of the sources are truly distributed (Scott et al., 2016). This in turn has motivated a range of problem specific and pragmatic approximations. These approximations are invariably distributional, and imposed at the level of the individual source (for instance, the sub-posteriors being approximately Gaussian). Such approximations limit the applicability of methodological approaches to particular settings, and outside those settings the unified results can be poorly understood, and even misleading. We instead focus on developing methodology for an exact Monte Carlo approximation of the unified distribution (1) one which provides robust inference in a wide range of practical problems, and yet is amenable to use alongside any problem specific constraints. The majority of the recent methodological developments for representing or sampling from (1) have been focused on tackling distributed big data problems (see for instance Scott et al., 2016; Neiswanger et al., 2014; Wang and Dunson, 2013; Minsker et al., 2014; Srivastava et al., 2015; Nemeth and Sherlock, 2018). In this setting, due to its sheer size, the data is split across a number of cores (say C cores), inference is separately conducted on each core (often using MCMC), and then the respective methodologies attempt to unify the sample approximations of the distribution (as per (1), and typically using a convenient approximation). In this paper, we will compare our methodology with a number of the most popular approaches, and so will briefly describe these here. The Consensus Monte Carlo (CMC) approach of Scott et al. (2016) produces approximate samples from (1) by means of a weighted average of sub-posterior samples. It can be shown that CMC is exact when Divide-and-Conquer Fusion each sub-posterior is Gaussian, and can be useful in settings where each sub-posterior is approximately Gaussian, which is often the case in big data settings (Walker, 1969; Johnson, 1970; Le Cam, 1986; Van der Vaart, 1998; Le Cam and Yang, 2000). However, it has been shown to exhibit large bias in other settings (Wang and Dunson, 2013). Neiswanger et al. (2014) suggest a strategy (which we term the Kernel Density Estimate Monte Carlo (KDEMC) approach) based on using a kernel density estimate to approximate the subposterior densities, and in effect approximating (1) by implicitly sampling from the product of non-parametric density estimates. Finally, the Weierstrass sampler of Wang and Dunson (2013) provides an alternative method for approximating (1) by means of using the product of Weierstrass transforms for each sub-posterior. Interestingly, we find empirically that for a cheap and crude approximation of (1) then the (simplest) CMC approach outperforms all other methodologies, but in cases where accuracy is a concern then our (more computationally expensive) Fusion approach should be used. The Fusion approach (Dai et al., 2019, 2023) constructs a direct sample approximation of (1) itself, rather than seeking to obtain an adhoc approximation of f by combining approximations of the sub-posteriors. Underpinning the Fusion approach is the simple observation that if we sampled (independently) X(c) fc for c {1, . . . , C} then conditional on the event that X(1) = = X(C), we have that X(1) has density f given in (1). Clearly the difficulty with exploiting this observation is that we are conditioning on an event of probability 0. The Monte Carlo Fusion (MCF) approach of Dai et al. (2019) provides a framework for practically enforcing this conditioning. This is achieved by initialising C stochastic processes (independently from one another) using a single realisation from each sub-posterior (i.e. X(c) 0 fc for c {1, . . . , C} where the subscript is a temporal index, noting that X(1) 0 = = X(C) 0 ), evolving the processes in such a manner that (i) these processes coalesce at some fixed future time (i.e. X(1) T = = X(C) T ); and (ii), the common marginal distribution at the coalescence time, T, is f. By repeating this approach multiple times, MCF provides multiple i.i.d. draws from f. The Bayesian Fusion (BF) approach of Dai et al. (2023) re-examined the theoretical underpinnings of MCF by introducing a stochastic differential equation (SDE) describing the coalescence of the C stochastic processes, and exploited this theory together with methodology for sequential Monte Carlo (SMC) to gradually coalesce the stochastic processes. The resulting output of the BF approach is a number of correlated and weighted draws from f. BF is a far more practical and robust algorithm than MCF. A key advantage of BF over MCF is that it is possible to give considerable user guidance in its implementation. Although BF provides significant improvements over MCF, the applicability of the methodology is still limited by factors including: (i) the numbers of sub-posteriors being combined; (ii) the level of sub-posterior correlation; (iii) the dimensionality of the sub-posteriors; (iv) the degree to which the sub-posteriors conflict; and (v) the computational cost of the approach even when the user-specified tuning parameters are optimally chosen. In this paper, we make two key contributions to address the limitations of MCF and BF: (i) we significantly improve upon the computational efficiency of BF by allowing the user to incorporate Chan, Pollock, Johansen and Roberts global information about each sub-posterior within the SDE formulation, and unify subsets of the sub-posteriors at any one time we term this approach Generalised Bayesian Fusion (GBF), and present it in Section 2 and Algorithm 1; (ii) using the flexibility given by (i) in which sub-posteriors can be partially unified, we embed our GBF methodology within the divide-and-conquer paradigm of Lindsten et al. (2017), allowing the user to combine sub-posteriors in stages to recover the fusion density f. We term this Divide-and-Conquer Fusion (D&C-Fusion), and present it in Section 3 and Algorithm 2. The remainder of the paper is organised as follows: In Section 4 we present detailed guidance on implementing our GBF and D&C-Fusion approaches, and in particular choosing any tuning parameters. In Section 5 we present applications of our methodology for a variety of models, comparing them to competing approximate methodologies. We conclude by outlining a variety of ways or Fusion approach could be extended, and used in other application settings. All technical proofs and detailed calculations are collated in the appendices. Statistical computations for this paper were written in R (R Core Team, 2022), C++ and Rcpp (Eddelbuettel, 2013). All code developed for producing numerical results can be found on Git Hub at https://github.com/rchan26/DCFusion. 2. A generalisation of the Fusion approach In this section we develop theory and methodology to generalise and improve upon the BF approach of Dai et al. (2023), by incorporating information about the covariance of the sub-posteriors within the SDE formulation. For completeness in Appendix A we more fully outline the connections of our methodology to the earlier MCF and BF works, highlighting explicitly the advantages of our approach, but for ease of presentation here we instead present our approach directly. In this section we also consider the more abstract problem of sampling from the density f(C) Q c C fc, where C is an index set representing the subposteriors we want to unify, and we assume we can sample (independently) X(c) fc for c C. This abstraction is useful for the methodology we develop in Section 3. For the purposes of simplifying the subsequent notation, we denote by x(C) t R|C| d a vector composed of x(c) t Rd for c C (in particular, we have x(C) t := (x(c1) t , . . . , x (c|C|) t ), with ci denoting the ith element of the index set C). We further assume that for c {1, . . . , C}, fc is nowhere zero and everywhere differentiable, and that we can compute Ac(x) := log fc(x), Ac(x), and 2Ac(x) pointwise (where is the gradient operator and 2 is the Hessian). A fuller discussion of these assumptions is given in Appendix A, but note that they match those of the earlier works of Dai et al. (2019, 2023). We begin by describing the joint distribution of |C| coalescing stochastic processes on [0, T] that at time T have the common marginal f(C) Q c C fc. We term this the fusion measure, F. To aid in the development of the subsequent methodology, we require that the stochastic processes can be simulated, and so this is done by considering a Radon-Nikod ym correction of the so-called proposal measure (P), which is defined to be the probability law induced Divide-and-Conquer Fusion by |C| interacting d-dimensional parallel continuous-time Markov processes in [0, T] where each process is given by the SDE, d X(c) t = Xt X(c) t T t dt + Λ 1 2c d W (c) t , X(c) 0 := x(c) 0 fc, t [0, T], (2) where Λc are (positive semi-definite) user-specified matrices associated to sub-posterior fc for c C with Λ1/2 c being the (positive semi-definite) square root of Λc where Λ1/2 c Λ1/2 c = Λc. Note that for the purposes of our numerical simulations later we use the Schur decomposition. Furthermore, {W (c) t }c C denotes independent Brownian motions, and c C Λ 1 c X(c) t denoting the weighted average of the processes at time t. In practice we typically take Λc to be a user estimate of the covariance matrix of the sub-posterior, ˆΣc which can be computed using the available sub-posterior samples for fc thereby incorporating problem-specific information about covariance structure. We will see that the choice for these matrices influences the efficiency of the algorithm but not the target distribution itself and thus incurs no bias. Realisations of the proposal measure are denoted as X := { x(C) t , t [0, T]}. For the purposes of exposition, we defer discussion on the practical simulation of P to Section 2.1. Now, we let the Fusion measure F be simply the measure induced by the following Radon Nikod ym derivative: d F d P(X) ρ0 x(C) 0 Y 0 φc X(c) t dt , (3) where {X(c) t , t [0, T]} is a Brownian bridge from X(c) 0 := x(c) 0 fc to X(c) T := x(c) T with covariance matrix Λc and ρ0 x(C) 0 := exp ( x(C) 0 x(c) 0 ) Λ 1 c ( x(C) 0 x(c) 0 ) 2T c C Λ 1 c x(c) t and φc(x) := 1 2 log fc(x) Λc log fc(x) + Tr Λc 2 log fc(x) . (6) Now, considering the time T marginal of X F we (almost surely) have: Theorem 1 Under the fusion measure F, the ending points of the |C| interacting, parallel processes have a common value at time T, y(C) which has density f(C) and y(C) = x(c1) T = = x (c|C|) T almost surely. Chan, Pollock, Johansen and Roberts Proof See Appendix B. Theorem 1 suggests that we can simulate from the fusion target density f(C) by simulating X F and retaining the T time marginal, y(C). As suggested by the theory, we do so by means of simulating a number of proposals X P and accepting (or importance weighting) the terminal time marginal y(C) with probability proportional to the Radon-Nikod ym derivative in (3). As such, we need to consider: (i), how to simulate proposals from X P (outlined in Section 2.1); and (ii), how to compute the Radon-Nikod ym correction (3) (outlined in Section 2.2). We then present our proposed complete methodology in Section 2.3. We discuss possible extensions of our approach in Section 2.4. 2.1 Simulating from the Proposal Measure First, we consider how to simulate proposals from X P. We begin by noting that the initialisation of the proposal measure given by (2) at time t = 0 only requires independent draws from the |C| sub-posteriors that we wish to unify, which in this paper we assume we have access to. If independent sampling is not feasible, it is possible to obtain approximate sub-posterior samples using MCMC (see Dai et al. (2023, Section 3.6) for a discussion on the impacts of using approximate sub-posterior samples for Fusion). Further, although paths X P are infinite dimensional random variables (and so we cannot draw entire sample paths from P), it is sufficient for our needs to simulate (exactly) the paths at a finite collection of times provided we can ensure that we are able to simulate the path (exactly) at time T. For clarity, we only consider simulating X at times given by the following auxiliary temporal partition, P = {t0, t1, . . . , tn : 0 =: t0 < t1 < < tn := T}. (7) We let j := tj tj 1 and for notational simplicity, subscripts are suppressed when considering the processes at times given in the temporal partition. In particular, let x(c) j denote x(c) tj , and let x(C) j denote x(C) tj . We will see from the following proposition, that algorithmically, to simulate from P at the time points in P, we can simply initialise the |C| paths with x(c) 0 fc for c C and sequentially simulate from the Normal distributions given in Proposition 2a (for j {1, . . . , n 1}) and b (for j = n). The following proposition tells us how to simulate from the transition density of P: Proposition 2 Let C := (c1, . . . , c|C|) denote the index set representing the sub-posteriors we wish to unify, then if X satisfies (2), then under the proposal measure, P, we have (a) For s < t < T, X(C) t X(C) s = x(C) s N|C|d M (C) s,t , Vs,t , (8) Divide-and-Conquer Fusion where M (C) s,t R|C| d := M (c1) s,t , . . . , M (c|C|) s,t with M (c) s,t = T t T sx(c) s + t s T s xs, (9) Γ11 Γ12 . . . Γ1|C| Γ21 Γ22 . . . Γ2|C| ... ... ... ... Γ|C|1 Γ|C|2 . . . Γ|C||C| R|C|d |C|d, (10) where for i, j = 1, . . . , |C|, Γii = (t s)(T t) T s Λci + (t s)2 T s ΛC Rd d, (11) Γij = (t s)2 T s ΛC Rd d. (12) (b) For s < t = T, y(C) := x(c1) T = = x (c|C|) T Nd( xs, ΛC). (c) For each c C, the distribution of {X(c) q , s q t} given endpoints X(c) s = x(c) s and X(c) t = x(c) t is a Brownian bridge with covariance matrix Λc, so X(c) u x(c) s , x(c) t Nd (t q)x(c) s + (q s)x(c) t t s , (t q)(q s) Proof See Appendix C. As we can initialise a draw from P, and from Proposition 2 we can simulate from its transition density, we can now explicitly express the d(n|C| + 1)-dimensional density of the |C|d-dimensional Markov process at the (n + 1) time marginals given by the temporal partition under P, by iterative simulation from the transition density: h C x(C) 0 , . . . , x(C) n 1, y(C) f x(C) 0 j=1 N|C|d x(C) j M (C) j , Vj Nd y(C) x(C) n 1, ΛC , (14) where f x(C) 0 Q c C fc x(c) 0 , and Nd(x|µ, Σ) denotes the density of a d-dimensional Normal distribution (evaluated at x) with mean µ and covariance Σ. For notational convenience we let M (C) j = M (C) tj 1,tj and Vj = Vtj 1,tj. 2.2 Radon-Nikod ym correction of the Proposal Now, we direct our consideration to the second step: computing the Radon-Nikod ym correction of (3), given we have drawn our proposal from P restricted to the times given by the Chan, Pollock, Johansen and Roberts partition P. Factorising the Radon-Nikod ym derivative in (3) according to the temporal partition P, the d(n|C| + 1)-dimensional density under F is g C x(C) 0 , . . . , x(C) n 1, y(C) h C x(C) 0 , . . . , x(C) n 1, y(C) j=0 ρj, (15) where ρ0 is given in (4) and for j {1, . . . , n}, ρj x(C) j 1, x(C) j = Y φc X(c) t Φc )# (0, 1], (16) and where WΛc,j is the law of a Brownian bridge {X(c) t , t (tj 1, tj)} from Xtj 1 := x(c) j 1 to Xtj := x(c) j with covariance Λc, and Φc < is a constant such that φc(x) Φc for all x and each c C. We note that the terms Φc for c C (the global lower bounds of the respective φc for c C) in (16) can be absorbed into normalising constants and, hence, as we apply sequential Monte Carlo methodology, they need not be evaluated (as shown more explicitly in Section 2.3). Whilst ρ0 (given by (4)) can be computed easily, direct computation of ρj in (16) for j {1, . . . , n} is not possible as it requires evaluation of path integrals of Brownian motion. However, it is possible to construct non-negative unbiased estimators for (16) (with finite variance and computable in finite cost) in a similar fashion to Beskos et al. (2008); Fearnhead et al. (2008); Dai et al. (2019, 2023). To do so, we require for a given sample path X(c) [tj 1,tj] WΛc,j that we have upper and lower bounds for φc(X(c) t ) for each c C. In general, it is not possible to find global bounds for φc, so we follow the approach of Beskos et al. (2008) and Pollock et al. (2016) who noted that if we can bound a sample path X(c) [tj 1,tj] WΛc,j, then conditional on these layers (or bounds) of the sample path, then we will be able to find local upper and lower bounds of φc denoted U(c) j and L(c) j , respectively, such that φc(X(c) t ) [L(c) j , U(c) j ] for t [tj 1, tj]. In order to practically implement this, we need to simulate Brownian bridges jointly with a compact region which almost surely constrains their path (a mechanism for doing this is described in Pollock et al. (2016, Sections 7 and 8)). We now describe one approach for doing this. To achieve this, let Rc := Rc(X[tj 1,tj]) denote the layer information (i.e the compact region in which X(c) t is constrained in time [tj 1, tj]). We note that it is possible to partition the sample space into disjoint sets and simulate from associated distribution function (without having to sample the underlying path), Rc Rc. If Λc = Id, then we can simulate a layer to which X(c) t Rc for t [tj 1, tj] by using algorithms outlined in Pollock et al. (2016, Sections 7 and 8) (for instance Pollock et al. (2016, Algorithm 14)). In the case where Λc = Id, we can still simulate Rc by appealing to a suitable transformation (which we detail fully in Appendix F and in Algorithm 5). Furthermore, once we have simulated layer information for X(c) t for t [tj 1, tj], we can simulate the path at any required time marginals conditional on the simulated layer, X(c) t WΛc,j|Rc (via a transformation and applying for instance Pollock et al. (2016, Algorithm 15)). Divide-and-Conquer Fusion Although it is possible to find tight local bounds for φc in a problem specific manner by exploiting specific structure, there are some generic strategies that can be followed. In sufficiently regular settings one might construct the partition necessary by first partitioning the domain of φc and then looking at the pre-image of that partition under φc, thereby reducing the problem to a univariate one. Alternatively, it is helpful in practice to note that it is possible to find generic (less tight) bounds given by the following proposition: Proposition 3 For all c C and x Rc, we have φc (x) h L(c) j , U(c) j i , where L(c) j := 1 2 d P Λc , (17) U(c) j := 1 1 2c log fc ˆx(c) + max x Rc 2 c x ˆx(c) P Λc 2 + d P Λc # where d denotes the dimension of x, is the Euclidean norm, ˆx(c) is a user-specified point central to Rc, and where P Λc is a quantity such that P Λc max x Rc γ Λc 2 log fc (x) , (19) with γ denoting the matrix norm, defined as γ(A) := max x =0 Ax Proof See Appendix D. Once local bounds for φc are obtained, we can unbiasedly estimate ρj (16) for j {1, . . . , n} by letting j := tj tj 1 and computing aj ρj, where aj := exp P c C Φc j and ρj x(C) j 1, x(C) j := Y κc j e U(c) j j κc! p (κc|Rc) h U(c) j φc X(c) ξc,kc where Rc is the simulated layer information for the Brownian bridge sample path X(c) t WΛc,j from x(c) j 1 to x(c) j , L(c) j and U(c) j are constants such that L(c) j φ X(c) t U(c) j for all X(c) t WΛc,j|Rc, κc is a discrete random variable with conditional probabilities P[κc = kc|Rc] := p(κc|Rc) (which at this stage we allow to be arbitrary) and ξc,1, . . . , ξc,κc iid U[tj 1, tj] for all c C. Theorem 4 Let aj := exp PC c=1 Φc j , then for every j = 1, . . . , n, aj ρj is an unbiased estimator of ρj. In particular, we have ρj = E h E h E h E h aj ρj {Rc, X(c) [tj 1,tj], κc}c C i {Rc, X(c) [tj 1,tj]}c C, i {Rc}c C ii Chan, Pollock, Johansen and Roberts = E RE W| RE KE U [aj ρj] , (22) where (for readability) the expectation subscript denotes the law with which they are taken. Here, R denotes the law of {Rc Rc : c = 1, . . . , C}, W denotes the law of the C Brownian bridges {WΛc,j : c = 1, . . . , C}, K denotes the law of {κc : c = 1, . . . , C} and U denotes the law of {ξc,1, . . . , ξc,κc : c = 1, . . . , C} iid U[tj 1, tj]. Proof See Appendix E. We note that this unbiased estimator for ρj allows for significant flexibility in choosing the law K. Following the discussion in Dai et al. (2023, Appendix B), there are two natural choices of unbiased estimators that could be used by making particular choices for the distribution of the discrete random variable used to simulate κc for c C. We denote these ρ(a) j and ρ(b) j and are based, respectively, upon the GPE-1 and GPE-2 estimators of Fearnhead et al. (2008): Definition 5 (GPE-1 for ρj (16)): Choosing the law of κc Poi (U(c) j L(c) j ) j for c C leads to the following estimator: ρ(a) j x(C) j 1, x(C) j := Y U(c) j φc X(c) ξc,kc U(c) j L(c) j where exp{PC c=1 Φc j} ρ(a) j is an unbiased estimator for ρj. Definition 6 (GPE-2 for ρj (16)): Choosing the law of κc NB(γc, βc) for c C with γc := U(c) j j Z tj x(c) j 1 tj s j + x(c) j s tj 1 leads to the following estimator: ρ(b) j x(C) j 1, x(C) j := Y e U(c) j j κc j Γ(βc) (βc + γc)βc+κc Γ(βc + κc)ββc c γκc c h U(c) j φc X(c) ξc,kc i (25) where exp{PC c=1 Φc j} ρ(b) j is an unbiased estimator for ρj. The estimators ρ(a) j and ρ(b) j can be computed as detailed in Appendix F, and by means of Algorithm 5, by appealing to Dai et al. (2023, Algorithm 4 and Appendix. B). ρ(a) j and ρ(b) j have particularly desirable properties (by choosing L(c) j and U(c) j as in Proposition 3): Divide-and-Conquer Fusion Proposition 7 Let aj := exp{PC c=1 Φc j}, then aj ρ(a) j ( x(C), y(C)) and aj ρ(b) j ( x(C), y(C)) are unbiased estimators of ρj( x(C), y(C)) which are positive with finite variance. In addition, ρ(a) j ( x(C), y(C)) [0, 1]. Proof See Fearnhead et al. (2008). As we discuss in Section 2.3, the critical consideration when choosing the law K is to minimise the variance of the estimator. In our subsequent simulations, we will typically choose the GPE-2 estimator in Condition 6 as it has been empirically shown to have superior performance in Fearnhead et al. (2008, Section 5) and Dai et al. (2023, Section 3.5). Note that the mean run time for both the estimators ρ(a) j and ρ(b) j will be random, but will be finite and proportional to κc for a given layer Rc Rc. 2.3 Methodology As we outlined earlier in this section, Theorem 1 suggests that we can simulate from the fusion target density f(C) by simulating X F and retaining the T time marginal, y(C). This can be achieved by simulating a number of proposals X P and accepting (or importance weighting) the terminal time marginal y(C) with probability proportional to the Radon Nikod ym derivative in (3). We are now able to implement each of these steps (as discussed in Sections 2.1 and 2.2 respectively), but we have considerable freedom over the details of the methodological approach. The simplest approach is a rejection sampler: simulate a proposal from h C (14) (by utilising Proposition 2), accept this proposal with probability Qn j=0 ρj, and conditional on acceptance return y(C). As more fully discussed in Appendix A, this coincides methodologically with MCF if we set Λc = Id for c {1, . . . , C} (although the formulation is different). The benefit of such a rejection sampler is it returns i.i.d. draws from f(C). However, it suffers from several inefficiencies. In particular, we would expect the acceptance probabilities ρj given in (4) to decay geometrically with increasing number of sub-posteriors, |C|, as each term in this product is bounded by 1. Furthermore, the acceptance probability Qn j=0 ρj will typically decay exponentially with increasing T. Consequently, a rejection sampling approach for this problem will ultimately be impractical in many practical settings as it will have very small acceptance probabilities. Similarly, the naive importance sampling adaptation of this approach (in which the proposal of the rejection sampler are all retained with a un-normalised importance weight of Qn j=0 ρj) will ultimately suffer from the same issues of robustness in practice. Inspired by the importance sampling approach, the BF approach of Dai et al. (2023) introduced the auxiliary temporal partition P in order to simulate from g C using SMC: allowing for the gradual coalescence of the C stochastic processes. In particular, we can initialise an SMC algorithm by simulating N particles from the time 0 marginal in h C (which consists of composing |C| samples from each of the sub-posterior densities to obtain x(C) 0 ), Chan, Pollock, Johansen and Roberts and assigning them an initial un-normalised importance weight given by w 0,i := ρ0( x(C) 0,i ) for i {1, . . . , N}. This initial particle set constitutes an approximation to the time 0 marginal of g C, and can be sequentially propagated n times (i.e. |P| 1 times) through the temporal time mesh P by simulating x(C) j,i | x(C) j 1,i Nd M (C) j,i , Vj as per (9) and (10) in Proposition 2. In our SMC formulation at each iteration (j {1, . . . , n}) the un-normalised importance weight of every particle is updated by a factor of ρj( x(C) j 1,i, x(C) j,i ) as per (21). Upon normalisation, the resulting weighted particle set after the nth iteration is an approximation of both the time T marginal of g C and our fusion target f(C). In particular, i=1 w(C) n,i δy(C) i (dy). (26) As we remarked upon in Section 2.2, due to this normalisation of the particle set weights we can avoid the need to explicitly compute the constants Φc in (16), as they are simply constants which cancel. As is common in SMC, to avoid weight degeneracy in which the variance of the importance weights degrades rapidly in n, we employ a resampling strategy (see for instance Gerber et al. (2019) for a recent investigation of the properties of many resampling schemes). In particular, we monitor the particle set for weight degeneracy by estimating its effective sample size (ESS) Kong et al. (1994). If the ESS falls below some user-specified threshold then at the beginning of the next iteration we resample the particle set to get N equally weighted particles. In all of our simulations in the subsequent sections, we used residual resampling (Higuchi, 1997; Liu and Chen, 1998; Whitley, 1994). We term our resulting Fusion approach Generalised Bayesian Fusion (GBF) and summarise it in Algorithm 1. 2.4 Practical extensions of Generalised Bayesian Fusion We now consider the practicalities of Algorithm 1. To generalise the algorithm further (and make it amenable to a recursive divide-and-conquer approach as in Divide-and-Conquer Fusion), we assume we have access to M importance weighted realisations of each subposterior, {x(c) 0,k, w(c) k }M k=1 for c C. To initialise the algorithm, we start by compos- ing M initial weighted particles by pairing the draws from each sub-posterior { x(C) 0,k}M k=1, and compute the associated (un-normalised) partial weights {w(C) 0,k }M k=1 where w(C) 0,k := Q c C w(c) k ρ0( x(C) 0,k) for k = 1, . . . , M. If we have M = N, we resample to obtain N samples from each sub-posterior, otherwise, we choose to only resample if the ESS is below some user-specified threshold. Note that in the Input step of Algorithm 1, we may have access to different numbers of samples from each sub-posterior: say Mc importance weighted samples for sub-posterior fc (for c C). In order to compose our M partial proposals in Step 1b, there are a number of approaches we could take. As presented above, if Mc = M Divide-and-Conquer Fusion Algorithm 1 gbf(C, {{x(c) 0,i, w(c) i }M i=1, Λc}c C, N, P): Generalised Bayesian Fusion (GBF). 1. Initialisation (j = 0): (a) Input: Importance weighted realisations {x(c) 0,i, w(c) i }M i=1 for c C := (c1, . . . , c|C|), the user-specified matrices, {Λc : c C}, the number of particles required, N, and temporal partition P := {t0, t1, . . . , tn : 0 =: t0 < t1 < < tn := T}. (b) Compose the importance weighted realisations { x(C) 0,k, w(C) 0,k }M k=1 where w(C) 0,k := Q c C w(c) k ρ0( x(C) 0,k) for k {1, . . . , M} as per (4). (c) w(C) 0,k: For k in 1 to M, compute normalised weight w0,k = w(C) 0,k / PM k =1 w(C) 0,k . (d) g M 0 : Set g M 0 (d x(C) 0 ) := PM k=1 w(C) 0,k δ x(C) 0,k(d x(C) 0 ) (e) x(C) 0,i : If M = N, for i = 1, . . . , N, resample x(C) 0,i g M 0 and reset w(C) 0,i = 1 2. Iterative updates. For j {1, . . . , n}: (a) Resample: If the ESS := PN i=1 w(C) j 1,i 2 1 breaches the lower user-specified threshold, then for i = 1, . . . , N, resample x(C) j 1,i g N j 1 and reset w(C) j 1,i = 1 (b) For i in 1 to N, i. x(C) j,i : Simulate x(C) j,i Nd M (C) j,i , Vj as per Proposition 2. ii. w(C) j,i : Compute un-normalised weight w(C) j,i = w(C) j 1,i ρj( x(C) j 1,i, x(C) j,i ) as per (21) (using Algorithm 5). (c) w(C) j,i : For i in 1 to N, compute normalised weight w(C) j,i = w(C) j,i / PN k =1 w(C) j,k . (d) g N j : Set g N j (d x(C) j ) := PN i=1 w(C) j,i δ x(C) j,i (d x(C) j ). 3. Output: n x(C) 0,i , . . . , x(C) n 1,i, y(C) i , w(C) n,i o N i=1, where ˆf(C)(dy) := g N n (dy) f(C)(y)dy. for c C, we simply pair the sub-posterior draws index-wise. This is a basic merging strategy of the sub-posterior realisations and has the advantage that it can be implemented in O(M) cost (and if Mc = M for every c C one could simply sub-sample to obtain a common number of samples from each sub-posterior). However, as noted in Lindsten et al. (2017), while this approach has a low computational cost, it can lead to high variance when the product Q c C fc(x(c)) differs substantially from the corresponding marginal of f(C) which one might expect to be the case in our setting when the sub-posteriors disagree. We found this simple approach more than adequate in our simulations, but there are more sophisticated options available should they be required in still more challenging settings. Chan, Pollock, Johansen and Roberts In particular, as described in Lindsten et al. (2017, Section 4.1), at the expense of a computational cost O(Q c C Mc), one could instead compose all possible permutations of the samples from each sub-posterior before weighting and then resampling to reduce the number of points in the approximation back to a pre-specified number, arriving at a better approximation at a greater cost. They termed this approach mixture resampling and also detailed a lightweight mixture resampling approach in which more than one permutation, but not all possible permutations, are used and found it to work well; as noted by Kuntz et al. (2022) such a strategy can be connected directly with the theory of incomplete U-statistics and consequently one might hope to realise much of the benefit of mixture resampling at a much reduced cost (see e.g. Kong and Zheng (2021)). 3. A divide-and-conquer approach to Fusion A key drawback of the Monte Carlo Fusion and Bayesian Fusion approaches of Dai et al. (2019, 2023), and the Generalised Bayesian Fusion approach we introduced and outlined in Section 2, is that it lacks robustness with increasing number of sub-posteriors, |C|. This is unsurprising as the extended target and proposal densities (g C and h C) are d(n|C| + 1)- dimensional, and these become increasingly mismatched with increasing dimension. In particular, as a consequence of the definition of ρj in (16), the acceptance probability of any rejection-based scheme will decrease geometrically with increasing |C|. Fundamentally, importance sampling variants of this will not address this bottleneck. As presented both in Dai et al. (2019, 2023), Fusion is an example of a fork-and-join approach all of the sub-posteriors are unified in a single step. In particular, within the GBF framework of Section 2 we set C := {1, . . . , C}. This is illustrated in the tree diagram of Figure 1, where the leaves of the tree represent the available sub-posterior densities, the directed edges are used to illustrate the computational flow of MCF, and the root vertex of the tree is the desired fusion density, f (as given in (1)). Figure 1: A tree representation of the fork-and-join approach of Monte Carlo Fusion. As the goal of the methodology is to approximate f in (1), one could envision a recursive divide-and-conquer approach in which the sub-posteriors are combined in stages to recover f. There are a number of possible orderings in which we could combine sub-posteriors, and so we represent these orderings in tree diagrams, and term these hierarchies (see Figure 2). For instance as illustrated in Figure 2a, one approach would be to combine two sub-posteriors at a time (we term this a balanced-binary tree approach). In Figure 2a, the intermediate vertices represent intermediate (auxiliary) densities up to proportionality. Divide-and-Conquer Fusion The approximation of the distribution associated with any non-leaf vertex is obtained by an application of Fusion methodology to the densities of the children of that vertex. A balanced-binary tree approach is perhaps the most natural way to combine sub-posteriors in a truly distributed setting (where the simulation of each sub-posterior has been conducted separately, and so the inferences we wish to combine are distributed). Another approach is given in Figure 2b, whereby sub-posteriors are fused one at a time (which we term a progressive tree approach). This is perhaps the most natural approach for an online setting. We focus on applying GBF to these two natural hierarchies for the remainder of this paper, although other hierarchies are certainly possible within our framework, and there is no limitation in unifying more than two vertices at any level of a tree (as suggested by both Section 2 and Figure 1). (a) A balanced-binary tree. QC 1 c=1 fc (b) A progressive tree. Figure 2: Illustrative hierarchies for the Fusion problem of (1). From this recursive perspective, sample approximations of auxiliary densities obtained at one level of any tree are themselves treated as sub-posteriors at the next level up. As such, one can iteratively apply the Fusion methodology of Section 2, working through the levels of the tree from the leaves to the root, using at each stage the output of one step as the input for the subsequent step. An advantage of our divide-and-conquer approach is that as fewer sub-posteriors are combined at each stage, we avoid (at each stage) the rapidly diminishing and variable importance weights. A divide-and-conquer variant of Sequential Monte Carlo (D&C-SMC) was recently introduced in Lindsten et al. (2017). D&C-SMC generalises the classical SMC framework from sequences (or chains) to trees, such as those in Figures 1 and 2. The theoretical properties of D&C-SMC are increasingly-well characterized and include a strong law of large numbers, finite sample Lp errors bounds as well as a N-central limit theorem under mild conditions (see Kuntz et al. (2023)). We thus embed our GBF approach within a D&C-SMC algorithm to address the robustness of Fusion with increasing |C|, albeit this being a trade-offwith the cost of the repeated application of the methodology. In our recursive setting, we unify distributed sample approximations by operating on a tree of auxiliary Fusion densities. Let T = (V, E) denote a tree with vertices V and (directed) edge set E. Let Leaf(T) denote the leaves of the tree (which represent the sub-posteriors f1, . . . , f C), Root(T) denote the root of the tree (which represents f) and Ch(v) denote the children of vertex v V where Chan, Pollock, Johansen and Roberts Ch(t) = if t is a leaf. Let V = {v0, v1, . . . , v C, . . .} be the set of vertices, with v0 = Root(T), {v1, . . . , v C} = Leaf(T) and as many intermediate vertices as are required to specify the tree. For the purposes of utilising the methodology developed in Section 2, we define the following notation for non-leaf vertices (i.e. v / Leaf(T)): let Cv := u Ch(v)Cu denote the index set representing the sub-posteriors that we want to unify for vertex v / Leaf(T). In addition, to simplify the notation and avoid an unnecessary level of subscripts, we index densities and other quantities by v rather than Cv when it is clear what is intended. In particular, let Λv := ΛCv, x(v) t := x(Cv) t , x(v) t := x(Cv) t , y(v) := y(Cv) where y(v) fv := f(Cv) for v / Leaf(T). Let WΛv,j denote the law of a Brownian bridge {X(v) t , t [tj 1, tj]} with X(v) tj 1 := x(v) j 1 and X(v) tj := x(v) j with covariance Λv for j {1, . . . , n}. The extended target and proposal densities for vertex v / Leaf(T) are denoted gv := g Cv and hv := h Cv, respectively. Lastly, the importance sampling weights for v / Leaf(T) are given by ρ(v) 0 ( x(v)) := ρ0( x(Cv)) and ρ(v) j ( x(v), y(v)) := ρj( x(Cv), y(Cv)) for all j. To describe our Divide-and-Conquer Fusion (D&C-Fusion) approach, we specify an algorithm that is carried out at each vertex v V which leads to a recursive procedure; an initial call to D&C-Fusion(Root(V), . . .) carries out the overall approach. For v V, we define a procedure (as given in Algorithm 2), which returns a weighted particle set { x(v) 0,i , . . . , x(v) n 1,i, y(v) i , w(v) n,i}N i=1 where w(v) n,i denotes the normalised importance weight of particle i for vertex v V. From this particle set, we can take the marginal weighted samples for y(v) to approximate the fusion density fv Q u Ch(v) fu for vertex v V. Recall that the leaf vertices, vc for c {1, . . . , C}, represent each of the sub-posteriors. It is possible to additionally incorporate importance sampling for the leaf vertices, but for simplicity we assume that we have access to unweighted samples for the sub-posteriors. Therefore, at these leaf vertices, we simply sample from the sub-posteriors. If independent sampling is not feasible, one could use MCMC to obtain unweighted sample approximations at the leaves. Formal arguments (under appropriate regularity conditions) could in principle follow an approach analogous to that in Finke et al. (2020). If v is a non-leaf vertex, we simply call Algorithm 1 by inputting the importance weighted samples {y(u) i , w(u) i }N i=1 for u Ch(v). As in standard SMC, although the auxiliary distributions are defined on larger spaces we do not need to retain sampled values which are not subsequently used; to obtain a more computationally manageable algorithm, we can choose to retain only the final parameter space marginal at each vertex (i.e. only returning {y(v) i , w(v) i }N i=1) since we only require this to compute the importance weights in Algorithm 1 at each vertex v / Leaf(T). Note that in Algorithm 2, we allow the user to specify different temporal partitions at each node and level (i.e. {Pu}u Ch(v), Pv). As we explore fully in Section 4, when we develop guidance for user chosen tuning parameters, having this flexibility on the temporal partition can lead to a far more robust and efficient implementation of Algorithm 2. Divide-and-Conquer Fusion Algorithm 2 D&C-Fusion(v, N, P): Divide-and-Conquer Fusion (D&C-Fusion). Given: Sub-posteriors, {fu}u Leaf(T), and preconditioning matrices {Λu}u T. Input: Node in tree, v, the number of particles N, and (optionally) the temporal mesh partitions {Pu}u Ch(v), Pv. 1. For u Ch(v), (a) n x(u) 0,i , . . . , x(u) n 1,i, y(u) i , w(u) n,i o N i=1 D&C-Fusion(u, N, Pu). 2. If v Leaf(T), (a) For i = 1, . . . , N, sample y(v) i fv(y). (b) Output: { , y(v) i , 1 3. If v / Leaf(T), (a) If Pv is not inputted, apply guidance from Section 4.1 and Section 4.2. (b) Output: Call gbf(Ch(v), {{y(u) i , w(u) i }N i=1, Λu}u Ch(v), N, Pv). 4. Implementational guidance for Generalised Bayesian Fusion In this section we develop guidance for choosing the parameter T and the temporal partition P (and so n implicitly) for our Generalised Bayesian Fusion (GBF) approach (Algorithm 1), the guidance for which can be used directly at each node within our Divide-and-Conquer Fusion approach (Algorithm 2). As GBF is fundamentally a sequential Monte Carlo (SMC) algorithm, we want to choose these hyperparameters in such a way to ensure that the discrepancy between subsequent proposal and target distributions are not degenerate. For this reason, and in common with Dai et al. (2023, Section 3), we look at the incremental weight changes and study the current effective sample size (CESS) associated with these weights: PN i=1 ρj,i 2 PN i=1 ρ2 j,i for j = 1, . . . , n; CESS0 := PN i=1 ρ0,i 2 PN i=1 ρ2 0,i , (27) where ρ0,i and ρj,i are given in (4) and (21) respectively. In order to develop heuristics to choose hyper-parameters, we consider the idealised setting of combining multivariate Gaussian sub-posteriors with mean vector ac and covariance matrix b|C|Λc/m, for some b > 0, for c C. The target is f Nd( a, b|C|ΛC/m), where a := P c C Λ 1 c 1 P c C Λ 1 c ac and ΛC := P c C Λ 1 c 1. In BF, this idealised setting was used to help select T and n and, by imposing an additional assumption that the partition was a regular mesh, in turn P. In this section we instead develop guidance for T (see Section 4.1) in the more sophisticated GBF setting, and then Chan, Pollock, Johansen and Roberts in Section 4.2 investigate the more challenging selection of P without assumption on its regularity (i.e. permitting an irregular choice of mesh) and so we instead implicitly find n. These ideas can also be directly applied to improve BF itself, which we show later in our numerical results. In our idealised setting, the key consideration is the degree to which the sub-posteriors disagree with one another. To measure how significant the sub-posterior conflict is we define σ2 a := 1 c C (ac a) Λ 1 c (ac a). (28) We further consider the two following conditions in order to explore how the algorithm hyperparameters should change according to sub-posterior heterogeneity: Condition 8 SH(λ). The sub-posteriors obey the SH(λ) condition (for some constant λ > 0) if σ2 a = b(|C| 1)λ Remark 9 Interpretation of λ. Of course SH(λ) will always hold for some λ, and this condition can alternatively be interpreted as a definition of λ. We will be particularly interested in moderate values of λ close to 1 which will indicate only weak or no sub-posterior discrepancy. SH(λ) is a natural condition, arising for instance if m |C| of the data is randomly allocated to each sub-posteriors then σ2 a b mχ2 |C| 1 and have mean b(|C| 1) m . The λ for which SH(λ) holds is therefore χ2 |C| 1/(|C| 1) and therefore has mean 1 and variance 2/(|C| 1). Consequently for large |C|, we would expect λ to be close to 1. In this idealised i.i.d. case, these arguments duplicate classical ANOVA calculations. However the SH(λ) condition for moderate λ > 1 is also of interest indicating weak discrepancy between sub-posteriors. This would occur (for instance) if the data consisted of disjoint segments of a long ergodic stationary sequence with no long-range dependence where, in this case, λ is an estimate of the integrated auto-correlation time of the sequence. For this reason, the scenario λ < 1 would not normally occur (particularly for large |C|). In the examples later on, we will set λ = 1 as default, since this is the natural iid scenario. However, as noted above, if we suspect that there is weak discrepancy between the subposteriors, or there is some dependency between the subsets of data, we may also choose λ to be slightly greater than 1 or alternatively estimate it from the data. The defining characteristic of SH(λ) is that λ is stable for large data sizes (large m). However for stronger sub-posterior discrepancy, just as the power of ANOVA tests become larger for larger data sets, λ will become much larger with m where there is a systematic difference in the data distributions between sub-posteriors. Now SH(λ) will not adequately describe this dependence, and so we consider the following scenario instead: Divide-and-Conquer Fusion Condition 10 SSH(γ). The sub-posteriors obey the super sub-posterior heterogeneity SSH(γ) condition (for some constant γ > 0) if σ2 a = bγ. (30) As with SH(λ), this can alternatively be seen as a definition of γ. This setting can arise if the sub-posterior heterogeneity does not decay with data size m. Remark 11 Choice of b: In the case that the user-specified matrices {Λc}c C are chosen to be the estimated covariance matrices for each sub-posterior, then we would set b = m Therefore, the sub-posteriors fc Nd(ac, b|C| m Λc) have variance which closely matches the sub-posterior variance. In general, we want to choose b such that b|C| m Λc is close to the variance of sub-posterior fc for c C. We study empirically our choices of tuning parameter (T, n and P) in the idealised settings described by the SH(λ) condition (of Condition 8) and SSH(γ) condition (of Condition 10) in Sections I.1 I.2 respectively. Note that the implementational guidance we provide in this section is for the general application and tuning of GBF methodology. In many practical settings there will be additional constraints which require further modification to GBF. This includes settings where latency between cores is problematic, or in scenarios where functional evaluations of the sub-posterior densities fc are not available. In Appendix H we provide further direction on some of what we envisage to be the most common modifications. 4.1 Guidance for choosing T In this section, we develop guidance on selecting T for the two idealised settings, SH(λ) and SSH(γ), defined in Conditions 8 and 10, respectively. In each setting, by first specifying the lower bound on the initial effective sample size that we desire, we can compute a minimum value of T which should be used in Algorithm 1. As choosing a larger value of T typically results in more iterations in GBF, we suggest using the minimum value of T which is suggested. The time horizon T only directly affects the initial weighting given to each of the N particles through ρ0 in (4). Thus, to develop guidance for T we study CESS0 in (27): Theorem 12 Let fc Nd(ac, b|C| m Λc) for c C, then considering the initial conditional effective sample size CESS0 we have that as N , the following convergence in probability holds N 1CESS0 p exp m T |C| + b m T |C| + 2b Chan, Pollock, Johansen and Roberts Proof See Appendix G. The following corollary considers the effect of T on CESS0 in the SH(λ) and SSH(γ) settings: Corollary 13 If for some constant k1 > 0, T is chosen such that T b|C|3/2k1 m for some constant k1, then the following lower bounds on CESS0 hold: (a) If SH(λ) holds for some λ > 0, then lim N N 1CESS0 exp λ k2 1 d 2k2 1 (b) If SSH(γ) holds for some γ > 0, and T k2|C| 1 2 for some constant k2, then lim N N 1CESS0 exp bγ k1k2 d 2k2 1 Proof See Appendix G. We choose k1 and k2 by means of Remark 14, which in turn allows us to determine T. As required by Remark 14 we first set λ = 1 (see Remark 9), b (using Remark 11), and σ2 a as per (28). Remark 14 Choice of k1, k2: To choose k1 and k2, we first specify ζ (0, 1) to be a lower bound on the initial relative effective sample size that we would desire. We then can consider which situation that we are likely to be in, and then: 1. Under SH(λ), suppose we want to ensure N 1CESS0 is above ζ (0, 1), from (32), we have exp n λ k2 1 d 2k2 1 o = ζ, which implies we choose k1 = 2 ) log(ζ) . 2. Under SSH(γ), suppose we want to ensure N 1CESS0 is above ζ (0, 1), then from (33), we have k1k2 d 2k2 1 Recall that for SSH(γ), we must have T max n b|C|3/2k1 m , |C| 1 2 k2 o . Since we wish T to be small, we would like k1 and k2 to be small, and thus we set these two terms equal to each other and find k2 = b|C|k1 m . Substituting into (34), we then choose 2) log(ζ) . Given k1 and k2, T can be chosen such that T b|C|3/2k1 m if SH(λ) holds, and T max n b|C|3/2k1 m , |C| 1 2 k2 o if SSH(γ) holds. Typically we want to minimise iterations of Algorithm 1 Step 2, and so we choose the smallest T which satisfies the user-specified ζ (0, 1). Divide-and-Conquer Fusion 4.2 Guidance for choosing P In order to choose the temporal mesh P we consider two approaches, each of which is considered and optimised by means of our CESS of (27): i) by first fixing n and assuming a regular mesh (as in Dai et al. (2023)), we then optimise for n by reference to the maximally tolerable degradation of CESSj over any single iterate (see Section 4.2.1); (ii) by starting at t0 = 0 we decide on the placement of t1 such that we do not violate the maximally tolerable degradation of CESS1, and then iterate until we reach T, and so leading to a irregular (adaptive) mesh and implicitly choosing n (see Section 4.2.2). We summarise these two mesh constructions in Algorithms 3 and 4 To simplify the analysis of Algorithm 1, for which there is considerable flexibility in the choice of proposal distribution for our unbiased estimator of the importance weights (see Theorem 4 of Section 2.2), we assume that we have access to the optimal unbiased estimator. Fearnhead et al. (2008, Theorem 1) (and Dai et al. (2023, Appendix B)) show that the variance of the unbiased estimator aj ρj is minimised when p(κc|Rc) Poi(λc), where U(c) j φ X(c) t 2 dt for c C. With this choice the second moment is finite and E h (aj ρj)2i 1 < . In practice choosing this optimal distribution for K is not possible since the integral in (35) cannot be evaluated directly. This is why in Section 2.2 we choose alternative simulatable distributions (as described in Conditions 5 6), which try to match this optimal distribution. With this optimal choice, we establish the following theorem: Theorem 15 Let p(κc|Rc) in (21) be a Poisson distribution with intensity given in (35), for c C, and k3, k4 be positive constants. If lim j 0 is taken over sequences of j := tj tj 1 0 with tj tj 1 j := min ( b2|C|k3 E [νj] m2 , b2|C|k4 where νj := 1 x(c) j 1 ac Λ 1 c x(c) j 1 ac , (37) and the expectation E[νj] is taken over x(C) j 1, we have plim j 0plim N N 1CESSj e k3 k4, (38) where plim denotes a limit in probability. Proof See Appendix G. Chan, Pollock, Johansen and Roberts Remark 16 In Theorem 15, νj (as defined in (37)) describes the scaled/weighted average variation of the |C| trajectories of the distribution of their proposed update locations with respect to their individual sub-posterior means (i.e. describing how far x(c) j 1 is from ac). Since the GBF approach has |C| trajectories which are initialised from their respective subposterior distributions and coalesce to a common end point, this variation is mainly determined by a combination of: (i) how large the time horizon T is; (ii) how large the interval we are simulating over for this iteration (tj 1, tj]; and (iii) how much the sub-posteriors conflict which we determine by looking at the variation in their means as per (28). Given a weighted particle set from the (j 1)th iteration of the algorithm, { x(C) j 1,i, w(C) j 1,i}N i=1, a natural estimator for E [νj] is i=1 w(C) j 1,i x(c) j 1,i ac Λ 1 c x(c) j 1,i ac ! Following Theorem 15 and Remark 16 we now have the additional problem of specifying k3 and k4, and using the result to develop practical guidance. We do so by means of letting the user choose the meaning parameter ζ (0, 1), which is we define to be a lower bound on the conditional effective sample size that they would tolerate. We can then select k3 and k4 such that e k3 k4 = ζ and compute tj = min n T, tj 1 + j o (40) recursively at each iteration until j = n such that tn = T. Remark 17 Note that we expect to have very different performance with different choices of k3 and k4. For instance, we can obtain a very high CESSj by simply choosing k3 very small and set k4 = log(ζ ) k3, which ultimately leads to having very small intervals sizes j. Choosing small interval sizes may help computationally simulating ρj, but this comes at the cost of having more iterations of the algorithm, leading to an increased communication between the cores. Natural choices for jointly specifying k3 and k4 are ones which lead to the largest interval size which still satisfies N 1CESSj ζ (0, 1), as this minimises the number of iterations of Algorithm 1 Step 2. We now consider the previously outlined regular and irregular (adaptive) mesh selection of P in Section 4.2.1 and Section 4.2.2 respectively. 4.2.1 A regular mesh construction Imposing an additional assumption that the temporal partition P is regular simplifies Algorithm 1 as it avoids us having to dynamically compute (36) at each iteration of Step 2. In particular, j = for each j {1, . . . , n} where n = T/ (where x denotes the smallest integer greater than or equal to x). This simplification of regularity was suggested in Divide-and-Conquer Fusion Dai et al. (2023, Remark 6). They noted that for large datasets in which observations were randomly allocated to sub-posteriors, that one would expect sub-posterior heterogeneity to be small. Hence one would expect E[νj] to be small (of O(m 1)). In their simulations, Dai et al. (2023, Section 3 and 4) set k3 = k4 = 1 and := tj 1 tj = p (b2|C|k4)/(2m2d) for all j. The rationale presented for these choices in Dai et al. (2023, Remark 6) does not hold in full generality so in this section, we instead develop a more systematic way to construct a regular mesh. In particular, setting k3 = k4 as they suggest is sub-optimal. Given a user specified lower bound on CESSj that they would tolerate (i.e. some ζ (0, 1)), we want to minimise the number of iterations of Algorithm 1 Step 2. This is achieved with reference to Theorem 15 (and in particular (36)). In particular, we choose a combination of k3 and k4 such that: (i) exp{ k3 k4} ζ (i.e. CESSj for any j does not violate the chosen ζ ); and (ii), b2|C|k3 E[νj]m2 q 2m2d for each j. The difficulty here is that at each iteration, we need the average variation of the trajectories E[νj]. Of course, this is not possible directly and so an estimate [ E[νj] is computed as per the guidance of Remark 16. To ensure the chosen ζ is not violated at any iteration we follow the guidance of (36) by using a supremum over all intervals of this estimator (i.e. supj [ E[νj]). This choice allows us to specify k3 and k4, and so in turn n and P. Remark 18 For ease of practically implementing Algorithm 1, it is desirable to avoid any recursive definitions of n and P (i.e. they are specified prior to calling Algorithm 1 Step 2 where they are required). In this setting we would need to estimate supj [ E[νj] based upon only the initial (weighted) sub-posterior realisations { x(C) 0,i , w(C) 0,i }M i=1 obtained in Algorithm 1 Step 1b. Following Remark 16, we would expect E[νj] to be maximised at t = T (corresponding to (42)), but in some instance may also occur at t = 0 (corresponding to (43)). In most practical applications of GBF it will be at t = T as the proposal for the coalescence of the |C| stochastic processes has a Gaussian distribution with mean x(C) 0 with variance TΛC (as a consequence of Proposition 2 and considering s = 0 and t = T). On the other hand, if the sub-posterior means are very close together, the largest variation in the trajectories from their respective means could occur at the start of the bridge. As such, we propose taking the larger value of those two scenarios to arrive at the following approximation: sup j [ E [νj] max{Ψ1, Ψ2}, (41) i=1 w(C) 0,i 1 |C| x(C) 0,i ac Λ 1 c x(C) 0,i ac , (42) i=1 w(C) 0,i 1 |C| x(c) 0,i ac Λ 1 c x(c) 0,i ac , (43) Chan, Pollock, Johansen and Roberts and where x(C) 0,i is defined in (5) and w(C) 0,i are the initial particle weights given in Algorithm 1 Step 1b. Our approximation of supj [ E [νj] has obvious limitations: it may not be conservative enough to ensure the user chosen ζ is not breached; it may be too conservative and lead to choosing n too high. In practice we have found it to be a robust approximation. Once we have a suitable estimate of supj [ E [νj], we need to find a suitable choice for k3 and k4 to ensure that we always choose the RHS side of (36) (as that leads to a regular mesh) and satisfies ζ . As there are many combinations of k3 and k4 which can return a regular mesh, we aim to find the combination which returns the largest interval size. We can do this by means of the following proposition which considers the jth interval of the partition: Proposition 19 Considering the jth interval of P (i.e. [tj 1, tj]), given a user-specified threshold ζ (0, 1) and estimate [ E[νj] of E[νj], then the largest interval size which satisfies N 1CESSj ζ is given by [ E[νj] 2 m2 2b2|C|d 2 log(ζ ) s 2 log(ζ ) [ E[νj] 2 m2 2 4 log(ζ )2 Proof See Appendix G. Using Proposition 19, we can substitute our estimate of supj [ E[νj] into (44), and subse- quently compute the regular interval size := q 2m2d and hence n = T/ . In effect, here we are setting k4 = supj k4,j. This process is summarised in Algorithm 3. 4.2.2 An adaptive mesh construction Our presentation of Section 4.2.1 (as opposed to that of Dai et al. (2023)), naturally suggests an adaptive approach, leading to a partition P with an irregular mesh. Since the construction of the regular mesh is based upon the worst case scenario of the trajectory variation, this leads to an excessive resolution of P. In this section we will address this. Instead, suppose we are at the beginning of the jth iteration of Algorithm 1 Step 2. At this point we have in effect simulated our |C| stochastic processes up to time tj 1 < T. We can now consider the placement of the next point in the partition (i.e. min(tj, T)) with reference to the user chosen ζ (0, 1). In particular, we want the interval to be as large Divide-and-Conquer Fusion Algorithm 3 Computing regular mesh P. Input: Time T > 0 and importance weighted particles { x(C) 0,i , w(C) 0,i }M i=1. 1. Compute estimate of supj [ E[νj] as per (41). 2. Compute k4 using the estimate from Step 1 as per (44). 3. Compute := q 2m2d and let n = T/ . 4. For j {1, . . . , n}, let tj = min{T, tj 1 + }. 5. Output: P := {t0, . . . , tn}. as possible while ensuring that the CESSj does not degrade by more than ζ . To do this we can compute an estimate of E[νj] as per (39) and appeal to Proposition 19 in order to choose k4,j, and consequently the interval size j in order to set tj = min(tj 1 + j, T). Once we reach T we simply halt iterating Algorithm 1 Step 2. In contrast to the regular mesh construction in Section 4.2.1, we cannot compute the temporal mesh prior to Algorithm 1 Step 2. Therefore, the computation of the interval size for iteration j must be done immediately after Step 2a and prior to Step 2b of Algorithm 1. In this setting, the number of steps in Algorithm 1, n, is not known in advance. Given the construction of the regular mesh assumes the worst case interval in selecting the mesh size, we would expect that n would be lower in our adaptive approach. Indeed, we show this empirically in our later simulation studies. We summarise this approach in Algorithm 4. Algorithm 4 Computing adaptive mesh P (computing j at iteration j immediately after Algorithm 1 Step 2a). Input: Time T > 0 and importance weighted particles { x(C) j 1,i, w(C) j 1,i}N i=1. 1. Compute [ E[νj] as per (39). 2. Compute k4 with the estimate from Step 1 as per (44). 3. Compute tj = min T, tj 1 + q 4. Output: j := tj tj 1. In general, it is known that adaptively specifying the sequence of target distributions in sequential Monte Carlo type algorithms can introduce a small bias. In practice, this bias is typically small enough that it can be neglected unless unbiasedness is required specifically, for example to allow the algorithm to be embedded within another. If the small bias is a problem it can be eliminated at the expense of a doubling of the computational cost by first Chan, Pollock, Johansen and Roberts running an adaptive version of the algorithm and then running it a second time with the sequence of distributions fixed to those learned in the first run. 5. Examples In this section we consider a number of models applied to a variety datasets, and suppose the dataset is randomly split into C (disjoint) subsets. We compare the performance of our Fusion methodologies (GBF and D&C-Fusion) with other established (approximate) methodologies. To compare performance, we consider their computational run-times and Integrated Absolute Distance (IAD). To compute the IAD we average across each dimension the difference between the true target (fusion) density (f), and a kernel density estimate of the draws realised using a given methodology ( ˆf). In particular, Z ˆf(θj) f(θj) dθj [0, 1]. (45) In the case where the true marginal density is not available analytically, we take as a proxy for the target f a kernel density estimate of f (for instance, obtained by sampling from f or using the output of an MCMC run). As a benchmark for the target f we use Stan (Carpenter et al., 2017) to implement an MCMC sampler for the target posterior distribution using the full dataset. In implementing Fusion methodologies we use the GPE2 variants of Algorithm 2c and Algorithm 5 as before. Our implementation is as presented in Sections 2 and 3, following the guidance presented in Section 4 (but without the inclusion of any adaptions such as those presented in Appendix H). In implementing D&C-Fusion we use the balanced-binary tree hierarchy. For brevity of the main paper, all detailed derivations required for these specific examples have been put in Appendix J. The established methodologies we consider are Consensus Monte Carlo (CMC) (Scott et al., 2016) (implemented using the parallel MCMCcombine package in R available from https://github.com/cran/parallel MCMCcombine (Miroshnikov and Conlon, 2014)), the kernel density averaging approach of Neiswanger et al. (2014) (which we term KDEMC, and also implemented using the parallel MCMCcombine R package), and the Weierstrass Rejection Sampler (WRS) (Wang and Dunson, 2013) (implemented using their R code available at https://github.com/wwrechard/weierstrass). 5.1 Simulation studies In Appendix I, we study empirically the robustness of our Fusion algorithms in our two idealised settings the SH(λ) setting (Condition 8) and SSH(γ) setting (Condition 10). We consider a range of different hyperparameter choices and illustrate in both settings that utilising both the guidance for T and the mesh P, developed in Section 4, drastically improves the performance of both BF and GBF. By comparing the regular and adaptive meshes, we found that the adaptive mesh generally performed better as it provided similar Divide-and-Conquer Fusion performance to the regular mesh but at a much reduced computational cost. The full details of these experiments can be found in Sections I.1 and I.2. In Section I.3 we compare the performance of Fusion methodologies with increasing dimensionality and found that our GBF and D&C-Fusion approaches offer the best performance with regards to dimension. 5.2 Robust regression In this section we consider the Combined Cycle Power Plant dataset available from the UCI Machine Learning Repository (Kaya et al., 2012; T ufekci, 2014). The dataset comprises m = 9568 records of the net hourly electrical output of a combined cycle power plant over 6 years between 2006 and 2011, together with four (hourly averaged) ambient variables: temperature; ambient-pressure; relative-humidity; and, exhaust-vacuum. To model electrical output using the ambient variables, we use a robust regression model: yi t(ν, Xiβ, σ), i = 1, . . . , n, βj N1 µβj , σ2 βj , j = 0, . . . , p, where y Rn is the dependent variable (electrical output), X Rn (p+1) is the design matrix, β Rp+1 is the vector of predictor (ambient) variables which we want to perform inference on. For simplicity, we assume that ν, σ, µj and σ2 βj for j = 0, . . . , p are known. For our dataset p = 4, and so d = 5. We consider C {4, 8, 16, 32, 64, 128} cores, each of which is assigned a random split of the data. We use Stan to sample the sub-posteriors (with µj = 0 and σ2 βj = 10C for j = 0, . . . , p), which we will attempt to unify as in (1). We use the approximate CMC, KDEMC and WRS approaches to do this, together with our D&C-Fusion approach. In implementing D&C-Fusion we set N = 10000, ζ = 0.5, ζ = 0.05, and consider both the regular and adaptive mesh variants of the temporal partition, P. We resample if the ESS falls below 0.5N. The results are presented in Figure 3. Figure 3a clearly shows that of all the approaches considered, D&C-Fusion provides the highest quality and most reliable sample approximation for f, and is the most robust to increasing C. Although more expensive computationally, D&C-Fusion has a cost which grows at the same rate as the approximate methodologies considered. 5.3 Negative Binomial regression Here we consider the Bike Sharing dataset available on the UCI Machine Learning Repository (Fanaee-T and Gama, 2014). The dataset contains m = 17379 records of the total count of bikes on rental each hour, together with seven variables: seasonality (a categorical variable with four levels: spring, summer, autumn, winter); weekend (binary, taking value 1 if a weekend, and 0 if not); holiday; (binary, taking value 1 if a holiday, and 0 if not); rush-hour (binary, taking value 1 if recorded on a weekday between 7AM-9AM Chan, Pollock, Johansen and Roberts 0.0 0.1 0.2 0.3 0.4 0.5 Integrated Absolute Distance 2 3 4 5 6 7 0.0 0.1 0.2 0.3 0.4 0.5 D&C GBF (regular mesh) D&C GBF (adaptive mesh) CMC KDEMC WRS (a) Integrated Absolute Distance. log(Time elapsed in seconds, 2) 2 3 4 5 6 7 2 0 2 4 6 8 10 12 14 16 D&C GBF (regular mesh) D&C GBF (adaptive mesh) CMC KDEMC WRS (b) Computational cost. Figure 3: Comparison of competing methodologies to D&C-Fusion applied to a robust regression model using the power plant dataset (see Section 5.2). or 4PM-7PM, and 0 if not); weather (binary, taking value 1 if clear , and 0 if not); temperature (continuous); and, wind-speed (continuous). We use treatment contrast coding to encode the seasonality via three binary variables. To model the total count of bikes on rental, we use the following Negative binomial (NB) regression model: yi NB(µi, r), where log(µi) = Xiβ, i {1, . . . , n}, βj N1 µβj , σ2 βj , j {0, . . . , p}, where y Rn is our total count of bikes on rental, X Rn (p+1) is the design matrix, β Rp+1 is the vector of predictor variables. For simplicity, r, µβj , σ2 βj for j = 0, . . . , p are assumed known. For this data set p = 9, and so d = 10. As in Section 5.2, we split the dataset amongst C {4, 8, 16, 32, 64, 128} cores, and use Stan with µj = 0 and σ2 βj = 10C for j = 0, . . . , p, to recover the respective sub-posteriors. To implement D&C-Fusion we set N = 10000, ζ = 0.2, ζ = 0.05, and consider both regular and adaptive mesh variants of P, and resample if the ESS drops below 0.5N. The results in Figure 4 again show that, when contrasted with existing (approximate) approaches, D&C-Fusion provides the most accurate sample approximation, and is robust and consistent with increasing C. Divide-and-Conquer Fusion Integrated Absolute Distance 2 3 4 5 6 7 0.0 0.1 0.2 0.3 D&C GBF (regular mesh) D&C GBF (adaptive mesh) CMC KDEMC WRS (a) Integrated Absolute Distance. log(Time elapsed in seconds, 2) 2 3 4 5 6 7 2 0 2 4 6 8 10 12 14 16 18 D&C GBF (regular mesh) D&C GBF (adaptive mesh) CMC KDEMC WRS (b) Computational cost. Figure 4: Comparison of competing methodologies to D&C-Fusion applied to a Negative Binomial regression model using the bike sharing dataset (see Section 5.3). 5.4 Logistic regression In this section, we apply a logistic regression model to two different datasets (each of which highlight an aspect of Fusion): ( 1 with probability exp{x i β} 1+exp{x i β}, 0 otherwise. (46) 5.4.1 Small data Here we consider a small data size scenario (m = 1000), in which the data is simulated from a logistic regression model (46). This is a variant of Scott et al. (2016, Section 4.3), and is of interest as when the data is (randomly) split among the available cores both exact and approximate Fusion approaches struggle. This is due to the resulting sub-posteriors being naturally conflicting and lacking fully overlapping support with one another. Each record of the simulated design matrix contained four covariates in addition to an intercept. The ith entry of the design matrix is given by xi = [1, ζi,1, ζi,2, ζi,3, ζi,4] , where ζi,1, ζi,2, ζi,3, ζi,4 are random variables generated from a mixture density with a point-mass at zero (and so are either activated or not). In particular, we have for j {1, . . . , 4} that ζi,j pj N1(1, 1)+(1 pj)δ0. For this example we chose p1 = 0.2, p2 = 0.3, p3 = 0.5 and p4 = 0.01 (corresponding to a rarely activated covariate). Upon simulating the design matrix, binary observations were obtained by simulation using the parameters β = [ 3, 1.2, 0.5, 0.8, 3] . In total there were a relatively small number of positive responses (P i yi = 129). Chan, Pollock, Johansen and Roberts To conduct Fusion we first equally split the data between C {4, 8, 16, 32, 64} cores. We again use Stan with Gaussian prior distributions with mean 0 and variance C on each parameter to find a sample approximation of each sub-posterior. Together with the approximate methodologies, we implemented our D&C-Fusion approach with N = 10000, ζ = 0.2, ζ = 0.05, and both regular and adaptive temporal partition meshes. Here, we also consider applying Generalised Bayesian Fusion (GBF) (i.e. directly applying Algorithm 1 with C := {1, . . . , C} (which is equivalent to D&C-Fusion within a fork-and-join tree hierarchy, as per Figure 1)). We present the results in Figure 5. Considering Figure 5a, we see again that D&C-Fusion achieves the best sample approximation, and the quality of the sample approximation is robust to increasing C. Note that our divide-and-conquer framework offers significant gains, with D&C-Fusion outperforming GBF in terms of robustness with C (even with the same tuning parameter guidance being followed). Note that CMC outperforms all other approximate methodologies, which leaves the practitioner with a clear decision: if a cheap but approximate methodology is needed use CMC, but if accuracy is the goal then D&C-Fusion should be used. 0.0 0.1 0.2 0.3 0.4 0.5 0.6 Integrated Absolute Distance 0.0 0.1 0.2 0.3 0.4 0.5 0.6 D&C GBF (regular mesh) D&C GBF (adaptive mesh) GBF (adaptive mesh) CMC KDEMC WRS (a) Integrated Absolute Distance. log(Time elapsed in seconds, 2) 4 2 0 2 4 6 8 10 12 14 16 18 D&C GBF (regular mesh) D&C GBF (adaptive mesh) GBF (adaptive mesh) CMC KDEMC WRS (b) Computational cost. Figure 5: Comparison of competing methodologies to D&C-Fusion applied to a logistic regression model using a simulated small data set (see Section 5.4.1). 5.4.2 NYC Flights 2013 Data Finally, we study Fusion approaches for a logistic regression model (46) applied to the nycflights13 dataset (obtained from the nycflights13 R package available on CRAN (Wickham, 2021)). In this study we predict on-time arrival of airplanes, by creating binary observations for arrival-delay (taking the value 1 if the flight arrived 1 minute or more late, and 0 otherwise). We model this using p = 20 predictor variables (so d = 21). After removing any entries with NA values, in total the dataset was of size m = 327346. This dataset was split randomly across C {4, 8, 16, 32, 64, 128} cores, and we used Stan to Divide-and-Conquer Fusion find sample approximations of each sub-posterior (using Gaussian priors with mean 0 and variance C for each parameter). D&C-Fusion was implemented with N = 30000, ζ = 0.2 and ζ = 0.05. The results are shown in Figure 6. As before, D&C-Fusion provides the best sample approximation and is robust to increasing C, but comes at the expense of increased computational cost. Although approximate methodologies have been specifically developed to tackle Bayesian big-data problems, here we see that they struggle to recover f even in this idealised scenario. They additionally (and critically) lack robustness when scaled with C. 0.0 0.1 0.2 0.3 0.4 Integrated Absolute Distance 2 3 4 5 6 7 0.0 0.1 0.2 0.3 0.4 D&C GBF (regular mesh) D&C GBF (adaptive mesh) CMC KDEMC WRS (a) Integrated Absolute Distance. log(Time elapsed in seconds, 2) 2 3 4 5 6 7 2 0 2 4 6 8 10 12 14 16 18 20 22 D&C GBF (regular mesh) D&C GBF (adaptive mesh) CMC KDEMC WRS (b) Computational cost. Figure 6: Comparison of competing methodologies to D&C-Fusion applied to a logistic regression model using the nycflights13 dataset (see Section 5.4.2). To further compare the methodologies, we consider fixing C = 64 and varying the computational budget for each method by varying the sample size N in order to study the effect of increased computation on IAD. We again compute the IAD against the same benchmark for the target f that was used above (based upon N = 30000 samples using Stan). Our results are shown in Figure 7. The IAD of the approximate methodologies considered in Figure 7 had large variance, and so we run each of these methods 10 times and took an average of the IAD. To show the variability we also plot the minimum and maximum IAD achieved in the 10 runs. The longest run for each approximate methodology was one hour, or when it had become apparent that further computation was not improving IAD. As such, the CMC and KDEMC approaches were considered for a range of sample sizes from N = 500 to N = 200000, but KDEMC was only considered for N = 500 to N = 50000. For CMC and WRS, the average and variance of the IAD decreases with more computation, but both methods quickly reach a point where IAD no longer decreases. In Figure 7 we additionally plot a pink dashed line which is the minimum mean value IAD achieved for CMC, as this seems to the point which the IAD of CMC converges to. For KDEMC increased computation does not improve the average IAD or its variability. As such CMC is clearly the best of the approximate Chan, Pollock, Johansen and Roberts 6 5 4 3 2 1 0 1 log(Time elapsed in seconds, 2) log(Integrated Absolute Distance, 2) 2 0 2 4 6 8 10 12 14 16 18 6 5 4 3 2 1 0 1 D&C GBF (regular mesh) D&C GBF (adaptive mesh) CMC KDEMC WRS Figure 7: Integrated absolute distance against computational budget for competing methodologies to D&C-Fusion applied to a logistic regression model using the nycflight and fixing C = 64 (see Section 5.4.2). methods, with it achieving the lowest computational cost and lowest IAD. For D&C-Fusion, we considered N = 500 to N = 30000. We did not perform replicate runs for D&C-Fusion due to the comparative lack of variability in results for this methodology. Of course, being an exact methodology Monte Carlo error can be further decreased by simply increasing N, but does achieve better IAD than CMC for its increased computational cost. As there is a reasonably large number of data points on each core (m/64 5000) the sub-posteriors are approximately Gaussian, and hence CMC performs unsurprisingly well. Taking into account accuracy and computational budget, then CMC performs the best out of all approximate methodologies here. We are however left with the same conclusion: if the practitioner values accuracy, or they have a poor understanding of the biases induced by an approximate approach, then our D&C-Fusion methodology should be used. 6. Conclusion The Fusion approach to unifying sub-posteriors into a coherent sample approximation of the posterior (as in (1)), offers fundamental advantages over approximation based approaches. In particular, Fusion avoids imposing any distributional approximation on the sub-posteriors, and so is more robust to a wider range of models, and circumvents needing to understand the impact of imposed approximations on the unified posterior. To date, Fusion approaches have had impractical computational cost in realistic settings, lacking robustness when considering: the number of sub-posteriors being unified; when unifying highly correlated sub-posteriors; the dimensionality of the sub-posteriors; and when considering conflicting sub-posteriors. In this paper, we have substantially addressed the practical issues of Fusion approaches by means of several theoretical and methodological extensions. Divide-and-Conquer Fusion In Section 2 we introduced Generalised Bayesian Fusion (GBF), which is a sequential Monte Carlo algorithm that incorporates available global information for each sub-posterior in order to construct informative proposals. As shown in Section 5.1, GBF addresses the lack of robustness when the sub-posteriors have strong correlation structure. By embedding GBF within the Divide-and-Conquer Sequential Monte Carlo (D&C-SMC) framework (Lindsten et al., 2017; Kuntz et al., 2023) in Section 3, we introduced Divide-and-Conquer Fusion (D&C-Fusion), together with a number of tree hierarchies, which allow the sub-posteriors to be combined in stages to recover f. By using the provided guidance for selecting the hyperparameters required for the GBF approach (and developed in Section 4), we saw in Section I.3 that our D&C-Fusion approach was the most scalable Fusion approach to date with regards to dimension. In Section 5, we applied our D&C-Fusion methodology to a variety of models with realistic data sets and compared its performance with competing approximate methodologies. In all of these settings, our implementation of D&C-Fusion offered the best performance in terms of Integrated Absolute Distance to an appropriate benchmark, at a modest computational cost. Furthermore, the examples in Section 5 showed that D&C-Fusion is a robust approach to unifying large numbers of sub-posteriors. There are a number of interesting avenues for extending the work of this paper. Perhaps most interesting is to adapt the D&C-Fusion approach to constraints in practical settings. As discussed in the introduction, one particularly promising direction is when considering (1) under privacy constraints of the individual sources (Yıldırım and Ermi s, 2019). In this setting, we may have a number of parties that wish to combine their distributional analysis on a common parameter space and model but cannot reveal their distribution due to confidentiality. This of course requires careful modifications to our approach and is an active area of research of the authors, and motivates variant tree hierarchies in D&C-Fusion. Another application is when considering a truly distributed big data setting where we have much larger datasets than ones considered in Section 5. In such settings, we may consider a large number of sub-posteriors C since the computational benefit of parallelisation for a divide-and-conquer method is typically proportional to the number of available processors (Nemeth and Sherlock, 2018). Although our divide-and-conquer approach is scalable with C, communication between different cores is expensive in a parallel setting (Scott et al., 2016). We discussed several practical implementation considerations in Appendix H.1, which aim to limit the amount of communication between cores for Algorithm 1, but a considered implementation of these techniques have yet to be explored. To make our Fusion methodology more applicable to large data settings, it would be particularly interesting to investigate embedding a sub-sampling approach within the Fusion algorithms (akin to the approaches of Pollock et al. (2020); Bouchard-Cˆot e et al. (2018); Baker et al. (2019); Bierkens et al. (2019)). First steps in integrating sub-sampling into our D&C-Fusion are considered in Appendix H.2. We also note that there is a growing literature on implementing SMC approaches in parallel and distributed settings (see for instance Doucet and Lee (2018, Section 7.5.3)) which may also be interesting to integrate within Fusion. From a theoretical perspective, current Fusion methodologies only consider sub-posteriors on a common parameter space. One direction of interest is extending Fusion methodology to Chan, Pollock, Johansen and Roberts combine sub-posteriors with varying dimension. The Markov Melding framework of Goudie et al. (2019) where separate sub-models (potentially of differing dimension) are fitted to different data sources and then joined, is promising. In this setting, the tree hierarchies could be defined by the model itself. To mitigate computational robustness of Fusion with increasing dimension in this setting, it may be possible to further utilise the methodology in Lindsten et al. (2017). Acknowledgments We would like to thank Louis Aslett, Hector Mc Kimm, Krzysztof Latuszy nski, Nicolas Chopin and Hongsheng Dai for helpful discussions on aspects of the paper. This work was supported by the Engineering and Physical Sciences Research Council under grant numbers EP/K034154/1, EP/K014463/1, EP/N510129/1, EP/R034710/1, EP/R018561/1 and EP/T004134/1 and by The Alan Turing Institute Doctoral Studentship and two Alan Turing Institute programmes; the Lloyd s Register Foundation programme on Data-centric engineering and the UK Government s Defence and security programme. The data used in this paper are openly available from the following links: the code to reproduce the simulated data used for the simulation studies in Sections 5.1 and 5.4.1 can be found at https://github.com/rchan26/DCFusion; the Combined Cycle Power Plant dataset (Section 5.2) can be accessed via the UCI Machine Learning Repository at https://archive.ics.uci.edu/ml/datasets/Combined+Cycle+Power+Plant; the Bike Sharing dataset (Section 5.3) can be accessed via the UCI Machine Learning Repository at https://archive.ics.uci.edu/ml/datasets/bike+sharing+dataset; the nycflights13 dataset (Section 5.4.2) is available through the nycflights13 R package on CRAN available at https://github.com/tidyverse/nycflights13. Divide-and-Conquer Fusion Appendix A. Connections with Monte Carlo Fusion and Bayesian Fusion In this appendix, we more explicitly draw connections with the earlier Monte Carlo Fusion (MCF) approach of Dai et al. (2019), and Bayesian Fusion (BF) approach of Dai et al. (2023). In particular, we outline how our Generalised Bayesian Fusion (GBF) approach, which we develop in Section 2, improves upon these approaches. We do so by considering several toy examples to illustrate the benefits of the algorithmic developments we have presented in this paper. Firstly, the theory and methodology developed in Section 2 admits the Monte Carlo Fusion Dai et al. (2019) and Bayesian Fusion (Dai et al., 2023) approaches as a special case and is established in the following corollaries: Corollary 20 Setting P := {0, T}, Λc = Id for c C := {1, . . . , C}, where Id is the identity matrix of dimension d and accepting a proposal y(C) as a sample from (1) with probability (ρ0 ρ(a) 1 )( x(C), y(C)) exp{P c C Φc T}, we recover the Monte Carlo Fusion approach of Dai et al. (2019, Algorithm 1). Corollary 21 Setting Λc = Id for c C := {1, . . . , C}, where Id is the identity matrix of dimension d, and applying the approach outlined in Algorithm 1 recovers the Bayesian Fusion approach of Dai et al. (2023, Algorithm 1). We note however that the MCF formulation to arrive at this algorithm is different and is based on the following proposition: Proposition 22 Suppose that pc is the transition density of a Markov chain on Rd with a stationary probability density proportional to f2 c . Then the (|C|+1)d-dimensional probability density proportional to the integrable function g MCF C x(C), y(C) := Y f2 c x(c) pc y(C) x(c) 1 fc(y(C)) admits marginal density f(C) Q c C fc for y(C) Rd. Proof By integrating out x(C), we have Z Rd g MCF C x(C), y(C) dx(c1) dx(c|C|) Rd f2 c x(1) pc y(C) x(c) 1 fc y(C) dx(c) # " f2 c y(C) Chan, Pollock, Johansen and Roberts c C fc y(C) = f(C) y(C) . (48) Hence, y(C) has marginal density f(C). Dai et al. (2019) exploited Proposition 22 by noting that if the index set C := {1, . . . , C}, then we recover the target fusion density f (as given in (1)). Since g MCF C will not typically be accessible directly, Dai et al. (2019) proposed sampling from g MCF C by constructing a suitable (|C|+1)d-dimensional proposal density (say, h C) for use within a rejection sampling algorithm (Dai et al., 2019, Algorithm 1), and then simply retaining the y(C) marginal of any accepted draw as a realisation of f(C). Dai et al. (2019) showed that if pc in Proposition 22 was chosen to be the transition density of a constant volatility Langevin diffusion at time T with invariant measure f2 c for each c C respectively, then for a (easily accessible) proposal h C constructed by sampling a single draw from each sub-posterior (x(c) fc for c C), and then a single Gaussian random variable parameterised by the sub-posterior realisations (corresponding to the y(C)-marginal), the acceptance probability was readily computable. Although the formulation of the approach was different, this corresponds algorithmically to a rejection sampling variant of GBF and setting Λc = Id for all c C with P := {0, T}. The advantage of the BF and GBF approaches is that our formulation allows for a general temporal mesh P. We have seen in Section 4 and Section 5.1 that the choice of T and P can drastically alter the performance of the algorithm and so a clear advantage of BF and GBF over MCF is having a greater flexibility in hyperparameter selection. By being restricted to choosing P := {0, T}, in many cases, it will not be possible to choose a T which is large enough for initialisation of the algorithm (i.e. to ensure CESS0 is large) and to choose T small enough for CESS1 to be sufficently large. This trade-offin choice of T in MCF is ultimately why it can fail in many practical settings. As discussed in the introduction, the existing MCF and BF approaches lack robustness in various key practical settings. For the remainder of this section, we re-visit these settings, and with the aid of illustrative examples, show that our new approach addresses these key bottlenecks. Since many of the key limitations are present with both approaches, we focus on comparing against MCF approach by setting P := {0, T} in Algorithm 1. We call this variant of GBF with P := {0, T} Generalised Monte Carlo Fusion (GMCF). In particular, in Section A.1 we consider the effect of increasing sub-posterior correlation, in Section A.2 we consider the robustness with increasing numbers of sub-posteriors, and in Section A.3 we consider how to address conflicting sub-posteriors. Throughout this section we use the GPE-2 estimator of ρj as given in Definition 6, and use the Trapezoidal rule to estimate the mean γc in (24) and fix βc = 10 for c C. To compare the methodology, we compute both the computational run-times of each methodology and a metric which we term the Integrated Absolute Distance (IAD) (45). Divide-and-Conquer Fusion A.1 Effect of correlation One of our key contributions in this paper was the generalisation of BF which incorporated covariance information of the sub-posteriors within our algorithm. In this example, we focus on the illustrative case in which we wish to recover a bi-variate Gaussian target distribution, f f1f2, where fc N2(0, Σ) with, Σ = 1.0 ρcorr ρcorr 1.0 As we are only considering combining two sub-posteriors in this section, we in effect consider only the GMCF approach. To study the impact of sub-posterior correlation on the robustness of MCF and GMCF (Algorithm 1 with P := {0, T}) we can simply consider varying the single parameter ρcorr, and compute the Effective Sample Size (ESS) per second averaged across 50 runs in order to compare the efficiency of each methodology. For simplicity, we assume we are able to sample directly from each sub-posterior, and for both methodologies we set T = 1. For the purposes of implementing GMCF, we simply set Λc = ˆΣc, where ˆΣc is the estimated covariance matrix from the sub-posterior samples for c = 1, 2 (and so in effect we have incorporated global information into our proposals), and use a particle set size of N = 10000. The results are presented in Figure 8, which clearly show that GMCF is robust to increasing sub-posterior correlation, and offers a significant computational advantage over MCF (which in this case exhibits a strong degradation in efficiency and performance). 0 200 400 600 800 1000 Correlation ESS / second 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0 200 400 600 800 1000 Figure 8: ESS per second (averaged over 50 runs) when contrasting Monte Carlo Fusion and Generalised Monte Carlo Fusion, along with increasing sub-posterior correlation, as per the example in Section A.1. A.2 Effect of hierarchy In our new formulation outlined in Section 2, we consider the more abstract setting of sampling from f(C) Q c C fc, where C is the index set of sub-posteriors which we want Chan, Pollock, Johansen and Roberts to unify. This abstraction of combining facilitates the recursive use of the algorithms to develop our Divide-and-Conquer Fusion (D&C-Fusion) approach in Section 3 - benefits of which are highlighted clearly in Section 5.4.1 (Here D&C-Fusion outperforms GBF as we increase C). In this example, we consider the illustrative case of attempting to recover a univariate standard Gaussian target distribution. In particular, we have f QC c=1 fc, where fc N(0, C) for c = 1, . . . , C. By simply varying C, we can study the robustness with increasing numbers of sub-posteriors of MCF (in effect the fork-and-join approach illustrated in Figure 1), and both our suggested versions of Divide-and-Conquer Generalised Monte Carlo Fusion (the balanced-binary tree approach illustrated in Figure 2a, and the progressive tree approach illustrated in Figure 2b). Note that in our chosen idealised setting, there is no advantage conferred with our embedded Generalised Monte Carlo Fusion methodology of Section 2, and so we are simply contrasting hierarchies. In all cases we use a particle set of size N = 10000 with resampling if ESS < N/2, set T = 1, use an appropriately scaled identity as the preconditioning (scalar) matrix, and average across 50 runs. The results are presented in Figure 9, which clearly show that, in contrast to the fork-and-join tree approach, both the balanced-binary tree and progressive tree approaches are robust in recovering the correct posterior distribution in the case of increasing C at the cost of modestly increased computational cost. 1 2 3 4 5 6 7 8 log(C, 2) 0.0 0.10 0.20 0.30 0.40 0.50 0.60 Integrated Absolute Distance fork and join balanced progressive (a) Integrated Absolute Distance. 1 2 3 4 5 6 7 8 log(C, 2) 1 2 3 4 5 6 7 8 9 10 11 12 log(Time elapsed in seconds, 2) fork and join balanced progressive (b) Computational cost. Figure 9: Illustrative comparison of the effect of using different hierarchies in Section A.2 (averaged over 50 runs). A.3 Dealing with conflicting sub-posteriors Directly unifying C conflicting sub-posteriors (sub-posteriors which have little common support and have high total-variation distance) using a fork-and-join approach as in MCF and Figure 1 is impractical. This can be understood with reference to (4) and (16), which indicates that importance weights will degrade rapidly in this setting. Divide-and-Conquer Fusion An approach to deal with conflicting sub-posteriors is to temper the sub-posteriors (to an inverse temperature β (0, 1] such that there is sufficient sub-posterior overlap), and then propose a suitable tree for which the recursive Divide-and-Conquer Generalised Monte Carlo Fusion approach we introduced in Section 3 could then be applied to recover (1). In particular, c=1 fβ c (x) One such generic tree is provided in Figure 10, in which the tempered sub-posteriors are first unified into 1/β N tempered posteriors, which are then again unified into f. QC c=1 fβ c QC c=1 fβ c Figure 10: Illustrative tree approach for the Fusion problem in the case of conflicting subposteriors as in Section A.3. 1/β copies of the C tempered (and over-lapping) sub-posteriors represent the leaves of the tree, which are unified into 1/β tempered versions of f (using a suitable tree and D&C-GMCF as in Section 3), and then unified again (using another tree, and D&C-GMCF) to recover f. To illustrate the advantage of our D&C-GMCF and tempering approach in the case of conflicting sub-posteriors, we consider the scenario of unifying two Gaussian sub-posteriors with the same variance (1), but with different mean ( µ). In particular, we have f f1f2 where f1 N( µ, 1) and f2 N(µ, 1). By simply increasing µ we can emulate increasingly conflicting sub-posteriors and study how MCF (which is equivalent to the forkand-join approach of Figure 1), behaves in terms of the IAD metric and computational time. We contrast this with our tempering approach, considering a range of temperatures 1/β {2, 4, 8, 16}, and then following the guidance of Figure 10. In particular, we use our D&C-GMCF approach to unify the tempered sub-posteriors with the balanced-binary approach of Figure 2a for both the first and second stage in Figure 10. In all cases, we use a particle set size of N = 10000 with resampling if ESS < N/2, set T = 1, and average across 50 runs. The results are presented in Figure 11, and show clearly that our D&C-GMCF approach is significantly more robust to conflicting sub-posteriors than the MCF approach where no tempering is applied. A natural trade-offarises when applying the tempering approach suggested, in that decreasing β results in tempered sub-posteriors which are less conflicting and are easier to combine, but there is an increased computational cost in recovering f as an increased number of levels are added to the resulting tree. Chan, Pollock, Johansen and Roberts 1 2 3 4 5 6 7 8 log(Difference in sub posterior means, 2) 0 0.2 0.4 0.6 0.8 1.0 Integrated Absolute Distance direct combination β = 1/2 β = 1/4 β = 1/8 β = 1/16 (a) Integrated Absolute Distance. 1 2 3 4 5 6 7 8 log(Difference in sub posterior means, 2) 0 2 4 6 8 10 12 log(Time elapsed in seconds, 2) direct combination β = 1/2 β = 1/4 β = 1/8 β = 1/16 (b) Computational cost. Figure 11: Illustrative comparison of using no tempering (solid line), and tempering at 4 different levels together with D&C-GMCF, to combat conflicting sub-posteriors as per Section A.3 (averaged over 50 runs). Appendix B. Proof of Theorem 1 Proof Following the approach of Dai et al. (2019, Appendix A), we begin by proving that the law of |C| independent Brownian motions initialised at x(c) 0 fc for c C and conditioned to coalesce at time T satisfies (2). Here, we use Doob h-transforms (Rogers and Williams, 2000, Chapter IV, Section 6.39) and define the following space-time harmonic function h t, x(C) t = Z Y 2π(T t)|Λc| exp (y x(c) t ) Λ 1 c (y x(c) t ) 2(T t) which represents the integrated density of coalescence at time T given the current state x(C) t . Then the |C| conditioned processes satisfy a SDE of the form, d X(C) t = Λ 1 2 d W (C) t + Λ log h(t, X(C) t ) dt, (51) where log h(t, x(C) t ) =: v(c1) t , . . . , v (c|C|) t is a collection of |C|d-dimensional vectors and 1 2c1 0d d . . . 0d d 0d d Λ 1 2c2 . . . 0d d ... ... ... ... 0d d 0d d . . . Λ Λc1 0d d . . . 0d d 0d d Λc2 . . . 0d d ... ... ... ... 0d d 0d d . . . Λc|C| 1 2c is the (positive semi-definite) square root of Λc where Λ 1 2c = Λc for c C, and 0d d denotes the d d matrix with all elements equal to 0. Divide-and-Conquer Fusion Considering the cth term and letting Λ 1 C = P c C Λ 1 c , then R Λ 1 c (y x(c) t ) T t 2π(T t)|Λc| exp y x(c) t Λ 1 c y x(c) t 2π(T t)|Λc| exp y x(c) t Λ 1 c y x(c) t R Λ 1 c y T t exp n (y xt) Λ 1 C (y xt) 2(T t) o dy R exp n (y xt) Λ 1 C (y xt) 2(T t) o dy Λ 1 c x(c) t T t = Λ 1 c xt x(c) t Consequently, we have log h(t, x(C) t ) = Λ 1 c1 xt x(c1) t T t , . . . , Λ 1 c|C| xt x (c|C|) t and (2) holds. Next, we show that under F this common value has density f. Since P is the measure for |C| coalesced Brownian motions (shown above), from (3), we can write F as d F(X) d P(X) ρ0 x(C) 0 Y 0 φc X(c) t dt c C fc x(c) 0 # (y(C) x(C) 0 ) Λ 1 C (y(C) x(C) 0 ) 2T ( x(C) 0 x(c) 0 ) Λ 1 c ( x(C) 0 x(c) 0 ) 2T 0 φc X(c) t dt c C fc x(c) 0 # (y(C) x(c) 0 ) Λ 1 c (y(C) x(c) 0 ) 2T 0 φc X(c) t dt , (53) where WΛ denotes the law of |C| independent Brownian bridges {X(c) t , t [0, T]}c C starting at X(c) 0 := x(c) 0 and ending at X(c) T := y(C) (with covariance Λc). Let g C( x(C) 0 , y(C)) denote the marginal distribution of F at x(C) 0 and x(C) T =: y(C), then we have g C( x(C) 0 , y(C)) Y h fc x(c) 0 i exp (y(C) x(c) 0 ) Λ 1 c (y(C) x(c) 0 ) 2T Chan, Pollock, Johansen and Roberts 0 φc X(c) t dt , f2 c x(c) 0 pc y(C) x(c) 0 1 fc y(C) pc y(C) x(c) 0 fc y(C) fc x(c) 0 exp (y(C) x(c) 0 ) Λ 1 c (y(C) x(c) 0 ) 2T 0 φc X(c) t dt . (55) Using the Dacunha-Castelle representation (Dacunha-Castelle and Florens-Zmirou, 1986, Lemma 1), this is the transition density density of a Langevin diffusion with covariance matrix Λc over time t [0, T]. Critically, this diffusion process has invariant density proportional to f2 c , so Z p y(C) x(c) 0 f2 c x(c) 0 dx(c) 0 = f2 c y(C) . By integrating out x(C) 0 in (54), we can see that g C( x(C) 0 , y(C)) admits f(C) as a marginal. Appendix C. Proof of Proposition 2 Proof Theorem a: We begin by deriving the joint density of X(C) t conditional on the state at time s, x(C) s . Firstly, consider the d(|C|+1) dimensional joint density of X(C) t and end-point y(C) conditional on x(C) s , which we denote as p1, then 2 log p1 = D1 + D2, where D1 is the log-density of y(C) conditional on x(C) s and given by (y(C) x(c) s ) Λ 1 c (y(C) x(c) s ) T s + k1 where k1 is a constant; D2 is the log-density of X(C) t conditional on x(C) s and y(C) (which is simply the log-density of |C| Brownian bridges with respective covariance matrices Λc for c C), given by T s (t s)(T t) T sy(C) T t T sy(C) T t Divide-and-Conquer Fusion where k2 is a constant. We therefore have 2 log p1 = (y(C) x(C) s ) Λ 1 C (y(C) x(C) s ) T s t s (T t)(T s)y(C) Λ 1 c y(C) 2 T ty(C) Λ 1 c x(c) t + 2 T sy(C) Λ 1 c x(c) s T s (t s)(T t) h y(C) Λ 1 C y(C) 2y(C) Λ 1 C x(C) s i + t s (T t)(T s)y(C) Λ 1 C y(C) 2 T ty(C) Λ 1 C x(C) t + 2 T sy(C) Λ 1 C x(C) s T s (t s)(T t) = 1 T s + t s (T t)(T s) y(C) Λ 1 C y(C) 2 T ty(C) Λ 1 C x(C) t T s (t s)(T t) h y(C) Λ 1 C y(C) 2y(C) Λ 1 C x(C) t i T s (t s)(T t) h (y(C) x(C) t ) Λ 1 C (y(C) x(C) t ) i 1 T t x(C) t Λ 1 C x(C) t T s (t s)(T t) where k3 and k4 are constants, and ΛC := P c C Λ 1 c 1. Next, we integrate out y(C) to obtain the d|C|-dimensional density of X(C) t conditional on x(C) s , which we denote p2: = 1 T t x(C) t Λ 1 C x(C) t + X T s (t s)(T t) = 1 T t x(C) t Λ 1 C x(C) t + X T s (t s)(T t) x(c) t Λ 1 c x(c) t 2 T t x(c) t Λ 1 c x(c) s Chan, Pollock, Johansen and Roberts where k5 and k6 are constants. Noting that x(C) t Λ 1 C x(C) t = c C Λ 1 c x(c) t c C Λ 1 c x(c) t c C Λ 1 c x(c) t c C Λ 1 c x(c) t i,j C x(i) t Λ 1 i ΛCΛ 1 j x(j) t . So we have, 2 log p2 = 1 T t i,j C x(i) t Λ 1 i ΛCΛ 1 j x(j) t + T s (t s)(T t) c C x(c) t Λ 1 c x(c) t c C x(c) t Λ 1 c x(c) s + k6 = T s (t s)(T t) c C x(c) t Λ 1 c x(c) t 1 T t c C x(c) t Λ 1 c ΛCΛ 1 c x(c) t x(i) t Λ 1 i ΛCΛ 1 j x(j) t 2 t s c C x(c) t Λ 1 c x(c) s + k6 = x t V 1 s,t x(C) t 2 t s x t L 1 x(C) s + k6 Σ 1 11 Σ 1 12 . . . Σ 1 1|C| Σ 1 21 Σ 1 22 . . . Σ 1 2|C| ... ... ... ... Σ 1 |C|1 Σ 1 |C|2 . . . Σ 1 |C||C| R|C|d |C|d, (56) Σ 1 ii = T s (t s)(T t)Λ 1 ci 1 T t Λ 1 ci ΛCΛ 1 ci Rd d, Σ 1 ij = 1 T t Λ 1 ci ΛCΛ 1 cj Rd d, for i, j = 1, . . . , |C|, and Λ 1 c1 0d d . . . 0d d 0d d Λ 1 c2 . . . 0d d ... ... ... ... 0d d 0d d . . . Λ 1 c|C| R|C|d |C|d, where 0d d is the d d with all elements zero. We finally complete the square to get 2 log p2 = x(C) t V 1 s,t x(C) t 2 x(C) t V 1 s,t M (C) s,t + k6, Divide-and-Conquer Fusion M (C) s,t = Vs,t L 1 x(C) s t s . Inverting V 1 s,t in (56), we obtain (10) and subsequently we can get the expression for M (c) s,t in (9) to prove the statement in part a of Theorem 1. For part b, for c C, the law of {X(c) t , t (0, T)} conditional on endpoints x(c) 0 and y(C) is that of a Brownian bridge. This statement in the theorem holds from the standard properties of Brownian bridges (with covariance matrix Λc). In particular, considering the distribution of X(c) q at an intermediate point q (s, t) given the positions X(c) s = x(c) s and X(c) t = x(c) t at times s and t respectively, then we have P Xq = w X(c) s = x(c) s , X(c) t = x(c) t P X(c) t = x(c) t X(c) s = x(c) s , Xq = w P Xq = w X(c) s = x(c) s P X(c) t = x(c) t Xq = w P Xq = w X(c) s = x(c) s (x(c) t w) Λ 1 c (x(c) t w) 2(t q) (w x(c) s ) Λ 1 c (w x(c) s ) 2(q s) and hence we arrive at the result in the statement. Appendix D. Proof of Proposition 3 Proof First note that we can rewrite (6) as follows, 1 2c log fc (x) 2 + Tr Λc 2 log fc (x) ! Let Rc := Rc X(c) [0,T] denote a compact subset of Rd for which X(c) t is constrained in time [0, T] for c C, then to bound the first term in (57), we first use the triangle inequality by noting 1 2c log fc(x) = max x Rc 1 2c log fc(ˆx(c)) + Λ 1 2c log fc(x) log fc(ˆx(c)) 1 2c log fc(ˆx(c)) + max x Rc 1 2c log fc(x) log fc(ˆx(c)) , where ˆx(c) is a user-specified point in Rd. Focusing now on bounding the second term of (58), then we express this as a line integral between x and ˆx(c) so 1 2c log fc(x) log fc(ˆx(c)) = max x Rc 2 c Z x ˆx(c) 0 Λc 2 log f(x + un)n du Chan, Pollock, Johansen and Roberts where u = x + un, where n is a unit-vector. We have 2 c Z x ˆx(c) 0 Λc log f(x + un)n du 2 c x ˆx(c) Λc 2 log f(x + un)n 2 c x ˆx(c) P Λc, (59) where P Λc is defined in (19). Putting together (58) and (59), we have 1 2c log fc(x) Λ 1 2c log fc ˆx(c) + max x Rc 2 c x ˆx(c) P Λc. Since for a matrix A Rd, Tr(A) d γ(A), we can bound the second term in (57) as follows: max x Rc Tr Λc 2 log fc(x) d P Λc, and hence we can bound φc as follows: max x Rc |φc (x)| 1 1 2c log fc ˆx(c) + max x Rc 2 c x ˆx(c) P Λc 2 + d P Λc # Noting that in (57) that the first term is squared, then the lower and upper bounds of φc (x) for x Rc are given by (17) and (18) respectively. Appendix E. Proof of Theorem 4 Proof Following in the style of Beskos et al. (2006, 2008); Fearnhead et al. (2008) and Dai et al. (2023, Appendix B), for j = 1, . . . , n, we have E RE W| RE KE U [aj ρj] = E RE W| RE KE U κc j e (U(c) j Φc) j κc! p (κc|Rc) U(c) j φc X(c) ξc,kc = E RE W| RE K κc j e (U(c) j Φc) j κc! p (κc|Rc) U(c) j φc X(c) t = E RE W| R kc j e (U(c) j Φc) j kc! p (kc|Rc) U(c) j φc X(c) t = E RE W| R c C e (U(c) j Φc) j kc j kc! p (kc|Rc) U(c) j φc X(c) t Divide-and-Conquer Fusion = E RE W| R c C e (U(c) j Φc) j exp U(c) j φc X(c) t dt φc X(c) t Φc dt and hence aj ρj is an unbiased estimator for ρj. Appendix F. Unbiased Estimation of ρj Computing ρ(a) 1 and ρ(b) 1 by means of layer information in the case where Λc = Id is detailed explicitly in Dai et al. (2023, Algorithm 4, Appendix B). In the case where Λc = Id, we simulate layers by appealing to a suitable transformation. In particular, we transform the start and end points of the Brownian bridge with transformation matrix Λ 1 2 c , letting z(c) j 1 := Λ 1 2 c x(c) j 1 and z(c) j := Λ 1 2 c x(c) j . The resulting Brownian bridge sample path, z(c) t := Λ 1 2 c X(c) t , has identity covariance structure and thus we can use existing methods for simulating layered Brownian bridge sample paths z(c) t with law WId from z(c) j 1 to z(c) j . By finding a bounding hyper cube for the reverse transformed bounds, we are able to find appropriate layer information for the case Λc = Id. We are now able with minimal modification to apply the approach of Dai et al. (2023), as given in Algorithm 5. Appendix G. Proof of Theorem 12, Corollary 13, Theorem 15, Proposition 19 Proof (Theorem 12) Considering the initial conditional effective sample size, CESS0, we have N 1CESS0 := N 1 PN i=1 ρ0,i 2 PN i=1 ρ2 0,i = E exp P c C ( x(C) 0 x(c) 0 ) Λ 1 c ( x(C) 0 x(c) 0 ) 2T E exp P c C ( x(C) 0 x(c) 0 ) Λ 1 c ( x(C) 0 x(c) 0 ) T = E e |C|σ2 where σ2 := 1 |C| P c C( x(C) 0 x(c) 0 ) Λ 1 c ( x(C) 0 x(c) 0 ) where x(c) 0 Nd(ac, b|C| m Λc). To get an expression for N 1CESS0, we begin by obtaining the moment generating function (mgf) Chan, Pollock, Johansen and Roberts Algorithm 5 Simulating ρj. (a) z(c) j 1, z(c) j : Transform the path, setting z(c) j 1 := Λ 1 2 c x(c) j 1, and z(c) j := Λ 1 2 c x(c) j . (b) Rc: Set Rc := Λ 1 2c R(z) c , where R(z) c R(z) c as per Pollock et al. (2016, Algorithm 14). (c) L(c) X , U(c) X : Compute lower and upper bounds, L(c) X and U(c) X , of φc(x) for x Rc (as per (17) and (18), or otherwise). (d) pc: Choose p( |Rc) using either GPE-1 (Condition 5) or GPE-2 (Condition 6). (e) κc, ξ: Simulate κc p( |Rc), and simulate ξc,1, . . . , ξc,κc U[tj 1, tj]. (f) z(c): Simulate z(c) ξc,1, . . . , z(c) ξc,κc WId|R(z) c as per Pollock et al. (2016, Algorithm 15). (g) X(c): Reverse transform the path, setting X(c) ξc,kc = Λ 1 2c z(c) ξc,kc for kc {1, . . . , κc}. κc j e U(c) j j κc! p (κc|Rc) U(c) j φc X(c) ξc,kc for σ2. First note c C (x(c) 0 a) Λ 1 c (x(c) 0 a) = σ2 + 1 c C ( x(C) 0 a) Λ 1 c ( x(C) 0 a). (61) Considering the term 1 |C| P c C(x(c) 0 a) Λ 1 c (x(c) 0 a) and letting Yc := Λ 1 2 c (x(c) 0 a), then Yc has mean Λ 1 2 c (ac a) and variance b|C| m Id. Hence q m b|C|Yc has mean q and variance Id, and so let c C (ac a) Λ 1 c (ac a) = m then m b|C| P c C Yc 2 χ2(|C|d, λ) distribution (i.e. m b|C| P c C(x(c) 0 a) Λ 1 c (x(c) 0 a) has a non-central χ2(|C|d, λ) distribution) with mgf M1(s) := exp n λs 1 2s o (1 2s) |C|d Secondly, consider 1 |C| P c C( x(C) 0 a) Λ 1 c ( x(C) 0 a) = 1 |C|( x(C) 0 a) Λ 1 C ( x(C) 0 a), where Λ 1 C := P c C Λ 1 c . Then since x(C) 0 Nd( a, b|C| m ΛC), then Z := q 2 ( x(C) 0 a) Divide-and-Conquer Fusion Nd(0, Id) and so Z 2 χ2 d (i.e. m b|C| P c C( x(C) 0 a) Λ 1 c ( x(C) 0 a) has χ2 d distribution) with mgf M2(s) := (1 2s) d From (61), we have c C (x(c) 0 a) Λ 1 c (x(c) 0 a) | {z } χ2(|C|d,λ) c C ( x(C) 0 a) Λ 1 c ( x(C) 0 a) | {z } χ2 d where λ = mσ2 a b . Therefore, using (62) and (63), the mgf for σ2 is given by Mσ2(s) = M1(sb m) = exp mσ2 as m 2sb 2 , where sb Given the mgf of σ2, then N 1CESS0 E e |C|σ2 T = Mσ2 |C| ( mσ2 a |C| 2T b m (|C| 1)d T b m (|C| 1)d Tm (|C| 1)d 1 + 2 |C|b Tm (|C| 1)d σ2 a T |C| + b ( σ2 a T |C| + 2b 1 + 2 |C|b Tm m T |C| + b m T |C| + 2b and so Theorem 12 immediately follows. Proof (Corollary 13) Under Condition 8, σ2 a = (|C| 1)λ m , so for the first term in (31), m T |C| + b m T |C| + 2b exp σ2 ab|C|2 Chan, Pollock, Johansen and Roberts exp b2|C|3λ where T b|C|3/2k1 m for some constant k1 > 0, and for the second term in (31), then |C|b Tm 2 (|C| 1)d with T b|C|3/2k1 m . Hence, under Condition 8 and choosing T b|C|3/2k1 m , combining the bounds from (65) and (66) gives (32). Under Condition 10, σ2 a = bγ, if we assume T b|C|3/2k1 m for some constant k1 > 0, and T |C| 1 2 k2 for some constant k2 > 0, then T and so we have m T |C| + b m T |C| + 2b Hence, under Condition 10 and choosing T such that T b|C|3/2k1 m and T |C| 1 2 k2, we can combine the bounds from (67) and (66) to obtain the bound in (33). Proof (Theorem 15) As N , we have N 1CESSj := N 1 PN i=1 ρj,i 2 PN i=1 ρ2 j,i N 1 PN i=1 aj ρj,i 2 N 1 PN i=1 (aj ρj,i)2 E h (aj ρj)2i, where aj := exp{P c C Φc j}. Since aj ρj is an unbiased estimate of ρj (see Theorem 4), then E [aj ρj] = Y φc X(c) t Φc )! Divide-and-Conquer Fusion tj 1 φc X(c) t )! where WΛ denotes the law of the collection of Brownian bridges {WΛc,j : c C} for each j. Note that under the optimal distribution for p(κc|Rc) (a Poisson distribution with intensity given in (35)), then E h (aj ρj)2i 1 (Fearnhead et al., 2008; Dai et al., 2023), so lim N N 1CESSj E [aj ρj]2 = tj 1 φc X(c) t )!#2 a2 j. If fc Nd(ac, b|C| m Λc), then φc(x) = 1 m b|C| 2 (x ac) Λ 1 c (x ac) md global lower bound Φc = 1 2 md b|C| (since the minimum of φc occurs at the mean, ac). Then by considering small intervals (tj 1, tj) and taking the limit of j := tj tj 1 0, then lim j 0 lim N N 1CESSj tj 1 φc X(c) t dt ) ξj, x(C) j 1 2 (x(c) j ac) Λ 1 c (x(c) j ac) ) ξj, x(C) j 1 2 (x(c) j ac) Λ 1 c (x(c) j ac) ) ξj, x(C) j 1 (by using a trapezoidal rule approximation of the integral and exploiting the use of small intervals) where lim j 0 and expectations are exchanged using the dominated convergence theorem (as the exponential term is bounded above by 1 and its expectation exists (Dai et al., 2023, Appendix C)). From (75) in Corollary 23, we note that x(c) j only depends x(c) j 1 through ξj and ζ(c) j for all c C, and we have x(c) j ξj, x(C) j 1 Nd E h x(c) j ξj, x(C) j 1 i , T tj and consequently, T tj c C (x(c) j ac) Λ 1 c (x(c) j ac) χ2(|C|d, λ j), with moment generating function Mj(s) := exp n λ js 1 2s o (1 2s) |C|d E h x(c) j ξj, x(C) j 1 i ac Λ 1 c E h x(c) j ξj, x(C) j 1 i ac Chan, Pollock, Johansen and Roberts 1 |C|σ2 tj, with σ2 tj := 1 E h x(c) j ξj, x(C) j 1 i ac Λ 1 c E h x(c) j ξj, x(C) j 1 i ac . Letting s = 1 2 m b|C| 2 T tj lim j 0 lim N N 1CESSj E E lim j 0 exp λ js 1 2s 2 (1 2s) |C|d lim j 0 exp 2 m2 b2C σ2 tj j (1 2s) |C|d. From (75), we have E h x(c) j ξj, x(C) j 1 i = " 2 j T tj 1 2 ξj + T tj T tj 1 x(c) j 1 + tj tj 1 T tj 1 xj 1, and so we have lim j 0 σ2 tj =: νj where νj is given in (37). Using Jensen s inequality, we can get lim j 0 lim N N 1CESSj lim j 0 2E [νj] m2 b2|C| j (1 2s) |C|d lim j 0 exp E [νj] m2 b2|C| j (1 2s) |C|d. (68) Consider the first term in (68), then taking the limit j 0 implies that s 0, and if j b2|C|k3 E[νj]m2 for some k3 > 0, then E [νj] m2 b2|C| j exp { k3} . (69) Similarly for the second term in (68), if j b2|C|k4 2 , we have (1 2s) |C|d exp {4s|C|d} exp { k4} . (70) Divide-and-Conquer Fusion Combining the bounds in (69) and (70), and taking the limit j 0 over sequences of tj tj 1 0, with (36), we arrive at the result given in the theorem. Proof (Proposition 19 Using Theorem 15, then for iteration j, we want to choose exp{ k3,j k4,j} = ζ (0, 1), and so k3,j = log(ζ ) k4,j. By substituting this into (36), we can choose the mesh size as ( b2|C|[ log(ζ ) k4,j] E[νj]m2 , b2|C|k4,j where k4,j < log(ζ ) (in order to ensure that k3,j > 0). Here, we want the largest interval which satisfies N 1CESSj ζ . This corresponds to choosing k4,j with b2|C|[ log(ζ ) k4,j] E[νj]m2 = b2|C|k4,j = b4|C|2[ log(ζ ) k4,j]2 E[νj]2m4 = b2|C|k4,j = [ log ζ k4,j]2 = E[νj]2m2 2b2|C|d k4,j = log ζ 2 + 2k4,j log ζ + k2 4,j = E[νj]2m2 2b2|C|d k4,j = k2 4,j + 2 log ζ E[νj]2m2 k4,j + log ζ 2 = 0. (72) Applying the quadratic formula to solve (72) gives 2b2|C|d 2 log(ζ ) r 2 log(ζ ) E[νj]2m2 2b2|C|d 2 4 log(ζ )2 Note that we have the constraints that 0 < k4,j < log(ζ ), and since from (72), we have k2 4,j + 2 log ζ E[νj]2m2 k4,j = log ζ 2, then we will always choose the smaller root and arrive at the statement of the theorem. Appendix H. Practical implementation considerations In many practical settings there will be additional constraints which require us to modify Algorithm 1 appropriately. Examples include settings where latency between cores is problematic, or in scenarios where functional evaluations of the sub-posterior densities fc are not available. In this section, consider several modifications to Algorithm 1 to make it more amenable to certain application areas. To clarify, the implementation of our methodology in examples presented in Section 5 do not exploit these modifications that we present below. Chan, Pollock, Johansen and Roberts H.1 Reducing communication between the cores For our GBF approach, we highlight two steps where communication between cores could be reduced. In particular, it is possible to limit the amount of communication necessary when initialising the particle set, and also when we propagate the particles in the iterative steps of the algorithm. In a distributed/parallel setting, it is desirable to reduce the number of communication between cores since there is a latency penalty for each communication leading to a more computationally expensive algorithm. In Algorithm 1 Step 1b, the particles are composed by pairing the sub-posterior draws index-wise to obtain { x(C) 0,i }M i=1 which requires a communication between the cores. To fully initialise the algorithm, we must assign importance weights to the particles which requires an additional two communications between the cores; namely a communication back to the individual cores to provide the weighted mean of the particles x0,i, and a communication between the cores to compute ρ0,i( x(C) 0, ) (since (4) can be decomposed into a product of |C| terms corresponding to the individual contributions from each sub-posterior. Following the approach of Dai et al. (2023, Section 3.7.1), let θ Rd be a weighted average of approximate modes (or means) of each sub-posterior. Noting that this can be computed in a single preprocessing step prior to initialisation, then we can modify the proposal mechanism for the initial draw to be from the density fc x(c) 0 exp (x(c) 0 θ) Λ 1 c (x(c) 0 θ) 2T fc x(c) 0 , (73) then by modifying the algorithm by replacing ρ0 with ( ( x(C) 0 θ) Λ 1 C ( x(C) 0 θ) 2T where Λ 1 C := (P c C Λ 1 c ), we can see that ϱ0 x(C) 0 Y c C fc x(c) 0 ρ0 x(C) 0 Y c C fc x(c) 0 . Since we subsequently re-normalise the importance weights, we do not need to compute any constant of proportionality for ϱ0. Adopting this approach means that we can sample from fc on each core independently and evaluate the modified importance weight without any further communication between the cores. This therefore reduces the number of communications required to initialise the particle set from three (in the original formulation) to two (since this approach does require one communication in order to compute θ). The modified initialisation is summarised in Algorithm 6. There is also scope to reduce the number of communications required to propagate the particle set in Algorithm 1 Step 2(b)i. To propagate the particles, there is a communication between the cores in order to compute M (C) j := M (C) tj 1,tj as per (9) since this requires the Divide-and-Conquer Fusion Algorithm 6 Particle set initialisation modification (to replace Algorithm 1 Step 1b). 1(b) For k in 1 to M, (i) x(C) 0,k: For c C, simulate x(c) 0,k fc (73). Set x(C) 0,k := (x(c1) 0,k , . . . , x (c|C|) 0,k ). (ii) Compute un-normalised weight w(C) 0,k := Q c C w(c) k ϱ0( x(C) 0,k) as per (74). current position of each of the |C| trajectories. Once we have computed this and propagated the samples, a further communication back to the cores would be necessary so that each core can compute their contribution to the ρj importance weight. Alternatively, we can utilise Corollary 23 so that each of the |C| processes can propagate their own individual particles to compose x(C) j . Corollary 23 Simulating x(C) j Nd M (C) j , Vj , the required transition from x(C) j 1 to x(C) j in Algorithm 1 Step 2(b)i, can be expressed as " 2 j T tj 1 2 ξj + T tj 2 η(c) j + M (c) j , (75) where ξj Nd(0, ΛC), η(c) j Nd(0, Λc) and M (c) j is the sub-vector of M (C) j corresponding to the cth component given by (9). Proof From Proposition a, we have x(C) j Nd M (C) j , Vj where M (C) j := M (C) tj 1,tj is given by (9) and Vj := Vtj 1,tj is given by (10). From (75), the mean and covariance matrix of x(C) j given x(C) j are also given by M (C) j and Vj as required. By using Corollary 23, we can see that the interaction between the |C| trajectories occurs through their weighted mean xj 1 at the previous iteration. This can be computed at the previous iteration, and we can communicate this along with the common Gaussian vector ξj at the same time. This therefore removes an unnecessary additional communication between the cores at every iteration, resulting in a much more efficient algorithm if latency is a concern. This approach is presented in Algorithm 7. Algorithm 7 Particle set propagation modification (to replace Algorithm 1 Step 2(b)i). 2(b)i. (A) For c C, simulate x(c) j,i |( xj 1,i, x(c) j 1,i) in (75). (B) Set x(C) j,i := (x(c1) j,i , . . . , x (c|C|) j,i ) and compute xj,i := (P c C Λ 1 c ) 1(P c C Λ 1 c x(c) j,i ). Chan, Pollock, Johansen and Roberts H.2 Alternative methods for updating the particle set weights In this paper, we have assumed that we have been able to compute functionals of each sub-posterior fc for c C, however there are many settings where it may be impractical or infeasible to do so. This may be case if there is some form of intractability of the subposteriors (see for instance Andrieu and Roberts (2009)), or maybe the evaluation of such quantities may be simply too computationally expensive (for instance in large data settings (Pollock et al., 2020; Bouchard-Cˆot e et al., 2018; Bierkens et al., 2019; Dai et al., 2023)). In these settings, we no longer are able to evaluate φc in (6) which is necessary to update the particle weights in the iterative steps of the BF algorithm. However, it is possible to consider alternative unbiased estimators for ρj in Step 2c. Corollary 24 (Dai et al., 2023, Corollary 3) The estimator ϱj x(C) j 1, x(C) j := Y κc j e U(c) X j κc! p (κc|Rc) U(c) X φc X(c) ξc,kc where φc is an unbiased estimator of φc and U(c) j is a constant such that φc(x) U(c) j for x Rc. Proof This follows directly from Theorem 4. The estimator ϱj in Corollary 24 can therefore be used as a substitute for ρj in Algorithm 1 Step 2c. However, we must be careful in constructing ϱj since its introduction typically increases the variance of the estimator which ultimately causes higher variance in the particle set weights in the BF algorithm. In particular, by using Corollary 24, the number of expected functional evaluations will change from K to K and so we must consider the growth in the ratio K /K as mc (Pollock et al., 2020; Dai et al., 2023). However, as noted, introducing an alternative unbiased estimator may be necessary to apply the BF approach to some settings. For instance, consider the example setting provided in Dai et al. (2023, Appendix E), where we have a large number of data points associated to each sub-posterior (i.e we have mc 1 data points for core c C) then computing φc in (6) is an expensive O(mc) operation. However, since φc is linear in terms terms of log fc(x) and 2 log fc(x), it is simple to construct an unbiased estimator φc for φc. In the setting, we also assume the sub-posteriors admit a structure with conditional independence and can be factorised as follows, i=1 li,c(x). (77) Then since φc is linear in terms of log li,c(x) and log li,c(x), then we could use the following naive unbiased estimator for φdl c : 2 log l I,c(x ) Λc log l J,c(x ) + Tr Λc 2 log l I,c(x ) , (78) Divide-and-Conquer Fusion where I, J iid U{1, . . . , mc}. Although using such an estimator has the advantage of having O(1) cost when evaluating, this comes at the cost of an O(mc) inflation in the expected number of evaluations when evaluating ϱj over ρj. However, following the approach of Pollock et al. (2020, Section 4) and Dai et al. (2023, Appendix E), we first want to suitable choose some control variates to construct our estimator, and compute log fc and 2 log fc at points close to either the mode of the sub-posterior, ˆxc, or the mode of the target posterior ˆx (where close means within O(m 1 2 c ) of the true respective modes). Computing these control variates will typically be one-time O(mc) computations. αI,c(x) := n [ log l I,c(x) log l I,c(x )], (79) HI,c(x) := n [ 2 log l I,c(x) 2 log l I,c(x )], (80) then since log fc(x) = Pmc i=1 log li,c(x), we have EA [ αI,c(x)] = αc(x), EA h HI,c(x) i = Hc(x). (81) where αc(x) := log fc(x) log fc(x ) and Hc(x) := 2 log fc(x) 2 log fc(x ) and A is the law of I U{1, . . . , n}. Noting that φc(x) in (6) can be re-expressed as 2 [αc(x) Λc(2 log fc(x ) + αc(x)) + Tr(Λc Hc(x))] + C , (82) where C := 1 2 log fc(x ) Λc log fc(x ) + Tr Λc 2 log fc(x ) , then this leads to the following unbiased estimator for φc: h αI,c(x) (2 log fc(x ) + αJ,c(x)) + Tr Λc HI,c(x) i + C , (83) where I, J iid U{1, . . . , mc}, i.e. if now we let A be the law of I, J iid U{1, . . . , mc}, we have EA h φc(x) i = φc(x). Here the evaluations of the constants log fc(x ) 2, Tr 2 log fc(x ) are of O(mc) cost, but they only need to be computed once prior to calling Algorithm 1. The unbiased estimator φc(x) uses only double draws from {1, . . . , mc}, although Pollock et al. (2020) notes that it would be possible to replace this by averaging over multiple draws (sampling from {1, . . . , mc} with replacement) which could have advantages of reducing the variance of the estimator at the cost of increasing the number of data points to evaluate at. Appendix I. Simulation studies In this section we study empirically the performance of our Fusion algorithms (Sections 2 and 3), and selection of tuning parameters (T, n and P as discussed in Section 4) in our Chan, Pollock, Johansen and Roberts two idealised key settings the SH(λ) setting (Condition 8) and SSH(γ) setting (Condition 10) described in Section 4. We do this in Sections I.1 and I.2 respectively. For simplicity, here we focus on BF and GBF, noting that GBF is simply D&C-Fusion with a fork-andjoin tree hierarchy (as in Figure 1). Finally, in Section I.3 we compare the performance of Fusion methodologies (including D&C-Fusion with a balanced-binary tree hierarchy) with increasing dimensionality. In Section 5, we consider more substantive examples using real data. Note that the earlier Bayesian Fusion approach is simply a special case of our GBF approach with Λc = Id for c C, and so comparison with this work is straight-forward. To compare the performance of different approaches we consider their computational cost (both the total run time, and n which represents the number of iterations of Algorithm 1 Step 2 and so is a proxy for the amount of communication between cores), and Integrated Absolute Distance (IAD) defined in (45). Throughout this section we use the GPE-2 estimator of ρj as given in Definition 6, and use the Trapezoidal rule to estimate the mean γc in (24) and set βc = 10 for c C. Code to run these simulation studies can be found at https://github.com/rchan26/DCFusion. I.1 Sub-posterior Homogeneity We first study the guidance developed for T and P in Section 4 for GBF (Algorithm 1) in the SH(λ) setting of Condition 8. Recall, this is the setting in which we are combining homogeneous sub-posteriors, and would naturally arise if a dataset was split randomly across several cores. To study this setting, we consider the idealised scenario of combining C = 10 bi-variate Gaussian sub-posteriors, with a range of data sizes from m = 1000 to m = 40000, which have been randomly split across the C = 10 cores. In particular, each sub-posterior has mean 0 = (0, 0) and variance C mΣ, where Σ = 1 ρ ρ 1 with ρ = 0.9. For this example, we apply both BF and GBF with a fixed particle set size of N = 10000. To verify the guidance for T and P, we consider varying T and P with increasing data size m, and the impact this has on CESS0 and CESSj (for j {1, . . . , n}). We consider the four following choices of T and P: 1. a fixed choice of T and n to obtain P (for GBF, T = 1 and n = 5, and for BF, T = 0.005 and n = 5), 2. using the recommended T from Section 4.1 and fixed n = 5 to obtain P, 3. using the recommended T and P using a regular mesh (as outlined in Algorithm 3 and Section 4.2.1), 4. using the recommended T and P using an adaptive mesh (as outlined in Algorithm 4 and Section 4.2.2). Divide-and-Conquer Fusion In implementing the BF and GBF, we set our lower tolerable bounds for the initial (CESS0) and the iterative (CESSj) conditional effective sample sizes to be 0.5N (i.e. we set ζ = ζ = 0.5), resampling if ESS falls below 0.5N. To summarise how one might practically use our guidance to choose T and P, we present our approach in Remark 25. Remark 25 We set the tuning parameters for BF and GBF (for the SH(λ) setting of Section I.1) as follows: 1. Following the guidance outlined in Remark 14, and with ζ = 0.5, λ = 1 and d = 2, we 2 ) log(ζ) 1.7. For GBF, Λc is the estimated covariance matrices for sub- posterior c {1, . . . , C}, so b = m C (see Remark 11), and we choose T = C 1 2 k1. For BF, Λc = Id for c {1, . . . , C}, so we have b = 1 and so we choose T = C3/2k1/m. 2. When using the regular mesh, we use Algorithm 3 to obtain P. First let ζ = 0.5 then for GBF we have b = m C , and so j = = q k4 2Cd for each j, where k4 is computed as per (44) and computing an estimate of the supremum of [ E[νj] as per (41). For BF, b = 1, so j = = q Ck4 2m2d for each j. 3. When using the adaptive mesh, we use Algorithm 4 to obtain j recursively at each iteration to construct P. We let ζ = 0.5 and for GBF (where b = m C ) we compute tj = min{T, tj 1 + j} where j = q k4 2Cd at each iteration of Algorithm 1, until we have tj = T. For the standard BF approach, note that b = 1 so we must compute Ck4 2m2d instead at each iteration. The conditional effective sample size of the GBF and (standard) BF approaches with increasing data size in this SH(λ) setting are shown in Figure 12. First considering the results from fixing T and n in Figure 12a, we can see that BF lacks robustness with increasing data size. Here CESS0 improves with increasing data size (m), which is due to the sub-posteriors becoming increasingly similar with m in this idealised scenario. However, as we increase m the fixed choice for T (and hence the size of the intervals) becomes increasingly inappropriate for the sub-posteriors, which leads to a degradation in average CESSj. In contrast, GBF incorporates global information about the sub-posteriors (i.e. the variance of the sub-posteriors), so there is no change in performance with m. Here there is a trade-offwith the choice of T: a small T leads to poor behaviour on initialisation (i.e. low CESS0), but good behaviour at each iteration (i.e. high average CESSj). Considering Figure 12b, we see that scaling T following the guidance developed in Section 4.1 immediately stabilises CESS0, although CESSj performance is still poor (n is too small). In Figure 12c and Figure 12d, we see that utilising both the guidance for T and the mesh P drastically improves the performance of both BF and GBF. In both cases GBF outperforms Chan, Pollock, Johansen and Roberts BF: it achieves higher average CESSj, and the variance of CESSj is lower. Given BF is a special case of GBF, this improvement can be ascribed to the use of estimated covariance matrices for Λc. In particular, this choice leads to a lower variance unbiased estimator for ρj, and an improved proposal hbf (14) for gbf (15). From Figure 12e we see that with BF that without our guidance on T and P, average IAD is poor, and the variance of the IAD is very large. In contrast, GBF with our guidance is robust across the different scenarios. Comparing the regular and adaptive meshes simply using CESS0 and CESSj would imply that the regular mesh is performing better (since it has slightly better CESSj), however the adaptive mesh is slightly more computationally efficient as shown by having a smaller mesh size, n, (illustrated in Figure 12f) and having a faster algorithm run-time (illustrated in Figure 12g). By looking at the IAD obtained for these approaches, we can see that we are able to obtain similar performance at a lower cost with the adaptive mesh construction. I.2 Sub-posterior Heterogeneity Now we study the guidance for T and P for GBF (Algorithm 1) developed in Section 4 in the SSH(γ) setting of Condition 10. This represents the setting where sub-posterior heterogeneity does not decay with data size. Here, we consider the scenario of combining C = 2 bi-variate Gaussian sub-posteriors, fc N µc, 2 mΣ , where µ1 = (0.25, 0.25) and µ2 = (0.25, 0.25) and Σ = 1 ρ ρ 1 , with ρ = 0.9. We again consider a range of data sizes, which ranges from m = 250 to m = 2500 and are randomly split between C = 2 cores. We apply BF and GBF with a fixed particle set size of N = 10000. In this setting as m increases the sub-posterior heterogeneity increases, which is a consequence of the sub-posteriors having diminishing overlapping support. In BF (where Λ1 = Λ2 = Id) this heterogeneity is not captured, and σ2 a = 0.125 irrespective of m. By contrast, the generalised approach is able to capture the heterogeneity with m with simply the inclusion of the estimated covariance matrices {Λc}c=1,2. As with the previous example in Section I.1, we will investigate the effect of varying T and P with m, and its impact upon CESS0 and CESSj. We consider the following choices for T and P: 1. a fixed choice of T and n to obtain P (for GBF, T = 2 and n = 5, and for BF, T = 0.01 and n = 5), 2. using the recommended T from Section 4.1 and fixed n = 5 to obtain P, 3. using the recommended T and P using a regular mesh (as outlined in Algorithm 3 and Section 4.2.1), 4. using the recommended T and P using a, adaptive mesh (as outlined in Algorithm 4 and Section 4.2.2). Divide-and-Conquer Fusion 1000 10000 20000 30000 40000 Data Sizes 0.0 0.2 0.4 0.6 0.8 1.0 CESS / N 1000 10000 20000 30000 40000 Data Sizes 0.0 0.2 0.4 0.6 0.8 1.0 CESS / N (a) Fixed user-specified T and n. 1000 10000 20000 30000 40000 Data Sizes 0.0 0.2 0.4 0.6 0.8 1.0 CESS / N 1000 10000 20000 30000 40000 Data Sizes 0.0 0.2 0.4 0.6 0.8 1.0 CESS / N (b) Recommended T and fixed n. Figure 12: Bivariate Gaussian example in SH(λ) setting with increasing data size. In Figures 12a, 12b, 12c, 12d solid lines denote initial CESS (CESS0), and dotted lines denote averaged CESS in subsequent iterations ( 1 n Pn j=1 CESSj), and crosses denote CESSj for each j {1, . . . , n}. Chan, Pollock, Johansen and Roberts 1000 10000 20000 30000 40000 Data Sizes 0.0 0.2 0.4 0.6 0.8 1.0 CESS / N 1000 10000 20000 30000 40000 Data Sizes 0.0 0.2 0.4 0.6 0.8 1.0 CESS / N (c) Recommended T and recommended regular mesh P. 1000 10000 20000 30000 40000 Data Sizes 0.0 0.2 0.4 0.6 0.8 1.0 CESS / N 1000 10000 20000 30000 40000 Data Sizes 0.0 0.2 0.4 0.6 0.8 1.0 CESS / N (d) Recommended T and recommended adaptive mesh P. 1000 10000 20000 30000 40000 Data Sizes 0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 Integrated Absolute Distance Fixed T, fixed n Recommended T, fixed n Recommneded T, regular mesh Recommended T, adaptive mesh 1000 10000 20000 30000 40000 Data Sizes 0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 Integrated Absolute Distance Fixed T, fixed n Recommended T, fixed n Recommneded T, regular mesh Recommended T, adaptive mesh (e) Integrated absolute distance: lines connect the mean IAD (averaged over ten runs) while the points denote the individual IAD achieved on each run. Figure 12: Bivariate Gaussian example in SH(λ) setting (continued). Divide-and-Conquer Fusion 1000 10000 20000 30000 40000 Data Sizes 0 100 200 n Regular mesh Adaptive mesh 1000 10000 20000 30000 40000 Data Sizes 0 100 200 n Regular mesh Adaptive mesh (f) Comparison of mesh sizes between regular and adaptive schemes. 1000 10000 20000 30000 40000 Data Sizes 2 3 4 5 6 7 8 9 10 11 log(Elapsed time in seconds) Fixed T, fixed n Recommended T, fixed n Recommneded T, regular mesh Recommended T, adaptive mesh 1000 10000 20000 30000 40000 Data Sizes 2 3 4 5 6 7 8 9 10 11 log(Elapsed time in seconds) Fixed T, fixed n Recommended T, fixed n Recommneded T, regular mesh Recommended T, adaptive mesh (g) Mean average computational run-times (based on ten runs). Figure 12: Bivariate Gaussian example in SH(λ) setting (continued). Chan, Pollock, Johansen and Roberts When applying the guidance, we set the lower tolerable bounds on the initial (CESS0) and iterative (CESSj) conditional effective sample sizes to be 0.5N (i.e. we set ζ = ζ = 0.5), and re-sample if ESS drops below 0.5N. Again, for helping with the practical interpretation of our extensive guidance for selecting T and P, we summarise our approach in Remark 26: Remark 26 We set the tuning parameters for BF and GBF (for the SSH(γ) setting of Section I.2) as follows: 1. We follow the guidance outlined in Remark 14, noting that ζ = 0.5 and d = 2. For GBF, Λc=1,2 are the estimated covariance matrices for each of the sub-posteriors, so b = m C (see Remark 11), and γ = mσ2 a/C (where σ2 a is estimated from the sub- posterior samples). Consequently, we can compute k1 = k2 = 2) log(ζ) , and choose T = C 1 2 k1. For BF, Λc=1,2 = Id, so b = 1 and γ = σ2 a, and so we can compute 2) log(ζ) and k2 = Ck1 m , and choose T = C3/2k1 m = C 1 2 k2. 2. When using the regular mesh, we use Algorithm 3 to obtain P. As ζ = 0.5, we have for GBF b = m C , and so j = = q k4 2Cd for each j where k4 is computed as per (44) (with supj [ E[νj] computed as per (41)). For BF we have b = 1, so j = = q Ck4 2m2d for each j. 3. When using the adaptive mesh, we use Algorithm 4 to obtain j recursively at each iteration to construct P. With ζ = 0.5 for the GBF (where b = m C ) we compute tj = min{T, tj 1 + j} where j = q k4 2Cd at each iteration of Algorithm 1 until we have tj = T. For BF with b = 1 we have instead j = q Ck4 2m2d at each iteration. CESS for BF and GBF with increasing m in this SSH(γ) setting are shown in Figure 13. We can immediately see that the SSH(γ) setting is much more challenging than the idealised SH(λ) setting of Section I.1 and Figure 12, which is unsurprising as in this case the sub-posteriors are becoming increasingly mismatched as data size increases. In Figure 13a, we see that fixing T and n is not ideal for either BF or GBF. As shown in Figure 13b, there is a positive effect for both BF and GBF in using our recommended scaling of T in the quality of the initialisation. In Figure 13c and Figure 13d, where both the guidance for T and P are implemented, we see a substantial improvement in the performance of both approaches with respect to CESS, with again our new GBF approach outperforming BF. In Figure 13c we see that the use of a regular mesh in choosing P, following our guidance, provides robust CESSj with low variance. Indeed, it appears to outperform the adaptive mesh approach for P (see Figure 13d). However, as discussed in Section 4.2.2, the regular mesh is overly conservative, and when we factor in the reduced number of iterations required Divide-and-Conquer Fusion in the adaptive case (Figure 13f), along with the overall reduction in computational cost (Figure 13g) for comparable IAD (Figure 13e), we see that the use of an adaptive mesh is preferable. 500 1000 1500 2000 2500 Data Sizes 0.0 0.2 0.4 0.6 0.8 1.0 CESS / N 500 1000 1500 2000 2500 Data Sizes 0.0 0.2 0.4 0.6 0.8 1.0 CESS / N (a) Fixed user-specified T and n. 500 1000 1500 2000 2500 Data Sizes 0.0 0.2 0.4 0.6 0.8 1.0 CESS / N 500 1000 1500 2000 2500 Data Sizes 0.0 0.2 0.4 0.6 0.8 1.0 CESS / N (b) Recommended T and fixed n. Figure 13: Bivariate Gaussian example in SSH(γ) setting with increasing data size. In Figures 13a, 13b, 13c, 13d solid lines denote initial CESS (CESS0), and dotted lines denote averaged CESS in subsequent iterations ( 1 n Pn j=1 CESSj), and crosses denote CESSj for each j {1, . . . , n}. Chan, Pollock, Johansen and Roberts 500 1000 1500 2000 2500 Data Sizes 0.0 0.2 0.4 0.6 0.8 1.0 CESS / N 500 1000 1500 2000 2500 Data Sizes 0.0 0.2 0.4 0.6 0.8 1.0 CESS / N (c) Recommended T and recommended regular mesh P. 500 1000 1500 2000 2500 Data Sizes 0.0 0.2 0.4 0.6 0.8 1.0 CESS / N 500 1000 1500 2000 2500 Data Sizes 0.0 0.2 0.4 0.6 0.8 1.0 CESS / N (d) Recommended T and recommended adaptive mesh P. 500 1000 1500 2000 2500 Data Sizes 0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 Integrated Absolute Distance Fixed T, fixed n Recommended T, fixed n Recommneded T, regular mesh Recommended T, adaptive mesh 500 1000 1500 2000 2500 Data Sizes 0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 Integrated Absolute Distance Fixed T, fixed n Recommended T, fixed n Recommneded T, regular mesh Recommended T, adaptive mesh (e) Integrated absolute distance: lines connect the mean IAD (averaged over ten runs) while the points denote the individual IAD achieved on each run. Figure 13: Bivariate Gaussian Example in SSH(γ) setting (continued). Divide-and-Conquer Fusion 500 1000 1500 2000 2500 Data Sizes 1 2 3 4 log(n, 10) Regular mesh Adaptive mesh 500 1000 1500 2000 2500 Data Sizes 1 2 3 4 log(n, 10) Regular mesh Adaptive mesh (f) Comparison of mesh sizes between regular and adaptive schemes. 500 1000 1500 2000 2500 Data Sizes 2 3 4 5 6 7 8 9 10 11 log(Elapsed time in seconds) Fixed T, fixed n Recommended T, fixed n Recommneded T, regular mesh Recommended T, adaptive mesh 500 1000 1500 2000 2500 Data Sizes 2 3 4 5 6 7 8 9 10 11 log(Elapsed time in seconds) Fixed T, fixed n Recommended T, fixed n Recommneded T, regular mesh Recommended T, adaptive mesh (g) Mean average computational run-times (based on ten runs). Figure 13: Bivariate Gaussian Example in SSH(γ) setting (continued). Chan, Pollock, Johansen and Roberts I.3 Dimension Scaling In this section we empirically study the performance of Fusion approaches (BF, GBF and D&C-Fusion) with increasing dimensionality. To do so we consider a d-dimensional multivariate Gaussian f QC c=1 fc, where we let C = 8 and fc Nd(0, CΣ), and where Σii = 1, for all i {1, . . . , d}, Σij = 0.9, for all i = j, (i, j) {1, . . . , d}, and simply vary d (in steps from d = 1 to d = 100). For BF and GBF we use an adaptive mesh for P, and for D&C-Fusion we consider both a regular and adaptive mesh for P with a balanced-binary tree hierarchy. In all cases we use the guidance developed in Section 4. As we are in the SH(λ) setting (the true sub-posterior means are the same), we set λ = 1. The lower bounds of the tolerable initial and iterative CESS are set to 0.05N (i.e. ζ = ζ = 0.05) and we resample if the ESS drops below 0.5N, where here we have N = 10000. The results are presented in Figure 14. 0 10 20 30 40 50 60 70 80 90 100 Dimension 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 Integrated Absolute Distance D&C GBF (regular) D&C GBF (adaptive) GBF BF (a) Integrated Absolute Distance. 0 1 2 3 4 5 6 log(Dimension, 2) 2 3 4 5 6 7 8 9 10 11 12 13 14 log(Elapsed time in seconds, 2) D&C GBF (regular) D&C GBF (adaptive) GBF BF (b) Computational cost. Figure 14: Comparison of Fusion methodologies with increasing dimensionality (in the setting of Section I.3). In Figure 14a, lines connect the mean IAD (averaged over ten runs) while the points denote the individual IAD achieved on each run. As shown in Figure 14a, the performance of all Fusion methods degrades with increasing dimensionality: both in terms of the average IAD and also the variance. As our target exhibits high correlation between components, BF struggles here even in low dimensions, whereas the GBF and D&C-Fusion approaches we have developed in this paper offer much better scaling with dimension. D&C-Fusion comfortably outperforms existing Fusion approaches for even moderate dimensionality in terms of IAD and computational cost. Divide-and-Conquer Fusion Appendix J. Calculations for examples In this section, we provide the calculations necessary to implement the Fusion algorithms discussed in this paper. In particular, to implement Generalised Bayesian Fusion (Section 2) and Divide-and-Conquer Fusion (Section 3), we must be able to compute φc given in (6). This requires the computation of the first and second order derivatives of the log sub-posterior densities. Furthermore, it is necessary to compute bounds of φc. As noted in Section 2.2, if it is not possible to (or simply difficult to) find tight bounds for φc, we can use the general bounds given in Proposition 3. To use these general bounds, we must find a upper bound on the matrix norm of Λc 2 log fc(x) for x Rc (i.e. find P Λc given in (19)), which can be done by computing the matrix norm of the matrix which bounds the matrix Λc 2 log fc(x) element-wise. We note that in some cases, it may be easier to find the bound on the matrix norm of the Hessian of the transformed sub-posterior, f(z) c (z) where z := Λ 1 2 c x. In particular, rather than finding a bound in (19), we can focus on finding the bound P Λc max z R(z) c γ 2 log f(z) c (z) , (84) which is equivalent to finding the bound in (19). In Appendix F, we detailed how we can simulate ρj. In particular, in Algorithm 5, we perform a transformation on the space and in Step 1b, we compute the layer information R(z) c and so we can directly use this to find local element-wise bounds 2 log f(z) c (z) for z R(z) c . Therefore, to find P Λc, we just need to find bounds on the second order derivatives of the log-sub-posterior in the transformed space z := Λ 1 2 c x so that we can compute the matrix norm of the matrix which bounds 2 log f(z) c (z) element-wise. J.1 Logistic Regression In Section 5.4, we considered applying our Fusion methodologies to a logistic regression example with Gaussian prior distributions for the parameters. In particular, our sub-posterior densities were given by the posterior for Bayesian logistic regression with Nd(µj, Cσ2 βj) prior for βj for j = 0, . . . , p is given by fc(β) := π(β|y) = 2πCσ2 βj exp Chan, Pollock, Johansen and Roberts where X Rn (p+1) is the design matrix so Xiβ = β0+β1Xi1+ βp Xip. The log-posterior is given by log fc(β) = h Xiβ yi log 1 + eβXi i 2Cσ2 βj + constant. (86) The first derivative of the log-posterior with respect to βk for k = 0, . . . , p, is given by Xik yi Xike Xiβ Xik yi 1 1 + e Xiβ Cσ2 βk (87) and the second order derivatives of the log-posterior are given by 2 log fc(β) (1 + e Xiβ)2 1 Cσ2 βk , (88) 2 log fc(β) Xik Xile Xiβ (1 + e Xiβ)2 for k = l, (89) for k, l = 0, . . . , p. We can use these directly to compute φc given in (6). To compute the bounds of φc, we can utilise the bounds provided in Proposition 3 (or in (17) and (18)). To do so, we must be able to compute an upper bound of the matrix norm Λc 2 log fc(x) for x Rc where Rc denotes the simulated layer information, i.e. to compute (19). While this can be done by computing the matrix norm of the matrix which bounds the matrix Λc 2 log fc(x) element-wise, we noted above that it is typically easier to find bounds on the matrix norm of 2 log f(z) c (z) where z := Λ 1 2 c β, and instead we can focus on finding a bound in the transformed space, i.e. compute (84). In this logistic regression setting, let z = Λ 1 2 c β then the transformed posterior density is given by f(z) c (z) = π(β|X, y)|J|, (90) where J = Λ 1 2 c is the Jacobian matrix with elements Jij = zi 2 c,ij. We have log f(z) c (z) = log π(β|X, y) + log |J|. (91) Since β = Λ 1 2c z, we have f(z) c (z) := π(β|X, y) |Λ 1 2πCσ2 βj exp Divide-and-Conquer Fusion e Xi(Λ 1 2 c z) yi 1 + e Xi(Λ 1 2 c z) 2πCσ2 βj exp 1 2c z)j µj 2 c |, (92) log f(z) c (z) = 1 2c z) yi log 1 + e Xi(Λ 1 2 c z) 1 2c z)j µj 2Cσ2 βj + constant. We first note that since β = Λ 1 2c z, then βi = (Λ 1 2c z)i = P k Λ 1 2 ikzk. So we have 1 2c )z zk = zk 1 2c )ik (94) and also we have 1 2c z)i zk = zk 1 2 ijzj = Λ 1 2 ik (95) Using (94) and (95), then the first derivative of the log transformed posterior with respect to βk for k = 0, . . . , p, is given by log f(z) c (z) zk = 1 + e (XiΛ 1 2 c )z 1 2c z)j µj Cσ2 βj . (96) Then the second order derivatives are given by 2 log f(z) c (z) zk zl = 1 2c )ik(XΛ 1 2c )ile(XiΛ 1 2 c )z 1 + e(XiΛ 1 2 c )z 2 1 2 jl Cσ2 βj , (97) for k, l = 0, . . . , p. Chan, Pollock, Johansen and Roberts To find bounds for φc, we must now try to find bounds on the second derivatives given above and compute the matrix norm of the matrix made up of these bounds (which ultimately bounds 2 log f(z) c (z) element-wise). For this example, we can find global and lower bounds of the second derivatives. Note however, we typically will expect better performance with the local bounds on P Λc (84) (as this will typically lead to the expected number of points we need to evaluate while performing Poisson thinning, κc, to be lower) despite these bounds being slightly more expensive to compute in practice. J.1.1 Global bounds of P Λc We first note that ex (1+ex)2 1 4 for all x (and this maximum occurs at x = 0). We can utilise this to obtain a global bound: " 2 log f(z) c (z) zk zl 1 2c |ik |XΛ 1 2c |il 4 + 1 2 jl Cσ2 βj . (98) J.1.2 Local bounds of P Λc Local bounds can be obtained if we can find local bounds for G1(z) := e(XiΛ 1 2 c )z 1 + e(XiΛ 1 2 c )z 2 , (99) for i = 1, . . . , n. In that case, we have " 2 log f(z) c (z) zk zl 1 2c |ik |XΛ 1 2c |il max z R(z) {G1(z)} + 1 2 jl Cσ2 βj . (100) To compute maxz R(z) {G1(z)}, see Section J.3.2 and Algorithm 8 and set r = 1. J.2 Robust Regression In Section 5.2, we considered a robust regression example (using a student-t distribution) with Gaussian prior distributions for the parameters. In particular, our sub-posterior densities were given by the posterior for Bayesian robust regression with Nd(µj, Cσ2 βj) prior for βj for j = 0, . . . , p is given by fc(β) = π(β|X, y) := Divide-and-Conquer Fusion 2πCσ2 βj exp The log-posterior is given by log fc(β) = ν + 1 i=1 log 1 + 1 νσ2 (yi Xiβ)2 2Cσ2 βj + constant. (102) The first derivative of the log-posterior with respect to βk for k = 0, . . . , p is given by log π(β|X, y) νσ2 (yi Xiβ) 1 + 1 νσ2 (yi Xiβ)2 (βk µk) Xik(yi Xiβ) νσ2 + (yi Xiβ)2 (βk µk) Cσ2 βk , (103) and the second order derivatives of the log-posterior are given by 2 log π(β|X, y) β2 k = (ν + 1) X2 ik (yi Xiβ)2 νσ2 (νσ2 + (yi Xiβ)2)2 1 Cσ2 βk , (104) 2 log π(β|X, y) βk βl = (ν + 1) Xik Xil (yi Xiβ)2 νσ2 (νσ2 + (yi Xiβ)2)2 for k = l, (105) for k, l = 0, . . . , p. We can use these derivatives directly to compute φc given in (6). Following in the same approach as Section J.1.2, we can compute the bounds of φc (in (6)) by utilising the bounds provided in (17) and (18). As noted in Section J.1.2, we must be able to find an upper bound on the matrix norm of 2 log f(z) c (z) where z := Λ 1 2 c β, i.e. compute (84). To do so, we can compute the matrix norm of the matrix which bounds 2 log f(z) c (z) element-wise. Now, let z = Λ 1 2 c β then f(z) c (z) = π(β|X, y)|J|, where J = Λ 1 2 c is the Jacobian matrix, so we have f(z) c (z) := π(β|X, y) |Λ 1 2πCσ2 βj exp 1 2c z)j µj 2 c |. (106) Chan, Pollock, Johansen and Roberts log f(z) c (z) = ν + 1 1 2c z)j µj 2Cσ2 βj + constant. (107) Recall from (94) and (95), then The first derivative of the log transformed posterior with respect to βk for k = 0, . . . , p is given by log f(z) c (z) zk = ν + 1 2(XΛ 1 2 c )ik νσ2 1 2c z)j µj νσ2 + yi (XiΛ 1 2c z)j µj Cσ2 βj (108) Then the second order derivatives are given by 2 log f(z) c (z) zk zl = (ν + 1) 1 2c )ik(XΛ 1 2c )z 2 νσ2 ! 1 2c )z 2 + νσ2 !2 1 2 jl Cσ2 βj for k, l = 0, . . . , p. J.2.1 Global bounds of P Λc To compute P Λc for this example, first note that we can write 2 log f(z) c (z) zk zl = (ν + 1) 1 2c )ik(XΛ 1 Ei + b 2b (Ei + b)2 1 2 jl Cσ2 βj , (110) for k, l = 0, . . . , p, where b = νσ2 and Ei = yi (XiΛ 1 2c )z 2 . Now let K(Ei) = 1 Ei + b 2b (Ei + b)2 , Divide-and-Conquer Fusion then the derivative is given by K (Ei) = 1 (Ei + b)2 + 4b (Ei + b)3 . Setting K (Ei) = 0 gives Ei = 3b, and we have K(Ei = 3b) = 1 8b. So the supremum of the second derivative is given by " 2 log f(z) c (z) zk zl 1 2c |ik |XΛ 1 2 jl Cσ2 βj . (111) We can therefore use this to compute P Λc to compute bounds for φc as per (17) and (18). J.3 Negative Binomial Regression In Section 5.3, we considered a negative Binomial regression example with Gaussian prior distributions for the parameters. In particular, our sub-posterior densities were given by the posterior density with Nd(µj, Cσ2 βj) priors for βj for j = 0, . . . , p, is given by fc(β) := π(β|X, y) yi r µi + r 2πCσ2 βj exp yi!Γ(r) exp(Xiβ yi) rr (exp(Xiβ) + r)yi+r 2πCσ2 βj exp The log-posterior is given by log fc(β) = i=1 [Xiβ yi (yi + r) log (exp(Xiβ) + r)] 2Cσβj + constant. (113) The first order derivative of the log-posterior with respect to βk for k = 0, . . . , p, is given by Xikyi (yi + r)Xik exp(Xiβ) exp(Xiβ) + r Xik yi (yi + r) exp(Xiβ) exp(Xiβ) + r Cσ2 βk , (114) and the second order derivatives of the log-posterior are given by 2 log fc(β) (yi + r)r X2 ik exp(Xiβ) (exp(Xiβ) + r)2 1 Cσ2 βk , (115) Chan, Pollock, Johansen and Roberts 2 log fc(β) (yi + r)r Xik Xil exp(Xiβ) (exp(Xiβ) + r)2 for k = l, (116) for k, l = 0, . . . , p. We can use these directly to compute φc given in (6). Following in the same approach as Section J.1.2, we can compute the bounds of φc (in (6)) by utilising the bounds provided in (17) and (18). As noted in Section J.1.2, we must compute (84). To do so, we can compute the matrix norm of the matrix which bounds 2 log f(z) c (z) element-wise. We have f(z) c (z) := π(β|X, y) |Λ 1 1 2c z) + r yi+r 2πCσ2 βj exp 1 2c z)j µj 2 c |, (117) log f(z) c (z) = 1 2c )z yi (yi + r) log exp (XiΛ 1 2c )z + r 1 2c z)j µj 2Cσβj + constant. (118) Recall from (94) and (95) that we have (XiΛ 1 2 c )z zk = (XΛ 1 2c )ik and (Λ 1 2 c z)i zk = Λ 1 2 ik. Then first derivative of the log transformed posterior with respect to βk is given by log f(z) c (z) zk = yi (yi + r) exp (XiΛ 1 2c )z + r 1 2c z)j µj Cσ2 βj (119) and the second order derivatives are given by 2 log f(z) c (z) zk zl = (yi + r)r(XΛ 1 2c )ik(XΛ 1 2c )il exp (XiΛ 1 2c )z + r 2 1 2 jl Cσ2 βj . (120) Divide-and-Conquer Fusion To find bounds for rc, we must now try to find bounds on the second derivatives given above and compute the matrix norm of the matrix made up of these bounds (which ultimately bounds 2 log f(z) c (z) element-wise). For this example, we can find global and lower bounds of the second derivatives. Note however, we typically will expect better performance with the local bounds on P Λc (as this will typically lead to the expected number of points we need to evaluate while performing Poisson thinning, κc, to be lower) despite these bounds being slightly more expensive to compute in practice. J.3.1 Global bounds of P Λc Note that eax (eax+r)2 1 4r for all x (where a is some constant), so we can use this to obtain global bounds on the matrix norm in the transformed space. Note that this maximum occurs at x = 1 a log(r). To find global bounds, we can use " 2 log f(z) c (z) zk zl (yi + r)r|XΛ 1 2c |ik |XΛ 1 2c |il 4r + 1 2 jl Cσ2 βj . (121) J.3.2 Local bounds of P Λc Local bounds can be obtained if we can find local bounds for Gr(z) := exp (XiΛ 1 2c )z + r 2 (122) for z R(z) and i = 1, . . . , n. In that case, we have " 2 log f(z) c (z) zk zl (yi + r)r|XΛ 1 2c |ik |XΛ 1 2c |il max z R(z) {Gr(z)} + 1 2 jl Cσ2 βj . We can obtain bounds for Gr(z) by noting that exp(x) (r+exp(x))2 1 4r for all x and this maximum is attained at x = log(r). Further note that exp(x) (r+exp(x))2 1 4r is a uni-modal function (with mode at x = log(r) as noted). Now let Fi(z) := (XiΛ 1 2c )jzj, (124) then let F i := minz R(z) Fi(z) and F i := maxz R(z) Fi(z) denote the minimum and maximum of Fi(z) for z R(z) respectively. Then we note that this can simply be computed in Chan, Pollock, Johansen and Roberts with a linear cost with d. Now, noting that Fi(z) is linear and exp(x) (r+exp(x))2 1 4r is uni-modal, after computing F i and F i , there are two cases: 1. If we have log(r) [F i , F i ], then we know that for this hypercube R(z), we will attain the maximum 1 4r. 2. If log(r) / [F i , F i ], then the maximum of Gr(x) occurs at which ever point is the closest to log(r). Therefore local bounds can be obtained by minimising and maximising Fi(z) for z R(z). If this interval includes log(r), then the local maximum attains the global maximum, otherwise, the local maximum occurs at either of these intervals (which ever is closer to log(r)). This method for finding local bounds requires two optimisations of Fi(z), but we note that we can actually obtain the bounds by only performing one optimisation. In particular, we can evaluate Fi(z) at any arbitrary value ˆz R(z) (we can simply take this to be the centre of the hypercube). If we have Fi(ˆz) > log(r), then we just need only need minimise the function Fi(z), since if we have F i < log(r), then we know that log(r) [F i , F i ], so the global maximum is attained. If F i > log(r), then the maximum of G(x) just occurs at F i and we can avoid the need to maximise the function Fi(z). However, if conversely, we evaluate Fi(z) at z = ˆz and we have Fi(ˆz) < log(r), then we just need to maximise Fi(z) for z R(z) and apply the inverse of the same trick. To summarise, in order to find maxz R(z) {Gr(z)}, we can apply Algorithm 8. Algorithm 8 Computing the local bounds of Gr(z) given in (122) for z R(z). 1. Compute Fi(ˆz) at some arbitrary value ˆz R(z). 2. If Fi(ˆz) > log(r): (a) Compute F i := minz R(z) Fi(z). (b) maxz R(z) exp (XiΛ 1 2 c )z exp (XiΛ 1 2 c )z +r 2 1 4r if F i < log(r), G(F i ) = exp F i exp F i +r 2 otherwise. 3. Else (if Fi(ˆz) < log(r)): (a) Compute maxz R(z) Fi(z). (b) maxz R(z) exp (XiΛ 1 2 c )z exp (XiΛ 1 2 c )z +r 2 1 4r if F i > log(r), G(F i ) = exp F i exp F i +r 2 otherwise. Divide-and-Conquer Fusion Isabelle Albert, Sophie Donnet, Chantal Guihenneuc-Jouyaux, Samantha Low-Choy, Kerrie Mengersen, and Judith Rousseau. Combining Expert Opinions in Prior Elicitation. Bayesian Analysis, 7(3):503 532, 2012. Christophe Andrieu and Gareth O. Roberts. The pseudo-marginal approach for efficient Monte Carlo computations. The Annals of Statistics, 37(2):697 725, 2009. Jack Baker, Paul Fearnhead, Emily B. Fox, and Christopher Nemeth. Control variates for stochastic gradient MCMC. Statistics and Computing, 29(3):599 615, 2019. James O. Berger. Statistical Decision Theory and Bayesian Analysis. Springer, New York, 1980. Alexandros Beskos, Omiros Papaspiliopoulos, and Gareth O. Roberts. Retrospective exact simulation of diffusion sample paths with applications. Bernoulli, 12(6):1077 1098, 2006. Alexandros Beskos, Omiros Papaspiliopoulos, and Gareth O. Roberts. A Factorisation of Diffusion Measure and Finite Sample Path Constructions. Methodology and Computing in Applied Probability, 10(1):85 104, 2008. Joris Bierkens, Paul Fearnhead, and Gareth O. Roberts. The Zig-Zag Process and Super Efficient Sampling for Bayesian Analysis of Big Data. The Annals of Statistics, 47(3): 1288 1320, 2019. Alexandre Bouchard-Cˆot e, Sebastian J. Vollmer, and Arnaud Doucet. The Bouncy Particle Sampler: A Non-Reversible Rejection-Free Markov Chain Monte Carlo Method. Journal of the American Statistical Association, 113(522):855 867, 2018. Bob Carpenter, Andrew Gelman, Matthew D. Hoffman, Daniel Lee, Ben Goodrich, Michael Betancourt, Marcus Brubaker, Jiqiang Guo, Peter Li, and Allen Riddell. Stan: A probabilistic programming language. Journal of Statistical Software, 76(1), 2017. Didier Dacunha-Castelle and Danielle Florens-Zmirou. Estimation of the Coefficients of a Diffusion from Discrete Observations. Stochastics: An International Journal of Probability and Stochastic Processes, 19(4):263 284, 1986. Hongsheng Dai, Murray Pollock, and Gareth O. Roberts. Monte Carlo Fusion. Journal of Applied Probability, 56(1):174 191, 2019. Hongsheng Dai, Murray Pollock, and Gareth O. Roberts. Bayesian Fusion: Scalable unification of distributed statistical analyses. Journal of the Royal Statistical Society Series B: Statistical Methodology, 85(1):84 107, 2023. Arnaud Doucet and Anthony Lee. Sequential Monte Carlo Methods. In Handbook of Graphical Models, pages 165 188. CRC Press, 2018. Dirk Eddelbuettel. Seamless R and C++ Integration with Rcpp. Springer, New York, 2013. Chan, Pollock, Johansen and Roberts Hadi Fanaee-T and Joao Gama. Event labeling combining ensemble detectors and background knowledge. Progress in Artificial Intelligence, 2(2):113 127, 2014. Paul Fearnhead, Omiros Papaspiliopoulos, and Gareth O. Roberts. Particle Filters for Partially Observed Diffusions. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 70(4):755 777, 2008. Axel Finke, Arnaud Doucet, and Adam M. Johansen. Limit Theorems for Sequential MCMC Methods. Advances in Applied Probability, 52(2):377 403, 2020. Joseph L. Fleiss. The statistical basis of meta-analysis. Statistical Methods in Medical Research, 2(2):121 145, 1993. Christian Genest and James V. Zidek. Combining Probability Distributions: A Critique and an Annotated Bibliography. Statistical Science, 1(1):114 135, 1986. Mathieu Gerber, Nicolas Chopin, and Nick Whiteley. Negative association, ordering and convergence of resampling methods. Annals of Statistics, 47(4):2236 2260, 2019. Robert J.B. Goudie, Anne M. Presanis, David Lunn, Daniela De Angelis, and Lorenz Wernisch. Joining and splitting models with Markov Melding. Bayesian Analysis, 14 (1):81 109, 2019. Tomoyuki Higuchi. Monte Carlo filter using the genetic algorithm operators. Journal of Statistical Computation and Simulation, 59(1):1 23, 1997. Richard A. Johnson. Asymptotic expansions associated with posterior distributions. The Annals of Mathematical Statistics, pages 851 864, 1970. Heysem Kaya, Pınar T ufekci, and Fikret S. G urgen. Local and global learning methods for predicting power of a combined gas & steam turbine. In Proceedings of the International Conference on Emerging Trends in Computer and Electronics Engineering ICETCEE, pages 13 18, 2012. Augustine Kong, Jun S. Liu, and Wing Hung Wong. Sequential Imputations and Bayesian Missing Data Problems. Journal of the American Statistical Association, 89(425):278 288, 1994. Xiangshun Kong and Wei Zheng. Design Based Incomplete U-Statistics. Statistica Sinica, 31:1593 1618, 2021. Juan Kuntz, Francesca R. Crucinio, and Adam M. Johansen. Product-form estimators: exploiting independence to scale up Monte Carlo. Statistics and Computing, 32(12):1 22, 2022. Juan Kuntz, Francesca R. Crucinio, and Adam M. Johansen. The divide-and-conquer sequential Monte Carlo algorithm: Theoretical properties and limit theorems. Annals of Applied Probability, 2023. In press. Lucien Le Cam. Asymptotic Methods in Statistical Decision Theory. Springer Science & Business Media, New York, 1986. Divide-and-Conquer Fusion Lucien Le Cam and Grace Lo Yang. Asymptotics in Statistics: Some Basic Concepts. Springer Science & Business Media, New York, 2000. Fredrik Lindsten, Adam M. Johansen, Christian A. Naesseth, Bonnie Kirkpatrick, Thomas B. Sch on, John A.D. Aston, and Alexandre Bouchard-Cˆot e. Divide-and-Conquer with Sequential Monte Carlo. Journal of Computational and Graphical Statistics, 26(2): 445 458, 2017. Jun S. Liu and Rong Chen. Sequential Monte Carlo methods for dynamic systems. Journal of the American Statistical Association, 93(443):1032 1044, 1998. Stanislav Minsker, Sanvesh Srivastava, Lizhen Lin, and David B. Dunson. Scalable and Robust Bayesian Inference via the Median Posterior. In International Conference on Machine Learning, pages 1656 1664, 2014. Alexey Miroshnikov and Erin M. Conlon. Parallel MCMCcombine: an R package for Bayesian Methods for Big Data and Analytics. Plo S one, 9(9):e108425, 2014. Willie Neiswanger, Chong Wang, and Eric P. Xing. Asymptotically Exact, Embarrassingly Parallel MCMC. In Proceedings of the Thirtieth Conference on Uncertainty in Artificial Intelligence, UAI 14, page 623 632, Arlington, Virginia, USA, 2014. AUAI Press. Christopher Nemeth and Chris Sherlock. Merging MCMC Subposteriors through Gaussianprocess Approximations. Bayesian Analysis, 13(2):507 530, 2018. Murray Pollock, Adam M. Johansen, and Gareth O. Roberts. On the exact and ε-strong simulation of (jump) diffusions. Bernoulli, 22(2):794 856, 2016. Murray Pollock, Paul Fearnhead, Adam M. Johansen, and Gareth O. Roberts. Quasistationary Monte Carlo and the Sca LE algorithm. Journal of the Royal Statistical Society. Series B: Statistical Methodology, 82(5):1167 1221, 2020. R Core Team. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria, 2022. URL http://www.R-project.org/. L. Chris G. Rogers and David Williams. Diffusions, Markov processes and martingales: Volume 2, Itˆo calculus, volume 2. Cambridge University Press, Cambridge, 2000. Steven L. Scott, Alexander W. Blocker, Fernando V. Bonassi, Hugh A. Chipman, Edward I. George, and Robert E. Mc Culloch. Bayes and Big Data: The Consensus Monte Carlo Algorithm. International Journal of Management Science and Engineering Management, 11(2):78 88, 2016. Sanvesh Srivastava, Volkan Cevher, Quoc Dinh, and David B. Dunson. WASP: Scalable Bayes via barycenters of subset posteriors. In Artificial Intelligence and Statistics, pages 912 920, 2015. Pınar T ufekci. Prediction of full load electrical power output of a base load operated combined cycle power plant using machine learning methods. International Journal of Electrical Power & Energy Systems, 60:126 140, 2014. Chan, Pollock, Johansen and Roberts A. W. Van der Vaart. Asymptotic Statistics. Cambridge University Press, Cambridge, 1998. Andrew M. Walker. On the Asymptotic Behaviour of Posterior Distributions. Journal of the Royal Statistical Society: Series B (Methodological), 31(1):80 88, 1969. Xiangyu Wang and David B. Dunson. Parallelizing MCMC via Weierstrass Sampler. Statistics e-print 1312.4605, ar Xiv, 2013. Darrell Whitley. A genetic algorithm tutorial. Statistics and Computing, 4(2):65 85, 1994. Hadley Wickham. nycflights13: Flights that Departed NYC in 2013, 2021. Sinan Yıldırım and Beyza Ermi s. Exact MCMC with differentially private moves. Statistics and Computing, 29(5):947 963, 2019.