# mixtures_of_gaussian_process_experts_with_smc2__e8acc238.pdf

Journal of Machine Learning Research 26 (2025) 1-38 Submitted 8/22; Revised 7/25; Published 10/25

Mixtures of Gaussian Process Experts with SMC2

Teemu H ark onen teemu.h.harkonen@aalto.fi Department of Electrical Engineering and Automation Aalto University Espoo, FI-02150, Finland School of Engineering Sciences LUT University Lappeenranta, Yliopistonkatu 34, FI-53850, Finland

Sara Wade sara.wade@ed.ac.uk School of Mathematics University of Edinburgh Edinburgh, EH9 3FD, United Kingdom

Kody Law kody.law@manchester.ac.uk Department of Mathematics University of Manchester Manchester, M13 9PL, United Kingdom KLAI Ltd London, EC1V 2NX, UK

Lassi Roininen lassi.roininen@lut.fi School of Engineering Sciences LUT University Lappeenranta, Yliopistonkatu 34, FI-53850, Finland

Editor: Anthony Lee

Gaussian processes are a key component of many ﬂexible statistical and machine learning models. However, they exhibit cubic computational complexity and high memory constraints due to the need of inverting and storing a full covariance matrix. To circumvent this, mixtures of Gaussian process experts have been considered where data points are assigned to independent experts, reducing the complexity by allowing inference based on smaller, local covariance matrices. Moreover, mixtures of Gaussian process experts substantially enrich the model s ﬂexibility, allowing for behaviors such as non-stationarity, heteroscedasticity, and discontinuities. In this work, we construct a novel inference approach based on nested sequential Monte Carlo samplers to simultaneously infer both the gating network and Gaussian process expert parameters. This greatly improves inference compared to importance sampling, particularly in settings when a stationary Gaussian process is inappropriate, while still being thoroughly parallelizable.

Keywords: sequential Monte Carlo, SMC2, Gaussian processes, mixture of experts, classiﬁcation

c 2025 Teemu H ark onen, Sara Wade, Kody Law, and Lassi Roininen.

License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v26/22-0973.html.

H ark onen, Wade, Law, and Roininen

1. Introduction

Gaussian processes (GP) are a versatile tool for statistical modelling of data where the exact form of the mapping between input and output variables is unknown and have found a plethora of applications. Recent applications of GPs include prediction of battery life times (Richardson et al., 2017), geometry optimization in theoretical chemistry (Denzel and Kstner, 2018), spectral peak structure prediction (Wakabayashi et al., 2018), tokamak plasma modelling (Ho et al., 2019), medical diagnosis using proteomic data (Yu et al., 2022), pile foundation design (Momeni et al., 2020), and computational biology (Otto et al., 2024), just to name a few.

However, straight-forward application of GPs suﬀers from cubic computational complexity and memory constraints due to the need of inverting and storing a full covariance matrix of size N N for N data points. Moreover, many applications present departures from the standard GP model, such as non-stationary, heteroscedasticity, and non-normality. Methods such as non-stationary covariance functions or deep Gaussian processes are a possible solutions for modelling such behaviour (Paciorek and Schervish, 2003; Porcu et al., 2007; Damianou and Lawrence, 2013; Heinonen et al., 2016). However, they do not provide classiﬁcation or clustering for individual data points or the input space. To overcome the above limitations while providing classiﬁcation, mixtures of GP experts have been considered (e.g. Tresp, 2001; Rasmussen and Ghahramani, 2002; Meeds and Osindero, 2005).

Mixtures of GP experts belong to the general class of mixtures of experts (Mo Es) (see Masoudnia and Ebrahimpour, 2014; Gormley and Fr uhwirth-Schnatter, 2019, for reviews). The nomenclature has it origins in machine learning (Jordan and Jacobs, 1994), but they appear in many diﬀerent contexts such as in statistics, where they are often called dependent mixtures (Mac Eachern et al., 1999; Quintana et al., 2022; Wade and In acio, 2025) and econometrics where they are referred to as smooth mixtures (Li et al., 2011; Norets, 2010; Villani et al., 2012). Mo Es probabilistically partition the input space into regions and specify a local expert within each region. The gating network maps the experts to local regions of the input space, determining the partition of the data into independent sets. Local experts are ﬁt independently to the data sets, where each expert is a conditional model for the output y given the input x, which is speciﬁcally a GP model in the case of mixtures of GP experts. This allows for both improved scalability, as each GP expert ﬁt only requires inversion and storing of the local covariance matrices, and ﬂexibility, as GP experts can capture diﬀerent behaviors, e.g. smoothness and variability, within each region.

Inference for mixtures of GP experts is often performed via Markov chain Monte Carlo (MCMC), namely, Gibbs sampling (Meeds and Osindero, 2005; Rasmussen and Ghahramani, 2002; Gadd et al., 2020), by augmenting the model with a set of latent allocation variables denoting the labels of the expert associated to each data point. This provides asymptotically exact samples from the posterior, however it often comes with a high computational cost due to slow mixing and long convergence times. For faster approximate inference, Yuan and Neubauer (2008) develop a variational inference scheme, and Tresp (2001) employs an expectation-maximization algorithm for maximum a posteriori (MAP) inference. A combination of variational inference, MAP, and sparse GPs (Quinonero-Candela and Rasmussen, 2005) is developed in Nguyen and Bonilla (2014) to improve scalability to

Mixtures of Gaussian Process Experts with SMC2

big data (e.g. training size of 105), and this is further expanded upon in Etienam et al. (2020) for fast one-pass approximate MAP inference.

Recently, Zhang and Williamson (2019) proposed an embarrassingly parallel approach for mixtures of GP experts based on importance sampling. Importantly, this allows for uncertainty quantiﬁcation in the parameters and quantities, as opposed to MAP inference, and avoids the independence or distributional assumptions of variational inference. Moreover, in the proposal distribution, random partitions of the data points are generated, allowing for easily distributed GP likelihood computation. However, it is well known that naive importance sampling requires the number of importance samples to be exponential in the the Kullback-Liebler (KL) divergence of the posterior from the proposal distribution (Chatterjee and Diaconis, 2018). We discuss how this number can be quite large for mixtures of GP experts due to the massive number of possible partitions, leading to poor inference, particularly when a stationary GP model is inappropriate.

To address this, we propose to improve upon the approach of Zhang and Williamson (2019) by using nested sequential Monte Carlo (SMC) samplers, as introduced by Chopin et al. (2013), as a way of inferring the model parameters from the posterior. We discuss and demonstrate how this provides a more accurate and robust approach, especially for complex target distributions resulting from complex behaviors of the data, such as non-stationarity, discontinuities, or heterogeneity. Furthermore, for complex target distributions our method can be much more computationally eﬃcient, while still being thoroughly parallelizable. It is well-known that SMC methods reduce importance sampling complexity from exponential to polynomial (Beskos et al., 2014). Some parallelizability must be sacriﬁced due to weight normalization, but all other stages of the algorithm can be executed independently and in parallel. There exist versions of SMC that can combine independent asynchronous SMCs in lieu of a single monolithic SMC as well, provided that a minimum ensemble size is used for each one (Jasra et al., 2022; Liang et al., 2025). If more parallel capacity is available, then these methods could be leveraged in place of any of the inner parallel SMCs and/or the outer SMC in our method. Additionally, it is worth noting that several studies have previously proposed using SMC in various diﬀerent ways to sample from standard mixture models, such as Bouchard-Cˆot e et al. (2017) for split-merge moves, amongst other approaches (Mac Eachern et al., 1999; Fearnhead, 2004; Fearnhead and Meligkotsidou, 2007; Mansinghka et al., 2007; Caron and Doucet, 2009; Carvalho et al., 2010; Ulker et al., 2010). SMC methods have also been proposed for GP inference, see for example Gramacy and Polson (2011); Svensson et al. (2015).

In summary, we formulate mixtures of GP experts such that nested SMC samplers can be used. The nested SMC samplers, or SMC2, allow for separation of the model inference into two separate parts, the sampling of the gating network parameters and GP parameters. We highlight that this allows for incorporation of uncertainty in the GP parameters, in contrast to the MAP estimates employed in Zhang and Williamson (2019). The use of SMC2 greatly extends the set of feasible target posterior distributions from the previously proposed embarrassingly parallel importance sampling of mixtures of GP experts. This is achieved by the improved computational eﬃciency, allowing improved inference of the mixture of GP experts parameters. Alternatively, one can obtain inference of similar quality with fewer samples due to the improved KL divergence provided by the SMC2.

H ark onen, Wade, Law, and Roininen

The rest of the paper is organized as follows. We formulate the statistical model in Section 2, which is followed by a review of importance sampling in Section 3. We present the proposed method by ﬁrst introducing the inner SMC sampler in Section 4 and concluding with the nested SMC2 sampler in Section 5. In Section 6, we provide our construction of the predictive distribution, and results are presented for two illustrative one-dimensional data sets along with one-dimensional motorcycle helmet acceleration, 3D NASA Langley glide-back booster simulation, and 4D Colorado precipitation data sets in Section 7. We conclude with ﬁnal remarks and a discussion in Section 8.

In a supervised setting, Mo Es provide a ﬂexible framework for modelling the conditional relationship between the outputs y given the inputs x, and two general approaches exist for deﬁning Mo Es, namely, generative and discriminative. Generative Mo Es model (y, x) jointly, while discriminative Mo Es model the conditional of interest directly. In this work, we focus on the discriminative approach, which has the advantage of avoiding modelling of and placing any distributional assumptions on the input variables.

We assume that the data consists of a sample of size N input variables X = {x1, . . . , x N}, with xi = (xi,1, . . . , xi,D), and output variables Y = {y1, . . . , y N}. For simplicity, we focus on the case when xi RD and yi R, although extensions for other types of inputs and outputs (e.g. binary, categorical, counts) can also be constructed. Mo Es assume

yi | xi, Ψ, Θ

pk(xi | Ψ)p(yi | xi, Θk),

where (p1(xi | Ψ), . . . , p K(xi | Ψ)) =: p(xi | Ψ) is the gating network which takes values in the K 1-dimensional simplex and the expert p(y | xi, Θk) is a conditional model, with local relevance determined by pk(xi | Ψ) at the input location xi. Here, Ψ := (ψ1, . . . , ψK) is a shorthand for the gating network parameters and Θ = (Θ1, . . . , ΘK) contains the expert parameters. Mo Es can be augmented with a set of latent allocation variables ci {1, . . . , K} and equivalently represented by

yi | xi, ci = k, Θk p(yi | xi, Θk),

ci | xi, Ψ Categorical (p(xi | Ψ)) ,

where ci determines the partition of the data points into the K groups. We use C = (c1, . . . , c N) to denote the vector of labels associated with each data point. Various proposals exist for diﬀerent choices of the experts and gating networks, ranging from linear to nonlinear approaches, such as trees (Gramacy and Lee, 2008), neural networks (Etienam et al., 2020), and GPs (Tresp, 2001). In the following, we consider GP experts and deﬁne the weights through normalized kernels.

2.1 Experts

For the experts, we consider the nonparametric regression model:

yi | xi, ci = k, fk, σ2

k, N(fk(xi), σ2

Mixtures of Gaussian Process Experts with SMC2

where fk( ) is the unknown regression function mapping the inputs to the outputs and σ2

k, is the noise variance for the kth expert. We place a nonparametric GP prior (Williams and Rasmussen, 2006) on the regression function:

fk GP (mk, Σ( , ; θk)) ,

with mk and Σ( , ; θk) denoting, respectively, a constant mean and the covariance function depending on the kernel hyperparameters θk. In the regression setting, the functions can be analytically marginalized, resulting in the likelihood:

Yk | (Xk, mk, θk) N

mk, Σ(Xk, Xk; θk) + σ2

where (Xk, Yk) = {(xi, yi) : ci = k} are the data points with category ci = k with Σ(Xk, Xk; θk) denoting the covariance function evaluated at the inputs {xi : ci = k} and I denotes the identity matrix. Extensions for multivariate or other types of outputs can be constructed through a generalized multivariate Gaussian process framework (Gadd et al., 2020).

To simplify notation, we incorporate the noise variance within the covariance function and use an anisotropic squared exponential covariance function for each of the GPs, such that the ijth element of the covariance matrix is given by

[Σ(Xk, Xk; θk)]i,j = σ2

k,εδ(xi xj) + σ2

(xi,d xj,d)2

where xi, xj Xk, σ2

k,f is the signal variance, and ld is the length scale along the dth dimension, with δ(x) denoting the Kronecker delta function and θk = (σk,ε, σk,f, lk,1, . . . , lk,D). Furthermore, we use the notation Θk = (mk, θk) for the kth GP parameters and Θ = (Θ1, . . . , ΘK) for all the GP parameters. An example partition and its corresponding GP ﬁts are illustrated in Figure 1.

2.2 Gating networks

We deﬁne the gating network through the kernels {K1(xi | ψ1), . . . , KK(xi | ψK)}; speciﬁcally,

pk(xi) = Kk(xi | ψk) K

k =1 Kk (xi | ψk )

To ensure the gating network p(xi) = (p1(xi), . . . , p K(xi))T takes values in the K 1dimensional simplex, we assume 0 < Kk(x | ψ) < for all x RD. Examples of kernels used in literature include the multinomial logistic model with Kk(x | ψk) = exp(ψT

k x) (Li et al., 2011) and the factorized kernel Kk(x | ψk) = vk

d=1 Kk,d(xd | ψk,d) (Antoniano Villalobos et al., 2014). In the latter, mixed types of inputs are allowed by specifying each Kk,d(xd | ψk,d) to correspond to the kernel of standard distributions, depending on the nature of the dth input. We emphasize that this does not correspond to any model assumptions on the input, which in fact may be ﬁxed.

In the following, we use weighted D-dimensional normal distributions with diagonal covariance matrices for the kernels, individually denoted by Kk:

Kk(xi | ψk) := Kk(xi | vk, µk, Σk) = vk N(xi | µk, Σk), (2)

H ark onen, Wade, Law, and Roininen

Figure 1: An example of partitioned data with four clusters. The clustered data points

are shown in blue, red, green, and gray for each cluster with their respective GP mean and 95% pointwise credible intervals in the corresponding color.

where vk denotes the weight of the component with µk RD and Σk RD D being the mean and a diagonal covariance matrix, respectively. As such, we deﬁne ψk := (vk, µk, Σk)T = (vk, µk,1, . . . , µk,D, σ2

k,1, . . . , σ2

k,D)T where (σ2

k,1, . . . , σ2

k,D) are the diagonal elements of the kth covariance matrix. Hard cluster assignments using the above kernel result in quadratic boundaries. More ﬂexible kernel speciﬁcations could be used for nonlinear boundaries such as mixtures of normals or the feature space method, where partitioning is done in a higherdimensional space, producing non-linear partitions in the original input space (Filippone et al., 2008). More generally, alternative speciﬁcations of the kernel may be employed.

For example, Zhang and Williamson (2019) also construct the gating network through normalized kernels and consider full covariance matrices Σk, employing the normal-inverse Wishart distribution as a prior for the means and covariances of the gating network. Instead the factorized form of the kernel based on a diagonal covariance matrix in Eq. (2) results in reduced computational complexity relative to the dimension D of the inputs, from O(D3) in the case of the full covariance matrix to O(D) in the factorized form. Furthermore, the factorized form allows for straightforward extensions to include multiple input types, as in Antoniano-Villalobos et al. (2014). Lastly, the inverse-Wishart prior can suﬀer from poor parametrization (Consonni and Veronese, 2001), with only a single parameter to control variability. We instead follow Gelman (2006) in using independent half-normal distributions for the variance parameters. Additionally, we use a normal distribution for the mean µk.

For the weights, we employ a Dirichlet prior:

(v1, . . . , v K) Dir(α/K, . . . , α/K),

Mixtures of Gaussian Process Experts with SMC2

Figure 2: Plate diagram showing relations between the data and model parameters for Mo E.

where α is the Dirichlet concentration parameter. The concentration parameter α can be used to enforce sparsity (Rousseau and Mengersen, 2011), as in sparse hierarchical Mo Es. This helps to avoid overﬁtting, which has been noted to be prevalent in dense mixtures (Titsias and Likas, 2002; Mossavat and Amft, 2011; Iikubo et al., 2018; rsoy and Alpaydn, 2021). Smaller concentration parameter values promote sparsity, and in the extreme case when α 0, prior mass is concentrated on the vertices of the simplex and all weight is placed on a single component. Thus, K can be considered an upper bound on the number of components, and through appropriate selection of α, the sparsity-promoting prior allows the data to determine the number of components required. In standard mixtures, Rousseau and Mengersen (2011) show that asymptotically the extra components will be emptied, provided that α/K < /2, where is the number of parameters in a single mixture component. Alternatively, the Jeﬀreys prior for the multinomial distribution, with α = K/2 in our case, can be used as a simple non-informative choice (Grazian and Robert, 2018). The Dirichlet-distributed weights can also be constructed through independent Gamma random variables, that is ν1, . . . , νK are independent with νk Gamma(α/K, 1) and deﬁning each vk = νk/ K

k =1 νk . Notice that the normalization term K

k=1 νk cancels out in our deﬁnition of the gating network, so we can rewrite Eq. (2) as:

Kk(xi | ψk) := Kk(xi | vk, µk, Σk) = νk N(xi | µk, Σk).

2.3 Mixture of GP experts

The above results in our statistical model being

Yk | (Xk, mk, θk) GP (mk, Σ(Xk, Xk; θk)) ,

ci | xi, Ψ Categorical(p(xi)),

H ark onen, Wade, Law, and Roininen

Parameter Prior distribution

mk U (0, max Y ) σk,ε N 1

Table 1: Prior distributions for Ψ and Θ for our higher-dimensional data sets. To encourage

components that are well-separated along the input space, the prior means for the gating networks parameters µk are constructed using a linearly-spaced grid, so that spacing along each dimension is 1/

K, with Gk = (Gk,1, . . . , Gk,D) denoting the kth grid point. For D = 1, we use G = 0.25 and σ = 0.25 with G = 0.05 and σ = 0.01 for the higher-dimensional examples.

where π0(Θk) denotes the prior distribution for the GP parameters, which assumes independence across k = 1, . . . , K, and

pk(xi) = νk N(xi | µk, Σk) K

k =1 νk N(xi | µk , Σk )

with νk Gamma(α/K, 1) and µk,d and σ2

k,d assigned independent normal and half-normal priors, respectively. A plate diagram of the model is provided in Figure 2. The prior distributions used for one-dimensional and higher-dimensional data sets are presented in Table 1, respectively.

3. Importance sampling

Zhang and Williamson (2019) propose an embarrassingly parallel algorithm based on importance sampling (IS) to estimate the posterior distribution of C and Ψ, using a MAP approximation for the GP parameters Θ as well as the predictive distribution of the data in parallel. For this, an importance distribution q(C, Ψ | X) is constructed which is easy to sample from. The importance distribution is chosen to be the prior distribution such that q(C, Ψ | X) = π(C, Ψ | X). Thus,

q(C, Ψ | X) = p(C | X, Ψ)π(Ψ) =

p(ci | xi, Ψ)

vk N(xi | µk, Σk) K

k =1 vk N(xi | µk , Σk )

Mixtures of Gaussian Process Experts with SMC2

Using the above, J particles are sampled in parallel from q(C, Ψ | X) and the weight ωj for the jth particle (Cj, Ψj) in the IS estimation is given as

ωj π(Cj, Ψj | X, Y )

q(Cj, Ψj | X) p(Y | X, Cj, Ψj)π(Cj, Ψj | X)

π(Cj, Ψj | X)

= p(Y | X, Cj, Ψj)

Yk,j | mk,j, Σ(Xk, Xk; θk,j)

where mk,j and θk,j are MAP estimates. For a new input x , the predictive mean and density of the output are computed by averaging according to the respective particle weights and gating network probabilities at the new location x :

E[y | x , X, Y ] =

k,j pk(x | Ψj),

π(y | x , X, Y ) =

k,j) pk(x | Ψj),

k,j are the predictive mean and covariance of the associated K GPs for each of the J particles. For more details on the importance sampling approach, see Zhang and Williamson (2019).

As mentioned in the Introduction, the number of particles required for importance sampling is exponential in the KL divergence of the posterior from the proposal distribution (Chatterjee and Diaconis, 2018), that is J exp[DKL(π(C, Ψ | X, Y ) q(C, Ψ | X))]. As an extreme example, consider a uniform proposal distribution for C, with q(C | X, Ψ) = 1/KN, obtained in the limiting case when α and σ2

k,d . On the other hand, suppose the posterior on C is very concentrated on a single partition C ; due to invariance with respect to a relabelling of the components, the posterior in this case is uniform over the set C of C that are equivalent to C up to relabelling of the components. Thus, π(C | X, Y ) =

C C 1/|C|δC, where the size of C is |C| =

t!, with t denoting the number of clusters in C . Here, the KL divergence is:

DKL(π(C, Ψ | X, Y ) q(C, Ψ | X)) =

and the number of particles required is J KN/(

t!). Even for a very small sample size, e.g. N = 10, K = 5, and t = 2, this requires J 488, 281, and when N = 20, this increases to J 4, 768, 371, 582, 031.

While this is a purposefully constructed edge case, some insight on behaviour for stationary data in comparison to data that present departures, such non-stationary, discontinuity, or heteroskedasticity, can be postulated. Stationary data can be modelled well over a large class of partitions, as all data can be eﬀectively modelled via a single GP model, and partitioning into any reasonable number of subgroups will not degrade the ﬁt too much. Therefore, sampling of the partitions can be done eﬀectively using strategies such as importance sampling, as the distance between the proposal and posterior is smaller in comparison to

H ark onen, Wade, Law, and Roininen

non-stationary and discontinuous cases. Instead, the partitions play a more important part in modelling non-stationary and discontinuous data; the posteriors are concentrated on a smaller number of partitions, thus requiring more samples to have a suitable approximation.

This motivates us to use sequential Monte Carlo (SMC) methods, and speciﬁcally SMC2, to improve our estimation of the posterior or retain the same level of quality with the same number of or fewer particles. SMC methods, also known as particle ﬁltering and smoothing, are a standard tool in statistical signal processing (S arkk a, 2013) and provide a robust method for sampling posterior distributions (Chopin, 2002; Del Moral et al., 2006). In particular, we use sequential likelihood tempering to estimate the gating network parameters Ψ and the allocation variables C, while ﬁtting K independent Gaussian processes to the partitioned data. This is in contrast to parameter estimation for a single GP using data tempering, such as by Gramacy and Polson (2011) and Svensson et al. (2015). In our case, likelihood tempering seems more appropriate, as the mixture conﬁguration will depend heavily on new data, and dynamically adding data may require signiﬁcant readjustments of the gating network. For further discussion on using tempering versus growing subsets of data, see Sections 17.3 and 17.3.4, in particular, of Chopin and Papaspiliopoulos (2020b) In addition, as SMC methods employ a collection of weighted particles, their application is a natural extension to the methodology presented in Zhang and Williamson (2019).

Owing to the nested structure of the SMC2, we refer to the nested SMC samplers as the inner and outer SMC samplers. Below, we present the posterior distribution for the discriminative Mo E model and the inner SMC sampler for estimating the GP expert parameters for a ﬁxed partition and gating network parameters C and Ψ. This is followed by the presentation of the outer SMC sampler for the gating network parameters and the partition. Together, these two samplers constitute our full nested SMC2 estimation scheme.

4. Inner SMC sampler

Using the discriminative Mo E model in Section 2.3, the target posterior distribution for C, Ψ, and Θ is

π(C, Ψ, Θ | X, Y ) p(Y | X, C, Θ)p(C | X, Ψ)π0(Ψ)π0(Θ),

which can be marginalized with respect to the GP parameters Θ to yield

π(C, Ψ | X, Y ) p(Y | X, C)p(C | X, Ψ)π0(Ψ).

In this case, the likelihood p(Y | X, C) is given as

p(Y | X, C) =

p(Y | X, C, Θ)π0(Θ) dΘ =

p (Yk | Xk, Θk) π0(Θk) dΘk, (4)

with the log-likelihood of the individual GPs being

log p(Yk | Xk, Θk) = 1

2(Yk mk)T Σ(Xk, Xk; θk) 1(Yk mk)

2 log |Σ(Xk, Xk; θk)| Nk

Mixtures of Gaussian Process Experts with SMC2

where |Σ(Xk, Xk; θk)| is the determinant of the covariance matrix and Nk is the number of data points in the kth group.

Estimating the marginalized likelihood p(Y | X, C) could be approached using MAP estimates, similarly to Eq. (3). However, the MAP estimates have certain disadvantages (Williams and Rasmussen, 2006, p. 136). First, given multiple modes, optimization is not guaranteed to converge to the MAP estimate. Second, the uncertainty for the predictions is misspeciﬁed. Lastly, the resulting marginal likelihood estimates are biased. To clearly illustrate this, consider the simpliﬁed case of K = 1 expert. We are interested in probing the posterior p(Θ | Y, X), and in particular computing the normalizing constant p(Y | X). The predictive distribution and marginal likelihood estimates are shown in Figures 3 and 4, respectively. Figure 3 shows the posterior distribution p(Θ | Y, X) in Eq. (4) as a function of the length scale and noise standard deviation parameters for a GP with 7 data points. The posterior distribution exhibits two local minima: one with smaller length scale and noise standard deviation parameters, and one with larger length scale and noise standard deviation. The corresponding GP predictive means and 95% predictive intervals are misspeciﬁed (ﬁrst two plots in the bottom row of Figure 3). In contrast, the marginalized predictive mean and 95% predictive intervals (last plot in the bottom row of Figure 3) more closely follow the data-generating predictive mean and predictive intervals, showing the beneﬁt of marginalization over a plug-in MAP approximation. Figure 4 shows the biases resulting from using either of the local minima as surrogates for marginalizing p(Θ | Y, X) together with the mean and 95% conﬁdence intervals for the marginal likelihood estimates using SMC with varying particle amounts, M. The SMC samplers are run 500 times for each M. The marginal likelihood estimates provided by the SMC samplers show no bias and consistently converge towards the correct marginalized likelihood estimate. The marginalized likelihood estimate for p(Y | X) has been computed using high-accuracy numerical integration. The above points led us to construct the likelihood-tempered inner SMC sampler to consistently estimate p(Y | X, C).

We build a sequence of intermediate tempered conditional posterior distributions for the parameters C and Ψ as follows

πt(C, Ψ | X, Y ) pt(Y | X, C)p(C | X, Ψ)π0(Ψ), (5)

pt(Y | X, C) :=

p(Y | X, C, Θ)κ(t)π0(Θ) dΘ =

p (Yk | Xk, Θk)κ(t)π0(Θk) dΘk, (6)

and where t stands for the time step of the SMC or SMC2 sampler and κ(t) is a strictly increasing sequence of parameters, 0 = κ(0) < < κ(t 1) < κ(t) < < κ(T) = 1, controlling the degree of tempering. We discuss the explicit construction of the tempering sequence, which depends on the outer SMC sampler, in the following Section.

The inner SMC sampler allows us to unbiasedly estimate the intractable tempered marginal likelihood pt(Y | X, C) by an ensemble {Θ(t)

m=1 for any outer SMC particle (C, Ψ), as follows. We have the tempered unnormalized posterior

πt(Y, C, Ψ, Θ) = π(C, Ψ, Θ | X)

p(Y | X, C, Θ)κ(s) κ(s 1).

H ark onen, Wade, Law, and Roininen

1 2 3 4 5 l1;1

Figure 3: On top left, the posterior distribution p(Θ | Y, X) in Eq. (4) as a function of the

noise standard deviation σ1,ε and length scale l1,1 for 7 data points from a single zero-mean Gaussian process expert with (σ1,ε, σ1,f, l1,1) =

0.11/2, 1, 1

. The local maxima of the posterior distribution are illustrated with blue and red stars. On top right, the 7 data points in blue together with the true predictive mean and 95% interval in black and gray. Below, from left to right, the predictive mean and 95% intervals corresponding to the short length scale local minimum, long length scale local minimum, and marginalization over the posterior distribution p(Θ | Y, X).

Now, for any given (C, Ψ), we can construct a non-negative and unbiased estimator of the joint πt(Y, C, Ψ) as

ˆπt(Y, C, Ψ) = ZM

t (Y | X, C)π(C, Ψ | X) , (7)

t (Y | X, C) :=

p(Y | X, C, Θ(s 1)

m )κ(s) κ(s 1) , (8)

is a non-negative and unbiased estimator (Moral, 2004; Dai et al., 2022) of pt(Y | X, C) in Eq. (6) with Θ(s)

m being the output from an SMC with ﬁxed C. In other words, the initial particles Θ(0)

m are i.i.d from the prior distribution Θ(0)

m π0(Θ). For subsequent iterations

t = 1, . . . , T, the particles are obtained as Θ(t)

(t) m , ), where Mt is an MCMC

Mixtures of Gaussian Process Experts with SMC2

5 9 16 28 50 89 158 281 500 M

p(Y j X) p SMC(Y j X) p MAP(Y j X) p LM(Y j X)

Figure 4: The marginalized likelihood estimates p MAP(Y | X), p LM(Y | X), and p(Y | X)

using the MAP estimate (blue line), long length scale local minimum (red line), and high-accuracy numerical integration (dashed red lines), respectively. The MAP estimate leads to overﬁtting and high likelihood for the training data. The mean p SMC(Y | X) and 95% conﬁdence intervals given by repeated SMC runs with diﬀerent particle amounts M shown in black and gray, respectively, match closely with numerical integration.

kernel, Θ(t)

m are resampled particles, and I(t)

m Categorical(ω(t)) are resampling

indices for m = 1, . . . , M, with ω(t) = (ω(t)

1 , . . . , ω(t)

M ) and the individual particle weight are given as ω(t)

m p(Y | X, C, Θ(t 1)

m )κ(t) κ(t 1). Here Mt is an MCMC kernel which keeps

pt(Θ | C, X, Y ) πt(Y, C, Ψ, Θ)

invariant. Let A(t) denote I(1:t)

1:M and all the auxiliary variables of the SMC sampler for Θ. The distribution

πt(C, Ψ, Θ(0:t)

1:M , A(t) | X, Y ) ZM

t (Y | X, C)π(C, Ψ | X)Υ(Θ(0:t)

1:M , A(t) | C, X, Y ), (9)

where Υ(Θ(0:t)

1:M , A(t) | C, X, Y ) is the distribution of the SMC sampler given C (up to time t), is such that for any m = 1, . . . , M and any t = 0, . . . , T,

ϕ(C, Ψ, Θ) dπt(C, Ψ, Θ | X, Y ) =

ϕ(C, Ψ, Θ(t)

m ) dπt(C, Ψ, Θ(0:t)

1:M , A(t) | X, Y ).

H ark onen, Wade, Law, and Roininen

Figure 5: An illustration of the inner SMC sampler Υ(Θ(0:t)

1:M , A(t) | C, X, Y ). The SMC sampler uses M particles Θ(0:t)

1:M to construct an approximation for the posterior distribution of the GP expert parameters at each iteration corresponding to the current level of tempering κ(t). The inner SMC sampler provides an unbiased estimator for the marginal likelihood pt(Y | X, C) in Eq. (6) according to Eq. (8). The auxiliary parameters A(t) appear in the mutation and selection steps and are not used elsewhere or retained.

In other words, it has the target of interest as its marginal. We note that the SMC sampler Υ(Θ(0:t)

1:M , A(t) | C, X, Y ) is independent of Ψ, hence its omission. We present a schematic in Figure 5 and pseudo-code in Algorithm 1 for the SMC sampler Υ(Θ(0:t)

1:M , A(t) | C, X, Y ). In the following Section, we combine the above inner SMC sampler with sampling of the gating network parameters Ψ and partition C.

The inner SMC sampler presented in the previous Section is used as the MCMC proposal for the outer SMC sampler. Targeting the tempered extended posterior distribution πt(C, Ψ, Θ(0:t)

1:M , A(t), | X, Y ) using MCMC is called particle MCMC (PMCMC) and is a particular instance of the pseudo-marginal method (Andrieu and Roberts, 2009). When the MCMC method used is Metropolis-Hastings, it is called particle marginal Metropolis Hastings (PMMH) (Andrieu et al., 2010). In particular, letting n denote the index of the MCMC, the proposal consists of sampling (C, Ψ) Q((C, Ψ)n, ) and then (Θ(0:t)

1:M ) Υ(Θ(0:t)

1:M , A(t) | C , Ψ ). The proposal (C, Ψ, Θ(0:t)

1:M ) is then accepted with probability

t (Y | X, C )π((C, Ψ) | X)Q((C, Ψ) , (C, Ψ)n) ZM

t (Y | X, Cn)π((C, Ψ)n | X)Q((C, Ψ)n, (C, Ψ) )

Mixtures of Gaussian Process Experts with SMC2

Algorithm 1 SMC for fully Bayesian GP estimation.

Initialize: Draw Θ(0)

m π0(Θ). Set ω(0)

m = 1 M for m = 1, . . . , M. Set ZM

Estimation of {(Θ, ω)m}M

m=1 pt(Θ | X, Y ) p(Y | X, Θ)κ(t)π0(Θ) and π(Y | X):

while κ(s) < κ(t) do

s = s + 1. Deﬁne ω(s)

m := p(Y | X, Θ(s 1)

m )κ(s) κ(s 1)

s 1, following Eq. (8).

Select new indices I(s)

m according to ω(s)

m Draw Θ(s)

m Ms(Θ(s 1)

m , ), where Ms is an MCMC kernel targeting ps(Θ | X, Y ).

m = 1 M for m = 1, . . . , M. end while

Algorithm 2 PMCMC step for pt(C, Ψ, Θ | X, Y ) p(Y | X, C, Θ)κ(t)p(C | X, Ψ)π0(Ψ, Θ)

Begin with (C, Ψ, Θ(0:t)

1:M ) πt(C, Ψ, Θ(0:t)

1:M , A(t) | X, Y ) and ZM

t (Y | X, C). Propose (C, Ψ) Q((C, Ψ), ). Propose Θ ,(0:t)

1:M pt(Θ | X, Y, C ) ZM

t (Y | X, C ), using SMC Algorithm 1.

C, Ψ, Θ(0:t)

(and store ZM

t (Y | X, C )) with probability

t (Y | X, C )π((C, Ψ) | X)Q((C, Ψ) , (C, Ψ)) ZM

t (Y | X, C)π((C, Ψ) | X)Q((C, Ψ), (C, Ψ) )

We denote the resulting MCMC kernel by Ht and note that by design πt Ht = πt. We target πt(C(t), Ψ(t), Θ(0:t)

1:M , A(t) | X, Y ) for t = 1, . . . , T using an SMC sampler (Del Moral et al., 2006), where the MCMC mutation at time t is given by the PMMH kernel Ht (indices t are introduced on the variables for convenience and to avoid confusion). This method is called SMC2 owing to the nested SMCs (Chopin et al., 2013). We present pseudo-code for the PMCMC proposal in Algorithm 2. An illustration of the PMCMC proposal ﬂow is illustrated in Figure 6.

In particular, we approximate Eq. (9) using the particles {C(t)

j , (Θ(0:t)

weights {w(t)

j=1. Note that we do not need the auxiliary variables A(t)

j , as they will always

be integrated out. In fact, for estimates we also only need one Θ(t)

j,m for each j, sampled

uniformly from

j,1, . . . Θ(t)

, however in principle one can average over the ensemble, which is a Rao-Blackwellisation of the former. Furthermore, if we do not need to estimate functions of Θ, then we only need the normalizing constant estimates ZM

t (Y | X, C(t)

H ark onen, Wade, Law, and Roininen

(C, !, ! Z)n (C, !, ! Z)n+1 PMCMC

! Z inner SMC

accept/reject

Figure 6: An illustration of one step of the particle Markov chain Monte Carlo (PM-

CMC) method, for simulating from the posterior gating network and partition πt(C, Ψ | X, Y ). The inner SMC sampler Υ(Θ(0:t)

1:M , A(t) | C , X, Y ) as shown in Figure 5 delivers an unbiased estimator ZM

t (Y | X, C ) in Eq. (8) of pt(Y | X, C ) according to Eq. (6), which is used to decide whether to accept the proposal, following Eq. (10). One or all of the M inner SMC particles Θ(0:t)

1:M,n (6) associated with ZM,n

t (Y | X, Cn) can be retained for estimation of the joint πt(C, Ψ, Θ | X, Y ).

along with the (C, Ψ)(t)

j particles. For J particles, estimating the tempered posterior in Eq. (5) using the model on the extended state space given in Eq. (9), the particle weights are recursively related according to

t (Y | X, Cj) ZM

t 1(Y | X, Cj)

pt(Y | X, Cj, Θ(t 1)

m )κ(t) κ(t 1)w(t 1)

where the initial particle weights are set as w(0)

J . However, the recursive update to the particle weights degrades the sample distribution. This degradation can be quantiﬁed using the eﬀective sample size (ESS):

We determine κ(t) adaptively so that the relative reduction in the ESS is approximately a predeﬁned learning rate, η. This adaptive scheduling allows one to control the sample degeneracy at each iteration t by keeping the change between the subsequent target posterior distributions πt 1(C, Ψ, Θ(0:t 1)

1:M , A(t 1) | X, Y ) and πt(C, Ψ, Θ(0:t)

1:M , A(t) | X, Y ) approximately constant. Overall, smaller learning rates allow for more aggressive exploration of

Mixtures of Gaussian Process Experts with SMC2

the target distribution with fewer iterations T and higher learning rates lead to more iterations T, albeit with more similar subsequent target posterior distributions, resulting in easier inference. In Syed et al. (2024), this is made precise and it is shown that there is a suﬃcient density of tempering beyond which the algorithm behaves well, and below which it breaks catastrophically. We use η = 0.9, which delivers good performance for our problem. More speciﬁcally, the next value in the tempering sequence, κ(t), is determined by solving a minimization problem:

κ(t) = arg min

pt(Y | X, Cj, Θ(t 1)

m )κt κ(t 1)w(t 1)

where the numerator is the ESS of the current time step according to κt and with the constraint κ(t) > κ(t 1). In particular, we resample the particles after each iteration t of the algorithm for both the inner and outer SMC samplers. One can save a bit of variance by using weighted particles but this was not found to be signiﬁcant for our numerical examples.

Resampling introduces duplicates of the particles into the sample. The MCMC mutation with Ht counteracts this. In particular, we propose new Ψ

1:J particles by separately proposing the gating network means and standard deviations from the weights. We construct the random walk proposal for the means and standard deviations as

(µ1:K, Σ1:K)

j = (µ1:K, Σ1:K)j + ξµ,Σ,j

where ξµ,Σ,j N(0, Σµ,Σ) with Σµ,Σ RNµ,Σ Nµ,Σ and Nµ,Σ = 2DK, being the weighted empirical covariance matrix of the J particles, (µ1:K, Σ1:K)1:J. We enforce positivity of the variance parameters by rejecting negative proposals due to our choice of prior distributions. The gating network weights, as in Lukens et al. (2020), are proposed as

1:K,j = log ν1:K,j + ξν,j

where ξν,j N(0, Σν) with Σν RNν Nv and Nv = K, being the weighted empirical covariance matrix of the J particles of the weights.

The proposal for the partition C given the proposed gating network parameters Ψ

follows the prior, that is c

i Categorical(p(xi | Ψ )) independently for i = 1, . . . , n. This leads to convenient cancellations in the acceptance probability. In cases when the clusters are not well separated by the inputs, we may need better proposals for the partitions. Possible alternatives are single-site Gibbs sampling (Rasmussen and Ghahramani, 2002) or split-merge moves which update multiple allocation variables simultaneously, ﬁrst appearing in Jain and Neal (2004) and more recently in Bouchard-Cˆot e et al. (2017).

For the inner SMC, we use similar MCMC updates as for the outer SMC. The proposals are constructed independently for each of the K experts. We propose new Θ

k,j,m particles again with a random walk proposal given as

k,j,m = Θk,j,m + ξΘk,j,m

H ark onen, Wade, Law, and Roininen

where ξΘk,j,m N(0, ΣΘk,j) with ΣΘk,j RNθ Nθ and Nθ = D + 3, being the weighted empirical covariance matrix of the M inner particles, (Θk,j,1, . . . , Θk,j,M), corresponding to the GP parameters of the kth expert. Positivity of the variance parameters is again enforced via our choice of prior distributions.

We use a criterion to automatically adapt the number of MCMC steps taken to mutate the particles suggested by Chopin and Papaspiliopoulos (2020c). Note that the exact form presented below is deﬁned only in the book s complementary software. Particularly, we perform MCMC steps until a relative decrease in a distance metric is below a threshold δ

d(n) d(n 1)

d(n 1) < δ,

and 2 denotes Euclidean norm. Here, ρ denotes the set of parameters being updated in the MCMC, i.e. ρ = (Ψj, Cj) in the outer SMC and ρ = (Θk,j,m) for the inner SMC. For the nth MCMC step, ρ(n)

i denotes the ith of Nρ particles for the inner SMCs, while for the outer SMC, ρ(n)

i denotes the ith of Nρ unbiased estimates of the likelihood in Eq. (7). For the outer SMC, we monitor the change in the likelihood in Eq. (7) as it is invariant to label-switching. We present a schematic for the full SMC2 sampler in Figure 7.

Before ending this section, it is relevant to reﬂect on the complexity of this method. Just as with naive importance sampling, the cost is ultimately dominated by the Gaussian process likelihoods with complexities proportional to O(KN3/K3), assuming N/K data points for each of the K experts, which can be distributed across the K experts. The likelihood needs to be estimated for each of M inner and J outer Θj,m particles at each time step t = 0, . . . , T. For the mutation PMCMC step, this must be done t times at step t, that is when the clustering C changes we must initialize the particles Θj,m π0(Θ) from the prior and rerun the inner SMC sampler. Therefore, the total number of likelihood estimations is MJ T

t=0 t = O(T 2MJ), and the total cost is O(T 2MJN3/K2). This seems expensive, particularly if one makes the natural choice T, M, J N, see for example Chopin et al. (2013); Beskos et al. (2014); Chopin and Papaspiliopoulos (2020a), but the point is that, in comparison to naive importance sampling, N4 is much less than the number samples required in importance sampling, e.g. N4 KN in the example of Section 3. Moreover, many evaluations can be performed in parallel at either the level of the inner particles, the K independent GP experts, or by running parallel inner SMC samplers.

In terms of memory, keeping track of each inner particle and their ancestor particles requires 2 + D GP kernel hyperparameters for each of of the K experts at each time step t = 0, . . . , T, resulting in a total of (2K + DK)MT numbers to be kept in memory. This is repeated for each of the J outer particles which, along with the inner particles, require 1+2D gating network parameters for each of the K experts and N labels for the partitions Cj again at each time step t = 0, . . . , T. This results in a total memory use of O((2K + DK)MT + (K + 2DK + N)JT). However, we do not use the intermediate results in our predictive inference. Thus, we are able to leave out T and simplify the memory complexity

Mixtures of Gaussian Process Experts with SMC2

1 (C, !)(s+1)

2 (C, !)(s+1)

2 (C, !)(t)

J (C, !)(s+1)

J (C, !)(t)

Figure 7: An illustration of the SMC2 sampler for mixtures of Gaussian process experts.

The outer SMC sampler uses the J particles for the partitions C(t)

j and gating

network parameters Ψ(t)

j . Each outer particle is associated with an inner SMC

sampler Υ(Θ(0:t)

1:M , A(t) | C, X, Y ) which provides an unbiased estimate for the marginal likelihood pt(Y | X, Cj) in Eq. (6) according to Eq. (8). The unbiased marginal likelihood estimate can then be used as a PMMH proposal for the MCMC acceptance ratio in Eq. (10), as shown in Fig.6.

107 107:5 108

Likelihood evaluations

Test log-likelihood

0 5 10 15 Likelihood evaluations #107

Wall time [s]

Figure 8: Eﬃciency of the methods, measured in terms of the test log-likelihood score (a)

and wall-clock time (b), as a function of the total number of likelihood evaluations.

to O((2K + DK)M + (K + 2DK + N)J) and keep track of particles only at the current iteration.

The eﬃciency of the methods in terms of number of likelihood evaluations is demonstrated in Figure 8. From the left panel it is clear that SMC2 converges quickly and with a small spread, whereas IS has a large spread and converges more slowly. IS is embarrassingly parallel, while there one expects a negligible hardware-dependent communication overhead for SMC2 in between likelihood evaluations. We consider likelihood evaluations to be a

H ark onen, Wade, Law, and Roininen

Algorithm 3 SMC2 sampled mixture of Gaussian process experts.

Initialize:

Set t = 0 and κ(t) = 0. Draw U (0)

j p(C, Ψ, Θ1, . . . ΘM | X),

0 (Cj) = 1. Set initial weights w(0) = (w(0)

1 , . . . , w(0)

J ), where w(0)

Estimation of π(C, Ψ, Θ | X, Y ):

while κ(t) < 1 do

t = t + 1. Determine κ(t). Resample U (t 1)

j given particles weights w(t), where w(t)

j follow Eq. (11).

Run one step of the inner SMC sampler (Algorithm 1) for U (t 1)

j to obtain U (t)

t (Cj) according to Eq. (8). Draw U (t)

j Kt( U (t)

j , ) PMCMC kernel (Algorithm 2) targeting πt in Eq. (9). end while

Prediction:

Compute the predictive distributions according to Eq. (12).

fair hardware-agnostic proxy for wall-clock time, given that communication arises at higher order. The right panel shows median wall-clock time of the two methods as a function of the likelihood evaluations on a ﬁxed computer architecture with 64 physical central processing unit cores of an AMD Ryzen Threadripper 3990X. Note that this architecture limits the parallelism of both methods equally. Here, SMC2 is slower by a factor of 2.2. For further comparisons and discussion on run time for IS versus alternative approaches, such as robust Bayesian Committee Machine (RBCM) (Deisenroth and Ng, 2015; Liu et al., 2018) and treed GP (Gramacy, 2007; Gramacy and Lee, 2008; Gramacy and Taddy, 2010), see the original study by Zhang and Williamson (2019).

6. Predictive distribution

We construct the predictive distribution by computing the individual GP predictive distributions for each expert given the partition Cj and GP parameters Θj,m, for every particle. These are then weighted by the particle weights and the gating network with parameters Ψj. Speciﬁcally, for each of the K GPs, the predictive mean and variance for a given input x RD is given by

k,j,m := E [y | x , X, Y, Cj, Ψj, mk,j,m, θk,j,m] = Σ Σ 1

X (Yk,j mk,j,m) + mk,j,m,

Mixtures of Gaussian Process Experts with SMC2

k,j,m := Var [y | x , X, Y, Cj, Ψj, mk,j,m, θk,j,m] = Σ Σ Σ 1

where Σ = Σ(x , x ; θk,j,m), Σ = Σ(x , Xk,j; θk,j,m), and ΣX = Σ(Xk,j, Xk,j; θk,j,m) are used for brevity. The individual normal marginal predictive distributions are combined into a mixture distribution:

π(y | c = k, x , X, Y, Cj, Ψj) =

ωm N(y | Ey

k,j,m, Vary

which, in practice, would be evaluated over a grid of y values for each x value where prediction is desired. These mixture distributions are further compounded across the K experts according to

π(y | x , X, Y, Cj, Ψj) =

j | c = k, x , X, Y, Cj, Ψj)pk(x | Ψj),

where the distributions are weighted according to the gating network deﬁned in Eq. (1). Lastly, the distributions obtained for each (Cj, Ψj) particle are added together according to their particle weights as

π(y | x , X, Y ) =

wjπ(y | x , X, Y, Cj, Ψj). (12)

Similarly, the predictive mean is computed as

E[y | x , X, Y ] =

Note that by resampling at each iteration t, the weights are wj = 1

J and ωi = 1 M . Pseudocode for the method is shown in Algorithm 3.

7. Numerical examples

We demonstrate the method using multiple simulated and real-life one-dimensional data sets along with two multi-dimensional data sets. The simulated one-dimensional sets of data consist of 1) a continuous function with homoskedastic noise and 2) a discontinuous function with heteroskedastic noise. The ﬁrst data set is generated using

yk = sin(πxk) cos((πxk)3) + ε(xk)

where ε(xk) N(0, 0.152) meaning that the measurement noise is homoskedastic. In contrast, the second data set is generated by

sin(60xk) 2 + ε1(xk), xk 0.3 10 + ε2(xk), 0.3 < xk 0.5 2 cos(4πxk) 10 + ε3(xk) 0.5 < xk 1

H ark onen, Wade, Law, and Roininen

where ε1(xk) N(0, 0.52), ε2(xk) N(0, 0.252), and ε3(xk) N(0, 12) are used to generate heteroskedastic measurement errors. We also consider the motorcycle data set studied in Silverman (1985), which shows obvious heteroskedastic noise and non-stationarity, with areas exhibiting diﬀerent behaviours. The data measures head acceleration in a simulated motorcycle accident for testing helmets; the measurements are practically constant initially, with a low noise level, which abruptly turns oscillating with high noise and gradually decaying to a possibly constant state while still having clearly higher measurement noise in comparison to the initial constant state. Lastly, we also study the 3D Langley glide-back booster simulation and 4D Colorado precipitation data sets.

The Langley glide-back booster is a rocket booster designed by NASA to allow the booster to glide down from the atmosphere instead of requiring crashing it into the ocean. A more detailed description of the booster and the related data set is provided in Gramacy and Lee (2008). As in Gramacy and Lee (2008), we model the lift as the output variable and the input variables include the vehicle speed at atmospheric re-entry, angle of attack, and the sideslip angle which are referred to in the data set as Mach, α, and β, respectively. The data set consists of 3167 data points of which we use the subset of N = 1900 data points.

The Colorado precipitation data set has been collected by the Colorado Climate Center and is available online. The data set consists of monthly total precipitation amounts measured at various weather stations located in Colorado. The monthly precipitation is modelled as the output variable with the input variables being the station longitude, latitude, and elevation, together with the measurement month. We use data measured over October December 2022 to showcase the distinct phenomenon of increased mean precipitation at higher elevations during winter months (Mahoney et al., 2015). Our data set corresponding to the period October December 2022 consists of 392 data points.

For all data sets, the inputs and outputs are normalized such that X [0, 1]D, min Y = 0, and Var[Y ] = 1. The prior distributions for the gating network parameters Ψ and experts parameters Θ are summarized in Table 1. Speciﬁcally, we use normal and half-normal priors for the gating network means and variances, respectively, along with a uniform prior for the GP constant mean functions and half-normal priors for variance and length scale parameters of the GPs. To encourage well-separated components along the input space, the prior means for the gating network parameters are constructed using a linearly spaced grid along each dimension.

For the synthetic data sets, we present the estimated predictive densities as 90% highest density regions in Figures 9 10 for both the Mo E with SMC2 (SMC2-Mo E) and the Mo E with IS (IS-Mo E), along with ground truth and the corresponding results obtained using a single GP, RBCM (Deisenroth and Ng, 2015; Liu et al., 2018), and treed GP (Gramacy, 2007; Gramacy and Lee, 2008; Gramacy and Taddy, 2010). For IS-Mo E and SMC2-Mo E, we employ an upper bound of K = 7 experts and consider diﬀerent choices of the Dirichlet concentration parameter α = 0.1, 1, K/2. For IS-Mo E, we use the MAP approximation used in the original paper without any additional modiﬁcations such as stochastic sub-sampling. In all cases, the number of outer particles J, inner particles M, and time steps T are chosen so that MJT 2 is similar to the number of IS samples in order to have similar run times for IS and SMC2 and fair comparisons based on a ﬁxed computational budget. The improvement in the estimated predictive densities with SMC2 is clearly evident. For the

Mixtures of Gaussian Process Experts with SMC2

0 0.2 0.4 0.6 0.8 1 x

(a) Ground truth.

0 0.2 0.4 0.6 0.8 1 x

(b) SMC2-Mo E, α = 0.1.

0 0.2 0.4 0.6 0.8 1 x

(c) SMC2-Mo E, α = 1.

0 0.2 0.4 0.6 0.8 1 x

(d) SMC2-Mo E, α = K/2.

0 0.2 0.4 0.6 0.8 1 x

(e) IS-Mo E, α = 0.1.

0 0.2 0.4 0.6 0.8 1 x

(f) IS-Mo E, α = 1.

0 0.2 0.4 0.6 0.8 1 x

(g) IS-Mo E, α = K/2.

0 0.2 0.4 0.6 0.8 1 x

0 0.2 0.4 0.6 0.8 1 x

0 0.2 0.4 0.6 0.8 1 x

(j) treed GP.

Figure 9: First synthetic data set: non-stationary function, homoskedastic noise. In a), the

generated data points are shown (in blue) with a 90% highest density region of the underlying ground truth distribution in gray and the true function as a solid black line. In b), c), and d) the highest density region and posterior median for Mo E with SMC2 are shown for diﬀerent values of the Dirichlet concentration parameter α = 0.1, 1, K/2. In e), f), and g) similar results are shown for IS-Mo E with the same Dirichlet concentration parameters. In the bottom row, h), i), and j) show the predictive distributions for a single GP, RBCM, and treed GP, respectively.

H ark onen, Wade, Law, and Roininen

0 0.2 0.4 0.6 0.8 1 x

(a) Ground truth.

0 0.2 0.4 0.6 0.8 1 x

(b) SMC2-Mo E, α = 0.1.

0 0.2 0.4 0.6 0.8 1 x

(c) SMC2-Mo E, α = 1.

0 0.2 0.4 0.6 0.8 1 x

(d) SMC2-Mo E, α = K/2.

0 0.2 0.4 0.6 0.8 1 x

(e) IS-Mo E, α = 0.1.

0 0.2 0.4 0.6 0.8 1 x

(f) IS-Mo E, α = 1.

0 0.2 0.4 0.6 0.8 1 x

(g) IS-Mo E, α = K/2.

0 0.2 0.4 0.6 0.8 1 x

0 0.2 0.4 0.6 0.8 1 x

0 0.2 0.4 0.6 0.8 1 x

(j) treed GP.

Figure 10: Second synthetic data set: discontinuous function, heteroskedastic noise. In a)

the generated data points are shown (in blue) with a highest density region of the underlying ground truth distribution in gray and the true function as a solid black line. In b), c), and d) the highest density region and posterior median for Mo E with SMC2 are shown for diﬀerent values of the Dirichlet concentration parameter α = 0.1, 1, K/2. In e), f), and g) similar results are shown for ISMo E with same Dirichlet concentration parameters. In the bottom row, h), i), and j) show the predictive distributions for a single GP, RBCM, and treed GP, respectively.

Mixtures of Gaussian Process Experts with SMC2

L1 L2 L1 median LL

SMC2-Mo E, α = 0.1 375.731 0.860 135.978 -296.381 SMC2-Mo E, α = 1 402.887 0.881 139.738 -298.903 SMC2-Mo E, α = K/2 403.566 0.893 138.866 -299.365 IS-Mo E, α = 0.1 588.068 1.042 154.580 -293.514 IS-Mo E, α = 1 599.560 1.073 159.100 -299.317 IS-Mo E, α = K/2 602.536 1.075 159.797 -297.154 GP 373.676 0.912 142.085 -298.689 RBCM 424.606 1.063 148.719 -297.406 Treed GP 363.797 0.855 134.044 -297.720

Table 2: First synthetic data set: non-stationary function, homoskedastic noise. Column

L1 shows the L1 vector norm of the diﬀerence between the estimated distributions and the ground truth distribution used to generate the data set for a given method. Column L2 shows the L2 vector norm of the diﬀerence between the aforementioned distributions and column L1 median shows the L1 vector norm of the distance between the posterior predictive median and the ground truth median. Column LL displays predictive log-likelihood results.

L1 L2 L1 median LL

SMC2-Mo E, α = 0.1 509.876 2.368 46.418 -224.955 SMC2-Mo E, α = 1 530.417 2.406 46.407 -222.160 SMC2-Mo E, α = K/2 558.190 2.499 47.203 -225.534 IS-Mo E, α = 0.1 993.265 3.301 84.464 -258.943 IS-Mo E, α = 1 934.844 3.167 85.765 -248.291 IS-Mo E, α = K/2 925.779 3.133 85.209 -248.900 GP 1078.485 3.496 95.274 -316.766 RBCM 1217.425 3.770 187.151 -316.085 Treed GP 655.436 2.857 87.843 -246.613

Table 3: Second synthetic data set: discontinuous function, heteroskedastic noise. Column

L1 shows the L1 vector norm of the diﬀerence between the estimated distributions and the ground truth distribution used to generate the data set for a given method. Column L2 shows the L2 vector norm of the diﬀerence between the aforementioned distributions and column L1 median shows the L1 vector norm of the distance between the posterior predictive median and the ground truth median. Column LL displays predictive log-likelihood results.

ﬁrst data set in Figure 9, IS produces disperse densities that are too wiggly on the left side, while SMC2 produces density estimates that better match the ground truth. Both the GP and RBCM, which is designed for distributed inference of a stationary GP model, also provide density estimates that are too wiggly on the left side, due to the non-stationary

H ark onen, Wade, Law, and Roininen

0 0.2 0.4 0.6 0.8 1 x

1 2 3 4 5 6 7 Non-empty experts

0 0.2 0.4 0.6 0.8 1 x

1 2 3 4 5 6 7 Non-empty experts

0 0.2 0.4 0.6 0.8 1 x

1 2 3 4 5 6 7 Non-empty experts

Figure 11: First synthetic data set: non-stationary function, homoskedastic noise. On the

left, the highest density regions and median for SMC2-Mo E are shown in gray and black, along with the data points in blue, for diﬀerent values of the Dirichlet concentration parameter α. In the middle column, the posterior distribution for the number of non-empty experts is shown. The last column shows the posterior similarity matrices summarizing the posterior over partitions. Values of α = 0.1, α = 1, and α = K/2 are used and correspond to the top, middle, and bottom rows, respectively.

behavior of the underlying function. Instead, the treed GP, which is a mixture of GP experts with a gating network that partitions the input space into axis-aligned regions, is better able to recover the non-stationary behavior of the true function. However, we note that the treed GP R package (Gramacy, 2007) employs MCMC and thus, comes with an increased computational cost (see for example (Zhang and Williamson, 2019) who ﬁnd that treed GP

Mixtures of Gaussian Process Experts with SMC2

0 0.2 0.4 0.6 0.8 1 x

1 2 3 4 5 6 7 Non-empty experts

0 0.2 0.4 0.6 0.8 1 x

1 2 3 4 5 6 7 Non-empty experts

0 0.2 0.4 0.6 0.8 1 x

1 2 3 4 5 6 7 Non-empty experts

Figure 12: Second synthetic data set: discontinuous function, heteroskedastic noise. On the

left, the highest density regions and median for SMC2-Mo E are shown in gray and black, along with the data points in blue, for diﬀerent values of the Dirichlet concentration parameter α. In the middle column, the posterior distribution for the number of non-empty experts is shown. The last column shows the posterior similarity matrices summarizing the posterior over partitions. Values of α = 0.1, α = 1, and α = K/2 are used and correspond to the top, middle, and bottom rows, respectively.

H ark onen, Wade, Law, and Roininen

increases wall-clock time by a factor of 18 compared to IS-MOE). We note that the SMC2

scheme developed here is general and could be used for other choices of gating networks, such as trees, for faster inference with treed GPs. In the second data set, the partition plays a crucial role due to the discontinuous nature of the true function and heteroskedastic noise. As postulated, in this setting, the results of IS-Mo E in Figure 10 are quite poor, while SMC2-Mo E provides substantial improvements, recovering well the ground truth. The GP and RBCM either under-smooth or over-smooth due to the stationary assumption of the models. The results with treed GP are better but highlight the diﬃculty of inferring the cut-oﬀvalues determining the treed partition structure. We also show L1 and L2 vector norms between the computed predictive densities, the ground truth densities and the L1 vector norm between the obtained predictive median and the ground truth median with predictive log-likelihood results in Tables 2 and 3 for the two synthetic data sets.

Due to the important role of the partition on predictions, we study and visualize uncertainty in the clustering C by constructing the posterior similarity matrix (PSM) (Gadd et al., 2020). The PSM is an N by N matrix, whose elements represent the posterior probability that two data points are clustered together, i.e. the ii th element is given by p(ci = ci | X, Y ), which is approximated by the weighted average across the particles of the indicators that ci and ci belong to the same expert. For the synthetic data sets, Figures 11 12 illustrate the posterior for the number of non-empty experts (middle column) and the posterior similarity matrices (right column) in comparison to the highest density region (left column), with rows corresponding to diﬀerent values of the Dirichlet concentration parameter α = 0.1, 1, K/2. The posterior similarity matrices also show how concentrated the sampled partitions Cj are with a lack of more uncertain gray areas in the PSM showing a more concentrated estimate for the partition. A smaller value of α encourages sparsity and fewer clusters. In particular, when α is small, we obtain larger clusters, and thus increased computational complexity, but smoother estimates. On the other hand, when α is large, we obtain more clusters of small sizes and reduced computational costs, but estimates tend to be less smooth. In general, we recommend selecting K to be large enough (an upper bound on the number of clusters) and choosing α to balance computational cost and smoothness in the estimates.

Results for the motorcycle data set with SMC2-Mo E and K = 7 are provided in Figure 13 and highlight the ability of the model to recover both the non-stationary behavior and heteroskedastic noise present in the data. Rows correspond to diﬀerent values of the Dirichlet concentration parameter α, highlighting the trade-oﬀbetween sparsity, computational cost, and smoothness. For the NASA data, Figure 14 illustrates the predictive densities as highest density regions for a one-dimensional slice of the 3D input space, along with the posterior over the number of clusters and the posterior similarity matrix for SMC2-Mo E and K = 27. In this case, a larger number of experts and larger value of α are required to capture the ridge at Mach = 1 showing the change from subsonic to supersonic ﬂow.

Lastly, the highest density regions of the predictive densities, the posterior over the number of clusters, and the posterior similarity matrix for SMC2-Mo E with K = 16 are presented in Figure 15 for the 4D Colorado precipitation data set with the 392 data points superimposed over the predictive densities. The presented one-dimensional predictive distribution is computed for a one-dimensional slice of the 4D input space. The slice corresponds to a latitude of 39.00 , with a longitude range between the minimum and maximum longi-

Mixtures of Gaussian Process Experts with SMC2

0 0.2 0.4 0.6 0.8 1 x

1 2 3 4 5 6 7 Non-empty experts

0 0.2 0.4 0.6 0.8 1 x

1 2 3 4 5 6 7 Non-empty experts

0 0.2 0.4 0.6 0.8 1 x

1 2 3 4 5 6 7 Non-empty experts

Figure 13: Motorcycle data. On the left, the highest density regions and median are shown

in gray and black along with the data points in blue for diﬀerent values of the Dirichlet concentration parameter α. In the middle column, the posterior distribution for the number of non-empty experts is shown. The last column shows the posterior similarity matrices summarizing the posterior over partitions. Values of α = 0.1, α = 1, and α = K/2 are used and correspond to the top, middle, and bottom plots, respectively.

tude values present in the data set, at a time point corresponding to November, and with predictive elevation locations computed using linear interpolation based on the latitudes, longitudes, and elevations of the station locations in the data set. The results show a clear transition around x = 0.6, which corresponds to a longitude of approximately -105.586 . The higher-variation area at x 0.6 corresponds to areas of higher elevation with higher mean monthly precipitation (Mahoney et al., 2015).

H ark onen, Wade, Law, and Roininen

0 0.2 0.4 0.6 0.8 1 x

5 10 15 20 25 Non-empty experts

0 0.2 0.4 0.6 0.8 1 x

5 10 15 20 25 Non-empty experts

0 0.2 0.4 0.6 0.8 1 x

5 10 15 20 25 Non-empty experts

Figure 14: 3D NASA Langley glide-back booster simulation data. On the left, the high-

est density regions and median for a one-dimensional slice of the input space are shown in gray and black along with the data points in blue for diﬀerent values of the Dirichlet concentration parameter α. In the middle column, the posterior distribution for the number of non-empty experts is shown. The last column shows the posterior similarity matrices summarizing the posterior over partitions. Values of α = 2.7, α = 6.75, and α = K/2 = 13.5 are used and correspond to the top, middle, and bottom plots, respectively.

Mixtures of Gaussian Process Experts with SMC2

0 0.2 0.4 0.6 0.8 1 x

0 10 20 Non-empty experts

0 0.2 0.4 0.6 0.8 1 x

0 10 20 Non-empty experts

0 0.2 0.4 0.6 0.8 1 x

0 10 20 Non-empty experts

Figure 15: 4D Colorado precipitation data. On the left, the highest density regions and

median for a one-dimensional slice of the input space are shown in gray and black along with the data points in blue for diﬀerent values of the Dirichlet concentration parameter α. In the middle column, the posterior distribution for the number of non-empty experts is shown. The last column shows the posterior similarity matrices summarizing the posterior over partitions. Values of α = 1.6, α = 4, and α = K/2 = 8 are used and correspond to the top, middle, and bottom plots, respectively.

H ark onen, Wade, Law, and Roininen

8. Discussion

In this work, we introduced the use of SMC2 in the context of mixtures of Gaussian process experts to conduct full Bayesian inference. We have discussed and demonstrated how the previously proposed importance sampling based approach requires the number of importance samples to be exponential in the Kullback-Leibler divergence between the prior and posterior, which is computationally infeasible in many examples. This is particularly true for complex targets that demonstrate departures from standard GP models, such as non-stationarity, heteroskedasticity, and discontinuity. To overcome this, we extended the importance sampling based approach by using nested sequential Monte Carlo samplers to reduce the number of samples required while still being embarrassingly parallel. These beneﬁts oﬀset the additional computational complexity introduced by the nested sequential Monte Carlo samplers, particularly for complex target distributions. The inner SMC can also be replaced by optimization combined with a Laplace approximation to relieve some of the computational burden. This similarly achieves the improvement from the exponential KL divergence of naive importance sampling to the polynomial KL divergence of an SMC sampler.

We have focused on gating networks deﬁned through normalized kernels but highlight that the SMC2 scheme can also be used for other choices of gating networks, such as the tree-based construction of treed GPs. An advantage of the normalized kernel construction is that we can include priors on the kernel locations which encourage partitions associated to distinct regions. This is in contrast to Zhang and Williamson (2019), who assume identically distributed kernel locations, resulting in overlapping partitions in the prior and IS proposal. While our prior is constructed based on a grid of the input space (Table 1), repulsive priors could be employed in higher dimensions (Petralia et al., 2012). Another advantage is the data-driven choice of the number of clusters, through a sparsity-promoting prior on the weights. Our experiments highlight that the concentration parameter α in the sparsitypromoting prior can be selected to balance sparsity, computational cost, and smoothness. One might also consider an automatic approach for selecting or estimating α. In this case, the SMC framework should allow us to compute an unbiased estimator of the marginal likelihood on α followed by a particle marginal Metropolis-Hastings update or stochastic gradient descent for empirical Bayes optimization.

In future work, we aim to use mixtures of Gaussian process experts as priors for inverse problems such as in computed tomography or deconvolution. In computed tomography, mixtures of Gaussian processes could model areas corresponding to diﬀering tissues in the human body or materials in inanimate objects, that is, this would be a ﬂexible alternative to level set methods or edge-preserving priors, such as total variation priors. For deconvolution, mixtures could be used to improve results by enabling modelling of discontinuous functions. In these complex settings, eﬃcient inference schemes for mixtures of GP experts are required. Additionally, it would be interesting to combine methods of online GP methods (Csat o and Opper, 2000; Ranganathan et al., 2011; Law and Zankin, 2022) with our methodology for computational speed-ups and extend the approach to alternative likelihoods over the partitions.

Acknowledgments

Mixtures of Gaussian Process Experts with SMC2

TH and LR were supported by Research Council of Finland (grant numbers 327734, 336787, 353094 and 359183). SW was supported by the Royal Society of Edinburgh (RSE) (grant number 69938).

Christophe Andrieu and Gareth O. Roberts. The pseudo-marginal approach for eﬃcient

Monte Carlo computations. The Annals of Statistics, 37(2):697 725, 2009.

Christophe Andrieu, Arnaud Doucet, and Roman Holenstein. Particle Markov chain Monte

Carlo methods. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 72(3):269 342, 2010.

Isadora Antoniano-Villalobos, Sara Wade, and Stephen G. Walker. A Bayesian nonpara-

metric regression model with normalized weights: a study of hippocampal atrophy in Alzheimers disease. Journal of the American Statistical Association, 109(506):477 490, 2014.

Alexandros Beskos, Dan Crisan, and Ajay Jasra. On the stability of sequential Monte Carlo

methods in high dimensions. The Annals of Applied Probability, 24(4):1396 1445, 2014.

Alexandre Bouchard-Cˆot e, Arnaud Doucet, and Andrew Roth. Particle Gibbs split-merge

sampling for Bayesian inference in mixture models. Journal of Machine Learning Research, 18(28):1 39, 2017.

Francois Caron and Arnaud Doucet. Bayesian nonparametric models on decomposable graphs. In Advances in Neural Information Processing Systems, volume 22. Curran Associates, Inc., 2009.

Carlos M. Carvalho, Hedibert F. Lopes, Nicholas G. Polson, and Matt A. Taddy. Particle

learning for general mixtures. Bayesian Analysis, 5(4):709 740, 2010.

Sourav Chatterjee and Persi Diaconis. The sample size required in importance sampling.

The Annals of Applied Probability, 28(2):1099 1135, 2018.

Nicolas Chopin. A sequential particle ﬁlter method for static models. Biometrika, 89(3):

539 552, 2002.

Nicolas Chopin and Omiros Papaspiliopoulos. An Introduction to Sequential Monte Carlo,

pages 357 370. Springer International Publishing, 2020a.

Nicolas Chopin and Omiros Papaspiliopoulos. An Introduction to Sequential Monte Carlo.

Springer International Publishing, 2020b.

Nicolas Chopin and Omiros Papaspiliopoulos. An Introduction to Sequential Monte Carlo,

page 332. Springer International Publishing, 2020c.

Nicolas Chopin, Pierre E. Jacob, and Omiros Papaspiliopoulos. SMC2: an eﬃcient algorithm for sequential analysis of state space models. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 75(3):397 426, 2013.

H ark onen, Wade, Law, and Roininen

Guido Consonni and Piero Veronese. Conditionally reducible natural exponential families

and enriched conjugate priors. Scandinavian Journal of Statistics, 28(2):377 406, 2001.

Lehel Csat o and Manfred Opper. Sparse representation for Gaussian process models. In

Advances in Neural Information Processing Systems, volume 13. MIT Press, 2000.

Chenguang Dai, Jeremy Heng, Pierre E. Jacob, and Nick Whiteley. An invitation to se-

quential Monte Carlo samplers. Journal of the American Statistical Association, 117 (539):1587 1600, 2022.

Andreas Damianou and Neil D. Lawrence. Deep Gaussian processes. In Proceedings of the

Sixteenth International Conference on Artiﬁcial Intelligence and Statistics, volume 31 of Proceedings of Machine Learning Research, pages 207 215. PMLR, 2013.

Marc Peter Deisenroth and Jun Wei Ng. Distributed Gaussian processes. In Proceedings

of the 32nd International Conference on International Conference on Machine Learning, volume 37, page 14811490. JMLR.org, 2015.

Pierre Del Moral, Arnaud Doucet, and Ajay Jasra. Sequential Monte Carlo samplers. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 68(3):411 436, 2006.

Alexander Denzel and Johannes Kstner. Gaussian process regression for geometry opti-

mization. The Journal of Chemical Physics, 148(9):094114, 2018.

Clement Etienam, Kody Law, and Sara Wade. Ultra-fast deep mixtures of Gaussian process

experts. ar Xiv preprint ar Xiv:2006.13309, 2020.

Paul Fearnhead. Particle ﬁlters for mixture models with an unknown number of components.

Statistics and Computing, 14(1):11 21, 2004.

Paul Fearnhead and Loukia Meligkotsidou. Filtering methods for mixture models. Journal

of Computational and Graphical Statistics, 16(3):586 607, 2007.

Maurizio Filippone, Francesco Camastra, Francesco Masulli, and Stefano Rovetta. A survey

of kernel and spectral methods for clustering. Pattern Recognition, 41(1):176 190, 2008.

Charles Gadd, Sara Wade, and Alexis Boukouvalas. Enriched mixtures of generalised Gaus-

sian process experts. In Proceedings of the Twenty Third International Conference on Artiﬁcial Intelligence and Statistics, volume 108, pages 3144 3154. PMLR, 2020.

Andrew Gelman. Prior distributions for variance parameters in hierarchical models (com-

ment on article by Browne and Draper). Bayesian Analysis, 1(3):515 534, 2006.

Isobel Claire Gormley and Sylvia Fr uhwirth-Schnatter. Mixture of experts models. Hand-

book of mixture analysis, pages 271 307, 2019.

Robert B. Gramacy. tgp: An R package for Bayesian nonstationary, semiparametric non-

linear regression and design by treed Gaussian process models. Journal of Statistical Software, 19(9):1 46, 2007.

Mixtures of Gaussian Process Experts with SMC2

Robert B. Gramacy and Herbert K. H. Lee. Bayesian treed Gaussian process models with

an application to computer modeling. Journal of the American Statistical Association, 103(483):1119 1130, 2008.

Robert B. Gramacy and Nicholas G. Polson. Particle learning of Gaussian process models for

sequential design and optimization. Journal of Computational and Graphical Statistics, 20(1):102 118, 2011.

Robert B. Gramacy and Matthew Taddy. Categorical inputs, sensitivity analysis, optimization and importance tempering with tgp version 2, an R package for treed Gaussian process models. Journal of Statistical Software, 33(6):1 48, 2010.

Clara Grazian and Christian P. Robert. Jeﬀreys priors for mixture estimation: properties

and alternatives. Computational Statistics & Data Analysis, 121:149 163, 2018.

Markus Heinonen, Henrik Mannerstrm, Juho Rousu, Samuel Kaski, and Harri Lhdesmki.

Non-stationary Gaussian process regression with Hamiltonian Monte Carlo. In Proceedings of the 19th International Conference on Artiﬁcial Intelligence and Statistics, volume 51 of Proceedings of Machine Learning Research, pages 732 740. PMLR, 2016.

A. Ho, J. Citrin, F. Auriemma, C. Bourdelle, F.J. Casson, Hyun-Tae Kim, P. Manas,

G. Szepesi, H. Weisen, and JET Contributors. Application of Gaussian process regression to plasma turbulent transport model validation via integrated modelling. Nuclear Fusion, 59(5):056007, 2019.

Yuji Iikubo, Shunsuke Horii, and Toshiyasu Matsushima. Sparse Bayesian hierarchical mixture of experts and variational inference. In 2018 International Symposium on Information Theory and Its Applications (ISITA), pages 60 64, 2018.

Sonia Jain and Radford M. Neal. A split-merge Markov chain Monte Carlo procedure for

the Dirichlet process mixture model. Journal of Computational and Graphical Statistics, 13(1):158 182, 2004.

Ajay Jasra, Kody J. H. Law, and Fangyuan Yu. Unbiased ﬁltering of a class of partially

observed diﬀusions. Advances in Applied Probability, 54(3):661 687, 2022.

Michael I. Jordan and Robert A. Jacobs. Hierarchical mixtures of experts and the EM

algorithm. Neural computation, 6(2):181 214, 1994.

Kody J. H. Law and Vitaly Zankin. Sparse online variational Bayesian regression. SIAM/ASA Journal on Uncertainty Quantiﬁcation, 10(3):1070 1100, 2022.

Feng Li, Mattias Villani, and Robert Kohn. Modeling conditional densities using ﬁnite

smooth mixtures. Mixtures: estimation and applications, pages 123 144, 2011.

Xinzhu Liang, Joseph M Lukens, Sanjaya Lohani, Brian T Kirby, Thomas A Searles, Xin

Qiu, and Kody J. H. Law. Scalable Bayesian Monte Carlo: fast uncertainty estimation beyond deep ensembles. ar Xiv preprint ar Xiv:2505.13585, 2025.

H ark onen, Wade, Law, and Roininen

Haitao Liu, Jianfei Cai, Yi Wang, and Yew-Soon Ong. Generalized robust Bayesian com-

mittee machine for large-scale Gaussian process regression. In Proceedings of Machine Learning Research, volume 80, pages 3131 3140. International Machine Learning Society (IMLS), 2018.

Joseph M. Lukens, Kody J. H. Law, Ajay Jasra, and Pavel Lougovski. A practical and

eﬃcient approach for Bayesian quantum state estimation. New Journal of Physics, 22 (6):063038, 2020.

Steven N. Mac Eachern, Merlise Clyde, and Jun S. Liu. Sequential importance sampling for

nonparametric Bayes models: the next generation. The Canadian Journal of Statistics / La Revue Canadienne de Statistique, 27(2):251 267, 1999.

Kelly Mahoney, F Martin Ralph, Klaus Wolter, Nolan Doesken, Michael Dettinger, Daniel

Gottas, Timothy Coleman, and Allen White. Climatology of extreme daily precipitation in Colorado and its diverse spatial and seasonal variability. Journal of Hydrometeorology, 16(2):781 792, 2015.

Vikash K. Mansinghka, Daniel M. Roy, Ryan Rifkin, and Josh Tenenbaum. Aclass: a simple,

online, parallelizable algorithm for probabilistic classiﬁcation. In Proceedings of Machine Learning Research, volume 2, pages 315 322. PMLR, 2007.

Saeed Masoudnia and Reza Ebrahimpour. Mixture of experts: a literature survey. Artiﬁcial

Intelligence Review, 42(2):275 293, 2014.

Edward Meeds and Simon Osindero. An alternative inﬁnite mixture of Gaussian process

experts. Advances in neural information processing systems, 18:883 890, 2005.

Ehsan Momeni, Mohammad Bagher Dowlatshahi, Fereydoon Omidinasab, Harnedi Maizir,

and Danial Jahed Armaghani. Gaussian process regression technique to estimate the pile bearing capacity. Arabian Journal for Science and Engineering, 45(10):8255 8267, 2020.

Pierre Moral. Feynman-Kac Formulae: Genealogical and Interacting Particle Systems with

Applications. Springer New York, NY, 2004.

Iman Mossavat and Oliver Amft. Sparse Bayesian hierarchical mixture of experts. In 2011

IEEE Statistical Signal Processing Workshop (SSP), pages 653 656, 2011.

Trung Nguyen and Edwin Bonilla. Fast allocation of Gaussian process experts. In Interna-

tional Conference on Machine Learning, pages 145 153. PMLR, 2014.

Andriy Norets. Approximation of conditional densities by smooth mixtures of regressions.

The Annals of Statistics, 38(3):1733 1766, 2010.

Dominik J Otto, Cailin Jordan, Brennan Dury, Christine Dien, and Manu Setty. Quantifying

cell-state densities in single-cell phenotypic landscapes using Mellon. Nature Methods, 21 (7):1185 1195, 2024.

Christopher Paciorek and Mark Schervish. Nonstationary covariance functions for Gaussian

process regression. In Advances in Neural Information Processing Systems, volume 16, pages 273 280. MIT Press, 2003.

Mixtures of Gaussian Process Experts with SMC2

Francesca Petralia, Vinayak Rao, and David Dunson. Repulsive mixtures. Advances in

Neural Information Processing Systems, 25, 2012.

Emilio Porcu, Jorge Mateu, and Moreno Bevilacqua. Covariance functions that are sta-

tionary or nonstationary in space and stationary in time. Statistica Neerlandica, 61(3): 358 382, 2007.

Joaquin Quinonero-Candela and Carl Edward Rasmussen. A unifying view of sparse approx-

imate Gaussian process regression. Journal of Machine Learning Research, 6:1939 1959, 2005.

Fernando A Quintana, Peter M uller, Alejandro Jara, and Steven N Mac Eachern. The dependent Dirichlet process and related models. Statistical Science, 37(1):24 41, 2022.

Ananth Ranganathan, Ming-Hsuan Yang, and Jeﬀrey Ho. Online sparse Gaussian process

regression and its applications. IEEE Transactions on Image Processing, 20(2):391 404, 2011.

Carl E. Rasmussen and Zoubin Ghahramani. Inﬁnite mixtures of Gaussian process experts.

In Advances in Neural Information Processing Systems 14, pages 881 888. MIT Press, 2002.

Robert R. Richardson, Michael A. Osborne, and David A. Howey. Gaussian process re-

gression for forecasting battery state of health. Journal of Power Sources, 357:209 219, 2017.

Judith Rousseau and Kerrie Mengersen. Asymptotic behaviour of the posterior distribution

in overﬁtted mixture models. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 73(5):689 710, 2011.

Simo S arkk a. Bayesian Filtering and Smoothing. Cambridge University Press, 2013.

Bernard W. Silverman. Some aspects of the spline smoothing approach to non-parametric

regression curve ﬁtting. Journal of the Royal Statistical Society. Series B (Methodological), 47(1):1 52, 1985.

Andreas Svensson, Johan Dahlin, and Thomas B. Sch on. Marginalizing Gaussian process

hyperparameters using sequential Monte Carlo. In 2015 IEEE 6th International Workshop on Computational Advances in Multi-Sensor Adaptive Processing (CAMSAP), pages 477 480. IEEE, 2015.

Saifuddin Syed, Alexandre Bouchard-Cˆot e, Kevin Chern, and Arnaud Doucet. Optimised

annealed sequential Monte Carlo samplers. ar Xiv preprint ar Xiv:2408.12057, 2024.

Michalis K. Titsias and Aristidis Likas. Mixture of experts classiﬁcation using a hierarchical

mixture model. Neural Computation, 14(9):2221 2244, 2002.

Volker Tresp. Mixtures of Gaussian processes. Advances in neural information processing

systems, pages 654 660, 2001.

H ark onen, Wade, Law, and Roininen

Yener Ulker, Bilge Gnsel, and Taylan Cemgil. Sequential Monte Carlo samplers for Dirichlet

process mixtures. In Proceedings of the Thirteenth International Conference on Artiﬁcial Intelligence and Statistics, volume 9, pages 876 883. PMLR, 2010.

Mattias Villani, Robert Kohn, and David J Nott. Generalized smooth ﬁnite mixtures. Journal of Econometrics, 171(2):121 133, 2012.

Sara Wade and Vanda In acio. Bayesian dependent mixture models: a predictive comparison

and survey. Statistical Science, 40(1):81 108, 2025.

Yuki K. Wakabayashi, Takuma Otsuka, Yoshitaka Taniyasu, Hideki Yamamoto, and Hiroshi

Sawada. Improved adaptive sampling method utilizing Gaussian process regression for prediction of spectral peak structures. Applied Physics Express, 11(11):112401, 2018.

Christopher KI Williams and Carl Edward Rasmussen. Gaussian Processes for Machine

Learning, volume 2. MIT press Cambridge, MA, 2006.

Weichang Yu, Sara Wade, Howard D Bondell, and Lamiae Azizi. Non-stationary Gaussian

process discriminant analysis with variable selection for high-dimensional functional data. Journal of Computational and Graphical Statistics, pages 1 31, 2022.

Chao Yuan and Claus Neubauer. Variational mixture of Gaussian process experts. Advances

in Neural Information Processing Systems, 21, 2008.

Michael Minyi Zhang and Sinead A. Williamson. Embarrassingly parallel inference for Gaussian processes. Journal of Machine Learning Research, 20(169):1 26, 2019.

Ozan rsoy and Ethem Alpaydn. Dropout regularization in hierarchical mixture of experts.

Neurocomputing, 419:148 156, 2021.