# gaussian_process_random_fields__2ccc04f7.pdf

Gaussian Process Random Fields

David A. Moore and Stuart J. Russell Computer Science Division University of California, Berkeley Berkeley, CA 94709 {dmoore, russell}@cs.berkeley.edu

Gaussian processes have been successful in both supervised and unsupervised machine learning tasks, but their computational complexity has constrained practical applications. We introduce a new approximation for large-scale Gaussian processes, the Gaussian Process Random Field (GPRF), in which local GPs are coupled via pairwise potentials. The GPRF likelihood is a simple, tractable, and parallelizeable approximation to the full GP marginal likelihood, enabling latent variable modeling and hyperparameter selection on large datasets. We demonstrate its effectiveness on synthetic spatial data as well as a real-world application to seismic event location.

1 Introduction

Many machine learning tasks can be framed as learning a function given noisy information about its inputs and outputs. In regression and classiﬁcation, we are given inputs and asked to predict the outputs; by contrast, in latent variable modeling we are given a set of outputs and asked to reconstruct the inputs that could have produced them. Gaussian processes (GPs) are a ﬂexible class of probability distributions on functions that allow us to approach function-learning problems from an appealingly principled and clean Bayesian perspective. Unfortunately, the time complexity of exact GP inference is O(n3), where n is the number of data points. This makes exact GP calculations infeasible for real-world data sets with n > 10000.

Many approximations have been proposed to escape this limitation. One particularly simple approximation is to partition the input space into smaller blocks, replacing a single large GP with a multitude of local ones. This gains tractability at the price of a potentially severe independence assumption.

In this paper we relax the strong independence assumptions of independent local GPs, proposing instead a Markov random ﬁeld (MRF) of local GPs, which we call a Gaussian Process Random Field (GPRF). A GPRF couples local models via pairwise potentials that incorporate covariance information. This yields a surrogate for the full GP marginal likelihood that is simple to implement and can be tractably evaluated and optimized on large datasets, while still enforcing a smooth covariance structure. The task of approximating the marginal likelihood is motivated by unsupervised applications such as the GP latent variable model [1], but examining the predictions made by our model also yields a novel interpretation of the Bayesian Committee Machine [2].

We begin by reviewing GPs and MRFs, and some existing approximation methods for large-scale GPs. In Section 3 we present the GPRF objective and examine its properties as an approximation to the full GP marginal likelihood. We then evaluate it on synthetic data as well as an application to seismic event location.

(a) Full GP.

(b) Local GPs.

(c) Bayesian committee machine.

Figure 1: Predictive distributions on a toy regression problem.

2 Background

2.1 Gaussian processes

Gaussian processes [3] are distributions on real-valued functions. GPs are parameterized by a mean function µθ(x), typically assumed without loss of generality to be µ(x) = 0, and a covariance function (sometimes called a kernel) kθ(x, x ), with hyperparameters θ. A common choice is the squared exponential covariance, k SE(x, x ) = σ2 f exp 1

2 x x 2/ℓ2 , with hyperparameters σ2 f and ℓspecifying respectively a prior variance and correlation lengthscale.

We say that a random function f(x) is Gaussian process distributed if, for any n input points X, the vector of function values f = f(X) is multivariate Gaussian, f N(0, kθ(X, X)). In many applications we have access only to noisy observations y = f + ε for some noise process ε. If the noise is iid Gaussian, i.e., ε N(0, σ2 n I), then the observations are themselves Gaussian, y N(0, Ky), where Ky = kθ(X, X) + σ2 n I.

The most common application of GPs is to Bayesian regression [3], in which we attempt to predict the function values f at test points X via the conditional distribution given the training data, p(f |y; X, X , θ). Sometimes, however, we do not observe the training inputs X, or we observe them only partially or noisily. This setting is known as the Gaussian Process Latent Variable Model (GPLVM) [1]; it uses GPs as a model for unsupervised learning and nonlinear dimensionality reduction. The GP-LVM setting typically involves multi-dimensional observations, Y = (y(1), . . . , y(D)), with each output dimension y(d) modeled as an independent Gaussian process. The input locations and/or hyperparameters are typically sought via maximization of the marginal likelihood

L(X, θ) = log p(Y ; X, θ) =

2 log |Ky| 1

2y T i K 1 y yi + C

2 log |Ky| 1

2tr(K 1 y Y Y T ) + C, (1)

though some recent work [4, 5] attempts to recover an approximate posterior on X by maximizing a variational bound. Given a differentiable covariance function, this maximization is typically performed by gradient-based methods, although local maxima can be a signiﬁcant concern as the marginal likelihood is generally non-convex.

2.2 Scalability and approximate inference

The main computational difﬁculty in GP methods is the need to invert or factor the kernel matrix Ky, which requires time cubic in n. In GP-LVM inference this must be done at every optimization step to evaluate (1) and its derivatives.

This complexity has inspired a number of approximations. The most commonly studied are inducingpoint methods, in which the unknown function is represented by its values at a set of m inducing points, where m n. These points can be chosen by maximizing the marginal likelihood in a surrogate model [6, 7] or by minimizing the KL divergence between the approximate and exact GP posteriors [8]. Inference in such models can typically be done in O(nm2) time, but this comes at the price of reduced representational capacity: while smooth functions with long lengthscales may be compactly represented by a small number of inducing points, for quickly-varying functions with

signiﬁcant local structure it may be difﬁcult to ﬁnd any faithful representation more compact than the complete set of training observations.

A separate class of approximations, so-called local GP methods [3, 9, 10], involves partitioning the inputs into blocks of m points each, then modeling each block with an independent Gaussian process. If the partition is spatially local, this corresponds to a covariance function that imposes independence between function values in different regions of the input space. Computationally, each block requires only O(m3) time; the total time is linear in the number of blocks. Local approximations preserve short-lengthscale structure within each block, but their harsh independence assumptions can lead to predictive discontinuities and inaccurate uncertainties (Figure 1b). These assumptions are problematic for GP-LVM inference because the marginal likelihood becomes discontinuous at block boundaries. Nonetheless, local GPs sometimes work very well in practice, achieving results comparable to more sophisticated methods in a fraction of the time [11].

The Bayesian Committee Machine (BCM) [2] attempts to improve on independent local GPs by averaging the predictions of multiple GP experts. The model is formally equivalent to an inducingpoint model in which the test points are the inducing points, i.e., it assumes that the training blocks are conditionally independent given the test data. The BCM can yield high-quality predictions that avoid the pitfalls of local GPs (Figure 1c), while maintaining scalability to very large datasets [12]. However, as a purely predictive approximation, it is unhelpful in the GP-LVM setting, where we are interested in the likelihood of our training set irrespective of any particular test data. The desire for a BCM-style approximation to the marginal likelihood was part of the motivation for this current work; in Section 3.2 we show that the GPRF proposed in this paper can be viewed as such a model.

Mixture-of-experts models [13, 14] extend the local GP concept in a different direction: instead of deterministically assigning points to GP models based on their spatial locations, they treat the assignments as unobserved random variables and do inference over them. This allows the model to adapt to different functional characteristics in different regions of the space, at the price of a more difﬁcult inference task. We are not aware of mixture-of-experts models being applied in the GP-LVM setting, though this should in principle be possible.

Simple building blocks are often combined to create more complex approximations. The PIC approximation [15] blends a global inducing-point model with local block-diagonal covariances, thus capturing a mix of global and local structure, though with the same boundary discontinuities as in vanilla local GPs. A related approach is the use of covariance functions with compact support [16] to capture local variation in concert with global inducing points. [11] surveys and compares several approximate GP regression methods on synthetic and real-world datasets.

Finally, we note here the similar title of [17], which is in fact orthogonal to the present work: they use a random ﬁeld as a prior on input locations, whereas this paper deﬁnes a random ﬁeld decomposition of the GP model itself, which may be combined with any prior on X.

2.3 Markov Random Fields

We recall some basic theory regarding Markov random ﬁelds (MRFs), also known as undirected graphical models [18]. A pairwise MRF consists of an undirected graph (V, E), along with node potentials ψi and edge potentials ψij, which deﬁne an energy function on a random vector y,

i V ψi(yi) + X

(i,j) E ψij(yi, yj), (2)

where y is partitioned into components yi identiﬁed with nodes in the graph. This energy in turn deﬁnes a probability density, the Gibbs distribution , given by p(y) = 1 Z exp( E(y)) where Z = R exp( E(z))dz is a normalizing constant.

Gaussian random ﬁelds are the special case of pairwise MRFs in which the Gibbs distribution is multivariate Gaussian. Given a partition of y into sub-vectors y1, y2, . . . , y M, a zero-mean Gaussian distribution with covariance K and precision matrix J = K 1 can be expressed by potentials

2y T i Jiiyi, ψij(yi, yj) = y T i Jijyj (3)

where Jij is the submatrix of J corresponding to the sub-vectors yi, yj. The normalizing constant Z = (2π)n/2|K|1/2 involves the determinant of the covariance matrix. Since edges whose potentials

are zero can be dropped without effect, the nonzero entries of the precision matrix can be seen as specifying the edges present in the graph.

3 Gaussian Process Random Fields

We consider a vector of n real-valued1 observations y N(0, Ky) modeled by a GP, where Ky is implicitly a function of input locations X and hyperparameters θ. Unless otherwise speciﬁed, all probabilities p(yi), p(yi, yj), etc., refer to marginals of this full GP. We would like to perform gradient-based optimization on the marginal likelihood (1) with respect to X and/or θ, but suppose that the cost of doing so directly is prohibitive.

In order to proceed, we assume a partition y = (y1, y2, . . . , y M) of the observations into M blocks of size at most m, with an implied corresponding partition X = (X1, X2, . . . , XM) of the (perhaps unobserved) inputs. The source of this partition is not a focus of the current work; we might imagine that the blocks correspond to spatially local clusters of input points, assuming that we have noisy observations of the X values or at least a reasonable guess at an initialization. We let Kij = covθ(yi, yj) denote the appropriate submatrix of Ky, and Jij denote the corresponding submatrix of the precision matrix Jy = K 1 y ; note that Jij = (Kij) 1 in general.

3.1 The GPRF Objective

Given the precision matrix Jy, we could use (3) to represent the full GP distribution in factored form as an MRF. This is not directly useful, since computing Jy requires cubic time. Instead we propose approximating the marginal likelihood via a random ﬁeld in which local GPs are connected by pairwise potentials. Given an edge set which we will initially take to be the complete graph, E = {(i, j)|1 i < j M}, our approximate objective is

q GP RF (y; X, θ) =

i=1 p(yi) Y

p(yi, yj) p(yi)p(yj), (4)

i=1 p(yi)1 |Ei| Y

(i,j) E p(yi, yj)

where Ei denotes the neighbors of i in the graph, and p(yi) and p(yi, yj) are marginal probabilities under the full GP; equivalently they are the likelihoods of local GPs deﬁned on the points Xi and Xi Xj respectively. Note that these local likelihoods depend implicitly on X and θ. Taking the log, we obtain the energy function of an unnormalized MRF

log q GP RF (y; X, θ) =

i=1 (1 |Ei|) log p(yi) + X

(i,j) E log p(yi, yj) (5)

with potentials

ψGP RF i (yi) = (1 |Ei|) log p(yi), ψGP RF ij (yi, yj) = log p(yi, yj). (6)

We refer to the approximate objective (5) as q GP RF rather than p GP RF to emphasize that it is not in general a normalized probability density. It can be interpreted as a Bethe-type approximation [19], in which a joint density is approximated via overlapping pairwise marginals. In the special case that the full precision matrix Jy induces a tree structure on the blocks of our partition, q GP RF recovers the exact marginal likelihood. (This is shown in the supplementary material.) In general this will not be the case, but in the spirit of loopy belief propagation [20], we consider the tree-structured case as an approximation for the general setting.

Before further analyzing the nature of the approximation, we ﬁrst observe that as a sum of local Gaussian log-densities, the objective (5) is straightforward to implement and fast to evaluate. Each of the O(M 2) pairwise densities requires O((2m)3) = O(m3) time, for an overall complexity of

1The extension to multiple independent outputs is straightforward.

O(M 2m3) = O(n2m) when M = n/m. The quadratic dependence on n cannot be avoided by any algorithm that computes similarities between all pairs of training points; however, in practice we will consider local modiﬁcations in which E is something smaller than all pairs of blocks. For example, if each block is connected only to a ﬁxed number of spatial neighbors, the complexity reduces to O(nm2), i.e., linear in n. In the special case where E is the empty set, we recover the exact likelihood of independent local GPs.

It is also straightforward to obtain the gradient of (5) with respect to hyperparameters θ and inputs X, by summing the gradients of the local densities. The likelihood and gradient for each term in the sum can be evaluated independently using only local subsets of the training data, enabling a simple parallel implementation.

Having seen that q GP RF can be optimized efﬁciently, it remains for us to argue its validity as a proxy for the full GP marginal likelihood. Due to space constraints we defer proofs to the supplementary material, though our results are not difﬁcult. We ﬁrst show that, like the full marginal likelihood (1), q GP RF has the form of a Gaussian distribution, but with a different precision matrix.

Theorem 1. The objective q GP RF has the form of an unnormalized Gaussian density with precision matrix J, with blocks Jij given by

Jii = K 1 ii + X

Q(ij) 11 K 1 ii , Jij = Q(ij) 12 if (i, j) E 0 otherwise.

where Q(ij) is the local precision matrix Q(ij) deﬁned as the inverse of the marginal covariance,

Q(ij) 11 Q(ij) 12 Q(ij) 21 Q(ij) 22

= Kii Kij Kji Kjj

Although the Gaussian density represented by q GP RF is not in general normalized, we show that it is approximately normalized in a certain sense.

Theorem 2. The objective q GP RF is approximately normalized in the sense that the optimal value of the Bethe free energy [19],

yi bi(yi)(1 |Ei|) ln bi(yi)

yi,yj bij(yi, yj) ln bij(yi, yj)

ψij(yi, yj))

(8) the approximation to the normalizing constant found by loopy belief propagation, is precisely zero. Furthermore, this optimum is obtained when the pseudomarginals bi, bij are taken to be the true GP marginals pi, pij.

This implies that loopy belief propagation run on a GPRF would recover the marginals of the true GP.

3.2 Predictive equivalence to the BCM

We have introduced q GP RF as a surrogate model for the training set (X, y); however, it is natural to extend the GPRF to make predictions at a set of test points X , by including the function values f = f(X ) as an M +1st block, with an edge to each of the training blocks. The resulting predictive distribution,

p GP RF (f |y) q GP RF (f , y) = p(f )

p(yi, f ) p(yi)p(f )

i=1 p(yi) Y

p(yi, yj) p(yi)p(yj)

p(f )1 M M Y

i=1 p(f |yi), (9)

corresponds exactly to the prediction of the Bayesian Committee Machine (BCM) [2]. This motivates the GPRF as a natural extension of the BCM as a model for the training set, providing an alternative to

(a) Noisy observed locations: mean error 2.48.

(b) Full GP: 0.21.

(c) GPRF-100: 0.36. (showing grid cells)

(d) FITC-500: 4.86. (with inducing points, note contraction)

Figure 2: Inferred locations on synthetic data (n = 10000), colored by the ﬁrst output dimension y1.

the standard transductive interpretation of the BCM.2 A similar derivation shows that the conditional distribution of any block yi given all other blocks yj =i also takes the form of a BCM prediction, suggesting the possibility of pseudolikelihood training [21], i.e., directly optimizing the quality of BCM predictions on held-out blocks (not explored in this paper).

4 Experiments

4.1 Uniform Input Distribution

We ﬁrst consider a 2D synthetic dataset intended to simulate spatial location tasks such as Wi Fi SLAM [22] or seismic event location (below), in which we observe high-dimensional measurements but have only noisy information regarding the locations at which those measurements were taken. We sample n points uniformly from the square of side length n to generate the true inputs X, then sample 50-dimensional output Y from independent GPs with SE kernel k(r) = exp( (r/ℓ)2) for ℓ= 6.0 and noise standard deviation σn = 0.1. The observed points Xobs N(X, σ2 obs I) arise by corrupting X with isotropic Gaussian noise of standard deviation σobs = 2. The parameters ℓ, σn, and σobs were chosen to generate problems with interesting short-lengthscale structure for which GP-LVM optimization could nontrivially improve the initial noisy locations. Figure 2a shows a typical sample from this model.

For local GPs and GPRFs, we take the spatial partition to be a grid with n/m cells, where m is the desired number of points per cell. The GPRF edge set E connects each cell to its eight neighbors (Figure 2c), yielding linear time complexity O(nm2). During optimization, a practical choice is necessary: do we use a ﬁxed partition of the points, or re-assign points to cells as they cross spatial boundaries? The latter corresponds to a coherent (block-diagonal) spatial covariance function, but introduces discontinuities to the marginal likelihood. In our experiments the GPRF was not sensitive to this choice, but local GPs performed more reliably with ﬁxed spatial boundaries (in spite of the discontinuities), so we used this approach for all experiments.

For comparison, we also evaluate the Sparse GP-LVM, implemented in GPy [23], which uses the FITC approximation to the marginal likelihood [7]. (We also considered the Bayesian GP-LVM [4], but found it to be more resource-intensive with no meaningful difference in results on this problem.) Here the approximation parameter m is the number of inducing points.

We ran L-BFGS optimization to recover maximum a posteriori (MAP) locations, or local optima thereof. Figure 3a shows mean location error (Euclidean distance) for n = 10000 points; at this size it is tractable to compare directly to the full GP-LVM. The GPRF with a large block size (m=1111, corresponding to a 3x3 grid) nearly matches the solution quality of the full GP while requiring less time, while the local methods are quite fast to converge but become stuck at inferior optima. The FITC optimization exhibits an interesting pathology: it initially moves towards a good solution but then diverges towards what turns out to correspond to a contraction of the space (Figure 2d); we conjecture this is because there are not enough inducing points to faithfully represent the full GP

2The GPRF is still transductive, in the sense that adding a test block f will change the marginal distribution on the training observations y, as can be seen explicitly in the precision matrix (7). The contribution of the GPRF is that it provides a reasonable model for the training-set likelihood even in the absence of test points.

(a) Mean location error over time for n = 10000, including comparison to full GP.

(b) Mean error at convergence as a function of n, with learned lengthscale.

(c) Mean location error over time for n = 80000.

Figure 3: Results on synthetic data.

distribution over the entire space. A partial ﬁx is to allow FITC to jointly optimize over locations and the correlation lengthscale ℓ; this yielded a biased lengthscale estimate ˆℓ 7.6 but more accurate locations (FITC-500-ℓin Figure 3a).

To evaluate scaling behavior, we next considered problems of increasing size up to n = 80000.3 Out of generosity to FITC we allowed each method to learn its own preferred lengthscale. Figure 3b reports the solution quality at convergence, showing that even with an adaptive lengthscale, FITC requires increasingly many inducing points to compete in large spatial domains. This is intractable for larger problems due to O(m3) scaling; indeed, attempts to run at n > 55000 with 2000 inducing points exceeded 32GB of available memory. Recently, more sophisticated inducing-point methods have claimed scalability to very large datasets [24, 25], but they do so with m 1000; we expect that they would hit the same fundamental scaling constraints for problems that inherently require many inducing points.

On our largest synthetic problem, n = 80000, inducing-point approximations are intractable, as is the full GP-LVM. Local GPs converge more quickly than GPRFs of equal block size, but the GPRFs ﬁnd higher-quality solutions (Figure 3c). After a short initial period, the best performance always belongs to a GPRF, and at the conclusion of 24 hours the best GPRF solution achieves mean error 42% lower than the best local solution (0.18 vs 0.31).

4.2 Seismic event location We next consider an application to seismic event location, which formed the motivation for this work. Seismic waves can be viewed as high-dimensional vectors generated from an underlying threedimension manifold, namely the Earth s crust. Nearby events tend to generate similar waveforms; we can model this spatial correlation as a Gaussian process. Prior information regarding the event locations is available from traditional travel-time-based location systems [26], which produce an independent Gaussian uncertainty ellipse for each event.

A full probability model of seismic waveforms, accounting for background noise and performing joint alignment of arrival times, is beyond the scope of this paper. To focus speciﬁcally on the ability to approximate GP-LVM inference, we used real event locations but generated synthetic waveforms by sampling from a 50-output GP using a Mat ern kernel [3] with ν = 3/2 and a lengthscale of 40km. We also generated observed location estimates Xobs, by corrupting the true locations with

3The astute reader will wonder how we generated synthetic data on problems that are clearly too large for an exact GP. For these synthetic problems as well as the seismic example below, the covariance matrix is relatively sparse, with only ~2% of entries corresponding to points within six kernel lengthscales of each other. By considering only these entries, we were able to draw samples using a sparse Cholesky factorization, although this required approximately 30GB of RAM. Unfortunately, this approach does not straightforwardly extend to GP-LVM inference under the exact GP, as the standard expression for the marginal likelihood derivatives

xi log p(y) = 1

2tr (K 1 y y)(K 1 y y)T K 1 y Ky

involves the full precision matrix K 1 y which is not sparse in general. Bypassing this expression via automatic differentiation through the sparse Cholesky decomposition could perhaps allow exact GP-LVM inference to scale to somewhat larger problems.

(a) Event map for seismic dataset.

(b) Mean location error over time.

Figure 4: Seismic event location task.

Gaussian noise of standard deviation 20km in each dimension. Given the observed waveforms and noisy locations, we are interested in recovering the latitude, longitude, and depth of each event.

Our dataset consists of 107556 events detected at the Mankachi array station in Kazakstan between 2004 and 2012. Figure 4a shows the event locations, colored to reﬂect a principle axis tree partition [27] into blocks of 400 points (tree construction time was negligible). The GPRF edge set contains all pairs of blocks for which any two points had initial locations within one kernel lengthscale (40km) of each other. We also evaluated longer-distance connections, but found that this relatively local edge set had the best performance/time tradeoffs: eliminating edges not only speeds up each optimization step, but in some cases actually yielded faster per-step convergence (perhaps because denser edge sets tended to create large cliques for which the pairwise GPRF objective is a poor approximation).

Figure 4b shows the quality of recovered locations as a function of computation time; we jointly optimized over event locations as well as two lengthscale parameters (surface distance and depth) and the noise variance σ2 n. Local GPs perform quite well on this task, but the best GPRF achieves 7% lower mean error than the best local GP model (12.8km vs 13.7km, respectively) given equal time. An even better result can be obtained by using the results of a local GP optimization to initialize a GPRF. Using the same partition (m = 800) for both local GPs and the GPRF, this hybrid method gives the lowest ﬁnal error (12.2km), and is dominant across a wide range of wall clock times, suggesting it as a promising practical approach for large GP-LVM optimizations.

5 Conclusions and Future Work

The Gaussian process random ﬁeld is a tractable and effective surrogate for the GP marginal likelihood. It has the ﬂavor of approximate inference methods such as loopy belief propagation, but can be analyzed precisely in terms of a deterministic approximation to the inverse covariance, and provides a new training-time interpretation of the Bayesian Committee Machine. It is easy to implement and can be straightforwardly parallelized.

One direction for future work involves ﬁnding partitions for which a GPRF performs well, e.g., partitions that induce a block near-tree structure. A perhaps related question is identifying when the GPRF objective deﬁnes a normalizable probability distribution (beyond the case of an exact tree structure) and under what circumstances it is a good approximation to the exact GP likelihood.

This evaluation in this paper focuses on spatial data; however, both local GPs and the BCM have been successfully applied to high-dimensional regression problems [11, 12], so exploring the effectiveness of the GPRF for dimensionality reduction tasks would also be interesting. Another useful avenue is to integrate the GPRF framework with other approximations: since the GPRF and inducing-point methods have complementary strengths the GPRF is useful for modeling a function over a large space, while inducing points are useful when the density of available data in some region of the space exceeds what is necessary to represent the function an integrated method might enable new applications for which neither approach alone would be sufﬁcient.

Acknowledgements

We thank the anonymous reviewers for their helpful suggestions. This work was supported by DTRA grant #HDTRA-11110026, and by computing resources donated by Microsoft Research under an Azure for Research grant.

[1] Neil D Lawrence. Gaussian process latent variable models for visualisation of high dimensional data. Advances in Neural Information Processing Systems (NIPS), 2004.

[2] Volker Tresp. A Bayesian committee machine. Neural Computation, 12(11):2719 2741, 2000.

[3] Carl Rasmussen and Chris Williams. Gaussian Processes for Machine Learning. MIT Press, 2006.

[4] Michalis K Titsias and Neil D Lawrence. Bayesian Gaussian process latent variable model. In International Conference on Artiﬁcial Intelligence and Statistics (AISTATS), 2010.

[5] Andreas C. Damianou, Michalis K. Titsias, and Neil D. Lawrence. Variational Inference for Latent Variables and Uncertain Inputs in Gaussian Processes. Journal of Machine Learning Research (JMLR), 2015.

[6] Joaquin Qui nonero-Candela and Carl Edward Rasmussen. A unifying view of sparse approximate Gaussian process regression. Journal of Machine Learning Research (JMLR), 6:1939 1959, 2005.

[7] Neil D Lawrence. Learning for larger datasets with the Gaussian process latent variable model. In International Conference on Artiﬁcial Intelligence and Statistics (AISTATS), 2007.

[8] Michalis K Titsias. Variational learning of inducing variables in sparse Gaussian processes. In International Conference on Artiﬁcial Intelligence and Statistics (AISTATS), 2009.

[9] Duy Nguyen-Tuong, Matthias Seeger, and Jan Peters. Model learning with local Gaussian process regression. Advanced Robotics, 23(15):2015 2034, 2009.

[10] Chiwoo Park, Jianhua Z Huang, and Yu Ding. Domain decomposition approach for fast Gaussian process regression of large spatial data sets. Journal of Machine Learning Research (JMLR), 12:1697 1728, 2011.

[11] Krzysztof Chalupka, Christopher KI Williams, and Iain Murray. A framework for evaluating approximation methods for Gaussian process regression. Journal of Machine Learning Research (JMLR), 14:333 350, 2013.

[12] Marc Peter Deisenroth and Jun Wei Ng. Distributed Gaussian Processes. In International Conference on Machine Learning (ICML), 2015.

[13] Carl Edward Rasmussen and Zoubin Ghahramani. Inﬁnite mixtures of Gaussian process experts. Advances in Neural Information Processing Systems (NIPS), pages 881 888, 2002.

[14] Trung Nguyen and Edwin Bonilla. Fast allocation of Gaussian process experts. In International Conference on Machine Learning (ICML), pages 145 153, 2014.

[15] Edward Snelson and Zoubin Ghahramani. Local and global sparse Gaussian process approximations. In Artiﬁcial Intelligence and Statistics (AISTATS), 2007.

[16] Jarno Vanhatalo and Aki Vehtari. Modelling local and global phenomena with sparse Gaussian processes. In Uncertainty in Artiﬁcial Intelligence (UAI), 2008.

[17] Guoqiang Zhong, Wu-Jun Li, Dit-Yan Yeung, Xinwen Hou, and Cheng-Lin Liu. Gaussian process latent random ﬁeld. In AAAI Conference on Artiﬁcial Intelligence, 2010.

[18] Daphne Koller and Nir Friedman. Probabilistic graphical models: Principles and techniques. MIT Press, 2009.

[19] Jonathan S Yedidia, William T Freeman, and Yair Weiss. Bethe free energy, Kikuchi approximations, and belief propagation algorithms. Advances in Neural Information Processing Systems (NIPS), 13, 2001.

[20] Kevin P Murphy, Yair Weiss, and Michael I Jordan. Loopy belief propagation for approximate inference: An empirical study. In Uncertainty in Artiﬁcial Intelligence (UAI), pages 467 475, 1999.

[21] Julian Besag. Statistical analysis of non-lattice data. The Statistician, pages 179 195, 1975.

[22] Brian Ferris, Dieter Fox, and Neil D Lawrence. Wi Fi-SLAM using Gaussian process latent variable models. In International Joint Conference on Artiﬁcial Intelligence (IJCAI), pages 2480 2485, 2007.

[23] The GPy authors. GPy: A Gaussian process framework in Python. http://github.com/ Sheffield ML/GPy, 2012 2015.

[24] James Hensman, Nicolo Fusi, and Neil D Lawrence. Gaussian processes for big data. In Uncertainty in Artiﬁcial Intelligence (UAI), page 282, 2013.

[25] Yarin Gal, Mark van der Wilk, and Carl Rasmussen. Distributed variational inference in sparse Gaussian process regression and latent variable models. In Advances in Neural Information Processing Systems (NIPS), 2014.

[26] International Seismological Centre. On-line Bulletin. Int. Seis. Cent., Thatcham, United Kingdom, 2015. http://www.isc.ac.uk.

[27] James Mc Names. A fast nearest-neighbor algorithm based on a principal axis search tree. IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), 23(9):964 976, 2001.