# spatiotemporal_variational_gaussian_processes__6509045e.pdf Spatio-Temporal Variational Gaussian Processes Oliver Hamelijnck The Alan Turing Institute / University of Warwick ohamelijnck@turing.ac.uk William J. Wilkinson Aalto University william.wilkinson@aalto.fi Niki A. Loppi NVIDIA nloppi@nvidia.com Arno Solin Aalto University arno.solin@aalto.fi Theodoros Damoulas The Alan Turing Institute / University of Warwick tdamoulas@turing.ac.uk We introduce a scalable approach to Gaussian process inference that combines spatio-temporal filtering with natural gradient variational inference, resulting in a non-conjugate GP method for multivariate data that scales linearly with respect to time. Our natural gradient approach enables application of parallel filtering and smoothing, further reducing the temporal span complexity to be logarithmic in the number of time steps. We derive a sparse approximation that constructs a state-space model over a reduced set of spatial inducing points, and show that for separable Markov kernels the full and sparse cases exactly recover the standard variational GP, whilst exhibiting favourable computational properties. To further improve the spatial scaling we propose a mean-field assumption of independence between spatial locations which, when coupled with sparsity and parallelisation, leads to an efficient and accurate method for large spatio-temporal problems. 1 Introduction Most real-world processes occur across space and time, exhibit complex dependencies, and are observed through noisy irregular samples. Take, for example, the task of modelling air pollution across a city. Such a task involves large amounts of noisy, partially-observed data with strong seasonal effects governed by weather, traffic, human movement, etc. This setting motivates a probabilistic perspective, allowing for the incorporation of prior knowledge and the quantification of uncertainty. Gaussian processes (GPS, [38]) provide such a probabilistic modelling paradigm, but their inherent cubic computational scaling in the number of data, N, limits their applicability to spatio-temporal tasks. Arguably the most successful methods to address this issue are sparse GPS [37], which summarise the true GP posterior through a reduced set of M inducing points and have dominant computational scaling O(NM 2), and spatio-temporal GPS [43], which rewrite the GP prior as a state-space model and use filtering to perform inference in O(Nd3), where d is the dimensionality of the state-space. Sparse GPS and spatio-temporal GPS have been combined by constructing a Markovian system in which a set of spatial inducing points are tracked over time [24, 51]. However, existing methods for spatio-temporal GPS make approximations to the prior conditional model [24] or do not exploit natural gradients [45], meaning they do not provide the same inference and learning results as state-of-the-art variational GPS [26] in the presence of non-conjugate likelihoods or sparsity, which has hindered their widespread adoption. We introduce spatio-temporal equal contribution 35th Conference on Neural Information Processing Systems (Neur IPS 2021). Inducing points are shared over the temporal domain ST-SVGP recovers the same spatio-temporal model as SVGP if the inducing points repeat over time ST-SVGP s effective inducing point count scales with the data, but its computational scaling is linear in the number of time steps Figure 1: A demonstration of the spatio-temporal sparse variational GP (ST-SVGP) applied to crime count data in New York. ST-SVGP tracks spatial points over time via spatio-temporal filtering. The colourmap is the posterior mean, and the red dots are spatial inducing points. The diagram shows the difference between how inducing points are treated in ST-SVGP and SVGP. variational GPS (ST-VGP), which provide the exact same results as standard variational GPS, whilst reducing the computational scaling in the temporal dimension from cubic to linear. ST-VGP is derived using a natural gradient variational inference approach based on filtering and smoothing. We also derive this method s sparse variant, and demonstrate how it enables the use of significantly more inducing points than the standard approach, leading to improved predictive performance. We then show how the spatio-temporal structure can be exploited even further to improve both the temporal and spatial scaling. We demonstrate for the first time how to apply parallel filtering and smoothing [41] to non-conjugate GPS to reduce the temporal (span) complexity to be logarithmic. We then reformulate the model to enable an efficient mean-field approximation across space, improving the complexity with respect to the number of spatial points. We analyse the practical performance and scalability of our proposed methods, demonstrating how they make it possible to apply GPS to large-scale spatio-temporal scenarios without sacrificing inference quality. 1.1 Related Work GPS are commonly used for spatio-temporal modelling in both machine learning and spatial statistics [38, 22, 14, 6]. Many approaches to overcome their computational burden have been proposed, from nearest neighbours [17] to parallel algorithms on GPUs [48]. Within machine learning, the sparse GP approach is perhaps the most popular [37, 46], and is typically combined with mini-batching to allow training on massive datasets [26]. However, it fails in practical cases where the number of inducing points must grow with the size of the data, such as for time series [49]. When the data lie on a grid, separable kernels exhibit Kronecker structure which can be exploited for efficient inference [39]. This approach has been generalised to the partial grid setting [53], and to structured kernel interpolation (SKI, [52]) which requires only that inducing points be on a grid. Generally, these approaches are limited to the conjugate case, although Laplace-based extensions exist [19]. Bruinsma et al. [10] present an approach to spatio-temporal modelling that performs an orthogonal projection of the data to enforce independence between the latent processes. It has been shown that variational GPS can be computed in linear time either by exploiting sparse precision structure [18] or via filtering and smoothing [11]. Other inference schemes such as Laplace and expectation propagation have also been proposed [35, 51]. In the spatio-temporal case, sparsity has been used in the spatial dimension [24, 43]. These methods historically suffered from the fact that i) filtering was not amenable to fast automatic differentiation due to its recursive nature, and ii) state-of-the-art inference schemes had not been developed to make them directly comparable to other methods. The first is no longer an issue since many machine learning frameworks are now capable of efficiently differentiating recursive models [11]. We address the second point with this paper. A similar algorithm to ours that is also sparse in the temporal dimension has been developed [50, 2], and relevant properties of the spatio-temporal model presented here are also analysed in [45]. Fourier features [28] are an alternative approach to scalable GPS, but are not suited to very long time series with high variability due to the need for impractically many inducing features. 2 Background We consider data lying on a spatio-temporal grid comprising input output pairs, {X(st) RNt Ns D, Y(st) RNt Ns}, where Nt is the number of temporal points, Ns the number of spatial points, and D = 1 + Ds the input dimensionality (with Ds being the number of spatial dimensions). We use t and s to represent time and space respectively. The assumption of the grid structure is relaxed via the introduction of sparse methods in Sec. 3, and by the natural handling of missing data. For consistency with the GP literature we let X = vec(X(st)) RN D, Y = vec(Y(st)) RN 1, where N = Nt Ns is the total number of data points. We use the operator vec( ) to simply convert data from a spatio-temporal grid into vector form, whilst keeping observations ordered by time and then space. For notational convenience we define Xn,k = X(st) n,k , Yn,k = Y(st) n,k , which indexes data at time index n and spatial index k. We use tn to denote the n th time point, S RNs Ds to denote all spatial grid points and Sk the k th one. Let f : RD R to be a random function with a zero-mean GP prior, then for a given likelihood p(Y | f(X)) the generative model is, f(x) GP(0, κ(x, x )), Y | f QNt n=1 QNs k=1 p(Yn,k | fn,k), (1) where fn,k = f(Xn,k), and we let fn be the function values of all spatial points at time tn. When the kernel κ is evaluated at given inputs we write the corresponding gram matrix as KXX = κ(X, X ). To make it explicit that f takes spatio-temporal inputs we also abuse the notation slightly to write f(x) = f(t, s) and κ(x, x ) = κ(t, s, t , s ). A summary of all notation used is provided in App. A. For Gaussian likelihoods the posterior, p(f | Y), is available in closed form, otherwise approximations must be used. In either case, inference typically comes at a cubic cost of O(N 3 t N 3 s ). 2.1 State Space Spatio-Temporal Gaussian Processes One method for handling the cubic scaling of GPS is to reformulate the prior in Eq. (1) as a state space model, reducing the computational scaling to linear in the number of time points [43]. The enabling assumption is that the kernel is both Markovian and separable between time and space: κ(t, s, t , s ) = κt(t, t ) κs(s, s ). We use the term Markovian kernel to refer to a kernel which can be re-written in state-space form (see [44] for an overview). First, we write down the GP prior as a stochastic partial differential equation (SPDE, see [15]) t f(t, s) = As f(t, s) + Ls w(t, s), where w(t, s) is a (spatio-temporal) white noise process and As a suitable (pseudo-)differential operator [see 42]. By appropriately defining the model matrices and the white noise spectral density function, SPDEs of this form can represent a large class of separable and non-separable GP models. When the kernel is separable, this SPDE can be simplified to a finite-dimensional SDE [24] by marginalising to a finite set of spatial locations, S RNs Ds, giving, d f(t) = F f(t) dt + L dβ(t), where f(t) is the Gaussian distributed state at the spatial points S at time t, with dimensionality d = Nsdt, where dt is the dimensionality of the state-space model induced by κt( , ). dβ(t) has spectral density Qc, and the matrix H extracts the function value from the state: fn = H f(tn). F and L are the feedback and noise effect matrices [42]. This simplification to an SDE is possible due to the independence between spatial points at time t and all other time steps, given the current state [45]. This follows from the fact that for any separable kernel, f(t, s) and f(t , s ) are independent given f(t , s) [36]. For a step size n = tn+1 tn, the discrete-time model matrices are, An = Φ(F n), Qn = R n 0 Φ( n τ) L Qc L Φ( n τ) dτ, (2) where Φ( ) is the matrix exponential. The resulting discrete model is, f(tn+1) = An f(tn) + qn, Yn | f(tn) p(Yn | H f(tn)), (3) where qn N(0, Qn). If p(Yn | H f(tn)) is Gaussian then Kalman smoothing algorithms can be employed to perform inference in Eq. (3) in O(Ntd3) = O(Nt N 3 s d3 t). Markovian GPs with Spatial Sparsity Sparse GPS re-define the GP prior over a smaller set of M inducing points: let u = f(Z) RM 1 be the inducing variables at inducing locations Z RM D, then the augmented prior is p(f, u) = p(f | u)p(u), where p(u) = N(u | 0, KZZ), and with Gaussian conditional p(f | u). If the inducing points are placed on a spatio-temporal grid, with Zs RMs Ds being the spatial inducing locations, the conditional p(f | u) can be simplified to (see App. D): p(f | u) = N f | I (K(s) SS K (s) Zs Zs) u, K(t) tt e Qs , (4) where e Qs = K(s) SS K(s) SZs K (s) Zs Zs K(s) Zs S (see App. A for notational details). The fully independent training conditional (FITC) method [37] approximates the full conditional covariance with its diagonal, leading to the following convenient property: q FITC(f | u) = QNt n=1 q FITC(fn | u) = QNt n=1 q FITC(fn | un), where the last equality holds because I (K(s) SS K (s) Zs Zs) is block diagonal. This factorisation across time allows the model to be cast into the state-space form of Eq. (3), but where the state f(t) is defined over the reduced set of spatial inducing points [24]. Inference can be performed in O(Nt M 3 s d3 t). 2.2 Sparse Variational GPs To perform approximate inference in the presence of sparsity or non-Gaussian likelihoods, variational methods cast inference as optimisation through minimisation of the Kullback Leibler divergence (KLD) from the true posterior to the approximate posterior [8]. Although direct computation of the KLD is intractable, it can be rewritten as the maximisation of the evidence lower bound (ELBO). Unlike FITC, the sparse variational GP (SVGP, [46]) does not approximate the conditional p(f | u), but instead approximates the posterior as q(f, u) = p(f | u) q(u), where q(u) = N(u | m, P) is a Gaussian whose parameters are to be optimised. The SVGP ELBO is: LSVGP = Eq(u) Eq(f | u) [log p(Y | f)] KL [q(u) p(u)] , (5) which can be computed in O(NM 2 + M 3). SVGP has many benefits over methods such as FITC, including: non-Gaussian likelihoods can be handled through quadrature or Monte-Carlo approximations [27, 33], it is applicable to big data through stochastic VI and mini-batching [26], and the inducing locations are variationally protected and hence prevent overfitting [7]. Natural Gradients Natural gradient descent calculates gradients in distribution space rather than parameter space, and has been shown to improve inference time and quality for variational GPS [26, 40]. A natural descent direction is obtained by scaling the standard gradient by the inverse of the Fisher information matrix, Eq( ) 2 log q( ) [5]. For a Gaussian approximate posterior, the natural gradient of target L with respect to the natural parameters λ can be calculated without directly forming the Hessian, since it can be shown to be equivalent to the gradient with respect to the mean parameters µ = [m, mm + P] [25]. The natural parameter update, with learning rate β, becomes, To update the approximation posterior, λ can be simply transformed to the moment parameterisation [m, P]. A table of mappings between the various parametrisations is given in App. G. CVI and the Approximate Likelihood Khan and Lin [31] show that when the prior and approximate posterior are conjugate (as in the GP case), further elegant properties of exponential family distributions mean that Eq. (6) is equivalent to a two step Bayesian update: eλ (1 β) eλ + β Eq(f)[log p(Y | f)] µ , λ η + eλ , (7) where η are the natural parameters of the prior and eλ are the natural parameters of the likelihood contribution. Letting g( ) = Eq[log p(Y | f)] , the gradients can be computed in terms of the mean and covariance via the chain rule: g(µ) = g(m) 2 g(P) m, g(P) . Eq. (7) shows that, since the prior parameters η are known, natural gradient variational inference is completely characterised by updates to an approximate likelihood, which we denote N( e Y | f, e V), parameterised by covariance e V = ( 2eλ(2)) 1 and mean e Y = e Veλ(1) (see App. A). The approximate posterior then has the form, q(f) = N( e Y | f, e V) p(f) R N( e Y | f, e V) p(f) df . (8) Computing q(f) amounts to performing conjugate GP regression with the model prior and the approximation likelihood. This approach is called conjugate-computation variational inference (CVI, [31]). To re-emphasise that the CVI updates compute the exact same quantity as Eq. (6), we provide an alternative derivation in App. H by directly applying the chain rule to Eq. (6). 3 Spatio-Temporal Variational Gaussian Processes In this section we introduce a spatio-temporal VGP that has linear complexity with respect to time whilst obtaining the identical variational posterior as the standard VGP. We will then go on to derive this method s sparse variant, which gives the same posterior as SVGP when the inducing points are set similarly (i.e., on a spatio-temporal grid), but is capable of scaling to much larger values of M. 3.1 The Spatio-Temporal VGP ELBO We first derive our proposed spatio-temporal VGP ELBO. We do this by exploiting the form of the approximate posterior after a natural gradient step in order to write the ELBO as a sum of three terms, each of which can be efficiently computed through filtering and smoothing. As shown in Sec. 2.2, after a natural gradient step, the approximate posterior q(f) N( e Y | f, e V) p(f) decomposes as a Bayesian update applied to the model prior with an approximate likelihood. Following Chang et al. [11] we substitute Eq. (8) into the VGP ELBO: LVGP = Eq(f) log p(Y | f) p(f) log p(Y | f) p(f) R N( e Y | f, e V) p(f) df N( e Y | f, e V) p(f) k=1 Eq(fn,k) log p(Yn,k | fn,k) Eq(f) log N( e Y | f, e V) + log Ep(f) N( e Y | f, e V) . The first term is the expected log likelihood, the second is the expected log approximate likelihood, and the final term is the log marginal likelihood of the approximation posterior, log p( e Y) = log Ep(f) N( e Y | f, e V) . Naïvely evaluating LVGP requires O(N 3) computation for both the update to q(f) and the marginal likelihood. We now show how to compute this with linear scaling in Nt. We observe that after a natural gradient update, e V, the approximate likelihood covariance, has the same form as the gradient g(P) because, as seen in Eq. (7), eλ is only additively updated by g(µ). Since the expected likelihood, Eq(f)[log p(Y | f)], factorises across observations, g(P) is diagonal and hence so is e V. The approximate likelihood therefore factorises in the same way as the true one: log N( e Y | f, e V) = k=1 log N( e Yn,k | fn,k, e Vn,k). (10) We now turn our attention to computing the posterior and the marginal likelihood. Recall that if the kernel is separable between time and space, κ(t, s, t , s ) = κt(t, t ) κs(s, s ), then the GP prior can be re-written as Eq. (3). This separability property further results in the state-space model matrices having a convenient Kronecker structure, f(tn+1) = h INs A(t) n i f(tn) + qn , e Yn | f(tn) p( e Yn | H f(tn)), (11) where qn N(0, K(s) SS Q(t) n ) and H = INs H(t). Here A(t) n Rdt dt, Q(t) n Rdt dt, and H(t) R1 dt are the transition matrix, process noise covariance, and measurement model of the SDE (see Sec. 2.1) induced by the kernel κt( , ), respectively. Because the GP prior is Markov and the approximate likelihood factorises across time, the approximate GP posterior is also Markov [45]. Hence marginals q(fn) can be computed through linear filtering and smoothing applied to Eq. (11). Furthermore, the marginal likelihood of a linear Gaussian state-space model, p( e Y) = p( e Y1) QNt n=2 p( e Yn | e Y1:n 1), can be computed sequentially by running the forward filter, since p( e Yn | e Y1:n 1) = R p( e Yn | H f(tn)) p( f(tn) | e Y1:n 1) d f(tn), where p( f(tn) | e Y1:n 1) is the predictive filtering distribution. By combining all of the above properties we can now write the ELBO as, k=1 Eq(fn,k) log p(Yn,k | fn,k) k=1 Eq(fn,k) log N( e Yn,k | fn,k, e Vn,k) n=1 log Ep( f(tn) | e Y1:n 1) N( e Yn | H f(tn), e Vn) . (12) Algorithm 1 Spatio-temporal sparse VGP Input: Data:{X, Y}, Initial params.:{ e Y, e V}, Learning rates:{β, ρ} while ELBO (L) not converged do CVI natural gradient step: q(u), ℓ= Alg. 2( e Y, e V) E = Eq(u)[Ep(f | u)[log[p(Y | f)]] eλ = (1 β)eλ + β E µ e V = ( 2eλ(2)) 1, e Y = e Veλ(1) Hyperparameter gradient step: q(u), ℓ= Alg. 2( e Y, e V) E = Eq(u)[Ep(f | u)[log[p(Y | f)]] L = E Eq(u)[log N( e Y | u, e V)] + ℓ ELBO θ = θ + ρ L θ end while Algorithm 2 Sparse spatio-temporal smoothing Input:Likelihood:{ e Y, e V}, Space prior:{K(s) Zs Zs}, Time prior:{A(t), Q(t), H(t)} Construct model matrices: An = IMs A(t) n , Qn = K(s) Zs Zs Q(t) n , H = IMs H(t) Filtering and smoothing: if using parallel filter / smoother then q(u), ℓ= Alg. 4( e Y, e V, A, Q, H) else q(u), ℓ= Alg. 3( e Y, e V, A, Q, H) end if Return posterior marginals and log likelihood: return q(u), ℓ This ELBO can be computed with linear scaling in Nt: O(Nt N 3 s d3 t). We now show that the natural gradient step for updating the parameters of N( e Y | f, e V) can be computed with the same complexity. 3.2 Efficient Natural Gradient Updates As discussed in Sec. 2.2, a natural gradient update to the posterior, q(f) p(f) N( e Y | f, e V), has superior convergence properties to gradient descent, and is completely characterised by an update to the approximate likelihood, N( e Y | f, e V), whose mean and covariance are the free parameters of the model, and implicitly define the same variational parameters as VGP. Since the likelihood factorises across the data points, these updates only require computation of the marginal distribution q(fn,k) to obtain Eq(fn,k)[log p(Yn,k | fn,k)] and its gradients. As we have shown, computation of the marginal posterior amounts to smoothing over the state, f N( f | m, P), with the model in Eq. (11). The time marginals are given by applying the measurement model to the state: q(fn) = N(fn | mn = H mn, Pn = H Pn H ) after which q(fn,k) = R q(fn) dfn, =k can then be obtained by integrating out the other spatial points. Given the marginal, Eq. (7) can be used to give the new likelihood parameters e Y and e V. The full learning algorithm iterates this process alternately with a hyperparameter update via gradient descent applied to the ELBO, Eq. (12), and has computational complexity O(Nt N 3 s d3 t). We call this method the spatio-temporal variational GP (ST-VGP). 3.3 Spatial Sparsity: from O(Nt N 3 s d3 t) to O(Nt M 3 s d3 t) We now introduce spatial inducing points, Zs, in order to reduce the effective dimensionality of the state-space model. Whilst we maintain the same notation for consistency, it should be noted that the sparse model no longer requires the data to be on a spatio-temporal grid, only that the inducing points are. In this case, letting q(u) = N(u | m(u), P(u)) be the sparse variational posterior, the marginal q(fn) = N(f | mn, Pn) only depends on m(u) n , P(u) n due to the conditional independence property for separable kernels discussed in Sec. 2.1. We compute the posterior q(u) via filtering and smoothing over the state f(t) in a similar way to ST-VGP by setting, An = IMs A(t) n , Qn = K(s) Zs Zs Q(t) n , H = IMs H(t). (13) Alg. 2 gives the smoothing algorithm. However, the natural gradient update, Eq. (7), now becomes, eλ (1 β) eλ + β Eq(u) Ep(f | u) [log p(Y | f)] µ(u) , (14) which results in eλ(2) n , and hence also e Vn, being a dense matrix (i.e., e V is block-diagonal) due to the conditional mapping, p(fn | un). Therefore the approximate likelihood for the sparse model factorises across time, but not space (see App. J for details): log N( e Y | u, e V) = PNt n=1 log N( e Yn | un, e Vn). 0 200 400 600 800 1000 101 102 103 Number of time steps Log run time (seconds) ST-SVGP SVGP MF-ST-SVGP SKI (a) Wall-clock time comparison 0 100 200 300 400 500 20000 40000 Iteration number ST-SVGPZ Held SVGPZ Held ST-SVGPZ Trained SVGPZ Trained (b) Negative ELBO across training iterations Figure 2: (a) Log wall-clock time, including any startup costs, across 7 synthetic spatio-temporal datasets with an increasing number of time steps (average across 5 runs). (b) Negative ELBO during training for the small-scale NYC-CRIME dataset. The Spatio-Temporal Sparse VGP ELBO Adding inducing points in space is equivalent to placing the inducing points on a spatio-temporal grid (i.e., inducing points exist at all time steps), and hence the variational objective directly follows from LSVGP using a similar argument to Sec. 3.1: LST-SVGP = Eq(f,u) log p(Y | f) p(f | u) p(u) R N( e Y | u, e V) p(u) du N( e Y | u, e V) p(f | u) p(u) k=1 Eq(un) Ep(fn,k | un) log p(Yn,k | fn,k) n=1 Eq(un) log N( e Yn | un, e Vn) n=1 log Ep( f(tn) | e Y1:n 1) N( e Yn | H f(tn), e Vn) , (15) where the final term is given by the forward filter. Efficient Natural Gradient Updates The marginal required to compute the ELBO and natural gradient, q(fn,k) = RR p(f | u) q(u) du df =n,k = RR p(f | un) q(un) dun df =n,k , is the predictive distribution at input Xn,k from the posterior q(u). Because the inducing points have only been placed in space, this can be simplified through the Kronecker structure given by the state-space model. As shown in App. I, the marginal mean and covariance are, mn,k = K(s) Sk Zs K (s) Zs Zsm(u) n , Pn,k = K(s) Sk Zs K (s) Zs Zs P(u) n K (s) Zs Zs K(s) Zs Sk + K(t) Xn,k Xn,k K(s) Sk Sk K(s) Sk Zs K (s) Zs Zs K(s) Zs Sk , (16) where m(u) n = H mn, P(u) n = H Pn H are given by filtering and smoothing. By combining the above properties we see that all the terms required for the natural gradient updates and hyperparameter learning can be computed efficiently in O(Nt M 3 s d3 t). We call this approach the spatio-temporal sparse variational GP (ST-SVGP). The full algorithm is given in Alg. 1. 4 Further Improving the Temporal and Spatial Scaling We now propose two approaches to further improve the computational properties of ST-VGP and ST-SVGP. First, we show how parallel filtering and smoothing can be used for non-conjugate GP inference, which results in a theoretical span complexity of O(log Ntd3). We then present a spatial mean-field approximation, which can be used independently, or in combination with sparsity. 4.1 Parallel Bayesian Filtering and Smoothing The associative scan algorithm [9] uses a divide-and-conquer approach combined with parallelisation to convert N sequential associative operations into log N sequential steps (for an operator , associativity implies (a b) c = a (b c)). This algorithm has been made applicable to conjugate Markov 2019-01-01 2019-04-01 0 PM10 - Hackney - Old Street (HK6) 2019-02-05 2019-02-10 2019-02-15 2019-02-20 2019-02-25 0 Figure 3: Observations of PM10 at site HK6, showing rich short-scale structure (top). Mean and 95% confidence of ST-SVGP trained with 30 spatial inducing points (totalling 64,770 inducing points) and SVGP with 2000 inducing points (minibatch size 100). Both models have similar training times. ST-SVGP captures the complex structure of the time series whereas SVGP smooths the data. GPS by deriving a new form of the Kalman filtering and smoothing operations that are associative [41, 13]. We give the form of these associative operators in App. E and show, for the first time, how these methods can be adapted to the non-conjugate setting. This follows directly from the use of the CVI approach to natural gradient VI, which requires only conjugate computations, i.e., the linear filter and smoother. Consider the ST-SVGP ELBO: LST-SVGP = Eq(u) Ep(f | u) log p(Y | f) | {z } factorises across time and space, compute in parallel Eq(u) log N( e Y | u, e V) | {z } factorises across time, compute in parallel + log Ep(u) N( e Y | u, e V) | {z } compute with parallel filter The first two terms can be computed in parallel, since they decompose across time given the marginals q(un). The final term can be computed via the parallel filter, and the required marginals q(un) via the parallel smoother, which makes ST-SVGP a highly parallelisable algorithm. Alg. 4 gives the filtering and smoothing algorithm and Alg. 2 shows how this method can be used in place of the sequential filter when performing inference. One drawback of the parallel filter is that when the state dimension is large, many of the available computational resources may be consumed by the arithmetic operations involved in a single filtering step, and the logarithmic scaling may be lost. Fortunately, the spatial mean-field approximation presented in Sec. 4.2 helps to alleviate this issue. In App. E we provide more details on the method as well as a detailed examination of its practical properties. 4.2 Spatial Mean-Field Approximation We reconsider the state space model for the spatio-temporal GP derived in Sec. 2.1, which has process noise qn N(0, K(s) SS Q(t) n ). This Kronecker structure implies Ns independent processes are linearly mixed using spatial covariance K(s) SS. The linearity of this operation makes it possible to reformulate the model to include the mixing as part of the measurement, rather than the prior: f(tn+1) = An f(tn) + qn, Yn | f(tn) p(Yn | [C(s) SS H(t)] f(tn)), (17) where qn N(0, INs Q(t) n ) and C(s) SS is the Cholesky factor of K(s) SS (see App. F for the derivation). This has the benefit that now both An and Qn = INs Q(t) n , are block diagonal, such that under the prior the latent processes are fully independent. This enables a mean-field assumption between the Ns latent posterior processes: q( f(t)) QNs k=1 q( fk(t)), where fk(t) is the dt-dimensional state corresponding to spatial point Sk. This approximation enforces block-diagonal structure in the state covariance, such that matrix operations acting on the full state can be avoided. Dependence between the latent processes is still captured via the measurement model (likelihood), and our experiments show that this approach still approximates the true posterior well (see Sec. 5 and App. K.6) whilst providing significant computational gains when Ns is large. 5 Experiments We examine the scalability and performance of ST-VGP and its variants. Throughout, we use a Matérn-3/2 kernel and optimise the hyperparameters by maximising the ELBO using Adam [32]. Table 1: NYC-CRIME (small) results. ST-SVGP = SVGP when Z is fixed. TRAIN Z MODEL RMSE NLPD ST-SVGP 3.02 0.13 1.72 0.04 SVGP 3.02 0.13 1.72 0.04 ST-SVGP 2.79 0.15 1.64 0.04 SVGP 2.94 0.12 1.65 0.05 Table 2: Test performance on matching average run time in seconds for the NYC-CRIME (large) count dataset. MODEL TIME (CPU) TIME (GPU) RMSE NLPD ST-SVGP 20.86 0.46 0.61 0.00 2.77 0.06 1.66 0.02 MF-ST-SVGP 20.69 0.86 0.32 0.00 2.75 0.04 1.63 0.02 SVGP-1500 12.67 0.11 0.13 0.00 3.20 0.14 1.82 0.05 SVGP-3000 80.80 3.42 0.45 0.01 3.02 0.18 1.76 0.05 We use learning rates of ρ = 0.01, β = 1 in the conjugate case, and ρ = 0.01, β = 0.1 in the non-conjugate case. We use 5-fold cross-validation (i.e., 80 20 train-test split), train for 500 iterations (except for AIR-QUALITY where we train for 300) and report RMSE, negative log predictive density (NLPD, see App. K.1) and average per-iteration training times on CPU and GPU. When using a GPU, the parallel filter and smoother are used. Synthetic Experiment We construct 7 toy datasets with rich temporal structure but smooth spatial structure (see App. K.2) and varying size: Nt = 10, 82, 155, 227, 300, 500, 1000, and construct a 500 100 grid that serves as a test set for all cases. As the dataset size increases we expect the predictive performance of all methods to improve at the expense of run time. We compare against SKI and SVGP (see Sec. 1.1). Fig. 2a shows that whilst SVGP becomes infeasible for more than 300 time steps, the ST-SVGP variants scale linearly with time and are faster than SKI (except for the very small datasets, in which the model compile time in JAX dominates). In App. K we show the test performance of all models. NYC-CRIME Count Dataset We model crime numbers across New York City, USA (NYC), using daily complaint data from [1]. Crime data has seasonal trends and is spatially dependent. Accurate modelling can lead to more efficient allocation of police resources [20, 3]. We first consider a small subset of the data to highlight when our methods exactly recover SVGP. We bin the data from 1st to 10th of January 2014 (Nt = 10) into a spatial grid of size 30 30 and drop cells that do not intersect with land (Ns = 447, N = 4470). We run ST-SVGP and SVGP with inducing points initialised to the same locations. We plot the training ELBO in Fig. 2b and performance in Table 1. For fixed inducing points, both models have the same training curve and provide the same predictions (up to numerical differences). Optimising the inducing points improves both methods. A comparable inference method for non-conjugate likelihoods has not yet been developed for SKI. We next consider observations from January to July 2014, with daily binning (Nt = 182) and the same spatial grid (Ns = 447, N = 81,354). We run ST-SVGP and its mean-field variant (MF-STSVGP) with 30 spatial inducing points (equivalent to SVGP with M = 30 182 = 5460). Table 2 shows that our methods outperform SVGP (with M = 1500, M = 3000 and batch sizes 1500, 3000 respectively) because they can afford more inducing points for the same computational budget. Regression: AIR-QUALITY Finally, we model PM10 (µg/m3) air quality across London, UK. The measurements exhibit periodic fluctuations and highly irregular behaviour due to events like weather and traffic jams. Using hourly data from the London air quality network [29] between January 2019 and April 2019 (Nt = 2159), we drop sensors that are not within the London boundaries or have more than 40% of missing data (Ns = 72, N = 155,448). We run ST-SVGP and MF-ST-SVGP with 30 inducing points in space (equivalent to SVGP with M = 30 2159 = 64,770 inducing points). To ensure the run times are comparable on both CPU and GPU, we run SVGP with 2000, 2500, 5000, and 8000 inducing points with mini-batch sizes of 600, 800, 2000, and 3000 respectively. We run SKI with Nt temporal inducing points and 6 inducing points in each spatial dimension. Table 3: AIR-QUALITY regression. ST-SVGP fits the fast-varying structure well, whereas SVGP smooths the data. Average run time and standard deviation in seconds shown for a single training step. ST-SVGP and MF-ST-SVGP use 30 spatial inducing points, equivalent to SVGP with 30 2159 = 64,770 inducing points. Number of inducing points chosen to make run time comparable. MODEL (BATCH SIZE) TIME (CPU) TIME (GPU) RMSE NLPD ST-SVGP 16.79 0.63 4.47 0.01 9.96 0.56 8.29 0.80 full spatio-temporal model MF-ST-SVGP 13.74 0.49 0.85 0.01 10.42 0.63 8.52 0.91 with mean-field assumption SVGP-2000 (600) 20.21 0.28 0.17 0.00 15.46 0.44 12.93 0.95 baselines SVGP-2500 (800) 40.90 1.11 0.25 0.00 15.53 1.09 13.48 1.85 SVGP-5000 (2000) 1.19 0.00 14.20 0.44 12.73 0.73 SVGP-8000 (3000) 4.09 0.01 13.83 0.47 12.40 0.75 SKI 23.36 1.01 3.61 0.01 12.01 0.55 10.32 0.79 Our methods significantly outperform the SVGP baselines because they can afford considerably more inducing points. As shown in Fig. 3 the SVGP drastically smooths the data, whereas ST-SVGP fits the short-term structure well. The mean-field approach is significantly more efficient, especially when using the parallel algorithm, but we do observe a slight reduction in prediction quality. 6 Conclusions and Discussion We have shown that variational inference and spatio-temporal filtering can be combined in a principled manner, introducing an approach to GP inference for spatio-temporal data that maintains the benefits of variational GPS, whilst exhibiting favourable scaling properties. Our experiments confirm that ST-SVGP outperforms baseline methods because the effective number of inducing points grows with the temporal horizon, without introducing a significant computational burden. Crucially, this leads to improved predictive performance, because fast varying temporal information is captured by the model. We demonstrated how to apply parallel filtering and smoothing in the non-conjugate GP case, but our empirical analysis identified a maximum state dimension of around d 50, after which the sub-linear temporal scaling is lost. However, our proposed spatial mean-field approach alleviates this issue somewhat, making the combined algorithm extremely efficient even when both the number of time steps and spatial points are large. The resemblance of our framework to the VGP approach suggests many potential extensions, such as multi-task models [4] and deep GPs [16]. We provide JAX code for all methods at https://github.com/Aalto ML/spatio-temporal-GPs. 6.1 Limitations and Societal Impact We believe our work takes an important step towards allowing sophisticated GP models to be run on both resource constrained CPU machines and powerful GPUs, greatly expanding the usability of such models whilst also reducing unnecessary consumption of resources. However, when using predictions from such methods, the limitations of the model assumptions and potential inaccuracies when using approximate inference should be kept in mind, especially in cases such as crime rate monitoring where actions based on biased or incorrect predictions can have harmful societal consequences. Acknowledgments and Disclosure of Funding This work was supported by the Academy of Finland Flagship programme: Finnish Center for Artificial Intelligence (FCAI). Parts of this project were done while OH was visiting Finland as part of the FCAI Turing initiative, supported by HIIT/FCAI. OH acknowledges funding from The Alan Turing Institute Ph D fellowship programme. AS and WJW acknowledge funding from the Academy of Finland (grant numbers 324345 and 339730). NAL contributed under the NVIDIA AI Technology Center (NVAITC) Finland program. TD acknowledges support from EPSRC (EP/T004134/1), UKRI Turing AI Fellowship (EP/V02678X/1), and the Lloyd s Register Foundation programme on Data Centric Engineering. We acknowledge the computational resources provided by the Aalto Science-IT project and CSC IT Center for Science, Finland. [1] 2014 2015 crimes reported in all 5 boroughs of New York City. https://www.kaggle.com/ adamschroeder/crimes-new-york-city. [2] V. Adam, S. Eleftheriadis, A. Artemev, N. Durrande, and J. Hensman. Doubly sparse variational Gaussian processes. In Proceedings of the 23rd International Conference on Artificial Intelligence and Statistics (AISTATS), volume 108 of Proceedings of Machine Learning Research, pages 2874 2884. PMLR, 2020. [3] V. Aglietti, T. Damoulas, and E. V. Bonilla. Efficient inference in multi-task Cox process models. In Proceedings of the 22nd International Conference on Artificial Intelligence and Statistics (AISTATS), volume 89 of Proceedings of Machine Learning Research, pages 537 546. PMLR, 2019. [4] M. A. Álvarez and N. D. Lawrence. Computationally efficient convolved multiple output Gaussian processes. Journal of Machine Learning Research, 12:1459 1500, July 2011. [5] S.-i. Amari. Natural gradient works efficiently in learning. Neural Computation, 10(2):251 276, 1998. [6] S. Banerjee. Hierarchical Modeling and Analysis for Spatial Data. Chapman & Hall/CRC, Boca Raton, FL, 2004. [7] M. Bauer, M. van der Wilk, and C. E. Rasmussen. Understanding probabilistic sparse Gaussian process approximations. In Advances in Neural Information Processing Systems 29 (NIPS), pages 1533 1541. Curran Associates, Inc., 2016. [8] D. M. Blei, A. Kucukelbir, and J. D. Mc Auliffe. Variational inference: A review for statisticians. Journal of the American Statistical Association, 112(518):859 877, 2017. [9] G. E. Blelloch. Scans as primitive parallel operations. IEEE Transactions on Computers, 38(11):1526 1538, 1989. [10] W. Bruinsma, E. Perim, W. Tebbutt, S. Hosking, A. Solin, and R. Turner. Scalable exact inference in multioutput Gaussian processes. In Proceedings of the 37th International Conference on Machine Learning (ICML), volume 119 of Proceedings of Machine Learning Research, pages 1190 1201. PMLR, 2020. [11] P. E. Chang, W. J. Wilkinson, M. E. Khan, and A. Solin. Fast variational learning in state-space Gaussian process models. In 30th IEEE International Workshop on Machine Learning for Signal Processing (MLSP), pages 1 6, Espoo, Finland, 2020. IEEE. [12] R. Condit. Tropical Forest Census Plots. Springer-Verlag and R. G. Landes Company, Berlin, Germany, and Georgetown, Texas, 1998. [13] A. Corenflos, Z. Zhao, and S. Särkkä. Gaussian process regression in logarithmic time. ar Xiv preprint ar Xiv:2102.09964, 2021. [14] N. Cressie. Statistics for Spatio-Temporal Data. Wiley, Hoboken, N.J, 2011. [15] G. Da Prato and J. Zabczyk. Stochastic Equations in Infinite Dimensions, volume 45 of Encyclopedia of Mathematics and its Applications. Cambridge University Press, Cambridge, 1992. [16] A. Damianou and N. Lawrence. Deep Gaussian processes. In Proceedings of the Sixteenth International Conference on Artificial Intelligence and Statistics (AISTATS), volume 31 of Proceedings of Machine Learning Research, pages 207 215. PMLR, 2013. [17] A. Datta, S. Banerjee, A. O. Finley, and A. E. Gelfand. Hierarchical nearest-neighbor Gaussian process models for large geostatistical datasets. Journal of the American Statistical Association, 111(514):800 812, 2016. [18] N. Durrande, V. Adam, L. Bordeaux, S. Eleftheriadis, and J. Hensman. Banded matrix operators for Gaussian Markov models in the automatic differentiation era. In Proceedings of the 22nd International Conference on Artificial Intelligence and Statistics (AISTATS), volume 89 of Proceedings of Machine Learning Research, pages 2780 2789. PMLR, 2019. [19] S. Flaxman, A. Wilson, D. Neill, H. Nickisch, and A. Smola. Fast kronecker inference in Gaussian processes with non-Gaussian likelihoods. In Proceedings of the 32nd International Conference on Machine Learning (ICML), volume 37 of Proceedings of Machine Learning Research, pages 607 616. PMLR, 2015. [20] S. Flaxman, M. Chirico, P. Pereira, and C. Loeffler. Scalable high-resolution forecasting of sparse spatiotemporal events with kernel methods: a winning solution to the NIJ Real-time crime forecasting challenge . ar Xiv preprnt ar Xiv:1801.02858, 2019. [21] J. Gardner, G. Pleiss, K. Q. Weinberger, D. Bindel, and A. G. Wilson. GPy Torch: Blackbox matrix-matrix Gaussian process inference with GPU acceleration. In Advances in Neural Information Processing Systems 31 (Neur IPS), pages 7576 7586. Curran Associates, Inc., 2018. [22] A. E. Gelfand and E. M. Schliep. Spatial statistics and Gaussian processes: A beautiful marriage. Spatial Statistics, 18:86 104, 2016. Spatial Statistics Avignon: Emerging Patterns. [23] O. Hamelijnck, W. J. Wilkinson, N. A. Loppi, A. Solin, and T. Damoulas. Data for Neur IPS 2021 submission: Spatio-Temporal Variational Gaussian Processes, 2021. URL https://doi.org/10.5281/ zenodo.4531304. [24] J. Hartikainen, J. Riihimäki, and S. Särkkä. Sparse spatio-temporal Gaussian processes with general likelihoods. In Artificial Neural Networks and Machine Learning ICANN 2011, pages 193 200, Berlin, Heidelberg, 2011. Springer. [25] J. Hensman, M. Rattray, and N. D. Lawrence. Fast variational inference in the conjugate exponential family. In Advances in Neural Information Processing Systems 25 (NIPS), pages 2888 2896. Curran Associates Inc., 2012. [26] J. Hensman, N. Fusi, and N. D. Lawrence. Gaussian processes for big data. In Proceedings of the Twenty-Ninth Conference on Uncertainty in Artificial Intelligence (UAI), pages 282 290. AUAI Press, 2013. [27] J. Hensman, A. Matthews, and Z. Ghahramani. Scalable variational Gaussian process classification. In Proceedings of the Eighteenth International Conference on Artificial Intelligence and Statistics (AISTATS), volume 38 of Proceedings of Machine Learning Research, pages 351 360. PMLR, 2015. [28] J. Hensman, N. Durrande, and A. Solin. Variational Fourier features for Gaussian processes. Journal of Machine Learning Research, 18(1), 2017. [29] Imperial College London. Londonair London air quality network (LAQN). https://www.londonair. org.uk, 2020. [30] E. J. Keogh and M. J. Pazzani. An indexing scheme for fast similarity search in large time series databases. In Proceedings of the 11th International Conference on Scientific and Statistical Database Management, SSDBM 99, page 56, USA, 1999. IEEE Computer Society. [31] M. E. Khan and W. Lin. Conjugate-computation variational inference: Converting variational inference in non-conjugate models to inferences in conjugate models. In Proceedings of the 20th International Conference on Artificial Intelligence and Statistics (AISTATS), volume 54 of Proceedings of Machine Learning Research, pages 878 887. PMLR, 2017. [32] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. ar Xv preprint ar Xiv:1412.6980, 2014. [33] K. Krauth, E. V. Bonilla, K. Cutajar, and M. Filippone. Auto GP: Exploring the capabilities and limitations of Gaussian process models. In Proceedings of the Thirty-Third Conference on Uncertainty in Artificial Intelligence (UAI). AUAI Press, 2017. [34] A. G. d. G. Matthews, M. van der Wilk, T. Nickson, K. Fujii, A. Boukouvalas, P. León-Villagrá, Z. Ghahramani, and J. Hensman. GPflow: A Gaussian process library using Tensor Flow. Journal of Machine Learning Research, 18(40):1 6, 2017. [35] H. Nickisch, A. Solin, and A. Grigorievskiy. State space Gaussian processes with non-Gaussian likelihood. In Proceedings of the 35th International Conference on Machine Learning (ICML), volume 80 of Proceedings of Machine Learning Research, pages 3789 3798. PMLR, 2018. [36] A. O Hagan. A Markov property for covariance structures. Statistics Research Report, 98:13, 1998. [37] J. Quiñonero Candela and C. E. Rasmussen. A unifying view of sparse approximate Gaussian process regression. Journal of Machine Learning Research, 6:1939 1959, Dec. 2005. [38] C. Rasmussen and C. Williams. Gaussian Processes for Machine Learning. MIT Press, Cambridge, MA, USA, 2006. [39] Y. Saatçi. Scalable Inference for Structured Gaussian Process Models. Ph D thesis, University of Cambridge, 2011. [40] H. Salimbeni, S. Eleftheriadis, and J. Hensman. Natural gradients in practice: Non-conjugate variational inference in Gaussian process models. In Proceedings of the International Conference on Artificial Intelligence and Statistics (AISTATS), volume 84 of Proceedings of Machine Learning Research, pages 689 697. PMLR, 2018. [41] S. Särkkä and A. F. García-Fernández. Temporal parallelization of bayesian smoothers. IEEE Transactions on Automatic Control, 66(1):299 306, 2021. [42] S. Särkkä and A. Solin. Applied Stochastic Differential Equations. Cambridge University Press, 2019. [43] S. Särkkä, A. Solin, and J. Hartikainen. Spatiotemporal learning via infinite-dimensional Bayesian filtering and smoothing. IEEE Signal Processing Magazine, 30(4):51 61, 2013. [44] A. Solin. Stochastic Differential Equation Methods for Spatio-Temporal Gaussian Process Regression. Doctoral dissertation, Aalto University, Helsinki, Finland, 2016. [45] W. Tebbutt, A. Solin, and R. E. Turner. Combining pseudo-point and state space approximations for sum-separable Gaussian processes. In Proceedings of the 37th Conference on Uncertainty in Artificial Intelligence (UAI), Proceedings of Machine Learning Research. PMLR, 2021. [46] M. Titsias. Variational learning of inducing variables in sparse Gaussian processes. In Proceedings of the Twelth International Conference on Artificial Intelligence and Statistics (AISTATS), volume 5 of Proceedings of Machine Learning Research, pages 567 574. PMLR, 2009. [47] M. J. Wainwright and M. I. Jordan. Graphical Models, Exponential Families, and Variational Inference. Now Publishers Inc., Hanover, MA, USA, 2008. [48] K. Wang, G. Pleiss, J. Gardner, S. Tyree, K. Q. Weinberger, and A. G. Wilson. Exact Gaussian processes on a million data points. In Advances in Neural Information Processing Systems 32 (Neur IPS), pages 14648 14659. Curran Associates, Inc., 2019. [49] W. Wilkinson, M. Andersen, J. D. Reiss, D. Stowell, and A. Solin. End-to-end probabilistic inference for nonstationary audio analysis. In Proceedings of the 36th International Conference on Machine Learning (ICML), volume 97 of Proceedings of Machine Learning Research, pages 6776 6785. PMLR, 2019. [50] W. Wilkinson, A. Solin, and V. Adam. Sparse algorithms for Markovian Gaussian processes. In Proceedings of the 24th International Conference on Artificial Intelligence and Statistics (AISTATS), Proceedings of Machine Learning Research, pages 1747 1755. PMLR, 2021. [51] W. J. Wilkinson, P. E. Chang, M. R. Andersen, and A. Solin. State space expectation propagation: Efficient inference schemes for temporal Gaussian processes. In Proceedings of the 37th International Conference on Machine Learning (ICML), volume 119 of Proceedings of Machine Learning Research. PMLR, 2020. [52] A. Wilson and H. Nickisch. Kernel interpolation for scalable structured Gaussian processes (KISS-GP). In Proceedings of the 32nd International Conference on Machine Learning (ICML), volume 37 of Proceedings of Machine Learning Research, pages 1775 1784. PMLR, 2015. [53] A. G. Wilson, E. Gilboa, A. Nehorai, and J. P. Cunningham. Fast kernel learning for multidimensional pattern extrapolation. In Advances in Neural Information Processing Systems 27 (NIPS), pages 3626 3634. Curran Associates, Inc., 2014.