# markovian_gaussian_process_variational_autoencoders__4be099ea.pdf

Markovian Gaussian Process Variational Autoencoders

Harrison Zhu 1 * Carles Balsells-Rodas 1 * Yingzhen Li 1

Sequential VAEs have been successfully considered for many high-dimensional time series modelling problems, with many variant models relying on discrete-time mechanisms such as recurrent neural networks (RNNs). On the other hand, continuous-time methods have recently gained attraction, especially in the context of irregularly-sampled time series, where they can better handle the data than discrete-time methods. One such class are Gaussian process variational autoencoders (GPVAEs), where the VAE prior is set as a Gaussian process (GP). However, a major limitation of GPVAEs is that it inherits the cubic computational cost as GPs, making it unattractive to practioners. In this work, we leverage the equivalent discrete state space representation of Markovian GPs to enable linear time GPVAE training via Kalman filtering and smoothing. For our model, Markovian GPVAE (MGPVAE), we show on a variety of high-dimensional temporal and spatiotemporal tasks that our method performs favourably compared to existing approaches whilst being computationally highly scalable.

1. Introduction

Modelling multivariate time series data has extensive applications in e.g., video and audio generation (Li & Mandt, 2018; Goel et al., 2022), climate data analysis (Ravuri et al., 2021) and finance (Sims, 1980). Among existing deep generative models for time series, a popular class of model is sequential variational auto-encoders (VAEs) (Chung et al., 2015; Fraccaro et al., 2017; Fortuin et al., 2020), which extend VAEs (Kingma & Welling, 2014) to sequential data. Originally proposed for image generation, a VAE is a latent variable model which encodes data via an

*Equal contribution 1Imperial College London. Correspondence to: Harrison Zhu <harrisonzhu5080@gmail.com or hbz15@ic.ac.uk>.

Proceedings of the 40 th International Conference on Machine Learning, Honolulu, Hawaii, USA. PMLR 202, 2023. Copyright 2023 by the author(s).

Filtering and smoothing operations

Amortised Posterior

𝑡 (𝑡!, 𝑡!"#)

Figure 1: Illustration of the MGPVAE model. (Top) The state posterior qϕ(s1:T ) is parameterised by encoder outputs and computed using filtering and smoothing. (Bottom) At prediction time, posterior predictive distributions can be calculated at any t.

encoder to a low-dimensional latent space and then decodes via a decoder to reconstruct the original data. To extend VAEs to sequential data, the latent space must also include temporal information (it is also technically possible to place temporal dynamics on the decoders (Chen et al., 2017), but for our work we focus on the latent variable dynamics). Sequential VAEs accomplish this by modelling the latent variables as a multivariate time series, where many existing approaches define a state-space model which governs the latent dynamics. These state-space model-based sequential VAE approaches can be classified into two subgroups:

Discrete-time: The first approach relies on building a discrete-time state-space model for the latent variables. The transition distribution is often parameterised by a recurrent neural network (RNN) such as Long Short Term Memory (LSTM; Hochreiter & Schmidhuber

Markovian Gaussian Process Variational Autoencoders

(1997)) or Gated Recurrent Units (GRU; Cho et al. (2014)). Notable methods include the Variational Recurrent Neural Network (VRNN; Chung et al. (2015)) and Kalman VAE (KVAE; Fraccaro et al. (2017)). However, these approaches may suffer from training issues, such as vanishing gradients (Pascanu et al., 2013), and may struggle with irregularly-sampled time series data (Rubanova et al., 2019).

Continuous-time: The second approach involves continuous-time representations, where the latent space is modelled using a continuous-time dynamic model. A notable class of such methods are neural differential equations (Chen et al., 2018; Rubanova et al., 2019; Li et al., 2020; Kidger, 2022), which model the latent variables using a system of differential equations, described by its initial conditions, drift and diffusion. As remarked in Li et al. (2020), one can construct a neural stochastic differential equation (neural SDE) that can be interpreted as an infinite-dimensional noise VAE. However, although these models can flexibly handle irregularly-sampled data, they also require numerical solvers to solve the underlying latent processes, which may cause training difficulties (Park et al., 2021), memory issues (Chen et al., 2018; Li et al., 2020) and slow computation times. Similarly, but combining linear SDEs with Kalman filtering, Continuous Recurrent Units (CRU; Schirmer et al. (2022)) is a RNN that is also able to model continuous data. Finally, in the context of audio generation, S4-related models (Gu et al., 2021; Goel et al., 2022) also rely on continuous state spaces and have been shown to perform strongly.

Another line of continuous-time approaches, related to neural SDEs, that treat the latent multivariate time series as a random function of time, and model the random function as a tractable stochastic process, are Gaussian Process Variational Autoencoders (GPVAEs; Casale et al. (2018); Pearce (2020); Fortuin et al. (2020); Ashman et al. (2020); Jazbec et al. (2021)) which model the latent variables using Gaussian processes (GPs) (Rasmussen, 2003). As compared with dynamic model-based approaches which focus on modelling the latent variable transitions (reflecting local properties mainly), the GP model for the latent variables describes, in a better way, the global properties of the time series if a suitable stationary kernel is chosen, such as smoothness and periodicity. Therefore GPVAEs may be better suited for e.g., climate time series data which clearly exhibits periodic behaviour. Unfortunately GPVAEs are not directly applicable to long sequences as they suffer from O(T 3) computational cost, therefore approximations need to be made. Indeed, Ashman et al. (2020); Fortuin et al. (2020); Jazbec et al. (2021) proposed variational approximations based on sparse Gaussian processes (Titsias, 2009; Hensman et al., 2013; 2015) or recognition networks

(Fortuin et al., 2020) to improve the scalability of GPVAEs.

In this work we propose Markovian GPVAEs (MGPVAEs) to bridge state-space model-based and stochastic process based approaches of sequential VAEs, aiming to achieve the best in both worlds. Our approach is inspired by the key fact that, when the GP is over time, a large class of GPs can be written as a linear SDE (Särkkä & Solin, 2019), for which there exists exact and unique solutions (Øksendal, 2003). As a result, there exists an equivalent discrete linear state space representation of GPs. Therefore the dynamic model for the latent variables has both discrete and continuous-time representations. This brings the following key advantages to the latent dynamic model of MGPVAE:

The continuous-time representation allows the incorporation of inductive biases via the GP kernel design (e.g., smoothness, periodic and monotonic trends), to achieve better prediction results and training efficiency. It also enables modelling irregularly sampled time series data.

The equivalent discrete-time representation, which is linear, enables Kalman filtering and smoothing (Särkkä & Solin, 2019; Adam et al., 2020; Chang et al., 2020; Wilkinson et al., 2020; 2021; Hamelijnck et al., 2021) that computes the posterior distributions in O(T) time. As the observed data is assumed to come from non-linear transformations of the latent variables, we further apply site-based approximations (Chang et al., 2020) for the non-linear likelihood terms to enable analytic solutions for the filtering and smoothing procedures.

In our experiments, We study much longer datasets (T 100) compared to many previous GPVAE and discrete-time works, which are only are of the magnitude of T 10. We include a range of datasets that describe different properties of MGPVAE compared to existing approaches:

We deliver competitive performance compared to many existing methods on corrupt and irregularly-sampled video and robot action data at a fraction of the cost of many existing models.

We extend our work to spatiotemporal climate data, where none of the discrete-time sequential VAEs are suited for modelling. We show that it outperforms traditional GP and existing sparse GPVAE models in terms of both predictive performance and speed.

2. Background

Consider building generative models for high-dimensional time series (e.g., video data). Here an observed sequence of length T is denoted as Yt1 . . . , Yt T RDy, where ti represents the timestamp of the ith observation in the sequence. Note that in general ti = i for irregularly

Markovian Gaussian Process Variational Autoencoders

sampled time series. As the proposed MGPVAE has both discrete state-space model based and stochastic process based formulations, below we introduce these two types of sequential VAEs and the key relevant techniques.

Sequential VAEs with state-space models: Consider ti = i w.l.o.g., and assume each of the latent states in Z1:T = (z1 1:T , . . . , z L 1:T ) RT L has L latent dimensions. Then the generative model is defined as

p(Y1:T , Z1:T ) = p(Z1:T )

t=1 p(Yt|Zt), (1)

where we choose p(Yt|Zt) = N(Yt; φ(Zt), σ2I) to be a multivariate Gaussian distribution, and φ : RL RDy is decoder network that transforms the latent state to the Gaussian mean. The prior p(Z1:T ) is defined by the transition probabilities, e.g., p(Z1:T ) = QT t=1 p(Zt|Z<t). Training is done by maximising the variational lower bound L (Ranganath et al., 2014):

log p(Y1:T ) PT t=1 Eq(Z1:T )[log p(Yt|Zt)]

KL(q(Z1:T |Y1:T )||p(Z1:T )) := L, (2)

where q(Z1:T |Y1:T ) is the approximate posterior parameterised by an encoder network. Often this q distribution is defined by mean-field approximation over time, i.e., q(Z1:T |Y1:T ) = QT t=1 q(Zt|Yt), or by using transition probabilities as well, e.g., q(Z1:T |Y1:T ) = QT t=1 q(Zt|Zt 1, Y t). Below we also write q(Z1:T ) = q(Z1:T |Y1:T ) to simplify notation.

Gaussian Process Variational Autoencoders: A Gaussian process is a stochastic process denoted by Z GP(0, k), where k : R R R is the kernel or covariance function that specifies the similarity between two time stamps. This allows us to explicitly enforce inductive biases or global behaviour. For instance, if Yt is a periodic system, then we may use a periodic kernel k or a longer initial kernel lengthscale to incorporate this knowledge in the model; if the underlying process is smooth, then a kernel can also be chosen so that it induces smooth functions (Kanagawa et al., 2018).

GPVAEs (Casale et al., 2018; Pearce, 2020; Fortuin et al., 2020; Ashman et al., 2020; Jazbec et al., 2021) define the decoder network φ and the conditional distribution p(Yt|Zt) in the same way as presented above. However, instead of using transition probabilities, a GPVAE places a multi-output GP prior on the latent variables {Zt}: Zt GP(0, k), where one can choose the kernel of the multi-output GP to be k = I k, i.e., the output GPs across dimensions share the same kernel k. However, each dimension may also be induced with separate kernels

k1, . . . , k L, which we adopt in this work, giving block diagonal kernel matrices.

Again we use the variational lower-bound (Eq. (2) when ti = i) as the training objective, but with a different approximate posterior q. Some examples include GPVAE (Pearce, 2020; Fortuin et al., 2020)

q(Z1:T ) = QL l=1 N(Zl 1:T ; Yl 1:T , Vl 1:T ),

where ( Yl t, Vl t)L l=1 = (µl ϕ(Yt), Σl ϕ(Yt))L l=1 = ϕ(Yt) are outputs of the encoder network ϕ. This corresponds to a mean-field approximation over latent dimensions instead of time. To avoid direct parameterisation of the full covariance matrix Σl ϕ(Y1:T ) RT T which can be expensive for long sequences, Fortuin et al. (2020) proposed a banded parameterisation of the precision matrix (Blei & Lafferty, 2006; Bamler & Mandt, 2017), reducing both the time and memory complexity to O(T). However, this choice makes it more difficult to work with irregularly-sampled data and previous works only focused on corrupt video frames.

Another more flexible option is to use sparse GP approximations with inducing points (Jazbec et al., 2021):

q(Z1:T ) = QL l=1 p(Zl 1:T |Gl m)q(Gl m),

q(Gl m) = N(Gl m|ml m, Al m),

Sl = Kl mm + Kl m T diag( Vl 1:T ) 1Kl T m,

ml m = Kl mm(Sl) 1Kl m T diag( Vl 1:T ) 1 Yl 1:T ,

Al m = Kl mm(Sl) 1Kl mm,

where p(Zl 1:T |Gl m) is the standard multivariate Gaussian conditional distribution and [Kl m T ]ij := kl(Ui, j) with i = 1, . . . , m and j = 1, . . . , T, and [Kl mm]ij := kl(Ui, Uj) with i, j = 1, . . . , m, for pre-determined inducing time locations U = [U1, . . . , Um] . However, the time complexity scales with O(m3 + m2T), where in practice to attain good performance m = O(log T) (Burt et al., 2019), and therefore the complexity increases massively when dealing with longer time series.

Markovian Gaussian Processes: Interestingly, the banded parameterisation coincides with the structure of the Markovian GP state space st. w.l.o.g. suppose Zt is one-dimensional in this subsection, then with a conjugate likelihood (e.g. linear Gaussian) p(Yt|Zt) and a Markovian kernel k (Särkkä & Solin, 2019), we can write the GP regression problem as an Itô SDE of latent dimension d

dst = Fstdt + Ld Bt, Zt = Hst,

Yt|Zt p(Yt|Zt), (3)

where F Rd d, L Rd e, H R1 d are the feedback, noise effect and emission matrices, respectively, and Bt

Markovian Gaussian Process Variational Autoencoders

is an e-dimensional (correlated) Brownian motion with diffusion Qc. s0 N(0, P ), where P is the stationary state covariance, which satisfies the Lyapunov equation (Solin, 2016). The state is typically the d derivatives st = (Zt, Z(1) t , . . . , Z(d 1) t ) with the subsequent emission matrix H = (1, 0, . . . , 0).

The linear SDE in Eq. (3) admits a unique closed form solution, allowing the recursive updates of sti+1 given sti:

sti+1 = Ai,i+1sti + qi, qi N(0, Qi,i+1),

Zti = Hsti, Yti|Zti p(Yti|Zti) (4)

with Ai,i+1 = e i F, where i = ti+1 ti,

Qi,i+1 = R i+t0 t0 e( i+t0 τ)FLQc L [e( i+t0 τ)F] dτ.

Note that the Qi,i+1 can be easily obtained in closed form. See Appendix A for a detailed explanation. For conjugate likelihood p(Yt|Zt) p(Yt|st), we can use the recursive Kalman filtering and smoothing equations (also known as the forward-backward equations) to obtain the posterior distribution (Särkkä & Solin, 2019) p(st|Y1:T ) with O(Td3) complexity. The corresponding filter prediction, filtering and smoothing equations are:

p(st+1|Y1:t) = R p(st+1|st)p(st|Y1:t)dst, (5)

p(st|Y1:t) = 1

ℓt p(Yt|st)p(st|Y1:t 1),

p(st|Y1:T ) = p(st|Y1:t) R p(st+1|st)p(st+1|Y1:T )

p(st+1|Y1:t) dst+1,

where ℓt = R p(Yt|st)p(st|Y1:t 1)dst is tractable as the integrand is a product of Gaussians. If p(Yt|st) is nonconjugate (e.g. a Poisson distribution), then the posterior cannot be obtained analytically.

The size of d depends on the kernel. For example, the Matern-3/2 kernel yields d = 2, since the GP sample paths lie in an RKHS that is norm equivalent to a space of functions with 1 derivative (Kanagawa et al., 2018). The periodic or quasi-periodic kernels (Solin & Särkkä, 2014) may yield larger d s. However, in comparison to sparse Gaussian process approximations in Ashman et al. (2020); Jazbec et al. (2021) that have complexity O(m3 + m2T), where m is the number of inducing points, d does not depend on T (unlike sparse GPs (Burt et al., 2019) that depend on O(log D T), where D is the time variable dimension) and thus does not need to grow as T increases.

3. Markovian Gaussian Process Variational Autoencoders

In this section, we propose Markovian GPVAEs (MGPVAEs) and a corresponding variational inference scheme for model learning.

Let L be the dimensionality of the latent variables and kl

be the kernel for the lth channel with state dimension dl. Then we have a total state space dimension of PL l=1 dl. Let φ : RL RDy be the decoder network. Then the generative model, under the linear SDE form of Markovian GP in the latent space, is (with Zt = (Z1 t, . . . , ZL t ) )

dsl t = Flsl tdt + Lld Bl t, Zl t = Hlsl t, l = 1, . . . , L

Yt|Zt p(Yt|Zt) p(Yt|φ(Zt)), (6)

which equivalently becomes the linear discrete state space model with nonlinear likelihood

sl ti+1 = Al i,i+1sl ti + ql i, ql i N(0, Ql i,i+1),

Zl t = Hlsl t, Yt|Zt p(Yt|Zt) p(Yt|φ(Zt)). (7)

Note that the transformation Zl t = Hlsl t is deterministic, and the stochasticity arises from the st variables.

3.2. Variational inference

Suppose q(s) is the approximate posterior over s := s1:T RT Ld. Then minimising KL(q(s)||p(s|Y)) is equivalent to maximising the lower bound L := PT t=1 Eq(st) log p(Yt|φ(Zt)) KL(q(s)||p(s)). Note that since Zt is a linear transformation or reparameterisation of st, the KL-divergence is between the posterior and prior distributions of st. We wish to compute q(s) in linear time using Kalman filtering and smoothing. However, due to the presence of a nonlinear decoder network φ in the likelihood, it is no longer possible to obtain the exact posterior p(s|Y) due to non-conjugacy. However, if we approximate the likelihood p(Yt|φ(Zt)) with Gaussian sites, as is done in Pearce (2020); Ashman et al. (2020); Jazbec et al. (2021); Chang et al. (2020), Kalman filtering and smoothing can be performed as conjugacy is reintroduced in the filtering and smoothing equations.

We propose the Gaussian-site approximation

q(s) p(s) QL l=1 QT t=1 N( Yl t|Hlsl t, Vl t),

where Yl t R and Vl t R for l = 1, . . . , L. Instead of optimising Yt and Vt as free-form parameters using conjugate-computation variational inference (Khan & Lin, 2017), we encode them using outputs of an encoder network ϕ i.e. ( Yl t, Vl t)L l=1 = ϕ(Yt). In addition, we approximate the potentially high-dimensional data likelihood using a likelihood comprising of low-dimensional state space variables.

Markovian Gaussian Process Variational Autoencoders

Then with straightforward computations,

KL(q(s)||p(s)) = Eq(s) log q(s)

= Eq(s) log p(s) QL l=1 QT t=1 N( Yl t|Hlsl t, Vl t) p(s) R p(s) QL l=1 QT t=1 N( Yl t|Hlsl t, Vl t)ds

= Eq(s) Pl l=1 PT t=1 log N( Yl t|Hlsl t, Vl t)

log Ep(s) QT t=1 QL l=1 N( Yl t|Hlsl t, Vl t).

The ELBO (2) thus becomes

L = log Ep(s)

l=1 N( Yl t|Hlsl t, Vl t)

log p(Yt|φ(Zt)) | {z } E2

l=1 log N( Yl t|Hlsl t, Vl t) | {z } E1

(E1) q(st) can be obtained using the Kalman smoothing distributions and can be computed in linear time (see the Kalman filtering and smoothing equations in Appendix 1). The reason is because q(s) is constructed by replacing the likelihood in Eq. (4) with Gaussian sites, making it possible to analytically evaluate the filtering and smoothing equations (5). Therefore using the same calculations as Chang et al. (2020); Hamelijnck et al. (2021), the first term in the ELBO can be computed analytically:

Eq(st)[E1] = 1

2 log |2π Vl t| 1

2( Yl t) ( Vl t) 1 Yl t + ( Yl t) Hlml,s t 1

2[Tr(( Vl) 1Hl Pl,s t (Hl) )]

+ (ml,s t ) (Hl) Hlml,s t ,

where ml,s t and Pl,s t are the smoothing mean and covariances respectively at time t.

(E2) is intractable but we can estimate it using Monte Carlo with K samples st,j q(st), and thus samples Zt,j = (H1s1 t,j, . . . , HLs L t,j) :

Eq(st) log p(Yt|φ(Zt)) 1

K PK j=1 log p(Yt|φ(Zt,j)).

(E3) is the log partition function of q(s), which is also the log marginal likelihood of the approximate model PL l=1 log p( Yl). Note that the latent channels are independent of each other, allowing us to sum over the log marginal likelihood over each channel. We can further decompose each term of the sum into

log p( Yl) = log p( Yl 1) QT t=2 p( Yl t| Yl 1:t 1)

= PT t=1 log Ep(sl t| Yl 1:t 1)N( Yl t; Hlsl t, Vl t),

where p(sl t| Yl 1:t 1) is the predictive filter distribution obtained with Kalman filtering. Fortunately, log p( Yl) can be computed during the filtering stage (see Algorithm 1 for a full breakdown of Kalman filtering and smoothing). In addition, a graphical representation of the Markovian GPVAE is in Figure 1.

3.3. Spatiotemporal Modelling

Spatiotemporal modelling is an important task with many real world applications (Cressie, 2015). Traditional methods such as kriging, or Gaussian process regression, incurs cubic computational costs and are even more costly and difficult when multiple variables need to be modelled jointly. GPVAEs may ameliorate this issue by effectively simplifying the task via an encoder-decoder model and has been proven to be effective in Ashman et al. (2020). Following section 4.2 of Hamelijnck et al. (2021), given a separable spatiotemporal kernel k(r, t, r , t ) = kr(r, r )kt(t, t ), it is straightforward to extend MGPVAE to model spatiotemporal data, which will be make it a highly scalable spatiotemporal model. We consider the model:

Z(r, t) GP(0, k), Y(r, t)|Z(r, t) p(Y(r, t)|φ(Z(r, t))),

where for classical kriging (Cressie, 2015), φ is the identity map and Z(r, t) is of the same dimensionality as Y(r, t).

For convenience of notation, we avoid introducing subscripts l and only write down 1 latent dimension with k. Suppose we have Ns spatial coordinates observed over time, denoted by the spatial matrix R RNr Dx, it is possible to rewrite the GP regression model by stacking the states for each spatial location on top of each other to get:

sti+1 = Ai,i+1sti + qi, qi N(0, Qi,i+1),

Yt|Zt p(Yt|φ(Zt)), (8)

where st = [st(r1), . . . , st(r Ns)] , Zt = [Lr RR Ht]sti and Ai,i+1 = INr At i,i+1, Qi,i+1 = INr Qt i,i+1, Kr RR = Lr RR(Lr RR) , where superscripts t and r indicate the temporal state space and spatial kernel matrices respectively. The graphical model for MGPVAE is shown in Figure 2, demonstrating how the states for each spatial location are independently filtered and smoothed over time, and then spatially mixed by the emission matrix. Lastly, we approximate the likelihood with a mean-field amortised approximation QT t=1 N( Yt|[Lr RR Ht]st, Vt), where Yt, Vt RNx. See Appendix A.3 for further details.

3.4. Computational Complexity and Storage

Computational Complexity: For simplicity let us assume that each channel has the same kernel but is modelled independently. Then the computational complexity of MGPVAE is O(Ld3T). For GPVAE, it is also O(Ld3T); for SVGPVAE, O(L(Tm2 + m3)) with m inducing points. For KVAE, VRNN and CRU, the complexity is O(LT), but there may be large big-O constants due to the RNN network sizes. For neural ODEs and neural SDEs, the complexity is linear with respect to the number of discretisation steps, which can potentially be much larger than T.

For spatiotemporal modelling, the computational complexity for MGPVAE will be O(Ld3TN 3 r ) in this case,

Markovian Gaussian Process Variational Autoencoders

Figure 2: Graphical model of the separable spatiotemporal MGPVAE model. The temporal dynamics of the states st(r) at each location r are independently handled. At each time step t, these states are spatially mixed to produce Zt, which is then transformed by a non-linear mapping to Yt.

but can be further lowered to O(Ld3T(Nrm2 + m3)) if we sparsify the spatial domain by using M spatial inducing points. For SVGPVAE, it is O(L(TNrm2 + m3)) with m inducing points over space and time.

Storage: The storage requirements for MGPVAE and SVGPVAE are O(Td2) and O(Tm+m2) respectively. For larger T, m needs to be larger and hence m d2 in many cases (e.g. d2 = 4 for the Matern-3/2 kernel).

4. Related Work

GPVAEs have been explored for time series modelling in Fortuin et al. (2020); Ashman et al. (2020). In our work, we experiment with the state-of-the-art GPVAEs of Fortuin et al. (2020) and SVGPVAE (Jazbec et al., 2021), and explore a wider variety of tasks. Unlike Fortuin et al. (2020), MGPVAE is capable of tackling both corrupt and missing frames imputation tasks, as well as spatiotemporal modelling tasks, with great scalability. Compared to sparse GPVAEs (Ashman et al., 2020; Jazbec et al., 2021), MGPVAE does not require inducing points.

Neural ODEs (Chen et al., 2018; Rubanova et al., 2019) and neural SDEs (Li et al., 2020) are also continuoustime models that can tackle the same tasks as MGPVAE. However, they depend heavily on the time discretisation, how well the initial conditions are learned and the expressiveness of the drift and diffusion functions. In our work, we experiment with neural ODEs, which have previously been used for similar missing frames imputation tasks, and find that it is the slowest model whithout achieving good predictive performance.

Many discrete-time and continuous-time models, such as VRNN (Chung et al., 2015), KVAE (Fraccaro et al., 2017) and latent ODE (Rubanova et al., 2019), are only designed to model temporal, but not spatiotemporal, data. On the other hand, classical multioutput GPs can only handle lower-

dimensional spatiotemporal datasets as there cannot be any dimensionality reduction to a latent space, whereas GPVAE enables the encoder-decoder networks to learn meaningful low-dimensional representations for high-dimensional data. SVGPVAE is able to handle spatiotemporal data, but has the disadvantage of being less efficient than MGPVAE due to the use of inducing points over space and time jointly.

Chang et al. (2020); Hamelijnck et al. (2021) considered modelling non-conjugate likelihoods with Gaussian approximations, which would allow for Kalman filtering and smoothing operations. We adopt this strategy to allow flexible decoder choices with non-linear mapping, where a key difference is that the encoder helps us construct a low-dimensional approximation to the likelihood function, which allows us to work with Kalman filtering and smoothing in the lower-dimensional latent space.

5. Experiments

We present 3 sets of experiments: rotating MNIST, Mujoco action data and spatiotemporal data modelling. We benchmark MGPVAE against a variety of continuous and discrete time models, such as GPVAE (Fortuin et al., 2020), SVGPVAE (Jazbec et al., 2021), KVAE (Fraccaro et al., 2017), VRNN (Chung et al., 2015), Latent ODE (Rubanova et al., 2019), CRU (Schirmer et al. (2022); only report RMSE as it was not originally conceived as a generative model) and sparse variational multioutput GP (MOGP). We evaluate the performances using both test negative loglikelihood (NLL) and root mean squared error (RMSE). We implemented each model across different libraries (JAX, Py Torch and Tensor Flow) due to varying suitabilities and tried our best to optimise each implementation for fairness of comparison. All wall-clock time computations are done on NVIDIA RTX-3090 GPUs with 24576Mi B RAM. See further experimental details and results in Appendix B.

5.1. Rotating MNIST

In this experiment, we produce sequences of MNIST frames in which the digits are rotated with a periodic length of 50, over T = 100 frames. We tackle 2 imputation tasks for: corrupted frames where the frame pixels are randomly set to 0, and missing frames where frames are randomly dropped out of each sequence. Each task has 4000 /1000 train/test sequences respectively. The underlying dynamics are simple (rotation) which may favour models with stronger inductive biases, such as GPVAE and MGPVAE. To test this, for both models, we use Matern-3/2 kernels with lengthscales initialised at 40 (fixed for GPVAE according to Fortuin et al. (2020)).

Corrupt frames imputation: This is a task that highly suits standard RNN-based models such as VRNN and

Markovian Gaussian Process Variational Autoencoders

Input MGPVAE

Input MGPVAE

Figure 3: (Left) Corrupt frames imputation results for an unseen sequence of 5 s. (Right) Missing frames imputation results for an unseen sequence of 9 s. Missing frames are red frames.

Table 1: Test NLL and RMSE for both the corrupt (Cor) and missing frames (Mis) imputation tasks.

Model NLL-Cor ( ) RMSE-Cor ( ) Time-Cor (s/epoch ) NLL-Mis ( ) RMSE-Mis ( ) Time-Mis (s/epoch ) VRNN 9898 162.0 0.1768 0.001563 63.51 16240 2090 0.1796 0.008002 103.6 KVAE 12500 83.13 0.2025 0.0006077 139.2 10730 1232 0.1582 0.008688 149.0 GPVAE 9026 48.70 0.1340 0.0004529 48.93 NA NA NA MGPVAE 8556 69.66 0.1468 0.0006738 50.45 8925 53.40 0.1508 0.0005190 59.43

KVAE, since the frames are observed at regular time steps. Even so, we see from Table 1 that both GPVAE and MGPVAE perform significantly better than VRNN and KVAE in both NLL and RMSE, validating our hypothesis of inductive biases helping the learning dynamics. Furthermore, we observe that the RMSE for MGPVAE is worse than GPVAE, although it has a slightly better NLL. However, the left panel of Figure 3 shows that the images generated do not visually differ significantly, which implies that the model performance is comparable.

Missing frames imputation: Rubanova et al. (2019) showed that RNN-based models fail in imputing irregularly sampled time-series, as they struggle to correctly update the hidden states at time steps of unobserved frames. This is confirmed by our results in Table 1: MGPVAE outperforms both VRNN and KVAE in terms of NLL and RMSE.1

In the right panel of Figure 3, we illustrate the posterior mean imputations, and again VRNN fails. These results are expected since VRNN implements filtering (only includes past observations), while MGPVAE and KVAE include a smoothing step (includes both past and future observations).

5.2. Mujoco Action Data

The Mujoco dataset is a physical simulation dataset generated using the Deepmind Control Suite (Tunyasuvunakool et al., 2020). We obtained the Hopper generation code from Rubanova et al. (2019), which outputs 14 dimensions sequences, and for all models we use 15-dimensional latent dimensions (according to Rubanova et al. (2019)). We modify the task so that we only train on the observed time steps, whereas in Rubanova et al. (2019) the models have access to data at all the time steps. This makes the task harder as the model has less information to work with during training. We have 2 settings (1) 1280/400 train-test split with length T = 100 and (2) 320/100 for length T = 1000. Compared to rotating MNIST, the

1We omit GPVAE here as the implementation by Fortuin et al. (2020) cannot efficiently handle missing frames in batches. See Appendix B.

underlying nonlinear dynamics are more complex, and each dimension can behave differently. For both SVGPVAE and MGPVAE, we use Matern-3/2 kernels.

Results: We see from Table 2 that the performance of VRNN, KVAE, latent ODE and CRU are significantly worse than GP-based models, and overall both SVGPVAE and MGPVAE achieve the best NLL and RMSE. This is because missing data imputation is a difficult task for the discretetime RNN-based models. On the other hand, latent ODE significantly underperforms, possibly due to the ELBO being computed only over the observed time steps, making it more difficult to fit the model than the original task in Rubanova et al. (2019). Figure 4 confirms the results; indeed the non-GP models struggle to simultaneously fit the data and estimate uncertainty well. Additional results can be found in Appendix B.2.

In terms of time complexity, VRNN, KVAE, latent ODE and CRU are comparable to each other as shown in Figure 5. The time and memory complexities for SVGPVAE is dependent on the number of inducing points (see section 3.4) and there is a trade-off between model expressiveness and inducing points. For longer sequences, such as for T = 1000, we would have to choose more inducing points to gain comparable performance to MGPVAE. We see that with only 20 inducing points, although the wall-clock time is faster than MGPVAE, SVGPVAE-20 underperforms MGPVAE; with 40 inducing points, although the performance and time complexities are comparable to MGPVAE, it also has maxed-out the memory on our GPU. In comparison, MGPVAE does not require any inducing points and is the most time-efficient model.

5.3. Spatiotemporal Climate Data

We obtained climate data, including temperature and precipitation, from ERA5 using Google Earth Engine API (Gorelick et al., 2017). The task is to condition on observed data {Y(r, t)}r,t (temperature and air pressure), and predict on unknown spatial locations {Y(r , t)}r ,t for all time steps. Here we focus on GP-based models

Markovian Gaussian Process Variational Autoencoders

MGPVAE VRNN KVAE SVGPVAE-20

CRU SVGPVAE-40 Latent ODE

Figure 4: 95% posterior credible intervals for unseen mujoco sequences in its 6,7 and 8th dimensions with T = 1000. The red dots show observed data. Note that some predictions are not showing as they fall outside the limits.

0 25 50 75 100 125 150 175 Time (s/epoch)

SVGPVAE-100

T=100 T=1000

Figure 5: Wallclock times for each model in the Mujoco experiment.

Table 2: Imputation results for the Mujoco tasks. Model NLL (T = 100) ( ) RMSE (T = 100) ( ) NLL (T = 1000) ( ) RMSE (T = 1000) ( ) CRU - 0.1343 0.009169 - 0.1353 0.008574 VRNN 385.2 25.59 0.1774 0.002499 2877 450.8 0.1844 0.004053 KVAE 8.353 25.6 0.1828 0.004587 262.4 1141 0.1761 0.01751 Latent ODE 124 11.99 0.06599 0.0007146 240.6 38.62 0.07749 0.001119 SVGPVAE-20 2438 111 0.02841 0.003288 18020 282.6 0.0538 0.0007901 SVGPVAE-40 2468 106.4 0.02566 0.002945 21290 136.8 0.04237 0.000295 SVGPVAE-60 2290 53.69 0.03014 0.001579 - - SVGPVAE-80 2312 67.76 0.0287 0.001866 - - MGPVAE 2292 18.02 0.03068 0.0006991 21610 233.7 0.04156 0.0007136

since they, unlike RNN-based models, can flexibly handle spatiotemporal data by combining spatial and temporal kernels to efficiently model correlated structures. In particular, for GPVAE models we use a separable kernel k(r, t, r , t) = kr(r, r )kt(t, t ) for each latent channel and the spatial kernel kr is shared across channels. We also consider Gaussian process prediction which is also known as kriging in spatial statistics (Cressie, 2015). Sparse approximation is needed for scalability for the MOGP baseline due to computational feasibility. We emphasise that the dimensionality of this problem, which is 8, is high relative to traditional spatiotemporal modelling problems (Cressie, 2015).

Results: We report the corresponding quantitative results in Table 3 and visualise the prediction results at an unseen spatial location in Figure 6. We observe that MOGP underfits and overestimates the uncertainty, which is expected as traditional GP regression models struggle to fit a complicated multi-dimensional time series with non-warped GPs (without decoder network). In comparison, the encoderdecoder networks allow GPVAEs to learn simpler dynamics, resulting in better performance. The conclusions are similar when we fix the time steps and plot the corresponding posterior mean over space in Figure 7, where GPVAE models better predict the spatial patterns.

SVGPVAE underperforms MGPVAE even when pushing the number of inducing points to the memory limit of our hardware (500). Interestingly, SVGPVAE underestimates the uncertainty more as we increase the number of latent channels to 8, as can also be seen from the large negative log-likelihood per dim (NLPD) values. It is unclear why this occurs, though this could be related to an imbalance in the regularisation effect with the KL divergence. We note that MGPVAE does not suffer from the same issue.

SVGPVAE results can be improved if having well-placed and sufficiently-many inducing points, but this also means SVGPVAE is memory inefficient when compared with MGPVAE. For the spatiotemporal modelling task, a significantly larger number of inducing points is required and thus further illustrates the drawbacks of SVGPVAE for spatiotemporal modelling. Similarly for computational time, MGPVAE is faster than the other models with more inducing points, but slower than the ones with less of inducing points (which also perform worse).

Markovian Gaussian Process Variational Autoencoders

SVGPVAE 4-500

Maximum 2m Air Temperature

SVGPVAE 8-500

Surface Pressure

Dewpoint 2m Temperature

Time stamp Figure 6: 95% posterior credible intervals for climate variables at an unseen spatial location.

SVGPVAE 4-500

𝑡= 0 𝑡= 1 𝑡= 2 𝑡= 3 𝑡= 4

𝑡= 0 𝑡= 1 𝑡= 2 𝑡= 3 𝑡= 4

Maximum 2m Air Temperature Dewpoint 2m Temperature

𝑡= 0 𝑡= 1 𝑡= 2 𝑡= 3 𝑡= 4

Surface Pressure

Figure 7: Posterior means over space for climate variables over 5 time steps.

Table 3: ERA5 prediction results.

Model NLPD ( ) RMSE ( ) Time (s/epoch ) MOGP-100 10.38 0.1783 1119 21.29 0.1220 MOGP-1000 10.38 0.1784 1119 21.13 0.6001 SVGPVAE-4-100 16.40 1.531 434.04 11.40 0.05987 SVGPVAE-4-500 8.222 0.9483 388.8 9.347 0.7046 SVGPVAE-8-100 1376 473.0 392.2 18.08 0.08675 SVGPVAE-8-500 2613 540.6 313.9 21.42 1.344 MGPVAE-4 2.454 0.1149 402.1 15.24 0.45876 MGPVAE-8 0.3070 0.1021 352.7 24.86 0.5128

6. Conclusion and Discussion

We propose MGPVAE, a GPVAE model using Markovian Gaussian processes that uses Kalman filtering and smoothing with linear time complexity. This is achieved by approximating the non-Gaussian likelihoods using Gaussian sites. Compared to Fortuin et al. (2020), which also achieves linear time complexity by using an approximate covariance structure, our model leaves the original Gaussian process covariance structure intact, and additionally work with both irregularly-sampled time series and spatiotemporal data. Experiments on video, Mujoco action and climate modelling tasks show that our method is both competitive and scalable, compared to modern discrete and continuoustime models. Future work can explore the use of parallel filtering (Särkkä & García-Fernández, 2020), other nonlinear filtering approaches (Kamthe et al., 2022) and forecasting applications.

Limitations: MGPVAE requires the use of kernels that admit a Markovian decomposition, and therefore kernels

such that the squared exponential kernel will not be permissible (Särkkä & Solin, 2019) without additional approximations. However, we argue that most commonly used kernels are indeed Markovian, which should be sufficient for most applications. In addition, we need to use Gaussian site approximations for the likelihood, in order to allow for Kalman filtering and smoothing. This may lower the approximation accuracy, though in our experiments we see that MGPVAE still performs strongly.

7. Acknowledgements

We would like to especially thank Wenlin Chen for his valuable help with code and experiments during the revision period, especially implementing a more optimised version of SVGPVAE with functorch (Horace He, 2021) compared to the original Tensor Flow implementation of Jazbec et al. (2021). HZ was supported by the EPSRC Centre for Doctoral Training in Modern Statistics and Statistical Machine Learning (EP/S023151/1) and the Department of Mathematics of Imperial College London. HZ was supported by Cervest Limited. We would also like to thank Andy Thomas for his endless support with using the NVIDIA4, Forrest and NVIDIA6 GPU Compute Servers.

Markovian Gaussian Process Variational Autoencoders

Adam, V., Eleftheriadis, S., Artemev, A., Durrande, N., and Hensman, J. Doubly sparse variational Gaussian processes. In International Conference on Artificial Intelligence and Statistics, pp. 2874 2884. PMLR, 2020.

Ashman, M., So, J., Tebbutt, W., Fortuin, V., Pearce, M., and Turner, R. E. Sparse Gaussian process variational autoencoders. ar Xiv preprint ar Xiv:2010.10177, 2020.

Bamler, R. and Mandt, S. Dynamic word embeddings. In International conference on Machine learning, pp. 380 389. PMLR, 2017.

Blei, D. M. and Lafferty, J. D. Dynamic topic models. In Proceedings of the 23rd international conference on Machine learning, pp. 113 120, 2006.

Burt, D., Rasmussen, C. E., and Van Der Wilk, M. Rates of convergence for sparse variational gaussian process regression. In International Conference on Machine Learning, pp. 862 871. PMLR, 2019.

Casale, F. P., Dalca, A., Saglietti, L., Listgarten, J., and Fusi, N. Gaussian process prior variational autoencoders. Advances in neural information processing systems, 31, 2018.

Chang, P. E., Wilkinson, W. J., Khan, M. E., and Solin, A. Fast variational learning in state-space Gaussian process models. In 2020 IEEE 30th International Workshop on Machine Learning for Signal Processing (MLSP), pp. 1 6. IEEE, 2020.

Chen, R. T., Rubanova, Y., Bettencourt, J., and Duvenaud, D. K. Neural ordinary differential equations. Advances in neural information processing systems, 31, 2018.

Chen, X., Kingma, D. P., Salimans, T., Duan, Y., Dhariwal, P., Schulman, J., Sutskever, I., and Abbeel, P. Variational lossy autoencoder. ICLR, 2017.

Cho, K., van Merriënboer, B., Bahdanau, D., and Bengio, Y. On the properties of neural machine translation: Encoder decoder approaches. In Proceedings of SSST-8, Eighth Workshop on Syntax, Semantics and Structure in Statistical Translation, pp. 103 111, Doha, Qatar, October 2014. Association for Computational Linguistics. doi: 10.3115/v1/W14-4012. URL https: //aclanthology.org/W14-4012.

Chung, J., Kastner, K., Dinh, L., Goel, K., Courville, A. C., and Bengio, Y. A recurrent latent variable model for sequential data. Advances in neural information processing systems, 28, 2015.

Cressie, N. Statistics for spatial data. John Wiley & Sons, 2015.

Developers, O. Objax, 2020. URL https://github. com/google/objax.

Evans, L. C. An introduction to stochastic differential equations version 1.2. Lecture Notes, UC Berkeley, 2006.

Fortuin, V., Baranchuk, D., Rätsch, G., and Mandt, S. Gp-vae: Deep probabilistic time series imputation. In International conference on artificial intelligence and statistics, pp. 1651 1661. PMLR, 2020.

Fraccaro, M., Kamronn, S., Paquet, U., and Winther, O. A disentangled recognition and nonlinear dynamics model for unsupervised learning. Advances in neural information processing systems, 30, 2017.

Goel, K., Gu, A., Donahue, C., and Ré, C. It s raw! audio generation with state-space models. In International Conference on Machine Learning, pp. 7616 7633. PMLR, 2022.

Gorelick, N., Hancher, M., Dixon, M., Ilyushchenko, S., Thau, D., and Moore, R. Google earth engine: Planetary-scale geospatial analysis for everyone. Remote Sensing of Environment, 2017. doi: 10.1016/j.rse.2017.06. 031. URL https://doi.org/10.1016/j.rse. 2017.06.031.

Gu, A., Goel, K., and Ré, C. Efficiently modeling long sequences with structured state spaces. ar Xiv preprint ar Xiv:2111.00396, 2021.

Hamelijnck, O., Wilkinson, W., Loppi, N., Solin, A., and Damoulas, T. Spatio-temporal variational Gaussian processes. Advances in Neural Information Processing Systems, 34, 2021.

Hensman, J., Fusi, N., and Lawrence, N. D. Gaussian processes for big data. UAI, 2013.

Hensman, J., Matthews, A., and Ghahramani, Z. Scalable variational gaussian process classification. In Artificial Intelligence and Statistics, pp. 351 360. PMLR, 2015.

Hochreiter, S. and Schmidhuber, J. Long short-term memory. Neural computation, 9(8):1735 1780, 1997.

Horace He, R. Z. functorch: Jax-like composable function transforms for pytorch. https://github.com/ pytorch/functorch, 2021.

Jazbec, M., Ashman, M., Fortuin, V., Pearce, M., Mandt, S., and Rätsch, G. Scalable gaussian process variational autoencoders. In International Conference on Artificial Intelligence and Statistics, pp. 3511 3519. PMLR, 2021.

Kamthe, S., Takao, S., Mohamed, S., and Deisenroth, M. Iterative state estimation in non-linear dynamical systems using approximate expectation propagation. Transactions on Machine Learning Research, 2022.

Markovian Gaussian Process Variational Autoencoders

Kanagawa, M., Hennig, P., Sejdinovic, D., and Sriperumbudur, B. K. Gaussian Processes and Kernel Methods: A Review on Connections and Equivalences. ar Xiv:1807.02582 [cs, stat], July 2018. URL http://arxiv.org/abs/1807.02582. ar Xiv: 1807.02582.

Khan, M. and Lin, W. Conjugate-computation variational inference: Converting variational inference in nonconjugate models to inferences in conjugate models. In Artificial Intelligence and Statistics, pp. 878 887. PMLR, 2017.

Kidger, P. On neural differential equations. Ph D Thesis, 2022.

Kingma, D. P. and Welling, M. Auto-Encoding Variational Bayes. ar Xiv:1312.6114 [cs, stat], May 2014. URL http://arxiv.org/abs/1312.6114. ar Xiv: 1312.6114.

Lee, J., Lee, Y., Kim, J., Kosiorek, A., Choi, S., and Teh, Y. W. Set transformer: A framework for attention-based permutation-invariant neural networks. In International conference on machine learning, pp. 3744 3753. PMLR, 2019.

Li, X., Wong, T.-K. L., Chen, R. T., and Duvenaud, D. Scalable gradients for stochastic differential equations. In International Conference on Artificial Intelligence and Statistics, pp. 3870 3882. PMLR, 2020.

Li, Y. and Mandt, S. Disentangled sequential autoencoder. ICML, 2018.

Lin, Z. and Yin, F. Towards flexibility and interpretability of gaussian process state-space model. ar Xiv preprint ar Xiv:2301.08843, 2023.

Maroñas, J. and Hernández-Lobato, D. Efficient transformed gaussian processes for non-stationary dependent multi-class classification. ar Xiv preprint ar Xiv:2205.15008, 2022.

Matthews, A. G. d. G., Van Der Wilk, M., Nickson, T., Fujii, K., Boukouvalas, A., León-Villagrá, P., Ghahramani, Z., and Hensman, J. Gpflow: A gaussian process library using tensorflow. J. Mach. Learn. Res., 18(40):1 6, 2017.

Park, S., Kim, K., Lee, J., Choo, J., Lee, J., Kim, S., and Choi, E. Vid-ode: Continuous-time video generation with neural ordinary differential equation. ar Xiv preprint ar Xiv:2010.08188, pp. online, 2021.

Pascanu, R., Mikolov, T., and Bengio, Y. On the difficulty of training recurrent neural networks. In International conference on machine learning, pp. 1310 1318. PMLR, 2013.

Pearce, M. The gaussian process prior vae for interpretable latent dynamics from pixels. In Symposium on Advances in Approximate Bayesian Inference, pp. 1 12. PMLR, 2020.

Ranganath, R., Gerrish, S., and Blei, D. Black box variational inference. In Artificial intelligence and statistics, pp. 814 822. PMLR, 2014.

Rasmussen, C. E. Gaussian processes in machine learning. In Summer school on machine learning, pp. 63 71. Springer, 2003.

Ravuri, S., Lenc, K., Willson, M., Kangin, D., Lam, R., Mirowski, P., Fitzsimons, M., Athanassiadou, M., Kashem, S., Madge, S., et al. Skilful precipitation nowcasting using deep generative models of radar. Nature, 597(7878):672 677, 2021.

Rubanova, Y., Chen, R. T., and Duvenaud, D. K. Latent ordinary differential equations for irregularly-sampled time series. Advances in neural information processing systems, 32, 2019.

Särkkä, S. and García-Fernández, Á. F. Temporal parallelization of bayesian smoothers. IEEE Transactions on Automatic Control, 66(1):299 306, 2020.

Särkkä, S. and Solin, A. Applied stochastic differential equations, volume 10. Cambridge University Press, 2019.

Schirmer, M., Eltayeb, M., Lessmann, S., and Rudolph, M. Modeling irregular time series with continuous recurrent units. In International Conference on Machine Learning, pp. 19388 19405. PMLR, 2022.

Sims, C. A. Macroeconomics and reality. Econometrica: journal of the Econometric Society, pp. 1 48, 1980.

Solin, A. Stochastic differential equation methods for spatiotemporal Gaussian process regression. 2016. Ph D Thesis: Aalto University.

Solin, A. and Särkkä, S. Explicit link between periodic covariance functions and state space models. In AISTATS, 2014.

Taylor, S. J. and Letham, B. Forecasting at scale. The American Statistician, 72(1):37 45, 2018.

Tebbutt, W., Solin, A., and Turner, R. E. Combining pseudo-point and state space approximations for sumseparable Gaussian processes. In Uncertainty in Artificial Intelligence, pp. 1607 1617. PMLR, 2021.

Titsias, M. Variational learning of inducing variables in sparse gaussian processes. In Artificial intelligence and statistics, pp. 567 574. PMLR, 2009.

Markovian Gaussian Process Variational Autoencoders

Tunyasuvunakool, S., Muldal, A., Doron, Y., Liu, S., Bohez, S., Merel, J., Erez, T., Lillicrap, T., Heess, N., and Tassa, Y. dmcontrol: Software and tasks for continuous control. Software Impacts, 6:100022, 2020. ISSN 2665-9638. doi: https://doi.org/10.1016/j.simpa.2020.100022. URL https://www.sciencedirect.com/ science/article/pii/S2665963820300099.

Wilkinson, W., Chang, P., Andersen, M., and Solin, A. State Space Expectation Propagation: Efficient Inference Schemes for Temporal Gaussian Processes. In III, H. D. and Singh, A. (eds.), Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pp. 10270 10281. PMLR, July 2020. URL https://proceedings.mlr.press/ v119/wilkinson20a.html.

Wilkinson, W. J., Solin, A., and Adam, V. Sparse Algorithms for Markovian Gaussian Processes. ar Xiv:2103.10710 [cs, stat], June 2021. URL http://arxiv.org/abs/2103.10710. ar Xiv: 2103.10710.

Zaheer, M., Kottur, S., Ravanbakhsh, S., Poczos, B., Salakhutdinov, R. R., and Smola, A. J. Deep sets. Advances in neural information processing systems, 30, 2017.

Øksendal, B. Stochastic differential equations. In Stochastic differential equations, pp. 65 84. Springer, 2003.

Markovian Gaussian Process Variational Autoencoders

A. Markovian Gaussian Process Background

In this section, we provide a rigorous treatment of the derivation of the state space Markovian Gaussian process equations as previously presented in Särkkä & Solin (2019) with stochastic differential equation technicalities from Øksendal (2003); Evans (2006).

A.1. Stochastic Differential Equations

Definition A.1 (Univariate Brownian Motion; (Øksendal, 2003; Evans, 2006)). Let (Ω, F, P) be a probability space. B : [0, T] Ω R is a univariate Brownian motion if:

1. B0 = 0, almost surely.

2. t 7 B(t, ) is continuous almost surely.

3. For t4 > t3 t2 > t1, Bt4 Bt3 Bt2 Bt1.

4. For any δ > 0, Bt+δ Bt N(0, δq), where q the correlation factor.

In a similar manner, we may extend univariate Brownian motions into correlated multi-dimensional forms.

Definition A.2 (Multi-dimensional Brownian Motion). Bt is an e-dimensional (correlated) Brownian motion: each Bk, for k = 1, . . . , e, is a univariate Brownian motion. We assume that each Bk is correlated, resulting in Bt+δ Bt N(0, δQc), where Qc Re e is the spectral density matrix.

Define Ft to be the σ-algebra generated by Bs( ) for s t. Intuitively, this is the history of information up to time t created by Bs for s t. Therefore for a stochastic process driven by Bs up to time t, we would like it to be measurable at all times t, which motivates the next definition of measurability for stochastic processes: Definition A.3. Let {Ft}t 0 be an increasing family of σ-algebras generated by Bt i.e. for s < t, Fs Ft. Then v(t, ω) : [0, ) Ω Re is Ft-adapted if v(t, ω) is measurable in Ft for all t 0.

With the correlated formulation, we can rederive Itô isometry using the same proof as in the uncorrelated case (Øksendal, 2003; Evans, 2006). We use the definition of the multi-dimensional Itô integral in Definition 3.3.1 Øksendal (2003). Let E be the expectation with respect to P. Definition A.4 (Multi-dimensional Itô Integral (Øksendal, 2003)). Let Vd e(S, T), for d N and S < T, be a set of matrix-valued stochastic processes v(t, ω) Rd e where each entry vij(t, ω) satisfies:

(t, ω) vij(t, ω) is B(R) F-measurable, where B(R) is the Borel σ-algebra on [0, ).

vij(τ, ω) L2([S, T]) in expectation i.e. E[ R T S v2 ij(τ, )dτ].

Let Wt be a univariate Brownian motion. There exists an increasing family of σ-algebras {Ht}t 0 such that (1) Wt is a martingale with respect to Ht and (2) vij(t, ω) is Ht-adapted.

Then, we can define the multi-dimensional Itô integral as follows: For all v Vd e(S, T),

S v(τ, ω)d Bτ = Z T

v11(τ, ω) v1e(τ, ω) ... ... ... vd1(τ, ω) vde(τ, ω)

d B1 τ ... d Be τ

where R T S v(τ, ω)d Bτ

i = Pe j=1 R T S vij(τ, ω)d Bj τ.

Proposition A.5 (Multi-dimensional Itô Isometry). For F, G Vd e (as defined in Øksendal (2003)) and S < T, then

E[ R T S F(τ, )d Bτ][ R T S G(τ, )d Bτ] = E R T S F(τ, )Qc G(τ, ) dτ.

Furthermore, if F and G are deterministic, then for t1, t2 t0,

E[ R t1 t0 F(τ)d Bτ][ R t2 t0 G(τ)d Bτ] = E R min(t1,t2) t0 F(τ)Qc G(τ) dτ.

Markovian Gaussian Process Variational Autoencoders

Proof. Let {ei}e i=1 be the canonical basis in Re. We first show that ik-th component of E[ R T S F(τ, )d Bτ][ R T S G(τ, )d Bτ] Rd d, for S < T, is equal to

R T S F i (τ, )Qc[G(τ, )] kdτ = E R T S e i F (τ, )Qc[G(τ, )] kdτ

E R T S F (τ, )Qc G(τ, ) dτ ek

= E R T S F (τ, )Qc G(τ, ) dτ

which would complete the proof. Indeed, the ik-th component is equal to

E Pe j=1 R T S Fij(τ, )d Bj τ

Pe l=1 R T S Gkl(τ, )d Bl τ

= Pe j,l=1 E R T S Fij(τ, )d Bj τ R T S Gkl(τ, )d Bl τ

1D Itô Isometry = Pe j,l=1 E R T S Fij(τ, )[Qc]jl Gkl(τ, )dτ

= Pe j,l=1 E R T S Fij(τ, )[Qc]jl[G (τ, )]lkdτ

= E R T S e i F(τ, )Qc[G (τ, )]ekdτ

= E R T S F(τ, )Qc[G (τ, )]dτ

as required. Next, if F and G are deterministic and suppose that t1 < t2 without loss of generality, then

E[ R t1 t0 F(τ)d Bτ][ R t2 t0 G(τ)d Bτ] = E[ R t1 t0 F(τ)d Bτ][ R t1 t0 G(τ)d Bτ + R t2 t1 G(τ)d Bτ ]

= E[ R min(t1,t2) t0 F(τ)d Bτ][ R min(t1,t2) t0 G(τ)d Bτ]

= E R min(t1,t2) t0 F(τ)Qc G (τ)d Bτ,

where for the first line we used property (iii) of Definition A.1 of Brownian motions in multi-dimensions.

A.2. Markovian Gaussian Processes

A Markovian Gaussian process f GP(0, k) can be written with an SDE of latent dimension d

ds(t) = Fs(t)dt + Ld Bt, f(x) = Hs(t), (9)

where F Rd d, L Rd e, H R1 d are the feedback, noise effect and emission matrices, and Bt is an e-dimensional (correlated) Brownian motion with spectral density matrix Qc. Suppose that s(t0) N(m0, P0) with the stationary state mean and covariance. Note that we assume that s(t0) independent of F+ t0, the σ-algebra generated by Bt Bs for t s t0, in order to invoke existence and uniquess of SDE solutions (Existence and Uniqueness Theorem, page 90, Evans (2006)). Thus the linear SDE equation 3 admits a unique closed form solution:

s(t) = e(t t0)Fs(t0) + R t t0 e(t τ)FLd Bτ.

m(t) = E[s(t)] = e(t t0)Fm0,

Markovian Gaussian Process Variational Autoencoders

since for any f Vd e, E[ R T S f(τ)d Bτ] = 0 (Øksendal, 2003; Evans, 2006). To calculate the covariance, we have

κ(t, t ) = E[s(t) m(t))(s(t ) m(t )) ]

= e(t t0)FE[(s(t0) m0)(s(t0) m0) ][e(t t0)F]

+ E[ R t t0 e(t τ)Fd Bτ][ R t

t0 e(t τ)Fd Bτ ]

+ ((((((((((((((((((((

E e(t t0)FL(s(t0) m0)( R t

t0 e(t τ)FLd Bτ) (Independence)

((((((((((((((((((((( (

E ( R t t0 e(t τ)FLd Bτ)(s(t0) m0) L [e(t t0)F] (Independence)

= e(t t0)FP0[e(t t0)F] + R min(t,t ) t0 e(t τ)FLQc L [e(t τ)F] dτ (Itô Isometry; Proposition A.5),

where we note that R t t0 e(t τ)FLd Bτ and R t

t0 e(t τ)FLd Bτ are Ft-measurable for Ft F+ t0, and thus independent to s(t0).

Therefore for all t {t0, . . . , t T }, if we solve for each ti with initial time ti 1, then s(t) N(m(t), k(t, t)) with

m(ti+1) = e(ti+1 ti)Fm0

k(ti+1, ti+1) = e(ti+1 ti)FP0[e(ti+1 ti)F] + R ti+1 ti e(ti+1 τ)FLQc L [e(ti+1 τ)F] dτ.

Denote ti+1 = ti+1 ti, Ai,i+1 = e(ti+1 ti)F = e ti+1F and Qi,i+1 = R ti+1+t0 t0 e( ti+1+t0 τ)FLQc L [e( ti+1+t0 τ)F] dτ. Then we can derive the recursive equations

s(ti+1) = Ai,i+1s(ti) + R ti+1 ti e(ti+1 τ)FLd Bτ

= Ai,i+1s(ti) + R ti+1+t0 t0 e ti+1+t0 τLd Bτ = Ai,i+1s(ti) + qi,

where qi N(0, Qi,i+1).

Types of Kernels: A large number of kernels do allow for the Markovian property to be satisfied. A full list of them could be found in Solin (2016); Särkkä & Solin (2019). Here, we list 2 common kernels:

Matern-3/2: The kernel is k(t, t ) = σ2(1 +

3d ρ ) exp(

3d ρ ), where d = |t t |. With λ =

F = 0 1 λ2 2λ

, m0 = 0, P0 = σ2 0 0 σ2λ2

3σ2 ℓ2 , L = 0 1

Matern-5/2: The kernel is k(t, t ) = σ2(1 +

5d ρ ), where d = |t t |. With λ =

5 ℓand κ = 5σ2

0 1 0 0 0 1 λ3 3λ2 3λ

, m0 = 0, P0 =

σ2 0 κ 0 κ 0 κ 0 25σ2

5σ2 3ℓ5 , L =

Kalman Filtering and Smoothing: The full Kalman filtering and smoothing algorithm is delineated in Algorithm 1.

A.3. Further details on spatiotemporal MGPVAE

To perform prediction at arbitrary spatiotemporal locations (r , t ), we can use a conditional independence property proven in Tebbutt et al. (2021) for separable spatiotemporal Markovian GPs that yields the predictive distribution

p(Z(r , t )|Y) q(Z(r , t )) (10)

:= R p(Z(r , t )|s(R, t ))q(s(R, t )|s(R, t1:T ))ds(R, t ),

Markovian Gaussian Process Variational Autoencoders

Algorithm 1 Kalman Filtering and Smoothing

1: Inputs: Variational parameters Y1:T , V1:T . Initial conditions mf 0 = m0, Pf 0 = P0, transition matrices {Ai 1,i, Qi 1,i}T i=1 and emission matrix H. 2: Filtering: 3: for i = 1, . . . , T do 4: Compute predictive filter distribution p(st| Y1:i 1) = N(st|mp t , Pp t ):

mp ti = Ai 1,imf ti 1 Pp ti = Ai 1,i Pf ti 1A i 1,i + Qi 1,i.

5: Let Λti = HPp ti H + Vti. Compute log marginal likelihood:

ℓti = log Ep(sti| Y1:i 1)N( Yti; Hsti, Vti) = log N( Yti; Hmp ti, Λti)

6: Compute updated filter distribution p(sti| Y1:i) = N(sti|mf ti, Pf ti):

Wti = Pp ti H Λ 1 ti mf ti = mp ti + Wti( Yti Hmp ti), Pf ti = Pp ti WtiΛti W ti.

7: end for 8: Smoothing, with initial conditions ms T = mf T , Ps T = mf T : 9: for i = T 1, . . . , 1 do 10: Compute the smoothing distribution q(sti) = p(sti| Y1:T ) = N(sti|ms ti, Ps ti) with the RTS smoother:

Gti = Pf ti Ai,i+1[Pp ti+1] 1 ms ti = mf ti + Gti(ms ti+1 mp ti+1), Ps ti = Pf ti + Gti(Ps ti+1 Pp ti+1)G ti

11: end for

12: Return: log marginal likelihood PT t=1 ℓti, marginal posteriors {q(sti)}T i=1.

Markovian Gaussian Process Variational Autoencoders

where q(s(R, t1:T )) = N(s(R, t1:T )|m(R, t1:T ), c(R, t1:T )) is the approximate state posterior s(R, t ) conditioned on s(R, t1:T ). Given in Tebbutt et al. (2021) and Wilkinson et al. (2021),

p(Z(r , t )|s(R, t )) =N(Z(r , t )|Br s(R, t ), Cr ),

Br = [Kr r R(Kr RR) 1] [Lr RR Ht]

Cr = kt(0, 0)[Kr r r Kr r R(Kr RR) 1Kr Rr ].

Therefore the integral in Equation (10) yields q(Z(r , t )) = N(Z(r , t )|Br m(t1:T , R), Br c(t1:T,R)B r + Cr ).

B. Additional Experimental and Implementation Details

For the generating modelling tasks of corrupt image (re)-generation, the NLL on an test sequence Y = (Y1, . . . , Y100) is

log p(Y) = log R p(Y|Z)p(Z)d Z

= log R p(Y|Z) p(Z) q(Z|Ycorrupt)q(Z|Ycorrupt)d Z

K PK k=1 p(Y|Zk) p(Zk) q(Zk|Ycorrupt), Zk q(Z|Ycorrupt), k = 1 . . . , K

K + logsumexpk=1,...,K

log p(Y|Zk) log q(Zk|Y)

For the tasks of missing frame imputation, given test sequence Y = (Y1, . . . , Y100), write Y (Y, Yc), where Yc are the unseen frames and Yc are seen frames.

log p(Y) log p(Yc|Y) + log p(Y).

log p(Y) can be computed by Equation 11 and

log p(Yc|Y) = log R p(Yc|Zc)p(Zc|Y)d Zc

= log R p(Yc|Zc) R p(Zc|Z)p(Z|Y)d Z d Zc

= log R p(Yc|Zc) R p(Zc|Z) q(Z|Y) | {z } encoder

| {z } =:q(Zc|Y) p(Zc|Y)

log R p(Yc|Zc)q(Zc|Y)d Zc

K PK k=1 p(Yc|Zc k), Zc k q(Zc|Y), k = 1 . . . , K,

K + logsumexpk=1,...,K log p(Yc|Zc k),

q(Zc|Y) can be analytically obtained (e.g. GP posterior distribution). For VRNN/KVAE, q(Zc|Y) would just be determined by the outputs of the RNN at each time step t = 1, . . . , 100.

For the spatiotemporal modelling task, we compute the negative log predictive distribution (NLPD) at new spatial locations s / S and temporal locations t T observed during training, conditioned on observed data D = {Y(s, t)}s S,t T , which is

log p(Y(t, s )|D) = log R p(Y(t, s )|Z(t, s ))p(Z(t, s )|D)d Z(t, s )

log R p(Y(t, s )|Z(t, s ))q(Z(t, s )|D)d Z(t, s )

K PK k=1 p(Y(t, s )|Zk(t, s )), Zk(t, s ) q(Z(t, s )|D), k = 1, . . . , K,

K + logsumexpk=1,...,K log p(Y(t, s )|Zk(t, s )).

We hereby provide a derivation of Equation 11 for each model (using the original notation as much as possible).

Markovian Gaussian Process Variational Autoencoders

log p(Y) = log R p(Y|s) p(s)

q(s)q(s)ds,

= log R p(Y|s) p(s) R p( Y|Hs, V)p(s)ds p( Y|Hs, V)p(s) q(s)ds,

K + log PK k=1 p(Y|s)

R p( Y|Hsk, V)p(sk)ds

p( Y|Hsk, V) , sk q(sk)

K + logsumexpk=1,...,K log p(Y|sk) log p( Y|Hsk, V) + log Ep(s)N( Y|Hs, V)

KVAE: When encountering a missing frame, the Kalman gain is zero in the Kalman filtering and smoothing operations. The filter prediction distribution is then used to recompute the A and C matrices, which are finally fed back into the usual filtering algorithm.

log p(Y) = log R p(Y|a)p(a|Z)p(Z)q(a,Z|Y)

q(a,Z|Y) dad Z, q(a, Z|Y) = q(a|Y)p(Z|a)

k=1,...K p(Y|ak)p(ak|Zk)p(Zk)

q(ak|Yk)p(Zk|Yk) + log 1

K , (ak, Zk) q(ak, Zk|Y)

K + logsumexpk=1,...,K log p(Y|ak) log q(ak|Y) + log p(ak|Zk) + log p(Zk) log p(Zk|ak).

VRNN: When encountering a missing frame, the previous hidden state is fed into prior network and Zt is sampled from the prior. It is then decoded and the decoded output is then used as a pseudo-datapoint for autoencoding.

log p(Y) = log QT t=1 p(Yt|Y<t)

= log R QT t=1 p(Yt|Y<t, Z t) p(Zt|Y<t,Z<t)

q(Zt|Y t,Z<t)q(Zt|Y t, Z<t)d Z,

K + log PK k=1 QT t=1 p(Yt|Y<t, Zk t) p(Zk t |Y<t Zk <t) q(Zk t |Y t,Zk <t), Zk 1:T q(Z T |Y T ) = QT t=1 q(Zt|Y t, Z<t)

K + logsumexpk=1,...,K PT t=1 log p(Yt|Y<t, Zk t) log q(Zk t |Y t Zk <t) p(Zk t |Y<t,Zk <t)

Latent ODE: The latent ODE places a prior over the initial conditions Z0 p(Z0) and a posterior via ODERNN q(Z0|Y1:T ) q(Z0). Thus the NLL is

log p(Y) = log R p(Y|Z0) p(Z0)

q(Z0)q(Z0)d Z

K + logsumexpk=1,...,K log p(Y|Zk 0) log q(Zk 0) p(Zk 0), Zk 0 p(Z0).

log p(Y) = log R p(Y|Z) p(Z)

q(Z)q(Z)d Z

K + logsumexpk=1,...,K log p(Y|Zk) log q(Zk|Y)

p(Zk) , Zk q(Zk|Y).

SVGPVAE: We are given that q(Z, Gm) = p(Z|Gm)q(Fm|µ, A), where Gm is the inducing variable and (µ, A) are the inducing variable mean and covariance.

log p(Y) = log R p(Z, Gm, Y) q(Z,Gm)

q(Z,Gm)d Zd Gm

= log R p(Y|Z)p(Z|Gm)p(Gm)

p(Z|Gm)q(Gm) p(Z|Gm)q(Gm)d Zd Gm

= log R p(Y|Z)p(Gm)

q(Gm) p(Z|Gm)q(Gm)d Zd Gm

P PP p=1 log 1

K + logsumexpk=1,...,K log p(Y|Zp k) log q(Gk m) p(Gk m), Gk m q(Gm) Zp k p(Zk|Gk m),

Markovian Gaussian Process Variational Autoencoders

where p(Zk|Gm) is a standard GP conditional distribution:

p(Zk|Gm) N Zk|Kxm K 1 mm Gm, Kxx Kxm K 1 mm Kmx

q(Gm) q(Gm|µ, A) N Gm|µ, A .

B.1. Video Data

For each of the 4000 train and 1000 test MNIST images, we create a T = 100 length sequence of rotating MNIST images by applying a clockwise rotation with period t = 50, meaning that the image gets rotated twice. For corrupt video sequences, we randomly remove 60% of the pixels and replace them with the value 0. For the missing frames imputation task, we randomly remove 60% of the frames and keep track of the time steps for which the frames are removed.

For each model during training, we used the following configurations (note that we use the standard Py Torch-Tensor Flow JAX notation for network layers).

Batch size: 40

Training epochs: 300

Number of latent channels: L = 16

Adam optimizer learning rate: 1e-3

clip Grad Norm(model parameters, 100)

Encoder structure: Conv(out=32, k=3, strides=2), Re LU(), Conv2D(out=32, k=3, strides=2), Re LU(), Flatten(), hidden To Variational Params(), where hidden To Variational Params() depends on the model.

Decoder structure: Linear(L, 8*8*32), Reshape((8,32,32)), Conv2DTranspose(out=64, k=3, strides=2, padding=same), Re LU(), Conv2DTranspose(out=32, k=3, strides=2, padding=same), Re LU(), Conv2DTranspose(out=1, k=3, strides=1, padding=same), Reshape((32, 32, 1)).

If KVAE or VRNN, we had to applying KL and Adam optimizer step size schedulers in order to make them work, meaning that they are more intricately optimised than the other models.

If GPVAE and MGPVAE, initialise the kernel lengthscales with 40 for each latent channel. The kernel hyperparameters (lengthscales and scales) are subsequently optimised via the ELBO.

The model at the last training epoch is used for test evaluation.

Remark: We hereby explain why GPVAE (Fortuin et al., 2020) is not practically suited for the missing frames imputation task. Recall that their approximate posterior is defined as q(z1:T |x1:T ) = N(mj, Λ 1 j ), where Λj = B j Bj and Bj is an upper triangular band matrix, parameterised by outputs from the encoder. For a batch of sequences with varying lengths after removing missing frames, Bj and hence Λj will be padded with zeros (assuming we rearrange the orders of the time points). This is because most mainstream deep learning libraries (e.g. JAX, Py Torch and Tensor Flow) all require static shape arrays/tensors and batch operations can only be done when all the arrays/tensors are of the same shape. However, we are not aware of any computationally efficient methods to invert a batch of Λj s with varying 0-padding sizes. Whilst it is still feasible to implement GPVAE for missing data imputation tasks, either the batch size will have to be 1 or we will need an extra for loop to iterate through each of the sequences in each batch during training. This severely constrains the practical computational efficiency of GPVAE, rendering it unsuitable as a missing frames imputation model.

Implementation Details: VRNN and KVAE are implemented in Py Torch. GPVAE mostly a non-modified version as the original Tensor Flow implementation in Fortuin et al. (2020). MGPVAE is implemented in JAX using Objax (Developers, 2020). One issue in Objax is that the convolutional operator only supports float32 precision (otherwise the implementation will be very slow) and therefore we need to manually cast the CNN weights and inputs into float32 (from float64). This is a slight disadvantage of MGPVAE that hopefully can be resolved in the future through further advances in JAX-based CNN implementations.

During test time, we used K = 20 latent samples to compute the NLL and RMSE (via the posterior mean). The NLL and RMSE of the predictive results on the test set are calculated over the entire sequence.

Markovian Gaussian Process Variational Autoencoders

Input MGPVAE

Input MGPVAE

Figure 8: Additional corrupt frames imputation results.

B.2. Mujoco Action Data

We create missing time steps by randomly dropping out 60% of them.

For each model during training, we used the following configurations (note that we use the standard Py Torch-Tensor Flow-Jax notation for network layers). Note that unlike Rubanova et al. (2019), we do not use a masked autoencoder approach for training and only use the non-missing time steps to compute the ELBO.

Batch size: 16

Training epochs: 1000 if T = 100 and 500 if T = 1000.

Number of latent channels: L = 15

Adam optimizer learning rate: 1e-3.

clip Grad Norm(model parameters, 100)

Encoder structure: Linear(L, 32)-Re LU-Linear(32, L). If latent ODE then slightly different but similar due to the ODE-RNN.

Decoder structure: Linear(L, 16)-Re LU-Linear(16, 14).

If KVAE or VRNN, we had to applying KL and Adam optimizer step size schedulers in order to make them work, meaning that they are more intricately optimised than the other models. Latent ODE also used a KL scheduler.

If SVGPVAE or MGPVAE, initialise the kernel lengthscales with: 5 if T = 100 and 50 if T = 1000. The kernel hyperparameters (lengthscales and scales) are subsequently optimised via the ELBO.

SVGPVAE, VRNN, latent ODE and MGPVAE we use early stopping with the validation RMSE on 320 if T = 100 else 80 validation sequences (not part of train or test sets). For KVAE and CRU, adding cross-validation is practically too time consuming and so the model at the last training epoch is used for test evaluation.

Remark: CRU (Schirmer et al., 2022) is capable of modelling continuous data in the form (ti, yi)N i=0, but, unlike latent ODE, SVGPVAE or MGPVAE, it is unable to condition on (ti, yi)i and then predict at any time step t (t0, t N). Therefore to train the CRU for our task, we have to treat it the same way as VRNN and KVAE, where we perform Kalman filtering

Markovian Gaussian Process Variational Autoencoders

Input MGPVAE

Input MGPVAE

Figure 9: Additional missing frames imputation results.

in regular time steps and then mask out the unobserved time steps during training. Furthermore, this model is trained via maximum likelihood estimation, and so we only report the test RMSE metric.

Implementation Details: VRNN, KVAE, latent ODE, SVGPVAE and CRU are implemented in Py Torch. For latent ODE and CRU, they are mostly unmodified, based off the original author s implementations. We rewrite SVGPVAE using the more efficient framework of Functorch (Horace He, 2021), which we find to give better implementation efficiency and performance than the original Tensor Flow implementation. MGPVAE is implemented with Objax Developers (2020) within JAX.

During test time, we use K = 20 latent samples to compute the NLL and RMSE (via the posterior mean). The NLL and RMSE of the predictive results on the test set are calculated over the entire sequence.

Markovian Gaussian Process Variational Autoencoders

MGPVAE VRNN KVAE SVGPVAE-20

CRU SVGPVAE-40 Latent ODE

MGPVAE VRNN KVAE SVGPVAE-20

CRU SVGPVAE-40 Latent ODE SVGPVAE-60 SVGPVAE-80

Figure 10: Additional results figures for unseen Mujoco sequences. Note that some predictions are not showing as they fall outside the graph limits.

Markovian Gaussian Process Variational Autoencoders

B.3. Spatiotemporal Data

We download the ECMWF ERA5 data (ECMWF/ERA5/DAILY) from Google Earth Engine API (Gorelick et al., 2017). The data vectors contain:

Mean 2m air temperature

Minimum 2m air temperature

Maximum 2m air temperature

Dewpoint 2m air temperature

Surface pressure

Mean sea level pressure

u component of wind 10m

v component of wind 10m

The spatial locations are represented by [longitude, latitude, elevation]. We index time by 0, 1, . . . , 100.

The model configurations for SVGPVAE and MGPVAE are:

clip Grad Norm(model parameters, 100)

Encoder structure: Linear(L, 16)-Re LU-Linear(16, L).

Decoder structure: Linear(L, 16)-Re LU-Linear(16, 14).

The model at the last training epoch is used for test evaluation.

Implementation Details: MOGP is implemented using GPFlow (Matthews et al., 2017) within Tensor Flow. SVGPVAE is again implemented in Py Torch with Functorch. MGPVAE is implemented with Objax within JAX.

MOGP and SVGPVAE: We stack time and space together in the dataset, and have inducing points over space-time jointly. During training, we use only 1 latent sample to compute the ELBO for SVGPVAE, a minibatch size of 40 and 200 epochs over the entire dataset (this gives roughly 20,000 gradient steps). We use an Adam optimizer learning rate of 1e-3. Similary for MOGP, we train over 20000 epochs of minibatches of size 40.

MGPVAE: We don t do any minibatching and directly use the entire training dataset, which is of dimension roughly 60-100-8. During training, we compute the ELBO with 20 latent samples and train over 20000 epochs. We use an Adam optimizer learning rate of 1e-3.

Prediction: We compute the NLL with 20 latent samples over a test set of unseen spatial locations at all time steps.

C. Possible Future Developments

It is possible to extend MGPVAE to incorporate temporal and spatial inducing points, by the virtue of the works of Adam et al. (2020) and Hamelijnck et al. (2021). The main challenge would be forming the likelihood approximation, which would have to rely on set encoders such as Deep Sets (Zaheer et al., 2017) or Set Transformer (Lee et al., 2019) to output approximate likelihoods factorised over inducing points (instead of over time points). For temporal data, the main challenge would be to cluster the time points together for each inducing time location. For spatiotemporal data, the clustering problem extends to space-time clusters.

The state space may also be enriched via normalising flows (Lin & Yin, 2023; Maroñas & Hernández-Lobato, 2022), which may additionally enrich the latent dynamics or simplify the learning dynamics. Another approach to enforcing global properties without GPs would be via a GAM-type model, such as that of Prophet (Taylor & Letham, 2018).

Markovian Gaussian Process Variational Autoencoders

SVGPVAE 8-100

Surface Pressure

SVGPVAE 8-500

SVGPVAE 4-100

SVGPVAE 4-500

SVGPVAE 8-100

Dewpoint 2m temperature

SVGPVAE 8-500

SVGPVAE 4-100

SVGPVAE 4-500

Figure 11: Additional figures of the results for the ERA5 experiment given fixed spatial locations.