# markovian_gaussian_process_variational_autoencoders__4be099ea.pdf Markovian Gaussian Process Variational Autoencoders Harrison Zhu 1 * Carles Balsells-Rodas 1 * Yingzhen Li 1 Sequential VAEs have been successfully considered for many high-dimensional time series modelling problems, with many variant models relying on discrete-time mechanisms such as recurrent neural networks (RNNs). On the other hand, continuous-time methods have recently gained attraction, especially in the context of irregularly-sampled time series, where they can better handle the data than discrete-time methods. One such class are Gaussian process variational autoencoders (GPVAEs), where the VAE prior is set as a Gaussian process (GP). However, a major limitation of GPVAEs is that it inherits the cubic computational cost as GPs, making it unattractive to practioners. In this work, we leverage the equivalent discrete state space representation of Markovian GPs to enable linear time GPVAE training via Kalman filtering and smoothing. For our model, Markovian GPVAE (MGPVAE), we show on a variety of high-dimensional temporal and spatiotemporal tasks that our method performs favourably compared to existing approaches whilst being computationally highly scalable. 1. Introduction Modelling multivariate time series data has extensive applications in e.g., video and audio generation (Li & Mandt, 2018; Goel et al., 2022), climate data analysis (Ravuri et al., 2021) and finance (Sims, 1980). Among existing deep generative models for time series, a popular class of model is sequential variational auto-encoders (VAEs) (Chung et al., 2015; Fraccaro et al., 2017; Fortuin et al., 2020), which extend VAEs (Kingma & Welling, 2014) to sequential data. Originally proposed for image generation, a VAE is a latent variable model which encodes data via an *Equal contribution 1Imperial College London. Correspondence to: Harrison Zhu . Proceedings of the 40 th International Conference on Machine Learning, Honolulu, Hawaii, USA. PMLR 202, 2023. Copyright 2023 by the author(s). Filtering and smoothing operations Amortised Posterior 𝑡 (𝑡!, 𝑡!"#) Figure 1: Illustration of the MGPVAE model. (Top) The state posterior qϕ(s1:T ) is parameterised by encoder outputs and computed using filtering and smoothing. (Bottom) At prediction time, posterior predictive distributions can be calculated at any t. encoder to a low-dimensional latent space and then decodes via a decoder to reconstruct the original data. To extend VAEs to sequential data, the latent space must also include temporal information (it is also technically possible to place temporal dynamics on the decoders (Chen et al., 2017), but for our work we focus on the latent variable dynamics). Sequential VAEs accomplish this by modelling the latent variables as a multivariate time series, where many existing approaches define a state-space model which governs the latent dynamics. These state-space model-based sequential VAE approaches can be classified into two subgroups: Discrete-time: The first approach relies on building a discrete-time state-space model for the latent variables. The transition distribution is often parameterised by a recurrent neural network (RNN) such as Long Short Term Memory (LSTM; Hochreiter & Schmidhuber Markovian Gaussian Process Variational Autoencoders (1997)) or Gated Recurrent Units (GRU; Cho et al. (2014)). Notable methods include the Variational Recurrent Neural Network (VRNN; Chung et al. (2015)) and Kalman VAE (KVAE; Fraccaro et al. (2017)). However, these approaches may suffer from training issues, such as vanishing gradients (Pascanu et al., 2013), and may struggle with irregularly-sampled time series data (Rubanova et al., 2019). Continuous-time: The second approach involves continuous-time representations, where the latent space is modelled using a continuous-time dynamic model. A notable class of such methods are neural differential equations (Chen et al., 2018; Rubanova et al., 2019; Li et al., 2020; Kidger, 2022), which model the latent variables using a system of differential equations, described by its initial conditions, drift and diffusion. As remarked in Li et al. (2020), one can construct a neural stochastic differential equation (neural SDE) that can be interpreted as an infinite-dimensional noise VAE. However, although these models can flexibly handle irregularly-sampled data, they also require numerical solvers to solve the underlying latent processes, which may cause training difficulties (Park et al., 2021), memory issues (Chen et al., 2018; Li et al., 2020) and slow computation times. Similarly, but combining linear SDEs with Kalman filtering, Continuous Recurrent Units (CRU; Schirmer et al. (2022)) is a RNN that is also able to model continuous data. Finally, in the context of audio generation, S4-related models (Gu et al., 2021; Goel et al., 2022) also rely on continuous state spaces and have been shown to perform strongly. Another line of continuous-time approaches, related to neural SDEs, that treat the latent multivariate time series as a random function of time, and model the random function as a tractable stochastic process, are Gaussian Process Variational Autoencoders (GPVAEs; Casale et al. (2018); Pearce (2020); Fortuin et al. (2020); Ashman et al. (2020); Jazbec et al. (2021)) which model the latent variables using Gaussian processes (GPs) (Rasmussen, 2003). As compared with dynamic model-based approaches which focus on modelling the latent variable transitions (reflecting local properties mainly), the GP model for the latent variables describes, in a better way, the global properties of the time series if a suitable stationary kernel is chosen, such as smoothness and periodicity. Therefore GPVAEs may be better suited for e.g., climate time series data which clearly exhibits periodic behaviour. Unfortunately GPVAEs are not directly applicable to long sequences as they suffer from O(T 3) computational cost, therefore approximations need to be made. Indeed, Ashman et al. (2020); Fortuin et al. (2020); Jazbec et al. (2021) proposed variational approximations based on sparse Gaussian processes (Titsias, 2009; Hensman et al., 2013; 2015) or recognition networks (Fortuin et al., 2020) to improve the scalability of GPVAEs. In this work we propose Markovian GPVAEs (MGPVAEs) to bridge state-space model-based and stochastic process based approaches of sequential VAEs, aiming to achieve the best in both worlds. Our approach is inspired by the key fact that, when the GP is over time, a large class of GPs can be written as a linear SDE (Särkkä & Solin, 2019), for which there exists exact and unique solutions (Øksendal, 2003). As a result, there exists an equivalent discrete linear state space representation of GPs. Therefore the dynamic model for the latent variables has both discrete and continuous-time representations. This brings the following key advantages to the latent dynamic model of MGPVAE: The continuous-time representation allows the incorporation of inductive biases via the GP kernel design (e.g., smoothness, periodic and monotonic trends), to achieve better prediction results and training efficiency. It also enables modelling irregularly sampled time series data. The equivalent discrete-time representation, which is linear, enables Kalman filtering and smoothing (Särkkä & Solin, 2019; Adam et al., 2020; Chang et al., 2020; Wilkinson et al., 2020; 2021; Hamelijnck et al., 2021) that computes the posterior distributions in O(T) time. As the observed data is assumed to come from non-linear transformations of the latent variables, we further apply site-based approximations (Chang et al., 2020) for the non-linear likelihood terms to enable analytic solutions for the filtering and smoothing procedures. In our experiments, We study much longer datasets (T 100) compared to many previous GPVAE and discrete-time works, which are only are of the magnitude of T 10. We include a range of datasets that describe different properties of MGPVAE compared to existing approaches: We deliver competitive performance compared to many existing methods on corrupt and irregularly-sampled video and robot action data at a fraction of the cost of many existing models. We extend our work to spatiotemporal climate data, where none of the discrete-time sequential VAEs are suited for modelling. We show that it outperforms traditional GP and existing sparse GPVAE models in terms of both predictive performance and speed. 2. Background Consider building generative models for high-dimensional time series (e.g., video data). Here an observed sequence of length T is denoted as Yt1 . . . , Yt T RDy, where ti represents the timestamp of the ith observation in the sequence. Note that in general ti = i for irregularly Markovian Gaussian Process Variational Autoencoders sampled time series. As the proposed MGPVAE has both discrete state-space model based and stochastic process based formulations, below we introduce these two types of sequential VAEs and the key relevant techniques. Sequential VAEs with state-space models: Consider ti = i w.l.o.g., and assume each of the latent states in Z1:T = (z1 1:T , . . . , z L 1:T ) RT L has L latent dimensions. Then the generative model is defined as p(Y1:T , Z1:T ) = p(Z1:T ) t=1 p(Yt|Zt), (1) where we choose p(Yt|Zt) = N(Yt; φ(Zt), σ2I) to be a multivariate Gaussian distribution, and φ : RL RDy is decoder network that transforms the latent state to the Gaussian mean. The prior p(Z1:T ) is defined by the transition probabilities, e.g., p(Z1:T ) = QT t=1 p(Zt|Z