# continuous_latent_process_flows__e5e4d175.pdf Continuous Latent Process Flows Ruizhi Deng1,2 Marcus A. Brubaker1,3,4 Greg Mori1,2 Andreas M. Lehrmann1 1Borealis AI 2Simon Fraser University 3York University 4Vector Institute Partial observations of continuous time-series dynamics at arbitrary time stamps exist in many disciplines. Fitting this type of data using statistical models with continuous dynamics is not only promising at an intuitive level but also has practical benefits, including the ability to generate continuous trajectories and to perform inference on previously unseen time stamps. Despite exciting progress in this area, the existing models still face challenges in terms of their representation power and the quality of their variational approximations. We tackle these challenges with continuous latent process flows (CLPF), a principled architecture decoding continuous latent processes into continuous observable processes using a time-dependent normalizing flow driven by a stochastic differential equation. To optimize our model using maximum likelihood, we propose a novel piecewise construction of a variational posterior process and derive the corresponding variational lower bound using importance weighting of trajectories. An ablation study demonstrates the effectiveness of our contributions and comparisons to state-of-the-art baselines show our model s favourable performance on both synthetic and real-world data. 1 Introduction Sparse and irregular observations of continuous dynamics are common in many areas of science, including finance [15, 36], healthcare [16], and physics [30]. Time-series models driven by stochastic differential equations (SDEs) provide an elegant framework for this challenging scenario and have recently gained popularity in the machine learning community [11, 18, 24]. The SDEs are typically implemented by neural networks with trainable parameters and the latent processes defined by the SDEs are then decoded into an observable space with complex structure. Due to the lack of closedform transition densities for most SDEs, dedicated variational approximations have been developed to maximize the observational log-likelihoods [2, 18, 24]. Despite the progress of existing works, challenges still remain for SDE-based models to be applied to irregular time-series data. One major challenge is the model s representation power. The continuoustime flow process (CTFP; [11]) uses a series of invertible mappings continuously indexed by time to transform a simple Wiener process to a more complex stochastic process. The use of a simple latent process and invertible transformations permits CTFP models to evaluate the exact likelihood of observations on any time grid efficiently, but they also limit the set of stochastic processes that CTFP can express to some specific form which can be obtained using Ito s Lemma and excludes many commonly seen stochastic processes. Another constraint of representation power in practice is the Lipschitz property of transformations in the models. The latent SDE model proposed by Hasan et al. [18] and CTFP both transform a latent stochastic process with constant variance to an observable one using injective mappings. Due to the Lipschitz property existing in many invertible neural network architectures, some processes that can be written as a non-Lipschitz transformation of a simple process, like geometric Brownian motion, cannot be expressed by these models unless specific choices of non-Lipschitz decoders are used. Apart from the model s representation power, variational inference is another challenge in training SDE-based models. The latent SDE model in the work of Li et al. [24] This work was done during an internship at Borealis AI. Correspondance to wsdmdeng@gmail.com. 35th Conference on Neural Information Processing Systems (Neur IPS 2021). base process time-dependent Figure 1: Overview. Our architecture uses a stochastic differential equation (SDE; left) to drive a time-dependent normalizing flow (NF; right). At time t1, t2 (grey bars), the values of the SDE trajectories (colored trajectories on the left) serve as conditioning information for the decoding of a simple base process (grey trajectories on the right) into a complex observable process (colored trajectories on the right). The color gradient on the right shows the individual trajectories of this transformation, which is driven by an augmented neural ODE. Since all stochastic processes and mappings are time-continuous, we can model observed data as partial realizations of a continuous process, enabling modelling of continuous dynamics and inference on irregular time grids. uses a principled method of variational approximation based on importance weighting of trajectories between a variational posterior and a prior process. The variational posterior process is constructed using a single SDE conditioned on all observations and is therefore restricted to be a Markov process. This approach may lack the flexibility to approximate the true posterior process well enough in complex inference tasks, e.g., in an online setting or with variable-length observation sequences. In this work we propose Continuous Latent Process Flows (CLPFs)2, a model that is governed by latent dynamics defined by an expressive generic stochastic differential equation. Inspired by [11], we then use dynamic normalizing flows to decode each latent trajectory into a continuous observable process. Driven by different trajectories of the latent stochastic process continuously evolving with time, the dynamic normalizing flow can map a simple base process to a diverse class of observable processes. We illustrate this process in Fig. 1. This decoding is critical for the model to generate continuous trajectories and be trained to fit observations on irregular time grids using a variational approximation. Good variational approximations and proper handling of complex inference tasks like online inference depend on a flexible variational posterior process. Therefore, we also propose a principled method of defining and sampling from a non-Markov variational posterior process that is based on a piecewise evaluation of SDEs and can adapt to new observations. The proposed model excels at fitting observations on irregular time grids, generalizing to observations on more dense time grids, and generating trajectories continuous in time. Contributions. In summary, we make the following contributions: (1) We propose a flow-based decoding of a generic SDE as a principled framework for continuous dynamics modeling of irregular time-series data. (2) We improve the variational approximation of the observational likelihood through a flexible non-Markov posterior process based on a piecewise evaluation of the underlying SDE; (3) We validate the effectiveness of our contributions in a series of ablation studies and comparisons to state-of-the-art time-series models, both on synthetic and real-world datasets. 2 Preliminaries 2.1 Stochastic Differential Equations SDEs can be viewed as a stochastic analogue of ordinary differential equations (ODEs) in the sense that d Zt dt = µ(Zt, t) + random noise σ(Zt, t). Let Zt be a variable which continuously evolves with time. An m-dimensional SDE describing the stochastic dynamics of Zt usually takes the form d Zt = µ(Zt, t) dt + σ(Zt, t) d Wt, (1) where µ maps to an m-dimensional vector, σ is an m k matrix, and Wt is a k-dimensional Wiener process. The solution of an SDE is a continuous-time stochastic process Zt that satisfies the integral equation Zt = Z0 + R t 0 µ(Zs, s) ds + R t 0 σ(Zs, s) d Ws with initial condition Z0, where 2Code available at https://github.com/Borealis AI/continuous-latent-process-flows the stochastic integral should be interpreted as a traditional Itô integral [27, Chapter 3.1]. For each sample trajectory ω Wt, the stochastic process Zt maps ω to a different trajectory Zt(ω). Latent Dynamics and Variational Bound. SDEs have been used as models of latent dynamics in a variety of contexts [2, 18, 24]. As closed-form finite-dimensional solutions to SDEs are rare, variational approximations are often used in practice. Li et al. [24] propose a principled way of re-weighting latent SDE trajectories for variational approximations using Girsanov s theorem [27, Chapter 8.6]. Specifically, consider a prior process and a variational posterior process in the interval [0, T] defined by two stochastic differential equations d Zt = µ1(Zt, t) dt + σ(Zt, t) d Wt and d ˆZt = µ2( ˆZt, t) dt + σ( ˆZt, t) d Wt, respectively. Furthermore, let p(x|Zt) denote the probability of observing x conditioned on the trajectory of the latent process Zt in the interval [0, T]. If there exists a mapping u : Rm [0, T] Rk such that σ(z, t)u(z, t) = µ2(z, t) µ1(z, t) and u satisfies Novikov s condition [27, Chapter 8.6], we obtain the variational lower bound log p(x) = log E[p(x|Zt)] = log E[p(x| ˆZt)MT ] E[log p(x| ˆZt) + log MT ], (2) where MT = exp( R T 0 1 2 u( ˆZt, t) 2 dt R T 0 u( ˆZt, t)T d Wt). See [24] for a formal proof. 2.2 Normalizing Flows Normalizing flows [3, 8, 12, 13, 21, 22, 23, 28, 29, 31] employ a bijective mapping f : Rd Rd to transform a random variable Y with a simple base distribution p Y to a random variable X with a complex target distribution p X. We can sample from a normalizing flow by first sampling y p Y and then transforming it to x = f(y). As a result of invertibility, normalizing flows can also be used for density estimation. Using the change-of-variables formula, we have log p X(x) = log p Y (g(x)) + log det g x , where g is the inverse of f. Continuous Indexing. More recently, normalizing flows have been augmented with a continuous index [6, 10, 11]. For instance, the continuous-time flow process (CTFP; [11]) models irregular observations of a continuous-time stochastic process. Specifically, CTFP transforms a simple ddimensional Wiener process Wt to another continuous stochastic process Xt using the transformation Xt = f(Wt, t), where f(w, t) is an invertible mapping for each t. Despite its benefits of exact log-likelihood computation of arbitrary finite-dimensional distributions, the expressive power of CTFP to model stochastic processes is limited in the following two aspects: (1) An application of Itô s lemma [27, Chapter 4.2] shows that CTFP can only represent stochastic processes of the form df(Wt, t) = {df dt (Wt, t) + 1 2 Tr(Hwf(Wt, t))} dt + ( wf T (Wt, t))T d Wt, (3) where Hwf is the Hessian matrix of f with respect to w and wf is the derivative. A variety of stochastic processes, from simple processes like the Ornstein-Uhlenbeck (OU) process to more complex non-Markov processes, fall outside of this limited class and cannot be learned using CTFP (see Appendix A for formal proofs); (2) Many normalizing flow architectures are compositions of Lipschitz-continuous transformations [7, 8, 17]. It is therefore challenging to model certain stochastic processes that are non-Lipschitz transformations of simple processes using CTFP without prior knowledge about the functional form of the observable processes and custom-tailored normalizing flows with non-Lipschitz transformations (see Appendix B for an example). A latent variant of CTFP is further augmented with a static latent variable to introduce non-Markovian behavior. It models continuous stochastic processes as Xt = f(Wt, t; Z), where Z is a latent variable with standard Gaussian distribution and f( , ; z) is a CTFP model that decodes each sample z of Z into a stochastic process with continuous trajectories. Latent CTFP can be used to estimate finite-dimensional distributions using a variational approximation. However, it is unclear how much a latent variable Z with finite dimensions can improve CTFPs representation power. Equipped with these tools, we can now describe our problem setting and model architecture. Let {(xti, ti)}n i=1 denote a sequence of d-dimensional observations sampled at arbitrary points in time, where xti and ti denote the value and time stamp of the observation, respectively. The observations are assumed to be partial realizations of a continuous-time stochastic process Xt. Our training objective is the maximization of the observational log-likelihood induced by Xt on a given time grid, L = log p Xt1,...,Xtn(xt1, . . . , xtn), (4) for an inhomogeneous collection of sequences with varying lengths and time stamps. At test-time, in addition to the maximization of log-likelihoods, we are also interested in sampling sparse, dense, or irregular trajectories with finite-dimensional distributions that conform with these log-likelihoods. We model this challenging scenario with Continuous Latent Process Flows (CLPF). In Section 3.1, we present our model in more detail. Training and inference methods will be discussed in Section 3.2. 3.1 Continuous Latent Process Flows A Continuous Latent Process Flow consists of two major components: an SDE describing the continuous latent dynamics of an observable stochastic process and a continuously indexed normalizing flow serving as a time-dependent decoder. The architecture of the normalizing flow itself can be specified in multiple ways, e.g., as an augmented neural ODE [14] or as a series of affine transformations [10]. The following paragraphs discuss the relationship between these components in more detail. Continuous Latent Dynamics. Analogous to our overview in Section 2.1, we model the evolution of an m-dimensional time-continuous latent state Zt in the time interval [0, T] using a flexible stochastic differential equation driven by an m-dimensional Wiener Process Wt, d Zt = µγ(Zt, t) dt + σγ(Zt, t) d Wt, (5) where γ denotes the (shared) learnable parameters of the drift function µ and variance function σ. In our experiments, we implement µ and σ using deep neural networks (see Appendix E for details). Importantly, the latent state Zt exists for each t [0, T] and can be sampled on any given time grid, which can be irregular and different for each sequence. Time-Dependent Decoding. Latent variable models decode a latent state into an observable variable with complex distribution. As an observed sequence {(xti, ti)}n i=1 is assumed to be a partial realization of a continuous-time stochastic process, continuous trajectories of the latent process Zt should be decoded into continuous trajectories of the observable process Xt, and not discrete distributions. Following recent advances in dynamic normalizing flows [6, 10, 11], we model Xt as Xt = Fθ(Ot; Zt, t), (6) where Ot is a d-dimensional stochastic process with closed-form transition density3 and Fθ( ; zt, t) is a normalizing flow parameterized by θ for any zt, t. The transformation Fθ decodes each sample path of Zt into a complex distribution over continuous trajectories Xt if Fθ is a continuous mapping and the sampled trajectories of the base process Ot are continuous with respect to time t. Unlike [11], who use a Wiener process as base process, we use the Ornstein Uhlenbeck (OU) process, which has a stationary marginal distribution and bounded variance. As a result, the variance of the observation process does not increase due to the increase of variance in the base process and is primarily determined by the latent process Zt and flow transformation Fθ. Flow Architecture. The continuously indexed normalizing flow Fθ( ; zt, t) can be implemented in multiple ways. Deng et al. [11] use ANODE [14], defined as the solution to the initial value problem = fθ(h(τ), a(τ), τ) gθ(a(τ), τ) , h(τ0) a(τ0) = ot (zt, t)T where τ [τ0, τ1], h(τ) Rd, a(τ) Rm+1, fθ : Rd Rm+1 [τ0, τ1] Rd, gθ : Rm+1 [t0, t1] R, and Fθ is defined as the solution of h(τ) at τ = τ1. Note the difference between t and τ: while t [0, T] describes the continuous process dynamics, τ [τ0, τ1] describes the continuous time-dependent decoding at each time step t. Alternatively, Cornish et al. [10] propose a variant of continuously indexed normalizing flows based on a series of N affine transformations fi, h(i+1) t = fi(h(i) t ; zt, t) = k(h(i) t exp( u(i)(zt, t)) v(i)(zt, t)), (8) 3More precisely, the conditional distribution p Oti |Otj (oti|otj) must exist in closed form for any tj < ti. (a) Generation (b) Inference Figure 2: Graphical Model of Generation and Inference. Red circles represent latent variables Zti. Unfilled blue circles represent samples from the OU process at discrete time points Oti. Blue diamonds in Fig. 2a indicate each Xti is the result of a deterministic mapping of Zti and Oti. Filled blue circles in Fig. 2b represent observed variables Xti. where h(0) t = ot, h(N) t = xt, u(i) and v(i) are unconstrained transformations, and k is an invertible mapping like a residual flow [8]. The temporal relationships among Zt, Ot, and Xt from a graphical model point of view are shown in Fig. 2a. 3.2 Training and Inference With the model fully specified, we can now focus our attention on training and inference. Computing the observational log-likelihood (Eq.(4)) induced by a time-dependent decoding of an SDE (Eq.(6)) on an arbitrary time grid is challenging, because only few SDEs have closed-form transition densities. Consequently, variational approximations are needed for flexible SDEs such as Eq.(5). We propose a principled way of approximating the observational log-likelihood with a variational lower bound based on a novel piecewise construction of the posterior distribution of the latent process. In summary, we first express the observational log-likelihood as a marginalization over piecewise factors conditioned on a latent trajectory, then approximate this intractable computation with a piecewise variational posterior process, and finally derive a variational lower bound for it. Observational Log-Likelihood. The observational log-likelihood can be written as an expectation over latent trajectories of the conditional likelihood, which can be evaluated in closed form, L = log p Xt1,...,Xtn(xt1, . . . , xtn) = log Eω Wt h p Xt1,...,Xtn|Zt(xt1, . . . , xtn|Zt(ω)) i = log Eω Wt i=1 p Xti|Xti 1,Zti,Zti 1(xti|xti 1, Zti(ω), Zti 1(ω)) where Zt(ω) denotes a sample trajectory of Zt driven by ω Wt. For simplicity, we assume w.l.o.g. and in this section only Z0, X0 to be given. As a result of invertibility, the conditional likelihood terms p Xti|Xti 1,Zti,Zti 1 can be computed using the change-of-variables formula, log p Xti|Xti 1,Zti,Zti 1(xti|xti 1, Zti(ω), Zti 1(ω)) = log p Oti|Oti 1(oti|oti 1) log det Fθ(oti; Zti(ω), ti) where oti = F 1 θ (xti; Zti(ω), ti). Piecewise Construction of Variational Posterior. Directly computing the observational loglikelihood is intractable and we use a variational approximation during both training and inference. Good variational approximations rely on variational posteriors that are close enough to the true posterior of the latent trajectory conditioned on observations. Previous methods [24] use a single SDE to propose the variational posterior conditioned on all observations. Instead, we develop a more flexible method that can update the posterior process parameters when a new observation is seen and naturally adapts to different time grids. Our posterior process is not constrained by the Markov property of SDE solutions. Moreover, the proposed method serves as the basis for a principled approach to online inference tasks using variational posterior processes. Our construction makes use of a further decomposition of the observational log-likelihood based on the following facts: {Ws+t Ws}t 0 is also a Wiener process s 0 and the solution to Eq.(5) is a Markov process. Specifically, let {(Ω(i), F(i) ti ti 1, P (i))}n i=1 be a series of probability spaces on which n independent m-dimensional Wiener processes W (i) t are defined. We can sample a complete trajectory of the Wiener process Wt in the interval [0, T] by sampling independent trajectories of length ti ti 1 from Ω(i) and adding them, i.e., ωt = P {i:ti