# latent_sdes_on_homogeneous_spaces__774b9292.pdf Latent SDEs on Homogeneous Spaces Sebastian Zeng, Florian Graf, Roland Kwitt University of Salzburg, Austria {sebastian.zeng, florian.graf, roland.kwitt}@plus.ac.at We consider the problem of variational Bayesian inference in a latent variable model where a (possibly complex) observed stochastic process is governed by the solution of a latent stochastic differential equation (SDE). Motivated by the challenges that arise when trying to learn an (almost arbitrary) latent neural SDE from data, such as efficient gradient computation, we take a step back and study a specific subclass instead. In our case, the SDE evolves on a homogeneous latent space and is induced by stochastic dynamics of the corresponding (matrix) Lie group. In learning problems, SDEs on the unit n-sphere are arguably the most relevant incarnation of this setup. Notably, for variational inference, the sphere not only facilitates using a truly uninformative prior, but we also obtain a particularly simple and intuitive expression for the Kullback-Leibler divergence between the approximate posterior and prior process in the evidence lower bound. Experiments demonstrate that a latent SDE of the proposed type can be learned efficiently by means of an existing one-step geometric Euler-Maruyama scheme. Despite restricting ourselves to a less rich class of SDEs, we achieve competitive or even state-of-the-art results on various time series interpolation/classification problems. 1 Introduction Learning models for the dynamic behavior of a system from a collection of recorded observation sequences over time is of great interest in science and engineering. Applications can be found, e.g., in finance [5], physics [46] or population dynamics [2]. While the traditional approach of adjusting a prescribed differential equation (with typically a few parameters) to an observed system has a long history, the advent of neural ordinary differential equations (ODEs) [10] has opened the possibility for learning (almost) arbitrary dynamics by parameterizing the vector field of an ODE by a neural network. This has not only led to multiple works, e.g., [16; 27; 37; 60], elucidating various aspects of this class of models, but also to extensions towards stochastic differential equations (SDEs) [26; 30; 34; 58; 59], where one parametrizes the drift and diffusion vector fields. Such neural SDEs can, by construction, account for potential stochasticity in a system and offer increased flexibility, but also come at the price of numerical instabilities, high computational/storage costs, or technical difficulties when backpropagating gradients through the typically more involved numerical integration schemes. Figure 1: Sample paths from a prior (left) and posterior (right) process on the sphere. Initial values are marked by . While several recent developments to address these challenges exist, such as [30] or [26], it is important to note that neural ODEs or SDEs are most often used in settings where they serve as a model for some sort of (unobserved) latent dynamics. Such latent ODE or latent SDE models, cf. [10; 30; 48], hinge on the premise that the available observations are determined by an underlying continuous-time latent state (that might be subject to uncertainty). If one postulates that the observed dynamics are indeed governed by simpler latent dynamics that are easier to model, the question naturally arises whether 37th Conference on Neural Information Processing Systems (Neur IPS 2023). a smaller subclass of SDEs, which can be solved efficiently and with less technical difficulty, may suffice for accurately modeling real-world phenomena. Specifically promising are SDEs that arise from the action of a Lie group on a homogeneous space, as they come with (1) an intuitive geometric interpretation, (2) a rich theory [1; 12; 22], and (3) easy-to-implement numerical solvers with strong convergence guarantees [35; 36; 40; 47]. Contribution. In this work, we take a step back from trying to learn (in theory) arbitrary SDEs and instead consider the very subclass mentioned above, i.e., SDEs that evolve on a homogeneous latent space as the result of the Lie group action. Numerical solutions to such SDEs are computed with a one-step geometric Euler-Maruyama scheme. The latter, combined with the assumption of a low latent space dimension enables backpropagating gradients through the SDE solver during learning via the discretize-then-optimize strategy. In the variational Bayesian learning regime, we instantiate our construction by learning latent SDEs on the unit n-sphere where the formula for the Kullback-Leibler (KL) divergence (within the evidence lower bound) between approximate posterior processes and an uninformative prior becomes surprisingly simple. Experiments on a variety of datasets provide empirical evidence that the subclass of considered SDEs is sufficiently expressive to achieve competitive or state-of-the-art performance on various tasks. 2 Related work Our work is mainly related to advances in the neural ODE/SDE literature and recent developments in the area of numerical methods for solving SDEs on Lie groups. Neural ODEs and SDEs. Since the introduction of neural ODEs in [10], many works have proposed extensions to the paradigm of parameterizing the vector fields of an ODE by neural networks, ranging from more expressive models [16] to higher-order variants [60]. As ODEs are determined by their initial condition, [27; 39] have also introduced a variant that can adjust trajectories from subsequent observations. Neural SDEs [19; 24; 34; 45; 54; 58; 59], meaning that an SDE s drift and diffusion coefficient are parametrized by neural networks, represent another natural extension of the neural ODE paradigm and can account for uncertainty in the distribution over paths. From a learning perspective, arguably the most common way of fitting neural ODEs/SDEs to data is the Bayesian learning paradigm, where learning is understood as posterior inference [29], although neural SDEs might as well be trained as GANs [26]. In the Bayesian setting, one may choose to model uncertainty (1) only over the initial latent state, leading to latent ODEs [10; 48], or (2) additionally over all chronologically subsequent latent states in time, leading to latent SDEs [19; 30]; in the first case, one only needs to select a suitable approximate posterior over the initial latent state whereas, in the second case, the approximate posterior is defined over paths (e.g., in earlier work [3] chosen as a Gaussian process). Overall, with recent progress along the line of (memory) efficient computation of gradients to reversible SDE solvers [25; 30], neural SDEs have become a powerful approach to model real-world phenomena under uncertainty. SDEs on matrix Lie groups. In the setting of this work, SDEs are constructed via the action of a matrix Lie group on the corresponding homogeneous space. In particular, an SDE on the Lie group will translate into an SDE in the homogeneous space. Hence, numerical integration schemes that retain the Lie group structure are particularly relevant. Somewhat surprisingly, despite a vast amount of literature on numerical integration schemes for ODEs that evolve in Lie groups and which retain the Lie group structure under discretization, e.g., [21; 41; 42], similar schemes for SDEs have only appeared recently. One incarnation, which we will use later on, is the geometric Euler-Maruyama scheme from [35; 36] for Itô SDEs. For the latter, [47] established convergence in a strong sense of order 1. Stochastic Runge-Kutta Munthe-Kaas schemes [40] of higher order were introduced just recently, however, at the cost of being numerically more involved and perhaps less appealing from a learning perspective. Finally, we remark that [44] introduces neural SDEs on Riemannian manifolds. However, due to the general nature of their construction, it is uncertain whether this approach can be transferred to a Bayesian learning setting and whether the intricate optimization problem of minimizing a discrepancy between measures scales to larger, high-dimensional datasets. 3 Methodology 3.1 Preliminaries In the problem setting of this work, we assume access to a repeated (and partially observed) record of the continuous state history of a random dynamical system in observable space over time. The collected information is available in the form of a dataset of N multivariate time series x1, . . . , x N. In search for a reasonable model that describes the data-generating system, we consider a continuous stochastic process X : Ω [0, T] Rd where each data sample refers to a continuous path xi : [0, T] Rd. This path is assumed to be part of a collection of i.i.d. realizations sampled from the path distribution of the process X. As common in practice, we do not have access to the entire paths xi, but only to observations xi(t) Rd at a finite number of time points t = tk that can vary between the xi and from which we seek to learn X. Fitting such a process to data can be framed as posterior inference in a Bayesian learning setting, where the data generating process is understood as a latent variable model of the form shown in the figure below: first, a latent path zi is drawn from a parametric prior distribution pθ (z) with true parameters θ ; second, an observation path xi is drawn from the conditional distribution pθ (x|zi). θ ϕ Essentially, this is the problem setting of [29]; however, the latent variables z, as well as the observations xi are path-valued and the true posterior pθ(z|x) is intractable in general. To efficiently perform variational inference (e.g., with respect to z) and to fit the model to observations, we choose a tractable approximate posterior qϕ(z|x) from some family of parametric distributions, a suitable prior pθ(z), and seek to maximize the evidence lower bound (ELBO) log pθ(xi) L(θ, ϕ; xi) = DKL qϕ(z|xi) pθ(z) + Ez qϕ(z|xi) log pθ(xi|z) . (1) This optimization problem is well understood in the context of vector-valued random variables, and can be efficiently implemented (provided the choice of prior and approximate posterior allows a reparameterization trick). However, the setting of a path-valued random latent variable comes with additional challenges. Parametric families of path distributions. To efficiently perform approximate variational inference, we need to select a prior and approximate posterior in a reasonable and tractable manner from parametric families of distributions. To this end, we consider path distributions [11] of stochastic processes that are implicitly defined as strong solutions to linear Itô SDEs of the form d Zt = f(Zt, t) dt + σ(Zt, t) d Wt, t [0, T], Z0 P . (2) W is a vector-valued Wiener process, f and σ denote vector-valued drift and matrix-valued diffusion coefficients, respectively, and Z0 is a multivariate random variable that follows some distribution P. To ensure the existence of a unique, strong solution to Eq. (2), we require that f, σ and Z0 satisfy the assumptions of [57, Theorem 5.5]. As in the seminal work on neural ODEs [10], one may parameterize the drift and diffusion coefficients via neural networks, yielding neural SDEs. Given these prerequisites, maximizing the ELBO in Eq. (1) requires computing gradients with respect to θ and ϕ, which hinges on sample paths from the approximate posterior process, generated by some (possibly adaptive step-size) SDE solver. Unfortunately, as mentioned in Section 1, modeling (2) with flexible drift and diffusion coefficients comes at high computational and memory cost. In the following (Section 3.2), we introduce our construction for SDEs on homogeneous spaces, discuss a simple SDE solver, and elaborate on the specific form of the KL divergence term in the ELBO of Eq. (1). In Section 3.3, we then instantiate the SDE construction on the example of the rotation group SO(n) acting on the unit n-sphere. Solving auxiliary tasks. While learning SDEs of the form outlined above is an unsupervised problem, the overall approach can easily be extended to solve an auxiliary supervised task. For instance, we may have access to one target variable yi (e.g., a class-label) per time series or to one target variable per time point and time series, i.e., yi t1, . . . , yi tm (e.g., a class-label or some real-valued quantity per time point). In these scenarios, we will additionally learn to map latent states z to target variables, and extend the ELBO from Eq. (1) with an additive task-specific loss term. In practice, the overall optimization objective is then a weighted sum of (i) the KL divergence, (ii) the log-likelihood, and (iii) a task-specific supervised loss. 3.2 SDEs on homogeneous spaces Drawing inspiration from manifold learning, e.g., [7; 14; 17], we assume the unobservable latent stochastic process Z evolves on a low dimensional manifold M Rn. If M admits a transitive action by a Lie group G, i.e., if it is a homogeneous G-space, then variational Bayesian inference can be performed over a family of processes that are induced by the group action. Specifically, we consider one-point motions, cf. [31], i.e., processes in M of the form Zt = Gt Z0 for some stochastic process G in a matrix Lie group G GL(n) that acts by matrix-vector multiplication. We define this stochastic process G as solution to the linear SDE i=1 dwi t Vi Gt, G0 = In . (3) Herein, wi are independent scalar Wiener processes, V1, . . . , Vm g are fixed (matrix-valued) tangent vectors in the Lie algebra g, and In is the n n identity matrix. The drift function V0 = K + K : [0, T] Rn n consists of a tangential (g-valued) component K plus an orthogonal component K called pinning drift [36; 53]. It offsets the stochastic drift away from the Lie group G due to the quadratic variation of the Wiener processes wi and only depends on the noise components V1, . . . , Vm (and the geometry of G). In case the Lie group G is quadratic, i.e., defined by some fixed matrix P via the equation G PG = P, then the pinning drift K can be deduced directly from the condition d(G PG) = 0 which implies that V0(t) = K(t) + 1 i=1 V2 i , and V i P + PVi = 0 . (4) We include the main arguments of the derivation of Eq. (4) in Appendix E.2 and refer to [6] and [12] for broader details. Finally, given an initial value Z0 P drawn from some distribution P on the manifold M, the SDE for G induces an SDE for the one-point motion Z = G Z0 via d Zt = d Gt Z0. Therefore, Z solves i=1 dwi t Vi Zt, Z0 P . (5) Numerical solutions. To sample paths from a process Z defined by an SDE as in Eq. (5), we numerically solve the SDE by a one-step geometric Euler-Maruyama scheme [36] (listed in Appendix E). Importantly, this numerical integration scheme ensures that the approximate numerical solutions reside in the Lie group, has convergence of strong order 1 [47], and can be implemented in a few lines of code (see Appendix D). The idea of the scheme is to iteratively solve the SDE in intervals [tj, tj+1] of length t, using the relation Zt = Gt Z0 = (Gt G 1 tj )Ztj . (6) Notably, Gt G 1 tj follows the same SDE as Gt, see Eq. (3), but with initial value In at t = tj instead of at t = 0. Moreover, if t is sufficiently small, then Gt G 1 tj stays close to the identity and reparametrizing Gt G 1 tj = exp(Ωt) by the matrix (or Lie group) exponential is feasible. The resulting SDE in the Lie algebra is linear and can thus be solved efficiently with an Euler-Maruyama scheme. Specifically, Ωtj+1 = V0(tj) t + Pm i=1 wi j Vi, where wi j N(0, t) and we can differentiate through the numeric solver by leveraging the well-known reparametrization trick for the Gaussian distribution. Kullback-Leibler (KL) divergence. For approximate inference in the variational Bayesian learning setting of Section 3.1, we seek to maximize the ELBO in Eq. (1). In our specific setting, the prior pθ(z) and approximate posterior distribution qϕ(z|x) of latent paths z are determined by an SDE of the form as in Eq. (5). In case these SDEs have the same diffusion term, i.e., the same Vi, i > 0, then the KL divergence DKL (qϕ(z|x) pθ(z)) in the ELBO is finite and can be computed via Girsanov s theorem. Specifically, if Vprior 0 and Vpost 0 denote the respective drift functions and V 0 = Vprior 0 Vpost 0 , then DKL (qϕ(z|x) pθ(z)) = 1 M q Zt(z) V 0 (t)z Σ+(z) V 0 (t)z dz dt , (7) where q Zt is the marginal distribution of the approximate posterior process Z at time t and Σ+(z) is the pseudo inverse of Σ(z) = Pm i=1 Vizz V i . For a derivation of this result, we refer to [43] and Appendix E.5. 3.3 SDEs on the unit n-sphere The unit n-sphere Sn 1 frequently occurs in many learning settings, e.g., see [13; 15; 32; 33; 38; 56]. It is a homogeneous space in context of the group SO(n) of rotations. For modeling time series data, Sn 1 is appealing due to its intrinsic connection to trigonometric functions and, in a Bayesian learning setting, it is appealing since compactness allows selecting the uniform distribution USn 1 as an uninformative prior on the initial SDE state Z0. The unit n-sphere also perfectly fits into the above framework, as SO(n) is the identity component of O(n), i.e., the quadratic Lie group defined by P = In. Its Lie algebra so(n) = A : A + A = 0 consists of skew-symmetric n n matrices, and has a basis {Ekl : 1 k < l n} with Ekl = eke l ele k , where ek is the k-th standard basis vector of Rn. Importantly, if αθ = 0 , {Vi}n(n 1)/2 i=1 = αθEkl 1 k 1 polynomials yields better interpolation results but at a loss of extrapolation quality as the higher-order polynomials are less well-behaved for large t. Regarding the baselines, we do not compare against (f-)CRU and m TAND as they are not directly applicable here. As only the image at the initial time point is used as model input, learning any reasonable attention mechanism (m TAND) or conditioning on observations spread out over the observation time interval (CRU) is impossible. Runtime measurements. For a detailed runtime analysis across different approaches that model an underlying (stochastic) differential equation, we refer the reader to Appendix C.1. Figure 4: Exemplary reconstructions on the Rotating MNIST data. Shown are the results (on one testing sequence) by integrating forward from t = 0 (marked red) to t = 4, i.e., three times longer than what is observed (t [0, 1)) during training. 5 Discussion & Limitations We have presented an approach to learning (neural) latent SDEs on homogeneous spaces with a Bayesian learning regime. Specifically, we focused on one incarnation of this construction, i.e., the unit n-sphere, due to its abundance in many learning problems. Our SDEs are, by design, less expressive than in earlier works. Yet, when modeling a latent variable, this limitation appears to be of minor importance (as shown in our experiments). We do, however, gain in terms of overall model complexity, as backpropagating through the SDE solver can be done efficiently (in a discretize-thenoptimize manner) without having to resort to adjoint sensitivity methods or variable step-size solvers. While our approach could benefit from the latter, we leave this for future research. Overall, the class of neural SDEs considered in this work presents a viable alternative to existing neural SDE approaches that seek to learn (almost) arbitrary SDEs by fully parameterizing the vector fields of an SDEs drift and diffusion coefficient by neural networks. In our case, we do use neural networks as high-capacity function approximators, but in a setting where one prescribes the space on which the SDE evolves. Finally, whether it is reasonable to constrain the class of SDEs depends on the application type. In situations where one seeks to learn a model in observation space directly, approaches that allow more model flexibility may be preferable. Acknowledgments This work was supported by the Austrian Science Fund (FWF) under project P31799-N38 and the Land Salzburg within the EXDIGIT project 20204-WISS/263/6-6022 and projects 0102-F1901166KZP, 20204-WISS/225/197-2019. Source code is available at https://github.com/plus-rkwitt/Latent SDEon HS. [1] D. Applebaum. Probability on Compact Lie Groups. 1st edition. Vol. 70. Probability Theory and Stochastic Modelling. Springer, 2014. [2] M. Arató. A famous nonlinear stochastic equation (Lotka-Volterra model with diffusion) . In: Mathematical and Computer Modelling 38 (2003), pp. 709 726. [3] C. Archambeau, M. Opper, Y. Shen, D. Cornford, and J. Shawe-Taylor. Variational Inference for Diffusion Processes . In: NIPS. 2007. [4] P. Becker, H. Pandya, G. Gebhardt, C. Zhao, J. Taylor, and G. Neumann. Recurrent Kalman Networks: Factorized Inference in High-Dimensional Deep Feature Spaces . In: ICML. 2019. [5] F. Black and M. Scholes. The Pricing of Options and Corporate Liabilities . In: Journal of Political Economy 81.3 (1973), pp. 637 654. [6] R. W. Brockett. Lie Algebras and Lie Groups in Control Theory . In: Geometric Methods in System Theory. Springer, 1973, pp. 43 82. [7] B. C. Brown, A. L. Caterini, B. L. Ross, J. C. Cresswell, and G. Loaiza-Ganem. Verifying the Union of Manifold Hypothesis for Image Data . In: ICLR. 2023. [8] N. D. Cao and W. Aziz. The Power Spherical distribution . In: ar Xiv abs/2006.04437 (2020). [9] F. P. Casale, A. V. Dalca, L. Saglietti, J. Listgarten, and N. Fusi. Gaussian Process Prior Variational Autoencoders . In: Neur IPS. 2018. [10] R. T. Chen, Y. Rubanova, J. Bettencourt, and D. Duvenaud. Neural Ordinary Differential Equations . In: Neur IPS. 2018. [11] R. Chetrite and H. Touchette. Nonequilibrium Markov Processes Conditioned on Large Deviations . In: Annales Henri Poincaré 16 (2015), pp. 2005 2057. [12] G. S. Chirikjian. Stochastic models, information theory, and Lie groups . In: Applied and numerical harmonic analysis. Boston, Basel, and Berlin: Birkhäuser, 2012, pp. 361 388. [13] T. S. Cohen, M. Geiger, J. Köhler, and M. Welling. Spherical CNNs . In: ICLR. 2018. [14] M. Connor, G. Canal, and C. Rozell. Variational Autoencoder with Learned Latent Structure . In: AISTATS. 2021. [15] T. R. Davidson, L. Falorsi, N. De Cao, T. Kipf, and J. M. Tomczak. Hyperspherical Variational Autoencoders . In: UAI. 2018. [16] E. Dupont, A. Doucet, and Y. W. Teh. Augmented Neural ODE . In: Neur IPS. 2019. [17] C. Fefferman, S. Mitter, and H. Narayanan. Testing the manifold hypothesis . In: Journal of the American Mathematical Society 29.4 (2016), pp. 983 1049. [18] H. Fu, C. Li, X. Liu, J. Gao, A. Celikyilmaz, and L. Carin. Cyclical Annealing Schedule: A Simple Approach to Mitigating KL Vanishing . In: NAACL. 2019. [19] A. Hasan, J. M. Pereira, S. Farsiu, and V. Tarokh. Identifying Latent Stochastic Differential Equations . In: IEEE Transactions on Signal Processing 70 (2022), pp. 89 104. [20] M. Horn, M. Moor, C. Bock, B. Rieck, and K. Borgwardt. Set Functions for Time Series . In: ICML. 2020. [21] A. Iserles, H. Z. Munthe-Kaas, S. P. Nørsett, and A. Zanna. Lie-group methods . In: Acta Numerica 9 (2000), 215 365. [22] K. Itô. Brownian motions in a Lie group . In: Proceedings of the Japan Academy, Series A, Mathematical Sciences 26.8 (1950). [23] K. Itô. Stochastic Calculus . In: Lect. Notes Phys. 39 (1975). Ed. by H. Araki, pp. 218 223. [24] J. Jia and A. R. Benson. Neural Jump Stochastic Differential Equations . In: Neur IPS. 2019. [25] P. Kidger, J. Foster, X. Li, and T. Lyons. Efficient an Accurate Gradients for Neural SDEs . In: Neur IPS. 2021. [26] P. Kidger, J. Foster, X. Li, H. Oberhauser, and T. Lyons. Neural SDEs as Infinite-Dimensional GANs . In: ICML. 2021. [27] P. Kidger, J. Morrill, J. Foster, and T. Lyons. Neural Controlled Differential equations for irregular time series . In: Neur IPS. 2020. [28] D. P. Kingma and J. Ba. Adam: A Method for Stochastic Optimization . In: ICLR. 2015. [29] D. P. Kingma and M. Welling. Auto-Encoding Variational Bayes . In: ICLR. 2014. [30] X. Li, T.-K. L. Wong, R. T. Chen, and D. Duvenaud. Scalable Gradients for Stochastic Differential Equations . In: AISTATS. 2020. [31] M. Liao. Levy processes in Lie groups. Providence, R.I.: Cambridge University Press, 2004. [32] W. Liu, R. Lin, Z. Liu, L. Xiong, B. Schölkopf, and A. Weller. Learning with Hyperspherical Uniformity . In: AISTATS. 2021. [33] W. Liu, Y. Zhang, X. Li, Z. Liu, B. Dai, T. Zhao, and L. Song. Deep Hyperspherical Learning . In: Neur IPS. 2017. [34] X. Liu, T. Xiao, S. Si, Q. Cao, S. Kumar, and C.-J. Hsieh. Neural SDE: Stabilizing Neural ODE Networks with Stochastic Noise . In: ar Xiv abs/1906.02355 (2019). [35] G. Marjanovic, M. Piggott, and V. Solo. A Simple Approach to Numerical Methods for Stochastic Differential Equations in Lie Groups . In: CDC. 2015. [36] G. Marjanovic and V. Solo. Numerical Methods for Stochastic Differential Equations in Matrix Lie Groups Made Simple . In: IEEE Transactions on Automatic Control 63.12 (2018), pp. 4035 4050. [37] S. Massaroli, M. Poli, J. Park, A. Yamashita, and H. Asama. Dissecting Neural ODEs . In: Neur IPS. 2020. [38] P. Mettes, E. van der Pol, and C. Snoek. Hyperspherical Prototype Networks . In: Neur IPS. 2019. [39] J. Morrill, C. Salvi, P. Kidger, and J. Foster. Neural rough differential equations for long time series . In: ICML. 2021. [40] M. Muniz, M. Ehrhardt, M. Günther, and R. Winkler. Higher strong order methods for linear Itô SDEs on matrix Lie groups . In: BIT Numerical Mathematics 62.4 (2022), pp. 1095 1119. [41] H. Munthe-Kaas. High order Runge-Kutta methods on manifolds . In: Applied Numerical Mathematics 29.1 (1999). Proceedings of the NSF/CBMS Regional Conference on Numerical Analysis of Hamiltonian Differential Equations, pp. 115 127. [42] H. Munthe-Kaas. Runge-Kutta methods on Lie groups . In: BIT Numerical Mathematics 38.1 (1998), pp. 92 111. [43] M. Opper. Variational Inference for Stochastic Differential Equations . In: Annalen der Physik 531.3 (2019), p. 1800233. [44] S. W. Park, H. Kim, K. Lee, and J. Kwon. Riemannian Neural SDE: Learning Stochastic Representations on Manifolds . In: Neur IPS. 2022. [45] S. W. Park, K. Lee, and J. Kwon. Neural Markov Controlled SDE: Stochastic Optimization for Continuous-Time Data . In: The Tenth International Conference on Learning Representations, ICLR. Open Review.net, 2022. [46] G. A. Pavliotis. Stochastic Processes and Applications. Springer, 2014. [47] M. Piggott and V. Solo. Geometric Euler-Maruyama Schemes for Stochastic Differential Equations in SO(n) and SE(n) . In: SIAM Journal on Numerical Analysis 54.4 (2016), pp. 2490 2516. [48] Y. Rubanova, R. T. Chen, and D. Duvenaud. Latent ODE for Irregularly-Sampled Time Series . In: Neur IPS. 2019. [49] M. Schirmer, M. Eltayeb, S. Lessmann, and M. Rudolph. Modeling Irregular Time Series with Continuous Recurrent Units . In: ICML. 2022. [50] S. N. Shukla and B. M. Marlin. Multi-Time Attention Networks for Irregularly Sampled Time Series . In: ICLR. 2021. [51] I. Silva, G. Moody, D. J. Scott, L. A. Celi, and R. G. Mark. Predicting in-hospital mortality of ICU patients: The Physio Net/Computing in cardiology challenge 2012 . In: Computing in Cardiology 39 (2012), pp. 245 248. [52] J. T. Smith, A. Warrington, and S. W. Linderman. Simplified State Space Layers for Sequence Modeling . In: ICLR. 2023. [53] V. Solo. An approach to stochastic system identification in Riemannian manifolds . In: CDC. 2014. [54] Y. Song, J. Sohl-Dickstein, D. P. Kingma, A. Kumar, S. Ermon, and B. Poole. Score-based generative modeling through stochastic differential equations . In: ICLR. 2021. [55] D. W. Stroock. On the growth of stochastic integrals . In: Zeitschrift für Wahrscheinlichkeitstheorie und Verwandte Gebiete 18.4 (1971), pp. 340 344. [56] C. Tan, Z. Gao, L. Wu, S. Li, and S. Z. Li. Hyperspherical Consistency Regularization . In: CVPR. 2022. [57] B. Øksendal. Stochastic Differential Equations. Universitext. Springer, 1995. [58] B. Tzen and M. Raginsky. Neural Stochastic Differential Equations: Deep Latent Gaussian Models in the Diffusion Limit . In: ar Xiv abs/1905.09883 (2019). [59] B. Tzen and M. Raginsky. Theoretical guarantees for sampling and inference in generative models with latent diffusions . In: COLT. 2019. [60] Ç. Yildiz, M. Heinonen, and H. Lahdesmäki. ODE2VAE: Deep generative second order ODEs with Bayesian neural networks . In: Neur IPS. 2019. Supplementary material In this supplementary material, we report full dataset descriptions, list architecture details, present additional experiments, and report any left-out derivations. A Datasets 13 A.1 Human Activity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 A.2 Pendulum . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 A.3 Physio Net (2012) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 A.4 Rotating MNIST . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 B Architecture details 15 B.1 Human Activity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 B.2 Pendulum regression & interpolation . . . . . . . . . . . . . . . . . . . . . . . 16 B.3 Physio Net (2012) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 B.4 Rotating MNIST . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 B.5 Computing infrastructure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 C Additional empirical results 17 C.1 Runtime measurements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 C.2 Stochastic Lorenz attractor . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 C.3 Uncertainty quantification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 D Implementation 21 E Theory 23 E.1 Background on stochastic differential equations (SDEs) . . . . . . . . . . . . . 23 E.2 SDEs in quadratic matrix Lie groups . . . . . . . . . . . . . . . . . . . . . . . 24 E.3 SDEs in homogeneous spaces of quadratic matrix Lie groups . . . . . . . . . . . 25 E.4 SDEs on the unit n-sphere . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 E.5 Kullback-Leibler (KL) divergences . . . . . . . . . . . . . . . . . . . . . . . . 27 Societal impact. This work mainly revolves around a novel type of neural SDE model. We envision no direct negative societal impact but remark that any application of such a model, e.g., in the context of highly sensitive personal data (such as ICU stays in Physio Net (2012)), should be inspected carefully for potential biases. This is particularly relevant if missing measurements are interpolated and eventually used in support of treatment decisions. In this scenario, biases (most likely due to the training data) could disproportionately affect any minority group, for instance, as imputation results might be insufficiently accurate. This section contains a description of each dataset we use in our experiments. While all datasets have been used extensively in the literature, reported results are often not directly comparable due to inconsistent preprocessing steps and/or subtle differences in terms of data partitioning as well as variable selection (e.g., for Physio Net (2012)). Hence, we find it essential to highlight the exact data-loading and preprocessing implementations we use: for Human Activity and Physio Net (2012) experiments, we rely on publicly available code from [50] (which is based upon the reference implementation of [48]). For the Pendulum experiments, we rely on the reference implementation from [49], and for Rotating MNIST, we rely on the reference implementation of [60]. A.1 Human Activity This motion capture dataset5, used in Section 4.1, consists of time series collected from five individuals performing different activities. The time series are 12-dimensional, corresponding to 3D coordinates captured by four motion sensors. Following the preprocessing regime of [48] (who group multiple motions into one of seven activities), we have 6,554 sequences available with 211 unique time points overall and a fixed sequence length of 50 time points. The time range of each sequence is mapped to the time interval [0, 1], and no normalization of the coordinates is performed. The task is to assign each time point to one of the seven (motion/activity) categories (i.e., lying , lying down , sitting , sitting down , standing up from lying , standing up from sitting , standing up from sitting on the ground ). The dataset is split into 4,194 sequences for training, 1,311 for testing, and 1,049 for validation. Notably, the majority of time series have one constant label, with only 239 (18.2%) of the time series in the testing portion of the dataset containing actual label switches (e.g., from lying to standing up from lying ). A.2 Pendulum The pendulum dataset, used in Sections 4.1 and 4.2, contains algorithmically generated [4] (grayscale) pendulum images of size 24 24. For the regression task from [49], we are given time series of these pendulum images at 50 irregularly spaced time points with the objective to predict the pendulum position, parameterized by the (sine, cosine) of the pendulum angle. The time series exhibit noise in the pixels and in the pendulum position (in form of a deviation from the underlying physical model). For the interpolation task from [49], uncorrupted images are available, but we only observe a smaller fraction of the full time series (50% of time points randomly sampled per time series). The task is to interpolate images at the unobserved time points. Two exemplary time series (per task) from the training portion of the dataset are shown in Figure 5. For both tasks, we use 4,000 images for training, 1,000 images for testing and 1,000 images for validation. Figure 5: Illustration of one exemplary training time series from the (left) regression and (right) pendulum position interpolation task, respectively, from [49]. Note that the images shown here are inverted for better visualization. A.3 Physio Net (2012) The Physio Net (2012) [51] dataset has been used many times in the literature. However, a clean comparison is difficult as the preprocessing steps, as well as the split and variable selection, often differ in subtle details. Hence, we focus on one evaluation setup [50] and re-evaluate prior work precisely in this setting instead of adopting the reported performance measures. 5https://doi.org/10.24432/C57G8X The original challenge dataset contains time series for 12,000 ICU stays over a time span of 48 hours. 41 variables are available, out of which a small set is only measured at admission time (e.g., age, gender), and up to 37 variables are sparsely available within the 48-hour window. To the best of our knowledge, almost all prior works normalize each variable to [0, 1] (e.g., [48; 49; 50] and many more). In [50], and hence in our work as well, all 41 variables are used, and only the training portion of the full challenge dataset is considered. The latter consists of 8,000 time series, split into 80% training sequences and 20% testing sequences. Depending on the experimental regime in prior works, observations are either aggregated within 1 min [48; 50] or within 6 min [49] time slots. This yields 480 and 2,880, resp., possible (unique) time points. Overall, the available measurements in this dataset are very sparse, i.e., only 2% of all potential measurements are available. While prior work typically focuses on mortality prediction, i.e., predicting one binary outcome per time series, we are interested in interpolating observations at unseen time points. To assess this interpolation capability of a model, we adopt the interpolation task of [50]. In this task, interpolation is understood as observing a fraction (p) of the (already few) available time points per sequence and interpolating observations at the remaining fraction (1 p) of time points. From our perspective, this is considerably more challenging than the setting in [49] or [48], as the experimental setup in the two latter works corresponds to p = 1. Furthermore, the preprocessing regime in [50] slightly differs from [49] in two aspects: (1) in the way each variable is normalized and (2) in the number of variables used (41 in [50] and 37 in [49]). Regarding the normalization step: letting x1, . . . , x N denote all values of one variable in the training dataset, and xmin, xmax denote the corresponding minimum and maximum values of that variable, [50] normalizes by (xi xmin)/xmax, whereas [49] normalizes by (xi xmin)/(xmax xmin). For fair comparison, we re-ran all experiments with the preprocessing of [50]. A.4 Rotating MNIST This dataset [9] consists of sequences of 28 28 grayscale images of gradually clockwise rotated versions of the handwritten digit 3 . One complete rotation takes place over 16 evenly spaced time points. During training, one time point is consistently left-out (according to the reference implementation of [60], this is the 4th time point, i.e. at time t = 3/16) and four additional time points are dropped randomly per sequence. Upon receiving the first image at time t = 0, the task is to interpolate the image at the left-out (4th) time point. Exemplary training sequences are shown in Figure 6. We exactly follow the dataset splitting protocol of [60], where 360 images are used for training, 360 for testing and 36 for validation and hyperparameter selection. Figure 6: Five (counted in rows) exemplary training sequences from the rotating MNIST dataset. B Architecture details The three types of recognition networks (i.e., A, B, or C) we use throughout all experiments are shown in Figure 2. In the following, we specify the configuration of the networks (from Figure 2) per experiment, discuss the choice of loss function (in addition to the KL divergence terms from the ELBO), and provide details about the architecture(s) in works we compare to. Irrespective of the particular type of network that processes an input sequence, our parameterization of the SDE on the n-sphere remains the same across all experiments. In particular, given a representation of an input sequence, say h, we linearly map to (1) the parameters of the power spherical distribution, i.e., the posterior distribution on the initial state, and (2) to the coefficients of K Chebyshev polynomials, cf. Eq. (11), that control the time evolution of Kϕ. Here, the superscript ϕ denotes the dependency of K on the (parameters of the) recognition network. As Kϕ is skew-symmetric and has, in case of SO(n), n(n 1)/2 free parameters, we linearly map to Kn(n 1)/2 coefficients overall. Unless noted otherwise, we consistently use K = 6 polynomials and set n = 16. In all experiments with ODE/SDE type models, we use ADAM [28] for optimization, with initial learning rate set to 1e-3 and subject to a cyclic cosine learning rate schedule (cycle length of 60 epochs) [18]. Finally, when replacing our SDE on the n-sphere by an ODE in Rn (denoted as m TAND-ODE), we use an Euler scheme as ODE solver and compute gradients via the adjoint sensitivity method (implemented in the publicly available torchdiffeq package). B.1 Human Activity As recognition network we use type (B, ) from Figure 2 with an m TAND encoder (m TAND-Enc [50]) configured as in the publicly-available reference implementation6. When used as a standalone module (i.e., without any subsequent ODE/SDE), m TAND-Enc uses the outputs of its final GRU unit (which yields representations in R128 at all desired time points) to linearly map to class predictions. In our setting, we take the last hidden state of this GRU unit as input h R128 to the linear maps mentioned above. When replacing our SDE on the n-sphere by an ODE in Rn (denoted as m TANDODE), the last GRU state is linearly mapped to the location parameter of a Gaussian distribution and to the entries of a diagonal covariance matrix. For m TAND-ODE, the ODE is parameterized as in [48] by a 3-layer MLP (with ELU activations) with 32 hidden units, but the dimensionality of the latent space is increased to 16 (for a fair comparison). As in our approach, latent states are linearly mapped to class predictions. Notably, when comparing ODE-ODE [48] to m TAND-ODE, the only conceptual difference is in the choice of recognition network (i.e., an ODE in [48], vs. an m TAND encoder here). Loss components. In addition to the KL divergence terms in the ELBO, the cross-entropy loss is computed on class predictions at the available time points. For our approach, as well as m TAND-ODE, the selected weighting of the KL divergence terms is based on the validation set accuracy (and no KL annealing is performed). B.2 Pendulum regression & interpolation In the pendulum experiments of Section 4.1 and Section 4.2, we parameterize the recognition network by type (A, ) from Figure 2. Images are first fed through a convolutional neural network (CNN) with the same architecture as in [49] and the so obtained representations are then input to an m TAND encoder (m TAND-Enc). Different to the m TAND configuration in Appendix B.1, the final GRU unit is set to output representations h R32. For the regression task, latent states are decoded via the same two-layer MLP from [49]. For interpolation, we also use the convolutional decoder from [49]. Loss components. For the regression task, we compute the mean-squared-error (MSE) on the (sine, cosine) encodings of the pendulum angle and use a Gaussian log-likelihood on the image reconstructions. Although the images for the regression task are corrupted (by noise), we found that adding a reconstruction loss helped against overfitting. The weighting of the KL divergence terms in the ELBO is selected on the validation set (and no KL annealing is performed). For the interpolation task, the MSE term is obviously not present in the overall optimization objective. Remark on m TAND-Enc. While the m TAND-Enc approach from [50] fails in the experimental runs reported in [49] (see m TAND-Enc in Table 2), we highlight that simply using the outputs of the final GRU unit as input to the two-layer MLP that predicts the (sine, cosine) tuples worked remarkably well in our experiments. 6https://github.com/reml-lab/m TAN B.3 Physio Net (2012) In the Physio Net (2012) interpolation experiment, we parameterize our recognition network by type (B, ) from Figure 2 using representations h R32 as input to the linear layers implementing the parameters of the drift and the initial distribution. The m TAND encoder is configured as suggested in the m TAND reference implementation for this experiment. We further adjusted the reference implementation7 of (f-)CRU from [49] to use exactly the same data preprocessing pipeline as in [50] and configure (f-)CRU as suggested by the authors. Loss components. In addition to the KL divergence terms, the ELBO includes the Gaussian loglikelihood of reconstructed observations at the available time points (using a fixed standard deviation of 0.01). The weightings of the KL divergence terms is fixed to 1e-5, without any further tuning on a separate validation portion of the dataset (and no KL annealing is performed). B.4 Rotating MNIST In the rotating MNIST experiment, the recognition network is parameterized by type (C, ) from Figure 2. In particular, we rely on the convolutional encoder (and decoder) from [60], except that we only use the first image as input (whereas [60] also need the two subsequent frames for their momentum encoder). Unlike the other experiments, where the Chebyshev polynomials are evaluated on the time interval [0, 1], we deliberately set this range to [0, 20] in this experiment. This is motivated by the fact that we also seek to extrapolate beyond the time range observed during training, see Figure 4. Furthermore, we use only one polynomial (K = 1), which is reasonable due to the underlying simple dynamic of a constant-speed rotation; setting K > 1 only marginally changes the results listed in Table 5. Loss components. Gaussian log-likelihood is used on image reconstructions and weights for the KL divergences are selected based on the validation set (and no KL annealing is performed). B.5 Computing infrastructure All experiments were executed on systems running Ubuntu 22.04.2 LTS (kernel version 5.15.0-71-generic x86_64) with 128 GB of main memory and equipped with either NVIDIA Ge Force RTX 2080 Ti or Ge Force RTX 3090 Ti GPUs. All experiments are implemented in Py Torch (v1.13 and also tested on v2.0). In this hardware/system configuration, all runs reported in the tables of the main manuscript terminated within a 24-hour time window. C Additional empirical results C.1 Runtime measurements We compare the runtime of our method to two close alternatives, where we substitute our latent (neural) SDE on the unit n-sphere, by either a latent (neural) ODE [10; 48] or a latent (neural) SDE [30] in Rn. In other words, we exchange the model components for the latent dynamics in the approaches we use for Rotating MNIST and Physio Net (2012), see Section 4. This provides insights into runtime behavior as a function of the integration steps, moving from 16 steps on Rotating MNIST to 480 on Physio Net (2012) at a 6 min quantization level and to 2,880 steps on Physio Net (2012) at a 1 min quantization level. In all experiments, we exclusively use Euler s method as ODE/SDE solver with a fixed step size. Results are summarized in Table 6. In particular, we list the elapsed real time for a forward+backward pass (batch size 50), averaged over 1,000 iterations. 7https://github.com/boschresearch/Continuous-Recurrent-Units Table 6: Comparison of the elapsed real time of a forward+backward pass over a batch (of size 50) across different tasks/datasets (with varying number of integration steps). For each task, the overall model sizes (in terms of the number of parameters) are comparable. Listed values are averages over 1,000 iterations the corresponding standard deviations. We remark that the runtime of our approach when using (Rn, GL(n)) instead of (Sn 1, SO(n)) differs only marginally. Rotating MNIST Physio Net (2012) Physio Net (2012) (6 min) (1 min) Number of integration steps 16 480 2,880 CNN/m TAND-ODE ([10; 48]) 0.053 0.004 0.926 0.017 3.701 0.033 CNN/m TAND-SDE ([30]) 0.112 0.009 2.075 0.042 10.273 0.173 Ours (Sn 1, SO(n)) 0.055 0.005 0.179 0.008 2.693 0.027 Overall, although our approach implements a latent SDE, runtime per batch is on a par (or better) with the latent ODE variant on rotating MNIST and on Physio Net (2012) at both quantization levels. Furthermore, we observe a substantially lower runtime of our approach compared to the (more flexible) latent SDE work of [30], which is closest in terms of modeling choice. It is, however, important to point out that the way our SDE evolves on the sphere (i.e., as a consequence of the group action) lends itself to a quite efficient implementation where large parts of the computation can be carried out as a single tensor operation (see Appendix D). 800 1,000 Number of epochs 0 100 200 300 400 500 600 700 800 Training MSE Accumulated training time (forward+backward pass) [s] Training MSE 0 200 400 600 CNN-ODE CNN-SDE Ours CNN-ODE CNN-SDE Ours Figure 7: Training time comparison on Rotating MNIST for five different training runs per model. Bottom: Training MSE vs. (accumulated) training time of each model. Top: Training MSE (dashed) and standard deviations (shaded) vs. number of training epochs. Aside from the direct runtime comparison above, we also point out that while a shorter (forward) runtime per batch ensures a faster inference time, it does not necessarily result in faster overall training time as the convergence speed might differ among methods. To account for the latter, we compare the evolution of training losses in Figure 7 when training on the Rotating MNIST dataset. The upper plot shows training MSE vs. number of epochs and reveals that our method needs fewer updates to converge, whereas the same model but with latent dynamics replaced by a latent ODE [10] or latent SDE [30] in Rn converge at approximately the same rate. The bottom plot accounts for different runtimes per batch and reveals that our method needs only around half of the training time than the latent SDE [30] variant in Rn. We want to point out, though, that runtime is, to a large extent, implementation-dependent, and therefore, the presented results should be taken with caution. For the baselines, we used publicly available implementations of ODE/SDE solvers, i.e., https://github.com/rtqichen/torchdiffeq for the latent ODE [10] models and https://github.com/google-research/torchsde for the latent SDE [30] models. Table 7: Relative timing results (forward+backward pass) on Physio Net (2012) for both quantization levels (reported in the main manuscript). 6 min 1 min CRU 22 25 f-CRU 17 20 m TAND-Full 1 1 Comparison to non-ODE/SDE models. Finally, we present a runtime comparison with methods that do not model an underlying (stochastic) differential equation, i.e., m TAND-Full and CRU. This relative runtime comparison was performed on the Physio Net (2012) dataset and the runtimes/batch are averaged over batches of size 50; results are summarized in Table 7. The Physio Net (2012) dataset is particularly interesting concerning runtime, as it contains the most time points (480 potential time points for a quantization level of 6 minutes and 2,880 time points for a quantization level of 1 minute). All runtime experiments were performed on the same hardware (see Appendix B.5 with a Ge Force RTX 2080 Ti GPU). C.2 Stochastic Lorenz attractor To illustrate that our model is expressive enough to model difficult dynamics, we replicate the experiment by [30] of fitting a stochastic Lorenz attractor. Figure 8 shows 75 posterior sample paths. Figure 8: Shown are 75 posterior sample paths from our model on the stochastic Lorenz attractor data of [30]. C.3 Uncertainty quantification In the main part of this manuscript, we primarily focussed on the interpolation and regression/classification capabilities of our method on existing benchmark data. It is, however, worth pointing out that our generative model can be utilized for uncertainty quantification. We can sample from the probabilistic model outlined in Section 3 and use these samples to estimate the underlying distribution. While we did not evaluate the uncertainty quantification aspect as thoroughly as the regression/classification performance, we present preliminary qualitative results on the Pendulum angle regression experiment in Figure 9. This figure shows density estimates for the angle predictions on testing data. The spread of the angle predictions is (as one would expect) sensitive to the KL-divergence weight in the loss function. The present model was trained with a weight of 1e-3. 0.00 0.25 0.50 0.75 1.00 1 0.00 0.25 0.50 0.75 1.00 1 0.00 0.25 0.50 0.75 1.00 1 0.00 0.25 0.50 0.75 1.00 1 0.00 0.25 0.50 0.75 1.00 1 0.00 0.25 0.50 0.75 1.00 1 0.00 0.25 0.50 0.75 1.00 1 0.00 0.25 0.50 0.75 1.00 1 0.00 0.25 0.50 0.75 1.00 1 0.00 0.25 0.50 0.75 1.00 1 0.00 0.25 0.50 0.75 1.00 1 0.00 0.25 0.50 0.75 1.00 1 0.0 0.2 0.4 0.6 0.8 1.0 Figure 9: Top: Visualization of quality of fit and uncertainty for the first 12 testing time series in the pendulum regression task. Shown is the ground truth (in blue) and the predicted angle (sine). Shaded regions correspond to densities, estimated from quantiles of 1,000 draws from the posterior distribution. Bottom: Zoomed-in plot of the first time series. D Implementation To highlight that our approach can be implemented in a few lines of code, Listings 1, 2 and 3 provide a minimal working example (in Py Torch) of how to obtain solutions to an SDE on Sn 1. The only part left out in this example is the generation of the time evolution of K. To execute the code snippet in Listing 4, the reference implementation of the power spherical distribution8 from [8] as well as the einops package9 are needed. Listing 1 SDE solver in Lie algebra so(n). 1 from einops import repeat 3 class Path Sampler(nn.Module): 4 def __init__(self, dim): 5 super(Path Sampler, self).__init__() 6 self.register_buffer("skew_sym_basis", gen_basis(dim)) 7 self.alpha = nn.Parameter(torch.tensor(1./dim, dtype=torch.float32)) 9 def forward(self, Kt, delta): 10 bs, nt = Kt.shape[0], Kt.shape[1] 11 Et = repeat(self.skew_sym_basis, 'k a b -> bs nt k a b', bs=bs, nt=nt) 12 noise = torch.randn(bs, nt, Et.size(2), 1, 1, device=Kt.device) 13 Vt = (noise*Et*delta.sqrt()*self.alpha).sum(2) 14 return Kt*delta + Vt Listing 2 Generate basis in so(n). 1 def gen_basis(n): 2 r = int(n*(n-1)/2) 3 B = torch.zeros(r, n, n) 4 for nth, (i,j) in enumerate(torch.combinations(torch.arange(n))): 5 B[nth,i,j] = 1. 6 return B - B.transpose(1,2) Listing 3 Generate SDE solution on Sn 1. 1 from torch.distributions import Distribution 2 from torch import Tensor 4 def sample(t: Tensor, pzx: Distribution, Kt: Tensor, sampler: Path Sampler): 5 z0 = pzx.rsample().unsqueeze(-1) 6 delta = torch.diff(t)[0] 7 Ot = sampler(Kt, delta) 8 Qt = torch.matrix_exp(Ot.contiguous()) 10 zi = [z0] 11 for t_idx in range(0, len(t)-1): 12 Qtz = torch.matmul(Qt[:,t_idx,:,:], zi[-1]) 13 zi.append(Qtz) 14 return torch.stack(zi, dim=1).squeeze() 8https://github.com/nicola-decao/power_spherical 9https://einops.rocks/ Listing 4 Example usage with dummy K, µ, κ. 1 import torch 2 import torch.nn as nn 3 from power_spherical import Power Spherical 6 batch_size = 4 7 num_time_points = 100 9 device = 'cuda' if torch.cuda.is_available() else 'cpu' 10 t = torch.linspace(0, 1, num_time_points, device=device) 11 sampler = Path Sampler(n).to(device) 13 mu = torch.randn(batch_size, n, device=device) # batch of µ s 14 kappa = torch.rand(batch_size, 1, device=device) # batch of κ's 15 pzx = Power Spherical( 16 torch.nn.functional.normalize(mu), 17 kappa.squeeze()) 19 # Kt just contains random skew-symmetric matrices in this example 20 Kt = torch.randn(batch_size, len(t), n, n, device=device) 21 Kt = Kt - Kt.transpose(2,3) 22 zis = sample(t, pzx, Kt, sampler) 23 # zis is now a tensor of size (bs,num_time_points,n) In this part, we provide further details on the construction of SDEs on homogeneous spaces of quadratic matrix Lie groups G, and derive the formula for the KL divergence between solutions of such SDEs that appeared in the main manuscript. E.1 Background on stochastic differential equations (SDEs) We first make some initial remarks on stochastic processes and on stochastic differential equations (SDEs) in Rn and introduce the concepts and notations that are used throughout this final chapter of the supplementary material. Given some probability space (Ω, F, P), a stochastic process on [0, T] with state space Rn is a function X : Ω [0, T] Rn such that for each point of time t [0, T] the map Xt = X( , t) : Ω Rn, ω 7 X(ω, t) is measurable. In other words, {Xt : t [0, T]} is a collection of random variables Xt indexed by t, cf. [66, Chapter I.1; 57, Chapter 2.1; 63, Chapter 1.1]. Conversely, each sample ω Ωdefines a function X(ω, ) : [0, T] Rn, t 7 X(ω, t) called a realization or sample path. Thus, a stochastic process models randomness in the space of functions [0, T] Rn, or in a "reasonable" subspace thereof. In more formal terms, the law or path distribution PX associated to a stochastic process X is defined as the push-forward of the measure P to this function space [66, Chapter I.3]. Arguably the most important stochastic process is the Wiener process W which formalizes the physical concept of the Brownian motion, i.e., the random motion of a small particle inside a fluid caused by thermal fluctuations, see [68, Chapter 3] for a heuristic approach to the topic. For stochastic processes X that are sufficiently well-behaved, one can define a stochastic integral R t 0 Xs d Ws, called Itô integral of X with respect to W, cf. [66, Chapter IV.2; 57, Chapter 3.1; 63, Chapter 3.2]. Notably, the result, is again a stochastic process so that one can implicitly define stochastic processes via an Itô integral equation of the form Xt = X0 + Z t 0 f(Xs, s) ds + Z t 0 σ(Xs, s) d Ws (12) often denoted as an Itô stochastic differential equation (SDE) d Xt = f(Xt, t) dt + σ(Xt, t) d Wt, X0 = ξ . (13) Here, f : Rn [0, T] Rn is called the drift coefficient, σ : Rn [0, T] Rn m is called the diffusion coefficient, and the initial value ξ is a random variable that is independent of the Wiener process W. Further information about different notions of solutions to an SDE, i.e., strong and weak solutions, as well as a comprehensive consideration of their existence and uniqueness can be found in [66, Chapter 9; 57, Chapter 5; 63, Chapter 5]. Additionally, for a treatment of SDEs from a numerical perspective, we recommend [64]. In this work, we consider time series data as realizations of a stochastic process, i.e., as samples from a function-valued random variable as described in Section 3.1. To perform variational inference in such a setting, we utilize SDEs to construct function-valued distributions. Specifically, we consider non-autonomous linear Itô SDEs with multiplicative noise that evolve on a homogeneous space of a (quadratic) matrix Lie group. To ensure the existence and uniqueness of strong solutions, the drift and diffusion coefficients that appear in the considered SDEs satisfy the Lipschitz and growth conditions of [57, Theorem 5.2.1] A remark on notation. As outlined in Section 3.4, we aim to learn the parameters ϕ of an approximate posterior. Thus, the SDEs we consider are not specified by an initial random variable X0 = ξ, but by an initial distribution PX0 such that X0 PX0. Therefore, we utilize the notation d Xt = f(Xt, t) dt + σ(Xt, t) d Wt , X0 PX0 . (14) A (strong) solution to such an SDE is then understood as a stochastic process X that (i) is adapted (see [66, Definition 4.2]), (ii) has continuous sample paths t 7 X(ω, ), (iii) almost surely satisfies the integral equation in Eq. (12), and (iv) there is a random variable ξ PX0 that is independent of the Wiener process W such that X0 = ξ holds almost surely. In accordance with [57, Theorem 5.2.1], we demand that and the initial distribution ξ has finite second moment, but, of course, distinct choices of X0 = ξ might define distinct strong solutions X. E.2 SDEs in quadratic matrix Lie groups Definition 1 (cf. [65]). Let Rn n be the vector space of real n n matrices. We denote the general linear group, i.e., the group of invertible real n n matrices equipped with the matrix product, as GL(n) = A Rn n | det A = 0 . (15) GL(n) is a Lie group with Lie algebra gl(n) = Rn n, [ , ] , (16) i.e., Rn n together with the commutator bracket (A, B) 7 [A, B] = AB BA. In the context of this work, we specifically consider quadratic matrix Lie groups. Definition 2 (cf. [62]). Given some fixed matrix P Rn n, the quadratic matrix Lie group G is defined as G = G(P) = A GL(n) : A PA = P . (17) The Lie algebra g of G then consists of the n n matrices V that satisfy V P + PV = 0. Moreover, we consider non-autonomous linear Itô SDEs in quadratic matrix Lie groups G. The following lemma (Lemma 3) provides sufficient conditions to ensure that an SDE solution evolves in G. This will pave the way to introduce a class of SDEs on homogeneous spaces in Appendix E.3 via the Lie group action. Lemma 3 (cf. [6, Chapter 4.1], [12]). Let G be a stochastic process that solves the matrix Itô SDE i=1 dwi t Vi Gt , G0 = In , (18) where V0 : [0, T] Rn n, V1, . . . , Vm Rn n and w1, . . . , wm are independent scalar-valued Wiener processes. The process G stays in the quadratic matrix Lie group G(P) defined by P Rn n if (i) V0(t) = K(t) + 1 2 Pm i=1 V2 i , with K : [0, T] g, and (ii) V1, . . . , Vm g . Remark. Although the matrices V1, . . . , Vm g can be chosen arbitrarily, it is natural to choose an orthogonal basis of the Lie algebra g. Then m = dim g and each direction in the vector space g provides an independent contribution to the overall noise in Eq. (18). Proof. The process G stays in the quadratic matrix Lie group G(P) if G t PGt is constant, i.e., if d(G t PGt) = 0. The latter is to be understood in the Itô sense, i.e., 0 = d(G t PGt) = d G t PGt + Gt P d Gt + d G t P d Gt V 0 (t)P + PV0(t) dt + i=1 dwi t V i P + PVi + i=1 dt V i PVi Here, dwi t dwj t = δij dt, as wi and wj are independent scalar Wiener processes. Eq. (19) implies the following two conditions: V i P + PVi = 0 for i = 1, . . . , m and V 0 (t)P + PV0(t) = i=1 V i PVi . (20) As the following calculation shows, these conditions are satisfied if V1, . . . , Vm g and V0(t) = K(t) + 1 2 Pm i=1 V2 i where K : [0, T] g. V 0 (t)P + PV0(t) = K (t)P + PK(t) | {z } =0 V i V i P + PVi Vi = 1 V i PVi + V i PVi (22) i=1 V i PVi . (23) Numerical solution. Algorithm 1 lists our choice of a numerical solver, which is the one-step geometric Euler-Maruyama scheme from [36] (see also [40, Section 3.1]). This scheme consists of three parts. First, we divide [0, T] into subintervals [tj, tj+1], starting with t0 = 0 until tj+1 = T. Second, we solve Eq. (18) pathwise in G over successive intervals [tj, tj+1] by applying a Mc Kean-Gangolli Injection [12] in combination with a numerical solver. Third, the (approximate) path solution is eventually computed as Gj+1 = Gj exp(Ωj+1). Algorithm 1 Geometric Euler-Maruyama Algorithm (g-EM) [36]. 1: t0 0 2: G0 In 3: repeat over successive subintervals [tj, tj+1], j 0 4: δj tj+1 tj 5: δwi j N(0, δj), for i = 1, . . . , m 6: Ωj+1 V0(tj) 1 2 Pm i=1 V2 i δj + Pm i=1 δwi j Vi Ωj+1 g. 7: Gj+1 Gj exp (Ωj+1) Matrix exp. ensures that Gj+1 G. 8: until tj+1 = T Remark. We remark that in the original derivation of the algorithm in [36], the group acts from the right (as in Algorithm 1). However, the same derivation can be done for a left group action which is what we use in our implementation and the presentation within the main manuscript. Furthermore, under condition (i) from Lemma 3, we see that line 6 reduces to Ωj+1 K(tj)δj + i=1 δwi j Vi . E.3 SDEs in homogeneous spaces of quadratic matrix Lie groups In the following, we use our results of Appendix E.2 to define stochastic processes in a homogeneous space M of a quadratic matrix Lie group G. The main idea is, given a starting point Z0 M, to translate a path G : [0, T] G into a path Z = G Z0 in M via the Lie group action. Notably, this construction can be done at the level of stochastic processes, i.e., a stochastic process G : [0, T] Ω G induces a stochastic process Z = G Z0 : [0, T] Ω M called a one-point motion, see [31, Chapter 2]. In particular, if Z0 follows some distribution P in M and G solves an SDE of the form as in Eq. (18), then Z solves d Zt = d Gt Z0, i.e., i=1 dwi t Vi Zt, Z0 PZ0 . (24) Notably, Eq. (3) guarantees that a solution to the SDE in Eq. (18) remains in the group G, and consequently a solution to Eq. (24) remains in M. As we only consider Lie group actions that are given by matrix-vector multiplication, we substitute the notation G Z by the more concise GZ from now on. Remark. For subsequent results, it is convenient to represent Eq. (24) in the form of d Zt = V0(t)Zt dt + [V1Zt , V2Zt , . . . , Vm Zt] d Wt, Z0 PZ0 , (25) i.e., with diffusion coefficient σ : (z, t) 7 [V1z, . . . , Vmz] Rn m and an m-dimensional Wiener process W = [w1, w2, . . . , wm] . Notably, the matrix σ(z, t)σ(z, t) is given by σ(z, t)σ(z, t) = i=1 Vizz V i . (26) Numerical solution. As Z = GZ0, a numerical solution to Eq. (24) is easily obtained from the numerical approximation of G, i.e., the numerical solution to Eq. (18) via Algorithm 1. In fact, it suffices to replace line 2 by Z0 PZ0 and line 7 by Zj+1 exp(Ωj+1)Zj. Notably, we do not need to compute the matrices Gj but only the iterates Ωj. E.4 SDEs on the unit n-sphere Next, we consider the homogeneous space (Sn 1, SO(n)), i.e., the unit n-sphere Sn 1 together with the action of the group SO(n) of rotations. Notably, the orthogonal group O(n) is the quadratic matrix Lie group G(P) defined by P = In, i.e., O(n) = A Rn n | A A = In and the special orthogonal group SO(n) is its connected component that contains the identity In. Thus, Lemma 3 ensures that a solution to Eq. (24) remains in Sn 1 if K : [0, T] so(n) and V1, . . . , Vm so(n). In the variational setting of the main manuscript, we consider stochastic processes that solve such an SDE for the choice of {Vi}m i=1 = α(eke l ele k ) 1 k