# parallelintime_probabilistic_numerical_ode_solvers__5ca95b5e.pdf

Journal of Machine Learning Research 25 (2024) 1-27 Submitted 10/23; Revised 6/24; Published 7/24

Parallel-in-Time Probabilistic Numerical ODE Solvers

Nathanael Bosch nathanael.bosch@uni-tuebingen.de T ubingen AI Center, University of T ubingen

Adrien Corenﬂos adrien.corenflos.stats@gmail.com Department of Electrical Engineering and Automation, Aalto University

Fatemeh Yaghoobi fatemeh.yaghoobi@aalto.fi Department of Electrical Engineering and Automation, Aalto University

Filip Tronarp filip.tronarp@matstat.lu.se Center for Mathematical Sciences, Lund University

Philipp Hennig philipp.hennig@uni-tuebingen.de T ubingen AI Center, University of T ubingen

Simo S arkk a simo.sarkka@aalto.fi Department of Electrical Engineering and Automation, Aalto University

Editor: Ryan Adams

Probabilistic numerical solvers for ordinary diﬀerential equations (ODEs) treat the numerical simulation of dynamical systems as problems of Bayesian state estimation. Aside from producing posterior distributions over ODE solutions and thereby quantifying the numerical approximation error of the method itself, one less-often noted advantage of this formalism is the algorithmic ﬂexibility gained by formulating numerical simulation in the framework of Bayesian ﬁltering and smoothing. In this paper, we leverage this ﬂexibility and build on the time-parallel formulation of iterated extended Kalman smoothers to formulate a parallel-in-time probabilistic numerical ODE solver. Instead of simulating the dynamical system sequentially in time, as done by current probabilistic solvers, the proposed method processes all time steps in parallel and thereby reduces the computational complexity from linear to logarithmic in the number of time steps. We demonstrate the eﬀectiveness of our approach on a variety of ODEs and compare it to a range of both classic and probabilistic numerical ODE solvers. Keywords: probabilistic numerics, ordinary diﬀerential equations, numerical analysis, parallel-in-time methods, Bayesian ﬁltering and smoothing.

1. Introduction

Ordinary diﬀerential equations (ODEs) are used throughout the sciences to describe the evolution of dynamical systems over time. In machine learning, ODEs provide a continuous description of certain neural networks (Chen et al., 2018) and optimization procedures (Helmke et al., 2012; Su et al., 2016), and are used in generative modeling with normalizing ﬂows (Papamakarios et al., 2021) and diﬀusion models (Song et al., 2021), among others. Unfortunately, all but the simplest ODEs are too complex to be solved analytically. Therefore, numerical methods are required to obtain a solution. While a multitude of numerical solvers

2024 Nathanael Bosch, Adrien Corenﬂos, Fatemeh Yaghoobi, Filip Tronarp, Philipp Hennig and Simo S arkk a.

License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v25/23-1261.html.

Bosch, Corenflos, Yaghoobi, Tronarp, Hennig, S arkk a

has been developed over the last century (Hairer et al., 1993; Deuﬂhard and Bornemann, 2012; Butcher, 2016), most commonly-used methods do not provide a quantiﬁcation of their own inevitable numerical approximation error. Probabilistic numerics provides a framework for treating classic numerical problems as problems of probabilistic inference (Hennig et al., 2015; Oates and Sullivan, 2019; Hennig et al., 2022). In the context of ODEs, methods based on Gaussian process regression (Skilling, 1992; Hennig and Hauberg, 2014) and in particular Gauss Markov regression (Schober et al., 2019; Kersting et al., 2020; Tronarp et al., 2019) provide an eﬃcient and ﬂexible approach to compute posterior distributions over the solution of ODEs (Bosch et al., 2021; Kr amer and Hennig, 2024), and even partial diﬀerential equations (Kr amer et al., 2022; Bosch et al., 2023) and diﬀerential-algebraic equations (Bosch et al., 2022). These so-called ODE ﬁlters typically scale cubically in the ODE dimension (as do most implicit ODE solvers) and speciﬁc approximations enable linear scaling (shared by most explicit solvers) (Kr amer et al., 2022). But to date, their linear scaling with the number of time steps remains. For very large-scale simulations with very long time horizons, the sequential processing in time of most ODE solvers can become a bottleneck. This motivates the development of parallel-in-time methods: By leveraging the ever-increasing parallelization capabilities of modern computer hardware, parallel-in-time methods can achieve sub-linear scaling in the number of time steps (Gander, 2015). One well-known method of this kind is Parareal (Lions et al., 2001). It achieves temporal parallelism by combining an expensive, accurate solver with a cheap, coarse solver, in such a way that the ﬁne solver is only ever applied to individual time slices in a parallel manner, leading to a square-root scaling (in ideal conditions). But, due to its sequential coarse-grid solve, Parareal still has only limited concurrency (Gander and Vandewalle, 2007), and while it has recently been extended probabilistically by Pentland et al. (2021, 2022) to improve its performance and convergence, these methods do not provide probabilistic solutions to ODEs per se. In this paper, we leverage the time-parallel formulation of Gaussian ﬁlters and smoothers (S arkk a and Garc ıa-Fern andez, 2021; Yaghoobi et al., 2021, 2023) and develop a parallelin-time probabilistic numerical ODE solver. The paper is structured as follows. Section 2 formulates numerical ODE solutions as Bayesian state estimation problems and presents the established, sequential, ﬁltering-based probabilistic ODE solvers. Section 3 then presents our proposed parallel-in-time method; ﬁrst as exact inference for aﬃne ODEs, then as an iterative, approximate algorithm for general nonlinear ODEs. Section 4 then presents experiments on a variety of ODEs and compares the performance of our proposed method to that of existing (sequential) probabilistic and non-probabilistic ODE solvers. Finally, Section 5 concludes with a discussion of our results and an outlook on future work.

2. Numerical ODE Solutions as Bayesian State Estimation

Consider an initial value problem (IVP) given by a ﬁrst-order ODE of the form

y(t) = f(y(t), t), t [0, T], y(0) = y0, (1)

with vector ﬁeld f : Rd R Rd and initial value y0 Rd. To capture the numerical error that arises from temporal discretization, the quantity of interest in probabilistic numerics

Parallel-in-Time Probabilistic Numerical ODE Solvers

for ODEs is the probabilistic numerical ODE solution, deﬁned as

p y(t) y(0) = y0, { y(tn) = f(y(tn), tn)}N n=1 , (2)

for some prior p (y(t)) and with {tn}N n=1 [0, T] the chosen time-discretization. In the following, we pose the probabilistic numerical ODE solution as a problem of Bayesian state estimation, and we deﬁne the prior, likelihood, data, and approximate inference scheme. For a more detailed description of the transformation of an IVP into a Gauss Markov regression problem, refer to Tronarp et al. (2019).

2.1 Gauss Markov Process Prior

We model the solution y of the IVP with a ν-times integrated Wiener process prior (IWP(ν)). More precisely, let Y (t) = Y (0)(t), Y (1)(t), . . . , Y (ν)(t) be the solution of the following linear, time-invariant stochastic diﬀerential equation with Gaussian initial condition

d Y (i)(t) = Y (i+1)(t) dt, i = 0, . . . , ν 1, (3a)

d Y (ν)(t) = σ d W(t), (3b)

Y (0) N (µ0, Σ0) , (3c)

with initial mean and covariance µ0 Rd(ν+1), Σ0 Rd(ν+1) d(ν+1), diﬀusion coeﬃcient σ R+, and d-dimensional Wiener process W : R Rd. Then, Y (i) is chosen to model the i-th derivative of the IVP solution y. By construction, accessing the i-th derivative can be done by multiplying the state Y with a projection matrix Ei := Id ei, that is, Y (i)(t) = Ei Y (t). This continuous-time prior satisﬁes discrete transition densities (S arkk a and Solin, 2019)

Y (t + h) | Y (t) N Φ(h)Y (t), σ2Q(h) , (4)

with transition matrix and process noise covariance Φ(h), Q(h) Rd(ν+1) d(ν+1) and step h R+. For the IWP(ν) these can be computed in closed form (Kersting et al., 2020), as

Φ(h) = Id Φ(h), h Φ(h) i

ij = 1i=j hi j

(j i)!, (5a)

Q(h) = Id Q(h), h Q(h) i

ij = h2ν+1 i j

(2ν + 1 i j)(ν i)!(ν j)!. (5b)

Remark 1 (Alternative Gauss Markov priors). While ν-times integrated Wiener process priors have been the most common choice for ﬁltering-based probabilistic ODE solvers in recent years, the methodology is not limited to this choice. Alternatives include the ν-times integrated Ornstein Uhlenbeck process and the class of Mat ern processes, both of which have a similar continuous-time SDE representation as well as Gaussian transition densities in discrete time. Refer to Bosch et al. (2023); Tronarp et al. (2021); S arkk a and Solin (2019).

The initial distribution N(µ0, Σ0) is chosen such that it encodes the initial condition y(0) = y0. Furthermore, to improve the numerical stability and the quality of the posterior,

Bosch, Corenflos, Yaghoobi, Tronarp, Hennig, S arkk a

we initialize not only on the function value Y (0)(0) = y0, but also the higher order derivatives, that is, Y (i)(0) = diy

dti (0) for all i ν (Kr amer and Hennig, 2024). These terms can be eﬃciently computed via Taylor-mode automatic diﬀerentiation (Griewank, 2000; Bettencourt et al., 2019). As a result, we obtain an initial distribution with mean

µ0 = y0, dy

dt (0), . . . , dνy

dtν (0) T , (6)

and zero covariance Σ0 = 0, since the initial condition has to hold exactly.

2.2 Observation Model and Data

To relate the introduced Gauss Markov prior to the IVP problem from Equation (1), we deﬁne an observation model in terms of the information operator

Z[y](t) := y(t) f (y(t), t) . (7)

By construction, Z maps the true IVP solution y exactly to the zero function, that is, Z[y] 0. In terms of the continuous process Y , the information operator can be expressed as Z[Y ](t) = E1Y (t) f (E0Y (t), t) , (8)

where E0 and E1 are the projection matrices introduced in Section 2.1 which select the zeroth and ﬁrst derivative from the process Y , respectively. There again, if Y corresponds to the true IVP solution (and its true derivatives), then Z[Y ] 0. Conversely, inferring the true IVP solution requires conditioning the process Y (t) on Z(t) = 0 over the whole continuous interval t [0, T]. Since this is in general intractable, we instead condition Y (t) only on discrete observations Z(tn) = 0 on a grid T = {tn}N n=1. This leads to the Dirac likelihood model commonly used in ODE ﬁltering (Tronarp et al., 2019):

Z(tn) | Y (tn) δ Y (1)(tn) f Y (0)(tn), tn , (9)

with zero-valued data Z(tn) = 0 for all tn T.

Remark 2 (Information operators for other diﬀerential equation problems). Similar information operators can be deﬁned for other types of diﬀerential equations that are not exactly of the ﬁrst-order form as given in Equation (1), such as higher-order diﬀerential equations, Hamiltonian dynamics, or diﬀerential-algebraic equations (Bosch et al., 2022).

2.3 Discrete-Time Inference Problem

The combination of prior, likelihood, and data results in a Bayesian state estimation problem

Y (0) N (µ0, Σ0) , (10a)

Y (tn+1) | Y (tn) N (Φ(tn+1 tn)Y (tn), Q(tn+1 tn)) , (10b)

Z(tn) | Y (tn) δ Y (1)(tn) f Y (0)(tn), tn , (10c)

Parallel-in-Time Probabilistic Numerical ODE Solvers

with zero data Z(tn) = 0 for all tn T. The posterior distribution over Y (0)(t) then provides a probabilistic numerical ODE solution to the given IVP, as formulated in Equation (2). This is a standard nonlinear Gauss Markov regression problem, for which many approximate inference algorithms have previously been studied (S arkk a and Svensson, 2023). In the context of probabilistic ODE solvers, a popular approach for eﬃcient approximate inference is Gaussian ﬁltering and smoothing, where the solution is approximated with Gaussian distributions p Y (t) | {Z(tn) = 0}N n=1 N (µ(t), Σ(t)) . (11)

This is most commonly performed with extended Kalman ﬁltering (EKF) and smoothing (EKS) (Schober et al., 2019; Tronarp et al., 2019; Kersting et al., 2020); though other methods have been proposed, for example based on numerical quadrature (Kersting and Hennig, 2016) or particle ﬁltering (Tronarp et al., 2019). Iterated extended Kalman smoothing (e.g. Bell, 1994; S arkk a and Svensson, 2023) computes the maximum a posteriori estimate of the probabilistic numerical ODE solution (Tronarp et al., 2021). This will be the basis for the parallel-in-time ODE ﬁlter proposed in this work, explained in detail in Section 3.

2.4 Practical Considerations for Probabilistic Numerical ODE Solvers

While Bayesian state estimation methods such as the extended Kalman ﬁlter and smoother can, in principle, be directly applied to the formulated state estimation problem, there are a number of modiﬁcations and practical considerations that should be taken into account:

Square-root formulation: Gaussian ﬁlters often suﬀer from numerical stability issues when applied to the ODE inference problem deﬁned in Equation (10), in particular when using high orders and small steps, due to the ill-conditioning of the state transition covariance Q and numerical round-oﬀerror from ﬁnite precision arithmetic (Kr amer and Hennig, 2024). To alleviate these issues, probabilistic numerical ODE solvers are typically formulated in square-root form (Kr amer and Hennig, 2024); this is also the case for the proposed parallel-in-time method.

Preconditioned state transitions: Kr amer and Hennig (2024) suggest a coordinate change preconditioner to make the state transition matrices step-size independent and thereby improve the numerical stability of EKF-based probabilistic ODE solvers. This preconditioner is also used in this work.

Uncertainty calibration: The Gauss Markov prior as introduced in Section 2.1 has a free parameter, the diﬀusion σ, which directly inﬂuences the uncertainty estimates returned by the ODE ﬁlter, but not its mean estimates. In this paper, we compute a quasi-maximum likelihood estimate for the parameter σ post-hoc, as suggested by Tronarp et al. (2019).

Approximate linearization: Variants of the standard EKF/EKS-based inference have been proposed in which the linearization of the vector-ﬁeld is done only approximately. Approximating the Jacobian of the ODE vector ﬁeld with a zero matrix enables inference with a complexity which scales only linearly with the ODE dimension (Kr amer et al., 2022), while still providing polynomial convergence rates (Kersting

Bosch, Corenflos, Yaghoobi, Tronarp, Hennig, S arkk a

et al., 2020). A diagonal approximation of the Jacobian preserves the linear complexity, but improves the stability properties of the solver (Kr amer et al., 2022). In this work, we only consider the exact ﬁrst-order Taylor linearization.

Local error estimation and step-size adaptation: Rather than predeﬁning the time discretization grid, certain solvers employ an adaptive approach where the solver dynamically constructs the grid while controlling an internal estimate of the numerical error. Step-size adaptation based on local error estimates have been proposed for both classic (Hairer et al., 1993, Chapter II.4) and probabilistic ODE solvers (Schober et al., 2019; Bosch et al., 2021). On the other hand, global step-size selection is often employed in numerical boundary value problem (BVP) solvers (Ascher et al., 1995, Chapter 9), and has been extended to ﬁltering-based probabilistic BVP solvers (Kr amer and Hennig, 2021). For our purposes, we will focus on ﬁxed grids.

3. Parallel-in-Time Probabilistic Numerical ODE Solvers

This section develops the main method proposed in this paper: a parallel-in-time probabilistic numerical ODE solver.

3.1 Time-Parallel Exact Inference in Aﬃne Vector Fields

Let us ﬁrst consider the simple case: An initial value problem with aﬃne vector ﬁeld

y(t) = L(t)y(t) + d(t), t [0, T], y(0) = y0. (12)

The corresponding information model of the probabilistic solver is then also aﬃne, with

Z(t) | Y (t) δ (H(t)Y (t) d(t)) , (13a)

H(t) := E1 L(t)E0. (13b)

Let T = {tn}N n=1 [0, T] be a discrete time grid. To simplify the notation in the following, we will denote a function evaluated at time tn by a subscript n, that is Y (tn) =: Yn, except for the transition matrices where we will use Φn := Φ(tn+1 tn) and Qn := Q(tn+1 tn). Then, the Bayesian state estimation problem from Equation (10) reduces to inference of Y (t) in the model

Y0 N (µ0, Σ0) , (14a)

Yn+1 | Yn N (Φn Yn, Qn) , (14b)

Zn | Yn δ (Hn Yn dn) , (14c)

with zero data Zn = 0 for all n = 1, . . . , N. Since this is an aﬃne Gaussian state estimation problem, it can be solved exactly with Gaussian ﬁltering and smoothing (Kalman, 1960; Rauch et al., 1965; S arkk a and Svensson, 2023); see also (Tronarp et al., 2019, 2021) for explicit discussions of probabilistic numerical solvers for aﬃne ODEs. Recently, S arkk a and Garc ıa-Fern andez (2021) presented a time-parallel formulation of Bayesian ﬁltering and smoothing, as well as a concrete algorithm for exact linear Gaussian ﬁltering and smoothing which could be directly applied to the problem formulation in

Parallel-in-Time Probabilistic Numerical ODE Solvers

Equation (14). But as mentioned in Section 2.4, the resulting ODE solver might suﬀer from numerical instabilities. Therefore, we use the square-root formulation of the time-parallel linear Gaussian ﬁlter and smoother by Yaghoobi et al. (2023). In the following, we review the details of the algorithm.

3.1.1 Time-Parallel General Bayesian Filtering and Smoothing

First, we follow the presentation of S arkk a and Garc ıa-Fern andez (2021) and formulate Bayesian ﬁltering and smoothing as preﬁx sums. We deﬁne elements an = (fn, gn) with

fn(Yn | Yn 1) = p(Yn | Zn, Yn 1), (15a)

gn(Yn 1) = p(Zn | Yn 1), (15b)

where for n = 1 we have p(Y1 | Z1, Y0) = p(Y1 | Z1) and p(Z1 | Y0) = p(Z1), together with a binary ﬁltering operator f : (fi, gi) f (fj, gj) 7 (fij, gij) deﬁned by

fij(x | z) :=

R gj(y)fj(x | y)fi(y | z) dy R gj(y)fi(y | z) dy , (16a)

gij(z) := gi(z) Z gj(y)fi(y | z) dy. (16b)

Then, S arkk a and Garc ıa-Fern andez (2021, Theorem 3) show that f is associative and that

a1 f f an = p(Yn | Z1:n) p(Z1:n)

that is, the ﬁltering marginals and the marginal likelihood of the observations at step n are the results of a cumulative sum of the elements a1:n under f. Since the ﬁltering operator f is associative, this quantity can be computed in parallel with preﬁx-sum algorithms, such as the parallel scan algorithm by Blelloch (1989).

Remark 3 (On Preﬁx-Sums). Preﬁx sums, also known as cumulative sums or inclusive scans, play an important role in parallel computing. Their computation can be eﬃciently parallelized and, if enough parallel resources are available, their (span) computational cost can be reduced from linear to logarithmic in the number of elements. One such algorithm is the well-known parallel scan algorithm by Blelloch (1989) which, given N elements and N/2 processors, computes the preﬁx sum in 2 log2 N sequential steps with 2N 2 invocations of the binary operation. This algorithm is implemented in both tensorﬂow (Abadi et al., 2015) and JAX (Bradbury et al., 2018); the latter is used in this work.

The time-parallel smoothing step can be constructed similarly: We deﬁne elements bn = p(Yn | Z1:n, Yn+1), with b N = p(YN | Z1:N), and a binary smoothing operator bi s bj = bij, with

bij(x | z) = Z bi(x | y)bj(y | z) dy. (18)

Then, s is associative and the smoothing marginal at time step n is the result of a reverse cumulative sum of the elements bn:N under s (S arkk a and Garc ıa-Fern andez, 2021):

bn s s b N = p(Yn | Z1:N). (19)

Bosch, Corenflos, Yaghoobi, Tronarp, Hennig, S arkk a

Again, since the smoothing operator s is associative, this cumulative sum can be computed in parallel with a preﬁx-sum algorithm (Blelloch, 1989).

3.1.2 Time-Parallel Linear Gaussian Filtering in Square-Root Form

In the linear Gaussian case, the ﬁltering elements an = (fn, gn) can be parameterized by a set of parameters {An, bn, Cn, ηn, Jn} as follows:

fn(Yn | Yn 1) = p(Yn | Zn, Yn 1) = N (Yn; An Yn 1 + bn, Cn) , (20a)

gn(Yn 1) = p(Zn | Yn 1) NI (Yn 1; ηn, Jn) , (20b)

where NI denotes a Gaussian density parameterized in information form, that is, NI(x; η, J) = N(x; J 1η, J 1). The parameters {An, bn, Cn, ηn, Jn} can be computed explicitly from the given state-space model (S arkk a and Garc ıa-Fern andez, 2021, Lemma 7). But since probabilistic numerical ODE solvers require a numerically stable implementation of the underlying ﬁltering and smoothing algorithm (Kr amer and Hennig, 2024), we formulate the time-parallel linear Gaussian ﬁltering algorithm in square-root form, following Yaghoobi et al. (2023). To this end, let

M denote a left square-root of a positive semi-deﬁnite matrix M, that is,

M T = M; the matrix

M is sometimes also called a generalized Cholesky factor of M (S. Grewal and P. Andrews, 2014). To operate on square-root matrices, we also deﬁne the triangularization operator: Given a wide matrix M Rn m, m n, the triangularization operator tria(M) ﬁrst computes the QR decomposition of MT, that is, MT = QR, with wide orthonormal Q Rm n and square upper-triangular R Rn n, and then returns RT. This operator plays a central role in square-root ﬁltering algorithms as it enables the numerically stable addition of covariance matrices, provided square-roots are available. Given two positive semi-deﬁnite matrices A, B Rn n with square-roots

B, a square-root of the sum A + B can be computed as

A + B = tria

With these deﬁnitions in place, we brieﬂy review the time-parallel linear Gaussian ﬁltering algorithm in square-root form as provided by Yaghoobi et al. (2023) in the following.

Parameterization of the ﬁltering elements. Let m0 = µ0, P0 = Σ0, and mn = 0, Pn = 0 for all n 1, and deﬁne

m n = Φn 1mn 1, (22a) p

P n = tria Φn 1 Pn 1 Qn 1 . (22b)

Then, the square-root parameterization of the ﬁltering elements an is given by

An = (I Kn Hn)Φn 1, (23a)

bn = m n Kn Hnm n dn , (23b) p

Cn = Ψ22, (23c)

Sn 1dn, (23d) p

Jn = ΦT n 1HT n p

Sn T, (23e)

Parallel-in-Time Probabilistic Numerical ODE Solvers

where I is the identity matrix and Ψ22, Sn and Kn are deﬁned via

Ψ11 0 Ψ21 Ψ22

Sn = Ψ11, (24b)

Kn = Ψ21Ψ 1 11 . (24c)

For generality the formula includes an observation noise covariance Rn, but note that in the context of probabilistic ODE solvers we have a noiseless measurement model with Rn = 0.

Associative ﬁltering operator. Let ai, aj be two ﬁltering elements, parameterized in square-root form by ai = {Ai, bi, Ci, ηi, Ji} and aj = {Aj, bj, p

Jj}. Then, the associative ﬁltering operator f computes the ﬁltering element aij = ai f aj as

Aij = Aj Ai Aj p

CiΞ T 11 ΞT 21Ai, (25a)

bij = Aj I p

CiΞ T 11 ΞT 21 (bi + p

Ci Tηj) + bj, (25b) p

Cij = tria Aj CiΞ T 11 p

ηij = AT i I Ξ21Ξ 1 11 p

Jj Tbi + ηi, (25d) p

Jij = tria AT i Ξ22 Ji , (25e)

where Ξ11, Ξ21, Ξ22 are deﬁned via

Ξ11 0 Ξ21 Ξ22

See Yaghoobi et al. (2023) for the detailed derivation.

The ﬁltering marginals. By computing a cumulative sum of the elements a1:N with the binary operation f, we obtain the ﬁltering marginals at time n = 1, . . . , N as

p(Yn|Z1:n) = N Yn; mf n, P f n , with mf n := b1:n, q

This concludes the time-parallel linear Gaussian square-root ﬁlter.

3.1.3 Time-Parallel Linear Gaussian Smoothing in Square-Root Form

Similarly to the ﬁltering equations, the linear Gaussian smoothing can also be formulated in terms of smoothing elements bn and an associative operator s, and the smoothing marginals can also be computed with a parallel preﬁx-sum algorithm.

Parameterization of the smoothing elements. The smoothing elements bn can be described by a set of parameters {En, gn, Ln}, as

bn = p(Yn | Z1:n, Yn+1) = N Yn; En Yn+1 + gn, p

Ln T . (28)

Bosch, Corenflos, Yaghoobi, Tronarp, Hennig, S arkk a

The smoothing element parameters can be computed as

En = Π21Π 1 11 , (29a)

gn = mf n EnΦnmf n, (29b) p

Ln = Π22, (29c)

where I is the identity matrix and the matrices Π11, Π21, Π22 are deﬁned via

Π11 0 Π21 Π22

Associative smoothing operator. Given two smoothing elements bi, bj, parameterized in square-root form by bi = {Ei, gi, Li} and bj = {Ej, gj, p

Lj}, the associative smoothing operator s computes the smoothing element bij = bi s bj as

Eij = Ei Ej, (31a)

gij = Eigj + gi, (31b) p

Lij = tria Ei p

Lj Li . (31c)

The smoothing marginals. The smoothing marginals can then be retrieved from the reverse cumulative sum of the smoothing elements as

p(Yn | Z1:N) = N (Yn; ms n, P s n) , (32a)

ms n = gn:N, (32b) p

Ln:N. (32c)

Refer to Yaghoobi et al. (2023) for a thorough derivation. This concludes the time-parallel Rauch Tung Striebel smoother, which can can be used as a parallel-in-time probabilistic numerical ODE solver for aﬃne ODEs. The full algorithm is summarized in Algorithm 1.

Algorithm 1 Time-parallel Rauch Tung Striebel Smoother (Para RTS)

Input: Initial distribution (µ0, Σ0), linear transition models {(Φn, Qn)}N 1 n=0 , aﬃne observation models {(Hn, dn)}N n=1, data {Zn}N n=1. Output: Smoothing marginals p(Yn | Z1:N) = N(Yn; ms n, P s n)

1: function Para RTS((µ0, Σ0), {(Φn, Qn)}N n=1, {(Hn, dn)}N n=1, {Zn}N n=1)

2: Compute the ﬁltering elements: an = (An, bn, Cn, ηn, Jn) for all n = 1, . . . , N Eq. (23)

3: Run the time-parallel Kalman ﬁlter: An, bn, Cn, ηn, Jn N n=1 Associative Scan f, (an)N n=1

p (Yn | Z1:N) = N Yn; mf n, P f n N(Yn; bn, Cn) Filtering marginals

4: Compute the smoothing elements: bn = (En, gn, Ln) for all n = 0, . . . , N Eq. (28)

5: Run the time-parallel Rauch Tung Striebel smoother: En, gn, Ln N n=1 Reverse Associative Scan s, (bn)N n=1

Eq. (31) return N(Yn; gn, Ln) for all n = 1, . . . , N Smoothing marginals

Parallel-in-Time Probabilistic Numerical ODE Solvers

3.2 Time-Parallel Approximate Inference in Nonlinear Vector Fields

Let us now consider the general case: An IVP with nonlinear vector ﬁeld

y(t) = f(y(t), t), t [0, T], y(0) = y0. (33)

As established in Section 2, the corresponding state estimation problem is

Y0 N (µ0, Σ0) , (34a)

Yn+1 | Yn N (Φn Yn, Qn) , (34b)

Zn | Yn δ (E1Yn f (E0Yn, tn)) , (34c)

with temporal discretization T = {tn}N n=1 [0, T] and zero data Zn = 0 for all n = 1, . . . , N. In this section, we describe a parallel-in-time algorithm for solving this state estimation problem: the time-parallel iterated extended Kalman smoother.

3.2.1 Globally Linearizing the State-Space Model

To make inference tractable, we will linearize the whole state-space model along a reference trajectory. And since the observation model (speciﬁed in Equation (34c)) is the only nonlinear part of the state-space model, it is the only part that requires linearization. In this paper, we only consider linearization with a ﬁrst-order Taylor expansion, but other methods are possible; see Remarks 4 and 5. For any time-point tn T, we approximate the nonlinear observation model Equation (34c) with an aﬃne observation model by performing a ﬁrst-order Taylor series expansion around a linearization point ηn Rd(ν+1). We obtain the aﬃne model

Zn | Yn δ (Hn Yn dn) , (35)

with Hn and dn deﬁned as

Hn := E1 Fy(E0ηn, tn)E0, (36a)

dn := f(E0ηn, tn) Fy(E0ηn, tn)E0ηn, (36b)

where Fy denotes the Jacobian of f with respect to y. In iterated extended Kalman smoothing, this linearization is performed globally on all time steps simultaneously along a trajectory of linearization points {ηn}N n=1 Rd(ν+1). We obtain the following linearized inference problem:

Y0 N (µ0, Σ0) , (37a)

Yn+1 | Yn N (Φn Yn, Qn) , (37b)

Zn | Yn δ (Hn Yn dn) , (37c)

with zero data Zn = 0 for all n = 1, . . . , N. This is now a linear state-space model with linear Gaussian observations. It can therefore be solved exactly with the numerically stable, time-parallel Kalman ﬁlter and smoother presented in Section 3.1.

Bosch, Corenflos, Yaghoobi, Tronarp, Hennig, S arkk a

Remark 4 (Linearizing with approximate Jacobians (EK0 & Diagonal EK1)). To reduce the computational complexity with respect to the state dimension of the ODE, the vector ﬁeld can also be linearized with an approximate Jacobian. Established choices include Fy 0 and Fy diag( yf), which result in probabilistic ODE solvers known as the EK0 and Diagonal EK1, respectively. See Kr amer et al. (2022) for more details.

Remark 5 (Statistical linear regression). Statistical linear regression (SLR) is a more general framework for approximating conditional distributions with aﬃne Gaussian distributions, and many well-established ﬁlters can be understood as special cases of SLR. This includes notably the Taylor series expansion used in the EKF/EKS, but also sigma-point methods such as the unscented Kalman ﬁlter and smoother (Julier et al., 2000; Julier and Uhlmann, 2004; S arkk a, 2008), and more. For more information on SLR-based ﬁlters and smoothers refer to S arkk a and Svensson (2023, Chapter 9).

3.2.2 Iterated Extended Kalman Smoothing

The IEKS (Bell, 1994; S arkk a and Svensson, 2023) is an approximate Gaussian inference method for nonlinear state-space models, which iterates between linearizing the state-space model along the current best-guess trajectory and computing a new state trajectory estimate by solving the linearized model exactly. It can equivalently also be seen as an eﬃcient implementation of the Gauss Newton method, applied to maximizing the posterior density of the state trajectory (Bell, 1994). This also implies that the IEKS computes not just some Gaussian estimate, but the maximum a posteriori (MAP) estimate of the state trajectory. In the context of probabilistic numerical ODE solvers, the IEKS has been previously explored by Tronarp et al. (2021), and the resulting MAP estimate has been shown to satisfy polynomial convergence rates to the true ODE solution. Here, we formulate an IEKS-based probabilistic ODE solver in a parallel-in-time manner, by exploiting the time-parallel formulation of the Kalman ﬁlter and smoother from Section 3.1. The IEKS is an iterative algorithm, which starts with an initial guess of the state trajectory and then iterates between the following two steps:

1. Linearization step: Linearize the state-space model along the current best-guess trajectory. This can be done independently for each time step and is therefore fully parallelizable.

2. Linear smoothing step: Solve the resulting linear state-space model exactly with the time-parallel Kalman ﬁlter and smoother from Section 3.1.

The algorithm terminates when a stopping criterion is met, for example when the change in the MAP estimate between two iterations is suﬃciently small. A pseudo-code summary of the method is provided in Algorithm 2. As with the sequential ﬁltering-based probabilistic ODE solvers as presented in Section 2, the mean and covariance of the initial distribution Y (0) N (µ0, Σ0) are chosen such that µ0 corresponds to the exact solution of the ODE and its derivatives and Σ0 is set to zero; see also Kr amer and Hennig (2024). The initial state trajectory estimate {ηn}N n=0 is chosen to be constant, that is, ηn = µ0 for all n = 0, . . . , N. Note that since only E0ηn is required to perform the linearization, it could equivalently be set to ηn = [y0, 0, . . . , 0] for all n.

Parallel-in-Time Probabilistic Numerical ODE Solvers

Algorithm 2 Parallel-in-Time Probabilistic Numerical ODE Solver (Para IEKS)

Input: ODE-IVP (f, y0), prior transition model {(Φn, Qn)}N n=1, time grid {tn}N n=0 [0, T]. Output: Probabilistic numerical ODE solution p y(tn) y(0) = y0, { y(tm) = f(y(tm), tm)}N m=1 N (µn, Σn) for n = 1, . . . , N.

1: function Para IEKS((f, y0), {(Φn, Qn)}N n=1, {tn}N n=0)

2: µ0, Σ0 Compute Exact Initial State(f, y0), 0 With automatic diﬀerentiation

3: ηn µ0 for all n = 0, . . . , N Constant initial guess of state trajectory

4: while stopping criterion not met do

5: Linearize the state-space model as in Sec. 3.2.1 (fully parallelizable):

6: for n = 1, . . . , N do

7: Hn, dn Linearize Observation Model(f, ηn, tn)

8: Run the time-parallel RTS in the linearized model as in Sec. 3.1:

9: {µn, Σn}N n=1 Para RTS (µ0, Σ0), {(Φn, Qn)}N n=1, {(Hn, dn)}N n=1, {0}N n=1

10: Choose the mean trajectory as the next linearization trajectory:

11: ηn µn for all n = 1, . . . , N

12: return y(tn) N E0µn, E0Σn ET 0 for n = 1, . . . , N.

Remark 6 (Other choices for the initial state trajectory). In the sequential IEKS the initial state trajectory is typically computed with a standard sequential extended Rauch Tung Striebel smoother (Tronarp et al., 2021). But in the context of the time-parallel IEKS, we can not aﬀord to apply a sequential method ﬁrst as this would break the parallel-in-time nature of the algorithm and increase the computational complexity from logistic back to linear in the number of time points. One possible alternative choice is compute an initial state trajectory with a sequential method on a coarse time grid with O(log(N)) time points, similar to classic parallel-in-time methods such as Parareal (Lions et al., 2001). But, this introduces additional hyperparameters and it is only beneﬁcial if this trajectory is indeed better than a constant guess, which often requires a certain minimum number of time points. We therefore choose the constant initial guess for the state trajectory in this paper, and leave the exploration of more sophisticated initialization strategies for future work.

Finally, the stopping criterion should be chosen such that the algorithm terminates when the MAP estimate of the state trajectory has converged. In our experiments, we chose a combination of two criteria: (i) the change in the state trajectory estimate between two iterations is suﬃciently small, or (ii) the change in the objective value between two iterations is suﬃciently small, where the objective value is deﬁned as the negative log-density of the state trajectory under the prior:

V(η0:N) = 1

n=1 ηn Φnηn 1 2 Q 1 n . (38)

In our experiments, we use a relative tolerance of 10 13 for the ﬁrst criterion and absolute and relative tolerances of 10 9 and 10 6 for the second criterion, respectively. Note that this procedure is compatible with the uncertainty calibration mentioned in Section 2.4 as the mean estimator is agnostic to the value of σ (and therefore Q).

Bosch, Corenflos, Yaghoobi, Tronarp, Hennig, S arkk a

Remark 7 (Convergence of the time-parallel IEKS). The IEKS is equivalent to the Gauss Newton method for computing the MAP estimate of the state trajectory (Bell, 1994), and the MAP estimate has been shown to satisfy polynomial convergence rates to the true ODE solution, (Tronarp et al., 2021). And under mild conditions on the Jacobian of the vector ﬁeld, the MAP is a local optimum and the Gauss Newton method is locally convergent (Tronarp et al., 2021; Knoth, 1989). While therefore the initial guess of the state trajectory must be suﬃciently close to the true solution for the method to be guaranteed to converge, in our experiments we found that the constant initialization works well in practice; see Section 4 below. In addition, the stability of the method could be further improved by using a more sophisticated optimizer, such as the Levenberg Marquardt method or Gauss Newton with line search (S arkk a and Svensson, 2020), or the alternating method of multipliers (ADMM; Gao et al., 2019). We leave the exploration of these methods for future work.

3.3 Computational Complexity

Let Cs KS-step be the summed costs of a predict, update, and smoothing step in the sequential Kalman smoother, and let Cp KS-step be the corresponding cost in the time-parallel formulation (which diﬀers by a constant factor, i.e. Cp KS-step Cs KS-step). Let Clinearize be the cost of linearizing the vector ﬁeld at a single time point, which requires both evaluating the vector ﬁeld and computing its Jacobian. Then, on a grid with N time points, the cost of a standard sequential extended Kalman smoother is linear in N, with

Cs EKS = N Clinearize + N Cs KS-step. (39)

The computational cost of the time-parallel IEKS diﬀers in two ways: (i) the preﬁx-sum formulation of the Kalman smoother enables a time-parallel inference with logarithmic complexity, and (ii) the linearization is not done locally in a sequential manner but can be performed globally, fully in parallel. In the following we assume that we have at least N processors / threads / cores available (simply referred to as cores hereafter), which allow us to take perfect advantage of the temporal parallelization, and we disregard the eﬀect of the number of cores on all other parts of the algorithm (such as the inﬂuence of the number of cores on the computation time of matrix multiplications). The computational cost of the time-parallel IEKS run for k iterations is then

Cp IEKS = k Clinearize + 2 log2(N) Cp KS-step . (40)

Thus, the runtime of the time-parallel IEKS is logarithmic in the number of time points, provided a suﬃcient number of cores are available. To make the comparison to the sequential EKS more explicit, we consider the speedup of the time-parallel IEKS over the sequential EKS, deﬁned as S = Cs EKS/Cp IEKS. By simply inserting the computational costs from above and re-arranging the terms, it can be shown that the speedup is bounded by

k Clinearize + Cs KS-step Clinearize , N 2k log2(N) Clinearize + Cs KS-step Cp KS-step

This term highlights two diﬀerent regimes: If the cost of linearizing the vector ﬁeld is small compared to the cost of the Kalman smoother, then the speedup is approximately bounded

Parallel-in-Time Probabilistic Numerical ODE Solvers

by S N 2k log2N (since the sequential and parallel Kalman smoother steps have similar cost). On the other hand, if the cost of linearizing the vector ﬁeld is large compared to the cost of the Kalman smoother, then the beneﬁt of parallelizing the linearization step dominates and the speedup is bounded by S N

4. Experiments

This section investigates the utility and performance of the proposed parallel-in-time ODE ﬁlter on a range of experiments. It is structured as follows: First, Section 4.1 compares the performance of the parallel-in-time probabilistic numerical ODE solver to its sequential counterpart on multiple test problems. Section 4.2 then investigates the parallel scaling of the proposed method and evaluates it on a range of problem sizes and for diﬀerent GPUs. Finally, Section 4.3 benchmarks the proposed method against other well-established ODE solvers, including both classic and probabilistic numerical methods. In the following, we refer to all ﬁltering-based probabilistic numerical ODE solvers by their underlying ﬁltering and smoothing algorithms: the proposed parallel-in-time method is referred to as Para IEKS, the corresponding sequential method is simply IEKS, and the established sequential EKS-based method is referred to as EKS. And, whenever relevant, we write the smoothness of the chosen IWP prior into the method name, for example as Para IEKS(ν =1) or EKS(ν =2).

Implementation All experiments are implemented in the Python programming language with the JAX software framework (Bradbury et al., 2018). Reference solutions are computed with Sci Py (Virtanen et al., 2020) and Diﬀrax (Kidger, 2021). Unless speciﬁed otherwise, experiments are run on an NVIDIA V100 GPU. Code for the implementation and experiments is publicly available on Git Hub.1

4.1 The Parallel-in-time Solver Compared to its Sequential Version

We ﬁrst compare the proposed parallel-in-time ODE solver Para IEKS to its sequential counterpart: a probabilistic solver based on the sequential implementation IEKS. We evaluate the solvers on three test problems: The logistic ordinary diﬀerential equation

y(t) = y(t) (1 y(t)) , t [0, 10], y(0) = 0.01, (42)

an initial value problem based on the rigid body dynamics (Hairer et al., 1993)

2y2(t)y3(t) 1.25y1(t)y3(t) 0.5y1(t)y2(t)

, t [0, 20], y(0) =

and the Van der Pol oscillator (Van der Pol, 1920)

y(t) = y2(t) µ 1 y1(t)2 y2(t) y1(t) , t [0, 6.3], y(0) = 2 0

here in a non-stiﬀversion with parameter µ = 1.

1. https://github.com/nathanaelbosch/parallel-in-time-ode-filters

Bosch, Corenflos, Yaghoobi, Tronarp, Hennig, S arkk a

a. Logistic

b. Rigid Body

c. Van der Pol

Figure 1: Trajectories, errors, and error estimates computed by the parallel-in-time solver. Top row: ODE solution trajectories. Visually, all three test problems seem to be solved accurately. Bottom row: Numerical errors (lines) and error estimates (shaded area). Ideally, for good calibration, the error should be of similar magnitude than the error estimate. The posterior appears underconﬁdent on the logistic equation, and well-calibrated on the rigid body and Van der Pol problems.

We ﬁrst solve the three problems with Para IEKS on grids of sizes 30, 150, and 100, respectively for the logistic, rigid body, and Van der Pol problem, with a two-times integrated Wiener process prior. Reference solutions are computed with diﬀrax s Kvaerno5 solver using adaptive steps and very low tolerances τ{abs,rel} = 10 12 (Kidger, 2021; Kværnø, 2004). Figure 1 shows the resulting solution trajectories, together with numerical errors and error estimates. For these grid sizes, Para IEKS computes accurate solutions on all three problems. Regarding calibration, the posterior appears underconﬁdent for the logistic equation, as it overestimates the numerical error by more than one order of magnitude, but is reasonably conﬁdent on the on the rigid body and Van der Pol problems where the error estimate is of similar magnitude as the numerical error. Next, we investigate the performance of Para IEKS and compare it to its sequential implementation IEKS. We solve the three problems with both methods on a range of grid sizes, with both a oneand two-times integrated Wiener process prior. Reference solutions are computed with diﬀrax s Kvaerno5 solver using adaptive steps and very low tolerances (τabs = 10 16, τrel = 10 13). Figure 2 shows the achieved root-mean-square errors (RMSE) for diﬀerent grid sizes in a work-precision diagram. As expected, Para IEKS and IEKS always achieve the same error for each problem and grid size, as both versions use the same initialization, compute the same quantities, and only diﬀer in their ﬁlter and smoother implementation. However, the methods diﬀer signiﬁcantly in actual runtime, as shown in Figure 3. In our experiments on an NVIDIA V100 GPU, Para IEKS is always strictly faster than IEKS across all problems, grid sizes, and priors, and we observe speed-ups of multiple orders of magnitude. Thus, when working with a GPU, Para IEKS appears to be strictly superior to the sequential IEKS.

Parallel-in-Time Probabilistic Numerical ODE Solvers

10 9 10 6 10 3 100

a. Logistic

10 9 10 6 10 3 100

b. Rigid Body

10 9 10 7 10 5 10 3

c. Van der Pol

IEKS(ν=1) IEKS(ν=2) Para IEKS(ν=1) Para IEKS(ν=2)

Figure 2: The sequential and parallel IEKS compute numerically identical solutions. For all three problems and all considered grid sizes, the sequential IEKS and the parallel Para IEKS achieve (numerically) identical errors. This is expected, as both versions compute the same quantities and only diﬀer in their implementation.

Runtime [s]

a. Logistic b. Rigid Body c. Van der Pol

10 9 10 6 10 3 100

Parallel speed-up

10 9 10 6 10 3 100

10 9 10 7 10 5 10 3

IEKS(ν=1) IEKS(ν=2) Para IEKS(ν=1) Para IEKS(ν=2)

Figure 3: Work-precision diagrams for the sequential and parallel IEKS-based ODE solver. Top row: Runtime in seconds per error (lower-left is better). Bottom row: Speedup of the parallel Para IEKS over the sequential IEKS (higher is better). Across all problems, grid sizes, and priors, Para IEKS outperforms the sequential IEKS.

Bosch, Corenflos, Yaghoobi, Tronarp, Hennig, S arkk a

101 102 103 104 105

Number of gridpoints

Runtime [s]

a. ODE solver runtime benchmark

Dopri5 Kvaerno5 EKS Para IEKS N log(N) # CUDA cores (RTX 4090)

Number of CUDA cores

RTX 2080 Ti

GTX 1080 Ti

b. GPU comparison (N=10240)

Figure 4: The time-parallel Para IEKS shows logarithmic scaling in the grid size and beneﬁts from GPU improvements. In comparison, the sequential IEKS and the classic Dopri5 and Kvaerno5 solvers show the expected linear runtime complexity (left). The sequential methods also do not show relevant changes in runtime for GPUs with more CUDA cores, whereas Para IEKS s runtime improves (right).

4.2 Runtimes and Parallel Scaling of the Parallel-in-Time Solver

Next we evaluate how the runtime of the proposed method scales with respect to increasing problem sizes, as well as increased numbers of available CUDA cores. We compare Para IEKS to the established sequential EKS, as well as to two classic ODE solvers: the explicit Runge Kutta method Dopri5 (Dormand and Prince, 1980; Shampine, 1986), and the implicit Runge Kutta method Kvaerno5 (Kværnø, 2004); both provided by the Diﬀrax python package (Kidger, 2021). We apply all methods to the logistic ODE on a range of grid sizes, resulting from time discretizations with step sizes h = 20, 2 1, . . . , 2 14, as well as for multiple GPUs with varying numbers of CUDA cores. Figure 4 shows the results. First, we observe the expected logarithmic scaling of Para IEKS with respect to the grid size, for grids of size up to around 5 103 (Figure 4a). For larger grid sizes the runtime of Para IEKS starts to grow linearly. This behavior is expected: The NVIDIA RTX 4090 used in this experiment has 16384 CUDA cores, so for larger grids the ﬁlter and smoother pass can not be fully parallelized anymore and additional grid points need to be processed sequentially. Note also that for very large grids, the 24GB memory of the GPU can become a limiting factor. Nevertheless, for large grid sizes N > 103, Para IEKS achieves the lowest runtimes out of all the tested solvers, with up to an order of magnitude faster ODE solutions.

Parallel-in-Time Probabilistic Numerical ODE Solvers

Runtime [s]

a. Logistic b. Rigid Body c. Van der Pol

10 9 10 6 10 3 100

10 9 10 6 10 3 100

10 8 10 5 10 2

Implicit Euler Kvaerno3 Kvaerno5 EKS(ν=1) EKS(ν=2) IEKS(ν=1) IEKS(ν=2) Para IEKS(ν=1) Para IEKS(ν=2)

Figure 5: Benchmarking Para IEKS against other commonly used numerical ODE solvers. Top row: Work-precision diagrams showing runtimes per error for a range of diﬀerent ODE solvers (lower-left is better). Bottom row: Errors per speciﬁed grid size (lower-left is better). Per grid size, the closely related EKS, IEKS, and Para IEKS solvers often coincide; Kvaerno5 achieves the lowest error per step as it has the highest order. In terms of runtime, Para IEKS outperforms all other methods on medium-to-high accuracy settings due to its logarithmic time complexity.

Figure 4b. shows runtimes for diﬀerent GPUs with varying numbers of CUDA cores for a grid of size N = 10240. We observe that the sequential IEKS, Dopri5 and Kvaerno5 solvers do not show a beneﬁt from the improved GPU hardware, which is expected as the method does not leverage parallelization. On the other hand, the runtime of Para IEKS decreases as the number of CUDA cores increases, and we observe speed-ups of up to an order of magnitude by using a newer GPU with a larger number of CUDA cores.

4.3 Benchmarking the Parallel-in-time Probabilistic Numerical ODE Solver

Finally, we compare Para IEKS to a range of well-established implicit ODE solvers, including both classic and probabilistic numerical methods: we compare against the implicit Euler method (Implicit Euler) and the Kvaerno3 and Kvaerno5 solvers (Kværnø, 2004) provided by Diﬀrax (Kidger, 2021), as well as the sequential EKS with local linearization, which is one of the currently most popular probabilistic numerical ODE solvers. We evaluate all solvers on all three test problems, namely the logistic ODE, the rigid body problem, and the Van der Pol oscillator, as introduced in Section 4.1. Reference solutions are computed with diﬀrax s Kvaerno5 solver with adaptive steps and a very low error tolerance setting (τabs = 10 16, τrel = 10 13).

Bosch, Corenflos, Yaghoobi, Tronarp, Hennig, S arkk a

Runtime [s]

a. Logistic b. Rigid Body c. Van der Pol

Number of grid points

Number of iterations

102 103 104 105

Number of grid points

102 103 104 105

Number of grid points

Implicit Euler Kvaerno3 Kvaerno5 EKS(ν=1) EKS(ν=2) IEKS(ν=1) IEKS(ν=2) Para IEKS(ν=1) Para IEKS(ν=2)

Implicit Euler Kvaerno3 Kvaerno5 EKS(ν=1) EKS(ν=2) IEKS(ν=1) IEKS(ν=2) Para IEKS(ν=1) Para IEKS(ν=2)

Implicit Euler Kvaerno3 Kvaerno5 EKS(ν=1) EKS(ν=2) IEKS(ν=1) IEKS(ν=2) Para IEKS(ν=1) Para IEKS(ν=2)

Figure 6: Runtimes of the ODE solvers for each grid size, and number of IEKS iterations. While all sequential solvers demonstrate linear scaling with the number of grid points, Para IEKS shows sub-linear scaling up to a certain grid size (top). The number of IEKS iterations until convergence can vary with the grid size and the problem, but it seems that in many cases around 10 iterations suﬃce (bottom). The sequential methods solve the ODE in one sweep.

Figure 5 shows the results as work-precision diagrams. For small grid sizes (low accuracy), the logarithmic time complexity of Para IEKS seems to not be very relevant and it is outperformed by the non-iterated EKS. In the particular case of the logistic ODE, it further seems that on coarse grids the MAP estimate computed by Para IEKS (and by the sequential IEKS) diﬀers signiﬁcantly from the actual ODE solution, and thus the error of Para IEKS on coarse grids is high (lower left ﬁgure). However, for larger grid sizes (medium-to-high accuracy), Para IEKS outperforms both its sequential, non-iterated counterpart EKS, as well as the classic implicit methods. In particular, Para IEKS with IWP(2) prior often shows runtimes lower than those of the classic Kvaerno5 method for comparable errors, even though it has a lower order of convergence and it requires multiple iterations per solve. Figure 6 shows the runtimes per grid size of all methods, as well as the number of iterations performed by the IEKS. We again observe the initial logarithmic scaling of the runtime with increasing problem size, up to a threshold where the runtime starts scaling linearly, as previously shown in Section 4.2. Overall, the logarithmic time complexity of the proposed Para IEKS appears to be very beneﬁcial for high accuracy settings on GPUs and makes Para IEKS a very competitive ODE solver in this comparison.

Parallel-in-Time Probabilistic Numerical ODE Solvers

5. Conclusion

In this work, we have developed a parallel-in-time probabilistic numerical ODE solver. The method builds on iterated extended Kalman smoothing to compute the maximum a posteriori estimate of the probabilistic ODE solution, and by using the time-parallel formulation of the IEKS it is able to eﬃciently leverage modern parallel computer hardware such as GPUs to parallelize its computations. Given enough processors or cores, the proposed algorithm shares the logarithmic cost per time step of the parallel IEKS and the underlying parallel preﬁx-sum algorithm, as opposed to the linear time complexity of standard, sequentially-operating ODE solvers. We evaluated the performance of the proposed method in a number of experiments, and have seen that the proposed parallel-in-time solver can provide speed-ups of multiple orders of magnitude over the sequential IEKS-based solver. We also compared the proposed method to a range of well-established, both probabilistic and classical ODE solvers, and we have shown that the proposed parallel-in-time method is competitive with respect to the state-of-the-art in both accuracy and runtime. This work opens up a number of interesting avenues for future research in the intersection of probabilistic numerics and parallel-in-time methods. Potential opportunities for improvement include the investigation of other optimization algorithms, such as Levenberg Marquart or ADMM, or the usage of line search, all of which have been previously proposed for the sequential IEKS (S arkk a and Svensson, 2020; Gao et al., 2019). Furthermore, combining the solver with adaptive grid reﬁnement approaches could also signiﬁcantly improve its performance in practice. A diﬀerent avenue would be to extend the proposed method to other related diﬀerential equation problems for which sequentially-operating probabilistic numerical methods already exist, such as higher-order ODEs, diﬀerential-algebraic equations, or boundary value problems (Bosch et al., 2022; Kr amer and Hennig, 2021), or to apply the method to other types of probabilistic inference problems, such as parameter or latent-force inference in ODEs (Schmidt et al., 2021; Tronarp et al., 2022; Beck et al., 2024). Finally, the improved utilization of GPUs by our parallel-in-time method could be particularly beneﬁcial to applications in the ﬁeld of machine learning, where GPUs are often required to accelerate the computations of deep neural networks. In summary, the proposed parallel-in-time probabilistic numerical ODE solver not only advances the eﬃciency of probabilistic numerical ODE solvers, but also paves the way for a range of future research on parallel-in-time probabilistic numerical methods and their application across various scientiﬁc domains.

Acknowledgments

The authors gratefully acknowledge co-funding by the European Union (ERC, ANUBIS, 101123955. Views and opinions expressed are however those of the author(s) only and do not necessarily reﬂect those of the European Union or the European Research Council. Neither the European Union nor the granting authority can be held responsible for them). Philipp Hennig is a member of the Machine Learning Cluster of Excellence, funded by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) under Germany s Excellence Strategy EXC number 2064/1 Project number 390727645; he also gratefully acknowledges the German Federal Ministry of Education and Research (BMBF) through the T ubingen AI Center (FKZ: 01IS18039A); and funds from the Ministry of Science, Research

Bosch, Corenflos, Yaghoobi, Tronarp, Hennig, S arkk a

and Arts of the State of Baden-W urttemberg. The authors would like to thank Research Council of Finland for funding. Filip Tronarp was partially supported by the Wallenberg AI, Autonomous Systems and Software Program (WASP) funded by the Knut and Alice Wallenberg Foundation. The authors thank the International Max Planck Research School for Intelligent Systems (IMPRS-IS) for supporting Nathanael Bosch. The authors are grateful to Nicholas Kr amer for many valuable discussion and to Jonathan Schmidt for feedback on the manuscript.

Individual Contributions

The original idea for this article came independently from SS and from discussions between FT and NB. The joint project was initiated and coordinated by SS and PH. The methodology was developed by NB in collaboration with AC, FT, PH, and SS. The implementation is primarily due to NB, with help from AC. The experimental evaluation was done by NB with support from FT and PH. The ﬁrst version of the article was written by NB, after which all authors reviewed the manuscript.

M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin, S. Ghemawat, I. Goodfellow, A. Harp, G. Irving, M. Isard, Y. Jia, R. Jozefowicz, L. Kaiser, M. Kudlur, J. Levenberg, D. Man e, R. Monga, S. Moore, D. Murray, C. Olah, M. Schuster, J. Shlens, B. Steiner, I. Sutskever, K. Talwar, P. Tucker, V. Vanhoucke, V. Vasudevan, F. Vi egas, O. Vinyals, P. Warden, M. Wattenberg, M. Wicke, Y. Yu, and X. Zheng. Tensor Flow: Large-scale machine learning on heterogeneous systems, 2015. Software available from tensorﬂow.org.

U. M. Ascher, R. M. M. Mattheij, and R. D. Russell. Numerical Solution of Boundary Value Problems for Ordinary Diﬀerential Equations. Society for Industrial and Applied Mathematics, 1995. doi: 10.1137/1.9781611971231.

J. Beck, N. Bosch, M. Deistler, K. L. Kadhim, J. H. Macke, P. Hennig, and P. Berens. Diﬀusion tempering improves parameter estimation with probabilistic integrators for ordinary diﬀerential equations. ar Xiv:2402.12231, 2024.

B. M. Bell. The iterated Kalman smoother as a Gauss Newton method. SIAM Journal on Optimization, 4(3):626 636, 1994.

J. Bettencourt, M. J. Johnson, and D. Duvenaud. Taylor-mode automatic diﬀerentiation for higher-order derivatives in JAX. In Program Transformations for ML Workshop at Neur IPS 2019, 2019.

G. Blelloch. Scans as primitive parallel operations. IEEE Transactions on Computers, 38 (11):1526 1538, 1989. doi: 10.1109/12.42122.

N. Bosch, P. Hennig, and F. Tronarp. Calibrated adaptive probabilistic ODE solvers. In A. Banerjee and K. Fukumizu, editors, Proceedings of The 24th International Conference

Parallel-in-Time Probabilistic Numerical ODE Solvers

on Artiﬁcial Intelligence and Statistics, volume 130 of Proceedings of Machine Learning Research, pages 3466 3474. PMLR, 2021.

N. Bosch, F. Tronarp, and P. Hennig. Pick-and-mix information operators for probabilistic ODE solvers. In G. Camps-Valls, F. J. R. Ruiz, and I. Valera, editors, Proceedings of The 25th International Conference on Artiﬁcial Intelligence and Statistics, volume 151 of Proceedings of Machine Learning Research, pages 10015 10027. PMLR, 2022.

N. Bosch, P. Hennig, and F. Tronarp. Probabilistic exponential integrators. In Thirty-seventh Conference on Neural Information Processing Systems, 2023.

J. Bradbury, R. Frostig, P. Hawkins, M. J. Johnson, C. Leary, D. Maclaurin, G. Necula, A. Paszke, J. Vander Plas, S. Wanderman-Milne, and Q. Zhang. JAX: composable transformations of Python+Num Py programs, 2018.

J. Butcher. Numerical Methods for Ordinary Diﬀerential Equations. Wiley, 2016. ISBN 9781119121503.

R. T. Q. Chen, Y. Rubanova, J. Bettencourt, and D. K. Duvenaud. Neural ordinary diﬀerential equations. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa Bianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 31. Curran Associates, Inc., 2018.

P. Deuﬂhard and F. Bornemann. Scientiﬁc computing with ordinary diﬀerential equations, volume 42. Springer Science & Business Media, 2012.

J. R. Dormand and P. J. Prince. A family of embedded Runge Kutta formulae. Journal of Computational and Applied Mathematics, 6:19 26, 1980.

M. J. Gander. 50 years of time parallel time integration. In T. Carraro, M. Geiger, S. K orkel, and R. Rannacher, editors, Multiple Shooting and Time Domain Decomposition Methods, pages 69 113, Cham, 2015. Springer International Publishing. ISBN 978-3-319-23321-5.

M. J. Gander and S. Vandewalle. Analysis of the parareal time-parallel time-integration method. SIAM Journal on Scientiﬁc Computing, 29(2):556 578, 2007. doi: 10.1137/ 05064607X.

R. Gao, F. Tronarp, and S. S arkk a. Iterated extended kalman smoother-based variable splitting for l1-regularized state estimation. IEEE Transactions on Signal Processing, 67 (19):5078 5092, 2019. doi: 10.1109/TSP.2019.2935868.

A. Griewank. Evaluating Derivatives: Principles and Techniques of Algorithmic Diﬀerentiation. Frontiers in applied mathematics. Society for Industrial and Applied Mathematics, 2000. ISBN 9780898714517.

E. Hairer, S. Norsett, and G. Wanner. Solving Ordinary Diﬀerential Equations I: Nonstiﬀ Problems, volume 8. Springer-Verlag, 1993. ISBN 978-3-540-56670-0. doi: 10.1007/ 978-3-540-78862-1.

Bosch, Corenflos, Yaghoobi, Tronarp, Hennig, S arkk a

U. Helmke, R. Brockett, and J. Moore. Optimization and Dynamical Systems. Communications and Control Engineering. Springer London, 2012. ISBN 9781447134671.

P. Hennig and S. Hauberg. Probabilistic solutions to diﬀerential equations and their application to riemannian statistics. In S. Kaski and J. Corander, editors, Proceedings of the Seventeenth International Conference on Artiﬁcial Intelligence and Statistics, volume 33 of Proceedings of Machine Learning Research, pages 347 355. PMLR, 2014.

P. Hennig, M. A. Osborne, and M. Girolami. Probabilistic numerics and uncertainty in computations. Proceedings. Mathematical, physical, and engineering sciences, 471, 2015.

P. Hennig, M. A. Osborne, and H. P. Kersting. Probabilistic Numerics: Computation as Machine Learning. Cambridge University Press, 2022. doi: 10.1017/9781316681411.

S. Julier and J. Uhlmann. Unscented ﬁltering and nonlinear estimation. Proceedings of the IEEE, 92(3):401 422, 2004. doi: 10.1109/JPROC.2003.823141.

S. Julier, J. Uhlmann, and H. Durrant-Whyte. A new method for the nonlinear transformation of means and covariances in ﬁlters and estimators. IEEE Transactions on Automatic Control, 45(3):477 482, 2000. doi: 10.1109/9.847726.

R. E. Kalman. A new approach to linear ﬁltering and prediction problems. Journal of Basic Engineering, 82(1):35 45, 1960. ISSN 0021-9223. doi: 10.1115/1.3662552.

H. Kersting and P. Hennig. Active uncertainty calibration in Bayesian ODE solvers. In Proceedings of the 32nd Conference on Uncertainty in Artiﬁcial Intelligence (UAI), pages 309 318, 2016.

H. Kersting, T. J. Sullivan, and P. Hennig. Convergence rates of Gaussian ODE ﬁlters. Statistics and computing, 30(6):1791 1816, 2020.

P. Kidger. On Neural Diﬀerential Equations. Ph D thesis, University of Oxford, 2021.

O. Knoth. A globalization scheme for the generalized gauss-newton method. Numerische Mathematik, 56(6):591 607, Jun 1989. ISSN 0945-3245. doi: 10.1007/BF01396345.

N. Kr amer and P. Hennig. Linear-time probabilistic solution of boundary value problems. In A. Beygelzimer, Y. Dauphin, P. Liang, and J. W. Vaughan, editors, Advances in Neural Information Processing Systems, 2021.

N. Kr amer and P. Hennig. Stable implementation of probabilistic ode solvers. Journal of Machine Learning Research, 25(111):1 29, 2024.

N. Kr amer, N. Bosch, J. Schmidt, and P. Hennig. Probabilistic ODE solutions in millions of dimensions. In K. Chaudhuri, S. Jegelka, L. Song, C. Szepesvari, G. Niu, and S. Sabato, editors, Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pages 11634 11649. PMLR, 2022.

Parallel-in-Time Probabilistic Numerical ODE Solvers

N. Kr amer, J. Schmidt, and P. Hennig. Probabilistic numerical method of lines for timedependent partial diﬀerential equations. In G. Camps-Valls, F. J. R. Ruiz, and I. Valera, editors, Proceedings of The 25th International Conference on Artiﬁcial Intelligence and Statistics, volume 151 of Proceedings of Machine Learning Research, pages 625 639. PMLR, 2022.

A. Kværnø. Singly diagonally implicit Runge Kutta methods with an explicit ﬁrst stage. BIT Numerical Mathematics, 44(3):489 502, 2004.

J.-L. Lions, Y. Maday, and G. Turinici. R esolution d EDP par un sch ema en temps parar eel . Comptes Rendus de l Acad emie des Sciences - Series I - Mathematics, 332(7):661 668, 2001. ISSN 0764-4442.

C. J. Oates and T. J. Sullivan. A modern retrospective on probabilistic numerics. Statistics and Computing, 29, 2019.

G. Papamakarios, E. Nalisnick, D. J. Rezende, S. Mohamed, and B. Lakshminarayanan. Normalizing ﬂows for probabilistic modeling and inference. Journal of Machine Learning Research, 22(1), 2021. ISSN 1532-4435.

K. Pentland, M. Tamborrino, D. Samaddar, and L. C. Appel. Stochastic parareal: An application of probabilistic methods to time-parallelization. SIAM Journal on Scientiﬁc Computing, 0(0):S82 S102, 2021. doi: 10.1137/21M1414231.

K. Pentland, M. Tamborrino, T. J. Sullivan, J. Buchanan, and L. C. Appel. GParareal: a time-parallel ODE solver using Gaussian process emulation. Statistics and Computing, 33 (1):23, 2022. ISSN 1573-1375. doi: 10.1007/s11222-022-10195-y.

H. E. Rauch, F. Tung, and C. T. Striebel. Maximum likelihood estimates of linear dynamic systems. AIAA Journal, 3(8):1445 1450, 1965. ISSN 1533-385X. doi: 10.2514/3.3166.

M. S. Grewal and A. P. Andrews. Kalman ﬁltering. 2014. doi: 10.1002/9781118984987.

S. S arkk a and A. F. Garc ıa-Fern andez. Temporal parallelization of Bayesian smoothers. IEEE Transactions on Automatic Control, 66(1):299 306, 2021. doi: 10.1109/TAC.2020.2976316.

S. S arkk a and L. Svensson. Bayesian Filtering and Smoothing. Institute of Mathematical Statistics Textbooks. Cambridge University Press, 2023. ISBN 9781108912303.

J. Schmidt, N. Kr amer, and P. Hennig. A probabilistic state space model for joint inference from diﬀerential equations and data. In A. Beygelzimer, Y. Dauphin, P. Liang, and J. W. Vaughan, editors, Advances in Neural Information Processing Systems, 2021.

M. Schober, S. S arkk a, and P. Hennig. A probabilistic model for the numerical solution of initial value problems. Statistics and Computing, 29(1):99 122, 2019. ISSN 1573-1375. doi: 10.1007/s11222-017-9798-7.

L. F. Shampine. Some practical Runge Kutta formulas. Mathematics of Computation, 46 (173):135 150, 1986. doi: https://doi.org/10.2307/2008219.

Bosch, Corenflos, Yaghoobi, Tronarp, Hennig, S arkk a

J. Skilling. Bayesian Solution of Ordinary Diﬀerential Equations, pages 23 37. Springer, 1992. ISBN 978-94-017-2219-3. doi: 10.1007/978-94-017-2219-3 2.

Y. Song, J. Sohl-Dickstein, D. P. Kingma, A. Kumar, S. Ermon, and B. Poole. Score-based generative modeling through stochastic diﬀerential equations. In International Conference on Learning Representations, 2021.

W. Su, S. Boyd, and E. J. Cand es. A diﬀerential equation for modeling Nesterov s accelerated gradient method: Theory and insights. Journal of Machine Learning Research, 17(153): 1 43, 2016.

S. S arkk a. Unscented Rauch Tung Striebel smoother. IEEE Transactions on Automatic Control, 53(3):845 849, 2008. doi: 10.1109/TAC.2008.919531.

S. S arkk a and A. Solin. Applied Stochastic Diﬀerential Equations. Institute of Mathematical Statistics Textbooks. Cambridge University Press, 2019. doi: 10.1017/9781108186735.

S. S arkk a and L. Svensson. Levenberg-Marquardt and line-search extended Kalman smoothers. In ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 5875 5879, 2020. doi: 10.1109/ICASSP40776. 2020.9054686.

F. Tronarp, H. Kersting, S. S arkk a, and P. Hennig. Probabilistic solutions to ordinary diﬀerential equations as nonlinear Bayesian ﬁltering: a new perspective. Statistics and Computing, 29(6):1297 1315, 2019.

F. Tronarp, S. S arkk a, and P. Hennig. Bayesian ODE solvers: The maximum a posteriori estimate. Statistics and Computing, 31(3):1 18, 2021.

F. Tronarp, N. Bosch, and P. Hennig. Fenrir: Physics-enhanced regression for initial value problems. In K. Chaudhuri, S. Jegelka, L. Song, C. Szepesvari, G. Niu, and S. Sabato, editors, Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pages 21776 21794. PMLR, 2022.

B. Van der Pol. Theory of the amplitude of free and forced triode vibrations. Radio Review, 1:701 710, 1920.

P. Virtanen, R. Gommers, T. E. Oliphant, M. Haberland, T. Reddy, D. Cournapeau, E. Burovski, P. Peterson, W. Weckesser, J. Bright, S. J. van der Walt, M. Brett, J. Wilson, K. J. Millman, N. Mayorov, A. R. J. Nelson, E. Jones, R. Kern, E. Larson, C. J. Carey, I. Polat, Y. Feng, E. W. Moore, J. Vander Plas, D. Laxalde, J. Perktold, R. Cimrman, I. Henriksen, E. A. Quintero, C. R. Harris, A. M. Archibald, A. H. Ribeiro, F. Pedregosa, P. van Mulbregt, and Sci Py 1.0 Contributors. Sci Py 1.0: Fundamental Algorithms for Scientiﬁc Computing in Python. Nature Methods, 17:261 272, 2020. doi: 10.1038/ s41592-019-0686-2.

F. Yaghoobi, A. Corenﬂos, S. Hassan, and S. S arkk a. Parallel iterated extended and sigma-point Kalman smoothers. In ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 5350 5354, 2021. doi: 10.1109/ICASSP39728.2021.9413364.

Parallel-in-Time Probabilistic Numerical ODE Solvers

F. Yaghoobi, A. Corenﬂos, S. Hassan, and S. S arkk a. Parallel square-root statistical linear regression for inference in nonlinear state space models. ar Xiv:2207.00426, 2023.