# geometric_neural_diffusion_processes__bb20e413.pdf

Geometric Neural Diffusion Processes

Emile Mathieu University Of Cambridge Vincent Dutordoir University Of Cambridge Secondmind Labs

Michael J. Hutchinson University Of Oxford

Valentin De Bortoli Center for Science of Data, ENS Ulm Yee Whye Teh University Of Oxford Richard E. Turner University Of Cambridge, Microsoft Research

Denoising diffusion models have proven to be a flexible and effective paradigm for generative modelling. Their recent extension to infinite dimensional Euclidean spaces has allowed for the modelling of stochastic processes. However, many problems in the natural sciences incorporate symmetries and involve data living in non-Euclidean spaces. In this work, we extend the framework of diffusion models to incorporate a series of geometric priors in infinite-dimension modelling. We do so by a) constructing a noising process which admits, as limiting distribution, a geometric Gaussian process that transforms under the symmetry group of interest, and b) approximating the score with a neural network that is equivariant w.r.t. this group. We show that with these conditions, the generative functional model admits the same symmetry. We demonstrate scalability and capacity of the model, using a novel Langevin-based conditional sampler, to fit complex scalar and vector fields, with Euclidean and spherical codomain, on synthetic and real-world weather data.

1 Introduction

Traditional denoising diffusion models are defined on finite-dimension Euclidean spaces (Song and Ermon, 2019; Song et al., 2021; Ho et al., 2020; Dhariwal and Nichol, 2021). Extensions have recently been developed for more exotic distributions, such as those supported on Riemannian manifolds (De Bortoli et al., 2022; Huang et al., 2022; Lou and Ermon, 2023; Fishman et al., 2023), and on function spaces of the form f : Rn Rd (Dutordoir et al., 2022; Kerrigan et al., 2022; Lim et al., 2023a; Pidstrigach et al., 2023; Franzese et al., 2023; Bond-Taylor and Willcocks, 2023) (i.e. stochastic processes). In this work, we extend diffusion models to further deal with distributions over functions that incorporate non-Euclidean geometry in two different ways. This investigation of geometry also naturally leads to the consideration of symmetries in these distributions, and as such we also present methods for incorporating these into diffusion models.

Firstly, we look at tensor fields. Tensor fields are geometric objects that assign to all points on some manifold a value that lives in some vector space V . Roughly speaking, these are functions of the form f : M V . These objects are central to the study of physics as they form a generic mathematical framework for modelling natural phenomena. Common examples include the pressure of a fluid in motion as f : R3 R, representing wind over the Earth s surface as f : S2 TS2, where TS2 is the tangent-space of the sphere, or modelling the stress in a deformed object as f : Object TR3 TR3, where is the tensor-product of the tangent spaces. Given the inherent symmetry in the laws of nature, these tensor fields can transform in a way that preserves these symmetries. Any modelling of these laws may benefit from respecting these symmetries.

37th Conference on Neural Information Processing Systems (Neur IPS 2023).

Equal contribution

Secondly, we look at fields with manifold codomain, and in particular, at functions of the form f : R M. The challenge in dealing with manifold-valued output, arises from the lack of vectorspace structure. In applications, these functions typically appear when modelling processes indexed by time that take values on a manifold. Examples include tracking the eye of cyclones moving on the surface of the Earth, or modelling the joint angles of a robot as it performs tasks.

The lack of data or noisy measurements in the physical process of interest motivates a probabilistic treatment of such phenomena, in addition to its functional nature. Arguably the most important framework for modelling stochastic processes are Gaussian Processes (GPs) (Rasmussen, 2003), as they allow for exact or approximate posterior prediction (Titsias, 2009; Rahimi and Recht, 2007; Wilson et al., 2020). In particular, when choosing equivariant mean and kernel functions, GPs are invariant (i.e. stationary) (Holderrieth et al., 2021; Azangulov et al., 2022; Azangulov et al., 2023). Their limited modelling capacity and the difficulty in designing complex, problem-specific kernels motivates the development of neural processes (NPs) (Garnelo et al., 2018b), which learn to approximately model a conditional stochastic process directly from data. NPs have been extended to model translation invariant (scalar) processes (Gordon et al., 2020) and more generic E(n)-invariant processes (Holderrieth et al., 2021). Yet, the Gaussian conditional assumption of standard NPs still limits their flexibility and prevents such models from fitting complex processes. Diffusion models provide a compelling alternative for significantly greater modelling flexibility. In this work, we develop geometric diffusion neural processes which incorporate geometrical prior knowledge into functional diffusion models.

Our contributions are three-fold: (a) We extend diffusion models to more generic function spaces (i.e. tensor fields, and functions f : X M) by defining a suitable noising process. (b) We incorporate group invariance of the distribution of the generative model by enforcing the covariance kernel and the score network to be group equivariant. (c) We propose a novel Langevin dynamics scheme for efficient conditional sampling.

2 Background

Denoising diffusion models. We briefly recall here the key concepts behind diffusion models on Rd and refer the readers to (Song et al., 2021) for a more detailed introduction. We consider a forward noising process (Yt)t 0 defined by the following Stochastic Differential Equation (SDE)

2Ytdt + d Bt, Y0 p0, (1)

where (Bt)t 0 is a d-dimensional Brownian motion and p0 is the data distribution. The process (Yt)t 0 is simply an Ornstein Ulhenbeck (OU) process which converges with geometric rate to N(0, Id) (Durmus and Moulines, 2017). Under mild conditions on p0, the time-reversed process ( Yt)t 0 = (YT t)t [0,T ] also satisfies an SDE (Cattiaux et al., 2021) given by

2 Yt + log p T t( Yt)}dt + d Bt, Y0 p T , (2)

where pt denotes the density of Yt. Unfortunately we cannot sample exactly from (2) as p T and the scores ( log pt)t [0,T ] are unavailable. First, p T is substituted with the limiting distribution N(0, Id) as it converges towards it. Second, one can easily show (Kallenberg, 2021, Thm 5.1) that the score log pt is the minimiser of ℓt(s) = E[ s(Yt) yt log pt|0(Yt|Y0) 2] over functions s : [0, T] Rd Rd where the expectation is over the joint distribution of Y0, Yt, and as such can readily be approximated by a neural network sθ(t, yt) by minimising this functional. Finally, a discretisation of (2) is performed (e.g. Euler Maruyama) to obtain approximate samples of p0.

f(x) f(g 1x) ρ(h)f(g 1x)

Figure 1: Illustration of a vector field f : R2 R2 with representation ρ(h) = h being steered by a group element h = 90 O(2) E(2).

Steerable fields. We now define steerable feature fields which are collections of tensor fields. We focus on the Euclidean group G = E(d), which elements g admits a unique decomposition g = uh where h O(d) is a d d orthogonal matrix and u T(d) is a translation which can be identified as an element of Rd; for a vector x Rd, g x = hx + u denotes the action of g on x, with h acting from the left on x by matrix multiplication. This special case simplifies the presentation, but can be extended to the general case is discussed in App. C.1.

We are interested in learning a probabilistic model over functions of the form f : X Rd such that a group G acts on X and Rd. We call a feature field a tuple (f, ρ) with f : X Rd a mapping between input x X to some feature f(x) with associated representation ρ : G GL(Rd) (Scott and Serre, 1996). This feature field is said to be G-steerable if it is transformed for all x X, g G as g f(x) = ρ(g)f(g 1 x). In this setting, the action of E(d) = T(d) O(d) on the feature field f yields g f(x) = (uh) f(x) ρ(h)f h 1(x u) . Typical examples of feature fields include scalar fields with ρtriv(g) 1 transforming as g f(x) = f g 1x such as temperature fields, and vectors or potential fields with ρId(g) h transforming as g f(x) = hf g 1x as illustrated in Fig. 1, such as wind or force fields.

For many natural phenomena, a priori we do not want to express a preference for a particular conformation of the feature field and thus want a prior p to place the same density on all the transformed fields µ(g f) = µ(f), g G. Leveraging this symmetry can drastically reduce the amount of data required to learn from and reduce training time.

3 Geometric neural diffusion processes: GEOMNDPS

3.1 Continuous diffusion on function spaces

We construct a diffusion model on functions f : X Y , with Y = Rd, by defining a diffusion model for every finite set of marginals. Most prior works on infinite-dimensional diffusions consider a noising process on the space of functions (Kerrigan et al., 2022; Pidstrigach et al., 2023; Lim et al., 2023b). In theory, this allows the model to define a consistent distribution over all the finite marginals of the process being modelled. In practice, however, only finite marginals can be modelled on a computer and the score function needs to be approximated, and at this step lose consistency over the marginals. The only work to stay fully consistent in implementation is Phillips et al. (2022), at the cost of limiting functions that can be modelled to a finite-dimensional subspace. With this in mind, we eschew the technically laborious process of defining diffusions over the infinite-dimension space and work solely on the finite marginals following Dutordoir et al. (2022). We find that in practice consistency can be well learned from data see Sec. 5, and this allows for more flexible choices of score network architecture and easier training.

Noising process. We assume we are given a data process (Y0(x))x X . Given any x = (x1, . . . , xn) X n, we consider the following forward noising process (Yt(x))t 0 (Yt(x1), . . . , Yt(xn))t 0 = (Y1 t , . . . , Yn t )t 0 Yn defined by the following SDE

d Yt(x) = 1

2 {m(x) Yt(x)} βtdt + β1/2 t K(x, x)1/2d Bt, (3)

where K(x, x)i,j = k(xi, xj) with k : X X R a kernel and m : X Y. The process (Yt(x))t 0 is a multivariate Ornstein Uhlenbeck process with drift b(t, x, Yt(x)) = m(x) Yt(x) and diffusion coefficient σ(t, x, Yt(x)) = K(x, x) which converges with geometric rate to N(m(x), K(x, x)). Using Phillips et al. (2022), it can be shown that this convergence extends to the process (Yt)t 0 which converges to the Gaussian Process with mean m and kernel k, denoted Y .

In the specific instance where k(x, x ) = δx(x ), then the limiting process Y is simply Gaussian white noise, whilst other choices such as the squared-exponential or Matérn kernel would lead to the associated Gaussian limiting process Y . Note that the white noise setting is not covered by the existing theory of functional diffusion models, as a Hilbert space and a square integral kernel are required, see Kerrigan et al. (2022) for instance.

Denoising process. Under mild conditions over Y02, the time-reversal process ( Yt(x))t 0 satisfies

d Yt(x) = { 1

2(m(x) Yt(x))+K(x, x) log p T t( Yt(x))}βT tdt+β1/2 T t K(x, x)1/2d Bt, (4)

with Y0 GP(m, k) and pt the density of Yt(x) w.r.t. Lebesgue. In practice, the log p T t term known as the Stein score, is not tractable and must be approximated by a neural network. We then consider the generative stochastic process model defined by first sampling Y0 GP(m, k) and then simulating the reverse diffusion (4) (e.g. via Euler-Maruyama discretisation).

2Haussmann and Pardoux (1986) assumes Lipschitz continuity of the drift and volatility matrix and Cattiaux et al. (2021, Thm 4.9) which only assume a finite entropy condition on the space of processes.

Manifold valued outputs. So far we have defined our generative model with Y = Rd, we can readily extend the methodology to manifold-valued functional models using Riemannian diffusion models such as De Bortoli et al. (2022) and Huang et al. (2022), see App. C. One of the main notable difference is that in the case where Y is a compact manifold, we replace the Ornstein-Uhlenbeck process by a Brownian motion which targets the uniform distribution.

Training. As the reverse SDE (4) involves the preconditioned score K(x, x) log pt, we directly approximate it with a neural network (Ks)θ : [0, T] X n Yn TYn, where TY is the tangent bundle of Y,see App. C. The conditional score of the noising process (3) is given by

Yt log pt(Yt(x)|Y0(x)) = Σ 1 t|0(Yt(x) mt|0) = σ 1 t|0 K(x, x) 1/2ε, (5)

since Yt = mt|0 + Σ1/2 t|0 ε with ε N(0, Id), and Σt|0 = σ2 t|0K with σt|0 = (1 e R t 0 β(s)ds)1/2, see App. B.1. We learn the preconditioned score (Ks)θ by minimising the following denoising score matching (DSM) loss (Vincent et al., 2010) weighted by Λ(t) = σ2 t|0 K K

L(θ; Λ(t)) = E[ sθ(t, Yt) log pt(Yt|Y0) 2 Λ(t)] = E[ σt|0 (Ks)θ(t, Yt) + K1/2ε 2 2], (6)

where x 2 Λ = x Λx. Note that when targeting a unit-variance white noise, then K = Id and the loss (6) reverts to the DSM loss with weighting λ(t) = 1/σ2 t|0 (Song et al., 2021). In App. B.2, we explore several preconditioning terms and associated weighting Λ(t). Overall, we found the preconditioned score K log pt parameterisation, in combination with the ℓ2 loss, to perform best, as shown by the ablation study in App. F.1.3.

3.2 Invariant neural diffusion processes

In this section, we show how we can incorporate geometrical constraints into the functional diffusion model introduced in the previous Sec. 3.1. In particular, given a group G, we aim to build a generative model over steerable feature fields as defined in Sec. 2.

Invariant process. A stochastic process f µ is said to be G invariant if µ(g A) = µ(A) for any g G, with µ P(C(X, Y)), where P is the space of probability measure on the space of continuous functions and A C(X, Y) measurable. From a sample perspective, this means that with inputoutput pairs C = {(xi, yi)}n i=1, and denoting the action of G on this set as g C {(g xi, ρ(g)yi)}n i=1, f µ is G invariant if and only if g C has the same distribution as C. In what follows, we aim to derive sufficient conditions on the model introduced in Sec. 3 so that it satisfies this G-invariance property. First, we recall such a necessary and sufficient condition for Gaussian processes.

Proposition 3.1. Invariant (stationary) Gaussian process (Holderrieth et al., 2021). We have that a Gaussian process GP(m, k) is G-invariant if and only if its mean m and covariance k are suitably G-equivariant that is, for all x, x X, g G

m(g x) = ρ(g)m(x) and k(g x, g x ) = ρ(g)k(x, x )ρ(g) . (7)

Trivial examples of E(n)-equivariant kernels include diagonal kernels k = k0 Id with k0 invariant (Holderrieth et al., 2021), but see App. F.2 for non trivial instances introduced by Macêdo and Castro (2010). Building on Prop. 3.1, we then state that our introduced neural diffusion process is also invariant if we additionally assume the score network to be G-equivariant. Proposition 3.2. Invariant neural diffusion process (Yim et al., 2023, Prop 3.6). We denote by ( Yt(x))x X,t [0,T ] the process induced by the time-reversal SDE (4) where the score is approximated by a score network sθ : [0, T] X n Yn TYn, and the limiting process is given by L( Y0) = GP(m, k). Assuming m and k are respectively G-equivariant per Prop. 3.1, if we additionally have that the score network is G-equivariant vector field, i.e. sθ(t, g x, ρ(g)y) = ρ(g)sθ(t, x, y) for all x X, g G, then for any t [0, T] the process ( Yt(x))x X is G-invariant.

This result can be proved in two ways, from the probability flow ODE perspective or directly in terms of SDE via Fokker-Planck, see App. D.2. In particular, when modelling an invariant scalar data process (Y0(x))x X such as a temperature field, we need the score network to admit the invariance constraint sθ(t, g x, y) = sθ(t, x, y).

Equivariant conditional process. Often precedence is given to modelling the predictive process

f(x )|g C f(x )|C f(x )|g C

Figure 2: Samples from equivariant neural diffusion processes conditioned on context set C (in red) and evaluated on a regular grid x for scalar (Left) and 2D vector (Right) fields. Same model is then conditioned on transformed context g C, with group element g being a translation of length 2 (Left) or a 90 rotation (Right).

given a set of observations C = {(xc, yc)}c C. In this context, the conditional process (Pollard, 2002, p.117) inherits the symmetry of the prior process in the following sense. A stochastic process with distribution µ given a context C is said to be conditionally G equivariant if the conditional satisfies µ(A|g C) = µ(g A|C) for any g G and A C(X, Y) measurable, as illustrated in Fig. 2.

Proposition 3.3. Equivariant conditional process. Assume a stochastic process f µ is G invariant. Then the conditional process f|C given a set of observations C is G-equivariant.

Originally stated in Holderrieth et al. (2021) in the case where the process is over functions of the form f : Rn Rd and marginals with density w.r.t. Lebesgue, we prove Prop. 3.3 for stochastic processes over generic fields on manifolds in terms only of the measure of the process (App. D.3).

3.3 Conditional sampling

p0 L( Y0) = p T

L( Y0 γ) L( Y1 γ)

L( Y0 2γ) L( Y1 2γ) p2γ

Figure 3: Illustration of Langevin corrected conditional sampling. The black line represents the noising process dynamics (pt)t [0,T ]. The time reversal (i.e. predictor)

step, is combined with a Langevin corrector step projecting back onto the dynamics.

There exist several methods to perform conditional sampling in diffusion models such as: replacement sampling, amortisation and conditional guidance, which we discuss in App. E.1. Here we propose a new method for sampling from exact conditional distributions of NDPs using only the score network for the joint distribution. Using the fact that the conditional score can be written as x log p(x|y) = x log p(x, y) x log p(y) = x log p(x, y) we can therefore, for any point in the diffusion time, conditionally sample using Langevin dynamics, following the SDE d Ys = 1

2K log p T t(Ys)ds +

Kd Bs, by only applying the diffusion to the variables of interest and holding the others fixed. While we could sample directly at the end time this proves difficult in practice. Similar to the motivation of Song and Ermon (2019), we sample along the reverse diffusion, taking a number of conditional Langevin steps at each time. In addition, we apply the forward noising SDE to the conditioning points at each step, as this puts the combined context and sampling set in a region that the score function will be well learned in training. Our procedure is illustrated in Alg. 1. In App. E.1 we draw links with REPAINT of Lugmayr et al. (2022).

3.4 Likelihood evaluation

Similarly to Song et al. (2021), we can derive a deterministic ODE which has the same marginal density as the SDE (3), which is given by the probability flow Ordinary Differential Equation (ODE), see App. B. Once the score network is learnt, we can thus use it in conjunction with an ODE solver to compute the likelihood of the model. A perhaps more interesting task is to evaluate the predictive posterior likelihood p(y |x , {xi, yi}i C) given a context set {xi, yi}i C. A simple approach is to simply rely on the conditional probability rule evaluate p(y |x , {xi, yi}i C) = p(y , {yi}i C|x , {xi}i C)/p({yi}i C|{xi}i C). This can be done by solving two probability flow ODEs: one over the joint evaluation and context set, and another only over the context set.

4 Related work

Gaussian processes and the neural processes family. One important and powerful framework to construct distributions over functional spaces are Gaussian processes (Rasmussen, 2003). Yet, they are restricted in their modelling capacity and when using exact inference they scale poorly with the number of datapoints. These problems can be partially alleviated by using neural processes (Kim et al., 2019; Garnelo et al., 2018b; Garnelo et al., 2018a; Jha et al., 2022; Louizos et al., 2019; Singh et al., 2019), although they also assume a Gaussian likelihood. Recently introduced autoregressive NPs (Bruinsma et al., 2023) alleviate this limitation, but they are disadvantaged by the fact that variables early in the auto-regressive generation only have simple distributions (typically Gaussian). Finally, (Dupont et al., 2022) model weights of implicit neural representation using diffusion models.

Stationary stochastic processes. The most popular Gaussian process kernels (e.g. squared exponential, Matérn) are stationary, that is, they are translation invariant. These lead to invariant Gaussian processes, whose samples when translated have the same distribution as the original ones. This idea can be extended to the entire isometry group of Euclidean spaces (Holderrieth et al., 2021), allowing for modelling higher order tensor fields, such as wind fields or incompressible fluid velocity (Macêdo and Castro, 2010). Later, Azangulov et al. (2022) and Azangulov et al. (2023) extended stationary kernels and Gaussian processes to a large class of non-Euclidean spaces, in particular all compact spaces, and symmetric non compact spaces. In the context of neural processes, (Gordon et al., 2020) introduced CONVCNP so as to encode translation equivariance into the predictive process. They do so by embedding the context into a translation equivariant functional representation which is then decoded with a convolutional neural network. Holderrieth et al. (2021) later extended this idea to construct neural processes that are additionally equivariant w.r.t. rotations or subgroup thereof.

Spatial structure in diffusion models. A variety of approaches have also been proposed to incorporate spatial correlation in the noising process of finite-dimensional diffusion models leveraging the multiscale structure of data (Jing et al., 2022; Guth et al., 2022; Ho et al., 2022; Saharia et al., 2021; Hoogeboom and Salimans, 2022; Rissanen et al., 2022). Our methodology can also be seen as a principled way to modify the forward dynamics in classical denoising diffusion models. Hence, our contribution can be understood in the light of recent advances in generative modelling on soft and cold denoising diffusion models (Daras et al., 2022; Bansal et al., 2022; Hoogeboom and Salimans, 2022). Several recent work explicitly introduced a covariance matrix in the Gaussian noise, either on a choice of kernel (Biloš et al., 2022), based on Discrete Fourier Transform of images (Voleti et al., 2022), or via empirical second order statistics (squared pairwise distances and the squared radius of gyration) for protein modelling (Ingraham et al., 2022). Alternatively, (Guth et al., 2022) introduced correlation on images leveraging a wavelet basis.

Functional diffusion models. Infinite dimensional diffusion models have been investigated in the Euclidean setting in (Kerrigan et al., 2022; Pidstrigach et al., 2023; Lim et al., 2023b; Bond-Taylor and Willcocks, 2023; Hagemann et al., 2023; Franzese et al., 2023; Dutordoir et al., 2022; Phillips et al., 2022). Most of these works are based on an extension of the diffusion models techniques (Song et al., 2021; Ho et al., 2020) to the infinite-dimensional space, leveraging tools from the Cameron-Martin theory such as the Feldman-Hájek theorem (Kerrigan et al., 2022; Pidstrigach et al., 2023) to define infinite-dimensional Gaussian measures and how they interact. We refer to (Da Prato and Zabczyk, 2014) for a thorough introduction to Stochastic Differential Equations in infinite dimension. (Phillips et al., 2022) consider another approach by defining countable diffusion processes in a basis. All these approaches amount to learn a diffusion model with spatial structure. Note that this induced correlation is necessary for the theory of infinite dimensional SDE (Da Prato and Zabczyk, 2014) to be applied but is not necessary to implement diffusion models (Dutordoir et al., 2022). Several approaches have been considered for conditional sampling. (Pidstrigach et al., 2023; Bond-Taylor and Willcocks, 2023) modify the reverse diffusion to introduce a guidance term, while (Dutordoir et al., 2022; Kerrigan et al., 2022) use the replacement method. Finally (Phillips et al., 2022) amortise the score function w.r.t. the conditioning context.

Data Prior Model Prior Model Posterior

True Posterior

2 1 0 1 2 2 1 0 1 2

Figure 4: Prior and posterior samples (in blue) from the data process and the Geom NDP model, with context points in red and posterior mean in black.

5 Experimental results

5.1 1D regression over stationary scalar fields

We evaluate GEOMNDPS on several synthetic 1D regression datasets. We follow the same experimental setup as Bruinsma et al. (2020) which we detail in App. F.1. In short, it contains Gaussian (Squared Exponential (SE), MATÉRN( 5

2), WEAKLY PERIODIC) and non-Gaussian (SAWTOOTH and MIXTURE) sample paths, where MIXTURE is a combination of the other four datasets with equal weight. Fig. 9 shows samples for each of these dataset. The Gaussian datasets are corrupted with observation noise with variance σ2 = 0.052. Table 1 reports the average log-likelihood p(y |x , C) across 4096 test samples, where the context set size is uniformly sampled between 1 and 10 and the target has fixed size of 50. All inputs xc, x are chosen uniformly within their input domain which is [-2, 2] for the training data and interpolation evaluation and [2, 6] for the generalisation evaluation.

We compare the performance of Geom NDP to a GP with the true hyperparameters (when available), a (convolutional) Gaussian NP (Bruinsma et al., 2020), a convolutional NP (Gordon et al., 2020) and a vanilla attention-based NDP (Dutordoir et al., 2022) which we reformulated in the continuous diffusion process framework to allow for log-likelihood evaluations and thus a fair comparison denoted NDP*. We enforce translation invariance in the score network for Geom NDP by subtracting the centre of mass from the input x, inducing stationary scalar fields.

On the GP datasets, GNP, CONVNPS and Geom NDP methods are able to fit the conditionals perfectly matching the log-likelihood of the GP model. GNP s performance degrades on the non Gaussian datasets as it is restricted by its conditional Gaussian assumption, whilst NDPS methods still performs well as illustrated on Fig. 4. In the bottom rows of Table 1, we assess the models ability to generalise outside of the training input range x [-2, 2], and evaluate them on a translated grid where context and target points are sampled from [2, 6]. Only convolutional NPs (GNP and CONVNP) and T(1) GEOMNDP are able to model stationary processes and therefore to perform as well as in the interpolation task. The NDP*, on the contrary, drastically fails at this task.

Table 1: Mean test log-likelihood (TLL) ( ) 1 standard error estimated over 4096 test samples are reported. Statistically significant best non-GP model is in bold. * stands for a TLL below 10. NP baselines from Bruinsma et al. (2020). T(1) Geom NDP indicates our proposed method with a translation invariant score.

SE MATÉRN( 5

2) WEAKLY PER. SAWTOOTH MIXTURE

INTERPOLAT.

GP (OPTIMUM) 0.70 0.00 0.31 0.00 0.32 0.00 - - T(1) GEOMNDP 0.72 0.03 0.32 0.03 0.38 0.03 3.39 0.04 0.64 0.08 NDP* 0.71 0.03 0.30 0.03 0.37 0.03 3.39 0.04 0.64 0.08 GNP 0.70 0.01 0.30 0.01 0.47 0.01 0.42 0.01 0.10 0.02 CONVNP 0.46 0.01 0.67 0.01 1.02 0.01 1.20 0.01 0.50 0.02

GENERALISAT.

GP (OPTIMUM) 0.70 0.00 0.31 0.00 0.32 0.00 - - T(1) GEOMNDP 0.70 0.02 0.31 0.02 0.38 0.03 3.39 0.03 0.62 0.02 NDP* * * * * * GNP 0.69 0.01 0.30 0.01 0.47 0.01 0.42 0.01 0.10 0.02 CONVNP 0.46 0.01 0.67 0.01 1.02 0.01 1.19 0.01 0.53 0.02

MODEL SE CURL-FREE DIV-FREE

GP 0.56 0.00 0.66 0.00 0.66 0.00 NDP* 0.55 0.00 0.62 0.01 0.62 0.01 E(2)-GEOMNDP 0.56 0.01 0.65 0.01 0.66 0.01

GP (DIAG.) 1.56 0.00 1.47 0.00 1.47 0.00 T(2)-CONVCNP 1.71 0.01 1.77 0.01 1.76 0.00 E(2)-STEERCNP 1.61 0.00 1.57 0.00 1.57 0.01 101 102 103 104 105 Number of training samples

Log-likelihood

E(2)-Geom NDP

Figure 5: Quantitative results for experiments on GP vector fields. Mean predictive log-likelihood ( ) and confidence interval estimated over 5 random seeds. Left: Comparison with neural processes. Statistically significant results are in bold. Right: Ablation study when varying the number of training data samples.

Non white kernels for limiting process. The NDP methods in the above experiment target the white kernel 1(x = x ) in the limiting process. In App. F.1.3, we explore different choices for the limiting kernel, such as SE and periodic kernels with short and long lengthscales, along with several score parameterisations, see App. B.3 for a description of these. We observe that although choosing such kernels gives a head start to the training, it eventually yield slightly worse performance. We attribute this to the additional complexity of learning a non-diagonal covariance. Finally, across all datasets and limiting kernels, we found the preconditioned score K log pt to result in the best performance.

Conditional sampling ablation. We employ the SE dataset to investigate various configurations of the conditional sampler as we have access to the ground truth conditional distribution through the GP posterior. In Fig. 11 we compute the Kullback-Leibler divergence between the samples generated by GEOMNDP and the actual conditional distribution across different conditional sampling settings. Our results demonstrate the importance of performing multiple Langevin dynamics steps during the conditional sampling process. Additionally, we observe that the choice of noising scheme for the context values yc has relatively less impact on the overall outcome.

5.2 Regression over Gaussian process vector fields

We now focus our attention to modelling equivariant vector fields. For this, we create datasets using samples from a two-dimensional zero-mean GP with one of the following E(2)-equivariant kernels: a diagonal Squared-Exponential (SE) kernel, a zero curl (CURL-FREE) kernel and a zero divergence (DIV-FREE) kernel, as described in App. D.1.

We equip our model, Geom NDP, with a E(2)-equivariant score architecture, based on steerable CNNs (Thomas et al., 2018; Weiler and Cesa, 2021). We compare to NDP* with a non-equivariant attention-based network (Dutordoir et al., 2022). We also evaluate two neural processes, a translationequivariant CONVCNP (Gordon et al., 2020) and a C4 R2 E(2)-equivariant STEERCNP (Holderrieth et al., 2021). We also report the performance of the data-generating GP, and the same GP but with diagonal posterior covariance GP (DIAG.). We measure the predictive log-likelihood of the data process samples under the model on a held-out test dataset.

We observe in Fig. 5 (Left), that the CNPs performance is limited by their diagonal predictive covariance assumption, and as such cannot do better than the GP (DIAG.). We also see that although NDP* is able to fit well GP posteriors, it does not reach the maximum log-likelihood value attained by the data GP, in contrast to its equivariant counterpart GEOMNDP . To further explore gains brought by the built-in equivariance, we explore the data-efficiency in Fig. 5 (Right), and notice that E(2)-GEOMNDP requires few data samples to fit the data process, since effectively the dimension of the (quotiented) state space is dramatically reduced.

5.3 Global tropical cyclone trajectory prediction

Finally, we assess our model on a task where the domain of the stochastic process is a non-Euclidean manifold. We model the trajectories of cyclones over the earth, modelled as sample paths of the form R S2 coming from a stochastic process. The data is drawn from the International Best Track Archive for Climate Stewardship (IBTr ACS) Project, Version 4 ((Knapp et al., 2010; Knapp et al., 2018)) and preprocessed as per App. F.3, where details on the implementation of the score function, the ODE/SDE solvers used for the sampling, and baseline methods can be found.

Figure 6: Left: 1000 samples from the training data. Right: 1000 samples from the trained model.

(a) Interpolation (b) Extrapolation Figure 7: Top: Examples of conditional trajectories sampled from the Geom NDP model. Blue: Conditioned sections of the trajectory. Green: The actual trajectory of the cyclone. Red: conditional samples from the model. Purple: closest matching trajectories in the dataset to the conditioning data.

Fig. 6 shows some cyclone trajectories samples from the data process and from a trained GEOMNDP model. We also demonstrate how such trajectories can be interpolated or extrapolated using the conditional sampling method detailed in Sec. 3.3. Such conditional sample paths are shown in Fig. 7. Additionally, we report in Table 2 the likelihood and MSE for a series of methods. The interpolation task involves conditioning on the first and last 20% of the cyclone trajectory and predicting intermediary positions. The extrapolation task involves conditioning on the first 40% of trajectories and predicting future positions. We see that the GPs (modelled as f : R R2, one on latitude/longitude coordinates, the other via a stereographic projection, using a diagonal RBF kernel with hyperparameters fitted with maximum likelihood) fail drastically given the high non-Gaussianity of the data. In the interpolation task, the NDP performs as well as the GEOMNDP , but the additional geometric structure of modelling the outputs living on the sphere appears to significantly help for extrapolation. See App. F.3 for more fine-grained results.

6 Discussion

In this work, we have extended diffusion models to model invariant stochastic processes over tensor fields. We did so by (a) constructing a continuous noising process over function spaces which correlate input samples with an equivariant kernel, (b) parameterising the score with an equivariant neural network. We have empirically demonstrated the ability of our introduced model GEOMNDP to fit complex stochastic processes, and by encoding the symmetry of the problem at hand, we show that it is more data efficient and better able to generalise.

We highlight below some current limitations and important research directions. First, evaluating the model is slow as it relies on costly SDE or ODE solvers, as existing diffusion models. Second, targeting a white noise process appears to over-perform other Gaussian processes. In future work, we would like to investigate the practical influence of different kernels. Third, strict invariance may sometimes be too strong, we thus suggest softening it by amortising the score network over extra spatial information available from the problem at hand.

Table 2: Comparative results of different models on the cyclone dataset, comparing test set likelihood, interpolation likelihood and mean squared error (MSE), and extrapolation likelihood and mean squared error. These are estimated over 5 random seeds. We only report likelihoods of models defined w.r.t the uniform measure on S2.

MODEL TEST DATA INTERPOLATION EXTRAPOLATION Likelihood Likelihood MSE (km) Likelihood MSE (km)

GEOMNDP (R S2) 802 5 535 4 162 6 536 4 496 14 STEREOGRAPHIC GP (R R2/{0}) 393 3 266 3 2619 13 245 2 6587 55 NDP (R R2) - - 166 22 - 769 48 GP (R R2) - - 6852 41 - 8138 87

Acknowledgements

We are grateful to Paul Rosa for helping with the proof, and to José Miguel Hernández-Lobato for useful discussions. We thank the hydra (Yadan, 2019), jax (Bradbury et al., 2018) and geomstats (Miolane et al., 2020) teams, as our library is built on these great libraries. Richard E. Turner and Emile Mathieu are supported by an EPSRC Prosperity Partnership EP/T005386/1 between Microsoft Research and the University of Cambridge. Michael J. Hutchinson is supported by the EPSRC Centre for Doctoral Training in Modern Statistics and Statistical Machine Learning (EP/S023151/1).

I. Azangulov, A. Smolensky, A. Terenin, and V. Borovitskiy. Stationary Kernels and Gaussian Processes on Lie Groups and their Homogeneous Spaces I: the Compact Case, Aug. 2022. DOI: 10.48550/ar Xiv.2208.14960. Cited on pages 2, 6.

I. Azangulov, A. Smolensky, A. Terenin, and V. Borovitskiy. Stationary Kernels and Gaussian Processes on Lie Groups and their Homogeneous Spaces II: non-compact symmetric spaces, Apr. 2023. DOI: 10.48550/ar Xiv.2301.13088. Cited on pages 2, 6.

A. Bansal, E. Borgnia, H.-M. Chu, J. S. Li, H. Kazemi, F. Huang, M. Goldblum, J. Geiping, and T. Goldstein. Cold diffusion: inverting arbitrary image transforms without noise. ar Xiv preprint ar Xiv:2208.09392, 2022. Cited on page 6.

M. Biloš, K. Rasul, A. Schneider, Y. Nevmyvaka, and S. Günnemann. Modeling Temporal Data as Continuous Functions with Process Diffusion, 2022. URL: https://openreview.net/forum? id=Vm JKUyp Q8w R. Cited on page 6.

S. Bond-Taylor and C. G. Willcocks. Infinite-diff: infinite resolution diffusion with subsampled mollified states. ar Xiv preprint ar Xiv:2303.18242, 2023. Cited on pages 1, 6.

J. Bradbury, R. Frostig, P. Hawkins, M. J. Johnson, C. Leary, D. Maclaurin, G. Necula, A. Paszke, J. Vander Plas, S. Wanderman-Milne, and Q. Zhang. JAX: composable transformations of Python+Num Py programs, version 0.2.5, 2018. URL: http://github.com/google/jax. Cited on pages 10, 13.

W. Bruinsma, S. Markou, J. Requeima, A. Y. K. Foong, A. Vaughan, T. Andersson, A. Buonomo, S. Hosking, and R. E. Turner. Autoregressive Conditional Neural Processes. In International Conference on Learning Representations, Feb. 2023. URL: https://openreview.net/forum? id=OAs XFPBf TBh. Cited on page 6.

W. Bruinsma, J. Requeima, A. Y. K. Foong, J. Gordon, and R. E. Turner. Gaussian Neural Processes. In 3rd Symposium on Advances in Approximate Bayesian Inference, Feb. 2020. Cited on pages 7, 14, 16.

P. Cattiaux, G. Conforti, I. Gentil, and C. Léonard. Time reversal of diffusion processes under a finite entropy condition. ar Xiv preprint ar Xiv:2104.07708, 2021. Cited on pages 2, 3, 5.

S. Chen, S. Chewi, J. Li, Y. Li, A. Salim, and A. R. Zhang. Sampling is as easy as learning the score: theory for diffusion models with minimal data assumptions. ar Xiv preprint ar Xiv:2209.11215, 2022. Cited on page 11.

T. Cohen. Equivariant convolutional networks. Ph D thesis, 2021. Cited on page 4.

G. Da Prato and J. Zabczyk. Stochastic equations in infinite dimensions. Cambridge university press, 2014. Cited on page 6.

A. S. Dalalyan. Theoretical guarantees for approximate sampling from smooth and log-concave densities. J. R. Stat. Soc. Ser. B. Stat. Methodol., 79(3):651 676, 2017. ISSN: 1369-7412. DOI: 10.1111/rssb.12183. URL: https://doi.org/10.1111/rssb.12183. Cited on page 13.

G. Daras, M. Delbracio, H. Talebi, A. G. Dimakis, and P. Milanfar. Soft diffusion: score matching for general corruptions. ar Xiv preprint ar Xiv:2209.05442, 2022. Cited on page 6.

V. De Bortoli. Convergence of denoising diffusion models under the manifold hypothesis. ar Xiv preprint ar Xiv:2208.05314, 2022. Cited on pages 10, 11.

V. De Bortoli, E. Mathieu, M. Hutchinson, J. Thornton, Y. W. Teh, and A. Doucet. Riemannian score-based generative modeling. ar Xiv preprint ar Xiv:2202.02763, 2022. Cited on pages 1, 4.

V. De Bortoli, J. Thornton, J. Heng, and A. Doucet. Diffusion Schrödinger bridge with applications to score-based generative modeling. In Advances in Neural Information Processing Systems, 2021. Cited on pages 5, 18.

P. Dhariwal and A. Nichol. Diffusion models beat GAN on image synthesis. ar Xiv preprint ar Xiv:2105.05233, 2021. Cited on page 1.

E. Dupont, H. Kim, S. Eslami, D. Rezende, and D. Rosenbaum. From data to functa: your data point is a function and you should treat it like one. ar Xiv preprint ar Xiv:2201.12204, 2022. Cited on page 6.

A. Durmus and E. Moulines. High-dimensional Bayesian inference via the Unadjusted Langevin Algorithm. Ar Xiv e-prints, May 2016. ar Xiv: 1605.01559 [math.ST]. Cited on page 4.

A. Durmus and É. Moulines. Nonasymptotic convergence analysis for the unadjusted Langevin algorithm. The Annals of Applied Probability, 27(3):1551 1587, 2017. ISSN: 1050-5164. DOI: 10.1214/16-AAP1238. URL: https://doi.org/10.1214/16-AAP1238. Cited on pages 2, 13.

V. Dutordoir, A. Saul, Z. Ghahramani, and F. Simpson. Neural Diffusion Processes, June 2022. DOI:

10.48550/ar Xiv.2206.03992. Cited on pages 1, 3, 6 8, 14, 16 18.

N. Fishman, L. Klarner, V. D. Bortoli, E. Mathieu, and M. Hutchinson. Diffusion models for constrained domains, 2023. ar Xiv: 2304.05364 [cs.LG]. Cited on page 1.

G. Franzese, S. Rossi, D. Rossi, M. Heinonen, M. Filippone, and P. Michiardi. Continuous-time functional diffusion processes. ar Xiv preprint ar Xiv:2303.00800, 2023. Cited on pages 1, 6.

M. Garnelo, D. Rosenbaum, C. Maddison, T. Ramalho, D. Saxton, M. Shanahan, Y. W. Teh, D. Rezende, and S. M. A. Eslami. Conditional neural processes. In J. Dy and A. Krause, editors, Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pages 1704 1713. PMLR, July 2018. URL: https: //proceedings.mlr.press/v80/garnelo18a.html. Cited on page 6.

M. Garnelo, J. Schwarz, D. Rosenbaum, F. Viola, D. J. Rezende, S. Eslami, and Y. W. Teh. Neural processes. ar Xiv preprint ar Xiv:1807.01622, 2018. Cited on pages 2, 6.

M. Geiger and T. Smidt. E3nn: euclidean neural networks, 2022. DOI: 10.48550/ARXIV.2207. 09453. URL: https://arxiv.org/abs/2207.09453. Cited on page 16.

J. Gordon, W. P. Bruinsma, A. Y. K. Foong, J. Requeima, Y. Dubois, and R. E. Turner. Convolutional conditional neural processes. In International Conference on Learning Representations, 2020. URL: https://openreview.net/forum?id=Skey4e BYPS. Cited on pages 2, 6 8, 17.

F. Guth, S. Coste, V. De Bortoli, and S. Mallat. Wavelet score-based generative modeling. ar Xiv preprint ar Xiv:2208.05003, 2022. Cited on page 6.

P. Hagemann, L. Ruthotto, G. Steidl, and N. T. Yang. Multilevel diffusion: infinite dimensional score-based diffusion models for image generation. ar Xiv preprint ar Xiv:2303.04772, 2023. Cited on page 6.

C. R. Harris, K. J. Millman, S. J. van der Walt, R. Gommers, P. Virtanen, D. Cournapeau, E. Wieser, J. Taylor, S. Berg, N. J. Smith, R. Kern, M. Picus, S. Hoyer, M. H. van Kerkwijk, M. Brett, A. Haldane, J. F. del Río, M. Wiebe, P. Peterson, P. Gérard-Marchant, K. Sheppard, T. Reddy, W. Weckesser, H. Abbasi, C. Gohlke, and T. E. Oliphant. Array programming with Num Py. Nature, 585(7825):357 362, Sept. 2020. Cited on page 13.

U. G. Haussmann and E. Pardoux. Time reversal of diffusions. The Annals of Probability, 14(4):1188 1205, 1986. Cited on page 3.

J. Ho, A. Jain, and P. Abbeel. Denoising diffusion probabilistic models. Advances in Neural Information Processing Systems, 2020. Cited on pages 1, 6.

J. Ho, C. Saharia, W. Chan, D. J. Fleet, M. Norouzi, and T. Salimans. Cascaded diffusion models for high fidelity image generation. J. Mach. Learn. Res., 23:47 1, 2022. Cited on page 6.

P. Holderrieth, M. J. Hutchinson, and Y. W. Teh. Equivariant learning of stochastic fields: gaussian processes and steerable conditional neural processes. In International Conference on Machine Learning, 2021. Cited on pages 2, 4 6, 8, 17.

E. Hoogeboom and T. Salimans. Blurring diffusion models. ar Xiv preprint ar Xiv:2209.05557, 2022. Cited on page 6.

C.-W. Huang, M. Aghajohari, J. Bose, P. Panangaden, and A. C. Courville. Riemannian diffusion models. Advances in Neural Information Processing Systems, 35:2750 2761, 2022. Cited on pages 1, 4.

J. D. Hunter. Matplotlib: a 2d graphics environment. Computing in Science & Engineering, 9(3):90 95, 2007. Cited on page 13.

M. Hutchinson, A. Terenin, V. Borovitskiy, S. Takao, Y. Teh, and M. Deisenroth. Vector-valued gaussian processes on riemannian manifolds via gauge independent projected kernels. Advances in Neural Information Processing Systems, 34:17160 17169, 2021. Cited on page 5.

J. Ingraham, M. Baranov, Z. Costello, V. Frappier, A. Ismail, S. Tie, W. Wang, V. Xue, F. Obermeyer, A. Beam, and G. Grigoryan. Illuminating protein space with a programmable generative model. bio Rxiv, 2022. DOI: 10.1101/2022.12.01.518682. eprint: https://www.biorxiv.org/ content/early/2022/12/02/2022.12.01.518682.full.pdf. URL: https://www. biorxiv.org/content/early/2022/12/02/2022.12.01.518682. Cited on page 6.

S. Jha, D. Gong, X. Wang, R. E. Turner, and L. Yao. The neural process family: survey, applications and perspectives. ar Xiv preprint ar Xiv:2209.00517, 2022. Cited on page 6.

B. Jing, G. Corso, R. Berlinghieri, and T. Jaakkola. Subspace diffusion generative models. ar Xiv preprint ar Xiv:2205.01490, 2022. Cited on page 6.

O. Kallenberg. Foundations of Modern Probability. Probability Theory and Stochastic Modelling. Springer, 2021. ISBN: 978-3-030-61872-8. URL: https://books.google.co.uk/books?id= 6h Jezg EACAAJ. Cited on page 2.

T. Karras, M. Aittala, T. Aila, and S. Laine. Elucidating the Design Space of Diffusion-Based Generative Models, Oct. 2022. DOI: 10.48550/ar Xiv.2206.00364. Cited on page 3.

G. Kerrigan, J. Ley, and P. Smyth. Diffusion Generative Models in Infinite Dimensions, 2022. DOI:

10.48550/ar Xiv.2212.00886. Cited on pages 1, 3, 6.

H. Kim, A. Mnih, J. Schwarz, M. Garnelo, A. Eslami, D. Rosenbaum, O. Vinyals, and Y. W. Teh. Attentive neural processes. In International Conference on Learning Representations, 2019. URL: https://openreview.net/forum?id=Sk E6Pj C9KX. Cited on page 6.

D. P. Kingma and J. Ba. Adam: A Method for Stochastic Optimization. ar Xiv:1412.6980 [cs], 2015. URL: http://arxiv.org/abs/1412.6980. Cited on page 17.

H. J. Knapp Kenneth R. Diamond, J. P. Kossin, M. C. Kruk, and C. J. I. Schreck. International Best Track Archive for Climate Stewardship (IBTr ACS) Project, Version 4. Technical report, NOAA National Centers for Environmental Information, 2018. DOI: https://doi.org/10.25921/ 82ty-9e16. Cited on pages 8, 17.

K. R. Knapp, M. C. Kruk, D. H. Levinson, H. J. Diamond, and C. J. Neumann. The international best track archive for climate stewardship (ibtracs): unifying tropical cyclone data. Bulletin of the American Meteorological Society, 91(3):363 376, 2010. DOI: https://doi.org/10.1175/ 2009BAMS2755.1. Cited on pages 8, 17.

J. Köhler, L. Klein, and F. Noé. Equivariant Flows: exact likelihood generative learning for symmetric densities. ar Xiv:2006.02425 [physics, stat], June 2020. URL: http://arxiv.org/abs/2006. 02425. Cited on page 6.

Y. Le Cun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. In Proceedings of the IEEE, volume 86 of number 11, pages 2278 2324, 1998. URL: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.42.7665. Cited on page 17.

H. Lee, J. Lu, and Y. Tan. Convergence of score-based generative modeling for general data distributions. In International Conference on Algorithmic Learning Theory, pages 946 985. PMLR, 2023. Cited on page 11.

J. H. Lim, N. B. Kovachki, R. Baptista, C. Beckham, K. Azizzadenesheli, J. Kossaifi, V. Voleti, J. Song, K. Kreis, J. Kautz, C. Pal, A. Vahdat, and A. Anandkumar. Score-based diffusion models in function space, 2023. DOI: 10.48550/ARXIV.2302.07400. Cited on page 1.

J. H. Lim, N. B. Kovachki, R. Baptista, C. Beckham, K. Azizzadenesheli, J. Kossaifi, V. Voleti, J. Song, K. Kreis, J. Kautz, C. Pal, A. Vahdat, and A. Anandkumar. Score-based diffusion models in function space, 2023. Cited on pages 3, 6.

A. Lou and S. Ermon. Reflected diffusion models. ar Xiv preprint ar Xiv:2304.04740, 2023. Cited on page 1.

C. Louizos, X. Shi, K. Schutte, and M. Welling. The functional neural process. Advances in Neural Information Processing Systems, 32, 2019. Cited on page 6.

A. Lugmayr, M. Danelljan, A. Romero, F. Yu, R. Timofte, and L. Van Gool. Repaint: inpainting using denoising diffusion probabilistic models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11461 11471, 2022. Cited on pages 5, 8 11.

I. Macêdo and R. Castro. Learning Divergence-Free and Curl-Free Vector Fields with Matrix-Valued Kernels. Pré-Publicações / A.: Pré-publicações. IMPA, 2010. Cited on pages 4, 6.

W. Mc Kinney et al. Data structures for statistical computing in python. In Proceedings of the 9th Python in Science Conference, volume 445, pages 51 56. Austin, TX, 2010. Cited on page 13.

N. Miolane, N. Guigui, A. L. Brigant, J. Mathe, B. Hou, Y. Thanwerdas, S. Heyder, O. Peltre, N. Koep, H. Zaatiti, H. Hajri, Y. Cabanes, T. Gerald, P. Chauchat, C. Shewmake, D. Brooks, B. Kainz, C. Donnat, S. Holmes, and X. Pennec. Geomstats: a python package for riemannian geometry in machine learning. Journal of Machine Learning Research, 21(223):1 9, 2020. URL: http://jmlr.org/papers/v21/19-027.html. Cited on page 10.

B. Øksendal. Stochastic differential equations. In Stochastic differential equations, pages 65 84. Springer, 2003. Cited on page 4.

G. Papamakarios, E. Nalisnick, D. J. Rezende, S. Mohamed, and B. Lakshminarayanan. Normalizing flows for probabilistic modeling and inference. ar Xiv preprint ar Xiv:1912.02762, 2019. Cited on page 6.

A. Phillips, T. Seror, M. Hutchinson, V. De Bortoli, A. Doucet, and E. Mathieu. Spectral Diffusion Processes, Nov. 2022. URL: http://arxiv.org/abs/2209.14125. Cited on pages 3, 6, 1, 4, 5, 8.

J. Pidstrigach, Y. Marzouk, S. Reich, and S. Wang. Infinite-Dimensional Diffusion Models for Function Spaces, Feb. 2023. URL: http://arxiv.org/abs/2302.10130. Cited on pages 1, 3, 6.

D. Pollard. A User s Guide to Measure Theoretic Probability. Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge University Press, 2002. ISBN: 978-0-521-00289-9. URL: https://books.google.co.uk/books?id=B7Ch-c2G21MC. Cited on page 5.

A. Rahimi and B. Recht. Random features for large-scale kernel machines. Advances in neural information processing systems, 20, 2007. Cited on page 2.

C. E. Rasmussen. Gaussian processes in machine learning. In Summer school on machine learning, pages 63 71. Springer, 2003. Cited on pages 2, 6.

S. Rissanen, M. Heinonen, and A. Solin. Generative Modelling With Inverse Heat Dissipation, July 2022. DOI: 10.48550/ar Xiv.2206.13397. Cited on page 6.

G. O. Roberts and R. L. Tweedie. Exponential convergence of Langevin distributions and their discrete approximations. Bernoulli, 2(4):341 363, 1996. ISSN: 1350-7265. DOI: 10.2307/3318418. URL: https://doi.org/10.2307/3318418. Cited on page 13.

C. Saharia, J. Ho, W. Chan, T. Salimans, D. J. Fleet, and M. Norouzi. Image super-resolution via iterative refinement. ar Xiv preprint ar Xiv:2104.07636, 2021. Cited on page 6.

S. Särkkä and A. Solin. Applied Stochastic Differential Equations. Cambridge University Press, 1st edition, 2019. ISBN: 978-1-108-18673-5 978-1-316-51008-7 978-1-316-64946-6. DOI: 10. 1017/9781108186735. Cited on page 1.

L. Scott and J. Serre. Linear Representations of Finite Groups. Graduate Texts in Mathematics. Springer New York, 1996. ISBN: 978-0-387-90190-9. URL: https://books.google.co.uk/ books?id=NCf Zgr54TJ4C. Cited on page 3.

G. Singh, J. Yoon, Y. Son, and S. Ahn. Sequential neural processes. Advances in Neural Information Processing Systems, 32, 2019. Cited on page 6.

Y. Song and S. Ermon. Generative modeling by estimating gradients of the data distribution. In Advances in Neural Information Processing Systems, 2019. Cited on pages 1, 5, 8.

Y. Song, J. Sohl-Dickstein, D. P. Kingma, A. Kumar, S. Ermon, and B. Poole. Score-based generative modeling through stochastic differential equations. In International Conference on Learning Representations, 2021. Cited on pages 1, 2, 4 6, 8.

D. W. Stroock and S. S. Varadhan. Multidimensional diffusion processes. Springer, 2007. Cited on pages 11, 13.

N. Thomas, T. Smidt, S. Kearnes, L. Yang, L. Li, K. Kohlhoff, and P. Riley. Tensor field networks: Rotationand translation-equivariant neural networks for 3D point clouds. ar Xiv:1802.08219 [cs], May 2018. URL: http://arxiv.org/abs/1802.08219. Cited on pages 8, 16.

M. Titsias. Variational learning of inducing variables in sparse gaussian processes. In D. van Dyk and M. Welling, editors, Proceedings of the Twelth International Conference on Artificial Intelligence and Statistics, volume 5 of Proceedings of Machine Learning Research, pages 567 574, Hilton Clearwater Beach Resort, Clearwater Beach, Florida USA. PMLR, 16 18 Apr 2009. URL: https: //proceedings.mlr.press/v5/titsias09a.html. Cited on page 2.

B. L. Trippe, J. Yim, D. Tischer, T. Broderick, D. Baker, R. Barzilay, and T. Jaakkola. Diffusion probabilistic modeling of protein backbones in 3D for the motif-scaffolding problem, June 2022. DOI: 10.48550/ar Xiv.2206.04119. Cited on page 8.

G. Van Rossum and F. L. Drake Jr. Python reference manual. Centrum voor Wiskunde en Informatica Amsterdam, 1995. Cited on page 13.

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin. Attention is all you need. In Advances in Neural Information Processing Systems, 2017. Cited on page 16.

T. Vincent, L. Risser, and P. Ciuciu. Spatially adaptive mixture modeling for analysis of f MRI time series. IEEE Trans. Med. Imag., 29(4):1059 1074, Apr. 2010. ISSN: 0278-0062. DOI: 10.1109/ TMI.2010.2042064. Cited on page 4.

P. Virtanen, R. Gommers, T. E. Oliphant, M. Haberland, T. Reddy, D. Cournapeau, E. Burovski, P. Peterson, W. Weckesser, J. Bright, S. J. van der Walt, M. Brett, J. Wilson, K. J. Millman, N. Mayorov, A. R. J. Nelson, E. Jones, R. Kern, E. Larson, C. J. Carey, I. Polat, Y. Feng, E. W. Moore, J. Vander Plas, D. Laxalde, J. Perktold, R. Cimrman, I. Henriksen, E. A. Quintero, C. R. Harris, A. M. Archibald, A. H. Ribeiro, F. Pedregosa, P. van Mulbregt, and Sci Py 1.0 Contributors. Sci Py

1.0: Fundamental Algorithms for Scientific Computing in Python. Nature Methods, 17:261 272, 2020. Cited on page 13.

V. Voleti, C. Pal, and A. M. Oberman. Score-based Denoising Diffusion with Non-Isotropic Gaussian Noise Models, 2022. URL: https://openreview.net/forum?id=ig C8c JKcb0Q. Cited on page 6.

M. Weiler, P. Forré, E. Verlinde, M. Welling, et al. Coordinate independent convolutional networks: isometry and gauge equivariant convolutions on riemannian manifolds, 2021. Cited on pages 4, 5, 7.

M. Weiler and G. Cesa. General $E(2)$-Equivariant Steerable CNNs. ar Xiv:1911.08251 [cs, eess], Apr. 2021. URL: http://arxiv.org/abs/1911.08251. Cited on pages 8, 17.

M. Weiler, M. Geiger, M. Welling, W. Boomsma, and T. S. Cohen. 3D steerable CNNs: Learning rotationally equivariant features in volumetric data. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 31. Curran Associates, Inc., 2018. URL: https://proceedings.neurips.cc/ paper/2018/file/488e4104520c6aab692863cc1dba45af-Paper.pdf. Cited on pages 16, 17.

J. Wilson, V. Borovitskiy, A. Terenin, P. Mostowsky, and M. Deisenroth. Efficiently sampling functions from gaussian process posteriors. In ICML, pages 10292 10302. PMLR, 2020. Cited on page 2.

O. Yadan. Hydra - a framework for elegantly configuring complex applications. Github, 2019. Cited on pages 10, 13.

J. Yim, B. L. Trippe, V. De Bortoli, E. Mathieu, A. Doucet, R. Barzilay, and T. Jaakkola. Se (3) diffusion model with application to protein backbone generation. ar Xiv preprint ar Xiv:2302.02277, 2023. Cited on pages 4, 7.

Supplementary to:

Geometric Neural Diffusion Processes

A Organisation of appendices

In this supplementary, we first introduce in App. B an Ornstein Uhlenbeck process on function space (via finite marginals) along with several score approximations. Then in App. C, we show how this methodology extend to manifold-valued inputs or outputs. Later in App. D, we derive sufficient conditions for this introduced model to yield a group invariant process. What s more in App. E, we study some conditional sampling schemes. Eventually in App. F, we give a thorough description of experimental settings along with additional empirical results.

B Ornstein Uhlenbeck on function space

B.1 Multivariate Ornstein-Uhlenbeck process

First, we aim to show that we can define a stochastic process on an infinite dimensional function space, by defining the joint finite marginals Y(x) as the solution of a multidimensional Ornstein-Uhlenbeck process. In particular, for any set of input x = (x1, , xk) X k, we define the joint marginal as the solution of the following SDE

d Yt(x) = (m(x) Yt(x))/2 βtdt + p

βt K(x, x)d Bt . (8)

Proposition B.1. (Phillips et al., 2022) We assume we are given a data process (Y0(x))x X and we denote by G GP(0, k) a Gaussian process with zero mean and covariance. Then let s define

2 R t s=0 βsds Y0 + 1 e 1

2 R t s=0 βsds m + 1 e R t s=0 βsds 1/2 G.

Then (Yt(x))x X is a stochastic process (by virtue of being a linear combination of stochastic processes). We thus have that Yt a.s. t 0 Y0 and Yt a.s. t Y with Y GP(m, k), so

effectively (Yt(x))t R+,x X interpolates between the data process and this limiting Gaussian process.

Additionally, L(Yt|Y0 = y0) = GP(mt, Kt) with mt = e 1

2 R t s=0 βsds y0 + 1 e 1

2 R t s=0 βsds m

and Σt = 1 e R t s=0 βsds K. Furthermore, (Yt(x))t R+,x X is the solution of the SDE in (8).

Proof. We aim to compute the mean and covariance of the process (Yt)t 0 described by the SDE (3). First let s recall the time evolution of the mean and covariance of the solution from a multivariate Ornstein-Uhlenbeck process given by

d Yt = f(Yt, t)dt + L(Yt, t)d Bt. (9)

We know that the time evolution of the mean and the covariance are given respectively by Särkkä and Solin (2019)

dt = E[f(Yt, t)] (10)

dt = E[f(Yt, t)(mt Yt) ] + E[(mt Yt)f(Yt, t) ] + E[L(Yt, t)L(Yt, t) ]. (11)

Plugging in the drift f(Yt, t) = 1/2 (m Yt)βt and diffusion term L(Yt, t) = βt K from (3), we get

dt = 1/2 (m Yt)βt (12)

dt = βt [K Σt] . (13)

Solving these two ODEs we get

2 R t s=0 βsdsm0 + 1 e 1

2 R t s=0 βsds m (14)

Σt = K + e R t s=0 βsds (Σ0 K) (15)

with m0 E[Y0] and Σ0 Cov[Y0].

Now let s compute the first two moments of (Yt(x))x X . We have

E[Yt] = E h e 1

2 R t s=0 βsds Y0 + 1 e 1

2 R t s=0 βsds m + 1 e 1

2 R t s=0 βsds G i (16)

2 R t s=0 βsdsm0 + 1 e 1

2 R t s=0 βsds m (17)

Cov[Yt] = Cov h e 1

2 R t s=0 βsds Y0 i + Cov 1 e R t s=0 βsds 1/2 G (19)

= e R t s=0 βsds Σ0 + 1 e R t s=0 βsds K (20)

= K + e R t s=0 βsds (Σ0 K) (21) = Σt . (22)

B.2 Conditional score

Hence, condition on Y0 the score is the gradient of the log Gaussian characterised by mean mt|0 = e 1

2 B(t)Y0 and Σt|0 = (1 e B(t))K with B(t) = R t 0 β(s)ds which can be derived from the above marginal mean and covariance with m0 = Y0 and Σ0 = 0. Yt log pt(Yt|Y0) = Yt log N Yt|mt|0, Σt|0 (23)

= Yt 1/2(Yt mt|0) Σ 1 t|0(Yt mt|0) + c (24)

= Σ 1 t|0(Yt mt|0) (25)

= L t|0 L 1 t|0Lt|0ϵ (26)

= L t|0 ϵ (27)

where Lt|0 denotes the Cholesky decomposition of Σt|0 = Lt|0L t|0, and Yt = mt|0 + Lt|0ϵ.

Then we can plugin our learnt (preconditioned) score into the backward SDE 4 which gives

d Yt|x = (m(x) Yt)/2 + K(x, x) Yt log p T t(t, x, Yt) dt + p

βt K(x, x)βtd Bt (28)

B.3 Several score parametrisations

In this section, we discuss several parametrisations of the neural network and the objective.

For the sake of versatility, we opt to employ the symbol Dθ for the network instead of sθ as mentioned in the primary text, as it allows us to approximate not only the score but also other quantities from which the score can be derived. In full generality, we use a residual connection, weighted by cout, cskip : R R, to parameterise the network Dθ(t, Y t) = cskip(t)Y t + cout(t)Fθ(t, Y t). (29) We recall that the input to the network is time t, and the noised vector Y t = µt|0 + n, where µt|0 = e B(t)/2Y 0 and n N(0, Σt|0) with Σt|0 = (1 e B(t))K. The gram matrix K corresponds to k(X, X) with k the limiting kernel. We denote by Lt|0 and S respectively the Cholesky decomposition of Σt|0 = Lt|0L t|0 and K = SS .

The denoising score matching loss weighted by Λ(t) is given by

L(θ) = E h Dθ(t, Y t) Y t log pt(Y t|Y 0) 2 Λ(t) i (30)

Table 3: Summary of different score parametrisations as well as the values for cskip and cout that we found to be optimal, based on the recommendation from Karras et al. (2022, Appendix B.6).

No precond. Precond. K Precond. S Predict Y 0

cskip 0 0 0 1 cout (σt|0 + 10 3) 1 (σt|0 + 10 3) 1 (σt|0 + 10 3) 1 1 Loss σt|0S Dθ + z 2 2 σt|0Dθ + Sz 2 2 σt|0Dθ + z 2 2 Dθ Y 0 2 2 K log pt KDθ Dθ SDθ Σ 1 t|0(Y t e B(t)

No preconditioning By reparametrisation, let Y t = µt|0 + Lt|0z, where z N(0, I), the loss from Eq. (30) can be written as

L(θ) = E h Dθ(t, Y t) + Σ 1 t|0(Y t µt|0) 2 Λ(t) i (31)

= E h Dθ(t, Y t) + Σ 1 t|0Lt|0z 2 Λ(t) i (32)

= E h Dθ(t, Y t) + L t|0 z 2 Λ(t) i (33)

(34) Choosing Λ(t) = Σt|0 = Lt|0L t|0 we obtain

L(θ) = E h L t|0Dθ(t, Y t) + z 2 2 i (35)

= E σt|0S Dθ(t, Y t) + z 2 2 . (36)

Preconditioning by K Alternatively, one can train the neural network to approximate the preconditioned score Dθ K Y t log pt(Y t|Y 0). The loss, weighted by Λ = σ2 t|0I, is then given by

L(θ) = E h Dθ(t, Y t) + K L t|0 z 2 Λ(t) i (37)

= E h Dθ(t, Y t) + σ 1 t|0 Sz 2 Λ(t) i (38)

= E σt|0Dθ(t, Y t) + Sz 2 2 . (39)

Precondition by S A variation of the previous one, is to precondition the score by the transpose Cholesky of the limiting kernel gram matrix, such that Dθ S Y t log pt(Y t|Y 0) .

The loss, weighted by Λ = σ2 t|0I, becomes

L(θ) = E h Dθ(t, Y t) + S L t|0 z 2 Λ(t) i (40)

= E h Dθ(t, Y t) + σ 1 t|0 z 2 Λ(t) i (41)

= E σt|0Dθ(t, Y t) + z 2 2 . (42)

Predicting Y 0 Finally, an alternative strategy is to predict Y 0 from a noised version Y t. In this case, the loss takes the simple form L(θ) = E Dθ(t, Y t) Y 0 2 2 . The score can be computed from the network s prediction following log pt(Y t|Y 0) = Σ 1 t|0(Y t µt|0) (43)

= Σ 1 t|0(Y t e B(t)/2Y 0) (44)

Σ 1 t|0 Y t e B(t)/2Dθ(t, Y t) (45)

Table 3 summarises the different options for parametrising the score as well as the values for cskip and cout that we found to be optimal, based on the recommendation from Karras et al. (2022, Appendix B.6). In practice, we found the precondition by K parametrisation to produce the best results, but we refer to App. F.1.3 for a more in-depth ablation study.

B.4 Exact (marginal) score in Gaussian setting

Interpolating between Gaussian processes GP(m0, Σ0) and GP(m, K)

K Yt log pt(Yt) = KΣ 1 t (Yt mt) (47)

= K[K + e R t s=0 βsds (Σ0 K)] 1(Yt mt) (48)

= K(Lt Lt ) 1(Yt mt) (49)

= KLt 1Lt 1(Yt mt) (50) (51)

with Σt = K + e R t s=0 βsds (Σ0 K) = Lt Lt obtained via Cholesky decomposition.

B.5 Langevin dynamics

Under mild assumptions on log p T t (Durmus and Moulines, 2016) the following SDE

2K log p T t(Ys)ds +

admits a solution (Ys)s 0 whose law L(Ys) converges with geometric rate to p T t for any invertible matrix K.

B.6 Likelihood evaluation

Similarly to Song et al. (2021), we can derive a deterministic process which has the same marginal density as the SDE (3), which is given by the following Ordinary Differential Equation (ODE) referred as the probability flow ODE

d Yt(x) log pt(Yt(x))

= 1 2 {m(x) Yt(x) K(x, x) log pt(Yt(x))} βt 1

2div {m(x) Yt(x) K(x, x) log pt(Yt(x))} βt

Once the score network is learnt, we can thus use it in conjunction with an ODE solver to compute the likelihood of the model.

B.7 Discussion consistency

So far we have defined a generative model over functions via its finite marginals Yθ T (x). These finite marginals were to arise from a stochastic process if, as per the Kolmogorov extension theorem (Øksendal, 2003), they satisfy exchangeability and consistency conditions. Exchangeability can be satisfied by parametrising the score network such that the score network is equivariant w.r.t permutation, i.e. sθ(t, σ x, σ y) = σ sθ(t, x, y) for any σ Σn. Additionally, we have that the noising process (Yt(x))x X is trivially consistent for any t R+ since it is a stochastic process as per Prop. B.1, and consequently so is the (true) time-reversal ( Yt(x))x X . Yet, when approximating the score sθ log pt, we lose the consistency over the generative process Yθ t (x) as the constraint on the score network is non trivial to satisfy. This is actually a really strong constraint on the model class, and as soon as one goes beyond linearity (of the posterior w.r.t. the context set), it is non trivial to enforce without directly parameterising a stochastic process, e.g. as Phillips et al. (2022). There thus seems to be a strong trade-off between satisfying consistency, and the model s ability to fit complex process and scale to large datasets.

C Manifold-valued diffusion process

C.1 Manifold-valued inputs

In the main text we dealt with a simplified case of tensor fields where the tensor fields are over Euclidean space. Nevertheless, it is certainly possible to apply our methods to these settings. Significant work has been done on performing convolutions on feature fields on generic manifolds (a superset of tensor fields on generic manifolds), core references being (Cohen, 2021) for the case of homogeneous spaces and (Weiler et al., 2021) for more general Riemannian manifolds. We

recommend these as excellent mathematical introductions to the topic and build on them to describe how to formulate diffusion models over these spaces.

Tensor fields as sections of bundles. Formally the fields we are interested in modelling are sections σ of associated tensor bundles of the principle G-bundle on a manifold M. We shall denote such a bundle BM and the space of sections Γ(BM). The goal, therefore, is to model random elements from this space of sections. For a clear understanding of this definition, please see Weiler et al. (2021, pages 73-95) for an introduction suitable to ML audiences. Prior work looking at this setting is (Hutchinson et al., 2021) where they construct Gaussian Processes over tensor fields on manifolds.

Stochastic processes on spaces of sections. Given we can see sections as maps σ : M BM, where an element in BM is a tuple (m, b), m in the base manifold and b in the typical fibre, alongside the condition that the composition of the projection proji : (m, b) 7 m with the section is the identity, proji σ = Id it is clear we can see distribution over sections as stochastic processes with index set the manifold M, and output space a point in the bundle BM, with the projection condition satisfied. The projection onto finite marginals, i.e. a finite set of points in the manifold, is defined as πm1,...,mn(σ) = (σ(m1), ..., σ(mn)).

Noising process. To define a noising process over these marginals, we can use Gaussian Processes defined in (Hutchinson et al., 2021) over the tensor fields. The convergence results of Phillips et al. (2022) hold still, and so using these Gaussian Processes as noising processes on the marginals also defines a noising process on the whole section.

Reverse process. The results of (Cattiaux et al., 2021) are extremely general and continue to hold in this case of SDEs on the space of sections. Note we don t actually need this to be the case, we can just work with the reverse process on the marginals themselves, which are much simpler objects. It is good to know that it is a valid process on full sections though should one want to try and parameterise a score function on the whole section akin to some other infinite-dimension diffusion models.

Score function. The last thing to do therefore is parameterise the score function on the marginals. If we were trying to parameterise the score function over the whole section at once (akin to a number of other works on infinite dimension diffusions), this could present some problems in enforcing the smoothness of the score function. As we only deal with the score function on a finite set of marginals, however, we need not deal with this issue and this presents a distinct advantage in simplicity for our approach. All we need to do is pick a way of numerically representing points on the manifold and b) pick a basis for the tangent space of each point on the manifold. This lets us represent elements from the tangent space numerically, and therefore also elements from tensor space at each point numerically as well. This done, we can feed these to a neural network to learn to output a numerical representation of the score on the same basis at each point.

C.2 Manifold-valued outputs

In the setting, where one aim to model a stochastic process with manifold codomain Yt(x) = (Yt(x1), , Yt(xn)) Mn, things are less trivial as manifolds do not have a vector space structure which is necessary to define Gaussian processes. Fortunately, We can still target a know distribution marginally independently on each marginal, since this is well defined, and as such revert to the Riemannian diffusion models introduced in De Bortoli et al. (2021) with n independent Langevin noising processes

d Yt(xk) = 1

2 U(Yt(xk)) βtdt + p

βtd BM t . (54) are applied to each marginal. Hence in the limit t , Yt(x) has density (assuming it exists) which factors as dp/d Vol M((y(x1), , y(xn))) Q

k e U(y(xn))). For compact manifolds, we can target the uniform distribution by setting U(x) = 0. The reverse time process will have correlation between different marginals, and so the score function still needs to be a function of all the points in the marginal of interest.

D Invariant neural diffusion processes

D.1 E(n)-equivariant kernels

A kernel k : Rd Rd Rd d is equivariant if it satisfies the following constraints: (a) k is stationnary, that is if for all x, x Rn

k(x, x ) = k(x x ) k(x x ) (55)

and if (b) it satisfies the angular constraint for any h H

k(hx, hx ) = ρ(h)k(x, x )ρ(h) . (56)

A trivial example of such an equivariant kernel is the diagonal kernel k(x, x ) = k0(x, x )I (Holderrieth et al., 2021), with k0 stationnary. This kernel can be understood has having d independent Gaussian process uni-dimensional output, that is, there is no inter-dimensional correlation.

Less trivial examples, are the E(n) equivariant kernels proposed in Macêdo and Castro (2010). Namely curl-free and divergence-free kernels, allowing for instance to model electric or magnetic fields. Formally we have kcurl = k0A and kdiv = k0B with k0 stationary, e.g. squared exponential kernel k0(x, x ) = σ2 exp x x 2

2l2 , and A and B given by

A(x, x ) = I (x x )(x x )

B(x, x ) = (x x )(x x )

l2 + n 1 x x 2

See Holderrieth et al. (Appendix C, 2021) for a proof.

D.2 Proof of Prop. 3.2

Below we give two proofs for the group invariance of the generative process, one via the probability flow ODE and one directly via Fokker-Planck.

Proof. Reverse ODE. The reverse probability flow associated with the forward SDE (3) with approximate score sθ(t, ) log pt is given by

2 m(x) + Yt + K(x, x)sθ(T t, x, Yt) dt (59)

b ODE(t, x, Yt)dt (60)

This ODE induces a flow ϕb t : Xn Y n TY n for a given integration time t, which is said to be G-equivariant if the vector field is G equivariant itself, i.e. b(t, g x, ρ(g) Yt) = ρ(g)b(t, x, Yt). We have that for any g G

b ODE(t, g x, ρ(g) Yt) = 1

2 m(g x) + ρ(g)Yt + K(g x, g x) sθ(t, g x, ρ(g) Yt) (61)

2 ρ(g)m(x) + ρ(g)Yt + ρ(g)K(x, x)ρ(g) sθ(t, g x, ρ(g) Yt)

2 ρ(g)m(x) + ρ(g)Yt + ρ(g)K(x, x)ρ(g) ρ(g) sθ(t, x, Yt) (63)

2ρ(g) m(x) + Yt + K(x, x) sθ(t, x, Yt) (64)

= ρ(g)b ODE(t, x, Yt) (65)

with (1) from the G-invariant prior GP conditions on m and k, (2) assuming that the score network is G-equivariant and (3) assuming that ρ(g) O(n). To prove the opposite direction, we can simply follow these computations backwards. Finally, we know that with a G-invariant probability measure pref and G-equivariant map ϕ, the pushforward probability measure p 1 ref ϕ is also Ginvariant (Köhler et al., 2020; Papamakarios et al., 2019). Assuming a G-invariant prior GP, and a G-equivariant score network, we thus have that the generative model from Sec. 3 defines marginals that are G-invariant.

Proof. Reverse SDE. The reverse SDE associated of the forward SDE (3) with approximate score sθ(t, ) log pt is given by

d Yt|x = (m(x) Yt)/2 + K(x, x)sθ(T t, x, Yt) dt + p

βt K(x, x)d Bt (66)

b SDE(t, x, Yt)dt + Σ1/2(t, x) d Bt. (67)

As for the probability flow drift b ODE, we have that b SDE is similarly G-equivariant, that is b SDE(t, g x, ρ(g) Yt) = ρ(g)b SDE(t, x, Yt) for any g G. Additionally, we have that diffusion matrix is also G-equivariant as for any g G we have Σ(t, g x) = βt K(g x, g x) = βtρ(g)K(x, x)ρ(g) = ρ(g)Σ(t, x)ρ(g) since K is the gram matrix of an G-equivariant kernel k.

Additionally assuming that b SDE and Σ are bounded, Yim et al. (Proposition 3.6, 2023) says that the distribution of Yt is G-invariant, and in in particular L( Y0).

D.3 Equivariant posterior maps

Theorem D.1 (Invariant prior stochastic process implies an equivariant posterior map). Using the language of Weiler et al. (2021) our tensor fields are sections of an associated vector bundle A of a manifold M with a G structure. Let Isom GM be the group of G-structure preserving isometries on M. The action of this group on a section of the bundle f Γ(A) is given by

ϕ f := ϕ ,A f ϕ 1

(Weiler et al., 2021). Let f P, P a distribution over the space of section. Let ϕ P be the law of of ϕ f. Let µx = L(f(x)) = πx#P, the law of f evaluated at a point, where πx is the canonical projection operator onto the marginal at x, # the pushforward operator in the measure theory sense, x M and y is in the fibre of the associated bundle. Let µx ,y x = L(f(x)|f(x ) = y ) = πxµx ,y = πx#L(f|f(x ) = y ), the conditional law of the process when given f(x ) = y .

Assume that the prior is invariant under the action of Isom GM, i.e. that

ϕ µx = (ϕ ,A)#µϕ 1(x) = µx

Then the conditional measures are equivariant, in the sense that

ϕ µx ,y x = (ϕ ,A)#µx ,y

ϕ 1(x) = µϕ 1(x),ϕ ,A(y) x = µϕ (x ,y ) x

Proof. A, B test functions, ϕ Isom GM,

E[B(f(x ))A((ϕ f)(x))] = E B(f(x ))A ϕ ,A f ϕ 1(x)

= E B(f(x ))E A ϕ ,A F(ϕ 1(x)) F(x )

= E B(f(x )) Z A(y)(ϕ ,A)#µx ,f(x ) ϕ 1(x) (dy)

= Z B(y ) Z A(y)(ϕ ,A)#µx ,f(x ) ϕ 1(x) (dy)µx (dy )

= Z B(y ) Z A(y) ϕ µx ,f(x ) x (dy)µx (dy )

By invariance this quantity is also equal to

E B (ϕ 1 f)(x ) A((ϕ 1 ϕ f)(x)) = E B (ϕ 1 f)(x ) E A(f(x)) B (ϕ 1 f)(x )

= E B ϕ ,A(f(ϕ 1(x ))) A(F(x)) ϕ ,A(f(ϕ 1(x )))

= E B τ 1 x ,g F(gx ) Z A(y)µ ϕ(x ),ϕ 1 ,A(y) x

= Z B(y ) Z A(y)µϕ (x ,y) x (dy) ϕ 1 ,A

#µϕ(x )(dy )

= Z B(y ) Z A(y)µϕ (x ,y) x (dy) ϕ 1 µx (dy )

Hence ϕ µx ,f(x ) x (dy)µx (dy ) = µϕ (x ,y) x (dy) ϕ 1 µx (dy )

By the stated invariance ϕ 1 µx = µx , hence ϕ µx ,f(x ) x (dy) = µϕ (x ,y) x (dy) a.e. y

ϕ µx ,f(x ) x = µϕ (x ,y) x (68)

as desired.

E Langevin corrector and the iterative procedure of REPAINT (Lugmayr et al., 2022)

E.1 Langevin sampling scheme

Several previous schemes exist for conditional sampling from Diffusion models. Two different types of conditional sampling exist. Those that try to sample conditional on some part of the state space over which the diffusion model has been trained, such as in-painting or extrapolation tasks, and those that post-hoc attempt to condition on something outside the state space that the model has been trained on.

This first category is the one we are interested in, and in it we have:

Replacement sampling (Song et al., 2021), where the reverse ODE or SDE is evolved but by fixing the conditioning data during the rollout. This method does produce visually coherent sampling in some cases, but is not an exact conditional sampling method. SMC-based methods (Trippe et al., 2022), which are an exact method up to the particle filter assumption. These can produce good results but can suffer from the usual SMC methods downsides on highly multi-model data such as particle diversity collapse. The Re Paint scheme of (Lugmayr et al., 2022). While not originally proposed as an exact sampling scheme, we will show later that it can in fact be shown that this method is doing a specific instantiation of our newly proposed method, and is therefore exact. Amortisation methods, e.g. Phillips et al. (2022). While they can be effective, these methods can never perform exact conditional sampling, by definition.

Our goal is to produce an exact sampling scheme that does not rely on SMC-based methods. Instead, we base our method on Langevin dynamics. If we have a score function trained over the state space x = [xc, x ], where xc are the points we wish to condition on and xs points we wish to sample, we exploit the following score breakdown:

x log p(x |xc) = x log p([x , xc]) x log p(xc) = x log p(x)

If we have access to the score on the joint variables, we, therefore, have access to the conditional score by simply only taking the gradient of the joint score for the variable we are not conditioning on.

Given we have learnt sθ(t, x) x log pt(x), we could use this to perform Langevin dynamics at t = ϵ, some time very close to 0. Similar to (Song and Ermon, 2019) however, this produces the twin issues of how to initialise the dynamics, given a random initialisation will start the sampler in a place where the score has been badly learnt, producing slow and inaccurate sampling.

Instead, we follow a scheme of tempered Langevin sampling detailed in Alg. 1. Starting at t = T we sample an initialisation of y based on the reference distribution. Progressing from t = T towards t = ϵ we alternate between running a series of Lavgevin corrector steps to sample from the distribution pt,x (y |yc), and a single backwards SDE step to sample from px(yt γ|yt) with a step size γ. At each inner and outer step, we sample a noised version of the conditioning points yc based forward SDE applying noise to these context points, pt,xc(yc t|yc). For the exactness of this scheme, all that matters is that at the end of the sampling scheme, we are sampling from px (y |yc) (up to the ϵ away from zero clipping of the SDE). The rest of the scheme is designed to map from the initial sample at t = T of y to a viable sample through regions where the score has been learnt well.

Given the noising scheme applied to the context points does not actually play into the theoretical exactness of the scheme, only the practical difficulty of staying near regions of well-learnt score, we could make a series of different choices for how to noise the context set at each step.

Fixed context set

SDE path noise, every outer step

Fully independent noise, every outer step

Fully independent noise, every inner step

Outer step Inner step

Figure 8: Comparison of different context noising schemes for the conditional sampling.

Table 4: Comparison of complexity of different noise sampling schemes for the context set.

Scheme Closed-form noise Simulated noise

Re-sample noise at every inner step O(NI) O(N 2I2) Re-sample noise at every outer step O(N) O(N 2) Sampling an SDE path on the context O(N) O(N) No noise applied - -

The choices that present themselves are

1. The initial scheme of sampling context noise from the SDE every inner and outer step.

2. Only re-sampling the context noise every outer step, and keeping it fixed to this for each inner step associated with the outer step.

3. Instead of sampling independent marginal noise at each outer step, sampling a single noising trajectory of the context set from the forward SDE and use this as the noise at each time.

4. Perform no noising at all. Effectively the replacement method with added Langevin sampling.

These are illustrated in Fig. 8. The main trade-off of different schemes is the speed at which the noise can be sampled vs sample diversity. In the Euclidean case, we have a closed form for the evolution of the marginal density of the context point under the forward SDE. In this case sampling the noise at a given time is O(1) cost. On the other hand, in some instances such as nosing SDEs on general manifolds, we have to simulate this noise by discretising the forward SDE. In this case, it is O(n) cost, where n is the number of discretisation steps in the SDE. For N outer steps and I inner steps, the complexity of the different noising schemes is compared in Table 4. Note the conditional sampling scheme other than the noise sampling is O(NI) complexity.

E.2 REPAINT (Lugmayr et al., 2022) correspondance

In this section, we show that:

Algorithm 1 Conditional sampling with Langevin dynamics.

Require: Score network sθ(t, x, y), conditioning points (xc, yc), query locations x

x = [xc, x ] Augmented inputs set y T N(m(x ), k(x , x )) Sample initial noise for t {T, T γ, ..., ϵ} do

yc t pt,xc(yc t|yc 0) Noise context outputs Z N(0, Id) Sample tangent noise _, y t γ = [yc t, y t ] + γ 1

2 (m( x) [yc t, y t ]) + K( x, x)sθ(t, x, [yc t, y t ]) + γK( x, x)1/2Z Euler-Maruyama step

for l {1, . . . , L} do

yc t γ pt γ,xc(yc t γ|yc 0) Noise context outputs Z N(0, Id) Sample tangent noise _, y t γ = _, y t γ + γ

2 K( x, x)sθ(t γ, x,

yc t γ, y t γ ) + γK( x, x)1/2Z Langevin step y t γ = y t γ return y ϵ

(a) Alg. 1 and Alg. 2 Repaint from (Lugmayr et al., 2022) are equivalent in a specific setting. (b) There exists a continuous limit (SDE) for both procedures. This SDE targets a probability density which does not correspond to p(xt0|xc 0). (c) When t0 0 this probability measure converges to p(x0|xc 0) which ensures the correctness of the proposed sampling scheme.

We begin by recalling the conditional sampling algorithm we study in Alg. 1 and Alg. 2.

Algorithm 2 REPAINT (Lugmayr et al., 2022).

Require: Score network sθ(t, x, y), conditioning points (xc, yc), query locations x

x = [xc, x ] Augmented inputs set [yc T , y T ] N(m( x), k( x, x)) Sample initial noise for t {T, T γ, ..., ϵ} do

y t = y t for l {1, . . . , L} do

yc t N(mt(xc; yc), kt(xc, xc; yc)) Noise context outputs Z N(0, Id) Sample tangent noise _, y t γ = [yc t, y t ]+γ 1

2 (m( x) [yc t, y t ]) + K( x, x)sθ(t, x, [yc t, y t ]) + γK( x, x)1/2Z Reverse step

Z N(0, Id) Sample tangent noise y t = y t γ + γ 1

2 m(x ) y t γ + γK(x , x )1/2Z Forward step y t γ = y t γ return y ϵ

First, we start by describing the Re Paint algorithm (Lugmayr et al., 2022). We consider (Z0 k, Z1 k, Z2 k)k N a sequence of independent Gaussian random variable such that for any k N, Z1 k and Z2 k are d-dimensional Gaussian random variables with zero mean and identity covariance matrix and Z0 k is a p-dimensional Gaussian random variable with zero mean and identity covariance matrix. We assume that the whole sequence to be inferred is of size d while the context is of size p. For simplicity, we only consider the Euclidean setting with K = Id. The proofs can be adapted to cover the case K = Id without loss of generality.

Let us fix a time t0 [0, T]. We consider the chain (Xk)k N given by X0 Rd and for any k N, we define

Xk+1/2 = eγXk + 2(eγ 1) xk log pt0([Xk, Xc k]) + (e2γ 1)1/2Z1 k, (69)

where Xc k = e t0Xc 0 + (1 e 2t0)1/2Z0 k. Finally, we consider

Xk+1 = e γXk+1/2 + (1 e 2γ)1/2Z2 k. (70)

Note that (69) corresponds to one step of backward SDE integration and (70) corresponds to one step of forward SDE integration. In both cases we have used the exponential integrator, see (De

Bortoli, 2022) for instance. While we use the exponential integrator in the proofs for simplicity other integrators such as the classical Euler-Maruyama integration could have been used. Combining (69) and (70), we get that for any k N we have

Xk+1 = Xk + 2(1 e γ) xk log pt0([Xk, Xc k]) + (1 e 2γ)1/2(Z1 k + Z2 k). (71)

Remarking that (Zk)k N = ((Z1 k + Z3 k)/

2)k N is a family of d-dimensional Gaussian random variables with zero mean and identity covariance matrix, we get that for any k N

Xk+1 = Xk + 2(1 e γ) xk log pt0([Xk, Xc k]) +

2(1 e 2γ)1/2Zk, (72)

where we recall that Xc k = e t0Xc 0 +(1 e 2t0)1/2Z0 k. Note that the process (72) is another version of the Repaint algorithm (Lugmayr et al., 2022), where we have concatenated the denoising and noising procedure. With this formulation, it is clear that Repaint is equivalent to Alg. 1. In what follows, we identify the limitating SDE of this process.

In what follows, we describe the limiting behavior of (72) under mild assumptions on the target distribution. In what follows, for any xt0 Rd, we denote

b(xt0) = 2 R

Rp xt0 log pt0([xt0, xc t0])pt0|0(xc t0|xc 0)dxc t0. (73)

We emphasize that b/2 = xt0 log p( |xc 0). In particular, using Tweedie s identity, we have that for any xt0 Rd

log pt0(xt0|xc 0) = R

Rp xt0 log p([xt0, xc t0]|xc 0)p(xc t0|xt0, xc 0)dxc t0. (74)

We introduce the following assumption.

Assumption 1. There exist L, C 0, m > 0 such that for any xc t0, yc t Rp and xt0, yt Rd

log pt0([xt0, xc t0]) log pt0([yt, yc t]) L( xt0 yt + xc t0 yc t ). (75)

Assumption 1 ensures that there exists a unique strong solution to the SDE associated with (72). Note that conditions under which log pt0 is Lipschitz are studied in De Bortoli (2022). In the theoretical literature on diffusion models the Lipschitzness assumption is classical, see Lee et al. (2023) and Chen et al. (2022).

We denote ((Xγ t )t 0)γ>0 the family of processes such that for any k N and γ > 0, we have for any t [kγ, (k + 1)γ), Xγ t = (1 (t kγ)/γ)Xγ kγ + (t kγ)/γXγ (k+1)γ and

Xγ (k+1)γ = Xγ kγ + 2(1 e γ) Xγ kγ log pt0([Xγ kγ, Xc,n kγ ]) +

2(1 e 2γ)1/2Zγ kγ, (76)

where (Zγ kγ)k N,γ>0 is a family of independent d-dimensional Gaussian random variables with zero mean and identity covariance matrix and for any k N, γ > 0, Xc,γ kγ = e t0xc 0 +(1 e 2t0)1/2Z0,γ kγ , where (Z0,γ kγ )k N,γ>0 is a family of independent p-dimensional Gaussian random variables with zero mean and identity covariance matrix. This is a linear interpolation of the Repaint algorithm in the form of (72).

Finally, we denote (Xt)t 0 such that

d Xt = b(Xt)dt + 2Bt, X0 = x0. (77)

We recall that b depends on t0 but t0 is fixed here. This means that we are at time t0 in the diffusion and consider a corrector at this stage. The variable t does not corresponds to the backward evolution but to the forward evolution in the corrector stage. Under Assumption 1, (77) admits a unique strong solution. The rest of the section is dedicated to the proof of the following result.

Theorem E.1. Assume Assumption 1. Then limn + (X1/n t )t 0 = (Xt)t 0.

This result is an application of Stroock and Varadhan (2007, Theorem 11.2.3). It explicits what is the continuous limit of the Repaint algorithm (Lugmayr et al., 2022).

In what follows, we verify that the assumptions of this result hold in our setting. For any γ > 0 and x Rd, we define

bγ(x) = (2/γ)[(1 e γ) R

Rd xt0 log pt0([xt0, xc t0])pt0|0(xc t0|xc 0)dxc t0 (78)

(1/γ)E[(Xγ (k+1)γ Xγ kγ)1 Xγ (k+1)γ Xγ kγ 1 |Xkγ = x], (79)

Σγ(x) = (4/γ)(1 e γ)2 R

Rd xt0 log pt0([xt0, xc t0]) 2pt0|0(xc t0|xc 0)dxc t0 + (2/γ)(1 e 2γ) Id (80)

(1/γ)E[(Xγ (k+1)γ Xγ kγ) 21 Xγ (k+1)γ Xγ kγ 1 |Xkγ = x]. (81)

Note that for any γ > 0 and x Rd, we have

bγ(x) = E[1 Xγ (k+1)γ Xγ kγ 1(Xγ (k+1)γ Xγ kγ) |Xγ kγ = x] (82)

Σγ(x) = E[1 Xγ (k+1)γ Xγ kγ 1(Xγ (k+1)γ Xγ kγ) 2 |Xγ kγ = x] (83)

Lemma E.2. Assume Assumption 1. Then, we have that for any R, ε > 0 and γ (0, 1)

lim γ 0 sup{ Σγ(x) Σ(x) |x Rd, x R} = 0, (85)

lim γ 0 sup{ bγ(x) b(x) |x Rd, x R} = 0, (86)

lim γ 0(1/γ) sup{P( Xγ (k+1)γ Xγ kγ ε |Xkγ = x) |x Rd, x R} = 0. (87)

Where we recall that for any x Rd,

Rp xt0 log pt0([xt0, xc t0])px t|0(xc t0|xc 0)dxc t0, Σ(x) = 4 Id . (88)

Proof. Let R, ε > 0 and γ (0, 1). Using Assumption 1, there exists C > 0 such that for any xt0 Rd with xt0 R, we have xt0 log pt0([xt0, xc t0]) C(1 + xc t0 ). Since pc t0|0 is Gaussian with zero mean and covariance matrix (1 e 2t0) Id, we get that for any p N, there exists Ak 0 such that for any xt0 Rd with xt0 R R

Rd xt0 log pt0([xt0, xc t0]) ppc t0|0(xc t0|xc 0)dxc t0 Ak(1 + xc 0 p). (89)

Therefore, using this result and the fact that for any s 0, e s 1 s, we get that there exists Bk 0 such that for any k, p N and for any xt0 Rd with xt0 R

E[ X(k+1)γ Xkγ p |Xkγ = x] Bkγp/2(1 + xc 0 p). (90)

Therefore, combining this result and the Markov inequality, we get that for any xt0 Rd with xt0 R we have

limγ 0(1/γ) sup{P( Xγ (k+1)γ Xγ kγ ε |Xkγ = x) |x Rd, x R} = 0, (91)

limγ 0(1/γ) E[(Xγ (k+1)γ Xγ kγ)1 Xγ (k+1)γ Xγ kγ 1 |Xkγ = x] = 0, (92)

limγ 0(1/γ) E[(Xγ (k+1)γ Xγ kγ)1 Xγ (k+1)γ Xγ kγ 1 |Xkγ = x] = 0 (93)

In addition, we have that for any xt0 Rd with R > 0

|(2/γ)(1 e γ) 2| R

Rd xt0 log pt0([xt0, xc t0])pt0|0(xc t0|xc 0)dxc t0 (94)

A1(1 + xc 0 )(2/γ) e γ 1 + γ . (95)

We also have that for any xt0 Rd with R > 0

(4/γ)|1 e γ|2 R

Rd xt0 log pt0([xt0, xc t0]) 2pt0|0(xc t0|xc 0)dxc t0 (96)

A2(1 + xc 0 2)(4/γ) 1 e γ 2. (97)

Combining this result, (91), the fact that limγ 0(4/γ)|1 e γ|2 = 0 and limγ 0(2/γ)|e γ 1 + γ| = 0, we get that limγ 0 sup{ Σγ(x) Σ(x) |x Rd, x R} = 0. Similarly, using (91), (94) and the fact that limγ 0(4/γ)|1 e γ|2 = 0, we get that limγ 0 sup{ bγ(x) b(x) |x Rd, x R} = 0.

We can now conclude the proof of Theorem E.1.

Proof. We have that x 7 b(x) and x 7 Σ(x) are continuous. Combining this result and Lemma E.2, we conclude the proof upon applying Stroock and Varadhan (2007, Theorem 11.2.3).

Theorem E.1 is a non-quantitative result which states what is the limit chain for the REPAINT procedure. Note that if we do not resample, we get that

bcond(x) = 2 xt0 log pt0([xt0, xc t0]), Σ(x) = 4 Id . (98)

Recalling (88), we get that (98) is an amortised version of bcond. Similar convergence results can be derived in this case. Note that it is also possible to obtain quantitative discretization bounds between (Xt)t 0 and (X1/n t )t 0 under the ℓ2 distance. These bounds are usually leveraged using the Girsanov theorem (Durmus and Moulines, 2017; Dalalyan, 2017). We leave the study of such bounds for future work.

We also remark that b(xt0) is not given by log pt0(xt0|xc 0). Denoting Ut0 such that for any xt0 Rd

Ut0(xt0) = R

Rp(log pt0(xt0|xc t0))pt|0(xc t0|xc 0)dxc t0, (99)

we have that Ut0(xt0) = b(xt0), under mild integration assumptions. In addition, using Jensen s inequality, we have R

Rd exp[ Ut0(xt0)]dxt0 R

Rp pt0(xt0|xc t0)pt|0(xc t0|xc 0)dxt0dxc t0 1. (100)

Hence, πt0 with density proportional to x 7 exp[ Ut0(x)] defines a valid probability measure.

We make the following assumption which allows us to control the ergodicity of the process (Xt)t 0.

Assumption 2. There exist m > 0 and C 0 such that for any xt0 Rd and xc t0 Rp

xt log pt0([xt, xc t]), xt m xt 2 + C(1 + xc t 2). (101)

The following proposition ensures the ergodicity of the chain (Xt)t 0. It is a direct application of Roberts and Tweedie (1996, Theorem 2.1).

Proposition E.3. Assume Assumption 1 and Assumption 2. Then, πt0 is the unique invariant probability measure of (Xt)t 0 and limt 0 L(Xt) πt0 TV = 0, where L(Xt) is the distribution of Xt.

Finally, for any t0 > 0, denoting πt0 the probability measure with density Ut0 given for any xt0 Rd by Ut0(xt0) = R

Rp(log pt0(xt0|xc t0))pt|0(xc t0|xc 0)dxc t0. (102)

We show that the family of measures (πt0)t0>0 approximates the posterior with density x0 7 p(x0|xc 0) when t0 is small enough.

Proposition E.4. Assume Assumption 1. We have that limt0 0 πt0 = π0 where π0 admits a density w.r.t. the Lebesgue measure given by x0 7 p(x0|xc 0).

Proof. This is a direct consequence of the fact that pt|0( |xc 0) δxc 0.

This last results shows that even though we do not target xt0 7 pt0|0(xt0|xc 0) using this corrector term, we still target p(x0|xc 0) as t0 0 which corresponds to the desired output of the algorithm.

F Experimental details

Models, training and evaluation have been implemented in Jax (Bradbury et al., 2018). We used Python (Van Rossum and Drake Jr, 1995) for all programming, Hydra (Yadan, 2019), Numpy (Harris et al., 2020), Scipy (Virtanen et al., 2020), Matplotlib (Hunter, 2007), and Pandas (Mc Kinney et al., 2010). The code is publicly available at https://github.com/cambridge-mlg/ neural_diffusion_processes.

F.1 Regression 1d

F.1.1 Data generation

We follow the same experimental setup as Bruinsma et al. (2020) to generate the 1d synthetic data. It consists of Gaussian (Squared Exponential (SE), MATÉRN( 5

2), WEAKLY PERIODIC) and non Gaussian (SAWTOOTH and MIXTURE) sample paths, where MIXTURE is a combination of the other four datasets with equal weight. Fig. 9 shows samples for each of these dataset. The Gaussian datasets are corrupted with observation noise with variance σ2 = 0.052. The left column of Fig. 9 shows example sample paths for each of the 5 datasets.

The training data consists of 214 sample paths while the test dataset has 212 paths. For each test path we sample the number of context points between 1 and 10, the number of target points are fixed to 50 for the GP datasets and 100 for the non-Gaussian datasets. The input range for the training and interpolation datasets is [ 2, 2] for both the context and target sets, while for the extrapolation task the context and target input points are drawn from [2, 6].

Architecture. For all datasets, except SAWTOOTH, we use 5 bi-dimensional attention layers (Dutordoir et al., 2022) with 64 hidden dimensions and 8 output heads. For SAWTOOTH, we obtained better performance with a wider and shallower model consisting of 2 bi-dimensional attention layers with a hidden dimensionality of 128. In all experiment, we train the NDP-based models over 300 epochs using a batch size of 256. Furthermore, we use the Adam optimiser for training with the following learning rate schedule: linear warm-up for 10 epochs followed by a cosine decay until the end of training.

F.1.2 Ablation Limiting Kernels

The test log-likelihoods (TLLs) reported in App. F.1.3 for the NDP models target a white limiting kernel and train to approximate the preconditioned score K log pt. Overall, we found this to be the best performing setting. App. F.1.3 shows an ablation study for different choices of limiting kernel and score parametrisation. We refer to Table 3 for a detailed derivation of the score parametrisations.

The dataset in the top row of the figure originates from a Squared Exponential (SE) GP with lengthscale ℓ= 0.25. We compare the performance of three different limiting kernels: white (blue), a SE with a longer lengthscale ℓ= 1 (orange), and a SE with a shorter lengthscale ℓ= 0.1 (green). As the dataset is Gaussian, we have access to the true score. We observe that, across the different parameterisations, the white limiting kernel performance best. However, note that for the White kernel K = I and thus the different parameterisations become identical. For non-white limiting kernels we see a reduction in performance for both the approximate and exact score. We attribute this to the additional complexity of learning a non-diagonal covariance.

In the bottom row of App. F.1.3 we repeat the experiment for a dataset consisting of samples from the Periodic GP with lengthscale 0.5. We draw similar conclusions: the best performing limiting kernel, across the different parametrisations, is the White noise kernel.

F.1.3 Ablation Conditional Sampling

Next, we focus on the empirical performance of the different noising schemes in the conditional sampling, as discussed in Fig. 8. For this, we measure the the Kullback-Leibler (KL) divergence between two Gaussian distributions: the true GP-based conditional distribution, and an distribution created by drawing conditional sampling from the model and fitting a Gaussian to it using the empirical mean and covariance. We perform this test on the 1D squared exponential dataset (described above) as this gives us access to the true posterior. We use 212 samples to estimate the empirical mean and covariance, and fix the number of context points to 3.

In Fig. 11 we keep the total number of score evaluations fixed to 5000 and vary the number of steps in the inner (L) loop such that the number of outer steps is given by the ratio 5000/L. From the figure, we observe that the particular choice of noising scheme is of less importance as long at least a couple ( 5) inner steps are taken. We further note that in this experiment we used the true score (available because of the Gaussianity of the dataset), which means that these results may differ if an approximate score network is used.

Data Prior Model Prior Model Posterior

True Posterior

Mat ern 5 2

Weakly Periodic

2 1 0 1 2 2 1 0 1 2

Figure 9: Visualisation of 1D regression experiment.

White SE (ℓ= 1.0) SE (ℓ= 0.1) exact score approximate score

0 1 2 3 4 5 Precondition by K

0 1 2 3 4 5

Precondition by S

0 1 2 3 4 5 No preconditioning

0 1 2 3 4 5 Predict y0

(a) Squared Exponential dataset with lengthscale ℓ= 0.25

White Periodic (ℓ= 1.0) Periodic (ℓ= 0.1) exact score approximate score

0 1 2 3 4 5 Precondition by K

0 1 2 3 4 5

Precondition by S

0 1 2 3 4 5 No preconditioning

0 1 2 3 4 5 Predict y0

(b) Periodic dataset with lengthscale ℓ= 0.25

Figure 10: Ablation study targeting different limiting kernels and score parametrisations.

Table 5: Mean test log-likelihood (TLL) 1 standard error estimated over 4096 test samples are reported. Statistically significant best non-GP model is in bold. The NP baselines (GNP, Conv CNP, Conv NP and ANP) are quoted from Bruinsma et al. (2020). * stands for a TLL below -10.

SE MATÉRN 5

2 WEAKLY PER. SAWTOOTH MIXTURE

INTERPOLATION

GP (OPTIMUM) 0.70 0.00 0.31 0.00 0.32 0.00 n/a n/a T(1) GEOMNDP 0.72 0.03 0.32 0.03 0.38 0.03 3.39 0.04 0.64 0.08 NDP* 0.71 0.03 0.30 0.03 0.37 0.03 3.39 0.04 0.64 0.08 GNP 0.70 0.01 0.30 0.01 0.47 0.01 0.42 0.01 0.10 0.02 CONVCNP 0.80 0.01 0.95 0.01 1.20 0.01 0.55 0.02 0.93 0.02 CONVNP 0.46 0.01 0.67 0.01 1.02 0.01 1.20 0.01 0.50 0.02 ANP 0.61 0.01 0.75 0.01 1.19 0.01 0.34 0.01 0.69 0.02

GENERALISATION

GP (OPTIMUM) 0.70 0.00 0.31 0.00 0.32 0.00 n/a n/a T(1) GEOMNDP 0.70 0.02 0.31 0.02 0.38 0.03 3.39 0.03 0.62 0.02 NDP* * * * * * GNP 0.69 0.01 0.30 0.01 0.47 0.01 0.42 0.01 0.10 0.02 CONVCNP 0.81 0.01 0.95 0.01 1.20 0.01 0.53 0.02 0.96 0.02 CONVNP 0.46 0.01 0.67 0.01 1.02 0.01 1.19 0.01 0.53 0.02 ANP 1.42 0.01 1.34 0.01 1.33 0.00 0.17 0.00 1.24 0.01

0 1 5 10 20 100 num_inner_steps

method independent_context_noise_without_resampling sample_path_context_noise independent_context_noise_with_resampling raw_context

Figure 11: Ablation noising schemes for conditional sampling.

F.2 Gaussian process vector fields

Data We create synthetic datasets using samples from two-dimensional zero-mean GPs with the following E(2)-equivariant kernels: a diagonal Squared-Exponential (SE) kernel, a zero curl (CURLFREE) kernel and a zero divergence (DIV-FREE) kernel, as described in App. D.1. We set the variance to σ2 = 1 and the lengthscale to ℓ=

5. We evaluate these GPs on a disk grid, created via a 2D grid with 30 30 points regularly space on [ 10, 10]2 and keeping only the points inside the disk of radius 10. We create a training dataset of size 80 103. and a test dataset of size 10 103.

Models We compare two flavours of our model Geom NDP. One with a non-equivariant attentionbased score network (Figure C.1, Dutordoir et al., 2022), referred as NDP*. Another one with a E(2)-equivariant score architecture, based on steerable CNNs (Thomas et al., 2018; Weiler et al., 2018). We rely on the e3nn library (Geiger and Smidt, 2022) for implementation. A knn graph E is built with k = 20. The pairwise distances are first embed into µ(rab) with a smooth_finite basis of 50 elements via e3nn.soft_one_hot_linspace, and with a maximum radius of 2. The time is mapped via a sinusoidal embedding ϕ(t) (Vaswani et al., 2017). Then edge features are obtained as eab = Ψ(e)(µ(rab)||ϕ(t)) (a, b) Ek with Ψ(e) an MLP with 2 hidden layers of width 64. We use 5 e3nn.Fully Connected Tensor Product layers with update given by V k+1 a = P

b N(a,Ek) V k a Ψv(eab||V k a ||V k b ) Y (ˆrab) with Y spherical harmonics up to order 2m Ψv an

MLP with 2 hidden layers of width 64 acting on invariant features, and node features V k having irreps 12x0e + 12x0o + 4x1e + 4x1o. Each layer has a gate non-linearity (Weiler et al., 2018).

We also evaluate two neural processes, a translation-equivariant CONVCNP (Gordon et al., 2020) with decoder architecture based on 2D convolutional layers (Le Cun et al., 1998) and a C4 R2 E(2)- equivariant STEERCNP (Holderrieth et al., 2021) with decoder architecture based on 2D steerable convolutions (Weiler and Cesa, 2021). Specific details can be found in the accompanying codebase https://github.com/Peter Holderrieth/Steerable_CNPs of Holderrieth et al. (2021).

Optimisation. Models are trained for 80k iterations, via (Kingma and Ba, 2015) with a learning rate of 5e 4 and a batch size of 32. The neural diffusion processes are trained unconditionally, that is we feed GP samples evaluated on the full disk grid. Their weights are updated via with exponential moving average, with coefficient 0.99. The diffusion coefficient is weighted by β : t 7 βmin + (βmax βmin) t, and βmin = 1e 4, βmax = 15.

As standard, the neural processes are trained by splitting the training batches into a context and evaluation set, similar to when evaluating the models. Models have been trained on A100-SXM-80GB GPUs.

Evaluation. We measure the predictive log-likelihood of the data process samples under the model on a held-out test dataset. The context sets are of size 25 and uniformly sampled from a disk grid of size 648, and the models are evaluated on the complementary of the grid. For neural diffusion processes, we estimate the likelihood by solving the associated probability flow ODE (53). The divergence is estimated with the Hutchinson estimator, with Rademacher noise, and 8 samples, whilst the ODE is solved with the 2nd order Heun solver, with 100 discretisation steps.

We also report the performance of the data-generating GP, and the same GP but with diagonal posterior covariance GP (DIAG.).

F.3 Tropical cyclone trajectory prediction

Data. The data is drawn from he International Best330 Track Archive for Climate Stewardship (IBTr ACS) Project, Version 4 (Knapp et al., 2010; Knapp et al., 2018). The tracks are taken from the all dataset covering the tracks from all cyclone basins across the globe. The tracks are logged at intervals of every 3 hours. From the dataset, we selected tracks of at least 50 time points long and clipped any longer to this length, resulting in 5224 cyclones. 90% was used fro training and 10% held out for evaluation. This split was changed across seeds. More interesting schemes of variable-length tracks or of interest, but not pursued here in this demonstrative experiment. Natively the track locations live in latitude-longitude coordinates, although it is processed into different forms for different models. The time stamps are processed into the number of days into the cyclone forming and this format is used commonly between all models.

Four models were evaluated.

The GP (R R2) took the raw latitude-longitude data and normalised it. Using a 2-output RBF kernel with no covariance between the latitude and longitude and taking the cyclone time as input, placed a GP over the data. The hyperparameters of this kernel were optimised using a maximum likelihood grid search over the data. Note that this model places density outside the bounding box of [ 90, 90] [ 180, 180] that defines the range of latitude and longitude, and so does not place a proper distribution on the space of paths on the sphere.

The STEREOGRAPHIC GP (R R2/{0}) instead transformed the data under a sterographicc projection centred at the north pole, and used the same GP and optimisation as above. Since this model only places density on a set of measure zero that does not correspond to the sphere, it does induce a proper distribution on the space of paths on the sphere.

The NDP (R R2) uses the same preprocessing as GP (R R2) but uses a Neural Diffusion Process from (Dutordoir et al., 2022) to model the data. This has the same shortcomings as the GP (R R2) in not placing a proper density on the space of paths on the sphere. The network used for the score function and the optimisation procedure is detailed below. A linear beta schedule was used with β0 = 1e 4 and β1 = 10. The reverse model was integrated back to ϵ = 5e 4 for numerical

stability. The reference measure was a white noise kernel with a variance 0.05. ODEs and SDEs were discretised with 1000 steps.

The GEOMNDP (R S2) works with the data projected into 3d space on the surface of the sphere. This projection makes no difference to the results of the model, but makes the computation of the manifold functions such as the exp map easier, and makes it easier to define a smooth score function on the sphere. This is done by outputting a vector for the score from the neural network in 3d space, and projecting it onto the tangent space of the sphere at the given point. For the necessity of this, see (De Bortoli et al., 2021). The network used for the score function and the optimisation procedure is detailed below. A linear beta schedule was used with β0 = 1e 4 and β1 = 15. The reverse model was integrated back to ϵ = 5e 4 for numerical stability. The reference measure was a white noise kernel with a variance 0.05. ODEs and SDEs were discretised with 1000 steps.

Neural network. The network used to learn the score function for both NDP (R R2) and GEOMNDP (R S2) is a bi-attention network from Dutordoir et al. (2022) with 5 layers, hidden size of 128 and 4 heads per layer. This results in 924k parameters.

Optimisation. NDP (R R2) and GEOMNDP (R S2) were both optimised using (correctly implemented) Adam for 250k steps using a batch size of 1024 and global norm clipping of 1. Batches were drawn from the shuffled data and refreshed each time the dataset was exhausted. A learning rate schedule was used with 1000 warmup steps linearly from 1e-5 to 1e-3, and from there a cosine schedule decaying from 1e-3 to 1e-5. With even probability either the whole cyclone track was used in the batch, or 20 random points were sub-sampled to train the model better for the conditional sampling task.

Conditional sampling. The GP models used closed-form conditional sampling as described. Both diffusion-based models used the Langevin sampling scheme described in this work. 1000 outer steps were used with 25 inner steps. We use a ψ = 1.0 and λ0 = 2.5. In addition at the end of the Langevin sampling, we run an additional 150 Langevin steps with t = ϵ as this visually improved performance.

Evaluation. For the model (conditional) log probabilities the GP models were computed in closed form. For the diffusion-based models, they were computed using the auxiliary likelihood ODE discretised over 1000 steps. The conditional probabilities were computed via the difference between the log-likelihood of the whole trajectory and the log-likelihood of the context set only. The mean squared errors were computed using the geodesic distance between 10 conditionally sampled trajectories, described above.