# autoencoders_in_function_space__e23cd8ae.pdf

Journal of Machine Learning Research 26 (2025) 1-54 Submitted 1/25; Published 6/25

Autoencoders in Function Space

Justin Bunker jb2200@cantab.ac.uk Department of Engineering University of Cambridge Cambridge, CB2 1TN, United Kingdom

Mark Girolami mgirolami@turing.ac.uk Department of Engineering, University of Cambridge and Alan Turing Institute Cambridge, CB2 1TN, United Kingdom

Heﬁn Lambley hefin.lambley@warwick.ac.uk Mathematics Institute University of Warwick Coventry, CV4 7AL, United Kingdom

Andrew M. Stuart astuart@caltech.edu Computing + Mathematical Sciences California Institute of Technology Pasadena, CA 91125, United States of America

T. J. Sullivan t.j.sullivan@warwick.ac.uk Mathematics Institute & School of Engineering University of Warwick Coventry, CV4 7AL, United Kingdom

Editor: Mahdi Soltanolkotabi

Autoencoders have found widespread application in both their original deterministic form and in their variational formulation (VAEs). In scientiﬁc applications and in image processing it is often of interest to consider data that are viewed as functions; while discretisation (of diﬀerential equations arising in the sciences) or pixellation (of images) renders problems ﬁnite dimensional in practice, conceiving ﬁrst of algorithms that operate on functions, and only then discretising or pixellating, leads to better algorithms that smoothly operate between resolutions. In this paper function-space versions of the autoencoder (FAE) and variational autoencoder (FVAE) are introduced, analysed, and deployed. Well-deﬁnedness of the objective governing VAEs is a subtle issue, particularly in function space, limiting applicability. For the FVAE objective to be well deﬁned requires compatibility of the data distribution with the chosen generative model; this can be achieved, for example, when the data arise from a stochastic diﬀerential equation, but is generally restrictive. The FAE objective, on the other hand, is well deﬁned in many situations where FVAE fails to be. Pairing the FVAE and FAE objectives with neural operator architectures that can be evaluated on any mesh enables new applications of autoencoders to inpainting, superresolution, and generative modelling of scientiﬁc data.

Keywords: Variational inference on function space, operator learning, variational autoencoders, regularised autoencoders, scientiﬁc machine learning

c 2025 Justin Bunker, Mark Girolami, Heﬁn Lambley, Andrew M. Stuart and T. J. Sullivan.

License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v26/25-0035.html.

Bunker, Girolami, Lambley, Stuart, and Sullivan

1. Introduction

Functional data, and data that can be viewed as a high-resolution approximation of functions, are ubiquitous in data science (Ramsay and Silverman, 2002). Recent years have seen much interest in machine learning in this setting, with the promise of architectures that can be trained and evaluated across resolutions. A variety of methods now exist for the supervised learning of operators between function spaces, starting with the work of Chen and Chen (1993), followed by Deep ONet (Lu et al., 2021), PCA-Net (Bhattacharya et al., 2021), Fourier neural operators (FNOs; Li et al., 2021) and variants (Kovachki et al., 2023). These methods have proven useful in diverse applications such as surrogate modelling for costly simulators of dynamical systems (Azizzadenesheli et al., 2024).

Practical algorithms for functional data must necessarily operate on discrete representations of the underlying inﬁnite-dimensional objects, identifying salient features independent of resolution. Some models make this dimension reduction explicit by representing outputs as a linear combination of basis functions learned from data in Deep ONet, and computed using principal component analysis (PCA) in PCA-Net. Others do this implicitly, as in, for example, the deep layered structure of FNOs involving repeated application of the discrete Fourier transform followed by pointwise activation.

While linear dimension-reduction methods such as PCA adapt readily to function space, there are many types of data, such as solutions to advection-dominated partial diﬀerential equations (PDEs), for which linear approximations are provably ineﬃcient a phenomenon known as the Kolmogorov barrier (Peherstorfer, 2022). This suggests the need for nonlinear dimension-reduction techniques on function space. Motivated by this we propose an extension of variational autoencoders (VAEs; Kingma and Welling, 2014) to functional data using operator learning; we refer to the resulting model as the functional variational autoencoder (FVAE). As a probabilistic latent-variable model, FVAE allows for both dimension reduction and principled generative modelling on function space.

We deﬁne the FVAE objective as the Kullback Leibler (KL) divergence between two joint distributions, both on the product of the data and latent spaces, deﬁned by the encoder and decoder models; we then derive conditions under which this objective is well-deﬁned. This diﬀers from the usual presentation of VAEs in which one maximises a lower bound on the data likelihood, the evidence lower bound (ELBO); we show that our objective is equivalent and that it generalises naturally to function space. The FVAE objective proves to be well deﬁned only under a compatibility condition between the data and the generative model; this condition is easily satisﬁed in ﬁnite dimensions but is restrictive for functional data. Our applications of FVAE rest on establishing such compatibility, which is possible for problems in the sciences such as those arising in Bayesian inverse problems with Gaussian priors (Stuart, 2010), and those governed by stochastic diﬀerential equations (SDEs) (Hairer et al., 2011). However, there are many problems in the sciences, and in generative models for PDEs in particular, for which application of FVAE fails because the generative model is incompatible with the data; using FVAE in such settings leads to foundational theoretical issues and, as a result, to practical problems in the inﬁnite-resolution and -data limits.

To overcome these foundational issues we propose a deterministic regularised autoencoder that can be applied in very general settings, which we call the functional autoencoder

Autoencoders in Function Space

(FAE). We show that FAE is an eﬀective tool for dimension reduction, and that it can used as a versatile generative model for functional data. We complement the FVAE and FAE objectives on function space with neural-operator architectures that can be discretised on any mesh. The ability to discretise both the encoder and the decoder on arbitrary meshes extends prior work such as the variational autoencoding neural operator (VANO; Seidman et al., 2023), and is highly empowering, enabling a variety of new applications of autoencoders to scientiﬁc data such as inpainting and superresolution. Code accompanying the paper is available at

https://github.com/htlambley/functional autoencoders.

Contributions. We make the following contributions to the development and application of autoencoders to functional data:

(C1) we propose FVAE, an extension of VAEs to function space, ﬁnding that the training objective is well deﬁned so long as the generative model is compatible with the data;

(C2) we complement the FVAE training objective with mesh-invariant architectures that can be deployed on any discretisation even irregular, non-grid discretisations;

(C3) we show that when the data and generative model are incompatible, the discretised FVAE objective may diverge in the inﬁnite-resolution and -data limits or entirely fail to minimise the divergence between the encoder and decoder;

(C4) we propose FAE, an extension of regularised autoencoders to function space, and show that its objective is well deﬁned in many cases where the FVAE objective is not;

(C5) exploiting mesh-invariance, we propose masked training schemes which exhibit greater robustness at inference time, faster training and lower memory usage;

(C6) we validate FAE and FVAE on examples from the sciences, including problems governed by SDEs and PDEs, to discover low-dimensional latent structure from data, and use our models for inpainting, superresolution, and generative modelling, exploiting the ability to discretise the encoder and decoder on any mesh.

Outline. Section 2 extends VAEs to function space (Contribution (C1)). We then describe mesh-invariant architectures of Contribution (C2); we also validate our approach, FVAE, on examples such as SDE path distributions where compatibility between the data and the generative model can be veriﬁed. Section 3 gives an example of the problems arising when applying FVAE in the misspeciﬁed setting of Contribution (C3). In Section 4 we propose FAE (Contribution (C4)) and apply our data-driven method to two examples from scientiﬁc machine learning: Navier Stokes ﬂuid ﬂows and Darcy ﬂows in porous media. In these problems we make use of the mesh-invariance of the FAE to apply the masked training scheme of Contribution (C5); masking proves to be vital for the applications to inpainting and superresolution in Contribution (C6). Section 5 discusses related work, and Section 6 discusses limitations and topics for future research.

2. Variational Autoencoders on Function Space

In this section we deﬁne the VAE objective by minimizing the KL divergence between two distinct joint distributions on the product of data and latent spaces, one deﬁned by the encoder and the other by the decoder. We show that this gives a tractable objective

Bunker, Girolami, Lambley, Stuart, and Sullivan

when written in terms of a suitable reference distribution (Section 2.1). This objective coincides with maximising the ELBO in ﬁnite dimensions (Section 2.2) and extends readily to inﬁnite dimensions. The divergence between the encoder and decoder models is ﬁnite only under a compatibility condition between the data and the generative model. This condition is restrictive in inﬁnite dimensions. We identify problem classes for which this compatibility holds (Section 2.3) and pair the objective with mesh-invariant encoder and decoder architectures (Section 2.4), leading to a model we call the functional variational autoencoder (FVAE). We then validate our method on several problems from scientiﬁc machine learning (Section 2.5) where the data are governed by SDEs.

2.1 Training Objective

We begin by formulating an objective in inﬁnite dimensions, making the following standard assumption in unsupervised learning throughout Section 2. At this stage we focus on the continuum problem, and address the question of discretisation in Section 2.4. In what follows P(X) is the set of Borel probability measures on the separable Banach space X.

Assumption 1 Take the data space (U, ) to be a separable Banach space. There exists a data distribution Υ P(U) from which we have access to N independent and identically distributed samples {u(n)}N n=1 U.

This setting is convenient to work with yet general enough to include many spaces of interest; U could be, for example, a Euclidean space Rk, or the inﬁnite-dimensional space L2(Ω) of (equivalence classes of) square-integrable functions with domain Ω Rd. Separability is a technical condition used to guarantee the existence of conditional distributions in the encoder and decoder models deﬁned shortly (see Chang and Pollard, 1997, Theorem 1). An autoencoder consists of two nonlinear transformations: an encoder mapping data into a low-dimensional latent space Z, and a decoder mapping from Z back into U. The goal is to choose transformations such that, for u Υ, composing the encoder and decoder approximates the identity. In this way the autoencoder can be used for dimension reduction, with the encoder output being a compressed representation of the data. To make this precise, ﬁx Z = Rd Z and deﬁne encoder and decoder transformations depending on parameters θ Θ and ψ Ψ, respectively, by

(encoder) U u 7 Qθ z|u P(Z), (1a)

(decoder) Z z 7 Pψ u|z P(U). (1b)

To ensure the statistical models we are interested in are well deﬁned, we require both maps to be Markov kernels that is, for all Borel measurable sets A Z and B U and all parameters θ Θ and ψ Ψ, the maps u 7 Qθ z|u(A) and z 7 Pψ u|z(B) are measurable. Since the encoder and decoder both output probability distributions, we must be precise about what it means to compose them: we mean the distribution

(autoencoding distribution) Aθ,ψ u (B) = Z

Z Pψ u|z(B) Qθ z|u(dz), B U measurable. (2)

Autoencoders in Function Space

Remark 2 Given u U, Aθ,ψ u ( ) is the distribution given by drawing z Qθ z|u and then

sampling from Pψ u|z; we would like this to be close to a Dirac distribution δu when u is drawn from Υ; enforcing this, approximately, will be used to determine the parameters (θ, ψ).

Autoencoding as Matching Joint Distributions. Now let us ﬁx a latent distribution Pz P(Z) and deﬁne two distributions for (z, u) on the product space Z U:

(joint encoder model) Qθ z,u(dz, du) = Qθ z|u(dz)Υ(du), (3a)

(joint decoder model) Pψ z,u(dz, du) = Pψ u|z(du)Pz(dz). (3b)

The joint encoder model (3a) is the distribution on (z, u) given by applying the encoder to data u Υ, while the joint decoder model (3b) is the distribution on (z, u) given by applying the decoder to latent vectors z Pz. We emphasise that Υ is given, while the distributions Qθ z|u, Pψ u|z, and Pz are to be speciﬁed; the choice of decoder distribution Pψ u|z on the (possibly inﬁnite-dimensional) space U will require particular attention in what follows. The marginal and conditional distributions of (3a) and (3b) will be important and we write, for example, Pψ z|u for the z | u-conditional of the joint decoder model and Qθ z for the z-marginal of the joint encoder model. In particular, the u-marginal of the joint decoder model is the generative model Pψ u. To match the joint encoder model (3a) with the joint decoder model (3b), we seek to solve, for some statistical distance or divergence d on P(Z U), the minimisation problem

arg min θ Θ, ψ Ψ d Qθ z,u Pψ z,u . (4)

Doing so determines an autoencoder and a generative model. The choice of d is constrained by the need for it to be possible to evaluate the objective with Υ known only empirically; examples for which this is the case include the Wasserstein metrics and the KL divergence.

VAE Objective as Minimisation of KL Divergence. Using the KL divergence as d in (4) leads to the goal of ﬁnding θ Θ and ψ Ψ to minimise

DKL Qθ z,u Pψ z,u = E (z,u) Qθz,u

log d Qθ z,u d Pψ z,u (z, u)

= E u Υ E z Qθ z|u

log d Qθ z,u d Pψ z,u (z, u)

here d Qθ z,u/d Pψ z,u is the Radon Nikodym derivative of Qθ z,u with respect to Pψ z,u, the appropriate inﬁnite-dimensional analogue of the ratio of probability densities. This exists only when Qθ z,u is absolutely continuous with respect to Pψ z,u, meaning that Qθ z,u assigns probability zero to a set whenever Pψ z,u does; we take (5) to be inﬁnite otherwise. This objective is equivalent to the standard VAE objective in the case U = Rk but has the additional advantage that it can be used in the inﬁnite-dimensional setting. KL divergence has many beneﬁts justifying its use across statistics. Its value in inference and generative modelling comes from the connection between maximum-likelihood methods and minimising KL divergence, as well as its information-theoretic interpretation (Cover and Thomas, 2006, Sec. 2.3). KL divergence is asymmetric and has two useful properties

Bunker, Girolami, Lambley, Stuart, and Sullivan

justifying the order of the arguments in (5): it can be evaluated using only samples of the distribution in its ﬁrst argument, and it requires no knowledge of the normalisation constant of the distribution in its second argument since minimisers of ν 7 DKL(ν µ) are invariant when scaling µ (Chen et al., 2023 and Bach et al., 2024, Sec. 12.2). To justify the use of the joint divergence (5), in Theorem 3 we decompose the objective as the sum of two interpretable terms. In Theorem 6 we write the objective in a form that leads to actionable algorithms.

Theorem 3 For all parameters θ Θ and ψ Ψ for which DKL(Qθ z,u Pψ z,u) < ,

DKL Qθ z,u Pψ z,u = DKL Υ Pψ u

h DKL Qθ z|u Pψ z|u i

| {z } (II)

Proof Factorise Qθ z,u(dz, du) = Qθ z|u(dz)Υ(du) and Pψ z,u(dz, du) = Pψ z|u(dz)Pψ u(du) and substitute into (5) to obtain

DKL Qθ z,u Pψ z,u = E (z,u) Qθ z,u

d Pψ u (u) d Qθ z|u d Pψ z|u (z)

= E u Υ E z Qθ z|u

d Pψ u (u) + log d Qθ z|u d Pψ z|u (z)

The result follows by inserting the deﬁnition of the KL divergences on the right-hand side of (6), and all terms are ﬁnite owing to the assumption that DKL(Qθ z,u Pψ z,u) < .

Theorem 3 decomposes the divergence (5) into (I) the error in the generative model and (II) the error in approximating the decoder posterior Pψ z|u with Qθ z|u via variational inference; jointly training a variational-inference model and a generative model is exactly the goal of a VAE. However, minimising (5) makes sense only if (θ, ψ) 7 DKL(Qθ z,u Pψ z,u) is ﬁnite for some θ and ψ. Veriﬁcation of this property is a necessary step that we perform for our settings of interest in Propositions 8, 16, and 24.

Remark 4 (Posterior collapse) While minimising (5) typically leads to a useful autoencoder, this is not guaranteed: the autoencoding distribution (2) may be very far from a Dirac distribution. For example, when Pψ z|u Pz, the optimal decoder distribution Pψ u|z may well ignore the latent variable z entirely. Indeed, if Υ = Qz|u = Pu|z = Pz = N(0, 1) on U = R, then DKL(Qz,u Pz,u) = 0, but the autoencoding distribution Au = N(0, 1) for all u U. This is not close to the desired Dirac distribution referred to in Remark 2. This issue is known in the VAE literature as posterior collapse (Wang et al., 2021). In practice, choice of model classes for Qz|u and Pu|z avoids this issue.

Tractable Training Objective. Since we can access Υ only through training data, we must decompose the objective function into a sum of a term which involves Υ only through samples and a term which is independent of the parameters θ and ψ. The ﬁrst of these terms may then be used to deﬁne a tractable objective function over the parameters. To address this issue we introduce a reference distribution Λ P(U) and impose the following conditions on the encoder and the reference distribution.

Autoencoders in Function Space

Assumption 5 (a) There exists a reference distribution Λ P(U) such that:

(i) for all ψ Ψ and z Z, Pψ u|z is mutually absolutely continuous with Λ; and (ii) the data distribution Υ satisﬁes the ﬁnite-information condition DKL(Υ Λ) < . (b) For all θ Θ and Υ-almost all u U, DKL(Qθ z|u Pz) < .

Theorem 6 If Assumption 5 is satisﬁed for some Λ P(U), then for all parameters θ Θ and ψ Ψ for which DKL(Qθ z,u Pψ z,u) < , we have

DKL Qθ z,u Pψ z,u = E u Υ L(u; θ, ψ) + DKL(Υ Λ), (7a)

(per-sample loss) L(u; θ, ψ) = E z Qθ z|u

log d Pψ u|z dΛ (u)

+ DKL Qθ z|u Pz . (7b)

Proof Write Qθ z,u(dz, du) = Qθ z|u(dz)Υ(du) and Pψ z,u(dz, du) = Pψ u|z(du)Pz(dz), factor through the distribution Λ in (5), and apply the deﬁnition of the KL divergence to obtain

DKL Qθ z,u Pψ z,u = E (z,u) Qθz,u

log d Qθ z|u d Pz (z) dΛ

d Pψ u|z (u)dΥ

log d Pψ u|z dΛ (u) + log d Qθ z|u d Pz (z)

log d Pψ u|z dΛ (u)

+ DKL Qθ z|u Pz #

+ DKL(Υ Λ),

where all terms are ﬁnite as a consequence of Assumptions 5(a)(ii) and (b).

Since DKL(Υ Λ) is assumed to be ﬁnite and depends on neither θ nor ψ, Theorem 6 shows that the joint KL divergence (5) is equivalent, up to a ﬁnite constant, to

(FVAE objective) J FVAE(θ, ψ) = E u Υ L(u; θ, ψ) . (8)

In particular, minimising J FVAE is equivalent to minimising the joint divergence, and J FVAE(θ, ψ) is ﬁnite if and only if DKL(Qθ z,u Pψ z,u) is ﬁnite. Moreover (8) can be approximated using samples from Υ:

(empirical FVAE objective) J FVAE N (θ, ψ) = 1

n=1 L u(n); θ, ψ . (9)

In the limit of inﬁnite data, the empirical objective (9) converges to (8); but both (8) and (9) may be inﬁnite in many practical settings, as we shall see in the following sections.

Remark 7 (a) Assumption 5(a) ensures that the density d Pψ u|z/dΛ in the per-sample loss exists, and that minimising (5) and (8) is equivalent; Assumption 5(b) ensures that, when the joint divergence (5) is ﬁnite, the per-sample loss L is ﬁnite for Υ-almost all u U. We could also formulate Assumption 5(a) with a σ-ﬁnite reference measure Λ, e.g., Lebesgue measure, but for our theory it suﬃces to consider probability measures.

Bunker, Girolami, Lambley, Stuart, and Sullivan

(b) The proof of Theorem 6 shows why we take Qθ z,u as the ﬁrst argument and Pψ z,u as the second in (5) to obtain a tractable objective. Reversing the arguments in the divergence gives an expectation with respect to the joint decoder model (3b):

DKL Pψ z,u Qθ z,u = E (z,u) Pψ z,u

d Qθ z|u (z) + log d Pψ u|z dΛ (u)

+ E (z,u) Pψ z,u

Unlike in Theorem 6, the term involving dΛ/dΥ is not a constant: it depends on the parameter ψ and would therefore need to be evaluated during the optimisation process, but this is intractable because we have only samples from Υ and not its density.

2.2 Objective in Finite Dimensions

We now show that the theory in Section 2.1 simpliﬁes in ﬁnite dimensions to the usual VAE objective. To do so we assume U = Rk and that Υ P(U) has strictly positive probability density υ: U (0, ). We moreover assume that Z = Rd Z and that the latent distribution and the distributions returned by the encoder (1a) and decoder (1b) are Gaussian, taking the form

Qθ z|u = N f(u; θ), Σ(u; θ) = f(u; θ) + Σ(u; θ) 1 2 N(0, IZ), (10a)

Pψ u|z = N g(z; ψ), βIU = g(z; ψ) + β 1 2 N 0, IU , (10b)

Pz = N 0, IZ , (10c)

where β > 0 is ﬁxed and the parameters of (10a) and (10b) are given by learnable maps

f = (f, Σ): U Θ Z S+(Z), (11a)

g: Z Ψ U, (11b)

with S+(Z) denoting the set of positive semideﬁnite matrices on Z. The Gaussian model (10a) (10c) is the standard setting in which VAEs are applied, resulting in the joint decoder model (3b) being the distribution of (z, u) in the model

u | z = g(z; ψ) + η, z N 0, IZ , η Pη = β 1 2 N 0, IU . (12)

Other decoder models are also possible (Kingma and Welling, 2014), and in inﬁnite dimensions we will consider a wide class of decoders, including Gaussians as a particular case. In practice (11a), (11b) will come from a parametrised class of functions, e.g., a class of neural networks. Provided these classes are suﬃciently large and the data distribution has ﬁnite information with respect to Pη, the joint divergence (5) is ﬁnite for at least one choice of θ and ψ.

Proposition 8 Suppose that for parameters θ Θ and ψ Ψ we have f(u; θ ) = (0, IZ) and g(z; ψ ) = 0. Then DKL Qθ z,u Pψ z,u = DKL(Υ Pη).

In particular, if DKL(Υ Pη) < , then the joint divergence (5) has ﬁnite inﬁmum.

Autoencoders in Function Space

Proof Evaluating the joint encoder model (3a) and the joint decoder model (3b) at θ and ψ gives Qθ z,u(dz, du) = Pz(dz)Υ(du) and Pψ z,u(dz, du) = Pz(dz)Pη(du), so

DKL Qθ z,u Pψ z,u = E u Υ E z Pz

d Pz (z) + log dΥ

d Pη (u) = DKL(Υ Pη).

Remark 9 In ﬁnite dimensions, many data distributions satisfy DKL(Υ Pη) < . However there are cases in which this fails, e.g., when Υ is a Cauchy distribution on R, In such cases (5) is inﬁnite even with the trivial maps f(u; θ ) = (0, IZ) and g(z; ψ ) = 0.

Training Objective. In Section 2.1 we proved that the joint divergence (5) can be approximated by the tractable objective (8) and its empiricalisation (9) whenever there is a reference distribution Λ satisfying Assumption 5. We now show that taking Λ = Υ satisﬁes this assumption and results in a training objective equivalent to maximising the ELBO. Recall the data density υ associated with data measure Υ.

Proposition 10 Under the model (10a) (10c) with reference distribution Λ = Υ, Assumption 5 is satisﬁed, and, for some ﬁnite constant C > 0 independent of u, θ, and ψ,

L(u; θ, ψ) = E z Qθ z|u

log pψ u|z(u) + log υ(u) + DKL Qθ z|u Pz (13a)

= E z Qθ z|u

h (2β) 1 g(z; ψ) u 2 2

i + log υ(u) + DKL Qθ z|u Pz + C. (13b)

Proof For Assumption 5(a), mutual absolute continuity of Pψ u|z and Λ follows as both distributions have strictly positive densities, and evidently DKL(Υ Λ) = 0. Assumption 5(b) holds since both the encoder distribution (10a) and the latent distribution (10c) are Gaussian, with KL divergence available in closed form (Remark 11). The expression for L follows from (7b) using that Pψ u|z is Gaussian with density pψ u|z and d Pψ u|z/dΥ(u) = pψ u|z(u)/υ(u).

While the per-sample loss (13a) involves the unknown density υ, we can drop this without aﬀecting the objective (8) if Υ has ﬁnite diﬀerential entropy Eu Υ log υ(u) . This follows because

J FVAE(θ, ψ) = E u Υ LVAE(u; θ, ψ) E u Υ log υ(u) + C, (14a)

LVAE(u; θ, ψ) = E z Qθ z|u

h (2β) 1 g(z; ψ) u 2 2

i + DKL Qθ z|u Pz . (14b)

Thus, J VAE(θ, ψ) = Eu Υ[LVAE(θ, ψ)] is equivalent, up to a ﬁnite constant, to J FVAE(θ, ψ), and J VAE is tractable. Requiring that Υ has ﬁnite diﬀerential entropy is a mild condition, and one can expect this to be the case for the vast majority of distributions arising in the ﬁnite-dimensional setting.

Bunker, Girolami, Lambley, Stuart, and Sullivan

Remark 11 (VAEs as regularised autoencoders) We can write the divergence in (14b) in closed form as DKL(Qθ z|u Pz) = 1

2 f(u; θ) 2 2 + tr Σ(u; θ) log Σ(u; θ) d Z . Applying the reparametrisation trick (Kingma and Welling, 2014) to write the expectation over z Qθ z|u in terms of ξ N(0, IZ), the interpretation of a VAE as a regularised autoencoder is clear:

LVAE(u; θ, ψ) E ξ N(0,IZ)

(2β) 1 g f(u; θ) + Σ(u; θ)ξ; ψ u 2

2 f(u; θ) 2 2 + 1

2 tr Σ(u; θ) log Σ(u; θ) .

Remark 12 (The evidence lower bound) The usual derivation of VAEs speciﬁes the decoder model (12) and performs variational inference on the posterior for z | u, seeking to maximise a lower bound, the ELBO, on the likelihood of the data. Denoting by pψ u the density of the generative model Pψ u, by qθ z|u the density of Qθ z|u, and so forth, the log-likelihood of u U is

log pψ u(u) = DKL qθ z|u pψ z|u + ELBO(u; θ, ψ), (15a)

ELBO(u; θ, ψ) = E z qθ z|u

log pψ u|z(u) DKL qθ z|u pz . (15b)

This demonstrates that ELBO(u; θ, ψ) is indeed a lower bound on the log-data likelihood under the decoder model. Minimising our per-sample loss (13a) is equivalent to maximising the ELBO (15b), but, notably, our underlying objective is shown to correspond exactly to the underlying KL divergence (5) and avoids the use of any bounds on data likelihood.

Remark 13 An interpretation of the VAE objective as minimisation of (5) is also adopted by Kingma (2017, Sec. 2.8) and Kingma and Welling (2019). Our approach diﬀers by writing the joint decoder model (3b) in terms of Υ rather than the empirical distribution

(empirical data distribution) ΥN = 1

Since ΥN is not absolutely continuous with respect to the generative model Pψ u of (10a) (10c), using this would result in the joint divergence (5) being inﬁnite for all θ and ψ.

2.3 Objective in Inﬁnite Dimensions

We return to the setting of Assumption 1 and adopt a generalisation of the Gaussian model (10a) (10c), with distributional parameters given by learnable maps f = (f, Σ): U Θ Z S+(Z) and g: Z Ψ U:

Qθ z|u = N f(u; θ), Σ(u; θ) = f(u; θ) + Σ(u; θ) 1 2 N(0, IZ), (16a)

Pψ u|z = Pη g(z; ψ) = g(z; ψ) + Pη, (16b)

Pη P(U), (16c)

Pz = N 0, IZ . (16d)

Autoencoders in Function Space

The only change from the ﬁnite-dimensional model is in the decoder distribution (16b), which we now write as the shift of the decoder-noise distribution Pη by the mean g(z; ψ). As we will see shortly, the choice of decoder noise will be very important in inﬁnite dimensions, and restricting attention solely to Gaussian white noise will no longer be feasible.

Obstacles in Inﬁnite Dimensions. Many new issues arise in inﬁnite dimensions, necessitating a more careful treatment of the FVAE objective; for example, we can no longer work with probability density functions, since there is no uniform measure analogous to Lebesgue measure (Sudakov, 1959). The fundamental obstacle, however, is that unlike in ﬁnite dimensions the joint divergence (5) is often ill deﬁned, satisfying

DKL Qθ z,u Pψ z,u = for all θ Θ and ψ Ψ.

An important situation in which this arises is misspeciﬁcation of the generative model Pψ u; consequently, great care is needed in the choice of decoder to avoid this issue. To illustrate this we show that the extension of the white-noise model of (10a) (10c) is ill-deﬁned in inﬁnite dimensions: the resulting joint divergence (5) is always inﬁnite. To do this we deﬁne white noise using a Karhunen Lo eve expansion (Sullivan, 2015, Sec. 11.1).

Deﬁnition 14 Let U = L2([0, 1]) with orthonormal basis ej(x) =

2 sin(πjx), j N. We say that the random variable η is L2-white noise if it has Karhunen Lo eve expansion

(L2-white noise) η = X

j N ξjej, ξj i.i.d. N(0, 1).

Proposition 15 Let Pη be the distribution of the L2-white noise η. Then Pη is a probability distribution supported on the Sobolev space Hs([0, 1]) if and only if s < 1/2; in particular Pη assigns probability zero to L2([0, 1]). Moreover, these statements remain true for any shift Pη( h) of the distribution Pη by h L2([0, 1]).

The proof, which makes use of the Borel Cantelli lemma and the Kolmogorov two-series theorem, is stated in Appendix A.

Example 1 Let U = L2([0, 1]), ﬁx a data distribution Υ P(U) and take the model (16a) (16d), where we further assume that g(z; ψ) L2([0, 1]) for all z Z and ψ Ψ, and with Pη taken to be the distribution of L2-white noise. Under this model, the joint divergence (5) is inﬁnite for all θ and ψ. To see this, note that Υ assigns probability one to U, but, as Pψ u|z assigns zero probability to U by Proposition 15, we have

Pψ u(U) = Z

Z Pψ u|z(U) Pz(dz) = 0.

Thus Υ is not absolutely continuous with respect to Pψ u, and so DKL(Qθ z,u Pψ z,u) = .

The VANO model (Seidman et al., 2023), which we discuss in detail in the related work (Section 5), adopts the setting of (16a) (16d) with white decoder noise. It thus suﬀers from not being well-deﬁned. The issues in Example 1 stem from the diﬀerence in regularity

Bunker, Girolami, Lambley, Stuart, and Sullivan

between the data, which lies in L2([0, 1]), and draws from the generative model, which lie in Hs([0, 1]), s < 1/2, and not in L2([0, 1]). A diﬀerence in regularity is not the only possible issue: even two Gaussians supported on the same space need not be absolutely continuous owing to the Feldman H ajek theorem (Bogachev, 1998, Ex. 2.7.4). But, as in Proposition 8, we can state a suﬃcient condition for the divergence (5) to have ﬁnite inﬁmum.

Proposition 16 Suppose that f(u; θ ) = (0, IZ) and g(z; ψ ) = 0 for some θ Θ and ψ Ψ, and that DKL(Υ Pη) < . Then (5) has ﬁnite inﬁmum.

Well-Deﬁned Objective for Speciﬁc Problem Classes. The issues we have seen suggest that one must choose the decoder model based on the structure of the data distribution: ﬁxing a decoder model a priori typically results, in inﬁnite dimensions, in the joint divergence being inﬁnite. However we now give examples to show that, in important classes of problems arising in science and engineering, there is a clear choice of decoder noise Pη arising from the problem structure. Adopting this decoder noise and taking the reference distribution Λ = Pη will ensure that the hypotheses of Theorem 6 are satisﬁed: the joint divergence can be shown to have ﬁnite inﬁmum, and Assumption 5 holds. Thus we can apply the actionable algorithms derived in Section 2.1.

2.3.1 SDE Path Distributions

One class of data to which we can apply FVAE arises in the study of random dynamical systems (E et al., 2004). We choose Υ to be the distribution over paths of the diﬀusion process deﬁned, for a standard Brownian motion (wt)t [0,T] on Rm and ε > 0, by

dut = b(ut) dt + ε dwt, u0 = 0, t [0, T]. (17)

We assume that the drift b: Rm Rm is regular enough that (17) is well deﬁned. The theory we outline also applies to systems with anisotropic diﬀusion and time-dependent coeﬃcients (S arkk a and Solin, 2019, Sec. 7.3), and to systems with nonzero initial condition, but we focus on the setting (17) for simplicity. The path distribution Υ is deﬁned on the space U = C0([0, T], Rm) of continuous functions u: [0, T] Rm with u(0) = 0. Recall (16b) where we deﬁne the decoder distribution Pψ u|z as the shift of the noise distribution Pη by g(z; ψ), and recall Assumption 5, which in particular demands that DKL(Υ Pη) < and that Pψ u|z is mutually absolutely continuous with Pη. Here we choose Pη to be the law of an auxiliary diﬀusion process, which in our examples will be an Ornstein Uhlenbeck (OU) process. This auxiliary process must have zero initial condition and must have the same noise structure as that in (17) to ensure that DKL(Υ Pη) < . We will learn the shift g and this will have to satisfy the zero initial condition to ensure that Pψ u|z is mutually absolutely continuous with Pη.

Decoder-Noise Distribution Pη. We take Pη to be the law of the auxiliary SDE

dηt = c(ηt) dt + ε dwt, η0 = 0, t [0, T], (18)

with drift c: Rm Rm, and with ε > 0 being the same in both (17) and (18). To determine whether DKL(Υ Pη) < , we will need to evaluate the term dΥ/d Pη in the KL divergence; an expression for this is given by the Girsanov theorem (Liptser and Shiryaev, 2001, Chap. 7).

Autoencoders in Function Space

Proposition 17 (Girsanov theorem) Suppose that U = C0([0, T], Rm) and that µ P(U) and ν P(U) are the laws of the Rm-valued diﬀusions

dut = p(ut) dt + ε dwt, u0 = 0, t [0, T],

dvt = q(vt) dt + ε dwt, v0 = 0, t [0, T].

Suppose that the Novikov condition (Øksendal, 2003, eq. (8.6.8)) holds for both processes:

0 p(ut) 2 2 dt < and E v ν

0 q(vt) 2 2 dt < . (19)

dµ dν (u) = exp 1

0 q(ut) 2 2 p(ut) 2 2 dt 1

0 q(ut) p(ut), dut . (20)

The second integral in (20) is a stochastic integral with respect to (ut)t [0,T] (S arkk a and Solin, 2019, Chap. 4). The Novikov condition suﬃces for our needs, but the theorem also holds under weaker conditions such as the Kazamaki condition (Liptser and Shiryaev, 2001, p. 249). Applying Proposition 17 to evaluate the KL divergence (Appendix A) yields

DKL Υ Pη = E u Υ

0 b(ut) c(ut) 2 2 dt .

Thus the condition DKL(Υ Pη) < is satisﬁed quite broadly, e.g., if b and c are bounded.

Per-Sample Loss. With the condition DKL(Υ Pη) < veriﬁed, it remains to choose g to ensure that Pψ u|z is mutually absolutely continuous with Pη, and to derive the corresponding density. Once this is done, we arrive at the actionable per-sample loss derived in Theorem 6. To do this we again apply the Girsanov theorem, using that, when g(z; ψ) H1([0, T]) takes value zero at t = 0, the distribution Pψ u|z is the law of the SDE

dvt = g(z; ψ) (t) dt + c vt g(z; ψ) dt + ε dwt, v0 = 0, t [0, T]. (21)

As we will parametrise g(z; ψ) using a neural network, we can assume it to be in C1([0, T]), and hence in H1([0, T]); moreover we shall enforce that g(z; ψ)(0) 0 through the loss, as described shortly. To be concrete, we restrict attention to decoder noise with c(x) = κx, making (ηt)t [0,T] an OU process if κ > 0 and Brownian motion if κ = 0. In this case the per-sample loss can be derived by applying the Girsanov theorem to (18) and (21).

Proposition 18 (SDE per-sample loss) Suppose that c(x) = κx, κ 0, and that g(z; ψ) H1([0, T]) with g(z; ψ)(0) = 0 for all z Z and ψ Ψ. Then Assumption 5 holds and

L(u; θ, ψ) = E z Qθ z|u

log d Pψ u|z d Pη (u) + DKL(Qθ z|u Pz),

log d Pψ u|z d Pη (u) = 1

g(z; ψ) (t) + κg(z; ψ)(t), dut

g(z; ψ) (t) κ u(t) g(z; ψ)(t) 2 2 κu(t) 2 2 dt.

Bunker, Girolami, Lambley, Stuart, and Sullivan

In practice, we make two modiﬁcations to L. First, the initial condition g(z; ψ)(0) = 0 is not enforced exactly; instead, we add a Tikhonov-like zero-penalty term with regularisation parameter λ > 0 to favour g(z; ψ)(0) 0. Second, to allow variation of the strength of the KL regularisation, we multiply the term DKL(Qθ z|u Pz) by a regularisation parameter β > 0. Setting β = 1 breaks the exact correspondence between the FVAE objective and the joint KL divergence (5), but can nevertheless be useful in computational practice (Higgins et al., 2017). This leads us to the SDE per-sample loss

LSDE λ,β (u; θ, ψ) = E z Qθ z|u

log d Pψ u|z d Pη (u) + λ g(z; ψ)(0) 2 2

+ βDKL Qθ z|u Pz .

2.3.2 Posterior Distributions in Bayesian Inverse Problems

Our theory can also be applied to to posterior distributions arising in Bayesian inverse problems (Stuart, 2010), which we illustrate through the following additive-noise inverse problem. Let U be a separable Hilbert space with norm U, let Y = Rd Y , and let G : U Y be a (possibly nonlinear) observation operator. Suppose that y Y is given by the model

y = G(u) + ξ, u µ0 P(U), ξ N(0, Σ) P(Y ), (22)

with noise covariance Σ S+(Y ), and with prior distribution µ0 = N(0, C) having covariance operator C : U U. Models of this type arise, for example, in both Eulerian and Lagrangian data assimilation problems in oceanography (Cotter et al., 2010). Given an observation y Y from (22), the Bayesian approach seeks to infer u U by computing the posterior distribution µy P(U) representing the distribution of u | y. In the setting of (22), µy has a density with respect to µ0 thanks to Bayes rule (Dashti and Stuart, 2017, Theorem 14), taking the form

dµ0 (u) = 1 Z(y) exp Φ(u; y) , Φ(u; y) = 1

2 G(u) y 2 Σ, Σ = Σ 1/2 2,

where Z(y) (0, 1] owing to the nonnegativity of Φ. A simple calculation then reveals

DKL(µy µ0) = E u µy

dµ0 (u) = E u Υ log Z(y) Φ(u) log Z(y) < . (23)

Similar arguments apply quite generally for observation models other than (22), provided the resulting log-density log dµy/dµ0 satisﬁes suitable boundedness or integrability conditions. We now assume that the data distribution Υ to be learned is the posterior µy, and that we have samples from it. This setting could arise, for example, when attempting to generate further approximate samples from the posterior µy, taking as data the output of a functionspace MCMC method (Cotter et al., 2013), with the ambition of faster sampling under FVAE than under MCMC. Recall the deﬁnition of the decoder distribution Pψ u|z in (16b). We take Pη to be the prior µ0; this is a natural choice as (23) shows that DKL(Υ Pη) < . We next discuss the choice of shift g, and the per-sample loss that results from these choices.

Autoencoders in Function Space

Per-Sample Loss. Since Pη and Pψ u|z are Gaussian, we can use the Cameron Martin theorem (Bogachev, 1998, Corollary 2.4.3) to derive conditions for their mutual absolute continuity. For this to be the case, the shift g(z; ψ) must lie in the Cameron Martin space H(Pη) U. Before stating the theorem, we recall the following facts about Gaussian measures. The space H(Pη) is Hilbert, and for ﬁxed h H(Pη), the H(Pη)-inner product h, H(Pη) extends uniquely (up to equivalence Pη-almost everywhere) to a measurable linear functional (Bogachev, 1998, Theorem 2.10.11), denoted by U u 7 h, u H(Pη).

Proposition 19 (Cameron Martin theorem) Let Pη P(U) be a Gaussian measure with Cameron Martin space H(Pη). Then Pψ u|z = Pη g(z; ψ) is mutually absolutely continuous with Pη if and only if g(z; ψ) H(Pη), and

d Pψ u|z d Pη (u) = exp g(z; ψ), u H(Pη) 1

2 g(z; ψ) 2 H(Pη) . (24)

Remark 20 The exponent in (24) should be viewed as the misﬁt 1

2 g(z; ψ) u 2 H(Pη) with the almost-surely-inﬁnite term 1

2 u 2 H(Pη) subtracted (Stuart, 2010, Remark 3.8). When Pη is Brownian motion on R, for example, g(z; ψ), u H(Pη) is a stochastic integral and H(Pη) = H1([0, T]); this is implicit in the calculations underlying Theorem 18. When Pη is L2-white noise, H(Pη) = L2([0, 1]).

When g takes values in H(Pη), we can use the Cameron Martin theorem to write down the per-sample loss explicitly since H(Pη) = C 1/2U, with h H(Pη) = C 1/2h U and h, u H(Pη) = C 1/2h, C 1/2u U for h H(Pη) and u U.

Proposition 21 (Bayesian inverse problem per-sample loss) Suppose that g(z; ψ) H(Pη) for all z Z and ψ Ψ. Then Assumption 5 holds and

LBIP(u; θ, ψ) = E z Qθ z|u

2 g(z; ψ) 2 U C 1

2 g(z; ψ), C 1

i + DKL Qθ z|u Pz .

Depending on the choice of Pη, the condition g(z; ψ) H(Pη) may follow immediately, e.g., when Pη is Brownian motion and g is parametrised by a neural network.

2.4 Architecture and Algorithms

In practice, we do not have access to training data {u(n)}N n=1 U; instead we have access to ﬁnite-dimensional discretisations u(n). We would like to evaluate the encoder and decoder, and to compute the empirical objective (9) for training, using only these discretisations. In our architectures we will assume that U is a Banach space of functions evaluable pointwise almost everywhere with domain Ω Rd and range Rm; in this setting we will assume that the discretisation u of a function u U consists of evaluations at {xi}I i=1 Ω:

(discretisation of function u U) u = n xi, u(xi) o I

Crucially, the number and location of mesh points may diﬀer for each discretised sample our aim is to allow for FVAE to be trained and evaluated across diﬀerent resolutions, with data provided on sparse and potentially irregular meshes. We therefore discuss how to discretise the loss, and propose encoder/decoder architectures that can be evaluated on any mesh.

Bunker, Girolami, Lambley, Stuart, and Sullivan

2.4.1 Encoder Architecture

The encoder f is a map from a function u: Ω Rm to the parameters of the encoder distribution (16a): the mean f(u; θ) Z and covariance matrix Σ(u; θ) S+(Z). We assume Σ(u; θ) is diagonal, so f need only return two vectors: the mean f(u; θ) and the log-diagonal of Σ(u; θ). We thus deﬁne

f(u; θ) = ρ Z

Ω κ x, u(x); θ dx; θ Z Z = R2d Z, (25)

where κ: Ω Rm Θ Rℓis parametrised as a neural network with two hidden layers of width 64 and output dimension ℓ= 64, using GELU activation (Hendrycks and Gimpel, 2016), and ρ: Rℓ Θ R2d Z is parametrised as a linear layer ρ(v; θ) = W θv+bθ, with W θ R2d Z ℓ

and bθ R2d Z. We augment x Ωwith 16 random Fourier features (Appendix B.1) to aid learning of high-frequency features (Tancik et al., 2020). After discretisation on data u = {(xi, u(xi))}I i=1, in which we approximate the integral over Ωby a normalised sum, our architecture resembles set-to-vector maps such as deep sets (Zaheer et al., 2017), Point Net (Qi et al., 2017), and statistic networks (Edwards and Storkey, 2017), which take the form

xi, u(xi) i = 1, 2, . . . , I 7 ρ pool κ xi, u(xi); θ i = 1, 2, . . . , I ; θ ,

where pool is a pooling operation invariant to the order of its inputs in our case, the mean. Unlike these works we design our architecture for functions and only then discretise; we believe there is great potential to extend other point-cloud and set architectures similarly. Many other function-to-vector architectures have been proposed, e.g., the variable-input Deep ONet (VIDON; Prasthofer et al., 2022), the mesh-independent neural operator (MINO; Lee, 2022) and continuum attention (Calvello et al., 2024), and our proposal is most similar to the linear-functional layer of Fourier neural mappings (Huang et al., 2025) and the neural functional of Rahman et al. (2022). These diﬀer from our approach by preceding (25) by a neural operator; on the problems we consider, we ﬁnd our encoder map to be equally expressive.

2.4.2 Decoder Architecture

The decoder g is a map from a latent vector z Z to a function g(z; ψ): Ω Rm, which we parametrise using a coordinate neural network γ : Z Ω Ψ Rm with 5 hidden layers of width 100 using GELU activation throughout, so that

g(z; ψ)(x) = γ(z, x; ψ). (26)

As before, we augment x Ωwith 16 random Fourier features (Appendix B.1). Our proposed architecture allows for discretisation of the decoded function g(z; ψ) on any mesh, and the cost of evaluating the decoder (26) grows linearly with the number of mesh points. There are several related approaches in the literature to parametrise vector-to-function maps. Huang et al. (2025) lift the input by multiplying with a learnable constant function, then apply an operator architecture such as FNO. Seidman et al. (2023) propose both a Deep ONet-inspired decoder using a linear combination of learnable basis functions, and a nonlinear decoder essentially the same as what we propose, which is also similar to the

Autoencoders in Function Space

architectures of the nonlinear manifold decoder (NOMAD; Seidman et al., 2022) and PARANet (de Hoop et al., 2022). Also related are implicit neural representations (Sitzmann et al., 2020), in which one regresses on a ﬁxed image using a coordinate neural network and treats the resulting weights as a resolution-independent representation of the data.

2.4.3 Discretisation of Per-Sample Loss

To discretise the per-sample losses derived in Section 2.3, we make two approximations. First, we approximate the expectation over z Qθ z|u by Monte Carlo sampling (Kingma and Welling, 2014), with the number of samples viewed as a hyperparameter. Second, we approximate the integrals, norms, and inner products arising in the loss, as we now outline.

Per-Sample Loss LSDE λ,β . Since the terms appearing in LSDE λ,β are integral functionals of the data and decoded functions, we can discretise on any partition 0 = t0 < t1 < < t I = T and work with data discretised at any time step. The deterministic integral can be approximated by a normalised sum, and the stochastic integral can be discretised as

g(z; ψ) (t), dut

D g(z; ψ) (ti 1), u(ti) u(ti 1) E ,

which converges in probability in the limit I to the true stochastic integral (S arkk a and Solin, 2019, eq. (4.6)). Since the decoder will be a diﬀerentiable neural network, terms involving the derivative g(z; ψ) (t) can be computed using automatic diﬀerentiation; we ﬁnd this to be much more stable than using a ﬁnite-diﬀerence approximation of the derivative.

Per-Sample Loss LBIP. For many Bayesian inverse problems, U is a function space such as L2(Ω), and so again the norms and inner products are integral functionals amenable to discretisation on any mesh. However, applying the operator C 1/2 is typically tractable only in special cases. One widely used setting is the one in which the eigenbasis of C is known and basis coeﬃcients are readily computable; this arises when C is an inverse power of the Laplacian on a rectangle and a fast Fourier transform may be used.

2.5 Numerical Experiments

We now apply FVAE on two examples where Υ is an SDE path distribution. Both examples serve as prototypes for more complex problems such as those arising in molecular dynamics. For all experiments, we adopt the architecture of Section 2.4. A summary of conclusions to be drawn from the numerical experiments with these examples is as follows:

(a) FVAE captures properties of individual paths as well as ensemble properties of the data set, with the learned latent variables being physically interpretable (Section 2.5.1);

(b) choosing decoder noise that accurately reﬂects the stochastic variability in the data is essential to obtain a high-quality generative model (Section 2.5.1);

(c) FVAE is robust to changes of mesh in the encoder and decoder, enabling training with heterogeneous data and generative modelling at any resolution (Section 2.5.2). We emphasise in these experiments that FVAE does this purely from data, with no knowledge of the data-generating process other than in the choice of decoder noise.

Bunker, Girolami, Lambley, Stuart, and Sullivan

2.5.1 Brownian Dynamics

The Brownian dynamics model (see Schlick, 2010, Chap. 14), also known as the Langevin model, is a stochastic approximation of deterministic Newtonian models for molecular dynamics. In this model, the conﬁguration ut (in some conﬁguration space X Rm) of a molecule is assumed to follow the gradient ﬂow of a potential U : X R perturbed by additive thermal noise with temperature ε > 0. This leads to the Langevin SDE

dut = U(ut) dt + ε dwt, t [0, T], (27)

where (wt)t [0,T] is a Brownian motion on Rm. As a prototype for the more sophisticated, high-dimensional potentials arising in molecular dynamics, such as the Lennard Jones potential (Schlick, 2010), we take X = R and consider the asymmetric double-well potential

U(x) 3x4 + 2x3 6x2 6x. (28)

This has a local minimum at x1 = 1 and a global minimum at x2 = +1 (Figure 1(a)). We take Υ to be the corresponding path distribution, with temperature ε = 1, ﬁnal time T = 5, and initial condition u0 = x1. (The preceding developments ﬁxed u0 = 0 but are readily adapted to any ﬁxed initial condition.) The training data set consists of 8,192 paths with time step 5/512 in [0, T], and in each path 50% of time steps are missing (see Appendix B.2).

x1 = 1 x2 = +1 x

(a) Potential U(x)

0 1 2 3 4 5 t

(b) Sample paths (ut)t [0,5]

Figure 1: (a) Realisations of the SDE (27) follow the gradient ﬂow of the potential U. (b) Sample paths (ut)t [0,T] Υ begin at x1 = 1 and transition with high probability to the lower-potential state x2 = +1 as a result of the additive thermal noise.

Sample paths drawn from Υ start at x1 and transition with very high probability to the potential-minimising state x2; the time at which the transition begins is determined by the thermal noise, but, once the transition has begun, the manner in which the transition occurs is largely consistent across realisations. Such universal transition phenomena occur quite generally in the study of random dynamical systems, as a consequence of large-deviation theory (see E et al., 2004). We train FVAE using the SDE loss (Section 2.3.1) with regularisation parameter β = 1.2 and zero-penalty scale λ = 10. Motivated by the observation that trajectories are determined chieﬂy by the transition time, we use latent dimension d Z = 1.

Choice of Noise Process. The choice of decoder noise greatly aﬀects FVAE s performance as an autoencoder and a generative model. To investigate this, we ﬁrst train three instances of FVAE with diﬀerent restoring forces κ in the decoder-noise process. Then, to evaluate autoencoding performance, we draw samples u Υ from the held-out set and compute

Autoencoders in Function Space

Brownian motion

g(z; ψ) 1 SD

OU (κ = 25)

0 1 2 3 4 5 t

OU (κ = 10000)

0 1 2 3 4 5 t

0 1 2 3 4 5 t

(a) Data u Υ (red) & reconstructions g(f(u; θ); ψ) (black)

(b) Distribution of g(z; ψ) + η with

1 standard deviation of noise

(c) Samples g(z; ψ) + η from

Pψ u, z Pz, η Pη

Figure 2: The SDE loss gives much freedom in the choice of noise process (ηt)t. The top row uses Brownian motion as the decoder noise and the second and third rows use OU processes with diﬀerent asymptotic variances. While all choices lead to high-quality reconstructions, only the OU process with κ = 25 gives a generative model that agrees well with the data.

the reconstruction g(f(u; θ); ψ), which is the mean of the decoder distribution Pψ u|z with

z = f(u; θ) taken to be the mean of the encoder distribution Qθ z|u (Figure 2(a)). To evaluate FVAE as a generative model, we draw samples from the latent distribution Pz and display the mean g(z; ψ) of the decoder distribution Pψ u|z along with a shaded region indicating one standard deviation of the noise process (Figure 2(b)); moreover we draw samples g(z; ψ) + η to illustrate their qualitative behaviour (Figure 2(c)).

Using Brownian motion as the decoder noise leads to excellent reconstructions, but samples from the generative model appear diﬀerent from the training data. By using OU-distributed noise with restoring force κ > 0 we obtain similar reconstructions to those achieved under Brownian motion, but samples from the generative model match the data distribution more closely. This is because the variance of Brownian motion grows unboundedly with time, while the asymptotic variance under the OU process is ε/2κ, better reﬂecting the behaviour of the data for well-chosen κ.

0 20 40 60 80 100 0.0

κ = 0 κ = 25 κ = 10000

Wall-clock time [s]

Reconstruction MSE on held-out set

Figure 3: Using OU noise (κ > 0) leads to faster training convergence.

On this data set, choosing a suitable noise process (ηt)t has the added beneﬁt of significantly accelerating training: autoencoding mean-squared error (MSE) decreases much faster under OU noise (κ > 0) than under Brownian motion noise (Figure 3). We expect the choice of noise should depend in general on properties of the data distribution, with the OU process being particularly suited to this data set; in the discussion that follows, we use an OU process with restoring force κ = 25.

Bunker, Girolami, Lambley, Stuart, and Sullivan

0 1 2 3 4 5

0 1 2 3 4 5 0.0

FVAE Direct numerical simulation

2 1 0 1 2 z

(a) Decoded paths g(z; ψ)(t), z [ 2.5, 2.5] (b) Distribution of T0(u) [16,384 samples]

Figure 4: (a) The latent variable z identiﬁed by FVAE corresponds to the ﬁrst-crossing time T0 of the decoded path g(z; ψ). (b) Kernel density estimates of the distributions of T0 under the FVAE generative model, and when computed using direct simulations, closely agree.

Unsupervised Learning of Physically Relevant Quantities. Our choice of latent dimension d Z = 1 was motivated by the heuristic that the time of the transition from x1 = 1 to x2 = +1 essentially determines the SDE trajectory. FVAE identiﬁes this purely from data, with the learned latent variable z Z being in correspondence with the transition time: larger values of z map to paths g(z; ψ) transitioning later in time (Figure 4(a)). To understand whether FVAE captures ensemble statistical properties of the data distribution, we compare the distributions of the ﬁrst-crossing time T0(u) = inf t > 0 ut 0 estimated using 16,384 paths from the generative model and 16,384 direct simulations, using kernel density estimates based on Gaussian kernels with bandwidths selected by Scott s rule (Scott, 2015, eq. (6.44)); we ﬁnd that the two distributions closely agree (Figure 4(b)).

2.5.2 Estimation of Markov State Models

In practical applications of molecular dynamics, one is often interested in the evolution of large molecules on long timescales. For example in the study of protein folding (Konovalov et al., 2021), it is of interest to capture the complex, multistage transitions of proteins between conﬁgurations. Moving beyond the toy one-dimensional problem in Section 2.5.1, the chief diﬃculty is the very high dimension of such systems, which makes simulations possible only on timescales orders of magnitudes shorter than those of physical interest. Markov state models (MSMs) oﬀer one method of distilling many simulations on short timescales into a statistical model permitting much longer simulations (Husic and Pande, 2018). Assuming that the dynamics are given by a random process (ut)t 0 taking values in the conﬁguration space X, an MSM can be constructed by partitioning X into disjoint state sets X = X1 Xp, and, for some lag time τ > 0, considering the discrete-time process (Uk)k N for which Uk = i if and only if ukτ Xi. One hopes that, if τ is suﬃciently large, the process (Uk)k N is approximately Markov, and thus its distribution can be characterised by learning the probabilities of transitioning in time τ from one state to another. These probabilities can be determined using the short-run simulations which can be generated in parallel and the resulting MSM can be used to simulate on longer timescales. Motivated by this application, we consider the problem of constructing an MSM from data provided at sparse or irregular intervals that do not necessarily align with the lag time τ; in this case, computing the probability of transition in time τ directly may not be possible.

Autoencoders in Function Space

-0.4 -0.2 0.0

(a) Potential U : X R and states Xi, i = 1, . . . , 9

(b) Maximum-likelihood transition matrices (i) Direct numerical

simulation TDNS(τ)

Figure 5: (a) Contour plot of the potential U and the division of the state space X. All paths start at t = 0 at the origin (in state 5). (b) Transition matrices with lag τ = 3/512 computed using FVAE and through direct simulation, both on the time interval [0, 3].

We show the power of FVAE in this problem by ﬁrst learning a generative model from the heterogeneously sampled data and then using the generative model to draw paths sampled at the regular time step τ; constructing an MSM from these paths is then straightforward. We give an example based on the Brownian dynamics model (27) on X = R2 using a multiwell potential U (Figure 5(a)), stated precisely in Appendix B.3, which we take to be a quadratic bowl, perturbed by a linear function to break the symmetry, and by six Gaussian densities to act as potential wells with minima at (0, 0), (0.2, 0.2), ( 0.2, 0.2), (0.2, 0.2), (0, 0.2) and ( 0.2, 0). We take U = C0([0, T], X) and let Υ P(U) be the path distribution, with temperature ε = 0.1, ﬁnal time T = 3 and initial condition u0 = 0. The training data set consists of 16,384 paths discretised with time step 3/512, where, for each sample, it is assumed that 50% of steps are missing (details in Appendix B.3). We train FVAE using the SDE loss (Section 2.3.1) with κ = 100, λ = 50, β = 0.02, and latent dimension d Z = 16.

Partitioning the Conﬁguration Space. The states of an MSM can be selected manually using expert knowledge or automatically using variational or machine-learning methods (Mardt et al., 2018). For simplicity, we choose the states by hand, partitioning X = R2 into p = 9 disjoint regions (Figure 5(a)) divided by the four lines x1 = 0.1 and x2 = 0.1.

Estimating Transition Probabilities with FVAE. After training FVAE with irregularly sampled data, we draw samples from the generative model with regular time step τ and use these samples to compute the MSM transition probabilities. Setting aside the question of Markovianity for simplicity, we draw from the generative model M = 2,048 paths {v(m)}M m=1 discretised on a mesh of K = 513 equally spaced points with time step τ = 3/512 on [0, T], and compute the count matrix

CFVAE(τ) = CFVAE ij (τ)

i,j {1,...,p}, CFVAE ij (τ) =

m=1 1 h v(m) kτ Xi and v(m) (k+1)τ Xj i .

We then derive the corresponding maximum-likelihood transition matrix TFVAE(τ) by normalising each row of CFVAE(τ) to sum to one; for simplicity we do not constrain the transition matrix to satisfy the detailed-balance condition (see Prinz et al., 2011, Sec. IV.D).

Bunker, Girolami, Lambley, Stuart, and Sullivan

The resulting transition matrix TFVAE(τ) agrees closely with the matrix TDNS(τ) computed analogously using 2,048 direct numerical simulations on the regular time step τ (Figure 5(b)).

3. Problems with VAEs in Inﬁnite Dimensions

As we have seen in Section 2.1, the empirical FVAE objective J FVAE N (θ, ψ) for the data set {u(n)}N n=1 is based on a sequence of approximations and equalities:

n=1 L(u(n); θ, ψ)

| {z } =:J FVAE N (θ,ψ)

(A) E u Υ[L(u; θ, ψ)] | {z } =:J FVAE(θ,ψ)

= (B) DKL(Qθ z,u Pψ z,u) DKL(Υ Λ) | {z } constant

The approximation (A) is based on the law of large numbers and is an equality almost surely in the limit N . The equality (B) is true by Theorem 6 with a ﬁnite constant DKL(Υ Λ) provided Assumption 5 holds but if the assumption does not hold, this constant may well be inﬁnite. So, while it is tempting to apply the empirical objective J FVAE N without ﬁrst checking the validity of (A) and (B), this strategy is fraught with pitfalls. To illustrate this we apply FVAE in the white-noise setting of Example 1; this coincides with the setting of the VANO model (Seidman et al., 2023), which we discuss in detail in the related work (Section 5). In this example we derive the per-sample loss L and apply the resulting empirical objective J FVAE N for training; but both the joint divergence DKL(Qθ z,u Pψ z,u) and the constant DKL(Υ Λ) turn out to be inﬁnite. Consequently, the approximations (A) and (B) break down, which we see numerically: discretisations of J FVAE N appear to diverge as resolution is reﬁned, suggesting that they have no continuum limit.

Example 2 Take U = H 1([0, 1]) and assume that Υ is the distribution of u in the model

ξ Uniform[0, 1]

u | ξ = δξ.

Realisations of u in this model lie in Hs([0, 1]), s < 1/2, with probability one, so Υ P(U). This data can be viewed as a prototype for rough behaviour such as the derivative of shock proﬁles arising in hyperbolic PDEs with random initial data. It will serve as an extreme example allowing us to isolate the numerical issues associated with using FVAE or VANO in the misspeciﬁed setting. Take the model (16a) (16d) for real-valued functions on [0, 1], with Pη taken to be L2-white noise. As discussed in Proposition 15, Pη P(Hs([0, 1])), s < 1/2, and H(Pη) = L2([0, 1]). Fixing the reference distribution Λ = Pη and assuming g takes values in L2([0, 1]), the Cameron Martin theorem (Proposition 19) ensures that the density d Pψ u|z/dΛ exists, and consequently

L(u; θ, ψ) = E z Qθ z|u

h 1 2 g(z; ψ) 2 L2 g(z; ψ), u L2 i + DKL Qθ z|u Pz . (29)

At this stage, we have not veriﬁed that the joint divergence DKL(Qθ z,u Pψ z,u) has ﬁnite inﬁmum, nor that Assumption 5 holds; indeed we will see that both of these conditions fail.

Autoencoders in Function Space

(a) Schematic diagram of

reconstruction g(z; ψ)

learnable mean µ

learnable SD σ

discretisation of Dirac δξ

(b) FVAE empirical objective

Pη = N(0, I); 50 runs

8 16 32 64 128

(c) FAE empirical objective

U = H 1([0, 1]); 50 runs

8 16 32 64 128 0.630

Figure 6: (a) The discrete representations (squares) of δξ and g(z; ψ) on a grid of 8 points. (b) In Example 2, the FVAE empirical objective at the minimising parameters diverges as resolution is increased. (c) To overcome this, we propose a regularised autoencoder, FAE, in Section 4. Repeating the experiment with this objective suggests that the FAE empirical objective has a well-deﬁned continuum limit.

Nevertheless we can attempt to train FVAE with the empirical objective resulting from (29). To do this we choose Z = R and use an encoder map f and a decoder map g tailored to this problem, since the architectures of Section 2.4 are not equipped to deal with functions of negative Sobolev regularity. We parametrise f as

f(u; θ) = ρ arg max x [0,1] (ϕ u)(x); θ Z Z = R2,

where ρ is a neural network and ϕ is a compactly supported smooth molliﬁer, chosen such that ϕ u has well-deﬁned maximum, and we parametrise g to return the Gaussian density

g z; ψ (x) = N µ(z; ψ), σ(z; ψ)2; x ,

with mean µ(z; ψ) [0, 1] and standard deviation σ(z; ψ) > 0 computed from a neural network (see Appendix B.4). In other words, f applies a neural network to the location of the maximum of u, while g returns a Gaussian density with learned mean and variance (Figure 6(a)). To investigate the behaviour of discretisations of J FVAE N as resolution is reﬁned, we generate a sequence of data sets in which we discretise on a mesh of I {8, 16, 32, 64, 128} equally spaced points {i/I+1}i=1,...,I [0, 1]. At each resolution we generate I training samples: one discretised Dirac function at each mesh point, normalised to have unit L1-norm. We train 50 independent instances of FVAE at each resolution and record the value of the empirical objective at convergence (Figure 6(a)). Notably, the empirical objective appears to diverge as resolution is reﬁned. This has two major causes:

(a) Υ is not absolutely continuous with respect to Λ = Pη: the set {δx | x [0, 1]} has probability zero under Λ but probability one under Υ; thus DKL(Υ Λ) = .

(b) Υ is not absolutely continuous with respect to Pψ u, meaning that DKL(Qθ z,u Pψ z,u) = for all θ and ψ. To see this, we again note that {δx | x [0, 1]} has probability one

Bunker, Girolami, Lambley, Stuart, and Sullivan

under Υ, but, as a consequence of the Cameron Martin theorem, Pψ u|z and Pη are mutually absolutely continuous, and thus as Pη({δx | x [0, 1]}) = 0,

Pψ u {δx | x [0, 1]} = Z

Z Pψ u|z {δx | x [0, 1]} Pz(dz) = 0.

The problematic term in the per-sample loss is the measurable linear functional g(z; ψ), u L2; this is deﬁned only up to modiﬁcation on Pη-probability zero sets, and yet we evaluate it on just such sets namely, the Pη-probability zero set {δx | x [0, 1]}.

Remark 22 The joint divergence DKL(Qθ z,u Pψ z,u) would also be inﬁnite if Υ was supported on L2([0, 1]), as seen in Example 1, but it is harder to observe any numerical issue in training. This is because the measurable linear functional g(z; ψ), u L2 reduces to the usual L2-inner product, and so, even though the FVAE objective is not well deﬁned, the per-sample loss can still be viewed as a regularised misﬁt (see Theorem 20):

L(u; θ, ψ) = E z Qθ z|u

h 1 2 g(z; ψ) u 2 L2 i + DKL(Qθ z|u Pz) 1

As a result one can reinterpret the objective as that of a regularised autoencoder (see Remark 11). This motivates our use of a regularised autoencoder, FAE, in Section 4.

4. Regularised Autoencoders on Function Space

To overcome the issues in applying VAEs in inﬁnite dimensions, we set aside the probabilistic motivation for FVAE and deﬁne a regularised autoencoder in function space, the functional autoencoder (FAE), avoiding the need for onerous conditions on the data distribution. In Section 4.1, we state the FAE objective and make connections to the FVAE objective. Section 4.2 outlines minor adaptations to the FVAE encoder and decoder for use with FAE. Section 4.3 demonstrates FAE on two examples from the sciences: incompressible ﬂuid ﬂows governed by the Navier Stokes equation; and porous-medium ﬂows governed by Darcy s law.

4.1 Training Objective

Throughout Section 4 we make the following assumption on the data, postponing discussion of discretisation to Section 4.2. Unlike in Section 2 we do not need U to be separable, allowing us to consider data from an even wider variety of spaces, such as the (non-separable) space BV(Ω) of bounded-variation functions on Ω Rd.

Assumption 23 Let (U, ) be a Banach space. There exists a data distribution Υ P(U) from which we have access to N independent and identically distributed samples {u(n)}N n=1 U.

To deﬁne our regularised autoencoder, we ﬁx a latent space Z = Rd Z and deﬁne encoder and decoder transformations f and g, which, unlike in FVAE, return points rather than probability distributions:

(encoder) U u 7 f(u; θ) Z, (30a)

(decoder) Z z 7 g(z; ψ) U. (30b)

Autoencoders in Function Space

We then take as our objective the sum of a misﬁt term between the data and its reconstruction, and a regularisation term with regularisation parameter β > 0 on the encoded vectors:

(FAE objective) J FAE β (θ, ψ) = E u Υ

h 1 2 g f(u; θ); ψ u 2 + β f(u; θ) 2 2

As in Section 2.1, the expectation over u Υ is approximated by an average over the training data, resulting in the empirical objective J FAE β,N . There is great ﬂexibility in the choice of regularisation term; we adopt the squared Euclidean norm f(u; θ) 2 2 as a simplifying choice consistent with using a Gaussian prior Pz in a VAE (Remark 11). While (31) has much in common with the FVAE objective J FVAE, it is not marred by the foundational issues raised in Section 3; indeed, the FAE objective is broadly applicable as the following result shows.

Proposition 24 Suppose Υ has ﬁnite second moment, i.e., Eu Υ u 2 < . If there exist θ Θ and ψ Ψ such that f(u; θ ) = 0 and g(z; ψ ) = 0, then (31) has ﬁnite inﬁmum.

Proof This follows immediately from evaluating J FAE β at θ and ψ , since J FAE β (θ , ψ ) = Eu Υ 1

2 u 2 , and the expectation is ﬁnite by hypothesis.

4.2 Architecture and Algorithms

To train FAE we must discretise the objective J FAE β and parametrise the encoder f and decoder g with learnable maps. We moreover propose a masked training scheme that appears to be new to the operator-learning literature; as we will show in Section 4.3, this scheme greatly improves the robustness of FAE to changes of mesh.

Encoder and Decoder Architecture. As in Section 2.4, we will construct encoder and decoder architectures under the assumption that U is a Banach space of functions evaluable pointwise almost everywhere with domain Ω Rd and range Rm, and that we have access to discretisations u(n) of the data u(n) comprised of evaluations at ﬁnitely many mesh points. We adopt an architecture near identical to that used for FVAE. More precisely, we parametrise the encoder as

f(u; θ) = ρ Z

Ω κ x, u(x); θ dx; θ Z,

with κ: Ω Rm Θ Rℓparametrised as a neural network with two hidden layers of width 64, output dimension ℓ= 64, and with ρ: Rℓ Θ Rd Z parametrised as the linear layer ρ(v; θ) = W θv + bθ. We parametrise the decoder as the coordinate neural network γ : Z Ω Ψ Rm with 5 hidden layers of width 100, so that

g(z; ψ)(x) = γ(z, x; ψ) Rm.

In both cases we use GELU activation and augment x with 16 random Fourier features (Appendix B.1). Relative to the architectures of Section 2.4, the only change is in the range of f, which now takes values in Z instead of returning distributional parameters for Qθ z|u.

Bunker, Girolami, Lambley, Stuart, and Sullivan

Discretisation of J FAE β . The FAE objective can be applied whenever U is Banach, but in this article we take U = L2(Ω), where Ω= Td or Ω= [0, 1]d, and discretise the U-norm with a normalised sum. One can readily imagine other possibilities, e.g., taking U to be a Sobolev space of order s 1 if derivative information is available (Czarnecki et al., 2017), and approximating the L2-norm of the data and its derivatives by sums. More generally we may use linear functionals as the starting point for approximation.

4.2.1 Masked Training

Self-supervised training learning to predict the missing data from masked inputs has proven valuable in both language models such as BERT (Devlin et al., 2019) and vision models such as the masked autoencoder (MAE; He et al., 2022). This method has been shown to both reduce training time and to improve generalisation. We propose two schemes making use of the ability to discretise the encoder and decoder on arbitrary meshes: complement masking and random masking. Under both schemes we transform each discretised training sample u = {(xi, u(xi))}I i=1 by subsampling with index sets Ienc and Idec, which may change at each training step, to obtain

uenc = xi, u(xi) i Ienc , udec = xi, u(xi) i Idec .

We supply uenc as input to the discretised encoder; moreover we discretise the decoder on the mesh {xi}i Idec and compare the decoder output to the masked data udec. In both of the strategies we propose Ienc and Idec will be unstructured random subsets of {1, . . . , I}, but in principle other masks such as polygons could be considered.

Complement Masking. The chief strategy used in the numerical experiments is to draw Ienc as a random subset of {1, . . . , I} and take Idec = {1, . . . , I} \ Ienc, ﬁxing the (encoder) point ratio renc = |Ienc|/I as a hyperparameter. This is similar to the strategy adopted by MAE though our approach diﬀers by masking individual mesh points instead of patches. We explore the tradeoﬀs in the choice of point ratio in Section 4.3.1.

Random Masking. A second strategy we consider is to independently draw Ienc and Idec as random subsets of {1, . . . , I}, ﬁxing both the encoder point ratio renc = |Ienc|/I and the decoder point ratio rdec = |Idec|/I. This gives greater control of the cost of evaluating the encoder and decoder: by taking both renc and rdec to be small, we signiﬁcantly reduce the cost of each training step, which is useful when the number of mesh points I is large.

4.3 Numerical Experiments

In Sections 4.3.1 and 4.3.2, we apply FAE as an out-of-the-box method to discover a lowdimensional latent space for solutions to the incompressible Navier Stokes equations and the Darcy model for ﬂow in a porous medium, respectively. We ﬁnd that for these data sets:

(a) FAE s mesh-invariant architecture autoencodes with performance comparable to convolutional neural network (CNN) architectures of similar size (Section 4.3.1); (b) the ability to discretise the encoder and decoder on diﬀerent meshes enables new applications to inpainting and data-driven superresolution (Section 4.3.1), as well as extensions of existing zero-shot superresolution as proposed for VANO by Seidman et al. (2023);

Autoencoders in Function Space

(c) masked training signiﬁcantly improves performance under mesh changes (Section 4.3.1) and can accelerate training while reducing memory demand (Section 4.3.2);

(d) training a generative model on the FAE latent space leads to a resolution-invariant generative model which accurately captures distributional properties (Section 4.3.2).

4.3.1 Incompressible Navier Stokes Equations

We ﬁrst illustrate how FAE can be used to learn a low-dimensional representation for snapshots of the vorticity of a ﬂuid ﬂow in two spatial dimensions governed by the incompressible Navier Stokes equations, and illustrate some of the beneﬁts of our mesh-invariant model. Let Ω= T2 be the torus, viewed as the square [0, 1]2 with opposing edges identiﬁed and with unit normal ˆz, and let U = L2(Ω). While the incompressible Navier Stokes equations are typically formulated in terms of the primitive variables of velocity u and pressure p, it is more natural in this case to work with the vorticity streamfunction formulation (see Chandler and Kerswell, 2013, eq. (2.6)). In particular the vorticity u is zero except in the out-of-plane component ˆzω. The scalar component of the vorticity, ω, then satisﬁes

tω = ˆz (u ωˆz) + ν ω + ϕ, (x, t) Ω (0, T],

ω(x, 0) = ω0(x), x Ω. (32)

In this setting the velocity is given by u = (ψsˆz), where the streamfunction ψs satisﬁes ω = ψs.1 Thus, using periodicity, ψs is uniquely deﬁned, up to an irrelevant constant, in terms of ω, and (32) deﬁnes a closed evolution equation for ω. We suppose that Υ P(U) is the distribution of the scalar vorticity ω( , T = 50) given by (32), with viscosity ν = 10 4

and forcing ϕ(x) = 1

10 sin(2πx1 + 2πx2) + 1

10 cos(2πx1 + 2πx2).

We assume that the initial condition ω0 has distribution N(0, C) with covariance operator C = 73/2(49I ) 5/2, where is the Laplacian operator for spatially-mean-zero functions on the torus T2. The training data set, based on that of Li et al. (2021), consists of 8,000 samples from Υ generated on a 64 64 grid using a pseudospectral solver, with a further 2,000 independent samples held out as an evaluation set. The data are scaled so that ω(x, T) [0, 1] for all x Ω. Further details are provided in Appendix B.5. We train FAE with latent dimension d Z = 64 and regularisation parameter β = 10 3, and train with complement masking using a point ratio renc of 30%.

Performance at Fixed Resolution. We compare the autoencoding performance of our mesh-invariant FAE architecture to a standard ﬁxed-resolution CNN architecture, both trained using the FAE objective with all other hyperparameters the same. Our goal is to understand whether our architecture is competitive even without the inductive bias of CNNs. To do this we ﬁx a class of FAE and CNN architectures for 64 64 unmasked data, and perform a search to select the best-performing models with similar parameter counts (details in Appendix B.5). The FAE architecture achieves reconstruction MSE slightly greater than the CNN on the held-out data, with a comparable number of parameters (Table 1). It is reasonable to expect that mesh-invariance of the FAE architecture comes at some cost to

1. While it is typical in the literature (e.g., Chandler and Kerswell, 2013) to denote the streamfunction by ψ, we instead denote it by ψs to avoid ambiguity with the decoder parameter ψ.

Bunker, Girolami, Lambley, Stuart, and Sullivan

Autoencoding on evaluation set (64 64 grid)

MSE Parameters

FAE architecture 4.82 10 4 64,857 2.57 10 5

CNN architecture 2.38 10 4 71,553 9.43 10 6

Mean 1 standard deviation; 5 training runs Table 1: Our resolution-invariant architecture performs comparably to CNNs with similar parameter counts.

performance at ﬁxed resolution, especially as the CNN beneﬁts from a strong inductive bias, but the results of Table 1 suggest that the cost is modest. Further research is desirable to close this gap through better mesh-invariant architectures.

Inpainting. Methods for inpainting inferring missing parts of an input based on observations and prior knowledge from training data and related inverse problems are of great interest in computer vision and in scientiﬁc applications (Quan et al., 2024). We exploit the ability of FAE to encode on any mesh, and decode on any other mesh, to solve a variety of inpainting tasks. More precisely, after training FAE, we take data from the held-out set on a 64 64 grid and, for each discretised sample, we apply one of three possible masks:

(i) random masking with point ratio 5%, i.e., masking 95% of mesh points; or

(ii) masking of all mesh points lying in a square with random centre and side length; or

(iii) masking of all mesh points in the 0.05-superlevel set of a draw from the Gaussian random ﬁeld N(0, (302I ) 1.2), where is the Laplacian for functions on the torus.

10 30 50 70 90 4.5

Point % (train)

Point % (evaluation)

(b) Reconstruction MSE [ 10 4]

Mean 1 standard deviation;

5 training runs

(a) Inpainting with missing mesh points input reconstruction ground truth

0.1 0.2 0.3 0.4 0.5 0.6 0.7

Figure 7: (a) FAE can solve a variety of inpainting tasks; further samples in Appendix B.5. (b) Training the encoder on a sparse mesh has a regularising eﬀect on FAE, leading to lower evaluation MSE on dense meshes, but harms performance on very sparse meshes.

Decoding these samples on a 64 64 grid (Figure 7) leads to reconstructions that agree well with the ground truth, and we ﬁnd FAE to be robust even with a signiﬁcant amount of

Autoencoders in Function Space

the original mesh missing. As a consequence of the autoencoding procedure, the observed region of the input may also be modiﬁed, an eﬀect most pronounced in (ii), where some features in the input are oversmoothed in the reconstruction. We hypothesise that the failure to capture ﬁne-scale features could be mitigated with better neural-operator architectures. To understand the eﬀect of the training point ratio on inpainting quality, we ﬁrst train instances of FAE with complement masking with point ratios 10%, 50%, and 90%. Then, for each model, we evaluate its autoencoding performance by applying random masking to each sample from the held-out set with point ratio reval {10%, 30%, 50%, 70%, 90%}, reconstructing on the full 64 64 grid, and computing the reconstruction MSE averaged over the held-out set (Figure 7(b)). We observe that the best choice of renc depends on the point ratio reval of the input. We hypothesise that when reval is large, training with renc small is helpful as training with few mesh points regularises the model. On the other hand, when reval is small, it is likely that the evaluation mesh is almost disjoint from any mesh seen during training, harming performance. Further analysis is provided in Appendix B.5.

Superresolution. The ability to encode and decode on diﬀerent meshes also enables the use of FAE for single-image superresolution: high-resolution reconstruction of a lowresolution input. Superresolution methods based on deep learning have found use in varied applications including imaging (Li et al., 2024) and ﬂuid dynamics, e.g., in increasing the resolution ( downscaling ) of numerical simulations (Kochkov et al., 2021; Bischoﬀand Deck, 2024). Generalising other continuous superresolution models such as the Local Implicit Image Function (Chen et al., 2021), a single trained FAE model can be applied with any upsampling factor, and FAE has the further advantage of accepting inputs at any resolution.

(b)(ii) Number of decoder network evaluations

for superresolution on:

full grid 160k patch 38k

input data on coarse mesh (64 64)

reﬁne in speciﬁc region

(mesh spacing 1/400)

0.7 (a) Examples of data-driven superresolution (b)(i) Superresolution on patches input reconstruction ground truth

16 16 64 64 64 64

8 8 64 64 64 64

Figure 8: (a) FAE can encode low-resolution inputs and decode at higher resolution, recovering ﬁne-scale features using knowledge of the underlying data. Further examples in Appendix B.5. (b) Evaluating the decoder in a speciﬁc subregion can lead to signiﬁcant computational savings compared to performing superresolution on the full grid.

For data-driven superresolution where we train a model at high resolution and use it to enhance low-resolution inputs at inference time FAE is able to resolve unseen features from 8 8 and 16 16 inputs on a 64 64 output grid after training at resolution 64 64 (Figure 8(a)). As with inpainting, superresolution performance could be further improved with an architecture that is better able to capture the turbulent dynamics in the data.

Bunker, Girolami, Lambley, Stuart, and Sullivan

We also investigate the stability of FAE for zero-shot superresolution (Li et al., 2021), where the model is evaluated on higher resolutions than seen during training. Since FAE is purely data-driven, we view this as a test of the model s mesh-invariance and do not expect to resolve high-frequency features that were not seen during training. Our architecture proves robust when autoencoding on meshes much ﬁner than the original 64 64 training grid (Figure 9(a)); moreover our coordinate MLP architecture allows us to decode on extremely ﬁne meshes without exhausting the GPU memory (details in Appendix B.5). While zero-shot superresolution is possible with VANO when the input is given on the mesh seen during training, FAE can be used for superresolution with any input.

Eﬃcient Superresolution on Regions of Interest. Since our decoder can be evaluated on any mesh, we can perform superresolution in a speciﬁc subregion without upsampling across the whole domain. Doing this can signiﬁcantly reduce inference time, memory usage, and energy cost. As an example, we consider the task of reconstructing a circular subregion of interest with target mesh spacing 1/400 (Figure 8(b)(i)). Achieving this resolution over the whole domain corresponding to a 400 400 grid would involve 160,000 evaluations of the decoder network; decoding on the subregion requires just 1/4 of this (Figure 8(b)(ii)).

32 increase

2,048 2,048

512 increase

32,768 32,768 (b) Latent interpolation g f(u1; θ)α + f(u2; θ)(1 α); ψ

0.2 0.3 0.4 0.5 0.6 0.7

(a) Autoencoding beyond training resolution with zero-shot superresolution

Figure 9: (a) FAE can stably decode at resolutions much higher than the training resolution (best viewed digitally). (b) The regularised latent space Z allows for meaningful interpolation between samples. Further examples are given in Appendix B.5.

Applications of the Latent Space Z. The regularised FAE latent space gives a wellstructured ﬁnite representation of the inﬁnite-dimensional data u U. We expect there to be beneﬁt in using this representation as a building block for applications such as supervised learning and generative modelling on functional data, similar in spirit to other supervised operator-learning methods with encoder decoder structure (Seidman et al., 2022). As a ﬁrst step towards verifying that the latent space does indeed capture useful structure beyond mere memorisation of the training data, we draw u1 and u2 from the held-out set, compute latent vectors z1 = f(u1; θ) and z2 = f(u2; θ) Z, and evaluate the decoder g along the convex combination z1α + z2(1 α). This leads to a sensible interpolation in U, suggesting that the latent representation is robust and well-regularised (Figure 9(b)).

Autoencoders in Function Space

4.3.2 Darcy Flow

Darcy ﬂow is a model of steady-state ﬂow in a porous medium, derivable from ﬁrst principles using homogenisation; see, e.g., Freeze and Cherry (1979, Sec. 2.11) and Keller (1980). We restrict attention to the two-dimensional domain Ω= [0, 1]2 and suppose that, for some permeability ﬁeld k: Ω R and forcing ϕ: Ω R, the pressure ﬁeld p: Ω R satisﬁes

k p = ϕ on Ω,

p = 0 on Ω. (33)

We assume ϕ = 1 and that k is distributed as the pushforward of the distribution N 0, ( + 9I) 2 , where is the Laplacian restricted to functions deﬁned on Ωwith zero Neumann data on Ω, under the map x 7 3 + 9 1 x 0 . We take U = L2(Ω) and deﬁne Υ P(U) to be the distribution of pressure ﬁelds p solving (33) with permeability k. While solutions to this elliptic PDE can be expected to have greater smoothness (Evans, 2010, Sec. 6.3), we assume only that p L2(Ω) and use the L2-norm in the FAE objective (31). The training data set is based on that of Li et al. (2021) and consists of 1,024 samples from Υ on a 421 421 grid, with a further 1,024 samples held out as an evaluation set. Data are scaled so that p(x) [0, 1] for all x Ωand, where speciﬁed, we downsample as described in Appendix B.6. We train FAE with d Z = 64 and β = 10 3, and use complement masking with a point ratio renc of 30%.

0 20 40 60 80 100

0.0150 Point % (training)

10% 50% 100%

Reconstruction MSE on held-out set (211 211)

(mean over 5 training runs)

Wall-clock time [s] Figure 10: Training with masking in the encoder and decoder reduces training time.

Accelerating Training Using Masking. As well as improving reconstructions and robustness to mesh changes, masked training can greatly reduce the cost of training. To illustrate this we compare the training dynamics of FAE on data downsampled to resolution 211 211, using random masking with point ratio renc = rdec {10%, 50%, 90%}. Since the evaluation cost of the encoder and decoder scales linearly with the number of mesh points, we expect signiﬁcant computational gains when using low point ratios. We perform ﬁve training runs for each model and compute the average reconstruction MSE over time on held-out data at resolution 211 211. The models trained with masking converge faster as the smaller data tensors allow for better use of the GPU parallelism (Figure 10). At higher resolutions, memory constraints may preclude training on the full grid, making masking vital. Related ideas are used in the adaptive-subsampling training scheme for FNOs proposed by Lanthaler et al. (2024), which involves training ﬁrst on a coarse grid and reﬁning the mesh each time the evaluation metric plateaus; our approach diﬀers by dropping mesh points randomly, which would not be possible with FNO. One can readily imagine training FAE with a combination of adaptive subsampling and masking.

Generative Modelling. While FAE is not itself a generative model, it can be made so by training a ﬁxed-dimension generative model on the latent space Z (Ghosh et al., 2020;

Bunker, Girolami, Lambley, Stuart, and Sullivan

Vahdat et al., 2021). More precisely, we know that applying the FAE encoder f to data induces a distribution Σθ P(Z) for z given by

z | u = f(u; θ), u Υ. (34)

Unlike with FVAE, there is no reason that this should be close to Gaussian. However, we can approximate Σθ with a ﬁxed-resolution generative model Pϕ z P(Z) parametrised by ϕ Φ, and deﬁne the FAE generative model Pψ,ϕ u for data u by

(FAE generative model) u | z = g(z; ψ), z Pϕ z . (35)

Since applying the decoder to Σθ should approximately recover the data distribution if g(f(u; θ); ψ) u for u Υ, we hope that when Σθ Pϕ z , samples from (35) will be approximately distributed according to Υ. As a simple illustration, we train FAE at resolution 47 47 and ﬁt a Gaussian mixture model Pϕ z with 10 components to Σθ using the expectation-maximisation algorithm (see Bishop, 2006, Sec. 9.2.2). Samples from (35) closely resemble those from the held-out data set (Figure 11(a)), and as a result of our mesh-invariant architecture, it is possible to generate new samples on any mesh.

(a)(i) FAE samples g(z; ψ), z Pz (47 47 grid)

0.0 0.2 0.4 0.6 0.8 1.0

(a)(ii) Samples p Υ (47 47 grid)

0.0 0.2 0.4 0.6 0.8 1.0

(b) Kernel density estimates for quantities of interest (based on 1,024 samples from generative model)

(i) Q1(p) = max x Ωp(x)

0.5 0.6 0.7 0.8 0.9 1.0 0

(ii) Q2(p) = p L2

0.30 0.35 0.40 0.45 0.50 0

FAE Ground Truth

Figure 11: (a) Uncurated samples from the FAE generative model for the pressure ﬁeld p. Further samples are provided in Appendix B.6. (b) The distributions of quantities of interest computed using the FAE generative model closely agree with the ground truth.

To measure generative performance, we approximate the distributions of physically relevant quantities of interest Qi(p) depending on the data p U, comparing 1,024 samples from the generative model to the held-out data. Using kernel density estimates as in Figure 4(b), we see close agreement between the distributions (Figure 11(b)). While we could also evaluate the generative model using distances such as maximum mean discrepancy (MMD; Borgwardt et al., 2006), we focus on interpretable quantities relevant to the physical system at hand. Though we adopt the convention of training the autoencoder and generative model separately (Rombach et al., 2022) here, the models could also be trained jointly; we leave this, and an investigation of generative models on the FAE latent space, to future work.

Autoencoders in Function Space

5. Related Work

Variational Autoencoding Neural Operators. The VANO model (Seidman et al., 2023) was the ﬁrst to attempt systematic extension of the VAE objective to function space. The paper uses ideas from operator learning to construct a model that can decode but not encode at any resolution. Our approach diﬀers in both training objective and practical implementation, as we now outline. The most signiﬁcant diﬀerence between what is proposed in this paper and in VANO is the objective on function space: the VANO objective coincides with a speciﬁc case of our model (16a) (16d) with the decoder noise Pη being white noise on L2([0, 1]d). As a consequence the generative model for VANO takes values in U = Hs([0, 1]d) if and only if s < d/2; in particular generated draws are not in L2([0, 1]d). Unlike our approach, VANO aims to maximise an extension of the ELBO (15b), in which a regularisation parameter β > 0 is chosen as a hyperparameter, and the ELBO takes the form

ELBOVANO β (u; θ, ψ) = E z Qθ z|u

log d Pψ u|z d Pη (u)

βDKL Qθ z|u Pz

= E z Qθ z|u

g(z; ψ), u L2 1

2 g(z; ψ) 2 L2 βDKL Qθ z|u Pz ,

where the second equality comes from the Cameron Martin theorem as in Example 2. Maximising ELBOVANO β with β = 1 is precisely equivalent to minimising the per-sample loss (29) from Example 2, so naive application of ELBOVANO β will result in the same issues seen there. In particular, discretisations of the ELBO may diverge as resolution is reﬁned; moreover, for data with L2-regularity, the generative model Pψ u is greatly misspeciﬁed, with draws g(z; ψ) + η, z Pz, η Pη, lying in a Sobolev space of lower regularity than the data. This issue is obscured by the convention in the VAE literature of considering only the decoder mean g(z; ψ); considering the full generative model with draws g(z; ψ)+η reveals the incompatibility more clearly. We argue that the empirical success of VANO in autoencoding is because the objective can be seen as that of a regularised autoencoder (Remark 22). Along with the diﬀerences in the training objective and its interpretation, FVAE diﬀers greatly from VANO in architecture. While the VANO decoders can be discretised on any mesh and our decoder closely resembles VANO s nonlinear decoder its encoders assume a ﬁxed mesh for training and inference. In contrast, our encoder can be discretised on any mesh, enabling many of our contributions, such as masked training, inpainting, and superresolution, which are not possible within VANO.

Generative Models on Function Space. Aside from VANO, recent years have seen signiﬁcant interest in the development of generative models on function space. Several extensions of score-based (Song et al., 2021) and denoising diﬀusion models (Sohl-Dickstein et al., 2015; Ho et al., 2020) to function space have been proposed (e.g., Pidstrigach et al., 2023; Hagemann et al., 2023; Lim et al., 2023; Kerrigan et al., 2023; Franzese et al., 2023; Zhang and Wonka, 2024). Rahman et al. (2022) propose the generative adversarial neural operator (GANO), extending Wasserstein generative adversarial neural networks (Arjovsky et al., 2017) to function space with FNOs in the generator and discriminator to achieve resolution-invariance.

Bunker, Girolami, Lambley, Stuart, and Sullivan

Variational Inference on Function Space. In machine learning, variational inference on function space also arises in the context of Bayesian neural networks (Sun et al., 2019; Burt et al., 2021; Cinquin and Bamler, 2024). In this setting one wishes to minimise the KL divergence between the posterior in function space and a computationally tractable approximation but, as in our study, this divergence may be inﬁnite owing to a lack of absolute continuity between the two distributions.

Learning on Point Clouds. Our architecture takes inspiration from the literature on machine learning on point clouds, where data are viewed as sets of points with arbitrary cardinality. Several models for autoencoding and generative modelling with point clouds have been proposed, such as energy-based processes (Yang et al., 2020) and Set VAE (Kim et al., 2021); our work diﬀers by deﬁning a loss in function space, ensuring that our model converges to a continuum limit as the mesh is reﬁned. Continuum limits of semisupervised algorithms for graphs and point clouds have also been studied (e.g., Dunlop et al., 2020).

Our study of autoencoders on function space has led to FVAE, an extension of VAEs which imposes stringent requirements on the data distribution in inﬁnite dimensions but beneﬁts from ﬁrm probabilistic foundations; it has also led to the non-probabilistic FAE, a regularised autoencoder which can be applied much more broadly to functional data.

Beneﬁts. Both FVAE and FAE oﬀer signiﬁcant beneﬁts when working with functional data, such as enabling training with data across resolutions, inpainting, superresolution, and generative modelling. These beneﬁts are possible only through our pairing of a well-deﬁned objective in function space with mesh-invariant encoder and decoder architectures.

Limitations. FVAE can be applied only when the generative model is suﬃciently compatible with the data distribution a condition that is diﬃcult to satisfy in inﬁnite dimensions, and restricts the applicability to FVAE to speciﬁc problem classes. FAE overcomes this restriction, but does not share the probabilistic foundations of FVAE. The desire to discretise the encoder and decoder on arbitrary meshes rules out many high-performing grid-based architectures, including convolutional networks and FNOs. We believe this is a limiting factor in the numerical experiments, and that combining our work with more complex operator architectures (e.g., Kovachki et al., 2023) or continuum extensions of point-cloud CNNs (Li et al., 2018) would yield further improvements.

Future Work. Our work gives new methods for nonlinear dimension reduction in function space, and we expect there to be beneﬁt in building operator-learning methods that make use of the resulting latent space, in the spirit of PCA-Net (Bhattacharya et al., 2021). For FAE, which unlike FVAE is not inherently a generative model, we expect particular beneﬁt in the use of more sophisticated generative models on the latent space, for example diﬀusion models, analogous to Stable Diﬀusion (Rombach et al., 2022). While our focus has been on scientiﬁc problems with synthetic data, our methods could also be applied to real-world data, for example in computer vision; for these challenging data sets, further research on improved mesh-invariant architectures will be vital. Our study has also focussed on the typical machine-learning setting of a ﬁxed dataset of size N; research

Autoencoders in Function Space

into the behaviour of FVAE and FAE in the inﬁnite-data limits using tools from statistical learning theory would also be of interest.

Acknowledgments

JB is supported by Splunk Inc. MG is supported by a Royal Academy of Engineering Research Chair, and Engineering and Physical Sciences Research Council (EPSRC) grants EP/T000414/1, EP/W005816/1, EP/V056441/1, EP/V056522/1, EP/R018413/2, EP/R034710/1, and EP/R004889/1. HL is supported by the Warwick Mathematics Institute Centre for Doctoral Training and gratefully acknowledges funding from the University of Warwick and the EPSRC (grant EP/W524645/1). AMS is supported by a Department of Defense Vannevar Bush Faculty Fellowship and by the Sci AI Center, funded by the Oﬃce of Naval Research (ONR), under grant N00014-23-1-2729. For the purpose of open access, the authors have applied a Creative Commons Attribution (CC BY) licence to any Author Accepted Manuscript version arising.

Bunker, Girolami, Lambley, Stuart, and Sullivan

Appendix A. Supporting Results

In the following proof we use the fact that the norm of the Sobolev space Hs([0, 1]) can be written as a weighted sum of frequencies (see Krein and Petunin, 1966, Sec. 9):

u 2 Hs([0,1]) = X

1 + j2 s|αj|2, u = X

j N αjej, ej(x) =

2 sin(πjx). (36)

Proof of Proposition 15 Let η be an L2-white noise, let h = P j N hjej L2([0, 1]), and note that Pη( h) is the distribution of the random variable η + h. Thus, writing out the Hs-norm using the Karhunen Lo eve expansion of η, we see that

η + h 2 Hs([0,1]) = X

1 + j2 s ξj + hj 2. (37)

First, we show that η + h Hs([0,1]) < almost surely when s < 1/2. To do this we apply the Kolmogorov two-series theorem (Durrett, 2019, Theorem 2.5.6), which states that the random series (37) converges almost surely if X

1 + j2 s E h ξj + hj 2i < and X

1 + j2 2s Var ξj + hj 2 < .

But, since ξj N(0, 1), we know that E[ξ2 j ] = 1 and Var(ξ2 j ) = 2; applying this, the elementary identity (x + y)2 2x2 + 2y2 for x, y R, and the fact that P j N jα < for α < 1 shows that the two series are ﬁnite. To see that Pη( h) assigns zero probability to L2([0, 1]), suppose for contradiction that η + h had ﬁnite L2-norm; then η would also have ﬁnite L2-norm. But as a consequence of the Borel Cantelli lemma, η 2 L2([0,1]) = X

j N ξ2 j = almost surely,

because the summands are independent and identically distributed, and thus for any constant c > 0, inﬁnitely many summands exceed c with probability one.

Lemma 25 Suppose that U = C0([0, T], Rm) and that µ P(U) and ν P(U) are the laws of the Rm-valued diﬀusions

dut = b(ut) dt + ε dwt, u0 = 0, t [0, T]

dvt = c(vt) dt + ε dwt, v0 = 0, t [0, T].

where (wt)t [0,T] is a Brownian motion on Rm. Suppose that the Novikov condition (19) holds for both processes. Then

DKL(µ ν) = E u µ

0 b(ut) c(ut) 2 2 dt .

Proof Applying the Girsanov formula (20) to obtain the density dµ/dν, taking logarithms to evaluate DKL(µ ν), and noting that under µ we have dut = b(ut) dt + ε dwt, we obtain

DKL(µ ν) = E u µ

0 b(ut) c(ut) 2 2 dt 1 ε

0 b(ut) c(ut), dwt .

Under µ, the process (wt)t [0,T] is Brownian motion and so the second expectation is zero.

Autoencoders in Function Space

Appendix B. Experimental Details

In this section, we provide additional details, training conﬁgurations, samples, and analysis for the numerical experiments in Section 2.5 and Section 4.3. All experiments were run on a single NVIDIA Ge Force RTX 4090 GPU with 24 GB of VRAM.

B.1 Base Architecture

We use the common architecture described in Section 2.4 and Section 4.2 for all experiments, using the Adam optimiser (Kingma and Ba, 2015) with the default hyperparameters ε, β1, and β2; we specify the learning rate and learning-rate decay schedule for each experiment in what follows.

Positional Encodings. Where speciﬁed, both the encoder and decoder will make use of Gaussian random Fourier features (Tancik et al., 2020), pairing the query coordinate x Ω Rd with a positional encoding γ(x) R2k. To generate these encodings, a matrix B Rk d with independent N(0, I) entries is sampled and viewed as a hyperparameter of the model to be used in both the encoder and decoder. The positional encoding γ(x) is then given by the concatenated vector γ(x) = cos(2πBx); sin(2πBx) T R2k where the sine and cosine functions are applied componentwise to the vector 2πBx.

B.2 Brownian Dynamics

The training data consists of 8,192 samples from the path distribution Υ of the SDE (27),(28) on the time interval [0, T], T = 5. Trajectories are generated using the Euler Maruyama scheme with internal time step 1/8,192 (unrelated to the choice to take 8,192 training samples), and the resulting paths are then subsampled by a factor of 80 to obtain the training data. Thus the data have eﬀective time increment 5/512; moreover the path information is removed at 50% of the points resulting from these time increments, chosen uniformly at random.

Experimental Setup. We train for 100,000 steps with initial learning rate 10 3 and an exponential decay of 0.98 applied every 1,000 steps, with batch size 32 and 4 Monte Carlo samples for Qθ z|u. We use latent dimension d Z = 1, β = 1.2 and λ = 10. The three sets of simulations shown in Figure 2 use κ = 0, 25, and 10,000 respectively.

B.3 Estimation of Markov State Models

Data and Discretisation. To validate the ability of FVAE to model higher-dimensional SDE trajectories, we specify a simple potential with qualitative features similar to those arising in the complex potential surfaces arising in molecular dynamics. To this end, deﬁne the centres c1 = (0, 0), c2 = (0.2, 0.2), c3 = ( 0.2, 0.2), c4 = (0.2, 0.2), c5 = (0, 0.2) and c6 = ( 0.2, 0); standard deviations σ1 = σ2 = σ3 = σ4 = 0.1 and σ5 = σ6 = 0.03; and masses m1 = m2 = m3 = m4 = 0.1 and m5 = m6 = 0.01. Then let

0.5(x1 + x2) + x2 1 + x2 2

i=1 mi N x; ci, σ2 i I2 #

This potential has three key components: a linear term breaking the symmetry, a quadratic term preventing paths from veering too far from the path s starting point, the origin,

Bunker, Girolami, Lambley, Stuart, and Sullivan

0.4 0.20.0 0.2 0.4

Surface of Potential

0.4 0.2 0.0 0.2 0.4 x1

Heatmap of Potential

Figure 12: Potential function U : R2 R for Section 2.5.2.

and negative Gaussian densities serving as potential wells positioned at the centres ci (Figure 12). Sample paths of (27) with initial condition u0 = 0, temperature ε = 0.1 and ﬁnal time T = 3 show signiﬁcant diversity, with many paths transitioning at least once between diﬀerent wells (see ground truth in Figure 13).

Experimental Setup. The training set consists of 16,384 paths generated with an Euler Maruyama scheme with internal time step 1/8,192, subsampled by a factor 48 to obtain an equally spaced mesh of 513 points. We take d Z = 16, β = 10, κ = 50, and λ = 50, and, as in Appendix B.3, train on data where 50% of the points on the path are missing. We also use the same learning rate, learning-rate decay schedule, step limit, and batch size.

Results. FVAE s reconstructions closely match the inputs (Figure 13), and FVAE produces convincing generative samples capturing qualitative features of the data (Figure 14).

B.4 Dirac Distributions

Data and Discretisation. We view Υ as a probability distribution on U = H 1([0, 1]). At each resolution I, we discretise the domain [0, 1] using an evenly spaced mesh of points {i/I+1}i=1,...,I and approximate the Dirac mass δξ, ξ [0, 1], by the optimal L1-approximation: a discretised function which is zero except at the mesh point closest to ξ, normalised to have unit L1-norm. The training data set consists of discretised Dirac functions at each mesh point; the goal is not to train a practical model for generalisation, but to isolate the eﬀect of the objective.

Experimental Setup. We train FVAE and FAE models at resolutions I {8, 16, 32, 64, 128}. For each model, we perform 50 independent runs of 30,000 steps with batch size 6.

Architecture. The neural network ρ: R Θ R R in the encoder map f is assumed to have 3 hidden layers of width 128, and the mean µ(z; ψ) and standard deviation σ(z; ψ) in the decoder are computed from a 3-layer neural network of width 128. For numerical stability, we impose a lower bound on σ based on the mesh spacing x, given by σmin( x) = (2π) 1/2 x.

FVAE Conﬁguration. We view data u Υ as lying in the Sobolev space U = H 1([0, 1]); the decoder g will output functions in L2([0, 1]) and we take decoder-noise distribution

Autoencoders in Function Space

Figure 13: Held-out ground-truth data from the SDE in Section 2.5.2 ( True row) and the corresponding FVAE reconstructions of sample paths ( Reconstructed row).

(b) Data set

Figure 14: (a) Samples of the SDE in Section 2.5.2 drawn from the FVAE generative model with randomly drawn latent vector z Pz. (b) Ground-truth paths of the SDE in Section 2.5.2 generated using an Euler Maruyama solver. In both subﬁgures, the evolution through time t [0, 3] is depicted as a transition in colour from blue to green.

Pη = N(0, I), noting that Pη P(Hs([0, 1])) if and only if s < 1/2; in particular whitenoise samples do not lie in the space L2([0, 1]). We modify the per-sample loss (7b) by reweighting the term DKL(Qθ z|u Pz) by β = 10 4, and take 16 Monte Carlo samples for Qθ z|u. We use an initial learning rate of 10 4, decaying exponentially by a factor 0.7 every 1,000 steps.

Bunker, Girolami, Lambley, Stuart, and Sullivan

FAE Conﬁguration. To compute the H 1-norm we truncate the series expansion (36) and compute coeﬃcients αj from a discretisation of u using the discrete sine transform. We use initial learning rate 10 4, decaying exponentially by a factor 0.9 every 1,000 steps, and take β = 10 12. For consistency with the FVAE loss, we subtract the squared data norm 1 2 u 2 H 1 from the FAE loss, yielding the expression

g(f(u; θ); ψ) u 2 H 1 1

u 2 H 1 = 1

g f(u; θ); ψ 2 H 1 g f(u; θ); ψ , u

Results. As expected, the ﬁnal training loss under both models decreases as the resolution is reﬁned, since the lower bound σmin decreases. However, the FAE loss appears to converge and is stable across runs, while the FVAE loss appears to diverge and becomes increasingly unstable across runs. This gives convincing empirical evidence that the joint divergence (5) for FVAE is not deﬁned as a result of the misspeciﬁed decoder noise; the use of FAE with an appropriate data norm alleviates this issue. Since the FVAE objective with Pη = N(0, I) coincides with the VANO objective, this issue would also be present for VANO. Under both models, training becomes increasingly unstable at high resolutions: when σ is small, the loss becomes highly sensitive to changes in µ; this instability is unrelated to the divergence of the FVAE training loss and is a consequence of training through gradient descent.

B.5 Incompressible Navier Stokes Equations

Viscosity ν Resolution Train Samples Eval. Samples Snapshot Time T

10 3 64 64 4,000 1,000 50 10 4 64 64 8,000 2,000 50 10 5 64 64 960 240 20

Table 2: Details of Navier Stokes data sets.

Data and Discretisation. We use data as provided online by Li et al. (2021). Solutions of (32) are generated by sampling the initial condition from the Gaussian random ﬁeld N(0, C), C = 73/2(49I ) 5/2, and evolving in time using a pseudospectral method. While the data of Li et al. (2021) includes the full time evolution, we use only snapshots of the vorticity at the ﬁnal time. Every snapshot is a 64 64 image, normalised to take values in [0, 1]; details of this data set are given in Table 2.

Eﬀects of Point Ratios. Here, we extend the analysis of Figure 7(b) to understand how the point ratio used during training aﬀects reconstruction performance. We ﬁrst train two FAE models on the Navier Stokes data set with viscosity ν = 10 4, using complement masking with a point ratio renc of 10% and 90% respectively. Then, we ﬁx an arbitrary sample from the held-out set and, for each model, generate 1,000 distinct masks with point ratios 10%, 30%, 50%, 70%, and 90%. We then encode on each mesh and decode on the full grid and compute kernel density estimates of the reconstruction MSE (Figure 15). The model trained with renc = 10% is much more sensitive to the location of the evaluation mesh points, especially when the evaluation point ratio is low; with suﬃciently high encoder point ratio at evaluation time, however, the reconstruction MSE of the model trained using renc = 10%

Autoencoders in Function Space

surpasses that of the model trained at renc = 90%. This suggests a tradeoﬀwhereby a higher training point ratio provides more stability, at the cost of increasing autoencoding MSE, particularly when the point ratio of the evaluation data is high. We hypothesise that a lower training ratio regularises the model to attain a more robust internal representation.

0.0000 0.0005 0.0010 0.0015 0.0020 0.0025 0.0030 MSE

(a) For a model trained with point ratio 10%.

0.0000 0.0005 0.0010 0.0015 0.0020 0.0025 MSE

Point % (evaluation)

10 30 50 70 90

(b) For a model trained with a point ratio 90%.

Figure 15: Kernel density estimates for full-grid reconstruction MSE on the reference sample across 1,000 randomly chosen meshes. Training with a low point ratio regularises, reducing MSE when the evaluation data has a high point ratio, but at the cost of greater variance when evaluating on low point ratios.

We also investigate the sensitivity of the models to a speciﬁc encoder mesh, seeking to understand whether an encoder mesh achieving low MSE on one image leads to low MSE on other images. The procedure is as follows: we select an image arbitrarily from the held-out set (the reference sample) and draw 1,000 random meshes with point ratio 10%; then, we select the mesh resulting in the lowest reconstruction MSE for each of the two models. For the nearest neighbours of the chosen sample in the held-out set, the reconstruction error on this MSE-minimising mesh is lower than average (Figure 16(a); dashed lines), suggesting that a good conﬁguration will yield good results on similar samples. On the other hand, using the MSE-minimising mesh on arbitrary samples from the held-out set yields an MSE somewhat lower than a randomly chosen mesh; unsurprisingly, however, the arbitrarily chosen samples appear to beneﬁt less than the nearest neighbours (Figure 16(b)).

Experimental Setup. We train for 50,000 steps with batch size 32 and initial learning rate 10 3, decaying exponentially by a factor 0.98 every 1,000 steps. We use complement masking with renc = 0.3, providing a good balance of performance and robustness to masking.

Architecture. Both the CNN and FAE architecture use Gaussian random positional encodings with k = 16. For the sake of comparison, we use a standard CNN architecture inspired by the VGG model (Simonyan and Zisserman, 2015), gradually contracting/expanding the feature map while increasing/decreasing the channel dimensions. The architecture we use was identiﬁed using a search over parameters such as the network depth while maintaining a similar parameter count to our baseline FAE model. The encoder consists of four CNN layers with output channel dimension 4, 4, 8, and 16 respectively and kernel sizes are 2, 2, 4, and 4 respectively, all with stride 2. The result is ﬂattened and passed through a single-hidden-layer MLP of width 64 to obtain a vector of dimension 64. The decoder consists of a single-layer MLP of width 64 and output dimension 512, which is then rearranged to a

Bunker, Girolami, Lambley, Stuart, and Sullivan

0.001 0.002 0.003 0

0.001 0.002 0.003 0

0.001 0.002 0.003 0

0.001 0.002 0.003 0

0.001 0.002 0.003 0

3000 High Low

(a) Nearest neighbours of the reference sample in the held-out set.

0.001 0.002 0.003 0

0.001 0.002 0.003 0

0.001 0.002 0

0.001 0.002 0

0.002 0.004 0

(b) Arbitrarily chosen samples from the held-out set.

Figure 16: Kernel density estimates of full-grid reconstruction MSE for models trained at 10% (Low) and 90% (High) point ratios on samples from the held-out set. Dashed lines indicate the MSE obtained using the mesh minimising MSE on the reference sample.

4 4 feature map with channel size 32. This feature map is then passed through four layers of transposed convolutions that respectively map to 16, 8, 4, and 4 channel dimensions, with kernel sizes 4, 4, 2, and 2 respectively, and stride 2. The result is then mapped by two CNN layers with kernel size 3, stride 1, and output channel dimension 8 and 1 respectively.

Uncurated Reconstructions and Samples. Reconstructions of randomly selected data from the held-out sets for viscosities ν = 10 3, 10 4 and 10 5 are provided in Figures 17, 18, and 19 respectively. As described in Section 4.3.2, we apply FAE as a generative model by ﬁtting a Gaussian mixture with 10 components on the latent space. Samples from models trained at ν = 10 3, 10 4 and 10 5 are shown in Figures 20, 21, and 22 respectively.

Evaluation at Very High Resolutions. In Figure 9, we demonstrate zero-shot resolution by evaluating the decoder on grids of resolution 2,048 2,048 and 32,768 32,768. While the former requires approximately 16 MB to store using 32-bit ﬂoating-point numbers, the latter requires 4.3 GB, and thus applying a neural network directly to the 32,768 32,768 image is more likely to exhaust GPU memory. To allow evaluation of the decoder at this resolution, we partition the domain into 1,000 chunks and evaluate the decoder on each chunk in turn; we then reassemble the resulting data in the RAM. To ensure that each

Autoencoders in Function Space

Figure 17: FAE reconstructions of Navier Stokes data with viscosity 10 3.

Figure 18: FAE reconstructions of Navier Stokes data with viscosity 10 4.

Bunker, Girolami, Lambley, Stuart, and Sullivan

Figure 19: FAE reconstructions of Navier Stokes data with viscosity ν = 10 5.

(b) Data set

Figure 20: Samples of Navier Stokes data with viscosity ν = 10 3.

Autoencoders in Function Space

(b) Data set

Figure 21: Samples of Navier Stokes data with viscosity ν = 10 4.

(b) Data set

Figure 22: Samples of Navier Stokes data with viscosity ν = 10 5.

Bunker, Girolami, Lambley, Stuart, and Sullivan

chunk has an integer number of points, we take the ﬁrst 824 chunks to contain 1,073,742 mesh points ( 4 MB), and take the remaining 176 chunks to contain 1,073,741 points.

B.6 Darcy Flow

Data Set. The data we use is based on that provided online by Li et al. (2021), given on a 421 421 grid and generated through a ﬁnite-diﬀerence scheme. Where described, we downsample this data to lower resolutions by applying a low-pass ﬁlter in Fourier space and subsampling the resulting image. The low-pass ﬁlter is a molliﬁcation of an ideal sinc ﬁlter with bandwidth selected to eliminate frequencies beyond the Nyquist frequency of the target resolution, computed by convolving the ideal ﬁlter in Fourier space with a Gaussian kernel with standard deviation σ = 0.1, truncated to a 7 7 convolutional ﬁlter.

Experimental Setup. We follow the same setup used for the Navier Stokes data set: we train for 50,000 steps, with batch size 32 and complement masking with renc = 30%. An initial learning rate of 10 3 is used with an exponential decay factor of 0.98 applied every 1,000 steps. We make use of positional embeddings (Appendix B.1) using k = 16 Gaussian random Fourier features. When performing the wall-clock training time experiment (Figure 10), we downsample the training and evaluation data to resolution 211 211.

Uncurated Reconstructions and Samples. Reconstructions of randomly selected examples from the held-out evaluation data set are shown in Figure 23. Samples from the FAE generative model and draws from the evaluation data set are shown in Figure 24.

Figure 23: FAE reconstructions of Darcy ﬂow data.

Autoencoders in Function Space

(b) Data set

Figure 24: Samples of Darcy ﬂow data.

The Fifth International Conference on Learning Representations (ICLR 2017), 2017.

2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021. IEEE.

The Ninth International Conference on Learning Representations (ICLR 2021), 2021.

2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 2022. IEEE.

M. Arjovsky, S. Chintala, and L. Bottou. Wasserstein generative adversarial networks. In D. Precup and Y. W. Teh, editors, Proceedings of the 34th International Conference on Machine Learning (ICML 2017), volume 70 of Proceedings of Machine Learning Research, pages 214 223, 2017. URL https://proceedings.mlr.press/v70/arjovsky17a.html.

ar Xiv:1701.07875.

K. Azizzadenesheli, N. Kovachki, Z. Li, M. Liu-Schiaﬃni, J. Kossaiﬁ, and A. Anandkumar. Neural operators for accelerating scientiﬁc simulations and design. Nat. Rev. Phys., 6: 320 328, 2024. doi:10.1038/s42254-024-00712-5.

E. Bach, R. Baptista, D. Sanz-Alonso, and A. Stuart. Inverse problems and data assimilation: A machine learning approach, 2024. ar Xiv:2410.10523.

J. Bengio and Y. Le Cun, editors. The Third International Conference on Learning Representations (ICLR 2015), 2015.

Bunker, Girolami, Lambley, Stuart, and Sullivan

K. Bhattacharya, B. Hosseini, N. B. Kovachki, and A. M. Stuart. Model reduction and neural networks for parametric PDEs. SMAI J. Comput. Math., 7:121 157, 2021. doi:10.5802/smai-jcm.74.

T. Bischoﬀand K. Deck. Unpaired downscaling of ﬂuid ﬂows with diﬀusion bridges. Artif. Intell. Earth Syst., 3:e230039, 22pp., 2024. doi:10.1175/AIES-D-23-0039.1.

C. M. Bishop. Pattern Recognition and Machine Learning. Information Science and Statistics. Springer, 2006. ISBN 978-0-387-31073-2.

V. I. Bogachev. Gaussian Measures, volume 62 of Mathematical Surveys and Monographs. American Mathematical Society, 1998. doi:10.1090/surv/062.

K. M. Borgwardt, A. Gretton, M. J. Rasch, H.-P. Kriegel, B. Sch olkopf, and A. J. Smola. Integrating structured biological data by kernel maximum mean discrepancy. Bioinform., 22(14):e49 e57, 2006. doi:10.1093/bioinformatics/btl242.

D. R. Burt, S. W. Ober, A. Garriga-Alonso, and M. van der Wilk. Understanding variational inference in function-space. In 3rd Symposium on Advances in Approximate Bayesian Inference, 2021. ar Xiv:2011.09421.

E. Calvello, N. B. Kovachki, M. E. Levine, and A. M. Stuart. Continuum attention for neural operators, 2024. ar Xiv:2406.06486.

G. J. Chandler and R. R. Kerswell. Invariant recurrent solutions embedded in a turbulent two-dimensional Kolmogorov ﬂow. J. Fluid. Mech., 722:554 595, 2013. doi:10.1017/jfm.2013.122.

J. T. Chang and D. Pollard. Conditioning as disintegration. Stat. Neerl., 51(3):287 317, 1997. doi:10.1111/1467-9574.00056.

T. Chen and H. Chen. Approximations of continuous functionals by neural networks with application to dynamic systems. IEEE Trans. Neural Netw., 4(6):910 918, 1993. doi:10.1109/72.286886.

Y. Chen, S. Liu, and X. Wang. Learning continuous image representation with local implicit image function. In 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) CVP (2021), pages 8624 8634. doi:10.1109/CVPR46437.2021.00852.

Y. Chen, D. Z. Huang, J. Huang, S. Reich, and A. M. Stuart. Sampling via gradient ﬂows in the space of probability measures. ar Xiv preprint ar Xiv:2310.03597, 2023.

T. Cinquin and R. Bamler. Regularized KL-divergence for well-deﬁned function-space variational inference in Bayesian neural networks, 2024. ar Xiv:2406.04317.

S. L. Cotter, M. Dashti, and A. M. Stuart. Approximation of Bayesian inverse problems for PDEs. SIAM J. Numer. Anal., 48(1):322 345, 2010. doi:10.1137/090770734.

S. L. Cotter, G. O. Roberts, A. M. Stuart, and D. White. MCMC methods for functions: Modifying old algorithms to make them faster. Stat. Sci., 28(3), 2013. doi:10.1214/13STS421.

Autoencoders in Function Space

T. M. Cover and J. A. Thomas. Elements of Information Theory. Wiley-Interscience, Hoboken, NJ, second edition, 2006. doi:10.1002/047174882X.

W. M. Czarnecki, S. Osindero, M. Jaderberg, G. Swirszcz, and R. Pascanu. Sobolev training for neural networks. In Guyon et al. (2017), pages 4278 4287. URL https://proceedings.neurips.cc/paper ﬁles/paper/2017/ﬁle/ 758a06618c69880a6cee5314ee42d52f-Paper.pdf. ar Xiv:1706.04859.

M. Dashti and A. M. Stuart. The Bayesian approach to inverse problems. In Handbook of Uncertainty Quantiﬁcation. Vol. 1, 2, 3, chapter 7, pages 311 428. Springer, Cham, 2017.

doi:10.1007/978-3-319-12385-1 7.

M. V. de Hoop, D. Z. Huang, E. Qian, and A. M. Stuart. The cost-accuracy tradeoﬀin operator learning with neural networks. J. Mach. Learn., 1(3):299 341, 2022. doi:10.4208/jml.220509.

J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova. BERT: Pre-training of deep bidirectional transformers for language understanding. In J. Burstein, C. Doran, and T. Solorio, editors, Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171 4186, Minneapolis, MN, 2019. Association for Computational Linguistics. doi:10.18653/v1/N19-1423.

M. M. Dunlop, D. Slepˇcev, A. M. Stuart, and M. Thorpe. Large data and zero noise limits of graph-based semi-supervised learning algorithms. Appl. Comput. Harmon. Anal., 49 (2):655 697, 2020. doi:10.1016/j.acha.2019.03.005.

R. Durrett. Probability: Theory and Examples. Number 49 in Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge University Press, Cambridge, ﬁfth edition, 2019. doi:10.1017/9781108591034.

W. E, W. Ren, and E. Vanden-Eijnden. Minimum action method for the study of rare events. Comm. Pure Appl. Math., 57(5):637 656, 2004. doi:10.1002/cpa.20005.

H. Edwards and A. Storkey. Towards a neural statistician. In The Fifth International Conference on Learning Representations (ICLR 2017) ICL (2017). ar Xiv:1606.02185.

L. Evans. Partial Diﬀerential Equations, volume 19 of Graduate Studies in Mathematics. American Mathematical Society, second edition, 2010. doi:10.1090/gsm/019.

G. Franzese, G. Corallo, S. Rossi, M. Heinonen, M. Filippone, and P. Michiardi. Continuoustime functional diﬀusion processes. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, editors, Advances in Neural Information Processing Systems, volume 36, pages 37370 37400. Curran Associates, Inc., 2023.

R. A. Freeze and J. A. Cherry. Groundwater. Prentice-Hall, Englewood Cliﬀs, NJ, 1979. ISBN 0-13-365312-9.

Bunker, Girolami, Lambley, Stuart, and Sullivan

P. Ghosh, M. S. M. Sajjadi, A. Vergari, M. Black, and B. Sch olkopf. From variational to deterministic autoencoders. In The Eighth International Conference on Learning Representations (ICLR 2020), 2020. ar Xiv:1903.12436.

I. Guyon, U. von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors. Advances in Neural Information Processing Systems, volume 30, 2017. Curran Associates, Inc.

P. Hagemann, L. Ruthotto, G. Steidl, and N. T. Yang. Multilevel diﬀusion: Inﬁnite dimensional score-based diﬀusion models for image generation, 2023. ar Xiv:2303.04772.

M. Hairer, A. Stuart, and J. Voss. Signal processing problems on function space: Bayesian formulation, stochastic PDEs and eﬀective MCMC methods. In The Oxford Handbook of Nonlinear Filtering, pages 833 873. Oxford University Press, Oxford, 2011. ISBN 9780199532902.

K. He, X. Chen, S. Xie, Y. Li, P. Doll ar, and R. Girshick. Masked autoencoders are scalable vision learners. In 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) CVP (2022), pages 15979 15988. doi:10.1109/CVPR52688.2022.01553.

D. Hendrycks and K. Gimpel. Gaussian error linear units (GELUs), 2016. ar Xiv:1606.08415.

I. Higgins, L. Matthey, A. Pal, C. Burgess, X. Glorot, M. Botvinick, S. Mohamed, and A. Lerchner. β-VAE: Learning basic visual concepts with a constrained variational framework. In The Fifth International Conference on Learning Representations (ICLR 2017) ICL (2017). URL https://openreview.net/pdf?id=Sy2fz U9gl.

J. Ho, A. Jain, and P. Abbeel. Denoising diﬀusion probabilistic models. In Larochelle et al. (2020), pages 6840 6851. ar Xiv:2006.11239.

D. Z. Huang, N. H. Nelsen, and M. Trautner. An operator learning perspective on parameterto-observable maps. Found. Data Sci., 7:163 225, 2025. doi:10.3934/fods.2024037.

B. E. Husic and V. S. Pande. Markov state models: From an art to a science. J. Am. Chem. Soc., 140(7):2386 2396, 2018. doi:10.1021/jacs.7b12191.

J. B. Keller. Darcy s law for ﬂow in porous media and the two-space method. In Nonlinear Partial Diﬀerential Equations in Engineering and Applied Science, pages 429 443. Routledge, 1980. doi:10.1201/9780203745465-27.

G. Kerrigan, J. Ley, and P. Smyth. Diﬀusion generative models in inﬁnite dimensions. In F. Ruiz, J. Dy, and J.-W. van de Meent, editors, Proceedings of the 26th International Conference on Artiﬁcial Intelligence and Statistics (AISTATS 2023), volume 206 of Proceedings of Machine Learning Research, pages 9538 9563, 2023. ar Xiv:2212.00886.

J. Kim, J. Yoo, J. Lee, and S. Hong. Set VAE: Learning hierarchical composition for generative modeling of set-structured data. In 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) CVP (2021), pages 15054 15063.

doi:10.1109/CVPR46437.2021.01481.

Autoencoders in Function Space

D. P. Kingma. Variational inference and deep learning: A new synthesis. Ph D thesis, University of Amsterdam, 2017. ISBN: 978-94-6299-745-5.

D. P. Kingma and J. L. Ba. Adam: A method for stochastic optimization. In Bengio and Le Cun (2015). ar Xiv:1412.6980.

D. P. Kingma and M. Welling. Auto-encoding variational Bayes. In Y. Bengio and Y. Le Cun, editors, The Second International Conference on Learning Representations (ICLR 2014), 2014. ar Xiv:1312.6114.

D. P. Kingma and M. Welling. An introduction to variational autoencoders. FNT in Mach. Learn., 12(4):307 392, 2019. doi:10.1561/2200000056.

D. Kochkov, J. A. Smith, A. Alieva, Q. Wang, M. P. Brenner, and S. Hoyer. Machine learning accelerated computational ﬂuid dynamics. Proc. Nat. Acad. Sci. USA, 118(21): e2101784118, 8pp., 2021. doi:10.1073/pnas.2101784118.

K. A. Konovalov, I. C. Unarta, S. Cao, E. C. Goonetilleke, and X. Huang. Markov state models to study the functional dynamics of proteins in the wake of machine learning. J. Amer. Chem. Soc. Au, 1(9):1330 1341, 2021. doi:10.1021/jacsau.1c00254.

N. Kovachki, Z. Li, B. Liu, K. Azizzadenesheli, K. Bhattacharya, A. Stuart, and A. Anandkumar. Neural operator: Learning maps between function spaces with applications to PDEs. J. Mach. Learn. Res., 23:1 97, 2023. ar Xiv:2108.08481.

S. G. Krein and Yu. I. Petunin. Scales of Banach spaces. Russ. Math. Surv., 21(2):85 159, 1966. doi:10.1070/RM1966v021n02ABEH004151.

S. Lanthaler, A. M. Stuart, and M. Trautner. Discretization error of Fourier neural operators, 2024. ar Xiv:2405.02221.

H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin, editors. Advances in Neural Information Processing Systems, volume 33, 2020. Curran Associates, Inc.

S. Lee. Mesh-independent operator learning for partial diﬀerential equations. In 2nd AI4Science Workshop at the 39th International Conference on Machine Learning, 2022. URL https://openreview.net/pdf?id=JUt ZG8-2v Gp.

J. Li, Z. Pei, W. Li, G. Gao, L. Wang, Y. Wang, and T. Zeng. A systematic survey of deep learning-based single-image super-resolution. ACM Comput. Surv., 56(10):1 40, 2024. doi:10.1145/3659100.

Y. Li, R. Bu, M. Sun, W. Wu, X. Di, and B. Chen. Point CNN: Convolution on X-transformed points. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 31, pages 820 830. Curran Associates, Inc., 2018. ar Xiv:1801.07791.

Z. Li, N. Kovachki, K. Azizzadenesheli, B. Liu, K. Bhattacharya, A. Stuart, and A. Anandkumar. Fourier neural operator for parametric partial diﬀerential equations. In The Ninth International Conference on Learning Representations (ICLR 2021) ICL (2021).

ar Xiv:2010.08895.

Bunker, Girolami, Lambley, Stuart, and Sullivan

J. H. Lim, N. B. Kovachki, R. Baptista, C. Beckham, K. Azizzadenesheli, J. Kossaiﬁ, V. Voleti, J. Song, K. Kreis, J. Kautz, C. Pal, A. Vahdat, and A. Anandkumar. Score-based diﬀusion models in function space, 2023. ar Xiv:2302.07400.

R. S. Liptser and A. N. Shiryaev. Statistics of Random Processes. Springer, Berlin, Heidelberg, second edition, 2001. doi:10.1007/978-3-662-13043-8.

L. Lu, P. Jin, G. Pang, Z. Zhang, and G. E. Karniadakis. Learning nonlinear operators via Deep ONet based on the universal approximation theorem of operators. Nat. Mach. Intell., 3(3):218 229, 2021. doi:10.1038/s42256-021-00302-5.

A. Mardt, L. Pasquali, H. Wu, and F. No e. VAMPnets for deep learning of molecular kinetics. Nat. Commun., 9(1):5, 11pp., 2018. doi:10.1038/s41467-017-02388-1.

B. Øksendal. Stochastic Diﬀerential Equations. Universitext. Springer, Berlin, Heidelberg, sixth edition, 2003. doi:10.1007/978-3-642-14394-6.

B. Peherstorfer. Breaking the Kolmogorov barrier with nonlinear model reduction. Not. Am. Math. Soc., 69(5):725 733, 2022. doi:10.1090/noti2475.

J. Pidstrigach, Y. Marzouk, S. Reich, and S. Wang. Inﬁnite-dimensional diﬀusion models for function spaces, 2023. ar Xiv:2302.10130.

M. Prasthofer, T. De Ryck, and S. Mishra. Variable-input deep operator networks, 2022.

ar Xiv:2205.11404.

J.-H. Prinz, H. Wu, M. Sarich, B. Keller, M. Senne, M. Held, J. D. Chodera, C. Sch utte, and F. No e. Markov models of molecular kinetics: Generation and validation. J. Chem. Phys., 134(17):174105, 23pp., 2011. doi:10.1063/1.3565032.

C. R. Qi, H. Su, K. Mo, and L. J. Guibas. Point Net: Deep learning on point sets for 3D classiﬁcation and segmentation. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 77 85. IEEE, 2017. doi:10.1109/CVPR.2017.16.

W. Quan, J. Chen, Y. Liu, D.-M. Yan, and P. Wonka. Deep learning-based image and video inpainting: A survey. Int. J. Comput. Vis., 132:2364 2400, 2024. doi:10.1007/s11263-02301977-6.

M. A. Rahman, M. A. Florez, A. Anandkumar, Z. E. Ross, and K. Azizzadenesheli. Generative adversarial neural operators. Transact. Mach. Learn. Res., 2022. ar Xiv:2205.03017.

J. O. Ramsay and B. W. Silverman, editors. Applied Functional Data Analysis: Methods and Case Studies. Springer Series in Statistics. Springer, 2002. doi:10.1007/b98886.

M. Ranzato, A. Beygelzimer, Y. Dauphin, P. S. Liang, and J. Wortman Vaughan, editors. Advances in Neural Information Processing Systems, volume 34, 2021. Curran Associates, Inc.

Autoencoders in Function Space

R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer. High-resolution image synthesis with latent diﬀusion models. In 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) CVP (2022), pages 10674 10685.

doi:10.1109/CVPR52688.2022.01042.

S. S arkk a and A. Solin. Applied Stochastic Diﬀerential Equations. Cambridge University Press, ﬁrst edition, 2019. doi:10.1017/9781108186735.

T. Schlick. Molecular Modeling and Simulation: An Interdisciplinary Guide, volume 21 of Interdisciplinary Applied Mathematics. Springer, New York, second edition, 2010. doi:10.1007/978-1-4419-6351-2.

D. W. Scott. Multivariate Density Estimation. Wiley Series in Probability and Statistics. John Wiley and Sons, second edition, 2015. doi:10.1002/9781118575574.

J. H. Seidman, G. Kissas, P. Perdikaris, and G. J. Pappas. NOMAD: Nonlinear manifold decoders for operator learning. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, editors, Advances in Neural Information Processing Systems, volume 35, pages 5601 5613. Curran Associates, Inc., 2022. ar Xiv:2206.03551.

J. H. Seidman, G. Kissas, G. J. Pappas, and P. Perdikaris. Variational autoencoding neural operators. In A. Krause, E. Brunskill, K. Cho, B. Engelhardt, S. Sabato, and J. Scarlett, editors, Proceedings of the 40th International Conference on Machine Learning (ICML 2023), volume 202 of Proceedings of Machine Learning Research, pages 30491 30522, 2023.

ar Xiv:2302.10351.

K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. In Bengio and Le Cun (2015). ar Xiv:1409.1556.

V. Sitzmann, J. N. P. Martel, A. W. Bergman, D. B. Lindell, and G. Wetzstein. Implicit neural representations with periodic activation functions. In Larochelle et al. (2020), pages 7462 7473. ar Xiv:2006.09661.

J. Sohl-Dickstein, E. Weiss, N. Maheswaranathan, and S. Ganguli. Deep unsupervised learning using nonequilibrium thermodynamics. In F. Bach and D. Blei, editors, Proceedings of the 32nd International Conference on Machine Learning (ICML 2015), volume 37 of Proceedings of Machine Learning Research, pages 2256 2265, 2015. ar Xiv:1503.03585.

Y. Song, J. Sohl-Dickstein, D. P. Kingma, A. Kumar, S. Ermon, and B. Poole. Score-based generative modeling through stochastic diﬀerential equations. In The Ninth International Conference on Learning Representations (ICLR 2021) ICL (2021). ar Xiv:2011.13456.

A. M. Stuart. Inverse problems: A Bayesian perspective. Acta Numer., 19:451 559, 2010.

doi:10.1017/S0962492910000061.

V. N. Sudakov. Linear sets with quasi-invariant measure. Dokl. Akad. Nauk SSSR, 127: 524 525, 1959.

T. J. Sullivan. Introduction to Uncertainty Quantiﬁcation, volume 63 of Texts in Applied Mathematics. Springer, 2015. doi:10.1007/978-3-319-23395-6.

Bunker, Girolami, Lambley, Stuart, and Sullivan

S. Sun, G. Zhang, J. Shi, and R. Grosse. Functional variational Bayesian neural networks. In The Seventh International Conference on Learning Representations (ICLR 2019), 2019. ar Xiv:1903.05779.

M. Tancik, P. P. Srinivasan, B. Mildenhall, S. Fridovich-Keil, N. Raghavan, U. Singhal, R. Ramamoorthi, J. T. Barron, and R. Ng. Fourier features let networks learn high frequency functions in low dimensional domains. In Larochelle et al. (2020), pages 7537 7547. ar Xiv:2006.10739.

A. Vahdat, K. Kreis, and J. Kautz. Score-based generative modeling in latent space. In Ranzato et al. (2021), pages 11287 11302. ar Xiv:2106.05931.

Y. Wang, D. Blei, and J. P. Cunningham. Posterior collapse and latent variable nonidentiﬁability. In Ranzato et al. (2021), pages 5443 5455. ar Xiv:2301.00537.

M. Yang, B. Dai, H. Dai, and D. Schuurmans. Energy-based processes for exchangeable data. In H. Daum e III and A. Singh, editors, Proceedings of the 37th International Conference on Machine Learning (ICML 2020), volume 119 of Proceedings of Machine Learning Research, pages 10681 10692, 2020. ar Xiv:2003.07521.

M. Zaheer, S. Kottur, S. Ravanbhakhsh, B. P oczos, R. Salakhutdinov, and A. J. Smola. Deep sets. In Guyon et al. (2017), pages 3391 3401. ar Xiv:1703.06114.

B. Zhang and P. Wonka. Functional diﬀusion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 4723 4732, 2024. ar Xiv:2311.15435.