# autoencoders_in_function_space__e23cd8ae.pdf Journal of Machine Learning Research 26 (2025) 1-54 Submitted 1/25; Published 6/25 Autoencoders in Function Space Justin Bunker jb2200@cantab.ac.uk Department of Engineering University of Cambridge Cambridge, CB2 1TN, United Kingdom Mark Girolami mgirolami@turing.ac.uk Department of Engineering, University of Cambridge and Alan Turing Institute Cambridge, CB2 1TN, United Kingdom Hefin Lambley hefin.lambley@warwick.ac.uk Mathematics Institute University of Warwick Coventry, CV4 7AL, United Kingdom Andrew M. Stuart astuart@caltech.edu Computing + Mathematical Sciences California Institute of Technology Pasadena, CA 91125, United States of America T. J. Sullivan t.j.sullivan@warwick.ac.uk Mathematics Institute & School of Engineering University of Warwick Coventry, CV4 7AL, United Kingdom Editor: Mahdi Soltanolkotabi Autoencoders have found widespread application in both their original deterministic form and in their variational formulation (VAEs). In scientific applications and in image processing it is often of interest to consider data that are viewed as functions; while discretisation (of differential equations arising in the sciences) or pixellation (of images) renders problems finite dimensional in practice, conceiving first of algorithms that operate on functions, and only then discretising or pixellating, leads to better algorithms that smoothly operate between resolutions. In this paper function-space versions of the autoencoder (FAE) and variational autoencoder (FVAE) are introduced, analysed, and deployed. Well-definedness of the objective governing VAEs is a subtle issue, particularly in function space, limiting applicability. For the FVAE objective to be well defined requires compatibility of the data distribution with the chosen generative model; this can be achieved, for example, when the data arise from a stochastic differential equation, but is generally restrictive. The FAE objective, on the other hand, is well defined in many situations where FVAE fails to be. Pairing the FVAE and FAE objectives with neural operator architectures that can be evaluated on any mesh enables new applications of autoencoders to inpainting, superresolution, and generative modelling of scientific data. Keywords: Variational inference on function space, operator learning, variational autoencoders, regularised autoencoders, scientific machine learning c 2025 Justin Bunker, Mark Girolami, Hefin Lambley, Andrew M. Stuart and T. J. Sullivan. License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v26/25-0035.html. Bunker, Girolami, Lambley, Stuart, and Sullivan 1. Introduction Functional data, and data that can be viewed as a high-resolution approximation of functions, are ubiquitous in data science (Ramsay and Silverman, 2002). Recent years have seen much interest in machine learning in this setting, with the promise of architectures that can be trained and evaluated across resolutions. A variety of methods now exist for the supervised learning of operators between function spaces, starting with the work of Chen and Chen (1993), followed by Deep ONet (Lu et al., 2021), PCA-Net (Bhattacharya et al., 2021), Fourier neural operators (FNOs; Li et al., 2021) and variants (Kovachki et al., 2023). These methods have proven useful in diverse applications such as surrogate modelling for costly simulators of dynamical systems (Azizzadenesheli et al., 2024). Practical algorithms for functional data must necessarily operate on discrete representations of the underlying infinite-dimensional objects, identifying salient features independent of resolution. Some models make this dimension reduction explicit by representing outputs as a linear combination of basis functions learned from data in Deep ONet, and computed using principal component analysis (PCA) in PCA-Net. Others do this implicitly, as in, for example, the deep layered structure of FNOs involving repeated application of the discrete Fourier transform followed by pointwise activation. While linear dimension-reduction methods such as PCA adapt readily to function space, there are many types of data, such as solutions to advection-dominated partial differential equations (PDEs), for which linear approximations are provably inefficient a phenomenon known as the Kolmogorov barrier (Peherstorfer, 2022). This suggests the need for nonlinear dimension-reduction techniques on function space. Motivated by this we propose an extension of variational autoencoders (VAEs; Kingma and Welling, 2014) to functional data using operator learning; we refer to the resulting model as the functional variational autoencoder (FVAE). As a probabilistic latent-variable model, FVAE allows for both dimension reduction and principled generative modelling on function space. We define the FVAE objective as the Kullback Leibler (KL) divergence between two joint distributions, both on the product of the data and latent spaces, defined by the encoder and decoder models; we then derive conditions under which this objective is well-defined. This differs from the usual presentation of VAEs in which one maximises a lower bound on the data likelihood, the evidence lower bound (ELBO); we show that our objective is equivalent and that it generalises naturally to function space. The FVAE objective proves to be well defined only under a compatibility condition between the data and the generative model; this condition is easily satisfied in finite dimensions but is restrictive for functional data. Our applications of FVAE rest on establishing such compatibility, which is possible for problems in the sciences such as those arising in Bayesian inverse problems with Gaussian priors (Stuart, 2010), and those governed by stochastic differential equations (SDEs) (Hairer et al., 2011). However, there are many problems in the sciences, and in generative models for PDEs in particular, for which application of FVAE fails because the generative model is incompatible with the data; using FVAE in such settings leads to foundational theoretical issues and, as a result, to practical problems in the infinite-resolution and -data limits. To overcome these foundational issues we propose a deterministic regularised autoencoder that can be applied in very general settings, which we call the functional autoencoder Autoencoders in Function Space (FAE). We show that FAE is an effective tool for dimension reduction, and that it can used as a versatile generative model for functional data. We complement the FVAE and FAE objectives on function space with neural-operator architectures that can be discretised on any mesh. The ability to discretise both the encoder and the decoder on arbitrary meshes extends prior work such as the variational autoencoding neural operator (VANO; Seidman et al., 2023), and is highly empowering, enabling a variety of new applications of autoencoders to scientific data such as inpainting and superresolution. Code accompanying the paper is available at https://github.com/htlambley/functional autoencoders. Contributions. We make the following contributions to the development and application of autoencoders to functional data: (C1) we propose FVAE, an extension of VAEs to function space, finding that the training objective is well defined so long as the generative model is compatible with the data; (C2) we complement the FVAE training objective with mesh-invariant architectures that can be deployed on any discretisation even irregular, non-grid discretisations; (C3) we show that when the data and generative model are incompatible, the discretised FVAE objective may diverge in the infinite-resolution and -data limits or entirely fail to minimise the divergence between the encoder and decoder; (C4) we propose FAE, an extension of regularised autoencoders to function space, and show that its objective is well defined in many cases where the FVAE objective is not; (C5) exploiting mesh-invariance, we propose masked training schemes which exhibit greater robustness at inference time, faster training and lower memory usage; (C6) we validate FAE and FVAE on examples from the sciences, including problems governed by SDEs and PDEs, to discover low-dimensional latent structure from data, and use our models for inpainting, superresolution, and generative modelling, exploiting the ability to discretise the encoder and decoder on any mesh. Outline. Section 2 extends VAEs to function space (Contribution (C1)). We then describe mesh-invariant architectures of Contribution (C2); we also validate our approach, FVAE, on examples such as SDE path distributions where compatibility between the data and the generative model can be verified. Section 3 gives an example of the problems arising when applying FVAE in the misspecified setting of Contribution (C3). In Section 4 we propose FAE (Contribution (C4)) and apply our data-driven method to two examples from scientific machine learning: Navier Stokes fluid flows and Darcy flows in porous media. In these problems we make use of the mesh-invariance of the FAE to apply the masked training scheme of Contribution (C5); masking proves to be vital for the applications to inpainting and superresolution in Contribution (C6). Section 5 discusses related work, and Section 6 discusses limitations and topics for future research. 2. Variational Autoencoders on Function Space In this section we define the VAE objective by minimizing the KL divergence between two distinct joint distributions on the product of data and latent spaces, one defined by the encoder and the other by the decoder. We show that this gives a tractable objective Bunker, Girolami, Lambley, Stuart, and Sullivan when written in terms of a suitable reference distribution (Section 2.1). This objective coincides with maximising the ELBO in finite dimensions (Section 2.2) and extends readily to infinite dimensions. The divergence between the encoder and decoder models is finite only under a compatibility condition between the data and the generative model. This condition is restrictive in infinite dimensions. We identify problem classes for which this compatibility holds (Section 2.3) and pair the objective with mesh-invariant encoder and decoder architectures (Section 2.4), leading to a model we call the functional variational autoencoder (FVAE). We then validate our method on several problems from scientific machine learning (Section 2.5) where the data are governed by SDEs. 2.1 Training Objective We begin by formulating an objective in infinite dimensions, making the following standard assumption in unsupervised learning throughout Section 2. At this stage we focus on the continuum problem, and address the question of discretisation in Section 2.4. In what follows P(X) is the set of Borel probability measures on the separable Banach space X. Assumption 1 Take the data space (U, ) to be a separable Banach space. There exists a data distribution Υ P(U) from which we have access to N independent and identically distributed samples {u(n)}N n=1 U. This setting is convenient to work with yet general enough to include many spaces of interest; U could be, for example, a Euclidean space Rk, or the infinite-dimensional space L2(Ω) of (equivalence classes of) square-integrable functions with domain Ω Rd. Separability is a technical condition used to guarantee the existence of conditional distributions in the encoder and decoder models defined shortly (see Chang and Pollard, 1997, Theorem 1). An autoencoder consists of two nonlinear transformations: an encoder mapping data into a low-dimensional latent space Z, and a decoder mapping from Z back into U. The goal is to choose transformations such that, for u Υ, composing the encoder and decoder approximates the identity. In this way the autoencoder can be used for dimension reduction, with the encoder output being a compressed representation of the data. To make this precise, fix Z = Rd Z and define encoder and decoder transformations depending on parameters θ Θ and ψ Ψ, respectively, by (encoder) U u 7 Qθ z|u P(Z), (1a) (decoder) Z z 7 Pψ u|z P(U). (1b) To ensure the statistical models we are interested in are well defined, we require both maps to be Markov kernels that is, for all Borel measurable sets A Z and B U and all parameters θ Θ and ψ Ψ, the maps u 7 Qθ z|u(A) and z 7 Pψ u|z(B) are measurable. Since the encoder and decoder both output probability distributions, we must be precise about what it means to compose them: we mean the distribution (autoencoding distribution) Aθ,ψ u (B) = Z Z Pψ u|z(B) Qθ z|u(dz), B U measurable. (2) Autoencoders in Function Space Remark 2 Given u U, Aθ,ψ u ( ) is the distribution given by drawing z Qθ z|u and then sampling from Pψ u|z; we would like this to be close to a Dirac distribution δu when u is drawn from Υ; enforcing this, approximately, will be used to determine the parameters (θ, ψ). Autoencoding as Matching Joint Distributions. Now let us fix a latent distribution Pz P(Z) and define two distributions for (z, u) on the product space Z U: (joint encoder model) Qθ z,u(dz, du) = Qθ z|u(dz)Υ(du), (3a) (joint decoder model) Pψ z,u(dz, du) = Pψ u|z(du)Pz(dz). (3b) The joint encoder model (3a) is the distribution on (z, u) given by applying the encoder to data u Υ, while the joint decoder model (3b) is the distribution on (z, u) given by applying the decoder to latent vectors z Pz. We emphasise that Υ is given, while the distributions Qθ z|u, Pψ u|z, and Pz are to be specified; the choice of decoder distribution Pψ u|z on the (possibly infinite-dimensional) space U will require particular attention in what follows. The marginal and conditional distributions of (3a) and (3b) will be important and we write, for example, Pψ z|u for the z | u-conditional of the joint decoder model and Qθ z for the z-marginal of the joint encoder model. In particular, the u-marginal of the joint decoder model is the generative model Pψ u. To match the joint encoder model (3a) with the joint decoder model (3b), we seek to solve, for some statistical distance or divergence d on P(Z U), the minimisation problem arg min θ Θ, ψ Ψ d Qθ z,u Pψ z,u . (4) Doing so determines an autoencoder and a generative model. The choice of d is constrained by the need for it to be possible to evaluate the objective with Υ known only empirically; examples for which this is the case include the Wasserstein metrics and the KL divergence. VAE Objective as Minimisation of KL Divergence. Using the KL divergence as d in (4) leads to the goal of finding θ Θ and ψ Ψ to minimise DKL Qθ z,u Pψ z,u = E (z,u) Qθz,u log d Qθ z,u d Pψ z,u (z, u) = E u Υ E z Qθ z|u log d Qθ z,u d Pψ z,u (z, u) here d Qθ z,u/d Pψ z,u is the Radon Nikodym derivative of Qθ z,u with respect to Pψ z,u, the appropriate infinite-dimensional analogue of the ratio of probability densities. This exists only when Qθ z,u is absolutely continuous with respect to Pψ z,u, meaning that Qθ z,u assigns probability zero to a set whenever Pψ z,u does; we take (5) to be infinite otherwise. This objective is equivalent to the standard VAE objective in the case U = Rk but has the additional advantage that it can be used in the infinite-dimensional setting. KL divergence has many benefits justifying its use across statistics. Its value in inference and generative modelling comes from the connection between maximum-likelihood methods and minimising KL divergence, as well as its information-theoretic interpretation (Cover and Thomas, 2006, Sec. 2.3). KL divergence is asymmetric and has two useful properties Bunker, Girolami, Lambley, Stuart, and Sullivan justifying the order of the arguments in (5): it can be evaluated using only samples of the distribution in its first argument, and it requires no knowledge of the normalisation constant of the distribution in its second argument since minimisers of ν 7 DKL(ν µ) are invariant when scaling µ (Chen et al., 2023 and Bach et al., 2024, Sec. 12.2). To justify the use of the joint divergence (5), in Theorem 3 we decompose the objective as the sum of two interpretable terms. In Theorem 6 we write the objective in a form that leads to actionable algorithms. Theorem 3 For all parameters θ Θ and ψ Ψ for which DKL(Qθ z,u Pψ z,u) < , DKL Qθ z,u Pψ z,u = DKL Υ Pψ u h DKL Qθ z|u Pψ z|u i | {z } (II) Proof Factorise Qθ z,u(dz, du) = Qθ z|u(dz)Υ(du) and Pψ z,u(dz, du) = Pψ z|u(dz)Pψ u(du) and substitute into (5) to obtain DKL Qθ z,u Pψ z,u = E (z,u) Qθ z,u d Pψ u (u) d Qθ z|u d Pψ z|u (z) = E u Υ E z Qθ z|u d Pψ u (u) + log d Qθ z|u d Pψ z|u (z) The result follows by inserting the definition of the KL divergences on the right-hand side of (6), and all terms are finite owing to the assumption that DKL(Qθ z,u Pψ z,u) < . Theorem 3 decomposes the divergence (5) into (I) the error in the generative model and (II) the error in approximating the decoder posterior Pψ z|u with Qθ z|u via variational inference; jointly training a variational-inference model and a generative model is exactly the goal of a VAE. However, minimising (5) makes sense only if (θ, ψ) 7 DKL(Qθ z,u Pψ z,u) is finite for some θ and ψ. Verification of this property is a necessary step that we perform for our settings of interest in Propositions 8, 16, and 24. Remark 4 (Posterior collapse) While minimising (5) typically leads to a useful autoencoder, this is not guaranteed: the autoencoding distribution (2) may be very far from a Dirac distribution. For example, when Pψ z|u Pz, the optimal decoder distribution Pψ u|z may well ignore the latent variable z entirely. Indeed, if Υ = Qz|u = Pu|z = Pz = N(0, 1) on U = R, then DKL(Qz,u Pz,u) = 0, but the autoencoding distribution Au = N(0, 1) for all u U. This is not close to the desired Dirac distribution referred to in Remark 2. This issue is known in the VAE literature as posterior collapse (Wang et al., 2021). In practice, choice of model classes for Qz|u and Pu|z avoids this issue. Tractable Training Objective. Since we can access Υ only through training data, we must decompose the objective function into a sum of a term which involves Υ only through samples and a term which is independent of the parameters θ and ψ. The first of these terms may then be used to define a tractable objective function over the parameters. To address this issue we introduce a reference distribution Λ P(U) and impose the following conditions on the encoder and the reference distribution. Autoencoders in Function Space Assumption 5 (a) There exists a reference distribution Λ P(U) such that: (i) for all ψ Ψ and z Z, Pψ u|z is mutually absolutely continuous with Λ; and (ii) the data distribution Υ satisfies the finite-information condition DKL(Υ Λ) < . (b) For all θ Θ and Υ-almost all u U, DKL(Qθ z|u Pz) < . Theorem 6 If Assumption 5 is satisfied for some Λ P(U), then for all parameters θ Θ and ψ Ψ for which DKL(Qθ z,u Pψ z,u) < , we have DKL Qθ z,u Pψ z,u = E u Υ L(u; θ, ψ) + DKL(Υ Λ), (7a) (per-sample loss) L(u; θ, ψ) = E z Qθ z|u log d Pψ u|z dΛ (u) + DKL Qθ z|u Pz . (7b) Proof Write Qθ z,u(dz, du) = Qθ z|u(dz)Υ(du) and Pψ z,u(dz, du) = Pψ u|z(du)Pz(dz), factor through the distribution Λ in (5), and apply the definition of the KL divergence to obtain DKL Qθ z,u Pψ z,u = E (z,u) Qθz,u log d Qθ z|u d Pz (z) dΛ d Pψ u|z (u)dΥ log d Pψ u|z dΛ (u) + log d Qθ z|u d Pz (z) log d Pψ u|z dΛ (u) + DKL Qθ z|u Pz # + DKL(Υ Λ), where all terms are finite as a consequence of Assumptions 5(a)(ii) and (b). Since DKL(Υ Λ) is assumed to be finite and depends on neither θ nor ψ, Theorem 6 shows that the joint KL divergence (5) is equivalent, up to a finite constant, to (FVAE objective) J FVAE(θ, ψ) = E u Υ L(u; θ, ψ) . (8) In particular, minimising J FVAE is equivalent to minimising the joint divergence, and J FVAE(θ, ψ) is finite if and only if DKL(Qθ z,u Pψ z,u) is finite. Moreover (8) can be approximated using samples from Υ: (empirical FVAE objective) J FVAE N (θ, ψ) = 1 n=1 L u(n); θ, ψ . (9) In the limit of infinite data, the empirical objective (9) converges to (8); but both (8) and (9) may be infinite in many practical settings, as we shall see in the following sections. Remark 7 (a) Assumption 5(a) ensures that the density d Pψ u|z/dΛ in the per-sample loss exists, and that minimising (5) and (8) is equivalent; Assumption 5(b) ensures that, when the joint divergence (5) is finite, the per-sample loss L is finite for Υ-almost all u U. We could also formulate Assumption 5(a) with a σ-finite reference measure Λ, e.g., Lebesgue measure, but for our theory it suffices to consider probability measures. Bunker, Girolami, Lambley, Stuart, and Sullivan (b) The proof of Theorem 6 shows why we take Qθ z,u as the first argument and Pψ z,u as the second in (5) to obtain a tractable objective. Reversing the arguments in the divergence gives an expectation with respect to the joint decoder model (3b): DKL Pψ z,u Qθ z,u = E (z,u) Pψ z,u d Qθ z|u (z) + log d Pψ u|z dΛ (u) + E (z,u) Pψ z,u Unlike in Theorem 6, the term involving dΛ/dΥ is not a constant: it depends on the parameter ψ and would therefore need to be evaluated during the optimisation process, but this is intractable because we have only samples from Υ and not its density. 2.2 Objective in Finite Dimensions We now show that the theory in Section 2.1 simplifies in finite dimensions to the usual VAE objective. To do so we assume U = Rk and that Υ P(U) has strictly positive probability density υ: U (0, ). We moreover assume that Z = Rd Z and that the latent distribution and the distributions returned by the encoder (1a) and decoder (1b) are Gaussian, taking the form Qθ z|u = N f(u; θ), Σ(u; θ) = f(u; θ) + Σ(u; θ) 1 2 N(0, IZ), (10a) Pψ u|z = N g(z; ψ), βIU = g(z; ψ) + β 1 2 N 0, IU , (10b) Pz = N 0, IZ , (10c) where β > 0 is fixed and the parameters of (10a) and (10b) are given by learnable maps f = (f, Σ): U Θ Z S+(Z), (11a) g: Z Ψ U, (11b) with S+(Z) denoting the set of positive semidefinite matrices on Z. The Gaussian model (10a) (10c) is the standard setting in which VAEs are applied, resulting in the joint decoder model (3b) being the distribution of (z, u) in the model u | z = g(z; ψ) + η, z N 0, IZ , η Pη = β 1 2 N 0, IU . (12) Other decoder models are also possible (Kingma and Welling, 2014), and in infinite dimensions we will consider a wide class of decoders, including Gaussians as a particular case. In practice (11a), (11b) will come from a parametrised class of functions, e.g., a class of neural networks. Provided these classes are sufficiently large and the data distribution has finite information with respect to Pη, the joint divergence (5) is finite for at least one choice of θ and ψ. Proposition 8 Suppose that for parameters θ Θ and ψ Ψ we have f(u; θ ) = (0, IZ) and g(z; ψ ) = 0. Then DKL Qθ z,u Pψ z,u = DKL(Υ Pη). In particular, if DKL(Υ Pη) < , then the joint divergence (5) has finite infimum. Autoencoders in Function Space Proof Evaluating the joint encoder model (3a) and the joint decoder model (3b) at θ and ψ gives Qθ z,u(dz, du) = Pz(dz)Υ(du) and Pψ z,u(dz, du) = Pz(dz)Pη(du), so DKL Qθ z,u Pψ z,u = E u Υ E z Pz d Pz (z) + log dΥ d Pη (u) = DKL(Υ Pη). Remark 9 In finite dimensions, many data distributions satisfy DKL(Υ Pη) < . However there are cases in which this fails, e.g., when Υ is a Cauchy distribution on R, In such cases (5) is infinite even with the trivial maps f(u; θ ) = (0, IZ) and g(z; ψ ) = 0. Training Objective. In Section 2.1 we proved that the joint divergence (5) can be approximated by the tractable objective (8) and its empiricalisation (9) whenever there is a reference distribution Λ satisfying Assumption 5. We now show that taking Λ = Υ satisfies this assumption and results in a training objective equivalent to maximising the ELBO. Recall the data density υ associated with data measure Υ. Proposition 10 Under the model (10a) (10c) with reference distribution Λ = Υ, Assumption 5 is satisfied, and, for some finite constant C > 0 independent of u, θ, and ψ, L(u; θ, ψ) = E z Qθ z|u log pψ u|z(u) + log υ(u) + DKL Qθ z|u Pz (13a) = E z Qθ z|u h (2β) 1 g(z; ψ) u 2 2 i + log υ(u) + DKL Qθ z|u Pz + C. (13b) Proof For Assumption 5(a), mutual absolute continuity of Pψ u|z and Λ follows as both distributions have strictly positive densities, and evidently DKL(Υ Λ) = 0. Assumption 5(b) holds since both the encoder distribution (10a) and the latent distribution (10c) are Gaussian, with KL divergence available in closed form (Remark 11). The expression for L follows from (7b) using that Pψ u|z is Gaussian with density pψ u|z and d Pψ u|z/dΥ(u) = pψ u|z(u)/υ(u). While the per-sample loss (13a) involves the unknown density υ, we can drop this without affecting the objective (8) if Υ has finite differential entropy Eu Υ log υ(u) . This follows because J FVAE(θ, ψ) = E u Υ LVAE(u; θ, ψ) E u Υ log υ(u) + C, (14a) LVAE(u; θ, ψ) = E z Qθ z|u h (2β) 1 g(z; ψ) u 2 2 i + DKL Qθ z|u Pz . (14b) Thus, J VAE(θ, ψ) = Eu Υ[LVAE(θ, ψ)] is equivalent, up to a finite constant, to J FVAE(θ, ψ), and J VAE is tractable. Requiring that Υ has finite differential entropy is a mild condition, and one can expect this to be the case for the vast majority of distributions arising in the finite-dimensional setting. Bunker, Girolami, Lambley, Stuart, and Sullivan Remark 11 (VAEs as regularised autoencoders) We can write the divergence in (14b) in closed form as DKL(Qθ z|u Pz) = 1 2 f(u; θ) 2 2 + tr Σ(u; θ) log Σ(u; θ) d Z . Applying the reparametrisation trick (Kingma and Welling, 2014) to write the expectation over z Qθ z|u in terms of ξ N(0, IZ), the interpretation of a VAE as a regularised autoencoder is clear: LVAE(u; θ, ψ) E ξ N(0,IZ) (2β) 1 g f(u; θ) + Σ(u; θ)ξ; ψ u 2 2 f(u; θ) 2 2 + 1 2 tr Σ(u; θ) log Σ(u; θ) . Remark 12 (The evidence lower bound) The usual derivation of VAEs specifies the decoder model (12) and performs variational inference on the posterior for z | u, seeking to maximise a lower bound, the ELBO, on the likelihood of the data. Denoting by pψ u the density of the generative model Pψ u, by qθ z|u the density of Qθ z|u, and so forth, the log-likelihood of u U is log pψ u(u) = DKL qθ z|u pψ z|u + ELBO(u; θ, ψ), (15a) ELBO(u; θ, ψ) = E z qθ z|u log pψ u|z(u) DKL qθ z|u pz . (15b) This demonstrates that ELBO(u; θ, ψ) is indeed a lower bound on the log-data likelihood under the decoder model. Minimising our per-sample loss (13a) is equivalent to maximising the ELBO (15b), but, notably, our underlying objective is shown to correspond exactly to the underlying KL divergence (5) and avoids the use of any bounds on data likelihood. Remark 13 An interpretation of the VAE objective as minimisation of (5) is also adopted by Kingma (2017, Sec. 2.8) and Kingma and Welling (2019). Our approach differs by writing the joint decoder model (3b) in terms of Υ rather than the empirical distribution (empirical data distribution) ΥN = 1 Since ΥN is not absolutely continuous with respect to the generative model Pψ u of (10a) (10c), using this would result in the joint divergence (5) being infinite for all θ and ψ. 2.3 Objective in Infinite Dimensions We return to the setting of Assumption 1 and adopt a generalisation of the Gaussian model (10a) (10c), with distributional parameters given by learnable maps f = (f, Σ): U Θ Z S+(Z) and g: Z Ψ U: Qθ z|u = N f(u; θ), Σ(u; θ) = f(u; θ) + Σ(u; θ) 1 2 N(0, IZ), (16a) Pψ u|z = Pη g(z; ψ) = g(z; ψ) + Pη, (16b) Pη P(U), (16c) Pz = N 0, IZ . (16d) Autoencoders in Function Space The only change from the finite-dimensional model is in the decoder distribution (16b), which we now write as the shift of the decoder-noise distribution Pη by the mean g(z; ψ). As we will see shortly, the choice of decoder noise will be very important in infinite dimensions, and restricting attention solely to Gaussian white noise will no longer be feasible. Obstacles in Infinite Dimensions. Many new issues arise in infinite dimensions, necessitating a more careful treatment of the FVAE objective; for example, we can no longer work with probability density functions, since there is no uniform measure analogous to Lebesgue measure (Sudakov, 1959). The fundamental obstacle, however, is that unlike in finite dimensions the joint divergence (5) is often ill defined, satisfying DKL Qθ z,u Pψ z,u = for all θ Θ and ψ Ψ. An important situation in which this arises is misspecification of the generative model Pψ u; consequently, great care is needed in the choice of decoder to avoid this issue. To illustrate this we show that the extension of the white-noise model of (10a) (10c) is ill-defined in infinite dimensions: the resulting joint divergence (5) is always infinite. To do this we define white noise using a Karhunen Lo eve expansion (Sullivan, 2015, Sec. 11.1). Definition 14 Let U = L2([0, 1]) with orthonormal basis ej(x) = 2 sin(πjx), j N. We say that the random variable η is L2-white noise if it has Karhunen Lo eve expansion (L2-white noise) η = X j N ξjej, ξj i.i.d. N(0, 1). Proposition 15 Let Pη be the distribution of the L2-white noise η. Then Pη is a probability distribution supported on the Sobolev space Hs([0, 1]) if and only if s < 1/2; in particular Pη assigns probability zero to L2([0, 1]). Moreover, these statements remain true for any shift Pη( h) of the distribution Pη by h L2([0, 1]). The proof, which makes use of the Borel Cantelli lemma and the Kolmogorov two-series theorem, is stated in Appendix A. Example 1 Let U = L2([0, 1]), fix a data distribution Υ P(U) and take the model (16a) (16d), where we further assume that g(z; ψ) L2([0, 1]) for all z Z and ψ Ψ, and with Pη taken to be the distribution of L2-white noise. Under this model, the joint divergence (5) is infinite for all θ and ψ. To see this, note that Υ assigns probability one to U, but, as Pψ u|z assigns zero probability to U by Proposition 15, we have Pψ u(U) = Z Z Pψ u|z(U) Pz(dz) = 0. Thus Υ is not absolutely continuous with respect to Pψ u, and so DKL(Qθ z,u Pψ z,u) = . The VANO model (Seidman et al., 2023), which we discuss in detail in the related work (Section 5), adopts the setting of (16a) (16d) with white decoder noise. It thus suffers from not being well-defined. The issues in Example 1 stem from the difference in regularity Bunker, Girolami, Lambley, Stuart, and Sullivan between the data, which lies in L2([0, 1]), and draws from the generative model, which lie in Hs([0, 1]), s < 1/2, and not in L2([0, 1]). A difference in regularity is not the only possible issue: even two Gaussians supported on the same space need not be absolutely continuous owing to the Feldman H ajek theorem (Bogachev, 1998, Ex. 2.7.4). But, as in Proposition 8, we can state a sufficient condition for the divergence (5) to have finite infimum. Proposition 16 Suppose that f(u; θ ) = (0, IZ) and g(z; ψ ) = 0 for some θ Θ and ψ Ψ, and that DKL(Υ Pη) < . Then (5) has finite infimum. Well-Defined Objective for Specific Problem Classes. The issues we have seen suggest that one must choose the decoder model based on the structure of the data distribution: fixing a decoder model a priori typically results, in infinite dimensions, in the joint divergence being infinite. However we now give examples to show that, in important classes of problems arising in science and engineering, there is a clear choice of decoder noise Pη arising from the problem structure. Adopting this decoder noise and taking the reference distribution Λ = Pη will ensure that the hypotheses of Theorem 6 are satisfied: the joint divergence can be shown to have finite infimum, and Assumption 5 holds. Thus we can apply the actionable algorithms derived in Section 2.1. 2.3.1 SDE Path Distributions One class of data to which we can apply FVAE arises in the study of random dynamical systems (E et al., 2004). We choose Υ to be the distribution over paths of the diffusion process defined, for a standard Brownian motion (wt)t [0,T] on Rm and ε > 0, by dut = b(ut) dt + ε dwt, u0 = 0, t [0, T]. (17) We assume that the drift b: Rm Rm is regular enough that (17) is well defined. The theory we outline also applies to systems with anisotropic diffusion and time-dependent coefficients (S arkk a and Solin, 2019, Sec. 7.3), and to systems with nonzero initial condition, but we focus on the setting (17) for simplicity. The path distribution Υ is defined on the space U = C0([0, T], Rm) of continuous functions u: [0, T] Rm with u(0) = 0. Recall (16b) where we define the decoder distribution Pψ u|z as the shift of the noise distribution Pη by g(z; ψ), and recall Assumption 5, which in particular demands that DKL(Υ Pη) < and that Pψ u|z is mutually absolutely continuous with Pη. Here we choose Pη to be the law of an auxiliary diffusion process, which in our examples will be an Ornstein Uhlenbeck (OU) process. This auxiliary process must have zero initial condition and must have the same noise structure as that in (17) to ensure that DKL(Υ Pη) < . We will learn the shift g and this will have to satisfy the zero initial condition to ensure that Pψ u|z is mutually absolutely continuous with Pη. Decoder-Noise Distribution Pη. We take Pη to be the law of the auxiliary SDE dηt = c(ηt) dt + ε dwt, η0 = 0, t [0, T], (18) with drift c: Rm Rm, and with ε > 0 being the same in both (17) and (18). To determine whether DKL(Υ Pη) < , we will need to evaluate the term dΥ/d Pη in the KL divergence; an expression for this is given by the Girsanov theorem (Liptser and Shiryaev, 2001, Chap. 7). Autoencoders in Function Space Proposition 17 (Girsanov theorem) Suppose that U = C0([0, T], Rm) and that µ P(U) and ν P(U) are the laws of the Rm-valued diffusions dut = p(ut) dt + ε dwt, u0 = 0, t [0, T], dvt = q(vt) dt + ε dwt, v0 = 0, t [0, T]. Suppose that the Novikov condition (Øksendal, 2003, eq. (8.6.8)) holds for both processes: 0 p(ut) 2 2 dt < and E v ν 0 q(vt) 2 2 dt < . (19) dµ dν (u) = exp 1 0 q(ut) 2 2 p(ut) 2 2 dt 1 0 q(ut) p(ut), dut . (20) The second integral in (20) is a stochastic integral with respect to (ut)t [0,T] (S arkk a and Solin, 2019, Chap. 4). The Novikov condition suffices for our needs, but the theorem also holds under weaker conditions such as the Kazamaki condition (Liptser and Shiryaev, 2001, p. 249). Applying Proposition 17 to evaluate the KL divergence (Appendix A) yields DKL Υ Pη = E u Υ 0 b(ut) c(ut) 2 2 dt . Thus the condition DKL(Υ Pη) < is satisfied quite broadly, e.g., if b and c are bounded. Per-Sample Loss. With the condition DKL(Υ Pη) < verified, it remains to choose g to ensure that Pψ u|z is mutually absolutely continuous with Pη, and to derive the corresponding density. Once this is done, we arrive at the actionable per-sample loss derived in Theorem 6. To do this we again apply the Girsanov theorem, using that, when g(z; ψ) H1([0, T]) takes value zero at t = 0, the distribution Pψ u|z is the law of the SDE dvt = g(z; ψ) (t) dt + c vt g(z; ψ) dt + ε dwt, v0 = 0, t [0, T]. (21) As we will parametrise g(z; ψ) using a neural network, we can assume it to be in C1([0, T]), and hence in H1([0, T]); moreover we shall enforce that g(z; ψ)(0) 0 through the loss, as described shortly. To be concrete, we restrict attention to decoder noise with c(x) = κx, making (ηt)t [0,T] an OU process if κ > 0 and Brownian motion if κ = 0. In this case the per-sample loss can be derived by applying the Girsanov theorem to (18) and (21). Proposition 18 (SDE per-sample loss) Suppose that c(x) = κx, κ 0, and that g(z; ψ) H1([0, T]) with g(z; ψ)(0) = 0 for all z Z and ψ Ψ. Then Assumption 5 holds and L(u; θ, ψ) = E z Qθ z|u log d Pψ u|z d Pη (u) + DKL(Qθ z|u Pz), log d Pψ u|z d Pη (u) = 1 g(z; ψ) (t) + κg(z; ψ)(t), dut g(z; ψ) (t) κ u(t) g(z; ψ)(t) 2 2 κu(t) 2 2 dt. Bunker, Girolami, Lambley, Stuart, and Sullivan In practice, we make two modifications to L. First, the initial condition g(z; ψ)(0) = 0 is not enforced exactly; instead, we add a Tikhonov-like zero-penalty term with regularisation parameter λ > 0 to favour g(z; ψ)(0) 0. Second, to allow variation of the strength of the KL regularisation, we multiply the term DKL(Qθ z|u Pz) by a regularisation parameter β > 0. Setting β = 1 breaks the exact correspondence between the FVAE objective and the joint KL divergence (5), but can nevertheless be useful in computational practice (Higgins et al., 2017). This leads us to the SDE per-sample loss LSDE λ,β (u; θ, ψ) = E z Qθ z|u log d Pψ u|z d Pη (u) + λ g(z; ψ)(0) 2 2 + βDKL Qθ z|u Pz . 2.3.2 Posterior Distributions in Bayesian Inverse Problems Our theory can also be applied to to posterior distributions arising in Bayesian inverse problems (Stuart, 2010), which we illustrate through the following additive-noise inverse problem. Let U be a separable Hilbert space with norm U, let Y = Rd Y , and let G : U Y be a (possibly nonlinear) observation operator. Suppose that y Y is given by the model y = G(u) + ξ, u µ0 P(U), ξ N(0, Σ) P(Y ), (22) with noise covariance Σ S+(Y ), and with prior distribution µ0 = N(0, C) having covariance operator C : U U. Models of this type arise, for example, in both Eulerian and Lagrangian data assimilation problems in oceanography (Cotter et al., 2010). Given an observation y Y from (22), the Bayesian approach seeks to infer u U by computing the posterior distribution µy P(U) representing the distribution of u | y. In the setting of (22), µy has a density with respect to µ0 thanks to Bayes rule (Dashti and Stuart, 2017, Theorem 14), taking the form dµ0 (u) = 1 Z(y) exp Φ(u; y) , Φ(u; y) = 1 2 G(u) y 2 Σ, Σ = Σ 1/2 2, where Z(y) (0, 1] owing to the nonnegativity of Φ. A simple calculation then reveals DKL(µy µ0) = E u µy dµ0 (u) = E u Υ log Z(y) Φ(u) log Z(y) < . (23) Similar arguments apply quite generally for observation models other than (22), provided the resulting log-density log dµy/dµ0 satisfies suitable boundedness or integrability conditions. We now assume that the data distribution Υ to be learned is the posterior µy, and that we have samples from it. This setting could arise, for example, when attempting to generate further approximate samples from the posterior µy, taking as data the output of a functionspace MCMC method (Cotter et al., 2013), with the ambition of faster sampling under FVAE than under MCMC. Recall the definition of the decoder distribution Pψ u|z in (16b). We take Pη to be the prior µ0; this is a natural choice as (23) shows that DKL(Υ Pη) < . We next discuss the choice of shift g, and the per-sample loss that results from these choices. Autoencoders in Function Space Per-Sample Loss. Since Pη and Pψ u|z are Gaussian, we can use the Cameron Martin theorem (Bogachev, 1998, Corollary 2.4.3) to derive conditions for their mutual absolute continuity. For this to be the case, the shift g(z; ψ) must lie in the Cameron Martin space H(Pη) U. Before stating the theorem, we recall the following facts about Gaussian measures. The space H(Pη) is Hilbert, and for fixed h H(Pη), the H(Pη)-inner product h, H(Pη) extends uniquely (up to equivalence Pη-almost everywhere) to a measurable linear functional (Bogachev, 1998, Theorem 2.10.11), denoted by U u 7 h, u H(Pη). Proposition 19 (Cameron Martin theorem) Let Pη P(U) be a Gaussian measure with Cameron Martin space H(Pη). Then Pψ u|z = Pη g(z; ψ) is mutually absolutely continuous with Pη if and only if g(z; ψ) H(Pη), and d Pψ u|z d Pη (u) = exp g(z; ψ), u H(Pη) 1 2 g(z; ψ) 2 H(Pη) . (24) Remark 20 The exponent in (24) should be viewed as the misfit 1 2 g(z; ψ) u 2 H(Pη) with the almost-surely-infinite term 1 2 u 2 H(Pη) subtracted (Stuart, 2010, Remark 3.8). When Pη is Brownian motion on R, for example, g(z; ψ), u H(Pη) is a stochastic integral and H(Pη) = H1([0, T]); this is implicit in the calculations underlying Theorem 18. When Pη is L2-white noise, H(Pη) = L2([0, 1]). When g takes values in H(Pη), we can use the Cameron Martin theorem to write down the per-sample loss explicitly since H(Pη) = C 1/2U, with h H(Pη) = C 1/2h U and h, u H(Pη) = C 1/2h, C 1/2u U for h H(Pη) and u U. Proposition 21 (Bayesian inverse problem per-sample loss) Suppose that g(z; ψ) H(Pη) for all z Z and ψ Ψ. Then Assumption 5 holds and LBIP(u; θ, ψ) = E z Qθ z|u 2 g(z; ψ) 2 U C 1 2 g(z; ψ), C 1 i + DKL Qθ z|u Pz . Depending on the choice of Pη, the condition g(z; ψ) H(Pη) may follow immediately, e.g., when Pη is Brownian motion and g is parametrised by a neural network. 2.4 Architecture and Algorithms In practice, we do not have access to training data {u(n)}N n=1 U; instead we have access to finite-dimensional discretisations u(n). We would like to evaluate the encoder and decoder, and to compute the empirical objective (9) for training, using only these discretisations. In our architectures we will assume that U is a Banach space of functions evaluable pointwise almost everywhere with domain Ω Rd and range Rm; in this setting we will assume that the discretisation u of a function u U consists of evaluations at {xi}I i=1 Ω: (discretisation of function u U) u = n xi, u(xi) o I Crucially, the number and location of mesh points may differ for each discretised sample our aim is to allow for FVAE to be trained and evaluated across different resolutions, with data provided on sparse and potentially irregular meshes. We therefore discuss how to discretise the loss, and propose encoder/decoder architectures that can be evaluated on any mesh. Bunker, Girolami, Lambley, Stuart, and Sullivan 2.4.1 Encoder Architecture The encoder f is a map from a function u: Ω Rm to the parameters of the encoder distribution (16a): the mean f(u; θ) Z and covariance matrix Σ(u; θ) S+(Z). We assume Σ(u; θ) is diagonal, so f need only return two vectors: the mean f(u; θ) and the log-diagonal of Σ(u; θ). We thus define f(u; θ) = ρ Z Ω κ x, u(x); θ dx; θ Z Z = R2d Z, (25) where κ: Ω Rm Θ Rℓis parametrised as a neural network with two hidden layers of width 64 and output dimension ℓ= 64, using GELU activation (Hendrycks and Gimpel, 2016), and ρ: Rℓ Θ R2d Z is parametrised as a linear layer ρ(v; θ) = W θv+bθ, with W θ R2d Z ℓ and bθ R2d Z. We augment x Ωwith 16 random Fourier features (Appendix B.1) to aid learning of high-frequency features (Tancik et al., 2020). After discretisation on data u = {(xi, u(xi))}I i=1, in which we approximate the integral over Ωby a normalised sum, our architecture resembles set-to-vector maps such as deep sets (Zaheer et al., 2017), Point Net (Qi et al., 2017), and statistic networks (Edwards and Storkey, 2017), which take the form xi, u(xi) i = 1, 2, . . . , I 7 ρ pool κ xi, u(xi); θ i = 1, 2, . . . , I ; θ , where pool is a pooling operation invariant to the order of its inputs in our case, the mean. Unlike these works we design our architecture for functions and only then discretise; we believe there is great potential to extend other point-cloud and set architectures similarly. Many other function-to-vector architectures have been proposed, e.g., the variable-input Deep ONet (VIDON; Prasthofer et al., 2022), the mesh-independent neural operator (MINO; Lee, 2022) and continuum attention (Calvello et al., 2024), and our proposal is most similar to the linear-functional layer of Fourier neural mappings (Huang et al., 2025) and the neural functional of Rahman et al. (2022). These differ from our approach by preceding (25) by a neural operator; on the problems we consider, we find our encoder map to be equally expressive. 2.4.2 Decoder Architecture The decoder g is a map from a latent vector z Z to a function g(z; ψ): Ω Rm, which we parametrise using a coordinate neural network γ : Z Ω Ψ Rm with 5 hidden layers of width 100 using GELU activation throughout, so that g(z; ψ)(x) = γ(z, x; ψ). (26) As before, we augment x Ωwith 16 random Fourier features (Appendix B.1). Our proposed architecture allows for discretisation of the decoded function g(z; ψ) on any mesh, and the cost of evaluating the decoder (26) grows linearly with the number of mesh points. There are several related approaches in the literature to parametrise vector-to-function maps. Huang et al. (2025) lift the input by multiplying with a learnable constant function, then apply an operator architecture such as FNO. Seidman et al. (2023) propose both a Deep ONet-inspired decoder using a linear combination of learnable basis functions, and a nonlinear decoder essentially the same as what we propose, which is also similar to the Autoencoders in Function Space architectures of the nonlinear manifold decoder (NOMAD; Seidman et al., 2022) and PARANet (de Hoop et al., 2022). Also related are implicit neural representations (Sitzmann et al., 2020), in which one regresses on a fixed image using a coordinate neural network and treats the resulting weights as a resolution-independent representation of the data. 2.4.3 Discretisation of Per-Sample Loss To discretise the per-sample losses derived in Section 2.3, we make two approximations. First, we approximate the expectation over z Qθ z|u by Monte Carlo sampling (Kingma and Welling, 2014), with the number of samples viewed as a hyperparameter. Second, we approximate the integrals, norms, and inner products arising in the loss, as we now outline. Per-Sample Loss LSDE λ,β . Since the terms appearing in LSDE λ,β are integral functionals of the data and decoded functions, we can discretise on any partition 0 = t0 < t1 < < t I = T and work with data discretised at any time step. The deterministic integral can be approximated by a normalised sum, and the stochastic integral can be discretised as g(z; ψ) (t), dut D g(z; ψ) (ti 1), u(ti) u(ti 1) E , which converges in probability in the limit I to the true stochastic integral (S arkk a and Solin, 2019, eq. (4.6)). Since the decoder will be a differentiable neural network, terms involving the derivative g(z; ψ) (t) can be computed using automatic differentiation; we find this to be much more stable than using a finite-difference approximation of the derivative. Per-Sample Loss LBIP. For many Bayesian inverse problems, U is a function space such as L2(Ω), and so again the norms and inner products are integral functionals amenable to discretisation on any mesh. However, applying the operator C 1/2 is typically tractable only in special cases. One widely used setting is the one in which the eigenbasis of C is known and basis coefficients are readily computable; this arises when C is an inverse power of the Laplacian on a rectangle and a fast Fourier transform may be used. 2.5 Numerical Experiments We now apply FVAE on two examples where Υ is an SDE path distribution. Both examples serve as prototypes for more complex problems such as those arising in molecular dynamics. For all experiments, we adopt the architecture of Section 2.4. A summary of conclusions to be drawn from the numerical experiments with these examples is as follows: (a) FVAE captures properties of individual paths as well as ensemble properties of the data set, with the learned latent variables being physically interpretable (Section 2.5.1); (b) choosing decoder noise that accurately reflects the stochastic variability in the data is essential to obtain a high-quality generative model (Section 2.5.1); (c) FVAE is robust to changes of mesh in the encoder and decoder, enabling training with heterogeneous data and generative modelling at any resolution (Section 2.5.2). We emphasise in these experiments that FVAE does this purely from data, with no knowledge of the data-generating process other than in the choice of decoder noise. Bunker, Girolami, Lambley, Stuart, and Sullivan 2.5.1 Brownian Dynamics The Brownian dynamics model (see Schlick, 2010, Chap. 14), also known as the Langevin model, is a stochastic approximation of deterministic Newtonian models for molecular dynamics. In this model, the configuration ut (in some configuration space X Rm) of a molecule is assumed to follow the gradient flow of a potential U : X R perturbed by additive thermal noise with temperature ε > 0. This leads to the Langevin SDE dut = U(ut) dt + ε dwt, t [0, T], (27) where (wt)t [0,T] is a Brownian motion on Rm. As a prototype for the more sophisticated, high-dimensional potentials arising in molecular dynamics, such as the Lennard Jones potential (Schlick, 2010), we take X = R and consider the asymmetric double-well potential U(x) 3x4 + 2x3 6x2 6x. (28) This has a local minimum at x1 = 1 and a global minimum at x2 = +1 (Figure 1(a)). We take Υ to be the corresponding path distribution, with temperature ε = 1, final time T = 5, and initial condition u0 = x1. (The preceding developments fixed u0 = 0 but are readily adapted to any fixed initial condition.) The training data set consists of 8,192 paths with time step 5/512 in [0, T], and in each path 50% of time steps are missing (see Appendix B.2). x1 = 1 x2 = +1 x (a) Potential U(x) 0 1 2 3 4 5 t (b) Sample paths (ut)t [0,5] Figure 1: (a) Realisations of the SDE (27) follow the gradient flow of the potential U. (b) Sample paths (ut)t [0,T] Υ begin at x1 = 1 and transition with high probability to the lower-potential state x2 = +1 as a result of the additive thermal noise. Sample paths drawn from Υ start at x1 and transition with very high probability to the potential-minimising state x2; the time at which the transition begins is determined by the thermal noise, but, once the transition has begun, the manner in which the transition occurs is largely consistent across realisations. Such universal transition phenomena occur quite generally in the study of random dynamical systems, as a consequence of large-deviation theory (see E et al., 2004). We train FVAE using the SDE loss (Section 2.3.1) with regularisation parameter β = 1.2 and zero-penalty scale λ = 10. Motivated by the observation that trajectories are determined chiefly by the transition time, we use latent dimension d Z = 1. Choice of Noise Process. The choice of decoder noise greatly affects FVAE s performance as an autoencoder and a generative model. To investigate this, we first train three instances of FVAE with different restoring forces κ in the decoder-noise process. Then, to evaluate autoencoding performance, we draw samples u Υ from the held-out set and compute Autoencoders in Function Space Brownian motion g(z; ψ) 1 SD OU (κ = 25) 0 1 2 3 4 5 t OU (κ = 10000) 0 1 2 3 4 5 t 0 1 2 3 4 5 t (a) Data u Υ (red) & reconstructions g(f(u; θ); ψ) (black) (b) Distribution of g(z; ψ) + η with 1 standard deviation of noise (c) Samples g(z; ψ) + η from Pψ u, z Pz, η Pη Figure 2: The SDE loss gives much freedom in the choice of noise process (ηt)t. The top row uses Brownian motion as the decoder noise and the second and third rows use OU processes with different asymptotic variances. While all choices lead to high-quality reconstructions, only the OU process with κ = 25 gives a generative model that agrees well with the data. the reconstruction g(f(u; θ); ψ), which is the mean of the decoder distribution Pψ u|z with z = f(u; θ) taken to be the mean of the encoder distribution Qθ z|u (Figure 2(a)). To evaluate FVAE as a generative model, we draw samples from the latent distribution Pz and display the mean g(z; ψ) of the decoder distribution Pψ u|z along with a shaded region indicating one standard deviation of the noise process (Figure 2(b)); moreover we draw samples g(z; ψ) + η to illustrate their qualitative behaviour (Figure 2(c)). Using Brownian motion as the decoder noise leads to excellent reconstructions, but samples from the generative model appear different from the training data. By using OU-distributed noise with restoring force κ > 0 we obtain similar reconstructions to those achieved under Brownian motion, but samples from the generative model match the data distribution more closely. This is because the variance of Brownian motion grows unboundedly with time, while the asymptotic variance under the OU process is ε/2κ, better reflecting the behaviour of the data for well-chosen κ. 0 20 40 60 80 100 0.0 κ = 0 κ = 25 κ = 10000 Wall-clock time [s] Reconstruction MSE on held-out set Figure 3: Using OU noise (κ > 0) leads to faster training convergence. On this data set, choosing a suitable noise process (ηt)t has the added benefit of significantly accelerating training: autoencoding mean-squared error (MSE) decreases much faster under OU noise (κ > 0) than under Brownian motion noise (Figure 3). We expect the choice of noise should depend in general on properties of the data distribution, with the OU process being particularly suited to this data set; in the discussion that follows, we use an OU process with restoring force κ = 25. Bunker, Girolami, Lambley, Stuart, and Sullivan 0 1 2 3 4 5 0 1 2 3 4 5 0.0 FVAE Direct numerical simulation 2 1 0 1 2 z (a) Decoded paths g(z; ψ)(t), z [ 2.5, 2.5] (b) Distribution of T0(u) [16,384 samples] Figure 4: (a) The latent variable z identified by FVAE corresponds to the first-crossing time T0 of the decoded path g(z; ψ). (b) Kernel density estimates of the distributions of T0 under the FVAE generative model, and when computed using direct simulations, closely agree. Unsupervised Learning of Physically Relevant Quantities. Our choice of latent dimension d Z = 1 was motivated by the heuristic that the time of the transition from x1 = 1 to x2 = +1 essentially determines the SDE trajectory. FVAE identifies this purely from data, with the learned latent variable z Z being in correspondence with the transition time: larger values of z map to paths g(z; ψ) transitioning later in time (Figure 4(a)). To understand whether FVAE captures ensemble statistical properties of the data distribution, we compare the distributions of the first-crossing time T0(u) = inf t > 0 ut 0 estimated using 16,384 paths from the generative model and 16,384 direct simulations, using kernel density estimates based on Gaussian kernels with bandwidths selected by Scott s rule (Scott, 2015, eq. (6.44)); we find that the two distributions closely agree (Figure 4(b)). 2.5.2 Estimation of Markov State Models In practical applications of molecular dynamics, one is often interested in the evolution of large molecules on long timescales. For example in the study of protein folding (Konovalov et al., 2021), it is of interest to capture the complex, multistage transitions of proteins between configurations. Moving beyond the toy one-dimensional problem in Section 2.5.1, the chief difficulty is the very high dimension of such systems, which makes simulations possible only on timescales orders of magnitudes shorter than those of physical interest. Markov state models (MSMs) offer one method of distilling many simulations on short timescales into a statistical model permitting much longer simulations (Husic and Pande, 2018). Assuming that the dynamics are given by a random process (ut)t 0 taking values in the configuration space X, an MSM can be constructed by partitioning X into disjoint state sets X = X1 Xp, and, for some lag time τ > 0, considering the discrete-time process (Uk)k N for which Uk = i if and only if ukτ Xi. One hopes that, if τ is sufficiently large, the process (Uk)k N is approximately Markov, and thus its distribution can be characterised by learning the probabilities of transitioning in time τ from one state to another. These probabilities can be determined using the short-run simulations which can be generated in parallel and the resulting MSM can be used to simulate on longer timescales. Motivated by this application, we consider the problem of constructing an MSM from data provided at sparse or irregular intervals that do not necessarily align with the lag time τ; in this case, computing the probability of transition in time τ directly may not be possible. Autoencoders in Function Space -0.4 -0.2 0.0 (a) Potential U : X R and states Xi, i = 1, . . . , 9 (b) Maximum-likelihood transition matrices (i) Direct numerical simulation TDNS(τ) Figure 5: (a) Contour plot of the potential U and the division of the state space X. All paths start at t = 0 at the origin (in state 5). (b) Transition matrices with lag τ = 3/512 computed using FVAE and through direct simulation, both on the time interval [0, 3]. We show the power of FVAE in this problem by first learning a generative model from the heterogeneously sampled data and then using the generative model to draw paths sampled at the regular time step τ; constructing an MSM from these paths is then straightforward. We give an example based on the Brownian dynamics model (27) on X = R2 using a multiwell potential U (Figure 5(a)), stated precisely in Appendix B.3, which we take to be a quadratic bowl, perturbed by a linear function to break the symmetry, and by six Gaussian densities to act as potential wells with minima at (0, 0), (0.2, 0.2), ( 0.2, 0.2), (0.2, 0.2), (0, 0.2) and ( 0.2, 0). We take U = C0([0, T], X) and let Υ P(U) be the path distribution, with temperature ε = 0.1, final time T = 3 and initial condition u0 = 0. The training data set consists of 16,384 paths discretised with time step 3/512, where, for each sample, it is assumed that 50% of steps are missing (details in Appendix B.3). We train FVAE using the SDE loss (Section 2.3.1) with κ = 100, λ = 50, β = 0.02, and latent dimension d Z = 16. Partitioning the Configuration Space. The states of an MSM can be selected manually using expert knowledge or automatically using variational or machine-learning methods (Mardt et al., 2018). For simplicity, we choose the states by hand, partitioning X = R2 into p = 9 disjoint regions (Figure 5(a)) divided by the four lines x1 = 0.1 and x2 = 0.1. Estimating Transition Probabilities with FVAE. After training FVAE with irregularly sampled data, we draw samples from the generative model with regular time step τ and use these samples to compute the MSM transition probabilities. Setting aside the question of Markovianity for simplicity, we draw from the generative model M = 2,048 paths {v(m)}M m=1 discretised on a mesh of K = 513 equally spaced points with time step τ = 3/512 on [0, T], and compute the count matrix CFVAE(τ) = CFVAE ij (τ) i,j {1,...,p}, CFVAE ij (τ) = m=1 1 h v(m) kτ Xi and v(m) (k+1)τ Xj i . We then derive the corresponding maximum-likelihood transition matrix TFVAE(τ) by normalising each row of CFVAE(τ) to sum to one; for simplicity we do not constrain the transition matrix to satisfy the detailed-balance condition (see Prinz et al., 2011, Sec. IV.D). Bunker, Girolami, Lambley, Stuart, and Sullivan The resulting transition matrix TFVAE(τ) agrees closely with the matrix TDNS(τ) computed analogously using 2,048 direct numerical simulations on the regular time step τ (Figure 5(b)). 3. Problems with VAEs in Infinite Dimensions As we have seen in Section 2.1, the empirical FVAE objective J FVAE N (θ, ψ) for the data set {u(n)}N n=1 is based on a sequence of approximations and equalities: n=1 L(u(n); θ, ψ) | {z } =:J FVAE N (θ,ψ) (A) E u Υ[L(u; θ, ψ)] | {z } =:J FVAE(θ,ψ) = (B) DKL(Qθ z,u Pψ z,u) DKL(Υ Λ) | {z } constant The approximation (A) is based on the law of large numbers and is an equality almost surely in the limit N . The equality (B) is true by Theorem 6 with a finite constant DKL(Υ Λ) provided Assumption 5 holds but if the assumption does not hold, this constant may well be infinite. So, while it is tempting to apply the empirical objective J FVAE N without first checking the validity of (A) and (B), this strategy is fraught with pitfalls. To illustrate this we apply FVAE in the white-noise setting of Example 1; this coincides with the setting of the VANO model (Seidman et al., 2023), which we discuss in detail in the related work (Section 5). In this example we derive the per-sample loss L and apply the resulting empirical objective J FVAE N for training; but both the joint divergence DKL(Qθ z,u Pψ z,u) and the constant DKL(Υ Λ) turn out to be infinite. Consequently, the approximations (A) and (B) break down, which we see numerically: discretisations of J FVAE N appear to diverge as resolution is refined, suggesting that they have no continuum limit. Example 2 Take U = H 1([0, 1]) and assume that Υ is the distribution of u in the model ξ Uniform[0, 1] u | ξ = δξ. Realisations of u in this model lie in Hs([0, 1]), s < 1/2, with probability one, so Υ P(U). This data can be viewed as a prototype for rough behaviour such as the derivative of shock profiles arising in hyperbolic PDEs with random initial data. It will serve as an extreme example allowing us to isolate the numerical issues associated with using FVAE or VANO in the misspecified setting. Take the model (16a) (16d) for real-valued functions on [0, 1], with Pη taken to be L2-white noise. As discussed in Proposition 15, Pη P(Hs([0, 1])), s < 1/2, and H(Pη) = L2([0, 1]). Fixing the reference distribution Λ = Pη and assuming g takes values in L2([0, 1]), the Cameron Martin theorem (Proposition 19) ensures that the density d Pψ u|z/dΛ exists, and consequently L(u; θ, ψ) = E z Qθ z|u h 1 2 g(z; ψ) 2 L2 g(z; ψ), u L2 i + DKL Qθ z|u Pz . (29) At this stage, we have not verified that the joint divergence DKL(Qθ z,u Pψ z,u) has finite infimum, nor that Assumption 5 holds; indeed we will see that both of these conditions fail. Autoencoders in Function Space (a) Schematic diagram of reconstruction g(z; ψ) learnable mean µ learnable SD σ discretisation of Dirac δξ (b) FVAE empirical objective Pη = N(0, I); 50 runs 8 16 32 64 128 (c) FAE empirical objective U = H 1([0, 1]); 50 runs 8 16 32 64 128 0.630 Figure 6: (a) The discrete representations (squares) of δξ and g(z; ψ) on a grid of 8 points. (b) In Example 2, the FVAE empirical objective at the minimising parameters diverges as resolution is increased. (c) To overcome this, we propose a regularised autoencoder, FAE, in Section 4. Repeating the experiment with this objective suggests that the FAE empirical objective has a well-defined continuum limit. Nevertheless we can attempt to train FVAE with the empirical objective resulting from (29). To do this we choose Z = R and use an encoder map f and a decoder map g tailored to this problem, since the architectures of Section 2.4 are not equipped to deal with functions of negative Sobolev regularity. We parametrise f as f(u; θ) = ρ arg max x [0,1] (ϕ u)(x); θ Z Z = R2, where ρ is a neural network and ϕ is a compactly supported smooth mollifier, chosen such that ϕ u has well-defined maximum, and we parametrise g to return the Gaussian density g z; ψ (x) = N µ(z; ψ), σ(z; ψ)2; x , with mean µ(z; ψ) [0, 1] and standard deviation σ(z; ψ) > 0 computed from a neural network (see Appendix B.4). In other words, f applies a neural network to the location of the maximum of u, while g returns a Gaussian density with learned mean and variance (Figure 6(a)). To investigate the behaviour of discretisations of J FVAE N as resolution is refined, we generate a sequence of data sets in which we discretise on a mesh of I {8, 16, 32, 64, 128} equally spaced points {i/I+1}i=1,...,I [0, 1]. At each resolution we generate I training samples: one discretised Dirac function at each mesh point, normalised to have unit L1-norm. We train 50 independent instances of FVAE at each resolution and record the value of the empirical objective at convergence (Figure 6(a)). Notably, the empirical objective appears to diverge as resolution is refined. This has two major causes: (a) Υ is not absolutely continuous with respect to Λ = Pη: the set {δx | x [0, 1]} has probability zero under Λ but probability one under Υ; thus DKL(Υ Λ) = . (b) Υ is not absolutely continuous with respect to Pψ u, meaning that DKL(Qθ z,u Pψ z,u) = for all θ and ψ. To see this, we again note that {δx | x [0, 1]} has probability one Bunker, Girolami, Lambley, Stuart, and Sullivan under Υ, but, as a consequence of the Cameron Martin theorem, Pψ u|z and Pη are mutually absolutely continuous, and thus as Pη({δx | x [0, 1]}) = 0, Pψ u {δx | x [0, 1]} = Z Z Pψ u|z {δx | x [0, 1]} Pz(dz) = 0. The problematic term in the per-sample loss is the measurable linear functional g(z; ψ), u L2; this is defined only up to modification on Pη-probability zero sets, and yet we evaluate it on just such sets namely, the Pη-probability zero set {δx | x [0, 1]}. Remark 22 The joint divergence DKL(Qθ z,u Pψ z,u) would also be infinite if Υ was supported on L2([0, 1]), as seen in Example 1, but it is harder to observe any numerical issue in training. This is because the measurable linear functional g(z; ψ), u L2 reduces to the usual L2-inner product, and so, even though the FVAE objective is not well defined, the per-sample loss can still be viewed as a regularised misfit (see Theorem 20): L(u; θ, ψ) = E z Qθ z|u h 1 2 g(z; ψ) u 2 L2 i + DKL(Qθ z|u Pz) 1 As a result one can reinterpret the objective as that of a regularised autoencoder (see Remark 11). This motivates our use of a regularised autoencoder, FAE, in Section 4. 4. Regularised Autoencoders on Function Space To overcome the issues in applying VAEs in infinite dimensions, we set aside the probabilistic motivation for FVAE and define a regularised autoencoder in function space, the functional autoencoder (FAE), avoiding the need for onerous conditions on the data distribution. In Section 4.1, we state the FAE objective and make connections to the FVAE objective. Section 4.2 outlines minor adaptations to the FVAE encoder and decoder for use with FAE. Section 4.3 demonstrates FAE on two examples from the sciences: incompressible fluid flows governed by the Navier Stokes equation; and porous-medium flows governed by Darcy s law. 4.1 Training Objective Throughout Section 4 we make the following assumption on the data, postponing discussion of discretisation to Section 4.2. Unlike in Section 2 we do not need U to be separable, allowing us to consider data from an even wider variety of spaces, such as the (non-separable) space BV(Ω) of bounded-variation functions on Ω Rd. Assumption 23 Let (U, ) be a Banach space. There exists a data distribution Υ P(U) from which we have access to N independent and identically distributed samples {u(n)}N n=1 U. To define our regularised autoencoder, we fix a latent space Z = Rd Z and define encoder and decoder transformations f and g, which, unlike in FVAE, return points rather than probability distributions: (encoder) U u 7 f(u; θ) Z, (30a) (decoder) Z z 7 g(z; ψ) U. (30b) Autoencoders in Function Space We then take as our objective the sum of a misfit term between the data and its reconstruction, and a regularisation term with regularisation parameter β > 0 on the encoded vectors: (FAE objective) J FAE β (θ, ψ) = E u Υ h 1 2 g f(u; θ); ψ u 2 + β f(u; θ) 2 2 As in Section 2.1, the expectation over u Υ is approximated by an average over the training data, resulting in the empirical objective J FAE β,N . There is great flexibility in the choice of regularisation term; we adopt the squared Euclidean norm f(u; θ) 2 2 as a simplifying choice consistent with using a Gaussian prior Pz in a VAE (Remark 11). While (31) has much in common with the FVAE objective J FVAE, it is not marred by the foundational issues raised in Section 3; indeed, the FAE objective is broadly applicable as the following result shows. Proposition 24 Suppose Υ has finite second moment, i.e., Eu Υ u 2 < . If there exist θ Θ and ψ Ψ such that f(u; θ ) = 0 and g(z; ψ ) = 0, then (31) has finite infimum. Proof This follows immediately from evaluating J FAE β at θ and ψ , since J FAE β (θ , ψ ) = Eu Υ 1 2 u 2 , and the expectation is finite by hypothesis. 4.2 Architecture and Algorithms To train FAE we must discretise the objective J FAE β and parametrise the encoder f and decoder g with learnable maps. We moreover propose a masked training scheme that appears to be new to the operator-learning literature; as we will show in Section 4.3, this scheme greatly improves the robustness of FAE to changes of mesh. Encoder and Decoder Architecture. As in Section 2.4, we will construct encoder and decoder architectures under the assumption that U is a Banach space of functions evaluable pointwise almost everywhere with domain Ω Rd and range Rm, and that we have access to discretisations u(n) of the data u(n) comprised of evaluations at finitely many mesh points. We adopt an architecture near identical to that used for FVAE. More precisely, we parametrise the encoder as f(u; θ) = ρ Z Ω κ x, u(x); θ dx; θ Z, with κ: Ω Rm Θ Rℓparametrised as a neural network with two hidden layers of width 64, output dimension ℓ= 64, and with ρ: Rℓ Θ Rd Z parametrised as the linear layer ρ(v; θ) = W θv + bθ. We parametrise the decoder as the coordinate neural network γ : Z Ω Ψ Rm with 5 hidden layers of width 100, so that g(z; ψ)(x) = γ(z, x; ψ) Rm. In both cases we use GELU activation and augment x with 16 random Fourier features (Appendix B.1). Relative to the architectures of Section 2.4, the only change is in the range of f, which now takes values in Z instead of returning distributional parameters for Qθ z|u. Bunker, Girolami, Lambley, Stuart, and Sullivan Discretisation of J FAE β . The FAE objective can be applied whenever U is Banach, but in this article we take U = L2(Ω), where Ω= Td or Ω= [0, 1]d, and discretise the U-norm with a normalised sum. One can readily imagine other possibilities, e.g., taking U to be a Sobolev space of order s 1 if derivative information is available (Czarnecki et al., 2017), and approximating the L2-norm of the data and its derivatives by sums. More generally we may use linear functionals as the starting point for approximation. 4.2.1 Masked Training Self-supervised training learning to predict the missing data from masked inputs has proven valuable in both language models such as BERT (Devlin et al., 2019) and vision models such as the masked autoencoder (MAE; He et al., 2022). This method has been shown to both reduce training time and to improve generalisation. We propose two schemes making use of the ability to discretise the encoder and decoder on arbitrary meshes: complement masking and random masking. Under both schemes we transform each discretised training sample u = {(xi, u(xi))}I i=1 by subsampling with index sets Ienc and Idec, which may change at each training step, to obtain uenc = xi, u(xi) i Ienc , udec = xi, u(xi) i Idec . We supply uenc as input to the discretised encoder; moreover we discretise the decoder on the mesh {xi}i Idec and compare the decoder output to the masked data udec. In both of the strategies we propose Ienc and Idec will be unstructured random subsets of {1, . . . , I}, but in principle other masks such as polygons could be considered. Complement Masking. The chief strategy used in the numerical experiments is to draw Ienc as a random subset of {1, . . . , I} and take Idec = {1, . . . , I} \ Ienc, fixing the (encoder) point ratio renc = |Ienc|/I as a hyperparameter. This is similar to the strategy adopted by MAE though our approach differs by masking individual mesh points instead of patches. We explore the tradeoffs in the choice of point ratio in Section 4.3.1. Random Masking. A second strategy we consider is to independently draw Ienc and Idec as random subsets of {1, . . . , I}, fixing both the encoder point ratio renc = |Ienc|/I and the decoder point ratio rdec = |Idec|/I. This gives greater control of the cost of evaluating the encoder and decoder: by taking both renc and rdec to be small, we significantly reduce the cost of each training step, which is useful when the number of mesh points I is large. 4.3 Numerical Experiments In Sections 4.3.1 and 4.3.2, we apply FAE as an out-of-the-box method to discover a lowdimensional latent space for solutions to the incompressible Navier Stokes equations and the Darcy model for flow in a porous medium, respectively. We find that for these data sets: (a) FAE s mesh-invariant architecture autoencodes with performance comparable to convolutional neural network (CNN) architectures of similar size (Section 4.3.1); (b) the ability to discretise the encoder and decoder on different meshes enables new applications to inpainting and data-driven superresolution (Section 4.3.1), as well as extensions of existing zero-shot superresolution as proposed for VANO by Seidman et al. (2023); Autoencoders in Function Space (c) masked training significantly improves performance under mesh changes (Section 4.3.1) and can accelerate training while reducing memory demand (Section 4.3.2); (d) training a generative model on the FAE latent space leads to a resolution-invariant generative model which accurately captures distributional properties (Section 4.3.2). 4.3.1 Incompressible Navier Stokes Equations We first illustrate how FAE can be used to learn a low-dimensional representation for snapshots of the vorticity of a fluid flow in two spatial dimensions governed by the incompressible Navier Stokes equations, and illustrate some of the benefits of our mesh-invariant model. Let Ω= T2 be the torus, viewed as the square [0, 1]2 with opposing edges identified and with unit normal ˆz, and let U = L2(Ω). While the incompressible Navier Stokes equations are typically formulated in terms of the primitive variables of velocity u and pressure p, it is more natural in this case to work with the vorticity streamfunction formulation (see Chandler and Kerswell, 2013, eq. (2.6)). In particular the vorticity u is zero except in the out-of-plane component ˆzω. The scalar component of the vorticity, ω, then satisfies tω = ˆz (u ωˆz) + ν ω + ϕ, (x, t) Ω (0, T], ω(x, 0) = ω0(x), x Ω. (32) In this setting the velocity is given by u = (ψsˆz), where the streamfunction ψs satisfies ω = ψs.1 Thus, using periodicity, ψs is uniquely defined, up to an irrelevant constant, in terms of ω, and (32) defines a closed evolution equation for ω. We suppose that Υ P(U) is the distribution of the scalar vorticity ω( , T = 50) given by (32), with viscosity ν = 10 4 and forcing ϕ(x) = 1 10 sin(2πx1 + 2πx2) + 1 10 cos(2πx1 + 2πx2). We assume that the initial condition ω0 has distribution N(0, C) with covariance operator C = 73/2(49I ) 5/2, where is the Laplacian operator for spatially-mean-zero functions on the torus T2. The training data set, based on that of Li et al. (2021), consists of 8,000 samples from Υ generated on a 64 64 grid using a pseudospectral solver, with a further 2,000 independent samples held out as an evaluation set. The data are scaled so that ω(x, T) [0, 1] for all x Ω. Further details are provided in Appendix B.5. We train FAE with latent dimension d Z = 64 and regularisation parameter β = 10 3, and train with complement masking using a point ratio renc of 30%. Performance at Fixed Resolution. We compare the autoencoding performance of our mesh-invariant FAE architecture to a standard fixed-resolution CNN architecture, both trained using the FAE objective with all other hyperparameters the same. Our goal is to understand whether our architecture is competitive even without the inductive bias of CNNs. To do this we fix a class of FAE and CNN architectures for 64 64 unmasked data, and perform a search to select the best-performing models with similar parameter counts (details in Appendix B.5). The FAE architecture achieves reconstruction MSE slightly greater than the CNN on the held-out data, with a comparable number of parameters (Table 1). It is reasonable to expect that mesh-invariance of the FAE architecture comes at some cost to 1. While it is typical in the literature (e.g., Chandler and Kerswell, 2013) to denote the streamfunction by ψ, we instead denote it by ψs to avoid ambiguity with the decoder parameter ψ. Bunker, Girolami, Lambley, Stuart, and Sullivan Autoencoding on evaluation set (64 64 grid) MSE Parameters FAE architecture 4.82 10 4 64,857 2.57 10 5 CNN architecture 2.38 10 4 71,553 9.43 10 6 Mean 1 standard deviation; 5 training runs Table 1: Our resolution-invariant architecture performs comparably to CNNs with similar parameter counts. performance at fixed resolution, especially as the CNN benefits from a strong inductive bias, but the results of Table 1 suggest that the cost is modest. Further research is desirable to close this gap through better mesh-invariant architectures. Inpainting. Methods for inpainting inferring missing parts of an input based on observations and prior knowledge from training data and related inverse problems are of great interest in computer vision and in scientific applications (Quan et al., 2024). We exploit the ability of FAE to encode on any mesh, and decode on any other mesh, to solve a variety of inpainting tasks. More precisely, after training FAE, we take data from the held-out set on a 64 64 grid and, for each discretised sample, we apply one of three possible masks: (i) random masking with point ratio 5%, i.e., masking 95% of mesh points; or (ii) masking of all mesh points lying in a square with random centre and side length; or (iii) masking of all mesh points in the 0.05-superlevel set of a draw from the Gaussian random field N(0, (302I ) 1.2), where is the Laplacian for functions on the torus. 10 30 50 70 90 4.5 Point % (train) Point % (evaluation) (b) Reconstruction MSE [ 10 4] Mean 1 standard deviation; 5 training runs (a) Inpainting with missing mesh points input reconstruction ground truth 0.1 0.2 0.3 0.4 0.5 0.6 0.7 Figure 7: (a) FAE can solve a variety of inpainting tasks; further samples in Appendix B.5. (b) Training the encoder on a sparse mesh has a regularising effect on FAE, leading to lower evaluation MSE on dense meshes, but harms performance on very sparse meshes. Decoding these samples on a 64 64 grid (Figure 7) leads to reconstructions that agree well with the ground truth, and we find FAE to be robust even with a significant amount of Autoencoders in Function Space the original mesh missing. As a consequence of the autoencoding procedure, the observed region of the input may also be modified, an effect most pronounced in (ii), where some features in the input are oversmoothed in the reconstruction. We hypothesise that the failure to capture fine-scale features could be mitigated with better neural-operator architectures. To understand the effect of the training point ratio on inpainting quality, we first train instances of FAE with complement masking with point ratios 10%, 50%, and 90%. Then, for each model, we evaluate its autoencoding performance by applying random masking to each sample from the held-out set with point ratio reval {10%, 30%, 50%, 70%, 90%}, reconstructing on the full 64 64 grid, and computing the reconstruction MSE averaged over the held-out set (Figure 7(b)). We observe that the best choice of renc depends on the point ratio reval of the input. We hypothesise that when reval is large, training with renc small is helpful as training with few mesh points regularises the model. On the other hand, when reval is small, it is likely that the evaluation mesh is almost disjoint from any mesh seen during training, harming performance. Further analysis is provided in Appendix B.5. Superresolution. The ability to encode and decode on different meshes also enables the use of FAE for single-image superresolution: high-resolution reconstruction of a lowresolution input. Superresolution methods based on deep learning have found use in varied applications including imaging (Li et al., 2024) and fluid dynamics, e.g., in increasing the resolution ( downscaling ) of numerical simulations (Kochkov et al., 2021; Bischoffand Deck, 2024). Generalising other continuous superresolution models such as the Local Implicit Image Function (Chen et al., 2021), a single trained FAE model can be applied with any upsampling factor, and FAE has the further advantage of accepting inputs at any resolution. (b)(ii) Number of decoder network evaluations for superresolution on: full grid 160k patch 38k input data on coarse mesh (64 64) refine in specific region (mesh spacing 1/400) 0.7 (a) Examples of data-driven superresolution (b)(i) Superresolution on patches input reconstruction ground truth 16 16 64 64 64 64 8 8 64 64 64 64 Figure 8: (a) FAE can encode low-resolution inputs and decode at higher resolution, recovering fine-scale features using knowledge of the underlying data. Further examples in Appendix B.5. (b) Evaluating the decoder in a specific subregion can lead to significant computational savings compared to performing superresolution on the full grid. For data-driven superresolution where we train a model at high resolution and use it to enhance low-resolution inputs at inference time FAE is able to resolve unseen features from 8 8 and 16 16 inputs on a 64 64 output grid after training at resolution 64 64 (Figure 8(a)). As with inpainting, superresolution performance could be further improved with an architecture that is better able to capture the turbulent dynamics in the data. Bunker, Girolami, Lambley, Stuart, and Sullivan We also investigate the stability of FAE for zero-shot superresolution (Li et al., 2021), where the model is evaluated on higher resolutions than seen during training. Since FAE is purely data-driven, we view this as a test of the model s mesh-invariance and do not expect to resolve high-frequency features that were not seen during training. Our architecture proves robust when autoencoding on meshes much finer than the original 64 64 training grid (Figure 9(a)); moreover our coordinate MLP architecture allows us to decode on extremely fine meshes without exhausting the GPU memory (details in Appendix B.5). While zero-shot superresolution is possible with VANO when the input is given on the mesh seen during training, FAE can be used for superresolution with any input. Efficient Superresolution on Regions of Interest. Since our decoder can be evaluated on any mesh, we can perform superresolution in a specific subregion without upsampling across the whole domain. Doing this can significantly reduce inference time, memory usage, and energy cost. As an example, we consider the task of reconstructing a circular subregion of interest with target mesh spacing 1/400 (Figure 8(b)(i)). Achieving this resolution over the whole domain corresponding to a 400 400 grid would involve 160,000 evaluations of the decoder network; decoding on the subregion requires just 1/4 of this (Figure 8(b)(ii)). 32 increase 2,048 2,048 512 increase 32,768 32,768 (b) Latent interpolation g f(u1; θ)α + f(u2; θ)(1 α); ψ 0.2 0.3 0.4 0.5 0.6 0.7 (a) Autoencoding beyond training resolution with zero-shot superresolution Figure 9: (a) FAE can stably decode at resolutions much higher than the training resolution (best viewed digitally). (b) The regularised latent space Z allows for meaningful interpolation between samples. Further examples are given in Appendix B.5. Applications of the Latent Space Z. The regularised FAE latent space gives a wellstructured finite representation of the infinite-dimensional data u U. We expect there to be benefit in using this representation as a building block for applications such as supervised learning and generative modelling on functional data, similar in spirit to other supervised operator-learning methods with encoder decoder structure (Seidman et al., 2022). As a first step towards verifying that the latent space does indeed capture useful structure beyond mere memorisation of the training data, we draw u1 and u2 from the held-out set, compute latent vectors z1 = f(u1; θ) and z2 = f(u2; θ) Z, and evaluate the decoder g along the convex combination z1α + z2(1 α). This leads to a sensible interpolation in U, suggesting that the latent representation is robust and well-regularised (Figure 9(b)). Autoencoders in Function Space 4.3.2 Darcy Flow Darcy flow is a model of steady-state flow in a porous medium, derivable from first principles using homogenisation; see, e.g., Freeze and Cherry (1979, Sec. 2.11) and Keller (1980). We restrict attention to the two-dimensional domain Ω= [0, 1]2 and suppose that, for some permeability field k: Ω R and forcing ϕ: Ω R, the pressure field p: Ω R satisfies k p = ϕ on Ω, p = 0 on Ω. (33) We assume ϕ = 1 and that k is distributed as the pushforward of the distribution N 0, ( + 9I) 2 , where is the Laplacian restricted to functions defined on Ωwith zero Neumann data on Ω, under the map x 7 3 + 9 1 x 0 . We take U = L2(Ω) and define Υ P(U) to be the distribution of pressure fields p solving (33) with permeability k. While solutions to this elliptic PDE can be expected to have greater smoothness (Evans, 2010, Sec. 6.3), we assume only that p L2(Ω) and use the L2-norm in the FAE objective (31). The training data set is based on that of Li et al. (2021) and consists of 1,024 samples from Υ on a 421 421 grid, with a further 1,024 samples held out as an evaluation set. Data are scaled so that p(x) [0, 1] for all x Ωand, where specified, we downsample as described in Appendix B.6. We train FAE with d Z = 64 and β = 10 3, and use complement masking with a point ratio renc of 30%. 0 20 40 60 80 100 0.0150 Point % (training) 10% 50% 100% Reconstruction MSE on held-out set (211 211) (mean over 5 training runs) Wall-clock time [s] Figure 10: Training with masking in the encoder and decoder reduces training time. Accelerating Training Using Masking. As well as improving reconstructions and robustness to mesh changes, masked training can greatly reduce the cost of training. To illustrate this we compare the training dynamics of FAE on data downsampled to resolution 211 211, using random masking with point ratio renc = rdec {10%, 50%, 90%}. Since the evaluation cost of the encoder and decoder scales linearly with the number of mesh points, we expect significant computational gains when using low point ratios. We perform five training runs for each model and compute the average reconstruction MSE over time on held-out data at resolution 211 211. The models trained with masking converge faster as the smaller data tensors allow for better use of the GPU parallelism (Figure 10). At higher resolutions, memory constraints may preclude training on the full grid, making masking vital. Related ideas are used in the adaptive-subsampling training scheme for FNOs proposed by Lanthaler et al. (2024), which involves training first on a coarse grid and refining the mesh each time the evaluation metric plateaus; our approach differs by dropping mesh points randomly, which would not be possible with FNO. One can readily imagine training FAE with a combination of adaptive subsampling and masking. Generative Modelling. While FAE is not itself a generative model, it can be made so by training a fixed-dimension generative model on the latent space Z (Ghosh et al., 2020; Bunker, Girolami, Lambley, Stuart, and Sullivan Vahdat et al., 2021). More precisely, we know that applying the FAE encoder f to data induces a distribution Σθ P(Z) for z given by z | u = f(u; θ), u Υ. (34) Unlike with FVAE, there is no reason that this should be close to Gaussian. However, we can approximate Σθ with a fixed-resolution generative model Pϕ z P(Z) parametrised by ϕ Φ, and define the FAE generative model Pψ,ϕ u for data u by (FAE generative model) u | z = g(z; ψ), z Pϕ z . (35) Since applying the decoder to Σθ should approximately recover the data distribution if g(f(u; θ); ψ) u for u Υ, we hope that when Σθ Pϕ z , samples from (35) will be approximately distributed according to Υ. As a simple illustration, we train FAE at resolution 47 47 and fit a Gaussian mixture model Pϕ z with 10 components to Σθ using the expectation-maximisation algorithm (see Bishop, 2006, Sec. 9.2.2). Samples from (35) closely resemble those from the held-out data set (Figure 11(a)), and as a result of our mesh-invariant architecture, it is possible to generate new samples on any mesh. (a)(i) FAE samples g(z; ψ), z Pz (47 47 grid) 0.0 0.2 0.4 0.6 0.8 1.0 (a)(ii) Samples p Υ (47 47 grid) 0.0 0.2 0.4 0.6 0.8 1.0 (b) Kernel density estimates for quantities of interest (based on 1,024 samples from generative model) (i) Q1(p) = max x Ωp(x) 0.5 0.6 0.7 0.8 0.9 1.0 0 (ii) Q2(p) = p L2 0.30 0.35 0.40 0.45 0.50 0 FAE Ground Truth Figure 11: (a) Uncurated samples from the FAE generative model for the pressure field p. Further samples are provided in Appendix B.6. (b) The distributions of quantities of interest computed using the FAE generative model closely agree with the ground truth. To measure generative performance, we approximate the distributions of physically relevant quantities of interest Qi(p) depending on the data p U, comparing 1,024 samples from the generative model to the held-out data. Using kernel density estimates as in Figure 4(b), we see close agreement between the distributions (Figure 11(b)). While we could also evaluate the generative model using distances such as maximum mean discrepancy (MMD; Borgwardt et al., 2006), we focus on interpretable quantities relevant to the physical system at hand. Though we adopt the convention of training the autoencoder and generative model separately (Rombach et al., 2022) here, the models could also be trained jointly; we leave this, and an investigation of generative models on the FAE latent space, to future work. Autoencoders in Function Space 5. Related Work Variational Autoencoding Neural Operators. The VANO model (Seidman et al., 2023) was the first to attempt systematic extension of the VAE objective to function space. The paper uses ideas from operator learning to construct a model that can decode but not encode at any resolution. Our approach differs in both training objective and practical implementation, as we now outline. The most significant difference between what is proposed in this paper and in VANO is the objective on function space: the VANO objective coincides with a specific case of our model (16a) (16d) with the decoder noise Pη being white noise on L2([0, 1]d). As a consequence the generative model for VANO takes values in U = Hs([0, 1]d) if and only if s < d/2; in particular generated draws are not in L2([0, 1]d). Unlike our approach, VANO aims to maximise an extension of the ELBO (15b), in which a regularisation parameter β > 0 is chosen as a hyperparameter, and the ELBO takes the form ELBOVANO β (u; θ, ψ) = E z Qθ z|u log d Pψ u|z d Pη (u) βDKL Qθ z|u Pz = E z Qθ z|u g(z; ψ), u L2 1 2 g(z; ψ) 2 L2 βDKL Qθ z|u Pz , where the second equality comes from the Cameron Martin theorem as in Example 2. Maximising ELBOVANO β with β = 1 is precisely equivalent to minimising the per-sample loss (29) from Example 2, so naive application of ELBOVANO β will result in the same issues seen there. In particular, discretisations of the ELBO may diverge as resolution is refined; moreover, for data with L2-regularity, the generative model Pψ u is greatly misspecified, with draws g(z; ψ) + η, z Pz, η Pη, lying in a Sobolev space of lower regularity than the data. This issue is obscured by the convention in the VAE literature of considering only the decoder mean g(z; ψ); considering the full generative model with draws g(z; ψ)+η reveals the incompatibility more clearly. We argue that the empirical success of VANO in autoencoding is because the objective can be seen as that of a regularised autoencoder (Remark 22). Along with the differences in the training objective and its interpretation, FVAE differs greatly from VANO in architecture. While the VANO decoders can be discretised on any mesh and our decoder closely resembles VANO s nonlinear decoder its encoders assume a fixed mesh for training and inference. In contrast, our encoder can be discretised on any mesh, enabling many of our contributions, such as masked training, inpainting, and superresolution, which are not possible within VANO. Generative Models on Function Space. Aside from VANO, recent years have seen significant interest in the development of generative models on function space. Several extensions of score-based (Song et al., 2021) and denoising diffusion models (Sohl-Dickstein et al., 2015; Ho et al., 2020) to function space have been proposed (e.g., Pidstrigach et al., 2023; Hagemann et al., 2023; Lim et al., 2023; Kerrigan et al., 2023; Franzese et al., 2023; Zhang and Wonka, 2024). Rahman et al. (2022) propose the generative adversarial neural operator (GANO), extending Wasserstein generative adversarial neural networks (Arjovsky et al., 2017) to function space with FNOs in the generator and discriminator to achieve resolution-invariance. Bunker, Girolami, Lambley, Stuart, and Sullivan Variational Inference on Function Space. In machine learning, variational inference on function space also arises in the context of Bayesian neural networks (Sun et al., 2019; Burt et al., 2021; Cinquin and Bamler, 2024). In this setting one wishes to minimise the KL divergence between the posterior in function space and a computationally tractable approximation but, as in our study, this divergence may be infinite owing to a lack of absolute continuity between the two distributions. Learning on Point Clouds. Our architecture takes inspiration from the literature on machine learning on point clouds, where data are viewed as sets of points with arbitrary cardinality. Several models for autoencoding and generative modelling with point clouds have been proposed, such as energy-based processes (Yang et al., 2020) and Set VAE (Kim et al., 2021); our work differs by defining a loss in function space, ensuring that our model converges to a continuum limit as the mesh is refined. Continuum limits of semisupervised algorithms for graphs and point clouds have also been studied (e.g., Dunlop et al., 2020). Our study of autoencoders on function space has led to FVAE, an extension of VAEs which imposes stringent requirements on the data distribution in infinite dimensions but benefits from firm probabilistic foundations; it has also led to the non-probabilistic FAE, a regularised autoencoder which can be applied much more broadly to functional data. Benefits. Both FVAE and FAE offer significant benefits when working with functional data, such as enabling training with data across resolutions, inpainting, superresolution, and generative modelling. These benefits are possible only through our pairing of a well-defined objective in function space with mesh-invariant encoder and decoder architectures. Limitations. FVAE can be applied only when the generative model is sufficiently compatible with the data distribution a condition that is difficult to satisfy in infinite dimensions, and restricts the applicability to FVAE to specific problem classes. FAE overcomes this restriction, but does not share the probabilistic foundations of FVAE. The desire to discretise the encoder and decoder on arbitrary meshes rules out many high-performing grid-based architectures, including convolutional networks and FNOs. We believe this is a limiting factor in the numerical experiments, and that combining our work with more complex operator architectures (e.g., Kovachki et al., 2023) or continuum extensions of point-cloud CNNs (Li et al., 2018) would yield further improvements. Future Work. Our work gives new methods for nonlinear dimension reduction in function space, and we expect there to be benefit in building operator-learning methods that make use of the resulting latent space, in the spirit of PCA-Net (Bhattacharya et al., 2021). For FAE, which unlike FVAE is not inherently a generative model, we expect particular benefit in the use of more sophisticated generative models on the latent space, for example diffusion models, analogous to Stable Diffusion (Rombach et al., 2022). While our focus has been on scientific problems with synthetic data, our methods could also be applied to real-world data, for example in computer vision; for these challenging data sets, further research on improved mesh-invariant architectures will be vital. Our study has also focussed on the typical machine-learning setting of a fixed dataset of size N; research Autoencoders in Function Space into the behaviour of FVAE and FAE in the infinite-data limits using tools from statistical learning theory would also be of interest. Acknowledgments JB is supported by Splunk Inc. MG is supported by a Royal Academy of Engineering Research Chair, and Engineering and Physical Sciences Research Council (EPSRC) grants EP/T000414/1, EP/W005816/1, EP/V056441/1, EP/V056522/1, EP/R018413/2, EP/R034710/1, and EP/R004889/1. HL is supported by the Warwick Mathematics Institute Centre for Doctoral Training and gratefully acknowledges funding from the University of Warwick and the EPSRC (grant EP/W524645/1). AMS is supported by a Department of Defense Vannevar Bush Faculty Fellowship and by the Sci AI Center, funded by the Office of Naval Research (ONR), under grant N00014-23-1-2729. For the purpose of open access, the authors have applied a Creative Commons Attribution (CC BY) licence to any Author Accepted Manuscript version arising. Bunker, Girolami, Lambley, Stuart, and Sullivan Appendix A. Supporting Results In the following proof we use the fact that the norm of the Sobolev space Hs([0, 1]) can be written as a weighted sum of frequencies (see Krein and Petunin, 1966, Sec. 9): u 2 Hs([0,1]) = X 1 + j2 s|αj|2, u = X j N αjej, ej(x) = 2 sin(πjx). (36) Proof of Proposition 15 Let η be an L2-white noise, let h = P j N hjej L2([0, 1]), and note that Pη( h) is the distribution of the random variable η + h. Thus, writing out the Hs-norm using the Karhunen Lo eve expansion of η, we see that η + h 2 Hs([0,1]) = X 1 + j2 s ξj + hj 2. (37) First, we show that η + h Hs([0,1]) < almost surely when s < 1/2. To do this we apply the Kolmogorov two-series theorem (Durrett, 2019, Theorem 2.5.6), which states that the random series (37) converges almost surely if X 1 + j2 s E h ξj + hj 2i < and X 1 + j2 2s Var ξj + hj 2 < . But, since ξj N(0, 1), we know that E[ξ2 j ] = 1 and Var(ξ2 j ) = 2; applying this, the elementary identity (x + y)2 2x2 + 2y2 for x, y R, and the fact that P j N jα < for α < 1 shows that the two series are finite. To see that Pη( h) assigns zero probability to L2([0, 1]), suppose for contradiction that η + h had finite L2-norm; then η would also have finite L2-norm. But as a consequence of the Borel Cantelli lemma, η 2 L2([0,1]) = X j N ξ2 j = almost surely, because the summands are independent and identically distributed, and thus for any constant c > 0, infinitely many summands exceed c with probability one. Lemma 25 Suppose that U = C0([0, T], Rm) and that µ P(U) and ν P(U) are the laws of the Rm-valued diffusions dut = b(ut) dt + ε dwt, u0 = 0, t [0, T] dvt = c(vt) dt + ε dwt, v0 = 0, t [0, T]. where (wt)t [0,T] is a Brownian motion on Rm. Suppose that the Novikov condition (19) holds for both processes. Then DKL(µ ν) = E u µ 0 b(ut) c(ut) 2 2 dt . Proof Applying the Girsanov formula (20) to obtain the density dµ/dν, taking logarithms to evaluate DKL(µ ν), and noting that under µ we have dut = b(ut) dt + ε dwt, we obtain DKL(µ ν) = E u µ 0 b(ut) c(ut) 2 2 dt 1 ε 0 b(ut) c(ut), dwt . Under µ, the process (wt)t [0,T] is Brownian motion and so the second expectation is zero. Autoencoders in Function Space Appendix B. Experimental Details In this section, we provide additional details, training configurations, samples, and analysis for the numerical experiments in Section 2.5 and Section 4.3. All experiments were run on a single NVIDIA Ge Force RTX 4090 GPU with 24 GB of VRAM. B.1 Base Architecture We use the common architecture described in Section 2.4 and Section 4.2 for all experiments, using the Adam optimiser (Kingma and Ba, 2015) with the default hyperparameters ε, β1, and β2; we specify the learning rate and learning-rate decay schedule for each experiment in what follows. Positional Encodings. Where specified, both the encoder and decoder will make use of Gaussian random Fourier features (Tancik et al., 2020), pairing the query coordinate x Ω Rd with a positional encoding γ(x) R2k. To generate these encodings, a matrix B Rk d with independent N(0, I) entries is sampled and viewed as a hyperparameter of the model to be used in both the encoder and decoder. The positional encoding γ(x) is then given by the concatenated vector γ(x) = cos(2πBx); sin(2πBx) T R2k where the sine and cosine functions are applied componentwise to the vector 2πBx. B.2 Brownian Dynamics The training data consists of 8,192 samples from the path distribution Υ of the SDE (27),(28) on the time interval [0, T], T = 5. Trajectories are generated using the Euler Maruyama scheme with internal time step 1/8,192 (unrelated to the choice to take 8,192 training samples), and the resulting paths are then subsampled by a factor of 80 to obtain the training data. Thus the data have effective time increment 5/512; moreover the path information is removed at 50% of the points resulting from these time increments, chosen uniformly at random. Experimental Setup. We train for 100,000 steps with initial learning rate 10 3 and an exponential decay of 0.98 applied every 1,000 steps, with batch size 32 and 4 Monte Carlo samples for Qθ z|u. We use latent dimension d Z = 1, β = 1.2 and λ = 10. The three sets of simulations shown in Figure 2 use κ = 0, 25, and 10,000 respectively. B.3 Estimation of Markov State Models Data and Discretisation. To validate the ability of FVAE to model higher-dimensional SDE trajectories, we specify a simple potential with qualitative features similar to those arising in the complex potential surfaces arising in molecular dynamics. To this end, define the centres c1 = (0, 0), c2 = (0.2, 0.2), c3 = ( 0.2, 0.2), c4 = (0.2, 0.2), c5 = (0, 0.2) and c6 = ( 0.2, 0); standard deviations σ1 = σ2 = σ3 = σ4 = 0.1 and σ5 = σ6 = 0.03; and masses m1 = m2 = m3 = m4 = 0.1 and m5 = m6 = 0.01. Then let 0.5(x1 + x2) + x2 1 + x2 2 i=1 mi N x; ci, σ2 i I2 # This potential has three key components: a linear term breaking the symmetry, a quadratic term preventing paths from veering too far from the path s starting point, the origin, Bunker, Girolami, Lambley, Stuart, and Sullivan 0.4 0.20.0 0.2 0.4 Surface of Potential 0.4 0.2 0.0 0.2 0.4 x1 Heatmap of Potential Figure 12: Potential function U : R2 R for Section 2.5.2. and negative Gaussian densities serving as potential wells positioned at the centres ci (Figure 12). Sample paths of (27) with initial condition u0 = 0, temperature ε = 0.1 and final time T = 3 show significant diversity, with many paths transitioning at least once between different wells (see ground truth in Figure 13). Experimental Setup. The training set consists of 16,384 paths generated with an Euler Maruyama scheme with internal time step 1/8,192, subsampled by a factor 48 to obtain an equally spaced mesh of 513 points. We take d Z = 16, β = 10, κ = 50, and λ = 50, and, as in Appendix B.3, train on data where 50% of the points on the path are missing. We also use the same learning rate, learning-rate decay schedule, step limit, and batch size. Results. FVAE s reconstructions closely match the inputs (Figure 13), and FVAE produces convincing generative samples capturing qualitative features of the data (Figure 14). B.4 Dirac Distributions Data and Discretisation. We view Υ as a probability distribution on U = H 1([0, 1]). At each resolution I, we discretise the domain [0, 1] using an evenly spaced mesh of points {i/I+1}i=1,...,I and approximate the Dirac mass δξ, ξ [0, 1], by the optimal L1-approximation: a discretised function which is zero except at the mesh point closest to ξ, normalised to have unit L1-norm. The training data set consists of discretised Dirac functions at each mesh point; the goal is not to train a practical model for generalisation, but to isolate the effect of the objective. Experimental Setup. We train FVAE and FAE models at resolutions I {8, 16, 32, 64, 128}. For each model, we perform 50 independent runs of 30,000 steps with batch size 6. Architecture. The neural network ρ: R Θ R R in the encoder map f is assumed to have 3 hidden layers of width 128, and the mean µ(z; ψ) and standard deviation σ(z; ψ) in the decoder are computed from a 3-layer neural network of width 128. For numerical stability, we impose a lower bound on σ based on the mesh spacing x, given by σmin( x) = (2π) 1/2 x. FVAE Configuration. We view data u Υ as lying in the Sobolev space U = H 1([0, 1]); the decoder g will output functions in L2([0, 1]) and we take decoder-noise distribution Autoencoders in Function Space Figure 13: Held-out ground-truth data from the SDE in Section 2.5.2 ( True row) and the corresponding FVAE reconstructions of sample paths ( Reconstructed row). (b) Data set Figure 14: (a) Samples of the SDE in Section 2.5.2 drawn from the FVAE generative model with randomly drawn latent vector z Pz. (b) Ground-truth paths of the SDE in Section 2.5.2 generated using an Euler Maruyama solver. In both subfigures, the evolution through time t [0, 3] is depicted as a transition in colour from blue to green. Pη = N(0, I), noting that Pη P(Hs([0, 1])) if and only if s < 1/2; in particular whitenoise samples do not lie in the space L2([0, 1]). We modify the per-sample loss (7b) by reweighting the term DKL(Qθ z|u Pz) by β = 10 4, and take 16 Monte Carlo samples for Qθ z|u. We use an initial learning rate of 10 4, decaying exponentially by a factor 0.7 every 1,000 steps. Bunker, Girolami, Lambley, Stuart, and Sullivan FAE Configuration. To compute the H 1-norm we truncate the series expansion (36) and compute coefficients αj from a discretisation of u using the discrete sine transform. We use initial learning rate 10 4, decaying exponentially by a factor 0.9 every 1,000 steps, and take β = 10 12. For consistency with the FVAE loss, we subtract the squared data norm 1 2 u 2 H 1 from the FAE loss, yielding the expression g(f(u; θ); ψ) u 2 H 1 1 u 2 H 1 = 1 g f(u; θ); ψ 2 H 1 g f(u; θ); ψ , u Results. As expected, the final training loss under both models decreases as the resolution is refined, since the lower bound σmin decreases. However, the FAE loss appears to converge and is stable across runs, while the FVAE loss appears to diverge and becomes increasingly unstable across runs. This gives convincing empirical evidence that the joint divergence (5) for FVAE is not defined as a result of the misspecified decoder noise; the use of FAE with an appropriate data norm alleviates this issue. Since the FVAE objective with Pη = N(0, I) coincides with the VANO objective, this issue would also be present for VANO. Under both models, training becomes increasingly unstable at high resolutions: when σ is small, the loss becomes highly sensitive to changes in µ; this instability is unrelated to the divergence of the FVAE training loss and is a consequence of training through gradient descent. B.5 Incompressible Navier Stokes Equations Viscosity ν Resolution Train Samples Eval. Samples Snapshot Time T 10 3 64 64 4,000 1,000 50 10 4 64 64 8,000 2,000 50 10 5 64 64 960 240 20 Table 2: Details of Navier Stokes data sets. Data and Discretisation. We use data as provided online by Li et al. (2021). Solutions of (32) are generated by sampling the initial condition from the Gaussian random field N(0, C), C = 73/2(49I ) 5/2, and evolving in time using a pseudospectral method. While the data of Li et al. (2021) includes the full time evolution, we use only snapshots of the vorticity at the final time. Every snapshot is a 64 64 image, normalised to take values in [0, 1]; details of this data set are given in Table 2. Effects of Point Ratios. Here, we extend the analysis of Figure 7(b) to understand how the point ratio used during training affects reconstruction performance. We first train two FAE models on the Navier Stokes data set with viscosity ν = 10 4, using complement masking with a point ratio renc of 10% and 90% respectively. Then, we fix an arbitrary sample from the held-out set and, for each model, generate 1,000 distinct masks with point ratios 10%, 30%, 50%, 70%, and 90%. We then encode on each mesh and decode on the full grid and compute kernel density estimates of the reconstruction MSE (Figure 15). The model trained with renc = 10% is much more sensitive to the location of the evaluation mesh points, especially when the evaluation point ratio is low; with sufficiently high encoder point ratio at evaluation time, however, the reconstruction MSE of the model trained using renc = 10% Autoencoders in Function Space surpasses that of the model trained at renc = 90%. This suggests a tradeoffwhereby a higher training point ratio provides more stability, at the cost of increasing autoencoding MSE, particularly when the point ratio of the evaluation data is high. We hypothesise that a lower training ratio regularises the model to attain a more robust internal representation. 0.0000 0.0005 0.0010 0.0015 0.0020 0.0025 0.0030 MSE (a) For a model trained with point ratio 10%. 0.0000 0.0005 0.0010 0.0015 0.0020 0.0025 MSE Point % (evaluation) 10 30 50 70 90 (b) For a model trained with a point ratio 90%. Figure 15: Kernel density estimates for full-grid reconstruction MSE on the reference sample across 1,000 randomly chosen meshes. Training with a low point ratio regularises, reducing MSE when the evaluation data has a high point ratio, but at the cost of greater variance when evaluating on low point ratios. We also investigate the sensitivity of the models to a specific encoder mesh, seeking to understand whether an encoder mesh achieving low MSE on one image leads to low MSE on other images. The procedure is as follows: we select an image arbitrarily from the held-out set (the reference sample) and draw 1,000 random meshes with point ratio 10%; then, we select the mesh resulting in the lowest reconstruction MSE for each of the two models. For the nearest neighbours of the chosen sample in the held-out set, the reconstruction error on this MSE-minimising mesh is lower than average (Figure 16(a); dashed lines), suggesting that a good configuration will yield good results on similar samples. On the other hand, using the MSE-minimising mesh on arbitrary samples from the held-out set yields an MSE somewhat lower than a randomly chosen mesh; unsurprisingly, however, the arbitrarily chosen samples appear to benefit less than the nearest neighbours (Figure 16(b)). Experimental Setup. We train for 50,000 steps with batch size 32 and initial learning rate 10 3, decaying exponentially by a factor 0.98 every 1,000 steps. We use complement masking with renc = 0.3, providing a good balance of performance and robustness to masking. Architecture. Both the CNN and FAE architecture use Gaussian random positional encodings with k = 16. For the sake of comparison, we use a standard CNN architecture inspired by the VGG model (Simonyan and Zisserman, 2015), gradually contracting/expanding the feature map while increasing/decreasing the channel dimensions. The architecture we use was identified using a search over parameters such as the network depth while maintaining a similar parameter count to our baseline FAE model. The encoder consists of four CNN layers with output channel dimension 4, 4, 8, and 16 respectively and kernel sizes are 2, 2, 4, and 4 respectively, all with stride 2. The result is flattened and passed through a single-hidden-layer MLP of width 64 to obtain a vector of dimension 64. The decoder consists of a single-layer MLP of width 64 and output dimension 512, which is then rearranged to a Bunker, Girolami, Lambley, Stuart, and Sullivan 0.001 0.002 0.003 0 0.001 0.002 0.003 0 0.001 0.002 0.003 0 0.001 0.002 0.003 0 0.001 0.002 0.003 0 3000 High Low (a) Nearest neighbours of the reference sample in the held-out set. 0.001 0.002 0.003 0 0.001 0.002 0.003 0 0.001 0.002 0 0.001 0.002 0 0.002 0.004 0 (b) Arbitrarily chosen samples from the held-out set. Figure 16: Kernel density estimates of full-grid reconstruction MSE for models trained at 10% (Low) and 90% (High) point ratios on samples from the held-out set. Dashed lines indicate the MSE obtained using the mesh minimising MSE on the reference sample. 4 4 feature map with channel size 32. This feature map is then passed through four layers of transposed convolutions that respectively map to 16, 8, 4, and 4 channel dimensions, with kernel sizes 4, 4, 2, and 2 respectively, and stride 2. The result is then mapped by two CNN layers with kernel size 3, stride 1, and output channel dimension 8 and 1 respectively. Uncurated Reconstructions and Samples. Reconstructions of randomly selected data from the held-out sets for viscosities ν = 10 3, 10 4 and 10 5 are provided in Figures 17, 18, and 19 respectively. As described in Section 4.3.2, we apply FAE as a generative model by fitting a Gaussian mixture with 10 components on the latent space. Samples from models trained at ν = 10 3, 10 4 and 10 5 are shown in Figures 20, 21, and 22 respectively. Evaluation at Very High Resolutions. In Figure 9, we demonstrate zero-shot resolution by evaluating the decoder on grids of resolution 2,048 2,048 and 32,768 32,768. While the former requires approximately 16 MB to store using 32-bit floating-point numbers, the latter requires 4.3 GB, and thus applying a neural network directly to the 32,768 32,768 image is more likely to exhaust GPU memory. To allow evaluation of the decoder at this resolution, we partition the domain into 1,000 chunks and evaluate the decoder on each chunk in turn; we then reassemble the resulting data in the RAM. To ensure that each Autoencoders in Function Space Figure 17: FAE reconstructions of Navier Stokes data with viscosity 10 3. Figure 18: FAE reconstructions of Navier Stokes data with viscosity 10 4. Bunker, Girolami, Lambley, Stuart, and Sullivan Figure 19: FAE reconstructions of Navier Stokes data with viscosity ν = 10 5. (b) Data set Figure 20: Samples of Navier Stokes data with viscosity ν = 10 3. Autoencoders in Function Space (b) Data set Figure 21: Samples of Navier Stokes data with viscosity ν = 10 4. (b) Data set Figure 22: Samples of Navier Stokes data with viscosity ν = 10 5. Bunker, Girolami, Lambley, Stuart, and Sullivan chunk has an integer number of points, we take the first 824 chunks to contain 1,073,742 mesh points ( 4 MB), and take the remaining 176 chunks to contain 1,073,741 points. B.6 Darcy Flow Data Set. The data we use is based on that provided online by Li et al. (2021), given on a 421 421 grid and generated through a finite-difference scheme. Where described, we downsample this data to lower resolutions by applying a low-pass filter in Fourier space and subsampling the resulting image. The low-pass filter is a mollification of an ideal sinc filter with bandwidth selected to eliminate frequencies beyond the Nyquist frequency of the target resolution, computed by convolving the ideal filter in Fourier space with a Gaussian kernel with standard deviation σ = 0.1, truncated to a 7 7 convolutional filter. Experimental Setup. We follow the same setup used for the Navier Stokes data set: we train for 50,000 steps, with batch size 32 and complement masking with renc = 30%. An initial learning rate of 10 3 is used with an exponential decay factor of 0.98 applied every 1,000 steps. We make use of positional embeddings (Appendix B.1) using k = 16 Gaussian random Fourier features. When performing the wall-clock training time experiment (Figure 10), we downsample the training and evaluation data to resolution 211 211. Uncurated Reconstructions and Samples. Reconstructions of randomly selected examples from the held-out evaluation data set are shown in Figure 23. Samples from the FAE generative model and draws from the evaluation data set are shown in Figure 24. Figure 23: FAE reconstructions of Darcy flow data. Autoencoders in Function Space (b) Data set Figure 24: Samples of Darcy flow data. The Fifth International Conference on Learning Representations (ICLR 2017), 2017. 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021. IEEE. The Ninth International Conference on Learning Representations (ICLR 2021), 2021. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 2022. IEEE. M. Arjovsky, S. Chintala, and L. Bottou. Wasserstein generative adversarial networks. In D. Precup and Y. W. Teh, editors, Proceedings of the 34th International Conference on Machine Learning (ICML 2017), volume 70 of Proceedings of Machine Learning Research, pages 214 223, 2017. URL https://proceedings.mlr.press/v70/arjovsky17a.html. ar Xiv:1701.07875. K. Azizzadenesheli, N. Kovachki, Z. Li, M. Liu-Schiaffini, J. Kossaifi, and A. Anandkumar. Neural operators for accelerating scientific simulations and design. Nat. Rev. Phys., 6: 320 328, 2024. doi:10.1038/s42254-024-00712-5. E. Bach, R. Baptista, D. Sanz-Alonso, and A. Stuart. Inverse problems and data assimilation: A machine learning approach, 2024. ar Xiv:2410.10523. J. Bengio and Y. Le Cun, editors. The Third International Conference on Learning Representations (ICLR 2015), 2015. Bunker, Girolami, Lambley, Stuart, and Sullivan K. Bhattacharya, B. Hosseini, N. B. Kovachki, and A. M. Stuart. Model reduction and neural networks for parametric PDEs. SMAI J. Comput. Math., 7:121 157, 2021. doi:10.5802/smai-jcm.74. T. Bischoffand K. Deck. Unpaired downscaling of fluid flows with diffusion bridges. Artif. Intell. Earth Syst., 3:e230039, 22pp., 2024. doi:10.1175/AIES-D-23-0039.1. C. M. Bishop. Pattern Recognition and Machine Learning. Information Science and Statistics. Springer, 2006. ISBN 978-0-387-31073-2. V. I. Bogachev. Gaussian Measures, volume 62 of Mathematical Surveys and Monographs. American Mathematical Society, 1998. doi:10.1090/surv/062. K. M. Borgwardt, A. Gretton, M. J. Rasch, H.-P. Kriegel, B. Sch olkopf, and A. J. Smola. Integrating structured biological data by kernel maximum mean discrepancy. Bioinform., 22(14):e49 e57, 2006. doi:10.1093/bioinformatics/btl242. D. R. Burt, S. W. Ober, A. Garriga-Alonso, and M. van der Wilk. Understanding variational inference in function-space. In 3rd Symposium on Advances in Approximate Bayesian Inference, 2021. ar Xiv:2011.09421. E. Calvello, N. B. Kovachki, M. E. Levine, and A. M. Stuart. Continuum attention for neural operators, 2024. ar Xiv:2406.06486. G. J. Chandler and R. R. Kerswell. Invariant recurrent solutions embedded in a turbulent two-dimensional Kolmogorov flow. J. Fluid. Mech., 722:554 595, 2013. doi:10.1017/jfm.2013.122. J. T. Chang and D. Pollard. Conditioning as disintegration. Stat. Neerl., 51(3):287 317, 1997. doi:10.1111/1467-9574.00056. T. Chen and H. Chen. Approximations of continuous functionals by neural networks with application to dynamic systems. IEEE Trans. Neural Netw., 4(6):910 918, 1993. doi:10.1109/72.286886. Y. Chen, S. Liu, and X. Wang. Learning continuous image representation with local implicit image function. In 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) CVP (2021), pages 8624 8634. doi:10.1109/CVPR46437.2021.00852. Y. Chen, D. Z. Huang, J. Huang, S. Reich, and A. M. Stuart. Sampling via gradient flows in the space of probability measures. ar Xiv preprint ar Xiv:2310.03597, 2023. T. Cinquin and R. Bamler. Regularized KL-divergence for well-defined function-space variational inference in Bayesian neural networks, 2024. ar Xiv:2406.04317. S. L. Cotter, M. Dashti, and A. M. Stuart. Approximation of Bayesian inverse problems for PDEs. SIAM J. Numer. Anal., 48(1):322 345, 2010. doi:10.1137/090770734. S. L. Cotter, G. O. Roberts, A. M. Stuart, and D. White. MCMC methods for functions: Modifying old algorithms to make them faster. Stat. Sci., 28(3), 2013. doi:10.1214/13STS421. Autoencoders in Function Space T. M. Cover and J. A. Thomas. Elements of Information Theory. Wiley-Interscience, Hoboken, NJ, second edition, 2006. doi:10.1002/047174882X. W. M. Czarnecki, S. Osindero, M. Jaderberg, G. Swirszcz, and R. Pascanu. Sobolev training for neural networks. In Guyon et al. (2017), pages 4278 4287. URL https://proceedings.neurips.cc/paper files/paper/2017/file/ 758a06618c69880a6cee5314ee42d52f-Paper.pdf. ar Xiv:1706.04859. M. Dashti and A. M. Stuart. The Bayesian approach to inverse problems. In Handbook of Uncertainty Quantification. Vol. 1, 2, 3, chapter 7, pages 311 428. Springer, Cham, 2017. doi:10.1007/978-3-319-12385-1 7. M. V. de Hoop, D. Z. Huang, E. Qian, and A. M. Stuart. The cost-accuracy tradeoffin operator learning with neural networks. J. Mach. Learn., 1(3):299 341, 2022. doi:10.4208/jml.220509. J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova. BERT: Pre-training of deep bidirectional transformers for language understanding. In J. Burstein, C. Doran, and T. Solorio, editors, Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171 4186, Minneapolis, MN, 2019. Association for Computational Linguistics. doi:10.18653/v1/N19-1423. M. M. Dunlop, D. Slepˇcev, A. M. Stuart, and M. Thorpe. Large data and zero noise limits of graph-based semi-supervised learning algorithms. Appl. Comput. Harmon. Anal., 49 (2):655 697, 2020. doi:10.1016/j.acha.2019.03.005. R. Durrett. Probability: Theory and Examples. Number 49 in Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge University Press, Cambridge, fifth edition, 2019. doi:10.1017/9781108591034. W. E, W. Ren, and E. Vanden-Eijnden. Minimum action method for the study of rare events. Comm. Pure Appl. Math., 57(5):637 656, 2004. doi:10.1002/cpa.20005. H. Edwards and A. Storkey. Towards a neural statistician. In The Fifth International Conference on Learning Representations (ICLR 2017) ICL (2017). ar Xiv:1606.02185. L. Evans. Partial Differential Equations, volume 19 of Graduate Studies in Mathematics. American Mathematical Society, second edition, 2010. doi:10.1090/gsm/019. G. Franzese, G. Corallo, S. Rossi, M. Heinonen, M. Filippone, and P. Michiardi. Continuoustime functional diffusion processes. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, editors, Advances in Neural Information Processing Systems, volume 36, pages 37370 37400. Curran Associates, Inc., 2023. R. A. Freeze and J. A. Cherry. Groundwater. Prentice-Hall, Englewood Cliffs, NJ, 1979. ISBN 0-13-365312-9. Bunker, Girolami, Lambley, Stuart, and Sullivan P. Ghosh, M. S. M. Sajjadi, A. Vergari, M. Black, and B. Sch olkopf. From variational to deterministic autoencoders. In The Eighth International Conference on Learning Representations (ICLR 2020), 2020. ar Xiv:1903.12436. I. Guyon, U. von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors. Advances in Neural Information Processing Systems, volume 30, 2017. Curran Associates, Inc. P. Hagemann, L. Ruthotto, G. Steidl, and N. T. Yang. Multilevel diffusion: Infinite dimensional score-based diffusion models for image generation, 2023. ar Xiv:2303.04772. M. Hairer, A. Stuart, and J. Voss. Signal processing problems on function space: Bayesian formulation, stochastic PDEs and effective MCMC methods. In The Oxford Handbook of Nonlinear Filtering, pages 833 873. Oxford University Press, Oxford, 2011. ISBN 9780199532902. K. He, X. Chen, S. Xie, Y. Li, P. Doll ar, and R. Girshick. Masked autoencoders are scalable vision learners. In 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) CVP (2022), pages 15979 15988. doi:10.1109/CVPR52688.2022.01553. D. Hendrycks and K. Gimpel. Gaussian error linear units (GELUs), 2016. ar Xiv:1606.08415. I. Higgins, L. Matthey, A. Pal, C. Burgess, X. Glorot, M. Botvinick, S. Mohamed, and A. Lerchner. β-VAE: Learning basic visual concepts with a constrained variational framework. In The Fifth International Conference on Learning Representations (ICLR 2017) ICL (2017). URL https://openreview.net/pdf?id=Sy2fz U9gl. J. Ho, A. Jain, and P. Abbeel. Denoising diffusion probabilistic models. In Larochelle et al. (2020), pages 6840 6851. ar Xiv:2006.11239. D. Z. Huang, N. H. Nelsen, and M. Trautner. An operator learning perspective on parameterto-observable maps. Found. Data Sci., 7:163 225, 2025. doi:10.3934/fods.2024037. B. E. Husic and V. S. Pande. Markov state models: From an art to a science. J. Am. Chem. Soc., 140(7):2386 2396, 2018. doi:10.1021/jacs.7b12191. J. B. Keller. Darcy s law for flow in porous media and the two-space method. In Nonlinear Partial Differential Equations in Engineering and Applied Science, pages 429 443. Routledge, 1980. doi:10.1201/9780203745465-27. G. Kerrigan, J. Ley, and P. Smyth. Diffusion generative models in infinite dimensions. In F. Ruiz, J. Dy, and J.-W. van de Meent, editors, Proceedings of the 26th International Conference on Artificial Intelligence and Statistics (AISTATS 2023), volume 206 of Proceedings of Machine Learning Research, pages 9538 9563, 2023. ar Xiv:2212.00886. J. Kim, J. Yoo, J. Lee, and S. Hong. Set VAE: Learning hierarchical composition for generative modeling of set-structured data. In 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) CVP (2021), pages 15054 15063. doi:10.1109/CVPR46437.2021.01481. Autoencoders in Function Space D. P. Kingma. Variational inference and deep learning: A new synthesis. Ph D thesis, University of Amsterdam, 2017. ISBN: 978-94-6299-745-5. D. P. Kingma and J. L. Ba. Adam: A method for stochastic optimization. In Bengio and Le Cun (2015). ar Xiv:1412.6980. D. P. Kingma and M. Welling. Auto-encoding variational Bayes. In Y. Bengio and Y. Le Cun, editors, The Second International Conference on Learning Representations (ICLR 2014), 2014. ar Xiv:1312.6114. D. P. Kingma and M. Welling. An introduction to variational autoencoders. FNT in Mach. Learn., 12(4):307 392, 2019. doi:10.1561/2200000056. D. Kochkov, J. A. Smith, A. Alieva, Q. Wang, M. P. Brenner, and S. Hoyer. Machine learning accelerated computational fluid dynamics. Proc. Nat. Acad. Sci. USA, 118(21): e2101784118, 8pp., 2021. doi:10.1073/pnas.2101784118. K. A. Konovalov, I. C. Unarta, S. Cao, E. C. Goonetilleke, and X. Huang. Markov state models to study the functional dynamics of proteins in the wake of machine learning. J. Amer. Chem. Soc. Au, 1(9):1330 1341, 2021. doi:10.1021/jacsau.1c00254. N. Kovachki, Z. Li, B. Liu, K. Azizzadenesheli, K. Bhattacharya, A. Stuart, and A. Anandkumar. Neural operator: Learning maps between function spaces with applications to PDEs. J. Mach. Learn. Res., 23:1 97, 2023. ar Xiv:2108.08481. S. G. Krein and Yu. I. Petunin. Scales of Banach spaces. Russ. Math. Surv., 21(2):85 159, 1966. doi:10.1070/RM1966v021n02ABEH004151. S. Lanthaler, A. M. Stuart, and M. Trautner. Discretization error of Fourier neural operators, 2024. ar Xiv:2405.02221. H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin, editors. Advances in Neural Information Processing Systems, volume 33, 2020. Curran Associates, Inc. S. Lee. Mesh-independent operator learning for partial differential equations. In 2nd AI4Science Workshop at the 39th International Conference on Machine Learning, 2022. URL https://openreview.net/pdf?id=JUt ZG8-2v Gp. J. Li, Z. Pei, W. Li, G. Gao, L. Wang, Y. Wang, and T. Zeng. A systematic survey of deep learning-based single-image super-resolution. ACM Comput. Surv., 56(10):1 40, 2024. doi:10.1145/3659100. Y. Li, R. Bu, M. Sun, W. Wu, X. Di, and B. Chen. Point CNN: Convolution on X-transformed points. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 31, pages 820 830. Curran Associates, Inc., 2018. ar Xiv:1801.07791. Z. Li, N. Kovachki, K. Azizzadenesheli, B. Liu, K. Bhattacharya, A. Stuart, and A. Anandkumar. Fourier neural operator for parametric partial differential equations. In The Ninth International Conference on Learning Representations (ICLR 2021) ICL (2021). ar Xiv:2010.08895. Bunker, Girolami, Lambley, Stuart, and Sullivan J. H. Lim, N. B. Kovachki, R. Baptista, C. Beckham, K. Azizzadenesheli, J. Kossaifi, V. Voleti, J. Song, K. Kreis, J. Kautz, C. Pal, A. Vahdat, and A. Anandkumar. Score-based diffusion models in function space, 2023. ar Xiv:2302.07400. R. S. Liptser and A. N. Shiryaev. Statistics of Random Processes. Springer, Berlin, Heidelberg, second edition, 2001. doi:10.1007/978-3-662-13043-8. L. Lu, P. Jin, G. Pang, Z. Zhang, and G. E. Karniadakis. Learning nonlinear operators via Deep ONet based on the universal approximation theorem of operators. Nat. Mach. Intell., 3(3):218 229, 2021. doi:10.1038/s42256-021-00302-5. A. Mardt, L. Pasquali, H. Wu, and F. No e. VAMPnets for deep learning of molecular kinetics. Nat. Commun., 9(1):5, 11pp., 2018. doi:10.1038/s41467-017-02388-1. B. Øksendal. Stochastic Differential Equations. Universitext. Springer, Berlin, Heidelberg, sixth edition, 2003. doi:10.1007/978-3-642-14394-6. B. Peherstorfer. Breaking the Kolmogorov barrier with nonlinear model reduction. Not. Am. Math. Soc., 69(5):725 733, 2022. doi:10.1090/noti2475. J. Pidstrigach, Y. Marzouk, S. Reich, and S. Wang. Infinite-dimensional diffusion models for function spaces, 2023. ar Xiv:2302.10130. M. Prasthofer, T. De Ryck, and S. Mishra. Variable-input deep operator networks, 2022. ar Xiv:2205.11404. J.-H. Prinz, H. Wu, M. Sarich, B. Keller, M. Senne, M. Held, J. D. Chodera, C. Sch utte, and F. No e. Markov models of molecular kinetics: Generation and validation. J. Chem. Phys., 134(17):174105, 23pp., 2011. doi:10.1063/1.3565032. C. R. Qi, H. Su, K. Mo, and L. J. Guibas. Point Net: Deep learning on point sets for 3D classification and segmentation. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 77 85. IEEE, 2017. doi:10.1109/CVPR.2017.16. W. Quan, J. Chen, Y. Liu, D.-M. Yan, and P. Wonka. Deep learning-based image and video inpainting: A survey. Int. J. Comput. Vis., 132:2364 2400, 2024. doi:10.1007/s11263-02301977-6. M. A. Rahman, M. A. Florez, A. Anandkumar, Z. E. Ross, and K. Azizzadenesheli. Generative adversarial neural operators. Transact. Mach. Learn. Res., 2022. ar Xiv:2205.03017. J. O. Ramsay and B. W. Silverman, editors. Applied Functional Data Analysis: Methods and Case Studies. Springer Series in Statistics. Springer, 2002. doi:10.1007/b98886. M. Ranzato, A. Beygelzimer, Y. Dauphin, P. S. Liang, and J. Wortman Vaughan, editors. Advances in Neural Information Processing Systems, volume 34, 2021. Curran Associates, Inc. Autoencoders in Function Space R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer. High-resolution image synthesis with latent diffusion models. In 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) CVP (2022), pages 10674 10685. doi:10.1109/CVPR52688.2022.01042. S. S arkk a and A. Solin. Applied Stochastic Differential Equations. Cambridge University Press, first edition, 2019. doi:10.1017/9781108186735. T. Schlick. Molecular Modeling and Simulation: An Interdisciplinary Guide, volume 21 of Interdisciplinary Applied Mathematics. Springer, New York, second edition, 2010. doi:10.1007/978-1-4419-6351-2. D. W. Scott. Multivariate Density Estimation. Wiley Series in Probability and Statistics. John Wiley and Sons, second edition, 2015. doi:10.1002/9781118575574. J. H. Seidman, G. Kissas, P. Perdikaris, and G. J. Pappas. NOMAD: Nonlinear manifold decoders for operator learning. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, editors, Advances in Neural Information Processing Systems, volume 35, pages 5601 5613. Curran Associates, Inc., 2022. ar Xiv:2206.03551. J. H. Seidman, G. Kissas, G. J. Pappas, and P. Perdikaris. Variational autoencoding neural operators. In A. Krause, E. Brunskill, K. Cho, B. Engelhardt, S. Sabato, and J. Scarlett, editors, Proceedings of the 40th International Conference on Machine Learning (ICML 2023), volume 202 of Proceedings of Machine Learning Research, pages 30491 30522, 2023. ar Xiv:2302.10351. K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. In Bengio and Le Cun (2015). ar Xiv:1409.1556. V. Sitzmann, J. N. P. Martel, A. W. Bergman, D. B. Lindell, and G. Wetzstein. Implicit neural representations with periodic activation functions. In Larochelle et al. (2020), pages 7462 7473. ar Xiv:2006.09661. J. Sohl-Dickstein, E. Weiss, N. Maheswaranathan, and S. Ganguli. Deep unsupervised learning using nonequilibrium thermodynamics. In F. Bach and D. Blei, editors, Proceedings of the 32nd International Conference on Machine Learning (ICML 2015), volume 37 of Proceedings of Machine Learning Research, pages 2256 2265, 2015. ar Xiv:1503.03585. Y. Song, J. Sohl-Dickstein, D. P. Kingma, A. Kumar, S. Ermon, and B. Poole. Score-based generative modeling through stochastic differential equations. In The Ninth International Conference on Learning Representations (ICLR 2021) ICL (2021). ar Xiv:2011.13456. A. M. Stuart. Inverse problems: A Bayesian perspective. Acta Numer., 19:451 559, 2010. doi:10.1017/S0962492910000061. V. N. Sudakov. Linear sets with quasi-invariant measure. Dokl. Akad. Nauk SSSR, 127: 524 525, 1959. T. J. Sullivan. Introduction to Uncertainty Quantification, volume 63 of Texts in Applied Mathematics. Springer, 2015. doi:10.1007/978-3-319-23395-6. Bunker, Girolami, Lambley, Stuart, and Sullivan S. Sun, G. Zhang, J. Shi, and R. Grosse. Functional variational Bayesian neural networks. In The Seventh International Conference on Learning Representations (ICLR 2019), 2019. ar Xiv:1903.05779. M. Tancik, P. P. Srinivasan, B. Mildenhall, S. Fridovich-Keil, N. Raghavan, U. Singhal, R. Ramamoorthi, J. T. Barron, and R. Ng. Fourier features let networks learn high frequency functions in low dimensional domains. In Larochelle et al. (2020), pages 7537 7547. ar Xiv:2006.10739. A. Vahdat, K. Kreis, and J. Kautz. Score-based generative modeling in latent space. In Ranzato et al. (2021), pages 11287 11302. ar Xiv:2106.05931. Y. Wang, D. Blei, and J. P. Cunningham. Posterior collapse and latent variable nonidentifiability. In Ranzato et al. (2021), pages 5443 5455. ar Xiv:2301.00537. M. Yang, B. Dai, H. Dai, and D. Schuurmans. Energy-based processes for exchangeable data. In H. Daum e III and A. Singh, editors, Proceedings of the 37th International Conference on Machine Learning (ICML 2020), volume 119 of Proceedings of Machine Learning Research, pages 10681 10692, 2020. ar Xiv:2003.07521. M. Zaheer, S. Kottur, S. Ravanbhakhsh, B. P oczos, R. Salakhutdinov, and A. J. Smola. Deep sets. In Guyon et al. (2017), pages 3391 3401. ar Xiv:1703.06114. B. Zhang and P. Wonka. Functional diffusion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 4723 4732, 2024. ar Xiv:2311.15435.