# continuoustime_functional_diffusion_processes__df2d8b1a.pdf Continuous-Time Functional Diffusion Processes Giulio Franzese EURECOM, France Giulio Corallo EURECOM, France Simone Rossi Stellantis, France Markus Heinonen Aalto University, Finland Maurizio Filippone EURECOM, France Pietro Michiardi EURECOM, France We introduce Functional Diffusion Processes (FDPs), which generalize scorebased diffusion models to infinite-dimensional function spaces. FDPs require a new mathematical framework to describe the forward and backward dynamics, and several extensions to derive practical training objectives. These include infinitedimensional versions of Girsanov theorem, in order to be able to compute an ELBO, and of the sampling theorem, in order to guarantee that functional evaluations in a countable set of points are equivalent to infinite-dimensional functions. We use FDPs to build a new breed of generative models in function spaces, which do not require specialized network architectures, and that can work with any kind of continuous data. Our results on real data show that FDPs achieve high-quality image generation, using a simple MLP architecture with orders of magnitude fewer parameters than existing diffusion models. Code available here. 1 Introduction Diffusion models have recently gained a lot of attention both from academia and industry. The seminal work on denoising diffusion (Sohl-Dickstein et al., 2015) has spurred interest in the understanding of such models from several perspectives, ranging from denoising autoencoders (Vincent, 2011) with multiple noise levels (Ho et al., 2020), variational interpretations (Kingma et al., 2021), annealed (Song & Ermon, 2019) and continuous-time score matching (Song & Ermon, 2020; Song et al., 2021). Several recent extensions of the theory underpinning diffusion models tackle alternatives to Gaussian noise (Bansal et al., 2022; Rissanen et al., 2022), second order dynamics (Dockhorn et al., 2022), and improved training and sampling (Xiao et al., 2022; Kim et al., 2022b; Franzese et al., 2022). Diffusion models have rapidly become the go-to approach for generative modeling, surpassing GANs (Dhariwal & Nichol, 2021) for image generation, and have recently been applied to various modalities such as audio (Kong et al., 2021; Liu et al., 2022), video (Ho et al., 2022; He et al., 2022), molecular structures and general 3D shapes (Trippe et al., 2022; Hoogeboom et al., 2022; Luo & Hu, 2021; Zeng et al., 2022). Recently, the generation of diverse and realistic data modalities (images, videos, sound) from open-ended text prompts (Ramesh et al., 2022; Saharia et al., 2022; Rombach et al., 2022) has projected practitioners into a whole new paradigm for content creation. A common trait of diffusion models is the need to understand their design space (Karras et al., 2022), and tailor the inner working parts to the chosen application and data domain. Diffusion models require specialization, ranging from architectural choices of neural networks used to approximate the score (Dhariwal & Nichol, 2021; Karras et al., 2022), to fine-grained details such as an appropriate definition of a noise schedule (Dhariwal & Nichol, 2021; Salimans & Ho, 2022), and mechanisms to deal with resolution and scale (Ho et al., 2021). Clearly, the data domain impacts profoundly such design choices. As a consequence, a growing body of work has focused on the projection of data This work was done while working at EURECOM 37th Conference on Neural Information Processing Systems (Neur IPS 2023). modalities into a latent space (Rombach et al., 2022), either by using auxiliary models such as a VAEs (Vahdat et al., 2021), or by using a functional data representation (Dupont et al., 2022a). These approaches lead to increased efficiency, because they operate on smaller dimensional spaces, and constitute a step toward broadening the applicability of diffusion models to general data. The idea of modelling data with continuous functions has several advantages (Dupont et al., 2022a): it allows working with data at arbitrary resolutions, it enjoys improved memory-efficiency, and it allows simple architectures to represent a variety of data modalities. However, a theoretically grounded understanding of how diffusion models can operate directly on continuous functions has been elusive so far. Preliminary studies apply established diffusion algorithms on a discretization of functional data by conditioning on point-wise values (Dutordoir et al., 2022; Zhuang et al., 2023). A line of work that is closely related to ours include approaches such as Kerrigan et al. (2022), who consider a Gaussian noise corruption process in Hilbert space and derive a loss function formulated on infinite-dimensional measures to approximate the conditional mean of the reverse process. Within this line of works, Mittal et al. (2022) consider diffusion of Gaussian processes. We are aware of other concurrent works that study diffusion process in Hilbert spaces (Lim et al., 2023; Pidstrigach et al., 2023; Hagemann et al., 2023). However, differently from us, these works do not formally prove that the score matching optimization is a proper evidence lower bound (ELBO), but simply propose it as an heuristic. None of these prior works discuss the limits of discretization, resulting in the failure of identifying which subset of functions can be reconstructed through sampling. Finally, the parametrization we present in our work merges how functions and score are approximated using a single, simple model. The main goal of our work is to deepen our understanding of diffusion models in function space. We present a new mathematical framework to lift diffusion models from finite-dimensional inputs to function spaces, contributing to a general method for data represented by continuous functions. In 2, we present Functional Diffusion Processs (FDPs), which generalize diffusion processes to infinite-dimensional functional spaces. We define forward ( 2.1) and backward ( 2.2) FDPs, and consider generic functional perturbations, including noising and Laplacian blurring. Using an extension of Girsanov theorem, we derive in 2.3 an ELBO, which allows defining a parametric model to approximate the score of the functional density of FDPs. Given a FDP and the associated ELBO, we are one-step closer to the definition of a loss function to learn the parametric score. However, our formulation still resides in an abstract, infinite-dimensional Hilbert space. Then, for practical reasons, in 3, we specify for which subclass of functions we can perfectly reconstruct the original function given only its evaluation in a countable set of points. This is an extension of the sampling theorem, which we use to move from the infinite-dimensional domain of functions to a finite-dimensional domain of discrete mappings. In 4, we discuss various options to implement such discrete mappings. In this work, we explore in particular the usage of implicit neural representations (INRs) (Sitzmann et al., 2020) and Transformers Vaswani et al. (2017) to jointly model both the sampled version of infinite-dimensional functions, and the score network, which is central to the training of FDPs, and is required to simulate the backward process. Our training procedure, discussed in 5, involves approximate, finite-dimensional Stochastic Differential Equations (SDEs) for the forward and backward processes, as well as for the ELBO. We complement our theory with a series of experiments to illustrate the viability of FDPs, in 6. In our experiments, the score network is a simple multilayer perceptron (MLP), with several orders of magnitude fewer parameters than any existing score-based diffusion model. To the best of our knowledge, we are the first to show that a functional-space diffusion model can generate realistic image data, beyond simple data-sets and toy models. 2 Functional Diffusion Processes (FDPs) We begin by defining diffusion processes in Hilbert Spaces, which we call functional diffusion processes (FDPs). While the study of diffusion processes in Hilbert spaces is not new (F ollmer & Wakolbinger, 1986; Millet et al., 1989; Da Prato & Zabczyk, 2014), our objectives depart from prior work, and call for an appropriate treatment of the intricacies of FDPs, when used in the context of generative modeling. In 2.1 we introduce a generic class of diffusion processes in Hilbert spaces. The key object is Equation (1), together with its associated path measure Q and the time varying measure ρt, where ρ0 represents the starting (data) measure. In 2.2 we derive the reverse FDP with the associated path-reversed measure ˆQ, and in 2.3 we use an extension of Girsanov theorem for infinite-dimensional SDEs to compute the ELBO. The ELBO is a training objective involving a generalization of the score function (Song et al., 2021). 2.1 The forward diffusion process We consider H to be a real, separable Hilbert space with inner product , , norm H, and countable orthonormal basis {ek} k=1. Let L(H) be the set of bounded linear operators on H, B(H) be its Borel σ algebra, Bb(H) be the set of bounded B(H) measurable functions H R, and P(H) be the set of probability measures on (H, B(H)). Consider the following H-valued SDE: d Xt = (AXt + f(Xt, t)) dt + d Wt, X0 ρ0 P(H), (1) where t [0, T], Wt is a R Wiener process on H defined on the quadruplet (Ω, F, (Ft)t 0, Q), and Ω, F are the sample space and canonical filtration, respectively. The domain of f is D(f) B([0, T] H), where f : D(f) [0, T] H H is a measurable map. The operator A : D(A) H H is the infinitesimal generator of a C0 semigroup exp(t A) in H (t 0), and ρ0 is a probability measure in H. We consider Ωto be C1([0, T]), that is the space of all continuous mappings [0, T] H, and Xt(ω) = ω(t), ω Ωto be the canonical process. The requirements on the terms A, f that ensure existence of solutions to Equation (1) depend on the type of noise trace-class (Tr{R} < ) or cylindrical (R = I) used in the FDP (Da Prato & Zabczyk (2014), Hypothesis 7.1 or Hypothesis 7.2 for trace-class and cylindrical noise, respectively). The measure associated with Equation (1) is indicated with Q. The law induced at time τ [0, T] by the canonical process on the measure Q is indicated with ρτ P(H), where ρτ(S) = Q({ω Ω: Xτ(ω) S}), and S is any element of F. Notice, that in infinite dimensional spaces there is not an equivalent of the Lebesgue measure to get densities from measures. In our case we consider however, when it exists, the single dimensional density ρ(d) τ (xi|xj =i), defined implicitly through dρτ(xi|xj =i) = ρ(d) τ (xi|xj =i)dxi, being dxi the Lebesgue measure. To avoid cluttering the notation, in this work we simply shorten ρ(d) τ (xi|xj =i) with ρ(d) τ (x) whenever unambiguous. In Appendix B we provide additional details on the time-varying measure ρt(dx)dt. Before proceeding, it is useful to notice that Equation (1) can also be expressed as an (infinite) system of stochastic differential equations in terms of Xk t = Xt, ek as: d Xk t = bk(Xt, t)dt + d W k t , k = 1, . . . , , (2) where we introduced the projection bk(Xt, t) = AXt +f(Xt, t), ek . Moreover, d W k t = d Wt, ek with covariance given by E[W k t W j s ] = δ(k j)rk min(s, t), δ in Kroenecker sense, and rk is the projection on the base of R. 2.2 The reverse diffusion process We now derive the reverse time dynamics for FDPs of the form defined in Equation (1). We require that the time reversal of the canonical process, ˆXt = XT t, is again a diffusion process, with distribution given by the path-reversed measure ˆQ(ω), along with the reversed filtration ˆF. Note that the time reversal of an infinite dimensional process is more involved than for the finite dimensional case (Anderson, 1982; F ollmer, 1985). There are two major approaches to guarantee the existence of the reverse diffusion process. The first approach (F ollmer & Wakolbinger, 1986) is applicable only when R = I (the case of cylindrical Wiener processes) and it relies on a finite local entropy condition. The second approach, which is valid in the case of trace class noise Tr{R} < , is based on stochastic calculus of variations (Millet et al., 1989). The full technical analysis of the necessary assumptions for the two approaches is involved, and we postpone formal details to Appendix A. Theorem 1. Consider Equation (1). If R = I, suppose Assumption 1 in Appendix A.1 holds; else, (R = I) suppose Assumption 5 annd Assumption 6 in Appendix A.2 hold. Then ˆXt, corresponding to the path measure ˆQ(ω), has the following SDE representation: ( d ˆXt = A ˆXt f( ˆXt, T t) + RDx log ρ(d) T t( ˆXt) dt + d ˆWt, ˆX0 ρT , (3) where ˆW is a ˆQ R Wiener process, and the notation Dx log ρ(d) t (x) stands for the mapping H H that, when projected, satisfies Dx log ρ(d) t (x), ek = xk log ρ(d) t (xk | xi =k) . By projecting onto the eigenbasis, we have an infinite system of SDEs: d ˆXk t = bk( ˆXt, T t) + rk xk log ρ(d) T t( ˆXt) dt + d ˆW k t , k = 1, . . . , . (4) The methodology proposed in this work requires to operate on proper Wiener processes, with Tr{R} < , which implies, intuitively, that the considered noise has finite variance. We now discuss a Corollary, in which Assumption 5 is replaced by stricter conditions, that we use to check the validity of the practical implementation of FDPs. Corollary 1. Suppose Assumption 6 from Appendix A.2 holds. Assume that i) Tr{R} = P i ri < , ii) bi(x, t) = bixi, i, i.e. the drift term is linear and only depends on x through its projection onto the corresponding basis and iii) the drift is bounded, such that K > 0 : K < bi < 0, i. Then, the reverse process evolves according to Equation (4). Theorem 1 stipulates that, given some conditions, the reverse time dynamics for FDPs of the form defined in Equation (1) exist. Our analysis provides theoretical grounding to the observations in concurrent work (Lim et al., 2023) where, empirically, it is observed that the cylindrical class of noise is not suitable. We argue that, when R = I, the difficulty stems from designing the coefficients bi of the SDEs such that the forward (see requirement (5.3) in Da Prato & Zabczyk (2014)) as well as the backward processes Assumption 1 exist. The work by Bond-Taylor & Willcocks (2023) uses cylindrical (white) noise, but we are not aware of any theoretical justification, since the model architecture is only partially suited for the functional domain. As an addendum, we note that the advantages of projecting the forward and backward processes on the eignenbasis of the Hilbert space H, as in Equation (2) and Equation (4), become evident when discussing about the implementation of FDPs, specifically when we derive practical expressions for training and the simulation of the backward process, as discussed in 5, and in a fully expanded toy example in Appendix D. 2.3 A Girsanov formula for the ELBO Direct simulation of the backward FDP described by Equation (3) is not possible. Indeed, we have no access to the true score of the density ρ(d) τ induced at time τ [0, T]. To solve the problem, we introduce a parametric score function sθ : H [0, T] Rm H. We consider the dynamics: ( d ˆXt = A ˆXt f( ˆXt, T t) + Rsθ( ˆXt, T t) dt + d Wt, ˆX0 χT P(H), (5) with path measure ˆPχT , and d Wt being a ˆPχT R Wiener process. To emphasize the connection between Equation (3) and Equation (5), we define initial conditions with the subscript T, instead of 0. In principle, we should have χT = ρT , as it will be evident from the ELBO in Equation (8). However, ρT has a simple and easy-to-sample-from distribution only for T , which is not compatible with a realistic implementation. The analysis of the discrepancy when T is finite is outside of the scope of this work, and the interested reader can refer to Franzese et al. (2022) for an analysis on standard diffusion models. The final measure of the new process at time T is indicated by χ0, i.e. χ0(S) = ˆPχT ({ω Ω: ˆXT (ω) S}). Next, we quantify the discrepancy between χ0 and the true data measure ρ0 through an ELBO. Thanks to an extension of Girsanov theorem to infinite dimensional SDEs (Da Prato & Zabczyk, 2014), it is possible to relate the path measures (ˆQ and ˆPχT , respectively) of the process ˆXt induced by different drift terms in Equation (3) and different initial conditions. Starting from the score function sθ, we define: γθ(x, t) = R sθ(x, T t) Dx log ρ(d) T t(x) . (6) Under loose regularity assumptions (see Condition 2 in Appendix A.4) Wt = ˆWt t R 0 γθ(Xs, s)ds is a ˆPρT R Wiener process (Theorem 10.14 in Da Prato & Zabczyk (2014)), where Girsanov Theorem also tells us that the measure ˆPρT satisfies the Radon-Nikodym derivative: 0 γθ( ˆXt, t), d ˆWt R 1 2 H 1 γθ( ˆXt, t) 2 By virtue of the disintegration theorem, dˆQ = dˆQ0dρT and similarly dˆPρT = dˆP0dρT , being ˆQ0, ˆP0 the measures of the processes when considering a particular initial value. Then, ˆPχT satisfies dˆPχT = dˆP dχT dρT , for any measure χT P(H). Consequently, the canonical process ˆXt has an SDE representation according to Equation (5), under the new path measure ˆPχT . Then (see Appendix A.5 for the derivation) we obtain the ELBO: KL [ρ0 χ0] 1 0 γθ(Xt, t) 2 R 1 2 Hdt + KL [ρT χT ] . (8) Provided that the required assumptions in Theorem 1 are met, the validity of Equation (8) is general. Our goal, however, is to set the stage for a practical implementation of FDPs, which calls for design choices that easily enable satisfying the required assumptions for the theory to hold. Then, for the remainder of the paper, we consider the particular case where f = 0 in Equation (1). This simplifies the dynamics as follows: d Xt = AXtdt + d Wt, X0 ρ0 P(H) (9) d ˆXt = A ˆXt + Rsθ( ˆXt, T t) dt + d Wt, ˆX0 χT P(H) (10) Since the only drift component in Equation (9) is the linear term A, the projection bj will be linear as well. Such a design choice, although not necessary from a theoretical point of view, carries several advantages. The design of a drift term satisfying the conditions of Corollary 1 becomes straightforward, where such conditions naturally aligns with the requirements of the existence of the forward process (Chapter 5 of Da Prato & Zabczyk (2014)). Moreover, the forward process conditioned on given initial conditions admits known solutions, which means that simulation of SDE paths is cheap and straightforward, without the need for performing full numerical integration. Finally, it is possible to claim existence of the true score function and even provide its analytic expression (full derivation in Appendix A.7) as: Dx log ρ(d) t (x) = S(t) 1 (x exp(t A)E [X0 | Xt = x]) , (11) where S(t) = t R s=0 exp((t s)A)R exp (t s)A ds . This last aspect is particularly useful when considering the conditional version of Equation (6), through Dx log ρ(d) t (x | x0), ek = xk log ρ(d) t (xk | xi =k, x0) , as: γθ(x, x0, t) = R sθ(x, T t) Dx log ρ(d) T t(x | x0) , (12) where, similarly to the unconditional case, we have Dx log ρ(d) t (x | x0) = S(t) 1 (x exp(t A)x0). Then, Equation (12) can be used to rewrite Equation (8): 0 γθ(Xt, t) 2 R 1 2 Hdt 0 γθ(Xt, X0, t) 2 R 1 2 Hdt where I is a quantity independent of θ. Knowledge of the conditional true score Dx log ρ(d) t (x | x0) and cheap simulation of the forward dynamics, allows for easier numerical optimization than the more general case of f = 0. 3 Sampling theorem for FDPs The theory of FDPs developed so far is valid for real, separable Hilbert spaces. Our goal now is to specify for which subclass of functions it is possible to perfectly reconstruct the original function given only its evaluation in a countable set of points. We present a generalization of the sampling theorem (Shannon, 1949), which allows us to move from generic Hilbert spaces to a domain which is amenable to a practical implementation of FDPs, and their application to common functional representation of data such as images, data on manifolds, and more. We model these functions as objects belonging to the set of square integrable functions over C homogeneous manifolds M (such as RN, SN, etc...), i.e., the Hilbert space H = L2(M). Then, exact reconstruction implies that all the relevant information about the considered functions is contained in the set of sampled points. First, we define functions that are band-limited: Definition 1. A function x in H = L2(M) is a spectral entire function of exponential type ν (SE-ν) if | k 2 x| νk|x|, k N. Informally, the Fourier Transform of x is contained in the interval [0, ν] (Pesenson, 2000). Second, we define grids that cover the manifold with balls, without too much overlap. Those grids will be used to collect the function samples. Their formal definition is as follows: Definition 2. Y (r, λ) denotes the set of all sets of points Z = {pi} such that: i) infj =i dist(pj, pi) > 0 and ii) balls B(pi, λ) form a cover of M with multiplicity < r. Combining the two definitions, we can state the key result of this Section. As long as the sampled function is band-limited, if the samples grid is sufficiently fine, exact reconstruction is possible: Theorem 2. For any set Z Y (r, λ), any SE-ν function x is uniquely determined by its set of values in Z (i.e. {x[pi]}) as long as λ < d, that is pi Z x[pi]mpi, (14) where mpi : M H are known polynomials2, and the notation x[p] indicates that the function x is evaluated at point p. A precise definition of the value of the constant d and its interpretation is outside the goal of this work, and we refer the interested reader to Pesenson (2000) for additional details. For our purposes, it is sufficient to interpret the condition in Theorem 2 as a generalization of the classical Shannon-Nyquist sampling theorem (Shannon, 1949). Under this light, Theorem 2 has practical relevance, because it gives the conditions for which the sampled version of functions contains all the information of the original functions. Indeed, given the set of points pi on which function x is evaluated, it is possible to reconstruct exactly x[p] for arbitrary p. The uncertainty principle. It is not always possible to define Hilbert spaces of square integrable functions that are simultaneously homogeneous and separable, for all the manifolds M of interest. In other words, it is difficult in practice to satisfy both the requirements for FDPs to exist, and for the sampling theorem to be valid (see an example in Appendix C). Nevertheless, it is possible to quantify the reconstruction error, and realize that practical applications of FDPs are viable. Indeed, given a compactly supported function x, and a set of points Z with finite cardinality, we can upper-bound the reconstruction error P pi Z x[pi]mpi x H with: pi Z (x[pi] xν[pi]) mpi H | {z } ϵ1 pi Z xν[pi]mpi xν H | {z } ϵ2 + xν x H | {z } ϵ3 where xν is the SE-ν bandlimited version of x, obtained by filtering out in the frequency domain any component larger than ν. The error ϵ1 is due to x = xν. The term ϵ2 is the reconstruction error 2Precisely, they are the limits of spline polynomials that form a Riesz basis for the Hilbert space of polyharmonic functions with singularities in Z (Pesenson, 2000). due to finiteness of |Z|: the sampling theorem applies to xν, but the corresponding sampling grid has infinite cardinality. Finally, the term ϵ3 quantifies the energy omitted by filtering out the frequency components of xν larger than ν. This (loose) upper bound allows us to understand quantitatively the degree to which the sampling theorem does not apply for the cases of interest. Although deriving tighter bounds is possible, this is outside the scope of this work. What suffices is that in many practical cases, when functions are obtained from natural sources, it has been observed that functions are nearly time and bandwidth limited (Slepian, 1983). Consequently, as long as the sampling grid is sufficiently fine, the reconstruction error ϵ is negligible. We now hold all the ingredients to formulate generative functional diffusion models using the Hilbert space formalism and implement them using a finite grid of points, which is what we do next. 4 Score Network Architectural Implementations We are now equipped with the ELBO (Equation (8)) and a score function sθ that implements the mapping H [0, T] Rm H. We could then train the score by optimizing the ELBO and produce samples arbitrary close to the true data measure ρ0. However, since the domain of the score function is the infinite-dimensional Hilbert space, such a mapping cannot be implemented in practice. Indeed, having access to samples of functions on finite grid of points is, in general, not sufficient. However, when the conditions for Theorem 2 hold, we can substitute with no information loss x H with its collection of samples {x[pi], pi}. This allows considering score network architectures that receive as input a collection of points, and not abstract functions. Such architectures should be flexible enough to work with an arbitrary number of input samples at arbitrary grid points, and produce as outputs functions in H. 4.1 Implicit Neural Representation The first approach we consider in this work is based on the idea of Implicit Neural Representations (INRs) (Sitzmann et al., 2020). These architectures can receive as inputs functions sampled at arbitrary, possibly irregular points, and produce output functions evaluated at any desired point. Unfortunately, the encoding of the inputs is not as straightforward as in the Neural Fourier Operator (NFO) case, and some form of autoencoding is necessary. Note, however, that in traditional score-based diffusion models (Song et al., 2021), the parametric score function can be thought of as a denoising autoencoder. This is a valid interpretation also in our case, as it is evident by observing the term E [X0 | Xt = x] of the true score function in Equation (11). Since INRs are powerful denoisers (Kim et al., 2022a), combined with their simple design and small number of parameters, in this Section we discuss how to implement the score network of FDPs using INRs. We define a valid INR as a parametric family (ψ, t, θ) of functions in H, i.e., mappings Rm [0, T] Rm H. A valid INR is the central building block for the implementation of the parametric score function, and it relies on two sets of parameters: θ, which are the parameters of the score function that we optimize according to Equation (8), and ψ, which serve the purpose of building a mapping from H into a finite dimensional space. More formally: Definition 3. Given a manifold M, a valid Implicit Neural Representation (INR) is an element of H defined by a family of parametric mappings n(ψ, t, θ), with t [0, T], θ, ψ Rm. That is, for p M, we have n(ψ, t, θ)[p] R. Moreover, we require n(ψ, t, θ) L2(M). A valid INR as defined in Definition 3 is not sufficiently flexible to implement the parametric score function sθ, as it cannot accept input elements from the infinite-dimensional Hilbert space H: indeed, the score function is defined as a mapping over H [0, T] Rm H, whereas the valid INR is a mapping defined over Rm [0, T] Rm H. Then, we use the second set of parameters ψ to condensate all the information of a generic x H into a finite-dimensional vector. When the conditions for Theorem 2 hold, we can substitute with no information loss x H with its collection of samples {x[pi], pi}. Then, we can construct an implicitly defined mapping g : H [0, T] Rm Rm as: g({x[pi], pi}, t, θ) = arg min ψ pi (n(ψ, t, θ)[pi] x[pi])2 . (16) In this work, we consider the modulation approach to INRs. The set of parameters ψ are obtained by minimizing Equation (16) using few steps of gradient descent on the objective P pi (n(ψ, t, θ)[pi] x[pi])2, starting from the zero initialization of ψ. This approach, also explored by Dupont et al. (2022b), is based on the concept of meta-learning (Finn et al., 2017). In summary, our method constructs mappings H [0, T] Rm H, where the same INR is used first to encode x into ψ, and subsequently to output the value functions for any desired input point p, thus implementing the following score network: sθ(x, t) = (S(t)) 1 (x exp(t A)n(g({x[pi], pi}, t, θ), t, θ)) . (17) 4.2 Transformers As an alternative approach, we consider implementing the score function sθ using transformer architectures Vaswani et al. (2017), by interpreting them as mappings between Hilbert spaces (Cao, 2021). We briefly summarize here such a perspective, focusing on a single attention layer for simplicity, and adapt the notation used throughout the paper accordingly. Consider the space L2(M), with the usual collection of samples {x[pi], pi}. As a first step, both the features {x[pi]} and positions {pi} are embedded into some higher dimensional space and summed together, to obtain a sequence of vectors {yi}. Then, three different (learnable) matrices θ(Q), θ(K), θ(V ) are used to construct the linear transformations of the vector sequence {yi} as ˆY (Q) = { ˆyi (Q) = θ(Q)yi}, ˆY (K) = { ˆyi (K) = θ(K)yi}, ˆY (V ) = { ˆyi (V ) = θ(V )yi}. Finally, the three matrices ˆY (Q,K,V ) are multiplied together, according to any variant of the attention mechanism. Indeed, different choices for the order of multiplication and normalization schemes in the products and in the matrices correspond to different attention layers Vaswani et al. (2017). In practical implementations, these operations can be repeated multiple times (multiple attention layers) and can be done in parallel according to multiple projection matrices (multiple heads). The perspective explored in (Cao, 2021) is that it is possible to interpret the sequences ˆyi (Q,K,V ) as learnable basis functions in some underlying latent Hilbert space, evaluated at the set of coordinates {pi}. Furthermore, depending on the type of attention mechanism selected, the operation can be interpreted as a different mapping between Hilbert spaces, such as Fredholm equations of the second-kind or Petrov Galerkin-type projections (Cao (2021) Eqs. 9 and 14). While a complete treatment of such an interpretation is outside the scope of this work, what suffices is that it is possible to claim that transformer architectures are a viable candidate for the implementation of the desired mapping H [0, T] Rm H, a possibility that we explore experimentally in this work. It is worth noticing that, compared to the approach based on INRs, resolution invariance is only learned, and not guaranteed, and that the number of parameters is generally higher compared to an INR. Nevertheless, learning the parameters of transformer architectures does not require meta-learning, which is a practical pain-point of INRs used in our context. Additional details for the transformer-based implementation of the score network are available in Appendix E. Finally, for completeness, it is worth mentioning that a related class of architectures, the Neural Operators and NFOs (Kovachki et al., 2021; Li et al., 2020), are also valid alternatives. However, such architectures require the input grid to be regularly spaced (Li et al., 2020), and their output function is available only at the same points pi of the input, which would reduce the flexibility of FDPs. 5 Training and sampling of FDPs Given the parametric score function sθ from Equation (17), by simulating the reverse FDP, we generate samples whose statistical measure χ0 is close in KL sense to ρ0. Next, we explain how to numerically compute of the quantities in Equation (13), which is part of the ELBO in Equation (8), and how to generate new samples from the trained FDP (simulation of Equation (10)). ELBO Computation. Equation (8) involves Equation (13), which requires the computation of the Hilbert space norm. The grid of points x[pi] is interpolated in H as P i x[pi]ξi. Then, the norm of interest can be computed as: R 1 2 H = R 1 i x[pi]ξi, R 1 i x[pi]ξi H = k=1 (rk) 1 * N X i=1 x[pi]ξi, ek +!2 Depending on the choice of ξi, ei, the sum w.r.t the index k is either naturally truncated or it needs to be further approximated by selecting a cutoff index value. Finally, training can then be performed by minimizing: 0 γθ(Xt, t) 2 k=1 (rk) 1 * N X i Xt[pi]ξi, t)[pi] Numerical integration. Simulation of infinite dimensional SDEs is a well studied domain (Debussche, 2011), including finite difference schemes (Gy ongy, 1998, 1999; Yoo, 2000), finite element methods and/or Galerkin schemes (Hausenblas, 2003a,b; Shardlow, 1999). In this work, we adopt a finite element approximate scheme, and introduce the interpolation operator, from R|Z| to H, i.e. P i x[pi]ξi (Hausenblas, 2003b). Notice that, in general, the functions ξi differ from the basis ei. In addition, the projection operator maps functions from H into RL, as x, ζj , ζj H. Usually, L = |Z|. When ζi = ξi the scheme is referred to as the Galerkin scheme. We consider instead a point matching scheme (Hausenblas, 2003b), in which ζi = δ[p pi] with δ in Dirac sense, and consequently x, ζi = x[pi]. Then, the infinite dimensional SDE of the forward process from Equation (9) is approximated by the finite (|Z|) dimensional SDE: i Xt[pi]ξi, ζk +! dt + d Wt[pk], k = 1, . . . , |Z|. (20) Similarly, the reverse process described by Equation (10) corresponds to the following SDE: d ˆXt[pk] = i ˆXt[pi]ξi, ζk + i ˆXt [pi] ξi, T t), ζk +! dt + d ˆWt[pk], (21) k = 1, . . . , |Z|. Equation (21) is a finite dimensional SDE, and consequently we can use any known numerical integrator to simulate its paths. In Appendix D we provide a complete toy example to illustrate our approach in a simple scenario, where we emphasize practical choices. 6 Experiments Despite a rather involved theoretical treatment, the implementation of FDPs is simple. We implemented our approach in JAX (Bradbury et al., 2018), and use WANDB (Biewald, 2020) for our experimental protocol. Additional details on implementation, and experimental setup, as well as more experiments are available in Appendix E. We evaluate our approach on image data, using the CELEBA 64 64 (Liu et al., 2015) dataset. Our comparative analysis with the state-of-the-art includes generative quality, using the FID score (Heusel et al., 2017), and parameter count for the score network. We also discuss (informally) the complexity of the network architecture, as a measure of the engineering effort in exploring the design space of the score network. We compare against vanilla Score Based Diffusion (SBD) (Song et al., 2021), From Data To Functa (FD2F) (Dupont et al., 2022a) which diffuses latent variables obtained from an INR, Infinite Diffusion ( -DIFF) (Bond-Taylor & Willcocks, 2023), which is a recent approach that is only partially suited for the functional domain, as it relies on the combination of Fourier Neural Operators and a classical convolutional U-NET backbone. Our FDP method is implemented using either MLP or Transformers. In the first case, we consider a score network implemented as a simple MLP with 15 layers and 256 neurons in each layer. The activation function is a Gabor wavelet activation function (Saragadam et al., 2023). In the latter case, our approach is built upon the UVi T backbone as detailed by Bao et al. (2022). The architecture comprises 7 layers, with each layer composed of a self-attention mechanism with 8 attention heads and a feedforward layer. We present quantitative results in Table 1, showing that our method FDP(MLP) achieves an impressively low FID score, given the extremely low parameter count, and the simplicity of the architecture. FD2F obtains a worse (larger) FID score, while having many more parameters, due to the complex parametrization of their score network. As a reference we report the results of SBD, where the price to be pay to achieve an extremely low FID is to have many more parameters and a much more intricate architecture. Finally, the very recent -DIFF method, has low FID-CLIP score (Kynk a anniemi et al., 2022), but requires a very complex architecture and more than 2 orders of magnitude more parameters than our approach. Showcasing the flexibility of the proposed methodology, we consider a more complex architecture based on Vision Transformers (FDP(UVi T)). These corresponding results indicate improvements in terms of image quality (FID score=11) and do not require meta-learning steps, but require more parameters (O(20M)) than the INR variant. To the best of our knowledge, none of related work in the purely functional domain (Lim et al., 2023; Hagemann et al., 2023; Dutordoir et al., 2022; Kerrigan et al., 2022) provides results going beyond simple data-sets. Finally, we present some qualitative results in Figures 1 and 2 clearly showing that the proposed methodology is capable of producing diverse and detailed images. Methods FID ( ) FID-CLIP ( ) Params FDP(MLP) 35.00 12.44 O(1 M) FDP(UVi T) 11.00 6.55 O(20 M) FD2F 40.40 - O(10 M) SBD 3.30 - O(100 M) -DIFF - 4.57 O(100 M) Table 1: Quantitative results, CELEBA data-set. (FID-CLIP (Kynk a anniemi et al., 2022)) Figure 1: Qualitative results with MLP. Figure 2: Qualitative results with UVi T. 7 Conclusion, Limitations and Broader Impact We presented a theoretical framework to define functional diffusion processes for generative modeling. FDPs generalize traditional score-based diffusion models to infinite-dimensional function spaces, and in this context we were the first to provide a full characterization of forward and backward dynamics, together with a formal derivation of an ELBO that allowed the estimation of the parametric score function driving the reverse dynamics. To use FDPs in practice, we carefully studied for which subset of functions it was possible to operate on a countable set of samples without losing information. We then proceeded to introduce a series of methods to jointly model using only a simple INR or a Transformer an approximate functional representation of data on discrete grids, and an approximate score function. Additionally, we detailed practical training procedures of FDPs, and integration schemes to generate new samples. The implementation of FDPs for generative modeling was simple. We validated the viability of FDPs through a series of experiments on real images, where we show, while only using a simple MLP for learning the score network, extremely promising results in terms of generation quality. Like other works in the literature, the proposed method can have both positive (e.g., synthesizing new data automatically) and negative (e.g., deep fakes) impacts on society depending on the application. 8 Acknowledgments GF gratefully acknowledges support from Huawei Paris and the European Commission (ADROIT6G Grant agreement ID: 101095363). MF gratefully acknowledges support from the AXA Research Fund and the Agence Nationale de la Recherche (grant ANR-18-CE46-0002 and ANR-19-P3IA-0002). Brian D. O. Anderson. Reverse-time diffusion equation models. Stochastic Processes and their Applications, 12(3):313 326, 1982. Arpit Bansal, Eitan Borgnia, Hong-Min Chu, Jie S Li, Hamid Kazemi, Furong Huang, Micah Goldblum, Jonas Geiping, and Tom Goldstein. Cold diffusion: Inverting arbitrary image transforms without noise. ar Xiv preprint ar Xiv:2208.09392, 2022. Fan Bao, Chongxuan Li, Yue Cao, and Jun Zhu. All are worth words: a vit backbone for score-based diffusion models. ar Xiv preprint ar Xiv:2209.12152, 2022. Lukas Biewald. Experiment tracking with weights and biases, 2020. URL https://www.wandb. com/. Software available from wandb.com. Vladimir Bogachev, Giuseppe Da Prato, and Michael R ockner. Existence and uniqueness of solutions for fokker-planck equations on hilbert spaces. J. Evol. Equ., 10, 07 2009. Vladimir Bogachev, Giuseppe Da Prato, and Michael R ockner. Uniqueness for solutions of fokker planck equations on infinite dimensional spaces. Communications in Partial Differential Equations, 36(6):925 939, 2011. Sam Bond-Taylor and Chris G Willcocks. -diff: Infinite resolution diffusion with subsampled mollified states. ar Xiv preprint ar Xiv:2303.18242, 2023. James Bradbury, Roy Frostig, Peter Hawkins, Matthew James Johnson, Chris Leary, Dougal Maclaurin, George Necula, Adam Paszke, Jake Vander Plas, Skye Wanderman-Milne, et al. Jax: composable transformations of python+ numpy programs. 5:14 24, 2018. Shuhao Cao. Choose a transformer: Fourier or galerkin. Advances in neural information processing systems, 34:24924 24940, 2021. Giuseppe Da Prato and Jerzy Zabczyk. Stochastic equations in infinite dimensions. Cambridge university press, 2014. Arnaud Debussche. Weak approximation of stochastic partial differential equations: the nonlinear case. Mathematics of Computation, 80(273):89 117, 2011. Prafulla Dhariwal and Alexander Nichol. Diffusion models beat gans on image synthesis. In M. Ranzato, A. Beygelzimer, Y. Dauphin, P.S. Liang, and J. Wortman Vaughan (eds.), Advances in Neural Information Processing Systems, volume 34, pp. 8780 8794. Curran Associates, Inc., 2021. Tim Dockhorn, Arash Vahdat, and Karsten Kreis. Score-based generative modeling with criticallydamped langevin diffusion. In International Conference on Learning Representations, 2022. Emilien Dupont, Hyunjik Kim, S. M. Ali Eslami, Danilo Jimenez Rezende, and Dan Rosenbaum. From data to functa: Your data point is a function and you can treat it like one. In Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvari, Gang Niu, and Sivan Sabato (eds.), Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pp. 5694 5725. PMLR, 17 23 Jul 2022a. Emilien Dupont, Hrushikesh Loya, Milad Alizadeh, Adam Golinski, Yee Whye Teh, and Arnaud Doucet. COIN++: Neural compression across modalities. Transactions of Machine Learning Research, 2022b. Vincent Dutordoir, Alan Saul, Zoubin Ghahramani, and Fergus Simpson. Neural diffusion processes. ar Xiv preprint ar Xiv:2206.03992, 2022. Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model-agnostic meta-learning for fast adaptation of deep networks. In International conference on machine learning, pp. 1126 1135. PMLR, 2017. Hans F ollmer. An entropy approach to the time reversal of diffusion processes. In Stochastic Differential Systems Filtering and Control, pp. 156 163. Springer, 1985. Hans F ollmer. Time reversal on wiener space. In Stochastic processes mathematics and physics, pp. 119 129. Springer, 1986. Giulio Franzese, Simone Rossi, Lixuan Yang, Alessandro Finamore, Dario Rossi, Maurizio Filippone, and Pietro Michiardi. How much is enough? a study on diffusion times in score-based generative models. ar Xiv preprint ar Xiv:2206.05173, 2022. H. F ollmer and A. Wakolbinger. Time reversal of infinite-dimensional diffusions. Stochastic Processes and their Applications, 22(1):59 77, 1986. ISSN 0304-4149. doi: https://doi.org/10. 1016/0304-4149(86)90114-6. URL https://www.sciencedirect.com/science/article/ pii/0304414986901146. Istv an Gy ongy. Lattice approximations for stochastic quasi-linear parabolic partial differential equations driven by space-time white noise i. Potential Analysis, 9(1):1 25, 1998. Istv an Gy ongy. Lattice approximations for stochastic quasi-linear parabolic partial differential equations driven by space-time white noise ii. Potential Analysis, 11(1):1 37, 1999. Paul Hagemann, Sophie Mildenberger, Lars Ruthotto, Gabriele Steidl, and Nicole Tianjiao Yang. Multilevel diffusion: Infinite dimensional score-based diffusion models for image generation, 2023. Erika Hausenblas. Approximation for semilinear stochastic evolution equations. Potential Analysis, 18(2):141 186, 2003a. Erika Hausenblas. Weak approximation for semilinear stochastic evolution equations. In Stochastic analysis and related topics VIII, pp. 111 128. Springer, 2003b. Yingqing He, Tianyu Yang, Yong Zhang, Ying Shan, and Qifeng Chen. Latent video diffusion models for high-fidelity video generation with arbitrary lengths. ar Xiv preprint ar Xiv:2211.13221, 2022. Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium. In Proceedings of the 31st International Conference on Neural Information Processing Systems, NIPS 17, pp. 6629 6640, 2017. Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin (eds.), Advances in Neural Information Processing Systems, volume 33, pp. 6840 6851. Curran Associates, Inc., 2020. Jonathan Ho, Chitwan Saharia, William Chan, David J. Fleet, Mohammad Norouzi, and Tim Salimans. Cascaded diffusion models for high fidelity image generation, 2021. URL https://arxiv.org/ abs/2106.15282. Jonathan Ho, William Chan, Chitwan Saharia, Jay Whang, Ruiqi Gao, Alexey Gritsenko, Diederik P. Kingma, Ben Poole, Mohammad Norouzi, David J. Fleet, and Tim Salimans. Imagen video: High definition video generation with diffusion models, 2022. URL https://arxiv.org/abs/2210. 02303. Emiel Hoogeboom, V ıctor Garcia Satorras, Cl ement Vignac, and Max Welling. Equivariant diffusion for molecule generation in 3D. In Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvari, Gang Niu, and Sivan Sabato (eds.), Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pp. 8867 8887. PMLR, 17 23 Jul 2022. Tero Karras, Miika Aittala, Timo Aila, and Samuli Laine. Elucidating the design space of diffusionbased generative models. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho (eds.), Advances in Neural Information Processing Systems, 2022. Gavin Kerrigan, Justin Ley, and Padhraic Smyth. Diffusion generative models in infinite dimensions, 2022. URL https://arxiv.org/abs/2212.00886. Chaewon Kim, Jaeho Lee, and Jinwoo Shin. Zero-shot blind image denoising via implicit neural representations. ar Xiv preprint ar Xiv:2204.02405, 2022a. Dongjun Kim, Seungjae Shin, Kyungwoo Song, Wanmo Kang, and Il-Chul Moon. Soft truncation: A universal training technique of score-based diffusion model for high precision score estimation. In Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvari, Gang Niu, and Sivan Sabato (eds.), Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pp. 11201 11228. PMLR, 17 23 Jul 2022b. Diederik P Kingma, Tim Salimans, Ben Poole, and Jonathan Ho. Variational diffusion models. In A. Beygelzimer, Y. Dauphin, P. Liang, and J. Wortman Vaughan (eds.), Advances in Neural Information Processing Systems, 2021. URL https://openreview.net/forum?id=2Ld Bqxc1Yv. Zhifeng Kong, Wei Ping, Jiaji Huang, Kexin Zhao, and Bryan Catanzaro. Diffwave: A versatile diffusion model for audio synthesis. In International Conference on Learning Representations, 2021. Nikola Kovachki, Zongyi Li, Burigede Liu, Kamyar Azizzadenesheli, Kaushik Bhattacharya, Andrew Stuart, and Anima Anandkumar. Neural operator: Learning maps between function spaces. ar Xiv preprint ar Xiv:2108.08481, 2021. Tuomas Kynk a anniemi, Tero Karras, Miika Aittala, Timo Aila, and Jaakko Lehtinen. The role of imagenet classes in fr echet inception distance. ar Xiv preprint ar Xiv:2203.06026, 2022. Yann Le Cun, Corinna Cortes, and CJ Burges. Mnist handwritten digit database. ATT Labs [Online]. Available: http://yann.lecun.com/exdb/mnist, 2, 2010. Zongyi Li, Nikola Kovachki, Kamyar Azizzadenesheli, Burigede Liu, Kaushik Bhattacharya, Andrew Stuart, and Anima Anandkumar. Fourier neural operator for parametric partial differential equations. ar Xiv preprint ar Xiv:2010.08895, 2020. Jae Hyun Lim, Nikola B Kovachki, Ricardo Baptista, Christopher Beckham, Kamyar Azizzadenesheli, Jean Kossaifi, Vikram Voleti, Jiaming Song, Karsten Kreis, Jan Kautz, et al. Score-based diffusion models in function space. ar Xiv preprint ar Xiv:2302.07400, 2023. Jinglin Liu, Chengxi Li, Yi Ren, Feiyang Chen, and Zhou Zhao. Diffsinger: Singing voice synthesis via shallow diffusion mechanism. Proceedings of the AAAI Conference on Artificial Intelligence, 36(10):11020 11028, Jun. 2022. Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang. Deep learning face attributes in the wild. In Proceedings of International Conference on Computer Vision (ICCV), December 2015. Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. ar Xiv preprint ar Xiv:1711.05101, 2017. Shitong Luo and Wei Hu. Diffusion probabilistic models for 3d point cloud generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2837 2845, June 2021. Annie Millet, David Nualart, and Marta Sanz. Time reversal for infinite-dimensional diffusions. Probability theory and related fields, 82(3):315 347, 1989. Sarthak Mittal, Guillaume Lajoie, Stefan Bauer, and Arash Mehrjou. From points to functions: Infinite-dimensional representations in diffusion models. In ICLR Workshop on Deep Generative Models for Highly Structured Data, 2022. Isaac Pesenson. A sampling theorem on homogeneous manifolds. Transactions of the American Mathematical Society, 352(9):4257 4269, 2000. Angus Phillips, Thomas Seror, Michael Hutchinson, Valentin De Bortoli, Arnaud Doucet, and Emile Mathieu. Spectral diffusion processes. ar Xiv preprint ar Xiv:2209.14125, 2022. Jakiw Pidstrigach, Youssef Marzouk, Sebastian Reich, and Sven Wang. Infinite-dimensional diffusion models for function spaces. ar Xiv preprint ar Xiv: 2302.10130, 2023. Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical textconditional image generation with clip latents, 2022. URL https://arxiv.org/abs/2204. 06125. Severi Rissanen, Markus Heinonen, and Arno Solin. Generative modelling with inverse heat dissipation. ar Xiv preprint ar Xiv:2206.13397, 2022. Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj orn Ommer. Highresolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 10684 10695, June 2022. Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily Denton, Seyed Kamyar Seyed Ghasemipour, Raphael Gontijo-Lopes, Burcu Karagol Ayan, Tim Salimans, Jonathan Ho, David J. Fleet, and Mohammad Norouzi. Photorealistic text-to-image diffusion models with deep language understanding. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho (eds.), Advances in Neural Information Processing Systems, 2022. Tim Salimans and Jonathan Ho. Progressive distillation for fast sampling of diffusion models. In International Conference on Learning Representations, 2022. Vishwanath Saragadam, Daniel Le Jeune, Jasper Tan, Guha Balakrishnan, Ashok Veeraraghavan, and Richard G Baraniuk. Wire: Wavelet implicit neural representations. ar Xiv preprint ar Xiv:2301.05187, 2023. C.E. Shannon. Communication in the presence of noise. Proceedings of the IRE, 37(1):10 21, 1949. Tony Shardlow. Numerical methods for stochastic parabolic pdes. Numerical functional analysis and optimization, 20(1-2):121 145, 1999. Vincent Sitzmann, Julien Martel, Alexander Bergman, David Lindell, and Gordon Wetzstein. Implicit neural representations with periodic activation functions. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin (eds.), Advances in Neural Information Processing Systems, volume 33, pp. 7462 7473. Curran Associates, Inc., 2020. David Slepian. Some comments on fourier analysis, uncertainty and modeling. SIAM review, 25(3): 379 393, 1983. Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsupervised learning using nonequilibrium thermodynamics. In Francis Bach and David Blei (eds.), Proceedings of the 32nd International Conference on Machine Learning, volume 37 of Proceedings of Machine Learning Research, pp. 2256 2265, Lille, France, 07 09 Jul 2015. PMLR. Yang Song and Stefano Ermon. Generative modeling by estimating gradients of the data distribution. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d Alch e-Buc, E. Fox, and R. Garnett (eds.), Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019. Yang Song and Stefano Ermon. Improved techniques for training score-based generative models. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin (eds.), Advances in Neural Information Processing Systems, volume 33, pp. 12438 12448. Curran Associates, Inc., 2020. Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations. In International Conference on Learning Representations, 2021. Brian L. Trippe, Jason Yim, Doug Tischer, David Baker, Tamara Broderick, Regina Barzilay, and Tommi Jaakkola. Diffusion probabilistic modeling of protein backbones in 3d for the motifscaffolding problem, 2022. URL https://arxiv.org/abs/2206.04119. Arash Vahdat, Karsten Kreis, and Jan Kautz. Score-based generative modeling in latent space. In A. Beygelzimer, Y. Dauphin, P. Liang, and J. Wortman Vaughan (eds.), Advances in Neural Information Processing Systems, 2021. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017. Pascal Vincent. A connection between score matching and denoising autoencoders. Neural Computation, 23(7):1661 1674, 2011. Zhisheng Xiao, Karsten Kreis, and Arash Vahdat. Tackling the generative learning trilemma with denoising diffusion GANs. In International Conference on Learning Representations (ICLR), 2022. Hyek Yoo. Semi-discretization of stochastic partial differential equations on r by a finite-difference method. Mathematics of computation, 69(230):653 666, 2000. Xiaohui Zeng, Arash Vahdat, Francis Williams, Zan Gojcic, Or Litany, Sanja Fidler, and Karsten Kreis. Lion: Latent point diffusion models for 3d shape generation. In Advances in Neural Information Processing Systems (Neur IPS), 2022. Juntang Zhuang, Tommy Tang, Yifan Ding, Sekhar C Tatikonda, Nicha Dvornek, Xenophon Papademetris, and James Duncan. Adabelief optimizer: Adapting stepsizes by the belief in observed gradients. Advances in neural information processing systems, 33:18795 18806, 2020. Peiye Zhuang, Samira Abnar, Jiatao Gu, Alex Schwing, Joshua M. Susskind, and Miguel Angel Bautista. Diffusion probabilistic fields. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=ik91m Y-2GN. Supplementary Material: Continuous-Time Functional Diffusion Processes A Reverse Functional Diffusion Processes In this Section, we review the mathematical details to obtain the backward FDP discussed in Theorem 1. Depending on the considered class of noise, different approaches are needed. First, we present in Appendix A.1 the conditions to ensure existence of the backward process , which we use if the C operator is an identity matrix, C = I. Then we move to a different approach in Appendix A.2 for the case C = I. A.1 Follmer Formulation The work in F ollmer (1986) is based on a finite entropy condition, which we report here as Condition 1. One simple way to ensure that the condition is satisfied is to assume: Condition 1. For a given k, define Q(k) to be the path measure corresponding to the (infinite) system d Xi t = bi(Xt, t)dt + d W i t , i = k d Xi t = d W k t , i = k. (22) We say that Q satisfies the finite local entropy condition if KL Q Q(k) < , k. Define F(i) t = σ(Xi 0, Xj s, 0 s t, j = i). Assumption 1. 0 bi(Xt, t)2dt + X j =i E[ Z T bj(Xt, t) E h bj(Xt, t) | F(i) t i 2 dt] < , Q(i)a.s. (23) Notice that if Assumption 1 is true, then Condition 1 holds (F ollmer (1986), Thm. 2.23) Theorem 3. If KL Q Q(k) < , then KL h ˆQ ˆQ(k) i < . Proof. The proof can be obtained by adapting the result of Lemma 3.6 of F ollmer & Wakolbinger (1986). This Theorem states that if the forward FDP path measure Q satisfies the finite local entropy condition, then also the reverse FDP path measure ˆQ satisfies the finite local entropy condition. Theorem 4. Let Q be a finite entropy measure. Then: ( d Xk t = bk(Xt, t)dt + d W k t , under Q d ˆXk t = ˆbk( ˆXt, t)dt + d ˆW k t , under ˆQ (24) where: log ρ(d) t (xk | xj, j = k) xk = ˆbk(x, T t) + bk(x, t) (25) Proof. For the proof, we refer to Theorem 3.14 of F ollmer & Wakolbinger (1986). A.2 Millet Formulation Let L2(R) = {x H : P ri(xi)2 < }. For simplicity, we overload the notation of the letter K, and use it for generic constants, that might be different on a case by case basis. Assumption 2. x L2(R), sup t { X ri(bi(x, t))2} + X (ri)2 K(1 + X ri(xi)2) x, y L2(R), sup t { X ri(bi(x, t) bi(y, t))2} K X ri(xi yi)2 This assumption is simply the translation of H1 from Millet et al. (1989) to our notation. Assumption 3. There exists an increasing sequence of finite subsets J(n), n N, n J(n) = N such that n N, M > 0 there exists a constant K(M, n) such that the following holds: sup x |bi(x, t)| : sup j J(n) |xj| M Again, this assumption is simply the translation of H5 from Millet et al. (1989) to our notation. Assumption 4. Either i): x, y L2(R), sup t { X ri(bi(x, t) bi(y, t))2} K X (ri)2(xi yi)2, or ii): i, bi(x) is a function of x for at most M coordinates and x, y L2(R), sup t { X (ri)2(bi(x, t) bi(y, t))2} K X (ri)2(xi yi)2. This corresponds to satisfying either H3 or jointly H2 and H4 of Millet et al. (1989). For simplicity, we can combine together the different assumptions into Assumption 5. Let Assumption 2, Assumption 3, and Assumption 4 hold. Finally, we state required assumptions about the density: Assumption 6. Suppose that the initial condition is X0 L2(R). Assume that the conditional law of xi given xj, j = i has density ρ(d) t (xi | xj, j = i) w.r.t Lebesgue measure on R. Assume that R 1 t0 R dxi (ρ(d) t (xi | xj, j = i))|dxiρt(dxj =i)dt < , for fixed subset J N,t0 > 0 and DJ = {(Q j / J R), Kj compact in R} L2(R). We reported in our notation the content of Theorem 4.3 of Millet et al. (1989). This can be used to prove the existence of the backward process. A.3 Proof of Theorem 1 If R = I, then we assume Assumption 1. Consequently, Q is a finite entropy measure. Then Theorem 4 holds, from which the desired result. If, instead R = I, then we require Assumption 5,Assumption 6. Application of Thm 4.3 of Millet et al. (1989) allows to prove the validity of Theorem 1 also in this case. A.3.1 Proof of Corollary 1 Assumption 5 is required directly. We need to show that with the considered restrictions Assumption 6 is valid. i ri < , then P i(ri)2 = Ka < . Moreover, (bi(xi, t))2 < K2 b (xi)2. Then, x L2(R), the following holds supt{P ri(bi(x, t))2} + P(ri)2 P ri K2 b (xi)2 + Ka max(Ka, K2 b ) 1 + P ri(xi)2 . Similarly, x, y L2(R) we have supt{P ri(bi(x, t) bi(y, t))2} P ri K2 b (xi yi)2. Thus Assumption 2 is satisfied. Since bi(x, t) is bounded and independent on t, Assumption 3 is satisfied, as explicitly discussed in Millet et al. (1989). Finally, since bi(x) is a function of x for M = 1 coordinate, and supt{P(ri)2(bi(x, t) bi(y, t))2} P(ri)2K2 b (xi yi)2, Assumption 4 is satisfied. Then, combined toghether Assumption 5 holds. A.4 Girsanov Regularity Condition 2. Assume that γθ(x, t) is an ˆF measurable process and that either: γθ( ˆXt, t) 2 0 γθ(Xt, t) 2 R 1 2 Hdt δ > 0 : EˆQ γθ( ˆXδ, δ) R 1 2 Hdt < . (27) This is equivalent to the regularity condition in eq. 10.23 of Da Prato & Zabczyk (2014) or Proposition 10.17 in Da Prato & Zabczyk (2014). A.5 Proof of KL divergence expression We leverage Equation (7) to express the Kullback-Leibler divergence as: KL h ˆQ ˆPχT i = EˆQ dˆP0 + log dρT + KL [ρT χT ] = 0 γθ( ˆXt, t), d ˆWt R 1 2 H + 1 γθ( ˆXt, t) 2 + KL [ρT χT ] = γθ( ˆXt, t) 2 + KL [ρT χT ] = 1 0 γθ(Xt, t) 2 R 1 2 Hdt + KL [ρT χT ] . Moreover, since KL h ˆQ ˆPχT i = EQ dˆPχT T + log dρ0 KL [ρ0 χ0] , we can combine the two results and obtain Equation (8) A.6 Conditional score matching In this subsection we prove the equality in Equation (13): 0 γθ(Xt, t) 2 R 1 2 Hdt H γθ(x, t) 2 R 1 2 Hdtdρt(x) = Dx log ρ(d) T t(x) sθ(x, T t) 2 R 1 2 Hdtdρt(x) = Dx log ρ(d) t (x) sθ(x, t) 2 R 1 2 Hdtdρt(x, x0) = Dx log ρ(d) t (x) Dx log ρ(d) t (x | x0) + Dx log ρ(d) t (x | x0) sθ(x, t) 2 R 1 2 Hdtdρt(x, x0) = Dx log ρ(d) t (x) Dx log ρ(d) t (x | x0) 2 R 1 2 H + Dx log ρ(d) t (x | x0) sθ(x, t) 2 2 D Dx log ρ(d) t (x) Dx log ρ(d) t (x | x0), Dx log ρ(d) t (x | x0) sθ(x, t) E dtdρt(x, x0). To simplify the equality, we need to notice that: ρ(d) t (xi|xj =i)dxi = dρt(xi|xj =i) = Z x0 dρt(x0|x)dρt(xi|xj =i) = Z x0 dρt(xi, x0|xj =i) = Z x0 dρt(xi|x0, xj =i)dρt(x0|xj =i) = dxi Z x0 ρ(d) t (xi|x0, xj =i)dρt(x0|xj =i). Then, computing d dxi log ρ(d)(xi|xj =i, x0)dρt(x, x0) = Z d dxi ρ(d)(xi|xj =i, x0) ρ(d)(xi|xj =i, x0) dρt(x, x0) = d dxi ρ(d)(xi|xj =i, x0) ρ(d)(xi|xj =i, x0) dρt(xi|xj =i, x0)dρt(x0, xj =i) = Z d dxi ρ(d)(xi|xj =i, x0)dxidρt(x0, xj =i) = d dxi ρ(d)(xi|xj =i, x0)dxidρt(x0|xj =i)dρt(xj =i) = d dxi x0 ρ(d)(xi|xj =i, x0)dρt(x0|xj =i) dxidρt(xj =i) = d dxi ρ(d) t (xi|xj =i)dxidρt(xj =i) = d log ρ(d) t (xi|xj =i) dxi ρ(d) t (xi|xj =i)dxidρt(xj =i) = d log ρ(d) t (xi|xj =i) Consequently: D Dx log ρ(d) t (x) Dx log ρ(d) t (x | x0), sθ(x, t) E dρt(x, x0) = 0. Combining together and rearranging the terms, we get the desired Equation (13). A.7 Explicit expression of score function As mentioned in the text, we consider the case f = 0. In this case, there exists a weak solution to Equation (1) as: Xt = exp(t A)X0 + 0 exp((t s)A)d Ws. (28) Consequently, the true score function has expression: d dxi log ρ(d) t (xi|xj =i) = d dxi ρ(d) t (xi|xj =i) ρ(d) t (xi|xj =i) = x0 ρ(d) t (xi|x0, xj =i)dρt(x0|xj =i) ρ(d) t (xi|xj =i) = x0(si) 1 xi exp tbi xi 0 ρ(d) t (xi|x0, xj =i)dρt(x0|xj =i) ρ(d) t (xi|xj =i) = (si) 1 xiρ(d) t (xi|xj =i) R x0 exp tbi xi 0ρ(d) t (xi|x0, xj =i)dρt(x0|xj =i) ρ(d) t (xi|xj =i) = (si) 1 xiρ(d) t (xi|xj =i) R x0 exp tbi xi 0ρ(d) t (xi|x0, xj =i)dρt(x0|xj =i) ρ(d) t (xi|xj =i) = (si) 1 xiρ(d) t (xi|xj =i) R xi 0 exp tbi xi 0ρ(d) t (xi|xi 0, xj =i)dρt(xi 0|xj =i) ρ(d) t (xi|xj =i) = (si) 1 xiρ(d) t (xi|xj =i) R xi 0 exp tbi xi 0ρ(d) t (xi|xi 0, xj =i)ρ(d)(xi 0|xj =i)dxi 0 ρ(d) t (xi|xj =i) = (si) 1 xiρ(d) t (xi|xj =i) R xi 0 exp tbi xi 0ρ(d) t (xi, xi 0|xj =i)dxi 0 ρ(d) t (xi|xj =i) = (si) 1 xiρ(d) t (xi|xj =i) R xi 0 exp tbi xi 0ρ(d) t (xi 0|x)dxi 0 ρ(d) t (xi|xj =i)) ρ(d) t (xi|xj =i) = xi 0 exp tbi xi 0ρ(d) t (xi 0|x)dxi 0 where si = ri exp(2bit) 1 2bi . This is exactly the desired Equation (11). Similar calculations allow to prove Dx log ρ(d) t (x | x0) = S(t) 1 (x exp(t A)x0). B Fokker Planck equation In this Section we discuss the infinite dimensional generalization of the classical Fokker Planck equation. We can associate to Eq. (1) the differential operator: L0u(x, t) = Dtu(x, t) + 1 2 Tr RD2 xu(x, t) + Ax + f(x, t), Dxu(x, t) | {z } Lu(x,t) , x H, t [0, T], (29) where Dt is the time derivative, Dx, D2 x are first and second order Fr echet derivatives in space. The domain of the operator L0 is D(L0), the linear span of real parts of functions uϕ,h = ϕ(t) exp(i x, h(t) ), x H, t [0, T] where ϕ C1([0, T]), ϕ(T) = 0, h C1([0, T]; D(A )), where indicates adjoint. Provided appropriate conditions are satisfied, see for example Bogachev et al. (2009, 2011), the time varying measure ρt(dx)dt exists, is unique, and solves the Fokker-Planck equation L 0ρt = 0. C Uncertainty principle We here clarify that Hilbert spaces of square integrable functions that are not, in general, simultaneously homogeneous and separable. For example, while R is homogeneous, the set of square integrable functions over R is not separable, since the basis is the uncountable set cos(2πνp), sin(2πνp), ν R. Then, FDP requirements are not met, as we need a countable basis. Moreover, we would need in general an infinite number of samples (grid over the whole R) to reconstruct the functions. Conversely, a set like the interval I = [0, 1] R has countable basis cos(2πtp), sin(2πtp), t Z (and thus is separable) and, considering x to be band-limited, a sampling grid with finite cardinality would allow to reconstruct of the function. However, I is not homogeneous as no isometry group exists. Consequently, Theorem 2 is not applicable. To fix the issue, one could naively think of extending any function defined over I to the whole R by considering x[p] = x[p], p I and x[p] = 0, p / I. Obviously, if x L2(I) then x L2(R). However, since x has finite support, it cannot be bandlimited, making such an approach not a viable solution. In classical signal processing literature, the problem is usually referred to as the uncertainty principle (Slepian, 1983). D A complete example We present an example in which we cast Equation (20) for square integrable functions over the interval I = [0, 1], L2(I). In this case, one natural selection for the basis is the Fourier basis3 ek = {. . . , exp( j2π2p), exp( j2πp), 1, exp(j2πp), exp(j2π2p), . . . }. Assume the operator A to be a pseudo-differantial operator, such that Ax, ek = bkxk. Also, assume that bk, rk are selected such that conditions of Corollary 1 are met, and consequently the backward process exists. Since we are working with samples collected on the grid x [i/N] we first map the samples to the frequency domain, and then build a Fourier-like representation with a finite set of sinusoids. We then define the mapping F(zi)k def= PN 1 i=0 zi exp j2πk i N and its inverse I(zi)k def= N 1 PN 1 i=0 zi exp j2πk i N . This suggests to consider the following expression for the interpolating functions: k=0 ek exp j2πk i k=0 exp j2πk(p i Those functions are indeed nothing but a frequency truncated version of the sinc function, which is the classical reconstruction function of the sampling theorem on 1-D signals. Moreover ξi, ζk = δ(i k). We are now ready to show i) the expression of the forward process, ii) the expression of the parametric score function sθ and γθ, iii) the computation of the ELBO and finally iv) the expression for the backward process. The forward process defined in Equation (20) has expression: d Xt [k/N] = I bl F(Xt[i/N])l k dt + d W t [k/N] , k = 1, . . . , |Z|, (30) where d W t [k/N] F(d W i t )k. Simple calculations show that Xt [k/N] is equivalent in distribution to Xt [k/N] = I exp blt F(X0[i/N])l + slϵl k , (31) where sl = S(t), el = rl exp(2blt) 1 2bl and ϵl N(0, 1), allowing simulation of the forward process in a single step. The parametric score function can be approximated as: i Xt [i/N] ξi, t [i/N] = (32) F (Xt [i/N])k exp bkt F (n(g(Xt [l/N]), t, θ) [l/N]) i Xt [i/N] ξi, X i X0 [i/N] ξi, t [i/N] = (33) F (n(g(Xt [l/N]), t, θ) [l/N] X0[l/N])k !i 3We stress that although we should consider a real Hilbert space, we select the complex exponential to avoid cluttering the notation. It is possible to select {cos(2πp), sin(2πp), cos(2π2p), sin(2π2p), . . . } as a basis, and redoing the calculations in this Section we can obtain a functionally equivalent scheme as the one with the real basis. Combining Equation (31) and Equation (33) we can fully characterize the training objective defined in Equation (19). Then, it is possible to optimize the value of the parameters θ with any gradient-based optimizer. Finally, the backward process approximation is expressed as: d ˆXt [k/N] = I bl F( ˆXt[i/N])l k + I i ˆXt [i/N] ξi, T t) [i/N] dt + d W t [k/N] (34) k = 1, . . . , |Z|, from which new samples can be generated. We start by proving Equation (30). Starting from the drift term of Equation (20), we have the following chain of equalities: i=0 Xt[i/N]ξi, ζk + i=0 Xt[i/N]A 1 l=0 el exp j2πl i i=0 Xt[i/N] 1 l=0 blel exp j2πl i i=0 Xt[i/N] 1 l=0 bl exp(j2πlk/N) exp( j2πli/N) = l=0 bl exp(j2πlk/N)F(Xt[i/N])l = I bl F(Xt[i/N])l i . The noise term d W t [k/N] is approximated as: d W t [k/N] = d Wt, ζk = i=0 d W i t ei, ζk = i=0 d W i t exp j2πi k F(d W i t )k, where we are truncating the sum. The score term has expression: i Xt [i/N] ξi, t) = (S(t)) 1 X i Xt [i/N] ξi exp(t A)n(g(Xt [i/N]), t, θ) =F(Xt[i/N])def =Ck t z }| { X i Xt [i/N] ξi, (ek) exp bkt n(g(Xt [i/N]), t, θ), (ek) Ck t exp bkt n(g(Xt [i/N]), t, θ), exp( j2πkp) Ck t exp bkt N 1 P r n(g(Xt [i/N]), t, θ) r N , exp j2πk r where the approximation is due to the substitution of explicit scalar product with the discretized version trough F. When evaluated on the grid of interest: i Xt [i/N] ξi, t Ck t exp bkt N 1 P r n(g(Xt [i/N]), t, θ) r N , exp j2πk r sk exp(j2πki/N) = F (Xt [i/N]) exp bkt F (n(g(Xt [i/N]), t, θ) [i/N]) The value of γθ, Equation (33) and the expression of the backward process, Equation (34), are obtained similarly, considering the above results. E Implementation Details and Additional Experiments In all experiments we use the the complex Fourier basis for the Hilbert spaces, indexed by k. This extends to the 2-dimensional case what we described in Appendix D.1. As stated in the main paper, our practical implementation sets f = 0: then, we only need to specify the value for the parameters bk, rk. In our implementation we consider an extended class of SDEs that include time-varying multiplying coefficients in front of the drift and diffusion terms, as done for example in the Variance Preserving SDE originally described by Song & Ermon (2020). This can be simply interpreted as the time-rescaled version of autonomous SDEs. E.1 Architectural details INR-based score network. In our implementation, we use the original INR architecture (Sitzmann et al., 2020). For the specific denoising task we consider in our model, we extend the input of the network architecture to include the corrupted version of the input sample and the diffusion time t, in addition to the spatial coordinates. We emphasize that our architectural design is simple, and does not require self-attention mechanisms (Song & Ermon, 2020). The non-linearity we use in our network is a Gabor wavelet activation function (Saragadam et al., 2023). Furthermore, we found beneficial the inclusion of skip connections. As stated in the main paper, we consider the modulation approach to INRs. In particular, we implement the meta-learning scheme described by Dupont et al. (2022b); Finn et al. (2017). The outer loop is dedicated to learning the base parameters of the model, while the inner loop focuses on refining the base parameters for each input sample. In the outer loop, the optimization algorithm is Ada Belief (Zhuang et al., 2020), sweeping the learning rate over 1e-4, 1e-5, 1e-6. We found the use of a cosine warm-up schedule to be beneficial for avoiding training instabilities and convergence to sub-optimal solutions. The inner loop is implemented by using three steps of stochastic gradient descent (SGD). Transformer-based score network. In our experiments with the Transformer architecture for score modeling, we employed the UVi T backbone Bao et al. (2022). This backbone processes all inputs, be they temporal or noisy image patches, as tokens. Rather than utilizing UVi T s default learned positional embeddings, we adapted it to integrate 2D sinusoidal positional encodings. For the noisy input images, patch embeddings transform them into a sequence of tokens. Notably, we chose a patch size of 1 to fully harness the functional properties of our framework. Time embeddings are computed based on the time and then concatenated with the image tokens. Our chosen transformer architecture comprises 7 layers, with each layer composed of a self-attention mechanism with 8 attention heads and a feedforward layer. Furthermore, we use long skip connections between the shallower and deeper layers, as outlined by Bao et al. (2022). For optimization during our training, we utilized the Adam WLoshchilov & Hutter (2017) algorithm with a weight decay of 0.03. We employed a cosine warm-up schedule for the learning rate, which ends at a value of 2e-4. Figure 3: Left: real (red) and generated samples (blue). Center and Right: Samples diffused for times 0.2 and 1.0 respectively. E.2 Additional results E.2.1 A Toy example. We here present some qualitative examples on a synthetic data-set of functions L([ 1, 1]), and therefore consider the settings described in Appendix D. The Quadratic data is generated as in (Phillips et al., 2022), i.e. X0[p] = qp2 + ϵ, where ϵ N(0, 0.1) and q is a binary random variable that take values { 1, 1} with equal probability. Concerning the design of the forward SDE, we select bk = min( k, 10) and rk = k 2 (thus satisfying Corollary 1). The real data is generated considering a grid of 100 equally spaced points. We can see in Figure 3 some qualitative results. On the left real (red) and generated through FDP (blue) samples show good agreement. Center and right plots depict some example of diffused samples for times 0.2 and 1.0 respectively. E.2.2 MNIST data-set We evaluate our approach on a simple data-set, using MNIST 32 32 (Le Cun et al., 2010). In this experiment, we compare our method against the baseline score-based diffusion models from Song et al. (2021), which we take from the official code repository https://github.com/yang-song/ score_sde. The baseline implements the score network using a U-NET with self-attention and skip connections, as indicated by current best practices, which amounts to O(108) parameters. Instead, our method uses a score-network/INR implemented as a simple MLP with 8 layers and 128 neurons in each layer. The activation function is a sinusoidal non-linearity (Sitzmann et al., 2020). Our model counts O(105) parameters. We consider an SDE with parameters rk,m = 176 k2+m2+2, 4 and bk,m = min((k2 + m2 + 0.3) 1 + rk,m 4 , 3.6). These values have been determined empirically by observing the power spectral density of the data-set, to ensure a well-behaved Signal to Noise ratio evolution throughout the diffusion process for all frequency components. 4Strictly speaking, the sum of the series rk,m is not convergent. We experimented changing the decay to ensure convergence, but we observed no numerical difference with the settings we the setting we used. It is an interesting avenue for future work to study if this approximation has an impact for higher-resolution data-sets. Figure 6: Example of super-resolution of Mnist images. From left to right: initial (training) resolution to higher resolutions. Figure 4: MNIST samples generated according to our proposed FDPs. Figure 5: Top right: MNIST real samples. Top Left: Each sample is diffused for a given random time. Bottom: output of INR for corresponding input noisy image. In Figure 4 we report un-curated samples generated according to our FDP. In Figure 5 we present instead various intermediate noisy versions of the training data, to illustrate the kind of noise we use to train the score network, and the output of the denoising INR. We also report the Fr echet Inception Distance (FID) score computed using 16k samples (lower is better). For the baseline we obtain FID=0.05, whereas for the proposed method we obtain FID=0.43. Although the FID score is in favor of the baseline, we believe that our results obtained with a simple MLP are very promising, as further corroborated by experiments on a more complex dataset, which we show next. Super resolution. We demonstrate how an INR trained at a 32x32 resolution on MNIST can be seamlessly applied to increase the resolution of the generated data points. Given that the INR establishes a mapping between a grid and its corresponding value, we initiated the diffusion process from a grid at a 32x32 resolution. The diffusion process continued until the final step, where we used the last learned parameters to extrapolate outputs at a higher resolution. A significant advantage of using INRs is the ability to conduct the resource-intensive sampling at a lower resolution and then effortlessly transition between resolutions with just a single forward call to the model. Figure 6, shows our results at different resolutions. E.2.3 CELEBA data-set For the CELEBA data-set we considered the same SDE as for the MNIST experiment. Results reported in the main paper have been obtained using a numerical integration scheme of a variant of the predictor-corrector scheme of (Song & Ermon, 2020), which we adapted to the SDEs we consider in our work. In Figure 8 and Figure 9 we report additional un-curated samples obtained with the INR and Transformer respectively. We proceed to describe further experiments in the following section. Conditional generation. In the following, we consider three use-cases for conditional generation: in-painting, de-blurring, and colorization, which we describe next. All these additional experiments were completed using the same architecture and configuration of the unconditional generation described above. In-painting. We perform in-painting experiments by adopting the same approach described by Song & Ermon (2020), and report results in Figure 10. Original images (left-column of Figure 10) are masked (center-column of Figure 10), where we set the value corresponding to the missing pixels to 0. The right column of Figure 10 shows the results of the in-painting scheme where, qualitatively, it is possible to observe that the conditional generation is able to fill the missing portion of the images while maintaining good semantic coherence. De-blurring. Our FDPs are naturally suited for the de-blurring use-case, as shown in Figure 12. In this experiment, we take the original images (left column of Figure 12) and filter them with a low pass filter (center column of Figure 12). The de-blurring scheme is implemented as the in-painting approach described by Song & Ermon (2020), where the only difference is that the masking at each update is applied in the frequency domain. The right column of Figure 12 shows that our technique gracefully recovers missing details and is capable of producing high quality images conditioned on the distorted inputs. Colorization. In this use-case, we adapt the approach from (Song & Ermon, 2020) to our setting. Figure 11 depicts qualitative results of the colorization experiment, confirming the flexibility of the proposed scheme. E.2.4 SPOKEN DIGIT data-set To demonstrate the versatility of our framework, we conducted preliminary experiments on an audio dataset, specifically the Spoken Digit Dataset. This dataset comprises recordings of spoken digits, stored as wav files at an 8k Hz sampling rate, with each recording trimmed to minimize silence at the beginning and end. The dataset features five speakers who have contributed to a total of 2,500 recordings, providing 50 recordings of each digit per speaker. The dataset is publicly available on the Tensor Flow Dataset Catalog. As preprocessing, each sample was either padded with zeros or truncated to a maximum duration of one second. Subsequently, the data was normalized using the effects.normalize function from the pydub library, and each sample was scaled to a range of [-1, +1] by dividing it by the dataset s maximum intensity. For the audio experiments, we fed the raw audio data directly into the transformer model, without converting it to log mel spectrograms, which is a common practice in audio processing tasks. The transformer model was configured with a patch size of 2, an embedding size of 512, 13 layers, and 8 heads. We employed the Adam W optimizer with a weight decay of 0.03 and a cosine warmup schedule that decays at a value of 1e-5. The preliminary examination of the audio generated by our model reveals its ability to effectively generate spoken digits. Upon listening to the model s generated samples, we were able to recognize the digits accurately, showcasing the model s potential in audio generation tasks. Figure 7 provides a comparative analysis of the waveforms generated by our model against real examples from the dataset. (a) Real samples from dataset (b) Generated samples Figure 7: Comparison of real and generated waveforms. Figure 8: Uncurated CELEBA samples generated by the INR. Figure 9: Uncurated CELEBA samples generated by the Transformer. Figure 10: In-painting experiment using INR. Left: real samples, Center: Masked samples, Right: Reconstructed samples. Figure 11: Colorization experiment using INR. Left: real samples, Center: Gray-scale samples, Right: Reconstructed samples. Figure 12: De-blurring experiment using INR. Left: real samples, Center: blurred samples, Right: Reconstructed samples.