# generative_fractional_diffusion_models__fe37f11d.pdf Generative Fractional Diffusion Models Gabriel Nobis Fraunhofer HHI Maximilian Springenberg Fraunhofer HHI Marco Aversa Dotphoton Michael Detzel Fraunhofer HHI Rembert Daems Ghent University Flanders Make MIRO Roderick Murray-Smith University of Glasgow Shinichi Nakajima BIFOLD, TU Berlin RIKEN AIP Sebastian Lapuschkin Fraunhofer HHI Stefano Ermon Stanford University Tolga Birdal Imperial College London Manfred Opper TU Berlin University of Potsdam University of Birmingham Christoph Knochenhauer Technical University of Munich Luis Oala Dotphoton Wojciech Samek Fraunhofer HHI TU Berlin BIFOLD We introduce the first continuous-time score-based generative model that leverages fractional diffusion processes for its underlying dynamics. Although diffusion models have excelled at capturing data distributions, they still suffer from various limitations such as slow convergence, mode-collapse on imbalanced data, and lack of diversity. These issues are partially linked to the use of light-tailed Brownian motion (BM) with independent increments. In this paper, we replace BM with an approximation of its non-Markovian counterpart, fractional Brownian motion (f BM), characterized by correlated increments and Hurst index H (0, 1), where H = 0.5 recovers the classical BM. To ensure tractable inference and learning, we employ a recently popularized Markov approximation of f BM (MA-f BM) and derive its reverse-time model, resulting in generative fractional diffusion models (GFDM). We characterize the forward dynamics using a continuous reparameterization trick and propose augmented score matching to efficiently learn the score function, which is partly known in closed form, at minimal added cost. The ability to drive our diffusion model via MA-f BM offers flexibility and control. H 0.5 enters the regime of rough paths whereas H > 0.5 regularizes diffusion paths and invokes long-term memory. The Markov approximation allows added control by varying the number of Markov processes linearly combined to approximate f BM. Our evaluations on real image datasets demonstrate that GFDM achieves greater pixel-wise diversity and enhanced image quality, as indicated by a lower FID, offering a promising alternative to traditional diffusion models1. 1 Introduction Recent years have witnessed a remarkable leap in generative diffusion models [1, 2, 3], celebrated for their ability to accurately learn data distributions and generate high-fidelity samples. These models have made significant impact across a wide spectrum of application domains, including the generation of complex molecular structures [4] for material [5] or drug discovery [6], realistic Corresponding author gabriel.nobis@hhi.fraunhofer.de 1The implementation of our framework is available at https://github.com/Gabriel Nobis/gfdm. 38th Conference on Neural Information Processing Systems (Neur IPS 2024). H = 0.1 H = 0.3 H = 0.5 H = 0.7 H = 0.9 Correlated OU processes Joint OU density Probability flow ODE trajectories d Xt = µ(t)Xtdt + g(t)d ˆ BH t F(t)Zt G(t)G(t)T z log pt(Zt) i dt + G(t)d Bt Y0 YT d Yt = [ γYt 1K,K y log qt(Yt)]dt + 1d Bt d Yt = γYtdt + 1d Bt Known guiding score function Figure 1: Each data dimension transitions to a known prior distribution through a forward process that approximates a fractional diffusion process. The Hurst index H on the LHS interpolates between the roughness of a Brownian driven SDE and the underlying integration in PF ODEs. The driving noise process is a linear combination of the correlated processes on the RHS, all driven by the same Brownian motion. The score function of these augmenting processes is available in closed form and serves as guidance for the unknown score function. audio samples [7, 8], 3D objects [9, 10] or textures [11], medical images [12], aerospace applications [13], and DNA sequence design [14, 15]. Despite these successes, modern score-based generative models formulated in continuous-time [16] face limitations due to their reliance on a simplistic driving noise, the Brownian motion (BM) [17, 18, 19]. As a light-tailed process, using BM can results in slow convergence rates and susceptibility to mode-collapse, especially with imbalanced data [20]. Additionally, its purely Markovian nature may also make it hard to capture the full complexity and richness of real-world data. All these attracted a number of attempts for involving different noise types [20, 21]. In this paper, we propose leveraging fractional noises, particularly the renowned non-Markovian fractional BM (f BM) [22, 23] to drive diffusion models. f BM extends BM to stationary increments with a more complex dependence structure, i.e., long-range dependence vs. roughness/regularity controlled by a Hurst index, a measure of "mild" or "wild" randomness [24]. This all comes at the expense of computational challenges and intractability of inference, mostly stemming from its non-Markovian nature. Moreover, deriving a reverse-time model poses theoretical challenges, as f BM is not only non-Markovian but also not a semimartingale [25]. To overcome these limitations, we leverage recent works in Markov approximations of f BM (MA-f BM) [26, 27] and establish a framework for training continuous-time score-based generative models using an approximate fractional diffusion process, as well as generating samples from the corresponding tractable reverse process. Notably, our method maintains the same number of score model evaluations during both training and data generation, with only a minimal increase in computational load. Our contributions are: We derive the time-reversal of forward dynamics driven by a Markovian approximation of fractional Brownian motion in a way that the dimensionality of the unknown part of the score function matches that of the data. We derive an explicit formulae for the marginals of the conditional forward process via a continuous reparameterization trick. We introduce a novel augmented score matching loss for learning the score function in our generative fractional diffusion model, which can be minimized by a score model of data-dimension. Our experimental evaluation validates our contributions, demonstrating the gains of correlated-noise with long-term memory, approximated by a combination of a number of Markov processes, where the amount of processes further control the diverstiy. Differentiation from existing work. Yoon et al. [20] generalizes score-based generative models from an underlying BM to a driving Lévy process, a stochastic process with independent and stationary increments. A driving noise with correlated increments is not included in the framework of Yoon et al. [20]. Conceptually, every Lévy process is a semimartingale [28]. Since f BM is not a Lévy process, it is not included in the framework of Yoon et al. [20]. The closest work to ours is Tong et al. [29] constructing a neural-SDE based on correlated noise and using the neural SDE as a forward process of a score-based generative model. Our framework with exact reverse-time model is based on the integral representation of f BM derived in Harms and Stefanovits [26] and the optimal approximation coefficients of Daems et al. [27], while the fractional noise in [29] is sparsely approximated by a linear combination of independent standard normal random variables without exact reverse-time model. Moreover, the framework of Tong et al. [29] is limited to H > 1 3 and only compatible with the Euler-Maruyama sample schema [30] while our framework is up to numerical stability applicable for any H (0, 1) and compatible with any suitable SDE or ODE solver. To the best of our knowledge, we are the first to build a framework for continuous-time score-based generative models that includes driving noise processes converging to non-Markovian processes with infinite quadratic variation. 2 Background Modeling the distribution transforming process of a score-based generative model through stochastic differential equations (SDEs) [16] offers a unifying framework to generate data from an unknown probability distribution. Instead of injecting a finite number of fixed noise scales via a Markov chain, infinitely many noise scales tailored to the continuous dynamics of the Markov process X = (Xt)t [0,T ] are utilized during the distribution transformation, offering considerable practical advantages over discrete time diffusion models [16]. The forward dynamics, transitioning from a data sample X0 p0 to a tractable noise sample XT p T are specified by a continuous drift function f and a continuous diffusion coefficient g. These dynamics define a diffusion process that solves the SDE d Xt = f(Xt, t)dt + g(t)d Bt, X0 p0 (1) driven by a multivariate BM B. To sample data from noise, a reverse-time model is needed that defines the backward transformation from the tractable noise distribution to the data distribution. Whenever X = (Xt)t [0,T ] is a stochastic process and g is a function on [0, T], we write Xt = XT t for the reverse-time model and g(t) = g(T t) for the reverse-time function. The marginal density of the stochastic process X at time t is denoted by pt throughout this work2. Remarkably, an exact reverse-time model to the forward model in eq. (1) is given by the backward dynamics [31, 32, 33] d Xt = f(Xt, t) g2(t) x log pt(Xt) dt + g(t)d Bt, X0 = XT p T , (2) where the only unknown is the score function x log pt, inheriting the intractability from the unknown initial distribution p0. In addition to the stochastic dynamics, the reverse-time model provides deterministic backward dynamics via an ordinary differential equation (ODE) by the so called probability flow ODE (PF ODE) [16] d xt = f( xt, t) 1 2 g2(t) x log pt( xt, t) dt, x T p T . (3) Stochasticity is only injected into the system through the random initialization x T p T , implying a deterministic and bijective map from noise to data [16]. Conditioning the forward process on a data sample x0 p0 results for linear f( , t) in a tractable Gaussian forward process with conditional score function x log p0t(x|x0) in closed form. To approximate the exact reverse-time model, this tractable score function is used to train a time-dependent score model Sθ via score matching [34, 35]. Upon training, any solver for SDEs or ODEs can be utilized to generate data from noise by simulating the stochastic or deterministic backward dynamics of the reverse-time model with Sθ x log p. Simulation error of the reverse-time model. The two main sources of error when simulating the reverse-time model are the approximation error due to Sθ only approximating x log p, and the discretization error, which arises from transitioning from continuous-time to discrete steps. Simulating the PF ODE with the Euler method over N N equidistant time steps results in a global error of order N 1 [36]. In contrast, the expected global error for simulating the SDE using the Euler-Maruyama method is of a lower order N 1 2 , indicating a larger error for the same number of steps [30, 36]. From this perspective it is reasonable that sampling from the PF ODE requires fewer steps. Yet, the source of qualitative differences between sampling from the ODE and the SDE [16] remains unclear. A pathwise perspective on sampling. The roughness of a path can be measured by its Hölder exponent 0 < δ 1 [37]. For example, BM as the integrator in the backward dynamics eq. (2) has δ-Hölder continuous paths for any 0 < δ < 1 2, whereas the integrator t 7 t of the PF ODE eq. (3) can be regarded as a Hölder continuous path with exponent δ = 1. Therefore, from a pathwise perspective, we move away from a rough path when we sample using the PF ODE. An unexplored topic in score-based generative models is the interpolation between the SDE and the PF ODE in terms 2See Appendix I for the notational conventions of this work. of the Hölder exponent. It remains to be examined whether there is, to some extent, an optimal degree of Hölder continuity in between, or if an even rougher path with δ 1 2 could yield an advantageous data generator. The process that naturally arises from this line of thought is f BM [22, 23] with Hurst index H (0, 1), where almost all paths are Hölder continuous for any exponent δ < H, controlled by H. In terms of roughness, the Hurst index interpolates between the paths of Brownian driven SDEs and those of the underlying integration in PF ODEs, while also offering the potential for even rougher paths. Motivated by these observations, we define a novel score-based generative model with underlying dynamics that approximate a fractional diffusion process. 3 Fractional driving noise Before describing the challenges in defining a score-based generative model with control over the roughness of the distribution transforming path, we introduce f BM. The literature distinguishes between Type I f BM and Type II f BM [38] having stationary and non-stationary increments, respectively. The type II f BM, also called Riemann-Liouville f BM, possesses smaller deviations from its mean, potentially an advantageous property for a driving noise of a score-based generative model, since large deviations of the sampling process to the data mean can lead to sample artifacts [39]. Here and in the experiments we focus on type II f BM. However, our theoretical framework generalizes to both types as detailed in Appendix A. The empirical study of a score-based generative model approximating a fractional diffusion process driven by type I f BM is dedicated to future work. We begin with the definition of Riemann-Liouville f BM [22], a generalization of BM permitting correlated increments. Definition 3.1 (Type II Fractional Brownian Motion [22]). Let B = (Bt)t 0 be a standard Brownian Motion (BM) and Γ the Gamma function. The centered Gaussian process BH t = 1 Γ(H + 1 2 d Bs, t 0, (4) uniquely characterized in law by its covariances E BH t BH s = 1 Γ2(H + 1 0 ((t u)(s u))H 1 2 du, t, s [0, ) (5) is called type II fractional Brownian motion (f BM) with Hurst index H (0, 1). BM being the unique continuous and centered Gaussian process with covariance min{t, s} is recovered for H = 0.5, since Γ(1) = 1. In comparison to the purely Brownian setting with independent increments (diffusion), the path of BH becomes more smooth for H > 0.5 due to positively correlated increments (super-diffusion) and more rough for H < 0.5 due to negatively correlated increments (sub-diffusion). These three regimes are reflected in the Hölder exponent of δ < H for almost all paths. Generalization challenges. The most challenging part in defining a score-based generative model driven by f BM is the derivation of a reverse-time model. Due to its covariance structure, f BM is not a Markov process [40] and the shift in the roughness of the sample path leads to changes in its quadratic variation: from t in the purely Brownian (diffusion) regime to zero in the smooth regime, and to infinite in the rough regime [30]. For that reason f BM is neither a Markov process nor a semimartingale [25] for all H = 0.5. Hence, we cannot make use of the Markov property or the Kolmogorov equations (Fokker-Planck) that are used to derive the reverse-time model of Brownian driven SDEs [31, 32, 33]. See Appendix H for a more illustrative view of the problem. The existence of a reverse-time model can be proven in the smooth regime of f BM [41]. However, due to the absence of an explicit score function in Darses and Saussereau [41] it does not provide a sufficient structure to train a score-based generative model. To overcome this difficulty we follow [26, 27] and define the driving noise of our generative model by a linear combination of Markovian semimartingales. The approximation is based on the exact infinite-dimensional Markovian representation of f BM given in Theorem A.2. Definition 3.2 (Markov approximation of f BM [26, 27]). Choose K N Ornstein Uhlenbeck (OU) processes Y k t = Z t 0 e γk(t s)d Bs, k N, t 0, (6) with speeds of mean reversion γ1, ..., γK and dynamics d Y k t = γk Y k t dt + d Bt. Given a Hurst index H (0, 1) and a geometrically spaced grid γk = rk n with r > 1 and n = K+1 2 we call the process k=1 ωk Y k t , H (0, 1), t 0, (7) Markov-approximate fractional Brownian motion (MA-f BM) with approximation coefficients ω1, ..., ωK R and denote by ˆBH = ( ˆBH 1 , ..., ˆBH D ) the corresponding D-dimensional process where ˆBH i and ˆBH j are independent for i = j inheriting independence from the underlying standard BMs Bi and Bj. Our framework is conceptually independent of the specific choice of spatial grid and approximation coefficients. To achieve strong convergence rates with a high polynomial order in K for H < 0.5 in the driving noise to f BM, one may follow the approach outlined in Harms [42]. Consequently, our framework includes driving noise processes that converge to non-Markovian processes with infinite quadratic variation. For computational efficiency, we instead follow the approach of Daems et al. [27] to choose the L2(P) optimal approximation coefficients for a given K, achieving empirically good results in approximating f BM, even with a small number of OU processes. Proposition 3.3 (Optimal Approximation Coefficients [27]). The optimal approximation coefficients ω = (ω1, ..., ωK) RK for a given Hurst index H (0, 1), a terminal time T > 0 and a fixed geometrically spaced grid to minimize the L2(P)-error E(ω) := Z T 0 E BH t ˆBH t 2 dt (8) are given by the closed-form expression Aω = b with Ai,j := 2T + e (γi+γj )T 1 γi+γj γi + γj , bk := T 2 k P H + 1 2, γk T H + 1 2 k P H + 3 2, γk T (9) and where P(z, x) = 1 Γ(z) R x 0 tz 1e tdt is the regularized lower incomplete gamma function. MA-f BM serves as the driving noise of our generative model, replacing BM in the distribution transforming process solving eq. (1), approximating a fractional diffusion process. See Figure 1 for an illustration of the underlying processes. 4 A score-based generative model based on fractional noise In this section, we define a continuous-time score-based generative model driven by MA-f BM. A detailed treatment of the theory can be found in Appendix A. We begin with the forward dynamics, transitioning data to noise. Definition 4.1 (Forward process). Let ˆBH be a D-dimensional MA-f BM with Hurst index H (0, 1). For continuous functions µ : [0, T] R and g : [0, T] R we define the forward process X = (Xt)t [0,T ] of a generative fractional diffusion model (GFDM) by d Xt = µ(t)Xtdt + g(t)dˆBH t , X0 = x0 p0, t [0, T], (10) where p0 is the unknown data distribution from which we aim to sample from. Considering both the forward process as well as the OU processes defining the driving noise ˆBH, we have for every data dimension an augmented vector of correlated processes (X, Y 1, . . . , Y K), driven by the same BM, approximating the time-correlated behavior of a one-dimensional fractional diffusion process [27]. We denote the stacked process of the D augmented vectors as Z = (Zt)t [0,T ] and refer to the resulting D(K+1)-dimensional process as the augmented forward process. Rewriting the dynamics of the forward process we observe that the augmented forward process Z solves a linear SDE. Hence, Z|x0, the augmented forward process conditioned on a data sample x0 p0, is a linear transformation of BM. Thus Z|x0 is a Gaussian process and so is X|x0 [43]. For each dimension 1 d D, we have a system of K + 1 trajectories that transform x0,d according to the augmented forward process with D = 1, following the dynamics d Zt = F(t)Ztdt + G(t)d Bt, (11) where all K + 1 processes are driven by the same one-dimensional BM B with matrix valued functions F and G defined in Appendix A.2. To efficiently sample for every t (0, T] from the conditional augmented forward distribution during training, we characterize its marginal statistics. Derivation of marginal statistics. The marginal mean E[Xt|x0] = x0 exp( R t 0 µ(s)ds) of the conditional forward process is unaffected by changing the driving noise to MA-f BM, and the mean of the augmenting OU processes is zero. See Appendix A.2 for a detailed derivation of the marginal statistics of the augmenting processes. The missing components in the marginal covariance matrix Σt of the conditional augmented forward process Z|x0 are the marginal variance of the forward process and the marginal correlation between the conditional forward process and the augmenting processes. We derive by reparameteriziation an explicit formula for the marginal variance of the conditional forward process. This generalizes the formula for the perturbation kernel p0t(x|x0) = N(x; c(t)x0, c2(t)σ2(t)ID) given in Karras et al. [44] to a driving MA-f BM and is reminiscent of the reparameterization trick used in discrete time. Proposition 4.2 (Continuous Reparameterization Trick). The forward process X of GFDM conditioned on x0 RD admits the continuous reparameterization Xt = c(t) x0 + Z t 0 α(t, s)d Bs N(c(t)x0, c2(t)σ2(t)ID) (12) with c(t) = exp R t 0 µ(s)ds and σ2(t) = R t 0 α2(t, s)ds where α is given by s fk(u, s)du , fk(u, s) = g(u) c(u) e γk(u s). (13) Sketch of Proof. Reparameterization of the forward dynamics in eq. (10) and the Stochastic Fubini Theorem yields the Gaussian process Xt = c(t)(x0 + R t 0 α(t, s)d Bs) with variance V [Xt] = c2(t) R t 0 α2(t, s)ds by Itô isometry. See Theorem A.3 for the full proof. By the above definition of α, we retrieve the perturbation kernel of the purely Brownian setting given in Karras et al. [44, Equation 12] for K = 1, γ1 = 0 and ω1 = 1 . When, depending on the choice of forward dynamics, R t 0 α(t, s)ds is not accessible in closed form, Σt can be described by an ODE and solved numerically as described in Appendix B. Thus our method admits any choice of forward dynamics in terms of µ and g. Explicit fractional forward dynamics. Although our framework is not bound to any specific dynamics, this work s empirical evaluation focuses on Fractional Variance Exploding (FVE) dynamics given by d Xt = σmin σmin dˆBH t , t [0, T] (14) with (σmin, σmax) = (0.01, 50) and Fractional Variance Preserving (FVP) dynamics given by 2β(t)Xtdt + p β(t)dˆBH t , t [0, T] (15) with β(t) = β(t) = βmin + t βmax βmin and ( βmin, βmax) = (0.1, 20) [16]. Leveraging the continuous reparameterization trick we derive in Appendix B the conditional marginal covariance matrix of FVE in closed form. To the best of our knowledge, the integral in eq. (13), needed to compute α in the setting of FVP dynamics, is not accessible in closed form. Therefore, we use a numerical ODE solver to estimate this quantity for FVP dynamics. See Appendix B for details on the computation of the marginal variances and an illustration of the resulting variance schedules. The reverse-time model. We observe that the augmented forward dynamics of GFDM are already encompassed in the general framework presented in Song et al. [16, Appendix A], although they differ from the Variance Exploding (VE), Variance Preserving (VP), and sub-VP dynamics discussed therein. To simplify notation, we use pt here to denote the marginal density of both Zt and Xt. The specific density referred to will be clear from the context. By the significant results of [31, 32, 33], the reverse-time model of GFDM is given by the backward dynamics F(t)Zt G(t)G(t)T z log pt(Zt) dt + G(t)d Bt, t [0, T]. (16) However, a direct application of [16] would require to train a score model with input and output dimension of D(K + 1). By proposing augmented score matching below, we show that learning a score model with input and output dimension D is sufficient, enabling the use of the same highly curated model architecture as in traditional diffusion models to approximate the score function. Augmented score matching. We condition the score function z log pt on a data sample x0 p0 and additionally on the states of the stacked vector Y[K] t := (Y1 t , ..., YK t ) of augmenting processes. To train our time-dependent score model sθ we propose the augmented score matching loss E(X0,Y[K] t )E(Xt|Y[K] t ,X0) k ηk t Yk t , t) x log p0t(Xt|Y[K] t , X0) 2 2 (17) The weights η1 t , ..., ηK t arise from conditioning Zt|x0 on Y[K] t and the time points t are uniformly sampled from U[0, T]. We show in the following that the optimal sθ w.r.t. the augmented score matching loss is the L2-optimal approximation of the score function of our reverse-time model. Proposition 4.3 (Optimal Score Model). Assume that sθ is optimal w.r.t. the augmented score matching loss L. The score model Sθ(Zt, t) := k ηk t Yk t , t), η1 t sθ(Xt X k ηk t Yk t , t), ..., ηK t sθ(Xt X k ηYk t , t) yields the optimal L2(P) approximation of z log pt(Zt) via Sθ(Zt, t) + z log qt(Y[K] t ) z log pt(Zt). (19) Sketch of Proof. Using the relation x log p0t = ηk t yk log p0t and the independence of X0 and Y[K] t yields the claim. See Appendix A.3 for the full proof. In addition to the result that a score model of data dimension D minimizes the proposed augmented score matching loss, Proposition 4.3 also implies that GFDM requires the same number of score model evaluations during sampling from the reverse-time model as traditional diffusion models. This is because, for a given time point t, we only need to evaluate sθ( , t) once at Xt P k ηk t Yk t to compute Sθ(Zt, t) according to eq. (18), and Sθ is all that is required to approximate the reverse-time dynamics described below. We provide a thorough quantitative evaluation of compute time in seconds for GFDM in Appendix F, validating the theoretical reasoning in this section that GFDM incur only minimal additional computational cost. Sampling from reverse-time model. Once we trained our score model Sθ via augmented score matching, we simulate the reverse-time model backward in time and sample from the reverse-time model via the SDE F(t)Zt G(t)G(t)T h Sθ(Zt, t) + z log qt(Y [K] t ) io dt + G(t)d Bt, t [0, T] (20) or the corresponding augmented PF ODE [16] 2G(t)G(t)T h Sθ(zt, t) + z log qt(y[K] t ) i dt, t [0, T], (21) where we initialize in both cases the reverse dynamics with the centered (non-isotropic) Gaussian Z0 with covariance matrix ΣT . To traverse backward from noise to data, we may deploy any suitable SDE or ODE solver. In both cases, for each data dimension, we have K +1 trajectories that transform the Gaussian initialization into an approximate sample of the data distribution. The PF ODE enables in addition negative log-likelihoods (NLLs) estimation of test data under the learned density [16]. See Appendix G for the computation details of NLLs. Remark 4.4. We showed in this section that it suffices to approximate a D-dimensional score to reverse the D(K + 1)-dimensional MA-f BM driven SDE with unknown starting distribution. Since this holds for any fixed K N an interesting task is to examine the behaviour of the reverse-time model as K and potentially link it to the dynamics of a reverse-time model of true f BM. To the best of our knowledge, existence of such a reverse-time model is not known for H < 0.5 and the drift of the reverse-time model for H > 0.5 lacks sufficient structure to train a score-based generative model [41]. FVE(H = 0.5) FID NLLs Test VSp VE (retrained) 10.82 2.73 24.20 K = 1 10.30 2.55 24.22 K = 2 9.89 3.03 24.15 K = 3 9.74 2.93 24.42 K = 4 11.25 3.10 24.54 K = 5 25.51 3.94 23.08 FVP(H = 0.5) FID NLLs Test VSp VP (retrained) 1.44 2.38 23.64 K = 1 2.81 3.90 23.69 K = 2 2.92 4.57 23.63 K = 3 3.51 7.02 23.78 K = 4 1.86 5.71 24.50 K = 5 4.89 7.09 24.56 Table 1: Effect of augmenting processes on conditional image generation on MNIST for FVE and FVP dynamics. MNIST H = 0.9 H = 0.7 H = 0.5 H = 0.1 FID VSp FID VSp FID VSp FID VSp VE (retrained) - - - - 10.82 24.20 - - VP (retrained) - - - - 1.44 23.64 - - MA-f BM driven FVP(H, K = 1) 2.86 23.56 3.01 23.78 2.81 23.69 2.92 23.59 FVP(H, K = 2) 1.93 24.00 2.30 23.82 2.92 23.63 2.56 23.82 FVP(H, K = 3) 0.72 24.18 2.67 23.96 3.51 23.78 4.87 23.60 FVP(H, K = 4) 1.22 24.76 0.86 24.39 1.86 24.50 6.25 23.89 FVP(H, K = 5) 2.17 25.15 1.36 24.63 4.89 24.56 9.57 23.70 (a) Conditional image generation on MNIST. CIFAR10 H = 0.9 H = 0.7 H = 0.5 H = 0.1 FID VSp FID VSp FID VSp FID VSp VE (retrained) - - - - 5.20 3.42 - - VP (retrained) - - - - 4.85 3.28 - - MA-f BM driven FVP(H, K = 1) 4.79 3.53 4.96 3.84 4.19 3.99 4.60 3.46 FVP(H, K = 2) 3.77 3.60 4.17 3.35 4.85 4.04 5.77 3.43 FVP(H, K = 3) 14.22 3.38 6.12 3.39 6.32 3.49 5.95 3.34 FVP(H, K = 4) 29.72 3.67 8.35 3.24 8.85 3.65 5.02 3.26 FVP(H, K = 5) 69.06 6.61 35.91 5.20 96.54 7.30 7.38 3.11 (b) Conditional image generation on CIFAR10. Table 2: FID and pixel-wise diversity VSp of GFDM compared to the original setting of purely Brownian driven VE and VP. In bold the scores that are better than both purely Brownian driven dynamics. The overall best scores within the experiment are boxed in, indicating that the highest scores on both datasets are achieved in the super-diffusive regime for H = 0.9. 5 Experiments We conduct experiments on MNIST and CIFAR10 to evaluate the ability of GFDM to generate real images. First, we measure the quality and the pixel-wise diversity of the generated images across different numbers of augmenting processes and various Hurst indices, showing that the super-diffusive regime with H > 0.5 yields better performance compared to the purely Brownian driven dynamics. Second, we further evaluate the best performing models in terms of class-wise image quality and class-wise distribution coverage. We measure image quality by the Frechét Inception Distance (FID) [45] and the Inception score (IS) [46], pixel-wise diversity by the pixel Vendi Score (VSp) [47] and class-wise distribution coverage by improved recall (Recall) [48]. See Appendix D for the implementation details and additional experimental results. We begin with the empirical evaluation of how the augmenting processes affect performance on MNIST. Effect of augmentation on MNIST. To isolate the effect of the augmenting processes on MNIST while minimally adapting the driving noise distribution, we fix H = 0.5 so that the weighted sum of the augmenting processes approximates BM, rather than f BM. We observe an increase of the pixel-wise diversity VSp for both FVE and FVP dynamics, with increasing K. In Table 1 we can observe that VSp increases from 24.20 to 24.54 for FVE dynamics and from 23.64 to 24.56 for FVP dynamics. The enhanced pixel-wise diversity on MNIST comes at the cost of a reduced likelihood of test data under the learned density, indicated by a higher NLLs for more augmenting processes. Quality results across different Hurst indices. On both, MNIST and CIFAR10, we obtain the 250 500 750 1000 NFE Averaged FID FVP(H = 0.9, K = 2) FVP(H = 0.7, K = 2) Purely Brownian VP Purely Brownian VE Figure 2: Comparison of the super-diffusive regime and purely Brownian dynamics in terms of average FID over three rounds of sampling plotted across different NFEs. best performance in terms of FID and VSp in the super-diffusive regime with H = 0.9 and FVP dynamics. On MNIST we achieve state of the art FID of 0.72, compared to an FID of 1.44 with the purely Brownian VP dynamics (Table 2a). Comparing FVP to the best-performing purely BM driven VP dynamics, we observe not only an improvement in quality but also an increase in pixel-wise diversity from 23.64 to 24.18, as measured by VSp . In Table 2b we observe the same behaviour on CIFAR10. The best performing configuration in terms of FID and pixelwise diversity is achieved for FVP(H = 0.9, K = 2) with an FID of 3.77 instead of 4.85 and an VSp of 3.60 instead of 3.28. Additionally, in Figure 7, we show the FID evolution of the super-diffusive regime for various numbers of augmenting processes, showing a similar pattern that either that K = 2 or Metric Dynamics airplane automobile bird cat deer dog frog horse ship truck FID VP 15.29 12.06 14.08 18.08 10.68 16.92 16.48 12.49 10.74 10.57 FVP(H = 0.7, K = 2) 14.67 9.55 14.02 16.97 11.05 17.14 16.43 10.97 9.91 8.81 FVP(H = 0.9, K = 2) 14.37 8.94 14.18 16.38 10.52 16.76 15.37 10.28 10.04 8.76 Recall VP 0.6814 0.6186 0.6860 0.6466 0.7002 0.6730 0.6758 0.6392 0.6468 0.5982 FVP(H = 0.7K = 2) 0.6838 0.6436 0.6870 0.6712 0.7140 0.6844 0.6922 0.6764 0.6550 0.6508 FVP(H = 0.9, K = 2) 0.7038 0.6614 0.7188 0.6842 0.7284 0.7096 0.7104 0.6806 0.6772 0.6852 Table 3: The class-wise image quality and class-wise distribution coverage of the super-diffusive regime FVP(H = 0.9, K = 2) compared to the purely Brownian VP dynamics. Figure 4: Visual comparison of PF ODE samples. (LHS) Purely Brownian VP dynamics. (RHS) Superdiffusive regime FVP(H = 0.9, K = 2). K = 3 yields the best performance across different datasets and dynamics. Evaluating the performance with different number of sampling steps in Figure 2 shows that the super-diffusive regime with K = 2 saturates already at 500 number of function evaluations (NFEs) on a lower level than both purely Brownian driven dynamics VP and VE. See Figure 2 in Appendix D for the exact FID values. Class-wise distribution coverage. We evaluate the capability to generate samples from different classes in terms of FID and class-wise distribution coverage, measured by Recall, comparing the best-performing purely Brownian driven dynamics to the super-diffusive regime with K = 2. In Table 3 we observe that the super-diffusive regime with K = 2 outperforms in both FID and Recall, where H = 0.7 and H = 0.9 achieve better class-wise FID for all but two and one class, respectively (deer and dog for H = 0.7, bird for H = 0.9). Additionally, the super-diffusive regime shows improved class-wise distribution coverage, as indicated by a higher Recall across all classes. Overall, both H = 0.7 and H = 0.9 perform significantly better in terms of distribution coverage than VP dynamics, H = 0.9 being the best performing model. Sampling with the augmented probability flow ODE. We compare the performance of sampling Sampled with SDE VE (retrained) 5.20 9.60 3.42 VP (retrained) 4.85 9.64 3.28 FVP(H = 0.7, K = 2) 4.17 9.51 3.35 FVP(H = 0.9, K = 2) 3.77 9.41 3.60 Sampled with PF ODE VE (retrained) 6.40 9.22 3.14 VP (retrained) 5.63 9.23 3.91 FVP(H = 0.7, K = 2) 12.23 9.73 4.38 FVP(H = 0.9, K = 2) 12.26 9.55 4.89 Figure 3: Quantitative performance comparison of SDE and PF ODE sampling. via the PF ODE for the best performing models from above. For MA-f BM driven dynamics, we have K + 1 deterministic trajectories for each pixel, traversing from noise to data. As shown in Figure 3, the PF ODE associated with purely Brownian dynamics outperforms the super-diffusive regime in terms of FID, while the superdiffusive regime achieves the overall highest pixel-wise diversity of VSp = 4.89 confirmed mildly perceptually in Figure 4. See Appendix E for additional visualization of the generated data. Our experiments show that, compared to purely Brownian dynamics, the super-diffusive regime of MA-f BM yields higher image quality with fewer NFEs, improved pixelwise diversity and better distribution coverage. 6 Related work Diffusion models in continuous-time. The seminal work of Song et al. [16] offers a unifying framework modeling the distribution transforming process by a stochastic processes in continuoustime with exact reverse-time model. Extensive research has been carried out to examine [44, 49, 50] and extend [39, 51, 52, 53, 54, 55] the continuous-time view on generative models through the lens of SDEs, including deterministic corruptions [56] and blurring diffusion [57]. While critic on this view question the usefulness of the theoretical superstructure [58], others extend in line with our work the theoretical framework to new types of underlying diffusion processes [59]. Conceptually similar to our work,Yoon et al. [20] generalizes the score-based generative model from an underlying Brownian motion to a driving Lévy process, thereby dropping the Gaussian assumptions on the increments. In contrast to our work, the framework of Yoon et al. [20] does not include correlated increments. Importantly, every Lévy process is a semimartingale, which means that f BM is not a Lévy process. Fractional noises in machine learning. Recently, Hayashi and Nakagawa [60] considered neural SDEs driven by fractional noise. Yet they do not study diffusion models. The closest work to our work, Tong et al. [29] approximated the type-II f BM with sparse Gaussian processes constructing a neural SDE as a forward process of a score-based generative model, without exact reverse-time dynamics. Unfortunately, they are also limited to Euler-Maruyama solvers and to the case of H > 1/3, while our framework is up to numerical stability applicable for any H (0, 1) and compatible with any suitable SDE or ODE solver. Daems et al. [27], who inspired our Markov-approximate noise, includes a more elaborate discussion as well as a variational inference framework for MA-f BM. Rough path theory. The pathwise analysis of SDEs driven by processes with a Hölder exponent less than 0.5, including f BM for H < 0.5 and BM, is encompassed by rough path theory [37]. Rough path theory is applied in machine learning in several ways including (i) deriving stability bounds for the trained weights of a residual neural network [61], (ii) enabling rough control of neural ODEs [62], and (iii) modeling long time series behavior via neural rough differential equations [63, 64]. In finance the famous Black-Scholes model [65] is driven by BM, while more recent continuous-time models employ fractional noise to model price processes [66, 67] or rough volatility [68, 69] to more closely mimic real-world behavior. 7 Conclusion In this work, we propose a generalized framework of continuous-time score-based generative models, introducing a novel generative model driven by MA-f BM with control over the roughness of distribution transformation paths via augmenting processes. Despite the increased dimensionality of the forward process, learning a score model with the dimensionality of the data distribution, guided by the marginal known score of the augmenting processes, is sufficient. Consequently, both training and sampling is efficient. Our experimental results show that the super-diffusive regime of our MA-f BM driven dynamics achieves superior performance in terms of FID and pixel-wise diversity. Additionally, the FID saturates at a lower level with fewer function evaluations compared to purely Brownian driven dynamics. The super-diffusive regime also improves class-wise distribution coverage, as measured by Recall. Based on these results, GFDM offers a promising alternative to traditional diffusion models for generating data from an unknown distribution. Limitations and future work. Several practical and theoretical questions remain open. While we draw our conclusions from experiments conducted on MNIST and CIFAR10, generalizing the observed behavior to other datasets and data modalities may not be valid. In future work, we aim to empirically and theoretically determine the optimal degree of correlated noise, and thus the optimal Hurst index, for training and sampling across different data modalities. Beyond image data, a particularly interesting modality could be the generation of rough time series data using dynamics of the sub-diffusive regime. A theoretical open question is the limiting behavior of GFDM s reverse dynamics with infinitely many augmenting processes and whether this limit is connected to the reverse time model for true f BM. An intriguing extension would be to adapt the dynamics of our framework to switch between two unknown distributions. This adaptation would enable the use of MA-f BM driven dynamics in the sciences to model real-world evolution between two states of unknown distributions. This is a promising direction, as the assumption of independent increments in real-world noise processes is often too strong. Broader impact. Our contribution advances generative modeling by introducing a specific driving noise process to improve the learning of an unknown distribution. This conceptual work aims to support impactful applications of generative modeling, such as molecular structure generation, medical imaging, drug discovery, and DNA sequence design. However, we acknowledge that generative models can reflect biases in the datasets they are trained on and may pose risks, including misuse for human impersonation and the spread of fake content. Acknowledgements We would like to give a special thanks to Thorsten Selinger for his support in utilizing Fraunhofer HHI s GPU cluster. We also thank the anonymous reviewers for their constructive feedback, which helped improve our work. This work was supported by the Federal Ministry of Education and Research (BMBF) as grants [Sy Real (01IS21069B)]. R.M-S. & M.A. acknowledge funding from the Quant IC Project funded by EPSRC Quantum Technology Programme (grant EP/MO1326X/1, EP/T00097X/1), and dot Photon AG. R.M-S acknowledges funding from EP/R018634/1, EP/T021020/1, EP/Y029178/1, and Google. SN acknowledges funding from the German Federal Ministry of Education and Research under the grant BIFOLD24B. RD acknowledges funding from the Flemish Government under the "Onderzoeksprogramma Artificiële Intelligentie (AI) Vlaanderen" programme and from Flanders Make under the SBO project CADAIVISION. TB was supported by a UKRI Future Leaders Fellowship [grant number MR/Y018818/1]. MO has been partially funded by Deutsche Forschungsgemeinschaft (DFG) - Project - ID 318763901 - SFB1294. WS acknowledges financial support by the German Research Foundation (DFG) - Research Unit KI-FOR 5363 (project ID: 459422098). [1] Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsupervised learning using nonequilibrium thermodynamics. In Proceedings of the 32nd International Conference on Machine Learning, volume 37 of Proceedings of Machine Learning Research, pages 2256 2265. PMLR, 2015. [2] Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. In Advances in Neural Information Processing Systems, volume 33, pages 6840 6851. Curran Associates, Inc., 2020. [3] Yang Song and Stefano Ermon. Generative modeling by estimating gradients of the data distribution. In Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019. [4] Emiel Hoogeboom, Víctor Garcia Satorras, Clément Vignac, and Max Welling. Equivariant diffusion for molecule generation in 3D. In Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pages 8867 8887. PMLR, 2022. [5] Matteo Manica, Jannis Born, Joris Cadow, Dimitrios Christofidellis, Ashish Dave, Dean Clarke, Yves Gaetan Nana Teukam, Giorgio Giannone, Samuel C Hoffman, Matthew Buchan, et al. Accelerating material design with the generative toolkit for scientific discovery. npj Computational Materials, 9(1):69, 2023. [6] Gabriele Corso, Hannes Stärk, Bowen Jing, Regina Barzilay, and Tommi S. Jaakkola. Diffdock: Diffusion steps, twists, and turns for molecular docking. In The Eleventh International Conference on Learning Representations, 2023. [7] Jade Copet, Felix Kreuk, Itai Gat, Tal Remez, David Kant, Gabriel Synnaeve, Yossi Adi, and Alexandre Defossez. Simple and controllable music generation. In Advances in Neural Information Processing Systems, volume 36, pages 47704 47720. Curran Associates, Inc., 2023. [8] Felix Kreuk, Gabriel Synnaeve, Adam Polyak, Uriel Singer, Alexandre Défossez, Jade Copet, Devi Parikh, Yaniv Taigman, and Yossi Adi. Audiogen: Textually guided audio generation. In The Eleventh International Conference on Learning Representations, 2023. [9] Xiaohui Zeng, Arash Vahdat, Francis Williams, Zan Gojcic, Or Litany, Sanja Fidler, and Karsten Kreis. Lion: Latent point diffusion models for 3D shape generation. In Advances in Neural Information Processing Systems, volume 35, pages 10021 10039. Curran Associates, Inc., 2022. [10] Zhiying Leng, Tolga Birdal, Xiaohui Liang, and Federico Tombari. Hyper SDFusion: Bridging Hierarchical Structures in Language and Geometry for Enhanced 3D Text2Shape Generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19691 19700, 2024. [11] Simone Foti, Stefanos Zafeiriou, and Tolga Birdal. UV-free texture generation with denoising and geodesic heat diffusions. In Advances in Neural Information Processing Systems, 2024. [12] Marco Aversa, Gabriel Nobis, Miriam Hägele, Kai Standvoss, Mihaela Chirica, Roderick Murray-Smith, Ahmed Alaa, Lukas Ruff, Daniela Ivanova, Wojciech Samek, Frederick Klauschen, Bruno Sanguinetti, and Luis Oala. Diff Infinite: Large mask-image synthesis via parallel random patch diffusion in histopathology. In Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2023. [13] Miguel Espinosa and Elliot J. Crowley. Generate your own scotland: Satellite image generation conditioned on maps. Neur IPS 2023 Workshop on Diffusion Models, Aug 2023. [14] Pavel Avdeyev, Chenlai Shi, Yuhao Tan, Kseniia Dudnyk, and Jian Zhou. Dirichlet diffusion score model for biological sequence generation. In Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pages 1276 1301. PMLR, 2023. [15] Chenyu Wang, Masatoshi Uehara, Yichun He, Amy Wang, Tommaso Biancalani, Avantika Lal, Tommi Jaakkola, Sergey Levine, Hanchen Wang, and Aviv Regev. Fine-tuning discrete diffusion models via reward optimization with applications to dna and protein design. ar Xiv preprint ar Xiv:2410.13643, 2024. [16] Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations. In International Conference on Learning Representations, 2021. [17] Robert Brown. XXVII. A brief account of microscopical observations made in the months of June, July and August 1827, on the particles contained in the pollen of plants; and on the general existence of active molecules in organic and inorganic bodies. The Philosophical Magazine, 4 (21):161 173, 1828. [18] Albert Einstein. Über die von der molekularkinetischen Theorie der Wärme geforderte Bewegung von in ruhenden Flüssigkeiten suspendierten Teilchen. Annalen der Physik, pages 549 560, 1905. [19] Norbert Wiener. Differential-space. Journal of Mathematics and Physics, 2:131 174, 1923. [20] Eunbi BI Yoon, Keehun Park, Sungwoong Kim, and Sungbin Lim. Score-based generative models with Lévy processes. In Advances in Neural Information Processing Systems, volume 36, pages 40694 40707. Curran Associates, Inc., 2023. [21] Hengyuan Ma, Li Zhang, Xiatian Zhu, and Jianfeng Feng. Approximated anomalous diffusion: Gaussian mixture score-based generative models, 2023. URL https://openreview.net/ forum?id=yc9xen7EAzd. [22] Paul Lévy. Random functions: general theory with special reference to Laplacian random functions. University of California Publications in Statistics, 1:331 390, 1953. [23] Benoit B. Mandelbrot and John W. Van Ness. Fractional Brownian Motions, fractional noises and applications. SIAM Review, 10(4):422 437, 1968. [24] Jerry Stinson. The (mis) behavior of markets. Journal of Personal Finance, 4(4):99, 2005. [25] Francesca Biagini, Yaozhong Hu, Bernt Øksendal, and Tusheng Zhang. Stochastic Calculus for Fractional Brownian Motion and Applications. Springer-Verlag London Limited 2008, 2008. doi: 10.1007/978-1-84628-797-8. [26] Philipp Harms and David Stefanovits. Affine representations of fractional processes with applications in mathematical finance. Stochastic Processes and their Applications, 129(4): 1185 1228, 2019. ISSN 0304-4149. [27] Rembert Daems, Manfred Opper, Guillaume Crevecoeur, and Tolga Birdal. Variational inference for SDEs driven by fractional noise. In The Twelfth International Conference on Learning Representations, 2024. [28] Philip E. Protter. Stochastic Integration and Differential Equations. Stochastic Modelling and Applied Probability. Springer Berlin, Heidelberg, 2nd edition, 2013. ISBN 978-3-662-10061-5. [29] Anh Tong, Thanh Nguyen-Tang, Toan Tran, and Jaesik Choi. Learning fractional white noises in neural stochastic differential equations. In Advances in Neural Information Processing Systems, volume 35, pages 37660 37675. Curran Associates, Inc., 2022. [30] Samuel N. Cohen and Robert J. Elliott. Stochastic Calculus and Applications. Probability and Its Applications. Birkhäuser, New York, NY, 2st edition, 2015. ISBN 978-1-4939-2866-8. [31] R. L. Stratonovich. Conditional Markov Processes. Theory of Probability & Its Applications, 5 (2):156 178, 1960. [32] Brian D.O. Anderson. Reverse-time diffusion equation models. Stochastic Processes and their Applications, 12(3):313 326, 1982. ISSN 0304-4149. [33] Hans Föllmer. Time reversal on Wiener space. Stochastic Processes - Mathematic and Physics, volume 1158 of Lecture Notes in Math, page 119 129, 1986. [34] Aapo Hyvärinen. Estimation of non-normalized statistical models by score matching. Journal of Machine Learning Research, 6(24):695 709, 2005. [35] Yang Song, Sahaj Garg, Jiaxin Shi, and Stefano Ermon. Sliced score matching: A scalable approach to density and score estimation. In Proceedings of the Thirty-Fifth Conference on Uncertainty in Artificial Intelligence, page 204, 2019. [36] Christian Bayer, Antonis Papapantoleon, and Raul Tempone. Computational Finance. Technical University of Berlin, 2021. URL https://www.wias-berlin.de/people/bayerc/files/ lecture.pdf. Lecture notes. [37] Terry J. Lyons. Differential equations driven by rough signals. Revista Matemática Iberoamericana, 14(2):215 310, 1998. [38] James Davidson and Nigar Hashimzade. Type I and type II fractional Brownian motions: A reconsideration. Computational Statistics & Data Analysis, 53(6):2089 2106, 2009. ISSN 0167-9473. The Fourth Special Issue on Computational Econometrics. [39] Aaron Lou and Stefano Ermon. Reflected diffusion models. In Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pages 22675 22701. PMLR, 2023. [40] Dang Huy. A remark on non-Markov property of a fractional Brownian motion. Vietnam Journal of Mathematics, 31, 01 2003. [41] Sebastien Darses and Bruno Saussereau. Time Reversal for Drifted Fractional Brownian Motion with Hurst Index H > 1/2. Electronic Journal of Probability, 12(none):1181 1211, 2007. doi: 10.1214/EJP.v12-439. [42] Philipp Harms. Strong convergence rates for Markovian representations of fractional processes. Discrete and Continuous Dynamical Systems - B, 26(10):5567 5579, 2021. ISSN 1531-3492. [43] Simo Särkkä and Arno Solin. Applied Stochastic Differential Equations, volume 10. Cambridge University Press, 2019. [44] Tero Karras, Miika Aittala, Timo Aila, and Samuli Laine. Elucidating the design space of diffusion-based generative models. In Advances in Neural Information Processing Systems, volume 35, pages 26565 26577. Curran Associates, Inc., 2022. [45] Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. GANs trained by a two time-scale update rule converge to a local Nash equilibrium. In Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017. [46] Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, Xi Chen, and Xi Chen. Improved techniques for training GANs. In Advances in Neural Information Processing Systems, volume 29. Curran Associates, Inc., 2016. [47] Dan Friedman and Adji Bousso Dieng. The Vendi Score: A diversity evaluation metric for machine learning. Transactions on Machine Learning Research, 2023. ISSN 2835-8856. [48] Tuomas Kynkäänniemi, Tero Karras, Samuli Laine, Jaakko Lehtinen, and Timo Aila. Improved precision and recall metric for assessing generative models. In Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019. [49] Sitan Chen, Sinho Chewi, Jerry Li, Yuanzhi Li, Adil Salim, and Anru R. Zhang. Sampling is as easy as learning the score: theory for diffusion models with minimal data assumptions. In The Eleventh International Conference on Learning Representations, 2023. [50] Raghav Singhal, Mark Goldstein, and Rajesh Ranganath. Where to diffuse, how to diffuse, and how to get back: Automated learning for multivariate diffusions. In The Eleventh International Conference on Learning Representations, 2023. [51] Bowen Jing, Gabriele Corso, Renato Berlinghieri, and Tommi Jaakkola. Subspace diffusion generative models. In Lecture Notes in Computer Science, pages 274 289. Springer Nature Switzerland, 2022. [52] Dongjun Kim, Byeonghu Na, Se Jung Kwon, Dongsoo Lee, Wanmo Kang, and Il-chul Moon. Maximum likelihood training of implicit nonlinear diffusion model. In Advances in Neural Information Processing Systems, volume 35, pages 32270 32284. Curran Associates, Inc., 2022. [53] Chin-Wei Huang, Milad Aghajohari, Joey Bose, Prakash Panangaden, and Aaron C Courville. Riemannian diffusion models. In Advances in Neural Information Processing Systems, volume 35, pages 2750 2761. Curran Associates, Inc., 2022. [54] Charlotte Bunne, Ya-Ping Hsieh, Marco Cuturi, and Andreas Krause. The Schrödinger bridge between Gaussian measures has a closed form. In International Conference on Artificial Intelligence and Statistics (AISTATS), 2023. [55] Yang Song, Prafulla Dhariwal, Mark Chen, and Ilya Sutskever. Consistency models. In Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pages 32211 32252. PMLR, 2023. [56] Giannis Daras, Mauricio Delbracio, Hossein Talebi, Alex Dimakis, and Peyman Milanfar. Soft diffusion: Score matching with general corruptions. Transactions on Machine Learning Research, 2023. ISSN 2835-8856. [57] Emiel Hoogeboom and Tim Salimans. Blurring diffusion models. In The Eleventh International Conference on Learning Representations, 2023. [58] Arpit Bansal, Eitan Borgnia, Hong-Min Chu, Jie Li, Hamid Kazemi, Furong Huang, Micah Goldblum, Jonas Geiping, and Tom Goldstein. Cold diffusion: Inverting arbitrary image transforms without noise. In Advances in Neural Information Processing Systems, volume 36, pages 41259 41282. Curran Associates, Inc., 2023. [59] Severi Rissanen, Markus Heinonen, and Arno Solin. Generative modelling with inverse heat dissipation. In The Eleventh International Conference on Learning Representations, 2023. [60] Kohei Hayashi and Kei Nakagawa. Fractional SDE-Net: Generation of time series data with long-term memory. In 2022 IEEE 9th International Conference on Data Science and Advanced Analytics (DSAA), pages 1 10, 2022. [61] Christian Bayer, Peter K. Friz, and Nikolas Tapia. Stability of deep neural networks via discrete rough paths. SIAM Journal on Mathematics of Data Science, 5(1):50 76, 2023. [62] P Kidger. On neural differential equations. Ph D thesis, University of Oxford, 2021. [63] Shujian Liao, Terry Lyons, Weixin Yang, and Hao Ni. Learning stochastic differential equations using RNN with log signature features. ar Xiv preprint ar Xiv:1908.08286, 2019. [64] James Morrill, Cristopher Salvi, Patrick Kidger, and James Foster. Neural rough differential equations for long time series. In Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pages 7829 7838. PMLR, 2021. [65] Fischer Black and Myron Scholes. The pricing of options and corporate liabilities. The Journal of Political Economy, 81(3):637 654, 1973. [66] Christoph Czichowsky, Rémi Peyre, Walter Schachermayer, and Junjian Yang. Shadow prices, fractional Brownian motion, and portfolio optimisation under transaction costs. Finance and Stochastics, 22:161 180, 2018. [67] Paolo Guasoni, Zsolt Nika, and Miklós Rásonyi. Trading fractional Brownian motion. SIAM journal on financial mathematics, 10(3):769 789, 2019. [68] Christian Bayer, Peter Friz, and Jim Gatheral. Pricing under rough volatility. Quantitative Finance, 16(6):887 904, 2016. [69] Jim Gatheral, Thibault Jaisson, and Mathieu Rosenbaum. Volatility is rough. In Commodities, pages 659 690. Chapman and Hall/CRC, 2022. [70] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pages 234 241. Springer, 2015. [71] Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. Co RR, December 2014. ar Xiv:1412.6980 [cs.LG]. [72] Leslie N. Smith and Nicholay Topin. Super-convergence: very fast training of neural networks using large learning rates. In Defense + Commercial Sensing, 2018. [73] Marjorie G. Hahn, Kei Kobayashi, and Sabir Umarov. Fokker-Planck-Kolmogorov equations associated with time-changed fractional Brownian motion. ar Xiv: Mathematical Physics, 139: 691 705, 2011. A The mathematical framework of generative fractional diffusion models 17 A.1 A Markovian representation of fractional Brownian motion . . . . . . . . . . . . . 17 A.2 The forward model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 A.3 Estimating the score via augmented score matching loss . . . . . . . . . . . . . . . 21 B Forward sampling 23 C Implementation details 26 D Additional experiments 26 E Illustration of generated data 28 F Computational cost of augmenting processes 30 G Likelihood computation 32 H Challenges in the attempt to generalize 32 I Notational conventions 34 A The mathematical framework of generative fractional diffusion models In this section we provide the mathematical details of the score-based generative model defined in the main paper. The driving noise of the underlying stochastic process is based on the affine representation of fractional processes from Harms and Stefanovits [26] and further simplified by the closed form expression to determine optimal approximation coefficients of Daems et al. [27]. A.1 A Markovian representation of fractional Brownian motion We begin with the definition of type I fractional Brownian motion, defined on the whole real line, possessing correlated increments that are in contrast to type II fractional Brownian motion stationary. Definition A.1 (Type I Fractional Brownian Motion [23]). Let (Ω, F, P) be a complete probability space equipped with a complete and right continuous filtration {Ft} and Γ the Gamma function. For two standard independent {Ft}-Brownian motions (BMs) B and B the centered Gaussian process W H = (W H t )t R with W H t := 1 Γ(H + 1 2 )d Bs + 1 Γ(H + 1 2 d Bs (22) uniquely characterized in law by its covariances E W H t W H s = 1 2 t2H + s2H (t s)2H , t s > 0 (23) is called type I fractional Brownian motion (f BM) with Hurst index H (0, 1). Type II f BM from the main paper is retrieved by setting the additionally defined BM B on the negative real line to zero. Therefore, the difference to type II f BM is the stochastic integral w.r.t. B that yields stationary increments and a non trivial distribution at t = 0. For H = 0.5, the process is a BM and has thus independent increments. For H (0, 1) \ { 1 2}, the process possesses correlated increments and, compared to BM, smoother paths for H > 0.5 due to positively correlated increments (superdiffusion) and rougher paths for H < 0.5 due to negatively correlated increments (sub-diffusion). These three regimes reflect for type I f BM in the same change of quadratic variation from t to zero quadratic variation in the smooth regime and to infinite quadratic variation in the rough regime [30]. To prepare the approximation of the non-Markovian and non-semimartingale f BM [25] via Markovian semimartingales, define for every γ (0, ) the Ornstein-Uhlenbeck process Y γ given by Y γ t := Y γ 0 e tγ + Z t 0 e γ(t s)d Bs, t 0, Y0 := Z 0 esγd Bs, (24) with speed of mean reversion γ and non trivial starting value in contrast to the OU processes defined in eq. (6) of the main paper. By Itô s product rule [30], the process Y γ solves the same SDE d Y γ t = γY γ t dt + d Bt, Y0 = Z 0 esγd Bs, (25) with different starting value. According to Harms and Stefanovits [26] we represent f Bm by an integral over the predefined family of Ornstein-Uhlenbeck processes. Theorem A.2 (Markovian representation of f BM [26, 27]). The non-Markovian process W H permits the infinite-dimensional Markovian representation W H t = R 0 (Y γ t Y γ 0 ) ν1(γ)dγ, H 1 2 R 0 γ (Y γ t Y γ 0 ) ν2(γ)dγ, H > 1 ν1(γ) = γ (H+ 1 2 H) and ν2(γ) = γ (H 1 Note that we follow Daems et al. [27] in replacing the process Zγ t := Zγ 0 e tγ + R t 0 e (t s)γY γ s ds from the original theorem throughout this work by Zγ t = γY γ t + ( γY γ 0 + Zγ 0 ) e tγ. This is justified by Harms and Stefanovits [26, Remark 3.5] and simplifies for H > 1 2 the approximation of f BM and the definition of our generative model, since we only have to reverse the Y γ processes instead of the pairs (Y γ, Zγ). For Y γ 0 = 0 eq. (26) yields an infinite-dimensional Markovian representation of type II f BM [27]. The MA-f BM from Definition 3.2 in the main paper becomes for type I f BM k=1 ωk Y k t Y k 0 , H (0, 1), t 0 (28) with non trivial Y0 = (Y 1 0 , ..., Y 1 0 ) that is a centered multivariate Gaussian with covariances E Y k 0 Y l 0 = 1/(γk + γl) [27]. Theorem 3.3 holds true for type I f BM as well with optimal approximation coefficients given in Daems et al. [27, Proposition 5]. For more details on the properties and distinction of type I and type II f BM we refer the reader to Daems et al. [27]. A.2 The forward model We define in the following a score-based generative model approximating a fractional diffusion process driven by type I f BM. For the remainder of Appendix A we assume Y k 0 = R 0 esγkd Bs for all 1 k K where the setting from the main paper with type II f BM is recovered by choosing Y k 0 = 0 instead. Let ˆBH be a D-dimensional MA-f BM with Hurst index H (0, 1). For continuous functions µ : [0, T] R and g : [0, T] R we define the forward process X = (Xt)t [0,T ] by d Xt = µ(t)Xtdt + g(t)dˆBH t , X0 = x0 p0, t [0, T] (29) where p0 is an unknown data distribution from which we aim to sample from. Using eq. (25) we note k=1 ωkγk Y k t dt + k=1 ωkd Bt, (30) where B = (B1, ..., Bd) is a multivariate BM. With ω := PK k=1 ωk we rewrite the dynamics of the forward process as µ(t)Xt g(t) k=1 ωkγk Yk t dt + ωg(t)d Bt, t [0, T], (31) Taking into account the dynamics of the OU processes, we define the augmented forward process Z = (Zt)t [0,T ] by Zt = (Xt,1, Y 1 t,1, ..., Y K t,1, Xt,2, Y 1 t,2, ..., Y K t,2, ..., ..., ..., Xt,D, Y 1 t,D, ...Y K t,D) RD(K+1) (32) following the dynamics d Zt = F(t)Ztdt + G(t)d Bt (33) with F(t) = diag(R(t), ..., R(t)) RD(K+1),D(K+1), R(t) = µ(t) g(t)ω1γ1 . . . g(t)ωKγK 0K diag(γ1, ..., γK) RK+1,K+1 (34) and G(t) = ( ωg(t)ID ID . . . ID)T RD(K+1),D. (35) For each dimension 1 d D, the dynamics of the process transforming x0,d reduce to those of the augmented forward process with D = 1, given by d Zt = F(t)Ztdt + G(t)d Bt, (36) where the K + 1 processes that transform x0,d are all driven by the same one-dimensional BM B. The augmented forward process Z conditioned on y1 0, ..., y K 0 and a data sample x0 p0 is a linear transformation of BM and hence a Gaussian process and so is X [43]. Since the integral w.r.t BM has zero mean, the mean vector of the augmenting processes is E Yk t = 0d for all 1 k K and the mean of the conditional forward process is the solution of the ODE t E [Xt|x0] = µ(t)E [Xt|x0] (37) and hence the marginal mean E [Xt|x0] = c(t)x0 with c(t) = exp Z t 0 µ(s)ds (38) is not affected by changing the driving noise to MA-f BM. The marginal covariance matrix Σt of the conditional augmented forward process can be approximated numerically by solving an ODE, see Appendix B for details. In addition we present a continuous reparameterization of the forward process, resulting for some forward dynamics in a closed form solution of the marginal covariance matrix. Our result generalizes the explicit formula for the perturbation kernel p0t(x|x0) = N(x; c(t)x0, c2(t)σ2(t)Id) given in Karras et al. [44]. Proposition A.3 (Continuous Reparameterization Trick). Let x0 be a fixed realisation drawn from p0. The forward process X = (Xt)t [0,T ] conditioned on x0 admits the continuous reparameterization Xt = c(t) x0 + Z t 0 α(t, s)d Bs g(s) c(s) e sγkds Yk 0 | {z } =0 for type II f BM since Yk 0 =0 with c(t) = exp R t 0 µ(s)ds and g(u) c(u) e γk(u s)du + ω g(s) such that Xt|x0 N c(t)x0, c2(t)σ2(t) + σ2 K(t) Id is a Gaussian random vector for all t (0, T] with σ2(t) = Z t 0 α2(t, s)ds (41) σ2 K = c2(t) g(s) c(u)du 2 (42) g(s) c(s) e sγkds Z t g(s) c(s) e sγlds (43) vanishing for an underlying type II f BM. Proof. By continuity, the functions µ and σ are bounded. Moreover, the processes Y 1 j , ..., Y K j posses continuous, hence bounded, paths and thus 0 |µ(u)|du < , Z t 0 σ2(u)du < and Z t k ωkγk Yk t |du < P a.s., (44) where the last integral is understood entrywise. Hence, by Cohen and Elliott [30, Theorem 16.6.1], the unique solution of the SDE eq. (31) is given explicitly as k=1 ωkγk Yk u g(u) c(u) d Bu with c(t) = exp R t 0 µ(s)ds . Define J Y[K] 0 , t := g(s) c(s) e sγkds Yk 0 (46) and by the definition of Y k j in (24) we calculate using the Stochastic Fubini Theorem [26] k=1 ωkγk Yk u g(u) c(u) e γk(u s)d Bsdu + J(Y[K] 0 , t) (47) g(u) c(u) e γk(u s)dud Bs + J Y[K] 0 , t (48) k=1 ωkγk Yk u g(u) c(u) d Bu g(u) c(u) e γk(u s)dud Bs + ω Z t g(u) c(u) d Bu J Y[K] 0 , t ! g(u) c(u) e γk(u s)du + ω g(s) d Bs J Y[K] 0 , t ! = c(t)x0 + c(t) Z t 0 α(t, s)d Bs c(t)J Y[K] 0 , t (50) g(u) c(u) e γk(u s)du + ω g(s) c(s) . (51) Since α(t, ) is continuous for every fixed t [0, T] we have R t 0 α2(t, s)ds < . Using that the integral of a bounded deterministic function w.r.t. Brownian motion is a Gaussian process we have by Itô s isometry Z t 0 α(t, s)d Bs N 0d, σ2(t)Id with σ2(t) = Z t 0 α2(t, s)ds. (52) Therefore, conditional on x0, the random vector Xt is Gaussian with mean vector mx t = c(t)x0 + E h J(Y[K] 0 ) i = x0 exp Z t 0 µ(s)ds . (53) Moreover, Bj and Bj corresponding to the entries of B = ( B1, ..., Bd) and B = (B1, ..., Bd) are independent by Theorem A.1 resulting in the entrywise variance Σx t,j,j = c2(t) Z t 0 α2(t, s)ds + σ2 K(t) (54) σ2 K(t) = V h J(Y[K] 0 )j i = c2(t) g(s) c(u)du 2 (55) g(s) c(s) e sγkds Z t g(s) c(s) e sγlds, (56) where we used again Itô s isometry to calculate E Y k 0,j Y l 0,j = E Z 0 e(γk+γl)sds = 1 γk + γl . (57) Since the entries of B are independent, we find the covariance matrix Σx t = c2(t)σ2(t) + σ2 K(t) Id. (58) The preceding proposition generalizes the reparameterization trick 3 from discrete time to continuous-time in the sense that 1 αtnϵ, ϵ N(0d, Id) (59) used in discrete time [2] with time steps 0 = t0 < ... < t N = T is replaced by our continuous-time reparameterization Xt = c(t) x0 + Z t 0 α(t, s)d Bs g(s) c(s) e sγkds Yk 0, (60) enabling to directly sample Xt|x0 N(c(t)x0 + c2(t)σ2(t) + σ2 K(t) ID) for a given data sample x0 and time point t (0, T], in case that σ2(t) and σ2 K(t) have a closed form solution. For a complete characterization of the marginal covariance matrix Σt of the conditioned augmented forward process we calculate by Itô isometry with X = Xj and Y l = Y l j for all 1 j D, 1 l K and any t [0, T] E Xt Y l t = c(t) Z t 0 α(t, s)e γk(t s)ds + c(t) ωkγk γk + γl e γlt Z t g(s) c(s) e sγkds (61) E Y k t Y l t = e (γk+γl)s γk + γl + 1 e (γk+γl)t γk + γl = 1 γk + γl (62) reducing for type II f BM to E Xt Y l t = c(t) Z t 0 α(t, s)e γk(t s)ds and E Y k t Y l t = 1 e (γk+γl)t γk + γl . (63) We denote in the following the stacked vector of the augmenting processes by Y[K] t = (Y 1 t,1, Y 2 t,1, ..., Y K t,1, Y 1 t,2, Y 2 t,2, ..., Y K t,2, ...., Y 1 t,D, Y 2 t,D, ..., Y K t,D) RD(K+1). (64) The random vector Y[K] t is a centered Gaussian process with covariance matrix Λt = diag(Σy t , ..., Σy t ) RD K,D K, Σy t RK,K, [Σy t ]k,l = E Y k t Y l t (65) where Σy t does not depend on the dimension 1 j D and we write qt for the multivariate Gaussian density of Y[K] t . Since we know the distribution of Y[K] 0 , we can directly calculate the corresponding score function by y[K] log qt Y[K] t = Λ 1 t Y[K] t . (66) A.3 Estimating the score via augmented score matching loss Conditioning Zt on x0 p0 and a realisation y[K] t of the stacked augmenting processes Y[K] t defined in eq. (64) at fixed time t [0, T] results in the Gaussian vector Xt N( mt, Σt) with mean mt = c(t)x0 + k=1 ηk t yk t , where ηk t = l=1 E Xt Y l t h (Σy t ) 1i and covariance Σt = c2(t)σ2(t) τ 2 t Id, where τ 2 t = k=1 ηk t E Xt Y k t . (68) We denote with x log p0t the conditional score function of Xt and calculate for the gradient w.r.t. x = (x1, ..., x D) RD x log p0t(x|y[K] t , x0) = Σ 1 t (x mt) = (x mt) (c2(t)σ2(t) τ 2 t ). (69) 3See https://lilianweng.github.io/posts/2021-07-11-diffusion-models/ for the derivation in discrete time. and for the gradient w.r.t. yk = (yk 1, ..., yk D) RD yk log p0t(x|y[K] t , x0) = 1 2 yk h (x mt)T Σ 1 t (x mt) i (70) = ηk t x log p0t(x|y[K] t , x0). (71) Deploying this relation of x log p0t and yk log p0t we derive the augmenting score matching loss that reduces the dimensionality of the score model we have to learn to the dimensionality of the data distribution and results in a score model guided by the the known score function y[K] log qt. Proposition A.4 (Optimal Score Model). Assume that sθ is optimal w.r.t. the augmented score matching loss L. The score model Sθ(Zt, t) := k ηk t Yk t , t), η1 t sθ(Xt X k ηk t Yk t , t), ..., ηK t sθ(Xt X k ηYk t , t) yields the optimal L2(P) approximation of z log pt(Zt) via Sθ(Zt, t) + z log qt(Y[K] t ) z log pt(Zt). (72) Proof. Fix t [0, T]. We write paug t for the density of Zt, paug 0t for the conditional density of Zt on X0, p0t for the density of Xt and q0t for the conditional density of Y[K] t on X0. First note that Y[K] t and X0 are independent by assumption and hence qt = q0t. By direct calculations we find x log paug t (Zt) = E(X0|Xt,Y[K] t ) [ x log paug 0t (Zt|X0)] (73) = E(X0|Xt,Y[K] t ) h x log p0t(Xt|Y[K] t , X0)q0t(Y[K] t |X0) i (74) = E(X0|Xt,Y[K] t ) x log p0t(Xt|Y[K] t , X0) + x log qt(Y[K] t ) | {z } =0d = E(X0|Xt,Y[K] t ) h x log p0t(Xt|Y[K] t , X0) i (76) (69) = E(X0|Xt,Y[K] t ) k ηk t Yk t c(t)X0 c2(t)σ2(t) τ 2 t Hence the best L2(P)-approximation of x log paug t (Zt) is a minimizer of the augmented score matching loss by x log paug t (Zt) (77) = E(X0|Xt,Y[K] t ) k ηk t Yk t c(t)X0 c2(t)σ2(t) τ 2 t = arg min sθ E(X0,Y[K] t )E(Xt|Y[K] t ,X0) k=1 ηk t Yk t , t) Xt P k ηk t Yk t c(t)X0 c2(t)σ2(t) τ 2 t (69) = arg min sθ E(X0,Y[K] t )E(Xt|Y[K] t ,X0) k=1 ηk t Yk t , t) x log p0t(Xt|Y[K] t , X0) Assume now that sθ is a minimizer of the augmented score matching loss. Similar to the calculation above we have yk log paug t (Zt) = E(X0|Xt,Y[K] t ) yk log paug 0t (Zt|X0) (81) = E(X0|Xt,Y[K] t ) h yk log p0t(Xt|Y[K] t , X0)q0t(Y[K] t |X0) i (82) = E(X0|Xt,Y[K] t ) h yk log p0t(Xt|Y[K] t , X0) + yk log qt(Y[K] t ) i (83) (70) = ηk t E(X0|Xt,Y[K] t ) h x log p0t(Xt|Y[K] t , X0) i + yk log qt(Y[K] t ) (84) and hence ηk t sθ(Xt P k ηk t Yk t ) + yk log qt(Y[K] t ) is the best approximation of yk log paug t (Zt) in L2(P) and the score model Sθ(Zt, t) := k ηk t Yk t , t), η1 t sθ(Xt X k ηk t Yk t , t), ..., ηK t sθ(Xt X k ηYk t , t) yields the best L2(P)-approximator of z log pt via Sθ(Zt, t) + z log qt(Y[K] t ) z log pt(Zt). (85) B Forward sampling We assume throughout this section type II f BM. Given the marginal covariance matrix Σt of Zt|x0 we uniformly sample first a time point t (0, T] and second Zt N(ˆzt, Σt) with ˆzt = (c(t)x0,1, 0, ..., 0, c(t)x0,2, 0, ..., 0, ..., ..., ..., c(t)x0,D, 0, ...0) RD(K+1) (86) where we use E [Xt|x0] = c(t)x0 and E Yk t = 0D. In the following we characterize further the entries of the marginal covariance matrix Σt. The calculations in this section are straightforward; nevertheless, we present them in full detail to facilitate easy understanding for the interested reader. We begin with rewriting σ2 from Proposition 4.2 given by σ2(t) = c2(t) Z t 0 α2(t, s)ds (87) α(t, s) = ω g(s) g(u) c(u) e γk(u s)du (88) g(u) c(u) e γk(u s)du | {z } =:αk(t,s) fk(u, s) := g(u) c(u) e γk(u s) and Ik(t, s) := Z t s fk(u, s)du (90) σ2 t = c2(t) Z t 0 α2(t, s)ds (91) = c2(t) Z t s fk(u, s)du #2 = c2(t) Z t k=1 ωkαk(t, s) = c2(t) Z t i=1,j=1 ωiωjαi(t, s)αj(t, s)ds (94) i=1,j=1 ωiωjc2(t) Z t 0 αi(t, s)αj(t, s)ds (95) i=1,j=1 ωiωjc2(t) Z t c(s) γi Ii(t, s) g(s) c(s) γj Ij(t, s) ds (96) var B(t) c2(t) Z t c(s) γi Ii(t, s) + γj Ij(t, s) γiγj Ii(t, s)Ij(t, s) ds , var B(t) = c2(t) Z t g2(s) c2(s) ds (98) corresponds to the purely Brownian marginal variance, explicitly calculated for VE and VP in Song et al. [16]. Using the above derivation, we derive the closed form variance schedule for FVE dynamics. Fractional Variance Exploding Fix σmax > σmin > 0 and define r := σmax σmin . Following Song et al. [16] we set µ(t) 0 and g(t) = art with a = σmin p 2 log(r) (99) such that c(t) = exp(0) = 1 and calculate Ik(t, s) = Z t s fk(u, s)du = Z t s arue γk(u s)du = F(t) F(s) (100) = a ln(r) γk | {z } ak eln(r)t γkt+γks eln(r)s = ak rte γk(t s) rs , (101) since the derivative of F(u) = akrue γk(u s) is given by d du F(u) = d h akrue γk(u s)i = akru ln(r)e γk(u s) + akrue γk(u s)( γk) (102) = a ln(r) γk (ln(r) γk)(rue γk(u s)) = arue γk(u s). (103) We calculate for the variance of Xt|x0 V [Xt|x0] = i,j=1 ωiωj{var B(t) aγi 0 rs Ii(t, s)ds | {z } Ji(t) 0 rs Ij(t, s)ds | {z } Jj(t) 0 Ii(t, s)Ij(t, s)ds | {z } =Ji,j(t) 0 rs rte γk(t s) rs ds = ak 0 rt+se γk(t s)ds ak 0 r2sds (106) = ak [F1(t) F1(0)] ak [F2(t) F2(0)] = ak r2t rte γkt ln (r) + γk r2t 1 d ds F1(s) = rt+s ln(r)e γk(t s) + rt+se γk(t s)(γk) ln(r) + γk = rt+se γk(t s), (108) d ds F2(s) = d = r2s ln(r)2 2 ln(r) = r2s. (109) Ji,j(t) = aiaj rte γi(t s) rs rte γj(t s) rs ds (110) " r2t 1 e t(γi+γj) r2t rte γit γi + ln(r) r2t rte γjt γj + ln(r) + r2t 1 0.0 0.2 0.4 0.6 0.8 1.0 t Analytical Approximated (a) Variance schedule of the forward FVE process. 0.0 0.2 0.4 0.6 0.8 1.0 t Var[Y k t ] Analytical Approximated (b) Variance schedule of the augmenting processes. Figure 5: Analytical solution (blue) used by our method for FVE dynamics with K = 5 and H = 0.5 compared to the approximated solution (dashed red) resulting from solving ODE (119). We calculate the covariance of Xt|x0 and Y l t cov(Xt|x0, Y l t ) = c(t) Z t 0 α(t, s)e γl(t s)ds (112) s fk(u, s)du e γl(t s)ds (113) 0 rse γl(t s)ds γk s fk(u, s)due γl(t s)ds (114) 0 rse γl(t s)ds γkak rte γk(t s) rs e γl(t s)ds 0 rseγlsds γkak rte γk(t s) rs e γl(t s)ds (a + akγk)(rt e γlt) γl + ln(r) γkak rt 1 e t(γk+γl) Fractional Variance Preserving To the best of our knowledge, there is no closed form solution for R t s fk(u, s)du for the dynamics of FVP. In this case, we numerically solve an ODE to determine the marginal covariance matrix of the conditional augmented forward process. General Dynamics. The covariance matrix of the conditional augmented forward process with dynamics d Zt = F(t)Ztdt + G(t)d Bt, (118) solves the ODE tΣt = F(t)Σt + Σt F(t)T + G(t)G(t)T , (119) lacking in general a closed form solution [43] in contrast to the setting of Song et al. [16]. This approach is applicable for any choice of µ and g in the forward dynamics, but depending on the choice of drift and diffusion function it might not yield a numerically stable solution. We empirically observe in Figure 5 that the analytical solution for FVE and the numerical approximation of the variance schedule, determined by solving eq. (119) do not differ significantly. Variance schedules. We normalize the variance schedule of FVE and FVP dynamics such that the variance at t = 0 and at t = T is equal to the variance used in the purely Brownian setting of VE and VP dynamics. For both FVE and FVP dynamics we calculate ω according to Proposition 3.3 and determine σ2 T and define ω = ω/ σ2 T to weight the OU-processes. By doing so, the terminal variance remains the same throughout different choices of H, as empirically confirmed in Figure 6. 0.0 0.2 0.4 0.6 0.8 1.0 t H = 0.1 H = 0.5 H = 0.9 (a) Variance schedules of the forward FVE process. 0.0 0.2 0.4 0.6 0.8 1.0 t H = 0.1 H = 0.3 H = 0.5 H = 0.7 H = 0.9 (b) Variance schedules of the forward FVP process. Figure 6: Normalized variance schedules for K = 5 over time. (a) Variance schedules of FVE dynamics, calculated in closed form according to the derived formulas. The shape of the schedule is preserved throughout different values of H. (b) Variance schedules of FVP dynamics numerically approximated. The shape of the schedule is shifted for different values of H. In Figure 6 we observe for FVE dynamics that not only the terminal variance is the same across different choices of H but also the shape of the variance schedule. For FVP dynamics, the shape of the variance schedule shifts with different values of H, approaching a nearly linear schedule for H = 0.1, while H = 0.9 offers a decreasing variance towards the end near t = T. C Implementation details We used for all experiments a conditional U-Net [70] architecture and the Adam optimizer [71] with Py Torchs One Cylce learning rate scheduler [72]. On MNIST we trained without exponential moving average (EMA) while on CIFAR10 we conducted experiments with and without EMA. Set up on MNIST. We used an attention resolution of [4, 2], 3 resnet blocks and a channel multiplication of [1, 2, 2, 2, 2] and trained with a maximal learning rate of 10 4 for 50k iterations and a batch size of 1024. For all MNIST training runs we used one A100 GPU per run, taking approximately 17 hours. Set up on CIFAR10. We used an attention resolution of [8], 4 resnet blocks and a channel multiplication of [1, 2, 2, 2, 2]. For the experiments without EMA, we used the same setup as with MNIST, but trained the models in parallel on two A100 GPUs for 300k iterations with an effective batch size of 1024. When training with EMA, we followed the set up of Song et al. [16] using an EMA decay of 0.9999 for all FVP dynamics and an EMA decay of 0.999 for all FVE dynamics. In contrast to Song et al. [16] we used Py Torchs One Cycle LR learning rate scheduler with a maximal learning rate of 2 10 4 and trained only for 1mio iterations instead of the 1.3mio iterations in Song et al. [16]. D Additional experiments In addition to the experiments presented in the main part, we provide additional results here, including a full evaluation of FVE dynamics on MNIST, as well as training on CIFAR10 without EMA. Evaluation of different Hurst indices of FVE dynamics on MNIST. In Table 4 we provide the evaluation of FVE dynamics. For the ease of comparisan, we include the quantitative results on FVP dynamics already presented in the main part. For FVE dynamics both, the super-diffusive regime and the sub-diffusive regime achieve a higher FID as the purely Brownian dynamics for K = 1, 2 throughout all tested Hurst indices and for K = 3 throughout all tested Hurst indices except for H = 0.9 with a higher pixel-wise diversity in the sub-diffusive regime of H < 0.5. Training on CIFAR10 without EMA. As Song et al. [16] point out, the empirically optimal EMA decay rate for VP dynamics differs from that for VE dynamics. Since we do not have the computational resources to optimize the EMA decay rate for every configuration of our framework, we evaluated it in line with Song et al. [16] using a consistent EMA decay rate of 0.999 across all configurations of FVE dynamics and 0.9999 across all configurations of FVP dynamics. Nevertheless, because the optimal EMA decay rate appears to depend on the dynamics of the underlying stochastic MNIST H = 0.9 H = 0.7 H = 0.5 H = 0.1 FID VSp FID VSp FID VSp FID VSp VP (retrained) - - - - 1.44 23.64 - - MA-f BM driven FVP(H, K = 1) 2.86 23.56 3.01 23.78 2.81 23.69 2.92 23.59 FVP(H, K = 2) 1.93 24.00 2.30 23.82 2.92 23.63 2.56 23.82 FVP(H, K = 3) 0.72 24.18 2.67 23.96 3.51 23.78 4.87 23.60 FVP(H, K = 4) 1.22 24.76 0.86 24.39 1.86 24.50 6.25 23.89 FVP(H, K = 5) 2.17 25.15 1.36 24.63 4.89 24.56 9.57 23.70 (a) Conditional image generation on MNIST with FVP. MNIST H = 0.9 H = 0.7 H = 0.5 H = 0.1 FID VSp FID VSp FID VSp FID VSp VE (retrained) - - - - 10.82 24.20 - - MA-f BM driven FVE(H, K = 1) 10.06 24.05 9.95 24.24 10.30 24.22 9.98 24.20 FVE(H, K = 2) 9.82 24.07 9.73 24.13 9.89 24.15 9.42 24.28 FVE(H, K = 3) 11.02 24.53 9.96 24.37 9.74 24.42 10.12 24.44 FVE(H, K = 4) 31.67 22.44 11.37 24.34 11.25 24.54 9.56 24.58 FVE(H, K = 5) 50.42 23.74 22.03 22.09 25.51 23.08 10.39 24.33 (b) Conditional image generation on MNIST with FVE. Table 4: FID and pixel-wise diversity scores of GFDM compared to the original setting of purely Brownian driven dynamics VE and VP. In bold the scores that are better than both purely Brownian driven dynamics VE and VP. The overall best scores within the experiment are boxed in. CIFAR10 H = 0.9 H = 0.5 H = 0.1 FID VSp FID VSp FID VSp VE (retrained) - - 9.38 3.21 - - VP (retrained) - - 17.29 2.24 - - MA-f BM driven FVE(H, K = 1) 9.52 3.22 9.46 3.22 8.93 3.26 FVE(H, K = 2) 8.99 3.26 9.62 3.22 10.23 3.09 FVE(H, K = 3) 16.67 2.175 13.41 2.94 16.54 2.62 FVE(H, K = 4) 40.03 1.41 17.74 2.26 14.49 2.46 Table 5: Quantitative results for FVE dynamics and varying Hurst index on CIFAR10 trained without EMA. In bold the scores that are better than both purely Brownian driven dynamics VE and VP. The overall best scores within the experiment are boxed in. process, we also evaluate our framework without EMA. In Table 5 we observe that the best performing configuration in terms of FID is FVP(H = 0.9, K = 2) with an FID of 8.99 and FVP(H = 0.1, K = 1) with an FID of 8.93 compared to the purely Brownian dynamics VE with an FID of 9.38 and VP with an FID of 17.29. Due to limited computational resources we only compared the best purely Brownian dynamics (VE) with the performance of corresponding augmented FVE dynamic of GFDM. As to be expected, using EMA for training of GFDM results in improved performnace w.r.t. image quality measured by FID obervable in Table 2b. Effect of the number of augmenting processes in the super-diffusive regime. Additionally, in Figure 7, we show the FID evolution of the super-diffusive regime for various numbers of augmenting processes, showing a similar pattern that either that K = 2 or K = 3 yields the best performance across different datasets and dynamics. 1 2 3 K augmenting processes MNIST Generation VE FVE(H = 0.9) FVE(H = 0.7) 1 2 3 4 5 K augmenting processes MNIST Generation VP FVP(H = 0.9) FVP(H = 0.7) 1 2 3 K augmenting processes CIFAR10 Generation without EMA VE FVE(H = 0.9) 1 2 3 K augmenting processes CIFAR10 Generation with EMA VP FVP(H = 0.9) FVP(H = 0.7) Figure 7: Dynamics driven by MA-f BM with super-diffusive Hurst index H = 0.9 and K = 0.7 perform in all four experiments we conducted better than the original purely Brownian driven dynamics, where either K = 2 or K = 3 yields the best performance. 250 NFEs 500 NFEs 750 NFEs 1000 NFEs VE 5.65 0.02 5.28 0.04 5.20 0.02 5.19 0.02 VP 15.12 0.11 5.86 0.07 4.79 0.11 4.79 0.11 MA-f BM driven FVP(H = 0.7, K = 2) 15.44 0.09 4.47 0.03 4.13 0.03 4.12 0.03 FVP(H = 0.9, K = 2) 12.44 0.08 3.71 0.03 3.70 0.03 3.70 0.03 Table 6: Averaged FID values for different NFEs of the super-diffusive regime compared to purely Brownian dynamics. E Illustration of generated data Visual comparison of generated CIFAR10 images sampled with SDE dynamics. (a) Purely Brownian VP sample. (b) Super-diffusive regime of MA-f BM with H = 0.9. Figure 8: (LHS) Images generated with the purely Brownian driven VP dynamics sampled with SDE dynamics, a FID of 4.85 and a pixel-wise diversity of 3.42. (RHS) Images generated with FVP(H = 0.9, K = 2) dynamics sampled with SDE, a FID of 3.77 and a pixel-wise diversity of 3.60. Visual comparison of generated CIFAR10 images sampled with PF ODE dynamics. (a) Purely Brownian VP sample. (b) Super-diffusive regime of MA-f BM with H = 0.9. Figure 9: (RHS) Images generated with the purely Brownian driven VP dynamics sampled with PF ODE, a FID of 5.63 and pixel-wise diversity of 3.91. (LHS) Images generated with FVP(H = 0.9, K = 2) dynamics sampled with PF ODE, a FID of 12.36 and pixel-wise diversity of 4.89. Visual comparison of generated MNIST images sampled from SDE. (a) FVP(K = 3, H = 0.9) with FID = 0.72 and VSp = 24.18. (b) FVP(K = 4, H = 0.9) with FID = 1.22 and VSp = 24.76. (c) FVP(K = 5, H = 0.9) with FID = 2.17 and VSp = 25.15. Figure 10: Diversifying effect of the augmenting processes with FVP dynamics on MNIST. The super-diffusive regime with H = 0.9: For K = 5 instead of K = 3 augmenting processes the pixel VS increases from 24.18 to 25.15. F Computational cost of augmenting processes In this section we compare the computation time of GFDM to the purely Brownian setting of traditional diffusion models. For a given Hurst index H (0, 1) and a given K, the optimal coefficients ω1, ..., ωK are calculated only once before training. For completeness of our quantitative compute time evaluation, we provide the average computation time in seconds, needed to compute ω1, ..., ωK on a GPU Tesla V100 with 32 GB RAM. We randomly sample 1000 times H U[0.1, 0.9] for a given K {1, 2, 3, 4, 5} and report the average computation time in Table 7. K = 1 K = 2 K = 3 K = 4 K = 5 time [s] 0.0043 0.0003 0.0003 0.0003 0.0003 Table 7: Averaged time in seconds needed before training to calculate for a given K the optimal approximation coefficients using the approach of Daems et al. [27]. Computation time during training The computational difference during training consists of the computation of the covariance matrix Σt instead of the marginal variance and sampling from a multivariate Gaussian instead of a univariate Gaussian. Note however, that we only need to calculate Σt for D = 1 and also sample only once for a given time t and a given data point. In Table 8 and Table 9 we report the average time of one training step measured in seconds calculated over 1000 training steps on CIFAR10. The underlying conditional U-Net has 58.7mio and EMA is applied. The batch size is 128 and all computation have been carried out on a GPU Tesla V100 with 32 GB RAM. We observe that the computation time depends only minimaly increases when switching from the original model to the augmented system and increases across FVE and FVP dynamics by at most 11/1000 seconds, while the choice of the Hurst index H has no effect on the computation time. Computation time during sampling Since the augmented system depends for fixed K only on the approximating coefficients ω1, ..., ωK it would suffice to report the average sampling time for FVP and FVE dynamics for varying K. Nevertheless, we report for H {0.9, 0.5, 0.1} in Table 10 and Table 11 the average time to sample a batch of 1000 images over 1000 discretization steps of the reverse-time SDE over 10 trials. We observe that the average time in seconds for one sampling step in the reverse dynamics of FVE and FVP dynamics increases for K 4 by at most 2/100 seconds. Only for K = 5 we observe a significant increase of average sampling time of roughly 4/10 seconds. training step time [s] H = 0.9 H = 0.5 H = 0.1 average VE - 0.0478 0.1702 - 0.0478 FVE(H, K = 1) 0.0489 0.0927 0.0483 0.0893 0.0485 0.0922 0.0486 FVE(H, K = 2) 0.0486 0.0944 0.0484 0.0484 0.0485 0.0904 0.0485 FVE(H, K = 3) 0.0487 0.0967 0.0484 0.0892 0.0493 0.0924 0.0488 FVE(H, K = 4) 0.0492 0.0939 0.0484 0.0897 0.0487 0.0952 0, 0488 FVE(H, K = 5) 0.0487 0.0939 0.0488 0.0906 0.0486 0.0933 0.0487 Table 8: Average time in seconds for one training step with FVE dynamics on CIFAR10 with a batch size of 128, a conditional U-Net with 58.7mio parameters and EMA. training step time [s] H = 0.9 H = 0.5 H = 0.1 average VP - 0.0478 0.1688 - 0.0478 K = 1 0.0475 0.0899 0.0475 0.0938 0.0475 0.0900 0.0475 K = 2 0.0476 0.0907 0.0477 0.0917 0.0481 0.0907 0.0478 K = 3 0.0483 0.0937 0.0477 0.0909 0.0477 0.0950 0.0479 K = 4 0.0476 0.0899 0.0479 0.0916 0.0484 0.0937 0.0480 K = 5 0.0484 0.0942 0.0479 0.0925 0.0479 0.0930 0.0481 Table 9: Average time in seconds for one training step with FVP dynamics on CIFAR10 with a batch size of 128, a conditional U-Net with 58.7mio parameters and EMA. sampling step time [s] H = 0.9 H = 0.5 H = 0.1 average VE - 2.3092 0.1462 - 2.3092 K = 1 2.3125 0.1275 2.3269 0.1069 2.3261 0.1093 2.3218 K = 2 2.3095 0.1280 2.3297 0.1077 2.3107 0.1593 2.3166 K = 3 2.3071 0.1213 2.3133 0.1083 2.3063 0.1058 2.3089 K = 4 2.3322 0.1086 2.3323 0.1067 2.3156 0.1122 2.3267 K = 5 2.6515 0.0930 2.6560 0.1067 2.6510 0.0953 2.6528 Table 10: Average time in seconds for one sampling step in the reverse dynamics of FVE to generate data of dimension (3, 32, 32) with a batch size of 1000 using a conditional U-Net with 58.7mio and EMA. sampling step time [s] H = 0.9 H = 0.5 H = 0.1 average VP - 2.3013 0.1511 - 2.3013 K = 1 2.3036 0.1290 2.3120 0.1062 2.3031 0.1133 2.3062 K = 2 2.3139 0.1166 2.3070 0.1102 2.3154 0.1555 2.3121 K = 3 2.3134 0.1246 2.3168 0.1056 2.3309 0.1096 2.3204 K = 4 2.3199 0.1109 2.3091 0.1132 2.3210 0.1383 2.3167 K = 5 2.6568 0.0984 2.6603 0.0978 2.6692 0.0975 2.6621 Table 11: Average time in seconds for one sampling step in the reverse dynamics of FVP to generate data of dimension (3, 32, 32) with a batch size of 1000 using a conditional U-Net with 58.7mio parameters and EMA. G Likelihood computation Given the approximate PF ODE corresponding to the augmented forward process dzt = F(t)zt 1 2G(t)G(t)T h Sθ(zt, t) + z log qt(y[K] t ) i | {z } := fθ(zt,t) dt, t [0, T] (120) we estimate according to Song et al. [16] the log-likelihoods of test data z0 under the learned density paug 0 via log paug 0 (z0) = log paug T (z T ) + Z T 0 fθ(zt, t)dt. (121) According to Song et al. [16], we integrate over [ϵ, T] rather than [0, T], using the same value of ϵ = 10 3, which has been empirically shown to yield the best performance when simulating the SDE. For ϵ = 0 and type II f BM we need to adjust the starting value of the augmenting processes from zero to a jointly sampled vector yϵ = (y1 ϵ , ..., y K ϵ ) N(0K, Λϵ) with (Λϵ)k,l = E yk ϵ yl ϵ = Z ϵ 0 e (γk+γl)(ϵ s)ds = 1 e (γk+γl)ϵ γk + γl . (122) Using the exact likelihood of yϵ and the independence of yϵ and x0 we have log paug 0 (zϵ) = log p0(x0) + log qϵ(yϵ) (123) where p0 is the learned density of x0 corresponding to θ. Hence in total log p0(x0) (121) = log paug T (z T ) + Z T 0 fθ(zt, t)dt log qϵ(yϵ) (124) and we define the negative log-likelihoods NLLs of test data x0 under the learned density by NLLs(x0, θ) := log paug 0 (z0) + log qϵ(yϵ). (125) H Challenges in the attempt to generalize In this work, we seek to determine the extent to which the continuous-time framework of score-based generative models can be generalized from an underlying BM to an underlying f BM. For a f BM W H it is not straightforward to define the forward process Xt = X0 + Z t 0 f(Xs, s)ds + Z t 0 g(Xs, s)d W H s , t [0, T] (126) driven by f BM, since f BM is neither a Markov process nor a semimartingale [25], and hence Itô calculus may not be applied, to define the second integral. However, a definition of the integral w.r.t. f BM is established [25, 73] such that the remaining problem is the derivation of the reversetime model. Following the second and more intuitive derivation of the reverse-time model for BM from Anderson [32], the conditional backward Kolmogorov equation and the unconditional forward Kolmogorov equation are applied. Starting point of the derivation is to rewrite p(xt, t, xs, s) = p(xs, s|xt, t)p(xt, t) with Bayes theorem to calculate with the product rule p(xt, t, xs, s) t = p(xs, s|xt, t) t p(xt, t) + p(xt, t) t p(xs, s|xt, t), s t. (127) Replacing p(xt,t) t with the RHS of the unconditional forward Kolmogorov equation and p(xs,s|xt,t) t with the RHS of the conditional backward Kolmogorov equation one derives an equation that only depends on the joint density p(xt, t, xs, s). Using Bayes theorem again leads to a conditional backward Kolmogorov equation for p(xt, t|xs, s) that defines the dynamics of the reverse process by the one-to-one correspondence between the conditional backward Kolmogorov equation and the reverse-time SDE [32]. Following these steps for f BM, starting from eq. (127) and deploying the one-to-one correspondence of f BM and the evolution of its density [73], we could replace p(xt,t) t in (127) by the RHS of i=1 fi(t, x) p(t, x) xi + Ht2H 1 d X i,j=1 gij(x, t) 2p(t, x) xi xj . (128) The missing part is however an analogous to the conditional backward Kolmogorov equation to replace p(xs,s|xt,t) t in eq. (127). The derivation of such an equation is to the best of our knowledge yet unsolved problem and hence the limiting factor in the generalization of continuous-time score-based generative models from an underlying BM to an underlying f BM. I Notational conventions [0, T] Time horizon with terminal time T > 0 X = (Xt)t [0,T ] Stochastic forward process taking values in R D N Data dimension X Vector valued stochastic forward process X = (Xt)t [0,T ] with Xt = (Xt,1, ..., Xt,D) X Reverse time stochastic process with Xt = XT t f Vector valued function f : RD [0, T] RD µ, g Functions µ, g : [0, T] R f Reverse time function with f(x, t) = f(x, T t) µ, g Reverse time functions with µ(t) = µ(T t) and g(t) = g(T t) p0 Data distribution pt Marginal density of (augmented) forward process at t [0, T] B Brownian motion (BM) H Hurst index H (0, 1) W H Type I fractional Brownian motion (f BM) BH Type II fractional Brownian motion (f BM) Y γ = (Y γ t )t [0,T ] Ornstein Uhlenbeck (OU) process with speed of mean reversion γ R K N Number of augmenting processes γ1, ..., γK Geometrically spaced grid ω1, ..., ωK Approximation coefficients ω Optimal approximation coefficients ω = (ω 1, ..., ω K) ω Sum of optimal approximation coefficients ˆBH Markov-approximate fractional Brownian motion (MA-f BM) k k N with 1 k K Y k OU processes Y k = Y γk Y1, ..., YK Augmenting processes with Yk = (Y k, ..., Y k) F, G Vector valued functions F, G : [0, T] RD (K+1) F, G Reverse time vector valued functions with F(t) = F(T t) and G(t) = G(T t) Z By Y1, ..., YK augmented forward process Y[K] Stacked vector of augmenting processes qt Marginal density of Y[K] at t [0, T] θ Weight vector of a neural network Neur IPS Paper Checklist Question: Do the main claims made in the abstract and introduction accurately reflect the paper s contributions and scope? Answer: [Yes] Justification: We summarize the contribution of our work in the introduction and in the abstract. Guidelines: The answer NA means that the abstract and introduction do not include the claims made in the paper. The abstract and/or introduction should clearly state the claims made, including the contributions made in the paper and important assumptions and limitations. A No or NA answer to this question will not be perceived well by the reviewers. The claims made should match theoretical and experimental results, and reflect how much the results can be expected to generalize to other settings. It is fine to include aspirational goals as motivation as long as it is clear that these goals are not attained by the paper. 2. Limitations Question: Does the paper discuss the limitations of the work performed by the authors? Answer: [Yes] Justification: We include a section on limitations of our work where we discuss the limitations of our results. Guidelines: The answer NA means that the paper has no limitation while the answer No means that the paper has limitations, but those are not discussed in the paper. The authors are encouraged to create a separate "Limitations" section in their paper. The paper should point out any strong assumptions and how robust the results are to violations of these assumptions (e.g., independence assumptions, noiseless settings, model well-specification, asymptotic approximations only holding locally). The authors should reflect on how these assumptions might be violated in practice and what the implications would be. The authors should reflect on the scope of the claims made, e.g., if the approach was only tested on a few datasets or with a few runs. In general, empirical results often depend on implicit assumptions, which should be articulated. The authors should reflect on the factors that influence the performance of the approach. For example, a facial recognition algorithm may perform poorly when image resolution is low or images are taken in low lighting. Or a speech-to-text system might not be used reliably to provide closed captions for online lectures because it fails to handle technical jargon. The authors should discuss the computational efficiency of the proposed algorithms and how they scale with dataset size. If applicable, the authors should discuss possible limitations of their approach to address problems of privacy and fairness. While the authors might fear that complete honesty about limitations might be used by reviewers as grounds for rejection, a worse outcome might be that reviewers discover limitations that aren t acknowledged in the paper. The authors should use their best judgment and recognize that individual actions in favor of transparency play an important role in developing norms that preserve the integrity of the community. Reviewers will be specifically instructed to not penalize honesty concerning limitations. 3. Theory Assumptions and Proofs Question: For each theoretical result, does the paper provide the full set of assumptions and a complete (and correct) proof? Answer: [Yes] Justification: We give a complete proof for our own theoretical results and refer to complete proofs for he theoretical results of others. Guidelines: The answer NA means that the paper does not include theoretical results. All the theorems, formulas, and proofs in the paper should be numbered and crossreferenced. All assumptions should be clearly stated or referenced in the statement of any theorems. The proofs can either appear in the main paper or the supplemental material, but if they appear in the supplemental material, the authors are encouraged to provide a short proof sketch to provide intuition. Inversely, any informal proof provided in the core of the paper should be complemented by formal proofs provided in appendix or supplemental material. Theorems and Lemmas that the proof relies upon should be properly referenced. 4. Experimental Result Reproducibility Question: Does the paper fully disclose all the information needed to reproduce the main experimental results of the paper to the extent that it affects the main claims and/or conclusions of the paper (regardless of whether the code and data are provided or not)? Answer: [Yes] Justification: We give implementation details in the appendix revealing the used model architecture and training procedures. Guidelines: The answer NA means that the paper does not include experiments. If the paper includes experiments, a No answer to this question will not be perceived well by the reviewers: Making the paper reproducible is important, regardless of whether the code and data are provided or not. If the contribution is a dataset and/or model, the authors should describe the steps taken to make their results reproducible or verifiable. Depending on the contribution, reproducibility can be accomplished in various ways. For example, if the contribution is a novel architecture, describing the architecture fully might suffice, or if the contribution is a specific model and empirical evaluation, it may be necessary to either make it possible for others to replicate the model with the same dataset, or provide access to the model. In general. releasing code and data is often one good way to accomplish this, but reproducibility can also be provided via detailed instructions for how to replicate the results, access to a hosted model (e.g., in the case of a large language model), releasing of a model checkpoint, or other means that are appropriate to the research performed. While Neur IPS does not require releasing code, the conference does require all submissions to provide some reasonable avenue for reproducibility, which may depend on the nature of the contribution. For example (a) If the contribution is primarily a new algorithm, the paper should make it clear how to reproduce that algorithm. (b) If the contribution is primarily a new model architecture, the paper should describe the architecture clearly and fully. (c) If the contribution is a new model (e.g., a large language model), then there should either be a way to access this model for reproducing the results or a way to reproduce the model (e.g., with an open-source dataset or instructions for how to construct the dataset). (d) We recognize that reproducibility may be tricky in some cases, in which case authors are welcome to describe the particular way they provide for reproducibility. In the case of closed-source models, it may be that access to the model is limited in some way (e.g., to registered users), but it should be possible for other researchers to have some path to reproducing or verifying the results. 5. Open access to data and code Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [Yes] Justification: We release our code upon publication. Together with the implementation details given in the paper our results can be reproduced. Guidelines: The answer NA means that paper does not include experiments requiring code. Please see the Neur IPS code and data submission guidelines (https://nips.cc/ public/guides/Code Submission Policy) for more details. While we encourage the release of code and data, we understand that this might not be possible, so No is an acceptable answer. Papers cannot be rejected simply for not including code, unless this is central to the contribution (e.g., for a new open-source benchmark). The instructions should contain the exact command and environment needed to run to reproduce the results. See the Neur IPS code and data submission guidelines (https: //nips.cc/public/guides/Code Submission Policy) for more details. The authors should provide instructions on data access and preparation, including how to access the raw data, preprocessed data, intermediate data, and generated data, etc. The authors should provide scripts to reproduce all experimental results for the new proposed method and baselines. If only a subset of experiments are reproducible, they should state which ones are omitted from the script and why. At submission time, to preserve anonymity, the authors should release anonymized versions (if applicable). Providing as much information as possible in supplemental material (appended to the paper) is recommended, but including URLs to data and code is permitted. 6. Experimental Setting/Details Question: Does the paper specify all the training and test details (e.g., data splits, hyperparameters, how they were chosen, type of optimizer, etc.) necessary to understand the results? Answer: [Yes] Justification: We describe the hyperparameters in our section on implementation details in the appednix. Guidelines: The answer NA means that the paper does not include experiments. The experimental setting should be presented in the core of the paper to a level of detail that is necessary to appreciate the results and make sense of them. The full details can be provided either with the code, in appendix, or as supplemental material. 7. Experiment Statistical Significance Question: Does the paper report error bars suitably and correctly defined or other appropriate information about the statistical significance of the experiments? Answer: [No] Justification: Unfortunately we don not have the computational resources to run all experiments a sufficient number of times to provide statistical certainty. Guidelines: The answer NA means that the paper does not include experiments. The authors should answer "Yes" if the results are accompanied by error bars, confidence intervals, or statistical significance tests, at least for the experiments that support the main claims of the paper. The factors of variability that the error bars are capturing should be clearly stated (for example, train/test split, initialization, random drawing of some parameter, or overall run with given experimental conditions). The method for calculating the error bars should be explained (closed form formula, call to a library function, bootstrap, etc.) The assumptions made should be given (e.g., Normally distributed errors). It should be clear whether the error bar is the standard deviation or the standard error of the mean. It is OK to report 1-sigma error bars, but one should state it. The authors should preferably report a 2-sigma error bar than state that they have a 96% CI, if the hypothesis of Normality of errors is not verified. For asymmetric distributions, the authors should be careful not to show in tables or figures symmetric error bars that would yield results that are out of range (e.g. negative error rates). If error bars are reported in tables or plots, The authors should explain in the text how they were calculated and reference the corresponding figures or tables in the text. 8. Experiments Compute Resources Question: For each experiment, does the paper provide sufficient information on the computer resources (type of compute workers, memory, time of execution) needed to reproduce the experiments? Answer: [Yes] Justification: We reveal the hardware specification we use and report the number of hours of training in our section on implementation details. Guidelines: The answer NA means that the paper does not include experiments. The paper should indicate the type of compute workers CPU or GPU, internal cluster, or cloud provider, including relevant memory and storage. The paper should provide the amount of compute required for each of the individual experimental runs as well as estimate the total compute. The paper should disclose whether the full research project required more compute than the experiments reported in the paper (e.g., preliminary or failed experiments that didn t make it into the paper). 9. Code Of Ethics Question: Does the research conducted in the paper conform, in every respect, with the Neur IPS Code of Ethics https://neurips.cc/public/Ethics Guidelines? Answer: [Yes] Justification: Our work fully confirms with the Neur IPS Code of Ethics. Guidelines: The answer NA means that the authors have not reviewed the Neur IPS Code of Ethics. If the authors answer No, they should explain the special circumstances that require a deviation from the Code of Ethics. The authors should make sure to preserve anonymity (e.g., if there is a special consideration due to laws or regulations in their jurisdiction). 10. Broader Impacts Question: Does the paper discuss both potential positive societal impacts and negative societal impacts of the work performed? Answer: [Yes] Justification: We include a broader impact statement at the end of our paper discussing potential misuse. Guidelines: The answer NA means that there is no societal impact of the work performed. If the authors answer NA or No, they should explain why their work has no societal impact or why the paper does not address societal impact. Examples of negative societal impacts include potential malicious or unintended uses (e.g., disinformation, generating fake profiles, surveillance), fairness considerations (e.g., deployment of technologies that could make decisions that unfairly impact specific groups), privacy considerations, and security considerations. The conference expects that many papers will be foundational research and not tied to particular applications, let alone deployments. However, if there is a direct path to any negative applications, the authors should point it out. For example, it is legitimate to point out that an improvement in the quality of generative models could be used to generate deepfakes for disinformation. On the other hand, it is not needed to point out that a generic algorithm for optimizing neural networks could enable people to train models that generate Deepfakes faster. The authors should consider possible harms that could arise when the technology is being used as intended and functioning correctly, harms that could arise when the technology is being used as intended but gives incorrect results, and harms following from (intentional or unintentional) misuse of the technology. If there are negative societal impacts, the authors could also discuss possible mitigation strategies (e.g., gated release of models, providing defenses in addition to attacks, mechanisms for monitoring misuse, mechanisms to monitor how a system learns from feedback over time, improving the efficiency and accessibility of ML). 11. Safeguards Question: Does the paper describe safeguards that have been put in place for responsible release of data or models that have a high risk for misuse (e.g., pretrained language models, image generators, or scraped datasets)? Answer: [NA] Justification: We only feature experiments on small scale dataset up to size 3 32 32. The models trained in this work do not pose such risks. Guidelines: The answer NA means that the paper poses no such risks. Released models that have a high risk for misuse or dual-use should be released with necessary safeguards to allow for controlled use of the model, for example by requiring that users adhere to usage guidelines or restrictions to access the model or implementing safety filters. Datasets that have been scraped from the Internet could pose safety risks. The authors should describe how they avoided releasing unsafe images. We recognize that providing effective safeguards is challenging, and many papers do not require this, but we encourage authors to take this into account and make a best faith effort. 12. Licenses for existing assets Question: Are the creators or original owners of assets (e.g., code, data, models), used in the paper, properly credited and are the license and terms of use explicitly mentioned and properly respected? Answer: [Yes] Justification: We properly cite all research works we build on and use the code of others only according to its license. Guidelines: The answer NA means that the paper does not use existing assets. The authors should cite the original paper that produced the code package or dataset. The authors should state which version of the asset is used and, if possible, include a URL. The name of the license (e.g., CC-BY 4.0) should be included for each asset. For scraped data from a particular source (e.g., website), the copyright and terms of service of that source should be provided. If assets are released, the license, copyright information, and terms of use in the package should be provided. For popular datasets, paperswithcode.com/datasets has curated licenses for some datasets. Their licensing guide can help determine the license of a dataset. For existing datasets that are re-packaged, both the original license and the license of the derived asset (if it has changed) should be provided. If this information is not available online, the authors are encouraged to reach out to the asset s creators. 13. New Assets Question: Are new assets introduced in the paper well documented and is the documentation provided alongside the assets? [Yes] Justification: We release our code upon publication alongside a proper documentation under the MIT license. Guidelines: The answer NA means that the paper does not release new assets. Researchers should communicate the details of the dataset/code/model as part of their submissions via structured templates. This includes details about training, license, limitations, etc. The paper should discuss whether and how consent was obtained from people whose asset is used. At submission time, remember to anonymize your assets (if applicable). You can either create an anonymized URL or include an anonymized zip file. 14. Crowdsourcing and Research with Human Subjects Question: For crowdsourcing experiments and research with human subjects, does the paper include the full text of instructions given to participants and screenshots, if applicable, as well as details about compensation (if any)? Answer: [NA] Justification: Our work does not involve crowdsourcing nor research with human subjects. Guidelines: The answer NA means that the paper does not involve crowdsourcing nor research with human subjects. Including this information in the supplemental material is fine, but if the main contribution of the paper involves human subjects, then as much detail as possible should be included in the main paper. According to the Neur IPS Code of Ethics, workers involved in data collection, curation, or other labor should be paid at least the minimum wage in the country of the data collector. 15. Institutional Review Board (IRB) Approvals or Equivalent for Research with Human Subjects Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or institution) were obtained? Answer: [NA] Justification: our work does not involve crowdsourcing nor research with human subjects. 16. Depending on the country in Guidelines: The answer NA means that the paper does not involve crowdsourcing nor research with human subjects. Depending on the country in which research is conducted, IRB approval (or equivalent) may be required for any human subjects research. If you obtained IRB approval, you should clearly state this in the paper. We recognize that the procedures for this may vary significantly between institutions and locations, and we expect authors to adhere to the Neur IPS Code of Ethics and the guidelines for their institution. For initial submissions, do not include any information that would break anonymity (if applicable), such as the institution conducting the review.