# time_series_diffusion_in_the_frequency_domain__cf5a492d.pdf Time Series Diffusion in the Frequency Domain Jonathan Crabb e * 1 Nicolas Huynh * 1 Jan Stanczuk 1 Mihaela van der Schaar 1 Abstract Fourier analysis has been an instrumental tool in the development of signal processing. This leads us to wonder whether this framework could similarly benefit generative modelling. In this paper, we explore this question through the scope of time series diffusion models. More specifically, we analyze whether representing time series in the frequency domain is a useful inductive bias for score-based diffusion models. By starting from the canonical SDE formulation of diffusion in the time domain, we show that a dual diffusion process occurs in the frequency domain with an important nuance: Brownian motions are replaced by what we call mirrored Brownian motions, characterized by mirror symmetries among their components. Building on this insight, we show how to adapt the denoising score matching approach to implement diffusion models in the frequency domain. This results in frequency diffusion models, which we compare to canonical time diffusion models. Our empirical evaluation on real-world datasets, covering various domains like healthcare and finance, shows that frequency diffusion models better capture the training distribution than time diffusion models. We explain this observation by showing that time series from these datasets tend to be more localized in the frequency domain than in the time domain, which makes them easier to model in the former case. All our observations point towards impactful synergies between Fourier analysis and diffusion models. 1. Introduction Deep generative modelling leverages the inductive bias of neural networks to learn complex, high-dimensional probability distributions from real-world datasets. Among other *Equal contribution 1DAMTP, University of Cambridge. Correspondence to: Jonathan Crabb e , Nicolas Huynh . Proceedings of the 41 st International Conference on Machine Learning, Vienna, Austria. PMLR 235, 2024. Copyright 2024 by the author(s). applications, generative models allow for generation of new synthetic samples consistent with the distribution of the training data, yet distinct from the actual data encountered during training. Recently this field has seen tremendous progress in various modalities including image (Karras et al., 2020; Dhariwal & Nichol, 2021), audio (Kong et al., 2021; Donahue et al., 2018), video (Rombach et al., 2022) and text (Dieleman et al., 2022) generation, as well as addressing inverse problems such as in-painting (Lugmayr et al., 2022) or super-resolution (Saharia et al., 2022). Moreover deep generative models have started showing significant potential in contributing to natural sciences, though protein design (Watson et al., 2023), drug development (Xu et al., 2022) and material synthesis (Zeni et al., 2023). However, the application of these models to time series data has not seen the same level of advancement (Gatta et al., 2022). Some notable examples of time series generative models include Time GAN (Yoon et al., 2019), Fourier Flow (Alaa et al., 2021), and RCGAN (Esteban et al., 2017), yet this area remains less explored compared to other applications. While research on generative modeling for time series has not progressed as quicky as in the static setting, it is an equally important problem. For example, generative modelling for time series is a promising avenue to reconciliate privacy with the development of machine learning models, notably in high-stakes domains such as healthcare, where access to time series data is subject to strong regulations by medical institutions (Miller & Tucker, 2009). Another example is generating time series for data augmentation, in order to increase the dataset size for some downstream tasks or to address imbalance problems (Nikolaidis et al., 2019). Diffusion Models. In recent years, diffusion models (Hyv arinen & Dayan, 2005; Sohl-Dickstein et al., 2015; Ho et al., 2020; Song et al., 2020) have emerged as one of the most promising research avenues in deep generative modelling, achieving state-of-the art results across many generative modelling tasks (Dhariwal & Nichol, 2021; Saharia et al., 2022). Diffusion models have been applied to time series modelling, achieving promising results (Lin et al., 2023). However, there is substantial room for development and refinement in these early-stage applications. Fourier Analysis. Fourier analysis is a remarkably powerful tool in signal processing, compression and machine learning (K orner, 2022). It has been shown to significantly Time Series Diffusion in the Frequency Domain improve state-of-the-art the performance of many deep learning based time series analysis techniques (Yi et al., 2023), with some recent applications in dataset distillation (Shin et al., 2023). In the context of deep generative models, this is exemplified in (Alaa et al., 2021), where the application of normalizing flows to Fourier representations yielded promising results. More recently, some work by (Phillips et al., 2022) has been done on diffusion on functional spaces, which include Fourier representations of signals although the paper does not specifically focus on the Fourier basis. Motivation. Despite Fourier analysis widespread success, its application to diffusion models for time series remains largely unexplored. This paper seeks to fill this research gap, by examining whether spectral representations can improve diffusion models for time series modelling. Our focus is not on achieving state-of-the-art results, but rather investigating whether representing time series in the frequency domain is a useful inductive bias for diffusion models. Our contributions. (1) Formalizing frequency diffusion. In Section 3, we show theoretically how to translate SDE-based diffusion of time series to the frequency domain. We demonstrate that the denoising score matching recipe can be adapted by replacing standard Brownian motions by what we call mirrored Brownian motions, characterized by mirror symmetries in their components. (2) Comparing time and frequency diffusion. In Section 4.1, we compare the ability of the time and frequency score models to generate samples that are faithful to the training sets by leveraging sliced Wasserstein distances. Through an extensive analysis on 6 real-world datasets illustrating fields like healthcare, finance, engineering and climate modelling, we demonstrate that frequency score models consistently outperform the time score models. (3) Understanding why and when frequency diffusion is preferable. In Section 4.2, we demonstrate that the signals in all 6 datasets concentrate most of their power spectrum on the low frequencies. We hypothesize that this localization in the frequency domain explains the superior performances of frequency diffusion models. In Section 4.3, we confirm this hypothesis by artificially delocalizing the spectral representation of real signals and showing that the gap between time and frequency diffusion closes. 2. Background Notations. We consider multivariate time series of fixed size1 x RN M, where N N is the number of time steps 1Padding over time can be used in cases where the datasets contain time series of different lengths. and M N is the number of features tracked over time. Often, we will denote by d X = N M the total dimension of the time series x. We shall use Greek letters for components of the time series. In this way, xτ RM denotes the feature vector at time τ [N] and xτ,ν denotes the value of feature ν [M] at time τ. We denote by [K] := {0, 1, . . . , K 1} the integers between 0 (included) and K N (excluded). To avoid any confusion between time series steps and diffusion steps, we shall use Latin letters for the diffusion process. In this way, the diffusion process is described by a family of time series {x(t) Rd X}T t=0 indexed by a continuous diffusion variable t [0, T]. Thanks to these notations, we unambiguously interpret xτ(t) RM as the feature vector at time step τ [N] and at diffusion step t [0, T]. We shall detail below how this diffusion process is defined. 2.1. Score-based Generative Modeling with SDEs In continuous-time diffusion modelling, one assumes access to samples drawn from an unknown density pdata. The objective of generative modelling is to obtain a tractable approximation of this distribution. Forward Diffusion. Score-based generative modeling with stochastic differential equation (SDEs) (Song et al., 2020) typically operates by first constructing a forward diffusion process. In the case of time series, forward continuous diffusion is described by the following SDE, with t [0, T]: dx = f(x, t)dt + G(t)dw, (1) where f : Rd X [0, T] Rd X is the drift, w is a standard Brownian motion in Rd X, and G : [0, T] RN N is the diffusion matrix. We denote pt the probability density of the solution x(t) of Equation (1) at time t [0, T]. With the slight abuse of notation from Song et al. (2020), we shall abbreviate pt(x(t)) by pt(x). Together with the SDE, we impose the initial condition p0 = pdata, which corresponds to samples initially drawn from the data density pdata. In practice, we consider f and G such that pdata is transported to a final density p T close to an isotropic Gaussian. Reverse Diffusion. The reverse diffusion process performs the inverse transformation by transporting the isotropic Gaussian density p T to the data density p0 = pdata. Hence, applying reverse diffusion to samples drawn from the isotropic Gaussian permits to sample from the unknown density pdata. It was shown by Anderson (1982) that this reverse diffusion satisfies the following SDE: dx = b(x, t)dt + G(t)d ˆw, (2) where b(x, t) = f(x, t) G(t)G(t)T x log pt(x), dt is a negative infinitesimal time step, and ˆw is a Brownian time increment with time going backwards from T to 0. Denoising Score Matching. In order to run the reverse diffusion process, one needs access to the score s(x, t) := Time Series Diffusion in the Frequency Domain x log pt(x). In practice, the ground-truth density pt is unknown. Denoising score matching circumvents this problem by estimating the ground-truth score with a function sθ whose parameters θ minimize the following score matching objective computed from the data samples (Hyv arinen & Dayan, 2005; Song & Ermon, 2019): θ = arg min θ Θ Et,x(0),x(t) LSM(sθ, st|0, x, t) (3) LSM(sθ, st|0, x, t) := sθ(x, t) st|0(x, t) 2 (4) where denotes the Frobenius norm, t U(0, T), x(0) p0(x), x(t) pt|0(x(t)|x(0)) with pt|0 denoting the transition kernel from 0 to t, and st|0(x, t) := x(t) log pt|0(x(t)|x(0)). With sufficient model capacity, the parameters θ provide an approximation sθ that is equal to the score s for almost all x and t in the large data limit (Vincent, 2011). Equipped with an approximation of the score sθ s, one can generate data by sampling according to the solution defined by the reverse diffusion process from Equation (2). 2.2. Discrete Fourier Transform DFT. By considering a time series x = (x0, . . . , x N 1) Rd X, the Discrete Fourier Transform (DFT), denoted as x = F[x], is defined as τ=0 xτ exp κ2πi for all κ [N]. In the signal processing literature, each κ corresponds to a harmonic of frequency ωκ := κ2π N . For this reason, the DFT x is said to represent the time series x in the frequency domain, as opposed to the time domain. We also note that the DFT is complex-valued ( x Cd X). Matrix Representation. We note that the DFT operator F is linear with respect to the time components (x0, . . . , x N 1). It can therefore be expressed through a left matrix multiplication x = F[x] = Ux, where U CN N is defined as [U]κτ := N 1/2 exp( iωκτ). It can easily be checked (see Appendix A.1) that the matrix U is unitary: U U = UU = IN, where U is the conjugate transpose of U and IN is the N N identity matrix. This implies that the DFT operator is invertible and that the original time series can be reconstructed from its representation in the frequency domain: x = F 1[ x] := U x. DFT of a Real-Valued Sequence. While the DFT x is defined in Cd X, some of its components are made redundant by the fact that x is a real-valued time-series. One can easily check (see Appendix A.1) that this constraint imposes the following mirror symmetry on the DFT for all κ [N]: xκ = x N κ, (6) where z denotes the complex conjugate of z C and we define x N := x0 for consistency. Through this symmetry, we observe that the components xκ with κ N/2 uniquely define the DFT of a real-valued time series. For this reason, the frequencies beyond the Nyquist frequency ωNyq := ω N/2 are redundant with respect to the lower frequencies. In the frequency domain, one then needs only to diffuse N real numbers extracted from the DFT and the rest of x can be deduced from Equation (6). Signal Energy. An important quantity related to a time series x is its total energy, which simply corresponds to the squared Frobenius norm x 2 := PN 1 τ=0 PM ν=1 |xτ,ν|2, where | | denotes the modulus of a complex number. Through Parseval s theorem, this energy can be evaluated by computing the same norm for the DFT x of x: x 2 = x 2. We note that the total energy is obtained by summing over all time steps or frequencies. To characterize how the energy is distributed over the time steps τ [N], we use the energy density defined as the squared Euclidean norm xτ 2 2 := PM ν=1 |xτ,ν|2. Similarly, the spectral energy density defined as xκ 2 2 describes how the signal energy is distributed across the frequencies κ [N]. Probability Density in the Complex Space. Adapting the diffusion formalism to the frequency domain requires to define a probability density for the complex-valued random variable x Cd X. By following (Schreier & Scharf, 2010), this density is written in terms of the real and imaginary parts of the signal p( x) := p(ℜ[ x], ℑ[ x]). Similarly, the score function follows a similar decomposition in terms of the signal real and imaginary parts s( x) := ℜ[ x] log p( x) + i ℑ[ x] log p( x). We note that the gradient involved in the definition of the scores is non-trivial when the constraint in Equation (6) is enforced. In Appendix A.2, we establish a formal definition in this setting by interpreting the complex signals fulfilling this constraint as a submanifold in Cd X. This constraint implies that the score components follow an analogous mirror symmetry: sκ = s N κ for all κ [N]. In the following, we shall implicitly rely on this definition. 3. Diffusing in the Frequency Domain In the previous section, we have described how the typical diffusion formalism applies to time-series. We have also described how the DFT x = F[x] offers a full description of the time series x in the frequency domain. The first step is to define how time-based diffusion translates in the frequency domain. Note that this is non-trivial as the DFT are complexvalued x Cd X signals. To solve this, we shall assume that the stochastic process in the time domain {x(t)}T t=0, written compactly as x, follows the diffusion process described in Equation (1). By leveraging the matrix formulation of the DFT x = Ux, we will now derive diffusion SDEs in the Time Series Diffusion in the Frequency Domain frequency domain. 3.1. Diffusion SDEs In order to derive the diffusion SDEs in the frequency domain, we shall simply apply the DFT operator to the forward diffusion SDE in the time domain from Equation (1). We note that this equation contains a standard Brownian motion w. In the below lemma, we describe the DFT of w and show that it contains two copies of a non-standard Brownian motion related by the constraint from Equation (6). We refer to this as a mirrored Brownian motion. Lemma 3.1. (DFT of standard Brownian motion). Let w be a standard Brownian motion on Rd X with d X = N M, where N N+ is the number of time series steps and M N+ is the number of features tracked over time. Then v = Uw is a continuous stochastic process endowed with: (1) Mirror Symmetry. For all κ [N], vκ = v N κ. (2) Real Brownian Motion. v0 is a (real) standard Brownian motion on RM. (3) Complex Brownian Motions. For all κ with 1 κ N/2 , we can write vκ = ( w1 κ + i w2 κ)/ 2 where w1 κ and w2 κ are independent standard Brownian motions on RM, except when N is even and κ = N/2, where v N/2 is a real standard Brownian motion on RM. (4) Independence. The stochastic processes {vκ} N/2 κ=0 are mutually independent. We call any stochastic process satisfying the above constraints a mirrored Brownian motion on Cd X. Proof. The proof is given in Appendix A.3. Remark 3.2. Note that v is not strictly speaking a Brownian motion, since it contains duplicate components due to the mirror symmetry. However, our theoretical analysis in Appendix A demonstrates that we can treat it as such by restricting to a subset of non-redundant components. We now leverage Lemma 3.1 to show that x can be described by diffusion SDEs in the frequency domain which involve mirrored Brownian motions. Proposition 3.3. (Diffusion process in frequency domain). Let us assume that x is a diffusion process that is a solution of Equation (1), with G(t) = g(t) IN. Then x = F[x] is a solution to the forward diffusion process defined by: d x = f( x, t)dt + g(t)d v, (7) where f( x, t) = Uf(U x, t) and v is a mirrored Brownian motion on Cd X. The associated reverse diffusion process is defined by: d x = b( x, t)dt + g(t)d v (8) where b( x, t) = f( x, t) g2(t)Λ2 s( x, t), Λ RN N is a diagonal matrix such that ( 1 if κ = 0, or N is even and κ = N/2 1 2 otherwise , dt is a negative infinitesimal time step, and v is a mirrored Brownian motion on Cd X with time going from T to 0. Proof. The proof is given in Appendix A.3. Proposition 3.3 gives us a recipe to implement diffusion in the frequency domain. It guarantees that the formalism introduced by Song et al. (2020) extends to this setting with one important difference: the Brownian motion must be replaced by a mirrored Brownian motion. Ignoring this prescription by taking v to be a Brownian motion on Cd X could lead to unintended consequences, such as generating complex-valued time series x Cd X \ Rd X. 3.2. Denoising Score Matching The reverse diffusion process given in Equation (8) provides an explicit way to samples time-series in the frequency domain provided we can compute b( x, t), which involves the unknown score s. Like in the time domain, and motivated by Equation (8), we build an approximation of the score with a function s θ , whose parameters θ minimize the score matching objective: θ = arg min θ Θ Et, x(0), x(t) LSM s θ, Λ2 st|0, x, t (9) with t U(0, T), x(0) p0( x), x(t) pt|0( x(t)| x(0)) and Λ is the diagonal matrix defined in Proposition 3.3. In practice, this objective is evaluated by first obtaining frequency representations of time-series, and then sampling from pt|0 using Equation (7). Having trained s θ , the backward process and s θ permit to draw samples from p0. It then suffices to apply the inverse DFT F 1 to map the resulting complex-valued signals back into the time domain. One important question remains at this stage. How does training a score in the frequency domain allow to generate DFT of time series sampled from pdata? In other words, how does minimizing the score matching in Equation (9) imply that p0 pdata? To answer this question, a key observation is that we can associate an auxiliary score s θ in the time domain to the score s θ by applying an inverse DFT F 1. Below, we show that minimizing the score matching loss from Equation (9) for the score s θ is equivalent to minimizing the score matching loss from Equation (3) for the auxiliary score s θ. This important observation connects the reverse diffusion process in the frequency domain described by Equation (8) with a reverse diffusion process in the time domain following Equation (2). Proposition 3.4. (Score matching equivalence). Consider a score s θ : Cd X [0, T] Cd X defined in the frequency Time Series Diffusion in the Frequency Domain domain and satisfying the mirror symmetry [ s θ]κ = [ s θ]N κ for all κ [N]. Let us define an auxiliary score s θ : Rd X [0, T] Rd X as (x, t) 7 s θ(x, t) = U s θ(Ux, t) in the time domain. The score matching loss in the frequency domain is equivalent to the score matching loss for the auxiliary score in the time domain: LSM s θ, Λ2 st|0, x, t = LSM s θ, st|0, x, t (10) where st|0( x, t) = x(t) log pt|0( x(t)| x(0)) , st|0(x, t) = x(t) log pt|0(x(t)|x(0)), and Λ is the diagonal matrix in Proposition 3.3. Proof. The proof is given in Appendix A.4. Propositions 3.3 and 3.4 provide an explicit way to translate diffusion in the time domain to diffusion in the frequency domain. We note that attempting to solve Equations (3) and (9) in the finite-sample regime yields a local minimum solution in practice. Hence, there is no guarantee that training a score model in the frequency domain will converge to an auxiliary score s θ = sθ identical to the one obtained by training the score model in the time domain. In particular, having a score function sθ defined in the frequency domain is an important inductive bias, which is likely to alter the training dynamic. Through our experiments in the next section, we study the effect of this inductive bias on the resulting diffusion processes. Take-away 1. Diffusion in the frequency domain can be implemented by replacing the standard Brownian motions with mirrored Brownian motions in the diffusion SDEs. The associated score can be optimized by minimizing a denoising score matching loss. 4. Comparing Time and Frequency Diffusion In this section, we empirically analyze the effect of performing time series diffusion in the frequency domain 2. In Section 4.1, we show that frequency diffusion models better capture their training distribution than time models. In Section 4.2, we argue that these differences of performance can be attributed the localization of the time series in the frequency domain. Finally, in Section 4.3, we artificially create settings where time models outperform the frequency models in order to confirm this hypothesis. The code necessary to reproduce the results, along with detailed instructions, is provided as a supplementary material. Data. To illustrate the breadth of time series applications, we work with 6 different datasets described in Table 1. We 2The code is publicly available at the following links: https://github.com/Jonathan Crabbe/ Fourier Diffusion https://github.com/ vanderschaarlab/Fourier Diffusion observe that these datasets cover many use-cases (healthcare, finance, engineering and climate modelling), sample sizes, sequence lengths N and number of features tracked over time M. All the datasets are standardized before being fed to models. We also split the datasets into a training set Dtrain and a validation set Dval. We provide more details on the datasets in Appendix B.1. Models. For each dataset, we parametrize the time score model sθ and the frequency score model s θ as transformer encoders with 10 attention and MLP layers, each with 12 heads and dimension dmodel = 72. Both models have learnable positional encoding as well as diffusion time t encoding through random Fourier features composed with a learnable dense layer. This results in models with 3.2M parameters. We use a VP-SDE with linear noise scheduling and βmin = 0.1 and βmax = 20, as in (Song et al., 2020). The score models are trained with the denoising score-matching loss, as defined in Section 3. All the models are trained for 200 epochs with batch size 64, Adam W optimizer and cosine learning rate scheduling (20 warmup epochs, lrmax = 10 3). The selected model is the one achieving the lowest validation loss. Time and Frequency. Crucially, the only difference between the time and the fequency diffusion models is the domain in which their input time series are represented. Since all datasets are expressed in the time domain, they can directly be fed to the time diffusion model sθ. When it comes to the frequency diffusion model s θ, the data is first mapped to the frequency domain by applying a DFT F on each time series. In the time domain, the forward and reverse diffusion obey the SDEs in Equations (1) and (2). In the frequency domain, the forward and reverse diffusion obey the modified SDEs in Equations (7) and (8). The denoised samples x(0) obtained in the frequency domain can be pulled back to the time domain by applying an inverse DFT x(0) 7 F 1[ x(0)]. In the following, we shall denote by Stime Rd X and Sfreq Rd X the time representation of the samples generated by the time and frequency models. Similarly, we shall denote by Stime := F[Stime] and Sfreq := F[Sfreq] the frequency representations of these time series. We sample |Stime| = |Sfreq| = 10, 000 samples for each model by applying T = 1, 000 diffusion time steps. 4.1. Which Samples Better Capture the Distribution? Methodology. We are interested in the faithfulness of the samples generated by the time and frequency diffusion models. Ideally, this faithfulness should be evaluated by computing the Wasserstein distance between the true distribution and the distribution spanned by our diffusion models. However, this is impossible since the exact computation of the Wasserstein distance in intractable in input spaces of large Time Series Diffusion in the Frequency Domain Table 1. Various datasets used in our experiments and some of their properties. Dataset Reference Field # Samples # Steps N # Features M ECG (Kachuee et al., 2018) Healthcare 87,553 187 1 MIMIC-III (Johnson et al., 2016) 19,155 24 40 NASDAQ-2019 (Onyshchak, 2020) Finance 4,827 252 5 NASA-Charge (Saha & Goebel, 2007) Engineering 2,396 251 4 NASA-Discharge 1,755 134 5 US-Droughts (Minixhofer, 2021) Climate 2,797 365 13 Table 2. Sliced Wasserstein distances ( ) evaluated in the time domain (SW(Dtrain, Stime), SW(Dtrain, Sfreq)) and in the frequency domain (SW( Dtrain, Stime), SW( Dtrain, Sfreq)) on the various datasets. For each distance, we report its mean 2 standard errors. Dataset Metric Domain Diffusion Domain Frequency Time ECG Frequency 0.012 0.000 0.020 0.000 Time 0.015 0.000 0.021 0.000 MIMIC-III Frequency 0.144 0.004 0.206 0.006 Time 0.152 0.004 0.211 0.006 NASDAQ-2019 Frequency 45.812 2.096 64.056 3.040 Time 43.602 2.044 60.512 2.960 NASA-Charge Frequency 0.211 0.008 0.27 0.006 Time 0.229 0.008 0.316 0.008 NASA-Discharge Frequency 1.999 0.084 2.974 0.134 Time 2.028 0.082 2.942 0.134 US-Droughts Frequency 0.633 0.018 2.849 0.090 Time 0.738 0.020 2.913 0.092 dimension d X 1. In the case of images, (Heusel et al., 2017) mitigates these problems by mapping all the images in a lower dimensional representation space (the activation of the penultimate layer of an Inception-V3 model). This crucially relies on the fact that the Inception-V3 provides high-quality representations of images. Unfortunately, such a general representation of time series does not exist in practice. Hence, our evaluation needs to be performed in the input space Rd X directly. For this reason, we shall rely on the sliced Wasserstein distance introduced by (Bonneel et al., 2015), which has similar properties to the Wasserstein distance and can be efficiently estimated in high dimension spaces. With a slight abuse of notation, we shall denote by SW(S1, S2) the sliced Wasserstein distances between the empirical distributions corresponding to the samples S1 and S2. Its detailed definition is provided in Appendix B.2. Analysis. The sliced Wasserstein distances are reported in Table 2. Interestingly, we observe that the frequency diffusion models consistently outperform the time diffusion models for all datasets both in the time and the frequency domain. In Appendix B.4, we show that using marginal Wasserstein distances instead of sliced Wasserstein distances essentially leads to the same conclusion. In order to verify that our observations are not specific to transformer backbones, we reproduce the same experiment with LSTM backbones in Appendix D and obtain similar results showing the superior performance of frequency diffusion models. While this observation already suggests the benefits of diffusing in the frequency domain rather than in the time domain, it is important to understand how these performance gains emerge. For this, we need to gain a better understanding of the training distributions, which is the object of next section. 4.2. How to Explain the Differences? Signal energy analysis. Before formulating an hypothesis as to why the frequency models are better, it is helpful to gain a better understanding about the training distribution Dtrain. To that end, we leverage the energy and spectral densities related to the time series, described in Section 2. These densities are represented in Figure 1, where we have averaged the densities over all time series in Dtrain. By analyzing Figure 1b, we make a key observation: for all datasets, most of the time series energy in the frequency domain is localized on the frequency ω0 = 0 also known as the fundamental frequency. Furthermore, we observe that the energy quickly decays as the frequency increases. This observation suggests that the low frequencies capture most of the time series information. This is in stark contrast with the energy distribution over time in Figure 1a, which is more uniform over time for all the datasets. This asym- Time Series Diffusion in the Frequency Domain 0.0 0.2 0.4 0.6 0.8 1.0 Time τ/N Energy Density xτ 2 2/ x 2 Dataset ECG MIMIC-III NASDAQ-2019 NASA-Charge NASA-Discharge US-Droughts (a) Time Domain. 0.0 0.2 0.4 0.6 0.8 1.0 Frequency ωκ/ωNyq Spectral Density xκ 2 2/ x 2 Dataset ECG MIMIC-III NASDAQ-2019 NASA-Charge NASA-Discharge US-Droughts (b) Frequency Domain. Figure 1. Localization of time series in the time and frequency domains. Time series are more localized in the frequency domain. NASDAQ-2019 NASA-Charge NASA-Discharge US-Droughts Delocalization Metric Time Frequency Figure 2. By evaluating our delocalization metrics in the time domain ( time) and the frequency domain ( freq), we quantitatively confirm that all the datasets are significantly more localized in the frequency domain. All the metrics are averaged over Dtrain, their mean is reported with a 95% confidence interval. metry between the time series spectral localization and their temporal delocalization is a promising candidate to explain the superior performances of frequency diffusion. We now make this observation more quantitative. Quantitative Signal Localization. In order to measure how delocalized a time series x Rd X is in the time domain, we shall use the delocalization metrics introduced by (Nam, 2013): time(x) := min τ [N] 1 x 2 X τ [N] dcyc(τ, τ ) xτ 2 2, (11) where dcyc : [N]2 [N] is the cyclic distance defined as dcyc(τ, τ ) = min(|τ τ |, N |τ τ |), denotes the Frobenius norm and 2 the Euclidean norm. Similarly, we can compute the delocalization in the frequency domain freq by replacing x 7 x and τ, τ 7 κ, κ in Equation (11). We report these delocalization metrics for each dataset in Figure 2. This quantitative analysis confirms our previous observations: the time series in all the datasets appear significantly more localized in the frequency domain. Interestingly, we never observe a time series that is localized in both the frequency and time domain simultaneously. This is in agreement with the uncertainty principle from (Nam, 2013), which echoes the foundational work of (Heisenberg, 1927). This verification is made in Appendix B.4. A Localization Explanation. Based on the previous observation, we postulate that higher localization of the time series in the frequency domain contributes to the superior performance of frequency diffusion models. While we will test this hypothesis in the next section, it is useful to discuss the intuition behind it first. Due to the frequency localization, the frequency score model is presented with a representation of the time series where most of the relevant information is aligned with few components of the model s input (i.e. the lower frequencies, especially the fundamental). This is in contrast with the time model, which is presented with an input where all the components matter equally. It follows that the frequency model does not need to learn a good distribution over all frequencies in order to generate samples of high quality, provided the lower frequency distributions are properly learned. The time model, on the other hand, needs to model all the time steps accurately in order to generate high-quality samples. If this intuition is correct, it would imply that delocalizing the signal in the frequency domain should reduce the gap of performance between time and frequency models. This is analyzed in the next section. Time Series Diffusion in the Frequency Domain 0 5 10 15 20 Gaussian Kernel Width Time Sliced Wasserstein ( ) Time Frequency Delocalization (a) Time Domain. 0 5 10 15 20 Gaussian Kernel Width Frequency Sliced Wasserstein ( ) Time Frequency Delocalization (b) Frequency Domain. Figure 3. Sliced Wasserstein distances of time and frequency models (blue) and localization metrics in time and frequency domains (red) when smoothing the spectral representations of the time series with Gaussian kernels of variable width. Increasing the kernel width removes the localization in the frequency domain and increases the localization in the time domain. Coincidentally, the time diffusion model becomes better than the frequency diffusion model. 4.3. Should we Always Diffuse in the Frequency Domain? Removing Spectral Localization. We would like to assess whether the localization of these signals in the frequency domain contribute to explain the superior performances of frequency diffusion models over time diffusion models. To test this hypothesis, we intervene on a given dataset with the objective of varying the localization of frequency and time representations. With this in mind, it is useful to start from a dataset where the imbalance between time and frequency localization is not severe. By looking at Figure 2, it is clear that the ECG dataset is the best candidate. In order to gradually remove the spectral localization from the ECG dataset DECG, we convolve the time series x in the frequency domain with Gaussians of increasing kernel width σ R+ and define xσ := F 1 [F[x] gσ], where denotes the convolution between two signals and gσ κ := Z 1 exp[ κ2/(2σ2)] for all κ [N] is a Gaussian kernel with normalization Z = PN κ=1 exp[ κ2/(2σ2)]. This results in a family of corrupted datasets Dσ ECG, where the localization in the frequency domain decreases as σ increases. This is indeed what we observe in Figure 3 with the red curves. Coincidentally, the delocalization decreases in the time domain, in agreement with the uncertainty principle. The two curves cross at σ 2, beyond which the time series are more localized in the time domain. Let us now analyze how the model performances evolve with different values of σ. Analysis. We train time and frequency diffusion models on the datasets Dσ ECG for σ {0, 5, 7, 10, 20}, where σ = 0 corresponds to the original ECG dataset: Dσ=0 ECG = DECG. As in Section 4.1, we measure the quality of 10, 000 samples S produced by these models with the Wasserstein distances SW(Dσ ECG, S) in Figure 3a and SW( Dσ ECG, S) in Figure 3b. By inspecting the blue curves from these figures, we notice that decreasing the frequency closes the gap between the time and the frequency models. Moreover, the two curves cross around at σ 7, beyond which the time model outperforms the frequency model. This confirms our hypothesis that the localization (at least) partially explains the better performance of the frequency diffusion model. Similar results can be obtained with other datasets, as discussed in Appendix B.4. A Cautionary Remark. Before concluding, we would like to incorporate a bit of nuance in our analysis. While the above results suggest that localization is an important factor to explain the superior performance of frequency diffusion models, we do not claim that this is the only explanation. There are essentially two things that suggest that this only partially explains the observations from Section 4.1. (1) In Figure 3, the blue curves and the red curves don t cross for the same value of σ (i.e. the red curves cross at σ 2 and the blue curves at σ 7). Hence, the time model requires time series that are substantially more localized in the time domain in order to outperform the frequency model. (2) In the limit σ + , the time series become constant in the frequency domain. While this corresponds to a minimal localization in the frequency model, this should also be very easy for a frequency diffusion model to learn. Hence, decreasing the localization of the time series in a given domain does not necessarily imply that the resulting time series are more difficult to model in that domain. Take-away 2. For all datasets in this paper, diffusing in the frequency domain yields better performances than in the time domain. A promising explanation for this is that time series from these datasets are substan- Time Series Diffusion in the Frequency Domain tially more localized in their spectral representation than in their temporal representation. 5. Discussion In this work, we have improved the understanding of how diffusion models should be used with time series. We constructed a theoretical framework that extends the scorebased SDE formulation of diffusion to complex-valued times series representations in the frequency domain. We have then demonstrated empirically that implementing time series diffusion in the frequency domain consistently outperforms the canonical diffusion in the time domain. Finally, we showed that the spectral localization of the time series plays a significant role to explain this phenomenon. There is a number of interesting ways to extend our work. Time Localized Datasets. While all 6 datasets we have studied appear substantially more localized in their spectral representation, we do not claim that is is a universal property of real-world time series. In particular, it would be of interest to survey a large amount of time series datasets to determine the extent to which this phenomenon occurs. Multiresolution Analysis. It may be fruitful to represent time series at different resolutions to facilitate generative modeling. A multiresolution representation of time series can be obtained for example by computing a wavelet transform (Mallat, 1999) of these time series. Adapting the theory from Section 3 to wavelets in order to represent time series at multiple resolutions could represent an interesting avenue for future work. Latent Diffusion. Latent diffusion has emerged as a fruitful direction of research in the diffusion literature (Rombach et al., 2022). A promising direction would be to study how spectral representations of time series can be incorporated to their latent representations and whether this benefits the quality of the generated samples. We leave these insightful research directions for future work. Impact Statement This paper presents work whose goal is to advance the field of Machine Learning. There are many potential societal consequences of our work, none which we feel must be specifically highlighted here. Acknowledgements The authors are grateful to the 4 ICML reviewers for their useful comments on an earlier version of the manuscript. Jonathan Crabb e and Jan Stanczuk are funded by Aviva, Nicolas Huynh by Illumina and Mihaela van der Schaar by the Office of Naval Research (ONR), NSF 172251. Time Series Diffusion in the Frequency Domain Alaa, A. M., Chan, A. J., and van der Schaar, M. Generative time-series modeling with fourier flows. In International Conference on Learning Representations, 2021. Anderson, B. D. Reverse-time diffusion equation models. Stochastic Processes and their Applications, 12(3):313 326, 1982. ISSN 0304-4149. doi: https://doi.org/10.1016/ 0304-4149(82)90051-5. Bonneel, N., Rabin, J., Peyr e, G., and Pfister, H. Sliced and radon wasserstein barycenters of measures. Journal of Mathematical Imaging and Vision, 51:22 45, 2015. Dhariwal, P. and Nichol, A. Diffusion models beat gans on image synthesis. Advances in neural information processing systems, 34:8780 8794, 2021. Dieleman, S., Sartran, L., Roshannai, A., Savinov, N., Ganin, Y., Richemond, P. H., Doucet, A., Strudel, R., Dyer, C., Durkan, C., et al. Continuous diffusion for categorical data. ar Xiv preprint ar Xiv:2211.15089, 2022. Donahue, C., Mc Auley, J., and Puckette, M. Adversarial audio synthesis. In International Conference on Learning Representations, 2018. Esteban, C., Hyland, S. L., and R atsch, G. Real-valued (medical) time series generation with recurrent conditional gans, 2017. Gatta, F., Giampaolo, F., Prezioso, E., Mei, G., Cuomo, S., and Piccialli, F. Neural networks generative models for time series. J. King Saud Univ. Comput. Inf. Sci., 34: 7920 7939, 2022. Heisenberg, W. Uber den anschaulichen inhalt der quantentheoretischen kinematik und mechanik. Zeitschrift f ur Physik, 43(3):172 198, 1927. Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., and Hochreiter, S. Gans trained by a two time-scale update rule converge to a local nash equilibrium. Advances in neural information processing systems, 30, 2017. Ho, J., Jain, A., and Abbeel, P. Denoising diffusion probabilistic models. Advances in neural information processing systems, 33:6840 6851, 2020. Hyv arinen, A. and Dayan, P. Estimation of non-normalized statistical models by score matching. Journal of Machine Learning Research, 6(4), 2005. Johnson, A., Pollard, T., and Mark, R. Mimic-iii clinical database (version 1.4). Physio Net, 10(C2XW26):2, 2016. Kachuee, M., Fazeli, S., and Sarrafzadeh, M. Ecg heartbeat classification: A deep transferable representation. In 2018 IEEE international conference on healthcare informatics (ICHI), pp. 443 444. IEEE, 2018. Karras, T., Laine, S., Aittala, M., Hellsten, J., Lehtinen, J., and Aila, T. Analyzing and improving the image quality of stylegan. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 8110 8119, 2020. Kloeden, P. E., Platen, E., Kloeden, P. E., and Platen, E. Stochastic differential equations. Springer, 1992. Kong, Z., Ping, W., Huang, J., Zhao, K., and Catanzaro, B. Diffwave: A versatile diffusion model for audio synthesis. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021, 2021. K orner, T. W. Fourier analysis. Cambridge university press, 2022. Lin, L., Li, Z., Li, R., Li, X., and Gao, J. Diffusion models for time-series applications: a survey. Frontiers of Information Technology & Electronic Engineering, pp. 1 23, 2023. Lugmayr, A., Danelljan, M., Romero, A., Yu, F., Timofte, R., and Van Gool, L. Repaint: Inpainting using denoising diffusion probabilistic models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11461 11471, 2022. Mallat, S. A wavelet tour of signal processing. Elsevier, 1999. Miller, A. R. and Tucker, C. Privacy protection and technology diffusion: The case of electronic medical records. Management science, 55(7):1077 1093, 2009. Minixhofer, C. Predict droughts using weather and soil data, 2021. M uller, M. Dynamic time warping. Information retrieval for music and motion, pp. 69 84, 2007. Nam, S. An Uncertainty Principle for Discrete Signals. In Samp TA, Bremen, Germany, 2013. Nikolaidis, K., Kristiansen, S., Goebel, V., Plagemann, T., Liestøl, K., and Kankanhalli, M. Augmenting physiological time series data: A case study for sleep apnea detection. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pp. 376 399. Springer, 2019. Onyshchak, O. Stock market dataset, 2020. Time Series Diffusion in the Frequency Domain Phillips, A., Seror, T., Hutchinson, M., De Bortoli, V., Doucet, A., and Mathieu, E. Spectral diffusion processes. ar Xiv preprint ar Xiv:2209.14125, 2022. Rombach, R., Blattmann, A., Lorenz, D., Esser, P., and Ommer, B. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 10684 10695, 2022. Saha, B. and Goebel, T. Battery data set, nasa ames prognostics data repository, 2007. Saharia, C., Ho, J., Chan, W., Salimans, T., Fleet, D. J., and Norouzi, M. Image super-resolution via iterative refinement. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(4):4713 4726, 2022. Schreier, P. J. and Scharf, L. L. Statistical Signal Processing of Complex-Valued Data: The Theory of Improper and Noncircular Signals. Cambridge University Press, 2010. doi: 10.1017/CBO9780511815911. Shin, D., Shin, S., and Moon, I.-C. Frequency domain-based dataset distillation. ar Xiv preprint ar Xiv:2311.08819, 2023. Sohl-Dickstein, J., Weiss, E., Maheswaranathan, N., and Ganguli, S. Deep unsupervised learning using nonequilibrium thermodynamics. In International conference on machine learning, pp. 2256 2265. PMLR, 2015. Song, Y. and Ermon, S. Generative modeling by estimating gradients of the data distribution. Advances in neural information processing systems, 32, 2019. Song, Y., Sohl-Dickstein, J., Kingma, D. P., Kumar, A., Ermon, S., and Poole, B. Score-based generative modeling through stochastic differential equations. In International Conference on Learning Representations, 2020. Vincent, P. A connection between score matching and denoising autoencoders. Neural computation, 23(7):1661 1674, 2011. Wang, S., Mc Dermott, M. B., Chauhan, G., Ghassemi, M., Hughes, M. C., and Naumann, T. Mimic-extract: A data extraction, preprocessing, and representation pipeline for mimic-iii. In Proceedings of the ACM conference on health, inference, and learning, pp. 222 235, 2020. Watson, J. L., Juergens, D., Bennett, N. R., Trippe, B. L., Yim, J., Eisenach, H. E., Ahern, W., Borst, A. J., Ragotte, R. J., Milles, L. F., et al. De novo design of protein structure and function with rfdiffusion. Nature, 620(7976): 1089 1100, 2023. Xu, M., Yu, L., Song, Y., Shi, C., Ermon, S., and Tang, J. Geodiff: a geometric diffusion model for molecular conformation generation. Ar Xiv, abs/2203.02923, 2022. Yi, K., Zhang, Q., Cao, L., Wang, S., Long, G., Hu, L., He, H., Niu, Z., Fan, W., and Xiong, H. A survey on deep learning based time series analysis with frequency transformation. Co RR, abs/2302.02173, 2023. Yoon, J., Jarrett, D., and Van der Schaar, M. Time-series generative adversarial networks. Advances in neural information processing systems, 32, 2019. Zeni, C., Pinsler, R., Z ugner, D., Fowler, A., Horton, M., Fu, X., Shysheya, S., Crabb e, J., Sun, L., Smith, J., et al. Mattergen: a generative model for inorganic materials design. ar Xiv preprint ar Xiv:2312.03687, 2023. Time Series Diffusion in the Frequency Domain A. Mathematical details A.1. DFT properties Proposition A.1 (Unitarity of the DFT operator). The DFT matrix U CN N with elements [U]κτ := N 1/2 exp( iωκτ) with ωκ := κ2π N is unitary. Proof. Let U denote the conjugate transpose of U. For any κ and τ in [N], we have: β=0 [U]κβ[U ]βτ β=0 exp( iωκβ) exp(iωβτ) β=0 exp( iωκ τβ) Hence, if κ = τ, we have [UU ]κτ = 1, otherwise [UU ]κτ = 0, since exp( iωκ τN) = 1. This is equivalent to UU = IN, i.e. U is unitary. Proposition A.2. The DFT x = F[x] = Ux of a real-valued time series x Rd X verifies the following mirror symmetry for all κ [N]: xκ = x N κ. Proof. Let κ and τ be in [N]. We first note that exp(iωN κτ) = exp(i(ωN ωκ)τ) = exp( iωκτ). Hence, τ=0 [U] N κ,τxτ (x is real valued) = N 1/2 N 1 X τ=0 exp(iωN κτ)xτ = N 1/2 N 1 X τ=0 exp( iωκτ)xτ τ=0 [U]κ,τxτ A.2. Densities and scores for constrained signals As we have mentioned in Section 2, the redundancy between certain components of the DFT x Cd X of x expressed by Equation (6) needs to be taken into account if we wish to define a density p for the time series distribution in the frequency domain. In particular, this redundancy implies that the density is really defined on a submanifold Cd X constr := { x = ( x0, . . . , x N 1) Cd X | xκ = x N κ κ [N]} of complex signals fulfilling the constraint. We can define a coordinate chart φ : Cd X constr Rd X on this submanifold by extracting the unconstrained part of a DFT x. This operator simply concatenates the relevant real and imaginary parts of the DFT and is defined as follows for all x Cd X constr: ( (ℜ[ xκ])N/2 κ=0 (ℑ[ xκ])N/2 1 κ=1 if N 2N (ℜ[ xκ]) N/2 κ=0 (ℑ[ xκ]) N/2 κ=1 else, (12) Time Series Diffusion in the Frequency Domain where v1 v2 denotes the concatenation of two vectors v1 Rd1 and v2 Rd2, with d1, d2 N. Due to Equation (6), one can unambiguously reconstruct x Cd X constr from φ[ x]. Hence, the coordinate chart admits an inverse φ 1 : Rd X Cd X constr defined as follows for all z = (z0, . . . , z N 1) Rd X: ( (z0) (zκ + i z N/2+κ)N/2 1 κ=1 (z N/2) (z N/2 κ i z N κ)N/2 1 κ=1 if N 2N (z0) (zκ + i z N/2 +κ) N/2 κ=1 (z N/2 κ i z N κ) N/2 κ=1 else. (13) With this coordinate chart, it becomes possible to rigorously define the density p : Cd X constr R+. One simply defines a probability density pφ : Rd X R+ over the real vector space Rd X on which the coordinate chart is defined and pull it back to the manifold of constrained signals p := pφ φ. This indeed defines a density that depends on the real and imaginary parts of the frequency representations x Cd X constr of time series, as announced in Section 2. Finally, it remains to rigorously define the score s : Cd X constr [0, T] Cd X in the frequency domain. This can be done by starting from the real score sφ : Rd X [0, T] Rd X that is well-defined for all z Rd X and t [0, T] as sφ(z, t) = z log pφ,t(z). Again, one can expand this vector field to the constrained manifold by defining for all x Cd X constr and all t [0, T]: s( x, t) := φ 1[ sφ(φ( x), t)]. (14) This indeed defines a vector field involving partial derivatives of the log density with respect to the real and imaginary parts of the frequency representations x Cd X constr of time series and respects the mirror symmetry by virtue of Equation (13). Everything is then consistent with the discussion from Section 2. A.3. Diffusion SDEs in the frequency domain Lemma 3.1. (DFT of standard Brownian motion). Let w be a standard Brownian motion on Rd X with d X = N M, where N N+ is the number of time series steps and M N+ is the number of features tracked over time. Then v = Uw is a continuous stochastic process endowed with: (1) Mirror Symmetry. For all κ [N], vκ = v N κ. (2) Real Brownian Motion. v0 is a (real) standard Brownian motion on RM. (3) Complex Brownian Motions. For all κ with 1 κ N/2 , we can write vκ = ( w1 κ + i w2 κ)/ 2 where w1 κ and w2 κ are independent standard Brownian motions on RM, except when N is even and κ = N/2, where v N/2 is a real standard Brownian motion on RM. (4) Independence. The stochastic processes {vκ} N/2 κ=0 are mutually independent. We call any stochastic process satisfying the above constraints a mirrored Brownian motion on Cd X. Proof. (1) Mirror Symmetry. This point directly follows from the symmetry of the DFT, proved in Proposition A.2. In what follows, we consider without loss of generality the case M = 1, since the cases M > 1 can be handled similarly by flattening matrices into vectors and using the same arguments as below. Let us first decompose U into its real and imaginary parts, i.e. U = Ure + i Uim, where Ure and Uim are in RN N. Note that these two matrices are both symmetric. Computing the distribution of the components of v will require the knowledge of covariance matrices, which will depend on U 2 re, U 2 im and Ure Uim. For any κ and τ in [N], [U 2 re]κ,τ = 1 N κγ) cos(2π N (κ + τ)γ) + cos(2π Time Series Diffusion in the Frequency Domain [U 2 im]κ,τ = 1 N κγ) sin(2π N (κ τ)γ) cos(2π N (κ + τ)γ) To compute these sums, we consider the situations where: N|κ τ: this is equivalent to κ = τ, since (N 1) κ τ N 1 N|κ + τ: this is equivalent to κ = τ = 0 or κ + τ = N, since 0 κ + τ 2(N 1) Hence, if N is even: [U 2 re]κ,τ = 1 if κ = τ = 0 or κ = τ = N 2 1 2 if κ / {0, N/2} and κ = τ or κ = N τ 0 otherwise and [U 2 im]κ,τ = 0 if κ = τ = 0 or κ = τ = N 2 1 2 if κ / {0, N/2} and κ = τ 1 2 if κ = N/2 and κ = N τ 0 otherwise If N is odd: [U 2 re]κ,τ = 1 if κ = τ = 0 1 2 if κ = 0 and κ = τ, or κ = N τ 0 otherwise and [U 2 im]κ,τ = 0 if κ = τ = 0 1 2 if κ = 0 and κ = τ 1 2 if κ = N τ 0 otherwise Finally, we compute Ure Uim: [Ure Uim]κ,τ = 1 N κγ) sin(2π N (τ κ)γ) + sin(2π N (κ + τ)γ) Hence, Ure Uim = 0N and similarly Uim Ure = 0N by taking the transpose and using the symmetry of Ure and Uim. We can now characterize the distribution followed by the stochastic process v. We first write Ucol = Ure Uim , and then notice that v can be investigated through the lens of its flattened version vflat = Ucolw, which is a stochastic process in R2N. (2) Real Brownian Motion. First, v0 is real-valued, by using (1) Mirror Symmetry. We then have the following: v0(0) = 0 almost surely: this stems from w(0) = 0 almost surely, since w is a multivariate standard Brownian motion and v = Uw. Continuity of t v0(t) almost surely: w satisfies the continuity property and the DFT operator (seen as a complex operator) is linear, hence v0 is also continuous with respect to t almost surely. Stationary and independent increments: this follows from the linearity of the DFT operator and w being a Brownian motion. Time Series Diffusion in the Frequency Domain For any t, s 0, v0(t + s) v0(s) N(0, t): to see this, we notice that v0(t + s) v0(s) is Gaussian3 since it is a linear transform of w(t + s) w(s). Moreover, its mean and its variance are given by: E v0(t + s) v0(s) = Ucol E[w(t + s) w(s)] Var(v0(t + s) v0(s)) = Ucol Cov w(t + s) w(s), w(t + s) w(s) U T col) 0,0 = t[Ucol U T col]0,0 = t[U 2 re]0,0 = t Hence, we have shown that v0 is a real Brownian motion. (3) Complex Brownian Motions. Let 1 κ < N/2 . Then, ℜ(vκ) and ℑ(vκ) follow the first three properties of a Brownian motion, using the same arguments as above. For the last point, we first characterize the distribution of ℜ(vκ) and ℑ(vκ), and then show that they are independent. Distribution of ℜ(vκ). 2ℜ(vκ) is a standard Brownian motion in R: For any t, s 0, ℜ(vκ)(t + s) ℜ(vκ)(s) is Gaussian since it is a linear transform of w(t + s) w(s) which is a Gaussian vector. We can compute its mean and its variance: E ℜ(vκ)(t + s) ℜ(vκ)(s) = Ucol E[w(t + s) w(s)] Var ℜ(vκ)(t + s) ℜ(vκ)(s) = Ucol Cov w(t + s) w(s), w(t + s) w(s) U T col) κ,κ = t[Ucol U T col]κ,κ = t[U 2 re]κ,κ Distribution of ℑ(vκ). Similarly, we can prove with the same arguments that 2ℑ(vκ) is a standard Brownian motion in R. Independence of ℜ(vκ) and ℑ(vκ). Let k and m be two strictly positive integers, and let (t1, ..., tk) (R +)k and (t 1, ..., t m) (R +)m. We need to show that the vectors vre κ = ℜ(vκ)(t1), ..., ℜ(vκ)(tk) and vim κ = ℑ(vκ)(t 1), ..., ℑ(vκ)(t m) are independent. First, the concatenation of vre κ and vim κ can be expressed as a linear transform of (w(t1), ..., w(tk), w(t 1), ..., w(t m)), which is a Gaussian vector since w is a Brownian motion. Consequently, (vre κ , vim κ ) is also a Gaussian vector. Now, let l {1, ..., k} and n {1, ..., m}. Then, Cov ℜ(vκ)(tl), ℑ(vκ)(t n) = Ure Cov w(tl), w(t n) Uim κ,κ = min(tl, t n)[Ure Uim]κ,κ = 0 Given this covariance structure and the fact that (vre κ , vim κ ) is a Gaussian vector, we have vre κ vim κ . Since this holds true for any choice of (t1, ..., tk) R + k and (t 1, ..., t m) R + m, we conclude that ℜ(vκ) ℑ(vκ). The case k = N 2 can be handled using the same arguments, by distinguishing the cases N odd and N even. (4) Independence. The mutual independence of the stochastic processes { vκ} N/2 κ=0 follows from the structure of U 2 re and U 2 im. Indeed, for any m and n such that m = n, 0 m N 2 , and 0 n N 2 , we have [U 2 re]m,n = [U 2 im]m,n = [Ure Uim]m,n = 0. We can then apply the same argument as in 3) of (3) Complex Brownian Motions to obtain the mutual independence of the stochastic processes { vκ} N/2 κ=0 . 3Note that a random variable almost surely equal to 0 can be seen as a degenerate Gaussian with mean 0 and variance 0. Time Series Diffusion in the Frequency Domain Proposition 3.3. (Diffusion process in frequency domain). Let us assume that x is a diffusion process that is a solution of Equation (1), with G(t) = g(t) IN. Then x = F[x] is a solution to the forward diffusion process defined by: d x = f( x, t)dt + g(t)d v, (7) where f( x, t) = Uf(U x, t) and v is a mirrored Brownian motion on Cd X. The associated reverse diffusion process is defined by: d x = b( x, t)dt + g(t)d v (8) where b( x, t) = f( x, t) g2(t)Λ2 s( x, t), Λ RN N is a diagonal matrix such that ( 1 if κ = 0, or N is even and κ = N/2 1 2 otherwise , dt is a negative infinitesimal time step, and v is a mir- rored Brownian motion on Cd X with time going from T to 0. Proof. Forward SDE. Since x = Ux, we can apply the multivariate Itˆo s lemma (Eq. 8.3, (Kloeden et al., 1992)), and obtain a forward SDE for x: d x = Uf(x, t)dt + g(t)Udw (17) where we have implicitly used the fact that G(t) = g(t)IN and U commute. By Lemma 3.1, v = Uw is a mirrored Brownian motion on Cd X, which gives the result. Reverse SDE. In order to derive the reverse SDE for x, we follow three steps: (1) we write a forward SDE for which the truncation φ[ x] is a solution ; (2) we write its associated reverse-time SDE (Anderson, 1982) and (3) we derive the full reverse SDE for the stochastic process x. Step 1: From Equation (17) and using Lemma 3.1, we can extract the following forward SDE for φ[ x], which is defined in Equation (12) : dφ[ x] = φ Uf(U φ 1(φ[ x]), t) dt + g(t)Λd w (18) φ 1 : Rd X Cd X satisfies φ 1(φ[Uy]) = Uy for all y Rd X as defined in Equation (13). Λ RN N is a diagonal matrix such that [Λ]κ,κ = ( 1 if κ = 0, or N is even and κ = N/2 1 2 otherwise , w is a stochastic process in Rd X which satisfies Λ w = φ[Uw ]. The above point, together with Lemma 3.1), then implies that w is in fact a standard multivariate Brownian motion. Step 2: The associated reverse-time SDE (Anderson, 1982) is given by: dφ[ x] = φ Uf(U φ 1(φ[ x]), t) g(t)2ΛΛT φ[ x] log pt(φ[ x]) dt + g(t)Λd b w (19) where b w is a standard Brownian motion on Rd X. Step 3: Since φ 1(φ[ x]) = x, we can recover the reverse-time SDE followed by x by applying the operator φ 1 to φ[ x], and using the Itˆo s lemma: d x = φ 1(φ Uf(U φ 1(φ[ x]), t) ) g(t)2φ 1 Λ2 φ[ x] log pt(φ[ x]) dt + g(t)φ 1(Λ)d b w (20) where we have exploited the fact that φ 1 is linear, and with the slight abuse of notation φ 1(Λ) CN N, which is the matrix obtained by applying φ 1 to the columns of Λ. Using the definition of φ 1, Equation (20) can further be simplified into: d x = Uf(U x, t) g(t)2φ 1 Λ2 φ[ x] log pt(φ[ x]) dt + g(t)d v (21) where v = φ 1(Λ) b w = φ 1(Λ b w) is a mirrored Brownian motion. Finally, notice that for any y Rd X, we have φ 1(Λ2y) = Λ2φ 1(y), which follows from the definition of φ 1. Hence, Equation (21) is equivalent to: d x = Uf(U x, t) g(t)2Λ2 s( x, t) dt + g(t)d v (22) Time Series Diffusion in the Frequency Domain where s( x, t) has been defined in Equation (14). A.4. Denoising score matching in the frequency domain Proposition 3.4. (Score matching equivalence). Consider a score s θ : Cd X [0, T] Cd X defined in the frequency domain and satisfying the mirror symmetry [ s θ]κ = [ s θ]N κ for all κ [N]. Let us define an auxiliary score s θ : Rd X [0, T] Rd X as (x, t) 7 s θ(x, t) = U s θ(Ux, t) in the time domain. The score matching loss in the frequency domain is equivalent to the score matching loss for the auxiliary score in the time domain: LSM s θ, Λ2 st|0, x, t = LSM s θ, st|0, x, t (10) where st|0( x, t) = x(t) log pt|0( x(t)| x(0)) , st|0(x, t) = x(t) log pt|0(x(t)|x(0)), and Λ is the diagonal matrix in Proposition 3.3. Proof. Let Λ RN N be the diagonal matrix such that [Λ]κ,κ = ( 1 if κ = 0, or N is even and κ = N/2 1 2 otherwise Step 1: We first express the score of x with respect to the score of the truncation φ[ x]. By definition of φ 1 in Equation (13), we have x = U φ 1(φ[ x]). Hence, we can write, using the change of variable formula: pt|0(x(t)|x(0)) = C pt|0 φ[ x(t)]|φ[ x(0)] (23) where C is a constant which does not depend on x, since x 7 φ[Ux] is linear. Moreover, let us write x 7 φ[Ux] in matrix form, i.e. x Rd X, φ[Ux] = V Ucolx = Qx, where V RN 2N, Ucol = Ure Uim and Q is an invertible matrix in RN N. For the rest of the proof, we shall build on the below results: Result 1. QQT = Λ2. To see this, write QQT = V Ucol U T col V T . The matrix Ucol U T col is equal to U 2 re 0N 0N U 2 im the proof to Lemma 3.1), while the multiplication by V , on the left and on the right of Ucol U T col, extracts the submatrix corresponding to the indices represented by the truncature φ. Hence, V Ucol U T col V T = Λ2. Result 2. For any x Rd X , we have QT x = U φ 1[Λ2x]. To see this, notice that Result 1 implies that QT x = Q 1Λ2x = U φ 1[Λ2x] for all x Rd X. Equipped with these results, we can now complete the rest of the proof. First, we have: x(t) log pt|0(x(t)|x(0)) = x(t) log pt|0 φ[ x(t)]|φ[ x(0)] (24) = QT φ[ x(t)] log pt|0(φ[ x(t)]|φ[ x(0)]) (Chain rule) (25) Step 2: We then obtain: LSM s θ, st|0, x, t := s θ(x, t) x(t) log pt|0(x(t)|x(0)) 2 = Us θ(x, t) U x(t) log pt|0(x(t)|x(0)) 2 (Parseval identity) = s θ( x, t) UQT φ[ x(t)] log pt|0(φ[ x(t)]|φ[ x(0)]) 2 (Equation (25)) = s θ( x, t) UU φ 1[Λ2 φ[ x(t)] log pt|0(φ[ x(t)]|φ[ x(0)]] 2 (Result 2) = s θ( x, t) Λ2φ 1[ φ[ x(t)] log pt|0(φ[ x(t)]|φ[ x(0)]] 2 (Proposition A.1 & Definition of φ 1) = s θ( x, t) Λ2 st|0( x, t) 2 (Equation (14)) = LSM s θ, Λ2 st|0, x, t . Time Series Diffusion in the Frequency Domain B. Empirical details Compute resources. All the models were trained and used for sampling on a single machine equipped with a 18-Core Intel Core i9-10980XE CPU, a NVIDIA RTX A4000 GPU and a NVIDIA Ge Force RTX 3080. B.1. Details on datasets In this subsection, we give detailed information about the 6 datasets used throughout our experiments and the preprocessing steps for each of them. ECG. We use two collections of heartbeat signals, from the MIT-BIH Arrhythmia Dataset and the PTB Diagnostic ECG Database (Kachuee et al., 2018). No preprocessing was performed on this dataset. MIMIC-III. MIMIC-III (Johnson et al., 2016) is a database consisting of deidentified records for patients who were in critical unit care units. Preprocessing. We use the vitals labs table of the database, which corresponds to time-varying vitals and labs. We extract the rows of the dataset which correspond to the first 24 hours of stay by using MIMIC-Extract (Wang et al., 2020). The features are then standardized across all times and patients. We also perform imputation to handle missing values in the dataset. To do so, we consider the mean features (average measurement over 1 hour). For each patient, and missing value, we propagate the last observation forward if this is possible. If not, we fill the missing value with the mean value for the patient (which is computed over the whole stay). If no mean value is available, we fill the entry with 0. NASDAQ-2019. This dataset (Onyshchak, 2020) contains daily prices for tickers trading on NASDAQ, and contains prices for up to 1st of April 2020. Preprocessing. We considered one year of daily prices from 1st of January 2019 to 1st of January 2020. Each sample corresponds to one stock, and we remove the stocks which are not active in this whole time interval, or contain missing values. NASA battery. The NASA battery dataset (Saha & Goebel, 2007) consists of profiles for Li-on batteries, under charge and discharge. Preprocessing. For both the charge and discharge datasets, we bin the time values (bins of size 10 for Charge, 15 for Discharge) and compute the mean of each feature inside each bin. US-Droughts. This dataset (Minixhofer, 2021) consists of drought levels in different US counties, from 2000 to 2020. Preprocessing. We consider one year of history, from 1st of January 2011 to 1st of January 2012, and drop the columns with missing values. B.2. Details on evaluation Sliced Wasserstein Distances. The sliced Wasserstein distance (Bonneel et al., 2015) is a metric which can handle high-dimensional distributions. It is motivated by the fact that the Wasserstein distance is easy to compute when comparing two one-dimensional distributions. The idea of the sliced Wasserstein distance is to map the high-dimensional distributions of interest to one-dimensional distributions, by considering random projections on vectors of the unit sphere. For two distributions µ1 and µ2, it can be written as: SWp(µ1, µ2) := Z Sd 1 Wp(Pu#µ1, Pu#µ2)du (26) where Sd 1 is the unit sphere in dimension d, Pu(x) = u x denotes the projection of x on u, Pu#µ is the push-forward of µ by Pu, and Wp is the Wasserstein distance of order p. To estimate this quantity in practice, we sample n = 10, 000 random vectors {ui|i [n]} which follow a uniform distribution in Sd 1 and consider p = 2. Hence, we can approximate SW p by the Monte-carlo estimator: ˆ SW p(µ1, µ2) = 1 i=1 Wp(Pui#µ1, Pui#µ2) (27) Marginal Wasserstein Distances. In addition to the sliced Wasserstein distance, we also consider the marginal Wasserstein distance. For any j {1, ..., d}, the j-th marginal Wasserstein distance is defined as: MWp (j)(µ1, µ2) = Wp(Pej#µ1, Pej#µ2) (28) Time Series Diffusion in the Frequency Domain where ej is the j-th vector of the standard basis of Rd. Throughout our experiments in Section 4, we compute the Wasserstein distances with respect to Dtrain and Dtrain. B.3. Computational cost Since the Discrete Fourier Transform (DFT) is a bijection, and by virtue of the mirror symmetry from Equation (6), the frequency representation of a time series has the same number of real independent components as its time representation. As a consequence, the diffusion in the frequency domain can be performed without any additional cost by restricting to these independent components, while keeping in mind that other components can be deduced from the mirror symmetry. We now make this more precise. Due to the mirror symmetry in the frequency representation x Cd X of a time series x Rd, one can define a coordinate chart φ that extracts a subset of non-redundant real components of x. This map is defined as (see also Equation (12) for the same definition used in a different context): ( (ℜ[ xκ])N/2 κ=0 (ℑ[ xκ])N/2 1 κ=1 if N is even (ℜ[ xκ]) N/2 κ=0 (ℑ[ xκ]) N/2 κ=1 else, where v1 v2 denotes the concatenation of two vectors v1 Rd1 and v2 Rd2, with d1, d2 N. As we can see, this map truncates the full frequency representation by extracting half of the real components in x Cd X. We deduce that this maps to Rd X and, therefore, φ[ x] has the same number of real components as x. By virtue of the mirror symmetry, x can be reconstructed from φ[ x] by applying the inverse map ( (z0) (zκ + i z N/2+κ)N/2 1 κ=1 (z N/2) (z N/2 κ i z N κ)N/2 1 κ=1 if N is even (z0) (zκ + i z N/2 +κ) N/2 κ=1 (z N/2 κ i z N κ) N/2 κ=1 else. One can check that the mirror symmetry permits to reconstruct the initial frequency representation via x = φ 1 φ[ x]. By exploiting this truncation with the coordinate chart φ, the implementation of frequency diffusion can be described as follows: 1. Bring the training time series Dtrain Rd X to the frequency domain by computing the DFT of each time series Dtrain F[Dtrain] Cd X. 2. Perform a truncation of the frequency representation to extract non-redundant components Dφ train φ( Dtrain) Rd X. 3. Train a frequency diffusion model by using denoising score matching on Dφ train, using the diffusion process defined in Equation (7) and Equation (8). 4. Generate truncated samples Sφ freq Rd X with the learned score model. 5. Compute the full frequency representation of these samples by using the inverse coordinate chart: Sfreq φ 1( Sφ freq) Cd X. 6. Optional: Bring these time series back to the time domain by computing their inverse DFT Sfreq F 1[ Sfreq] Rd X. In Appendix A, we provide a detailed description on how this practical implementation relates to the theoretical discussion in Section 3. Since this implementation performs the frequency diffusion in Rd X, we deduce that it does not lead to additional computational costs compared to diffusing in the time domain (computing the DFT and its inverse using the Fast Fourier transform takes a negligeable amount of time compared to training / sampling from the diffusion model). For transparency, we have also timed the training time of a few diffusion models trained in the time and frequency domain. These results are reported in Table 3. As we can observe, the computational times of the time and frequency models are similar. Time Series Diffusion in the Frequency Domain Table 3. Training times for time domain and frequency domain models Dataset Training time Time domain model Frequency domain model MIMIC-III 30m 2s 30m 13s Nasa-Charge 44m 51s 43m 45s US-Droughts 2h 1m 30s 2h 1m 33s B.4. Additional plots Sliced Wasserstein Distances. In Figure 4, we show the distribution of the sliced Wasserstein distances over all slices. In addition, we have included the average sliced Wasserstein distances obtained with 2 baselines. The first baseline is simply the Wasserstein distance between the training set and a set of sample only containing identical copies the average sample SW(Dtrain, Smean), where Smean = {EX U(Dtrain)[X]}. It represents the performance of a dummy generator that only generates the average time series and is denoted by mean in Figure 4. The second baseline is the Wasserstein distance between two random splits of the training set SW(D1/2 train, D2/2 train), where Dtrain = D1/2 train F D2/2 train is a decomposition of the training set into two disjoint random splits of equal size |D1/2 train| = |D2/2 train|. It represents the distance between two samples from the ground-truth distribution and is denoted by self in Figure 4. As we can observe, both the time and frequency diffusion models substantially outperform the mean baseline (as expected) and perform on par with the self baseline. This indicates that the models learned a good approximation of the real distribution. Furthermore, we notice that the frequency diffusion models tend to have smaller quantiles than the time diffusion models. This confirms that frequency diffusion models outperform the time diffusion models, as discussed in Section 4. Marginal Wasserstein Distances. In Figure 5, we show the distribution of the marginal Wasserstein distances over all slices. In addition, we have included the average marginal Wasserstein distances obtained with the 2 baselines defined in the previous paragraph. Again, both the time and frequency diffusion models tend to outperform the mean baseline (as expected) and perform on par with the self baseline. Furthermore, we notice that the frequency diffusion models tend to have smaller quantiles than the time diffusion models. This is consistent with the observations made in the above paragraph. Per-Sample Localization. In Figure 6, we observe the distribution of our localization metrics time and sigma for each sample and each dataset from Section 4. We notice that most samples are located below the y = x axis, which confirms the fact that most samples are more localized in the frequency domain. Interestingly, we also observe that none of the samples is located close to the origin. This confirms the uncertainty theorem from (Nam, 2013). Empirical Evidence for Localization Claim. We repeat the same experiment as in Section 4.3 with the datasets which have the least extreme imbalance in time and frequency localization, namely MIMIC-III, Nasa-Charge and US-Droughts. In this experiment, we increase the delocalization of the frequency representations of the original samples by convolving them with several Gaussian kernels with various widths. We report these results in Figure 7. These results show that increasing the delocalization of the frequency representations decreases the delocalization in the time domain (in agreement with the uncertainty principle from (Nam, 2013)). For MIMIC-III and US-Droughts, this leads to better time diffusion models, compared to their frequency counterparts. This observation directly mirrors the conclusion given in the case of the ECG example in Section 4.3. It is worth noting that in the case of the Nasa-Charge dataset, the frequency diffusion model is better than the time diffusion model for the different levels of smoothing considered. This corroborates the cautionary remark included in Section 4.3: while localization is a key element which can explain the superior performance of frequency diffusion models, it may not be the only explanation for this phenomenon. Time Series Diffusion in the Frequency Domain Frequency Time Metric Domain Sliced Wasserstein ( ) Time Diff. Frequ. Diff. Mean Half Train Frequency Time Metric Domain Sliced Wasserstein ( ) Time Diff. Frequ. Diff. Mean Half Train (b) MIMIC-III. Frequency Time Metric Domain Sliced Wasserstein ( ) Time Diff. Frequ. Diff. Mean Half Train (c) NASDAQ-2019. Frequency Time Metric Domain Sliced Wasserstein ( ) Time Diff. Frequ. Diff. Mean Half Train (d) NASA-Charge. Frequency Time Metric Domain Sliced Wasserstein ( ) Time Diff. Frequ. Diff. Mean Half Train (e) NASA-Discharge. Frequency Time Metric Domain Sliced Wasserstein ( ) Time Diff. Frequ. Diff. Mean Half Train (f) US-Droughts. Figure 4. Sliced Wasserstein distances of time and frequency diffusion models. Time Series Diffusion in the Frequency Domain Frequency Time Metric Domain Marginal Wasserstein ( ) Time Diff. Frequ. Diff. Mean Half Train Frequency Time Metric Domain Marginal Wasserstein ( ) Time Diff. Frequ. Diff. Mean Half Train (b) MIMIC-III. Frequency Time Metric Domain Marginal Wasserstein ( ) Time Diff. Frequ. Diff. Mean Half Train (c) NASDAQ-2019. Frequency Time Metric Domain Marginal Wasserstein ( ) Time Diff. Frequ. Diff. Mean Half Train (d) NASA-Charge. Frequency Time Metric Domain Marginal Wasserstein ( ) Time Diff. Frequ. Diff. Mean Half Train (e) NASA-Discharge. Frequency Time Metric Domain Marginal Wasserstein ( ) Time Diff. Frequ. Diff. Mean Half Train (f) US-Droughts. Figure 5. Marginal Wasserstein distances of time and frequency diffusion models. Time Series Diffusion in the Frequency Domain Figure 6. Localization metrics time and freq for all the samples of all datasets. We observe that no sample has a high localization (i.e. low ) in the time and frequency domain simultaneously. Figure 7. Additional results on the links between performance and localization. Time Series Diffusion in the Frequency Domain C. Other Distance Metric While the experimental results in Section 4 were reported with Sliced Wasserstein distances (see Appendix B.2 for details), which permits to quantify the distance between two probability distributions, more specialized metrics exist for time series. Indeed, the dynamic time warping (DTW) distance (M uller, 2007) permits to compare the dissimilarity DTW(x, x ) between two time series x, x Rd X. While this metric is not designed to compare two distributions of time series, it can be aggregated over all the pairs of time series we wish to compare. In our case, we want to compare the training set Dtrain Rd X with the samples S Rd X generated by a diffusion model: DTW := 1 |Dtrain| x S DTW(x, x ) A clear limitation of this metric is that it needs to be evaluated on all possible pairs, which typically represents O(108) pairs when we generate 10000 samples for a dataset of comparable size. This is prohibitively expensive for some of our datasets where time series have many time steps (for instance, this metric would take 7.9 days to evaluate on the NASA dataset). To make the computation time tractable, we resort to a Monte Carlo evaluation by sampling uniformly each pair of samples: \ DTW := Ex U(Dtrain),x U(S) [DTW(x, x )] . For each diffusion model and each dataset, we compute this mean on 1000 such random pairs of samples. We report the results in Table 4 (computed mean 2 standard errors). Table 4. Comparison of Frequency and Time Diffusion \ DTW across the different datasets Dataset Frequency Diffusion \ DTW ( ) Time Diffusion \ DTW ( ) ECG 1.166 0.034 1.179 0.036 MIMIC-III 35.828 0.578 36.136 0.76 NASA-Charge 8.203 1.04 11.556 0.954 NASA-Discharge 138.993 6.71 175.405 8.656 NASDAQ-2019 1857.703 254.372 2236.094 262.882 US-Droughts 274.9 5.206 399.342 12.794 These results lead to the same conclusions as the one stated in Section 4: frequency diffusion models outperform time diffusion models. We note that these metrics are considerably more computationally expensive than the Wasserstein metrics. Indeed, the computation of these metrics typically takes several minutes, while the sliced Wasserstein distances can be computed in a few seconds. D. Alternative Backbone LSTM Models. For each dataset, we try an alternative parametrization of the time score model sθ and the frequency score model s θ as LSTM encoders with 10 layers, each with dimension dmodel = 72. Both models have diffusion time t encoding through random Fourier features composed with a learnable dense layer. This results in models with 427k parameters. The data is noised by using a VP-SDE, as in (Song et al., 2020). The score models are trained with the denoising score-matching loss, as defined in Section 3. All the models are trained for 200 epochs with batch size 64, Adam W optimizer and cosine learning rate scheduling (20 warmup epochs, lrmax = 10 3). The selected model achieves the lowest validation loss. Sliced Wasserstein Distances. In Figure 8, we show the distribution of the sliced Wasserstein distances over all slices for the LSTM models. In addition, we have included the average sliced Wasserstein distances obtained with 2 baselines defined in Appendix B.4. As observed for the transformer models, both the time and frequency diffusion models substantially outperform the mean baseline (as expected) and perform on par with the self baseline. This indicates that the models learned a good approximation of the real distribution. Furthermore, we notice that the frequency diffusion models tend to have smaller quantiles than the time diffusion models. This confirms that frequency diffusion models outperform the time diffusion models, as observed for the transformer models in Section 4. We note that the Nasa-Charge and the NASDAQ-2019 Time Series Diffusion in the Frequency Domain Time Frequency Metric Domain Sliced Wasserstein ( ) Time Diff. Frequ. Diff. Mean Half Train Time Frequency Metric Domain Sliced Wasserstein ( ) Time Diff. Frequ. Diff. Mean Half Train (b) MIMIC-III. Time Frequency Metric Domain Sliced Wasserstein ( ) Time Diff. Frequ. Diff. Mean Half Train (c) NASA-Discharge. Time Frequency Metric Domain Sliced Wasserstein ( ) Time Diff. Frequ. Diff. Mean Half Train (d) US-Droughts. Figure 8. Sliced Wasserstein distances of time and frequency LSTM models. Time Series Diffusion in the Frequency Domain are absent from Figure 8. This is because we did not manage to obtain diffusion models performing better than the mean baseline for these datasets, hence leading to non informative comparisons between models with poor performances. Other Attempts. In order to minimize the inductive bias in our models, we also tried to train diffusion models with simple feed-forward neural networks. Unfortunately, this attempt was unsuccessful and resulted in models performing worse than the mean baseline in each case. This emphasizes the value of incorporating inductive biases in time series diffusion models. E. Comparison with Time GAN We compare the performance of our diffusion models with Time GAN models (Yoon et al., 2019), which are Generative Adversarial Networks specifically designed for time series data. Following Section 4.1 in our paper, we report the sliced Wasserstein distances of the samples generated by Time GAN models (trained with the hyperparameters used by the authors) and compare them to our diffusion models. We report in Table 5 the distance as its mean 2 standard errors. Table 5. Comparison with Time GAN Dataset Time GAN Wasserstein ( ) Frequency Diffusion Wasserstein ( ) Time Diffusion Wasserstein ( ) ECG 0.72 0.031 0.015 0.000 0.021 0.000 MIMIC-III 0.88 0.0071 0.152 0.004 0.211 0.006 NASDAQ-2019 89 4.25 43.602 2.044 60.512 2.960 Nasa-Charge 1.9 0.087 0.229 0.008 0.316 0.008 Nasa-Discharge 10.0 0.47 2.028 0.082 2.942 0.134 US-Droughts 23.0 1.0 0.738 0.020 2.913 0.092 As we can observe, the diffusion models significantly outperform GANs, even after performing a grid search for Time GAN for a few datasets. Hence, the differences can be attributed to the type of generative model used (i.e., diffusion vs GAN) rather than its architecture. This further reinforces our motivation to focus on diffusion models, which seem to offer better performances. F. Forecasting Use Case The diffusion models we study in this work approximate a joint distribution P(x0, ..., x N 1). We note that modelling the joint distribution permits to derive other quantities, such as conditionals. For example, given a partition I J = [N], we can infer P(x I|x J) from the joint distribution, where x I denotes the restriction of x Rd X to the indices in the set I. We can then use the infered conditionals for practical tasks, such as forecasting. Motivated by this insight, we conducted an additional experiment to illustrate the use of our frequency diffusion models for forecasting, in a data augmentation setting. For each of the datasets investigated in our manuscript, we use the synthetic time series S generated by the diffusion model trained in the frequency domain and augment the training set Dtrain with S to obtain the augmented dataset Daug = Dtrain S. The forecasting task requires the definition of a forecast window, which is the number of time steps for which we would like to predict the values of the time series. Let N be the sequence length of the samples in Dtrain, M be the number of features, and W be the size of the forecast window. For the ECG dataset, we let W = N 2 (because the electrocardiograms are zero-padded in the last time steps, which makes the forecasting task too easy if we use a small W). For the other datasets, we let W = N 4 . We split every time series x of sequence length N present in the training set into an observable part x[N W ] (which will be used as input to our forecasting model) and a target x[N]\[N W ] (to be predicted by the forecasting model). We train a forecasting model by using an LSTM backbone, with 2 layers and a hidden dimension of 256. The last hidden state of the LSTM is used as input to a linear layer of output size W M, whose output is reshaped to obtain time series of dimension (W, M). The forecasting models are trained for 50 epochs, and we use early stopping based on a validation set, with a train/validation set ratio of 0.8. We then evaluate the forecasting performance by computing the test Mean Absolute Error (MAE) (the test set being held-out time series from the original datasets that are not observed by the diffusion model). We report the results in Table 6 (mean 2 standard errors for 5 seeds). The results show the potential of diffusion-based generative modelling in a data augmentation scenario, as we improve forecasting performance for half the datasets by augmenting Dtrain with time series generated by the diffusion model. Time Series Diffusion in the Frequency Domain Table 6. Forecasting performance for different datasets when training on Dtrain and Daug. Dataset Train on Dtrain Test MAE ( ) Train on Daug Test MAE ( ) ECG 0.046 0.0039 0.035 0.0009 MIMIC-III 0.186 0.001 0.179 0.001 NASDAQ-2019 3.058 0.65 2.614 0.20 Nasa-Charge 0.019 0.001 0.033 0.002 Nasa-Discharge 0.21 0.02 0.28 0.02 US-Droughts 0.60 0.02 0.76 0.01 G. Sample Visualization In Figures 9 to 13, we visualize a few examples generated by each diffusion model, along with ground-truth training examples. We do not include samples from the MIMIC-III dataset in accordance with the dataset licence. We observe that the frequency diffusion models generate samples that are substantially less noisy than than the ones generated by time diffusion models. All the generated samples resemble training samples, with the only exception of the NASDAQ-2019 dataset. The models appear to be struggling with the high correlation between the different features. Time Series Diffusion in the Frequency Domain 0 25 50 75 100 125 150 175 Training samples 0 25 50 75 100 125 150 175 1.0 Generated samples (Frequency domain model) 0 25 50 75 100 125 150 175 Generated samples (Time domain model) 0 25 50 75 100 125 150 175 1.0 Feature 0 0 25 50 75 100 125 150 175 1.0 Feature 0 0 25 50 75 100 125 150 175 1.0 Feature 0 0 25 50 75 100 125 150 175 1.0 Feature 0 0 25 50 75 100 125 150 175 0 25 50 75 100 125 150 175 1.0 Feature 0 0 25 50 75 100 125 150 175 1.0 Feature 0 0 25 50 75 100 125 150 175 1.0 Feature 0 0 25 50 75 100 125 150 175 1.0 Feature 0 0 25 50 75 100 125 150 175 1.0 Feature 0 0 25 50 75 100 125 150 175 0 25 50 75 100 125 150 175 Figure 9. Samples for the ECG dataset. The y axis corresponds to the different features, while the x axis corresponds to the different time steps. Time Series Diffusion in the Frequency Domain 0 50 100 150 200 250 Training samples Feature 0 Feature 1 Feature 2 Feature 3 0 50 100 150 200 250 5 Generated samples (Frequency domain model) Feature 0 Feature 1 Feature 2 Feature 3 0 50 100 150 200 250 Generated samples (Time domain model) Feature 0 Feature 1 Feature 2 Feature 3 0 50 100 150 200 250 Feature 0 Feature 1 Feature 2 Feature 3 0 50 100 150 200 250 Feature 0 Feature 1 Feature 2 Feature 3 0 50 100 150 200 250 Feature 0 Feature 1 Feature 2 Feature 3 0 50 100 150 200 250 Feature 0 Feature 1 Feature 2 Feature 3 0 50 100 150 200 250 Feature 0 Feature 1 Feature 2 Feature 3 0 50 100 150 200 250 Feature 0 Feature 1 Feature 2 Feature 3 0 50 100 150 200 250 Feature 0 Feature 1 Feature 2 Feature 3 0 50 100 150 200 250 Feature 0 Feature 1 Feature 2 Feature 3 0 50 100 150 200 250 Feature 0 Feature 1 Feature 2 Feature 3 0 50 100 150 200 250 Feature 0 Feature 1 Feature 2 Feature 3 0 50 100 150 200 250 Feature 0 Feature 1 Feature 2 Feature 3 0 50 100 150 200 250 Feature 0 Feature 1 Feature 2 Feature 3 Figure 10. Samples for the NASA-Charge dataset. The y axis corresponds to the different features, while the x axis corresponds to the different time steps. Time Series Diffusion in the Frequency Domain 0 20 40 60 80 100 120 35 Training samples Feature 0 Feature 1 Feature 2 Feature 3 Feature 4 0 20 40 60 80 100 120 10 Generated samples (Frequency domain model) Feature 0 Feature 1 Feature 2 Feature 3 Feature 4 0 20 40 60 80 100 120 Generated samples (Time domain model) Feature 0 Feature 1 Feature 2 Feature 3 Feature 4 0 20 40 60 80 100 120 8 Feature 0 Feature 1 Feature 2 Feature 3 Feature 4 0 20 40 60 80 100 120 Feature 0 Feature 1 Feature 2 Feature 3 Feature 4 0 20 40 60 80 100 120 12 Feature 0 Feature 1 Feature 2 Feature 3 Feature 4 0 20 40 60 80 100 120 Feature 0 Feature 1 Feature 2 Feature 3 Feature 4 0 20 40 60 80 100 120 Feature 0 Feature 1 Feature 2 Feature 3 Feature 4 0 20 40 60 80 100 120 Feature 0 Feature 1 Feature 2 Feature 3 Feature 4 0 20 40 60 80 100 120 Feature 0 Feature 1 Feature 2 Feature 3 Feature 4 0 20 40 60 80 100 120 Feature 0 Feature 1 Feature 2 Feature 3 Feature 4 0 20 40 60 80 100 120 2 12 Feature 0 Feature 1 Feature 2 Feature 3 Feature 4 0 20 40 60 80 100 120 Feature 0 Feature 1 Feature 2 Feature 3 Feature 4 0 20 40 60 80 100 120 2.5 Feature 0 Feature 1 Feature 2 Feature 3 Feature 4 0 20 40 60 80 100 120 2.5 Feature 0 Feature 1 Feature 2 Feature 3 Feature 4 Figure 11. Samples for the NASA-Discharge dataset. The y axis corresponds to the different features, while the x axis corresponds to the different time steps. Time Series Diffusion in the Frequency Domain 0 50 100 150 200 250 50 Training samples Feature 0 Feature 1 Feature 2 Feature 3 Feature 4 0 50 100 150 200 250 36 Generated samples (Frequency domain model) Feature 0 Feature 1 Feature 2 Feature 3 Feature 4 0 50 100 150 200 250 0 Generated samples (Time domain model) Feature 0 Feature 1 Feature 2 Feature 3 Feature 4 0 50 100 150 200 250 Feature 0 Feature 1 Feature 2 Feature 3 Feature 4 0 50 100 150 200 250 52.5 Feature 0 Feature 1 Feature 2 Feature 3 Feature 4 0 50 100 150 200 250 Feature 0 Feature 1 Feature 2 Feature 3 Feature 4 0 50 100 150 200 250 130 Feature 0 Feature 1 Feature 2 Feature 3 Feature 4 0 50 100 150 200 250 20.0 Feature 0 Feature 1 Feature 2 Feature 3 Feature 4 0 50 100 150 200 250 Feature 0 Feature 1 Feature 2 Feature 3 Feature 4 0 50 100 150 200 250 130 Feature 0 Feature 1 Feature 2 Feature 3 Feature 4 0 50 100 150 200 250 Feature 0 Feature 1 Feature 2 Feature 3 Feature 4 0 50 100 150 200 250 Feature 0 Feature 1 Feature 2 Feature 3 Feature 4 0 50 100 150 200 250 Feature 0 Feature 1 Feature 2 Feature 3 Feature 4 0 50 100 150 200 250 22.5 Feature 0 Feature 1 Feature 2 Feature 3 Feature 4 0 50 100 150 200 250 30 Feature 0 Feature 1 Feature 2 Feature 3 Feature 4 Figure 12. Samples for the NASDAQ-2019 dataset. The y axis corresponds to the different features, while the x axis corresponds to the different time steps. Time Series Diffusion in the Frequency Domain Figure 13. Samples for the US-Droughts dataset, represented as heatmaps. The y axis corresponds to the different features, while the x axis corresponds to the different time steps.