# denoising_levy_probabilistic_models__b32b176d.pdf

Published as a conference paper at ICLR 2025

HEAVY-TAILED DIFFUSION WITH DENOISING LÉVY PROBABILISTIC MODELS

Dario Shariatian1, Umut Simsekli1, Alain Durmus2

1INRIA - Department of Computer Science, PSL Research University, Paris, France 2École Polytechnique - CMAP, IP Paris, Palaiseau, France {dario.shariatian, umut.simsekli}@inria.fr alain.durmus@polytechnique.edu

Exploring noise distributions beyond Gaussian in diffusion models remains an open challenge. While Gaussian-based models succeed within a unified SDE framework, recent studies suggest that heavy-tailed noise distributions, like αstable distributions, may better handle mode collapse and effectively manage datasets exhibiting class imbalance, heavy tails, or prominent outliers. Recently, Yoon et al. (Neur IPS 2023), presented the Lévy-Itô model (LIM), directly extending the SDE-based framework to a class of heavy-tailed SDEs, where the injected noise followed an α-stable distribution, a rich class of heavy-tailed distributions. However, the LIM framework relies on highly involved mathematical techniques with limited flexibility, potentially hindering broader adoption and further development. In this study, instead of starting from the SDE formulation, we extend the denoising diffusion probabilistic model (DDPM) by replacing the Gaussian noise with α-stable noise. By using only elementary proof techniques, the proposed approach, Denoising Lévy Probabilistic Model (DLPM), boils down to vanilla DDPM with minor modifications. As opposed to the Gaussian case, DLPM and LIM yield different training algorithms and different backward processes, leading to distinct sampling algorithms. These fundamental differences translate favorably for DLPM as compared to LIM: our experiments show improvements in coverage of data distribution tails, better robustness to unbalanced datasets, and improved computation times requiring smaller number of backward steps.

1 INTRODUCTION

The evolution of generative models has introduced several approaches, with diffusion models emerging as one of the most prominent. These models transform a data distribution into a Gaussian distribution via a forward noising process and then learn to reverse it. The foundational work on denoising in this context was presented by Sohl-Dickstein et al. (2015), where the goal is to reverse a Markov chain that progressively adds Gaussian noise to the data. This framework culminated in denoising diffusion probabilistic models (DDPM) by Ho et al. (2020), which demonstrated state-of-the-art performance in image generation, while drawing connections to score matching techniques (Song & Ermon (2020)). A unified theoretical framework, based on stochastic differential equations (SDEs), further integrated score matching and the denoising framework (Song et al. (2021)). Various generative models build up on this framework, improving its performance (Dhariwal & Nichol (2021a); Karras et al. (2022)).

Despite their success, diffusion models exhibit limitations, such as requiring a large number of steps (Ho et al. (2020)) and empirically struggling with imbalanced datasets (Zhang et al. (2024); Yoon et al. (2023)), often leading to mode collapse (Dhariwal & Nichol (2021a); Deasy et al. (2022)). Tackling heavy-tailed datasets also presents a challenge for diffusion models, as the finite-time diffusion processes generate finite variance data distributions, which are ill-suited for modeling heavytailed data. Such datasets, like financial datasets, could benefit from extensions beyond the Gaussian setting, as demonstrated in Borak et al. (2005). Moreover, although many techniques exist to improve quality at the expense of diversity (Dhariwal & Nichol (2021b); Song et al. (2023)), by only

Published as a conference paper at ICLR 2025

using Gaussian noise, the methods typically require highly nontrivial setups in order to obtain improved diversity while maintaining a reasonable quality (e.g., FID score) (Sadat et al. (2024); Nobis et al. (2024)).

Several approaches have been explored to address the limitations of diffusion models, particularly through the use of non-Gaussian, heavy-tailed noise distributions. The motivation behind this is that heavy-tailed distributions, which can take on larger values, may reduce the number of sampling steps and better capture multimodal data distributions by identifying isolated modes through large noise injections (Yoon et al. (2023)). These distributions are also capable of modeling extreme events or rare occurrences in the tails, making them suitable for tasks such as audio generation (Chen et al. (2020); Kong et al. (2021)), where rare but important variations in amplitude or pitch (e.g., prosodies) can enhance sample quality and diversity. An early attempt at using heavy-tailed noise distributions was made by Nachmani et al. (2021), who replaced Gaussian noise with Gammadistributed noise, reducing the number of diffusion steps and improving data diversity. Similarly, Deasy et al. (2022) employed generalized Gaussian distributions in a score-matching formulation to improve robustness on unbalanced datasets. However, despite these promising directions, the performance boosts are limited, since both approaches fail to provide a time-reversal formula, either using annealed Langevin dynamics when the score is available, or using heuristics based on the Gaussian approach.

Very recently, partially inspired by Simsekli (2017); Huang et al. (2020b), who employ heavy-tailed SDEs in Monte Carlo sampling for challenging distributions, Yoon et al. (2023) extended the SDE framework by replacing the light-tailed Brownian motion with a heavy-tailed driving process, introducing the Lévy-Itô model (LIM), improving performance on image data, particularly for unbalanced datasets, offering gains in metrics like FID and diversity. The tail index α controls the degree of tail heaviness, enabling tunable performance based on the characteristics of the data. When α = 2 the noising process reduces to the Brownian motion (light-tailed), however, whenever α < 2, the process becomes heavy-tailed with infinite variance (notably, heavier-tailed than the distributions explored by Nachmani et al. (2021), Deasy et al. (2022)).

While injecting heavy-tailed, infinite-variance noise might seem natural when the data distribution itself is also heavy-tailed, Yoon et al. (2023) further illustrated that the heavy-tailed noise can also be beneficial when sampling from compactly supported data distributions (e.g., generating images), especially in the presence of class imbalances (i.e., the large jumps introduced by the heavy tails can help finding weakly represented modes). Indeed, perhaps being counter-intuitive, it has been shown that a heavy-tailed process can indeed converge to a light-tailed distribution with appropriate care Simsekli (2017); Simsekli et al. (2020); Huang et al. (2020a), making them suitable for sampling from a broad range of data distributions.

Motivation. While LIM demonstrates promising results, the technical complexity of the timereversed SDE presents significant challenges. The Lévy process with α < 2 has discontinuous paths and no variance, preventing the use of standard analysis tools. The proof techniques rely on fractional calculus and estimations for pseudo-differential operators, which might not be accessible for the broader community of diffusion-based generative models. While the theory is elegant, we argue that the highly technical nature of LIM, originated due to the use of continuous-time processes, might hinder its development. For instance, it is highly non-trivial to use arbitrary noise schedules.

Moreover, the loss function used in LIM presents some shortcomings: the theory requires a squared ℓ2 loss, assuming the loss remains finite for the considered neural network. However, this may not always hold, as the noise term is heavy-tailed and admits no variance, potentially leading to infinite loss values. As a result, LIM experiments must revert to an ℓ1 loss for stable training, suggesting that the original loss function may indeed be unworkable. Additionally, LIM is constrained to isotropic noise, further limiting its flexibility.

To overcome these issues, we propose a simpler yet effective alternative for incorporating heavytailed distributions, focusing on a discrete-time framework that leverages more elementary mathematical tools while maintaining performance improvements over Gaussian-based models.

Published as a conference paper at ICLR 2025

DDPM Score-based SDE

DLPM (This study) LIM (Yoon et al. (2023))

α-stable noise α-stable noise

not unified

Figure 1: Illustration of available methods.

Contributions. As opposed to LIM which extended the SDE-based framework, here, we take a step back and directly work on the discrete-time DDPM process and replace the Gaussian noise with α-stable noise. More precisely, we propose the following Markov process as noising process:

Xt = γt Xt 1 + σtϵ(α) t , (1)

where ϵ(α) t follows a multivariate α-stable distribution. When α = 2, the process recovers the standard Gaussian DDPM, but for α < 2, it introduces heavy-tailed noise with infinite variance. A comparison between DLPM and LIM is provided in Figure 1. Our contributions are as follows.

Simplified mathematical framework. Leveraging a property of stable distributions (cf. Theorem 1), we decompose ϵ(α) t into a product of a one-dimensional random variable and a Gaussian vector. This transformation reduces the forward process to a Gaussian one with modified scaling, allowing us to approximate the reverse process using only elementary tools. We call the resulting generative model Denoising Lévy Probabilistic Model (DLPM). Extension to deterministic sampling. Building on the DDIM framework (Song et al. (2020)), we introduce a deterministic sampler for DLPM, termed DLIM, which further reduces the number of sampling iterations, boosting efficiency. Compatibility with existing methods. DLPM maintains compatibility with existing DDPM implementations, requiring only minor modifications, making it a practical and flexible alternative to LIM. This simplicity extends for instance to noise schedules, which are difficult to manipulate in the continuous-time LIM framework. Distinct Algorithms from LIM. Unlike the Gaussian case, where DDPM and the score-based SDE formulation are two sides of the same coin (Song et al. (2021)), we show that DLPM and LIM result in distinct training algorithms and backward processes. Importantly, as these heavy-tailed distributions admit no variance, we cannot simply use the square ℓ2 loss, problem which DLPM carefully addresses. Improved performances. Thanks to our conditionally Gaussian representation strategy, our networks are only modeling conditional densities given the heavy-tailed variables. Hence the network is not learning a heavy-tailed distribution directly, but a light-tailed conditional distribution, intuitively an easier task. These differences work in favor of DLPM across several performance aspects, particularly where heavy-tailed noise injections are already known to offer advantages. Our experiments show that DLPM provides (i) better coverage of the tails of the data distribution, (ii) improved generation of unbalanced datasets, and (iii) faster computation times, requiring fewer backward steps.

2 BACKGROUND ON α-STABLE DISTRIBUTIONS

The family of α-stable distributions appears as the limiting distribution in the generalized central limit theorem (Gnedenko & Kolmogorov (1968)). In the one dimensional case, an α-stable distributed random variable X is defined through its characteristic function (Samorodnitsky et al. (1996)): for u R

E eiu X = exp{iuµ |σu|α(1 iφβ sgn(u))} , where φ = tan(πα/2) if α = 1 (2/π) log |σu| otherwise .

Here, (i) µ R is the location parameter (ii) α (0, 2] is the tail index (iii) σ > 0 the scale parameter (iv) β [ 1, 1] determines the rightor left-skewness, and sgn is the sign function. We denote the α-stable distribution by Sα,β(µ, σ).

In the case where α < 1 and β = 1, the support of the distribution becomes the positive real line (i.e., the random variable is positive), hence we call this distribution positive stable . On the other hand, in the case where β = 0, the distribution Sα,0(µ, σ) is symmetric around µ, and denoted by Sα(µ, σ). Furthermore, in the case α = 2, the distribution reduces to a Gaussian Sα(µ, σ) =

Published as a conference paper at ICLR 2025

N(µ, 2σ2), hence it is light-tailed. However, whenever α < 2, Sα(µ, σ) has heavy tails, i.e., the decay rate of its tail distribution satisfies P(|X| > r) r α as r (see (Nolan, 2020, Theorem 1.2)). This implies that E[|X|p] < if and only if p < α < 2.

As opposed to Gaussians, there are multiple ways of extending the α-stable distributions to the multivariate setting. In this paper, we will be interested in two major cases: (i) the isotropic (also called rotationally invariant)1 and (ii) the non-isotropic with independent components. These distributions are also defined through their respective characteristic functions. The random variable X Rd is isotropic α-stable if its characteristic function is given by: for all u Rd, E[exp(iu X)] = exp(iµ u σα u α), where µ Rd is the location parameter and σId plays the role of a covariance matrix2. We denote it by X Si α(µ, σId). Similarly, X follows the non-isotropic α-stable distribution Sn α(µ, σId), if for any u Rd, E[exp(iu X)] = exp(iµ u σα Pd i=1 |ui|α). While both of these distributions share similar characteristics, such as having power-law tails with the same exponent, the components of the isotropic case are dependent, which results in a significant difference compared to the non-isotropic case, which has independent coordinates. When α = 2 both options coincide with a multivariate Gaussian.

The following property of stable distributions will form the backbone of our algorithm.

Theorem 1 (See (Samorodnitsky et al., 1996, Equation 2.5.3)). Let α < 2, and let X Si α(µ, σId).

Then, X d= µ + σA1/2G, where d= denotes equality in distribution, A Sα/2,1(0, c A) is a onedimensional positive stable random variable with c A := cos2/α(πα/4), and G N(0, Id).

This theorem shows that a zero-mean, unit-scale isotropic stable random-vector can be equivalently written as the product of a one dimensional positive stable random variable and a standard Gaussian random vector. This fundamental property will have a significant impact in terms of incorporating α-stable noise in DDPMs in a simple way as, conditioned on A, the distribution of X is just a Gaussian. We conclude this section by noting that a similar decomposition for the non-isotropic case: if X Sn α(µ, σId), then X d= µ + σ A1/2 G, where is the component-wise multiplication and A1/2 = {A1/2 i }d i=1 Rd is a vector with i.i.d. components with Ai Sα/2,1(0, c A).

3 DENOISING LÉVY PROBABILISTIC MODELS

In this section, we develop our algorithm by following a similar route to the development on DDPM: we identify the forward and backward processes associated with (1) and construct a variational approximation for the backward chain for sampling. Here, we focus on isotropic α-stable noise; however, adaptation to non-isotropic α-stable noise is straightforward, as all our derivations rely on Theorem 1, hence it is omitted.

3.1 MARKOVIAN FORWARD PROCESS

Recall that DLPM is based on the following recursion (restatement of (1)):

X0 p , and Xt = γt Xt 1 + σtϵ(α) t , (2)

where p is the data distribution and {ϵ(α) t }T t=1 are independent and distributed as Si α(0, Id). Thanks to the stability property of α-stable distributions, i.e., the sum of two α-stable random variables is still α-stable (see Appendix B for details), we can explicitly characterize the conditional distribution of Xt given X0. Setting for any t {1, . . . , T}, γ1 t := Qt i=1 γt, and σ1 t := (Pt i=1(γ1 tσi/γ1 i)α)1/α, we show in Proposition 4 that:

Xt d= γ1 t X0 + σ1 tεt ,

where εt Si α(0, Id). Similarly to DDPM, the noising schedule parameters {(γt, σt)}T t=1 and the horizon T are set so that the final distribution of XT is approximately equal to Si α(0, σ1 t Id),

1The noise distribution used in LIM is the isotropic α-stable distribution. Note that our framework allows for different types of α-stable distributions by following a single mathematical recipe. 2in this isotropic case the components are not independent even though the covariance matrix is diagonal.

Published as a conference paper at ICLR 2025

choosing either (i) γ1 t 0 as t increases, or (ii) γ1 t = 1 and σ1 t increasing with t. Following the terminology used in DDPM3, we refer to schedule (i) as scale preserving and schedule (ii) as scale exploding.

3.2 GENERATIVE PROCESS

Once the forward process is run for large enough time-steps T, it is clear that XT will be approximately stable-distributed. Hence, to go back to the data distribution p , we now need to time-revert the forward process so that the reversed process can take some α-stable noise and generate a sample from p , which is our ultimate goal. More precisely, we aim to find a backward process associated to

the Markov chain {Xt}T t=0, i.e., a Markov chain { Xt}T t=0 such that the two processes { XT t}T t=0 and {Xt}T t=0 have the same distributions. Since, by (Nolan, 2010, Theorem 1.9), any non-degenerate α-stable distribution has a smooth density with respect to the Lebesgue measure, the Markov chain (2) admits a transition density denoted by k(α) t|t 1. In addition, Xt admits a density as well, denoted

by p(α) t . Then, it can be easily verified that a Markov process starting from p(α) T and with transition densities, for t {0, . . . , T 1}, k (α) t 1|t(xt 1|xt) p(α) t 1(xt)k(α) t|t 1(xt, xt 1) for any xt 1, xt, is a backward process associated with {Xt}T t=0, where denotes equality up to a normalization constant. As in the case of DDPM, this backward transition densities are unfortunately intractable, hence, we will develop a variational scheme for their approximation.

3.2.1 APPROXIMATION OF THE BACKWARD TRANSITION DENSITIES IN DDPM

To ease the introduction of our approach, let us first recall the strategy taken in DDPM, which approximates the backward kernels by relying on a variational approximation for k (α) 0:T (x0:T ) := p(α) T (x T ) Q1 t=T k (α) t 1|t(xt 1|xt), where α = 2 and x0:T := (x0, . . . , x T ) Rd (T +1). More

precisely, the goal is to find the closest distribution to k (α) 0:T in a family of distributions { q θ 0:T : θ Θ}, indexed by a parameter θ taking values in some parameter space Θ (typically taken as a neural network).

The variational family is assumed to have the same decomposition as k (α) 0:T (x0:T ), thus such that q θ 0:T (x0:T ) := q θ T (x T ) Q1 t=T q θ t 1|t(xt 1|xt), where q θ T is chosen to be the density of

N(0, σ2 1 t Id) as an approximation of p(α) T . Then, θ is obtained by minimizing the following objective function (Ho et al., 2020, Equation 5):

t=2 L D t 1(θ) with L D t 1(θ) = E[KL(k(α) t 1|0,t( |X0, Xt) q θ t 1|t( |Xt))] , (3)

where KL denotes the Kullback-Leibler divergence and k(α) t 1|0,t denotes the conditional density of

Xt 1 given X0 and Xt. As α = 2 in this case, k(α) t 1|0,t is Gaussian (Ho et al., 2020, Equation 6,7), motivating the choice of Gaussian densities q θ t 1|t(xt 1|xt) as elements of the variational family at hand, since one obtains a closed-form formula for the KL terms, i.e., q θ t 1|t(xt 1|xt) = ϕd xt 1|ˆmθ t 1(xt), ˆΣθ t 1(xt) , (4)

where (x, m, Σ) 7 ϕd(x|m, Σ) is the density of the d-dimensional Gaussian distribution with mean m and covariance matrix Σ, and ˆmθ t 1, ˆΣθ t 1 are functions of xt parameterized by θ. This approach relies on the fact that k(α) t 1|0,t is analytically tractable. Unfortunately, when α < 2, it is not the case anymore. We now expose our methodology to address this limitation.

3.2.2 A DATA AUGMENTATION APPROACH

To obtain a tractable objective function for learning a variational approximation of the backward transition densities, we rely on a data augmentation approach, which is a classical MCMC technique

3In the DDPM literature, these noising schedules are referred to as variance preserving, or variance exploding. We use the term scale instead of the variance here, since the variance does not exist when α < 2.

Published as a conference paper at ICLR 2025

(see Brooks et al. (2011), Chapter 10). Consider the Markov chain Y0 p , and for t {1, . . . , T},

Yt = γt Yt 1 + σt A1/2 t Gt , (5)

where {Gt}T t=1 and {At}T t=1 are independent random variables, distributed according to Gt N(0, Id), and At Sα/2,1(0, c A) with c A = cos2/α(πα/4). From Theorem 1, {Yt}T t=0 is a Markov chain that admits the same distribution as {Xt}T t=0. As a result, conditioned on {At}T t=1 and Y0, {Yt}T t=1 is a Markov chain with Gaussian transition densities:

k(α) 1:T |0,a(y1:T |y0, a1:T ) = YT

t=1 ϕd(yt|γtyt 1, σ2 t at) .

Again, we can explicitly characterize the conditional distribution of Yt given Y0, {At}T t=1. Setting for any t {1, . . . , T}, Σ1 t(a1:t) := Pt k=1(γ1 ta1/2 k σk/γ1 k)2, we show in Proposition 5 that:

Yt d= γ1 t Y0 + Σ1 t(A1:t)εt , (6)

where εt Si α(0, Id). We propose approximating the backward process associated to {Yt}T t=0, given {At}T t=1, adapting the DDPM approach. This time, for the backward process, we use the conditional density of Yt 1 given Y0, Yt and A1:T : k (α) 1:T |0,a(y1:T |y0, a1:T ) := p(α) T (y T ) Y2

t=T k(α) t 1|0,t,a(yt 1|yt, y0, a1:t),

where k(α) t 1|0,t,a( |yt, y0, a1:t) is now the tractable density of a Gaussian distribution:

Proposition 1. The density of the backward process associated to {Yt}T t=0 given Y0, {At}T t=1, denoted by k(α) t 1|t,0,a( |yt, y0, a1:t), is the density of a Gaussian distribution N( mt 1, Σt 1), with mean

mt 1 and variance Σt 1 equal to:

mt 1(yt, y0, a1:t) = 1

γt (yt Γt(a1:t)σ1 tϵt(yt, y0)) , Σt 1(a1:t) = Γt(a1:t)Σ1 t 1(a1:t 1) ,

where Σ1 t is as in (6), Γt = 1 γtΣ1 t 1/Σ1 t, and ϵt(yt, y0) = (yt γ1 ty0)/σ1 t.

See Appendix C.3 for the derivations required for Proposition 1. For our generative model, we reconsider the family of Gaussian variational approximation introduced in (4), modified to account for an iid. sequence {At}T t=1: q θ 0:T (y0:T ) := R q θ T (y T ) Q1 t=T q θ t 1|t,a(xt 1|xt, a1:t)ψ(α) 1:T (a1:T )da1:T ,

where ψ(α) 1:T (a1:T ) = QT t=1 ψ(α)(at) and ψ(α) denotes the density of Sα/2,1(0, c A), and q θ t 1|t,a(yt 1|yt, a1:t) = ϕd(yt 1|ˆmθ t 1(yt, a1:t), ˆΣθ t 1(yt, a1:t)) .

3.3 VARIATIONAL INFERENCE OBJECTIVE

We consider the following loss function:

L L(θ) := E

L L t 1(θ, A1:T ) r #

, where (7)

L L t 1(θ, A1:t) := E h KL k(α) t 1|t,0,a( |Yt, Y0, A1:t) q θ t 1|t,a( |Yt, A1:t) A1:t i ,

and r > 0, k(α) t 1|0,t,A denotes the conditional density of Yt 1 given Y0, Yt and A1:T . In order to ensure that the expectations with respect to A1:T are finite, we need to choose r < α

2 when α (1, 2). For simplicity, in the rest of the paper, we will use r = 1

A comparison with DDPM s loss function L D (cf. (3)) immediately illustrates that, thanks to Theorem 1, L L is almost identical to L D up to taking expectations with respect to one-dimensional random variables and the taking square-root of the summands4. We show in Appendix C.4.3 that, alike DDPM, our loss is obtained from a KL minimization principle, serving as an upper bound to:

E h KL(k(α) 0|a ( |A1:T )| q θ 0|a( |A1:T )) ri ,

4We note that, with the choices of α = 2 and r = 1, we exactly recover DDPM.

Published as a conference paper at ICLR 2025

when r (0, 1]. This additionally shows how any zero of the loss corresponds to a perfect generative model, while maintaining a similar objective function. The crucial property of (7) is that, since both k(α) t 1|t,0,a and q θ t 1|t,a are Gaussian (thanks to the conditioning), the KL term admits a closed-form analytical formula, as in the case of DDPM.

From a practical perspective, (7) suggests that we can use the same software architecture as for DDPM, with a slight modification to compute the outer expectation, which can be simply estimated by a Monte Carlo, or median-of-means procedure (Lugosi & Mendelson (2019)). In order to obtain a final denoising training loss, we provide three design choices, for which the full details are given in Appendix C.4. They are similar to what is classically done in diffusion models (Ho et al. (2020); Nichol & Dhariwal (2021); Karras et al. (2022)):

D1. We set a fixed variance ˆΣθ t = Σt for the reverse process.

D2. We reparameterize the model to predict the value of ϵt(yt, y0) with a network ˆϵθ t, setting

ˆmθ t 1(Yt, A1:t) = 1

Yt σ1 tΓt(A1:t)ˆϵθ t(Yt, A1:t) .

Moreover, we drop the dependency of ˆϵθ t on {At}T t=1, making ˆϵθ t only a function of Yt. This enables re-using classical diffusion models network architectures.

D3. Assuming D1, D2, we obtain L L t 1(θ) = E λ2 t,A1:t ˆϵθ t(Yt, A1:t) ϵt(Yt, Y0) 2 where λt,a1:t = Γt(a1:t)σ1 t/2γt Σt, and ϵt(yt, y0) = (yt γ1 ty0)/σ1 t. We then fix λt,a1:t = 1. Proposition 2 (Simplified denoising loss). With the design choices D1, D2, D3, we obtain the simplified denoising objective function:

L Simple(θ) =

t=1 E h E ˆϵθ t(Yt) ϵt(Yt, Y0) 2 A1:t 1/2i , (8)

where the model ˆϵθ t is designed to fit the noise ϵt(Yt, Y0) = (Yt γ1 t Y0)/σ1 t added at time-step t, and γ1 t, σ1 t are as given in Proposition 4. Thus the model is not learning a heavy-tailed distribution directly, but the light-tailed conditional distribution.

See Appendix C.4.2 for the derivations required for Proposition 2. Finally, in Appendix C.5, we show that on some conditions, satisfied under design choices D1, D2, D3, the expectation of each term in L L can be rewritten as an expectation with respect to only one univariate random variable (as opposed to t variables, i.e., A1:t), reducing the additional computational burden of accommodating heavy tails. As we will estimate the expectations by Monte Carlo averaging, reducing the number of random variables in the expectation is equally important to reduce the error in the estimation. The resulting loss is given in Proposition 9.

3.4 DETERMINISTIC SAMPLING WITH DLIM

Using the same techniques as in denoising diffusion implicit models (DDIM, Song et al. (2020)), we can recover a deterministic sampling scheme. We call this algorithmic extension Denoising Lévy Implicit Models (DLIM), which details are given in Appendix D. We recapitulate the resulting sampling and training algorithms in Appendix A. In this context and in the case of the non-isotropic Cauchy distribution (α = 1), it is possible to bypass the data augmentation technique, since a closed-form KL divergence between two such distributions exists. Full details about these Cauchy DLIM are given in Appendix D.4, but we leave their experimental exploration to future work.

3.5 COMPARING DLPM TO LIM

The objective function (7) is slightly different from the one obtained in the continuous setting of LIM by Yoon et al. (2023). The training equations are very similar, and can be reformulated to involve a denoising loss (see Appendix E.1, and (29)):

Lt 1 : θ 7 E ˆϵθ t (Xt) ϵt(Xt, X0) η p .

As a refresher, in the case of DDPM, one sets p = 2, η = 2. In the case of DLPM, our discussion leads us to the choice p = 2, η = 1 (see (8)). In the case of LIM, the theory relies on a squared

Published as a conference paper at ICLR 2025

ℓ2 loss, setting p = 2, η = 2 in order to properly derive the loss and effectively approximate the true score of the data, at various noisescales. One must therefore make the assumption that Lt 1 is not infinite for each parameter θ considered, which may not hold since ϵt(Xt, X0) is α-stable distributed and admits no variance. In the experiments for LIM, one is forced to revert to an ℓ1 loss, by setting η = 1, p = 1, to obtain a stable training, potentially indicating that the loss is infinite.

Additionally, DLPM and LIM yield different backward processes, which in turn lead to distinct sampling algorithms cf. Table 4 in the Appendix. Finally, the LIM framework can only accommodate isotropic noise. We refer to Appendix E.1 for a detailed comparison between DLPM and LIM.

4 EXPERIMENTS

After the groundwork of the previous sections, we design experiments to demonstrate the practical strengths of our DLPM approach as compared to LIM, apart from its technical simplicity. We recall that setting α = 2 simply reverts LIM to classical diffusion, and DLPM reverts to DDPM, appart from the square-root in the loss function. As specified in (Yoon et al. (2023), Appendix G), LIM relies on gradient and noise clipping, which introduces extra hyperparameters that must be finetuned for each dataset. In the experiments, we use these clipping parameters only when specified. The experimental details relative to this section are available in Appendix G.

As our loss function (7) involves an expectation with respect to A1:T , we propose estimating it by using the median-of-means estimator, which is known to have better performance for heavy-tailed distributions (Lugosi & Mendelson (2019)). For an integer M, this approach requires sampling M 2 many A1:T terms, then split them into M groups of size M. To approximate the expectation, we take the sample mean of each group, and finally take the median of the computed M sample means. In our experiments, we explore M = 1 (approximating the expectation with only one sample), denoted simply by DLPM, and M = 5, denoted by DLPM5. Similarly, we denote DLIM and DLIM5 for the corresponding deterministic sampling schemes.

Finally, we consider the range 1.5 α 2.0. In our experiments on images, we make use of the dataset CIFAR10 LT (long tail), that has been introduced in Yoon et al. (2023) as an unbalanced modification of the CIFAR10 dataset.

4.1 DATA COVERAGE AND MODE COLLAPSE IN TWO-DIMENSIONAL DATA

Before progressing to higher dimensional problems, we start with easily controlled and visualized two-dimensional datasets, in order to validate the competitiveness of our method in the contexts where heavy-tailed diffusions are of interest. In particular, we consider heavy-tailed and unbalanced multi-modal datasets. See Appendix G.1 for details about the experimental setup in these settings.

Figure 2: DLPM with α = 1.7 and α = 2.. The lighter-tailed process fails to capture the distribution s tail.

Enhancing data coverage: capturing the tail of the distribution. We set d = 2 and generate data points distributed as Si α(0, 0.05 Id), with α = 1.7. Our aim is to test the ability of each method to cover the dataset correctly; the main challenge is to correctly capture the tails. As we can visually observe in Figure 2, the backward diffusion process in the Gaussian case cannot produce truly heavy-tailed data, mainly stemming from the fact that its variance is always finite. As expected, we see improvements when using noise with α < 2.

To quantify this behaviour, we utilize the mean square logarithmic error (MSLE) metric, which compares the relative error between the higher quantiles (i.e., the quantiles corresponding to the tail region that has probability 1 ξ) of the true and the generated data distribution (see Appendix G.3 for detailed definitions). We observe in Table 1 how, as α gets smaller, one gets better tail approximation. Furthermore, DLPM consistently outperforms LIM, indicating that the generation process benefits from the heavy-tailed denoising formulation, rather than the continuous-time one, in this setting.

Published as a conference paper at ICLR 2025

Method α = 1.5 α = 1.6 α = 1.7 α = 1.8 α = 1.9 α = 2.0

DLPM 0.160 0.128 0.081 0.078 0.071 0.028 0.099 0.044 0.132 0.101 0.798 0.601 DDPM - - - - - 0.528 0.400 1.0e-1 LIM 0.743 0.290 0.497 0.311 0.267 0.077 0.653 0.413 2.444 1.067 1.239 0.240 1.0e-08 8.6e-06 1.3e-10 8.8e-06 7.9e-09 5.0e-3

Table 1: MSLEξ=0.95 averaged over 20 runs. Figures below scores corresponds to p-values from Welch s t-test (assuming unequal variances), comparing the mean of DLPM with the given method.

Figure 3: Gaussian grid

Enhancing data coverage: addressing mode collapse. To assess the robustness of DLPM to mode collapse, we consider an unbalanced mixture of nine Gaussian distributions. We set their standard deviation to 0.05 and arrange them in a grid-like pattern with equal spacing, in the square [ 1, 1]2. Their mixture weights range from .01 to .3 5. We use the F pr 1 score (i.e., the harmonic mean of precision and recall, see Appendix G.3) to assess, in a single summary statistic, the quality and diversity in the generated data. As shown in Table 2, we are able to achieve improved scores by choosing a tail index α < 2 with DLPM. This is not necessarily the case for LIM, which is consistently outperformed. Finally, DLPM5 shows its strengths with better performance over all the range of α, though at the cost of 5 times the run time.

Method α = 1.5 α = 1.6 α = 1.7 α = 1.8 α = 1.9 α = 2.0

DLPM 0.933 0.018 0.923 0.005 0.933 0.028 0.923 0.024 0.907 0.034 0.862 0.028 DLPM5 0.944 0.013 0.943 0.021 0.943 0.010 0.941 0.014 0.928 0.016 - 9.0e-3 1.6e-05 7.4e-2 9.0e-4 3.9e-3 LIM 0.842 0.039 0.850 0.046 0.868 0.034 0.874 0.030 0.884 0.017 0.874 0.027 1.7e-14 1.3e-09 5.7e-11 3.9e-09 1.9e-3 9.6e-2 DDPM - - - - - 0.867 0.029 5.0e-1

Table 2: F pr 1 score, averaged over 30 runs. Figures below scores corresponds to p-values from Welch s t-test (assuming unequal variances), comparing the mean of DLPM with the given method.

4.2 EXPERIMENTS ON IMAGE DATA

To fairly illustrate the differences between LIM and DLPM, we use the same improved DDPM neural network architecture, as designed in Nichol & Dhariwal (2021). The specific configuration for each dataset is carefully described in Appendix G.2. Our experiments are designed to compare deterministic and stochastic generation methods under varying conditions. As a visual check, examples of generated images are listed in Appendix G.4.

Convergence speed. Consistent with existing literature Song et al. (2020), our findings as shown in Figure 4 confirm that deterministic generation outperforms its stochastic counterpart significantly, especially when fewer than 100 diffusion steps are used, on both MNIST and CIFAR10_LT. As the number of diffusion steps increases, both of these sampling methods produce similar results. This observation highlights that the advantages of the diffusion process do not only stem from increased randomness at sampling time. These heavy-tailed processes may define more appropriate vector field on which the noise is transported back to the original data distribution, which would lead to improved model performance (see Karras et al. (2022) for similar discussions on DDPM vs DDIM).

The previous observations are quantitatively supported in Table 3, where we present results for both deterministic and stochastic sampling strategies. We compare both methods on stochastic generation at a high step count, to compare their performance at their best regime, and on deterministic generation at a small step count, to assess the tradeoff in computations/quality offered by both methods. As we can see, DLPM surpasses LIM on both datasets. Moreover, these results show that LIM s performance deteriorates significantly when clipping is not used, raising questions about whether

5The exact mixture weights are {.01, .02, .02, .05, .1, .15, .15, .2, .3}.

Published as a conference paper at ICLR 2025

Table 3: FID , 1000 sampling steps for LIM and DLPM, 25 sampling steps for LIM-ODE and DLIM.

MNIST α = 1.5 α = 1.7 α = 1.8 α = 1.9 α = 2.0

DDPM - - - - 3.43 LIM 14.37 11.54 11.18 13.75 11.69 w/ clipping 4.08 5.17 6.81 11.20 DLPM5 3.80 3.03 2.51 2.71 - DLPM 5.39 2.94 2.93 3.24 3.63

DDIM - - - - 5.16 LIM-ODE 49.63 78.59 92.93 109.48 29.04 w/ clipping 45.72 68.15 85.09 113.20 DLIM5 3.37 2.93 3.44 4.31 - DLIM 3.38 2.81 3.18 3.27 5.18

DDPM - - - - 19.05 LIM 75.38 35.15 31.14 21.68 21.56 w/ clipping 16.13 16.21 17.67 19.24 DLPM 16.10 18.00 19.94 20.21 21.07

DDIM - - - - 23.44 LIM-ODE 42.07 91.64 105.95 407.79 32.00 w/ clipping 30.17 65.78 84.55 101.70 DLIM 20.69 20.77 21.96 22.79 23.99

Figure 4: FID , α = 1.7

DLIM, 5 steps

LIM, 5 steps

Figure 5: DLIM and LIM-ODE with small number of steps, on the Gaussian grid of Figure 3.

the framework of LIM is inherently well-suited for heavy-tailed distributions. More interestingly, we observe in Table 3 that DLPM consistently outperforms LIM and offers satisfying image quality at low number of steps, both for stochastic and deterministic sampling.

Generated images after 25 steps achieve a FID score of 2.81 on MNIST and of 20.69 on CIFAR10_LT, as compared to respectively 45.72 and 30.17 for LIM-ODE with clipping. On MNIST, with α = 1.7, DLIM is able to match the sample quality of DLPM with 40 times less diffusion steps, further proving its efficacy. To visualize these behaviours, we display on Figure 5 different generation with varying time horizon T. We can see how the backward process defined by DLIM is able to approach the true data distribution more accurately.

Eventhough lower α usually entails lower FID in this table, LIM-ODE shows worse performance than Gaussian diffusion at 25 reverse steps; since the image quality is still monotonically decreasing with α, except for α = 2, we can conjecture that the initial instability introduced by heavy-tails are quickly counterbalanced by their benefits. DLPM5 shows consistent improvement over baseline, more particularly for stochastic sampling. We provide results for non-isotropic generation, and additional metrics on image data in Appendix G.4, further supporting our claims.

5 CONCLUSION

In this study, we proposed DLPM and DLIM, as heavy-tailed generalizations of DDPM and DDIM. Contrary to similar methods, we believe our approach will be more accessible to the community, thanks to its elementary tools. The various experiments conducted suggest DLPM is more effective in leveraging the characteristics of heavy-tailed distributions, providing robust performance across heavy-tailed data, unbalanced datasets and requiring a lower number of diffusion steps.

Published as a conference paper at ICLR 2025

ACKNOWLEDGMENT

A.D is funded by the European Union (ERC, Ocean, 101071601). D.S and U.S are partially funded by the European Union (ERC, Dynasty, 101039676). Views and opinions expressed are however those of the author(s) only and do not necessarily reflect those of the European Union or the European Research Council Executive Agency. Neither the European Union nor the granting authority can be held responsible for them.

U.S is additionally funded by the French government under management of Agence Nationale de la Recherche as part of the France 2030 program, reference ANR-23-IACL-0008 (PR[AI]RIEPSAI).

The authors are grateful to the CLEPS infrastructure from the Inria of Paris for providing resources and support.

Michaël Allouche, Stéphane Girard, and Emmanuel Gobet. EV-GAN: Simulation of extreme events with Re LU neural networks. Journal of Machine Learning Research, 23(150):1 39, 2022. URL https://hal.science/hal-03250663.

Szymon Borak, Wolfgang Karl Härdle, and Rafał Weron. Stable distributions. Statistical Tools for Finance and Insurance, 04 2005. doi: 10.1007/3-540-27395-6_1.

Steve Brooks, Andrew Gelman, Galin Jones, and Xiao-Li Meng. Handbook of Markov Chain Monte Carlo. CRC press, 2011.

Nanxin Chen, Yu Zhang, Heiga Zen, Ron J. Weiss, Mohammad Norouzi, and William Chan. Wavegrad: Estimating gradients for waveform generation, 2020. URL https://arxiv.org/abs/ 2009.00713.

Frédéric Chyzak and Frank Nielsen. A closed-form formula for the kullback-leibler divergence between cauchy distributions, 2019. URL https://arxiv.org/abs/1905.10965.

Jacob Deasy, Nikola Simidjievski, and Pietro Liò. Heavy-tailed denoising score matching, 2022.

Prafulla Dhariwal and Alex Nichol. Diffusion models beat gans on image synthesis, 2021a.

Prafulla Dhariwal and Alex Nichol. Diffusion models beat gans on image synthesis, 2021b. URL

https://arxiv.org/abs/2105.05233.

B. V. Gnedenko and A. N. Kolmogorov. Limit distributions for sums of independent random variables. Translated from the Russian, annotated, and revised by K. L. Chung. With appendices by J. L. Doob and P. L. Hsu. Revised edition. Addison-Wesley Publishing Co., Reading, Mass.- London-Don Mills., Ont., 1968.

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models, 2020.

William Holt and Duy Nguyen. Essential aspects of bayesian data imputation. SSRN, 2023, jun 2023. doi: 10.2139/ssrn.4494314. URL http://dx.doi.org/10.2139/ssrn.4494314.

Lu-Jing Huang, Mateusz B. Majka, and Jian Wang. Approximation of heavy-tailed distributions via stable-driven sdes, 2020a. URL https://arxiv.org/abs/2007.02212.

Lu-Jing Huang, Mateusz B. Majka, and Jian Wang. Approximation of heavy-tailed distributions via stable-driven sdes, 2020b. URL https://arxiv.org/abs/2007.02212.

Tero Karras, Miika Aittala, Timo Aila, and Samuli Laine. Elucidating the design space of diffusionbased generative models, 2022.

Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization, 2017.

Zhifeng Kong, Wei Ping, Jiaji Huang, Kexin Zhao, and Bryan Catanzaro. Diffwave: A versatile diffusion model for audio synthesis, 2021. URL https://arxiv.org/abs/2009.09761.

Published as a conference paper at ICLR 2025

Gábor Lugosi and Shahar Mendelson. Mean estimation and regression under heavy-tailed distributions: A survey. Foundations of Computational Mathematics, 19(5):1145 1190, 2019.

Eliya Nachmani, Robin San Roman, and Lior Wolf. Denoising diffusion gamma models, 2021.

Alex Nichol and Prafulla Dhariwal. Improved denoising diffusion probabilistic models, 2021.

Gabriel Nobis, Maximilian Springenberg, Marco Aversa, Michael Detzel, Rembert Daems, Roderick Murray-Smith, Shinichi Nakajima, Sebastian Lapuschkin, Stefano Ermon, Tolga Birdal, Manfred Opper, Christoph Knochenhauer, Luis Oala, and Wojciech Samek. Generative fractional diffusion models, 2024. URL https://arxiv.org/abs/2310.17638.

J. P. Nolan. Stable Distributions - Models for Heavy Tailed Data. Birkhäuser, Boston, 2010. In progress, Chapter 1 online at academic2.american.edu/ jpnolan.

John P Nolan. Univariate stable distributions. Springer Series in Operations Research and Financial Engineering, 10:978 3, 2020.

Manuel D Ortigueira, Taous-Meriem Laleg-Kirati, and J A Tenreiro Machado. Riesz potential versus fractional laplacian. Journal of Statistical Mechanics: Theory and Experiment, 2014(9): P09032, sep 2014. doi: 10.1088/1742-5468/2014/09/P09032. URL https://dx.doi.org/ 10.1088/1742-5468/2014/09/P09032.

Seyedmorteza Sadat, Jakob Buhmann, Derek Bradley, Otmar Hilliges, and Romann M. Weber. Cads: Unleashing the diversity of diffusion models through condition-annealed sampling, 2024. URL https://arxiv.org/abs/2310.17347.

Mehdi S. M. Sajjadi, Olivier Bachem, Mario Lucic, Olivier Bousquet, and Sylvain Gelly. Assessing generative models via precision and recall, 2018.

Gennady Samorodnitsky, Murad S Taqqu, and RW Linde. Stable non-gaussian random processes: stochastic models with infinite variance. Bulletin of the London Mathematical Society, 28(134): 554 555, 1996.

Umut Simsekli. Fractional Langevin Monte carlo: Exploring Levy driven stochastic differential equations for Markov chain Monte Carlo. In Doina Precup and Yee Whye Teh (eds.), Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pp. 3200 3209. PMLR, 06 11 Aug 2017. URL https://proceedings.mlr.press/v70/simsekli17a.html.

Umut Simsekli, Lingjiong Zhu, Yee Whye Teh, and Mert Gurbuzbalaban. Fractional underdamped langevin dynamics: Retargeting sgd with momentum under heavy-tailed gradient noise. In International conference on machine learning, pp. 8970 8980. PMLR, 2020.

Jascha Sohl-Dickstein, Eric A. Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsupervised learning using nonequilibrium thermodynamics, 2015.

Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models. Co RR, abs/2010.02502, 2020. URL https://arxiv.org/abs/2010.02502.

Yang Song and Stefano Ermon. Generative modeling by estimating gradients of the data distribution, 2020.

Yang Song, Jascha Sohl-Dickstein, Diederik P. Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations, 2021.

Yang Song, Prafulla Dhariwal, Mark Chen, and Ilya Sutskever. Consistency models, 2023. URL

https://arxiv.org/abs/2303.01469.

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need, 2023.

Pascal Vincent. A connection between score matching and denoising autoencoders. Neural Comput., 23(7):1661 1674, jul 2011. ISSN 0899-7667. doi: 10.1162/NECO_a_00142. URL https: //doi.org/10.1162/NECO_a_00142.

Published as a conference paper at ICLR 2025

Ling Yang, Zhilong Zhang, Yang Song, Shenda Hong, Runsheng Xu, Yue Zhao, Wentao Zhang, Bin Cui, and Ming-Hsuan Yang. Diffusion models: A comprehensive survey of methods and applications, 2024.

Eun Bi Yoon, Keehun Park, Sungwoong Kim, and Sungbin Lim. Score-based generative models with lévy processes. In A. Oh, T. Neumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine (eds.), Advances in Neural Information Processing Systems, volume 36, pp. 40694 40707. Curran Associates, Inc., 2023. URL https://proceedings.neurips.cc/paper_files/paper/2023/file/ 8011b23e1dc3f57e1b6211ccad498919-Paper-Conference.pdf.

Tianjiao Zhang, Huangjie Zheng, Jiangchao Yao, Xiangfeng Wang, Mingyuan Zhou, Ya Zhang, and Yanfeng Wang. Long-tailed diffusion models with oriented calibration. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/ forum?id=NW2s5XXw XU.

Published as a conference paper at ICLR 2025

The Appendix is organized as follows:

In Appendix A, we provide the pseudo-code for the training and sampling algorithms of DLPM and DLIM.

In Appendix B, we characterize the stability property of the α-stable distribution, and we give explicit formulas for the distribution of the sum of two α-stable random variables.

In Appendix C, we provide the detailed theory and derivations of the DLPM framework. Working on the process {Xt}T t=0 defined in (2), and its data augmentation counterpart {Yt}T t=0 defined in (5), we start in Appendix C.2 by characterizing the distribution of Xt given X0 from a given noise schedule {(γt, σt)}T t=1, as this defines the characteristic location γ1 t and scale σ1 t of the process at time t. Likewise we characterize the distribution of Yt given Y0, {At}T t=1, as this defines the characteristic location γ1 t and variance Σ1 t of the augmented process at time t. In Appendix C.3, we focus on the Gaussian trick exploited by the process {Yt}T t=0 defined in (5). This leads us to the an explicit formula for the Gaussian density of the backward process conditioned on a sequence {At}T t=1 of α-stable random variables. In Appendix C.4, we further put this characterization into good use by obtaining a closedform formula for the loss function . Follows a discussion on design choices for the model and the loss function at hand, which leads us to the simplified training loss (8) corresponding to the denoising loss for α-stable diffusion. In Appendix C.4.3 we provide a more principled approach to derive the loss function at hand. Finally, in Appendix C.5, we provide a faster sampling strategy for training DLPM, computing each loss term L L t 1 using only two heavy-tailed random variables per datapoint, instead of t random variables.

In Appendix D, we adapt the setting for deterministic sampling in classical denoising diffusion DDIM (Song et al. (2020)) to our α-stable framework. We naturally call this extension DLIM. Alike DDIM, it is true that the same neural networks can be used for both the DLIM and DLPM generation procedures.

In Appendix E, we compare in more details the two discrete and continuous frameworks DLPM and LIM, underlining how they offer two distinct loss functions, training and sampling algorithms.

In Appendix F, we give proofs relative to our technique for faster training, as introduced in Appendix C.5.

In Appendix G, we provide additional experimental details.

A ALGORITHMS FOR DLPM AND DLIM

In this section, we explicitly provide the algorithms needed to train and sample from the DLPM and DLIM generative methods.

A.1 TRAINING

The same models can be shared between DLPM and DLIM, as underlined in Appendix D.3. We introduce the values σ1 t, γ1 t determined by the noise schedule, as presented in Proposition 4:

i=1 γt , σ1 t =

We define c A = cos2/α(πα/4) as in Theorem 1. We make the design choices D1, D2, D3 for our model, as described in Appendix C.4.2. Finally, we use the method for faster sampling as described in Appendix C.5, Proposition 9. The resulting training algorithm is given in Algorithm 2.

Published as a conference paper at ICLR 2025

Algorithm 2 DLPM training - simplified loss

Require: model {ˆϵθ t}T t=1, noise schedule {(γt, σt)}T t=1, data Y0 1: Sample t U[1, T] 2: Sample At Sα/2,1(0, c A) 3: Sample Gt N(0, Id)

4: Yt γ1 t Y0 + σ1 t A1/2 t Gt 5: Lt 1 ˆϵθ t(Yt) A1/2 t Gt 2 return Lt 1

A.2 SAMPLING

Given the noise schedule and a sequence {At}T t=1 of α-stable random variables we define:

Σ1 t(A1:t) =

2 , Γt(A1:t) = 1 γ2 t Σ1 t 1(A1:t 1)

Σ1 t(A1:t) , (9)

see Proposition 6 and Equation (12) from Proposition 5 for precise statements. We give our sampling algorithm for DLPM in Algorithm 3.

Algorithm 3 Stochastic sampling (DLPM)

Require: model {ˆϵθ t}T t=1, noise schedule {(γt, σt)}T t=1 1: Sample YT S(0, σ1 T Id) 2: Sample {At}T t=1 i.i.d., At Sα/2,1(0, c A) 3: for t T to 1 do 4: Compute Σ1 t, Σ1 t 1, Γt as in (9) 5: Sample Gt N(0, Id) 6: ˆΣθ t 1 ΓtΣ1 t 1

γt Γtσ1 tˆϵθ t(Yt) + q

8: end for return Y0

For DLIM, one can potentially provide YT as input, in order to use the support of the noise distribution as a latent space, alike what is done by Song et al. (2020). We give our DLIM sampling algorithm in Algorithm 4.

Algorithm 4 Deterministic sampling (DLIM)

Require: model {ˆϵθ t}T t=1, noise schedule {(γt, σt)}T t=1 1: Sample YT Si α(0, Id) 2: for t T to 1 do

4: end for return Y0

B ADDITIONAL REMARK ON α-STABLE DISTRIBUTIONS

Stable distributions are closed under convolution for a fixed value of α. Since convolution is equivalent to multiplication of the Fourier-transformed function, it follows that the product of two stable characteristic functions with the same α will yield another such characteristic function. We precisely characterize this stability property in the following proposition:

Published as a conference paper at ICLR 2025

Proposition 3 (See (Nolan, 2020, Proposition 1.3)). Let X1, X2 be two random variables respectively distributed as X1 Sα,β1(µ1, σ1) and X2 Sα,β2(µ2, σ2), with µ1, µ2, β1, β2 R and σ1, σ2 > 0. Then, X = X1 + X2 is distributed as Sα,β(µ, σ) where:

µ = µ1 + µ2 , σ = (σα 1 + σα 2 )1/α , β = β1σα 1 + β2σα 2 σα 1 + σα 2 .

In particular, when X1, X2 are such that X1 Sα(0, σ1), X2 Sα(0, σ2), then X = X1 + X2 Sα(0, (σα 1 + σα 2 )1/α), which is the key relation used in the later characterizations of the distribution of our forward process.

C THEORETICAL DERIVATIONS FOR DLPM

In this section, we provide the detailed theory and associated derivations of the DLPM framework.

C.1 SETTING AND NOTATIONS

We will denote by ϕd( |µ, Σ) the density of N(µ, Σ), where µ Rd and Σ Rd d. We will denote by ψ(α) the density of Sα/2,1(0, c A), where c A = cos2/α(πα/4).

Forward process. We reintroduce the setting presented in Section 3, with the noising schedule being denoted by {(γt, σt)}T t=1, and the following forward process on which DLPM is based:

X0 p , and Xt = γt Xt 1 + σtϵ(α) t , (10)

where p is the data distribution and {ϵ(α) t }T t=1 are independent random variables distributed as Si α(0, Id).

Data augmentation process. We also introduce the associated data augmentation process:

Y0 p , and Yt = γt Yt 1 + σt A1/2 t Gt , (11)

where {Gt}T t=1 and {At}T t=1 are independent random variables distributed according to Gt N(0, Id) and At Sα/2,1(0, c A), with c A = cos2/α(πα/4). From Theorem 1, {Yt}T t=0 is a Markov

chain that admits the same distribution as {Xt}T t=0. We will denote by p(α) t the distribution of Yt, and by k(α) t|t 1( | ) the transition density associated to the Markov chain (10).

Backward process. A backward process associated to the Markov chain {Yt}T t=0 is a Markov chain { Y t}T t=0 such that the two processes { Y T t}T t=0 and {Yt}T t=0 have the same distribution. For ease of presentation and following classical notations, we will rather consider { Y t}T t=0 where Y t = Y T t. We will denote by k (α) t 1|t( | ) the transition densities associated to the process { Y t}T t=0. Since the true backward process is never available to us, we will focus on an approximation induced by a variational family. We consider the process { Y θ t }T t=0 where Y θ T is distributed as S(0, σ1 t Id), and the density of the distribution of Y θ t 1 conditioned on Y θ t are given by q θ t 1|t( | ), where

θ RD parameterizes a neural network. We also denote by q θ 0:T the joint distribution of { Y θ}T t=0 and by q θ t the marginal distribution of Y θ t.

Further notations. Finally, we denote by pα t ( |a1:t), k(α) t|t 1,a( | , a1:t), and q θ t 1|t,a( | , a1:t)

the densities/transition densities associated to the processes {Yt}T t=0, { Y θ t}T t=0 given At = at for 1 t T. We we will also write pα t ( |y0, a1:t), k(α) t|t 1,0,a( | , y0, a1:t), and q θ t 1|t,0,a( | , y0, a1:t) when further conditioning on Y0.

Published as a conference paper at ICLR 2025

C.2 FORWARD PROCESS

Let us now characterize the distribution of Xt given X0, and Yt given Y0, {At}T t=1, which are tractable thanks to Proposition 3. These will come in handy, for instance when working on the backward process in Appendix C.3. Proposition 4 (Distribution of Xt given X0). Let {Xt}T t=0 be the forward process as given in (10), and {(γt, σt)}T t=1 the noise schedule. Then the distribution of Xt given X0 is given for any t by

Xt d= γ1 t X0 + σ1 t ϵt

where ϵt Si α(0, Id), and γ1 t, σ1 t are given by:

k=1 γk , σ1 t =

The proof is an elementary induction based on Proposition 3. Proposition 5 (Distribution of Yt given Y0, {At}T t=1). Let {Yt}T t=0 be the forward process as given in (11), {(γt, σt)}T t=1 the noise schedule, and {At}T t=1 the associated α/2-stable random variables, parameterizing the variance of the Gaussian noise increments. Then the distribution of Yt given Y0, {At}T t=1 is the Gaussian distribution with mean γ1 t Y0 and covariance matrix Σ1 t Id, i.e.,

Yt d= γ1 t Y0 + Σ1 t(A1:t)1/2 Gt ,

where Gt N(0, Id), and

k=1 γk , Σ1 t(a1:t) =

The proof is elementary and omitted. It is worth mentioning the following recurrence, to speedup the computation of the sequence {Σ1 t(a1:t)}T t=1:

Σ1 t(a1:t) = σ2 t at + γ2 t Σ1 t 1(a1:t 1) .

C.3 BACKWARD PROCESS

Consider the setting of the data augmentation approach as given in (11). By the same arguments used in Section 3, it can be verified that a process starting from p(α) T and with transition densities k (α) t 1|t(yt 1|yt) p(α) t (yt 1)k(α) t|t 1(yt|yt 1) for any yt 1, yt is a backward process associated with {Yt}T t=0. However, it raises two main problems. First (i), we cannot characterize the distribution of k (α) t 1|t(yt 1|yt), since we do not know the data distribution. Second (ii), in the case where

α < 2, we do not have access to an explicit expression for k (α) t 1|t,0(yt 1|yt, y0).

Regarding (i), we have access to the distribution of Yt given Y0, so a valid strategy consists in devising a method relying on characterizing the backward of the process {Yt}T t=0 given Y0. This is the classical strategy used in DDPM (Ho et al. (2020)), which is possible in the case α = 2 since k (α) t 1|t,0(yt 1|yt, y0) admits an analytical expression for any y0, yt 1, yt, thanks to the properties of the Gaussian distribution.

Regarding (ii), in the case where α < 2, we make use of the trick introduced in Theorem 1, justifying the data augmentation approach. We will rather characterize the density of the Markov kernels associated to the backward of the process {Yt}T t=0 given Y0 and {At}T t=1. This time, since we manage Gaussian noise increments, we can fall back to the classical strategy, as we further develop in the following proposition. Proposition 6 (Density of the backward process associated to {Yt}T t=0 given Y0, {At}T t=1). Consider the setting of the data augmentation approach as given in (11). Let {(γt, σt)}T t=1 be the noise schedule at hand. Let k(α) t 1|0,t,a( |yt, y0, a1:t) be the density of the backward process associated to

Published as a conference paper at ICLR 2025

{Yt}T t=0 given Y0 and {At}T t=1. Then k(α) t 1|t,0,a( |yt, y0, a1:t) is the density of a Gaussian distribu-

tion N( mt 1, Σt 1) with mean mt 1 and variance Σt 1 such that

mt 1(yt, y0, a1:t) = 1

γt (yt Γt(a1:t)σ1 tϵt(yt, y0)) , Σt 1(a1:t) = Γt(a1:t)Σ1 t 1(a1:t 1) ,

Σ1 t(a1:t) =

ϵt(yt, y0) = yt γ1 ty0

Γt(a1:t) = 1 γ2 t Σ1 t 1(a1:t 1)

Σ1 t(a1:t) .

Eventhough Γt involves multiple heavy-tailed random variables, it is nonetheless bounded: 0 Γt 1.

Proof. To determine k(α) t 1|t,0,a( |yt, y0, a1:t), we need to work on the joint distribution of (Yt 1, Yt) conditioned on Y0, {At}T t=1, which is a Gaussian vector for which classical techniques will let us derive the distribution of Yt 1 given Yt. Before doing so we need to compute ρt the covariance of Yt 1 and Yt given Y0, {At}T t=1, which we do thanks to Proposition 5:

ρt = Cov(Yt, Yt 1|Y0, A1:T ) = γt Cov(Yt 1, Yt 1|Y0, A1:T ) = γtΣ1 t 1Id.

Denote by k(α) t 1,t|0,a the density of (Yt 1, Yt) conditioned on Y0, A1:T . Denote by ϕd( |µ, Σ) the density of a d-dimensional Gaussian distribution with mean µ and covariance Σ. From the results of Proposition 5, we can write

k(α) t 1,t|0,a(yt 1, yt|y0, a1:t) = ϕd

γ1 t 1y0 γ1 ty0

, Σ1 t 1(a1:t 1)Id ρt Id ρt Id Σ1 t(a1:t)Id

Then the distribution of Yt 1 given Yt, Y0, A1:T is a Gaussian distribution N( mt 1, Σt 1) (Holt & Nguyen, 2023, Theorem 3) with mean mt 1 and variance Σt 1 satisfying:

mt 1(yt, y0, a1:t) = γ1 t 1y0 + ρt Σ1 t(a1:t) (yt γ1 ty0)

Σt 1(a1:t) = Σ1 t 1(a1:t 1) ρ2 t Σ1 t(a1:t) .

By defining

ϵt(yt, y0) = yt γ1 ty0

σ1 t , Γt(a1:t) = 1 γ2 t Σ1 t 1(a1:t 1)

Σ1 t(a1:t) ,

we give the final expression for the mean mt 1 and variance Σt 1 of the distribution of Yt 1 given Yt, Y0 and {At}T t=1:

mt 1(yt, y0, a1:t) = 1

γt (yt σ1 tΓt(a1:t)ϵt(yt, y0)) (14)

Σt 1(a1:t) = Γt(A1:t)Σ1 t 1(a1:t 1) .

Since Γt(a1:t) = 1 γ2 t Σ1 t 1(a1:t 1)/Σ1 t(a1:t) = atσ2 t /Σ1 t(a1:t) and at, γt, Σ1 t, σ1 t > 0, we have 0 Γt 1.

Published as a conference paper at ICLR 2025

Case α = 2 As we set α = 2, the random variables {At}T t=1 become deterministic, equal to 2. One can check that in this case, with the variance preserving schedule

1 βt , γ1 t = αt , σt = p

βt , σ1 t =

then: Σ1 t = 2σ2 1 t = 2(1 αt) . Further noticing that γt = γ1 t/γ1 t 1, one computes

Γt = 1 σ2 1 t 1αt/αt 1

= 1 (1 αt 1)αt/αt 1

= 1 αt/αt 1

so that one recovers the famous equations made popular in the seminal DDPM paper (Ho et al., 2020, Equation 7):

mt 1 = αt 1 αt

Yt 1 αt/αt 1 1 αt ϵt(Yt, Y0) (15)

Σt 1 = (1 αt 1) 1 αt/αt 1

with ϵt(Yt, Y0) = (Yt αt Y0)/ 1 αt.

Model for the reverse process. We propose approximating the backward process associated to {Yt}T t=0, given {At}T t=1, adapting the DDPM approach. This time, for the backward process, we characterized the conditional density of Yt 1 given Y0, Yt and {At}T t=1:

k (α) 1:T |0,a(y1:T |y0, a1:T ) := p(α) T (y T ) Y1

t=T k(α) t 1|0,t,a(yt 1|yt, y0, a1:t) ,

where k(α) t 1|0,t,a( |yt, y0, a1:t) is the tractable density of a Gaussian distribution, as we have just proved in Proposition 6. We similarly reconsider the family of Gaussian variational approximation introduced in (4), modified to account for an i.i.d. sequence {At}T t=1: q θ t 1|t,a(yt 1|yt, a1:t) = ϕd(yt 1|ˆmθ t 1(yt, a1:t), ˆΣθ t 1(yt, a1:t)) , (16)

where ϕd is the density of the multivariate Gaussian distribution, so that the overall model for the backward process is the following:

q θ 0:T (y0:T ) = Z ψ(α) 1:T (a1:T )p T (y T ) YT

t=1 q θ t 1|t,a(yt 1|yt, a1:t)da1:T ,

where ψ(α) 1:T (a1:T ) = QT i=1 ψ(α)(ai), and p T is the density of Si α(0, σ1 T Id).

C.4 TRAINING LOSS

In this section, we draw inspiration from DDPM (Ho et al. (2020)) to obtain a loss function admitting a closed-form formula. We further provide three design choices which lead to a simplified training loss, corresponding to the denoising loss for α-stable diffusion.

C.4.1 CLASSICAL LOSS FOR DDPM, α = 2

We start by reviewing what is classically done for DDPM, i.e., the case α = 2. The variational approximation { q θ 0:T : θ Θ}, for some parameter space Θ, is designed to admit the same decomposition as k (2) 0:T , i.e., q θ 0:T (x0:T ) = q θ T (x T ) Q1 t=T q θ t 1|t(xt 1|xt), where q θ T is chosen

to be the density of Si α(0, σ1 t Id) as an approximation of p(α) T . Then, it is trained on a classical upper bound of the KL loss L D : θ 7 KL(p q θ 0) between the true and the generated distribution,

Published as a conference paper at ICLR 2025

which is a form of evidence lower bound loss (Ho et al., 2020, Equation 5). Thus one resorts to optimize the following sum:

L D(θ) L D T +

t=2 L D t 1(θ) + L D 0 (θ) + C (17)

where C is a constant that does not depend on θ, and

L D T = E h KL k(2) T |0( |X0) ϕd( |0, σ1 T Id) i

L D 0 (θ) = E h log q θ 0|1(X0|X1) i

L D t 1(θ) = E h KL k(2) t 1|0,t( |X0, Xt) q θ t 1|t( |Xt) i ,

where {Xt}T t=0 is the process defined in (10), and k(2) t 1|0,t is the conditional density of Xt 1 given X0, Xt. We make the following classical remarks on the terms of this loss (Sohl-Dickstein et al. (2015), Yang et al. (2024)). The term L D T does not depend on θ but only on the chosen time horizon for the forward process, that determines the final variance of the Gaussian distribution N(0, σ1 T Id). It is neglected. The effect of optimizing the first term L D 0 (θ) is negligible too.

More importantly, for the term L D t 1(θ), when using Gaussian variational approximations, i.e., as

q θ t 1|t(xt 1|xt) = ϕd xt 1|ˆmθ t 1(xt), ˆΣθ t 1(xt) ,

where (x, m, Σ) 7 ϕd(x|m, Σ) is the d-dimensional density of the Gaussian distribution with mean m and covariance matrix Σ, ˆmθ t 1, ˆΣθ t 1 are some functions of xt parameterized by θ, it turns out that L D t 1 admits a closed-form expression. For a fixed variance ˆΣθ t 1 = Σt 1, with Σt 1 given in (15), one resorts to optimize a convenient L2 loss function:

L D t 1(θ) = λt mt 1(xt, x0) ˆmθ t 1(xt) 2, (18)

where λt, mt depend on the noise schedule (γt, σt) and xt, x0.

Unfortunately, as we mentioned in Section 3.2.2, this solution cannot be used as such to learn the backward transitions associated to {Xt}T t=0 for α < 2. The main issue that we face stems from the fact that the density of α-stable distributions are in most cases unknown, in contrast to Gaussian distributions. As a result, the conditional density xt 1, x0, xt 7 k(α) t 1|0,t(xt 1|x0, xt) is unknown for α < 2, which prevents us to have an explicit expression for θ 7 L D t 1(θ).

Moreover, the absence of a second order moment for α-stable distributions challenges the most straightforward adaptation we can make to the previous loss considering the data augmentation setting. Indeed, to fit θ to the data distribution, we aim to rely on Kullback-Leibler minimization, a.k.a. the maximum likelihood principle, and some associated upper bounds. A naive solution would consist in considering the bounds obtained applying the Jensen inequality:

KL(p q θ 0) E h KL h p ( ) q θ 0|a( |A1:T ) ii ,

and fall back to the expression obtained in (17), only with conditioning on {At}T t=1. However, as we see in (18), this expression would involve taking expectation of At, while it is distributed as Sα/2,1(0, c A) and admits no first order moment. We are not aware of any bounds on KL(p q θ 0|a( |A1:T )) that would lead to a meaningful optimization problem due to the heavy tailed nature of the distribution of {At}T t=1.

C.4.2 LOSS FOR DLPM, FOR ANY α

We now expose our methodology to address this limitation. We keep the structure of the classical loss and aim at minimizing the error between the backward process { Y t}T t=0 and its variational approximation { Y θ t}T t=0. To do so we consider the same loss structure as before, but take the square

Published as a conference paper at ICLR 2025

root of individual KL terms. See Appendix C.4.3 for a more principled approach leading to a similar loss. Thus we consider the following valid loss function:

L L : θ 7 L L(θ) = E

L L t 1(θ, A1:t) r #

where r > 0,

L L t 1(θ, A1:t) = E KL k(α) t 1|t,0,a( |Yt, Y0, A1:t) q θ t 1|t,a( |Yt, A1:t) A1:t

and k(α) t 1|0,t,a denotes the conditional density of Yt 1 given Y0, Yt and {At}T t=1. In order to ensure that the expectations with respect to A1:T are finite, we need to choose r < α

2 when α (1, 2). For simplicity, in the rest of the paper, we will use r = 1

Proposition 7 (Training loss for DLPM). The loss L L admits a closed-form expression, such that one resorts to optimize the following loss for 1 t T:

L L t 1(θ, A1:t) = E 1

2 log ˆΣθ t 1 Σt 1 + Σt 1 + mt 1 ˆmθ t 1 2

mt 1(Yt, Y0, A1:t) = 1

γt (Yt σ1 tΓt(A1:t)ϵt(Yt, Y0))

Σt 1(A1:t) = Γt(A1:t)Σ1 t 1(A1:t 1)

ϵt(Yt, Y0) = Yt γ1 t Y0

Σ1 t(A1:t) =

Γt(A1:t) = 1 γ2 t Σ1 t 1(A1:t 1)

Σ1 t(A1:t) ,

where ˆmθ t 1, ˆΣθ t 1 are the mean and variance of the backward transition kernels q θ t 1|t. We have omitted the arguments of the mean and variance functions for clarity.

Proof. Recall (Proposition 6) that the backward process Yt 1 conditioned on {At}T t=1, Yt, Y0 at time t is distributed as N( mt 1, Σt 1), and, by design (Section 3.2.2), the backward transition kernels q θ t 1|t,a of each element of the variational family describe a Gaussian transition kernel of mean

ˆmθ t 1 and variance ˆΣθ t 1 at time t, as defined in 14. Since the KL term in L L t 1(θ, A1:t) corresponds to a KL divergence between two Gaussian distributions, a closed-form formula is readily available. Here we rewrite the equation with all the functions arguments written out explicitly:

L L t 1(θ, A1:t) = E 1

2 log ˆΣθ t 1(A1:t) Σt 1(A1:t) 1

+ Σt 1(A1:t) + mt 1(Yt, Y0, A1:t) ˆmθ t 1(Yt, A1:t) 2

2ˆΣθ t 1(A1:t)

Now we discuss further design choices for L L t 1(θ, A1:t), q θ t 1|t,a, leading to a simplified denoising loss, as is usually done in the literature. We denote them by D1, D2 and D3.

D1. We set a fixed variance ˆΣθ t = Σt for the reverse process , but we expect a study on the effect of learning variance to yield similar results as in original DDPM (Ho et al. (2020)), and especially its improved version (Nichol & Dhariwal (2021)).

Published as a conference paper at ICLR 2025

D2. Following our own experimental results and the usual recommendation for denoising diffusion models (see, e.g., Yang et al. (2024); Karras et al. (2022)), we reparameterize the output of the model to predict the value ϵt(yt, y0) at time-step t rather than mt 1(yt, y0, a1:t). Since

mt 1(Yt, Y0, A1:t) = 1

γt (Yt σ1 tΓt(A1:t)ϵt(Yt, Y0)) ,

we re-parameterize ˆmθ t 1 to be equal to

ˆmθ t 1(Yt, A1:t) = 1

Yt σ1 tΓt(A1:t)ˆϵθ t(Yt, A1:t) .

with ˆϵθ t being the output of the model. Following experimental results (see the introduction of Appendix G), we drop the dependency of ˆϵθ t on {At}T t=1, which is reasonable since it approximates ϵt : (Yt, Y0) 7 ϵt(Yt, Y0). Thus ˆϵθ t only depends on t, Yt. This choice achieves better performance in our experiments, allows further computational tricks (introduced in Appendix C.5), and enables one to re-use existing neural network architectures.

D3. Assuming D1, D2, L L t 1(θ) becomes

L L t 1(θ) = E λ2 t,A1:t ˆϵθ t(Yt, A1:t) ϵt(Yt, Y0) 2 , (21)

where λt,a1:t = Γt(a1:t)σ1 t/2γt Σt 1, and ϵt(Yt, Y0) = (Yt γ1 t Y0)/σ1 t. The methodological knowledge of diffusion models motivates making specific choices for λ, different from its defined value, resulting in a classical technique for improving performances (e.g., see Karras et al. (2022); Ho et al. (2020); Nichol & Dhariwal (2021); Yang et al. (2024)). We will stick to the classical choice of choosing λt,a1:t = 1, which experimentally works better and draws similarities to the score-based perspective (see Appendix E). Other choices and optimizations are left to further work.

The proof of Proposition 2 follows immediately from these design choices, hence omitted.

C.4.3 A PRINCIPLED APPROACH FOR DERIVING THE LOSS FUNCTION

In this section, we provide a more principled approach to derive the loss function for DLPM, as initially given in (19). We show the derivation for r = 1/2; however, the same derivation applies for any r (0, 1].

Noting that Y0 is independent of {At}T t=1 in (5), p is the equal to k(α) 0|a ( |a1:T ), the conditional density of Y0 given At = at for any t {1, . . . , T}, and therefore, we consider the valid loss function

L L(θ) : θ 7 Z da1:T ψ(α) 1:T (a1:T ) h KL(k(α) 0|a ( |a1:T ) q θ 0|a( |a1:T )) i1/2 .

While this function is still intractable, we can provide an upper bound which we can minimize. Indeed, using Jensen inequality twice, we bound this function by

L L(θ) = R da1:T ψ(α) 1:T (a1:T ) n R dy0k(α) 0|a (y0|a1:T )(log k(α) 0|a (y0|a1:T ) log q θ 0|a(y0|a1:T )) o1/2

R da1:T ψ(α) 1:T (a1:T ) R dy0:T k(α) 0:T |0,a(y0:T |a1:T ) log q θ 0:T |a(y0:T |a1:T )

k(α) 1:T |0,a(y1:T |y0,a1:T ) + Cst1

= R da1:T ψ(α) 1:T (a1:T ){PT 1 t=0 L L t (θ, a1:T ) + Cst1 + Cst2}1/2

R da1:T ψ(α) 1:T (a1:T ){PT 1 t=0 L L t (θ, a1:T )1/2 + Cst1/2 1 + Cst1/2 2 } ,

where we used

a + b < a +

b when a, b 0 and L L 0 (θ, a1:T ) = E[ log pθ(Y0|Y1, a1:T )|{At}T t=1 = {at}T t=1] for t > 0,

L L t (θ, a1:T ) = E h KL(k(α) t|t+1,0,a( |Yt, Y0, a1:T ) q θ t|t+1,a( |Yt, a1:T ))|{At}T t=1 = {at}T t=1 i ,

and Cst1 = R dy0p (y0) log p (y0)dy0 and Cst2 = E[KL(k(α) 1:T |0,a( |Y0, a1:T )| q θ T )|{At}T t=1 = {at}T t=1] does not depend on θ since q θ T is chosen as Sα(0, σ1 t Id). Regarding L L 0 , we neglect this

Published as a conference paper at ICLR 2025

term, replacing the distribution q θ 0|1,a( |y1, a1:T ) by a deterministic Dirac. One could alternatively employ the strategy of the discrete decoder for image data as described by Ho et al. (2020). We end up with the final loss function:

L L(θ) = R da1:T ψ(α) 1:T (a1:T ){PT 1 t=0 L L t (θ, a1:T )1/2} .

We can then provide an explicit expression for L L t (θ, a1:T ) based on the result of Proposition 6.

C.5 REDUCING THE COMPUTATIONAL COST WITH FASTER SAMPLING AT EACH TIMESTEP

In this section, we provide a faster algorithm for training DLPM, computing each loss term L L t 1 using only two heavy-tailed random variables per datapoint, instead of t random variables.

Consider again the process {Yt}T t=0. We replace the loss function (19) by an equivalent one:

L L time(θ) = E h L L t 1(θ, A1:t)1/2i , (22)

where t U[2, T] = (PT i=2 δi)/(T 1). The standard technique for computing the loss consists in the following loop:

1. Take a batch of B datapoints {Y i 0 }B i=1. 2. For each datapoint Y i 0 , draw a random ti,as suggested by the alternative loss (22). Indeed, for a single datapoint, (i) training on all timesteps rather than just one yields equal to inferior results, for a much higher computational cost, (ii) it is beneficial for the model to proportionally spend more time learning specific time ranges (iii) thus the distribution of t can be optimized and is a matter of ongoing methodological research, e.g., see Karras et al. (2022). 3. Draw sequences {Ai t}ti t=1 of heavy-tailed random variables.

4. Compute the noised datapoints {Y i ti}B i=1. 5. Compute the batch loss

ˆ L L(θ) = 1

i=1 L L ti 1(θ, Y i ti, Ai 1:ti) ,

such that ˆ L L(θ) L L time(θ). 6. Do an optimization step.

Step 3 can be expensive, since one has to sample on average O(T) d-dimensional heavy-tailed random variables to compute a single noised datapoint Yt from Y0. This is all the more inefficient as T can be quite large (indeed, on image datasets we can have T = 4000, see Appendix G).

One can guess that this is abusive, especially since characterizing the distribution of Yt given Y0 only requires a single heavy-tailed random variable:

Yt d= γ1 t Y0 + σ1 t A1/2 Gt ,

where A Sα/2,1(0, c A), Gt N(0, Id). As we formalize in the next proposition, it is indeed possible to bypass the sampling of a whole sequence. Since we manipulate the joint distribution of (Y0, Yt 1, Yt) for the loss term L L t 1(θ), we will actually need to sample two heavy-tailed random variables. Proposition 8 (Sampling two heavy-tailed r.v for each loss term). Suppose that the functions ˆmθ t 1, ˆΣθ t 1 satisfy for any yt, a1:t:

ˆmθ t 1(yt, a1:t) = M θ t 1

yt, at, Σ1 t 1(a1:t 1)

ˆΣθ t 1 = Sθ t 1

yt, at, Σ1 t 1(a1:t 1)

Published as a conference paper at ICLR 2025

for some functions M θ t 1, Sθ t 1, and where

Σ1 t(a1:t) =

as given in (12). Then each term E[L L t 1(θ, A1:t)]1/2 of the loss can be computed sampling only two independent random variables At 0, At 1 distributed as Sα/2,1(0, c A):

E L L t 1(θ, A1:t) 1/2 = E L Less t 1 (θ, At 0,1) 1/2 ,

where At 0,1 := ( At 0, At 1), and

L Less t 1 (θ, At 0,1) = E 1

2 log Sθ t 1(Zt, At 0,1) Σt 1( At 0,1) 1

+ Σt 1( At 0,1) + m t 1(Zt, Z0, At 0,1) M θ t 1(Zt, At 0,1) 2

2Sθ t 1(Zt, At 0,1)

with {Zt}T t=0 being a stochastic process defined as

Z0 = Y0 , Zt = γ1 t Z0 + Σ 1/2 t Gt ,

where {Gt}T t=1 is an i.i.d. sequence distributed as N(0, Id), and

Σ t 1( At 0) = σ2 1 t 1 At 0 Σ t( At 0,1) = σ2 t At 1 + γ2 t Σ t 1( At 0)

Γ t( At 0,1) = 1 γ2 t Σ t 1( At 0) Σ t( At 0,1) ,

such that Zt d= Yt, and:

mt 1(Zt, Z0, At 0,1) = 1

Zt σ1 tΓ t( At 0,1)ϵt(Zt, Z0)

Σt 1( At 0,1) = Γ t( At 0,1)Σ t 1( At 0)

ϵt(Zt, Z0) = Zt γ1 t Z0

In order to keep the notations similar for all t 2, in the case of L Less 1 , we always set A2 0 = 0.

Proof. Remember the full equation for the loss, first given in Proposition 7 (20):

L L t 1(θ, A1:t) = E 1

2 log ˆΣθ t 1(A1:t) Σt 1(A1:t) 1

+ Σt 1(A1:T ) + mt 1(Yt, Y0, A1:T ) ˆmθ t 1(Yt, A1:T ) 2

2ˆΣθ t 1(A1:t)

Now, all the required variables and computations only depend on At, Σ1 t 1; this is the case for ˆmθ t 1, ˆΣθ t 1 by hypothesis, and this is the case for mt 1, Σt 1 as one can see in (13). Rewriting the previous loss as

L L t 1(θ, A1:t) = E 1

2 log Sθ t 1 Zt, At, Σ1 t 1(A1:t 1)

Σt 1(A1:T ) 1

+ Σt 1(A1:t)

2Sθ t 1 Zt, At, Σ1 t 1(A1:t 1)

+ mt 1(Yt, Y0, A1:t) M θ t 1 Zt, At, Σ1 t 1(A1:t 1)

2Sθ t 1 Zt, At, Σ1 t 1(A1:t 1)

Published as a conference paper at ICLR 2025

it becomes clear how the expectation can be taken on the joint distribution of Y0, Yt 1, Yt, At, Σ1 t 1(A1:t 1)

A direct application of Lemma 1 shows that this expectation can be taken on the joint distribution of the five random variables (Z0, Zt 1, Zt, At 1, At 0), which only necessitates sampling two heavytailed random variables At 0, At 1. Using the formulas for Z0, Zt 1 and Zt given At 0, At 1 as defined in Lemma 1, we obtain the equivalent loss (23).

As we will prove in the next proposition, the conditions of Proposition 8 are always satisfied under design choices D1, D2. Under design choice D3, we can also rewrite the simplified denoising loss given in Proposition 2.

Proposition 9 (Sampling one heavy-tailed r.v in the simplified loss). Assume the design choices D1, D2, D3 are satisfied. Then one can obtain the following simplified denoising objective function from the full objective function given in (23):

L Simple Less(θ) =

t=1 E E ˆϵθ t (Zt) A1/2 t Gt 2 | At 1/2 ,

where { At}T t=1 is an i.i.d. sequence distributed as Sα/2,1(0, c A), and

Zt = γ1 t Z0 + σ1 t A1/2 t Gt ,

with { Gt}T t=1 an i.i.d. sequence distributed as N(0, Id).

Proof. Let us recall design choice D1:

ˆΣθ t 1(A1:) = Γt(A1:t)Σ1 t(A1:t) ,

and design choice D2:

ˆmθ t 1(Zt, A1:t) = 1

Zt σ1 tΓt(A1:t)ˆϵθ t (Zt) .

Since Γ only depends on Σ1 t and Σ1 t 1, both ˆmθ t 1, Σt 1 can be expressed as functions of zt, at and Σ1 t(a1:t). Thus the assumptions of Proposition 8 are satisfied. Using the same notations, we can apply the same algebraic transformations as in (21), and by design choice D3, obtain:

L Less(θ) =

t=1 E h E ˆϵθ t (Zt) ϵt(Zt, Z0) 2 | At 0, At 1 1/2i .

Finally, we apply Lemma 2 to Σ t to affirm that Σ t d= σ2 1 t At, where At Sα/2,1(0, c A), and obtain the final loss we presented.

We stress that this denoising training loss is similar to that of LIM (Yoon et al., 2023, Theorem 4.3), but elevated to the necessary power to guarantee that the loss is finite. See Appendix E.1 for a more detailed discussion.

D DENOISING LÉVY IMPLICIT MODELS (DLIM)

Using the same techniques as in DDIM (Song et al. (2020)), we obtain a deterministic sampling process which we naturally call Denoising Levy Implicit Models (DLIM). Alike the Gaussian case treated in the original DDIM work, we will show that both DLPM and DLIM can share the same neural network.

Published as a conference paper at ICLR 2025

D.1 NON-MARKOVIAN FORWARD PROCESS

Let {ρt}T t=1 be an alternative noise schedule, proper to DLIM, that will ultimately tend to zero for deterministic generation. In the same way as in Section 3.2.2, we take a data augmentation approach. We consider a process {Zt}T t=1 defined by Z0 p , where p is the data distribution, ZT S(γ1 T Z0, σ1 T Id) and, for 1 < t T

Zt 1 = γ1 t 1Z0 + (σα 1 t 1 ρα t )1/αϵt(Zt, Z0) + ρt A1/2 t Gt

where ϵt(Zt, Z0) = (Zt γ1 t Z0)/σ1 t, and {At}T t=1, {Gt}T t=1 are independent random variables distributed according to At Sα/2,1(0, c A) and Gt N(0, Id).

Proposition 10. The distribution of Zt given Z0 is the same as that of Yt given Y0.

Proof. This is a simple proof by induction, where one can re-adapt the technique of (Song et al., 2020, Lemma B.1) with the property for addition of α-stable variable as we introduced in Proposition 3. The case t = T is true by construction. Suppose now that the property is verified at timestep t, where 1 t T. Then, focusing on the distribution of Zt 1 given Z0, ϵt(Zt, Z0) = (Zt γ1 t Z0)/σ1 t is distributed as Si α(0, Id) by hypothesis, and thus by Proposition 3 and since A1/2 t Gt Si α(0, Id), we can write

Zt 1 d= γ1 t 1Z0 + σ1 t 1 ϵt ,

where ϵt Si α(0, Id), which shows that indeed Zt 1 given Z0 admits the same distribution as Yt 1 given Y0.

The design of this process makes the distribution of Zt given Z0 match that of Yt given Y0, where {Yt}T t=0 is the forward process of DLPM (5). The task of sampling from it is thus efficient and straightforward.

D.2 GENERATIVE PROCESS

We similarly reconsider the family of Gaussian variational approximation introduced in (4), which accounts for an i.i.d. sequence {At}T t=1:

q θ 0:T (y0:T ) = Z ψ(α) 1:T (a1:T )p T (y T ) YT

t=1 q θ t 1|t,a(yt 1|yt, a1:t)da1:T ,

where ψ(α) 1:T (a1:T ) = QT i=1 ψ(α)(ai), ψ(α) is the density of Sα/2,1(0, c A), and p T is the density of S(0, σ1 T Id). We set

q θ t 1|t,a(zt 1|zt, a1:t) = ϕd zt 1|ˆmθ t 1(zt), ρ2 tat ,

with ϕd being the density of the multivariate Gaussian distribution: the variance is fixed, determined by the alternative noise schedule {ρt}T t=1. For deterministic sampling, one will ultimately choose ρt = 0 for all t and sample the chain { Z θ t}T t=0 as follows:

Z θ T S(0, σ1 T Id) , Z θ t 1 = ˆmθ t( Z θ t ) for t {T, , 1} .

As zt is available as input, the model can be fit to approximate the value of ϵt(zt, z0), and we reparameterize ˆmθ t as follows:

ˆmθ t(zt) = zt σ1 tˆϵθ t (zt) γt + (σα 1 t 1 ρα t )1/αˆϵθ t (zt) . (24)

This is alike design choice D2 in Appendix C.4.2.

Published as a conference paper at ICLR 2025

D.3 LOSS FUNCTION AND EQUIVALENCE WITH DLPM

We denote by h(α) t 1|t,0,a the density of Zt 1 given Zt, Z0 and A1:T , which is the density of the

Gaussian distribution with mean γ1 t 1Z0 + (σα 1 t 1 ρα t )1/αϵt(Zt, Z0) and covariance ρ2 t At Id. Since this distribution is now a given, we are inclined to use the loss function introduced in (7), which is:

L L(θ) := E

L L t 1(θ, A1:t) 1/2 #

L L t 1(θ, A1:t) := E h KL h(α) t 1|t,0,a1:t( |Zt, Z0, A1:t) q θ t 1|t,a( |Zt, A1:t) A1:T i .

Since for 2 t T, h(α) t 1|t,0,a and q θ t 1|t,a are the densities of Gaussian distributions, we can analytically compute each term of the loss, as in (19):

L L t 1(θ, A1:t) = 1 2ρ2 t At γ1 t 1Z0 + (σα 1 t 1 ρα t )1/αϵt(Zt, Z0) ˆmθ t (Zt, A1:t) 2

where ϵt(Zt, Z0) = (Zt γ1 t Z0)/σ1 t. Since the variance of the elements of our variational family { q θ 0:T } have been designed to match that of the backward process given Z0, A1:T , the expression for the loss is readily in a simpler format. Finally, using the reparameterization given in (24), The loss term L L t 1(θ, A1:T ) becomes:

L L t 1(θ, A1:t) = λ t,At ϵt(Zt, Z0) ˆϵθ t(Zt) 2 ,

where λ t,at = (σ1 t (σα 1 t 1 ρα t )1/α)2/(2ρ2 tat). By comparing with the simpler DLPM loss (21) with design choices D1, D2, as introduced in Appendix C.4.2, we realize we obtained the same loss term, with a different multiplicative factor λ t,at instead of λt,a1:t in (21). Finally, considering the alternative loss where λ t,at = 1 for all t, alike the design choice D3 in Appendix C.4.2, we fall back to the same simplified objective function obtained for DLPM:

L Simple(θ) =

t=1 E h E ˆϵθ t(Yt) ϵt(Yt, Y0) 2 A1:t 1/2i ,

D.4 CAUCHY DLIM

In the special case of a non-isotropic Cauchy distribution (α = 1), it is possible to bypass the data augmentation machinery, since there exists a closed-form formula for the KL divergence between two Cauchy distributions. Denote by Cauchy(µ, σ) the one-dimensional Cauchy distribution centered at µ and of scale σ. Then (Chyzak & Nielsen, 2019, Theorem 1):

KL(Cauchy(µ1, σ1) Cauchy(µ2, σ2)) = log (µ1 µ2)2 + (σ1 + σ2)2

The forward process {Zt}T t=1 is defined with Z0 p , where p is the data distribution, ZT S(γ1 T Z0, σ1 T Id) and, for 1 < t T

Zt 1 = γ1 t 1Z0 + (σα 1 t 1 ρα t )1/αϵt(Zt, Z0) + ρtϵ(α) t ,

where ϵt(Zt, Z0) = (Zt γ1 t Z0)/σ1 t, and {ϵ(α) t }T t=1 Cauchy(0, Id) T , where Cauchy(0, Id) is a d-dimensional Cauchy distribution with independent components, centered at 0, of unit scale. The distribution of Zt given Z0 admits the same closed form expression given in Proposition 10. We denote by h(α) t 1|t,0 the density of Zt 1 given Zt, Z0.

Our generative process will be an element of a parameterized family of distributions admitting Cauchy transitions: q θ 0:T (y0:T ) = p T (y T ) YT

t=1 q θ t 1|t,a(yt 1|yt) ,

Published as a conference paper at ICLR 2025

where p T is the density of S(0, σ1 T Id), and q θ t 1|t(zt 1|zt) = C zt 1|ˆmθ t 1(zt), ρt Id ,

where C( |ˆmθ t 1(zt), ρt Id) is the density of the multivariate non-isotropic Cauchy distribution Cauchy(ˆmθ t 1(zt), ρt Id).

Instead of using the loss function L L defined in (7), we derive the loss via the conventional evidence lower bound (ELBO) approach (see, e.g., Ho et al. (2020)). Omitting extremal terms, this yields:

L Cauchy(θ) :=

t=2 L Cauchy t 1 (θ) , where

L Cauchy t 1 (θ) := E h KL h(α) t 1|t,0( |Zt, Z0) q θ t 1|t( |Zt) i .

Using (25), we obtain a closed form formula for the final loss:

L Cauchy t 1 (θ) =

mi,t 1(Zt, Z0) ˆmθ i,t 1(Zt) 2

which could also serve as a template for another family of losses for heavy-tailed diffusion models. We leave these methodological explorations and possible extensions to the isotropic Cauchy case for future work. Based on our experimental findings, we expect an isotropic implementation to significantly outperform a non-isotropic one.

We outline again that such simplifications are not available for DLPM, since we are not able to characterize the distribution of Xt 1 given Xt, X0 from the forward process {Xt}T t=0.

E ADDITIONAL INFORMATION ON LEVY-ITO MODELS (LIM)

Here we briefly recapitulate the work done by Yoon et al. (2023), introducing continuous diffusion models with α-stable heavy-tailed noise. Using notations closer to ours, we define the noising schedule as any locally bounded continuous functions γ : (t, X) 7 γ(t, X) and σ : (t) 7 σ(t). We denote by Lα t the Levy process for which the increments between time s < t follow a symmetric isotropic α-stable distribution Si α(0, (t s)Id). In this setting, the forward process Xt, with X0 p , is written d Xt = γ(t, Xt )dt + σ(t)d Lα t , (26) where Xt denotes the left limit of X at time t. Similarly, Xt is distributed as Si α(γ1 t X0, σ1 t Id) when using Euler steps. This defines the cadlag (right continuous with left limits) solution, which in the case of α < 2 a.s admits discontinuous jumps. We then consider the following backward process Xt:

d Xt = γ(t, Xt+) ασα(t, Xt+)S(α) t ( Xt+) dt + σ(t)d Lαt + d Zt (27)

where Zt is the backward version of a Levy-type stochastic integral Zt s.t E[Zt] = 0 with finite variation, and S(α) t is the fractional score function, defined to be

S(α) t (x) = α 2

where η/2 denotes the fractional Laplacian of order η/2 (Ortigueira et al. (2014)). More precisely, η/2f(x) = F 1{ u ηF{f}(u)}, where F, F 1 are the Fourier and inverse Fourier transforms.

The training loss is obtained using the classical technique of denoising score matching (Vincent (2011)), where the following losses

L : θ 7 E sθ(Xt, t) S(α) t (Xt) 2 , L : θ 7 E sθ(Xt, t) S(α) t (Xt|X0) 2 , (28)

are proven to be equivalent objective functions, with sθ being the score approximation given by the model.

Published as a conference paper at ICLR 2025

E.1 COMPARING LIM AND DLPM

Let (Xt)0 t T be the forward process introduced in (26). As stressed initially, the framework of LIM is not straightforward to manipulate, thus we do not characterize explicitly the distribution of Xt given X0 for an arbitrary noise schedule in the continuous case. Since the work done for LIM by Yoon et al. (2023) only provides the formulas for the scale-preserving schedule, we stick to them in the following: we keep the notation γ1 t, σ1 t for the continuous time regime equivalent of the scale preserving schedule we introduce in Appendix G, and they match on integer times t.

Considering an Euler scheme to obtain discretization for the forward and backward process, and using our own notations, both LIM and DLPM admit the same forward process {Xt}T t=1, X0 p and Xt = γt Xt 1 + σtϵ(α) t ,

where {ϵ(α) t }T t=1 is an iid sequence of random variable distributed as Si α(0, Id). We denote by { X θ t }0 t=T the backward process associated to the Euler discretization of (27), where we use a neural network sθ to approximate the true score S(α) t . Since the true score of the data S(α) t (xt|x0) can be expressed as

S(α) t (xt|x0) = 1 ασα 1 1 t ϵt(xt, x0) ,

where ϵt(xt, x0) = (xt γ1 tx0)/σ1 t, we write

sθ(xt, t) = 1 ασα 1 1 t ˆϵθ t(xt, x0) ,

so that we rather work with ˆϵθ t, with the same intention that led us to the design choices given in Appendix C.4.2.

Moreover, we denote by { Y θ t}T t=0 the backward process of DLPM, as introduced in (16). As emphasized in Table 4, the sampling strategies for LIM and DLPM differ fundamentally when α = 2. This is also the case for the training procedure.

Stochastic sampling. The DLPM approach introduces the bounded random variable 0 Γt 1, interacting with the mean and variance of the Gaussian conditional at hand. Three points: when α = 2, Γt becomes deterministic and one recovers DDPM formulas. Second, Γt brings additional stochasticity in the sampling process. Third, it does so in the interesting manner than it simultaneously scales both (i) the magnitude of the noise added at time t 1 and (ii) the output of the noise model.

Deterministic sampling. In the case of the scale-preserving schedule, these two equations do not describe the same sampling procedure.

Stochastic Deterministic

Continuous (LIM)

X θ t γt α(1/γt 1)

ˆϵθ t + ( 1

γα t 1)1/αϵ t

X θ t γt σ1 α 1 t

γt σ1 α 1 t

Denoising (DLPM)

Y θ t γt Γtσ1 tˆϵθ t + p

ΓtΣ1 t 1G t

Y θ t γt σ1 t

Table 4: Distribution of X θ t 1, Y θ t 1. {G t}1 t=T are independent random variables distributed as N(0, Id), {ϵ t}1 t=T are independent random variables distributed as Si α(0, Id). ˆϵθ t is the model at hand at time t, the formula for Σ1 t is given in (12), and Γt = 1 γ2 t Σ1 t 1/Σ1 t. Eventhough Γt involves two heavy-tailed random variables, it is bounded: 0 Γt 1 (see Appendix C.3).

Training. Alike the Gaussian case (α = 2), the score S(α) t (xt|x0) is a linear expression of the noise term ϵt(xt, x0), so the training equations are very similar, and can be reformulated to involve a denoising loss: Lt 1 : θ 7 E ˆϵθ t (Xt) ϵt(Xt, X0) η p . (29)

Published as a conference paper at ICLR 2025

In the case of DLPM, our discussion leads us to the choice p = 2 and η = 1 (see (8)). In the case of LIM, the theory must rely on the choice p = 2 and η = 2 in order to obtain the denoising score matching loss equivalence (i.e., L, L are equivalent in (28)). One must make the assumption that the losses L, L are not infinite for some θ, which is not necessarily realistic because St(Xt), St(Xt|X0) are heavy-tailed random variables involving α-stable noise, and as such admit no variance. In the case of LIM, in the experiments the parameters p = 1 and η = 1 are chosen, instead of the previous squared loss, in order to obtain more stable training, potentially indicating that indeed L, L (28) might be infinite.

F TECHNICAL RESULTS

In this section, we give the proofs relative to our technique for faster training, as introduced in Appendix C.5. Lemma 1. Let At 0, At 1 bet two independent random variables distributed as Sα/2,1(0, c A). Define Z0 = Y0, and Zt = γ1 t Z0 + σ1 t At 1 1/2 Gt . Moreover, let Zt 1 be equal to

Zt Γ t( At 0,1)σ1 tϵt(Zt, Z0) + Σ t( At 0,1)Gt 1 ,

Σ t( At 0,1) = Γ t( At 0,1)σ1 t 1 At 0 1/2

Γ t( At 0,1) = At 1σ2 t At 1σ2 t + γ2 t σ2 1 t 1 At 0

ϵt(Zt, Z0) = Zt γ1 t Z0

Then the joint distribution of (Z0, Zt 1, Zt, At 1, At 0) matches the joint distribution of

Y0, Yt 1, Yt, At, Σ t( At 0,1) σ2 1 t

Proof. Consider the setting of Proposition 6. The distribution of Yt 1 given Yt, Y0, A1:T is characterized by the values of Σ1 t, Γt:

γt (Yt Γtσ1 tϵt(Yt, Y0)) , Σt 1(A1:t) = Γt(A1:t)Σ1 t 1(A1:t 1) ,

Σ1 t(A1:t) = σ2 t At + γ2 t Σ1 t 1(A1:t 1)

ϵt(Yt, Y0) = Yt γ1 t Y0

Γt(A1:t) = 1 γ2 t Σ1 t 1(A1:t 1)

Σ1 t(A1:t) .

Directly applying the result of Lemma 2, we can affirm that

Σ1 t 1(A1:t 1) d= σ2 1 t 1 At 0 ,

where At 0 Sα/2,1(0, c A). In this conditions, the distribution of Γt(A1:t) is equal to that of Γ t( At 0,1), where

Γ t( At 0,1) = 1 γ2 t σ2 1 t 1 At 0 σ2 t At + γ2 t σ2 1 t 1 At 0 Since the distribution of Zt does not change if we draw another independent At 1 instead of At, this ends the proof.

Published as a conference paper at ICLR 2025

Lemma 2 (Sampling Σ1 t with a single heavy-tailed r.v). Consider the setting of the data augmentation approach in Section 3.2.2, where in particular {At}T t=1 are independent random variables distributed according to At S1 α/2,1(0, c A), with c A = cos2/α(πα/4). Consider the random variable Σ1 t(A1:t), as defined in (12):

Σ1 t(A1:t) =

Then Σ1 t(A1:t) d= σ2 1 t A, where A S1 α/2,1(0, c A).

Proof. By Proposition 5, Yt given Y0, A1:t is a random variable distributed as a Gaussian of variance Σ1 t(A1:t):

Yt d= γ1 t Y0 + p

Σ1 t(A1:t) Gt , where Gt is distributed as a standard Gaussian. Remember that Yt and Xt admit the same distribution, with Xt = d= γ1 t X0 + ϵt where ϵt is distributed as a Si α(0, Id).

In the same spirit we can define a third sequence of random variables {Zt}T t=0 with Z0 = X0, and

Zt = γ1 t Z0 + σ1 t p

A t Gt , where {A t}T t=0 are independent random variables distributed according to A t S1 α/2,1(0, c A). It is then quite clear from Section 2 that Zt and Yt admit the same distribution; in particular, p

Σ1 t Gt d= σ1 t p

From there, we use Lemma 3 to conclude that Σ1 t d= σ1 t p

A t, which ends the proof.

Lemma 3. Let A, A be positive real random variables, let Z be a real continuous random variable with density p Z. Suppose that AZ and A Z admit the same distribution. Then A, A admit the same distribution too.

Proof. Let h be a measurable function. Then E(h(A)) = E(h(AZ/Z)) = E(h(A Z/Z)) = E(h(A )). This shows that A, A have the same distribution.

G ADDITIONAL EXPERIMENTAL DETAILS

All experiments are conducted using Py Torch. We use linear timesteps during training and sampling, and the scale-preserving process6, being the only forward process readily provided by LIM. This entails choosing a sequence {βt}T t=1 such that

γt = (1 βt)1/α, σt = (1 γα t )1/α ,

resulting in σ1 t = (1 γα 1 t)1/α and γ1 t = Qt i=1 γi. With this choice, we obtain approximately XT Sα(0, Id). We choose {βi}T i=1 as the cosine schedule, as introduced by Nichol & Dhariwal (2021).

We do not give any of the heavy-tailed random variables {At}T t=1 as input to the neural network architecture, as we have witnessed worse performance in every scenarios we tried: learned embedding added to each model layer, concatenation to model input, concatenation at each layer, or feeding log(A1:T ) instead of A1:T to better manage large jumps. This corresponds to the design choice D2 in Appendix C.4.2.

For image data generation with LIM, we use the same clipping hyper-parameters specified in Yoon et al. (2023).

All the training and experiments are conducted on four NVIDIA RTX8000 GPU and four NVIDIA V100 GPU, where a single training run on MNIST or CIFAR10_LT takes approximately 1 day per GPU, and requires about 4-12GB of VRAM for the batch sizes we use. Generating 5000 images with 1000 backward steps takes approximately 3-4 hours on one RTX8000 GPU.

6we mention again that it is traditionally called the variance preserving process

Published as a conference paper at ICLR 2025

G.1 2D DATA

We give more details about the mixture of Gaussian we consider in our experiment. It is designed in a grid-like pattern in [ 1, 1]2, as follows:

i=1 wi N(µi, σ2I2) ,

where (wi)9 i=1 = (0.01, 0.1, 0.3, 0.2, 0.02, 0.15, 0.02, 0.15, 0.05), µi = (µ1, µ2) with µ1 = (i mod 3) 1, µ2 = i/3 1, and σ = 0.05.

For our 2D datasets, we use 32000 datapoints for training, a batch size of 1024, and 25000 points for evaluation. We train each model for 10000 steps. Since we do not focus on the effect of diffusion steps, we set it to 100, where all methods have been observed to perform optimally.

The optimizer is Adam (Kingma & Ba (2017)) with learning rate 5e-3. We use a neural network consisting of four time-conditioned MLP blocks with skip connections, each of which consisting of two fully connected layers of width 64. The time t passes through two fully connected layers of size 32x32, and is fed to each time conditioned block, where it passes through an additional 32x64 fully connected layer before being component-wise added to the middle layer.

We compute a mean squared logarithmic error (MSLE) loss, designed to assess the fit to tails of distributions. Since it depends on the one-dimensional cumulative distribution function, we calculate it after projecting the data onto each dimension. In this simple setting, we keep the score computed on the first dimension.

We also compute the precision/recall metrics, as presented in Appendix G.3.

G.2 IMAGE DATA

We work on the MNIST and the CIFAR10_LT dataset. CIFAR10_LT consists of the CIFAR10 images were artificial class unbalance has been introduced. The specific class counts we use are [5000, 2997, 1796, 1077, 645, 387, 232, 139, 83, 50].

The optimizer is Adam (Kingma & Ba (2017)) with learning rate 1e-3 for MNIST and 2e-4 for CIFAR10_LT. We use the Step LR scheduler which scales the learning rate by γ = .99 every N = 1000 steps for CIFAR10_LT and N = 400 for MNIST.

To establish a fair comparison, LIM and DLPM use the same network model. We use a U-Net following the implementation of Nichol & Dhariwal (2021) available in https://github.com/ openai/improved-diffusion. We dimension the network as follows: we set the hidden layers to [128, 256, 256, 256], fix the number of residual blocks to 2 at each level, and add selfattention block at resolution 16x16, using 4 heads. We use an exponential moving average with a rate of 0.99 for MNIST and 0.9999 for CIFAR10_LT. We use the silu activation function at every layer. Diffusion time t is rescaled to (0, 1) and fed to the model through the Transformer sinusoidal position embedding (Vaswani et al. (2023)). We train MNIST for 120000 steps with batch size 256 with a time horizon T = 1000, and CIFAR_LT for 400000 steps with batch size 100 with a time horizon T = 4000.

We use the FID metric for assessing the quality of our generative models, computing this metric between 5000 using images and 5000 generated images.

G.3 METRICS FOR GENERATIVE MODELS

MSLE we use a mean squared logarithmic error (MSLE) metric tailored to measure the fit on the tails of the distribution at hand. Drawing inspiration from Allouche et al. (2022), we define the MSLE as the squared distance between the logarithm of the inverse cumulative distributions of the original and generated data. If F, ˆF denote respectively the cumulative distribution function of the original data and the empirical cumulative distribution function of the generated data, then

MSLE(ξ) = Z 1

log F 1(p) log ˆF 1(p) 2 dp ,

where ξ is chosen the be 0.95.

Published as a conference paper at ICLR 2025

Precision/recall These metrics are introduced in the setting of generative models by Sajjadi et al. (2018), and assess the overlap of sample distributions using local geometric structures. Precision measures how much the generated distribution is contained in the original data distribution (measuring quality), and recall measured how much of the original data distribution is covered by the generated distribution (diversity). We also consider the F pr 1 score which we define as the harmonic mean of these two values: F pr 1 = 2 precision recall

precision + recall .

G.4 ADDITIONAL RESULTS

We provide some more results on MNIST and CIFAR10_LT, with FID for non-isotropic noise, and with the F pr 1 metric for other methods (with clipping enabled in LIM and LIM-ODE). We also provide grid images in order to visually check the performance of DLPM.

MNIST α = 1.5 α = 1.7 α = 1.8 α = 1.9 α = 2.0

DLPMni 44.17 14.06 5.74 3.62 - DLIMni 14.96 51.58 59.84 76.03 -

Table 5: FID , 1000 sampling steps for DLPMni, 25 sampling steps for DLIMni.

DLIM DLPM LIM LIM-ODE DDPM α

1.7 0.884 0.887 0.857 0.869 - 1.8 0.874 0.881 0.821 0.875 - 1.9 0.877 0.878 0.700 0.808 - 2.0 0.820 0.871 0.694 0.772 0.881

Table 6: MNIST, F pr 1

DLIM DLPM LIM LIM-ODE DDPM α

1.7 0.676 0.675 0.679 0.677 - 1.8 0.669 0.680 0.677 0.673 - 1.9 0.667 0.669 0.661 0.669 - 2.0 0.664 0.667 0.660 0.665 0.666

Table 7: CIFAR10_LT, F pr 1

Published as a conference paper at ICLR 2025

Figure 6: MNIST, DLPM (α = 1.7)

Figure 7: CIFAR10_LT, DLPM (α = 1.7)