# closedform_diffusion_models__da005855.pdf

Published in Transactions on Machine Learning Research (05/2025)

Closed-Form Diﬀusion Models

Christopher Scarvelis scarv@mit.edu MIT CSAIL

Haitz Sáez de Ocáriz Borde chri6704@ox.ac.uk University of Oxford

Justin Solomon jsolomon@mit.edu MIT CSAIL

Reviewed on Open Review: ht tp s: // op en re vi ew .n et /f or um ?i d= Jk Mi fr 17wc

Score-based generative models (SGMs) sample from a target distribution by iteratively transforming noise using the score function of the perturbed target. For any ﬁnite training set, this score function can be evaluated in closed form, but the resulting SGM memorizes its training data and does not generate novel samples. In practice, one approximates the score by training a neural network via score-matching. The error in this approximation promotes generalization, but neural SGMs are costly to train and sample, and the eﬀective regularization this error provides is not well-understood theoretically. In this work, we instead explicitly smooth the closed-form score to obtain an SGM that generates novel samples without training. We analyze our model and propose an eﬃcient nearest-neighbor-based estimator of its score function. Using this estimator, our method achieves competitive sampling times while running on consumer-grade CPUs.

1 Introduction

Score-based generative models (SGMs) draw samples from a target distribution Ä1 by sampling Gaussian noise and ﬂowing it through a possibly noisy velocity ﬁeld vt. This velocity ﬁeld depends on the score function of the perturbed target distribution Ät, which existing SGMs parameterize as a neural network and learn via score-matching (Hyvärinen & Dayan, 2005) or denoising (Vincent, 2011; Ho et al., 2020). Although the target distribution Ä1 (for example, the distribution over human face images) is typically assumed to be continuous, in practice score-matching and denoising problems are solved using an empirical approximation ˆÄ1 to the target distribution constructed from a ﬁnite training set.

When ˆÄ1 is the empirical distribution over a ﬁnite training set {xi}N i=1, the perturbed target distribution Ät is a mixture of Gaussians, whose score function log Ät(z) has a simple closed-form expression. This score function is a vector pointing from z toward a distance-weighted average of all N rescaled training points and is the exact solution to the score-matching problem (Miyasawa, 1961; Raphan & Simoncelli, 2011; Karras et al., 2022). By evaluating this closed-form score during sampling, one obtains a training-free sampler for ˆÄ1. While this approach seems tempting at ﬁrst glance, two ﬂaws render it unsuitable for real-world applications:

1. Many applications involve large training sets, prohibiting O(N) computation of the closed-form score.

2. Flowing base samples through the closed-form velocity ﬁeld simply outputs training samples xi, which is not useful in practice.

Existing work avoids these issues by neurally approximating the score of Ät. By compressing training data into the score model s weights, neural score functions replace a sum over N training points with a neural network

Published in Transactions on Machine Learning Research (05/2025)

evaluation whose complexity does not depend directly on N. Moreover, neural SGMs generate novel samples given ﬁnite training data thanks to approximation error (from limited model capacity) and optimization error (from undertraining) in learning the score (Pidstrigach, 2022; Yoon et al., 2023; Yi et al., 2023). While neural SGMs are successful, they are costly to train, and sampling them requires many (typically GPU-bound) neural network evaluations. Furthermore, the form of the error that enables neural SGMs to generalize is unknown, making it diﬃcult to characterize the distribution from which these models sample in practice.

Our key insight is that the ﬂaws of naïve closed-form SGMs (in particular, lack of generalization and poor scalability) can be addressed without resorting to costly black-box neural approximations. To this end, we make use of a well-known score formula and introduce smoothed closed-form diﬀusion models (smoothed CFDMs), a class of training-free diﬀusion models that require only access to the training set at sampling time. Smoothed CFDMs generate novel samples from a ﬁnite training set by ﬂowing Gaussian noise through a velocity ﬁeld built from a smoothed closed-form score. Our method is eﬃcient, has few hyperparameters, and generates plausible samples in high-dimensional tasks such as image generation. By developing this algorithm, we demonstrate that a closed-form score formula can be adapted to build a non-neural sampler that scales to non-trivial generative tasks.

Our speciﬁc contributions are as follows:

1. In Section 4, we show that smoothing the exact solution to the score-matching problem promotes generalization by encouraging the score function to point towards barycenters of training samples.

2. Using our smoothed score, in Section 5.1 we construct a closed-form sampler that generates novel samples without requiring any training, and characterize the support of its samples.

3. In Section 5.4, we accelerate our sampler using a nearest-neighbor-based estimator of our smoothed score, and show in Section 6.2 that in practice, one can aggressively approximate our smoothed score at little cost to sample accuracy.

4. In Section 6, we scale our method to high-dimensional tasks such as image generation. By operating in the latent space of a pretrained autoencoder, we generate novel samples from popular image datasets at speeds competitive with existing GPU-bound methods while running on a consumer-grade laptop with no dedicated GPU.

2 Related work

Diﬀusion models (Sohl-Dickstein et al., 2015; Song & Ermon, 2019; Ho et al., 2020) have recently achieved state of the art performance in image (Rombach et al., 2022; Zhang & Agrawala, 2023) and video generation (Ho et al., 2022a;b). They have also shown promise in 3D synthesis (Luo & Hu, 2021; Poole et al., 2022; Watson et al., 2022; Lukoianov et al., 2024) and in crucial steps of the drug discovery pipeline such as molecular docking (Corso et al., 2023) and generation (Hoogeboom et al., 2022; Schneuing et al., 2022). Despite this progress, however, diﬀusion models remain costly to train and sample from (Shih et al., 2023). Prior work has sought to accelerate the sampling of diﬀusion models via model distillation (Salimans & Ho, 2022), operating in a pre-trained autoencoder s latent space (Vahdat et al., 2021; Rombach et al., 2022), modifying the generative process (Song et al., 2020), using alternative time discretizations for sampling (Zhang & Chen, 2023; Liu et al., 2022; Wu et al., 2023), or by parallelizing sampling steps (Shih et al., 2023). Latent diﬀusion models also beneﬁt from lower training expenses (Rombach et al., 2022), but publicly-reported training costs for state-of-the-art diﬀusion models remain high (Bastian, 2022).

Recent works propose alternative diﬀusion-like models that discard the Markov chain and SDE formalisms from earlier work. Liu et al. (2023) introduce a uniﬁed framework for ﬂow-based generative modeling that subsumes diﬀusion models and show that straightening their model s ﬂows enables few-step sample generation. Heitz et al. (2023) use a similar objective to construct a straightforward graphics-inspired sampler, and Delbracio & Milanfar (2023) concurrently generalize this framework to arbitrary data perturbations and apply it to image restoration and generation tasks. All of these methods parametrize their ﬂows by neural networks that require extensive training.

Published in Transactions on Machine Learning Research (05/2025)

While diﬀusion models draw inspiration from mathematical theory (Feller, 1949; Stroock & Varadhan, 1969a;b; 1972), there have been limited attempts to develop a theoretical understanding of their behavior. Salmona et al. (2022); Koehler et al. (2023) study the statistical limitations of diﬀusion models trained via score-matching, De Bortoli et al. (2021); Lee et al. (2023) present convergence results for diﬀusion models with absolutely continuous targets, and De Bortoli (2022) extends these results to manifold-supported distributions. However, as diﬀusion models are trained on an empirical approximation to their target distributions, these results can only show that a diﬀusion model converges to the empirical distribution of its training set, whereas one is typically interested in generating novel samples.

Pidstrigach (2022) takes an initial step in this direction by studying the support of an SGM s model distribution and providing conditions under which an SGM memorizes its training data or learns to sample from the true data manifold. Oko et al. (2023) further show that diﬀusion models can attain nearly minimax estimation rates for the true data distribution provided its density lies in an appropriate function class. Yoon et al. (2023) propose and empirically test a memorization-generalization dichotomy, which states that diﬀusion models may only generalize when they are parametrized by neural networks with insuﬃcient capacity relative to the size of their training set. Yi et al. (2023) note that standard training objectives for diﬀusion models have closed-form optima given ﬁnite training sets and show via experiments that the approximation error of neural score functions enables existing diﬀusion models to generalize. Recently, Kadkhodaie et al. (2024) study generalization in diﬀusion models using techniques from applied harmonic analysis and demonstrate that SGMs trained on suﬃciently large datasets learn a distribution that is eﬀectively independent of the training set, and Aithal et al. (2024) show that neural SGMs hallucinate by generating data that lies outside the support of the target distribution because they learn smooth approximations to the ground truth score function. Whereas these works study the generalization of existing SGMs, we construct a novel SGM that explicitly perturbs the closed-form score to generalize without the indeterminate approximation error and training costs of a neural score.

Recent works in graphics and vision have also noted that neural networks are unnecessary for tasks such as novel view synthesis, where neural radiance ﬁelds (Ne RFs) had previously achieved SOTA results (Barron et al., 2022). In light of this, Kerbl et al. (2023) use eﬃciently optimized 3D Gaussian scene representations to achieve SOTA visual quality in novel view synthesis while operating in real time. In this work, we adopt a similar perspective and investigate the extent to which neural networks can be replaced with eﬃcient and well-understood classical approaches in generative modeling.

3 Preliminaries: The closed-form score

Flow-based generative models draw samples from a target distribution Ä1 by sampling from a known base distribution Ä0 (typically N(0, I)) and ﬂowing these samples through a velocity ﬁeld vt from t = 0 to t = 1. For an appropriately-chosen vt, the samples will be distributed according to the target distribution Ä1 at t = 1. SGMs employ a vt that depends on the score function log Ät of the perturbed data distribution Ät. For example, when Ä0 = N(0, I), one velocity ﬁeld satisfying this property is v t (z) = 1

t (z + (1 t) log Ä t (z)) (Liu et al., 2023), where Ä t is the marginal distribution of the random variable z = tx + (1 t)ϵ, whose samples are target samples x Ä1 that have been rescaled by t and perturbed by Gaussian noise (1 t)ϵ N(0, (1 t)2I). The score function log Ä t (z) is typically learned via score-matching or denoising.

In practice, one learns an SGM from a ﬁnite training set {xi}N i=1. In this case, the target distribution ˆÄ1 is the empirical distribution over {xi}N i=1, and for the ﬁeld v t deﬁned above, the perturbed target distribution Ä t is a mixture of Gaussians with means txi and common covariance matrix (1 t)2I. (We will subsequently use the fact that Ä t is a mixture of Gaussians to accelerate our sampler in Sections 5.2 and 6.2.) Its score log Ä t (z) has a closed-form expression:

log Ä t (z) = 1 (1 t)2 (kt(z) z) , (1)

where kt(z) =

i=1 softmax z t X 2

Published in Transactions on Machine Learning Research (05/2025)

in which we let z t X 2 denote the vector whose i-th entry is z txi 2. This log Ä t (z) is a vector pointing from z toward a distance-weighted average kt(z) of all N rescaled training points and is the exact solution to the score-matching problem given ﬁnite training data. Equation 1 is well-known, having appeared in the empirical Bayes literature as early as in Miyasawa (1961) and more recently in works such as Raphan & Simoncelli (2011) and Karras et al. (2022, Appendix B.3). It has also inspired machine learning methods such as denoising score-matching (Vincent, 2011) and score interpolation (Dieleman et al., 2022).

We deﬁne a closed-form diﬀusion model (CFDM) to be the SGM that ﬂows N(0, I) base samples through this v t (z) while evaluating the score log Ä t (z) in closed form as needed during sampling. As this model can only generate samples from the empirical distribution over training data, CFDMs are not useful in practice.

4 Smoothed closed-form diﬀusion models

Pidstrigach (2022); Yi et al. (2023) ﬁnd that existing diﬀusion models generalize due to approximation error incurred during score-matching. Rather than studying the generalization of neural SGMs, we take inspiration from this observation and construct a training-free SGM that generalizes by explicitly inducing error in the closed-form score.

4.1 Deﬁnition

Deep neural networks ﬁt the low-frequency components of their target functions ﬁrst during training, a phenomenon known as spectral bias that results in excessively smooth approximations to the target function (Rahaman et al., 2019). Hence, to model the bias of a neural SGM, we induce error in the score function by smoothing it. To smooth a function f, one chooses a zero-mean noise distribution pϵ and replaces f(z) with the convolution f(z) = E ϵ pϵ[f(z + ϵ)]. In practice, we compute the smoothed score function sÃ,t(z)

by ﬁxing a smoothing parameter Ã g 0, drawing M noise samples ϵm pϵ, and computing

sÃ,t(z) = 1 (1 t)2

m=1 kt(z + Ãϵm) z

That is, we average the weights kt in Equation 2 over M small perturbations Ãϵm of z; as Ã 0, we approach the unsmoothed score in Equation 1. We do not add noise to the z term in the score because it vanishes in expectation. The smoothing procedure in Equation 3 is the key ingredient enabling our model to generalize without a learned approximation to the score function. The smoothed score sÃ,t can then be inserted into an SGM sampling loop to yield a closed-form sampler that generates novel samples.

To develop intuition for why this simple modiﬁcation of the closed-form score promotes generalization, we consider the behavior of the closed-form score as t 1. Figure 1 depicts the closed-form score (Equation 1) and its smoothed counterpart (Equation 3) at t = 0.95 for a simple case where the training data consists of two points x0 (in blue) and x1 (in red). In this regime, the temperature (1 t)2 of the softmax in Equation 2

(a) Closed-form score

(b) Smoothed score

Figure 1: Eﬀect of smoothing on the closed-form score (yellow streamplot). Colors represent distance weights in kt(z); blue regions of space are drawn to the blue point on the left, and vice-versa.

Published in Transactions on Machine Learning Research (05/2025)

is low, and kt(z) is eﬀectively equal to the nearest neighbor of z within the training set. Flowing points z through a velocity ﬁeld such as Liu et al. (2023) s v t (z) = 1

t (z + (1 t) log Ä t (z)) causes them to ﬂow towards their nearest training sample. As a result, an SGM based on this score function simply outputs training data.

On the other hand, the small perturbations Ãϵm in Equation 3 occasionally push points z near the Voronoi boundary between x0 and x1 into their neighboring Voronoi cell. Averaging kt over these perturbations yields a score function that instead points towards the line segment connecting x0 and x1. An SGM based on the smoothed score function will hence cause samples to ﬂow towards weighted barycenters of the training points, which promotes generalization, especially when the data lie on a manifold of suﬃciently low curvature. We will make these intuitions rigorous in the following section by proving Proposition 4.1, which will enable us to constrain the support of our model s samples.

4.2 Eﬀect of smoothing the score

In this section, we show that as t 1, the smoothed score points towards barycenters of these tuples rather than towards training points, thereby enabling our sampler to generalize. We ﬁrst note that via a straightforward computation,

kt(z + Ãϵm) =

i=1 softmax z t X 2 + Ãtui,m

where ui,m = 2ïϵm, xið is a scalar random variable. This shows that smoothing the score acts by perturbing the distance weights z txi , so one can directly add scalar noise ui,m pu to these weights instead of perturbing the inputs z with noise ϵm pϵ. To simplify our exposition, we will frame the remainder of our results from this perspective.

We now show that smoothing the closed-form score yields a function sÃ,t(z) that points from z towards a convex combination kÃ,t(z) of barycenters t ck = 1 M PM m=1 txi(k,m) of tuples t Ck = (txi(k,m) : m = 1, ..., M) of rescaled training points. In this notation, i(k, m) picks out an index i corresponding to one of the N training points {x1, ..., x N} that depends both on the identity of the tuple t Ck (represented by the argument k and the index s position in the tuple (represented by the argument m). In this way, the k-th tuple t Ck contains M training samples xi(k,m) for m = 1, ..., M. The weights of this convex combination depend not only on the distance z t ck between z and the barycenters t ck, but also on the variance of the tuples t Ck and the noise terms uk = 1 M PM m=1 ui(k,m). Tuples of tightly-clustered points have low variance and hence receive large weights in kÃ,t(z), whereas tuples of distant points have high variance and receive small weights in kÃ,t(z). We prove the following proposition in Appendix B.1.

Proposition 4.1 (sÃ,t points towards barycenters of training points). The smoothed score sÃ,t(z) can be expressed as:

sÃ,t(z) = 1 (1 t)2 (kÃ,t(z) z) , where

k=1 softmax

M z t ck 2 + Var(t Ck) + Ãt uk

k t ck. (5)

5 Sampling algorithm

5.1 Forward Euler scheme for sampling

Armed with the smoothed score sÃ,t, we are now in position to deﬁne our sampler. Following Liu et al. (2023), we draw N(0, I) base samples and ﬂow them through

vÃ,t(z) = 1

t (z + (1 t)sÃ,t(z)) , (6)

Published in Transactions on Machine Learning Research (05/2025)

Algorithm 1 Sampling

Input: Training set {xi}N i=1, noise {ui,m}, step size h = 1

S , initial sample z0 N(0, I) for n = 0, ..., S 1 do tn = n

S zn+1 zn + hvÃ,tn(zn) end for return z T

from t = 0 to t = 1. We discretize this ODE using a forward Euler scheme, leading to Algorithm 1 for sampling using the smoothed score.

The smoothed score in Equation 3 and Algorithm 1 jointly deﬁne our smoothed closed-form diﬀusion model; given a smoothing parameter Ã, we call this a Ã-CFDM. Using Algorithm 1, we can sample from a Ã-CFDM given access only to the training data {xi}N i=1 and noise samples. Notably, no training phase or neural network is needed for this procedure. By explicitly smoothing the closed-form score rather than relying on a neural network s approximation error, we can determine the support of our Ã-CFDM s distribution. For suﬃciently small step sizes, our model s samples will lie at the barycenters of tuples of training points.

Theorem 5.1 (Support of Ã-CFDM samples). All samples returned by Algorithm 1 are of form z S = S S 1kÃ, S 1

S (z S 1). As the number of sampling steps S (equivalently, as the step size 1

S 0), the model samples converge towards barycenters z S = ck of M-tuples of training points.

We prove this theorem in Appendix B.2. While our sampler is easy to implement and training-free, it may be costly if the number of training samples N and the number of sampling steps S are large. We address these issues in the following sections. In Section 5.4, we show how to approximate our smoothed score using eﬃcient nearest-neighbor search. In Section 6.2, we demonstrate that one may take fewer sampling steps by initializing the sampler at a non-zero start time at little cost to sample accuracy, and provide complementary analysis in Section 5.2. This will permit our method to scale to high-dimensional real-world datasets while achieving sampling times competitive with existing methods and running on consumer-grade CPUs.

5.2 Taking fewer sampling steps

As a CFDM s distribution Ä t is simply a time-dependent mixture of Gaussians centered at the training points, it can be directly sampled at any time t by uniformly sampling a mixture mean txi and then sampling from a Gaussian centered at txi. We use this fact to sample a Ã-CFDM with fewer steps by starting at T > 0 with samples from its corresponding unsmoothed CFDM. As a Ã-CFDM does not have the same distribution as an unsmoothed CFDM, this approximation incurs some error, which we bound in the following theorem.

Theorem 5.2 (Approximation error from starting at T > 0). Let ÄT 1 ϵ be the model distribution at t = 1 ϵ obtained by starting sampling a Ã-CFDM at T > 0 with samples from the unsmoothed CFDM, and let Ä0 Ã,1 ϵ be the corresponding Ã-CFDM model distribution when sampling starting at T = 0. Then for any ﬁxed T and ϵ, W2(Ä0 Ã,1 ϵ, ÄT Ã,1 ϵ) = O(Ã). (7)

where W2 is the 2-Wasserstein distance.

Following De Bortoli (2022), we stop sampling at time 1 ϵ for some truncation parameter ϵ > 0 to account for the fact that the smoothed score sÃ,t blows up as t 1 due to division by (1 t)2. We prove this theorem in Appendix B.3.

This result shows that initializing a Ã-CFDM with samples from the unsmoothed CFDM Ä T at time T > 0 results in bounded error that scales linearly with Ã. Intuitively, increasing Ã causes the unsmoothed velocity ﬁeld v t to be a worse approximation to the smoothed velocity ﬁeld vÃ,t at any time t; Theorem 5.2 conﬁrms that the cost to sample accuracy is linear in Ã.

Published in Transactions on Machine Learning Research (05/2025)

5.3 Distribution of one-step samples under Gumbel weight perturbations

When the scalar noise ui,m perturbing the distance weights in Equation 4 is drawn from a Gumbel(0, 1) distribution, we can precisely characterize the smoothed model s distribution when performing single-step sampling by starting sampling at the ﬁnal Euler iteration in Algorithm 1. Proposition 5.3. Suppose we begin sampling a smoothed CFDM at iteration S 1 of Algorithm 1 using samples z S 1 Ä t S 1 from the unsmoothed CFDM at t S 1. Suppose also that the perturbations ui,m to the distance weights in Equation 4 are drawn from a Gumbel(0, 1) distribution. Then, as the number of Euler steps S , the model samples z S are of the form z S = 1 M XIÃ, where X RD N is the matrix whose i-th column is training sample xi RD and IÃ Multinomial(ÃÃ, M). The probability Ãi Ã of training point xi is given by Ãi Ã = softmax 1

Ã z S 1 xi 2 .

We prove this proposition in Appendix B.4. This result provides further intuition on the role of the smoothing parameter Ã in determining the distribution of a smoothed CFDM s samples: It is the temperature of the softmax determining Ãi Ã = softmax 1

Ã z S 1 xi 2 . When Ã = 0, the softmax simply picks out the training sample xi that is closest to z S 1. Conversely, as Ã , the event probabilities Ãi Ã become uniform and z S becomes the barycenter of M uniformly-sampled training points.

5.4 Fast score computation via approximate nearest-neighbor search

Each sampling step in Algorithm 1 requires the evaluation of the smoothed score sÃ,t(z) and hence a sum over O(N) terms. For large datasets, each evaluation of sÃ,t(z) is therefore costly and places substantial demands on memory.

In the t 1 regime, the temperature of the softmax in Equation 4 is low, and the large sum is dominated by the handful of terms corresponding to the smallest values of z txi 2 Ãtui,m. If Ã is suﬃciently small, these terms correspond to the nearest neighbors of z among the rescaled training points txi. This suggests that we can approximate the smoothed score sÃ,t by subsampling terms in the O(N) sum while ensuring that the nearest neighbors of z are included with high probability.

Noting that the closed-form score log Ä t (z) = Ä t (z) Ä t (z) is the score of a Gaussian kernel density estimator (KDE) Ä t , we employ Karppa et al. (2022) s unbiased nearest-neighbor estimator for KDEs to estimate the denominator, and take its gradient to obtain an unbiased estimate of the numerator. We provide details on this estimator in Appendix A. Our estimator is computed using the K nearest neighbors of z in the training set and L random samples from the remainder of the training set; we study the accuracy tradeoﬀs associated with K and L in Section 6.2.

Figure 2: W2 between Ã-CFDM model samples and true samples. We depict model samples for Ã {0, 0.26, 1}.

Given this estimator for the closed-form score, we estimate the smoothed score d sÃ,t via convolution against a smoothing kernel as in Section 4.1. By using the approximate nearest neighbor search algorithms implemented in Faiss (Douze et al., 2024), we are able to scale our method to high-dimensional real-world datasets and achieve sampling times competitive with neural SGMs while running on consumer-grade CPUs; see Sections 6.3 and 6.4 for examples and runtime metrics.

6.1 Impact of Ã on generalization

We now show that a Ã-CFDM s model distribution approaches the true distribution Ä1 of its training samples xi Ä1 for appropriate values of Ã. We ﬁx a continuous target distribution Ä1 and draw N = 5000 samples yi to serve as a discrete approximation to Ä1. We then draw a smaller subset of n = 500 training samples xi and construct a Ã-CFDM on these samples while varying Ã.

Published in Transactions on Machine Learning Research (05/2025)

For each Ã, we measure the 2-Wasserstein distance W2 between the Ã-CFDM s generated samples and the true samples yi Ä1, and use this as a tractable proxy for the distance between the Ã-CFDM s model distribution and the true distribution Ä1. We present the results of this experiment for the Checkerboard distribution in Figure 2.

When Ã = 0, the support of our model s samples (left side of Figure 2) coincides with the training samples xi. The 2-Wasserstein distance between the model samples and true samples yi decreases for small values of Ã as the model samples become convex combinations of nearby points in the training set; we depict model samples for Ã = 0.26 in the center of Figure 2. However, as Ã grows large, the model samples spread out to ﬁll the convex hull of the training set (right side of Figure 2) and the distance between our model s samples and true samples yi grows rapidly. These results suggest that for appropriate values of Ã, our method can use a ﬁxed training set {xi}N i=1 to generate novel samples that remain close to the target distribution Ä1. Experiments of this type may be used to select appropriate values of Ã for a given application.

In Figure 3, we demonstrate that with an appropriate choice of Ã, our method can sample from a 2D surface embedded in R3 given a sparse blue noise sampling of the surface; this is a low-dimensional case of a manifold-supported distribution, which is typical in machine learning applications. Our method s samples (blue points) ﬁll in the gaps between the sparse training samples (red points) while remaining close to the true manifold. This occurs because Ã-CFDM samples are barycenters of tuples of nearby training points, with Ã controlling the variance of these tuples. For appropriate values of Ã and suﬃciently dense samplings of training points, these barycenters will approximately lie on tangent planes to the surface, and hence lie near the surface but away from the training data.

(a) σ = 0.2: 28.9% in W2

(b) σ = 0.375: 13.4% in W2

(c) σ = 0.4: 34.1% in W2

Figure 3: Sampling a Ã-CFDM (blue points) yields a dense point cloud given sparse mesh samples (red points). We report % drop in W2 distance to a dense mesh sampling when using our Ã-CFDM s samples relative to the sparse training samples. We render these point clouds in Polyscope (Sharp et al., 2019).

6.2 Ablation and computational trade-oﬀs

In this section, we investigate the impact of the start time T and the parameters of our nearest-neighbor-based score estimator (9) on the distribution of our model s samples.

Impact of T. As a CFDM s distribution Ä t is simply a time-dependent mixture of Gaussians centered at training points, it can be directly sampled at any time t by uniformly sampling a mixture mean txi and then sampling from a Gaussian centered at txi. We use this fact to sample a Ã-CFDM with fewer steps by starting at T > 0 with samples from its corresponding unsmoothed CFDM. We show here that for practical values of Ã, one can begin sampling at T close to 1 with little accuracy loss.

We ﬁx a continuous target distribution Ä1, draw n = 500 training samples xi, and construct a Ã-CFDM on these samples for Ã {0, 0.2, 0.4, 0.6, 0.8, 1.0}. We then vary the initial sampling times T and compute the 2-Wasserstein distance W2 between model samples generated starting at T = 0 and at T > 0. We compare this to the average W2 distance between batches of Ã-CFDM samples generated by starting at T = 0 (which

Published in Transactions on Machine Learning Research (05/2025)

is nonzero due to randomness in sampling) and report the percent change in W2 relative to this baseline value. We present the results of this experiment for the Checkerboard distribution in Figure 4.

Figure 4: % change in W2 between ÃCFDM model samples generated starting at T = 0 and samples generated starting at T > 0.

For Ã < 0.4, there is little accuracy loss when starting at T > 0, even for start times close to 1. When Ã g 0.4, the accuracy of this approximation begins to decline for start times T g 0.4, with large reductions in approximation quality when both Ã and T are large. As we have found that our model has performed best with Ã f 0.4 in the applications considered in this work, this section s results support the use of few sampling steps in practice. The results in Sections 6.3 and 6.4 further support the use of late start times T for image generation; we ﬁnd in these experiments that we can start sampling as late as T = 0.98 while maintaining good sample quality.

Impact of K and L on the NN-based score estimator. In Section 5.4, we proposed an eﬃcient score estimator based on fast nearest-neighbor search. We now study the impact of the number of nearest neighbors K and the number of random samples L from the remainder of the training set on our model s samples.

We ﬁx a continuous target distribution Ä1, draw n = 500 training samples xi, and construct a Ã-CFDM on these samples for Ã = 0.3; this value is typical for real-world applications. We then vary the number of nearest neighbors K and the number of random samples L used to compute the score estimator Equation 9 and measure the 2-Wasserstein distance between model samples generated using the full smoothed score and using the estimator Equation 9. We present the results of this experiment for the Checkerboard distribution in Figure 5. We center the diverging color scheme at the W2 distance between two batches of samples from a Ã-CFDM using the full smoothed score; this noise threshold encodes the inherent randomness in our model s samples across batches.

Figure 5: W2 between Ã-CFDM model samples generated using the full score and our NN-based estimator for varying # of NN K (horizontal axis) and # of random samples L (vertical axis).

The error arising from the NN-based estimator is decreasing in K and L, with especially poor approximation quality when using a single random sample xℓ. However, the accuracy of the model samples approaches the noise threshold for small values of K, L. For example, with K = L = 15 (which samples just 6% of the terms in kÃ,t), the W2 distance between samples generated using the full score and the NN-based estimator is 0.1865, a value close to the noise threshold of 0.1791. In Sections 6.3 and 6.4, we additionally show that one can generate high-quality images while subsampling kÃ,t at a far lower rate, thereby enabling our method to scale to real-world datasets.

6.3 Image generation in pixel space

In this section, we use our Ã-CFDM to sample images in pixel space that are similar to images from the Smithsonian Butterﬂies dataset,1 rescaled to 128 128. We benchmark our model s sample quality, training time, and sampling time against a denoising diﬀusion probabilistic model (DDPM) (Ho et al., 2020) and provide training details in Appendix C.1.

We display images from a held-out test set along with DDPM samples and our model s samples in Figure 6. Both our model and the DDPM generate images that qualitatively resemble the test images, but as our model can only output barycenters of training samples (see Theorem 5.1), our samples exhibit softer details than the test and DDPM samples. Table 1 records sample quality metrics and training and generation times for our method and the DDPM baseline. Our training-free method achieves

1Dataset available on Hugging Face: huggan/smithsonian_butterflies_subset

Published in Transactions on Machine Learning Research (05/2025)

(a) Ground truth images

(b) DDPM samples

(c) σ-CFDM samples

Figure 6: Ground truth images from Smithsonian Butterﬂies (left), DDPM samples (center), and Ã-CFDM samples (right).

comparable sample quality to a DDPM that has been trained for 5.34 hours on a single V100 GPU, and achieves over 2.9 times the sample throughput of a DDPM running on a V100 GPU while running on a Macbook M1 Pro CPU with 16 GB of RAM.

As 128 128 RGB images lie in 49,152-dimensional space, this experiment demonstrates that our method scales to high-dimensional problems. As our method is able to generate plausible samples despite being restricted to outputting barycenters of training samples, it also demonstrates that there exist image manifolds for which our Ã-CFDM s inductive bias is reasonable. However, we do not expect this inductive bias to be suitable for most real-world image data, where barycenters of training samples typically lie oﬀ-manifold and fail to resemble ground truth samples. To narrow this gap between theory and practice, we show in the following section that by sampling in the latent space of an autoencoder, our method can generate plausible and diverse images of human faces, comparing favorably with a VAE at marginally higher sampling costs.

Table 1: Metrics for sample quality and generation time in pixel space. Our Ã-CFDM achieves competitive sample quality and generation time while requiring no costly training.

Method Metric Butterﬂies

Inception score 1.87 0.225 KID ³ 0.0220 0.0038 Training time 5.24 h Sampling time (GPU) 1.20 s

Inception score 2.20 0.150 KID ³ 0.0114 0.0048 Training time 0 h Sampling time (CPU) 0.4124 s

6.4 Image generation in latent space

Theorem 5.1 shows that in the limit of small step sizes, a Ã-CFDM s samples are barycenters of nearby training points. This is typically a poor prior for images in pixel space, but an appropriately-chosen autoencoder may map the training data to a latent manifold that more closely satisﬁes this local linearity assumption. To this end, we train the nuclear norm-regularized autoencoder proposed by Scarvelis & Solomon (2024), which encourages latent vectors to lie on a low-dimensional manifold. We then sample from a Ã-CFDM in the latent space of this pretrained autoencoder for the Celeb A dataset (Liu et al., 2015) and discard samples that are identical to their nearest latents from the training set. As our method relies on tractable nearest-neighbor queries in the training set at sampling time, this is a feasible post-processing step for our

Published in Transactions on Machine Learning Research (05/2025)

(a) Ground truth samples

(b) VAE samples

(c) σ-CFDM samples

Figure 7: Ground truth images from Celeb A (left), VAE samples (center), and Ã-CFDM samples (right).

sampler. We benchmark our method against a Variational Auto Encoder (VAE) (Kingma, 2013) trained for the same number of epochs and employing the same architecture as the nuclear norm-regularized autoencoder. We train both autoencoders using the log hyperbolic cosine reconstruction loss, which has been found to improve sample quality in VAEs (Chen et al., 2019).

A VAE is an appropriate baseline for a latent Ã-CFDM because both models train a regularized autoencoder to obtain well-structured latent representations, and then employ a training-free process to generate new samples in latent space. A VAE s sampling procedure is simple and data-independent: One draws a normally-distributed latent and decodes it. However, one must use heavy regularization to ensure the latent distribution is nearly Gaussian, and VAEs suﬀer from poor sample quality as a result. In contrast, a Ã-CFDM s data-dependent sampling procedure merely requires that the latent distribution be supported on a manifold of suﬃciently low curvature, so that barycenters of nearby latents continue to lie on this manifold. We consequently expect that one may sample a Ã-CFDM in a weakly regularized latent space to obtain better-quality decoded samples than a VAE while preserving the ability to sample on CPU without requiring additional training.

We display our model s samples, along with VAE samples and ground truth samples from the Celeb A dataset in Figure 7. Barycenters of natural images in pixel space typically do not resemble natural images unless they are well-registered (as with the butterﬂies in Section 6.3), but operating in an autoencoder s latent space allows our method to generate plausible and diverse images of human faces. In particular, our method s samples exhibit greater qualitative diversity than the VAE samples, at times including features such as hats and glasses that seldom or never appear in the VAE baseline s samples.

Table 2: Metrics for sample quality and generation time in latent space. Our Ã-CFDM improves signiﬁcantly on a VAE s sample quality at a marginal sampling cost on CPU.

Method Metric Celeb A

Inception score 1.68 0.08 KID ³ 0.108 0.0066 Sampling time (CPU)

Inception score 2.22 0.19 KID ³ 0.092 0.0075 Sampling time (CPU) 44 ms

We report sample quality and generation time metrics, including inception scores (Salimans et al., 2016) and kernel inception distances (KID) (Bińkowski et al., 2018) between generated samples and samples from the Celeb A test partition in Table 2. Our Ã-CFDM results in a 15.0% improvement in KID and 32.4% improvement in inception score compared to the VAE baseline. While the VAE s sampling cost, which

Published in Transactions on Machine Learning Research (05/2025)

amounts to the cost of generating Gaussian noise, is negligible, our method s sampling time is just 44 ms per sample. For the sake of fairness, this cost is amortized over the number of Ã-CFDM samples left after discarding nearest neighbors to ensure novelty.

7 Conclusion and future work

In this work, we introduced smoothed closed-form diﬀusion models (smoothed CFDMs): a class of training-free diﬀusion models requiring only access to the training set at sampling time. Smoothed CFDMs leverage the availability of an exact solution to the score-matching problem which alone does not yield generalization and explicitly induce error by smoothing. This results in a model that generalizes by provably outputting barycenters of training points. Our method is eﬃcient and scalable, and runs on a consumer-grade laptop with no dedicated GPU.

Our results suggest that it is possible to design SGMs that generalize without relying on neural score approximations. They also suggest that smoothness is among the inductive biases enabling neural SGMs to generalize in spite of the uninteresting global optimum of their training objective, which only allows for memorization. However, because our method generates barycenters of training points, its inductive bias is unsuitable on its own for sparsely-sampled manifolds in high-dimensional space, which one typically encounters in modern applications such as image generation. In this work, we partially address this shortfall by using our method to sample in an appropriately-structured latent space, but our sample quality lags behind that of state of the art neural SGMs. We therefore encourage further work to close this gap in sample quality, and describe some potential directions for future work below.

State of the art SGMs are typically built upon convolutional architectures with self-attention layers, which both feature unique inductive biases. Concurrent work by Kamb & Ganguli (2024) investigates the impact of locality and equivariance constraints on the optimum of the score-matching objective, and Niedoba et al. (2024) empirically investigate whether a locality bias can explain the behavior of neural denoisers. Combining these constraints with our smoothing approach and explicitly modeling the inductive biases of self-attention layers may yield further insights into the generalization of neural diﬀusion models and lead to new strategies for building training-free diﬀusion models that generalize.

Most interesting image generation tasks are conditional. For instance, a user may provide a text prompt and seek an image whose subject and style match the prompt. While state-of-the-art diﬀusion models typically employ classiﬁer-free guidance (Ho & Salimans, 2021) to introduce conditioning information, it is unclear how to extend our training-free method to include an analogous form of guidance. On the other hand, Dhariwal & Nichol (2021) s classiﬁer guidance would likely be a feasible addition to our method, amounting to augmenting our velocity ﬁeld (6) with the gradient of a pretrained classiﬁer. As classiﬁer guidance is known to improve diﬀusion models sample quality, this may have the additional beneﬁt of narrowing the gap between our samples and those generated by state-of-the-art neural methods.

Acknowledgements

The MIT Geometric Data Processing Group acknowledges the generous support of Army Research Oﬃce grants W911NF2010168 and W911NF2110293, of National Science Foundation grant IIS2335492, from the CSAIL Future of Data program, from the MIT IBM Watson AI Laboratory, from the Wistron Corporation, and from the Toyota CSAIL Joint Research Center.

Christopher Scarvelis acknowledges the support of the Natural Sciences and Engineering Research Council of Canada (NSERC), funding reference number CGSD3-557558-2021, and a 2024 Exponent fellowship.

Sumukh K Aithal, Pratyush Maini, Zachary Chase Lipton, and J Zico Kolter. Understanding hallucinations in diﬀusion models through mode interpolation. In The Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024. URL https://openreview.net/forum?id=a NTn HBkw4T.

Published in Transactions on Machine Learning Research (05/2025)

Jonathan T Barron, Ben Mildenhall, Dor Verbin, Pratul P Srinivasan, and Peter Hedman. Mip-nerf 360: Unbounded anti-aliased neural radiance ﬁelds. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5470 5479, 2022.

Matthias Bastian. Training cost for stable diﬀusion was just $600,000 and that is a good sign for ai progress. https://the-decoder.com/training-cost-for-stable-diffusion-was-just-600000-and-that-i s-a-good-sign-for-ai-progress/, 2022.

Mikołaj Bińkowski, Dougal J. Sutherland, Michael Arbel, and Arthur Gretton. Demystifying MMD GANs. In International Conference on Learning Representations, 2018. URL https://openreview.net/forum?i d=r1l UOz WCW.

Pengfei Chen, Guangyong Chen, and Shengyu Zhang. Log hyperbolic cosine loss improves variational auto-encoder, 2019. URL https://openreview.net/forum?id=rkglvs C9Ym.

Gabriele Corso, Hannes Stärk, Bowen Jing, Regina Barzilay, and Tommi S. Jaakkola. Diﬀdock: Diﬀusion steps, twists, and turns for molecular docking. In The Eleventh International Conference on Learning Representations, 2023.

Valentin De Bortoli. Convergence of denoising diﬀusion models under the manifold hypothesis. Transactions on Machine Learning Research, 2022. ISSN 2835-8856. URL https://openreview.net/forum?id=Mh K5 a Xo3g B. Expert Certiﬁcation.

Valentin De Bortoli, James Thornton, Jeremy Heng, and Arnaud Doucet. Diﬀusion schrödinger bridge with applications to score-based generative modeling. Advances in Neural Information Processing Systems, 34: 17695 17709, 2021.

Mauricio Delbracio and Peyman Milanfar. Inversion by direct iteration: An alternative to denoising diﬀusion for image restoration. Transactions on Machine Learning Research, 2023. ISSN 2835-8856. URL https://openreview.net/forum?id=Vmy FF5l L3F. Featured Certiﬁcation.

Prafulla Dhariwal and Alexander Quinn Nichol. Diﬀusion models beat GANs on image synthesis. In A. Beygelzimer, Y. Dauphin, P. Liang, and J. Wortman Vaughan (eds.), Advances in Neural Information Processing Systems, 2021. URL https://openreview.net/forum?id=AAWu Cvza Vt.

Sander Dieleman, Laurent Sartran, Arman Roshannai, Nikolay Savinov, Yaroslav Ganin, Pierre H Richemond, Arnaud Doucet, Robin Strudel, Chris Dyer, Conor Durkan, et al. Continuous diﬀusion for categorical data. ar Xiv preprint ar Xiv:2211.15089, 2022.

Matthijs Douze, Alexandr Guzhva, Chengqi Deng, JeﬀJohnson, Gergely Szilvasy, Pierre-Emmanuel Mazaré, Maria Lomeli, Lucas Hosseini, and Hervé Jégou. The faiss library. 2024.

William Feller. On the theory of stochastic processes, with particular reference to applications. 1949.

Eric Heitz, Laurent Belcour, and Thomas Chambon. Iterative ³-(de)blending: A minimalist deterministic diﬀusion model. In ACM SIGGRAPH 2023 Conference Proceedings, SIGGRAPH 23, New York, NY, USA, 2023. Association for Computing Machinery. ISBN 9798400701597. doi: 10.1145/3588432.3591540. URL https://doi.org/10.1145/3588432.3591540.

Jonathan Ho and Tim Salimans. Classiﬁer-free diﬀusion guidance. In Neur IPS 2021 Workshop on Deep Generative Models and Downstream Applications, 2021. URL https://openreview.net/forum?id=qw8A Kxf Yb I.

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diﬀusion probabilistic models. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin (eds.), Advances in Neural Information Processing Systems, volume 33, pp. 6840 6851. Curran Associates, Inc., 2020.

Jonathan Ho, William Chan, Chitwan Saharia, Jay Whang, Ruiqi Gao, Alexey A. Gritsenko, Diederik P. Kingma, Ben Poole, Mohammad Norouzi, David J. Fleet, and Tim Salimans. Imagen video: High deﬁnition video generation with diﬀusion models. Ar Xiv, abs/2210.02303, 2022a. URL https://api.semanticscho lar.org/Corpus ID:252715883.

Published in Transactions on Machine Learning Research (05/2025)

Jonathan Ho, Tim Salimans, Alexey Gritsenko, William Chan, Mohammad Norouzi, and David J. Fleet. Video diﬀusion models. Ar Xiv, abs/2204.03458, 2022b. URL https://api.semanticscholar.org/Corpus ID: 248006185.

Emiel Hoogeboom, Víctor Garcia Satorras, Clément Vignac, and Max Welling. Equivariant diﬀusion for molecule generation in 3D. In Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvari, Gang Niu, and Sivan Sabato (eds.), Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pp. 8867 8887. PMLR, 17 23 Jul 2022.

Aapo Hyvärinen and Peter Dayan. Estimation of non-normalized statistical models by score matching. Journal of Machine Learning Research, 6(4), 2005.

Zahra Kadkhodaie, Florentin Guth, Eero P Simoncelli, and Stéphane Mallat. Generalization in diﬀusion models arises from geometry-adaptive harmonic representations. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=ANvm VS2Yr0.

Mason Kamb and Surya Ganguli. An analytic theory of creativity in convolutional diﬀusion models. ar Xiv preprint ar Xiv:2412.20292, 2024.

Matti Karppa, Martin Aumüller, and Rasmus Pagh. DEANN: speeding up kernel-density estimation using approximate nearest neighbor search. In AISTATS, volume 151 of Proceedings of Machine Learning Research, pp. 3108 3137. PMLR, 2022.

Tero Karras, Miika Aittala, Timo Aila, and Samuli Laine. Elucidating the design space of diﬀusion-based generative models. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho (eds.), Advances in Neural Information Processing Systems, 2022. URL https://openreview.net/forum?id=k7Fu TOWMOc7.

Bernhard Kerbl, Georgios Kopanas, Thomas Leimkühler, and George Drettakis. 3d gaussian splatting for real-time radiance ﬁeld rendering. ACM Transactions on Graphics (To G), 42(4):1 14, 2023.

Diederik P Kingma. Auto-encoding variational bayes. ar Xiv preprint ar Xiv:1312.6114, 2013.

Frederic Koehler, Alexander Heckett, and Andrej Risteski. Statistical eﬃciency of score matching: The view from isoperimetry. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=TD7An Qj Nz R6.

Dohyun Kwon, Ying Fan, and Kangwook Lee. Score-based generative modeling secretly minimizes the wasserstein distance. Advances in Neural Information Processing Systems, 35:20205 20217, 2022.

Holden Lee, Jianfeng Lu, and Yixin Tan. Convergence of score-based generative modeling for general data distributions. In International Conference on Algorithmic Learning Theory, pp. 946 985. PMLR, 2023.

Luping Liu, Yi Ren, Zhijie Lin, and Zhou Zhao. Pseudo numerical methods for diﬀusion models on manifolds. Ar Xiv, abs/2202.09778, 2022. URL https://api.semanticscholar.org/Corpus ID:247011732.

Xingchao Liu, Chengyue Gong, and qiang liu. Flow straight and fast: Learning to generate and transfer data with rectiﬁed ﬂow. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=XVj TT1nw5z.

Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang. Deep learning face attributes in the wild. In Proceedings of International Conference on Computer Vision (ICCV), December 2015.

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. In International Conference on Learning Representations, 2017. URL https://api.semanticscholar.org/Corpus ID:53592270.

Artem Lukoianov, Haitz Sáez de Ocáriz Borde, Kristjan Greenewald, Vitor Campagnolo Guizilini, Timur Bagautdinov, Vincent Sitzmann, and Justin Solomon. Score distillation via reparametrized DDIM. In The Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024. URL https: //openreview.net/forum?id=4Dcp Fag Q9e.

Published in Transactions on Machine Learning Research (05/2025)

Shitong Luo and Wei Hu. Diﬀusion probabilistic models for 3d point cloud generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2837 2845, 2021.

Koichi Miyasawa. An empirical bayes estimator of the mean of a normal population. Bulletin of the International Statistical Institute, 38(4):181 188, 1961.

Matthew Niedoba, Berend Zwartsenberg, Kevin Murphy, and Frank Wood. Towards a mechanistic explanation of diﬀusion model generalization. ar Xiv preprint ar Xiv:2411.19339, 2024.

Kazusato Oko, Shunta Akiyama, and Taiji Suzuki. Diﬀusion models are minimax optimal distribution estimators. Ar Xiv, abs/2303.01861, 2023. URL https://api.semanticscholar.org/Corpus ID: 257353832.

Jakiw Pidstrigach. Score-based generative models detect manifolds. Advances in Neural Information Processing Systems, 35:35852 35865, 2022.

Ben Poole, Ajay Jain, Jonathan T. Barron, and Ben Mildenhall. Dreamfusion: Text-to-3d using 2d diﬀusion. ar Xiv, 2022.

Nasim Rahaman, Aristide Baratin, Devansh Arpit, Felix Draxler, Min Lin, Fred Hamprecht, Yoshua Bengio, and Aaron Courville. On the spectral bias of neural networks. In International Conference on Machine Learning, pp. 5301 5310. PMLR, 2019.

Martin Raphan and Eero P Simoncelli. Least squares estimation without priors or supervision. Neural computation, 23(2):374 420, 2011.

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diﬀusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 10684 10695, 2022.

Tim Salimans and Jonathan Ho. Progressive distillation for fast sampling of diﬀusion models. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=TId IXIpzho I.

Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen. Improved techniques for training gans. Advances in neural information processing systems, 29, 2016.

Antoine Salmona, Valentin De Bortoli, Julie Delon, and Agnès Desolneux. Can push-forward generative models ﬁt multimodal distributions? In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho (eds.), Advances in Neural Information Processing Systems, 2022. URL https://openreview.net/f orum?id=Tsy9WCO_f K1.

Filippo Santambrogio. Optimal transport for applied mathematicians. Birkäuser, NY, 55(58-63):94, 2015.

Christopher Scarvelis and Justin Solomon. Nuclear norm regularization for deep learning. ar Xiv preprint ar Xiv:2405.14544, 2024.

Arne Schneuing, Yuanqi Du, Charles Harris, Arian Jamasb, Ilia Igashov, Weitao Du, Tom Blundell, Pietro Lió, Carla Gomes, Max Welling, Michael Bronstein, and Bruno Correia. Structure-based drug design with equivariant diﬀusion models. ar Xiv preprint ar Xiv:2210.13695, 2022.

Nicholas Sharp et al. Polyscope, 2019. www.polyscope.run.

Andy Shih, Suneel Belkhale, Stefano Ermon, Dorsa Sadigh, and Nima Anari. Parallel sampling of diﬀusion models, 2023.

Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsupervised learning using nonequilibrium thermodynamics. In Francis Bach and David Blei (eds.), Proceedings of the 32nd International Conference on Machine Learning, volume 37 of Proceedings of Machine Learning Research, pp. 2256 2265, Lille, France, 07 09 Jul 2015. PMLR.

Published in Transactions on Machine Learning Research (05/2025)

Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diﬀusion implicit models. ar Xiv:2010.02502, October 2020. URL https://arxiv.org/abs/2010.02502.

Yang Song and Stefano Ermon. Generative modeling by estimating gradients of the data distribution. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett (eds.), Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019.

Daniel W Stroock and SR Srinivasa Varadhan. Diﬀusion processes with continuous coeﬃcients, i. Communications on Pure and Applied Mathematics, 22(3):345 400, 1969a.

Daniel W Stroock and SR Srinivasa Varadhan. Diﬀusion processes with continuous coeﬃcients, ii. Communications on Pure and Applied Mathematics, 22(4):479 530, 1969b.

Daniel W Stroock and SR Srinivasa Varadhan. Diﬀusion processes. In Proc. Sixth Berkeley Symp. Math. Statist. Probab, volume 3, pp. 361 368, 1972.

Arash Vahdat, Karsten Kreis, and Jan Kautz. Score-based generative modeling in latent space. Advances in Neural Information Processing Systems, 34:11287 11302, 2021.

Pascal Vincent. A connection between score matching and denoising autoencoders. Neural computation, 23 (7):1661 1674, 2011.

Daniel Watson, William Chan, Ricardo Martin-Brualla, Jonathan Ho, Andrea Tagliasacchi, and Mohammad Norouzi. Novel view synthesis with diﬀusion models. Ar Xiv, abs/2210.04628, 2022. URL https: //api.semanticscholar.org/Corpus ID:252780361.

Zike Wu, Pan Zhou, Kenji Kawaguchi, and Hanwang Zhang. Fast diﬀusion model. Ar Xiv, abs/2306.06991, 2023. URL https://api.semanticscholar.org/Corpus ID:259138469.

Mingyang Yi, Jiacheng Sun, and Zhenguo Li. On the generalization of diﬀusion model. ar Xiv preprint ar Xiv:2305.14712, 2023.

Tae Ho Yoon, Joo Young Choi, Sehyun Kwon, and Ernest K. Ryu. Diﬀusion probabilistic models generalize when they fail to memorize. In ICML 2023 Workshop on Structured Probabilistic Inference & Generative Modeling, 2023. URL https://openreview.net/forum?id=shci Cb Sk9h.

Lvmin Zhang and Maneesh Agrawala. Adding conditional control to text-to-image diﬀusion models. Ar Xiv, abs/2302.05543, 2023. URL https://api.semanticscholar.org/Corpus ID:256827727.

Qinsheng Zhang and Yongxin Chen. Fast sampling of diﬀusion models with exponential integrator. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/f orum?id=Loek7hfb46P.

Published in Transactions on Machine Learning Research (05/2025)

A Details on nearest-neighbor estimator of closed-form score

Karppa et al. (2022) propose an unbiased estimator of a kernel density estimate KDE(z). Given a kernel function Kh(z) with bandwidth h > 0 and a dataset {xi}N i=1, their estimator ﬁrst searches for the K-nearest neighbors {xk}K k=1 of z in the dataset, then draws L random samples {xℓ}L ℓ=1 from the remainder of the dataset, and approximates KDE(z) as follows:

\ KDE(z) = 1

k=1 Kh(xk, z) + N K

ℓ=1 Kh(xℓ, z) (8)

This estimator is unbiased for any subset of points xk {xi}N i=1 drawn in the ﬁrst stage. In particular, using approximate nearest-neighbors (ANNs) rather than exact nearest neighbors of z increases the variance of Equation 8 but does not introduce bias.

As the closed-form score Ä t is the score of a Gaussian KDE Ä t with bandwidth h = 2(1 t)2, we approximate the closed-form score using the following ratio estimator:

\ log Ä t (z) = \ Ä t (z) Ä t (z)

= bÄ t (z) bÄ t (z) , (9)

where bÄ t (z) is Karppa et al. (2022) s estimator Equation 8. Since the gradient operator is linear, both the numerator and denominator in Equation 9 are unbiased estimates of their respective terms in the closed-form score.

B.1 Proof of Proposition 4.1

For each k = 1, ..., N M, let t Ck = (txi(k,m) : m = 1, ..., M) be an M-tuple of rescaled training points txi. (The same point txi can appear multiple times in an M-tuple.)

Deﬁne the barycenters and variances of these tuples as follows:

m=1 txi(k,m), uk = 1

m=1 ui(k,m), Var(t Ck) = 1

m=1 txi(k,m) t ck 2. (10)

We will show that up to a constant factor, the smoothed score Equation 3 is itself the score of a mixture of N M Gaussians. Rewriting the smoothed score in gradient form, we have:

Published in Transactions on Machine Learning Research (05/2025)

sÃ,t(z) = 1 (1 t)2 1 M

i=1 softmax z t X 2 + Ãtui,m

i=1 exp z txi 2 + Ãtui,m

= z 1 M log

i=1 exp z txi 2 + Ãtui,m

= z 1 M log

PM m=1( z txi(k,m) 2 + Ãtui(k,m))

= z 1 M log

M z t ck 2 + Var(t Ck) + Ãt uk

k=1 exp M(Var(t Ck) + Ãt uk)

exp M z t ck 2

k=1 wk(t) exp M z t ck 2

M z log qt(z)

This shows that up to a constant factor 1 M , the smoothed score sÃ,t(z) is the score of a large mixture of

Gaussians qt(z) = PNM

k=1 wk(t) exp M( z t ck 2)

. The mean of each Gaussian is the barycenter t ck of some

M-tuple t Ck of training points txi(k,m), and its common covariance matrix is (1 t)2

M I. The time-dependent

mixture weights wk(t) exp M(Var(t Ck)+Ãt uk)

2(1 t)2 are decreasing in the variance of the M-tuples t Ck but

are subject to the presence of noise terms Ãt uk = Ãt

Finally, by expanding the gradient in ( ), we straightforwardly obtain:

sÃ,t(z) = 1 (1 t)2

k=1 softmax M 2(1 t)2 z t ck 2 + Var(t Ck) + Ãt uk

B.2 Proof of Theorem 5.1

k=1 softmax M 2(1 t)2 z t ck 2 + Var(t Ck) + Ãt uk

k t ck (11)

so that sÃ,t(z) = 1 (1 t)2 (kÃ,t(z) z). Then

Published in Transactions on Machine Learning Research (05/2025)

vÃ,t(z) = 1

t (z + (1 t)sÃ,t(z))

z + 1 (1 t)(kÃ,t(z) z)

t kÃ,t(z) z

Expanding the formula for the ﬁnal Euler step using this expression for vÃ,t(z) and t S 1 = S 1

S , we obtain:

z S = z S 1 + 1

S vÃ,t S 1(z S 1)

= z S 1 + 1

S (z S 1) z S 1

= z S 1 + S S 1kÃ, S 1

S (z S 1) z S 1

= S S 1kÃ, S 1

k=1 softmax MS2

S ck 2 + Var(S 1

S Ck) + Ã S 1

In the ﬁnal line, we use the fact that as S , the temperature of the softmax goes to 0 and picks out a single index k such that

k = argmax k z S 1 ck 2 + Var(Ck) + Ã uk

z S 1 ck 2 + Var(Ck) + Ã uk

B.3 Proof of Theorem 5.2

We divide the proof of this theorem into three propositions. We ﬁrst sketch the proof and state the propositions, and then prove each proposition in subsections below.

Our ﬁrst result shows that ﬂowing Ä0 through two similar velocity ﬁelds v t , vÃ,t yields similar model distributions Ä T , ÄÃ,T at some terminal time T:

Proposition B.1. Suppose a measure Ä0 is pushed through velocity ﬁelds v t , vÃ,t, and denote the respective pushforward measures at time t by Ä t , ÄÃ,t. Then,

W2(Ä T , ÄÃ,T ) f Z T

E z Ä t v t (z) vÃ,t(z) 2dt (12)

where (t) := exp R T t Lv s ds and Lv s g 0 is the Lipschitz constant of v s.

The result above applies to any two velocity ﬁelds, subject to some weak regularity conditions. To apply this result to the unsmoothed and smoothed velocity ﬁelds v t and vÃ,t, we bound (t) and E v t (z) vÃ,t(z) 2 in terms of Ã:

Published in Transactions on Machine Learning Research (05/2025)

Proposition B.2. Let v t be velocity ﬁeld of an unsmoothed CFDM, and let v Ã,t be the velocity ﬁeld Equation 6 of the corresponding Ã-CFDM. Then,

(t) f exp D2(2T 1)

E z Ä t v t (z) vt(z) 2 f C1 Ãt 2(1 t)3 (14)

where D is the diameter of the training data and C1 is a constant depending on the training data and the distribution pu of the scalar noise ui,m perturbing the distance weights in Equation 4.

Combining these results, we obtain the following bound on W2(Ä T , ÄÃ,T ):

W2(Ä T , ÄÃ,T ) = O(Ã). (15)

This shows that one can approximate a Ã-CFDM s model samples at some time T > 0 by model samples from its corresponding unsmoothed CFDM (i.e. a mixture of Gaussians) when the smoothing parameter Ã is small.

We now show that ﬂowing two similar distributions Ä T and ÄÃ,T through a Ã-CFDM s velocity ﬁeld from time T to 1 ϵ yields similar terminal distributions ÄT Ã,1 ϵ, Ä0 Ã,1 ϵ. Following De Bortoli (2022), we stop sampling at time 1 ϵ for some truncation parameter ϵ > 0 to account for the fact that the smoothed score sÃ,t blows up as t 1 due to division by (1 t)2.

Proposition B.3. Suppose Ä T and ÄÃ,T are pushed through the velocity ﬁeld vÃ,t of a Ã-CFDM, and let ÄT Ã,1 ϵ, Ä0 Ã,1 ϵ denote their respective terminal distributions at time 1 ϵ. Then

W2(ÄT Ã,1 ϵ, Ä0 Ã,1 ϵ) f O

2 exp D2(1 2ϵ)

W2(Ä T , ÄÃ,T ), (16)

where D is the diameter of the training data.

By combining Equation 15 and Equation 16 and treating T and the truncation parameter ϵ as ﬁxed, we ﬁnally obtain a global upper bound on W2(ÄT Ã,1 ϵ, Ä0 Ã,1 ϵ):

W2(ÄT Ã,1 ϵ, Ä0 Ã,1 ϵ) = O(Ã) (17)

where ÄT Ã,1 ϵ is the model distribution obtained by starting sampling at T > 0 with samples from the unsmoothed CFDM and Ä0 Ã,1 ϵ is true model distribution of the Ã-CFDM.

B.3.1 Proof of Proposition B.1

Our proof for this proposition employs techniques similar to those used to prove Theorem 1 and Proposition 1 in Kwon et al. (2022).

We begin with the following well-known result (Santambrogio, 2015, Corollary 5.25):

Suppose that two measures Ä and Ä are each pushed through velocity ﬁelds v t , vt respectively and denote the pushforward measures at time t by Ä t , = Ä t . Then:

1 2 d dt W 2 2 (Ä t , Ät) = E (x,y) µtïy x, v t (y) vt(x)ð (18)

Published in Transactions on Machine Learning Research (05/2025)

where µt is the W2 coupling between Ä t and Ät.

For any x, y we can use Cauchy-Schwarz and the triangle inequality to obtain the following bound:

ïy x, v t (y) vt(x)ð f y x ( v t (y) v t (x) + v t (x) vt(x) ) (19)

We can then bound v t (y) v t (x) in terms of maximum of the Jacobian Dv t of v t on the line segment [x, y] := {ty + (1 t)x : 0 f t f 1} to obtain:

ïy x, v t (y) vt(x)ð f max p [x,y] Dv t y x 2 + y x v t (x) vt(x) (20)

This constant is in turn upper-bounded by the Lipschitz constant Lv t of v t on the convex hull of supp(Ä t ) supp(Ät), so we in fact have:

ïy x, v t (y) vt(x)ð f Lv t y x 2 + y x v t (x) vt(x) (21)

Adding E (x,y) µt back in, we get:

1 2 d dt W 2 2 (Ä t , Ät) = E (x,y) µtïy x, v t (y) vt(x)ð

f Lv t E (x,y) µt y x 2 + E (x,y) µt y x v t (x) vt(x)

= Lv t W 2 2 (Ä t , Ät) + E (x,y) µt y x v t (x) vt(x)

f Lv t W 2 2 (Ä t , Ät) + r

E (x,y) µt y x 2 r

E (x,y) µt v t (x) vt(x) 2

= Lv t W 2 2 (Ä t , Ät) + W2(Ä t , Ät) r

E x Ä t v t (x) vt(x) 2

where we use Cauchy-Schwarz for random variables in passing from the third to fourth lines and then the fact that Ä t is one of the marginals of µt. Using the chain rule on the LHS and cancelling a factor of W2(Ä t , Ät) from both sides, we obtain the following diﬀerential inequality:

d dt W2(Ä t , Ät) f Lv t W2(Ä t , Ät) + r

E x Ä t v t (x) vt(x) 2 (22)

We can now solve the diﬀerential inequality Equation 22 to obtain:

W2(Ä T , ÄT ) f Z T

E x Ä t v t (x) vt(x) 2dt (23)

Published in Transactions on Machine Learning Research (05/2025)

B.3.2 Proof of Proposition B.2

We ﬁrst estimate (t) = exp R T t Lv s ds . As

v t (z) = 1 t(1 t)k t (z) 1 1 tz, (25)

we can bound its Lipschitz constant by Lv t f 2 max n 1 t(1 t)Lk t , 1 1 t o . Our next step is therefore to bound

Lk t . We will do so by bounding the spectral norm of the Jacobian Jk t (z) 2 of k t for any z RD.

To this end, we ﬁrst note that for any z RD, this Jacobian has the form of a weighted covariance matrix:

Jk t (z) = 1 (1 t)2

i=1 wi(z)(txi k t (z))(txi k t (z)) ,

where wi(z) is the i-th component of the weight vector softmax z t X 2

2(1 t)2 . To upper bound its spectral norm, we ﬁrst apply the triangle inequality and use the absolute homogeneity of norms:

Jk t (z) 2 =

i=1 wi(z)(txi k t (z))(txi k t (z)) 2

i=1 wi(z) (txi k t (z))(txi k t (z)) 2.

But it is well-known that the spectral norm of a rank-1 matrix uv is given by uv 2 = u 2 v 2, so in fact

Jk t (z) 2 f 1 (1 t)2

i=1 wi(z) txi k t (z) 2 2. (26)

Now, because k t (z) is a convex combination of the txi, it lies in the convex hull of the txi. The diameter of a convex hull is bounded by the diameter of its extreme points, so if D = diam({xi}) and t D = diam({txi}), then txi k t (z) 2 2 f (t D)2. Substituting this back into equation 26 and using the fact that the wi(z) sum

to 1 for any z, we obtain the bound Jk t (z) 2 f t D 1 t 2 . This bound holds for all z RD, so it follows that

Lk t f t D 1 t 2 .

Hence Lv t f 2 max n t D2 (1 t)3 , 1 1 t o .

We now use this bound on Lv t to estimate (t). Let s [0, T] denote the time from which s D2 (1 s)3 g 1 1 s.

D2 + 4 + 2). Decomposing the integral that deﬁnes log (t), we obtain:

Published in Transactions on Machine Learning Research (05/2025)

t Lv s ds = Z s

t Lv s ds + Z T

1 1 sds + 2D2 Z T

s (1 s)3 ds

= 2 log 1 t

(1 T)2 2 s 1

f 2 log 1 t

+ D2 2T 1 (1 T)2 4D2 D

D2 + 4 + 1 (D2 D

Substituting this bound into (t) = exp( R T t Lv s ) and simplifying, we obtain:

(t) f C(T) 1 t

where C(T) = exp D2 2T 1 (1 T )2 4 D2 D

D2+4+1 (D2 D

D2+4)2 = exp D2(2T 1)

(1 T )2 and D2 =: C0 depends only on the training data.

We now estimate r

E x Ä t v t (x) vt(x) 2.

We ﬁrst observe that v t (z) vt(z) = 1 t(1 t)(k t (z) kt(z)). Once again letting X RD N be the matrix

of training data and w (z) = softmax z tx 2

2(1 t)2 RN, wm(z) = softmax z tx 2+Ãtui,m

2(1 t)2 RN be the

vector of weights, we have that k t (z) = t Xw (z) and hence

v t (z) vt(z) = 1 1 t X(w (z) 1

m=1 wm(z)) f 1 1 t X 1

m=1 w (z) wm(z) . (28)

Once again using the Lipschitz continuity of w(z), we obtain the bound

w (z) wm(z) f Ãtui,m 2(1 t)2 , (29)

and by substituting this into our bound for v t (z) vt(z) , we obtain:

v t (z) vt(z) 2 f t2Ã2 X 2 u2 i 4(1 t)6 , (30)

where ui = 1 M P m ui,m. As this bound holds for all z, it also holds in expectation, so we ﬁnally conclude that

E x Ä t v t (x) vt(x) 2 f tÃ X ui

2(1 t)3 . (31)

B.3.3 Proof of Proposition B.3

We now begin with the diﬀerential inequality

d dt W2(Ä t , Ät) f Lv t W2(Ä t , Ät) + r

E x Ä t v t (x) vt(x) 2, (32)

Published in Transactions on Machine Learning Research (05/2025)

that we derived in the proof of Proposition B.1, which bounds the rate of change in W2(Ä t , Ät) when ﬂowing Ä t and Ät through two velocity ﬁelds v t and vt, respectively. As we now consider the case where Ä T and ÄÃ,t both ﬂow through the smoothed velocity ﬁeld vÃ,t from time T to 1 ϵ, r

E x Ä t v t (x) vt(x) 2 = 0 and the

diﬀerential inequality becomes:

d dt W2(Ä t , ÄÃ,t) f LvÃ,t W2(Ä t , ÄÃ,t). (33)

Solving this diﬀerential inequality, we obtain

W2(ÄT Ã,1 ϵ, Ä0 Ã,1 ϵ) := W2(Ä 1 ϵ, ÄÃ,1 ϵ) f (T)W2(Ä T , ÄÃ,T ) (34)

where (T) = exp R 1 ϵ T Lvs,Ãds . Using the same bounds as in our proof of Proposition B.2 while noting that vÃ,t is at least as smooth as v t , we obtain

ϵ ) + D2 1 2ϵ

D2 + 4 + 1 (D2 D

D2 + 4 + 1 (D2 D

2 exp D2(1 2ϵ)

where D2 is the diameter of the training data, which we treat as constant for a given CFDM.

Substituting this into Equation 34, we obtain

W2(ÄT Ã,1 ϵ, Ä0 Ã,1 ϵ) f O

2 exp D2(1 2ϵ)

W2(Ä T , ÄÃ,T ). (35)

B.4 Proof of Proposition 5.3

We showed in Theorem 5.1 (see Appendix B.2) that as the number of sampling steps S , the samples from a smoothed CFDM converge towards barycenters z S = ck of M-tuples of training points for indices k

k (z S 1) = argmax k z S 1 ck 2 + Var( Ck) + Ã uk (36)

Using an equivalent expression for kÃ,t, these barycenters can also be written as

z S = ck = 1

m=1 xi (z S 1,m), (37)

Published in Transactions on Machine Learning Research (05/2025)

i (z S 1, m) = argmax i z S 1 xi 2 + Ãui,m

= argmax i 1

Ã z S 1 xi 2 + ui,m

If ui,m Gumbel(0, 1), then by applying the Gumbel max-trick, we conclude that i (z S 1, m) Categorical(Ãi Ã). This is a distribution over the indices i = 1, ..., N of training samples, with event probabilities given by

Ãi Ã = softmax 1

Ã z S 1 xi 2

If we represent xi as Xei , where X RD N is the matrix whose i-th column is training sample xi and ei RN is the i -th standard basis vector, then

m=1 xi (z S 1,m)

But IÃ := PM m=1 ei is a realization of a Multinomial(ÃÃ, M) random variable; this fact completes the proof of Proposition 5.3.

C Additional Experimental Details and Results

In this appendix, we provide details for our pixel space and latent space image generation experiments.

C.1 Pixel space DDPM training details

Our training data is drawn from the dataset huggan/smithsonian_butterflies_subset, which is publicly available on huggingface and contains 1000 images of butterﬂies. We extract RGB images from the image column of their dataset and reshape them to 128 128 before using them in our experiments. We construct an 80/20 train-test split and use the train partition to train the DDPM and to construct our Ã-CFDM, and use the test partition to compute metrics.

We use the DDPM implemented in the lucidrains library denoisingdiffusionpytorch as our baseline. We use their UNet with dim_mults=(1, 2, 4, 8) as a backbone. We use 1000 time steps during training, and use DDIM sampling with 100 time steps during sample generation. We train the diﬀusion model with a batch size of 8 at a learning rate of 5 10 5 for 20,000 iterations.

We center and normalize the training data to lie in the unit ball before using it to construct our Ã-CFDM. We set M = 2 and Ã = 0.1 for this experiment, and compute the smoothed score exactly rather than using our nearest neighbor-based estimator from Section 5.4 due to this dataset s relatively small size. We start sampling at T = 0.98 and use step size 0.01. We ﬁlter out model samples whose Euclidean distance is within

Published in Transactions on Machine Learning Research (05/2025)

10 6 of their nearest neighbor in the training set; with these hyperparameters, roughly 60% of the model samples remain after this ﬁltering step.

We compute our metrics using the torchmetrics implementation of the kernel inception distance (KID) and inception score. We compute KID scores with subset_size=50 between 500 randomly-chosen images from the test partition and our CFDM and DDPM samples.

C.2 Celeb A latent space generation details

Our method uses the nuclear norm-regularized autoencoder from Scarvelis & Solomon (2024). This autoencoder operates on 256 256 images from the Celeb A dataset. To reduce the memory and compute costs of our autoencoder, we perform a discrete cosine transform (DCT) using the torch-dct package and keep only the ﬁrst 80 DCT coeﬃcients. We then pass these coeﬃcients into the autoencoder.

The autoencoder consists of an encoder f¹ followed by a decoder gϕ. The encoder f¹ is parametrized as a two-layer MLP with 10,000 hidden units; the latent space is 700-dimensional. The decoder gϕ consists of a two-layer MLP with 10,000 hidden units and 3 80 80 = 19200 output dimensions, followed by an inverse DCT, and ﬁnally a UNet. We set the regularization parameter to = 4 (see (Scarvelis & Solomon, 2024, Appendix B.3) for details on the training objective) and use a log-cosh reconstruction loss (Chen et al., 2019) for improved sample quality. We train for 100 epochs at a learning rate of 10 4 using the Adam W optimizer (Loshchilov & Hutter, 2017).

We then sample our Ã-CFDM in the latent space of this pre-trained autoencoder. We center and normalize the training data to lie in the unit ball before using it to construct our Ã-CFDM. We set M = 2 and Ã = 0.025 for this experiment and use the nearest neighbor-based score estimator described in Section 5.4. We start sampling at T = 0.99 and use step size 0.01. We ﬁlter out model samples whose Euclidean distance is within 10 6 of their nearest neighbor in the training set; with these hyperparameters, roughly 34% of the model samples remain after this ﬁltering step.

Our baseline is a VAE with the same architecture as the nuclear norm-regularized autoencoder and the same log-cosh reconstruction loss. We set the regularization strength at 10 3 and train for 100 epochs at a learning rate of 10 4 using the Adam W optimizer. At sampling time, we decode Gaussian samples drawn from N(0, 10I); we ﬁnd that this results in improved sample quality relative to sampling from a standard normal distribution.

We compute our metrics using the torchmetrics implementation of the kernel inception distance (KID) and inception score. We compute KID scores with subset_size=50 between 500 randomly-chosen images from the test partition and our CFDM and DDPM samples.

D Impact of M on model samples

In this appendix, we demonstrate the impact of M the number of noise samples used to computed the smoothed score Equation 3 on a Ã-CFDM s model samples. In Figure 8, we use a simple training set of 2 points (in blue), ﬁx Ã = 1, generate 100 Ã-CFDM samples (in red) for diﬀerent values of M. Note in particular that for large values of M, the model samples cluster around the centroid of the two training points. We conjecture that this phenomenon may be explained by the law of large numbers: As M , 1 M PM m=1 kt(x + Ãϵm) Eϵkt(x + Ãϵ), which is a deterministic quantity lying on the line segment connecting the two training points. In this regime, the reasoning used in the proof of Theorem 5.1 suggests that conditional on the second-to-last sampling iterate z S 1, the output of a Ã-CFDM becomes deterministic and all randomness in the model samples originates from z S 1.

In Figure 9, we carry out a similar experiment with a training set consisting of 500 samples from the checkerboard distribution and Ã = 0.3. Note that for large values of M, the model samples recede from boundary of the convex hull of the training data; we conjecture that this is an instance of the same phenomenon as in Figure 8.

Published in Transactions on Machine Learning Research (05/2025)

We ﬁnally experiment with the impact of M on the latent sampling algorithm introduced in Section 6.4. We ﬁx all hyperparameters but M to the values described in Appendix C.2 and use our Ã-CFDM to sample with M {2, 4, 6, 8}. We depict grids of decoded samples in Figure 10 and report sample quality metrics in Table 3. Using small values of M tends to yield increased sample diversity while keeping sampling costs low, whereas large values of M yield a greater proportion of samples which qualitatively resemble barycenters of face images. This is also likely an instance of the phenomenon illustrated in Figure 8, in which model samples cluster around centroids of training samples for large values of M.

Table 3: Metrics for Ã-CFDM sample quality and generation time in latent space as a function of M.

Value of M Metric Celeb A

Inception score 2.18 0.22 KID ³ 0.091 0.0071 Sampling time (CPU) 41 ms

Inception score 2.13 0.16 KID ³ 0.095 0.0077 Sampling time (CPU) 52 ms

Inception score 2.03 0.17 KID ³ 0.098 0.0088 Sampling time (CPU) 97 ms

Inception score 2.11 0.19 KID ³ 0.095 0.0073 Sampling time (CPU) 368 ms

Figure 8: Ã-CFDM samples (in red) generated given two training points (in blue) for various M.

E Comparison of Gaussian and Gumbel noise for latent space sampling

In this appendix, we brieﬂy illustrate our method s robustness to the distribution of noise used for smoothing the closed-form score. We use our Ã-CFDM to sample Celeb A images in latent space using the same hyperparameters as described in Section C.2, and compare the eﬀects of smoothing the closed-form score with Gaussian noise and with Gumbel noise whose ﬁrst two moments match those of the Gaussian noise. We

Published in Transactions on Machine Learning Research (05/2025)

Figure 9: Ã-CFDM samples (in red) generated given 500 training samples from the checkerboard distribution (in blue) for various M.

depict grids of decoded samples in Figure 11 and report sample quality metrics in Table 4. Our method is robust to the distribution of smoothing noise, with Gaussian and Gumbel noise resulting in similar sample quality metrics and qualitatively similar samples.

Table 4: Metrics for Ã-CFDM sample quality and generation time in latent space when smoothing with Gaussian and Gumbel noise with matched mean and covariance.

Noise distribution Metric Celeb A

Inception score 2.18 0.22 KID ³ 0.091 0.0071 Sampling time (CPU) 41 ms

Inception score 2.22 0.19 KID ³ 0.092 0.0082 Sampling time (CPU) 50 ms

F Adding noise to the velocity ﬁeld does not induce generalization

The proof of Theorem 5.1 shows that in the limit of small step sizes, a Ã-CFDM outputs barycenters of training samples in its ﬁnal sampling iteration. This is true regardless of the position of the second-to-last iterate z S 1. Consequently, adding noise to the velocity ﬁeld at each sampling step, as is typical for the Euler-Maruyama scheme to simulate stochastic diﬀerential equations (SDEs), does not fundamentally change the result of Theorem 5.1: SDE sampling for a Ã-CFDM would still output barycenters of training samples, provided the noise variance vanishes as t 1. (If the noise variance does not vanish at the ﬁnal sampling iteration, then the sampler will return noisy samples as some noise remains present in the ﬁnal iteration.)

Furthermore, adding noise to an unsmoothed CFDM s velocity ﬁeld is insuﬃcient to induce generalization. Theorem 5.1 also shows that the output z S of our sampler is of the form S S 1kÃ, S 1

S (z S 1). This would not change if the velocity ﬁeld were augmented with noise that vanishes in the ﬁnal sampling iteration; only z S 1 would change. When the smoothing parameter Ã = 0, kÃ, S 1

S (z S 1) = k S 1

S (z S 1) =

Published in Transactions on Machine Learning Research (05/2025)

Figure 10: Ã-CFDM samples drawn in latent space with M {2, 4, 6, 8}.

PN i=1 softmax z S 1

S xi. For suﬃciently small step sizes, the temperature of this softmax is

nearly 0, which implies that the unsmoothed CFDM outputs a training sample in its ﬁnal iteration, regardless of the penultimate sample z S 1. It follows that augmenting the velocity ﬁeld vt of an unsmoothed CFDM with noise does not induce generalization.

We demonstrate this empirically in Figure 12, in which we sample from an unsmoothed CFDM constructed from 500 empirical samples from the 2D checkerboard distribution, and from the same CFDM whose velocity ﬁeld at time t is augmented with Gaussian noise with covariance

0.1(1 t)I. The step size in these experiments is 10 2. Adding noise to the velocity ﬁeld does not induce generalization, and the model samples are numerically identical to training samples in each case: the chamfer distance between the noiseless CFDM s model samples and the training samples is 6.41 10 4, and the chamfer distance between the noisy CFDM s model samples and the training samples is 6.59 10 4.

G Impact of step size on generalization

In this section, we demonstrate empirically that an unsmoothed CFDM memorizes its training data, even when sampled with relatively few iterations.

Published in Transactions on Machine Learning Research (05/2025)

(a) Gaussian noise

(b) Gumbel noise

Figure 11: Ã-CFDM samples drawn in latent space using Gaussian noise (left) and Gumbel noise (right) with matched means and covariances to smooth the closed-form score.

(a) Noiseless velocity ﬁeld

(b) Noisy velocity ﬁeld

Figure 12: Augmenting an unsmoothed CFDM s velocity ﬁeld with Gaussian noise that vanishes in the ﬁnal step does not induce generalization.

We ﬁrst show that sampling an unsmoothed CFDM with large step sizes may in principle lead to generalization. To do so, we employ similar arguments as in the proof of Theorem 5.1 in Appendix B.2. In particular, if Ã = 0, then the smoothed velocity ﬁeld vÃ,t(z) can be expressed as v0,t(z) = vt(z) = 1 1 t 1

t kt(z) z . By expanding the formula for the ﬁnal Euler step using this expression for vt(z) and t S 1 = S 1

S in the same way as in Appendix B.2 but substituting the softmax weights from the deﬁnition of kt(z) in equation 2, one sees that the ﬁnal iterate z S will be of the form

i=1 softmax

S2 z ( S S 1)X 2

If the number of sampling iterations S is small, then the temperature of this softmax may be suﬃciently large that z S is a non-trivial convex combination of training samples. However, we demonstrate empirically

Published in Transactions on Machine Learning Research (05/2025)

that this does not occur in practice, even for relatively small S. We construct an unsmoothed CFDM using 500 samples from a 100-dimensional standard Gaussian distribution to mimic the high-dimensional data that is typical of real-world applications, sample it with a number of iterations ranging from S = 1 to S = 100, and compute the chamfer distance between the model samples and the training samples for each value of S. We depict the result of this experiment in Figure 13. An unsmoothed CFDM memorizes its training data when sampled with as few as 5 iterations, demonstrating that in practice, large step sizes alone are insuﬃcient to cause a CFDM to generalize.

Figure 13: An unsmoothed CFDM memorizes its training data when sampled with as few as 5 iterations.

H Comparing neural SGMs to σ-CFDMs.

In this section, we present some preliminary ﬁndings on the relationship between the inductive bias of neural SGMs and that of our proposed Ã-CFDM. We ﬁrst show that in low-dimensional problems, one may approximate the velocity ﬁeld of a neural SGM with that of a Ã-CFDM for an appropriate choice of Ã. We then show that this is no longer the case for high-dimensional image datasets such as MNIST with the velocity ﬁeld vt(z) parametrized by a Unet. Our strategy will be to extract kt(z) from a neurally-parametrized vt(z) and show that in contrast to a Ã-CFDM, the neural kt(z) does not output convex combinations of training samples for t 1.

H.1 2D datasets

For this experiment, we train a simple 3-layer MLP on the ﬂow-matching objective from (Liu et al., 2023), whose theoretical optimum is attained by the velocity ﬁeld vt(z) := 1

t (z + (1 t) log Ä t (z). (Recall that for all 0 f t f 1, Ä t is a mixture of Gaussians centered at rescaled training samples with common covariance (1 t)2I.) Our MLP has 64 neurons in each hidden layer and uses Softplus activations. We train this MLP for 20k epochs at a learning rate of 10 3 using Adam W, and take the target distributions to be the empirical distribution over ﬁxed sets of 500 samples from the 2D Checkerboard dataset used throughout this paper and a 2D spirals dataset, respectively.

We then sweep over Ã [0, 2] and construct a Ã-CFDM on the same 500 training samples for each value of Ã. For each value of Ã, we then sweep over t (0, 1), draw batches of samples from Ä t , and compute the average squared L2 distance between the neural velocity ﬁeld vt and the velocity ﬁeld vÃ,t deﬁned in equation 6. We

Published in Transactions on Machine Learning Research (05/2025)

report the average squared L2 distance across t for each value of Ã in Figure 14. The unsmoothed CFDM s velocity ﬁeld v t is a poor approximation to the neural SGM s velocity ﬁeld, indicating that the neural SGM does not learn the closed-form score for its training set. However, smoothing the CFDM signiﬁcantly improves the quality of the approximation, with Ã = 0.45 achieving an 88.8% reduction in squared L2 error against the neural velocity ﬁeld relative to the unsmoothed velocity ﬁeld on the checkerboard dataset, and Ã = 0.15 achieving an 81.1% reduction in squared L2 error on the spirals dataset.

(a) Checkerboard dataset

(b) Spirals dataset

Figure 14: An unsmoothed CFDM s velocity ﬁeld is a poor approximation to a neural SGM trained on the same dataset, but smoothing it signiﬁcantly improves the quality of the approximation.

In Figure 15, we compare samples drawn from the neural SGM (in red) to samples drawn from a Ã-CFDM with the optimal values of Ã = 0.45 and Ã = 0.15 for the checkerboard and spirals datasets, respectively. While each distribution s samples are qualitatively similar, we note in particular that the Ã-CFDM places less mass near extreme points of the support of the training data. We observed a similar phenomenon in Appendix D and conjectured that it results from 1 M PM m=1 kt(x + Ãϵm) converging towards a deterministic quantity for suﬃciently large values of M. We further conjecture that this phenomenon may partially explain why neural SGMs generalize better on high-dimensional image generation problems, as they place more mass near extreme points of the data distribution and less mass on barycenters of training samples, which typically do not resemble natural images.

Finally, in Figure 16, we compare the velocity ﬁelds of a neural SGM and an appropriately-smoothed Ã-CFDM for the checkerboard and spirals datasets at three times: t {0.1, 0.5, 0.9}. We normalize the velocity ﬁelds in the top row of each subﬁgure to facilitate a comparison of the direction of each velocity ﬁeld, and depict the diﬀerence of the non-normalized velocity ﬁelds in the bottom rows. Smoothing a CFDM yields a velocity ﬁeld that accurately approximates the corresponding neural SGM s velocity ﬁeld for t close to 0, but the accuracy of this approximation deteriorates as t 1.

In this section, we will show that if one parametrizes vt by a Unet and trains it on a high-dimensional image dataset such as MNIST, the resulting model does not behave like a Ã-CFDM. Because sample estimates of the squared L2 distance between the neural vt and a Ã-CFDM s vÃ,t are noisy in high dimensions, we do not employ the strategy from the previous section to compare a neural SGM and our Ã-CFDM, but instead study the extent to which a neural SGM points towards convex combinations of training samples in image generation problems.

Published in Transactions on Machine Learning Research (05/2025)

(a) Checkerboard dataset, σ = 0.45

(b) Two Spirals dataset, σ = 0.15

Figure 15: A Ã-CFDM generates samples that are qualitatively similar to a neural SGM on low-dimensional datasets, but sends less mass near extreme points of the data distribution.

Given a velocity ﬁeld vt(z) := 1

t (z + (1 t) log Ä t (z)), where log Ä t (z) = 1 (1 t)2 (kt(z) z), one may extract the corresponding kt by substituting the formula for log Ä t (z) into the formula for vt and rearranging. Equation 4 and Proposition 4.1 show that if kÃ,t(z) is extracted from either of the unsmoothed or smoothed scores sÃ,t (Ã = 0 and Ã > 0, resp.), then it must output convex combinations of rescaled training samples for any z: kÃ,t(z) = PN i=1 wi(z)txi with wi(z) g 0 and PN i=1 wi(z) = 1. We will show that for t 1, the kt function extracted from a neural SGM s velocity ﬁeld vt is not well-approximated by convex combinations of rescaled training samples. This will imply that unlike a Ã-CFDM, a neural SGM s score function does not point towards convex combinations of training data.

To this end, we train a Unet on the ﬂow-matching objective from (Liu et al., 2023), whose theoretical optimum is attained by the velocity ﬁeld vt(z) := 1

t (z + (1 t) log Ä t (z)). The training set consists of the 60k images of handwritten digits from the MNIST train partition. We train for 10 epochs at a learning rate of 10 4

using Adam W.

We then draw a batch of 512 samples xk from the MNIST test partition, ﬁx a value of t (0, 1), and compute noisy samples of the form zt,k = txk + (1 t)ϵ, where ϵ N(0, I). We compute the neural SGM s kt(zt,k) for each noisy sample, and use projected gradient descent to regress each kt(zt,k) on the set of rescaled training samples txi subject to the constraint that the weights lie in the probability simplex. If the neural SGM is well-approximated by a Ã-CFDM, then we would expect the MSE of this regression to be close to 0.

In the left panel of Figure 17, we depict the average MSE of the regression of each 1

t kt(zt,k) onto the convex hull of the training samples xi as a function of t. (We rescale kt(zt,k) by 1

t to enable a direct comparison to the convex hull of the training samples xi; otherwise, the MSE values for small t would be small simply because the data has been scaled by t.) For small values of t, the function 1

t kt extracted from neural SGM is well-approximated by a convex combination of training samples xi, as one would expect for a Ã-CFDM. However, the quality of this approximation deteriorates as t 1. This indicates that by pointing towards the convex hull of the training data, a neural SGM behaves like a Ã-CFDM for t 0, but that this behavior vanishes as t 1.

If a neural SGM fails to behave like a Ã-CFDM for t 1, then how does it instead behave? The right panel of Figure 17 indicates that for t 1, the kt extracted from a neural SGM approximates an optimal denoiser for test data. In particular, by computing kt(zt) for a noisy test sample zt = tx + (1 t)ϵ, one recovers the rescaled test sample tx. Because test samples often lie outside the convex hull of a neural SGM s training data, this behavior cannot be replicated by a Ã-CFDM. This provides evidence of a neural SGM s generalization

Published in Transactions on Machine Learning Research (05/2025)

capabilities by showing that its score points towards regions of the support of the target distribution that are outside the convex hull of the training data.

We further illustrate this phenomenon in Figure 18. When t = 0.2, the output of the neural SGM s kt(z) on noisy data is well-approximated by a convex combination of training samples, as the theory of Ã-CFDMs predicts. In contrast, when t = 0.8, a neural SGM s kt(z) behaves like an optimal denoiser for test data, mapping noisy test samples to their clean counterparts. While a comprehensive study of the generalization of neural SGMs is out of scope for this work, we point to concurrent work such as Kamb & Ganguli (2024), which studies closed-form solutions to the score-matching problem under locality and equivariance constraints to better understand the generalization of SGMs parametrized by convolutional architectures.

Published in Transactions on Machine Learning Research (05/2025)

(a) Checkerboard dataset, σ = 0.45

(b) Two Spirals dataset, σ = 0.15

Figure 16: Smoothing a CFDM yields a velocity ﬁeld that accurately approximates a neural SGM s velocity ﬁeld for small times, but the accuracy of the approximation deteriorates as t 1.

Published in Transactions on Machine Learning Research (05/2025)

(a) A neural SGM s kt is close to the convex hull of the training data for small t, but not for t 1.

(b) A neural SGM s kt behaves like an optimal denoiser for test data as t 1.

Figure 17: A neural SGM s kt behaves like a Ã-CFDM s kt for t 0. However, as t 1, it instead behaves like an optimal denoiser for test data.

Figure 18: Row 1 shows that for small t, a neural SGM s kt(z) outputs convex combinations of training samples, as one expects for a Ã-CFDM. In contrast, Row 2 shows that for large t, a neural SGM s kt(z) behaves like an optimal denoiser for test data, mapping noisy test samples to their clean counterparts