# the_diffusion_duality__5614007f.pdf The Diffusion Duality Subham Sekhar Sahoo 1 Justin Deschenaux 2 Aaron Gokaslan 1 Guanghan Wang 1 Justin Chiu 3 Volodymyr Kuleshov 1 Uniform-state discrete diffusion models hold the promise of fast text generation due to their inherent ability to self-correct. However, they are typically outperformed by autoregressive models and masked diffusion models. In this work, we narrow this performance gap by leveraging a key insight: Uniform-state diffusion processes naturally emerge from an underlying Gaussian diffusion. Our method, Duo, transfers powerful techniques from Gaussian diffusion to improve both training and sampling. First, we introduce a curriculum learning strategy guided by the Gaussian process, doubling training speed by reducing variance. Models trained with curriculum learning surpass autoregressive models in zero-shot perplexity on 3 of 7 benchmarks. Second, we present Discrete Consistency Distillation, which adapts consistency distillation from the continuous to the discrete setting. This algorithm unlocks few-step generation in diffusion language models by accelerating sampling by two orders of magnitude. We provide the code and model checkpoints on the project page: https://s-sahoo.com/duo 1. Introduction An eternal theme in mathematics is that discreteness emerges from underlying continuity. From quantum mechanics, where the quantized energy states of electrons arise as solutions to continuous wave equations, to the binary logic of digital circuits, fundamentally driven by smooth analog currents, discreteness has repeatedly and naturally emerged from an underlying continuum. Our work contin- 1Computer and Information Science, Cornell Tech, NY, USA. 2School of Computer and Communication Sciences, EPFL Lausanne, Switzerland 3Cohere, NY, USA. Correspondence to: Subham Sekhar Sahoo . Proceedings of the 42 nd International Conference on Machine Learning, Vancouver, Canada. PMLR 267, 2025. Copyright 2025 by the author(s). Figure 1: An illustration of Uniform-state discrete diffusion (top) and the underlying Gaussian diffusion (bottom). While both are separate Markov processes, applying arg max maps Gaussian latents wt Rn to discrete latents zt V, transforming their marginals from qt(. x; αt) (6) to qt(. x;T ( αt)) (1) and adjusting diffusion parameters from αt to αt = T ( αt) (10). Notably, the ELBO for Uniformstate diffusion induces a tighter bound on the likelihood than Gaussian diffusion, as established in Theorem 3.1. ues this tradition by demonstrating that a discrete diffusion process is, in fact, an emergent phenomenon of an underlying continuous Gaussian diffusion process. This perspective enables the design of faster training and sampling algorithms for discrete diffusion models. Diffusion models (Sohl-Dickstein et al., 2015) are powerful generative models inspired by physics. Gaussian diffusion models excel at synthesizing realistic and high-quality continuous-valued data such as images (Ho et al., 2020; Rombach et al., 2022), audio (Kong et al., 2021; Liu et al., 2023b), and videos (Ho et al., 2022; Wu et al., 2023; Esser et al., 2023; Blattmann et al., 2023). Gaussian diffusion is well studied the success of these models is rooted in techniques such as efficient parameterizations of the denoising model, which improve upon the standard meanparameterization (Ho et al., 2020; Salimans & Ho, 2022; Zheng et al., 2023), faster training techniques (Kingma et al., 2021), efficient samplers (Karras et al., 2022), and distillation schemes that enable single-step generation (Song et al., 2023; Song & Dhariwal, 2023; Yin et al., 2024). The Diffusion Duality While Gaussian diffusion is well-studied, it underperforms discrete diffusion models on tasks involving discrete data such as text (Sahoo et al., 2024b), graphs (Liu et al., 2023a), and molecules (Lee et al., 2025). However, from the perspective of Gaussian diffusion, the design space for discrete diffusion models remains primitive: mean parameterization for the denoising model (Sahoo et al., 2024a; Schiff et al., 2025) and slow ancestral sampling (Austin et al., 2021) are still the dominant approaches. Recent work on distilling Masked Discrete Diffusion Models (MDMs) improves sampling speed (Deschenaux & Gulcehre, 2024), but performance degrades severely in the few-step regime. Unlike Gaussian diffusion models with Probability Flow ODEs (Song et al., 2020), MDMs lack an implicit property: a deterministic mapping from noise to data. This property is vital for few-step distillation methods (Song et al., 2023; Frans et al., 2024), but MDMs forgo it due to their deterministic prior, which requires stochasticity during sampling. Our objective is (1) to design a framework for discrete diffusion that enables the transfer of advanced training and inference techniques from Gaussian diffusion to discrete diffusion models. And (2) create language models that support few-step generation. To this end, we focus on Uniform-state Diffusion Models (USDMs) (Austin et al., 2021). In this paper, we discover a very remarkable property of USDMs these emerge from Gaussian diffusion processes as illustrated in Fig. 1. We call this phenomenon the Diffusion Duality which expands the design space of USDMs, making it possible to incorporate techniques developed for Gaussian diffusion. Notably, USDMs models allow token updates during reverse sampling unlike MDMs, naturally correcting earlier mistakes without requiring costly predictor-corrector (Zhao et al., 2024; Wang et al., 2025) steps saving function evaluations (NFEs). However, these models have historically underperformed compared to MDMs (Austin et al., 2021; Lou et al., 2023), raising the key question: Can USDMs be made competitive with MDMs? And more importantly, can the implicit property of the underlying Gaussian diffusion be leveraged for fast, few-step generation? We answer both questions with Duo, a rich framework of theoretical connections between USDMs and Gaussian diffusion. Duo enriches the design space of USDMs by incorporating Gaussian diffusion, which allows us to develop efficient training strategies that accelerate the training of USDMs, significantly reducing the performance gap between MDMs and AR models on standard language generation benchmarks. Notably, we surpass AR models on 3 out of 7 zero-shot datasets (Table 2). Furthermore, this duality allows us to adapt consistency distillation (Song et al., 2023) from Gaussian to discrete diffusion, reducing NFEs from 1024 to 8 with minimal effect on sample qual- ity (Sec. 5.2). Importantly, in the low-NFE regime, Duo outperforms MDMs. Our main contributions are threefold: (1) We establish a theoretical connection between continuous and discrete diffusion, demonstrating that discrete diffusion arises from an underlying continuous Gaussian diffusion. This insight enables the transfer of techniques from the continuous domain to the discrete setting, opening up new possibilities. (2) Our framework doubles the training speed of USDMs by introducing a low-variance curriculum, and (3) accelerates sampling by two orders of magnitude by adapting efficient distillation methods from continuous diffusion models. 2. Background Notation We represent scalar discrete random variables that can take K values as one-hot column vectors and define V = {x {0,1}K K i=1 xi = 1} as the set of all such vectors. Define Cat( ;π) as the categorical distribution over K classes with probabilities given by π , where denotes the K-simplex. Additionally, let 1 = {1}K and a,b and a b respectively denote the dot and Hadamard products between two vectors a and b. We use x1 L VL and [xℓ]L ℓ=1 VL to denote sequences of length L. 2.1. Discrete Diffusion Models Consider the clean data x V drawn from the data distribution qdata. In the discrete diffusion framework (Sohl Dickstein et al., 2015; Austin et al., 2021) the complex data distribution qdata is mapped to a simple distribution through a sequence of Markov states. Sahoo et al. (2024a) propose a simplified variant an interpolating noise framework where the forward process (qt)t [0,1] smoothly transitions from qdata to a prior distribution Cat(.;π), by introducing latent variables zt V whose marginals conditioned on x at time t are given by: qt(. x;αt) = Cat(.;αtx + (1 αt)π), (1) where the diffusion parameter αt [0,1] is a strictly decreasing function in t, with αt=0 1 and αt=1 0. A discrete diffusion process is characterized by the time evolution of marginals follows a linear ordinary differential equation (Anderson, 2012): d dtqt = Qtqt, (2) where Qt RK K is the state transition matrix. There are two main variants of interpolating noise frameworks: MDMs (Sahoo et al., 2024b), which use a masked token prior π = m with m V as a special mask token, and USDMs (Schiff et al., 2025), which uses a uniform prior over V (π = 1/K). These frameworks differ in their The Diffusion Duality forward corruption dynamics. In MDMs, the clean data x either stays unchanged or transitions to the mask token m, after which it remains masked for the rest of the process. In contrast, USDMs allow each token to either stay the same or transition uniformly to any other token in V, with the transition probability determined by the diffusion timestep (see Fig. 5 for examples). These forward dynamics impact the reverse generation process: USDMs permit continual token updates, while MDMs fix tokens once unmasked. To mitigate this limitation, predictor-corrector methods have been proposed for MDMs (Campbell et al., 2022; Gat et al., 2024; Wang et al., 2025), but at the cost of added computation. In contrast, USDMs naturally exhibit a self-correcting property absent in AR and MDM approaches. As a result, our work focuses primarily on the USDM framework. Schiff et al. (2025) show that for USDMs, the state transition matrix Qt is given by: Qt = α t Kαt [11 KIK], (3) where α t is the time derivative of αt and the true reverse posterior for a timestep s < t is given as: qs t(. zt,x) = Cat .; Kαtzt x + (αt s αt)zt Kαt zt,x + 1 αt + (αs αt)x + (1 αt s)(1 αs)1/K Kαt zt,x + 1 αt where αt s = αt/αs. Since x is unavailable during inference, we approximate it with a neural network xθ V [0,1] K with parameters θ. The resulting approximate reverse posterior is defined as pθ s t(. zt) = qs t(. zt,x = xθ(zt,t)). The denoising model is trained by minimizing the Negative Evidence Lower Bound (NELBO) (Schiff et al., 2025): NELBO(q,pθ;x) = Et U[0,1],qt(zt x;αt) f(zt,xθ(zt,t),αt;x), (5) where f is defined in (40). Sampling from this model begins with the prior zt=1 1/K, and proceeds via ancestral denoising, i.e., by drawing zs pθ s t(. zt) at each step. 2.2. Gaussian Diffusion Models Gaussian diffusion maps a data distribution qdata to a simple prior distribution usually a Normal distribution N(0,IK), through a sequence of noisy latents wt qt(. x), whose marginal distribution is given by: qt(. x; αt) = N( αtx,(1 αt where the diffusion parameter αt [0,1] is a monotonically decreasing function in t. For αt=0 = 1 and αt=1 = 0, the NELBO for such a process is given as (Kingma et al., 2021): NELBO( q,pθ;x) = Et U[0,1], qt(wt x; αt)ν (t) x xθ(wt,t) 2 2 (7) where ν (t) is the time derivative of the signal-to-noise ratio ν(t) = αt2/(1 αt2) for the Gaussian diffusion process. 2.3. Consistency Distillation Consistency models (Song et al., 2023; Song & Dhariwal, 2023) are a class of generative models that define a bijective mapping between the samples from the noise distribution N(0,IK) and the data distribution qdata. They build on deterministic samplers for Gaussian diffusion (Song et al., 2020; 2021), specifically using the Probability-Flow ODE (PF-ODE). Given a pre-trained Gaussian diffusion model xθ, which requires hundreds or thousands of sampling steps, Consistency Distillation is a popular technique to distil them down to fewer steps generation that enables much faster generation. The distillation begins with a teacher model xθ , often the Exponentially Moving Average (EMA) of the student model xθ obtained during the course of training. A noisy sample wt is drawn from the forward process qt(. x) (6), and a less noisy sample ws is obtained by numerically solving one PF-ODE step using xθ . The student model is then trained to match the teacher s estimate of the clean sample minimizing the following loss: L(θ,θ ) = λ(t)d(xθ(wt,t),xθ (ws,s)), (8) where d RK RK R+ denotes the error between the teacher model s reconstruction xθ (ws,t) and the student model s reconstruction xθ(wt,t) of the original sample and λ [0,1] R+ is a weighting function that scales the loss based on the diffusion time-step t. 3. The Diffusion Duality Unlike discrete diffusion, Gaussian diffusion is replete with well-established empirical techniques, which have driven significant advances in both training (Ho et al., 2020; Salimans & Ho, 2022; Zheng et al., 2023) and sampling (Karras et al., 2022; Song et al., 2023; Song & Dhariwal, 2023; Yin et al., 2024). Our goal in this section is to establish a theoretical bridge between discrete-state diffusion and continuous-state diffusion, which will enable us to leverage tools from the later to improve the former. We propose a simple method to map a Gaussian latent to the discrete space: the arg max operator. But does this transformation of latents also transform a Gaussian diffusion process into a discrete one? A necessary and sufficient condition for this is that the marginal distribution of the discretized vector satisfies the characteristic ODE of a discrete diffusion process (2). We first derive a closed-form expression for The Diffusion Duality this marginal and show that arg max maps the marginals of a Gaussian diffusion to those of a Uniform-state discrete diffusion, including a transformation of the diffusion parameters (9). Finally, we verify that this marginal evolves according to (11), establishing that arg max transforms a Gaussian diffusion process into a Uniform-state discrete diffusion process. We begin by defining a Gaussian diffusion process on x V as per (6), with qt=0 qdata and qt=1 = N(0,IK). Let wt qt(. x; αt) be an intermediate latent at time t. Next, define the operation arg max RK V to map a continuous vector w RK to the one-hot vector corresponding to the index of its largest entry in w, i.e., arg max(w) = arg maxz V z w. Discrete Marginals Let zt = arg max(wt) and Pt(. x) denote its conditional pmf marginalized over wt qt(. x; αt). In Suppl. A.1, we show: zt Pt (. x;T ( αt)) = Cat(.;T ( αt)x + (1 T ( αt)) 1 where the function T [0,1] [0,1] is the Diffusion Transformation operator, defined as: T ( αt) = K K 1 [ 1 αt2 ) ΦK 1(z)dz 1 where ϕ(z) = exp( z2)/ 2π is the standard Normal distribution and Φ(z) = z exp( t2/2)dz/ 2π is its cumulative distribution function. Time Evolution of Marginals Next, we examine how the discrete marginal Pt evolves with time as the continuous vector wt undergoes Gaussian diffusion. In Suppl. A.2 we show that Pt evolves as per the following linear ODE: d dt Pt = T ( αt) KT ( αt)[11 KIK] ¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹ ¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹ Qt where T represents the time derivative of T . From (2) and (3), we infer that (11) characterizes a Uniform-state discrete diffusion process with diffusion parameter T ( αt). It is important to note that while the marginals of the discretized latents evolve according to a Markovian Uniform-state discrete diffusion process, a Gaussian diffusion trajectory after discretization, might not map to a discrete diffusion trajectory. We discuss this in detail in Suppl. A.3. Duality The implications of (9) and (11) are quite profound. These reveal a fundamental connection between Uniform-state discrete diffusion and Gaussian diffusion bridged by the arg max operator: The arg max operation transforms Gaussian diffusion into Uniform-state diffusion, with the diffusion parameters related by (10). More formally, this can be expressed as: qt(zt x;T ( αt)) = [arg max] qt(wt x; αt) (12) where the operator denotes the pushforward of the Kdimensional Gaussian density qt under the arg max map, yielding a categorical distribution qt with K classes. Thus, as x diffuses in the discrete space via a Uniformstate process, there exists an underlying continuous-space representation in which x follows Gaussian diffusion, as illustrated in Fig. 1. ELBO Note that these two processes are separate Markov chains with no transitions between them, and they induce different variational bounds on the log-likelihood. Specifically, they yield distinct ELBOs: (5) for the discrete diffusion process and (7) for the Gausssian diffusion process. Theorem 3.1. The ELBO for the Uniform-state discrete diffusion is tighter than for the underlying Gaussian diffusion. We provide a proof in Suppl. A.4. Briefly, we show that: log pθ(x) ELBO(q, pθ;x) ELBO( q,pθ;x), (13) where pθ is the denoiser in the Gaussian diffusion space, and pθ = [arg max] pθ is the denoiser in the discrete diffusion space. Since the ELBO is inherently tighter in the discrete space, it is advantageous to design the denoising model to model discrete latents. Hence, we choose (5) as the training and evaluation objective. The term f in (5) involves materializing the one-hot vector x, which increases memory usage and slows down training. We reformulate it with a Rao-Blackwellized objective (41) that avoids materializing one-hot vectors and also reduces training variance, resulting in faster training and lower perplexity (Sec. 5.1). Sampler To sample from Duo, we use ancestral sampling for USDMs (Sec. 2.1). Furthermore, to improve text quality, we propose Greedy-Tail Sampler, that improves the sample quality by decreasing the sample entropy in a manner similar to nucleus sampling in AR models models (Holtzman et al., 2020). Specifically, during the final denoising step, instead of sampling the clean sequence via x pθ 0 δ(.), we perform greedy decoding: x = arg max(pθ 0 δ(.)) where δ denotes the time discretization. We exploit this duality to incorporate Gaussian diffusion into the design space of USDMs. This allows us to design a low-variance training algorithm that leads to faster training (Sec. 4.1) and a distillation scheme that unlocks few-step generation in diffusion language models (Sec. 4.2). The Diffusion Duality 4. Applications We now present two applications where discrete diffusion models benefit from leveraging the underlying Gaussian diffusion. In Sec. 4.1, we introduce a curriculum learning strategy that reduces training variance and leads to faster training. Then, in Sec. 4.2, we propose a distillation algorithm that cuts the number of sampling steps by two orders of magnitude with minimal impact on sample quality. We extend our discrete diffusion framework to sequences x1 L qdata of length L. The forward process and the reverse process factorize independently over tokens as: qt(z1 L t x1 L;αt) = ℓ [L] qt(zℓ t xℓ;αt) based on (1) and pθ(z1 L t ) = ℓ [L] qs t(zℓ s z1 L t ,xℓ θ(z1 L t ,t)) based on (4), respectively. Here, xθ VL [0,1] L denotes the denoising model. Consequently, the sequence-level NELBO decomposes into a sum of token-level losses: NELBO(q,pθ;x1 L) = Et U[0,1],qt ℓ [L] f Duo(zℓ t,xℓ θ(z1 L t ,t),αt;xℓ). (14) where f Duo is defined in (41). 4.1. Faster Training using Curriculum Learning Curriculum learning (Bengio et al., 2009) gradually exposes models to increasingly complex data, starting with simpler, easier-to-denoise noise patterns and progressing to more challenging ones. Here, we design a curriculum for USDMs by exploiting the underlying Gaussian diffusion. Similar to relaxation methods in discrete gradient estimation (Jang et al., 2017; Maddison et al., 2017), our curriculum is centered around annealing the temperature parameter of a smooth approximation of arg max. We reformulate the NELBO for discrete diffusion in terms of arg max over Gaussian latents (Sec. 4.1.1). The denoising model is then trained to operate on the arg max of these Gaussian variables. We then relax the arg max using a tempered softmax (Sec. 4.1.2), which yields a lower-variance but biased estimator of the ELBO. Initially, the model operates on fully relaxed, continuous Gaussian latents. As training progresses, the temperature is gradually decreased, transitioning the model s inputs from soft (continuous) to hard (discrete). By the end of the curriculum, the model effectively operates on discrete latents, closing the gap between training and inference-time behavior. 4.1.1. DISCRETE NELBO WITH GAUSSIAN LATENTS Consider the discrete diffusion NELBO (14), which marginalizes f Duo over the discrete latents z1 L t qt(. x1 L;αt). Our goal is to re-express this objective in terms of Gaussian latents w1 L t qt(. x1 L; αt) such that marginalizing over w1 L t yields the same numerical value for the NELBO. In Suppl. B.1, we show: NELBO(q,pθ;x1 L) = Et,qt(z1 L t x1 L;αt) ℓ [L] f Duo(zℓ t,xℓ θ(z1 L t ,t),αt;xℓ) = Et, qt(w1 L t x; αt) ℓ [L] f Duo(zℓ t = arg max(wℓ t), xθ ([arg max(wℓ t )]L ℓ =1,t),αt = T ( αt);xℓ), (15) where αt = T ( αt) is obtained via (10) from the Gaussian diffusion coefficient αt and also verify this empirically. As discussed in Sec. 3, these are distinct Markov chains whose marginal distributions are related only through (12). This reparameterization underpins our curriculum learning strategy which we present in the next section. 4.1.2. LOW-VARIANCE TRAINING LOSS To reduce training variance, we replace arg max(wℓ t) in the denoising model input (15) with a tempered softmax. We argue that this substitution eases recovery of the clean sequence from its noisy counterpart, and that the difficulty of this recovery is regulated by the temperature parameter. As shown in prior work (Jang et al., 2017; Maddison et al., 2017), arg max is a limiting case of softmax: arg max(wℓ t) = lim τ 0+ softmax(wℓ t/τ). (16) We relax this operation by setting the temperature parameter τ > 0. While computing the NELBO in (15), note that the discrete diffusion parameter T ( αt) spans the interval [0,1], as does its Gaussian counterpart αt. The diffusion transformation operator T (10) has a crucial property: as the vocabulary size K increases, a small sub-interval [a,1]0 a<1 within the domain of T is sufficient to map onto the full range [0,1]. For instance, in Suppl. C.7, we observe that for K = 30K, when αt [0.85,1], the corresponding αt = T ( αt) nearly spans the entire interval [0,1]. This observation is counter-intuitive: since the Gaussian latents mostly resemble x, one might expect the discrete NELBO to approach zero when evaluated with αt restricted to such a narrow range. However, in practice, the NELBO remains largely unchanged. Why? The key reason lies in the discretization step. Even small amounts of Gaussian noise in wt ℓcan cause the output of the arg max operation to change drastically, as it is highly sensitive to perturbations. As a result, much of the extra signal is lost due to discretization. To mitigate this, we allow the denoising model xθ to access the continuous latent wℓ t through a tempered softmax in (16). This relaxation helps preserve more of the signal, making the reconstruction task easier. In this way, the temperature parameter τ effectively controls the difficulty of the learning problem. The Diffusion Duality 10K 20K 50K 100K Gradient Variance ( ) Duo w/o CL Duo Figure 2: Curriculum learning drastically lowers the gradient variance in Duo trained with a fixed τ = 0.001. The figure shows the summed gradient variance of the 100 weights with the highest variance, comparing Duo with CL (blue) and without CL (grey). Hence, unlike prior discrete diffusion methods, we design the denoising model xθ L VL [0,1] L to handle both continuous latents and discrete latents; see Suppl. C.2 for details. During training, we sample t U[β,γ]0 β<γ 1 from a sub-interval so that αt [a,b]0 a